Hybrid Transactional and Analytical Processing, or HTAP, is an advanced database capability that allows for both types of workloads to be performed without one impacting the performance of the other.
In this Video Blog, I cover some of the history of HTAP, some of the challenges and benefits of these systems, and where you can find them in Azure.
In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:
What is your schema type? Snowflake or Star, or something else?
Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
What are your Service level agreements for the business? What are your data processing times?
Can you save cost by using an option that’s less compute heavy?
For developers, it’s very beneficial to have a source code repository. A
source code repository helps to keep all your changes, to manage tasks,
branches, share the code with a team and simply put, to keep it in safe
place.
In this post, I’ll tell you why you should connect your Azure
Data Factories to a source code repository, and I’ll demo how to do so:
To do this, I’ll start in my data factory inside my Azure portal.
When I go into Author & Monitor, I have the ability to either:
Set up a code repository within the landing page or main page here or,
I can go directly into my data factory and I can add it in the
left-hand corner and pull down to where it says, ‘set up code
repository’.
One thing to note is the code repository itself has been supported
for a while, but recently with the release of Data Flows, they’re now
supporting GitHub within the repositories as well (previously is was
only Azure DevOps).
So, I select my GitHub account and fill in the information. *If
you’re doing this for the first time, it’s going to prompt you to log
into your GitHub account when you do this. In this case, I’ve already
previously connected this so it’s going to know about my repositories.
Next, I select my repository name and I’ll go to my playground branch and I’m going to use my existing playground.
One field will ask me ‘Branch to import resources into:’ so if I’m
importing resources, I can select an existing one or create a new one.
For this demo I’m going to pick my playground.
Before I hit Save, notice on the left-hand side I’ve got zero
pipelines, one data set and zero data flows. But when I connect to my
playground, it’s going to bring everything in I’ve previously connected
to within any of my areas I’ve saved up into or checked my code into
anything in that playground branch. So, now you’ll see I have 6
pipelines, 23 data sets and 4 data flows.
One of the other nice pieces of being able to add source control is
if I want to add a new data set. I just select my SQL Server I was
previously connected to; I leave it on default for now and connect in.
I then select one of the tables; I selected the Product Category Table.
You’ll see at the top you have the option to Save All or Publish.
Save All is going to save any of the changes across the tabs, so you can
tell when there’s been a change, whether it be data set or pipeline or
data flow by having a star next to the name.
Now instead of needing to publish every time you’re doing
development, you can just save it and it will save here. So, rather than
having to publish the entire pipeline and do any of the error checking
and make sure the pipeline is in good standing, you can now just save
part way through. This is a huge advantage over having to publish the
entire pipeline which could cause some challenges which might not be
efficient for development and such.
The new features of GitHub being added in gives us another great opportunity if you didn’t previously use Azure Dev Ops (formerly known as Visual Studio Team Services). I’ll be doing more upcoming blogs around Data Factory in Azure Every Day that will be beneficial to you with some of the nuances as the product has greatly evolved since releasing Data Flows.
Pragmatic Works is considered to be experts in the Microsoft Data Platform, both on-premises and in Azure. That being said, we often get asked many questions like, how can a certain technology benefit my company? One technology we are asked about a lot is Azure Databricks. This was released over a year ago in preview in the Azure portal and we’re starting to see some massive adoption by many companies, but not everyone is ready to delve into data science and deep analytics, so they haven’t had much exposure to what Databricks is and what it can do for their business.
There are some barriers preventing organizations from adopting data
science and machine learning which can be applied to solve many common
business challenges. Collaboration between data scientists, data
engineers, business analysts who are working with data (structured and
unstructured) from a multitude of sources is an example of one of those
barriers.
In addition, there’s a complexity involved when you try to do things with these massive volumes of data. Then add in some cultural aspects, having multiple teams and using consultants, and with all these factors, how do you get that one common theme and common platform where everybody can work and be on the same page? Azure Databricks is one answer.
Here’s an overview of 3 common use cases that we’re beginning to see and how they can benefit your organization:
1. Recommendation Engines – Recommendation
Engines are becoming an integral part of applications and software
products as mobile apps and other advances in technology continue to
change the way users choose and utilize information. Most likely when
you’re shopping on any major retail site, they are going to make
recommendations to related products based on the products you’ve
selected or that you’re looking at.
2. Churn Analysis – Commonly known as
customer attrition; basically, it’s when we lose customers. Using
Databricks, there are ways to find out what some of the warning signs
are behind that. Think about it, if you get ways to correlate the data
that leads to a customer leaving your company, then you know that you
have a better chance to possibly save that customer.
And we all know that keeping a customer and giving them the service they need or the product they want is significantly less costly than having to acquire new customers.
3. Intrusion Detection – This is needed to
monitor networks or systems and activities for malicious activity or
policy violations and produce electronic reports to some kind of
dashboard or management station or wherever that is captured.
With the combination of streaming and batch technologies tightly
integrated with Databricks and the Azure Data Platform, we are getting
access to more real-time and static data correlations that are helping
to make faster decisions and try to avoid some of these intrusions.
Once we get triggered that there is a problem, we can shut if off very quickly or use automation options to do that as well.
Today I wanted to highlight some of the ways that you can utilize
Databricks to help your organization. If you have questions or would
like to break down some of these barriers to adopting machine learning
and data science for your business, we can help.
We are using all the Azure technologies and talking about them with
our customer all the time, as well as deploying real world workload
scenarios.
Today
I’d like to tell you about some very cool intelligence features within
the Azure SQL Database. Azure SQL Database technologies deliver
intelligent capabilities through a range of built-in machine learning
and adaptive technologies that monitor and manage performance and
security for you.
Using telemetry from millions of databases running in Azure over the
years, Microsoft has built this capability of training a truly
intelligent and autonomous database that gives you the ability to learn
and adapt to your workload. This intelligent performance gives you the
deeper insight into database performance. Plus, it eliminates the hassle
of making ongoing improvements, allowing you to focus more on driving
your business and less on “chores”.
Features like query performance insights and automatic tuning continuously monitor database usage and detect disruptive events and then they take steps to improve performance.
Three examples of the intelligent performance that can collectively optimize your memory usage and improve overall query performance are things like:
Row mode memory grant feedback – this
gives you the ability to expand on batch-mode memory grant feedback by
adjusting memory grant sizes for both batch and row mode operators.
Approximate query processing – this is
designed to provide aggregations across large datasets where
responsiveness is more critical than absolute precision, and it will
return an approximate value with the focus on performance.
Table variable deferred compilation – this
improves plan quality and overall performance for queries, referencing
table variable by propagating cardinality estimates that are based on
actual table variable row counts. In turn, this optimizes your
downstream plan operations.
Along all those features, Azure SQL Database intelligent protection
allows you to efficiently and productively meet your data’s security and
compliance requests by proactively monitoring for potential threats and
vulnerabilities. You can flag things such as PII or a cross-scripting
attack or something like that. There are detection mechanisms in there
that can help you avoid these.
Through features like information protection, vulnerability
assessment and threat detection, you can proactively discover and
protect sensitive data, as well as uncover potential vulnerability and
detect anomaly activities that could indicate a threat to your data.
In short, Microsoft has built these intelligent features over years
of machine learning and is applying it to all their Platform as a
Service, as well as some of their on-premises, offerings. These are
really cool features and we’ve got great response about them and how
well they work.
I recommend you give these features a try, but remember, always try them out in your test or dev environments prior to bringing them into production.
I’m
excited to announce that Azure Data Factory Data Flow is now in public
preview and I’ll give you a look at it here. Data Flow is a new feature
of Azure Data Factory (ADF) that allows you to develop graphical data
transformation logic that can be executed as activities within ADF
pipelines.
The intent of ADF Data Flows is to provide a fully visual experience
with no coding required. Your Data Flow will execute on your own Azure
Databricks cluster for scaled out data processing using Spark. ADF
handles all the code translation, spark optimization and execution of
transformation in Data Flows; it can handle massive amounts of data in
very rapid succession.
In the current public preview, the Data Flow activities available are:
Joins – where you can join data from 2 streams based on a condition
Conditional Splits – allow you to route data to different streams based on conditions
Union – collecting data from multiple data streams
Lookups – looking up data from another stream
Derived Columns – create new columns based on existing ones
Aggregates – calculating aggregations on the stream
Surrogate Keys – this will add a surrogate key column to output streams from a specific value
Exists – check to see if data exists in another stream
Select – choose columns to flow into the next stream that you’re running
Filter – you can filter streams based on a condition
Sort – order data in the stream based on columns
Getting Started:
To get started with Data Flow, you’ll need to sign up for the Preview by emailing adfdataflowext@microsoft.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.
At this point, when you go in and create a Data Factory, you’ll now
have 3 options: Version 1, Version 2 and Version 2 with Data Flow.
Next, go to aka.ms/adfdataflowdocs and this will give you all the
documentation you need for building your first Data Flows, as well as
work and play around with some samples already built. You can then
create your own Data Flows and add a Data Flow activity to your pipeline
to execute and test your own Data Flow in debug mode in the pipeline.
Or you can use Trigger Now in the pipeline to test your Data Flow from a
pipeline activity.
Ultimately, you can operationalize your Data Flow by scheduling and
monitoring your Data Factory pipeline that is executing the Data Flow
activity.
With Data Flow we have the data orchestration and transformation
piece we’ve been missing. It gives us a complete picture for the ETL/ELT
scenarios that we want to do in the cloud or hybrid environments, your
on prem to cloud or cloud to cloud.
With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.
When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!
If
you know about or are already using Databricks, I’m excited to tell you
about Databricks Delta. As most of you know, Apache Spark is the
underlining technology for Databricks, so about 75-80% of all the code
in Databricks is still Apache Spark. You get that super-fast, in-memory
processing of both streaming and batch data types as some of the
founders of Spark built Databricks.
The ability to offer Databricks Delta is one big difference between
Spark and Databricks, aside from the workspaces and the collaboration
options that come native to Databricks. Databricks Delta delivers a
powerful transactional storage layer by harnessing the power of Spark
and Databricks DBFS.
The core abstraction of Databricks Delta is an optimized Spark table
that stores data as Parquet files in DBFS, as well as maintains a
transaction log that efficiently tracks changes to the table. So, you
can read and write data, stored in the Delta format using Spark SQL
batch and streaming APIs that you use to work with HIVE tables and DBFS
directories.
With the addition of the transaction log, as well as other enhancements, Databricks Delta offers some significant benefits:
ACID Transactions – a big one for consistency. Multiple writers can simultaneously modify a dataset and see consistent views. Also, writers can modify a dataset without interfering with jobs reading the dataset.
Faster Read Access – automatic file management organizes data into large files that can be read efficiently. Plus, there are statistics that enable speeding up reads by 10-100x and data skipping avoids reading irrelevant information. This is not available in Apache Spark, only in Databricks.
Databricks Delta is another great feature of Azure Databricks that is not available in traditional Spark further separating the capabilities of the products and providing a great platform for your big data, data science and data engineering needs.
Microsoft
has recently announced another Platform as a Service (PaaS) offering
with the release of MariaDB in Preview in Azure. I’d like to tell you
more about that offering and what are some of its advantages.
First, a little history on MariaDB. MariaDB is a community developed
fork of the MySQL. Essentially, when Oracle purchased MySQL from Sun,
some of the developers from MySQL were concerned that the acquisition
would make changes or lead down the road where it would no longer be
open source.
So, they went ahead and forked off to MariaDB with the intent to
maintain high compatibility with MySQL. Also, the contributors are
required to share their copyright with MariaDB foundation rights, which
in a nutshell means they want this foundation to always be open source.
Now take the open source technology of MariaDB, which is a proven and
valuable method for many companies, and compile that with the fact that
it’s in the Azure platform as a managed offering, consumers using this
offering from Azure get to take advantage of some of the standard
capabilities in the PaaS model, such as:
Built-in high availability with no extra cost
3 tiers (basic, general purpose and memory optimized) which you can
choose depending on your workload, transactional or analytical
processing.
99% availability SLAs
Capabilities for predictable performance done by built in monitoring
and alerting, allowing the ability to quickly assess the effects of
scaling V-Cores up or down based on current or projected performance
needs – through automation or manually in seconds.
The secure protection of sensitive data at rest and in motion. Uses
256-bit encryption on secured disks in the Azure data centers and
enforces an SSL connection for data in transit. Note: you can turn off
the SSL requirement if your application doesn’t support it – I don’t
recommend it, but it can be done.
Automatic backups so you can have point-in-time restore for up to 35 days
These are all the standard advantages when you turn on in Azure
Database Platform as a Service offerings, like SQL DB or Mongo DB. Azure
Database for MariaDB is just another option added by Microsoft to the
portfolio for databases. And a great option to check out if you use
MariaDB. This is now in Preview but I’m sure it will be made generally
available pretty soon.
Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight Apache Spark Clusters.
If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managed
offering for running open source workloads on Azure. You can run things
like Spark, Hadoop, HIVE, and LLAP among others. You create clusters
and spin them up and spin them down when you’re not using them.
The big news here is the recently released preview of HDInsight IO
Cache, which is a new transparent data caching feature that provides
customers with up to 9X performance improvement for Spark jobs, without
an increase in costs.
There are many open source caching products that exist in the
ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO
Cache is also based on RubiX and what differentiates RubiX from other
comparable caching products is its approach of using SSD and eliminating
the need for explicit memory management. While other comparable caching
products leverage the reservation of operating memory for caching the
data.
Because the SSDs typically provide more than 1 gigabit/second of
bandwidth, as well as leverage operating system in-memory file cache,
this gives us enough bandwidth to load big data compute processing
engines like Spark. This allows us to run Spark optimally and handle
bigger memory workloads and overall better performance, by speeding up
these jobs that read data from remote cloud storage, the dominant
architecture pattern in the cloud.
In benchmark tests comparing a Spark cluster with and without the IO
Cache running, they performed 99 SQL queries against a 1 terabyte
dataset and got as much as 9X performance improvement with IO Cache
turned on.
Let’s face it, data is growing all over and the requirement for
processing that data is increasing more and more every day. And we want
to get faster and closer to real time results. To do this, we need to
think more creatively about how we can improve performance in other
ways, without the age-old recipe of throwing hardware at it instead of
tuning it or trying a new approach.
This is a great approach to leverage some existing hardware and help
it run more efficiently. So, if you’re running HDInsight, try this out
in a test environment. It’s as simple as a check box (that’s off by
default); go in, spin up your cluster and hit the checkbox to include IO
Cache and see what performance gains you can achieve with your
HDInsight Spark clusters.