Category Archives: Database

What is HTAP in Azure

Hybrid Transactional and Analytical Processing, or HTAP, is an advanced database capability that allows for both types of workloads to be performed without one impacting the performance of the other.

In this Video Blog, I cover some of the history of HTAP, some of the challenges and benefits of these systems, and where you can find them in Azure.

The Modern Data Warehouse in Azure Part 4: The Serving Layer

In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:

  • What is your schema type? Snowflake or Star, or something else?
  • Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
  • What are your Service level agreements for the business? What are your data processing times?
  • Can you save cost by using an option that’s less compute heavy?

How and Why to Add a Source Code Repository to Azure Data Factory

For developers, it’s very beneficial to have a source code repository. A source code repository helps to keep all your changes, to manage tasks, branches, share the code with a team and simply put, to keep it in safe place.

In this post, I’ll tell you why you should connect your Azure Data Factories to a source code repository, and I’ll demo how to do so:

  • To do this, I’ll start in my data factory inside my Azure portal.
  • When I go into Author & Monitor, I have the ability to either:
    • Set up a code repository within the landing page or main page here or,
    • I can go directly into my data factory and I can add it in the left-hand corner and pull down to where it says, ‘set up code repository’.
  • One thing to note is the code repository itself has been supported for a while, but recently with the release of Data Flows, they’re now supporting GitHub within the repositories as well (previously is was only Azure DevOps).
  • So, I select my GitHub account and fill in the information. *If you’re doing this for the first time, it’s going to prompt you to log into your GitHub account when you do this. In this case, I’ve already previously connected this so it’s going to know about my repositories.
  • Next, I select my repository name and I’ll go to my playground branch and I’m going to use my existing playground.
  • One field will ask me ‘Branch to import resources into:’ so if I’m importing resources, I can select an existing one or create a new one. For this demo I’m going to pick my playground.
  • Before I hit Save, notice on the left-hand side I’ve got zero pipelines, one data set and zero data flows. But when I connect to my playground, it’s going to bring everything in I’ve previously connected to within any of my areas I’ve saved up into or checked my code into anything in that playground branch. So, now you’ll see I have 6 pipelines, 23 data sets and 4 data flows.
  • One of the other nice pieces of being able to add source control is if I want to add a new data set. I just select my SQL Server I was previously connected to; I leave it on default for now and connect in.
  • I then select one of the tables; I selected the Product Category Table.
  • You’ll see at the top you have the option to Save All or Publish. Save All is going to save any of the changes across the tabs, so you can tell when there’s been a change, whether it be data set or pipeline or data flow by having a star next to the name.
  • Now instead of needing to publish every time you’re doing development, you can just save it and it will save here. So, rather than having to publish the entire pipeline and do any of the error checking and make sure the pipeline is in good standing, you can now just save part way through. This is a huge advantage over having to publish the entire pipeline which could cause some challenges which might not be efficient for development and such.

The new features of GitHub being added in gives us another great opportunity if you didn’t previously use Azure Dev Ops (formerly known as Visual Studio Team Services). I’ll be doing more upcoming blogs around Data Factory in Azure Every Day that will be beneficial to you with some of the nuances as the product has greatly evolved since releasing Data Flows.

3 Common Analytics Use Cases for Azure Databricks

Pragmatic Works is considered to be experts in the Microsoft Data Platform, both on-premises and in Azure. That being said, we often get asked many questions like, how can a certain technology benefit my company? One technology we are asked about a lot is Azure Databricks. This was released over a year ago in preview in the Azure portal and we’re starting to see some massive adoption by many companies, but not everyone is ready to delve into data science and deep analytics, so they haven’t had much exposure to what Databricks is and what it can do for their business.

There are some barriers preventing organizations from adopting data science and machine learning which can be applied to solve many common business challenges. Collaboration between data scientists, data engineers, business analysts who are working with data (structured and unstructured) from a multitude of sources is an example of one of those barriers.

In addition, there’s a complexity involved when you try to do things with these massive volumes of data. Then add in some cultural aspects, having multiple teams and using consultants, and with all these factors, how do you get that one common theme and common platform where everybody can work and be on the same page? Azure Databricks is one answer.

Here’s an overview of 3 common use cases that we’re beginning to see and how they can benefit your organization:

1. Recommendation Engines – Recommendation Engines are becoming an integral part of applications and software products as mobile apps and other advances in technology continue to change the way users choose and utilize information. Most likely when you’re shopping on any major retail site, they are going to make recommendations to related products based on the products you’ve selected or that you’re looking at.

2. Churn Analysis – Commonly known as customer attrition; basically, it’s when we lose customers. Using Databricks, there are ways to find out what some of the warning signs are behind that. Think about it, if you get ways to correlate the data that leads to a customer leaving your company, then you know that you have a better chance to possibly save that customer.

And we all know that keeping a customer and giving them the service they need or the product they want is significantly less costly than having to acquire new customers.

3. Intrusion Detection – This is needed to monitor networks or systems and activities for malicious activity or policy violations and produce electronic reports to some kind of dashboard or management station or wherever that is captured.

With the combination of streaming and batch technologies tightly integrated with Databricks and the Azure Data Platform, we are getting access to more real-time and static data correlations that are helping to make faster decisions and try to avoid some of these intrusions.

Once we get triggered that there is a problem, we can shut if off very quickly or use automation options to do that as well.

Today I wanted to highlight some of the ways that you can utilize Databricks to help your organization. If you have questions or would like to break down some of these barriers to adopting machine learning and data science for your business, we can help.

We are using all the Azure technologies and talking about them with our customer all the time, as well as deploying real world workload scenarios.

A Look at Some of Azure SQL Database’s Intelligence Features

Today I’d like to tell you about some very cool intelligence features within the Azure SQL Database. Azure SQL Database technologies deliver intelligent capabilities through a range of built-in machine learning and adaptive technologies that monitor and manage performance and security for you.

Using telemetry from millions of databases running in Azure over the years, Microsoft has built this capability of training a truly intelligent and autonomous database that gives you the ability to learn and adapt to your workload. This intelligent performance gives you the deeper insight into database performance. Plus, it eliminates the hassle of making ongoing improvements, allowing you to focus more on driving your business and less on “chores”.

Features like query performance insights and automatic tuning continuously monitor database usage and detect disruptive events and then they take steps to improve performance.

Three examples of the intelligent performance that can collectively optimize your memory usage and improve overall query performance are things like:

  • Row mode memory grant feedback – this gives you the ability to expand on batch-mode memory grant feedback by adjusting memory grant sizes for both batch and row mode operators.
  • Approximate query processing – this is designed to provide aggregations across large datasets where responsiveness is more critical than absolute precision, and it will return an approximate value with the focus on performance.
  • Table variable deferred compilation – this improves plan quality and overall performance for queries, referencing table variable by propagating cardinality estimates that are based on actual table variable row counts. In turn, this optimizes your downstream plan operations.

Along all those features, Azure SQL Database intelligent protection allows you to efficiently and productively meet your data’s security and compliance requests by proactively monitoring for potential threats and vulnerabilities. You can flag things such as PII or a cross-scripting attack or something like that. There are detection mechanisms in there that can help you avoid these.

Through features like information protection, vulnerability assessment and threat detection, you can proactively discover and protect sensitive data, as well as uncover potential vulnerability and detect anomaly activities that could indicate a threat to your data.

In short, Microsoft has built these intelligent features over years of machine learning and is applying it to all their Platform as a Service, as well as some of their on-premises, offerings. These are really cool features and we’ve got great response about them and how well they work.

I recommend you give these features a try, but remember, always try them out in your test or dev environments prior to bringing them into production.

Azure Data Factory – Data Flow

I’m excited to announce that Azure Data Factory Data Flow is now in public preview and I’ll give you a look at it here. Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic that can be executed as activities within ADF pipelines.

The intent of ADF Data Flows is to provide a fully visual experience with no coding required. Your Data Flow will execute on your own Azure Databricks cluster for scaled out data processing using Spark. ADF handles all the code translation, spark optimization and execution of transformation in Data Flows; it can handle massive amounts of data in very rapid succession.

In the current public preview, the Data Flow activities available are:

  • Joins – where you can join data from 2 streams based on a condition
  • Conditional Splits – allow you to route data to different streams based on conditions
  • Union – collecting data from multiple data streams
  • Lookups – looking up data from another stream
  • Derived Columns – create new columns based on existing ones
  • Aggregates – calculating aggregations on the stream
  • Surrogate Keys – this will add a surrogate key column to output streams from a specific value
  • Exists – check to see if data exists in another stream
  • Select – choose columns to flow into the next stream that you’re running
  • Filter – you can filter streams based on a condition
  • Sort – order data in the stream based on columns

Getting Started:

To get started with Data Flow, you’ll need to sign up for the Preview by emailing adfdataflowext@microsoft.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.

At this point, when you go in and create a Data Factory, you’ll now have 3 options: Version 1, Version 2 and Version 2 with Data Flow.

Next, go to aka.ms/adfdataflowdocs and this will give you all the documentation you need for building your first Data Flows, as well as work and play around with some samples already built. You can then create your own Data Flows and add a Data Flow activity to your pipeline to execute and test your own Data Flow in debug mode in the pipeline. Or you can use Trigger Now in the pipeline to test your Data Flow from a pipeline activity.

Ultimately, you can operationalize your Data Flow by scheduling and monitoring your Data Factory pipeline that is executing the Data Flow activity.

With Data Flow we have the data orchestration and transformation piece we’ve been missing. It gives us a complete picture for the ETL/ELT scenarios that we want to do in the cloud or hybrid environments, your on prem to cloud or cloud to cloud.

With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.

When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!

Intro to Azure Databricks Delta

If you know about or are already using Databricks, I’m excited to tell you about Databricks Delta. As most of you know, Apache Spark is the underlining technology for Databricks, so about 75-80% of all the code in Databricks is still Apache Spark. You get that super-fast, in-memory processing of both streaming and batch data types as some of the founders of Spark built Databricks.

The ability to offer Databricks Delta is one big difference between Spark and Databricks, aside from the workspaces and the collaboration options that come native to Databricks. Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Spark and Databricks DBFS.

The core abstraction of Databricks Delta is an optimized Spark table that stores data as Parquet files in DBFS, as well as maintains a transaction log that efficiently tracks changes to the table. So, you can read and write data, stored in the Delta format using Spark SQL batch and streaming APIs that you use to work with HIVE tables and DBFS directories.

With the addition of the transaction log, as well as other enhancements, Databricks Delta offers some significant benefits:

ACID Transactions – a big one for consistency. Multiple writers can simultaneously modify a dataset and see consistent views. Also, writers can modify a dataset without interfering with jobs reading the dataset.

Faster Read Access – automatic file management organizes data into large files that can be read efficiently. Plus, there are statistics that enable speeding up reads by 10-100x and data skipping avoids reading irrelevant information. This is not available in Apache Spark, only in Databricks.

Databricks Delta is another great feature of Azure Databricks that is not available in traditional Spark further separating the capabilities of the products and providing a great platform for your big data, data science and data engineering needs.

Azure Database for MariaDB in Preview

Microsoft has recently announced another Platform as a Service (PaaS) offering with the release of MariaDB in Preview in Azure. I’d like to tell you more about that offering and what are some of its advantages.

First, a little history on MariaDB. MariaDB is a community developed fork of the MySQL. Essentially, when Oracle purchased MySQL from Sun, some of the developers from MySQL were concerned that the acquisition would make changes or lead down the road where it would no longer be open source.

So, they went ahead and forked off to MariaDB with the intent to maintain high compatibility with MySQL. Also, the contributors are required to share their copyright with MariaDB foundation rights, which in a nutshell means they want this foundation to always be open source.

Now take the open source technology of MariaDB, which is a proven and valuable method for many companies, and compile that with the fact that it’s in the Azure platform as a managed offering, consumers using this offering from Azure get to take advantage of some of the standard capabilities in the PaaS model, such as:

  • Built-in high availability with no extra cost
  • 3 tiers (basic, general purpose and memory optimized) which you can choose depending on your workload, transactional or analytical processing.
  • 99% availability SLAs
  • Capabilities for predictable performance done by built in monitoring and alerting, allowing the ability to quickly assess the effects of scaling V-Cores up or down based on current or projected performance needs – through automation or manually in seconds.
  • The secure protection of sensitive data at rest and in motion. Uses 256-bit encryption on secured disks in the Azure data centers and enforces an SSL connection for data in transit. Note: you can turn off the SSL requirement if your application doesn’t support it – I don’t recommend it, but it can be done.
  • Automatic backups so you can have point-in-time restore for up to 35 days

These are all the standard advantages when you turn on in Azure Database Platform as a Service offerings, like SQL DB or Mongo DB. Azure Database for MariaDB is just another option added by Microsoft to the portfolio for databases. And a great option to check out if you use MariaDB. This is now in Preview but I’m sure it will be made generally available pretty soon.

How to Gain Up to 9X Speed on Apache Spark Jobs

Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight Apache Spark Clusters.

If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managed offering for running open source workloads on Azure. You can run things like Spark, Hadoop, HIVE, and LLAP among others. You create clusters and spin them up and spin them down when you’re not using them.

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Because the SSDs typically provide more than 1 gigabit/second of bandwidth, as well as leverage operating system in-memory file cache, this gives us enough bandwidth to load big data compute processing engines like Spark. This allows us to run Spark optimally and handle bigger memory workloads and overall better performance, by speeding up these jobs that read data from remote cloud storage, the dominant architecture pattern in the cloud.

In benchmark tests comparing a Spark cluster with and without the IO Cache running, they performed 99 SQL queries against a 1 terabyte dataset and got as much as 9X performance improvement with IO Cache turned on.

Let’s face it, data is growing all over and the requirement for processing that data is increasing more and more every day. And we want to get faster and closer to real time results. To do this, we need to think more creatively about how we can improve performance in other ways, without the age-old recipe of throwing hardware at it instead of tuning it or trying a new approach.

This is a great approach to leverage some existing hardware and help it run more efficiently. So, if you’re running HDInsight, try this out in a test environment. It’s as simple as a check box (that’s off by default); go in, spin up your cluster and hit the checkbox to include IO Cache and see what performance gains you can achieve with your HDInsight Spark clusters.