Azure Data Factory Pipelines and Activities

Yesterday’s Azure Every Day post covered how Azure Data Factory pricing works. In today’s post I’d like to go a bit deeper into Azure Data Factory Version 2 and review pipelines and activities. In essence, a pipeline is a logical grouping of activities. If you’re familiar with SSIS, think of an SSIS package being a grouping of activities that are happening with the data.

An example of a pipeline would look like: you want to pull data from a website, file server or database up into Azure and do some kind of transformation on that data, then report from it. Within the pipeline, multiple activities can be defined. If there’s no activity dependency on a set of activities – so you have one activity running and there’s no dependency on the next activity -then they can run in parallel.

This is good to keep in mind as you’re performing these activities because you may need to schedule them or figure out a way, so they don’t run in parallel or that one runs after another.

There are 3 main types of activities:

1. Data Movement Activities – This is the sources where you’re pulling in data from such as Azure Blob Storage, Azure Data Lake, Azure DB and DW. You can also set up an on premises gateway and pull in databases, such as commonly used DB2, MySQL, Oracle, SAP, Sybase and Teradata, as well as NoSQL databases like Cassandra and MongoDB.

I also mentioned files; you can pull from Amazon, S3, file systems, FTP, HTTP, etc. You also have the Software as a Service (SaaS) options: Dynamics, HubSpot, Marketo, QuickBooks, and Salesforce, to name a few. You can check a complete list on the Azure online documentation.

2. Data Transformation Activities – Here is where you’re taking your data after it’s ingested into Azure and doing something with it. Some common ones are HDInsight, HIVE, PIG, MapReduce, Hadoop Streaming and Spark transformations. These allow you to transform your big data in your Azure environment and stage it for your reporting.

Other common uses would be machine learning into an Azure VM, as well as stored procedures. You can have your stored procedures in SQL Server defined in Azure, and then run that stored procedure, and also use U-SQL for your Data Lake Analytics.

3. Control Activities – In these activities you can do things like execute your pipelines or run a ForEach statement or Look-up activities, the types of things where you’re controlling how the pipeline is working and interacting with the data.


How Azure Data Factory Pricing Works

In today’s post I’d like to discuss how Azure Data Factory pricing works with the Version 2 model which was just released. The pricing is broken down into four ways that you’re paying for this service. I hope that by pointing these out, you can gain an understanding of not only how it works, but how you can keep an eye on your spending.

1. Azure activity runs vs self-hosted activity runs – there are different pricing models for these. For the Azure activity runs it’s about copying activity, so you’re moving data from an Azure Blob to an Azure SQL database or Hive activity running high script on an Azure HDInsight cluster.

With self-hosted, you want to copy activity moving from an on premises SQL Server to an Azure Blob Storage, a stored procedure to an Azure Blob Storage or a stored procedure activity running a stored procedure on an on premises SQL Server.

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

3. SSIS integration run times – here you’re using A-series and D-series compute levels. When you go through these, it depends on what the compute needs are to invoke the process (how much CPU, how much RAM, how much attempt storage you need).

4. The inactive pipeline – you’re paying a small account for pipelines (about 40 cents currently). A pipeline is considered inactive if it’s not associated with a trigger and hasn’t been run for over a week. Yes, it’s a minimal charge, but they do add up and when you start to wonder where some of those charges come from it’s good to keep this in mind.

Also, each of the components inside the Azure Data Factory, whether it’s blob storage, SQL Server, HDInsight or any kind of storage or compute resources you’re using as part of your pipeline, will also incur charges. These are billed separately based specifically around what those resources are.

Something to keep in mind as you start of build workloads, like if you spin up an HDInsight cluster or a SQL data warehouse as part of a pipeline, make sure you shut down, pause it or destroy that cluster afterwards. So, there are opportunities to get your data moved but also keep the cost down but not keeping it running all the time.

A Guide to GDPR Compliance with Microsoft Data Platform

As most people know, the GDPR is approaching quickly. May 25th to be exact. Most companies will need to review or modify their database management and data handling procedures, especially focusing on the security of data processing. In a recent webinar hosted by 3 experts in the Azure, SQL Data Platform and software arenas: Abraham Samuel, Technical Support Personnel, Microsoft; Brian Knight, Founder and CEO, Pragmatic Works; and Myself, Sr. Principal Architect, Pragmatic Works, offered an informational session on steps you need to take now to help in your journey with compliance.

This 2-hour webinar covered the key changes needed to be addressed for GDPR: Controls, Modifications, Transparent Policies and IT and Training. It also discusses how modernizing your data platform, on-premises and in Azure, will immediately reduce areas out of compliance, as well as what Azure tools and services are offered to help ensure you remain in compliance.

It also taps into experience from the Pragmatic Works team on some of the danger areas customers face and how the suite of software tools can help you expose areas of concern in your environment. Still using SQL Server 2008 or 2008 R2? Here you’ll learn what it means for 2008/2008 R2 end of support and paths to upgrade your SQL Server.

Take some time and watch this information packed webinar that will help eliminate confusion around GDPR and discuss the steps you need to take to be in compliance, as well as how to make your plans actionable. GDPR goes into effect this month. This webinar will educate you and give you options to move along your journey into GDPR and a Microsoft modern data platform.