ADFv2, Azure, Azure Data Factory, Training

Azure Data Factory Pipelines and Activities

May 30, 2018 cseferlis Leave a comment

Yesterday’s Azure Every Day post covered how Azure Data Factory pricing works. In today’s post I’d like to go a bit deeper into Azure Data Factory Version 2 and review pipelines and activities. In essence, a pipeline is a logical grouping of activities. If you’re familiar with SSIS, think of an SSIS package being a grouping of activities that are happening with the data.

An example of a pipeline would look like: you want to pull data from a website, file server or database up into Azure and do some kind of transformation on that data, then report from it. Within the pipeline, multiple activities can be defined. If there’s no activity dependency on a set of activities – so you have one activity running and there’s no dependency on the next activity -then they can run in parallel.

This is good to keep in mind as you’re performing these activities because you may need to schedule them or figure out a way, so they don’t run in parallel or that one runs after another.

There are 3 main types of activities:

1. Data Movement Activities – This is the sources where you’re pulling in data from such as Azure Blob Storage, Azure Data Lake, Azure DB and DW. You can also set up an on premises gateway and pull in databases, such as commonly used DB2, MySQL, Oracle, SAP, Sybase and Teradata, as well as NoSQL databases like Cassandra and MongoDB.

I also mentioned files; you can pull from Amazon, S3, file systems, FTP, HTTP, etc. You also have the Software as a Service (SaaS) options: Dynamics, HubSpot, Marketo, QuickBooks, and Salesforce, to name a few. You can check a complete list on the Azure online documentation.

2. Data Transformation Activities – Here is where you’re taking your data after it’s ingested into Azure and doing something with it. Some common ones are HDInsight, HIVE, PIG, MapReduce, Hadoop Streaming and Spark transformations. These allow you to transform your big data in your Azure environment and stage it for your reporting.

Other common uses would be machine learning into an Azure VM, as well as stored procedures. You can have your stored procedures in SQL Server defined in Azure, and then run that stored procedure, and also use U-SQL for your Data Lake Analytics.

3. Control Activities – In these activities you can do things like execute your pipelines or run a ForEach statement or Look-up activities, the types of things where you’re controlling how the pipeline is working and interacting with the data.

ADFv2, Azure Data Factory, Strategy, Training

How Azure Data Factory Pricing Works

May 28, 2018 cseferlis 1 Comment

In today’s post I’d like to discuss how Azure Data Factory pricing works with the Version 2 model which was just released. The pricing is broken down into four ways that you’re paying for this service. I hope that by pointing these out, you can gain an understanding of not only how it works, but how you can keep an eye on your spending.

1. Azure activity runs vs self-hosted activity runs – there are different pricing models for these. For the Azure activity runs it’s about copying activity, so you’re moving data from an Azure Blob to an Azure SQL database or Hive activity running high script on an Azure HDInsight cluster.

With self-hosted, you want to copy activity moving from an on premises SQL Server to an Azure Blob Storage, a stored procedure to an Azure Blob Storage or a stored procedure activity running a stored procedure on an on premises SQL Server.

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

3. SSIS integration run times – here you’re using A-series and D-series compute levels. When you go through these, it depends on what the compute needs are to invoke the process (how much CPU, how much RAM, how much attempt storage you need).

4. The inactive pipeline – you’re paying a small account for pipelines (about 40 cents currently). A pipeline is considered inactive if it’s not associated with a trigger and hasn’t been run for over a week. Yes, it’s a minimal charge, but they do add up and when you start to wonder where some of those charges come from it’s good to keep this in mind.

Also, each of the components inside the Azure Data Factory, whether it’s blob storage, SQL Server, HDInsight or any kind of storage or compute resources you’re using as part of your pipeline, will also incur charges. These are billed separately based specifically around what those resources are.

Something to keep in mind as you start of build workloads, like if you spin up an HDInsight cluster or a SQL data warehouse as part of a pipeline, make sure you shut down, pause it or destroy that cluster afterwards. So, there are opportunities to get your data moved but also keep the cost down but not keeping it running all the time.

BizDataViz

Monthly Archives: May 2018

Azure Data Factory Pipelines and Activities

How Azure Data Factory Pricing Works

Azure and SQL Data Blog

Share this:

Share this:

Share this:

Azure and SQL Data Blog