Category Archives: Azure Data Factory

How and Why to Add a Source Code Repository to Azure Data Factory

For developers, it’s very beneficial to have a source code repository. A source code repository helps to keep all your changes, to manage tasks, branches, share the code with a team and simply put, to keep it in safe place.

In this post, I’ll tell you why you should connect your Azure Data Factories to a source code repository, and I’ll demo how to do so:

  • To do this, I’ll start in my data factory inside my Azure portal.
  • When I go into Author & Monitor, I have the ability to either:
    • Set up a code repository within the landing page or main page here or,
    • I can go directly into my data factory and I can add it in the left-hand corner and pull down to where it says, ‘set up code repository’.
  • One thing to note is the code repository itself has been supported for a while, but recently with the release of Data Flows, they’re now supporting GitHub within the repositories as well (previously is was only Azure DevOps).
  • So, I select my GitHub account and fill in the information. *If you’re doing this for the first time, it’s going to prompt you to log into your GitHub account when you do this. In this case, I’ve already previously connected this so it’s going to know about my repositories.
  • Next, I select my repository name and I’ll go to my playground branch and I’m going to use my existing playground.
  • One field will ask me ‘Branch to import resources into:’ so if I’m importing resources, I can select an existing one or create a new one. For this demo I’m going to pick my playground.
  • Before I hit Save, notice on the left-hand side I’ve got zero pipelines, one data set and zero data flows. But when I connect to my playground, it’s going to bring everything in I’ve previously connected to within any of my areas I’ve saved up into or checked my code into anything in that playground branch. So, now you’ll see I have 6 pipelines, 23 data sets and 4 data flows.
  • One of the other nice pieces of being able to add source control is if I want to add a new data set. I just select my SQL Server I was previously connected to; I leave it on default for now and connect in.
  • I then select one of the tables; I selected the Product Category Table.
  • You’ll see at the top you have the option to Save All or Publish. Save All is going to save any of the changes across the tabs, so you can tell when there’s been a change, whether it be data set or pipeline or data flow by having a star next to the name.
  • Now instead of needing to publish every time you’re doing development, you can just save it and it will save here. So, rather than having to publish the entire pipeline and do any of the error checking and make sure the pipeline is in good standing, you can now just save part way through. This is a huge advantage over having to publish the entire pipeline which could cause some challenges which might not be efficient for development and such.

The new features of GitHub being added in gives us another great opportunity if you didn’t previously use Azure Dev Ops (formerly known as Visual Studio Team Services). I’ll be doing more upcoming blogs around Data Factory in Azure Every Day that will be beneficial to you with some of the nuances as the product has greatly evolved since releasing Data Flows.

Azure Data Factory – Data Flow

I’m excited to announce that Azure Data Factory Data Flow is now in public preview and I’ll give you a look at it here. Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic that can be executed as activities within ADF pipelines.

The intent of ADF Data Flows is to provide a fully visual experience with no coding required. Your Data Flow will execute on your own Azure Databricks cluster for scaled out data processing using Spark. ADF handles all the code translation, spark optimization and execution of transformation in Data Flows; it can handle massive amounts of data in very rapid succession.

In the current public preview, the Data Flow activities available are:

  • Joins – where you can join data from 2 streams based on a condition
  • Conditional Splits – allow you to route data to different streams based on conditions
  • Union – collecting data from multiple data streams
  • Lookups – looking up data from another stream
  • Derived Columns – create new columns based on existing ones
  • Aggregates – calculating aggregations on the stream
  • Surrogate Keys – this will add a surrogate key column to output streams from a specific value
  • Exists – check to see if data exists in another stream
  • Select – choose columns to flow into the next stream that you’re running
  • Filter – you can filter streams based on a condition
  • Sort – order data in the stream based on columns

Getting Started:

To get started with Data Flow, you’ll need to sign up for the Preview by emailing adfdataflowext@microsoft.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.

At this point, when you go in and create a Data Factory, you’ll now have 3 options: Version 1, Version 2 and Version 2 with Data Flow.

Next, go to aka.ms/adfdataflowdocs and this will give you all the documentation you need for building your first Data Flows, as well as work and play around with some samples already built. You can then create your own Data Flows and add a Data Flow activity to your pipeline to execute and test your own Data Flow in debug mode in the pipeline. Or you can use Trigger Now in the pipeline to test your Data Flow from a pipeline activity.

Ultimately, you can operationalize your Data Flow by scheduling and monitoring your Data Factory pipeline that is executing the Data Flow activity.

With Data Flow we have the data orchestration and transformation piece we’ve been missing. It gives us a complete picture for the ETL/ELT scenarios that we want to do in the cloud or hybrid environments, your on prem to cloud or cloud to cloud.

With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.

When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!

Sharing Integration Runtimes Among Azure Data Factories

In this post I’ll talk about self-hosted integration runtimes and the ability to share them across Data Factories. Also, I’ll tell you about a new capability that was announced in the Azure Data Factory space.

The integration runtime is essentially the connector that allows you to connect back to your on premises environment and safely and securely move data between Azure and that on-prem environment with Data Factories. This is a dedicated application for Azure Data Factory that’s similar to the on premises Data Gateway.

Here’s where this new feature helps. Until now, Data Factories could not share integration runtimes, therefore, you needed to set up different Data Factories to connect back to on-prem data, databases or flat files, etc. Also, you would have to set up individual integration runtimes for those various Data Factories or pipelines going across multiple Data Factories.

With this newly announce feature comes some new terminology

  • Shared integration runtime – is basically the standard integration runtime you’re used to, however now it can be shared
  • Linked integration runtime – when a shared integration runtime is shared, it will have linked integration runtimes and have a sub-type that’s shared with other Data Factories.

So, you’ll have your main shared integration runtime and on top of that you’ll have a linked integration runtime, which is a linked integration runtime that references the infrastructure of another self-hosted IR. That link points back to a shared IR and allows you to share among multiple Data Factories.

With this straightforward process, you install the integration runtime in your environment, set up your linked service within your Azure Data Factory and then connect it through that linked service. Then you’re ready to pull the data that you need into the cloud and do transformations and push it out to Azure Data Warehouse, Azure Data Bricks, etc.

This cool new technology allows you to get your data to the cloud much easier and more efficiently and I highly recommend for all to try!

Azure Data Factory V2 in GA and New Features

Today I’m excited to talk about the general availability of Azure Data Factory V2, as well as some new features that have been added over the last couple months. If you don’t know, Azure Data Factory Version 2 added some new features that V1 didn’t have.

With ADF V2 you get a browser-based interface using drag and drop technology; V1 was primarily done in the Visual Studio IDE. It also added triggers for scheduling, so you can schedule your jobs when required and in additional ways (which I’ll discuss further in a bit).

Some other features of ADF V2 that came out as it became generally available:

  • Lift and Shift operations for your SSIS packages, so if you have SSIS packages local, you can now Lift and Shift those into compute with the integration runtime service in Data Factory.
  • This also allows for cloud to cloud, cloud to prem, prem to prem and some third-party tools are supported within that as well.
  • Control flow activities like link branching, looping, conditional execution and parameterization.
  • Integration with HD Spark and Databricks for big data workloads and data science.

Some features that have come out more recently:

  • Integration with Key Vault, which gives you the ability to encrypt keys and small secrets like passwords used for keys. You can create a Linked Service to a Key Vault and reference those needed passwords rather than having to store those in search or text files or a PowerShell script and have those open. So, you can use Key Vault to reference back and run workloads without having to expose those passwords.
  • The ability to monitor Data Factory using OMS, Microsoft’s cloud-based management solution that helps you manage and protect your on-prem and cloud infrastructure. This is quick and easy to set up and allows you to reach in to different types of applications in Azure and give you additional visibility and control for things like log analytics, automation, data protection and recovery, as well as security and compliance.
  • You can monitor the overall health of your Data Factories and be able to drill in, see the details and troubleshoot if you’re having problems. This is all enable through Azure Analytics, so you turn on your Azure Analytics and Data Factory, then hook those into your OMS suite and you can monitor it as that central management point.
  • Event based triggering with integration through Data Factory. Now you have event driven architecture where you have a common data integration pattern that involves production. Instead of having to schedule a timed trigger, you can monitor a blob creation or deletion, add that file into there and you can trigger your pipeline based on that.

Azure Data Factory V2 is a neat technology and I’m interested to see where it goes as I’m sure that more features will be coming. If you have questions about Azure Data Factory or any of the new Azure resources, we are the people to talk with. We’re doing a lot of work with our clients using Azure tools and we’d love to talk to you about how we can get you using Azure in your organization.

Azure Data Factory vs Logic Apps

Customers often ask, should I use Logic Apps or Data Factory for this? Of course, the answer I give is the same as with most technology, it depends. What is the business use case we’re talking about?

Logic Apps can help you simplify how you build automated, scalable workflows that integrate apps and data across cloud and on premises services. Azure Data Factory is a cloud-based data integration service that allows you to create data driven workflows in the cloud for orchestrating and automating data movement and data transformation. Similar definitions, so that probably didn’t help at all, right?

Let me try to clear up some confusion. There are some situations where the best-case scenario is to use both, so where a feature is lacking in Data Factory but can be found in Logic Apps since it’s been around longer. A great use case is alerting and notifications, for instance. You can use the web API out of Data Factory and send a notification through a Logic App via email back to a user to say a job has competed or failed.

To answer the question of why I would use one over the other, I’d say it comes down to how much data we’re moving and how much transformation we need to do on that data to make it ready for consumption. Are we reporting on it, putting it in Azure Data Warehouse, building some facts and dimensions and creating our enterprise data warehouse then reporting off of that with Power BI? This would all require a decent amount of heavy lifting. I would not suggest a Logic App for that.

If you’re monitoring a folder on-prem or in OneDrive and you’re looking to see when files get posted there and you want to simply move that file to another location or send a notification about an action on the file, this a great use case for a Logic App.

However, the real sweet spot is when you can use them together, as it helps you maximize cost efficiency. Depending on what the operation is, it can be more or less expensive depending upon whether you’re using Data Factory or Logic Apps.

You can also make your operations more efficient. Utilize the power of Azure Data Factory with its SSIS integration runtimes and feature sets that include things like Data Bricks and the HDInsight clusters, where you can process huge amounts of data with massively parallel processing. Or use your Hadoop file stores for reporting off structured, unstructured or semi-structured data. But Logic Apps can help you enhance the process.

Clear as mud, right? Hopefully I was able to break it down a bit better. To put it in simple terms: when you think about Logic Apps, think about business applications, when you think about Azure Data Factory, think about moving data, especially large data sets, and transforming the data and building data warehouses.

 

Continuous Integration and Deployment Using Azure Data Factory

Today I’m excited to talk about one of the new releases in Azure that gives you continuous integration and deployment using Azure Data Factory. This new release is an Azure Data Factory visual interface that allows you to export any of your Data Factory components as an Azure Resource Manager (ARM) template.

When you do these exports from your Data Factory, it will generate 2 files.  The template file, which will contain all the Data Factory metadata for the pipelines, data sets, etc., as well as a configuration file, which will contain environment parameters that will be different for each of your environments. So, if you’re going to create a development, a test and a production environment, each one will be different.

You also can specify things like storage containers, Databricks clusters, etc. After you’ve deployed this, you’re going to create a new factory for your environment. You’re also going to associate your Visual Studio team services get repository to that Data Factory, enabling source control versioning and collaboration uses.

Next, you’ll set up your Data Factory with VSTS. This is where all the developers can author data factory resources, such as pipelines, data sets and other components. Once you have this development area set up, developers can modify the resources and debug them right in the interface, along with checking performance. They’ll also have the option to create a PR from their branch to master or create a collaborative branch to get the changes reviewed by peers.

Once they are satisfied with the changes and are ready to go to production, they set it in the master branch and can then publish it to the development Data Factory. Or they can promote each of those environments through exporting those ARM templates when they’re ready from the master branch, or any other branch.

So, you export the template and it gets deployed with different environment parameters to test and production environments. From there, you can also set up VSTS release definitions to automate the deployment of your Data Factory to multiple environments.

The benefit with this is it opens the opportunity to bring your true dev test and production environments, that you’re used to in your local environment using SSIS or other ETL tools, to Azure. This tool offers a tremendous amount of power and it’s getting better all the time.

Azure Data Factory Integration Runtimes

This week I’ve been talking about Azure Data Factory. In today’s post I’d like to talk about the much-awaited Azure Data Factory Integration Runtime. The compute infrastructure provides data movement, connectors, data conversions and data transfers, as well as activity dispatching, so you can dispatch and monitor activities running HDInsight, SQL DB or DW and more.

A big part of V2 is that you can now lift and shift your SSIS packages up into Azure and run them from your Azure data portal inside of Data Factory. There are 3 integration runtime types:

1. Azure Integration Runtime – This is clearly set up in Azure and you would use this if you’re going to be copying between two cloud data sources.

2. Self-Hosted Integration Runtime – This can run a copy of transformation activities between cloud and on-premises, including moving data from an IaaS virtual machine. Use this if you’re going to be copying between a cloud source and an on-prem, private network source.

So, if you’re environment is behind a firewall, not in the public cloud and you want to move data from your environment to Azure and the gateway will not work for you. Also, because that IaaS virtual machine is isolated, and you can’t get into that data storage, you would set up that integration between sites.

3. SSIS Integration Runtime – Use this when you’re lifting and shifting your SSIS packages into Azure Data Factory. But a key thing to mention is that this does not yet support third party tools for SSIS, but that support will be added eventually.

Where are these located? Azure Data Factories are located in limited regions at this time, but they can access data stores and compute services globally. With Azure Integration Runtime, the location of that runtime will define the backend compute resources where those are being used. This is optimized for data compliance, efficiency and reduced network egress costs, to ensure that they’re using the best services available in the region that is needed.

The Self-Hosted Runtime is installed in the environment in that private network. The SSIS Integration Runtime is determined based on where the SQL DB or managed instance is hosting that SSIS DB catalog. It is currently limited where it can be located, but it does not have to be in the same place as the Data Factory. It will be as close to the data sources as possible and will run as optimally as it can.

Azure Data Factory – Data Sets, Linked Services and Pipeline Executions

In my posts this week, I’ve been going a little deeper into Azure Data Factory (be sure to check out my others from Tuesday and Wednesday). I’m excited today to dig into getting data from one source into Azure.

First let’s talk about data sets. Consider a data set anything that you’re using as your input or output. An Azure blob data set is one example and is defined by the blob container itself, the folder, the file, the documentation, etc. In this case you can have that as your source or your destination.

There are many data set options, such as Azure Services (other data services or blobs, databases, data warehouses or Data Lake), as well as on premises databases like SQL Server, MySQL or Oracle. You also have your NoSQL components like Cassandra and MongoDB and you can get your files from an FTP server, Amazon S3 or internal file systems.

Lastly, you have your SaaS offerings like MS Dynamics, Salesforce or Marketo. There is a grid available in the Microsoft documentation which outlines what can be a source or a destination or both, in some cases.

Using Linked Services is the way that you link from your source to your destination. This defines the connection to the data source, whether it be the input or the output. Think of it like your connection string in SQL Server; you’re connecting to a specific place, whether you’re using that to output data from a source or input it to a destination.

Now, let’s look at Pipeline Executions. A pipeline is a collection of activities that you’ve built, and the executions run that pipeline moving the data from one place to another or to do some transformation with that data. There are 2 defined Pipeline Executions:

1. Manual (or On Demand) Executions. These can be triggered through the Azure portal, a REST API, as part of a PowerShell script, or as part of your .net application.

2. Setting Up Triggers. With this execution, you set up a trigger as part of your Data Factory. This was an exciting new change in Azure Data Factory V2. Triggers can be scheduled, so you can set a job to run at a particular time each day or you can set a tumbling window trigger. Using a tumbling window, you can set up your hours of operation (let’s say you want your data to run from 8-5, Monday through Friday, every hour on the hour). The tumbling window trigger runs continuously for the times/hours you’ve specified.

Doing this allows us to lessen our costs to run the job in Azure by using the compute time only when you need it, and not using compute time during downtimes like outside of business hours.

Azure Data Factory Pipelines and Activities

Yesterday’s Azure Every Day post covered how Azure Data Factory pricing works. In today’s post I’d like to go a bit deeper into Azure Data Factory Version 2 and review pipelines and activities. In essence, a pipeline is a logical grouping of activities. If you’re familiar with SSIS, think of an SSIS package being a grouping of activities that are happening with the data.

An example of a pipeline would look like: you want to pull data from a website, file server or database up into Azure and do some kind of transformation on that data, then report from it. Within the pipeline, multiple activities can be defined. If there’s no activity dependency on a set of activities – so you have one activity running and there’s no dependency on the next activity -then they can run in parallel.

This is good to keep in mind as you’re performing these activities because you may need to schedule them or figure out a way, so they don’t run in parallel or that one runs after another.

There are 3 main types of activities:

1. Data Movement Activities – This is the sources where you’re pulling in data from such as Azure Blob Storage, Azure Data Lake, Azure DB and DW. You can also set up an on premises gateway and pull in databases, such as commonly used DB2, MySQL, Oracle, SAP, Sybase and Teradata, as well as NoSQL databases like Cassandra and MongoDB.

I also mentioned files; you can pull from Amazon, S3, file systems, FTP, HTTP, etc. You also have the Software as a Service (SaaS) options: Dynamics, HubSpot, Marketo, QuickBooks, and Salesforce, to name a few. You can check a complete list on the Azure online documentation.

2. Data Transformation Activities – Here is where you’re taking your data after it’s ingested into Azure and doing something with it. Some common ones are HDInsight, HIVE, PIG, MapReduce, Hadoop Streaming and Spark transformations. These allow you to transform your big data in your Azure environment and stage it for your reporting.

Other common uses would be machine learning into an Azure VM, as well as stored procedures. You can have your stored procedures in SQL Server defined in Azure, and then run that stored procedure, and also use U-SQL for your Data Lake Analytics.

3. Control Activities – In these activities you can do things like execute your pipelines or run a ForEach statement or Look-up activities, the types of things where you’re controlling how the pipeline is working and interacting with the data.

 

How Azure Data Factory Pricing Works

In today’s post I’d like to discuss how Azure Data Factory pricing works with the Version 2 model which was just released. The pricing is broken down into four ways that you’re paying for this service. I hope that by pointing these out, you can gain an understanding of not only how it works, but how you can keep an eye on your spending.

1. Azure activity runs vs self-hosted activity runs – there are different pricing models for these. For the Azure activity runs it’s about copying activity, so you’re moving data from an Azure Blob to an Azure SQL database or Hive activity running high script on an Azure HDInsight cluster.

With self-hosted, you want to copy activity moving from an on premises SQL Server to an Azure Blob Storage, a stored procedure to an Azure Blob Storage or a stored procedure activity running a stored procedure on an on premises SQL Server.

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

3. SSIS integration run times – here you’re using A-series and D-series compute levels. When you go through these, it depends on what the compute needs are to invoke the process (how much CPU, how much RAM, how much attempt storage you need).

4. The inactive pipeline – you’re paying a small account for pipelines (about 40 cents currently). A pipeline is considered inactive if it’s not associated with a trigger and hasn’t been run for over a week. Yes, it’s a minimal charge, but they do add up and when you start to wonder where some of those charges come from it’s good to keep this in mind.

Also, each of the components inside the Azure Data Factory, whether it’s blob storage, SQL Server, HDInsight or any kind of storage or compute resources you’re using as part of your pipeline, will also incur charges. These are billed separately based specifically around what those resources are.

Something to keep in mind as you start of build workloads, like if you spin up an HDInsight cluster or a SQL data warehouse as part of a pipeline, make sure you shut down, pause it or destroy that cluster afterwards. So, there are opportunities to get your data moved but also keep the cost down but not keeping it running all the time.