Azure Data Factory – Data Sets, Linked Services and Pipeline Executions

In my posts this week, I’ve been going a little deeper into Azure Data Factory (be sure to check out my others from Tuesday and Wednesday). I’m excited today to dig into getting data from one source into Azure.

First let’s talk about data sets. Consider a data set anything that you’re using as your input or output. An Azure blob data set is one example and is defined by the blob container itself, the folder, the file, the documentation, etc. In this case you can have that as your source or your destination.

There are many data set options, such as Azure Services (other data services or blobs, databases, data warehouses or Data Lake), as well as on premises databases like SQL Server, MySQL or Oracle. You also have your NoSQL components like Cassandra and MongoDB and you can get your files from an FTP server, Amazon S3 or internal file systems.

Lastly, you have your SaaS offerings like MS Dynamics, Salesforce or Marketo. There is a grid available in the Microsoft documentation which outlines what can be a source or a destination or both, in some cases.

Using Linked Services is the way that you link from your source to your destination. This defines the connection to the data source, whether it be the input or the output. Think of it like your connection string in SQL Server; you’re connecting to a specific place, whether you’re using that to output data from a source or input it to a destination.

Now, let’s look at Pipeline Executions. A pipeline is a collection of activities that you’ve built, and the executions run that pipeline moving the data from one place to another or to do some transformation with that data. There are 2 defined Pipeline Executions:

1. Manual (or On Demand) Executions. These can be triggered through the Azure portal, a REST API, as part of a PowerShell script, or as part of your .net application.

2. Setting Up Triggers. With this execution, you set up a trigger as part of your Data Factory. This was an exciting new change in Azure Data Factory V2. Triggers can be scheduled, so you can set a job to run at a particular time each day or you can set a tumbling window trigger. Using a tumbling window, you can set up your hours of operation (let’s say you want your data to run from 8-5, Monday through Friday, every hour on the hour). The tumbling window trigger runs continuously for the times/hours you’ve specified.

Doing this allows us to lessen our costs to run the job in Azure by using the compute time only when you need it, and not using compute time during downtimes like outside of business hours.