How and Why to Add a Source Code Repository to Azure Data Factory

For developers, it’s very beneficial to have a source code repository. A source code repository helps to keep all your changes, to manage tasks, branches, share the code with a team and simply put, to keep it in safe place.

In this post, I’ll tell you why you should connect your Azure Data Factories to a source code repository, and I’ll demo how to do so:

  • To do this, I’ll start in my data factory inside my Azure portal.
  • When I go into Author & Monitor, I have the ability to either:
    • Set up a code repository within the landing page or main page here or,
    • I can go directly into my data factory and I can add it in the left-hand corner and pull down to where it says, ‘set up code repository’.
  • One thing to note is the code repository itself has been supported for a while, but recently with the release of Data Flows, they’re now supporting GitHub within the repositories as well (previously is was only Azure DevOps).
  • So, I select my GitHub account and fill in the information. *If you’re doing this for the first time, it’s going to prompt you to log into your GitHub account when you do this. In this case, I’ve already previously connected this so it’s going to know about my repositories.
  • Next, I select my repository name and I’ll go to my playground branch and I’m going to use my existing playground.
  • One field will ask me ‘Branch to import resources into:’ so if I’m importing resources, I can select an existing one or create a new one. For this demo I’m going to pick my playground.
  • Before I hit Save, notice on the left-hand side I’ve got zero pipelines, one data set and zero data flows. But when I connect to my playground, it’s going to bring everything in I’ve previously connected to within any of my areas I’ve saved up into or checked my code into anything in that playground branch. So, now you’ll see I have 6 pipelines, 23 data sets and 4 data flows.
  • One of the other nice pieces of being able to add source control is if I want to add a new data set. I just select my SQL Server I was previously connected to; I leave it on default for now and connect in.
  • I then select one of the tables; I selected the Product Category Table.
  • You’ll see at the top you have the option to Save All or Publish. Save All is going to save any of the changes across the tabs, so you can tell when there’s been a change, whether it be data set or pipeline or data flow by having a star next to the name.
  • Now instead of needing to publish every time you’re doing development, you can just save it and it will save here. So, rather than having to publish the entire pipeline and do any of the error checking and make sure the pipeline is in good standing, you can now just save part way through. This is a huge advantage over having to publish the entire pipeline which could cause some challenges which might not be efficient for development and such.

The new features of GitHub being added in gives us another great opportunity if you didn’t previously use Azure Dev Ops (formerly known as Visual Studio Team Services). I’ll be doing more upcoming blogs around Data Factory in Azure Every Day that will be beneficial to you with some of the nuances as the product has greatly evolved since releasing Data Flows.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.