I’m excited to announce that Azure Data Factory Data Flow is now in public preview and I’ll give you a look at it here. Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic that can be executed as activities within ADF pipelines.
The intent of ADF Data Flows is to provide a fully visual experience with no coding required. Your Data Flow will execute on your own Azure Databricks cluster for scaled out data processing using Spark. ADF handles all the code translation, spark optimization and execution of transformation in Data Flows; it can handle massive amounts of data in very rapid succession.
In the current public preview, the Data Flow activities available are:
- Joins – where you can join data from 2 streams based on a condition
- Conditional Splits – allow you to route data to different streams based on conditions
- Union – collecting data from multiple data streams
- Lookups – looking up data from another stream
- Derived Columns – create new columns based on existing ones
- Aggregates – calculating aggregations on the stream
- Surrogate Keys – this will add a surrogate key column to output streams from a specific value
- Exists – check to see if data exists in another stream
- Select – choose columns to flow into the next stream that you’re running
- Filter – you can filter streams based on a condition
- Sort – order data in the stream based on columns
To get started with Data Flow, you’ll need to sign up for the Preview by emailing email@example.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.
At this point, when you go in and create a Data Factory, you’ll now have 3 options: Version 1, Version 2 and Version 2 with Data Flow.
Next, go to aka.ms/adfdataflowdocs and this will give you all the documentation you need for building your first Data Flows, as well as work and play around with some samples already built. You can then create your own Data Flows and add a Data Flow activity to your pipeline to execute and test your own Data Flow in debug mode in the pipeline. Or you can use Trigger Now in the pipeline to test your Data Flow from a pipeline activity.
Ultimately, you can operationalize your Data Flow by scheduling and monitoring your Data Factory pipeline that is executing the Data Flow activity.
With Data Flow we have the data orchestration and transformation piece we’ve been missing. It gives us a complete picture for the ETL/ELT scenarios that we want to do in the cloud or hybrid environments, your on prem to cloud or cloud to cloud.
With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.
When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!