In my latest video blog I discuss getting started on the newly Generally Available Spark Pools as a part of Azure Synapse, another great option for Data Engineering/Preparation, Data Exploration, and Machine learning workloads
Without going too deep into the history of Apache Spark, I’ll start with the basics. Essentially, in the early days of Big Data workloads, a basis for machine learning and deep learning for advanced analytics and AI, we would use a Hadoop cluster and move all these datasets across disks, but the disks were always the bottleneck in the process. So, the creators of Spark said hey, why don’t we do this in memory and remove that bottleneck. So they developed Apache Spark as an in memory data processing engine as a faster way to process these massive datasets.
When the Azure Synapse team wanted to make sure that they were offering the best possible data solution for all different kinds of workloads, Spark gave the ability to have an option for their customers that were already familiar with the Spark environment, and included this feature as part of the complete Azure Synapse Analytics offering.
Behind the scenes, the Synapse team is managing many of the components you’d find in Open-Sourced Spark such as:
Apache Hadoop Yarn – for the management of the clusters where the data is being processed
Apache Livy – for the job orchestration
Anaconda – a package manager, environment manager, Python/R data science distribution and a collection of over 7500 open source packages for increasing the capabilities of the Spark clusters
I hope you enjoy the post. Let me know your thoughts or questions!
In my latest video blog I discuss and demonstrate some of the ways to connect to external data in Azure Synapse if there isn’t a need to import the data to the database or you want to do some ad-hoc analysis. I also talk about using COPY and CTAS statements if the requirement is to import the data after all. Check it out here
In this vLog, I cover the reasons why you might consider using Azure Data Factory, a mature cloud service for orchestration and processing of data over the newly GA Azure Synapse Studio.
Synapse has all of the same features as Azure Data Factory, but if you have a large development team working on ELT operations, or a simple data processing activity, it could make sense for the less-cluttered Azure Data Factory.
Take a look at the vLog here and let me know your thoughts on other scenarios for you!
In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:
What is your schema type? Snowflake or Star, or something else?
Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
What are your Service level agreements for the business? What are your data processing times?
Can you save cost by using an option that’s less compute heavy?