In my latest video blog I discuss getting started on the newly Generally Available Spark Pools as a part of Azure Synapse, another great option for Data Engineering/Preparation, Data Exploration, and Machine learning workloads
Without going too deep into the history of Apache Spark, I’ll start with the basics. Essentially, in the early days of Big Data workloads, a basis for machine learning and deep learning for advanced analytics and AI, we would use a Hadoop cluster and move all these datasets across disks, but the disks were always the bottleneck in the process. So, the creators of Spark said hey, why don’t we do this in memory and remove that bottleneck. So they developed Apache Spark as an in memory data processing engine as a faster way to process these massive datasets.
When the Azure Synapse team wanted to make sure that they were offering the best possible data solution for all different kinds of workloads, Spark gave the ability to have an option for their customers that were already familiar with the Spark environment, and included this feature as part of the complete Azure Synapse Analytics offering.
Behind the scenes, the Synapse team is managing many of the components you’d find in Open-Sourced Spark such as:
Apache Hadoop Yarn – for the management of the clusters where the data is being processed
Apache Livy – for the job orchestration
Anaconda – a package manager, environment manager, Python/R data science distribution and a collection of over 7500 open source packages for increasing the capabilities of the Spark clusters
I hope you enjoy the post. Let me know your thoughts or questions!
In my latest video blog I discuss and demonstrate some of the ways to connect to external data in Azure Synapse if there isn’t a need to import the data to the database or you want to do some ad-hoc analysis. I also talk about using COPY and CTAS statements if the requirement is to import the data after all. Check it out here
In this vLog I give an overview of Azure Data Explorer and the Kusto Query Language (KQL). Born from analyzing logs behind Power BI, ADX is a great way to take large sets of data and quickly analyze those datasets and get actionable insights on that data.
Find more details about Azure Data Explorer here: https://azure.microsoft.com/en-us/services/data-explorer/
And get started with these great tutorials: https://docs.microsoft.com/en-us/azure/data-explorer/create-cluster-database-portal
In this vLog, I cover the reasons why you might consider using Azure Data Factory, a mature cloud service for orchestration and processing of data over the newly GA Azure Synapse Studio.
Synapse has all of the same features as Azure Data Factory, but if you have a large development team working on ELT operations, or a simple data processing activity, it could make sense for the less-cluttered Azure Data Factory.
Take a look at the vLog here and let me know your thoughts on other scenarios for you!
In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:
What is your schema type? Snowflake or Star, or something else?
Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
What are your Service level agreements for the business? What are your data processing times?
Can you save cost by using an option that’s less compute heavy?
I’d like to discuss the recently announced Azure Firewall service that is now just released in GA. Azure Firewall is a managed, cloud-based network security service that protects your Azure Virtual Network resources. It is a fully stateful PaaS firewall with built-in high availability and unrestricted cloud scalability.
It’s in the cloud and Azure ecosystem and it has some of that
built-in capability. With Azure Firewall you can centrally create,
enforce and log application and network connectivity policies across
subscriptions and virtual networks, giving you a lot of flexibility.
It is also fully integrated with Azure Monitor for log analytics.
That’s big because a lot of firewalls are not fully integrated with log
analytics which means you can’t centralize these logs in OMS, for
instance, which would give you a great platform in a single pane of
glass for monitoring many of the technologies being used in Azure.
Some of the features within:
Built in high availability, so there’s no additional load balances that need to be built and nothing to configure.
Unrestricted cloud scalability. It can scale up as much as you need
to accommodate changing network traffic flows – no need to budget for
your peak traffic, it will accommodate any peaks or valleys
It has application FQDN filtering rules. You can limit outbound
HTTP/S traffic to specified lists of fully qualified domain names
including wildcards. And the feature does not require SSL termination.
There are network traffic filtering rules, so you can create, allow
or deny network filtering rules by source and destination IP address,
port and protocol. Those rules are enforced and logged across multiple
subscriptions and virtual networks. This is another great example of
having availability and elasticity to be able to manage many components
at one time.
It has fully qualified domain name tagging. If you’re running
Windows updates across multiple servers, you can tag that service as an
allowed service to come through and then it becomes a set standard for
all your services behind that firewall.
Outbound SNAT and inbound DNAT support, so you can identify and
allow traffic originating from your virtual network to remote Internet
destinations, as well as inbound network traffic to your firewall public
IP address is translated (Destination Network Address Translation) and
filtered to the private IP addresses on your virtual networks.
That integration with Azure Monitor that I mentioned in which all
events are integrated with Azure Monitor, allowing you to archive logs
to a storage account, stream events to your Event Hub, or send them to
Another nice thing to note is when you set up an express route or a
VPN from your on premises environment to Azure, you can use this as your
single firewall for all those virtual networks and allow traffic in and
out from there and monitor it all from that single place.
This was just released in GA so there are a few hiccups, but if none of the service challenges effect you, I suggest you give it a try. It will only continue to come along and get better as with all the Azure services. I think it’s going to be a great firewall service option for many.