Tag Archives: Azure

Getting started with Spark Pools in Azure Synapse

In my latest video blog I discuss getting started on the newly Generally Available Spark Pools as a part of Azure Synapse, another great option for Data Engineering/Preparation, Data Exploration, and Machine learning workloads

Without going too deep into the history of Apache Spark, I’ll start with the basics. Essentially, in the early days of Big Data workloads, a basis for machine learning and deep learning for advanced analytics and AI, we would use a Hadoop cluster and move all these datasets across disks, but the disks were always the bottleneck in the process. So, the creators of Spark said hey, why don’t we do this in memory and remove that bottleneck. So they developed Apache Spark as an in memory data processing engine as a faster way to process these massive datasets.

When the Azure Synapse team wanted to make sure that they were offering the best possible data solution for all different kinds of workloads, Spark gave the ability to have an option for their customers that were already familiar with the Spark environment, and included this feature as part of the complete Azure Synapse Analytics offering.

Behind the scenes, the Synapse team is managing many of the components you’d find in Open-Sourced Spark such as:

  • Apache Hadoop Yarn – for the management of the clusters where the data is being processed
  • Apache Livy – for the job orchestration
  • Anaconda – a package manager, environment manager, Python/R data science distribution and a collection of over 7500 open source packages for increasing the capabilities of the Spark clusters

I hope you enjoy the post. Let me know your thoughts or questions!

Connecting to External Data with Azure Synapse

In my latest video blog I discuss and demonstrate some of the ways to connect to external data in Azure Synapse if there isn’t a need to import the data to the database or you want to do some ad-hoc analysis. I also talk about using COPY and CTAS statements if the requirement is to import the data after all. Check it out here

Comparing Azure Synapse, Snowflake, and Databricks for common data workloads

In this vLog post I discuss how Azure Synapse, Databricks and Snowflake compare when it comes to common data workloads:

Data Science

Business Intelligence

Ad-Hoc data analysis

Data Warehousing

and more!

Where does Azure Data Explorer fit in the rest of the Data Platform?

In this vLog I give an overview of Azure Data Explorer and the Kusto Query Language (KQL). Born from analyzing logs behind Power BI, ADX is a great way to take large sets of data and quickly analyze those datasets and get actionable insights on that data.

Find more details about Azure Data Explorer here: https://azure.microsoft.com/en-us/services/data-explorer/

And get started with these great tutorials: https://docs.microsoft.com/en-us/azure/data-explorer/create-cluster-database-portal

Should I Choose Azure Data Factory or Synapse Studio

In this vLog, I cover the reasons why you might consider using Azure Data Factory, a mature cloud service for orchestration and processing of data over the newly GA Azure Synapse Studio.

Synapse has all of the same features as Azure Data Factory, but if you have a large development team working on ELT operations, or a simple data processing activity, it could make sense for the less-cluttered Azure Data Factory.

Take a look at the vLog here and let me know your thoughts on other scenarios for you!

The Modern Data Warehouse in Azure Part 4: The Serving Layer

In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:

  • What is your schema type? Snowflake or Star, or something else?
  • Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
  • What are your Service level agreements for the business? What are your data processing times?
  • Can you save cost by using an option that’s less compute heavy?

What is Azure Firewall?

I’d like to discuss the recently announced Azure Firewall service that is now just released in GA. Azure Firewall is a managed, cloud-based network security service that protects your Azure Virtual Network resources. It is a fully stateful PaaS firewall with built-in high availability and unrestricted cloud scalability.

It’s in the cloud and Azure ecosystem and it has some of that built-in capability. With Azure Firewall you can centrally create, enforce and log application and network connectivity policies across subscriptions and virtual networks, giving you a lot of flexibility.

It is also fully integrated with Azure Monitor for log analytics. That’s big because a lot of firewalls are not fully integrated with log analytics which means you can’t centralize these logs in OMS, for instance, which would give you a great platform in a single pane of glass for monitoring many of the technologies being used in Azure.

Some of the features within:

  • Built in high availability, so there’s no additional load balances that need to be built and nothing to configure.
  • Unrestricted cloud scalability. It can scale up as much as you need to accommodate changing network traffic flows – no need to budget for your peak traffic, it will accommodate any peaks or valleys automatically.
  • It has application FQDN filtering rules. You can limit outbound HTTP/S traffic to specified lists of fully qualified domain names including wildcards. And the feature does not require SSL termination.
  • There are network traffic filtering rules, so you can create, allow or deny network filtering rules by source and destination IP address, port and protocol. Those rules are enforced and logged across multiple subscriptions and virtual networks. This is another great example of having availability and elasticity to be able to manage many components at one time.
  • It has fully qualified domain name tagging. If you’re running Windows updates across multiple servers, you can tag that service as an allowed service to come through and then it becomes a set standard for all your services behind that firewall.
  • Outbound SNAT and inbound DNAT support, so you can identify and allow traffic originating from your virtual network to remote Internet destinations, as well as inbound network traffic to your firewall public IP address is translated (Destination Network Address Translation) and filtered to the private IP addresses on your virtual networks.
  • That integration with Azure Monitor that I mentioned in which all events are integrated with Azure Monitor, allowing you to archive logs to a storage account, stream events to your Event Hub, or send them to Log Analytics.

Another nice thing to note is when you set up an express route or a VPN from your on premises environment to Azure, you can use this as your single firewall for all those virtual networks and allow traffic in and out from there and monitor it all from that single place.

This was just released in GA so there are a few hiccups, but if none of the service challenges effect you, I suggest you give it a try. It will only continue to come along and get better as with all the Azure services. I think it’s going to be a great firewall service option for many.