How to Gain Up to 9X Speed on Apache Spark Jobs

Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight Apache Spark Clusters.

If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managed offering for running open source workloads on Azure. You can run things like Spark, Hadoop, HIVE, and LLAP among others. You create clusters and spin them up and spin them down when you’re not using them.

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Because the SSDs typically provide more than 1 gigabit/second of bandwidth, as well as leverage operating system in-memory file cache, this gives us enough bandwidth to load big data compute processing engines like Spark. This allows us to run Spark optimally and handle bigger memory workloads and overall better performance, by speeding up these jobs that read data from remote cloud storage, the dominant architecture pattern in the cloud.

In benchmark tests comparing a Spark cluster with and without the IO Cache running, they performed 99 SQL queries against a 1 terabyte dataset and got as much as 9X performance improvement with IO Cache turned on.

Let’s face it, data is growing all over and the requirement for processing that data is increasing more and more every day. And we want to get faster and closer to real time results. To do this, we need to think more creatively about how we can improve performance in other ways, without the age-old recipe of throwing hardware at it instead of tuning it or trying a new approach.

This is a great approach to leverage some existing hardware and help it run more efficiently. So, if you’re running HDInsight, try this out in a test environment. It’s as simple as a check box (that’s off by default); go in, spin up your cluster and hit the checkbox to include IO Cache and see what performance gains you can achieve with your HDInsight Spark clusters.

Using Azure to Drive Security in Banking Using Biometrics

In the digital world we live in today, it’s getting harder to verify identity in industries such as banking. We now do less and less transactions in person. No longer do we go into banks with passbook in hand and make deposits or withdrawals face to face with a bank teller. Many of us have moved from ATM transactions to digital banking.

With this move, banks have tried many approaches of 2-factor authentication, some better than others and obviously the need is there for secure forms of authentication for the users. Let me tell you how Azure is driving identity security in banking using biometric identification. By combining biometrics with artificial intelligence, banks are now able to take new approaches at verifying the digital identity of their customers and prospects.

If you don’t know, biometrics is the process of uniquely identifying a person’s physical and personal traits. These are then recorded into a database and those images or features are captured into an electronic device and are used as a unique form of identification. Some methods we use biometrics are fingerprint and facial recognition, hand geometry, iris or eye scan and even odor or scents.

Because of their uniqueness, these are much more reliable in confirming a person’s identity than a password or access card. So, how do you verify a person is who they say they are if they’re not in person? Microsoft partners are now leveraging some of the Azure platform offerings to do this—things such as Cognitive Service’s Vision API and Azure Machine Learning tools for performing multi-factor authentication in the banking industry.

The way this works is the user provides a government issued ID (a license or passport for example) and they validate it against standards provided by the ID issuer, so they’re building an algorithm for verification of that ID and putting that into a database. So, when someone submits an ID from a particular state, we know what that ID is supposed to look like and we look for all the distinguishing features of that ID.

To take this a step further, the second factor is they’re using facial recognition software on things like your phone or computer, like Face ID for the iPhone. It will take your photo, but it will also take a video of you and force you to move your head in certain motions in order validate that is it you – you’re not wearing a mask or something – and that you’re alive.

It takes a picture of your ID and matches it to your facial constructions and compares them side by side; this becomes your digital signature. This is considered extremely secure as now you have two forms of verification and you’re using biometrics. Crazy stuff when you think about it but in the digital world we live in, you must go to these lengths to verify someone’s identity when they are not right in front you.

This is still in the early phase of what we’ll see but it’s cool to see how it’s being used and will be interesting to see how it progresses in the future. We’ve got great consultants working with Cognitive Services and Machine Learning. Anything data or Azure related, we’re doing it.

Introducing Azure SQL Database Hyperscale Service Tier

If your current SQL Database service tier is not well suited to your needs, I’m excited to tell you about a newly created service tier in Azure called Hyperscale. Hyperscale is a highly scalable storage and compute performance tier that leverages the Azure architecture to scale out resources for Azure SQL Database beyond the current limitations of general purpose and business critical service tiers.

The Hyperscale service tier provides the following capabilities:

  • Support for up to 100 terabytes of database size (and this will grow over time)
  • Faster large database backups which are based on file snapshots
  • Faster database restores (also based on file snapshots)
  • Higher overall performance due to higher log throughput and faster transaction commit time regardless of the data volumes
  • The ability to rapidly scale out. You can provision one or more read only nodes for offloading your read workload for use as hot standbys.
  • You can rapidly scale up your compute resources (in constant time) to accommodate heavy workloads, so you can scale compute up and down as needed just like Azure Data Warehouse

Who should consider moving over to the Hyperscale tier? This is not an inexpensive tier, but it’s a great choice for companies who have large databases and have not been able to use Azure databases in the past due to its 4-terabyte limit, as well as for customers who see performance and scalability limitations with the other 2 service tiers.

It is primarily designed for transactional or OLTP workloads. However, it does support hybrid and OLAP workloads, but something to keep in mind when designing out your databases and services. It’s also important to note that elastic pools do not support the Hyperscale service tier.

How does it work?

  • You separate the compute and storage out into 4 separate nodes similar to Azure Data Warehouse.
  • The compute node is where the relational engine lives or where the querying process happens.
  • The page server node is where the scaled-out storage engine resides and where database pages are served out to the compute nodes on demand and keeps pages updated as transactions update data, so these nodes are moving the data around for you.
  • The log service node is where the log records are kept as they come in from the compute node and kept in a durable cache, then they’re forwarded along to additional compute nodes and caches to ensure consistency. When all this is spread out and everything is consistently spread across the compute nodes, it will get stored in Azure storage for long term storage of your logs.
  • Lastly, the Azure storage node is where all the data is pushed from the page servers. So, all the data that eventually lands in the database gets pushed over to Azure storage and this is also the storage that gets used for backups, as well as where the replication between availability groups happens.

This Hyperscale tier is an exciting opportunity for those customers that don’t have their requirements fulfilled with prior service tiers. It’s another great Microsoft offering that’s worth checking out if you have had these service tier issues up to now. And it helps to leave a line of distinction between Azure Data Warehouse and Azure Database because you now can scale out/up and tons of data, but it’s still built out for the transactional processing, as opposed to Azure Data Warehouse which is more of the analytical or massively parallel processing.

What is Azure Automation?

So, what do you know about Azure Automation? In this post, I’ll fill you in on this cool, cloud-based automation service that provides you the ability to configure process automation, update management and system configuration, which is managed across your on-premises resources, as well as your Azure cloud-based resources.

Azure Automation provides complete control of deployment operation and decommissions of workloads and resources for your hybrid environment. So, we can have a single pane of glass for managing all our resources through automation.

Some features I’d like to point out are:

  • It allows you to automate those mundane, error-prone activities that you perform as part of your system configuration and maintenance.
  • You can create Runbooks in PowerShell or Python that help you reduce the chance for misconfiguration errors. And it will help lower operational costs for the maintenance of those systems, as you can script it out to do it when you need instead of manually.
  • The Runbooks can be developed for on-premises or Azure resources and they use Web Hooks that allow you to trigger automation from things such as ITSM, Dev Ops and monitoring systems. So, you can run these remotely and trigger them from wherever you need to.
  • On configuration management side, you can build these desired state configurations for your enterprise environment. This will help you to set a baseline for how your systems will operate and will identify when there’s a variance from the initial system configuration, alerting you of any anomalies that could be problematic.
  • It has a rich reporting back end and alerting interface for full visibility into what’s happening in your Windows and Linux systems – on-premises and in Azure.
  • Gives you update management aspects (in Windows and Linux) to help you define the aspects of how updates are applied, and it helps administrators to specify which updates will be deployed, as well as successful or unsuccessful deployments and the ability to specify which updates should not be deployed to systems, all done through PowerShell or Python scripts.
  • It can share capabilities, so when you’re using multiple resources or building those Runbooks for automation, it allows you to share the resources to simplify management. You can build multiple scripts but use the same resources over and over as references for things like role-based access control, variables, credentials, certificates, connections, schedules and access to source control and PowerShell modules. You can check these in and out of source control like any kind of code-based project.
  • Lastly, and one of the coolest features in my opinion, where these are templates you’re deploying out in your systems, everyone has some similar challenges. There’s a community gallery where you can go and download templates others have created or upload ones you’ve created to share. With a few basic configuration tweaks and review to make sure they’re secure, this is a great option for making the process faster by finding an existing script and cleaning it up and deploying it in your systems and environment.

So, there’s a lot you can do with this service and I think it’s worth checking out as it can make your maintenance and management much simpler.

What is Azure Firewall?

I’d like to discuss the recently announced Azure Firewall service that is now just released in GA. Azure Firewall is a managed, cloud-based network security service that protects your Azure Virtual Network resources. It is a fully stateful PaaS firewall with built-in high availability and unrestricted cloud scalability.

It’s in the cloud and Azure ecosystem and it has some of that built-in capability. With Azure Firewall you can centrally create, enforce and log application and network connectivity policies across subscriptions and virtual networks, giving you a lot of flexibility.

It is also fully integrated with Azure Monitor for log analytics. That’s big because a lot of firewalls are not fully integrated with log analytics which means you can’t centralize these logs in OMS, for instance, which would give you a great platform in a single pane of glass for monitoring many of the technologies being used in Azure.

Some of the features within:

  • Built in high availability, so there’s no additional load balances that need to be built and nothing to configure.
  • Unrestricted cloud scalability. It can scale up as much as you need to accommodate changing network traffic flows – no need to budget for your peak traffic, it will accommodate any peaks or valleys automatically.
  • It has application FQDN filtering rules. You can limit outbound HTTP/S traffic to specified lists of fully qualified domain names including wildcards. And the feature does not require SSL termination.
  • There are network traffic filtering rules, so you can create, allow or deny network filtering rules by source and destination IP address, port and protocol. Those rules are enforced and logged across multiple subscriptions and virtual networks. This is another great example of having availability and elasticity to be able to manage many components at one time.
  • It has fully qualified domain name tagging. If you’re running Windows updates across multiple servers, you can tag that service as an allowed service to come through and then it becomes a set standard for all your services behind that firewall.
  • Outbound SNAT and inbound DNAT support, so you can identify and allow traffic originating from your virtual network to remote Internet destinations, as well as inbound network traffic to your firewall public IP address is translated (Destination Network Address Translation) and filtered to the private IP addresses on your virtual networks.
  • That integration with Azure Monitor that I mentioned in which all events are integrated with Azure Monitor, allowing you to archive logs to a storage account, stream events to your Event Hub, or send them to Log Analytics.

Another nice thing to note is when you set up an express route or a VPN from your on premises environment to Azure, you can use this as your single firewall for all those virtual networks and allow traffic in and out from there and monitor it all from that single place.

This was just released in GA so there are a few hiccups, but if none of the service challenges effect you, I suggest you give it a try. It will only continue to come along and get better as with all the Azure services. I think it’s going to be a great firewall service option for many.

Shell Chooses Azure Platform for AI

Artificial Intelligence (AI) is making its way into many industries today, helping to solve business problems and helping with efficiency. In this post, I’d like to share an interesting story about Shell choosing Azure for their AI platform. Shell Oil Company chose to use C3 IoT for their IoT device management and Azure for their predictive analytics.

Let’s look at how Shell is using this technology:

  • The operations that are required to fix a drill or piece of equipment in the field is much more significant when it’s unexpected. Shell can use AI to look at when maintenance is required on compressors, valves and other equipment that’s used for oil drilling. This will help to reduce unplanned downtime and repair efforts. If they can keep up with maintenance before equipment fails, they can plan downtime and do so at much less cost.
  • They’ll use AI to help steer the drill bits through shale deposits to find the best quality shale deposits.
  • Failures of equipment of great size, such as drilling equipment, can have a lot of related damage and danger. This technology will improve the safety of employees and customers by helping to reduce unexpected failures.
  • AI enabled drills will help chart a course for the well itself as it’s being drilled, as well as providing constant data from the drill bits on what type of material is being drilled through. The benefits here are 2-fold; they will get data on quality deposits and reduce the wear and tear on the drill. If the drill is using an IoT device to detect a harder material, they’ll have the knowledge to drill in a different area or to figure out the best path to reduce the wear and tear.
  • It will free up the geologists and engineers to be able to manage more drills at one time, making them more efficient, as well as reactive to deal with problems as they arise while drilling.

As with everything in Azure, this platform is a highly scalable platform that will allow Shell to grow with what is required, plus have the flexibility to take on new workloads. With IoT and AI, these workloads are very easily scaled using Azure as a platform and all the services available with it.

I wanted to share this interesting use case about Shell because it really displays the capabilities of the Azure Platform to solve the mundane and enable the unthinkable.

Sharing Integration Runtimes Among Azure Data Factories

In this post I’ll talk about self-hosted integration runtimes and the ability to share them across Data Factories. Also, I’ll tell you about a new capability that was announced in the Azure Data Factory space.

The integration runtime is essentially the connector that allows you to connect back to your on premises environment and safely and securely move data between Azure and that on-prem environment with Data Factories. This is a dedicated application for Azure Data Factory that’s similar to the on premises Data Gateway.

Here’s where this new feature helps. Until now, Data Factories could not share integration runtimes, therefore, you needed to set up different Data Factories to connect back to on-prem data, databases or flat files, etc. Also, you would have to set up individual integration runtimes for those various Data Factories or pipelines going across multiple Data Factories.

With this newly announce feature comes some new terminology

  • Shared integration runtime – is basically the standard integration runtime you’re used to, however now it can be shared
  • Linked integration runtime – when a shared integration runtime is shared, it will have linked integration runtimes and have a sub-type that’s shared with other Data Factories.

So, you’ll have your main shared integration runtime and on top of that you’ll have a linked integration runtime, which is a linked integration runtime that references the infrastructure of another self-hosted IR. That link points back to a shared IR and allows you to share among multiple Data Factories.

With this straightforward process, you install the integration runtime in your environment, set up your linked service within your Azure Data Factory and then connect it through that linked service. Then you’re ready to pull the data that you need into the cloud and do transformations and push it out to Azure Data Warehouse, Azure Data Bricks, etc.

This cool new technology allows you to get your data to the cloud much easier and more efficiently and I highly recommend for all to try!

What is Azure Data Box and Data Box Disk?

Are you looking to move large amounts of data into Azure? How does doing it for free sound and with an easier process? Today I’m here to tell you how to do just that with the Azure Data Box.

Picture this: you have a ton of data, let’s say 50 terabytes on-prem, and you need to get that into Azure because you’re going to start doing incremental back ups of a SQL Database, for instance. You have two options to get this done.

First option is to move that data manually. Which means you have to chunk it, set it up using AZ copy or a similar Azure data tool, put it up in a blob storage, then extract it and continue with the process. Sounds pretty painful, right?

Your second option is to use Azure Data Box which allows you to move large chunks of data up into Azure. Here’s how simple it is:

  • You order the Data Box through Azure (currently available in the US and EU)
  • Once received, you connect it to your environment however you plan to move that data
  • It uses standard protocols like SMB and CIFS
  • You copy the data you want to move and return the Data Box back to Azure and then they will upload the data into your storage container(s)
  • Once the data is uploaded, they will securely erase that Data Box

With the Data Box you get:

  • 256-bit encryption
  • A super tough, hardened box that can withstand drops or water, etc.
  • It can be pushed into Azure Blob
  • You can copy data up to 10 storage accounts
  • There are two 1 gigabit/second and two 10 gigabit/second connections to allow quick movement of data off your network onto the box

In addition, Microsoft has recently announced the Data Box Disk, which is a small 8 terabyte disk that you can order up to five of as part of the Data Box Disk.

With Data Box Disc you get:

  • 35 terabytes of usable capacity per order
  • Supports Azure Blobs
  • A USB SATA 2 and 3 interface
  • Uses 128-bit encryption
  • Like Data Box, it’s a simple process to connect it, unlock it, copy the data onto the disk and it send it back to copy those into a single storage account for you

Here comes the best part—while Azure Data Box and Data Box Disk are in Preview, this is a free service. Yes, you heard it right, Microsoft will send you the Data Box or Data Box Disk for free and you can move your data up into Azure for no cost.

Sure, it will cost you money when you buy your storage account and start storing large sums of data, but storage is cheap in Azure, so that won’t break the bank.

 

What is Azure Virtual WAN?

In today’s post I’d like to talk about site to site networking service. Azure already has a site to site VPN service, but the Azure Virtual WAN is a newer service currently in Preview. This networking service is optimized for branch to service connectivity and offers the capability to use partner devices currently supplied by preferred partners (currently Riverbed and Cisco) or the ability to manually configure this connectivity with your environment.

Azure Virtual WAN has some big differences to consider:

  • Automated set up and configuration of these devices by preferred partners makes much easier to configure them. You simply set up these connections which you can export directly from the device into Azure and it automatically sets it up for you.
  • It is designed for large scalability and more through-put. The site to site service is great for smaller workloads but this new service opens the pipe and allows the data to crank through much faster.
  • It’s designed as a Hub and Spoke model. The Hub being Azure and the Spoke being your branch office – all managed within Azure.

Let’s look at the 4 main components of this service:

  • The Virtual WAN Service itself – This asset is where the resources are collected, and it represents a virtual overlay of the Azure network. Think of it as a top down view of the connectivity between all the components in Azure and in your offices.
  • A site represents the on premises VPN device and its settings. I mentioned those preferred devices from Riverbed and Sysco (with more to come) and if you’re using a supported device, you can easily drop that configuration into Azure.
  • The hub is the connection point in Azure for those sites. The site connects to the hub and the virtual WAN is overlooking all of these components.
  • The hub virtual network connection allows your connection point for your hub to your virtual network.

So, your hub and your virtual network are connected through that virtual network connection. This allows the communication between your virtual networks in Azure and your site to site virtual WAN.

This offering makes the landscape a bit different with how people are doing connectivity into Azure and connecting their remote offices by consolidating what that network looks like, as well as making it easier by offering these preferred devices.

Again, this is still in Preview but definitely something I would suggest checking out.

Informatica Enterprise Data Catalog in Azure

If you’re like many Azure customers, you’ve been on the look out for a data catalog and data lineage tool and want one with all the key capabilities you’re looking for. Today, I’d like to tell you more about the Informatica Data Catalog which was discussed briefly in a previous Azure Every Day post.

The Informatica tool helps you to analyze, consolidate and understand large volumes of metadata in your enterprise. It allows you to extract both physical and business metadata for objects and organize it based on business concepts, as well as view data lineage and relationships for each of those objects.

Sources include databases, data warehouses, business glossaries, data integration and Business Intelligence reports and more – anything data related. The catalog maintains an indexed inventory of all the dated objects or ‘assets’ in your enterprise such as tables, columns, reports, views and schemas.

Metadata and statistical information in the catalog include things like profile results, as well as info about data domains and data relationships. It’s really the who, what, when, where and how of the data in your enterprise.

Informatica Data Catalog can be use for tasks such as:

  • Find your scalable assets by being able to scour your network or cloud space to look for assets that aren’t cataloged.
  • View lineage for those assets, as well as relationships between assets.
  • Enrich assets by tagging them with additional attributes, possibly tag a specific report as a critical item.

These are lots of useful features in the Data Catalog. Some key ones are:

  • Data Discovery – Do a semantic search, dynamic filtering, data lineage and relationships for assets across your enterprise.
  • Data Classification – Automatically or manually annotate data classifications to help with governance and discovery – who should have access to what data and what does the data contain.
  • Resource Administration – Like resource, schedule and attribute management, as well as connection or profile configuration management. All the items that surround the data that help you manage the data and the metadata around it.
  • Create and edit reusable profile definition settings.
  • Monitor resources and tasks within your environment.
  • Data domain management where you can create and edit domains and the kind of groups you want to group together with like data and reports.
  • Assign logical data domains to data groups.
  • Build composite data domains for management purposes.
  • Monitor the status of tasks in progress and look at some transformation logic for assets.

On top of this, you can look at how frequently the data is accessed and how valuable it is to your business users; showing this type of information around your data so you can trim reports that aren’t being used for instance.

When we talk about modern data warehousing in the Azure cloud, this is something we’ve been looking for. It’s a useful and valuable tool for those who want those data governance and lineage tools.