Category Archives: Big Data

Microsoft and BlackRock Announce Retirement Planning Partnership

At this point, the state of financial planning is a potential major crisis with current and future generations coming upon retirement age with little or no savings to account for.

As we’ve moved away from pensions of the old, the responsibility held previously by the companies, has now shifted to individuals having to invest and save on their own to ensure they’re set up after they retire.

I wanted to share a recent press release from Microsoft who announced that they created a partnership with BlackRock to help reimagine the way people manage their retirement planning. BlackRock is a world leader in wealth management, including providing solutions to consumers and currently manages approximately 6.5 trillion in assets for investors worldwide.

The goal of this alliance is to find ways for people to interact with their retirement assets more, so they know what kind of contributions they’re making. BlackRock will design and manage a suite of next generation investment tools that aim to provide a ‘lifetime’ of income in retirement. This would be made available to US workers through their employer’s workplace savings plan.

The press release did not share much detail about what exactly the two firms will partner on, but the following is a quote from Microsoft CEO, Satya Nadella: “Together with BlackRock, we will apply the power of the cloud and AI to introduce new solutions that address this important challenge and reimagine retirement planning.”

As we know, AI, deep learning and machine learning and all their related technologies, can have a profound impact on information gathering, processing and the intelligence we can extract from it. This helps us make better decisions.

The idea here is to offer technology options to businesses for their employees to consume and promote fiduciary responsibility. There will be more complex options that have been shunned previously by employers because of their complexity and costliness.

BlackRock has shown that they want to move their technology footprint forward with acquisitions and investments in firms in recent years. In 2015, they acquired a robo advisor company, as well as invested in Acorns, a company which helps millennials save their spare change to put it into a savings account.

Last year, BlackRock acquired Investment, a company that gives them more sophisticated online investment tooling. It is also believed that additional partnerships will come along to help support any of these new investment options, the plans and the employees.

When it comes to how the world is changing, AI is thought to be one of the biggest conversations occurring in 2019. At the heart of AI is data—data quality and consistency. These important factors are something we focus on at Pragmatic Works, as well as knowing that this is what our clients need to rely on.

This press release shows where we’re going with some of the AI technology that’s a huge topic of conversations in organizations today.

3 Common Analytics Use Cases for Azure Databricks

Pragmatic Works is considered to be experts in the Microsoft Data Platform, both on-premises and in Azure. That being said, we often get asked many questions like, how can a certain technology benefit my company? One technology we are asked about a lot is Azure Databricks. This was released over a year ago in preview in the Azure portal and we’re starting to see some massive adoption by many companies, but not everyone is ready to delve into data science and deep analytics, so they haven’t had much exposure to what Databricks is and what it can do for their business.

There are some barriers preventing organizations from adopting data science and machine learning which can be applied to solve many common business challenges. Collaboration between data scientists, data engineers, business analysts who are working with data (structured and unstructured) from a multitude of sources is an example of one of those barriers.

In addition, there’s a complexity involved when you try to do things with these massive volumes of data. Then add in some cultural aspects, having multiple teams and using consultants, and with all these factors, how do you get that one common theme and common platform where everybody can work and be on the same page? Azure Databricks is one answer.

Here’s an overview of 3 common use cases that we’re beginning to see and how they can benefit your organization:

1. Recommendation Engines – Recommendation Engines are becoming an integral part of applications and software products as mobile apps and other advances in technology continue to change the way users choose and utilize information. Most likely when you’re shopping on any major retail site, they are going to make recommendations to related products based on the products you’ve selected or that you’re looking at.

2. Churn Analysis – Commonly known as customer attrition; basically, it’s when we lose customers. Using Databricks, there are ways to find out what some of the warning signs are behind that. Think about it, if you get ways to correlate the data that leads to a customer leaving your company, then you know that you have a better chance to possibly save that customer.

And we all know that keeping a customer and giving them the service they need or the product they want is significantly less costly than having to acquire new customers.

3. Intrusion Detection – This is needed to monitor networks or systems and activities for malicious activity or policy violations and produce electronic reports to some kind of dashboard or management station or wherever that is captured.

With the combination of streaming and batch technologies tightly integrated with Databricks and the Azure Data Platform, we are getting access to more real-time and static data correlations that are helping to make faster decisions and try to avoid some of these intrusions.

Once we get triggered that there is a problem, we can shut if off very quickly or use automation options to do that as well.

Today I wanted to highlight some of the ways that you can utilize Databricks to help your organization. If you have questions or would like to break down some of these barriers to adopting machine learning and data science for your business, we can help.

We are using all the Azure technologies and talking about them with our customer all the time, as well as deploying real world workload scenarios.

What is Azure Data Box Heavy?

You may have seen my previous Azure Every Day post on Azure Data Box and Azure Data Box Disk. These are a great option for getting smaller workloads, up to 80 terabytes of data, quickly up into Azure. Rather than moving it over the wire, you can send a box and bring it up.

The Data Box Heavy works the same, but you can use much larger amounts of data with up to a petabyte of space.

Let’s review the Data Box process:

  • You order the box through the Azure Portal and specify the region that you’re going to use.
  • Once you receive it, connect it into your network, set up network shares and then you copy your data over. It has fast performance with up to 40 gigabits/second transfer rates.
  • Then you return the box to Microsoft and they will load the data directly into your Azure tenant.
  • Lastly, they will securely erase the disk as per the National Institute of Standards and Technology (NIST) guidelines.

The Data Box Heavy is ideally suited to transfer data sizes larger than 500 terabytes. If you used a Data Box with it’s 80 terabytes, you’d need 5 or 6 of those in place of the Heavy. When you have those larger data sizes, it makes more sense to have it on one machine.

The data movement can be a one time or periodic thing, depending on the use case. So, if you want to do an initial bulk data load, you can move that over and then follow that up with periodic transfers.

Some scenarios or use cases would be:

  • You have a huge amount of data on prem and you want to move it up into Azure – maybe a media library of offline tapes or tape backups for some kind of online library.
  • You’re migrating an entire cabinet – you have a ton of data in there with your virtual machine farm, your SQL Server and applications – over to Azure. You can move that over into your tenant, migrate your virtual machines first, then you can do an incremental restore of data from there.
  • Moving historical data to Azure for doing deeper analysis using Databricks or HD Insight, etc.
  • A scenario where you have a massive amount of data and you want to do the initial bulk load to push it up, then from there you want to do incremental loads of additional data as it gets generated across the wire.
  • You have an organization that’s using IoT or video data with a drone – inspecting rail lines or power lines for instance. They are capturing tremendous amounts of data (video and graphic files can be huge) and they want to be able to move that up in batches. Data Box Heavy would be a great solution to quickly move these up rather than moving the files individually or over the wire.

This is a very cool technology and an exceptional solution for moving data up in a more efficient manner when you have huge, terabyte-scale amounts of data to push to Azure.

Azure Data Factory – Data Flow

I’m excited to announce that Azure Data Factory Data Flow is now in public preview and I’ll give you a look at it here. Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic that can be executed as activities within ADF pipelines.

The intent of ADF Data Flows is to provide a fully visual experience with no coding required. Your Data Flow will execute on your own Azure Databricks cluster for scaled out data processing using Spark. ADF handles all the code translation, spark optimization and execution of transformation in Data Flows; it can handle massive amounts of data in very rapid succession.

In the current public preview, the Data Flow activities available are:

  • Joins – where you can join data from 2 streams based on a condition
  • Conditional Splits – allow you to route data to different streams based on conditions
  • Union – collecting data from multiple data streams
  • Lookups – looking up data from another stream
  • Derived Columns – create new columns based on existing ones
  • Aggregates – calculating aggregations on the stream
  • Surrogate Keys – this will add a surrogate key column to output streams from a specific value
  • Exists – check to see if data exists in another stream
  • Select – choose columns to flow into the next stream that you’re running
  • Filter – you can filter streams based on a condition
  • Sort – order data in the stream based on columns

Getting Started:

To get started with Data Flow, you’ll need to sign up for the Preview by emailing adfdataflowext@microsoft.com with your ID from the subscription you want to do your development in. You’ll receive a reply when it’s been added and then you’ll be able to go in and add new Data Flow activities.

At this point, when you go in and create a Data Factory, you’ll now have 3 options: Version 1, Version 2 and Version 2 with Data Flow.

Next, go to aka.ms/adfdataflowdocs and this will give you all the documentation you need for building your first Data Flows, as well as work and play around with some samples already built. You can then create your own Data Flows and add a Data Flow activity to your pipeline to execute and test your own Data Flow in debug mode in the pipeline. Or you can use Trigger Now in the pipeline to test your Data Flow from a pipeline activity.

Ultimately, you can operationalize your Data Flow by scheduling and monitoring your Data Factory pipeline that is executing the Data Flow activity.

With Data Flow we have the data orchestration and transformation piece we’ve been missing. It gives us a complete picture for the ETL/ELT scenarios that we want to do in the cloud or hybrid environments, your on prem to cloud or cloud to cloud.

With Data Flow, Azure Data Factory has become the true cloud replacement for SSIS and this should be in GA by year’s end. It is well designed and has some neat features, especially how you build your expressions which works better than SSIS in my opinion.

When you get a chance, check out Azure Data Factory and its Data Flow features and let me know if you have any questions!

Intro to Azure Databricks Delta

If you know about or are already using Databricks, I’m excited to tell you about Databricks Delta. As most of you know, Apache Spark is the underlining technology for Databricks, so about 75-80% of all the code in Databricks is still Apache Spark. You get that super-fast, in-memory processing of both streaming and batch data types as some of the founders of Spark built Databricks.

The ability to offer Databricks Delta is one big difference between Spark and Databricks, aside from the workspaces and the collaboration options that come native to Databricks. Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Spark and Databricks DBFS.

The core abstraction of Databricks Delta is an optimized Spark table that stores data as Parquet files in DBFS, as well as maintains a transaction log that efficiently tracks changes to the table. So, you can read and write data, stored in the Delta format using Spark SQL batch and streaming APIs that you use to work with HIVE tables and DBFS directories.

With the addition of the transaction log, as well as other enhancements, Databricks Delta offers some significant benefits:

ACID Transactions – a big one for consistency. Multiple writers can simultaneously modify a dataset and see consistent views. Also, writers can modify a dataset without interfering with jobs reading the dataset.

Faster Read Access – automatic file management organizes data into large files that can be read efficiently. Plus, there are statistics that enable speeding up reads by 10-100x and data skipping avoids reading irrelevant information. This is not available in Apache Spark, only in Databricks.

Databricks Delta is another great feature of Azure Databricks that is not available in traditional Spark further separating the capabilities of the products and providing a great platform for your big data, data science and data engineering needs.

How to Gain Up to 9X Speed on Apache Spark Jobs

Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight Apache Spark Clusters.

If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managed offering for running open source workloads on Azure. You can run things like Spark, Hadoop, HIVE, and LLAP among others. You create clusters and spin them up and spin them down when you’re not using them.

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Because the SSDs typically provide more than 1 gigabit/second of bandwidth, as well as leverage operating system in-memory file cache, this gives us enough bandwidth to load big data compute processing engines like Spark. This allows us to run Spark optimally and handle bigger memory workloads and overall better performance, by speeding up these jobs that read data from remote cloud storage, the dominant architecture pattern in the cloud.

In benchmark tests comparing a Spark cluster with and without the IO Cache running, they performed 99 SQL queries against a 1 terabyte dataset and got as much as 9X performance improvement with IO Cache turned on.

Let’s face it, data is growing all over and the requirement for processing that data is increasing more and more every day. And we want to get faster and closer to real time results. To do this, we need to think more creatively about how we can improve performance in other ways, without the age-old recipe of throwing hardware at it instead of tuning it or trying a new approach.

This is a great approach to leverage some existing hardware and help it run more efficiently. So, if you’re running HDInsight, try this out in a test environment. It’s as simple as a check box (that’s off by default); go in, spin up your cluster and hit the checkbox to include IO Cache and see what performance gains you can achieve with your HDInsight Spark clusters.

Shell Chooses Azure Platform for AI

Artificial Intelligence (AI) is making its way into many industries today, helping to solve business problems and helping with efficiency. In this post, I’d like to share an interesting story about Shell choosing Azure for their AI platform. Shell Oil Company chose to use C3 IoT for their IoT device management and Azure for their predictive analytics.

Let’s look at how Shell is using this technology:

  • The operations that are required to fix a drill or piece of equipment in the field is much more significant when it’s unexpected. Shell can use AI to look at when maintenance is required on compressors, valves and other equipment that’s used for oil drilling. This will help to reduce unplanned downtime and repair efforts. If they can keep up with maintenance before equipment fails, they can plan downtime and do so at much less cost.
  • They’ll use AI to help steer the drill bits through shale deposits to find the best quality shale deposits.
  • Failures of equipment of great size, such as drilling equipment, can have a lot of related damage and danger. This technology will improve the safety of employees and customers by helping to reduce unexpected failures.
  • AI enabled drills will help chart a course for the well itself as it’s being drilled, as well as providing constant data from the drill bits on what type of material is being drilled through. The benefits here are 2-fold; they will get data on quality deposits and reduce the wear and tear on the drill. If the drill is using an IoT device to detect a harder material, they’ll have the knowledge to drill in a different area or to figure out the best path to reduce the wear and tear.
  • It will free up the geologists and engineers to be able to manage more drills at one time, making them more efficient, as well as reactive to deal with problems as they arise while drilling.

As with everything in Azure, this platform is a highly scalable platform that will allow Shell to grow with what is required, plus have the flexibility to take on new workloads. With IoT and AI, these workloads are very easily scaled using Azure as a platform and all the services available with it.

I wanted to share this interesting use case about Shell because it really displays the capabilities of the Azure Platform to solve the mundane and enable the unthinkable.

Overview of Azure Elastic Database Jobs Service

Today I’ll give an overview of Microsoft’s newly released (in preview) Elastic Database Jobs service. This is considered as a fully hosted Azure service, whereas the previous iteration was a custom hosted and managed version available on SQL DB and SQL DW within Azure.

It’s similar in capability to an on prem SQL Server Agent, but it can reach across multiple servers, subscriptions and regions. SQL Agent is limited to just the instance on the server for the database that you’re managing. This gives you a much wider range across all your different Azure services.

Other benefits and features:

  • Significant capability added that can enable automation and execution of T-SQL jobs with PowerShell, REST API or T-SQL APIs against a group of databases.
  • Can be used for a wide variety of maintenance tasks, such as rebuilding indexes, schema changes, collecting query results and performance monitoring. Think of it in terms of a developer who’s managing many databases across multiple subscriptions to support multiple lines of business or web applications with the same database schema and they want to make a change to it.
  • The capability to maintain a larger number of databases with similar operations and it allows management for whatever databases you specify and that will ensure an optimum customer experience. You’ll also ensure maximum efficiency to maintain your databases without having to set up specific jobs on each of those servers, and to tap into them and make changes more efficiently during off hours and scale up/down when you need to. Plus, you can change that schema across all those databases with a simple interface.
  • Schedule administrative tasks that otherwise would have to be manually done.
  • Allows for some small schema changes, credential management, performance database, or even telemetry collection if you want insight into what people are doing on the databases.
  • Build indexes off hours.
  • Collect query results from multiple databases for central performance management, so you can collect this info into one place, then render info into a portal like Power BI.

Basically, it reduces management maintenance overhead with its ability to go across subscriptions. Normally, you’d have to have that job run on a specific server; but now within Azure, where you are running managed databases, you can run operations across those databases without having to set up separate jobs.

So, a cool feature – it’s now only in preview so it’s sure to grow and I’m excited about the direction.

 

Key Terminology in Azure Databricks

In a previous post, I talked about Azure Databricks and what it is. In review, Azure Databricks is a managed platform for running Apache Spark jobs. As it’s managed, that means you don’t have to worry about managing the cluster or running performance maintenance to use Spark, like you would if you were going to deploy a full HDInsight Spark cluster.

Databricks provides a simple to operate user interface for data scientist and analysts when building models, as well as a powerful API that allows for some automation. You also can run role-based access control with Active Directory for better user integration at a more granular scale. You don’t have to tear down an HDInsight cluster to use Spark jobs as you can pause (or start) your resources on demand and scale up/out as needed.

In this post, I’ll run through some key Databricks terms to give you an overview of the different points you’ll use when running Databricks jobs:

    • Workspace – This is the central place that will allow you to organize all the work that’s being done. You can think of it as a ‘folder’ structure where you can save Notebooks and Libraries that you want to operate on and manipulate data with, and then share them securely with other users. Workspace is not meant for storing data; data should be stored in the data storage.
    • Notebooks – This is a set of any number of cells that allow you to execute commands with a programming language, such as Scala, Python, R or SQL; you can specify the language when you open a cell at the top of the Notebook. Here you can also create a dashboard that allows the output of the code to be shared rather than the code itself, and they can be scheduled as jobs for running pipelines, updating models or dashboards.
    • Libraries – These are packages or modules that provide additional functionality for developing various models for different types of analysis. Like a traditional IDE environment like Visual Studio where you have libraries you can plug in and add.
    • Tables – This is where the structured data is stored that you and your team will use for analysis. They can live in cloud storage or in the cluster that’s being used or store them in memory for faster processing of the data.
    • Clusters – Essentially a group of compute resources being used for operations like executing the code from Notebooks or Libraries. You can also pull in data from raw sources like cloud or structured/semi structured data or the data in the tables I mentioned above. Clusters can be controlled via access policies using Active Directory integration.
    • Jobs – Jobs are a tool that’s used to schedule execution within a cluster. These can be scripts using Python or JAR assemblies and you can create manual triggers that will send the jobs off or run them through a REST API.
    • Apps – Think of these as the third-party components that can tap into your Databricks cluster. A good scenario is visualizing the data with apps like Tableau or Power BI. You can consume the modules that you built and the output of the Notebooks or script that you ran to visualize that data.

Azure Data Factory V2 in GA and New Features

Today I’m excited to talk about the general availability of Azure Data Factory V2, as well as some new features that have been added over the last couple months. If you don’t know, Azure Data Factory Version 2 added some new features that V1 didn’t have.

With ADF V2 you get a browser-based interface using drag and drop technology; V1 was primarily done in the Visual Studio IDE. It also added triggers for scheduling, so you can schedule your jobs when required and in additional ways (which I’ll discuss further in a bit).

Some other features of ADF V2 that came out as it became generally available:

  • Lift and Shift operations for your SSIS packages, so if you have SSIS packages local, you can now Lift and Shift those into compute with the integration runtime service in Data Factory.
  • This also allows for cloud to cloud, cloud to prem, prem to prem and some third-party tools are supported within that as well.
  • Control flow activities like link branching, looping, conditional execution and parameterization.
  • Integration with HD Spark and Databricks for big data workloads and data science.

Some features that have come out more recently:

  • Integration with Key Vault, which gives you the ability to encrypt keys and small secrets like passwords used for keys. You can create a Linked Service to a Key Vault and reference those needed passwords rather than having to store those in search or text files or a PowerShell script and have those open. So, you can use Key Vault to reference back and run workloads without having to expose those passwords.
  • The ability to monitor Data Factory using OMS, Microsoft’s cloud-based management solution that helps you manage and protect your on-prem and cloud infrastructure. This is quick and easy to set up and allows you to reach in to different types of applications in Azure and give you additional visibility and control for things like log analytics, automation, data protection and recovery, as well as security and compliance.
  • You can monitor the overall health of your Data Factories and be able to drill in, see the details and troubleshoot if you’re having problems. This is all enable through Azure Analytics, so you turn on your Azure Analytics and Data Factory, then hook those into your OMS suite and you can monitor it as that central management point.
  • Event based triggering with integration through Data Factory. Now you have event driven architecture where you have a common data integration pattern that involves production. Instead of having to schedule a timed trigger, you can monitor a blob creation or deletion, add that file into there and you can trigger your pipeline based on that.

Azure Data Factory V2 is a neat technology and I’m interested to see where it goes as I’m sure that more features will be coming. If you have questions about Azure Data Factory or any of the new Azure resources, we are the people to talk with. We’re doing a lot of work with our clients using Azure tools and we’d love to talk to you about how we can get you using Azure in your organization.