Overview of HDInsight Kafka

Continuing with my HDInsight series, today I’ll be talking about Kafka. HDInsight Kafka will sound much like Storm but as I get into the nuts the bolts you’ll see the differences. Kafka is an open source distributed stream platform that can be used to build real time data streaming pipelines and applications with a message broker functionality, like a message cue.

Some specific Kafka improvements with HDInsight:

  • 99.9% uptime from HDInsight
  • You get 16 terabyte managed discs which increases the scale and reduces the number of required nodes for traditional Kafka clusters, which would have a limit of 1 terabyte.
  • Kafka takes a single rack view, but Azure is designed in 2 dimensions for update and fault domains. Thus, Microsoft designed special tools to rebalance the partitions and replicas. Once you scale out, you would repartition your data and then you’d be able to take advantage of the additional nodes, as well as when you scale down.
  • Kafka allows you to change the number of worker nodes for scaling up/down, depending on the workload and this can be done through the portal or PowerShell or any automation tool within Azure.
  • Direct integration with Azure log analytics. This looks at the virtual machine level information like the disc and the network. The importance of this is it allows you to roll that up into the Microsoft OMS suite for global log analytics. So, when you’re looking at all your resources in Azure through OMS, it helps you to see it at a high level and also drill in for more details.
  • The Zookeeper manages the state of the cluster which helps the concurrency, resiliency and the low latency transactions, as well as the orchestration of the data through the nodes and clusters.
  • Records are stored in topics which is produced by a producer and consumed by consumers. The producers send records to Kafka brokers and each worker node in the cluster is considered a broker. These brokers are what is helping the data move around inside the clusters.

Again, Kafka and Storm sound relatively similar, here’s some major differences:

    • Storm was invented by Twitter; Kafka by LinkedIn. But these are all using the Hadoop platform and it’s an open source, so they can build their own iterations.
    • Storm is meant more for real time message processing; Kafka is for distributed messaging processing.
    • Storm can take data from Kafka and other database system and process the data; Kafka is taking in those streams from things like Facebook, Twitter and LinkedIn.
    • Kafka is a message broker; Storm’s primary use is stream processing.
    • In Storm there is no data storage, you can only stream data through it; Kafka stores the data on the file system. As those streams are processed, Storm can do it much faster, on a micro-batch processing level. Kafka is doing small batches, larger than micro.
    • As far as dependency, Kafka requires Zookeeper for all the orchestration; Storm does not depend on anything externally.
    • Storm has a latency of milliseconds; with Kafka it depends on the source of the data, but typically takes slightly less than 1-2 seconds. So, you’re keeping the data local in Kafka, processing it, then pushing it somewhere else. Whereas with Storm, you’re processing the data in motion as you’re pushing it somewhere else.

Basically, two different ways to solve similar problems depending on the use case. It apparently worked better for LinkedIn to design it this way as opposed to the way that Twitter handles their data.

 

Overview of HDInsight Storm

Next in my series on HDInsight, today I’ll be talking about Storm. HDInsight Storm is a distributed stream processing computational framework. It uses spouts which define information sources and bolts which are manipulations in processing to allow batch distributed processing of streaming data.

Think of it’s apology in the shape of a direct acyclic graph. It’s a DHE where the edges are named streams and direct the data from node to node. When you put it all together, it creates the data transformation pipeline.

When you break it down, it’s topology is like that of map/reduce jobs; the difference being that map/reduce jobs run in individual batches and Storm is processed continuously in real time.

The Storm cluster has 2 different types of nodes. There’s a Master node which executes a Nimbus which assigns tasks to machines and monitors their performance. The Worker node runs Supervisor which assigns tasks to the other worker nodes and operates them as needed.

The Storm cluster can’t monitor its own state and health, so it deploys a Zookeeper node to connect to the Nimbus and Supervisor to keep an eye on things.

The 3 main components of Storm are:

1. The topology which is basically a network for the stream and spout.

2. The stream which is an unbounded pipeline of tuples.

3. The spout which is the source of the data which converts the data to the tuple of streams and then sends the bolts to be processed.

What makes this effective is that the data processing engine is guaranteed as far as every tuple will be fully processed and delivered, giving it a 99.9% uptime SLA from Microsoft. It does this by tracking the lineage of the tuple as it makes its way through the typology. It works like a query system as the messages can be replayed if there’s a failure in delivery.

Some use cases for Storm:

    • Writing the data after it gets processed into an Azure Data Lake Store.
    • As a source for Azure Event Hubs, as well as processing events from here. It can take a vehicle sensor, for instance, and can process it in Event Hubs, then send the data to Cosmos DB or an Azure Storage Blob.
    • Twitter is using Storm in a variety of ways. They use it for discovery on their data, running real time analytics and personalization in real time, so when you log into Twitter it knows your preferences based on past visits. It also works for real time Search and for their own internal revenue optimization.

As with other HDInsight components, it’s used among various typologies to solve and satisfy big data requirements and workloads. For example, if you were doing a customer churn analysis in real time based on a Twitter feed, this would be a technology you would use along side Hadoop.

Overview of HDInsight HBase

In continuation of my series on HDInsight and the different clusters within it, today I’ll cover HBase. HBase is a NoSQL database that provides random access and strong consistency for structured, unstructured and semi-structured data.

It’s a schema-less (or organized by families of columns) database. Another way to describe it is it’s sort of modeled after Google’s Bigtable, where data is stored in the rows of a table and then grouped by a column family. As it’s schema-less, neither the columns themselves or the data types inside of the columns need to be defined before using the data.

Some other key things to be aware of with HBase:

  • As with all the HDInsight components, this get implemented as a managed cluster and a Platform as a Service offering in which we can separate compute nodes from storage.
  • It has a scale out architecture that helps provide automatic sharding or horizontal partitioning of tables, where essentially rows of a table are held separately rather than splitting those columns as we would in a typical table normalization.
  • Strong consistency for read and write as it’s part of the architecture of HBase.
  • Automatic failover built in, so you have multiple clusters that you can failover to multiple nodes.
  • In-memory caching for reads and writes, which helps with performance, as well as moving your data in and out quicker.

Some of the most common workloads:

    • A search engine like I mentioned with Google’s Bigtable, which builds indexes that map terms to webpages that contain them.
    • A key value store. Facebook uses HBase for their messaging system because it’s ideal for storing and managing internet communications.
    • Also, a good repository for collecting sensor data, so where large amounts of data are being pulled into this NoSQL Table and it can be used to build dashboards for reporting.

I still have a few HDInsight technologies to cover in this series. Many of these are interrelated and work together to complete and update data architecture.

 

Overview of HDInsight Spark

Today I’m continuing my series on HDInsight with the focus on Spark clusters. HDInsight Spark clusters provide the required baseline for in-memory cluster computing. This technology has gained momentum over the last few years as the required levels of memory have increased, as well as the hardware.

So, being able to load a large amount of data into memory has become much more possible. In-memory data allows us to load and cache the data, so it’s much more responsive when working within the data, with querying off it or visualizing for instance.

Some benefits and features of HDInsight Spark are:

  • Spark provides access to Scala programming language. This allows us to work with distributed data sets like collections, and it doesn’t require us to structure everything as map and reduce operations, thus making our operations more responsive and efficient.
  • Quick deployment. You can deploy a Spark cluster, as with other Azure PaaS offerings, through the Azure portal. You can also do it through scripting, PowerShell or Azure automation
  • Native integration with Zeppelin and Jupiter notebooks for your processing and visualization.
  • The REST API Service allows for remote orchestration and job processing.
  • Azure Data Lake support, allowing us to separate compute from storage, which lends itself to scalability. When compute and storage are handled separately, you can tear down your compute clusters, or nodes, and add new ones if you want to scale up/down. Then you can reattach to that storage without losing any of the work that you’ve done.
  • As a PaaS offering, it integrates easily with other Azure services, like Event Hubs or HDInsight Kafka (which I’ll cover later this week) for data streaming applications.
  • Support of concurrent queries which allows us to take better advantage of the processing power of the nodes.
  • Native Power BI integration for visualization purposes; connecting directly to a Spark cluster from Power BI.
  • Pre-loaded with Anaconda, which provides about 200 libraries for things like Machine Learning, advanced analytics and visualizations.

Best uses for Spark:

    • As with other workloads for big data, the in-memory processing allows us to do interactive data analysis and create business solutions. It uses that in-memory processing engine to have more responsive reports and data visualization.
    • It has the machine learning capability with built in support for the Jupiter and Zeppelin notebooks.
    • Pre-loaded with Anaconda distributed with 200 canned libraries so you can jump in and start using it quickly.
    • It handles streaming and real-time data workloads. You can extend your Event Hub queue, so you can bring in your data and report on it in real time scenarios. This is great if you’re using IoT; much more responsive than waiting for that refresh of ETL.

Be sure to check out my next post on HDInsight HBase.

Overview of HDInsight Hadoop

In upcoming posts, I’ll begin a series focusing on Big Data and the Azure HDInsight offerings. If you don’t know, HDInsight is a fully managed, full spectrum open source analytics service for enterprises that allows you to use open source frameworks such as Hadoop, Spark, Hive, among others. It was introduced to Azure in 2013 and they’ve added more recent options, such as domain join clusters capabilities.

Today’s focus is on HDInsight Hadoop. What we’re talking about here is being able to work with big data workloads. These large amounts of data can be structured, unstructured or semi-structured data, like table structures, documents or photos.

It can be historical data that you’re looking to analyze or stream data that’s coming in real time. The goal of this is for you to process the data and generate information from it. Some advantages are:

  • It’s a cloud native Platform as a Service (PaaS) offering within the Azure workplace.
  • Lower cost and scalability because of the capability of separation of compute and storage. You can store your data there but can tear down the clusters so you’re not paying anything when they’re not running. You can also keep your storage and reattach to it with additional nodes to get scalability.
  • Security and compliance with government regulations.
  • You can do monitoring of the system within Azure. If you hook on the Enterprise Security Package, you have a capability to do some monitoring within the system, as well as setting up user accounts that tie into your Active Directory.
  • It’s globally available, including Azure government, China and Germany Azure spaces.

Some of the uses for Hadoop HDInsight are:

  • Batch processing ETL
  • Data Warehousing
  • IoT
  • Streaming of data and processing – A use case example here is Toyota. They used this for their Connected Car Architecture Program where they were able to monitor their cars and stream it into an HDInsight cluster.
  • Being more commonly used for data science workloads, as you get these massive data sets that you want to do data processing and analytics on, or a combination of items like wanting to run some data science and machine learning on some streaming data to do predictive analytics on what might happen next.

Another benefit is HDInsight clusters support multiple programming languages, like Java, Python, Scala, Pig Latin, Hive QL and Spark SQL. Basically, all common programming languages in the open source community that allow you to take advantage of the great, high performing technology for these big data workloads.

Coming up, I’ll discuss some of the cluster types available, such as HDInsight Spark, HBase, Storm, Kafka, Interactive Query and R-Server.

 

Hybrid Cloud Strategies and Management

Are you running a hybrid environment between on-premises and Azure? Do you want to be? In a recent webinar, Sr. Principal Architect, Chris Seferlis, answered the question: How can my organization begin using hybrid cloud today? In this webinar, he defines the four key pillars of true hybrid development, identity, security, data platform and development, and shows actionable resources to help get you started down the hybrid road.

Hybrid cloud presents great opportunity for your organization and is the path most are going down:

80% of enterprises see themselves operating hybrid clouds for the foreseeable future

58% of enterprises have a hybrid cloud strategy (up from 55% a year ago)

87% of organizations are planning to integrate on-premises datacenters with public cloud

In this in-depth webinar, Chris covers:

Hybrid Identity with Window Server Active Directory and Azure Active Directory – Identity is the new control plane. We’ve all got lots of services, devices and internal apps and firewalls do not protect your users in the cloud.

With Azure AD you:

  • Have 1000s of apps with 1 identity
  • Enable business without borders
  • Manage access at scale
  • Have cloud-powered protection

Security – Better security starts at the OS – protect identity, protect the OS on-premises or in the cloud, help secure virtual machines.

Coupling Server 2016 with Azure enables security for your environment at cloud speed.

Azure enables rapid deployment of build-in security controls, as well as products and services from security partners and provided integration of partner solutions. Microsoft is making a major commitment to integration with 3rd party tools for ease of transition and a true hybrid approach.

Data and AI – AI investment increased by 300% in 2017. Organizations that harness data, cloud and AI out-perform and out-innovate with nearly double operating margin.

This webinar will tell you how to transform your business with a modern data estate.

Other areas covered are:

Azure Stack – the 1st consistent hybrid cloud platform

Hybrid Protection with Azure Site Recovery – Azure reduces the common challenges of cost and complexity with increased compliance.

Azure File Sync – If you’re using a file server on-prem, let’s make it better with Azure.

Project Honolulu – A modern management platform to manage on-prem and Azure.

This webinar is chock-full of information to get you on the right path to running a hybrid environment between on-premises and Azure. Watch the complete webinar here and click here to download the slides from the session. If you want to learn more about hybrid cloud strategies, contact us – we’re here to help.

Azure Enterprise Security Package for HDInsight

In today’s post I’d like to talk about the Enterprise Security Package for Azure HDInsight. HDInsight is a managed cloud Platform as a Service offering built on the Hadoop framework. It allows you to build big data solutions using Hadoop, Spark, Hive, LLAP and R, among others.

Let’s think about the traditional deployment of Hadoop. In traditional deployment, you would deploy a cluster, give local admin access to users with SSH access to that cluster. Then you would hand it over to the data scientists, so they could do what they needed to run those data science workloads; train the models, run scripts and such.

With the adoption of these types of big data workloads into the enterprise, it became much more reliant on enterprise security. There was a need for role-based access control with Active Directory permissions. Admins wanted to get greater visibility into who was accessing the data and when, as well as what they tried to get into and were they successful in their attempts or not – basically all those audit requirements when we’re working with large data sets.

Who is the leader in enterprise security? Microsoft, of course, for Active Directory. The Enterprise Security Package allows you to add the cluster to the domain within the creative process, as a sort of ‘add-on’ to your Azure portal. Other things it allows you to do are:

  • Add an HDI cluster with Active Directory Domain Services.
  • Role based access control for HIVE, Spark and Interactive HIVE using Apache Ranger.
  • Specific file and folder permissions for the data inside of an Azure Data Lakes Store.
  • Auditing of logs to see who has access to what and when.

Currently, these features are only available for Spark, Hadoop and Interactive Query workloads, but more workloads will be adopted soon.

How and When to Scale Up/Out Using Azure Analysis Services

Some of you may not know when or how to scale up your queries or scale out your processing. Today I’d like to help with understanding when and how using Azure Analysis Services. First, you need to decide which tier you should be using. You can do that by looking at the QPUs (Query Processing Units) of each tier on Azure. Here’s a quick breakdown:

  • Developer Tier – gives you up to 20 QPUs
  • Basic Tier – is a mid-scale tier, not meant for heavy loads
  • Standard Tier (currently the highest available) – allows you more capability and flexibility

Let’s start with when to scale up your queries. You need to scale up when your reports are slow, so you’re reporting out of Power BI and the throughput isn’t working for your needs. What you’re doing with scaling up is adding more resources. The QPU is a combination of your CPU, memory and other factors like the number of users.

Memory checks are straightforward. You run the metrics in the Azure portal and you can see what your memory usage is, if your memory limited or memory hard settings are being saturated. If so, you need to either upgrade your tier or adjust the level within your current tier.

CPU bottlenecks are a bit tougher to figure out. You can get an idea by starting to watch your QPUs to see if you’re saturating those using those metrics and looking at the logs within the Azure portal. Then you want to watch your processor pool job que length and your processing pool busy, non-IO threads. This should give you an idea of how it’s performing.

For the most part, you’re going to want to scale up when the processing engine is taking too long to process the data to build your models.

Next up, scaling out. You’ll want to scale out if you’re having problems with responsiveness with reporting because the reporting requirements are saturating what you currently have available. Typically, in cases with a large number of users, you can fix this by scaling out and adding more nodes.

You can add up to 7 additional query replicas; these are Read-only replicas that you can report off, but the processing is handled on the initial instance of Azure Analysis Services and subsequent queries are being handled as part of those query replicas. Hence, any processing is not affecting the responsiveness of the reports.

After it separates the model processing from query engine, then you can measure the performance by watching the log analytics and query processing units and see how they’re performing. If you’re still saturating those, you’ll need to re-evaluate whether you need additional QPUs or to upgrade your tiers.

Something to keep in mind is once you’ve processed your data, you must resynchronize it across all of those queries. So, if you’re going to be processing data throughout the day, it’s a good idea not only to run those queries, but also to strategically synchronize them as well.

Also important to know is that scale out does require the Standard Edition Tier; Basic and Developer will not work for this purpose. There are some interesting resources out there that allow you to auto scale. It will be based on a schedule using a PowerShell runbook. It uses your Azure automation account to schedule when it’s going to scale up or out based on the needs of the environment. For example, if you know Monday mornings you’re going to need additional processing power to run your queries efficiently, you’ll want to set up a schedule for that time and then you can scale it back.

Another note is that you can scale up to a higher tier, but you cannot scale those back automatically if you’re running a script. But with this ability it does allow you to be prepared for additional requirements in that environment.

I hope this helped with questions you have about scaling up and out.

 

Azure Data Factory vs Logic Apps

Customers often ask, should I use Logic Apps or Data Factory for this? Of course, the answer I give is the same as with most technology, it depends. What is the business use case we’re talking about?

Logic Apps can help you simplify how you build automated, scalable workflows that integrate apps and data across cloud and on premises services. Azure Data Factory is a cloud-based data integration service that allows you to create data driven workflows in the cloud for orchestrating and automating data movement and data transformation. Similar definitions, so that probably didn’t help at all, right?

Let me try to clear up some confusion. There are some situations where the best-case scenario is to use both, so where a feature is lacking in Data Factory but can be found in Logic Apps since it’s been around longer. A great use case is alerting and notifications, for instance. You can use the web API out of Data Factory and send a notification through a Logic App via email back to a user to say a job has competed or failed.

To answer the question of why I would use one over the other, I’d say it comes down to how much data we’re moving and how much transformation we need to do on that data to make it ready for consumption. Are we reporting on it, putting it in Azure Data Warehouse, building some facts and dimensions and creating our enterprise data warehouse then reporting off of that with Power BI? This would all require a decent amount of heavy lifting. I would not suggest a Logic App for that.

If you’re monitoring a folder on-prem or in OneDrive and you’re looking to see when files get posted there and you want to simply move that file to another location or send a notification about an action on the file, this a great use case for a Logic App.

However, the real sweet spot is when you can use them together, as it helps you maximize cost efficiency. Depending on what the operation is, it can be more or less expensive depending upon whether you’re using Data Factory or Logic Apps.

You can also make your operations more efficient. Utilize the power of Azure Data Factory with its SSIS integration runtimes and feature sets that include things like Data Bricks and the HDInsight clusters, where you can process huge amounts of data with massively parallel processing. Or use your Hadoop file stores for reporting off structured, unstructured or semi-structured data. But Logic Apps can help you enhance the process.

Clear as mud, right? Hopefully I was able to break it down a bit better. To put it in simple terms: when you think about Logic Apps, think about business applications, when you think about Azure Data Factory, think about moving data, especially large data sets, and transforming the data and building data warehouses.