Category Archives: Review

Overview of HDInsight Kafka

Continuing with my HDInsight series, today I’ll be talking about Kafka. HDInsight Kafka will sound much like Storm but as I get into the nuts the bolts you’ll see the differences. Kafka is an open source distributed stream platform that can be used to build real time data streaming pipelines and applications with a message broker functionality, like a message cue.

Some specific Kafka improvements with HDInsight:

  • 99.9% uptime from HDInsight
  • You get 16 terabyte managed discs which increases the scale and reduces the number of required nodes for traditional Kafka clusters, which would have a limit of 1 terabyte.
  • Kafka takes a single rack view, but Azure is designed in 2 dimensions for update and fault domains. Thus, Microsoft designed special tools to rebalance the partitions and replicas. Once you scale out, you would repartition your data and then you’d be able to take advantage of the additional nodes, as well as when you scale down.
  • Kafka allows you to change the number of worker nodes for scaling up/down, depending on the workload and this can be done through the portal or PowerShell or any automation tool within Azure.
  • Direct integration with Azure log analytics. This looks at the virtual machine level information like the disc and the network. The importance of this is it allows you to roll that up into the Microsoft OMS suite for global log analytics. So, when you’re looking at all your resources in Azure through OMS, it helps you to see it at a high level and also drill in for more details.
  • The Zookeeper manages the state of the cluster which helps the concurrency, resiliency and the low latency transactions, as well as the orchestration of the data through the nodes and clusters.
  • Records are stored in topics which is produced by a producer and consumed by consumers. The producers send records to Kafka brokers and each worker node in the cluster is considered a broker. These brokers are what is helping the data move around inside the clusters.

Again, Kafka and Storm sound relatively similar, here’s some major differences:

    • Storm was invented by Twitter; Kafka by LinkedIn. But these are all using the Hadoop platform and it’s an open source, so they can build their own iterations.
    • Storm is meant more for real time message processing; Kafka is for distributed messaging processing.
    • Storm can take data from Kafka and other database system and process the data; Kafka is taking in those streams from things like Facebook, Twitter and LinkedIn.
    • Kafka is a message broker; Storm’s primary use is stream processing.
    • In Storm there is no data storage, you can only stream data through it; Kafka stores the data on the file system. As those streams are processed, Storm can do it much faster, on a micro-batch processing level. Kafka is doing small batches, larger than micro.
    • As far as dependency, Kafka requires Zookeeper for all the orchestration; Storm does not depend on anything externally.
    • Storm has a latency of milliseconds; with Kafka it depends on the source of the data, but typically takes slightly less than 1-2 seconds. So, you’re keeping the data local in Kafka, processing it, then pushing it somewhere else. Whereas with Storm, you’re processing the data in motion as you’re pushing it somewhere else.

Basically, two different ways to solve similar problems depending on the use case. It apparently worked better for LinkedIn to design it this way as opposed to the way that Twitter handles their data.

 

Overview of HDInsight Storm

Next in my series on HDInsight, today I’ll be talking about Storm. HDInsight Storm is a distributed stream processing computational framework. It uses spouts which define information sources and bolts which are manipulations in processing to allow batch distributed processing of streaming data.

Think of it’s apology in the shape of a direct acyclic graph. It’s a DHE where the edges are named streams and direct the data from node to node. When you put it all together, it creates the data transformation pipeline.

When you break it down, it’s topology is like that of map/reduce jobs; the difference being that map/reduce jobs run in individual batches and Storm is processed continuously in real time.

The Storm cluster has 2 different types of nodes. There’s a Master node which executes a Nimbus which assigns tasks to machines and monitors their performance. The Worker node runs Supervisor which assigns tasks to the other worker nodes and operates them as needed.

The Storm cluster can’t monitor its own state and health, so it deploys a Zookeeper node to connect to the Nimbus and Supervisor to keep an eye on things.

The 3 main components of Storm are:

1. The topology which is basically a network for the stream and spout.

2. The stream which is an unbounded pipeline of tuples.

3. The spout which is the source of the data which converts the data to the tuple of streams and then sends the bolts to be processed.

What makes this effective is that the data processing engine is guaranteed as far as every tuple will be fully processed and delivered, giving it a 99.9% uptime SLA from Microsoft. It does this by tracking the lineage of the tuple as it makes its way through the typology. It works like a query system as the messages can be replayed if there’s a failure in delivery.

Some use cases for Storm:

    • Writing the data after it gets processed into an Azure Data Lake Store.
    • As a source for Azure Event Hubs, as well as processing events from here. It can take a vehicle sensor, for instance, and can process it in Event Hubs, then send the data to Cosmos DB or an Azure Storage Blob.
    • Twitter is using Storm in a variety of ways. They use it for discovery on their data, running real time analytics and personalization in real time, so when you log into Twitter it knows your preferences based on past visits. It also works for real time Search and for their own internal revenue optimization.

As with other HDInsight components, it’s used among various typologies to solve and satisfy big data requirements and workloads. For example, if you were doing a customer churn analysis in real time based on a Twitter feed, this would be a technology you would use along side Hadoop.

Overview of HDInsight HBase

In continuation of my series on HDInsight and the different clusters within it, today I’ll cover HBase. HBase is a NoSQL database that provides random access and strong consistency for structured, unstructured and semi-structured data.

It’s a schema-less (or organized by families of columns) database. Another way to describe it is it’s sort of modeled after Google’s Bigtable, where data is stored in the rows of a table and then grouped by a column family. As it’s schema-less, neither the columns themselves or the data types inside of the columns need to be defined before using the data.

Some other key things to be aware of with HBase:

  • As with all the HDInsight components, this get implemented as a managed cluster and a Platform as a Service offering in which we can separate compute nodes from storage.
  • It has a scale out architecture that helps provide automatic sharding or horizontal partitioning of tables, where essentially rows of a table are held separately rather than splitting those columns as we would in a typical table normalization.
  • Strong consistency for read and write as it’s part of the architecture of HBase.
  • Automatic failover built in, so you have multiple clusters that you can failover to multiple nodes.
  • In-memory caching for reads and writes, which helps with performance, as well as moving your data in and out quicker.

Some of the most common workloads:

    • A search engine like I mentioned with Google’s Bigtable, which builds indexes that map terms to webpages that contain them.
    • A key value store. Facebook uses HBase for their messaging system because it’s ideal for storing and managing internet communications.
    • Also, a good repository for collecting sensor data, so where large amounts of data are being pulled into this NoSQL Table and it can be used to build dashboards for reporting.

I still have a few HDInsight technologies to cover in this series. Many of these are interrelated and work together to complete and update data architecture.

 

Overview of HDInsight Spark

Today I’m continuing my series on HDInsight with the focus on Spark clusters. HDInsight Spark clusters provide the required baseline for in-memory cluster computing. This technology has gained momentum over the last few years as the required levels of memory have increased, as well as the hardware.

So, being able to load a large amount of data into memory has become much more possible. In-memory data allows us to load and cache the data, so it’s much more responsive when working within the data, with querying off it or visualizing for instance.

Some benefits and features of HDInsight Spark are:

  • Spark provides access to Scala programming language. This allows us to work with distributed data sets like collections, and it doesn’t require us to structure everything as map and reduce operations, thus making our operations more responsive and efficient.
  • Quick deployment. You can deploy a Spark cluster, as with other Azure PaaS offerings, through the Azure portal. You can also do it through scripting, PowerShell or Azure automation
  • Native integration with Zeppelin and Jupiter notebooks for your processing and visualization.
  • The REST API Service allows for remote orchestration and job processing.
  • Azure Data Lake support, allowing us to separate compute from storage, which lends itself to scalability. When compute and storage are handled separately, you can tear down your compute clusters, or nodes, and add new ones if you want to scale up/down. Then you can reattach to that storage without losing any of the work that you’ve done.
  • As a PaaS offering, it integrates easily with other Azure services, like Event Hubs or HDInsight Kafka (which I’ll cover later this week) for data streaming applications.
  • Support of concurrent queries which allows us to take better advantage of the processing power of the nodes.
  • Native Power BI integration for visualization purposes; connecting directly to a Spark cluster from Power BI.
  • Pre-loaded with Anaconda, which provides about 200 libraries for things like Machine Learning, advanced analytics and visualizations.

Best uses for Spark:

    • As with other workloads for big data, the in-memory processing allows us to do interactive data analysis and create business solutions. It uses that in-memory processing engine to have more responsive reports and data visualization.
    • It has the machine learning capability with built in support for the Jupiter and Zeppelin notebooks.
    • Pre-loaded with Anaconda distributed with 200 canned libraries so you can jump in and start using it quickly.
    • It handles streaming and real-time data workloads. You can extend your Event Hub queue, so you can bring in your data and report on it in real time scenarios. This is great if you’re using IoT; much more responsive than waiting for that refresh of ETL.

Be sure to check out my next post on HDInsight HBase.

Overview of HDInsight Hadoop

In upcoming posts, I’ll begin a series focusing on Big Data and the Azure HDInsight offerings. If you don’t know, HDInsight is a fully managed, full spectrum open source analytics service for enterprises that allows you to use open source frameworks such as Hadoop, Spark, Hive, among others. It was introduced to Azure in 2013 and they’ve added more recent options, such as domain join clusters capabilities.

Today’s focus is on HDInsight Hadoop. What we’re talking about here is being able to work with big data workloads. These large amounts of data can be structured, unstructured or semi-structured data, like table structures, documents or photos.

It can be historical data that you’re looking to analyze or stream data that’s coming in real time. The goal of this is for you to process the data and generate information from it. Some advantages are:

  • It’s a cloud native Platform as a Service (PaaS) offering within the Azure workplace.
  • Lower cost and scalability because of the capability of separation of compute and storage. You can store your data there but can tear down the clusters so you’re not paying anything when they’re not running. You can also keep your storage and reattach to it with additional nodes to get scalability.
  • Security and compliance with government regulations.
  • You can do monitoring of the system within Azure. If you hook on the Enterprise Security Package, you have a capability to do some monitoring within the system, as well as setting up user accounts that tie into your Active Directory.
  • It’s globally available, including Azure government, China and Germany Azure spaces.

Some of the uses for Hadoop HDInsight are:

  • Batch processing ETL
  • Data Warehousing
  • IoT
  • Streaming of data and processing – A use case example here is Toyota. They used this for their Connected Car Architecture Program where they were able to monitor their cars and stream it into an HDInsight cluster.
  • Being more commonly used for data science workloads, as you get these massive data sets that you want to do data processing and analytics on, or a combination of items like wanting to run some data science and machine learning on some streaming data to do predictive analytics on what might happen next.

Another benefit is HDInsight clusters support multiple programming languages, like Java, Python, Scala, Pig Latin, Hive QL and Spark SQL. Basically, all common programming languages in the open source community that allow you to take advantage of the great, high performing technology for these big data workloads.

Coming up, I’ll discuss some of the cluster types available, such as HDInsight Spark, HBase, Storm, Kafka, Interactive Query and R-Server.

 

Azure Common Data Services

What do you know about Azure Common Data Services? Today I’d like to talk about this product for apps which was recently re-done by Microsoft to expand upon the product’s vision. Common Data Services is an Azure-based business application platform that enables you to easily build and extend applications with your customer’s business data.

Common Data Services helps you bring together your data from across the Dynamics 365 Suite (CRM, AX, Nav, GP) and use this common data service to more easily extract data rather than having to get into the core of those applications. It also allows you to focus on building and delivering the apps that you want and insights and process automation that will help you run more efficiently. Plus it integrates nicely with PowerApps, Power BI and Microsoft Flow.

Some other key things:

  • If you want to build Power BI reports from your Dynamics 365 CRM data, there are pre-canned entities provided by Microsoft.
  • Data within the Common Data Services (CDS) is stored within a set of entities. An entity is just a set of records that’s used to store data, similar to how a table stores data within a database.
  • CDS should be thought of as a managed database service. You don’t have to create indexes or do any kind of database tuning; you’re not managing a database server as you would with SQL Server or a data warehouse. It’s designed to be somewhat of a centralized data repository from which you can report or do further things with.
  • PowerApps is quickly becoming a good replacement for things like Microsoft Access as it comes with along with functionality and feature sets. A common use for PowerApps is extending that data rather than having to dig into the background platform.
  • This technology is easy to use, to share and to secure. You set up your user account as you would with Azure Services, giving specific permissions/access based on the user.
  • It gives you the metadata you need based on that data and you can specify what kind of field or column you’re working with within that entity.
  • It gives you the ability of logic and validation; you can create business process rules and data flows from entity to entity or vice versa or from app to entity to PowerApps.
  • You can create workflows that automate business process, such as data cleansing or record updating; these workflows can run in the background without having to manage manually.
  • Gets good connectivity with Excel which makes it user friends for people comfortable with that platform.
  • For power users, there’s an SDK available for developers, which allows you to extend the product and build some cool custom apps.

I don’t think of this as a replacement for Azure SQL DW or DB but it does give you the capability to have table-based data in the cloud that has some nice hooks into the Dynamics 365 space, as well as outputting to PowerApps and Power BI.

Continuous Integration and Deployment Using Azure Data Factory

Today I’m excited to talk about one of the new releases in Azure that gives you continuous integration and deployment using Azure Data Factory. This new release is an Azure Data Factory visual interface that allows you to export any of your Data Factory components as an Azure Resource Manager (ARM) template.

When you do these exports from your Data Factory, it will generate 2 files.  The template file, which will contain all the Data Factory metadata for the pipelines, data sets, etc., as well as a configuration file, which will contain environment parameters that will be different for each of your environments. So, if you’re going to create a development, a test and a production environment, each one will be different.

You also can specify things like storage containers, Databricks clusters, etc. After you’ve deployed this, you’re going to create a new factory for your environment. You’re also going to associate your Visual Studio team services get repository to that Data Factory, enabling source control versioning and collaboration uses.

Next, you’ll set up your Data Factory with VSTS. This is where all the developers can author data factory resources, such as pipelines, data sets and other components. Once you have this development area set up, developers can modify the resources and debug them right in the interface, along with checking performance. They’ll also have the option to create a PR from their branch to master or create a collaborative branch to get the changes reviewed by peers.

Once they are satisfied with the changes and are ready to go to production, they set it in the master branch and can then publish it to the development Data Factory. Or they can promote each of those environments through exporting those ARM templates when they’re ready from the master branch, or any other branch.

So, you export the template and it gets deployed with different environment parameters to test and production environments. From there, you can also set up VSTS release definitions to automate the deployment of your Data Factory to multiple environments.

The benefit with this is it opens the opportunity to bring your true dev test and production environments, that you’re used to in your local environment using SSIS or other ETL tools, to Azure. This tool offers a tremendous amount of power and it’s getting better all the time.

Overview of Azure Stream Analytics

Analytics is the key to making your data useful and supporting decision making. Today I’m excited to talk about Azure Stream Analytics. Azure Stream Analytics is an event processing engine that allows you to capture and examine high volumes of data from all kinds of connections, like devices, websites and social media feeds.

You can examine those data streams and it allows you to trigger things like alerts, as well as take action with reporting or storage. So, whether you want to report on it with Power BI or store the data for down the road, you have these options. Stream analytics is used a lot with IoT or streaming feeds through social media, where people want to keep an eye on what’s happening with the data.

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Of course, the other option is to archive it for further processing down the road if you want to do something with that data.

This was designed to be easy to use and spin up. It has source and sync integration and an easy to use declarative SQL query-like language. Also, it’s a managed service so it’s pay-as-you-use, as with many Azure services. There’s no need to buy hardware or software up front. And it has an enterprise grade service level agreement so it’s robust, reliable and you can have multi-locations.

Another big positive is it’s in-memory processing with multi-node capabilities offers tremendous scalability and performance benefits. Plus, unlike on prem solutions it can be fairly elastic, so you can buy nodes as you need them to process more data and you can bring them back down when you’re not using them.

There are a lot of cool things being done with stream analytics and IoT; it’s an exciting time to be in this arena.

Azure Data Factory Integration Runtimes

This week I’ve been talking about Azure Data Factory. In today’s post I’d like to talk about the much-awaited Azure Data Factory Integration Runtime. The compute infrastructure provides data movement, connectors, data conversions and data transfers, as well as activity dispatching, so you can dispatch and monitor activities running HDInsight, SQL DB or DW and more.

A big part of V2 is that you can now lift and shift your SSIS packages up into Azure and run them from your Azure data portal inside of Data Factory. There are 3 integration runtime types:

1. Azure Integration Runtime – This is clearly set up in Azure and you would use this if you’re going to be copying between two cloud data sources.

2. Self-Hosted Integration Runtime – This can run a copy of transformation activities between cloud and on-premises, including moving data from an IaaS virtual machine. Use this if you’re going to be copying between a cloud source and an on-prem, private network source.

So, if you’re environment is behind a firewall, not in the public cloud and you want to move data from your environment to Azure and the gateway will not work for you. Also, because that IaaS virtual machine is isolated, and you can’t get into that data storage, you would set up that integration between sites.

3. SSIS Integration Runtime – Use this when you’re lifting and shifting your SSIS packages into Azure Data Factory. But a key thing to mention is that this does not yet support third party tools for SSIS, but that support will be added eventually.

Where are these located? Azure Data Factories are located in limited regions at this time, but they can access data stores and compute services globally. With Azure Integration Runtime, the location of that runtime will define the backend compute resources where those are being used. This is optimized for data compliance, efficiency and reduced network egress costs, to ensure that they’re using the best services available in the region that is needed.

The Self-Hosted Runtime is installed in the environment in that private network. The SSIS Integration Runtime is determined based on where the SQL DB or managed instance is hosting that SSIS DB catalog. It is currently limited where it can be located, but it does not have to be in the same place as the Data Factory. It will be as close to the data sources as possible and will run as optimally as it can.

An Overview of Azure File Sync

I have a question… Who is still using a file server? No need to answer, I know that most of us still are and need to use them for various reasons. We love them—well, we also hate them, as they are a pain to manage.

The pains with Windows File Server:

  • They never seem to have enough storage.
  • They never seem to be properly cleaned up; users don’t delete the files they’re supposed to.
  • The data never seems accessible when and where you need it.

In this blog, I’d like to walk you through Azure File Sync, so you can see for yourself how much better it is.

    • Let’s say I’m setting up a file server in my Seattle headquarters and that file server begins having problems, maybe I’m running out of space for example.
    • I decide to hook this up in a file share in Azure space.
    • I can set up cloud tiering and set up a threshold (say 50%), so that everything beyond that threshold, those files will start moving up into Azure.
    • When I set this threshold, it will start taking the oldest files and graying them out as far as users are concerned. The files are still there and visible as there, but they’ve been pushed off to the cloud, so that space has now been freed up on the file server.
    • If users ever need those files, they can click on them and redownload.
    • Now, let’s say I want to bring on another server at a branch office. I can simply bring up that server, synchronize it with the branch office based on those files in Azure.
    • From here, I can hook up my SMBs and NFS shares for my users and applications, as well as my work folders using multi-site technology. I have all my files synchronized and it’s going to give me direct cloud access to these files.
    • I can hook up my IaaS and PaaS solutions with my REST API or my SMB shares to be able to access these files.
    • With everything synchronized, I’m able to have a rapid file server disaster/data recovery. If my server in Seattle goes down, I simply remove it; my files are already up in Azure.
    • I bring on a new server, sync it back to Azure. My folders start to populate, and as they get used, people will download the files back and the rules that were set up will maintain.
    • The great thing is it can be used with SQL Server 2012 R2, as well as SQL Server 2016.
    • Now I have an all-encompassing solution (with integrated cloud back up within Azure) with better availability, better DR capability and essentially bottomless storage. Azure Backup Vault gets backed up automatically and storage is super cheap.

With Azure File Sync I get:

1. A centralize file service in Azure storage.

2. Cache in multiple locations for fast, local performance.

3.  I can utilize cloud based backup and fast data/disaster recovery.