Category Archives: Strategy

Overview of HDInsight Kafka

Continuing with my HDInsight series, today I’ll be talking about Kafka. HDInsight Kafka will sound much like Storm but as I get into the nuts the bolts you’ll see the differences. Kafka is an open source distributed stream platform that can be used to build real time data streaming pipelines and applications with a message broker functionality, like a message cue.

Some specific Kafka improvements with HDInsight:

  • 99.9% uptime from HDInsight
  • You get 16 terabyte managed discs which increases the scale and reduces the number of required nodes for traditional Kafka clusters, which would have a limit of 1 terabyte.
  • Kafka takes a single rack view, but Azure is designed in 2 dimensions for update and fault domains. Thus, Microsoft designed special tools to rebalance the partitions and replicas. Once you scale out, you would repartition your data and then you’d be able to take advantage of the additional nodes, as well as when you scale down.
  • Kafka allows you to change the number of worker nodes for scaling up/down, depending on the workload and this can be done through the portal or PowerShell or any automation tool within Azure.
  • Direct integration with Azure log analytics. This looks at the virtual machine level information like the disc and the network. The importance of this is it allows you to roll that up into the Microsoft OMS suite for global log analytics. So, when you’re looking at all your resources in Azure through OMS, it helps you to see it at a high level and also drill in for more details.
  • The Zookeeper manages the state of the cluster which helps the concurrency, resiliency and the low latency transactions, as well as the orchestration of the data through the nodes and clusters.
  • Records are stored in topics which is produced by a producer and consumed by consumers. The producers send records to Kafka brokers and each worker node in the cluster is considered a broker. These brokers are what is helping the data move around inside the clusters.

Again, Kafka and Storm sound relatively similar, here’s some major differences:

    • Storm was invented by Twitter; Kafka by LinkedIn. But these are all using the Hadoop platform and it’s an open source, so they can build their own iterations.
    • Storm is meant more for real time message processing; Kafka is for distributed messaging processing.
    • Storm can take data from Kafka and other database system and process the data; Kafka is taking in those streams from things like Facebook, Twitter and LinkedIn.
    • Kafka is a message broker; Storm’s primary use is stream processing.
    • In Storm there is no data storage, you can only stream data through it; Kafka stores the data on the file system. As those streams are processed, Storm can do it much faster, on a micro-batch processing level. Kafka is doing small batches, larger than micro.
    • As far as dependency, Kafka requires Zookeeper for all the orchestration; Storm does not depend on anything externally.
    • Storm has a latency of milliseconds; with Kafka it depends on the source of the data, but typically takes slightly less than 1-2 seconds. So, you’re keeping the data local in Kafka, processing it, then pushing it somewhere else. Whereas with Storm, you’re processing the data in motion as you’re pushing it somewhere else.

Basically, two different ways to solve similar problems depending on the use case. It apparently worked better for LinkedIn to design it this way as opposed to the way that Twitter handles their data.

 

Hybrid Cloud Strategies and Management

Are you running a hybrid environment between on-premises and Azure? Do you want to be? In a recent webinar, Sr. Principal Architect, Chris Seferlis, answered the question: How can my organization begin using hybrid cloud today? In this webinar, he defines the four key pillars of true hybrid development, identity, security, data platform and development, and shows actionable resources to help get you started down the hybrid road.

Hybrid cloud presents great opportunity for your organization and is the path most are going down:

80% of enterprises see themselves operating hybrid clouds for the foreseeable future

58% of enterprises have a hybrid cloud strategy (up from 55% a year ago)

87% of organizations are planning to integrate on-premises datacenters with public cloud

In this in-depth webinar, Chris covers:

Hybrid Identity with Window Server Active Directory and Azure Active Directory – Identity is the new control plane. We’ve all got lots of services, devices and internal apps and firewalls do not protect your users in the cloud.

With Azure AD you:

  • Have 1000s of apps with 1 identity
  • Enable business without borders
  • Manage access at scale
  • Have cloud-powered protection

Security – Better security starts at the OS – protect identity, protect the OS on-premises or in the cloud, help secure virtual machines.

Coupling Server 2016 with Azure enables security for your environment at cloud speed.

Azure enables rapid deployment of build-in security controls, as well as products and services from security partners and provided integration of partner solutions. Microsoft is making a major commitment to integration with 3rd party tools for ease of transition and a true hybrid approach.

Data and AI – AI investment increased by 300% in 2017. Organizations that harness data, cloud and AI out-perform and out-innovate with nearly double operating margin.

This webinar will tell you how to transform your business with a modern data estate.

Other areas covered are:

Azure Stack – the 1st consistent hybrid cloud platform

Hybrid Protection with Azure Site Recovery – Azure reduces the common challenges of cost and complexity with increased compliance.

Azure File Sync – If you’re using a file server on-prem, let’s make it better with Azure.

Project Honolulu – A modern management platform to manage on-prem and Azure.

This webinar is chock-full of information to get you on the right path to running a hybrid environment between on-premises and Azure. Watch the complete webinar here and click here to download the slides from the session. If you want to learn more about hybrid cloud strategies, contact us – we’re here to help.

Azure Enterprise Security Package for HDInsight

In today’s post I’d like to talk about the Enterprise Security Package for Azure HDInsight. HDInsight is a managed cloud Platform as a Service offering built on the Hadoop framework. It allows you to build big data solutions using Hadoop, Spark, Hive, LLAP and R, among others.

Let’s think about the traditional deployment of Hadoop. In traditional deployment, you would deploy a cluster, give local admin access to users with SSH access to that cluster. Then you would hand it over to the data scientists, so they could do what they needed to run those data science workloads; train the models, run scripts and such.

With the adoption of these types of big data workloads into the enterprise, it became much more reliant on enterprise security. There was a need for role-based access control with Active Directory permissions. Admins wanted to get greater visibility into who was accessing the data and when, as well as what they tried to get into and were they successful in their attempts or not – basically all those audit requirements when we’re working with large data sets.

Who is the leader in enterprise security? Microsoft, of course, for Active Directory. The Enterprise Security Package allows you to add the cluster to the domain within the creative process, as a sort of ‘add-on’ to your Azure portal. Other things it allows you to do are:

  • Add an HDI cluster with Active Directory Domain Services.
  • Role based access control for HIVE, Spark and Interactive HIVE using Apache Ranger.
  • Specific file and folder permissions for the data inside of an Azure Data Lakes Store.
  • Auditing of logs to see who has access to what and when.

Currently, these features are only available for Spark, Hadoop and Interactive Query workloads, but more workloads will be adopted soon.

How and When to Scale Up/Out Using Azure Analysis Services

Some of you may not know when or how to scale up your queries or scale out your processing. Today I’d like to help with understanding when and how using Azure Analysis Services. First, you need to decide which tier you should be using. You can do that by looking at the QPUs (Query Processing Units) of each tier on Azure. Here’s a quick breakdown:

  • Developer Tier – gives you up to 20 QPUs
  • Basic Tier – is a mid-scale tier, not meant for heavy loads
  • Standard Tier (currently the highest available) – allows you more capability and flexibility

Let’s start with when to scale up your queries. You need to scale up when your reports are slow, so you’re reporting out of Power BI and the throughput isn’t working for your needs. What you’re doing with scaling up is adding more resources. The QPU is a combination of your CPU, memory and other factors like the number of users.

Memory checks are straightforward. You run the metrics in the Azure portal and you can see what your memory usage is, if your memory limited or memory hard settings are being saturated. If so, you need to either upgrade your tier or adjust the level within your current tier.

CPU bottlenecks are a bit tougher to figure out. You can get an idea by starting to watch your QPUs to see if you’re saturating those using those metrics and looking at the logs within the Azure portal. Then you want to watch your processor pool job que length and your processing pool busy, non-IO threads. This should give you an idea of how it’s performing.

For the most part, you’re going to want to scale up when the processing engine is taking too long to process the data to build your models.

Next up, scaling out. You’ll want to scale out if you’re having problems with responsiveness with reporting because the reporting requirements are saturating what you currently have available. Typically, in cases with a large number of users, you can fix this by scaling out and adding more nodes.

You can add up to 7 additional query replicas; these are Read-only replicas that you can report off, but the processing is handled on the initial instance of Azure Analysis Services and subsequent queries are being handled as part of those query replicas. Hence, any processing is not affecting the responsiveness of the reports.

After it separates the model processing from query engine, then you can measure the performance by watching the log analytics and query processing units and see how they’re performing. If you’re still saturating those, you’ll need to re-evaluate whether you need additional QPUs or to upgrade your tiers.

Something to keep in mind is once you’ve processed your data, you must resynchronize it across all of those queries. So, if you’re going to be processing data throughout the day, it’s a good idea not only to run those queries, but also to strategically synchronize them as well.

Also important to know is that scale out does require the Standard Edition Tier; Basic and Developer will not work for this purpose. There are some interesting resources out there that allow you to auto scale. It will be based on a schedule using a PowerShell runbook. It uses your Azure automation account to schedule when it’s going to scale up or out based on the needs of the environment. For example, if you know Monday mornings you’re going to need additional processing power to run your queries efficiently, you’ll want to set up a schedule for that time and then you can scale it back.

Another note is that you can scale up to a higher tier, but you cannot scale those back automatically if you’re running a script. But with this ability it does allow you to be prepared for additional requirements in that environment.

I hope this helped with questions you have about scaling up and out.

 

Azure Data Factory vs Logic Apps

Customers often ask, should I use Logic Apps or Data Factory for this? Of course, the answer I give is the same as with most technology, it depends. What is the business use case we’re talking about?

Logic Apps can help you simplify how you build automated, scalable workflows that integrate apps and data across cloud and on premises services. Azure Data Factory is a cloud-based data integration service that allows you to create data driven workflows in the cloud for orchestrating and automating data movement and data transformation. Similar definitions, so that probably didn’t help at all, right?

Let me try to clear up some confusion. There are some situations where the best-case scenario is to use both, so where a feature is lacking in Data Factory but can be found in Logic Apps since it’s been around longer. A great use case is alerting and notifications, for instance. You can use the web API out of Data Factory and send a notification through a Logic App via email back to a user to say a job has competed or failed.

To answer the question of why I would use one over the other, I’d say it comes down to how much data we’re moving and how much transformation we need to do on that data to make it ready for consumption. Are we reporting on it, putting it in Azure Data Warehouse, building some facts and dimensions and creating our enterprise data warehouse then reporting off of that with Power BI? This would all require a decent amount of heavy lifting. I would not suggest a Logic App for that.

If you’re monitoring a folder on-prem or in OneDrive and you’re looking to see when files get posted there and you want to simply move that file to another location or send a notification about an action on the file, this a great use case for a Logic App.

However, the real sweet spot is when you can use them together, as it helps you maximize cost efficiency. Depending on what the operation is, it can be more or less expensive depending upon whether you’re using Data Factory or Logic Apps.

You can also make your operations more efficient. Utilize the power of Azure Data Factory with its SSIS integration runtimes and feature sets that include things like Data Bricks and the HDInsight clusters, where you can process huge amounts of data with massively parallel processing. Or use your Hadoop file stores for reporting off structured, unstructured or semi-structured data. But Logic Apps can help you enhance the process.

Clear as mud, right? Hopefully I was able to break it down a bit better. To put it in simple terms: when you think about Logic Apps, think about business applications, when you think about Azure Data Factory, think about moving data, especially large data sets, and transforming the data and building data warehouses.

 

Continuous Integration and Deployment Using Azure Data Factory

Today I’m excited to talk about one of the new releases in Azure that gives you continuous integration and deployment using Azure Data Factory. This new release is an Azure Data Factory visual interface that allows you to export any of your Data Factory components as an Azure Resource Manager (ARM) template.

When you do these exports from your Data Factory, it will generate 2 files.  The template file, which will contain all the Data Factory metadata for the pipelines, data sets, etc., as well as a configuration file, which will contain environment parameters that will be different for each of your environments. So, if you’re going to create a development, a test and a production environment, each one will be different.

You also can specify things like storage containers, Databricks clusters, etc. After you’ve deployed this, you’re going to create a new factory for your environment. You’re also going to associate your Visual Studio team services get repository to that Data Factory, enabling source control versioning and collaboration uses.

Next, you’ll set up your Data Factory with VSTS. This is where all the developers can author data factory resources, such as pipelines, data sets and other components. Once you have this development area set up, developers can modify the resources and debug them right in the interface, along with checking performance. They’ll also have the option to create a PR from their branch to master or create a collaborative branch to get the changes reviewed by peers.

Once they are satisfied with the changes and are ready to go to production, they set it in the master branch and can then publish it to the development Data Factory. Or they can promote each of those environments through exporting those ARM templates when they’re ready from the master branch, or any other branch.

So, you export the template and it gets deployed with different environment parameters to test and production environments. From there, you can also set up VSTS release definitions to automate the deployment of your Data Factory to multiple environments.

The benefit with this is it opens the opportunity to bring your true dev test and production environments, that you’re used to in your local environment using SSIS or other ETL tools, to Azure. This tool offers a tremendous amount of power and it’s getting better all the time.

Overview of Azure Stream Analytics

Analytics is the key to making your data useful and supporting decision making. Today I’m excited to talk about Azure Stream Analytics. Azure Stream Analytics is an event processing engine that allows you to capture and examine high volumes of data from all kinds of connections, like devices, websites and social media feeds.

You can examine those data streams and it allows you to trigger things like alerts, as well as take action with reporting or storage. So, whether you want to report on it with Power BI or store the data for down the road, you have these options. Stream analytics is used a lot with IoT or streaming feeds through social media, where people want to keep an eye on what’s happening with the data.

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Of course, the other option is to archive it for further processing down the road if you want to do something with that data.

This was designed to be easy to use and spin up. It has source and sync integration and an easy to use declarative SQL query-like language. Also, it’s a managed service so it’s pay-as-you-use, as with many Azure services. There’s no need to buy hardware or software up front. And it has an enterprise grade service level agreement so it’s robust, reliable and you can have multi-locations.

Another big positive is it’s in-memory processing with multi-node capabilities offers tremendous scalability and performance benefits. Plus, unlike on prem solutions it can be fairly elastic, so you can buy nodes as you need them to process more data and you can bring them back down when you’re not using them.

There are a lot of cool things being done with stream analytics and IoT; it’s an exciting time to be in this arena.

How Azure Data Factory Pricing Works

In today’s post I’d like to discuss how Azure Data Factory pricing works with the Version 2 model which was just released. The pricing is broken down into four ways that you’re paying for this service. I hope that by pointing these out, you can gain an understanding of not only how it works, but how you can keep an eye on your spending.

1. Azure activity runs vs self-hosted activity runs – there are different pricing models for these. For the Azure activity runs it’s about copying activity, so you’re moving data from an Azure Blob to an Azure SQL database or Hive activity running high script on an Azure HDInsight cluster.

With self-hosted, you want to copy activity moving from an on premises SQL Server to an Azure Blob Storage, a stored procedure to an Azure Blob Storage or a stored procedure activity running a stored procedure on an on premises SQL Server.

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

3. SSIS integration run times – here you’re using A-series and D-series compute levels. When you go through these, it depends on what the compute needs are to invoke the process (how much CPU, how much RAM, how much attempt storage you need).

4. The inactive pipeline – you’re paying a small account for pipelines (about 40 cents currently). A pipeline is considered inactive if it’s not associated with a trigger and hasn’t been run for over a week. Yes, it’s a minimal charge, but they do add up and when you start to wonder where some of those charges come from it’s good to keep this in mind.

Also, each of the components inside the Azure Data Factory, whether it’s blob storage, SQL Server, HDInsight or any kind of storage or compute resources you’re using as part of your pipeline, will also incur charges. These are billed separately based specifically around what those resources are.

Something to keep in mind as you start of build workloads, like if you spin up an HDInsight cluster or a SQL data warehouse as part of a pipeline, make sure you shut down, pause it or destroy that cluster afterwards. So, there are opportunities to get your data moved but also keep the cost down but not keeping it running all the time.

A Guide to GDPR Compliance with Microsoft Data Platform

As most people know, the GDPR is approaching quickly. May 25th to be exact. Most companies will need to review or modify their database management and data handling procedures, especially focusing on the security of data processing. In a recent webinar hosted by 3 experts in the Azure, SQL Data Platform and software arenas: Abraham Samuel, Technical Support Personnel, Microsoft; Brian Knight, Founder and CEO, Pragmatic Works; and Myself, Sr. Principal Architect, Pragmatic Works, offered an informational session on steps you need to take now to help in your journey with compliance.

This 2-hour webinar covered the key changes needed to be addressed for GDPR: Controls, Modifications, Transparent Policies and IT and Training. It also discusses how modernizing your data platform, on-premises and in Azure, will immediately reduce areas out of compliance, as well as what Azure tools and services are offered to help ensure you remain in compliance.

It also taps into experience from the Pragmatic Works team on some of the danger areas customers face and how the suite of software tools can help you expose areas of concern in your environment. Still using SQL Server 2008 or 2008 R2? Here you’ll learn what it means for 2008/2008 R2 end of support and paths to upgrade your SQL Server.

Take some time and watch this information packed webinar that will help eliminate confusion around GDPR and discuss the steps you need to take to be in compliance, as well as how to make your plans actionable. GDPR goes into effect this month. This webinar will educate you and give you options to move along your journey into GDPR and a Microsoft modern data platform.

 

The 5 Stages of Cloud Adoption

So, still not in the cloud and the thought of doing so feels like you’re taking a huge jump into unknown waters? We’re seeing more enterprises starting to dip their toe in the water with Microsoft Azure. Microsoft has shown their commitment to where they’re going with their cloud infrastructure and the growth has been tremendous.

Let’s look at the 5 stages or steps of cloud adoption to ease the fear of taking the leap:

Step 1 – Chaos – In most cases, there’s some chaotic event that makes businesses start looking at alternative ways to service their customers or their business. Maybe a server dies, or software comes to the end of life or support. The cloud then becomes a viable option and people start to consider it.

Step 2 – Awareness – Once the cloud is on the plate, people start by ramping up their cloud knowledge. They may start with training, hackathons, POCs or try to some hands-on opportunities, like setting up an Azure Active Directory to sync with their on-premises AD. Building knowledge leads to Step 3.

Step 3 – Security – Most companies get hung up with security concerns around the cloud. I can tell you that Microsoft has spent more money than any other company worldwide on security—over one billion spent in 2017. They are committed to making their customers security a top priority. Through their commitment they have 72 government and standardization certificates; their closest competitor, AWS, only has 44.

To overcome your fear, you need to realize that with Azure, you have an entire team of security experts watching your data and servers, as well as implementing best practices and creating new ways and policies to help companies avoid any kind of breach.

Step 4 – Governance – So, you’ve gotten over security concerns and have put your trust in the Azure public cloud, now you must develop best practices, policies and procedures around governance. The good news is when you start looking at service offerings, whether it’s PaaS, SaaS or IaaS, Microsoft has the best in class offerings and they’re managing a good portion of that security for you.

Step 5 – Optimization – Once you’ve got your environment in the cloud, how do you optimize it for performance and cost effectiveness? Take the time to choose the best services for your business and optimize your servers to minimize cost and run those servers in the best way; this can become a differentiator against your competitors.