Azure Enterprise Security Package for HDInsight

In today’s post I’d like to talk about the Enterprise Security Package for Azure HDInsight. HDInsight is a managed cloud Platform as a Service offering built on the Hadoop framework. It allows you to build big data solutions using Hadoop, Spark, Hive, LLAP and R, among others.

Let’s think about the traditional deployment of Hadoop. In traditional deployment, you would deploy a cluster, give local admin access to users with SSH access to that cluster. Then you would hand it over to the data scientists, so they could do what they needed to run those data science workloads; train the models, run scripts and such.

With the adoption of these types of big data workloads into the enterprise, it became much more reliant on enterprise security. There was a need for role-based access control with Active Directory permissions. Admins wanted to get greater visibility into who was accessing the data and when, as well as what they tried to get into and were they successful in their attempts or not – basically all those audit requirements when we’re working with large data sets.

Who is the leader in enterprise security? Microsoft, of course, for Active Directory. The Enterprise Security Package allows you to add the cluster to the domain within the creative process, as a sort of ‘add-on’ to your Azure portal. Other things it allows you to do are:

  • Add an HDI cluster with Active Directory Domain Services.
  • Role based access control for HIVE, Spark and Interactive HIVE using Apache Ranger.
  • Specific file and folder permissions for the data inside of an Azure Data Lakes Store.
  • Auditing of logs to see who has access to what and when.

Currently, these features are only available for Spark, Hadoop and Interactive Query workloads, but more workloads will be adopted soon.

How and When to Scale Up/Out Using Azure Analysis Services

Some of you may not know when or how to scale up your queries or scale out your processing. Today I’d like to help with understanding when and how using Azure Analysis Services. First, you need to decide which tier you should be using. You can do that by looking at the QPUs (Query Processing Units) of each tier on Azure. Here’s a quick breakdown:

  • Developer Tier – gives you up to 20 QPUs
  • Basic Tier – is a mid-scale tier, not meant for heavy loads
  • Standard Tier (currently the highest available) – allows you more capability and flexibility

Let’s start with when to scale up your queries. You need to scale up when your reports are slow, so you’re reporting out of Power BI and the throughput isn’t working for your needs. What you’re doing with scaling up is adding more resources. The QPU is a combination of your CPU, memory and other factors like the number of users.

Memory checks are straightforward. You run the metrics in the Azure portal and you can see what your memory usage is, if your memory limited or memory hard settings are being saturated. If so, you need to either upgrade your tier or adjust the level within your current tier.

CPU bottlenecks are a bit tougher to figure out. You can get an idea by starting to watch your QPUs to see if you’re saturating those using those metrics and looking at the logs within the Azure portal. Then you want to watch your processor pool job que length and your processing pool busy, non-IO threads. This should give you an idea of how it’s performing.

For the most part, you’re going to want to scale up when the processing engine is taking too long to process the data to build your models.

Next up, scaling out. You’ll want to scale out if you’re having problems with responsiveness with reporting because the reporting requirements are saturating what you currently have available. Typically, in cases with a large number of users, you can fix this by scaling out and adding more nodes.

You can add up to 7 additional query replicas; these are Read-only replicas that you can report off, but the processing is handled on the initial instance of Azure Analysis Services and subsequent queries are being handled as part of those query replicas. Hence, any processing is not affecting the responsiveness of the reports.

After it separates the model processing from query engine, then you can measure the performance by watching the log analytics and query processing units and see how they’re performing. If you’re still saturating those, you’ll need to re-evaluate whether you need additional QPUs or to upgrade your tiers.

Something to keep in mind is once you’ve processed your data, you must resynchronize it across all of those queries. So, if you’re going to be processing data throughout the day, it’s a good idea not only to run those queries, but also to strategically synchronize them as well.

Also important to know is that scale out does require the Standard Edition Tier; Basic and Developer will not work for this purpose. There are some interesting resources out there that allow you to auto scale. It will be based on a schedule using a PowerShell runbook. It uses your Azure automation account to schedule when it’s going to scale up or out based on the needs of the environment. For example, if you know Monday mornings you’re going to need additional processing power to run your queries efficiently, you’ll want to set up a schedule for that time and then you can scale it back.

Another note is that you can scale up to a higher tier, but you cannot scale those back automatically if you’re running a script. But with this ability it does allow you to be prepared for additional requirements in that environment.

I hope this helped with questions you have about scaling up and out.

 

Azure Data Factory vs Logic Apps

Customers often ask, should I use Logic Apps or Data Factory for this? Of course, the answer I give is the same as with most technology, it depends. What is the business use case we’re talking about?

Logic Apps can help you simplify how you build automated, scalable workflows that integrate apps and data across cloud and on premises services. Azure Data Factory is a cloud-based data integration service that allows you to create data driven workflows in the cloud for orchestrating and automating data movement and data transformation. Similar definitions, so that probably didn’t help at all, right?

Let me try to clear up some confusion. There are some situations where the best-case scenario is to use both, so where a feature is lacking in Data Factory but can be found in Logic Apps since it’s been around longer. A great use case is alerting and notifications, for instance. You can use the web API out of Data Factory and send a notification through a Logic App via email back to a user to say a job has competed or failed.

To answer the question of why I would use one over the other, I’d say it comes down to how much data we’re moving and how much transformation we need to do on that data to make it ready for consumption. Are we reporting on it, putting it in Azure Data Warehouse, building some facts and dimensions and creating our enterprise data warehouse then reporting off of that with Power BI? This would all require a decent amount of heavy lifting. I would not suggest a Logic App for that.

If you’re monitoring a folder on-prem or in OneDrive and you’re looking to see when files get posted there and you want to simply move that file to another location or send a notification about an action on the file, this a great use case for a Logic App.

However, the real sweet spot is when you can use them together, as it helps you maximize cost efficiency. Depending on what the operation is, it can be more or less expensive depending upon whether you’re using Data Factory or Logic Apps.

You can also make your operations more efficient. Utilize the power of Azure Data Factory with its SSIS integration runtimes and feature sets that include things like Data Bricks and the HDInsight clusters, where you can process huge amounts of data with massively parallel processing. Or use your Hadoop file stores for reporting off structured, unstructured or semi-structured data. But Logic Apps can help you enhance the process.

Clear as mud, right? Hopefully I was able to break it down a bit better. To put it in simple terms: when you think about Logic Apps, think about business applications, when you think about Azure Data Factory, think about moving data, especially large data sets, and transforming the data and building data warehouses.

 

Azure Common Data Services

What do you know about Azure Common Data Services? Today I’d like to talk about this product for apps which was recently re-done by Microsoft to expand upon the product’s vision. Common Data Services is an Azure-based business application platform that enables you to easily build and extend applications with your customer’s business data.

Common Data Services helps you bring together your data from across the Dynamics 365 Suite (CRM, AX, Nav, GP) and use this common data service to more easily extract data rather than having to get into the core of those applications. It also allows you to focus on building and delivering the apps that you want and insights and process automation that will help you run more efficiently. Plus it integrates nicely with PowerApps, Power BI and Microsoft Flow.

Some other key things:

  • If you want to build Power BI reports from your Dynamics 365 CRM data, there are pre-canned entities provided by Microsoft.
  • Data within the Common Data Services (CDS) is stored within a set of entities. An entity is just a set of records that’s used to store data, similar to how a table stores data within a database.
  • CDS should be thought of as a managed database service. You don’t have to create indexes or do any kind of database tuning; you’re not managing a database server as you would with SQL Server or a data warehouse. It’s designed to be somewhat of a centralized data repository from which you can report or do further things with.
  • PowerApps is quickly becoming a good replacement for things like Microsoft Access as it comes with along with functionality and feature sets. A common use for PowerApps is extending that data rather than having to dig into the background platform.
  • This technology is easy to use, to share and to secure. You set up your user account as you would with Azure Services, giving specific permissions/access based on the user.
  • It gives you the metadata you need based on that data and you can specify what kind of field or column you’re working with within that entity.
  • It gives you the ability of logic and validation; you can create business process rules and data flows from entity to entity or vice versa or from app to entity to PowerApps.
  • You can create workflows that automate business process, such as data cleansing or record updating; these workflows can run in the background without having to manage manually.
  • Gets good connectivity with Excel which makes it user friends for people comfortable with that platform.
  • For power users, there’s an SDK available for developers, which allows you to extend the product and build some cool custom apps.

I don’t think of this as a replacement for Azure SQL DW or DB but it does give you the capability to have table-based data in the cloud that has some nice hooks into the Dynamics 365 space, as well as outputting to PowerApps and Power BI.

Overview of Azure Data Catalog

In today’s post, I’ll give you an overview of Azure Data Catalog and an example of how you may use it in your organization. Azure Data Catalog is used to discover Azure data sources in your environment, as well as tell what those data sources are and describe the data sources that you’ve already found.

It provides the ability to add metadata and annotations around all Azure data. So, if you want to describe a column, a data source, or apply documentation or a schema, you can do all this in the Azure Data Catalog. It also provides a cloud-based service in which a data source can be registered.

The data remains in the existing location, but a copy of it’s metadata is added to the data catalog, as well as a reference to the data source location, so you’ll know where to find it when you need it. The metadata is also indexed, ensuring that each data source is easily discoverable through a search, and that it’s understandable to the users who discover it.

The primary purpose of registering data sources in the data catalog is the discovery and understanding of them. Enterprise users may need data for Business Intelligence, application development, data science or other tests where the right data is required. The Data Catalog discovery experience can be used to quickly find data that matches their needs.

Users can also understand the data to evaluate if it serves their purpose. The data is consumed by opening the data source in their tool of choice. At the same time, users can contribute to the catalog and the metadata or add annotations. They can register new data sources as well, which can be discovered by other users and understood and consumed by other users who have permission to do so. This is locked down by permission and can be secured with Active Directory.

Here’s a basic example of using Azure Data Catalog:

Let’s say we’re moving towards a self-service BI idea, whether it’s a data team or IT team setting up the data, so users can create their own dashboards in Power BI. The IT or data team has already secured the data by making sure users only have access to what they need/should have. Now the information workers and analysts can create their own reports, workbooks and dashboards without having any restrictions from IT.

As the new data gets created by workers and analysts, it can be challenging to provide information about the data, where it is for instance. Let’s say I save it into a SharePoint repository. I may not remember to tell everyone about it and even if I did, I’ll probably have to remind them 6 months from now. Obviously, this is ineffective and a big waste of time.

This is where Data Catalog comes in, as it gives the ability for the data creators to catalog and tag data, making it easier for all the users with permissions to find it. It can be registered in a centralized data catalog, it leaves the data where it came from, and users can go in and add annotations or tags or metadata that applies.

Azure Data Catalog is a great tool that we highly recommend for all your data projects in Azure!

Continuous Integration and Deployment Using Azure Data Factory

Today I’m excited to talk about one of the new releases in Azure that gives you continuous integration and deployment using Azure Data Factory. This new release is an Azure Data Factory visual interface that allows you to export any of your Data Factory components as an Azure Resource Manager (ARM) template.

When you do these exports from your Data Factory, it will generate 2 files.  The template file, which will contain all the Data Factory metadata for the pipelines, data sets, etc., as well as a configuration file, which will contain environment parameters that will be different for each of your environments. So, if you’re going to create a development, a test and a production environment, each one will be different.

You also can specify things like storage containers, Databricks clusters, etc. After you’ve deployed this, you’re going to create a new factory for your environment. You’re also going to associate your Visual Studio team services get repository to that Data Factory, enabling source control versioning and collaboration uses.

Next, you’ll set up your Data Factory with VSTS. This is where all the developers can author data factory resources, such as pipelines, data sets and other components. Once you have this development area set up, developers can modify the resources and debug them right in the interface, along with checking performance. They’ll also have the option to create a PR from their branch to master or create a collaborative branch to get the changes reviewed by peers.

Once they are satisfied with the changes and are ready to go to production, they set it in the master branch and can then publish it to the development Data Factory. Or they can promote each of those environments through exporting those ARM templates when they’re ready from the master branch, or any other branch.

So, you export the template and it gets deployed with different environment parameters to test and production environments. From there, you can also set up VSTS release definitions to automate the deployment of your Data Factory to multiple environments.

The benefit with this is it opens the opportunity to bring your true dev test and production environments, that you’re used to in your local environment using SSIS or other ETL tools, to Azure. This tool offers a tremendous amount of power and it’s getting better all the time.

Device Management with Azure IoT Hub

Yesterday’s post covered what Azure IoT Hub is and what it brings. Today I’m going a bit deeper and talking about how the devices you’re bringing to the table get managed. IoT Hub provides the features and extensibility that enables devices, as well as the people who program those devices and their architectures, with a robust device management solution.

Devices are all over the place; they are sensors, microcontrollers and Raspberry Pi computers. It’s also the gateways that route the communications for groups of devices. They’re installed on a local network and can work in peer to peer networks or have a router that passes information back and forth.

Azure IoT Hub offers a flexible platform for the many different uses across many different industries and devices themselves to be able to have that compatibility no matter the industry you’re in. No matter what you’re using the devices for, a significant part lies in the planning of how the devices and gateways will work together in the IoT Hub.

Let’s look at some things to be aware of:

1. Device Management Principles – Here you’ve got your scale and automation. You need to have simple tools to automate routine tasks. And you need the ability to manage millions of devices simply, as well as remotely and in bulk, so you can make sweeping changes across a whole suite of devices.

In addition, you don’t need to be alerted for every change or notification, but you do need to be alerted when there’s a problem. There are many different devices, protocols and patterns. IoT Hub needs to accommodate all those changes; with the wide range of devices from single process chips to fully functional computers, we need to have the flexibility to accommodate those systems.

Other things you need to know are:

  • Context awareness to accommodate the SLA and maintenance windows for when there’s downtime.
  • The network and power states.
  • The in-use conditions – What are the expectations while the devices themselves are working?
  • Where the device is – Is it in a building or out in the field on a utility pole?

These devices serve many roles and must work within the IT operations of your group. They need to be easily managed from that group or an extension from that group, as well as be able to surface alerts when it’s required. Most importantly, this all needs to work within your internal IT ecosystem to keep that continuity and consistency inside the business.

2. Device Lifecycle – So, we start with a plan – how will we use the devices; how will they be managed; and what will the devices be for our specific instance? Next, we need to provision them by adding them into the IoT Hub identity registry, so when we get to the next step they are being acknowledged in the system. Our next step is to configure them. We want to maintain the health of the device, even when we’re doing these updates and configurations, and we can send confirm updates securely.

Also, we need to monitor the device’s health to be aware if it’s beginning to fail. Many are small, simple devices that have a certain lifespan. We also monitor the status of the device and we need the ability to get alerts when the device begins to have issues. Then, ultimately, we need to remove old devices that are no longer effective so they’re not showing up or cluttering up the space of that IoT Hub interface.

3. Device Management Patterns – How are we interacting with devices after they’ve been deployed? So, if you’re going to reboot, factory reset or redeploy a device, you’ll need to reconfigure it so that it can be brought back up in the system. You’ll need to do simple configurations to change how the devices behave, and these need to be done in bulk.

To ensure you’re staying on top of bug fixes and new functionality and features for your devices, you’ll need to send firmware updates. Lastly, you need to show reporting progress and statuses of the devices themselves. It’s important that you have visibility into how the devices are performing and know if there are any problems.

This has been a high-level overview of device management with Azure IoT Hub. I hope you found it informative and helpful.

How Does Azure IoT Hub Work?

Today I’d like to talk about Internet of Things (IoT) and the Azure IoT Hub. IoT devices are not your typical devices like mobile phones, tablets or laptops. IoT devices are designed to respond to sensor activity that the device is being used for, like a glass break sensor for instance.

These devices are meant to be used for specific communications, whereas the typical device acts more like a server waiting to receive information from everywhere. This can cause some security threats if they are deployed in that manner. We can use firewalls and software to protect our equipment, but the whole idea with IoT is that these low power, no frills devices are what’s being deployed, so you don’t have a lot of that capability.

Also, the traditional PKI trust model is inefficient and ineffective for the IoT model; the TTL (time to live) certificates are too long and it doesn’t make sense for these devices. As well as the fact that promiscuous mode is turned on by default, which defeats the purpose of trying to have a secure environment.

Azure IoT Hub implements a service assisted communication methodology and this mediates interaction between backend systems and devices. With this you have a bi-directional, trust worthy communication set up and security is the number one priority of this configuration.

Devices will not accept unsolicited information; they must regularly check in for instructions, and authorization is based on per device identity. For devices in areas where there are network coverage or power issues, IoT provides cues for the messages that are set up for communication with the devices. Essentially, it will hold the message and validate the device before anything is sent/received; it will send the necessary data after it’s validated.

This also sets up an application payload data, which is secured separately, so any data that’s flowing through is going to be secured for protected transit through the gateways. The data is wrapped prior to sending and receiving between devices. Devices can be configured to work peer to peer before they get to a gateway to be able to extend out the range. That gateway is what communicates with your Azure IoT Hub.

All that traffic is designed to flow to and from the gateway and then communicate with the IoT Hub, which you can use to collect the data for big data uses, setting up Power BI reports or many other ways to use that data.

Overview of Azure Stream Analytics

Analytics is the key to making your data useful and supporting decision making. Today I’m excited to talk about Azure Stream Analytics. Azure Stream Analytics is an event processing engine that allows you to capture and examine high volumes of data from all kinds of connections, like devices, websites and social media feeds.

You can examine those data streams and it allows you to trigger things like alerts, as well as take action with reporting or storage. So, whether you want to report on it with Power BI or store the data for down the road, you have these options. Stream analytics is used a lot with IoT or streaming feeds through social media, where people want to keep an eye on what’s happening with the data.

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Of course, the other option is to archive it for further processing down the road if you want to do something with that data.

This was designed to be easy to use and spin up. It has source and sync integration and an easy to use declarative SQL query-like language. Also, it’s a managed service so it’s pay-as-you-use, as with many Azure services. There’s no need to buy hardware or software up front. And it has an enterprise grade service level agreement so it’s robust, reliable and you can have multi-locations.

Another big positive is it’s in-memory processing with multi-node capabilities offers tremendous scalability and performance benefits. Plus, unlike on prem solutions it can be fairly elastic, so you can buy nodes as you need them to process more data and you can bring them back down when you’re not using them.

There are a lot of cool things being done with stream analytics and IoT; it’s an exciting time to be in this arena.

Azure Data Factory Integration Runtimes

This week I’ve been talking about Azure Data Factory. In today’s post I’d like to talk about the much-awaited Azure Data Factory Integration Runtime. The compute infrastructure provides data movement, connectors, data conversions and data transfers, as well as activity dispatching, so you can dispatch and monitor activities running HDInsight, SQL DB or DW and more.

A big part of V2 is that you can now lift and shift your SSIS packages up into Azure and run them from your Azure data portal inside of Data Factory. There are 3 integration runtime types:

1. Azure Integration Runtime – This is clearly set up in Azure and you would use this if you’re going to be copying between two cloud data sources.

2. Self-Hosted Integration Runtime – This can run a copy of transformation activities between cloud and on-premises, including moving data from an IaaS virtual machine. Use this if you’re going to be copying between a cloud source and an on-prem, private network source.

So, if you’re environment is behind a firewall, not in the public cloud and you want to move data from your environment to Azure and the gateway will not work for you. Also, because that IaaS virtual machine is isolated, and you can’t get into that data storage, you would set up that integration between sites.

3. SSIS Integration Runtime – Use this when you’re lifting and shifting your SSIS packages into Azure Data Factory. But a key thing to mention is that this does not yet support third party tools for SSIS, but that support will be added eventually.

Where are these located? Azure Data Factories are located in limited regions at this time, but they can access data stores and compute services globally. With Azure Integration Runtime, the location of that runtime will define the backend compute resources where those are being used. This is optimized for data compliance, efficiency and reduced network egress costs, to ensure that they’re using the best services available in the region that is needed.

The Self-Hosted Runtime is installed in the environment in that private network. The SSIS Integration Runtime is determined based on where the SQL DB or managed instance is hosting that SSIS DB catalog. It is currently limited where it can be located, but it does not have to be in the same place as the Data Factory. It will be as close to the data sources as possible and will run as optimally as it can.