Category Archives: Training

Continuous Integration and Deployment Using Azure Data Factory

Today I’m excited to talk about one of the new releases in Azure that gives you continuous integration and deployment using Azure Data Factory. This new release is an Azure Data Factory visual interface that allows you to export any of your Data Factory components as an Azure Resource Manager (ARM) template.

When you do these exports from your Data Factory, it will generate 2 files.  The template file, which will contain all the Data Factory metadata for the pipelines, data sets, etc., as well as a configuration file, which will contain environment parameters that will be different for each of your environments. So, if you’re going to create a development, a test and a production environment, each one will be different.

You also can specify things like storage containers, Databricks clusters, etc. After you’ve deployed this, you’re going to create a new factory for your environment. You’re also going to associate your Visual Studio team services get repository to that Data Factory, enabling source control versioning and collaboration uses.

Next, you’ll set up your Data Factory with VSTS. This is where all the developers can author data factory resources, such as pipelines, data sets and other components. Once you have this development area set up, developers can modify the resources and debug them right in the interface, along with checking performance. They’ll also have the option to create a PR from their branch to master or create a collaborative branch to get the changes reviewed by peers.

Once they are satisfied with the changes and are ready to go to production, they set it in the master branch and can then publish it to the development Data Factory. Or they can promote each of those environments through exporting those ARM templates when they’re ready from the master branch, or any other branch.

So, you export the template and it gets deployed with different environment parameters to test and production environments. From there, you can also set up VSTS release definitions to automate the deployment of your Data Factory to multiple environments.

The benefit with this is it opens the opportunity to bring your true dev test and production environments, that you’re used to in your local environment using SSIS or other ETL tools, to Azure. This tool offers a tremendous amount of power and it’s getting better all the time.

Device Management with Azure IoT Hub

Yesterday’s post covered what Azure IoT Hub is and what it brings. Today I’m going a bit deeper and talking about how the devices you’re bringing to the table get managed. IoT Hub provides the features and extensibility that enables devices, as well as the people who program those devices and their architectures, with a robust device management solution.

Devices are all over the place; they are sensors, microcontrollers and Raspberry Pi computers. It’s also the gateways that route the communications for groups of devices. They’re installed on a local network and can work in peer to peer networks or have a router that passes information back and forth.

Azure IoT Hub offers a flexible platform for the many different uses across many different industries and devices themselves to be able to have that compatibility no matter the industry you’re in. No matter what you’re using the devices for, a significant part lies in the planning of how the devices and gateways will work together in the IoT Hub.

Let’s look at some things to be aware of:

1. Device Management Principles – Here you’ve got your scale and automation. You need to have simple tools to automate routine tasks. And you need the ability to manage millions of devices simply, as well as remotely and in bulk, so you can make sweeping changes across a whole suite of devices.

In addition, you don’t need to be alerted for every change or notification, but you do need to be alerted when there’s a problem. There are many different devices, protocols and patterns. IoT Hub needs to accommodate all those changes; with the wide range of devices from single process chips to fully functional computers, we need to have the flexibility to accommodate those systems.

Other things you need to know are:

  • Context awareness to accommodate the SLA and maintenance windows for when there’s downtime.
  • The network and power states.
  • The in-use conditions – What are the expectations while the devices themselves are working?
  • Where the device is – Is it in a building or out in the field on a utility pole?

These devices serve many roles and must work within the IT operations of your group. They need to be easily managed from that group or an extension from that group, as well as be able to surface alerts when it’s required. Most importantly, this all needs to work within your internal IT ecosystem to keep that continuity and consistency inside the business.

2. Device Lifecycle – So, we start with a plan – how will we use the devices; how will they be managed; and what will the devices be for our specific instance? Next, we need to provision them by adding them into the IoT Hub identity registry, so when we get to the next step they are being acknowledged in the system. Our next step is to configure them. We want to maintain the health of the device, even when we’re doing these updates and configurations, and we can send confirm updates securely.

Also, we need to monitor the device’s health to be aware if it’s beginning to fail. Many are small, simple devices that have a certain lifespan. We also monitor the status of the device and we need the ability to get alerts when the device begins to have issues. Then, ultimately, we need to remove old devices that are no longer effective so they’re not showing up or cluttering up the space of that IoT Hub interface.

3. Device Management Patterns – How are we interacting with devices after they’ve been deployed? So, if you’re going to reboot, factory reset or redeploy a device, you’ll need to reconfigure it so that it can be brought back up in the system. You’ll need to do simple configurations to change how the devices behave, and these need to be done in bulk.

To ensure you’re staying on top of bug fixes and new functionality and features for your devices, you’ll need to send firmware updates. Lastly, you need to show reporting progress and statuses of the devices themselves. It’s important that you have visibility into how the devices are performing and know if there are any problems.

This has been a high-level overview of device management with Azure IoT Hub. I hope you found it informative and helpful.

How Does Azure IoT Hub Work?

Today I’d like to talk about Internet of Things (IoT) and the Azure IoT Hub. IoT devices are not your typical devices like mobile phones, tablets or laptops. IoT devices are designed to respond to sensor activity that the device is being used for, like a glass break sensor for instance.

These devices are meant to be used for specific communications, whereas the typical device acts more like a server waiting to receive information from everywhere. This can cause some security threats if they are deployed in that manner. We can use firewalls and software to protect our equipment, but the whole idea with IoT is that these low power, no frills devices are what’s being deployed, so you don’t have a lot of that capability.

Also, the traditional PKI trust model is inefficient and ineffective for the IoT model; the TTL (time to live) certificates are too long and it doesn’t make sense for these devices. As well as the fact that promiscuous mode is turned on by default, which defeats the purpose of trying to have a secure environment.

Azure IoT Hub implements a service assisted communication methodology and this mediates interaction between backend systems and devices. With this you have a bi-directional, trust worthy communication set up and security is the number one priority of this configuration.

Devices will not accept unsolicited information; they must regularly check in for instructions, and authorization is based on per device identity. For devices in areas where there are network coverage or power issues, IoT provides cues for the messages that are set up for communication with the devices. Essentially, it will hold the message and validate the device before anything is sent/received; it will send the necessary data after it’s validated.

This also sets up an application payload data, which is secured separately, so any data that’s flowing through is going to be secured for protected transit through the gateways. The data is wrapped prior to sending and receiving between devices. Devices can be configured to work peer to peer before they get to a gateway to be able to extend out the range. That gateway is what communicates with your Azure IoT Hub.

All that traffic is designed to flow to and from the gateway and then communicate with the IoT Hub, which you can use to collect the data for big data uses, setting up Power BI reports or many other ways to use that data.

Overview of Azure Stream Analytics

Analytics is the key to making your data useful and supporting decision making. Today I’m excited to talk about Azure Stream Analytics. Azure Stream Analytics is an event processing engine that allows you to capture and examine high volumes of data from all kinds of connections, like devices, websites and social media feeds.

You can examine those data streams and it allows you to trigger things like alerts, as well as take action with reporting or storage. So, whether you want to report on it with Power BI or store the data for down the road, you have these options. Stream analytics is used a lot with IoT or streaming feeds through social media, where people want to keep an eye on what’s happening with the data.

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Of course, the other option is to archive it for further processing down the road if you want to do something with that data.

This was designed to be easy to use and spin up. It has source and sync integration and an easy to use declarative SQL query-like language. Also, it’s a managed service so it’s pay-as-you-use, as with many Azure services. There’s no need to buy hardware or software up front. And it has an enterprise grade service level agreement so it’s robust, reliable and you can have multi-locations.

Another big positive is it’s in-memory processing with multi-node capabilities offers tremendous scalability and performance benefits. Plus, unlike on prem solutions it can be fairly elastic, so you can buy nodes as you need them to process more data and you can bring them back down when you’re not using them.

There are a lot of cool things being done with stream analytics and IoT; it’s an exciting time to be in this arena.

Azure Data Factory Integration Runtimes

This week I’ve been talking about Azure Data Factory. In today’s post I’d like to talk about the much-awaited Azure Data Factory Integration Runtime. The compute infrastructure provides data movement, connectors, data conversions and data transfers, as well as activity dispatching, so you can dispatch and monitor activities running HDInsight, SQL DB or DW and more.

A big part of V2 is that you can now lift and shift your SSIS packages up into Azure and run them from your Azure data portal inside of Data Factory. There are 3 integration runtime types:

1. Azure Integration Runtime – This is clearly set up in Azure and you would use this if you’re going to be copying between two cloud data sources.

2. Self-Hosted Integration Runtime – This can run a copy of transformation activities between cloud and on-premises, including moving data from an IaaS virtual machine. Use this if you’re going to be copying between a cloud source and an on-prem, private network source.

So, if you’re environment is behind a firewall, not in the public cloud and you want to move data from your environment to Azure and the gateway will not work for you. Also, because that IaaS virtual machine is isolated, and you can’t get into that data storage, you would set up that integration between sites.

3. SSIS Integration Runtime – Use this when you’re lifting and shifting your SSIS packages into Azure Data Factory. But a key thing to mention is that this does not yet support third party tools for SSIS, but that support will be added eventually.

Where are these located? Azure Data Factories are located in limited regions at this time, but they can access data stores and compute services globally. With Azure Integration Runtime, the location of that runtime will define the backend compute resources where those are being used. This is optimized for data compliance, efficiency and reduced network egress costs, to ensure that they’re using the best services available in the region that is needed.

The Self-Hosted Runtime is installed in the environment in that private network. The SSIS Integration Runtime is determined based on where the SQL DB or managed instance is hosting that SSIS DB catalog. It is currently limited where it can be located, but it does not have to be in the same place as the Data Factory. It will be as close to the data sources as possible and will run as optimally as it can.

Azure Data Factory – Data Sets, Linked Services and Pipeline Executions

In my posts this week, I’ve been going a little deeper into Azure Data Factory (be sure to check out my others from Tuesday and Wednesday). I’m excited today to dig into getting data from one source into Azure.

First let’s talk about data sets. Consider a data set anything that you’re using as your input or output. An Azure blob data set is one example and is defined by the blob container itself, the folder, the file, the documentation, etc. In this case you can have that as your source or your destination.

There are many data set options, such as Azure Services (other data services or blobs, databases, data warehouses or Data Lake), as well as on premises databases like SQL Server, MySQL or Oracle. You also have your NoSQL components like Cassandra and MongoDB and you can get your files from an FTP server, Amazon S3 or internal file systems.

Lastly, you have your SaaS offerings like MS Dynamics, Salesforce or Marketo. There is a grid available in the Microsoft documentation which outlines what can be a source or a destination or both, in some cases.

Using Linked Services is the way that you link from your source to your destination. This defines the connection to the data source, whether it be the input or the output. Think of it like your connection string in SQL Server; you’re connecting to a specific place, whether you’re using that to output data from a source or input it to a destination.

Now, let’s look at Pipeline Executions. A pipeline is a collection of activities that you’ve built, and the executions run that pipeline moving the data from one place to another or to do some transformation with that data. There are 2 defined Pipeline Executions:

1. Manual (or On Demand) Executions. These can be triggered through the Azure portal, a REST API, as part of a PowerShell script, or as part of your .net application.

2. Setting Up Triggers. With this execution, you set up a trigger as part of your Data Factory. This was an exciting new change in Azure Data Factory V2. Triggers can be scheduled, so you can set a job to run at a particular time each day or you can set a tumbling window trigger. Using a tumbling window, you can set up your hours of operation (let’s say you want your data to run from 8-5, Monday through Friday, every hour on the hour). The tumbling window trigger runs continuously for the times/hours you’ve specified.

Doing this allows us to lessen our costs to run the job in Azure by using the compute time only when you need it, and not using compute time during downtimes like outside of business hours.

Azure Data Factory Pipelines and Activities

Yesterday’s Azure Every Day post covered how Azure Data Factory pricing works. In today’s post I’d like to go a bit deeper into Azure Data Factory Version 2 and review pipelines and activities. In essence, a pipeline is a logical grouping of activities. If you’re familiar with SSIS, think of an SSIS package being a grouping of activities that are happening with the data.

An example of a pipeline would look like: you want to pull data from a website, file server or database up into Azure and do some kind of transformation on that data, then report from it. Within the pipeline, multiple activities can be defined. If there’s no activity dependency on a set of activities – so you have one activity running and there’s no dependency on the next activity -then they can run in parallel.

This is good to keep in mind as you’re performing these activities because you may need to schedule them or figure out a way, so they don’t run in parallel or that one runs after another.

There are 3 main types of activities:

1. Data Movement Activities – This is the sources where you’re pulling in data from such as Azure Blob Storage, Azure Data Lake, Azure DB and DW. You can also set up an on premises gateway and pull in databases, such as commonly used DB2, MySQL, Oracle, SAP, Sybase and Teradata, as well as NoSQL databases like Cassandra and MongoDB.

I also mentioned files; you can pull from Amazon, S3, file systems, FTP, HTTP, etc. You also have the Software as a Service (SaaS) options: Dynamics, HubSpot, Marketo, QuickBooks, and Salesforce, to name a few. You can check a complete list on the Azure online documentation.

2. Data Transformation Activities – Here is where you’re taking your data after it’s ingested into Azure and doing something with it. Some common ones are HDInsight, HIVE, PIG, MapReduce, Hadoop Streaming and Spark transformations. These allow you to transform your big data in your Azure environment and stage it for your reporting.

Other common uses would be machine learning into an Azure VM, as well as stored procedures. You can have your stored procedures in SQL Server defined in Azure, and then run that stored procedure, and also use U-SQL for your Data Lake Analytics.

3. Control Activities – In these activities you can do things like execute your pipelines or run a ForEach statement or Look-up activities, the types of things where you’re controlling how the pipeline is working and interacting with the data.

 

How Azure Data Factory Pricing Works

In today’s post I’d like to discuss how Azure Data Factory pricing works with the Version 2 model which was just released. The pricing is broken down into four ways that you’re paying for this service. I hope that by pointing these out, you can gain an understanding of not only how it works, but how you can keep an eye on your spending.

1. Azure activity runs vs self-hosted activity runs – there are different pricing models for these. For the Azure activity runs it’s about copying activity, so you’re moving data from an Azure Blob to an Azure SQL database or Hive activity running high script on an Azure HDInsight cluster.

With self-hosted, you want to copy activity moving from an on premises SQL Server to an Azure Blob Storage, a stored procedure to an Azure Blob Storage or a stored procedure activity running a stored procedure on an on premises SQL Server.

2. Volume of data moved – this is measured in DMUs (data movement units). This is one you should be aware of as this will default to auto, which is basically using all the DMUs it can use and this is paid for by the hour. Let’s say you specify and use 2 DMUs and it takes an hour to move that data. The other option is you could use 8 DMUs and it takes 15 minutes, this price is going to end up the same. You’re using 4X the DMUs but it’s happening in a quarter of the time.

This is good to look at and do some comparisons since how many DMUs you’re using is where the bulk of your spend if going to be.

3. SSIS integration run times – here you’re using A-series and D-series compute levels. When you go through these, it depends on what the compute needs are to invoke the process (how much CPU, how much RAM, how much attempt storage you need).

4. The inactive pipeline – you’re paying a small account for pipelines (about 40 cents currently). A pipeline is considered inactive if it’s not associated with a trigger and hasn’t been run for over a week. Yes, it’s a minimal charge, but they do add up and when you start to wonder where some of those charges come from it’s good to keep this in mind.

Also, each of the components inside the Azure Data Factory, whether it’s blob storage, SQL Server, HDInsight or any kind of storage or compute resources you’re using as part of your pipeline, will also incur charges. These are billed separately based specifically around what those resources are.

Something to keep in mind as you start of build workloads, like if you spin up an HDInsight cluster or a SQL data warehouse as part of a pipeline, make sure you shut down, pause it or destroy that cluster afterwards. So, there are opportunities to get your data moved but also keep the cost down but not keeping it running all the time.

An Overview of Azure File Sync

I have a question… Who is still using a file server? No need to answer, I know that most of us still are and need to use them for various reasons. We love them—well, we also hate them, as they are a pain to manage.

The pains with Windows File Server:

  • They never seem to have enough storage.
  • They never seem to be properly cleaned up; users don’t delete the files they’re supposed to.
  • The data never seems accessible when and where you need it.

In this blog, I’d like to walk you through Azure File Sync, so you can see for yourself how much better it is.

    • Let’s say I’m setting up a file server in my Seattle headquarters and that file server begins having problems, maybe I’m running out of space for example.
    • I decide to hook this up in a file share in Azure space.
    • I can set up cloud tiering and set up a threshold (say 50%), so that everything beyond that threshold, those files will start moving up into Azure.
    • When I set this threshold, it will start taking the oldest files and graying them out as far as users are concerned. The files are still there and visible as there, but they’ve been pushed off to the cloud, so that space has now been freed up on the file server.
    • If users ever need those files, they can click on them and redownload.
    • Now, let’s say I want to bring on another server at a branch office. I can simply bring up that server, synchronize it with the branch office based on those files in Azure.
    • From here, I can hook up my SMBs and NFS shares for my users and applications, as well as my work folders using multi-site technology. I have all my files synchronized and it’s going to give me direct cloud access to these files.
    • I can hook up my IaaS and PaaS solutions with my REST API or my SMB shares to be able to access these files.
    • With everything synchronized, I’m able to have a rapid file server disaster/data recovery. If my server in Seattle goes down, I simply remove it; my files are already up in Azure.
    • I bring on a new server, sync it back to Azure. My folders start to populate, and as they get used, people will download the files back and the rules that were set up will maintain.
    • The great thing is it can be used with SQL Server 2012 R2, as well as SQL Server 2016.
    • Now I have an all-encompassing solution (with integrated cloud back up within Azure) with better availability, better DR capability and essentially bottomless storage. Azure Backup Vault gets backed up automatically and storage is super cheap.

With Azure File Sync I get:

1. A centralize file service in Azure storage.

2. Cache in multiple locations for fast, local performance.

3.  I can utilize cloud based backup and fast data/disaster recovery.

Overview and Benefits of Azure Cognitive Services

With Artificial Intelligence and Machine Learning, the possibilities for your applications are endless. Would you like to be able to infuse your apps, websites and bots with intelligent algorithms to see, hear, speak, understand and interpret your user needs through natural methods of communication, all without having any data science expertise?

Continue reading Overview and Benefits of Azure Cognitive Services