In my latest video blog I discuss and demonstrate some of the ways to connect to external data in Azure Synapse if there isn’t a need to import the data to the database or you want to do some ad-hoc analysis. I also talk about using COPY and CTAS statements if the requirement is to import the data after all. Check it out here
In this video blog post I covered the serving layer step of building your Modern Data Warehouse in Azure. There are certainly some decisions to be made around how you want to structure your schema as you get it ready for presentation with whatever your business intelligence tool of choice, for this example I used Power BI, so I discuss some of the areas you should focus on:
- What is your schema type? Snowflake or Star, or something else?
- Where should you serve up the data? SQL Server, Synapse, ADLS, Databricks, or Something Else?
- What are your Service level agreements for the business? What are your data processing times?
- Can you save cost by using an option that’s less compute heavy?
Simplified Managed Disk Migration in Azure
In the past, migrating managed disks could be a bit of a challenge. Today I’d like to talk about how Azure has simplified the process. Microsoft recently added the ability to migrate the disks through their portal instead of having to use a command line interface or a PowerShell script.
First off, why would you want a managed disk over an unmanaged one?
- Greater scalability due to much higher IOPs and storage limits. There’s no longer the need to add additional storage accounts when you’re adding disk space, which has been a challenge for users that were using large virtual machines and required large storage space.
- Better availability and reliability which ensures that disks are now isolated from each other in different storage scale units.
- Managed disks offer an over 99.99% uptime, plus are always stored with 3 replicas of the data.
- More granular access control by employing role-based access control (RBAC) security. You have granular capability to assign access to various people in your organization.
Here’s how it works:
- When looking at an overview of your VM if you’re using an unmanaged disk, you’ll see a ribbon or banner at the top alerting you that you’re not using managed disks and that you should. Sure, they cost a bit more, but the payback is better resiliency and reliability.
- When you click on that banner, it will give you a wizard to walk you through how to perform that migration. It will also remind you that when you migrate, you can’t go back. Your virtual machine will remain unchanged, but you’ll want to take that into account.
- It will reboot your VM once complete, so keep this in mind so you can plan to do this during off hours.
- Another note, if your VM is in an availability set, you’ll be prompted to migrate that availability set first, then your migration.
- Once you’re done and back up and running, you’ll see the new disks and the old unmanaged disks, even though they can’t be mounted. You can later clean those up and delete them.
- You’ll have a disk for the OS and each data disk in that resource group and you’re ready to go, with more availability plus the comfort of knowing you’re running in a more continuity mode.
So, look at your virtual machines and do that migration when you have a chance. This great wizard-based feature makes it much easier. The reliability benefits will greatly outweigh the added cost.
Monitoring your website performance is key to gaining insight into your customers and users, as well as keeping an eye on the website’s performance. In today’s post I’d like to tell you about what is Azure Application Insights.
Application Insights is an application performance management service for web applications that enables you to do all the monitoring of your website performance in Azure. It’s designed to ensure you’re getting optimal performance and the best in class user experience from your website. It also has a powerful analytic tool that helps you diagnose issues and gain an understanding of how people are using your web application.
You can use it with many web platforms and although you’re sending the information about your website to Azure, the website or application itself doesn’t have to be hosted in Azure. For those who work on the dev ops processes, it will help you ensure that you are enabling continuous improvement on your web application with connectivity to bunch of development tools.
How does it work?
How Application Insights works is you insert a small package to your application and set up the Application Insights resource within Azure, thus sending the data to Azure to collect information. The web app is monitored, and it sends telemetry data to the Insight portal (the portal itself is Azure but as I mentioned, the application can be pretty much anywhere).
Along with the Application Insights from the web app, you can pull in your host environmental data, allowing you to look at performance logs, Azure diagnostics and container logs, giving you a full look at what’s going on inside the application, as well as in the environment where it lives.
You can set up periodic web tests that will allow you to send requests to the web server to ensure that it’s responding properly and that the website is working the way it’s supposed to. It’s a very straightforward implementation with a light set of code that tracks web calls that are non-blocking that are sent in separate threads after they’ve been batched together.
Some of the things you can track or collect are:
- What are the most popular webpages in your application, at what time of day and where is that traffic coming from?
- Dependency rates or response times and failure rates to find out if there’s an external service that’s causing performance issues on your app, maybe a user is using a portal to get through to your application and there are response time issues going through there for instance.
- Exceptions for both server and browser information, as well as page views and load performance from the end users’ side.
- Session info – who, what, when, where.
- Performance and host diagnostics – giving you a complete picture of what’s happening in your application.
- Trace logs for correlating trace events with requests to help you get a deeper insight into the data and dig deeper into the diagnostics to improve performance.
It also gives you flexibility, so you can write custom snip its of code to collect other pieces of data that aren’t part of the usual pieces collected. And all your reports can be looked at through the Azure suite of reporting tools such as Power BI to get visualization and fine-grained analytical info about your application.
Application Insights is an incredibly useful tool for anyone who has an application or website and wants to track and manage all the info that’s put out there – who’s viewing what, what’s the most popular, etc.
Today I’ll wrap up my series on HDInsight with R Server. What R Server does is when you create an HDInsight cluster, you can select it as an option and it will provide data scientists, statisticians and R Programmers with on demand access to scalable and distributed methods of analytics on HDInsight.
Where it is open source, R allows you to leverage any of the 8,000+ open source packages. Because it falls in Microsoft’s big data analytics package, it includes the scale R routines. These routines provide things such as descriptive statistics, generalized linear models, logistic regression, classification and regression trees, as well as decision forests.
You can run an edge node outside of a cluster that provides a great place to connect on the cluster. You can also run your R scripts which gives the option of running parallel distributed functions. The models that are built can be downloaded for on prem use and can also be sent to Azure Machine Learning Studio for further processing and scoring.
So, why would you choose the Microsoft R Server over other options?
- Microsoft is putting a lot behind AI and R Server and this big data offering as part of the HDInsight suite.
- It provides an internally built set of algorithms and when you combine that with the open source community offerings, you create a bridge for cutting edge AI, machine and deep learning applications.
- As with other Azure offerings, you’re getting a simplified, secure, highly scalable environment, so instead of wasting time building those clusters in-house, you can focus on the capabilities of the platform itself by quickly and easily spinning up a cluster.
Many of these topics have been discussed throughout this series about the capabilities of HDInsight and what each has to offer. Looking at R, some key features are:
- R enabled for the R programming language with runtime infrastructure for script execution.
- Also, Python enabled with runtime infrastructure for Python scripting.
- Pre-trained models to help with visual analytics and text statement analysis that is ready to score the data you provide.
- You can put the server into operations and deploy solutions as a web service very quickly; so you spin up your cluster, turn everything on, hook it into your domain, use your domain credentials and start training your models.
- Remote web execution allows us to work from our work station and train models, rather than having to log directly into the server or use SSH or other means. It allows you to build your scripts locally and then execute them remotely, giving you more flexibility with the way you’re operating.
R Server fits within the Azure and HDInsight ecosystems, so you can use and easily integrate these technologies together, such as integrating with Azure Data Factory or Azure Data Bricks, etc.
Last week I began a series on HDInsight. Today I’m continuing that series with a focus on Interactive Query. Interactive Query leverages Hive which uses LLAP (Long Live and Process), also known as low latency analytical processing. This allows for interactivity with complex data warehouse-style queries on big data, that is stored in commodity storage, such as a blob or Data Lake Store.
This stand-alone cluster is separate from HDI Hadoop clusters; it only contains the Hive service. The LLAP replaces the direct interaction with the HDFS data node, allowing for caching, prefetching, some light query processing and access control. Heavier query processing workloads are still happening at the yarn container with text orchestration, and that helps with the overall execution.
Obviously, it’s much more efficient to be able to query the data interactively where the data is prepared, rather than needing to move the data from one storage location to another, as we normally would with data warehousing. It allows for faster insight and resiliency, as well as reduced effort and simplified architecture – less components meets more simplicity.
There are several ways to execute Hive queries from Interactive Query:
- Power BI, so you can tap right into it with your Power BI reports
- Zeppelin notebooks
- Visual Studio
- Ambari with Hive View
- Beeline from head node or an empty edge node
You can also leverage existing workloads, so if you’re running batch or ETL workloads using HDInsight, you can attach your Interactive Query cluster to an existing metastore and data storage without any additional overhead.
There may be a need to convert CSV or JSON files into ORC, Parquet or Avro field as they can be more efficient for Hadoop processing. But with Interactive Query, that need is either lessened or eliminated because they can load that data into memory. The queries now determine what is cached and what can just run quickly since it’s running in memory instead of running from a storage area.
It also uses the Enterprise Security Package and Azure Log Analytics. These two features get wrapped into more of a true enterprise offering and allows your users to use their simplified Active Directory domain log in. Users can connect using Interactive Query and run their workloads without having to have a separate set of credentials, plus you can monitor your nodes from the Log Analytics piece. This helps you bring that data into OMS for a top down view and an understanding of what the whole environment looks like.
Interactive Query offers some great opportunities to run things more efficiently and smaller workloads can be run very quickly.
Next in my series on HDInsight, today I’ll be talking about Storm. HDInsight Storm is a distributed stream processing computational framework. It uses spouts which define information sources and bolts which are manipulations in processing to allow batch distributed processing of streaming data.
Think of it’s apology in the shape of a direct acyclic graph. It’s a DHE where the edges are named streams and direct the data from node to node. When you put it all together, it creates the data transformation pipeline.
When you break it down, it’s topology is like that of map/reduce jobs; the difference being that map/reduce jobs run in individual batches and Storm is processed continuously in real time.
The Storm cluster has 2 different types of nodes. There’s a Master node which executes a Nimbus which assigns tasks to machines and monitors their performance. The Worker node runs Supervisor which assigns tasks to the other worker nodes and operates them as needed.
The Storm cluster can’t monitor its own state and health, so it deploys a Zookeeper node to connect to the Nimbus and Supervisor to keep an eye on things.
The 3 main components of Storm are:
1. The topology which is basically a network for the stream and spout.
2. The stream which is an unbounded pipeline of tuples.
3. The spout which is the source of the data which converts the data to the tuple of streams and then sends the bolts to be processed.
What makes this effective is that the data processing engine is guaranteed as far as every tuple will be fully processed and delivered, giving it a 99.9% uptime SLA from Microsoft. It does this by tracking the lineage of the tuple as it makes its way through the typology. It works like a query system as the messages can be replayed if there’s a failure in delivery.
Some use cases for Storm:
- Writing the data after it gets processed into an Azure Data Lake Store.
- As a source for Azure Event Hubs, as well as processing events from here. It can take a vehicle sensor, for instance, and can process it in Event Hubs, then send the data to Cosmos DB or an Azure Storage Blob.
- Twitter is using Storm in a variety of ways. They use it for discovery on their data, running real time analytics and personalization in real time, so when you log into Twitter it knows your preferences based on past visits. It also works for real time Search and for their own internal revenue optimization.
As with other HDInsight components, it’s used among various typologies to solve and satisfy big data requirements and workloads. For example, if you were doing a customer churn analysis in real time based on a Twitter feed, this would be a technology you would use along side Hadoop.
In continuation of my series on HDInsight and the different clusters within it, today I’ll cover HBase. HBase is a NoSQL database that provides random access and strong consistency for structured, unstructured and semi-structured data.
It’s a schema-less (or organized by families of columns) database. Another way to describe it is it’s sort of modeled after Google’s Bigtable, where data is stored in the rows of a table and then grouped by a column family. As it’s schema-less, neither the columns themselves or the data types inside of the columns need to be defined before using the data.
Some other key things to be aware of with HBase:
- As with all the HDInsight components, this get implemented as a managed cluster and a Platform as a Service offering in which we can separate compute nodes from storage.
- It has a scale out architecture that helps provide automatic sharding or horizontal partitioning of tables, where essentially rows of a table are held separately rather than splitting those columns as we would in a typical table normalization.
- Strong consistency for read and write as it’s part of the architecture of HBase.
- Automatic failover built in, so you have multiple clusters that you can failover to multiple nodes.
- In-memory caching for reads and writes, which helps with performance, as well as moving your data in and out quicker.
Some of the most common workloads:
- A search engine like I mentioned with Google’s Bigtable, which builds indexes that map terms to webpages that contain them.
- A key value store. Facebook uses HBase for their messaging system because it’s ideal for storing and managing internet communications.
- Also, a good repository for collecting sensor data, so where large amounts of data are being pulled into this NoSQL Table and it can be used to build dashboards for reporting.
I still have a few HDInsight technologies to cover in this series. Many of these are interrelated and work together to complete and update data architecture.
Today I’m continuing my series on HDInsight with the focus on Spark clusters. HDInsight Spark clusters provide the required baseline for in-memory cluster computing. This technology has gained momentum over the last few years as the required levels of memory have increased, as well as the hardware.
So, being able to load a large amount of data into memory has become much more possible. In-memory data allows us to load and cache the data, so it’s much more responsive when working within the data, with querying off it or visualizing for instance.
Some benefits and features of HDInsight Spark are:
- Spark provides access to Scala programming language. This allows us to work with distributed data sets like collections, and it doesn’t require us to structure everything as map and reduce operations, thus making our operations more responsive and efficient.
- Quick deployment. You can deploy a Spark cluster, as with other Azure PaaS offerings, through the Azure portal. You can also do it through scripting, PowerShell or Azure automation
- Native integration with Zeppelin and Jupiter notebooks for your processing and visualization.
- The REST API Service allows for remote orchestration and job processing.
- Azure Data Lake support, allowing us to separate compute from storage, which lends itself to scalability. When compute and storage are handled separately, you can tear down your compute clusters, or nodes, and add new ones if you want to scale up/down. Then you can reattach to that storage without losing any of the work that you’ve done.
- As a PaaS offering, it integrates easily with other Azure services, like Event Hubs or HDInsight Kafka (which I’ll cover later this week) for data streaming applications.
- Support of concurrent queries which allows us to take better advantage of the processing power of the nodes.
- Native Power BI integration for visualization purposes; connecting directly to a Spark cluster from Power BI.
- Pre-loaded with Anaconda, which provides about 200 libraries for things like Machine Learning, advanced analytics and visualizations.
Best uses for Spark:
- As with other workloads for big data, the in-memory processing allows us to do interactive data analysis and create business solutions. It uses that in-memory processing engine to have more responsive reports and data visualization.
- It has the machine learning capability with built in support for the Jupiter and Zeppelin notebooks.
- Pre-loaded with Anaconda distributed with 200 canned libraries so you can jump in and start using it quickly.
- It handles streaming and real-time data workloads. You can extend your Event Hub queue, so you can bring in your data and report on it in real time scenarios. This is great if you’re using IoT; much more responsive than waiting for that refresh of ETL.
Be sure to check out my next post on HDInsight HBase.
In upcoming posts, I’ll begin a series focusing on Big Data and the Azure HDInsight offerings. If you don’t know, HDInsight is a fully managed, full spectrum open source analytics service for enterprises that allows you to use open source frameworks such as Hadoop, Spark, Hive, among others. It was introduced to Azure in 2013 and they’ve added more recent options, such as domain join clusters capabilities.
Today’s focus is on HDInsight Hadoop. What we’re talking about here is being able to work with big data workloads. These large amounts of data can be structured, unstructured or semi-structured data, like table structures, documents or photos.
It can be historical data that you’re looking to analyze or stream data that’s coming in real time. The goal of this is for you to process the data and generate information from it. Some advantages are:
- It’s a cloud native Platform as a Service (PaaS) offering within the Azure workplace.
- Lower cost and scalability because of the capability of separation of compute and storage. You can store your data there but can tear down the clusters so you’re not paying anything when they’re not running. You can also keep your storage and reattach to it with additional nodes to get scalability.
- Security and compliance with government regulations.
- You can do monitoring of the system within Azure. If you hook on the Enterprise Security Package, you have a capability to do some monitoring within the system, as well as setting up user accounts that tie into your Active Directory.
- It’s globally available, including Azure government, China and Germany Azure spaces.
Some of the uses for Hadoop HDInsight are:
- Batch processing ETL
- Data Warehousing
- Streaming of data and processing – A use case example here is Toyota. They used this for their Connected Car Architecture Program where they were able to monitor their cars and stream it into an HDInsight cluster.
- Being more commonly used for data science workloads, as you get these massive data sets that you want to do data processing and analytics on, or a combination of items like wanting to run some data science and machine learning on some streaming data to do predictive analytics on what might happen next.
Another benefit is HDInsight clusters support multiple programming languages, like Java, Python, Scala, Pig Latin, Hive QL and Spark SQL. Basically, all common programming languages in the open source community that allow you to take advantage of the great, high performing technology for these big data workloads.
Coming up, I’ll discuss some of the cluster types available, such as HDInsight Spark, HBase, Storm, Kafka, Interactive Query and R-Server.