In upcoming posts, I’ll begin a series focusing on Big Data and the Azure HDInsight offerings. If you don’t know, HDInsight is a fully managed, full spectrum open source analytics service for enterprises that allows you to use open source frameworks such as Hadoop, Spark, Hive, among others. It was introduced to Azure in 2013 and they’ve added more recent options, such as domain join clusters capabilities.
Today’s focus is on HDInsight Hadoop. What we’re talking about here is being able to work with big data workloads. These large amounts of data can be structured, unstructured or semi-structured data, like table structures, documents or photos.
It can be historical data that you’re looking to analyze or stream data that’s coming in real time. The goal of this is for you to process the data and generate information from it. Some advantages are:
- It’s a cloud native Platform as a Service (PaaS) offering within the Azure workplace.
- Lower cost and scalability because of the capability of separation of compute and storage. You can store your data there but can tear down the clusters so you’re not paying anything when they’re not running. You can also keep your storage and reattach to it with additional nodes to get scalability.
- Security and compliance with government regulations.
- You can do monitoring of the system within Azure. If you hook on the Enterprise Security Package, you have a capability to do some monitoring within the system, as well as setting up user accounts that tie into your Active Directory.
- It’s globally available, including Azure government, China and Germany Azure spaces.
Some of the uses for Hadoop HDInsight are:
- Batch processing ETL
- Data Warehousing
- Streaming of data and processing – A use case example here is Toyota. They used this for their Connected Car Architecture Program where they were able to monitor their cars and stream it into an HDInsight cluster.
- Being more commonly used for data science workloads, as you get these massive data sets that you want to do data processing and analytics on, or a combination of items like wanting to run some data science and machine learning on some streaming data to do predictive analytics on what might happen next.
Another benefit is HDInsight clusters support multiple programming languages, like Java, Python, Scala, Pig Latin, Hive QL and Spark SQL. Basically, all common programming languages in the open source community that allow you to take advantage of the great, high performing technology for these big data workloads.
Coming up, I’ll discuss some of the cluster types available, such as HDInsight Spark, HBase, Storm, Kafka, Interactive Query and R-Server.