Overview of HDInsight Spark

Today I’m continuing my series on HDInsight with the focus on Spark clusters. HDInsight Spark clusters provide the required baseline for in-memory cluster computing. This technology has gained momentum over the last few years as the required levels of memory have increased, as well as the hardware.

So, being able to load a large amount of data into memory has become much more possible. In-memory data allows us to load and cache the data, so it’s much more responsive when working within the data, with querying off it or visualizing for instance.

Some benefits and features of HDInsight Spark are:

  • Spark provides access to Scala programming language. This allows us to work with distributed data sets like collections, and it doesn’t require us to structure everything as map and reduce operations, thus making our operations more responsive and efficient.
  • Quick deployment. You can deploy a Spark cluster, as with other Azure PaaS offerings, through the Azure portal. You can also do it through scripting, PowerShell or Azure automation
  • Native integration with Zeppelin and Jupiter notebooks for your processing and visualization.
  • The REST API Service allows for remote orchestration and job processing.
  • Azure Data Lake support, allowing us to separate compute from storage, which lends itself to scalability. When compute and storage are handled separately, you can tear down your compute clusters, or nodes, and add new ones if you want to scale up/down. Then you can reattach to that storage without losing any of the work that you’ve done.
  • As a PaaS offering, it integrates easily with other Azure services, like Event Hubs or HDInsight Kafka (which I’ll cover later this week) for data streaming applications.
  • Support of concurrent queries which allows us to take better advantage of the processing power of the nodes.
  • Native Power BI integration for visualization purposes; connecting directly to a Spark cluster from Power BI.
  • Pre-loaded with Anaconda, which provides about 200 libraries for things like Machine Learning, advanced analytics and visualizations.

Best uses for Spark:

    • As with other workloads for big data, the in-memory processing allows us to do interactive data analysis and create business solutions. It uses that in-memory processing engine to have more responsive reports and data visualization.
    • It has the machine learning capability with built in support for the Jupiter and Zeppelin notebooks.
    • Pre-loaded with Anaconda distributed with 200 canned libraries so you can jump in and start using it quickly.
    • It handles streaming and real-time data workloads. You can extend your Event Hub queue, so you can bring in your data and report on it in real time scenarios. This is great if you’re using IoT; much more responsive than waiting for that refresh of ETL.

Be sure to check out my next post on HDInsight HBase.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.