Last week I began a series on HDInsight. Today I’m continuing that series with a focus on Interactive Query. Interactive Query leverages Hive which uses LLAP (Long Live and Process), also known as low latency analytical processing. This allows for interactivity with complex data warehouse-style queries on big data, that is stored in commodity storage, such as a blob or Data Lake Store.
This stand-alone cluster is separate from HDI Hadoop clusters; it only contains the Hive service. The LLAP replaces the direct interaction with the HDFS data node, allowing for caching, prefetching, some light query processing and access control. Heavier query processing workloads are still happening at the yarn container with text orchestration, and that helps with the overall execution.
Obviously, it’s much more efficient to be able to query the data interactively where the data is prepared, rather than needing to move the data from one storage location to another, as we normally would with data warehousing. It allows for faster insight and resiliency, as well as reduced effort and simplified architecture – less components meets more simplicity.
There are several ways to execute Hive queries from Interactive Query:
- Power BI, so you can tap right into it with your Power BI reports
- Zeppelin notebooks
- Visual Studio
- Ambari with Hive View
- Beeline from head node or an empty edge node
You can also leverage existing workloads, so if you’re running batch or ETL workloads using HDInsight, you can attach your Interactive Query cluster to an existing metastore and data storage without any additional overhead.
There may be a need to convert CSV or JSON files into ORC, Parquet or Avro field as they can be more efficient for Hadoop processing. But with Interactive Query, that need is either lessened or eliminated because they can load that data into memory. The queries now determine what is cached and what can just run quickly since it’s running in memory instead of running from a storage area.
It also uses the Enterprise Security Package and Azure Log Analytics. These two features get wrapped into more of a true enterprise offering and allows your users to use their simplified Active Directory domain log in. Users can connect using Interactive Query and run their workloads without having to have a separate set of credentials, plus you can monitor your nodes from the Log Analytics piece. This helps you bring that data into OMS for a top down view and an understanding of what the whole environment looks like.
Interactive Query offers some great opportunities to run things more efficiently and smaller workloads can be run very quickly.