In continuation of my series on HDInsight and the different clusters within it, today I’ll cover HBase. HBase is a NoSQL database that provides random access and strong consistency for structured, unstructured and semi-structured data.
It’s a schema-less (or organized by families of columns) database. Another way to describe it is it’s sort of modeled after Google’s Bigtable, where data is stored in the rows of a table and then grouped by a column family. As it’s schema-less, neither the columns themselves or the data types inside of the columns need to be defined before using the data.
Some other key things to be aware of with HBase:
- As with all the HDInsight components, this get implemented as a managed cluster and a Platform as a Service offering in which we can separate compute nodes from storage.
- It has a scale out architecture that helps provide automatic sharding or horizontal partitioning of tables, where essentially rows of a table are held separately rather than splitting those columns as we would in a typical table normalization.
- Strong consistency for read and write as it’s part of the architecture of HBase.
- Automatic failover built in, so you have multiple clusters that you can failover to multiple nodes.
- In-memory caching for reads and writes, which helps with performance, as well as moving your data in and out quicker.
Some of the most common workloads:
- A search engine like I mentioned with Google’s Bigtable, which builds indexes that map terms to webpages that contain them.
- A key value store. Facebook uses HBase for their messaging system because it’s ideal for storing and managing internet communications.
- Also, a good repository for collecting sensor data, so where large amounts of data are being pulled into this NoSQL Table and it can be used to build dashboards for reporting.
I still have a few HDInsight technologies to cover in this series. Many of these are interrelated and work together to complete and update data architecture.