In a previous post, I talked about Azure Databricks and what it is. In review, Azure Databricks is a managed platform for running Apache Spark jobs. As it’s managed, that means you don’t have to worry about managing the cluster or running performance maintenance to use Spark, like you would if you were going to deploy a full HDInsight Spark cluster.
Databricks provides a simple to operate user interface for data scientist and analysts when building models, as well as a powerful API that allows for some automation. You also can run role-based access control with Active Directory for better user integration at a more granular scale. You don’t have to tear down an HDInsight cluster to use Spark jobs as you can pause (or start) your resources on demand and scale up/out as needed.
In this post, I’ll run through some key Databricks terms to give you an overview of the different points you’ll use when running Databricks jobs:
- Workspace – This is the central place that will allow you to organize all the work that’s being done. You can think of it as a ‘folder’ structure where you can save Notebooks and Libraries that you want to operate on and manipulate data with, and then share them securely with other users. Workspace is not meant for storing data; data should be stored in the data storage.
- Notebooks – This is a set of any number of cells that allow you to execute commands with a programming language, such as Scala, Python, R or SQL; you can specify the language when you open a cell at the top of the Notebook. Here you can also create a dashboard that allows the output of the code to be shared rather than the code itself, and they can be scheduled as jobs for running pipelines, updating models or dashboards.
- Libraries – These are packages or modules that provide additional functionality for developing various models for different types of analysis. Like a traditional IDE environment like Visual Studio where you have libraries you can plug in and add.
- Tables – This is where the structured data is stored that you and your team will use for analysis. They can live in cloud storage or in the cluster that’s being used or store them in memory for faster processing of the data.
- Clusters – Essentially a group of compute resources being used for operations like executing the code from Notebooks or Libraries. You can also pull in data from raw sources like cloud or structured/semi structured data or the data in the tables I mentioned above. Clusters can be controlled via access policies using Active Directory integration.
- Jobs – Jobs are a tool that’s used to schedule execution within a cluster. These can be scripts using Python or JAR assemblies and you can create manual triggers that will send the jobs off or run them through a REST API.
- Apps – Think of these as the third-party components that can tap into your Databricks cluster. A good scenario is visualizing the data with apps like Tableau or Power BI. You can consume the modules that you built and the output of the Notebooks or script that you ran to visualize that data.