Databricks is a cloud-based data engineering tool used for data transformation and data exploration through machine learning models. Azure Databricks is Microsoft Azure Platform’s implementation of Databricks.
Evolution of Databirkcs :
A short timeline of evolution of technology will give us an overview of the underlying stack.
2003: Google released Google File system papers in 2003.
2004: This was followed up in 2004 by Google MapReduce Papers. It takes a load of analytics work and distributes it across cheap compute instances.
2006: This led to Apache Hadoop creation in 2006. Apache Hadoop had a problem where it was doing a lot of functions which would cost hard disk input outputs.
2012: Matei started Spark project which built on Apache Hadoop and did a lot of the compute calculations in Memory.
2013: Spark was donated to Apache foundation by Matei. Matei & his Berkley colleagues founded Databricks. The platform provides a management layer over Apache Spark functions including the following :
- Spark providing Map Reduce in Memory
- DataFrames allow you to write Map Reduce for you. You can also write SQL statements which are then compiled into DataFrames call which can then do Map Reduce.
- Graph API
- Stream API
- Machine Learning
Databricks Implementations:
Since Databricks is a 3rd party tool. All major cloud platforms have their own implementation of Databricks. Google Cloud, Amazon AWS and Microsoft Azure are major Databricks providers.
Azure Databricks Environments:
Azure Databircks offers 3 environments for developing data intensive applications.
Azure Databricks SQL
It provides an easy-to-use platform for analyst who want to
- Run SQL queries on their data lake.
- Create multiple visualization types to explore query results
- Build and share dashboards
Azure Databricks Data Science & Engineering
Provides an interactive workspace for
- Enabling collaboration between data engineer, data scientists and machine learning engineers.
- Data (raw or structured) is ingested into Azure Data Factory in batches or streamed using Apache Kafka, Event Hub or IOT hub. This data lands in either Blog storage or Data Lake storage.
- Reads data from multiple data sources and turns them into insights.
Azure Databricks Machine Learning
Provides managed services for
- Experiment tracking
- Model Training
- Feature Development and Management
- Feature and model serving
Databricks Clusters
Types of Clusters
There are two types of clusters:
All-purpose clusters
Typically, used to run notebooks. They remain active until you terminate them.
Job clusters
They run when you create a job. They are terminated after the job is completed.
Creating a Cluster
Creating a cluster is necessary for getting started with data bricks.
Databricks Notebooks
After creating the workspace in first step, we can now create a notebook to execute our functions. This can be done via UI or CLI.
Clicking on Create -> Notebook is all there is to creating a notebook. Choose language and Cluster and finally click on the create button.
Databricks Cluster Navigation
Clicking on Clusters leads us to this helpful interface which gives us the following options :
Configuration
Lists the basic configuration of the cluster which we used to build this cluster.
Notebooks
Notbooks lists all the notebooks available in the cluster.
Libraries
Libraries interface allows us to attach appropriate SDK package to the cluster for our use cases.
Event Log
Gives a complete log of events in clusters.
Spark UI
This is the holy grail of all the sparks jobs which are actually running under the databricks shell.
Driver Logs
Driver logs gives detailed log of Sparks events.
Metrics
Metrics interface opens a separate application called Ganglia UI.
Apps
Apps gives an option to configure R studio