Databricks is a cloud-based data engineering tool used for data transformation and data exploration through machine learning models. Azure Databricks is Microsoft Azure Platform’s implementation of Databricks.

Evolution of Databirkcs :

A short timeline of evolution of technology will give us an overview of the underlying stack.

2003: Google released Google File system papers in 2003.

2004: This was followed up in 2004 by Google MapReduce Papers. It takes a load of analytics work and distributes it across cheap compute instances.

2006: This led to Apache Hadoop creation in 2006. Apache Hadoop had a problem where it was doing a lot of functions which would cost hard disk input outputs.

2012: Matei started Spark project which built on Apache Hadoop and did a lot of the compute calculations in Memory.

2013: Spark was donated to Apache foundation by Matei. Matei & his Berkley colleagues founded Databricks. The platform provides a management layer over Apache Spark functions including the following :

Spark providing Map Reduce in Memory

DataFrames allow you to write Map Reduce for you. You can also write SQL statements which are then compiled into DataFrames call which can then do Map Reduce.

Graph API

Stream API

Machine Learning

Databricks Implementations:

Since Databricks is a 3rd party tool. All major cloud platforms have their own implementation of Databricks. Google Cloud, Amazon AWS and Microsoft Azure are major Databricks providers.

Azure Databricks Environments:

Azure Databircks offers 3 environments for developing data intensive applications.

Azure Databricks SQL

It provides an easy-to-use platform for analyst who want to

Run SQL queries on their data lake.

Create multiple visualization types to explore query results

Build and share dashboards

Azure Databricks Data Science & Engineering

Provides an interactive workspace for

Enabling collaboration between data engineer, data scientists and machine learning engineers.

Data (raw or structured) is ingested into Azure Data Factory in batches or streamed using Apache Kafka, Event Hub or IOT hub. This data lands in either Blog storage or Data Lake storage.

Reads data from multiple data sources and turns them into insights.

Azure Databricks Machine Learning

Provides managed services for

Experiment tracking

Model Training

Feature Development and Management

Feature and model serving

Databricks Clusters

Types of Clusters

There are two types of clusters:

All-purpose clusters

Typically, used to run notebooks. They remain active until you terminate them.

Job clusters

They run when you create a job. They are terminated after the job is completed.

Creating a Cluster

Creating a cluster is necessary for getting started with data bricks.

Databricks Notebooks

After creating the workspace in first step, we can now create a notebook to execute our functions. This can be done via UI or CLI.

Clicking on Create -> Notebook is all there is to creating a notebook. Choose language and Cluster and finally click on the create button.

Clicking on Clusters leads us to this helpful interface which gives us the following options :

Configuration

Lists the basic configuration of the cluster which we used to build this cluster.

Notebooks

Notbooks lists all the notebooks available in the cluster.

Libraries

Libraries interface allows us to attach appropriate SDK package to the cluster for our use cases.

Event Log

Gives a complete log of events in clusters.

Spark UI

This is the holy grail of all the sparks jobs which are actually running under the databricks shell.

Driver Logs

Driver logs gives detailed log of Sparks events.

Metrics

Metrics interface opens a separate application called Ganglia UI.

Apps

Apps gives an option to configure R studio

Introduction to Azure Databricks

Manan Younas