Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

Databricks is a cloud-based data engineering tool used for data transformation and data exploration through machine learning models. Azure Databricks is Microsoft Azure Platform’s implementation of Databricks. 

Evolution of Databirkcs :  

A short timeline of evolution of technology will give us an overview of the underlying stack. 

2003: Google released Google File system papers in 2003.  

2004: This was followed up in 2004 by Google MapReduce Papers. It takes a load of analytics work and distributes it across cheap compute instances.  

2006: This led to Apache Hadoop creation in 2006. Apache Hadoop had a problem where it was doing a lot of functions which would cost hard disk input outputs. 

2012: Matei started Spark project which built on Apache Hadoop and did a lot of the compute calculations in Memory.  

2013: Spark was donated to Apache foundation by Matei. Matei & his Berkley colleagues founded Databricks. The platform provides a management layer over Apache Spark functions including the following :  

  1. Spark providing Map Reduce in Memory 
  1. DataFrames allow you to write Map Reduce for you. You can also write SQL statements which are then compiled into DataFrames call which can then do Map Reduce. 
  1. Graph API  
  1. Stream API 
  1. Machine Learning 

Databricks Implementations:  

Since Databricks is a 3rd party tool. All major cloud platforms have their own implementation of Databricks. Google Cloud, Amazon AWS and Microsoft Azure are major Databricks providers.  

Azure Databricks Environments:  

Azure Databircks offers 3 environments for developing data intensive applications.  

Azure Databricks SQL 

It provides an easy-to-use platform for analyst who want to  

  1. Run SQL queries on their data lake. 
  1. Create multiple visualization types to explore query results 
  1. Build and share dashboards 

Azure Databricks Data Science & Engineering 

Provides an interactive workspace for  

  1. Enabling collaboration between data engineer, data scientists and machine learning engineers. 
  1. Data (raw or structured) is ingested into Azure Data Factory in batches or streamed using Apache Kafka, Event Hub or IOT hub. This data lands in either Blog storage or Data Lake storage.  
  1. Reads data from multiple data sources and turns them into insights. 

 

Azure Databricks Machine Learning 

Provides managed services for  

  1. Experiment tracking 
  1. Model Training 
  1. Feature Development and Management 
  1. Feature and model serving 

 

Databricks Clusters 

 

Types of Clusters 

There are two types of clusters:  

All-purpose clusters 

Typically, used to run notebooks. They remain active until you terminate them.  

Job clusters 

They run when you create a job. They are terminated after the job is completed.  

 

Creating a Cluster 

Creating a cluster is necessary for getting started with data bricks. 

Databricks Notebooks 

After creating the workspace in first step, we can now create a notebook to execute our functions. This can be done via UI or CLI.  

 

Clicking on Create -> Notebook is all there is to creating a notebook.  Choose language and Cluster and finally click on the create button. 

 

 

Databricks Cluster Navigation 

Clicking on Clusters leads us to this helpful interface which gives us the following options : 

Configuration 

Lists the basic configuration of the cluster which we used to build this cluster. 

Notebooks 

Notbooks lists all the notebooks available in the cluster.  

Libraries 

Libraries interface allows us to attach appropriate SDK package to the cluster for our use cases.  

Event Log 

Gives a complete log of events in clusters.

Spark UI 

This is the holy grail of all the sparks jobs which are actually running under the databricks shell. 

 

Driver Logs 

Driver logs gives detailed log of Sparks events.

Metrics 

Metrics interface opens a separate application called Ganglia UI. 

Apps 

Apps gives an option to configure R studio