Select Page

My Learning Space

​Space to take notes, learn and share.

How Apache Spark work

Understanding Apache Spark Architecture Apache Spark is a distributed computing system designed for big data processing and analytics. Here's a breakdown of how it works:   Core Components Driver Program: The central coordinator that manages the execution of...

read more

Spark Motivation

Motivation Behind Apache Spark The limitations of MapReduce led to the development of Apache Spark, addressing key challenges in modern data processing.   1.Distributed data processing began with Hadoop and MapReduce. 2.Over time, specialized solutions were built...

read more

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

Spark provides three abstractions for handling data RDDs Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters). DataFrame DataFrames are distributed collections...

read more