Select Page

How Apache Spark work

Understanding Apache Spark Architecture

Apache Spark is a distributed computing system designed for big data processing and analytics. Here's a breakdown of how it works:

 

Core Components

  • Driver Program: The central coordinator that manages the execution of Spark applications
  • Cluster Manager: Allocates resources across applications (can be Standalone, YARN, Mesos, or Kubernetes)
  • Worker Nodes: Execute tasks and store data
  • Executors: Processes that run on worker nodes to execute tasks

Data Processing Model

Spark processes data through:

  • RDDs (Resilient Distributed Datasets): Fundamental data structure that represents distributed collection of elements
  • DataFrames: Structured data organized into named columns
  • Datasets: Strongly-typed version of DataFrames

Execution Flow

  1. Application submission to cluster manager
  2. Resource allocation to worker nodes
  3. Driver program creates execution plan (DAG)
  4. Tasks distributed to executors
  5. Data processing and transformation
  6. Result collection and aggregation

 

graph TD;
    A["Driver Program"] --> B["Cluster Manager"];
    B --> C["Worker Node 1"];
    B --> D["Worker Node 2"];
    B --> E["Worker Node N"];
    C --> F["Executor 1"];
    D --> G["Executor 2"];
    E --> H["Executor N"];
    F --> I["Tasks"];
    G --> I;
    H --> I;

 

Key Features

  • In-Memory Processing: Keeps data in RAM for faster processing
  • Fault Tolerance: Automatically recovers from node failures
  • Lazy Evaluation: Optimizes processing by creating execution plans
  • Multiple Language Support: Scala, Java, Python, and R APIs

Performance Optimization

Spark achieves high performance through:

  • Parallel processing across cluster nodes
  • Data caching and persistence
  • Advanced DAG optimization
  • Efficient memory management

Common Use Cases

  • Batch Processing
  • Stream Processing
  • Machine Learning
  • Interactive Analytics
  • Graph Processing

Spark Motivation

Motivation Behind Apache Spark

The limitations of MapReduce led to the development of Apache Spark, addressing key challenges in modern data processing.

 

1.Distributed data processing began with Hadoop and MapReduce.

2.Over time, specialized solutions were built on Hadoop for streaming, SQL operations, and machine learning.

3.Finally, Apache unified these solutions to create Apache Spark.

MapReduce Limitations

  • Disk-Based Processing: Every operation requires disk I/O, causing significant latency
  • Two-Step Only: Limited to map and reduce operations
  • Batch Processing Focus: Not suitable for interactive or real-time analysis
  • Complex Implementation: Multi-step operations require multiple MapReduce jobs

Feature Comparison

FeatureMapReduceSpark
Processing SpeedSlower (disk-based)100x faster (in-memory)
Programming ModelMap and Reduce only80+ high-level operators
Real-time ProcessingNo ( through Storm only)Yes (Spark Streaming API ) that is 2 to 5 times faster than Storm.
Machine LearningNo built-in support MLlib library
Graph ProcessingNo built-in support ( Only supported via Pregel API )GraphX component which implements Pregel API in 20 lines of code
Interactive AnalysisNoYes (Spark Shell)
SQL SupportThrough Hive onlyNative Spark SQL (includes Hive API and upto 100 times faster than Hive on MapReduce
Recovery SpeedSlow (checkpoint-based)Fast (lineage-based)
Language SupportJavaScala, Java, Python, R

Key Spark Innovations

  • Resilient Distributed Datasets (RDD): In-memory data structures for efficient processing
  • DAG Execution Engine: Optimized workflow planning and execution
  • Unified Platform: Single framework for diverse processing needs
  • Rich Ecosystem: Integrated libraries for various use cases

These improvements make Spark a more versatile and efficient framework for modern big data processing requirements.

 

Spark Framework

 

 

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

Spark provides three abstractions for handling data

RDDs

Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).

DataFrame

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.

DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.

Dataset

Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.

They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.

 

 

Comparison of Spark Data Abstractions

FeatureRDDDataFrameDataset
Type SafetyType-safeNot type-safeType-safe
SchemaNo schemaSchema-basedSchema-based
API StyleFunctional APIDomain-specific language (DSL)Both functional and DSL
OptimizationBasicCatalyst OptimizerCatalyst Optimizer
Memory UsageHighEfficientModerate
SerializationJava SerializationCustom encodersCustom encoders
Language SupportAll Spark languagesAll Spark languagesScala and Java only
graph LR
    A["Data Storage in Spark"]
    
    A --> B["RDD"]
    A --> C["DataFrame"]
    A --> D["Dataset"]
    
    B --> B1["Raw Java/Scala Objects"]
    B --> B2["Distributed Collection"]
    B --> B3["No Schema Information"]
    
    C --> C1["Row Objects"]
    C --> C2["Schema-based Structure"]
    C --> C3["Column Names & Types"]
    
    D --> D1["Typed JVM Objects"]
    D --> D2["Schema + Type Information"]
    D --> D3["Strong Type Safety"]
    
    style B fill:#f9d6d6
    style C fill:#d6e5f9
    style D fill:#d6f9d6

 

This diagram illustrates how data is stored in different Spark abstractions:

  • RDD stores data as raw Java/Scala objects with no schema information
  • DataFrame organizes data in rows with defined column names and types
  • Dataset combines schema-based structure with strong type safety using JVM objects

Use Case Scenarios and Recommendations

AbstractionBest Use CasesWhy Choose This?
RDD– Low-level transformations – Custom data types – Legacy code maintenance– Complete control over data processing – When working with non-structured data – Need for custom optimization
DataFrame– SQL-like operations – Machine learning pipelines – Structured data processing– Better performance through optimization – Familiar SQL-like interface – Integration with BI tools
Dataset– Complex business logic – Type-safe operations – Domain object manipulation– Compile-time type safety – Object-oriented programming – Balance of performance and control

Key Takeaways:

  • Use RDDs when you need low-level control and working with unstructured data
  • Choose DataFrames for structured data and when performance is critical
  • Opt for Datasets when you need both type safety and performance optimization

Initialization and Operations Examples

OperationRDDDataFrameDataset
Creationsc.parallelize(List(1,2,3))spark.createDataFrame(data)spark.createDataset(data)
Reading Filessc.textFile("path")spark.read.csv("path")spark.read.csv("path").as[Case]
Filteringrdd.filter(x => x > 10)df.filter($"col" > 10)ds.filter(_.value > 10)
Mappingrdd.map(x => x * 2)df.select($"col" * 2)ds.map(x => x * 2)
Groupingrdd.groupBy(x => x)df.groupBy("col")ds.groupByKey(_.key)

Note: The above examples assume necessary imports and SparkSession/SparkContext initialization.