Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

Spark provides three abstractions for handling data

RDDs

Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).

DataFrame

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.

DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.

Dataset

Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.

They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.

 

 

Comparison of Spark Data Abstractions

FeatureRDDDataFrameDataset
Type SafetyType-safeNot type-safeType-safe
SchemaNo schemaSchema-basedSchema-based
API StyleFunctional APIDomain-specific language (DSL)Both functional and DSL
OptimizationBasicCatalyst OptimizerCatalyst Optimizer
Memory UsageHighEfficientModerate
SerializationJava SerializationCustom encodersCustom encoders
Language SupportAll Spark languagesAll Spark languagesScala and Java only
graph LR
    A["Data Storage in Spark"]
    
    A --> B["RDD"]
    A --> C["DataFrame"]
    A --> D["Dataset"]
    
    B --> B1["Raw Java/Scala Objects"]
    B --> B2["Distributed Collection"]
    B --> B3["No Schema Information"]
    
    C --> C1["Row Objects"]
    C --> C2["Schema-based Structure"]
    C --> C3["Column Names & Types"]
    
    D --> D1["Typed JVM Objects"]
    D --> D2["Schema + Type Information"]
    D --> D3["Strong Type Safety"]
    
    style B fill:#f9d6d6
    style C fill:#d6e5f9
    style D fill:#d6f9d6

 

This diagram illustrates how data is stored in different Spark abstractions:

  • RDD stores data as raw Java/Scala objects with no schema information
  • DataFrame organizes data in rows with defined column names and types
  • Dataset combines schema-based structure with strong type safety using JVM objects

Use Case Scenarios and Recommendations

AbstractionBest Use CasesWhy Choose This?
RDD– Low-level transformations – Custom data types – Legacy code maintenance– Complete control over data processing – When working with non-structured data – Need for custom optimization
DataFrame– SQL-like operations – Machine learning pipelines – Structured data processing– Better performance through optimization – Familiar SQL-like interface – Integration with BI tools
Dataset– Complex business logic – Type-safe operations – Domain object manipulation– Compile-time type safety – Object-oriented programming – Balance of performance and control

Key Takeaways:

  • Use RDDs when you need low-level control and working with unstructured data
  • Choose DataFrames for structured data and when performance is critical
  • Opt for Datasets when you need both type safety and performance optimization

Initialization and Operations Examples

OperationRDDDataFrameDataset
Creationsc.parallelize(List(1,2,3))spark.createDataFrame(data)spark.createDataset(data)
Reading Filessc.textFile("path")spark.read.csv("path")spark.read.csv("path").as[Case]
Filteringrdd.filter(x => x > 10)df.filter($"col" > 10)ds.filter(_.value > 10)
Mappingrdd.map(x => x * 2)df.select($"col" * 2)ds.map(x => x * 2)
Groupingrdd.groupBy(x => x)df.groupBy("col")ds.groupByKey(_.key)

Note: The above examples assume necessary imports and SparkSession/SparkContext initialization.