Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

Motivation Behind Apache Spark

The limitations of MapReduce led to the development of Apache Spark, addressing key challenges in modern data processing.

 

1.Distributed data processing began with Hadoop and MapReduce.

2.Over time, specialized solutions were built on Hadoop for streaming, SQL operations, and machine learning.

3.Finally, Apache unified these solutions to create Apache Spark.

MapReduce Limitations

  • Disk-Based Processing: Every operation requires disk I/O, causing significant latency
  • Two-Step Only: Limited to map and reduce operations
  • Batch Processing Focus: Not suitable for interactive or real-time analysis
  • Complex Implementation: Multi-step operations require multiple MapReduce jobs

Feature Comparison

FeatureMapReduceSpark
Processing SpeedSlower (disk-based)100x faster (in-memory)
Programming ModelMap and Reduce only80+ high-level operators
Real-time ProcessingNo ( through Storm only)Yes (Spark Streaming API ) that is 2 to 5 times faster than Storm.
Machine LearningNo built-in support MLlib library
Graph ProcessingNo built-in support ( Only supported via Pregel API )GraphX component which implements Pregel API in 20 lines of code
Interactive AnalysisNoYes (Spark Shell)
SQL SupportThrough Hive onlyNative Spark SQL (includes Hive API and upto 100 times faster than Hive on MapReduce
Recovery SpeedSlow (checkpoint-based)Fast (lineage-based)
Language SupportJavaScala, Java, Python, R

Key Spark Innovations

  • Resilient Distributed Datasets (RDD): In-memory data structures for efficient processing
  • DAG Execution Engine: Optimized workflow planning and execution
  • Unified Platform: Single framework for diverse processing needs
  • Rich Ecosystem: Integrated libraries for various use cases

These improvements make Spark a more versatile and efficient framework for modern big data processing requirements.

 

Spark Framework