Apache Spark Execution Flow: A Visual Guide

When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before physical execution begins.

Job Trigger
- A job starts only when you run an action (e.g., collect(), count()).
- This job is then broken into stages.
Stages
- Stages are separated by shuffle points (caused by wide transformations like groupBy or join).
- Inside a stage, Spark can pipeline operations (e.g., map, filter) without shuffling data.
Tasks
- Each stage is made of tasks, the smallest unit of execution.
- A task processes one partition of data and is sent to an executor slot.
- So: Partition = Task = Work on one chunk of data.

Manan Younas

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.