When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before physical execution begins.
- Job Trigger
- A job starts only when you run an action (e.g.,
collect()
,count()
). - This job is then broken into stages.
- A job starts only when you run an action (e.g.,
- Stages
- Stages are separated by shuffle points (caused by wide transformations like
groupBy
orjoin
). - Inside a stage, Spark can pipeline operations (e.g.,
map
,filter
) without shuffling data.
- Stages are separated by shuffle points (caused by wide transformations like
- Tasks
- Each stage is made of tasks, the smallest unit of execution.
- A task processes one partition of data and is sent to an executor slot.
- So: Partition = Task = Work on one chunk of data.