Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before physical execution begins.

  1. Job Trigger
    • A job starts only when you run an action (e.g., collect(), count()).
    • This job is then broken into stages.
  2. Stages
    • Stages are separated by shuffle points (caused by wide transformations like groupBy or join).
    • Inside a stage, Spark can pipeline operations (e.g., map, filter) without shuffling data.
  3. Tasks
    • Each stage is made of tasks, the smallest unit of execution.
    • A task processes one partition of data and is sent to an executor slot.
    • So: Partition = Task = Work on one chunk of data.

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.