Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.

To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.

Basic Approach to Writing Data

There are two main syntax styles in PySpark:

  1. Generic API
    Uses the chain: .format().option().save()
  2. Format-specific shortcuts
    Uses direct methods like .csv(), .json(), .parquet(), etc.

Key Components

  • .write: Returns a DataFrameWriter object
  • .format(): Defines the output format (e.g., CSV, JSON, Parquet)
  • .option(): Controls file-specific settings like headers, delimiters, quotes
  • .save(): Triggers the actual data write

Mode Behaviour with .mode()

The .mode() method controls how PySpark behaves when data already exists at the output location:

  • .overwrite: Deletes and replaces existing data
  • .append: Adds new rows to the existing file
  • .ignore: Skips writing if data already exists
  • .error: (Default) Throws an error if data exists

Choose the mode based on whether you’re doing a full refresh or incremental load.

Partitioning Output Files

  • .partitionBy("column"): Splits the output into folders based on column values
  • Improves query performance when filtering on partitioned columns
  • Avoid partitioning on columns with high cardinality to prevent small files

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.