Writing data with PySpark: A Visual Guide

Manan Younas

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.

To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.

Basic Approach to Writing Data

There are two main syntax styles in PySpark:

Generic API
Uses the chain: .format().option().save()
Format-specific shortcuts
Uses direct methods like .csv(), .json(), .parquet(), etc.

Key Components

.write: Returns a DataFrameWriter object
.format(): Defines the output format (e.g., CSV, JSON, Parquet)
.option(): Controls file-specific settings like headers, delimiters, quotes
.save(): Triggers the actual data write

Mode Behaviour with `.mode()`

The .mode() method controls how PySpark behaves when data already exists at the output location:

.overwrite: Deletes and replaces existing data
.append: Adds new rows to the existing file
.ignore: Skips writing if data already exists
.error: (Default) Throws an error if data exists

Choose the mode based on whether you’re doing a full refresh or incremental load.

Partitioning Output Files

.partitionBy("column"): Splits the output into folders based on column values
Improves query performance when filtering on partitioned columns
Avoid partitioning on columns with high cardinality to prevent small files

Manan Younas

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.