PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.
To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.
Basic Approach to Writing Data
There are two main syntax styles in PySpark:
- Generic API
Uses the chain:.format().option().save() - Format-specific shortcuts
Uses direct methods like.csv(),.json(),.parquet(), etc.
Key Components
.write: Returns a DataFrameWriter object.format(): Defines the output format (e.g., CSV, JSON, Parquet).option(): Controls file-specific settings like headers, delimiters, quotes.save(): Triggers the actual data write
Mode Behaviour with .mode()
The .mode() method controls how PySpark behaves when data already exists at the output location:
.overwrite: Deletes and replaces existing data.append: Adds new rows to the existing file.ignore: Skips writing if data already exists.error: (Default) Throws an error if data exists
Choose the mode based on whether you’re doing a full refresh or incremental load.
Partitioning Output Files
.partitionBy("column"): Splits the output into folders based on column values- Improves query performance when filtering on partitioned columns
- Avoid partitioning on columns with high cardinality to prevent small files