PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.
To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.
Basic Approach to Writing Data
There are two main syntax styles in PySpark:
- Generic API
 Uses the chain:.format().option().save()
- Format-specific shortcuts
 Uses direct methods like.csv(),.json(),.parquet(), etc.
Key Components
- .write: Returns a DataFrameWriter object
- .format(): Defines the output format (e.g., CSV, JSON, Parquet)
- .option(): Controls file-specific settings like headers, delimiters, quotes
- .save(): Triggers the actual data write
Mode Behaviour with .mode()
The .mode() method controls how PySpark behaves when data already exists at the output location:
- .overwrite: Deletes and replaces existing data
- .append: Adds new rows to the existing file
- .ignore: Skips writing if data already exists
- .error: (Default) Throws an error if data exists
Choose the mode based on whether you’re doing a full refresh or incremental load.
Partitioning Output Files
- .partitionBy("column"): Splits the output into folders based on column values
- Improves query performance when filtering on partitioned columns
- Avoid partitioning on columns with high cardinality to prevent small files
