PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.
To write data in PySpark, you start with the .write
attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.
Basic Approach to Writing Data
There are two main syntax styles in PySpark:
- Generic API
Uses the chain:.format().option().save()
- Format-specific shortcuts
Uses direct methods like.csv()
,.json()
,.parquet()
, etc.
Key Components
.write
: Returns a DataFrameWriter object.format()
: Defines the output format (e.g., CSV, JSON, Parquet).option()
: Controls file-specific settings like headers, delimiters, quotes.save()
: Triggers the actual data write
Mode Behaviour with .mode()
The .mode()
method controls how PySpark behaves when data already exists at the output location:
.overwrite
: Deletes and replaces existing data.append
: Adds new rows to the existing file.ignore
: Skips writing if data already exists.error
: (Default) Throws an error if data exists
Choose the mode based on whether you’re doing a full refresh or incremental load.
Partitioning Output Files
.partitionBy("column")
: Splits the output into folders based on column values- Improves query performance when filtering on partitioned columns
- Avoid partitioning on columns with high cardinality to prevent small files