For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame()
API. There are two main things to consider:
- Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even directly from a Pandas DataFrame.
- Schema – The structure of your data. This can be:
- Auto-Inferred: Spark automatically detects column names and data types.
- Explicit: You define the schema yourself. For simple cases, a string schema works (e.g.
'name STRING, age INT'
). For more complex cases, a StructType
gives you full control, including nullability.
PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.
To write data in PySpark, you start with the .write
attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.
Basic Approach to Writing Data
There are two main syntax styles in PySpark:
- Generic API
Uses the chain: .format().option().save()
- Format-specific shortcuts
Uses direct methods like .csv()
, .json()
, .parquet()
, etc.
Key Components
.write
: Returns a DataFrameWriter object
.format()
: Defines the output format (e.g., CSV, JSON, Parquet)
.option()
: Controls file-specific settings like headers, delimiters, quotes
.save()
: Triggers the actual data write
Mode Behaviour with .mode()
The .mode()
method controls how PySpark behaves when data already exists at the output location:
.overwrite
: Deletes and replaces existing data
.append
: Adds new rows to the existing file
.ignore
: Skips writing if data already exists
.error
: (Default) Throws an error if data exists
Choose the mode based on whether you’re doing a full refresh or incremental load.
Partitioning Output Files
.partitionBy("column")
: Splits the output into folders based on column values
- Improves query performance when filtering on partitioned columns
- Avoid partitioning on columns with high cardinality to prevent small files
Embedded:
PySpark makes it easy to load data from different sources into DataFrames. At first, the process can seem a little overwhelming, but this guide is designed as a visual walkthrough to simplify the concepts.
The data reading process begins with the .read attribute of a SparkSession, which provides access to a DataFrameReader object.
Basic approach to reading data
There are two main syntax styles for reading data in PySpark:
- Generic API using format().option().load() chain
- Convenience methods using wrapper methods like csv(), json(), or parquet()
Key components of data reading
- DataFrameReader object: created by accessing the .read attribute of a SparkSession
- Format specification: defines the file type such as CSV, JSON, or Parquet
- Options: format-specific settings that control how data is interpreted
- Schema definition: defines the structure of the resulting DataFrame (column names and types)
- Loading: calling .load() or a method like .csv() performs the read and returns a DataFrame
Schema handling approaches
- Explicit schema definition: use .schema() for precise control over data types
- Schema inference: use inferSchema=True to automatically detect data types
Tip: For large datasets or production environments, explicit schema definition is recommended for better performance and data consistency.
When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before physical execution begins.
- Job Trigger
- A job starts only when you run an action (e.g.,
collect()
, count()
).
- This job is then broken into stages.
- Stages
- Stages are separated by shuffle points (caused by wide transformations like
groupBy
or join
).
- Inside a stage, Spark can pipeline operations (e.g.,
map
, filter
) without shuffling data.
- Tasks
- Each stage is made of tasks, the smallest unit of execution.
- A task processes one partition of data and is sent to an executor slot.
- So: Partition = Task = Work on one chunk of data.
The .persist()
method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions.
- Without
.persist()
- Every action (e.g.,
count()
, collect()
) recomputes the entire DataFrame or RDD.
- With
.persist()
- Data is saved in memory (or disk, depending on storage level).
- Subsequent actions reuse this stored data instead of recomputing.
- After
.unpersist()
- The data is removed from memory/disk, freeing resources.
Storage Levels
Spark provides different storage levels to balance memory use and speed:
- MEMORY_ONLY → Fastest, but data is lost if it doesn’t fit in memory.
- MEMORY_AND_DISK → Stores in memory; spills to disk if too large.
- DISK_ONLY → Slower, but uses less memory.
- Serialized options → Save space but require CPU to decompress.
Why Use .persist()
?
- Reduce latency → Avoid repeating heavy computations.
- Improve throughput → Reuse datasets for multiple actions.
- Stability → Prevent failures from repeatedly recalculating big datasets.
- Control resources → Helps manage memory vs computation trade-offs.