Select Page

Creating DataFrame in pySpark: A Visual Guide

For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame() API. There are two main things to consider:

  1. Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even directly from a Pandas DataFrame.
  2. Schema – The structure of your data. This can be:
    • Auto-Inferred: Spark automatically detects column names and data types.
    • Explicit: You define the schema yourself. For simple cases, a string schema works (e.g. 'name STRING, age INT'). For more complex cases, a StructType gives you full control, including nullability.

Writing data with PySpark: A Visual Guide

PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved.

To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process.

Basic Approach to Writing Data

There are two main syntax styles in PySpark:

  1. Generic API
    Uses the chain: .format().option().save()
  2. Format-specific shortcuts
    Uses direct methods like .csv(), .json(), .parquet(), etc.

Key Components

  • .write: Returns a DataFrameWriter object
  • .format(): Defines the output format (e.g., CSV, JSON, Parquet)
  • .option(): Controls file-specific settings like headers, delimiters, quotes
  • .save(): Triggers the actual data write

Mode Behaviour with .mode()

The .mode() method controls how PySpark behaves when data already exists at the output location:

  • .overwrite: Deletes and replaces existing data
  • .append: Adds new rows to the existing file
  • .ignore: Skips writing if data already exists
  • .error: (Default) Throws an error if data exists

Choose the mode based on whether you’re doing a full refresh or incremental load.

Partitioning Output Files

  • .partitionBy("column"): Splits the output into folders based on column values
  • Improves query performance when filtering on partitioned columns
  • Avoid partitioning on columns with high cardinality to prevent small files

Reading data in pySpark: A Visual Guide

Embedded:

PySpark makes it easy to load data from different sources into DataFrames. At first, the process can seem a little overwhelming, but this guide is designed as a visual walkthrough to simplify the concepts.

The data reading process begins with the .read attribute of a SparkSession, which provides access to a DataFrameReader object.

Basic approach to reading data

There are two main syntax styles for reading data in PySpark:

  1. Generic API using format().option().load() chain
  2. Convenience methods using wrapper methods like csv(), json(), or parquet()

Key components of data reading

  • DataFrameReader object: created by accessing the .read attribute of a SparkSession
  • Format specification: defines the file type such as CSV, JSON, or Parquet
  • Options: format-specific settings that control how data is interpreted
  • Schema definition: defines the structure of the resulting DataFrame (column names and types)
  • Loading: calling .load() or a method like .csv() performs the read and returns a DataFrame

Schema handling approaches

  • Explicit schema definition: use .schema() for precise control over data types
  • Schema inference: use inferSchema=True to automatically detect data types

Tip: For large datasets or production environments, explicit schema definition is recommended for better performance and data consistency.

Apache Spark Execution Flow: A Visual Guide

When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before physical execution begins.

  1. Job Trigger
    • A job starts only when you run an action (e.g., collect(), count()).
    • This job is then broken into stages.
  2. Stages
    • Stages are separated by shuffle points (caused by wide transformations like groupBy or join).
    • Inside a stage, Spark can pipeline operations (e.g., map, filter) without shuffling data.
  3. Tasks
    • Each stage is made of tasks, the smallest unit of execution.
    • A task processes one partition of data and is sent to an executor slot.
    • So: Partition = Task = Work on one chunk of data.

Spark Optimizations : Technical guide to .persist()

The .persist() method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions.

  1. Without .persist()
    • Every action (e.g., count(), collect()) recomputes the entire DataFrame or RDD.
  2. With .persist()
    • Data is saved in memory (or disk, depending on storage level).
    • Subsequent actions reuse this stored data instead of recomputing.
  3. After .unpersist()
    • The data is removed from memory/disk, freeing resources.

Storage Levels

Spark provides different storage levels to balance memory use and speed:

  • MEMORY_ONLY → Fastest, but data is lost if it doesn’t fit in memory.
  • MEMORY_AND_DISK → Stores in memory; spills to disk if too large.
  • DISK_ONLY → Slower, but uses less memory.
  • Serialized options → Save space but require CPU to decompress.

Why Use .persist()?

  • Reduce latency → Avoid repeating heavy computations.
  • Improve throughput → Reuse datasets for multiple actions.
  • Stability → Prevent failures from repeatedly recalculating big datasets.
  • Control resources → Helps manage memory vs computation trade-offs.