Select Page

My Learning Space

​Space to take notes, learn and share.

Creating DataFrame in pySpark: A Visual Guide

For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame() API. There are two main things to consider: Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even...

read more

Writing data with PySpark: A Visual Guide

PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved. To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process. Basic Approach to...

read more

Reading data in pySpark: A Visual Guide

Embedded: PySpark makes it easy to load data from different sources into DataFrames. At first, the process can seem a little overwhelming, but this guide is designed as a visual walkthrough to simplify the concepts. The data...

read more

Apache Spark Execution Flow: A Visual Guide

When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before...

read more

Spark Optimizations : Technical guide to .persist()

The .persist() method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions. Without .persist() Every action (e.g., count(),...

read more