Select Page

My Learning Space

​Space to take notes, learn and share.

Writing data with PySpark: A Visual Guide

PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved. To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process. Basic Approach to...

read more

Reading data in pySpark: A Visual Guide

Embedded: PySpark makes it easy to load data from different sources into DataFrames. At first, the process can seem a little overwhelming, but this guide is designed as a visual walkthrough to simplify the concepts. The data...

read more

Apache Spark Execution Flow: A Visual Guide

When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before...

read more

Spark Optimizations : Technical guide to .persist()

The .persist() method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions. Without .persist() Every action (e.g., count(),...

read more

Vectorisation

A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents. Motivation for Vectorisation Machine learning algorithms require numerical inputs rather than raw text....

read more