My Learning Space
​Space to take notes, learn and share.
Writing data with PySpark: A Visual Guide
PySpark in Apache Spark uses the DataFrameWriter to manage how data is saved. To write data in PySpark, you start with the .write attribute of a DataFrame, which gives you a DataFrameWriter to manage the save process. Basic Approach to...
Reading data in pySpark: A Visual Guide
Embedded: PySpark makes it easy to load data from different sources into DataFrames. At first, the process can seem a little overwhelming, but this guide is designed as a visual walkthrough to simplify the concepts. The data...
Apache Spark Execution Flow: A Visual Guide
When a Spark application is submitted, it does not execute statements sequentially. Instead, Spark constructs a logical execution plan, represented as a Directed Acyclic Graph (DAG), which captures the computation flow and dependencies before...
Spark Optimizations : Technical guide to .persist()
The .persist() method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions. Without .persist() Every action (e.g., count(),...
MapReduce : Fundamental BigData algorithm behind Hadoop and Spark
MapReduce is a fundamental algorithmic model used in distributed computing to process and generate large datasets efficiently. It was popularized by Google and later adopted by the open-source community through Hadoop. The model simplifies parallel processing...
Vectorisation
A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents. Motivation for Vectorisation Machine learning algorithms require numerical inputs rather than raw text....