MapReduce : Fundamental BigData algorithm behind Hadoop and Spark

MapReduce is a fundamental algorithmic model used in distributed computing to process and generate large datasets efficiently. It was popularized by Google and later adopted by the open-source community through Hadoop. The model simplifies parallel processing by dividing work into map and reduce tasks, making it easier to scale across a cluster of machines. Whether you’re working in big data or learning Apache Spark, understanding MapReduce helps in building a strong foundation for distributed data processing.

What Is MapReduce?

MapReduce is all about breaking a massive job into smaller tasks so they can run in parallel.

Map: Each worker reads a bit of the data and tags it with a key.
Reduce: Another worker collects all the items with the same tag and processes them.

That’s it. Simple structure, massive scale.

Why Was MapReduce Needed?

MapReduce emerged when traditional data processing systems couldn’t handle the explosive growth of data in the early 2000s, requiring a new paradigm that could scale horizontally across thousands of commodity machines. Given below is a brief rundown of what problems existed for handling Big Data and how MapReduce solved it.

Problem	MapReduce Solution
Data Volume Explosion: Single machines couldn’t handle terabytes/petabytes of data	Automatic data splitting and distributed processing across multiple machines
Network failures and system crashes	Built-in fault tolerance with automatic retry mechanisms and task redistribution
Complex load balancing requirements	Automated task distribution and workload management across nodes
Data consistency issues in distributed systems	Structured programming model ensuring consistent data processing across nodes
Manual recovery processes after failures	Automatic failure detection and recovery without manual intervention
Difficult parallel programming requirements	Simple programming model with just Map and Reduce functions
Poor code reusability in distributed systems	Standardized framework allowing easy code reuse across applications
Complex testing of distributed applications	Simplified testing due to the standardized programming model
Expensive vertical scaling (adding more CPU/RAM)	Cost-effective horizontal scaling by adding more commodity machines
Data transfer bottlenecks	Data locality optimization by moving computation to data

Schema-on-Read:

Schema-on-read is backbone of MapReduce. It means you don’t need to structure your data before storing it. Think of it like this:

Your input data can be messy – log files, CSVs, or JSON – MapReduce doesn’t care
Mappers read and interpret the data on the fly, pulling out just the pieces they need
If some records are broken or formatted differently, the mapper can skip them and keep going

Example: Let’s say you’re processing server logs. Some might have IP addresses, others might not. Some might include user agents, others might be missing them. With schema-on-read, your mapper just grabs what it can find and moves on – no need to clean everything up first.

MapReduce doesn’t care what your data looks like until it reads it. That’s schema-on-read. Store raw files however you want, and worry about structure only when you process them.

How MapReduce Works in Hadoop

Let’s walk through a realistic example. Imagine a CSV file with employee info and we want to count the number of employees in each department :

101, Alice, HR
102, Bob, IT
103, Clara, HR
104, Dave, Finance

Step 1: Input Split and Mapping

First, Hadoop splits the input file into blocks—say, 128MB each. Why? So they can be processed in parallel across multiple machines. That’s how Hadoop scales.

Each Map Task reads lines from its assigned block. In our case, each line represents an employee record, like this:

101, Alice, HR

What do we extract from this? It depends on our goal. Since we’re counting people per department, we group by department. That’s why we emit:

("HR", 1)

We don’t emit ("Alice", 1) because we’re counting departments, not individuals. This key-value pair structure prepares us perfectly for the reduce phase.

Step 2: Shuffle and Sort

These key-value pairs then move into the shuffle phase, where they’re grouped by key:

("HR", [1, 1])

("IT", [1])

("Finance", [1])

Step 3: Reduce

Each reducer receives one of these groups and sums the values:

("HR", 2)

("IT", 1)

("Finance", 1)

The final output shows the headcount per department.

MapReduce Today: The Evolution to DAG-Based Processing

While MapReduce revolutionized distributed computing, its rigid two-stage model had limitations. Each operation required writing to disk between stages, creating performance bottlenecks. Apache Spark introduce the concept of DAG ( Directed Acyclic Graph)

The DAG model represents a fundamental shift in how distributed computations are planned and executed:

Flexible Pipeline Design: Instead of forcing everything into map and reduce steps, Spark creates a graph of operations that can have any number of stages
Memory-First Architecture: Data stays in memory between operations whenever possible, dramatically reducing I/O overhead
Intelligent Optimization: The DAG scheduler can analyze the entire computation graph and optimize it before execution
Lazy Evaluation: Operations are planned but not executed until necessary, allowing for better optimization

This evolution meant that while MapReduce jobs had to be explicitly structured as a sequence of separate map and reduce operations, Spark could automatically determine the most efficient way to execute a series of transformations. This made programs both easier to write and significantly faster to execute.

MapReduce vs. DAG: Comparison

Here’s a high level overview of how DAG in spark differs from MapReduce

Feature	MapReduce	DAG (Spark)
Processing Stages	Fixed two stages (Map and Reduce)	Multiple flexible stages based on operation needs
Execution Flow	Linear flow with mandatory disk writes between stages	Optimized flow with in-memory operations between stages
Job Complexity	Multiple jobs needed for complex operations	Single job can handle complex multi-stage operations
Performance	Slower due to disk I/O between stages	Faster due to in-memory processing and stage optimization

Key Advantages of DAG over MapReduce:

DAGs can optimize the execution plan by combining and reordering operations
Simple operations can complete in a single stage instead of requiring both map and reduce
Complex operations don’t need to be broken into separate MapReduce jobs
In-memory data sharing between stages improves performance significantly

This evolution from MapReduce to DAG-based processing represents a significant advancement in distributed computing, enabling more efficient and flexible data processing pipelines. For example you could use a much complex chaining of operations using a single line of code in multi-stage Dag as given below:

// Example of a complex operation in Spark creating a multi-stage DAG
val result = data.filter(...)
               .groupBy(...)
               .aggregate(...)
               .join(otherData)
               .select(...)

Understanding the Employee count Code Example

The employee count by department example we discussed above can be written in spark by using these 2 lines.

val df = spark.read.option("header", "true").csv("people.csv")
df.groupBy("department").count().show()

This code performs a simple but common data analysis task: counting employees in each department from a CSV file. Here’s what each part does:

Line 1: Reads a CSV file containing employee data, telling Spark that the file has headers
Line 2: Groups the data by department and counts how many employees are in each group

Let’s break down this Spark code to understand how it implements MapReduce concepts and leverages DAG advantages:

1. Schema-on-Read :

The line spark.read.option("header", "true").csv("people.csv") demonstrates schema-on-read – Spark only interprets the CSV structure when reading, not before.
It automatically detects column types and handles messy data without requiring pre-defined schemas.

2. Hidden Map Phase with DAG Optimization:

When groupBy("department") runs, Spark creates an optimized DAG of tasks instead of a rigid map-reduce pattern.
Each partition is processed efficiently with the ability to combine operations in a single stage when possible.

3. Improved Shuffle and Sort:

The groupBy operation triggers a shuffle, but unlike MapReduce, data stays in memory between stages.
DAG enables Spark to optimize the shuffle pattern and minimize data movement across the cluster.

4. Enhanced Reduce Phase:

The count() operation is executed as part of the DAG, potentially combining with other operations.
Unlike MapReduce’s mandatory disk writes between stages, Spark keeps intermediate results in memory.

5. DAG Benefits in Action:

The entire operation is treated as a single job with multiple stages, not separate MapReduce jobs.
Spark’s DAG optimizer can reorder and combine operations for better performance.

While the code looks simpler, its more than syntax simplification – the DAG-based execution model provides fundamental performance improvements over traditional MapReduce while maintaining the same logical pattern.

Code Comparison: Spark vs Hadoop MapReduce

Let’s compare how the same department counting task would be implemented in both Spark and traditional Hadoop MapReduce:

1. Reading Data

Spark:

val df = spark.read.option("header", "true").csv("people.csv")

Hadoop MapReduce:

public class EmployeeMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text department = new Text();
    
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split(",");
        department.set(fields[2].trim());  // department is third column
        context.write(department, one);
    }
}

2. Processing/Grouping

Spark:

df.groupBy("department").count()

Hadoop MapReduce:

public class DepartmentReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

3. Job Configuration and Execution

Spark:

// Just execute the transformation
df.groupBy("department").count().show()

Hadoop MapReduce:

public class DepartmentCount {
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance(new Configuration(), "department count");
        job.setJarByClass(DepartmentCount.class);
        
        job.setMapperClass(EmployeeMapper.class);
        job.setReducerClass(DepartmentReducer.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Key Differences:

Spark requires significantly less boilerplate code
Hadoop MapReduce needs explicit type declarations and separate Mapper/Reducer classes
Spark handles data types and conversions automatically
Hadoop MapReduce requires manual serialization/deserialization of data types
Spark’s fluent API makes the data processing pipeline more readable and maintainable

Performance Implications:

Spark keeps intermediate results in memory between operations
Hadoop MapReduce writes to disk between map and reduce phases
Spark’s DAG optimizer can combine operations for better efficiency
Hadoop MapReduce follows a strict two-stage execution model

Vectorisation

A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents.

Motivation for Vectorisation

Machine learning algorithms require numerical inputs rather than raw text. Therefore, we first convert text into a numerical representation.

Tokens

The most primitive representation of language in NLP is a token. Tokenisation breaks raw text into atomic units – typically words, subwords or characters. These tokens form the basis of all downstream processing. A token is typically assigned a Token ID e.g. “Cat”—> 310 . However, tokens themselves carry no meaning unless they’re transformed into numeric vector.

Vectors

Although tokens and token IDs are numeric representations, they lack inherent meaning. To make these numbers meaningful mathematically, they are used as building blocks for vectors.

Vector is a mathematical representation in high-dimensional space. What that means is it carries more context than a bare number itself . As a simplified example , consider “cat”represented in a 5 dimension vector.

"cat" → [0.62, -0.35, 0.12, 0.88, -0.22]

Dimension	Value	Implied Meaning (not labeled in real models, just illustrative)
1	0.62	Animal-relatedness
2	-0.35	Wild vs Domestic (negative = domestic)
3	0.12	Size (positive = small)
4	0.88	Closeness to human-associated terms (like "pet", "owner", "feed")
5	-0.22	Abstract vs Concrete (negative = more physical/visible)

Embeddings

If we consider the our example of word "cat", its embedding vector consists of values that are shaped by exposure to language data—such as frequent co-occurrence with words like "meow", "pet", and "kitten". This contextual usage informs how the embedding is constructed, positioning "cat" closer to semantically similar words in the vector space.

More broadly, while vectors provide a numeric way to represent tokens, embeddings are a specialised form of vector that is learned from data to capture linguistic meaning. Unlike sparse or manually defined vectors, embeddings are dense, low-dimensional, and trainable.

Dimension	Value (Generic Vector)	Value (Embedding)	Implied Meaning (illustrative only)
1	0.62	0.10	Animal-relatedness
2	-0.35	0.05	Wild vs Domestic
3	0.12	-0.12	Size
4	0.88	0.02	Closeness to human-associated terms (e.g., pet, owner)
5	-0.22	-0.05	Abstract vs Concrete

Learned Embedding vs Generic Vector for "cat"

Vectorisation Algorithms

Given below is a brief summary of major vectorisation algorithms and their timeline.

Early Algorithms: Sparse Representations

Traditional NLP approaches like Bag of Words (BoW) and TF-IDF relied on token-level frequency information. They represent each document as a high-dimensional vector based on token counts.

Bag of Words (BoW)

Represents a document by counting how often each token appears.
Treats all tokens independently; ignores order and meaning.
Output: sparse vector with many zeros.

TF-IDF (Term Frequency-Inverse Document Frequency)

Extends BoW by scaling down tokens that appear in many documents.
Aims to highlight unique or important tokens.
Output: still sparse and high-dimensional.

These approaches produce sparse vectors. As vocabulary size grows, vectors become inefficient and incapable of generalising across related words like "cat" and "feline."

Transition to Dense Vectors: Embeddings

To overcome the limitations of sparse representations, researchers introduced dense embeddings. These are fixed-size, real-valued vectors that place semantically similar words closer together in the vector space. Unlike count-based vectors, embeddings are learned through training on large corpora.

Early Embedding Algorithms – Dense Representation

Word2Vec (2013, Google – Mikolov)

Learns dense embeddings using a shallow neural network.
Words that appear in similar contexts get similar embeddings.
Two training strategies:
- CBOW (Continuous Bag of Words): Predicts the target word from its surrounding context.
- Skip-Gram: Predicts surrounding words from the target word.
Efficient training using negative sampling.
Limitation: Produces static embeddings. A word has one vector regardless of its context.

GloVe (2014, Stanford)

Stands for Global Vectors.
Learns embeddings by factorising a global co-occurrence matrix.
Combines global corpus statistics with local context windows.
Strength: Captures broader semantic patterns than Word2Vec.
Limitation: Still produces static embeddings.

Embedding Algorithms – Contextual Embeddings

Even though Word2Vec and GloVe marked a huge advancement, they had a major drawback: they generate one embedding per token, regardless of context. For example, the word "bank" will have the same vector whether it refers to a financial institution or a riverbank.

This limitation led to contextual embeddings such as:

ELMo (Embeddings from Language Models): Learns context from both directions using RNNs.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformers to generate context-aware embeddings where each token’s representation changes depending on its surrounding words.

Sparse vs Dense

Feature	Sparse Vectors (BoW/TF-IDF)	Dense Embeddings (Word2Vec, GloVe)
Dimensionality	Very high	100–300
Vector content	Mostly zeros	Fully populated
Captures word similarity	No	Yes
Context awareness	No	Partially
Efficient for learning	No	Yes

Summary Timeline of Key Algorithms

Year	Algorithm	Key Idea	Embedding Type
Pre-2010	BoW, TF-IDF	Token count or frequency	Sparse Vector
2013	Word2Vec	Predict words using neural networks	Static Embedding
2014	GloVe	Factorize co-occurrence matrix	Static Embedding
2018	ELMo	Deep contextual embeddings via language modeling	Contextual Embedding
2018	BERT	Transformer-based contextual learning	Contextual Embedding

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

Spark provides three abstractions for handling data

RDDs

Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).

DataFrame

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.

DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.

Dataset

Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.

They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.

Comparison of Spark Data Abstractions

Feature	RDD	DataFrame	Dataset
Type Safety	Type-safe	Not type-safe	Type-safe
Schema	No schema	Schema-based	Schema-based
API Style	Functional API	Domain-specific language (DSL)	Both functional and DSL
Optimization	Basic	Catalyst Optimizer	Catalyst Optimizer
Memory Usage	High	Efficient	Moderate
Serialization	Java Serialization	Custom encoders	Custom encoders
Language Support	All Spark languages	All Spark languages	Scala and Java only

graph LR
    A["Data Storage in Spark"]
    
    A --> B["RDD"]
    A --> C["DataFrame"]
    A --> D["Dataset"]
    
    B --> B1["Raw Java/Scala Objects"]
    B --> B2["Distributed Collection"]
    B --> B3["No Schema Information"]
    
    C --> C1["Row Objects"]
    C --> C2["Schema-based Structure"]
    C --> C3["Column Names & Types"]
    
    D --> D1["Typed JVM Objects"]
    D --> D2["Schema + Type Information"]
    D --> D3["Strong Type Safety"]
    
    style B fill:#f9d6d6
    style C fill:#d6e5f9
    style D fill:#d6f9d6

This diagram illustrates how data is stored in different Spark abstractions:

RDD stores data as raw Java/Scala objects with no schema information
DataFrame organizes data in rows with defined column names and types
Dataset combines schema-based structure with strong type safety using JVM objects

Use Case Scenarios and Recommendations

Abstraction	Best Use Cases	Why Choose This?
RDD	– Low-level transformations – Custom data types – Legacy code maintenance	– Complete control over data processing – When working with non-structured data – Need for custom optimization
DataFrame	– SQL-like operations – Machine learning pipelines – Structured data processing	– Better performance through optimization – Familiar SQL-like interface – Integration with BI tools
Dataset	– Complex business logic – Type-safe operations – Domain object manipulation	– Compile-time type safety – Object-oriented programming – Balance of performance and control

Key Takeaways:

Use RDDs when you need low-level control and working with unstructured data
Choose DataFrames for structured data and when performance is critical
Opt for Datasets when you need both type safety and performance optimization

Initialization and Operations Examples

Operation	RDD	DataFrame	Dataset
Creation	`sc.parallelize(List(1,2,3))`	`spark.createDataFrame(data)`	`spark.createDataset(data)`
Reading Files	`sc.textFile("path")`	`spark.read.csv("path")`	`spark.read.csv("path").as[Case]`
Filtering	`rdd.filter(x => x > 10)`	`df.filter($"col" > 10)`	`ds.filter(_.value > 10)`
Mapping	`rdd.map(x => x * 2)`	`df.select($"col" * 2)`	`ds.map(x => x * 2)`
Grouping	`rdd.groupBy(x => x)`	`df.groupBy("col")`	`ds.groupByKey(_.key)`

Note: The above examples assume necessary imports and SparkSession/SparkContext initialization.

Snowflake Object Model

The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either

Store data,
Organize data, or
Manage data.

From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.

Organisation

In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.

Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.

Account

An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.

Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.

Role

A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.

Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.

User

A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.

Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.

A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.

Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.

Network Policy

A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.

Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.

Warehouse

In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.

Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.

Resource Monitor

A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.

Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.

Database

A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.

Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.

Schema

A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.

Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.

UDF (User-Defined Function)

A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.

Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.

Task

A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.

Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.

Pipe

A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.

Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.

Procedure

A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.

Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.

Stages

In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).

Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.

External Stage

An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.

Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.

Internal Stage

An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.

Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.

Table

A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.

Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.

External Tables

External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.

Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.

Transient Tables

Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.

Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.

Temporary Tables

Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.

Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.

Permanent Tables

Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.

Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.

View

A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.

Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.

Secured Views

Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.

Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.

Standard Views

Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.

Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.

Materialized Views

Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.

Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.

Exporting GA4 data from BigQuery to Snowflake

In a previous article, we have already explored how to export data grom GA4 to BigQuery. In instances, where we want to migrate data from BigQuery to another platform like snowflake, BigQuery offers a few options.

BigQuery Export options

Explore with Sheets:

Directly analyze and visualize your BigQuery data using Google Sheets.

Explore with Looker Studio:

Utilize Looker Studio (formerly Data Studio) for more advanced data exploration and interactive dashboard creation.

Export to GCS:

Save BigQuery datasets to Google Cloud Storage for storage or further processing with other tools.

Scan with Sensitive Data Protection:

Check your datasets for sensitive information before exporting, to ensure compliance and data privacy.

In out case, since we want to export the Google Analytics 4 data into Snowflake, we will need to first export it to Google Cloud Storage ( GCS ) . From this storage, we can then ingest data into Snowflake.

To understand the process flow, here is what we will be doing.

GA4 -> BigQuery -> Google Cloud Storage -> Snowfalke

A. Exporting from BigQuery to GCS

Deciding on Export Format

Before exporting, we want to decide on the format in which data will be exported for consumption. You can choose any of the format from CSV, JSON, Avro and Parquet depending on the use case. we will go with Parquet in this example. A brief comparison of these 4 data formats is given in the table below.

Feature	CSV	JSON	AVRO	Parquet
Data Structure	Flat	Hierarchical	Hierarchical	Columnar
Readability	High (Text-based)	High (Text-based)	Low (Binary)	Low (Binary)
File Size	Medium	Large	Small	Small
Performance	Low	Medium	High	Very High
Schema Evolution	Not Supported	Not Supported	Supported	Supported
Use Case	Simple analytics	Web APIs, Complex data	Long-term storage, Evolving schemas	Analytical queries, Large datasets
Compression	Low	Medium	High	Very High

Why Parquet?

Here’s a brief summary of why we chose Parquet for exporting GA4 data to BigQuery.

Columnar Efficiency

We benefit from Parquet’s columnar storage, optimizing our query execution by accessing only the necessary data.

Cost-Effective Storage

Our expenditure on data storage is minimized due to Parquet’s superior compression and encoding capabilities.

Support for Hierarchical Data

It supports our GA4 hierarchical data structures, ensuring the integrity of our analytical insights.

Seamless Integration

We utilize Snowflake’s native support for Parquet for straightforward data processing.

Schema Evolution Support

Since GA4 is in its early stage and new features keep on coming, we can gracefully manage changes in our data schema, avoiding costly downtime and data migrations.

Exporting Single Table

Clicking on the Export -> Export to GCS option will give us an option box to pick export format and compression. I have also specified a GCS storage location to store the export.

Exporting Multiple tables

Visual interface only allows export of a single table. Google Analytics 4, however, stores each day’s data separately as a single table. Therefore, we will have to find an alternative to visual export.

Shell script for Multiple table export

We can write a shell script which can export all our tables into our bucket. At a high level, we want our script to do the following :

Set Parameters: Define the dataset, table prefix, and GCS bucket.
List Tables: Use BigQuery to list all events_ tables in the dataset.
Export Tables: Loop through each table and export it to the designated GCS bucket as a Parquet file.

Here’s what the exact script looks like

#!/bin/bash 

  # Define your dataset and table prefix

  DATASET="bigquery-public-data:ga4_obfuscated_sample_ecommerce"

  TABLE_PREFIX="events_"

  BUCKET_NAME="tempv2"

  # Get the list of tables using the bq ls command, filtering with grep for your table prefix

  TABLES=$(bq ls --max_results 1000 $DATASET | grep $TABLE_PREFIX | awk '{print $1}')

  # Loop through the tables and run the bq extract command on each one

  for TABLE in $TABLES

  do

      bq extract --destination_format=PARQUET $DATASET.$TABLE gs://$BUCKET_NAME/${TABLE}.parquet

   done

Save the script and give it a name. I named it export_tables.sh. Change the script mode to chmod +x.

Execute the shell script with ./export_tables.sh

If everything works out correctly, you will start to see output :

You can check whether data has been exported by inspecting the contents of the storage bucket.

A screenshot of a computer

Description automatically generated

Allow appropriate access , so that you can read the data in snowflake. You can do this by opening the bucket > Permissions and then click on Grand Access.

In this example, I have granted access to allUsers. This will make the bucket readable publicly.

A screenshot of a web page

Description automatically generated

To ingest the data from Google cloud storage into snowflake, we will create storage integration between GCS and snowflake and then create an external stage. Storage integration streamlines the authentication flow between GCS and Snowflake. External stage, allows snowflake database to ingest the data.

Top of Form

B. Create Storage Integration in Snowflake

Storage integration in snowflake will create a temporary service account. We will then provision access to that temporary account from GCP.

Log into Snowflake: Open the Snowflake Web UI and switch to the role with privileges to create storage integrations (e.g., ACCOUNTADMIN).
Create the Storage Integration:

CREATE STORAGE INTEGRATION gcs_storage_integration

TYPE = EXTERNAL_STAGE

STORAGE_PROVIDER = GCS

ENABLED = TRUE

STORAGE_ALLOWED_LOCATIONS = (‘gcs://tempv2/’);

Retrieve Integration Details:

DESC STORAGE INTEGRATION gcs_storage_integration;

Snowflake will automatically create a GCP service account . We will then go back to Google Cloud to provision necessary access to this service account.

Navigate to GCS: Open the Google Cloud Console, go to your GCS bucket’s permissions page.
Add a New Member: Use the STORAGE_GCP_SERVICE_ACCOUNT value as the new member’s identity.
Assign Roles: Grant roles such as Storage Object Viewer for read access

C. Create External Stage in Snowflake

External stage will allow snowflake database to ingest data from external source of GCP.

Define the File Format (if not already defined):

CREATE OR REPLACE FILE FORMAT my_parquet_format

TYPE = ‘PARQUET’;

Create External Stage

CREATE OR REPLACE STAGE my_external_stage

URL = ‘gcs://tempv2/’

STORAGE_INTEGRATION = gcs_storage_integration

FILE_FORMAT = my_parquet_format;

Verify the external stage with LIST command, you can see the output

A screenshot of a computer program

Description automatically generated

Create table to load data into

We will create a table , so that we can load data from the stage into the table.

create table raw_ga4 (

data VARIANT

)

Load data from stage

Finally, we can load data from external stage into snowflake database table using COPY INTO command.

There are 2 primary ways that we can ingest data into snowflake database.

With an upfront Schema
Without an upfront Schema

In this case, we will ingest the data without having an upfront schema.

D. Loading data without Schema

Snowflake provides a ‘VARIANT’ data type. It is used to store sem-structured data such as SON, Avro or Parquet etc. Its useful because it allows you to ingest and store data without needing to define a schema upfront. The VARIANT column can hold structured and semi-structured data in the same table, enabling you to flexibly query the data using standard SQL alongside Snowflake’s powerful semi-structured data functions.

Therefore, in Step4 , we create a simple table with VARIANT column of data.

To load the data into our raw_ga4 table, we use the following command.

COPY INTO GA4ECOMMERCE.PUBLIC.RAW_GA4

FROM ‘@GA4_STAGE/’

FILE_FORMAT = (FORMAT_NAME = ‘my_parquet_format’);

This will load all files into Data column

You can also view the data from Data Preview tab of RAW_GA4 table , which will look like :

« Older Entries

Next Entries »

MapReduce : Fundamental BigData algorithm behind Hadoop and Spark

What Is MapReduce?

Why Was MapReduce Needed?

Schema-on-Read:

How MapReduce Works in Hadoop

Step 1: Input Split and Mapping

Step 2: Shuffle and Sort

Step 3: Reduce

MapReduce Today: The Evolution to DAG-Based Processing

MapReduce vs. DAG: Comparison

Key Advantages of DAG over MapReduce:

Understanding the Employee count Code Example

Code Comparison: Spark vs Hadoop MapReduce

1. Reading Data

2. Processing/Grouping

3. Job Configuration and Execution

Vectorisation

Motivation for Vectorisation

Tokens

Vectors

Embeddings

Vectorisation Algorithms

Early Algorithms: Sparse Representations

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

Transition to Dense Vectors: Embeddings

Early Embedding Algorithms – Dense Representation

Word2Vec (2013, Google – Mikolov)

GloVe (2014, Stanford)

Embedding Algorithms – Contextual Embeddings

Sparse vs Dense

Summary Timeline of Key Algorithms

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

RDDs

DataFrame

Dataset

Comparison of Spark Data Abstractions

Use Case Scenarios and Recommendations

Key Takeaways:

Initialization and Operations Examples

Snowflake Object Model

Organisation

Account

Role

User

Share

Network Policy

Warehouse

Resource Monitor

Database

Schema

UDF (User-Defined Function)

Task

Pipe

Procedure

Stages

External Stage

Internal Stage

Table

External Tables

Transient Tables

Temporary Tables

Permanent Tables

View

Secured Views

Standard Views

Materialized Views

Exporting GA4 data from BigQuery to Snowflake

Exporting GA4 data from BigQuery to Snowflake

BigQuery Export options

A. Exporting from BigQuery to GCS

Deciding on Export Format

Why Parquet?

Exporting Single Table

Exporting Multiple tables

Shell script for Multiple table export

B. Create Storage Integration in Snowflake

C. Create External Stage in Snowflake

D. Loading data without Schema