Select Page

Spark Optimizations : Technical guide to .persist()

MapReduce : Fundamental BigData algorithm behind Hadoop and Spark

 

MapReduce is a fundamental algorithmic model used in distributed computing to process and generate large datasets efficiently. It was popularized by Google and later adopted by the open-source community through Hadoop. The model simplifies parallel processing by dividing work into map and reduce tasks, making it easier to scale across a cluster of machines. Whether you’re working in big data or learning Apache Spark, understanding MapReduce helps in building a strong foundation for distributed data processing.

What Is MapReduce?

MapReduce is all about breaking a massive job into smaller tasks so they can run in parallel.

  • Map: Each worker reads a bit of the data and tags it with a key.
  • Reduce: Another worker collects all the items with the same tag and processes them.

That’s it. Simple structure, massive scale.

Why Was MapReduce Needed?

MapReduce emerged when traditional data processing systems couldn’t handle the explosive growth of data in the early 2000s, requiring a new paradigm that could scale horizontally across thousands of commodity machines. Given below is a brief rundown of what problems existed for handling Big Data and how MapReduce solved it.

ProblemMapReduce Solution
Data Volume Explosion: Single machines couldn’t handle terabytes/petabytes of dataAutomatic data splitting and distributed processing across multiple machines
Network failures and system crashesBuilt-in fault tolerance with automatic retry mechanisms and task redistribution
Complex load balancing requirementsAutomated task distribution and workload management across nodes
Data consistency issues in distributed systemsStructured programming model ensuring consistent data processing across nodes
Manual recovery processes after failuresAutomatic failure detection and recovery without manual intervention
Difficult parallel programming requirementsSimple programming model with just Map and Reduce functions
Poor code reusability in distributed systemsStandardized framework allowing easy code reuse across applications
Complex testing of distributed applicationsSimplified testing due to the standardized programming model
Expensive vertical scaling (adding more CPU/RAM)Cost-effective horizontal scaling by adding more commodity machines
Data transfer bottlenecksData locality optimization by moving computation to data

Schema-on-Read:

Schema-on-read is backbone of MapReduce. It means you don’t need to structure your data before storing it. Think of it like this:

  • Your input data can be messy – log files, CSVs, or JSON – MapReduce doesn’t care
  • Mappers read and interpret the data on the fly, pulling out just the pieces they need
  • If some records are broken or formatted differently, the mapper can skip them and keep going

Example: Let’s say you’re processing server logs. Some might have IP addresses, others might not. Some might include user agents, others might be missing them. With schema-on-read, your mapper just grabs what it can find and moves on – no need to clean everything up first.

MapReduce doesn’t care what your data looks like until it reads it. That’s schema-on-read. Store raw files however you want, and worry about structure only when you process them.

How MapReduce Works in Hadoop

Let’s walk through a realistic example. Imagine a CSV file with employee info and we want to count the number of employees in each department :

101, Alice, HR
102, Bob, IT
103, Clara, HR
104, Dave, Finance

Step 1: Input Split and Mapping

First, Hadoop splits the input file into blocks—say, 128MB each. Why? So they can be processed in parallel across multiple machines. That’s how Hadoop scales.

Each Map Task reads lines from its assigned block. In our case, each line represents an employee record, like this:

101, Alice, HR

 

What do we extract from this? It depends on our goal. Since we’re counting people per department, we group by department. That’s why we emit:

("HR", 1)

We don’t emit ("Alice", 1) because we’re counting departments, not individuals. This key-value pair structure prepares us perfectly for the reduce phase.

Step 2: Shuffle and Sort

These key-value pairs then move into the shuffle phase, where they’re grouped by key:

("HR", [1, 1])

("IT", [1])

("Finance", [1])

Step 3: Reduce

Each reducer receives one of these groups and sums the values:

("HR", 2)

("IT", 1)

("Finance", 1)

The final output shows the headcount per department.

MapReduce Today: The Evolution to DAG-Based Processing

While MapReduce revolutionized distributed computing, its rigid two-stage model had limitations. Each operation required writing to disk between stages, creating performance bottlenecks. Apache Spark introduce the concept of DAG ( Directed Acyclic Graph)

The DAG model represents a fundamental shift in how distributed computations are planned and executed:

  • Flexible Pipeline Design: Instead of forcing everything into map and reduce steps, Spark creates a graph of operations that can have any number of stages
  • Memory-First Architecture: Data stays in memory between operations whenever possible, dramatically reducing I/O overhead
  • Intelligent Optimization: The DAG scheduler can analyze the entire computation graph and optimize it before execution
  • Lazy Evaluation: Operations are planned but not executed until necessary, allowing for better optimization

This evolution meant that while MapReduce jobs had to be explicitly structured as a sequence of separate map and reduce operations, Spark could automatically determine the most efficient way to execute a series of transformations. This made programs both easier to write and significantly faster to execute.

 

MapReduce vs. DAG: Comparison

Here’s a high level overview of how DAG in spark differs from MapReduce

FeatureMapReduceDAG (Spark)
Processing StagesFixed two stages (Map and Reduce)Multiple flexible stages based on operation needs
Execution FlowLinear flow with mandatory disk writes between stagesOptimized flow with in-memory operations between stages
Job ComplexityMultiple jobs needed for complex operationsSingle job can handle complex multi-stage operations
PerformanceSlower due to disk I/O between stagesFaster due to in-memory processing and stage optimization

 

 

Key Advantages of DAG over MapReduce:

  • DAGs can optimize the execution plan by combining and reordering operations
  • Simple operations can complete in a single stage instead of requiring both map and reduce
  • Complex operations don’t need to be broken into separate MapReduce jobs
  • In-memory data sharing between stages improves performance significantly

This evolution from MapReduce to DAG-based processing represents a significant advancement in distributed computing, enabling more efficient and flexible data processing pipelines. For example you could use a much complex chaining of operations using a single line of code in multi-stage Dag as given below:

// Example of a complex operation in Spark creating a multi-stage DAG
val result = data.filter(...)
               .groupBy(...)
               .aggregate(...)
               .join(otherData)
               .select(...)

 

Understanding the Employee count Code Example

The employee count by department example we discussed above can be written in spark by using these 2 lines.

val df = spark.read.option("header", "true").csv("people.csv")
df.groupBy("department").count().show()

 

This code performs a simple but common data analysis task: counting employees in each department from a CSV file. Here’s what each part does:

  • Line 1: Reads a CSV file containing employee data, telling Spark that the file has headers
  • Line 2: Groups the data by department and counts how many employees are in each group

 

Let’s break down this Spark code to understand how it implements MapReduce concepts and leverages DAG advantages:

1. Schema-on-Read :

  • The line spark.read.option("header", "true").csv("people.csv") demonstrates schema-on-read – Spark only interprets the CSV structure when reading, not before.
  • It automatically detects column types and handles messy data without requiring pre-defined schemas.

2. Hidden Map Phase with DAG Optimization:

  • When groupBy("department") runs, Spark creates an optimized DAG of tasks instead of a rigid map-reduce pattern.
  • Each partition is processed efficiently with the ability to combine operations in a single stage when possible.

3. Improved Shuffle and Sort:

  • The groupBy operation triggers a shuffle, but unlike MapReduce, data stays in memory between stages.
  • DAG enables Spark to optimize the shuffle pattern and minimize data movement across the cluster.

4. Enhanced Reduce Phase:

  • The count() operation is executed as part of the DAG, potentially combining with other operations.
  • Unlike MapReduce’s mandatory disk writes between stages, Spark keeps intermediate results in memory.

5. DAG Benefits in Action:

  • The entire operation is treated as a single job with multiple stages, not separate MapReduce jobs.
  • Spark’s DAG optimizer can reorder and combine operations for better performance.

While the code looks simpler, its more than syntax simplification – the DAG-based execution model provides fundamental performance improvements over traditional MapReduce while maintaining the same logical pattern.

Code Comparison: Spark vs Hadoop MapReduce

Let’s compare how the same department counting task would be implemented in both Spark and traditional Hadoop MapReduce:

1. Reading Data

Spark:

val df = spark.read.option("header", "true").csv("people.csv")

Hadoop MapReduce:

public class EmployeeMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text department = new Text();
    
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split(",");
        department.set(fields[2].trim());  // department is third column
        context.write(department, one);
    }
}

 

 

2. Processing/Grouping

Spark:

df.groupBy("department").count()

 

Hadoop MapReduce:

public class DepartmentReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

3. Job Configuration and Execution

Spark:

// Just execute the transformation
df.groupBy("department").count().show()

 

Hadoop MapReduce:

public class DepartmentCount {
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance(new Configuration(), "department count");
        job.setJarByClass(DepartmentCount.class);
        
        job.setMapperClass(EmployeeMapper.class);
        job.setReducerClass(DepartmentReducer.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

 

Key Differences:

  • Spark requires significantly less boilerplate code
  • Hadoop MapReduce needs explicit type declarations and separate Mapper/Reducer classes
  • Spark handles data types and conversions automatically
  • Hadoop MapReduce requires manual serialization/deserialization of data types
  • Spark’s fluent API makes the data processing pipeline more readable and maintainable

Performance Implications:

  • Spark keeps intermediate results in memory between operations
  • Hadoop MapReduce writes to disk between map and reduce phases
  • Spark’s DAG optimizer can combine operations for better efficiency
  • Hadoop MapReduce follows a strict two-stage execution model

 

Vectorisation

A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents.

Motivation for Vectorisation

Machine learning algorithms require numerical inputs rather than raw text. Therefore, we first convert text into a numerical representation.

Tokens

The most primitive representation of language in NLP is a token. Tokenisation breaks raw text into atomic units – typically words, subwords or characters. These tokens form the basis of all downstream processing. A token is typically assigned a Token ID e.g. “Cat”—> 310 . However, tokens themselves carry no meaning unless they’re transformed into numeric vector.

Vectors

Although tokens and token IDs are numeric representations, they lack inherent meaning. To make these numbers meaningful mathematically, they are used as building blocks for vectors.

Vector is a mathematical representation in high-dimensional space. What that means is it carries more context than a bare number itself . As a simplified example , consider “cat”represented in a 5 dimension vector.

 

"cat" → [0.62, -0.35, 0.12, 0.88, -0.22] 

 

 

DimensionValueImplied Meaning (not labeled in real models, just illustrative)
10.62Animal-relatedness
2-0.35Wild vs Domestic (negative = domestic)
30.12Size (positive = small)
40.88Closeness to human-associated terms (like "pet", "owner", "feed")
5-0.22Abstract vs Concrete (negative = more physical/visible)

Embeddings

If we consider the our example of word "cat", its embedding vector consists of values that are shaped by exposure to language data—such as frequent co-occurrence with words like "meow", "pet", and "kitten". This contextual usage informs how the embedding is constructed, positioning "cat" closer to semantically similar words in the vector space.

More broadly, while vectors provide a numeric way to represent tokens, embeddings are a specialised form of vector that is learned from data to capture linguistic meaning. Unlike sparse or manually defined vectors, embeddings are dense, low-dimensional, and trainable.

 

DimensionValue (Generic Vector)Value (Embedding)Implied Meaning (illustrative only)
10.620.10Animal-relatedness
2-0.350.05Wild vs Domestic
30.12-0.12Size
40.880.02Closeness to human-associated terms (e.g., pet, owner)
5-0.22-0.05Abstract vs Concrete

Learned Embedding vs Generic Vector for "cat"

 

Vectorisation Algorithms

Given below is a brief summary of major vectorisation algorithms and their timeline.

Early Algorithms: Sparse Representations

Traditional NLP approaches like Bag of Words (BoW) and TF-IDF relied on token-level frequency information. They represent each document as a high-dimensional vector based on token counts.

Bag of Words (BoW)

  • Represents a document by counting how often each token appears.
  • Treats all tokens independently; ignores order and meaning.
  • Output: sparse vector with many zeros.

TF-IDF (Term Frequency-Inverse Document Frequency)

  • Extends BoW by scaling down tokens that appear in many documents.
  • Aims to highlight unique or important tokens.
  • Output: still sparse and high-dimensional.

These approaches produce sparse vectors. As vocabulary size grows, vectors become inefficient and incapable of generalising across related words like "cat" and "feline."


Transition to Dense Vectors: Embeddings

To overcome the limitations of sparse representations, researchers introduced dense embeddings. These are fixed-size, real-valued vectors that place semantically similar words closer together in the vector space. Unlike count-based vectors, embeddings are learned through training on large corpora.


Early Embedding Algorithms – Dense Representation

Word2Vec (2013, Google – Mikolov)

  • Learns dense embeddings using a shallow neural network.
  • Words that appear in similar contexts get similar embeddings.
  • Two training strategies:
    • CBOW (Continuous Bag of Words): Predicts the target word from its surrounding context.
    • Skip-Gram: Predicts surrounding words from the target word.
  • Efficient training using negative sampling.
  • Limitation: Produces static embeddings. A word has one vector regardless of its context.

GloVe (2014, Stanford)

  • Stands for Global Vectors.
  • Learns embeddings by factorising a global co-occurrence matrix.
  • Combines global corpus statistics with local context windows.
  • Strength: Captures broader semantic patterns than Word2Vec.
  • Limitation: Still produces static embeddings.

 

Embedding Algorithms – Contextual Embeddings

Even though Word2Vec and GloVe marked a huge advancement, they had a major drawback: they generate one embedding per token, regardless of context. For example, the word "bank" will have the same vector whether it refers to a financial institution or a riverbank.

This limitation led to contextual embeddings such as:

  • ELMo (Embeddings from Language Models): Learns context from both directions using RNNs.
  • BERT (Bidirectional Encoder Representations from Transformers): Uses transformers to generate context-aware embeddings where each token’s representation changes depending on its surrounding words.

Sparse vs Dense

FeatureSparse Vectors (BoW/TF-IDF)Dense Embeddings (Word2Vec, GloVe)
DimensionalityVery high100–300
Vector contentMostly zerosFully populated
Captures word similarityNoYes
Context awarenessNoPartially
Efficient for learningNoYes

Summary Timeline of Key Algorithms

YearAlgorithmKey IdeaEmbedding Type
Pre-2010BoW, TF-IDFToken count or frequencySparse Vector
2013Word2VecPredict words using neural networksStatic Embedding
2014GloVeFactorize co-occurrence matrixStatic Embedding
2018ELMoDeep contextual embeddings via language modelingContextual Embedding
2018BERTTransformer-based contextual learningContextual Embedding

 

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

Spark provides three abstractions for handling data

RDDs

Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).

DataFrame

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.

DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.

Dataset

Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.

They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.

 

 

Comparison of Spark Data Abstractions

FeatureRDDDataFrameDataset
Type SafetyType-safeNot type-safeType-safe
SchemaNo schemaSchema-basedSchema-based
API StyleFunctional APIDomain-specific language (DSL)Both functional and DSL
OptimizationBasicCatalyst OptimizerCatalyst Optimizer
Memory UsageHighEfficientModerate
SerializationJava SerializationCustom encodersCustom encoders
Language SupportAll Spark languagesAll Spark languagesScala and Java only
graph LR
    A["Data Storage in Spark"]
    
    A --> B["RDD"]
    A --> C["DataFrame"]
    A --> D["Dataset"]
    
    B --> B1["Raw Java/Scala Objects"]
    B --> B2["Distributed Collection"]
    B --> B3["No Schema Information"]
    
    C --> C1["Row Objects"]
    C --> C2["Schema-based Structure"]
    C --> C3["Column Names & Types"]
    
    D --> D1["Typed JVM Objects"]
    D --> D2["Schema + Type Information"]
    D --> D3["Strong Type Safety"]
    
    style B fill:#f9d6d6
    style C fill:#d6e5f9
    style D fill:#d6f9d6

 

This diagram illustrates how data is stored in different Spark abstractions:

  • RDD stores data as raw Java/Scala objects with no schema information
  • DataFrame organizes data in rows with defined column names and types
  • Dataset combines schema-based structure with strong type safety using JVM objects

Use Case Scenarios and Recommendations

AbstractionBest Use CasesWhy Choose This?
RDD– Low-level transformations – Custom data types – Legacy code maintenance– Complete control over data processing – When working with non-structured data – Need for custom optimization
DataFrame– SQL-like operations – Machine learning pipelines – Structured data processing– Better performance through optimization – Familiar SQL-like interface – Integration with BI tools
Dataset– Complex business logic – Type-safe operations – Domain object manipulation– Compile-time type safety – Object-oriented programming – Balance of performance and control

Key Takeaways:

  • Use RDDs when you need low-level control and working with unstructured data
  • Choose DataFrames for structured data and when performance is critical
  • Opt for Datasets when you need both type safety and performance optimization

Initialization and Operations Examples

OperationRDDDataFrameDataset
Creationsc.parallelize(List(1,2,3))spark.createDataFrame(data)spark.createDataset(data)
Reading Filessc.textFile("path")spark.read.csv("path")spark.read.csv("path").as[Case]
Filteringrdd.filter(x => x > 10)df.filter($"col" > 10)ds.filter(_.value > 10)
Mappingrdd.map(x => x * 2)df.select($"col" * 2)ds.map(x => x * 2)
Groupingrdd.groupBy(x => x)df.groupBy("col")ds.groupByKey(_.key)

Note: The above examples assume necessary imports and SparkSession/SparkContext initialization.

 

Snowflake Object Model


The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either

  1. Store data,
  2. Organize data, or
  3. Manage data.

From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.

Organisation

In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.

Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.

Account

An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.

Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.

Role

A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.

Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.

User

A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.

Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.

Share

A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.

Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.

Network Policy

A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.

Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.

Warehouse

In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.

Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.

Resource Monitor

A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.

Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.

Database

A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.

Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.

Schema

A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.

Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.

UDF (User-Defined Function)

A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.

Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.

Task

A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.

Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.

Pipe

A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.

Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.

Procedure

A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.

Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.

Stages

 In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).

Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.

External Stage

An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.

Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.

Internal Stage

An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.

Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.

Table

A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.

Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.

External Tables

External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.

Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.

Transient Tables

 Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.

Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.

Temporary Tables

 Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.

Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.

Permanent Tables

 Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.

Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.

View

A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.

Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.

Secured Views

Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.

Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.

Standard Views

Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.

Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.

Materialized Views

Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.

Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.