Select Page

Vectorisation

A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents.

Motivation for Vectorisation

Machine learning algorithms require numerical inputs rather than raw text. Therefore, we first convert text into a numerical representation.

Tokens

The most primitive representation of language in NLP is a token. Tokenisation breaks raw text into atomic units – typically words, subwords or characters. These tokens form the basis of all downstream processing. A token is typically assigned a Token ID e.g. “Cat”—> 310 . However, tokens themselves carry no meaning unless they’re transformed into numeric vector.

Vectors

Although tokens and token IDs are numeric representations, they lack inherent meaning. To make these numbers meaningful mathematically, they are used as building blocks for vectors.

Vector is a mathematical representation in high-dimensional space. What that means is it carries more context than a bare number itself . As a simplified example , consider “cat”represented in a 5 dimension vector.

 

"cat" → [0.62, -0.35, 0.12, 0.88, -0.22] 

 

 

DimensionValueImplied Meaning (not labeled in real models, just illustrative)
10.62Animal-relatedness
2-0.35Wild vs Domestic (negative = domestic)
30.12Size (positive = small)
40.88Closeness to human-associated terms (like "pet", "owner", "feed")
5-0.22Abstract vs Concrete (negative = more physical/visible)

Embeddings

If we consider the our example of word "cat", its embedding vector consists of values that are shaped by exposure to language data—such as frequent co-occurrence with words like "meow", "pet", and "kitten". This contextual usage informs how the embedding is constructed, positioning "cat" closer to semantically similar words in the vector space.

More broadly, while vectors provide a numeric way to represent tokens, embeddings are a specialised form of vector that is learned from data to capture linguistic meaning. Unlike sparse or manually defined vectors, embeddings are dense, low-dimensional, and trainable.

 

DimensionValue (Generic Vector)Value (Embedding)Implied Meaning (illustrative only)
10.620.10Animal-relatedness
2-0.350.05Wild vs Domestic
30.12-0.12Size
40.880.02Closeness to human-associated terms (e.g., pet, owner)
5-0.22-0.05Abstract vs Concrete

Learned Embedding vs Generic Vector for "cat"

 

Vectorisation Algorithms

Given below is a brief summary of major vectorisation algorithms and their timeline.

Early Algorithms: Sparse Representations

Traditional NLP approaches like Bag of Words (BoW) and TF-IDF relied on token-level frequency information. They represent each document as a high-dimensional vector based on token counts.

Bag of Words (BoW)

  • Represents a document by counting how often each token appears.
  • Treats all tokens independently; ignores order and meaning.
  • Output: sparse vector with many zeros.

TF-IDF (Term Frequency-Inverse Document Frequency)

  • Extends BoW by scaling down tokens that appear in many documents.
  • Aims to highlight unique or important tokens.
  • Output: still sparse and high-dimensional.

These approaches produce sparse vectors. As vocabulary size grows, vectors become inefficient and incapable of generalising across related words like "cat" and "feline."


Transition to Dense Vectors: Embeddings

To overcome the limitations of sparse representations, researchers introduced dense embeddings. These are fixed-size, real-valued vectors that place semantically similar words closer together in the vector space. Unlike count-based vectors, embeddings are learned through training on large corpora.


Early Embedding Algorithms – Dense Representation

Word2Vec (2013, Google – Mikolov)

  • Learns dense embeddings using a shallow neural network.
  • Words that appear in similar contexts get similar embeddings.
  • Two training strategies:
    • CBOW (Continuous Bag of Words): Predicts the target word from its surrounding context.
    • Skip-Gram: Predicts surrounding words from the target word.
  • Efficient training using negative sampling.
  • Limitation: Produces static embeddings. A word has one vector regardless of its context.

GloVe (2014, Stanford)

  • Stands for Global Vectors.
  • Learns embeddings by factorising a global co-occurrence matrix.
  • Combines global corpus statistics with local context windows.
  • Strength: Captures broader semantic patterns than Word2Vec.
  • Limitation: Still produces static embeddings.

 

Embedding Algorithms – Contextual Embeddings

Even though Word2Vec and GloVe marked a huge advancement, they had a major drawback: they generate one embedding per token, regardless of context. For example, the word "bank" will have the same vector whether it refers to a financial institution or a riverbank.

This limitation led to contextual embeddings such as:

  • ELMo (Embeddings from Language Models): Learns context from both directions using RNNs.
  • BERT (Bidirectional Encoder Representations from Transformers): Uses transformers to generate context-aware embeddings where each token’s representation changes depending on its surrounding words.

Sparse vs Dense

FeatureSparse Vectors (BoW/TF-IDF)Dense Embeddings (Word2Vec, GloVe)
DimensionalityVery high100–300
Vector contentMostly zerosFully populated
Captures word similarityNoYes
Context awarenessNoPartially
Efficient for learningNoYes

Summary Timeline of Key Algorithms

YearAlgorithmKey IdeaEmbedding Type
Pre-2010BoW, TF-IDFToken count or frequencySparse Vector
2013Word2VecPredict words using neural networksStatic Embedding
2014GloVeFactorize co-occurrence matrixStatic Embedding
2018ELMoDeep contextual embeddings via language modelingContextual Embedding
2018BERTTransformer-based contextual learningContextual Embedding

 

Data Abstractions in Spark ( RDD, DataSet, DataFrame)

Spark provides three abstractions for handling data

RDDs

Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).

DataFrame

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.

DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.

Dataset

Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.

They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.

 

 

Comparison of Spark Data Abstractions

FeatureRDDDataFrameDataset
Type SafetyType-safeNot type-safeType-safe
SchemaNo schemaSchema-basedSchema-based
API StyleFunctional APIDomain-specific language (DSL)Both functional and DSL
OptimizationBasicCatalyst OptimizerCatalyst Optimizer
Memory UsageHighEfficientModerate
SerializationJava SerializationCustom encodersCustom encoders
Language SupportAll Spark languagesAll Spark languagesScala and Java only
graph LR
    A["Data Storage in Spark"]
    
    A --> B["RDD"]
    A --> C["DataFrame"]
    A --> D["Dataset"]
    
    B --> B1["Raw Java/Scala Objects"]
    B --> B2["Distributed Collection"]
    B --> B3["No Schema Information"]
    
    C --> C1["Row Objects"]
    C --> C2["Schema-based Structure"]
    C --> C3["Column Names & Types"]
    
    D --> D1["Typed JVM Objects"]
    D --> D2["Schema + Type Information"]
    D --> D3["Strong Type Safety"]
    
    style B fill:#f9d6d6
    style C fill:#d6e5f9
    style D fill:#d6f9d6

 

This diagram illustrates how data is stored in different Spark abstractions:

  • RDD stores data as raw Java/Scala objects with no schema information
  • DataFrame organizes data in rows with defined column names and types
  • Dataset combines schema-based structure with strong type safety using JVM objects

Use Case Scenarios and Recommendations

AbstractionBest Use CasesWhy Choose This?
RDD– Low-level transformations – Custom data types – Legacy code maintenance– Complete control over data processing – When working with non-structured data – Need for custom optimization
DataFrame– SQL-like operations – Machine learning pipelines – Structured data processing– Better performance through optimization – Familiar SQL-like interface – Integration with BI tools
Dataset– Complex business logic – Type-safe operations – Domain object manipulation– Compile-time type safety – Object-oriented programming – Balance of performance and control

Key Takeaways:

  • Use RDDs when you need low-level control and working with unstructured data
  • Choose DataFrames for structured data and when performance is critical
  • Opt for Datasets when you need both type safety and performance optimization

Initialization and Operations Examples

OperationRDDDataFrameDataset
Creationsc.parallelize(List(1,2,3))spark.createDataFrame(data)spark.createDataset(data)
Reading Filessc.textFile("path")spark.read.csv("path")spark.read.csv("path").as[Case]
Filteringrdd.filter(x => x > 10)df.filter($"col" > 10)ds.filter(_.value > 10)
Mappingrdd.map(x => x * 2)df.select($"col" * 2)ds.map(x => x * 2)
Groupingrdd.groupBy(x => x)df.groupBy("col")ds.groupByKey(_.key)

Note: The above examples assume necessary imports and SparkSession/SparkContext initialization.

 

Snowflake Object Model


The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either

  1. Store data,
  2. Organize data, or
  3. Manage data.

From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.

Organisation

In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.

Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.

Account

An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.

Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.

Role

A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.

Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.

User

A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.

Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.

Share

A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.

Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.

Network Policy

A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.

Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.

Warehouse

In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.

Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.

Resource Monitor

A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.

Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.

Database

A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.

Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.

Schema

A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.

Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.

UDF (User-Defined Function)

A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.

Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.

Task

A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.

Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.

Pipe

A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.

Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.

Procedure

A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.

Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.

Stages

 In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).

Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.

External Stage

An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.

Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.

Internal Stage

An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.

Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.

Table

A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.

Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.

External Tables

External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.

Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.

Transient Tables

 Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.

Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.

Temporary Tables

 Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.

Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.

Permanent Tables

 Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.

Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.

View

A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.

Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.

Secured Views

Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.

Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.

Standard Views

Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.

Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.

Materialized Views

Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.

Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.

Exporting GA4 data from BigQuery to Snowflake

Exporting GA4 data from BigQuery to Snowflake

In a previous article, we have already explored how to export data grom GA4 to BigQuery. In instances, where we want to migrate data from BigQuery to another platform like snowflake, BigQuery offers a few options.

BigQuery Export options

Explore with Sheets:

Directly analyze and visualize your BigQuery data using Google Sheets.

Explore with Looker Studio:

Utilize Looker Studio (formerly Data Studio) for more advanced data exploration and interactive dashboard creation.

Export to GCS:

Save BigQuery datasets to Google Cloud Storage for storage or further processing with other tools.

Scan with Sensitive Data Protection:

Check your datasets for sensitive information before exporting, to ensure compliance and data privacy.

In out case, since we want to export the Google Analytics 4 data into Snowflake, we will need to first export it to Google Cloud Storage ( GCS ) . From this storage, we can then ingest data into Snowflake.

To understand the process flow, here is what we will be doing.

GA4 -> BigQuery -> Google Cloud Storage -> Snowfalke

A. Exporting from BigQuery to GCS

Deciding on Export Format

Before exporting, we want to decide on the format in which data will be exported for consumption. You can choose any of the format from CSV, JSON, Avro and Parquet depending on the use case. we will go with Parquet in this example. A brief comparison of these 4 data formats is given in the table below.

Feature

CSV

JSON

AVRO

Parquet

Data Structure

Flat

Hierarchical

Hierarchical

Columnar

Readability

High (Text-based)

High (Text-based)

Low (Binary)

Low (Binary)

File Size

Medium

Large

Small

Small

Performance

Low

Medium

High

Very High

Schema Evolution

Not Supported

Not Supported

Supported

Supported

Use Case

Simple analytics

Web APIs, Complex data

Long-term storage, Evolving schemas

Analytical queries, Large datasets

Compression

Low

Medium

High

Very High

Why Parquet?

Here’s a brief summary of why we chose Parquet for exporting GA4 data to BigQuery.

Columnar Efficiency

We benefit from Parquet’s columnar storage, optimizing our query execution by accessing only the necessary data.

Cost-Effective Storage

Our expenditure on data storage is minimized due to Parquet’s superior compression and encoding capabilities.

Support for Hierarchical Data

It supports our GA4 hierarchical data structures, ensuring the integrity of our analytical insights.

Seamless Integration

We utilize Snowflake’s native support for Parquet for straightforward data processing.

Schema Evolution Support

Since GA4 is in its early stage and new features keep on coming, we can gracefully manage changes in our data schema, avoiding costly downtime and data migrations.

Exporting Single Table

Clicking on the Export -> Export to GCS option will give us an option box to pick export format and compression. I have also specified a GCS storage location to store the export.

Exporting Multiple tables

Visual interface only allows export of a single table. Google Analytics 4, however, stores each day’s data separately as a single table. Therefore, we will have to find an alternative to visual export.

Shell script for Multiple table export

We can write a shell script which can export all our tables into our bucket. At a high level, we want our script to do the following :

  1. Set Parameters: Define the dataset, table prefix, and GCS bucket.
  2. List Tables: Use BigQuery to list all events_ tables in the dataset.
  3. Export Tables: Loop through each table and export it to the designated GCS bucket as a Parquet file.

Here’s what the exact script looks like

#!/bin/bash 

  # Define your dataset and table prefix

  DATASET="bigquery-public-data:ga4_obfuscated_sample_ecommerce"

  TABLE_PREFIX="events_"

  BUCKET_NAME="tempv2"

  # Get the list of tables using the bq ls command, filtering with grep for your table prefix

  TABLES=$(bq ls --max_results 1000 $DATASET | grep $TABLE_PREFIX | awk '{print $1}')

  # Loop through the tables and run the bq extract command on each one

  for TABLE in $TABLES

  do

      bq extract --destination_format=PARQUET $DATASET.$TABLE gs://$BUCKET_NAME/${TABLE}.parquet

   done

Save the script and give it a name. I named it export_tables.sh. Change the script mode to chmod +x.

Execute the shell script with ./export_tables.sh

If everything works out correctly, you will start to see output :

You can check whether data has been exported by inspecting the contents of the storage bucket.

A screenshot of a computer

Description automatically generated

Allow appropriate access , so that you can read the data in snowflake. You can do this by opening the bucket > Permissions and then click on Grand Access.

In this example, I have granted access to allUsers. This will make the bucket readable publicly.

A screenshot of a web page

Description automatically generated

To ingest the data from Google cloud storage into snowflake, we will create storage integration between GCS and snowflake and then create an external stage. Storage integration streamlines the authentication flow between GCS and Snowflake. External stage, allows snowflake database to ingest the data.

Top of Form

B. Create Storage Integration in Snowflake

Storage integration in snowflake will create a temporary service account. We will then provision access to that temporary account from GCP.

  1. Log into Snowflake: Open the Snowflake Web UI and switch to the role with privileges to create storage integrations (e.g., ACCOUNTADMIN).
  2. Create the Storage Integration:

CREATE STORAGE INTEGRATION gcs_storage_integration

TYPE = EXTERNAL_STAGE

STORAGE_PROVIDER = GCS

ENABLED = TRUE

STORAGE_ALLOWED_LOCATIONS = (‘gcs://tempv2/’);

  1. Retrieve Integration Details:

DESC STORAGE INTEGRATION gcs_storage_integration;

Snowflake will automatically create a GCP service account . We will then go back to Google Cloud to provision necessary access to this service account.

  1. Navigate to GCS: Open the Google Cloud Console, go to your GCS bucket’s permissions page.
  2. Add a New Member: Use the STORAGE_GCP_SERVICE_ACCOUNT value as the new member’s identity.
  3. Assign Roles: Grant roles such as Storage Object Viewer for read access

C. Create External Stage in Snowflake

External stage will allow snowflake database to ingest data from external source of GCP.

  1. Define the File Format (if not already defined):

CREATE OR REPLACE FILE FORMAT my_parquet_format

TYPE = ‘PARQUET’;

  1. Create External Stage

CREATE OR REPLACE STAGE my_external_stage

URL = ‘gcs://tempv2/’

STORAGE_INTEGRATION = gcs_storage_integration

FILE_FORMAT = my_parquet_format;

  1. Verify the external stage with LIST command, you can see the output
A screenshot of a computer program

Description automatically generated
  1. Create table to load data into

We will create a table , so that we can load data from the stage into the table.

create table raw_ga4 (

data VARIANT

)

  1. Load data from stage

Finally, we can load data from external stage into snowflake database table using COPY INTO command.

There are 2 primary ways that we can ingest data into snowflake database.

  1. With an upfront Schema
  2. Without an upfront Schema

In this case, we will ingest the data without having an upfront schema.

D. Loading data without Schema

Snowflake provides a ‘VARIANT’ data type. It is used to store sem-structured data such as SON, Avro or Parquet etc. Its useful because it allows you to ingest and store data without needing to define a schema upfront. The VARIANT column can hold structured and semi-structured data in the same table, enabling you to flexibly query the data using standard SQL alongside Snowflake’s powerful semi-structured data functions.

Therefore, in Step4 , we create a simple table with VARIANT column of data.

To load the data into our raw_ga4 table, we use the following command.

COPY INTO GA4ECOMMERCE.PUBLIC.RAW_GA4

FROM ‘@GA4_STAGE/’

FILE_FORMAT = (FORMAT_NAME = ‘my_parquet_format’);

This will load all files into Data column

You can also view the data from Data Preview tab of RAW_GA4 table , which will look like :

A screenshot of a computer

Description automatically generated

Implementing tests in DBT (Data Build Tool)

Data Build Tool ( DBT ) offers functionalities for testing our data models to ensure their reliability and accuracy. DBT tests help in achieveing these objectives .

Overall, dbt testing helps achieve these objectives:

  • Improved Data Quality: By catching errors and inconsistencies early on, we prevent issues from propagating downstream to reports and dashboards.
  • Reliable Data Pipelines: Tests ensure that our data transformations work as expected, reducing the risk of regressions when code is modified.
  • Stronger Data Culture: A focus on testing instills a culture of data quality within organization.

This article is a continuation of previous hands-on implementation of DBT model. DIn this article, we will explore the concept of tests in DBT and how we can implement the tests.

Data Build Tool (DBT) provides a comprehensive test framework to ensure data quality and reliability. Here’s a summary of tests that dbt supports.

Test TypeSubtypeDescriptionUsageImplementation method
Generic TestsBuilt-inPre-built or out-of-the-box tests for common data issues such as uniqueness, not-null, and referential integrity.unique, not_null, relationships,accepted_valuesschema.yml using tests: key under model/column
Custom Generic TestsCustom tests written by users that can be applied to any column or model, similar to built-in tests but with custom logic.Custom reusable tests with user-defined SQL logic.1. Dbt automatically picks up from tests/generic/test_name.sql starting with {% tests %} .
2. Schema.yml by defining tests in .sql file and applying in schema.yml
3.macros: Historically, this wa the only place they could be defined.
Singular TestsDesigned to validate a single condition and are not intended to be reusable e.g. check if total sales reported in a period matches the sum of individual sales record. Custom SQL conditions specific to a model or field.Dbt automatically picks up test from tests/test_name.sql
Source Freshness TestsTests that check the timeliness of the data in your sources by comparing the current timestamp with the last updated timestamp.dbt source freshness to monitor data update intervals.Schema.yml using freshness: key under source/table
DBT testing framework

Implementing Built-In Tests

We will begin by writing generic tests. As we saw in the table above there are two types, Built-in and Custom. We will begin with a built-in test.

Built-in Tests 1.Unique :

We will explore this test with our model DEV.DIM_LISTING_CLEANSED.The objective is to make sure that all values in the column LISTING_ID are unique.

To implement tests we will create a new file schema.yml and define our tests in it.

version: 2

models:
  - name: dim_listings_cleansed
    columns: 

      - name: listing_id
        tests:
          - unique

As we now run dbt test command in terminal, we will see the test execute and pass

Built-in Test 2.not_null

In the not_null example, we do not want any of the values in minimum nights column to be null. We will add another column name and test under the same hierarchy.

      - name: minimum_nights
        tests:
          - not_null

Now, when we execute dbt test, we will see another test added

Built-in Test 3. accepted_values

In the accepted_values test, we will ensure that the column ROOM_TYPE in DEV.DIM_LISTING_CLEANSED can have preselected values only.

To add that test, we use the following entry in our source.yml file under model dim_listings_cleansed.

      - name: room_type
        tests:
          - accepted_values:
              values: ['Entire home/apt',
              'Private room',
              'Shared room',
              'Hotel room']

Now when we run dbt test, we can see another test executed which will check whether the column has only predetermined values as specificed in accepted_values or not.

Built-in Test 4. relationships

In the relationships test, we will ensure that each id in HOST_ID in DEV.DIM_LISTING_CLEANSED has a reference in DEV.DIM_HOSTS_CLEANSED

To implement this test, we can use the following code in our scehma.yml file. Since we are creating a reference relationship to dim_hosts_cleansed based on field host_id in that table. The entry will be :

      - name: host_id
        test:
          -not_null
          -relationships:
            to: ref('dim_hosts_cleansed')
            field: host_id

Now when we run dbt test, we see the fourth test added.

Implementing Singular Tests

A singular test consists of an SQL query which passes when no rows are returns, or , the test fails if the query returns result. It is used to validate assumptions about data in a specificmodel. We implement singular tests by writing sql query in sql file in tests folder ,case dim_listings_minimum_night.sql

We will check in the test whether there are any minimum nights = 0 in the table. If there are none , result will return no rows and our test will pass.

Checking in snowflake itself , we get zero rows :

The test we have written in tests/dim_listings_minimum_nights.sql

-- tests/dim_listings_minimum_nights.sql
SELECT *
FROM {{ ref('dim_listings_cleansed') }}
WHERE minimum_nights <1

We can now implement the same test in dbt and run dbt test. In this case, I am executing a single test only. The output for which will become.

Since the resultset of the query was empty, therefore, the test passed. ( In other words, we were checking for exception, it did not occur, therefore test passed).

Implementing Custom Generic Tests

Custom Generic tests are tests which can accept parameters. They are defined in SQL file and are identified by parameter {% test %} at the beginning of the sql file. Lets write 2 tests using two different methods supported by DBT.

Custom Generic Test with Macros

First, we careate a .sql file under macros folder . In this case we are creating macros/positive_value.sql. The test accepts two parameters and then returns rows if the column name passed in parameters has value of less than 1.

{% test positive_value(model,column_name) %}

SELECT
* 
FROM
    {{model}}
WHERE
    {{column_name}} < 1

{% endtest %}

After writing the test, we will specify the test within schema.yml file

To contrast the difference between Generic in-built test and Generic custom test, lets review the schema.yml configuration.

We have defined two tests on column minimum_nights. First was built-in not_null test that we did in last section. In this section, we created a macro by the name of positive_value(model,column_name) . The macro accepted two parameters. Both these parameters will be passed from schema.yml file . It will look up the model under which macro is specified and pass on that model. Similarly, it will pass on the column_name under which macro is mentioned.

Its important to remember that whether the test is defined in macros or is in-built. Test is passed only when no rows are returned. If any rows are returned, test will fail.

Custom Generic Test with generic subfolder

The other way of specifying custom generic test is under tests/generic. Also remember that we create a singular test under tests folder. Here’s a visual indication of the difference.

  1. Singular tests are saved under tests/
  2. Customer generic tests are saved under tests/generic

Since its a generic test, it will start with the {% test %} . The code for our test is

-- macros/tests/generic/non_negative_value.sql

{% test non_negative_value(model, column_name) %}
SELECT
  *
FROM
  {{ model }}
WHERE
  {{ column_name }} < 0
{% endtest %}

I want to implement this test on listing_id in dim_listings_cleansed. And I also want the test to fail. ( just to clarify the concept).I will go to schema.yml and define the test under column listing_id.

When I now run dbt test, DBT finds the test under macros/generic, however, it fails the test because there are several rows returned

Conclusion

In this hands-on exploration, we implemented various DBT testing methods to ensure data quality and reliability. We implemented built-in schema tests for core data properties, created reusable custom generic tests, and wrote singular tests for specific conditions. By leveraging these techniques, we can establish a robust testing strategy that safeguards our data warehouse, promotes high-quality data, and strengthens the foundation for reliable data pipelines.