A vector in the context of NLP is a multi-dimensional array of numbers that represents linguistic units such as words, characters, sentences, or documents.
Motivation for Vectorisation
Machine learning algorithms require numerical inputs rather than raw text. Therefore, we first convert text into a numerical representation.
Tokens
The most primitive representation of language in NLP is a token. Tokenisation breaks raw text into atomic units – typically words, subwords or characters. These tokens form the basis of all downstream processing. A token is typically assigned a Token ID e.g. “Cat”—> 310 . However, tokens themselves carry no meaning unless they’re transformed into numeric vector.
Vectors
Although tokens and token IDs are numeric representations, they lack inherent meaning. To make these numbers meaningful mathematically, they are used as building blocks for vectors.
Vector is a mathematical representation in high-dimensional space. What that means is it carries more context than a bare number itself . As a simplified example , consider “cat”represented in a 5 dimension vector.
"cat" → [0.62, -0.35, 0.12, 0.88, -0.22]
Dimension
Value
Implied Meaning (not labeled in real models, just illustrative)
1
0.62
Animal-relatedness
2
-0.35
Wild vs Domestic (negative = domestic)
3
0.12
Size (positive = small)
4
0.88
Closeness to human-associated terms (like "pet", "owner", "feed")
5
-0.22
Abstract vs Concrete (negative = more physical/visible)
Embeddings
If we consider the our example of word "cat", its embedding vector consists of values that are shaped by exposure to language data—such as frequent co-occurrence with words like "meow", "pet", and "kitten". This contextual usage informs how the embedding is constructed, positioning "cat" closer to semantically similar words in the vector space.
More broadly, while vectors provide a numeric way to represent tokens, embeddings are a specialised form of vector that is learned from data to capture linguistic meaning. Unlike sparse or manually defined vectors, embeddings are dense, low-dimensional, and trainable.
Dimension
Value (Generic Vector)
Value (Embedding)
Implied Meaning (illustrative only)
1
0.62
0.10
Animal-relatedness
2
-0.35
0.05
Wild vs Domestic
3
0.12
-0.12
Size
4
0.88
0.02
Closeness to human-associated terms (e.g., pet, owner)
5
-0.22
-0.05
Abstract vs Concrete
Learned Embedding vs Generic Vector for "cat"
Vectorisation Algorithms
Given below is a brief summary of major vectorisation algorithms and their timeline.
Early Algorithms: Sparse Representations
Traditional NLP approaches like Bag of Words (BoW) and TF-IDF relied on token-level frequency information. They represent each document as a high-dimensional vector based on token counts.
Bag of Words (BoW)
Represents a document by counting how often each token appears.
Treats all tokens independently; ignores order and meaning.
Extends BoW by scaling down tokens that appear in many documents.
Aims to highlight unique or important tokens.
Output: still sparse and high-dimensional.
These approaches produce sparse vectors. As vocabulary size grows, vectors become inefficient and incapable of generalising across related words like "cat" and "feline."
Transition to Dense Vectors: Embeddings
To overcome the limitations of sparse representations, researchers introduced dense embeddings. These are fixed-size, real-valued vectors that place semantically similar words closer together in the vector space. Unlike count-based vectors, embeddings are learned through training on large corpora.
Early Embedding Algorithms – Dense Representation
Word2Vec (2013, Google – Mikolov)
Learns dense embeddings using a shallow neural network.
Words that appear in similar contexts get similar embeddings.
Two training strategies:
CBOW (Continuous Bag of Words): Predicts the target word from its surrounding context.
Skip-Gram: Predicts surrounding words from the target word.
Efficient training using negative sampling.
Limitation: Produces static embeddings. A word has one vector regardless of its context.
GloVe (2014, Stanford)
Stands for Global Vectors.
Learns embeddings by factorising a global co-occurrence matrix.
Combines global corpus statistics with local context windows.
Strength: Captures broader semantic patterns than Word2Vec.
Limitation: Still produces static embeddings.
Embedding Algorithms – Contextual Embeddings
Even though Word2Vec and GloVe marked a huge advancement, they had a major drawback: they generate one embedding per token, regardless of context. For example, the word "bank" will have the same vector whether it refers to a financial institution or a riverbank.
This limitation led to contextual embeddings such as:
ELMo (Embeddings from Language Models): Learns context from both directions using RNNs.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformers to generate context-aware embeddings where each token’s representation changes depending on its surrounding words.
Spark provides three abstractions for handling data
RDDs
Distributed collections of objects that can be cached in memory across cluster nodes (e.g., if an array is large, it can be distributed across multiple clusters).
DataFrame
DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a powerful abstraction that supports structured and semi-structured data with optimized execution through the Catalyst optimizer.
DataFrames are schema-aware and can be created from various data sources including structured files (CSV, JSON), Hive tables, or external databases, offering SQL-like operations for data manipulation and analysis.
Dataset
Datasets are a type-safe, object-oriented programming interface that provides the benefits of RDDs (static typing and lambda functions) while also leveraging Spark SQL's optimized execution engine.
They offer a unique combination of type safety and ease of use, making them particularly useful for applications where type safety is important and the data fits into well-defined schemas using case classes in Scala or Java beans.
Comparison of Spark Data Abstractions
Feature
RDD
DataFrame
Dataset
Type Safety
Type-safe
Not type-safe
Type-safe
Schema
No schema
Schema-based
Schema-based
API Style
Functional API
Domain-specific language (DSL)
Both functional and DSL
Optimization
Basic
Catalyst Optimizer
Catalyst Optimizer
Memory Usage
High
Efficient
Moderate
Serialization
Java Serialization
Custom encoders
Custom encoders
Language Support
All Spark languages
All Spark languages
Scala and Java only
graph LR
A["Data Storage in Spark"]
A --> B["RDD"]
A --> C["DataFrame"]
A --> D["Dataset"]
B --> B1["Raw Java/Scala Objects"]
B --> B2["Distributed Collection"]
B --> B3["No Schema Information"]
C --> C1["Row Objects"]
C --> C2["Schema-based Structure"]
C --> C3["Column Names & Types"]
D --> D1["Typed JVM Objects"]
D --> D2["Schema + Type Information"]
D --> D3["Strong Type Safety"]
style B fill:#f9d6d6
style C fill:#d6e5f9
style D fill:#d6f9d6
This diagram illustrates how data is stored in different Spark abstractions:
RDD stores data as raw Java/Scala objects with no schema information
DataFrame organizes data in rows with defined column names and types
Dataset combines schema-based structure with strong type safety using JVM objects
The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either
Store data,
Organize data, or
Manage data.
From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.
Organisation
In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.
Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.
Account
An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.
Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.
Role
A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.
Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.
User
A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.
Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.
Share
A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.
Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.
Network Policy
A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.
Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.
Warehouse
In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.
Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.
Resource Monitor
A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.
Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.
Database
A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.
Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.
Schema
A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.
Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.
UDF (User-Defined Function)
A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.
Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.
Task
A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.
Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.
Pipe
A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.
Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.
Procedure
A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.
Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.
Stages
In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).
Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.
External Stage
An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.
Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.
Internal Stage
An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.
Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.
Table
A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.
Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.
External Tables
External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.
Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.
Transient Tables
Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.
Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.
Temporary Tables
Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.
Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.
Permanent Tables
Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.
Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.
View
A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.
Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.
Secured Views
Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.
Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.
Standard Views
Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.
Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.
Materialized Views
Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.
Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.
In a previous article, we have already explored how to export data grom GA4 to BigQuery. In instances, where we want to migrate data from BigQuery to another platform like snowflake, BigQuery offers a few options.
BigQuery Export options
Explore with Sheets:
Directly analyze and visualize your BigQuery data using Google Sheets.
Explore with Looker Studio:
Utilize Looker Studio (formerly Data Studio) for more advanced data exploration and interactive dashboard creation.
Export to GCS:
Save BigQuery datasets to Google Cloud Storage for storage or further processing with other tools.
Scan with Sensitive Data Protection:
Check your datasets for sensitive information before exporting, to ensure compliance and data privacy.
In out case, since we want to export the Google Analytics 4 data into Snowflake, we will need to first export it to Google Cloud Storage ( GCS ) . From this storage, we can then ingest data into Snowflake.
To understand the process flow, here is what we will be doing.
GA4 -> BigQuery -> Google Cloud Storage -> Snowfalke
A. Exporting from BigQuery to GCS
Deciding on Export Format
Before exporting, we want to decide on the format in which data will be exported for consumption. You can choose any of the format from CSV, JSON, Avro and Parquet depending on the use case. we will go with Parquet in this example. A brief comparison of these 4 data formats is given in the table below.
Feature
CSV
JSON
AVRO
Parquet
Data Structure
Flat
Hierarchical
Hierarchical
Columnar
Readability
High (Text-based)
High (Text-based)
Low (Binary)
Low (Binary)
File Size
Medium
Large
Small
Small
Performance
Low
Medium
High
Very High
Schema Evolution
Not Supported
Not Supported
Supported
Supported
Use Case
Simple analytics
Web APIs, Complex data
Long-term storage, Evolving schemas
Analytical queries, Large datasets
Compression
Low
Medium
High
Very High
Why Parquet?
Here’s a brief summary of why we chose Parquet for exporting GA4 data to BigQuery.
Columnar Efficiency
We benefit from Parquet’s columnar storage, optimizing our query execution by accessing only the necessary data.
Cost-Effective Storage
Our expenditure on data storage is minimized due to Parquet’s superior compression and encoding capabilities.
Support for Hierarchical Data
It supports our GA4 hierarchical data structures, ensuring the integrity of our analytical insights.
Seamless Integration
We utilize Snowflake’s native support for Parquet for straightforward data processing.
Schema Evolution Support
Since GA4 is in its early stage and new features keep on coming, we can gracefully manage changes in our data schema, avoiding costly downtime and data migrations.
Exporting Single Table
Clicking on the Export -> Export to GCS option will give us an option box to pick export format and compression. I have also specified a GCS storage location to store the export.
Exporting Multiple tables
Visual interface only allows export of a single table. Google Analytics 4, however, stores each day’s data separately as a single table. Therefore, we will have to find an alternative to visual export.
Shell script for Multiple table export
We can write a shell script which can export all our tables into our bucket. At a high level, we want our script to do the following :
Set Parameters: Define the dataset, table prefix, and GCS bucket.
List Tables: Use BigQuery to list all events_ tables in the dataset.
Export Tables: Loop through each table and export it to the designated GCS bucket as a Parquet file.
Here’s what the exact script looks like
#!/bin/bash
# Define your dataset and table prefix
DATASET="bigquery-public-data:ga4_obfuscated_sample_ecommerce"
TABLE_PREFIX="events_"
BUCKET_NAME="tempv2"
# Get the list of tables using the bq ls command, filtering with grep for your table prefix
TABLES=$(bq ls --max_results 1000 $DATASET | grep $TABLE_PREFIX | awk '{print $1}')
# Loop through the tables and run the bq extract command on each one
for TABLE in $TABLES
do
bq extract --destination_format=PARQUET $DATASET.$TABLE gs://$BUCKET_NAME/${TABLE}.parquet
done
Save the script and give it a name. I named it export_tables.sh. Change the script mode to chmod +x.
Execute the shell script with ./export_tables.sh
If everything works out correctly, you will start to see output :
You can check whether data has been exported by inspecting the contents of the storage bucket.
Allow appropriate access , so that you can read the data in snowflake. You can do this by opening the bucket > Permissions and then click on Grand Access.
In this example, I have granted access to allUsers. This will make the bucket readable publicly.
To ingest the data from Google cloud storage into snowflake, we will create storage integration between GCS and snowflake and then create an external stage. Storage integration streamlines the authentication flow between GCS and Snowflake. External stage, allows snowflake database to ingest the data.
Top of Form
B. Create Storage Integration in Snowflake
Storage integration in snowflake will create a temporary service account. We will then provision access to that temporary account from GCP.
Log into Snowflake: Open the Snowflake Web UI and switch to the role with privileges to create storage integrations (e.g., ACCOUNTADMIN).
Snowflake will automatically create a GCP service account . We will then go back to Google Cloud to provision necessary access to this service account.
Navigate to GCS: Open the Google Cloud Console, go to your GCS bucket’s permissions page.
Add a New Member: Use the STORAGE_GCP_SERVICE_ACCOUNT value as the new member’s identity.
Assign Roles: Grant roles such as Storage Object Viewer for read access
C. Create External Stage in Snowflake
External stage will allow snowflake database to ingest data from external source of GCP.
Define the File Format (if not already defined):
CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = ‘PARQUET’;
Create External Stage
CREATE OR REPLACE STAGE my_external_stage
URL = ‘gcs://tempv2/’
STORAGE_INTEGRATION = gcs_storage_integration
FILE_FORMAT = my_parquet_format;
Verify the external stage with LIST command, you can see the output
Create table to load data into
We will create a table , so that we can load data from the stage into the table.
create table raw_ga4 (
data VARIANT
)
Load data from stage
Finally, we can load data from external stage into snowflake database table using COPY INTO command.
There are 2 primary ways that we can ingest data into snowflake database.
With an upfront Schema
Without an upfront Schema
In this case, we will ingest the data without having an upfront schema.
D. Loading data without Schema
Snowflake provides a ‘VARIANT’ data type. It is used to store sem-structured data such as SON, Avro or Parquet etc. Its useful because it allows you to ingest and store data without needing to define a schema upfront. The VARIANT column can hold structured and semi-structured data in the same table, enabling you to flexibly query the data using standard SQL alongside Snowflake’s powerful semi-structured data functions.
Therefore, in Step4 , we create a simple table with VARIANT column of data.
To load the data into our raw_ga4 table, we use the following command.
Data Build Tool ( DBT ) offers functionalities for testing our data models to ensure their reliability and accuracy. DBT tests help in achieveing these objectives .
Overall, dbt testing helps achieve these objectives:
Improved Data Quality: By catching errors and inconsistencies early on, we prevent issues from propagating downstream to reports and dashboards.
Reliable Data Pipelines: Tests ensure that our data transformations work as expected, reducing the risk of regressions when code is modified.
Stronger Data Culture: A focus on testing instills a culture of data quality within organization.
This article is a continuation of previous hands-on implementation of DBT model. DIn this article, we will explore the concept of tests in DBT and how we can implement the tests.
Data Build Tool (DBT) provides a comprehensive test framework to ensure data quality and reliability. Here’s a summary of tests that dbt supports.
Test Type
Subtype
Description
Usage
Implementation method
Generic Tests
Built-in
Pre-built or out-of-the-box tests for common data issues such as uniqueness, not-null, and referential integrity.
unique, not_null, relationships,accepted_values
schema.yml using tests: key under model/column
Custom Generic Tests
Custom tests written by users that can be applied to any column or model, similar to built-in tests but with custom logic.
Custom reusable tests with user-defined SQL logic.
1. Dbt automatically picks up from tests/generic/test_name.sql starting with {% tests %} . 2. Schema.yml by defining tests in .sql file and applying in schema.yml 3.macros: Historically, this wa the only place they could be defined.
Singular Tests
Designed to validate a single condition and are not intended to be reusable e.g. check if total sales reported in a period matches the sum of individual sales record.
Custom SQL conditions specific to a model or field.
Dbt automatically picks up test from tests/test_name.sql
Source Freshness Tests
Tests that check the timeliness of the data in your sources by comparing the current timestamp with the last updated timestamp.
dbt source freshness to monitor data update intervals.
Schema.yml using freshness: key under source/table
DBT testing framework
Implementing Built-In Tests
We will begin by writing generic tests. As we saw in the table above there are two types, Built-in and Custom. We will begin with a built-in test.
Built-in Tests 1.Unique :
We will explore this test with our model DEV.DIM_LISTING_CLEANSED.The objective is to make sure that all values in the column LISTING_ID are unique.
To implement tests we will create a new file schema.yml and define our tests in it.
As we now run dbt test command in terminal, we will see the test execute and pass
Built-in Test 2.not_null
In the not_null example, we do not want any of the values in minimum nights column to be null. We will add another column name and test under the same hierarchy.
- name: minimum_nights
tests:
- not_null
Now, when we execute dbt test, we will see another test added
Built-in Test 3. accepted_values
In the accepted_values test, we will ensure that the column ROOM_TYPE in DEV.DIM_LISTING_CLEANSED can have preselected values only.
To add that test, we use the following entry in our source.yml file under model dim_listings_cleansed.
Now when we run dbt test, we can see another test executed which will check whether the column has only predetermined values as specificed in accepted_values or not.
Built-in Test 4. relationships
In the relationships test, we will ensure that each id in HOST_ID in DEV.DIM_LISTING_CLEANSED has a reference in DEV.DIM_HOSTS_CLEANSED
To implement this test, we can use the following code in our scehma.yml file. Since we are creating a reference relationship to dim_hosts_cleansed based on field host_id in that table. The entry will be :
- name: host_id
test:
-not_null
-relationships:
to: ref('dim_hosts_cleansed')
field: host_id
Now when we run dbt test, we see the fourth test added.
Implementing Singular Tests
A singular test consists of an SQL query which passes when no rows are returns, or , the test fails if the query returns result. It is used to validate assumptions about data in a specificmodel. We implement singular tests by writing sql query in sql file in tests folder ,case dim_listings_minimum_night.sql
We will check in the test whether there are any minimum nights = 0 in the table. If there are none , result will return no rows and our test will pass.
Checking in snowflake itself , we get zero rows :
The test we have written in tests/dim_listings_minimum_nights.sql
-- tests/dim_listings_minimum_nights.sql
SELECT *
FROM {{ ref('dim_listings_cleansed') }}
WHERE minimum_nights <1
We can now implement the same test in dbt and run dbt test. In this case, I am executing a single test only. The output for which will become.
Since the resultset of the query was empty, therefore, the test passed. ( In other words, we were checking for exception, it did not occur, therefore test passed).
Implementing Custom Generic Tests
Custom Generic tests are tests which can accept parameters. They are defined in SQL file and are identified by parameter {% test %} at the beginning of the sql file. Lets write 2 tests using two different methods supported by DBT.
Custom Generic Test with Macros
First, we careate a .sql file under macros folder . In this case we are creating macros/positive_value.sql. The test accepts two parameters and then returns rows if the column name passed in parameters has value of less than 1.
{% test positive_value(model,column_name) %}
SELECT
*
FROM
{{model}}
WHERE
{{column_name}} < 1
{% endtest %}
After writing the test, we will specify the test within schema.yml file
To contrast the difference between Generic in-built test and Generic custom test, lets review the schema.yml configuration.
We have defined two tests on column minimum_nights. First was built-in not_null test that we did in last section. In this section, we created a macro by the name of positive_value(model,column_name) . The macro accepted two parameters. Both these parameters will be passed from schema.yml file . It will look up the model under which macro is specified and pass on that model. Similarly, it will pass on the column_name under which macro is mentioned.
Its important to remember that whether the test is defined in macros or is in-built. Test is passed only when no rows are returned. If any rows are returned, test will fail.
Custom Generic Test with generic subfolder
The other way of specifying custom generic test is under tests/generic. Also remember that we create a singular test under tests folder. Here’s a visual indication of the difference.
Singular tests are saved under tests/
Customer generic tests are saved under tests/generic
Since its a generic test, it will start with the {% test %} . The code for our test is
-- macros/tests/generic/non_negative_value.sql
{% test non_negative_value(model, column_name) %}
SELECT
*
FROM
{{ model }}
WHERE
{{ column_name }} < 0
{% endtest %}
I want to implement this test on listing_id in dim_listings_cleansed. And I also want the test to fail. ( just to clarify the concept).I will go to schema.yml and define the test under column listing_id.
When I now run dbt test, DBT finds the test under macros/generic, however, it fails the test because there are several rows returned
Conclusion
In this hands-on exploration, we implemented various DBT testing methods to ensure data quality and reliability. We implemented built-in schema tests for core data properties, created reusable custom generic tests, and wrote singular tests for specific conditions. By leveraging these techniques, we can establish a robust testing strategy that safeguards our data warehouse, promotes high-quality data, and strengthens the foundation for reliable data pipelines.