The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either
Store data,
Organize data, or
Manage data.
From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.
Organisation
In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.
Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.
Account
An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.
Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.
Role
A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.
Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.
User
A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.
Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.
Share
A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.
Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.
Network Policy
A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.
Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.
Warehouse
In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.
Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.
Resource Monitor
A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.
Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.
Database
A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.
Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.
Schema
A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.
Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.
UDF (User-Defined Function)
A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.
Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.
Task
A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.
Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.
Pipe
A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.
Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.
Procedure
A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.
Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.
Stages
In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).
Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.
External Stage
An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.
Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.
Internal Stage
An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.
Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.
Table
A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.
Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.
External Tables
External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.
Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.
Transient Tables
Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.
Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.
Temporary Tables
Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.
Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.
Permanent Tables
Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.
Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.
View
A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.
Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.
Secured Views
Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.
Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.
Standard Views
Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.
Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.
Materialized Views
Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.
Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.
In a previous article, we have already explored how to export data grom GA4 to BigQuery. In instances, where we want to migrate data from BigQuery to another platform like snowflake, BigQuery offers a few options.
BigQuery Export options
Explore with Sheets:
Directly analyze and visualize your BigQuery data using Google Sheets.
Explore with Looker Studio:
Utilize Looker Studio (formerly Data Studio) for more advanced data exploration and interactive dashboard creation.
Export to GCS:
Save BigQuery datasets to Google Cloud Storage for storage or further processing with other tools.
Scan with Sensitive Data Protection:
Check your datasets for sensitive information before exporting, to ensure compliance and data privacy.
In out case, since we want to export the Google Analytics 4 data into Snowflake, we will need to first export it to Google Cloud Storage ( GCS ) . From this storage, we can then ingest data into Snowflake.
To understand the process flow, here is what we will be doing.
GA4 -> BigQuery -> Google Cloud Storage -> Snowfalke
A. Exporting from BigQuery to GCS
Deciding on Export Format
Before exporting, we want to decide on the format in which data will be exported for consumption. You can choose any of the format from CSV, JSON, Avro and Parquet depending on the use case. we will go with Parquet in this example. A brief comparison of these 4 data formats is given in the table below.
Feature
CSV
JSON
AVRO
Parquet
Data Structure
Flat
Hierarchical
Hierarchical
Columnar
Readability
High (Text-based)
High (Text-based)
Low (Binary)
Low (Binary)
File Size
Medium
Large
Small
Small
Performance
Low
Medium
High
Very High
Schema Evolution
Not Supported
Not Supported
Supported
Supported
Use Case
Simple analytics
Web APIs, Complex data
Long-term storage, Evolving schemas
Analytical queries, Large datasets
Compression
Low
Medium
High
Very High
Why Parquet?
Here’s a brief summary of why we chose Parquet for exporting GA4 data to BigQuery.
Columnar Efficiency
We benefit from Parquet’s columnar storage, optimizing our query execution by accessing only the necessary data.
Cost-Effective Storage
Our expenditure on data storage is minimized due to Parquet’s superior compression and encoding capabilities.
Support for Hierarchical Data
It supports our GA4 hierarchical data structures, ensuring the integrity of our analytical insights.
Seamless Integration
We utilize Snowflake’s native support for Parquet for straightforward data processing.
Schema Evolution Support
Since GA4 is in its early stage and new features keep on coming, we can gracefully manage changes in our data schema, avoiding costly downtime and data migrations.
Exporting Single Table
Clicking on the Export -> Export to GCS option will give us an option box to pick export format and compression. I have also specified a GCS storage location to store the export.
Exporting Multiple tables
Visual interface only allows export of a single table. Google Analytics 4, however, stores each day’s data separately as a single table. Therefore, we will have to find an alternative to visual export.
Shell script for Multiple table export
We can write a shell script which can export all our tables into our bucket. At a high level, we want our script to do the following :
Set Parameters: Define the dataset, table prefix, and GCS bucket.
List Tables: Use BigQuery to list all events_ tables in the dataset.
Export Tables: Loop through each table and export it to the designated GCS bucket as a Parquet file.
Here’s what the exact script looks like
#!/bin/bash
# Define your dataset and table prefix
DATASET="bigquery-public-data:ga4_obfuscated_sample_ecommerce"
TABLE_PREFIX="events_"
BUCKET_NAME="tempv2"
# Get the list of tables using the bq ls command, filtering with grep for your table prefix
TABLES=$(bq ls --max_results 1000 $DATASET | grep $TABLE_PREFIX | awk '{print $1}')
# Loop through the tables and run the bq extract command on each one
for TABLE in $TABLES
do
bq extract --destination_format=PARQUET $DATASET.$TABLE gs://$BUCKET_NAME/${TABLE}.parquet
done
Save the script and give it a name. I named it export_tables.sh. Change the script mode to chmod +x.
Execute the shell script with ./export_tables.sh
If everything works out correctly, you will start to see output :
You can check whether data has been exported by inspecting the contents of the storage bucket.
Allow appropriate access , so that you can read the data in snowflake. You can do this by opening the bucket > Permissions and then click on Grand Access.
In this example, I have granted access to allUsers. This will make the bucket readable publicly.
To ingest the data from Google cloud storage into snowflake, we will create storage integration between GCS and snowflake and then create an external stage. Storage integration streamlines the authentication flow between GCS and Snowflake. External stage, allows snowflake database to ingest the data.
Top of Form
B. Create Storage Integration in Snowflake
Storage integration in snowflake will create a temporary service account. We will then provision access to that temporary account from GCP.
Log into Snowflake: Open the Snowflake Web UI and switch to the role with privileges to create storage integrations (e.g., ACCOUNTADMIN).
Snowflake will automatically create a GCP service account . We will then go back to Google Cloud to provision necessary access to this service account.
Navigate to GCS: Open the Google Cloud Console, go to your GCS bucket’s permissions page.
Add a New Member: Use the STORAGE_GCP_SERVICE_ACCOUNT value as the new member’s identity.
Assign Roles: Grant roles such as Storage Object Viewer for read access
C. Create External Stage in Snowflake
External stage will allow snowflake database to ingest data from external source of GCP.
Define the File Format (if not already defined):
CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = ‘PARQUET’;
Create External Stage
CREATE OR REPLACE STAGE my_external_stage
URL = ‘gcs://tempv2/’
STORAGE_INTEGRATION = gcs_storage_integration
FILE_FORMAT = my_parquet_format;
Verify the external stage with LIST command, you can see the output
Create table to load data into
We will create a table , so that we can load data from the stage into the table.
create table raw_ga4 (
data VARIANT
)
Load data from stage
Finally, we can load data from external stage into snowflake database table using COPY INTO command.
There are 2 primary ways that we can ingest data into snowflake database.
With an upfront Schema
Without an upfront Schema
In this case, we will ingest the data without having an upfront schema.
D. Loading data without Schema
Snowflake provides a ‘VARIANT’ data type. It is used to store sem-structured data such as SON, Avro or Parquet etc. Its useful because it allows you to ingest and store data without needing to define a schema upfront. The VARIANT column can hold structured and semi-structured data in the same table, enabling you to flexibly query the data using standard SQL alongside Snowflake’s powerful semi-structured data functions.
Therefore, in Step4 , we create a simple table with VARIANT column of data.
To load the data into our raw_ga4 table, we use the following command.
Data Build Tool ( DBT ) offers functionalities for testing our data models to ensure their reliability and accuracy. DBT tests help in achieveing these objectives .
Overall, dbt testing helps achieve these objectives:
Improved Data Quality: By catching errors and inconsistencies early on, we prevent issues from propagating downstream to reports and dashboards.
Reliable Data Pipelines: Tests ensure that our data transformations work as expected, reducing the risk of regressions when code is modified.
Stronger Data Culture: A focus on testing instills a culture of data quality within organization.
This article is a continuation of previous hands-on implementation of DBT model. DIn this article, we will explore the concept of tests in DBT and how we can implement the tests.
Data Build Tool (DBT) provides a comprehensive test framework to ensure data quality and reliability. Here’s a summary of tests that dbt supports.
Test Type
Subtype
Description
Usage
Implementation method
Generic Tests
Built-in
Pre-built or out-of-the-box tests for common data issues such as uniqueness, not-null, and referential integrity.
unique, not_null, relationships,accepted_values
schema.yml using tests: key under model/column
Custom Generic Tests
Custom tests written by users that can be applied to any column or model, similar to built-in tests but with custom logic.
Custom reusable tests with user-defined SQL logic.
1. Dbt automatically picks up from tests/generic/test_name.sql starting with {% tests %} . 2. Schema.yml by defining tests in .sql file and applying in schema.yml 3.macros: Historically, this wa the only place they could be defined.
Singular Tests
Designed to validate a single condition and are not intended to be reusable e.g. check if total sales reported in a period matches the sum of individual sales record.
Custom SQL conditions specific to a model or field.
Dbt automatically picks up test from tests/test_name.sql
Source Freshness Tests
Tests that check the timeliness of the data in your sources by comparing the current timestamp with the last updated timestamp.
dbt source freshness to monitor data update intervals.
Schema.yml using freshness: key under source/table
DBT testing framework
Implementing Built-In Tests
We will begin by writing generic tests. As we saw in the table above there are two types, Built-in and Custom. We will begin with a built-in test.
Built-in Tests 1.Unique :
We will explore this test with our model DEV.DIM_LISTING_CLEANSED.The objective is to make sure that all values in the column LISTING_ID are unique.
To implement tests we will create a new file schema.yml and define our tests in it.
As we now run dbt test command in terminal, we will see the test execute and pass
Built-in Test 2.not_null
In the not_null example, we do not want any of the values in minimum nights column to be null. We will add another column name and test under the same hierarchy.
- name: minimum_nights
tests:
- not_null
Now, when we execute dbt test, we will see another test added
Built-in Test 3. accepted_values
In the accepted_values test, we will ensure that the column ROOM_TYPE in DEV.DIM_LISTING_CLEANSED can have preselected values only.
To add that test, we use the following entry in our source.yml file under model dim_listings_cleansed.
Now when we run dbt test, we can see another test executed which will check whether the column has only predetermined values as specificed in accepted_values or not.
Built-in Test 4. relationships
In the relationships test, we will ensure that each id in HOST_ID in DEV.DIM_LISTING_CLEANSED has a reference in DEV.DIM_HOSTS_CLEANSED
To implement this test, we can use the following code in our scehma.yml file. Since we are creating a reference relationship to dim_hosts_cleansed based on field host_id in that table. The entry will be :
- name: host_id
test:
-not_null
-relationships:
to: ref('dim_hosts_cleansed')
field: host_id
Now when we run dbt test, we see the fourth test added.
Implementing Singular Tests
A singular test consists of an SQL query which passes when no rows are returns, or , the test fails if the query returns result. It is used to validate assumptions about data in a specificmodel. We implement singular tests by writing sql query in sql file in tests folder ,case dim_listings_minimum_night.sql
We will check in the test whether there are any minimum nights = 0 in the table. If there are none , result will return no rows and our test will pass.
Checking in snowflake itself , we get zero rows :
The test we have written in tests/dim_listings_minimum_nights.sql
-- tests/dim_listings_minimum_nights.sql
SELECT *
FROM {{ ref('dim_listings_cleansed') }}
WHERE minimum_nights <1
We can now implement the same test in dbt and run dbt test. In this case, I am executing a single test only. The output for which will become.
Since the resultset of the query was empty, therefore, the test passed. ( In other words, we were checking for exception, it did not occur, therefore test passed).
Implementing Custom Generic Tests
Custom Generic tests are tests which can accept parameters. They are defined in SQL file and are identified by parameter {% test %} at the beginning of the sql file. Lets write 2 tests using two different methods supported by DBT.
Custom Generic Test with Macros
First, we careate a .sql file under macros folder . In this case we are creating macros/positive_value.sql. The test accepts two parameters and then returns rows if the column name passed in parameters has value of less than 1.
{% test positive_value(model,column_name) %}
SELECT
*
FROM
{{model}}
WHERE
{{column_name}} < 1
{% endtest %}
After writing the test, we will specify the test within schema.yml file
To contrast the difference between Generic in-built test and Generic custom test, lets review the schema.yml configuration.
We have defined two tests on column minimum_nights. First was built-in not_null test that we did in last section. In this section, we created a macro by the name of positive_value(model,column_name) . The macro accepted two parameters. Both these parameters will be passed from schema.yml file . It will look up the model under which macro is specified and pass on that model. Similarly, it will pass on the column_name under which macro is mentioned.
Its important to remember that whether the test is defined in macros or is in-built. Test is passed only when no rows are returned. If any rows are returned, test will fail.
Custom Generic Test with generic subfolder
The other way of specifying custom generic test is under tests/generic. Also remember that we create a singular test under tests folder. Here’s a visual indication of the difference.
Singular tests are saved under tests/
Customer generic tests are saved under tests/generic
Since its a generic test, it will start with the {% test %} . The code for our test is
-- macros/tests/generic/non_negative_value.sql
{% test non_negative_value(model, column_name) %}
SELECT
*
FROM
{{ model }}
WHERE
{{ column_name }} < 0
{% endtest %}
I want to implement this test on listing_id in dim_listings_cleansed. And I also want the test to fail. ( just to clarify the concept).I will go to schema.yml and define the test under column listing_id.
When I now run dbt test, DBT finds the test under macros/generic, however, it fails the test because there are several rows returned
Conclusion
In this hands-on exploration, we implemented various DBT testing methods to ensure data quality and reliability. We implemented built-in schema tests for core data properties, created reusable custom generic tests, and wrote singular tests for specific conditions. By leveraging these techniques, we can establish a robust testing strategy that safeguards our data warehouse, promotes high-quality data, and strengthens the foundation for reliable data pipelines.
In this article we will follow along complete dbt(Data Build Tool) bootcamp to create an end-to-end project pipeline using DBT and Snowflake. We begin by outlining the steps we will be performing in this hands-on project.
1.Loading Data from AWS to Snowflake:
Data from S3 is loaded into Snowflake using the COPY INTO command.
The raw data is stored in a dedicated schema (Airbnb.RAW).
2. Configuring Python & DBT:
Set up a virtual environment and install dbt-snowflake.
Create a user in Snowflake with appropriate permissions.
Initialize a dbt project and connect it to Snowflake.
3. Creating Staging Layer Models:
Models in dbt are SQL files representing transformations.
Staging models transform raw data into a cleaner, more structured format.
Use dbt commands to build and verify models.
4. Creating Core Layer Models & Materializations:
Core models apply further transformations and materializations.
Understand different materializations like views, tables, incremental, and ephemeral.
Create core models with appropriate materializations and verify them in Snowflake.
5. Creating DBT Tests:
Implement generic and custom tests to ensure data quality.
Use source freshness tests to monitor data timeliness.
6. Creating DBT Documentation:
Document models using schema.yml files and markdown files.
Generate and serve documentation to provide a clear understanding of data models and transformations.
The complete sources and code for this article can be found from the original course here.
Lets begin the hands-on implementation step by step.
We have now sourced the raw data that we want to build our data pipeline on.
2.Configuring Python & DBT
Next step is to create a temporary staging area to store the data without transformations . Since downstream processing of data can put considerable load on the systems, this staging area helps in decoupling the load. It also helps in auditing purposes.
In first step, we created 3 tables in RAW schema manually and copied data into them. In this step, we will use DBT to create staging area .
Setting up Python and Virtual Enviornment
The first step we need to do is to setup python and a virtual enviornment. You can use global enviornment as well. However, it is preferred to go with virtual enviornment. I am using Visual studio code for this tutorial and assuming that you can configure VS code with a Virtual enviornment, therefore, not going into its detail. If you need help setting it up, please follow this guide for setup.
After setting up virtual enviornment, please intall dbt-snowflake. DBT has adapters ( packages) for several data warehouses. Since we are using Snowflake in this example, we will use the following package :
pip install dbt-snowflake
Creating user in Snowflake
To connect to Snowflake, we created a user dbt in snowflake, granted it role called transform and granted “ALL” priveleges on database Airbnb. ( This is by no means a standard or recommended practice, we will do all blanket priveleges only for the purpose of this project). We use the following SQL in Snowflake to create a user and grant privileges.
-- Create the `dbt` user and assign to role
CREATE USER IF NOT EXISTS dbt
PASSWORD='dbtPassword123'
LOGIN_NAME='dbt'
MUST_CHANGE_PASSWORD=FALSE
DEFAULT_WAREHOUSE='COMPUTE_WH'
DEFAULT_ROLE='transform'
DEFAULT_NAMESPACE='AIRBNB.RAW'
COMMENT='DBT user used for data transformation';
GRANT ROLE transform to USER dbt;
-- Set up permissions to role `transform`
GRANT ALL ON WAREHOUSE COMPUTE_WH TO ROLE transform;
GRANT ALL ON DATABASE AIRBNB to ROLE transform;
GRANT ALL ON ALL SCHEMAS IN DATABASE AIRBNB to ROLE transform;
GRANT ALL ON FUTURE SCHEMAS IN DATABASE AIRBNB to ROLE transform;
GRANT ALL ON ALL TABLES IN SCHEMA AIRBNB.RAW to ROLE transform;
GRANT ALL ON FUTURE TABLES IN SCHEMA AIRBNB.RAW to ROLE transform;
Now that we have the credentials for Snowflake, we can provide this information to dbt for connection.
Creating DBT Project
Once dbt-snowflake and its required dependencies are installed, you can now proceed with setting up dbt project. Inside the virtual enviornment , initiate the dbt core project with dbt init [project name]
dbt init mydbt
In my case , I am building this dbt project within DBTCOURSE folder, I will name my dbt core project as mydbt, therefore, my folder structure would be :
DBTCOURSE -> mydbt > ALL DBT FOLDERS AND FILES
Since its the first time we are running dbt init, it will ask the name of the project, the database we are connecting to and the authentication credentials for the platform ( snowflake) .
Connecting DBT to Snowflake
After project is initiated, DBT will ask for user access credentials to connect to Snowflake with appropriate permissions on the database and schema’s we want to work on. Given below is summary of information command prompt will ask to connect.
Prompt
Value
platform ( number)
[1] Snowflake
account
This is your Snowflake account identifier.
user
User we created to connect to snowflake ( dbt in this case)
password
user’s password
role
Role we assigned to the above user, that we want dbt to open the connection with
database
Database we want our connection established with (AIRBNB in this case )
schema
This is the default schema where dbt will build all models ( DEV in this case )
At times, finding Snowflake’s account identifier can be tricky. you can find it from Snowflake > Admin > Account > Hover over the account row.
Verifying DBT connection
We can verify whether our dbt project has been configured properly and that it is able to connect to Snowflake using the dbt debug
dbt debug
The verbose output is quite long, however, I have provided screenshots only of key checks which enable us to start working with our project.
3.Creating data models in Staging Layer with DBT
Concept : Models in DBT
In dbt (Data Build Tool), models are SQL files that contain SELECT statements. These models define transformations on your raw data and are the core building blocks of dbt projects. Each model represents a transformation step that takes raw or intermediate data, applies some logic, and outputs the transformed data as a table or view in your data warehouse.Models promote modularity by breaking down complex transformations into simpler, reusable parts. This makes the transformation logic easier to manage and understand.
We will explore how models in dbt function within a dbt project to build and manage a data warehouse pipeline. An overview of key model characteristics and functions we will look at are :
Data Transformation: Models allow you to transform raw data into meaningful, structured formats. This includes cleaning, filtering, aggregating, and joining data from various sources.
Incremental Processing: Models can be configured to run incrementally, which means they only process new or updated data since the last run. This improves efficiency and performance.
Materializations:
Models in dbt can be materialized in different ways:
Views: Create virtual tables that are computed on the fly when queried.
Tables: Persist the transformed data as physical tables in the data warehouse.
Incremental Tables: Only update rows that have changed or been added since the last run.
Ephemeral: Used for subqueries that should not be materialized but instead embedded directly into downstream models.
Testing: dbt allows usto write tests for your models to ensure data quality and integrity. These can include uniqueness, non-null, and custom tests that check for specific conditions.
Documentation: Models can be documented within dbt, providing descriptions and context for each transformation. This helps in understanding the data lineage and makes the project more maintainable.
Dependencies and DAGs: Models can reference other models using the ref function, creating dependencies between them. dbt automatically builds a Directed Acyclic Graph (DAG) to manage these dependencies and determine the order of execution.
Version Control: Because dbt models are just SQL files, they can be version controlled using Git or any other version control system, enabling collaboration and change tracking.
Concept: Staging Layer
Staging layer is an abstraction used to denote purpose of data models in enterprise data warehouses. The concept of staging layer is prevalent in both Kimball and Inmon methodologies for data modeling . For dbt purposes, we will use a folder src under models to create the staging layer. Using staging layer we decouple the source data for further processing. In staging layer we create 3 simple models under models/src/
#
Model
Transformation
Materialization
1
Airbnb.DEV.src_listing
Column name changes
View
2
Airbnb.DEV.src_reviews
Column name changes
View
3
Airbnb.DEV.src_hosts
Column name changes
View
Models in staging layer
Creating staging layer models
We create a new file src_listing.sql and use the following SQL to create our first model
WITH raw_listings AS (
SELECT
*
FROM
AIRBNB.RAW.RAW_LISTINGS
)
SELECT
id AS listing_id,
name AS listing_name,
listing_url,
room_type,
minimum_nights,
host_id,
price AS price_str,
created_at,
updated_at
FROM
raw_listings
We repeat the same for src_reviews and src_hosts
WITH raw_reviews AS (
SELECT
*
FROM
AIRBNB.RAW.RAW_REVIEWS
)
SELECT
listing_id,
date AS review_date,
reviewer_name,
comments AS review_text,
sentiment AS review_sentiment
FROM
raw_reviews
WITH raw_hosts AS (
SELECT
*
FROM
AIRBNB.RAW.RAW_HOSTS
)
SELECT
id AS host_id,
NAME AS host_name,
is_superhost,
created_at,
updated_at
FROM
raw_hosts
Our file structure after creating 3 models will be :
Finally, we can now run the dbt run command to build our first 3 models.
dbt run
It will start building models and in case there’s any error, it will show it. In case of succesful run, it will confirm on the prompt.
To confirm the models in our Snowflake database, we can now navigate to Snowflake and see whether our new models appear there .
As you can see , dbt has succesfully created 3 new models under AIRBNR.DEV schema, thus completing our staging layer.
4.Creating core layer models & materializations
In this step , we will create the “Core” layer with three models. We will also explore materializations in dbt and make a choice of which materializations we want our models to have. A summary of models we will create in this step alongside their materialization choice is :
Dbt supports four types of materialization. We can specify the materialization of a model either in the model file ( .sql) or within the dbt-project.yaml file. For our excercise, we will use both approaches to get familiar with both.
A comparative view of materializations supported by dbt and when to use them is as follows:
Feature
View
Table
Incremental
Ephemeral
Description
Creates a virtual table that is computed on the fly when queried.
Creates a physical table that stores the results of the transformation.
Updates only the new or changed rows since the last run.
Creates temporary subqueries embedded directly into downstream models.
Use Case
Use when you need to frequently refresh data without the need for storage.
Use when you need fast query performance and can afford to periodically refresh the data.
Use when dealing with large datasets and only a subset of the data changes frequently.
Use for intermediate transformations that don’t need to be materialized.
Pros
– No storage costs.<br>- Always shows the latest data.<br>- Quick to set up.
– Fast query performance.<br>- Data is stored and doesn’t need to be recomputed each time.
– Efficient processing by updating only changed data.<br>- Reduces processing time and costs.
– No storage costs.<br>- Simplifies complex transformations by breaking them into manageable parts.
Cons
– Slower query performance for large datasets.<br>- Depends on the underlying data’s performance.
– Higher storage costs.<br>- Requires periodic refreshing to keep data up-to-date.
– More complex setup.<br>- Requires careful handling of change detection logic.
– Can lead to complex and slow queries if overused.<br>- Not materialized, so each downstream query must recompute the subquery.
Materializations in dbt
Creating Core Layer models
We will apply transformation to our staging layer models and create core layer models as discussed above. Our project structure after creating these models would change as shown below.
dim_hosts_cleansed
{{
config(
materialized = 'view'
)
}}
WITH src_hosts AS (
SELECT
*
FROM
{{ ref('src_hosts') }}
)
SELECT
host_id,
NVL(
host_name,
'Anonymous'
) AS host_name,
is_superhost,
created_at,
updated_at
FROM
src_hosts
dim_listing_cleansed
WITH src_listings AS (
SELECT
*
FROM
{{ ref('src_listings') }}
)
SELECT
listing_id,
listing_name,
room_type,
CASE
WHEN minimum_nights = 0 THEN 1
ELSE minimum_nights
END AS minimum_nights,
host_id,
REPLACE(
price_str,
'$'
) :: NUMBER(
10,
2
) AS price,
created_at,
updated_at
FROM
src_listings
dim_listings_with_hosts
WITH
l AS (
SELECT
*
FROM
{{ ref('dim_listings_cleansed') }}
),
h AS (
SELECT *
FROM {{ ref('dim_hosts_cleansed') }}
)
SELECT
l.listing_id,
l.listing_name,
l.room_type,
l.minimum_nights,
l.price,
l.host_id,
h.host_name,
h.is_superhost as host_is_superhost,
l.created_at,
GREATEST(l.updated_at, h.updated_at) as updated_at
FROM l
LEFT JOIN h ON (h.host_id = l.host_id)
fct_reviews
{{
config(
materialized = 'incremental',
on_schema_change='fail'
)
}}
WITH src_reviews AS (
SELECT * FROM {{ ref('src_reviews') }}
)
SELECT * FROM src_reviews
WHERE review_text is not null
{% if is_incremental() %}
AND review_date > (select max(review_date) from {{ this }})
{% endif %}
In our models above, we have explicitly created materialization configuration within the models for fct_reviews and dim_hosts_cleansed. For the others, we have used dbt-project.yaml file to specify the materialization.
With the above materialization specifications within the Yaml file, we run the dbt and get the following output.
We can check the same in Snowflake interface , which will show us the new models created.
Adding sources to the models
Concept : Source in DBT
Sources in dbt are aliases given to the actual tables. Its an additional abstraction added over external tables which makes it possible to name and describe the data loaded into warehouse. Sources enable the following :
We can calculate the freshness of source data.
Test our assumptions about source data
select from source tables in our models using {{source()}} function , this helps in defining lineage of data.
Adding Sources
To add sources to our model, we will create a file “sources.yml” ( Filename is arbitrary ) . Once we have created config file , we can now go in to src_* files and replace existing table names with our “sources”.
A benefit of using “sources” in DBT is to be able to maintain freshness of data. Source freshness is a mechanism in DBT which enables monitoring the timeliness and update frequency of data in our sources. Source freshness mechanism allows for “warning” or “error” as notification mechanism for data freshness.
Here are the steps to configure source freshness.
In our sources.yml file, we decide a “date/time” field which acts as the cut-off point for source monitoring(loaded_at_field).
We specify a maximum allowable data age ( in interval e.g. hours or days) before a warning or error is triggered.(warn_after or warn_before)
We execute dbt source freshness command to check the freshness of sources.
If the data exceeds the freshness thresholds, DBT raises warnings or errors.
Here is what we have told the above yml file to do.
Parameter
Setting
Description
loaded_at_field
date
Check in “date” field when was data last loaded
warn_after
{count: 1, period: hour}
Issues a warning if the data in the date field is older than 1 hour.
error_after
{count: 24, period: hour}
Issues an error if the data in the date field is older than 24 hours.
source freshness configuration
Now when we go to command prompt and run:
dbt source freshness
We get the following output
Creating tests in DBT
Concept: DBT tests
Dbt provides a comprehensive test framework to ensure data quality and reliability. Here’s a summary of tests that dbt supports.
Test Type
Subtype
Description
Usage
Implementation method
Generic Tests
Built-in
Pre-built or out-of-the-box tests for common data issues such as uniqueness, not-null, and referential integrity.
unique, not_null, relationships,accepted_values
schema.yml using tests: key under model/column
Custom Generic Tests
Custom tests written by users that can be applied to any column or model, similar to built-in tests but with custom logic.
Custom reusable tests with user-defined SQL logic.
1. Dbt automatically picks up from tests/generic/test_name.sql starting with {% tests %} . 2. Schema.yml by defining tests in .sql file and applying in schema.yml 3.macros: Historically, this wa the only place they could be defined.
Singular Tests
Designed to validate a single condition and are not intended to be reusable e.g. check if total sales reported in a period matches the sum of individual sales record.
Custom SQL conditions specific to a model or field.
Dbt automatically picks up test from tests/test_name.sql
Source Freshness Tests
Tests that check the timeliness of the data in your sources by comparing the current timestamp with the last updated timestamp.
dbt source freshness to monitor data update intervals.
Schema.yml using freshness: key under source/table
DBT testing framework
Implementing Built-In Tests
We will begin by writing generic tests. To improve the readability of this article, I have moved the testing to an article of its own, please continue with implementing Built-In tests on this link.
Documenting models in DBT
In any data warehousing project, documentation act as a blueprint and reference guide for data structures. It should capture key information about data models essentially explaining what data is stored, how it is organised and how does it relate to other data. DBT solves the problem of documenting models by providing framework for implementing documentation within its own framework.
Writing documentations for our Models
DBT provides two methods for writing documentation. We can either write the documentation within the schema.yml file as text or we can write documentation in separate markup files and link them back to schema.yml file. We will explore both these options. Please follow this link to see hands-on documentation in DBT.
Conclusion
In this article, we created an end-to-end project pipeline using dbt and Snowflake. DBT makes it easier for data engineers to effectively build and manage reliable and scalable data pipelines. We started by loading data from S3 into Snowflake using the COPY INTO command, storing the raw data in a dedicated schema (Airbnb.RAW). Next, we configured Python and dbt, setting up a virtual environment, installing dbt-snowflake, creating a user in Snowflake with appropriate permissions, and initializing a dbt project connected to Snowflake. We then created staging layer models, which are SQL files representing transformations, to clean and structure the raw data. Using dbt commands, we built and verified these models. Moving on to the core layer, we applied further transformations and materializations, exploring different types like views, tables, incremental, and ephemeral. We created core models with appropriate materializations and verified them in Snowflake. To ensure data quality, we implemented generic and custom tests, as well as source freshness tests to monitor data timeliness. Lastly, we documented our models using schema.yml files and markdown files, generating and serving the documentation to provide a clear understanding of data models and transformations. By following these steps, data engineers can leverage dbt to build scalable and maintainable data pipelines, ensuring data integrity and ease of use for downstream analytics.
Slowly Changing Dimensions (SCDs) are an approach in data warehousing used to manage and track changes to dimensions over time. It plays an important role in data modeling especially in the context of a data warehouse where maintaining historical accuracy of data over time is essential. The term “slow” in SCDs refer to rate of change and the method of handling these changes. For example, changes to a customer’s address or marital status happen infrequently, making these “slowly” changing dimensions. This is in contrast to “fast-changing” scenarios where data elements like stock prices or inventory levels might update several times a day or even minute by minute.
Types of SCDs
In the context of data modeling, there are 3 types of slowly changing dimensions. Choosing the right type of SCD depends on business requirements. Given below is brief overview of the most frequently used types of SCDs in data modeling
Type 1 SCD (No History)
Overwrites old data with new data. It’s used when the history of changes isn’t necessary. For example, correcting a misspelled last name of a customer.
In this table Manan’s email address is updated directly in the database, replacing the old email with the new one. No historical record of old mail is maintained.
Type 2 SCD (Full History)
Adds new records for changes, keeping historical versions. It’s crucial for auditing purposes and when analyzing the historical context of data, like tracking price changes over time.
Scenario:
A customer changes their subscription plan.
Before the Change:
Originally, the customer is subscribed to the “Basic” plan.
Customer ID
Name
Subscription Plan
Start Date
End Date
Current
001
Manan Younas
Basic
2023-01-01
NULL
Yes
After the Change:
The customer upgrades to the “Premium” plan on 2023-06-01.
Update the existing record to set the end date and change the “Current” flag to “No.”
Add a new record with the new subscription plan, starting on the date of the change.
Customer ID
Name
Subscription Plan
Start Date
End Date
Current
001
John Doe
Basic
2023-01-01
2023-06-01
No
001
John Doe
Premium
2023-06-01
NULL
Yes
Explanation:
Before the Change: The table shows John Doe with a “Basic” plan, starting from January 1, 2022. The “End Date” is NULL, indicating that this record is currently active.
After the Change: Two changes are made to manage the subscription upgrade:
The original “Basic” plan record is updated with an “End Date” of January 1, 2023, and the “Current” status is set to “No,” marking it as historical.
A new record for the “Premium” plan is added with a “Start Date” of January 1, 2023. This record is marked as “Current” with a NULL “End Date,” indicating it is the active record.
This method of handling SCDs is beneficial for businesses that need to track changes over time for compliance, reporting, or analytical purposes, providing a clear and traceable history of changes.
Type 3 SCD (Limited History)
Type 3 SCDs add new columns to store both the current and at least one previous value of the attribute, which is useful for tracking limited history without the need for multiple records.It is less commonly used but useful for tracking a limited history where only the most recent change is relevant.
Scenario:
A customer moves from one city to another.
Before the Move:
Initially, only current information is tracked.
Customer ID
Name
Current City
001
Manan Younas
Sydney
After the Move:
A new column is added to keep the previous city alongside the current city.
Customer ID
Name
Current City
Previous City
001
Manan Younas
Melbourne
Sydney
Explanation:
In this table, when Manan moves from Sydney to Melbourne, the “Current City” column is updated with the new city, and “Previous City” is added to record his last known location. This allows for tracking the most recent change without creating a new record.
These examples illustrate the methods by which Type 1 , Type2 and Type 3 SCDs manage data changes. Type 1 SCDs are simpler and focus on the most current data, discarding historical changes. Type 3 SCDs, meanwhile, provide a way to view a snapshot of both current and previous data states without maintaining full historical records as Type 2 SCDs do.
Architectural considerations for Managing SCDs
The management of Slowly Changing Dimensions (SCDs) in data warehousing requires careful architectural planning to ensure data accuracy, completeness, and relevance. Lets discuss the implementation considerations for each type of SCD and the architectural setups required to support these patterns effectively.
Type 1 Implementation consideration
Scenarios Where Most Effective
Type 1 SCDs are most effective in scenarios where historical data is not needed for analysis and only the current state is relevant. Common use cases include:
Correcting data errors in attributes, such as fixing misspelled names or incorrect product attributes.
Situations where business does not require tracking of historical changes, such as current status updates or latest measurements.
Architectural Setup
Database Design: A simple design where each record holds the latest state of the data. Historical data tracking mechanisms are not needed.
Data Update Mechanism: The implementation requires a straightforward update mechanism where old values are directly overwritten by new ones without the need for additional fields or complex queries.
Performance Considerations: Since this pattern only involves updating existing records, it typically has minimal impact on performance and does not significantly increase storage requirements.
Type 2 Implementation consideration
Scenarios Where Most Effective
Type 2 SCDs are crucial when the full history of changes must be preserved for compliance, reporting, or analytical purposes. They are widely used in:
Customer information management, where it is necessary to track address history, status changes, or subscription details.
Product information tracking, where changes over time can provide insights into lifecycle management and evolution.
Architectural Setup
Database Design: Requires a more complex setup with additional fields for managing historical data, such as start date, end date, and a current flag.
Data Management Strategy: Insertion of new records for each change, while updating or closing previous records to indicate they are no longer current. This setup can be managed through triggers or application logic.
Versioning and Timestamping: Implementation of version control and timestamping to ensure each change is accurately recorded with its validity period.
Performance and Storage Considerations: Type 2 can significantly increase the volume of data stored, which may impact performance. Indexing on key fields and partitioning historical data can help optimize query performance.
Type 3 Implementation Pattern
Scenarios Where Most Effective:
Type 3 SCDs are used when tracking a limited history is sufficient. This can be applicable in cases like:
Tracking a previous address alongside the current one for a short-term promotional campaign.
Monitoring recent changes in terms and conditions for services, where only the most recent previous state is relevant.
Architectural Setup:
Database Design: Includes additional columns to store both the current and previous values of the tracked attributes. This setup is simpler than Type 2 but more complex than Type 1.
Data Update Mechanism: Updates involve changing multiple fields within a single record—both updating the previous value fields and writing new current values.
Performance Considerations: This method increases the size of each record but does not increase the total number of records as Type 2 does. Performance impacts are generally moderate but can involve more complex queries than Type 1.
Conclusion
Each type of SCD requires a different architectural approach based on the business requirements for historical data. While Type 1 is simpler and less resource-intensive, Types 2 and 3 provide data tracking capabilities at the cost of additional complexity and potentially higher resource requirements. Properly choosing and implementing these patterns will depend on the specific needs of the business, the criticality of historical data, and the performance impacts on the data warehouse system.