Select Page

Snowflake Object Model


The Snowflake object model is a hierarchical framework that organizes and manages data within the Snowflake cloud data platform . An “object” itself refers to a logical container or structure that is used to either

  1. Store data,
  2. Organize data, or
  3. Manage data.

From the top-level organization and account objects down to the granular elements like tables and views, the Snowflake object model provides a structured framework for data storage, access, and security. The following is a detailed overview of the key objects in the Snowflake object model and their respective functions.

Organisation

In Snowflake, an organisation is a top-level entity that groups together related accounts, providing a way to manage billing, usage, and access at a higher level.

Example: A multinational corporation might have a separate Snowflake organisation for each region it operates in, with individual accounts for each country.

Account

An account in Snowflake represents an independent environment with its own set of users, databases, and resources. It is the primary unit of access and billing.

Example: A retail company might have a Snowflake account dedicated to its e-commerce analytics, separate from other accounts used for different business functions.

Role

A role in Snowflake is a collection of permissions that define what actions a user or group of users can perform. Roles are used to enforce security and access control.

Example: A “Data Analyst” role might have permissions to query and view data in specific databases and schemas but not to modify or delete data.

User

A user in Snowflake is an individual or service that interacts with the platform, identified by a unique username. Users are assigned roles that determine their access and capabilities.

Example: A user named “john.doe” might be a data scientist with access to analytical tools and datasets within the Snowflake environment.

Share

A share in Snowflake is a mechanism for securely sharing data between different accounts or organisations. It allows for controlled access to specific objects without copying or moving the data.

Example: A company might create a share to provide its partner with read-only access to a specific dataset for collaboration purposes.

Network Policy

A network policy in Snowflake is a set of rules that define allowed IP addresses or ranges for accessing the Snowflake account, enhancing security by restricting access to authorized networks.

Example: A financial institution might configure a network policy to allow access to its Snowflake account only from its corporate network.

Warehouse

In Snowflake, a warehouse is a cluster of compute resources used for executing data processing tasks such as querying and loading data. Warehouses can be scaled up or down to manage performance and cost.

Example: A marketing team might use a small warehouse for routine reporting tasks and a larger warehouse for more intensive data analysis during campaign launches.

Resource Monitor

A resource monitor in Snowflake is a tool for tracking and controlling the consumption of compute resources. It can be used to set limits and alerts to prevent overspending.

Example: A company might set up a resource monitor to ensure that its monthly compute costs do not exceed a predetermined budget.

Database

A database in Snowflake is a collection of schemas and serves as the primary container for storing and organizing data. It is similar to a database in traditional relational database systems.

Example: A healthcare organization might have a database called “PatientRecords” that contains schemas for different types of medical data.

Schema

A schema in Snowflake is a logical grouping of database objects such as tables, views, and functions. It provides a way to organize and manage related objects within a database.

Example: In a “Sales” database, there might be a schema called “Transactions” that contains tables for sales orders, invoices, and payments.

UDF (User-Defined Function)

A UDF in Snowflake is a custom function created by users to perform specific operations or calculations that are not available as built-in functions.

Example: A retail company might create a UDF to calculate the total sales tax for an order based on different tax rates for each product category.

Task

A task in Snowflake is a scheduled object that automates the execution of SQL statements, including data loading, transformation, and other maintenance operations.

Example: A data engineering team might set up a task to automatically refresh a materialized view every night at midnight.

Pipe

A pipe in Snowflake is an object used for continuous data ingestion from external sources into Snowflake tables. It processes and loads streaming data in near real-time.

Example: A streaming service might use a pipe to ingest real-time user activity data into a Snowflake table for analysis.

Procedure

A procedure in Snowflake is a stored sequence of SQL statements that can be executed as a single unit. It is used to encapsulate complex business logic and automate repetitive tasks.

Example: A finance team might create a procedure to generate monthly financial reports by aggregating data from various sources and applying specific calculations.

Stages

 In Snowflake, stages are objects used to stage data files before loading them into tables. They can be internal (managed by Snowflake) or external (located in cloud storage).

Example: A data integration process might use a stage to temporarily store CSV files before loading them into a Snowflake table for analysis.

External Stage

An external stage in Snowflake is a reference to a location in cloud storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data files are staged before loading.

Example: A company might use an external stage pointing to an Amazon S3 bucket to stage log files before loading them into Snowflake for analysis.

Internal Stage

An internal stage in Snowflake is a built-in storage location managed by Snowflake for staging data files before loading them into tables.

Example: An analytics team might use an internal stage to temporarily store JSON files before transforming and loading them into a Snowflake table for analysis.

Table

A table in Snowflake is a structured data object that stores data in rows and columns. Tables can be of different types, such as permanent, temporary, or external.

Example: A logistics company might have a permanent table called “Shipments” that stores detailed information about each shipment, including origin, destination, and status.

External Tables

External tables in Snowflake are tables that reference data stored in external stages, allowing for querying data directly from cloud storage without loading it into Snowflake.

Example: A data science team might use external tables to query large datasets stored in Amazon S3 without importing the data into Snowflake, saving storage costs.

Transient Tables

 Transient tables in Snowflake are similar to permanent tables but with a shorter lifespan and lower storage costs. They are suitable for temporary or intermediate data.

Example: During a data transformation pipeline, a transient table might be used to store intermediate results that are needed for a short period before being discarded.

Temporary Tables

 Temporary tables in Snowflake are session-specific tables that are automatically dropped at the end of the session. They are useful for temporary calculations or intermediate steps.

Example: In an ad-hoc analysis session, a data analyst might create a temporary table to store query results for further exploration without affecting the permanent dataset.

Permanent Tables

 Permanent tables in Snowflake are tables that persist data indefinitely and are the default table type for long-term data storage.

Example: A financial institution might use permanent tables to store historical transaction data for compliance and reporting purposes.

View

A view in Snowflake is a virtual table that is defined by a SQL query. Views can be standard, secured, or materialized, each serving different purposes.

Example: A sales dashboard might use a view to present aggregated sales data by region and product category, based on a query that joins multiple underlying tables.

Secured Views

Secured views in Snowflake are views that enforce column-level security, ensuring that sensitive data is only visible to authorized users.

Example: In a multi-tenant application, a secured view might be used to ensure that each tenant can only see their own data, even though the underlying table contains data for all tenants.

Standard Views

Standard views in Snowflake are the default view type, providing a simple way to create a virtual table based on a SQL query without any additional security features.

Example: A marketing team might use a standard view to create a simplified representation of a complex query that combines customer.

Materialized Views

Materialized views in Snowflake are views that store the result set of the query physically, providing faster access to precomputed data.

Example: To speed up reporting on large datasets, a data warehouse might use materialized views to pre-aggregate daily sales data by store and product category.

Exporting GA4 data from BigQuery to Snowflake

Exporting GA4 data from BigQuery to Snowflake

In a previous article, we have already explored how to export data grom GA4 to BigQuery. In instances, where we want to migrate data from BigQuery to another platform like snowflake, BigQuery offers a few options.

BigQuery Export options

Explore with Sheets:

Directly analyze and visualize your BigQuery data using Google Sheets.

Explore with Looker Studio:

Utilize Looker Studio (formerly Data Studio) for more advanced data exploration and interactive dashboard creation.

Export to GCS:

Save BigQuery datasets to Google Cloud Storage for storage or further processing with other tools.

Scan with Sensitive Data Protection:

Check your datasets for sensitive information before exporting, to ensure compliance and data privacy.

In out case, since we want to export the Google Analytics 4 data into Snowflake, we will need to first export it to Google Cloud Storage ( GCS ) . From this storage, we can then ingest data into Snowflake.

To understand the process flow, here is what we will be doing.

GA4 -> BigQuery -> Google Cloud Storage -> Snowfalke

A. Exporting from BigQuery to GCS

Deciding on Export Format

Before exporting, we want to decide on the format in which data will be exported for consumption. You can choose any of the format from CSV, JSON, Avro and Parquet depending on the use case. we will go with Parquet in this example. A brief comparison of these 4 data formats is given in the table below.

Feature

CSV

JSON

AVRO

Parquet

Data Structure

Flat

Hierarchical

Hierarchical

Columnar

Readability

High (Text-based)

High (Text-based)

Low (Binary)

Low (Binary)

File Size

Medium

Large

Small

Small

Performance

Low

Medium

High

Very High

Schema Evolution

Not Supported

Not Supported

Supported

Supported

Use Case

Simple analytics

Web APIs, Complex data

Long-term storage, Evolving schemas

Analytical queries, Large datasets

Compression

Low

Medium

High

Very High

Why Parquet?

Here’s a brief summary of why we chose Parquet for exporting GA4 data to BigQuery.

Columnar Efficiency

We benefit from Parquet’s columnar storage, optimizing our query execution by accessing only the necessary data.

Cost-Effective Storage

Our expenditure on data storage is minimized due to Parquet’s superior compression and encoding capabilities.

Support for Hierarchical Data

It supports our GA4 hierarchical data structures, ensuring the integrity of our analytical insights.

Seamless Integration

We utilize Snowflake’s native support for Parquet for straightforward data processing.

Schema Evolution Support

Since GA4 is in its early stage and new features keep on coming, we can gracefully manage changes in our data schema, avoiding costly downtime and data migrations.

Exporting Single Table

Clicking on the Export -> Export to GCS option will give us an option box to pick export format and compression. I have also specified a GCS storage location to store the export.

Exporting Multiple tables

Visual interface only allows export of a single table. Google Analytics 4, however, stores each day’s data separately as a single table. Therefore, we will have to find an alternative to visual export.

Shell script for Multiple table export

We can write a shell script which can export all our tables into our bucket. At a high level, we want our script to do the following :

  1. Set Parameters: Define the dataset, table prefix, and GCS bucket.
  2. List Tables: Use BigQuery to list all events_ tables in the dataset.
  3. Export Tables: Loop through each table and export it to the designated GCS bucket as a Parquet file.

Here’s what the exact script looks like

#!/bin/bash 

  # Define your dataset and table prefix

  DATASET="bigquery-public-data:ga4_obfuscated_sample_ecommerce"

  TABLE_PREFIX="events_"

  BUCKET_NAME="tempv2"

  # Get the list of tables using the bq ls command, filtering with grep for your table prefix

  TABLES=$(bq ls --max_results 1000 $DATASET | grep $TABLE_PREFIX | awk '{print $1}')

  # Loop through the tables and run the bq extract command on each one

  for TABLE in $TABLES

  do

      bq extract --destination_format=PARQUET $DATASET.$TABLE gs://$BUCKET_NAME/${TABLE}.parquet

   done

Save the script and give it a name. I named it export_tables.sh. Change the script mode to chmod +x.

Execute the shell script with ./export_tables.sh

If everything works out correctly, you will start to see output :

You can check whether data has been exported by inspecting the contents of the storage bucket.

A screenshot of a computer

Description automatically generated

Allow appropriate access , so that you can read the data in snowflake. You can do this by opening the bucket > Permissions and then click on Grand Access.

In this example, I have granted access to allUsers. This will make the bucket readable publicly.

A screenshot of a web page

Description automatically generated

To ingest the data from Google cloud storage into snowflake, we will create storage integration between GCS and snowflake and then create an external stage. Storage integration streamlines the authentication flow between GCS and Snowflake. External stage, allows snowflake database to ingest the data.

Top of Form

B. Create Storage Integration in Snowflake

Storage integration in snowflake will create a temporary service account. We will then provision access to that temporary account from GCP.

  1. Log into Snowflake: Open the Snowflake Web UI and switch to the role with privileges to create storage integrations (e.g., ACCOUNTADMIN).
  2. Create the Storage Integration:

CREATE STORAGE INTEGRATION gcs_storage_integration

TYPE = EXTERNAL_STAGE

STORAGE_PROVIDER = GCS

ENABLED = TRUE

STORAGE_ALLOWED_LOCATIONS = (‘gcs://tempv2/’);

  1. Retrieve Integration Details:

DESC STORAGE INTEGRATION gcs_storage_integration;

Snowflake will automatically create a GCP service account . We will then go back to Google Cloud to provision necessary access to this service account.

  1. Navigate to GCS: Open the Google Cloud Console, go to your GCS bucket’s permissions page.
  2. Add a New Member: Use the STORAGE_GCP_SERVICE_ACCOUNT value as the new member’s identity.
  3. Assign Roles: Grant roles such as Storage Object Viewer for read access

C. Create External Stage in Snowflake

External stage will allow snowflake database to ingest data from external source of GCP.

  1. Define the File Format (if not already defined):

CREATE OR REPLACE FILE FORMAT my_parquet_format

TYPE = ‘PARQUET’;

  1. Create External Stage

CREATE OR REPLACE STAGE my_external_stage

URL = ‘gcs://tempv2/’

STORAGE_INTEGRATION = gcs_storage_integration

FILE_FORMAT = my_parquet_format;

  1. Verify the external stage with LIST command, you can see the output
A screenshot of a computer program

Description automatically generated
  1. Create table to load data into

We will create a table , so that we can load data from the stage into the table.

create table raw_ga4 (

data VARIANT

)

  1. Load data from stage

Finally, we can load data from external stage into snowflake database table using COPY INTO command.

There are 2 primary ways that we can ingest data into snowflake database.

  1. With an upfront Schema
  2. Without an upfront Schema

In this case, we will ingest the data without having an upfront schema.

D. Loading data without Schema

Snowflake provides a ‘VARIANT’ data type. It is used to store sem-structured data such as SON, Avro or Parquet etc. Its useful because it allows you to ingest and store data without needing to define a schema upfront. The VARIANT column can hold structured and semi-structured data in the same table, enabling you to flexibly query the data using standard SQL alongside Snowflake’s powerful semi-structured data functions.

Therefore, in Step4 , we create a simple table with VARIANT column of data.

To load the data into our raw_ga4 table, we use the following command.

COPY INTO GA4ECOMMERCE.PUBLIC.RAW_GA4

FROM ‘@GA4_STAGE/’

FILE_FORMAT = (FORMAT_NAME = ‘my_parquet_format’);

This will load all files into Data column

You can also view the data from Data Preview tab of RAW_GA4 table , which will look like :

A screenshot of a computer

Description automatically generated

Architectural patterns for managing slowly changing dimensions

What are SCDs ?

Slowly Changing Dimensions (SCDs) are an approach in data warehousing used to manage and track changes to dimensions over time. It plays an important role in data modeling especially in the context of a data warehouse where maintaining historical accuracy of data over time is essential. The term “slow” in SCDs refer to rate of change and the method of handling these changes. For example, changes to a customer’s address or marital status happen infrequently, making these “slowly” changing dimensions. This is in contrast to “fast-changing” scenarios where data elements like stock prices or inventory levels might update several times a day or even minute by minute.

Types of SCDs

In the context of data modeling, there are 3 types of slowly changing dimensions. Choosing the right type of SCD depends on business requirements. Given below is brief overview of the most frequently used types of SCDs in data modeling

Type 1 SCD (No History)

Overwrites old data with new data. It’s used when the history of changes isn’t necessary. For example, correcting a misspelled last name of a customer.

Scenario:

A customer changes their email address.

Before the Change:

Customer IDNameEmail
001Manan Younasmyemail@oldmail.com

After the Change:

The old email is overwritten with the new one in the same record.

Customer IDNameEmail
001Manan Younasmyemail@newmail.com

Explanation:

In this table Manan’s email address is updated directly in the database, replacing the old email with the new one. No historical record of old mail is maintained.


Type 2 SCD (Full History)

Adds new records for changes, keeping historical versions. It’s crucial for auditing purposes and when analyzing the historical context of data, like tracking price changes over time.

Scenario:

A customer changes their subscription plan.

Before the Change:

Originally, the customer is subscribed to the “Basic” plan.

Customer IDNameSubscription PlanStart DateEnd DateCurrent
001Manan YounasBasic2023-01-01NULLYes

After the Change:

The customer upgrades to the “Premium” plan on 2023-06-01.

  1. Update the existing record to set the end date and change the “Current” flag to “No.”
  2. Add a new record with the new subscription plan, starting on the date of the change.
Customer IDNameSubscription PlanStart DateEnd DateCurrent
001John DoeBasic2023-01-012023-06-01No
001John DoePremium2023-06-01NULLYes

Explanation:

Before the Change: The table shows John Doe with a “Basic” plan, starting from January 1, 2022. The “End Date” is NULL, indicating that this record is currently active.

After the Change: Two changes are made to manage the subscription upgrade:

  1. The original “Basic” plan record is updated with an “End Date” of January 1, 2023, and the “Current” status is set to “No,” marking it as historical.
  2. A new record for the “Premium” plan is added with a “Start Date” of January 1, 2023. This record is marked as “Current” with a NULL “End Date,” indicating it is the active record.

This method of handling SCDs is beneficial for businesses that need to track changes over time for compliance, reporting, or analytical purposes, providing a clear and traceable history of changes.


Type 3 SCD (Limited History)

Type 3 SCDs add new columns to store both the current and at least one previous value of the attribute, which is useful for tracking limited history without the need for multiple records.It is less commonly used but useful for tracking a limited history where only the most recent change is relevant.

Scenario:

A customer moves from one city to another.

Before the Move:

Initially, only current information is tracked.

Customer IDNameCurrent City
001Manan YounasSydney

After the Move:

A new column is added to keep the previous city alongside the current city.

Customer IDNameCurrent CityPrevious City
001Manan YounasMelbourneSydney

Explanation:

In this table, when Manan moves from Sydney to Melbourne, the “Current City” column is updated with the new city, and “Previous City” is added to record his last known location. This allows for tracking the most recent change without creating a new record.

These examples illustrate the methods by which Type 1 , Type2 and Type 3 SCDs manage data changes. Type 1 SCDs are simpler and focus on the most current data, discarding historical changes. Type 3 SCDs, meanwhile, provide a way to view a snapshot of both current and previous data states without maintaining full historical records as Type 2 SCDs do.

Architectural considerations for Managing SCDs

The management of Slowly Changing Dimensions (SCDs) in data warehousing requires careful architectural planning to ensure data accuracy, completeness, and relevance. Lets discuss the implementation considerations for each type of SCD and the architectural setups required to support these patterns effectively.

Type 1 Implementation consideration

Scenarios Where Most Effective

Type 1 SCDs are most effective in scenarios where historical data is not needed for analysis and only the current state is relevant. Common use cases include:

  • Correcting data errors in attributes, such as fixing misspelled names or incorrect product attributes.
  • Situations where business does not require tracking of historical changes, such as current status updates or latest measurements.

Architectural Setup

Database Design: A simple design where each record holds the latest state of the data. Historical data tracking mechanisms are not needed.

Data Update Mechanism: The implementation requires a straightforward update mechanism where old values are directly overwritten by new ones without the need for additional fields or complex queries.

Performance Considerations: Since this pattern only involves updating existing records, it typically has minimal impact on performance and does not significantly increase storage requirements.

Type 2 Implementation consideration

Scenarios Where Most Effective

Type 2 SCDs are crucial when the full history of changes must be preserved for compliance, reporting, or analytical purposes. They are widely used in:

  • Customer information management, where it is necessary to track address history, status changes, or subscription details.
  • Product information tracking, where changes over time can provide insights into lifecycle management and evolution.

Architectural Setup

Database Design: Requires a more complex setup with additional fields for managing historical data, such as start date, end date, and a current flag.

Data Management Strategy: Insertion of new records for each change, while updating or closing previous records to indicate they are no longer current. This setup can be managed through triggers or application logic.

Versioning and Timestamping: Implementation of version control and timestamping to ensure each change is accurately recorded with its validity period.

Performance and Storage Considerations: Type 2 can significantly increase the volume of data stored, which may impact performance. Indexing on key fields and partitioning historical data can help optimize query performance.

Type 3 Implementation Pattern

Scenarios Where Most Effective:

Type 3 SCDs are used when tracking a limited history is sufficient. This can be applicable in cases like:

  • Tracking a previous address alongside the current one for a short-term promotional campaign.
  • Monitoring recent changes in terms and conditions for services, where only the most recent previous state is relevant.

Architectural Setup:

Database Design: Includes additional columns to store both the current and previous values of the tracked attributes. This setup is simpler than Type 2 but more complex than Type 1.

Data Update Mechanism: Updates involve changing multiple fields within a single record—both updating the previous value fields and writing new current values.

Performance Considerations: This method increases the size of each record but does not increase the total number of records as Type 2 does. Performance impacts are generally moderate but can involve more complex queries than Type 1.

Conclusion

Each type of SCD requires a different architectural approach based on the business requirements for historical data. While Type 1 is simpler and less resource-intensive, Types 2 and 3 provide data tracking capabilities at the cost of additional complexity and potentially higher resource requirements. Properly choosing and implementing these patterns will depend on the specific needs of the business, the criticality of historical data, and the performance impacts on the data warehouse system.

Association Rule Learning in Marketing Analytics

Association Rule Learning is used to find relationships between items or events in large data sets. The primary goal of Association Rule Learning is to identify frequently occurring patterns in data to reveal hidden relationships.

How Association Rule Learning works?

Association rules are typically represented in the form of “If {A} then {B}”, where A and B are sets of items or events. The strength of an association rule is usually measured by three key metrics:

  1. Support: The proportion of transactions in the dataset that contain both A and B. A high support value indicates that the rule occurs frequently in the data.
  2. Confidence: The probability of B occurring given that A has occurred. A high confidence value means that if A is present, there is a high likelihood of B also being present.
  3. Lift The ratio of the observed support to the expected support if A and B were independent events. A lift value greater than 1 indicates that the occurrence of A and B together is more frequent than what would be expected if they were unrelated.

Use cases in Marketing Analytics

There are several applications of the Association Learning Rule, in the context of marketing analytics, listed below is a list of common use cases, their examples, and a very brief summary of how Association Rules are used in them.

wdt_ID Category Use Case Example Association Rule
1 Cross-channel Marketing Channel strategy optimization Identifying effective combinations of marketing channels for engagement or conversions If {Email, Social_Media, Paid_Search} then {High_Conversions}
2 Email Marketing Content and offer optimization Identifying effective content and discount code combinations If {Product_A_Recommendation} then {Discount_Code_X}
3 Content Marketing Content strategy optimization Identifying popular combinations of blog topics to engage users If {Blog_Topic_A, Blog_Topic_B} then {High_Engagement}
4 Social Media Marketing Social media strategy optimization Finding effective combinations of social media posts, hashtags, or influencers If {Post_Type_A, Hashtag_X} then {High_Engagement}
5 Search Engine Optimization (SEO) SEO strategy optimization Identifying effective combinations of keywords and content types for organic traffic If {Keyword_A, Content_Type_B} then {High_Organic_Traffic}
6 Landing Page Optimization Conversion optimization Analyzing the effectiveness of different landing page elements If {Image_A, Headline_B, CTA_C} then {High_Conversions}
7 Affiliate Marketing Affiliate strategy optimization Identifying the most effective combinations of affiliate offers and traffic sources If {Affiliate_Offer_A, Traffic_Source_B} then {High_Conversions}
8 Online Advertising Ad strategy optimization Analyzing the impact of different ad creatives and targeting options If {Ad_Creative_A, Targeting_B} then {High_Clicks}
9 Customer Segmentation Targeted marketing optimization Analyzing associations between customer attributes and marketing responsiveness If {Demographics_A, Browsing_Behavior_B} then {High_Engagement}
10 Product Recommendations Personalization Analyzing associations between products in an online store to inform recommendations If {Product_A, Product_B} then {High_Likelihood_of_Purchase}

Algorithms for Association Rule Learning

To get an idea of various algorithms based on Association Rule Learning, given below is a summary of such algorithms developed over the last 3 decades.

Association Rule Learning – Algorithms Summary
Algorithm wdt_ID What it does Developed by Year of development Total Citations
Apriori 1 Mines frequent itemsets using a breadth-first search Rakesh Agrawal, Tomasz Imieliński, Arun Swami 1993 ~22,000
Eclat 2 Mines frequent itemsets using a depth-first search Mohammed J. Zaki 1997 ~3,000
FP-Growth 3 Mines frequent itemsets without candidate generation Jiawei Han, Jian Pei, Yiwen Yin 2000 ~12,000
H-mine 4 Improves upon FP-growth using a hyper-structure approach J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, D. Yang 2001 ~800
RElim 5 Eliminates items recursively to mine frequent itemsets Christian Borgelt 2004 ~200
LCM 6 Mines closed frequent itemsets in linear time Takeaki Uno, Tatsuya Asai, Hiroki Arimura 2004 ~400
FARMER 7 Uses a matrix-based data structure to mine frequent itemsets Roberto J. Bayardo Jr. 2004 ~100
OPUS Miner 8 Discovers the top-K association rules with the highest overall utility Geoffrey I. Webb, Shichao Zhang 2013 ~100

Classification model using AutoML in Vertex AI

What is Vertex AI?

Vertex AI is Google’s solution for problem-solving in Artificial Intelligence domain. To put things into context, Microsoft provides Azure Machine Learning platform for artificial intelligence problem solving and Amazon has Sage Maker for solving AI workloads.

Google’s Vertex AI supports two processes for model training.

  1. AutoML: This is the easy version. It lets you, train models with low effort and machine learning expertise. The downside is that the parameters you can tweak in this process are very limited.
  2. Custom training: This is free space for data science engineers to go wild with machine learning. You can train models using TensorFlow, sickit-learn, XGBoost etc.

In this blog post, we will use AutoML to train a classification model, deploy it to a GCP endpoint, and then consume it using the GS cloud shell.

Supported data types

Image

  1. Image classification single-label
  2. Image classification multi-label
  3. Image object detection
  4. Image segmentation

Tabular

  1. Regression / classification
  2. Forecasting

Text

  1. Text classification single-label
  2. Text classification Multi-label
  3. Text entity extraction
  4. Text sentiment analysis

Video

  1. Video action recognition
  2. video classification
  3. video object tracking

Supported Data sources

You can upload data to Vertex AI from 3 sources

  1. from Local computer
  2. From google cloud storage
  3. From Bigquery

Training a model using AutoML

    Training a model in AutoML is straightforward. Once you have created your dataset, you can use a click-point interface for creating a model.

    Training Method

    Model Details

    Training options

    Feature Selection

    Factor weightage

    You have  an option to weigh your factors equally.

    Optimization objective

    Optimization objectives options vary for each workload type. In our case, we are doing classification, hence it has given options relevant to an optimization workload. For more details on optimzation objectives, this optimization objective documentation is very helpful.

    Compute and pricing

    Lastly, we have to select Budget in terms of how many node hours do we want our model to run for. Google’s vertex AI pricing guide is helpful in understanding the pricing.

    Once you have completed these steps, your model will move into training mode. You can view the progress from the Training link in the navigation menu.  Once the training is finished, the model will start to appear in Model Registry.

     

    Deploying the model

    Model deployment is done via Deploy and Test tab on the model page itself.

    1. Click on Deploy to End-point
    2. Select a machine type for deployment
    3. Click deploy.

    Consuming the model

    To consume the model, we need a few parameters. We can set these parameters as environment variables using the Google cloud shell and then invoke the model with ./smlproxy

    Setting environment variables

    ENVIRONMENT VARIABLE VALUE
    AUTH_TOKEN Use the value from the previous section
    ENDPOINT https://sml-api-vertex-kjyo252taq-uc.a.run.app/vertex/predict/tabular_classification
    INPUT_DATA_FILE INPUT-JSON

    Getting predictions

    ./smlproxy tabular -a $AUTH_TOKEN -e $ENDPOINT -d $INPUT_DATA_FILE