Select Page

Association Rule Learning in Marketing Analytics

Association Rule Learning is used to find relationships between items or events in large data sets. The primary goal of Association Rule Learning is to identify frequently occurring patterns in data to reveal hidden relationships.

How Association Rule Learning works?

Association rules are typically represented in the form of “If {A} then {B}”, where A and B are sets of items or events. The strength of an association rule is usually measured by three key metrics:

  1. Support: The proportion of transactions in the dataset that contain both A and B. A high support value indicates that the rule occurs frequently in the data.
  2. Confidence: The probability of B occurring given that A has occurred. A high confidence value means that if A is present, there is a high likelihood of B also being present.
  3. Lift The ratio of the observed support to the expected support if A and B were independent events. A lift value greater than 1 indicates that the occurrence of A and B together is more frequent than what would be expected if they were unrelated.

Use cases in Marketing Analytics

There are several applications of the Association Learning Rule, in the context of marketing analytics, listed below is a list of common use cases, their examples, and a very brief summary of how Association Rules are used in them.

wdt_ID Category Use Case Example Association Rule
1 Cross-channel Marketing Channel strategy optimization Identifying effective combinations of marketing channels for engagement or conversions If {Email, Social_Media, Paid_Search} then {High_Conversions}
2 Email Marketing Content and offer optimization Identifying effective content and discount code combinations If {Product_A_Recommendation} then {Discount_Code_X}
3 Content Marketing Content strategy optimization Identifying popular combinations of blog topics to engage users If {Blog_Topic_A, Blog_Topic_B} then {High_Engagement}
4 Social Media Marketing Social media strategy optimization Finding effective combinations of social media posts, hashtags, or influencers If {Post_Type_A, Hashtag_X} then {High_Engagement}
5 Search Engine Optimization (SEO) SEO strategy optimization Identifying effective combinations of keywords and content types for organic traffic If {Keyword_A, Content_Type_B} then {High_Organic_Traffic}
6 Landing Page Optimization Conversion optimization Analyzing the effectiveness of different landing page elements If {Image_A, Headline_B, CTA_C} then {High_Conversions}
7 Affiliate Marketing Affiliate strategy optimization Identifying the most effective combinations of affiliate offers and traffic sources If {Affiliate_Offer_A, Traffic_Source_B} then {High_Conversions}
8 Online Advertising Ad strategy optimization Analyzing the impact of different ad creatives and targeting options If {Ad_Creative_A, Targeting_B} then {High_Clicks}
9 Customer Segmentation Targeted marketing optimization Analyzing associations between customer attributes and marketing responsiveness If {Demographics_A, Browsing_Behavior_B} then {High_Engagement}
10 Product Recommendations Personalization Analyzing associations between products in an online store to inform recommendations If {Product_A, Product_B} then {High_Likelihood_of_Purchase}

Algorithms for Association Rule Learning

To get an idea of various algorithms based on Association Rule Learning, given below is a summary of such algorithms developed over the last 3 decades.

Association Rule Learning – Algorithms Summary
Algorithm wdt_ID What it does Developed by Year of development Total Citations
Apriori 1 Mines frequent itemsets using a breadth-first search Rakesh Agrawal, Tomasz Imieliński, Arun Swami 1993 ~22,000
Eclat 2 Mines frequent itemsets using a depth-first search Mohammed J. Zaki 1997 ~3,000
FP-Growth 3 Mines frequent itemsets without candidate generation Jiawei Han, Jian Pei, Yiwen Yin 2000 ~12,000
H-mine 4 Improves upon FP-growth using a hyper-structure approach J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, D. Yang 2001 ~800
RElim 5 Eliminates items recursively to mine frequent itemsets Christian Borgelt 2004 ~200
LCM 6 Mines closed frequent itemsets in linear time Takeaki Uno, Tatsuya Asai, Hiroki Arimura 2004 ~400
FARMER 7 Uses a matrix-based data structure to mine frequent itemsets Roberto J. Bayardo Jr. 2004 ~100
OPUS Miner 8 Discovers the top-K association rules with the highest overall utility Geoffrey I. Webb, Shichao Zhang 2013 ~100

Classification model using AutoML in Vertex AI

What is Vertex AI?

Vertex AI is Google’s solution for problem-solving in Artificial Intelligence domain. To put things into context, Microsoft provides Azure Machine Learning platform for artificial intelligence problem solving and Amazon has Sage Maker for solving AI workloads.

Google’s Vertex AI supports two processes for model training.

  1. AutoML: This is the easy version. It lets you, train models with low effort and machine learning expertise. The downside is that the parameters you can tweak in this process are very limited.
  2. Custom training: This is free space for data science engineers to go wild with machine learning. You can train models using TensorFlow, sickit-learn, XGBoost etc.

In this blog post, we will use AutoML to train a classification model, deploy it to a GCP endpoint, and then consume it using the GS cloud shell.

Supported data types

Image

  1. Image classification single-label
  2. Image classification multi-label
  3. Image object detection
  4. Image segmentation

Tabular

  1. Regression / classification
  2. Forecasting

Text

  1. Text classification single-label
  2. Text classification Multi-label
  3. Text entity extraction
  4. Text sentiment analysis

Video

  1. Video action recognition
  2. video classification
  3. video object tracking

Supported Data sources

You can upload data to Vertex AI from 3 sources

  1. from Local computer
  2. From google cloud storage
  3. From Bigquery

Training a model using AutoML

    Training a model in AutoML is straightforward. Once you have created your dataset, you can use a click-point interface for creating a model.

    Training Method

    Model Details

    Training options

    Feature Selection

    Factor weightage

    You have  an option to weigh your factors equally.

    Optimization objective

    Optimization objectives options vary for each workload type. In our case, we are doing classification, hence it has given options relevant to an optimization workload. For more details on optimzation objectives, this optimization objective documentation is very helpful.

    Compute and pricing

    Lastly, we have to select Budget in terms of how many node hours do we want our model to run for. Google’s vertex AI pricing guide is helpful in understanding the pricing.

    Once you have completed these steps, your model will move into training mode. You can view the progress from the Training link in the navigation menu.  Once the training is finished, the model will start to appear in Model Registry.

     

    Deploying the model

    Model deployment is done via Deploy and Test tab on the model page itself.

    1. Click on Deploy to End-point
    2. Select a machine type for deployment
    3. Click deploy.

    Consuming the model

    To consume the model, we need a few parameters. We can set these parameters as environment variables using the Google cloud shell and then invoke the model with ./smlproxy

    Setting environment variables

    ENVIRONMENT VARIABLE VALUE
    AUTH_TOKEN Use the value from the previous section
    ENDPOINT https://sml-api-vertex-kjyo252taq-uc.a.run.app/vertex/predict/tabular_classification
    INPUT_DATA_FILE INPUT-JSON

    Getting predictions

    ./smlproxy tabular -a $AUTH_TOKEN -e $ENDPOINT -d $INPUT_DATA_FILE

     

    Practical Application of Logarithms

    What is Logarithm

    A logarithm is an inverse function of exponentiation. A simplified way of saying the same would be, that a logarithm is the opposite of an exponent. E.g.

    The exponent form 2^3 = 8 can be written in its (opposite) logarithm form as \log_2 8 = 3

    Let us describe the above mathematical forms in their simple English equivalents.

    2^3 = 8,  asks the question what will be the value if  2  grows  3 times? Answer is 8

    \log_2 8 = 3, asks the question how many times will 2 have to grow to reach 8? The answer is 3 times

    What is Natural Logarithm?

    A natural logarithm is a special case of logarithm which has Euler’s constant (e) as its base. Mathematically Euler’s constant can be written as  e^x. If we convert this into logarithmic form, we get the natural log.

    e^x = a     ==>    \log_e a = x

    The logarithm form is denoted as ln(a), which is called natural log.

    Let us describe these mathematical forms in their simple English equivalent.

    e^x = a , asks the question what will be the value if e grows x times ? Answer is a

    ln(a) ==>    \log_e a = x ,  asks the question how many times will  e have to grow to reach a> Answer is x times.

    Practical Application of Logarithm

    Logarithms have practical applications in the following:

    Scaling in Graphs

    In graphs, logarithms are used for displaying a large range of values in compact space.  For example, consider this covid death data plotted in both linear and log axis. The left side shows linear scale of deaths on Y-Axis. The right side graph shows logarithmic scale of deaths on Y-Axis.  We have used logarithm to compact the size of the Y-Axis.

    It is interesting to note that about 60% of the respondent in a recent survey by LSE did not understand what was being explained by a logarithmic graph. In another interesting article, Mordechai Rorvig  explains the dynamics of log scale graphing and the cautions required in reading the logarithm axis.

    Measurement scales

    Logarithms are used in measurement scales where calculations require taking values over a large range into consideration. A few examples of measurement scales engaging logarithms are

    Ritcher scale for earthquake monitoring

    The Richter scale is logarithmic and uses a base of 10, meaning that whole-number jumps indicate a tenfold increase. In this case, the increase is in wave amplitude. That is, the wave amplitude in a level 6 earthquake is 10 times greater than in a level 5 earthquake, and the amplitude increases 100 times between a level 7 earthquake and a level 9 earthquake. 

    Google page rank algorithm

    Google’s page rank algorithm shows the relative authority and traffic of a website on a scale between 0 -10. A higher number means higher traffic. The PageRank score is a logarithmic score. The actual calculation takes a lot of factors into account e.g. backlinks, referring domains, pr quality, authority score, etc. For this illustration, I have only taken traffic as an indicator. The two columns with log values calculate the rank with base 10 and 6 respectively. The main idea here is that we can use logarithms to represent a large-scale variation with a smaller number which makes intuitive sense for readers.

    Domain Monthly visits (similarweb.com) log10(x)=Pagerank log6(x)=Pagerank Estimated Google Pagerank (checkpagerank.net)
    cnn.com 524,000,000 8.7 11.2 9
    latrobe.edu.au 22,000,000 7.3 9.4 6

    Growth & decay

    Logarithmic growth is the inverse of exponential growth and is very slow. Exponential growth starts slowly and then speeds up. Logarithmic growth starts fast and then slows down.

    The function of logarithmic growth describes a relationship where the result can be described as a log function of input

    f(t)= A.Log(t) + B

    The confusion between logarithmic growth and exponential growth can be explained by the fact that exponential growth curves may be straightened by plotting them using a logarithmic scale for the growth axis.

     

    Facebook Campaigns for Marketing Analytics

    This article aims to summarise attributes available in any facebook campaign at the time of configuration. Having an overview of campaign attributes available is handy in understanding what can be analyzed,configured and managed within facebook campaigns. Depending on your role, you can go through the article to achieve the following objectives:

    Data Analysts: If you are a data analyst, you can use the attributes below to understand what are your possible dimensions of analysis.

    Data Scientists : If you are a data scientist, you can use the attributes to identify features for your data science model.

    Campaign Managers : If you are a campaign manager, you can use the attributes as options available to configure,manage and optimise your campaign.

     

    Facebook Campaign

    Facebook Ads Manager  is used for creating campaigns. The first step in creating a campaign is to select its Objectives. Clicking on a Create Campaign button, opens selection choice for objective.

    Campaign attribute summary

    Attribute summary of campaign is given in the table below:

    Category Campaign Objectives Description
    Awareness Brand Awareness Show your ads to people who are most likely to remember them.
    Reach Show your ads to the maximum number of people.
    Consideration Traffic Send people to a website, app or Facebook event, or let them tap to call you.
    Engagement Get more Page likes, event responses, or post reacts, comments or shares.
    App Installs Show your ad to people most likely to download and engage with your app. Learn more
    Video Views Show people video ads.
    Lead Generation Use forms, calls or chats to gather info from people interested in your business. Learn more
    Messages Show people ads that allow them to engage with you on Messenger, WhatsApp and Instagram.
    Conversions Conversions Show your ads to the people who are most likely to take action, such as buying something or calling you from your website.
    Catalogues Sales Use your target audience to show people ads with items from your catalogue.
    Store Traffic Show your ad to people most likely to visit your physical shops when they're near them. Learn more

    Facebook Bidding Strategies

    Facebook Ad ecosystem is Auction based. Advertisers compete for ad space ( placement ). Facebook provides the following strategies to accomplish the Ad auctions.

    Category Bid Strategy Description
    Spend-based Lowest Cost Get the most result for your budget
    Highest Value Spend your budget and focus on highest value purchases
    Goal-Based Cost Cap Strive to keep costs around the cost amount regardless of market conditions
    Minimum ROAS Budget Type
    Manual Bid Cap Manually control bids

    Facebook Budget

    Budget is the total amount you want to spend for the campaign.  Budget can be configured at two levels.

    1. Campaign Budget : Budget for entire campaign.
    2. Ad Set Budget : Budget for individual Ad Sets within campaigns.

    Campaign Budget

    Campaign budget applies to all Ad Sets within a campaign. It can be set on a daily basis or lifetime of campaign basis.

     

    Ad Set Budget

    Ad Set budget is defined at an Adset level. After you create a campaign, you can choose whether you want to create budget for Ad Set. It is optional. If it is not set, then campaign budget will determine Ad Set Budget.

    Facebook Audiences

    Facebook supports few types of audiences.

    1. Core Audience: Defined from Adset panel, with limited options.
    2. Custom Audience: Defined from Audience Builder, with a lot more options than core audience.
    3. Look Alike Audience : Uses facebook’s algorithm to find new people similar to another audience.
    4. Special Ad Audience : Only available for users in special ads categories of credit, employment and housing ads.

     

    You can create last 3 audiences from Audience manager.

    Core Audience

    Core Audeinces is defined from default adset.  We can select from Location, Age, Gender, Demographic, interest, behavior and connections.

    Custom Audience

    Custom audiences provide a lot more attributes for selection.  You can create custom audiences from audience manager or from the Ad set option

    Lookalike Audience

    A lookalike audience is built from an existing audience source. We provide the source, facebook find similar audiences. We can control the level  of similarity ( or difference ) between source and target audience. For analysis purposes, three attributes of interest would be :

    1. Lookalike source
    2. Location
    3. Similarity Percentage.

    Within facebook, creation of lookalike audience will be done from this screen.

     

    Audience attribute summary

    Attribute summary of audiences is given in the table below:

    Audience Type Category Attributes Description
    Core Audience Defined using default audience panel within the AdSet creation section.
    Custom Audience Your Sources Website Custom audiences which can be created in facebook from our own audience data or from facebook audience data.
    Customer List
    App Activity
    Offline Activity
    Catalogue
    Meta Sources Video
    Instagram Account
    Lead form
    Events
    Instant Experience
    Facebook Page
    Shopping
    On-Facebook Listings
    Lookalike Audiences Mixed Lookalike source Using machine learning algoritm to create audiences which have relatively similar attributes than original audiences.
    Location
    Similarity percentage
    Special Ad Audience Lookalike source Restricted for Credit, Employment or Housing
    Location
    Similarity percentage

    Facebook Placements

    Placement is where our Ads show up. It is of significant importance in any campaign performance analysis. Same ad shown on top placement may have different efficiency as compared to a bottom placement.

    There are two  placement  selection choices  in facebook.

    1. Automatic placement : Facebook’s delivery system allocates your ad set’s budget across multiple placements. You do not have much control over it.
    2. Manual placement : You choose which placements you want your ad to be displayed on.

    There are eight categories of placement  in facebook, each comprising of several types of placements. Let us review all possible attributes in facebook placement.

    Placements attribute summary

    Placement Type Placement Example
    Feeds Facebook feed Facebook Feed Example
    Instagram feed Instagram feed example
    Facebook marketplace Facebook marketplace example
    Facebook video feeds Facebook video feed example
    Facebook right column Facebook right column example
    Instagram explore Instagram explore example
    Instagram shop Instagram shop example
    Messenger Inbox Messenger Inbox example
    Facebook groups feed Facebook group feeds example
    Facebook Business Explore Facebook Business explore example
    Stories and Reels Instagram Stories Instagram stories example
    Facebook Stories Facebook stories example
    Messenger Stories Messenger stories example
    Instagram Reels Instagram Reels example
    Facebook Reels Facebook Reels example
    In-stream Facebook In-stream videos Facebook Instream Videos example
    Reels overlay Facebook overlay ads in Reels Facebook Reels overlay example
    Search Facebook search results Facebook Search Results example
    Messages Messenger sponsored messages Sponsored message example
    In-article Facebook Instant Articles Facebook Instant Article example
    Apps and sites Audience Network native, banner and interstitial Facebook Audience Network banner example
    Audience Network rewarded videos Audience Network rewarded videos example
    Audience Network in-stream videos Audience network instream videos example

    Mobile Devices & operating systems.

    Within the Adset configuration, we can also choose which mobile devices and operating systems you want to target your ads on. If you are looking to analyse apple users(iOS) vs samsung users(Android), this will be your starting point.  There are 4 options available at the time of writing.

    Mobile Devices attribute summary

    1. All mobile devices
    2. Android devices only
    3. iOS devices only
    4. Feature phones only

    Another option available for mobile delivery is whether you want to display Ads, when audience is connected to wi-fi.

    Optimization and Delivery

    By selecting an optimisation method, you inform facebook on what does success mean for your campaign. Based on your “success criteria”, facebook’s machine learning algorithm can work to (optimise)  tweak campaign parameters to get desired success results.

    Optimisation attribute summary

    Optimisation methods vary for each campaign objective. Here is a quick summary of what optimisation methods are available for each campaign type.

    Category Campaign Objective Optimisation method Optimisation method description
    Awareness Brand Awareness Ad recall lift Facebook will serve your ads to maximise the total number of people who will remember seeing your ads.
    Reach Reach Facebook will serve your ads to the maximum number of people
    Impressions Facebook will deliver your ads to people as many times as possible
    Consideration Traffic Landing page views Facebook will deliver your ads to people who are more likely to click on your ad's link and load the website or Instant Experience. (Pixel)
    Link clicks Facebook will deliver your ads to the people most likely to click on them.
    Impressions Facebook will deliver your ads to people as many times as possible.
    Daily unique reach Facebook will deliver your ads to people up to once a day.
    Engagement Impressions   Facebook will deliver your ads to people as many times as possible.
    Post Engagement   Facebook will deliver your ads to the right people to help you get the most likes, shares or comments on your post at the lowest cost.
    Daily unique reach   Facebook will deliver your ads to people up to once a day.
    Page likes   Facebook will deliver your ads to the right people to help you get more Page likes at the lowest cost.
    Event responses   Facebook will deliver your ads to the right people to help you get the most event interest at the lowest cost.
    App Installs App Installs  Facebook will deliver your ads to the people most likely to install your app.
    Link clicks   Facebook willdeliver your ads to the people most likely to install your app.
    App Events  Facebook will deliver your ads to the people most likely to install your app.
    Value   Facebook willdeliver your ads to the people most likely to install your app.
    Video Views ThruPlay Facebook will deliver your ads to help you get the most completed video plays if the video is 15 seconds or shorter. For longer videos, this will optimise for people most likely to play at least 15 seconds.
    2-second continuous video views Facebook will deliver your ads to get the most video views of two continuous seconds or more. Most 2-second continuous video views will have at least 50% of the video pixels on screen.
    Lead Generation Leads We'll deliver your ads to the right people to help you get the most leads at the lowest cost.
    Conversion leads Facebook will deliver your ads to help you get leads that are most likely to convert.
    Messages Conversations   Facebook will deliver your ads to people most likely to have a conversation with you through messaging.
    Leads   Facebook will deliver your ads to the right people to help you get the most leads at the lowest cost.
    Link clicks   Facebook will deliver your ads to the people most likely to click on them.
    Replies   Facebook will deliver your ads to people most likely to have a conversation with you through messages.
    Conversions Conversions Conversions   Facebook will deliver your ads to the right people to help you get the most website conversions.
    Conversations   Facebook will deliver your ads to people most likely to have a conversation with you through messaging.
    Link clicks   Facebook will deliver your ads to the people most likely to click on them.
    Impressions   Facebook will deliver your ads to people as many times as possible.
    Daily unique reach   Facebook will deliver your ads to people up to once a day.
    Catalogues Sales Value   Facebook will deliver your ads to people to maximise the total purchase value generated and get the highest return on ad spend (ROAS).
    Conversion events (suggested option)   Facebook will deliver your ads to people more likely to take action when they see a product from your catalogue.
    Link clicks   Facebook will deliver your ads to the people most likely to click on them.
    Impressions   Facebook will deliver your ads to people as many times as possible.
    Store Traffic Daily unique reach   Facebook will deliver your ads to people up to once a day.
    Store visit   Facebook will deliver your ads to people more likely to visit your business locations.

    Predicting Customer Life time Value with Google BigQuery ML

    Exploring Google Analytics E-commerce Data.

    We can explore machine learning capabilities of Google BigQuery by creating a classification model. The model will predict whether or not a new user is likely to purchase in future.

    BigQuery ML supports following machine learning models.

    A. Regression Models

    1. Forecasting using linear regression

    2. Classification using logistic regression

    B. Classification Models

    1. Deep Neural Networks

    2. Boosted Decision Trees

    3. AutoML Tables Models

    4. Custom TensorFlow Models

    E-commerce Conversion Rate.

    #standardSQL
    WITH visitors AS(
    SELECT
    COUNT(DISTINCT fullVisitorId) AS total_visitors
    FROM `data-to-insights.ecommerce.web_analytics`
    ),
    purchasers AS(
    SELECT
    COUNT(DISTINCT fullVisitorId) AS total_purchasers
    FROM `data-to-insights.ecommerce.web_analytics`
    WHERE totals.transactions IS NOT NULL
    )
    SELECT
    total_visitors,
    total_purchasers,
    total_purchasers / total_visitors AS conversion_rate
    FROM visitors, purchasers

    The result will show us a conversion rate of 2.69%

    Row
    total_visitors
    total_purchasers
    conversion_rate
    1
    741721
    20015
    0.026984540008979117

    Top 5 selling products

    SELECT
    p.v2ProductName,
    p.v2ProductCategory,
    SUM(p.productQuantity) AS units_sold,
    ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h,
    UNNEST(h.product) AS p
    GROUP BY 1, 2
    ORDER BY revenue DESC
    LIMIT 5;

    Result shows the following table

    Row
    v2ProductName
    v2ProductCategory
    units_sold
    revenue
    1
    Nest® Learning Thermostat 3rd Gen-USA – Stainless Steel
    Nest-USA
    17651
    870976.95
    2
    Nest® Cam Outdoor Security Camera – USA
    Nest-USA
    16930
    684034.55
    3
    Nest® Cam Indoor Security Camera – USA
    Nest-USA
    14155
    548104.47
    4
    Nest® Protect Smoke + CO White Wired Alarm-USA
    Nest-USA
    6394
    178937.6
    5
    Nest® Protect Smoke + CO White Battery Alarm-USA
    Nest-USA
    6340
    178572.4

    Subsequent Visit Purchasers

    # visitors who bought on a return visit (could have bought on first as well
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid, # 741,721 unique visitors
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    SELECT
    COUNT(DISTINCT fullvisitorid) AS total_visitors,
    will_buy_on_return_visit
    FROM all_visitor_stats
    GROUP BY will_buy_on_return_visit
    Row
    total_visitors
    will_buy_on_return_visit
    1
    729848
    0
    2
    11873
    1

    About 1.6% of total visitors will return and purchase from the website.

    Feature Engineering.

    We will build two models by selecting different features and compare their performance

    Feature Set 1 for First model

    We will use two input fields for the classification model.

    Bounces given by field  totals.bounces. A visit is counted as a bounce if a visitor does not engage in any activity on the website after opening the page and leaves.

    Time on site given by field totals.timeOnSite. This field determins total time visitor was on our website.

     

    SELECT
    * EXCEPT(fullVisitorId)
    FROM
    # features
    (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
    FROM
    `data-to-insights.ecommerce.web_analytics`
    WHERE
    totals.newVisits = 1)
    JOIN
    (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM
    `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
    USING (fullVisitorId)
    ORDER BY time_on_site DESC
    LIMIT 10;

    We get the following output

    Row
    bounces
    time_on_site
    will_buy_on_return_visit
    1
    0
    15047
    0
    2
    0
    12136
    0
    3
    0
    11201
    0
    4
    0
    10046
    0
    5
    0
    9974
    0
    6
    0
    9564
    0
    7
    0
    9520
    0
    8
    0
    9275
    1
    9
    0
    9138
    0
    10
    0
    8872
    0

    Feature set 2 for second model:

    For our second model, we will use entirely different set of features.

    1. Visitors Journey progress given by field hits.eCommerceAction.action_type. It denotes progress with integer field where 6 = completed purchase
    2. Traffic source by trafficsource.source and trafficsource.medium
    3. Device category given by field device.deviceCategory
    4. Country given by field geoNetwork.country

    Creating model in BigQuery.

    Creating dataset for the model

    Before creating a model in BigQuery, we will create a BigQuery dataset to store the models.

    Selecting Model type

    Since we are predicting a binary class ( whether user will buy or not) , we will go with classification using logistic regression.

    Classification using Logistic Regression Model:
    CREATE OR REPLACE MODEL `ecommerce.classification_model`
    OPTIONS
    (
    model_type='logistic_reg',
    labels = ['will_buy_on_return_visit']
    )
    AS
    #standardSQL
    SELECT
    * EXCEPT(fullVisitorId)
    FROM
    # features
    (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
    FROM
    `data-to-insights.ecommerce.web_analytics`
    WHERE
    totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
    JOIN
    (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM
    `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
    USING (fullVisitorId)
    ;

     

    We can now create model in our newly created ecommerce table. It will show the following information.

    Classification using Logistic Regression with Feature Set 2

    CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
    OPTIONS
    (model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    # add in new features
    SELECT * EXCEPT(unique_session_id) FROM (
    SELECT
    CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430' # train 9 months
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    );

    Classification using XGBoost with Feature Set 2

    We can use the feature set used in the second model to create another model, however, this time using XGBoost.

    CREATE OR REPLACE MODEL `ecommerce.classification_model_3`
    OPTIONS
    (model_type='BOOSTED_TREE_CLASSIFIER' , l2_reg = 0.1, num_parallel_tree = 8, max_tree_depth = 10,
    labels = ['will_buy_on_return_visit']) AS
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    # add in new features
    SELECT * EXCEPT(unique_session_id) FROM (
    SELECT
    CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430' # train 9 months
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    );

    Evaluating performance of trained model.

    We can evaluate classification models using a few metrics. Let us go with the basics.

    ROC for logistic regression model with Feature Set 1

    When we create a logistic regression classification model, we can access roc_auc field in BigQuery to automatically draw an ROC curve.

    SELECT
    roc_auc,
    CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'not great'
    ELSE 'poor' END AS model_quality
    FROM
    ML.EVALUATE(MODEL ecommerce.classification_model, (
    SELECT
    * EXCEPT(fullVisitorId)
    FROM
    # features
    (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
    FROM
    `data-to-insights.ecommerce.web_analytics`
    WHERE
    totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
    JOIN
    (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM
    `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
    USING (fullVisitorId)
    ));

    The output we get is :

    Row
    roc_auc
    model_quality
    1
    0.72386313686313686
    not great

    ROC for logistic regression model with Feature Set 2

    #standardSQL
    SELECT
    roc_auc,
    CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'not great'
    ELSE 'poor' END AS model_quality
    FROM
    ML.EVALUATE(MODEL ecommerce.classification_model_2, (
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    # add in new features
    SELECT * EXCEPT(unique_session_id) FROM (
    SELECT
    CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630' # eval 2 months
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    )
    ));

    The result we get this time are :

    Row
    roc_auc
    model_quality
    1
    0.90948851148851151
    good

    ROC for XGBoost Classifier with feature set 2:

    #standardSQL
    SELECT
    roc_auc,
    CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'not great'
    ELSE 'poor' END AS model_quality
    FROM
    ML.EVALUATE(MODEL ecommerce.classification_model_3, (
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    # add in new features
    SELECT * EXCEPT(unique_session_id) FROM (
    SELECT
    CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE 1=1
    # only predict for new visits
    AND totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630' # eval 2 months
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    )
    ));/* Your code... */

    Our roc_auc metric has declined slightly  ( by .02).

    Row
    roc_auc
    model_quality
    1
    0.907965034965035
    good

    Making Predictions

    Lastly, we can predict using our model which new visitors will come back and purchase.

    BigQuery ML uses ml.PREDICT() function to make a prediction.

    Our total dataset is 12 months. We are making prediction for only last 1 month.

    Prediction with logistic regression model

    We will use our logistic regression model to make predictions first.

    SELECT
    *
    FROM
    ml.PREDICT(MODEL `ecommerce.classification_model_2`,
    (
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    SELECT
    CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE
    # only predict for new visits
    totals.newVisits = 1
    AND date BETWEEN '20170701' AND '20170801' # test 1 month
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    )
    )
    ORDER BY
    predicted_will_buy_on_return_visit DESC;

    The output  will be :

    Models prediction : predicted_will_buy_on_return_visit ( 1=yes)

    Models confidence : predicted_will_buy_on_return_visit_probs.prob 

    Row
    predicted_will_buy_on_return_visit
    predic….label
    predic….prob
    unique_session_id
    will_buy_on_return_visit
    latest_ecommerce_progress
    bounces
    time_on_site
    pageviews
    source
    medium
    channelGrouping
    deviceCategory
    country
    1
    1
    1
    0.56384838909667623
    5584156807534326199-1499456348
    0
    6
    0
    571
    24
    gdeals.googleplex.com
    referral
    Referral
    desktop
    United States
    0
    0.43615161090332377
    2
    1
    1
    0.53345977038719161
    0608643970260729792-1499785056
    1
    6
    0
    620
    14
    gdeals.googleplex.com
    referral
    Referral
    desktop
    United States
    0
    0.46654022961280839
    3
    1
    1
    0.56460044526931086
    5177128356817703126-1500354345
    0
    6
    0
    397
    18
    sites.google.com
    referral
    Referral
    desktop
    United States

     

    Predictions with XGBoost model

    SELECT
    *
    FROM
    ml.PREDICT(MODEL `ecommerce.classification_model_3`,
    (
    WITH all_visitor_stats AS (
    SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid
    )
    SELECT
    CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
    # labels
    will_buy_on_return_visit,
    MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
    # behavior on the site
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site,
    totals.pageviews,
    # where the visitor came from
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    # mobile or desktop
    device.deviceCategory,
    # geographic
    IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE
    # only predict for new visits
    totals.newVisits = 1
    AND date BETWEEN '20170701' AND '20170801' # test 1 month
    GROUP BY
    unique_session_id,
    will_buy_on_return_visit,
    bounces,
    time_on_site,
    totals.pageviews,
    trafficSource.source,
    trafficSource.medium,
    channelGrouping,
    device.deviceCategory,
    country
    )
    )
    ORDER BY
    predicted_will_buy_on_return_visit DESC;
    Row
    predicted_will_buy_on_return_visit
    predic….label
    predic….prob
    unique_session_id
    will_buy_on_return_visit
    latest_ecommerce_progress
    bounces
    time_on_site
    pageviews
    source
    medium
    channelGrouping
    deviceCategory
    country
    1
    1
    1
    0.52540105581283569
    8015915032010696677-1500581774
    1
    5
    0
    1198
    16
    sites.google.com
    referral
    Referral
    desktop
    United States
    0
    0.47459891438484192
    2
    1
    1
    0.52540105581283569
    8168421017407336619-1500948344
    1
    5
    0
    1043
    19
    sites.google.com
    referral
    Referral
    desktop
    United States
    0
    0.47459891438484192
    3
    1
    1
    0.52540105581283569
    8860823110638301167-1499466860
    0
    5
    0
    1176
    17
    sites.google.com
    referral
    Referral
    desktop
    United States
    0
    0.47459891438484192
    4
    1
    1
    0.547279417514801
    1860094959713788549-1499748085
    0
    3
    0
    1084
    22
    mall.googleplex.com
    referral
    Referral
    desktop
    Canada
    0
    0.45272055268287659

     

    Reference : https://www.coursera.org/learn/gcp-big-data-ml-fundamentals

    self-notes