Exploring Google Analytics E-commerce Data.
We can explore machine learning capabilities of Google BigQuery by creating a classification model. The model will predict whether or not a new user is likely to purchase in future.
BigQuery ML supports following machine learning models.
A. Regression Models
1. Forecasting using linear regression
2. Classification using logistic regression
B. Classification Models
E-commerce Conversion Rate.
#standardSQL
WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers
The result will show us a conversion rate of 2.69%
Row |
total_visitors
|
total_purchasers
|
conversion_rate
|
|
---|---|---|---|---|
1 |
741721
|
20015
|
0.026984540008979117
|
Top 5 selling products
SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;
Result shows the following table
Row |
v2ProductName
|
v2ProductCategory
|
units_sold
|
revenue
|
|
---|---|---|---|---|---|
1 |
NestĀ® Learning Thermostat 3rd Gen-USA – Stainless Steel
|
Nest-USA
|
17651
|
870976.95
|
|
2 |
NestĀ® Cam Outdoor Security Camera – USA
|
Nest-USA
|
16930
|
684034.55
|
|
3 |
NestĀ® Cam Indoor Security Camera – USA
|
Nest-USA
|
14155
|
548104.47
|
|
4 |
NestĀ® Protect Smoke + CO White Wired Alarm-USA
|
Nest-USA
|
6394
|
178937.6
|
|
5 |
NestĀ® Protect Smoke + CO White Battery Alarm-USA
|
Nest-USA
|
6340
|
178572.4
|
Subsequent Visit Purchasers
# visitors who bought on a return visit (could have bought on first as well
WITH all_visitor_stats AS (
SELECT
fullvisitorid, # 741,721 unique visitors
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
COUNT(DISTINCT fullvisitorid) AS total_visitors,
will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visit
Row |
total_visitors
|
will_buy_on_return_visit
|
|
---|---|---|---|
1 |
729848
|
0
|
|
2 |
11873
|
1
|
About 1.6% of total visitors will return and purchase from the website.
Feature Engineering.
We will build two models by selecting different features and compare their performance
Feature Set 1 for First model
We will use two input fields for the classification model.
Bounces given by field totals.bounces. A visit is counted as a bounce if a visitor does not engage in any activity on the website after opening the page and leaves.
Time on site given by field totals.timeOnSite. This field determins total time visitor was on our website.
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1)
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;
We get the following output
Row |
bounces
|
time_on_site
|
will_buy_on_return_visit
|
|
---|---|---|---|---|
1 |
0
|
15047
|
0
|
|
2 |
0
|
12136
|
0
|
|
3 |
0
|
11201
|
0
|
|
4 |
0
|
10046
|
0
|
|
5 |
0
|
9974
|
0
|
|
6 |
0
|
9564
|
0
|
|
7 |
0
|
9520
|
0
|
|
8 |
0
|
9275
|
1
|
|
9 |
0
|
9138
|
0
|
|
10 |
0
|
8872
|
0
|
Feature set 2 for second model:
For our second model, we will use entirely different set of features.
- Visitors Journey progress given by field hits.eCommerceAction.action_type. It denotes progress with integer field where 6 = completed purchase
- Traffic source by trafficsource.source and trafficsource.medium
- Device category given by field device.deviceCategory
- Country given by field geoNetwork.country
Creating model in BigQuery.
Creating dataset for the model
Before creating a model in BigQuery, we will create a BigQuery dataset to store the models.
Selecting Model type
Since we are predicting a binary class ( whether user will buy or not) , we will go with classification using logistic regression.
Classification using Logistic Regression Model:
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;
We can now create model in our newly created ecommerce table. It will show the following information.
Classification using Logistic Regression with Feature Set 2
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
Classification using XGBoost with Feature Set 2
We can use the feature set used in the second model to create another model, however, this time using XGBoost.
CREATE OR REPLACE MODEL `ecommerce.classification_model_3`
OPTIONS
(model_type='BOOSTED_TREE_CLASSIFIER' , l2_reg = 0.1, num_parallel_tree = 8, max_tree_depth = 10,
labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
Evaluating performance of trained model.
We can evaluate classification models using a few metrics. Let us go with the basics.
ROC for logistic regression model with Feature Set 1
When we create a logistic regression classification model, we can access roc_auc field in BigQuery to automatically draw an ROC curve.
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model, (
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
));
The output we get is :
Row |
roc_auc
|
model_quality
|
|
---|---|---|---|
1 |
0.72386313686313686
|
not great
|
ROC for logistic regression model with Feature Set 2
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));
The result we get this time are :
Row |
roc_auc
|
model_quality
|
|
---|---|---|---|
1 |
0.90948851148851151
|
good
|
ROC for XGBoost Classifier with feature set 2:
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_3, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));/* Your code... */
Our roc_auc metric has declined slightly ( by .02).
Row |
roc_auc
|
model_quality
|
|
---|---|---|---|
1 |
0.907965034965035
|
good
|
Making Predictions
Lastly, we can predict using our model which new visitors will come back and purchase.
BigQuery ML uses ml.PREDICT() function to make a prediction.
Our total dataset is 12 months. We are making prediction for only last 1 month.
Prediction with logistic regression model
We will use our logistic regression model to make predictions first.
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
The output will be :
Models prediction : predicted_will_buy_on_return_visit ( 1=yes)
Models confidence : predicted_will_buy_on_return_visit_probs.prob
Row |
predicted_will_buy_on_return_visit
|
predic….label
|
predic….prob
|
unique_session_id
|
will_buy_on_return_visit
|
latest_ecommerce_progress
|
bounces
|
time_on_site
|
pageviews
|
source
|
medium
|
channelGrouping
|
deviceCategory
|
country
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
1
|
1
|
0.56384838909667623
|
5584156807534326199-1499456348
|
0
|
6
|
0
|
571
|
24
|
gdeals.googleplex.com
|
referral
|
Referral
|
desktop
|
United States
|
|
0
|
0.43615161090332377
|
||||||||||||||
2 |
1
|
1
|
0.53345977038719161
|
0608643970260729792-1499785056
|
1
|
6
|
0
|
620
|
14
|
gdeals.googleplex.com
|
referral
|
Referral
|
desktop
|
United States
|
|
0
|
0.46654022961280839
|
||||||||||||||
3 |
1
|
1
|
0.56460044526931086
|
5177128356817703126-1500354345
|
0
|
6
|
0
|
397
|
18
|
sites.google.com
|
referral
|
Referral
|
desktop
|
United States
|
Predictions with XGBoost model
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_3`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
Row |
predicted_will_buy_on_return_visit
|
predic….label
|
predic….prob
|
unique_session_id
|
will_buy_on_return_visit
|
latest_ecommerce_progress
|
bounces
|
time_on_site
|
pageviews
|
source
|
medium
|
channelGrouping
|
deviceCategory
|
country
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
1
|
1
|
0.52540105581283569
|
8015915032010696677-1500581774
|
1
|
5
|
0
|
1198
|
16
|
sites.google.com
|
referral
|
Referral
|
desktop
|
United States
|
|
0
|
0.47459891438484192
|
||||||||||||||
2 |
1
|
1
|
0.52540105581283569
|
8168421017407336619-1500948344
|
1
|
5
|
0
|
1043
|
19
|
sites.google.com
|
referral
|
Referral
|
desktop
|
United States
|
|
0
|
0.47459891438484192
|
||||||||||||||
3 |
1
|
1
|
0.52540105581283569
|
8860823110638301167-1499466860
|
0
|
5
|
0
|
1176
|
17
|
sites.google.com
|
referral
|
Referral
|
desktop
|
United States
|
|
0
|
0.47459891438484192
|
||||||||||||||
4 |
1
|
1
|
0.547279417514801
|
1860094959713788549-1499748085
|
0
|
3
|
0
|
1084
|
22
|
mall.googleplex.com
|
referral
|
Referral
|
desktop
|
Canada
|
|
0
|
0.45272055268287659
|
Reference : https://www.coursera.org/learn/gcp-big-data-ml-fundamentals
self-notes