Messing with models: k-means clustering of Google Analytics 4 data

machine learning

unsupervised learning

k-means clustering

A tutorial on how to perform k-means clustering using Google Analytics data

Author

Collin K. Berke, Ph.D.

Published

September 14, 2024

Marketing campaigns target specific audiences. Some form of customer segmentation is performed to identify these audiences. These segments can be identified using several approaches. This post describes one such approach: the use of data and machine learning to specify clustering models present within Google Analytics data for an e-commerce store.

The goal is simple: create customer segments a marketing team could use to craft and target specific marketing campaigns. To achieve this goal, this post overviews the use of a straightforward, useful machine learning algorithm called k-means clustering. Packages and functions from R, a statistical programming language, are used for the segmentation process.

library(tidyverse)
library(bigrquery)
library(skimr)
library(janitor)
library(factoextra)
library(ids)
library(here)
library(psych)
library(gt)

ggplot_theme <- 
  theme_bw() +
  theme(
    text = element_text(size = 12),
    title = element_text(size = 16, face = "bold")
  )
# Override the default ggplot2 theme
theme_set(ggplot_theme)

Extract and explore data

First, we need some data. This section briefly discusses the data extraction process. I’ve detailed the extraction of Google Analytics data in a previous post. For the sake of time, then, this topic won’t be discussed in-depth. If you’re already familiar with how to access this data, feel free to skip this section.

The data used here is obfuscated Google Analytics data for the Google Merchandise Store. This data is stored and accessed via BigQuery, a cloud-based data warehouse useful for analytics purposes.

Warning

This data is useful for example tutorials, so the conclusions drawn here should not be used to infer anything about true purchasing behavior. The purpose of this post is to be a tutorial on how to perform k-means clustering, rather than about deriving true conclusions about Google Merchandise store customers.

Before data extraction, let’s do some exploration of the source data. The focus here is to get a sense of what’s in the data, while also creating an understanding of the source data’s structure. This mostly involves identifying the available fields and data types.

The bigrquery package provides the bq_tables_fields() function to retrieve this information. The following code example shows how to use this function to return field names within the dataset’s tables:

table_ecommerce <-
  bq_table(
    "bigquery-public-data",
    "ga4_obfuscated_sample_ecommerce",
    "events_20210101"
  )

bq_table_fields(table_ecommerce)

Then, the following code submits a SQL query to return the ga_obfuscated_sample_ecommerce data from BigQuery. It’s important to note, similar to what was done in past posts, I’m querying data associated with transactions occurring over the U.S. Christmas holiday season. Here’s the query string if you want specific details:

query <- "
  select 
    event_date,
    user_pseudo_id,
    ecommerce.transaction_id,
    items.item_category,
    items.quantity,
    ecommerce.purchase_revenue_in_usd
  from `<your-project-name>.ga4_obfuscated_sample_ecommerce.events_*`,
  unnest(items) as items
  where _table_suffix between '20201101' and '20201231' and 
  event_name = 'purchase'
  order by user_pseudo_id, transaction_id
"

data_ga_transactions <- bq_project_query(
  "<your-project-name>",
  query
) |>
bq_table_download()

Before moving forward, some data validation is in order. Columns containing missing values are the biggest concern.

# Verify if there are any missing values
map(data_ga_transactions, \(x) any(is.na(x)))

$event_date
[1] FALSE

$user_pseudo_id
[1] FALSE

$transaction_id
[1] TRUE

$item_category
[1] FALSE

$quantity
[1] TRUE

$purchase_revenue_in_usd
[1] FALSE

No surprise: some features contain missing values. transaction_id and quantity both contain missing values. Our data wrangling step will need to address these issues. Let’s look closer and explore why missing values might be present within the data. This code can be used to do this:

# Examples with a missing `transaction_id`
data_ga_transactions |> filter(is.na(transaction_id))

# A tibble: 59 × 6
   event_date user_pseudo_id transaction_id item_category quantity purchase_revenue_in_usd
        <dbl>          <dbl> <chr>          <chr>            <dbl>                   <dbl>
 1   20201101       1494019. <NA>           New                  1                      25
 2   20201101       2422026. <NA>           New                  2                      72
 3   20201101       2422026. <NA>           Fun                  1                      72
 4   20201101      29640693. <NA>           Apparel              1                      55
 5   20201101      29640693. <NA>           Shop by Brand        2                      59
 6   20201101       3297047. <NA>           Apparel              1                      87
 7   20201101       3297047. <NA>           Apparel              1                      87
 8   20201101       3297047. <NA>           Apparel              1                      87
 9   20201101       3297047. <NA>           Apparel              1                      87
10   20201101      33027284. <NA>           Accessories          1                      63
# ℹ 49 more rows

# Examples where a `quantity` is missing
data_ga_transactions |> filter(is.na(quantity))

# A tibble: 148 × 6
   event_date user_pseudo_id transaction_id item_category quantity purchase_revenue_in_usd
        <dbl>          <dbl> <chr>          <chr>            <dbl>                   <dbl>
 1   20201217       1086663. (not set)      (not set)           NA                       0
 2   20201120      11080045. (not set)      (not set)           NA                       0
 3   20201204      12422651. (not set)      (not set)           NA                       0
 4   20201213      13309292. (not set)      (not set)           NA                       0
 5   20201126      13429550. (not set)      (not set)           NA                       0
 6   20201213       1396855. (not set)      (not set)           NA                       0
 7   20201127       1423291. (not set)      (not set)           NA                       0
 8   20201210    1540308157. (not set)      (not set)           NA                       0
 9   20201201      15915163. (not set)      (not set)           NA                       0
10   20201120      15980073. (not set)      (not set)           NA                       0
# ℹ 138 more rows

For the transaction_id column, 59 rows contain missing values. If you dig a little further, you’ll notice the majority of these missing transaction_ids occur near the beginning of data collection (i.e., 2020-11-01). This was likely a measurement issue with the Google Analytics setup, which would warrant further exploration. Since access to information about how Google Analytics was set up for the Google Merchandise Store (i.e., I can’t speak with the developers), this issue can’t be further explored. The only option, then, is to simply drop these rows.

Missing quantity values seem to be associated with examples where transaction_id and item_category both contain a (not set) character string. Again, this could be a measurement issue worth further exploration. Given the available information, these examples will also be dropped.

At this point, dplyr’s glimpse() and skimr’s skim() functions are used for some additional exploratory analysis. glimpse() is great to get info about the data’s structure. skim() provides summary statistics for each variable within the dataset, regardless of type.

glimpse(data_ga_transactions)

Rows: 13,113
Columns: 6
$ event_date              <dbl> 20201210, 20201210, 20201210, 20201103, 20201103, 20201103, 202011…
$ user_pseudo_id          <dbl> 10111056, 10111056, 10111056, 1014825, 1014825, 1014825, 1014825, …
$ transaction_id          <chr> "741471", "741471", "741471", "(not set)", "(not set)", "(not set)…
$ item_category           <chr> "Apparel", "Apparel", "Apparel", "Shop by Brand", "Office", "Shop …
$ quantity                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ purchase_revenue_in_usd <dbl> 94, 94, 94, 183, 183, 183, 183, 183, 183, 86, 86, 86, 86, 86, 86, …

skim(data_ga_transactions)

Data summary
Name	data_ga_transactions
Number of rows	13113
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
transaction_id	59	1	2	9	0	3563	0
item_category	0	1	0	23	189	22	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
event_date	0	1.00	20201167.86	4.740000e+01	20201101	20201121	20201202	20201212	20201231	▇▁▁▂▇
user_pseudo_id	0	1.00	289012023.38	1.274012e+09	1014825	5706716	25060616	62921408	9202815833	▇▁▁▁▁
quantity	148	0.99	1.49	2.910000e+00	1	1	1	1	160	▇▁▁▁▁
purchase_revenue_in_usd	0	1.00	106.68	1.171100e+02	0	40	71	137	1530	▇▁▁▁▁

The data contains a total of 13,113 rows with six features. After reviewing the data types and comparing some of the example values outputted from glimpse(), it’s concerning that some variables are of type character rather than type numeric. Values of type integer would be best in this case. Let’s do some more digging.

# Only printing the first 10 values for brevity
unique(data_ga_transactions$transaction_id) |>
  head(n = 10)

 [1] "741471"    "(not set)" "983645"    "406646"    "2105"      "886501"    "76937"     "29460"    
 [9] "339943"    "614046"

As expected, some transaction_ids are of type character, which is keeping the column from being a true numeric variable. This is due to some rows containing the (not set) character string. Since the transaction_id is critical for our analysis, these rows will need to be filtered out during data wrangling.

Initial data exploration is complete. We now have enough information to begin data wrangling. The wrangling step will strive to format the data into a structure necessary for performing k-means clustering. The following code chunk contains the code to do this. After, there’s a step-by-step explanation of what this code is doing.

data_user_items <- 
  data_ga_transactions |>
  filter(transaction_id != "(not set)") |>
  drop_na() |>
  mutate(
    user_id = random_id(), 
    .by = user_pseudo_id, 
    .after = user_pseudo_id
  ) |>
  select(-user_pseudo_id) |>
  summarise(
    quantity = sum(quantity), 
    .by = c(user_id, item_category)
  ) |>
  mutate(
    item_category = case_when(
      item_category == "" ~ "Unknown",
      TRUE ~ as.character(item_category)
    )
  ) |>
  pivot_wider(
    names_from = item_category, 
    values_from = quantity,
    values_fill = 0
  ) |>
  clean_names()

In the code above, the (not set) issue in the transaction_id feature is addressed first. Any row with the (not set) value is simply filtered from the data. drop_na() follows. This function drops any rows that contain a NA value. As a result of dropping NAs, we’re left with 11,615 rows: a loss of 1,498 examples (~11.4%).

Note

Merely dropping examples is convenient in this case, but it may not be ideal or even valid in every context. You’ll want to explore what is appropriate for the data you’re working with before applying this strategy.

Some mutating and summarization of the data is next. The mutation step applies a random id in place of the user_pseudo_id. This provides an extra layer of privacy for users, since the data is being used outside of its original source system.

Note

This may be a little inordinate, but I argue it’s important. It’s best to do what we can to create another layer of privacy to the information we’re using outside of the original source system. Since this data is public, it’s not too much of a concern here. However, this may be an important consideration when you’re working with ‘real world’ Google Analytics data.

I’m not a privacy expert, so you’ll want to identify best practice for the context you work in.

With the modelling goal top-of-mind, aggregation will sum values to individual users rather than transactions. This way, we’ll be able to use k-means clustering to identify customer cohorts, rather than transaction cohorts. To do this, the data is grouped by user_id and item_category and the quantity feature is summed.

Post aggregation, item_categorys that don’t contain any values need to be addressed. To address this, any missing string is replaced with ‘Unknown’ using the case_when() function. Finally, tidyr’s pivot_wider() is used to transform the data from long format to wide format. When taken together, the transformation results in a data set where each feature, other than user_id, is a sum of the item categories purchased by each user during the period. Each example, then, could comprise of multiple transactions that occur over the period.

Post wrangling, readr’s write_csv() function is used to save the data to disk. This way a request isn’t sent to BigQuery every time the post is built. You don’t have to do this, but it’s useful for limiting query costs associated with the service, though it’s pretty economical to query data in this way.

write_csv(data_user_items, "data_user_items.csv")

data_user_items <- read_csv("data_user_items.csv")

Rows: 2941 Columns: 22
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): user_id
dbl (21): apparel, bags, shop_by_brand, drinkware, new, clearance, accessories, campus_collectio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data exploration for modeling

Before modeling, let’s do some data exploration. The goal during exploration is to ensure data meets the assumptions of k-means clustering. First, let’s get a general sense of the structure of our wrangled data. dplyr’s glimpse() is once again useful here.

glimpse(data_user_items)

Rows: 2,941
Columns: 22
$ user_id                 <chr> "vqnzdyytdx", "lhkctjeylp", "fkgfswdpur", "bycrrxogpw", "xdecczvgb…
$ apparel                 <dbl> 3, 0, 3, 0, 0, 0, 4, 15, 1, 1, 0, 1, 2, 0, 2, 3, 0, 0, 0, 0, 1, 6,…
$ bags                    <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ shop_by_brand           <dbl> 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 6, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 0, …
$ drinkware               <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, …
$ new                     <dbl> 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 6, 2, 0, 0, 0, 0, 1, 1, 0, 4, 0, 0, …
$ clearance               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, …
$ accessories             <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 3, 0, 0, 1, 0, 0, 0, …
$ campus_collection       <dbl> 0, 0, 0, 1, 0, 0, 0, 27, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0,…
$ office                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lifestyle               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ small_goods             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ uncategorized_items     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, …
$ stationery              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ google                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ writing_instruments     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fun                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ unknown                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ electronics_accessories <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ notebooks_journals      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gift_cards              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ black_lives_matter      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

For this time period, the data contains over 2,941 customers and 21 features available for the k-means clustering model. The user_id feature will be excluded from the modeling. Features represent counts of items purchased within each item category for each user during the period. skimr::skim() is again handy for getting a sense of the shape of the data.

skim(data_user_items)

Data summary
Name	data_user_items
Number of rows	2941
Number of columns	22
_______________________
Column type frequency:
character	1
numeric	21
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
user_id	0	1	10	10	0	2941	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p50	p75	p100	hist
apparel	1	1.42	2.13	1	2	28	▇▁▁▁▁
bags	1	0.28	2.02	0	0	75	▇▁▁▁▁
shop_by_brand	1	0.30	1.32	0	0	40	▇▁▁▁▁
drinkware	1	0.30	1.49	0	0	30	▇▁▁▁▁
new	1	0.56	2.32	0	0	87	▇▁▁▁▁
clearance	1	0.23	1.03	0	0	20	▇▁▁▁▁
accessories	1	0.49	2.23	0	0	48	▇▁▁▁▁
campus_collection	1	0.58	3.40	0	0	128	▇▁▁▁▁
office	1	0.56	3.21	0	0	80	▇▁▁▁▁
lifestyle	1	0.21	0.97	0	0	20	▇▁▁▁▁
small_goods	1	0.05	0.28	0	0	4	▇▁▁▁▁
uncategorized_items	1	0.21	1.32	0	0	48	▇▁▁▁▁
stationery	1	0.19	2.16	0	0	100	▇▁▁▁▁
google	1	0.19	3.39	0	0	160	▇▁▁▁▁
writing_instruments	1	0.10	0.99	0	0	40	▇▁▁▁▁
fun	1	0.04	1.40	0	0	75	▇▁▁▁▁
unknown	1	0.16	1.08	0	0	40	▇▁▁▁▁
electronics_accessories	1	0.01	0.24	0	0	12	▇▁▁▁▁
notebooks_journals	1	0.01	0.30	0	0	12	▇▁▁▁▁
gift_cards	1	0.02	0.40	0	0	15	▇▁▁▁▁
black_lives_matter	1	0.00	0.02	0	0	1	▇▁▁▁▁

Upon visual inspection, the shape of the distributions are concerning. Each histogram exhibits the presence of highly skewed data. Outliers are most likely present, and they will need to be addressed. Otherwise, they’ll negatively impact the k-means clustering algorithm. The mean values for the features also indicate a likely issue: limited purchase frequency for certain item categories.

Before moving ahead with removing examples, dropping some features might be worth exploring. The objective here is to identify item categories with limited purchase frequency. Any item category with a limited purchase frequency could likely be dropped.

summarise(
  data_user_items, 
  across(apparel:black_lives_matter, sum)
) |>
  glimpse()

Rows: 1
Columns: 21
$ apparel                 <dbl> 4169
$ bags                    <dbl> 832
$ shop_by_brand           <dbl> 889
$ drinkware               <dbl> 893
$ new                     <dbl> 1634
$ clearance               <dbl> 684
$ accessories             <dbl> 1438
$ campus_collection       <dbl> 1700
$ office                  <dbl> 1633
$ lifestyle               <dbl> 607
$ small_goods             <dbl> 143
$ uncategorized_items     <dbl> 608
$ stationery              <dbl> 554
$ google                  <dbl> 572
$ writing_instruments     <dbl> 291
$ fun                     <dbl> 122
$ unknown                 <dbl> 463
$ electronics_accessories <dbl> 35
$ notebooks_journals      <dbl> 43
$ gift_cards              <dbl> 46
$ black_lives_matter      <dbl> 1

Reviewing the output, small_goods; fun; electronics_accessories; notebooks_journals; gift_cards; and black_lives_matter have a small enough purcharse frequency to be dropped. Since we don’t have information on what is being purchased, the unknown feature is also another feature that could be dropped. Here’s the code to drop these features:

data_items <- 
  data_user_items |> 
  select(
    -c(
      small_goods,
      fun,
      electronics_accessories,
      notebooks_journals,
      gift_cards,
      black_lives_matter,
      unknown
    )
  )

While the skimr::skim() output includes histograms, we’ll use ggplot2 to examine the distributions in more detail.

data_item_hist <- data_items |>
  pivot_longer(
    cols = apparel:stationery,
    values_to = "items"
  ) 

ggplot(data_item_hist, aes(x = items)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~name, ncol = 4, scales = "free") +
  labs(
    title = "Distribution of purchases by item category",
    y = "",
    x = ""
  )

The following histograms further confirm the presence of skewed data. It also further provides evidence of another characteristic of concern: low item purchase frequency. Both of these issues will need to be addressed before applying k-means clustering.

Let’s first normalize the data, so we can get values across features to be within a similar range. To do this, we’ll transform features into z-scores. The summary() function can be used to confirm the transformation was applied. This step can be done by utilizing the following code:

Note

scale() accepts a dataframe with only numeric features. So, I have to remove it, then add it back.

data_items_stnd <-
  as_tibble(scale(select(data_items, -user_id))) |>
  mutate(user_id = data_items$user_id, .before = 1)

# Verify the standarization was applied
summary(data_items_stnd)

   user_id             apparel             bags         shop_by_brand       drinkware      
 Length:2941        Min.   :-0.6655   Min.   :-0.1399   Min.   :-0.2297   Min.   :-0.2032  
 Class :character   1st Qu.:-0.6655   1st Qu.:-0.1399   1st Qu.:-0.2297   1st Qu.:-0.2032  
 Mode  :character   Median :-0.1960   Median :-0.1399   Median :-0.2297   Median :-0.2032  
                    Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
                    3rd Qu.: 0.2734   3rd Qu.:-0.1399   3rd Qu.:-0.2297   3rd Qu.:-0.2032  
                    Max.   :12.4797   Max.   :36.9390   Max.   :30.1695   Max.   :19.8776  
      new            clearance        accessories     campus_collection     office       
 Min.   :-0.2391   Min.   :-0.2252   Min.   :-0.219   Min.   :-0.1701   Min.   :-0.1728  
 1st Qu.:-0.2391   1st Qu.:-0.2252   1st Qu.:-0.219   1st Qu.:-0.1701   1st Qu.:-0.1728  
 Median :-0.2391   Median :-0.2252   Median :-0.219   Median :-0.1701   Median :-0.1728  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2391   3rd Qu.:-0.2252   3rd Qu.:-0.219   3rd Qu.:-0.1701   3rd Qu.:-0.1728  
 Max.   :37.2019   Max.   :19.1369   Max.   :21.277   Max.   :37.4975   Max.   :24.7254  
   lifestyle       uncategorized_items   stationery           google         writing_instruments
 Min.   :-0.2135   Min.   :-0.1566     Min.   :-0.08721   Min.   :-0.05737   Min.   :-0.1002    
 1st Qu.:-0.2135   1st Qu.:-0.1566     1st Qu.:-0.08721   1st Qu.:-0.05737   1st Qu.:-0.1002    
 Median :-0.2135   Median :-0.1566     Median :-0.08721   Median :-0.05737   Median :-0.1002    
 Mean   : 0.0000   Mean   : 0.0000     Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000    
 3rd Qu.:-0.2135   3rd Qu.:-0.1566     3rd Qu.:-0.08721   3rd Qu.:-0.05737   3rd Qu.:-0.1002    
 Max.   :20.4745   Max.   :36.2012     Max.   :46.21115   Max.   :47.13470   Max.   :40.4114

Reviewing the summary() output post normalization, the maximum values are concerning. There are some users who purchased items within categories at a very significant rate. Although this is a nice problem to have for the merchandise store (i.e., big purchases are good for the bottom line), it may cause problems when specifying the clustering algorithm.

Indeed, a couple of approaches could be taken here. For one, the outliers could be retained. This will likely highly affect the k-means algorithms’ ability to effectively identify useful clusters, though. The second option is to drop examples that we consider to be outliers. Let’s think this through a bit.

More than likely, any example with the number of items purchased above 3 standard deviations beyond the mean should be dropped. The merchandise store is likely meant for business-to-consumer (B2C) sales, rather than business-to-business (B2B) sales. As such, the amount of items purchased during a typical customer transaction will likely be of a volume that is reasonable for consumer purchases (e.g., who needs more than 50 items of stationery?). Such large purchases are likely a B2B transaction, where large volumes of items are being bought. Given the intent of the store to be B2C, then examples exhibiting such large purchase volumes should be dropped.

Additional information could be used to further verify this assumption. We likely have a Customer Relationship Management (CRM) system with information about who the customers of these purchases are, and thus we could use this information to confirm if a purchase was for a business or individual. Since the ability to obtain this additional information is not possible, dropping these outliers before clustering is the best option. With all that said, here’s the code to further explore customers considered to be outliers. head() is used here to limit the output.

data_items_stnd |>
  filter(if_any(-c(user_id), ~ . > 3)) |>
  head(n = 10)

# A tibble: 10 × 15
   user_id    apparel   bags shop_by_brand drinkware    new clearance accessories campus_collection
   <chr>        <dbl>  <dbl>         <dbl>     <dbl>  <dbl>     <dbl>       <dbl>             <dbl>
 1 qifwtgogoz   6.38  -0.140        -0.230    -0.203 -0.239    -0.225      -0.219             7.78 
 2 dootkphyen  -0.665 -0.140         4.33     -0.203  2.34     -0.225      -0.219            -0.170
 3 qksqqiqfjm  -0.665 -0.140         1.29     -0.203  0.191     0.743       0.677             0.124
 4 cdykzwveuk  -0.665 -0.140        -0.230    10.5   -0.239    -0.225      -0.219            -0.170
 5 yigpconufy  -0.665 -0.140        -0.230    -0.203  6.65     -0.225      -0.219            -0.170
 6 ubblgupnrd   4.97  -0.140        -0.230     5.82  -0.239    -0.225      -0.219             1.30 
 7 shyjuzbuci  -0.665 36.9          -0.230    -0.203 -0.239    -0.225      -0.219            -0.170
 8 cbpqtertvd  -0.665 -0.140         4.33     -0.203 -0.239    -0.225      -0.219            -0.170
 9 djoibydgth   0.743 -0.140         0.530     2.47   1.05     -0.225      -0.219            -0.170
10 zqgnutolrn   3.09  -0.140        -0.230    -0.203 12.7      -0.225      -0.219            -0.170
# ℹ 6 more variables: office <dbl>, lifestyle <dbl>, uncategorized_items <dbl>, stationery <dbl>,
#   google <dbl>, writing_instruments <dbl>

Here’s the code to filter out customers considered to be outliers:

data_items_stnd <- data_items_stnd |>
  select(-user_id) |>
  filter(!if_any(everything(), ~ . > 3)) 

summary(data_items_stnd)

    apparel              bags          shop_by_brand        drinkware             new          
 Min.   :-0.66549   Min.   :-0.13986   Min.   :-0.22973   Min.   :-0.20324   Min.   :-0.23910  
 1st Qu.:-0.66549   1st Qu.:-0.13986   1st Qu.:-0.22973   1st Qu.:-0.20324   1st Qu.:-0.23910  
 Median :-0.19602   Median :-0.13986   Median :-0.22973   Median :-0.20324   Median :-0.23910  
 Mean   :-0.09975   Mean   :-0.05728   Mean   :-0.07586   Mean   :-0.08145   Mean   :-0.08105  
 3rd Qu.: 0.27345   3rd Qu.:-0.13986   3rd Qu.:-0.22973   3rd Qu.:-0.20324   3rd Qu.:-0.23910  
 Max.   : 2.62080   Max.   : 2.82645   Max.   : 2.81020   Max.   : 2.47420   Max.   : 2.77339  
   clearance         accessories       campus_collection      office           lifestyle       
 Min.   :-0.22516   Min.   :-0.21897   Min.   :-0.17010   Min.   :-0.17281   Min.   :-0.21349  
 1st Qu.:-0.22516   1st Qu.:-0.21897   1st Qu.:-0.17010   1st Qu.:-0.17281   1st Qu.:-0.21349  
 Median :-0.22516   Median :-0.21897   Median :-0.17010   Median :-0.17281   Median :-0.21349  
 Mean   :-0.07933   Mean   :-0.08889   Mean   :-0.05533   Mean   :-0.09425   Mean   :-0.07426  
 3rd Qu.:-0.22516   3rd Qu.:-0.21897   3rd Qu.:-0.17010   3rd Qu.:-0.17281   3rd Qu.:-0.21349  
 Max.   : 2.67916   Max.   : 2.91591   Max.   : 2.77268   Max.   : 2.93947   Max.   : 2.88970  
 uncategorized_items   stationery           google         writing_instruments
 Min.   :-0.15659    Min.   :-0.08721   Min.   :-0.05737   Min.   :-0.10021   
 1st Qu.:-0.15659    1st Qu.:-0.08721   1st Qu.:-0.05737   1st Qu.:-0.10021   
 Median :-0.15659    Median :-0.08721   Median :-0.05737   Median :-0.10021   
 Mean   :-0.04983    Mean   :-0.05096   Mean   :-0.03856   Mean   :-0.05943   
 3rd Qu.:-0.15659    3rd Qu.:-0.08721   3rd Qu.:-0.05737   3rd Qu.:-0.10021   
 Max.   : 2.87322    Max.   : 2.69069   Max.   : 1.71234   Max.   : 2.93816

Let’s take a look at the histograms again, just to get a sense if dropping outliers helped.

data_items_stnd |>
  pivot_longer(
    cols = apparel:stationery,
    values_to = "items"
  ) |> 
  ggplot(aes(x = items)) +
    geom_histogram(binwidth = 1) +
    facet_wrap(~name, ncol = 4, scales = "free") +
    labs(
      title = "Distribution of purchases by item category post wrangling",
      y = "",
      x = ""
    )

Although this helped with the issues caused by outliers, we still have to contend with the fact that some customers just don’t buy certain items. We’ll want to keep this in mind when drawing conclusions from the final clusters.

Determine a k-value

Now that the data is in a format acceptable for modeling, we need to explore a value for the number of cluster centers, the k-value. Various methods can be used to determine this value. I’ll rely on three methods here: the elbow method; the average silhouette method; and using the realities imposed by the business case (Lantz 2023). Each will be discussed more in-depth in the following sections.

The first method is the elbow method, where we visually examine an elbow plot of the total within sum of squares based on the number of potential clusters used for the model. The fviz_nbclust() function from the factoextra package is useful here. We first pass in the data we intend to use for modeling, then base R’s stats kmeans() function. We’re also interested in creating the plot using the within sum of squares method, so we specifiy that using the method argument.

fviz_nbclust(data_items_stnd, kmeans, method = "wss")

Given visual examination of the plot, six clusters seems to be a good starting point. The average silhouette method is another visulization useful for confirming the number of cluster groups for our cluster modelling.

fviz_nbclust(data_items_stnd, kmeans, method = "silhouette")

Just as was expected, the silhouette method provides additional evidence for 6 clusters.

Considering the business case is another useful method for determining the k-value. For instance, say our goal is to cluster the data based on the number of campaigns our marketing team has capacity to manage. The number used here is completely arbitrary (i.e., I don’t have a marketing team to confirm capacity). Thus, for the sake of example, let’s say we have a marketing team capable of managing 3 campaigns over the holiday season.

So, based on the information above, let’s examine k-means clustering using 3 and 6 groups.

Specify the model

Now the modeling step. We’ll use base R’s kmeans() function from the stats package. Since a clustering model with 3 or 6 clusters is being explored here, purrr’s map() function is used to iterate the model specification. map() returns a list object, which each element of the list is a model output. One for the three and six cluster model. set.seed() is also used for the reproducibility of the results.

set.seed(20240722)
mdl_clusters <- 
  map(c(3, 6), \(x) kmeans(data_items_stnd, centers = x))

Evaluating model performance

To assess model fit, we’ll look to the cluster sizes for both clustering models. purrr’s map function makes this easy. Use the following code to return the size element of kmeans output:

map(mdl_clusters, "size")

[[1]]
[1] 1642  762  278

[[2]]
[1]  223  296 1284  660   80  139

Identifying imbalanced groups is priority here. Some imbalance is tolerable, but major imbalances might indicate the presence of model fit issues. The six cluster model looks fairly balanced, where only one group includes a small subset of customers (~80 customers). The three cluster model has one larger group followed by ever decreasing sized groups. Overall, the balance across the different groups seems to be acceptable here.

Visualizing the clusters is also useful for model assessment. The factoextra package is once again helpful. The fviz_cluster() function from the package can be used to visualize the clusters. The function first takes our model output (mdl_clsuters[[1]]) and the initial data object (data_item_stnd) as arguments. Both the three and six cluster model are visualized using the example code below.

fviz_cluster(mdl_clusters[[1]], data = data_items_stnd, pointsize = 3) +
  labs(
    title = "Three-cluster model of Google Merchandise Store customers",
    subtitle = "View of cluster groupings during the U.S. holiday season"
  ) +
  theme_replace(ggplot_theme)

fviz_cluster(mdl_clusters[[2]], data = data_items_stnd, pointsize = 3) +
  labs(
    title = "Six-cluster model of Google Merchandise Store customers",
    subtitle = "View of cluster groupings during the U.S. holiday season"
  ) +
  theme_replace(ggplot_theme)

fviz_cluster() uses dimension reduction methods to allow for plotting of a multi-dimensional dataset into a two-dimensional representation. The output is useful for identifying general clustering patterns within the data, which could provide additional information about the shape of the clusters. For instance, this type of visualization can be used to visually identify any outliers which may be influencing the shape clusters take.

Indeed, the plot for both models shows some cluster overlap. This is an indication that the clusters for this data may not be as distinct as we would like. There might be some common products every customer buys, and a few peripheral products that other clusters purchase beyond these common products. These initial results indicate the presence of ‘upsell’ opportunities for the marketing team. You have your core items that most customers purchase, but some clusters seem to purchase items beyond the core products. Thus, some of the marketing campaigns might strategise ways to highlight the upselling of some products.

Draw conclusions about clusters

The next step is to examine the cluster centers and factor loadings. The goal is to derive conclusions about each cluster from this information. Let’s first draw conclusions using our three-cluster model. We’ll look to identify specific audience segments based on item category loadings.

The kmeans()’s output has an object labelled centers, which is a matrix of cluster centers. We then inspect these center values for each cluster, which are on the left, and the values associated within each column.

mdl_clusters[[1]]$centers

     apparel        bags shop_by_brand   drinkware         new  clearance  accessories
1 -0.4476288 -0.06097525   -0.06495514 -0.08339455 -0.09966977 -0.2251558 -0.096510610
2  0.7127262 -0.06395056   -0.09907281 -0.07850703 -0.07644876 -0.1933938 -0.103778685
3 -0.2720182 -0.01715296   -0.07663600 -0.07803938  0.01632421  1.0946694 -0.003105749
  campus_collection        office   lifestyle uncategorized_items  stationery      google
1       -0.05701557 -0.1077973177 -0.09379899        -0.073095026 -0.06268202 -0.04012104
2       -0.09325080 -0.0992918176 -0.05330947        -0.005497098 -0.05926368 -0.04110829
3        0.05854479 -0.0004035005 -0.01628649        -0.033980814  0.04102364 -0.02235330
  writing_instruments
1        -0.062586565
2        -0.072300004
3        -0.005490095

mdl_clusters[[1]]$centers |>
  as_tibble() |>
  mutate(
    cluster = 1:3,
    across(where(is.double), \(x) round(x, digits = 4)),
    .before = 1
  ) |>
  gt() |>
  data_color(
    columns = !("cluster"),
    rows = 1:3,
    direction = "row",
    method = "numeric",
    palette = "Blues"
  ) |>
  tab_header(
    title = md(
      "**Three-cluster k-means model of Google Merchandise store customers factor loadings**"
    )
  ) |>
  tab_source_note(
    source_note = md(
      "Source: Google Analytics data for the Google Merchandise Store"
    )
  ) |>
  opt_interactive()

Table 1: Three-cluster model factor loadings

Three-cluster k-means model of Google Merchandise store customers factor loadings

Source: Google Analytics data for the Google Merchandise Store

After inspecting the three-cluster k-means model factor loadings, a few groups emerge. Cluster 1 (n = 1,642) seems to be the ‘anything but apparel’ customers. This customer segment really doesn’t purchase any specific item, but when they shop, they’re purchasing items other than apparel. Perhaps a campaign could be created to improve customer’s familiarity with products other than apparel that are available in the merchandise store.

Cluster 2 (n = 762) are the ‘fashion fanatics’. In fact, it seems this group is mainly purchasing apparel. Maybe our product and marketing teams could consider releasing a new apparel line around the holiday season. A campaign focused on highlighting different apparel pieces could also be explored.

Cluster 3 (n = 278) are ‘discount diggers’. Indeed, this data covers the holiday season, so maybe some customers around this time are trying to find a unique gift, but want to do so on a budget. Perhaps a campaign focused on ‘holiday gift deals’ might appeal to these types of customers.

If more nuance is required for the segmentation discussion, the factor loadings for the six-cluster model can be examined. Again, these results suggest the presence of ‘anything but apparel customers’; ‘fashion fanatics’; and ‘discount diggers’. However, three additional groups emerge from the six cluster model.

mdl_clusters[[2]]$centers |>
  as_tibble() |>
  mutate(
    cluster = 1:6,
    across(where(is.double), \(x) round(x, digits = 4)),
    .before = 1
  ) |>
  gt() |>
  data_color(
    columns = !("cluster"),
    rows = 1:6,
    direction = "row",
    method = "numeric",
    palette = "Blues"
  ) |>
  tab_header(
    title = md(
      "**Six-cluster k-means model of Google Merchandise store customers factor loadings**"
    )
  ) |>
  tab_source_note(
    source_note = md(
      "Source: Google Analytics data for the Google Merchandise Store"
    )
  ) |>
  opt_interactive()

Table 2: Six-cluster model factor loadings

Six-cluster k-means model of Google Merchandise store customers factor loadings

Source: Google Analytics data for the Google Merchandise Store

The first additional segment emerging from the six-cluster model includes ‘lifestyle looters’ (n = 223): the customers who purchase products that fit or enhance their lifestyle. Perhaps there’s room for a campaign focused on highlighting how the Google Merchandise Store’s products fit within the lives of its customers: most likely people who work in tech.

The second segment are the ‘brand buyers’ (n = 296). These customers are mostly interested in purchasing branded items. Thus, a campaign highlighting the various branded items that are available might be explored.

The final group to emerge is our ‘accessory enthusiasts’ (n = 139). These are customers most interested in purchasing accessories. Perhaps a focus on accessories could be another campaign our marketing team might look at to create.

Depending on the model reviewed, clustering resulted in the identification of three customer segments campaigns could be targeted: ‘anything but apparel’, ‘fashion fanatics’, and ‘discount diggers’. If an expanded list of segments was required, the six-cluster model provides additional information. This includes segments like ‘lifestyle looters’, ‘brand buyers’, and ‘accessory enthusiasts’. Indeed, the segment names are up for debate. I would lean on my marketing team to workshop them some more. Analysts are poor at naming things.

Wrap up

This post was a tutorial on how to perform clustering using Google Analytics e-commerce data. The k-means algorithm was used to identify clusters within obfuscated analytics data from the Google Merchandise store. Information about the clusters was used to generate various customer segments. The intent was to use these clusters to better inform future marketing campaigns and targeting strategies.

The kmeans() function from the base R stats package was used to specify two clustering models. The elbow method, silhouette method, and information about the business case were used to determine the k-values for the clustering models. The fviz_nbclust() function from the factoextra package was useful for creating visualizations to further confirm the selected k-values. It was determined both a three and six cluster model would be effective to meet our modelling goal for this data. Lastly, the fviz_cluster() function from the factoextra package was used to create visualizations of each model’s clusters.

In truth, this dataset lacked the presence of any interesting cluster groups. I was hoping for some more cross-product segments that could be used for customer segmentation identification. Unfortunately, this wasn’t the case. Many of the Google Merchandise Store’s customers fell within a single item-category. This was likely due to low purchase frequency for item categories, which likely is due to customers only buying one or two products with each purchase. Nonetheless, we were able to still identify various customer segments useful for targeting purposes and provide some additional support for potential marketing campaigns.

So there you have it, another messing with models post. I hope you found something useful. If not, I hope it was somewhat informative and you found a few takeaways.

Until next time, keep messing with models.

References

Lantz, Brett. 2023. Machine Learning with R. 4th ed. packt. https://www.oreilly.com/library/view/machine-learning-with/9781801071321/.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@misc{berke2024,
  author = {Berke, Collin K},
  title = {Messing with Models: K-Means Clustering of {Google}
    {Analytics} 4 Data},
  date = {2024-09-14},
  langid = {en}
}

For attribution, please cite this work as:

Berke, Collin K. 2024. “Messing with Models: K-Means Clustering of Google Analytics 4 Data.” September 14, 2024.