K-Means

Introduction

Imagine you work in the marketing department of a company, and your manager wants to better understand customer behavior in order to design targeted campaigns. In marketing (and, more broadly, in business), knowing your customers is essential, as different customer segments respond to different approaches. For example, a customer who frequently visits the company website but never makes a purchase might be looking for a specific item and checking its availability; in contrast, a customer who visits less often but buys different products each time is likely driven by a different set of needs or habits.

As a data scientist, your goal is to identify and describe these customer profiles based on available data. Suppose the company is able to put together a dataset with two key variables: the number of visits to the website and the number of products purchased. The following scatterplot visualizes this dataset:

Figure 36.1: Scatterplot of website visits and purchased products.

Looking at the plot, you might notice that the points appear to form distinct groups. This suggests the presence of clusters—groups of customers with similar behaviors. For instance, customers with fewer than 150 website visits might belong to a cluster of “low visit frequency,” while the rest could be grouped under “high visit frequency.” We could illustrate this manually as follows:

Figure 36.2: Scatterplot of website visits and purchased products with cluster example.

Of course, this kind of manual grouping is subjective. In practice, we use more systematic and scalable techniques, especially nowadays, in the era of machine learning and artificial intelligence. Grouping data points based on patterns within the data is a fundamental task of unsupervised machine learning, where no labeled output is given.

In this chapter, we introduce one of the most popular and intuitive clustering techniques in unsupervised learning: k-means clustering.

K-Means in Depth

To understand how the k-means algorithm works, let us revisit the data introduced earlier:

Figure 36.3: Revisiting the previous scatterplot.

To start implementing the k-means method, we create \(k\) random data points, which can be seen as centers in the multi-dimensional space. As a result, a data point is assigned to its nearest center. To visualize this, suppose we set \(k\) equal to 2, meaning that we get two random centers, or centroids. Let’s refer to these centroids as Centroid 1 and Centroid 2, which appear as the blue and green point respectively:

Figure 36.4: Scatterplot with the initial centroids.

The two centroids are randomly generated points that were not part of the original dataset. They do not carry any significance; we simply introduce two artificial random points to the data.

The next step is to measure the distance of each data point from each centroid to determine which centroid is closest. As we discussed in Chapter K-Nearest Neighbors, we can measure the distance between two points using the Euclidean distance, calculated using the following formula:

\[\text{dist}(A,B) = \sqrt{(x_B - x_A)^2 + (y_B - y_A)^2}\]

In this formula, \(A\) and \(B\) represent the two points, while \(x\) and \(y\) are the two variables used to calculate the distance between them. Of course, we can expand this equation to include more than two variables and calculate distances in the multi-dimensional space. In our example, we have two variables—the website visits and number of purchased products—meaning we use only these two variables for the calculation of distance.

The Unsupervised Learning Version of KNN

This is actually the same logic as KNN (see Chapter K-Nearest Neighbors). Intuitively, k-means can be seen as the unsupervised counterpart of KNN. In KNN, a data point is classified by looking at the \(k\) closest points in the dataset and assigning it the majority label of those neighbors. The label we predict is the target variable. In k-means, however, there is no target variable. Instead, we group data points based on their similarity, measured by distance, so that points within the same cluster are more similar to each other than to points in other clusters.

As an example, we can visualize the distance of a data point (colored by red) from each centroid:

Figure 36.5: Scatterplot with the initial centroids and a selected data point (red).

In this example, the distance between the red point and the green point (Centroid 2) is:

\[\begin{align*} \text{dist}(\text{Red}, \text{Centroid 2}) &= \sqrt{(x_{\text{Red}} - x_{\text{Centroid 2}})^2 + (y_{\text{Red}} - y_{\text{Centroid 2}})^2} \\ &= \sqrt{(125 - 63)^2 + (10 - 10)^2} \\ &= 62 \end{align*}\]

Similarly, the distance between the red point and the blue point (Centroid 1) is:

\[\begin{align*} \text{dist}(\text{Red}, \text{Centroid 1}) &= \sqrt{(x_{\text{Red}} - x_{\text{Centroid 1}})^2 + (y_{\text{Red}} - y_{\text{Centroid 1}})^2} \\ &= \sqrt{(125 - 208)^2 + (10 - 22)^2} \\ &= 84 \end{align*}\]

Because the distance between the red point and Centroid 2 is shorter than that between the red point and Centroid 1, the red point would be classified as green.

However, as we discussed in chapter K-Nearest Neighbors, using the raw values to calculate distances is not recommended if we want to give to each variable the same weight of importance. Hence, we need to scale the variables before we start calculating distances.

After standardization, the x- and y-axis values are on different scales:

Figure 36.6: Scatterplot after standardization with the initial centroids and a selected data point (red).

The majority of the data points having website visits less than 0 (remember, those are just the standardized values, not actual values) would belong to centroid Centroid 2 and the rest would belong to centroid Centroid 1.

Even if the created clusters seem reasonable, we cannot stop the process here. There is a potential problem when we think about how we started this process: the centroids can appear anywhere and, as a result, there is a high risk of creating segments that do not make sense. To understand how this could be possible, suppose we have the following results:

Figure 36.7: Scatterplot after standardization with poorly placement of the initial centroids.

In this case, all data points would be assigned to Centroid 1, meaning that we would view all customers as having the same profile. Of course such an assignment would be useless. To make sure we end up with meaningful clusters, we start an iteration process by moving the centroids towards their closest assigned data points. In other words, the data points “pull” their centroids towards their center and then we re-assign the data points to the centers. In our example, after the “pull”, the plot now looks like this:

Figure 36.8: Scatterplot with the centroids moved towards the center of their corresponding assigned data points.

So, after the initial assignment to the clusters, the centroids move to the “center” of their assigned data points. After this moving of Centroid 1 though, there are some data points now that are closest to Centroid 2, not Centroid 1. The next step is to repeat the whole process, meaning that we calculate the Euclidean distance of all the data points from each centroid and then re-assign them to the closest centers. We can repeat this process many times until there is some stability regarding the moving of the centroids. The number of times that we repeat this process is called number of iterations.

To see the whole process from the beginning, suppose we start again the process in the example above, setting the number of iterations to three. How the results change throughout this process can be seen below:

Figure 36.9: Initial placement of the centroids.

Figure 36.10: Assigning all data points to Centroid 1 because it is closer than Centroid 2.

Figure 36.11: Moving Centroid 1 toward the center of its assigned data points.

Figure 36.12: Re-assigning data points to their nearest centroids.

Figure 36.13: Moving both centroids toward the centers of their assigned data points.

Figure 36.14: Re-assignment of data points to their nearest centroids.

The first two plots above describe the initial creation of the two centroids and the assignment of the data points to “Centroid 1”. The third and the fourth plots reflect the re-location of “Centroid 1” and the re-assignment of the data points to the centroids. The last two plots show the results after one more iteration.

Therefore, each iteration consists of one assignment and one update of the centroids (in that order). After three iterations, we have the option to stop or continue. After some iterations, the centers will stop moving since no new assignment of the data points to the centroids will occur. When we decide to stop the iterations, we have our final clusters. We have the option to begin the same process again with the same or a different \(k\) to check how our results would change.

Iterative Stability

From a statistical perspective, the fact that more iterations does not lead to different results is the application of the Law of Large Numbers.

Assumptions

As with any clustering method, k-means is based on a set of assumptions about the structure of the data in the feature space. These assumptions describe the types of groupings the algorithm is designed to detect and, therefore, the conditions under which it is expected to perform well. When they are approximately satisfied, k-means typically produces meaningful and interpretable clusters, but when they are strongly violated, the resulting groupings can become unstable or misleading (Hastie et al., 2009; Jain, 2010). The main assumptions are the following:

  • Cluster Size Similarity: K-means assumes that clusters are roughly equal in size. This means that each group has about the same number of points, and they occupy a similar volume in space. If one cluster is much larger or denser than others, the algorithm may pull the shared centroid toward it, causing misclassification. For example, a large group may “absorb” nearby smaller groups if they are close enough, even if they are conceptually distinct.

  • Similar Variability: K-means assumes that the data points in each cluster have similar spread (or variance). This means the points in each cluster are about equally tight around their center. If one cluster has high variability and another has low variability, the centroid of the wider cluster may not represent it well. It may also pull points from neighboring clusters just because the boundaries are more “blurred.”

  • Linearity: K-means works best when the clusters can be separated by straight lines (or planes in higher dimensions). That is, the boundaries between groups are linear. K-means assigns points to the nearest centroid based on Euclidean distance. This results in linear decision boundaries. If the actual cluster shapes are curved or nested (like a circle within a circle), k-means won’t be able to separate them correctly.

Although violations of these assumptions do not automatically make k-means unusable, they do affect the reliability and interpretability of its results. In practice, even when the assumptions are only approximately satisfied, the algorithm can still provide useful exploratory insights, as long as its limitations are clearly recognized and taken into account during analysis and interpretation.

K-Means in R

To implement k-means in R, we’ll use the same dataset discussed throughout this chapter. Our goal is to apply the k-means algorithm and interpret the resulting clusters. Since base R already includes a function for k-means, we only need to load the tidyverse package (for easier data manipulation) and import the dataset website_visits, available on GitHub:

# Libraries
library(tidyverse)

# Importing the dataset
website_visits <- read.csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/website_visits.csv")

As previously mentioned, this dataset includes just two columns:

  • Website_Visits: the number of times a customer visited the website

  • Purchased_Products: the number of products a customer has purchased

To apply k-means, we use the function kmeans(). Before we do so, however, it’s important to standardize our variables. Here’s a simple function to standardize a variable, which we then apply to both columns:

# Creating the standardization Function
standardization <- function(x) {return((x - mean(x)) / sd(x))}  

# Applying standardization on the variables
website_visits <- website_visits %>%   
  mutate(Website_Visits = standardization(Website_Visits),    
         Purchased_Products = standardization(Purchased_Products))

Now that the values of Website_Visits and Purchased_Products are standardized, we can apply the k-means method using kmeans(). The main arguments to consider are:

  • centers: the number of clusters (k)

  • iter.max: the maximum number of iterations

  • nstart: how many times to repeat the process with different initial centroids

Although we can run k-means just once (nstart = 1), using a larger value increases the chance of finding a better solution, since the starting points are random. For this example, we’ll set centers = 2, iter.max = 50, and nstart = 20. Also, we set a seed to ensure the reproducibility of the results:

# Setting seed
set.seed(123) 

# Applying k-means 
kmeans_results <- kmeans(website_visits, centers = 2, iter.max = 50, nstart = 20)

The resulting object is a list. With the dollar sign ($), we can extract the information of interest, such as the cluster that each observation belongs to. We can actually add the clustering number to the dataset as a column with the mutate() function (we give the name Cluster in this example):

# Printing the results with the clusters
website_visits <- website_visits %>% 
  mutate(Cluster = as.factor(kmeans_results$cluster))  

# Printing the first 6 rows 
head(website_visits, n = 10)
   Website_Visits Purchased_Products Cluster
1      -1.0976463          0.7091550       1
2      -0.9932219          0.2745556       1
3      -0.7843732          0.3832055       1
4      -0.5755246          0.3832055       1
5       1.7218109          0.7091550       2
6       0.8864162         -0.7032931       2
7      -1.0976463         -0.8119430       1
8      -0.4711002          0.2745556       1
9      -1.3064949          1.1437545       1
10     -0.9932219          1.4697040       1

The Cluster column contains values 1 or 2, consistent with the k = 2 setting. These numerical values can later be relabeled with more descriptive names for interpretation. We can visualize the clusters using ggplot2:

# Visualizing the results 
website_visits %>%   
  ggplot(aes(x = Website_Visits, 
             y = Purchased_Products, 
             color = Cluster)) +  
  geom_point(size = 2.5) +  
  scale_colour_manual(values = c("grey", "darkgreen")) +
  theme_light()

Figure 36.15: Scatterplot with the resulting clusters.

The visualization confirms that the data has been divided into two groups. But what happens if we change the number of clusters? Let’s try k = 3 and k = 4, and plot all results together:

Figure 36.16: Scatterplots of the resulting clusters, facetted by k.

This raises the question: which number of clusters (k) should we choose? Since this is an unsupervised learning method, there is no single “correct” answer. However, one common technique to guide this choice is the Elbow Method. This method involves plotting the total within-cluster sum of squares (WSS) against different values of \(k\) (Ketchen & Shook, 1996). The “elbow” indicates diminishing returns in WSS, representing the point where adding more clusters no longer significantly reduces the total within-cluster variation:

# Setting seed
set.seed(123)  

# Method to find the optimal number of clusters: Look for the "elbow" 
## 1) Creating an empty vector for the total within sum of squares error: wss
wss <- vector()  

## 2) Applying For-loop for 15 cluster centers
for (i in 1:15) { 
  km.out <- kmeans(website_visits, centers = i, iter.max = 50, nstart = 20)  
  
  # Saving the total within sum of squares in wss 
  wss[i] <- km.out$tot.withinss} 

## 3) Plotting total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",  
     xlab = "Number of Clusters",  
     ylab = "Within Sum of Squares")

Figure 36.17: Determining the Optimal number of clusters using the Elbow Method.

Based on the plot above, the recommended number of clusters is 4 because this is where the line starts to flatten, indicating that adding more clusters beyond this point does not significantly improve the clustering. Therefore, we can set \(k\) equal to 4, apply the kmeans() function for a last time and then assign the results in our dataset:

# Set seed 
set.seed(12) 

# Set k equal to 4
k <- 4 

# K-means 
kmeans_results <- kmeans(website_visits, centers = k, iter.max = 50, nstart = 20) 

# Assign the results
website_visits <- website_visits %>%    
  mutate(Cluster = as.factor(kmeans_results$cluster)) 

# Visualize the results
website_visits %>%   
  ggplot(aes(x = Website_Visits, y = Purchased_Products, color = Cluster)) +  
  geom_point(size = 2.5) +  
  labs(x = "Website Visits",   
       y = "Purchased Products") + 
  scale_colour_manual(values = c("grey", "darkgreen", "#ff7f0e", "#1f77b4"))

Figure 36.18: Scatterplot with the final cluster choice.

Based on the empirical results, it is suggested to keep four clusters. The next question is how do we move from there? Based on the characteristics of the clusters, we try to give an appropriate name and description in order to provide insights that make sense to the people who are interested. In this case, a plausible description could be the following:

  • Cluster 1 – Casual Buyers: Customers with few website visits and few purchased products
  • Cluster 2 – Selective Spenders: Customers with few website visits but many purchased products
  • Cluster 3 – Shopping Champions: Customers with many website visits and many purchased products
  • Cluster 4 – Explorers: Customers with many website visits but few purchased products
Describing Clusters

Since centroids are used as the “center” of a cluster, we can use their characteristics (e.g., the average value of website visits or purchased products) to describe each cluster quantitatively. Therefore, we can summarize the resulting clusters using centroid values, cluster size, and other summary statistics.

Advantages and Limitations

K-means is one of the most widely used clustering algorithms in practice, mainly because of its simplicity and efficiency (MacQueen, 1967; Lloyd, 1982). However, like any method, it comes with both strengths and limitations that should be taken into account when applying it.

K-means is relatively easy to understand and implement. The algorithm follows a clear and intuitive process: initialize centroids, assign data points, update centroids, and repeat. Additionally, k-means clustering is computationally efficient, even when applied to large datasets (Lantz, 2023). Its iterative nature and relatively simple calculations (mainly distance computations and averages) allow it to scale well compared to more complex clustering methods. The results are easy to interpret, as each cluster is represented by its centroid, which can be used to summarize the characteristics of the group. This makes it particularly useful in business contexts, such as customer segmentation, where interpretability is essential (Wedel & Kamakura, 2000).

One of the main challenges of k-means is selecting the appropriate number of clusters. Since the algorithm requires the value of k in advance, methods such as the Elbow Method are often used, but they do not always provide a clear answer. The initial placement of centroids can significantly affect the final results (Lantz, 2023). Poor initialization may lead to suboptimal clustering, which is why techniques such as multiple random starts (nstart) are commonly used. K-means assumes that clusters are evenly sized and roughly spherical, which is why we assume the clusters can be separated by straight lines. If the true clusters have irregular shapes, varying densities, or different sizes, the algorithm may fail to identify them correctly (Ester et al., 1996). Lastly, outliers can strongly influence the position of centroids, as the mean is affected by extreme values. This can distort the resulting clusters and reduce their interpretability.

Although these limitations do not prevent k-means from being useful, they highlight the importance of understanding the structure of the data and carefully interpreting the results. In practice, k-means is often used as an exploratory tool, providing a first insight into potential groupings within the data.

Recap

K-Means Clustering is a simple and effective method for grouping data points based on similarity, making it well-suited for tasks such as customer segmentation. It helps businesses uncover patterns in behavior, enabling targeted campaigns and personalized experiences. K-means is easy to implement even on large datasets, and its results can be interpreted using the characteristics of the cluster centroids. However, like any machine learning algorithm, it has limitations: the choice of k influences the outcome, the initial placement of centroids can affect clustering quality and outliers can skew the centroids. The key is to standardize data properly, select k thoughtfully, and interpret clusters using centroid characteristics to gain actionable business insights.