4  Data Manipulation

In this Chapter, we will discuss about various functions we can use to handle our data. More specifically, we introduce the tidyverse family of packages and explore how we can use two of thοse packages for data manipulation.

4.1 From Data Imports to Data Manipulation

Generally, after we import the data, we want to transform them in such a way that it becomes easy to analyze, a process often called data manipulation. In the real world, data sets are usually not in the format we would like them to be. Even when data are orderly stored in databases, they are typically not structured with our specific analysis in mind. For example, do we want to analyze sales per country for a specific calendar year or all calendar years of sales for a specific country? In this example, it is up to us to decide on the best approach, depending on our goal. This illustrates that we need to be able to bring an imported data set into an appropriate format to facilitate our analysis. ## Data Manipulation with the Tidyverse Family

Before diving into details on data manipulation, we need to discuss the most well-known family of packages within the R community: the tidyverse. In simple terms, tidyverse refers to a collection of packages that are designed to work together seamlessly, helping us write code that is both easy to read and consistent.

While more information on the tidyverse collection of packages can be found on its official website (https://www.tidyverse.org/), the two packages that we will discuss here are the dplyr and the tidyr package. Specifically, the dplyr package is the primary package that we will use throughout this chapter, with its main theme being the use of pipe operators or pipes %>%, a concept that makes our code clearer and more efficient, also helping us transform and manipulate data easily. Although the use of pipes may seem complicated at the beginning, it becomes increasingly intuitive as we practice. From the tidyr package, we will use only a couple of necessary functions. Because we do not want to remember exactly which function comes from which package, we simply install and load both packages, and recommend that you do so as well. Additionally, it makes sense to have a specific data set throughout this chapter to demonstrate the concepts. This data set will be the same as the one we imported (using the read_csv() function from the readr package) in Chapter [[Importing Data]]: the customer_churn data set.

Our data set has information for 4.000 customers, out of which 1.000 churn, meaning that these customers are known to have stopped purchasing at a given point in time. The columns Recency and Recency_Level show information about how recently a customer made his/her last purchase. Moreover, The columns Frequency and Frequency_Level show us how many purchases (in total) a customer has made in the past. Also, the columns Monetary_Value and Monetary_Value_Level show how much money the company received from a customer. According to academic marketing literature (see, for example, Blattberg, Kim, & Neslin, 2008; Fader & Hardie, 2009), these general variables—known by their acronym RFM—can be used as strong indicators of whether a customer will churn in the future. We can imagine that the more purchases take place recently and the higher their value, the higher the probability that a customer will not churn. Lastly, the column Observation_Period shows how long a customer is known to have been a customer with the company.

We load all three mentioned packages from the tidyverse family as well as the respective data set (assigning it to an object called main_data) using the code below:

# A tibble: 6 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
6     6      55 Medium               39 Low                      8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
spc_tbl_ [4,000 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID                  : num [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
 $ Recency             : num [1:4000] 46 40 35 50 77 55 48 193 96 167 ...
 $ Recency_Level       : chr [1:4000] "Low" "Low" "Low" "Low" ...
 $ Frequency           : num [1:4000] 26 56 293 18 14 39 18 36 24 20 ...
 $ Frequency_Level     : chr [1:4000] "Low" "Medium" "High" "Low" ...
 $ Monetary_Value      : num [1:4000] 3010 57347 14496 1416 524 ...
 $ Monetary_Value_Level: chr [1:4000] "Medium" "High" "High" "Medium" ...
 $ Observation_Period  : num [1:4000] 742 2301 2411 813 1 ...
 $ Churn               : num [1:4000] 0 0 0 0 0 0 0 1 0 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Recency = col_double(),
  ..   Recency_Level = col_character(),
  ..   Frequency = col_double(),
  ..   Frequency_Level = col_character(),
  ..   Monetary_Value = col_double(),
  ..   Monetary_Value_Level = col_character(),
  ..   Observation_Period = col_double(),
  ..   Churn = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# Libraries
library(readr)
library(dplyr)
library(tidyr)

# Importing the data set
main_data <- read_csv("https://raw.githubusercontent.com/GeorgeOrfanos/
Data-Sets/refs/heads/main/customer_churn.csv")

# Have a look on the data set
head(main_data)
str(main_data)

As mentioned above, the imported data set has 4.000 observations (as rows) and 9 variables (as columns). Because it is really helpful to have a brief description of each variable, the contents of each column are described below:

  • ID: customer ID (unique number)
  • Recency: number of days that have passed since last purchase
  • Recency_Level: the level of recency (low, medium or high)
  • Frequency: number of purchases
  • Frequency_Level: the level of frequency (low, medium or high)
  • Monetary_Value: total monetary value (e.g. total number of euros)
  • Monetary_Value_Level: the level of monetary value (low, medium or high)
  • Observation_Period: number of days between the first and the last purchase
  • Churn: whether the customer churns or not (a value of 1 implies that the customer churns)

Evidently, the data imported are structured as a tibble. Tibbles retain all the functionalities of data frames while offering additional features, such as improved printing and better handling of large datasets. Therefore, a tibble is essentially a “modern” or an advanced version of a data frame and so we can use it in the exact same way as we would use a data frame. Looking closely, we observe that one of the data structures of the imported data set is also a data frame (first line of the output of the str() function). This is important because it confirms that tibbles are built on top of base R data frames, ensuring that functions expecting traditional data frames will still work seamlessly with tibbles.

Now that we understood more about the content of our data set, we start learning about the main symbol of the dplyr package: the pipes.

4.2 The use of Pipes: %>%

As stated earlier, pipes can help us create clear and efficient code. With pipes, we are able to take the data output of one function and send it directly to the next. To understand exactly the use of pipes, let us use them with the summary() function.

# Summary function without pipes
summary(main_data)
       ID          Recency      Recency_Level        Frequency     
 Min.   :   1   Min.   : 30.0   Length:4000        Min.   : 12.00  
 1st Qu.:1001   1st Qu.: 38.0   Class :character   1st Qu.: 22.00  
 Median :2000   Median : 54.0   Mode  :character   Median : 31.00  
 Mean   :2000   Mean   :102.1                      Mean   : 37.82  
 3rd Qu.:3000   3rd Qu.:129.0                      3rd Qu.: 45.00  
 Max.   :4000   Max.   :519.0                      Max.   :860.00  
 Frequency_Level    Monetary_Value  Monetary_Value_Level Observation_Period
 Length:4000        Min.   :  109   Length:4000          Min.   :   1.0    
 Class :character   1st Qu.: 1780   Class :character     1st Qu.: 852.8    
 Mode  :character   Median : 4210   Mode  :character     Median :1803.0    
                    Mean   : 8418                        Mean   :1529.7    
                    3rd Qu.: 9163                        3rd Qu.:2288.0    
                    Max.   :99011                        Max.   :2435.0    
     Churn     
 Min.   :0.00  
 1st Qu.:0.00  
 Median :0.00  
 Mean   :0.25  
 3rd Qu.:0.25  
 Max.   :1.00  
# Summary function with pipes
main_data %>% summary()
       ID          Recency      Recency_Level        Frequency     
 Min.   :   1   Min.   : 30.0   Length:4000        Min.   : 12.00  
 1st Qu.:1001   1st Qu.: 38.0   Class :character   1st Qu.: 22.00  
 Median :2000   Median : 54.0   Mode  :character   Median : 31.00  
 Mean   :2000   Mean   :102.1                      Mean   : 37.82  
 3rd Qu.:3000   3rd Qu.:129.0                      3rd Qu.: 45.00  
 Max.   :4000   Max.   :519.0                      Max.   :860.00  
 Frequency_Level    Monetary_Value  Monetary_Value_Level Observation_Period
 Length:4000        Min.   :  109   Length:4000          Min.   :   1.0    
 Class :character   1st Qu.: 1780   Class :character     1st Qu.: 852.8    
 Mode  :character   Median : 4210   Mode  :character     Median :1803.0    
                    Mean   : 8418                        Mean   :1529.7    
                    3rd Qu.: 9163                        3rd Qu.:2288.0    
                    Max.   :99011                        Max.   :2435.0    
     Churn     
 Min.   :0.00  
 1st Qu.:0.00  
 Median :0.00  
 Mean   :0.25  
 3rd Qu.:0.25  
 Max.   :1.00  

In this example, we “repositioned” the data frame main_data to the outside of the parentheses (before the pipes). However, the final outcome is exactly the same. We can also confirm this by using the function identical() from base R:

# Check whether the two approaches lead to the exact same results
identical(summary(main_data), main_data %>% summary())
[1] TRUE

Notice that we could achieve the same result using parentheses: in R, as with mathematical operations, expressions within parentheses always take precedence. However, as we start nesting multiple functions within parentheses, the code will quickly become cluttered and hard to follow.

Pipes are extremely helpful when applying multiple functions sequentially, especially on a data frame, as they allow us to transform it step by step into the desired form. Imagine using 10 different functions, with some of them targeting specific columns — handling this with nested parentheses would make the code almost unreadable and very challenging to debug. By adopting pipes, our code becomes far more readable and efficient.

4.3 Exploring Functions for Data Manipulation

For the rest of this chapter, we see how we can apply dplyr and tidyr functions to handle the imported data set. For the first five functions, we check the results with and without the use of pipes (to get more familiar with the concept).

4.3.1 Function: select()

This function allows us to select specific columns that we want to keep in or drop from our data frame. Specifically, we include the name of the columns we want to keep. On the other hand, if we want to exclude some of the columns from the resulting dataset, we include their names but with a minus sign (-) preceding them.

In the following example, we want to keep only the columns Recency, Frequency, Monetary_Value and Churn. Therefore, we can either select to keep these columns or select to remove all other columns.

# Select Recency, Frequency, Monetary_Value and Churn (without pipes)
head(select(main_data, Recency, Frequency, Monetary_Value, Churn))
# A tibble: 6 × 4
  Recency Frequency Monetary_Value Churn
    <dbl>     <dbl>          <dbl> <dbl>
1      46        26          3010.     0
2      40        56         57347.     0
3      35       293         14496.     0
4      50        18          1416.     0
5      77        14           524.     0
6      55        39          8830.     0
# Select Recency, Frequency, Monetary_Value and Churn (with pipes)
main_data %>% 
  select(Recency, Frequency, Monetary_Value, Churn) %>% head()
# A tibble: 6 × 4
  Recency Frequency Monetary_Value Churn
    <dbl>     <dbl>          <dbl> <dbl>
1      46        26          3010.     0
2      40        56         57347.     0
3      35       293         14496.     0
4      50        18          1416.     0
5      77        14           524.     0
6      55        39          8830.     0
# Drop ID, Recency_Level, Frequency_Level and Observation_Period (with pipes)
main_data %>% 
  select(-ID, -Recency_Level, -Frequency_Level, -Observation_Period) %>%
  head()
# A tibble: 6 × 5
  Recency Frequency Monetary_Value Monetary_Value_Level Churn
    <dbl>     <dbl>          <dbl> <chr>                <dbl>
1      46        26          3010. Medium                   0
2      40        56         57347. High                     0
3      35       293         14496. High                     0
4      50        18          1416. Medium                   0
5      77        14           524. Low                      0
6      55        39          8830. Medium                   0

Note that when we specify the names of the columns we want to keep or exclude, we do not use quotes or special characters; we just type the name of the column “as is”. Additionally, the first argument in the select() function is out data set and then the rest of the arguments are the columns we want to select (or drop). However, when we used pipes, our function was written in a more straightforward fashion, as it only included the names of the selected columns. Lastly, we see how we could use the function head() immediately with pipes.

Hopefully, this example clarifies the usefulness of pipes in writing consistent and efficient code. The more complicated the content of data manipulation, the more useful the functionality of dplyr becomes.

4.3.2 Function: rename()

Another easy yet useful function is the function rename(), which simply changes the name of a column. In order to do so, we write the new name of a column, followed by the equal sign (=) and the old name of the corresponding column (the one we want to change). In the following example, we change the name of the column ID to Customer_ID, to make it more intuitive.

# Change the name of ID (without pipes)
head(rename(main_data, Customer_ID = ID))
# A tibble: 6 × 9
  Customer_ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
        <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1           1      46 Low                  26 Low                      3010.
2           2      40 Low                  56 Medium                  57347.
3           3      35 Low                 293 High                    14496.
4           4      50 Low                  18 Low                      1416.
5           5      77 Medium               14 Low                       524.
6           6      55 Medium               39 Low                      8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
# Change the name of ID (with pipes)
main_data %>% rename(Customer_ID = ID) %>% head()
# A tibble: 6 × 9
  Customer_ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
        <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1           1      46 Low                  26 Low                      3010.
2           2      40 Low                  56 Medium                  57347.
3           3      35 Low                 293 High                    14496.
4           4      50 Low                  18 Low                      1416.
5           5      77 Medium               14 Low                       524.
6           6      55 Medium               39 Low                      8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

The select() function also includes a trick that renames the columns that we choose, hence the use of rename() is not really required for most cases. Inside the function select(), we can give new names to the selected columns directly. For example, we can select the column ID from our data set and give it the name Customer_ID directly:

# Select and change the name of ID (without pipes)
head(select(main_data, Customer_ID = ID))
# A tibble: 6 × 1
  Customer_ID
        <dbl>
1           1
2           2
3           3
4           4
5           5
6           6
# Select and change the name of ID (with pipes)
main_data %>% select(Customer_ID = ID) %>% head()
# A tibble: 6 × 1
  Customer_ID
        <dbl>
1           1
2           2
3           3
4           4
5           5
6           6

As we would expect though, given that we used the select function, the output is a data frame of a single column, since we selected only one column. With the rename() function we rename the stated column without changing the underlying selection.

4.3.3 Function: filter()

Often, we want to remove certain rows from a dataset. The filter() function helps us retain only the rows that meet a specified condition. It’s important to note that, while select() applies to columns, filter() applies to rows. In the filter() function, the condition typically involves specific columns, but its effect impacts entire rows. In other words, if a row is filtered out based on a condition in a particular column, the corresponding data in all other columns for that row are also removed (the entire row is removed). In the example below, we use filter() to keep only customers who have churned (i.e., where the value in the Churn column is 1). Since 1.000 customers have churned in our dataset, we use nrow() to confirm the result.

# Keep rows for which Churn is 1 (without pipes)
filter(main_data, Churn == 1)
# A tibble: 1,000 × 9
      ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
   <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
 1     8     193 High                 36 Low                     15585.
 2    10     167 High                 20 Low                      1913.
 3    11     466 High                 22 Low                      1670.
 4    14     132 High                 13 Low                       713.
 5    15     435 High                 20 Low                       721.
 6    19      96 Medium               31 Low                      1797.
 7    20     112 High                 41 Low                      7435.
 8    24     230 High                 20 Low                       289.
 9    39     231 High                 28 Low                      2894.
10    44     111 High                 16 Low                       449.
# ℹ 990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
# Confirm the number of rows
nrow(filter(main_data, Churn == 1))
[1] 1000
# Keep rows for which Churn is 1 (with pipes)
main_data %>% filter(Churn == 1)
# A tibble: 1,000 × 9
      ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
   <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
 1     8     193 High                 36 Low                     15585.
 2    10     167 High                 20 Low                      1913.
 3    11     466 High                 22 Low                      1670.
 4    14     132 High                 13 Low                       713.
 5    15     435 High                 20 Low                       721.
 6    19      96 Medium               31 Low                      1797.
 7    20     112 High                 41 Low                      7435.
 8    24     230 High                 20 Low                       289.
 9    39     231 High                 28 Low                      2894.
10    44     111 High                 16 Low                       449.
# ℹ 990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
# Confirm the number of rows
main_data %>% filter(Churn == 1 ) %>% nrow()
[1] 1000

4.3.4 Function: slice()

We saw that we can remove entire rows by specifying a specific field condition, using the function filter(). What if we just want to keep a specified number of rows from the data set? For example, what if we want to keep the first five or ten rows? The function slice() is appropriate for cases like this. For instance, we could use this function to preview a very large data set or split a single data set without including any of its rows more than once. Essentially, we slice (or cut) the data set and keep a specified part of it. We just need to specify the row numbers that we want to keep.

# Keep the first 5 rows (without pipes)
slice(main_data, 1:5)
# A tibble: 5 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
# Keep the first 5 rows (with pipes)
main_data %>% slice(1:5)
# A tibble: 5 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

The slice() function can be viewed as a more flexible alternative to the head() function. With head(), we specify the number of rows to keep starting from the top of the dataset. In contrast, slice() allows us to define both the starting and ending indices of the rows we want to keep, giving us greater control over the selection.

4.3.5 Function: mutate()

We have seen how we can remove columns and rows from a data set. But what if we want to create a new column or transform the data of an existing one? This is where the function mutate() comes into play. For instance, suppose that, in our previous data set, we want to create a column that shows the number of days since a customers’ last purchase, in months. We have the column Recency which shows how many days have been since a customers’ last purchase. Assuming that every month consists of 30 days, we can create a new column that shows the recency in months and then only select this column to print (with the function select() that we learnt previously).

# Create a new column: Recency_in_months (without pipes) 
head(select(mutate(main_data, Recency_months = Recency/30), Recency_months))
# A tibble: 6 × 1
  Recency_months
           <dbl>
1           1.53
2           1.33
3           1.17
4           1.67
5           2.57
6           1.83
# Create a new column: Recency_in_months (with pipes) 
main_data %>% 
  mutate(Recency_in_months = Recency/30) %>% 
  select(Recency_in_months) %>% 
  head()
# A tibble: 6 × 1
  Recency_in_months
              <dbl>
1              1.53
2              1.33
3              1.17
4              1.67
5              2.57
6              1.83

This example also illustrates why pipes are much better when we want to use several functions sequentially. The first code chunk is hardly readable while the second is much clearer: we create a new column (mutate()), then we select the created column (select()) and then print the first 6 rows of that column (head()).

Had we used the original name of the column (Recency) instead of the name Recency_months, we would have essentially transformed the current column to the new one.

# Transform column Recency (without pipes) 
head(mutate(main_data, Recency = Recency/30))
# A tibble: 6 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1    1.53 Low                  26 Low                      3010.
2     2    1.33 Low                  56 Medium                  57347.
3     3    1.17 Low                 293 High                    14496.
4     4    1.67 Low                  18 Low                      1416.
5     5    2.57 Medium               14 Low                       524.
6     6    1.83 Medium               39 Low                      8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>
# Transform column Recency (with pipes) 
main_data %>% mutate(Recency = Recency/30) %>% head()
# A tibble: 6 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1    1.53 Low                  26 Low                      3010.
2     2    1.33 Low                  56 Medium                  57347.
3     3    1.17 Low                 293 High                    14496.
4     4    1.67 Low                  18 Low                      1416.
5     5    2.57 Medium               14 Low                       524.
6     6    1.83 Medium               39 Low                      8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

In combination with the mutate() function, we can use the row_number() function to create a column that shows the index of a row.

# Create a new column: Row_Number (without pipes) 
head(mutate(main_data, Row_Number = row_number()))
# A tibble: 6 × 10
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
6     6      55 Medium               39 Low                      8830.
# ℹ 4 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>, Row_Number <int>
# Create a new column: Row_Number (with pipes) 
main_data %>% mutate(Row_Number = row_number()) %>% head()
# A tibble: 6 × 10
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
6     6      55 Medium               39 Low                      8830.
# ℹ 4 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>, Row_Number <int>

In conclusion, we use the mutate() function to create or modify a column. With this function, we can perform various operations on the columns of a dataset, or even apply other functions such as row_number().

4.3.6 Functions: group_by(), summarize() and ungroup()

Aggregated data are summarizations of our raw data. For example, how many customers exist within each category of recency level? Or, what is the average recency for customers that churn in comparison to customers that do not churn? Essentially, “summarize” our results - loose some case-specific information - in order to be able to extract valuable conclusions for the whole set.

For starters, let’s see how many levels of recency we have:

# Check distinct Recency Level
main_data %>%
  select(Recency_Level) %>%
  distinct()
# A tibble: 3 × 1
  Recency_Level
  <chr>        
1 Low          
2 Medium       
3 High         

The function distinct() (which is equivalent to the function unique() from base R) shows only the unique values of the selected column(s). Although we selected a single column from our data set, the number of rows stays the same. Using the function distinct(), we clearly identify the unique levels above: “Low”, “Medium” and “High”.

Now, suppose we want to count how many customers we have by level of recency. This is where the group_by() function comes into play.

# Group by Recency Level
main_data %>%
  group_by(Recency_Level)
# A tibble: 4,000 × 9
# Groups:   Recency_Level [3]
      ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
   <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
 1     1      46 Low                  26 Low                      3010.
 2     2      40 Low                  56 Medium                  57347.
 3     3      35 Low                 293 High                    14496.
 4     4      50 Low                  18 Low                      1416.
 5     5      77 Medium               14 Low                       524.
 6     6      55 Medium               39 Low                      8830.
 7     7      48 Low                  18 Low                      1493.
 8     8     193 High                 36 Low                     15585.
 9     9      96 Medium               24 Low                      1608.
10    10     167 High                 20 Low                      1913.
# ℹ 3,990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

Notice that, by using the group_by() function, it seems as if nothing changes - at first glance. This is simply the case because, even though we specified the groups, we did not use any other function to affect changes; we just grouped the variables. Yet, observing closely, we see that there is an extra line that shows the name of the groups (with the square brackets showing the number of the groups of the selected variable). The group_by() function is very useful when we want to summarize descriptive results for each category or group separately. With the group_by() function, we can create aggregated features using the summarize() function, as shown below.

# Group by Recency_Level and summarize Number_of_Customers
main_data %>% 
  group_by(Recency_Level) %>% 
  summarize(Number_of_Customers = n())
# A tibble: 3 × 2
  Recency_Level Number_of_Customers
  <chr>                       <int>
1 High                         1160
2 Low                          1877
3 Medium                        963

To count the data, we also used another function from the dplyr package, the n() function, which simply counts the number of observations (note that we do not include any argument within the parentheses). Now, suppose we wanted to calculate the average value of recency for the customers that churn and compare it with the respective value for customers that do not churn. In that case, we would follow a similar approach but, instead of the function n(), we would use the function mean() with the Recency column inside the parentheses.

# Group by Recency Churn and Summarize Average Recency
main_data %>%
  group_by(Churn) %>% 
  summarize(Average_Recency = mean(Recency))
# A tibble: 2 × 2
  Churn Average_Recency
  <dbl>           <dbl>
1     0            61.7
2     1           223. 

As seen, the average value of recency is much higher for customers that churn than for customers that do not. We managed to create these summary results with just a simple (and quite intuitive) line of R code. In fact, we could even easily store outputs like this to another object, in case we need them later in our analysis.

Regarding the outputs of the two mentioned examples, we see that, in both cases, each output has 1 row per group (the number of customers or the average recency per specified group). It is important to note that we can give names of our choice in these newly created (aggregated) columns.

In case we do not need to work with grouped data anymore, we use the function ungroup(), without arguments. In the example below, we use the function group_by() and the ungroup() (in the same line of code) to see that we revert at the original data set main_data (no groups will appear above the names of the columns).

# Group by Churn and Ungroup
main_data %>% 
  group_by(Churn) %>% 
  ungroup()
# A tibble: 4,000 × 9
      ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
   <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
 1     1      46 Low                  26 Low                      3010.
 2     2      40 Low                  56 Medium                  57347.
 3     3      35 Low                 293 High                    14496.
 4     4      50 Low                  18 Low                      1416.
 5     5      77 Medium               14 Low                       524.
 6     6      55 Medium               39 Low                      8830.
 7     7      48 Low                  18 Low                      1493.
 8     8     193 High                 36 Low                     15585.
 9     9      96 Medium               24 Low                      1608.
10    10     167 High                 20 Low                      1913.
# ℹ 3,990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

4.3.7 Functions: arrange()

A simple, yet very useful function is the arrange() function. As the name suggests, this function helps us arrange the order of rows in a data frame. For example, we may want to review our data set in a descending order based on Monetary_Value. In order to use this function, we need to include in the parenthesis the name of the column based on which we want to arrange our data. If we want the order be descending, we include the function desc() inside the function arrange(), otherwise, an ascending ordering is applied.

To check the results of the arrange() function, let us use the functions group_by() and summarize() to calculate the average monetary value per recency level. We then use the arrange() function to arrange the order of the resulting rows.

# Group by Recency_Level
# Summarize Average_Monetary_Value
# Arrange on Average_Monetary_Value
main_data %>% 
  group_by(Recency_Level) %>%
  summarize(Average_Monetary_Value = mean(Monetary_Value)) %>%
  arrange(Average_Monetary_Value)
# A tibble: 3 × 2
  Recency_Level Average_Monetary_Value
  <chr>                          <dbl>
1 High                           3687.
2 Medium                         6336.
3 Low                           12409.
# Group by Recency_Level
# Summarize Average_Monetary_Value
# Arrange on Average_Monetary_Value (descending order)
main_data %>% 
  group_by(Recency_Level) %>%
  summarize(Average_Monetary_Value = mean(Monetary_Value)) %>%
  arrange(desc(Average_Monetary_Value))
# A tibble: 3 × 2
  Recency_Level Average_Monetary_Value
  <chr>                          <dbl>
1 Low                           12409.
2 Medium                         6336.
3 High                           3687.

The Monetary_Value column is numeric, so the resulting order is straightforward. For character columns, such as Recency_Level, the rows are ordered alphabetically. To see this, let us rewrite the above code, using the column Recency_Level to arrange the rows.

# Group by Recency_Level
# Summarize Average_Monetary_Value
# Arrange on Recency_Level
main_data %>% 
  group_by(Recency_Level) %>%
  summarize(Average_Monetary_Value = mean(Monetary_Value)) %>%
  arrange(Recency_Level)
# A tibble: 3 × 2
  Recency_Level Average_Monetary_Value
  <chr>                          <dbl>
1 High                           3687.
2 Low                           12409.
3 Medium                         6336.
# Group by Recency_Level
# Summarize Average_Monetary_Value
# Arrange on Recency_Level (descending order)
main_data %>% 
  group_by(Recency_Level) %>%
  summarize(Average_Monetary_Value = mean(Monetary_Value)) %>%
  arrange(desc(Recency_Level))
# A tibble: 3 × 2
  Recency_Level Average_Monetary_Value
  <chr>                          <dbl>
1 Medium                         6336.
2 Low                           12409.
3 High                           3687.

4.3.8 Functions: select() and pull()

We have seen how we can use the function select() to choose one or more columns to keep in our data set. Every time we use this function, the new data set will always be a data frame. This is important to remember because, even if we select a single column, we will still have a data frame (which obviously consists of a single column). Actually, we saw that this is the case when we started the discussion about the aggregation functions group_by() and summarize(). However, we sometimes need to keep this column in a vector form (instead of a data frame form). This is when we use the pull() function. Imagine that the name of the pull() function comes from the fact that we want to “pull” only one column from a data set (as a vector). Of course, we need to include the name of the column that we want to “pull” as an argument. To see how the function pull() works and how this is different from the function select(), let us extract and keep only the column Churn from our main data set. We see in the console the different outputs from using the two functions.

# Select column Churn
main_data %>% select(Churn) %>% head()
# A tibble: 6 × 1
  Churn
  <dbl>
1     0
2     0
3     0
4     0
5     0
6     0
# Pull column Churn
main_data %>% pull(Churn) %>% head()
[1] 0 0 0 0 0 0

As shown, no name is kept when we use the pull() function; we have a vector, not a data frame - with one column, as is the case when we use the select() function.

4.3.9 Functions: mutate() with if_else() and case_when()

We have already seen how we can create a new column using the function mutate(). We can use this function to create a new column based on specific conditions using the if_else() function. Let us use this function to create a new column Churn_Label where, if the value in the column Churn is 1, it takes the value “customer churns” in the column Churn_Label; otherwise it takes the value “customer does not churn”.

# Create a new column: Churn_Label with if_else()
main_data %>% 
  mutate(Churn_Label = if_else(Churn == 1, 
                               "customer churns", 
                               "customer does not churn")) %>%
  select(Churn, Churn_Label)
# A tibble: 4,000 × 2
   Churn Churn_Label            
   <dbl> <chr>                  
 1     0 customer does not churn
 2     0 customer does not churn
 3     0 customer does not churn
 4     0 customer does not churn
 5     0 customer does not churn
 6     0 customer does not churn
 7     0 customer does not churn
 8     1 customer churns        
 9     0 customer does not churn
10     1 customer churns        
# ℹ 3,990 more rows

With the if_else() function we essentially create the Churn_Label column whereby if the column Churn has a value of 1, the corresponding value is “customer_churns”, otherwise the value is “customer does not churn”. In this example, we also used the function select() just to get a clear output of the two churn columns. We see indeed that the values are aligned: if the value in the column Churn is 1 then the value in the column Churn_Label is “customer churns”, otherwise it is “customer does not churn”.

But what if we need to combine multiple conditions and outputs? Suppose we want to create a column value_Label_if_Customer_Churns which takes the following values:

  • If the monetary value label is low and the customer churns, then we want to get the value “low value”
  • If the monetary value label is medium and the customer churns, then we want to get the value “medium value”
  • If the monetary value label is high and the customer churns, then we want to get the value “high value”
  • Otherwise we get the value “customer does not churn”

Although we could use the if_else() function iteratively, this would create messy code. For such cases, it is much better to use the case_when() function. Inside the parenthesis of this function, we specify the condition (the “if”), include the tilde sign (“~”) and then the results (the “then”). With the help of commas, we can create different conditions with different outcomes. Let us then create the column that we just mentioned using the piece of code below:

# Create a new column: Value_Label_of_Customer_Churn with case_when()
main_data %>% 
  mutate(Value_Label_if_Customer_Churns = case_when(
    Monetary_Value_Level == "Low" & Churn == 1 ~ "low value",
    Monetary_Value_Level == "Medium" & Churn == 1  ~ "medium value",
    Monetary_Value_Level == "High" & Churn == 1 ~ "high value",
    TRUE ~ "customer does not churn")) %>% 
  select(Value_Label_if_Customer_Churns, Monetary_Value_Level, Churn)
# A tibble: 4,000 × 3
   Value_Label_if_Customer_Churns Monetary_Value_Level Churn
   <chr>                          <chr>                <dbl>
 1 customer does not churn        Medium                   0
 2 customer does not churn        High                     0
 3 customer does not churn        High                     0
 4 customer does not churn        Medium                   0
 5 customer does not churn        Low                      0
 6 customer does not churn        Medium                   0
 7 customer does not churn        Medium                   0
 8 high value                     High                     1
 9 customer does not churn        Medium                   0
10 medium value                   Medium                   1
# ℹ 3,990 more rows

We see how we can also use special characters, such as the ampersand (&), to introduce simultaneous conditions (in our example, we used the ampersand (&) to introduce an AND condition). Additionally, note that, for the last condition, we do not specify the value of any column, but we simply use the word “TRUE”. In the case_when() function, this stands for if none of the other conditions have been met; that is why it is better to use this word only as the last “catch-all” condition.

In the last example, if we wanted to specify that if at least one of the conditions is TRUE then the whole statement would be TRUE (what is often called an “OR” condition), we would need to use the vertical bar (|) instead of the ampersand (&).

4.3.10 Functions: group_by(), summarize() and count()

In the aggregation functions group_by() and summarize(), we saw how we can calculate summary statistics such as the number of customers within each recency level. Although we used the function n() for this purpose, we can do the exact same thing with the count() function. With this function, we need to include the name of the column of our interest as an argument of the function. In the example below we see how both approaches lead to the exact same results.

# Group by Recency_Level and summarize Number_of_Customers
main_data %>%
  group_by(Recency_Level) %>% 
  summarize(n = n()) %>% 
  ungroup() 
# A tibble: 3 × 2
  Recency_Level     n
  <chr>         <int>
1 High           1160
2 Low            1877
3 Medium          963
# Count number of Recency_Level 
main_data %>% 
  count(Recency_Level)
# A tibble: 3 × 2
  Recency_Level     n
  <chr>         <int>
1 High           1160
2 Low            1877
3 Medium          963

The count() function is very handy when we just want the number of times a group has appeared and nothing more (i.e., just the frequency); we see that the code is much simpler and easier to read.

4.3.11 Functions: bind_cols()

The function bind_cols() allows us to put the columns of one data frame next to the another’s, so as to form a larger data frame. It is though necessary that the number of rows in each data set is equal. With the function bind_cols(), we can bind as many columns as we want; we just include the names of all data frames that we want in the parentheses, separated by commas. In this example, we assign the columns Recency, Recency_Level and Churn of our main data set in 3 different data frames (one column in each data frame). Then, we “bind” those columns together to form a new data frame.

# Create the 3 small data sets
example_data_1 <- main_data$Recency
example_data_2 <- main_data$Recency_Level
example_data_3 <- main_data$Churn

# Bind the 3 data frames together
example_data_1 %>%
  bind_cols(example_data_2, example_data_3) %>% 
  head()
# A tibble: 6 × 3
   ...1 ...2    ...3
  <dbl> <chr>  <dbl>
1    46 Low        0
2    40 Low        0
3    35 Low        0
4    50 Low        0
5    77 Medium     0
6    55 Medium     0

That was a simple example showing the bind_cols() function works. If we wanted to just keep the mentioned columns, we could simply use the select() function.

In the example above, the columns appear in the following order: Recency, Recency_Level, Churn. This is intentional, as the order of the columns in the resulting data frame aligns with the order in which they were specified in the bind_cols() function. The columns from the original data frame (i.e., those outside the bind_cols() function) will always appear first (on the left).

4.3.12 Functions: bind_rows()

Similar to the bind_cols(), the bind_rows() function allows us to put the rows of one data frame below another’s, so as to form (again) a larger data frame. This time, the main condition is that the number of columns in each data set is equal and that the names of the columns should be the same between the different data frames. The intuition is the same as with bind_cols(), but instead of binding column(s), we bind (or append) rows. As an example, we assign the first 9 rows of our main data set (with the slice() function) in 3 small different data frames (3 rows per data frame). Then, we “bind” those rows together to form a new data frame.

# Create the 3 small data sets
example_data_1 <- main_data %>% slice(1:3)
example_data_2 <- main_data %>% slice(4:6)
example_data_3 <- main_data %>% slice(7:9)

# Bind the 3 data frames together
example_data_1 %>%
  bind_rows(example_data_2, example_data_3)
# A tibble: 9 × 9
     ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
  <dbl>   <dbl> <chr>             <dbl> <chr>                    <dbl>
1     1      46 Low                  26 Low                      3010.
2     2      40 Low                  56 Medium                  57347.
3     3      35 Low                 293 High                    14496.
4     4      50 Low                  18 Low                      1416.
5     5      77 Medium               14 Low                       524.
6     6      55 Medium               39 Low                      8830.
7     7      48 Low                  18 Low                      1493.
8     8     193 High                 36 Low                     15585.
9     9      96 Medium               24 Low                      1608.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
#   Churn <dbl>

4.3.13 Functions: pivot_wider() & pivot_longer()

The pivot_wider() function makes a table “wide” in a way that we have more columns and fewer rows while the pivot_longer function makes a table “longer” in a way that we have fewer columns and more rows. It is much easier to understand these functions with a simple example. Suppose we want to check how many customers churn (or do not churn) within each level of monetary value. For this, let’s create a new data frame (called summarized_data) with the techniques we have already discussed.

# Create Data Frame for Monetary Value Level per Customer Churn Label
summarized_data <- main_data %>% 
  count(Churn, Monetary_Value_Level) %>%
  rename(Number = n)

# Print the results
summarized_data
# A tibble: 6 × 3
  Churn Monetary_Value_Level Number
  <dbl> <chr>                 <int>
1     0 High                    811
2     0 Low                     235
3     0 Medium                 1954
4     1 High                     85
5     1 Low                     347
6     1 Medium                  568

We now have the two columns that we included in the parentheses (Churn and Monetary_Value_Level) plus the column Number, which shows the number of observations within each category. Notice how the values of 0 and 1 appear twice in the Churn column. This happens because each value corresponds to two different values of the Monetary_Value_Level column.

What if, instead of one column for monetary value levels, we had a separate column for each level? This is where we use the pivot_wider() function as we want to make our data frame have more columns and fewer rows. Besides the data frame, pivot_wider() requires two key arguments: the names_from and values_from. In the first argument, we put the column whose values will be used to create the different new columns. The second argument specifies the values that will populate these new columns. In our example, we set these arguments to Monetary_Value_Level and Number. The output is the following:

# Create a new data frame with pivot_wider()
wider_summarized_data <- summarized_data %>%
  pivot_wider(names_from = Monetary_Value_Level, values_from = Number)

# Print results
wider_summarized_data 
# A tibble: 2 × 4
  Churn  High   Low Medium
  <dbl> <int> <int>  <int>
1     0   811   235   1954
2     1    85   347    568

Now, each monetary value level has its own column, with values taken from the column Number, which does not appear anymore. This transformation reduces rows and increases columns, thus making the table ‘wider’.

The pivot_longer() function can be seen as the transverse of the pivot_wider() function: it typically takes multiple columns and reshapes them into two columns: one column for the variable names (with the names_to argument) and another for the corresponding values (with the values_to argument). To specify the columns to transform, we list their names in a vector in the cols argument.

In the following example, we use the function pivot_longer() on the data frame wider_summarized_data that we created above.

# Create a new data frame with pivot_longer()
longer_summarized_data <- wider_summarized_data %>% 
  pivot_longer(cols = c(High, Low, Medium),
               names_to = "Levels", 
               values_to = "Number")

# Print results 
longer_summarized_data
# A tibble: 6 × 3
  Churn Levels Number
  <dbl> <chr>   <int>
1     0 High      811
2     0 Low       235
3     0 Medium   1954
4     1 High       85
5     1 Low       347
6     1 Medium    568

Notice how the column names in names_to and values_to must be in double quotes (““), but those in cols do not need them.

We see that the final output is exactly the same as the data frame summarized_data that we initially created.