In this Chapter, we will discuss about various functions we can use
to handle our data. More specifically, we introduce the
tidyverse family of packages and explore how we can
use two of thοse packages for data manipulation.
4.2 From Data Imports to
Data Manipulation
Generally, after we import the data, we want to transform them in
such a way that it becomes easy to analyze, a process often called
data manipulation. In the real world, data sets are
usually not in the format we would like them to be. Even when data
are orderly stored in databases, they are typically not structured
with our specific analysis in mind. For example, do we want to
analyze sales per country for a specific calendar year or all
calendar years of sales for a specific country? In this example, it
is up to us to decide on the best approach, depending on our goal.
This illustrates that we need to be able to bring an imported data
set into an appropriate format to facilitate our analysis.
4.3 Data Manipulation
with the Tidyverse Family
Before diving into details on data manipulation, we need to discuss
the most well-known family of packages within the R community: the
tidyverse. In simple terms,
tidyverse refers to a
collection of packages that are designed to work together
seamlessly, helping us write code that is both easy to read and
consistent.
While more information on the tidyverse collection of packages can
be found on its official website (https://www.tidyverse.org/), the two packages that we will discuss here are the
dplyr and the tidyr package. Specifically,
the dplyr package is the primary package that we will
use throughout this chapter, with its main theme being the use of
pipe operators or pipes%>%, a concept that makes our code clearer and more
efficient, also helping us transform and manipulate data easily.
Although the use of pipes may seem complicated at the beginning, it
becomes increasingly intuitive as we practice. From the
tidyr package, we will use only a couple of necessary
functions. Because we do not want to remember exactly which function
comes from which package, we simply install and load both packages,
and recommend that you do so as well. Additionally, it makes sense
to have a specific data set throughout this chapter to demonstrate
the concepts. This data set will be the same as the one we imported
(using the read_csv() function from the
readr package) in Chapter
Importing Data: the customer_churn data set.
Our data set has information for 4.000 customers, out of which 1.000
churn, meaning that these customers are known to have stopped
purchasing at a given point in time. The columns
Recency and Recency_Level show information
about how recently a customer made his/her last purchase. Moreover,
The columns Frequency and
Frequency_Level show us how many purchases (in total) a
customer has made in the past. Also, the columns
Monetary_Value and
Monetary_Value_Level show how much money the company
received from a customer. According to academic marketing literature
(see, for example, Blattberg, Kim, & Neslin, 2008; Fader &
Hardie, 2009), these general variables—known by their acronym
RFM—can be used as strong indicators of whether a
customer will churn in the future. We can imagine that the more
purchases take place recently and the higher their value, the higher
the probability that a customer will not churn. Lastly, the column
Observation_Period shows how long a customer is known
to have been a customer with the company.
We load all three mentioned packages from the tidyverse family as
well as the respective data set (assigning it to an object called
main_data) using the code below:
# A tibble: 6 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
# Librarieslibrary(readr)library(dplyr)library(tidyr)# Importing the data setmain_data <-read_csv("https://raw.githubusercontent.com/GeorgeOrfanos/Data-Sets/refs/heads/main/customer_churn.csv")# Have a look on the data sethead(main_data)str(main_data)
As mentioned above, the imported data set has 4.000 observations (as
rows) and 9 variables (as columns). Because it is really helpful to
have a brief description of each variable, the contents of each
column are described below:
ID: customer ID (unique number)
Recency: number of days that have passed since last
purchase
Recency_Level: the level of recency (low, medium or
high)
Frequency: number of purchases
Frequency_Level: the level of frequency (low, medium
or high)
Monetary_Value: total monetary value (e.g. total
number of euros)
Monetary_Value_Level: the level of monetary value
(low, medium or high)
Observation_Period: number of days between the first
and the last purchase
Churn: whether the customer churns or not (a value of
1 implies that the customer churns)
Evidently, the data imported are structured as a tibble.
Tibbles retain all the functionalities of data
frames while offering additional features, such as improved printing
and better handling of large datasets. Therefore, a tibble is
essentially a “modern” or an advanced version of a data frame and so
we can use it in the exact same way as we would use a data frame.
Looking closely, we observe that one of the data structures of the
imported data set is also a data frame (first line of the output of
the str() function). This is important because it
confirms that tibbles are built on top of base R data frames,
ensuring that functions expecting traditional data frames will still
work seamlessly with tibbles.
Now that we understood more about the content of our data set, we
start learning about the main symbol of the
dplyr package: the pipes.
4.4 The use of Pipes:
%>%
As stated earlier, pipes can help us create clear and efficient
code. With pipes, we are able to
take the data output of one function and send it directly to the
next. To understand exactly the use of pipes, let us use them with the
summary() function.
# Summary function without pipessummary(main_data)
ID Recency Recency_Level Frequency
Min. : 1 Min. : 30.0 Length:4000 Min. : 12.00
1st Qu.:1001 1st Qu.: 38.0 Class :character 1st Qu.: 22.00
Median :2000 Median : 54.0 Mode :character Median : 31.00
Mean :2000 Mean :102.1 Mean : 37.82
3rd Qu.:3000 3rd Qu.:129.0 3rd Qu.: 45.00
Max. :4000 Max. :519.0 Max. :860.00
Frequency_Level Monetary_Value Monetary_Value_Level Observation_Period
Length:4000 Min. : 109 Length:4000 Min. : 1.0
Class :character 1st Qu.: 1780 Class :character 1st Qu.: 852.8
Mode :character Median : 4210 Mode :character Median :1803.0
Mean : 8418 Mean :1529.7
3rd Qu.: 9163 3rd Qu.:2288.0
Max. :99011 Max. :2435.0
Churn
Min. :0.00
1st Qu.:0.00
Median :0.00
Mean :0.25
3rd Qu.:0.25
Max. :1.00
# Summary function with pipesmain_data %>%summary()
ID Recency Recency_Level Frequency
Min. : 1 Min. : 30.0 Length:4000 Min. : 12.00
1st Qu.:1001 1st Qu.: 38.0 Class :character 1st Qu.: 22.00
Median :2000 Median : 54.0 Mode :character Median : 31.00
Mean :2000 Mean :102.1 Mean : 37.82
3rd Qu.:3000 3rd Qu.:129.0 3rd Qu.: 45.00
Max. :4000 Max. :519.0 Max. :860.00
Frequency_Level Monetary_Value Monetary_Value_Level Observation_Period
Length:4000 Min. : 109 Length:4000 Min. : 1.0
Class :character 1st Qu.: 1780 Class :character 1st Qu.: 852.8
Mode :character Median : 4210 Mode :character Median :1803.0
Mean : 8418 Mean :1529.7
3rd Qu.: 9163 3rd Qu.:2288.0
Max. :99011 Max. :2435.0
Churn
Min. :0.00
1st Qu.:0.00
Median :0.00
Mean :0.25
3rd Qu.:0.25
Max. :1.00
In this example, we “repositioned” the data frame
main_data to the outside of the parentheses (before
the pipes). However, the final outcome is exactly the same. We can
also confirm this by using the function
identical() from base R:
# Check whether the two approaches lead to the exact same resultsidentical(summary(main_data), main_data %>%summary())
[1] TRUE
Notice that we could achieve the same result using parentheses: in
R, as with mathematical operations, expressions within parentheses
always take precedence. However, as we start nesting multiple
functions within parentheses, the code will quickly become cluttered
and hard to follow.
Pipes are extremely helpful when applying multiple functions
sequentially, especially on a data frame, as they allow us to
transform it step by step into the desired form. Imagine using 10
different functions, with some of them targeting specific columns —
handling this with nested parentheses would make the code almost
unreadable and very challenging to debug. By adopting pipes, our
code becomes far more readable and efficient.
4.5 Exploring Functions
for Data Manipulation
For the rest of this chapter, we see how we can apply
dplyr and tidyr functions to handle the
imported data set. For the first five functions, we check the
results with and without the use of pipes (to get more familiar with
the concept).
4.5.1 Function:
select()
This function allows us to select specific columns that we want to
keep in or drop from our data frame. Specifically, we include the
name of the columns we want to keep. On the other hand, if we want
to exclude some of the columns from the resulting dataset, we
include their names but with a minus sign (-) preceding them.
In the following example, we want to keep only the columns
Recency, Frequency,
Monetary_Value and Churn. Therefore, we
can either select to keep these columns or select to remove all
other columns.
# Drop ID, Recency_Level, Frequency_Level and Observation_Period (with pipes)main_data %>%select(-ID, -Recency_Level, -Frequency_Level, -Observation_Period) %>%head()
# A tibble: 6 × 5
Recency Frequency Monetary_Value Monetary_Value_Level Churn
<dbl> <dbl> <dbl> <chr> <dbl>
1 46 26 3010. Medium 0
2 40 56 57347. High 0
3 35 293 14496. High 0
4 50 18 1416. Medium 0
5 77 14 524. Low 0
6 55 39 8830. Medium 0
Note that when we specify the names of the columns we want to keep
or exclude, we do not use quotes or special characters; we just
type the name of the column “as is”. Additionally, the first
argument in the select() function is out data set and
then the rest of the arguments are the columns we want to select
(or drop). However, when we used pipes, our function was written
in a more straightforward fashion, as it only included the names
of the selected columns. Lastly, we see how we could use the
function head() immediately with pipes.
Hopefully, this example clarifies the usefulness of pipes in
writing consistent and efficient code. The more complicated the
content of data manipulation, the more useful the functionality of
dplyr becomes.
4.5.2 Function:
rename()
Another easy yet useful function is the function
rename(), which simply changes the name of a column.
In order to do so, we write the new name of a column, followed by
the equal sign (=) and the old name of the corresponding column
(the one we want to change). In the following example, we change
the name of the column ID to
Customer_ID, to make it more intuitive.
# Change the name of ID (without pipes)head(rename(main_data, Customer_ID = ID))
The select() function also includes a trick that
renames the columns that we choose, hence the use of
rename() is not really required for most cases.
Inside the function select(), we can give new names
to the selected columns directly. For example, we can select the
column ID from our data set and give it the name
Customer_ID directly:
# Select and change the name of ID (without pipes)head(select(main_data, Customer_ID = ID))
As we would expect though, given that we used the
select function, the output is a data frame of a
single column, since we selected only one column. With the
rename() function we rename the stated column without
changing the underlying selection.
4.5.3 Function:
filter()
Often, we want to remove certain rows from a dataset. The
filter() function helps us retain only the rows that
meet a specified condition. It’s important to note that, while
select() applies to columns,
filter() applies to rows. In the
filter() function, the condition typically involves
specific columns, but its effect impacts entire rows. In other
words, if a row is filtered out based on a condition in a
particular column, the corresponding data in
all other columns for that row are also removed
(the entire row is removed). In the example
below, we use filter() to keep only customers who
have churned (i.e., where the value in the
Churn column is 1). Since 1.000 customers have
churned in our dataset, we use nrow() to confirm the
result.
# Keep rows for which Churn is 1 (without pipes)filter(main_data, Churn ==1)
# A tibble: 1,000 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 8 193 High 36 Low 15585.
2 10 167 High 20 Low 1913.
3 11 466 High 22 Low 1670.
4 14 132 High 13 Low 713.
5 15 435 High 20 Low 721.
6 19 96 Medium 31 Low 1797.
7 20 112 High 41 Low 7435.
8 24 230 High 20 Low 289.
9 39 231 High 28 Low 2894.
10 44 111 High 16 Low 449.
# ℹ 990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
# Confirm the number of rowsnrow(filter(main_data, Churn ==1))
[1] 1000
# Keep rows for which Churn is 1 (with pipes)main_data %>%filter(Churn ==1)
# A tibble: 1,000 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 8 193 High 36 Low 15585.
2 10 167 High 20 Low 1913.
3 11 466 High 22 Low 1670.
4 14 132 High 13 Low 713.
5 15 435 High 20 Low 721.
6 19 96 Medium 31 Low 1797.
7 20 112 High 41 Low 7435.
8 24 230 High 20 Low 289.
9 39 231 High 28 Low 2894.
10 44 111 High 16 Low 449.
# ℹ 990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
# Confirm the number of rowsmain_data %>%filter(Churn ==1 ) %>%nrow()
[1] 1000
4.5.4 Function: slice()
We saw that we can remove entire rows by specifying a specific
field condition, using the function filter(). What if
we just want to keep a specified number of rows from the data set?
For example, what if we want to keep the first five or ten rows?
The function slice() is appropriate for cases like
this. For instance, we could use this function to preview a very
large data set or split a single data set without including any of
its rows more than once. Essentially, we slice (or cut) the data
set and keep a specified part of it. We just need to specify the
row numbers that we want to keep.
# Keep the first 5 rows (without pipes)slice(main_data, 1:5)
# A tibble: 5 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
# Keep the first 5 rows (with pipes)main_data %>%slice(1:5)
# A tibble: 5 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
The slice() function can be viewed as a more flexible
alternative to the head() function. With
head(), we specify the number of rows to keep
starting from the top of the dataset. In contrast,
slice() allows us to define both the starting and
ending indices of the rows we want to keep, giving us greater
control over the selection.
4.5.5 Function:
mutate()
We have seen how we can remove columns and rows from a data set.
But what if we want to create a new column or transform the data
of an existing one? This is where the function
mutate() comes into play. For instance, suppose that,
in our previous data set, we want to create a column that shows
the number of days since a customers’ last purchase, in months. We
have the column Recency which shows how many days
have been since a customers’ last purchase. Assuming that every
month consists of 30 days, we can create a new column that shows
the recency in months and then only select this column to print
(with the function select() that we learnt
previously).
# Create a new column: Recency_in_months (without pipes) head(select(mutate(main_data, Recency_months = Recency/30), Recency_months))
This example also illustrates why pipes are much better when we
want to use several functions sequentially. The first code chunk
is hardly readable while the second is much clearer: we create a
new column (mutate()), then we select the
created column (select()) and then print the
first 6 rows of that column (head()).
Had we used the original name of the column (Recency)
instead of the name Recency_months, we would have
essentially transformed the current column to the new one.
# A tibble: 6 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 1.53 Low 26 Low 3010.
2 2 1.33 Low 56 Medium 57347.
3 3 1.17 Low 293 High 14496.
4 4 1.67 Low 18 Low 1416.
5 5 2.57 Medium 14 Low 524.
6 6 1.83 Medium 39 Low 8830.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
In combination with the mutate() function, we can use
the row_number() function to create a column that
shows the index of a row.
# Create a new column: Row_Number (without pipes) head(mutate(main_data, Row_Number =row_number()))
# A tibble: 6 × 10
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
# ℹ 4 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>, Row_Number <int>
# Create a new column: Row_Number (with pipes) main_data %>%mutate(Row_Number =row_number()) %>%head()
# A tibble: 6 × 10
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
# ℹ 4 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>, Row_Number <int>
In conclusion, we use the mutate() function to create
or modify a column. With this function, we can perform various
operations on the columns of a dataset, or even apply other
functions such as row_number().
4.5.6 Functions:
group_by(), summarize() and ungroup()
Aggregated data are summarizations of our raw data. For example,
how many customers exist within each category of recency level?
Or, what is the average recency for customers that churn in
comparison to customers that do not churn? Essentially,
“summarize” our results - loose some case-specific information -
in order to be able to extract valuable conclusions for the whole
set.
For starters, let’s see how many levels of recency we have:
# A tibble: 3 × 1
Recency_Level
<chr>
1 Low
2 Medium
3 High
The function distinct() (which is equivalent to the
function unique() from base R) shows only the unique
values of the selected column(s). Although we selected a single
column from our data set, the number of rows stays the same. Using
the function distinct(), we clearly identify the
unique levels above: “Low”, “Medium” and “High”.
Now, suppose we want to count how many customers we have by level
of recency. This is where the group_by() function
comes into play.
# Group by Recency Levelmain_data %>%group_by(Recency_Level)
# A tibble: 4,000 × 9
# Groups: Recency_Level [3]
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
7 7 48 Low 18 Low 1493.
8 8 193 High 36 Low 15585.
9 9 96 Medium 24 Low 1608.
10 10 167 High 20 Low 1913.
# ℹ 3,990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
Notice that, by using the group_by() function, it
seems as if nothing changes - at first glance. This is simply the
case because, even though we specified the groups, we did not use
any other function to affect changes; we just grouped the
variables. Yet, observing closely, we see that there is an extra
line that shows the name of the groups (with the square brackets
showing the number of the groups of the selected variable). The
group_by() function is very useful when we want to
summarize descriptive results for each category or group
separately. With the group_by() function, we can
create aggregated features using the
summarize() function, as shown below.
# Group by Recency_Level and summarize Number_of_Customersmain_data %>%group_by(Recency_Level) %>%summarize(Number_of_Customers =n())
# A tibble: 3 × 2
Recency_Level Number_of_Customers
<chr> <int>
1 High 1160
2 Low 1877
3 Medium 963
To count the data, we also used another function from the
dplyr package, the n() function, which
simply counts the number of observations (note that we do not
include any argument within the parentheses). Now, suppose we
wanted to calculate the average value of recency for the customers
that churn and compare it with the respective value for customers
that do not churn. In that case, we would follow a similar
approach but, instead of the function n(), we would
use the function mean() with the
Recency column inside the parentheses.
# Group by Recency Churn and Summarize Average Recencymain_data %>%group_by(Churn) %>%summarize(Average_Recency =mean(Recency))
As seen, the average value of recency is much higher for customers
that churn than for customers that do not. We managed to create
these summary results with just a simple (and quite intuitive)
line of R code. In fact, we could even easily store outputs like
this to another object, in case we need them later in our
analysis.
Regarding the outputs of the two mentioned examples, we see that,
in both cases, each output has 1 row per group (the number of
customers or the average recency per specified group). It is
important to note that we can give names of our choice in these
newly created (aggregated) columns.
In case we do not need to work with grouped data anymore, we use
the function ungroup(), without arguments. In the
example below, we use the function group_by() and the
ungroup() (in the same line of code) to see that we
revert at the original data set main_data (no groups
will appear above the names of the columns).
# Group by Churn and Ungroupmain_data %>%group_by(Churn) %>%ungroup()
# A tibble: 4,000 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
7 7 48 Low 18 Low 1493.
8 8 193 High 36 Low 15585.
9 9 96 Medium 24 Low 1608.
10 10 167 High 20 Low 1913.
# ℹ 3,990 more rows
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
4.5.7 Functions:
arrange()
A simple, yet very useful function is the
arrange() function. As the name suggests, this
function helps us arrange the order of rows in a data frame. For
example, we may want to review our data set in a descending order
based on Monetary_Value. In order to use this
function, we need to include in the parenthesis the name of the
column based on which we want to arrange our data. If we want the
order be descending, we include the function
desc() inside the function arrange(),
otherwise, an ascending ordering is applied.
To check the results of the arrange() function, let
us use the functions group_by() and
summarize() to calculate the average monetary value
per recency level. We then use the arrange() function
to arrange the order of the resulting rows.
# Group by Recency_Level# Summarize Average_Monetary_Value# Arrange on Average_Monetary_Valuemain_data %>%group_by(Recency_Level) %>%summarize(Average_Monetary_Value =mean(Monetary_Value)) %>%arrange(Average_Monetary_Value)
# A tibble: 3 × 2
Recency_Level Average_Monetary_Value
<chr> <dbl>
1 High 3687.
2 Medium 6336.
3 Low 12409.
# Group by Recency_Level# Summarize Average_Monetary_Value# Arrange on Average_Monetary_Value (descending order)main_data %>%group_by(Recency_Level) %>%summarize(Average_Monetary_Value =mean(Monetary_Value)) %>%arrange(desc(Average_Monetary_Value))
# A tibble: 3 × 2
Recency_Level Average_Monetary_Value
<chr> <dbl>
1 Low 12409.
2 Medium 6336.
3 High 3687.
The Monetary_Value column is numeric, so the
resulting order is straightforward. For character columns, such as
Recency_Level, the rows are ordered alphabetically.
To see this, let us rewrite the above code, using the column
Recency_Level to arrange the rows.
# Group by Recency_Level# Summarize Average_Monetary_Value# Arrange on Recency_Levelmain_data %>%group_by(Recency_Level) %>%summarize(Average_Monetary_Value =mean(Monetary_Value)) %>%arrange(Recency_Level)
# A tibble: 3 × 2
Recency_Level Average_Monetary_Value
<chr> <dbl>
1 High 3687.
2 Low 12409.
3 Medium 6336.
# Group by Recency_Level# Summarize Average_Monetary_Value# Arrange on Recency_Level (descending order)main_data %>%group_by(Recency_Level) %>%summarize(Average_Monetary_Value =mean(Monetary_Value)) %>%arrange(desc(Recency_Level))
# A tibble: 3 × 2
Recency_Level Average_Monetary_Value
<chr> <dbl>
1 Medium 6336.
2 Low 12409.
3 High 3687.
4.5.8 Functions:
select() and pull()
We have seen how we can use the function select() to
choose one or more columns to keep in our data set. Every time we
use this function, the new data set will always be a data
frame. This is important to remember because, even if we select a
single column, we will still have a data frame (which obviously
consists of a single column). Actually, we saw that this is the
case when we started the discussion about the aggregation
functions group_by() and summarize().
However, we sometimes need to keep this column in a vector form
(instead of a data frame form). This is when we use the
pull() function. Imagine that the name of the
pull() function comes from the fact that we want to
“pull” only one column from a data set (as a vector). Of course,
we need to include the name of the column that we want to “pull”
as an argument. To see how the function pull() works
and how this is different from the function select(),
let us extract and keep only the column Churn from
our main data set. We see in the console the different outputs
from using the two functions.
As shown, no name is kept when we use the
pull() function; we have a vector, not a data frame -
with one column, as is the case when we use the
select() function.
4.5.9 Functions:
mutate() with if_else() and case_when()
We have already seen how we can create a new column using the
function mutate(). We can use this function to create
a new column based on specific conditions using the
if_else() function. Let us use this function to
create a new column Churn_Label where, if the value
in the column Churn is 1, it takes the value
“customer churns” in the column Churn_Label;
otherwise it takes the value “customer does not churn”.
# Create a new column: Churn_Label with if_else()main_data %>%mutate(Churn_Label =if_else(Churn ==1, "customer churns", "customer does not churn")) %>%select(Churn, Churn_Label)
# A tibble: 4,000 × 2
Churn Churn_Label
<dbl> <chr>
1 0 customer does not churn
2 0 customer does not churn
3 0 customer does not churn
4 0 customer does not churn
5 0 customer does not churn
6 0 customer does not churn
7 0 customer does not churn
8 1 customer churns
9 0 customer does not churn
10 1 customer churns
# ℹ 3,990 more rows
With the if_else() function we essentially create the
Churn_Label column whereby if the column
Churn has a value of 1, the corresponding value is
“customer_churns”, otherwise the value is “customer does not
churn”. In this example, we also used the function
select() just to get a clear output of the two churn
columns. We see indeed that the values are aligned: if the value
in the column Churn is 1 then the value in the column
Churn_Label is “customer churns”, otherwise it is
“customer does not churn”.
But what if we need to combine multiple conditions and outputs?
Suppose we want to create a column
value_Label_if_Customer_Churns which takes the
following values:
If the monetary value label is low and the customer churns, then
we want to get the value “low value”
If the monetary value label is medium and the customer churns,
then we want to get the value “medium value”
If the monetary value label is high and the customer churns,
then we want to get the value “high value”
Otherwise we get the value “customer does not churn”
Although we could use the if_else() function
iteratively, this would create messy code. For such cases, it is
much better to use the case_when() function. Inside
the parenthesis of this function, we specify the condition (the
“if”), include the tilde sign (“~”) and then the results (the
“then”). With the help of commas, we can create different
conditions with different outcomes. Let us then create the column
that we just mentioned using the piece of code below:
# Create a new column: Value_Label_of_Customer_Churn with case_when()main_data %>%mutate(Value_Label_if_Customer_Churns =case_when( Monetary_Value_Level =="Low"& Churn ==1~"low value", Monetary_Value_Level =="Medium"& Churn ==1~"medium value", Monetary_Value_Level =="High"& Churn ==1~"high value",TRUE~"customer does not churn")) %>%select(Value_Label_if_Customer_Churns, Monetary_Value_Level, Churn)
# A tibble: 4,000 × 3
Value_Label_if_Customer_Churns Monetary_Value_Level Churn
<chr> <chr> <dbl>
1 customer does not churn Medium 0
2 customer does not churn High 0
3 customer does not churn High 0
4 customer does not churn Medium 0
5 customer does not churn Low 0
6 customer does not churn Medium 0
7 customer does not churn Medium 0
8 high value High 1
9 customer does not churn Medium 0
10 medium value Medium 1
# ℹ 3,990 more rows
We see how we can also use special characters, such as the
ampersand (&), to introduce simultaneous conditions (in our
example, we used the ampersand (&) to introduce an AND
condition). Additionally, note that, for the last condition, we do
not specify the value of any column, but we simply use the word
“TRUE”. In the case_when() function, this stands for
if none of the other conditions have been met; that is why it is
better to use this word only as the last “catch-all” condition.
OR condition
In the last example, if we wanted to specify that if at least
one of the conditions is TRUE then the whole statement would
be TRUE (what is often called an “OR” condition), we would
need to use the vertical bar (|) instead of the ampersand
(&).
4.5.10 Functions:
group_by(), summarize() and count()
In the aggregation functions group_by() and
summarize(), we saw how we can calculate summary
statistics such as the number of customers within each recency
level. Although we used the function n() for this
purpose, we can do the exact same thing with the
count() function. With this function, we need to
include the name of the column of our interest as an argument of
the function. In the example below we see how both approaches lead
to the exact same results.
# Group by Recency_Level and summarize Number_of_Customersmain_data %>%group_by(Recency_Level) %>%summarize(n =n()) %>%ungroup()
# A tibble: 3 × 2
Recency_Level n
<chr> <int>
1 High 1160
2 Low 1877
3 Medium 963
# Count number of Recency_Level main_data %>%count(Recency_Level)
# A tibble: 3 × 2
Recency_Level n
<chr> <int>
1 High 1160
2 Low 1877
3 Medium 963
The count() function is very handy when we just want
the number of times a group has appeared and nothing more (i.e.,
just the frequency); we see that the code is much simpler and
easier to read.
4.5.11 Functions:
bind_cols()
The function bind_cols() allows us to put the columns
of one data frame next to the another’s, so as to form a larger
data frame. It is though necessary that the number of rows in each
data set is equal. With the function bind_cols(), we can bind as
many columns as we want; we just include the names of all data
frames that we want in the parentheses, separated by commas. In
this example, we assign the columns Recency,
Recency_Level and Churn of our main data
set in 3 different data frames (one column in each data frame).
Then, we “bind” those columns together to form a new data frame.
# Create the 3 small data setsexample_data_1 <- main_data$Recencyexample_data_2 <- main_data$Recency_Levelexample_data_3 <- main_data$Churn# Bind the 3 data frames togetherexample_data_1 %>%bind_cols(example_data_2, example_data_3) %>%head()
That was a simple example showing the
bind_cols() function works. If we wanted to just keep
the mentioned columns, we could simply use the
select() function.
In the example above, the columns appear in the following order:
Recency, Recency_Level,
Churn. This is intentional, as the order of the
columns in the resulting data frame aligns with the order in which
they were specified in the bind_cols() function. The
columns from the original data frame (i.e., those outside the
bind_cols() function) will always appear first (on
the left).
4.5.12 Functions:
bind_rows()
Similar to the bind_cols(), the
bind_rows() function allows us to put the rows of one
data frame below another’s, so as to form (again) a larger data
frame. This time, the main condition is that the number of columns
in each data set is equal and that the names of the columns should
be the same between the different data frames. The intuition is
the same as with bind_cols(), but instead of binding
column(s), we bind (or append) rows. As an example, we assign the
first 9 rows of our main data set (with the
slice() function) in 3 small different data frames (3
rows per data frame). Then, we “bind” those rows together to form
a new data frame.
# Create the 3 small data setsexample_data_1 <- main_data %>%slice(1:3)example_data_2 <- main_data %>%slice(4:6)example_data_3 <- main_data %>%slice(7:9)# Bind the 3 data frames togetherexample_data_1 %>%bind_rows(example_data_2, example_data_3)
# A tibble: 9 × 9
ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
<dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 46 Low 26 Low 3010.
2 2 40 Low 56 Medium 57347.
3 3 35 Low 293 High 14496.
4 4 50 Low 18 Low 1416.
5 5 77 Medium 14 Low 524.
6 6 55 Medium 39 Low 8830.
7 7 48 Low 18 Low 1493.
8 8 193 High 36 Low 15585.
9 9 96 Medium 24 Low 1608.
# ℹ 3 more variables: Monetary_Value_Level <chr>, Observation_Period <dbl>,
# Churn <dbl>
4.5.13 Functions:
pivot_wider() & pivot_longer()
The pivot_wider() function makes a table “wide” in a
way that we have more columns and fewer rows while the
pivot_longer function makes a table “longer” in a way
that we have fewer columns and more rows. It is much easier to
understand these functions with a simple example. Suppose we want
to check how many customers churn (or do not churn) within each
level of monetary value. For this, let’s create a new data frame
(called summarized_data) with the techniques we have
already discussed.
# Create Data Frame for Monetary Value Level per Customer Churn Labelsummarized_data <- main_data %>%count(Churn, Monetary_Value_Level) %>%rename(Number = n)# Print the resultssummarized_data
# A tibble: 6 × 3
Churn Monetary_Value_Level Number
<dbl> <chr> <int>
1 0 High 811
2 0 Low 235
3 0 Medium 1954
4 1 High 85
5 1 Low 347
6 1 Medium 568
We now have the two columns that we included in the parentheses
(Churn and Monetary_Value_Level) plus
the column Number, which shows the number of
observations within each category. Notice how the values of 0 and
1 appear twice in the Churn column. This happens
because each value corresponds to two different values of the
Monetary_Value_Level column.
What if, instead of one column for monetary value levels, we had a
separate column for each level? This is where we use the
pivot_wider() function as we want to make our data
frame have more columns and fewer rows. Besides the data frame,
pivot_wider() requires two key arguments: the
names_from and values_from. In the first
argument, we put the column whose values will be used to create
the different new columns. The second argument specifies the
values that will populate these new columns. In our example, we
set these arguments to Monetary_Value_Level and
Number. The output is the following:
# Create a new data frame with pivot_wider()wider_summarized_data <- summarized_data %>%pivot_wider(names_from = Monetary_Value_Level, values_from = Number)# Print resultswider_summarized_data
# A tibble: 2 × 4
Churn High Low Medium
<dbl> <int> <int> <int>
1 0 811 235 1954
2 1 85 347 568
Now, each monetary value level has its own column, with values
taken from the column Number, which does not appear
anymore. This transformation reduces rows and increases columns,
thus making the table ‘wider’.
The pivot_longer() function can be seen as the
transverse of the pivot_wider() function: it
typically takes multiple columns and reshapes them into two
columns: one column for the variable names (with the
names_to argument) and another for the corresponding
values (with the values_to argument). To specify the
columns to transform, we list their names in a vector in the
cols argument.
In the following example, we use the function
pivot_longer() on the data frame
wider_summarized_data that we created above.
# Create a new data frame with pivot_longer()longer_summarized_data <- wider_summarized_data %>%pivot_longer(cols =c(High, Low, Medium),names_to ="Levels", values_to ="Number")# Print results longer_summarized_data
# A tibble: 6 × 3
Churn Levels Number
<dbl> <chr> <int>
1 0 High 811
2 0 Low 235
3 0 Medium 1954
4 1 High 85
5 1 Low 347
6 1 Medium 568
Notice how the column names in names_to and
values_to must be in double quotes (““), but those in
cols do not need them.
We see that the final output is exactly the same as the data frame
summarized_data that we initially created.