Data Manipulation

Introduction

In this chapter, we will discuss various functions and methods used to handle data. More specifically, we introduce the NumPy package and discuss in much more detail the Pandas package.

From Data Imports to Data Manipulation

Generally, after we import the data, we want to transform them in such a way that it becomes easy to analyze, a process often called data manipulation. In the real world, datasets are usually not in the format we would like them to be. Even when data are orderly stored in databases, they are typically not structured with our specific analysis in mind. For example, do we want to analyze sales per country for a specific calendar year or all calendar years of sales for a specific country? In this example, it is up to us to decide on the best approach, depending on our goal. This illustrates that we need to be able to bring an imported dataset into an appropriate format to facilitate our analysis.

Data Manipulation with NumPy

Perhaps the two of the most important packages for data manipulation in the Python data science ecosystem are NumPy and Pandas.

The NumPy package (short for Numerical Python) provides efficient tools for working with numerical data. In particular, it introduces the NumPy array, which is similar to a Python list but designed for fast mathematical computation. NumPy arrays support vectorized operations, meaning that calculations can be applied to an entire collection of values at once, without needing loops. This makes NumPy extremely important for data science, statistics, machine learning, and scientific computing. Many other Python libraries (including Pandas) are built on top of NumPy.

To begin, let us create a regular Python list:

import numpy as np

our_list = [1, 4, 5, 6, 6, 6, 6, 4100, 100, 100]

A regular Python list is flexible, but it is not optimized for mathematical calculations. For example, multiplying a list by a number repeats the list rather than multiplying each value:

# Multiplying our_list by 2
our_list * 2
[1, 4, 5, 6, 6, 6, 6, 4100, 100, 100, 1, 4, 5, 6, 6, 6, 6, 4100, 100, 100]

To perform mathematical operations element by element, we first convert the list into a NumPy array. Once transformed into an array, calculations are automatically applied to every element, something that is known as vectorization. In the code below, each value in the array is now multiplied by 2 individually:

# Vectorized multiplication
np.array(our_list) * 2
array([2, 8, 10, 12, 12, 12, 12, 8200, 200, 200])

We can also perform calculations between arrays of the same length. In the example below, each element of the first array is added to the corresponding element of the second array:

# Element-wise addition
np.array(our_list) + np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
array([2, 6, 8, 10, 11, 12, 13, 4108, 109, 110])

Just like Python lists, NumPy arrays support indexing and slicing. For instance, we can select the first 2 elements of a NumPy array in the following way:

# Selecting the first two elements
np.array(our_list)[0:2]
array([1, 4])

Like standard Python lists, NumPy arrays can also contain multiple dimensions. For example, a double list can be transformed into a two-dimensional array:

# Creating a double list
double_list = [[1, 2, 3], [4, 5, 6]]

# Transforming double_list to a numpy array
np.array(double_list)
array([[1, 2, 3],
       [4, 5, 6]])

This creates a matrix-like structure with two rows (number of lists) and three columns (number of elements of each list).

We can also use NumPy to apply conditions element by element. For example, suppose we want to classify values in a list as "Large" or "Small", depending on whether each value is above 5. To do this, we can use the function np.where():

# Applying condition
np.where(np.array(our_list) > 5, "large", "small")
array(['small', 'small', 'small', 'large', 'large', 'large',
       'large', 'large', 'large', 'large'], dtype='<U5')

Lastly, NumPy can also be used to calculate summary statistics such as the mean or sum. Although Python already provides built-in functions such as sum(), NumPy functions are specifically optimized for numerical data and work efficiently with NumPy arrays. In practice, NumPy operations are generally faster and more convenient when working with large numerical datasets or performing mathematical computations across arrays.

# Calculating sum with standard Python
sum(our_list)
4334
# Calculating sum with numpy
np.sum(np.array(our_list))
np.int64(4334)

In both cases, the output is the same (4334). The main difference is not the result itself, but how the calculation is performed internally. The built-in sum() function works with general Python objects such as lists, while np.sum() is optimized for NumPy arrays and numerical computations. This becomes especially useful when working with large datasets or multidimensional arrays, where NumPy operations are generally faster and more memory-efficient. Note that the output of np.sum() has a slightly different format because it returns a NumPy numerical type (such as numpy.int64) instead of a standard built-in Python integer.

Data Manipulation with Pandas

Even though NumPy provides many additional capabilities, we will focus mostly on using Pandas (short for Panel Data or Python Data Analysis), whose primary focus is data manipulation with data frames. A data frame is a rectangular table of data organized into rows and columns, similar to a spreadsheet. While NumPy is optimized for fast numerical computation, Pandas focuses on practical data analysis tasks. In most real-world data science projects, Pandas is the primary tool used for manipulating datasets, while NumPy operates behind the scenes to provide efficient numerical operations.

To learn how to use Pandas, it makes sense to have a specific dataset throughout this chapter. This dataset will be the same as the one we imported (using the read_csv() function from the Pandas package) in Chapter Importing Data: the customer_churn dataset.

Our dataset has information for 4,000 customers, out of which 1,000 churn, meaning that these customers are known to have stopped purchasing at a given point in time. The columns Recency and Recency_Level show information about how recently a customer made his/her last purchase. Moreover, The columns Frequency and Frequency_Level show us how many purchases (in total) a customer has made in the past. Also, the columns Monetary_Value and Monetary_Value_Level show how much money the company received from a customer. According to academic marketing literature (see, for example, Blattberg, Kim, & Neslin, 2008; Fader & Hardie, 2009), these general variables—known by their acronym RFM—can be used as strong indicators of whether a customer will churn in the future. We can imagine that the more purchases take place recently and the higher their value, the higher the probability that a customer will not churn. Lastly, the column Observation_Period shows how long a customer is known to have been a customer with the company.

We load the Pandas package and the respective dataset (assigning it to an object called customer_churn) using the code below:

import pandas as pd

# Importing customer_churn
customer_churn = pd.read_csv("https://raw.githubusercontent.com/Datakortex/Datasets/refs/heads/main/customer_churn.csv")

# Printing the first few rows
customer_churn.head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

As mentioned, the imported dataset has 4,000 observations (as rows) and 9 variables (as columns). Because it is really helpful to have a brief description of each variable, the contents of each column are described below:

  • ID: customer ID (unique number)

  • Recency: number of days that have passed since last purchase

  • Recency_Level: the level of recency (low, medium or high)

  • Frequency: number of purchases

  • Frequency_Level: the level of frequency (low, medium or high)

  • Monetary_Value: total monetary value (e.g. total number of euros)

  • Monetary_Value_Level: the level of monetary value (low, medium or high)

  • Observation_Period: number of days between the first and the last purchase

  • Churn: whether the customer churns or not (a value of 1 implies that the customer churns)

Evidently, the data imported are structured as a Pandas DataFrame. Each row is an observation (a customer) while each column is a different variable (characteristic).

Output Display

The first few rows of the DataFrame are shown as a formatted table rather than as text within the code output. This is a consequence of how the Python environment renders tabular objects. Depending on the software you use to run the code, the appearance and placement of the output may vary. These variations are purely cosmetic and do not affect the results.

The Use of Method Chaining: .

The main characteristics with Pandas is that we can manipulate a DataFrame in steps. For example, in customer_churn we may want firstly to keep a particular group of customers and then calculate the average recency. Such a process would require two distinct steps: first, filtering the rows that satisfy a condition and, second, applying a calculation on the resulting subset of data. With Pandas, we apply method chaining, which relies on the dot (.) operator. This allows us to take the output of one operation and immediately apply another operation to it, creating a sequence (or “chain”) of transformations. Put differently, the dot lets us move step by step through a data object by calling methods directly on it.

To illustrate this concept, let’s use a very simple example. So far, we have used the head() method which typically returns the first 5 rows of a DataFrame. In this method, we can also specify the number of rows inside the parentheses. For instance, the code below returns the first ten rows:

# Printing the first ten rows
customer_churn.head(10)
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0
5 6 55 Medium ... Medium 2077 0
6 7 48 Low ... Medium 1317 0
7 8 193 High ... High 1612 1
8 9 96 Medium ... Medium 1037 0
9 10 167 High ... Medium 1373 1

10 rows × 9 columns

Now, to illustrate the idea of method chaining, we can apply another head() method directly after the first one:

# Printing the first few rows
customer_churn.head(10).head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

As expected, the first five rows are printed. This happens because the first operation (head(10)) produces a new DataFrame containing only the first ten rows. The second operation (head()) is then applied to this intermediate result, extracting the first five rows from that subset.

In this way, each dot represents a step in a sequence of transformations: the output of one operation becomes the input to the next. This allows us to build data processing workflows step by step, making code more structured, readable, and easy to extend.

Exploring Operations for Data Manipulation

For the rest of this chapter, we see how we can manipulate the DataFrame customer_churn using Pandas. The methods info() and describe() display dataset structure and summary statistics, respectively. In this way, we can quickly build an initial understanding of the data we are working with.

# Printing structure
customer_churn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    4000 non-null   int64  
 1   Recency               4000 non-null   int64  
 2   Recency_Level         4000 non-null   object 
 3   Frequency             4000 non-null   int64  
 4   Frequency_Level       4000 non-null   object 
 5   Monetary_Value        4000 non-null   float64
 6   Monetary_Value_Level  4000 non-null   object 
 7   Observation_Period    4000 non-null   int64  
 8   Churn                 4000 non-null   int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 281.4+ KB
# Printing summary statistics
customer_churn.describe()
ID Recency Frequency Monetary_Value Observation_Period Churn
count 4000.00 4000.00 4000.00 4000.00 4000.00 4000.00
mean 2000.50 102.06 37.82 8417.93 1529.73 0.25
std 1154.84 101.46 30.55 12498.18 825.79 0.43
min 1.00 30.00 12.00 109.00 1.00 0.00
25% 1000.75 38.00 22.00 1779.81 852.75 0.00
50% 2000.50 54.00 31.00 4209.97 1803.00 0.00
75% 3000.25 129.00 45.00 9163.19 2288.00 0.25
max 4000.00 519.00 860.00 99011.17 2435.00 1.00

Each column is assigned a specific data type. For example, int64 indicates a numeric (integer) variable, while object typically represents categorical or text data. Additionally, the summary statistics include measures such as the number of observations (count), the average value (mean), and the standard deviation (std). These values provide a quick overview of the distribution of each numeric variable. Since the variable (column) Churn only takes the values 0 and 1, it can be interpreted as a binary indicator. In this case, the mean of Churn represents the proportion of customers who churned. Therefore, a mean of 0.25 implies that 25% of customers in the dataset have churned.

Selecting columns:
To select specific columns in a Pandas DataFrame, we use square brackets ([]) with the column name(s). This allows us to extract either a single column or multiple columns from the dataset.

customer_churn["Recency"]
0        46
1        40
2        35
3        50
4        77
       ... 
3995    189
3996    152
3997    359
3998     69
3999     31
Name: Recency, Length: 4000, dtype: int64

Notice that the resulting output is a Pandas Series, even though it may look like a single-column DataFrame. When we select a single column using a single square brackets, the result is a Series, a one-dimensional object that represents a single variable, whereas a DataFrame is two-dimensional and can contain multiple variables. Although both may appear similar when printed, they are fundamentally different data structures and are handled differently in Pandas.

This distinction matters because certain operations behave differently depending on whether we are working with a Series or a DataFrame. For example, a Series is often used when we want to work with a single variable directly (such as computing statistics, applying functions, or passing data into models), while a DataFrame is preferred when we need to preserve tabular structure or work with multiple columns simultaneously.

To keep the DataFrame structure, we need to use double square brackets ([[]]):

# Selecting Recency
customer_churn[["Recency"]]
Recency
0 46
1 40
2 35
3 50
4 77
... ...
3995 189
3996 152
3997 359
3998 69
3999 31

4000 rows × 1 columns

Now, we have a DataFrame with only the Recency column. In the same way, we can select multiple columns by passing a list of column names inside the brackets:

# Selecting Recency, Frequency, Monetary_Value and Churn
customer_churn[["Recency", "Frequency", "Monetary_Value", "Churn"]]
Recency Frequency Monetary_Value Churn
0 46 26 3009.60 0
1 40 56 57347.28 0
2 35 293 14496.16 0
3 50 18 1416.20 0
4 77 14 523.72 0
... ... ... ... ...
3995 189 17 683.18 1
3996 152 23 3205.66 0
3997 359 17 251.00 1
3998 69 43 9438.84 1
3999 31 45 5822.22 0

4000 rows × 4 columns

This returns a new DataFrame containing only the selected columns, allowing us to focus on specific variables of interest while keeping the data in a structured format.

Dropping columns:
Instead of selecting the columns we want to keep, it is sometimes easier to remove the columns we do not need. In Pandas, the method drop() is used for this purpose. To delete columns, we use the columns argument and provide a list containing the names of the columns we want to remove. For instance, suppose that we want a DataFrame containing only the columns Recency, Frequency, Monetary_Value, and Churn. Rather than selecting these columns directly, we can instead drop all the remaining columns:

# Dropping all other columns
customer_churn.drop(columns = ["ID", 
                               "Recency_Level",
                               "Frequency_Level", 
                               "Monetary_Value_Level",
                               "Observation_Period"])
Recency Frequency Monetary_Value Churn
0 46 26 3009.60 0
1 40 56 57347.28 0
2 35 293 14496.16 0
3 50 18 1416.20 0
4 77 14 523.72 0
... ... ... ... ...
3995 189 17 683.18 1
3996 152 23 3205.66 0
3997 359 17 251.00 1
3998 69 43 9438.84 1
3999 31 45 5822.22 0

4000 rows × 4 columns

Renaming columns:
Another easy yet useful function is the method rename(), which is used to change the names of columns. To do this, we provide a dictionary (see Chapter Python Basics) inside the columns argument, where the keys represent the original column names and the values represent the new names we want to assign. As an example, the code below changes the name of the column Frequency to Number_of_Purchases:

# Renaming column Frequency to Number_of_Purchases
customer_churn.rename(columns = {"Frequency": "Number_of_Purchases"})[["ID", "Number_of_Purchases", "Churn"]].head()
ID Number_of_Purchases Churn
0 1 26 0
1 2 56 0
2 3 293 0
3 4 18 0
4 5 14 0

Note that the structure of the data remains the same; only the column label is updated.

Sorting values:
A simple, yet very useful method is sort_values(). As the name suggests, this method helps us sort the order of rows in a DataFrame based on the values of one or more specified columns. For example, we may want to review our dataset in a descending order based on Monetary_Value. In order to use this method, we need to include in the parenthesis the name of the column based on which we want to sort our data. We also need to specify the argument ascending if we prefer an ascending or descending order.

# Sorting on Monetary_Value
customer_churn.sort_values("Monetary_Value", ascending = True)[["ID", "Monetary_Value", "Churn"]].head()
ID Monetary_Value Churn
540 541 109.00 0
3957 3958 115.00 1
400 401 118.00 1
832 833 124.00 1
1482 1483 134.98 0

To include more than one variables, we just need to include them in a list. In this case though, we also need to specify the ascending argument—True or False—also with a list:

# Sorting on Frequency and Monetary_Value
customer_churn.sort_values(["Frequency", "Monetary_Value"], ascending = [True, False])[["ID", "Frequency", "Monetary_Value", "Churn"]].head()
ID Frequency Monetary_Value Churn
418 419 12 789.08 1
2876 2877 12 670.94 1
1501 1502 12 535.52 0
1108 1109 12 496.08 1
221 222 12 490.35 1

Filtering rows:
Often, we want to remove certain rows from a dataset. To do this, we rely on conditional expressions (see Chapter Python Basics), which allow us to create logical statements that evaluate to either True or False for each row. As an example, suppose we want to keep only those customers with recency of less than 60 days. We start by using the following expression:

# Checking which rows satisfy the condition
customer_churn["Recency"] < 60
0        True
1        True
2        True
3        True
4       False
        ...  
3995    False
3996    False
3997    False
3998    False
3999     True
Name: Recency, Length: 4000, dtype: bool

This expression creates a Pandas Boolean Series, where each entry is either True or False depending on whether the condition is met. In this case, each row is checked to see whether the value in the Recency column is less than 60. We can then use this Boolean Series to filter the original DataFrame. When we place the condition inside the square brackets of the DataFrame, Pandas keeps only the rows where the condition is True:

# Filtering rows based on condition
customer_churn[customer_churn["Recency"] < 60].head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
5 6 55 Medium ... Medium 2077 0

5 rows × 9 columns

This operation effectively subsets the DataFrame, returning only the observations that satisfy the condition. In this way, conditional expressions allow us to flexibly remove or select rows based on any rule defined on the data.

We can extend conditional expressions by combining multiple conditions. To do this, we place each condition inside parentheses and connect them using logical operators such as &:

# Filtering rows based on multiple condition
customer_churn[(customer_churn["Recency"] == 30) & (customer_churn["Frequency_Level"] == "High")][["ID", "Recency", "Frequency_Level", "Churn"]].head()
ID Recency Frequency_Level Churn
71 72 30 High 0
3005 3006 30 High 0

In this example, we keep only the customers who satisfy both conditions simultaneously: customers with a recency value of 30 days and a frequency level equal to "High".

When working with variables that contain many categories, writing multiple equality conditions is not efficient. In such cases, the method isin() provides a cleaner alternative. This method checks whether each value belongs to a specified list of categories. The code below illustrates this idea:

# Filtering rows based on multiple condition
customer_churn[(customer_churn["Recency"] == 30) & (customer_churn["Monetary_Value_Level"].isin(["Medium", "High"]))][["ID", "Recency", "Frequency_Level", "Churn"]].head()
ID Recency Frequency_Level Churn
71 72 30 High 0
1883 1884 30 Low 0
2309 2310 30 Low 0
2349 2350 30 Low 0
3005 3006 30 High 0

Here, we keep customers with a recency value of 30 days and with a monetary value level that is either "Medium" or "High".

We can also select rows and columns using the special Pandas indexers loc[] and iloc[]. These provide a more explicit and flexible way to subset DataFrames. The loc[] indexer selects rows and columns based on labels (that is, column names and row labels). Its general structure is:

# General structure of .loc[]
dataframe.loc[row_selection, column_selection]

For example, suppose we want to keep only the rows where Recency is below 60 and display only the variables Recency, Frequency_Level, and Monetary_Value. We can write:

# Filtering rows and selecting columns with .loc[]
customer_churn.loc[
    customer_churn["Recency"] < 60,
    ["Recency", "Frequency_Level", "Monetary_Value"]
    ].head()
Recency Frequency_Level Monetary_Value
0 46 Low 3009.60
1 40 Medium 57347.28
2 35 High 14496.16
3 50 Low 1416.20
5 55 Low 8830.32

Here, the first part inside loc[] specifies the rows we want to keep using a Boolean condition, while the second part specifies the columns to display. This is, of course, just an alternative way of selecting rows and columns. We could instead perform the row filtering and column selection separately; the output would be exactly the same.

# Filtering rows and selecting columns without .loc[]
customer_churn[customer_churn["Recency"] < 60][["Recency", "Frequency_Level", "Monetary_Value"]].head()
Recency Frequency_Level Monetary_Value
0 46 Low 3009.60
1 40 Medium 57347.28
2 35 High 14496.16
3 50 Low 1416.20
5 55 Low 8830.32

The iloc[] indexer works differently because it selects rows and columns based on their integer positions rather than labels. Its structure is similar:

# General structure of .iloc[]
dataframe.iloc[row_positions, column_positions]

For example, suppose we want to display the first five rows and the first three columns of the dataset:

# Selecting rows and columns by position with .iloc[]
customer_churn.iloc[0:5, 0:3]
ID Recency Recency_Level
0 1 46 Low
1 2 40 Low
2 3 35 Low
3 4 50 Low
4 5 77 Medium

In this example, 0:5 selects rows from position 0 up to (but not including) position 5, while 0:3 selects the first three columns.

So, the important distinction between loc[] and iloc[] is that loc[] uses labels and includes the ending label when slicing, whereas iloc[] uses integer positions and excludes the ending position, just like standard Python slicing.

Creating new columns:
We have seen how we can remove columns and rows from a dataset. But what if we want to create a new column or transform the values of an existing one? In Pandas, this can be achieved by assigning values directly to a column name. The vectorization property we introduced earlier with NumPy also applies to Pandas. This means that operations are automatically performed element by element across an entire column, without requiring loops. Assuming that a month consists of 30 days, we can create a new variable called Recency_Months in the following way:

# Creating column Recency_Months
customer_churn["Recency_Months"] = customer_churn["Recency"] / 30

# Printing the first few rows
customer_churn.head()
ID Recency Recency_Level ... Observation_Period Churn Recency_Months
0 1 46 Low ... 742 0 1.53
1 2 40 Low ... 2301 0 1.33
2 3 35 Low ... 2411 0 1.17
3 4 50 Low ... 813 0 1.67
4 5 77 Medium ... 1 0 2.57

5 rows × 10 columns

A new column is now added to the DataFrame containing the transformed values. Each value in the Recency column is divided by 30, and the result is stored in the newly created column Recency_Months.

We can also create new variables by combining multiple columns. For example, suppose we want to calculate the average monetary value per purchase. We can divide the Monetary_Value column by the Frequency column:

# Creating column Average_Monetary_Value
customer_churn["Average_Monetary_Value"] = customer_churn["Monetary_Value"] / customer_churn["Frequency"]

# Printing the first few rows
customer_churn[["ID", "Monetary_Value", "Frequency", "Average_Monetary_Value"]].head()
ID Monetary_Value Frequency Average_Monetary_Value
0 1 3009.60 26 115.75
1 2 57347.28 56 1024.06
2 3 14496.16 293 49.47
3 4 1416.20 18 78.68
4 5 523.72 14 37.41

Essentially, each value in Monetary_Value is divided by the corresponding value in Frequency. Since Pandas applies operations element-wise, the calculation is automatically performed row by row across the entire DataFrame.

Additionally, we can create a new column based on multiple conditions. For example, suppose we want to classify customers as "Inactive", "Active", or "Very Active" depending on how recently they made a purchase. Instead of writing multiple nested conditions with np.where(), we can use a more structured approach similar to CASE WHEN in the programming language SQL, which can be implemented in Pandas using case_when():

# Creating column Customer_Status using case_when
customer_churn["Customer_Status"] = customer_churn["Recency"].case_when([
  (customer_churn["Recency"] > 180, "Inactive"),
  (customer_churn["Recency"].between(90, 180), "Active"),
  (customer_churn["Recency"] < 90, "Very Active")
  ])

# Printing the first few rows
customer_churn[["ID", "Recency", "Customer_Status"]].head()
ID Recency Customer_Status
0 1 46 Very Active
1 2 40 Very Active
2 3 35 Very Active
3 4 50 Very Active
4 5 77 Very Active

In the code above, each condition is evaluated row by row, and the first condition that is satisfied determines the value assigned to the new column. This means that for every customer, Pandas checks the conditions in order (starting from "Inactive", then "Active", and finally "Very Active"), and once a match is found, the corresponding label is assigned. This makes the logic both sequential and transparent, and it closely mirrors how a CASE WHEN statement works in SQL. As a result, the code is easier to read and particularly useful when working with multiple categories or more complex classification rules.

Besides creating transformed variables, we can also create columns based on the row positions of the DataFrame. The method reset_index() generates a new column containing the row numbers (index values):

# Dropping previous created columns
customer_churn = customer_churn.drop(
  columns = ["Recency_Months",
             "Average_Monetary_Value",
             "Customer_Status"]
             )

# Creating an index column
customer_churn.reset_index()
index ID Recency ... Monetary_Value_Level Observation_Period Churn
0 0 1 46 ... Medium 742 0
1 1 2 40 ... High 2301 0
2 2 3 35 ... High 2411 0
3 3 4 50 ... Medium 813 0
4 4 5 77 ... Low 1 0
... ... ... ... ... ... ... ...
3995 3995 3996 189 ... Low 264 1
3996 3996 3997 152 ... Medium 1675 0
3997 3997 3998 359 ... Low 1 1
3998 3998 3999 69 ... Medium 2234 1
3999 3999 4000 31 ... Medium 2392 0

4000 rows × 10 columns

This adds the current index as a regular column in the DataFrame while simultaneously resetting the index itself. These values were already visible on the left side of the DataFrame as the index, but they were not part of the actual data structure. Essentially, reset_index() converts this index into a regular column so that it becomes an explicit variable in the DataFrame and can be used like any other feature in the dataset.

Aggregating results:
Aggregated data are summarizations of our raw data. For example, how many customers exist within each category of recency level? Or, what is the average recency for customers that churn in comparison to customers that do not churn? Essentially, we “summarize” our results—loose some case-specific information—in order to be able to extract valuable conclusions for the whole set.

The simplest way to summarize a variable is to apply an aggregation method directly after selecting a column. For instance, we can calculate the average value of the variable Frequency using the method mean():

# Calculating the average frequency
customer_churn["Frequency"].mean()

# Calculating the average frequency while preserving DataFrame structure
customer_churn[["Frequency"]].mean()
Frequency    37.82
dtype: float64

Both approaches compute the same average value. However, the first expression returns a single numeric result because the selected column is treated as a Pandas Series. The second expression preserves the DataFrame structure and therefore returns a Pandas Series containing the column name together with its corresponding mean value. In practice, the first approach is often simpler when working with a single variable, while the second approach becomes useful when combining multiple aggregations or working consistently with DataFrame objects.

A more advanced way to aggregate data is to summarize it by group. In Pandas, this is achieved using the method groupby(), inside of which we provide one or more variables that define the groups we want to create.

For instance, suppose that we want to calculate the average monetary value for each combination of Recency_Level and Frequency_Level. In this case, Pandas first splits the data into groups based on these categories and then applies the selected calculation to each group separately:

# Calculating average monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].mean()
Recency_Level  Frequency_Level
High           High               32589.66
               Low                 3191.59
               Medium             21914.37
Low            High               37084.83
               Low                 6261.37
               Medium             23735.66
Medium         High               51118.21
               Low                 4897.72
               Medium             20290.33
Name: Monetary_Value, dtype: float64

Here, the DataFrame is grouped according to all combinations of Recency_Level and Frequency_Level, and the mean() method is applied to the Monetary_Value column within each group.

The resulting output is not a regular DataFrame but a Pandas Series with a hierarchical index (also called a MultiIndex). The grouping variables (Recency_Level and Frequency_Level) become the index levels, while the calculated means are stored as the values of the Series. This structure is useful for many analytical tasks, but it can sometimes be less convenient for further data manipulation, filtering, or exporting results.

To convert the output into a standard DataFrame with ordinary columns, we can use the reset_index() method. This moves the grouping variables from the index back into regular columns and returns a clean tabular structure:

# Calculating average monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].mean().reset_index()
Recency_Level Frequency_Level Monetary_Value
0 High High 32589.66
1 High Low 3191.59
2 High Medium 21914.37
3 Low High 37084.83
4 Low Low 6261.37
5 Low Medium 23735.66
6 Medium High 51118.21
7 Medium Low 4897.72
8 Medium Medium 20290.33

Additionally, the method agg() can be used when we want to apply one or multiple aggregation calculations to the grouped data. Let’s add the calculation of median value along with the mean:

# Calculating average and median monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].agg(["mean", "median"])
mean median
Recency_Level Frequency_Level
High High 32589.66 38336.25
Low 3191.59 1716.67
Medium 21914.37 19691.79
Low High 37084.83 37121.19
Low 6261.37 4666.78
Medium 23735.66 17398.98
Medium High 51118.21 61814.51
Low 4897.72 3476.98
Medium 20290.33 15396.38

Counting values:
Even though we just discussed how to perform calculations such as averages and sums by groups, one of the most common and simplest aggregation tasks is simply counting the number of observations within each category. In Pandas, this can be done using the method value_counts():

# Counting observations by category
customer_churn["Recency_Level"].value_counts(sort = True)
Recency_Level
Low       1877
High      1160
Medium     963
Name: count, dtype: int64

This method counts how many times each category appears in the selected column. By default, the argument sort = True sorts the categories from the most frequent to the least frequent.

In many situations, absolute counts are useful, but proportions can be even more informative. For this reason, value_counts() also includes the argument normalize, which converts the counts into relative frequencies (proportions) when set to True.

# Counting observations by category
customer_churn["Recency_Level"].value_counts(sort = True, normalize = True)
Recency_Level
Low       0.47
High      0.29
Medium    0.24
Name: proportion, dtype: float64

Because we set the argument normalize equal to True, the counts are divided by the total number of observations. As a result, the output represents proportions rather than raw counts. For example, a value of 0.35 means that 35% of the observations belong to that category. This is especially useful when comparing category distributions across datasets of different sizes or when interpreting results in percentage terms.

Pivot tables:
Pivot tables provide an excellent way to aggregate data in a structured and highly flexible format. While we have already seen how to compute summaries using methods such as groupby() and agg(), pivot tables offer a more intuitive, spreadsheet-like alternative for producing grouped summaries. They are especially useful when we want to reorganize data into a clear tabular structure with groups displayed as rows (and optionally columns), making results easier to interpret and compare.

To create a pivot table in Pandas, we use the method pivot_table(). The three main arguments are index, values, and aggfunc.

  • index specifies the variable used to define the rows of the pivot table.

  • values specifies the variable that will be summarized.

  • aggfunc specifies the aggregation function applied to the selected values, such as "mean", "sum", or "count".

For example, the code below calculates the average recency for each category of Recency_Level:

# Calculating average Recency by Recency_Level
customer_churn.pivot_table(index = "Recency_Level", 
                           values = "Recency",
                           aggfunc = "mean")
Recency
Recency_Level
High 232.08
Low 38.56
Medium 69.20

Here, the observations are grouped according to Recency_Level, and the mean of the variable Recency is computed within each group.

Pivot tables become even more useful when we introduce the argument columns. This allows us to create summaries across two grouping dimensions simultaneously, producing a matrix-like table. For instance, the following code calculates the average recency for each combination of Recency_Level and Frequency_Level:

# Calculating average Recency by Recency_Level and Frequency_Level
customer_churn.pivot_table(index = "Recency_Level",
                           columns = "Frequency_Level",
                           values = "Recency",
                           aggfunc = "mean")
Frequency_Level High Low Medium
Recency_Level
High 216.67 233.06 191.35
Low 34.63 39.20 37.66
Medium 60.50 69.76 63.29

In this case the rows are defined by Recency_Level, the columns are defined by Frequency_Level, and the values are the average values of Recency.

Finally, we also have the option to include the argument margins = True, which adds overall summary totals to the pivot table:

# Adding overall summary margins
customer_churn.pivot_table(index = "Recency_Level",
                           columns = "Frequency_Level",
                           values = "Recency",
                           aggfunc = "mean",
                           margins = True)
Frequency_Level High Low Medium All
Recency_Level
High 216.67 233.06 191.35 232.08
Low 34.63 39.20 37.66 38.56
Medium 60.50 69.76 63.29 69.20
All 40.70 113.69 47.86 102.06

The additional row and column labeled All contain aggregated results across all groups, providing an overall summary alongside the grouped statistics.

Transforming to and from a DataFrame:
Before finalizing this chapter, it is important to highlight that Pandas provides considerable flexibility when converting between different data structures. For example, Pandas objects can be transformed into NumPy arrays and vice versa. Additionally, Pandas can create DataFrames from structures such as lists, dictionaries, tuples, and NumPy arrays.

To transform a Pandas DataFrame into a NumPy array, we can use the method to_numpy(). Applying this method to the DataFrame customer_churn creates a large matrix-like NumPy array with 4,000 rows and 9 columns.

One important limitation, however, is that the column names are not preserved in the resulting array because NumPy arrays only store the underlying values. For this reason, it is often useful to save the column names separately using the columns attribute:

# Transforming the DataFrame into a numpy array
customer_churn_in_numpy_array = customer_churn.to_numpy()

# Printing customer_churn_in_numpy_array
customer_churn_in_numpy_array

# Extracting the column names
customer_churn_column_names = customer_churn.columns

# Printing customer_churn_column_names
customer_churn_column_names
Index(['ID', 'Recency', 'Recency_Level', 'Frequency',
       'Frequency_Level', 'Monetary_Value',
       'Monetary_Value_Level', 'Observation_Period', 'Churn'],
      dtype='object')

Methods vs Attributes

Some Pandas operations (e.g. describe()) require parentheses while others (e.g. columns) do not. The reason is that methods perform an action or calculation and therefore need parentheses to execute. In contrast, columns is an attribute, meaning it simply stores information about the object and can be accessed directly without parentheses.

To transform a NumPy array, a list, a dictionary, or other compatible data structures into a Pandas DataFrame, we use the function DataFrame() from the Pandas package. This function creates a tabular object with rows and columns that can then be manipulated using Pandas operations.

To illustrate this idea, we can transform the object customer_churn_in_numpy_array back into a DataFrame:

# Transforming numpy array back to a DataFrame
customer_churn_in_pandas = pd.DataFrame(customer_churn_in_numpy_array)

# Printing the first few rows
customer_churn_in_pandas.head()
0 1 2 ... 6 7 8
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

Notice that the column names are now different. Since NumPy arrays only store raw values, the original column names were lost during the conversion process. As a result, Pandas automatically assigns integer column labels such as 0, 1, 2, and so on.

To restore the original variable names, we can use the columns argument and pass the object containing the saved column names:

# Restoring original column names
customer_churn_in_pandas = pd.DataFrame(
  customer_churn_in_numpy_array,
  columns = customer_churn_column_names
  )

# Printing the first few rows
customer_churn_in_pandas.head()
ID Recency Recency_Level ... Monetary_Value_Level Observation_Period Churn
0 1 46 Low ... Medium 742 0
1 2 40 Low ... High 2301 0
2 3 35 Low ... High 2411 0
3 4 50 Low ... Medium 813 0
4 5 77 Medium ... Low 1 0

5 rows × 9 columns

In this way, we recreate the original DataFrame structure while preserving the variable names from the initial dataset.

As stated, it is also possible to create a DataFrame from a dictionary. In fact, converting a dictionary to a DataFrame is very common in practice, as its structure maps naturally onto that of a DataFrame: the keys correspond to column names, while the values correspond to column values. In this sense, dictionaries provide a very intuitive way of constructing tabular data in Python. Moreover, each key must contain the same number of elements, which aligns with the fundamental requirement of a DataFrame, where all columns must have equal length.

To illustrate this idea, we can use the following example:

# Creating a dictionary
customer_dict = {
  "Recency": [46, 40, 35, 50, 55],
  "Frequency_Level": ["Low", "Medium", "High", "Low", "Low"],
  "Monetary_Value": [3009.6000, 57347.2800, 14496.1575, 1416.2000, 8830.3200]
  }

# Converting dictionary into a pandas DataFrame
customer_df = pd.DataFrame(customer_dict)

# Printing the first few rows
customer_df.head()
Recency Frequency_Level Monetary_Value
0 46 Low 3009.60
1 40 Medium 57347.28
2 35 High 14496.16
3 50 Low 1416.20
4 55 Low 8830.32

In this example, each dictionary key corresponds to a column in the resulting DataFrame, and the associated lists become the column values. Because all lists contain the same number of elements, Pandas can align them row by row to construct the DataFrame correctly.