Data Manipulation

Introduction

In this chapter, we will discuss various functions and methods used to handle data. More specifically, we introduce the NumPy package and discuss in much more detail the Pandas package.

From Data Imports to Data Manipulation

Generally, after we import the data, we want to transform them in such a way that it becomes easy to analyze, a process often called data manipulation. In the real world, datasets are usually not in the format we would like them to be. Even when data are orderly stored in databases, they are typically not structured with our specific analysis in mind. For example, do we want to analyze sales per country for a specific calendar year or all calendar years of sales for a specific country? In this example, it is up to us to decide on the best approach, depending on our goal. This illustrates that we need to be able to bring an imported dataset into an appropriate format to facilitate our analysis.

Data Manipulation with NumPy

Perhaps the two of the most important packages for data manipulation in the Python data science ecosystem are NumPy and Pandas.

The NumPy package (short for Numerical Python) provides efficient tools for working with numerical data. In particular, it introduces the NumPy array, which is similar to a Python list but designed for fast mathematical computation. NumPy arrays support vectorized operations, meaning that calculations can be applied to an entire collection of values at once, without needing loops. This makes NumPy extremely important for data science, statistics, machine learning, and scientific computing. Many other Python libraries (including Pandas) are built on top of NumPy.

To begin, let us create a regular Python list:

import numpy as np

our_list = [1, 4, 5, 6, 6, 6, 6, 4100, 100, 100]

A regular Python list is flexible, but it is not optimized for mathematical calculations. For example, multiplying a list by a number repeats the list rather than multiplying each value:

# Multiplying our_list by 2
our_list * 2

[1, 4, 5, 6, 6, 6, 6, 4100, 100, 100, 1, 4, 5, 6, 6, 6, 6, 4100, 100, 100]

To perform mathematical operations element by element, we first convert the list into a NumPy array. Once transformed into an array, calculations are automatically applied to every element, something that is known as vectorization. In the code below, each value in the array is now multiplied by 2 individually:

# Vectorized multiplication
np.array(our_list) * 2

array([2, 8, 10, 12, 12, 12, 12, 8200, 200, 200])

We can also perform calculations between arrays of the same length. In the example below, each element of the first array is added to the corresponding element of the second array:

# Element-wise addition
np.array(our_list) + np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

array([2, 6, 8, 10, 11, 12, 13, 4108, 109, 110])

Just like Python lists, NumPy arrays support indexing and slicing. For instance, we can select the first 2 elements of a NumPy array in the following way:

# Selecting the first two elements
np.array(our_list)[0:2]

array([1, 4])

Like standard Python lists, NumPy arrays can also contain multiple dimensions. For example, a double list can be transformed into a two-dimensional array:

# Creating a double list
double_list = [[1, 2, 3], [4, 5, 6]]

# Transforming double_list to a numpy array
np.array(double_list)

array([[1, 2, 3],
       [4, 5, 6]])

This creates a matrix-like structure with two rows (number of lists) and three columns (number of elements of each list).

We can also use NumPy to apply conditions element by element. For example, suppose we want to classify values in a list as "Large" or "Small", depending on whether each value is above 5. To do this, we can use the function np.where():

# Applying condition
np.where(np.array(our_list) > 5, "large", "small")

array(['small', 'small', 'small', 'large', 'large', 'large',
       'large', 'large', 'large', 'large'], dtype='<U5')

Lastly, NumPy can also be used to calculate summary statistics such as the mean or sum. Although Python already provides built-in functions such as sum(), NumPy functions are specifically optimized for numerical data and work efficiently with NumPy arrays. In practice, NumPy operations are generally faster and more convenient when working with large numerical datasets or performing mathematical computations across arrays.

# Calculating sum with standard Python
sum(our_list)

# Calculating sum with numpy
np.sum(np.array(our_list))

np.int64(4334)

In both cases, the output is the same (4334). The main difference is not the result itself, but how the calculation is performed internally. The built-in sum() function works with general Python objects such as lists, while np.sum() is optimized for NumPy arrays and numerical computations. This becomes especially useful when working with large datasets or multidimensional arrays, where NumPy operations are generally faster and more memory-efficient. Note that the output of np.sum() has a slightly different format because it returns a NumPy numerical type (such as numpy.int64) instead of a standard built-in Python integer.

Data Manipulation with Pandas

Even though NumPy provides many additional capabilities, we will focus mostly on using Pandas (short for Panel Data or Python Data Analysis), whose primary focus is data manipulation with data frames. A data frame is a rectangular table of data organized into rows and columns, similar to a spreadsheet. While NumPy is optimized for fast numerical computation, Pandas focuses on practical data analysis tasks. In most real-world data science projects, Pandas is the primary tool used for manipulating datasets, while NumPy operates behind the scenes to provide efficient numerical operations.

To learn how to use Pandas, it makes sense to have a specific dataset throughout this chapter. This dataset will be the same as the one we imported (using the read_csv() function from the Pandas package) in Chapter Importing Data: the customer_churn dataset.

Our dataset has information for 4,000 customers, out of which 1,000 churn, meaning that these customers are known to have stopped purchasing at a given point in time. The columns Recency and Recency_Level show information about how recently a customer made his/her last purchase. Moreover, The columns Frequency and Frequency_Level show us how many purchases (in total) a customer has made in the past. Also, the columns Monetary_Value and Monetary_Value_Level show how much money the company received from a customer. According to academic marketing literature (see, for example, Blattberg, Kim, & Neslin, 2008; Fader & Hardie, 2009), these general variables—known by their acronym RFM—can be used as strong indicators of whether a customer will churn in the future. We can imagine that the more purchases take place recently and the higher their value, the higher the probability that a customer will not churn. Lastly, the column Observation_Period shows how long a customer is known to have been a customer with the company.

We load the Pandas package and the respective dataset (assigning it to an object called customer_churn) using the code below:

import pandas as pd

# Importing customer_churn
customer_churn = pd.read_csv("https://raw.githubusercontent.com/Datakortex/Datasets/refs/heads/main/customer_churn.csv")

# Printing the first few rows
customer_churn.head()

	ID	Recency	Recency_Level	...	Monetary_Value_Level	Observation_Period
0	1	46	Low	...	Medium	742
1	2	40	Low	...	High	2301
2	3	35	Low	...	High	2411
3	4	50	Low	...	Medium	813
4	5	77	Medium	...	Low	1

5 rows × 9 columns

As mentioned, the imported dataset has 4,000 observations (as rows) and 9 variables (as columns). Because it is really helpful to have a brief description of each variable, the contents of each column are described below:

ID: customer ID (unique number)
Recency: number of days that have passed since last purchase
Recency_Level: the level of recency (low, medium or high)
Frequency: number of purchases
Frequency_Level: the level of frequency (low, medium or high)
Monetary_Value: total monetary value (e.g. total number of euros)
Monetary_Value_Level: the level of monetary value (low, medium or high)
Observation_Period: number of days between the first and the last purchase
Churn: whether the customer churns or not (a value of 1 implies that the customer churns)

Evidently, the data imported are structured as a Pandas DataFrame. Each row is an observation (a customer) while each column is a different variable (characteristic).

Output Display

The first few rows of the DataFrame are shown as a formatted table rather than as text within the code output. This is a consequence of how the Python environment renders tabular objects. Depending on the software you use to run the code, the appearance and placement of the output may vary. These variations are purely cosmetic and do not affect the results.

The Use of Method Chaining: .

The main characteristics with Pandas is that we can manipulate a DataFrame in steps. For example, in customer_churn we may want firstly to keep a particular group of customers and then calculate the average recency. Such a process would require two distinct steps: first, filtering the rows that satisfy a condition and, second, applying a calculation on the resulting subset of data. With Pandas, we apply method chaining, which relies on the dot (.) operator. This allows us to take the output of one operation and immediately apply another operation to it, creating a sequence (or “chain”) of transformations. Put differently, the dot lets us move step by step through a data object by calling methods directly on it.

To illustrate this concept, let’s use a very simple example. So far, we have used the head() method which typically returns the first 5 rows of a DataFrame. In this method, we can also specify the number of rows inside the parentheses. For instance, the code below returns the first ten rows:

# Printing the first ten rows
customer_churn.head(10)

	ID	Recency	Recency_Level	...	Monetary_Value_Level	Observation_Period	Churn
0	1	46	Low	...	Medium	742	0
1	2	40	Low	...	High	2301	0
2	3	35	Low	...	High	2411	0
3	4	50	Low	...	Medium	813	0
4	5	77	Medium	...	Low	1	0
5	6	55	Medium	...	Medium	2077	0
6	7	48	Low	...	Medium	1317	0
7	8	193	High	...	High	1612	1
8	9	96	Medium	...	Medium	1037	0
9	10	167	High	...	Medium	1373	1

10 rows × 9 columns

Now, to illustrate the idea of method chaining, we can apply another head() method directly after the first one:

# Printing the first few rows
customer_churn.head(10).head()

	ID	Recency	Recency_Level	...	Monetary_Value_Level	Observation_Period
0	1	46	Low	...	Medium	742
1	2	40	Low	...	High	2301
2	3	35	Low	...	High	2411
3	4	50	Low	...	Medium	813
4	5	77	Medium	...	Low	1

5 rows × 9 columns

As expected, the first five rows are printed. This happens because the first operation (head(10)) produces a new DataFrame containing only the first ten rows. The second operation (head()) is then applied to this intermediate result, extracting the first five rows from that subset.

In this way, each dot represents a step in a sequence of transformations: the output of one operation becomes the input to the next. This allows us to build data processing workflows step by step, making code more structured, readable, and easy to extend.

Exploring Operations for Data Manipulation

For the rest of this chapter, we see how we can manipulate the DataFrame customer_churn using Pandas. The methods info() and describe() display dataset structure and summary statistics, respectively. In this way, we can quickly build an initial understanding of the data we are working with.

# Printing structure
customer_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    4000 non-null   int64  
 1   Recency               4000 non-null   int64  
 2   Recency_Level         4000 non-null   object 
 3   Frequency             4000 non-null   int64  
 4   Frequency_Level       4000 non-null   object 
 5   Monetary_Value        4000 non-null   float64
 6   Monetary_Value_Level  4000 non-null   object 
 7   Observation_Period    4000 non-null   int64  
 8   Churn                 4000 non-null   int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 281.4+ KB

# Printing summary statistics
customer_churn.describe()

	ID	Recency	Frequency	Monetary_Value	Observation_Period	Churn
count	4000.00	4000.00	4000.00	4000.00	4000.00	4000.00
mean	2000.50	102.06	37.82	8417.93	1529.73	0.25
std	1154.84	101.46	30.55	12498.18	825.79	0.43
min	1.00	30.00	12.00	109.00	1.00	0.00
25%	1000.75	38.00	22.00	1779.81	852.75	0.00
50%	2000.50	54.00	31.00	4209.97	1803.00	0.00
75%	3000.25	129.00	45.00	9163.19	2288.00	0.25
max	4000.00	519.00	860.00	99011.17	2435.00	1.00

Each column is assigned a specific data type. For example, int64 indicates a numeric (integer) variable, while object typically represents categorical or text data. Additionally, the summary statistics include measures such as the number of observations (count), the average value (mean), and the standard deviation (std). These values provide a quick overview of the distribution of each numeric variable. Since the variable (column) Churn only takes the values 0 and 1, it can be interpreted as a binary indicator. In this case, the mean of Churn represents the proportion of customers who churned. Therefore, a mean of 0.25 implies that 25% of customers in the dataset have churned.

Selecting columns:
To select specific columns in a Pandas DataFrame, we use square brackets ([]) with the column name(s). This allows us to extract either a single column or multiple columns from the dataset.

customer_churn["Recency"]

0        46
1        40
2        35
3        50
4        77
       ... 
3995    189
3996    152
3997    359
3998     69
3999     31
Name: Recency, Length: 4000, dtype: int64

Notice that the resulting output is a Pandas Series, even though it may look like a single-column DataFrame. When we select a single column using a single square brackets, the result is a Series, a one-dimensional object that represents a single variable, whereas a DataFrame is two-dimensional and can contain multiple variables. Although both may appear similar when printed, they are fundamentally different data structures and are handled differently in Pandas.

This distinction matters because certain operations behave differently depending on whether we are working with a Series or a DataFrame. For example, a Series is often used when we want to work with a single variable directly (such as computing statistics, applying functions, or passing data into models), while a DataFrame is preferred when we need to preserve tabular structure or work with multiple columns simultaneously.

To keep the DataFrame structure, we need to use double square brackets ([[]]):

# Selecting Recency
customer_churn[["Recency"]]

	Recency
0	46
1	40
2	35
3	50
4	77
...	...
3995	189
3996	152
3997	359
3998	69
3999	31

4000 rows × 1 columns

Now, we have a DataFrame with only the Recency column. In the same way, we can select multiple columns by passing a list of column names inside the brackets:

# Selecting Recency, Frequency, Monetary_Value and Churn
customer_churn[["Recency", "Frequency", "Monetary_Value", "Churn"]]

	Recency	Frequency	Monetary_Value	Churn
0	46	26	3009.60	0
1	40	56	57347.28	0
2	35	293	14496.16	0
3	50	18	1416.20	0
4	77	14	523.72	0
...	...	...	...	...
3995	189	17	683.18	1
3996	152	23	3205.66	0
3997	359	17	251.00	1
3998	69	43	9438.84	1
3999	31	45	5822.22	0

4000 rows × 4 columns

This returns a new DataFrame containing only the selected columns, allowing us to focus on specific variables of interest while keeping the data in a structured format.

Dropping columns:
Instead of selecting the columns we want to keep, it is sometimes easier to remove the columns we do not need. In Pandas, the method drop() is used for this purpose. To delete columns, we use the columns argument and provide a list containing the names of the columns we want to remove. For instance, suppose that we want a DataFrame containing only the columns Recency, Frequency, Monetary_Value, and Churn. Rather than selecting these columns directly, we can instead drop all the remaining columns:

# Dropping all other columns
customer_churn.drop(columns = ["ID", 
                               "Recency_Level",
                               "Frequency_Level", 
                               "Monetary_Value_Level",
                               "Observation_Period"])

	Recency	Frequency	Monetary_Value	Churn
0	46	26	3009.60	0
1	40	56	57347.28	0
2	35	293	14496.16	0
3	50	18	1416.20	0
4	77	14	523.72	0
...	...	...	...	...
3995	189	17	683.18	1
3996	152	23	3205.66	0
3997	359	17	251.00	1
3998	69	43	9438.84	1
3999	31	45	5822.22	0

4000 rows × 4 columns

Renaming columns:
Another easy yet useful function is the method rename(), which is used to change the names of columns. To do this, we provide a dictionary (see Chapter Python Basics) inside the columns argument, where the keys represent the original column names and the values represent the new names we want to assign. As an example, the code below changes the name of the column Frequency to Number_of_Purchases:

# Renaming column Frequency to Number_of_Purchases
customer_churn.rename(columns = {"Frequency": "Number_of_Purchases"})[["ID", "Number_of_Purchases", "Churn"]].head()

	ID	Number_of_Purchases
0	1	26
1	2	56
2	3	293
3	4	18
4	5	14

Note that the structure of the data remains the same; only the column label is updated.

Sorting values:
A simple, yet very useful method is sort_values(). As the name suggests, this method helps us sort the order of rows in a DataFrame based on the values of one or more specified columns. For example, we may want to review our dataset in a descending order based on Monetary_Value. In order to use this method, we need to include in the parenthesis the name of the column based on which we want to sort our data. We also need to specify the argument ascending if we prefer an ascending or descending order.

# Sorting on Monetary_Value
customer_churn.sort_values("Monetary_Value", ascending = True)[["ID", "Monetary_Value", "Churn"]].head()

	ID	Monetary_Value	Churn
540	541	109.00	0
3957	3958	115.00	1
400	401	118.00	1
832	833	124.00	1
1482	1483	134.98	0

To include more than one variables, we just need to include them in a list. In this case though, we also need to specify the ascending argument—True or False—also with a list:

# Sorting on Frequency and Monetary_Value
customer_churn.sort_values(["Frequency", "Monetary_Value"], ascending = [True, False])[["ID", "Frequency", "Monetary_Value", "Churn"]].head()

	ID	Frequency	Monetary_Value	Churn
418	419	12	789.08	1
2876	2877	12	670.94	1
1501	1502	12	535.52	0
1108	1109	12	496.08	1
221	222	12	490.35	1

Filtering rows:
Often, we want to remove certain rows from a dataset. To do this, we rely on conditional expressions (see Chapter Python Basics), which allow us to create logical statements that evaluate to either True or False for each row. As an example, suppose we want to keep only those customers with recency of less than 60 days. We start by using the following expression:

# Checking which rows satisfy the condition
customer_churn["Recency"] < 60

0        True
1        True
2        True
3        True
4       False
        ...  
3995    False
3996    False
3997    False
3998    False
3999     True
Name: Recency, Length: 4000, dtype: bool

This expression creates a Pandas Boolean Series, where each entry is either True or False depending on whether the condition is met. In this case, each row is checked to see whether the value in the Recency column is less than 60. We can then use this Boolean Series to filter the original DataFrame. When we place the condition inside the square brackets of the DataFrame, Pandas keeps only the rows where the condition is True:

# Filtering rows based on condition
customer_churn[customer_churn["Recency"] < 60].head()

	ID	Recency	Recency_Level	...	Monetary_Value_Level	Observation_Period
0	1	46	Low	...	Medium	742
1	2	40	Low	...	High	2301
2	3	35	Low	...	High	2411
3	4	50	Low	...	Medium	813
5	6	55	Medium	...	Medium	2077

5 rows × 9 columns

This operation effectively subsets the DataFrame, returning only the observations that satisfy the condition. In this way, conditional expressions allow us to flexibly remove or select rows based on any rule defined on the data.

We can extend conditional expressions by combining multiple conditions. To do this, we place each condition inside parentheses and connect them using logical operators such as &:

# Filtering rows based on multiple condition
customer_churn[(customer_churn["Recency"] == 30) & (customer_churn["Frequency_Level"] == "High")][["ID", "Recency", "Frequency_Level", "Churn"]].head()

	ID	Recency	Frequency_Level	Churn
71	72	30	High	0
3005	3006	30	High	0

In this example, we keep only the customers who satisfy both conditions simultaneously: customers with a recency value of 30 days and a frequency level equal to "High".

When working with variables that contain many categories, writing multiple equality conditions is not efficient. In such cases, the method isin() provides a cleaner alternative. This method checks whether each value belongs to a specified list of categories. The code below illustrates this idea:

# Filtering rows based on multiple condition
customer_churn[(customer_churn["Recency"] == 30) & (customer_churn["Monetary_Value_Level"].isin(["Medium", "High"]))][["ID", "Recency", "Frequency_Level", "Churn"]].head()

	ID	Recency	Frequency_Level
71	72	30	High
1883	1884	30	Low
2309	2310	30	Low
2349	2350	30	Low
3005	3006	30	High

Here, we keep customers with a recency value of 30 days and with a monetary value level that is either "Medium" or "High".

We can also select rows and columns using the special Pandas indexers loc[] and iloc[]. These provide a more explicit and flexible way to subset DataFrames. The loc[] indexer selects rows and columns based on labels (that is, column names and row labels). Its general structure is:

# General structure of .loc[]
dataframe.loc[row_selection, column_selection]

For example, suppose we want to keep only the rows where Recency is below 60 and display only the variables Recency, Frequency_Level, and Monetary_Value. We can write:

# Filtering rows and selecting columns with .loc[]
customer_churn.loc[
    customer_churn["Recency"] < 60,
    ["Recency", "Frequency_Level", "Monetary_Value"]
    ].head()

	Recency	Frequency_Level	Monetary_Value
0	46	Low	3009.60
1	40	Medium	57347.28
2	35	High	14496.16
3	50	Low	1416.20
5	55	Low	8830.32

Here, the first part inside loc[] specifies the rows we want to keep using a Boolean condition, while the second part specifies the columns to display. This is, of course, just an alternative way of selecting rows and columns. We could instead perform the row filtering and column selection separately; the output would be exactly the same.

# Filtering rows and selecting columns without .loc[]
customer_churn[customer_churn["Recency"] < 60][["Recency", "Frequency_Level", "Monetary_Value"]].head()

	Recency	Frequency_Level	Monetary_Value
0	46	Low	3009.60
1	40	Medium	57347.28
2	35	High	14496.16
3	50	Low	1416.20
5	55	Low	8830.32

The iloc[] indexer works differently because it selects rows and columns based on their integer positions rather than labels. Its structure is similar:

# General structure of .iloc[]
dataframe.iloc[row_positions, column_positions]

For example, suppose we want to display the first five rows and the first three columns of the dataset:

# Selecting rows and columns by position with .iloc[]
customer_churn.iloc[0:5, 0:3]

	ID	Recency	Recency_Level
0	1	46	Low
1	2	40	Low
2	3	35	Low
3	4	50	Low
4	5	77	Medium

In this example, 0:5 selects rows from position 0 up to (but not including) position 5, while 0:3 selects the first three columns.

So, the important distinction between loc[] and iloc[] is that loc[] uses labels and includes the ending label when slicing, whereas iloc[] uses integer positions and excludes the ending position, just like standard Python slicing.

Creating new columns:
We have seen how we can remove columns and rows from a dataset. But what if we want to create a new column or transform the values of an existing one? In Pandas, this can be achieved by assigning values directly to a column name. The vectorization property we introduced earlier with NumPy also applies to Pandas. This means that operations are automatically performed element by element across an entire column, without requiring loops. Assuming that a month consists of 30 days, we can create a new variable called Recency_Months in the following way:

# Creating column Recency_Months
customer_churn["Recency_Months"] = customer_churn["Recency"] / 30

# Printing the first few rows
customer_churn.head()

	ID	Recency	Recency_Level	...	Observation_Period	Recency_Months
0	1	46	Low	...	742	1.53
1	2	40	Low	...	2301	1.33
2	3	35	Low	...	2411	1.17
3	4	50	Low	...	813	1.67
4	5	77	Medium	...	1	2.57

5 rows × 10 columns

A new column is now added to the DataFrame containing the transformed values. Each value in the Recency column is divided by 30, and the result is stored in the newly created column Recency_Months.

We can also create new variables by combining multiple columns. For example, suppose we want to calculate the average monetary value per purchase. We can divide the Monetary_Value column by the Frequency column:

# Creating column Average_Monetary_Value
customer_churn["Average_Monetary_Value"] = customer_churn["Monetary_Value"] / customer_churn["Frequency"]

# Printing the first few rows
customer_churn[["ID", "Monetary_Value", "Frequency", "Average_Monetary_Value"]].head()

	ID	Monetary_Value	Frequency	Average_Monetary_Value
0	1	3009.60	26	115.75
1	2	57347.28	56	1024.06
2	3	14496.16	293	49.47
3	4	1416.20	18	78.68
4	5	523.72	14	37.41

Essentially, each value in Monetary_Value is divided by the corresponding value in Frequency. Since Pandas applies operations element-wise, the calculation is automatically performed row by row across the entire DataFrame.

Additionally, we can create a new column based on multiple conditions. For example, suppose we want to classify customers as "Inactive", "Active", or "Very Active" depending on how recently they made a purchase. Instead of writing multiple nested conditions with np.where(), we can use a more structured approach similar to CASE WHEN in the programming language SQL, which can be implemented in Pandas using case_when():

# Creating column Customer_Status using case_when
customer_churn["Customer_Status"] = customer_churn["Recency"].case_when([
  (customer_churn["Recency"] > 180, "Inactive"),
  (customer_churn["Recency"].between(90, 180), "Active"),
  (customer_churn["Recency"] < 90, "Very Active")
  ])

# Printing the first few rows
customer_churn[["ID", "Recency", "Customer_Status"]].head()

	ID	Recency	Customer_Status
0	1	46	Very Active
1	2	40	Very Active
2	3	35	Very Active
3	4	50	Very Active
4	5	77	Very Active

In the code above, each condition is evaluated row by row, and the first condition that is satisfied determines the value assigned to the new column. This means that for every customer, Pandas checks the conditions in order (starting from "Inactive", then "Active", and finally "Very Active"), and once a match is found, the corresponding label is assigned. This makes the logic both sequential and transparent, and it closely mirrors how a CASE WHEN statement works in SQL. As a result, the code is easier to read and particularly useful when working with multiple categories or more complex classification rules.

Besides creating transformed variables, we can also create columns based on the row positions of the DataFrame. The method reset_index() generates a new column containing the row numbers (index values):

# Dropping previous created columns
customer_churn = customer_churn.drop(
  columns = ["Recency_Months",
             "Average_Monetary_Value",
             "Customer_Status"]
             )

# Creating an index column
customer_churn.reset_index()

	index	ID	Recency	...	Monetary_Value_Level	Observation_Period	Churn
0	0	1	46	...	Medium	742	0
1	1	2	40	...	High	2301	0
2	2	3	35	...	High	2411	0
3	3	4	50	...	Medium	813	0
4	4	5	77	...	Low	1	0
...	...	...	...	...	...	...	...
3995	3995	3996	189	...	Low	264	1
3996	3996	3997	152	...	Medium	1675	0
3997	3997	3998	359	...	Low	1	1
3998	3998	3999	69	...	Medium	2234	1
3999	3999	4000	31	...	Medium	2392	0

4000 rows × 10 columns

This adds the current index as a regular column in the DataFrame while simultaneously resetting the index itself. These values were already visible on the left side of the DataFrame as the index, but they were not part of the actual data structure. Essentially, reset_index() converts this index into a regular column so that it becomes an explicit variable in the DataFrame and can be used like any other feature in the dataset.

Aggregating results:
Aggregated data are summarizations of our raw data. For example, how many customers exist within each category of recency level? Or, what is the average recency for customers that churn in comparison to customers that do not churn? Essentially, we “summarize” our results—loose some case-specific information—in order to be able to extract valuable conclusions for the whole set.

The simplest way to summarize a variable is to apply an aggregation method directly after selecting a column. For instance, we can calculate the average value of the variable Frequency using the method mean():

# Calculating the average frequency
customer_churn["Frequency"].mean()

# Calculating the average frequency while preserving DataFrame structure
customer_churn[["Frequency"]].mean()

Frequency    37.82
dtype: float64

Both approaches compute the same average value. However, the first expression returns a single numeric result because the selected column is treated as a Pandas Series. The second expression preserves the DataFrame structure and therefore returns a Pandas Series containing the column name together with its corresponding mean value. In practice, the first approach is often simpler when working with a single variable, while the second approach becomes useful when combining multiple aggregations or working consistently with DataFrame objects.

A more advanced way to aggregate data is to summarize it by group. In Pandas, this is achieved using the method groupby(), inside of which we provide one or more variables that define the groups we want to create.

For instance, suppose that we want to calculate the average monetary value for each combination of Recency_Level and Frequency_Level. In this case, Pandas first splits the data into groups based on these categories and then applies the selected calculation to each group separately:

# Calculating average monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].mean()

Recency_Level  Frequency_Level
High           High               32589.66
               Low                 3191.59
               Medium             21914.37
Low            High               37084.83
               Low                 6261.37
               Medium             23735.66
Medium         High               51118.21
               Low                 4897.72
               Medium             20290.33
Name: Monetary_Value, dtype: float64

Here, the DataFrame is grouped according to all combinations of Recency_Level and Frequency_Level, and the mean() method is applied to the Monetary_Value column within each group.

The resulting output is not a regular DataFrame but a Pandas Series with a hierarchical index (also called a MultiIndex). The grouping variables (Recency_Level and Frequency_Level) become the index levels, while the calculated means are stored as the values of the Series. This structure is useful for many analytical tasks, but it can sometimes be less convenient for further data manipulation, filtering, or exporting results.

To convert the output into a standard DataFrame with ordinary columns, we can use the reset_index() method. This moves the grouping variables from the index back into regular columns and returns a clean tabular structure:

# Calculating average monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].mean().reset_index()

	Recency_Level	Frequency_Level	Monetary_Value
0	High	High	32589.66
1	High	Low	3191.59
2	High	Medium	21914.37
3	Low	High	37084.83
4	Low	Low	6261.37
5	Low	Medium	23735.66
6	Medium	High	51118.21
7	Medium	Low	4897.72
8	Medium	Medium	20290.33

Additionally, the method agg() can be used when we want to apply one or multiple aggregation calculations to the grouped data. Let’s add the calculation of median value along with the mean:

# Calculating average and median monetary value by groups
customer_churn.groupby(["Recency_Level", "Frequency_Level"])["Monetary_Value"].agg(["mean", "median"])

		mean	median
Recency_Level	Frequency_Level
High	High	32589.66	38336.25
	Low	3191.59	1716.67
	Medium	21914.37	19691.79
Low	High	37084.83	37121.19
	Low	6261.37	4666.78
	Medium	23735.66	17398.98
Medium	High	51118.21	61814.51
	Low	4897.72	3476.98
	Medium	20290.33	15396.38

Counting values:
Even though we just discussed how to perform calculations such as averages and sums by groups, one of the most common and simplest aggregation tasks is simply counting the number of observations within each category. In Pandas, this can be done using the method value_counts():

# Counting observations by category
customer_churn["Recency_Level"].value_counts(sort = True)

Recency_Level
Low       1877
High      1160
Medium     963
Name: count, dtype: int64

This method counts how many times each category appears in the selected column. By default, the argument sort = True sorts the categories from the most frequent to the least frequent.

In many situations, absolute counts are useful, but proportions can be even more informative. For this reason, value_counts() also includes the argument normalize, which converts the counts into relative frequencies (proportions) when set to True.

# Counting observations by category
customer_churn["Recency_Level"].value_counts(sort = True, normalize = True)

Recency_Level
Low       0.47
High      0.29
Medium    0.24
Name: proportion, dtype: float64

Because we set the argument normalize equal to True, the counts are divided by the total number of observations. As a result, the output represents proportions rather than raw counts. For example, a value of 0.35 means that 35% of the observations belong to that category. This is especially useful when comparing category distributions across datasets of different sizes or when interpreting results in percentage terms.

Pivot tables:
Pivot tables provide an excellent way to aggregate data in a structured and highly flexible format. While we have already seen how to compute summaries using methods such as groupby() and agg(), pivot tables offer a more intuitive, spreadsheet-like alternative for producing grouped summaries. They are especially useful when we want to reorganize data into a clear tabular structure with groups displayed as rows (and optionally columns), making results easier to interpret and compare.

To create a pivot table in Pandas, we use the method pivot_table(). The three main arguments are index, values, and aggfunc.

index specifies the variable used to define the rows of the pivot table.
values specifies the variable that will be summarized.
aggfunc specifies the aggregation function applied to the selected values, such as "mean", "sum", or "count".

For example, the code below calculates the average recency for each category of Recency_Level:

# Calculating average Recency by Recency_Level
customer_churn.pivot_table(index = "Recency_Level", 
                           values = "Recency",
                           aggfunc = "mean")

	Recency
Recency_Level
High	232.08
Low	38.56
Medium	69.20

Here, the observations are grouped according to Recency_Level, and the mean of the variable Recency is computed within each group.

Pivot tables become even more useful when we introduce the argument columns. This allows us to create summaries across two grouping dimensions simultaneously, producing a matrix-like table. For instance, the following code calculates the average recency for each combination of Recency_Level and Frequency_Level:

# Calculating average Recency by Recency_Level and Frequency_Level
customer_churn.pivot_table(index = "Recency_Level",
                           columns = "Frequency_Level",
                           values = "Recency",
                           aggfunc = "mean")

Frequency_Level	High	Low	Medium
Recency_Level
High	216.67	233.06	191.35
Low	34.63	39.20	37.66
Medium	60.50	69.76	63.29

In this case the rows are defined by Recency_Level, the columns are defined by Frequency_Level, and the values are the average values of Recency.

Finally, we also have the option to include the argument margins = True, which adds overall summary totals to the pivot table:

# Adding overall summary margins
customer_churn.pivot_table(index = "Recency_Level",
                           columns = "Frequency_Level",
                           values = "Recency",
                           aggfunc = "mean",
                           margins = True)

Frequency_Level	High	Low	Medium	All
Recency_Level
High	216.67	233.06	191.35	232.08
Low	34.63	39.20	37.66	38.56
Medium	60.50	69.76	63.29	69.20
All	40.70	113.69	47.86	102.06

The additional row and column labeled All contain aggregated results across all groups, providing an overall summary alongside the grouped statistics.

Transforming to and from a DataFrame:
Before finalizing this chapter, it is important to highlight that Pandas provides considerable flexibility when converting between different data structures. For example, Pandas objects can be transformed into NumPy arrays and vice versa. Additionally, Pandas can create DataFrames from structures such as lists, dictionaries, tuples, and NumPy arrays.

To transform a Pandas DataFrame into a NumPy array, we can use the method to_numpy(). Applying this method to the DataFrame customer_churn creates a large matrix-like NumPy array with 4,000 rows and 9 columns.

One important limitation, however, is that the column names are not preserved in the resulting array because NumPy arrays only store the underlying values. For this reason, it is often useful to save the column names separately using the columns attribute:

# Transforming the DataFrame into a numpy array
customer_churn_in_numpy_array = customer_churn.to_numpy()

# Printing customer_churn_in_numpy_array
customer_churn_in_numpy_array

# Extracting the column names
customer_churn_column_names = customer_churn.columns

# Printing customer_churn_column_names
customer_churn_column_names

Index(['ID', 'Recency', 'Recency_Level', 'Frequency',
       'Frequency_Level', 'Monetary_Value',
       'Monetary_Value_Level', 'Observation_Period', 'Churn'],
      dtype='object')

Methods vs Attributes

Some Pandas operations (e.g. describe()) require parentheses while others (e.g. columns) do not. The reason is that methods perform an action or calculation and therefore need parentheses to execute. In contrast, columns is an attribute, meaning it simply stores information about the object and can be accessed directly without parentheses.

To transform a NumPy array, a list, a dictionary, or other compatible data structures into a Pandas DataFrame, we use the function DataFrame() from the Pandas package. This function creates a tabular object with rows and columns that can then be manipulated using Pandas operations.

To illustrate this idea, we can transform the object customer_churn_in_numpy_array back into a DataFrame:

# Transforming numpy array back to a DataFrame
customer_churn_in_pandas = pd.DataFrame(customer_churn_in_numpy_array)

# Printing the first few rows
customer_churn_in_pandas.head()

	0	1	2	...	6	7
0	1	46	Low	...	Medium	742
1	2	40	Low	...	High	2301
2	3	35	Low	...	High	2411
3	4	50	Low	...	Medium	813
4	5	77	Medium	...	Low	1

5 rows × 9 columns

Notice that the column names are now different. Since NumPy arrays only store raw values, the original column names were lost during the conversion process. As a result, Pandas automatically assigns integer column labels such as 0, 1, 2, and so on.

To restore the original variable names, we can use the columns argument and pass the object containing the saved column names:

# Restoring original column names
customer_churn_in_pandas = pd.DataFrame(
  customer_churn_in_numpy_array,
  columns = customer_churn_column_names
  )

# Printing the first few rows
customer_churn_in_pandas.head()

	ID	Recency	Recency_Level	...	Monetary_Value_Level	Observation_Period
0	1	46	Low	...	Medium	742
1	2	40	Low	...	High	2301
2	3	35	Low	...	High	2411
3	4	50	Low	...	Medium	813
4	5	77	Medium	...	Low	1

5 rows × 9 columns

In this way, we recreate the original DataFrame structure while preserving the variable names from the initial dataset.

As stated, it is also possible to create a DataFrame from a dictionary. In fact, converting a dictionary to a DataFrame is very common in practice, as its structure maps naturally onto that of a DataFrame: the keys correspond to column names, while the values correspond to column values. In this sense, dictionaries provide a very intuitive way of constructing tabular data in Python. Moreover, each key must contain the same number of elements, which aligns with the fundamental requirement of a DataFrame, where all columns must have equal length.

To illustrate this idea, we can use the following example:

# Creating a dictionary
customer_dict = {
  "Recency": [46, 40, 35, 50, 55],
  "Frequency_Level": ["Low", "Medium", "High", "Low", "Low"],
  "Monetary_Value": [3009.6000, 57347.2800, 14496.1575, 1416.2000, 8830.3200]
  }

# Converting dictionary into a pandas DataFrame
customer_df = pd.DataFrame(customer_dict)

# Printing the first few rows
customer_df.head()

	Recency	Frequency_Level	Monetary_Value
0	46	Low	3009.60
1	40	Medium	57347.28
2	35	High	14496.16
3	50	Low	1416.20
4	55	Low	8830.32

In this example, each dictionary key corresponds to a column in the resulting DataFrame, and the associated lists become the column values. Because all lists contain the same number of elements, Pandas can align them row by row to construct the DataFrame correctly.