Introduction to Tidy Modeling

Introduction

In previous chapters, we discussed a range of machine learning methods, with the goal of understanding what happens behind the scenes: their assumptions, as well as their advantages and limitations. However, each of these methods typically relies on its own package, syntax, and set of arguments. In practice, this can make it challenging to apply multiple methods to the same dataset in a consistent way.

Although it is technically possible to work directly with these individual packages, it is often more practical and efficient to rely on a well-established framework that brings them together under a unified interface.

In this chapter, we focus on such a framework that integrates seamlessly with the tidyverse: tidymodels. Tidymodels can be seen as the tidyverse counterpart for machine learning. The underlying idea is straightforward: just as we use tidy data principles and pipe-based workflows to manipulate data, we can also construct complete machine learning workflows in a consistent, modular, and organized way.

Our goal in this chapter is to explore the differences between manual modeling and tidy modeling, and to examine the advantages and limitations of this approach.

Manual Modeling vs Tidy Modeling

In Chapter Linear Regression in Machine Learning, we loaded the dataset eshop_revenues and used linear regression to fit a model on a training set and make predictions on a test set. The whole code was the following:

# Libraries
library(tidyverse)

# Importing eshop_revenues
eshop_revenues <- read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/eshop_revenues.csv")

# Splitting the dataset into training_set and test_set
training_set <- eshop_revenues %>% slice(1:100)
test_set <- eshop_revenues %>% slice(101:130)

# Fitting a linear model in the training_set using all predictors
lm_model_simple <- lm(Revenue ~ ., data = training_set)

# Making predictions on the test_set
lm_predictions_simple <- predict(lm_model_simple, test_set)

# Printing results
lm_predictions_simple

          1           2           3           4           5           6 
 4122.70360 11006.90251 24658.54190 10631.02989 17633.32257 38082.85325 
          7           8           9          10          11          12 
 8609.90351  2102.73991 10801.09908 29163.84248 30500.46726  9089.92722 
         13          14          15          16          17          18 
25303.44104 19645.51037 11181.34442     3.15446 13378.13167  5527.83911 
         19          20          21          22          23          24 
 5822.93714 19469.07671  2671.79385 -2511.90575  2991.71088  8226.81090 
         25          26          27          28          29          30 
 6020.27451 10349.06629 27398.08436 -3452.38973 16409.19411 24962.35533

We now follow a similar approach using tidymodels. To start, we need to install and load the tidymodels package. After loading it, we use the function tidymodels_prefer() to ensure that functions from tidymodels take priority in case of conflicts with other loaded packages. Similar to tidyverse, tidymodels also loads the readr package, so we can use read_csv() to import datasets. Since we have already loaded eshop_data though, we skip this step in the code below:

# Libraries
library(tidymodels)

# Ensuring tidymodels functions priority
tidymodels_prefer()

Function Conflicts

Some R packages contain functions with the same name, which can lead to unexpected behavior if the wrong version is used. For example, both dplyr package (loaded by tidymodels) and a package named MASS have a function named select(). While select() in dplyr is used to choose columns from a data frame, the one in MASS has a completely different purpose. By calling tidymodels_prefer(), we ensure that when such conflicts occur, the version from tidymodels (and its core packages like dplyr) is used by default. This avoids errors or unexpected results when manipulating data frames.

The next step is to split the dataset into training and test sets. In tidymodels, we use the function initial_split() for this purpose, which randomly partitions the data. This makes the previous use of slice() unnecessary. Random splitting is especially useful when the rows of a dataset are not already randomly ordered. We also set a seed for reproducibility and set the argument prop to 0.75 to keep 75% of the data in the training set:

# Setting seed
set.seed(1234)

# Splitting eshop_data
split_data <- initial_split(eshop_revenues, prop = 0.75)

The resulting object split_data is of class rsplit, which contains both the training and test sets. To extract these datasets, we use the functions training() and testing():

# Extracting the training set
training_set <- split_data %>% training()

# Extracting the test set
test_set <- split_data %>% testing()

Next, we specify the engine of the model. In tidymodels, the engine determines the computational implementation used to fit the model. For example, "lm" uses base R’s lm() function for linear regression, while "ranger" uses the ranger() function for random forest models. The mode indicates the type of prediction task (e.g., regression). The engine maps to a function internally, but we do not call the function or package directly—tidymodels handles the computation automatically. Let’s initiate the engine:

# Initiating linear regression engine
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

The resulting object, lm_spec, is of class model_spec. It contains the model type and engine information but has not yet been fitted to the data. To fit the model, we use the fit() function and specify the formula:

# Fitting linear regression on the training_set
lm_fit <- lm_spec %>% fit(Revenue ~ ., data = training_set)

Notice that the fit() function performs similarly to lm() in fitting the model to the training data. However, the behavior of predict() is slightly different from what we would normally expect:

# Making prediction on the test set
predictions_tm <- predict(lm_fit, test_set)

# Printing the first 6 rows
predictions_tm %>% head(n = 6)

# A tibble: 6 × 1
   .pred
   <dbl>
1 28955.
2 15023.
3 20315.
4  5244.
5 17106.
6 17723.

The output of predict() is now a tibble instead of a numeric vector. Although this seems like a minor difference, it reflects the tidy modeling philosophy: all outputs are treated as structured data frames, facilitating further analysis and integration with the rest of the tidy workflow. The predicted values correspond exactly to the actual revenues for the test set.

Finally, we assess the model predictions using the mean absolute error (MAE):

# Calculating MAE
mean(abs(test_set$Revenue - predictions_tm$.pred))

[1] 5077.953

The MAE obtained may differ from the previous manual approach, not because of model performance, but because the training and test sets are now created using random sampling.

Up to now, tidymodels may seem like it requires more code than the manual approach, and the previous example does not yet clearly demonstrate why we should use it. Indeed, for very simple cases like a basic linear regression on a clean dataset, a manual approach may be preferred. However, there are several important points to consider:

The dataset we used was already very clean: there were no missing values, no unusual ordering of rows, and no problematic data types. Most real-world datasets are messier—they often contain missing values, inconsistent strings, outliers, or poorly formatted data. Tidymodels provides a structured framework for preprocessing that integrates seamlessly with model fitting, making these steps more consistent and less error-prone.
We did not need to apply any transformations, such as scaling or log transformations, which can improve model performance. Tidymodels allows these preprocessing steps to be applied systematically and consistently across datasets.
Linear regression has no hyperparameters, so there was no need for tuning. Many machine learning algorithms though, such as random forests, have multiple hyperparameters that can greatly affect performance. Tidymodels provides tools to tune these parameters systematically without rewriting code for each case.
Different machine learning models often require different formulas, arguments, or fitting functions. For example, the syntax of lm() and kknn() differs significantly, requiring different inputs for training, testing, and prediction.
Complex models (e.g., random forests, gradient boosting, neural networks) require different specifications and formulas. Tidymodels unifies these into a consistent interface.
While simple examples are easy to implement manually, real projects often require adding new data, modifying features, or comparing multiple models. Tidymodels allows these extensions to be handled systematically, without breaking existing code.

In short, tidymodels may initially look more verbose, but it pays off when datasets are messy, models are complex, or experiments need to be reproducible, scalable, and consistent.

Recipes and Workflow

Let’s see now a more complex example. In Chapter K-Nearest Neighbors, we used the dataset customer_churn and the kknn() function to fit a KNN model to the training set. In contrast to the lm() function, kknn() requires the test set as well, reflecting the fact that it is a “lazy” learning method. Additionally, we needed to apply standardization after the initial split to the training and test sets separately, because the original scale of the variables plays an important role in calculating distances between neighbors.

These two points highlight that the modeling approach differs from what we used earlier. Accidentally, we could apply standardization before the split, meaning that our results would be technically biased, as information from the test set would have influenced the training data. This is a subtle but important issue in machine learning: all preprocessing steps should be calculated only on the training set and then applied to the test set. Tidymodels, however, provides a functionality called workflows, which handles this correctly and consistently, preventing such mistakes.

Definition

Data leakage refers to the phenomenon in which information from the training set “leaks” into the test set (or vice versa). When datasets are not fully independent in the code, the model may inadvertently gain access to information from the test set, leading to overly optimistic performance during evaluation.

To understand the difference (and the benefits of workflow), let’s repeat the machine learning approach we followed in Chapter K-Nearest Neighbors. The whole code is the following:

# Libraries
library(tidyverse)
library(kknn)

# Importing customer_churn
customer_churn <- read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/customer_churn.csv")

# Importing and pre-processing the dataset
customer_churn <- customer_churn %>% 
  select(Recency, Frequency, Monetary_Value, Churn) %>%
  mutate(Churn_Label = as.factor(if_else(Churn == 1, "Churn", "No Churn")))

# Training and test set split
training_set <- customer_churn %>% slice(1:3000)
test_set <- customer_churn %>% slice(3001:4000)

# Standardization Function
standardization <- function(x) {return((x - mean(x)) / sd(x))}

# Applying standardization on the predictors in the training_set
training_set <- training_set %>% 
  mutate(Recency = standardization(Recency),
         Frequency = standardization(Frequency),
         Monetary_Value = standardization(Monetary_Value))

# Applying standardization on the predictors in the test_set
test_set <- test_set %>%
  mutate(Recency = standardization(Recency),
         Frequency = standardization(Frequency),
         Monetary_Value = standardization(Monetary_Value))

# KNN Classifier with k equal to 2
knn_classifier <- kknn(Churn_Label ~ Recency + Frequency + Monetary_Value, 
                       train = training_set, 
                       test = test_set, 
                       k = 2)

# Calculating accuracy
mean(knn_classifier$fitted.values == test_set$Churn_Label)

[1] 0.815

Now, let’s try to follow the same modeling approach using tidymodels. The following code sets up a KNN model using the “kknn” engine with 2 neighbors, sets the mode to "classification", and splits the data in training and test sets. Note that because we set the mode to "classification", the data type of the target variable should be a factor, not numeric. For this reason, we use the variable Churn_Label for this classification task:

# Initiating engine
knn_spec <- nearest_neighbor(neighbors = 2) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# Splitting customer_churn
split_data <- initial_split(customer_churn, prop = 0.75, strata = Churn_Label)

# Extracting the training set
training_set <- split_data %>% training()

# Extracting the test set
test_set <- split_data %>% testing()

Using Strata for Balanced Splits

The strata argument in the initial_split() function ensures a balanced split between the training and test sets based on the specified variable. This is especially important for unbalanced datasets, where the outcome variable is unevenly distributed. Using strata helps guarantee that both the training and test sets reflect the same distribution of the target variable. In very large datasets, random sampling would theoretically achieve a similar balance, so strata may not be strictly necessary. However, there is generally little downside to using it (Kuhn & Silge, 2022).

The next step is to apply standardization to the two datasets. With tidymodels, we can perform such transformations using recipes. Recipes are perhaps one of the features that make tidymodels not only efficient but also fun! Just as a chef follows a recipe in the kitchen, we create feature engineering recipes, meaning we prepare our dataset for training. At this point, we only use the training set, ensuring that all preprocessing steps are calculated without “peeking” at the test data.

Using functions that start with step_*, we can define sequential preprocessing steps. These steps are conceptually similar to the tidyverse functions we discussed in Chapter Data Manipulation. For example, mutate() allows us to transform variables; in recipes, we have step_mutate() for the same purpose, but applied in a structured, reusable pipeline. Because there are so many recipe steps, we cannot cover all of them in this chapter. However, the official tidymodels website provides a complete list of available steps and their descriptions: https://www.tidymodels.org/find/recipes/.

Let’s stick to scaling for now and use the variables Recency, Frequency, and Monetary_Value. To apply standardization, we use the recipe function step_normalize(). Inside the recipe, we can either include individual predictors by name, or select all of them by type. Because all predictors in our example are numeric, we use all_numeric_predictors():

# Creating recipe
knn_recipe <- recipe(Churn_Label ~ Recency + Frequency + Monetary_Value, 
                      data = training_set) %>% 
  step_normalize(all_numeric_predictors())

This creates a recipe object that contains all the instructions for preprocessing, but has not yet applied them to the data. The recipe object merely stores the steps we want to apply to the training and test sets.

Normalization vs. Standardization in Recipes

In Chapter Scaling, we mentioned that normalization usually refers to min-max scaling. In recipes, however, normalization refers to standardization—subtracting the mean from each value and dividing by the standard deviation.

Selecting Predictors by Type in Recipes

Because we may want to preprocess different data types with different recipe steps, we can type all_ in RStudio to see all available options. For example, all_nominal_predictors() selects all character or factor variables.

To prepare the data according to the recipe, we use the function prep(), which calculates the necessary parameters (such as means and standard deviations) from the training set. Then, we use bake() to apply the recipe and extract the results:

# Preparing the recipe
knn_recipe_prepped <- prep(knn_recipe, training = training_set)  

# Applying the recipe to the training set 
training_prepared <- bake(knn_recipe_prepped, new_data = training_set) 

# Applying the recipe to the test set
test_prepared <- bake(knn_recipe_prepped, new_data = test_set)

# Print the first 6 rows of the preprocessed training set
training_prepared

# A tibble: 3,000 × 4
   Recency Frequency Monetary_Value Churn_Label
     <dbl>     <dbl>          <dbl> <fct>      
 1  0.889    -0.0587         0.551  Churn      
 2  3.57     -0.575         -0.539  Churn      
 3  0.289    -0.907         -0.614  Churn      
 4  3.27     -0.649         -0.614  Churn      
 5 -0.0650   -0.243         -0.529  Churn      
 6  0.0923    0.126         -0.0875 Churn      
 7  1.25     -0.649         -0.648  Churn      
 8  0.0824   -0.796         -0.635  Churn      
 9  2.79      1.86           2.82   Churn      
10  1.68     -0.354         -0.484  Churn      
# ℹ 2,990 more rows

# Print the first 6 rows of the preprocessed test set
test_prepared

# A tibble: 1,000 × 4
   Recency Frequency Monetary_Value Churn_Label
     <dbl>     <dbl>          <dbl> <fct>      
 1 -0.0650   -0.501          -0.544 No Churn   
 2  0.633    -0.649          -0.520 Churn      
 3 -0.301    -0.169           0.520 No Churn   
 4 -0.684     0.458           0.124 No Churn   
 5 -0.144    -0.243          -0.158 No Churn   
 6 -0.331    -0.354          -0.359 No Churn   
 7 -0.350    -0.0956          0.220 No Churn   
 8 -0.635    -0.0587         -0.196 No Churn   
 9  1.26     -0.354          -0.444 Churn      
10 -0.655     0.384           0.663 No Churn   
# ℹ 990 more rows

By separating definition (recipes() and step_*) from application (prep() and bake()), tidymodels ensures that preprocessing is consistent, reproducible, and only learned from the training set. This approach prevents the accidental leakage of information from the test set and makes it easy to add new preprocessing steps later without rewriting code.

Now that we have defined the KNN model and the preprocessed recipe, the next step is to combine everything into a workflow. A workflow connects preprocessing and model in a single object, ensuring that all steps are applied consistently. To create a workflow, we use the function workflow() and the add_*() functions to insert the model and the recipe we created earlier:

# Combining model and recipe into a workflow 
knn_workflow <- workflow() %>%  
  add_model(knn_spec) %>% 
  add_recipe(knn_recipe) 

# Printing workflow output
knn_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 2

Computational engine: kknn

The output shows the chosen machine learning method, the preprocessing steps, and the mode. At this point, we have just added the recipe object to the workflow. When using a workflow, there is no need to call prep() and bake() manually—these preprocessing steps are applied automatically when the workflow is fitted. We can now fit the workflow on the training data, and tidymodels will apply the recipe steps to the training set before fitting the KNN model:

# Fitting the model
knn_fit <- fit(knn_workflow, data = training_set)

Once the model is fitted, we can generate predictions on the test set using the predict() function. Tidymodels ensures that the same preprocessing steps are applied to the test set before predicting:

# Making predictions on the test set
knn_predictions <- predict(knn_fit, new_data = test_set) 
# Printing the first 6 rows 
knn_predictions %>% head()

# A tibble: 6 × 1
  .pred_class
  <fct>      
1 No Churn   
2 Churn      
3 No Churn   
4 No Churn   
5 Churn      
6 No Churn

The result is a tibble containing the predicted classes for the test set. As stated earlier, the predicted values correspond to the actual values of the test set. For this reason, we can add these predictions back to the test set to evaluate the results using the bind_cols() function:

# Adding predictions to the test set
test_results <- test_set %>% bind_cols(knn_predictions) 
# Printing the first 6 rows 
test_results %>% head()

# A tibble: 6 × 6
  Recency Frequency Monetary_Value Churn Churn_Label .pred_class
    <dbl>     <dbl>          <dbl> <dbl> <fct>       <fct>      
1      96        24          1608.     0 No Churn    No Churn   
2     167        20          1913.     1 Churn       Churn      
3      72        33         15183.     0 No Churn    No Churn   
4      33        50         10136.     0 No Churn    No Churn   
5      88        31          6531.     0 No Churn    Churn      
6      69        28          3967.     0 No Churn    No Churn

This approach guarantees that preprocessing (standardization) is applied exactly the same way for both training and test sets, preventing information leakage and making the analysis fully reproducible.

We can check the accuracy of the results using the following code:

# Calculating accuracy
mean(test_results$Churn_Label == test_results$.pred_class)

[1] 0.814

Recap

In this chapter, we introduced the tidymodeling approach, comparing it with the manual approaches we followed in previous chapters and discussing how it provides a solid framework to combine different machine learning methods with preprocessing steps, while ensuring reproducibility, consistency, and protection against data leakage.

It makes sense that it may not be fully clear why we would adopt the tidymodels framework at this stage, but this will become more apparent in the next chapters, where we will explore more complex datasets and advanced modeling techniques. Our goal in this chapter was to make a smooth transition from the previous manual approaches to the tidymodels framework, focusing on recipes and workflows. With these foundations in place, you are now ready to build more sophisticated, scalable, and reproducible machine learning pipelines.