In previous chapters, we discussed a range of machine learning methods, with the goal of understanding what happens behind the scenes: their assumptions, as well as their advantages and limitations. However, each of these methods typically relies on its own package, syntax, and set of arguments. In practice, this can make it challenging to apply multiple methods to the same dataset in a consistent way.
Although it is technically possible to work directly with these individual packages, it is often more practical and efficient to rely on a well-established framework that brings them together under a unified interface.
In this chapter, we focus on such a framework that integrates seamlessly with the tidyverse: tidymodels. Tidymodels can be seen as the tidyverse counterpart for machine learning. The underlying idea is straightforward: just as we use tidy data principles and pipe-based workflows to manipulate data, we can also construct complete machine learning workflows in a consistent, modular, and organized way.
Our goal in this chapter is to explore the differences between manual modeling and tidy modeling, and to examine the advantages and limitations of this approach.
Manual Modeling vs Tidy Modeling
In Chapter Linear Regression in Machine Learning, we loaded the dataset eshop_revenues and used linear regression to fit a model on a training set and make predictions on a test set. The whole code was the following:
# Librarieslibrary(tidyverse)# Importing eshop_revenueseshop_revenues <-read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/eshop_revenues.csv")# Splitting the dataset into training_set and test_settraining_set <- eshop_revenues %>%slice(1:100)test_set <- eshop_revenues %>%slice(101:130)# Fitting a linear model in the training_set using all predictorslm_model_simple <-lm(Revenue ~ ., data = training_set)# Making predictions on the test_setlm_predictions_simple <-predict(lm_model_simple, test_set)# Printing resultslm_predictions_simple
We now follow a similar approach using tidymodels. To start, we need to install and load the tidymodels package. After loading it, we use the function tidymodels_prefer() to ensure that functions from tidymodels take priority in case of conflicts with other loaded packages. Similar to tidyverse, tidymodels also loads the readr package, so we can use read_csv() to import datasets. Since we have already loaded eshop_data though, we skip this step in the code below:
Some R packages contain functions with the same name, which can lead to unexpected behavior if the wrong version is used. For example, both dplyr package (loaded by tidymodels) and a package named MASS have a function named select(). While select() in dplyr is used to choose columns from a data frame, the one in MASS has a completely different purpose. By calling tidymodels_prefer(), we ensure that when such conflicts occur, the version from tidymodels (and its core packages like dplyr) is used by default. This avoids errors or unexpected results when manipulating data frames.
The next step is to split the dataset into training and test sets. In tidymodels, we use the function initial_split() for this purpose, which randomly partitions the data. This makes the previous use of slice() unnecessary. Random splitting is especially useful when the rows of a dataset are not already randomly ordered. We also set a seed for reproducibility and set the argument prop to 0.75 to keep 75% of the data in the training set:
The resulting object split_data is of class rsplit, which contains both the training and test sets. To extract these datasets, we use the functions training() and testing():
# Extracting the training settraining_set <- split_data %>%training()# Extracting the test settest_set <- split_data %>%testing()
Next, we specify the engine of the model. In tidymodels, the engine determines the computational implementation used to fit the model. For example, "lm" uses base R’s lm() function for linear regression, while "ranger" uses the ranger() function for random forest models. The mode indicates the type of prediction task (e.g., regression). The engine maps to a function internally, but we do not call the function or package directly—tidymodels handles the computation automatically. Let’s initiate the engine:
# Initiating linear regression enginelm_spec <-linear_reg() %>%set_engine("lm") %>%set_mode("regression")
The resulting object, lm_spec, is of class model_spec. It contains the model type and engine information but has not yet been fitted to the data. To fit the model, we use the fit() function and specify the formula:
# Fitting linear regression on the training_setlm_fit <- lm_spec %>%fit(Revenue ~ ., data = training_set)
Notice that the fit() function performs similarly to lm() in fitting the model to the training data. However, the behavior of predict() is slightly different from what we would normally expect:
# Making prediction on the test setpredictions_tm <-predict(lm_fit, test_set)# Printing the first 6 rowspredictions_tm %>%head(n =6)
The output of predict() is now a tibble instead of a numeric vector. Although this seems like a minor difference, it reflects the tidy modeling philosophy: all outputs are treated as structured data frames, facilitating further analysis and integration with the rest of the tidy workflow. The predicted values correspond exactly to the actual revenues for the test set.
Finally, we assess the model predictions using the mean absolute error (MAE):
The MAE obtained may differ from the previous manual approach, not because of model performance, but because the training and test sets are now created using random sampling.
Up to now, tidymodels may seem like it requires more code than the manual approach, and the previous example does not yet clearly demonstrate why we should use it. Indeed, for very simple cases like a basic linear regression on a clean dataset, a manual approach may be preferred. However, there are several important points to consider:
The dataset we used was already very clean: there were no missing values, no unusual ordering of rows, and no problematic data types. Most real-world datasets are messier—they often contain missing values, inconsistent strings, outliers, or poorly formatted data. Tidymodels provides a structured framework for preprocessing that integrates seamlessly with model fitting, making these steps more consistent and less error-prone.
We did not need to apply any transformations, such as scaling or log transformations, which can improve model performance. Tidymodels allows these preprocessing steps to be applied systematically and consistently across datasets.
Linear regression has no hyperparameters, so there was no need for tuning. Many machine learning algorithms though, such as random forests, have multiple hyperparameters that can greatly affect performance. Tidymodels provides tools to tune these parameters systematically without rewriting code for each case.
Different machine learning models often require different formulas, arguments, or fitting functions. For example, the syntax of lm() and kknn() differs significantly, requiring different inputs for training, testing, and prediction.
Complex models (e.g., random forests, gradient boosting, neural networks) require different specifications and formulas. Tidymodels unifies these into a consistent interface.
While simple examples are easy to implement manually, real projects often require adding new data, modifying features, or comparing multiple models. Tidymodels allows these extensions to be handled systematically, without breaking existing code.
In short, tidymodels may initially look more verbose, but it pays off when datasets are messy, models are complex, or experiments need to be reproducible, scalable, and consistent.
Recipes and Workflow
Let’s see now a more complex example. In Chapter K-Nearest Neighbors, we used the dataset customer_churn and the kknn() function to fit a KNN model to the training set. In contrast to the lm() function, kknn() requires the test set as well, reflecting the fact that it is a “lazy” learning method. Additionally, we needed to apply standardization after the initial split to the training and test sets separately, because the original scale of the variables plays an important role in calculating distances between neighbors.
These two points highlight that the modeling approach differs from what we used earlier. Accidentally, we could apply standardization before the split, meaning that our results would be technically biased, as information from the test set would have influenced the training data. This is a subtle but important issue in machine learning: all preprocessing steps should be calculated only on the training set and then applied to the test set. Tidymodels, however, provides a functionality called workflows, which handles this correctly and consistently, preventing such mistakes.
Definition
Data leakage refers to the phenomenon in which information from the training set “leaks” into the test set (or vice versa). When datasets are not fully independent in the code, the model may inadvertently gain access to information from the test set, leading to overly optimistic performance during evaluation.
To understand the difference (and the benefits of workflow), let’s repeat the machine learning approach we followed in Chapter K-Nearest Neighbors. The whole code is the following:
# Librarieslibrary(tidyverse)library(kknn)# Importing customer_churncustomer_churn <-read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/customer_churn.csv")# Importing and pre-processing the datasetcustomer_churn <- customer_churn %>%select(Recency, Frequency, Monetary_Value, Churn) %>%mutate(Churn_Label =as.factor(if_else(Churn ==1, "Churn", "No Churn")))# Training and test set splittraining_set <- customer_churn %>%slice(1:3000)test_set <- customer_churn %>%slice(3001:4000)# Standardization Functionstandardization <-function(x) {return((x -mean(x)) /sd(x))}# Applying standardization on the predictors in the training_settraining_set <- training_set %>%mutate(Recency =standardization(Recency),Frequency =standardization(Frequency),Monetary_Value =standardization(Monetary_Value))# Applying standardization on the predictors in the test_settest_set <- test_set %>%mutate(Recency =standardization(Recency),Frequency =standardization(Frequency),Monetary_Value =standardization(Monetary_Value))# KNN Classifier with k equal to 2knn_classifier <-kknn(Churn_Label ~ Recency + Frequency + Monetary_Value, train = training_set, test = test_set, k =2)# Calculating accuracymean(knn_classifier$fitted.values == test_set$Churn_Label)
[1] 0.815
Now, let’s try to follow the same modeling approach using tidymodels. The following code sets up a KNN model using the “kknn” engine with 2 neighbors, sets the mode to "classification", and splits the data in training and test sets. Note that because we set the mode to "classification", the data type of the target variable should be a factor, not numeric. For this reason, we use the variable Churn_Label for this classification task:
# Initiating engineknn_spec <-nearest_neighbor(neighbors =2) %>%set_engine("kknn") %>%set_mode("classification")# Splitting customer_churnsplit_data <-initial_split(customer_churn, prop =0.75, strata = Churn_Label)# Extracting the training settraining_set <- split_data %>%training()# Extracting the test settest_set <- split_data %>%testing()
Using Strata for Balanced Splits
The strata argument in the initial_split() function ensures a balanced split between the training and test sets based on the specified variable. This is especially important for unbalanced datasets, where the outcome variable is unevenly distributed. Using strata helps guarantee that both the training and test sets reflect the same distribution of the target variable. In very large datasets, random sampling would theoretically achieve a similar balance, so strata may not be strictly necessary. However, there is generally little downside to using it (Kuhn & Silge, 2022).
The next step is to apply standardization to the two datasets. With tidymodels, we can perform such transformations using recipes. Recipes are perhaps one of the features that make tidymodels not only efficient but also fun! Just as a chef follows a recipe in the kitchen, we create feature engineering recipes, meaning we prepare our dataset for training. At this point, we only use the training set, ensuring that all preprocessing steps are calculated without “peeking” at the test data.
Using functions that start with step_*, we can define sequential preprocessing steps. These steps are conceptually similar to the tidyverse functions we discussed in Chapter Data Manipulation. For example, mutate() allows us to transform variables; in recipes, we have step_mutate() for the same purpose, but applied in a structured, reusable pipeline. Because there are so many recipe steps, we cannot cover all of them in this chapter. However, the official tidymodels website provides a complete list of available steps and their descriptions: https://www.tidymodels.org/find/recipes/.
Let’s stick to scaling for now and use the variables Recency, Frequency, and Monetary_Value. To apply standardization, we use the recipe function step_normalize(). Inside the recipe, we can either include individual predictors by name, or select all of them by type. Because all predictors in our example are numeric, we use all_numeric_predictors():
# Creating recipeknn_recipe <-recipe(Churn_Label ~ Recency + Frequency + Monetary_Value, data = training_set) %>%step_normalize(all_numeric_predictors())
This creates a recipe object that contains all the instructions for preprocessing, but has not yet applied them to the data. The recipe object merely stores the steps we want to apply to the training and test sets.
Normalization vs. Standardization in Recipes
In Chapter Scaling, we mentioned that normalization usually refers to min-max scaling. In recipes, however, normalization refers to standardization—subtracting the mean from each value and dividing by the standard deviation.
Selecting Predictors by Type in Recipes
Because we may want to preprocess different data types with different recipe steps, we can type all_ in RStudio to see all available options. For example, all_nominal_predictors() selects all character or factor variables.
To prepare the data according to the recipe, we use the function prep(), which calculates the necessary parameters (such as means and standard deviations) from the training set. Then, we use bake() to apply the recipe and extract the results:
# Preparing the recipeknn_recipe_prepped <-prep(knn_recipe, training = training_set) # Applying the recipe to the training set training_prepared <-bake(knn_recipe_prepped, new_data = training_set) # Applying the recipe to the test settest_prepared <-bake(knn_recipe_prepped, new_data = test_set)# Print the first 6 rows of the preprocessed training settraining_prepared
# Print the first 6 rows of the preprocessed test settest_prepared
# A tibble: 1,000 × 4
Recency Frequency Monetary_Value Churn_Label
<dbl> <dbl> <dbl> <fct>
1 -0.0650 -0.501 -0.544 No Churn
2 0.633 -0.649 -0.520 Churn
3 -0.301 -0.169 0.520 No Churn
4 -0.684 0.458 0.124 No Churn
5 -0.144 -0.243 -0.158 No Churn
6 -0.331 -0.354 -0.359 No Churn
7 -0.350 -0.0956 0.220 No Churn
8 -0.635 -0.0587 -0.196 No Churn
9 1.26 -0.354 -0.444 Churn
10 -0.655 0.384 0.663 No Churn
# ℹ 990 more rows
By separating definition (recipes() and step_*) from application (prep() and bake()), tidymodels ensures that preprocessing is consistent, reproducible, and only learned from the training set. This approach prevents the accidental leakage of information from the test set and makes it easy to add new preprocessing steps later without rewriting code.
Now that we have defined the KNN model and the preprocessed recipe, the next step is to combine everything into a workflow. A workflow connects preprocessing and model in a single object, ensuring that all steps are applied consistently. To create a workflow, we use the function workflow() and the add_*() functions to insert the model and the recipe we created earlier:
# Combining model and recipe into a workflow knn_workflow <-workflow() %>%add_model(knn_spec) %>%add_recipe(knn_recipe) # Printing workflow outputknn_workflow
The output shows the chosen machine learning method, the preprocessing steps, and the mode. At this point, we have just added the recipe object to the workflow. When using a workflow, there is no need to call prep() and bake() manually—these preprocessing steps are applied automatically when the workflow is fitted. We can now fit the workflow on the training data, and tidymodels will apply the recipe steps to the training set before fitting the KNN model:
# Fitting the modelknn_fit <-fit(knn_workflow, data = training_set)
Once the model is fitted, we can generate predictions on the test set using the predict() function. Tidymodels ensures that the same preprocessing steps are applied to the test set before predicting:
# Making predictions on the test setknn_predictions <-predict(knn_fit, new_data = test_set) # Printing the first 6 rows knn_predictions %>%head()
# A tibble: 6 × 1
.pred_class
<fct>
1 No Churn
2 Churn
3 No Churn
4 No Churn
5 Churn
6 No Churn
The result is a tibble containing the predicted classes for the test set. As stated earlier, the predicted values correspond to the actual values of the test set. For this reason, we can add these predictions back to the test set to evaluate the results using the bind_cols() function:
# Adding predictions to the test settest_results <- test_set %>%bind_cols(knn_predictions) # Printing the first 6 rows test_results %>%head()
# A tibble: 6 × 6
Recency Frequency Monetary_Value Churn Churn_Label .pred_class
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 96 24 1608. 0 No Churn No Churn
2 167 20 1913. 1 Churn Churn
3 72 33 15183. 0 No Churn No Churn
4 33 50 10136. 0 No Churn No Churn
5 88 31 6531. 0 No Churn Churn
6 69 28 3967. 0 No Churn No Churn
This approach guarantees that preprocessing (standardization) is applied exactly the same way for both training and test sets, preventing information leakage and making the analysis fully reproducible.
We can check the accuracy of the results using the following code:
In this chapter, we introduced the tidymodeling approach, comparing it with the manual approaches we followed in previous chapters and discussing how it provides a solid framework to combine different machine learning methods with preprocessing steps, while ensuring reproducibility, consistency, and protection against data leakage.
It makes sense that it may not be fully clear why we would adopt the tidymodels framework at this stage, but this will become more apparent in the next chapters, where we will explore more complex datasets and advanced modeling techniques. Our goal in this chapter was to make a smooth transition from the previous manual approaches to the tidymodels framework, focusing on recipes and workflows. With these foundations in place, you are now ready to build more sophisticated, scalable, and reproducible machine learning pipelines.