Linear Regression in Machine Learning

Introduction

In Chapters Simple Linear Regression and Multiple Linear Regression, we discussed how linear regression can be used for statistical inference. Specifically, we showed that, under the Gauss–Markov assumptions, the estimators are BLUE (Best Linear Unbiased Estimators), and we can use them to assess whether the association between an independent variable and the dependent variable is statistically significant.

However, linear regression is not limited to statistical inference; it can also be applied as a machine learning tool. In the previous chapter, we introduced linear regression in the context of supervised learning, where the primary goal is to generate accurate predictions for a target variable.

In this chapter, we focus on using linear regression from a predictive perspective. We demonstrate how to train a model on one dataset and evaluate its performance on a new, unseen dataset. Along the way, we illustrate that the model optimized for prediction may differ from the one used for statistical inference, depending on the goal, and how choices such as adding predictors or interactions can impact predictive performance.

Simple Linear Regression in Machine Learning

Let’s start by loading the packages and importing the eshop_revenues dataset:

# Libraries
library(tidyverse)
library(broom)

# Importing eshop_revenues
eshop_revenues <- read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/eshop_revenues.csv")

Technically, we firstly should proceed with exploratory data analysis, data pre-processing and feature engineering. We are already familiar with this dataset though from previous chapters and, to keep things simple, we focus on the training-test phase. Since our task is to make good predictions, we need to make sure that these predictions will occur on a dataset that our model has not seen in the estimation phase (training).

There are many ways we could split the dataset in a training and test set. For simplicity, we use the slice() function, which can be used due to each row being fully independent from the other; there is no specific order in our dataset such as having the data frame sorted by revenue.

So, we keep the first 100 rows—approximately 75% of the original dataset—as a training set and leave the rest 30 as a test set. There is no universal rule about the amount of the training set.

# Splitting the dataset into training_set and test_set
training_set <- eshop_revenues %>% slice(1:100)
test_set <- eshop_revenues %>% slice(101:130)

Just like we did in Chapter Simple Linear Regression, we fit a simple linear regression model on the training set using the lm() function, taking Ad_Spend as the independent variable and Revenue as the dependent variable:

# Fitting a linear model in the training_set
lm_model_simple <- lm(Revenue ~ Ad_Spend, data = training_set)

The slope and intercept of this model are the following:

# Model summary
lm_model_simple %>% summary()


Call:
lm(formula = Revenue ~ Ad_Spend, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-37687  -7172  -3323   6734  47172 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6391.58    1764.21   3.623 0.000464 ***
Ad_Spend       50.53       6.45   7.835  5.7e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12190 on 98 degrees of freedom
Multiple R-squared:  0.3851,    Adjusted R-squared:  0.3789 
F-statistic: 61.38 on 1 and 98 DF,  p-value: 5.704e-12

These estimates differ from those obtained using the full dataset, and both the intercept and the slope are statistically significantly different from 0 at the 1% significance level, indicating that the predictor has a meaningful relationship with the response variable. However, we are not focused on hypothesis testing but on the prediction accuracy. Mathematically, we estimated the followin

\[ \widehat{\text{Revenue}} = 6381.58 + 50.53 \times \text{AdSpend} \]

To test the model’s accuracy, we use it to make predictions on the test set. This is possible because the test set contains all the variables (features) from the original dataset. For example, if a test set observation has Ad_Spend = 100, the predicted revenue will be:

\[ 6391.58 + 50.53 \times 100 = 11444.58 \]

As with every regression task, we will never predict the actual revenues exactly; the goal is to produce predictions as close as possible to the true values.

We can use the lm_model_simple object to make predictions on the test set with the predict() function. This function takes the model object as its first argument, followed by the dataset on which we want to make predictions:

# Making predictions on the test_set
lm_predictions_simple <- predict(lm_model_simple, test_set)

# Printing results
lm_predictions_simple

        1         2         3         4         5         6         7         8 
 9182.152 10078.651 15360.620 13355.877 13341.222 35352.455  9931.593  9257.955 
        9        10        11        12        13        14        15        16 
13053.170 23338.656 14540.429 11207.614 20178.672 22030.292 12782.299  9881.563 
       17        18        19        20        21        22        23        24 
13516.580 10645.154 14371.640 14690.014 17392.651 12526.084 18915.790 13696.486 
       25        26        27        28        29        30 
10519.826  9116.456 24820.861  9283.728 11876.704  9480.311

The output of predict() is a vector of predicted values, in the same order as the test_set tibble. This order is important because we can directly use the mutate() function to add the predictions to the test set:

# Adding the predicted values to the test_set
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions_simple)

# Printing the first five rows
test_set %>% head(n = 5)

# A tibble: 5 × 8
  Ad_Spend Website_Visitors Product_Quality_Score Customer_Satisfaction Price
     <dbl>            <dbl>                 <dbl>                 <dbl> <dbl>
1     55.2              925                   5.3                   7.4  24.7
2     73.0             1046                   6.6                   7.9  75.5
3    177.               964                   8.6                   8.4  57.5
4    138.              1125                   6.1                   7.4  85.4
5    138.              1750                   6.9                   7.9 146. 
# ℹ 3 more variables: New_Product <dbl>, Revenue <dbl>,
#   Predicted_Revenues <dbl>

Now, we have an extra column containing the predicted revenues, which we can use to assess prediction accuracy. Keeping the predictions in the same data frame makes it easier to compare predicted and actual values observation-by-observation and to identify where the model performs well or poorly.

At this point, the key question is how we can make such an assessment. We discuss more advanced evaluation metrics in later chapters, but one simple and intuitive metric is the mean absolute error (MAE). This is the average absolute difference between the predicted revenues and the actual revenues:

\[ \text{MAE} = \frac{1}{N} \sum^N_{i = 1}|y_i - \hat{y}_i| \]

This metric is expressed in the same units as the target variable (revenue) and measures, on average, how far predictions are from the true values. Therefore, the lower the value of MAE, the better our predictions are on average. Because it uses absolute values, over-predictions (predicted revenues higher than actual) and under-predictions (predicted revenues lower than actual) do not cancel each other out, which makes the metric easy to interpret.

# Calculating MAE
test_set %>% 
  summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 9676.

The result shows that, on average, our predictions differ from the actual revenues by approximately €9,000. Whether this is a “good” result depends on the context: the scale of revenues, the business application, and the cost of prediction errors. In machine learning, model performance is always relative to the problem being solved.

As a baseline comparison, we can consider a very simple model: using only the average revenue of the training set as the prediction for every observation in the test set. In this case, regardless of the advertising spend, the predicted revenue is always the same constant value.

# Computing the average revenue from the training set
average_revenue <- training_set %>% 
  summarize(Average_Revenue = mean(Revenue)) %>% 
  pull(Average_Revenue)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - average_revenue)))

# A tibble: 1 × 1
     MAE
   <dbl>
1 13255.

The MAE is now much higher. This shows that even a simple linear regression model can produce substantially better predictions than a naive baseline that ignores all features and uses only the average.

Using an Intercept-Only Model to Predict the Mean

Instead of manually computing the average, we can obtain the same result using lm() with only an intercept and no predictors:

# Fitting a linear model in the training_set using only intercept
lm_model_intercept <- lm(Revenue ~ 1, data = training_set)

# Making predictions on the test_set
lm_predictions_intercept <- predict(lm_model_intercept, test_set)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - lm_predictions_intercept)))

# A tibble: 1 × 1
     MAE
   <dbl>
1 13255.

This leads to the same result because the intercept-only model estimates the mean of the dependent variable in the training set, making it mathematically equivalent to the average-based prediction model.

Multiple Linear Regression in Machine Learning

Up to now, we have not introduced any entirely new concept. Instead of creating a linear regression model using the whole original dataset, we split the dataset into two parts: one to estimate coefficients and the other to make predictions and assess accuracy. We continue the same approach here to see how including more predictors can affect prediction performance.

This time, we add the feature Product_Quality_Score to the model and train a multiple linear regression:

# Fitting a linear model in the training_set
lm_model_multiple <- lm(Revenue ~ Ad_Spend + Product_Quality_Score, data = training_set)

We follow the same steps as before to calculate the MAE on the test set:

# Making predictions on the test_set
lm_predictions <- predict(lm_model_multiple, test_set)

# Adding the predicted values to the test_set
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 6651.

This time, the MAE is even lower! Adding the additional predictor improves the model’s predictions. Can we do even better? Let’s create a more advanced model by including both Ad_Spend and Product_Quality_Score along with their interaction:

# Fitting a model with interaction
lm_model_interaction <- lm(Revenue ~ Ad_Spend * Product_Quality_Score,
                           data = training_set)

# Making predictions on the test set
lm_predictions <- predict(lm_model_interaction, test_set)
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 6214.

The MAE decreases even further, showing that including the interaction of the two predictors improves model performance.

Now, let’s push things to the extreme by including all available features and their pairwise interactions, and then evaluate the model’s accuracy on the test set:

# Fitting a model with all features and interactions
lm_model_all <- lm(Revenue ~ . * .,
                           data = training_set)

# Making predictions on the test set
lm_predictions <- predict(lm_model_all, test_set)
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions)

# Calculating MAE
test_set %>%
  summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 4750.

The performance looks solid. On average, predicted revenues deviate by roughly €4,750 from the actual revenues.

Does this mean we should always include all features and interactions? Not necessarily. The No Free Lunch theorem and the bias-variance trade-off remind us that adding more features and interactions can sometimes increase variance and reduce generalization, leading to a worse test accuracy.

To see this in action, let’s follow the same process as before—use all available features and their interactions, but this time exclude the feature New_Product:

# Removing New_Product from the training set
training_set_except_new <- training_set %>% 
  select(-New_Product)

# Fitting a model with all remaining features and interactions
lm_model_all_except_new <- lm(Revenue ~ .*., 
                              data = training_set_except_new)

# Making predictions on the test set
lm_predictions <- predict(lm_model_all_except_new, test_set)
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 4402.

Surprisingly, MAE is even lower—approximately €4,400—when we use a model with fewer predictors. This is a classic example of the bias-variance trade-off: a more complex model may fit the training data very well (low bias) but generalize worse (high variance). No model is best for all situations—the “No Free Lunch” theorem reminds us that there is no single algorithm that outperforms all others in every problem.

We might think that this feature simply wasn’t important in the first place. Just like including irrelevant independent variables in a regression model for statistical inference can increase the standard errors of all coefficients without improving estimates (see Chapter Linear Regression – Statistical Inference), including such a feature in a machine learning model can hurt predictive performance.

But was New_Product really irrelevant? Let’s check the R-squared and the p-values of lm_model_all and lm_model_all_except_new:

# Printing R-squared of lm_model_all
summary(lm_model_all)$r.squared

[1] 0.9187999

# Printing p-values of lm_model_all
lm_model_all %>% tidy()

# A tibble: 22 × 5
   term                             estimate  std.error statistic   p.value
   <chr>                               <dbl>      <dbl>     <dbl>     <dbl>
 1 (Intercept)                    50966.     22703.         2.24  0.0276   
 2 Ad_Spend                        -143.        41.4       -3.45  0.000908 
 3 Website_Visitors                 -20.1       20.4       -0.985 0.328    
 4 Product_Quality_Score          -6067.      3961.        -1.53  0.130    
 5 Customer_Satisfaction          -9255.      3392.        -2.73  0.00786  
 6 Price                             78.9       31.5        2.50  0.0144   
 7 New_Product                    11131.     12840.         0.867 0.389    
 8 Ad_Spend:Website_Visitors          0.0344     0.0114     3.03  0.00334  
 9 Ad_Spend:Product_Quality_Score    15.3        3.67       4.17  0.0000777
10 Ad_Spend:Customer_Satisfaction     9.56       4.25       2.25  0.0271   
# ℹ 12 more rows

# Printing R-squared of lm_model_all
summary(lm_model_all_except_new)$r.squared

[1] 0.8850834

# Printing summary of lm_model_all_except_new
lm_model_all_except_new %>% tidy()

# A tibble: 16 × 5
   term                                     estimate std.error statistic p.value
   <chr>                                       <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                               3.94e+4   2.52e+4     1.56  1.21e-1
 2 Ad_Spend                                 -1.94e+2   4.18e+1    -4.65  1.24e-5
 3 Website_Visitors                         -2.05e+1   2.18e+1    -0.942 3.49e-1
 4 Product_Quality_Score                    -2.18e+3   4.42e+3    -0.493 6.23e-1
 5 Customer_Satisfaction                    -7.88e+3   3.81e+3    -2.07  4.16e-2
 6 Price                                     7.40e+1   3.46e+1     2.14  3.56e-2
 7 Ad_Spend:Website_Visitors                 4.08e-2   1.17e-2     3.48  7.85e-4
 8 Ad_Spend:Product_Quality_Score            1.88e+1   3.84e+0     4.90  4.67e-6
 9 Ad_Spend:Customer_Satisfaction            1.31e+1   4.61e+0     2.84  5.70e-3
10 Ad_Spend:Price                           -1.91e-1   3.95e-2    -4.83  6.01e-6
11 Website_Visitors:Product_Quality_Score   -2.50e+0   2.85e+0    -0.876 3.83e-1
12 Website_Visitors:Customer_Satisfaction    4.77e+0   2.87e+0     1.66  1.00e-1
13 Website_Visitors:Price                    5.08e-3   2.57e-2     0.198 8.44e-1
14 Product_Quality_Score:Customer_Satisfac…  8.10e+2   4.89e+2     1.66  1.01e-1
15 Product_Quality_Score:Price              -1.79e+0   5.42e+0    -0.331 7.42e-1
16 Customer_Satisfaction:Price              -9.89e+0   5.68e+0    -1.74  8.51e-2

Looking at the summary of the model-fit, lm_model_all seemed actually better. Not only R-Squared (goodness-of-fit metric) was better but also there were three marginally statistically significant interaction effects. However, the model lm_model_all_not_new_product had a better accuracy on the test set. This highlights an important lesson: a model optimized for statistical inference is not necessarily the same as one optimized for prediction. Different purposes may require different models.

Disclaimer

Creating and testing many different models and checking their accuracy on the test set is not best practice because it effectively makes the test set part of the training process. We discuss better approaches in the later chapters.

Stepwise Regression

The previous section highlighted an important point—more predictors don’t always mean a better model. Deciding which features to include is not always straightforward, and manually testing all possible combinations can be tedious.

One approach to automate feature selection is stepwise regression (Hocking, 1976; Draper & Smith, 1998). Just as we experimented manually with different models before, stepwise regression systematically adds or removes predictors based on a selected criterion, helping to identify a model that balances complexity and predictive performance. By default, R uses a metric called the Akaike Information Criterion (AIC) to evaluate models, but there are other metrics we could use. There is no need to go into the details of AIC at this point—the main takeaway is that, similar to MAE, lower AIC values correspond to better models.

The main approaches for adding and removing predictors are the following:

Forward Selection: This approach starts with a minimal model, usually just an intercept. Then, it adds one predictor at a time, selecting the feature that improves the model the most according to the chosen criterion (e.g., lowers AIC). The process continues until adding any remaining predictor no longer improves the model.
Backward Elimination: This approach starts with a full model containing all predictors and their interactions. It then removes one predictor at a time, dropping the variable that least contributes to the model. This continues until removing any more variables would worsen the model according to the selection criterion.
Both Directions (Stepwise): The approach combines forward selection and backward elimination. At each step, the algorithm evaluates both adding and removing predictors, choosing the change that best improves the model according to the chosen criterion. More specifically, the algorithm may add a new predictor (not currently in the model) while simultaneously removing a predictor that is already included. This dual approach allows stepwise regression to be more flexible than simple forward or backward procedures, often finding a model that balances predictive performance and complexity better than either of the other two approaches.

The step() function also requires specifying the scope of the search—that is, the range of models to explore. This is done using the lower and upper arguments in a list:

lower corresponds to the simplest model (e.g., null_model).
upper corresponds to the most complex model (e.g., full_model).

Additionally, we set the direction argument, which determines whether the algorithm should perform forward selection, backward elimination, or both directions (stepwise). In our case, we use "both", so the algorithm evaluates at each step whether adding a new predictor or removing an existing one improves the model according to AIC. We also set trace = FALSE to suppress the detailed step-by-step output, keeping the console output cleaner

Now, let’s apply stepwise regression to the training_set dataset:

# Full model with all predictors
full_model <- lm(Revenue ~ .*., data = training_set)

# Null model with only intercept
null_model <- lm(Revenue ~ 1, data = training_set)

# Stepwise regression
step_model <- step(null_model,
                   scope = list(lower = null_model, upper = full_model),
                   direction = "both",
                   trace = FALSE)
summary(step_model)


Call:
lm(formula = Revenue ~ Ad_Spend + Product_Quality_Score + Customer_Satisfaction + 
    Price + New_Product + Website_Visitors + Ad_Spend:Product_Quality_Score + 
    Ad_Spend:Price + Product_Quality_Score:Customer_Satisfaction + 
    Ad_Spend:New_Product + Customer_Satisfaction:Price + Price:New_Product + 
    Customer_Satisfaction:New_Product + Product_Quality_Score:New_Product, 
    data = training_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-12037.1  -3043.2   -196.8   2229.8  12394.8 

Coefficients:
                                              Estimate Std. Error t value
(Intercept)                                  3.410e+04  1.922e+04   1.774
Ad_Spend                                    -3.148e+01  1.751e+01  -1.798
Product_Quality_Score                       -1.295e+04  3.036e+03  -4.265
Customer_Satisfaction                       -5.195e+03  2.513e+03  -2.067
Price                                        8.831e+01  2.802e+01   3.151
New_Product                                  7.999e+03  1.284e+04   0.623
Website_Visitors                             4.736e+00  2.366e+00   2.001
Ad_Spend:Product_Quality_Score               1.599e+01  2.921e+00   5.475
Ad_Spend:Price                              -2.135e-01  3.420e-02  -6.242
Product_Quality_Score:Customer_Satisfaction  1.786e+03  3.975e+02   4.492
Ad_Spend:New_Product                         2.822e+01  8.020e+00   3.518
Customer_Satisfaction:Price                 -1.168e+01  3.851e+00  -3.032
Price:New_Product                           -2.867e+01  1.683e+01  -1.703
Customer_Satisfaction:New_Product           -2.607e+03  1.410e+03  -1.849
Product_Quality_Score:New_Product            2.071e+03  1.142e+03   1.813
                                            Pr(>|t|)    
(Intercept)                                  0.07962 .  
Ad_Spend                                     0.07578 .  
Product_Quality_Score                       5.16e-05 ***
Customer_Satisfaction                        0.04176 *  
Price                                        0.00224 ** 
New_Product                                  0.53489    
Website_Visitors                             0.04855 *  
Ad_Spend:Product_Quality_Score              4.35e-07 ***
Ad_Spend:Price                              1.63e-08 ***
Product_Quality_Score:Customer_Satisfaction 2.20e-05 ***
Ad_Spend:New_Product                         0.00070 ***
Customer_Satisfaction:Price                  0.00322 ** 
Price:New_Product                            0.09213 .  
Customer_Satisfaction:New_Product            0.06797 .  
Product_Quality_Score:New_Product            0.07342 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5111 on 85 degrees of freedom
Multiple R-squared:  0.9062,    Adjusted R-squared:  0.8908 
F-statistic: 58.68 on 14 and 85 DF,  p-value: < 2.2e-16

The output of step_model is the final model selected by the algorithm. It is an object of the same class as models created with lm(), which means we can use it just like any linear model.

Let’s now use the step_model to make predictions on the test set:

# Making predictions on the test set
lm_predictions <- predict(step_model, test_set)
test_set <- test_set %>%
  mutate(Predicted_Revenues = lm_predictions)

# Calculating MAE
test_set %>% summarize(MAE = mean(abs(Revenue - Predicted_Revenues)))

# A tibble: 1 × 1
    MAE
  <dbl>
1 4672.

Even though this stepwise-selected model provides a lower MAE than lm_model_all (the model including all predictors and interactions), it still does not outperform lm_model_all_except_new (the model without New_Product). This example highlights that automated feature selection is helpful, but it does not guarantee the best possible predictive performance.

Stepwise regression is a well-known and historically popular method for feature selection, particularly in the context of linear models. It is valuable to learn because it illustrates how algorithms can systematically automate model selection, providing an alternative to manually testing multiple combinations of predictors. In practice, stepwise regression is still used in some applied settings, but many modern machine learning practitioners avoid relying on it exclusively. This is because it can be unstable, sensitive to small changes in the data, and prone to overfitting. Deciding which features to include in a machine learning model remains an open question in research, with no universal agreement on the best approach.

Regardless of whether stepwise regression is recommended in practice, learning how it works gives important insights into model selection, the bias-variance trade-off, and the need to balance fit and complexity. In other words, even if we rarely use stepwise regression in practice, it remains a valuable pedagogical tool for understanding why blindly adding predictors is not always the best approach.

Recap

In this chapter, we explored simple and multiple linear regression for predicting e-shop revenues, emphasizing the importance of evaluating models on unseen data. We started with a single predictor (Ad_Spend) and progressively added features, interactions, and all available predictors to observe their effect on predictive accuracy using MAE. While more complex models often improved training fit, we saw that including all predictors does not always yield the best test performance, illustrating the bias-variance trade-off and the No Free Lunch theorem. We also introduced stepwise regression as a systematic, automated approach to feature selection, highlighting its pedagogical value for understanding model selection, even if it is not always recommended in modern practice.