Statistical Tests

Introduction

In this chapter, we explore several foundational statistical tests used in hypothesis testing. These include the one-sample t-test, the independent and paired samples t-tests, the F-test for equality of variances, the F-test as used for Analysis of Variance (ANOVA), and the chi-square test. Each of these tests serves a specific purpose, but they all rely on the same core idea: using probability distributions to assess how likely it is that an observed result does not occur as a result of pure chance.

The logic behind these tests builds directly on the distributions we studied in previous chapters. Because we are now familiar with the normal, t-, F-, and chi-square distributions, the reasoning behind these tests becomes quite intuitive: in every case, we compute a test statistic and then calculate a p-value, the probability of obtaining a result as extreme as, or more extreme than, the one observed in our sample if the null hypothesis were true (see Chapter Hypothesis Testing). Alongside the theoretical explanation, we will also demonstrate how each of these tests can be performed in practice using R.

One Sample t-Test

In Chapter Hypothesis Testing, we introduced the concept of the one-sample t-test. Now, we will demonstrate how to perform this test in practice, using R.

The one-sample t-test is used when we want to test whether the mean of a single sample differs significantly from a known (or hypothesized) population mean. This is particularly useful when we do not know the population standard deviation and must therefore rely on the sample standard deviation.

To illustrate how to conduct a one-sample t-test, we will use the student_performance dataset, available on GitHub. This dataset contains observations from 1,000 students and includes the following six variables:

Hours_Studied: Average number of hours the student studies per day.
Previous_Score: Average score (out of 100) in previous assessment.
Extracurricular_Activities: Participation in extracurricular activities, recorded as "Yes" or "No".
Sleep_Hours: Average number of hours the student sleeps per day.
Sample_Question_Papers_Practiced: Number of practice question papers the student solved before the final assessment (used to evaluate preparation for the test).
Performance_Index: Overall performance metric, scaled from 0 to 100, where 100 represents the best possible performance.

We begin by loading the dataset and inspecting the first few rows:

# Libraries
library(tidyverse)

# Importing dataset
student_performance <- read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/student_performance.csv")

# Printing the first 10 rows
head(student_performance, n = 10)

# A tibble: 10 × 6
   Hours_Studied Previous_Score Extracurricular_Activities Sleep_Hours
           <dbl>          <dbl> <chr>                            <dbl>
 1             7             99 Yes                                  9
 2             4             82 No                                   4
 3             8             51 Yes                                  7
 4             5             52 Yes                                  5
 5             7             75 No                                   8
 6             3             78 No                                   9
 7             7             73 Yes                                  5
 8             8             45 Yes                                  4
 9             5             77 No                                   8
10             4             89 No                                   4
# ℹ 2 more variables: Sample_Question_Papers_Practiced <dbl>,
#   Performance_Index <dbl>

In this dataset, we would like to test whether the sample mean performance index is statistically significantly higher than 50, which we assume to be the minimum performance level required to pass.

Because we are interested in testing whether the sample mean is greater than 50, we will perform a one-sided, one-sample, t-test.

We formulate the hypotheses as follows:

$H_0: \mu = 50$
$H_1: \mu > 50$

To keep things simple, we set the confidence level to 90%. This means that if the p-value is below 0.10, we will reject the null hypothesis and conclude that there is a statistically significant difference between the sample mean and the value of 50. Alternatively, we will not have evidence to conclude that there is a statistically significant difference.

To perform the one-sample t-test, we use the built-in t.test() function from base R. To do this, we first need to specify:

x: the variable of interest (in this case, student_performance$Performance_Index),
alternative = "greater", to indicate a one-sided test,
mu = 50 to set the hypothesized population mean

The code is the following:

# Applying one-sample t-test (greater than 50)
t.test(x = student_performance$Performance_Index,
       alternative = "greater", 
       mu = 50, 
       conf.level = 0.90)


    One Sample t-test

data:  student_performance$Performance_Index
t = 27.195, df = 9999, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 50
90 percent confidence interval:
 54.97856      Inf
sample estimates:
mean of x 
  55.2248

The output of the t.test() function provides several important statistics. First, it reports the t-value, which quantifies how many standard errors the sample mean is (in this case) above the hypothesized value of 50. It also displays the degrees of freedom, which is equal to the sample size minus one. The p-value indicates the probability of obtaining a test statistic as extreme as the one observed, assuming that the null hypothesis is true. Additionally, the output includes a one-sided confidence interval, which gives a lower bound for the population mean under the specified confidence level. Finally, the function returns the sample mean, which is the observed average performance index in our dataset.

Here, the p-value is extremely small (much lower than 0.10), so we reject the null hypothesis and thus conclude that our evidence can support the claim that the average student performance is (statistically) significantly higher than 50. The one-sided 90% confidence interval starts at approximately 54.98, which means that any value below 54.98 would not be considered a plausible value for the population mean at the 90% confidence level.

Independent Two-Sample t-Test

In the previous example, we worked with a single sample and tested whether its mean was statistically different from a hypothesized value. Now, we turn to a different scenario: comparing two unrelated (i.e., independent) samples, to determine whether their means are statistically significantly different from each other. This is the purpose of the independent two-sample t-test.

To perform this test, we modify the formula for the t-statistic to account for two sample means rather than one. The test statistic is thus calculated as:

\[t = \frac{\bar{X}_{1} - \bar{X}_{2}}{SE_{\bar{X}_{1} - \bar{X}_{2}}}\]

where:

$\bar{X}_{1}$ and $\bar{X}_{2}$ are the sample means of the two independent groups,
$SE_{\bar{X}_{1} - \bar{X}_{2}}$ is the standard error of the difference between the two sample means.

Unlike the one-sample case, calculating the standard error here is slightly more complex, as each group contributes its own variance. The formula for the standard error of the difference between independent means is:

\[SE_{\bar{X}_{1} - \bar{X}_{2}} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\]

In this expression:

$s^2_1$ and $s^2_2$ are the sample variances of the two groups,
$n_1$ and $n_2$ are the sample sizes of each group.

This formula essentially combines the variability of each group by squaring their individual standard errors ($s_1 / \sqrt{n_1}$ and $s_2 / \sqrt{n_2}$ respectively), adding them together, and then taking the square root of the result. This gives us a measure of how much variability we expect in the difference between two sample means, assuming the null hypothesis is true.

Degrees of Freedom in the Independent Two-Sample t-Test

The degrees of freedom are still needed for the probability calculation of the t-value. In the independent two-sample t-test, the degrees of freedom are calculated as the total number of observations across both groups, minus 2:

\[df = n_1 + n_2 - 2\]

This subtraction of 2 accounts for the estimation of two group means from the data, which uses up two degrees of freedom

It is important to keep in mind that this version of the t-test assumes that the two samples are independent and that their sizes are approximately equal. If the sample sizes are very different, the group with the larger sample will tend to dominate the calculation of the standard error. This can make the test more sensitive to the variability in one group over the other, which in turn affects both the t-value and the resulting p-value. In such cases, we use Welch’s t-test, which adjusts both the test statistic and the degrees of freedom to account for the possibility that the two groups have different variances.

The formula for the t-statistic remains the same:

\[t = \frac{\bar{X}_{1} - \bar{X}_{2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

However, the degrees of freedom are no longer simply $n_1 + n_2 - 1$. Instead, Welch’s test uses an approximation known as the Welch–Satterthwaite equation, which adjusts the degrees of freedom based on both the sample sizes and sample variances:

\[df = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2}{\frac{ \left( \frac{s_1^2}{n_1} \right)^2 }{n_1 - 1} + \frac{ \left( \frac{s_2^2}{n_2} \right)^2 }{n_2 - 1}}\]

This expression looks complicated, but it is used to give a more accurate estimate of the sampling distribution under the null hypothesis when variances cannot be assumed unequal. The denominator contains two terms that reflect the variability introduced by each group’s sample variance as well as its size. The resulting degrees of freedom are typically non-integer, and they tend to be lower than the value we would use in the equal-variance case.

Therefore, the Welch test adjusts the degrees of freedom based on the variances and sample sizes of the two groups, effectively reducing the influence of less reliable variance estimates, making the test more robust if variances cannot be assumed equal (Urdan, 2022). This adjustment leads to a slightly more conservative test, reducing the risk of falsely declaring a difference significant when it is not, thus avoiding type I error.

Why Large Samples Stabilize the Standard Error

If the sample sizes in both groups are very large, the terms in the denominator of the standard error become small, reducing the overall standard error. In such cases, the t-test becomes more robust—even if the sample sizes or variances between the two groups differ. This is because large samples provide more precise estimates of the population parameters, making violations of assumptions (like equal variances) less problematic.

Thankfully, we do not need to compute this by hand. In R, the default behavior of the t.test() function for two independent samples is to use Welch’s t-test automatically, unless we explicitly request the equal-variance version. This means that R handles the adjustment for us behind the scenes, making the test both more accurate and easy to perform.

To see how we can implement this test, let us use the variable Extracurricular_Activities, which takes the value "Yes" if the student participates in extracurricular activities, and "No" otherwise. We will compare the two groups of students—those who participate in extracurricular activities and those who do not—in terms of their performance.

Before implementing the t-test, let’s calculate the sample size and variance of each group. With the count() function, we calculate the number of students in each group, while with the group_by() and summarize() functions we calculate the variance per group:

# Counting the number of students in each group
student_performance %>% count(Extracurricular_Activities)

# A tibble: 2 × 2
  Extracurricular_Activities     n
  <chr>                      <int>
1 No                          5052
2 Yes                         4948

# Calculating the variance of performance in each group
student_performance %>%
  group_by(Extracurricular_Activities) %>% 
  summarize(Variance_Performance_Index = var(Performance_Index))

# A tibble: 2 × 2
  Extracurricular_Activities Variance_Performance_Index
  <chr>                                           <dbl>
1 No                                               367.
2 Yes                                              371.

The sample sizes and variances are roughly equal in the two groups, meaning that there is no strong need to adjust the standard error. We can therefore safely assume equal variances for this example.

To implement the t-test, we once again use the function t.test() and include the following additional arguments:

y: the second group of observations, to compare against x.
var.equal: a logical argument specifying whether we assume equal variances across groups. If set to TRUE, R performs the standard independent two-sample t-test. If set to FALSE (or omitted), R performs Welch’s t-test by default.

Because we need two different vectors to assign to the arguments x and y, we store all the observations with no extracurricular activities in the data frame student_performance_no_extra_act, and the rest in the data frame student_performance_yes_extra_act:

# Filtering and extracting performance index for each group
dat_no_extra_act <- student_performance %>%
  filter(Extracurricular_Activities == "No") %>% 
  pull(Performance_Index)

dat_yes_extra_act <- student_performance %>%
  filter(Extracurricular_Activities == "Yes") %>% 
  pull(Performance_Index)

This time, we set the confidence level to 95%, meaning that a p-value below 0.05 would reflect statistically significant results. Our two hypotheses are:

$H_0: \bar{X}_{1} - \bar{X}_{2} = 0$. The mean difference between the two groups is statistically zero.
$H_1: \bar{X}_{1} - \bar{X}_{2} \ne 0$. The mean difference between the two groups is statistically different than (not equal to) zero.

Because we are testing whether the difference between the two means are equal, we set the mu argument to 0:

# Applying Independent Two-Sample t-test
t.test(x = dat_no_extra_act,
       y = dat_yes_extra_act,
       alternative = "two.sided", 
       var.equal = TRUE,
       mu = 0, 
       conf.level = 0.95)


    Two Sample t-test

data:  dat_no_extra_act and dat_yes_extra_act
t = -2.453, df = 9998, p-value = 0.01418
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.6954392 -0.1893163
sample estimates:
mean of x mean of y 
 54.75851  55.70089

The results may be surprising: there is a statistically significant difference between the students not performing extracurricular activities and those performing extracurricular activities. The means are 54.76 and 55.70, which seem very close to each other. However, the confidence interval is approximately [-1.70, -0.19], meaning that, any value outside that range (including 0), is considered significantly different (statistically) from the observed difference. The negative sign comes from the fact that the mean of the second group (students with extracurricular activities) is larger than that of the first group.

This example illustrates how the two-sample t-test allows us to formally compare the means of two independent groups. Even when the numerical difference between means seems small, statistical testing provides a framework to assess whether such a difference is likely due to chance. By carefully checking assumptions, such as equal variances, and using appropriate methods like Welch’s adjustment when needed, we ensure that our conclusions are both valid and reliable.

Paired Samples t-Test

There are many situations in which the two samples we want to compare are not independent but paired in a meaningful way (Urdan, 2022). This occurs, for example, when the same individuals are measured twice: once before and once after a treatment, or when observations are naturally matched, such as siblings, twins, or matched case-control designs. In these cases, we use the paired samples t-test (also called the dependent samples t-test) to determine whether the mean difference between the two sets of observations is statistically significant.

The key idea behind the paired t-test is that we do not treat the two samples as separate groups. Instead, we focus on the difference between the paired values for each subject. For example, if a person’s score before treatment is 55 and after treatment is 60, we consider the difference of +5 as the primary quantity of interest.

The formula for the t-statistic in a paired t-test is:

\[t = \frac{\bar{X}_{pre} - \bar{X}_{post}}{SE_{\bar{D}}} = \frac{\bar{X}_{pre} - \bar{X}_{post}}{\frac{s_{\bar{D}}}{\sqrt{n}}}\]

where:

$\bar{X}_{pre}$ is the sample mean from the first measurement (e.g., before a treatment)
$\bar{X}_{post}$ is the sample mean from the second measurement (e.g., after a treatment)
$s_{\bar{D}}$ is the standard deviation of the differences,
$n$ is the number of pairs,
$SE_{\bar{D}}$ is the standard error of the mean difference.

Even though the samples are dependent though, our logic stays the same: we want to find whether the differences between the two samples means—before and after treatment—is statistically significant. Technically, the only difference between the paired samples t-test and the independent sample t-test is the calculation of the standard error.

As with the one-sample t-test, the degrees of freedom for the paired t-test are $n - 1$, which makes sense: at the end of the day, we have one sample. Additionally, there is no need to worry about unequal sample sizes or variances.

As an example, we will test whether the latest performance scores were higher or lower than the previous scores for the same students. Our null hypothesis assumes that there is no difference between the means of the two related samples, implying that the students’ performance did not change over time. The alternative hypothesis is that there is a statistically significant difference between the two means:

$H_0: \bar{X}_{pre} - \bar{X}_{post} = 0$. The mean difference between the two groups is zero.
$H_1: \bar{X}_{pre} - \bar{X}_{post} \ne 0$. The mean difference between the two groups is not zero.

Since we are interested in two-sided differences (increase or decrease that is), we conduct a two-sided test with a 95% confidence level. The paired t-test is appropriate here because each student’s previous score is naturally paired with their latest score. In R, this pairing is explicitly indicated by setting the argument paired = TRUE in the t.test() function:

# Applying Paired Samples t-test
t.test(x = student_performance$Previous_Score,
       y = student_performance$Performance_Index,
       alternative = "two.sided", 
       mu = 0, 
       paired = TRUE,
       conf.level = 0.95)


    Paired t-test

data:  student_performance$Previous_Score and student_performance$Performance_Index
t = 183.57, df = 9999, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 14.06905 14.37275
sample estimates:
mean difference 
        14.2209

The resulting p-value is much smaller than 0.05, indicating that the difference between the mean previous scores and the mean latest scores is statistically significant. The 95% confidence interval for the mean difference ranges approximately between 14.07 and 14.37, suggesting that, on average, the latest performance scores are lower than the previous scores by this amount.

Chi-Square Test of Independence

This type of test is different from the ones we have discussed so far. In the previous sections, we compared means of numeric variables. Here, we work with categorical variables and want to understand whether there is a statistical association between them.

For example, suppose the students in our dataset fall into groups based on two categorical variables: how much they study and how much they sleep. We might define the groups as:

Study a lot (more than 5 hours) and sleep a lot (more than 7 hours)
Study a lot (more than 5 hours) and sleep little (less than or equal to 7 hours)
Study little (less than or equal to 5 hours) and sleep a lot (more than 7 hours)
Study little (less than or equal to 5 hours) and sleep little (less than or equal to 7 hours)

The following code creates a data frame mapping these categories using the dataset we have used throughout this chapter:

# Categorizing Hours_Studied and Sleep_Hours into two groups each
dat_categories <- student_performance %>%
  mutate(Hours_Studied = if_else(Hours_Studied <= 5,
                                 "<= 5hrs",
                                 "> 5hrs"),
         Sleep_Hours = if_else(Sleep_Hours <= 7,
                               "<= 7hrs",
                               "> 7hrs")) %>%
  select(Hours_Studied, 
         Sleep_Hours)

# Printing the first 10 rows
head(dat_categories, n = 10)

# A tibble: 10 × 2
   Hours_Studied Sleep_Hours
   <chr>         <chr>      
 1 > 5hrs        > 7hrs     
 2 <= 5hrs       <= 7hrs    
 3 > 5hrs        <= 7hrs    
 4 <= 5hrs       <= 7hrs    
 5 > 5hrs        > 7hrs     
 6 <= 5hrs       > 7hrs     
 7 > 5hrs        <= 7hrs    
 8 > 5hrs        <= 7hrs    
 9 <= 5hrs       > 7hrs     
10 <= 5hrs       <= 7hrs

The question we want to answer is: are these two variables related? Or are they independent of each other? This is actually a question we can answer using the chi-square distribution. Recall from Chapter Chi-Square and F-Distributions that the chi-square statistic is calculated using the formula:

\[\chi^2 = \sum^n_{i = 1} \left( \frac{x_i - \bar{x}}{s} \right) ^ 2\]

For categorical data, this formula is adapted to compare observed counts with expected counts under the assumption of independence (each value is independent from the other in the distribution):

\[\chi^2 = \sum\frac{(O - E)^2}{E}\]

where:

$O$ is the observed frequency (i.e., the count we actually see in our data)
$E$ is the expected frequency (i.e., the count we would expect if there were no association between the two variables)

This test is called a test of independence because it checks whether the distribution of one categorical variable is independent of the distribution of another (McDonald, 2014). In other words, it tests whether knowing the value of one variable gives us information about the other.

Why the Chi-Square Statistic Can Be Used with Counts

In the last formula, we can think of each term as a squared standardized deviation: the difference between the observed and expected counts, scaled by the expected count. When sample sizes are large, the difference $(O - E)$ can be approximated by a normal distribution, thanks to the Central Limit Theorem.

If we standardize this difference—just like we do in a z-score—we get an approximately normal quantity:

\[Z \approx \frac{(O - E)}{\sqrt{E}}\]

Squaring this gives:

\[Z^2 = \frac{(O - E) ^ 2}{E}\]

So under the hood, each term in the chi-square test statistic is like a squared normal deviation—approximately—as long as the expected counts are large enough. This connection helps explain why the chi-square distribution emerges when we sum these terms across all categories.

To illustrate this, contingency tables are very helpful, as they summarize how many cases fall into each combination of categories. We can create such a table using the function table():

# Creating a contingency table
table(dat_categories)

             Sleep_Hours
Hours_Studied <= 7hrs > 7hrs
      <= 5hrs    3627   1908
      > 5hrs     2947   1518

It is apparent from the table that the majority of the students are in the category that study up to 5 hours and sleep up to 7 hours. If we had created these four categories randomly, we would expect that approximately 2,500 students would fall into each one, since there are 10,000 students in total. However, the chi-square test helps us determine whether these observed differences are statistically significant or simply due to chance.

Chi-Square Test: A Test for Only Two Variables

Contingency tables highlight an important aspect of the chi-square test of independence: it is specifically designed to evaluate the relationship between two categorical variables at a time. Each combination of categories is displayed in a two-dimensional table, where one variable defines the rows and the other defines the columns. Trying to include a third categorical variable would make the table difficult to interpret and visualize. Although more advanced statistical models can handle multiple categorical variables, the standard chi-square test of independence cannot.

Because we calculate a chi-square statistic, we use the chi-square distribution to test our hypothesis. As with other hypothesis tests, we start with formulating our null and alternative hypotheses:

Null hypothesis ($H_0$): The two categorical variables are independent.
Alternative hypothesis ($H_1$): The two categorical variables are associated (not independent).

The degrees of freedom ($df$) for a test of independence are calculated using:

\[df = (r - 1) \times (c - 1)\]

where:

$r$ is the number of rows in the contingency table (i.e., categories of the first variable)
$c$ is the number of columns (i.e., categories of the second variable)

The meaning of the degrees of freedom in a contingency table is that it tells us how many values in the table are free to vary once the row and column totals are fixed.

To perform the test of independence in R, we use the chisq.test() function and set the arguments x and y to the two vectors of the categories:

# Performing chi-square test of independence
chisq.test(x = dat_categories$Hours_Studied, 
           y = dat_categories$Sleep_Hours)


    Pearson's Chi-squared test with Yates' continuity correction

data:  dat_categories$Hours_Studied and dat_categories$Sleep_Hours
X-squared = 0.22572, df = 1, p-value = 0.6347

If the p-value is less than 0.05, we reject the null hypothesis and conclude that the variables are associated.

In this case, however, the p-value is much higher than the threshold of 0.05, meaning that there is no statistically significant association between the amount students study and the amount they sleep, even if we had used a more lenient confidence threshold such as 0.10.

F-Test for Equality of Variances

As discussed in Chapter Chi-Square and F-Distributions, the F-value is calculated as the ratio of two sample variances. If the two variances are approximately equal, the F-value will be close to 1. Conversely, a value far from 1 suggests a difference in variability between the two groups. Respectively, the F-test provides a formal method to test the following hypotheses regarding population variances:

$H_0: \frac{s^2_1}{s^2_2} = 1 \Leftrightarrow s^2_1 = s^2_2$
$H_1: s^2_1 \ne s^2_2$

where $s^2_1$ and $s^2_2$ represent the variances of the two populations from which the samples are drawn. By performing the F-test, we assess whether the observed difference between the sample variances is statistically significant or can be attributed to random sampling variability.

In R, this test can be performed using the var.test() function. Recall our earlier comparison of student performance between two groups: one with extracurricular activities and one without. This time, instead of comparing means, we will examine whether the variability in performance differs between these groups, using the sample data objects we previously created. We set the confidence level to 95% and conduct a two-sided test, reflecting the alternative hypothesis that the variances may differ in either direction:

# Performing F-test
var.test(x = dat_no_extra_act,
         y = dat_yes_extra_act,
         alternative = "two.sided",
         conf.level = 0.95)


    F test to compare two variances

data:  dat_no_extra_act and dat_yes_extra_act
F = 0.98837, num df = 5051, denom df = 4947, p-value = 0.6792
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9350483 1.0447159
sample estimates:
ratio of variances 
         0.9883702

The resulting F-value is approximately 0.99, which is very close to 1. This indicates that the variances of the two groups are very similar given their degrees of freedom, the confidence level, and the two-sided nature of the test. Therefore, there is no strong statistical evidence to reject the null hypothesis of equal variances in this case.

F-Test Analysis of Variance (ANOVA)

While the F-test for equality of variances is used to compare two variances, the F-test in the context of Analysis of Variance (ANOVA) compares the means of two or more groups (Urdan, 2022). Its purpose is to determine whether there are statistically significant differences among group means. For example, if we have three school classes, ANOVA allows us to test whether the average exam scores differ significantly across these classes. This approach is conceptually similar to the independent samples t-test, which compares the means of two groups. The key difference is that the F-test in ANOVA extends this idea to compare means across multiple groups simultaneously.

An important question naturally arises though: Why not simply perform several t-tests instead of using ANOVA?

The key issue with running several t-tests is that it increases the chance of a Type I error, meaning we might conclude that a difference exists when it actually does not. Each test has its own false-positive risk (typically 5%), and when we run many tests on the same data, these risks add up. As a result, the probability of finding at least one “significant” result by chance becomes much higher, leading us to report differences that are not truly there (Urdan, 2022).

ANOVA solves this problem by comparing all group means at the same time. This keeps the overall Type I error under control and avoids the higher error risk from doing many separate tests. Instead of running multiple tests, ANOVA uses a single F-test to check if any group means differ more than we would expect by chance.

The F-statistic in ANOVA is computed as the ratio of these two variances:

\[F = \frac{\text{Variance} \ \text{Between} \ \text{Groups}}{\text{Variance} \ \text{Within} \ \text{Groups}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}\]

where:

$MS_{\text{between}}$ (Mean Square Between) measures the average variability between the group means.
$MS_{\text{within}}$ (Mean Square Within) measures the average variability within each group.

At first glance, it may seem counter-intuitive to use variances when our goal is to compare means. However, the logic becomes clearer when we realize that we are not comparing raw individual values directly, but rather examining the distribution of group means. In other words, the group means form their own distribution, and the variance of this distribution is what appears in the numerator of the F-statistic. The denominator represents the pooled variance within each group, a concept familiar from earlier tests, such as the independent samples t-test. Thus, we still compare the difference between sample means relative to a measure of variance.

We can write the previous formula as:

\[F = \frac{MS_{\text{between}}}{MS_{\text{within}}} = \frac{SS_{\text{between}}/df_{\text{between}}}{SS_{\text{within}}/df_{\text{within}}}\]

where $SS_{\text{between}}$ (sum of squares between) measures the variability of group means around the overall mean, and $SS_{\text{within}}$ (sum of squares within) measures the variability of individual values around their own group means.

Now, let’s discuss how we can calculate the F-statistic, starting with $SS_{\text{between}}$. The formula of this variable is:

\[SS_{\text{between}} = \sum^g_{i = 1}n_i(\bar{X}_{i} - \bar{X})^2\]

where:

$\bar{X}$ is the grand mean, the mean of the means
$\bar{X_i}$ is the mean of group $i$
$n_i$ is the sample size of group $i$
$g$ is the number of groups

The reason why $n_i$ appears in the formula, is to weight the squared deviation of each group by its sample size. Larger groups should have a greater influence on the total between-group variability. The degrees of freedom within the groups are:

\[df_{\text{between}} = g - 1\]

Similarly, the sum of squared differences between each individual value and the group mean within a group is:

\[SS_{\text{within}} = \sum^g_{i = 1}\sum^{n_{i}}_{j = 1}(X_{ij} - \bar{X}_{i})^2\]

where:

$\bar{X}_{i}$ is the mean of group $i$
$X_{ij}$ is the individual score of observation $j$ in group $i$
$n_i$ is the sample size of group $i$
$g$ is the number of groups

The sample size now is used indirectly, as each group contributes $n_i - 1$ degrees of freedom to the total variability within groups. When we sum across all groups, the total degrees of freedom for the within-group variation becomes:

\[df_{\text{within}} = \sum^g_{i = 1}(n_i - g) = N - g\]

where $N$ is the total sample size (number of observations).

Connection Between F-Test and t-Test

With two groups, the formula for the F-statistic is:

\[F = \frac{\frac{n_1(\bar{X}_{1} - \bar{X}) + n_2(\bar{X}_{2} - \bar{X})}{ df_{\text{between}}}}{\frac{\sum^{n_{1}}_{j = 1}(X_{1j} - \bar{X}_{1})^2 + \sum^{n_{2}}_{j = 1}(X_{2j} - \bar{X}_{2})^2}{ df_{\text{within}}}}\]

where $\bar{X}_{1}$ and $\bar{X}_{2}$ are the sample means for groups 1 and 2, respectively, $\bar{X}$ is the overall mean across both groups, $n_1$ and $n_2$ are the group sample sizes, and the degrees of freedom are $df_{\text{between}} = 1$ (since there are two groups) and $df_{\text{within}} = n_1 + n_2 - 2$. For the special case of two groups, this expression can be simplified and shown to be equivalent to the square of the independent samples t-statistic. Recall that the t-statistic for comparing two independent means is:

\[t = \frac{\bar{X}_{1} - \bar{X}_{2}}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}\]

where $s^2_p$ is the pooled variance estimate, calculated as:

\[s_p^2 = \frac{\sum_{j=1}^{n_1}(X_{1j} - \bar{X}_{1})^2 + \sum_{j=1}^{n_2}(X_{2j} - \bar{X}_{2})^2}{n_1 + n_2 - 2}\]

By substituting the formulas for between-group and within-group variances and simplifying, the F-statistic reduces to:

\[F = t^2\]

This equivalence means that for comparing exactly two groups, the ANOVA F-test and the independent samples t-test produce the same results (and therefore conclusions) in terms of statistical inference. The F-test extends naturally to cases with more than two groups, where no direct t-test equivalent exists.

To see how we could perform these calculations in practice, Suppose we have the math test scores of 15 students, divided equally across three classrooms (5 students per classroom). The scores are:

Classroom A: 72, 74, 71, 73, 70
Classroom B: 78, 76, 80, 79, 77
Classroom C: 85, 88, 84, 86, 87

The variability between classrooms (how different the class averages are) is captured by a chi-square variable, scaled by its degrees of freedom. Similarly, the variability within classrooms (how spread out individual scores are inside each class) is also captured by another chi-square variable with its own degrees of freedom. The F-statistic compares these two scaled chi-square values to determine if differences between classrooms are substantially large compared to the natural variation within each classroom. If the ratio is high, teaching styles can be assumed to be causing real differences in score variability.

We want to check if there is more variability between classrooms as compared to within classrooms using the F-statistic.

Step 1: Compute Group Means and Grand Mean

Mean of classroom A:

\[\bar{X}_{A} = \frac{72 + 74 + 71 + 73 + 70}{5} = 72\]
Mean of classroom B:

\[\bar{X}_{B} = \frac{78 + 76 + 80 + 79 + 77}{5} = 78\]
Mean of classroom C:

\[\bar{X}_{C} = \frac{85 + 88 + 84 + 86 + 87}{5} = 86\]
Grand Mean (mean of all 15 scores):

\[\bar{X} = \frac{72 + 74 + 71 + 73 + 70 + 78 + 76 + 80 + 79 + 77 + 85 + 88 + 84 + 86 + 87}{15} = 78.67\]

Step 2: Compute Between-Group Sum of Squares (SSB):

We use the formula:

\[SS_{\text{between}} = \sum^{g}_{i = 1} n_i(\bar{X}_{i} - \bar{X})^2\]

Notably, each group has 5 students:

\[SS_{\text{between}} = 5(72 - 78.67)^2 + 5(78 - 78.67)^2 + 5(86 - 78.67)^2 = 493.3\]

Step 3: Compute Within-Group Sum of Squares (SSW)

We use the formula:

\[SS_{\text{within}} = \sum^g_{i = 1}\sum^{n_{i}}_{j = 1}(X_{ij} - \bar{X}_{i})^2\]

For simplicity, we compute the squared deviations of each score from its own group mean:

Classroom A (mean = 72):

\[(72 - 72)^2 + (74 - 72)^2 + (71 - 72)^2 + (73 - 72)^2 + (70 - 72)^2 = 10\]

Classroom B (mean = 78):

\[(78 - 78)^2 + (76 - 78)^2 + (80 - 78)^2 + (79 - 78)^2 + (77 - 78)^2 = 10\]

Classroom C (mean = 86):

\[(85 - 86)^2 + (88 - 86)^2 + (84 - 86)^2 + (86 - 86)^2 + (87 - 86)^2 = 10\]

So we have:

\[SS_{\text{within}} = 10 + 10 + 10 = 30\]

Step 4: Compute Degrees of Freedom

Between-groups:

\[df_{\text{between}} = g - 1 = 3 - 1 = 2\]
Within-groups:

\[df_{\text{within}} = N - g = 15 - 3 = 12\]

Step 5: Compute Mean Squares and F-statistic

\[F = \frac{SS_{\text{between}}/df_{\text{between}}}{SS_{\text{within}}/df_{\text{within}}} = \frac{493.3 / 2}{30 / 12} = \frac{246.65}{2.5} = 98.66\]

An F-statistic of 98.66 is very large, meaning the between-classroom variability is much greater than the within-classroom variability. This suggests that the differences in classroom averages (and possibly teaching methods) are too large to be explained by chance alone.

In R, we can use the functions aov() and anova() to perform the ANOVA F-test. After creating the same dataset as in the previous example, we use the aov() function to specify the model. In this function, the continuous outcome variable is placed on the left-hand side of the tilde (~), and the categorical grouping variable is placed on the right-hand side:

# Creating the dataset
school_example <- tibble(A = c(72, 74, 71, 73, 70),
                         B = c(78, 76, 80, 79, 77),
                         C = c(85, 88, 84, 86, 87)) %>%
  pivot_longer(cols = c(A, B, C),
               names_to = "School",
               values_to = "Score")

# Printing the first 10 rows
head(school_example, n = 10)

# A tibble: 10 × 2
   School Score
   <chr>  <dbl>
 1 A         72
 2 B         78
 3 C         85
 4 A         74
 5 B         76
 6 C         88
 7 A         71
 8 B         80
 9 C         84
10 A         73

# Fit the ANOVA model and display the results
aov(formula = Score ~ School,
    data = school_example) %>%
  anova()

Analysis of Variance Table

Response: Score
          Df Sum Sq Mean Sq F value    Pr(>F)    
School     2 493.33  246.67  98.667 3.549e-08 ***
Residuals 12  30.00    2.50                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected, the F-value in the output matches the one calculated manually. Since we are using a significance level of 0.05, and the resulting p-value is below this threshold, we conclude that there are statistically significant differences in mean scores across the three school groups.

Statistical Tests and the Normality Assumption

Before applying any statistical test, it is important to understand the conditions under which the test produces valid and reliable results. These are known as the assumptions of the test. If these assumptions are not met, the results of the test, especially the p-value and conclusions drawn from it, may be misleading.

Let’s review the common assumptions that apply across the most frequently used hypothesis tests: the t-tests, the F-test, and the chi-square test of independence.

1. Type of Data

Each test is designed for specific types of data:

t-test and the F-test require numerical (continuous) data, measured on an interval or ratio scale.
The chi-square test works with categorical data, where observations fall into distinct groups.

This distinction is essential: using a test with the wrong type of data violates its basic logic.

2. Independence of Observations

All hypothesis tests assume that the observations are independent. This means that any one observation does not influence another. For example, when testing whether two groups have different means (using a t-test), we assume the values in one group are conceptually and practically unrelated to the values in the other.

In the chi-square test, each individual should appear in only one cell of the contingency table. If the same person appears in multiple categories, the assumption is violated, and the test results become invalid.

3. Normality (for t-tests and F-test)

When comparing means using t-tests or variances using the F-test, the data is assumed to come (to be sampled) from populations that are normally distributed. This assumption is especially important when sample sizes are small (e.g., fewer than 30 observations). For large samples, the Central Limit Theorem helps reduce the impact of non-normality, and the tests become more robust.

The chi-square test of independence, however, does not require normally distributed data, since it deals with counts, not continuous measurements.

4. Equality of Variances

The standard independent samples t-test assumes that the two groups have equal variances. If the spread of the data is very different between groups, this assumption is violated. In that case, we use Welch’s t-test, which adjusts for unequal variances.

The F-test, by definition, compares variances, so the assumption of equal variances doesn’t apply. This assumption also does not apply to the chi-square test, which is used for categorical data and focuses on frequencies rather than variability in numerical values.

5. Expected Frequencies (for chi-square test)

For the chi-square test of independence, it is important that the expected count in each cell of the contingency table is at least 5 (this is just a rule of thumb). When expected frequencies are too small, the test becomes unreliable, and we may need to use alternatives.

6. Random Sampling

All tests assume that the data was obtained through random sampling. If the data collection process is biased or non-random, any statistical inference we make may not generalize to the broader population.

Recap

Throughout the chapter, we emphasized the shared structure behind these statistical tests: each one begins by formulating a null hypothesis that represents a default or baseline assumption: typically, that there is no difference or no association. We then calculate a test statistic that captures how far the observed data deviate from what we would expect if the null hypothesis were true. This test statistic is then evaluated against a theoretical probability distribution, such as the t-, F-, or chi-square distribution, which allows us to quantify how extreme the result is. The resulting p-value represents the probability of observing a test statistic as extreme as (or more extreme than) the one obtained, purely by chance.

What differs across these tests is not the underlying logic, but the type of data we work with, the specific assumptions we make, and the choice of distribution used to evaluate the evidence. For instance, comparing two group means involves the t-distribution, comparing more than two groups leads us to the F-distribution, and examining relationships between categorical variables calls for the chi-square distribution. Despite these differences, the foundational idea remains the same: using the language of probability to decide whether an observed effect (outcome) is statistically meaningful.

In addition to understanding the theoretical structure of each test, we also demonstrated how to implement them in R using real data. This practical element is crucial: knowing how to set up the data, apply the appropriate function, and interpret the output equips you to carry out hypothesis testing in applied settings. Together, these theoretical and computational tools provide a strong foundation for analyzing data with confidence and clarity.