18 t-Distribution – Data Science with R

18.1 Introduction

In statistics, we often want to make inferences about a population based on a sample. If we knew the population mean and standard deviation, we could use the Normal distribution to make these inferences, assuming that the distribution in the population is normal. But in real life, we rarely have access to the whole population—so we rely on samples instead.

When we estimate the population mean using a sample, we also have to estimate the standard deviation. This adds extra uncertainty to our calculations. The t-distribution accounts for this added uncertainty. It is similar to the normal distribution in shape—symmetric and bell—shaped, but it has heavier tails, which means it gives higher probability to extreme values. These heavier tails help protect against underestimating variability when the sample size is small.

The t-distribution gradually becomes more like the normal distribution as the sample size increases. This is because our estimate of the standard deviation becomes more accurate with more data. In fact, when the sample size is large enough, the t-distribution and the normal distribution are almost the same.

Interestingly, the t-distribution was developed in 1908 by William Sealy Gosset, a statistician working at the Guinness Brewery in Dublin. Due to restrictive company policies, he published his work under the pseudonym “Student,” which is why the distribution is commonly called the Student’s t-distribution. Gosset created it to address the problem of making reliable inferences from small sample sizes, which was common in quality control processes at the brewery.

18.2 When do we encounter the t-distribution?

Suppose we want to estimate the distribution of IQ of students in a university, but we only have data from a small sample of 10 students. A common approach is to calculate the sample mean and sample standard deviation and based on these estimates we can make inference about the population.

For this example, we generate 10 IQ scores randomly from a Uniform distribution between 90 and 110. We then simulate 100,000 IQ scores using the normal distribution with the calculated sample mean and standard deviation using the rnorm() function:

# Libraries
library(tidyverse)

# Setting the theme
theme_set(theme_light())

# Setting seed
set.seed(568)

# Generate a sample of 10 IQ scores
iq_scores <- tibble(Score = runif(10, min = 90, max = 110))

# Calculate mean and standard deviation
mean_score <- mean(iq_scores$Score)
sd_score <- sd(iq_scores$Score)

# Simulate from the normal distribution
simulated_iq_scores_n <- tibble(IQ_n = rnorm(n = 100000,                          
                                        mean = mean_score, 
                                        sd = sd_score))

# Plot the distribution
simulated_iq_scores_n %>%
  ggplot(aes(x = IQ_n)) + 
  geom_density(fill = "skyblue", alpha = 0.6) +
  labs(x = "Score",
       y = "Count")

As expected, the distribution looks normal, because we simulated it from a normal distribution in the first place. However, there’s a key issue here: we don’t actually know the true standard deviation of all students’ IQ scores. Of course, we have used the sample standard deviation as an estimate, but our sample size is small. This means that our estimate is probably not that accurate, even though it is our best guess.

To demonstrate, let’s simulate values using the t-distribution instead (we discuss the code later in this chapter):

Notice how the center of the distribution is around zero (0). This is because the t-distribution is standardized, meaning it assumes a mean of 0. To compare it fairly to the normal distribution, we need apply standardization to the data of the normal distribution and plot both distributions together:

# Standardize the normal distribution sample
simulated_iq_scores_n <- simulated_iq_scores_n %>%
  mutate(IQ_n_std = (IQ_n - mean(IQ_n)) / sd(IQ_n))

# Plot both distributions
simulated_iq_scores_n %>%
  bind_cols(simulated_iq_scores_t) %>%
  ggplot() +
  geom_density(aes(x = IQ_t), fill = "darkgreen", alpha = 0.6) +
  geom_density(aes(x = IQ_n_std), fill = "lightblue", alpha = 0.6) +
  labs(x = "Standardized Score", y = "Density")

We can now see the difference clearly: the dark green distribution (t-distribution) has slightly heavier tails and a lower center than the light blue distribution (normal distribution).

Degrees of freedom and sample size

In the example above, we generated 100,000 values from a t-distribution to visualize its shape, but this does not mean our sample size was 100,000. The sample size refers to the number of actual data points we collected—in this case, just 10 IQ scores. The degrees of freedom used for the t-distribution reflect this original sample size. So even though we create many simulated values, the distribution’s shape still depends on the small sample, not the number of values drawn for plotting.

18.3 What are the parameters and shape of the t-distribution?

We write the t-distribution using the following notation:

\[ X \sim t(df) \]

This means that the random variable $X$ follows a t-distribution with df degrees of freedom. Since the t-distribution is always standardized, it is implied that the mean is always 0 and the standard deviation is always 1.

In the chapter Statistical Distributions, we discussed how the degrees of freedom are used to take into account the fact that we use the sample mean instead of the population mean to calculate the standard deviation of a sample. As a reminder, the degrees of freedom (df) is equal to the sample size minus one:

\[ df = n - 1 \]

The important thing to keep in mind is that the smaller the number of degrees of freedom, the higher the uncertainty.

The t-distribution is a continuous probability distribution characterized by its probability density function (pdf), which is given by the following formula:

\[ f(x) = \frac{\Gamma(\frac{df + 1}{2})}{\sqrt{df \times \pi} \times \Gamma(\frac{df}{2})} \times (1 + \frac{x^2}{df})^{-\frac{df + 1}{2}} \]

The first part of the formula is a constant that ensures the total area under the curve equals 1. This is just like the constant $\frac{1}{2 \pi}$ in the normal distribution. It contains the gamma function, which is a generalization of the factorial function. This part helps shape the height of the curve appropriately for different values of df. The second part of the formula controls the shape of the distribution. It works a bit like the exponential part of the normal distribution. The term $\frac{x^2}{df}$ measures how far a value $x$ is from the center, and the exponent determines how quickly the density drops off as you move further from zero.

When the degrees of freedom are small, the drop is slow, meaning more probability is placed in the tails. This reflects greater uncertainty when we have little data. As the degrees of freedom increase, the drop becomes steeper, and the distribution starts to resemble the normal distribution.

From the student example, suppose we want to calculate the likelihood of $x = 0$ given $df = 9$. Using the formula, we have:

\[ f(1) = \frac{\Gamma(\frac{9+1}{2})}{\sqrt{df \times\pi}\times\Gamma(\frac{9}{2})} \times (1 + \frac{1^2}{9})^{-\frac{9 + 1}{2}} \approx 0.388, \]

which is is approximately the maximum height of the distribution we saw in the last plot.

Now let’s look at the expected value and variance of the t-distribution. As expected, the expected value is:

\[ E(X) = 0 \] The variance is:

\[ Var(X) = \frac{df}{df - 2} \]This variance is greater than 1 when df is small, which reflects the heavier tails. For example:

With $df = 5$ the variance is $\frac{5}{5 - 2} \approx 1.67$.
With $df = 30, the variance is $\frac{30}{30 - 2} \approx 1.07$

As df increases, the variance approaches 1, just like the standard normal distribution.

However, when $df \le 2$, the variance becomes undefined or infinite. This again highlights how small samples carry more risk and uncertainty.

18.4 Calculating and Simulating in R

R provides built-in functions that make it easy to work with the t-distribution in simulations and probability calculations. Let’s use these functions to explore a scenario involving a sample of 20 university students, where we want to estimate the average IQ.

Suppose we randomly select 20 students and calculate the sample mean and sample standard deviation of their IQ scores. Since the population standard deviation is unknown and the sample size is relatively small, we use the t-distribution with $df = 19$.

To simulate 10,000 possible sample means under this setup, we draw from a t-distribution. Note that the values appearing are standardized:

# Setting seed
set.seed(568)

# Simulate 10,000 t-distributed values
t_sim <- rt(n = 10000, df = 20 - 1)

# Plot the t_sim sample means
tibble(x = t_sim) %>%
  ggplot(aes(x = x)) +
  geom_histogram(fill = "lightblue", bins = 40) +
  labs(x = "Simulated Sample Mean IQ",
       y = "Frequency")

This histogram shows the distribution of simulated sample means. The shape resembles a normal distribution, but with slightly heavier tails, which reflect the added uncertainty due to estimating the population standard deviation from a small sample.

To calculate the probability density at a specific t-value (e.g., the likelihood of observing a t-value of 1.5 with 19 degrees of freedom), we use the dt() function:

# Density at t = 1.5 with 19 degrees of freedom
dt(x = 1.5, df = 19)

[1] 0.1285716

To compute the cumulative probability up to a certain t-value (e.g., the probability of getting a t-value up to 1.5), we apply the function pt():

# Cumulative probability of t ≤ 1.5
pt(q = 1.5, df = 19)

[1] 0.9249757

Finally, to find the critical value corresponding to the 97.5th percentile, which is useful for confidence intervals, we use the qt() function:

# 97.5th percentile of the t-distribution with 19 df
qt(p = 0.975, df = 19)

[1] 2.093024

These functions allow us to simulate data and calculate probabilities when working with small samples and unknown population standard deviations, which is exactly the kind of scenario the t-distribution was designed for.

18.5 Recap

In this chapter, we explored the t-distribution, a critical tool in statistics when working with small samples and an unknown population standard deviation. The t-distribution is similar in shape to the normal distribution, meaning that it is symmetric and bell-shaped, but it has heavier tails. These heavier tails account for the extra uncertainty introduced when we estimate the population standard deviation from the sample.

The distribution is defined by a single parameter: degrees of freedom, typically equal to the sample size minus one. When the degrees of freedom are small, the t-distribution is noticeably wider than the normal. As the sample size increases and our estimate of the standard deviation becomes more reliable, the t-distribution gradually converges to the normal distribution.

The mean of the t-distribution is 0 and its variance is larger than 1 when the degrees of freedom are small, but this variance approaches 1 as the degrees of freedom increase. This flexibility allows the t-distribution to adapt based on the size of the sample and make more cautious inferences, particularly useful in hypothesis testing and confidence intervals for means when we do not know the population standard deviation.