7  Introduction to Statistics

7.1 Introduction

In this Chapter, we discuss the concept of statistics. Statistics is one of the main pillars of research and data science and, as data accumulate exponentially in the digital era, it is becoming increasingly important.

In many fields, such as medicine, business and engineering, decisions must be made based on incomplete information, under uncertainty. For example, a doctor may need to decide on the best treatment for a patient based on clinical trial results and patient history. A business leader may rely on market data to launch a new product or change strategy. In all these cases, drawing reliable conclusions from data is critical. Making decisions without solid evidence can lead to poor outcomes, wasted resources, or even harm. Statistics provides the tools to do just that: to make informed, data-driven decisions rather than relying on guesswork or intuition alone.

Consider a scenario where you work at a large company and want to assess how our employees’ salaries compare to those of competing firms (by the way, this is typical work of an HR function). This inquiry is not merely about compensation or costs: it concerns fairness, pay equity, market competitiveness and even strategic direction.

People could start arguing about what exactly is to be considered fair. For example, should we increase salaries for new hires to attract fresh talent, even if it means they earn more than long-standing employees with similar roles? Many times, such arguments are based to a large extend on gut feelings and biases. Intuition or anecdote is insufficient though if we truly want to make informed and objective decisions. To do so, we require data and methods for analyzing and interpreting them properly. This is the role of statistics as a discipline.

Statistics can be thought of both as a tool as well as a framework for reasoning with data. Its use enables us to answer questions impartially, identify patterns, and make informed decisions. Importantly, one need not be a mathematician to engage with statistics.

7.2 Populations and Samples

Imagine we want to estimate the average salary between employees at our company against those of its competitors. Before we can apply statistics though, we need to collect data. Ideally, we would like to have all the relevant information from all employees in the market. However, with thousands of employees across firms (even if limiting only to direct competitors), roles, and locations (as well as with salaries changing over time), collecting complete salary data is often impractical, if not impossible. Instead, we gather data from a smaller group in hope of extrapolating to the broad community we are referring to. This subset is called a sample, while the full group of interest (all the employees in the market at the given snapshot) is called the population.

  • Population: The entire group we aim to draw conclusions for (e.g., all relevant employees in the industry).

  • Sample: A smaller group selected from the population, for which we actually collect data.

It’s important to note that in statistics, “population” does not refer solely to people: it can describe any complete set of items or entities we are studying—such as products in a factory, transactions in a bank, or days in a year. The key is that the population includes all units about which we want to draw conclusions.

Definition

Population includes all units about which we want to draw conclusions.

Even though we usually work with sample data, we use statistics to draw conclusions about the population as if we had data on everyone. If the sample is representative, it can yield valid inferences about the population. The process is analogous to tasting a spoonful of soup to assess the flavor of the entire pot. The main assumption, of course, is that the spoonful of soup accurately represents the whole pot.

In contrast to representative samples, biased samples lead to misleading conclusions. If we sample only from a single department or a single firm, we overlook variation across roles, seniority levels, even organizations. For example, if we aim to study IT managers but sample only new managers within our company, the results would readily misrepresent the broader population. Explicitly defining the population of interest and ensuring a representative sample, is essential to avoid bias.

7.3 Sample Types

As mentioned, we need to make sure that our sample represents our population of interest as best as possible. This is a critical step, as poor sampling introduces bias, thus limiting the reliability of our conclusions. However, the concept of a representative sample is more theoretical rather than practical: perfect representativeness is hard to achieve in practice, but there are practical methods that help approximate it. In practice, we usually apply simple random sampling, where every individual in the population has an equal chance of being selected.

A similar method that can make simple random sampling more effective by ensuring the representation of under-represented groups is stratified random sampling. When the population contains meaningful subgroups (e.g., departments, job levels, companies), stratified sampling is often preferable (and practically possible). Here, the population is divided into strata, with random samples drawn from each one. This ensures that all relevant groups are represented proportionately, thus improving the accuracy and relevance of results. For example, if we believe that there are three meaningful groups regarding the salaries of financial analysts (junior, mid-senior and senior), we need to get subsamples from all these groups (assuming of course we have defined our population as “financial analysts”, without any further conditionalities or constraints). It is important to note that, when we apply stratified random sampling, we know the meaningful subgroups beforehand, possibly due to domain knowledge or previous studies.

In very large samples, simple random sampling and stratified sampling lead to similar samples. In other words, simple random sampling leads to a sample that represents all meaningful groups, proportionately. This is because random variation tends to balance out across a large number of observations.

Let’s contrast the above with a common pitfall in sampling: convenience sampling. Convenience sampling involves selecting individuals who are the easiest to reach (e.g., employees present in the office). While this method is by definition quick and inexpensive, it tends to produce biased samples that do not reflect the broader population. This method can actually be applied by accident, as we may think that we apply simple random sampling, while essentially employing convenience sampling. For instance, suppose we collect a sample about the population of interest by using Google and collecting the data from the first (non-sponsored) website that appears. Although we may still get a representative sample, there is a significant chance that this will not be the case; only a specific subgroup, i.e., websites with significant online presence and visibility, may be (over-)represented. As a result, our conclusions may be skewed, even though the sampling process seemed neutral at first glance.

The latest example highlights the importance of being deliberate and transparent in how we select our samples. The chosen sampling method should align with our goals, the population structure, as well as our available resources. While simple random sampling is often sufficient, stratified sampling provides greater precision when subgroup differences are expected. Careful sampling ensures that findings are not skewed by over- or under-represented groups.

Technically, any and all samples represent some population. For example, if we collect data only from very experienced financial analysts in the banking sector, our sample - by definition - represents that particular group. The real challenge is ensuring that the population our sample represents matches closely the population we actually intend to study.

7.4 Sample size

In addition to how we select our sample, the number of individual cases we include - known as the sample size - is equally important. A small sample, even if chosen randomly, may not reflect the true characteristics of a large population, simply due to chance. Larger samples tend to provide more stable and accurate estimates, reducing the impact of random variation. This is intuitive: the larger the sample, the more clearly we see the overall population. A useful analogy is a puzzle: connecting more pieces gives us a better idea of the full picture, even if some pieces are still missing.

That said, not all populations are large. In some studies, the population itself may be relatively small — for instance, all employees in a small company, or patients with a rare condition. In such cases, a small sample may still be sufficient to describe the population meaningfully, especially if it includes a large proportion of the total group.

However, when working with large populations, collecting large samples can be costly or time-consuming. Therefore, choosing an appropriate sample size requires a balance between precision and practicality.

7.5 Parameters and Statistics

Along with selecting individual cases for our sample, we must also decide which variables to study - those characteristics that directly relate to our research questions. In the employee salary example, relevant variables might include total compensation, department, job title, education level, or years of experience. The right variables depend on what we want to learn. For instance, comparing salaries across departments requires department-related data; studying the effect of education on pay makes education a key variable. Because our sample is meant to represent a broader population, the data we collect should reflect real relationships that exist in that population. Otherwise, our conclusions may be inaccurate or incomplete. Suppose we survey 200 employees and find an average salary of $52,000. This number is called a statistic — a summary calculated from our sample. Our broader goal, however, might be to estimate the true average salary across an entire industry. That unknown value is called a parameter.

  • Parameter: A numerical summary of a population. Usually unknown.

  • Statistic: A numerical summary of a sample. Known and calculated from data.

We use statistics to estimate parameters. The more representative and larger our sample, the more accurate our estimate is likely to be. In practice, collecting data from an entire population is often unrealistic — making sampling essential to research and decision-making.

7.6 Descriptive and Inferential Statistics

At this point, it’s also important to emphasize that statistics come in two broad types: descriptive and inferential. Descriptive statistics help us summarize and organize data so we can understand the structure of our data. They include measures like averages, percentiles, and tools like graphs - anything that helps paint a clearer picture of the data we’ve collected. For example, reporting the average salary, the most common job title, or the range of years of experience in our sample, are all forms of descriptive statistics.

Inferential statistics, on the other hand, allow us to go a step deeper. They help us use the sample data to make educated guesses (draw documented conclusions) about the larger population. If we estimate the industry-wide average salary based on our sample - or compare salaries between departments and test if the difference is meaningful - we’re doing inferential statistics. Both types are important: descriptive statistics help us understand what we are working with, while inferential statistics help us make logically defendable claims about what we can’t directly observe.

7.7 Putting It All Together: A Salary Example

To see how the key ideas we’ve discussed — population, sample, statistic, and parameter — come together in practice, let’s walk through a simplified example.

Suppose we want to estimate the average salary in our industry. This is a common question for companies seeking to understand their positioning in the talent market and ensure competitive compensation. However, collecting salary data from every employee across all firms is unrealistic. We therefore take a practical approach: we gather information from a sample of 200 employees working in various roles at various companies.

Compensation Studies

While we are using the average salary in an industry as the parameter of interest, in practice companies would be interested in much more detailed specifications - for example, by job family, seniority, etc.

Let’s assume this sample is drawn from multiple, non-overlapping, reliable data sources. The data providers rely on simple random sampling, meaning that each individual had an equal chance of being included in the sample. This doesn’t guarantee perfection, but it does reduce the chance of bias. Once we have the data, we calculate the average salary for the 200 employees. The result is $52,000.

This number is a statistic — a summary measure calculated from our sample. It tells us something about the people we actually observed.

But what we truly care about is the parameter: the average salary across the entire industry. This is a summary measure of the full population. Unfortunately, we don’t have access to this value because we didn’t (and likely couldn’t) collect data from everyone.

If our sample is representative — meaning it includes employees from various sectors, company sizes, and roles in proportions that reflect the broader industry — then our statistic ($52,000) is probably a good estimate of the unknown parameter. This is the core strength of statistical reasoning: using sample data to learn about population truths.

But consider a contrasting scenario. Suppose our sample includes only employees from high-paying startups, perhaps due to the data sources we used. In that case, our sample average might be $100,000. While still technically a statistic, it’s no longer a reliable estimate of the true industry average — it’s likely an overestimate due to sampling bias.

This simple example highlights why sampling design matters so much. The quality of the sample directly affects the accuracy of our conclusions. A poorly chosen sample can lead to misleading statistics and flawed decisions.

Here’s how the process unfolds in summary:

  • We define a population — all employees in the industry.

  • We draw a sample — 200 employees selected through (ideally) random methods.

  • We compute a statistic — the sample’s average salary, $52,000.

  • We use that statistic to estimate a parameter — the unknown true industry average.

Let’s take this a step further. If our own company’s average salary is $50,000, and we estimate that the industry average is $52,000, this information could guide strategic decisions. It might raise questions like: Are we falling behind the market? Should we adjust our pay structure? Are there differences across roles or departments that explain the gap?

This illustrates the broader point: statistics allow us to make informed, data-driven decisions, even when we can’t measure everything directly.

7.8 The Law of Large Numbers

At this point, it is necessary to emphasize one of the most fundamental results in statistics: the Law of Large Numbers (LLN). This law tells us that as the sample size increases, the sample mean gets closer and closer to the true population mean In other words, the more data we collect, the more reliable our estimate of the population average becomes.

To see why this matters, imagine trying to estimate the average monthly salary in a population of 1,000 people. If we only look at 5 people, our sample mean could easily be far from the true average. But if we had data from 500 people—or even 999—their average salary would almost certainly be very close to the actual population mean.

This idea is powerful because it reassures us that statistics based on larger samples are generally more trustworthy than those based on smaller samples.

7.9 Why Statistical Thinking Matters

Statistics is not just a collection of formulas — it is a way of thinking. It gives us a structured approach for understanding the world when full information is out of reach. In real life, we almost never have access to an entire population. Instead, we work with samples and try to make reasonable estimates about the larger group. This process allows us to move from specific observations to broader conclusions with a certain degree of confidence.

At its core, statistics begins with a question and ends with better understanding. It does this by connecting samples to populations, and by using known values — statistics — to make informed guesses about unknown truths — parameters. If the sample is carefully chosen and the assumptions behind our methods are sound, then the statistics we compute can bring meaningful insight into complex systems.

But the real power of statistics lies in the mindset it promotes. Statistical thinking encourages us to look beneath the surface, to ask how the data were gathered, and to question what conclusions are warranted if we want to be scientifically correct. This critical lens is essential in a world filled with data.

There’s a popular belief that “statistics can be made to argue for anything.” While it’s true that numbers can be misused or misinterpreted, the problem usually lies not in the math, but in the assumptions and context behind it. When the data are valid and the methods sound, the numbers don’t lie — but they do need careful interpretation for us to reveal the truth. Even experienced statisticians may disagree, not because the math is wrong, but because they interpret the same numbers through different lenses.

Moreover, when complex data are boiled down into a single figure, such as an average, individual variation is lost. Statistics describe group-, not individual-experiences. A population average says nothing about how any one person feels, earns, or performs. This is why statistics should be seen as tools for acquiring a broader understanding, not personalized truth.

In short, to think statistically is to recognize uncertainty, ask thoughtful questions, and seek evidence-based conclusions. It means being skeptical without being cynical, and curious without being careless. It equips us to read headlines more wisely, to make decisions more responsibly, and to better understand the world around us.