## How to Define ‘Statistics’: A Simple Guide to Inference

You may have taken a statistics course in high school or college, but could you define the word ‘statistic’? Classes often fail to explain the definition of a statistic at the beginning of the course, before diving into calculating specific statistics. In this post, I will discuss what statistics are, and how we can use them to make conclusions.

## What is Statistical Inference?

The field of statistics utilizes collected data to learn something about a specific group or **population**. Statisticians define a population as the entire group of individuals or elements you are interested in analyzing. For example, if you want to examine U.S. opinions about global warming, the group you want to analyze is the entire United States. If you want to study how mice grow when given high-protein food, your population of interest is all mice. A study of everyone in a population of interest is called a **census**. A census provides the most accurate data about a population because it contains information from everyone in the population.

Unfortunately, it can be difficult, expensive, or even impossible to study everyone in a population. Think about how difficult it would be to interview everyone in the U.S. or find and examine all mice. This is why we usually study small portions of a population, called **samples**, instead. The image below shows how a sample relates to a population. **Statistical inference** is the process of using a sample to learn about the whole population.

## Statistics vs Parameters

We now know what statistical inference is, but we still haven’t defined a statistic. In the image above, there are two #’s: one under the population circle and one under the sample circle. The # under the population circle represents what we want to learn about a population. We call this a **parameter**. In math classes, we usually depict this using a Greek letter (often Theta). However, here I use # for simplicity and to remind ourselves that this is a number.

The image also has a # under the sample circle. This # is different because it has a ^ symbol above it. Mathematicians call this symbol a **hat** — probably because it sits on the symbol’s head. The hat above a symbol indicates that it is an estimate. In this case, our #-hat is an estimate of #, our parameter. We call this estimate a **statistic**. A statistic is an estimate of a parameter based on a sample of the population. You can remember the difference between a statistic and a parameter by their first letters. A **p**arameter is a characteristic of a **p**opulation. **S**tatistics are characteristics of a **s**ample.

## Types of Statistics

These ‘characteristics’ can take different forms. Imagine I want to understand the average height of an Asian Elephant. I would calculate the mean of heights in my sample of elephants and use it as my statistic. Then, I can use it to estimate what the mean of all Asian Elephant heights, my parameter, would be.

The mean is not the only statistic we can calculate. We may want to know the most common survey answer in a sample, in which case we would use the mode as our statistic. Similarly, maybe we want to know how similar test scores are for a group of third-graders, in which case we might use variance or standard deviation as our statistic.

The statistic itself depends on what we want to learn about our sample, and as a result our population. When you learn things like mean, median, and mode in your statistics class, it is because these can describe a sample, and therefore are all types of statistics!

## When Things Go Wrong

Statisticians use statistics to make an educated guess (or inference) about what the parameter might be based on the data. This entire process will only work under a few conditions.

First, the sample size must be large enough. The larger the sample size, the more accurate the statistician’s guess will be. This is because, with a larger sample size, there is a better chance of getting an accurate representation of the population of interest. For example, if we only selected two people from the whole U.S. we would miss a lot of meaningful variation.

There is also a chance that we are selecting two people who provide a skewed view of the U.S. as a whole. Perhaps they both live in a city and we have no representation of the rural population.

This brings us to the second condition: the sample needs to accurately reflect the population. For example, if we want to know about all mice, but we only sample male mice, our sample is different from our population in a meaningful way. What we learn about male mice may not apply to all mice. Similarly, if we want to know about the whole U.S. but only sample people from New York, our sample is not sufficiently representative of the whole U.S. This is why it is critical to be careful about choosing a sample.

## Bias and Generalizability

When a sample meets these two conditions, it is representative of the population, and we call it **generalizable**. This means that what we learn from the sample, we can *generalize* to the population. It is common for generalizable samples to be randomly selected from the population so that no group is selected more often than another. However, there are many more complicated sampling methods that also create generalizable samples. Understanding how data was sampled is extremely important. Some data may need to be checked for **bias** before calculating statistics. **Biased** samples are the opposite of **generalizable** samples and should be avoided. If you want to learn more, I dedicated a whole blog post to sampling and sampling biases.

## Summary

Statistical inference is the process of learning about a population parameter by investigating statistics calculated from a sample of data. This method is considered highly effective if the sample is representative of the population and the findings can be generalized to a larger population.

If you enjoyed this post, please subscribe at this link to receive notifications when I post new articles. Also, please check out my YouTube channel for more data science content! You can find the trailer on my homepage.