What is a Sample and How To Take One?

A data scientist needs to understand where their data comes from and how it was collected. This understanding comes from learning statistical inference and sampling techniques. In this post, I will discuss ways of choosing a sample (called sampling designs) and the pros and cons of each. To learn more about statistical inference, please see my post about the definition of statistics.

What is a Sample?

Samples are small selections of data taken when it is not feasible to study an entire population. By studying a sample, we can get an estimate of what the whole population looks like. This only works if that sample is representative of the population. A sample is representative when it is selected in such a way that it accurately reflects the characteristics of the population from which it was taken.

By studying a sample, we can get a good estimate
of what the whole population would look like.

Mathematicians tend to use N to represent the total number of elements (also called units or people) in a population and n to represent the number in a sample, or sample size. The image below shows a population of N = 100 grey dots on the left side. On the right side, some of these dots are in a maroon color. These dots represent the n = 10 dots in the sample. I chose these 10 dots randomly when making the image. The sampling design I used ensured all 100 grey dots had the same chance of being selected. We call any sampling design where all elements in the population have the same chance of being in the sample a simple random sample (SRS for short).

Comparison of Simple Random Sample and Population
My original set of images, made for this online data science textbook.

Representative Sampling Designs

An SRS is an easy and efficient way of choosing a representative sample from a population. Because you choose every element randomly, there is little chance of choosing any portion of the population significantly more often than another. The major downside of using an SRS is that you need to know every possible element in a population to take a random sample. For example, if I were to take an SRS from the entire U.S. population, I would first need a list of everyone in the U.S. to sample from. For large populations, this is usually not feasible.

Stratified Samples

Sometimes, we want to ensure portions of a population are represented equally. For example, we may want to make sure that we are equally representing the opinions of males and females or staff from different departments in a survey. In these cases, it is useful to use a stratified random sample, depicted below.

Depiction of a stratified random sample

In a stratified random sample, we group the population based on characteristics of interest and call these groups strata. Then, we sample the same number of elements from each stratum. In the image above, I have stratified the grey dots by pattern and then randomly sampled 2 dots from each stratum to make my sample of size n = 10 maroon dots.

Cluster Samples

I consistently see students confuse stratified random sampling with another sampling design known as cluster sampling, depicted below. When you take a cluster sample, you also divide the population into groups. However, you normally choose these groups based on proximity. Cluster samples have a wide variety of elements within groups (as opposed to stratified sampling where one selects groups because they are similar). Once we make clusters, we take a random sample of those clusters. The main difference between a cluster and a stratified sample is that a cluster sample chooses entire clusters. In contrast, a stratified sample chooses elements within strata.

depiction of a cluster sample

Those taking samples often use a combination of these techniques. This is known as a multistage sample. For example, they may take a simple random sample of states in the United States. Then, they group all people within those states into clusters by county and select a sample of those clusters. Lastly, they would stratify by whether a person lives in an urban or rural area and randomly sample people within those strata. This would result in a multistage sample of the US population.

Biased Samples

Each of the sampling designs we have discussed creates representative samples of a population. However, many sampling designs are not representative of the population. We call these types of samples biased. Biased samples are difficult to generalize to the broader population because they are, by definition, different in some meaningful way.

Systematic Samples

An example of a sampling scheme that toes the line between biased and representative is the systematic sample. As the name suggests, we choose these samples in a systematic or preconceived way. For example, I may choose the first 50 people in line or every third person who calls me. If I told my students to line up by GPA and then selected the first 10 students to take a test, these 10 students would not be representative of the whole group. However, if I take every other student in line, this would be less biased as I ensure I get some students from the front and back of the line.

Convenience and Voluntary Samples

A convenience sample is almost always biased. Again, as the name suggests, we choose who to include in these samples based on what is easiest. Psychology studies frequently use convenience samples. Many psychology departments require undergraduate students to participate in research studies so faculty can get participants for their research. Unfortunately, it is difficult to believe that undergraduate psychology majors are representative of the overall population which is commonly listed as a downside of this work.

Unfortunately, it is hard to believe that undergraduate
psychology majors are representative of the
overall population which is commonly
listed as a downside of this work.

Another example of a biased sample is the voluntary response sample which is also fairly self-explanatory. In a voluntary response sample, participants choose to be part of the sample. Online reviews are a common example of a voluntary response sample. Those who review restaurants or businesses online through Google or Yelp normally have strong opinions positively or negatively about those businesses. This is why you are not likely to see reviews from patrons who did not feel strongly either way about their experience.

Summary

There are many ways to sample a population, many of which are not discussed here. Some of these ways create samples that are representative of the population. These allow for easy generalization of results from the sample to the population (see my post about statistical inference). However, there are also many biased sampling methods. These make it difficult to learn about the underlying population. It is important for anyone working with data to understand how their data might be biased as biased data can lead to misleading results.


If you enjoyed this post, please subscribe at this link to receive notifications when I post new articles. Also, please check out my YouTube channel for more data science content!

How to Define ‘Statistics’: A Simple Guide to Inference

You may have taken a statistics course in high school or college, but could you define the word ‘statistic’? Classes often fail to explain the definition of a statistic at the beginning of the course, before diving into calculating specific statistics. In this post, I will discuss what statistics are, and how we can use them to make conclusions.

What is Statistical Inference?

The field of statistics utilizes collected data to learn something about a specific group or population. Statisticians define a population as the entire group of individuals or elements you are interested in analyzing. For example, if you want to examine U.S. opinions about global warming, the group you want to analyze is the entire United States. If you want to study how mice grow when given high-protein food, your population of interest is all mice. A study of everyone in a population of interest is called a census. A census provides the most accurate data about a population because it contains information from everyone in the population.

Statistical inference is the process of using a sample
to learn something about the whole population.

Unfortunately, it can be difficult, expensive, or even impossible to study everyone in a population. Think about how difficult it would be to interview everyone in the U.S. or find and examine all mice. This is why we usually study small portions of a population, called samples, instead. The image below shows how a sample relates to a population. Statistical inference is the process of using a sample to learn about the whole population.

Image of showing sample (and corresponding statistic) taken from population

Statistics vs Parameters

We now know what statistical inference is, but we still haven’t defined a statistic. In the image above, there are two #’s: one under the population circle and one under the sample circle. The # under the population circle represents what we want to learn about a population. We call this a parameter. In math classes, we usually depict this using a Greek letter (often Theta). However, here I use # for simplicity and to remind ourselves that this is a number.

The image also has a # under the sample circle. This # is different because it has a ^ symbol above it. Mathematicians call this symbol a hat — probably because it sits on the symbol’s head. The hat above a symbol indicates that it is an estimate. In this case, our #-hat is an estimate of #, our parameter. We call this estimate a statistic. A statistic is an estimate of a parameter based on a sample of the population. You can remember the difference between a statistic and a parameter by their first letters. A parameter is a characteristic of a population. Statistics are characteristics of a sample.

Types of Statistics

These ‘characteristics’ can take different forms. Imagine I want to understand the average height of an Asian Elephant. I would calculate the mean of heights in my sample of elephants and use it as my statistic. Then, I can use it to estimate what the mean of all Asian Elephant heights, my parameter, would be.

The mean is not the only statistic we can calculate. We may want to know the most common survey answer in a sample, in which case we would use the mode as our statistic. Similarly, maybe we want to know how similar test scores are for a group of third-graders, in which case we might use variance or standard deviation as our statistic.

The statistic itself depends on what we want to learn about our sample, and as a result our population. When you learn things like mean, median, and mode in your statistics class, it is because these can describe a sample, and therefore are all types of statistics!

When Things Go Wrong

Statisticians use statistics to make an educated guess (or inference) about what the parameter might be based on the data. This entire process will only work under a few conditions.

First, the sample size must be large enough. The larger the sample size, the more accurate the statistician’s guess will be. This is because, with a larger sample size, there is a better chance of getting an accurate representation of the population of interest. For example, if we only selected two people from the whole U.S. we would miss a lot of meaningful variation.

Statisticians use statistics as a best-guess
(or inference) about what the parameter would be.

There is also a chance that we are selecting two people who provide a skewed view of the U.S. as a whole. Perhaps they both live in a city and we have no representation of the rural population.

This brings us to the second condition: the sample needs to accurately reflect the population. For example, if we want to know about all mice, but we only sample male mice, our sample is different from our population in a meaningful way. What we learn about male mice may not apply to all mice. Similarly, if we want to know about the whole U.S. but only sample people from New York, our sample is not sufficiently representative of the whole U.S. This is why it is critical to be careful about choosing a sample.

Bias and Generalizability

When a sample meets these two conditions, it is representative of the population, and we call it generalizable. This means that what we learn from the sample, we can generalize to the population. It is common for generalizable samples to be randomly selected from the population so that no group is selected more often than another. However, there are many more complicated sampling methods that also create generalizable samples. Understanding how data was sampled is extremely important. Some data may need to be checked for bias before calculating statistics. Biased samples are the opposite of generalizable samples and should be avoided. If you want to learn more, I dedicated a whole blog post to sampling and sampling biases.

Summary

Statistical inference is the process of learning about a population parameter by investigating statistics calculated from a sample of data. This method is considered highly effective if the sample is representative of the population and the findings can be generalized to a larger population.


If you enjoyed this post, please subscribe at this link to receive notifications when I post new articles. Also, please check out my YouTube channel for more data science content! You can find the trailer on my homepage.