## What is a Sample and How To Take One?

A data scientist needs to understand where their data comes from and how it was collected. This understanding comes from learning statistical inference and sampling techniques. In this post, I will discuss ways of choosing a sample (called **sampling designs**) and the pros and cons of each. To learn more about statistical inference, please see my post about the definition of statistics.

## What is a Sample?

**Samples** are small selections of data taken when it is not feasible to study an entire population. By studying a sample, we can get an estimate of what the whole population looks like. This only works if that sample is **representative **of the population. A sample is representative when it is selected in such a way that it accurately reflects the characteristics of the population from which it was taken.

Mathematicians tend to use N to represent the total number of elements (also called units or people) in a population and n to represent the number in a sample, or **sample size**. The image below shows a population of N = 100 grey dots on the left side. On the right side, some of these dots are in a maroon color. These dots represent the n = 10 dots in the sample. I chose these 10 dots randomly when making the image. The sampling design I used ensured all 100 grey dots had the *same chance* of being selected. We call any sampling design where all elements in the population have the same chance of being in the sample a **simple random sample** (SRS for short).

## Representative Sampling Designs

An SRS is an easy and efficient way of choosing a representative sample from a population. Because you choose every element randomly, there is little chance of choosing any portion of the population significantly more often than another. The major downside of using an SRS is that you need to know every possible element in a population to take a random sample. For example, if I were to take an SRS from the entire U.S. population, I would first need a list of everyone in the U.S. to sample from. For large populations, this is usually not feasible.

### Stratified Samples

Sometimes, we want to ensure portions of a population are represented equally. For example, we may want to make sure that we are equally representing the opinions of males and females or staff from different departments in a survey. In these cases, it is useful to use a **stratified random sample**, depicted below.

In a stratified random sample, we group the population based on characteristics of interest and call these groups *strata*. Then, we sample the same number of elements from each stratum. In the image above, I have *stratified* the grey dots by pattern and then randomly sampled 2 dots *from each **stratum* to make my sample of size n = 10 maroon dots.

### Cluster Samples

I consistently see students confuse stratified random sampling with another sampling design known as **cluster sampling**, depicted below. When you take a cluster sample, you also divide the population into groups. However, you normally choose these groups based on proximity. Cluster samples have a wide variety of elements within groups (as opposed to stratified sampling where one selects groups because they are similar). Once we make clusters, we take a random sample *of those clusters*. The main difference between a cluster and a stratified sample is that a cluster sample chooses *entire *clusters. In contrast, a stratified sample chooses elements *within* strata.

Those taking samples often use a combination of these techniques. This is known as a **multistage sample**. For example, they may take a simple random sample of states in the United States. Then, they group all people within those states into clusters by county and select a sample of those clusters. Lastly, they would stratify by whether a person lives in an urban or rural area and randomly sample people within those strata. This would result in a multistage sample of the US population.

## Biased Samples

Each of the sampling designs we have discussed creates representative samples of a population. However, many sampling designs are not representative of the population. We call these types of samples **biased**. Biased samples are difficult to generalize to the broader population because they are, by definition, different in some meaningful way.

### Systematic Samples

An example of a sampling scheme that toes the line between biased and representative is the **systematic sample**. As the name suggests, we choose these samples in a *systematic or preconceived* way. For example, I may choose the first 50 people in line or every third person who calls me. If I told my students to line up by GPA and then selected the first 10 students to take a test, these 10 students would not be representative of the whole group. However, if I take every other student in line, this would be less biased as I ensure I get some students from the front and back of the line.

### Convenience and Voluntary Samples

A **convenience sample **is almost always biased. Again, as the name suggests, we choose who to include in these samples based on what is easiest. Psychology studies frequently use convenience samples. Many psychology departments require undergraduate students to participate in research studies so faculty can get participants for their research. Unfortunately, it is difficult to believe that undergraduate psychology majors are representative of the overall population which is commonly listed as a downside of this work.

Another example of a biased sample is the **voluntary response sample** which is also fairly self-explanatory. In a voluntary response sample, participants choose to be part of the sample. Online reviews are a common example of a voluntary response sample. Those who review restaurants or businesses online through Google or Yelp normally have strong opinions positively or negatively about those businesses. This is why you are not likely to see reviews from patrons who did not feel strongly either way about their experience.

## Summary

There are many ways to sample a population, many of which are not discussed here. Some of these ways create samples that are representative of the population. These allow for easy generalization of results from the sample to the population (see my post about statistical inference). However, there are also many biased sampling methods. These make it difficult to learn about the underlying population. It is important for anyone working with data to understand how their data might be biased as biased data can lead to misleading results.

If you enjoyed this post, please subscribe at this link to receive notifications when I post new articles. Also, please check out my YouTube channel for more data science content!