Introduction to Sampling Distributions

The goal of statistics is to extract meaningful information from samples to make inferences about a larger population. Since it is often impractical to measure every individual in a population, we rely on samples—smaller subsets of the population—to carry out statistical analysis. In doing so, we investigate random variables or collections of random variables that characterize the population.

Numerical measures that describe a population (eg. mean, variance, standard deviation, etc) are called parameters.

Numerical measures that describe a sample are called statistics.

The goal of inferential statistics is to use sample statistics to estimate population parameters. Sampling distributions play a critical role in this process.

Definition:

Sampling Distribution

A sampling distribution is the probability distribution of a statistic, such as the sample mean or proportion, calculated from all possible samples of a fixed size drawn from the population.

So just like we have mass functions to describe discrete random variables, and density functions to describe continuous ones, the sampling distribution is a theoretical distribution of a sample statistic; and provides the foundation for many statistical techniques, including hypothesis testing and confidence intervals, which rely on the sample mean as a key component of their calculations.

Understanding sampling distributions is essential because they enable us to:

Estimate Population Parameters: Sampling distributions bridge the gap between sample statistics (e.g., the sample mean) and population parameters (e.g., the population mean).

Assess Sampling Variability: They quantify how much sample statistics are expected to vary from sample to sample.

Conduct Hypothesis Testing: Sampling distributions help determine probabilities and critical values for decision-making.

Calculate Confidence Intervals: They provide a basis for computing intervals within which population parameters are likely to lie with a certain degree of confidence.

Sampling Distributions of The Mean

One of those most important sampling distributions in statistics is the sampling distribution of the mean.

When we collect data, we often calculate the sample mean as a summary of the data. However, the value of the sample mean can vary depending on the sample we select. By understanding the behavior of sample means, we can make inferences about the population mean.

Definition:

Sampling Distribution of The Means

A sampling distribution of the sample mean is a probability distribution of all possible sample means from all possible samples of size $n$.

This distribution helps us answer key questions, such as:

How likely is it for a sample mean to fall within a certain range?

How much can we expect the sample mean to vary from the population mean?

The Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is a fundamental concept in statistics that explains the behavior of the sampling distribution of the mean. It states that, regardless of the shape of the population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size is sufficiently large. This is a powerful result because it allows us to make inferences about the population mean using the normal distribution, even when the population distribution is unknown.

Theorem:

The Central Limit Theorem (CLT)

Let $X_1, X_2, ..., X_n$ be a random sample of size $n$ from a population with finite mean $\mu$ and finite standard deviation $\sigma$, the sampling distribution of the sample mean $\bar{X}$ will approximate a normal distribution as the sample size $n$ becomes large, regardless of the population's original distribution.

More specifically,

The mean of the sampling distribution of the sample mean is equal to the population mean $\mu$. $$\mu_{\overline{X}}=\mu $$

The standard deviation of the sampling distribution of the sample mean, also known as the standard error, is equal to the population standard deviation divided by the square root of the sample size $n$. That is, $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$

As $n\rightarrow \infty$ the sampling distribution of the sample mean will approach a normal distribution with mean $\mu$ and standard deviation $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ $$\bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$

Remark

In practice, the approximation to normality is typically sufficient when $n$ is reasonably large (e.g., $n\geq 30$), though the required sample size depends on the skewness and shape of the original population distribution. If the population is already normal, the sampling distribution of the mean will be exactly normal for any sample size.