Introduction to Parameter Estimation

Inferential statistics enables us to draw conclusions about a population based on data from a sample. Unlike descriptive statistics, which summarizes data, inferential techniques use probability theory to make predictions, test hypotheses, and estimate population parameters.

Key methods in inferential techniques include:

  • Estimation: Determining population parameters (like mean or proportion) through confidence intervals.
  • Hypothesis Testing: Evaluating claims about population parameters using sample data.
  • Regression Analysis: Exploring relationships between variables and making predictions.
  • These techniques rely on principles such as sampling distributions and the Central Limit Theorem to generalize findings from a representative sample to the broader population. By incorporating uncertainty and variability, inferential statistics provide a foundation for decision-making and scientific discovery.

    Parameter Estimation and Confidence Intervals

    In statistics, parameter estimation refers to a process by which information obtained from a sample is used to make inferences about a population. There are two types of estimators that can be used to estimate the value of a population parameter: point and interval.

    Definition:

    Point Estimate

    A point estimate uses a single value of a statistic to approximate the value of a population parameter.
    For example, the sample mean, $\bar{x}$, is often used as a point estimate for the population mean, $\mu$; the sample variance, $s^2$, serves as a point estimate for the population variance, $\sigma^2$, and so on.

    Remark

    Using a single point to estimate the value of a population parameter is neither ideal nor practical; the probability that the value of our calculated statistic will be close to (or equal) the value of the population parameter, is extremely small.
    Another drawback of using point estimates is that they do not reflect the effects of sampling from large population. A sample mean, for instance, is random insofar as the sample itself is random. Every time a new sample is drawn from the population, a new and different sample mean will result when it is calculated. For these reasons, interval estimates preferred.

    Definition:

    Interval Estimate

    An interval estimate, provides a range or an interval of values that the parameter may be found in.

    This interval of values is called a confidence interval.
    For example, $ a<\bar{x} < b$ is an interval estimate for $\mu$, and indicates that the population mean is somewhere between $a$ and $b$.

    Remark

    A natural question which arises is what should we add or subtract to the point estimate in order to generate the lower and upper bounds of the interval? The answer depends on two considerations:

  • 1. the standard error of the statistic, and
  • 2. the level of confidence that is to be attached to the interval.
  • Definition:

    Standard Error

    The standard error quantifies the precision of the sample statistic as an estimate of the population parameter.
    A smaller standard error indicates that the sample statistic is a more precise estimate of the population parameter, and a larger standard error suggests that the sample statistic is less precise.

    Example

    Suppose a random sample of 100 students has a mean test score of 85, and the population standard deviation is 10. The standard error of the mean is: $$SE_{\bar{X}}=\frac{\sigma}{\sqrt{n}}=\frac{10}{\sqrt{100}}=1$$

    Solution

    The standard error of the mean is $1$. This implies that the sample mean is expected to vary by about 1 unit from the true population mean on average $95\%$ of the time.

    Example

    Suppose you conduct a survey of 200 people to determine the proportion who prefer tea over coffee. Out of the 200 people, 120 say they prefer tea. The sample proportion $p$ is: $$ p=\frac{\text { Number of people preferring tea }}{\text { Total number of people surveyed }}=\frac{120}{200}=0.6 $$ The formula for the standard error of the sample proportion is: $$SE_p=\sqrt{\frac{p(1-p)}{n}}=\sqrt{\frac{(0.6)(1-0.6)}{200}}=0.0346$$

    Solution

    The standard error of the proportion is approximately 0.0346 . This means that if you repeated the survey many times, the sample proportion would typically vary by about 0.0346 (or 3.46 percentage points) from the true population proportion.

    Confidence Intervals

    A confidence interval is a statistical tool used to estimate the range within which a population parameter is likely to fall, based on sample data. It provides a measure of uncertainty around the estimate, allowing researchers to make inferences about the population with a specified level of confidence, such as 95% or 99%.

    The interval is constructed using the sample statistic (e.g., mean or proportion), the standard error, and a critical value determined by the desired confidence level.

    Confidence intervals are widely used in research and decision-making to quantify the precision of an estimate and communicate the reliability of conclusions drawn from data.

    Definition:

    Confidence Level

    The confidence level is the probability that the value of the parameter falls within a specified range of values. The confidence level is denoted as $(1-\alpha) 100 \%$, where $\alpha$ denotes the level of significance.
    Since confidence intervals are constructed from data obtained from random samples, they too are random as well. As a result, we can never be certain that the interval contains the value of the parameter that we are trying to estimate. However, they are constructed in such a way so that we have a high degree of confidence that it does contain the actual value of the parameter that we are interested in.
    In interval estimation, the level of significance ($\alpha$) works alongside the confidence level to indicate the likelihood that the true population parameter falls outside the confidence interval, accounting for random sampling error. Meanwhile, the margin of error (ME) represents the maximum expected difference between a sample statistic, such as the mean or proportion, and the true population parameter. It captures the uncertainty in an estimate caused by sampling variability and is essential for constructing confidence intervals.

    Definition:

    Margin of Error

    The margin of error (ME) is the maximum expected difference between a sample statistic and the true population parameter. It quantifies the uncertainty in an estimate due to sampling variability.
    The margin of error is the product of two numbers: the standard error, and a critical value. Critical values are essentially cut-off values that define regions where the test statistic is unlikely to lie, and are obtained by looking up tables which describe how the statistic is distributed.

    Remark

    Factors that can influence the size of the margin of error:
  • Sample Size ($n$): Larger samples reduce the standard error, leading to a smaller margin of error.
  • Confidence Level: Higher confidence levels require a larger critical value $\left(Z^*\right)$, increasing the margin of error.
  • Variability: Greater variability in the population (higher $\sigma$ or $p(1-p)$ ) leads to a larger margin of error.
  • Definition:

    Confidence Interval

    A confidence interval is a range of values that is likely to contain the true value of a population parameter. It is constructed using a sample statistic, the standard error, and a critical value based on the desired confidence level. The formula for a confidence interval is: $$\text{CI} = \text{point estimate} \pm \text{margin of error}$$

    Remark

    Information about the precision of the estimation is conveyed by the length of the interval. A short interval implies a precise estimation; and a long interval implies an imprecise estimation.

    Remark

    As confidence increases, precision decreases, and vice versa.
    Putting all these elements together, we can now consider the complete definition of a confidence interval.

    Definition:

    Confidence Interval

    A confidence interval is a range of values that is likely to contain the true value of a population parameter. It is constructed using a sample statistic, the standard error, and a critical value based on the desired confidence level. The formula for a confidence interval is: $$\text{CI} = \text{point estimate} \pm \text{margin of error}$$

    Rule of Thumb

    All confidence intervals must be accompanied by a probabilistic statement that is interpreted in the context of the problem. The general form of some acceptable interpretations are:

  • ``We are _____ $\% confident that the true ______ lies between ______ and _____ .``
  • ``With repeated sampling we are _____ $\% confident that the true value of the population ______ is between ______ and ______ .``
  • Estimating the Mean of A Population (Known Variance)

    In many situations, we would like to estimate the mean of a population, $\mu$. Estimating the population mean, $\mu$, often involves two scenarios: when the population variance, $\sigma^2$, is known and when it is unknown. For a random sample $X_1, X_2, \dots, X_n $ of size $ n $ from a population with mean $ \mu $ and variance $\sigma^2$ the sample mean, $\bar{X}, follows a Normal Distribution under certain conditions:$$\overline{X}\sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$
    When the population variance is known, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$, and the sample mean $\bar{X}$ serves as a point estimate for $\mu$. Using this standard error and a critical Z-value, we can construct a confidence interval that provides a range where the true population mean is likely to fall, offering a precise and reliable method for estimation.

    Remark

    The sampling distribution of the mean when the population variance is known is the Normal distribution.

    Formula:

    Confidence Interval for the Mean; $\sigma$ Known

    Let $\overline{X}$ be the sample mean obtained from a random sample of size $n$ from a normal population with known variance, $\sigma$. Then the $(1-\alpha) \,100 \%$ confidence interval for $\mu$ is $$ \bar{x}_{ \pm} Z_{\alpha / 2} \frac{\sigma}{\sqrt{n}}$$ where $Z_{\alpha / 2}$ is the critical value associated with the level of confidence.
    The critical value, $Z_{\alpha / 2}$, is found by blocking the middle $(1-\alpha)\, 100 \%$ of the area under the standard normal distribution, and determining the value of $k$ which satisfies $P(-k

    Remark

    The margin of error, is $Z_{\alpha / 2} \frac{\sigma}{\sqrt{n}}$ and represents the maximum likely difference between the sample mean and the true population mean.

    Example

    A random sample of 25 students is taken from a population of students with a known variance of 100. The sample mean is 75. Find the 95% confidence interval for the population mean.

    Solution

    Given: $n=25$, $\sigma=10$, $\bar{x}=75$, $\alpha=0.05$. The critical value for a 95% confidence interval is $Z_{0.025}=1.96$. The margin of error is $1.96 \times \frac{10}{\sqrt{25}}=3.92$. The 95% confidence interval for the population mean is $75 \pm 3.92 = (71.08, 78.92)$. Interpretation: We are 95% confident that the true population mean lies between 71.08 and 78.92.

    Example

    A biologist is studying the average weight of a specific species of frogs in a population. From previous research, the population variance ( $\sigma^2$ ) is known to be $4 g^2$ (so $\sigma=2 g$ ). The biologist collects a random sample of $n=25$ frogs and finds that the sample mean weight is $\bar{x}=12.5g$. Construct a $98 \%$ confidence interval for the true mean weight of the frogs $(\mu)$.

    Solution

    Given: $n=25$, $\sigma=2$, $\bar{x}=12.5$, $\alpha=0.02$. The critical value for a 98% confidence interval is $Z_{0.01}=2.33$. The margin of error is $2.33 \times \frac{2}{\sqrt{25}}=0.932$. The 98% confidence interval for the population mean is $12.5 \pm 0.932 = (11.568, 13.432)$. Interpretation: We are 98% confident that the true population mean weight of the frog between $11.568$ and $13.432$ grams.

    Example

    A physicist measures the speed of sound in a different medium over $n=25$ trials, finding a sample mean $\bar{x}=343 m/s$. The population standard deviation is known to be $\sigma=3 m/s$.

    Sample Sizes

    There are two ways to increase the precision of our estimates. We can either: lowering the confidence level, or increasing the sample size.

    Lowering the level of confidence is not advisable because it may produce less reliable results. Therefore, the preferable option is to always increase our sample sizes.
    To determine the sample size needed to meet certain conditions, we use the margin of error part of the confidence interval or bound to compute it.

    Formula:

    Sample Size for Estimating the Mean; $\sigma$ Known

    To estimate the sample size needed to estimate the population mean with a specified margin of error, we use the formula: $$n=\left(\frac{Z_{\alpha / 2} \sigma}{ME}\right)^2$$ where $n$ is the sample size, $Z_{\alpha / 2}$ is the critical value, $\sigma$ is the population standard deviation, and $ME$ is the margin of error.

    Example

    A researcher wants to estimate the mean weight of a population of frogs with a margin of error of $0.5g$. The population standard deviation is known to be $2g$. What sample size is needed to achieve this margin of error with a $95 \%$ confidence level?

    Solution

    Given: $ME=0.5$, $\sigma=2$, $\alpha=0.05$. The critical value for a 95% confidence interval is $Z_{0.025}=1.96$. The sample size needed is $n=\left(\frac{1.96 \times 2}{0.5}\right)^2=61.4656$. Therefore, the researcher should take a sample size of $n=62$ to estimate the mean weight of the population with a margin of error of $0.5g$ and a $95 \%$ confidence level.

    Example

    A scientist wants to estimate the mean speed of sound in a medium with a margin of error of $0.5 m/s$. The population standard deviation is known to be $3 m/s$. What sample size is needed to achieve this margin of error with a $99 \%$ confidence level?

    Solution

    Given: $ME=0.5$, $\sigma=3$, $\alpha=0.01$. The critical value for a 99% confidence interval is $Z_{0.005}=2.58$. The sample size needed is $n=\left(\frac{2.58 \times 3}{0.5}\right)^2=595.584$. Therefore, the scientist should take a sample size of $n=596$ to estimate the mean speed of sound in the medium with a margin of error of $0.5 m/s$ and a $99 \%$ confidence level.

    Example

    A manufacturer claims that their lightbulbs have an average lifetime of $\mu=1200$ hours. A quality control team tests a random sample of $n=40$ lightbulbs and finds a sample mean lifetime of $\bar{x}=$ 1185 hours. The population standard deviation is known to be $\sigma=100$ hours.

    Single Sided Confidence Bounds for the Mean; $\sigma$ Known

    In general, two sided confidence intervals are used to conduct an interval estimate. But, one-sided confidence bounds also exist; these are often used in conjunction with hypothesis tests.
    The mechanics of finding a one-sided confidence bound is exactly the same as working out a two sided confidence interval; the only difference is that the critical value $Z_{\alpha / 2}$ is upgraded to $Z_\alpha$ to reflect the loading of all of the $\alpha$ to one of the tails in the distribution.

    Formula:

    One-Sided Confidence Bound for the Mean; $\sigma$ Known

    Let $\overline{X}$ be the sample mean obtained from a random sample of size $n$ from a normal population with known variance, $\sigma$. Then a
  • $(1-\alpha) 100 \%$ lower confidence bound for $\mu$ is
  • $$\bar{x}-Z_\alpha \frac{\sigma}{\sqrt{n}} \leq \mu $$
  • $(1-\alpha) 100 \%$ upper confidence bound for $\mu$ is
  • $$\mu \leq \bar{x}+Z_\alpha \frac{\sigma}{\sqrt{n}} $$

    Example

    An engineering team is testing the tensile strength of a new type of steel alloy. From a sample of $n=50$ test pieces, they measure a sample mean tensile strength of $\bar{x}=850 MPa$. The population standard deviation is known to be $\sigma=40 MPa$. Construct and interpret a $98 \%$ lower confidence bound for the true mean tensile strength of the steel alloy $(\mu)$.

    Solution

    Given: $n=50$, $\sigma=40$, $\bar{x}=850$, $\alpha=0.02$. The critical value for a 98% confidence interval is $Z_{0.02}=2.33$. The margin of error is $2.33 \times \frac{40}{\sqrt{50}}=13.12$. The 98% lower confidence bound for the population mean is $850-13.12 = 836.88$. Interpretation: With repeated sampling, we are 98% confident that the true population mean tensile strength of the steel alloy is at least 836.88 MPa. This lower bound ensures that the true average tensile strength is unlikely to fall below this value, providing engineers with a conservative estimate for performance guarantees.

    Example

    A veterinarian is studying the weight of a new breed of puppies at 3 months old. From a sample of $n=30$ puppies, the average weight is found to be $\bar{x}=12.5 kg$, with a population standard deviation of $\sigma=2 kg$.Construct and interpret a $95 \%$ upper confidence bound for the true mean weight of the puppies $(\mu)$.

    Solution

    Given: $n=30$, $\sigma=2$, $\bar{x}=12.5$, $\alpha=0.05$. The critical value for a 95% confidence interval is $Z_{0.05}=1.645$. The margin of error is $1.645 \times \frac{2}{\sqrt{30}}=0.599$. The 95% upper confidence bound for the population mean is $12.5+0.599 = 13.099$. Interpretation: With repeated sampling, we are 95% confident that the true population mean weight of the puppies is at most 13.099 kg. This upper bound ensures that the true average weight is unlikely to exceed this value, providing veterinarians with a conservative estimate for feeding guidelines.

    Estimating the Mean of A Population; Variance Unknown

    In most situations, the population variance is unknown. When the variance is not provided, the sample standard deviation is used as an estimate, which introduces additional variability. Consequently, the sampling distribution of the sample mean follows a t-distribution rather than a normal distribution.

    The t-distribution adjusts for this added uncertainty and depends on the sample size through its degrees of freedom $(n-1)$ This method allows researchers to construct confidence intervals and perform hypothesis tests for the population mean, even in the absence of precise knowledge about the population variance, making it a widely applicable and robust statistical approach.

    Remark

    The sampling distribution of the mean when the population variance is unknown t-distribution . The t-distribution is similar to the normal distribution but has heavier tails, which account for the additional variability introduced by using the sample standard deviation as an estimate of the population variance.

    The $t-$Distribution

    The t-distribution is a family of distributions that depend on the degrees of freedom $(n-1)$, which adjust for the additional variability introduced by using the sample standard deviation as an estimate of the population variance. The t-distribution is symmetric and bell-shaped, similar to the normal distribution, but has heavier tails. As the sample size increases, the t-distribution approaches the normal distribution. The t-distribution is used to construct confidence intervals and perform hypothesis tests for the population mean when the population variance is unknown.

    Remark

    If the desired degrees of freedom falls between two values, the smaller value should be used (i.e. round down).

    Example

    Find the $t-$value of a $t-$distribution with $12$ degrees of freedom and $5\%$ in the right tail.

    $t_{0.05,11}=1.782$

    Solution

    Example

    Find the $t-$value of a $t-$distribution with $58$ degrees of freedom with $12.5%$ of the area in the right tail.

    $t_{0.125,57}=1.67$

    Solution

    Example

    Find the $t-$value of a $t-$distribution with $20$ degrees of freedom and $0.005$ in the left tail.

    $t_{0.005,19}=-2.845$

    Solution

    Example

    Find the $t-$value of a $t-$distribution with $1500$ degrees of freedom with $1%$ of the area in the left tail.

    $t_{0.01,\infty}=-2.326$

    Solution

    Confidence Interval for the Population Mean; $\sigma$ Unknown

    The mechanics of constructing interval estimates for the mean, $\mu$, when the population variance, is unknown are exactly the same for when we do know the value of $\sigma^2$. The only difference is that:
  • 1. The sample standard deviation, $s$, is used in place of the population standard deviation, $\sigma$.
  • 2. The t-distribution is used in place of the normal distribution to account for the additional variability introduced by using $s$ as an estimate of $\sigma$.
  • Formula:

    Confidence Interval for the Population Mean; $\sigma$ Unknown

    Let $\bar{x}$ be the sample mean obtained from a random sample of size $n$ with unknown population variance $\sigma^2$. Then the $(1-\alpha) 100 \%$ confidence interval for $\mu$ is $$\bar{x} \pm t_{\alpha/2, n-1} \left(\frac{s}{\sqrt{n}}\right)$$ where $t_{\alpha/2, n-1}$ is the $t-$value that corresponds to the desired confidence level and degrees of freedom $(n-1)$.
    The standard error of the mean, is calculated as $SE(\bar{x})=\frac{s}{\sqrt{n}}$ and the margin of error is $MR=t_{\alpha/2, n-1} \left(\frac{s}{\sqrt{n}}\right)$.

    Remark

    The same assumptions for normality hold as in the case when the population variance is known.
  • The underlying population from which the sample is drawn should follow a normal distribution.
  • If the sample size is small (typically $n<30$ ), this assumption is critical, and normality should be assessed using plots (e.g., histograms, Q-Q plots) or tests (e.g., Shapiro-Wilk test).
  • If the sample size is large ( $n \geq 30$ ), the Central Limit Theorem ensures that the sampling distribution of the mean is approximately normal, even if the population is not perfectly normal.
  • Example

    A biologist is studying the wing length of a specific butterfly species. From a random sample of $n=$ 15 butterflies, the average wing length is found to be $\bar{x}=12.4 cm$, with a sample standard deviation of $s=0.8 cm$. Construct and interpret a $90\%$ confidence interval for the true average wing length of this butterfly species.

    Solution

    The $90\%$ confidence interval for the true average wing length of this butterfly species is $12.4 \pm 1.761 \left(\frac{0.8}{\sqrt{15}}\right) = 12.4 \pm 0.363$ cm. We are $90\%$ confident that the true average wing length of this butterfly species falls between $12.037$ and $12.763$ cm.

    Example

    A random sample of $n=$ 25 students is taken to estimate the average number of hours students spend studying per week. The sample mean is $\bar{x}=10.5$ hours, and the sample standard deviation is $s=2.3$ hours. Construct a $95\%$ confidence interval for the true average number of hours students spend studying per week.

    Solution

    The $95\%$ confidence interval for the true average number of hours students spend studying per week is $10.5 \pm 2.064 \left(\frac{2.3}{\sqrt{25}}\right) = 10.5 \pm 0.924$ hours. We are $95\%$ confident that the true average number of hours students spend studying per week falls between $9.576$ and $11.424$ hours.

    One Sided Confidence Bounds for the Population Mean; $\sigma$ Unknown

    As seen before, one-sided confidence bounds are used when the researcher is only interested in the lower or upper limit of the confidence interval. The one-sided confidence bounds for the population mean when the population variance is unknown are constructed similarly to the two-sided confidence interval, but the critical value is adjusted accordingly.

    Formula:

    One Sided Confidence Bound for the Population Mean; $\sigma$ Unknown

    Let $\bar{x}$ be the sample mean obtained from a random sample of size $n$ with unknown population variance $\sigma^2$. The A
  • $(1-\alpha) 100 \%$ lower confidence bound for $\mu$ is$$ \bar{x}-t_{\alpha, n-1} \frac{s}{\sqrt{n}} \leq \mu $$
  • and a $(1-\alpha) 100 \%$ upper confidence bound for $\mu$ is $$ \mu \leq \bar{x}+t_{\alpha, n-1} \frac{s}{\sqrt{n}}$$
  • where $t_{\alpha, n-1}$ is the $t-$value that corresponds to the desired confidence level and degrees of freedom $(n-1)$.

    Example

    A chemist is studying the purity percentage of a newly synthesized chemical compound. From a random sample of $n=12$ batches, the sample mean purity is found to be $\bar{x}=98.6 \%$, with a sample standard deviation of $s=0.5 \%$. Calculate and interpret a $95 \%$ lower bound for the true mean purity ( $\mu$ ) of the compound.

    Solution

    The $95 \%$ lower bound for the true mean purity of the compound is $98.6 - 1.796 \left(\frac{0.5}{\sqrt{12}}\right) = 98.6 - 0.259 \%$. With 95% confidence, the true mean purity of the chemical compound is at least 98.34%. This lower bound provides a conservative estimate of the compound's purity, ensuring the chemist has a reliable minimum value for quality assurance.

    Example

    A random sample of $n=20$ patients is taken to estimate the average time it takes for a new medication to take effect. The sample mean is $\bar{x}=3.5$ hours, and the sample standard deviation is $s=0.8$ hours. Calculate and interpret a $90 \%$ upper bound for the true average time it takes for the medication to take effect.

    Solution

    The $90 \%$ upper bound for the true average time it takes for the medication to take effect is $3.5 + 1.725 \left(\frac{0.8}{\sqrt{20}}\right) = 3.5 + 0.305$ hours. We are $90 \%$ confident that the true average time it takes for the medication to take effect is at most $3.805$ hours.

    When to Use the t-Distribution

    The t-distribution should be used when

  • The population variance is unknown. If the population variance $\left(\sigma^2\right)$ or standard deviation $(\sigma)$ is unknown, and you must estimate it using the sample standard deviation $(s)$, the additional uncertainty requires the use of the t-distribution.
  • The sample size is small (typically $n<30$). If the sample size is small $(n<30)$, the t-distribution is better suited because it accounts for the extra variability introduced by estimating the population standard deviation.

    For larger samples ( $n \geq 30$ ), the Central Limit Theorem ensures that the sampling distribution of the sample mean is approximately normal, so the $t$-distribution and normal distribution give nearly identical results.
  • When to Use the Normal Distribution

    The normal distribution should be used when

  • The population variance is known. If the population variance $\left(\sigma^2\right)$ or standard deviation $(\sigma)$ is known, you can use the normal distribution, regardless of the sample size.
  • The sample size is large $(n \geq 30)$. If the sample size is large $(n \geq 30)$, the $t$-distribution approaches the normal distribution, and either can be used. However, in practice, the t-distribution is often used regardless of sample size if the population variance is unknown, as it is a more conservative choice.
  • Rule of Thumb

  • Use the $t-$distribution when the population variance is unknown and when the sample size is small $(n<30)$.
  • Use the normal distribution when the population variance is known or when the sample size is large $(n \geq 30)$.
  • Confidence Interval for A Population Proportion

    Another parameter that we often want to estimate is the population proportion or percentage. A population proportion, can be viewed as a binomial random variable; either an element in the population has a certain characteristic or it doesn't.

    Recall that a binomial distribution can be completely described by the number of independent trials in the experiment, $n$, and by the probability of success is each trial, $p$. Moreover, if $n p>5$ and $n(1-p)>5$, then the normal distribution can be used to approximate the binomial distribution.

    Point Estimator for a Population Proportion

    The point estimator for a population proportion is the sample proportion, $\hat{p}$. The sample proportion is calculated as the number of elements in the sample that have the characteristic of interest divided by the sample size. The sample proportion is an unbiased estimator of the population proportion, $p$. The sample proportion is also a maximum likelihood estimator of the population proportion.

    Definition:

    Point Estimate for a Population Proportion

    Let the population proportion be denoted by $p$. Then the estimator for the population proportion, $\hat{p}$, is defined to be $$\hat{p}=\frac{x}{n} $$ where $x$ is the number of elements in the sample that have the characteristic of interest and $n$ is the sample size.

    Sampling Distribution For Population Proportion Statistic, $\hat{p}$

    Since $\hat{p}$ is obtained from sample statistics, it has a sampling distribution. For a sufficiently large sample (ie. $n p>5$ and $n(1-p)>5$ ), the distribution of the sample proportion is approximately normal and has the following properties:

  • The mean of the sample proportion is equal to the population proportion, $p$.
  • The standard error of the sample proportion is given by $$\sqrt{\frac{p(1-p)}{n}}$$
  • As with the sampling distributions of the mean, if we take larger and larger samples, and calculate the proportion of elements in these samples that have a particular characteristic, then the values of these proportions will form their own normal distribution. That is, $$\hat{p}\sim N\left(\mu_{\hat{p}}, \sigma_{\hat{p}}\right)=N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$

    Confidence Intervals for a Population Proportion

    A confidence interval for a population proportion is an interval estimate for the population proportion.

    Formula:

    Confidence Interval for a Population Proportion

    Let $\hat{p}=\frac{x}{n}$ is the point estimate for the population proportion, $p$. Then the $(1-\alpha) 100 \%$ confidence interval for $p$ is $$ \hat{p} \pm Z_{\alpha / 2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ where $Z_{\alpha / 2}$ is the critical value associated with the level of confidence.

    Remark

    The sampling distribution for the proportion is the Normal distribution. Therefore, the confidence interval for the population proportion is based on the normal distribution.

    Rule of Thumb

    The interval is valid provided that sample size is large enough. Generally speaking when $np>5$ and $n(1-p)>5$. This is the same condition that enables us to use the normal distribution to approximate the binomial distribution.

    Example

    A physicist is testing a batch of LED bulbs to determine the proportion that meets the required energy efficiency standards. Out of a random sample of $n=200$ bulbs, $x=170$ bulbs are found to be energy-efficient. Construct and interpret a $90\%$ confidence interval for the true proportion of energy-efficient LED bulbs in the batch.

    Solution

    The point estimate for the population proportion is $\hat{p}=\frac{x}{n}=\frac{170}{200}=0.85$. The critical value for a $90\%$ confidence interval is $Z_{\alpha / 2}=1.645$. The margin of error is $Z_{\alpha / 2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=1.645 \sqrt{\frac{0.85(1-0.85)}{200}}=0.045$. Therefore, the $90\%$ confidence interval for the true proportion of energy-efficient LED bulbs in the batch is $0.85 \pm 0.045$ or $(0.805, 0.895)$. This means that we are $90\%$ confident that the true proportion of energy-efficient LED bulbs in the batch is between $80.5\%$ and $89.5\%$.

    Example

    A software company is testing a new algorithm for detecting malware. In a random sample of $n=$ 500 files, the algorithm correctly identifies $x=460$ malware-infected files. Construct and interpret a $96\%$ confidence interval for the true proportion of malware-infected files that the algorithm can detect.

    Solution

    The point estimate for the population proportion is $\hat{p}=\frac{x}{n}=\frac{460}{500}=0.92$. The critical value for a $96\%$ confidence interval is $Z_{\alpha / 2}=2.05$. The margin of error is $Z_{\alpha / 2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=2.05 \sqrt{\frac{0.92(1-0.92)}{500}}=0.034$. Therefore, the $96\%$ confidence interval for the true proportion of malware-infected files that the algorithm can detect is $0.92 \pm 0.034$ or $(0.886, 0.954)$. This means that we are $96\%$ confident that the true proportion of malware-infected files that the algorithm can detect is between $88.6\%$ and $95.4\%$.

    Example

    A biologist is studying a population of frogs in a wetland to determine the proportion that are carriers of a specific gene mutation. Out of a random sample of $n=400$ frogs, $x=72$ are found to carry the mutation.

    In certain cases, the sample size required to estimate a population proportion with a specified margin of error is of interest. The sample size required to estimate a population proportion with a specified margin of error is can be divided into two cases: when we have a preliminary estimate of the population proportion and when we do not have a preliminary estimate of the population proportion.

    Formula:

    Sample Size for Estimating a Population Proportion

    Let $p$ be the population proportion, $\alpha$ be the level of confidence, $Z_{\alpha / 2}$ be the critical value associated with the level of confidence, and $E$ be the margin of error. Then the sample size required to estimate the population proportion with a specified margin of error is given by $$n=\frac{Z_{\alpha / 2}^{2} p(1-p)}{E^{2}}$$

    Remark

    In the absence of a preliminary estimate of the population proportion, we make $p=0.5$ to generate the most conservative estimate for sample size that satisfy the given constraints.

    Example

    A wildlife conservationist wants to estimate the proportion of birds in a region that are affected by a particular parasite. To ensure the estimate is accurate, the conservationist wants the margin of error to be no more than 5 percentage points ( 0.05 ) with $95 \%$ confidence. Based on previous studies, the estimated proportion ( $\hat{p}$ ) of affected birds is approximately $0.3$. What is the minimum sample size needed to achieve this level of precision?

    Solution

    The population proportion is $p=0.3$, the level of confidence is $95\%$, the margin of error is $E=0.05$, and the critical value for a $95\%$ confidence interval is $Z_{\alpha / 2}=1.96$. The sample size required to estimate the population proportion with a specified margin of error is given by $$n=\frac{Z_{\alpha / 2}^{2} p(1-p)}{E^{2}}=\frac{1.96^{2} 0.3(1-0.3)}{0.05^{2}}=331.776$$ Therefore, the minimum sample size needed to achieve this level of precision is 332.

    Example

    A biologist wants to estimate the proportion of a specific fish species in a lake that is infected with a particular parasite. The biologist requires a margin of error of no more than 4 percentage points ( 0.04 ) with $99\%$ confidence. What is the minimum sample size needed to achieve this level of precision?

    Solution

    The population proportion is unknown so we force $p=0.5$, the level of confidence is $99\%$, the margin of error is $E=0.04$, and the critical value for a $99\%$ confidence interval is $Z_{\alpha / 2}=2.58$. The sample size required to estimate the population proportion with a specified margin of error is given by $$n=\frac{Z_{\alpha / 2}^{2} p(1-p)}{E^{2}}=\frac{2.58^{2} 0.5(1-0.5)}{0.04^{2}}=841.5$$ Therefore, the minimum sample size needed to achieve this level of precision is 842.

    One Sided Confidence Bounds for the Population Proportion

    In some cases, we may be interested in estimating the proportion of a population that has a certain characteristic, but we are only interested in the lower or upper bound of the proportion. In such cases, we can use a one-sided confidence interval to estimate the lower or upper bound of the population proportion.

    Formula:

    One-Sided Confidence Interval for a Population Proportion

    Let $\hat{p}=\frac{x}{n}$ be the point estimate for the population proportion, $p$. Then the $100(1-\alpha)\%$ one-sided confidence interval for $p$ is given by $$\hat{p} + Z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ for the upper bound and $$\hat{p} - Z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ for the lower bound, where $Z_{\alpha}$ is the critical value associated with the level of confidence.

    Example

    An astrophysicist is studying a sample of distant stars to determine the proportion that exhibit unusual fluctuations in brightness, which might indicate the presence of exoplanets. Out of a sample of $n=120$ stars, $x=18$ stars show such fluctuations. Construct and interpret a $95\%$ one-sided lower confidence bound for the true proportion of stars that exhibit unusual brightness fluctuations.

    Solution

    The point estimate for the population proportion is $\hat{p}=\frac{x}{n}=\frac{18}{120}=0.15$. The critical value for a $95\%$ one-sided confidence interval is $Z_{\alpha}=1.645$. The margin of error is $Z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=1.645 \sqrt{\frac{0.15(1-0.15)}{120}}=0.045$. Therefore, the $95\%$ one-sided confidence interval for the true proportion of stars that exhibit unusual brightness fluctuations is $0.15 + 0.045$ or $(0.15, 1)$. This means that we are $95\%$ confident that the true proportion of stars that exhibit unusual brightness fluctuations is at least $15\%$.

    Example

    A computer scientist is evaluating a new algorithm for detecting spam emails. In a random sample of $n=250$ emails, the algorithm correctly identifies $x=200$ as spam. Construct and interpret a $80\%$ one-sided upper confidence bound for the true proportion of spam emails that the algorithm can detect.

    Solution

    The point estimate for the population proportion is $\hat{p}=\frac{x}{n}=\frac{200}{250}=0.8$. The critical value for a $80\%$ one-sided confidence interval is $Z_{\alpha}=1.282$. The margin of error is $Z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=1.282 \sqrt{\frac{0.8(1-0.8)}{250}}=0.034$. Therefore, the $80\%$ one-sided confidence interval for the true proportion of spam emails that the algorithm can detect is $0.8 + 0.034$ or $(83.4\%)$. This means that we are $80\%$ confident that the true proportion of spam emails that the algorithm can detect is at most $83.4\%$.