Introduction to Chi-Square Tests

Chi-square tests are a family of statistical procedures that can be used to determine whether there is a significant association between two categorical variables.
  • The chi-square test of independence is used to determine whether there is a significant association between two categorical variables.
  • The chi-square goodness-of-fit test is used to determine whether the distribution of a categorical variable differs from a hypothesized distribution.
  • Goodness of Fit Test

    When building a model to describe a set of data, it is always important to ensure that the model fits the data. The Chi-Square Goodness of Fit Test is a statistical test used to determine if there is a significant difference between the expected and observed frequency distribution in one or more categories.
    In our context, we can use the coefficient of dispersion, $CD$ to guide us in the selection of a modeling distribution.

    Definition:

    Coefficient of Dispersion

    The coefficient of dispersion, $CD$, is a measure of the relative variability of a distribution. It is defined as the ratio of the standard deviation to the mean: $$CD = \frac{\sigma}{\mu}$$ where $\sigma$ is the standard deviation and $\mu$ is the mean.

    The coefficient of determination $CD$ for a sample of $n$ observations is given by $$CD = \frac{s}{\bar{x}}$$ where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

    Remark

    The coefficient of dispersion is a measure of the relative variability of a distribution. It is a dimensionless quantity that indicates how much the data is spread out relative to the mean.

    The $CD$ value can be interpreted as follows:

  • A $CD$ value close to $0$ indicates that the data is tightly clustered around the mean
  • a larger $CD$ value indicates greater variability

  • In general, a $CD$ value of less than $0.2$ suggests a normal distribution, while a $CD$ value greater than $0.2$ suggests a non-normal distribution.

    Chi-Square Goodness of Fit Test

    To test if the selected distribution really fits the data we will deploy a $\chi^2$ (chi-square) goodness of fit test.

    Statistical tests are formulated in terms of null hypothesis $H_0$ and alternative hypotheses, $H_1$. For a $\chi^2-$goodness of fit test the null hypothesis is the statement that the model is appropriate. The alternative hypothesis is the statement that the model is not appropriate. The $\chi^2-$ value defined next is computed from the data and is used to decide whether to reject the null hypothesis and discard the model.

    Formula:

    Chi-Square Goodness of Fit Test

    The test statiistic for the $\chi^2$ goodness of fit is gven by $$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$ where $O_i$ is the observed frequency and $E_i$ is the expected frequency.
    The degrees of freedom for this test is $df=k-m-1$ where $k$ is the number of categories (cells) and $m$ is the number of parameters estimated in the model.
    The $P-$value for the test is the probability of observing a test statistic as large as the one we observed, assuming the null hypothesis is true. Large test statistic here means that the observed and expected frequencies are very different.

    Long Example 1

    A teacher suspects a die may be biased and asks students to roll it $120$ times. The observed outcomes are: $$\begin{array}{|c|c|c|c|c|c|c|} \hline \text{Face} & 1 & 2 & 3 & 4 & 5 & 6 \\ \hline \text{Frequency} & 15 & 25 & 20 & 18 & 22 & 20 \\ \hline \end{array}$$
    Assuming the die is fair, each outcome should have an equal probability. Conduct a chi-square goodness-of-fit test at a $5\%$ significance level.

    Long Example 2

    A pet store surveys 100 customers to find out which type of pet they prefer. The store expects preferences to be evenly distributed across four categories: cats, dogs, fish, and birds. However, the survey results are as follows: $$ \begin{array}{|c|c|c|c|c|} \hline \text{Pet} & \text{Cats} & \text{Dogs} & \text{Fish} & \text{Birds} \\ \hline \text{Frequency} & 30 & 20 & 25 & 25 \\ \hline \end{array} $$
    At a $5\%$ significance level, test whether there is any evidence that the distribution of pet preferences differs from the store's claim.

    Example 3

    A die is rolled 60 times and the following frequencies are obtained: $$\begin{array}{|c|c|c|c|c|c|c|} \hline \text{Face} & 1 & 2 & 3 & 4 & 5 & 6 \\ \hline \text{Frequency} & 8 & 10 & 12 & 9 & 11 & 10 \\ \hline \end{array}$$

    Example 4

    A candy company produces bags with candies in four colors: red, green, blue, and yellow. They claim each color appears equally often. To verify this, a consumer group randomly selects $200$ candies and observes: $$\begin{array}{|c|c|c|c|c|} \hline \text{Color} & R & G & B & Y \\ \hline \text{Frequency} & 55 & 45 & 50 & 50 \\ \hline \end{array}$$

    Example 5

    A car company claims that its car colors are equally distributed amoung five colors: red, black, blue, white, and brown. To verify this, a consumer group randomly selects $100$ cars and observes: $$\begin{array}{|c|c|c|c|c|c|} \hline \text{Color} & Red & Black & Blue & White & Brown \\ \hline \text{Frequency} & 10 & 25 & 20 & 30 & 15\\ \hline \end{array}$$

    Example 6

    A beverage company claims that customer preferences for its five drink flavors — Cola, Lemon, Orange, Grape, and Mango — are equally likely. A marketing researcher surveys 200 customers and records the following responses:$$\begin{array}{|c|c|c|c|c|c|} \hline \text{Flavor} & Cola & Lemon & Orange & Grape & Mango \\ \hline \text{Frequency} & 60 & 25 & 45 & 30 & 40\\ \hline \end{array}$$

    Chi-Square Test of Independence

    The chi-square test for independence is a statistical test used to determine whether there is a significant association between two categorical variables. It evaluates whether the distribution of one variable differs across the levels of another variable. In other words, it helps us see whether changes in one variable are related to changes in another.
    The null hypothesis states that the two variables are independent (i.e. there is no association between them). The alternative hypothesis states that there is an association between the two variables. The test statistic is a chi-square random variable defined as follows:

    Formula:

    Chi-Square Test for Independence

    The data for the test for independence is organized in a contingency table with $r$ rows and $c$ columns. The value in row $i$ and column $j$ is denoted by $O_{ij}$. The marginal counts in the table are used to calculate the expected frequency for each table cell under the assumption of independence. The expected frequency is calculated as:$$ E_{i j}=\frac{1}{n}\left\{\left(\sum_j O_{i j}\right)\left(\sum_i O_{i j}\right)\right\}$$ The test statistic for the $\chi^2-$ goodness of fit test is given by: $$\chi^2=\sum_{i, j} \frac{\left(O_{i j}-E_{i j}\right)^2}{E_{i j}}$$ with $df=(r-1)(c-1)$ degrees of freedom.
    The test statistic is compared to the critical value from the chi-square distribution with $df$ degrees of freedom at a specified significance level (e.g., 0.05). If the test statistic exceeds the critical value, we reject the null hypothesis and conclude that there is a significant association between the two variables.

    Remark

    The chi-square test of independence is a non-parametric test, which means that it does not make any assumptions about the distribution of the data. However, it is sensitive to the sample size, and it is recommended to have a large enough sample size to ensure the validity of the test results.

    Example 1

    The following table shows the distribution of bison in Yellowstone National Park by age and location. $$\begin{array}{|c|c|c|c|} \hline \text { Age } & \text { North } & \text { South } & \text { Total } \\ \hline 0-1 & 10 & 20 & 30 \\ \hline 2-3 & 15 & 25 & 40 \\ \hline 4-5 & 20 & 30 & 50 \\ \hline \text { Total } & 45 & 75 & 120 \\ \hline \end{array} $$

    Example 2

    A public health researcher wants to investigate whether there is an association between age group and usage of a new over-the-counter pain medication. A sample of 150 individuals is surveyed and categorized as follows: $$\begin{array}{|c|c|c|c|} \hline \text { Age Group } & \text { Used Medication } & \text { Did Not Use Medication } & \text { Total } \\ \hline 18-35 & 30 & 10 & 40 \\ \hline 36-55 & 25 & 25 & 50 \\ \hline 56+ & 10 & 50 & 60 \\ \hline \text { Total } & 65 & 85 & 150 \\ \hline \end{array} $$

    Example 3

    An auto industry analyst wants to know whether car type (SUV, Sedan, or Truck) is associated with a preferred fuel type (Gasoline or Electric). A random sample of 120 car buyers is surveyed, and the results are shown below: $$\begin{array}{|c|c|c|c|} \hline \text { Car Type } & \text { Gasoline } & \text { Electric } & \text { Total } \\ \hline \text { SUV } & 30 & 10 & 40 \\ \hline \text { Sedan } & 25 & 15 & 40 \\ \hline \text { Truck } & 20 & 20 & 40 \\ \hline \text { Total } & 75 & 45 & 120 \\ \hline \end{array} $$