Numerical Measures of Data

One of the primary goals of statistics is to extract information from data. To this, we need summary statistics; numerical descriptors that can effectively and efficiently describe the whole of the data set or distribution under investigation. These descriptors need to be: easy to manage, simple to interpret, and adequately reflect the trends within the data.

Typically, these numerical measures are divided into three categories: measures of central tendency, measures of dispersion, and measures of relative standing.

Measures of Central Tendency

One way to summarize data is with a measure of central tendency (i.e with a number that describes the most central or most typical value within the data pool or probability distribution.)

Measures of central tendency are important in statistics because they allow us to:
  • Find a representative value in the data set
  • Condense a vast amount of information (e.g. data and figures) into one number.
  • Make comparisons between two or more distributions by using representative values from each that describe these distributions.
  • The three most common measures of central tendency are: mean, median, and mode.

    Mean (Average)

    Definition:

    Mean (Average)

    The mean is the average.

    Formula:

    Population Mean

    The population mean, denoted by $\mu$ is calculated by adding up all the values in the data set and then dividing by the number of values in the set. $$\mu=\frac{\sum^n_{i=1} x_i}{N}=\frac{x_1+x_2+\cdots +x_n}{N} $$ where

    $x_i$= individual values in the data set
    $N$= number of values in the data set.

    Formula:

    Sample Mean

    The sample mean, denoted by $\bar{x}$ is calculated by adding up all the values in the data set and then dividing by the number of values in the set. $$\bar{x}=\frac{\sum^n_{i=1} x_i}{n}=\frac{x_1+x_2+\cdots +x_n}{n} $$ where

    $x_i$= individual values in the data set
    $n$= number of values in the data set.

    Remark

    Although the population mean, $\mu$, and sample mean, $\bar{x}$ denote two different things, they are calculated in the same way.

    Example 1

    Calculate the sample mean of the following data set: 1, 2, 3, 4, 5.

    The mean is calculated as follows: $$\overline{x}=\frac{\sum^n_{i=1} x_i}{n}=\frac{1+2+3+4+5}{5}=\frac{15}{5}=3$$

    Solution

    Example 2

    Professor X has just finished grading papers written by his students on the causes of genetic mutation in humans. The papers are marked out of 100 , and each of the grades are shown below. $$ \begin{array}{lllllll} 60 & 95 & 75 & 88 & 93 & 84 & 60 \\ 65 & 99 & 72 & 81 & 77 & 89 & 91\end{array}$$ Calculate the average grade of the papers.

    $$\overline{x}=\frac{\sum^n_{i=1} x_i}{n}=\frac{60+95+75+\cdots +91}{14}=\frac{1129}{14}=80.64$$

    Solution

    Median

    Definition:

    Median

    The median is the middle value of a data set.

    Formula:

    Median

    To find the median, the data set must first be ordered from least to greatest. If the number of values in the data set is odd, the median is the middle value. If the number of values in the data set is even, the median is the average of the two middle values.

    Example 3

    There are seven sections of Probability and Statistics this semester. The teacher responsible for Section 3 of the course, is relaxed and patient. The number of minutes it takes for his students to complete a class exercise is presented below. $$ \begin{array}{lllllllllll} 82 & 42 & 55 & 58 & 56 & 21 & 83 & 80 & 67 & 35 & 79 \end{array}$$ Determine the median.

    Arranging the data in ascending order $$\begin{array}{lllllllllll} 21 & 35 & 42 & 55 & 56 & \mathbf{58} & 67 & 79 & 80 & 82 & 83\end{array}$$ The median is $58$

    Solution

    Example 4

    The teacher that is responsible for Section 1 of Business Data Analysis, is on a schedule. The time, in minutes, that it takes her students to complete a class exercise is below. $$ \begin{array}{llllllllll} 25 & 42 & 31 & 29 & 43 & 26 & 15 & 22 & 36 & 26\end{array}$$ Determine the median.

    Arranging the data in ascending order $$\begin{array}{llllllllll} 15 & 22 & 25 & 26 & 26 & 29 & 31 & 36 & 42 & 43\end{array}$$ The $median =\frac{26+29}{2}=27.5$

    Solution

    Remark

    In the presence of outliers (i.e. extreme values that warrant another look), the median is a better indicator of central tendency than the mean. This is because, extreme values in the data pool can unduly affect the value of the mean; artificially inflating or deflating the average depending on which end of the spectrum they are found in. The median on the other hand, is resistant to outliers, because its value is solely determined by its position within the data set and nothing else. Identifying outliers will be covered in the section on measures of relative standing.

    Mode

    Definition:

    Mode

    The mode is the value that appears most frequently in a data set.
    In the case where all values in the data set appear with the same frequency, the data set is said to be uniform and has no mode. In the case where two or more values appear with the same frequency, the data set is said to be multimodal and has more than one mode.

    Example 5

    The teacher responsible for Section 2 of this course is known to be easy going, and allows the students to work in groups. The time it takes for his students to complete a class exercise is presented below $$\begin{array}{lllllllllll}22 & 23 & 23 & 27 & 27 & 27 & 27 & 27 & 31 & 35 & 36 \end{array} $$ Determine the mode for this data set

    Mode: $27$

    Solution

    Example 6

    The teacher responsible for Section 5 of this course is known to be tough and demanding. The time it takes for her students to complete a class exercise is presented below $$\begin{array}{llllllllllllll}8 & 8 & 9 & 9 & 9 & 9 & 10 & 10 & 12 & 12 & 12 & 12 & 13 & 13 \end{array} $$ Determine the mode for this data set

    Mode: $9$ and $12$

    Solution

    Remark

    The mode is the only measure of central tendency that can be used with nominal data. Nominal data is data that is divided into categories that have no numerical value. For example, the colors of cars in a parking lot (e.g. red, blue, green, etc.)

    Example 7

    The weights (in kg) of five dogs are: $12,\, 15,\, 10,\, 20, \,$ and $18$.

    Example 8

    The ages of six children are: $7, \,9,\, 10,\, 8,\, 6,$ and $11$.

    Example 9

    The following data represent the number of books read by students in a class: $3, 5, 2, 5, 3, 5, 4, 3$

    Example 10

    Consider the following dataset: $6,\, 6,\, 8,\, 10,\, 10,\, 10,\, 12$ which measure of central tendency is the largest?

    Example 11

    Which measure of central tendency is most appropriate for determining the average salary in a company where most employees earn between $ \$40,000$ and $\$60,000$, but a few executives earn over $\$1,000,000$?

    Median, since the mean would be skewed by the few executives' salaries that are over $\$1,000, 000$.

    Solution

    Example 12

    Which measure of central tendency is most appropriate for determining the average number of children in a family?

    Mode, since the number of children in a family is a whole number and the mode is the only measure of central tendency that can be used with nominal data.

    Solution

    Means For Frequency Tables

    When data is presented in the form of a frequency table the calculation of the mean needs to be slighlty modified.

    Means For Ungrouped Data

    Formula:

    Mean for Ungrouped Data

    The mean for ungrouped data, is obtained by multiplying each value of $x$, with the number of times it occurs, $f$, taking the sum of these products, and dividing by the total number of data points in the set. $$ \bar{x} = \frac{\sum_{i=1}^{n} x_i f_i}{n} $$ where

    $x_i=$ the $i^{th}$ value of the data set
    $f_i=$ the frequency of the $i^{th}$ value
    $n=$ the total number of data points.

    Example 13

    Last year, Facebook developed and trained an AI to design clothes after scanning million of images on the internet. One of its creations was a pair of pants with two extra legs. Shoppers at a fast fashion retailer were asked how much they would spend on such a pair of pants.Their responses are presented below: $$\begin{array}{cc}\hline \text { Number of Shoppers } & \text { Amount (\$) } \\ \hline 13 & 5.00 \\ 27 & 8.00 \\ 24 & 10.00 \\ 31 & 15.00 \\ 15 & 20.00 \\ \hline \end{array}$$

    Remark

    In the previous example, it is possible to calculate the mean ``the old fashioned way``, by because we can recover the number of times each value of $x$ occurs from the table and expand them into a list.

    Example 14

    Telegram is Russia's most popular instant messaging system. But because it is heavily encrypted and therefore completely private, it is used by terrorists and drug dealers - and most worrying for the Kremlin - opponents of Putin regime. To prevent its citizens from using the system, Roskomnadzor, the federal agency responsible for monitoring internet communications, tried to block Telegram's key IP addresses. It failed spectacularly. In their efforts to shut down Telegram, they also blocked Google, MasterCard, Volvo, Nintendo, Amazon - and their own website. Meanwhile, Telegram servers remained open.

    Telegram's app has been downloaded thousands of times from Apple's App Store. The table below shows how users rated the app. $$ \begin{array}{|c|c|} \hline \text { Rating (in stars) } & \text { Number of Telegram Users } \\ \hline 5 & 1392 \\ 4 & 212 \\ 3 & 24 \\ 2 & 15 \\ 1 & 268 \\ \hline \end{array}$$

    Means For Grouped Data

    When data has been prearranged into groups, we cannot ``see`` the actual data values to recover the number of times each value occurs. In this case, we can only calculate an approximate mean for the data set. We will use the midpoint of each class interval as a representative value for the class interval.

    Formula:

    Mean for Grouped Data

    The mean for grouped data is calculated by multiplying the midpoint of each class interval, $m_i$, with the frequency of the class interval, $f_i$, taking the sum of these products, and dividing by the total number of data points in the set. $$ \bar{x} = \frac{\sum_{i=1}^{n} m_i f_i}{n} $$ where

    $m_i=$ the midpoint of the $i^{th}$ class interval
    $f_i=$ the frequency of the $i^{th}=$ class interval
    $n=$ the total number of data points.

    Example 15

    The table below shows the number of hours students spend studying for an exam. $$\begin{array}{cc}\hline \text { Number of Hours } & \text { Number of Students } \\ \hline 0-2 & 5 \\ 3-5 & 10 \\ 6-8 & 15 \\ 9-11 & 20 \\ 12-14 & 10 \\ \hline \end{array}$$

    Example 16

    Starting in April 2019, H&R Block agents will be doubling up as therapists. The company which offers tax-preparation services, is now putting all of its tax pros through an ``empathy training`` program - so that they can comfort clients upset by small refunds or worse, surprised with a large tax-bill. The table shows the number of clients that needed comforting after getting hit with a large tax-bill. $$ \begin{array}{|c|c|} \hline \text { Amount Due } (\$)& \text { Number of Clients }\\ \hline 0 \leq x<500 & 5 \\ 500 \leq x<1000 & 16 \\ 1000 \leq x<1500 & 23 \\ 1500 \leq x<2000 & 17 \\ 2000 \leq x<2500 & 14 \\ 2500 \leq x<3000 & 4 \\ \hline \end{array} $$

    Weighted Mean

    In the standard calculation of the average, every data point contributes equally to the final value of the mean. However, in some instances, we want to assign more importance to one or some of the numbers than others. In situations like these, we want to compute a weighted average.

    Formula:

    Weighted Mean

    The weighted mean is calculated by multiplying each value of $x$ by its corresponding weight, $w$, taking the sum of these products, and dividing by the sum of the weights. $$ \bar{x} = \frac{\sum_{i=1}^{n} x_i w_i}{\sum_{i=1}^{n} w_i} $$ where $x_i$ is the $i^{th}$ value of the data set, $w_i$ is the weight of the $i^{th}$ value, and $n$ is the total number of data points.

    Example 17

    In your Statistics course, the final mark is based on several components: two in class tests, one paper, and a final exam. There are a total of 100 points available, and each test is worth $25 \%$ of your final grade, the paper is worth $15 \%$, and the final exam is worth $35 \%$. Calculate your final mark in this course if you got: $85 \%$ for test $1,70 \%$ for test $2,90 \%$ for the paper, and $77 \%$ on the final.

    The final mark in the course is calculated as follows: $$\begin{aligned} \bar{x} = \frac{\sum_{i=1}^{n} x_i w_i}{\sum_{i=1}^{n} w_i} &= \frac{(85 \times 25) + (70 \times 25) + (90 \times 15) + (77 \times 35)}{25 + 25 + 15 + 35} \\ &= \frac{2125 + 1750 + 1350 + 2695}{100} \\ &= 79.20 \end{aligned}$$

    Solution

    Example 18

    A conservation biologist is studying the average lifespan of bird species in three different ecosystems. The ecosystems differ in the number of species studied and their average lifespans:

  • Forest Ecosystem: Average lifespan $=5$ years; 12 species studied
  • Grassland Ecosystem: Average lifespan $=3$ years; 8 species studied
  • Wetland Ecosystem: Average lifespan $=7$ years; 10 species studied
  • Measures of Dispersion

    An average is an attempt to summarize a set of data with just one number; but by itself it is not meaningful. To better understand the nature of the data, a statistical cross reference that measures spread is needed. These are: the range, variance, standard deviation, and interquartile range.

    Range

    The simplest of statistical dispersion is the range. It reports length of the interval spanned by the data values.

    Definition:

    Range

    The range of a data set is the difference between the largest and smallest value in the set.

    Formula:

    Range

    Given a set of ordered data, $$Range = Highest \,value - Lowest\, value$$

    Example 19

    The average Canadian worker wastes about 2 hours a day surfing the internet, talking to colleagues, conducting personal business, and taking long lunches. Administrators have long felt that because they got paid more, they work harder. But is that really the case? The results for six administrators and six teachers from the math department, along with how many minutes they spent not working for various reasons last Monday are presented below.

    Administrators: $125,125,125,125,125,139$

    Teachers: $9,10,13,15,22,23$

    One of the drawbacks of using the range as a measure of dispersion is that it fails to provide any information on how the values are spread out between the endpoints. In the previous example, the range is the same for both groups, but the data sets exhibit very different distributions.

    In the administrators' data set, the values consistent, whereas in the teachers' data set, the values tend to fluctuate a little more. As a result, measures that are more informative about how the majority of the data is spread are preferred.

    Variance and Standard Deviation

    Variance and standard deviation are two key concepts in statistics that measure the spread or dispersion of a dataset. They provide insight into how much the data values differ from the mean (average).

    Variance

    Definition:

    Variance

    Variance is a measure of how much the values in a dataset differ from the mean. It is the average of the squared deviations from the mean, providing a mathematical representation of the data's spread.

    Formula:

    Population Variance

    Given a set of data with $N$ values, $x_1, x_2, x_3, ..., x_n$, the population variance, denoted by $\sigma^2$, is calculated by taking each data value, subtracting the mean from it, squaring the result, and then averaging the squared differences: $$\sigma^2=\frac{\sum^n_{i=1}(x_i-\mu)^2}{N}=\frac{1}{N}\left[\sum x_i^2-\frac{\left(\sum x_i\right)^2}{N}\right]$$ where

    $\mu =$ the population mean
    $x_i = $ the $i^{th}$ data value,
    $N =$ the total number of data values.

    Formula:

    Sample Variance

    Given a set of data with $n$ values, $x_1, x_2, x_3, ..., x_n$, the sample variance, denoted by $s^2$, is calculated by taking each data value, subtracting the mean from it, squaring the result, and then averaging the squared differences: $$s^2=\frac{\sum^n_{i=1}(x_i-\bar{x})^2}{n-1}=\frac{1}{n-1}\left[\sum x_i^2-\frac{\left(\sum x_i\right)^2}{n}\right]$$ where

    $\bar{x}=$ the sample mean
    $x_i=$ the $i^{th}$ data value,
    $n =$ the total number of data values.

    Remark

    The sample variance formula uses $n-1$ in the denominator instead of $n$ to correct for the bias that arises when using the sample mean to estimate the population mean. This correction is known as Bessel's correction.
    One of the advantages of using the variance as a measure of dispersion is the fact that it treats all deviations from the mean regardless of direction. But at the same time, this makes variance susceptible to outliers.
    Another disadvantage of the variance is that is not easy to interpret. As a result, the standard deviation is often preferred as a measure of dispersion because it is in the same units as the data.

    Standard Deviation

    The standard deviation is the square root of the variance. It measures the typical distance of data points from the mean and is expressed in the same units as the data. It provides an intuitive sense of the data's variability.

    Definition:

    Standard Deviation

    The standard deviation is the square root of the variance. It is a measure of the dispersion of a set of data points around the mean.

    Formula:

    Population Standard Deviation

    The population standard deviation, denoted by $\sigma$, is obtained by taking the square root of the population variance $$\sigma=\sqrt{\frac{\sum^n_{i=1}(x_i-\mu)^2}{N}}=\sqrt{\frac{1}{N}\left[\sum x_i^2-\frac{\left(\sum x_i\right)^2}{N}\right]}$$ where

    $\mu =$ the population mean
    $x_i =$ the $i^{th}$ data value
    $N =$ the total number of data values.

    Formula:

    Sample Standard Deviation

    The sample standard deviation, denoted by $s$, is calculated by taking the square root of the sample variance $$s=\sqrt{\frac{\sum^n_{i=1}(x_i-\bar{x})^2}{n-1}}=\sqrt{\frac{1}{n-1}\left[\sum x_i^2-\frac{\left(\sum x_i\right)^2}{n}\right]}$$ where

    $\bar{x}=$ the sample mean
    $x_i =$ the $i^{th}$ data value
    $n =$ the total number of data values.
    The standard deviation serves as a measuring stick to indicate how far the data is spread out relative to the mean.
  • The larger the standard deviation, the more spread out the data is.
  • Conversely, the smaller the standard deviation, the more closely the data points cluster around the mean.
  • Rule of Thumb

    If the distribution is bell-shaped then the Empirical Rule states that almost of the data can be found within three standard deviations of the mean. More specifically,

  • 68% of the data falls within one standard deviation of the mean.
  • 95% of the data falls within two standard deviations of the mean.
  • 99.7% of the data falls within three standard deviations of the mean.
  • Example 20

    Capsicum Ivanovii Mathematica is a hot variety of red peppers native to Bulgaria. A colleague of mine likes to grow them in his backyard, and he has entered a few specimens into a local competition. Below are lengths of the peppers that he submitted $$\begin{array}{lllllll} 30 & 35 & 42 & 45 & 36 & 43 & 28\end{array}$$

    Variance and Standard Deviation for Frequency Distributions

    When dealing with frequency distributions, the formulas for variance and standard deviation are slightly modified. The formulas are adjusted to account for the frequency of each data value.

    Variance and Standard Deviation for Ungrouped Data

    Formula:

    Variance for Frequency Distributions (Ungrouped Data)

    The variance for ungrouped data is calculated as follows: $$$s^2=\frac{\sum^n_{i=1}(x_i-\bar{x})^2 f_i}{n-1}=\frac{1}{n-1}\left[\sum f_i x_i^2-\frac{\left(\sum f_i x_i\right)^2}{n}\right]$$ where

    $f_i=$ the frequency of the $i^{th}$ data value
    $x_i=$ is the $i^{th}$ data value
    $n=$ is the total number of data values.

    Remark

    The second formula for the variance of ungrouped data is easier to work with; but both formulas are equivalent.

    Example 21

    At Steer Clear Driving School, 40 students just completed a theoretical exam to see if they qualify for a learner's license. The exam consisted of 30 questions and each question was worth one point. Below are the scores $$ \begin{array}{cc} \text { Test Score } & \text { Number of Students } \\ \hline 20 & 1 \\ 21 & 2 \\ 23 & 7 \\ 24 & 3 \\ 27 & 10 \\ 28 & 3 \\ 29 & 4 \\ 30 & 10 \end{array}$$

    Example 22

    If a regular alarm just doesn`t rouse you in time, then how about a “device for waking persons from sleep”? Patented in 1882, the invention was intended to rouse heavy sleepers by dropping wooden or cork blocks onto their faces at a set time. The device was connected to a clock, ensuring a timely, albeit startling, wake-up call.

    Depending on the complexity of the invention submitted for consideration, the number of pages on the application varies considerably as shown in the table below. $$\begin{array}{cc} \text { Number of Pages } & \text { Number of Applications } \\ \hline 10 & 5 \\ 12 & 15 \\ 23 & 20 \\ 45 & 25 \\ 15 & 30 \\ \hline \end{array}$$

    Variance and Standard Deviation for Grouped Data

    When dealing with grouped data, the $x_i$'s in the formulas for variance and standard deviation are replaced with the midpoints of each class interval.

    Formula:

    Variance for Frequency Distributions (Grouped Data)

    The variance for ungrouped data is calculated as follows: $$$s^2=\frac{\sum^n_{i=1}(m_i-\bar{x})^2 f_i}{n-1}=\frac{1}{n-1}\left[\sum f_i m_i^2-\frac{\left(\sum f_i m_i\right)^2}{n}\right]$$ where

    $f_i=$ is the frequency of the $i^{th}$ data value
    $m_i=$ the midpoint of the $i^{th}$ class interval
    $n=$ is the total number of data values.

    Example 23

    It seems that fans of Taylor Swift, will listen to just about anything that the singer releases. In 2014, the singer accidentally released 8 seconds of white noise on iTunes in Canada and it immediately shot to the top of the charts. Simply titled ``Track 3``, it is found on the album, 1989, sandwiched between ``Welcome to New York`` and ``Shake It Off``.

    The Deluxe version of the Taylor's 1989 album has 19 songs on it. Below is table showing the run time of each of the song in seconds. $$\begin{array}{cc} \text { Run Time (s) } & \text { Number of Songs } \\ \hline 100 \leq x < 150 & 2 \\ 150 \leq x < 200 & 2 \\ 200\leq x<250 & 11 \\ 250\leq x<300 & 6 \end{array}$$

    Example 24

    The first recorded use of the word ``computer`` was in 1613 to describe a person who performed calculations. The term was later used to describe a machine that performed calculations. The first computer was the Analytical Engine, designed by Charles Babbage in 1837. The Analytical Engine was never completed, but it was the first machine that could be considered a computer.

    The table below shows the number of computers sold by a local electronics store in the last month. $$\begin{array}{cc} \text { Number of Computers Sold } & \text { Number of Days } \\ \hline 0 \leq x < 5 & 2 \\ 5 \leq x < 10 & 3 \\ 10\leq x<15 & 4 \\ 15\leq x<20 & 5 \end{array}$$

    Measures of Relative Standing

    Measures of relative standing are designed to provide information about where individual data points sit in relation to the entire data set.
    The most common measures of relative standing are percentiles, quartiles, and z-scores (which we will cover in section on the Normal Distribution).

    Example

    Each year, thousands of students write the SAT exam for college admission. Alice just scored 1060 on the exam, which puts her in the 90 th percentile. Interpret the meaning of this statement.

    This means that, $90 %$ of the scores are below what Alice received, and $10 %$ are above hers.

    Solution

    How to Calculate Percentiles

    Formula:

    Percentile

    Given a data set that is ordered from smallest to largest, the location of the $P^{th}$ percentile is given by the formula: $$ L = \frac{(N+1)P_{i}}{100} $$ where $N$ is the number of data points in the data set, and $P_{i}$ is the percentile of interest.

    Example

    The time it takes for 33 students to complete the 2-hour Probability and Stats exam is given below. Times are rounded to the nearest minute. $$\begin{array}{rrrrrrrrrrr} 80 & 80 & 80 & 81 & 82 & 85 & 88 & 90 & 91 & 91 & 93 \\ 93 & 94 & 94 & 95 & 97 & 97 & 97 & 99 & 105 & 108 & 110 \\ 110 & 110 & 112 & 113 & 113 & 116 & 116 & 117 & 118 & 119 & 120\end{array}$$ Determine the $30^{th}$ percentile of the data set.

    $$ L = \frac{(N+1)P_{i}}{100} = \frac{(33+1)(30)}{100} = 10.20 $$ The $30^{th}$ percentile is the $10.2^{th}$ value in the ordered data set, which is $$P_{30}=91+0.20(93-91)=91.40$$.

    Solution

    Example

    Consider again the time it takes for 33 students to complete the 2-hour Probability and Stats exam is given below. Times are rounded to the nearest minute. $$\begin{array}{rrrrrrrrrrr} 80 & 80 & 80 & 81 & 82 & 85 & 88 & 90 & 91 & 91 & 93 \\ 93 & 94 & 94 & 95 & 97 & 97 & 97 & 99 & 105 & 108 & 110 \\ 110 & 110 & 112 & 113 & 113 & 116 & 116 & 117 & 118 & 119 & 120\end{array}$$ Determine the $55^{th}$ percentile of the data set.

    $$ L = \frac{(N+1)P_{i}}{100} = \frac{(33+1)(55)}{100} = 18.70 $$ The $55^{th}$ percentile is the $18.7^{th}$ value in the ordered data set, which is $$$P_{55}=97+0.7(99-97)=98.40$$.

    Solution

    Example

    The time it takes for 33 students to complete the 2-hour Probability and Stats exam is given below. Times are rounded to the nearest minute. $$\begin{array}{rrrrrrrrrrr} 80 & 80 & 80 & 81 & 82 & 85 & 88 & 90 & 91 & 91 & 93 \\ 93 & 94 & 94 & 95 & 97 & 97 & 97 & 99 & 105 & 108 & 110 \\ 110 & 110 & 112 & 113 & 113 & 116 & 116 & 117 & 118 & 119 & 120\end{array}$$ Determine the second quartile of the data set.

    $Q_2=P_{50}$ $$L=\frac{(N+1)P_{50}}{100}=\frac{(33+1)50}{100}= 17 \quad \Rightarrow \quad Q_2=97$$.

    Solution

    Outliers

    Outliers are data points that are significantly different from the rest of the data set. They can have a large impact on the measures of central tendency and spread. Outliers can be identified using the following formula which makes use of the interquartile range:

    Formula:

    Interquartile Range (IQR)

    Let $Q_1$ be the first quartile and $Q_3$ be the third quartile. The interquartile range is given by the formula: $$IQR=Q_3-Q_1$$.

    Formula:

    Outliers

    An observation is considered an outlier if it is either less than $Q_1-1.5(Q_3-Q_1)$ or greater than $Q_3+1.5(Q_3-Q_1)$.

    Example

    The time it takes for 33 students to complete the 2-hour Probability and Stats exam is given below. Times are rounded to the nearest minute. $$\begin{array}{rrrrrrrrrrr} 80 & 80 & 80 & 81 & 82 & 85 & 88 & 90 & 91 & 91 & 93 \\ 93 & 94 & 94 & 95 & 97 & 97 & 97 & 99 & 105 & 108 & 110 \\ 110 & 110 & 112 & 113 & 113 & 116 & 116 & 117 & 118 & 119 & 120\end{array}$$ Determine if there are any outliers in the data set.

    $Q_1=88$ and $Q_3=116$. $$Q_1-1.5(Q_3-Q_1)=88-1.5(116-88)=58$$ $$Q_3+1.5(Q_3-Q_1)=116+1.5(116-88)=146$$ There are no outliers in the data set.

    Solution

    Example

    Money can't buy you love, but dressing well certainly helps. According to a recent survey, $85 \%$ of women said that a man who was well dressed was far more attractive than one who was rich. The survey also revealed that for $63 \%$ of the participants, well dressed was synonymous with a well tailored suit. A random sample of 30 businessmen were asked how much the suit that they were wearing costed. Their answers are presented below. $$\begin{array}{rrrrrrrrrrrrrrr} 90 & 90 & 92 & 92 & 93 & 96 & 96 & 99 & 100 & 101 & 102 & 106 & 108 & 109 & 112 \\ 113 & 113 & 113 & 114 & 115 & 116 & 117 & 117 & 117 & 118 & 119 & 119 & 119 & 120 & 150 \end{array}$$

    Example

    Russian President, Vladimir Putin, might be willing to joke about things like climate change and meddling with US elections, but when it comes to his masculinity, he doesn't plau around. The 72 year old, who likes being photographed shirtless and holding big guns, recently said in an interview, that he doesn't have any ``bad days as President`` because ``he's not a woman`` and offered this pseudoscientific explanation on why that was the case: ``I am not trying to insult anyone. That's just the nature of things. There are certain natural cycles``. Several men and women were asked how many bad days they experienced on the job. Their responses are shown below:

    Men $$\begin{array}{llllllllllll} 13 & 15 & 15 & 15 & 17 & 18 & 19 & 19 & 19 & 20 & 21 & 22 \\ 24 & 24 & 26 & 27 & 28 & 28 & 32 & 33 & 33 & 34 & 55 \end{array}$$
    Women $$\begin{array}{rrrrrrrrrrrrrr}2 & 6 & 7 & 10 & 12 & 12 & 12 & 13 & 14 & 17 & 17 & 18 & 19 & 20 \\ 21 & 22 & 23 & 23 & 26 & 28 & 29 & 29 & 30 & 31 & 34 & 39 & 54 & 59 \end{array}$$

    Exercises

    Question 1

    Explain the difference between the mean, median, and mode. In what type of dataset is the median a better measure of central tendency than the mean?

    The mean is the average of all the numbers in a dataset. The median is the middle number in a dataset when the numbers are arranged in order. The mode is the number that appears most frequently in a dataset. The median is a better measure of central tendency than the mean when the dataset contains outliers, or extreme values, that would skew the mean.

    Solution

    Question 2

    Define the range and standard deviation. How do these two measures provide different information about the spread of data?

    The range of a dataset is the difference between the maximum and minimum values in the dataset. It provides a simple measure of the spread of the data. The standard deviation is a more sophisticated measure of the spread of the data, taking into account the variance of the data points from the mean.

    Solution

    Question 3

    When would you use a percentile to describe a dataset instead of the mean or median?

    You would use a percentile to describe a dataset when you want to know the percentage of data points that fall below a certain value in the dataset. This can be useful for comparing individual data points to the rest of the dataset.

    Solution

    Question 4

    What is the empirical rule, and how can it help interpret data that follows a normal distribution?

    The empirical rule states that for a dataset that follows a normal distribution, approximately 68% of the data points fall within one standard deviation of the mean, 95% fall within two standard deviations, and 99.7% fall within three standard deviations. This rule can help interpret data by providing a quick estimate of the spread of the data.

    Solution

    Question 5

    The test scores for a group of students are heavily skewed to the left. Which measure of central tendency is the most appropriate to describe the data, and why?

    When the data is heavily skewed to the left, the median is the most appropriate measure of central tendency to describe the data. This is because the median is not affected by extreme values or outliers, which can skew the mean.

    Solution

    Question 6

    The test scores for a group of students are heavily skewed to the right. Order the three measures of central tendency, the mean, the median, and the mode from the smallest to the largest.

    When a dataset is heavily skewed to the right: the mean is pulled toward the higher (right-hand) tail because it is influenced by the extreme values; the median is less affected by extreme values but still shifts slightly toward the tail compared to the mode; the mode is located at the peak of the distribution, which is on the left side of the data. Order from smallest to largest: Mode < Median < Mean.

    Solution

    Question 7

    If a dataset has outliers, which measure of spread—range or interquartile range (IQR)—would be more appropriate, and why?

    If a dataset has outliers, the interquartile range (IQR) would be more appropriate than the range as a measure of spread. The IQR is less sensitive to outliers because it is based on the middle 50% of the data, which makes it more robust to extreme values.

    Solution

    Question 8

    A dataset follows a normal distribution with a mean of 50 and a standard deviation of 5. According to the empirical rule, what percentage of data falls between 40 and 60?

    According to the empirical rule, approximately 68% of the data falls within one standard deviation of the mean. Therefore, approximately 68% of the data falls between 40 and 60 in this dataset.

    Solution

    Question 9

    Describe a situation where variance would be more useful than standard deviation in comparing two datasets.

    Variance would be more useful than standard deviation in comparing two datasets when you want to compare the spread of the data relative to the mean. Variance provides a measure of the average squared deviation of data points from the mean, which can be useful for understanding the overall variability of the data.

    Solution

    Question 10

    The test scores for a group of students are heavily skewed to the right. Order the three measures of central tendency, the mean, the median, and the mode from the smallest to the largest.

    When a dataset is heavily skewed to the left: the mean is pulled toward the lower (left-hand) tail because it is affected by extreme values; the median is less influenced by the skewness compared to the mean but still shifts slightly toward the left; the mode is located at the peak of the distribution, which is on the right side of the data. Order from smallest to largest: Mean < Median < Mode.

    Solution

    Question 11

    Following the #MeToo movement, Netflix implemented a ``five-second rule`` on its sets—not for dropped food, but for eye contact. Cast and crew were instructed not to stare at each other for more than five seconds, a policy inspired by the discovery that prolonged gazes are officially ``creepy.`` The rule, along with bans on hugging, flirting, and asking for phone numbers, came after Kevin Spacey's misconduct scandal, which cost Netflix millions and slowed House of Cards production.

    On the sets of The Crown, Stranger Things, and Orange is the New Black, cast and crew members were asked, ``How many seconds of sustained eye contact do you consider creepy?`` Their responses are summarized in the table below: $$\begin{array}{cc}\\ \text { Number of Seconds } & \text { Number of People } \\ \hline 5 & 17 \\ 7 & 23 \\ 9 & 36 \\ 10 & 44 \\ 11 & 12 \\ 15 & 3 \\ \hline \end{array}$$

    Question 12

    The Bois de Vincennes, a 2,000-acre park in Paris's 12th arrondissement, is the city's largest public park—and one of its most revealing. Certain areas are designated for naturists, who, last summer, had to call in reinforcements after their peaceful, clothing-free frolicking was disrupted by Peeping Toms, exhibitionists, and ``bush-dwelling perverts.`` Naturally, French police beefed up patrols to protect the sanctity of the naked experience.The table below shows how many minutes several nudists managed to enjoy the Bois de Vincennes last Friday before, presumably, heading for cover.

    $$\begin{array}{|c|c|} \text { Time (in min.) } & \text { Number of Nudists } \\ \hline [0,60) & 15 \\ [60,120) & 50 \\ [120,180) & 65 \\ [180,240) & 155 \\ [240,300) & 70 \\ [300,360) & 45 \\ [360,420) & 5 \\ \hline \end{array}$$

    Question 13

    For the German dub of The Terminator, Arnold Schwarzenegger wasn`t allowed to voice himself because his accent was deemed too rural—apparently, even killer robots have linguistic standards. Producers figured it`d be hard to take a futuristic death machine seriously if it sounded like a hillbilly. On Twitter, though, Arnold leans into his unstoppable legacy. With 4.26 million followers, his bio reads: ``Former Mr. Olympia, Conan, Terminator, and Governor of California. I killed the Predator. I told you I'd be back.`` Clearly, no accent can stop the Terminator online.

    Below are the run-times for several of Arnie's most notable movies. $$\begin{array}{lc} \text { Movie } & \text { Length (minutes) } \\ \hline \text { Conan the Barbarian } & 129 \\ \text { Conan the Destroyer } & 101 \\ \text { Terminator } & 107 \\ \text { Predator } & 107 \\ \text { Terminator 2: Judgement Day } & 137 \\ \text { Terminator 3: Rise of the Machines } & 109 \\ \hline \end{array} $$

    Question 14

    Last March, Brighton's Big Cheese Festival hit a rather ironic snag — it ran out of cheese. Bad weather delayed several traders, leaving hundreds of attendees staring at empty tables and questioning their life choices. One unimpressed visitor took to social media to write, ``Hmm, was expecting more cheese,`` while another quipped, ``Should've just gone to the supermarket — shorter queues and way more cheese.``

    The table below shows the number of cheeses sampled by attendees at the festival. $$\begin{array}{c|c} \text { Number of Cheeses Sampled } & \text { Number of Attendees } \\ \hline 1-5 & 20 \\ 6-10 & 35 \\ 11-15 & 50 \\ 16-20 & 45 \\ 21-25 & 30 \\ 26-30 & 15 \\ 31-35 & 5 \end{array}$$

    Question 15

    Swiss parents, with a twisted sense of humour can now hire an ``evil birthday clown`` to stalk and harass their children for up to a week before their birthdays. Dominic Deville, says that he got the idea to dress up as a creepy clown and scare the daylights out of unsuspecting children, after reading Stephen King's It and watching Killer Clowns From Outer Space. He was also quick to point out that the ``fun`` can be called off at any time - which is great for any parent who have second thoughts about the service, or haven't saved up enough for therapy sessions.

    The hourly rates for 10 evil birthday clowns is shown below: $$\begin{array}{ccccc} 40 & 40 & 45 & 45 & 65\\ 65 & 70 & 70 & 75 & 80\ \end{array}$$

    Question 16

    Uber has found itself at the center of a messy—and potentially very expensive—divorce. A Frenchman is suing the ride-share giant for €45 million after a glitch in the app tipped off his wife about his extramarital activities. According to the unnamed businessman, even after logging out, the app continued sending notifications to his wife's phone, detailing the dates, times, and locations of his romantic escapades. The result? She sued him for divorce, and he turned around and sued Uber for failing to protect his privacy—because apparently, discretion costs extra. Android users can relax, though; the bug only outed iPhone users.

    After news of the glitch hit the news, Uber created an update for the faulty app. The number of seconds that it took for iPhone users to download and install the update onto their phones is shown in the data below. \begin{array}{llllllllllllllll} 15 & 16 & 17 & 17 & 17 & 18 & 18 & 19 & 19 & 20 & 23 & 23 & 23 & 23 & 23 & 24 \\ 25 & 25 & 25 & 25 & 26 & 27 & 28 & 28 & 29 & 29 & 29 & 30 & 30 & 30 & 30 & 33 \\ 33 & 33 & 34 & 34 & 34 & 34 & 34 & 35 & 36 & 36 & 37 & 37 & 37 & 40 & 40 & 41 \\ 42 & 44 & 44 & 45 & 45 & 45 & 47 & 47 & 47 & 47 & 50 & 51 & 51 & 53 & 54 & 57 \end{array}

    Question 17

    Peppa Pig, Britain's gift to preschoolers, has some American parents in a panic after their kids started speaking with British accents and saying ``biscuits`` instead of ``cookies``. Psychologists assure everyone that the Peppa effect is temporary — though it might make snack time feel oddly formal. Tea, anyone?

    The data below shows the number of hours that a group of children spent watching Peppa Pig last year. $$\begin{array}{|c|c|} \text { Number of Hours } & \text { Number of Children } \\ \hline 0-2 & 10 \\ 3-5 & 15 \\ 6-8 & 20 \\ 9-11 & 25 \\ 12-14 & 30 \\ 15-17 & 25 \\ 18-20 & 15 \\ \hline \end{array}$$

    Question 18

    In 2021, American Airlines decided to crack down on emotional support animals. Goats, snakes, spiders, and anything with hoofs, tusks, or horns were officially grounded—because nothing says “relaxing flight” like a goat trying to claim the aisle seat. Non-household birds were added to the banned list, much to the dismay of one woman who was turned away with her emotional support peacock.

    Emotional support dogs are still allowed on flights, but they now have to travel in the cargo hold. The data below shows, the weight of 32 emotional support dogs which were allowed to fly last month

    $$\begin{array}{llllllllllllllll} 10 & 10 & 15 & 15 & 20 & 20 & 25 & 25 & 30 & 30 & 35 & 35 & 40 & 40 & 45 & 45 \\ 50 & 50 & 55 & 55 & 60 & 60 & 65 & 65 & 70 & 70 & 75 & 75 & 80 & 80 & 85 & 85 \end{array}$$