Correlation and Regression

We often seek to determine whether a relationship exists between two or more variables. If such a relationship is present, the next step is to evaluate its strength and, if feasible, construct a model that captures the trends in the data. This section aims to introduce methods for assessing the strength and direction of relationships in data, along with an overview of linear regression.

Paired Data and Scatter Plots

The investigation between the nature of two variables begins with a scatter plot of the paired data.

Definition:

Explanatory and Response Variables

For each data pair $(x,y)$,

  • $x$ is called the explanatory variable and
  • $y$ is called the response variable
  • Definition:

    Scatter Plot

    A scatter plot is a graphical representation that shows the relationship between two sets of data.
    A scatter plot can reveal the relationship that exists between two variables.
    No Relationship Linear Relationship
    Quadratic Relationship Polynomial Relationship

    Remark

    For the purpose of this course, we are interested only in linear relationships; plots with data that resemble a straight line. But data can also exhibit non-linear relationships, such as quadratic, exponential, or logarithmic.
    Once a scatter-plot of the data has been made, we can visually assess:

    • The strength of the relationship between the two variables by gauging how tightly or loosely the data points follow a line.
    • If the relationship between the variables is positive, negative, or does not exist, by examining the overall flow of direction exhibited by the data points
    No Linear Relationship
    Weak Postive Linear Relationship Strong Positive Linear Relationship
    Weak Negative Linear Relationship Strong Negative Linear Relationship
    Scatter-plots can reveal the underlying relationship between two variables, but they do not provide any quantitative information about the strength of the relationship nor how the explanatory and response variables move with or against each other. To overcome these issues, we need numerical measures of strength and direction.

    Measures of Strength and Direction

    When two variables are linearly related, we want to express the strength and direction of the relationship with a numerical value. The statistical measures which quantify the degree to which paired data relate to each other are: covariance and correlation.

    Covariance

    Definition:

    Covariance

    The covariance is measures the joint variability of two variables.
    In other words, the covariance measures how two variables move together. If the large values of one variable corresponds to the large values of the other, and the same is true for the smaller values, then the covariance is positive because of the like-and-like behaviour. Conversely, if the large values of one variable corresponds to the small values of the other, then the covariance would be negative; to reflect the dislike-nature.

    Formula:

    Population Covariance

    The population covariance, $\sigma_{xy}$, is defined as follows $$\sigma_{x_iy_i}=\frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{N}$$ where

    $x_i=$ the data points for the explanatory variable, $x$
    $y_i=$ the data points for the response variable, $y$
    $\mu_x=$ the mean for $x$
    $\mu_y=$ the mean for $y$
    $N=$ the number of data points.
    In the vast majority of cases, we will be working with sample data. Therefore, two versions of the formula for sample covariance are presented. The one on the right is a short-cut version, and maybe preferable for use when the data values are unwieldy.

    Formula:

    Sample Covariance

    The sample covariance, $s_{xy}$, is defined as follows $$s_{xy}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1}= \frac{1}{n-1} \left[ \sum x_iy_i-\frac{\sum x_i\sum y_i}{n} \right]$$ where

    $x_i=$ the data points for the explanatory variable, $x$
    $y_i=$ the data points for the response variable, $y$
    $\bar{x}=$ the sample mean for $x$
    $\bar{y}=$ the sample mean for $y$
    $n=$ the number of data points.
    Regardless of which covariance measure we are interested in, the two pieces of information that we want to extract from it are its sign, which tell us the nature of relationship, and its magnitude, which tells the strength of the association. Thus,

    • When two variables move in the same direction, the covariance will be a large positive number.
    • When two variables move in the opposite direction, the covariance will be a large negative number.
    • When two variables do not exhibit any particular patterns, the covariance will be a small number.

    Correlation

    The reason why covariance is difficult to interpret, is because it can range anywhere from $-\infty$ to $+\infty$. And having no absolute maximum or minimum values to contain the covariance within means that we have no way of using the number to make a meaningful statement about the strength of the relationship. To overcome the complications associated with large numbers, we need to rescale the numerical measures of strength and direction to fit in between two manageable values. The resulting measure is the coefficient of correlation.

    Definition:

    Coefficient of Correlation

    The coefficient of correlation is a numerical measure of the strength and direction of the linear relationship between two variables. Its values are found between $-1$ and $+1$.
    Essentially, what the coefficient of correlation does, is that it rescales the values of the covariance to fit between $-1$ and $+1$. This greatly improves the manageability and readability of the measure by using a scale that we can readily interpret; a move similar to converting raw data counts into relative frequencies. So now, if we are told that the coefficient of correlation between two variables was $-0.91$, we would be able to conclude that it was a strong negative relationship.
    The coefficient of correlation is calculated by dividing the covariance of the two variables by the product obtained from multiplying the standard deviations of each together.

    Formula:

    Coefficient of Correlation (Population)

    The coefficient of correlation for population data, denoted as, $\rho$, is defined to be $$\rho=\frac{\sigma_{xy}}{\sigma_x\sigma_y}$$ where

    $\sigma_{xy}=$ the population covariance
    $\sigma_x=$ the population standard deviation for $x$
    $\sigma_y=$ the population standard deviation for $y$
    More often than not, we will be working with sample data, and therefore, two versions of the formula for calculating coefficient of correlation for sample data are presented. The one on the right is a short-cut version, and maybe preferable for use when the data values are unwieldy.

    Formula:

    Coefficient of Correlation (Sample)

    The coefficient of correlation for sample data, denoted as, $r$, is defined to be $$r=\frac{s_{xy}}{s_x s_y} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}}$$ where

    $s_{xy}=$ the sample covariance
    $s_x=$ the sample standard deviation for $x$
    $s_y=$ the sample standard deviation for $y$

    Formula:

    Alternate Coefficient of Correlation (Sample)

    The coefficient of correlation for sample data, denoted as, $r$, is defined to be $$r=\frac{s_{xy}}{s_x s_y} = \frac{n \sum x y-\left(\sum x\right)\left(\sum y\right)}{\sqrt{n \sum x^2-\left(\sum x\right)^2} \sqrt{n \sum y^2-\left(\sum y\right)^2}}$$ where

    $\sum x=$ the sum of the data points for the explanatory variable, $x$
    $\sum y=$ the sum of the data points for the response variable, $y$
    $\sum x^2=$ the sum of the squares of the data points for the explanatory variable, $x$
    $\sum y^2=$ the sum of the squares of the data points for the response variable, $y$
    $n=$ the number of data points.
    The coefficient of correlation is a unitless measure, and is always a value between $-1$ and $+1$. The sign of the coefficient of correlation tells us the direction of the relationship, while the magnitude tells us the strength of the relationship. The closer the coefficient of correlation is to $+1$, the stronger the positive relationship. The closer the coefficient of correlation is to $-1$, the stronger the negative relationship. The closer the coefficient of correlation is to $0$, the weaker the relationship.
    Weak Positive Correlation Moderate Positive Correlation Strong Positive Correlation
    Moderate Negative Correlation Strong Negative Correlation No Correlation
    How to Interpret the Coefficient of Correlation:

    • If $r=1$, then there is a perfect positive linear relationship between the two variables.
    • If $r=-1$, then there is a perfect negative linear relationship between the two variables.
    • If $r=0$, then there is no linear relationship between the two variables.
    The same can be said about $\rho$. It should be noted that

    • Positive values of $r$ and $\rho$ imply that as $x$ increases, $y$ tends increases.
    • Negative values of $r$ and $\rho$ indicate that as $x$ increases, $y$ tends to decrease.
    • The values of $r$ and $\rho$ stay the same regardless of which variable has been designated as the explanatory and which one is labelled as the response.
    • The values of $r$ and $\rho$ remains the same, even if the variables are converted into different units.

    Remark

    When examining a scatter plot and interpreting a coefficient of correlation, there are several important things to keep in mind.

    • Since it is not possible to obtain or graph all of the data points from a population, a scatter plot provides only one snapshot of the data captured from a random sample. Because of this, the value of $r$ can change from sample to sample; even though the samples are drawn from the same population.
    • The value of $r$ is sensitive to the omission of small or large data values in a random sample; meaning that the exclusion of these data points can impact the final value of $r$
    • Correlation does not imply causation. The coefficient of correlation only measures the strength of relationship between two variables and does not make any implications about cause and effect. The fact that two variables increase or decrease together does not mean that change in one is causing changes in the other.

    Example

    After years of stagnation, the housing market in the U.S. is beginning to show signs of recovery. Last year, the median price of a home in Chicago was $\$ 230000$ up $8.5 \%$ Below are the age and selling prices of six homes in the suburb of West Englewood.

    $$\begin{array} {c|c} \text { Age of Property } & \text { Selling Price of Home } \\ \text { (years) } & \text { (hundred thousands) } \\ \hline 5 & 321 \\ 7 & 315 \\ 15 & 267 \\ 25 & 266 \\ 34 & 242 \\ 37 & 208 \\ \hline \end{array}$$

    The Coefficient of Determination

    Recall, that the coefficient of correlation, $r$, is a measure of strength and direction for the relationship which exists between two variables. The values of $r$ are constrained to fit between $-1$ and $+1$; and the strength of the relationship is deduced from the value of $r$ relative to this scale. Using this information, the coefficient of correlation can be transformed into another measure; one that provides insight into the quality of the model.
    The coefficient of determination, denoted by $R^2$, is a measure of how well the least-squares line fits the data. It is a number between $0$ and $1$, and is interpreted as the proportion of the total variation in the response variable that is explained by the explanatory variable.

    Definition:

    The Coefficient of Determination

    The coefficient of determination is the square of the correlation coefficient, and is a measure of how well the least-squares line fits the data.

    Formula:

    The Coefficient of Determination

    The correlation of determination,$R^2$ is given by $$R^2 = r^2$$ where

    $r=$ the correlation coefficient.
    The coefficient of determination can be interpreted as the proportion of the total variation in the response variable that is explained by the explanatory variable. The remaining proportion of the variation is attributed to random error.

    Formula:

    Percentage of Variance Accounted

    The percentage of variance in the response variable that can be accounted and unaccounted for by the explanatory variable is calculated as follows:

  • Accounted: $R^2$
  • Unaccounted: $1-R^2$
  • The coefficient of determination can be used to evaluate the quality of the model. The closer the value of $R^2$ is to 1, the better the model fits the data. Conversely, the closer the value of $R^2$ is to 0, the worse the model fits the data.

    Remark

    Although the coefficient of determination is a useful metric to gauge the level of linear association between two paired variables, it should never be used in isolation to make an assessment of how well the model predicts the future or fits the data. Here are some things to keep in mind when interpreting the value of $R^2$:

    • A large $R^2$ value should not be always be interpreted as meaning that the estimated regression line fits the data well. It is quite possible that another function might better describe the trend in the data.
    • The coefficient of determination, $r^2$, and the correlation coefficient, $r$, can both be greatly affected by just one data point (or a few data points). Adding or removing data points can change the pitch/slope of the line, which causes changes in the values of $r$ and $R^2$.
    • The $R^2$ cannot determine if coefficient estimates and the predictions offered by the model are biased; for that, we need to consult the residual plots.
    • The source of unexplained variation can be due to chance or the presence of a lurking variable; one that is neither an explanatory nor a response variable, but can be responsible for changes in both $x$ and $y$.

    Example

    Last January, two men in Cambridgeshire were arrested for growing cannabis and attempted to convince a sceptical court that they had mistaken their crop for bonsai trees. A bold claim, considering their ``bonsais`` were flourishing to such an extent that when the police raided their house, the suspects managed to hide among them.

    At Cambridge University, researchers are studying how sunlight influences the carbon dioxide emissions of a newly discovered bonsai species. The table below displays the hours of sunlight exposure and the corresponding carbon dioxide volume, measured in cubic centimetres, produced by a single tree across five different observations. $$\begin{array}{cc} \text { Exposure to Sunlight } & \text { Amount of Carbon Dioxide } \\ \text { (hours) } & \left(\mathrm{cm}^3\right) \\ \hline 1 & 3 \\ 3 & 6 \\ 5 & 8 \\ 7 & 9 \\ 8 & 10 \\ \hline \end{array}$$

    Example

    China is embracing facial recognition on an epic scale. At traffic junctions, jaywalkers are shamed by having their faces projected onto giant screens, and at Ming-dynasty temples they use it to stop toilet paper theff - it's so good that it can tell if you've had plastic surgery. In schools, surveillance is ramped up: one high school scans students every 30 seconds to spot yawners or daydreamers, while universities use it to control dorm access—blocking ``strangers`` and, inconveniently, boyfriends.

    At one college, facial recognition tracks attendance and absenteeism. The table below shows the number of classes missed by five Data Analysis students and their final grades. $$\begin{align} \begin{array}{cc} \text { Classes Missed } & \text { Final Mark (out of 100) } \\ \hline 10 & 75 \\ 15 & 65 \\ 20 & 50 \\ 25 & 40 \\ 30 & 30 \end{array} \end{align}$$

    Regression

    After establishing that two variables are related, the next step is to construct a model which describes the relationship between them. The procedure often begins by making a scatter-plot of the data and then fitting a curve through it. The properties of the curve are then extracted to generate a mathematical equation, which can be used to make predictions about the variables and forecast future values.
    For paired-data which are linearly related, the data is fitted with a straight line, and the relationship is modelled with a linear equation. Indeed, there are many lines to choose from; but if goal is to maximize the predictive power of the model, and minimize the overall error produced by it, then not just any line will do.

    The Least Squares Method

    The least squares method is a technique used to find the best-fitting line through a set of data points. The method minimizes the sum of the squares of the vertical distances between the data points and the line. The line that minimizes this sum is the best-fitting line, and is the line that is used to model the relationship between the variables.

    Definition:

    The Least-Squares Line

    The least-squares line is the line that best represents the data on a scatter plot; by minimizing the sum of the squares of the residual errors.
    The diagram below shows a least-squares line passing through a set of data points. The vertical offsets (residual errors) occur whenever the least-squares line and the data points do not line up precisely.
    X-axisY-axis
    In order to use the least-squares line to make predictions about the data, we need to be able to come up with the equation that describes the line itself. This calls for two computations to be carried out; one for the slope, and one for the intercept.

    • The slope is a number which describes the rate of change between the two variables. It tells us how a change in one unit of the explanatory variable affects the value of the response variable. The size of the change (large/small) is reflected in the numerical value of the slope, and the direction of change (increasing/decreasing) by its sign.
    • The intercept is where the graph of the line and the $y-$axis intersect.

    Formula:

    The Slope of the Least-Squares Line

    $$b = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2}$$ where

    $n=$ the number of data points
    $x_i=$ each value of the explanatory variable,$x$
    $y_i=$ each value of the response variable

    Formula:

    The Intercept of the Least-Squares Line

    $$a = \frac{\sum y_i - b\sum x_i}{n}$$ where

    $n=$ the number of data points
    $x_i=$ each value of the explanatory variable,$x$
    $y_i=$ each value of the response variable
    $b=$ is the slope of the least-squares line

    Formula:

    The Equation of the Least-Squares Line

    $$\hat{y} = a + bx$$ where

    $\hat{y}=$ the response variable
    $x=$ the explanatory variable
    $a=$ the intercept of the least-squares line
    $b=$ the slope of the least-squares line

    Example

    McDonald's, the world's largest fast food chain, has worked to clean up its image by publishing nutritional guides for its menu items. While salads and fruits were added to attract health-conscious customers, most still prefer their iconic burgers. Below are some of McDonald's most popular burgers, along with their fat content and calorie counts.

    $$\begin{array}{lcc} \text { Sandwich } & \text { Fat }(g) & \text { Calories } \\ \hline \text { Big Mac } & 28 & 520 \\ \text { Cheeseburger } & 11 & 290 \\ \text { Double Cheeseburger } & 20 & 420 \\ \text { Double Quarter Pounder } & 43 & 740 \\ \text { Hamburger } & 8 & 240 \\ \text { McDouble } & 17 & 370 \\ \text { Quarter Pounder With Cheese } & 26 & 520 \end{array}$$

    Even though the least-squares line does a respectable job of minimizing distances, differences between the predicted value of $y$ (i.e. $\hat{y}$) and the actual value of $y$ inevitably occur.

    Definition:

    Residual Error

    The residual error is the difference between the predicted value of $y$ and the actual value of $y$. Geometrically, the residual is the vertical distance between the data point and the least-squares line.

    Formula:

    Residual Error

    Let $(x,y)$ be a pair of data values. Then, $$e = y - \hat{y}$$ where

    $e=$ the residual error
    $y=$ the actual value of the response variable
    $\hat{y}=$ the predicted value of the response variable.
    X-axis Y-axis e e
    The least-squares line can also be used to make predictions about the values of an explanatory variables found within and outside of the data set.

    Definition:

    Interpolation and Extrapolation

    Predicting $\hat{y}$ for values of $x$ that are between observed values of $x$ is called interpolation, and predicting values of $x$ that are outside of the data set is called extrapolation.
    It should be noted that the extrapolation of data values can lead to unrealistic results; so caution should be exercised when using this to method to forecast trends or predict future behaviours.
    A graph of a scatter plot with the least-squares line sketched through it is not enough when it come to evaluating the quality of the model. As a result, a goodness of fit measure is needed.

    Example

    Using the data from the McDonald's example, we generated the following least-squares line: $$ \begin{align} \hat{y} & =a+b x \\ & =131.6777+14.237 x\end{align}$$

    Exercises

    Question 1

    What does a correlation coefficient of 0 indicate about the relationship between two variables?

    It indicates that there is no linear relationship between the two variables.

    Solution

    Question 2

    If the correlation coefficient between two variables is -0.85, what can you say about their relationship?

    The two variables have a strong negative linear relationship..

    Solution

    Question 3

    True or False: A high correlation coefficient implies causation between two variables.

    False. Correlation does not imply causation.

    Solution

    Question 4

    What does a positive covariance between two variables indicate?

    It indicates that as one variable increases, the other variable tends to increase as well.

    Solution

    Question 5

    How is covariance different from correlation?

    Covariance measures the direction of the relationship between two variables, while correlation standardizes this measure to a scale from -1 to +1, showing both strength and direction.

    Solution

    Question 6

    True or False: Covariance is unaffected by changes in the scale of measurement of the variables.

    False. Covariance depends on the units of measurement.

    Solution

    Question 7

    What is the primary purpose of regression analysis?

    he primary purpose is to model the relationship between a dependent variable and one or more independent variables, and to make predictions.

    Solution

    Question 8

    What does an $R^2=0.85$ indicate?

    It indicates that 85% of the variation in the dependent variable is explained by the independent variable(s) in the model. The other 15% is due to other factors.

    Solution

    Question 9

    What does a negative covariance between two variables indicate?

    It indicates that as one variable increases, the other variable tends to decrease.

    Solution

    Question 10

    If a residual error is negative, what does this indicate?

    It indicates that the model underestimates the actual value of the dependent variable.

    Solution

    Question 11

    An etiquette course in Turkey has churned up controversy on social media for advising women against licking ice-cream cones, deeming it ``unladylike.`` Apparently, the organizers have yet to clarify what exactly makes licking an ice-cream cone so scandalous—or suggest a more genteel way to savor the frosty indulgence. The course, a finishing school for good manners, also dishes out advice on dressing, talking, and walking in public.

    Meanwhile, over at the Häagen-Dazs in Istanbul, they're tracking ice-cream sales against daily temperatures, proving that, etiquette or not, people still scream for ice cream. Here's a scoop of their data from the first week:

    $$\begin{array} {l|ccccccc} \text{Temperature} (°C) & 25 & 27 & 30 & 32& 35& 37 & 40 \\ \hline \text{Ice-Cream Sales (units)} & 200& 220 & 250 & 280 & 300 & 320 & 350 \end{array}$$

    Question 12

    Thor and Captain America might be able to save the world, but maybe they should save room for a salad, too. Researchers at Binghamton University analyzed the body mass indexes (BMI) of over 3,700 comic book characters and discovered that many male superheroes teeter on the edge of obesity, while their female counterparts are often startlingly underweight. The study also revealed that a third of Marvel's heroes should rethink their dietary choices, and most exhibit body proportions that defy reality. In fact, some female superheroes boast measurements more extreme than those seen in the adult film industry.

    Below, you'll find data on the BMI and body fat percentages of nine male superheroes, alongside some other jaw-dropping insights from the study:

    \begin{array}{l|ccccccccc} x: \text { Percentage Body Fat (%) } & 5.8 & 6.5 & 7.1 & 7.4 & 8.2 & 8.5 & 9.2 & 9.4 & 9.6 \\ \hline y: \text { BMI }(\mathrm{kg} / \mathrm{m}^2) & 29.7 & 31.4 & 31.7 & 32.0 & 32.2 & 33.0 & 33.2 & 33.6 & 33.8 \end{array}

    Question 13

    CrossFit isn't just a fitness program—it's an identity. As any enthusiast will tell you (at least three times in one conversation), “It's not a workout, bro. It's a lifestyle.” That might explain why CrossFit is muscling in on the meal-kit game alongside PX90 and Weight Watchers. But unlike their competitors, CrossFit's kits ditch the veggies and seasonings in favor of... meat. Lots of meat. Each kit includes 60 ounces of organic chicken, 3 pounds of ground beef, 10 ounces of filet mignon, two 10-ounce rib-eyes, two 10-ounce strip steaks, and two 6-ounce sirloins. That's 10 pounds of protein! — enough fuel to flip tires and lug sandbags like a pro bro!

    The data below shows the number of calories burned at fivve Crossfit session for one 40 year old woman weighing 120 pounds. $$\begin{array}{c|ccccc} x: \text { Number of Minutes: } & 20 & 30 & 40 & 50 & 60 \\ \hline y \text { : Number of Calories: } & 255 & 420 & 485 & 663 & 675 \end{array}$$