Correlation and Regression
Paired Data and Scatter Plots
Definition:
Explanatory and Response Variables
Definition:
Scatter Plot
Remark
- The strength of the relationship between the two variables by gauging how tightly or loosely the data points follow a line.
- If the relationship between the variables is positive, negative, or does not exist, by examining the overall flow of direction exhibited by the data points
Measures of Strength and Direction
Covariance
Definition:
Covariance
Formula:
Population Covariance
$x_i=$ the data points for the explanatory variable, $x$
$y_i=$ the data points for the response variable, $y$
$\mu_x=$ the mean for $x$
$\mu_y=$ the mean for $y$
$N=$ the number of data points.
Formula:
Sample Covariance
$x_i=$ the data points for the explanatory variable, $x$
$y_i=$ the data points for the response variable, $y$
$\bar{x}=$ the sample mean for $x$
$\bar{y}=$ the sample mean for $y$
$n=$ the number of data points.
- When two variables move in the same direction, the covariance will be a large positive number.
- When two variables move in the opposite direction, the covariance will be a large negative number.
- When two variables do not exhibit any particular patterns, the covariance will be a small number.
Correlation
Definition:
Coefficient of Correlation
Formula:
Coefficient of Correlation (Population)
$\sigma_{xy}=$ the population covariance
$\sigma_x=$ the population standard deviation for $x$
$\sigma_y=$ the population standard deviation for $y$
Formula:
Coefficient of Correlation (Sample)
$s_{xy}=$ the sample covariance
$s_x=$ the sample standard deviation for $x$
$s_y=$ the sample standard deviation for $y$
Formula:
Alternate Coefficient of Correlation (Sample)
$\sum x=$ the sum of the data points for the explanatory variable, $x$
$\sum y=$ the sum of the data points for the response variable, $y$
$\sum x^2=$ the sum of the squares of the data points for the explanatory variable, $x$
$\sum y^2=$ the sum of the squares of the data points for the response variable, $y$
$n=$ the number of data points.
- If $r=1$, then there is a perfect positive linear relationship between the two variables.
- If $r=-1$, then there is a perfect negative linear relationship between the two variables.
- If $r=0$, then there is no linear relationship between the two variables.
- Positive values of $r$ and $\rho$ imply that as $x$ increases, $y$ tends increases.
- Negative values of $r$ and $\rho$ indicate that as $x$ increases, $y$ tends to decrease.
- The values of $r$ and $\rho$ stay the same regardless of which variable has been designated as the explanatory and which one is labelled as the response.
- The values of $r$ and $\rho$ remains the same, even if the variables are converted into different units.
Remark
- Since it is not possible to obtain or graph all of the data points from a population, a scatter plot provides only one snapshot of the data captured from a random sample. Because of this, the value of $r$ can change from sample to sample; even though the samples are drawn from the same population.
- The value of $r$ is sensitive to the omission of small or large data values in a random sample; meaning that the exclusion of these data points can impact the final value of $r$
- Correlation does not imply causation. The coefficient of correlation only measures the strength of relationship between two variables and does not make any implications about cause and effect. The fact that two variables increase or decrease together does not mean that change in one is causing changes in the other.
Example
After years of stagnation, the housing market in the U.S. is beginning to show signs of recovery. Last year, the median price of a home in Chicago was $\$ 230000$ up $8.5 \%$ Below are the age and selling prices of six homes in the suburb of West Englewood.
$$\begin{array} {c|c} \text { Age of Property } & \text { Selling Price of Home } \\ \text { (years) } & \text { (hundred thousands) } \\ \hline 5 & 321 \\ 7 & 315 \\ 15 & 267 \\ 25 & 266 \\ 34 & 242 \\ 37 & 208 \\ \hline \end{array}$$
The Coefficient of Determination
Definition:
The Coefficient of Determination
Formula:
The Coefficient of Determination
$r=$ the correlation coefficient.
Formula:
Percentage of Variance Accounted
Remark
- A large $R^2$ value should not be always be interpreted as meaning that the estimated regression line fits the data well. It is quite possible that another function might better describe the trend in the data.
- The coefficient of determination, $r^2$, and the correlation coefficient, $r$, can both be greatly affected by just one data point (or a few data points). Adding or removing data points can change the pitch/slope of the line, which causes changes in the values of $r$ and $R^2$.
- The $R^2$ cannot determine if coefficient estimates and the predictions offered by the model are biased; for that, we need to consult the residual plots.
- The source of unexplained variation can be due to chance or the presence of a lurking variable; one that is neither an explanatory nor a response variable, but can be responsible for changes in both $x$ and $y$.
Example
Last January, two men in Cambridgeshire were arrested for growing cannabis and attempted to convince a sceptical court that they had mistaken their crop for bonsai trees. A bold claim, considering their ``bonsais`` were flourishing to such an extent that when the police raided their house, the suspects managed to hide among them.
At Cambridge University, researchers are studying how sunlight influences the carbon dioxide emissions of a newly discovered bonsai species. The table below displays the hours of sunlight exposure and the corresponding carbon dioxide volume, measured in cubic centimetres, produced by a single tree across five different observations. $$\begin{array}{cc} \text { Exposure to Sunlight } & \text { Amount of Carbon Dioxide } \\ \text { (hours) } & \left(\mathrm{cm}^3\right) \\ \hline 1 & 3 \\ 3 & 6 \\ 5 & 8 \\ 7 & 9 \\ 8 & 10 \\ \hline \end{array}$$
Example
China is embracing facial recognition on an epic scale. At traffic junctions, jaywalkers are shamed by having their faces projected onto giant screens, and at Ming-dynasty temples they use it to stop toilet paper theff - it's so good that it can tell if you've had plastic surgery. In schools, surveillance is ramped up: one high school scans students every 30 seconds to spot yawners or daydreamers, while universities use it to control dorm access—blocking ``strangers`` and, inconveniently, boyfriends.
At one college, facial recognition tracks attendance and absenteeism. The table below shows the number of classes missed by five Data Analysis students and their final grades. $$\begin{align} \begin{array}{cc} \text { Classes Missed } & \text { Final Mark (out of 100) } \\ \hline 10 & 75 \\ 15 & 65 \\ 20 & 50 \\ 25 & 40 \\ 30 & 30 \end{array} \end{align}$$
Regression
The Least Squares Method
Definition:
The Least-Squares Line
- The slope is a number which describes the rate of change between the two variables. It tells us how a change in one unit of the explanatory variable affects the value of the response variable. The size of the change (large/small) is reflected in the numerical value of the slope, and the direction of change (increasing/decreasing) by its sign.
- The intercept is where the graph of the line and the $y-$axis intersect.
Formula:
The Slope of the Least-Squares Line
$n=$ the number of data points
$x_i=$ each value of the explanatory variable,$x$
$y_i=$ each value of the response variable
Formula:
The Intercept of the Least-Squares Line
$n=$ the number of data points
$x_i=$ each value of the explanatory variable,$x$
$y_i=$ each value of the response variable
$b=$ is the slope of the least-squares line
Formula:
The Equation of the Least-Squares Line
$\hat{y}=$ the response variable
$x=$ the explanatory variable
$a=$ the intercept of the least-squares line
$b=$ the slope of the least-squares line
Example
McDonald's, the world's largest fast food chain, has worked to clean up its image by publishing nutritional guides for its menu items. While salads and fruits were added to attract health-conscious customers, most still prefer their iconic burgers. Below are some of McDonald's most popular burgers, along with their fat content and calorie counts.
$$\begin{array}{lcc} \text { Sandwich } & \text { Fat }(g) & \text { Calories } \\ \hline \text { Big Mac } & 28 & 520 \\ \text { Cheeseburger } & 11 & 290 \\ \text { Double Cheeseburger } & 20 & 420 \\ \text { Double Quarter Pounder } & 43 & 740 \\ \text { Hamburger } & 8 & 240 \\ \text { McDouble } & 17 & 370 \\ \text { Quarter Pounder With Cheese } & 26 & 520 \end{array}$$
Definition:
Residual Error
Formula:
Residual Error
$e=$ the residual error
$y=$ the actual value of the response variable
$\hat{y}=$ the predicted value of the response variable.
Definition:
Interpolation and Extrapolation
Example
Using the data from the McDonald's example, we generated the following least-squares line: $$ \begin{align} \hat{y} & =a+b x \\ & =131.6777+14.237 x\end{align}$$
Exercises
Question 1
What does a correlation coefficient of 0 indicate about the relationship between two variables?
It indicates that there is no linear relationship between the two variables.
Solution
Question 2
If the correlation coefficient between two variables is -0.85, what can you say about their relationship?
The two variables have a strong negative linear relationship..
Solution
Question 3
True or False: A high correlation coefficient implies causation between two variables.
False. Correlation does not imply causation.
Solution
Question 4
What does a positive covariance between two variables indicate?
It indicates that as one variable increases, the other variable tends to increase as well.
Solution
Question 5
How is covariance different from correlation?
Covariance measures the direction of the relationship between two variables, while correlation standardizes this measure to a scale from -1 to +1, showing both strength and direction.
Solution
Question 6
True or False: Covariance is unaffected by changes in the scale of measurement of the variables.
False. Covariance depends on the units of measurement.
Solution
Question 7
What is the primary purpose of regression analysis?
he primary purpose is to model the relationship between a dependent variable and one or more independent variables, and to make predictions.
Solution
Question 8
What does an $R^2=0.85$ indicate?
It indicates that 85% of the variation in the dependent variable is explained by the independent variable(s) in the model. The other 15% is due to other factors.
Solution
Question 9
What does a negative covariance between two variables indicate?
It indicates that as one variable increases, the other variable tends to decrease.
Solution
Question 10
If a residual error is negative, what does this indicate?
It indicates that the model underestimates the actual value of the dependent variable.
Solution
Question 11
An etiquette course in Turkey has churned up controversy on social media for advising women against licking ice-cream cones, deeming it ``unladylike.`` Apparently, the organizers have yet to clarify what exactly makes licking an ice-cream cone so scandalous—or suggest a more genteel way to savor the frosty indulgence. The course, a finishing school for good manners, also dishes out advice on dressing, talking, and walking in public.
Meanwhile, over at the Häagen-Dazs in Istanbul, they're tracking ice-cream sales against daily temperatures, proving that, etiquette or not, people still scream for ice cream. Here's a scoop of their data from the first week:
$$\begin{array} {l|ccccccc} \text{Temperature} (°C) & 25 & 27 & 30 & 32& 35& 37 & 40 \\ \hline \text{Ice-Cream Sales (units)} & 200& 220 & 250 & 280 & 300 & 320 & 350 \end{array}$$
Question 12
Thor and Captain America might be able to save the world, but maybe they should save room for a salad, too. Researchers at Binghamton University analyzed the body mass indexes (BMI) of over 3,700 comic book characters and discovered that many male superheroes teeter on the edge of obesity, while their female counterparts are often startlingly underweight. The study also revealed that a third of Marvel's heroes should rethink their dietary choices, and most exhibit body proportions that defy reality. In fact, some female superheroes boast measurements more extreme than those seen in the adult film industry.
Below, you'll find data on the BMI and body fat percentages of nine male superheroes, alongside some other jaw-dropping insights from the study:
\begin{array}{l|ccccccccc} x: \text { Percentage Body Fat (%) } & 5.8 & 6.5 & 7.1 & 7.4 & 8.2 & 8.5 & 9.2 & 9.4 & 9.6 \\ \hline y: \text { BMI }(\mathrm{kg} / \mathrm{m}^2) & 29.7 & 31.4 & 31.7 & 32.0 & 32.2 & 33.0 & 33.2 & 33.6 & 33.8 \end{array}
Question 13
CrossFit isn't just a fitness program—it's an identity. As any enthusiast will tell you (at least three times in one conversation), “It's not a workout, bro. It's a lifestyle.” That might explain why CrossFit is muscling in on the meal-kit game alongside PX90 and Weight Watchers. But unlike their competitors, CrossFit's kits ditch the veggies and seasonings in favor of... meat. Lots of meat. Each kit includes 60 ounces of organic chicken, 3 pounds of ground beef, 10 ounces of filet mignon, two 10-ounce rib-eyes, two 10-ounce strip steaks, and two 6-ounce sirloins. That's 10 pounds of protein! — enough fuel to flip tires and lug sandbags like a pro bro!
The data below shows the number of calories burned at fivve Crossfit session for one 40 year old woman weighing 120 pounds. $$\begin{array}{c|ccccc} x: \text { Number of Minutes: } & 20 & 30 & 40 & 50 & 60 \\ \hline y \text { : Number of Calories: } & 255 & 420 & 485 & 663 & 675 \end{array}$$