WEBVTT
00:00:01.680 --> 00:00:09.800
In this lesson, we’ll learn how to deal with linear correlation and distinguish between different types of correlation.
00:00:11.040 --> 00:00:14.920
Let’s think about what happens when we plot a scatter diagram.
00:00:15.720 --> 00:00:25.320
A scatter diagram can be used to represent bivariate data, where one set of data is paired with another set of data.
00:00:26.240 --> 00:00:35.680
For instance, we might look to plot the daily precipitation in New York City versus fried-chicken sales in pounds.
00:00:36.640 --> 00:00:40.800
Looking at the scatter diagram, there appears to be a pattern, or trend.
00:00:41.600 --> 00:00:48.360
In this case, as the daily precipitation increases, so do the fried-chicken sales.
00:00:49.400 --> 00:00:58.000
In this case, we might say that these two data sets have a correlation, meaning there appears to be some sort of relationship between them.
00:00:59.280 --> 00:01:07.320
It’s worth noting though that whilst we might appear to find correlation, that doesn’t necessarily mean that causation exists.
00:01:08.000 --> 00:01:16.360
In other words, we cannot necessarily assume that daily precipitation actually causes fried-chicken sales to increase.
00:01:17.600 --> 00:01:22.840
Now, with that in mind, let’s fully define the word correlation.
00:01:24.040 --> 00:01:30.120
We say that two data sets are correlated when there appears to be a relationship between them.
00:01:30.960 --> 00:01:35.840
We can use a scatter diagram to identify whether this correlation exists.
00:01:36.680 --> 00:01:50.520
Now, more specifically, if we plot these points on a scatter diagram and they mainly appear to lie along a straight line, then they’re said to be linearly correlated.
00:01:51.680 --> 00:02:03.920
Similarly, if they follow some nonlinear trend, such as a curve or a logarithmic trend, then they’re said to be nonlinearly correlated.
00:02:05.080 --> 00:02:13.480
And, of course, if no such trend exists, there’s said to be no correlation.
00:02:14.920 --> 00:02:17.960
Consider the linear correlation we discussed.
00:02:18.920 --> 00:02:24.120
A scatter diagram showing two variables that are linearly correlated might look a little bit like this.
00:02:24.960 --> 00:02:27.360
Similarly, it could look a little something like this.
00:02:28.400 --> 00:02:33.280
The data points in either case appear to lie approximately along a straight line.
00:02:34.600 --> 00:02:37.800
In our second example, the points might look a little something like this.
00:02:38.720 --> 00:02:42.560
In this case, the line of best fit is a curve.
00:02:43.680 --> 00:02:49.160
Finally, if there is no correlation, our scatter diagram might look a little something like this.
00:02:50.160 --> 00:02:56.480
In each of these cases, we’ve considered whether we can actually draw a line of best fit through each of our points.
00:02:57.040 --> 00:03:01.480
The shape of the line of best fit then tells us information about the type of correlation, if it exists.
00:03:02.840 --> 00:03:11.920
So, with this in mind, let’s look at how to compare a line of best fit with data on a scatter diagram.
00:03:12.760 --> 00:03:18.720
And this will help us determine whether the data is linearly correlated.
00:03:20.320 --> 00:03:23.960
Can we use the line of best fit to describe the trend in the data?
00:03:25.080 --> 00:03:25.640
Why?
00:03:27.080 --> 00:03:30.440
And then we have a scatter diagram with a line of best fit drawn.
00:03:31.160 --> 00:03:34.360
Let’s imagine this supposed line of best fit wasn’t drawn on the diagram.
00:03:35.240 --> 00:03:37.680
How would we construct our own line of best fit?
00:03:38.560 --> 00:03:47.040
How would we find a line that more accurately describes the trend in the data given by the blue points?
00:03:48.280 --> 00:03:50.280
Well, it might look a little something like this.
00:03:51.040 --> 00:03:55.440
Yes, as the values of 𝑥 increase, the values of 𝑦 also increase.
00:03:56.320 --> 00:04:01.640
But we can see that this is not necessarily in a straight line.
00:04:02.360 --> 00:04:06.120
This means 𝑥 and 𝑦 do appear to be correlated.
00:04:06.760 --> 00:04:09.040
But we would say they are nonlinearly correlated.
00:04:09.440 --> 00:04:12.440
The line of best fit is not a straight line.
00:04:13.400 --> 00:04:19.320
And so this would not be a sensible line of best fit to describe the trend in the data.
00:04:20.120 --> 00:04:32.240
We certainly wouldn’t want to use this line of best fit to make predictions or estimates based on the data we’re given, and the reason being is because this data is not linearly correlated.
00:04:32.560 --> 00:04:35.000
It doesn’t approximately follow a straight line.
00:04:36.520 --> 00:04:52.160
Now, whilst this wouldn’t be a sensible line of best fit to describe the trend in the data, we did say that both the line of best fit and the apparent trend in the data show that as the values of 𝑥 increase, the values of 𝑦 also appear to increase.
00:04:53.160 --> 00:04:56.800
And there are some phrases we can use to describe this.
00:04:57.720 --> 00:05:07.280
We say that two data sets are positively correlated, or directly correlated, if one data set increases as the other increases.
00:05:08.960 --> 00:05:13.680
In the case of positive linear correlation, the data points might look a little something like this.
00:05:14.400 --> 00:05:26.080
If data sets are negatively correlated, or inversely correlated, then as one set increases, the other will decrease, and vice versa.
00:05:27.160 --> 00:05:36.680
In the case of two data sets that have negative linear correlation, the points appear to follow a line which slopes downwards, as we see.
00:05:37.920 --> 00:05:51.400
So, with this in mind, let’s determine whether data is positively or negatively correlated or not at all correlated using a line of best fit.
00:05:52.440 --> 00:05:56.440
What type of correlation exists between the two variables in the scatter plot shown?
00:05:58.000 --> 00:06:18.280
When we think about correlation, we think about linear correlation — in other words, points that approximately follow a straight line — we think about nonlinear correlation — these are points that might follow a different type of trend, for example, a curve.
00:06:19.320 --> 00:06:32.680
And if things are linearly correlated, we say that they can be positively linearly correlated or negative linearly correlated, depending on the direction of the line of best fit.
00:06:33.800 --> 00:06:40.760
So, let’s consider the graph we’ve been given here and see if we can draw a line of best fit.
00:06:41.800 --> 00:06:51.360
The line of best fit, of course, does not need to go through the origin, the point zero, zero, although here it does appear that it might.
00:06:52.360 --> 00:06:56.280
And that line of best fit should roughly follow the trend of our points.
00:06:57.120 --> 00:07:00.680
We might now notice that our line of best fit slopes upwards.
00:07:01.360 --> 00:07:03.400
In other words, it has a positive slope.
00:07:04.000 --> 00:07:09.960
So this tells us that as the values of 𝑥 increase, so do the values of 𝑦.
00:07:10.680 --> 00:07:15.280
In this case then, the variables 𝑥 and 𝑦 are positively correlated.
00:07:16.320 --> 00:07:23.880
Specifically, since these points also approximately follow a straight line, we can say that the correlation is linear.
00:07:24.600 --> 00:07:27.080
And so we fully answered the question.
00:07:27.840 --> 00:07:32.360
The type of correlation that exists is positive linear correlation.
00:07:34.000 --> 00:07:38.160
Now, in this example, we were given a scatter diagram of a data set.
00:07:38.760 --> 00:07:40.480
This might not always be the case.
00:07:40.960 --> 00:07:44.800
We might instead be given a description of the type of variables.
00:07:45.640 --> 00:07:59.400
As we’ll now see, we’ll then need to use our understanding of how variables relate to one another as a way of determining whether they are positively or negatively correlated or not correlated at all.
00:08:00.560 --> 00:08:08.240
Suppose variable 𝑥 is the number of hours you work and variable 𝑦 is your salary.
00:08:08.960 --> 00:08:13.240
You suspect that the more hours you work, the higher your salary is.
00:08:14.160 --> 00:08:20.040
Does this follow a positive correlation, a negative correlation, or no correlation?
00:08:21.240 --> 00:08:27.640
We’re told that variable 𝑥 is the number of hours worked, whilst variable 𝑦 is the salary.
00:08:28.120 --> 00:08:32.000
And we’re looking to find a relationship, if it exists, between these two variables.
00:08:33.000 --> 00:08:37.280
Now, in fact, the suspicion is that the more hours you work, the higher your salary is.
00:08:37.840 --> 00:08:41.520
So, let’s attempt to plot this on a scatter graph.
00:08:42.680 --> 00:08:50.840
Variable 𝑥 is the number of hours worked, whilst 𝑦 is the salary, so we can label the axes as shown.
00:08:51.680 --> 00:08:53.680
Let’s make up some starting figures.
00:08:54.240 --> 00:09:00.240
Let’s imagine that if you work 15 hours, you earn 20,000 pounds.
00:09:00.840 --> 00:09:10.160
You might then assume that if you work 30 hours a week, you earn an annual salary of 40,000 pounds.
00:09:11.160 --> 00:09:21.120
Assuming that the more hours you work, the higher your salary is, we could add extra points on our scatter graph as shown.
00:09:22.240 --> 00:09:29.440
We notice that the points plotted approximately follow a straight line and that this straight line has a positive slope.
00:09:29.920 --> 00:09:31.720
It slopes upward.
00:09:32.920 --> 00:09:41.160
Since this line slopes upwards, we can say that the two variables 𝑥 and 𝑦 must have positive correlation.
00:09:41.760 --> 00:09:48.480
Now, we also assumed that this was positive linear correlation, but that might not be the case.
00:09:49.080 --> 00:09:57.480
We only know that the higher the number of hours, the higher the salary, which means that this is an example of positive correlation.
00:09:59.000 --> 00:10:06.080
Now, in this example, we modeled our data points as lying very closely to some straight line.
00:10:06.960 --> 00:10:12.840
The distance that the data points actually lie relative to a line of best fit describes the strength of the correlation.
00:10:13.720 --> 00:10:18.320
For instance, suppose we’re interested in positive linear correlation.
00:10:19.200 --> 00:10:28.160
If all the points lie very close to the line of best fit, as in this example, we can say that’s an example of strong correlation.
00:10:28.960 --> 00:10:36.680
If, however, the points are quite far away from the line of best fit, as in this example, then we say that there is weak correlation.
00:10:37.800 --> 00:10:46.400
Of course eventually this weak correlation turns into no correlation as the points get further and further away from one another.
00:10:47.440 --> 00:10:53.000
With this in mind, let’s determine the strength of correlation in our next example.
00:10:54.160 --> 00:10:58.520
State which of the scatter diagrams shows bivariate data with a stronger correlation.
00:10:59.760 --> 00:11:01.920
And then there are two diagrams to choose from.
00:11:02.600 --> 00:11:10.640
Remember, when we think about the strength of a correlation, we’re determining how close the points are to a line of best fit.
00:11:11.640 --> 00:11:15.040
The closer the points are, the stronger the correlation.
00:11:15.640 --> 00:11:19.680
So, it makes sense to begin by drawing the line of best fit on both of our diagrams.
00:11:20.680 --> 00:11:24.680
The line of best fit on diagram one might look a little something like this.
00:11:25.640 --> 00:11:30.960
The points approximately follow a straight line, so there is linear correlation here.
00:11:31.720 --> 00:11:36.400
Specifically, as the 𝑥-variables increase, so do the 𝑦.
00:11:37.120 --> 00:11:43.120
So, we can say that 𝑥 and 𝑦 are positively linearly correlated.
00:11:44.680 --> 00:11:48.200
In diagram two, our line of best fit looks quite similar.
00:11:49.040 --> 00:11:53.000
But we notice that all of the points are a little bit further away from the line itself.
00:11:53.640 --> 00:11:57.160
This means in diagram two, the correlation is less strong.
00:11:57.800 --> 00:11:59.160
We might say it’s weak.
00:11:59.560 --> 00:12:02.880
And so the answer is diagram one.
00:12:03.680 --> 00:12:07.880
The scatter diagram one shows bivariate data with a stronger correlation.
00:12:09.040 --> 00:12:18.560
We’ve now looked at how two different variables can be related and what it means for them to have a linear or nonlinear relationship.
00:12:19.280 --> 00:12:29.200
We’ve considered how to describe the relationship between variables in terms of positive, negative, or no correlation.
00:12:30.280 --> 00:12:36.960
And we’ve looked at how strongly correlated variables are based on how close they are to a line of best fit.
00:12:37.760 --> 00:12:43.400
With all this in mind, let’s recap the key points from this lesson.
00:12:45.000 --> 00:12:53.160
In this video, we learned that if two variables follow a trend of some description, they’re said to be correlated.
00:12:54.040 --> 00:13:03.040
If we model these points on a scatter diagram and they appear to follow approximately a straight line, then linear correlation exists.
00:13:03.920 --> 00:13:12.240
Then, if the line of best fit constructed appears to slope upwards, in other words, its slope is positive, then they have positive correlation.
00:13:12.680 --> 00:13:21.040
And if that line of best fit slopes downwards, if it has negative slope, then those variables are said to be negatively correlated.
00:13:21.880 --> 00:13:32.880
Now, if neither of these is true, in other words, if a line of best fit cannot be constructed, then we said that there was no correlation.
00:13:33.560 --> 00:13:42.400
Finally, we saw that we can determine the strength of the correlation by considering how close all of the points lie to the LOBF, the line of best fit.