WEBVTT
00:00:02.430 --> 00:00:04.910
In this video, we’re gonna learn about linear correlation.
00:00:05.340 --> 00:00:12.160
There are lots of situations where we have two sets of data related to individuals or events, and we call this bivariate data.
00:00:12.810 --> 00:00:15.950
For example, student’s scores in math tests and English scores.
00:00:16.190 --> 00:00:18.020
Each student took both tests.
00:00:18.060 --> 00:00:21.220
So we have two sets of numbers related to individual students.
00:00:22.250 --> 00:00:28.610
We can use one set for the “𝑥”-coordinates and the other for the “𝑦”-coordinates and plot all the data as points on a scatterplot.
00:00:29.540 --> 00:00:35.790
Then we can examine any patterns that may emerge in the scatterplots to see if they suggest any association between the two data sets.
00:00:37.020 --> 00:00:40.370
One type of pattern that can emerge is a straight line relationship.
00:00:40.810 --> 00:00:51.230
This has turned out to be so useful in scientific and statistical analysis that techniques have been developed to quantify and interpret linear correlation between two associated sets of data.
00:00:52.140 --> 00:00:56.910
So we’re gonna talk about linear correlation and the terminology that we use to describe it.
00:00:58.150 --> 00:01:01.370
Let’s start by describing an experiment that I do with my math students.
00:01:01.780 --> 00:01:08.960
I give each student a different-sized circle and ask them to measure the diameter and circumference and then we gather in all the results.
00:01:09.840 --> 00:01:13.830
This sounds pretty easy maybe, but they only have straight rulers to measure with.
00:01:14.030 --> 00:01:20.670
So they need to be quite creative about how they measure the circumference, and I don’t let them calculate it if they happen to know about “𝜋” and the formula.
00:01:21.700 --> 00:01:32.270
So we have two bits of data about each circle and we use the diameters as the “𝑥”-coordinates and the circumferences as the “𝑦”-coordinates and we plot all these points on a scatterplot.
00:01:33.160 --> 00:01:37.690
So here’s the data that I gather for one class, and here’s the scatterplot.
00:01:39.040 --> 00:01:44.010
Now the first thing that jumps off the page is this point here, which looks very different to all the others.
00:01:45.340 --> 00:01:51.540
Most points are close to a straight line running something like this, but the other point is a long way from the pack.
00:01:52.150 --> 00:01:56.880
In fact, it turned out to be due to a student who read out their diameter and circumference the wrong way round.
00:01:57.290 --> 00:02:00.520
So we were able to swap the “𝑥”- and “𝑦”-coordinates over to correct them.
00:02:02.080 --> 00:02:08.170
But if the student who made the mistake hadn’t been in the room to explain what they’d done, then we’d have had a tricky decision to make.
00:02:08.490 --> 00:02:10.490
Why’s that point so far away from the others?
00:02:10.640 --> 00:02:16.100
Because it was a genuine circle which was very different to all the others or was there some kind of mistake?
00:02:17.220 --> 00:02:19.610
You shouldn’t just throw away data because it looks different.
00:02:19.850 --> 00:02:23.670
You need to find out more about it: is it real or is it a mistake?
00:02:23.900 --> 00:02:27.130
If it’s real, then you need to take it into account in your analysis.
00:02:28.130 --> 00:02:32.590
So after our correction, this is what the scatterplot looked like with a new line of best fit.
00:02:33.700 --> 00:02:41.720
The line of best fit that we’ve drawn is positioned in such a way that it minimises the overall vertical distance to all of the points, like these orange lines here.
00:02:42.370 --> 00:02:44.380
It’s called a least squares regression line.
00:02:44.850 --> 00:02:47.740
But we’re not gonna going into the detail of how we calculate that just now.
00:02:47.930 --> 00:03:01.340
We’re just gonna draw it by eye, trying our ruler in lots of different positions until we find a route that is as close as possible to as many of the points as possible with a nice even balance of points above and below the- the line along its entire length.
00:03:02.300 --> 00:03:11.310
So we’ve got points above and below here, we’ve got points above and below here, and we’ve also got points above and below in the middle here.
00:03:12.800 --> 00:03:15.860
And now we can use the line of best fit to make predictions.
00:03:16.190 --> 00:03:26.640
For example, if we had a circle that had a diameter of “two” inches, we could draw a line up to our line of best fit and across to the “𝑦”-axis.
00:03:27.730 --> 00:03:31.740
And that looks like it would have a circumference of between “six” and “six and a half” inches.
00:03:32.830 --> 00:03:41.510
So without actually having to do measurements on the circle if you know the diameter of a circle, you can use this graph to make a prediction about what its circumference would be.
00:03:42.020 --> 00:03:45.540
And likewise if we know the circumference, we could make a prediction about the diameter.
00:03:45.810 --> 00:03:57.380
So if we had a circle with a circumference of “twenty” inches, we could draw a line across from the “𝑦”-axis to our line of best fit and then down to the “𝑥”-axis.
00:03:59.050 --> 00:04:02.990
And it looks like that’s just under “six point five” inches in diameter.
00:04:05.110 --> 00:04:10.620
We could even go as far as calculating the equation of that line of best fit and using that to make our predictions.
00:04:11.450 --> 00:04:15.980
So for example, if we had a diameter of “three” inches, “𝑥 will be equal to three”.
00:04:16.200 --> 00:04:27.550
We can plug that into our equation and then that would give us an answer of “nine point four” inches for the circumference, which is a bit easier and probably more accurate than reading off of that scale.
00:04:29.280 --> 00:04:35.670
Now looking at the equation, we can see that the slope or the gradient is “positive three point one”.
00:04:36.170 --> 00:04:50.600
And because the pattern of dots make a pretty close fit to a straight line and that line has this positive slope as we’ve just seen, we say that the points are positively correlated, or if you want to be really accurate, positively linearly correlated.
00:04:52.290 --> 00:04:57.720
And if the points had suggested a line with a negative slope, then we’d have said that they had negative correlation.
00:04:59.230 --> 00:05:04.300
So the terms positive and negative correlation are statements about bivariate data.
00:05:05.080 --> 00:05:17.590
So if higher values on one aspect of data are associated with higher values on the other aspect of data and lower values on one aspect of data are associated with lower values on the other aspect of data, we call that positive correlation.
00:05:18.730 --> 00:05:26.810
And if high values on one aspect of data are associated with low values on the other aspect of data, we call that negative correlation.
00:05:27.940 --> 00:05:34.450
And some people call positive correlation direct correlation and negative correlation inverse correlation.
00:05:34.580 --> 00:05:36.710
So they’re terms that you might also come across.
00:05:38.020 --> 00:05:39.350
But that doesn’t really cover it all.
00:05:39.530 --> 00:05:42.260
Sometimes there’s no correlation between two data sets.
00:05:42.660 --> 00:05:51.800
For example, if you plotted the number of doughnuts people can eat without licking their lips against the number of books that they’ve read over the past year, you might expect a scatterplot looking something like this.
00:05:52.360 --> 00:05:56.150
There’s no association between the two at all; there’s no correlation.
00:05:57.440 --> 00:06:05.410
Knowing how many books someone has read over the past year tells you nothing about how many doughnuts they’re likely to be able to eat without licking their lips and vice versa.
00:06:07.080 --> 00:06:15.620
Okay then, we’ve got a basic idea of what correlation is now: it’s a way to describe apparent associations between data sets or even the lack of association between them.
00:06:16.040 --> 00:06:18.500
Let’s go through a summary of what the basic types are.
00:06:19.870 --> 00:06:26.220
We’ve got positive or direct correlation, negative or inverse correlation, and no correlation.
00:06:26.720 --> 00:06:29.380
But there are also different strengths of correlation.
00:06:30.550 --> 00:06:35.660
So strong correlation is when the points are closer to a line of best fit.
00:06:35.880 --> 00:06:43.130
Weaker correlation is when they’re scattered a bit more randomly further away from that line of best fit; there’s a bit more variation going on there.
00:06:44.310 --> 00:06:54.750
So for example with weak positive correlation, you still get higher data values on one data aspect associated with higher values on the other data aspect and-and lower with lower and so on.
00:06:55.080 --> 00:06:59.980
But the picture is a little bit more confused; it’s not quite so clear that they’re correlated.
00:07:01.280 --> 00:07:08.410
And likewise with negative correlation, you’ve still got high-high values on one data aspect associated with low values on the other data aspect.
00:07:08.790 --> 00:07:12.800
But those points don’t conform to that line of best fit so clearly.
00:07:14.830 --> 00:07:18.610
Now this strong and weak correlation idea is all a bit fluffy and woolly.
00:07:20.070 --> 00:07:29.550
If we drew the axes slightly differently and used a different scale, we could make correlation look stronger or weaker by having the points more spaced out or closer to the line.
00:07:29.920 --> 00:07:31.220
So that’s not really that great.
00:07:32.300 --> 00:07:38.230
But luckily we have something called a correlation coefficient which quantifies the strength of the correlation.
00:07:38.580 --> 00:07:48.790
And this is a number that runs on a scale from “negative one” for perfect negative correlation through “zero” for no correlation up to “positive one” for perfect positive correlation.
00:07:50.100 --> 00:07:56.210
So perfect negative correlation would be when all of the points exactly sit on that line of best fit.
00:07:56.670 --> 00:08:01.590
In perfect positive correlation, all the points would exactly fit on that line of best fit.
00:08:02.890 --> 00:08:08.510
So in both of those cases, our line of best fit would make perfect predictions of one thing from the other.
00:08:09.850 --> 00:08:19.240
So going back to our circle measuring task that we did with my students, that should’ve given us perfect positive correlation between the diameter and the circumference of a circle.
00:08:19.440 --> 00:08:25.470
We know that there’s a formula that exactly describes this relationship: the circumference is “𝜋 times the diameter”.
00:08:26.390 --> 00:08:32.370
Now the only reason that it didn’t come out perfect was that the students weren’t able to measure the circles with “a hundred percent” accuracy.
00:08:32.800 --> 00:08:35.480
But we did see a pretty strong positive correlation.
00:08:35.780 --> 00:08:46.560
And we had a good deal of confidence that the predictions of one aspect based on the other using our line of best fit were going to be quite reliable because all the data points were close to that line.
00:08:47.380 --> 00:08:51.020
The line was a good predictor for the data points that we gathered.
00:08:52.380 --> 00:08:56.710
So going back to our scale, we had correlation which was quite strong.
00:08:56.710 --> 00:08:59.980
It was probably in this region, not “one” but approaching “one”.
00:09:01.300 --> 00:09:03.510
So in the real world, things are quite messy.
00:09:03.510 --> 00:09:08.500
So we would probably would never expect to get perfect positive or perfect negative correlation.
00:09:08.810 --> 00:09:21.170
We would always be operating in this kind of zone in between here somewhere and we’ll be looking at the tendency: are we sort of generally closer to “negative one” or are we generally closer to “zero” or are we generally closer to “one”?
00:09:22.460 --> 00:09:29.270
So the value of the correlation coefficient tells us how reliable the predictions made using our line of best fit are.
00:09:29.790 --> 00:09:34.090
Close to “negative one” or “positive one”, that means they’re quite reliable.
00:09:34.410 --> 00:09:37.000
Closer to “zero”, they’re totally unreliable.
00:09:38.410 --> 00:09:40.810
So let’s have a look at these two scatterplots.
00:09:41.320 --> 00:09:45.680
So there’re two classes, A and B, and they both did a math test and an English test.
00:09:45.680 --> 00:09:49.630
And we’ve used the English scores as our “𝑥”-coordinates; and the math scores as our “𝑦”-coordinates.
00:09:50.140 --> 00:09:53.310
So for class A, we’ve got this particular pattern.
00:09:53.310 --> 00:09:57.180
Everybody scored about “fifty” on English, but there’s a complete range of scores on math.
00:09:57.510 --> 00:10:02.340
And for class B everybody scored about “fifty” on math, but there’s a complete range of scores on English.
00:10:03.290 --> 00:10:06.940
Now those points suggest a pretty clear line of best fit in each case.
00:10:07.210 --> 00:10:13.790
So for class A, the line of best fit would be vertical; and for class B, the line of best fit would be horizontal.
00:10:14.770 --> 00:10:17.880
So how strong do you think the correlation is in each case?
00:10:19.650 --> 00:10:24.150
Well, in fact — both cases — we’ve got “zero” or no correlation.
00:10:25.400 --> 00:10:29.670
And that’s because knowing one of the scores tells you nothing about the other.
00:10:29.940 --> 00:10:33.730
There’s no predictability of one score based on the other score.
00:10:34.790 --> 00:10:41.890
In class A, if I know someone scored “fifty” for English, that doesn’t tell me anything at all about what they might have scored in their math score.
00:10:41.890 --> 00:10:46.770
People who scored “fifty” for English scored a whole range of different scores on their math test.
00:10:47.420 --> 00:11:01.830
And likewise for class B, if I know somebody scored “fifty” on math, that doesn’t enable me to predict what score they got on their English test because people who scored “fifty” on math scored the complete range of different scores on their English test.
00:11:03.340 --> 00:11:15.580
This means that although the points suggest a pretty good line of best fit because it’s exactly horizontal or exactly vertical, you can’t use one score to make a prediction about the other for any individual student.
00:11:16.030 --> 00:11:18.520
This means there is no correlation between the two.
00:11:19.600 --> 00:11:24.540
Correlation is all about the predictive power of one piece of data for another piece of data.
00:11:26.230 --> 00:11:30.640
Now correlation is also about association between data within a certain range.
00:11:31.210 --> 00:11:37.490
For example, one March I planted some sunflower seeds in my garden and I measured how tall the plants were every day.
00:11:38.090 --> 00:11:40.810
By the end of September, I’d gathered a lot of data.
00:11:41.080 --> 00:11:51.100
And there was a pretty strong positive correlation between the number of days that had passed since I planted the seeds and the height of my plants, which were about “twelve” feet tall by that stage.
00:11:51.810 --> 00:12:01.090
Now by extending that pattern, I confidently predicted that by the end of the following January my plants will be “twenty” feet tall and I wondered if that would be a world record.
00:12:01.950 --> 00:12:03.170
Of course I was wrong.
00:12:03.420 --> 00:12:04.270
Autumn came.
00:12:04.460 --> 00:12:07.520
They stopped growing, they died, they fell over, and they rotted.
00:12:08.820 --> 00:12:21.680
Although the data that I gathered was very good at estimating how tall the plants would have been over the time that I was gathering the data in this region here, it turned out to be very bad at making predictions about the future.
00:12:23.310 --> 00:12:28.490
Using patterns to make estimations within the range of data you’ve collected is called interpolation.
00:12:28.780 --> 00:12:34.690
And this could be very reliable if the data has strong positive or strong negative correlation.
00:12:35.330 --> 00:12:42.580
But trying to use those patterns to make predictions about the future or beyond the range of data- the data that you’ve collected is called extrapolation.
00:12:42.770 --> 00:12:48.340
And it could be very unreliable even in data that was perfectly correlated within the data range that you gathered.
00:12:49.410 --> 00:12:59.270
Another thing, although we’ve been talking about correlation in this video, really — and as we mentioned this a couple of times — we mean linear correlation: how well the data fits a straight line pattern.
00:12:59.840 --> 00:13:04.570
Sometimes though the data doesn’t fit a straight line so well, but maybe it would fit a curve.
00:13:05.970 --> 00:13:12.080
Take this data about the number of visits to the UK between “nineteen seventy-eight” and “nineteen ninety-nine” for example.
00:13:12.490 --> 00:13:28.410
If we fit a linear pattern through the middle here, we can see that although it’s quite a good line of best fit with this pattern emerging at the ends, it’s the line is tending to underpredict the number of thousands of visits made each year, but in the middle it’s overpredicting.
00:13:28.890 --> 00:13:36.410
So although it looks like a reasonable line of best fit, there’s a pattern to the way in which is making errors about making its predictions.
00:13:37.520 --> 00:13:43.590
If we fitted more of a curve like this, then there’s a mix of underestimates and overestimates moving along that line.
00:13:43.810 --> 00:13:48.880
So it’s a slightly better predictor of the number of visits based on what year it is.
00:13:49.670 --> 00:13:56.410
So although nonlinear correlation is beyond the scope of this video, we did just want you to be aware that it is something that does exist.
00:13:57.960 --> 00:14:02.370
So we’ve taken a look at strong or weak positive or direct correlation.
00:14:03.610 --> 00:14:13.010
So we’ve seen strong or weak positive or direct correlation: the closer the correlation coefficient is to “one”, the stronger the correlation.
00:14:14.410 --> 00:14:24.830
And we’ve seen strong or weak negative or inverse correlation: in this case, the closer the correlation coefficient is to “negative one”, the stronger the correlation.
00:14:26.140 --> 00:14:28.920
And we’ve seen examples of no correlation.
00:14:29.450 --> 00:14:37.260
Now this can happen if you’ve got this random splatter of points that look like this or if you’ve got a completely vertical or completely horizontal line of best fit.
00:14:38.320 --> 00:14:44.800
When the correlation coefficient is close to “zero”, knowing one piece of data doesn’t help you to predict what the other one would be.
00:14:45.080 --> 00:14:54.700
So for example, if we knew what their math score was, it wouldn’t help us to predict what their English score was because there’s a whole range of different values that it could’ve been.
00:14:56.180 --> 00:15:07.290
We’ve also seen that when we’ve got good strong correlation doing interpolation, making predictions of one piece of data based on the other within the range of data we’ve got, can be quite reliable.
00:15:08.320 --> 00:15:15.680
But trying to do extrapolation or make predictions beyond the data range that we’ve gathered can give us very bad results indeed.
00:15:17.050 --> 00:15:22.060
One last thing, correlation tells you about association, not necessarily causality.
00:15:22.850 --> 00:15:30.380
It could just be a coincidence that two sets of data correlate or maybe there’s some other underlying factor affecting both sets of data.
00:15:31.700 --> 00:15:45.130
For example, between “two thousand” and “two thousand and nine”, an analysis of the average amount of margarine consumed per person by people in the United States each year correlated very strongly with the divorce rate per thousand people in the state of Maine that year.
00:15:45.830 --> 00:15:47.410
That’s just a coincidence.
00:15:48.810 --> 00:15:55.540
How could the number of divorces in one particular state be affected by how much margarine was being consumed elsewhere in the country?
00:15:56.950 --> 00:16:03.240
There’s also a very weak negative correlation between how yellow people’s teeth are and how long they live.
00:16:04.480 --> 00:16:06.590
Now there’s no causal link between the two.
00:16:06.930 --> 00:16:12.340
But shorter lifespans and having yellow teeth are both caused by smoking tobacco.
00:16:12.720 --> 00:16:19.440
So perhaps that aspect is causing this underlying apparent weak correlation between those two other pieces of data.