WEBVTT
00:00:00.600 --> 00:00:10.360
In this video, we’re gonna look at the least squares regression method of finding the equation of a line of best fit through points on a scatter plot.
00:00:11.080 --> 00:00:19.880
We’ll also talk a little bit about the theory behind the method and how to use and interpret the regression equation when you worked out what it is.
00:00:20.360 --> 00:00:27.360
Let’s look at an example of a situation where you might want to calculate the equation of the least squares regression line.
00:00:28.080 --> 00:00:32.760
Some students did an experiment in which they hung objects of various masses from a spring.
00:00:33.360 --> 00:00:35.240
And they measured the length of the spring in each case.
00:00:35.800 --> 00:00:39.440
Find the equation of the least squares regression line.
00:00:40.440 --> 00:00:48.360
Well we can see, for example, when they hung a mass of ten kilograms from the spring, it had a length of twelve centimetres.
00:00:48.920 --> 00:00:55.760
When they hung twenty kilograms, it was sixteen centimetres and so on.
00:00:56.440 --> 00:00:58.680
So this is bivariate data.
00:00:59.120 --> 00:01:00.920
There are two variables: mass and length.
00:01:01.480 --> 00:01:05.360
And each pair of pieces of data relates to a specific event.
00:01:06.080 --> 00:01:12.640
So placing a mass of twenty kilograms on the spring extends it to a length of sixteen centimetres.
00:01:13.560 --> 00:01:19.120
Now the first thing that we need to think about is which is the dependent variable.
00:01:19.480 --> 00:01:21.880
And which is the independent variable.
00:01:22.760 --> 00:01:27.680
Now the independent variable is the one that you’d normally control or change.
00:01:28.320 --> 00:01:33.480
And it causes changes in the value of the other variable, the dependent variable.
00:01:34.000 --> 00:01:37.800
So the dependent variable depends on the value of the other variable.
00:01:38.160 --> 00:01:41.160
Now we’re choosing which masses to put on the spring.
00:01:41.880 --> 00:01:45.000
And that’s causing the change in the length of the spring.
00:01:45.720 --> 00:01:48.000
So length is the dependent variable.
00:01:48.360 --> 00:01:50.560
And the mass is the independent variable.
00:01:51.280 --> 00:01:56.080
Now we tend to call the independent variable 𝑥 and the dependent variable 𝑦.
00:01:56.720 --> 00:01:59.160
So let’s add those letters to our table.
00:02:00.040 --> 00:02:03.360
Now we can plot these points on a scatter plot.
00:02:04.080 --> 00:02:07.800
And it looks like we’ve got very strong positive correlation between the two variables.
00:02:08.560 --> 00:02:16.520
In fact, when you calculate the Pearson’s product moment correlation coefficient, it comes out to be 0.99 to two decimal places.
00:02:17.360 --> 00:02:20.040
So yep, indeed that is very strong positive correlation.
00:02:20.800 --> 00:02:24.040
So how do we go about drawing a suitable line of best fit?
00:02:24.760 --> 00:02:32.360
Well we could just try laying down our ruler in various positions until it looks like the points are generally as close as they can be to the line.
00:02:33.000 --> 00:02:37.640
Then we can read off a couple of pairs of coordinates from the line and work out its equation.
00:02:38.240 --> 00:02:42.760
However, luckily there’s a more methodical and consistent way of going about it.
00:02:43.160 --> 00:02:47.160
Calculating the equation of the least squares regression line.
00:02:47.920 --> 00:02:55.120
And the way this works is to specify a line which minimizes the sum of the squares of the residuals of each of the points.
00:02:55.800 --> 00:03:05.120
So, for example, if we call that point one, this distance here, this vertical distance between the point and the line, is called the residual.
00:03:05.640 --> 00:03:17.200
So if the equation of our line of best fit was 𝑦 equals 𝑎 plus 𝑏𝑥, we could plug 𝑥-coordinates in there and make predictions about what the corresponding 𝑦-coordinates would be.
00:03:18.000 --> 00:03:27.240
So the residuals of each point are the difference between that prediction of our estimate and the actual values that we observed when we did the experiment.
00:03:27.960 --> 00:03:31.840
So for these points here, our estimates were under estimates of the actual values.
00:03:32.240 --> 00:03:38.200
And for these points here, our estimates were overestimates of the actual values that we observed.
00:03:38.840 --> 00:03:44.600
So we could think of some of the residuals as being positive and some of the residuals as being negative.
00:03:45.320 --> 00:03:52.520
So if we tried out lots and lots of different lines of best fit, and for each one added up all the residuals.
00:03:53.080 --> 00:04:05.200
Then you might expect that lines with a sum of the residuals close to zero would be a better line of best fit than lines with larger sums of residuals.
00:04:05.760 --> 00:04:10.320
But the thing is some spectacularly bad lines of best fit might be drawn.
00:04:10.920 --> 00:04:18.880
So that all the positive residuals exactly balance all of the negative residuals and give us sum of zero.
00:04:19.720 --> 00:04:29.880
So to get around this potential problem, the least squares regression model takes the squares of all the residuals, so that the results are all positive.
00:04:30.440 --> 00:04:35.760
And then it finds the line that minimizes the sum of the squares of all these residuals.
00:04:36.240 --> 00:04:40.960
Hence the name least squares regression.
00:04:41.640 --> 00:04:45.160
Now we’re not gonna go into detail here about how to derive the exact formula.
00:04:45.560 --> 00:04:47.920
But we are gonna talk about how to use it.
00:04:48.560 --> 00:05:04.440
The formula then for the least squares regression line is 𝑦 is equal to 𝑎 plus 𝑏𝑥, where 𝑏 is 𝑆𝑥𝑦 over 𝑆𝑥𝑥, where 𝑆𝑥𝑦 is the covariance of 𝑥 and 𝑦 divided by 𝑛.
00:05:05.080 --> 00:05:09.080
And 𝑆𝑥𝑥 is the variance of 𝑥 divided by 𝑛.
00:05:09.800 --> 00:05:16.600
Also the value of 𝑎 is equal to the mean 𝑦-value minus 𝑏 times the mean 𝑥-value.
00:05:17.560 --> 00:05:26.640
Now remember to work out the value of 𝑆𝑥𝑦, we have to multiply each pair of 𝑥- and 𝑦-coordinates together and then sum the result of all those added together.
00:05:26.960 --> 00:05:32.800
And then we need to take the sum of all the 𝑥s and times that by the sum of all the 𝑦s.
00:05:33.400 --> 00:05:36.480
And then take that result and divide it by the number of pieces of data that we’ve got.
00:05:37.320 --> 00:05:41.800
And to work out the value of 𝑆𝑥𝑥, we square each of the pieces of 𝑥-data.
00:05:42.400 --> 00:05:44.520
And then we add up all those squares.
00:05:44.920 --> 00:05:51.440
And then from that, we take away the sum of all the 𝑥-values squared and then divide it by the number of pieces of data that we’ve got.
00:05:52.120 --> 00:05:55.440
Okay it all looks pretty horrible on paper at the moment.
00:05:55.960 --> 00:06:00.200
But when we go through the example, I’m sure you’ll find it much easier.
00:06:01.240 --> 00:06:04.320
Firstly, we take the table of values that we were given in the question.
00:06:04.760 --> 00:06:07.440
And then we need to extend it a bit.
00:06:07.920 --> 00:06:12.320
We need to create a couple of extra columns and an extra row at the bottom.
00:06:13.080 --> 00:06:20.480
So firstly, the columns of the 𝑥 squared values, so that’s each individual 𝑥-value squared, and then the 𝑥𝑦 values.
00:06:21.240 --> 00:06:25.720
So that’s for each data pair, we take the 𝑥-value and multiply it by the 𝑦-value.
00:06:26.360 --> 00:06:30.120
And then we create a row at the bottom for all of our totals.
00:06:30.960 --> 00:06:34.640
So first of all, let’s add up all the 𝑥s and then add up all the 𝑦s.
00:06:35.520 --> 00:06:41.440
Well ten plus twenty plus thirty plus forty plus fifty is a hundred and fifty.
00:06:42.200 --> 00:06:46.400
And if I add up all five 𝑦-values, I get an answer of 120.
00:06:47.360 --> 00:06:50.160
No, I’ve got five pairs of data.
00:06:51.000 --> 00:06:53.080
So that’s 𝑛 equals five.
00:06:53.520 --> 00:06:57.560
And squaring each 𝑥-value, 10 squared is 100.
00:06:58.000 --> 00:07:01.320
20 squared is 400 and so on.
00:07:02.000 --> 00:07:06.040
Then adding them all up, I get 5500.
00:07:06.840 --> 00:07:08.960
Now I’m gonna do 𝑥 times 𝑦.
00:07:09.520 --> 00:07:12.080
So 10 times 12 is 120.
00:07:12.680 --> 00:07:16.840
20 times 16 times 320 and so on.
00:07:17.320 --> 00:07:22.960
And then if I add all of those up, I get 4230.
00:07:23.720 --> 00:07:26.240
Now I can calculate the individual values.
00:07:26.840 --> 00:07:37.360
So 𝑆𝑥𝑦, remember, was the sum of the 𝑥 times 𝑦s minus the sum of the 𝑥s minus the sum of the 𝑦s divided by the number of pieces of data.
00:07:37.920 --> 00:07:42.560
Well the sum of the 𝑥𝑦s is 4230.
00:07:43.160 --> 00:07:46.080
Sum of the 𝑥s is 150.
00:07:46.640 --> 00:07:49.400
The sum of the 𝑦s is 120.
00:07:50.120 --> 00:07:51.800
And I’ve got five pieces of data.
00:07:52.560 --> 00:07:56.720
So popping that into my calculator, I get 630.
00:07:57.880 --> 00:08:02.200
And 𝑆𝑥𝑥, I’ve gotta sum the 𝑥 squared column.
00:08:02.600 --> 00:08:09.560
And I’ve gotta sum the 𝑥 column, square that value, and divide by the number of pieces of data.
00:08:10.160 --> 00:08:14.160
Well the total of all the 𝑥-squares added together is 5500.
00:08:14.640 --> 00:08:17.640
And the sum of the 𝑥s is 150.
00:08:18.360 --> 00:08:25.200
So I’ve gotta square 150 and divide by the number of pieces of the data, five.
00:08:25.840 --> 00:08:29.720
And when I pop that into my calculator, I get 1000.
00:08:30.480 --> 00:08:38.120
So just making a note of those over on the left while I make some space to do some more calculations on the right.
00:08:38.920 --> 00:08:44.160
The 𝑏-value in the equation of my straight line is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥.
00:08:44.800 --> 00:08:51.000
So that’s 630 divided by 1000, which is 0.63.
00:08:51.760 --> 00:08:54.640
Now to calculate the 𝑎-value is a little bit more trickier.
00:08:55.040 --> 00:09:06.840
I need to work out the mean 𝑦-coordinate and the mean 𝑥-coordinate and then also take into account the answer I got for 𝑏 in that first part.
00:09:07.440 --> 00:09:13.160
So to work out the mean 𝑦-value, I just have to add up all the 𝑦s and divide by how many there are.
00:09:13.640 --> 00:09:18.840
So that’s 120 divided by five, which is 24.
00:09:19.560 --> 00:09:25.800
And same process again for the 𝑥s, just add up all the 𝑥-values and divide by how many there are.
00:09:26.320 --> 00:09:31.440
So that’s 150 divided by five, which is thirty.
00:09:32.200 --> 00:09:35.840
So the mean 𝑥-value is thirty.
00:09:36.320 --> 00:09:41.160
The mean 𝑦-value is 24.
00:09:41.400 --> 00:09:51.760
So 𝑎 then is the mean 𝑦-value minus 𝑏 times the mean 𝑥-value, which is 24 minus 0.63 times 30.
00:09:52.480 --> 00:09:54.800
And that gives us an answer of 5.1.
00:09:55.640 --> 00:10:02.480
So again just making a note of those values so I can carry on on the right hand side with more working out.
00:10:03.000 --> 00:10:07.680
The equation of our line of best fit is 𝑦 is equal to 𝑎 plus 𝑏𝑥.
00:10:07.680 --> 00:10:17.560
So our least squares regression line is 𝑦 is equal to 5.1 plus 0.63 times the 𝑥-coordinate.
00:10:18.480 --> 00:10:19.600
Well, that’s great.
00:10:20.040 --> 00:10:27.240
So we’ve now got an equation that enables us to make predictions about the length of the spring given different masses that were hanging from it.
00:10:28.160 --> 00:10:38.880
So, for example, if we were gonna hang a mass of 37 kilograms from the spring, we just put an 𝑥-value of 37 into that equation.
00:10:39.560 --> 00:10:44.840
And we’d make our prediction that 𝑦 is equal to 28.41 centimetres.
00:10:45.600 --> 00:10:50.360
That’s how long we’d expect the spring to be with that mass hanging from it.
00:10:51.160 --> 00:11:05.320
Now because our scatter plot showed that we had very strong positive correlation, we’d expect that equation to make pretty reasonable estimates of 𝑦-values given certain 𝑥-values.
00:11:06.240 --> 00:11:14.000
Well, that is, we’d expect them to be good estimates, if we use 𝑥-values between about 10 and 50.
00:11:14.600 --> 00:11:18.400
In other words, if we use the equation to interpolate the 𝑦-values.
00:11:19.160 --> 00:11:22.080
Now we gather data, 𝑥-data, in that range.
00:11:22.640 --> 00:11:27.440
We don’t know if that same equation will be true outside that range.
00:11:28.120 --> 00:11:36.280
For example, if we put a mass of 60 or 70 or 80 kilograms on the spring, it might snap all together.
00:11:36.680 --> 00:11:38.480
So our equation just simply wouldn’t work.
00:11:39.000 --> 00:11:49.440
So using the equation to make predictions of 𝑦-values based on 𝑥-values in the range that we’ve gathered data for is called interpolation.
00:11:50.120 --> 00:11:59.200
But extending beyond that range and making predictions with 𝑥-values less than 10 or greater than 50 is called extrapolation.
00:11:59.880 --> 00:12:04.880
And as we said, extrapolating is generally a bad idea.
00:12:05.360 --> 00:12:11.920
Cause we’re just not gonna be so confident that the rules still apply for those data values.
00:12:12.560 --> 00:12:14.600
And our equation may not hold.
00:12:15.640 --> 00:12:22.800
Now we could use the equation to make prediction about the length of the spring without any weight standing on it at all.
00:12:23.440 --> 00:12:25.000
So we put an 𝑥-value of zero.
00:12:25.480 --> 00:12:32.320
And we’d have the equation 𝑦 equals 5.1 plus 0.63 times zero.
00:12:33.160 --> 00:12:39.560
So the length of the spring with no weights added to it would be 5.1 centimetres.
00:12:40.360 --> 00:12:45.040
Now just quickly, I’m gonna go back up here and rub this out and change it to a plus.
00:12:45.680 --> 00:12:47.440
Obviously, I made a bit of a mistake there.
00:12:48.000 --> 00:12:49.120
So apologies for that.
00:12:49.800 --> 00:12:59.240
So going back to our question here with a mass of zero kilograms, we get a length of spring of 5.1 centimetres.
00:13:00.040 --> 00:13:02.320
So this is telling us the starting conditions for a problem if you like.
00:13:02.880 --> 00:13:08.920
With no masses added, the spring will be 5.1 centimetres.
00:13:09.720 --> 00:13:13.480
Now I think you can spot the potential problem with this.
00:13:14.200 --> 00:13:24.040
Because we only gather data with 𝑥-values from 10 to 50 kilograms, the equation we’re extrapolating back to zero here.
00:13:24.480 --> 00:13:26.360
So that might not necessarily be true.
00:13:26.760 --> 00:13:27.720
It might be true.
00:13:28.000 --> 00:13:33.280
But we don’t have 100 percent confidence that the equation will still hold for those 𝑥-values.
00:13:34.040 --> 00:13:38.520
Now we can also interpret the parameters in that regression equation.
00:13:39.080 --> 00:13:53.160
That coefficient of 𝑥, the multiple of 𝑥 there, 0.63, means that every time I add one more kilogram, so I increase 𝑥 by one.
00:13:53.720 --> 00:13:56.520
Then the spring stretches by 0.63 centimetres.
00:13:56.960 --> 00:14:05.400
And as we’ve just seen that number there, the 5.1 on its own, when I have an 𝑥-value of zero, then 𝑦 is equal to 5.1.
00:14:06.000 --> 00:14:12.520
So when no mass is added to the spring, its length would be 5.1 centimetres.
00:14:13.320 --> 00:14:19.920
Now this method of least squares regression analysis seems like magic.
00:14:20.720 --> 00:14:23.240
Simply process your data.
00:14:23.760 --> 00:14:31.840
And you get an easy-to-use equation to make predictions of one value from the other, brilliant!
00:14:32.640 --> 00:14:41.320
But remember, you also need to consider the strength of the correlation before using your least squares regression line to make predictions.
00:14:41.840 --> 00:14:51.480
If there’s little or no correlation, then the equation is gonna give you very unreliable predictions or estimates.
00:14:52.120 --> 00:14:56.920
You’d also need to consider the amount of data that you used to build the model.
00:14:57.280 --> 00:15:03.440
The more data you have, generally, the more reliable and more realistic that model will be.
00:15:04.360 --> 00:15:07.080
And also remember, don’t extrapolate.
00:15:07.560 --> 00:15:10.120
Interpolation is quite good.
00:15:10.840 --> 00:15:16.360
If the correlation’s quite good, then interpolated values will be quite good predictions.
00:15:17.040 --> 00:15:22.200
Extrapolated values, you really don’t know how reliable they’re going to be.
00:15:23.360 --> 00:15:26.600
Okay, here’s one for you to try.
00:15:27.720 --> 00:15:32.280
Find the least squares regression equation for the following data.
00:15:32.720 --> 00:15:38.440
And use it to estimate the value of 𝑦 when 𝑥 equals nine, then comment on your result.
00:15:39.600 --> 00:15:41.760
So we’ve got some data here for 𝑥 and 𝑦.
00:15:42.320 --> 00:15:44.960
When 𝑥 is one, 𝑦 is twelve.
00:15:45.440 --> 00:15:48.800
When 𝑥 is two, 𝑦 is seven and so on.
00:15:49.520 --> 00:15:52.520
And we’ve given you the formulae down at the bottom there for you to use.
00:15:53.040 --> 00:15:58.920
So press pause and then come back when you’ve answered the question.
00:15:59.520 --> 00:16:06.040
And I’ll go through the answers — Right, first we need to add two columns, the 𝑥 squareds and the 𝑥𝑦s.
00:16:06.920 --> 00:16:08.320
Now we’re gonna fill those in.
00:16:08.840 --> 00:16:10.480
One squared is one.
00:16:10.880 --> 00:16:13.520
Two squared is four and so on.
00:16:14.280 --> 00:16:18.960
And now, the 𝑥𝑦s, one times twelve is twelve.
00:16:19.440 --> 00:16:22.720
Two times seven are fourteen and so on.
00:16:23.480 --> 00:16:26.400
Now we’ll add a row at the bottom for all the totals.
00:16:26.960 --> 00:16:30.920
Now if I add up all the 𝑥-values, I get fifteen.
00:16:31.480 --> 00:16:35.960
Adding up all the 𝑦-values gives me a total of 37.
00:16:36.640 --> 00:16:41.400
Adding up all the 𝑥 squared values gives me a total of 55.
00:16:41.920 --> 00:16:47.760
And adding up all the products of 𝑥 and 𝑦, I get a total of 93.
00:16:48.560 --> 00:16:53.520
And because I’ve got five sets of data, 𝑛 is equal to five.
00:16:54.040 --> 00:17:02.480
So 𝑆𝑥𝑦 is the sum of the 𝑥𝑦s minus the sum of the 𝑥s times the sum of the 𝑦s all over 𝑛.
00:17:03.320 --> 00:17:11.560
So that’s 93 minus 15 times 37 all over five, which is negative 18.
00:17:12.240 --> 00:17:19.920
And the 𝑆𝑥𝑥 value is the sum of the 𝑥 squareds minus the sum of the 𝑥s all squared divided by 𝑛.
00:17:20.680 --> 00:17:23.520
Well and the sum of the 𝑥 squareds is 55.
00:17:24.120 --> 00:17:26.400
The sum of the 𝑥s is 15.
00:17:26.960 --> 00:17:28.160
And 𝑛 is five.
00:17:28.920 --> 00:17:35.080
So that becomes 55 minus 15 squared over five, which is equal to 10.
00:17:35.720 --> 00:17:41.240
So working out the values of the parameters for our equation of our straight line, 𝑦 equals 𝑎 plus 𝑏𝑥.
00:17:41.920 --> 00:17:46.560
The 𝑏-value is 𝑆𝑥𝑦 divided by 𝑆𝑥𝑥.
00:17:47.400 --> 00:17:53.440
Well that was negative 18 divided by 10, which is negative 1.8.
00:17:54.200 --> 00:17:59.120
And the mean 𝑦-value was just the sum of all the 𝑦s divided by how many there are.
00:17:59.520 --> 00:18:05.240
That’s 37 divided by five, which is 7.4.
00:18:05.960 --> 00:18:11.320
And the mean 𝑥-value is the sum of all the 𝑥-values divided by how many there are.
00:18:12.000 --> 00:18:17.080
So that’s 15 divided by five, which is three.
00:18:17.520 --> 00:18:23.440
So the 𝑎-value then is the mean 𝑦 minus 𝑏 times the mean 𝑥.
00:18:24.040 --> 00:18:29.840
Now because the 𝑏-value is negative 1.8, we gotta be quite careful with our negative signs here.
00:18:30.280 --> 00:18:38.320
So that’s 7.4 minus negative 1.8 times three, which is equal to 12.8.
00:18:39.240 --> 00:18:44.280
So the equation of our least squares regression line, 𝑦 equals 𝑎 plus 𝑏𝑥.
00:18:44.920 --> 00:18:48.640
All we need to do then is substitute in our values for 𝑎 and 𝑏.
00:18:49.280 --> 00:18:50.920
So that’s the equation.
00:18:51.440 --> 00:18:56.400
𝑦 equals 12.8 minus 1.8𝑥.
00:18:57.240 --> 00:19:02.040
And now we have to substitute in 𝑥 equals nine to make a prediction of the corresponding 𝑦-value.
00:19:02.720 --> 00:19:11.320
So 𝑦 would be equal to 12.8 minus 1.8 times nine, which would be negative 3.4.
00:19:12.200 --> 00:19:17.120
Now commenting on the result, there’re couple of things I wanna say.
00:19:17.520 --> 00:19:20.320
One is we’ve extrapolated.
00:19:20.800 --> 00:19:25.920
Look, the 𝑥-values that we gathered in terms of our data were from one to five.
00:19:26.320 --> 00:19:29.040
Well we’ve used an 𝑥-value of nine.
00:19:29.640 --> 00:19:32.720
So we’ve extrapolated.
00:19:33.400 --> 00:19:37.360
So we don’t necessarily know how reliable that answer’s gonna be.
00:19:38.160 --> 00:19:43.120
And the other thing I would say is we don’t know how good the correlation was.
00:19:43.600 --> 00:19:48.720
We don’t know the Pearson’s correlation coefficient or any other correlation coefficient for that matter.
00:19:49.320 --> 00:19:55.160
So even if we had interpolated our value, we still wouldn’t really know how reliable that answer would be.
00:19:55.800 --> 00:20:02.040
But the main point to make is that it was an extrapolated value.
00:20:02.520 --> 00:20:04.280
So we do need to be cautious about it.
00:20:05.000 --> 00:20:17.680
So in summary, we can work out the equation of our least squares regression line 𝑦 equals 𝑎 plus 𝑏𝑥 by using 𝑏 is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥.
00:20:18.480 --> 00:20:23.960
And 𝑎 is equal to the mean 𝑦-value minus 𝑏 times the mean 𝑥-value.
00:20:24.680 --> 00:20:34.480
So 𝑆𝑥𝑦, remember, is the sum of the 𝑥 times 𝑦 answers minus the sum of the 𝑥s times the sum of the 𝑦s over how many pieces of data we’ve got.
00:20:35.240 --> 00:20:45.520
The 𝑆𝑥𝑥 value is the sum of the 𝑥 squared values minus the sum of the 𝑥-values all squared divided by the number pieces of data you’ve got.
00:20:46.280 --> 00:20:52.280
And you know how to work out the mean 𝑦-value and the mean 𝑥-value.
00:20:52.920 --> 00:20:57.880
You just add them up and divide by how many you’ve got.
00:20:58.640 --> 00:21:01.680
And finally beware of extrapolation.