WEBVTT
00:00:01.580 --> 00:00:08.660
In this video, we talk briefly about what standard deviation is, a measure of the variability of a set of data.
00:00:09.180 --> 00:00:14.440
And then we move on to see how to calculate the standard deviation from a list of data values.
00:00:16.760 --> 00:00:22.460
Here, we’ve got three sets of test results, scores out of ten for three different groups of nine students.
00:00:22.750 --> 00:00:25.670
How can we describe the differences in the results between the groups?
00:00:26.060 --> 00:00:27.040
Which group did best?
00:00:27.070 --> 00:00:28.030
Which did worst?
00:00:28.240 --> 00:00:29.500
What are the differences?
00:00:31.040 --> 00:00:37.120
Well first, we could work out the mean score for each group; add up the scores and divide by how many there are.
00:00:38.400 --> 00:00:49.760
Well, if we say 𝑥 is the value of a score of an individual, this Σ, this funny sign here in front of the 𝑥, means add up all the 𝑥-values: add up all the scores for a particular group.
00:00:50.130 --> 00:00:55.100
And then 𝑛 is simply the number of people in each group, and that’s nine students in each case.
00:00:56.390 --> 00:01:04.600
Now it turns out with this particular set of data, whether I go for group 𝐴 or 𝐵 or 𝐶, if I add up all their scores, I get the same answer, forty-five.
00:01:05.690 --> 00:01:12.530
So in all three cases, my mean score is gonna be forty-five divided by nine, which is five.
00:01:15.280 --> 00:01:18.960
So my mean score in all three groups is five.
00:01:19.300 --> 00:01:21.250
So on average, they all did the same.
00:01:22.760 --> 00:01:25.410
Okay, let’s work out the median score for each group.
00:01:25.740 --> 00:01:31.180
And to do that, we need to organize them in order from smallest to largest in each group and look at the middle score.
00:01:32.120 --> 00:01:37.540
The middle person in each case has got four students to the left and four students to the right.
00:01:37.800 --> 00:01:41.700
And again in all three cases the median score here is five.
00:01:42.730 --> 00:01:46.710
So again we’re saying, on average, all three groups did the same.
00:01:48.330 --> 00:01:52.240
Right, well now we can start to look at the variability of the data.
00:01:53.420 --> 00:01:57.410
Well in group 𝐵, everybody scored exactly five.
00:01:57.720 --> 00:01:59.830
That’s a very consistent performance across the group.
00:01:59.830 --> 00:02:02.930
There was zero variability in their scores.
00:02:03.850 --> 00:02:11.590
But in groups 𝐴 and 𝐶, there was some variation between individuals.
00:02:12.290 --> 00:02:14.980
So which group’s scores vary the most?
00:02:16.060 --> 00:02:22.520
Well, we could calculate the range of scores in each group, and that’s the difference between the highest and the lowest scores.
00:02:22.830 --> 00:02:27.970
So in groups 𝐴 and 𝐶, that would again be the same, nine take away one.
00:02:28.020 --> 00:02:29.590
That’s eight in each case.
00:02:30.300 --> 00:02:37.930
So that very simple measure of variability, range, says that’s the same amount of variability in groups 𝐴 and 𝐶.
00:02:39.230 --> 00:02:49.040
But it looks like there’s more variability in group 𝐴 where every single person got a different score, cause in group 𝐶 all but two people got the same score of five out of ten.
00:02:49.200 --> 00:02:52.160
How can we capture that difference between the two groups?
00:02:53.820 --> 00:03:02.560
Well, one way would be to calculate how much each individual’s score varied from the mean score for that group, and then work out the mean of those.
00:03:03.770 --> 00:03:08.390
But the problem with that approach is that some people scored more than the mean and some people scored less than the mean.
00:03:08.790 --> 00:03:12.630
So some of the variations are going to be positive and some are gonna be negative.
00:03:12.960 --> 00:03:19.240
And then when you add them all up to calculate the mean variation from the mean, some or all of them could cancel each other out.
00:03:20.170 --> 00:03:23.280
For example, in group 𝐴 these are the scores they got.
00:03:23.890 --> 00:03:29.560
And we said that the mean score was five, so we can work out the difference between each person’s score and the mean.
00:03:30.570 --> 00:03:33.750
For example, the person who only scored one; that’s four below the mean.
00:03:33.980 --> 00:03:36.790
The person who scored five; that’s a difference of zero from the mean.
00:03:37.070 --> 00:03:39.560
And the person who scored nine was four above the mean.
00:03:40.600 --> 00:03:48.170
Now, if I want to work out that mean difference from the mean, I’ve got to add up all those differences and then divide by the number of scores there were, nine.
00:03:49.440 --> 00:03:57.670
So when I try to calculate the mean deviation from the mean, I need to add up all those differences and then divide by how many scores there were; that’s nine.
00:03:58.110 --> 00:04:02.510
But when I do that, the sum of the differences from the mean is equal to zero.
00:04:03.010 --> 00:04:11.080
Negative four add four, negative three add three, negative two add two, negative one add one plus zero is equal to zero.
00:04:12.300 --> 00:04:13.740
So this method falls down.
00:04:13.740 --> 00:04:18.410
It’s saying that the total variability of the mean variability for this set of data is zero.
00:04:18.690 --> 00:04:23.380
But in fact, every single person got a different score, so there was lots of variation in the scores.
00:04:24.850 --> 00:04:38.390
So to stop that sort of thing happening and all those variations cancelling each other out, cause it doesn’t really matter whether it’s a positive or a negative variation from the mean so long as it’s a variation from the mean, we in fact square all of those differences.
00:04:39.620 --> 00:04:48.600
So we’re now going to add up all of these 𝑥 minus 𝑥 bar squares and then divide by the number of people we’ve got, so 𝑛 is equal to nine.
00:04:49.620 --> 00:04:56.800
And when I add all those scores together, sixteen plus nine plus four plus one plus zero plus one plus four plus nine plus sixteen, I get sixty.
00:04:57.710 --> 00:05:03.700
So this measure of variability gives us an answer of sixty divided by nine, which is six and two-thirds.
00:05:04.620 --> 00:05:08.420
Now, we call this number the variance of the data set.
00:05:08.870 --> 00:05:14.540
But it’s the variance of the squares of the deviations from the mean, and their result is in square units.
00:05:15.480 --> 00:05:20.490
So we usually take the square root of that result and call it the standard deviation.
00:05:21.430 --> 00:05:24.870
So for group 𝐴, the variance was six and two-thirds.
00:05:25.730 --> 00:05:31.740
And the standard deviation is the square root of that, which is about two point five eight to two decimal places.
00:05:32.690 --> 00:05:41.590
Our interpretation of this is that in group 𝐴, on average, people’s scores vary from the mean score by about two point five eight.
00:05:43.050 --> 00:05:44.290
So let’s just summarise that.
00:05:44.680 --> 00:05:49.560
The variance is a measure of the variability, of the square variability in fact.
00:05:50.860 --> 00:05:55.290
To calculate it, first we need to find the mean score in that set of data.
00:05:55.570 --> 00:06:03.850
And then for every individual, we work out how different their score was from that mean, so 𝑥 minus 𝑥 bar, and then we square that difference.
00:06:04.180 --> 00:06:09.540
Then we add up all of those squares of differences and divide by how many pieces of data we’ve got.
00:06:10.850 --> 00:06:14.870
And there’s a special symbol for that; it’s this small 𝜎 squared.
00:06:16.600 --> 00:06:23.570
And because those units are in square units, if we take the square root of that number, we get something called the standard deviation.
00:06:23.940 --> 00:06:27.730
And we just use the little 𝜎 notation to represent that.
00:06:28.690 --> 00:06:34.990
And a way to describe the standard deviation is it’s the root mean square deviation from the mean.
00:06:36.840 --> 00:06:44.410
Now, we can rearrange that formula a little bit with a little bit of magic, and it actually turns it into a slightly more usable friendly version of the formula.
00:06:45.730 --> 00:06:49.000
And this rearrangement gives you exactly the same answer at the end of the day.
00:06:49.030 --> 00:06:53.690
But as we said, when you’ve got lots of data, it turns out to be easier to do this calculation than the previous one.
00:06:54.220 --> 00:06:59.390
And that first term there, Σ 𝑥 squared over 𝑛 is the mean of the squares of the data.
00:06:59.390 --> 00:07:04.340
So we have to take each individual piece of data and square it and then divide by how many bits of data there are.
00:07:05.500 --> 00:07:09.770
And when we’ve got that result, we then subtract the square of the mean of the data.
00:07:09.770 --> 00:07:14.870
So Σ 𝑥, adding up all of the 𝑥 scores divided by 𝑛, taking that result, and squaring it.
00:07:17.020 --> 00:07:21.590
Now of course if we take the square root of that answer, we’ve got the standard deviation.
00:07:22.080 --> 00:07:23.900
Okay, let’s see that in action then.
00:07:25.120 --> 00:07:28.710
Let’s use it to calculate the standard deviation for the scores in group 𝐶.
00:07:30.130 --> 00:07:34.140
So first we’re gonna list all the scores: one and then all the fives and then the nine.
00:07:35.460 --> 00:07:41.380
Then we’re going to square all of those individual scores, so one and then lots of twenty-fives and then eighty-one.
00:07:43.020 --> 00:07:52.700
Now if we add up all the 𝑥 scores, remember we’ve got forty-five, and if we add up all these 𝑥 squared values, it gives us two hundred and fifty-seven.
00:07:54.620 --> 00:08:04.550
Now remember, our formula for standard deviation is the sum of the 𝑥 squares divided by 𝑛 take away the sum of 𝑥 divided by 𝑛 all squared and then take the square root of that whole answer.
00:08:04.550 --> 00:08:05.850
So let’s plug the numbers in.
00:08:07.360 --> 00:08:14.070
So that’s two hundred and fifty-seven divided by nine minus forty-five divided by nine all squared and then the square root of that whole thing.
00:08:15.740 --> 00:08:21.860
And that simplifies down to the square root of thirty-two over nine, which is one point eight nine to two decimal places.
00:08:22.900 --> 00:08:40.380
So comparing these then for group 𝐴 and group 𝐶, remember that mean was exactly the same, the range was exactly the same, the median was exactly the same, but the standard deviation for group 𝐴 was two point five eight to two decimal places while for group 𝐶 was one point eight nine to two decimal places.
00:08:41.330 --> 00:08:47.110
So that’s nicely captured for us the fact that there’s more variability in the group 𝐴 scores.
00:08:47.310 --> 00:08:53.150
Everybody’s got something different, whereas in group 𝐶 most people scored the same and only a couple of people scored something different.
00:08:55.250 --> 00:08:58.450
Alright then, just before we go, let’s do one final example.
00:08:58.480 --> 00:09:05.410
Calculate the standard deviation of the values twelve, nineteen, twenty-three, twenty-five, thirty-seven, and forty-two.
00:09:05.820 --> 00:09:08.520
Give your answer to two decimal places.
00:09:09.650 --> 00:09:14.010
So first, we’re gonna write out all of our 𝑥 values; that’s just the list of data.
00:09:14.640 --> 00:09:19.210
Then we notice we’ve got six pieces of data, so 𝑛 is equal to six.
00:09:19.850 --> 00:09:23.960
Next, we need to square the values of each individual piece of data.
00:09:25.380 --> 00:09:40.680
Twelve squared is a hundred and forty-four, nineteen squared is three hundred and sixty-one, twenty-three squared is five hundred and twenty-nine, twenty-five squared is six hundred and twenty-five, thirty-seven squared is one thousand three hundred and sixty-nine, and forty-two squared is one thousand seven hundred and sixty-four.
00:09:41.710 --> 00:09:47.310
Now, if we add up all the 𝑥 values, it gives us a total of one hundred and fifty-eight.
00:09:48.140 --> 00:09:54.390
And if we add up all the 𝑥 squared values, it gives us a total of four thousand seven hundred and ninety-two.
00:09:55.990 --> 00:10:04.960
And remember the formula for standard deviation, 𝜎 is equal to the square root of the sum of 𝑥 squared over 𝑛 minus the sum of 𝑥 over 𝑛 all squared.
00:10:06.150 --> 00:10:12.260
Well, the sum of 𝑥 squared was four thousand seven hundred and ninety-two and 𝑛 was six, and that first bit is that.
00:10:13.070 --> 00:10:18.780
The sum of 𝑥s was a hundred and fifty-eight and 𝑛 was six, so we’ve got to square a hundred and fifty-eight over six.
00:10:19.050 --> 00:10:22.990
And then we’ve got to take the square root of the whole thing.
00:10:24.300 --> 00:10:36.410
Well, my calculator tells me that simplifies down to nine hundred and forty-seven over nine, which is ten point two five seven seven eight eight three seven and so on, but we want to give our answer to two decimal places.
00:10:37.560 --> 00:10:43.120
So the answer is the standard deviation is equal to ten point two six to two decimal places.
00:10:45.480 --> 00:10:49.160
So we’ve learned two different formulae then for the standard deviation.
00:10:49.440 --> 00:10:53.710
We use this symbol here, little 𝜎, to represent standard deviation.
00:10:53.950 --> 00:11:01.210
And the first formula is it’s the square root of the sum of the 𝑥 minus the means all squared divided by 𝑛.
00:11:02.420 --> 00:11:11.400
Although sometimes in practice we use this method: we square each of the scores and add those up and divide by how many there are, and then we take away the mean all squared.
00:11:13.500 --> 00:11:20.880
And an easy way to remember that formula is it’s the mean of the squares minus the square of the mean, all square rooted.
00:11:23.010 --> 00:11:29.890
And finally, remember if you don’t take that square root at the end, you’ll have 𝜎 squared, and we call that the variance.