Let's visualize this idea of variability partitioning.
Suppose the circle represents the total variability in vocabulary scores.
We partition the variability in to two
variability that can be attributed to differences in social class and
variability attributed to all other factors.
Variability attributed to social class is called the between group variability
since social class is the grouping variable in our analysis, and
the other portion of the variability is what we're not interested in.
And in fact it's somewhat of a nuisance factor for us.
Since, if everyone within a social certain class scored the same,
then we would have no variability attributed to other factors.
This portion of the variability is called our within group variability.
Here's a look at the anova output.
The first row is about between group variability, and
the second row is the within group variability.
We often refer to the first row as the group row, and
the second row as the error row, the third row displays the totals.
Next, we're going to go through all of the values n this table.
How they're calculated, and what they mean?
Let's start with the column of sum of squares.
The last value in this column is sum of squares total,
commonly referred to a as SST.
This value measures the total variability in the response variable.
In this case, that would be the variability of vocabulary scores.
This value is calculated very similarly to variance
except that it is not scaled by the sample size.
More specifically, this is calculated as the square
deviation from the mean of the response variable.
We have 795 observations in our dataset.
On the mean vocabulary, score is 6.14.
So to calculate SST, we take each individual score and subtract 6.14 from
it, square the difference, and finally add up all the values.
For example, the first is 6, so that's 6- 6.14 squared.
The next one is 9, that's 9- 6.14 squared.
Third one is also 6, so on and so forth and
we add up all of the values to get to the total sum of squares of 3,106.36.
This value represents the total variability in the response variable.
But what we're really interested in is how this variability is partitioned into
between and within group variabilities.
As an aside we can see that this is an awfully tedious calculation to do by hand.
And hands for a no, we usually rely on software to do the calculations for us.
So the calculations we're going to present in this video are for
illustrative purposes and for introducing the concepts.
But you'll likely never have to calculate these by had.
You still need to understand what they mean so
that you can interpret your analysis though.
Next, let's talk about the sum of squares group, SSG.
This value measures the variability between groups and
can be thought of as the variability in the response variable
explained by explanatory variable in the analysis.
It's calculated as the deviation from group means from the overall mean
weighted by their sample sizes.
So more specifically for each group we calculate it's mean,
that's y bar j subtract the grand mean from it,
y bar square this value and multiply it for the sample size for that group.
We do this for each of the groups, and sum them up.
Here's a summary table that's going to help us.
The lower class group has a mean of 5.07 we subtract from that grand
mean of 6.14 square that value, multiply it by the sample size for the group of 41.
We do the same thing for for all of our groups and arrive at the sum of square's
group of 230.56, which on its own is not a meaningful number but it's
interesting how it compares to the total sum of squares we calculated earlier.
For example, this value is roughly 7.6% of SST.
Meaning that 7.6% of the variability in vocabulary scores
is explained by social class and the remainder is not
explained by the explanatory variable we're considering in this analysis.
This is a low percentage which I think would make sense because we would expect
vocabulary scores to be associated with, more with education or
how much people read.
The last value here is sum of squares, SSE and
it measures the variability within groups.
In other words, this is the unexplained variability and
it's the variability due to all the other variables.
The simplest way of calculating this is simply as
the difference between SST and SSG.
Now we need a way to get from the sum of squares measures to the mean square
values.
To do so we need to scale the sum of square values by values that incorporate
sample size as well as the number of groups, namely the degrees of freedom.
So next let's focus on that group.
Total degrees of freedom is calculated as sample size minus 1, 794.
Group degrees of freedom is calculated as number of groups minus 1, 3.
And the error degrees of freedom is simply the difference between these two 791.
Next stop is the mean squares column, which measures the average variability
between and within groups and is calculated as the sum of squares for
that component divided by degrees of freedom.
So we can calculate that by doing the divisions, and
we're going to next use these values for calculating our F score,
because you remember that our F statistic is the ratio of the average between and
within group variabilities.
In other words, it's MSG divided by MSE.