0:31

And for that, we'll use a one way repeated measures ANOVA.

This is a parametric ANOVA.

And we've done a one-way ANOVA, you'll remember, before, but now it's a one-way

repeated measures ANOVA, which indicates a within subjects factor.

0:46

So we'll read in search, scroll, voice as our data table,

that third level of our technique factor.

Let's view that, as we commonly do.

So we still have only 20 subjects, and we have technique search scroll and voice.

1:06

We have order as before, one and two, where voice is always three.

Now, that would be a real challenge if we ran a study this way,

where we brought people in to do voice always as the third technique.

Because we might be introducing some confound there by having it always last.

But perhaps in an exploratory aspect of the experiment, we might tack on

a condition like voice maybe to test a prototype at the end of the study.

1:49

And as we often like to do, we want to see a few more statistics

about each of the levels in terms of they're mean and median.

So we can see here for example that, that

scrolling seems to be the longer, the slower of the techniques.

Searching and then voice is a little bit faster than searching.

Is it fast enough to be different?

That's the question and we can do looking at the standard deviations

in the next output, and it helps us judge that a little bit.

And we can look at our histograms as well.

These haven't changed for search and scroll but for voice as the new one,

we can see a lot of clustering between 80 and 90 there.

And the box plot helps us see their relative position in terms of the time it

takes to find a contact in the contacts manager.

3:02

then we can look into pairwise comparisons.

If the overall or omnibus test is not significant,

we're not justified in looking further for pairwise comparisons.

We're going to use the EZ library, and

I've got some comments here in the code that help explain how this is working.

So the EZ library allows us to build this model M with specifying

the dependent variable time within a subjects variable technique and

the within subjects ID as subject, and also the data table here.

So we have a one factor three level within subjects variable called technique.

And we built our model.

And then the comment says, we have to check our model for

violations of something called sphericity.

3:49

Sphericity is the situation where the variances

of the differences between all the combinations of levels of within subjects,

factor are equal, or very near equal.

It always holds for within subjects factors that have just two levels.

So we don't have to worry about it.

But with three or more levels, sphericity has to be tested and

examined with Mauchly's test of sphericity.

These are some of the complications and

complexities that within subjects variables introduce.

We'll see later when we use mixed effect models that we can actually

model co-variance explicitly, and we don't have to use sphericity to test for it.

So we first check in our model here the Mauchly output.

If it's significant, it indicates a violation,

and we have to use a corrected form of our ANOVA.

4:40

Here, we do have a P value of less than .05.

That star means that that's the case.

So we have a violation of sphericity and

we'll use a corrected output which I'll show you in a moment.

If there's no violation, we can just use the regular ANOVA.

If there is a violation, we'll use the sphericity output, and

use it within that, the Greenhouse-Geisser correction.

So first let's look at the ANOVA table, without correction.

We can see an F test, recall it has two degrees of freedom.

Degrees of freedom in the numerator are two, and the denominator are 38.

Here, is our F statistic.

And the P value is obviously quite a bit less than .05.

And GES is a value that tells us the effect size.

It's called the generalized effect size.

We won't go into that in this class, but

effect size has to do with the strength of the effect.

You don't want to interpret a P value as effect strength and so

the generalized effect size is a way of getting that.

Actually, the GAS stands for Generalized Ada Squared, and

it compares to Ada squared or partial Ada squared, which are other effect sizes.

But because ES also matches effect size,

I find that an easier way to remember what it means.

6:08

Okay, we're actually going to do some calculations here to compute the degrees

of freedom for the corrected results.

So we'll just do those, and

add that to this sphericity table that's output from this EZ ANOVA function call.

So here's our table, and again we have technique as our effect.

We know there's a sphericity violation, so we're going to use, there are two outputs

here, the Greenhouse-Geisser correction and the Huynh-Feldt correction, the HFE.

We'll use the Greenhouse-Geisser correction.

This is the Greenhouse-Geisser statistic.

And the P value that goes with it, obviously less than .05.

So technique is still statistically significant for the F test.

Because there is a sphericity violation,

if this wasn't less than .05 we wouldn't have a significant result.

Now we'll ignore the Huynh-Feldt results, we only need one set.

And then here are the Greenhouse-Geisser degrees of freedom in the numerator and

denominator.

And we can round those to nearest, say tenth.

And that's what we computed up above here, so

we have the full data we need to report the result.

So it's reported just like an F test result,

but with these adjusted degrees of freedom,

and the adjusted degrees of freedom and

the F value from the original effect table.

Incidentally, the same uncorrected results in R can be given by fitting this model

here which you should be able to understand now and then,

summarizing over that.

I'll just do that briefly.

But that wouldn't give us the sphericity tests, the Mauchly's sphericity test,

and so that's why we don't use that generic form here.

7:55

Now, because the overall test was statistically significant,

we can reach in and do post hoc comparisons.

And for that we will use the paired samples T test,

but we need a wide format table for that.

So we'll use D cast as we've done before to make a wide format table

based on technique and we'll view that.

So we have subject in the left column and then scroll, search,

and voice across the top.

8:22

We verified that and then in the next three rows,

we store up the individual paired sampled T tests.

And then we adjust for multiple corrections and display the results.

And we can see that all three results are statistically significantly different.

8:54

Well, let's look at errors now for the three techniques.

As we've said, errors often aren't conformant to the assumptions of ANOVA.

So we'll do some looking at errors for the three techniques here.

We can see the means and medians there.

And standard deviation there as well in that next output.

And some histograms will give us a sense of the distribution of errors,

those first two by search and scroll haven't changed from before.

Here are the voice errors those certainly don't look normally distributed,

and we can look at the box plots for errors and we can see that in fact scroll,

still seems the least.

And voice, although it seemed fast, it was maybe more error prone.

If we go back to a couple graphs back, we can see this was the time things took,

voice was the fastest and we know that was a significant difference but

when we go forward here and see errors, voice seems the most error prone.

What we have in our hands here is a speed accuracy trade off in human performance.

That's very, very common.

When people are faster, they tend to make more mistakes.

That's not universally true when we're comparing techniques.

It may not always hold, but more often that not, that may be the case.

So keep that in mind as you measure both speed and errors or accuracy.

10:25

We can ask again as we did before,

are those errors Poisson distributed in this new voice condition?

So we'd done a fit and examining that, we see in fact that there

is definitely no significant departure from a Poisson distribution.

That will be interesting to us later when we return to this data and

analyze it using a Poisson distribution directly.

But for now we'll do a Friedman test on errors.

And again we have the same syntax as we did for

the Wilcoxon signed-rank test where we have errors by technique and

subject as our blocking factor across rows here.

And so the Friedman test shows a P value that certainly is much lower than .05.

And we might expect that in looking at the graph.

That means the overall test of errors is significant.

So we can reach in and look at the pairwise comparisons using

the Wilcoxon signed-rank test as our pairwise test.

We correct for multiple comparisons and

we see that all of the results are less than .05 even when corrected.

So with confidence, we can say all of the pairwise comparisons,

the two way comparisons here between search and scrolling, scrolling and voice,

and search and voice are all significantly different in terms of errors.

11:58

Lastly, we can look at the Likert scale ratings.

Ordinal ratings,

one to seven also don't generally comply with the assumptions of ANOVA.

Let's explore that data.

Here, we can see means and medians again for how people rated effort.

How hard or effortful was it to use these techniques to find contacts?

And we can see that the standard deviations look similar so

the spreads around them probably about the same.

Looking at some histograms, we see the effort on a seven point scale for

search, for scroll and for voice.

They all look like they were more towards seven, let's do a plot and see here.

Where we see efforts about similar for scroll and search, but

maybe a little more effort for voice.

Perhaps it was, we know there were more errors, so

perhaps it was making voice recognition mistakes.

Let's do the Friedman test on the overall effort ratings and

here we see an interesting outcome.

The P value is not significant meaning there's not a detectable difference

in the effort ratings on one to seven scale that people gave for

these three different techniques.

I have a note here for what that means.

Since the omnibus test is not significant, the post hoc comparisons,

the pairwise comparisons are not justified.

If we could do them, we would carry them out like we did for errors just above.

So we know how to do that.

But we're not justified in doing that in this case.

That's an important principal in these analyses to remember.

13:36

So we've just completed our analysis of the performance of subjects looking for

contacts in a smart phone contacts manager using three techniques, searching,

scrolling, and their voice.

13:47

So we had one factor, it had more than two levels.

It had three levels as we just said.

It was a within subjects factor.

All subjects did all three of those techniques to find

a set of contacts in a contact manager.

14:02

We used a one way repeated measure ANOVA for our parametric tests,

and we used the Friedman test for the non parametric tests

across all those three levels of the technique.

We followed up the one-way repeated measures ANOVA with

paired samples T tests for post hoc contrast testing.

And for the Friedman test,

when it was significant we followed it up with the Wilcoxon signed-rank test.

14:37

Now, what happens if we go beyond not just two or

three levels of a factor, but

if we go into having multiple factors themselves?

This will bring us to the factorial ANOVA and the aligned-rank transform.

It'll take us towards linear models and

eventually generalized linear models, which will be next.