0:20

This is not intended to replace the written assignment description, but

to give you an overview and some tips to get you going.

So part 1 is Non-Personalized Recommendation, and

the assignment is about computing in a spreadsheet.

You can use pretty much any spreadsheet Google Sheets, Microsoft Excel,

OpenOffice Calc.

Just about any of them will have all of the functions that you would need

to complete this work.

And you're given a 20 x 20 ratings matrix as a .CSV file to download as a start.

Let's just take a quick look at that matrix.

1:00

And I opened it up in Microsoft Excel, and

what you can see here is that it is 21 rows.

The first one is a header, followed by a set of user numbers.

In case, you're interested these are actually users who took an earlier version

of this course and submitted their ratings of a number of movies.

Across the top, you see a set of individual movies,

so number 260 is Star Wars, Episode IV, A New Hope.

We come in here, number 356 is Forrest Gump, and

these are the movies that they rated.

We go all the way out.

You'll see that again we have a total of 21 columns.

There's the heading column that has the user, and then 20 more.

Inside the spreadsheet, any cell with a number in it is a five star rating,

from 1, intended to be low, I didn't like this movie,

to 5, means I liked it very much.

We can see going across user number 755,

that user did not very much like the original Star Wars,

but liked Return of the Jedi, was not a big fan of

Forrest Gump but liked Silence of the Lambs.

So these are all real people's ratings.

The other thing you'll notice is that many of the cells are empty.

A blank cell means the person did not rate that, and presumably may not have seen it.

2:40

And you're going to compute a variety of outputs submitting to Coursera the top

five movies for each of the things we ask you to calculate.

Where top five is measured numerically from your scores.

So if you compute something that turns out to be in a range from one to five,

then go back look at either sort those numbers or look for

the top five and compute them in order.

You going to submit both the list of top 5 and the scores that go with them,

and these will be graded for you automatically.

You'll get back your results.

3:18

So what are these specific computations?

We're going to have you look at the mean rating the average of all of the ratings,

the number of ratings this could be a measure of popularity of the movie.

How many of these people's saw, whether they said they liked it or not.

The percentage of ratings that are positive, so

positive is defined as greater than or equal to four for this purpose.

So if something had 12 ratings, and eight of them were four and

five, and four of them were one, two, and three, then that percentage would

be two thirds or .6666, or 66%.

And then you're going to do some product associations.

People who rated for a particular movie.

We'll tell you that movie in the assignment,

and here's a secret about doing online courses.

When we record videos, there's all sorts of things we don't tell you, so

that we can change them.

And so, over time we've changed which movie we've asked people to calculate

to give people a little bit of a fresh results.

So read the assignment for the details.

You're going to do this two ways.

One is using the simple product association

formula from the lesson on product association.

So the count of selected movie and

each other movie divided by the count of selected movies.

In other words, how often do these co-occur?

And then we're going to use the lift formula, where you take the probability or

count works just as fine here, of the two movies together,

divided by the product of the probabilities of the movies apart.

5:04

Both of those are going to be done separately,

that'll give you a chance to see the difference.

And we have one last computation, which has a nice built in function for it.

Correlation, this is a piercing correlation built into the spreadsheets as

c-o-r-r-e-l, and you're going to look at the correlation between a selected

movie in each of the other movies and find the five that correlate most together.

5:30

To take you through a little bit in case you're not experienced with

doing these in a spreadsheet, I want to take you through a few tips and

show them to you in the spreadsheet as we go.

And probably the most important thing is to understand how you do formulas and

calculations.

And all of those in our standard spreadsheets start with an equal.

5:50

So if I came, and I said I really want to know

what the average is of the ratings for

this movie I can say give me the average of.

And the beauty is spreadsheets I don't have to actually type the cell numbers.

I can just select this range, close the parenthesis,

and it will show me this is a 3.2667, nice to know.

There's a bunch of other formulas I can do here if I wanted

to know correlation between two, I can do this with correlation and

put two ranges in with a comma in between.

I could even do some interesting things.

So let's say I created the average here too, and

7:32

And I'm going to have you look that up, as you go in your spreadsheet, but

the functionality of it is to say I have a condition.

I only want to count the cases that meet that condition.

Count if It's greater than or equal to four.

Count if the user is something or another.

8:04

So let come back to the Spreadsheet here and say you know, here's my formula for

the average, and what would happen if I just copy.

I'm using ctrl c, but I could also say copy

as an added operation, and I copy into here.

You'll notice the number changed.

And in fact, the formula changed.

8:29

Here I was, I had the average from C2 to C21.

Now it's the average of D2 to D21.

Whenever you copy a formula by default, in fact, if I do this, and

I copy this formula over here, this is now going to be this minus this.

How much better do people like the Return of the Jedi than Forest Gump?

That may not be what I intended.

8:58

And when I don't want these things to move, and I should point out,

this works if I step downwards also.

If I copy this formula down here, the average is going to change.

And the reason that the average changes

is because the cell here at the top suddenly got excluded.

Now it's the average from D3 to D22.

10:03

This turns out to be really useful as you're going through and

trying to create formulas.

So, if what I meant was not, I really care about the difference between

these adjacently, let me just put this formula that we have, all the way across.

10:40

But let's say what I really want to know is If I treat Star Wars

as the greatest movie ever, some people feel that way, but

I want to see how Star Wars compared to every other movie in the world,

then what I really care about is, I'm going to lock this,

that I always care about comparing B.

11:10

But, the C I'm willing to have changed as I go to whatever cell moves over.

Now, I actually think it's probably a mistake to write my formula that way.

I think the way to think about this is in every column,

I'm going to say how much better was Start Wars than that movie?

And in fact Star Wars is not better than Star Wars at all.

But if I copy this over here,

Star Wars is a quarter of a star better than the Return of the Jedi.

11:46

That must be experimental error,

because I told you Star Wars was the greatest movie ever, right?

But as we go through we can see,

there are some movies that people liked more than Star Wars.

Well, exactly one.

12:05

Maybe when you're done with this assignment you can treat yourself to

a video and go watch it.

But these are the kinds of things we can do with a spreadsheet.

So, the tips, remember put formulas in cells using equal.

You can copy formulas, and you will find this really useful as you go forward.

Rows and columns locked separately so

if you put D21, if you copy it down

into the right will become E22.

If you want to lock the 21 even if you move it up and down, put it as D$21.

If you want to lock the D even as you move left and

right do $D21, and if you want to lock both of them and

this is the thing you're always referring to put in both $.

Okay part two demographics.

This is going to be a small part of the assignment, because, as we've discussed,

many of the techniques that you would need to use you will not have learned yet.

But we do want you to start exploring the idea of demographics, and so

we're going to show you a second spreadsheet.

I'm not going to bring it up now,

because all it has is one additional column that has gender male and

female associated with each of the people that are providing ratings.

Obviously, in the real world, you'd have more data, you'd have more unknowns.

And we're going to have you compute, as given in the assignment,

several of the same outputs separately for male and female users, and

the final question in the assignment is asking you to assess whether it would

appear to be valuable to instead of giving overall averages

to give gender base stereotyped recommendations.

We're going to do that in number of different ways.

You're going to look at to what extent the genders have different average ratings.

To what extent they differ systematically, and what they think about the movies.

And will do just a little bit of computation that gives you a head start

thinking about the evaluation that's coming up at a later course

as we look at a couple of examples of how we might start to estimate

whether one of these is better than the other.