0:05

So there's a data file that's posted on the course website,

it's an Excel worksheet with data pulled from movies.

If we think about the motion picture industry

it's a pretty big industry as far as how many workers are employed,

how much it adds to the economy, how much it adds to the tax base.

There's also, a lot of interest there from a predictive standpoint because

the better guest we can get as far as which movies are going to be successful.

0:45

Virtual stock markets, that's one way that those

we can try to predict how popular different movies are going to be.

So this is just a summary of the data that's up on the course website.

We got a sample of movies that produced during the 2001 to 2005 years

with a lot of information available about those movies.

Now if we look at the information that's available,

features such as what genre is it?

Which studio produced it?

What's the movie rating?

Is it based on an adaptation of a graphic novel or novel?

Is it based in some other medium?

Some of these are yes/no answers.

Other variables might have multiple options available,

not just the two options.

These are all categorical outcomes.

That's the common thread here.

1:37

We've also got a block of financial measures.

So, how much revenue was brought in?

What was the production budget?

What was the marketing budget?

These are all quantitative variables.

So, the nature of the variable, that's going to

inform the type of analysis that we can apply to it.

So some of the ways that we might start looking at the categorical data and

we'll go through some of these, I'll demonstrate them.

I would encourage you to spend some time working with the Excel file to

make sure that you're comfortable not only generating these different reports but

also understanding the trade-offs associated with them.

Frequency tables are going to report numerical values to us

as our contingency tables or cross tabs.

We might look at pie charts, bar charts,

column charts as ways of visualizing some of this output.

2:31

So if we wanted to put together a frequency table, and I'll jump over to

the Excel file so that we can see what we're working with in a second.

This just gives you a snapshot of the number of movies being produced by each

studio in the particular year, that's in the middle column.

So if we look at all of these 199 movies we can see how they're distributed

across the different studios, notice that they are ranked in descending order so

Universal had most movies out in this year.

Followed by 20th Century Fox, Warner Brothers, and so fourth,

as we go further down this list the other category lumping

all of those smaller studios that had less than 5 films

in that year produce makes up a total of 44 of the 199 films.

And then the column on the side is putting that as a percentage basis.

So this is saying, Universal produced 9.05% of the films.

20th Century Fox produced 8.54%.

Descending percentage until we get to that final row

where we've lumped the others together.

That percentage adds up to 100%.

Now the way that we've reported it here, we're reporting for each studio,

what percentage of films were produced,

you might also want to produce a cumulative column.

So, Universal produce 9% of the films, 20th Century Fox produce 8.5%,

so we might want to say all right, well the two studios that produce

the largest films or the most films combined, how much did they produce?

It would be a running sum.

So I'd start off with the 9.05% for 20th Century Fox and

larger studios we'd add in the 8.54, for Warner Brothers, add in another 8.45.

As we go further down that column, adding in the films produced by

the smaller studios, we get closer and closer to 100%.

And that's what this cumulative distribution would show us.

So on the x axis or horizontal axis,

1 corresponds to the studio that produced the largest films,

37 corresponds to the studio producing the fewest number of films.

And as we include more and more studios as I move from left to right on this graph.

It accounts for a larger share of the films that have been produced.

4:56

Another way of looking at this data,

numbers tend to be a bit sterile we might want to put that into the bar chart and

so we can see how many films were produced by each studio.

So that frequency table, that can be reformatted and put into the bar chart.

And this is just focusing on a subset of those studios,

just those that had at least five films.

Maybe easier in delivering reports

rather than including a massive table to have charts similar to this one.

Another way that we might represent the distribution

of films across the studios would be with a pie chart.

And I have a little bit of a love-hate relationship with pie charts, and

you start to see why in this case.

We've got a lot of studios that make up a very small slice.

Well, think of making a more and more narrow slice of the pie.

Try splitting that other category into the individual studios

fitting these data labels onto this chart is going to become very difficult.

So in this case we've included the name of the studio on the pie chart.

We've included the percentage of the films that those studios are producing.

And we can still see it on this chart.

As you add more and more studios, as you have a categorical variable with more and

more values these pie charts may become less useful because you cannot visualize

all of the possible options.

Ways around that might be to lump some of the options together.

So that's what we've done in this case with

the 22% falling in to that other category.

6:39

So just a couple of words of caution with these charts and we've talked about

categorical variables in the sense that a movie falls under a studio.

A movie falls only under one of the studios in our dataset.

Well, when you're making bar charts,

when you're making pie charts, that's a requirement in the data.

That each observation can only fall into one of those categories and

all of your options are going to have to add up to 100%.

One of the other few things to be careful of is we focus just on studio.

What if I wanted to look at studio by rating?

So let's look at the movies that are PG rated movies from the different studios,

and I want to draw some comparisons between the PG and the PG-13 movies.

Well, a single pie chart is not necessarily the best way to go

about doing that.

I might have to do side by side pie charts or

side by side bar charts to make those comparisons.

But these are just ways of summarizing the categorical data that's available to us

and visualizing that, and very helpful from a recording perspective.

Bar charts, pie charts, the frequency tables.

7:55

It's ways of summarizing a single categorical variable.

But what about when we want to see the relationships that exist among

two categorical variables, or even more general than that,

what if there are three or more categorical variables?

Well, one of the popular tools that we can use to do that is a contingency table or

cross tab table.

So using the data that we have available to us we can put together these tables

say we wanted to look at the studio and the genre of movies that are in there.

Perhaps we want to see how many movies of a particular genre are made by different

studios maybe we want to look at the relationship between studio and ratings.

8:41

So this is one way of looking at this data, this is the raw count of the data.

You'll notice going across the rows we have the studios,

if we look down the columns, this is looking at the movie rating.

So we have G, PG, PG-13, and R, and

these are the counts of how many movies of each rating were made by these studios.

This is that cross tab.

So we're trying to look at the relationship that exists

in terms of studio and ratings.

A couple of different pieces of this table I want to call out for you.

9:31

If we were to add those up,

351 movies were released by these two studios in total.

Of those 351 movies, 23 of those movies were rated G.

Well that's coming from adding up that first column.

So, 23 movies were rated G, and

they only came from two studios, Buena Vista and Warner Brothers.

10:02

this bottom row here, the grand total row that we're looking at,

think of this, if we're looking at that last row, it's the margin of the table.

It's the marginal distribution for us.

So what fraction of movies were rated G?

23 out of 351.

What fraction of movies were rated R?

89 out of the 351.

If we wanted to look at how many of these movies came from a particular studio with

a particular rating, that's when we're going to jump into the individual cells.

So Buena Vista movies rated G were 20 out of the 351.

If we look at Buena Vista overall, I could add up this entire row and

that's going to tell me that studio produced 87 movies out of the 351.

So, this tool good for looking at two different variables.

In this case, we're looking at just those counts.

This is produced entirely in Excel.

It's using the pivot table feature.

Very convenient as far as organizing.

Data and providing quick summaries.

11:17

Same data that we were looking at previously, but in this case,

I've just reformatted that data.

So instead of saying, let's count up the number of movies, in this case,

focus is on the fraction of the total.

So you'll recall from the previous slide there were 351 movies.

Well, now we're looking at 100% of those movies.

So divide each entry by 351, and we can see what the percentages are.

So movies rated G made up just about 6.5% of movies released by these four studios.

Whereas, movies released by Universal made up just shy

of 22% of movies released by these four studios.

I'm going to jump over to Excel to show you how easy it is for

us to produce reports like this using the data that we have available.