This lesson is going to explore statistical issues that might
arise when you begin to explore multi-dimensional data sets.
So far you've seen how to visually analyze two dimensional data sets,
and you've also learned about analyzing them
analytically by using multi-dimensional arrays.
This lesson is going to focus on some issues that
come up when we start looking at multi-dimensional data sets.
In particular we're going to look at paradoxes of
probability and how to avoid them when you're looking at dimensional data,
and we're going to look at statistical misinterpretation and how to
be careful not to let your statistics lead you astray.
And lastly we're going to look at a fun website called spurious correlations which
reinforces the idea that correlation does not imply causation.
So first you're going to read this article on
the conversation about different paradoxes of probability.
And these are concepts such as Simpson's paradox where you make
a measurement and you start interpreting it and it
looks like the result is different than you would have thought,
typically involving an aggregation.
And yet when you separate the data out it looks different and
trying to understand why is what leads to Simpson's paradox.
There's other ones like the base rate fallacy,
there's others as Will Rogers paradox, et cetera.
These provide interesting insights
into how you can be led astray by statistical analysis,
particularly with multi-dimensional data sets.
The next article is also in the conversation,
the seven deadly sins, a statistical misinterpretation.
It sounds really bad but that's mostly to get your attention.
So one thing to be careful about is looking at data and not
realizing that there's things that may not always be present.
So the example they give here you look at these two bar charts
here and this looks like it's quite significant, this difference.
But if you actually have an understanding of what the error is on each measurement as
shown in the right panel you realize that the differences are within the errors,
and thus it's unlikely that there's an important difference between these two data sets.
Another thing is that sometimes you see that statistical significance
implies something is important but when you
really look at it in the real world that's not true.
And that's often an issue with sample size,
that if you have a small sample variations can
be large and thus depending on which sample you get,
you might have a different result.
The rest of this article goes through similar examples and these are important to
see the things that you need to be careful about as you look at data sets.
The last one is a very fun website that I like.
It talks about what are known as
spurious correlations and the idea here is that often we look
at a data set such as these shown here and
you think wow these two data sets clearly are correlated,
there must be some relationship between them.
But what this is showing us is the spending by the U.S. government on Science, Space,
and Technology correlated with suicide by hanging, strangulation, and suffocation.
There should be no correlation between these two data sets.
And so this is what's known as a spurious correlation.
One clue to this would be the different sides,
the different labels on each side.
But there's many others that you can look at here.
The correlation between the number of people who drowned by falling into
a pool correlates with films Nicolas Cage appeared in, and et cetera.
You can go through these and see different ones.
Here's a very high correlation.
There's our R value that we learned in a previous notebook,
that's quite high, quite close to one.
So clearly there's a really strong correlation in
per capita cheese consumption and the number of people who
died by becoming tangled in their bedsheets.
Anyway, as I said this is a fun website to look at and to see that,
you know what, correlation does not imply causation.
You can see that there's a relationship between
two data sets but then you have to really think analytically,
think is there really some reason these should be correlated,
is there some cause for this correlation,
and in some cases there is and that provides
you very important and unique insight into the data.
Often that's what we're after and that's why we're showing
the visualization and we're showing the correlation coefficient to help
convince people that there is a correlation and then
that correlation has an important cause within the data set.
That's the model that's generating that data.
So hopefully you've learned to be careful about interpreting data,
particularly when we start going to multi-dimensional data set,
there's a lot of new things that come into play that we have to be careful about.
You should feel free to discuss these on the course forum,
if you find anything else on this correlation causation website,
maybe you can share that.
And of course if you have any questions,
let us know. Good luck.