Last time, we looked at pie charts as a way to compare categories. This time, we'll look at scatterplots. To build a scatterplot, we take each point in our data set and we plot it in 2 axes. What are scatterplots good for? They're good for looking at the relationship between 2 or more variables and they're also good for finding patterns in our data. This first one shows a mapping between age and years and reaction time in seconds. So I don't have all the details but this seems pretty clearly from an experiment. This is hard for me as a gamer. As you can see, the older we get the slower we get in terms of reacting to things that happen which tends to be a bad treat in video games. We also see the red line. It is very common for people to plot a number of points in a scatter plot and then try to fit that data to a line. Now, when we're trying to fit it to an actual line, a linear function, then it's fitting it to a line. Sometimes we say, that looks exponential. So we fit it to an exponential function and that's called fitting it to a curve. It's fairly common for people to look at the data points in a scatter plot mathematically and try to figure out what the best fit line is for that data. Here is another scatterplot and this could certainly be fictional versus real data. As you can see again, I've just put the citations of where I found these on the web. This is plotting the relationship between study time and minutes and quiz grade. So we certainly see a trend here, that the more study time, the higher the quiz grade. Pretty clearly an academic generated this to try to extort their students to study harder or at least for more time so they can do better in the class. And again, there is a line fit to those data points. And in this case we can see the actual equation is provided for that line. This one, even though there's a dotted line there, this scatterplot plots the duration of an eruption at Old Faithful to the waiting time between eruptions. So as you can see, if you wait about an hour and you get an eruption, that eruption will be around just two and a half minutes. If, however, it's a longer period of time before the eruption, like, say, 90 minutes, then the eruption is liable to last more like four and a half minutes. Now, we can't control for Old Faithful, the geyser. We can't control how long the waiting time is. So this is just luck of the draw. But we can actually see an interesting pattern here. And in particular what's interesting is the cluster on the bottom left and the cluster on the top right. It looks like there's not very often a waiting time of around 70 minutes, because there aren't that many points on a line that goes horizontally from the 70. So it's kind of interesting for this particular geyser, it's either going to blow in an hour or slightly less, or it's going to blow a little longer than that, 75 minutes or more. But there's sort of this 15 minute window when it's not likely to erupt. So that's an interesting pattern that we can observe from the scatter plot. The final scatter plot we're going to look at is pretty interesting. It shows population growth between April 2000 and July 2004 in various states in the US. And each of these data points is plotted on two different axis that give us different information. So the horizontal axis is the raw number change in thousands, so 1,500, 1,000 is 1.5 million and so the horizontal axis gives us those raw changes in the population. The vertical axis tells us about the percent change. So a smaller state that grows by 500,000 will have less of a percent change than a larger state that grows by 500,000. As we can see, this is actually a pretty cool way to look at data in two different ways in a single scatterplot. So if you look at the state like Idaho near the left, near the top, we see that it didn't grow much in raw numbers that looks like about 100000 that I grew in numbers, but it also grew by almost 8%. So there was a large percent change in Idaho's population, even though there wasn't a large change in raw numbers. If you look at California all the way over on the right, California grew by about 2 million people in that time period. But that percent change was only about 6%. And so in terms of lower numbers it was a significantly larger population growth over the span of time than Idaho. But it was a lower percent because California has a larger population than Idaho does. So we can look, and of course there are some states like North Dakota and DC, District of Columbia which isn't actually a state, that had population decline. But we can observe for different states both the raw growth in numbers and the percentage growth because the scatterplot plots on these two different distinct axes. We can infer interesting information from the data by looking at where a plot for a particular state appears. So if we look over at California or Texas or Florida, we can see that large states are getting larger. And we can look over on the left hand side but those states that are actually pretty high vertically, especially in Nevada, which I didn't see because my webcam was hiding it from me. Nevada has a huge growth in percentage over that time span even though the raw numbers are slightly less. So we can see if we look at those that are near the top of the stack vertically, those are the ones that are fastest growing. They've experienced the largest percent change over that time period. So as we can see there can be a really interesting way to represent data in the scatterplot that gives us numerous pieces of information rather than just one. To recap, in this lecture, we learned how scatterplots can help us understand the relationship between two or more variables and how they can help us find patterns in our data.