So in this module, we want to discuss one of the most common two-dimensional visualization methods scatterplots. Basically, we want to think about how we can apply methods of visualizing discrete data values along two axes. The scatterplot is one that most people have seen in some form or another created in Excel or PowerPoint or Tableau. Any of these softwares have this and essentially what scatterplot is doing, is it's trying to let us look at how two variables are related. So for example, we can think about baseball players. So, I may have a baseball player, I may have their batting average, I may have their percent on base and other sort of statistics. You don't like baseball you can think of this as soccer, football, you can think of this as a quiz scores, any sort of data set where we have, some sort of measures and values and we want to compare to. We talked about how we can do batting averages, a 1D case, right? So, for each players, I'm going to have player one, then I might have their batting average is the bar, I might have player two and their batting average and player three and their batting average and so forth and then we can also have player one, two, three by the percent on base. We may have a question of well, how does batting average relate to percent of getting onto base? So, to do that, we may want to think about how we can create some graphical representation to allow people to understand these relationships and for just two variables and the most common sort of visualization is a scatterplot where we can visualize discrete data values along two axes, we can analyze bivariate relationships, we can quickly sort of look for outliers, clusters, distributions and we can also identify trend lines. So, imagine if I show you a scatterplot that looks like this, immediately you start thinking to yourself, why is there a group of data here and a group of data here? What is the relationship between the two? If I add another point way out here, you might say why is that single lone point way out there? Is that an outlier? What's going on? Likewise, if I show you a plot with points that sort of go in this manner, you can quickly start seeing the trend and asking why we might have some sort of trend that fits this sort of exponential looking curve? So, by plotting these values against each other, we can start asking questions and if we do things like batting average versus percent on base and if we start seeing some sort of plot like this and we may have some outliers even. But, we sort of see a trend through our data and we might start saying, well, it seems like batting average may match percent on base. It could be in class that if I plot say, quiz one versus the final exam score, we may find that for some reason quiz one is highly correlated to the final exam scores and students who did well on quiz one early in the semester stick it out and do well on the final. These things might help us to identify average students. If you're down here on quiz one, next semester and I know this is correlated I might come to people and say, "Hey, students who did bad on quiz one really need to improve because they tend to do bad on the final." So, we can start using this information to make decisions, talk about interesting facts and the data, help people understand trends and what those trends mean. So, in creating a scatterplot we can plot things like the number of runs and the batting average. So, again, any sorts of variables I have in my dataset and then each point here is a different player. So, again, you have to think about what questions you want to answer with the data, what labels you might want to have, what interactions you might want to have and what we're trying to see. So, here we might see some sort of trend maybe we could even try to fit some sort of regression line and we'll talk about how to do regression in different lectures to the plot to try to see is there some way I can model the data to explain the different patterns within it? Can identify outliers, for example? So, this person had a really low number of runs, but a reasonably high batting average same with this person actually, one of the higher batting averages, but some of the lowest number of runs. We could think about adding interaction. So, imagine if this was my mouse tool tip and I can point to the element on the screen, it might be able to pull up the label to say this is player number seven, for example and tell us who that is, tell us more information. Again, we can think about, how do we want to fix the aspect ratio, should I've made this axis much much longer than the other axis? How do I manage and manipulate those things? So, all of these different elements we've talked about in past modules from aspect ratios to nice numbers to graph and labels, all come into play in creating our scatterplot here. We don't have to just stick with two variables in a scatterplot. So, there's a really fun video by Hans Rosling if you go to gapminder.org. Hans Rosling does this nice animated PowerPoint where each bubble here is a country and he's showing income versus life expectancy. As he shows this move over time you get a story of the world and how income has shifted life expectancy. So, if we think about a data set like that, we can think about we have a country, we have an income, we have life expectancy, we can even do things like population and we could add in another variable like GDP, we can even look at things that they trade, we can look at the percent of trade to other countries, all sorts of variables about different countries and then we can start thinking about what visual variables do we have and we talked about Britain seven visual variables. So, we had position on x and y. So, this combination alone creates a scatterplot. So, each country becomes a circle on the screen, okay? So, then we get shape if we want. So, we chose shape as our circle here, color matches to something and we can see lots of colors being reused. Notice, we don't have a legend here. What Hans Rosling did was match color to certain chunks of the world and so you had North America, South America, Europe. I'm not a very good artist. Afrika, Australia, and each chunk of the world might have been some slightly different color. So, we get color, we get shape, we get position, we can even think about adding a texture to these circles if we wanted. They gets sort of busy. Notice we have different sizes. So, we have size in the circle, and here size corresponds to something like population. Color corresponded to location in the world. So, we're combining these different visual variables together to move past just this 2D representation, where we had variable X versus variable Y, it's now multiple variables. Then he even adds in animation. So we can move our data over time and so animation provides us with yet another variable to see trends and patterns and changes over time. There's nice work on looking at animated scatterplots by John Stascho. So, you can take a look, I encourage you to watch Hans Rosling's Gapminder video, just to see how his nice presentation goes for these multivariate cases. Now the real question though is, if I have all of these datasets which scatterplots should I draw off? If I have all of these variables like income, life expectancy, population, I can create a ton of different scatterplots for every two variables, I can make a scatterplot for GDP vs the percentage of trade, I can make a scatterplot for GDP versus life expectancy, I can make a scatterplot for GDP versus population. So, the more variables I have, the more scatterplots that I can make. So, I want to think about how do I help people again detect the expected, discover the unexpected? How do I identify anomalies in these scatterplots? How did you identify interesting trends? One way is through what was coined as Scagnostics. So, this was, Tukey coined this back in the early 80's, talking about graph theoretic measures for detecting structural anomalies in scatterplots. So, if I draw a scatterplot, what are the different things that I'm seeing? So, for example, remember when I drew a scatterplot that had properties like this, notice this has a clumpiness property. Is there a way where I can theoretically have some mathematical computation of that? This may be a really interesting view to say, "Hey, these particular elements are highly related in these two variables." So, we can use these graph theoretic measures to help users pick views to show particular structures of interest. This was coined by Tukey to help us determine which relationships between variables should we pick? If I don't have time or capacities to show every possible Pairwise combination. Which is the best Pairwise combination to show? Which is the second best? And so forth. So, Scagnostics gives us a bunch of different equations that we can start calculating. So, for example, we can figure out the minimum convex hole that will enclose all of our points. So, for example, the minimum convex hole if I draw some points on the screen, the minimum convex hole is what's the smallest polygon that's going to connect all of these together. We can measure things like area of this polygon to try to do some measure, we can have some sort of correlation measures to talk about how correlated the data is, and other elements like that. There's a whole list of Scagnostics and people have been working on those sorts of measures for a very long time. Wilkinson proposed nine Scagnostic measures to characterize scatterplots. So, outlying, sparse, striated, skinny, monotonic, skewed, clumpy, convex, and stringy. All of these have a different equation associated with them but what's nice is wilkinson developed the library where we can calculate all of these automatically, and start using these to try and rank scatterplots. Again, trying to think about importance in showing people what's important and interested in their data. We've talked about Shneiderman's information mantra where it's overview first, zoom and filter, details on demand, and Daniel Keim talked about visual analytics mantra for analyze first as opposed to overview first. So, if you have a large dataset you put it into some sort of analytical framework whether it's going to be deep learning, whether it's going to be supervised learning through clustering, unsupervised learning through clustering, whether it's going to be supervised learning through decision trees, things like that. Whether it's going to be creating a bunch of scatterplots and measuring how outlying the particles are or how skewed the particles are? We can use these measures to then characterize different scatterplots. What's nice about scatterplots is we can also create what's called the scatterplot matrix. So, even if I have a whole lot of variables for a scatterplot, I can actually go ahead and organize these into a matrix where I can do each variable versus each other variable. So for example, I have At Bats, every Y-axis, I'm sorry every X-axis in this direction is the same. So, we see we have our At Bats here. So, every X axis is At Bats across our row. Across our columns, we get changes in variables. So, here we get At Bats versus At Bats, we get Runs versus Runs, we get Batting Average versus Batting Average. In this example here, we've got Batting Average and Runs, we've got At Bat and Runs. So, we can start looking and see if there's any interesting trends. Now the diagonal is always going to be these straight lines, this is due to the fact that we're plotting the same variable against it self, so it should be highly correlated. So again, each dot is a baseball player and we can start looking for trends and patterns. Here we might say well, there's one outlier in this plot otherwise it looks like it might have some sort of trend here. Here we may not see much relationship here, but scatterplots let us get a quick overview and the problem is I could have rearranged any row and any column in any way I want. So, I could have swapped these two columns or these two rows and I would get a different order, a different layout and how I'm going to go through these orders and layouts is really important and can take time. That's where we might want to use these Scagnostic measures to think about how we can order what we call our scatterplot matrix. We can even think about adding an interaction, and adding interaction allows viewers to visualize other combinations of variables. So, if I have a scatter plot in two dimensions, I can always extrude a third dimension and rotate my points and show what this looks like. Nicholas Elmphast has a nice paper called rolling the dice, and you can take a look at some of the nice interactions he added in with these scatterplot matrices and allowing extensions and things to let people visualize this in 3-dimensional space. Again, we can add color to the points, we can add some shape or sizes, all sorts of things to add more information into these variables. Now, the other thing we should have realized with scatterplots is that really I'm just showing a whole bunch of the same thing over and over. I have the same type of plot repeated over and over but showing different combinations to my variables. This display is sometimes referred to as small multiples or a trellis display. Essentially, what you might do is if you have a bunch of different data you might want to try to organize it in some way we can look at an overview of it all at once. So for example, what if I want to look at homicide rates in Canada. So, Canada has several different provinces. I've captured homicide rates over time, and I don't want to maybe just make one plot for the homicide rate in Canada because I may see something like this. Or, even worse, I may see my homicide rate go up. This may not tell the whole story, if I break this down by province, I can see some interesting trends. So for example, in Abbotsford Mission, I see this huge spike and decline in 2009. I can see that Calgary had a downward trend and now it's back on an upswing. I can see Montreal has been low and slowly declining. The same for Toronto, I've another case I'm seeing much more variability. So, this allows me to quickly compare between different elements of my data, in this case provinces and look at the changes over time. Again I can think about how do I want to organize my small multiples here? What's the different layout that I want to show to the user to help them compare these quickly over time? Do I want to put Abbotsford next to Calgary? Are these geographically located next to each other in Canada? Do I want to organize these based on most similar trends? So for example, should I put Toronto next to Montreal because they have similar sorts of graphs? So, this allows us yet another mechanism to start thinking about how we can show multiple variables on one screen with information about trends and changes over time, to look for relationships? Again, we can start thinking about how we might pre analyze the data, to identify interesting things so we know where to start visualizing to help and give the user more information. So, that takes us through this concept of scatterplots, extending those to scatterplot matrices, and even further to small multiples. Thank you.