In the last session, we looked at cricket and ways of drawing graphs, charts, plots in cricket in order to understand, visualize the data on it better. Now we're going to do a similar thing for baseball. We're going to look at where the ball lands in baseball and look at plots of where the ball lands in baseball and look at that from different perspectives depending on what hit was, whether the player was out or not, and what type of player we're looking at. To show how you could break up the data in different ways and tell different stories. We're going to start in the same way as we always do by loading the packages we need and then loading the data and here we're going to use data from MLB Advanced Media for data for 2018, which gives us every event in the entire season. If we load that data up and it's a pretty big file and you can see here, if we scroll down to the bottom, you can see we have 185,771 events in the 2018 season. It's a pretty big and impressive dataset. We have an awful lot of variables in here and I would just print those off and a lot of things we don't need. The one thing that we're mainly going to be interested in is the X and Y coordinates. These tell us where the ball landed when it was hit. We have actually two variables here. One is XY. Then we have our dot X and our dot Y. Those two variables actually others referred to the same coordinates but just looked at from a different perspective. You'll see that in a minute. The first thing that I'm going to do is restrict the data to the variables that we're going to be interested in. We're going to use all of these variables in the analysis on creating a new file called MLB map, which will allow us to analyze it. Just keep things simple for it. We're also going to be interested in the type of event that we're looking at. Let's just see what types of events there are. If you want to know all the different values that a variable can take, you can type the variable name and then dot unique and you can see all of the names. Here you can see all of the names of the different possible events in this dataset. You got home run or walk hit by pitch, strikeout and ground out, and so on. First, let's just look at all of the events in our dataset. We're going to look at them by mapping them, our X against our Y. Plot the coordinates for where the ball ended up in each event. This is a simple scatter diagram. We said here the size of the marker, which is S and S equal to 0.001. You can change that value and we'll set the color of the marker as red, that's C equals R. We said the marker type here, we specify these will be dots. Now let's look at what that scatter diagram looks like. There you can see the shape very clearly of a familiar baseball diamond. You can see that the bottom in the center is where the batter stands and then the deep red parts are where the bases are. You can see it's essentially roughly with third bases, second based slightly less clear. Then you can see where the ball went in the outfield and you can see also particularly the three brighter areas in the outfield they really correspond to where the outfield is tend to stand. It's perhaps not surprising that outfielders tend to stand where the ball tends to go. Now we plotted all the events. Now let's look at some subset of events. Of course, one of the most important events is a hit in baseball. A hit is defined as an event where the batter succeeds in getting onto a base by hitting the ball. It's not just hitting the ball, but actually hitting the ball successfully enough to get on base. There are different types of hits, depending on how many bases you reach. A single reaches first base, a double second base, a triple third base, and then four goes, a home run get you read all the bases. Let's now plot one event only, that's a single batter getting to first base. Let's see what that looks like. You can see here, this gives you a picture of where the ball goes when a batter gets to first base. We're going to make some comparisons here. Now let's look at the second possibility, a double where the batter gets to second base and now you can see that the distribution of the doubles is quite different from the distribution of singles. The doubles tend to be further out and fall in specifically in places where you don't get a single and vice versa. Now let's look at hits the gage to third base triples. Triples are relatively rare in baseball. What you can see, it's triples tend to be hit into spaces right in the outfield where quite a long way away from where the field is happened to be and they're pretty sparse. Then finally, let's look at where the home runs go and of course, the home runs are hit out of the park. Not surprisingly, the home runs are all located in an arc right at the top of the data. We can now put these scatter diagrams together in a row alongside each other to show the different locations of the ball for each of these possible hits. You can see here gives you a nice frame for comparison to see where the different hits end up. Then finally, we can put all of these complying these onto a single plot, given that they've all got different colors. There you can see very finally, where the distribution of hits are for all of these possibilities. Now notice also of course that whilst there are all these hits are in these particular locations, there's also a lot of blank areas in here. Those are going to be areas where markers got out. They hit the ball into the outfield, but they didn't actually qualify as a proper baseball hit because the player got out because they were caught or the player was run out on first base or a runner on base was out. You can see here the distribution of outs. Of course, most of the outs tend to be balls that are hit into the infield, so the batteries run out on first base. Now we can compare that to just all hits, combined, singles, doubles, triples, and home runs, and let's see how they compare. This is the combination of all hits. Again, you can see they're somewhat complimentary in terms of the colors where you get a hits are in the spaces where you don't get outs and vice versa. Again, we can put that onto a single figure here. First alongside each other and then combined on the same chart. Again, the complimentary nature of hits announced tends to come through very, very clearly in this. Now having done that, we can also go on to look at some subsets of the data depending on different characteristics associated with the teams or the players. One thing we can do is look at different stadiums. I'm just going to quickly go through and show you these are a list of all of the stadiums that are in our data and the number of events at each stadium in that season. For example, is often said that the size of the park matters. For example, Tropicana field is the smallest ballpark and Dodger Stadium is the largest. We can do a contrast to see the distribution of events at Tropicana Field and compare them to Dodger Stadium. There they are separately. Now we can combine them in a single chart here. Actually in many ways that distribution of hits doesn't look that different. The data we're looking at here is not probably fine enough to draw out the distinction in the way these two fields operates. But it doesn't mean to say that you couldn't do some quite interesting comparisons for other fields, which you can easily do with this data. Now let's do some comparison amongst the players. We can first sought out the players. Look at the number of at-bats for each player. You can see here for example, we have players such as Justin Turner who had the most at-bats in the season. Then actually fourth in this list is Nick Markakis. We're going to compare these two players because one interesting difference between them is that Turner is a right-handed batter and Markakis is a left-handed batter. We can actually draw a chart to compare where these two players hit the ball. We can see here, that's the plot for Turner in terms of where his hits landed and here is the plot for Markakis. What's interesting about that is that you can see the maps are mirror images of each other. Perhaps that's not surprising for a lefty and a righty. They're hitting to different parts of the park, reflecting their different starts and in the way they hit the ball. If we put them together, we can see, although they're both hitting a lot of balls into the center of the field, we can see that Markakis is tending to hit more balls into left field. But when the ball goes into right field, it goes further and for Turner the opposite is true. Let's just see that again combined when we put those into the three plots here alongside each other. One thing that's interesting about that is that both Markakis and Turner tend to be opposite field hitters. They tend to be hitting the ball more frequently into the side of the field, opposite to the side of the plate in which they're standing. Whereas in fact, often players are pull hitters, they tend to pull the ball onto the same side of the field as to where they're standing. It's said to be, being an opposite field hitter is better than being a pull hitter. That's interesting if we now compare all lefties with all righties, which we can do here. That's the hit map for all the lefties, and that's the hit map for all the righties. Then we can put them together all in three plots. One thing that seems to be striking in this is that generally lefties and righties do tend to be pull hitters. That makes Turner and Markakis stand out. But of course, they're also two of the best players in that season. Perhaps that helps to explain why they're so good is that they don't suffer from the same problems that many batters tend to suffer from on average. We've shown here that we can use these simple scatter plots to generate hit maps, which are quite informative about different aspects of the game. There are other aspects we could look at here. But overall, this shows how we can use plots in order to understand better what's going on on the field. We've now looked at cricket and baseball. Finally we're going to look at plotting some data for basketball.