In this lecture, we'll learn about histograms. To build a histogram, we take the range of values covered by the data points in our data set, and we break that range into a set of bins. We then count how many data points in our data set, fit in each of those bins, and then we plot those counts as a set of bars. One bar per bin. What are histograms good for? They help us actually see the shape of a distribution. Let's go look at a few. You should recall that we saw these distributions, when we talked about mean and standard deviation, and although we didn't use the word at that time, these are histograms. So they're counts of values that fall into particular bins. Here's another example. This is a histogram of the height of a set of black cherry trees, and so you can see each of the bins in this particular histogram represent a range of five feet. Typically the way it works is that, leftmost bin that goes from 60 to 65, really goes from 60 to 64.9999999, and then the next one is 65 to 69.9999, and so on. So we can actually see the shape of a distribution by looking at the histogram. This next histogram, and by the way. I'm putting the citations at the bottom of each histogram. So you can see the URL from which I retrieved that histogram. So this is actually a histogram of tip amounts at a particular restaurant, and you can see that the most common tip is $2 but 2, and $3 are around the same. Not quite. One of the interesting things that we discover with histograms is the bin size that we select can actually really affect the information that the histogram shares. So with the same set of data, we can make smaller bins. This is the same set of data, and we suddenly see a different pattern that we didn't see in the previous histogram. You can actually see that there are spikes on the dollar amount, and the $0.50 amounts. Messing around or that's probably not the right way to say it, adjusting the bin sizes to sort of evaluate the data is a good way to tease out interesting patterns in the data, that don't necessarily show up based on your first gut instinct of bin size. The last histogram we'll look at today, is the percentage of American households earning a particular dollar value, and the interesting thing about this particular histogram is that, the ranges of dollar values are not consistent. So the yellow bar that goes from 35 to 50 is $15 thousand, and then the red bar next to it is $25 thousand, different as is the blue bar next to that. But then we jumped to 50 thousand for the orange bar, and so on. This is an atypical way to do a histogram. Usually we select bin sizes that are consistent so that we can see the full pattern in the data. It's sort of obscures some of the pattern if you don't use a consistent bin size. But we can still see from looking at this histogram, the shape of the distribution of US income, and we can certainly see that people making a $150,000 or more, are more rare, based on the bar heights. Compared to those making less than 15,000 or 15 to 25, and so on. The median for this data set is 53,567. The median is the middle value as compared to the mean which is the arithmetic average, and for highly skewed distributions, the median, and the mean are significantly different. We discovered that outliers. CEO salaries for example, can have a very large impact on the mean. Whereas taking the 100 CEOs out of those 130 million data points, would not have shifted the median at all or minimally, in terms of, sort of the salary amount. So those are a few example histograms. Let's actually go to Google Sheets, so I can show you how to make a histogram in Google Sheets. Here in Google sheets, I have a distribution of grades from an anonymous course on an anonymous assignment. But as you can see, these grades, I happened to have them sorted in ascending order. So they start at zero, and they go up to a 100. To insert a histogram, I select the cells that I want included in whatever chart I'm doing, this one happens to be a histogram. So I select all the data cells and I pick insert, and down below function is chart, and once I select chart. Their are a number of recommendations that Google Sheets provides to me. Their are also a variety of chart types that I can select from. So you can add all kinds of different kinds of charts using Google Sheets. I'm going to add a histogram. So I've selected the histogram analogist, click the Insert button, and if I scroll up in my spreadsheet, you can see it added that histogram, and I can grab it, and slide it away from the data, if I choose to do so. As you can see, remember, I said that the bins start, this rightmost one starts at a 100 and goes up to 119.999. This are actually all scores of a 100, here in this histogram. Now, I did say that we can perhaps get some more insight into the data by changing the bin size. You can adjust the bin size in Google Sheets by right-clicking on the histogram. Selecting advanced edit, all the way down on the bottom, and then in those editing capabilities, we can actually change the bucket size. I've been using the term, "bin", but people also use the term, "bucket". So I can change this to 10 instead of 20, which is what it currently is, and as you can see it changes the histogram. This actually gives us I'll just update the histogram to use those new bin sizes, and as you can see, this actually gives us a little more insight into the distribution, by making those smaller bucket. We still see we have a huge number of hundreds over here on the right, but we do see some gaps here. So this is most definitely not a normal distribution, where we'd have the mean in the middle, and then nice curves on both side. So this is most definitely skewed toward a 100, with some gaps down here in the lower scores. To recap, in this lecture, we saw how histograms can help us see the shape of the distribution of the data in our data set.