In this lecture, we'll demonstrate how to incorporate binary and categorical features into regression problems. I will also compare the different benefits, various feature representation strategies. So far, we've looked at various regression models that incorporate different kinds of features. Including the PM 2.5 levels, temperature, wind speed, air pressure, height, weight, etc. What you might notice those different features have in common is they are essentially numerical quantities. So what would happen if we tried to build regression models that instead incorporated features like the following. Height is a function of gender. Preferences is a function of geographical region. Or product demand is a function of different seasons. So it's not so clear how we would be able to represent gender or geographical region or seasonality using numerical features like we've been so far. So for example, could we model how height varied with gender? This is a picture I took from a previous slide where we were trying to model the relationship between weight and height. And we saw a bunch of observations of user's weights and heights. And we tried to fit a line of best fit between them. So would that work for gender? So if you stare at that picture for long enough, you'll see that it kind of doesn't make sense. We're unlikely to have a dataset where we have an x axis consisting of a continuum of different gender values. So somehow coming up with a line of best fit doesn't really make sense. So given that, how can we deal with this type of data which ought to be fairly common when building a linear regression framework? So on a real dataset, we might collect gender values and look something like the following. If you've ever been asked to specify a gender on a drop down, you might see options like male, female, other, or not specified. But for the moment, to come up with the simplest version of the problem, let's just imagine we have binary values where gender was just male or female. So how will we build a model equation for a problem like this? We'd like to solve a problem like the following, which are going to predict height from gender. So we want something along the lines of height is equal to theta 0 + theta 1 times gender. Of course, this equation doesn't really make sense, gender is not a number, so how can we plug it into this equation? Well, we'll need some kind of encoding. There could be many different ways to encode it, and we'll see the advantages and disadvantages of each one. But a very simple encoding might look like the following. Where we just say, if you're male, we'll map your gender to the number 0, if your female, mark your gender with the number 1. Okay, and if we do that, what would our model equation then look like? It would look something like this. It would say height is equal to theta 0, if you're a male, whereas height is equal to theta 0 + theta 1, if you're female. Okay, and now we can try and fit that model solving the best value or most predicted values of theta 0 and theta 1. This looks something like the following. We would have a bunch of observations of different heights and different genders and they would be scattered along these lines. All of the genders would take one of two values and then we would have some variation in measurements for different heights. And the values that we'd be trying to fit would look like the following. We would say the male height is theta 0, whereas the female height is equal to theta 0 + theta 1. Those would be the predictions made by our model. So in this case, theta 0 really represent the predictable, roughly speaking, the average height for males. Whereas the theta 1 is now an offset or difference saying, how much taller, or if it's a negative number, how much shorter are females than males on average? So really, we're somehow still fitting a line between these two data points. This is still an example of linear regression. We're finding a line of best fit that best captures this data. We've just used this framework to allow us to handle non-numerical, or in this case, binary quantities. Okay, so let's make things a bit more complicated. What would happen if we had more than two values? So if we had four possibilities for gender, male, female, other, or not specified, could we try applying exactly the same approach where we'd like to predict height as a function of gender? So again, we'll write our height is equal to theta 0 + theta 1 times gender, and we'll need to come up with this time a different encoding of gender. So we might try roughly the same idea, where we'd say gender = 0 for a male, 1 for a female, 2 for other, and 3 for not specified. And what would be the consequences of that? Our model equation would look like the following. We would say height = theta 0 for a male, theta 0 + theta 1 for a female, theta 0 + theta 2 theta 1 for other, and theta 0 + 3 theta 1 for not specified. All right, and now our line of best fit would somehow look like the following. We would have four different possible values for gender. We would have heights scattered around for each different gender, and we would have these four different predictions. Theta 0, theta 0 + theta 1, theta 0 + 2 theta 1, and theta 0 + 3 theta 1. All right, so in principle this seems okay, and it is a valid model, but if you look at it closely, it won't be very effective. So it's really assuming that the difference between male and female must somehow be equivalent to the difference between female and other. Because all of these bars are decreasing down by a constant amount for each step, that fall in a line in other words. But there's really no reason this should be the case. For instance, this type of model would not allow us to fit data that looked more like this. Or had one height for females, one height for males, a different height for other, and a different height for not specified. And there was no line connecting those four different heights. So how would we go about capturing data like that? We need a more sophisticated encoding of our gender values. So imagine something I could follow or we might say, the height = theta 0 for male, theta 0 + theta 1 for a female, theta 0 + a different value theta 2 for other, and theta 0 + theta 3 for not specified. Certainly this has made my model more complex, and I have four unknowns, theta 0, theta 1, theta 2, theta 3 rather than just two unknown. But this is still an example of a linear regression model. I can ride it out as an inner product between my parameters theta and my features, male, female, other or not specified, each of which is going to be a binary feature. Okay, so to make that a little bit clearer, we can write it out like this, where it's equivalent to the inner product between theta and some feature vector. But the feature vectors are given at the bottom of the slide. We say the feature vector was [1,0,0] for a female, [0,1,0] for other, and [0,0,1] for not specified. And then for males, they would really just be given this 0 value for all of these entries in the featured vector because their parameter's determined by theta 0 or the offset feature. Okay, so this is a concept that's going to show up a lot when we deal with binary or categorical features, which is the concept of one-hot encoding. Here, we had three different values we wanted to encode, female, other, and not specified. And we encouraged then using a three dimensional feature vector which is all zeros except in one position which recall being one-hot of all zeros except for a single one. Note that we're able to capture four possible categories in this representation. We had male, female, other, and not specified, but we actually only needed three dimensions do so. The male feature will be redundant as their prediction is captured by theta 0 or the offset parameter. So this approach is going to show up a lot. We're going to use it to capture a variety of categorical feature types. And we can also use it to capture objects that belong to multiple categories just by setting multiple values and encoding for one rather than just a single value. Okay, so to summarize, in this lecture, we described how to capture binary and categorical features of linear regression models. And we introduced this important concept of a one-hot encoding. So on your own, try to come up with different instances of one encodings for different feature categories. For example, I already come up with one-hot encoding to represent the set of categories a business belongs to or the set of users friends on a social network.