[MUSIC] Hi, from Jupiter, Florida, I'm Garfield Fisher. I'm here in a Jupiter, the home of the Cardinals' and Marlins' spring training. Over the last few days, I've been asked lots of questions about how can we predict ticket sales, how can we predict attendance rates. And that got me thinking that we have an opportunity, with the 2013 Housing Affordability Data System data, to be able to ask a similar question. And the question that I would ask is this, what factors predict or contribute to the housing or the current market value of single family units? I look forward to your results, I look forward to how you explain those results in a way that anyone can understand. Thank you so much, bye-bye. >> The key key variable of interest in this question is the current market value of the housing unit, which is the value variable in your data files. You need to build a model that'll help associate this variable, which is the market value of the housing unit, to various other variables observed in the data. This calls for a regression model to be built, with the VALUE variable as the dependent variable. To make things a little simpler and uniform, we will deal with the 2013 data only. Remember, Garfield also alluded to this. Further, we will only be considering single family housing. This implies that you will have to choose, from the 2013 data file, only those housing units that have the TYPE variable as 1, and STRUCTURETYPE variable also as 1. You need to delete all housing units that do not have a value of 1 for both these variables. Some more cleansing of data that is required is as follows. As has been mentioned in previous lessons, the VALUE variable, which is the current market value of the housing unit, may have some negative values or ridiculously low values. You need to delete all housing units that have a negative number for the VALUE variable, or a number which is less than $1,000. This would automatically also delete all data on housing units which are rental units, that is, units that have a value of quote, unquote two for their own rent variable. So these are the basic data cleaning steps you need to take. Once again, choose only those housing units that have a market value which is $1,000 or greater. That is, the VALUE variable is greater or equal to $1,000. Amongst them, choose only those that are single family units. That is, the STRUCTURETYPE variable = 1. Then again, from this set, choose only those units that have the TYPE variable = 1. After these basic data cleaning jobs are done, you need to focus attention on the key variable of interest, which is the VALUE variable. You may want to calculate various descriptive statistics, such as the mean, median, standard deviation, etc for this variable. You may have done that in the previous exercises, however, you may want to repeat it. Before you use the VALUE variable in a regression model, it is helpful to visualize its empirical distribution. Remember, regression works best when the dependent variable is closer to a normal distribution, or the bell curve. Please plot a histogram of this VALUE variable. Also, plot a histogram of the log transform VALUE variable, that is, a histogram of the natural logarithm of the VALUE variable. Which of these two histograms more closely resembles a bell curve? The answer to this question should guide your choice of transformation for dependent variable. So having fixed your dependent variable, first by cleaning it, and then by considering whether or not to use a natural logarithmic transformation, the next very important step is to select your set of independent variables, or the X variables. This selection depends on you, and I'll leave it to you to select an appropriate set of independent variables. However, I will outline some important tips that may help you in selecting variables. Firstly, remember that Excel cannot run a regression with more than 16 independent or X variables. So the total number of independent variables, including the dummy variables, cannot exceed 16. Secondly, many of the potential independent variables are categorical in nature. If you use these variables, you will need to appropriately code them as dummy variables. Further, you may want to collapse certain categories. For example, if you consider the variable METRO3 in the data, this variable gives the metropolitan status of the area where the housing unit is located. The way this is coded is that this variable takes a value '1', within single quotes, when it is a central city area, and takes a value '2', '3', '4', or '5', all within single quotation marks, when the area is not a central city area. So for this variable, you can collapse the categories '2', '3', '4', and '5' all into a single category. The categorical variable will then end up with just two categories. Thirdly, you may want to try out various transformations, such as the natural log transformation, or some other transformation on some of the X or independent variables. The aim being to get a better-fitting regression model, as indicated by the R-squared. Finally, the most important step is to think on the appropriateness of using a particular independent, or X variable, in this model. You do not have to use all the variables, rather, you need to carefully think what variables to include. Remember, you'll be using the estimated regression model to prepare a brief report on factors associated with the market value of housing units. So think about the variables and the justification to use them in the regression model. The report that you submit, at the minimum, needs to have the following, the set of variables you used in the regression model and a brief justification for their use. You need to provide the estimated regression model output from Excel. If you have taken certain transformations of variables, you need to briefly explain that. And you need to interpret the impact of various variables included in your model on the market value of housing units.