All right. So in this lecture, we'll demonstrate how to set up classification problems in Python. Previously, we've introduced mathematical derivation of various classification schemes and seeing some of the advantages and disadvantages of each. In this case, we'll see how to actually implement them on some real data and in particular we'll introduce the logistic regression model from the SKLearn library. Which also contains variety of other linear classification and regression schemes. Okay. So today we will look at some data from the UCI dataset again. It's a different dataset this time. This is dataset on Polish companies going bankrupt. So it contains a bunch of companies various attributes and various measurements of whether they went bankrupt at a certain points in time. So again, this is just a very clean simple classification data set to work with before we look at more complex datasets later on. Basically, we have single binary measurement saying at a certain point in time had a company going bankrupt, given a bunch of simple to use real valued features. Okay. So, the first thing we have to do is read in our dataset. It's mostly a CSV format, but it does contain some weirdness. Basically, it has this header we first need to skip. So I first look for the appearance of the header in the dataset. This is fairly simple approach but you have to read the documentation of the dataset to understand these details. Essentially the header ends and the real data begins after we see this @data tag. Okay. Now we read in the CSV data. We perform some civil pre-processing as we go. Like we did in previous lectures, first we skip any rows that have missing entries and we convert all fields to floats for all of our features, and we convert the label itself to a bool. So it's going to be true or false describing whether a particular company went bankrupt. Okay. So let's look at some statistics of our data set. Here we just say the number of samples after we've discarded missing values, and we say the number of those samples that are positive. In other words, how many companies did go bankrupt? Okay. So the first thing we do is extract features x and our labels y, much as we do for a regression problem. Next we want to introduce some library function. It's actually going to help us to solve this regression problems. So far we set everything up; what we need. We have our vector of labels y, we have our matrix of features X. Now we need some library function to help us choose the best value of Theta, for a given classification scheme. So the sklearn library contains a number of different regression classification models. In particular, it contains everything we've seen so far. It has regular logistic regression, it has linear regression which we saw in previous classes, and it also has the support vector machine classifier. So in this lecture, we're just going to be focused on the logistic regression model only. Okay. So, to fit the model first, we have to import the library itself and create an instance of the model, which is just an instance of a class. So, we import linear model from SKLearn and model as an instance of logistic regression module, and we run this function fit; which takes the data x and labels y. So, I'll matrix the features and our vector of labels. Okay, and that tells something. You might notice that running this function doesn't actually produce any output, rather, it just updates some variables inside that class instance in order to store the model that it has learned. Similarly, we can use the model we have in this class object in order to make predictions. So, given our matrix of features x or possibly even a matrix of features associated with new observations, we can use the model to predict what labels that should have. That's going to give us a vector of the same length as this matrix. Which just says for every single row of this data, what prediction does the model make? All right. Now we can compute the error or the accuracy of our model just by checking whether the predictions made by the model actually match the labels. So, here I'm doing this using some fairly tough syntax but I'm basically just saying in what cases is my vector predictions equal to my vector of labels y, and that's going to be applied element-wise. So I'm going to get this vector, with correct predictions that says, for what data points was the prediction made by my model the correct prediction. We can then compute the error. We have this vector of correct predictions. So the error is going to be the fraction of times, the elements of that vector are equal to true. In other words, we made the correct prediction. That is take the sum and divide by the length, get the accuracy of my classifier. It seems to have been fairly accurate. All right. So there was one detail I talked about in previous lecture, of training versus testing. I seem to have gotten here a fairly high accuracy by using a simple classifier off the shelf. But, not that I'm cheating a little bit. I'm sort of evaluating my classifier on exactly the same data that was used to train it. So, conceivably, our model could have learned to memorize what those features and label looked like and it would actually not be very effective if I tried to apply it to new or unseen data. So, if I really want to evaluate the quality of my classifier, what I would like to do is say, "How well will it work on new data that wasn't seen in order to train in the model?" So this is something we'll look at in more depth in the next course. Where we discuss training, testing, and validation of models. Okay. So, in this example, we just tried how to do logistic regression using sklearn, but there are all kinds of other classifiers available that have fairly similar interfaces to what I described, where you just give it a matrix of features x and a vector of labels y, svm.SVC contains support vector machine classification. There is also other classified but I haven't discussed like Decision trees, and Naive Bayes, and even this simple classifier introduced a few lectures ago called Nearest Neighbors. They are all implemented inside the SKLearn library. You can see this link at the bottom of this slide for additional comparisons between these classification schemes. Okay. So just to summarize. In this lecture, we introduce the SKLearn library and showed how it can be used to set up a simple classification problem in Python. In particular using logistic regression. So, I would suggest on your own looking at some of the other classification data sets from the UCI repository and see if you can run some of these same classification schemes, and maybe also compare the outputs of different types of classifiers on this type of data.