Hello, and welcome to this lesson on feature selection.
Feature selection is a very subtle,
but important concept where we can improve the performance of machine learning algorithms
by allowing those algorithms to focus on
those features that actually contain the most predictive power.
This seems like a slightly simple idea and yet it's very important,
because if you have
100 features and only 10 of them actually are necessarily to make the prediction,
the other 90 can actually lead your algorithm astray
and cause challenges that you may have trouble overcoming.
Another issue to keep in point with feature selection,
is the idea of the ethical implication of feature selection,
and that we need to make sure a machine learning model is not discriminating or biasing
certain populations based on their features
or their values that exist for their features.
So for instance, you often do not want to use gender in a machine learning algorithm,
unless there's an explicit business driven reason why you would need to do that.
Other examples might be race.
You don't want to do that,
because you don't want to cause problems that might lead you into ethical dilemmas.
So you always need to be careful about feature selection
and that you're not creating new features,
that encode that information in a way that's difficult to find.
But in this lesson we're going to
introduce you to the specific concepts of feature selection,
the specific types that you might be able to use, such as;
selecting the k best,
selecting percentiles, recursive feature extraction.
We should be able to apply feature selection by using the scikit learn library.
Now, this particular lesson focuses on this course notebook and
this notebook is going to go through
many different ways that you can apply feature selection.
We've broken this down by first talking about the data we're
going to use to perform the feature selection.
Then there are statistical test that you can use to
determine the best features to keep for using in your machine learning algorithm.
Then there's things we call univariate techniques.
These are ways to say,
"Let's test this one feature and see if it's going to improve or not our results."
There's also a similar idea called recursive feature extraction.
Then there's another one where you're saying, "Look,
I have a particular model and I want to use that to
make predictions on which features are the ones I should keep."
A good example is when you are doing random forest for
decision trees and you have feature importance and you won't be able to use that.
But we also have these things coming to us from regularization,
where we actually penalize overly complex models and so we
may realize that only certain features are
truly necessary to make an accurate prediction.
And the last thing, we are going to show how these can be applied
within a pipeline in order to build
a more efficient framework to perform the full process from data cleaning and processing,
feature selection, machine learning, scoring.
The rest of this notebook is pretty similar to what we've seen.
We'll talk first a little bit about the data.
I want to show you.
I do this because I want you to see
the actual features and labels so that we're really thinking about,
what are the different things we have.
Then also the handwritten digit data set.
We then step into statistical tests.
The primary one here is variance thresholding.
If you remember from talking about decision trees,
we talked about wanting to spread the data out in a way that when we make splits,
we're capturing the signal nicely in our child nodes.
One way of doing that is by actually by measuring the variance
and splitting in the feature that has the greatest variance.
So this is the same idea here we're looking at what is the variance?
How spread out is the data?
And that feature probably is something we want to use.
If a feature is all bunched up in values it's going to be difficult to split on it,
it's not going to give us a lot of power in terms of predictive or descriptive power,
because they all have the same value.
So variance thresholding is very easy way to sort of make comparisons.
One thing we do throughout this notebook,
is we look at the baseline measurements that we make and we'll look at
how these feature importances or selection techniques will change,
based on the different technique.
So first we just sort of do a variance threshold and then we realize that wait a minute,
these actually had different ranges over which they were measured.
So we actually need to normalize first and
once we normalize the results actually change a little bit,
petal width becomes more important than the others.
That means there's a greater variance in
the normalized features for that particular feature.
We can then move on to other ones we show the same idea with the digit dataset.
And the nice thing about the digit dataset is you actually can visualize it.
The data is 64 pixels.
So let me come up here to this one because it's easier to see.
And there's 8 by 8 or 64 total pixels
and of course the numbers typically are in the center,
those are where the pixels that have the most information.
So typically these techniques are going to highlight those pixels.
So you can imagine that 8 being drawn here,
and a 7, and a 4, and a 6.
You could see why some of these pixels are getting hit more and more,
because there's a lot of variation in them.
The rest of this notebook just walks through
different techniques again on those same two datasets.
One other thing I did want to mention before I go on is this,
we've now taken the digit dataset and we've written it out as
one big plot that's a single pixel wide,
but 64 pixels long to show the idea of changing in
this case the threshold and
how the pixels that we actually use for our classification change.
When we use a very low variance threshold only those pixels on the edge.
So, 0, 1, 2 and then sort of 8,
9, those are used.
The middle pixels aren't, so you can see most of the middle pixels are
selectively used for this particular variance thresholding.
As we increase the threshold, however,
we lose more and more pixels because only those with the greatest variance are selected.
So you can see when we've raised the threshold up to this value,
only these pixels out here are selected.
Such an interesting way to visualize the data.
We'll see this in some of the examples as well.
The rest of this notebook just walks through different techniques.
We have select k best and select percentile.
Both of them have different measurement functions that compute the mutual information,
or the ANOVA F-value,
or chi-squared statistic et cetera that allow you to perform these right.
When you say select k best,
you have to have a metric by which you've measure the k best or the percentile.
And that's these measures that we can use.
So this section talks about that,
walks you through applying those in different ways.
We also take our iris dataset and we add in random noise and see,
how well does this technique recover the actual features relative to the noise.
And generally it does a pretty good job,
you do a pretty good job of pulling those out.
We also look at these examples applied to their digit dataset,
and then next, we actually move into other types.
One of the powerful ways for recursive feature extraction,
where we recursively remove attributes when
we're trying things out and we see which are the most important features.
That's a great example of demonstrating that.
Another nice little map of our digit data we actually can encode the importance.
So here's the most important pixel with recursive feature extraction.
We can also then, as we move through this look at selecting from a model.
I talked about this,
we actually can use a classifier and we can use
that result from that classifier to tell us which are the most important pixels.
Decision tree based techniques are good for that.
We could also use other techniques as well.
And then lastly, we talk about employing this within
a pipeline and so I think we've gone through a lot of examples here.
One of these techniques are overly complex or mathematical,
they're all pretty simple and straightforward to understand.
We're making some sort of measurement of
the data and it's spread and we're going from there,
in terms of applying that to actually be able to do
interesting and powerful machine learning algorithms by only using a subset of the data.
The most important subset of the data.
So that I'm going to go ahead and stop.
If you have questions let us know in the course forum and good luck.