So, in this video I'm going to talk about k-means,
which is a clustering algorithm.
And, I have a favorite picture I like to use to explain clustering.
If you look at these two plots,
even though there's no labels,
nothing on the axis,
no explanation at all,
your clever human brain is already doing something with the data.
With the one on the left,
I'm sure it will have grouped into maybe 10 or so,
some of them overlap a bit,
ten or so clusters.
The one on the right, you will have grouped into two clusters.
One big, one small.
And this process is basically what the k-means algorithm is going to do.
It goes through your data and tries to find centers,
cluster centers that minimize the error and
the error is defined as the distance to the other points in a cluster.
If you are interested in the mathematics,
there will be some links after this video that you can go and explore.
I also recommend you go read the documentation,
if you want more details about the available parameters.
I'm going to do quite a light introduction to k-means here.
It's fairly much this simple.
It takes your training data,
builds a model and then you use predict to find out which sample is in which group.
Now, the most important parameter is k,
the k of k-means.
This is saying how many clusters you want.
That requires you to know something about your data in advance.
The simplest way is to guess.
Another option is H2O offers an estimate k parameter.
Go and look it up in the documentation.
That works it will try k is 1,
k is 2, k is 3,
all the way through to the maximum k that you specify.
It's going to take a bit longer to run,
but if you've really no idea how many clusters might be in your data,
that's probably the best thing to try.
What else can we say about k-means?
First, as you saw in the original diagram,
the clusters can be very different sizes.
So if that's the way your data is,
you're going to get one big cluster on one little cluster.
K-means also works very well with circles.
In two dimensions, we'll call them circles.
So long strung out clusters are going to be harder to detect.
And that's kind of related to another disadvantage,
very high dimensional data.
Everything is going to end up close to everything else.
If you're unfamiliar with this,
go search on the curse of high dimensionality, curse of dimensionality.
It's an important machine learning topic anyway.
So, k-means doesn't work so well when the dimension is very high.
K-means does work with categories,
H2O will automatically one hot encode your category data.
You just have to be careful because
one hot encoding can also give you very high dimensional data.