Now, because we have developed in this course

a good habit to convert all algorithms into neural networks,

let's keep up with that and see how

the K-means can be implemented with the neural network.

As usual, we start with the data represented as

an input layer over network that passes its values to a hidden layer.

Now, we said that neurons in their hidden layer over

neural network are described in terms of input weights.

These weights can now be used to encode the cluster centers of the K-means method.

If your data has N dimensions so that the observation X_i is in N-dimensional vector,

then we can describe the positions of

K centroids mu K relatively to all components of X_i,

in terms of a matrix W,

of size K by N. The elements w_kn of this matrix gives

a normalized distance or weight of cluster K in the Nth component of X_i.

Therefore, the Kth role of this metric gives the weights

of all components of X_i for a chosen cluster.

This provides an encoding scheme for storing cluster center positions,

as positions in a space of weights.

Now, we use these weights for the hidden layer of the network,

and in addition, choose a linear activation function for these units.

So their activation for the Kth hidden neuron,

will be a sum of W and K times X_n over all components N,

which can also be viewed as a scalar product of column vector W_k and the data vector x.

This scalar product can be seen as a measure of similarity between vectors h,

X and mu K. Now,

we can set in motion an iterative procedure that

will provide the neural implementation of the K-means method.

To this end, we first select the learning rate parameter eta,

to use in our problem.

Then we take some initial guess for weights W and K

and repeat the following steps for a fixed number of iterations or epochs.

First, for each new data point,

we compute activations for all hidden neurons.

Then we pick a winning neuron among all hidden neurons.

The winning neuron will be the one with the maximal activation h_k.

This means that our neural network is a network of competing neurons.

They compete for the right to fire,

and there can be only one neuron that fires at the end.

Then we update the weights only for

the winning neuron K by adjusting the weights w_kn for each end,

by the amount equal to the learning rate eta times the difference between X_n and w_kn.

Please note that this is a rule for an online setting,

where we add new data points one by one.

In this case, we do not know the mean of

the data until we finished scanning the whole dataset.

So, instead at each step,

we adjust cluster centers towards the true mean,

using the learning rate eta.

So this is a very simple algorithm that can be

implemented using gradient descent or stochastic gradient descent.

As usual in neural networks,

you need the hyperparameter,

the learning rate eta to implement their online K-means with neural networks.

You can usually set this parameter based on some experimentation

that should check the conversion speed and inspect the resulting clusters.

The algorithm can also be generalized in many ways.

For example, you can pick the winning neuron not deterministically,

but instead probabilistically by picking neurons

from the hidden layers with probabilities that are proportional to their activations.

This can be done using the so-called softmax activation for the neurons,

which gives probability distribution over your two-dimensional grid of neurons.

You can then sample from this distribution to pick a winning neuron.

We will talk more about softmax and its multiple uses in follow-up courses,

but for now, we do not need to know its precise definition.

Another option would be to update not only the weights of the winning neuron,

but also weights of its neighboring neurons.

For example, you may make a similar update for the weights but with

a smaller magnitude that could decay if as you move away from the winning neuron.

In this way, you obtain a SOM or

a self-organizing map neural network for

unsupervised learning as suggested by Teuvo Kohonen in 1988.

Self-organizing maps are very interesting,

as they have some features that are not

usually present in other unsupervised learning algorithms.

In particular, when building a SOM,

the hidden layers are usually put on some one-dimensional,

two-dimensional or three-dimensional grid.

Most often a two-dimensional grid of neurons is chosen for a SOM.

This is done for visualization of results or to a plane,

which can be done by showing activations of all neurons on the grid.

It turns out that due to this small trick

of adjusting not just the weights of the winning neuron,

but also weights of its nearest neighbors that I mentioned above,

the SOM preserves the biological order of points in

the regional high-dimensional space upon

projecting them onto the two-dimensional grid of neurons.

You can read more about neuron implementation of

the K-means and the SOM in the book of Marsland,

which also contains references to the original papers and reviews.

Here, I want to move forward and discuss other types of clustering methods.

Let's do it in the next video but not

before you check how well you understood the material of this video.