Now it's time to do some NLP,
Natural Language Processing, and we will start with the famous word2vec example.
So word2vec is a way to compress
your multidimensional text data into smaller-sized vectors,
and with those vectors,
you can actually do calculations or further attach downstream neural network layers,
for example, for classification.
And in this case, we will just do both.
So we will start with some imports,
and we put here a seat that examples are replicatable.
So let's start with some sentences.
So this is a document which has five sentences,
and four are related to queens and kings and so on and men and women,
and one is unrelated.
We define vocabulary size 50.
That means this system here as it is will support 50 different words.
In a real-life scenario,
of course, you have millions of words.
Later, we do some calculations with the words.
Therefore, we already now get the representation of these words,
and I will show you what I mean by that.
So each word is just a sequence of characters and,
obviously, we cannot work with sequence of characters.
Therefore, we will convert each word into an integer number,
and this integer number is unique,
as long as we don't exceed the vocabulary size.
Note that the function is called one-hot,
but in my opinion, that's not correct.
It's not the one-hot encoding.
It's basically just the transformation from
a list of the words into a list of integer values.
As you can see here, I'm correct.
So, for example, king is number 23 and queen is number 26 or man is number 22, and so on.
So now we are actually one-hot encoding this stuff.
So this is now a single dimension with multiple values.
And one-hot encoding returns multiple dimensions,
and only one of those dimensions has a one.
The rest is zero.
That means we are getting sparse vector representations.
So that's actually what I'm talking about.
So you see here for each word,
we get a sparse vector.
And if the particular word is present at that position we have one,
otherwise we have zero.
So now we have done this for our example words,
but now let's do it for a complete document.
So as you can see here, we have some words again.
So those are the sentences.
Each element in this multidimension array is
a sentence and each integer is representing a word.
It's just an index which tells us at what position we should insert a particular word.
So 23, 42 and 22 are actually "king is man." So king is 23.
We see here, again, king,
a second word in the second sentence.
So here, we are actually padding that means,
if the sentence is not long enough,
we are just filling it with zeros.
This function is very,
very important and may be a bit complicated to understand,
but what we are doing is,
we're now creating tuples.
And we are creating tuples of a word and related words.
So in a sentence, for example,
"king is a man," if you look at the word king,
then we are creating tuples,
where "king" is the key and "is" is the value.
And "king" is the key and "man" is the value.
That means that's the embedding.
That means the neighbors of the word.
So we are creating a list of tuples where,
for each word, we also are assigning the neighbors of the word.
We're doing this, in this case, for two neighbors,
for two preceding neighbors and two succeeding neighbors.
Don't lose too much time with that.
That's the follow-up here.
Just look at recite.
So recite has dimension 38 by two.
That means you have 38 word pairs and the pairs,
as the name implies, have dimension two.
So, again, we have now two dimensions with integer keys.
And in order to make a neural network perform, we, again,
have to encode it as one-hot encoded vectors.
We first do it for the first dimension,
which basically resembles in the input extra neural network,
and then we do it for the second dimension,
which is our target.
So that's what we want to predict and that's why or the output of the neural network.
So here you see that, again,
we have 38 dimensions because,
in the previous example, we had 38 pairs,
but now we have 50 dimensions because now
the one-hot encoded vector representation is the sparse vector of size 50.
Remember, only one of those elements is one.
And that's the position which actually reflects the word.
So if you start with a dense input layer which has
50 internal neurons and also expects dimensionality of 50,
actually, that perfectly matches our one-hot encoded vectors of size 50.
And we use relu as activation.
So now it gets interesting.
Now we're increasing the bottleneck.
Remember bottlenecks. We have an auto encoder.
So this is an auto encoder of what we are building,
and the bottleneck has only two dimensions.
And, again, activation function is relu.
And in an auto encoder, of course,
we have to map again to the same dimensionality.
So because input was of dimension 50,
also the output has to be of dimension 50.
So we're mapping sparse vectors to sparse vectors.
That means we are mapping the words,
for example, a word in the center to words in the neighborhood.
And while we're doing this,
we are training the bottleneck layer,
and that's our low-dimensional representation.
And that's actually what word2vec does.
It creates a low-dimensional representation of our high-dimensional word embedding,
which then basically is a low-dimensional word embedding.
We compile, set optimizer,
and loss function, and then we call fit.
So that starts training on the one-hot encoded labels X and the one-hot encoded target Y.
So for this Mickey Mouse dataset,
this goes really fast,
and we are quite happy with our loss and accuracy.
So now it gets even more interesting.
Now we build a new neural network but we are using
two of the three already trained layers of the previous neural network.
So we start with a sequential model again,
and we add the already trained input layer,
and we add the already trained bottleneck layer.
So if we now compile this neural network,
the input still is 50 dimensional but the output now is only two dimensional.
So we take all five words of the beginning,
and let's see what the two-dimensional representations of those words are looking like.
So we see this is an array of two-dimensional vectors,
and in theory, we can now plot those words into 2D space.
But we can also calculate with those vectors.
So we now take the word king, we subtract man,
we add woman, and this should resemble nearly the same vector as the word "queen."
That means if we now subtract "queen" from it we should get a relatively low value.
Since this is only a try example with very little data and training takes a lot of time,
I'm using some online demo of this system.
So here, we actually take woman,
and here we say king minus man.
So we get somehow the meaning of king without a gender.
And if we add this meaning without a gender to a woman,
we should get "queen."
Let's see what happens. Yes, see. We are pretty close.
So that's actually really cool.
We can now calculate with words in a low-dimensional vector space.