[SOUND] Hey everyone, we're going to discuss a very important technique in neural networks. We are going to speak about encoder-decoder architecture and about attention mechanism. We will cover them by the example of neural machine translation, just because they were mostly proposed for machine translation originally. But now they are applied to many, many other tasks. For example, you can think about summarization or simplification of the texts, or sequence to sequence chatbots and many, many others. Now let us start with the general idea of the architecture. We have some sequence as the input, and we would want to get some sequence as the output. For example, this could be two sequences for different languages, right? We have our encoder and the task of the encoder is to build some hidden representation over the input sentence in some hidden way. So we get this green hidden vector that tries to encode the whole meaning of the input sentence. Sometimes this vector is also called thought vector, because it encodes the thought of the sentence. The encoder task is to decode this thought vector or context vector into some output representation. For example, the sequence of words from the other language. Now what types of encoders could we have here? Well, one most obvious type would be her current neural networks, but actually this is not the only option. So be aware that we have also convolutional neural networks that can be very fast and nice, and they can also encode the meaning of the sentence. We could also have some hierarchical structures. For example, recursive neural networks try to use syntax of the language and build the representation hierarchically from from bottom to the top, and understand the sentence that way. Okay, now what is the first example of sequence to sequence architecture? This is the model that was proposed in 2014 and it is rather simple. So it says, we have some LCM module or RNN module that encodes our input sentence, and then we have end of sentence token at some point. At this point, we understand that our state is our thought vector or context vector, and we need to decode starting from this moment. The decoding is conditional language modelling. So you're already familiar with language modelling with neural networks, but now it is conditioned on this context vector, the green vector. Okay, as any other language model, you usually fit the output of the previous state as the input to the next state, and generate the next words just one by one. Now, let us go deeper and stack several layers of our LSTM model. You can do this straightforwardly like this. So let us move forward, and speak about a little bit different variant of the same architecture. One problem with the previous architectures is that the green context letter can be forgotten. So if you only feed it as the inputs to the first state of the decoder, then you are likely to forget about it when you come to the end of your output sentence. So it would be better to feed it at every moment. And this architecture does exactly that, it says that every stage of the decoder should have three kind of errors that go to it. First, the error from the previous state, then the error from this context vector, and then the current input which is the output of the previous state. Okay, now let us go into more details with the formulas. So you have your sequence modeling task conditional because you need to produce the probabilities of one sequence given another sequence, and you factorize it using the chain rule. Also importantly you see that x variables are not needed anymore because you have encoded them to the v vector. V vector is obtained as the last hidden state of the encoder, and encoder is just recurrent neural network. The decoder is also the recurrent neural network. However, it has more inputs, right? So you see that now I concatenate the current input Y with the V vector. And this means that I will use all kind of information, all those three errors in my transitions. Now, how do we get predictions out of this model? Well, the easiest way is just to do soft marks, right? So when you have your decoder RNN, you have your hidden states of your RNN and they are called SJ. You can just apply some linear layer, and then softmax, to get the probability of the current word, given everything that we have, awesome. Now let us try to see whether those v vectors are somehow meaningful. One way to do this is to say, okay they are let's say three dimensional hidden vectors. Let us do some dimensional reduction, for example, by TS&E or PCA, and let us plot them just by two dimensions just to see what are the vectors. So you see that the representations of some sentences are close here and it's nice that the model can capture that active and passive voice doesn't actually matter for the meaning of the sentence. For example, you see that the sentence, I gave her a card or she was given a card are very close in this space. Okay, even though these representations are so nice, this is still a bottleneck. So you should think about how to avoid that. And to avoid that, we will go into attention mechanisms and this will be the topic of our next video. [SOUND]