I'll show you a chronological order of when the models were discovered. We will also see the advantages and disadvantages of each model. Let's start. This is a quick outline where you can see we start here with continuous bag-of-words model, then we got to ELMo, then we got GPT, then we got BERT, and finally we ended up having T5, and so more like over here, you're going to end up having many, many more models coming up hopefully soon. This is not a complete history of all relevant models and research findings, but it's useful to see what problems arise with each problem and what problems each method solves. Let's look at contexts over here. Let's say we have the word right, ideally we want to see what this word means. We can look at the context before it, and then we can also look at the context afterwards, and that's how we've been able to train word embeddings so far. The continuous bag of words over here, you have the word right. Previously what you've been doing, you would take a fixed window, say two before and two after or three or four, whatever C is for the window size and then you will take the corresponding words, feed them into a neural network, and predict the central word in this case, which is right. Now, the issue over here is that what if we wanted to look at not only the fixed window, but all the words before and all the words after, and how can we do that? If you want some more context, let's say we want they were on the over here, this is like the first left parts of the sentence and then all the rights parts of the sentence, so instead of having the fixed window, we want to add the streets or history for example. To use all of the context words, what's researchers have done, they explored the following using RNN. So they would use an RNN from their right and from the left, and then they would have a bidirectional LSTM, which is a version of recurrent neural network. You feed both of them in, and then you can predict the central word, right. That gives you the word embedding for the word right. Now open a AI GPT what it did. We had transformer, so the encoder-decoder architecture that you are familiar with, and then we ended up having this GPT, which makes use of a decoder of stacks only. In this case we only have one decoder, but you can have several decoders in the picture. ELMo made use of RNNs or LSTMs, and we would use these models to predict the central word. The thing with ELMo is that you only looked at one direction. You could not look at the remaining parts of the sentence from the right side. Why not use a bidirectional model? In this case, this is the transformer, and you can see that in the transformer over here, each word can peek at itself. If we were to use it for underwrites, you can only look at itself. But the issue is that you cannot peek forward. Remember, you've seen this in causal attention where you don't look forward, you only look at the previous ones. BERT came and helped us solve this problem. This is a recap. Transformers encoder-decoder, GPT makes use of decoders, and BERT makes use of encoders. Over here you have the legislature believed that they were on the blank side of history, so they changed the law. Over here we can make use of Bidirectional Encoder Representation from Transformers, and this will help us solve this issue. This is an example of a transformer plus bidirectional contexts. You feed this into your model, and then you get right and of. Because of this, you're able to look at the sentence from the beginning and from the end and make use of the context to predict the corresponding words. If we were to look at words to sentences, meaning instead of trying to predict just the word, we'll try to predict what the next sentence is. Given the sentence over here, the legislatures believed that they were on the right side of history, is this the next sentence or is this the next sentence? You have a sentence A, and then you try to predict the next sentence B. In this case, it's obviously so they changed the law. BERT pre-training tasks makes use of multi-mask language modeling, so the same thing that you've seen before, and it makes use of the next sentence prediction. It takes two sentences and it predicts whether it's a yes, meaning sentence 2 follow sentence 1 or sentence B follows sentence A or not. Let's look at encoder versus encoder-decoder. Over here you have the transformer which had the encoder and the decoder stack, and then you had GPT, which just had decoder stack, and then BERT just the encoder stack. T5 tested the performance when using the encoder-decoder as in the original transformer model, and the researchers found that the model performed better when it's contained both the encoder and the decoder stacks. Let's look at the multi-task training strategy over here. You have studying with deeplearning.ai was, it's being fed into the model, and it gives you a five-star rating. Hopefully set five-star rating. Then you have a question and then you get the answer. But the question here is, how do you make sure that the model knows which task it's performing in? Because you can feed in a review over here, and how do you know that it's not going to return an answer instead? How do you know it's going to return a rating? Or if you feed in a question, how do you know it's not going to return some numerical output or some texts version of a numerical output. Let's see how to do this. Over here is an example where let's say you're trying to classify whether this is it five stars or four stars or three stars. You append the string classify for example, and then it's classifies it and you get five stars. Let's say you want to summarize. You add the string summarize colon. It goes into the model, it's automatically identifies that you're trying to summarize, and it says that it was all right. You have a question, you append the question, string. It knows that it's a question and it returns the answer. Now this might not be the exact tags that you'll find in the paper, but it gives you the overall sense of the idea. In summary, we've seen the continuous bag-of-words model, which makes use of a fixed context window. You've seen ELMo, which makes use of a bidirectional LSTM. You've seen GPT, which is just a decoder stack, and you can see unidirectional in context. Then you've seen BERT, which makes use of bidirectional encoder and presentation from the transformer, and it also makes use of multi-mask language modeling and next sentence prediction, and then you've seen the T5, which makes use of the encoder-decoder stack and also makes use of mask and multi-task training. You have now seen an overview of the models. You have seen how the text to text model has a prefix. With the same model, you can solve several tasks. In the next video, we will be looking at the BERT model in more detail, also known as the Bidirectional Encoder Representation for Transformers.