I'll now teach you about Bidirectional Encoder Representations for Transformers, or in short, just BERT. BERT is a model that makes use of the transformer, but it looks at the inputs from two directions. Let's dive in and see how this works. Today, you're going to learn about the BERT architecture and then you're going to understand how BERT pre-training works and see what the inputs are and the outputs are. What is BERT? BERT is the Bidirectional Encoder representations from transformers, and it makes use of transfer learning and pre-training. How does this work? Usually starts with some inputs embeddings, so E_1, E_2, all the way to some random number E_n. Then you go through some transformer blocks, as you can see here. Each blue circle is a transformer block, goes up furthermore and then you get your T_1, T_2, T_n. Basically, there are two steps in BERT's framework, pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks, as you've already seen before. For fine tuning, the BERT model is first initialized with a pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. For example in the figure over here, you get the corresponding embeddings, you run this through a few transformer blocks and then you make the prediction. We'll discuss some notation over here. First of all, BERT a multi-layer bidirectional transformer. It makes use of positional embeddings. The famous model is BERT's base, which has 12 layers or 12 transformer blocks, 12 attention heads, and 110 million parameters. These new models that are coming out now like GPT-3 and so forth, they have way more parameters and way more blocks and layers. Let's talk about pre-training. Before feeding the word sequences to the BERT model, we mask 15 percent of the words. Then the training data generator chooses 15 percent of these positions at random for prediction. Then FDI token is chosen, we replace the ice token with, one, the mask token, 80 percent of the time, and then, two, a random token 10 percent of the time, and then, three, the unchanged with either token 10 percent of the time. In this case then TI, what you've seen in the previous slide, will be used to predict the original token with cross entropy loss. In this case, this is known as the masked language model. Over here we have," After school, Lucas does his blank in the library," so maybe work, maybe homework one of these words that your BERT model is going to try to predict. To do so usually what you do, you just add a dense layer after the TI token and use it to classify after the encoder outputs. You just multiply the outputs vectors by the embedding matrix and then to transform them into a vocabulary dimension and you add a softmax at the end. This is another sentence, "After school, Lucas does his homework in the library," and then, "After school blank his homework in the blank." You have to predict Lucas does, and then library also. In summary, you choose 15 percent of the tokens at random. You mask them 80 percent of the time, replace them with a random token 10 percent of the time, or keep as is 10 percent of the time. Then notice that there could be multiple masked spans in a sentence. You could mask several words in the same sentence. In BERT, also next sentence prediction is also used when pre-training. Given two sentences, if it's true, it's means the two sentences follow one another. Otherwise, they're different, they don't lie in the same sequence of the text. You have now developed an intuition for this model. You've seen that BERT makes use of the next sentence prediction and masked language modeling. This allows the model to have a general sense of the language. In the next video, I'm going to formalize this and show you the loss function for BERT. Please go onto the next video.