Welcome. This week I will talk about long, short-term memory cells, which we call LSTMs. To understand why they're important, let me explain the vanishing and exploding gradients problems faced by conventional RNNs. Let's dive in. First, I'll talk about backpropagation through time. Then I'll introduce you to vanishing and exploding gradients. A problem common to RNNs and I'll show you a few ways to handle them. Let's begin with a discussion of some of the pros and cons of using a recurrent neural network. For one. The way lane or vanilla RNNs model sequences work is by recalling information from the immediate past, allowing you to capture dependencies to a certain degree. They're also relatively lightweight compared to the other n-gram models, taking up less space and RAM. But there are downsides. The RNNs architecture can struggle to capture long-term dependencies and RNNs are prone to vanishing and exploding gradients, both of which can cause your model training to fail. Vanishing and exploding gradients are a problem that can arise due to the fact that it's RNNs propagates information from the beginning of the sequence to the end, starting with the first word or the sequence, the hidden value at the far left. The first values are computed here. Then as propagates, some of the computed information, takes the second word in the sequence and gets new values. You can see that process illustrated here. The orange area denotes the first computed values, and the green denotes the second word. The second values are computed using the older values in orange and then word in green. After that, it takes the third word and the propagated values from the first second words and computes and other sets of values from both of those. It continues and a similar way from there. At the final step, the computations contain information from all the words in the sequence and the RNN is able to predict the next word, which in this example is goal. Note that in an RNN, the information from the first step doesn't have much influence on the outputs. This is why you can see the orange portion from the first step decreasing with each new step. Correspondingly, the computations made at the first step don't have much influence on the cost function either. The gradients are calculated using backpropagation through time, which sounds way more scary than what it really is. As it would simple backpropagation. You just have to apply the chain rule multiple times. Recall that the weights W_h and W_x are the same for each step. Let's focus on the weights W_h. Noting that everything that I'll present to you also applies to W_x, with the loss being computed at the T step sequence. The gradient with respect to W_h would depend on the computations that are made this every step. In fact, it is proportional to the sum of products of hidden states, partial derivatives. This relationship can be found by applying the chain rule and the use of a couple of tricks. But you don't need to worry about the derivation as much as the implications behind this formula. Let's take a closer look at it. The term inside the sum, the products of partial derivatives is the contribution of hidden states k, so the gradient and the length of the product sequence for each k is proportional to how far the step k is from the place where the loss is computed in this step t. As you look at hidden states that are further away from the place where you're loss is computed, the partial derivative products start to become longer and longer. For instance, the contribution to the gradient of a hidden state that is 10 steps away from step t, you would have to compute a product of 10 terms. Therefore, if the partial derivatives are lower than one, the contribution of the hidden state to the gradients approaches zero as you move further away from the place where the loss is computed. Conversely, if the partial derivatives are greater than one, the contribution to the gradient goes to infinity. The first case is known as the vanishing gradients and its causes the RNN to ignore the values computed at early steps of a sequence while the second case is known as exploding gradients, which causes convergence problems during training. Now that you're terrified of vanishing and exploding gradients, let's discuss some solutions. I won't spend a whole lot of time on this, since this week focuses on the model approach that was designed to mitigate this problem. You can deal with vanishing gradients by initializing your weights to the identity matrix, which carries values of one along the main diagonal and zero everywhere else. Using a ReLU activation. What this essentially does is copy the previous hidden states and information from the current inputs and replace any negative values with zero. This has the effect of encouraging your network to stay close to the values and the identity matrix, which act like ones during matrix multiplication. This method is referred to unsurprisingly as an identity RNN. The identity RNN approach only works for vanishing gradients though, as a derivative of ReLU is equal to 1 for all values greater than zero. To account for values growing exponentially you can perform gradients clipping. To clip your gradient, simply choose a relevant value that you would clip the gradient to, say 25. Using this technique, any value greater than 25 will be clipped to 25. This serves to limit the magnitude of the gradient. Finally skip connections provide a direct connection to the earlier layers. This effectively skips over the activation functions and adds the value from your initial inputs x to you're outputs or f of x plus x. This way, activations from early layers have more influence over the costs. Now you understand how RNNs can have a problem with vanishing gradients. Next, I'll show you a solution, the LSTM. Let's go to the next video.