0:00

>> So you have seen the equations for how to

Â invent Batch Norm for maybe a single hidden layer.

Â Let's see how it fits into the training of a deep network.

Â So, let's say you have a neural network like this,

Â you've seen me say before that you can view each of the unit as computing two things.

Â First, it computes Z and then it applies the activation function to compute A.

Â And so we can think of each of these circles as representing a two step computation.

Â And similarly for the next layer,

Â that is Z2 1, and A2 1, and so on.

Â So, if you were not applying Batch Norm,

Â you would have an input X fit into the first hidden layer,

Â and then first compute Z1,

Â and this is governed by the parameters W1 and B1.

Â And then ordinarily, you would fit Z1 into the activation function to compute A1.

Â But what would do in Batch Norm is take this value Z1,

Â and apply Batch Norm,

Â sometimes abbreviated BN to it,

Â and that's going to be governed by parameters,

Â Beta 1 and Gamma 1,

Â and this will give you this new normalize value Z1.

Â And then you fit that to the activation function to get A1,

Â which is G1 applied to Z tilde 1.

Â Now, you've done the computation for the first layer,

Â where this Batch Norms that really occurs in between the computation from Z and A.

Â Next, you take this value A1 and use it to compute Z2,

Â and so this is now governed by W2, B2.

Â And similar to what you did for the first layer,

Â you would take Z2 and apply it through Batch Norm, and we abbreviate it to BN now.

Â This is governed by Batch Norm parameters specific to the next layer.

Â So Beta 2, Gamma 2,

Â and now this gives you Z tilde 2,

Â and you use that to compute A2 by applying the activation function, and so on.

Â So once again, the Batch Norms that happens between computing Z and computing A.

Â And the intuition is that,

Â instead of using the un-normalized value Z,

Â you can use the normalized value Z tilde, that's the first layer.

Â The second layer as well,

Â instead of using the un-normalized value Z2,

Â you can use the mean and variance normalized values Z tilde 2.

Â So the parameters of your network are going to be W1, B1.

Â It turns out we'll get rid of the parameters but we'll see why in the next slide.

Â But for now, imagine the parameters are the usual W1.

Â B1, WL, BL, and we have added to this new network,

Â additional parameters Beta 1,

Â Gamma 1, Beta 2, Gamma 2,

Â and so on, for each layer in which you are applying Batch Norm.

Â For clarity, note that these Betas here,

Â these have nothing to do with the hyperparameter beta that we had for

Â momentum over the computing the various exponentially weighted averages.

Â The authors of the Adam paper use Beta on their paper to denote that hyperparameter,

Â the authors of the Batch Norm paper had used Beta to denote this parameter,

Â but these are two completely different Betas.

Â I decided to stick with Beta in both cases,

Â in case you read the original papers.

Â But the Beta 1,

Â Beta 2, and so on,

Â that Batch Norm tries to learn is a different Beta than

Â the hyperparameter Beta used in momentum and the Adam and RMSprop algorithms.

Â So now that these are the new parameters of your algorithm,

Â you would then use whether optimization you want,

Â such as creating descent in order to implement it.

Â For example, you might compute D Beta L for a given layer,

Â and then update the parameters Beta,

Â gets updated as Beta minus learning rate times

Â D Beta L. And you can also use

Â Adam or RMSprop or momentum in order to update the parameters Beta and Gamma,

Â not just creating descent.

Â And even though in the previous video,

Â I had explained what the Batch Norm operation does,

Â computes mean and variances and subtracts and divides by them.

Â If they are using a Deep Learning Programming Framework,

Â usually you won't have to implement the Batch Norm step on Batch Norm layer yourself.

Â So the probing frameworks,

Â that can be sub one line of code.

Â So for example, in terms of flow framework,

Â you can implement Batch Normalization with this function.

Â We'll talk more about probing frameworks later,

Â but in practice you might not end up needing to implement all these details yourself,

Â knowing how it works so that you can get

Â a better understanding of what your code is doing.

Â But implementing Batch Norm is often one line of code in the deep learning frameworks.

Â Now, so far, we've talked about Batch Norm as if you were training on

Â your entire training site at the time as if you are using Batch gradient descent.

Â In practice, Batch Norm is usually applied with mini-batches of your training set.

Â So the way you actually apply Batch Norm is you take

Â your first mini-batch and compute Z1.

Â Same as we did on the previous slide using the parameters W1,

Â B1 and then you take just this mini-batch and computer mean and variance of the Z1 on

Â just this mini batch and then Batch Norm would

Â subtract by the mean and divide by the standard deviation and then re-scale by Beta 1,

Â Gamma 1, to give you Z1,

Â and all this is on the first mini-batch,

Â then you apply the activation function to get A1,

Â and then you compute Z2 using W2,

Â B2, and so on.

Â So you do all this in order to perform one step of

Â gradient descent on the first mini-batch and then goes to the second mini-batch X2,

Â and you do something similar where you will now compute Z1 on

Â the second mini-batch and then use Batch Norm to compute Z1 tilde.

Â And so here in this Batch Norm step,

Â You would be normalizing Z tilde using just the data in your second mini-batch,

Â so does Batch Norm step here.

Â Let's look at the examples in your second mini-batch,

Â computing the mean and variances of the Z1's on just that mini-batch and

Â re-scaling by Beta and Gamma to get Z tilde, and so on.

Â And you do this with a third mini-batch, and keep training.

Â Now, there's one detail to the parameterization that I want to clean up,

Â which is previously, I said that the parameters was WL, BL,

Â for each layer as well as Beta L, and

Â Gamma L. Now notice that the way Z was computed is as follows,

Â ZL = WL x A of L - 1 + B of L. But what Batch Norm does,

Â is it is going to look at the mini-batch and normalize

Â ZL to first of mean 0 and standard variance,

Â and then a rescale by Beta and Gamma.

Â But what that means is that,

Â whatever is the value of BL is actually going to just get subtracted out,

Â because during that Batch Normalization step,

Â you are going to compute the means of the ZL's and subtract the mean.

Â And so adding any constant to all of the examples in the mini-batch,

Â it doesn't change anything.

Â Because any constant you add will get cancelled out by the mean subtractions step.

Â So, if you're using Batch Norm,

Â you can actually eliminate that parameter,

Â or if you want, think of it as setting it permanently to 0.

Â So then the parameterization becomes ZL is just WL x AL - 1,

Â And then you compute ZL normalized,

Â and we compute Z tilde = Gamma ZL + Beta,

Â you end up using this parameter Beta L in order to decide

Â whats that mean of Z tilde L. Which is why guess post in this layer.

Â So just to recap,

Â because Batch Norm zeroes out the mean of these ZL values in the layer,

Â there's no point having this parameter BL,

Â and so you must get rid of it,

Â and instead is sort of replaced by Beta L,

Â which is a parameter that controls that ends up affecting the shift or the biased terms.

Â Finally, remember that the dimension of ZL,

Â because if you're doing this on one example,

Â it's going to be NL by 1,

Â and so BL, a dimension, NL by one,

Â if NL was the number of hidden units in layer

Â L. And so the dimension of Beta L and Gamma L

Â is also going to be NL by 1 because that's the number of hidden units you have.

Â You have NL hidden units, and so Beta L and Gamma L are used to scale

Â the mean and variance of each of

Â the hidden units to whatever the network wants to set them to.

Â So, let's pull all together and describe how

Â you can implement gradient descent using Batch Norm.

Â Assuming you're using mini-batch gradient descent,

Â it rates for T = 1 to the number of many batches.

Â You would implement forward prop on

Â mini-batch XT and doing forward prop in each hidden layer,

Â use Batch Norm to replace

Â ZL with Z tilde L. And so then it shows that within that mini-batch,

Â the value Z end up with some normalized mean and variance and the values and

Â the version of the normalized mean that and variance is Z tilde L. And then,

Â you use back prop to compute DW,

Â DB, for all the values of L,

Â D Beta, D Gamma.

Â Although, technically, since you have got to get rid of B,

Â this actually now goes away.

Â And then finally, you update the parameters.

Â So, W gets updated as W minus the learning rate times, as usual,

Â Beta gets updated as Beta minus learning rate times DB,

Â and similarly for Gamma.

Â And if you have computed the gradient as follows,

Â you could use gradient descent.

Â That's what I've written down here,

Â but this also works with gradient descent with momentum,

Â or RMSprop, or Adam.

Â Where instead of taking this gradient descent

Â update,nini-batch you could use the updates given

Â by these other algorithms as we discussed in the previous week's videos.

Â Some of these other optimization algorithms as well can be used to update

Â the parameters Beta and Gamma that Batch Norm added to algorithm.

Â So, I hope that gives you a sense of how you could

Â implement Batch Norm from scratch if you wanted to.

Â If you're using one of

Â the Deep Learning Programming frameworks which we will talk more about later,

Â hopefully you can just call someone else's implementation in

Â the Programming framework which will make using Batch Norm much easier.

Â Now, in case Batch Norm still seems a little bit mysterious if you're

Â still not quite sure why it speeds up training so dramatically,

Â let's go to the next video and talk more about

Â why Batch Norm really works and what it is really doing.

Â