Here's our outline and important keywords. We're going to talk about optimization method and especially, stochastic gradient descent which has a lot of hyperparameters or design parameters that we can tweak. We'll especially talk about learning rate, and momentum, and decay. What are they and how they affect the performance of the optimization? Also, after we learn about stochastic gradient descent, we're going to talk more about more tricks about how to make the optimization better so we can talk about how to change learning rate as Epoch goes, so that's learning rate scheduling. We can talk about some of events to momentum method which is called the Nesterov. Other than the stochastic gradient descent which has a lot of tweaking parameter, some of these are the best gradient descent method. They come with different tricks and improvement so some of them convert faster and some of them tries to reduce the adder and so on. We'll talk about those. Toward the end of this lecture, we're going to talk about tips for neural network training. We'll talk about how to avoid overfitting and we'll talk about some neural network-specific regularization method which are dropouts and batch normalization. These are the keyword and we'll visit one by one. First, let's talk about gradient descent. We briefly discussed the last lecture. Optimization goal is to find the set of weights that are optimized which can minimize the error although loss-function at the output. We talked about weight update rule, looks like this. In layer L, for the nth in weight in the previous layer and mth neuron in the next layer. The connection between the two is called W_nm which specify the strength of the connection between these two neurons in the two layers. Then we initially random select this W, so we initialize with a random number. Then as we show more data, we update some delta W which we calculate from the chain rule with the gradient of the loss function with respect to W_nm of the L's layer times step size Alpha or the learning rate Alpha. Here is some cartoony picture of what the error surface look like. In general, the loss function is parameterized with a lot of parameters, W_nm, and different layers. Therefore, in high dimension, they can have some complicated structure like this, there are multiple tips, and there are multiple hills as well. Our goal is to reach the smallest error as possible. So this is a loss-function measure of error. So smaller the error, it's better. We want to get here. When we randomly initialize our weights, probably, it's somewhere up here, and then route optimization, we can find optimal solution. However, sometimes, depending on where we start, images find local minima, this is called the local minima and this is called the global minima. As you can imagine, when in high demand zone and complex functions, you may have many local minimas, therefore, we need some techniques to navigate those and try to get to low local minima as low as possible. That's our goal and this is again, the weight update rule. Error surface, as I mentioned, in the multidimensional space can be very complicated, so you may have some hills like this or you can have some plateaus. Here's a local minima and here's a saddle point. Saddle point is something like this. In one axis it looks like a local minima, but the other axis, if you see, its on the hill. It's a little unstable and it's not the best place to be. Our goal, of course, would be to find the global minima. It's not showing here but there are sometimes, cliffs which we want to avoid also because it's a cliff. The change of weight can be really large so that you can overshoot, maybe escape some attractive places like this. We need to be careful on choosing learning rate and so on. How many training samples at a time do we need to calculate the error or the gradient? When you calculate this, loss function is not only the function of W_ij but it's also function of data x_k, for example. Strictly speaking, the error function need to see all the data. We can show all the data point from data point number 1 to 10,000 or even 1 million, depending on how many data points you have. But it's going to be too much to show every incidence and the update of one weight value for this component, and then do it again to calculate another update for the same or different weight value. That's too much, it's a lot of waste of computation. Practically, we use minibatches. What the minibatches are is this. Let's say I have 10,000 data points. Then I may chop into 1,000 slices. That means in each slice or each batch, I have 10 data points. That's called the minibatches. Instead of showing 10,000 incidents to update only one value of my weight matrices, which is a lot, I'm going to show only 10 instances, randomly shuffled, randomly picked, and then update my weight according to this gradient calculation, which can be a little bit inaccurate than showing all the data points but that's okay because depending on how big is a minibatch size is, they're already good enough. Then it's more important to the iteration more than try to be perfect. The other extreme spectrum would be, have a batch size equals 1, that means only show one data point and then update to our weight. This works in principle but usually, it's too stochastic. That means the value of a weight update can vary a lot, depending on which data that we showed. Ideally and practically, some reasonable minibatch size would be more efficient. By the way, original stochastic gradient descent idea was the minibatch size, so batch size equals 1, which is super stochastic. But nowadays, when we talk about stochastic gradient descent, we assume that we use minibatches. Training speed and accuracy, we already talked about. This is first update on each weight but it's going to be a little inaccurate. This is going to be accurate but it's going to be very slow, so we want to use something in the middle. With that in mind, let's talk about stochastic gradient descent. Stochastic gradient descent, as we mentioned, initialize the weight. Note that this block of algorithm, I borrowed from Deep Learning Book here which is a great book that you should read, but they use a different notation that we've been using in these lecture slides. Also, usually, parameters, CETA in the neural network, they use W more. But anyway, they are interchangeable so please keep in mind. After the initialization, we iterate over this procedure which is we sample minibatch of some batch size M and make examples from training data, and then we also take the corresponding target, and then we estimate that gradient using this gradient descent formula, and then we update our weight according to this row. Again, you see this EK is a learning rate at step K. Usually, we call learning rate Alpha or sometimes, LR, but they also are interchangeable. But pay attention that this has some index K which is the iteration index, and we didn't talk about it so far, we talked about fixed learning rate which is just the Alpha, which doesn't depend on our iteration. However, practically, it's better to use changing learning rate because we want the first update initially, but as we're closer to the goal, we want to slow down so that we don't just overshoot and go over the local minima that we are interested. That's why they have a K here. Some explicit engineering about this, how the learning rate has decreased over time is called the learning rate scheduling. We'll talk about that soon. This forms the stochastic gradient descent algorithm.