Welcome back. During the last lecture,

I provided a brief introduction to deep learning and the neon framework,

which will be used for all the exercises.

In this lecture, we will learn the fundamentals

of deep learning including activation functions,

various ways to initialize the model, cost functions,

optimization techniques, and the famous backpropagation algorithm.

Let's get started. We will first have here some concepts from the previous lecture.

We learned that deep learning is used across

a variety of domains including video recognition,

object detection, clustering, speech recognition,

playing video games, and fault line detection in seismic data.

Its recent success is attributed to larger datasets,

faster hardware, and newer algorithms.

We learned that deep learning is a subset of

machine learning which is a subset of artificial intelligence.

Machine learning can be described as a program

that learns a function that maps features from

the input data to some desired output and whose performance improves with more data.

Deep learning can be described as a program that learns to extract

features from the data with increased complexity at each layer,

and also learns a function that maps these features to some desired output.

We also discussed the neon framework and its ability to define a wide range of models.

In this lecture, we will learn the fundamentals that are needed to train deep networks.

While the exercises are in neon,

these principles are applicable across all the frameworks.

We'll start by introducing the multi-layer perceptrons or MLP.

The most popular types of deep networks used in

supervised learning are multi-layered perceptrons or MLP,

also known as deep feed forward networks,

convolutional networks or CNNs,

and recurrent networks or RNNs,

and there are many others such as auto encoders,

generative adversarial networks, deconvolutional networks, etc.

On the right, we show an MLP model with two hidden layers.

This MLP can and has been used for the task of digit classification.

The input are the 784 pixel values of 28 by 28 image,

containing one handwritten digit.

That is the input layer has 784 units.

The first hidden layer has 128 units and the second hidden layer has 32 units.

The last layer is the output layer and has 10 units,

one for each digit.

The output of each unit represents

a probability that the input image corresponds to a particular digit.

As a side note, some people refer to these units as neurons and to

these models and networks as neural networks

in an attempt to claim resemblance to actual neurons in the brain.

However, given that I am a machine learning engineer and not a neuroscientist,

I will not attempt to claim a resemblance to actual neurons,

and will continue my nomenclature as units

and deep networks instead of neurons and neural networks.

Also in my nomenclature,

X is the input,

W is a matrix of weights,

A is the activation or output of a hidden unit,

Y hat is the output,

and Y which is not shown here, is the ground truth.

The superscripts represent the layer index and the subscripts represent the unit index.

This small network already requires over 100,000 parameters.

Reducing the number of parameters will be one of

the main motivations for convolutional networks,

which will be discussed in the following lecture.

There is a weight from every unit in

a layer to every other unit in the next layer and a bias.

The bias is not drawn here,

therefore, W zero has 784 times 128 values.

B zero has 128 values.

W one has 128 times 32 values.

B one has 32 values.

W two has 32 times 10 values,

and B two has 10 values for a total of 104,938 weight.

This network can be trained for the task of image classification.

In this example, the inputs are the pixels in an image.

The output is a probability distribution over the 10 digits.

In this figure, I'm only showing three output units,

but imagine there were 10 instead,

Y hat 1 through Y hat 10.

If the image corresponds to the number four,

the vector y is a vector with all entries zero except for

the fifth entry corresponding to the number four where there is a one.

There are six steps required to train a network.

First, the weights of the network are

initialized often drawing samples from some distribution.

Second, as small batch of data is fetched.

Third, the data in the batch is passed through the network.

This is known as forward propagation.

Fourth, the cost C is computed.

The cost is a metric of difference between

the actual output Y hat and the expected output Y.

Fifth, gradients of the cost with respect to the weights and activations are

back propagated in order to know how to adjust the weights to reduce the cost.

Six, the weights are updated.

This completes one iteration of training and the cycle is

repeated starting with step two until the cost no longer decreases.

Note that, I am only showing one input in the batch.

If the batch has, say 32 input,

then both Y hat and Y would be matrices of size 10 by 32.

That is, its input in the batch has its output

with a distribution over 10 values and its own ground truth.

Once the network is trained,

it can be used to infer labels on new data it has not seen before.

This is known an inference.

In this example, an image of the number 4 is the input to

the network and the output is a probability distribution over the 10 values.

This network is 80 percent confident that the input data corresponds to the digit four.

Note that inference is simply a forward propagation through the network.

Let's dive into what happens at each layer and unit.

The MLP network is composed of an Affine layer.

An Affine layer is a layer of

units that linearly combines the weighted output from the previous layer,

adds a bias, and applies an activation function.

Each unit linearly combines the weighted output from the previous layer,

adds a bias and applies an activation function.

For example, the input to this unit is the weighted sum x1,

x2 and x3 plus a bias, that is,

x1 times w1, plus x2 times w2,

plus x3 times w3,

plus the bias b.

This weighted sum is denoted by z.

Then an activation function denoted by g is applied with z,

resulting in the activation or the output of the unit a.

Note that we refer to g as the activation function and to a as the activation.

The same applies for some layer i_plus_one.

The activation a_i_plus_one is the weighted sum

of the activations from layer i plus a bias passed through the activation function g.

Many types of activation functions are commonly used.

One such activation function is the hyperbolic tangent,

which has an appealing interpretation.

Suppose your i unit is responsible for detecting a particular feature.

If your input z is near zero,

then you're uncertain about the presence of that feature.

In that case, your gradient or derivative is high, which encourages training.

However, when your input z is too high or too low,

your gradient is almost zero,

causing the weight not to learn.

The logistic activation function is similar except that its range is from zero to one,

which can be useful for modeling probabilities.

The rectified linear unit activation function or RELU has become quite popular as it

was found to accelerate the train process

compared to the sigmoid or hyperbolic tangent functions.

The RELU can be implemented by simply thresholding a matrix

of inputs at zero and does not require computing exponentials.

While they are very popular, however,

RELU units can be fragile during the training and can die.

For example, a large gradient flowing through a RELU unit could

cause the weights to update in such a way that the unit will never activate.

There are variants of RELU that mitigate this at the expense of additional computations,

such as noisy RELU,

leaky RELU and parametric RELU.

A recently proposed activation function is the scaled exponential linear unit or SELU,

which self-normalizes the activations to a zero mean and unit variance which

reduces the risk of diminishing gradients or exploding gradient for very deep networks.

We will talk more about these risks in a moment.

Finally, the Softmax activation function is common for

classification as its output can be interpreted as a probability distribution.

That is, the output values are all between zero and one,

and sum to one.

Next, let's talk about various methods

to initialize the weights to start the training process.

The initialization of the model weights can

drastically affect the train speed and performance of deep networks,

in particular, for deeper networks.

A simple initialization approach is to sample from a Gaussian distribution,

but there are better ways.

Here's why, exploding and diminishing gradients are two challenges with deep networks.

That is, as we backpropagate the gradients,

they can exponentially increase or decrease,

and for very deep networks,

they can explode or diminish to zero.

One way to mitigate this is to initialize the weight

such that the activations and backpropagations have unit variance.

GlorotUniform and Xavier approximates

this units variance for sigmoid or logistic activations,

and Kaiming for RELU activations using the distribution shown here.

Each of these methods are named after its author,

with Xavier and Glorot coming from the same author, Xavier Glorot.

In these equations, d_in represents the number of input,

also known as fan-in,

and d_out the number of output,

also known as fan-out.

Logistics or sigmoid have linear activations near the origin.

That is, the output a is approximately equal to the input z for values near the origin,

and the z values are usually near zero for the first few training iterations.

The variance of the activations is equal to the variance

of the input times the variance of the weight,

times the number of inputs d_in during the forward propagation,

or times the number of outputs d_out during the backward propagation.

Assuming the inputs have unit variance,

to maintain unit variance in the forward propagation,

the weights should be initialized to have a variance of one over d_in.

To maintain unit variance in the backward propagation,

the weights should be initialized to have a variance of one over d_out.

Glorot makes a compromise between these two as two over d_in plus d_out.

It can be shown,

although that relation is not shown here,

that a uniform distribution over the interval

negative b to b has variance b_squared over three.

We set that equal to the variance of the weight and solve for b,

and find that b equal to square root of six over d_in plus d_out.

This is the GlorotUniform initialization,

the weighted sample from a uniform distribution over the interval negative b to b.

Note that the distribution is the same for units within a layer,

but it may be different for units across layers.

The Kaiming initialization is useful for the RELU activation function.

The variance of the activation is half of what it was for

sigmoid because RELU zeroes out the negative inputs.

Assuming the inputs have unit variance,

to maintain unit variance in the activations,

the weights should be initialized to have a variance of two over d_in.

Kaiming's initialization allow the training of much deeper networks.

For example, prior to this technique,

the authors of the popular VGG paper had to

meticulously initialize the layers of the larger VGG networks in various steps.

And it's okay if you're not yet familiar with the VGG network.

My point is that with Kaiming's initialization technique,

this meticulous way to initialize layers is no longer needed.

In this plot, we observe that Kaiming's initialization allows for better convergence,

meaning a lower cost or error for a 22-layer network and

enables the training of a 30-layer network that would otherwise not train.

Next, we will discuss the metrics used to compute the cost.

The cost is a measure of the difference between the output y_hat and the ground truth y.

Training aims to minimize this cost or difference.

Various cost functions or metrics can be used to train networks,

including the cross-entropy loss,

the misclassification rate, the L2 loss,

also known as the mean squared error,

and the L1 loss,

also known as the mean absolute error.

For regression tasks, the L2 loss is a popular metric.

For classification tasks, the cross-entropy loss is a popular metric.

The cross-entropy loss for one input sample is computed as the sum over

all the outputs of the negative expected output times the log of the actual output.

In this example, the expected output is zero for all but the fifth output which is one.

The fifth value of the actual output is 0.1.

Therefore, the cross-entropy error is negative

one times the log of 0.1, which equals one.

The cross entropy loss is a better metric than

the misclassification error as we'll demonstrate by this example.

Here we see the output of two different models.

The outputs on the top come from

the first model and the outputs on the bottom comes from the second model.

Both of these models have the same misclassification rate.

The first two outputs are correct and the third output is incorrect.

However, upon closer inspection,

the outputs of the second model are better because they have

a higher confidence or probability on the correct class,

and lower confidence on the incorrect class.

The cross-entropy loss take this into account.

The cross-entropy loss for the first model is 1.38,

and for the second model it's only 0.64.

Next, we'll learn about various optimization techniques that are used to reduce the cost.

We will also take a short detour to review gradient descent and its variants.

There are various optimization algorithms that can be used to minimize the loss function,

such as gradient descent,

stochastic gradient descent, AdaGrad, and others.

Because most of these optimizers are variants of gradient descent,

we will take a detour to review gradient descent.

Gradient descent is a popular algorithm used to

find a local minimum of an objective function.

On the top right, you see the objective function computed at some value

w_zero at iteration zero before starting grading descent,

w represents a set of weight of the network.

In this two example,

w has only one dimension,

but in practice, it can have millions or

even billions of dimensions and the same theory applies.

In supervised learning, the cost is a penalty of how far

the network's output is from the true expected output.

The user can choose a cost function such as the squared difference loss commonly used for

regression tasks or the cross entropy loss commonly used for classification tasks.

The objective function is the sum of the cost over all the N training samples.

When training a network with no prior knowledge of the search space,

the first guess w zero is usually sampled from some distribution defined by the user.

For example, a Gaussian distribution with a K mean standard deviation.

Next, we need to know which direction to move towards the minimum.

In this toy example,

we can visualize the object function and know that we have to move towards the left.

But in practice, in a high dimensional space,

we won't be able to visualize the object.

One method to determine which direction to move,

is to compute the derivative at w zero.

If the derivative is positive,

then you move to the left and vice versa.

If the derivative is negative, then you move to the right.

In other words, you move in the direction opposite to the gradient.

As we can see in this figure,

the derivative is positive,

so we will want to move to the left.

Note that the cost functions must have the property of differentiability.

Both the cross entropy laws and the sum of square differences have this property.

Next we update the weights by taking a step in the opposite direction of the gradient.

That is w one equals w zero minus the gradient at w zero. Well, almost.

The size of the step taken is controlled by a learning rate alpha,

which is an important hyper parameter to tune by the user.

Choosing a good learning rate is important for the training efficiency and effectiveness.

If the learning rate is too small it will take too many steps to reach the minimum.

And notice that as you approach the minimum,

the gradient gets closer to zero and therefore the steps get even smaller.

If the learning rate is too large and w one it's too far to the left,

the objective function at w₁ is now worse than at w zero.

That is you are diverging and moving farther away from the minimum.

Here we have selected a good enough learning rate.

The steps taken do not overshoot the local minimum,

but are also not too small.

After one step is taken and the weights are updated,

another step can be taken by repeating the same process.

This process can be repeated to calculate w three and

w four and so on until the local minimum is reached.

However in practice, we use variants of gradient decent,

as gradient decent has some glaring issues for deep learning.

One issue is that gradient descent requires summing

overall the cost of the end trailing samples as shown on the top right.

That is, you need to compute the cost for

all the training samples before you can take one step.

In deep learning N can be millions,

making gradient descent very inefficient.

Aside from the computational expense,

gradient descent converges to our poor local minimum.

There are some theories to explain this

but these are still theories and this is an active area of research.

One theory is that saddle points pose a problem for the algorithm.

You can get stuck in a saddle point and the training stops at a sub optimal value.

When the weight reaches a saddle point, the learning stops.

A second theory, it appears that the optimisation space has many sharp minimums.

Gradient descent does not explore

the optimization space but moves towards the

local minimal directly underneath its starting position,

which is often a sharp minimum.

Sharp minimums do not generalize well because the cost of the validation

or test dataset may be much different than the cost on the training dataset.

In this diagram, we see that the objective function with respect

to the dataset is similar to that of the train dataset but is slightly shifted.

This shifts result in models that converts to

a sharp minimum having a high cost with respect to the test dataset.

Meaning that the model does not generalize well for data outside of the training set.

On the other hand, models that are converts to

a flat minimum have a low cost with respect to a test dataset.

Meaning that the model generalizes well for data outside of the training set.

Note, that this is not the result of overfilling.

Even though the symptoms are similar.

Don't worry if you're not familiar with overfilling.

We will cover overfilling in a later lecture.

In summary, the three issues with gradient descent are as follows.

It is computationally expensive.

Training can stop at a saddle point and it often converges to a sharp minimum,

which does not generalize well.

Note that the last two points are theories and

the exact reasons for the poor performance is an active area of research.

Modified versions of gradient descend are used to mitigate these issues.

One is a stochastic gradient descent or SGD.

Note that the term is SGD was originally reserved as

a special case of mini batch gradient descent with a batch size of one.

But now, including it in this lecture,

the term is SGD means meaning mini batch gradient descent.

Apologies to the purist.

In SGD with computed gradient,

which I call the batch gradient,

which is computed use in the aggregated cost over a small batch of training samples.

This gradient will be different than the gradient computed in gradient descent,

which I call the global gradient,

which is computed using the aggregated cost of the entire dataset.

SGD mitigates the issues previously discussed as follows.

SGD allows us to make

faster progress because we don't have to compute the error for the entire dataset.

SGD mitigates the saddle point issue

because while the global gradient at a saddle point is zero,

the batch gradients will not be and the model can move away from the saddle points.

SGD mitigates the sharp minima because it better explores the solution space.

Empirically, SGD has been shown to

optimize models much better than plain gradient descent.

The objective function used in gradient descent is shown on the top left.

The objective function used in stochastic gradient descent are shown on the top right.

In SGD, the dataset is split into M different equal sized batches.

For each data batch,

the costs are aggregated.

The ingredients are computed,

and the weight are updated.

The size of each batch is a hyper parameter that the user has to choose.

A very large batch starts to suffer from the same issues as gradient descent.

A very small batch size,

may not be able to keep their hardware a high realization,

and limits the number of nodes that can be used for distributed training.

We will dedicate an entire lecture to distributed training and their relation to SGD.

In this figure the red X represents the weight of

the current model at the start of training.

In this toy example,

the model only has two values.

That is, the weight space is only a two dimensional space.

The dotted line represents

the trajectory that the model would have taken with gradient descent.

Note that the arrow shows the vector equal to a negative batch gradient.

This is the direction of the first step.

Over the next slide,

we will take many steps in the direction of

the batch gradients computed for each mini batch.

Note that SGD makes progress towards the minimum,

but then it has a hard thing converging and bounces around the minimum.

During training, SGD computes the cost with respect to

multiple data batches and actually cycles multiple times or the various data batches.

We call an epoch every time all the boxes in the dataset are used.

In this SGD example,

One epoch equals m iterations since the dataset has m set of batches.

Training a model may require anywhere from one epoch to a couple hundred epochs.

Note that in gradient descent, a batch equals entirely that descent.

Therefore in gradient descent,

one epoch equals one iteration.

Another training hyper parameter is momentum.

In stochastic gradient descent,

while each individual gradient may not be in the direction towards the minimum,

on average the overall direction tends to be towards the minimum.

We can take advantage of this by accumulating the overall direction of

the gradient and taking steps in the direction

of the accumulated gradient known as the velocity.

In the bottom right,

the velocity is updated as a way to sum between

the previous value of the velocity and the current gradient.

While the momentum parameter mu is another hyper parameter to potentially adjust,

there is active research to develop

robust techniques to better select this momentum parameter.

For now, most practitioners use mu equals to 0.9 as the de facto momentum.

Momentum can be thought of as similar to momentum in classical physics.

As when we threw up or downward,

it gathers momentum and its velocity keeps increasing.

Here is a graphical total representation of momentum which may help.

The red arrow is the gradient and the green arrow is the velocity.

The bottom of the arrow is the current position in the weight domain,

and the head of the arrow is the next position.

With each parameter update,

we can observe that the gradient modifies the velocity and

the velocity determines the next position in the weight domain.

On the next light,

we observe that the gradient modifies the velocity.

And the velocity determines the next position in the weights domain.

Even when a particular gradient point in the direction opposite to the desired minimum,

progress in the desired direction continues and the velocity only slows down.

Another view of creating descent and SGD is as follows.

Training constitutes feeding in the various data samples into a model for propagating,

calculating the cost, and back-propagating to compute the gradient.

With gradient descent, the weight update,

delta W, is computed by aggregating all the gradients.

For example, on the other hand with stochastic gradient descent,

the data is splitting to many batches.

In this two example, two batches.

The weight update for each batch is computed

by aggregating all the gradients for each sample in the batch.

There are other optimizers that

make slight modifications to stochastic gradient descent to improve it further.

AdaGrad for example allows the learning rate

to be normalized by the size of the gradient.

This helps the model take larger steps than he would

otherwise take when the gradients are small which can occur in a subtle point.

And it helps the model

take smaller steps than it would otherwise take when the gradients are large.

Having these dynamic learning rate reduces

the need for the user to manually search for a good learning rate.

This can take time for the deep learning practitioner.

Note that each waiting model will have its own dynamic learning rate.

AdaGrad normalizes the learning rate using all the gradients for that particular weight,

starting at iteration zero.

RMS propagation is similar,

only that it takes into account the previous few iterations only and are all as AdaGrad.

Both RMS prop and AdaGrad are very popular optimization techniques.

While some research shows that SGD can converge to

a slightly better minimum AdaGrad and

RMS prop reduce the need for the user to manually search for a good learning rate.

Regardless, training and deep network requires an incredible amount to compute.

For example, Alexnet on the image net dataset runs for 90 epochs,

processing 115 million images over

a seven layer deep network with around 60 million parameters.

And that was in 2012.

Now we have deep networks with hundreds of layers.

We have covered the cost and optimizers.

The optimizers require a simple gradients of the weights.

In this section, we will explain how to compute the gradients of the cost with

respect to each weight for

all the weights in the model via the back-propagation algorithm.

In this network, suppose that we want to compute the gradients or

partial derivatives of the cost C with respect to weight four,

five in layer one.

We can compute these using the chain rule.

Not that I will abbreviate partial derivative as just partial.

To compute the partial of C with respect to W451,

we need to back-propagate from C and compute all the partials that are circled in red.

Working backwards, we started by computing the partial C with respect to ZK3.

This equals the partial of C with respect to Y hat K times

the partial of Y hat K with respect to ZK3.

The result is backpropagated to compute partial C with respect to A42.

Note that A42 is used in three inputs in layer three,

and all those inputs affect the cost C. Therefore,

the partial of C with respect to A42 is

a summation over the three units in A3 as shown here.

Similarly, the partials C with respect to A42 is back-propagated and multiplied by

the partial A42 with respect to Z42 to compute the partial C with respect to Z42.

This is back-propagated and multiplied by the partial Z42 with respect to

W451 to compute the partial C with respect to W451.

And we're done computing their partials.

In practice, we don't compute the partials one by one but in groups,

as we'll see in the next slides.

On top the equations that were used in the previous slides are shown.

Below them, the functions whose gradients need to be computed are also shown.

Using these equations and high school calculus,

all the gradients can be computed.

I will leave that as an exercise to you.

You can derive these equations and compute

the partial C which relate to any weight in the network.

Finally, note that in practice,

all these operations can be computed more efficiently as matrix multiplications.

Okay, now that we have gone through the derivation of back-propagation,

let us quickly talk about controlling gradient flow.

As networks get larger and deeper,

controlling the flow of gradients is essential to prevent them from

becoming too small and vanishing or becoming too large and exploding.

There are a few techniques that we can employ to prevent this.

Activation functions as we have seen,

are important in controlling the gradients.

For example, the RELU activation function is zero when the activation is negative,

making degrading zero, no matter what else is going on.

Choosing a weight initializer such as

Kaiming and or learning rules such as AdaGrad is also important.

Finally, layer architecture including the number of layers and layers types,

also play a role in controlling the gradient flow.

Once again, we will be using the neon framework to

design and develop deep learning models in this course.

Neon contains many of the elements discussed in

this lecture including various initializers,

optimizers, activations, layers, and costs.

The general procedure for building a deep network in unit is as follows.

First, the back-end must be generated,

then the data is loaded.

The model is defined and hyper parameters are chosen.

Then the model is trained and simultaneously evaluated to ensure correct training.

We have covered a lot of material in this lecture.

We learned the fundamentals of deep learning including activation functions,

various ways to initialize a model, cost functions,

various optimization techniques, and the famous backpropagation algorithm.

You should be ready to start training simple networks and

understand the most important hyper parameter choices.

In the next lecture,

my colleague Kalnin will introduce you to convolutional networks.