0:09

The algorithm is really quite simple once you have seen the equivalents between a

Â recurrent neural network and a feed forward neural network that has one layer

Â for each time step. I'll also talk about ways of providing

Â input, and desired outputs, to recurrent neural networks.

Â 0:49

The key to understanding how to train a recurrent network is to see that a

Â recurrent network is really just the same as a feed forward network, where you've

Â expanded the recurrent network in time. So the recurrent network starts off in

Â some initial state. Shown at the bottom there, times zero.

Â And then uses the way some of these connections to get a new state, shown at

Â time one. You then uses the same weights again to

Â get another new state, and it uses the same weights again to get another new

Â state and so on. So it's really just a lead feed forward

Â network, where the weight is a constraint to be the same at every layer.

Â 1:39

Now backprop is good at learning when there are weight constraints.

Â We saw this for convolutional nets and just to remind you, we can actually

Â incorporate any linear constraint quite easily in backprop. So we compute the

Â gradients as usual, as if the weights were not constrained.

Â And then we modify the gradients, so that we maintain the constraints.

Â 2:04

So if we want W1 to equal W2, we start off with an equal and then we need to make

Â sure that the changing W1 is equal to the changing W2.

Â And we do that by simply taking the derivative of the area with respect to W1,

Â the derivative with respect to W2, and adding or averaging them, and then

Â applying the same quantity for updating both W1 and W2.

Â 2:28

So if the weights started off satisfying the constraints they'll continue to

Â satisfy the constraints. The backpropagation through time algorithm

Â is just the name for what happens when you think of a recurrent net as a lead feet

Â forward net with shared weights, and you train it with backpropagation.

Â So, we can think of that algorithm in the time domain.

Â The forward pass builds up a stack of activities at each time slice.

Â And the backward pass peels activities off that stack and computes error derivatives

Â each time step backwards. That's why it's called back propagation

Â through time. After the backward pass we can add

Â together the derivatives at all the different time step for each particular

Â weight. And then change all the copies of that

Â weight by the same amount which is proportional to the sum or average of all

Â those derivatives. There is an irritating extra issue.

Â If we don't specify the initial state of the all the units, for example, if some of

Â them are hidden or output units, then we have to start them off in some particular

Â state. We could just fix those initial states to

Â have some default value like 0.5, but that might make the system work not quite as

Â well as it would otherwise work if it had some more sensible initial value.

Â So we can actually learn the initial states.

Â We treat them like parameters rather than activities and we learn them the same way

Â as learned the weights. We start off with an initial random guess

Â for the initial states. That is the initial states of all the

Â units that aren't input units And then at the end of each training sequence we back

Â propagate through time all the way back to the initial states.

Â And that gives us the gradient of the error function with respects to the

Â initial state. We then just, adjust the initial states by

Â following, that gradient. We go downhill in the gradient, and that

Â gives us new initial states that are slightly different.

Â 4:29

There's many ways in which we can provide the input to a recurrent neural net.

Â We could, for example, specify the initial state of all the units.

Â That's the most natural thing to do when we think of a recurrent net, like a feed

Â forward net with constrained weights. We could specify the initial state of just

Â a subset of the units or we can specify the states at every time stamp of the

Â subset of the units and that's probably the most natural way to input sequential

Â data. Similarly, there's many way we can specify

Â targets for a recurrent network. When we think of it as feed forward

Â network with constrained weights, the natural thing to do is to specify the

Â desired final states for all of the units. If we're trying to train it to settle to

Â some attractor, we might want to specify the desired states not just for the final

Â time steps but for several time steps. That will cause it to actually settle down

Â there, rather than passing through some state and going off somewhere else.

Â So by specifying several states of the end, we can force it to learn attractors

Â and it's quite easy as we back propagate to add in derivatives that we get from

Â each time stamp. So the back propegation starts at the top,

Â with the derivatives for the final time stamp.

Â And then as we go back through the line before the top we add in the derivatives

Â for that man, and so on. So it's really very little extra effort to

Â have derivatives at many different layers. Or we could specify the design activity of

Â a subset of units which we might think of as output units.

Â And that's a very natural way to train a recurrent neural network that is meant to

Â be providing a continuous output.

Â