0:29

And the output of this particular implementation of max pooling

Â will be a 2x2 output.

Â And the way you do that is quite simple.

Â Take your 4x4 input and break it into different regions.

Â And I'm going to color the four regions as follows.

Â And then in the output, which is 2x2,

Â each of the outputs will just be the max from the correspondingly shaded region.

Â So in the upper left, I I guess a max of these four numbers is 9.

Â Upper right, and the max of blue numbers is 2.

Â Lower left, the biggest number is 6, and lower right, the biggest number is 3.

Â So to compute each the numbers on the right,

Â we took the max over by a 2x2 region.

Â So this is as if you're applying a filter size of 2 because you're

Â taking 2x2 regions, and you're taking a stride of 2.

Â So these are actually the Hyperparameters, Of max pooling.

Â Because we start from this filter size is like a 2x2 region, that gives you the 9.

Â And then you step it over two steps to look at this region to give you the 2.

Â And then for the next row, you step down two steps to give you the 6.

Â And then you step to the right by two steps to give you the 3.

Â So because the squares are 2x2, f is equal to 2.

Â And because you stride by 2, s is equal to 2.

Â So here's the intuition behind what max pooling is doing.

Â If you think of this 4x4 input as some set of features.

Â Maybe not, if you think of this 4x4 region as some set

Â of features, deactivations in some layer of the neural network,

Â then a large number means that it's maybe detected a particular feature.

Â Right, so the upper left hand quadrant has this particular feature, maybe a vertical

Â edge, or maybe an eye, or a we're scared of it trying to detect the cap.

Â But clearly that feature exists in the upper left hand quadrant.

Â Whereas this feature, maybe there's a cat eye detector,

Â whereas this feature doesn't really exist in the upper right hand quadrant.

Â So what the max operation does is so long as the feature is detected anywhere in one

Â of these quadrants, it then remains preserved in the output of Max pooling.

Â So what the max operator does is really it says,

Â if the speech is detected anywhere in this filter, then keep a high number.

Â But if this feature is not detected, so

Â maybe the feature doesn't exist in the upper right hand quadrant,

Â then the max of all those numbers is still itself quite small.

Â So maybe that's the intuition behind max pooling.

Â 3:15

But I have to admit, I think the main reason people use max pooling is because

Â it's been found in a lot of experiments to work well.

Â And the intuition I just described, despite it being often cited,

Â I don't know if anyone fully knows if that's the real underlined reason.

Â I don't know if anyone knows that that's the real underlying

Â reason that max pooling works well in confidence.

Â One interesting property of max pooling is that it has a set of hyperparameters,

Â but it has no parameters to learn, right.

Â There's actually nothing for gradient descent to learn.

Â Once you've fix f and s, it's just a fixed computation and

Â gradient descent doesn't change anything.

Â 4:19

And the formula is where it develops in the previous videos for

Â figuring out the output size for a cons layer.

Â Those formulas also work for max pooling.

Â All right, so that's n + 2p- f over s for Or + 1.

Â That formula also works to figure out the output size of max pool.

Â But in this example let's compute each of the elements of this 3x3 output.

Â The upper left hand elements, we're going to look over that region.

Â So notice this is a 3x3 region because of filter size is 3,

Â and take the max there, so that's going to be 9.

Â And then we shift it over by 1 because we can take stride of 1,

Â so that max in the blue box is 9.

Â Let's shift that over again.

Â The max of the blue box is 5.

Â And then let's go on to the next row.

Â A stride of 1, so we're just stepping down by one step.

Â So max in that region is 9, max in that region is 9,

Â max in that region, there's two 5s, the max is the 5.

Â And then finally, max in that is 8, max in that is 6.

Â And max in that does not give you does not give you the red corner.

Â Okay, so this, with this as the high parameters f = 3,

Â s = 1, gives that output as shown.

Â Now so far, I've shown max pooling on a 2D input.

Â If you have a 3D input, then the output will have the same dimension.

Â 6:00

And the way you compute max pooling is you perform the computation we just described

Â on each of the channels independently.

Â So we have the first channel, which is shown here on top, is still the same.

Â And then for the second channel, I guess this one that I just drew at the bottom,

Â you would do the same computation on that slice of this volume.

Â And that gives you this second slice.

Â 6:57

instead of taking the maxes within each filter, you take the average.

Â So in this example, the average of the numbers in purple is 3.75,

Â then there's 1.25, and 4, and 2.

Â And so this is average pooling with hyperparameters f = 2,

Â s = 2, we can choose other hyperparameters as well.

Â 7:22

So these days max pooling is used much more often than average pooling,

Â with one exception, which is sometimes, very deep in a neural network,

Â you might use average pooling to collapse your implementation from, say,

Â 7x7x1000.

Â And average over all the spacial extents to get 1x1x1000.

Â We'll see an example of this later.

Â 7:54

So just to summarize, the hyperparameters for

Â pooling are f, the filter size, and s, the stride.

Â And maybe common choices of parameters might be f = 2, s = 2.

Â This is used quite often.

Â And this is has the effect of roughly shrinking the height and

Â width By a factor of about two.

Â And the common choice of hyperparameters might be f = 2, s = 2,.

Â And this has the effect of shrinking the height and

Â width of the representation by a factor of two.

Â Have also seen f = 3, s = 2 used.

Â And then the other hyperparameter is just the combined bit that

Â says are you using max pooling or are you using average pooling?

Â If you want you can add an extra hyperparameter for

Â the padding, although this is very, very rarely used.

Â When you do max pooling, usually you do not use any padding.

Â Although, there is one exception that we'll see next week as well.

Â But for the most part, max pooling usually does not use any padding.

Â So the most common value of p, by far, is p equals 0.

Â 9:41

One thing to note about pooling is that there are no parameters to learn, right.

Â So when you implement backprop,

Â you find that there are no parameters that backprop will adapt to max pooling.

Â Instead there are just these hyper parameters that you set once,

Â maybe set once by hand or set using cross validation.

Â And then, beyond that, you're done.

Â It's just a fixed function that the neural network computes in one of the layers.

Â And there is actually nothing to learn, it's just a fixed function.

Â