0:00

Strided convolutions is another piece of

Â the basic building block of convolutions as used in convolutional neural networks.

Â Let me show you an example.

Â Let's say you want to convolve this 7 x 7 image with this 3 x 3 filter,

Â except that instead of doing it the usual way,

Â we're going to do it with a stride of two.

Â What that means is you take the element-wise product as usual in

Â this upper left 3 x 3 region and then multiply and add and that gives you 91.

Â But then instead of stepping the blue box over by one step,

Â we're going to step it over by two steps.

Â So we're going to make it hop over two steps like so.

Â Notice how the upper left hand corner has gone from the start to this start,

Â jumping over one position.

Â And then you do the usual element-wise product then summing,

Â and that gives you, it turns out,100.

Â And now, we're going to do that again and make the blue box jump over by two steps.

Â So you end up there and that gives you 83.

Â And now, when you go to the next row,

Â you again actually take two steps instead of one step.

Â So we're going to move the blue box over there,

Â notice how we're skipping over one of the positions and then this gives you 69.

Â And now you again step over two steps.

Â This gives you 91, and so on.

Â So 127 and then for the final row,

Â 44, 72, and 74.

Â So in this example,

Â we convolve with a 7 x 7 matrix with a 3 x 3 matrix and we get a 3 x 3 output.

Â So the input and output dimensions turns out to be governed by the following formula.

Â If you have n x n image then you convolve with an f x f filter,

Â and if you use padding p,

Â and stride s. So in this example, s = 2,

Â then you end up with an output that is n plus 2 to the p

Â minus f. And now because you're

Â stepping s steps at a time instead of just one step at a time,

Â you now divide by s plus 1 and then by the same thing.

Â So in our example,

Â we have 7 plus zero minus 3 divided 2,

Â s stripe plus 1 equals, let's see,

Â that's 4 over 2 plus 1 = 3,

Â which is why we wound up with this 3 x 3 output.

Â Now just one last detail which is,

Â what if this fraction is not an integer?

Â In that case, we're going to round this down.

Â So this notation denotes the floor of something.

Â So this is also called the floor of z.

Â It means taking z and rounding down to the nearest integer.

Â And if the way this is implemented is that you take this type of

Â blue box multiplication only if the blue box is fully

Â contained within the image or the image plus the padding.

Â And if any of this blue box, kind of,

Â part of it hangs outside,

Â then you just do not do that computation.

Â Then it turns out that that's the convention that your 3 x 3 filter must lie

Â entirely within your image or the image plus

Â padding region before there's a corresponding output generated, that's the convention.

Â Then the right thing to do to compute the output dimension is

Â to round down in case this n + 2 p - f over s is not an integer.

Â So just to summarize the dimensions,

Â if you have an n x n matrix or n x n image that you convolve with an f x f matrix,

Â an f x f filter with padding p,

Â and stride s, then the output size will have this dimension.

Â And it is nice we can choose all of these numbers so that that is an integer,

Â although sometimes, you don't have to do that and rounding down is just fine as well.

Â But please feel free to work through a few examples of values of n, f, p,

Â and s for yourself to convince yourself if you want that

Â this formula is correct for the output size.

Â Now before moving on,

Â there is a technical comment I want to make about cross-correlations versus convolutions.

Â And this won't affect what you have to do to implement convolutional neural networks,

Â but depending on if you read a different math textbook or signal processing textbook,

Â there is one other possible inconsistency in your notation.

Â Which is that, if you look at a typical math textbook,

Â the way that a convolution is defined,

Â before doing the element-wise product and summing,

Â there's actually one other step that you will first take,

Â which is to convolve this 6 x 6 matrix with the 3 x 3 filter.

Â You will first take the 3 x 3 filter and flip

Â it on the horizontal as well as the vertical axis.

Â So this 3, 4, 5, 1, 0, 2, minus 1,

Â 9, 7, will become,

Â 3 goes here, 4 goes there, 5 goes there.

Â And then the second row becomes this 1, 0,

Â 2 minus 1, 9, 7.

Â This is really taking the 3 x 3 filter and mirroring it,

Â both on the vertical and horizontal axes.

Â And then there was this flipped matrix that you would then copy over here.

Â So to compute the output,

Â you would take 2 times 7 plus 3 times 2,

Â plus 7 times 5, and so on.

Â You actually multiply out the elements of this flipped matrix in order to compute

Â the upper lefthand-most elements of the 4 x 4 output as follows.

Â And then you take those nine numbers and shift them over by one,

Â shift them over by one, and so on.

Â So the way we've defined the convolution operation in

Â these videos is that we've skipped this mirroring operation.

Â And technically, what we're actually doing, really,

Â the operation we've been using for

Â the last few videos is sometimes called cross-correlation instead of convolution.

Â That in deep learning literature, by convention,

Â we just call this a convolution operation.

Â So just to summarize,

Â by convention in machine learning,

Â we usually do not bother with this flipping operation and technically,

Â this operation is maybe better called cross-correlation,

Â but most of the deep learning literature just calls this the convolution operator.

Â And so, I'm going to use that convention in these videos as well.

Â And if you read a lot of the machine learning literature,

Â you find most people just call this the

Â convolution operator without bothering to use these flips.

Â And it turns out that in signal processing or in certain branches of mathematics,

Â doing the flipping in the definition of convolution

Â causes convolution operator to enjoy this property,

Â that A convolved with B convolved with C is equal to A convolved with B,

Â convolved with C, and this is called associativity in mathematics.

Â And this is nice for some signal processing applications,

Â but for deep neural networks,

Â it really doesn't matter.

Â And so, omitting this double mirroring operation just simplifies

Â the code and makes the neural networks work just as well.

Â And by convention, most of us just call this convolution,

Â even though the mathematicians prefer we call this cross-correlation sometimes.

Â But this should not affect anything you have to implement

Â in the following exercises and should

Â not affect your ability to read and understand the deep learning literature.

Â So you've now seen how to carry out convolutions and you've

Â seen how to use padding as well as strides for convolutions.

Â But so far, all we've been using is convolutions over matrices,

Â like over a 6 x 6 matrix.

Â In the next video, you see how to carry out convolutions over volume.

Â And this will make what you can do with convolutions suddenly much more powerful.

Â Let's go on to the next-

Â