0:00

In the last video you learned how to use a convolutional implementation of

Â sliding windows and that's more computationally efficient but the store

Â has a problem of not quite outputting the most accurate bounding boxes in this

Â video let's see how you can get your bounding box predictions to be more

Â accurate with sliding windows you take this discrete set of locations and run

Â the classifier through it and in this case none of the boxes really match up

Â perfectly with the position of the car so maybe that box is at best match and

Â also it looks like in the ground true the perfect bounding box isn't even

Â quite square it is actually you know has a slightly wider rectangle slightly

Â horizontal aspect ratio so is there a way to get this algorithm to output more

Â accurate bounding boxes a good way to get this output more accurate bounding

Â boxes is with the yellow algorithm Yolo stands for you only look once and is an

Â algorithm due to Joseph Redman Santosh Diwali Rajesh wick and Ally for Hardy

Â here's what you do let's say you have an input image that

Â 100 by 100 you're going to place down a grid on this image and for the purposes

Â of illustration I'm going to use a 3x3 grid although in an actual

Â implementation you use a finer one like maybe a 19 by 19 grid and the basic idea

Â is you're going to take the image classification and localization

Â algorithm that you saw a few videos back and apply it to each of the nine grids

Â and the basic idea is you're going to take the image localization and basic

Â idea is you're going to take the image classification and localization

Â algorithm that you saw in the first video of this week and apply that to

Â each of the nine grid cells of this image so the more concrete here's how

Â you define the labels you use for training

Â so for each of the nine grid cells you specify a label Y where the label Y is

Â this 8 dimensional vector same as you saw previously your first output PC 0 1

Â depending on whether or not there's an image in that green cell and then B X B

Â Y B h BW to specify the bounding box if there is an image if there is an object

Â associated vector in cell and then say C 1 C 2 C 3 if you try and recognize three

Â classes not counting the background cloth so you trying to recognize but

Â that's just cause motorcycles in the background class then C 1 C 2 C 3 can be

Â the pedestrian car and motorcycle classes so in this image we have nine

Â grid cells so you have a vector like this for each of the grid cells

Â so let's not to the upper left for it sell this one up here for that one there

Â is no object so the label vector Y for the upper left result would be zero and

Â then don't cares for the rest of these and the output label Y would be the same

Â for this per cell and this per cell and all the grid cells with nothing with no

Â interesting object in them now how about this print cell to give it the more

Â detail this object to give a bit more detail this image has two objects and

Â what the yellow algorithm does is it takes the midpoint of each of the two

Â objects and it assigns the object to the grid cell containing the midpoint so the

Â left car is assigned to this grid cell and the Condor right which as this

Â midpoint is assigned to this grid cell and so even though the central grid cell

Â has you know some parts of both cars will pretend the central grid cell has

Â no interesting object so for the central grid cell the class label Y also looks

Â like this vector with no object and that's the first component PC and then

Â the rest are don't cares whereas for this cell this cell des

Â circled in green on the left the target label Y would be as follows there is an

Â object and then you write px py b HB w to specify the position of this bounding

Â box and then you have let's see if cost 1 was a pedestrian then that was zero

Â cost to the car this one class three was a motorcycle that's zero and then

Â similarly for the grid cell on the right because that does have an object in it

Â you know it will also have some vector like this as a target label

Â corresponding to the grid cell on the right so for each of these nine grid

Â cells you end up with a eight dimensional output vector and because

Â you have three by three grid cells you have nine grid cells the total volume of

Â the output is going to be three by three by eight so the target output is going

Â to be three by three by eight because you have three by three grid cells and

Â for each of the three by three grid cells you have a eight dimensional Y

Â vector so the target output volume is d by three by eight where for example this

Â one by one by a volume and upper left corresponds to the target output vector

Â for the upper left of the nine grid cells and so for each of the three by

Â three positions for each of these nine grid cells this is a corresponding a

Â dimensional target vector why did you want in the output some of which could

Â be don't care as if there's no object there and that's why the total target

Â output the output label for this image is now itself a three by three by eight

Â volume so now to train your neural network the input is a hundred by a

Â hundred by three now that's the input image and then you have a usual

Â confident with conf layers maxpro layers and so on

Â so that in the end you would have this should choose the conference and the max

Â cool layers and so on so that this eventually maps to a three by three by

Â eight output volume and so what you do is you have an input X which is the

Â input image like that and you have these target labels Y which are three by three

Â by eight and you use back propagation to train the neural network to map from any

Â input X to this type of output volume Y so the advantage of this algorithm is

Â that the neural network outputs precise bounding boxes as follows

Â so at test time what you do is you feed in an input image X and run forward prop

Â until you get this output Y and then for each of the nine outputs of each of the

Â three by three positions and we share the output you can then just read off

Â one or zero you know is there an object associated with that one of the nine

Â positions and if there is an object what object it is and whether it's the

Â bounding box for the objects in that grid cell and so long as you don't have

Â more than one object in each grid cell this algorithm should work okay and the

Â problem of having multiple objects in the grid cell is something we'll address

Â later but in practice yeah I've used a relatively small 3x3 grid in practice

Â you might use a much finer grid maybe 19 by 19 so you end up with 19 by 19 by 8

Â and that also makes you a grid much finer and reduces the chance that

Â they're multiple objects assigned to the same grid cell and just as a reminder

Â the way you assign an object to grid cell is you look at the mid point of an

Â object and then you assign that object to whichever one grid cell contains the

Â mid point of the object so each object even if the object expense multiple grid

Â cells that object is assigned only to one of the nine grid cells or one of the

Â three by three or one of the 19 by 19 grid cells and over 19 by 19 grid the

Â chance of a object of two midpoints of objects appear in the same grid cell is

Â just a bit smaller so notice two things first this is a lot

Â like the image classification and localization algorithm that we talked

Â about in the first video of this week in that it outputs the bounding box

Â coordinates explicitly and so this allows the neural network to output

Â bounding boxes of you know any aspect ratio as well as output much more

Â precise coordinates then unjust dictated by the stride size of your sliding

Â windows classifier and second this is a convolutional implementation and you're

Â not implementing this algorithm nine times on the 3x3 grid or if you're using

Â a 19 by 19 grid 19 squared this 361 so you're not running the same algorithm

Â you know 361 times or 19 squared times instead this is one single convolutional

Â implementation but you use one confident with a lot of shit computation between

Â all the computations needed for all of your you know three by three or all your

Â nine by all of your 19 by 19 grid cells so there's a pretty efficient algorithm

Â and in fact one nice thing about the Yolo algorithm which which accounts for

Â popularity is because this is a convolutional implementation it actually

Â runs very fast so this works even for real-time object detection now before

Â wrapping up there's one more detail I want to share with you which is how do

Â you encode these bounding boxes the xB YB h BW let's discuss that on the next

Â slide so given these two cars remember we have the 3x3 grid let's take the

Â example of the car on the right so in this red cell there is an object and so

Â the target label Y will be one that was PC is equal to one and then uh B X B Y B

Â H V W and then zero one zero so how do you specify the bounding box in the

Â yellow algorithm relative to this square when I take the convention that the

Â upper-left point here is 0 0 and this lower right point

Â is one one so to specify the position of that midpoint that orange dot px might

Â be the C X looks like it's about 0.4 since it's maybe about 0.4 of the way to

Â the right and then Y looks like that's uh maybe 0.3 and then the height of the

Â bounding box is specified as a fraction of the overall width of this box so the

Â width of this red box is maybe ninety percent of that blue line and so pH is

Â 0.9 and the height of this is maybe one half of the overall height of the grid

Â cell so in that case PW would be that's a 0.5

Â so in other words this sum be xB YB h BW are specified relative to the grid cell

Â and so BX and B Y this has to be between 0 and 1 right because pretty much by

Â definition that orange thought is within the bounds of that grid cell was

Â assigned to if it wasn't between 0 & 1 it was outside the square then will have

Â been assigned to a different grid cell but these could be greater than 1 in

Â particular if you had a car where the bounding box was that then the height

Â and width of a bounding box this could be greater than 1 so there are multiple

Â ways of specifying the bounding boxes but this would be one convention that's

Â quite reasonable although if you read the yellow research papers the yellow

Â research line aware there are other parameterizations that work even a

Â little bit better but I hope this gives one reasonable convention that should

Â work okay although there are some more complicated parameterizations involving

Â sigmoid functions to make sure this is between 0 & 1

Â and using an exponential parameterization to make sure that these

Â are non-negative since 0.9 or 0.5 this has to be greater

Â than equal to 0 that there are some other more advanced parameterizations

Â that working a little bit better but the one you saw here

Â we're okay so that's it for the Yolo or the you only look ones algorithm and in

Â the next few videos I'll show you a few other ideas that will help make this

Â algorithm even better in the meantime if you want you can take a look at the

Â yellow paper reference at the bottom of these past couple slides I use although

Â just one warning if you take a look at these papers which is the yellow paper

Â is one of the harder papers to read I remember when I was reading this paper

Â for the first time I had a really hard time figuring out what was going on and

Â I wound up asking a couple of my friends did you know very good researchers to

Â help me figure it out and even they had a hard time understanding some of the

Â details of the paper so if you look at the paper

Â just it's okay if you have a hard time figuring it out it's not I wish it was

Â more uncommon but it's not that uncommon sadly for even you know senior

Â researchers and research papers and have a hard time freaking out the details and

Â have to look at the open source code or contact the authors or something else to

Â figure out the details these are rooms but don't don't don't let me stop you

Â from taking a look at the paper yourself though if you wish but this is one of

Â the harder ones so with that though you now understand the basics of the yellow

Â algorithm let's go on to some additional pieces that would make this algorithm

Â work even better

Â