0:00

Hello and welcome back this week you learn about object detection this is one

Â of the areas of computer vision that's just exploding and it's working so much

Â better than just a couple years ago in order to build up to object detection

Â you first learn about object localization let's start by defining

Â what that means you're already familiar with the image classification task where

Â an algorithm looks at this picture and might be responsible for saying this is

Â a car so that was classification the problem you learn to build a neural

Â network to address later in this video is classification with localization

Â which means not only do you have to label this as say a car but the

Â algorithm also is responsible for putting a bounding box or drawing a red

Â rectangle around the position of the car in the image so that's called the

Â classification with localization problem where the term localization refers to

Â figuring out where in the picture is the car you've detected later this week you

Â then learn about the detection problem where now there might be multiple

Â objects in the picture and you have to detect them all and localize them all

Â and if you're doing this for an autonomous driving application then you

Â might need to detect not just other cause but maybe other pedestrians and

Â motorcycles and maybe even other objects so you see that later this week so in

Â the terminology we'll use this week the classification and the classification of

Â localization problems usually have one object usually one big object in the

Â middle of the image that you're trying to recognize or recognize and localize

Â in contrast in the detection problem there can be multiple objects and in

Â fact maybe even multiple objects of different categories within a single

Â image so the idea is you've learned about for image classification will be

Â useful for classification with localization and then the idea is you

Â learn for localization will then turn out to be useful for detection so let's

Â start by talking about classification with localization you're already

Â familiar with the image classification problem in which you might input a

Â picture into a confident with multiple layers so there's a confident and this

Â results in and this results in a vector features that is fed to maybe a soft max

Â unit that outputs the predicted class so if you are building a self-driving car

Â maybe your object categories are they're following where you might have a

Â pedestrian or a car or a motorcycle or background this means none of the above

Â some of this no pedestrian in your column no motorcycle then you may have

Â an output background so these are your classes they have a soft max with four

Â possible outputs so this is the standard classification pipeline how about if you

Â want to localize the car in the image as well to do that you can change your

Â neural network to have a few more output units that output a bounding box so in

Â particular you can have the neural network output for more numbers and I'm

Â going to call them B X B Y B H and B W and these phone numbers

Â parameterize the bounding box of the detected object so in these videos I'm

Â going to use the notational convention that the upper left of the image I'm

Â going to denote as the coordinate 0 0 and the lower right is 1 1 so specifying

Â the bounding box the red rectangle requires specifying the midpoint so

Â that's the point B X comma B Y as well as the height that

Â would be B H as well as the width B W of this bounding box so now if your

Â training set contains not just the object class label which your neural

Â network is trying to predict up here but it also contains four additional number

Â giving the bounding box then you can use supervised learning to make your

Â algorithm outputs not just a class label but also the four parameters to tell you

Â where is the bounding box of the object you detect it so in this example the

Â idea px might be about 0.5 because this is about half way to the right to the

Â image B Y might be about 0.7 since that's about you know maybe 70% of the

Â way down to the image B H might be about 0.3 because the height of this red

Â square is about 30% of the overall height of the image and BW might be

Â about 0.4 let's say because the width of the red box is about 0.4 of the overall

Â width of the entire image so let's formalize this a bit more in terms of

Â how we define the target label Y for this as a supervised learning task so

Â just as a reminder these are our four classes and the new network now outputs

Â those phone numbers as well as a class label or maybe probabilities of the

Â class labels so let's define the target label Y as follows it's going to be a

Â vector where the first component PC is going to be is there an object so if the

Â object is classes 1 2 or 3 PC will be equal to 1 and if is the background

Â class so if it's none of the objects you're trying to detect then PC will be

Â 0 and PC you can think of that as standing for the probability that

Â there's a object probability that one of the classes you're trying to detect is

Â there so something other than the background class next if there is an

Â object then you wanted to output the X B Y B H & BW the bounding box for the

Â object you detect it and finally if there is an object so if

Â PC is equal to one you wanted to also output c1 c2 and c3 which tells it is it

Â the class 1 class 2 or class 3 so is it a pedestrian a car or a motorcycle and

Â remember in the problem we're addressing we assume that your image has only one

Â object so at most one of these objects appears in the picture in this

Â classification with localization problem so let's go through a couple examples if

Â this is a training set image so that is X then Y will be the first component PC

Â will be equal to 1 because there is an object then b XB y bh + BW will specify

Â the bounding box so your label training set we'll need bounding boxes in the

Â labels and then finally this is a car so it's cost 2 so C 1 will be 0 because

Â it's not a pedestrian C 2 B 1 because it is a car C 3 will be 0 since it's not a

Â motorcycle so among C 1 C 2 C 3 at most one of them should be equal to 1 so

Â that's if there is an object in the image 1 of this no object in the image

Â what do we have a training example where X is equal to that in this case PC would

Â be equal to 0 and the rest of the elements of this will be don't cares so

Â I'm going to write question marks in all of them so this is a don't care because

Â if there is no object in this image then you don't care what bounding box the

Â neural network outputs as well as which of the three objects you want C 2 C 3 it

Â thinks of this so given a set of labeled training examples this is how you it

Â construct X the input image as well as Y the class label both images where there

Â is an object and for images where there is no object and the set of these will

Â then define your training set finally next let's describe the loss function

Â you use to train the neural network so the ground crew of label was Y and in

Â your network outputs some Y hat what should be the lost speed well if you're

Â using squared error then the loss can be y 1 hat minus y 1 squared plus y 2 hats

Â minus y 2 squared plus dot dot plus y 8 hat minus y 8 squared notice that Y here

Â has eight components so that goes from sum of the squares of the difference of

Â the elements and that's the loss if y 1 is equal to 1 so that's the case where

Â there is an object so Y 1 is equal to PC right so PC is equal to 1 that is if

Â there is an object in the image then the loss can be the sum of squares over all

Â the different elements the other case is if Y 1 is equal to 0 so that's if this

Â PC is equal to 0 in that case the loss can be just Y 1 hat minus y 1 squared

Â because in that second case all the rest of the components are don't cares and so

Â all you care about is how accurately is the neural network outputting PC in that

Â case so just a recap if y 1 is equal to 1 that's this case then you can use the

Â squared error to penalise squared deviation from the predictor than the

Â actual outputs for all eight components whereas if Y 1 is equal to 0 then you

Â know the second to the eighth components that don't care so all you care about is

Â how accurately is your neural network estimating y1 which is equal to PC and

Â just as a side comment for those of you that want to know all the details have

Â used the squared error just to simplify the description here

Â in practice you could use you can probably use a log likelihood loss for

Â the c1 c2 c3 to the softmax output one of those elements usually you can use

Â squared error or something like squared error for the bounding box coordinates

Â and then for PC you could use something like the logistic regression loss

Â although even if you use squared error or very work okay so that's how you get

Â a neural network to not just classify an object but also to localize it the idea

Â of having a neural network output a bunch of real numbers to tell you where

Â things are in a picture turns out to be a very powerful idea in the next video I

Â want to share of you some other places where this idea of having a neural

Â network I'll put a set of real numbers almost as a regression task can be very

Â powerfully used elsewhere in computer vision as well so let's go on to the

Â next video

Â