0:00
[SOUND].
For example, you could try to classify images for your usual means problems.
This time you're not that interested in classifying digits in mnist if you want to
classify digits on house numbers, for example.
But of course, the problem with house numbers is, let's say you have a label
data set with mnist, but no one gave you a labeled house numbers yet,
let's just pretend this doesn't exist.
0:26
Of course this is a toy problem, but the same thing arises when you are trying
to for example, you have an image classifier, but
you want to apply it to classify images from your social network.
So you go from slightly different set of photo cameras,
with different set of brands, just for different content, maybe.
And you want your network to be as good on this changed kind of domain
on this new data set as it was on the originally labeled one.
0:53
Now, of course you could just stop earlier and maybe somehow validate, but
let's see how we can improve over this classical approach with adversarial.
Now your original task is owned by a classifier or redresser like this one.
Let's split this model into two parts with the first part, well, the left one tries
to extract features, and the second part uses them to predict something.
This division is, however, really arbitrary, so there's no specific
boundary to split this model at, you can pick arbitrary one.
And the idea here is that the whole model is usually trained via back propagation.
Now if you want to prevent this model from overfitting to your
particular domain, let's try to apply this image to those features.
1:41
Here, there is this purple network, which is our discriminator,
it looks not at the intermediate features.
It tries to judge how the model sees the world, and
it tries to distinguish between those features as your model processes
the initial training set of images, and the target domain of images.
So it basically tries to see whether there is any difference between your model
behaving on training objects, and on those kind of out-of-domain,
the target domain, social network images, for example.
2:33
Well, yeah, exactly, if your model is able to distinguish simply by looking
at features what kind of image it is, it means the representation your neural
network learned is different for, well, for training and validation images.
If something is different from training and validation, it's usually a bad sign.
In this case, it's a really bad sign, it means your model overfits.
2:53
Aside from its regional loss, this L classifier here, you also add those kinds
of adversarial components, yes, this adversarial ratio of probability of real.
And this basically means that you want to train those features,
this kind of left part of your classifier network.
In order to make it indistinguishable between how it operates on the training
data, and how it operates on the target domain.
3:18
And again, you train those two models simultaneously,
you try to optimize the kind of mixed objective at the classifier.
And of course you can slightly tune it by scaling the adversarial in the classifier
by a multiplicative constant, say, kind of a regularization factor, if you wish.
And this way, you can obtain a model that tries to adapt toward the main for
which you don't even need to have any labels.
So it doesn't actually need to have a label of social network images,
so labeled house numbers.
What it needs is labels or labeled image net.
Hence, unlabeled data from your target domain, so basically,
this is a very powerful idea here.
And since I'm promoting the idea that deep learning is a kind of language you speak
to your machine learning model to describe what you actually want it to learn.
This kind of adversarial approach gives you the words of power,
which is called indistinguishable.
So if you want some kind of behavior indistinguishable between one case and
another case, you can train the discriminator, and
try to optimize it in an adversarial manner.
4:39
You probably all the cool artificial intelligent apps called prisma,
or acmon app, or maybe artistA.
The Prisma is probably an overwhelming favorite here.
The idea is that those apps try to morph your image in a way that
follows the artistic style of a particular painting,
or maybe a particular style of art, like impressionism, for example.
And you do this by, so far, magically
inserting your image in the super mega image box and waiting for a minute.
Let's cover the math, the nuts and bolts of how you actually do that.
Again, you have to make some representation of model
indistinguishable here.
You won't need the specified trainable discriminator, but you want to somehow
obtain an image representation that only preserves this style information.
So you want to define the style for an image,
in a way that the representation you get only covers style but not content.
So basically, you have to mimic style, but
you have to preserve the content of an image.
If you have a selfie, you want to still see your face on it,
but the style, the texture, should be like Monet, or something similar.
5:58
This is, again,
a non-mathematical problem, or well, heuristical, if you wish.
You could try to define this art style by trying to take a pre-trained network,
from image network models, for example.
And taking some kind of representation from this network,
maybe, that only captures local information.
Of course you won't be able to just take it, you'll have to compute something that
preserves the kind of low level text information.
But that throws away all the higher order features, like what's on the image.
Now, is there some way you can think of this transformation?
6:30
And there is of course more than one way you could do that, and
it's likely that at least some of you managed to land on something
greater than the idea we're going to cover right now.
But what you could do, at least,
is you could take filters from some lower layers in pretrained.
So some kind of, not too deep, or shallow enough so
that the filters only catch texture and super small image details.
And you can try to either average over
the whole image like global average [INAUDIBLE].
We're trying to computer the gram matrix over this kind of two-dimensional
activation map.
And the intuition here, if you don't know the math of gram matrixes,
you can try to explain it the following way.
You try to compute how frequently do texture features
kind of coexist, coincide at adjacent locations.
And use this descriptor, you just compute this for all features, and
you use these metrics as kind of a style descriptor,
you use this as a representation of an art style.
Now you could compute this cell descriptor for your kind of reference image,
say fiery nights or some Monet paintings.
And then you could try to compute the same descriptor for your selfie, for example.
Now [INAUDIBLE] size is going to be pretty different,
because your selfie isn't obviously a painting, it was not painted by a brush.
But the idea here is that those two representations, those two descriptors,
when they are different,
you can compute some difference between them, [INAUDIBLE] error.
And this whole procedure is going to be differentiable,
which is a very important part here, because, remember,
we take filters from a differentiable neural network.
And then we compute the gram matrix, or we just average over the whole field,
which is kind of simpler but yields less impressive results.
We just average or compute the matrix, and
then we compute the differences between those two gram matrices.
And this is in fact, well, just a set of multiplications,
additions and maybe some if your network allows for that.
Now if you try to then adjust your image, you take your selfie and
you try to adjust your selfie to make its image descriptor,
this gram matrix, similar to the one of your reference image.
Say Fiery Night, then your selfie will slowly take on features of this painting,
this Fiery Night painting, but not the content of it.
9:01
Since we're just optimizing the texture so far, this is going to be, well,
this is going to be quite inferior,
because the image may even lose its content as it tries to optimize textures.
So let's also add this kind of content analysis, so we want an image which
looks like Fiery Nights or any other painting you want in terms of texture,
which also looks like your selfie in terms of content.
Now, where do you get content, how do you divide it from any other content?
9:31
If you want higher level kind of [INAUDIBLE], you can just go deeper.
You can take maybe a pre file dense layer, or some of the top conversion layers,
depending on the architecture.
And again, you can just skip [INAUDIBLE] is going to be perfectly differentiable.
And then weigh it up by adding some of the multiplicative coefficients to each of
those differences.
And then you could minimize them over the pieces of image, so
you're going to start with a random image, or your selfie.
Random image is slightly better because and then you just morph
it by following gradient direction, or any other optimization.
I'll be following the gradient of this kind of texture dissimilarity and
content dissimilarity.
This builds your image which inherits the texture for the selfie, and
the content from the, sorry, the content from the selfie and
the texture from the painting, of course.
Now, here's an example of how this thing actually works.
This photo was morphed to look like, to resemble Gogh's style of painting.
And this is of course a slightly modified,
it's like a hacky version of the algorithm, so
it's not just activation of one layer for textures, and another layer for content.
It will include a model description of how do you actually, which layers do you use,
what networks do you use, and
what methods of optimization do you actually apply to get faster results.
We'll include all this stuff into the reading section.
Now we can introduce you to follow this URL here, and try taking this on yourself,
of course if you have not done this already.
See you in the next section.
[MUSIC]