In this video, we will review
motivation and design choices behind
modern neural network architectures for computer vision,
and present some evaluation results from a historical perspective.
This video assumes you have already passed
introduction to computer learning course in the specialization.
We will however review the material in greater depth.
One of the earliest Deep Learning architectures for vision was LeNet 5,
a seven layer CNN,
designed for handwritten digits recognition in 1998.
It accepted 32 by 32 monochrome images as input,
and produced a 10 volute output vector or scores for each of the 10 digit classes.
LeNet was composed of
a four layer convolutional feature extractor computing a 400 dimensional feature vector,
and a two layer fully connected artificial neural network as a 10 way classifier.
LeNet five model showed remarkable resistance to various sources of noise.
In variance to a number of standard image transformations such as rotation, translation,
and scale, and was applied by
several banks to recognize handwritten numbers and digitized checks.
In 2012, a team of researchers from University of Toronto used a large dip
convolutional neural network to classify
the one million images in that image net challenge,
into the thousand different classes.
Their final model achieved breakthrough performance
compared to the best available solution of that time,
reducing top one error rate by roughly 10 percent, a significant improvement.
And even when it errored,
it made sensible predictions.
Due to its unexpectedly high results,
AlexNet had received significant attention in the research community.
Internal representations generated by AlexNet were
studied and utilized as a basis for large scale image retrieval.
Despite many CNN improvements were proposed in subsequent years,
the architectural principles underlying AlexNet
model remain the foundation for CNN's up until now.
AlexNet architecture can be viewed as a deeper and much larger network than
it's nevertheless similar in design to the old LeNet five.
AlexNet architecture in general follows the trend set by an older LeNet 5 model.
It consists of eight hidden weight layers,
five convolutional layers used as a feature extractor,
and three fully connected layers used as a classifier.
The first convolutional layer filters the 224 by 224 by
three input image with 96 kernels of size 11 by 11 by three with a stride of four pixels.
The second convolutional layer takes us as input,
the output of the first convolutional layer and filters it with
256 kernels of size five by five by 48.
Note how the number of filters
increases as we go from the first to the second convolutional layer.
This can be seen as a general principle in convolutional neural networks used
to convert special information into semantic representation.
Between the first and the second convolutional layers,
max-pooling and normalization operations that
shift in variance and numerically tabulates learning respectively.
The third, fourth and fifth convolutional layers are connected
to one another without any intervening pooling or normalization layers.
The third convolutional layer has 384 kernels of size three
by three by 256 connected to the outputs of the second convolutional layer.
The fourth convolutional layer has 384 kernels size of three by three by 192.
And the fifth convolutional layer has 256 kernels of size three by 192.
The fully connected layers have 4096 neurons each.
All layers in the network are equipped with rectified linear unit and non linearity.
One of the direct and puzzling observations about the features in the AlexNet model is
about convolutional kernels learned by the networks first data connected layer.
The network has learned a variety of frequency and
orientation selective kernels as well as various colored blobs.
These kinds of cell activity have long been known to exist in non visual cortex.
More over, mathematical functions known as Gabor functions formulated in the slide,
yield kernels of visually very similar structure.
Such functions are widely employed in digital signal processing for instance,
for the analysis of fingerprint images.
Revealing similarily structured kernels
via plane and end optimization was a surprising discovery.
VGG model introduced in 2014 by the visual geometry group from Oxford,
addressed another important aspect of convenant architecture design as depth,
that would range from 11 to 19 layers,
compared to eight layers in the AlexNet.
To this end, other parameters of the architecture were fixed
while depth was steadily increased by adding more convolutional layers,
which was feasible due to the use of very small convolution filters in all layers.
These were fixed to be three by three,
which is the smallest size to capture the notion of Left-Right, Up-Down, and Center.
A stock of two,
3 by 3 convolutional layers without special pooling
in-between has an effective receptive field of five by five.
Three size layers have a seven by seven effective receptive field.
Decision function becomes more complicated due to three non-linearities,
but we still must learn less parameters.
These can be seen as imposing
a regularization on the seven by seven filters forcing them to
have a decomposition through the
three by three filters with nonlinearity injected in between.
The inception architecture and
the associated GoogleLeNet family of models were first introduced by Google in 2014,
where they performed on par with VGG network.
It was developed with computational efficiency in mind,
so that it would not end up to be
a purely academic curiosity but could be put to real world use even on large datasets,
at a reasonable cost.
This is necessary as companies such as
Google have to process billions of images in reasonable time.
As a result although one may know that
the original GoogleLeNet is a much deeper model compared to AlexNet,
it has 22 weight layers compared to eight in the AlexNet model.
It only has 6 million parameters which is 12 times fewer than the AlexNet model.
Inception architecture revolves around using
a fairly sophisticated scheme to organize
convolutional layers into the so-called inception blocks,
that are the basic building blocks of this network like
the stacked three by three convolutional layers were for the VGG model.
The idea of having inception blocks is connected to both the
reduction of computational complexity and the efficient use of local image structure.
The correlation statistics over the last layer is analyzed and
clustered into groups of units with high correlations.
In the layers close to the input,
correlated units would concentrate in local regions.
Thus we would end up with a lot of clusters concentrated in a single region,
and then can be covered by a layer of one by one convolutions in the next layer.
One can also expect that there will be a smaller number of more specially
spread-out clusters that can be covered by convolutions over larger patches.
And there will be a discrete decreasing the number of
patches over larger and larger regions.
In order to avoid patch alignment issues,
current incarnations of the inception architecture
are restricted to filter sizes one by one,
three by three, and five by five.
At the end four different image feature maps are concatenated on depth.
The idea of having one by one convolutions is that such convolutions
can capture interactions of local channels in one pixel of the feature map.
They form sort of dimensionality reduction with
added ReLU activation that is
necessary to remove redundant feature maps from the previous layer.
Inception architecture has lived through a number of variants or incarnations.
On this slide you can see the second version of the inception architecture,
with a little more advanced inception block
that would decompose five by five convolution,
into more computationally cheap three by three convolutions stacked one after another.
This version, not only trains a little bit faster than the previous one,
but also is more computationally efficient as well.
It is further known that three by three kernels or
any other Gaussian blur further can be decomposed into one dimensional filters,
by first blurring along one dimension and then along the orthogonal dimension.
This can be employed to further speed up
inception architecture by decomposing three by three filters,
into a stack of one by three followed by three by one filters.
This further reduces the computational burden
suffered by the deep architecture of such kind.
This is currently one of the most advanced convolutional blocks ever made.
To summarize, we have shown that stacking more convolution
and pooling layers and thus deepifying our network,
we can reduce the error in standard benchmark as well as in practical applications.
This is the case for AlexNet or VGG.
However, one should employ
more sophisticated convolutional blocks such as the inception block,
in order to improve performance for more complicated datasets.
In the next video, we will discuss the principle known as residual learning,
that allows building even deeper convolutional architectures for vision.