Let's start from efficiency.
Why would it be less efficient if we have more features in the vector space?
But this essentially depends on the data structure that
you develop or you use to implement your search algorithms.
Usually, search algorithms are implemented with data structures like trees,
where if you are searching for a particular data element in your database,
you start from the root of your data structure.
You navigate down in the tree where at every step,
the tree is guiding you towards the data element that you are looking for and eventually,
after navigating on this tree from root to the leaf,
you locate the data object that you want.
Now, it turns out that these data structures such as trees
turned out to be more inefficient as you add more dimensions to your vector space.
Again, in this lecture,
I'm not going to go over why that is the case.
For now, take it as given but we discussed some of
these concepts like efficient search in other units in this course,
as well as other courses in the core sequence.
Now, there is a second aspect to this aside from efficiency which is effectiveness.
It turns out that the more dimensions that you have,
it also becomes more difficult to analyze the data aside from efficiency.
And why would that be the case?
Why having more dimensions make it harder to analyze the data?
Well, the reason for this is very simple.
If you have, say only one vector and say you have two objects in this vector,
and in this case these two objects are two flowers.
It might be easy to find a separator that separates
the two objects that you have from all the other objects in your database.
You have two objects, they are similar to each other.
There are other objects,
they are different from each other and it may be quite easy to find a separator.
However, as you increase the number of dimensions you have in your space,
this becomes more and more difficult because as you add more dimensions to your space,
you are adding more flexibility,
so the separation line in this case that separates your flowers from rest of the objects,
it may be more difficult to find that separation line.
We call this overfitting.
So, if you have more and more dimensions,
it becomes easier to overfit to the samples that you have.
So, to prevent overfitting,
you need more and more and more examples.
If I have only two dimensions,
I might be able to prevent overfitting,
say with five example images.
But if I have 5,000 dimensions,
I might need many more images to
properly find the separating boundary between in this example,
flowers and the rest of the objects.
Again, in this lecture,
we will not spend time on getting details about
efficiency of dimensionality curse and effectiveness impact of dimensionality curse.
However, as I mentioned,
we have other lectures and other videos in the sequence that gets into details of this,
and we provide more detailed examples and we also discuss solutions to these problems.
What we will discuss next is
the second aspect of vector spaces which is the distance and similarity functions.
So far, we have discussed how to take a set of
objects in the database and how do we map them into a vector space.
We know that we need to identify a set of vectors that we will use as basis vectors.
And we know that these basis vectors need to be
complete and we also know that they need to be non-redundant.
The second question, however,
is how do we use this basis vector to support search or to support exploration?
And to support search and to support exploration,
we need to define what we call distance and similarity measures on the vector space.
Now, it turns out that this is again a difficult problem.
It is difficult because the distance or
similarity functions that we need to use is also application dependent.
Having said that, there are
some commonly used distance functions or similarity functions.
And in the rest of this lecture,
we will go over them and we'll basically see what are
the key properties of
this similarity functions and why we use them and in which context we use them.