0:01
So we will now talk about algorithmic thinking and how we apply it
to our very important problem in, in biology called sequence alignment.
So I am going to illustrate what's sequence alignment is and we will talk ab,
a little briefly,
about where we are heading with this problem in terms of algorithmic solution.
So the issue of, or the problem of, of sequence alignment relates to
the notion of evolution and evolution of sequences from a common ancestor.
So let's just think about a case where, you know, we have two DNA sequences here
that evolved for some time from a sequence that existed at some point in the past.
Let's assume that we are looking at DNA sequences, and
the sequence in the past was ACTAC.
This was the sequence, for
example, at the ancestor of human and chimp.
Okay?
This sequence is evolving.
You know after human and chimp split from a common ancestor this
sequence started evolving independently along each one of these branches or edges.
Now what happens when the sequence is evolving over time, mutations happen.
The mutations, there are two, at least two types of
mutations at least the ones that we're going to be focusing on.
One type of mutation is, is that, is the type of mutation that changes a letter.
As this sequence is evolving over time, some letters change.
Okay? They go from A to C to D to G or whatever.
This is known as base pair substitution.
A letter changes.
Another type of mutation that happens is called insertion or deletion n ndil.
In short, where some parts of the DNA can get deleted or
parts of DNA can get inserted.
So let me illustrate.
1:44
Let me illustrate with this,
with this example here suppose as this sequence was evolving the first letter
changed to G mutated, and the second letter got lost, okay?
So what, what happens here is that as this changes to G and
this gets lost this sequence here becomes G D A C, okay?
The A changed to G.
The C got lost, it's a type of mutation.
T A C, no changes happened to them, okay?
2:39
Now, as you know, we have access today with,
with sequences I can easily get accesses to sequences from human geno like this.
I can easily get access to DNA material to Chimp.
I don’t have access to this.
Okay? This existed millions of years ago,
we don't have access to it.
So our task in se, in the sequence alignment problem is that, presented with
these two sequence and knowing that they came from some common answer here.
I want to try and reverse engineer this process and know what our limitations or
how these two sequences can have different sequence content, and
different length, how they came from a common ancestor.
Okay?
So again to illustrate now when I have the problem,
the data that we are presented basically have this now.
These are the only things that we can access or observe data, okay?
So now the problem becomes, that I have these two sequences
4:08
Now what do I mean by an alignment?
Okay?
So an alignment of these two sequences basically means that
I want to make them both of the same length.
Okay?
I want to put them on top of each other, that they are of the same length.
Now, how can I make these two sequences of the same length.
They are not, this has four letters, this has five letters.
So we define one operation where you can insert in either of them or
in both of them, dashes.
Okay?
So one way of aligning these two sequences, for example, is that, if we,
let's call this x and this y.
5:12
This is another alignment.
And I can add more alignments because I can also insert dashes in the second one.
I can say.
G dash dash T A C.
A C A dash.
This is another alignment, okay?
And, there are many more alignments, okay?
This version of this alignment problem where I'm making the two sequences in
there entirety of equal length.
This is known as the global alignment problem.
[SOUND] Okay?
So this is, we are globally aligning both sequences.
So I want to make, I have two sequences of the same length or
of varying lengths, I want to make them of the same length by inserting dashes and
then so I get something like this.
Now, there are two questions that we need to answer here algorithmically or
when part of algorithmic thinking.
First of all, what defines an alignment?
Okay, this is an alignment, this is an alignment, this is an alignment.
Is the following an alignment?
6:31
Is this an alignment?
Of course, I added dashes to both of them.
They are of the same length.
This has 2, 4, 6, 2, 4, 6, they have the same length.
Is this an alignment?
And the answer is no, this is not an alignment, okay?
So what is it that we didn't like about this and we like about these?
The fact that there are two dashes here next to
each other is something we don't want, we don't allow in an alignment, okay?
So the first question we need to answer is that what defines an alignment and
what doesn't define an alignment.
And the second question is that once I have set these rules of feasibility, or
criteria of feasibility, what defines an alignment, how do I find it?
Okay, these are all feasible alignments,
based on the set of rules that I will define.
Why should I, should I define, should I give the biologist this alignment, or
this, or this, or which one of the many more alignments that exist?
Okay?
So we have to talk about two parts to the problem.
One if feasibility.
What defines a feasible alignment?
For example this is feasible alignment, this is not feasible.
The second one is optimality.
What defines an optimal alignment or
an alignment that is better under some criterion than another one?
Okay?
So when we talk about, I will talk first about feasibility.
What defines an alignment?
7:47
To define an alignment, global alignment, we have three conditions.
First is that for an, for, for, for something to
be an alignment the two sequences have to be of exactly the same length.
This satisfies it, this satisfies it, this satisfies it.
Of course this satisfied it as well, okay?
So the first condition is that they have to be of the same length, okay?
The second condition is that we don't allow a dash in one to go
with a dash in the second.
This, this and this, they satisfied that condition, but
this one does not satisfy this condition because we put the dash here.
In both of them, okay?
So the, the first one, they are of the same length.
The second one, we don't allow a dash in both of them at the same position.
The third one, if I remove the dashes from the sequences in the alignment,
I should get back the sequences that I started with.
So if you remove the dash from here, you'll get back to GTAC, which is this.
If you remove the dashes from here you go back to this.
The same thing here, the same thing here, you remove these 2 dashes,
you go back to GTAC, remove the dash from here you go back to the original sequence.
Okay?
So, for, for, for two sequences to be in, X point and
Y point to be in alignment with X and Y,.
They have to satisfy three conditions.
The first one is that they are of the same length.
The second one is that we don't have dashes in the same position.
The third one is that if I removed the dashes I get back my sequences.
Okay?
Now that we have defined feasibility, you can see that I've
already listed three possibilities here but I can list more and more of these.
Okay?
So you will see that the problem becomes very hard, is that how do I
go through all of them and how do I present them to a biologist and
say, here's a list of all feasible alignment, make your choice.
It doesn't make sense.
So the second thing is that we have to mathematically define a criterion that
says, this is better than this, for
example, so that when I give a solution to the biologist, I say, under this criteria,
under this mathematical definition, this is most often more than that.
Okay?
To do that, we have to define, what we call, a, a,
a scoring scheme, or a scoring matrix.
What is a scoring scheme or scoring matrix?
Let me illustrate here, we can leave these two alignments here.
10:03
And in the case of D and
A, because we have four letters, notice what happens in an alignment.
Either a D and A letter goes with a D and
A letter or a D and A letter goes with a dash, right?
Dash in the first sequence or dash in the second sequence.
So, to define I will try to give a score to each one of these configuration.
Either I have a nucleotide with nucleotide or a dash with nucleotide, or
nucleotide with a dash, and so on.
The easiest way to define this is by defining what we call a scoring matrix.
10:58
So at every entry in this scoring matrix that I'm going to call M
is going to be the, the score, of matching two letters, okay?
For example, if I said ten here it means that every time I
match A in one sequence within another sequence is going to be of score ten.
If I put, for example, dash with C here, for
example minus 4, it means that if I put a dash in the first sequence,
with a C in the second this is going to have a score of minus four.
11:50
Okay?
So in this case, and suppose that the matrix is symmetric, so,
this is going to be 4, and this going to be 4 and so on.
In this case, now I can define the score of an alignment,
by going through the alignment on position at a time, and for
every position I look up the score of that.
Configuration from the table.
So here I have G goes with an A.
What is the score of G going with an A?
It's G with A.
It's this one.
It's four.
12:20
The second one is a dash with a C.
Dash with a C, it's minus 4.
So this is minus 4.
T with A, it's 4.
A with G is 4.
And C with C is 10.
Okay?
So I look at these, the sum of these scores, 10, 14,
18, 14, 18, I would say that the score of this is 18.
I can repeat the process with this, here.
This is G with A has a score of 4.
Dash with C is minus 4.
Dash with A is minus 4.
D with dash is minus 4.
A with G is 4.
C with C is 10.
So if we have 10 here, we'll say 10, 14 ,10, 6, 2, 6.
Right?
So the score of this is going to be 6, okay?
13:17
Now, if, since we are defining a score here, it will make sense to
say that I will be looking for or seeking the alignment that has the better score.
So in this, from, between these two,
I will be basically saying this alignment has a higher score than this alignment,
and that this scoring scheme that I defined by this matrix therefore I will be
returning, between these two I'll be returning this alignment to the biologist.
Okay?
So now we defined this criteria that tells us what is a feasible alignment.
Under this criteria, we can have many possible feasible alignments.
Second thing, we define a scoring scheme that's given by this matrix.
So that I can score every matrix, what is, what score it has, every, every alignment,
what score it has, then out of all the possible feasible alignments,
I will choose the one that has the highest score.
Okay?
Now of course you will see that it's not possible, it's not feasible in practice to
list every possible feasible alignment, score them, and then return the one
with the highest score, which would be a brute force algorithm.
Instead, we can use an algorithmic technique that we have learned,
which is dynamic programming, that will compute the optimal alignment and
it's score, in, in a very efficient amount time.
Okay?