0:00

We'll now return to the questions that we

asked at the very beginning of this lesson.

Starting from genomes, how can we transform them into synteny blocks?

And the first thing I want you to learn is the notation of genomic dot plot.

Let's arrange genome along x coordinate and along y coordinate.

And let's represent

every k-mer that is shed at position x and y in the

genome as a dot, a position with coordinates x and y.

In this case, I show you genomic dot plot for k-mer size equal to 3.

And it only consists of the diagonal in the genomic dot plot.

Of course, diagonal is always present in the genomic

dot plot because k-mers at the coordinate x and x are always the same.

But here's the genomic dot plot for k=2, and you can see that in

this case the number of dots increased as compared to k=3.

For example these 4 dots correspond to repeated occurrences of the 2-mer CT in

the genome.

1:20

So far, we talked about a genome consisting

of one strand, but in reality of course genome consists of 2 strands,

and therefore we should take care of both identical and

reverse complementary k-mers while representing genomic dot plots,

as shown here. In this case, red points correspond to

identical k-mers, and blue points correspond to

reverse complementary k-mers. Now

this is how a genomic dot plot looks like when we compare genome against itself.

But can we also construct genomic

dot plots when comparing two different genomes?

And here's an example of comparing one genome with another

one that is obtained from the first one by a reversal.

And in this case, you can see that a reversal corresponds

to the -45 degree diagonal formed by blue

points in this genomic dot plot. Now, from

looking at the small string, let's now construct genomic dot

plot for entire E. Coli and Salmonella genomes.

And human eye immediately see synteny blocks

in this genomic dot plot, they simply correspond to various diagonals.

And we only pay attention to sufficiently long diagonal, because small

diagonal, or small points, clusters of points may represent noise in

this case. So we see five synteny blocks, five

diagonal corresponding to a block marked asa a, b, c, d and e,

and if we project these blocks on the

x axis we will get an arrangement of synteny blocks in E. Coli.

And if we project them on the y-axis then we'll get

3:27

an arrangement of synteny blocks in salmonella.

And now we are ready to analyze the 2-break distance between these two genomes.

Therefore, construction of synteny blocks essentially amount

to constructing diagonals in the genomic dot plot.

And here is the finding synteny blocks problem.

The input is the set of points in

genomic dot plots in 2D, and the output is

a set of diagonals in DotPlot representing synteny blocks.

Does it make sense?

4:03

Of course it doesn't make sense.

Because I have not defined what diagonals.

Look at this genomic dot plot and you can see the diagonals

are not perfect because there are many insertions, deletions, mismatches and many

other artifacts, and therefore human eye has the ability

to somehow process this genomic dot plot into

five diagonals, but we really do not know

what our brain is doing while making this transformation.

Can we come up with an algorithm for constructing these diagonals?

Let's try. Let's look at the genomic dot plot between

two different genomes that we constructed before.

And let's represent every dot in the dot plot as a node in the graph.

And we will form edges in this graph by simply connecting closely located points.

For example, points that are located at a

distance less than maxDistance, where maxDistance is a parameter.

What will happen?

We will connect certain points by edges, and you can see from here

that diagonals actually correspond to connected components in this graph.

Moreover, we probably should ignore very small connected components.

For example, a component located in the upper-right (upper-left to the viewer)

corner and low left corners (lower-right to the viewer), but the three remaining

components should be accepted as synteny blocks in our genome.

And therefore, I offer you the following algorithm for solving the problem.

Let's form a graph whose node set is the set of points in DotPlot, and

connect two nodes by an edge if the

corresponding nodes are located at close distance from another.

The connected components in the resulting graph define synteny block.

And afterwards, delete small synteny blocks.

6:06

Of course this is not the only algorithm that tries to model what the human eye is

doing, while clustering points in 2D, and our brain is very good at clustering.

I can give you another algorithm, for example this one that I call Amalgamate.

And it actually different, but it looks like it's doing almost the same thing.

First step, define each point in DotPlot as

a separate block and iteratively amalgamate the resulting block.

How do we amalgamate the blocks?

We amalgamate two blocks and combine them into a single block

if they contain two points that are separated by small distance in

another genome, and as before delete small synteny blocks.

7:02

Instead of answering this question, you should

ask me what problem I actually want you

to solve, because I have never defined an

algorithmic problem that I want you to solve.

You should have refused coming up with an algorithm

before actually seeing the problem I want to address.

Unfortunately, in bioinformatics, it's not as simple.

I have been working on the problem of synteny block generation for 10 years now,

and I failed to formulate this problem as

a rigorous algorithmic problem, and that's why we

are at mercy of our intuition.

And some people may like Synteny algorithm better.

Other people may like Amalgamate algorithm better.

How would we choose?

7:52

The choice should be done based on the practical results.

And if you implement these algorithms and run them, then you

see that synteny algorithm gives

a reasonable and biologically adequate result.

But amalgamate algorithm, also, it looks reasonable, it looks like it

models what human eye may be doing, gives completely wrong results.

After developing the algorithm for constructing

synteny blocks, we are now ready to

find out what are the synteny blocks between human and mouse X chromosome.

And here's a genomic dot plot for the X chromosome consisting of 25,000 dots.

8:32

We form a synteny graph on these 25,000 nodes.

And at this slide, you can see the resulting connected components.

We further process of these connected components and represent them

as a perfect +45 degree or -45 degree diagonals in 2-D.

Where are the 11 synteny

blocks between human and mouse that we have been considering in this lecture?

Well if you project these 11 blocks in 2D to x-axis,

you will get the arrangement of human synteny blocks.

And if you project it to y-axis, then

you will get the arrangement of mouse synteny blocks.

Moreover, since we are not particularly interested

in the scale of these blocks, and even if some blocks appear

small, there are maybe hundreds of genes in each of the blocks.

It's just an issue of scale.

So let us now represent each block by the diagonal of the same size.

And once again, if we project these blocks into 2D to the

x axis, we will get an arrangement of synteny blocks in Phillip, and if

we project it to y-axis, we will get the arrangement of synteny blocks in mouse.

And now, after we constructed synteny blocks for two

genomes, How about constructing synteny blocks for three genomes?

You can think about generalizing how

a synteny block construction approach for more

than two genomes, but what about

constructing synteny blocks for 100 genomes that

are being sequenced now?

This actually turns out to be a very difficult problem.

And nobody in the world today is able to construct synteny blocks for 100 genomes.

Maybe after this course you will suggest a reasonable approach for this.

But an interesting thing is that if we learn how

to construct synteny blocks for many genomes, we will

be able to come up with a very

accurate picture of chromosomal organization in our distant ancestor.

So there is lot of work

ahead in developing genome rearrangement and synteny

block construction algorithms, but this is already beyond the scope of this course.