The typical software, such as bwa and bowtie, they cannot split the reads too.
A strategy for solving this problem is to separating the genome sequence into transcriptomesomes,
and use the transcriptome as a new reference.
This is a part I cut off from the example, and they all belong to one gene. But the gene has 3 transcriptome.
So some genes of them is overlapped, and just the 5’ UTR are different. The sequence of the 3 parts of transcriptome are similar to some extent.
A problem will arise from here. When we want to map a read, we can identify which gene, but we cannot identify the certain transcriptome.
Like this, the SRA read appears in these three transcripts.
Like this, the SRA read appears in these three transcripts.
So if follow the result of bwa, a mistaken result will sent, we don't know which transcript it is.
But if we trace the transcripts to gene level, you will find an exact expression quantity of this gene.
Then, how we could get the the expression quantity in transcript level?
We do need a new strategy of alignment
After disrupt these reads, we can get some junctions and we can define the border of different kinds of transcripts.
Actually, we choose whole genome to be reference.
To solve the problem, there are two strategies: alignment based on exons and seed-extension.
Processes of these two strategies are present on screen. And we mostly focus on the first one,
because of the similarity of reattachment to genome. Moreover, the pace of this method is much faster than seed-extension.
While the algorithm of seed-extension is based on dynamic programming, slow and high consumed in RAM, is not a practical algorithm.
Although ‘fast’ is an advantage of exon-first, a severe problem arises simultaneously——pseudogene!
For instance, there maybe a junction in real gene (intron or others), but pseudogene will has no junction at that location, or just some SNPs.
When we grade the alignment, absolute score of junction will much higher than a SNP
So, reads prefer reattaching onto a pseudogene. It will cause a consequence that high expression level of pseudogene, but not real gene.
How to deal with it?
A previous research referred to a method to solve the problem.
The principal is very easy that firstly do a reattachment to cDNA, just like the file of ref-RNA which showed on preceding slides.
Thus, the problem is solved.
Reattaching to cDNA firstly then turn back to genome,
a more accurate result will be provided.
In this way, we highly recommend TopHat software;
This is the result of using TopHat to do remapping and it's a bam file. We can see there is an insertion of 659bp.
It is how they exist in genome, and this is a junction.
So, we can use similar remapping results to reconstruct their transcripts.
We can put these reads having gaps out again,
then think about whether they mapped to a junction between two exons.
And we can set junctions according to these reads.
After that, we will be able to linking whole exons to a large transcript.
Besides this strategy there is another one which called de novo assemble not based on mapping.
Of course, this method mainly is applied to situation we don’t know transcripts.
Its theoretical principle is a graph traveler. For example, we obtain five short fragments from circle DNA sequence such as plasmid.
We could split these five fragments
in various dimers such as AA, AT. Based on this we will split the fragments into trimmers.
The sum is on that.Based on ti ,we can spilt a trimer from that.
If we want to find the relationship between Dimer and Trimer,a matchup will be got
Then we can find two context dimer
We could draw a graph based on the data structure,
which is called De Bruijn graph.
Then we make an effort to get the complete graph.
Firstly we could use a long reads as a reference.
For example, the reads is ATGGCGT here and we could find its start point.
The start point is at the place of AT in the graph.
With the route goes on we find there is 2 selections at the point of G.
We use the reads to find route as same as that of before.
It is G after ATG in the reads so we select the point of GG.
And with this method we could get the complete graph.
We can see the graph is a circulation.
Actually the sequence of gene is like that. And this is the roughly mechanism of reconstruction.
By the way, when we could use methods other than reconstruction,
especially in the research of common species such as human or mouse,
it will be better to use software like Cufflinks instead of reconstruction.
If you want to know more about alternative splicing
such as the 8 types of splicing in the course,
it is recommended to read the Science paper published in 2008, which is shown on the right. the 8 types of splicing was explained in detail.
If you need to do some analysis with this method you may read the paper shown on the left.
The paper explained the methods and introduced a model called MISO.
Finally, we will talk about Differential Expression.
First we need to understand what indicator and definition we set for expression.
For example, there is a concept of FPKM in Cufflinks.
The difference between FPKM and RPKM is that FPKM based on pair-end.
And why do we need the FPKM?
Because we need to do a normalization. For example, in terms of 3 and 4 these two transcripts, whose reads shown here,
4’s reads is obviously higher than 3’s.
On this condition, if we simply count the number of reads, we will find 4 is obviously higher than
However, if we do normalization according to their length, we’ll find there is no differential expression between them actually,
which exactly accords with our expectation that there is no difference between their gaps in this figure.
So FPKM is also a reflection of gaps and different transcripts of two genes.
Then we will talk about how to define the boundaries of transcripts.
When we get a read, how do we make sure which transcript it’s belong?
Just like this yellow pair-end sequence, firstly we suppose it’s belong to C, which is very long.
If we define the length of C is 500,
we will find the probability that it’s on C is very low according to the Normal distribution, the middle of which is 150.
But if we suppose this read is belong to B, we will find the probability is higher actually.
And then if the yellow one is on A transcript, its projected length on A will be 150, which is just on the peak point of Normal distribution.
According to this method, we could know the probability of A, B, C separately.
Based on this result, we could reject it on all reads in this figure. And then we get a figure like this.
We would know what percentage of these transcripts on this gene. Finally we will get this result.
That’s the differential expression of one gene’s different transcript.
And if we combine multiple samples together, we can get a picture like figure a.
If we don’t consider the transcripts and just focus on all genes, we could get a heatmap figure like figure b on the right side.
These all can be done by RNA-Seq research.
If you are interested in doing this, there is a Nature Protocol paper which described it very well.
It generally introduces a process which starts from Tophat until the finally analysis.
And that’s the reference. Thank you all!