So to, once we've mapped our RNA-seq data to a reference genome,
we can use our spliced alignment data to under,
to understand the locations of exons, so where you have reads aligning,
potentially, probably unspliced to the genome.
That suggests that there's an exon in that region, and
the locations of splice junctions, based on the splice mapped reads that we see.
While this information is not sufficient to know exactly
what transcripts are present, we can make a reasonable prediction.
And that's what the process of assembly is going to do.
And so the tool we're going to use for this is a tool called Cufflinks.
And we're going to use our gene annotations to
guide the identification of locations of exons and splice junctions.
And we're going to do this for each of the four accepted_hits files from Tophat.
If you went through the last lecture, you should see that you have
the results of running Tophat on four sets of RNA-seq data.
And so we have four data sets labeled accepted_hits.
These will form the input to Cufflinks.
And so if we go to Cufflinks, there
are a number of different tools that are part of the Cufflinks package here.
The one we want at this point is just Cufflinks.
So if you click on Cufflinks, you now have the tool forum for Cufflinks.
We want to do a couple of things differently.
The first is, we have three options for reference annotation.
We can choose not to use reference annotation, in which case, all of the gene
information is going to be inferred from the RNA-seq data that we have.
We can choose to use a reference annotation, in which case that information
will come from a existing gene annotation that we're going to provide.
And so the Cufflinks tool will use the gene models from that
existing annotation but align our RNA-seq data to it for
estimating levels of expression and such.
What we want to do here is something in between,
which is use reference annotation as a guide.
So what this is going to do is it's going to use the information from both
the RNA-seq data and reference annotation for developing gene models.
This is useful because for example, if you have,
if you have, if you've discovered genes in RNA-seq data that are consistent with
an existing annotation, it will have the gene name associated with it.
So we're going to say use reference annotations guide and
select data set 5, chromosome 19-annotations.gtf.
This is, this is a file that came from data library and
contains the gene annotation information.