Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Bioconductor for Genomic Data Science

92 ratings

Johns Hopkins University

92 ratings

Course 6 of 8 in the Specialization Genomic Data Science

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

From the lesson

Week Four

In this week, we will cover Getting data in Bioconductor, Rsamtools, oligo, limma, and minfi

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this session we will discuss a popular limma packet and a little bit about some general concepts from statistical inference.

The limma packet says, the limma stands for linear models for micro rays. It's a package that was developed for analysis of gene expression microrays, but it fits a classic model that we want to fit a lot of different types of genomic data. So while a lot of the terminology limma has to do with microrays, it really being used for a lot of different types of data. I'm using it myself for my own research on DNA methylation.

There's other types of, there's a couple of other analysis packages that are very popular. Bioconductor, that deals with count data. Specifically, negative binomial models for count data. These packages have names such as, PE Seek, PE Seek Two, and Estar. Estar is from the same authors as the limma packets. A lot of these models basically treat each feature. We have a set of features, in this case here, let's think of them as genes, and we want to find genes that are differentially expressed.

A thing that's very important in genomics is that we have learned that borrowing information across genes is extremely beneficial. So in most gene expression data sets where many genes, say 20,000 or 30,000 different genes, we have relatively few samples. Sometimes you have six samples, sometimes you have on the order of a hundred samples, but we have far fewer samples than we have genes. We are essentially asking the same question for all the genes.

We can get a better analysis by borrowing information across genes. That's usually done using a statistical class of statistical methods known as empirical base methods, and it typically has to do with modeling the relationship between gene expression and variance of the gene. This all gets a little bit technical, but the common theme here is that you gain a lot of power and performance by using these empirical based techniques. That is something that is a little bit special to genomics.

In all these cases, we have some data and we have something we will call the design of the experiment. The design of the experiment tells us something about what samples were run at what times, which rules do they come from? Are the samples paired? Do we have additional comparative information at the sample wants you to take into consideration such as sex sample or the age of the sample, or the age of the person where you take the sample from?

And these are all encapsulated into something we in statistical terms call a design matrix. We will discuss a design matrix a little bit later in this session when we can see an example. So let's start off by loading in the limma packets and load an example data set called leukemiasEset.

So we see here that leukemia used as an expression set, it has 60 samples. And the interesting variable in this data set is a variable called leukemia type Leukemia types, so this data set here, profiled four common subsets of Leukemia. We have ALL, AML, CLL and CML, and then finally we have the reference group, that's NoL, it stands for Not Leukemia. So these are patients that didn't have Leukemia.

We, in this session we will avoid discussing complicated designs and we will just go through a very simple two group comparisons where we imagine we are profiling examples from two different experimental groups and we want to find genes that are different in express We got to select the LL group and the AOL group.

So now that we subset the data, we can have a look at our variable of interest. This is where the variable that encapsulates everything about the sign we are interested in, in this toy example.

It's been a factor. It has five levels, and these levels represent types of leukemia that are not present in our subset of the data. That's going to confuse some of the model fitting functions. So we're going to clean up the data a little bit, we're going to recreate the leukemia type factor, and the effect of that is we are going to get rid of the three levels for which we don't have any samples.

Look, we have the same variable list but now we only have two levels instead of five levels. The first level in a factor is always important. It's the first level that's being used as a reference level. This case here we have ALL as the reference level. I'm not going to fix that. Normally you would do that. The implication of this is that the differential expression we're going to see is genes That are differentially expressed in patients that don't have leukemia relative to those who have ALL. That's a little confusing. A more natural thing is to talk about genes that are differentially expressed in ALL relative to not having leukemia.

The first thing we're going to do when we fill a liver more or less we set our design matrix, which encapsulates the design of the experiment. Setting up a design matrix is a general statistical task and it takes quite a while to become fully comfortable with it There's a great introduction to this in the limma user's guide and if you need more help which is very likely, you can ask at the support forum or you can consult a friendly local statistician. I'm going to emphasize that setting up a design matrix for the new experiment experiment usually does not require very much insight into genomics. It's a statistical task that doesn't have much to do with genomics, although we use it a lot.

It's a matrix where the number of rows is equal to the number of samples, and we have two columns which describe two parameters in the model. We have an intercept parameter. In this case here, the interpretation, the intercept parameter is that this is the average gene expression for a given gene across all our samples.

The second column, which is really the parameter we're interested in, is the difference in gene expression between NOL and ALL.

If this parameter is equal to zero, it means that the gene is not differently expressed because there's no difference between NOL and ALL.

So setting up model matrices and using this model matrix thing and being sure that you interpret the coefficient correctly is, as I said, a task that's completely beyond the scope of the session here. So we'll continue this. We see this example a lot. This is probably the most common set up a two group comparison and it's important to feel somewhat comfortable with it.

I'm going to note that what we're going to do here, find the different expressed genes between two groups, is we're going to do this with something that is very similar to doing it teaches.

Well let's wait on that for a moment. Now we have the design, and the first think we do in limma is we fit the basic model for the design.

So, in statistics, we separate the model from the hypothesis. The model tells us in this case here, something about that we have 24 samples and they have been, there are two groups of samples.

So we round LM fit which fits a linear model to all the genes separately. So at this stage here, at the output of LM fit, we've done nothing where we borrow information across genes.

That happens in the next step where he does an empirical base estimation of the variability. Now this is again, completely outside the scope to explain what this really does but essentially it shows that the variability of a gene is a mixture of a gene specific variability and a variability where you assume or global variability where the global variability assumes that all genes have the same variability.

Assuming that all genes have the same variability is clearly not right biologically, but assuming, but on the other hand, estimating the variables for each gene separately means we are estimating a variability with only 12 samples in each group. That's not a whole lot of data and what happens in the empirical base step here is that we improve our variance estimation by forming this mixture.

Okay, so let's just do it and now we filled it and we can look at the top different expressed genes. So let's step back a little bit here. Why is there only one list of different express genes? That's because we have a design, where we have two group comparisons. When you only have two groups and you have no covariance there's only one comparison that's worth making. That is asking whether or not the two groups have the same expression level. So when I run this top table thing here, I don't have to explain what coefficient, or what chest I'm interested in. We're going to expand a little bit on this below.

So this here we have to get some columns out. We get a lock fold change, which is a fold change between ALL and NOL. We get a T statistic, we get a P value. We get just an adjusted P value, which is adjusted for multiple testing.

That's usually what we're interested in. We're also interested in the full chase estimate. Now remember that our factor busted off so that reference level is ALL which means that the fold chains we have here is the fold chains from NOL to ALL.

So we can actually compute this by hand just to make sure the we have the right interpretation. So let's just get the first gene, and we're going to get the gene name so we can look it up in our expression list, this is really just like a feature name. And now we're going to compute the expression for each, the mean expression inside each group. I do that with this little C apply command here which says I take my expression level for only one gene, and I do a mean inside each level of the leukemia type thing. So what could I get out of that? This is the expression level in the two groups, ALL and NoL, and my log fold change is NoL minus ALL, because of the reference level in the factor. So again, this is a little bit where it's positive because a gene is really actually, NoL and ALL is and isn't an OL. Now, let's try to do this again. So, what we've done here is really similar to a cheap statistic. It's something known as a moderator gene statistic, because we have estimated the variability of the data in a different way from a normal t statistic, where we have utilized that we have many thousands of genes, and we utilize that information.

Now let's go back and refit this model here, where we very specifically explain what contrast we are interested in.

So we set out a new model matrix we call design two that is very similar to design one, but notice this -1 I have here. Again, we won't go into details but essentially it means that I get a new matrix where

the column names are slightly different. And the true parameters in my model really what I get out of this model matrix is that the first parameter is the expression level in ALL, and the other parameter is the expression level in NoL.

Remember before, the design matrix had two parameters, and it was expression in ALL and then the difference in expression from ALL to NoL.

I'm going to change the name of these, the columns of this thing here, just because this is a little bit much to type.

I have ALL and Nol. And now, I do the same thing as before, I fit it, but now, I have to make a contrast. So a contrast is a hypothesis, and we're going to test, we're going to call this function called makeContrasts, and I hope you can get a little feel for how it's run. So I'm saying I want to have a contrast that's ALL minus NOL. This is the opposite of what we did before, we allow using NOL as the reference level. I say I take my levels from this design matrix, and I get a little this contrast matrix is really, quite simple, and what this contrast test is whether or not these two parameters with this linear combination is equal to zero. So it's going to test whether ALL-Nol is equal to zero.

So now that I have a contrast I have to do something called contrast.fit and then I've got to run my eBayes a module born from racing across genes, and now I can look at my top table. Now my top table here is relative to this specific contrast I have inside my That I used in the contrast upfit.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.