An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

92 ratings

Johns Hopkins University

92 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The P-value is the most widely used statistic in the entire worldÂ including for inference and for everything else.Â Its so popular that if it was cited every time that it was used it would have atÂ least three million citations, making it the most highly cited paper ever created.Â So the p-value is a very important statistic and since its such an importantÂ statistic there are lots of people that hate the p-value because it's so popular.Â And so part of the reason why people hate it,Â is because people consistently miss interpret the p-value.Â And so the p-value is defined as the probability of observing a statistic thatÂ you've calculated.Â That is extreme as you observed it, if the null hypotheses is true.Â So a couple of the things that p-value is not andÂ that will make statisticians see red is if you say that the p-valuesÂ the probability that the null hypothesis is true it's not equal to that.Â It's also not the probability that the alternative is true.Â And in some sense it's not necessarily a measure of statistical evidence.Â That's a philosophical term that people will worry about but in this case,Â you need to interpret it very narrowly.Â As the probability of observing a statistic as or more extreme than the oneÂ you observed in the data if you would observe the null hypothesis to be true.Â So here we're going to use that example again with the responders andÂ the not responders to illustrate what's going on.Â So again, we have responders and not responders, now we're looking at say, forÂ gene one, a statistic that compares the responders to the not responders.Â So we might calculate the T statistics to take the average expression level amongÂ the responders, and subtract the average expression level among the non-responders.Â And then standardize that by some measure of the variability, in this case,Â the average variability in each of the two groups.Â So in a previous lecture we learned that one way that you could try toÂ quantify a null hypothesis.Â The null hypothesis that the distributions are exactly the same amongÂ the responders and the non responders, is to permute the sample labels.Â So when you permute the sample labels, you leave the relationship among the genesÂ unchanged, but you can look at the, you can break the relationship between eachÂ gene and the responder non-responder label.Â So if I recompute the statistic, after I do that, I get a distributionÂ under the permutations and then I have the original statistic that I calculated.Â And so the p-value that I can calculate could be the number ofÂ permutation statistics I observed to be larger thanÂ the statistic that I originally calculated.Â And I do that in absolute value since in general the null hypothesis isÂ that the value is equal to zero.Â That there's no difference between the two groups.Â But the alternative could be that it's either more or it's positive orÂ it's negative.Â And so I have to look in both directions, whether it's positive or negative.Â And so I just count up the number of statistics that are more extreme inÂ each direction, and I divide by the total number of permutations.Â So I basically average the number of times I observed the statistic as orÂ more extreme under this null hypothesis as the statistic I originally calculated andÂ that gives me the p-value.Â So this p-value is often used as a measure, butÂ in general it's basically used as a hypothesis testing tool to be able to say,Â if that p-value is small, you're going to reject the null hypothesis.Â Because the statistic is very extremeÂ compared to the distribution that you would have got under the the null.Â So this is what p-value distributions look like forÂ genomic experiments that are done well.Â So typically, you see a distribution like this where there's a spike near zero andÂ then there's a flat distribution as you move out here towards one.Â So if you actually look at this and break it down into the different parts,Â this part near zero, these p-values that are really small,Â those are really the P-values that are coming from the alternative distribution.Â Because remember,Â the p-value is measuring the probability of observing a statistic more extremeÂ under the permutations than the statistic that you got when you observed it.Â So if you observe a statistic that's very, very extreme, the number of null orÂ the number of permuted statistics that will be larger than that is very small,Â and you'll get a small p-value.Â

So this is the sort of the p-values that you expect to be coming fromÂ the cases that are not from the null distribution.Â And then under the null, these are the p-values you get,Â you get a flat distribution that goes out here to the right hand side.Â So turns out that a particular property of the p-value is that it'sÂ uniformly distributed, it's equally likely to be any value between zero andÂ one if the null hypothesis is true.Â What does that mean in general?Â It means that even if you get a small p-value,Â it might be from the null distribution because there's an equal chance thatÂ it'll be any value between zero and one if the null is true.Â So this actually is a useful set of properties that can be used toÂ estimate things like the false discovery rate that we we'll talk about whenÂ we talk about multiple testing.Â But the basic idea is that this distribution is a mixture of twoÂ distributions.Â There's a mixture of the p-values that come from the null hypotheses, andÂ the p-values that come from the alternative hypotheses.Â And the null hypothesis p-values are supposed to be uniformly distributed.Â And the alternative ones should be pushed up towards zero.Â They should be skewed away from one.Â

And so the p-values almost always go to zero with the sample size.Â That's another common misinterpretation of the p-value.Â Just because you got a really small p-value,Â it doesn't mean that the difference is huge.Â It could just be that your sample size is really large, and soÂ the variability is small.Â Even if you have any difference at all, as the sample size gets big,Â the p-value will get small.Â The usual cut off that people use for calling p-values significant is 0.05.Â This is if you're doing only a single hypothesis test, butÂ that number is basically just a made up number.Â So it could be any other threshold could also be used.Â I mean it's useful to have a standard, but don't treat this as sort of religiousÂ truth that 0.05 is the right way to tell if your p-value significant.Â And you should always report p-values in conjunctions with estimates andÂ variances on the scale that's scientifically meaningful.Â P-values can be useful as a complement to that,Â as a way to sort of quantify statistical significance, as long as you pay attentionÂ to the properties of the p-values and interpret them correctly.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.