[SOUND] [MUSIC] How do you analyze genome-wide association scan? So you have a collection of samples, and for each sample you know the value of the phenotype, unless for the moment consider this is some quantitative trait: body mass index or height, or some lipid level. And you have the collection of genotypes, which are genotype genome-wide. And then to start with, you're going to analyze the relation between each genotype at specific locus with the phenotype. And we will go through the whole genome and repeat this procedure again and again. So, the models which are most often used to do genome-wide association scanning are linear regression models. In such models, you model the value or rather expectation of the trait of interest as a sum of several effects. So here on this screen, you see a formula and it defines the value of a "y", which is the value of the phenotype of interest as a sum of three terms. And the first term is an intercept. So this is a model parameter, it's specific value. And then you see the term, which is coded beta multiplied by x_i. And x_i is the value of specific covariate. And this could be, for example, your genotype coded as zero, one, or two for, say, AA homozygote, heterozygote and BB homozygote. And then, finally, you see a term "e" which is the contribution from random noise. You can try to take this model into your data and come up with the maximum likelihood, the best estimate of the effect of your covariate. And we are mostly interested when covariate is the genotype. And under this model, when we called our genotypes with 0, 1, or 2, we will be investigating an additive model. So the beta will corresponds to the additive effect of specific allele. So let's consider practical example. On this graph, you can see a number of dots and they fall into three groups corresponding to three genotypes. And then we try to fit a regression line, and after we've done that, we have the estimates of our regression model, namely the intercept - mu, and coefficient of regression - beta. So both coefficient of regression and the intercept has very clear interpretation in linear model when we study quantitative traits. So namely, the intercept corresponds to expected value of the homozygote, which is coded as 0. And in this case, the expected value of the trait for the zero homozygote is close to 3. And the regression coefficient tells us by how much the trait changes as the genotype is acquiring or gets the copies of the alternative allele. So this corresponds to the heterozygote coded as 1, and other type of homozygote is 2. And in this case, the additive effect of the allele is roughly 0.6. In case we want to consider binary traits, we usually would use logistic models. And logistics models utilize a very simple idea. They try to project this linear predictor, exactly the predictor of linear regression model we just have considered, into the space between 0 and 1. This corresponds to the probability of the realization of one of these binary events, for example, a risk of disease. Now what I have just described provides you an easy means to analyze directly genotype data, right? So if they can code our genotypic data with a dosage of one of the alleles, 0, 1, 2, we can easily use the regression models. However, in previous lecture, I've told you that most of the time, we're dealing with imputed data, and imputed data are special. We don't actually know what is exactly the genotype for this particular person at this particular locus. We have a probabilistic guess. We know probability distribution. Remember the example from last lecture? So we estimated for the sample genotypes in the middle the probabilities that this person is homozygous CC is 0.75. But still, there is a probability of 25% this is heterozygote. How are we going to deal with this? Well, first of all, we can think of kind of ideal model, where you would go over the whole distribution of the genotypes. And compute the probability of the data, given this specific realization of genotypes. However, this may be a little bit complex model and mathematically, and practically, it will take quite some time to estimate it. So what can you do else? Well, you can think that in principle, you can pick up the genotype with the highest posterior probability, and pretend as if it was really this genotype, right? So actually, you can try to ignore this probability distribution. And pretend it's really, really genotype data, and you know the genotype for sure. Well, this has a lot of attraction, especially, it had a lot of attraction in earlier days when imputations just came to the scene. Because by that time, there have been many, many software packages developed. And many of these software packages were developed specifically to deal with directly genotype data. And many of them would use condensed storage model where you use only a few bits, for example, only two bits, to store your genotype. But this model doesn't scale to probabilities. You cannot store a real numbers using two bits. But if you do best guess genotype, right, so you're going to ignore that you have probability distribution, then you can still use all software. However, very soon, we have figured out that if you try to act in this manner, you're going to receive biased estimates of genetic effect, and actually, you're going to lose power. So what can we do? Should we go to these more complicated, likelihood-based models? Not necessarily, because there is an intermediate way. What one can do is trying to do regression on the estimated probabilities. Or if one studies additive genetic models, you can do regression onto allelic dosages. How well does this method perform? How these three possible sets of methods to deal with imputed data compare? On this graph, you can see three lines, red, blue and green. And the green line corresponds to this sophisticated and theoretically superior model, the red line corresponds to the best guess model, and the blue line corresponds to regression onto the probabilities. You can see from the graphs that across the range of imputation accuracies and across of range of minor allele frequencies, the blue and green line go very well together, they're almost equivalent. This indicates that regression onto the probabilities in the scenario when we study large samples and we're interested in smaller effects is almost as powerful as the sophisticated model. At the same time, you can see that the red line, which corresponds to the best guess, with diminished imputation accuracy and in the region of low or minor allele frequencies, starts deviating to the bottom. Indicating smaller power of this approach under these conditions. Other scenario which is very interesting to consider is scenario when we are going to study relatively small samples, but our expected effects maybe rather big. And you can think of this scenario, so it's not a context of studying of complex polygenic traits. But rather, it could be study of some genetic determinant of omics, for example, transcriptomics or other omics. And you can see that here, the situation is slightly different. So the green line is clearly the winner, so the sophisticated model is performing the best, as expected. And the blue line start deviating from it downwards relatively soon. So you would need a very good imputation accuracy and you would need to study a rather common polymorphisms in order that application of the regression of the probability models gives you equivalent power. And of course, the red line goes down very quickly, so it shouldn't be used In this scenario. So now we have considered the set of methods which can be used to study genome-wide association scanning. And in the context of studying big samples and complex traits, the regression onto probabilities may be the method of choice. And imagine, now you have run this model throughout your genome and you've received many, many estimates of the genetic effects, and something we should be very interested in, the p-values. The smaller your p-value is, the stronger is indication for association in the region. However, important question is what is the cutoff p-value when you claim that the results you reach for specific association are genome-wide significant? Well, if you think of applying a regular threshold of 5% significance and you study a million of markers, you are likely to end up with tens of thousands of "significant" results. And they are not true, it's just consequence that you're doing multiple testing. So we need to account for multiple testing and your significance threshold should take this into consideration. Nowadays, when you study common variants and the population which you study are of European descent, we commonly use the cutoff p-value of 5 multiplied by the 10 to the -8. It's interesting to see that in this definition, I'm talking about common variants and about population of European descent. Well, the thing is that this threshold depends on the number of tests you do. And tests across the genome, they are not independent, they depend on linkage disequilibrium structure. And the more linkage disequilibrium you have, the less effective tests you are doing. So this means that the population with higher LD, this significance threshold is going to be lower and other way around.