Hello, I am Neil Clark, I'm a Postdoc in the lab of Avi Ma'ayan here at Mount Sinai. And my contribution to this series of lectures is to tell you about some staistical methods for appoaching high dimensional data. Now in this two part lecture, I'm going to tell you about a method called Gene Set Enrichment Analysis, which also goes by the acronym, GSEA. Gene Set Enrichment Analysis, is a method used for extracting biological insight from gene expression data. I'll begin in this first slide by giving you the overall picture of how this works. Now, microarray experiments measure the expression of genes on the genome scale. Typically tens of thousands of genes are measured in a single microarray experiment. One way that microarrays are used is to make a comparison between two biological states, diseased and not diseased for example. And to compare the expression measurements to gain insights into the biology, and to learn something about the disease for example. Traditional methods look for individual genes whose expression is different between the two classes of samples. But often, this approach has two associated problems. Firstly, it may happen that no genes stand out as being differentially expressed, at least none stand out through the noise in the data. Or alternatively, many genes may stand out as significantly differentially expressed. But then, we find ourselves in a position where we have this large number of genes with no obvious way to interpret the list and derive meaningful biological insight. GSEA attempts to overcome both these problems, by looking not for individual genes, but for whole sets of genes that are collectively differentially expressed. This has two advantages. Firstly, it can increase statistical power; small but consistent changes throughout a whole set of genes are liable to stand out above the noise much more clearly than individual genes. Secondly, if the set genes is chosen carefully such that each of the genes in the set are related biologically then biological interpretation is built into the approach. In the first part in this two part lecture I'm first going to go through a few mathematical preliminaries. A knowledge of these is not strictly necessary for you to preform GSEA, Gene Set Enrichment Analysis, But I think that if you're well acquainted with them, and you have some knowledge of the statistical tests that inspired this method, then it will stand you in good stead for understanding how Gene Set Enrichment Analysis works, and also how you can interpret the results. So this first part will be spent looking at random walks in one dimension. which are essential for understanding the Kolmogorov-Smirnov test. This is the statistical test that inspired Gene Set Enrichment Analysis, which retains much of this test of goodness of it. In the second part, I'll take you through Gene Set Enrichment Analysis by applying it to a small example data set. During which you'll be able to see all the inner workings of the method. Okay, we're going to begin with random walks in one dimension. This may seem a little abstract and unrelated to our ultimate aim of analyzing gene expression data, but I hope you'll stick with me to see how this builds up. Let's begin with a regular one dimensional lattice. Which you can think of as a line with an origin labeled zero, and a series of equally spaced points labeled one, two, three, four and so on, going to the right. And minus one minus two so on, ggoing to the left. To perform a random walk, we start at zero. Then, with discrete time steps, we take a step in a random direction, left or right with equal probability. As we repeat these random steps, we can chart our progress by plotting our position on the lattice against the time. As the top figure shows, this results in a jagged line. In the bottom figure, we have let the walk go on for much longer, and as you can see, the fluctuations take place on a wider and wider range of scales. If we take the limit that the length of each step tends to zero, while the number of steps taken tends to infinity. Then the walk becomes what is termed a Weiner process. A modification of this random walk is defined by fixing the end points to be at zero. One way to make such a walk would be to take some number of right steps and an equal number of left steps. Then randomly permute them, then start at zero and follow the randomly ordered steps. Because there's an equal number of left and right steps, you must end up back at zero. If you take the similar limit that the size of the steps goes to zero as their length goes to infinity. Then the walk becomes what is called a Brownian Bridge. One question you might like to ask about a Brownian Bridge is, in the course of the process, what is the maximum distance I'm likely to stray from zero? Mathematicians have answered this question by calculating the probability distribution. This is shown here in the equation at the bottom of the slide. I'll now just quickly define what I mean by a probability distribution. A random variable is just that, it's a variable which takes on random values. Which particular random values it is likely or unlikely to take is described by its probability distribution function. The probability of observing a random variable to have a value in the range between the values b and a, is given by the area under the function between those two values. In mathematical notation, this is the integral, and it's shown in the first equation here. An example of a probability distribution function for a random variable distributed as a Gaussian with a mean of one and a variance of 9.5, is shown in the upper figure on this slide. The fact that the function peaks in the vicinity of one, can loosely be understood as meaning that the range of values in the vicinity of one are more likely to be observed than other values. The cumulative distribution function is related to the probability distribution function. The values of the cumulative distribution function at x, say, tell you the probability of observing random the variable to have a value less than or equal to x. This can be written in terms of an integral of the probability distribution function, as shown in the last equation on this slide. The figure on the right shows the cumulative distribution function corresponding to the probability distribution function in the figure above. Notice that at small values of the dependent variables the cumulative distribution tends to to zero, and at large values, the function tends to the value of one, this is just as you would expect, because for an extremely small value of x, it is unlikely that we might see an even smaller value, and for an extremely large value of x, you are very likely to see a value that size or smaller. The Kolmogorov-Smirnov test, is a statistical test that was the inspiration for Gene Set Enrichment Analysis. As we will see, it works on a similar principle. The Kolmogorov-Smirnov Test, is a means of addressing the question of whether some data are consistent with a given cumulative distribution function. To put it more explicitly, consider some random variable which has a given cumulative distribution function. Then we make several observations of this variable samples, and plot the distribution function of our sample. Because we can only take a finite sample, random fluctuations will mean that the distribution function for our data will most likely differ from the true distribution function by some random scatter. The Kolmogorov-Smirnov test is a way to answer the question of whether the difference between the two distributions is just random scatter or if there is a real difference between them. It is a test of goodness of fit. One property particular to this test is that it is useful when the sample size is small as there is no binning with the data. Okay, we'll now go through an example of the Kolmogorov-Smirnov test by using a simple example. Suppose we take some measurements of a random variable, and obtain the numbers shown here. We want to test whether these numbers are consistent with the variable being drawn from a Gaussian distribution with mean 1.0 and variance 0.5. The first thing we do, is plot the cumulative distribution functions for the data, and the distribution we want to compare it to. These are shown in the figure on the top, here. Notice that the cumulative distribution function for our data, as we move along the horizontal left- -to-right, takes a step up every time it reaches a value that is in our data Such that the resulting stepped curve indicates the fraction of our data points that have a value less than or equal to the value on the horizontal axis. So we have a jagged stepped cumulative distribution function for our data. This will smooth as you collect more data. And we also have the completely smooth Gaussian cumulative distribution function, to which we want to compare. I'll let you into a little secret here. I did actually draw the data points from the Gaussian. That's why the two curves do seem to be quite similar. But we're going to make an objective quantitative, statistical test of the similarity between these two with the Kolmogorov-Smirnov test. The basic idea of the Kolmogorov-Smirnov test, is that if there is no real difference, other than the random scatter, between the two cumulative distribution functions. Then, the difference between them should just be a random walk. But as the end point of every cumulative distribution function are fixed at zero at the far left and one to far right. Then the end points of the random walk will be fixed at zero. So we should be able to observe a Brownian Bridge. The lower figure shows the difference between two cumulative distribution functions. As you can see it tends to zero on the left and right and in between there's some fluctuations. This curve looks like the Brownian Bridges we looked at earlier. So it's looking promising, but how do we quantify this judgement and make it objective? Remember how we said that the mathematicians that worked out the probability distribution for absorbing any given maximum distance from zero, in the course of the Brownian Bridge. We can use this to estimate the probability of observing the actual maximum distance, under the null hypothesis that there is no real difference between the distributions. If we observe a maximum distance which is so large as to be unlikely under the null hypothesis. Then we will reject the null hypothesis and conclude that the data is inconsistent with the comparison distribution. On the other hand if the maximum distance is small, and reasonably likely to occur by random scatter under the null hypothesis. Then we'll accept the data is consistent with the comparison distribution. Okay, let's consider another example, consider, the data points shown here. We will compare them to the same Gaussian distribution. Again, in the figure in the top, we plot the cumulative distribution functions, and we might already become suspicious that there is something wrong. They seem to be a little different this time. In the figure on the bottom, we plot the difference between them, and, this time, we see the supposed Brownian Bridge straying a much larger distance from 0. If we calculate the probability of seeing a maximum distance of this size or larger, we will see that it's very unlikely, if this really was a Brownian Bridge. So, we then conclude that this is not a Brownian Bridge, and there is a difference between the two distributions, which is more than just random scatter. In the next part, I will show you how all this relates to Gene Set Enrichment Analysis. But to conclude this first part, I will give you the big picture of how Gene Set Enrichment Analysis is performed. First, the gene expression data from two different classes is taken, and the genes are ranked according to their differential expression. Those genes which are most up regulated on top and those which most down regulated on the bottom. We'll then take a test set of genes which are known to be related by some common biological theme. These may be the genes which correspond to proteins which are the members of a given pathway, for example. And then with this gene set we're going to try to quantify the degree to which these genes tend to sit in extreme positions of the ranked list. In order to compare this quantity to the distribution we might expect by chance, we repeatedly perform random permutations of the data. In the next section, I'll fill this process out in more detail using a simple example.