In the previous module, we talked about how to make riskier predictions. In this module, we'll focus on how your predictions can actually be falsified; first talking in theory and then later in practice. Let's start by a very simple question; have you ever designed a study where the goal was to show that there was no effect. So your prediction was really to examine something where there was no difference between two different groups of people for example, not just as a possibility but really as the main question that you were interested in. Now I've asked this question to quite a lot of scholars around the world for example in workshops that I teach in, more than 95 percent of people I would guess say, no. They always ask whether there is some difference. So most scientific research questions are stated in a form that people want to demonstrate that there is some effect or some difference between groups. On the other hand, it's very often the case that if you would be able to demonstrate the absence of an effect it could be extremely interesting. For example, you might come up with a new intervention, there is already an existing one but that requires face-to-face contact with some therapist, people need to come in every week to get some treatment, and you are predicting that you can do just as well, not better, not worse, but just as well with some digital coach that I can just have on my smartphone and carry with me, and this digital coach will do just as well. Well this new intervention will be much cheaper, much easier for people to use and if it is equally effective, we want to know and maybe we can use it in the future. Another situation might be when you have a theoretically predicted absence of an effect, let's say that your theory predicts that two things should not differ and you want to be able to demonstrate something like this, so how would you, theoretically speaking, design an experiment so that you could actually show the absence of an effect. So take a moment to think about this, maybe think about the hypothesis in your last study or if you're planning a new experiment in the next study that you plan to do, take a moment to think about your hypothesis, and what would actually falsify the hypothesis in the last study that you did or in the next study that you plan to perform. As you think about this, maybe I can give you another practical suggestion, that this question in itself, what would falsify your hypothesis, is a perfect question to ask if you dozed off during a public lecture, and then the lecture is over, and in the end everybody is waiting for somebody to ask a question, and you didn't really pay attention, you can always ask: So what would falsify the hypothesis in this study. So if you thought about this for a while, let's take a look at some possible answers that people have told me in the past. Now you might think I can falsify my hypothesis whenever I observe a p-value that's larger than the alpha level that I chose. However, the absence of evidence is not evidence of absence. Just to give you a simple illustration, let's say that I'm going to take a look at how much people who follow this online course like chocolate, and I take another couple of people who've not followed this online course and I ask them how much they like chocolate, and I just take two people from each group. Am I likely to find a statistically significant difference between these two groups of two people? No, of course not, because the sample size is much too small to know anything that's going on. So this already shows that if the sample size is too small we might just have a lack of power to detect real differences. Assume we are interested in the difference between two groups. On the left we see the expected distribution if the difference is exactly zero. On the right we see the expected distribution if the difference is 0.5. We have observed a difference of 0.35 which is not extreme enough to be statistically different given this sample size. However, a difference of 0.35 is actually quite likely if there is a real difference of 0.5. So just observing a non-significant difference does not mean that there's no true effect in the population. Now take a look at what you did in practice, open your last paper or maybe a paper that you are currently working on but if you've never written a paper, open another article that was very important for you. Search in this work for a p-value that's larger than 0.05, and take a look at how people interpret this result. You'll very often see that a non-significant p-value is in practice interpreted as the absence of an effect even though this is a logical fallacy, this is one of the most common misinterpretations of p-values. Now you might think, okay, let's forget a non-significant result. How about if we observe a significant result in the opposite direction of what we predicted. Now surely that must be a way to falsify our prediction. Yes, I guess in practice most people would consider this a clear falsification. If you predict an effect in one direction and it is statistically significant in the opposite direction, well that seems to clearly falsify the prediction you had. However, if the null hypothesis is true, you shouldn't find significant results in any direction, let alone in the opposite direction very often. If the null hypothesis is true, finding a significant effect in the opposite direction is actually a Type 1 error. It's also a fluke. So most of the time you'll just find a non-significant p-value, so this is also not enough in practice to actually falsify predictions. So let's take a look at what you actually would predict. Is any effect in the predicted direction actually support for the alternative hypothesis. We very often treat it like it is because we make a directional prediction, but do we really think so. For example, would an effect size of a Cohen's d of 10, which means 10 standard deviations. So we've observed an effect of the size of 10 standard deviations, would this actually be support for your theory? Now this is an effect that so immensely huge, that almost none effects or maybe no effects at all and for example experimental psychology would ever be this big. So I would argue that this is an effect that's much too large to be support for your theory and if I would observe a Cohen's d of 10 in a study that I perform, I would highly suspect that I made some coding error or calculation error because this is just too big. Nevertheless we see cases in the scientific literature where huge effects are interpreted as support for a theory when this is really very doubtful. Let's take a look at the study that is mentioned in many popular science books that was interpreted as evidence that the decisions judges make are influenced by how long ago it is that they had a coffee or lunch break. So here, we see a graphical representation of the proportion of favorable parole decisions that real-life judges are making as a function of the number of cases they do across the day. We see that early on in the day, they start by giving about 65 percent of people parole, which basically means, "All right. You can go back into society." But then very quickly, the number of favorable decisions decreases to a point of zero over the course of the day. Then it goes back up again after a quick break where they have something to eat, and relax a little bit, and replenish their energy. We see that again over the course of the day, the parole decisions that are favorable drop to a very low number. They take another break and goes back up again to 65 percent only to drop again over the course of the day. So these are very clearly noticeable effects of maybe fatigue or hunger, something that impacts the cognitive resources of these judges or at least this is the theoretical explanation that the authors provide. If we calculate the effect size here, we see that the effect is a Cohen's d of two, which is incredibly large. Almost no effects in psychology have an effect of this size. So this is a surprisingly huge effect, not just once, but we see it return three times over the course of the day that would just be a consequence of some fatigue and maybe hunger. If hunger really had such a huge effect on real-life court cases, that we see this drop three times over the course of the day, then society would basically become a mess just before lunch break because our cognitive resources would all be so deplenished, we would be making all sorts of mistakes. So you can see a commentary written about this by Andreas Glockner who doubts that the effect is actually due to the theoretical mechanism that the authors suggest, and I think this makes a lot of sense. So sometimes an effect is much too large to be plausible. Under the theory and under our theoretical predictions, we shouldn't be observing effects that are this large. All right. Let's go back to the questions we are asking. So Cohen's d of 10 is probably too much. Cohen's d of 2, for most cases, is too large an effect to predict. Would an effect size of Cohen's d of 1, one standard deviation be support for your theory. Might be, there are some effects that are this huge. For example in psychology, we have the effect of social exclusion. If people tell you that they don't particularly like you, but they do want to interact with other people, you feel socially excluded which has very negative affective consequences of approximately a Cohen's d of 1. So this is a very noticeable huge effect with an effect of 0.1, be support for your theoretical prediction. Now we're talking about really small, almost unnoticeable effects, but it's still an effect. It could still be support for your theory if your theory predicts very tiny effects. But we can make it more extreme: would a Cohen's d of 0.01, be support for your theory, or is this really too close to zero to matter, or maybe 0.001. I think now we've reached the point where most people would say, ''Well, this is really too small to be something that my theory has predicted. This is basically zero in practice.'' However what we do when we do a directional prediction, we say that these types of effects all are part of our alternative hypothesis. So these tiny effects are still officially support for our alternative hypothesis if we have a directional prediction. It's very difficult to in practice actually study these kind of effects. If you want to have 90 percent statistical power for an effect size of 0.001 in an independent t-test, in the Alpha level of 0.05, you would need 42 million observations in total to be able to reliably study an effect of this size. This is ridiculously rare that you actually have these resources. It could be though. Maybe you're working for a large international Internet company that collects data of millions of people, then you can study these kinds of effects. But for most people, I think most theories are not build on effects of these tiny magnitudes. The problem is that you have to set a boundary somewhere because we can never prove that an effect is exactly equal to zero. There is always some random variation, and we just don't have any statistical tests to say this effect is absolutely zero. Maybe if you would be able to measure the entire population and find an effect that is spot on zero, but even then, there would probably be the tiniest of deviations just due to random variation. So we don't have a tool to say the effect is exactly zero, so you have to say, ''Okay, I'm going to make a theoretical prediction with some smallest effect size that I actually care about.'' Some people have mentioned that authors of a theory should state what potential outcomes would lead them to regard the theory as disproven. This is a statement by De Groot in his great book Methodology from 1969. He was already suggesting that when you make a theoretical prediction, you should also state what would falsify the prediction. In the absence of a true process of replication and falsification, the entire discipline risks a slide towards the unfalsifiable. Ferguson and Heene called this undead theories. We have theories that just won't die. They are not falsifiable. They keep surviving even though people try to kill them basically, and that's why they call them undead theories. My favorite example here is maybe this book that my PhD student once showed me. This book is the ABC of Behavior Change, and on the front it's advertising the inclusion of 83 different theories of behavior change. It's suggesting that this is a reason to buy this book as if it's a good thing that we have 83 different theories of behavior change. I would say, maybe this is around 80 theories, too many. Why aren't we falsifying most of these theories and selecting the ones that have most going for it? One reason for this is mentioned by Watkins in 1984 that a theory is a little bit like someone else's toothbrush. It's fine for that individual's use but for the rest of us, well, we would just rather not, thank you. We prefer to generate our own theories and then work on these, hopefully making sure that they're not falsified by other people so that we can continue working on our theories for the rest of our career. Having a theory that's actually falsifiable is quite negative for an individual because then you have to give up the theory that you created. Whereas, if you can keep promoting your theory, that's much better for your individual career. What should we do about this? Well, Klaus Fielder writes that only when theories are detached from individuals and many different researchers working in different labs and driven by different motives participate in a pluralistic endeavor can the full potential and the weaknesses of a theory be expected to unfold. So in essence, what we need is to collectively get together, think about the theories that we have, how we can actually test them, and how we could falsify them, what their weaknesses are, and then we can have theoretical progress in a sense that we might not end up with 83 different theories, but maybe three that are really useful to test. So whenever possible, when you design an experiment or you have a theory and a theoretical prediction, carefully think about and clearly state, what would falsify your prediction.