[MUSIC] We will conclude this module. There is an overview of the things you will see next week because it's a lot of stuff you will have to learn. And just in order to be prepared a little bit, I will introduce you to the different steps we are considering. So the first step is to set up a theoretical framework. This is not something we can really teach you here because, as I already said in some other lessons, it is the environment in and for which you want to construct the composite indicator. That is, certainly, even if you're constructing the composite indicator from scratch, you are not constructing something completely out of the nothing. So then you are talking about certain environmental indicator you have already in mind what kind of things should go in. And there's some literature behind. There's some theory behind. There's a lot of knowledge behind. And this we cannot teach to you, especially because we don't where do you want to construct your composite indicator for. And so, this is just to make clear that, certainly, you should not construct your composite indicator in a vacuum. You should be aware that there is something that many people who have already been working on that. And you should know the literature and you should have a concept. This concept, this theoretical framework, will be the basis for all the selections we have to do then in the next steps. It's, for example the basis for selecting the variables you will use. It's also the basis of the statistical method, how to weight each variable, how to aggregate the variables. Whether for example you want to sum them up or multiply them or take some transformation out of them. All this should be considered, looking back to what was the theoretical framework I'm looking at? And I'm trying to construct a composite indicator in order to measure a certain phenomenon. It certainly also determines a certain kind of subgroup, subgroups of variables, subgroups of indicators. You could also think of a complete different dimension subgroup of countries. But the second step then is something we have really to talk a little bit about, maybe not in the next module. But when we present all the different composite indicators you will come to learned in our Mookh. Then we certainly talk about data selection. And when you select data then certainly the first most important thing is to select data along the theoretical framework, the first step. But then also, you have other things to address and that is the analytical soundness of the data. It's a measurability, it's also the coverage, whether you find this data for all of the countries you have in mind. So it doesn't really make sense to set up a composite indicator that finally you can only calculate for two counties because for all the others you don't get the data. And certainly but this is, I think obvious, is the relevance of the data you collect for your composite indicator. If you don't really observe the things you're interested in, then you always should think an alternative and that's called the proxy variables. Sometimes you don't need the exact information, you just need a variable that could move. The data proxy for the things you want to measure. So in the end you need the quality for data. You need the strengths and weaknesses of data to know. That means well, maybe they are very well measured, maybe they are badly measured, maybe they are easily available, maybe it's very hard to get them, maybe you get them just now, but not in the future. Then you have just a composite indicator for today, but you will never be able again to use it. And you might even want to make a table where you have all the pros and cons of the different variables and then on the first or second step select variables along that table. The next thing is then the importation of missing data. The problem is, if you have decided for a certain set of variables of individual indicators, then you will nonetheless obvious found that for some years some for some regions, for some countries, for some aspects, you don't have data. Certainly it should not be the common thing because then you really should reconsider the set of variables you've selected. But it definitely will happen. And then you have to impute some missing data because it's not always recommendable to throw away all the the variables where you have some missings. Or to just exclude all the countries just because you have two or three data in certain years, not available. However, the imputation, there exist many statistical methods in order to impute missing data. But these are typically made for different problems. For example if you say, okay they are just missing at random. Then you can say okay, I would guess that the missing variables behave along the same distributions as all the others. In those cases we have very nice methods that just are maximizing the probability of the sample you're considering. And plugging in where you have blanks data that fit exactly or maximize exactly this probability. This certainly only works if they are really missing at random. Very often you will find and this is something I cannot generally teach you, this depends very much on the context. That data are exactly missing for especially poor countries, or especially rich areas. Or whatever but they're not missing at random and if you have a systematic bias, then certainly you cannot use those methods, you have to look for alternative methods. But again this depends from case to case you just have to first be aware of it and not just impute something and then go ahead. We will come back to this point a little bit later. The next step is multivariate analysis. What does it mean? It basically means that you have to think about everything in this world as correlated, has some covariances and this correlation can be strong or it can be weak. Maybe you don't think that there's a correlation. But then you look at, it turns out yes. These variables always move in exactly the same direction. This basically means if you include both in your composite indicator, then you include twice the same information. And this you should take into account when you're thinking about rating. So should I really include this information twice, or should I just take one of these variables and then think about a reasonable rating for that? Otherwise the rate should account for the covariance structure. Certainly also the covariance structure should somehow be related to the theoretical framework that we discussed in the first step. So you might start maybe this cluster analysis of principal component analysis. The reason is that sometimes you have included many different variables, but it turns out that they all upon more or less have same information. And you would like to know whether really the problem you want to, or the phenomenon you want to measure is as multi-dimensional as you think. I'll just give you a very simple example, it's not really a composite indicator but when we have student evaluations then there's very often the discussion whether we should put 3, 5, or 25 questions on the form. And we ask about all the different aspects and dimensions of the quality of teaching. However, very often it turns out finally you do a principal component analysis that it's just one dimension. Students like the class, or they don't like it. And whether you ask one question, whether they like it or not, or whether you ask them 20 questions about all the different aspects and dimensions that teaching comprises or entails. Then in the end it doesn't make a difference. And this kind of PCA or cluster analysis help you to reduce a little bit dimensionally before you start to think about rating. Another aspect is normalization. If you do a PCA you might get in so manuals if you read them before you perform a PCA. Find a lot that you can not just compare all kinds of variables no mater on what scales they are measured. And this is maybe obvious but, very often, people forget about it. That they certainly have to bring the variables on a comparable scale before they compare them, or before they aggregate them. The next and almost final step is the weighting and aggregation. We already talked a little bit about it because normally addition results in kind of weighting. And certainly if you take into account the covariance structure between the different variables, and say okay I have three variables included that more or less tell me the same story even if they have completely different names. Then this already gives me an idea that I have to think over, and over, and over the different weightings I want to give to the variables when I include them to the composite indicator. It's certainly a collection of importance how, and it's a subjective decision. How important is for me this or that variable in my composite indicator? Even if people say I'm using a statistical method for it, there're many statistical methods in order to calculate a weighting. And all of it is a choice of statistical method is a subjective choice in the end of the day. The aggregation is even a little bit more complicated, but all these things you will learn in the next week. The aggregation is a more complex thing, because here you have to think about, for example, should I add all the numbers or should I multiply the numbers? And what is the difference? Well, the difference is quite essential, because if you just add the numbers, that means that they are exchangeable. So for example, if you think about the well being of a person you could say, well, he doesn't have enough food, but maybe I just can give him more education. And if you just sum up the variables, education and food, then certainly, you could still have an indicator that goes up, up, up, up but he's starving. And this certainly is an essential question of aggregating the different variables. Then finally, once you have done the complicit indicator, then you should make what you call a sensitivity analysis or robustness check. And that means while on the way to construct a composite indicator, made several subjective decisions even if you wanted just to impute some missing data. And now what you could do is while in each step, when you make such a decision, you thought about different alternatives. And just try different alternatives and see whether you get completely different results. Or whether, for example, a country with one composite indicator using all the imputation methods, variable selections you have decided for. Or an alternative one a colleague have voted for, whether they give more or less the same ranking or completely different one. This is a sensitivity analysis. It just tells you a little bit whether the composite indicator moves a lot with your decision, or not. And even if it's not just about the individual decisions, also then just about the by debating the alternate decision it depends on what message you have chosen. But it could be that finally the composite indicator after all the effort you have done it's just driven by one variable and this you would like to see. So you have also to simulate some data some new distributions for the data you included in the composite indicator and see which are the drivers of my composite indicators and other some variables that don't actually matter. I included them but in the end they don't move the composite indicator. And with all these things and tools at hand, you should be able to construct at least to understand and get an intuition for reasonable construction of composite indicators. [MUSIC]