Hello. Welcome to Section 3: Introduction to Record Linkage. I should say upfront, many thanks to Stefan Bender who is currently at the Bundesbank in Germany, Julia Lane is Provost Fellow professor at New York University Wagner School, and Manfred Antoni, a colleague of mine at the Institute for Employment Research in Germany, as well as Peter Christen who is with the Australian National University. All of these and many others have contributed to my knowledge on record linkage, and I shall not forget to mention my thesis adviser, Rainer Schnell. I will cite and use material from all of them, they have developed in their own research in their own work. And will indicate when I do so but if I fail to mention that on a particular slide, you should know that any of the literature contributions of these people should be something that you should look up to, and that's where we collected all this material from. So let me get started with this topic. After a brief motivation, I will give you a very gentle introduction on what it is that we call record linkage. In the third lecture on this topic, I will talk about key linkage techniques. And then we'll touch on a couple of advanced record linkage techniques. Mostly an overview of those, no deep technical handling of these topics. And in the last module in this course, all the ethical issues that have to do with linkage concept but also with privacy and confidentiality in these linkage techniques. So let's start with the motivation. Without a doubt, there are more and more data collected everywhere. We see this in the public sector, in the private sector. We see this from specific researchers or individuals. And you know that from your own life. So, you know, when you go to doctor's office, it's not longer handwritten cards; everything is captured electronically. Many of our financial records are kept and processed electronically, leaving traces of data. We all have more or a few loyalty cards, credit cards. They produce transaction data every day when they're used. Scanner data, a process we used to process food purchases in stores produce data. So all of these are kept and create individual files that presumably can be merged together to create a full picture. We've seen this for a little longer with data from tax records or social security records, and I'll touch upon those. But what is really new in this space, all the social media traces we leave, text messages, blogs. Now, it's a little scary to think of a world where all of this would be taken together and used. But, of course, from a researcher point of view, there is great hope that we can create new insights out of these data, and maybe even at a lower cost. And actually, there are a couple of fantastic examples out there where this has happened. A few, I want to introduce here. And on the course website, there's more information on them as well as links to its original resource. So, Julia Lane, I mentioned her earlier. She was one of the main engines in the Longitudinal Employer-Household Dynamics project that is part of the Center of Economic Studies at the U.S. Census Bureau. It's a pretty cool project because here, the states, in the United States, share Unemployment Insurance earnings data and the Quarterly Census of Employment and Wages with the U.S. Census Bureau which themselves adds administrative and survey data to this database. The mission of this endeavor is to be able to create dynamic information on workers, employers, and jobs, and remove any additional data collection burden. As I mentioned, see this on your course page, links and background information. But, just imagine, if you were to ask with the help of survey questions all of these people about their employment history, what they earn, when they started, when they stopped, you know, whether this were earnings that created Social Security benefits and the like. That would be a huge burden. And of course, analyzing the data that's already there, it seems like a natural thing, but it is less common as one might think. And in fact, a lot of countries are struggling right now with how to unlock data that are siloed in the U.S. for example, in the various federal states or at local agencies. As I just mentioned, many countries and so in the U.S., they struggle with this question and the U.S. Committee on National Statistics as part of the Academy of Science, they just wrote a report, you know, coming out of a panel that I was part on, really trying to figure out how can administrative and survey data be combined, not on the technical level per se, but the whole notion of, you know, privacy issues, access, who should be able to do this, when can this be done, how can one in general make more use of administrative data, who should be allowed to access those data and the like. So, this is a big push and a discussion everywhere. The sources for this discussion, I also linked on the web page, and I encourage you to read that report. It gives you a good overview of the current state here and what's in discussion. The hope is that with this endeavor, the quality of existing data sources gets improved, or the burden of collecting these data from respondent, we have some other means, for example, surveys data is very used. And as I said before, the hope is that new research question can be answered. Now, the challenge is with these endeavors, that they are data from different sources. They come in a variety of different formats. They're records are of different quality. There're often no unique identifier across all these data. It's not clear who really belongs together and how can we figure that out. And then the issues of privacy regulations that vary across the different data sources. And an issue of size. When the data very large, it can be quite costly to find the proper matches. So, with this motivational introduction, I want to point you to two books. One from Peter Christen, where this presentation heavily relies on material from that book. It's called "Data Matching" with Springer. Then another Springer book which has been around for a little longer, from Thomas Herzog, Fritz Scheuren, and Bill Winkler, "Data Quality and Record Linkage Techniques." So, between those two, you really covered and can well master the topic. They're both, of course, published material and we can't put them for free up on the Coursera website, but I nevertheless want to encourage you to take a look if your local library has them or you will purchase them. We have additional information and shorter articles on the web. Hopefully, jointly, this will be enough background material for you.