You may remember a while ago I said one of the goals is to figure out where the data are, where they come from, and where they're going. So, here we're going to be discussing data exchange, how data goes from one place to another. Now, it's one thing to simply send data across, but there's a subtlety. So, we're at this level of the stack; we are trying to exchange this lower level. Now, why is there a problem? Why can I simply send data from one place to another? Here we get into the naming of things. You may remember that the name field is metadata, information about data, and if I get the name wrong, I could get the data wrong. So, here's a simple example which is not exactly data but I think a problem. So, I think you all know that anemia is a low blood count. If you have a low blood count, number one you can feel tired, or number two you can go into heart failure, or could be a sign of leukemia, or any number of things. If any of you have anemia, I don't mean to be diagnosing your anemia. If I go to a famous search engine and type A-N-E-M-I-A, as you see on the left, I'm going to get those my five top hits are going to be what you see on the left-hand side, Mayo Clinic et cetera. If I'm say the same location, same computer, same browser, same search engine, and type an A-N- A-E-M-I-A, I get a totally different set of hits. You and I know that anemia is anaemia. So, why doesn't distribute computer know that anemia is anaemia? The point is that it's a stupid computer. You and I both understand that these are two ways of spelling the exact same concept but in fact, machine does not, unless I tell it so. For fun, you can go through a translating program and look at different languages. You'll see most languages, anemia sounds the same, but it's spelled with different letters. There are a couple of languages where the name is not anemia at all, but it means like lowered blood or something in the native language. So, if we move to real data like clinical data, we have the same problem with like laboratory data. So, we'll be talking about sodium levels. The sodium levels of 135. Well, there are many ways to communicate sodium. I could use the letters S-O-D-I-U-M. I can use Na for Natrium, that you see in the periodic table. I can say Na+ because the sodium in your body is not really elemental sodium, it's ionic sodium, or I can use SOD. I can use sodium chloride, NaCl to stand for a number let alone non-English languages. So, wouldn't it be nice if there is a language or vocabulary that gives us a common name for sodium? That's this thing called LOINC: Logical Observation Identifiers Names and Codes. Now, very often you'll hear us talk about LOINC codes. It sounds redundant, but that's what it is. So, you can see in the left-hand side there're list of LOINC codes for sodium. You can see that we have at least five ways of saying sodium. I'm not going to go into the subtle differences. They are all a bit different. Remember we said one thing about the metadata of the method of how something is measured and here's a classic example. Did you moles per volume or mass per volume. I'm not going to get to differences, but you see this also maximum during the study, corrected for glucose, a post-dialysis. Well, when a researcher comes to the technologists and says "Please give me all patients with the same sodium of mole 135." What should the technologists do? Should they go into system looked for all LOINC codes that have the word sodium in them, or should they talk with the researcher and say, "Now, you which sodium do you really mean?" By the way, it should be the second conversation. In fact, 80 percent of the task, the technologists in a data warehouse or in a research environment, is working with the researcher to say, "Now which one do you really want?" We'll say with more about this in a few minutes, but sometimes they want the post dialysis, sometimes they don't. Sometimes they're thrilled to know that there's a maximum, sometimes they'll be sad to find out that it's not always calculated. So, this notion of a vocabulary for communicating across institutions becomes really important as you go to other domains. So, diagnoses is important domain because that's how you get paid. So, as a physician, I'm very invested to make sure that you as the payment person knows what I think the patient has, because I want to be paid for the correct diagnosis. So, the vocabulary today is called ICD- 10, Clinical Modification. There were only 68,000 codes. Thank you very much. I'm sure you'll be thrilled to know that ICD-11 is done. When it will go into use I cannot say but that is also a complicated thing. But wait, there's more because ICD- 10 also has procedure coding. That's another 76,000 codes for procedures. If you're getting paid for procedures that you've done, I think you'd want to know what the vocabulary is for expressing yourself, about what you did to the patient then so when you going to pay. But there's not just one coding system proof procedures; there's at least two. Here you see that HCPCS gives you another type of level of procedure coding, and that gives you the option for 8,000 possibilities. If HCPCS Level II, enables you to talk about things like the stuff that you use in a hospital, which is very helpful for communicating with the supply chain people who are supplying you with the material that you use in doing the procedures. So, the vocabularies have different purposes, different cardinalities. They're different amounts of different numbers of codes for each of these, and it's quite a morass to deal with. I want to introduce to you a concept that might or may not help you make sense of this morass of what the concepts are about and what inside these coding systems. This is the distinction between intention and extension. So, the intention is what's in your head, the extension what's in the computer. Over what computers know, I don't mean that in terms of knowledge, I really mean what's inside the computer. So, if in your head you're thinking odd numbers. Great, you know what you mean. It could be that your computer, the list of odd numbers is 1, 3, 5, 7, 13 and 19. Now, you might say, "Wait a minute, you're missing 9, 11, 15, let alone everything after 19, and there's computer responds closed world assumption, I only know about 1,3,5,7,13 and 19, those were all the odd numbers for me. The reason this is so important is that, when you're dealing with a technology folks and even when you're dealing with other clinicians and researchers and others, there's a lot of focus on the data. Everybody wants to know what are the data and what do the data say, and if you're not realizing that what the data say, is really an intention and maybe a gap between the intention and the data in front of you, you can run into trouble. So, if I, as the researcher, I'm concerned about heart failure, there are many codes in the ICD-10 universe that deal with heart failure. Which ones do I mean? If I'm doing clinical quality indicators which my hospital gets reimbursed, their notion of heart failure, the code set for that notion of heart failure will be different than the code set for the researchers notion of heart failure. So, the code set that you develop for a purpose depends on the intention for which the code set will be used and the computer does not know that intention, only the human beings know that intention. Just a last piece of cleaning up I feel I need to do, we talked a lot about unstructured data before, and I suggested that you can turn unstructured data into structured data. So, the algorithms for this are either text processing or natural language processing. Text processing is for finding things that you know about. So, in the case of a note from the audiologists where they're talking about the type of battery in a hearing aid, that's a pretty limited set of things I'm going to be looking for. I could be looking for an ICD-10 code. So, in that sense, I can extract data out of the note. However, sometimes those are complicated like the patient does not have heart failure, all right, or the patient denies drinking alcohol. Well, now I have to understand the language and text processing will not know that negation. I need another way of finding negation and therefore communicating patient does not have alcohol or however you want to interpret it. There's a lot to say about natural language processing. It's very much the range. Remember 80 percent of the data about a patient who is in those notes, so NLP is really, really important and talking about the algorithms is beyond the scope of this course. Somebody will want to let you know that there are pipelines for how to chop up the texts and how to identify the pieces of the tags and then had to process them to get an answer that you want and it's not one pipeline, like with intention and extension the pipeline that you need depends on the purposes for which you're doing the NLP. So, I just want to leave you with that little information about NLP and text processing. The final thing to say about transferring all those data back and forth and so forth is data quality. Now, we recall our metadata about a piece of data. The metadata have to be exchanged along with the value. So, it's not just the name that we've been focusing on, the parameter name, which is what let say allowing keys about but it say that components as well. They can be inaccurate just as the parameter name, just like the value can be inaccurate. There's a lot of work being done now on the informatics or data quality. I'm not going to go through it all by any means. I do want to leave you with this notion that there's different dimensions of data quality. So, there's a notion of conformance, does the data match the format that's expected of it. Is the format of the value match the format that is expected for that parameter. Is the notion of completeness, do I have all the values of that parameter for that patient. So, it's nice to know that I have an ace value of sodium of 135. Do I have all the sodiums? When you get to the other metadata, are the data consistent over time? So, I need that to know the time, and if your height is 52 inches, 52 inches, 52 inches, 62 inches, and back to 52 inches, that 62 is probably a mistake compared to other value of U over time. Because we know that height does not change that dramatically over any point in time. It can shrink dramatically if you get bilateral amputation. So, just the fact that it changes, does not mean- or it can go up slowly, in case of growth, as we saw with all the growth charts. But give me a break. Then this is compared with other data. If you have hypertension in one source and other places, you do not have hypertension, one of those two is mistaken. Plausibility, we mentioned that a temperature of 44 degrees Celsius or a 111 degrees Fahrenheit is just not plausible. It's not just that's out of bounds, it's just nuts. How to figure a plausibility besides bounds is not easy. We talked at validating, so, the classic example is most males are not pregnant. Used to be impossible, now it is possible, given that gender is a bit more fluid than we thought, but still an atypical as is the female with a prostate cancer diagnosis. Verification comparing to outside data, like I suggested if the diagnosis is in another source that you can check with. So, this just scratches the surface. But to maintain data quality, to identify issues with data quality, to go to the root causes of problems with data quality, it's a whole another ball of wax. So, in conclusion data exchange is not just a pipeline between one organization to another, but you have to pay attention to the names of things, you need to pay attention to whether you're dealing with intention or extension, you need to pay attention to data quality.