We start off with the many types of clinical data, and you'll see that we're not talking about just clinical data, but it's a nice word to use. This table is based on a more complicated graphic from Isaac Kohane at Boston Children's Hospital. The URL is on the slide and available online. You're encouraged to look at it. It's a bit scary and barely comprehensive, and so I created this table to make it a little bit easier to absorb. What you see on the left-hand side is that there are many types of data that we care about when we're talking about health. When we talk about finances, the type of the data generally are kind of dollars, might be names of people, might be names of institution, might be names of instruments, but the data is pretty simple. Here, you see, we have a lot more complicated data. You can see that we have two chunks to the columns of the table. We have structured and unstructured data, and I'll explain that more as we go along. I'll point out that the middle of the table, this is the greenish shaded cells, are health data that comes from the health system themselves. So these are traditional data that are found in an EHR, although there's some new data like snips for genomes and such. We have one box for claims data, although there's a lot more data and claims than just that, but they give you a flavor that. There are data in claims data that you can't get an electronic health record, like prescriptions filled at another institution. Finally, you see that there in the purplish cells, is lot of data about patients that are outside of the health system. So, it used to be that when we talked about health data, we focus on just those green columns. But as we talk about types of data, we're going to go across the whole table. So, we start off with this notion of types. What do we mean by structured and unstructured data? So, here is a form, actually was a paper form from occupational health. When somebody goes visits a doctor for occupational reasons, the agency wants you to get these things filled out, and the first thing that you want is your name. Your name could be anything, and they can have middle initials, they cannot have middle initials, they can have junior, senior, the third. They can have many different spellings, they can be hyphenated the middle. So, that's free text. It's unstructured. It can be whatever you want to. The address field is also basically, there is some kind of structure, but it's so varied. You might as well as call it free text. But, then, you get the fields like age, which was obviously numerical. It's a number. It's really nice. We call number fields atomic. You'll see why in a few minutes. So, next, we see the phone field, that's a bit more structured. There aren't as many ways of structuring phone numbers as there are names. We get to the check boxes at the bottom, where different health items are checked, yes, no, or left blank. I'm calling those atomic because there are really only three values: yes, no, or blank. Like with numbers, where you can say, "Okay, does the age equal this, or is it greater than this, or was less than this?" A computer can go in and say, "Is a column of head or spinal injuries yes, no, or blank?" So, to point this out, here's how you might see these data in a database that there's a com call had a spinal injury, there may be another next one over. It might be seizures, and then the cell is yes, or why, or one and end. Why I just said it could be why or yes or one? So the actual vocabulary, if you will or the actual vocabulary, it can be system specific. But whatever system you're in, there are only three choices. The reason we call it atomic is you can't break it down any further. So, a number, if you're age is 35, you can't break that down any further. If I was in a name field, I could look for the first thing, I could look for the last thing. So, I couldn't break it down further, atomic is you cannot. So, when you look at health data, so if you look at the top, that we have the medical record number is an identity or an ID. Further down, we have the visit type. So, medical record number, it's got a format, but it's not necessarily a number. It's not necessarily letters. It can be combination, it can be hyphens, whatever it is, but there is a structure to it that is the same within an institution. In terms of visits, encounters, it could be visit type. There are finite number of possibilities. Ideally, there is a vocabulary for what those visit types are, and therefore, there's a finite lists, is atomic structured data. We have lab results, and those could be atomic like the sodium is 135, or they can be more complicated. On your diagnosis, if I could let you write it out as unstructured free text, or I could use a vocabulary like ICD 10, which limits you to only 68,000 choices. Genomics, there are these things called SNPs, which are variations in your genome. If a clinician know what whenever your SNPs are, it might help me today figure out what drug to use or not use. In the future, it might help me know what risks you have for certain diseases. At the bottom, we have for medications, doses, and routes. Again, I could let you write them out free text, or I can limit you to a structured vocabulary. The most important structured data is the ID. Anything in the database that's in your electronic health record or an app or anywhere, should have an ID, because that's how the database knows what they're talking about. There should be one ID for any entity, and there should be one entity for any ID. They can be IDs for patients, for visits, for specimens, for the facility. The entire hospital can have an ID of that concept. So, when I mentioned there are 68,000 codes for ICD 10, in many ways each one of those codes is its ID. It's an ID for a concept. So, Social Security number is not an ID. First of all, it's illegal to use it as an ID outside the Social Security System. Number two though is very often, people will share their Social Security number with somebody else so that they can get services. That too is illegal, but if somebody feels their life is in jeopardy and they may be here illegally, they might be motivated to use a Social Security number for that purpose. Now, that the truth of the matter is they can use a medical record number in the same way. In fact, any medical institution has people who are using the medical record who are not them. That is dangerous to the patient, because if I come in Monday and there's recordings about me, and then you come in as me the next day, and there are data about you, that they think is about me. When I come back later after you, I'm now being treated as if I'm you, and that may be a bad thing. So, IDs can kill, and using IDs improperly can kill. This graph is from a classic story about this from the mid 2000s, where a children's hospital brought in a computerized provider order entry system into the ICU to make ordering better, as opposed to make it easier, more faster, fewer errors, high patient safety. Well, so the X-axis is time, the Y-axis is mortality, and 100 percent is the expected mortality. So, you can see in the left hand side, mortality is low, is below what's expected, so that's great. This ICU is doing a really good job. Mortality is 40-60 percent. Not great that people are dying, but unfortunately, people who are in ICUs are at risk of death, which is why they're in the ICU. But, then, the CPOE is implemented. Look what happens to the mortality rates, they go way above 100 percent, not good. The story here is that the database now demands that the ID exist before orders are accepted for patients, which makes a lot of good sense, right? If you're running a hospital, you don't want to be giving medications to people who don't exist and not in your database. The problem is you're in ICU and you get helicopters, medevacing children in for motor vehicle accidents or other situations, the EMTs are calling from the helicopter ahead saying, "Please get it ventilated for this child. This child is not breathing on their own." The database says, "Sorry, you're not in the database, I'm not going to prepare the ventilator." So, the kid lands, they have to waste time getting the child registered, and in that time, the child gets worse. So, it's a great example of how not paying attention to those IDs can actually kill a patient. This problem is called a closed world assumption, where the database assumes that what it knows is the entire world, and if it's not on database, it doesn't exist. Which is fine for employment systems, it's fine for banks. It is not fine for electronic health records. So, that's all on the structured data side of things on the the unstructured side of things, we get a lot of information. So we already mentioned the address, then there the clinic notes, and we tend to say that 80 percent of the data of a patient is the clinic notes. So even though it's easier to deal with structured data because you can say is it equal to less than or greater than, the clinic notes is where the data are. Do you get to other issues like social history? Again, these issues are very often written or typed. There are no checkbox for this occupation. There is a vocabulary for occupation that the Department of Labor uses, but we didn't use it in clinical notes, so we write whatever we think somebody works at a factory. I don't know what type of job that could be. I will leave it to the next one. It should figure it out. Chief complaint. While we try to structure the chief complaints, we try to get a vocabulary of chief complaints. Properly, we should record what the patient says. Reports from radiology or from laboratories is like clinical notes. Again, there's a lot of text and the data you want out of it has to be extracted. Images or non-text, non-numerical data, a totally different nature, and to our eyes, they may be structured because we see the chest X-ray. We know what to expect in a chest X-ray. But to a computer, it's just a bunch of pixels. Finally, I have listed allergies. You think, "Well, allergies should just be like medications," but in fact, first of all, there food allergies, second of all, type of reaction that you have to, what does the allergic reaction that you're said to have, who've said that you had the allergies. So that allergy field can be arbitrarily complicated and complex. So here's an example of a note. The format is structured. Sure, there's a heading, there are sections to it, but the text itself can be anything that you want. Now, it may be atomic data inside the note. So here, there is a battery, the za312. I suspect there is a finite number of batteries that would go into a hearing aid, so that datum could be structured data, but you have to dig it out of this unstructured note. There are boilerplates. There are templates that are used these days for many notes. So we get the weird situation where the machine uses a very structured way of getting the data into the system, but then the structure is not available to any researcher, anybody else trying to extract the data out of that note. They have to use fancy methods to get it out. So then I mentioned all the other types of data. Let me go in detail through all of this. I want to point out that much of this data is patient-specific, but there are things like census data. So what census track you live in, well, that's important because your health may be affected by your neighborhood's health. So what I know about your census track in your neighborhood can impact on you. So it's not data about you, but it's still affects you. I have there the grocery data. If I were a physician and I knew your purchasing habits, that could inform me a lot on whether or not you're following your diabetic diet or not. So if I was a physician, it would help me a lot to know whether you're following your diabetic diet or not. Now you know that the grocery store where you punch in your phone number to get the dollar or two off your purchase, they're getting a lot of information about you. I think you're hoping that they don't send it to the doctor because you have some notion of privacy, which goes back to a concept we had a few lectures ago. Where this is going to go in the future, I can't say. I have account there about apps. Clearly, anything I said about electronic health record applies to apps. The app collects structured data on you and atomic data, how long did you run, how long do you run, where do you run, if it's measuring your blood glucose, you have a blood glucose. If you put in notes, then it's clicking unstructured data. What it does with all this data, that's a separate issue. On the right-hand side of things, therefore, there is unstructured data as well. Police records, pretty unstructured, very relevant. But if I want to see from this notion of privacy, everybody know your police record. As we get to social media like Facebook postings, that's a whole new area of data. Does that tell me about you, the individual patient? Does it tell me about what's going on in the world in terms of, let's say, flu outbreaks? If on the hospital, I get sentiment analysis about what people think about me as a hospital. So there's a lot of information in the data. At the bottom, finally, things like herbal medications, it can be totally unstructured because who knows what's available out there. So when you think about population health, I hinted at this with the census track information, but there's a lot of data that's not health data that is relevant to a patient. Some of it comes from the patient themselves. Some of it comes from around the patient. So environmental data is classic information that's around. So if I measure the lead level in your house, it's not about you, it's about your house. Or if I measure it in the soil or in the air, clearly you're exposed to that, but that's not your personal exposure. So there's lot of information here. Some of it is you and some of it is not you. We'll make more of that distinction when we start talking about information to which we now turn our attention.