Welcome to Population Health. Population Health IT and Data Systems, this is Lecture g. This component, Population Health discusses the application of informatics and informatics methods in Population Health Management. This unit, Population Health IT and Data Systems, explains the challenges and opportunities of using different data types, data sources, and data systems for population health IT. The objectives for this lecture are to examine how data quality affects population health analytics. Analyze data access, privacy, and interoperability issues that may hinder population health management. This lecture discusses the data management challenges of population health data systems and analytics. There are certain challenges with data management in population health research and analysis. Some of these challenges include issues with data quality, problems with data linkage and data integration challenges with data access and privacy limitations. And complexity of population health system architecture and design. Data quality issues stem from the fact that population health data are often extracted from real world databases, which usually require extensive cleaning and preparation. Data quality can be related to the accuracy of the measures, completeness of the data, or timeliness of updates. Data linkage and interoperability issues are also of special concern to population health analysts. As most population wide data sets are a combination of other data sets thus requiring some level of integration. Matching patient records from one data source to another data source is also particularly challenging, and thus developing a master patient index, MPI, is a critical step in merging the various data sources. Data access and privacy issues are a major challenge, because a population wide consenting process is impractical. Protected health information, PHI, is usually needed to better merge datasets and improve population health predictive models. However, various privacy loss such as the Health Insurance Portability and Accountability Act, HIPAA Limit the sharing of protected health information. Finally, there are multiple system architectures for large population health databases. But making a decision based on their advantages and disadvantages is often challenging. Data quality is critical for population health analysis the quality of data can both affect the models that are developed based on large population health data sets and also the application of those models on target population health data sets. As depicted in this diagram, the accuracy and completeness of data can be affected by numerous factors throughout the data collection, integration, and analysis phases. For example, an actual weight of a patient might be measured inaccurately, typed mistakenly, processed with error, and even aggregated differently. Thus producing different population health models at the end. The first column of the diagram shows that the actual weight of a patient is 200.6 pounds. The measurement tool might have a validity issue that causes it consistently to record the original weight at 198.9. The measurement tool could also have a reliability issue and record the weight differently each time such as 198.9, 200.6, and 202.2 pounds another possibility is that for some reason on certain dates the weight is not measured at all though it should have been measured. The second column of the diagram shows errors that can happen when a user user enters the data. For example, there might be a simple typo entering the weight as 20.06 or 2006 pounds, in which the period is simply misplaced. Sometimes the user may choose the wrong unit. In this example, the user may choose kilograms instead of pounds note that 200.6 is a reasonable weight in both kilogram and pound versions and not considered out of range. Another problem is when users make assumptions and modify the number such as truncating 200.6 to 200 pounds. Other issues include entering zero for missed values or adding free text to a numerical field. In this example, the actual weight of 206 pounds for a single patient is entered as 200 pounds in the electronic health record, EHR system. The third column shows the effect of data automation, conversions, and mergers on the accuracy of data theses automated database functions may truncate or round numbers, mistakenly convert them to empty or zero, and convert the units to match with the receiving end. And of course, sometimes custom natural language processing tools may convert a text field to a wrong numerical value in this example, the recorded weight of 200 pounds is mistakenly converted to 200 kilograms, as the underlying population health architecture uses kilogram as the unit. The last column of the diagram shows that sometimes, analysts may mistakenly convert the value to something else. For example, the analyst may consider the mean weight as the actual weight of a patient for a period if the patient has more than one weight recorded in that period. However, numeric errors that have been propagated from the previous stages may affect the data. For example, if one weight is recorded as 200 and another one is zero. Which, actually meant not recorded in the first place. The total aggregated ween weight for that patient will be 100, which is considerably inaccurate. Selecting one of the measures in a time series or removing outliers also may affect the overall measure for a given patient. Now, consider that more population health analysts work with millions of patients records, and cleaning, preparing, and checking the data may take a long time before the actual analysis is conducted. Data quality can be defined in various perspectives. The most important aspects of data quality for population health analysts are accuracy, completeness, and timeliness of the data. Accuracy, also known as precision of data, refers to the extent to which the data captured through the health IT system, accurately reflects the state of interest. Accuracy is often complex to measure because the true value of a given variable remains inaccessible. Completeness of data is the level of missing data for a data element in the health IT system. Completeness is the most common measure of data quality in population health data systems. Timeliness of data refers to the length of time between the capture of actual value and available value in the health IT system. Types of information flow and series of timestamps are required to measure timeliness. This diagram shows the change in data accuracy due to various reasons. On the left side, the actual table includes the true values of weight for a list of patients. The right side shows three different tables that have stored the same values. However, each of the tables has some levels of inaccuracy when compared to the original table. For example it seems that values in database one are somehow rounded or truncated. While values in database two are converted to a different unit without adjusting a unit. In values in database three have a series of errors including typos the main challenge is that the actual database usually cannot be acquired by developing health population databases. Indeed, the only database that a population health analyst will access is one of the databases labeled 1, 2, or 3. Thus, data inaccuracies are often challenging to find and almost impossible to fix. For example, in the case of database 1, there is no way to know what was the truncated decimal. In the case of database 2, unless some extreme values revealed a unit conversion, the analyst may not know what has happened. And in database three, there is no way to find out what has been omitted or mistyped. This diagram shows the data completeness issues of a data set. Note that there is no table with actual values as completeness is measured based on the available data. Database one shows that a number of patients do not have a weight recorded at all, and the values are noted as nulls. A population health analyst may decide to drop patients with missing values, or try to impute them. Depending on the underlying data, and statistical approach taken to analyze the population database 2 includes multiple measurements for each patient,for example patient two has five measurements from which two of them are missing. In this case in addition to the dropping versus amputation problem the analyst needs to find the most appropriate approach to set one value for patient two's weight timeliness of data is important if the population health analytics need to be trained and deployed on certain intervals. This diagram shows that the EHR's billing codes go through a cycle before being added to a claims database. This process may include patient pre-authorization, benefit verification. Charge capturing, coding, claims submission, and denial management. The entire process may take any time from one day to 30 days. Population health analysts should be aware of this data latency. And consider models that fit with the existing information flows. This table shows the common data quality, and accessibility issues of various population health data sources. Note that this is a general approximation for a group of data sources. Population health analysts should examine each individual data source carefully for potential data quality issues. Data linkage and integration challenges often limit the merger of various data sources and the development of larger population health data warehouses. Interoperability is one of the main reasons that various population health data sources cannot be easily integrated. Interoperability is defined as the ability of a system to exchange electronic health information with and use electronic health information from other systems without special effort on the part of the user. As depicted in this diagram, interoperability goes through multiple stages, but essentially, sending, receiving, binding, and eventually using the data are the end results of interoperability. While developing large population health data warehouses, the technical team should study and be aware of potential interoperability issues. Linking it to integrating various data sources for population health analysis also requires the matching the patients in various databases. Often analyst are required to generate master patient indexes MPIs. To match patients across various data sources such as matching EHR records with insurance claims. Developing and utilizing MPIs is a complex process and may introduce error and bias in population health databases despite the fact that there are many tools to accomplish this process. The following data attributes are often used to create MPIs First, middle, and last name, date of birth. Current and previous address, including street address, city, state, and zip code. Current and previous phone number, and gender. Note that most of the data elements needed to creat MPI's are considered protected health information, and may not be available for the matching process. HIPAA established national standards to protect individual's medical records and other personal health information. Protected health information, PHI, includes a long list of identifiers. Such as: names, geographic data, dates, phone and fax numbers, email addresses, and social security numbers. Complying with HIPAA and removing protected health information for population data sources, may affect the end results, by limiting the use of some key population health data types such as geographic data and various dates. Including date of birth that are essential for predictive modeling. Making the matching process complex due to the lack of MPI, which will eventually limit the creation of larger and more diverse population health datasets. Shifting population health research into quality improvement efforts that may hinder broader adoption by others and increasing the development cost of large, but anonymized population health datasets for research. Population health data systems can adhere to a variety of system architectures. The two common architectures are the centralized model and the federated model. The centralized architecture accumulates and manages all data in a single and centralized depository. The advantages of the centralized model are simplicity and effiecentcy. Higher consistency of data and easier patient linkage, if the same patient identifiers are used. However, the centralized model has a number of disadvantages, such as it does not scale up well. It has a single point of control, and the data sources have to trust the custodian of the centralized repository. It requires exceptional leadership to get all stake holders to share their data. It requires everyone to accept the same identifiers for patients. It needs robust communication infrastructures. This diagram shows the centralized population health architecture. The centralized data repository is at the center of this model. And each of the potential data sources are added to it over time. The federated architecture permits users to access the data only when needed. An MPI is needed to match the patients across the external data sources. The advantages of the federated model are data ownership can be managed by defining policies. Individual organizations are able to control their own data. There are benefits from scalability. The model builds on existing infrastructure. There are more opportunities for creativity. However, the federated model comes with its disadvantages such as it requires more coordination. It may be slower than a monolithic database. The patient identifier problem has to be solved. This diagram shows the federated population health architecture the central data broker does not store any data and queries external data sources as needed. This model requires a strong matching process based on the NPI. This concludes lecture g of Population Health IT and Data Systems. In this lecture, we discussed data management challenges of population health data repositories, including data quality challenges, such as accuracy, completeness, and timeliness of the data. Data linkage and integration challenges, and the critical role of interoperability and master patient indexes in alleviating the problem, data access, privacy, and use challenges. And finally, complexities of various population health system architectures, including the centralized and federated models. This concludes Population Health IT and Data Systems. This unit includes the following lectures and corresponding topics. Lecture a, common data types, specifically demographics. Lecture b, common data types, specifically diagnosis, medications, and procedures. Lecture c, common data types, specifically surveys, utilization, and groupers lecture d, emerging data types. Lecture e, traditional and nontraditional data sources. Lecture f, factors affecting data sources. And lecture g, challenges of population health data management.