Welcome to Population Health. Population Health IT and Data Systems, this is lecture C. This component population health discusses the application of informatics and informatics methods in population health management. This unit, population health IT and data systems, explains the challenges and opportunities of using different data types, data sources, and data systems for population health IT. The objective for this lecture is to identify various data types and data sources used for population health management, including both traditional and nontraditional data sources. This lecture finalizes the discussion of the common data types used for population health and focuses on survey's, utilization and groupers a set of data types commonly used in population health analytics common data types used for population health data systems and analytics include. Demographics information such as age, sex, and gender. Diagnostic information, such as the actual diagnosis and severity of a diagnosis. Medication information such as prescriptions, dispenses and filled medications. Procedures such as medical evaluations, anesthetic procedures, surgeries, medical imaging, and other procedures occurred in inpatient and outpatient settings. Data collected by patient surveys, such as the health risk assessment, HRA. And the Patient Health Questionnaire nine, PHQ-9. Which is designed to measure the severity of depression. Utilization information such as cost, hospitalization, admission to Readmission and so on. And finally, a set of derived variables that categories other variables into meaningful groups. These grouper variables are often generated by a variety of commercial and non-commercial applications. There's a long list of grouper data types, including advanced clinical groups, ACGs, diagnoses related groups, DRGs and others. In this lecture we will focus on surveys, utilization and grouper data types. Survey data types if available for an entire population are often used in population health analytics. There's a variety of variables in this data type, depending on the type of survey. Some surveys are more common than others, thus making certain variables more popular in population health analytics. Certain risk factors and self-reported behaviors affect healthcare utilization for example, smoking or drinking will likely increase the future healthcare costs. Survey data types are commonly used as independent variables to predict population health outcomes such as future healthcare costs. A variety of variables can be derived from the surveys. There are no standard coding mechanisms for the surveys. However, using a standardized questionnaire can help reduce bias and error in utilizing such survey information for population health analytics. Some of the standardized questionnaires include a variety of surveys marked as health risk assessment tools which span various topics including lifestyle, family history, physiological data, attitudes and behaviors, safety and mental health. Patient Health Questionnaire, PHQ-9 which is a common screening tool to identify depression. Generalized anxiety disorder, GAD-7, which is a screening tool for anxiety. And Life Event Checklist, LEC, which is a brief self-report measure designed to screen for potentially traumatic events in a respondent's life time. Survey data are collected by self reported questionnaires. Sometimes survey data are stored in electronic health records, EHRs, as part of a population wide assessments or some other clinical indications. And sometimes, insurers collect survey data from their members. However, these data sets are often not accompanied in typical claims data sets. The quality of survey data varies considerably from one questionnaire to another. Inherent data quality issues are due to the survey nature of the data. Surveys are prone to various biases, such as sampling, selection, response and social desirability biases. In addition, the validity and reliability of questionnaires are often tricky to measure. Thus any population wide survey data need to be used with extra caution. There is no known common data coding standard for surveys. However, some of these questionnaires are well validated and are considered reliable across an entire population. Surveys may include questions on information protected by the health insurance portability and accountability act, HIPAA. The survey data needs to be scanned for protected health information before being merged in a large population health data base for analysis, this image shows a sample survey. The patient health questionnaire, or PHQ-9, which is commonly used to assess the level of depression. Note that surveys generate a wide range of potential data types with various units and ranges and not all variables may be of use for population health analytics. Health Risk Assessment, or HRA surveys, cover various topics, including personal disease history, which may help to find diagnosis of select high cost diseases. Family disease history, which may reveal diagnosis of cancer, cardiovascular disease, CVD, hypertension, HTN, and other diseases that have a familial risk factor. Health screenings and immunizations which typically covers immunization for influenza, pneumonia and other high impact diseases. Especially for the elderly population, alcohol consumption, and more specifically the ability to limit drink in stressful situations. Injury prevention behavior such as gun safety, wearing seat belts, and other injury preventive behaviors. Nutrition which may cover consumption of grains, nuts, dairy and portions. Physical activity which often is categorized as low, medium and high intensity levels. Skin protection, specially for outdoor activities. Stress and well-being such as the ability to handle stressful situations. Tobacco use, which includes the medium such as cigar, pipe, and cigarette, and also the frequency and intensity of the smoking behavior. Weight management, which often results in measuring body mass index or BMI. Women's health such as pregnancy status or other women's health risks factors. This bar chart shows the prevalence of various health risk factors in the US population. The data is based on the Behavioral Risk Factor Surveillance System, or BRFSS survey, which is administered and managed by the Centers for Disease Control and Prevention, CDC. As depicted by the bar chart certain behavioral risk factors are common in the general population such as lack of exercise or alcohol consumption. These risk factors can be used to better predict future healthcare costs. This bar chart shows the potential share of risk associated with a number of behavioral risk factors. For example, a high risk for depression will increase the annual healthcare cost by almost $3,250 annually. Note that these are modifiable health risk factors that can be addressed accordingly to achieve a lower health care cost in a given population. This bar chart shows the positive and negative effects of various behavioral risk factors on annual healthcare cost. As shown, certain risk factors, such as increased levels of physical activity, are associated with lower annual health care costs, while increased levels of stress are associated with higher annual health care cost. Utilization data type is typically used in population health analytics as the outcome measures. There's a variety of variables in this data type. Depending on the type of utilization outcome, utilization can be defined as cost, emergency room admission, hospitalization, readmission, or other significant healthcare utilization events. Past utilization rates and patterns are also sometimes used to predict future cost and utilization. Utilization data are usually used as the dependant or outcome variables in population health analytics. However, prior utilization patterns can also be used to predict future utilization. A variety of variables can be derived from utilization data. There are no specific standard coding terminologies for utilization, however, most utilization data are based on claims data, and there are a number of reference coding systems to code certain utilization events. Utilization data are often extracted from claims, which include all covered procedures costs and other associated utilization events for each patient across all providers. Sometimes utilization data are extracted from other data sources, such as EHRs, especially when claims data are not available. Note that EHR level utilization data, are limited to events that have occurred at a particular provider, and often do not contain utilization data from other providers. The quality of utilization data is often acceptable, due to various mandates to collect them accurately, however, as discussed earlier, data quality varies across different sources. There are no known data interoperability issues, certain utilization events, such as admission to a mental health clinic, are protected by various federal and state laws. Therefore, a population health database may not include those utilization and cost data. This diagram depicts the stages that a claims payment goes through before the final net cost is calculated. A claims data source may include utilization and cost for each type of the stages or only one of them. Population health analysts should be aware of the type of cost that is being used for predictive modeling. This table shows a sample list of patients and their partial claims records. The claims records are based on synthetic data. The claims data include the claims date, billed cost, allowed cost, co-pay amount, deducted amount, calculated cost, up to four diagnoses, and one procedure. The stages of claim payment explain the various cost columns. The diagnosis is coded in international classification of diseases, version 9, ICD9, and the procedure is coded in current procedural terminology, CPT codes. Note that if a patient has more than one procedure, it will be listed in an additional row. Box one shows the fact that the billed cost of $153 is higher than the amount the provider was allowed to bill based on the arranged contract between the provider and the insurance plan, which is set at $140.54. Box two shows a considerable deduction cost of $19.61 that has been applied to the total allowed cost of $20 which is resulted in a net disbursement of 39 cents from the payer side. Population health analytic models may predict any of these costs depending on their analytical goals. This bar chart shows the distribution of cost across the entire population of a large commercial claims database. As shown on the left side of the chart, a large majority of the population incurs less than the mean total cost, which is set at $4,459 for this particular health plan. However, a long tail of high utilizers exists on the right side of the chart, these patients are often denoted as high utilizers. The main challenge in population health analytics is to predict this highly skewed long tail of utilizers. This diagram shows the distribution of healthcare spending based on the Medical Expenditure Panel Survey, MEPS of 2009. The Agency for Healthcare Research and Quality, AHRQ, describes MEPS as a set of large scale surveys of families and individuals, their medical providers and employers across the United States. MEPS can be considered the most complete source of data on the cost and use of health care and health insurance coverage. Based on this diagram the top 5% of the spenders account for almost 50% of the spending. Box one points to the area under the curve that only constitutes 1% of the total population, but includes almost 20% of the total spending. Box two points to another area under the curve that represents the top 5% of spenders that account for almost half of the total health care spending in the US. And finally, box three points to the fact that almost half of the population only accounts for 3% of the total spending. This skewness of utilization rates is an important challenge while developing population health analytics. This bar chart shows the percent of total healthcare expenses incurred by different percentiles of the population based on the 2002 MEPS data. As shown, almost 50% of the population falls within the bottom 3% of spenders, and almost 20% of the population create 80% of the spending. The top 1% of spenders created more than 20% of the total cost. This bar chart shows the distribution of age within the top 5% of health care utilizers across the entire population, the chart is based on the 2002 MEPS data. As shown, the percentage of spending is much higher in the elderly age groups compared to the younger groups. Indeed, the 65 to 79 age range includes almost a third of the top 5% of utilizers. This table shows how prior health care cost and utilization can be used for the prediction of future cost. The table lists a number of employer based health plans, the number of members for each plan, and the baseline cost. The baseline cost is then used to predict future cost as denoted by cost of year one. As shown across the employer plans, prior cost is an important independent variable in predicting future cost. This model has resulted in less variation compared to the age and sex only model that was discussed in earlier lectures. Groupers are software applications or lookup tables that provide a systematic and logical method to group lower level variables such as ICD codes into larger concepts, that can facilitate population health data management and analytics. These grouped concepts are often designed and generated in a way to be highly correlated with healthcare utilization or any other outcome of interest. Grouped concepts are often used as independent variables in population health analytics. For example, a high level diagnostic concept may categorize all diabetes ICD codes into two grouped concepts, diabetes with complications and diabetes without complications. It should be noted that grouped concepts are also sometimes used as dependent variables. For example, unplanned hospitalization, which may group a number of hospitalizations while excluding others, might be used as an outcome variable in a population health study. There are a large variety of grouped concepts and variables that are generated by different population health stratification applications. There are no specific standard coding terminologies for grouped concepts. However, the underlying grouper software uses common grouping terms and concepts for internal use. Grouper software applications often use insurance claims data to generate the grouped concepts. Due to the proprietary aspect of most of these grouped concepts they are highly limited there interoperability with other systems. Some of the grouped concepts may include sensitive data that should be managed accordingly. There's a long list of groupers, both commercial and academic. These groupers can reduce the dimensions of population health data by reducing the number of low-level concepts used for predictive modeling. This table lists a number of these so-called, quote, risk groupers, end quote. Note that all groupers use all available data sources for grouping and predictive modeling. For example, the Diagnostic Risk Group's, DRGs, only use diagnoses data, and are customized for inpatient settings. But the Adjusted Clinical Groups, ACGs, use a variety of data types, including demographics, diagnoses, medications, and even lab results. This diagram shows how a grouper, in this case the Johns Hopkins ACG, uses the underlying population health data to generate higher level concepts. For example, the Johns Hopkins ACG Grouper generates such group concepts as ADGs, ACGs and EDCs for ICD level diagnostic and demographic information. Note that the number of group concepts is considerably lower than the entire list of ICD coders. The ACG system also generates RXMG's for NDC level medication data. The software then uses these aggregated group concepts for predictive modeling and eventually population stratification. This diagram shows how the results of commercial and academic risk stratification applications can be translated into actionable applications. For example, the population with the highest risk factor for a given outcome, such as utilization can be assigned to a closer case management application. While the population with a lower risk factor can be assigned to more modest management methods. The population risk groups are depicted in the pyramid diagram on the left side. While the potential matching management applications are listed on the right side of the diagram. This concludes Lecture c of population health IT and data systems. This lecture discussed common data types for population help analytics, focusing on surveys, utilization and groupers among other common data types. The survey data type included a discussion of the prevalence of risk factors and associated healthcare costs with various risk factors. The utilization data type included a discussion of the distribution of cost, relationship of prior cost and future utilization, and other utilization factors.