A Bayesian latent class approach for EHR-based phenotyping


  • Author: Rebecca Hubbard
  • Date: 12 July 2019
  • Copyright: Image copyright of Patrick Rhodes

Phenotyping, ie, identification of patients possessing a characteristic of interest, is a fundamental task for research conducted using electronic health records. However, challenges to this task include imperfect sensitivity and specificity of clinical codes and inconsistent availability of more detailed data such as laboratory test results. Despite these challenges, most existing electronic health records–derived phenotypes are rule‐based, consisting of a series of Boolean arguments informed by expert knowledge of the disease of interest and its coding. The objective of a paper recently published in Statistics in Medicine is to introduce a Bayesian latent phenotyping approach that accounts for imperfect data elements and missing not at random missingness patterns that can be used when no gold‐standard data are available.

The paper is available via this link and the authors explain their findings in further detail below:

thumbnail image: A Bayesian latent class approach for EHR-based phenotyping

Electronic health records (EHR) have been broadly embraced as a resource for efficiently conducting research on health and healthcare in the real world. One fundamental challenge to using EHR data for research is identifying patients with a target condition or characteristic, referred to as phenotyping. The traditional approach, rule-based phenotyping, relies on expert opinion on biomarkers, diagnosis codes or other pieces of information contained in the medical record that signify the presence of a particular characteristic. However, such approaches invariably result in imperfect phenotypes because gold-standard information on disease diagnosis is rarely available.

A unique challenge to EHR-based phenotyping arises from the complex pattern of missing data the EHR exhibits. In order for disease diagnosis to be captured in the EHR, the patient must have the condition of interest, the patient must seek care for the condition, and the condition must be correctly diagnosed and documented by the medical provider. This generates complex missing data patterns that violate the missing data assumptions of many standard statistical methods. In many cases, data elements such as biomarker values that are strongly predictive of the phenotype of interest are only available for a small number of patients. For example, in the case of type 2 diabetes, fasting glucose and hemoglobin A1c are strongly predictive of the presence of disease. However, only a small proportion of patients will have these test results included in their EHR data. Moreover, in addition to the biomarker values, presence of information on these biomarkers is also indicative of the presence of disease, a missing not at random missingness pattern. Existing phenotyping approaches do not explicitly address the complex missing data mechanisms encountered in the EHR, and application of standard statistical prediction modeling approaches fails to account for missing not at random missingness patterns.

In this work, the authors develop and evaluate a Bayesian latent class model for estimating patient phenotypes using EHR data. By using a Bayesian approach, expert opinion on codes or biomarkers expected to be strongly associated with the phenotype can be incorporated into the phenotyping model. To address missing not at random missingness, presence or absence of biomarker data can also be incorporated into the model as a predictor. Unlike traditional approaches, the latent class approach further allows for incorporation of data elements that may have unknown predictive performance relative to the true phenotype. This work uses simulation studies to evaluate the proposed latent class model and demonstrates that the proposed approach has superior predictive accuracy relative to traditional rule-based approaches.

In contrast to rule-based phenotyping, which provides a dichotomous classification, latent class-derived phenotypes provide a posterior probability of possessing the phenotype of interest for each patient. Depending on the specific needs of a given research project and costs associated with false-positive or false-negative misclassifications, the threshold for classifying patients as possessing the phenotype of interest can be selected to maximize sensitivity, specificity, or another criterion. Flexible approaches to phenotyping are needed to address the complexities of EHR data as well as the wide variety of research purposes they can potentially be used for.

Related Topics

Related Publications

Related Content

Site Footer


This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.