Each week, we will be publishing layman’s abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.
Hoogland, J, van Barreveld, M, Debray, TPA, et al. Handling missing predictor values when validating and applying a prediction model to new patients. Statistics in Medicine. 2020; 3591-3607. https://doi.org/10.1002/sim.8682
Missing data present challenges for development and real-world application of clinical prediction models. While these challenges have received considerable attention in the setting of prediction model development, there is only sparse research on the handling of missing data when applying a prediction model in practice.
The main unique feature of handling missing data in this applied setting, as opposed to the model development setting, is that missing data methods have to be performed for a single new individual. This precludes direct application of well-known missing data methods for groups of cases that are widely used during prediction model development.
This article compares existing and new methods to account for missing data for a new individual in the context of prediction. The studied methods vary in their approach and can be divided into three groups. The so called submodel methods try to find different prediction models that are based on the observed data only; these prediction models therefore have fewer predictors than the original model used for an individual with completely observed data. The group of marginalization methods finds weighted average predictions, as averaged over the possible values of the missing data and weighted by their probability. Lastly, the imputation-based methods use an individual’s observed data to find plausible replacement values for his/her missing data, which consequently allow the original model to be applied as usual.
The influence of these missing data methods on prediction model performance was evaluated in a simulation study. Furthermore, each of the methods was applied in data from a large Dutch study on prophylactic implantation of cardioverter defibrillators. Overall, the imputation-based methods were to be preferred in terms of prediction model performance. Also, they allow for the convenience of using a prediction models just as though the complete data would have been observed, and the replacement values give insight into the assumptions made about the missing data for a particular individual.
Lastly, in line with the need for these methods when applying a prediction model in presence of missing data, we propose that it is equally desirable to perform model validation based on such methods. The argument to do so, is that validation studies should resemble practical application as much as possible, and thus reflect the handling of missing data as would be performed in practice.