Layman’s abstract for paper on estimating prediction error for complex samples

Every few days, we will be publishing layman’s abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.

The article featured today is from theĀ Canadian Journal of Statistics, with the full article now available to read here.

Holbrook, A., Lumley, T. and Gillen, D. (2020), Estimating prediction error for complex samples. Can J Statistics, 48: 204-221. doi:10.1002/cjs.11527

Personalized medicine requires predictions based on electronic medical records and large-scale health surveys. Predicting the US presidential election (of, say, 2020) requires a large and variegated patchwork of local and nationwide surveys. But electronic medical records favor subpopulations given to illness, and we all know how well the survey-based predictions of the last election cycle turned out.

Your data is probably biased. You want to make predictions about the broader populations and mechanisms that give rise to your data. How well do your predictions extend to unobserved members of your target population? How does your sampling design (or lack thereof) affect the reliability of your predictions? How you answer these questions influences which tool you grab from your machine learning toolbox because different prediction algorithms generalize beyond your (biased) observed data in different ways.

We say that a prediction rule or algorithm or model generalizes well if it has a small prediction error when applied to unobserved subjects. Here, we show how to estimate the prediction error of your prediction rule when you have trained your prediction rule on a biased data set. We illustrate our Hovitz-Thompson-Efron (HTE) prediction error estimator with extensive simulations and apply our methodology to predicting kidney health based on the NHANES national health survey. Finally, we relate the HTE estimator to dAIC, a recent extension of AIC for biased samples.