Layman’s abstract for paper on a synthetic data method to incorporate external information into a current study

Every few days, we will be publishing layman’s abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.

The article featured today is from the Canadian Journal of Statistics, with the full article now available to read in early view here.

Gu, T., Taylor, J.M.G., Cheng, W. and Mukherjee, B. (2019), Synthetic data method to incorporate external information into a current study. Can J Statistics, 47: 580-603. doi:10.1002/cjs.11513

This article considers the situation where there is a known regression model that can be used to predict an outcome of interest from a set of commonly available predictors. An internal modest-sized dataset is available containing individual level data for the variables in the known model as well as a new variable. The challenge is to build an improved prediction model that includes the new variable, using both the internal individual level data and information obtained from the external known model.
The authors propose a synthetic data approach, which consists of using the known model to create synthetic data observations with missing values of the new variable, and then appending them to the internal data to create a combined dataset incorporating the external information from the known model. To estimate the parameters of the improved model, this combined dataset is analyzed using methods that can handle missing data (e.g. multiple imputation).
A theoretical justification of the method is provided, and it is evaluated in simulation studies. The method is applied to improve models for the risk of prostate cancer. The method’s broad applicability makes it appealing for use across diverse scenarios.