Statistical Plasmode Simulations: Potentials, Challenges and Recommendations – lay abstract

The lay abstract featured today (for the Tutorial in Biostatistics on Statistical plasmode simulations–Potentials, challenges and recommendations by Nicholas Schreck, Alla Slynko, Maral Saadati and Axel Benneris from Statistics in Medicine with the full Open Access article now available to read here.

Schreck N, Slynko A, Saadati M, Benner A. Statistical plasmode simulations–Potentials, challenges and recommendations. Statistics in Medicine. 2024; 122. doi: 10.1002/sim.10012


Data availability is a crucial issue that arises in the context of statistical model development and validation, inference derivation, introduction of statistical concepts and many other areas.

In some cases, especially for high-dimensional data where the number of variables is substantially larger than the number of observations, the number of available data sets is often not large enough to properly and reliably perform all required tasks.

To overcome that deficiency, alternative data generation approaches such as the generation of artificial data are required.

The generated data should match as closely as possible the real-life data underlying the research question of interest, in particular with respect to the types of variables, their  dependencies and distributions. Also, at least some specific aspects of truth should be available in order to be able to objectively evaluate the methods applied to the generated data. In practice, this might include knowledge about the association between some influential variables and outcome variables.

A variety of simulation approaches have already been introduced including parametric and the so-called plasmode simulations. While there are concerns about the lack of realism of parametrically simulated data, it is often claimed that plasmode simulations come very close to reality.

However, there are no explicit guidelines or state-of-the-art on how to perform plasmode simulations.

The authors first review existing literature and introduce the concept of statistical plasmode simulation. In contrast to biological plasmodes, which are usually created by conducting lab experiments, statistical plasmode simulation is based on the availability of a representative dataset. It utilizes aspects of resampling (when generating the covariate information) as well as parametric modeling (e.g., application of outcome generating models, modeling of exposure etc.) and can be interpreted as a semi-parametric method.

Thus, artificial outcomes are created using a parametric outcome generating model, i.e., a specification which allows the user to generate outcomes while implementing user-defined associations with input variables.

Further, the authors discuss advantages and challenges of statistical plasmode simulations and provide a step-by-step procedure for plasmode data generation, including key aspects for implementation and reporting.

The generation of statistical plasmodes for high-dimensional data is illustrated using a real-life dataset on breast cancer patients (with R code for reproducing the example).



More Details