Every few days, we will be publishing layman’s abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.
The Open Access article featured today is from Statistics in Medicine, with the full article now available to read here.
Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: Applications to distributed computing under disclosure constraints. Statistics in Medicine. 2020; 39: 1183– 1198. https://doi.org/10.1002/sim.8470
, , .If someone wants to jointly analyze data from several different sites, that is, patient data from different clinics, the typical approach is to combine the individual data of all patients (often called Individual Person Data, or IPD) into one big data set. Yet, this is not always possible due to privacy constraints. We show that it is possible to use non disclosive summary statistics, such as mean values, calculated from the IPD, to generate synthetic data that can subsequently be used for statistical analyses, which would be otherwise difficult to perform here. Synthetic data is generated via a resampling method taking the summary statistics as the only input data. In our example, non trivial statistical inferences such as parameter estimates of a multi-variable mixed-effect logistic regression, obtained from these synthetic data, are very similar to results from the original IPD. This is a bit surprising, because one would not typically expect to recover such information from simple summary statistics only. Thus, privacy is maintained, as sites only have to deliver non disclosive summaries, while being able to still get useful analysis results. The method is readily applicable in infrastructures, such as DataSHIELD, that supports multi-center analysis of protected IPD. Results and code are openly available for reproduction. Have a look!