An integrated Bayesian framework for multi-omics prediction and classification – lay abstract

The lay abstract featured today (for An integrated Bayesian framework for multi-omics prediction and classification by Himel Mallick, Anupreet Porwal, Satabdi Saha, Piyali Basak, Vladimir Svetnik and Erina Paulis from Statistics in Medicine with the full article now available to read here.

Mallick HPorwal ASaha SBasak PSvetnik VPaul EAn integrated Bayesian framework for multi-omics prediction and classificationStatistics in Medicine2023120. doi: 10.1002/sim.9953


Multiview data, which involves concurrent measurements (modalities) collected on the same subjects from multiple sources, are increasingly commonplace in biomedical research. There is enormous potential in integrating concurrent information from distinct vantage points to comprehensively understand complex biological phenomena. Current dominant approaches in multimodal predictive modeling typically generate a single output with little guidance as to its uncertainty. Ignoring the uncertainty in the integration process can lead to undesirable outcomes in predicting health outcomes.

Our study published in Statistics in Medicine examined the feasibility of a uncertainty-aware Bayesian machine learning algorithm (IntegratedLearner) for integrating vastly different kinds of biological data. We used publicly available multi-omics profiles, both cross-sectional and longitudinal, each consisting of thousands of quality-controlled features (processed to contain only the most important features, filtered for variance) derived from hundreds of samples and multiple biologically diverse omics layers including microbiome, gene expression, proteomics, metabolomics, and immune system, among others. Crucially, these datasets had different levels of signal-to-noise ratios, uncorrelated with the number of measurements, or features available and the number of subjects in these cohorts was small relative to the number of measurements.

In one application concerning pregnant women, we integrated seven high-throughput biological modalities during term pregnancy where IntegratedLearner was used to evaluate the predictive power of both individual and combined datasets for the estimation of gestational age using biological signals. We found that the integrated model learned from multiple omics datasets typically performed better than those using only one type of omics data. Our study provided a conceptual and analytical framework to analyze the complex interplays between various biological modalities that govern preterm birth and other pregnancy-related pathologies. Given the racial disparities in pregnancy outcomes, replicating this analysis using IntegratedLearner in more diverse cohorts is crucial going forward.

In another application involving Inflammatory Bowel Diseases (IBD), several novel microbial species emerged as “top hits”, which broadly manifested as a characteristic increase in facultative anaerobes at the expense of obligate anaerobes, in agreement with the previously observed depletion of butyrate producers in IBD, not captured by the original study’s univariate approach. IntegratedLearner also identified several biochemical and functional associations such as specific literature-curated metabolites including the enrichment of bile acid-associated products.

Unlike published methods that summarize predictions using a single point estimate, IntegratedLearner yields uncertainty estimates (i.e., credible intervals) of the prediction and model parameters in addition to reporting a small set of interpretable features for follow-up experiments. For example, in the IBD multi-omics study, we observed very wide 95% credible intervals in the IBD prediction and feature importance, which highlights the importance of quantifying uncertainty in the presence of substantial population heterogeneity as it is misleading to summarize prediction in terms of a single point estimate.

These results further suggest that the trade-off between interpretability, scalability, and uncertainty quantification must be considered in addition to ensuring a high prediction accuracy while building an integrated ML model. Multimodal integration thus holds significant potential to revolutionize healthcare predictive modeling and omics-based biomarker discovery research, which could lead to improved diagnostics and therapeutics, benefiting society at large.


More Details