Likelihood inflating sampling algorithm

News

• Author: Reihaneh Entezari, Radu V. Craiu and Jeffrey S. Rosenthal
• Date: 05 December 2018

Markov Chain Monte Carlo (MCMC) sampling from a posterior distribution corresponding to a massive data set can be computationally prohibitive as producing one sample requires a number of operations that is linear in the data size. In an article published in The Canadian Journal of Statistics, the authors introduce a new communication‐free parallel method, the “Likelihood Inflating Sampling Algorithm (LISA),” that significantly reduces computational costs by randomly splitting the data set into smaller subsets and running MCMC methods “independently” in parallel on each subset using different processors.

The article is available via this link and the authors explain their findings below:

Likelihood inflating sampling algorithm

Reihaneh Entezari, Radu V. Craiu and Jeffrey S. Rosenthal

The Canadian Journal of Statistics, Volume 46, Issue 1, Special Issue on Big Data and the Statistical Sciences, March 2018, pages 147-175

The class of Markov chain Monte Carlo (MCMC) algorithms represents an important tool for statistical computation. Applications requiring the statistical analysis of big data have shown that some of the most widely used MCMC samplers can become prohibitively expensive computationally when the data sample is massive. A promising approach for dealing with statistical analysis of big data relies on the ubiquity of parallel processing capabilities in the modern computational environment. Specifically, the data are divided into smaller sub-samples and each sample is assigned to a different processor for analysis.. Once the sub-sample analyses are completed, the results are combined to produce conclusions for the entire data set. In the MCMC setting, of paramount importance is a) the construction of sub-sample specific distributions that can be sampled independently via MCMC using a separate machine, and b) the strategies for combining the Monte Carlo samples obtained on each processor.

The paper proposes solutions to a) and b) for the important class of Bayesian Additive Regression Trees (BART). The latter is a flexible model for non-linear regression that has been widely used in applications and is considered one of the best modern predictive methods. The BART model considers the sum of decision trees with their parameters simultaneously estimated for predictive accuracy. BART’s predictive power crucially depends on MCMC ability to search through the parameter space. This paper offers a reliable communication-free method to divide the computational effort required by BART among parallel processors.

View all

View all