
The article featured today is from Statistics in Medicine, with the full Open Access article now available to read here.
A new regression model for overdispersed binomial data accounting for outliers and an excess of zeros. Statistics in Medicine. 2021; 40: 3895-3914. https://doi.org/10.1002/sim.9005
, . This study proposes a new regression model for binomial outcomes (i.e. number of successes among a fixed number of trials), which are extremely common in biomedical research.
Despite its popularity, binomial regression often fails to model this kind of data accurately, due to the overdispersion problem (i.e., when data is characterized by a larger variance than assumed by the model). Even if several different factors can cause overdispersion, they are mainly connected with the failure of the i.i.d. assumption of the individual responses (i.e., the binary outcome) forming the binomial count.
Many alternatives can be found in the literature, the beta-binomial regression model being one of the most popular. The additional parameter of this model enables a better fit to overdispersed data. It also exhibits an attractive interpretation in terms of the intraclass correlation coefficient. Nonetheless, in many real data applications, a single additional parameter cannot handle the entire excess of variability.
This study proposes a new finite mixture distribution with beta-binomial components, namely, the flexible beta-binomial, which is characterized by a richer parametrization. The latter allows enhancing the variance structure to account for multiple causes of overdispersion, while also preserving the interesting intraclass correlation interpretation.
The novel regression model, based on the flexible beta-binomial distribution, exploits the flexibility and large variety of the distribution’s possible shapes (which includes bimodality and various tail behaviors). Thus, it succeeds in accounting for several (possibly concomitant) sources of overdispersion including the presence of latent groups in the population, outliers, and excessive zero observations. This is possible because the new model dedicates one of its mixture components to a particular group of observations (e.g., zero-values and/or outliers) automatically and only when necessary, providing interesting information about the possible causes of the excess of variability.
Estimation issues are dealt with via a Bayesian approach based on the Hamiltonian Monte Carlo algorithm. The paper includes an intensive simulation study that shows the superiority of the new regression model over that of the existing ones. Its better performance is also confirmed by three applications to real datasets extensively studied in the biomedical literature, that is (i) parasitized eggs, (ii) cells with chromosomal abnormalities due to atomic bombs exploded in Hiroshima and Nagasaki, and (iii) fetal deaths in control mice litters.