Every few days, we will be publishing layman’s abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.
The article featured today is from the Canadian Journal of Statistics and the full article, published in issue 47.3, is available to read online here.
Lohr, S. L., Riddles, M. K. and Brick, J. M. (2019), Goodness‐of‐fit tests for distributions estimated from complex survey data. Can J Statistics, 47: 409-425. doi: 10.1002/cjs.11501
Obesity prevalence is often assessed using population estimates of body mass index (BMI, defined as weight divided by height-squared). But comparing the percentage of persons who have BMI greater than 30 kg/m2 in successive years does not give the full picture of changes in obesity. To obtain a better idea of changes, it is necessary to look at the empirical distribution function. A normal or lognormal distribution function is often used to summarize the distribution function of BMI.
But how can one tell if the proposed distribution function fits the data from a complex survey with stratification, unequal selection probabilities, and clustering? The authors develop goodness-of-fit tests for testing whether data from a complex survey are consistent with a hypothesized probability distribution. Two situations are considered: in the first, the probability distribution is fully specified in the null hypothesis (for example, a normal distribution with mean 25 and variance 50). In the second situation, it is hypothesized that the data come from a particular distribution family (for example, normal) with parameters to be estimated from the data. The methods are justified theoretically and perform well in simulation studies.
The tests are then applied to test whether the distribution of BMI from the US National Health and Nutrition Examination Survey is consistent with various parametric distribution functions. Although normal, lognormal, and gamma families exhibit statistically significant lack of fit, a mixture of two lognormal distributions fits the data well.