The origin of this paper is a problem that is increasingly common in official statistics: the need for availability of granular statistics on a set of variables partially collected in distinct studies, for the purpose of assisting in answering scientific and policy questions. We develop an innovative statistical data integration methodology via multilevel modeling, and illustrate it with an important real example. In this case, two surveys collect data on employee compensation components, wage and benefits, serving as key data in producing economic indicators. Both surveys collect data on wage, but only the smaller survey collects data on benefits. Employee compensation estimates are desired for all of the population subgroups represented by at least one of the surveys.
This study proposes a statistical model to integrate the two surveys’ data. The proposed model reconciles the two surveys’ estimates for the common variable and uses the relationship between the two variables of interest to enable estimation of both variables’ means for all of the subgroups represented in at least one of the surveys. The precision of each variable’s estimate in each subgroup is improved by “borrowing strength” from data available for the other variable and from data available in other subgroups.
An alternative approach is considered, wherein fine subgroup estimates are obtained by first using the small survey to estimate benefit-to-wage ratios for large, aggregated groups and then multiplying these ratios by fine subgroup estimates of average wages obtained from the large survey. Both approaches are shown to yield reasonable point estimates for the application, but the statistical model results in more precise estimates and can readily be extended to more than two variables.
This work bridges the statistical fields of data integration and small area estimation. While this work is motivated by a statistical data integration problem, it builds upon small area estimation approaches developed to yield precise survey estimates for population subgroups with small sample sizes. The proposed model also brings a novel contribution to the small area estimation, using unmatched sets of population subgroups in the model specification. This work represents an innovative synthesis of methods from these two fields and has the potential to improve official statistics related to employee compensation.
Layman’s abstract for Canadian Journal of Statistics article on Statistical data integration using multilevel models to predict employee compensation
The article featured today is from the Canadian Journal of Statistics with the full article now available to read here.
Erciulescu, A.L., Opsomer, J.D. and Schneider, B.J. (2022), Statistical data integration using multilevel models to predict employee compensation. Can J Statistics. https://doi.org/10.1002/cjs.11688