How to solve a problem like missing data

Authors: Alex Sutherland and Catherine Saunders

Missing data is a challenge for statisticians, policymakers and analysts, particularly when a robust evidence base is needed. This is often caused by three key reasons – when data collection is done improperly, when mistakes exist in the data and when the data simply does not exist due to non-responses. The Second Longitudinal Study of Young People in England (LSYPE2), research designed to understand the compulsory education, school-to-work transition, career and lives of young people in the UK, suffers from the latter.

The overall aim of the study is to have a dataset that can provide a resource for evidence-based policy development. However, a significant barrier to this aim is the fact that, on top of the more ‘run of the mill’ missingness (the manner in which data is missing from a sample of a population) that bedevils longitudinal studies, LSYPE2 has systematic incomplete data owing to a boycott of Key Stage 2 (KS2) testing in 2010 that occurred before the study began. Boycotts of national tests leave gaps in pupils’ attainment records and, in the case of LSYPE2, threaten to undermine a large-scale (and expensive) longitudinal study with substantial policy relevance. In LSYPE2, KS2 data was missing for approximately 30 per cent of the cohort.


There are two key concerns associated with missing data. The first is bias. If young people who were surveyed in LSYPE2 have missing KS2 test results are systematically different from those with KS2 test results, then it is possible that results based only on those young people with available test results may be different (biased up or down) compared with results if KS2 tests had been available for all cohort members.

The second concern is efficiency. If researchers use only those data for pupils with a KS2 test score available, a large amount of information would have to be dropped from analyses, leading to more uncertainty about results and larger standard errors.

In light of these challenges, the UK’s Department for Education commissioned RAND Europe to provide a statistically robust, statistically and politically unbiased and consistent approach to addressing the missing KS2 data. The aim was to find a way to include pupils who attended schools that boycotted KS2 tests in 2010 and attempt to mitigate the effect of the boycott on this study. Before the analysis, there were two initial steps taken.

The first step was ‘thinking through missingness’ in the sample, and to paraphrase Rudyard Kipling, to ask about the what, why, when, how, where and who. This meant hypothesising about the nature of the missingness: where and how it arose, describing what data might be missing in the intended analysis sample, then considering reasons why the data might be missing. Missing data might not just be due to the KS2 boycott, but also due to other reasons, such as attrition at later survey waves or a lack of consent to linkage.

The second step was to describe the differences between samples with and without missing data. This means looking to any differences in the characteristics of young people – and schools – affected by the boycott. We found that boycott schools were more likely to serve deprived populations, with higher proportions of pupils who spoke English as an additional language and lower attainment. However, after accounting for the school level sampling of the cohort, we found very few pupil-level predictors of missingness (which on reflection is of course unsurprising as the missingness mechanism is at the school-level e.g. schools rather than pupils made the decision to boycott).

Following these initial steps, we used both multiple imputation (MI) and inverse probability weighting (IPW) to address missing KS2 data related to the KS2 boycott. MI aims to identify uncertainty in the imputations, with multiple versions of an imputed dataset being generated and then analysed. Through MI, these imputed datasets replace missing values with ‘plausible’ substitutes based on the distribution of the observed data, and then use randomness to reflect the uncertainty. The estimates from each dataset are combined using Rubin’s rules, which allow for the combination of individual estimates and standard errors from each of the imputed datasets into an overall single MI estimate to provide valid statistical results. Alternatively, a model describing the predictors of missing data can be estimated consistently from the observed data, which can then be used to generate weights in the analysis to correct for missingness – this is the so-called IPW approach.

By comparing the mean KS2 average point score from a complete-case vs MI analysis (27.2 vs 27.1 in data from three survey waves), with the estimates using complete case vs IPW (26.9 vs 27.0 in data from one survey wave only) we found that, while all three are very close with no substantive differences in point estimates, their standard errors differ. For the IPW, the standard errors are at their widest, and for MI, the standard errors are at their most narrow. This is expected given the additional information used for MI. Further exploration using a limited simulation study found that the increase in standard errors associated with the IPW was indeed plausible.

The intention of this work was to leave other researchers with options beyond complete-cases analysis when deciding how to deal with missing data in LSYPE2. However, it is not just a simple matter of choosing one approach or the other and hoping for the best. Analysts must be clear about what they want to achieve with their analysis. Comparing complete-case analysis (i.e. an analysis of the smaller sample of only those available observations with no missing data) with MI suggests that MI would be more efficient than the complete-case approach, with smaller standard errors, meaning this approach should be used if statistical inference is the aim of a given analysis.

However, analysts should consider whether it is appropriate to use the MI data sets for their analyses. In particular, if analyses involve variables that are not those included in the imputation model then the use of the MI data sets requires caution. A second concern is around the multi-level nature of the data, which is difficult to incorporate into non-technical MI solutions. There is a difficult balance to manage between solutions that are ‘easy to use’ and can provide robust solutions for policy analysts who are not necessarily statisticians, and more methodologically rigorous but quite technical approaches that would be difficult for non-statisticians to decipher.

Our work looking at missing data following the boycott of KS2 tests is just one example of addressing missing data. Ultimately, it is up to analysts to decide on their own approaches, and most importantly understand the particular features of their own data sets when encountering this problem. However, our work and reports have provided a number of useful recommendations and tips for statisticians and analysts that they can use in the future.

Alex Sutherland is a research leader at RAND Europe and Catherine Saunders is a statistician working in the Cambridge Centre for Health Services Research. Both were involved in ‘Missing Data in the Second Longitudinal Study of Young People in England (LSYPE2)’ report for the UK Department for Education.