Earlier this year, Wiley was proud to publish Multiple Imputation and its Application, a practical guide to analysing partially observed data. It is written by Professor James Carpenter, Head of MSD and Professor of Medical Statistics and Programme Leader in Methodology, MRC Clinical Trials Unit and Professor Mike Kenward, Professor of Biostatistics, both of the London School of Hygiene and Tropical Medicine.
Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.
This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures.
Multiple Imputation and its Application:
• Discusses the issues raised by the analysis of partially observed data, and the assumptions on which analyses rest.
• Presents a practical guide to the issues to consider when analysing incomplete data from both observational studies and randomized trials.
• Provides a detailed discussion of the practical use of MI with real-world examples drawn from medical and social statistics.
• Explores handling non-linear relationships and interactions with multiple imputation, survival analysis, multilevel multiple imputation, sensitivity analysis via multiple imputation, using non-response weights with multiple imputation and doubly robust multiple imputation.
Multiple Imputation and its Application is aimed at quantitative researchers and students in the medical and social sciences with the aim of clarifying the issues raised by the analysis of incomplete data data, outlining the rationale for MI and describing how to consider and address the issues that arise in its application.
Statistics Views talks to Professor Carpenter about this collaborative work and the importance of multiple imputation in statistics.
1. Congratulations on the publication of Multiple Imputation and its Application and its positive reception. You both work at the London School of Hygiene and Tropical Medicine. Is this where you first met and what made you decide to work together on this book?
It is where we first met, more or less! I joined the School in 1998 having studied for my PhD at Oxford University. Mike was previously at the University of Kent and he came here in 1999. When I first started at the School, I initially began working on clinical trials and was working on one of the examples we use in the book. Mike had also worked in this area so it was natural that we should collaborate. We had a series of projects to do with clinical trials and observational data, and one of the tools that we explored and developed was multiple imputation. There was a lot of interest in the courses we subsequently ran, and – particularly from 2006-2010 – we travelled and presented many courses. These courses formed the backbone of what we put into the book.
2. What were your main objectives during the writing process? What did you set out to achieve in reaching your readers?
What we hoped to do was to set out the ideas that underpin multiple imputation and show how it could be applied in a range of settings. Multiple imputation is a method for analyzing data which has a non-trivial proportion of missing values. We wanted to show how multiple imputation can not only be applied in relatively straightforward settings but also in more complicated ones, where relationships between variables are not simply linear, and/or the data has hierarchical structure. Hierarchical, or multi-level data, arise in many settings, such as when we have repeated observations on patients registered with different general practices. We also wanted to describe how multiple imputation can be used with survey weights, and hence in the survey sample setting.
Our aim was to set out the underlying principles so that readers can understand their rationale and then adapt and apply them to their own analyses with some confidence.
3. The book is divided into three parts: foundations, MI for cross-sectional data, and advanced topics. Were there areas that you found more challenging to write and if so, why?
Different parts of each chapter had their own difficulties! Some parts of the book follow quite closely from papers we have both published so that was slightly easier. The first chapter was one of the hardest to write because we were trying to explain the key ideas in a way that was as accessible as possible while simultaneously trying to lay the foundations for the whole book.
The other chapter that was quite hard for me was the chapter on hierarchical data, because working out how to do imputation in a hierarchical setting involves some tricky algebra.
The chapter that was the most fun to write was the chapter on sensitivity analysis where we look at how conclusions from analyses of partially observed data vary as you change some of the assumptions that you are making about the missing data.
One point where this book is different from others is that it’s not tied to any one particular software package or one particular algorithm for performing multiple imputation. We have our own software which we have used, but we have also compared and contrasted it with other software and algorithms.
4. Who was the target audience you had in mind when writing this?
In my mind’s eye I was writing for researchers who had been on the courses that we have been teaching. Over the years this has been quite a mix! Some have been quite established researchers; some have a quantitative and mathematical background; some less so.
Experienced researchers want an overview of the concepts and issues you have to think about when analysing a dataset with a nontrivial number of missing observations, and also of the methods and software available. Junior researchers and PhD students are typically the ones in project team that are actually doing the analyses, so therefore we wanted to give plenty of examples of how the analyses with missing data can be done, what the key issues are that you have to think about. We were keen not simply to present a recipe book, but also the rationale, as this is key to adapting the methods to new situations.
The deeper you delve, the more technical the arguments become. The book has reflected the questions that people attending the courses have asked us over the years. Thus we have touched upon a number of different ways of approaching the problems and tried to relate those to the software packages that are out there.
One point where this book is different from others is that it’s not tied to any one particular software package or one particular algorithm for performing multiple imputation. We have our own software, which we have used, but we have also compared and contrasted it with other software and algorithms.
5. Throughout the book, the concepts are illustrated with real data examples. Were these from your own research or teaching, or did you devise them especially for the book?
They are almost exclusively from our own research and teaching. Some of them have arisen as a result of questions people asked us on courses. For example, one study which we talk about quite early on was a trial that I was involved in analyzing initially in the early 2000s – Mike and I have then gone on to use that example to illustrate a whole range of ideas over the last ten years. The example that comes in Chapter 9 involving data from paediatric admissions in Kenyan hospitals came from a collaboration that I am involved in with a colleague here at the LSHTM and with researchers in Nairobi. The cancer registry data came from collaboration with colleagues at the LSHTM who conduct a lot of work on the relative survival of populations. The class size data – on trying to understand if children in Reception and Year One do better if they have smaller class sizes – also came from a collaboration with Harvey Goldstein and colleagues at the Institute of Education; and so on…
In summary, the examples have not been devised especially for the book – rather most have been published elsewhere or are in publications that are currently in progress, so the book is very much rooted in real-life statistical practice.
6. How did the cover process come together?
In 2007 or 2008, one of our colleagues, Harvey Goldstein was travelling in Hong Kong and snapped this tableau in the New Territories. Harvey thought this picture, with its damaged figurines, was a novel example of missing data. When I saw the photo, I realised that a number of the concepts that Mike and I wanted to talk about, especially in the early part of the book, could be illustrated without any algebra, just by looking at the picture.
For example, the extent of the problem caused by missing observations depends on the question you are asking. So, looking at this tableau, if you are just interested in the number of figurines, then the fact that some are missing parts is neither here nor there; the missing data will not cause you any difficulties. However, if you wish to comment on the headdress of figurines with missing heads, then the fact that they are missing is obviously a problem. One way to tackle that is to assume that the headdresses of the figurines with missing heads are similar to those of figurines which retain their heads and are wearing similar clothes! So you make little subgroups of figurines who are similarly dressed and assume that the headdress for the missing head would likely be similar to those who were wearing similar clothes. That is an example of what is known as data ‘missing at random’. What we do in the book is to explain this, and related concepts, graphically and then explore out how these work out in statistical analyses, such as regression analyses.
Many books about missing data have a cover image with dots and question marks. We wanted to be more original, and this photograph captured all the ideas we wanted to talk about. It also catches your eye and draws you in, at least that’s our hope!
What we hoped to do was to set out the ideas that underpin multiple imputation and show how it could be applied in a range of settings. Multiple imputation is really a method for obtaining parameter estimates and standard errors e.g. an idea of how precise your parameter estimates are when you have missing data, such as partially observed data…the examples have not been devised especially for the book – most have been published elsewhere or in publications that are currently in progress, so the book is very much rooted in real-life statistical practice.
7. Do you have any plans on writing together in the future? What will be your next book-length undertaking?
We don’t have any firm plans at the moment. We are working on various papers but it is very likely that something might happen. Mike is in the office immediately next door to me and we enjoy collaborating.
8. How did you begin to pursue a career in statistics and what was it that brought you to recognise statistics as a discipline in the first place?
My father was a medical statistician and he really enjoyed his career. Even in retirement, he is still involved with medical research. I studied maths and, not wishing to work in the financial sector, thought it would be interesting to try an MSc in medical statistics. After completing that, I had the opportunity to study for a PhD and it went on from there. I’ve always been interested in medical applications and I enjoy statistics because it is motivated by tackling real-world problems.
Just as the examples in the book are very varied, so my professional life is varied –meeting many stimulating researchers and tackling a range of different problems, alongside teaching and writing.
9. As a university professor, what do you think the future of teaching statistics will be? What do you think will be the upcoming challenges in engaging students?
One significant development will be more distance learning courses – the quality of experience that you can obtain in distance learning has dramatically improved over the last few years, so you can now fairly easily not only have a book, but also see the lecture delivered and also interact with your fellow students either one-to-one or in a group discussion.
At the LSHTM we have a number of distance learning courses which have proved popular, especially (but not exclusively) for those in low income countries who struggle to get the money together to study in the UK. This is therefore a great way for the School to fulfill its educational mission.
Despite the development in distance learning technology, though, I believe the experience of coming and studying in London for a year on a Masters course is unbeatable. Our MSc in Medical Statistics has run for over 40 years now, and this year we have enrolled more students than ever before.
Another issue that has occurred with the discipline over the past 40 or so years is that it is becoming increasingly specialised – e.g. into medical statistics, genetics, social statistics, financial modelling etc. – and I think this trend going to continue. Its is unfortunate if it leads to the development of more and more specialized Masters courses, because the key statistical concepts are common to all settings, and students benefit from understanding this. Instead, I would prefer to see the development of Masters courses with a core component in the first part of the year and a choice of specialist modules thereafter.
10. Over the years, how has your teaching, consulting, and research motivated and influenced each other? Do you get research ideas from statistics and incorporate your ideas into your teaching?
All the time. Sometimes you get research ideas from your teaching and the questions people ask you. Certainly the research that Mike andI have done has influenced our teaching. For example, we have a short course run at the LSHTM which is based entirely on this book. Research that we have described in the book we now teach to MSc students. There is material in the book whose development has been triggered by consultancy as well, so it is all very interlinked.
11. What do you think the most important recent developments in the field have been? What do you think will be the most exciting and productive areas of research in statistics during the next few years?
In terms of medical statistics, the use of routinely collected health data is a major new development that will continue to grow. The challenge is to use such data, collected for clinical reasons, to answer research questions. This is often not straightforward, and missing data is one of many issues that arises. Alongside, and linked to this, is the rapid development of research in genetics.
Another area where there has been a lot of development over the past ten years, which I expect to continue, is in methods for making causal inferences from observational data.
On the computational side, I think Bayesian methods will continue to be an area of considerable activity because of their flexibility for tackling diverse complex analyses.
12. What do you see as the greatest challenges facing the profession of statistics in the coming years?
Statistical analysis is central to understanding and exploiting the vast array of data that is now being assembled in a whole range of settings. It is thus a time of great opportunity for statistics as a discipline. The challenge will be to prevent statistics fragmenting into sub-disciplines.
13. Are there people or events that have been influential in your careers?
My father (also a medical statistician) was certainly influential; Professor Stuart Pocock who offered me my first post at the LSHTM continues to be a valued colleague and friend. Indeed it was Stuart who first suggested I look at the issues of missing data in clinical trials. I’m also indebted to my PhD supervisor, John Bithell, whose enthusaism for medical statistics is infectious.