Multivariate Density Estimation: Author David Scott on the second edition of his bestseller

David Scott’s bestseller Multivariate Density Estimation: Theory, Practice, and Visualization clarifies modern data analysis through nonparametric density estimation for a complete working knowledge of the theory and methods

Featuring a thoroughly revised presentation, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition maintains an intuitive approach to the underlying methodology and supporting theory of density estimation. Including new material and updated research in each chapter, the Second Edition presents additional clarification of theoretical opportunities, new algorithms, and up-to-date coverage of the unique challenges presented in the field of data analysis.

The new edition focuses on the various density estimation techniques and methods that can be used in the field of big data. Defining optimal nonparametric estimators, the Second Edition demonstrates the density estimation tools to use when dealing with various multivariate structures in univariate, bivariate, trivariate, and quadrivariate data analysis. Continuing to illustrate the major concepts in the context of the classical histogram, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition also features:

Over 150 updated figures to clarify theoretical results and to show analyses of real data sets
An updated presentation of graphic visualization using computer software such as R
A clear discussion of selections of important research during the past decade, including mixture estimation, robust parametric modeling algorithms, and clustering
More than 130 problems to help readers reinforce the main concepts and ideas presented
Boxed theorems and results allowing easy identification of crucial ideas
Figures in colour in the digital versions of the book
A website with all data sets including a pdf file of all figures in colour

Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition is an ideal reference for theoretical and applied statisticians, practicing engineers, as well as readers interested in the theoretical aspects of nonparametric estimation and the application of these methods to multivariate data. The Second Edition is also useful as a textbook for introductory courses in kernel statistics, smoothing, advanced computational statistics, and general forms of statistical distributions.

Statistics Views talks to author David Scott about putting together the second edition.

1. Congratulations on the publication of the second edition of Multivariate Density Estimation which clarifies modern data analysis through nonparametric density estimation for a complete working knowledge of the theory and methods. How did the book come about in the first place?

The idea for Multivariate Density Estimation originated with work on visualization at Rice University and on real data analysis at Baylor College of Medicine, where I held my first faculty position beginning in June, 1976. My thesis advisors, Jim Thompson and Richard Tapia, gave lectures on the topic of nonparametric density estimation at Johns Hopkins and then published a book with Johns Hopkins press in 1978. This book included the portion of my thesis dealing with maximum penalized likelihood density estimates, but limited to one dimension due to computational constraints. But at Baylor, I had the good fortune to collaborate with several groups working on risk factors for cardiovascular disease, resulting in publications that moved to 2-5 dimensions for analysis. I recall in particular that the journal Circulation was especially helpful during the refereeing process, since the nonparametric approach was not yet standard in the medical literature. In fact, Circulation asked us to write an editorial about the nonparametric approach, which was highly influential.

Thus the idea of a book that focused on data in dimensions 1-5 evolved naturally. On the theoretical side, the book would describe a series of papers I wrote over ten years covering properties of histograms and frequency polygons, and the new averaged shifted histogram. These estimators were all well-suited to the new massive data sets emerging at that time. On the practical side, the book would cover data ranging from biostatistics to remote sensing that I had analyzed as part of ARO grants PI’d by Jim Thompson. Then I was fortunate to get funding from NSF and ONR that allowed for purchase of expensive graphical workstations. This would provide the wherewithal for the third topic in the book, namely, the visualization of estimated densities in dimensions beyond one and two. Thus an outline of the book was roughly set by 1984.

I spent a sabbatical year at Stanford in 1985-1986, and entertained the idea of a jointly authored book, which at the end of the day did not materialize. Perhaps collaboration by internet was still not really feasible. In any case during the summer of 1985, I taught a course in density estimation to a large group of graduate students and visiting faculty that provided the first draft. (Many of these individuals wrote influential papers afterward in their careers.) It is hard to believe it was not until 1991 that the final manuscript was submitted to Wiley. Bea Shube and then Kate Roach were the very patient Wiley editors who worked with me through the final process.I believe my book was the first in the series to include color plates.

The final push of writing took place in the summer of 1991 and literally occupied 16-18 hours per day, seven days a week. My family was incredibly supportive during this intensive period.I hope my children were not too traumatized. They seemed to enjoy bringing my meals to my office every day! Looking at their careers and lives today, I think it was all worthwhile.

2. What were the primary objectives that you had in mind when originally writing the book? How did the writing process begin?

My primary objective in writing the book was to increase the use of nonparametric methods and quality graphics within the statistical and general scientific community. When I examined the range of figures appearing in the top statistical journals, I was entirely surprised to see how most displayed either one-dimensional histograms or two-dimensional scatterplots with regression curves. I strongly believed authors should be encouraged to display higher-dimensional figures. Of course, the technology for doing so was not widely available at that time.

3. The first edition edition was published in 1992. For those who have not yet been introduced to the book who will read this, what can the reader expect in the latest version?

The second edition of MDE only appeared in 2015, 23 years after the original. In 2004, I proposed a second edition to Steve Quigley. I had intended to complete the project in 2007, but a serious health issue arose that took several years to resolve.

However, working on the edge of technology has a number of risks. For the first edition, LaTeX was far from as useful as it is today. PlainTeX was impossibly difficult to use (unless you were Don Knuth). So I identified a TeX add-on from ArborText that worked quitewell on a Sun workstation. The S language was the choice for computing and producing quite impressive graphics. In fact, as commercial add-on called Splus was even more useful. But by 2009, both ArborText and Splus were obviously dated and needed to be replaced by LaTeX and the R language. Quigley was very helpful in getting a LaTeX file for the original book. I spent two summers translating the Splus graphics functions to R. This was successful most of thetime, but there were hundreds of figures in the original book.

Some of these could be (and were) improved for the second edition, and dozens of new figures were added as well. One of the little details that took an unbelievable amount of time to fix involved the LaTeX files provided. The equation numbers and such had been typed in explicitly rather than using the powerful LaTeX numbering tools.These all had to be replaced (working backwards to avoid introducing too many errors). Thus getting a “working” version of the original book in modern LaTeX with new figures extended over several years.

The material in the original edition has held up very well as the emergence of modern data sciences has evolved quickly. Nonparametric methods are a big part of data sciences and first edition of MDEsurveyed these very well, even in 1992. Thus the second edition leaves mostly intact the original material and organization, but adds 22 new subsections of material scattered among the nine chapters. Some of these investigate the theoretical underpinnings of new algorithms, while others present new graphical tools for data analysis. And even a few lingering typos and mistakes that had escaped detection were fixed.

The field of density estimation has grown exponentially, so the second edition of MDE continues to represent a rather personal view of the field and what techniques can provide the motivated researcher important results efficiently and effectively.

4. The book maintains an intuitive approach to the underlying methodology and supporting theory of density estimation and this new edition focuses on the various density estimation techniques and methods that can be used in the field of big data. Please could you give us a taster of such a technique?

Successful data analysis revolves around the ability to accurately depict the structure in data, and to recognize unexpected features.

The parametric approach imposes a model on the data analysis, but such models come with a fixed set of available structures. The MDE book includes over a dozen real data sets, each with unusual (often nonlinear) structure. The theoretical properties of various nonparametric density and regression algorithms are described, and then applied to these data.

5. If there is one piece of information or advice that you would want your reader to take away and remember after reading your book, what would that be?

When nonparametric methods were first introduced by Murray Rosenblatt in, 1956, there was disappointment when he proved there were no unbiased density estimators. Likewise, when early investigations of the impact of the curse of dimensionality appeared, it seemed that astronomically large samples would be necessary to have even reasonably accurate estimates.

The book dispels these concerns with strong practical examples and new theoretical arguments. However, when the dimension exceeds five or six, these concerns are certainly valid.

6. Who should read the book and why?

I have had the opportunity to teach short courses from the book to a wide range of audiences: from government agencies and laboratories to statistical meetings to industry and finally to computer scientists at the 2002 KDD conference. Examining the citations to MDE in Google scholar, we find the fields where density estimate are employed cover the entire spectrum of science and engineering disciplines, as well as those in the social sciences. Since the book begins with a thorough discussion of the lowly but widely used histogram, it is not surprising how frequently that material is cited.

As MDE includes an extensive collection of problems and exercises, the book has found a place not only on the bookshelf of researchers but in the classroom and among doctoral students.

7. Why is this book of particular interest now?

When the first edition of the book was written, the constraints of personal computing were acute. Memory and storage were limited, and computer chip speed only a fraction of what is available today. Thus I was driven for practical reasons to replace datasets of millions of points with bin counts of perhaps only one thousand.

The focus on bin-count algorithms simultaneously allowed for theanalysis of very large datasets and much faster estimation. Today’sdatasets are millions of times larger, and the bin-count approach even more relevant. The bin-count approach is also perfectly suited to multivariate data and modern visualization algorithms, such as marching cubes.

8. Were there areas of the book that you found more challenging to write, and if so, why?

Since a large fraction of the book reports on my own research, being self-critical in search of accuracy was always a challenge. But it was a surprise at how much work it took to accurately report on other important papers in the first edition. Any little detail that was not perfect could require 1-2 weeks to properly analyze. Such detours were not comforting when facing a deadline.

9. What is it about multivariate analysis that fascinates you?

Modern multivariate analysis is a chicken-and-egg thing. It is not unreasonable to assume that virtually every analysis of 50-dimensional data misses key features or oversimplifies the structure. On the other hand, there are many examples of apparent nonlinear features falsely “discovered” in ordinary multivariate normal data, especially with moderate sample sizes. When we will able to “recognize” new features in a more routine fashion? Important breakthroughs await.

10. What will be your next book-length undertaking?

I have no immediate plans for a new book. I am very fond of the 1979 “Multivariate Analysis” book by Mardia, Kent, and Bibby, published by Academic Press. I had toyed with writing a more modern book at that level with new applications and modern computing code. However, Wiley has signed those authors and a new co-author to a contract to write a second edition, so I will be content to wait a couple of years and see how that fares.

11. You are also the Co-Editor of Wiley Interdisciplinary Reviews: Computational Statistics. What makes the journal different from others in the discipline?

I am starting my 8th year as co-editor of Wiley Interdisciplinary Reviews: Computational Statistics. The original vision has evolved rapidly as the economics of publishing have changed almost as quickly. The journal was originally conceived as a multi-volume encyclopedia (in print) with an accompanying journal providing new material on a regular basis, to be incorporated into revisions and new editions of the (print) encyclopedia. Ed Wegman and Steve Quigley provided the primary impetus for this plan, but the plan for the encyclopedia never materialized.

However, the journal appeared in print in January, 2009, and has provided six issues per year since. Wegman’s vision was very broad and so the topics selected for this review journal have included not only core statistical fare but also survey articles from many fields of science that intersect statistics.

Finally, the journal has evolved into an electronic-only version in recent years. Jim Gentle and Jim Landwehr have joined the journal and serve primarily as acquisition editors, leaving the tasks of article review and journal organization to me.

12. What are you looking for in terms of contributions to the journal? Are there any hot areas that you’d like to see the journal publish in?

The vast majority of articles in WIRES result from personal invitations from the editors. We survey a vast array of journal articles and talks at scientific meetings to identify topics by both established as well as young researchers. A thorough refereeing process ensures that the articles satisfy the requirement of giving a comprehensive survey of the field, as well as any new material the author may wish to include. The selection of “hot topics” is a little at variance with a survey journal, but of course, we seek to have articles that reflect the state-of-the-art in modern statistical practice.

13. You are Noah Harding Professor in the Department of Statistics at Rice University. Please could you tell us about your educational background and what inspired you to pursue your career in statistics?

As an undergraduate at Rice University, I enjoyed both Electrical Engineering and Mathematics majors. While a senior, I happened to take a statistics course from Jim Thompson, then a new faculty member. I was the second best student in the class, so when the otherstudent turned down an offer to study statistics with Jim, I got the offer and accepted. The field of statistics is often characterized as allowing one to meddle in any field of research, and that has certainly been my experience. I found statistics a perfect compromise between my two undergraduate majors.

14. Your research interests include computational statistics, data visualization, and density estimation. What are you working on currently? What are your main objectives and what do you hope to achieve through the results?

I currently work with four doctoral students on such topics.A decade ago, I started with ideas of George Terrell and extended them to parametric estimation (essentially performing parametric estimation using nonparametric criteria). This approach has inherent robustness properties, as noted by David Donoho in a more general context. This approach has proven highly relevant to modern data sciences, since massive datasets are always contaminated by large numbers of outliers or even clusters of unexpected data. Robust methods are perhaps the only feasible practical answer today.

15. Are there people that have been influential in your career?

I returned to Rice University after three intensive and exciting years at Baylor College of Medicine. At Baylor, I was very fortunate to work for Tony Gorry, who was an outstanding mentor with excellent taste in science and a gifted writer. We were part of a 200-person NIH grant focusing on heart disease, led by the famed heart surgeon Michael DeBakey and lipid researcher Tony Gotto. Tony Gorry later become vice-president of research at Rice and currently teaches artificial intelligence courses in the Rice business school. But both of my thesis advisors are still full-time faculty and have been instrumental in my own development and career at Rice. Jim Thompson is close to retirement, and Richard Tapia continues an active research agenda while serving in several university-wide capacities involving minority affairs. Tapia recently went to the White House to receive the National Medal of Science. I have been very fortunate to count both as gracious and effective mentors throughout my career. I recently turned 65, and enjoy reflecting on the many students I have tried to mentor myself, and perhaps looking forward to a new phase in my life before I turn 70.

Stats & Data Science Views

Multivariate Density Estimation: Author David Scott on the second edition of his bestseller

Topics

Topics

Share