“In most places, not only are statisticians not in control of Big Data efforts and data science, but sometimes they are totally excluded or at best, marginalised.” An interview with Stephen Fienberg

Stephen Fienberg is one of the top social statisticians in the world and is currently the Maurice Falk University Professor of Statistics and Social Science in the Department of Statistics, the Machine Learning Department, the Heinz College, and Cylab at Carnegie Mellon University.

Born in Toronto, Canada, Fienberg earned a hon. B.Sc. in Mathematics and Statistics from the University of Toronto in 1964, an A.M. in Statistics in 1965, and a Ph.D. in Statistics in 1968 at Harvard University. He has been on the Carnegie Mellon University faculty since 1980 and served as Dean of the Dietrich College of Humanities and Social Sciences. He served as Academic Vice President of York University in Toronto, 1991-1993. He has authored more than 400 publications, including six books and can claim more than 80 descendants in his mathematical genealogy.

Fienberg is a recipient of the COPSS Presidents’ Award and Fisher Lecture Award, an elected member of the National Academy of Sciences and an elected fellow of many associations including the American Academy of Arts and Sciences and the Royal Society of Canada. He is founding editor of the Annual Review of Statistics and its Applications, a founder and editor-in-chief of the Journal of Privacy and Confidentiality and Editor-in-Chief of the Annals of Applied Statistics.

Professor Fienberg talks to StatisticsViews about his career and achievements, what made him a Bayesian, his work on design in experiments, forensic science and privacy protection, and his ambition once to be a professional ice hockey player.

1. When and how did you first become aware of statistics as a discipline and what was it that inspired you to pursue a career in statistics?

When I was growing up, nobody was exposed to statistics at school. Mathematics courses were very rigid. At the University of Toronto, I was in a large course with about 200 students called ‘Mathematics, Physics and Chemistry’. It was one of the elite courses of the university. At the end of the first year, there were 120 of us left! At that point, I dropped chemistry because I could never make the experiments work. I could always get A’s in my lab work by working backwards because I knew what the answer was! So then I was concentrating on maths and physics and then I dropped the physics for a related set of reasons!

It was then that I had my first glimpse of probability. Then there was a course during my third year as an undergraduate which was an introduction to statistics by Don Fraser, also a Wiley author! Back then, Fraser had written two books for Wiley and he followed material in one of them for the course. Don was inspiring. After learning so much geometry, I now learnt about induction and inference and how to bring geometry ideas to bear on these. It was different and it really struck a chord within me. There were not much data presented, although we did have a lab and we did mundane calculations on examples with old style calculators.

In my fourth year, I was introduced to the design of experiments. It was a revelation to me as it was about real-world data in settings that exploited the mathematics. There were not very many textbooks on the topic at that time. The instructor was Dan DeLurie and he didn’t believe in texts. I quickly discovered Fisher’s book on design of experiments and later I read parts of Cox’s book, and then Cochran and Cox. The notion that it was real phenomena that statistics was dealing with convinced me that this was what I wanted to do. I also knew that I didn’t want to be focussed on theorems and proofs for the rest of my life. That set me on a path looking for the right graduate school and once I arrived at Harvard, I discovered what the real world of statistics was about.

2. You are currently Maurice Falk University Professor of Statistics and Social Science in the Department of Statistics, the Machine Learning Department, Heinz College, and Cylab at Carnegie Mellon University. Over the years, how have your teaching and research motivated and influenced each other?

I have always been bothered by textbooks where the focus was on the mathematics of statistics – they tended to always start with an abstracted problem. They were never motivated by real substantive questions, things that other people outside of statistics cared about. I have, over the years, tried to make what I do outside of the classroom motivate the students inside the classroom. So if I am teaching an introductory class, I can’t present a big elaboration of an application with a substantial dataset, but I do try to use real data that they relate to, that also illustrates what I want to say statistically.

I will do things no other instructor is willing to do, and sometimes it fails. In the early 80s, I wanted all my first year students by the end of the year to be able to actually take a real data set and run multiple regressions and record what they learnt, not just writing down, ‘here are the numbers,’ and writing a formal report on the results. So I actually did that in a course in the 90s, but no one at CMU has tried to do this ever since!

I quickly discovered Fisher’s book on design of experiments and later I read parts of Cox’s book, and then Cochran and Cox. The notion that it was real phenomena that statistics was dealing with convinced me that this was what I wanted to do.

One of the things I’ve been able to do is teach a freshman seminar every once in a while. In 1990, I did it as a class in a very ad hoc way and then again in 2000, and again in 2010, I taught small freshman seminars on the census. Those were the census years, so I would bring real data into the classroom which we would discuss. One of the nice things about working on those seminars is that, because I personally knew many of the Census Directors, I was able to bring many of them to class as my guests. It was great fun and it really changes how students think about what they do. In 1990, we signed all students up as census in numerators and they did a shelter and homeless night and had to come back and describe their experiences and share them. That doesn’t sound like it should belong in a stat class but I can take you around here at JSM and introduce you to people who were in those classes and they’ve become statisticians!

3. You are well known for your work in log-linear modelling for categorical data, the statistical analysis of network data, and methodology for disclosure limitation. What are you focussing on currently and what do you hope to achieve through your research?

Over the years, I have told students and colleagues that nothing really ever goes away. Periodically, I end up looking at a research problem that I had previously looked at many years ago! When I was a graduate student, I worked on contingency tables and used them in my thesis by characterising two by two contingency tables, and then in higher dimensions. I look at what characterises all of the two by two tables where the row variable is independent of the column variable – what I call the surface of independence. I then worked on many other statistical problems and I have been known for turning lots of problems back into contingency tables, not because that it how I set about doing things but it sort of happens periodically. If you look at my publications over a long stretch of time, log-linear models often show up in different guises. For example, in late 1970s, I took what was a simple model by Paul Holland and Sam Leinhardt, with a colleague Stanley Wasserman, and gave it a contingency table representation and told how we could use the software and analyse it. That was for social network models, and the world of networks is now really popular. Very few people at the time paid any attention and a decade ago, I returned to network modelling for different reasons. Working with a great bunch of graduates, post docs and collaborators, I have carved out a whole set of research domains, some of which is mathematical, some of which is highly data-oriented, but actually only a few of the topics have contingency table representations!

Secondly, I have a lot of work invested in forensic science, which is related to the Wiley edited volume I did in the early 80s with my CMU colleagues Morrie DeGroot and Jay Kadane. Going back to when I was a young faculty member at Chicago, I became involved in research related to statistics and the law, through a seminar that I took part in. It involved both forensic kinds of applications and civil applications. I worked on other areas over the years including the volume with DeGroot and Kadane when I came to Carnegie Mellon. Statistics has moved on but forensic science has not, except for DNA profiling. Almost all of the other forensic “sciences” are not very scientific and I’m part of a big new Center for Statistics and Applications in Forensic Evidence, at Carnegie Mellon University, Iowa State University, University of Virginia and University of California Irvine, where we’re trying to put statistic and the science back in into selective aspects of forensic science.

Thirdly, I also work in privacy protection and record linkage – two different research areas that are, in fact, related issues. You can view record linkage in a positive way as taking the existing data sets and merging them so that you create individual records that are available for analysis that don’t correspond to those any existing separate data sets – it’s a statistical process where we “predict” matches and non-matches, but also check what happens with the match uncertainty. If you were an intruder and wanted to break into a database, how would you do it? All too often, people find a window in and it’s happened to me recently, including my security clearance records at the Office of Personnel Management in Washington! Everybody’s data can be compromised and with record linkage, you can match your data with unrelated variables to data that you already have access to, which tells you how to protect your data. Record linkage data also has a lot to do with evaluating the quality of census taking, so I am working on another big project on that.

4. You have written books on categorical data analysis, US census adjustment, and forensic science including Statistics and the Law for Wiley, and contributed articles to Encyclopaedia of Statistical Sciences and the journal Statistics in Medicine. Do you continue to get research ideas from statistics and incorporate your ideas into your teaching? Where do you get inspiration for your research projects and books?

It’s the way I do things and I believe that it should change how we teach and prepare materials for introductory courses – they don’t all have to have the same examples. I try to share my knowledge and this coming semester, I am teaching a course on data privacy at graduate level, mainly focusing on the technical statistical aspects, but embedding these in the broader real-world contexts.

5. What do you think the most important recent developments in the field have been? What do you think will be the most exciting and productive areas of research in statistics during the next few years?

Virtually everywhere I go, people talk about Big Data. In a sense, we have been doing Big Data for a long time, so at CMU a subset of us began to meet with people who worked in computer science in the mid-1990s. We began to talk about where our interests coincided and how to take advantage of each other’s research. We started a centre which has turned into the Machine Learning Department, which is the only one in the world. I’d like to say it is the first but it will not be the last and all of this preceded public talk about Big Data. In statistics, I take the simplest model I can work with and try to implement new methodological ideas, whereas a computer scientist will work the other way, often starting with the implementation of a fancy algorithm. We have developed curricula interfaces that try to join perspectives so that our students have the facility to do computation in ways that I can’t. I don’t even pretend to do what my students can do these days. I teach them about judgement, the provenance of databases, randomization and the importance of how data are collected. We bring these things together at Carnegie Mellon. This allows us in statistics to try and take some leadership in machine learning and big data contexts. The ASA and other statistical societies came to this a little too late and have been playing catch-up. The IMS was better – if you look at the Annals of Applied Statistics for which I am currently Editor-in-Chief, we have in some senses a lot more about Big Data in real-world substantive settings than most other statistics journals and that it not by accident. This in many ways is the future of the field.

6. What do you see as the greatest challenges facing the profession of statistics in the coming years?

In most places, not only are statisticians not in control of Big Data efforts and data science, but sometimes they are totally excluded or at best, marginalised. Efforts are led by physicists and computer scientists, everyone but statisticians. It’s like a runaway train in some places. I think statistics not only has to catch up but also has to figure out how to slow the train down enough to bring aboard the values that we have developed in statistics in over a century of theory and methodological development. We have insights that the movement desperately needs and we have to find a way to provide, step up and lead. We have some leaders in the field but it’s really the challenge for the graduate students who are roaming the halls here at JSM. They’re the ones who are going to have to cope with the new challenges for our field.

I think statistics not only has to catch up but also has to figure out how to slow the train down enough to bring aboard the values that we have developed in statistics in over a century of theory and methodological development. We have insights that the movement desperately needs and we have to find a way to provide, step up and lead.

7. What has been the best book on statistics that you have ever read?

Wow. I would almost immediately say Fisher’s Design of Experiments. There are many who have said they have learned from Fisher’s Methods for Statistical Researchers but there was so much missing in it—you had to know so much in order to appreciate what was in it. Design of Experiments was written in a very different way with not much mathematics. But it is chock full of ideas and I find that I can go back and re-read a chapter whether it is simply on the lady tasting tea or what Fisher had to say about the Darwin dataset—how he used those data to get into factorial experiments and complex design and all of these other topics which I clearly never appreciated when I was an undergraduate or even a graduate student. I have learned over time about many of these ideas and I belatedly understood their importance.

A book that is not easy reading but which in many ways made me a Bayesian is Applied Statistical Decision Theory by Raiffa and Schlaifer, originally published by the Harvard Business School. Here I was in 1965 as a graduate student at Harvard, and every Monday afternoon over at the Business School a seminar run by Raiffa and Schlaifer where someone would present a technical problem. When I first attended the seminar, I didn’t know what the heck they were talking about but I knew that it was important and they had this book, that I finally acquired, and the notation was so elaborate and difficult to follow. But it was brilliant and, over time, I have come to appreciate what they did. The book was essentially an independent investigation implementing in a systematic way ideas developed in the early 1950s, including those in Jimmie Savage’s book on the Foundations of Statistics. This work changed the lives of many statisticians. I read Jimmie’s book only later but it is a very important book. At any rate I still go back to Raiffa and Schlaifer to find the answers to some simple but subtle questions. I think that’s why it’s a Wiley classic.

Another book which had enormous influence on me and I only met him later was one by Jack Good. In 1965, he published a book called Estimation of Probabilities and it was basically an introduction to Bayesian hierarchical models before they had been called that by Dennis Lindley and Adrian Smith. I used that book throughout my thesis research. Jack was really quite amazing and I got to know him afterwards and I was present at his 90th birthday celebrations. He died not long after that. In WWII, Jack did all this work with Turing. We still don’t know entirely from where Turing’s Bayesian ideas emerged, and what ideas he was really inventing totally on his own. A lot of what Jack did in the 1950s, when he was finally able to publish in the open literature, was taking these ideas that he had learned from Turing and explaining them in language that statisticians would understand.

8. You have achieved many distinguished accolades over the years, including the COPSS President’s Award and the Wilks Award. What is the achievement that you have been most proud of in your career?

In some sense, I have been really lucky. In many ways, the most important achievement was my election to the National Academy of Sciences because that gave me an opportunity to do things in a forum where it had a broader impact across the sciences. This was of course an honour as there are not many of us in the NAS and there are many statisticians who deserve to be there and are not. Being elected has also opened up other opportunities for me, such as running the report review process report for the Academy, something I have done now for eight years. I think of that work as having a positive effect, not just on science and public policy, but also on statistics and its role more broadly.

9. Are there people or events that have been influential in your career?

Initially, the person who had the greatest influence on me was the teacher who got me into statistics, Don Fraser at the University of Toronto, who just turned 90 and we celebrated at the SSC meeting in Halifax. Somebody actually said to me, after I gave the talk at a session in his honor, that I gestured with my hands just the way Don did!

Fred Mosteller was my Ph.D. thesis advisor and, in many ways, was the example that I tried to follow in terms of bridging from statistics to other sciences. He also taught me enormous amounts about writing and explaining statistics as well.

When I got to the University of Chicago, Bill Kruskal and Paul Meier were my mentors until I felt confident to be off on my own.

10. If you had not got involved in the field of statistics, what do you think you would have done? (Is there another field that you could have seen yourself making an impact on?)

I know what I wanted to do but I wasn’t good enough—I wanted to play ice hockey! I grew up in Toronto and the sad thing was that in elementary school, I wasn’t good enough as a skater. There were guys in my neighbourhood who were really good when they were 10 years old and they dominated the ice time and played on the formal teams. Some of them went on to play in the National Hockey League! Thus it wasn’t until I was a graduate student that I got some good hockey equipment and learned how to skate in a serious way. I worked out with the Harvard JV team one day, and that truly dashed all my thoughts of hockey as a vocation. But I continued to play hockey recreationally until about three years ago.

Or I would have liked to write mystery novels. Perhaps I still will.


Copyright: Image appears courtesy of Professor Fienberg