“The interface of statistical optimality and computational optimality is becoming increasingly relevant in the larger data science environment”: An interview with Eric Kolaczyk

Eric Kolaczyk is a Professor in Statistics in the Department of Mathematics and Statistics at Boston University. Currently he serves as director of the department’s MS in Statistical Practice (MSSP), as well as more generally the director of the Program in Statistics. He has been on the faculty in the Department of Mathematics and Statistics at Boston University since 1998, and was faculty in the Department of Statistics at the University of Chicago before that. He also has been visiting faculty at Harvard University and l’Universite Paris VII. He teaches an annual short-course at l’Ecole Nationale de la Statistique et de l’Administration Economique (ENSAE) in Paris.

Prof. Kolaczyk’s main research interests currently revolve around the statistical analysis of network-indexed data, and include both the development of basic methodology and inter-disciplinary work with collaborators in bioinformatics, computer science, geography, neuroscience, and sociology. Besides various research articles on these topics, he has also authored three books in this area, including Statistical Analysis of Network Data: Methods and Models (Springer, 2009) and, joint with Gabor Csardi, Statistical Analysis of Network Data in R (Springer, 2014).

He is an elected fellow of the American Statistical Association (ASA), the Institute of Mathematical Statistics (IMS), and the American Association for the Advancement of Science (AAAS), an elected senior member of the Institute for Electrical and Electronics Engineers (IEEE), and an elected member of the International Statistical Institute (ISI).

Alison Oliver talks to Professor Kolaczyk about his career in statistics so far.

1. You obtained your BS in Mathematics from Chicago before moving onto Stanford for your MS and PhD in Statistics. What inspired this move from mathematics to statistics?

It is a common move and I think people have different sort of paths by which they got there. For myself, I majored in mathematics because people said this was a good foundation for going onwards. We were then required to take sort of a breadth requirement of some sort, and I took it in statistics—took an intro class—and then, as a math major, took probability and mathematical statistics, because that was, I think, the path of least resistance for a math major. I became really captured by the idea that you could use mathematics to quantify uncertainty in something as messy as real-life data. That ended up being the right balance for me, rather than simply going into mathematics purely – being able to stay with the foundational training I had but to move in a way that would bring me closer to working with real-life problems and real-life data.

2. Your research interests include statistical analysis of network-indexed data, particularly on both foundational issues and statistical problems arising in practice. What are you working on currently?

Most of my research in my career has been driven by questions in other fields. I am most comfortable straddling statistics and other domain areas. What drew me into networks were problems in computer traffic analysis and computational biology, but recently I’ve been working much more with people in computational neuroscience. In that area, they’re using networks to try and explain how the dynamics in different regions of the brain may be related to each other, but we don’t get a deity giving us what that information is; instead we get to measure it and then we somehow try to infer those relationships. That is messy because the original data is messy, and so that very practical need to create these networks leads to statistics problems where you want to be able to somehow infer the networks, you want to quantify how uncertain you are about those networks, or even more fundamentally, you want to talk about: Is it possible to say what is the “average” of a bunch of networks? Is that even well-defined to define that?

My group is working on a spectrum of problems from practically implementing these methods with computational neuroscientists to looking at mathematical abstractions where you bring geometry and probability to bear in ways when you try and define answers to questions like: What’s the average of a bunch of networks?

3. Your lecture at JSM 2017 was an Introductory Overview Lecture on Network Data, Modeling, Analysis and Applications with Harry Crane of Rutgers University and George Michailidis of University of Florida. Please could you tell us more about this topic? What was the one thing that you wished for your audience to take away from your lecture?

Certainly, and particularly for a short lecture like that, I always make sure I have a take-home message, and I always give it first. That way, if they want to go to sleep, they can go to sleep, but they still heard what I had to say.

So I think the take-home message for them today was that although network science as a field is close to 20 years old, that the statistical work to be done in that field, particularly for laying foundations, is actually yet to be done. It’s happening right now. That’s somewhat ironic and catches people, particularly in our field, a little bit by surprise, but it’s just somehow the nature of the beast that network science has evolved largely in other quantitative fields. We’re seeing a sea change in our field where there is a critical mass now that’s finally become involved, and problems are being defined in a way that people can begin to get their teeth into them. I think it’s an extremely rare and exciting time that this frontier is opening and it’s, “Now or never. Jump in,” is the message.

4. Your book Topics at the Frontier of Statistics and Network Analysis: (Re)Visiting the Foundations was published last year. Please could you tell us about this new work?

The book is in association to lectures they call the Seminaires en Statistiques, which the Bernoulli Society puts together. My impression is that they were very well regarded and then went quiescent for various reasons, and Ernst Wit in the Bernoulli Society decided to relaunch them with a new publishing model that very smartly—although painfully for the authors—required that they write the book first and then speak, and not the other way around! We did that due to the nature of the lectures as they wanted to have a focus on this frontier, and so I focused particularly on three problem areas: network modeling, sampling of networks, and experimental design of networks. The point of each of these areas is that they are very classic statistical areas—they go back to the roots of our field, well over a century ago—and yet we are now laying the foundation of much of those, still, despite much work in some of the subsets of those areas. Those are our truly growth areas for the field, and the book is meant to try and facilitate for statisticians to get into it quickly and relatively painlessly.

5. Currently you co-chair the Data Science Post-secondary Education Roundtable, sponsored by the U.S. National Academies of Sciences. Please could you tell us more about this role?

The roundtable seems to be a new mechanism that the academies have for trying to bring experts together and extract some sort of sense of knowledge from them. This is a group of about 30 representatives in data science from quite a cross-section on many different axes. It’s academic, government, and industry. It also has representation from statistics, from applied mathematics, from computer science, and from data engineering. So it’s a type of cross-section that’s very dear to my heart because of my research interests.

My role on that is to try and guide the committee through three years’ worth of periodic discussions—we meet four times a year—on specific topics in which we can share best practice, pose problems, hypothesize solutions, etc. So far, we have focused on core contributions from the core areas of CS, engineering, math, and stats; we have had another meeting where we considered the interaction of domain expertise in data science; and a third one we looked at data science education in the workplace. The focus in all of those is supposed to be on post-secondary level. We’re beginning to move towards models where we’re not only trying to summarize what’s going on during the roundtable, but see if we can interact more largely with societies and some of the other many many groups that are having these discussions. We’re now one year through, and in the second and third year we’ll see if we can try to continue to branch out and build more of a message to a larger community.

6. You are Director of the MS in Practical Statistics at Boston as well as Director of the Program of Statistics. What are the rewards in teaching these courses?

I do somehow find myself wearing two director hats and would be more than happy to begin offloading at least one of those! Those are largely administrative roles, and then they do have some implications on my teaching. At BU, we somewhat uniquely have a math and stat department jointly, which allows us to do certain creative things regarding crossing through the spectrum of mathematical sciences without interdepartmental barriers. It’s a unique sort of setup. Despite being a joint department, we actually do offer Bachelors, minor, major, Masters, and PhD degrees in statistics, so we have quite a full plate, teaching at all those different levels.

The MS in Statistical Practice is our newest contribution. It’s a very unique program, we think. We formulate it as kind of a departmental-level response to the growth and interest in “data science” more and more generally, and what we did is we started with a blank slate in which we said the central focus was going to be creating holistically trained statisticians, and we were going to build that around practice – the theory, the methods, the computing they get are all hung around a set of courses we call Statistics Practicum, in which industry partner projects, consulting, and various related-type work is constantly tightly integrated with all of that. I have led that practicum in addition to leading the program, of course, and that has just been a blast. It’s sort of like our innovation lab; it’s like a start-up in feel; and we kept it that way by continually trying to look at what would be best practice, what worked, what didn’t work, and innovating constantly.

7. What are the most popular lessons that the students respond to that you would recommend to others teaching?

Well over 20 years, I was looking the other day, I have taught probably about two-thirds of the courses in our department. But it’s hard to come up with things that I think others perhaps haven’t already encountered. Certainly, I’d say the importance of practical work. There’s a lot of focus these days on flipping classrooms. But I think one thing that I’ve tried that does catch colleagues a little by surprise when I mentioned it, is in this practicum course, we focus towards holistic training not only on the more traditional statistical training, but also very importantly on communication skills.

One of the drills we do for getting them comfortable with speaking is we have them, within two weeks of arriving on campus, give a very short presentation. They’re going to do this all year long, constantly—formally, informally—and somewhat to their surprise, and my colleagues when they hear about this, is we tell them they are absolutely forbidden to talk about statistics. They get a five-minute talk that is supposed to be on the topic of why you should . . . whatever. Why you should see this movie? Why you should read this book? Why you should eat this food or go to this restaurant? They all think I’m crazy. They begin speaking, and we run them very quickly, in a lightning fashion, and gradually myself and my colleagues—because it’s a team-taught course—we will begin to offer small comments, and it will seem like it’s only about the content. Then gradually we’ll start saying, for example, “You know, the font needs to be a little different here. People can’t see.” “You’re only speaking to your slides and not to your audience.” We throw in things like the font, if it were a table, or your audience, if it were non-experts and not just statisticians, and gradually they begin—about halfway through the class—lightbulbs start going on and you can see them going on, that, “Oh, he wasn’t crazy. There was actually a reason to do this.” I think that importantly, having them not talk about statistics, not trying to leverage what they may or may not know about statistics, but just purely putting them on even footing, and, “You tell me about something that you’re passionate about and that you know well and that is totally unconstrained by what your level of knowledge is, and let’s see just what purely your communication skills are,” that’s the mechanism I’ve found so far that comes closest to being able to elicit that and encourage in them as soon as possible in the program an awareness of where they’re at with that and where they want to go.

8. What do you think have been the most important recent developments in the field and will these influence your teaching in future years?

The two that come to mind are first the interface of statistical optimality and computational optimality. I think this is becoming increasingly relevant in the larger data science environment. I think it is one of the areas that statistics as a field is evolving towards, certainly towards, if you want to think of it as a pull in computer science, and I think we’re pulling towards each other. I think we necessarily need to in order to reach a certain equilibrium where the importance of the contributions from statistics and computer science between estimation and inference and algorithms come to a necessary balance.

The other one would probably be the ever-increasing role of so-called “found” data in modern society now, where we’re just seeing increasingly that there are massive amounts of data that simply, because we can collect it, it’s collected, it’s made available, and the idea is you can analyze it. But really, without a larger contextual envelope around that—that that data in turn comes from somewhere and therefore any of the canonical issues of where it came from, bias in sampling and generation, relevance of that data to a larger population are the questions being asked—all of that, no matter how big these data are, are still likely relevant. I know, for example, Emmanuel Candes’ three Wald lectures at JSM 2017 were largely being motivated by that problem, as well. So I’m encouraged that I think our field is beginning to move to address that very important area.

9. Your research has been published in journals and books: is there a particular article or book that you are most proud of?

I guess I would say it’s like children in some sense, right? You’re proud of all your children, but if you have to choose one, I guess I would say it was my first book, and the reason is probably because, at that time and place, in 2009 when it was published, or even 2005 when I started working on the book, I could probably count on two hands the number of card-carrying statisticians that were deeply embedded in networks—really meaning that they had really put a large substantial amount of their research agenda purely in that area and embedded themselves in the community. That was a time where the genesis of the book was my just trying to figure out what all was going on, and I applied my own toolset in doing it, the foundations that I had learned about how a statistician thinks and organizes problems. As I started doing that, I started thinking, “Hm, there might be a book here.” As it was the first to do that, I think it still is the only one written by a statistician for statisticians that really tries to sort of help frame that very active field, and because it seems like it’s had some moderate success in doing that, that’s been very gratifying to me. I think it’s very rare in a scientist’s career to have an opportunity to, in the river of science as it flows by you, to be able to jump in at a certain point of time where something big perhaps is beginning to emerge and to add your voice early on in that process.

10. What is the best book in statistics that you have ever read?

That is a tough one. I have read many good books as a student, many ones that helped me, so I guess what I did is, I thought about twisting your question—this is the politician in me—and I guess the two books that I think had a tremendous impact and therefore also really made an impression on me would be Noel Cressie’s book Spatial Statistics and Trevor Hastie, Jerry Friedman, and Rob Tibshirani’s Elements of Statistical Learning. Both were ones that I had in mind when I wrote my own first book in the sense that Noel’s had really had a similar sort of time and place in spatial statistics, and the one that Trevor, Jerry, and Rob wrote had a similar one in the evolution of statistics and its interaction with machine learning. The one that Noel wrote had already been very well established by the time I was writing mine, and usefully to me, the one that the other three wrote had only just been out for a few years when I was beginning to talk with Springer about writing that. That in various ways was trailblazing for helping my book in that they were the first ones to get substantial colour in a statistics book, which I told the publishers very early on that if you can’t show network visualizations using colour, this is not going to happen, at least in terms of the chapter on visualization.

11. What would you recommend to young people who want to start a career in statistics?

I find the young people – particularly the ones in our new MS program – are very hyper-focused, understandably, on career and getting launched in a short period of time – they come in having heard a great deal about data science, both through academics, through peers, through popular press or whatnot, and I find they often come in wondering not only what role does statistics play, but some of them come in wondering did they make the right choice to choose to go into a statistics degree. My advice to them is to realize that statistics always has been and is going to continue to be central to doing things. On the one hand, that they should embrace the foundations, the applications and the technology, and know that in doing so they’re building the core of a powerful foundation for contributing in data science. At the same time, I tell them that statistics is quickly evolving and continuing to evolve with the field, and that they need to be prepared to do that on a variety of timescales. In the short term it may be supplementing their education with more tool-based skills regarding the software. On a longer timescale, they need to understand that what we’re providing them in a year or two must be only the foundation for what might be a 20-, 30-, 40-year career in a field that will evolve in ways that most of us will not be able to predict.

12. Who are the people who have been influential in your career?

My PhD advisor was David Donoho and he’s impacted me in many ways. His ability to have a grasp of pure breadth of fields and understand not only how they do interact, but how they might interact. Yet doing so is statistics-central to the problems he’s thinking about: bringing in mathematics, bringing in signal processing, understanding how information theory relates—those are skills that at some level or another I think I’ve used throughout my career and I got in some way or another from him through example.

Also his ability to speak to people in those fields, to use that understanding that he has in a productive fashion to work across the fields. That is a skill, and it’s not necessarily always an easy skill for everyone. So, seeing someone who is very good at it in my early years was very useful.I would also say he also cares very deeply about both science and the people working in it. He’s shown that in many ways, both in ways that are visible and ways that are perhaps less visible to the community. He’s deeply impacted the field in that way.

If I went beyond an individual, I guess on the other extreme I would say my collaborators. A lot of my work has been sort of synergistic in following the opportunities that have arisen when they smelled right and promising to me; that there was something big that might be lurking there and deciding to follow it a bit. That couldn’t have really happened if I didn’t have excellent collaborators around me to draw on. I wouldn’t even be in networks if it wasn’t for the fact that at the time, during the early 2000s, in the place that I was, at BU, there were some excellent people in computer science and computational biology, primarily, who were watching the emergence of the use of networks in those fields and realizing that there was something important going on there that they might leverage for their domain perspectives. At the same time knowing enough about statistics to realize that much of what they’re wrestling with in those problems had a strongly statistical flavour to it, and reaching out to me to try and collaborate. Then of course collaborating well and faithfully and having fun together and all that, really, that’s very key for being able to build a career around things like that.


Copyright: Image appears courtesy of Professor Kolaczyk