“I stepped into the world of data mining. This path has been a thrilling and troubling journey.” Interview with Dr Galit Shmueli

Galit Shmueli is SRITNE Chaired Professor of Data Analytics and Associate Professor of Statistics & Information Systems at the Indian School of Business.

Dr. Shmueli’s research focuses on statistical and data mining methods for modern data structures, with a focus on statistical strategy – issues related to how data analytics are used in scientific research. Her main field of application is information systems (in particular, electronic commerce).

Dr. Shmueli’s research has been published in the statistics, information systems, management, and marketing literature. She has authored over 80 journal articles, books, and book chapters, and is on the editorial boards of several journals. She presents her work nationally and internationally. One of these books has been the bestselling Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel® with XLMiner®, co-authored with Nitin R. Patel and Peter C. Bruce which is published with Wiley.

Statistics Views talks to Dr Shmueli about the book’s success and her own career in statistics.

1. Congratulations on the continued success of your best-selling book, Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel® with XLMiner® which incorporated a new focus on data visualization and time series forecasting and continues to supply insightful, detailed guidance on fundamental data mining techniques. How did the writing process originally begin? What was it that initiated the project?

Thank you – my co-authors and I have been delighted by the adoption of our book in so many courses around the world. This book exemplifies the phrase “necessity is the mother of all invention”.

In 2004-5, I was re-designing and teaching a data mining elective MBA course at University of Maryland’s Smith School of Business. The course originally had more of an advanced statistics flavor, and over several semesters I was seasoning it with more data mining content and principles. While re-designing, I was searching for two resources: software and a textbook appropriate for business students. To my surprise, this was a real challenge. Data mining texts were geared mostly to computer science students, while the business texts had very little in the way of hands-on no-fluff learning. From my experience as a researcher and a teacher, I know that hands-on is the only way to ‘get’ data analysis. In terms of software, I needed an MBA-friendly (non-programming) software with a gentle learning curve that would support learning the concepts. This quest led me to roaming software vendor stands at data mining conferences and to attending a conference on teaching statistics in business schools. And this is where I met my to-be-co-authors, Peter Bruce and Nitin Patel. When I complained to them about the difficulty of finding a suitable textbook for business students, Nitin suggested “why don’t we write the book”? After a moment of shock (I was still an untenured assistant professor at the time), I thought – why not? And there started our decade-long collaboration.

I have continued teaching data mining courses in the last 10 years in different forms and to different audiences, and have designed additional business analytics courses, including forecasting analytics and data visualization, to meet the hunger and demand in business schools. The textbook has been stalking me in these paths. We are now working on the third edition. Based on feedback from instructors and readers, we’re expanding on existing topics such as over-sampling and lift charts as well as adding new topics, including social networks and ensembles.

2. Who should read the book and why?

This is a book “for humans”. I have been subjecting my dear mother, an art curator and “not a numbers person”, to every textbook I’ve written. Besides her professional language advice, she’s been my thermometer for measuring potential readers’ pain. My sensitivity to painful or boring textbooks might be selfish, but apparently our readers have the same mindset. Our intended readers are students and practitioners interested in getting familiar with data mining within a business environment. With “data mining”, “machine learning”, “predictive analytics” and “data science” becoming part of the jargon in business and other domains, many want a basic understanding of what it’s all about. While the computer science books are aimed at the “data crunchers”, our text is aimed at the managerial level – those interacting with or managing data crunchers. Such knowledge will support identifying and evaluating opportunities for data-based implementations and decision making.

The book is focused on explaining the general logic behind different data mining algorithms, the strengths and weaknesses of different approaches, and importantly, the scientific approach to evaluating performance. The text is thin on equations and there are no proofs; instead, there are data examples and hands-on exercises and cases. Readers should have basic familiarity with summary statistics, linear regression and basic inference, as taught in a typical statistics 101 course.

3. Why is this book of such continued interest, do you think?

From the feedback that we’ve received – we try to keep in touch with instructors and readers – I believe the level of our book, its combination of concept learning with hands-on examples and problems, and the user-friendly XLMiner add-on software are the main strengths. Instructors from areas such as information systems, operations research, and marketing who venture into teaching data mining find our book a gentle, but sufficiently rigorous, introduction and reference. We also hear readers appreciate its less-than-500-pages length. We do, however, receive requests for using the book with other popular software such as R and JMP. The good news that such editions are now underway.

4. What were your main objectives during the writing process? What did you set out to achieve in reaching your readers?

The main goal was “gentrifying” material that was mostly accessible to computer scientists and statisticians. The technical language, whilst intuitive to some, has kept others in the dark, thinking “this is too difficult for me”. Unfortunately, this is the case with a large chunk of the statistics literature beyond the basic textbook. My personal conviction is that data mining and statistics are not rocket science (in fact, maybe rocket science is not “rocket science” either?). We should therefore be able to explain the concepts sufficiently clearly to those who can benefit from them. That was the main goal that I’ve set when writing all of my textbooks.

The book is focused on explaining the general logic behind different data mining algorithms, the strengths and weaknesses of different approaches, and importantly, the scientific approach to evaluating performance.

5. Were there areas of the book that you found more challenging to write, and if so, why?

Surprisingly, when we wrote the first edition, I found the linear regression chapter the most challenging. Studying and teaching statistics for over a decade, linear regression was the method that I was most familiar with. Yet, writing the “linear regression as a prediction tool” led me into a universe that at first was puzzling. I started writing this chapter using Nitin Patel’s notes that included surprising statement such as “the best fitting model isn’t necessarily the one that predicts best”. It was a far cry from the reliance on statistical significance to choose predictors, the emphasis on interpreting coefficients and focusing on residuals and on R2 to evaluate models, as taught in statistics courses. Exploring this deviation between explanation and prediction has in fact led to my most important research on “To Explain or To Predict?”

6. You are now working on a book entitled Information Quality: The Potential of Data and Analytics to Generate Knowledge with Ron Kenett, who won last year’s Greenfield Medal at the Royal Statistical Society, which is due to be published in 2015. Please could you tell us more about this book and what we can expect from it?

The Information Quality concept started from an email exchange with Ron Kenett back in 2008 based on a piece he wrote for Six Sigma Forum magazine “From Data to Information and Knowledge: Or is it the other way around?”. My sense was that “information quality” was fundamentally different from “data quality”, that the distinction required attention and that “information quality” warranted further investigation and discussion. I’ve encountered situations where teams of my students working on a project with real data would reach a dead-end, not because the data quality was low, rather because the data did not contain the information needed for addressing their goal. Yet, there seemed to be nothing about this in the statistics literature. Ron and I wrote a paper On Information Quality, introducing language and a framework for measuring InfoQ. We submitted it to JRSS-A in 2011 and it just came out recently with discussion. Since its birth, we’ve been seeing the need for InfoQ in almost every domain and we’ve been illustrating its use in different application areas. Our upcoming book will have one part introducing the InfoQ language and its place within the statistics literature. The second part will consist of chapters each devoted to InfoQ in different areas (education, official statistics, risk analysis, etc.)

Let me take this opportunity to mention another book series that I’ve been working on as a service to society. Living and working in and outside the US, I now fully appreciate the need for affordable “Practical Analytics” textbooks written for non-statisticians around the world. I have therefore started self-publishing low-cost titles for professional non-statisticians and students who need to learn a topic in statistics – such as forecasting, risk analysis, or sampling inspection – in a hands-on fashion. Three titles are globally available as affordable softcover and extremely affordable Kindle editions. To avail this knowledge to non-English readers, a Chinese translation is available for one title, with Korean and Portuguese translations also underway. As with my other textbooks, a website provides all the datasets and instructors get access to resources such as solutions and slides.

7. You obtained your PhD in Statistics from the Israel Institute of Technology in 2000. Please could you tell us more about your educational background and what was it that brought you to recognise statistics as a discipline in the first place?

I knew I would be a statistician since I was five years old. Just kidding! My forage into the world of statistics was almost accidental. After completing my military service, I planned to study psychology, which required a second major. I consulted with a psychologist on reserve duty at my unit, who suggested “if you’re any good at math, take statistics. Psychologists don’t understand statistics”. After graduating, I realized that I enjoyed studying statistics more than psychology and thus decided to continue my studies in statistics at the Technion.

Once I completed my MSc, I didn’t plan to pursue a PhD in statistics. (Un)fortunately, at the time there were very few jobs in statistics in Israel, and none of them seemed exciting. After some reflection and some push from my advisor, I decided to continue studying for a PhD in statistics. I remember going through an agonizing stage of looking for a dissertation problem. Eventually the problem came from an unexpected direction: the intro to industrial statistics course notes by my advisors Ayala Cohen and Paul Feigin included a brief section on continuous sampling with a few mysterious formulas. In trying to derive those formulas myself I got caught up for a few years in developing and expanding a probabilistic approach by the famous Feller. I enjoyed working in this mysterious world of probability, sensing that there was a solution and I only had to work hard enough to find it. At the time, the most terrifying word you could utter to me was “data”. Luckily, I was weaned off that fear after a couple of years at Carnegie Mellon University’s statistics department where I got the “data treatment”. So, from probability I slowly moved into statistics, and then I stepped into the world of data mining. This path has been a thrilling and troubling journey.

8. Furthermore, what was it that inspired you to take up statistics as a career path?

I admit that my career has taken a life of its own, where I was open to interesting and exciting opportunities for learning and contributing as they came my way. This path has also taken some sharp and unexpected turns, keeping me on my toes and reminding me that there’s a lot more that I don’t know than I really do know. The secret for me has therefore been “step out of your comfort zone”. I shifted from an engineering environment at the Technion to a statistics department in a school of social sciences at CMU, and then made the dramatic move to a business school. These environments differ significantly in terms of colleagues, collaborators, students, research culture and socializing! Another dimension of change has been geographic: from Israel, to Pittsburgh and Washington DC (even those two are different!), to India and Bhutan. While inertia is a strong force, resisting it has so far been rewarding.

9. You are currently SRITNE Chaired Professor of Data Analytics and Associate Professor of Statistics & Information Systems at the Indian School of Business. What are the main challenges of teaching data analytics?

The trend at business schools worldwide to reduce course durations creates quite a challenge for hands-on project-based data analytics courses. At many schools courses are now run in a single mini-semester (7-8 weeks). At ISB, I teach a 20-contact-hour course over five weeks. I am a great believer in projects, which lead students to encounter what they know and what they don’t. So I had to figure out how to make time for sufficient hands-on work as well as feedback rounds with each team. This led me to experiment with a “flipped classroom” model, which has been proving quite successful in the last two years. I recorded over eight hours of video in 15-20 min segments. Students watch these before class, then join an online forum where we discuss a practical and complex article, case, or exercise. In class, we continue this discussion, which is at the level that managers would be likely to have: framing a problem, approaching it with data mining, evaluating it properly, integrating domain with data-driven knowledge. Class time is also used for team presentations and feedback.

While I believe that a longer course would benefit learning, the one advantage of the current “crash-course” mode is that it better mimics projects in industry. Students who do well in this course are likely to be better prepared for implementing or supervising such projects on the job!

Another challenge is the growing diversity of students’ backgrounds and technical strengths. The “big data” and “business analytics” buzz and market demand have been increasing enrolments into my courses. The challenge is not only teaching larger classes (60 students per section) but also more diverse audiences.

There’s so much we don’t know because we’ve been brainwashed to think that we do know. That’s a phenomenon that I see in many academic disciplines, especially age-old ones such as statistics, economics and even the social sciences. The two tricks are to first realize that we’ve been brainwashed and try to stay sufficiently open-minded and adventurous to test out the “what happens if?” And secondly, accept failure as part of the learning and learn from it. That’s especially difficult when “rejection” cuts incentives for inquisitiveness of untenured academics.

10. Your research focuses on statistical and data mining methods for modern data structures, with a focus on statistical strategy – issues related to how data analytics are used in scientific research. The main field of application is information systems (in particular, electronic commerce).What is it about these areas that fascinate you so much?

What would you do if someone locked you up in a Toys R Us store for a month? (with complimentary food and accommodation). At first, you’d probably try out all the toys. But after some time, you might start taking them apart and testing their endurance. That’s somewhat of an analogy of my path. At first I was intrigued learning about different probabilistic, statistical, and data mining methods and approaches and playing with them. I’ve now become fascinated with shaking and breaking and seeing how these tools react to realistic shocks such as new data types or sizes and users of all kinds.

Working with researchers in different disciplines, I’ve encountered perplexing moments of two types: one is what my PhD advisor Ayala Cohen calls “black holes”. These are statistical methods or approaches that are used in some disciplines but rarely taught in a statistics program. I recall one such encounter, when an information systems colleague consulted with me as “the expert in statistics” regarding a model with “mediating and moderating effects”. The terror struck as I realized I never heard of such terms. It also took me some time to learn how to properly pronounce “endogeneity” and “operationalization”, over and above understanding what they mean beyond the technical statistical jargon (which doesn’t even exist in some cases). The second perplexing moments are when you realize that the statistical toolkit doesn’t work for all types of data or scenarios, and then you must be creative in adapting existing methods or developing new ones.

Data on bidding in online auctions, for instance, consist of time series of bids that are unequally-spaced with different numbers of records (bids). The distribution of bid values is somewhere between continuous and discrete, as there are some “favourite” bid amounts. Product ratings on websites such as Amazon on a scale of 1-5 stars manifest as discrete distributions of various shapes, including bimodal distributions when there’s controversy, fraud or different taste. I find especially intriguing investigating what happens in such cases when you continue to use the standard methods, which is what researchers and practitioners typically do. I have seen the good-old linear regression model fitted to millions of records, or even fitted to a binary outcome variable. While it is easy and natural for statisticians to dismiss such practices as flawed or ignorant, I’ve discovered that it is more interesting to try and figure out (1) why this practice exists, and (2) what really breaks down. To my surprise, in many such instances the answers turn out to lead to fresh understanding and knowledge. There’s so much we don’t know because we’ve been brainwashed to think that we do know. That’s a phenomenon that I see in many academic disciplines, especially age-old ones such as statistics, economics and even the social sciences. The two tricks are to first realize that we’ve been brainwashed and try to stay sufficiently open-minded and adventurous to test out the “what happens if?” And secondly, accept failure as part of the learning and learn from it. That’s especially difficult when “rejection” cuts incentives for inquisitiveness of untenured academics.

11. In terms of your current research, what are you working on at the moment and what do you hope to achieve?

In the Toys R Us terminology, I am testing the endurance of some toys, trying to use Barbie dolls as racing cars, and building a few new toys that might have a market.

Statistically speaking, I’m studying the use of regression models commonly used in research and practice, under the scenario of large samples. What “small sample” practices are useful? Which are redundant or even harmful? What are new “large sample” practices should one adopt and what do they offer? For instance, consider the common residual analysis procedures. Should we carry all these tests and evaluations out even if we have a million records? How about stepwise regression that uses p-values to discard/accept variables? Do we care about heteroskedasticity? Another area that I’ve been investigating is “small-sample visualizations” used with large samples.

The “dolls as racing cars” refers to research on utilizing predictive machine learning tools for explanatory modelling. While several statistical models designed for explanatory modelling are now used for prediction, the reverse has not been the case. But can a good set of wheels on a doll give a useful car design? Two of my recent papers look at using classification and regression trees for causal-related modelling: (1) as an alternative to propensity score matching for inferring causality from observational data, and (2) as a method for detecting potential confounding variables that cause Simpson’s Paradox.
In the new toy department, there are statistical models and statistical language. I’ve been extending work on a flexible model for count data, called the COM-Poisson model, which we started at Carnegie Mellon University in 2000 with Jay Kadane, Peter Boatwright, Sharad Borle and Tom Minka. While count data are now everywhere, there are much fewer and less versatile models for count data compared with continuous data. I’ve been working with different colleagues around the globe on different extensions, from regression models to mixtures for modelling bimodality, to predictive usage. As for statistical language, Ron Kenett and I have been extending our work on InfoQ (Information Quality) to different areas, from creating a framework for reviewing journal papers to applications in various fields (with different collaborators).

12. In 2004, you co-founded the now annual symposium Statistical Challenges in eCommerce Research. Please could you tell us more about this symposium and what can be expected from this year’s in Tel Aviv this month?

The SCECR conference is now in its 10th year. The conference brings together researchers conducting empirical research in exciting fields, mostly related to information systems. The particular topics change with new technologies: earlier eCommerce was a hot research topic (thanks to the huge amounts made available on sites such as eBay and Amazon). More recently, social networks and text analytics have become prominent due to data available from Twitter, Facebook and other social media websites. Another interesting recent trend is large-sample studies that rely not on observational data but on large-scale online experiments on websites such as LinkedIn or online dating sites. SCECR (pronounced s-KE-cur) is exciting because of the immediacy and relevance of the work presented. It is also an enthusiastic and open-minded community that embraces a variety of approaches and innovation. We are right in the midst of the SCECR 2014 preparations, and based on the large number of high-quality submissions, we expect an interesting and dynamic meeting. Plus, the food in Israel is terrific and the beach in Tel Aviv is spectacular!

13. Are there people or events that have been influential in your career?

There are so many people who I am grateful for their mentorship, friendship, support and kindness throughout my career. As an independent and sometimes stubborn person with unusual ideas, I was lucky to have multiple open-minded and kind mentors and colleagues. Let me mention some of the main figures, but there are many more that I am indebted to.

My MSc and PhD advisor at the Technion, Ayala Cohen, was the one who inspired me to be innovative and fearless. She knew just the right amount of supervision to lead me in interesting directions, yet allowing me to carve my own path. She is also the one who lighted my writing fire. I recall one day finding her in front of her locked office door, reading a chapter of my dissertation which I have just sent to her. When I gently said hello, she jumped “Oh! I was so engaged in your writing that I stopped in front of my door”. Engaging? Because I’ve always loved roaming the statistics & probability areas of the library, pulling out books at random, I appreciate the difficulty in writing clearly and more importantly, engagingly.

After graduating, I enjoyed two years at Carnegie Mellon’s statistics department as a Visiting Assistant Professor. I was incredibly lucky to work with two icons – Steve Fienberg and Jay Kadane – on two very different research projects. I learned about the differences between methodological and applied research; about collaborating with non-statisticians in computer science, marketing, medicine, bioinformatics, and more. I owe much of my growth in terms of publishing and developing new areas of cross-disciplinary research to these two mentors.

At University of Maryland, I worked closely with Wolfgang Jank, a statistician with a thirst for exciting application areas and a great sense of humor. We quickly discovered that working together, besides being fun, was leading to innovations. Our ability to work closely on so many projects, with high energy levels and excitement has been an incredible experience, which I cherish and appreciate. Another close colleague and friend is Ravi Bapna, who I first met serendipitously at a Washington DC café. Ravi is an information systems researcher at University of Minnesota, and the one who introduced me to the information systems community, one of the best turns in my career. His integration of creativity with solid methodology has been inspiring! We’ve collaborated on several research projects over the years and I hope we’ll have many more. Ravi was also the first to use our Data Mining book in the classroom and luckily he liked it enough to spread the word.

Peter Bruce, co-author of our Data Mining for Business Intelligence textbook and president of Statistics.com has been a wonderful colleague and friend. He’s been supportive of my fire to re-design online data analytics courses using cutting edge technology, allowing me to launch a few such courses on Statistics.com. Way before the MOOC craze, he enabled me to see the value of (properly designed) online learning.

My students at the Technion, CMU, Statistics.com, UMD and ISB have shaped my career in many ways. I learned how to better communicate with a variety of audiences, and more importantly, they showed me how creative one can get when considering data analytics in the real world.

And last, but most important, has been the influence of my husband, Boaz Shmueli. He’s been the fire lighting my energy, motivation and critical thinking. He’s been my partner in making the Practical Analytics series a reality as well as our educational technology endeavors in Bhutan; he’s kept me on top of technology, has encouraged me to go outside the comfort zone, and has been the biggest fan for my work.

 

Copyright: Photograph appears courtesy of Dr Shmueli