Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining – An interview with co-author Simon Munzert

This month, Wiley are proud to publish Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining by Simon Munzert, Christian Rubba, Peter Meissner and Dominic Nyhuis.

The rapid growth of the World Wide Web has opened many opportunities in collecting, sharing and publishing data of all kinds. This book shows how to collect and post-process this data with the most popular and easy to use statistical programming language R. It provides a hands-on guide to web scraping and text mining for both beginners and experienced users, featuring examples throughout that explain each of the techniques presented. Fundamental concepts of the main architecture of the Web and databases are discussed along with coverage of HTTP, HTML, XML, JSON, JavaScript and SQL.

•   Presents a practical guide to web scraping and text mining for both beginners and experienced users of R.
• Explores basic techniques to query web documents and data sets (XPath and regular expressions) as well as technologies to gather information from dynamic HTML (Selenium).
•   Demonstrates how to connect to web services/web APIs and collect data in a regular manner.
•  Provides a practical perspective on the workflow of data scraping and managment – from choosing the right method to optimizing code and maintaining scrapers and features case studies throughout along with examples for each technique presented.
•  Provides a multitude of exercises to guide the reader through each technique.
• Supported by website ( featuring R code and answers to questions posed in the text along with inspiring case studies.

Automated Data Collection with R is a must-have for applied social scientists needing to upgrade their data collection strategies. Both students and researchers will learn to efficiently apply a variety of techniques for conducting data analysis with minimal pre-existing knowledge of R.

Statistics Views talks to co-author Simon Munzert about this exciting new collaboration.

1. Congratulations to you and your co-authors on the upcoming publication of your book, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining which is a hands on guide to web scraping and text mining for both beginners and experienced users of R. How did the writing process begin?

Thanks! This is our first comprehensive book project, but I guess that the actual writing process began just as usual – not before much of the coordination and planning work had been settled. The idea of the project itself was born during a seminar I taught at the University of Konstanz which dealt with technologies to tap web data sources. During the preparation of the course, I noted that information on web data collection techniques with R is spread around the Web, mostly in form of ad hoc tutorials and outdated data sources. I finally asked Christian, Peter and Dominic – birds of a feather – if they wanted to join and develop a comprehensive guide on the topic. The rest was around one and a half years of conceptualizing the contents, coordinating the work, coding, writing, and proofreading.

2. Who should read the book and why?

If you have identified online data as an appropriate resource for your project, it might be a good decision to automate the data collection procedure, especially if you plan to update your databases regularly, if the collection task is non-trivial in terms of scope and complexity, and if you want others to be able to replicate your data collection process. Ideally, the techniques presented in this book enable you to create powerful collections of existing, but unstructured or unsorted data no one has analyzed before at very reasonable cost.

I would describe our book as introductory concerning basic technologies of the Web, string processing, data base and web document query and text mining techniques. It is advanced concerning the technologies we offer for scraping data from the Web in various scenarios with R, and yet general enough to attract attention from academics and practitioners who are interested in statistically processing or just visualizing new sources of data. We wrote the book for people with some familiarity with R, but not necessarily for an academic audience alone, as we do not assume any statistical or methodological foreknowledge and provide both academic and non-academic examples and case studies. In our experience, there is an increasing demand for web data on one side and for an easy-to-read, non-computer scientist book on the other.

The willingness, however, to learn yet another programming language classically used to scrape web data is usually low. As R’s popularity is exploding in so many different fields, we think that sticking to this environment is a good choice for a large audience. However, whether R is the right software choice certainly depends on the environment the reader is working in. In general I think there is no reason for you to dogmatically stick to one piece of software if the majority of your peers works with another. We propagate the use of R because it turns out to be one of the most widely used programs in the data science business, with its popularity still growing remarkably fast. What we like especially about R is that it is the entire package from start to finish. As an upcoming statistician, researcher or business analyst, you are likely not a dedicated programmer, but hold a substantive interest in a topic or specific data source that you want to work with. In this case learning another language will not pay off but rather prevent you from working on your research. R is useful in so many stages in the research process, from the collection of data to data management and analysis tasks and even for visualization and publication of results.

(The book)… is advanced concerning the technologies we offer for scraping data from the Web in various scenarios with R, and yet general enough to attract attention from academics and practitioners who are interested in statistically processing or just visualizing new sources of data.

3. Importantly, the book also contains a multitude of exercises and examples are presented to illustrate each technique. R code and solutions to the exercises presented are due to be featured on the book’s supporting website. In this age of Big Data when the need to upgrade data collection techniques is more vital than ever, was it always your objective to include a practical side to the book?

From the very beginning of the project we planned to make it accessible to autodidacts with no or only little knowledge about web technologies. Although R is our primary tool of choice for scraping tasks, web scraping skills require knowledge about other languages and data formats as well. Just as you have to learn grammar and vocabulary again and again to become good at speaking a foreign language, the only way to become a good scraping practitioner is to apply these techniques over and over again. We think that this can best be done by trying out many of the examples in the book by oneself. Developing exercises for the book was at times a challenge though, because the data we are working with in real-life scenarios are a moving target. While exercises in math textbooks should be valid forever, tasks on web data collection stand on a more shaky ground, as websites are continuously modified or even disappear. One of the drawbacks of many online tutorials on web scraping is that they are quickly outdated after they have been written. Therefore we set up a website and developed exercises that are fairly stable and reproducible over a longer period of time.

4. What is it about the area of data collection that fascinates you?

I guess I’m a rather skeptical person. Data help to challenge claims and give guidance for all kinds of decisions. This is old hat, of course – observational or survey data have been collected by governments or other institutions for ages. What is a rather new phenomenon is that with the rapid growth of the World Wide Web virtually everybody has access to any kind of information. Just think of Googling which is basically no more than a simple form of data collection to answer rather simple questions. If questions become more sophisticated, collecting and analyzing data from other sources can be the key to new insights. Overall I think it is the fact that we are able to give educated answers to sophisticated questions much faster than we previously could some years ago. However, I want to make no pretense of the fact that for me, along with the fascination for the potential of automated data collection, also comes a pinch of pessimism. I think that the honeymoon with uncritical use of alternative data sources is drawing to an end. We have to be more aware of the origins of data, who generated them and how, and what restrictions are imposed by these data features on the validity of our conclusions. While this aspect of modern data collection is not at the centre of our book, we emphasize how important it is to think about data quality issues in advance of every data collection project.

5. Why is this book of particular interest now?

We have observed that the rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect and publish data. Firms, public institutions and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behaviour. Nowadays traditional techniques for collecting and analyzing data may no longer suffice to overcome the tangled masses of data. One consequence of the need to make sense of such data has been the inception of `data scientists who sift through data and are greatly sought after by research and business alike. During the past few years, the market for books on data science/mining has emerged quickly. What is surprisingly often missing in these introductions though is how data for data science applications are actually acquired. In this sense, our book serves as a preparatory step for data analyses but also provides guidance on how to manage available information and keep it up to date.

6. What were your main objectives during the writing process? What did you set out to achieve in reaching your readers?

We all learnt the techniques by ourselves over the years, very unsystematically and often with learning by doing. When developing the structure of the book, our question was what to learn first – the fundamentals, the R workflow, or just learning by doing based on case studies? In the end, we chose an approach which appeared intuitive to us: start with the basics, learn the application tools in R, and finally get inspired by real-life examples. The selection of examples was another important decision to make. We all share a common educational background in the social sciences, so we first thought about targeting this group only. But when talking about web data collection and text processing, there was actually no reason to confine ourselves to one specific area. As a matter of fact, the tools for web data collection procedures which we present in the book are agnostic about the type of data to be collected. Therefore we did not want to bore a large group of potential readers with yet another political science example. The Web is full of incredibly attractive and new information, and we picked just a tiny fraction of it to demonstrate what to do with these data masses.

Simon Munzert

7. Were there areas of the book that you found more challenging to write, and if so, why?

The first part of the book which covers fundamentals in web and data technologies was by far the most difficult to write. There are tons of excellent books already written on each of the topics we discuss. The whole trick was to pick the basics which are important to know for web scraping and text processing purposes and to cut out the irrelevant parts – without creating any obvious gaps. Writing the other parts was much easier – a bit like telling tales out of Scraping Land and of our day-to-day business.

8. What will be your next book-length undertaking?

According to the “build a house, plant a tree, father a son” phrase writing the book did not get us anywhere (although at least one of us settled one of the tasks during the project), so there is still much work to do! Seriously, we have excessively devoted ourselves to this project over the last months and let other things slide. We are still all in the middle of our PhD projects and will try to bring them to a good end.

9. Please could you tell us more about your educational background and what was it that brought you to recognize statistics as a discipline in the first place?

I can pretty much speak for all of us, as we all share a very similar background. We are trained political scientists and have employed all sorts of statistical techniques to deal with various research problems. In the social sciences you are constantly being confronted with nebulous concepts and messy data. If there is no way to gather cleaner data, which often has been and continues to be the usual case, statistical methods provide an important tool to account and correct for measurement error. Still, complementary to correct data flaws post hoc is to look for more appropriate alternative data sources in advance. This is where statistics and data collection strategies come together.

10. Are there people or events that have been influential in your career?

Crawling around in the lowlands of PhD Land, there has not been much of a career yet. But with regards to the fact that I started studying political science without knowing that it could have anything to do with numbers, most certainly important things happened to get me on track. Thinking about it for a while I guess that I have been very lucky with my academic teachers and mentors. I have been (and continue to be) fascinated by survey data from the very beginning of my studies, and a grumpy statistics teacher who also happened to be a passionate survey researcher (and who shall not be named) certainly added to this. A person to whom I owe a lot to with regards to the way I think, academically, is Peter Selb. Years ago he convinced me that there are other interesting areas than just survey methodology (and turned a professor for survey research himself shortly afterwards). He also was the person who drew my attention to intuitive model building and the use of alternative data sources. To name another important event, the first encounter with R was all but love at first sight. I remember trying to work through a badly written online manual during my Bachelor studies “just for fun” and throwing it in the corner a few hours later, disillusioned and frustrated. I very much hope nothing similar will happen to the readers of our book. After all it’s hardcover and quite a missile…

Copyright: Image appears courtesy of Professor Munzert