Text Mining in Practice with R: An interview with author Ted Kwartler

This summer, Wiley was proud to publish Text Mining in Practice with R, a reliable, cost-effective approach to extracting priceless business information from all sources of text.

Excavating actionable business insights from data is a complex undertaking, and that complexity is magnified by an order of magnitude when the focus is on documents and other text information. This book takes a practical, hands-on approach to teaching you a reliable, cost-effective approach to mining the vast, untold riches buried within all forms of text using R.

Author Ted Kwartler, a data science instructor at DataCamp.com, clearly describes all of the tools needed to perform text mining and shows you how to use them to identify practical business applications to get your creative text mining efforts started right away. With the help of numerous real-world examples and case studies from industries ranging from healthcare to entertainment to telecommunications, he demonstrates how to execute an array of text mining processes and functions, including sentiment scoring, topic modelling, predictive modelling, extracting clickbait from headlines, and more. You’ll learn how to:

•Identify actionable social media posts to improve customer service
•Use text mining in HR to identify candidate perceptions of an organisation, match job descriptions with resumes, and more
•Extract priceless information from virtually all digital and print sources, including the news media, social media sites, PDFs, and even JPEG and GIF image files
•Make text mining an integral component of marketing in order to identify brand evangelists, impact customer propensity modelling, and much more

Most companies’ data mining efforts focus almost exclusively on numerical and categorical data, while text remains a largely untapped resource. Especially in a global marketplace where being first to identify and respond to customer needs and expectations imparts an unbeatable competitive advantage, text represents a source of immense potential value. Unfortunately, there is no reliable, cost-effective technology for extracting analytical insights from the huge and ever-growing volume of text available online and other digital sources, as well as from paper documents—until now.

 

1. Congratulations on the recent publication of your book Text Mining in Practice with R. How did the writing process begin?

I had been giving text mining workshops at data science conferences and was approached by a publisher after one such talk. Whilst flattering, I never thought I could write a book. However, after some lengthy thought about the target audience, prerequisites and a working table of contents, I decided to write 3 chapters largely to prove to myself that I was serious about the project. So it began with a lot of prep work for me.

2. What were your main objectives during the writing process?

Throughout the project I wanted to stay true to my target audience. My expected audience weren’t mathematicians or linguists. Instead I sought out to write a book written mostly in plain English that was loaded with case studies that could illustrate the practical applications of text mining. I wanted to write a book that made it easier to learn these concepts than it had been for myself to learn them.

3. Throughout the book, you clearly describe all of the tools needed to perform text mining and show how to use them to identify practical business applications. You employ numerous real-world examples and case studies from industries ranging from healthcare to entertainment to telecommunications and demonstrate how to execute an array of text mining processes and functions, including sentiment scoring, topic modelling, predictive modelling, extracting clickbait from headlines, and more. Was it always your intention to write the book with this hands-on approach?

This was exactly my intention. I am a bit of a different data scientist in that I am not a PhD physicist or similar but instead have a graduate business degree. This has affected my perspective on data science projects, particularly text mining applications. A lot of time was spent finding real text mining use cases with messy data. Examples of perfect algorithms do not occur outside of text books, so I thought my exercises should likewise have a real, sometimes messy, outcome.

4. If there is one piece of information or advice that you would want your reader to take away and remember after reading your book, what would that be?

Remember that identifying objectives and planning the analytical project is paramount to success. It’s certainly not as interesting as the model building or reviewing outcomes but it is foundational nonetheless. In the book, I outline a workflow I often use but there are others. The point is that upfront organization is key.

5. Who should read the book and why?

I think my book is accessible to many people. New data scientists would benefit from my relaxed style and inviting language. More seasoned data scientists looking to add to their skill set would find my book appealing because it covers so many aspects of text mining and natural language processing. Even business minded decision makers charged with leading text mining data science projects could benefit by learning the multitude of ways text mining can be used in a business setting.

6. Why is this book of particular interest now?

Interestingly, natural language processing is becoming more prevalent in our daily lives. Virtual assistants like Amazon Echo perform audio transcription followed by text mining to benefit the user. In addition to emerging technology spaces that rely on text mining, the “traditional” Internet with its many text based channels and voluminous data streams all require some form of natural language processing.

7. Were there areas of the book that you found more challenging to write, and if so, why?

As a first time author, I found many aspects challenging. The chapter on text to vector R code was particularly tough. The package author used code structure that was new to me. I had to immerse myself into the math behind “text2vec” and find a compelling use case for the book. All with the intent of distilling these difficult concepts and code into manageable digestible nuggets. Even now, I could probably review that chapter again and find opportunities!

8. What is it about this area of statistics that fascinates you?

As my wife says “it’s the math of talking and those are your two favourite things.” In all seriousness, I like to understand how language choice affects outcomes. Language is very powerful, diverse and sometimes difficult. Text mining is a qualitative, challenging, mathematical problem.

9. What will be your next book-length undertaking?

I am taking a much needed break from writing at the moment. However, I am hoping to complete either a sports analytics or business minded book meant for executives working with data scientists. Sports analytics is a fun and rewarding topic with easy access to data. On the other hand, a book helping inform executives how to extract the most value from their data science initiatives is needed in the market.

10. What was it that introduced you each to statistics as a discipline and what was it that led you to pursue the field as a career?

I was exposed to machine learning at graduate school. That spring I used simple algorithms to pick winners of a US collegiate basketball tournament. I started to understand how these algorithms are tools for solving riddles trapped in tables of data. Later, I came to enjoy text mining the first time I made a word cloud of my twitter feed. Shortly thereafter, I saw text algorithms in action at amazon.com when working in the social media customer service department. It was there that this work transitioned from a hobby to professional passion.