Developing Analytic Talent in an Age of Big Data

Features

  • Author: Statistics Views
  • Date: 12 Aug 2014

In April, Wiley was proud to publish Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville.

Harvard Business Review calls it the sexiest tech job of the 21st century. Data scientists are in demand, and this unique book shows you exactly what employers want and the skill set that separates the quality data scientist from other talented IT professionals. Data science involves extracting, creating, and processing data to turn it into business value. With over 15 years of big data, predictive modeling, and business analytics experience, author Vincent Granville is no stranger to data science. In this one-of-a-kind guide, he provides insight into the essential data science skills, such as statistics and visualization techniques, and covers everything from analytical recipes and data science tricks to common job interview questions, sample resumes, and source code.

The applications are endless and varied: automatically detecting spam and plagiarism, optimizing bid prices in keyword advertising, identifying new molecules to fight cancer, assessing the risk of meteorite impact. Complete with case studies, this book is a must, whether you're looking to become a data scientist or to hire one. The book:

  • Explains the finer points of data science, the required skills, and how to acquire them, including analytical recipes, standard rules, source code, and a dictionary of terms
  • Shows what companies are looking for and how the growing importance of big data has increased the demand for data scientists
  • Features job interview questions, sample resumes, salary surveys, and examples of job ads
  • Case studies explore how data science is used on Wall Street, in botnet detection, for online advertising, and in many other business-critical situations

Developing Analytic Talent: Becoming a Data Scientist is essential reading for those aspiring to this hot career choice and for employers seeking the best candidates. Statistics Views talks to author Dr Vincent Granville, a data scientist with 15 years of big data, predictive modeling, and business analytics experience. He is the co-founder of Data Science Central, which includes a robust editorial platform, social interaction, forum-based technical support, the latest in technology tools and trends, and industry job opportunities.

thumbnail image: Developing Analytic Talent in an Age of Big Data

1. Congratulations on the publication of your book, Developing Analytic Talent: Becoming a Data Scientist, which helps one learn the skills needed for the most in-demand tech job. Data scientists are in demand, and this unique book shows you exactly what employers want and the skill set that separates the quality data scientist from other talented IT professionals. Data science involves extracting, creating, and processing data to turn it into business value. This guide discusses the essential skills, such as statistics and visualization techniques. How did the writing process begin?

It started as a series of blog postings (data science recipes, career advice, new data science training and certifications, conference announcements, salary surveys) back to year 2007, on www.DataScienceCentral.com and www.AnalyticBridge.com.

2. Who should read the book and why?

Professionals with a background in machine learning, operations research, computer science, non-traditional statistics, data mining, artificial intelligence, and other analytic fields such as quants, are most likely to benefit from the book. These are the people most likely to benefit and understand all parts of the books. We’ve also received great feedback from hiring managers, analytic-savvy executives and even people from less technical fields such as law, econometrics, computational chemistry or healthcare.

3. The guide is extremely varied, covering everything from analytical recipes and data science tricks to common job interview questions, sample resumes, and source code. Was this one of your original objectives in writing this guide to make its appeal so broad?

The idea was to introduce core data science - new intellectual property mostly - sometimes at an advanced level (though written in simple English); provide a significant amount of resources about this emerging field; and essentially produce a book that offers a vast amount of truly new original methodology tested on real Big Data based on my 20 years of experience in various industries and companies both small and big. I wished to include as many Big Data success stories that could be replicated by the reader willing to go the extra mile to follow my reasoning. In some ways, it is an anti-book, that is, a book that unlike the 200+ statistical books that I have in my bookcase (most of them bought in the last 5 years), is not - and I mean not at all - about the same traditional techniques that have been published and discussed at lengths over and over for the last 30 years, by traditional authors typically discussing small data with small R code snippets and an occasional reference to collaborative filtering, Twitter, Python (by the way a great language worth learning, with Pandas libraries for data science) or recommendation engines.

These techniques that you won’t find in my book (except to show how approximations do better, eliminate over-fitting and are more robust) include: logistic regression, support vector machines, k-NN, hierarchical clustering, and naïve Bayes. If anything, limitations of such techniques are discussed in this book. Finally, this book is just a starting point to my next endeavour: automating data science, statistical analysis, exploratory data analysis.

By using very robust algorithms - you can google “data science Jack-knife regression” to have an idea as to where I am going now. But preliminary robust “black-box” techniques such as hidden decision trees, model-free (data-driven) confidence intervals, fast combinatorial optimization of multivariate features using new predictive power metric to measure lift (at the core of any machine learning algorithm), data dictionary and clustering / categorization of very large sparse text data (to name a few) are already discussed - indeed in some details - in this first book.

4. The applications involved in the book are also varied: optimizing bid prices in keyword advertising, identifying new molecules to fight cancer, assessing the risk of meteorite impact, steganography. Was this to illustrate the breadth to which data science can be applied?

Yes, and also to show success stories - that data science can really work if correctly applied using the right data sets and techniques. But also to show potential data scientists new ways to make money, for instance by selling data or designing an API that makes forecasts.

5. The guide also includes case studies, which explore how data science is used on Wall Street, in botnet detection, for online advertising, and in many other business-critical situations. Where did these case studies originate?

These case studies come from my work experience, working for companies such as eBay, Microsoft, Wells Fargo, Visa and large ad networks. One of my success stories is uncovering very large Botnets causing dozens of millions of dollars in click fraud per year, and very recently (after the book was published) unearthing a vast network of criminal activity infesting Amazon AWS and resulting in a significant proportion of Amazon AWS co-located users (sharing a same volatile IP address) to be blacklisted because of their AWS “neighbours” engaging in criminal activity including spamming (mostly for identity theft) and click fraud.

6. What is it about the area of data science that fascinates you?

It is an area where you can still beat competitors by outsmarting them, using better data, innovative algorithms, and what I usually call (legitimate) grow or business or data hacking.

7. What will be your next book-length undertaking?

A book on automated data science.

I wished to include as many Big Data success stories that could be replicated by the reader willing to go the extra mile to follow my reasoning. In some ways, it is an anti-book, that is, a book that unlike the 200+ statistical books that I have in my bookcase (most of them bought in the last 5 years), is not - and I mean not at all - about the same traditional techniques that have been published and discussed at lengths over and over for the last 30 years, by traditional authors typically discussing small data with small R code snippets and an occasional reference to collaborative filtering, Twitter, Python or recommendation engines.

8. Please could you tell us more about your educational background and what was it that brought you to recognise statistics as a discipline in the first place?

I earned my Ph.D. in computational statistics in 1995 in Belgium, moved to Cambridge University (UK) and the National Institute of Statistical Science (North Carolina, US) to pursue a post-doc with Professor Richard L Smith, then moved to the Industry. During my Ph.D. years in Belgium, I was very lucky to have a mentor very focused on practical problems, such as digging as few wells as possible to detect the area of an oil field (using sophisticated convex analysis), or predicting extreme floods using ground numerical models stored in an hierarchical database - maybe one of the first examples of NoSQL database back in 1990.

9. You are a data scientist yourself with 15 years of Big Data, predictive modeling, and business analytics experience; and also the co-founder of Data Science Central. Please could you tell us more about Data Science Central?

Data Science Central (which also includes Big Data News, Hadoop360, and AnalyticBridge) is the leading community for data science and Big Data practitioners, providing numerous resources: new books, salary surveys, training, certifications, papers, jobs, webinars, events (and even cartoons and polls) and the opportunity to network with peers and boost your career. Altogether, we have more than 300,000 monthly uniques, and we manage the largest LinkedIn group about analytics. We are also a very small team leveraging automation, computational marketing, lean start-up philosophy, and outsourcing to vendors; thus we offer very competitive prices to our clients, as we have far less overhead than our competitors.

10. What is the last piece of data you made for fun?

An attempt at factoring the product of two very large prime numbers, to reverse-engineer the algorithms that make financial transactions secure. Also: using continued fractions instead of linear regression, to make more robust predictive modelling. There are also my data videos, including one based on simulations to generate data points moving around and simulating belly dancing!

11. What is your favourite algorithm?

Hidden decision trees, but I guess it is because I created and prototyped it, it is very simple and scalable, it has been used to score billions of clicks, IP addresses and keywords by ad networks, and I developed some patents around preliminary versions.

12. Which statistical techniques are most commonly used at Data Science Central?

It’s more than just techniques; it’s a new way to look at data, such as segmenting users by ISP rather than traditional segments, to optimize delivery and operations. It is also about assessing the quality of the traffic that we deliver - making sure that our clients get genuine, relevant traffic. We don’t want to become another Amazon delivering likes and reviews that many are questioning, thus we are very careful about spam detection, relevancy and traffic quality. Indeed we mention a simple, efficient ad relevancy optimization technique in our book.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.