Five Fundamental Concepts of Data Science
- Author: Kirk Borne
- Date: 11 Nov 2013
- Copyright: Image appears courtesy of iStockPhoto. Figure 1 copyright of Kirk Borne
Data Science is like….
How many times have you seen an article start this way? Now, you have just seen another one! But I want to avoid completing that sentence, and consequently I will skip definitions and analogies here, even though they are very important and my colleagues (and students) know that I indulge in analogies profusely when attempting to explain anything. Instead, the focus here will be on fundamental concepts that are important to all students and practitioners of Data Science.
Let me begin with a joke (which is not quite an “analogy”): A rich man wanted to invest in breeding and training a class of racing horses that would win as many races as possible. He decided to sponsor the research of three world-class scientists to fulfill his dream: a physiologist, a biochemist, and a physicist. After suitable time spent in research and development, the physiologist was called in to explain her solution to the problem. She presented a comprehensive daily exercise regimen that would guarantee strong, fast, high-endurance horses, if the horses were trained from their youth according to her plans. The rich man congratulated her, thanked her, and paid her for her solution. Next, the biochemist was called in to explain his solution to the problem. He presented complete diet plans that the horses should follow from infancy into adulthood, including pre-race, race-day, and post-race meals. He guaranteed that these dietary guidelines would produce strong, fast, high-endurance horses. The rich man congratulated him, thanked him, and paid him for his solution. Finally, the physicist was called in to explain her solution to the problem. The rich man was looking forward to her solution, since physicists are born problem-solvers and she would certainly have an awesome solution. She began, “assume a spherical horse….”
What went wrong here? The third scientist assumed that an overly simplistic model would properly place a horse into the specific category of “fast race horses,” whereas the first two scientists understood that this was an inherently multi-dimensional (multivariate) problem, as could be demonstrated by spending some quality time with a book on factor analysis (1).
Nevertheless, despite different solutions, all three scientists did start off wisely by following the first principle of data science: Begin with the end in mind! This concept is fundamental to science, engineering, design, business, education, healthcare, security, financial planning, sports, and perhaps every domain of human activity. Likewise, whenever we undertake a big data analytics (data science) task or project, we should ask: What is the goal? What are we trying to achieve? How do we know if we are successful? If possible, we should quantify these end-goals with metrics – measurable outcomes, with some estimate of the “success threshold.” Furthermore, knowledge of our end-goal will often be the key stimulus for selecting the right ingredients for the project: recruiting the right team, choosing the correct data sets, selecting which features from the data should be analyzed, and identifying which algorithm(s) should be used. Often, data mining is described pejoratively, and rightly so when the practitioners use it as a “fishing expedition” to see what turns up. While some unsupervised data exploration is essential (to guarantee that we don’t miss “what the data are telling us” and to find all possible patterns, trends, correlations, and clusters in the data set), nevertheless we should be explicit up front if that is what we are aiming to achieve (2). On the other hand, our end-goal (especially in the domains named above) is usually much more explicit: sell more goods, keep our customers happy, discover a drug therapy for some disease, design a robust functional product, discover the characteristics of some new scientific phenomenon, win the America’s Cup (3), or learn how to breed and train winning race horses.
Implicit in the above statements is the second fundamental concept of data science: Know your data! In order to know which data will be best for a project, and which features to select, we clearly must know our data. But I mean something more than that. I mean something that is better called “Data Profiling.” In the process of data profiling, we examine many aspects of the data: min/max values, aggregate values (such as mean, median, sum,…), the list of distinct data values (if we are working with defined discrete data attributes), data histograms and distribution parameters (quartiles, deciles,…), physical units, scale factors, interdependencies (e.g., derived parameters, such as C=B/A, where A, B, and C might all be included in the data set), missing values, NULL values, indices (used to ID the data object, but not a property of the object), and more. If you are working with labeled data (for a classification, predictive analytics, or supervised learning project), then it is imperative to identify which data attribute is the class label or predicted variable. Another aspect of the “know your data” concept is to remember to focus on actionable data (i.e., parsimonious data models are preferred, sometimes referred to as Occam’s Razor, or Einstein’s rule: “Models should be made as simple as possible, but no simpler” – avoid the spherical horse!). By focusing on the data elements and output variables that inform, guide, and provide insights regarding your end-goal, you are consequently kept on a path where distractions and the noise-to-signal ratio are reduced.
I would argue that these features are in fact the very essence of big data discovery: (a) the collection of massive data sets now enables us to find very unusual, surprising, unexpected, and even outrageous things in our domain of study (i.e., the unknown unknowns); and (b) the high signal-to-noise data distributions that big data yield are rich in higher-order moments (beyond means, medians, modes, and variances) that reveal interesting variations in the objects that we are investigating.
The activities of our three scientists have correctly satisfied the third fundamental concept of data science: Remember that this *is* science! In other words, we must remember that we are experimenting with data selections, data combinations, algorithms, combinations (ensembles) of algorithms, success metrics, accuracy measures, and more. All of these items should, at some point, be tested for their validity and applicability to the problem that you are trying to solve. We may know from past experience that a certain combination of data, features, and algorithms will satisfy our needs, but even that past experience was learned (not guessed). Remember this aphorism: “Good judgment comes from experience, and experience comes from bad judgment.” So, good choices for the experimental components of your data science project are also learned from experience, especially from failed projects (bad experience). Furthermore, science is a process, involving observation, inference, hypothesis generation, experimental design, data collection, hypothesis testing, error estimation, and hypothesis refinement. By following these steps sequentially and cyclically (as much as needed to reduce errors and to optimize accuracy), we are likely to avoid biases and traps that lead us to fallacious conclusions.
The fourth fundamental concept of data science is this: Data are never perfect, but love your data anyway! This is potentially the most challenging and the most rewarding principle to follow. We often assume that the only good data are perfectly clean and normally distributed data. The fact that the real world rarely delivers us data on such a silver platter should be reason enough to look for the silver lining in our data, not the silver platter. The silver lining, for me, is the fact that the anomalies, long tail, asymmetries, and other “warts” in the data are often telling us something very important about the domain that we are studying and/or about the objects within that domain. For example, outliers are frequently dismissed and clipped from the data, especially in some scientific disciplines that I know and love. This is fine if you can be certain that these points are simply noise or artifacts in the data. However, what if those outliers represent a totally new class of object or a new type of behaviour?
Figure 1: This data histogram from a hypothetial big data collection reveals numerous peaks, valleys, and tails in the distribution. Each of those features of the histogram provides potentially valuable insights into the population.
Therefore, I prefer to call outlier detection by a better (more data-loving name): Surprise Discovery (4). The unexpected features of your data (that don’t follow the norm) are the things that are novel, interesting, and surprising. Spend some time with those features: the long tail, the QQ-plot (5), the outliers, and other features (such as multi-modal data distributions). Learn from those diverse characteristics in your data (such as the interesting variety of features shown in the data distribution in Figure 1). I would argue that these features are in fact the very essence of big data discovery: (a) the collection of massive data sets now enables us to find very unusual, surprising, unexpected, and even outrageous things in our domain of study (i.e., the unknown unknowns); and (b) the high signal-to-noise data distributions that big data yield are rich in higher-order moments (beyond means, medians, modes, and variances) that reveal interesting variations in the objects that we are investigating. Apply some non-parametric statistical tests on your data and enter a new world of data-driven discovery. If not already, you will soon “love your data” because of its diversity. In fact, in the science of recommender engines, there is high value placed on diversity in the recommendations (6) (i.e., simply recommending “obvious” products to consumers is not going to win and keep those customers, whereas offering interesting, unusual, but relevant products is a sure winner, much like that winning race horse).
Finally, the fifth fundamental concept of data science is: Overfitting is a sin against data science! Whereas we criticized the physicist in our joke for underfitting the horse model (with a single simple geometric descriptor), we might also have included a fourth “scientist” in the joke (pick your favorite!) who created a “Rube Goldberg” model of the winning horse (7) - one that is significantly over-engineered and over-specified. This is a “sin” against data science in the following sense: because of concept #3 (data science is science), we should be testing, validating, and verifying our models’ accuracy using training data, test data, and “previously unseen” data. This scientific process should guard us both from overfitting (high variance) and from underfitting (high bias) in our model solutions (8). If we ignore the principles of good science, then we are prone to overfitting and bias. In addition, by taking concepts #2 and #4 seriously, we should already be aware of the variance in our data values, and consequently we should be alerted early in the process that our final model will still be acceptable (if not more acceptable) when the model’s uncertainties reflect the true variance in our data.
In summary, all students and practitioners of data science should avoid spherical horses and follow these five fundamental principles:
1) Begin with the end in mind.
2) Know your data.
3) Remember that this *is* science.
4) Data are never perfect, but love your data anyway.
5) Overfitting is a sin against data science.
2. Shabalin, A. A., et al.: “Finding Large Average Submatrices in High Dimensional Data,” Annals of Applied Statistics, 3(3), 985-1012 (2009).
4. Borne, K., & Vedachalam, A.: “Surprise Detection in Multivariate Astronomical Data,” in Statistical Challenges of Modern Astronomy V (Springer: New York), 275-290 (2013).
5. Kratz, M., & Resnick, S.: “The QQ-Estimator and Heavy Tails,” Communications in Statistics: Stochastic Models, 12(4), 699-724 (1996).
6. Ghanghas, V., Rana, C., & Dhingra, S.: “Diversity in Recommend Systems,” International Journal of Engineering Trends and Technology, 4(6), 2344-2348 (2013).
8. Briscoe, E., & Feldman, J.: "Conceptual Complexity and the Bias/Variance Tradeoff," Cognition, 118(1), 2-16 (2011).