Statistical Truisms in the Age of Big Data
- Author: Kirk Borne
- Date: 19 Jun 2013
- Copyright: Image appears courtesy of Kirk Borne.
Does Big Data make Statistics obsolete? Of course not! But, many are now suggesting it (1). We address here some of the causes and consequences of this way of thinking. Specifically, we examine some of the basic tenets of elementary statistics that are easy to forget and/or to cast aside when analyzing large comprehensive “full-population” datasets.
The core methodology of knowledge discovery and value generation from data is data science, which includes a family of disciplines, one of the most important of which is statistics. Likewise, statistical thinking and reasoning are more important than ever in the information age (2). Nevertheless, the temptation is large and growing among some big data deep-divers to toss out the tried and true foundational tenets of statistical reasoning. One reason for this disregard may be that big data offers a convenient path around statistical rigor, since there is so much low-hanging fruit (easy discoveries) in large data collections that there is apparently no need to pull out the big mathematical machinery of statistics.
One glance at a standard computational statistics textbook may easily reinforce this way of thinking. Another reason for statistical side-stepping may be that the lack of statistics training in most educational training programs makes it easy to “put it out of your mind” as you analyse data, particularly if you were not trained to apply statistics naturally in your thinking and analyses – this is the meaning and essence of “statistical thinking” (3). A more worrisome explanation of statistical dodging is when we believe that statistics is the discipline of “small data analysis.” After all, if you now have enough data to do 1000-fold cross-validation or 1000-variable models with millions of training samples, then statistics must be irrelevant, right?
Let us consider four foundational statistical truisms (obvious, self-evident truths) that are at risk in the age of big data:
1) Correlation does not imply Causation --- Everyone knows this, but many choose to ignore it. Some even triumphantly state that this fundamental tenet of statistics is no longer an important concept when working with big data, since huge numbers of correlations can be discovered now in massive data collections, and some of these correlations must have a causal relationship, which should be good enough, they say. In fact, “big data will wean us off our obsession with causation” (4). I am partially guilty of this way of thinking myself – I advocate the use of unsupervised discovery for many data analytics projects: the search for patterns, trends, correlations, and associations in data without preconceived bias or models of expected behaviours. This search for the “unknown unknowns” is one of the major use cases of big data (5) : correlation mining and discovery.
In fact, causal relationships are among the most significant discoveries from big data that analytics practitioners seek. Finding causes to observed effects would truly be a gold mine of value for any business, science, government, healthcare, or security group that is analysing big data.
I argue that this way of thinking is acceptable when placed in the full scientific methodological context of data science: inference, hypothesis generation, experimental design, testing, validation, hypothesis modification. In fact, causal relationships are among the most significant discoveries from big data that analytics practitioners seek. Finding causes to observed effects would truly be a gold mine of value for any business, science, government, healthcare, or security group that is analyzing big data. Statisticians are rediscovering the importance of causality also: a recent paper examines causal inference in modeling of dynamic systems from a system (mechanistic) perspective (6). Big data provide a wealth of information on dynamic systems of all sorts, so there is definitely much reason to devote attention to causation as well as correlation. Otherwise, ignoring statistical truism #1 can lead to funny or incorrect conclusions. Statistics books and websites offer many fine examples of these violations.
2) Sample variance does not go to zero, even with Big Data --- This is another easy-to-forget truth when working with big data. We are familiar with the concept of statistical noise and how noise decreases as sample size increases. But sample variance is not the same thing as noise. The former is a fundamental property of the population, while the latter is a property of the measurement process. The final error in our predictive models is likely to be irreducible beyond a certain threshold: this is the intrinsic sample variance. A simple example is regression: as long as you avoid over-fitting the data, the regression (predictive) model will rarely predict the exactly correct value of the dependent variable. For more complex multivariate models, the bigger the sample, the more accurate will be your estimate of the variance in different parameters (variables) representing the population.
This is a good thing – in fact, it might be “the thing” that you are searching or, one of the fundamental characteristics of the population. In astronomy, we call this “cosmic variance” (7). We can study the entire Universe, with billions (even trillions) of objects in our sample, but there will still be differences in the distributions of various parameters as we look in different directions in the sky. Those cosmic differences will never go to zero. Similarly, in any other domain, as you collect more data on the various members of the population (including their different classes and properties), we can make better and better estimates of the fundamental statistical characteristics of the population, including the variance in each of those properties across the different classes. Statistical truism #2 is a good thing, because it fulfills one of the big promises of big data: obtaining the best-ever statistical estimates of the non-zero parameters describing the data distribution of the population.
Sample bias can lead to models with biased results, slanted against the wonderful diversity of the original population.
3) Sample bias does not necessarily go to zero, even with Big Data --- The tendency to ignore this tenet of statistics occurs particularly when we have biased data collection methods or when our models are under-fitted, which is usually a consequence of poor model design and thus independent of the quantity of data in hand. Put simply, bias can result from the application of a model that is inadequately based on the full universe of potentially available evidence. As Albert Einstein said: “models should be as simple as possible, but no simpler.” In the era of big data, it is still feasible to settle for a simple predictive model that ignores the many relevant patterns and intricacies in the data collection.
Another situation in which bias does not evaporate as the data sample gets larger occurs when correlated factors (or treatments) are present in an analysis that incorrectly assumes statistical independence (e.g., in a series of randomized A/B trials, which is a very common practice in big data analytics (8)). In such cases, the bias remains, regardless of sample size (9). Statistical truism #3 warns us that just because we have big data does not mean that we have properly applied those data to our modeling efforts. Sample bias can lead to models with biased results, slanted against the wonderful diversity of the original population.
4) Absence of Evidence is not the same as Evidence of Absence --- In the era of big data, we easily forget that we haven’t yet measured everything. Even with the prevalence of data everywhere, we still haven't collected all possible data on a particular subject. Consequently, statistical analyses should be aware of and make allowances for missing data (absence of evidence), in order to avoid biased conclusions. Conversely, "evidence of absence" is a very valuable piece of information, if you can prove it. Scientists have investigated the importance of these concepts in the evaluation of substance abuse education programs (10). They find that even though the distinctions between the two concepts ("evidence of absence" versus "absence of evidence ") are important, some policy decisions and societal responses to important problems should move forward anyway.
This is an atypical case – usually the distinctions between the two concepts are significant influencers in decision-making and in the advancement of an area of research. For example, I once suggested to a major astronomy observatory director that we create a database of things searched for (with his telescopes) but never found – the EAD: Evidence of Absence Database. He liked the idea (as a tool to help minimize the use of his facilities for duplicate false searches in cases where we already have clear evidence of absence), but he didn’t offer to pay for it. Here is one science paper that has dramatically understood this concept (11): The paper’s title: “Can apparent superluminal neutrino speeds be explained as a quantum weak measurement?” The paper’s full abstract: “Probably not.”
A more dramatic and ruinous example of a failure to appreciate this statistical concept is the NASA Shuttle Challenger disaster in 1986, when engineers assumed that the lack of evidence of O-ring failures during cold weather launches was equivalent to evidence that there would be no O-ring failure during a cold-weather launch (12). In this case, the consequences of this faulty statistical reasoning were catastrophic. This is an extreme case, but neglect of statistical truism #4 is still an example of fallacious reasoning in the era of big data that we should avoid.
Where does all of this take us? It leads us to a clear example of correlation with causation – as we venture out into the space age of big data and analytics applications, the use of those applications might correlate with and might cause a lack (or misapplication) of statistical thinking precisely on the home planet of the big data universe: statistics!
(1) See discussion by Larry Wasserman at http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/
(2) H.G. Wells said, “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write!” http://www.causeweb.org/cwis/SPT--FullRecord.php?ResourceId=1240
(5) K. Borne, “Data Mining in Astronomical Databases” http://arxiv.org/abs/astro-ph/0010583
(6) O. Aalen, K. Roysland, & J. M. Gran (2012) “Causality, Mediation and Time: A Dynamic Viewpoint”, Journal of the Royal Statistical Society Series A, 175(4), p831.
(7) S. Driver & A. Robotham (2010), “Quantifying Cosmic Variance,” Monthly Notices of the Royal Astronomical Society, 407: p2131.
(9) B. Kahan (2013), “Bias in Randomised Factorial Trials,” Statistics in Medicine. doi: 10.1002/sim.5869
(10) D. Foxcroft (2006), “Alcohol Education: Absence of Evidence or Evidence of Absence,” Addition, 101: 1057
(12) E. R. Tufte (1997), Visual & Statistical Thinking: Displays of Evidence for Decision Making. Graphics Press.