Outlier Detection Gets a Makeover - Surprise Discovery in Scientific Big Data
- Author: Kirk Borne
- Date: 16 Sep 2014
- Copyright: Image appears courtesy of iStock Photo
Novelty and surprise are two of the more exciting aspects of science – finding something totally new and unexpected can lead to a quick research paper, or it can make your career. As scientists, we all yearn to make a significant discovery. Petascale big data collections potentially offer a multitude of such opportunities. But how do we find that unexpected thing? These discoveries come under various names: interestingness, outlier, novelty, anomaly, surprise, or defect (depending on the application). Outlier? Anomaly? Defect? How did they get onto this list? Well, those features are often the unexpected, interesting, novel, and surprising aspects (patterns, points, trends, and/or associations) in the data collection. Outliers, anomalies, and defects might be insignificant statistical deviants, or else they could represent significant scientific discoveries.
As a demonstration, consider the attached plot of boiling point versus melting point for all 92 of the naturally occurring elements on Earth. The trend line (shown in black) captures the general behavior of the data points, which does not “fit” the data very well since there are several significantly deviant points (outliers). However, there is one element (shown in red) that has a location in this parameter space that is completely different from all other known elements – its boiling point is lower than its melting point (as indicated by its position below the diagonal yellow line in the plot). That is very strange behavior indeed – perhaps even impossible, you might say. But it is real. It is Arsenic, and you can read about its surprising behavior in the chemistry literature . That is the really interesting outlier in this data set.
In traditional scientific experiments, when a small number of parameters from an instrument are being monitored and tracked, then the outlier in one of those signal streams probably is just noise or a random statistical fluctuation. This makes sense statistically when you are sampling the same process hundreds or thousands of times. However, when you are sampling thousands, or millions, of independent processes (such as stars in the sky, or persons on social media, or retail transaction logs in an online store), then population outliers truly represent the surprising, novel, and interesting events within that massive population.
When you are sampling thousands or millions of independent processes, then population outliers truly represent the surprising, novel, and interesting events within that massive population
Therefore, it is fair to say that “outliers” have gotten a bad reputation that should not carry over into the era of big data. To illustrate this, consider the following definition of big data that we are promoting: “big data is everything being quantified and tracked.” Therefore, big data leads to a new statistical reality: whole-population analysis, and the end of demographics. We are measuring (not just once, but continuously) many features of entire populations (stars in the sky, persons on social media, website users, online customers, air traffic passengers, cyber traffic, seismic tremors, health diagnostics, and more). Within our “whole population analysis” we will likely discover members of the population that behave differently, occupy a different location in parameter space, or present unusual links/associations relative to other members of the population. These unexpected behaviors represent discoveries! Consequently, surprise discovery (outlier detection) becomes the gem, the joy, and the justification of our big data analytics activities.
Big data leads to a new statistical reality: whole-population analysis, and the end of demographics
As an example, we consider massive data-producing science projects. For such projects, various measures of interestingness in large databases and in high-rate data streams are needed for rapid detection and characterization of the most interesting and potentially most important events (i.e., changes, anomalies, novelties). The growth in massive scientific databases therefore offers the potential for major new discoveries. Of course, simply having the potential for scientific discovery is insufficient, unsatisfactory, and frustrating. Scientists actually do want to make real discoveries. Consequently, effective and efficient algorithms that explore these massive datasets are essential. These algorithms will then enable scientists to mine and analyze ever-growing data streams from satellites, sensors, and simulations – to discover the most "interesting" scientific knowledge hidden within large and high-dimensional datasets, including new and interesting correlations, patterns, linkages, relationships, associations, principal components, redundant and surrogate attributes, condensed representations, object classes/subclasses and their classification rules, transient events, outliers, anomalies, novelties, and surprises. Searching for "unknown unknowns" (outliers) through unsupervised exploratory data analysis thus becomes the poster child of data-intensive scientific discovery .
Among the sciences, astronomy provides a prototypical example of the growth of datasets. Astronomers now systematically study the sky with large sky surveys. These surveys make use of uniform calibrations and well-engineered pipelines for the production of a comprehensive set of quality-assured data products. Surveys are used to collect and measure data from all objects that are visible within large regions of the sky, in a systematic, controlled, and repeatable fashion. These statistically robust procedures thereby generate very large unbiased samples of many classes of astronomical objects. A common feature of modern astronomical sky surveys is that they are producing massive catalogs. Surveys produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. These include the existing SDSS (Sloan Digital Sky Survey at sdss.org) plus the future LSST (Large Synoptic Survey Telescope at lsst.org) in the next decade. Large sky surveys have enormous potential to enable countless astronomical discoveries. Such discoveries will span the full spectrum of statistics: from rare one-in-a-billion (or one-in-a-trillion) type objects, to the complete statistical and astrophysical specification of a class of objects (based upon millions of instances of the class).
With the advent of large rich sky survey data sets, astronomers have been slicing and dicing the galaxy parameter catalogs to find additional, sometimes subtle, inter-relationships among a large variety of external and internal galaxy parameters. Occasionally, objects are found that do not fit anybody's model or relationship (the "unknown unknowns"). The discovery of Hanny's Voorwerp by the Galaxy Zoo citizen science volunteers is one example . Some rare objects that are expected to exist are found only after deep exploration of multi-wavelength data sets (the "unknown knowns”, e.g., Type II quasars  and Brown Dwarfs). And then there are known objects that cannot be explained until additional parameters are measured by other instruments (the “known unknowns”, e.g., Gamma-Ray Bursts). The detection of rare, novel, unexpected, anomalous outliers, which are outside the expectations and predictions of our models, thereby reveal new astrophysical phenomena and processes. Soon, with much larger sky surveys, we may discover even rarer one-in-a-billion objects and object classes.
The LSST is the most impressive astronomical sky survey being planned for the next decade. It will monitor the sky, measuring hundreds of parameters for billions of objects repeatedly for 10 years. Compared to other sky surveys, the LSST survey will deliver temporal coverage for orders of magnitude more objects. The project is expected to produce ~15-30 TB of data per night of observation for 10 years. The final image archive will be ~100-200 PB, and the final LSST astronomical object catalog (= object-attribute database, representing time series for ~50 billion objects) is expected to be ~20-40 PB, comprising over 200 attributes for each one of the ~20 trillion independent source observations.
Many machine learning use cases are anticipated with the LSST database , including:
• Provide rapid descriptive characterizations and probabilistic classifications for millions of events daily.
• Find new multivariate correlations and associations in high-dimensional space of 100’s of measured parameters.
• Discover voids in high-dimensional parameter spaces, perhaps signaling new processes that preclude the existence certain parameter values.
• Discover new and improved rules to classify known classes of objects.
• Discover new and exotic classes and subclasses of objects and processes.
• Serendipity – discover rare one-in-a-billion (or trillion) objects through outlier detection (“Surprise Discovery”) algorithms.
• Identify novel, unexpected behavior in time series data.
• Hypothesis testing – verify existing (or generate new) hypotheses with strong statistical confidence, using millions (or billions) of training samples.
The LSST is just one example of a massive data-producing project. Though it may be an astronomy project, other big data projects (in science and outside of scientific domains) face similar challenges and discovery potential. Similarly, the use cases listed above are specific to astronomy, but the general categories of surprise discovery that they represent have parallels in all disciplines. Since big data are collected everywhere, it is useful for analytics professionals to share across discipline boundaries their own algorithms, best practices, techniques, and ideas for discovering the most novel, interesting, and surprising things in data collections. The discovery of unknown unknowns (those pesky outliers) may yield fame and fortune – and the potential for such success stories is now in the mainstream of big data analytics, not an outlier.
 Downloaded from http://education.jlab.org/itselemental/ele033.html
 A. A. Shabalin et al. (2009), Finding large average submatrices in high dimensional data, Annals of Applied Statistics, 3(3): 985-1012.
 G. I. G. Jozsa et al. (2009), Revealing Hanny's Voorwerp: radio observations of IC 2497, Astronomy & Astrophysics, 500: L33-L36.
 G. T. Richards et al. (2009), Eight-dimensional mid-infrared/optical Bayesian quasar selection, Astronomical Journal, 137: 3884-3899.
 K. Borne (2009), Scientific data mining in astronomy, in Next Generation Data Mining. CRC Press: Taylor & Francis, Boca Raton, FL, pp. 91-114.