# Kurtosis: Four Momentous Uses for the Fourth Moment of Statistical Distributions

## Features

• Author: Kirk Borne
• Date: 04 Apr 2014
• Copyright: Image appears courtesy of iStock Photo. Figure is copyright of Kirk Borne.

We frequently see much mulling over mean, medians, and modes of statistical distributions, and lengthy discussions of variance and skew (including the now famous "long tail" [1], [2], [3]). What about fat tails? Is that a taboo subject? Maybe it is! For example, in the widely respected book Numerical Recipes: The Art of Scientific Computing, the authors had the audacity to say "the skewness (or third moment) and the kurtosis (or fourth moment) should be used with caution or, better yet, not at all." [4] Those warnings notwithstanding, kurtosis is making a comeback. Not that it ever went away, but a recent search on Google Scholar found over 3000 articles mentioning kurtosis in the context of statistics within the first three months of this year, and over 12,000 articles in 2013, though only about 4000 such articles were cited in the preceding three years combined. Many of those contributions focus on real-world uses of that particular characteristic of data distributions [5], [6].

So, what is kurtosis? It is a statistical measure of the peakiness of the data distribution, effectively measuring how peaked (positive kurtosis) or flattened (negative kurtosis) the data distribution is compared to the normal distribution. See the attached figure for an illustration of these 3 types of data distributions. For a statistical distribution f(x): the mean m (first moment) of the distribution is the average value of x over the full range of data values (i.e., the weighted mean of x, weighted by the frequency of occurrence of each value x, which is the distribution function f(x)); the variance s (second moment) is the average value of (x-m)²; the skew (third moment) is the average value of (x-m)³/s³/²; and the kurtosis is the fourth moment of the data distribution [average value of (x-m)^4/s²] minus 3. In the latter case, the “minus 3” is applied in order to set kurtosis=0 (Mesokurtic) for a normal distribution, kurtosis>0 (Leptokurtic) for a peaked distribution, and kurtosis<0 (Platykurtic) for a flattened distribution [7]. Traditionally it is assumed that very large data sets are required in order to estimate kurtosis. This is true if the primary goal of the study is to determine with strong statistical confidence whether kurtosis=0 or not (is the data normally distributed or not?). This constraint is not essential in some data science applications where the kurtosis is used to monitor shifts in the data distribution (changes in the stationarity of the system being studied [8]), which can be detected with moderate-sized data sets.

I describe here four practical applications that demonstrate significant uses of the fourth moment of a statistical distribution. Showing some love to kurtosis is consistent with one of the fundamental principles of data science: "Data are never perfect, but love your data anyway" [9]. The value of exploring the features, characteristics, and moments of your data distribution was further highlighted in this article: "Data Profiling – Four Steps to Knowing Your Big Data" [10].

1) Independent Component Analysis: ICA is a variant of PCA in cases where the data distribution contains subcomponents that are statistically independent of each other, though generally not orthogonal. ICA is an example of blind source separation, sometimes called the “cocktail party problem”, in which you try to isolate a specific speech signal out of a superposition of many independent voices. In large data collections, these independent components are unlikely to have the same means, medians, and modes. Consequently, the broad (fat tail) distribution on the data that is identified through high kurtosis is an indicator of the presence of multiple components in a complex signal (as illustrated in the figure, we see multiple components in the data distribution that has negative kurtosis). Estimating the slice through the data (i.e., the “x” dimension for the f(x) calculation) that yields the highest kurtosis will begin to identify those separate sources – subsequent application of SVM will assist in source separation [11].

2) Hidden variable discovery: There are often explanatory variables that are not measured that help to identify different categories of objects or events in data collections. Whether the source is scientific data, or social data, or financial data, or machine data, the ability to recognize the existence of such hidden variables can help to explain unusual correlations or inexplicable inaccuracies in classification models. We encountered an example of this when analyzing galaxy classifications from the Galaxy Zoo citizen science project [12]. For each one of approximately 900,000 galaxies, there were about 200 citizen scientist volunteers who provided a classification label for the galaxy: either spiral galaxy, or elliptical galaxy, or a merging galaxy. We attempted to build a predictive model for these volunteer-provided classifications using the measured features of the galaxies that were recorded in the scientific database. The predictive model worked very well (95% accuracy) for galaxies that had nearly uniform concurrence among the volunteers’ classifications (i.e., the distribution of class labels had a single peak with low kurtosis). However, our predictive model was an abysmal failure (5% accuracy) for galaxies that had a largely split vote (50-50 spiral vs. elliptical), for which the distribution of class labels had high kurtosis. We concluded that there must be some “hidden” feature (not contained in our scientific database of measurements for those galaxies) that the human eye sees that makes it difficult to classify the galaxy unequivocally as one type or the other. Now that we realize that there is a hidden explanatory variable that probably accounts for this, the hunt is on! We are continuing our search for an explanation of what the high kurtosis is signaling to us.

3) Change-point detection in dynamic streaming data: When capturing massive streams of data (from social media or scientific experiments or whatever), it is often beneficial (and efficient) to track a few key parameters that characterize the behavior of the data (i.e., effective descriptors of the system or population that is being monitored and measured). For example, calculating running averages and variances in stock market prices can produce alerts to traders. The more data characteristics that can be easily measured and tracked, then the more likely that any early warning system will generate meaningful and timely alerts [6]. Kurtosis is one of those characterizations of the data stream that is particularly effective in such applications, precisely because of its use in ICA – its ability to identify the emergence of new behaviors (new independent components) and thereby detect changes in the stationarity of the system (which otherwise may have relatively invariant mean values of key parameters, thanks to the central limit theorem).

4) Drastically improving the estimated age of the Universe: A remarkable example of kurtosis in action was in the study of classical variable stars in astronomy (Cepheid variables, in particular). Members of this class of pulsating stars follow a tightly correlated period-luminosity relationship: the longer the period of pulsation, the more luminous (brighter) the star. Using easily measured periods of these stars in images of galaxies has enabled astronomers to estimate the distances to those galaxies (which would otherwise be very hard to estimate). Unfortunately, in the mid-20th century, there was a serious discrepancy (of about a factor of two between different studies) in the estimated distances of these variable stars. Consequently, a factor two uncertainty in the distance scale of distant galaxies translated into factor of two uncertainties in the size and age estimates of the Universe. This was embarrassing for astronomers. The solution to the problem was the recognition that there was high kurtosis in the distribution of Cepheid variable stars’ data, particularly in their period-luminosity 2-dimensional scatter plot. The high kurtosis was an indisputable indicator of two independent components – in this case, two independent types of Cepheid variable stars. Once we had the Hubble Space Telescope in orbit, with the finest scientific camera ever used in astronomy, astronomers were able to identify uniquely which types of Cepheid variable stars were being seen in any particular galaxy image, and thus the factor of two uncertainty in their distances (and in the size and age of the Universe) was reduced to a few percent uncertainty, with kurtosis contributing to that improvement [13].

Finally, the most important result of any data mining and statistical analysis activity is what you do with what you have discovered. In science, one may say that the discovery is a sufficient result, but in fact the discovery should provide decision support for further action, such as: publish a research paper, make a time-critical response to the discovery, refine your hypothesis, design a new experiment, etc. More generally, in any data-driven environment, monitoring and responding to changes in the characteristic features of the data stream can lead to new discoveries and new opportunities, especially in autonomous intelligent systems, including “Decision Science-as-a-Service” for business analytics applications using big data [14], or Dynamic Data-Driven Application Systems [15], or a space probe operating in deep space with little (if any) human intervention [16]. Tapping into the power of the fourth moment of the data distribution should not be an outlier activity, but an essential component of data science and knowledge discovery within any data-driven decision-making process.

References

[1] Glanzel, W., "High-end performance or outlier? Evaluating the tail of scientometric distributions," Scientometrics, 97(1), 13-23 (2013).
[2] Brynjolfsson, E., et al., "Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales," Management Science, 57(8), 1373-1386 (2011).
[3] Anderson, C., “The Long Tail: Why the Future of Business is Selling Less of More,” Hyperion Press (2006).
[4] Press, W., et al., "Numerical Recipes in C: The Art of Scientific Computing," Cambridge University Press, 2nd Edition, pg. 612 (1992).
[5] Mora, P., et al., “Impact of Heat on the Pressure Skewness and Kurtosis in Supersonic Jets,” AIAA Journal, 52(4), 777-787 (2014).
[6] Hou, W., et al., “Detection of small target using recursive higher order statistics,” Proc. SPIE 9142, International Conference on Frontiers in Optical Imaging Technology and Application, DOI:10.1117/12.2054029 (2014).
[8] Sierra-Fernandez, J., et al., “Adaptive detection and classification system for power quality disturbances,” 2013 International Conference on Power, Energy and Control (ICPEC), DOI: 10.1109/ICPEC.2013.6527713 (2013).
[9] Borne, K. "Five Fundamental Concepts of Data Science," http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html (2013).
[10] Borne, K. "Data Profiling – Four Steps to Knowing Your Big Data," http://insideanalysis.com/2014/02/data-profiling-four-steps-to-knowing-your-big-data/ (2014).
[11] Lu, C.-J., et al., “Recognition of Concurrent Control Chart Patterns by Integrating ICA and SVM,” Applied Mathematics & Information Sciences, 8(2), 681-689 (2014).
[12] http://galaxyzoo.org/
[13] http://www.atnf.csiro.au/outreach/education/senior/astrophysics/variable_cepheids.html
[14] http://www.syntasa.com
[15] http://www.dddas.org/
[16] Borne, K., “Data-Driven Discovery through e-Science Technologies,” in SMC-IT 2006: Second IEEE International Conference on Space Mission Challenges for Information Technology (2006). Downloaded from http://kirkborne.net/papers/Borne2006-SMC-IT-DataDriven-eScience.pdf

View all

View all