# Visualising Statistics: The importance of seeing not just describing data

## Features

• Author: Andy Kirk
• Date: 11 Nov 2014

From the moment Hans Rosling entertained us with his energetic TEDTalk of 2006, breathlessly commentating on the elegant motion of a screen full of bubbles, the interest in and awareness of visualisation began to reach a mainstream audience.

‘The best stats you’ve ever seen” is the tag line associated with this famous talk, one that has now been viewed over 8 million times.

Aside from Rosling’s entertaining oratory, the success of this data presentation comes from the power of seeing the graphical portrayal of global health and population data, observing the patterns and stories that unfold in front of us. The key word here is ‘seeing’.

If statistics can be said to describe and quantify the characteristics of data, visualisation is what enables us to actually see the data. In harmony, they give us the most thorough understanding of data.

“Visualization may not be as precise as statistics, but it provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods.”

- Robert Kosara, EagerEyes.org

Exploratory Data Analysis

The person widely attributed as being the father of visual methods is John W Tukey, the prominent statistician who pioneered Exploratory Data Analysis. He championed techniques for visually exploring data to unearth discoveries that are otherwise indiscernible in the original data form or potentially masked by the aggregating nature of some statistical treatments.

One of Tukey’s most enduring visual devices is the ‘Box Plot’ (or ‘Box and Whiskers Plot’) used to graphically depict the classic five-number summary of minimum value, lower quartile, median value, upper quartile and maximum value. The Box Plot packs a lot of statistical information into a single graphic device and helps us to see the range of values as well as get a sense of the distribution (the degree of dispersion, clustering and skew) of these values.

Of course, depending on the subject matter and the analysis being undertaken we may wish to explore the statistical attributes of our data in different ways, to try and see it from new perspectives. Thankfully, there is a broad repertoire of graphical approaches that can help us to familiarise with and discover new insights from our data.

Here are some of the most relevant and useful ways to help see your statistics:

To Show Distribution

‘Histogram’

The histogram shows the distribution of data, presenting frequency counts across a range of categorical values or intervals. In the example below we see analysis of the total appearances made by footballers during a given season. The height of the bar indicates the number of players who managed each group of appearances numbers. Whilst in this data set the average value (25.9) and the median value (26) are very close, the shape of the histogram would help to show the potential degree of skewness in your data.

‘Back-to-Back Histogram’ or ‘Population Pyramid’

This approach facilitates the comparison of two distributions. In the example below from the Office for National Statistics we see the shape of the population for England and Wales as at 2011, with the length of the bars indicating the population size by age.

http://www.ons.gov.uk/ons/interactive/vp1-story-of-the-census/index.html

To Show Range

‘Floating Bar Chart’

The simplest view of data range is to show the minimum and maximum readings for different categorical variables. When you want to see the spread or tightness of a set of values – and don’t require all the dimensions of a box plot – the floating bar chart can be a useful approach. In the example below we see a breakdown of range of high and low temperatures by month, in this case for Rome. This particular example focuses on the dimensional changes in wood and includes data on high and low levels of humidity and moisture content.

http://www.woodchanges.com/

‘Dot Plot’

The Dot Plot displays multiple data points along an axis with a mark – such as filled, semi- or fully-transparent circle – to demonstrate the range and spread of values across a set. In the example below we see a range of dots against different Athletic events. Each dot is a gold medal won at the Olympics. The position along the x-axis is based on a calculated index of improvement in winning time compared to the slowest ever gold medal winning time in each event. The colours separate male and female events.

‘Barometer Chart’

This chart is another variation on the theme, showing maximum, minimum and average values as well as, in this case, the ability to accentuate a specific instance of the data to view its position across the distribution of values. In this example we see a range of graphics showing responses to a survey about innovation. The left part of the blue box shows the country with the lowest percentage of agreement to the statement, the right part shows the highest percentage and the average response is indicated by the dotted line. In this example one can highlight a particular country (Australia) and observe its position within the range of response values for each survey question.

‘Bar Chart’

A simple Bar Chart can be organised to show the ranking of a series of values. In the example below we have a World Health Organisation (WHO) ranking of all countries based on their 2009 average BMI Index (Obesity indicator). By colour-coding specific bars, you can facilitate a view of the relative ranking of your chosen items, in this case highlighting the positions of the US (blue) and UK (red). The dotted line at 25 indicates the threshold value for ‘overweight’.

To Show Frequency…

‘Stem and Leaf Plot’

Similar to the histogram, a Stem and Leaf Plot uses a combined pictorial/tabular display to show the frequency of (typically) small data sets. In the example below, we see the method used to depict the frequency of trains each hour during the course of a day in Yokohama, Japan.

http://en.wikipedia.org/wiki/Stem-and-leaf_display

‘Word Cloud’

A Word Cloud is used to show the relative frequency of words in a specified passage of text. As with many of these techniques, a Word Cloud does not produce a precise reading but gives you a ‘feel’ for the shape and content of your data.

To Show Correlation and Variance

‘Scatter Plot’

A Scatter Plot displays values for two quantitative variables plotted along a horizontal x-axis and vertical y-axis. Its purpose is to visualise a collection of data points to determine the extent of potential correlation and the amount of or lack of dispersion of values. It also acts as a means of revealing outliers, clusters and gaps. In the graphic below we a plot for all countries showing life expectancy in years on the y-axis and the percentage of healthy years on the x-axis (i.e. how much of our life will expect to live in good health). The colours categorise the continents. The plot reveals no significant correlation but does show some interesting outliers (Haiti and China, for example) and clustering patterns for continents (Africa and Europe, for example).

‘Parallel Coordinates’

Where scatterplots combine two variables on Cartesian coordinates, Parallel Coordinates enable the plotting of a dataset’s multiple dimensions across a series of parallel axes. The vertical position along any given axis is defined by the maximum and minimum values in each variable. All the associated data points for a common record are then connected to reveal patterns, clusters and outliers. In the example below we see analysis of the ‘USDA Nutrient Database’ showing the nutritional content per 100g of over 1000 different foods. Each connected line is a different food item with each axis relating to a different nutrient such as sugars, fats and protein. The higher up the axis a line reaches, the more content there is of that particular nutrient.

http://exposedata.com/parallel/

‘Bump Chart’

This specific deployment of the common line chart reveals the cumulative variation against the median of a set of values. In this case we see the total duration stage-by-stage of riders in the Tour de France for a number of years. The zero-line is the median rider’s time, so the widening variation of lines above and below this axis shows the spread of the field

To Show Statistical Context

‘Bullet graph’

A Bullet Graph shows a set of values using bar lengths and then presents an array of visual cues to help judge the context of each of these values. In the example below, we see the black bars showing values for indicators such as revenue and profit. The shaded areas in the background may facilitate the contextual judgment of these values with bands for poor (darker), adequate (lighter) and good (lightest) performance thresholds. The vertical marks may indicate a comparison against a measure such as target, forecast or previous time period.

‘Annotated Line Chart’

Like the Bullet Chart, this chart presents a principal reading (temperature) via a line chart and then overlays this pattern over time against contextual bands of average temperatures, record ranges, last year and current average.

‘Fan Chart’

The Fan Chart overlays actual values onto future forecasts and past estimates of the projected trends, with the darkness of shading providing the statistical probabilities associated with each route. The fan chart is commonly used by the Bank of England in their inflation reports to show the volatility (or otherwise) of future projections of GDP and the accuracy of past estimates.

View all

View all