Tyler Vigen is currently a student at Harvard Law School, working on his Juris Doctor degree. Within the past year or so, he has gained the attention of the statistical community for his website, ‘Spurious Correlations‘. It was a project he put together as a fun way to look at correlations in data. He finds empirical research interesting and loves to wonder about how variables work together. The charts on ‘Spurious Correlations’ are not meant to imply causation nor are they meant to create a distrust for research or even correlative data. Rather, Vigen hopes that this project will foster interest in statistics and numerical research. I think he has achieved his goal with his correlations appearing on a variety of sources from statistical blogs to BBC News Magazine.
Vigen insists that he is not a math or statistics researcher. He talks to Statistics Views of how his love for science and discovery helped to start the site, which has since led to the publication of his first book.
1. Congratulations on the successful popularity of ‘Spurious Correlations’. How did it come about?
I kind of learnt about statistics when I was an undergraduate studying criminal justice. I didn’t learn about it very thoroughly but I knew enough to talk about what standard deviation means.
2. What were your main objectives in setting up the site?
The idea came from an image that I once saw that correlated with the number of murderers in New York State against the rising of the falling slope of a mountain. It stuck with me for a couple of years. I thought at the time that ‘I bet a lot of other things do that’. I knew there were different ways to find out that information but I wanted to find it on my own website. I spent some time just to see what would happen with my own goals. It looks like I went into this project to prove something but really it was me just wanting to play with statistics and seeing what I could come up with by building a correlation engine and merging different graphs together.
3. You mention on your site that a lot of your findings were inspired from a book published 50 years ago called How to Lie with Statistics by Darren Huff. Whom or what was it that introduced you to this book?
It was actually introduced to me after I had originated the project. I have read other books on statistics but Huff’s book has some very distinct and efficient descriptions of how graphs can lie.
It is interesting as some of the information in that book is very outdated and does not really apply anymore. However, other parts are more relevant now than they ever were then. It presents an interesting way to look at the changing face of statistics in terms of graphs and charts.
4. Please could you tell us more about your educational background and what brought you to recognise statistics as a discipline in the first place?
I worked as an internal investigator and prior to that, I had been working in in the military on security. As the subject was already very familiar to me, it made sense to study criminal justice.
There was a specific statistics course but when you study criminal justice, the focus is more on crime rates and statistics, so that you can track these better. It was not a very in-depth course but I still found it interesting in helping me understand how things worked.
5. Many of the stats you use come from the US Census Bureau, National Science Foundation and various government departments. Have you entertained the idea of giving ‘Spurious Correlations’ more of an international flavour?
I have been trying to do that and it is certainly a work in progress. What makes it easy in using US data is that the permit that runs the US Census, for example, has to complete a compendium, which means everything is in the same format. I can then download Excel files with everything I need. The main reason why I have chosen to focus on US statistics is because they are available in bulk and are free to access.
Statistics from other countries are not difficult to obtain but they are not available in bulk to download. The ONS, for example, release all their statistics on an individual topic basis.
The US Census Bureau makes a point of aggregating lots of years together, so even if certain departments or agencies only release statistics every single month or every single year, they still do an excellent job in putting these statistics altogether in one set. Therefore I can look at the trends from 1975 to today, instead of just 2005.
One of the problems that I see a lot in international statistics, at least for my purposes (so I am not sure if this is a general issue) is that I have to download one month or year individually and then add them up on my own in separate files. This takes up a lot of work. It would be easy for me to write a script on just one file but it is not enough. I have to open 75 files and then provide an analysis. And I am doing all this in my spare time!
6. The old saying is that correlation does not necessarily imply causation but you’ve certainly pointed out that statistics can be fun and entertaining – such as correlations between civil engineering doctorates awarded and the amount of mozzarella cheese consumed. Do you have a personal favourite so far?
Hmm, that’s a good question. There are so many! My favourite is probably ‘Worldwide non-commercial space launches with sociology doctorates awarded’. It is my favourite not because of the content but the title – there is a two year cycle that they go up and down on and it is very interesting how they end up linked to each other. A lot of graphs, if they are correlated, for a period of ten years, are rising or falling or staying consistent. It is great that they are showing strong correlation but often it is indicative of a trend rather than an actual connection.
(to see full data, click here)
7. The purpose of ‘Spurious Correlations’ is not that data is ambiguous but that such statistical data allows us to think carefully about whether there is a connection or if it’s coincidental. You also interestingly point out we now know about the dangers of smoking but it would not have happened if the correlation between cases of lung cancer and smoking was discovered. Is there an ultimate goal with this purpose in mind?
I think it has a multi-faceted goal. In one part, it is for show – there are ways to manipulate graphs and statistics in a way that they can say things that the maths doesn’t actually say, but appears to say because of how you present it. Another part is that you can have a little bit of fun when you dealing with numbers too.
8. What kind of feedback have you received so far?
A lot! It has been a lot of fun and many have asked, as you have yourself, about including more international statistics. It has been a very enjoyable process and a lot of people have given me some good pointers. I’ve actually learned a lot about statistics just through people who have contacted me about the site in an effort to better understand what I have been working on.
Learning terms like ‘data dredging’ has been very useful. Data-dredging is basically what I do one-on-one – I just did not have a name for this because I started this process without knowing what it was called until someone kindly pointed it out to me!
9. What have been the highpoints of working on ‘Spurious Correlations’ so far? Have there be any low points?
It has been great fun to be participating in lots of different media outlets and to see my graphs being seen all over. Since I released them without a strong copyright, they can be reposted, so it was therefore fun to see where they would show up.
For me, when I first launched the project, it was last May and it was right before finals. The struggle for me was trying to do an interview or trying to keep my web server online whilst simultaneously trying to revise!
10. Do you have any future plans for ‘Spurious Correlations’ in terms of development, the data you present, etc.?
I am trying to work on new ways to present the data. I am also the Co-President of the Drama Society here at Harvard Law School and we only do one performance every year, so for me, it has been a juggle of time so once the performance was over earlier this year, I then had a lot more time to focus on the project. The biggest thing that has come out of ‘Spurious Correlations’ for me is that I have put together a book with Hachette Books that came out in May this year.
11. The BBC described you as a ‘statistical provocateur’. What would your advice on how to spot a ‘bad statistic’?
I think there are two different answers to that – one of them pertains to when you are a statistician – you are putting together the data and trying to find ways to present it, correlate it and see how it works with other data. I think the message for a statistician is ‘Don’t have an angle in mind’ because if you do, you need to be wary of it because you are trying to prove something with data. You need to be very careful about deciding how you use that data because once you start choosing data in order to show a specific claim, there is a good chance that you are going to find a way to make your point with data even if you are lying.
Like what I do for example – when there is monthly data available, I aggregate it to yearly data so that I can have 20 points instead of a few hundred. It then lines up better on a graph and makes it look more correlated. That is a lie with a graph. You need to be aware of that and wary of it too.
If you are a consumer of data, then I think the advice is different. Try to look more into the data itself and try to work out if the person that has been presenting to you has given you the full picture or not. Has that person only given monthly data instead of yearly data – have they left things out intentionally, why have they chosen not to present the data in this way – ask questions and look to see whether there is any significant correlation to their work or whether it could just be chance.