‘Statisticians will become the heroes of the 21st century’: An interview with Open Data Institute CEO Gavin Starks

The Open Data Institute officially opened on 1st October 2012 with the mission ‘to catalyse the evolution of Open Data culture’. Over the past six months, the Institute has been overwhelmed by queries from all corners of the world from Virgin Media to the Government of Indonesia, asking for assistance to help them with their data.

During the recent G8 summit, the Open Data Institute launched its Open Data Certification scheme, which has been created for the purpose of helping everyone worldwide discover, understand and use open data.

Gavin Starks, CEO of ODI who announced the launch towards the end of last month, said ‘We’re entering an era where open is the new default. Much like the global web of documents has grown over the last 20 years, we are seeing the emergence of a global web of data. The certificates will help to create the right conditions for innovation: making open data easier to find, share and use. We want to give confidence to people to invest their time, energy, and money: to build sustainable services that meet user needs, and improve people’s lives. Given the level of interest we have seen, we anticipate wide global adoption of Open Data Certificates.’ (ODI press release, 17th June 2013).

StatisticsViews.com talks to Mr Starks about the Open Data Institute, the Certification scheme, data as culture, working with Open Healthcare UK and Ben Goldacre, and the role of the statistician.

 

1. The Open Data Institute has been described as a collaboration between the government, business and academia and I was wondering if you could please clarify that more?

Firstly, the organisation is partly funded through the government by the Technology Strategy Board, so we have a £10 million commitment over 5 years. We are a non-profit organization and sit outside of government and none of our investors have Board seats. Our Board is Sir Tim Berners-Lee, Sir Nigel Shadbolt, myself and our non-executive Directors who come from both the private and public sector. We work with the government and provide policy consultations; feedback into data and our Technical Director sits on the Transparency Open Data Standards Board. We have a close working relationship with the public sector. We also have a membership model so that companies can join the ODI as members and this is part of our remit to provide funding for the organization but also to get corporates involved in the Open Data landscape. The largest corporation to sign up was Virgin Media and we are working with them on their data. Thirdly, we’ve started a relationship with Southampton University who helped with the start-up of the ODI and later on this year, we will be extending our program with academia.

We are finding that all of these different stakeholders are coming to us and wanting to do things together in the context of Open Data.

2. What kind of feedback are you receiving so far? Do they include statisticians or members of the public or businesses?

The feedback has been incredible. We’ve had over 2000 people through our space here in Shoreditch and we’re only six months old. We’ve had over 15 governments asking how they can set up an ODI in their country. It has been almost universally positive. Most people are very interested but are unsure as to where to start. We help people find where to begin.
We have a statistician working with us. Statisticians are a critical part of the landscape here. In fact, I would like to see the role of a statistician elevated as we move forward because the ability to process and derive meaning from the information that is coming to market is critical. There is a sort of tension between the terms, ‘statistician’ and ‘data scientist’ but I see it as how can we derive meaning from the whole data landscape?

I would like to see the role of a statistician elevated as we move forward because the ability to process and derive meaning from the information that is coming to market is critical. There is a sort of tension between the terms, ‘statistician’ and ‘data scientist’ but I see it as how can we derive meaning from the whole data landscape?

There is a need to bring in the statistician’s knowledge whether it is in health, government, transportation, etc. together with computer knowledge – can we develop new tools to help us? I don’t really mind if it’s Big Data or middling data, so to speak (!) as long as we get to the most meaningful insight.

3. At the Big Data Debate held at the British Academy in November 2012, it was concluded that the potential for such data is enormous as it facilitates better decision-making; its huge size allows for the study of subtle interactions not traditionally possible in social research; makes citizens more informed and enhances democracy. However, Professor Harvey Goldstein also mentioned the downsides: Open Data could lead to (possibly deliberate) misleading inferences by the media and others and could also facilitate commercial or official institutions’ control over citizens. I was wondering what you thought of this conclusion?

With all tools, there comes both positive and negative use. Do people misuse data to suit their own ends at the moment? Yes, they do. One of the main tools that can help to combat that is transparency. You can see the provenance behind Open Data. If you have a continuity of supply and structure around that information, then the results become reproducible. It is therefore harder in theory to obtain the numbers, when others can look at the same data sets and their analyses, and confirm or deny their truth.

4. At the debate, the ethics of Open Data was a hotly discussed topic– the idea of the ethical issues over this sort of control that such data could dangerously give, particularly in relation to personal data. In the medical field, there is a good tradition in protecting patient confidentiality. We hear increasingly more about debates about the kind of personal data that Google and Facebook deal with. The ethical and legal framework backing that data up is still incomplete as we have not had to face these kinds of questions before. And I was wondering how ODI are approaching these ethical issues?

Open Data is not personal data and if you take that on board, the question in the argument shifts to “What can be derived from this information?” “What checks and balances need to be put in place so that we do not infringe on people’s privacy?” This is very critical. The prescription analytics piece that we did last year with Open Health Care UK and Ben Goldacre (http://www.theodi.org/news/prescription-savings-worth-millions-identified-odi-incubated-company) looked at the Open Data available on NHS prescriptions – the cost and the date. That was all – it did not contain the patient data. The analysis on one class of drug revealed the potential for a £200 million saving a year. This was an exercise that did not reveal private data, but it did reveal a huge saving that could be made in the health service. We are now trying to get this project funded further so that it can be an on-going piece of analysis where you can engage in a behaviour change program and tangibly measure the outputs, so that everyone can see the data and analyses in one website. They can challenge the methodology if that needs to be done and have confidence in the analytics.

One of the reasons why Virgin Media has joined us is to help them make their data open without falling foul of data privacy…We are in conversation with everyone right now, and that ranges from HMRC, World Bank, Bank of England, Department of Education – this is for everyone.

One of the reasons why Virgin Media has joined us is to help them make their data open without falling foul of data privacy. Nobody wants to release that information into the open because it is for nobody’s benefit and this applies even to the insurance industry as well. I have been speaking to members of the insurance industry recently and they said that in the healthcare industry, they don’t actually want to get right down into the individual health records because the whole point of the insurance industry is based on risk and about creating a balance of risk against the market.

5. It has been generally agreed that there needs to be a large scale education of providers and users, with for example, the Royal Statistical Society’s getstats campaign a useful start. We have the technology and the knowledge and from past interviews I have conducted, statisticians are concerned as to whether we are really equipped to deal with Big Data in terms of education, with professors saying postgraduate docs in statistics are falling possibly due to debt mounting up? Does ODI have any plans to help with education in terms of training?

That is part of our remit. We have already launched a number of courses. The first course ran this spring (Open Data in Practice). We have Open Data courses for journalists, ones on licensing, and we’re planning events with the Royal Statistical Society this year. Training is really critical – we have a two year program with the World Bank to train the world’s political and national leaders so that they can create their own Open Data strategies. Again, we help them where they are unsure where to begin and we help them reach that starting point but this new knowledge that they then acquire is not just technical. It’s not just statistics, it’s across the board – what are the legal issues, licensing issues, behaviour change issue, how do we communicate once we have this information? We demonstrated this with the prescriptions analytics piece. Once we found out the results that there was a huge saving to be made, we did not go straight to the Daily Mail but to The Economist, who wrote a very thoughtful piece which was then followed up by a piece in the Financial Times.

Managing our communications is very important – when the Daily Mail did eventually cover it, they wrote a positive piece about the use of data. Focusing on problems, not people are important and getting people around the table to say, “Here is a problem we’ve found, how do we help solve it?” This helps everyone and facilitates more informed and quicker decision-making. Given the scale of problems we face – massive social changes as society grows, how we sustain a population of 10 billion people, massive environment changes, international supply chains, economic issues – all of which, by definition, are international issues. Statisticians will become the heroes of the 21st century as they help us work out what data is relevant, how to use it, what insight we can derive from the data, and how can we turn that into a program of intervention that brings everyone along on the same journey.

6. Paul Woobey, Senior Information Risk Owner for the Office for National Statistics said that the ONS are progressively moving towards dealing with more administrative data (e.g. tax, benefits) and Open Data can be useful in corroborating data that we already have. Woobey examined Google trends and found that various data sources and channels embody discrete sets of attributors. As Woobey explained later to Statistics Views after the debate, Big Data can be used to underpin the surveys that the ONS carry out and ensure that the data is turned into the right information for the public. Is this the kind of assistance you’re providing now and does ODI have any plans to work with the ONS?

We are in conversation with everyone right now, and that ranges from HMRC, World Bank, Bank of England, Department of Education – this is for everyone. One of our advertising lines at the moment is ‘Knowledge for everyone’ and to me this is very similar to the mid-1990s where everyone was very excited about the existence of the Web and had lots of ideas. It then took us twenty years to work it out materially and we are still working it out. Again, everyone is very excited about Big Data, there is lots of potential but we are just beginning this journey. We can learn a lot from that period in the 1990s as to what did we do well back then and how we can bring everyone with us. The kind of scale of impact here is, to me, of the same scale of the Web.

7. Dr Farida Vis of University of Sheffield says that the danger with Open Data is that it has implications of reducing people to a number and where do humans come in? In your lecture during Big Data Week, you used examples of data visualization. With the assistance of data visualization, we can become obsessed with wanting the whole picture that Open Data can offer us. What kind of funding will there be? What do we invest in?

This is where the domain experts come in. The data experts don’t have all the answers. The domain expertise, for example, in health lies ultimately with the medical professionals. One of the benefits we do is that we help bring together healthcare professionals and big data analysts – e.g. a group of healthcare professionals are presented with 40 million rolls of data and are unsure how to write it into a spread-sheet, who can they be teamed with? Engineers are very good at solving problems but they don’t have the domain knowledge. The criteria must be led by what problems we are trying to solve, which data can we find that can inform in solving the problems and who is the team to help work on the problems to deliver the insight.

8. I recently interviewed John Pullinger, new President of the Royal Statistical Society and he said that there is a massive opportunity with Big Data. We now have the potential to do things with data we never could before. However, it is still a potential. For him, the big issue that has been underplayed so far is data provenance. What we have currently is masses of undifferentiated information spread out over the Web. But the skill he says we all need to learn is checking what procedures were put in place to ensure this information’s validity and quality. What steps is the ODI taking to ensure data validity?

This is a really critical question. What we have created is a framework called the Open Data Certificate and this is a completely free and open service. The certificate brings together certain pieces of metadata linked to datasets e.g. the URL, its structure, the provenance and the continuity of supply. This was another concern – will the data remain there, will it be updated on an on-going basis? We know that data is released but if it does not get updated, then it becomes irrelevant. Prime Minister David Cameron said “Open Data is at the heart of my agenda for Government”, President Barack Obama said, “Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” As I said in my lecture during Big Data Week, if Prime Minister David Cameron has an iPad app which provides him with a dashboard as to the state of the UK, I’d really like to know whose data, whose algorithms, is there a dogma in the code, whose visualizations, whose app is he using to do that, and we should all have the same tools ourselves.

I believe Open Data can help and generate value for everyone. If we helped to bring all of this data into the open, not personal or private data, these are not Open Data but if we helped to bring out all the environmental data, all energy information, all financial data, banking etc., altogether, this would help solve or at least try to address some of the greatest challenges of our time. We face massive issues of population, massive issues in environmental sustainability and massive issues in our economy. We are not going to solve these with systems-level issues without joining everything together and Open Data, to me, is very much about how do we combine these issues and how do we make them relevant to everybody.