The Good, The Bad and the Ugly of publicly available research data

Features

  • Author: Carlos Alberto Gómez Grajales
  • Date: 11 Apr 2014
  • Copyright: Image appears courtesy of iStock Photo

PLoS stands for Public Library of Science. It is a nonprofit publishing project aimed to create and manage a community of open access scientific publications. They are currently publishing over half a dozen peer reviewed journals, covering different topics such as Medicine, Biology or Genetics. Based in Levi's Plaza in San Francisco, this initiative would be a great topic of discussion on itself, for their delivery of an innovative business model for scientific publishing or for proposing an open and easily accessible way to promote and divulge science. But I'm not here to discuss such notions, as attractive as they are. Today, we'll talk a bit about PLOS' new data policies.

thumbnail image: The Good, The Bad and the Ugly of publicly available research data

Beginning on 1st March 2014 all PLoS' journals require all submitted articles to be accompanied by a Data Availability Statement. From this date on, every author is obliged to make “all data underlying the findings described in their manuscript fully available without restriction, with rare exception”. It is still not sure how the policy will be implemented or enforced, nor how the data will be made available for other researchers or curious readers who wish to play with the numbers, but the reality is that the policy will eventually become a standard for these journals.

PLoS is not the only medium promoting the publication of research data. The White House Office of Science and Technology Policy (OSTP) has issued comments on how to expand the publication of Federally Funded Research Results by incorporating more detailed information of the studies, which also includes full disclosure and publication of the generated data (1). But still, not everyone is happy about what is happening at PLoS. Some voices claim that such an idea is the next big step towards science development. Others claim that such policies would only slow down progress and place unnecessary burdens on our already overloaded scientists (2). Some even say that it cannot be implemented. But what does this really mean for scientific research? Is it a good idea? A bad one? A useless one? As a statistician, I am tempted to list some of the challenges, problems and benefits that I consider that these kinds of policies would bring for the journal, the authors and its users. Yes, too tempted. So I'll do it.

The Good

First, let's talk about the important benefits of openly publishing and distributing research data, of which are many. A few years ago, John P. A. Ioannidis published an article with the bold title Why Most Published Research Findings Are False (3). It appeared in a journal named PLoS medicine. You may have heard about PLoS before, possibly in an unbelievably well written article. Well, the author of this paper argued that many researchers, by lacking an appropriate statistical knowledge or training in analytical methods, tend to fall prey to common mistakes that increase the chance of bias or incorrectly interpreted results: inadequate sample sizes, incorrectly specified hypothesis, erroneous interpretations of p-values (4) and many of those little details that some researchers usually omit (5). Ioannidis concludes his article with a grim sentence: “The majority of modern biomedical research is operating in areas with very low pre- and post-study probability for true findings”.

While this study focused on biomedical research, I think it is safe to assume that there is a lack of statisticians in many other fields, such as psychology, economics or sociology. Sadly, many researchers do not have a deep knowledge of statistical methods or the support of a professional. It is therefore understandable that some research papers have some flaws in data analysis. As a consequence, it is easy to see why prestigious journals such as PLoS would like to see their authors' data first and, moreover, see it published for anyone to be able to validate it. Sometimes, innocent mistakes are made that can be detected. This could not only reduce the time that it takes for mistakes to be located, therefore promoting more corrections and improvements on previous research. But what I am really excited about is the fact that this could also motivate scientists to learn more about statistics. Everyone will be reviewing your data and validating it, so you should be extra careful. Suddenly, that two-week seminar on statistical analysis looks more attractive now. As a statistician, I truly believe that anything that promotes and invites us to learn and study statistics should be applauded.

Another interesting outcome of PLoS' new policy would be a dramatic reduction in fraudulent research. Even though it can be correctly argued that deliberately produced results are extremely rare in any field of scientific research, and even more in peer-reviewed journals, it is always safer to implement more drastic measures to prevent it. With the increasing pressure that researchers have to secure funds in a post-crisis world, the burden to complete publishing quotas or the ambition to present the next best thing, more scrutiny on published data is commendable.

Now, imagine for a moment what it would mean if every single research paper had its related data published. That would mean that every single study has made available, in a format you can use, all the numbers used for the statistics and its calculations. You can suddenly replicate every single graph and model, every p-value, every test statistic. As a former teacher I can easily see the pedagogic advantages of freely distributed data. You can use real, relevant research as a class example, one that covers every aspect, from the objectives to the methodology and up to the statistical analysis. You could improve your knowledge of statistics simply by replicating what other researchers did, interpreting and understanding what some other scientist, maybe one with a deeper knowledge on mathematical models, did. Having the data sets, along with an article that practically describes the whole process that produced the results would be immensely helpful for any student.

You can suddenly replicate every single graph and model, every p-value, every test statistic...You can use real, relevant research as a class example, one that covers every aspect, from the objectives to the methodology and up to the statistical analysis. You could improve your knowledge of statistics simply by replicating what other researchers did, interpreting and understanding what some other scientist, maybe one with a deeper knowledge on mathematical models, did. Having the data sets, along with an article that practically describes the whole process that produced the results would be immensely helpful for any student.

But we can be even more ambitious. Let's not reproduce. Think about using the datasets of previous research for a more complex analysis: merging data from several studies to create a huge meta-analysis, expanding the timeframe of an experiment by using many compatible longitudinal studies, analyzing complex relationships that were impossible to measure because you did not have the money to survey people, yet someone else did have it. By allowing access to hundreds, thousands of different data sets, possibly obtained through highly sophisticated methods, a new window for low-budget research would be born. Just think of how many articles are produced by analyzing government data such as the British Household Panel Survey or the American National Health Interview Survey. Now dream of how many new studies, how many new discoveries could be produced by using the data gathered locally by researchers, labs or nonprofits. By opening the door of a world full of data, a new frontier in science would be within our reach.

The Bad

Prior to the 1970s, American airports had minimal security arrangements, with close to none security checkups. During this time, new measures were introduced after a few hijacks were reported. But even after that, the security process was fairly unobtrusive and fast. But of course, it all changed after 9/11. New policies emerged that required longer, more detailed scrutinies of the passengers and their luggage. The Transportation Security Administration, the agency created to enforce revisions and the new security measures within U.S. airports, installed more sophisticated checkpoints. As a result, people had to wait. Easy revisions became a thing of the past as the passengers now had to wait for this more thorough revision. Waiting during peak times at checkpoints averaged 12 minutes in 2006. By 2008, the average wait had grown to 15 minutes ((6). And it was about to get worst. By January 2012, at Chicago's O'Hare International Airport, the average wait in line was 35 minutes between 4 and 5 p.m., and the longest wait was 137 minutes (7). The moral of the story is simple: more security measures are annoying and take time. PLoS, by asking its authors to present and publish all their related data, is implementing security arrangements that will slow down the reviewing process and that will make the lives of authors harder.

Writing a scientific article is hard and time consuming. Researchers often face ambitious projects that are subjected to numerous constraints. Funding research requires scientists to secure partnerships, scholarships and other sources that imply additional responsibilities and excruciating deadlines. A researcher has to think of a topic, study it, read and investigate about it. He has to devise a methodology, design an experiment, adapt it, execute it and validate it, analyze it, interpret it and understand it. If you add to the dozen tasks a research team has to comply, the amount of work it takes to manage and standardize their data sets and to publish them is staggering. Not to mention that there are many different ways and formats to store data, but we will get to that in a moment.

The new policies will not only significantly increase the work of researchers but it will also change the work of reviewers. By having full access to the data, some reviewers may be compromised to check some of the calculations. Maybe the journal will require additional reviewers, with at least one specialized in statistical analysis, whose sole job would be to validate the numbers obtained through the data. This would take time, further complicating the reviewing process and possibly making it harder for articles to be accepted.

So, why would you, a researcher, send your piece to a journal that, at first, will make you work more and then will slow down your approval with longer and more complicated reviewing? Why keep sending articles to PLoS if every other journal would easily accept your work without hesitation? It is a well identified problem that many journals have a dubious reviewing process, one that seems more focused on accepting than on rejecting (8). Prestigious journals, by complicating its reviewing process, may actually be promoting these less serious publications, thus making them more attractive for scientists eager to promote their results as quickly as possible.

Some of these concerns are already being discussed, some more loudly than others (9). Many researchers are complaining of the new policies and echoing some of the worries that I just expressed. This is why, if we expect open access research data to become available, it is important to realize and overcome some of the challenges ahead.

The Ugly

You may have already noticed that I am excited about the prospect of fully disclosed and published research data, as I believe many more statisticians will be. Still, I am a realist. This is going to be hard and it is going to be ugly. There are many challenges ahead - challenges that journals, publications and research groups have to overcome if we wish to ensure that openly accessible data becomes as useful as intended.

The first and most important challenge is the diversity of research itself and within its enclosed data. Not every research is the same, so not every data is gathered in the same way. Include the diversity of researchers, labs and analysts and you can realize how complex and diverse data can be. That is not even considering that some people like R, some Python, some love other statistical software or some simply decide to program something completely new. There are many ways to store, organize and decode data and it will be hard to enforce a single unified format. Yet, it is important to mention as it is not impossible. There are many initiatives devoted to standardize the process of data collection and organization such as those that help design strategies for opening access to research data. One of such initiatives is SDMX, which stands for Statistical Data and Metadata eXchange (10). Several organizations, including the United Nations and the World Bank are backing up this initiative, making it one of the most advanced projects for standardization of data sets; one that includes tools and online applications for organizing complex data structures. However, SDMX was mostly designed for economic data. Maybe researchers need their own set of tools, formats and applications. It is therefore important that we agree on the methods, on the formats, on the tools. This can be done in two ways: by adopting and expanding one of the initiatives available, not necessarily SDMX, but possibly others such as DDI (11) or GSBPM (12), which could be adapted to the needs and desires of the scientific community. 

Scientists need to know that, by organizing and curating their data, they are leaving a door open for future researchers. They may be working a little bit more, but in return we will all, collectively, work a bit less. It is extremely important to talk with researchers and explain to them that publicly available research data is not a burden journals wish to impose out of authoritarianism.

On the other hand, probably based on such projects, a totally new standard may be born, one fitted and custom made for researchers and their needs. This standard would include tools, manuals and sets of applications aimed to help authors, with web based systems tailored to their needs, projected with the idea of making their life easier, their process of data management easier. And maybe in the future, such systems will allow any user, any reader, to find, filter and even to partially analyze all gathered research data in an easy and practical way. It is important to generate such a system, to make sure it is easy to use (not every user will have a PhD, it must be aimed for everyone) and to make sure that all the people interested has the opportunity to learn it. Notice that such an effort would be important not only for PLoS, but for many journals that rely on the statistical analysis of results, which are a majority. PLoS does not have to be alone on this one.

Such a system would be probably hard to learn, yet it would be worth it. The second challenge that we should overcome is showing everyone that it is worthy. Scientists need to know that, by organizing and curating their data, they are leaving a door open for future researchers. They may be working a little bit more, but in return we will all, collectively, work a bit less. It is extremely important to talk with researchers and explain to them that publicly available research data is not a burden journals wish to impose out of authoritarianism. They need to know the advantages; the profit that we expect to collect with their effort. They need to know that, in a few years, they will have an incredibly big collection of data available at their fingertips. They need to know that sharing data with colleagues and institutions will be extremely easy after this. They need to know of all the innovative research that will spawn due to this change. However, in order for them to know, someone needs to explain to them. Right now, many authors argue that asking for data is a solution for a problem that does not exist. I think it is an opportunity to share a deeper knowledge with the world. That gap must be closed for this ideal to become real.

And with this comes the greater challenge of all. One that several worldwide organizations, professionals and companies have been working on for a few decades now: teaching statistics to scientists and researchers. To understand the advantages of immensely rich data sets, you need to know what to do with them. To realize the advantages of standardized procedures for reviewing calculations and mathematical methodology, you need to understand what we are talking about. One of the reasons why open data is just recently becoming a thing is because this is the time when people are starting to learn about the power of statistics. Trends like Big Data, Data Marketing and Business Metrics are becoming mandatory in certain circles. Open access data seems just like a natural conclusion to that line of thought. That is why every statistician, every professor, every student of the discipline should be involved in the role of promoting, of teaching what statistics can reach. If everyone knew about it, policies like those presented by the Public Library of Science would be seen not as concerns but as huge opportunities.

References

(1) Expanding Public Access to the Results of Federally Funded Research (February, 2013)
http://www.whitehouse.gov/blog/2013/02/22/expanding-public-access-results-federally-funded-research

(2) PLOS' Bold Data Policy – The Scholarly Kitchen (March, 2014) http://scholarlykitchen.sspnet.org/2014/03/04/plos-bold-data-policy/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+ScholarlyKitchen+%28The+Scholarly+Kitchen%29

(3) Ioannidis, John Why Most Published Research Findings Are False – PLOS Medicine (August, 2005) http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

(4) The Most Difficult Step in Calculating Sample Size Estimates – The Analysis Factor
http://www.theanalysisfactor.com/sample-size-most-difficult-step/

(5) Cuando un doctor no sabe analizar datos – Asesoría Estadística (November 2011)
http://www.asesoriae.mx/2011/11/cuando-un-doctor-no-sabe-analizar-datos/

(6) New Airport Security to Slow Down Holiday Travel – CBS News (November, 2010)
http://www.cbsnews.com/news/new-airport-security-to-slow-down-holiday-travel/

(7) $100 to Fly through the airport – The Wall Street Journal (March, 2012) http://online.wsj.com/news/articles/SB10001424052702303863404577281483630937016

(8) Bad reviews: the perils of modern peer reviews – Significance Magazine Website (December 2013)
http://www.statslife.org.uk/significance/1103-bad-reviews-the-perils-of-modern-peer-reviews

(9) PLoS is letting the inmates run the asylum and this will kill them – Drug Monkey (February, 2014)
http://drugmonkey.wordpress.com/2014/02/25/plos-is-letting-the-inmates-run-the-asylum-and-this-will-kill-them/

(10) Statistical Data and Metadata eXchange http://sdmx.org

(11) Data Documentation Initiative http://www.ddialliance.org

(12) The Generic Statistical Business Process Model http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+Process+Model

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.