Computational Hurdles in “Big Data” Statistics for Health Care Policy

Features

  • Author: Dr Sherri Rose
  • Date: 25 Feb 2014
  • Copyright: Photograph of Harvard Medical School campus copyright of Dr Sherri Rose

The influx of "big data" has been widely discussed in numerous areas of science. While health care policy may not be the first field mentioned in these conversations, many critical research questions in health care policy require and make use of these large data sets. The scope of the problems we wish to address often involve hundreds of thousands of subjects, or even millions, and hundreds or perhaps thousands of covariates.

thumbnail image: Computational Hurdles in “Big Data” Statistics for Health Care Policy

Consider, for example, the newly formed Accountable Care Organizations (ACOs) as part of the federal Patient Protection and Affordable Care Act. ACOs are an effort to coordinate care for Medicare patients, where the patient-based goals are to improve outcomes and quality of care. With this system of doctors and health care providers working together, the Centers for Medicare and Medicaid Services also project substantial cost savings. It will be natural to compare those patients in an ACO to patients in a "fee-for-service" model, where each service provided is billed separately. The typical fee-for-service model does not serve the patient, as running additional tests or procedures does not necessarily improve their care, although it does increase physician payments. Thus, physicians are incentivized to order a larger number of services. That said, the impact of ACOs on cost savings, health outcomes, and quality of care will require the thorough analysis of medical records and quality surveys from hundreds of thousands, and eventually, millions of people. The real benefits and drawbacks of ACOs are currently unknown. Novel statistical methods may be needed to handle the uniqueness of the data, including specific types of missing and miscategorized variables. Informative missingness is particularly troublesome, and there is a large body of literature addressing this problem. However, these methods may need extensions or new frameworks in order to operate in the context of very large data sets, particularly when considering computational needs.

The growth of electronic data in health care policy now has the potential to also have a strong impact on the diagnosis of mental disorders and prevention of adverse mental health outcomes. A key component of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS) was the collection of administrative data on over 1.6 million active duty soldiers in an effort to identify predictors of suicide. Suicide is a leading cause of death among soldiers in the Armed Forces, topped only by combat deaths. It also takes the lives of more than 30,000 people in the civilian population each year [1]. The World Mental Health Survey Initiative has collected data on over 150,000 people in nearly 30 countries in an effort to examine an array of mental health outcomes, including suicide [2]. When we are able to better identify individuals at high risk for suicide, targeted interventions can be offered. Building effective estimators for prediction in these types of data sets may require not only flexible methods, but also possibly new computational advances in the implementation of the estimator.

It’s not just sound statistical methodology that we need to be concerned about when handling big data, but the computational load. You may run into issues with proprietary software being too rigid to elegantly handle adding your new method as a function, or available open-source software may require exorbitant amounts of memory and time to run, despite accommodating user-generated functions more easily. Incorporating certain techniques into your estimator to avoid over-fitting, such as cross-validation, also only increases the computational burden.

There have been many statistical advances in the development of methods for "big data" problems, however substantial challenges remain. While one may prefer to develop statistical learning methods with fewer statistical assumptions in nonparametric or semiparametric models, actually executing these estimators with very large data sets in the available statistical software programs can be problematic, as was touched upon in the above two examples. It’s not just sound statistical methodology that we need to be concerned about when handling big data, but the computational load. You may run into issues with proprietary software being too rigid to elegantly handle adding your new method as a function, or available open-source software may require exorbitant amounts of memory and time to run, despite accommodating user-generated functions more easily. Incorporating certain techniques into your estimator to avoid over-fitting, such as cross-validation, also only increases the computational burden.

In 2013, I wrote about "Statisticians' Place in Big Data," and in 2012, "Big Data and the Future." In both of these articles, I discussed the need for statisticians to work in interdisciplinary teams, often with subject-matter experts [3,4]. However, it is also crucial for us to devote energy and time to developing computational solutions appropriate for our methodology, such that they can be used both in our own labs, but also in practice. This includes working with computer scientists and engineers not exclusively on the statistical methods but also in an effort to inform their new work on database infrastructure that allows us to take our "big data" and analyse it with both statistical rigour and speed. There is interest on both sides in developing solutions to close these gaps, and it’s important we continue nurturing these efforts.

References

[1] Kessler, Ronald C., et al. Design of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). Int. J. Methods Psychiatr. Res. 22(4): 267-275, 2013.
[2] Nock, Matthew K., Guilherme Borges, and Yutaka Ono, eds. Suicide: Global Perspectives from the WHO World Mental Health Surveys. Cambridge University Press, 2012.
[3] Rose, Sherri. Statisticians' place in big data. Amstat News, 428:28, 2013.
[4] Rose, Sherri. Big data and the future. Significance, 9(4):47-48, 2012.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.