Columbia Professor Uses Statistics and Data Science to Solve Global Water Resource Problems


  • Author: Lillian Pierson P.E.
  • Date: 21 Jan 2014
  • Copyright: Image appears courtesy of iStock Photo

Dr. Upmanu Lall, Director of Columbia Water Center, is using complex statistics and a solid understanding of climate, agriculture, commerce, engineering, technology, and politics to solve some of the world’s most urgent water resource problems. His deep understanding of the economic, governmental and social systems affecting water usage and consumption patterns allows him to solve water resource problems of unparalleled magnitude.

Through statistical modeling of worldwide rainfall data, Dr. Lall has discovered that the poorest countries in the world are the countries that have the greatest variability of rainfall within a year and within a series of years, regardless whether the nation receives a favourable amount of average annual rainfall (the notable exceptions to this finding occur in major oil-producing countries). Inversely, nations with the greatest GDP were found to have the least rainfall variability and only modest depths of average annual rainfall. This correlation is due to the effects of rainfall variability on agriculture and the high energy requirements for pumping water from aquifers in nations of high variability rainfall.

thumbnail image: Columbia Professor Uses Statistics and Data Science to Solve Global Water Resource Problems

Interestingly, the cost of energy required to pump water from aquifers generally keeps water usage at a balance, where the extraction rates generally don’t surpass the rate of aquifer recharge. In energy-subsidized nations like India, however, farmers can afford to continue pumping water from deeper and deeper depths of the aquifer. In these places, the extraction rates far exceed aquifer recharge rates, thus putting communities at extreme risk for losing all access to groundwater supply in the future. In nations of high variability rainfall, that is quite the precarious position in which to be.

In an exclusive interview for Statistics Views, Dr. Lall discusses how he uses statistics to uncover and solve some of the world’s most pressing water problems.

1) How did you use statistics to study and draw conclusions about rainfall rates (variability and annual average) and the agriculture-energy-consumption usages of water resources?

Let me start with rainfall variability. This is the primary source of risk for social impacts through floods and droughts. In our work, we have used statistics to characterize the nature of variability in multiple ways:

1. Long term or inter-annual variability from historical and paleo-proxy records

Here, we have developed multilevel Hierarchical Bayesian models to reduce the uncertainty in reconstructing long climate records of drought related to rainfall and streamflow from proxies such as tree rings. The key innovation has been using information from multiple sites together to simultaneously model common spatial information and reduce/characterize uncertainty.

2. Modeling low frequency variability in rainfall and streamflow

Here, we have taken historical or paleo-reconstructed climate variables and used methods such as the wavelet autoregressive moving average (WARM) that we developed to identify the main frequency bands in which long term variability is organized and then to use that decomposition to apply ARMA models to each band to generate synthetic sequences.

Together, 1. and 2. provide a model to study long term return periods on drought severity and duration at a regional scale, and thus analyze the potential impacts on agriculture, reservoir operation, groundwater, and energy use.

3. Daily/sub-daily rainfall, weather dynamics, and related extremes

A number of "weather generators" that simulate daily weather exist to provide scenarios for hydrological and agricultural models. Typically, these did not address the spatio-temporal dependence across the weather variables, notably precipitation and temperature, and hence the dynamics of snow vs rain and their impact on flood generation were not well represented. The existing models also made assumptions about the probability distributions of the daily rain (e.g. exponential or gamma) and of the dry/wet spell durations (exponential) that were often not validated by data, leading to underestimation of the probability of long wet/dry spells. Our work in this area has (been to):

a) Focus on nonparametric function estimation methods (k-nearest neighbor, kernel density, logspline density, copula) to relate the space-time dependence across variables; and

b) Explore richer dependence structures such as Nonhomogeneous Hidden Markov Models, or nonparametric renewal processes so that the heterogeneous nature of space-time variability in rainfall is better represented.

4. In addition, we have developed methods for spatial interpolation and temporal forecasting of rainfall and streamflow that use local polynomial regression as well as copulas. These address the problem of filling in information at sites that do not have long continuous records as well as at ungaged sites. In this context, we have also looked at methods of linear and nonlinear dimension reduction of large space-time data sets, and used Machine Learning (minimum variance embedding) methods to explore the dimension reduction and subsequent time series forecasting of the phenomena.

5. Finally, a body of our work has focused on the estimation of the return period of extreme floods in a nonstationary environment. Here, nonstationarity refers to changing probabilities and mechanisms over time, primarily due to changing underlying climate. Initially, our work focused on extending empirical probability density estimation using kernel, local polynomial, and local quantile density estimation to the extrapolation of the tail of the distribution under stationarity assumptions, but granting that the data was likely from a finite mixture of causative processes. Subsequently, we used Hierarchical Bayesian models to explore the dependence of extreme rainfall/floods on large scale climate indicators and mechanisms.

...we have developed multilevel Hierarchical Bayesian models to reduce the uncertainty in reconstructing long climate records of drought related to rainfall and streamflow from proxies such as tree rings. The key innovation has been using information from multiple sites together to simultaneously model common spatial information and reduce/characterize uncertainty.

2) Overall, what statistical models and methods have you found particularly useful in your work? And, what sorts of water-related problems are those particular models and methods most useful in solving?

1. Nonparametric function estimation -- broadly multivariate kernel density estimation methods (including nearest neighbor, spline and local polynomial methods)

They provide a flexible yet understandable approach to modeling and are easy to communicate to practitioners and researchers. They have allowed us to address a broad class of problems in prediction and simulation, and hence been building blocks for simulation-optimization models that can be used to improve risk management and water system operation.

2. Hierarchical Bayesian Models

These have been very useful for multivariate analysis of hydroclimatic data, and allow one to reduce and describe uncertainties clearly. In turn these have allowed us to provide reliable extrapolation of information to poorly gaged or ungaged sites, a critical aspect for new water resource development.

3. Frequency domain methods, specifically wavelets and multi-taper spectral analysis, as building blocks for identifying and modeling long range climate variability

This has been a key aspect to reduce surprise for the management of water reservoirs, by exposing not just the severity-duration of drought, but also better informing the likely recurrence and associated physical factors.

3) Overall, what types of distributions and methods are most helpful to you in using spatial statistics to understand water problems? And, what sorts of water-related problems are those particular distributions and methods most useful in solving?

For spatial statistics, the traditional geostatistics methods have been applied extensively, largely for interpolation or conditional field simulation for use with hydrologic models. These are very useful and I teach them. The primary limitation here used to be the assumption of second order stationarity, since again the data were not usually supporting this assumption. The major unsolved (in my view) issue here is how one can reconcile information across scales, point, to areal, to regional, both for upscaling and downscaling, recognizing that in both space and time the process is heterogeneous. In this respect, I have appreciated using the nonlinear decomposition methods that come from the machine learning literature, but even these do not really solve this problem.

The relevance to water problems is the following. Hydroclimatic data were historically point data, rich in time series information, and sparse in space. As a result, it was difficult to estimate what the resource availability or the intensity of flood forcing by extreme rainfall was. In the last twenty years, radar and satellite based products purport to provide areal estimates at different spatial resolution. In both cases, the areal (block average) estimate is derived using a regression of some sort on point rainfall or soil moisture data. The mapping of point to areal information in this way, looking at sensors that have inherently different resolution and sensitivity, has been studied, but it is not clear to me that this is a solved problem. A newer version still of this problem is the need/effort to relate climate change projections to future rainfall. The climate models produce output at 0.5 to 2.5 degrees latitude/longitude. It is typically biased with respect to averages constructed from satellite or ground based data for the same areal extent. These biases over the corresponding historical period statistics (1960 to 2000 for example, mean, standard deviation, spatial variance) can be quite large -- factors of 3 to 5, and the bias varies by location and season. There is a lot of ad hoc and poor work being done to rectify these. Until this is addressed the credibility of these projections will remain low.

Related Topics

Related Publications

Related Content

Site Footer


This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.