A Common Task Framework (CTF) for objective comparison of spatial prediction methodologies

Authors: Christopher K. Wikle, Noel Cressie, Andrew Zammit-Mangion and Clint Shumack

Scientists, engineers, and policy-makers are often interested in using statistical models to describe and predict (i.e., spatially interpolate) spatial processes given uncertain, and often incomplete observations. Principled statistical methods (e.g., kriging) to approach such “geostatistical” problems have been around for many decades (e.g., see Cressie, 1990, 1993). In recent years, the datasets of interest in these problems have become increasingly larger as data collection and storage technology has improved. High-volume (“big data”) spatial datasets include those that come from remote sensing platforms, sensor networks, telemetry, and GPS-based sensors. Although it can be a blessing to have these datasets, and they have opened up new avenues of scientific exploration in problems such as sea-level rise, climate monitoring, the spread of invasive species, global mapping of greenhouse gases, and county-level maps of employer-household dynamics, the “curse of dimensionality” can render many of the traditional spatial methods impossible or impractical to implement (e.g., traditional kriging, which is very computationally intensive). This has led to the development of a wide variety of computationally efficient spatial statistical methods that give predictions and uncertainty estimates in the presence of very large data volumes and high-dimensional prediction domains. Unfortunately, there have been very few publications where these methods are compared in an objective manner (see Bradley et al., 2016, for an exception). In contrast, rapid innovation of new machine-learning methods for prediction and classification has led to the development of a “common task framework” to compare and evaluate these methodologies. In this Statistics Views article, we summarize the framework and then describe a prototype implementation that has recently been developed by us for comparing spatial-prediction methods.

In his 2015 Tukey Centennial Workshop presentation in Princeton, NJ (USA), “50 Years of Data Science,” David Donoho described the “secret sauce” of the “predictive culture” (Breiman, 2001) in the statistical-and-machine-learning community, as being the so-called Common Task Framework (CTF) described in Liberman (2015). We summarize the salient components of the CTF as:

I. Publicly available datasets for training that are well documented and contain the relevant information for each observation (e.g., features, class labels).

II. Groups of “enrolled competitors” that have the common task of inferring prediction rules from the observations in the training data.

III. An objective “scoring referee” who can evaluate the prediction rules that are submitted based on their performance on test datasets that are unavailable to the competitors (i.e., “behind a wall”). The resulting prediction scores are reported by the scoring referee automatically.

Spatial-prediction methodologies can certainly be considered in such a framework, but there are several additional issues that should be considered when developing such a CTF. These are based on the fact that there is always measurement uncertainty, and it should be accounted for explicitly in spatial-prediction problems (Cressie, 1993, pp. 127-130); there are different spatial-dependence models depending on the underlying scientific process generating the data; there are different types of missing data; and there are different prediction-domain sizes and different data volumes. This suggests that, in addition to the general CTF components above, a CTF for spatial prediction (CTF-SP) should at least consider the following additional components:

1) Multiple training datasets that exhibit:
    a) Different signal-to-noise ratios (SNRs): at least two, one of which has high SNR and one of which has low SNR.
    b) Different types of spatial dependence: at least two levels of spatial smoothness.
    c) At least one dataset that exhibits stationarity and at least one that exhibits non-stationarity.
    d) Different types of data configurations: one where data locations are uniform random, one where data locations are clustered with large gaps of missing data between clusters, and one where data locations are regularly spaced. (These configurations could equivalently be classified according to the types of “missingness” implied by their data locations.)
    e) Different prediction-domain sizes: at least two, one involving a moderately large (105 – 106) number of prediction locations, and one involving a very large (106 – 107) number of prediction locations.
    f) Different data volumes: at least two, one being moderately large and one being “massive” (˃107).

2) Scoring rules for quality of spatial prediction. They consider not only the true prediction errors relative to the observations but also the coverage associated with the prediction uncertainty. These scoring rules may be presented on a running “scoreboard” that is publically available, allowing easy comparison of methods.

3) Common and easily interpreted output graphics and diagnostics for comparison.

4) Common computer architecture, allowing objective comparison of computational cost and speed.

Although these additional CTF components may seem quite obvious for those who do spatial prediction, there are several challenges. First, most real-world spatial datasets are publicly available, thereby possibly breaking the “wall” that prevents competitors from having access to validation data held back for testing. In addition, spatial datasets typically have missing observations and, even when they are complete, they still have measurement error. This means that validation data held back are themselves imperfect. Non-stochastic predictors such as inverse-distance weighting or bilinear interpolation can also be added, but they do not typically come with uncertainties, and so coverage cannot be generally assessed for them; clearly, prediction errors can be assessed for all types of predictors. Finally, there is no way to know for sure what the true spatial-dependence type is for a real-world dataset; hence, one cannot expect a single spatial predictor to dominate all others for all datasets.

The challenges discussed in the previous paragraph result in slight modifications to the usual CTF implemented in a more traditional machine-learning setting of prediction or classification (e.g., an ImageNet or Kaggle competition). One approach is to create datasets that exhibit 1) (a)–(f), through simulation; another approach is to take existing datasets and “corrupt them” by adding additional measurement error and by removing observations. The latter then assumes that the observations are the “true process.” We favor the simulation approach in general, where one can use mechanistic models (e.g., numerical solutions to differential equations and/or agent-based models) to simulate complex processes and then “corrupt” the true-process output to accommodate the points in 1) (a)–(f) above.

A CTF for spatial-prediction methodology is made more difficult by the necessity to prevent data stewards who are also developing spatial methodology, from deliberately choosing datasets that favorably bias their methods. This can be mitigated by ensuring a variety of test scenarios indicated in 1) above, especially with mechanistically simulated data. It could be made even more objective in a simulation environment by allowing the choices associated with 1) to be made randomly by a “steward algorithm.”

The authors have recently implemented a prototypical CTF for spatial prediction (CTF-SP) at the Centre for Environmental Informatics, National Institute for Applied Statistics Research Australia (NIASRA), University of Wollongong. In this framework (illustrated in Figure 1), the user visits the CTF-SP website and enters their name, email address, and the name of their prediction function. The user then selects a training dataset and difficulty level (the amount of missing data), and uploads an R script file containing a single function that accepts as input two data frames, a training frame and a test (prediction) frame. This function then returns a list of the predicted mean response and associated prediction standard errors. After submitting a function, the user is taken to a status page that is updated once a minute to show the status of the task and eventually the results or any error messages. An email is sent to the user at the start of the task and again upon its completion. Successful results are viewable on a public leaderboard to compare the prediction quality and computational speed of the different spatial-prediction functions. The website is written in PHP and submitted tasks are queued in an SQLite database before being uploaded via SSH to a dedicated Linux system with R installed for processing. The queuing helps ensure that jobs supplied to the CTF-SP are not executed simultaneously, thus providing more accurate records of the computational time and resources required by the jobs.

Figure 1. User interface for University of Wollongong prototype CTF for objective assessment of spatial prediction (CTF-SP). The CTF-SP website can be accessed from https://hpc.niasra.uow.edu.au/ctf/.  

Figure 2 shows an example of the leaderboard for the CTF-SP website for two users who have used the Fixed Rank Kriging (FRK) method of spatial prediction (Cressie and Johannesson, 2008). The data used are from the so-called “oco2lite” dataset with 30% of the data held back at random (this is labeled “Difficulty” on the leaderboard). These data are a processed/filtered version of the original NASA OCO-2 data, and they are also freely available at the OCO-2 Data Center. The difference between the User1 and User2 submissions is the number of basis functions used in the FRK procedure (92 for User1 and 364 for User2). The leaderboard shows the total clock time it took for each method along with three summary measures associated with the prediction quality, specifically, the root mean squared prediction error (RMSPE), mean difference between predicted value and validation datum (MPE), and the percentage of times the validation data fall inside the nominal 90% prediction intervals (COV90). Clicking on the “detail” link at the top of each entry repeats this summary and gives a summary of the function output and basic plots showing the data, validation data, predicted values, and the associated prediction standard errors (see Figure 3).

Figure 2. Sample leaderboard of the University Wollongong CTF-SP website showing the results for two users. The column “Time” corresponds to the total clock time that it took to implement the prediction; “RMSPE” corresponds to the root mean squared prediction error; “MPE” is the mean difference between the predictions and the validation data; and “COV90” corresponds to the percentage of times that the validation data fall inside the nominal 90% prediction interval. The CTF-SP leaderboard can be accessed at https://hpc.niasra.uow.edu.au/ctf/board.php.

Of course, data often have both a spatial label and a temporal label. A spatio-temporal dataset may be considered spatial by ignoring the temporal label. Of more relevance to spatio-temporal analyses that seek dynamics rather than simple description, the dataset being analyzed may be partitioned into sub-datasets in which observations occur in given time periods. Then data in a particular time period could be considered a spatial dataset in its own right. An investigation of the original spatio-temporal dataset might then consider a time series of these spatial datasets. The CTF applied to spatio-temporal prediction problems is more challenging, since one must consider scoring rules for prediction in time in addition to prediction in space, for different types of temporal dependencies, and for different types of spatio-temporal interactions (e.g., separable, fully symmetric, etc.). See Cressie and Wikle (2011) for an overview of these issues.

Under a CTF, statements of “better,” “faster,” and “more flexible” can be assessed on a leaderboard. However, the main benefit will not be to find the best spatial predictor, because we do not believe one exists. It will be to improve the “science” of spatial prediction, since conjectures about potentially better performances can be assessed quickly. Better overall spatial predictors will rise towards the top of the leaderboard and will be chosen amongst others near the top for their applicability in science, engineering, and policy-making.

Figure 3. Visualization of data and prediction summaries for User2 on the University of Wollongong CTF-SP website. The upper-left panel shows the data, the upper-right panel shows the validation data, the lower-left panel shows the predicted values corresponding to the validation data, and the lower-right panel shows the associated prediction standard errors. In all cases, the data and prediction locations are indicated by circles, with the color of the circle linked to the scale as indicated in the color bar for each plot. These maps can be accessed by clicking on the link under the “Details” column on the leaderboard (see Figure 2 above).




1) Bradley, J. R., Cressie, N., and Shi, T. (2016). A comparison of spatial predictors when datasets could be very large. Statistics Surveys, 10, 100-131.
2) Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199-231.
3) Cressie, N. (1990). The origins of kriging. Mathematical Geology, 22, 239-252
4) Cressie, N. (1993). Statistics for Spatial Data, rev. edn. John Wiley & Sons, Hoboken, NJ.
5) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70, 209-226.
6) Cressie, N. and Wikle, C.K. (2011). Statistics for Spatio-Temporal Data. John Wiley & Sons, Hoboken, NJ.
7) Donoho, D. (2015). 50 years of Data Science. In Tukey Centennial Workshop, Princeton, NJ. Available at: http://www.ccs.neu.edu/course/cs7280sp16/CS7280-Spring16_files/50YearsOfDataScience.pdf
8) Liberman, M. (2015). Reproducible Research and the Common Task Method. Simmons Foundation Lecture, April 1, 2015: https://www.simonsfoundation.org/lecture/reproducible-research-and-the-common-task-method/

About the authors:

Christopher K. Wikle: (to whom correspondence should be addressed) Department of Statistics, University of Missouri, 146 Middlebush Hall, Columbia, MO 65203, USA wiklec@missouri.edu
Noel Cressie, Andrew Zammit-Mangion and Clint Shumack: Centre for Environmental Informatics, National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, University of Wollongong, NSW 2522, Australia