Environmental data science is a multi-disciplinary and mature field of research at the interface of statistics, machine learning, information technology, climate and environmental science. The two-part special issue ‘Environmental Data Science’ comprises a set of research articles and opinion pieces led by statisticians who are at the forefront of the field. The editorial reproduced below identifies and discusses common strands of research that appear in the contributions to Part 1, which largely focus on statistical methodology. These include temporal, spatial and spatio-temporal modeling; statistical computing; machine learning and artificial intelligence; and the critical question of decision-making in the presence of uncertainty. This editorial complements that of Part 2, which largely focuses on applications.
Environmental and climate data analyses are conducted to improve our understanding of Earth, the biosphere, the interaction between human activity and climate, and the response of ecosystems to a changing climate. Never has this understanding been more crucial for our ecosystem biodiversity and well-being as we grapple with the reality of living in a warming world. Much of what we know derives from fundamental physical principles and theories. However, methodological and technological advances in the last two decades have seen analyses and forecasts in the environmental sciences harness the increased availability of data, and have led to the multi-faceted, interdisciplinary field that we refer to here as environmental data science (EDS). EDS considers every aspect of a workflow and value-chain involving environmental data, from the moment data are collected and stored (databases), through to the stage at which the data are used to support decision-making.
Data science empowers the field of statistics in applications involving big and complex data, high-performance computing, and artificial intelligence (including machine- and deep-learning), while itself benefits from the solid theoretical and practical aspects underpinning the field of statistics, which provides the foundation for reliability and validity that are so crucial for any data analysis. Much has been said about the role of statistics in data science; see, for example, Diggle (2015); Efron and Hastie (2021); Hassani et al. (2021); and Peng and Parker (2022). Both public and private organizations are increasingly engaged in data science capacity-building to strengthen their relevance and value; these include organisations focused on environmental applications, such as national environmental science agencies. The capacity-building traditionally involves making more use of automation, emerging technologies, and frameworks that facilitate the use of data mining, but today also widely acknowledges the need for core expertise in the field of statistics/statistical learning and evidence-based decision-making.
A primary step in the EDS workflow involves modeling and inference, and this is where statisticians have contributed substantive advances. Part 1 of the special issue is a recognition of these advances, as well as a contribution to the field through nine research articles that offer novel methodologies in this area. Part 1 also contains four opinion pieces by expert practitioners in the field that offer perspectives and insights on challenges and future research avenues. This editorial focuses on the core research strands of the contributions to Part 1 of the special issue. Part 2 of the special issue comprises an additional eight research articles and four opinion pieces, which we discuss in a separate editorial; see Burr et al. (2023).
STATISTICAL TEMPORAL, SPATIAL AND SPATIO-TEMPORAL MODELING
Temporal, spatial and spatio-temporal models are central to the vast majority of contributions to the issue. Lowther et al. (2023) consider multiple time series data that contain change points, and showcase their methods on data on the Greenland ice sheet; Kleiber et al. (2023) consider the problem of modeling and simulating tropical cyclone precipitation fields using a spatio-temporal model in polar coordinates; Shirota et al. (2023) tackle the problem of fitting spatial models to light detection and ranging (LiDAR) data collected over Alaska; Abdulah et al. (2023) consider the spatial analysis of sea-surface temperature data; Jurek and Katzfuss (2023) the spatio-temporal analysis of total precipitable water; Daw and Wikle (2023) the spatial analysis of satellite temperature data; Ning et al. (2023) the spatial analysis of presence-absence ecological data; and the discussion by Rougier et al. (2023) focuses on the challenges of fitting spatio-temporal models to environmental data. The large number of contributed papers involving these classes of models is not coincidental, as many of the phenomena that are analyzed in EDS are temporal, spatial or spatio-temporal in nature. Indeed, many of the methodological developments in spatio-temporal statistics were inspired by challenges encountered in the analysis of environmental data (e.g., Cressie & Wikle, 2015).
Many of the data sets encountered in EDS are large. Hence, models that are used to describe these data also tend to be large (in terms of parameters and/or number of latent variables). The big-models-big-data setting leads to the consideration of statistical computing, and this is also reflected in the special issue. Low-rank representations of process models are often used to address the problem of high dimensionality, and this is the approach taken by Kleiber et al. (2023) in the context of circular processes, and by Ning et al. (2023), who use resolution-adaptive basis functions for modeling both the covariates and the spatial process of interest. Basis-function expansions also feature in the discussion of Rougier et al. (2023). Often, the use of basis functions leads to a problem known as ‘over-smoothing’. Shirota et al. (2023) propose modeling the residual, fine-scale, process as a sparsity-inducing nearest-neighbor Gaussian process; they then use parallel computing to do spatial prediction from more than 17 million observations in less than a minute. Sparsity is also central to the work of Jurek and Katzfuss (2023), which allows for fast approximate inference with spatio-temporal models. Computational issues relating to inference with spatio-temporal models are also considered at length by Rougier et al. (2023), who discuss under what conditions sequential updating within each time step is possible, while taking into account the types of data commonly encountered in EDS. Finally, Abdulah et al. (2023) show how one can do exact inference with very large datasets; their approach leverages high-performance computing architectures and parallel linear algebraic libraries, and solves inferential problems that were thought to be practically unsolvable just a few years ago.
MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
The special issue reflects the increased adoption of techniques commonly associated with the field of machine learning or artificial intelligence (AI) by statisticians working in EDS. Kleiber et al. (2023) use random forests to model basis-function coefficients of a spatio-temporal process, while Daw and Wikle (2023) develop an approach based on an ensemble of deep learning models, known as extreme learning machines, for spatial prediction and uncertainty quantification of the predictions. A common criticism of deep learning models is that they are not interpretable; the rise of explainable AI and how this is applied in the context of EDS is discussed at length by the working group “AI methods in Environmental Science” of The International Environmetrics Society (TIES) in Wikle et al. (2023). Here, several diagnostics such as Shapley values and relevant techniques such as feature shuffling are showcased on a variety of AI architectures, including convolutional neural networks and extreme gradient boosting, when applied to the problem of predicting soil moisture in the US. As discussed by Rodrigues and Carfagna (2023), statistics and machine learning/AI complement each other in the field of data science at large; this is also true for EDS.
DECISION-MAKING IN THE PRESENCE OF UNCERTAINTY
One of the main goals of EDS is to provide a medium for evidence-based decision-making in the presence of uncertainty. Often, the decisions made in environmental applications are derived from statistical models that are fitted to observational data. For example, in the case of Baerenbold et al. (2023) (a contribution from the working group “Functional analysis for correlated time-series” of TIES) a Dirichlet process model is used to predict what the sources and the respective contributions are of observed particulate matter; a decision on which are the largest contributors could be made in a relatively straightforward manner using the model predictions. Cripps and Durrant-Whyte (2023) discuss four sources of uncertainty that ultimately affect decisions (inherent, parametric, model, and knowledge), and give some examples of how these manifest themselves in environmental applications. Cressie (2023) discusses a number of techniques that can be used to aid decision makers. These include the “value of information” when making decisions on data acquisition, model selection when making decisions on competing models, and prediction from the view of minimizing a loss function. Various loss functions are discussed, as well as the notion of loss-function calibration.
The special issue brings together world-leading statisticians who work in EDS, and provides a glimpse of the contributions the field of statistics is making to this important area of research. The special issue also features contributions from a number of junior scholars as lead authors: it is heartening to see an up-and-coming new generation of talented scholars tackling problems in this field. The vast array of topics in the published works is enlightening, and a reflection of how multi-faceted and interdisciplinary the field of EDS is. From statistical modeling, to computing, to uncertainty quantification: statisticians are leading the way in several of the critical sub-disciplines of EDS, and will likely do so for years to come.
- 2023). Large-scale environmental data science with ExaGeoStatR. Environmetrics, 34, e2770. , , , , , , & (
- 2023). A dependent Bayesian Dirichlet process model for source apportionment of particle number size distribution. Environmetrics, 34, e2763. , , , , , , , , & (
- 2023). Environmental data science: Part 2. Environmetrics, 34, e2788. , , & (
- 2023). Decisions, decisions, decisions in an uncertain environment. Environmetrics, 34, e2767. (
- 2015). Statistics for Spatio-Temporal Data. Hoboken, NJ: Wiley. , & (
- 2023). Uncertainty: Nothing is more certain. Environmetrics, 34, e2745. , & (
- 2023). REDS: Random ensemble deep spatial prediction. Environmetrics, 34, e2780. , & (
- 2015). Statistics: a data science for the 21st century. Journal of the Royal Statistical Society A, 178, 793– 813. (
- 2021). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science ( student ed.). Cambridge, UK: Cambridge University Press. , & (
- 2021). The science of statistics versus data science: What is the future? Technological Forecasting & Social Change, 173(21111), 1– 11. , , , , & (
- 2023). Scalable spatio-temporal smoothing via hierarchical sparse Cholesky decomposition. Environmetrics, 34, e2757. , & (
- 2023). Stochastic tropical cyclone precipitation field generation. Environmetrics, 34, e2766. , , , & (
- 2023). Detecting changes in mixed-sampling rate data sequences. Environmetrics, 34, e2762. , , & (
- 2023). A double fixed rank kriging approach to spatial regression models with covariate measurement error. Environmetrics, 34, e2771. , , & (
- 2022). Perspective on data science. Annual Review of Statistics and Its Application, 9, 1– 20. , & (
- 2023). Data science applied to environmental sciences. Environmetrics, 34, e2783. , & (
- 2023). The scope of the Kalman filter for spatio-temporal applications in environmental science. Environmetrics, 34, e2773. , , , , , , , & (
- 2023). Conjugate sparse plus low rank models for efficient Bayesian interpolation of large spatial data. Environmetrics, 34, e2748. , , , & (
- 2023). An illustration of model agnostic explainability methods applied to environmental data. Environmetrics, 34, e2772. , , , , , , , , , & (
The cover image is based on the Research Article Large-scale environmental data science with ExaGeoStatR by Sameh Abdulah et al., https://doi.org/10.1002/env.2770. Image Credit: Xavier Pita, KAUST.