Environmental data science is a multi-disciplinary and mature field of research at the interface of statistics, machine learning, information technology, climate and environmental science. The two-part special issue ‘Environmental Data Science’ comprises a set of research articles and opinion pieces led by statisticians who are at the forefront of the field. This editorial identifies and discusses common research themes that appear in the contributions to Part 2, which focuses on applications. These include spatio-temporal modeling; the problem of aggregation and sparse sampling; the importance of community-building and training for the next generation of specialists in environmental data science; and the need to look forward at the challenges that lie ahead for the discipline. This editorial complements that of Part 1, which largely focuses on statistical methodology; see Zammit-Mangion, Newlands, and Burr (2023).
Methodological and technological advances in the last two decades have seen analyses and forecasts in the environmental sciences harness the increased availability of data, and have led to the multi-faceted, interdisciplinary field which we refer to as environmental data science (EDS). EDS considers every aspect of a workflow and value-chain involving environmental data, from the moment data are collected and stored through to the stage at which the data are used to support decision-making.
An EDS workflow often involves developing and applying statistical models and frameworks to answer scientific questions using data, and this is an area where statisticians have contributed substantive advances. Part 2 of the special issue is a recognition of these developments, as well as a further contribution to the field through eight research articles that showcase applications of EDS. Part 2 also contains four opinion pieces by expert practitioners in the field that offer perspectives and insights on various aspects of EDS, including the critical question of training and community-building. This editorial focuses on the core themes discussed in Part 2 of the special issue. Part 1 of the special issue comprises an additional nine articles and four opinion pieces, which we discuss in a separate editorial; see Zammit-Mangion et al. (2023).
Application and development of spatio-temporal models
As noted in our editorial for Part 1 of the special issue (Zammit-Mangion et al. 2023), environmental data analyses are often concerned with processes evolving in space and/or time, and therefore make extensive use of spatial or spatio-temporal models. Most of the contributions to Part 2 of the special issue develop and apply such models: Yan, Cantoni, Field, Treble, and Mills Flemming (2023) consider a spatio-temporal application in fisheries science that involves estimating the maturity of fish stock; Nie, Wang, and Cao (2023) apply functional data analysis to the problem of sub-region estimation for daily bike-share rentals; Laroche, Olteanu, and Rossi (2023) examine irregularly sampled left-censored pesticide concentration data from France, developing new methodology for modeling spatio-temporal heterogeneity; while Mukherjee, Bagozzi, and Chatterjee (2023) use spatio-temporal fields to model climate and social instability interactions, as a framework for studying conflict. Several contributions also consider the problem of spatial/spatio-temporal interpolation or emulation: Granville, Woolford, Dean, Boychuk, and McFayden (2023) tackle the problem of interpolating spatial data for generating a fire index for wildfires in Ontario, Canada, while Cartwright, Zammit-Mangion, and Deutscher (2023) develop a spatio-temporal emulator based on convolutional variational autoencoders. Several contributed opinion pieces also expand on the challenges in this area: Scott (2023) discusses the ‘digital earth’ concept and the challenges of spatially or temporally sparse data; Blair and Henrys (2023) consider the idea of ‘digital twins’ for making sense of complex, heterogeneous spatio-temporal data; and Sain (2023) discusses data science and risk quantification in a complex environment.
Sampling and aggregation
Spatio-temporal analyses often involve dealing with data which are recorded on differing time scales, spatial scales, or both. Jahid et al. (2023) examine this problem in the context of animal tagging and abundance estimation for grizzly bears in Alberta, Canada; Yan et al. (2023) consider aggregation for fisheries stocks in Atlantic Canada; and Laroche et al. (2023) deal with aggregation when examining censored pesticide data. The more methodological side of this problem is considered by Roth et al. (2023) for calibration methods of flood hazard projections, when integrating model outputs with differing resolutions. The opinion pieces of Scott (2023) and Blair and Henrys (2023) also consider this issue in their discussions of the ‘digital earth’ and ‘digital twin’ concepts.
Community-building and training
One common complaint amongst industry professionals is, and has been for several years, the lack of trained and qualified staff capable of handling the deluge of data generated by modern instrumentation and observational hardware. The response to this issue is varied, and extends from academic programs, which initially train graduates to work in the field, to communities of practitioners capable of encouraging ‘life-long learning’ and renewed skill development among professionals working in the field. de Silva (2023) examines the intersection between these professionals and the R community in Latin America, emphasizing the need for diverse and local community building. With regard to training, there have been a number of graduate programs that have been recently developed and launched, from the Masters in Data Science program at University of British Columbia to the Masters in Environmental Data Science program at the University of California, Santa Barbara. In addition, the geospatial community, often directly entwined with the ‘spatial’ aspect of EDS, has risen to the challenge; for example, new programs on geospatial data science and on data science for energy and environmental research have been launched at the University of Michigan and the University of Chicago, respectively. As discussed by Scott (2023), there is an opportunity for academics in the field of statistics to ensure that these programs are grounding their graduates in strong, foundational thinking. Graduates who are computationally ready to tackle large data in the environmental realm also need to be statistically prepared to consider the problems of sampling, design, and bias; these are topics that are core to the field of statistics. Governments are also helping to build and strengthen data science communities both internally within the public service as well as externally with industry, academia, and citizens. For example, Statistics Canada has launched a Data Science Network for government, industry, academia, and citizens to join, learn, and share knowledge and insights.
More application areas in EDS will continute to emerge as complex models and computing become increasingly accessible, and as it becomes increasingly clear that EDS plays a pivotal role in tackling and mitigating the effects of climate change. Scott (2023) looks forward and backwards, noting that while the evolution in the data landscape is exciting, challenges of data assurance continue to plague our field, and that expertise in design and sampling is needed now more than ever. Reproducibility and responsible workflow are of growing importance (Parashar, Heroux, & Stodden 2022) due to the rapid and constant evolution of technology and tools, and due to the increased awareness of the importance of best research practice. There is also a need for vetted and inclusive curricular materials for multidisciplinary communication and comprehension (Danyluk et al. 2021; Horton et al. 2022). Sain (2023) reminds us that the toolbox for modeling and analysis is growing rapidly, and that there are opportunities and challenges in thoughtfully incorporating methodologies into EDS, including those concerned with the quantification of risk. Blair and Henrys (2023)’s focus on the interrelationships between process and data models, and the complexity of the ‘arrows’ that join them, is a timely reminder for practitioners in EDS to consider the connections between the models, the data, and the world they are all based on.
The special issue brings together global leaders in the theory and application of EDS, and provides a glimpse of the contributions the field of statistics is making to this important area of research. The special issue also features contributions from a number of junior scholars as lead authors: it is heartening to see an up-and-coming new generation of talented scholars tackling problems in this field. The vast array of topics in the published works is enlightening, and a reflection of how multi-faceted and interdisciplinary the field of EDS is. Part 2 of the special issue clearly shows that there are dedicated scientists working in the fields of environmental chemistry, fisheries science, wildfire science, and climate and environmental science, who can benefit from statisticians and their toolsets, their way of thinking, and the models that they have spent decades developing. Collaboration is a wonderful tool for building up science in a thoughtful, supported way, and it is encouraging to see so much happening in the pages of this special issue.