Authors: János Gyarmati-Szabó, Leonid V. Bogachev and Haibo Chen
Air pollution is a collective name for various harmful substances in the atmosphere, many of which are caused by human activity (e.g., road traffic and point sources such as households, plants and construction sites). Nowadays, the policy makers and the public at large seem to be acutely aware of the pollution as an everyday “bad” factor adversely affecting health and quality of living. According to the striking formula often repeated by the media and politicians, air pollution is “linked to 40,000 early deaths a year” in the UK alone. Past disasters such as the notorious Big Smog in London on December 4, 1952 (causing some 4,000 deaths within days) have contributed to a greater understanding and acknowledgement of the threats posed by toxic atmospheric pollution, and have led to progressive changes in the law, policies and technologies, including tougher regulations for industry and tightened emission standards for vehicles. As a result of such measures, a dramatic improvement of the air quality has been achieved across many countries including the UK; however, there are still many local hotspots that lag behind the pollution reduction targets, whilst new sites of concern also emerge due to the ever growing urbanization and industrialization, especially in the developing economies.
In order to monitor, control and successfully tackle air pollution, it is vital to develop and use adequate statistical methods to estimate and predict future pollution episodes, especially with regard to exceedances of the set standards. Clearly, such methods should be linked to extreme values of the pollutant concentrations, because the average, “typical” levels provide little information about rare occurrences of high spikes, which may have a particularly detrimental effect on vulnerable individuals. However, obtaining precise estimates of the probabilities of occurrence and level of extreme concentrations is a formidable task due to the combination of meteorological conditions (such as wind, humidity and temperature) with complex photochemical interconversion reactions; for example, nitrogen monoxide NO bonds with ozone O3 to produce excess levels of nitrogen dioxide NO2, which in turn generates NO and O3 in a feedback loop, but only in the presence of sunlight.
One of the most efficient modern statistical tools in extreme value modelling is the peaks-over-threshold (POT) method, which looks at the exceedances above a certain (high) level. This approach is particularly well suited in the environmental context, where the air quality standards and objectives are normally expressed in terms of certain critical thresholds. As compared to other methods, its conceptual and methodological advantage is that it makes use of all significantly high values in the data rather than just one highest maximum per data block; thus, POT provides an improved accuracy of estimation and inference. The key mathematical result underpinning the POT approach (proved by James Pickands III in 1975) states that the distribution of the heights of exceedances can be approximated by a generalized Pareto distribution (GPD), characterized by the shape parameter (which determines the type of the GPD, including the range of possible values) and the scale parameter, which depends on the chosen threshold. In applications of this method, choosing the threshold, for example, as the 90% empirical quantile of the observed data set (i.e., with about 10% of exceedances), the GPD parameters can then be estimated from the data and used for predicting future exceedances.
However, the classic POT theory was developed for stationary data whereby the parameters of the observed process are assumed to be time-independent. Unfortunately, this assumption is hardly justifiable in the environmental context, where clear signs of nonstationarity can be easily detected in the data. This is of course due to varying patterns in traffic flows over time, and is also explained by a highly nonstationary behaviour of meteorological covariates, such as sunlight and wind speed and direction. The simple but crucial idea proposed by Anthony Davison and Richard Smith in their seminal paper in the Journal of the Royal Statistical Society, Series B (1990) is to model the time dependence of the GPD parameters through the dependence on the covariates only. Hence, the right choice of covariates potentially impacting on the pollutant levels becomes paramount. The Davison–Smith approach was tried out in the air pollution context by many authors (e.g., by Emma Eastoe and Jonathan Tawn in their work of 2009 on the extremes of surface level ozone concentrations).
In a paper in Environmetrics, the authors have analysed the exceedance data collected from January 1, 2008 until January 1, 2009 at a roadside laboratory on Kirkstall Road in the city of Leeds (West Yorkshire, UK), which houses a traffic monitoring system and an air quality station, also recording the concurrent meteorological conditions. The study focused on three pollutants: nitrogen monoxide (NO), nitrogen dioxide (NO2) and ozone (O3), and the data were analysed by building two POT models:
• Model I develops the Davison–Smith approach via inclusion of an extensive set of meteorological and traffic covariates.
• The new Model II is based on the novel functional dependence of the GPD parameters to ensure consistency overdifferent threshold choices, thus resolving the open problem of threshold stability of the GPD in the nonstationary case.
Both models were fitted using the regression framework, that is, by introducing weighting coefficients for each of the covariates. The resulting dimension of the covariate vector is quite high, ranging from about 175 for Model I to 230 for Model II. Due to the computational complexity, estimation was done using an efficient Markov Chain Monte Carlo (MCMC) algorithm to produce the posterior distribution of all the parameters, incorporating prior information and the data observed. Note that this methodology follows the Bayesian statistical framework, whereby model parameters are treated as random variables. By sampling from their posterior distribution, one can easily draw predictive inference about extreme values beyond the observed ranges, which may be very helpful in designing, validating and evaluating future air pollution scenarios—for example, resulting from changing patterns in the traffic flows and/or meteorological conditions. Thus, the models can provide the air quality managers and decision makers with an effective tool to manage air pollution problems.
It has been demonstrated that both models fit well to the data and yield encouraging results, indicating a promising potential for an accurate and reliable estimation of extreme concentrations. An unexpected feature of Model II (which remains to be understood better) is that it produces noticeably more conservative estimates of the future extreme values, and at the same time the corresponding credible intervals get narrower as compared to Model I. This means that Model II provides less exaggerated and yet more accurate predictions of future extreme events, which may be quite valuable in view of very high financial and other costs of environmental actions. This is also confirmed by a comparison of Models I and II using their likelihood ratio (often referred to as the Bayes factor), indicating that Model I is strongly outperformed by Model II.
In practical terms, the most appealing feature of the threshold-stable Model II is that its parameters are estimated only once, for a certain calibration value of the threshold, and the fitted model can then be readily used for any other threshold should it change (e.g., due to the different environmental standards, driving patterns, or climate). Thus, the model provides a highly efficient and flexible online tool for rapid predictions of upcoming extreme events, for instance, to give prompt warnings to vulnerable patients with certain conditions (e.g., as part of digital health systems). At the same time, the model’s accuracy can be constantly improved by updating the estimated parameters using new data that become available.
The issue of choosing a “correct” threshold in the nonstationary POT modelling has attracted a lot of attention due to its paramount importance for the extreme value inference. Following the classic approach by Davison and Smith, the threshold is fixed (e.g., determined by a certain empirical quantile, say 90%), and nonstationarity of the data is modelled through dependence on time-varying covariates. The alternative popular approach is to choose a time-dependent, data-driven threshold but to keep the parameters of the GPD constant. The latter idea is appealing because the exceedance rate of a fixed threshold may deteriorate due to nonstationarity; thus, it is reasonable to monitor the estimation performance and adjust the threshold if and when necessary. Flexibility with the threshold is also attractive in view of the possible future changes (e.g., in the environmental standards, driving patterns, or climate). From this point of view, the threshold-stable Model II proposed by the authors has a strong conceptual advantage by bridging the gap between the two alternative approaches to the threshold selection.
The well-known difficulty of statistical modelling in the air pollution context is due to a strong correlation between different chemicals, which can be attributed to the complex photochemical reactions in the atmosphere (e.g., when the concentrations of NO and NO2 both go up, the concentration of O3 goes down, and vice versa). Despite considering only univariate dynamics for each chemical, the need to account for multiple pollutants is somewhat addressed in the paper owing to the use of lagged past values of the pollutant under study, meaning that the impact of other (correlated) pollutants is implicitly taken into account. Furthermore, the models developed in the paper may serve as a stepping stone for a multivariate version of the POT modelling, which will be developed in the team’s future work. It would also be important to add a spatial (e.g., regional) dimension to the models, in particular due to the apparent significance of proximity to point sources such as factories or road junctions. Data available from the UK’s Automatic Urban and Rural Network (AURN), combined with the urban big data (e.g., via the Urban Big Data Centre, UBDC), would be instrumental for both feeding in and validating the extreme value models developed in this study and, as a consequence, should prove valuable for evidence-based air quality decision-making.
Today, the researchers, environmental practitioners and stakeholders bear a shared responsibility for the clean air of tomorrow. The increased uncertainties about the prospects of maintaining and improving the international standards of air quality, recently fuelled by the imminent Brexit and by the recent US withdrawal from the Paris Agreement on greenhouse gases emissions, mean that now is the time to strengthen the joint efforts to ensure that efficient, cost-effective and concerted measures will stay in place. Environmental science, and statistical modelling of extremes as part thereof, should push for its impact to make the real difference by finding innovative solutions and raising the public awareness, whereas the policy makers urgently need to use best science to inform and guide their actions.
This paper can be read in full via the link below:
János Gyarmati-Szabó, Leonid V. Bogachev and Haibo Chen
Environmetrics, Volume 28, Issue 5, August 2017, e2449