Machine Learning Approaches for Outdoor Air Quality Modelling: a Systematic Review
Total Page:16
File Type:pdf, Size:1020Kb
applied sciences Review Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review Yves Rybarczyk 1,2 and Rasa Zalakeviciute 1,* 1 Intelligent & Interactive Systems Lab (SI2 Lab), Universidad de Las Américas, 170125 Quito, Ecuador; [email protected] 2 Department of Electrical Engineering, CTS/UNINOVA, Nova University of Lisbon, 2829-516 Monte de Caparica, Portugal * Correspondence: [email protected]; Tel.: +351-593-23-981-000 Received: 15 November 2018; Accepted: 8 December 2018; Published: 11 December 2018 Abstract: Current studies show that traditional deterministic models tend to struggle to capture the non-linear relationship between the concentration of air pollutants and their sources of emission and dispersion. To tackle such a limitation, the most promising approach is to use statistical models based on machine learning techniques. Nevertheless, it is puzzling why a certain algorithm is chosen over another for a given task. This systematic review intends to clarify this question by providing the reader with a comprehensive description of the principles underlying these algorithms and how they are applied to enhance prediction accuracy. A rigorous search that conforms to the PRISMA guideline is performed and results in the selection of the 46 most relevant journal papers in the area. Through a factorial analysis method these studies are synthetized and linked to each other. The main findings of this literature review show that: (i) machine learning is mainly applied in Eurasian and North American continents and (ii) estimation problems tend to implement Ensemble Learning and Regressions, whereas forecasting make use of Neural Networks and Support Vector Machines. The next challenges of this approach are to improve the prediction of pollution peaks and contaminants recently put in the spotlights (e.g., nanoparticles). Keywords: atmospheric pollution; predictive models; data mining; multiple correspondence analysis 1. Introduction Worsening air quality is one of the major global causes of premature mortality and is the main environmental risk claiming seven million deaths every year [1]. Nearly all urban areas do not comply with air quality guidelines of the World Health Organization (WHO) [2,3]. The risk populations that suffer from the negative effects of air pollution the most are children, elderly, and people with respiratory and cardiovascular problems. These health complications can be avoided or diminished through raising the awareness of air quality conditions in urban areas, which could allow citizens to limit their daily activities in the cases of elevated pollution episodes, by using models to forecast or estimate air quality in regions lacking monitoring data. Air pollution modelling is based on a comprehensive understanding of interactions between emissions, deposition, atmospheric concentrations and characteristics, meteorology, among others; and is an indispensable tool in regulatory, research, and forensic applications [4]. These models calculate and predict physical processes and the transport within the atmosphere [5]. Therefore, they are widely used in estimating and forecasting the levels of atmospheric pollution and assessing its impact on human and environmental health and economy [6–9]. In addition, air pollution modelling is used in science to help understand the relevant processes between emissions and concentrations, and understand the interaction of air pollutants with each other and with weather [10] Appl. Sci. 2018, 8, 2570; doi:10.3390/app8122570 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 2570 2 of 27 and terrain [11,12] conditions. Modelling is not only important in helping to detect the causes of air pollution but also the consequences of past and future mitigation scenarios and the determination of their effectiveness [4]. There are a few main approaches to air pollution modelling—atmospheric chemistry, dispersion (chemically inert species), and machine learning. Different complexity Gaussian models (e.g., AERMOD, PLUME) are widely used by authorities, industries, and environmental protection organizations for impact studies and health risk investigations for emissions dispersion from a single or multiple point sources (also line and area sources, in some applications) [13,14]. These models are based on assumptions of continuous emission, steady-state conditions and conservation of mass. Lagrangian models study a trajectory of an air parcel, the position and properties of which are calculated according to the mean wind data over time (e.g., NAME) [5,14]. On the other hand, Eulerian models use a gridded system that monitors atmospheric properties (e.g., concentration of chemical tracers, temperature and pressure) in specific points of interest over time (e.g., Unified Model). Chemical Transport Models (CTMs) (e.g., air-quality, air-pollution, emission-based, source-based, source, etc.) are prognostic models that process emission, transport, mixing, and chemical transformation of trace gases and aerosols simultaneously with meteorology [15]. Complex and computationally costly CTMs can be of a global (e.g., online: Fim-Chem, AM3, MPAS, CAM-Chem, GEM-MACH, etc.; and offline: GEOS-Chem, MOZART, TM5) and regional (e.g., online: MM5-Chem, WRF-Chem, BRAMS; and offline: CMAQ, CAMx, CHIMERE) scale [16,17]. These models combine atmospheric science and multi-processor computing techniques, highly relying on considerable resources like real-time meteorological data and an updated detailed emission sources inventory [18]. Unfortunately, emission inventory inputs for boundary layers and initial conditions may be lacking in some regions, while geophysical characteristics (terrain and land use) might further complicate the implementation of these models [19]. To deal with complex structure of air flows and turbulence in urban areas Computer Fluid Dynamics (CFD) methods are used [20]. However, recent studies show that the traditional deterministic models struggle to capture the non-linear relationship between the concentration of contaminants and their sources of emission and dispersion [21–24], especially in a model application in regions of complex terrain [25]. To tackle the limitations of traditional models, the most promising approach is to use statistical models based on machine learning (ML) algorithms. Statistical techniques do not consider physical and chemical processes and use historical data to predict air quality. Models are trained on existing measurements and are used to estimate or forecast concentrations of air pollutants according to predictive features (e.g., meteorology, land use, time, planetary boundary layer, elevation, human activity, pollutant covariates, etc.). The simplest statistical approaches include Regression [26], Time Series [27] and Autoregressive Integrated Moving Average (ARIMA) [28] models. These analyses describe the relationship between variables based on possibility and statistical average. Well-specified regressions can provide reasonable results. However, the reactions between air pollutants and influential factors are highly non-linear, leading to a very complex system of air pollutant formation mechanisms. Therefore, more advanced statistical learning (or machine learning) algorithms are usually necessary to account for a proper non-linear modelling of air contamination. For instance, Support Vector Machines [29], Artificial Neural Networks [30], and Ensemble Learning [31] have been applied to overcome non-linear limitations and uncertainties to achieve better prediction accuracy. Although statistical models do not explicitly simulate the environmental processes, they generally exhibit a higher predictive performance than CTMs on fin spatiotemporal scales in the presence of extensive monitoring data [32–34]. Different machine learning approaches have been used in recent years to predict a set of air pollutants using different combinations of predictor parameters. However, with a growing number of studies, it is puzzling why a certain algorithm is chosen over another for a given task. Therefore, in this study we aim to review recent machine learning studies used in atmospheric pollution research. To do so, the remainder of the paper is organized into three sections. First, we explain the method used to select Appl. Sci. 2018, 8, 2570 3 of 27 and scan relevant journal articles on the topic of machine learning and air quality, which conforms to the PRISMA guideline. This section also describes the strategy used to analyze the manuscripts and synthetizeAppl. Sci. the 2018 main, 8, x findings.FOR PEER REVIEW Second, the results are presented and discussed from a general to3 a of detailed 27 and synthetic account. Finally, the last section draws conclusions on the use of machine learning algorithms quality, which conforms to the PRISMA guideline. This section also describes the strategy used to for predictinganalyze the air qualitymanuscripts and theand future synthetize challenges the main of thisfindings. promising Second, approach. the results are presented and discussed from a general to a detailed and synthetic account. Finally, the last section draws 2. Method conclusions on the use of machine learning algorithms for predicting air quality and the future 2.1. Searchchallenges Strategy of this promising approach. Relevant2. Method papers were researched in SCOPUS. The enquiry was limited to this scientific search engine, because it compiles, in a single database, manuscripts which are also indexed in the 2.1. Search