applied sciences
Review Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review
Yves Rybarczyk 1,2 and Rasa Zalakeviciute 1,* 1 Intelligent & Interactive Systems Lab (SI2 Lab), Universidad de Las Américas, 170125 Quito, Ecuador; [email protected] 2 Department of Electrical Engineering, CTS/UNINOVA, Nova University of Lisbon, 2829-516 Monte de Caparica, Portugal * Correspondence: [email protected]; Tel.: +351-593-23-981-000
Received: 15 November 2018; Accepted: 8 December 2018; Published: 11 December 2018
Abstract: Current studies show that traditional deterministic models tend to struggle to capture the non-linear relationship between the concentration of air pollutants and their sources of emission and dispersion. To tackle such a limitation, the most promising approach is to use statistical models based on machine learning techniques. Nevertheless, it is puzzling why a certain algorithm is chosen over another for a given task. This systematic review intends to clarify this question by providing the reader with a comprehensive description of the principles underlying these algorithms and how they are applied to enhance prediction accuracy. A rigorous search that conforms to the PRISMA guideline is performed and results in the selection of the 46 most relevant journal papers in the area. Through a factorial analysis method these studies are synthetized and linked to each other. The main findings of this literature review show that: (i) machine learning is mainly applied in Eurasian and North American continents and (ii) estimation problems tend to implement Ensemble Learning and Regressions, whereas forecasting make use of Neural Networks and Support Vector Machines. The next challenges of this approach are to improve the prediction of pollution peaks and contaminants recently put in the spotlights (e.g., nanoparticles).
Keywords: atmospheric pollution; predictive models; data mining; multiple correspondence analysis
1. Introduction Worsening air quality is one of the major global causes of premature mortality and is the main environmental risk claiming seven million deaths every year [1]. Nearly all urban areas do not comply with air quality guidelines of the World Health Organization (WHO) [2,3]. The risk populations that suffer from the negative effects of air pollution the most are children, elderly, and people with respiratory and cardiovascular problems. These health complications can be avoided or diminished through raising the awareness of air quality conditions in urban areas, which could allow citizens to limit their daily activities in the cases of elevated pollution episodes, by using models to forecast or estimate air quality in regions lacking monitoring data. Air pollution modelling is based on a comprehensive understanding of interactions between emissions, deposition, atmospheric concentrations and characteristics, meteorology, among others; and is an indispensable tool in regulatory, research, and forensic applications [4]. These models calculate and predict physical processes and the transport within the atmosphere [5]. Therefore, they are widely used in estimating and forecasting the levels of atmospheric pollution and assessing its impact on human and environmental health and economy [6–9]. In addition, air pollution modelling is used in science to help understand the relevant processes between emissions and concentrations, and understand the interaction of air pollutants with each other and with weather [10]
Appl. Sci. 2018, 8, 2570; doi:10.3390/app8122570 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 2570 2 of 27 and terrain [11,12] conditions. Modelling is not only important in helping to detect the causes of air pollution but also the consequences of past and future mitigation scenarios and the determination of their effectiveness [4]. There are a few main approaches to air pollution modelling—atmospheric chemistry, dispersion (chemically inert species), and machine learning. Different complexity Gaussian models (e.g., AERMOD, PLUME) are widely used by authorities, industries, and environmental protection organizations for impact studies and health risk investigations for emissions dispersion from a single or multiple point sources (also line and area sources, in some applications) [13,14]. These models are based on assumptions of continuous emission, steady-state conditions and conservation of mass. Lagrangian models study a trajectory of an air parcel, the position and properties of which are calculated according to the mean wind data over time (e.g., NAME) [5,14]. On the other hand, Eulerian models use a gridded system that monitors atmospheric properties (e.g., concentration of chemical tracers, temperature and pressure) in specific points of interest over time (e.g., Unified Model). Chemical Transport Models (CTMs) (e.g., air-quality, air-pollution, emission-based, source-based, source, etc.) are prognostic models that process emission, transport, mixing, and chemical transformation of trace gases and aerosols simultaneously with meteorology [15]. Complex and computationally costly CTMs can be of a global (e.g., online: Fim-Chem, AM3, MPAS, CAM-Chem, GEM-MACH, etc.; and offline: GEOS-Chem, MOZART, TM5) and regional (e.g., online: MM5-Chem, WRF-Chem, BRAMS; and offline: CMAQ, CAMx, CHIMERE) scale [16,17]. These models combine atmospheric science and multi-processor computing techniques, highly relying on considerable resources like real-time meteorological data and an updated detailed emission sources inventory [18]. Unfortunately, emission inventory inputs for boundary layers and initial conditions may be lacking in some regions, while geophysical characteristics (terrain and land use) might further complicate the implementation of these models [19]. To deal with complex structure of air flows and turbulence in urban areas Computer Fluid Dynamics (CFD) methods are used [20]. However, recent studies show that the traditional deterministic models struggle to capture the non-linear relationship between the concentration of contaminants and their sources of emission and dispersion [21–24], especially in a model application in regions of complex terrain [25]. To tackle the limitations of traditional models, the most promising approach is to use statistical models based on machine learning (ML) algorithms. Statistical techniques do not consider physical and chemical processes and use historical data to predict air quality. Models are trained on existing measurements and are used to estimate or forecast concentrations of air pollutants according to predictive features (e.g., meteorology, land use, time, planetary boundary layer, elevation, human activity, pollutant covariates, etc.). The simplest statistical approaches include Regression [26], Time Series [27] and Autoregressive Integrated Moving Average (ARIMA) [28] models. These analyses describe the relationship between variables based on possibility and statistical average. Well-specified regressions can provide reasonable results. However, the reactions between air pollutants and influential factors are highly non-linear, leading to a very complex system of air pollutant formation mechanisms. Therefore, more advanced statistical learning (or machine learning) algorithms are usually necessary to account for a proper non-linear modelling of air contamination. For instance, Support Vector Machines [29], Artificial Neural Networks [30], and Ensemble Learning [31] have been applied to overcome non-linear limitations and uncertainties to achieve better prediction accuracy. Although statistical models do not explicitly simulate the environmental processes, they generally exhibit a higher predictive performance than CTMs on fin spatiotemporal scales in the presence of extensive monitoring data [32–34]. Different machine learning approaches have been used in recent years to predict a set of air pollutants using different combinations of predictor parameters. However, with a growing number of studies, it is puzzling why a certain algorithm is chosen over another for a given task. Therefore, in this study we aim to review recent machine learning studies used in atmospheric pollution research. To do so, the remainder of the paper is organized into three sections. First, we explain the method used to select Appl. Sci. 2018, 8, 2570 3 of 27 and scan relevant journal articles on the topic of machine learning and air quality, which conforms to the PRISMA guideline. This section also describes the strategy used to analyze the manuscripts and synthetizeAppl. Sci. the 2018 main, 8, x findings.FOR PEER REVIEW Second, the results are presented and discussed from a general to3 a of detailed 27 and synthetic account. Finally, the last section draws conclusions on the use of machine learning algorithms quality, which conforms to the PRISMA guideline. This section also describes the strategy used to for predictinganalyze the air qualitymanuscripts and theand future synthetize challenges the main of thisfindings. promising Second, approach. the results are presented and discussed from a general to a detailed and synthetic account. Finally, the last section draws 2. Method conclusions on the use of machine learning algorithms for predicting air quality and the future 2.1. Searchchallenges Strategy of this promising approach. Relevant2. Method papers were researched in SCOPUS. The enquiry was limited to this scientific search engine, because it compiles, in a single database, manuscripts which are also indexed in the 2.1. Search Strategy most significant databases in engineering (e.g., IEEE Xplore, ACM, etc.). The first step of the literature reviewRelevant consisted papers were of completingresearched in the SCOPUS.website The document enquiry search was withlimited a combinationto this scientific of search keywords. The usedengine, formula because was it compiles, as follows: in {‘Machinea single database Learning’}, manuscripts AND {‘Air which Quality’ are also OR indexed ‘Air Pollution’} in the most AND {‘Modelsignificant OR ‘Modelling’}. databases in The engineering exploration (e.g., was IEEE limited Xplore, to theACM, period etc.). ofThe 2010–2018. first step of Another the literature limitation review consisted of completing the website document search with a combination of keywords. The used was to focus only on journal papers, since they represent the most achieved work. The result of this formula was as follows: {‘Machine Learning’} AND {‘Air Quality’ OR ‘Air Pollution’} AND {‘Model first stepOR ‘Modelling’}. provides us The with exploration 103 documents. was limited to the period of 2010–2018. Another limitation was to Thefocus second only on step journal consisted papers, since of filtering they represent the studies, the most by achieved reading work. the title The and result the of abstract.this first step Papers were excludedprovides us from with our103 documents. selection if they addressed the topics as follows: Physical sensors (and not computationalThe second models); step health/epidemiological consisted of filtering the studiesstudies, (i.e.,by reading predictive the title models and the to estimateabstract. Papers the impact on healthwere and excluded not to from estimate our selection and/or if forecast they addressed the concentration the topics ofas pollutants);follows: Physical social sensors studies; (and biological not studies;computational indoor studies; models); sporadic health/epidemiological calamity (e.g., smog studies disaster). (i.e., predictive After applyingmodels to theseestimate rejection the impact criteria, the documentson health wereand not reduced to estimate to 50. and/or forecast the concentration of pollutants); social studies; Inbiological the last studies; step, all indoor 50 papers studies; were sporadic fully read.calamity After (e.g., a consensussmog disaster). between After theapplying authors these of this rejection criteria, the documents were reduced to 50. systematic review, four papers were rejected. Three manuscripts were excluded, because they In the last step, all 50 papers were fully read. After a consensus between the authors of this representedsystematic a very review, similar four study papers that were was previouslyrejected. Three carried manuscripts out by the were same excluded, authors. because The other they paper was consensuallyrepresented a consideredvery similar study out of that scope was afterpreviously reviewing carried the out full by the document. same authors. Consequently, The other paper a total of 46 manuscriptswas consensually were includedconsidered for out a furtherof scope qualitativeafter reviewing and the quantitative full document. synthesis. Consequently, Figure 1a totalrepresents of the flow46 manuscripts diagram ofwere the included search for method a further for qualitative the systematic and quantitative review. synthesis. It is based Figure on 1 represents the PRISMA approachthe flow that diagram provides of athe guideline search method to identify, for the select, system assessatic review. and summarizeIt is based on the the findingsPRISMA ofapproach similar but separatethat studies.provides Thisa guideline method to identify, promotes select, a quantitative assess and summarize synthesis the of findings the papers, of similar which but is separate carried out throughstudies. a factorial This method analysis. promotes a quantitative synthesis of the papers, which is carried out through a factorial analysis.
FigureFigure 1. PRISMA-based 1. PRISMA-based flowchart flowchart of of the the systematicsystematic selection selection of of the the relevant relevant studies. studies. Appl. Sci. 2018, 8, 2570 4 of 27
2.2. Analyzed Parameters All the papers are analyzed according to 14 aspects. The first parameter concerns the motivation of the study. The second is the type of modelling, which is divided into two categories—estimation and forecasting models. An estimation model uses predictive features (e.g., contaminant covariates, meteorology, etc.) to estimate the concentration of a determined pollutant at the same time. A forecasting model takes into account historical data to predict the concentration of a pollutant in the future. The third analysis is based on the type of machine learning algorithm used. The main categories are artificial neural networks, support vector machines, ensemble learning, regressions, and hybrid versions of these algorithms. The fourth analysis describes the method applied by the authors. The fifth point focuses on the nature of the predicted parameter. Again, two groups are identified. On the one hand, there are authors interested in predicting specific air contaminants, which are: Micro and nano-size particulate matter (PM10, PM2.5, PM1); nitrogen oxides (NOx = NO + NO2); Sulphur oxides (SOx); carbon monoxide (CO); and ozone (O3). On the other hand, some authors work on a prediction of air quality in general by searching for the Air Quality Index (AQI), which may include the concentrations of several pollutants. The sixth parameter identifies the geographic location of the study. The seventh gives details of the characteristics of the dataset, such as—time span; quantity of monitoring stations; and number of instances. Furthermore, the eighth point provides information on the specificity of the dataset in terms of the used predictive attributes. The main features are related to—pollutant covariates; meteorology; land use; time; human activity; and atmospheric phenomena. The ninth and tenth factors address the evaluation method and the performance of the tested algorithms, respectively. The assessment is mainly based on a comparison of the accuracy between the models and/or a comparison of the prediction of the actual value. The most popular evaluation criteria are the ratio of correctly classified instances (Accuracy), the Mean Absolute Error (MAE), the Root Mean Square Error (RMSE), and the coefficient of determination (R2). The Accuracy represents the overall performance of a classifier by providing the proportion of the whole test set that is correctly classified, as described in Equation (1).
TP + TN Accuracy = (1) TP + TN + FP + FN where TP, TN, FP and FN stand for True Positives, True Negatives, False Positives and False Negatives, respectively. The higher the Accuracy value is, the better is the model performance. The MAE shows the degree of difference between the predicted values and the actual values. The RMSE is another relative error estimator that focuses on the impact of extreme values based on the MAE. The R2 represents the fitting degree of a regression. The MAE, RMSE and R2 are calculated according to Equations (2)–(4), respectively.
1 n MAE = ∑|Ei − Ai| (2) n i=1
1 n ! 2 1 2 RMSE = ∑(Ei − Ai) (3) n i=1 2 n 2 ∑i=1 Ai − A Ei − E R = q (4) n 2 n 2 ∑i=1 Ai − A ∑i=1 Ei − E where n is the number of instances, Ai and Ei are the actual and estimated values, respectively. A and E stand for the mean measured and mean estimated value of the contaminant, respectively. Amax and 2 Amin are the maximum and minimum observed pollutant values, respectively. R is a dimensionless descriptive statistical index ranging from 0 (no correlation) to 1 (perfect correlation). MAE and RMSE Appl. Sci. 2018, 8, 2570 5 of 27 values are in # cm−3. The lower the MAE and RMSE is, the better is the predictive performance of the model. The eleventh parameter considers the computational cost of the method. Finally, the last three points discuss the outcomes of the proposed approach, its limitation, and its scope of applicability.
2.3. Synthesis of Results The results are synthetized according to descriptive statistics and a factorial analysis. First, we describe the most used algorithms over time in order to define the current tendencies. Second, we quantify the types of algorithms applied for the prediction of the principal contaminants. Third, we identify the evolution of the modelling performance for each pollutant over the last decade. Finally, since several parameters are considered for the description of the selected papers, we perform a factorial analysis to summarize the main outcomes of this systematic review. The most appropriate method to identify the relationships between the qualitative factors that characterize each study is the Multiple Correspondence Analysis (MCA). First, a data table is created from the parameters identified in Section 2.2 (see Figure A1 of AppendixA). Each row (i) corresponds to each paper and each column (j) corresponds to each qualitative variable (e.g., types of modelling, algorithms, etc.). The total number of papers and qualitative variables are identified as I and J, respectively. Next, this data table is fed to the R software, in order to proceed with the analysis through the MCA library. The initial table is transformed into an indicator matrix called the Complete Disjunctive Table, in which rows are still the papers, but the columns are now the categories (e.g., estimation model vs. forecasting model) of the qualitative variables. The total number of categories is identified as kj. The entry of the intersection of the i-th row and the k-th column, which is called yik, is equal to 1 (or true) if the i-th paper has category k of the j-th variable and 0 (or false) otherwise. All the papers have the same weight, which is equal to 1/I. However, the method highlights papers that are characterized by rare categories by implementing Equation (5).
yik xik = (5) pk where pk represents the proportion of papers in the category k. Then, the data are centralized by applying Equation (6).
yik xik = − 1 (6) pk
The table of the xik is used to build the points cloud of papers and categories. Since, the variables are centered, the cloud has a center of gravity in the origin of the axes. The distance between a paper i and a paper i0 is given by Equation (7).
K K 2 pk 2 1 1 2 di,i0 = ∑ (xik − xi0k) = ∑ (yik − yi0k) (7) k=1 J J k=1 pk 0 where xik and xi0k are the coordinates of the papers i and i , respectively. This equation shows that the distance is 0 if two papers are in the same set of categories (or profile). If two papers share many categories, the distance will be small. However, if two papers share several categories except a rare one, the distance between them will be relatively large due to the pk value. Next, the calculation of the distance of a point from the origin (O) is given by Equation (8).
K K y 2 pk 2 1 ij d(i, O) = ∑ (xik) = ∑ − 1 (8) k=1 J J k=1 pk The equation shows that a distance gets larger when the categories of a paper are rarer, because its pk are small. In other words, the more a paper possesses rare categories, the further it is from the origin of the plot axes. To conclude the process, the point cloud must be projected into a smaller-dimensional Appl. Sci. 2018, 8, 2570 6 of 27 space, which is usually reduced to two dimensions. To do so, the cloud (N ) is projected onto a Appl. Sci. 2018, 8, x FOR PEER REVIEW I 6 of 27 sequence of orthogonal axes with maximum inertia as calculated in Equation (9).