<<

SPATIAL-TEMPORAL PREDICTION OF AIR POLLUTION IN -

TIANJIN- REGION

A Thesis

Presented to the

Faculty of

California State Polytechnic University, Pomona

In Partial Fulfillment

Of the Requirements for the Degree

Master of Science

In

Civil Engineering

By

Keita Makino

2019 SIGNATURE PAGE

THESIS: SPATIAL-TEMPORAL PREDICTION OF AIR

POLLUTION IN BEIJING--HEBEI

REGION

AUTHOR: Keita Makino

DATE SUBMITTED: Spring 2019

Department of Civil Engineering

Xinkai Wu, Ph.D. Thesis Committee Chair Civil Engineering

Yongping Zhang, Ph.D., P.E. Civil Engineering

Wen Cheng, Ph.D., P.E. Civil Engineering

ii ABSTRACT

In this thesis, the hourly concentrations of Carbon Monoxide (CO), Nitrogen

Dioxide (NO2), Sulfur Dioxide (SO2), Ozone (O3), particulate matter with diameter less than 2.5 and 10 (PM2.5 and PM10) in Beijing and Beijing-Tianjin-Hebei (BTH) region in

China are comprehensively investigated. The dataset of the concentration of these

pollutants were collected at eighty sites in the BTH region with national environmental

monitoring station equipment. Starting from statistical analysis in the concentration of

various pollutants, we provide multiple regression models and Long-Short Term Memory

Recurrent Neural Network (LSTM-RNN) regression model to predict the concentration in

Beijing in 1, 8, or 24-hour ahead. Some of the models account the concentration of various

pollutants and meteorological conditions in the neighboring cities of BTH region.

Experimental results have shown that, while LSTM-RNN model achieved the best index

of agreement (IA) and root mean squared error (RMSE) in most of 8-hour predictions,

linear models can provide comparable results in some cases.

iii TABLE OF CONTENTS

SIGNATURE PAGE ...... ii

ABSTRACT...... iii

TABLE OF CONTENTS...... iv

LIST OF FIGURES ...... viii

1. INTRODUCTION ...... 1

2. LITERATURE REVIEW ...... 7

2.1. Scope of Research...... 7

2.1.1. Statistical analysis between pollutants ...... 7

2.1.2. Statistical analysis between some pollutants and other factors/issues8

2.1.3. Prediction of the concentration of a pollutant ...... 10

2.2. Area of Research...... 11

2.2.1. Microscale – roadside air pollution...... 11

2.2.2. Intracity-scale – multiple locations in a city/region...... 11

2.2.3. Intercity-scale – multiple cities/regions to country-level...... 11

3. DATASET ...... 13

3.1. Data Description ...... 13

3.2. Time Series Overview...... 15

3.3. Contribution of Wind to Beijing...... 19

4. Correlation Analysis ...... 23 iv 4.1. Correlation of pollutants in Beijing ...... 23

4.2. Intercity Correlation...... 24

4.3. Wind and Pollutants...... 28

5. Predictions...... 31

5.1. Data Standardization...... 31

5.2. Models...... 31

5.2.1. Linear Regression...... 33

5.2.2. Compound Linear Regression...... 42

5.2.3. LSTM-RNN ...... 47

5.3. Results...... 49

5.4. Models with Additional Variables...... 59

6. Conclusion ...... 63

REFERENCES ...... 65

Appendix...... 71

v LIST OF TABLES Table 1. Distance and Orientation between Cities...... 6

Table 2. Pollutants ...... 13

Table 3. Meteorological Factors ...... 15

Table 4. Structure of Data Table for Each City ...... 15

Table 5. Terminology for Description in Models and Prediction...... 15

Table 6. Variables in linear regression: For CO prediction...... 36

Table 7. Variables in linear regression: For NO2 prediction...... 37

Table 8. Variables in linear regression: For SO2 prediction ...... 38

Table 9. Variables in linear regression: For O3 prediction...... 39

Table 10. Variables in linear regression: For PM2.5 prediction...... 40

Table 11. Variables in linear regression: For PM10 prediction...... 41

Table 12. Variables in compound linear regression: For CO prediction ...... 44

Table 13. Variables in compound linear regression: For NO2 prediction...... 44

Table 14. Variables in compound linear regression: For SO2 prediction...... 45

Table 15. Variables in compound linear regression: For O3 prediction...... 45

Table 16. Variables in compound linear regression: For PM2.5 prediction...... 46

Table 17. Variables in compound linear regression: For PM10 prediction...... 46

Table 18. Variables in linear regression: For PM10 prediction...... 48

Table 19. Performance index of 1-hour-prediction models ...... 50 vi Table 20. Performance index of 8-hour-prediction models ...... 50

Table 21. Performance index of 24-hour-prediction models...... 51

Table 22. Performance index of 1-hour-prediction models ...... 60

Table 23. Performance index of 8-hour-prediction models ...... 60

Table 24. Performance index of 24-hour-prediction models...... 61

vii LIST OF FIGURES

Figure 1. Cities in Beijing-Tianjin-Hebei (BTH) region [10]...... 4

Figure 2. Location of BTH region. Facing to the . [11] ...... 5

Figure 3. Concentration of CO, NO2, SO2, O3, PM2.5 and PM10 in Beijing...... 17

Figure 4. Meteorological condition in Beijing...... 17

Figure 5. Concentration of CO, NO2, SO2, O3, PM2.5 and PM10 in Tianjin...... 18

Figure 6. Meteorological condition in Tianjin...... 18

Figure 7. Wind rose in Beijing...... 19

Figure 8. Wind rose in Tianjin...... 19

Figure 9. Contribution of wind is evaluated as the vertical component of the wind in a city to Beijing...... 20

Figure 10. Contribution of wind from each city to Beijing...... 22

Figure 11. Correlation matrix of pollutants in Beijing...... 23

Figure 12. Correlation matrix of pollutants and meteorological conditions in Beijing.... 24

Figure 13. Correlation matrix of pollutants in Beijing and aggregation of the other cities.

...... 25

Figure 14. Correlation matrix of pollutants in Beijing and ...... 26

Figure 15. Correlation matrix of pollutants in Beijing and ...... 26

Figure 16. Correlation of pollutant in Beijing and other cities, at a glance...... 27

Figure 17. Correlation of pollutant in Beijing and other cities, sorted by distance...... 27

viii Figure 18. Correlation matrix of pollution in Beijing and the intensity of wind to Beijing.

...... 30

Figure 19. Histogram of CO, NO2, SO, O3, PM2.5 and PM10 in Beijing...... 32

Figure 20. Histogram of AP, HM, TP, and CW in Beijing...... 32

Figure 21. Optimal lag for 1-hour ahead CO prediction...... 43

Figure 22. Concept of LSTM-RNN...... 47

Figure 23. LSTM Unit [40]...... 48

Figure 24. Bidirectional LSTM [41]...... 49

Figure 25. 1-hour prediction with three models...... 52

Figure 26. 8-hour prediction with three models...... 53

Figure 27. 24-hour prediction with three models...... 53

Figure 28. 1-hour Q-Q plot with three models...... 54

Figure 29. 8-hour Q-Q plot with three models...... 54

Figure 30. 24-hour Q-Q plot with three models...... 55

Figure 31. Autocorrelation function of pollutants in Beijing...... 56

Figure 32. Cross-correlation function of pollutants in Beijing and other cities: ,

Cangzhou, , and Handan...... 57

Figure 33. Cross-correlation function of pollutants in Beijing and other cities: ,

Langfang, , and ...... 58

ix Figure 34. Cross-correlation function of pollutants in Beijing and other cities: ,

Tianjin, , and ...... 58

x 1. INTRODUCTION

Along the rise in the population, the use of automobiles and the advance in heavy industries, air pollution has been one core issues in many urban areas for the last few decades. Especially, , which now has more than 1.4 billion population and one of world’s most developed industry system, has struggled with the serious pollution problem in their megalopolises in these days.

The Beijing-Tianjin-Hebei region (BTH region) is located in north-east China.

With temperate monsoon climate, the BTH region is dry and cold in winter, and rainy and hot in summer. The elevation decreases from the northwest to the southeast. This region, with total of 13 cities, 185 counties or districts, is the governmental, economic and cultural center of the country, and has been supported the development of the country for a long time. As of the end of 2017, the region counts up to 112-million of population, 1.11-trillion

USD of gross domestic products, which respectively corresponds to 8.1% and 9.5% of the total in China [1]. Typically, the Beijing city consists of 21.7 million people and further provides about 40% of the total GDP in the BTH area [2]. Tianjin, the other large city in the BTH region facing the East Sea, has about 10% of the regional population and provides

25% of the GDP. At the same time, however, there is 1.26-billion passenger traffic in the region a year [1] and it suffers from a severe harmful air pollution every day. According to the Ministry of Environmental Protection, , Langfang, Shijiazhuang, ,

Zhengzhou, Tangshan, Hengshui, Xingtai and Baoding had the worst air quality ever in

2015, seven of which belonged to the BTH region (Dan Yana et al., 2018).

As a commonplace problem, urban air pollution never disappeared despite the changes in form [3]. It is repeatedly reported that the exposure to chemical pollutants from 1 traffic significantly increases the chance of various health problems. Investigating 24 schools in southwestern Netherland, Janssen et al (2003) [4] has indicated that children living by a busy traffic would have a higher chance to suffer from an allergic sensitization, bronchial hyperresponsiveness, or respiratory symptoms. In Chinese context, et al.

(2011) [5] suggested that the mortality rate in Shenyang, China is significantly increased by a higher PM2.5 ambient concentration, especially in warmer season and for females and elderly people over 75. This health problem will eventually lead to a negative effect to the economy in the country, Tian et al. (2018) [6] has investigated this economic effect based

on the prediction of vehicle emissions and technologies on the vehicles, and concluded that

some provinces in China might lose up to 1% of the total GDP in 2030, even if the vehicle

technologies are in high standard at that time. According to Parry (2013) [7], the air quality

in Beijing is now considered as a catastrophe and needs more than 20 years to become

better, despite the current effort of the country, where more than 4% of the country’s GDP

is spent to deal with the pollution issues. Lu et al. (2015) [8] made a meta-analysis through

59 studies that had argued about the health effects of ambient Particulate Matter with 5Ɋm

diameters less than 2.5Ɋm (PM2.5) and those with diameters less than 10Ɋm (PM10).

Using a meta-regression over datasets given by the previous studies, they concluded that

there should be a significant relationship between ambient PM2.5/PM10 and mortality

from cardiovascular disease or respiratory disease in both short-term and long-term

perspective.

Given this, monitoring, analysis and prediction of the concentration of various

pollutants have been long anticipated and practiced in the industries, transportation sectors,

and private/public researchers. It is true that there are some countries/regions/cities which 2 do not have enough infrastructure network of monitoring system or studies based on the collected data. Lu (2015) [8] also implied that the number of studies was still insufficient to reveal a long-term effect. Among African countries, Coker (2018) [9] has pointed out that there is an inadequate epidemiological studies regarding ambient air pollution in sub-

Saharan area Africa, and highlighted the urgent need for an improvement on the monitoring network in the region.

Nevertheless, many studies have already brought important findings based on the monitored air pollution data in both developed countries and developing countries.

Generally, there are three types of analysis: statistical analysis in air pollution, statistical analysis between air pollution and other factors, including traffic or health issues, and prediction of the pollution. As the recent development of national-level monitoring network, studies based on high-resolution, long-term and wide-area dataset are getting more common. Besides commonly used hourly datasets, Zhang et al. (2014) proposed a visualization of road-side air pollution in Beijing by analyzing a dataset of 1-minute resolution.

Given this context, in this thesis, we will investigate an air-pollution and meteorological dataset collected in 13 Chinese cities in BTH region; Beijing, Baoding,

Cangzhou, Chengde, Handan, Hengshui, Langfang, Qinhuangdao, Shijiazhuang, Tangshan,

Tianjin, Xingtai, and Zhangjiakou. The geometric location of the cities is listed in Figure

1, Figure 2 and Table 1. The satellite cities are located 47 km to 399 km away from Beijing.

We then utilize the large amount of data to construct three predictive models which estimates the concentration of air pollutants in Beijing in 1, 8, or 24 hours ahead. Some of

3 N Legend Provincial boundary ~ ,. ,,--J\, Municipal boundary 0 25 50 100 150 200 • 1 l ~ County boundary

~ -~ ~ - Study area ; ) (//V

,,~\ l~-t \ Chengde I1

\ Zbon,jfakou /~ l\ "l ~ Beijing \.., 1 uang~ ? / _,r "-, ~ J°"' r L½ikg\ ( -_.--1--, j ~ '7J \r-Tangshan L l,~ 'l-.._r--. ....<\ l, f I Tianjin / '-{/~ ~ llooding ., ' ( '!_ _ ... \ Shijiazhuang \J l / Xingtai t / ) i~ ,)

Figure 1. Cities in Beijing-Tianjin-Hebei (BTH) region [10]. the models are made to capture the propagation of pollutants that were observed in neighboring cities.

The first model we developed is a linear regression model that has been commonly used in diverse fields. We named the second one as compound linear model, which will account the past concentration of pollutants in neighboring cities in BTH region to predict the future concentration in Beijing. Lastly, we will introduce a long-short term memory recurrent neural network (LSTM-RNN) model that can capture non-linear dependencies

4 0 ,~,r--· 'j§"ff.ffi 0Artem ) ApreM JILIN 0 INNER Wladivostok MONGOLIA 8naA1-t8 0CTOK

c=fft;ng 0 s~,J#;ng - ~ffi

0 8f'ft:fJ ft illJ!'Sffi JlYJI: T.i1nhuanqdao North Korea " • .aSffi ...of ·q~~ Pyongyang 'll'IJ HE9 o ;;lc;Jif; 0 • Yulio Baoding C~r~ f&i!,Jrfi 9 Qllfif; ~,.Ii Lvhang Shijiazhuang Vanta! a~m 9 l!ll!,if; 0 EJm Jinan UaochMg ;;"fjfjffj i!li/;if; .,$ 0 0 o:r~o South Korea SHAN DONG o Da~ u0 ilf'i:if; Unyi Gwangju ;i Busan ll;iijif; tto~ ~o~ 0 u anyungang Jlll/;lm i.I~J~ffi Hiroshima Sh•¥KJ IZ:11 0 Zhouk.ou Fukuoka !ll □ ffi 1,llolli] ° Kumamoto N'l~ 0 !l!i;$: 2humad1an 0 U-'!r.!iifi Xi•1mng

The remainder of this thesis is organized as follows. The subsequent chapters first introduce the literature review, followed by the data description and preliminary analysis.

Next, the results and discussion are presented, and the conclusion is drawn in the final chapter.

5 Table 1. Distance and Orientation between Cities Top: Orientation. From the city on left column to the city on top row. 0 degree is from west to east, and it increments counterclockwise. Bottom: Distance (km). Measured by GlobeFeed.com [12]. ng

dao u g City hua z zho shui ndan ng Beijing Baodin Ca Chengde Ha Heng Langfang Qinhuang Shijia Tangsahn Tianjin Xingtai Zhangjiakou

Beijing - 235 282 41 236 255 300 1 232 348 309 242 144

Baoding 140 - 332 47 252 262 35 18 228 20 11 247 103

Cangzhou 182 135 - 72 222 211 96 37 188 52 72 214 122

Chengde 176 314 310 - 238 241 235 321 228 278 252 235 185

Handan 399 263 276 565 - 51 60 40 91 46 50 93 86

Hengshui 249 127 120 408 159 - 66 36 162 44 49 216 101

Langfang 47 128 138 191 374 219 - 10 221 5 315 235 132

Qinhuangdao 272 374 299 179 575 419 306 - 206 196 203 216 167

Shijiazhuang 264 124 205 438 158 107 251 487 - 29 28 270 84

Tangsahn 155 248 188 148 462 303 129 126 363 - 222 156 133

Tianjin 108 153 99 211 365 205 62 224 263 429 - 224 132

Xingtai 356 217 247 525 50 127 334 546 108 307 330 - 85

Zhangjiakou 161 216 321 260 462 343 205 410 305 267 267 412 -

6 2. LITERATURE REVIEW

As mentioned in the previous chapter, predicting the air quality is one of the most important interests among any , as it is highly associated with the health issues and will consequently leads to an economic loss in the region.

In this study, we will first investigate the air-pollution and meteorological data provided from 13 cities in BTH region to find some sort of relationship or correlation between them. Then we will review a few models that predict the concentration of various pollutants in Beijing. Therefore, in this chapter, we review various existing research to better understand how the air-pollution monitoring and prediction have been developed over the time, and which type of the models could be applied to the analysis in this field.

2.1. Scope of Research

First of all, we can classify the existing studies regarding the air-pollution analysis into three types.

2.1.1. Statistical analysis between pollutants

Studies in this category will usually make a correlation analysis between various pollutants in a certain city, region or country. Pollutants like CO, NOx, SOx, O3, PMx are most commonly chosen to be investigated. Wang et al. (2014) [13] have composed a comprehensive analysis in spatial and temporal distribution of CO, NO2, SO2, O3, PM2.5

and PM10 in 31 cities across China. They separated the 31 cities into three regions (north,

south-east and west) and derived a correlation between each pair of the pollutants in

quarterly seasons and throughout the year, using a dataset of hourly concentration of those

pollutants in these cities. Yangyang et al. (2015) [14] made a similar analysis in the same

set of the cities, but classified them into 6 regions (northwest, north, northeast, east, middle- 7 south and southwest). Using an hourly dataset as well, their result of correlations between

PM2.5/PM10 and CO, NO2, SO2 and O3 follows the result from the previous work [13].

It is verified that there is a moderate to strong positive correlation between PM2.5/PM10 and CO, NO2, and SO2 throughout the year or in winter/spring in northern and south- eastern part of China. However, PM2.5/PM10 and O3 is almost uncorrelated both throughout the year and in each season.

Some works, on the other hand, analyze the correlation between the concentration of one pollutant in different locations. Liu et al. (2018) [15] has revealed that each pair of

17 cities in north-east China has a certain time lag that maximize the correlation between

PM2.5 in the cities. This time lag tends to be significant typically in spring/ winter seasons.

Looking outside the China, Gehrig (2003) [16] investigated the seasonal correlation between PM2.5 and PM10 in 7 cities across the Switzerland. It is commonly accepted that

PM2.5 and PM10 are highly correlated, but the study has shown that they become uncorrelated or weakly correlated in some pair of the cities in the country. This is caused by the Alps, a row of high mountains in middle Switzerland, which prohibits those pollutants being distributed over the geometry. This effect seems significant especially in winter so that the correlation between the PMs in northern and southern Switzerland tend to less significant in that season on the contrary to the Chinese situation.

2.1.2. Statistical analysis between some pollutants and other factors/issues

The second type of analysis mainly focuses on the correlation or relationship between some pollutants and other social factors or activities. Social factors are oftentimes be identified as a health problem or economic loss due to it. Social activities are usually attributed to the traffic or industry in the region. As mentioned in the introduction, there 8 have been many studies already that pointed out how the ambient pollutants could increase a health issue [4]–[6], [8].

Another interest of researchers is cause-effect relationship between traffic or heavy industry and pollutants as an emission. According to Wang et al. (2008) [17], it was difficult to develop an accurate vehicle emission inventory in a large city, and thus there have been many methods to estimate the inventory or total emission in a city. California

Line Source Model 4 (CALINE4) [18] is commonly accepted model that estimates the ambient concentration of several pollutants by roadside emission and meteorological factors. Wang et al. (2008) [17] has provided a bottom-up approach where the total inventory is evaluated as the sum of cold start emissions and running emissions, whose rate is depending on the vehicle type. Gokhale and Raokhande (2008) [19] conducted a comparative analysis between California Line Source Model 3 CALINE3 [20], Modified

General Finite Line Source Model for particulate (M-GFLSM), and California Line Source for Queuing & Hot Spot Calculations (CAL3QHC) by comparing the daily prediction of

PM10 concentration with traffic volume, wind speed, wind direction, sunshine hours, temperature, and emission factor.

Also, the emission from industry is one of major concerns in China, which provides one of the largest volume of various products, such as steel [21] or plastics [22], in the world. Meng et al. (2015) [23] provided a comprehensive analysis of PM2.5 emission along

Chinese supply chains from consuming power, transportation, agriculture, industry, service and construction products. It revealed that only 8% of the total emission of PM2.5 can be attributed to the emission from the final demand of the product and the rest of 92% comes from the production process of the demand. 9 2.1.3. Prediction of the concentration of a pollutant

Given the argument discussed in the previous two sections, it is considered that the prediction and estimation of the air pollution from various harmful sources are gravely important problem among developing countries and cities in them. Therefore, there have been many works that focuses on predicting the concentration of a certain pollutant published out so far.

Besides a simple linear regression model, which has been used since long time ago, models based on a neural network or machine learning schema have rapidly getting common in this field. Perez and Trier (2001) [24] utilized a primitive three-layer feed forward neural network to predict an hourly concentration of NO and NO2 on the next day in Santiago, Chile. It included the NO concentration on the previous day and meteorological conditions (temperature, humidity, wind speed and wind direction). Since it is Cai et al. (2009) [25] presented a prediction model based on back-propagation neural network that predicts the concentration of a pollutant (CO, NO2, O3 or PM10) by taking the concentration of the pollutant at 1-3 hours before, meteorological conditions (air pressure, humidity, temperature, rainfall, wind speed, wind direction and solar radiation), road geometry and traffic. Sekar et al. (2016) [26] utilized a neural network and decision tree algorithm to predict NOx and O3 concentration in an urban intersection in Delhi, India.

The model contained finely classified vehicle counts (e.g., vehicle counts of gasoline, diesel, CNG or LPG), and source strength of other pollutants (e.g., CO, NOx, Hydrocarbon,

PM) in addition to meteorological conditions. and CALINE4 [18] is widely accepted as the standard of the performance.

10 As the computing power has been improved in the last few years, some new approaches with more resources have been adopted into the prediction. Zhan et al. (2017)

[27], for example, applied a geographically weighted gradient boosting machine to predict the spatial distribution of PM2.5 across China, based on the meteorological data, elevation data, and population density. Another trend in the recent artificial neural network is a utilization of the temporal distribution. Bui et al. (2017) [28] adopted a LSTM network to predict the concentration of PM2.5 in Seoul, Korea.

2.2. Area of Research

These are the examples of existing works regarding the analysis in air pollution.

Note that some studies have both features. The other classification could be given to the scale of the study. Each study generally focuses on either of them.

2.2.1. Microscale – roadside air pollution

This type of research argues a roadside or fixed point pollution, using a ground monitoring system and traffic counter based on a camera system as represented by the works from Xu et al. (2016) [29] or Li et al. (2017) [30].

2.2.2. Intracity-scale – multiple locations in a city/region

Works in this type oftentimes focuses the negative effect of pollution for people in the city or region [4], [5]. Also, some studies in early year of statistical analysis or regression by a neural network would rely on the dataset taken in one city, perhaps due to the limitation of computing power [24], [31], [32].

2.2.3. Intercity-scale – multiple cities/regions to country-level

The last group consists of multi-city or country-level analyses. Studies of this type are largely based on the published dataset by government of the region or country. As the 11 development and spread of refined monitoring network for both meteorological condition and air pollution, this scope of research are getting rapidly common lately. Especially, as some referred in the previous section, a decent number of research have been out for better understanding of the correlation between regional [13], [14]. Song et al

(2017) [33] made one of the most updated report of the spatiotemporal distribution of CO,

NO2, SO2, O3, PM2.5 and PM10 throughout Chinese 1467 monitoring site in 367 cities which participates in the National Air Quality Monitoring Sites (NAQMS).

Certainly, analysis in air pollution is important topic for every developed country as well. The air pollution caused by NO2 or PM2.5 in United States has oftentimes been investigated in country level. Tai et al. (2010) [34] made a correlation analysis between

PM2.5 concentration and meteorological factors in whole United States, with the details of

PM2.5 components such as organic carbon (OC) or elemental carbon (EC).

The variables/factors being used in the analysis vary according to its scale. Most of the works in microscale and intracity-scale scope take the traffic volume there as an important factor. Meanwhile, studies in a larger scale tend to focus on only the ambient pollution provided by national-level observation network as it is very difficult to take all the traffic or industrial activity in the region.

The work presented in this thesis would have the characteristics of 2.1.3 as we have developed several models that can predict the concentration of CO, NO2, SO2, O3, PM2.5 and PM10, and characteristics of 2.2.3 as the dataset we investigated contains records from

13 cities. In the following chapters, we will review the dataset structure, make a preliminary analysis and then develop models for the prediction.

12 3. DATASET

In this chapter, we will introduce the dataset that was investigated in the rest of this study, and then will make a preliminary analysis, such as correlation analysis, as well as the review of overall dataset.

3.1. Data Description

Beijing-Tianjin-Hebei, including Beijing, Tianjin and 11 prefecture-level cities in

Hebei Province, is called “Capital Economic Circle” in China. As reviewed in the introduction chapter, Beijing is in located in the north of the Plain, adjacent to Hebei and Tianjin. In this study, a total of 80 air quality monitoring stations were studied in BTH regions.

Table 2. Pollutants Factor Unit Abbreviation Carbon monoxide mg/mଷ CO Nitrogen dioxide Ɋg/mଷ NO2 Sulfur dioxide Ɋg/mଷ SO2 Ozone Ɋg/mଷ O3 Particle matter with diameter less than 2.5 Ɋm Ɋg/mଷ PM2.5 Particle matter with diameter less than 10 Ɋm Ɋg/mଷ PM10

The hourly concentration of each pollutant such as CO, NO2, O3, SO2, PM2.5, and

PM10 over BTH regions in 2016 were obtained from the Data Centre of Ministry of

Environmental Protection of the People's Republic of China. The hourly concentrations of air pollutants in all of the 80 stations in 2016 were published in China Environmental

Monitoring Center website (http://www.cnemc.cn/). The hourly concentration of CO,

NO2, SO2, O3, PM2.5 and PM10 throughout 2016 were obtained in the 13 cities in the

13 BTH region. Each value of the concentration is evaluated as the average of the record collected at all stations in each city.

The meteorological dataset in all the cities were retrieved from OpenWeatherMap.

Org. The records have been observed in 13 cities in the BTH region as well as the air- pollution data, and contains hourly air pressure, humidity, temperature, wind speed, and wind direction. For these factors, we will use the two-character abbreviation listed in Table

2 for further reference in this thesis.

With the combination of two datasets, the data table can be represented as shown in Table 4. Then, in order to guarantee the continuity of wind direction, two wind parameters, wind-x (WX) and wind-y (WY) components, were generated based on the wind speed and wind direction [35] as the following equation:

ܹܺ = ܹܫ × sin(ܦܴ)

ܹܻ = ܹܫ × cos(ܦܴ)

Consequently, we filled the missing values by linear interpolation, which is fairly acceptable whenever the ratio of missing value is 10% or less [36]. After interpolating it, we made the wind back to the wind speed and direction for further use. Note that the data structure represented in Table 4 belongs to one city and thus there are 13 tables in total.

Finally, we will use the terminologies given in Table 5 for the concentration of pollutants and meteorological condition as a variable for the rest of this thesis. For instance,

ଵ ஼ை(௧ା଼) refers to the concentration of CO in Beijing at time ݐ +8. Also, we will useܲ

differently typed abbreviations in the table to refer to the variable (i.e., time series) of the

corresponding quantity.

14 Table 3. Meteorological Factors Factor Unit Abbreviation Air Pressure Hectopascal (hPa) AP Humidity Relative humidity (%) HM Temperature Celsius (C) TP Wind Speed m/s WI 0° = from West 90° = from South Wind Direction DR 180° = from East 270° = from North

Table 4. Structure of Data Table for Each City Datetime CO NO2 SO2 O3 PM2.5 PM10 AP HM TP WI DR 01/01/16 3.98 100 31.3 8.33 168 211 1028 79 -2.1 0 180 00:00 01:00 3.76 96.0 30.6 7.58 165 199 1030 76 -3.4 0 180 02:00 3.68 92.5 32.6 7.75 160 176 1028 79 -2.2 2 280 … … … … … … … … … … … … 12/31/16 5.37 131 15.4 4.08 413 527 1026 92 -5.3 1 270 23:00

Table 5. Terminology for Description in Models and Prediction *CW is being introduced in section 3.3. Terminology Refers to Note (written in red) Corresponds to either of CO, NO2, SO, O3, ܲ Concentration of pollutant of type ݅ ௜ PM2.5 and PM10 Corresponds to either of AP, HM, TP, WI, ܺ Meteorological condition of type ݆ ௝ DR, WX, WY, CW*

ܲ௜(௧) Time In hour ௞ ܲ௜(௧) City Index From 1 to 13 ଵ ܲ௜(௧) Beijing ݇ = 1

3.2. Time Series Overview

In this section, through various plots, we will review the basic statistics and several

time series of air pollution and meteorological conditions in Beijing and other cities.

15 Firstly, Figure 3 displays the concentration of all the pollutants in Beijing. While every concentration of the pollutants varies over the time, each time series has a seasonal tendency. For example, CO, NO2 and SO2 are high typically in winter season, while O3 gets lower at that time. PM2.5 and PM10 has a weaker tendency and rather seem consistent through a year. Figure 4 is the plots of meteorological variables in Beijing. As located in the north part of China, the temperature is oftentimes less than zero Celsius in winter season but reach 40 Celsius in some time of summer season.

As some examples are displayed in the following figures (Figure 5, Figure 6), all the cities have the same set of time series. Note that the plots are based on the data table after the linear interpolation process. Though PM2.5 and PM10 get a little higher in winter, overall Tianjin seems to have the same tendency of the air pollution and meteorological condition in the four seasons of a year.

Wind property cannot be represented in the same manner of a line plot as the previous values, so we produced a wind rose for each city, as shown in the following figures. It works as a histogram that consists of both wind speed and wind direction [37].

Both in Beijing and Tianjin the majority is wind from north to south in a year. The wind speed is generally moderate, as it is less than 4 m/s in more than 75% of the hours.

Readers may find the plots for the other cities than Beijing and Tianjin through the web appendix listed at the end of this thesis.

16 200

150

N 0 100 z

50

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

200

150 N 0 (') CJ) 100 0

50

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

600 1000

750 LO 400 0 N ::;; ::;; 500 a.. a.. 200 250

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time - Concentration - 24-hour average - 1-week average Figure 3. Concentration of CO, NO2, SO2, O3, PM2.5 and PM10 in Beijing.

100

1040 75 ro a.. E, 1020 a.. <(

1000 25

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

40

Q: 20 - Meteorological Value a.. I- - 24-hour average - 1-week average

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Figure 4. Meteorological condition in Beijing.

17 200

150

N 0 100 z

50

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

400

300

N (') 0 200 CJ) 0

100

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

400 750

LO 300 0 N ::;; 500 ii: 200 a..

250 100

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time - Concentration - 24-hour average - 1-week average Figure 5. Concentration of CO, NO2, SO2, O3, PM2.5 and PM10 in Tianjin.

100

1040 75 ro a.. E, a._ 1020 <(

25 1000

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Time

- Meteorological Value a.. I- - 24-hour average - 1-week average

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time Figure 6. Meteorological condition in Tianjin.

18 N NWN NNE 2000 NW NE 1500

Wind Speed (m/s) 1000 WNW ENE 12 - 14

~ 500 ■ 10 - 12 :, 0 ■ 8 - 10 E, 0 w E 6 - 8 c:, 0 0 4 - 6 2 - 4 WSW ESE ■ ■ 0 - 2

SW SE

SSW SSE s

Figure 7. Wind rose in Beijing.

N NWN NNE

NW NE 1000 Wind Speed (m/s)

WNW ENE 12 - 14 500 ■ ~ 10 - 12 ~ :, ■ 8 - 10 0 E, 0 w E 6 - 8 c :, 4 - 6 0 0 ■ 2 - 4 WSW ESE 0 - 2 ■ NA

SW SE

SSW SSE s

Figure 8. Wind rose in Tianjin.

3.3. Contribution of Wind to Beijing

In the previous section, we have reviewed the wind rose in Beijing and Tianjin.

However, when developing a prediction model, it is necessary to account the direction of

19 the wind in one city with regarding of the orientation of the city from Beijing. Thereafter,

we modeled how the wind in each city affect the concentration of pollutants in Beijing. To evaluate this value, we have defined a new variable named “contribution of wind” (CW) according to the following equation.

௞ (ଵ,௞) ௞ (ଵ,௞) ܺௐூ(௧) × cos൫ܱ െܺ஽ோ(௧)൯്݇1 ܺ஼ௐ(௧) = ቊmin ቆ ଵ ,0ቇ + 0.001 ܺௐூ(௧) ݇ =1

(ଵ,௞) ௞ ௞ where ܺ஼ௐ(௧), ܺௐூ(௧) and ܺ஽ோ(௧) is respectively the contribution of wind from city ݇ to city

Beijing), wind intensity at city ݇ and wind direction at city ݇ at time ݐ, and ܱ(ଵ,௞) is the) 1

geographical orientation from city ݇ to city 1 (Beijing).

WI - Beijing

City A

Beijing

City A

Beijing

City A Contribution of Wind

Figure 9. Contribution of wind is evaluated as the vertical component of the wind in a city to Beijing.

This new variable represents the vertical component of the wind in a city to Beijing

as shown in Figure 9, and can represents how strongly the wind blows to Beijing. Hence,

20 (ଵ,௞) we have converted the negative values of ܺ஼ௐ(௧) to zero. Finally, we have added 0.001 to make it non-zero value for log transformation, which is being explained in chapter 5.

Figure 10 displays the one-year contribution of wind to Beijing from each city.

Obviously, some cities, such as Handan or Langfang, have a higher average of the contribution of wind. Meanwhile, cities to the north-east of Beijing (i.e., Qinhuangdao,

Chengde) have a smaller contribution of wind. As Liu et al. (2018) [15] pointed out, the time-averaged wind composition in north-east China is consistently from west to east in a year. Hence, the contribution of wind in our data seems correctly indicates that the vertical component of the wind to Beijing from such area is less significant than from other area.

21 Baoding Cangzhou 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

0.0 0.0 1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 Time Time

Chengde Handan 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

0.0 1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 Time Time

Hengshui Langfang 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

0.0 0.0 1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 Time Time

Qinhuangdao Shijiazhuang 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

0.0 0.0 1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 Time Time

Tangshan Tianjin 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 Time Time

Xingtai Zhangjiakou 10.0 10.0

VJ 7.5 VJ 7.5

1 5.0 1 5.0 5 5 U 2.5 U 2.5

1 2016 4 2016 7 2016 10 2016 1 2017 1 2016 4 2016 7 2016 10 2016 1 2017 TI~ TI~ - Concentration - 24-hour average - 1-week average Figure 10. Contribution of wind from each city to Beijing.

22 4. Correlation Analysis

4.1. Correlation of pollutants in Beijing

In this section, we will review how the concentration of each pollutant and each

city relates to other properties. First of all, we investigated the correlation between the

concentration of each pollutant in one city. Figure 9 displays the relationship between each

pollutant observed in Beijing. One can find that some pair of pollutants have significantly

high correlation, and some do not. For example, CO has a high correlation with NO2

(0.808), PM2.5 (0.836), SO2 (0.529) and PM10 (0.733), while it has a negative moderate

correlation with O3 (-0.352). NO2 has a positive correlation between CO, SO2 (0.532),

PM2.5 (0.766) and PM10 (0.7), but has a negative one with O3 (-0.522). SO2 has a similar

S02 03 PM2.5 PM10

Corr: Corr: Corr: Corr: 0 0 ~:~0.25L 0.808:::: 0.529 -0.352 0.836 0.733 0.00

~ orr: Corr: ~~~1 [Z[S:J[J:Jorr: [J:Jorr: 6 ~~ / 0.532 -0.522 0.766 0.7 "' 0

~ orr: Corr: C/J ~~~100 co□ O:J-0.206orr: 0.517 0.483 2 500 ~---- ~ -

'.]bJ[;:J □EJ~ ~~';; 8 :~CZJC~IZJD D ~o; ~ ,~~ □□0D~ I L l~I 0.0 2.5 5.0 7.5 0 50 100 150 2000 50 100 150 200 0 100 200 3000 200 400 600 0 250 500 750 1000 Figure 11. Correlation matrix of pollutants in Beijing. Upper: Correlation between two pollutants. Diagonal: Histogram of each pollutant. Lower: scatterplot of two pollutants. Thicker color of scatterplots indicates more records near the coordinate. Same in the next four figures.

23 tendency with the previous two variables, but it is less significant. O3 has the most unclear

relationship between other variables among all the pollutants, showing moderate to weak

negative correlation between all the other factors. PM2.5 and PM10 inherently share the

same property, and they show a very high correlation of 0.9.

Figure 12 displays the correlation between the concentration of pollutants in Beijing

and the meteorological conditions there. According to the dataset, it seems that the

humidity is most influential factor to CO, NO2, PM2.5 and PM10. Air pressure is weaker factor than the humidity, showing a weak to moderate correlation with CO and SO2.

Temperature is highly correlated with O3, but has an inverse correlation with CO, NO2 and

SO2. Nevertheless, the meteorological conditions are somehow related to the concentration and expected to be a significant indicator in the prediction process later.

2 3 2 5 PM10 1040 ~ ~ o. 29 N0 ~ S/ ~ -0.389 ° ~ 2 PM . -0 038 1020 _ - --.__ L 1,..,.-/- I___ "1J► 1 I 1000 I ----- ,:~~Q~~~ ~ -~ H

~~□0.0 2.5 5.0 7.5 0 50 100 150 2000~ 50 100 150 200 LJ0 100 200 3000E; 200 400 600ll 0~ 250 500 750 1000H Figure 12. Correlation matrix of pollutants and meteorological conditions in Beijing.

4.2. Intercity Correlation

Next, since our focus is spatial-temporal behavior of the concentration in the BTH

region, it is necessary to investigate and reveal the correlation between pollutants in

different cities. Figure 13 displays the intercity correlation of pollutants in Beijing and all

the other 12 cities in the BTH region. It seems that most of the pairs between CO, NO2, 24 PM2.5 and PM10 (e.g., CO2 in Beijing and PM2.5 in Other cities) have a higher positive

correlation more than 0.5. Both SO2 and O3 seem less actively associated with other pollutants, and O3 typically has an inverse relationship with any of other pollutants.

N02 S02 LJ3 PM2.5 PM10

1 o.354 ] : - 10~, , :__ @J :OAM 0 0 ► o - ~ ===~ ~~:::=J - E '[§] z 0 l~ "'► 1~lE---- II E---- II V ll::__ 11 v l - i~~ c::0.342 □0 . 368 ~0.474 □-0 . 258 E::0.248 ~§ 1000 _ -- - ~ ~-----===~~ 1

:~ l ~--J l ~ ll :___ ll ✓ ll ~ 1 F1-- ~ ~~l::-- ll ~ II E------11 ~ I □::-- ~ ::~i:-11 ~ 11 ~ 11_~ _11 ~ 1.:---i 0.0 2.5 5.0 7.5 0 50 100 150 2000 50 100 150 200 0 100 200 3000 200 400 600 0 250 500 750 1000 Figure 13. Correlation matrix of pollutants in Beijing and aggregation of the other cities. Columns (x-axis in each cell): pollutants in Beijing. Rows (with suffix “A”, y-axis in each cell): Pollutants in another city. Labels at top-left in each cell indicates the correlation of pollutants in two cities. Same in the next two figures.

Verifying each pair of Beijing and other 12 cities, we found that the distance

between them might be an influential factor. For example, Figure 14 shows the intercity

correlation of pollutants in Beijing and Langfang, which is the closest city to Beijing in the

dataset. In the plot, one can find that there are many set of variables that show an intense

negative/positive correlation. Typically, most of pairs between CO2, NO2, SO2, PM2.5

and PM10 in two cities exhibit more than 0.5 of positive correlation. Meanwhile, O3 seems

negatively correlated or almost irrelevant with other pollutants.

25 PM10

0 0 / ►

0.0 2.5 5.0 7.5 0 50 100 150 2000 50 100 150 200 0 100 200 3000 200 400 600 0 250 500 750 1000 Figure 14. Correlation matrix of pollutants in Beijing and Langfang.

PM10

10.205 J 0 0 ► t.. -

0.0 2.5 5.0 7.5 0 50 100 150 2000 50 100 150 200 0 100 200 3000 200 400 600 0 250 500 750 1000 Figure 15. Correlation matrix of pollutants in Beijing and Handan.

26 Baoding " X◊ 0 6. + Cangzhou " 0 ~ + Chengde " X <) 6- + Handan "y O X ~ +

Hengshui * l8> + Langfang ~ o .i':' + u Qinhuangdao X 6. ◊ 0 " +

Shijiazhuang ~ 6. o +

Tangshan 0 X 6. ◊ +

Tianjin X '9 ◊6. +

Xingtai X <, 6. +

Zhangjiakou 0 ◊ b.:7 +

0.00 0.25 0.50 0.75 1.00 Correlation vs. Beijing's value

Pollutant 0 co 6. NO2 + 03 X PM10 ◊ PM2.5 "v SO2 Figure 16. Correlation of pollutant in Beijing and other cities, at a glance.

400

300 E 6 Cl C

[lJ'~ E ,g ID 200 '-' C iii"' 0

100

+

0.00 0.25 0.50 0.75 1.00 Correlation vs. Beijing's value

Pollutant --0-- co -A- NO2 + 03 * PM10 ~ PM2.5 --'ir SO2 Figure 17. Correlation of pollutant in Beijing and other cities, sorted by distance.

27 On the other hand, pollutants except for O3 become less significantly correlated when the two cities are distant. Looking at Figure 15, which illustrates the correlation of pollutants in Beijing and Handan, the furthest city from Beijing, there are less pairs of variables which exhibits the correlation either less than -0.5 or more than 0.5. Even the correlation between same pollutant (e.g., CO2 in Beijing and CO2 in Handan) remains in moderate level except for O3. Based on this data, we can infer that the distance is one of the important factors to estimate the concentration of a certain pollutant in Beijing.

Figure 16 summarizes the correlation between same type of pollutant in Beijing and another city. It clearly displays that overall O3 has a higher average of intercity correlation than others, and some cities such as Chengde or Langfang have relatively higher correlation with all the pollutants. Plotting the correlation in the order of distance from

Beijing, it is obvious that the correlation between the same pollutant in Beijing and another city decreases in accordance with the distance. Therefore, we can conclude that the distance from Beijing will largely affects the correlation between the pollutants in two cities.

4.3. Wind and Pollutants

Another factor that may affect the concentration of pollutants in Beijing is wind. It is intrinsically expected that the intense wind in a city will lead to a smaller concentration of a pollutant. Also, continuously blowing wind one city to Beijing may carry several pollutants into the city if the origin has a certain source of a contamination. Given this argument, Figure 18 displays the correlation between pollution in Beijing and wind intensity from each city to Beijing, using the wind to Beijing (CW) property. Looking at the scatterplot, most of the concentrations have a weak to moderate negative correlation with the wind intensity in Beijing as one could expect. On the contrary, there are not so many 28 significant correlations between wind from surrounding cities and the pollution in Beijing.

Especially, CO seems uncorrelated with the wind property in any city than Zhangjiakou.

Even though NO2, SO2, PM2.5 and PM10 shows some positive or negative correlation between the wind from certain cities, including Cangzhou, Handan, Hengshui and Xingtai, all of them are less than 0.3 so that we cannot completely conclude the dependency between the wind flowing into Beijing and those pollutants in the city. O3 again behaves differently in this metric, showing a positive correlation with the wind in Beijing, Baoding, Langfang,

Shijiahuang and Tianjin.

29 PM10

(-o 196 j Cl] Jg: s· c:--__ cc

I0 .251 J I CD ::, cc (I> ::,- ~ S. (o147J r Ql ::, cc iii' ::, ~ cc

(M62] -I Ql ::, cc (I> ::,- Ql ::,

0.139 I -I iii' 2. 1:------s·

0.0 2.5 5.0 7.5 0 50 100 150 2000 50 100 150 200 0 100 200 3000 200 400 600 0 250 500 750 1000 Figure 18. Correlation matrix of pollution in Beijing and the intensity of wind to Beijing. Columns (x-axis in each cell): Concentration of pollutants in Beijing. Rows (y-axis in each cell): Intensity of wind from the city to Beijing.

30 5. Predictions

In this chapter, we will compose three regression models to predict the concentration of each pollutant in 1, 8, or 24 hours ahead. Each of the models input the current, and past if necessary, concentration of the pollutant which we want to predict in all the city and current meteorological condition at Beijing, and output the future concentration of the pollutant.

5.1. Data Standardization

Before presenting the prediction models, we have preprocessed the data table for more efficient and accurate result. In the time series of each pollutant that was referred to in chapter 3, there have been many peaks in each plot. Then, looking at Figure 19, clearly each histogram has a high skewness and it would not be appropriate for a linear regression.

Also, because we omitted the negative values of CW, its histogram is also highly skewed.

Then, using the logarithm of each variable is recommended to make it coherent with other variables while still preserving the linear model [38]. Hence, have applied the log transformation to all the pollutants and CW in each city.

Followingly, a standardization is performed to all the variables in each city to make all the variance and mean consistent. The scaling for variance and offset for mean have been saved and later used to get the prediction values which are generated by the models back to the actual scale.

5.2. Models

For the prediction purpose, we will propose the following three models in this chapter: linear regression model, compound linear regression model and LSTM-RNN model. The details of each model are being introduced in the following sections. 31 1000 2000 i!! i!! 750 1500 :, 6 0 ~ ~ 500 c 1000 :, c:, 0 0 u 500 u 250

0 0 0.0 2.5 5.0 7.5 0 50 100 150 200 3 3 CO (mg/m ) NO2 (ug/m )

~ 3000 i!! VJ 1000 :, :i 0 0 ~ 2000 ~

c:, c:, 500 8 1000 u0

0 0 0 50 100 150 200 0 100 200 300 3 3 SO2 (ug/m ) 0 3 (ug/m )

2500 2000 VJ VJ 2000 :i 1500 :i 0 I f 1500 ::: 1000 C 1000 :, c:, 0 0 u 500 u 500

0 0 0 200 400 600 0 250 500 750 1000 3 3 PM2 .5 (ug/m ) PM10 (ug/m ) Figure 19. Histogram of CO, NO2, SO, O3, PM2.5 and PM10 in Beijing.

600 400

i!! VJ :i 300 6 400 0 ~ ~ c 200 c:, :, O 200 0 u U 100

0 0 1000 1020 1040 25 50 75 100 AP (hPa) HM(%)

600 4000 i!! ~ 3000 6 400 0 ~ ~ c 2000 c:, :, 0 200 0 u U 1000

0 0 -- -20 0 20 40 0.0 2.5 5.0 7.5 10.0 TP (C) CW (m/s) Figure 20. Histogram of AP, HM, TP, and CW in Beijing.

32 5.2.1. Linear Regression

Linear regression model is the most primitive prediction method and has been commonly used in any field since long time ago. It assumes that an objective variable can be the linear combination of explanatory variables. In this study, the objective variable is

ଵ ௞ ଵ corresponding to ܲ௜(௧ା௅), and the explanatory variables are ܲ௜(௧) and ܺ௝(௧).

ଵଷ ௕ ௞ ଵ ܲ௜(௧ା௅) = ෍ ܽ଴ ܲ௜(௧) + ෍ܽ௝ܺ௝(௧) ௞ୀଵ ௝ where ܽ଴ and ܽ௝ represent the coefficients of the variables. This model has the simplest structure compared to the rest of the models, but cannot utilize the past information to

.(estimate the future value (i.e., depending only on the values at time ݐ

Table 6 to Table 11 is the summery of all the linear models. Note that the coefficients are comparable between different type of variables because we made a standardization in the previous section. Overall, while 1-hour ahead prediction there are not many significant variables in each model, as the prediction interval increase there seems more significant variables. This on one hand means that in long-term prediction it needs for the model to take account of current air pollution or wind condition in neighboring cities. On the other hand, the future concentration becomes more difficult to estimate with fewer variables and the current condition in other cities will affect the value as the pollutant propagates to Beijing over the time.

Looking at the models for CO prediction in detail, the variables in Langfang and

Zhanjiakou generally have a higher absolute value of coefficient and significance. It is partially because these two cities are closer to Beijing than others, as shown in Table 1.

Certainly, the current pollution in Beijing are most significant and determining factor to

33 evaluate the future value in 1-hour and 8-hour ahead prediction. However, in 24-hour prediction the highest coefficient is taken by the current pollution in Baoding, which implies that the current concentration of CO might not be appropriate to estimate the value on the next day.

The models for NO2 prediction have more significant variables, which indicates that the prediction is not so easy as a simple linear model can handle. Among them except for Beijing, only the variables (CW and NO2) in Langfang and Zhanjiakou are more than

95% significant in all the cases. Models for SO2 prediction show a similar tendency. Only the variables in Zhanjiakou has the 95% significance in all prediction interval, but here those in Langfang are significant only for 1-hour prediction. O3 prediction seems most complicated among all the prediction models. Only variables in Handan has the significance in all the cases, but it is most remote city from Beijing (399 km away), thus the performance of these models is highly questionable.

As PM2.5 and PM10 has similar characteristics, the tendency of the models for these prediction share some common features. Interestingly variables in Zhanjiakou play most significant role in these models as well.

About the meteorological conditions other than the wind, overall humidity and temperature work well for all the pollutants, and air pressure becomes important for some predictions. Especially, temperature seems highly correlated to the future value of O3, as its coefficient exceeds 0.1 in all the intervals.

Meanwhile, oddly, the coefficient of CW in Zhanjiakou becomes less than -0.1 in all the models for all pollutants except O3. Considering that the contribution of wind represents the vertical component of the wind from Zhanjiakou to Beijing and there is no 34 negative contribution of wind, it raises a certain question about the liability and performance.

The prediction result of this model will be shown in section 5.3.

35 Table 6. Variables in linear regression: For CO prediction Coef. and Sig. respectively represents coefficient and significant of the corresponding variable. *, **, ***: significant with 95%, 99%, and 99.9% confidence level. Same in the rest of this chapter. 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.022 *** -0.026 ** AP <0.01 <0.01 0.070 *** HM 0.026 *** -0.062 *** -0.039 ** TP -0.031 *** -0.11 *** -0.10 *** Beijing <0.01 *** -0.038 *** -0.071 *** Baoding <0.01 0.038 *** -0.012 Cangzhou <0.01 ** 0.040 *** 0.047 *** Chengde <0.01 <0.01 -0.040 *** Handan <0.01 <0.01 -0.032 * Hengshui <0.01 0.044 *** <0.01 CW Langfang 0.012 *** 0.034 *** -0.015 Qinhuangdao <0.01 <0.01 0.040 *** Shijiazhuang <0.01 -0.030 *** 0.024 * Tangshan <0.01 0.034 *** -0.028 * Tianjin <0.01 0.038 *** -0.014 Xingtai <0.01 0.027 * 0.020 Zhangjiakou <0.01 *** -0.12 *** -0.25 *** Beijing 0.92 *** 0.43 *** 0.062 ** Baoding <0.01 0.059 *** 0.17 *** Cangzhou <0.01 ** -0.022 * <0.01 Chengde <0.01 * 0.077 *** 0.061 *** Handan <0.01 * -0.045 *** 0.023 Hengshui <0.01 -0.024 * -0.103 *** CO Langfang 0.028 *** 0.050 *** 0.091 *** Qinhuangdao <0.01 -0.060 *** 0.023 Shijiazhuang <0.01 0.047 *** 0.076 *** Tangshan <0.01 0.022 * -0.033 ** Tianjin 0.017 *** <0.01 <0.01 Xingtai <0.01 <0.01 <0.01 Zhangjiakou 0.018 *** 0.12 *** 0.078 36 Table 7. Variables in linear regression: For NO2 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.031 *** -0.020 * AP -0.029 *** <0.01 0.033 HM <0.01 -0.13 *** -0.037 ** TP -0.069 *** -0.15 *** -0.074 *** Beijing -0.016 *** -0.054 *** -0.044 *** Baoding 0.013 *** 0.046 *** <0.01 Cangzhou <0.01 0.051 *** 0.015 Chengde <0.01 -0.017 <0.01 Handan <0.01 <0.01 -0.031 * Hengshui <0.01 0.030 -0.018 CW Langfang 0.014 *** 0.035 *** -0.041 *** Qinhuangdao <0.01 <0.01 0.034 ** Shijiazhuang <0.01 * -0.031 ** 0.044 *** Tangshan <0.01 * 0.010 -0.092 *** Tianjin 0.013 *** 0.040 *** -0.074 *** Xingtai <0.01 0.051 *** 0.038 * Zhangjiakou <0.01 ** -0.090 *** -0.17 *** Beijing 0.91 *** 0.38 *** 0.092 *** Baoding <0.01 0.049 ** 0.031 Cangzhou <0.01 0.016 <0.01 Chengde <0.01 <0.01 0.045 * Handan -0.027 *** -0.010 0.039 ** Hengshui 0.018 *** -0.047 ** 0.069 *** NO2 Langfang 0.013 * -0.11 *** 0.047 ** Qinhuangdao 0.015 *** -0.021 0.084 *** Shijiazhuang -0.020 *** 0.032 <0.01 Tangshan -0.016 *** -0.10 *** -0.039 * Tianjin <0.01 0.038 * 0.070 *** Xingtai 0.020 *** -0.014 0.017 Zhangjiakou 0.030 *** 0.27 *** 0.16 ***

37 Table 8. Variables in linear regression: For SO2 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 <0.01 -0.018 * AP 0.011 ** 0.028 * 0.13 *** HM <0.01 ** -0.087 *** -0.21 *** TP <0.01 -0.18 *** -0.12 *** Beijing <0.01 <0.01 0.015 Baoding <0.01 ** 0.017 * 0.013 Cangzhou <0.01 0.043 *** 0.039 ** Chengde <0.01 * <0.01 -0.013 Handan <0.01 * 0.023 * <0.01 Hengshui <0.01 0.019 <0.01 CW Langfang 0.012 *** <0.01 <0.01 Qinhuangdao <0.01 <0.01 <0.01 Shijiazhuang <0.01 * -0.015 * 0.035 *** Tangshan <0.01 0.041 *** -0.020 Tianjin <0.01 0.019 * -0.046 *** Xingtai <0.01 0.023 * -0.025 Zhangjiakou -0.012 *** -0.14 *** -0.16 *** Beijing 0.91 *** 0.48 *** 0.16 *** Baoding <0.01 0.079 *** 0.037 ** Cangzhou 0.013 *** <0.01 <0.01 Chengde <0.01 -0.071 *** 0.049 *** Handan <0.01 0.014 0.061 *** Hengshui <0.01 <0.01 0.036 ** SO2 Langfang 0.029 *** <0.01 -0.011 Qinhuangdao <0.01 0.033 ** 0.019 Shijiazhuang <0.01 0.016 0.053 ** Tangshan <0.01 ** <0.01 -0.11 *** Tianjin <0.01 0.060 *** 0.081 *** Xingtai <0.01 0.035 ** 0.088 *** Zhangjiakou 0.030 *** 0.069 *** 0.094 ***

38 Table 9. Variables in linear regression: For O3 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 0.032 *** <0.01 AP 0.044 *** 0.086 *** 0.085 *** HM -0.040 *** 0.052 *** 0.053 *** TP 0.12 *** 0.43 *** 0.26 *** Beijing 0.012 *** 0.054 *** 0.023 ** Baoding <0.01 -0.055 *** <0.01 Cangzhou <0.01 * -0.026 <0.01 Chengde <0.01 0.037 *** 0.016 Handan 0.019 *** 0.082 *** 0.030 * Hengshui 0.014 *** 0.027 0.022 CW Langfang -0.022 *** -0.17 *** <0.01 Qinhuangdao 0.010 *** -0.031 ** -0.056 *** Shijiazhuang <0.01 <0.01 -0.047 *** Tangshan -0.012 *** -0.030 * 0.052 *** Tianjin -0.011 *** -0.041 *** 0.016 Xingtai -0.020 *** -0.067 *** -0.022 Zhangjiakou <0.01 * <0.01 0.11 *** Beijing 0.89 *** 0.33 *** 0.14 *** Baoding -0.019 *** 0.077 *** 0.11 *** Cangzhou <0.01 <0.01 0.048 ** Chengde 0.033 *** 0.017 0.21 *** Handan -0.020 *** -0.137 *** 0.092 *** Hengshui -0.026 *** -0.097 *** 0.031 O3 Langfang -0.023 *** 0.050 * -0.032 Qinhuangdao 0.015 *** <0.01 -0.047 *** Shijiazhuang -0.016 *** -0.071 *** -0.019 Tangshan <0.01 -0.019 0.092 *** Tianjin 0.030 *** 0.029 0.018 Xingtai 0.016 ** -0.092 *** <0.01 Zhangjiakou 0.020 *** 0.21 *** 0.091 ***

39 Table 10. Variables in linear regression: For PM2.5 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.012 -0.020 * AP <0.01 <0.01 0.033 HM <0.01 -0.053 *** -0.080 *** TP <0.01 * -0.027 * -0.025 Beijing -0.013 *** -0.045 *** -0.070 *** Baoding 0.012 *** 0.048 *** 0.012 Cangzhou 0.011 *** 0.048 *** 0.087 *** Chengde <0.01 -0.052 *** -0.077 *** Handan <0.01 * 0.031 ** -0.023 Hengshui <0.01 0.018 0.035 * CW Langfang 0.011 *** 0.035 *** <0.01 Qinhuangdao <0.01 ** -0.027 ** <0.01 Shijiazhuang <0.01 -0.022 ** 0.018 Tangshan <0.01 0.014 -0.061 *** Tianjin <0.01 * 0.042 *** <0.01 Xingtai <0.01 0.015 <0.01 Zhangjiakou -0.014 *** -0.17 *** -0.31 *** Beijing 0.96 *** 0.56 *** 0.24 *** Baoding <0.01 0.021 0.13 *** Cangzhou <0.01 -0.011 0.095 *** Chengde <0.01 ** -0.016 -0.043 * Handan <0.01 0.028 * <0.01 Hengshui <0.01 -0.016 -0.038 * PM2.5 Langfang <0.01 -0.046 ** -0.081 *** Qinhuangdao <0.01 -0.053 *** <0.01 Shijiazhuang <0.01 * -0.016 0.012 Tangshan <0.01 0.039 ** -0.028 Tianjin <0.01 0.093 *** 0.041 * Xingtai <0.01 0.028 0.012 Zhangjiakou 0.016 *** 0.070 *** 0.017

40 Table 11. Variables in linear regression: For PM10 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 <0.01 -0.013 AP <0.01 <0.01 0.026 HM <0.01 -0.11 *** -0.10 *** TP <0.01 <0.01 -0.040 * Beijing -0.012 *** -0.046 *** -0.075 *** Baoding 0.017 *** 0.071 *** 0.045 *** Cangzhou 0.018 *** 0.044 *** 0.067 *** Chengde <0.01 -0.054 *** -0.038 ** Handan <0.01 0.053 *** <0.01 Hengshui <0.01 0.024 0.016 CW Langfang 0.018 *** 0.034 *** <0.01 Qinhuangdao <0.01 <0.01 0.015 Shijiazhuang <0.01 * <0.01 0.022 Tangshan <0.01 -0.015 -0.076 *** Tianjin 0.011 ** 0.024 * -0.013 Xingtai -0.011 * -0.012 0.046 ** Zhangjiakou <0.01 * -0.15 *** -0.23 *** Beijing 0.86 *** 0.34 *** 0.16 *** Baoding <0.01 <0.01 0.066 *** Cangzhou <0.01 <0.01 0.075 *** Chengde 0.011 * 0.079 *** <0.01 Handan <0.01 0.061 *** 0.072 *** Hengshui <0.01 -0.044 ** -0.030 PM10 Langfang 0.033 *** 0.055 *** -0.047 * Qinhuangdao <0.01 -0.079 *** 0.016 Shijiazhuang <0.01 0.045 ** 0.020 Tangshan <0.01 0.085 *** -0.018 Tianjin <0.01 <0.01 0.051 ** Xingtai <0.01 <0.01 -0.017 Zhangjiakou 0.047 *** 0.14 *** 0.046 ***

41 5.2.2. Compound Linear Regression

The problem in the previous linear regression model was that it cannot evaluate the time lag which a pollutant needs to propagate from one city to Beijing. In order to resolve this issue, we introduce another model, namely compound linear regression model. In this

(ଵ,௞) ௞ model, a modified variable ܲ௜ᇱ(௧ ) substitutes the previous ܲ௜(௧) . Using cross-correlation function (CCF, ߛ), it can be written as:

(ଵ,௞) ௞ ܲ௜ᇲ(௧) = ܲ (భ,಼) ௜ቀ௧ି௅೔(೟) ቁ

(ଵ,௄) ଵ ௞ ௞ ଵ ௞ ௞ ܮ௜(௧) = argmax ቀߛ൫ܲ௜(௧), ܲ௜(௧) × ܺ஼ௐ(௧)൯ቁ = argmaxఛ ൬෍ܲ௜(௧) × ܲ௜(௧ିఛ) × ܺ஼ௐ(௧ିఛ)൰ ௧

This equation finds the optimal lag for the pollutant propagate from city k to city 1

(Beijing). Multiplying the pollution of city k by its wind contribution to Beijing, it also evaluates how strongly the pollutant is distributed into Beijing. However, it is expected that the optimal lag varies over the seasons, as the governing wind direction and speed vary in a year. Hence, we extract the concentration and contribution of wind for every 288 hours

(=12 days) to evaluate the optimal lag in that period, and then applied a loess smoothing to interpolate the lag. Figure 13 represents the optimal lag for 1-hour ahead CO prediction.

One can find that Langfang and Tianjin, which are closer to Beijing, has a consistently smaller lag throughout a year. However, Zhangjiakou, which is the third closest city to

Beijing, has overall the longest lag and some cities, such as Cangzhou and Handan that are not so close to Beijing, has very small lag in spring and fall seasons.

Based on this result, we present Table 12 to Table 17 that display the coefficients and significances of the variables of each model as in the same manner to the previous

42 120

~:, 0 90 ~ 0 u

[lJ'~ 60 vi G Cl ...J"' ""O Q) iii Q) Cl 30 Cl :, CJ)

Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017 Time

Baoding -+- Chengde --+ Hengshui ...... Qinhuangdao -+- Tangshan ¼ Xingtai City -A-- Cangzhou ..,._ Handan -'v- Langfang --ll<- Shijiazhuang ---<>- Tianj in -!II- Zhangjiakou

Figure 21. Optimal lag for 1-hour ahead CO prediction. section. Thanks to the reduction in the number of variables, the summaries are clearer than the previous ones.

Looking at the detail of each model, obviously, same to the linear regression models, there are more significant variables as the prediction interval increases. Interestingly again, the most significant variable outside Beijing is determined as those in Handan for CO prediction, while it is located furthest from Beijing. Models for NO2 and O3 prediction has a lot of significant variable, but some of them has variables with high coefficient (e.g., 8- hours prediction of O3). For SO2, PM2.5 and PM10 prediction, the number of significant variables is clearly different between 1-hour and 8/24-hour prediction as well as those for

CO prediction, so it seems successfully captures the mechanism of propagation of such pollutants.

The prediction result of this model will be shown in section 5.3.

43 Table 12. Variables in compound linear regression: For CO prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.016 * -0.029 ** AP <0.01 <0.01 0.14 *** HM 0.016 *** -0.057 *** -0.021 TP -0.018 *** <0.01 -0.026 CW Beijing -0.016 *** -0.070 *** -0.10 *** CO Beijing 0.97 *** 0.67 *** 0.39 *** Baoding <0.01 0.060 *** <0.01 Cangzhou <0.01 <0.01 0.043 *** Chengde <0.01 <0.01 0.017 Handan -0.011 *** -0.059 *** -0.042 *** Hengshui <0.01 0.026 * <0.01 Langfang <0.01 * 0.084 *** 0.060 *** CO’ Qinhuangdao <0.01 <0.01 0.055 *** Shijiazhuang <0.01 <0.01 -0.011 Tangshan <0.01 <0.01 -0.024 * Tianjin <0.01 <0.01 <0.01 Xingtai <0.01 <0.01 -0.064 *** Zhangjiakou <0.01 <0.01 -0.029 **

Table 13. Variables in compound linear regression: For NO2 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.016 -0.020 * AP -0.030 *** -0.022 0.067 *** HM <0.01 -0.13 *** -0.032 ** TP -0.044 *** 0.069 *** -0.076 *** CW Beijing -0.018 *** -0.073 *** -0.081 *** NO2 Beijing 0.95 *** 0.47 *** 0.40 *** Baoding <0.01 * 0.020 0.015 Cangzhou <0.01 * 0.10 *** 0.026 * Chengde <0.01 0.053 *** -0.049 *** Handan <0.01 * 0.039 *** 0.026 * Hengshui <0.01 -0.032 ** -0.011 Langfang -0.011 *** 0.031 ** <0.01 NO2’ Qinhuangdao <0.01 <0.01 0.073 *** Shijiazhuang <0.01 0.038 ** <0.01 Tangshan <0.01 * 0.033 ** 0.036 *** Tianjin 0.010 ** 0.061 *** <0.01 Xingtai <0.01 <0.01 -0.037 ** Zhangjiakou <0.01 *** -0.064 *** -0.026 * 44 Table 14. Variables in compound linear regression: For SO2 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 <0.01 -0.020 * AP 0.017 *** 0.058 *** 0.14 *** HM <0.01 -0.022 ** -0.22 *** TP <0.01 -0.086 *** -0.10 *** CW Beijing <0.01 *** -0.034 *** -0.021 * SO2 Beijing 0.97 *** 0.57 *** 0.33 *** Baoding <0.01 0.017 <0.01 Cangzhou <0.01 0.024 * <0.01 Chengde <0.01 0.031 *** -0.025 ** Handan <0.01 -0.038 *** <0.01 Hengshui <0.01 0.039 *** 0.014 Langfang <0.01 0.096 *** 0.061 *** SO2’ Qinhuangdao <0.01 <0.01 0.048 *** Shijiazhuang <0.01 0.026 ** 0.022 Tangshan <0.01 -0.017 * -0.066 *** Tianjin <0.01 0.012 0.037 ** Xingtai 0.013 *** 0.035 *** 0.041 *** Zhangjiakou <0.01 <0.01 <0.01

Table 15. Variables in compound linear regression: For O3 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 <0.01 <0.01 AP 0.039 *** 0.18 *** 0.034 * HM -0.041 *** 0.010 -0.021 * TP 0.10 *** 0.26 *** 0.21 *** CW Beijing 0.014 *** 0.069 *** 0.043 *** O3 Beijing 0.92 *** 0.25 *** 0.43 *** Baoding <0.01 -0.12 *** 0.047 *** Cangzhou 0.023 *** 0.14 *** <0.01 Chengde <0.01 0.015 -0.012 Handan <0.01 0.12 *** 0.034 *** Hengshui <0.01 0.068 *** 0.042 *** Langfang <0.01 ** -0.079 *** 0.074 *** O3’ Qinhuangdao <0.01 0.012 -0.010 Shijiazhuang -0.049 *** 0.017 -0.025 * Tangshan <0.01 ** 0.12 *** 0.036 *** Tianjin 0.014 *** 0.23 *** -0.048 *** Xingtai <0.01 -0.14 *** 0.070 *** Zhangjiakou <0.01 *** 0.12 *** 0.035 *** 45 Table 16. Variables in compound linear regression: For PM2.5 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 -0.012 -0.029 ** AP <0.01 0.020 0.098 *** HM <0.01 -0.048 *** -0.053 *** TP <0.01 0.040 ** 0.063 *** CW Beijing -0.015 *** -0.075 *** -0.11 *** PM2.5 Beijing 1.00 *** 0.78 *** 0.47 *** Baoding <0.01 0.031 ** 0.037 * Cangzhou <0.01 *** 0.068 *** 0.023 Chengde <0.01 -0.010 -0.052 *** Handan <0.01 -0.038 *** -0.12 *** Hengshui <0.01 ** -0.029 ** 0.022 Langfang -0.026 *** -0.046 *** -0.044 ** PM2.5’ Qinhuangdao <0.01 <0.01 0.037 ** Shijiazhuang <0.01 <0.01 0.034 * Tangshan <0.01 -0.028 *** <0.01 Tianjin <0.01 <0.01 0.029 Xingtai <0.01 <0.01 -0.060 *** Zhangjiakou <0.01 0.015 * 0.015

Table 17. Variables in compound linear regression: For PM10 prediction 1-hour prediction 8-hour prediction 24-hour prediction Variable Coef. Sig. Coef. Sig. Coef. Sig. (Intercept) <0.01 <0.01 -0.022 * AP -0.011 <0.01 0.030 HM <0.01 * -0.095 *** -0.093 *** TP <0.01 0.065 *** <0.01 CW Beijing -0.016 *** -0.082 *** -0.11 *** PM10 Beijing 0.95 *** 0.64 *** 0.41 *** Baoding <0.01 0.025 * <0.01 Cangzhou <0.01 0.032 ** 0.022 Chengde <0.01 <0.01 -0.087 *** Handan <0.01 <0.01 -0.057 *** Hengshui <0.01 <0.01 <0.01 Langfang <0.01 0.037 ** -0.026 PM10’ Qinhuangdao -0.011 ** -0.030 ** 0.010 Shijiazhuang <0.01 0.038 *** 0.029 * Tangshan <0.01 0.020 0.018 Tianjin <0.01 * 0.041 *** 0.048 *** Xingtai <0.01 -0.024 * -0.020 Zhangjiakou <0.01 -0.019 * 0.032 ** 46 5.2.3. LSTM-RNN

The last model we present in this thesis is LSTM-RNN model, which is classified

as an extension of Recurrent Neural Network (RNN). LSTM-RNN was introduced by

Hochreiter and Schmidhuber (1997) [39] and now is known for a good performance at

prediction in a variable that has a specific temporal behavior. Translation or text mining,

where the relationship between words and phrases has a specific structure depending on

each language, is the representative of the focus with which this method is widely used.

The concept of LSTM-RNN is briefly explained as illustrated in Figure 14. The LSTM unit

stores the previous output as a part next input. Therefore, it can memorize the previous

information of one time series to derive the next output.

Input Layer LSTM Unit Output Layer (Hidden Layer) Figure 22. Concept of LSTM-RNN.

Each LSTM unit has the structure described in Figure 15. It has three gates called

input gate, output gate and forget gate, and controls respectively whether the unit will take

the input of ݔ௧, output the value of ݄௧(= ݕ௧), and forget the cell state ܿ௧. In this thesis, we

adopt a bidirectional LSTM, which contains two inverse-direction (forward and backward)

LSTM layers so that it can process both past and future of the currently evaluated time

series to optimize the network.

For this study, we have used keras-R + Tensorflow GPU 1.13.1 and hardware with

NVIDIA GeForce GTX 1050 Ti. Table 18 is the detail of network. In addition to the

bidirectional LSTM layer, there is one dropout layer in the network that will omit 5% of 47 the input to prevent an overfitting. Also, we set an early stopping parameter that monitors the validation loss using the last 24 days in the training dataset and stop training if the loss will no longer decrease in a certain period. The result of this model will be shown in the next section.

Table 18. Variables in linear regression: For PM10 prediction *: AP, HM and TP in Beijing + CW and Concentration in 13 cities **S: number of intervals in prediction (1, 8, or 24 hours) ***P: 30, 15, 5 respectively for prediction interval of 1, 8 or 24-hour prediction Layer Hyperparameter Setting Value (Input) (Dimension) 29*× S** Bidirectional Units (Input) × 16 LSTM Dropout Rate 0.05 Dense Units 1 (Output) (Dimension) 1 × 1 Parameter Description Setting Value Batch Number of records for one update of network state. 32 Epoch Number of loops for the training. 100 Early stopping Stopping training if the validation loss no longer improves. True Patience Epochs where no improvement is allowed. P*** Validation Number of records used for validation. 576 (24 days)

Figure 23. LSTM Unit [40].

48 Outputs

Activation Layer

Backwar ..----1 LSTM LSTM LSTM.-.-- Layer LSTM

Forward LSTM Layer

Inputs

Figure 24. Bidirectional LSTM [41].

5.3. Results

In this section, we present the result of prediction based on the three models introduced before. First of all, all the models have been trained with the training dataset that is basically based on the first 8208 hours (342 days) of the data. Then, we evaluate the performance of each model by applying the model to the test dataset which represents last

576 hours (24 days) of the data. Note that compound linear model and LSTM-RNN model could include the past information so that even test dataset may contain some information in the first 8208 hours.

As evaluation criteria, we selected Index of Agreement (IA) and Root Mean

Squared Error (RMSE).

௡ ଶ σ ൫ܻ෠ െܻ൯ IA = 1 െ ௞ୀଵ ௞ ௞ ௡ ෠ ഥ ഥ ଶ σ௞ୀଵ൫หܻ ௞ െܻ௞ห + |ܻ௞ െܻ௞|൯

1 ௡ ଶ RMSE = ඨ ෍ ൫ܻ෠௞ െܻ௞൯ ݊ ௞ୀଵ

49 where ݊ is the total hour of test dataset ( = 576 ) ܻ෠௞ is the predicted value of the

concentration, ܻ௞ is the actual concentration, and ܻഥ௞ is the mean of actual concentration.

Table 19 to Table 21 summarizes the performance index given by the three models.

Figure 17 to Figure 19 visually displays the actual and predicted value in the test duration, and Figure 20 to Figure 22 illustrates the quantile-quantile (Q-Q) plot of the models in different prediction intervals.

Table 19. Performance index of 1-hour-prediction models IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) Linear 0.992 0.987 0.980 0.984 0.993 0.979 2 C. Linear 0.995 0.990 0.979 0.982 0.994 0.984 4 LSTM-RNN 0.932 0.968 0.947 0.940 0.971 0.848 0 RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) Linear 0.36 8.72 2.97 4.63 19.34 35.54 2 C. Linear 0.29 8.06 3.05 4.77 16.89 32.06 4 LSTM-RNN 0.93 13.76 4.83 9.12 41.77 83.65 0

Table 20. Performance index of 8-hour-prediction models IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) Linear 0.651 0.573 0.727 0.758 0.705 0.685 1 C. Linear 0.776 0.594 0.677 0.362 0.749 0.741 0 LSTM-RNN 0.877 0.803 0.750 0.706 0.827 0.840 5 RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) Linear 1.75 41.32 8.77 16.38 94.47 107.47 1 C. Linear 1.51 37.40 9.14 19.55 90.25 102.01 0 LSTM-RNN 1.29 33.00 9.51 16.27 79.49 86.37 5

50 Table 21. Performance index of 24-hour-prediction models IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) Linear 0.387 0.561 0.549 0.548 0.394 0.432 2 C. Linear 0.371 0.399 0.302 0.423 0.310 0.337 0 LSTM-RNN 0.245 0.672 0.598 0.362 0.647 0.652 4 RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) Linear 2.14 39.06 10.54 17.56 124.54 132.45 3 C. Linear 2.19 43.35 11.94 18.53 132.29 143.97 0 LSTM-RNN 2.41 35.42 10.60 20.68 115.46 115.63 3

Looking at the results displayed above, first of all, in 36 evaluation criteria (6

pollutants × 3 prediction intervals × 2 criteria), LSTM-RNN achieved the best

performance in about 50% cases (17 out of 36 cases). Especially, it seems decisively better

than the other two models in an 8 or 24-hour prediction. In those prediction intervals, it is

identified as the best model in 17 out 24 criteria. Typically, this model seems quite well for

PM2.5 and PM10 prediction in those longer-term intervals, being recognized as the best model in all the cases of such predictions. Also, only the LSTM-RNN model seems correctly capture the peaks of in most of the concentrations (e.g., 8-hour prediction of CO and NO2, and 24-hour prediction of NO2 and PM2.5). Although the linear model achieved the best performance in some criteria, the other two models did not correctly reproduce the paternal peaks in the time series. This indicates and validates the capability of this model to deal with non-linear relationships between pollution or meteorological condition in one city at previous time and the current condition in Beijing.

Having been expected a better performance owing to the capability of capturing the propagation of pollutants, compound linear regression model did better than the simple linear model in 1 or 8-hour prediction, except for SO2 and O3 prediction. but the 51 effectiveness does not exist in 24-hour predictions. This implies it becomes quite difficult to estimate the concentraitin of an air pollutant in one city when the prediction interval reaches one day.

;;-- 7.5 ,-__ "'E 150 E Cl 5.0 - -J 100 s N 8 2.5 0z 50

12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

50 ,-__ ,-__ 90 "'E 40 "'E -J 30 en 60 2, 8 20 8 30 CJl 10 o~------~ 12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

500 500 ,-__ ,-__ "'E 400 "'E 400 -J 300 -J 300 LO 200 N ~ 200 ii: 100 ii: 100 o'------~-----~-----~-__J o ~------~~------~ 12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 TI~ TI~ Model - Actual - Linear - C. Linear - LSTM-RNN Figure 25. 1-hour prediction with three models.

52 ;;-- 7.5 ,-__ '"E 150 E Cl 5.0 - -J 100 s N 8 2.5 0z 50

12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

50 ,-__ '"E 40 rn 30 2, N 20 0 CJ) 10

12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

500 ,-__ 400 ,-__ '"E '"E 400 - 300 Cl :::, -J 300 °;;;' 200 N ~ 200 ::;; 100 a.. 1i: 100 o ~------~ 12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time Model - Actual - Linear - C. Linear - LSTM-RNN Figure 26. 8-hour prediction with three models.

;;-- 7.5 ,-__ '"E 150 E Cl 5.0 - -J 100 s N 8 2.5 0z 50

12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

50 ,-__ ,-__ 60 '"E 40 '"E rn 30 rn 40 2, 2, N 20 0 8 20 CJ) 10

12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time

,-__ 500 500 ,-__ '"E 400 '"E 400 -J 300 -J 300 ~ 200 ~ 200 1i: 100 1i: 100 O~~----~-----~-----~--~ O~~----~-----~-----~--~ 12 08 12 15 12 22 12 29 12 08 12 15 12 22 12 29 Time Time Model - Actual - Linear - C. Linear - LSTM-RNN Figure 27. 24-hour prediction with three models.

53 8 I• N 0 0 150 U 5 z ""O Q) ""O Q) 100 u 4 '6 0 Q) '6 ~ 50 a: 2 a.. 0 2.5 5.0 7.5 0 50 100 150 200 Actual CO Actual NO2

N 50 • "" 90 .. 8 40 0 .. ""O °al 30 2 60 0 (.) '6 20 '6 ~ ~ 30 a.. 10 a..

0 0 10 20 30 40 50 0 20 40 60 Actual SO2 Actual 03 500 500 LO 0 ~ 400 ~ 400 .. • ::;; a.. 300 ~ 300 ""O Q) Q) o 200 o 200 '6 '6 •"' ~ 100 ~ 100 a.. a.. 0 0 0 100 200 300 400 0 100 200 300 400 500 Actual PM2 5 Actual PM10

Model ■ Linear . C. Linear LSTM-RNN Figure 28. 1-hour Q-Q plot with three models.

O 7.5 N 150 u ..... 0z . ""O ""O 2 5.0 Q) 100 (.) • 0 '6 '6 ~ 2.5 I ~ 50 a.. a.. . ■

2.5 5.0 7.5 0 50 100 150 200 Actual CO Actual NO2

40 80 ■ N ■ ...... (") ... 8 30 . 0 60 ""O ""O . . ... ■ Q) .. 2 20 r(' •• (.) 0 40 '6 . . '6 .. Q) C: 20 ■ £ 10 ., ■ ■ 0 10 20 30 40 50 0 20 40 60 ·= Actual SO2 Actual 03

500 LO 0 N 300 ::;; i 400 a.. . a.. -c, 200 -c, 300 Q) Q) 0 o 200 '6 100 '6 ~ ~ 100 a.. a.. 0 0 0 100 200 300 400 0 100 200 300 400 500 Actual PM2 5 Actual PM10

Model ■ Linear . C. Linear LSTM-RNN Figure 29. 8-hour Q-Q plot with three models.

54 3 . 0 ... N u 0 ..·.11: :~ •• ,. z 100 ""O °al 2 Q) 0 .·'. ··-~ < •• . 0 '6 '. ~ • ·~ ..... '6 50 ~ .... ~ [l_ . . [l_ I

0.0 2.5 5.0 7.5 0 50 100 150 200 Actual CO Actual NO2

50 N 40 . . . (") .. 0 0 40 ,, CJl 30 .. ... ""O . ""O .. ID 30 . Q) .. .. 0 . ~ 20 • ""O '6 20 ~ ..• ~ [l_ Cl. 10 10 '. • J. 0 10 20 30 40 50 0 20 40 60 Actual SO2 Actual 03

400 LO 500 0 N . ::;; 400 ::;; 300 [l_ [l_ -c, 300 ""O Q) 2 200 1;i 200 (J . ""O '6 . ,. ~ 100 ~ 100 [l_ [l_ 0 0 100 200 300 400 0 100 200 300 400 500 Actual PM2 5 Actual PM10 Model . Linear . C. Linear LSTM-RNN Figure 30. 24-hour Q-Q plot with three models.

Figure 23 to Figure 26 may account this matter. The autocorrelation function (ACF,

ߙ) and cross-correlation function (CCF, ߛ) respectively represents the correlation between

two time series with the following equations:

௕ ௕ ߙ(ఛ) = ෍ܲ(௧) × ܲ(௧ିఛ) ௧

௕ ௞ ߛ(ఛ) = ෍ܲ(௧) × ܲ(௧ିఛ) ௧

Looking at the plots in the figures, one can find that the correlation between the

concentration of a pollutant in Beijing and one at 24 hours later is less than 0.5 except for

O3, which displays a significant pattern over every 24 hours. This means that the current

air pollution in Beijing is not likely to affect the one in one day. Especially, the 2-day ACF

55 of PM2.5 and PM10 is nearly zero, which means the concentration of those pollutants are

virtually irrelevant with those in two days later.

The CCF between the concentrations in Beijing and ones in other cities implies the

same characteristics. For most of the pairs of city and pollution, the CCF becomes less than

0.4 when the time lag exceeds 24 hours, except for O3. Due to this fact, it is inferred that,

even if we use lagged inputs in compound linear regression, they become hardly effective

for when we try estimating the concentration at 24 hours ahead. Meanwhile, some

polllutant-city pairs show the peak of the CCF at a certain positive lag less than 8 hours, which would lead to the advantage of the compound linear models for 1 or 8-hour predictions.

1.00 1.00 N' 0 0.75 8 0.75 z £ a.so £ 0.50 LL LL ~ 0.25 U 0.25 <( 0.00 0.00 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

1.00 1.00 N' ;;;- 0 0.75 0 0.75 Cf) £ 0.50 £ a.so LL t] 0.25 ~ 0.25 <( 0.00 0.00 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

1.00 1.00 in 0 ~ 0.75 i 0.75 [L [L vi 0.50 vi 0.50 G G LLu 0.25 tJ 0.25 <( <( 0.00 0.00 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour) Figure 31. Autocorrelation function of pollutants in Beijing.

56 0.8 0.8 N 0 0.6 0 u z 0.6 vi 0.4 vi G G 0.4 LL LL u 0.2 u 0.2 u u 0.0 0.0 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

0.6 N ;;;' 0.75 0 CJ) 0 0.4 vi vi 0.50 G G LLu 0.2 tJ 0.25 u u 0.0 0.00 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

0.8 in 0 N 0.6 ::;; 0.6 ::;; CL CL 0.4 vi 0.4 vi G G 0.2 LL 0.2 LLu u u u 0.0 0.0 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

---&- Baoding -A- Cangzhou -+- Chengde __,._ Handan Figure 32. Cross-correlation function of pollutants in Beijing and other cities: Baoding, Cangzhou, Chengde, and Handan. The lag means that the time shifting of the air pollution in other cities (e.g., Current SO2 concentration in Beijing has the highest correlation with previous concentration in Cangzhou). Same for the next two figures.

57 0.8 0.8 N 0 0.6 0 u z 0.6 vi 0.4 vi 0.4 G G LL LL u 0.2 u 0.2 u u 0.0 0.0 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

0.6 N ;;;' 0.75 0 CJ) 0 0.4 vi vi 0.50 G G LLu 0.2 tJ 0.25 u u 0.0 0.00 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

0.8 LO 0 N 0.6 ::;; 0.6 ::;; CL CL 0.4 vi 0.4 vi G G 0.2 LL 0.2 LLu u u u 0.0 0.0 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour)

---&- Baod ing -A- Cangzhou -+- Chengde __,._ Handan Figure 33. Cross-correlation function of pollutants in Beijing and other cities: Hengshui, Langfang, Qinhuangdao, and Shijiazhuang.

~ o.6 N o.6 0 0 u z vi 0.4 . 0.4

0.8 N o.6 ;;;- 0 0 0.6 CJ) 0.4 vi

LO s o.6 N o.6 ::;; ::;; CL 0.4 CL 0.4 > ;:;:- 0.2 ;:;:- 0.2 u u u U 0.0 0.0 -40 0 40 -40 0 40 Lag (Hour) Lag (Hour) ---&- Tangshan --A- Tianjin -+- Xingtai __,._ Zhangjiakou Figure 34. Cross-correlation function of pollutants in Beijing and other cities: Tangshan, Tianjin, Xingtai, and Zhangjiakou.

58 5.4. Models with Additional Variables

In the first attempt, we had created three models with variables of pollutant whose concentration in future we predicted. However, as discussed in the preliminary data analysis, several pairs of pollutants have shown a moderate to high correlation between their concentration. Given this, we have created new models that account all the pollutants in 13 cities of BTH region. We refer to the previous models as standard models, and name the new models with additional variables as extended models.

For example, the linear model now can be written as:

ଵଷ ௕ ௞ ଵ ܲ௜(௧ା௅) = ෍ ൬෍ܽ௜ܲ௜(௧)൰ + ෍ܽ௝ܺ௝(௧) ௞ୀଵ ௜ ௝

One issue of this problem can be inferred that there are too many variables in one model (13 cities × 6 pollutants + 4 meteorological conditions), which may lead to an overfitting to the training dataset.

Table 22 to Table 24 display the overall performance of each model. Looking at the outcome, it seems that the simple linear model has obtained the most significant improvement at the prediction performance, showing a better value in the criteria in 75% of the cases. It has achieved 6.2% increase in IA and 5.7% reduction in RMSE. Especially, it achieved a better performance than the corresponding standard model in all the 8-hour predictions. On the other hand, extended compound linear model and LSTM-RNN model did not show a superiority to the standard model in most of the cases. About the LSTM-

RNN model, we can infer that the training epochs were not enough to let the model find relationships between such a plethora of variables. Meanwhile, the issue that the compound model did not predict as better as the simple model is not clear. 59 Table 22. Performance index of 1-hour-prediction models Bottom number in each cell indicates the difference from the corresponding standard model. Better performance is colored cyan, and worse performance is colored brown. RMSE with underline is significantly better than the corresponding standard model with 95% confidence level. IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) 0.994 0.986 0.981 0.984 0.995 0.984 Linear 3 0.2% -0.1% 0.1% 0.0% 0.3% 0.5% 0.994 0.987 0.982 0.98 0.995 0.983 C. Linear 3 -0.1% -0.2% 0.3% -0.2% 0.0% 0.0% 0.843 0.946 0.925 0.926 0.949 0.931 LSTM-RNN 0 -9.5% -2.3% -2.4% -1.5% -2.3% 9.8% RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) 0.31 9.15 2.92 4.64 15.79 31.4 Linear 4 -12.8% 4.9% -1.7% 0.2% -18.4% -11.6% 0.31 8.91 2.89 5.04 16.67 32.06 C. Linear 3 4.6% 10.6% -5.2% 5.7% -1.3% 0.0% 1.29 17.42 5.63 10.8 51.01 61.94 LSTM-RNN 0 39.1% 36.6% 16.5% 18.4% 22.1% -26.0%

Table 23. Performance index of 8-hour-prediction models IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) 0.767 0.691 0.781 0.765 0.776 0.757 Linear 2 17.9% 20.7% 7.4% 1.0% 10.1% 10.5% 0.703 0.661 0.681 0.476 0.748 0.686 C. Linear 0 -9.3% 11.3% 0.7% 31.5% -0.1% -7.5% 0.829 0.632 0.801 0.712 0.924 0.84 LSTM-RNN 4 -5.5% -21.3% 6.8% 0.8% 11.7% -0.1% RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) 1.51 35.71 8.24 13.96 87.27 99.17 Linear 2 -13.4% -13.6% -6.0% -14.8% -7.6% -7.7% 1.66 35.56 9.57 19.26 90.78 109.18 C. Linear 1 9.8% -4.9% 4.7% -1.5% 0.6% 7.0% 1.39 39.85 8.51 14.74 62.61 83.28 LSTM-RNN 3 7.9% 20.7% -10.5% -9.4% -21.2% -3.6%

60 Table 24. Performance index of 24-hour-prediction models IA CO NO2 SO2 O3 PM2.5 PM10 Best in (larger is better) 0.425 0.56 0.504 0.545 0.472 0.504 Linear 3 9.9% -0.2% -8.2% -0.4% 19.6% 16.6% 0.31 0.389 0.357 0.414 0.397 0.368 C. Linear 0 -16.6% -2.5% 18.1% -2.1% 28.0% 9.2% 0.406 0.636 0.472 0.284 0.656 0.667 LSTM-RNN 3 65.4% -5.4% -21.1% -21.6% 1.2% 2.3% RMSE CO NO2 SO2 O3 PM2.5 PM10 Best in (smaller is better) 2.1 40.36 11.07 17.05 121.65 129.93 Linear 5 -1.7% 3.3% 5.0% -2.9% -2.3% -1.9% 2.23 44.71 12.03 18.43 126.12 138.48 C. Linear 0 1.8% 3.1% 0.8% -0.5% -4.7% -3.8% 2.12 40.77 11.29 20.65 102.35 150.18 LSTM-RNN 1 -12.2% 15.1% 6.5% -0.1% -11.4% 29.9%

Focusing on each pollutant, it seems that PM2.5 and PM10 are most typical pollutant that we may apply this extended model, with the three models shows 7.6%

(PM2.5) and 4.6% (PM10) improvement in IA and 4.9% (PM2.5) and 2.0% (PM10) reduction of RMSE in average. Meanwhile, NO2 (0.0% improvement in IA and 8.4% increase of RMSE) or SO2 (0.2% improvement in IA but 1.1% increase of RMSE) typically seem not appropriate for being applied with this method. It is also difficult to argue the validity of this approach to predict the concentration of CO (5.8% improvement in IA but 2.6% increase of RMSE) or O3 (0.8% improvement in IA and 0.5% reduction of

RMSE).

To clarify the difference in the performance, we have additionally utilized the

Wilcoxon signed-rank test [42]. With a pair of observed values (ݔ , ݕ), it ranks the

difference between each of two associated observations (ݖ௜ = ݔ௜ െݕ௜), assuming that the 61 distribution of ݖ becomes symmetry [43]. Then, the test supposes null hypothesis and

alternative hypothesis:

଴ = median(ݖ) =0ܪ

ଵ = median(ݖ) ് 0ܪ

If the test can reject ܪ଴ with more than 95% confidence level, it implies that the

median of the difference is significantly different (i.e., one group significantly has a higher

values). Then, we applied this test to the squared error of each corresponding model (i.e.,

the squared error of prediction by a standard model and that of extended model). The

negative values with * sign in Table 22 to Table 24 indicates that the squared error of one

extended model tend to be smaller than that of corresponding standard model [43].

Looking at the significance noted in the table, some of the improvement in RMSE

we identified (8 out of 28 cases) are less significant. Therefore, we can conclude that this

model, which account the concentration of other pollutants in neighboring cities, may only

be a better approach to predict the concentration of PM2.5 or PM10 in a mid-term interval around 8-hours. Even for such prediction, considering the percentage difference in the performance and increased task in computation due to the additional variables, one may follow a simpler method according to their needs.

62 6. Conclusion

This study first proposed a comprehensive correlation analysis through the dataset of air pollution and meteorological conditions in the BTH region. The analysis has revealed that the pollutants other than O3 have strong correlation between them, while O3 is either uncorrelated or inversely correlated with those pollutants. Although the wind property in

Beijing is one of the influential factors, wind flowing from other cities to Beijing has less significance in terms of determining the concentration of those pollutants. Finally, the intercity analysis of correlation of the pollutants in Beijing and any of other cities has revealed that the distance between them highly affects the correlation of the pollutants.

While it is somehow natural that such relationship disappears as they become remote, it could be an important finding that we have confirmed it by the linear regression.

Then, we have developed three regression models to predict the air pollution in

Beijing, China, using a reliable hourly data of; 1) the concentration of pollutants (CO, NO2,

SO2, O3, PM2.5 and PM10), and 2) meteorological conditions (air pressure, humidity, temperature, wind speed and wind direction). These three models takes the current concentration of the pollutant, whose future value in Beijing it predicts, in the surrounding cities in the BTH region and current meteorological conditions in Beijing. The first model is the linear regression, which simply predicts the future value based on the linear relationship among the concentrations and meteorological factors. The second model is compound linear regression, which uses the past information of the concentration in neighboring cities of Beijing, with the time lag being optimized through CCF analysis.

Even though it cannot deal with a long-term dependency of the pollutions in different cities, it achieved a better performance than the simple linear model in 1 or 8-hour predictions. 63 The last model we have introduced in this thesis is the bidirectional LSTM-RNN model, which can capture both dependencies to the past and for the future through two inversed

LSTM layers. With the capability of the model, it has exhibited the highest IA and RMSE in about 75% of longer-term of 8 or 24-hour predictions. Meanwhile, this model, besides the linear models, will not extend its capability of the prediction even if it accounts the concentration of other pollutants in neighboring city in the BTH region, except for PM2.5 and PM10 prediction. Given that the extended models will have much more variables that would cause longer computation time, such improvement in the capability may be ignored according to the needs of the following research.

64 REFERENCES

[1] National Bureau of Statistics of China, “China Statistical Yearbook 2018,” 2018.

[2] Beijing Municipal Bureau of Statistics, “Beijing Statistical Yearbook 2018,” 2018.

[3] D. D. Genc, C. Yesilyurt, and G. Tuncel, “Air pollution forecasting in Ankara,

Turkey using air pollution index and its relation to assimilative capacity of the

atmosphere,” Environ. Monit. Assess., vol. 166, no. 1–4, pp. 11–27, 2010.

[4] N. A. H. Janssen et al., “The relationship between air pollution from heavy traffic

and allergic sensitization, bronchial byperresponsiveness, and respiratory symptoms

in Dutch schoolchildren,” Environ. Health Perspect., vol. 111, no. 12, pp. 1512–

1518, 2003.

[5] Y. Ma et al., “Fine particulate air pollution and daily mortality in Shenyang, China,”

Sci. Total Environ., vol. 409, no. 13, pp. 2473–2477, 2011.

[6] X. Tian et al., “Economic impacts from PM2.5 pollution-related health effects in

China’s road transport sector: A provincial-level analysis,” Environ. Int., vol. 115,

no. March, pp. 220–229, 2018.

[7] J. Parry, “Beijing pollution is becoming a ‘public health catastrophe,’ expert says.,”

2013.

[8] F. Lu et al., “Systematic review and meta-analysis of the adverse health effects of

ambient PM2.5 and PM10 pollution in the Chinese population,” Environ. Res., vol.

136, pp. 196–204, 2015.

[9] E. Coker and S. Kizito, “A narrative review on the human health effects of ambient

air pollution in sub-saharan africa: An urgent need for health effects studies,” Int. J.

Environ. Res. Public Health, vol. 15, no. 3, 2018. 65 [10] Y. Yang, Y. Liu, Y. Li, and J. Li, “Land Use Policy Measure of urban-rural

transformation in Beijing-Tianjin-+HEHL UHJLRQLQWKH QHZ PLOOHQQLXPௗ 3RSXODWLRQ-

land-industry perspective,” Land use policy, vol. 79, no. January, pp. 595–608,

2018.

[11] Google, “Google Maps: Beijing-Tianjin-Hebei 13 Cities.” [Online]. Available:

https://drive.google.com/open?id=1ZwA3oLvg5FWaYmtuDgm4oOe_pwaXpzen

&usp=sharing.

[12] GlobeFeed.com, “GlobeFeed Distance Calculator.” [Online]. Available:

https://distancecalculator.globefeed.com/. [Accessed: 05-May-2019].

[13] Y. Wang, Q. Ying, J. Hu, and H. Zhang, “Spatial and temporal variations of six

criteria air pollutants in 31 provincial capital cities in China during 2013-2014,”

Environ. Int., vol. 73, pp. 413–422, 2014.

[14] X. Yangyang, Z. Bin, Z. Lin, and L. Rong, “Spatiotemporal variations of PM2.5 and

PM10 concentrations between 31 Chinese cities and their relationships with SO2,

NO2, CO and O3,” Particuology, vol. 20, pp. 141–149, 2015.

[15] J. Liu, W. Li, J. Wu, and Y. Liu, “Visualizing the intercity correlation of PM2.5 time

series in the Beijing-Tianjin-Hebei region using ground-based air quality monitoring

data,” PLoS One, vol. 13, no. 2, pp. 1–14, 2018.

[16] R. Gehrig and B. Buchmann, “Characterising seasonal variations and spatial

distribution of ambient PM10 and PM2.5 concentrations based on long-term Swiss

monitoring data,” Atmos. Environ., vol. 37, no. 19, pp. 2571–2580, 2003.

[17] H. Wang, C. Chen, C. Huang, and L. Fu, “On-road vehicle emission inventory and

its uncertainty analysis for Shanghai, China,” Sci. Total Environ., vol. 398, no. 1–3, 66 pp. 60–67, 2008.

[18] P. Benson, “CALINE4 - A Dispersion Model For Predicting Air Pollutant

Concentrations Near Roadways,” Calif. Dep. Transp., 1989.

[19] S. Gokhale and N. Raokhande, “Performance evaluation of air quality models for

predicting PM10 and PM2.5 concentrations at urban traffic intersection during

winter period,” Sci. Total Environ., vol. 394, no. 1, pp. 9–24, 2008.

[20] P. Benson, “CALINE3-A versatile dispersion model for predicting air pollutant

levels near highways and arterial streets,” 1979.

[21] World Steel Association, “WORLD STEEL IN FIGURES 2018,” 2018.

[22] Consultic Marketing & Industrieberatung GmbH, “World Plastics Materials

Demand 2015 by Types,” 2015.

[23] J. Meng, J. Liu, Y. Xu, and S. Tao, “Tracing Primary PM2.5 emissions via Chinese

supply chains,” Environ. Res. Lett., vol. 10, no. 5, 2015.

[24] P. Perez and A. Trier, “Prediction of NO and NO 2 concentrations near a street with

heavy traffic in Santiago,” Chile. Atmos. Environ., vol. 35, no. 10, pp. 1783–1789,

2001.

[25] M. Cai, Y. Yin, and M. Xie, “Prediction of hourly air pollutant concentrations near

urban arterials using artificial neural network approach,” Transp. Res. Part D

Transp. Environ., vol. 14, no. 1, pp. 32–41, 2009.

[26] C. Sekar, C. S. P. Ojha, B. R. Gurjar, and M. K. Goyal, “Modeling and Prediction

of Hourly Ambient Ozone (O3) and Oxides of Nitrogen (NOx) Concentrations

Using Artificial Neural Network and Decision Tree Algorithms for an Urban

Intersection in India,” J. Hazardous, Toxic, Radioact. Waste, vol. 20, no. 4, p. 67 A4015001, 2016.

[27] Y. Zhan et al., “Spatiotemporal prediction of continuous daily PM 2.5

concentrations across China using a spatially explicit machine learning algorithm,”

Atmos. Environ., vol. 155, pp. 129–139, 2017.

[28] T.-C. Bui, V.-D. Le, and S.-K. Cha, “A Deep Learning Approach for Forecasting

Air Pollution in South Korea Using LSTM,” 2018.

[29] J. Xu, A. Wang, and M. Hatzopoulou, “Investigating near-road particle number

concentrations along a busy urban corridor with varying built environment

characteristics,” Atmos. Environ., vol. 142, pp. 171–180, 2016.

[30] Xinxu Li, G. Yu, Y. Wand, X. Wu, and Y. Ma, “A Multimodal Detection and

Tracking System Based on Deep-Learning for Traffic Monitoring,” pp. 572–582,

2017.

[31] J. Hooyberghs, C. Mensink, G. Dumont, F. Fierens, and O. Brasseur, “A neural

network forecast for daily average PM10 concentrations in Belgium,” Atmos.

Environ., vol. 39, no. 18, pp. 3279–3289, 2005.

[32] J. P. Shi and R. M. Harrison, “Neural network modelling and prediction of hourly

NOx and NO2 concentrations in urban air in London,” Atmos. Environ., vol. 31, no.

24, pp. 4081–4094, 1997.

[33] C. Song et al., “Air pollution in China: Status and spatiotemporal variations,”

Environ. Pollut., vol. 227, no. March 2018, pp. 334–347, 2017.

[34] A. P. K. Tai, L. J. Mickley, and D. J. Jacob, “Correlations between fine particulate

matter (PM2.5) and meteorological variables in the United States: Implications for

the sensitivity of PM2.5 to climate change,” Atmos. Environ., vol. 44, no. 32, pp. 68 3976–3984, 2010.

[35] K. Siwek and S. Osowski, “Improving the accuracy of prediction of PM 10 pollution

by the wavelet transformation and an ensemble of neural predictors,” Eng. Appl.

Artif. Intell., vol. 25, no. 6, pp. 1246–1258, 2012.

[36] N. M. Noor, M. M. Al Bakri Abdullah, A. S. Yahaya, and N. A. Ramli, “Comparison

of Linear Interpolation Method and Mean Method to Replace the Missing Values in

Environmental Data Set,” Mater. Sci. Forum, vol. 803, pp. 278–281, 2014.

[37] A. Clifton, “Wind rose with ggplot (R)?,” 2013. [Online]. Available:

https://stackoverflow.com/questions/17266780/wind-rose-with-ggplot-

r/17266781#17266781. [Accessed: 05-Mar-2019].

[38] K. Benoit, “Linear Regression Models with Logarithmic Transformations,” London

Sch. Econ., pp. 1–8, 2011.

[39] S. Hochreiter and J. Schmidhuber, “LONG SHORT-TERM MEMORY,” Neural

Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[40] A. Graves, “Generating Sequences With Recurrent Neural Networks,” University of

Toronto, 2013.

[41] Ö. Yildirim, “A novel wavelet sequences based on deep bidirectional LSTM

network model for ECG signal classification,” Comput. Biol. Med., vol. 96, no.

January, pp. 189–202, 2018.

[42] F. WILCOXON, “Individual comparisons of grouped data by ranking methods.,” J.

Econ. Entomol., vol. 39, no. 6, p. 269, 1946.

[43] D. VanLeeuwen and M. Archuleta, “WILCOXON SIGNED-RANK TEST,” 2016.

[Online]. Available: 69 https://www.sneb.org/clientuploads/directory/Recorded_Webinars/Webinar_Hand outs/4-13-16.pdf. [Accessed: 05-Nov-2019].

70 Appendix

The R code used for this study and all the images related to the work is available via https://github.com/keita-makino/Thesis_MS.

71