PHYSICAL STUDIES OF AIRBORNE POLLEN AND UTILIZING

MACHINE LEARNING

by

Xun Liu

APPROVED BY SUPERVISORY COMMITTEE:

David John Lary, Chair

Roderick Heelis

Robert Glosser

Lunjin Chen

Fabiano Rodrigues Copyright c 2019

Xun Liu

All rights reserved PHYSICAL STUDIES OF AIRBORNE POLLEN AND PARTICULATES UTILIZING

MACHINE LEARNING

by

XUN LIU, BS, MS

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN

PHYSICS

THE UNIVERSITY OF TEXAS AT DALLAS

December 2019 ACKNOWLEDGMENTS

I would like to thank all those who supported my research and dissertation. Without their help, I would never have completed this dissertation.

I must express my deepest appreciation to my advisor Dr. David Lary for his guidance in the past three years. My heartfelt gratitude also goes to his patience and tolerance for all those mistakes I had once made. His warm support and insightful instructions helped me out of many hard times during the process of this research.

I would also like to thank other members of my committee: Dr. Roderick Heelis, Dr. Robert Glosser, Dr. Lunjin Chen, and Dr. Fabiano Rodrigues for serving as my committee members and generously providing me with advice and comments.

I am also grateful to all teammates in Dr. Lary’s group: Daji Wu, Gebreab K. Zewdie and Lakitha Wijeratne, for their collaboration on my research.

October 2019

iv PHYSICAL STUDIES OF AIRBORNE POLLEN AND PARTICULATES UTILIZING

MACHINE LEARNING

Xun Liu, PhD The University of Texas at Dallas, 2019

Supervising Professor: David John Lary, Chair

This dissertation presents an approach for estimating the abundance of airborne pollen and particulates using a comprehensive description of the physical environment coupled with machine learning. The aspects of the physical environment are characterized by eighty-five variables that quantify the physical state of the land surface and soil, and the physical state of the atmosphere. The physical environment of plants naturally affects their rate of maturing and pollen generation. Then, once the pollen is released, conditions such as wind speed will affect how the pollen is dispersed. Machine learning is helpful for studying such a complex system. Machine learning allows us to ‘learn by example’, since at present, we do not have a complete theoretical description, from first principles, of the entire system from the plant growth and development to the plants’ full interaction with its physical environment. Machine learning allows us to objectively highlight which physical parameters play a central role in determining the atmospheric abundance of the pollen, and hence, the impact on human health.

Some key aspects in building a physical model of airborne particulates using machine learning that are explored in this dissertation include:

1. The collection of an appropriate and comprehensive training dataset that machine learning algorithms can use to learn from. This involves characterizing the appropriate

v temporal and spatial scales involved. Variograms were used to perform this analysis. Machine learning is an automated encapsulation of the scientific method, an automated paradigm for learning by example to build descriptive models that can be tested and iteratively improved.

2. Identifying the physical parameters which are the most appropriate input variables (or features) to build an accurate machine learning model. This is a key step in machine learning called feature engineering. Feature engineering can provide useful physical insights into the key drivers of the system being studied.

3. Provide a framework for updating the machine learning model as new observational data is collected. This was done by providing a mini-batch training process that allows the machine learning model to be updated in almost real-time.

vi TABLE OF CONTENTS

ACKNOWLEDGMENTS ...... iv ABSTRACT ...... v LIST OF FIGURES ...... ix LIST OF TABLES ...... xii CHAPTER 1 INTRODUCTION ...... 1 1.1 Airborne Particulates ...... 1 1.1.1 Particulate Matter in Different Sizes ...... 1 1.1.2 Chemical Composition and Source Apportionment ...... 2 1.1.3 Role in Global Environmental Change ...... 3 1.1.4 Particulate Matter and Human Health ...... 5 1.2 Airborne Ambrosia Pollen ...... 7 1.2.1 Ragweeds - Source of Airborne Pollen in North America ...... 8 1.2.2 Airborne Pollen Particles ...... 9 1.2.3 Ambrosia Pollen and Health ...... 9 1.2.4 Environment’s effect on airborne pollen ...... 12 1.3 Summary ...... 13 CHAPTER 2 OBSERVATIONS OF THE TEMPORAL CHANGES IN AMBROSIA (RAGWEED) POLLEN ABUNDANCE ...... 15 2.1 Previous Work ...... 15 2.2 Data ...... 17 2.3 Summary ...... 19 CHAPTER 3 OBSERVATION OF THE TEMPORAL CHANGES IN PARTICULATE MATTER ABUNDANCE ...... 20 3.1 Optical Particle Counters ...... 20 3.2 In-situ Observation System ...... 21 3.3 Data from MINTS ...... 22 CHAPTER 4 PHYSICAL INSIGHTS PROVIDED BY VARIOGRAMS ...... 24 4.1 Stochastic Process and Sampling ...... 24

vii 4.2 Variogram and Kriging ...... 25 4.2.1 Variogram Definition Equation ...... 25 4.2.2 Kriging for Data Interpolation ...... 26 4.2.3 Practical Use of Variograms ...... 27 4.3 Variograms of Airborne Particulates ...... 29 4.4 Summary ...... 34 CHAPTER 5 BUILDING EMPIRICAL PHYSICAL MODELS OF AIRBORNE PAR- TICULATES USING MACHINE LEARNING ...... 35 5.1 Introduction to Machine Learning ...... 36 5.1.1 Supervised Learning ...... 37 5.1.2 Unsupervised Learning ...... 37 5.1.3 Feature Engineering ...... 39 5.2 LASSO ...... 39 5.2.1 Algorithm ...... 40 5.2.2 Result ...... 40 5.3 Neural Networks ...... 41 5.3.1 Algorithm ...... 41 5.3.2 Result ...... 43 5.4 Ensembles of Decision Trees ...... 44 5.4.1 Decision Trees ...... 44 5.4.2 Random Forest ...... 45 5.4.3 Advantages of a Random Forest ...... 47 5.4.4 Estimating Ambrosia Pollen Using Random Forests ...... 48 5.5 Summary of Pollen Estimation Using Machine Learning ...... 54

5.6 Machine Learning Inter-comparison for PM2.5 ...... 55 CHAPTER 6 SUMMARY ...... 63 6.1 Conclusion ...... 63 6.2 Future Direction ...... 65 APPENDIX ENVIRONMENTAL VARIABLES USED IN POLLEN ESTIMATION 66 REFERENCES ...... 70 BIOGRAPHICAL SKETCH ...... 76 CURRICULUM VITAE

viii LIST OF FIGURES

1.1 Size comparison for PM particles, GNU free documentation license from EPA public knowledge ...... 2 1.2 Airborne particulate size distribution chart, GNU free documentation license from Wikipedia...... 4

1.3 PM2.5 source and chemical composition apportionment at multiple Chinese sites during 2013 (Huang et al., 2014)...... 5

1.4 Chemical composition and source apportionment comparison between PM10 and PM2.5...... 6 1.5 Particulates’ direct and indirect effects on the global climate system...... 7 1.6 Percentages of risk factors on attributable deaths in 2013. (Bank, 2016) . . . . . 8 1.7 A schematic showing the Ambrosia life-cycle...... 10 2.1 Correlation of the model-predicted pollen concentrations with observed validation data for 2013. Plotted based on equation 2.1, using data from (Howard and Levetin, 2014; Bosilovich et al., 2006) ...... 16 2.2 Example seasonal pollen data for 1986, 1987 and 1988...... 17 2.3 Averaged 1986-2014 Pollen Data in Flowering Season ...... 18 3.1 Optical Particle Counters ...... 21 3.2 Schematic of MINTS sensors ...... 22 3.3 Particulate Time Series Data in Chattanooga, 08.02.2018 ...... 23 3.4 Particulate in size of 0.75 - 1.7 µm Time Series Data from August 10th to 12th . 23 4.1 A spherical variogram fit...... 28 4.2 Covariance function as a function of data pair separation...... 29 4.3 Significance of variogram nugget and range. The range characterizes the spatial scale beyond which separation the data is no longer correlated, so this is a useful way to determine the spatial and/or temporal scales of our data. The nugget (the variogram at zero separation) characterizes the experimental error in our observations...... 30

4.4 The lower panel shows an example temporal variogram for observed PM2.5. The units of the variogram are the same as the units of variance, in this case of PM2.5. The upper panel shows how many observations are in each lag time bin. . . . . 31

ix −3 4.5 (a) Observed PM2.5 time series in µg/cm (shown in red). Values are recorded every two seconds, then a one-hour moving time window centered on each time point is considered. For this one-hour time window we calculate the representa- tiveness uncertainty, σrep. The green lines either side of the observed PM2.5 indi- cate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representative- ness uncertainty is a significant fraction of the observed PM2.5. (c) A histogram of the range of each variogram. A separate variogram is considered for all observa- tions taken over the one-hour moving time window centered on each observation taken every two-second is calculated. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So ideally an observation should be re- ported every few minutes so that this dominant time-scale of temporal variations can be adequately resolved. (d) A histogram of the fractional representativeness uncertainty, σrep, for the entire time-series...... 32 −3 4.6 (a) Observed PM10 time series in µg/cm (shown in red). Values are recorded ev- ery two seconds, then a one hour moving time window centered on each time point is considered. For this one hour time window we calculate the representativeness uncertainty, σrep. The green lines either side of the observed PM10 indicate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representativeness uncer- tainty is a significant fraction of the observed PM10. (c) A histogram of the range of each variogram. A separate variogram is considered for all observations taken over the one hour moving time window centered on each observation taken every two-second is calculated. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So ideally an observation should be reported every few minutes so that this dominant time-scale of temporal variations can be adequately resolved. (d) A histogram of the fractional representativeness uncertainty, σrep, for the entire time-series...... 33 5.1 An overview of some of the often used types of machine learning (Pilotte, Pilotte). 36 5.2 Schematics illustrating conceptually how the data is organized for supervised machine learning. The green columns represent the inputs, the yellow/orange column represents the outputs...... 38 5.3 Schematic illustrating conceptually how the data is organized for unsupervised machine learning. The green columns represent the inputs, there are no outputs. The purpose of the unsupervised classification is to split the data up into a distinct set of subgroups...... 39 5.4 Scatter diagram for the airborne pollen estimates made using the LASSO approach. 41 5.5 Schematic of a single hidden layer, feed-forward Neural Network. Each arrow corresponds to a real-valued parameter, or a weight, of the network. The values of these parameters are tuned in the network training. b are the biases, w are the weights, σ is the activation function...... 42

x 5.6 Scatter diagram for the airborne pollen estimates made using a Neural Network. 43 5.7 A decision tree consists of 3 types of nodes: 1. Decision nodes-commonly repre- sented by squares; 2. Chance nodes-represented by circle ...... 45 5.8 A Random Forest is a classifier consisting of a collection of tree-structured clas- sifiers {h(~x,Θk), k=1,...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input ~x 46 5.9 Using Random Forest Regression to estimate the airborne abundance of Ambrosia (Ragweed) pollen...... 50 5.10 Plot of Model Estimation versus Actual Observation ...... 51 5.11 rank of variable’s relative importance ...... 53 5.12 Linear Regression Performance ...... 57 5.13 Support Vector Machine Performance ...... 58 5.14 Decision Tree Performance ...... 59 5.15 Gaussian Process Regression Performance ...... 60 5.16 Ensemble Method Performance ...... 61

xi LIST OF TABLES

5.1 Table of correlation coefficients for the various machine learning approaches used in this study, with the best performing approach listed first. RT is the correlation coefficient for the training dataset. RV is the correlation coefficient for the totally independent validation dataset...... 52 5.2 Table of cross-validated root mean square error (RMSE) and correlation coeffi- cients (R) values for multiple machine learning approaches. Values are listed in ascending order of RMSE...... 56 A.1 Variable names versus, abbreviations and units...... 66

xii CHAPTER 1

INTRODUCTION

1.1 Airborne Particulates

One of the first documented occurrences of public attention being given to air pollution was in the 1700s (Brimblecombe and Bowler, 1992). Atmospheric pollution can be gaseous, involve biological material (e.g. airborne mold or pollen), or other airborne particulate matter (e.g. soot and smog). Airborne particulate matter stands out among these three as leaving a particular mark on human history from the great London smog (Zhang et al., 2014) to recent particulate matter pollution (Zhang et al., 2014). Many of the early epidemiological studies considered sulfur dioxide and its role in producing smog. Recently it has become evident that small airborne particles, referred to as fine fraction, are particularly detrimental for human health (Wilson et al., 1996).

The size and components properties of these particulates determine their physical prop- erty and effects on human health. Further, the size of particulates may reflect their compo- nents (Wilson et al., 1996).

1.1.1 Particulate Matter in Different Sizes

Airborne Particulate matter (also called atmospheric aerosols) consists of solid and/or liquid particles suspended in air. The size of the airborne particulates is a key factor, and the aerodynamic size is usually used. The aerosol aerodynamic size influences the time that the particles stay airborne and hence the distances to which they can be transported by the winds, and then finally if inhaled, the size governs where they are deposited within the human body. The Environmental Protection Agency (EPA) reviews and updates National

Ambient Air Quality Standards (NAAQS) every five years. In the revised standard in 1987, particulate matter was divided into two main groups (Hester et al., 1998):

1 • The coarse aerosol size fraction, PM2.5 - PM10, particles in the size range 2.5 to 10 µm.

• The fine aerosol size fraction, particles with diameter up to 2.5 µm.

Figure 1.1: Size comparison for PM particles, GNU free documentation license from EPA public knowledge

Figure 1.1 helps understand the size of PM by comparing PM to the cross-section of a human hair. A typical human hair has a diameter of around 70 µm, around 30 times larger than the largest PM2.5 particle.

1.1.2 Chemical Composition and Source Apportionment

It is worth noting that the designations PM2.5 and PM10 just refer to the size and not the composition. Figure 1.2 shows some examples of the typical components of airborne

2 particulates, the horizontal axis marks the size of particulates in unit of µm. The composition

is a reflection of the source and can vary temporally and spatially. Figure 1.3 shows the

different PM composition and source apportionment at multiple Chinese cities in 2013.

For example, PM2.5 collected during the high pollution events of 5-25 January 2013 at the urban sites of Beijing, Shanghai, Guangzhou and Xian, had different sources in differ- ent regions. These sources included traffic, coal burning, cooking, biomass burning, dust related, secondary organic and inorganic sources (Huang et al., 2014; Ducret-Stich, 2013).

Apportionment of each source is different as well.

An example of the different components in the different size fractions can be seen in

Figure 1.4 (Yin et al., 2005; Cheng et al., 2015). In the left panel, we see results for par- ticulates collected in Ireland, where the major chemical component of PM2.5 was organic carbon while that of the larger PM10 particles were minerals. In the right-hand panel, we see measurements of both group of aerosols made in Hong Kong, where the major sources are vehicle emissions and unidentified materials.

1.1.3 Role in Global Environmental Change

Aerosols play a key role in global environmental change (Wilson et al., 1996; Stier et al.,

2007), in part, this is due to the direct role that they play in the atmosphere’s radiation budget (ANgstr¨om,1962),˚ and indirectly by changing the atmosphere’s thermal structure and affecting cloud properties and abundance (Hansen et al., 1997). Both of which eventually impacts both the global environment and human health.

The leftmost panel in Figure 1.5 illustrates the direct effect, where aerosols absorb and scatter shortwave radiation from the sun. The right panel depicts the indirect effect where aerosols modify the cloud albedo and the radiative forcing in the lower atmosphere (some- times called the Twomey effect). A higher abundance of aerosols, means a higher abundance of cloud condensation nuclei (CCN), leading to a larger number of smaller cloud droplets,

3 Figure 1.2: Airborne particulate size distribution chart, GNU free documentation license from Wikipedia.

4 Figure 1.3: PM2.5 source and chemical composition apportionment at multiple Chinese sites during 2013 (Huang et al., 2014). thus whitening the clouds, and increasing cloud reflectivity and cooling the atmosphere. Changing the number of cloud condensation nuclei affects the lifetime of clouds and reduces the precipitation, as the water content of the atmosphere, is split between a larger number of smaller clouds. Lower clouds leads to a cooling of the Earth’s surface by blocking incoming sunlight, while the higher clouds will trap outgoing longwave infrared radiation, leading to a warming of the Earth’s surface (Ch`ylekand Coakley, 1974).

1.1.4 Particulate Matter and Human Health

As early as 1970, an “association” was made between air pollution and death rates, this was described using a linear regression in a seminal set of papers written by Lave and Seskin. Though doubt was soon cast on to this, and no concrete evidence was found to support

5 Approximate composition of PM in Ireland

Source apportionment of PM near a Hong Kong roadway.

Figure 1.4: Chemical composition and source apportionment comparison between PM10 and PM2.5. that ”association”, it was later refined and supporting evidence later provided over the next decade (Wilson et al., 1996).

Although the details of exactly how particular matter affects human health are being determined, there is little doubt of the substantial impact of airborne particulate matter on human health. Figure 1.6 shows the relative impacts of a set of risk factors on attributable deaths for 2013, we note that air pollution ranks fourth in this list. Many research studies, for example (Kreyling et al., 2010; R¨uckerl et al., 2011; Peng et al., 2008; Atkinson et al.,

2010), have shown that airborne particulates impact human health in at least two ways:

6 Figure 1.5: Particulates’ direct and indirect effects on the global climate system.

1. PM2.5 due to its small size can penetrate deep into the lungs and carry toxic chemicals across the air-blood barrier.

2. PM10 while not being able to cross the air-blood barrier is still associated with a di- verse set of adverse health outcomes, impacting, among other things, cardio-respiratory

health and premature mortality.

1.2 Airborne Ambrosia Pollen

The work shown in this section 1.2 has been published as “Liu, X., Wu, D., Zewdie, G. K.,

Wijerante, L., Timms, C. I., Riley, A., ... & Lary, D. J. (2017). Using machine learning to estimate atmospheric Ambrosia pollen concentrations in Tulsa, OK. Environmental health insights, DOI: 10.1177/1178630217699399.”.

A class of airborne particulates that have a particularly important human impact is airborne pollen. Pollen typically has a size of between 6 – 100 µm. One of the pollens which causes a strong allergenic response is Ragweed (Ambrosia).

7 Figure 1.6: Percentages of risk factors on attributable deaths in 2013. (Bank, 2016)

1.2.1 Ragweeds - Source of Airborne Pollen in North America

Ragweeds are flowering plants in the genus Ambrosia in the aster family, Asteraceae (Strother, 1753). They are distributed in North America, where the origin and center of diversity of the genus are in the southwestern United States and northwestern Mexico (Flor´ıstica,2010). Plants of the genus Ambrosia belong to the Asteraceae family. There are 22 known allergens, with 6 considered major allergens. In North America, 17 species of Ragweed have been discovered. The most common one is Ambrosia artemisiifolia (Jelks, 1986). The lifespan of a single Ragweed plant (Fig 1.7) begins with the release of seeds from the mother plant, followed by a dormant phase in the soil (Karrer, 2016). High variation of seed morphs has been found (Fumanal et al., 2007). Ambrosia seeds are light enough to float on water and can be spread easily along rivers. The seed lifetime will be influenced

8 by the frequency of soil disturbance. In arable fields with annual soil tillage, the turnover rate of seeds is higher compared to that of abandoned fields or grassland. The persistence of individual seeds in the soil is typically short. In grassland, most of the seeds stay in the upper soil, or on the soil surface. A small fraction of seeds will be integrated into deeper soil horizons and will have a long-time persistence as part of the soil seed bank. One could expect that smaller seeds that show higher dormancy tend to be accumulated in the lower soil, whereas heavier seeds have better chances to stay aboveground (Fumanal et al., 2007). Like other typical summer annual weeds, Ragweed seeds show innate dormancy after the seed set in autumn and need an exposure of about 4 weeks to temperatures around 0◦C for germination. If the conditions are not suitable for germination, enforced secondary dormancy can be initiated (Baskin and Baskin, 1980). Ragweed individuals that germinate early in the season (March to April) grow slowly at the beginning forming a rosette-like stage with 4-6 leaves. With increasing temperatures, vegetative growth is enhanced during June and July by significant stem elongation (Figure 1.7). Figure 1.7 is plotted and contributed by Gebreab K. Zewdie (Zewdie et al., 2019). The amount of pollen and seeds produced per individual depends largely on the habitat features and the population density (LESKOVSEKˇ et al., 2012).

1.2.2 Airborne Pollen Particles

The pollen is tricolpate (i.e. have having three colpi), with a spiny, granular surface (Mo- hapatra et al., 2004). A single plant can release about 1 billion pollen grains in a season (Thompson and Thompson, 2003). The plants grow to about 1.2 meters in height generates pollen grains that are about 15-25 µm in size (WEBER et al., 2013).

1.2.3 Ambrosia Pollen and Health

Although most larger pollen grains cannot deposit deep in the peripheral airways, it has been demonstrated that Ragweed pollen exists in particle sizes of less than 10 µm that could

9 Flowering:Production of Pollen, Seed Development

Sep Oct

Aug Nov

Jul Dec Seed Dormancy Jun Jan

May Feb

Apr Mar Growing

Germination

Figure 1.7: A schematic showing the Ambrosia life-cycle. potentially lead to lower respiratory symptoms. It was recently reported that subpollen particles (fragments) released from Ragweed pollen grains, ranging in size from 0.5-4.5 µm, could induce allergic inflammation in an animal model (Bacsi et al., 2006).

It has been estimated that symptoms after exposure to Ragweed pollen can begin with concentrations of as few as 5 − 20 pollen grains/m3.

Ragweed tends to grow in fields and in freshly cleared grounds. It is considered an an-

nual disturbance weed that completes its life cycle in 1 year and requires the clearing or

disturbance of the soil for future growth (Thompson and Thompson, 2003). The expan-

10 sion of Ragweed in both the United States and Europe has been attributed to increasing deforestation and economic development (Taramarcaz et al., 2005).

Ragweed is an allergenic species that has expanded on a global scale (Oswalt and Mar- shall, 2008). As allergy sufferers are exposed to increasing amounts of air pollution in the future, this could lead to increased sensitization and thus symptoms. Sensitization in children can lead to the classic ‘allergic march’, which includes a progression from atopic dermati- tis and allergic rhinitis to asthma (Wahn, 2000). In adults, new allergen sensitization may increase the development of allergic disease, with persistence of symptoms into older adult- hood or asymptomatic sensitization that does not develop clinically until later in life (Nelson,

2000).

In the third national health and nutrition exam survey (NHANES) III (Chen and Schwartz,

2008), 54.3% of the population had a positive test responses to one or more allergens. Over the last two decades, rates of asthma have increased in the United States and worldwide, although there is some evidence that asthma rates might have peaked in this period of time

(Eaton et al., 2012). One of the most important risk factors for asthma is sensitization to one or more important risk factors for asthma is sensitization to one or more allergens. The

National Center for Health Statistics included allergy skin testing in the second and third

National Health and Nutrition Examination Surveys ((NHANES) II and (NHANES) III), which were conducted from 1976 through 1980 and 1988 through 1994, respectively, to esti- mate and monitor the prevalence of allergic sensitization in the United States.

Allergic conditions such as asthma and rhinitis can be exacerbated by pollen. According to the World Health Organization (WHO, WHO). 9% of US students younger than 18, experienced seasonal hay fever symptom in 2008; three-quarters of these are believed to be caused by Ambrosia pollen. Approximately, 50 million Americans have allergic diseases. On average, each day in the USA, 44,000 people have an asthma attack. On average, each day in the USA, asthma causes 36,000 kids to miss school, 27,000 adults miss work and 4,700 people

11 visit the emergency room, with 1,200 of these emergency room visits leading to a hospital admission each day, and unfortunately, on average, 9 of those admitted with asthma dies.

Early warning of imminent high pollen levels could be valuable for people with conditions such as Asthma and Chronic Obstructive Pulmonary Disease (COPD).

1.2.4 Environment’s effect on airborne pollen

As mentioned in section 1.2.1, the growth of pollen plants is influenced by their local envi- ronment. As a result, how many pollen airborne particles are produced is very affected by the local environment. On the other hand, some specific environmental factors, such as wind speed, will influence the range and distribution of these airborne particles.

In recent years, most Ragweed studies have focused on modeling and mapping the spread of Ragweed plants across counties in Europe (Kasprzyk, 2008). Many of these forecasting models seek to approximate the daily pollen concentration based on a variety of meteorologic parameters and using time-series analysis, non-parametric analysis, stepwise regression, and multiple regression models.

Pollen forecasting models generally emphasize the importance of meteorological variables in predicting pollen concentrations. In a 2008 study, Kasprzyk found that several meteoro- logic variables were influential in the daily Ragweed pollen concentration. The maximum, mean, and change in temperature and the dew point variables were positively correlated with pollen concentration, whereas humidity was negatively correlated with pollen concentration.

(Makra and Matyasovszky, 2011) assessed the influence of previous-day meteorologic data and previous-day Ragweed pollen concentration as predictive factors for the daily Ragweed pollen concentration in Hungary. The authors used multiple regression analysis and split the data into 2 groups, producing separate models for days with and without precipitation. The models indicated that the previous day’s pollen concentration was significantly predictive of pollen concentration despite precipitation. For rainy days, previous-day solar radiation was

12 significant; however, for non-rainy days, the previous-day mean temperature was significant. The results of their quantile analysis were similar, indicating previous-day pollen concen- tration to be the most predictive, with previous-day mean temperature and previous-day precipitation levels also significant variables. The findings of (Stark et al., 1997) are consistent with those obtained in the European models described above. Stark et al. found that the temperature, daily precipitation, and wind speed were significant parameters in estimating Ragweed pollen concentration in Michi- gan. The authors developed an individual model for each of the 4 years in their data set, rather than developing a collective model. Unlike the other models reviewed, this study included an incremental variable representing each day in the season, which was found to be statistically significant in predicting daily pollen concentration. As a result of the health problems associated with Ragweed pollen, the investigation of the factors that affect the development and allergenicity of Ragweed pollen has been pursued by many studies. The plants are tolerant of many environmental conditions that may not be conducive to the growth of other plants, such as extremely warm and dry environments (Kasprzyk, 2008). Besides, the complexities of the reproduction of Ragweed species have been studied in conjunction with the phenology and distribution patterns to better understand the behavior of these potent aeroallergen-producing plants (Zink et al., 2012). Chapman et al. (Zink et al., 2012) found that the phenology or seasonal development of Ragweed varieties is a significant factor in predicting the geographic spread of Ragweed plants and, consequently, Ragweed pollen.

1.3 Summary

In this chapter, we have examined some of the key roles and impacts of environmental particulates and aerosols, and introduced one in particular, Ambrosia or Ragweed pollen. We have seen that particulate matters are often classified based simply on its size (e.g. the

13 categories used by the 1987 NAAQS), e.g. PM2.5 particulates with a size of up to 2.5 µm and

PM10 particulates with a size of up to 10 µm. When lofted in the air, airborne particulates can play a role in atmospheric radiative transfer, the formation of clouds, and can impact human health. Among these particulates, pollen is noteworthy, being a significant allergen affecting around one-third of the US population. Let us now examine how machine learning can be of use in estimating the abundance of airborne particulates.

14 CHAPTER 2

OBSERVATIONS OF THE TEMPORAL CHANGES IN AMBROSIA

(RAGWEED) POLLEN ABUNDANCE

The work shown in this chapter 2 has been published as “Liu, X., Wu, D., Zewdie, G. K.,

Wijerante, L., Timms, C. I., Riley, A., ... & Lary, D. J. (2017). Using machine learning to

estimate atmospheric Ambrosia pollen concentrations in Tulsa, OK. Environmental health

insights, DOI: 10.1177/1178630217699399.”.

2.1 Previous Work

(Howard and Levetin, 2014) have measured and analyzed the long-term Ambrosia (Rag-

weed) pollen counts observed at the University of Tulsa for over twenty-seven years and

have developed a multi-linear forecasting model to estimate daily pollen concentration. As

is described in equation 2.1 In their model, they associated the pollen concentration with

the long term phenology (Chapman et al., 2014) and a set of meteorological factors that

included the minimum temperature Tmin, precipitation P , and the mean dew point DP ;

ln(C) = −0.505 − 0.018 × Tmin − 0.108 × P + 0.013 × DP + 0.970 × PH (2.1) where C is the pollen concentration. The phenology(PH) is the mean pollen count for that day of the year for all prior years of Ambrosia pollen observations in Tulsa, OK.

Even though the multi-linear model examined by (Howard and Levetin, 2014) is not particularly accurate, it is shown here for the sake of completeness. Figure 2.1 shows a scatter diagram of this multi-linear model, where the x-axis shows the estimated pollen count and the y-axis the actual observed pollen count. For a perfect prediction all the points lie on a straight line with a slope of one and an intercept of zero. This figure is used as a benchmark for the comparison of results obtained later, using a variety of machine learning

15 approaches. In Figure 2.1 the Pearson correlation coefficient is 0.59 (defined in Equation 2.2, where Cov(X,Y ) = E[(X − µX )(Y − µY )] and σ is standard deviation. It has a value of [−1, 1], -1 and 1 marks linear correlation, and 0 marks no linear correlation.)

cov(X,Y ) ρX,Y = (2.2) σX σY

Linear regression result of previous work, R=0.59

1:1 5

4

3

2

1

Estimated Pollen concentration(Base-10) 0 0 1 2 3 4 5 Observed Pollen concentration(Base-10) Figure 2.1: Correlation of the model-predicted pollen concentrations with observed validation data for 2013. Plotted based on equation 2.1, using data from (Howard and Levetin, 2014; Bosilovich et al., 2006)

This study examined the aerobiology of Ambrosia pollen in the Tulsa, OK, and its time evolution in the context of meteorologic factors that affect pollen release. All pollen season statistics were highly variable during the 27-year period. The results of the present study indicate that there has been no significant change in the start date, peak date, and end date of the Ragweed season in Tulsa. When analyzing the temporal pattern of start and end dates, the results are consistent with those found by (Ziska et al., 2011) who reported,

16 in certain cities, a significant increase in the length of the Ragweed pollen season based on latitude. In Figure 2.2 the pollen count for three different Ragweed seasons is shown.

We recall from figure 1.7, that the Ambrosia pollen season typically begins in August and

finishes in November, consistent with the data shown here. The phenology term dominates

First year pollen data Second year pollen data Third year pollen data 2500 1200 1400

1200 1000 2000 1000 800 1500 800 600 600 1000 Pollen count Pollen count 400 Pollen count 400 500 200 200

0 0 0 08/08 08/18 08/28 09/07 09/17 09/27 10/07 10/17 10/27 11/06 08/08 08/18 08/28 09/07 09/17 09/27 10/07 10/17 10/27 11/06 08/08 08/18 08/28 09/07 09/17 09/27 10/07 10/17 10/27 11/06 Time of year Time of year Time of year

1986 1987 1988

Figure 2.2: Example seasonal pollen data for 1986, 1987 and 1988. this estimating model. Thus, removing this item from the model and using machine learning might reveal other important environmental factors.

2.2 Data

Two types of data were used in this study. First, observational data of the abundance of airborne Ambrosia pollen (e.g. Figure 2.3) which was previously reported by (Howard and

Levetin, 2014). Second, a comprehensive meteorological and land surface context for the pollen observations provided by the NASA MERRA meteorological reanalysis (Bosilovich et al., 2006; Rienecker et al., 2011).

As is described in §2.1, the daily airborne pollen concentration was obtained at the

University of Tulsa in Tulsa, Oklahoma. Over the time from 1986 to 2014, a Burkard

Volumetric Spore Trap was deployed on the roof of Oliphant Hall, collecting airborne pollen day and night. Inside the Burkard trap, the pollen is deposited onto a greased strip of

Melenex tape that is affixed to a rotating drum. Tapes were collected each week, divided into strips for each day, and then examined at a magnification of 400× for pollen grain

17 Averaged year pollen data 500

450

400

350

300

250

200 Pollen count 150

100

50

0 08/08 08/18 08/28 09/07 09/17 09/27 10/07 10/17 10/27 11/06 Time of year

Figure 2.3: Averaged 1986-2014 Pollen Data in Flowering Season identification and counting under a microscope. Once the pollen counts were obtained, they were multiplied by a conversion factor to yield the overall atmospheric pollen concentration

(Howard and Levetin, 2014).

For every day of the 27 years from 1987-2013, for which pollen data were available at Tulsa, OK, the hourly values of 85 environmental parameters were retrieved from the

NASA MERRA meteorological analysis that describes the surface and soil state

(Rienecker et al., 2011). These 85 variables are listed in Table A.1 of the appendix and comprehensively characterize both the air close to the land surface and the land surface itself. Since the pollen data are only available as daily values, 3 summary statistics were also

18 calculated for each of the 85 environmental parameters: The daily mean, daily minimum, and daily maximum. According to life experience, weather plays a key role in time, concentration, and for how long pollen is released by plants. For example, windy dry weather typically leads to higher levels of pollen that are rapidly dispersed. When it rains, pollen is quickly washed out of the atmosphere. Since the plant’s likelihood of releasing pollen on any given day is naturally affected by that plant’s recent history, we also time-lagged each of the 85 parameters by a delay that varied from 1 to 30 days. This leads to a total of 85×30×3 = 7,650 variables that were used in our machine learning studies. Of these 7,650 variables, only a small subset turns out to be important for estimating the daily pollen count.

2.3 Summary

In this brief chapter, we have introduced one of the datasets that have been used in our research. This data on Ambrosia (Ragweed) pollen will be used in later chapters together with supervised non-linear non-parametric multivariate machine learning regression to build empirical models of the time evolution of Ambrosia pollen as well as providing some useful physical insights. The inputs for these empirical non-linear non-parametric multivariate machine learning regression models include a comprehensive characterization of the physical context of the Ambrosia plants, both the physical state of the soil and the ambient air. A full list of the physical variables providing the environmental context are provided in an appendix.

19 CHAPTER 3

OBSERVATION OF THE TEMPORAL CHANGES IN PARTICULATE

MATTER ABUNDANCE

The particulate matter data used in this research was collected by an in-situ observation system, consisting of standardized commercial sensors deployed in Chattanooga, TN, and in Richardson, TX. Machine learning was used for sensor calibration against a reference stan- dard instrument. The sensors were built and calibrated by fellow student Lakitha Wijeratne.

3.1 Optical Particle Counters

Early measurements of airborne particulates did not employ laser based approaches (Kulka- rni et al., 2011). This changed since the establishment of the Environmental Protection Agency (EPA) in 1970 (Suter, 2008). The EPA designated a class of instruments as federal reference methods (FRM) and federal equivalent methods (FEM) for devices used to exam- ine air pollution (Hall et al., 2014). FRM denotes the standard methods for air pollution measurement devices. FEM, devices incorporate new technologies to access compliance with NAAQS. The particulate data used in this study were collected using small Optical Particle Coun- ters (OPCs). The OPC devices were made by AlphaSense, the OPC-N2 and OPC-N3 (http://www.alphasense.com/). Figure 3.1a shows the OPC-N2 and figure 3.1b shows the OPC-N3. These two modern sensors are portable, lightweight (under 105 g), and small size (75mm × 60mm × 65mm) (Pope et al., 2018). These sensors are also affordable ($300 - $500) (Pope et al., 2018). The OPC-N2 and OPC-N3 can provide both a particle size distribution and the size

integrated quantities such as PM1, PM2.5 and PM10. The OPC data used here was calibrated using machine learning against reference instruments accounting for the role of changing temperature, pressure and humidity.

20 (a) OPC-N2 (b) OPC-N3

Figure 3.1: Optical Particle Counters

3.2 In-situ Observation System

These in-situ sensors are part of a much larger observing system making observations on multiple spatial and temporal scales. This system is called the multi-scale integrated sensing and simulation system (MINTS). MINTS aims to to assemble a broad collection of data from 8 different types of sentinel (Figure 3.2). The particulate matter data used here was collected from two of the street level sensors, streaming data 24/7. These sensors are part of a collection of sensors we have called ‘central nodes’. A central node consists of an array of Internet of Things (IoT) environmental sensors. The sensors installed are: Ozone Monitor Model 108-L (https://www.twobtech.com/index.html), a UBLOX GPS Receiver, a sky facing USB WebCam measuring cloud fraction, BME280 (read atmospheric temperature, humidity, and pressure), Gas Sensor and Light Sensors. There is also a small computer Odroid XU4 (Kernel, 2015) built-in node to operate the whole system and manage all sensors. The central nodes also provide a LoRa gateway to communicate wirelessly within a 10 km radius with even lower cost, solar powered, sensors (also calibrated using machine learning). LoRa allows low power long-range communication. LoRa devices are connected to each other through a wide area network protocol named LoRaWan. Our LoRa nodes make use of

21 Figure 3.2: Schematic of MINTS sensors

LoRaWan technology to form a sensor mesh network, which records data from a number of

nodes. Each LoRa node includes a set of gas sensors for measurement of CO, NO2,C2H6OH,

NH3, CH4,C3H8 and C4H10, and sensors measuring temperature, humidity, pressure, and airborne particulate matter.

3.3 Data from MINTS

Particulate data used in this thesis is collected from sensors discussed in above section 3.1, which are deployed in Chattanooga, Tennessee. Sensors recorded the temporal change of particulate matter abundance with a temporal resolution of 10 seconds during 2018. Some examples are shown in figure 3.3. All these figures plot the PM abundance change over time in Chattanooga on August 2nd 2018. Figure 3.3a shows the abundance of PM of other size fractions, 0.75-1.7 µm. Figures 3.3b and 3.3c show two other size ranges. Figure 3.4 show data for August 10th - 12th.

22 Count in range :0.75 -1.7 m @02-Aug-2018 Count in range :1.7 -2.2 m @02-Aug-2018 Count in range :2.2 -2.7 m @02-Aug-2018 300 35 20

30 250 15 25 200 20 150 10 15 100 10 5 Count of Airborne Particles Count of Airborne Particles Count of Airborne Particles 50 5

0 0 0 00:00:00 05:00:00 10:00:00 15:00:00 20:00:00 00:00:00 05:00:00 10:00:00 15:00:00 20:00:00 00:00:00 05:00:00 10:00:00 15:00:00 20:00:00 Time Time Time (a) PM size in 0.75 - 1.7 µm (b) PM size in 1.7 - 2.2 µm (c) PM size in 2.2 - 2.7 µm

Figure 3.3: Particulate Time Series Data in Chattanooga, 08.02.2018

Count in range :0.75 -1.7 m @10-Aug-2018 Count in range :0.75 -1.7 m @11-Aug-2018 Count in range :0.75 -1.7 m @12-Aug-2018 500 100 100 400

300 50 200 50

100

0 0 0 00:00:00 12:00:00 24:00:00 00:00:00 12:00:00 24:00:00 00:00:00 12:00:00 24:00:00 Count of Airborne Particles Count of Airborne Particles Count of Airborne Particles Time Time Time (a) 08-10-2018 (b) 08-11-2018 (c) 08-12-2018

Figure 3.4: Particulate in size of 0.75 - 1.7 µm Time Series Data from August 10th to 12th

23 CHAPTER 4

PHYSICAL INSIGHTS PROVIDED BY VARIOGRAMS

Variograms were first introduced to characterize the spatial and temporal scales associated with a given dataset. Characterizing the spatial and temporal scales of a data set is useful for several reasons. We would like to make sure that the spatial and temporal resolution of our observations can resolve the key features of the variable(s) being considered. At the same time, needlessly high resolution can have significant data storage impacts and bandwidth implications for data transmission.

4.1 Stochastic Process and Sampling

Machine learning has two key ingredients. First, a set of comprehensive representative examples to learn from called the training data. Second, an algorithm that can learn by example the behavior of the system from the training data. Even with the best algorithm in the world for a given type of machine learning, the overall performance of the machine learning system will be improved by a comprehensive, representative, unbiased, balanced training dataset. Appropriate sampling of the training data characterizing the system to ensure this is a key step (Oliver and Webster, 2015). The goal of the machine learning system we built was to estimate a continuous vari- able, the distribution of airborne particulates. The abundance of airborne particulates is characterized by continuous spatial and temporal change. The ideal training dataset will be representative of the various parts and regimes in the parameter space being considered. This will allow the machine learning to generalize well (i.e. be able to accurately estimate the state of the system from data not included in the training data. To help achieve this sampling theory can be helpful. To illustrate the way variograms can help us it is useful to first consider both random variables and regionalized variables. The dispersion, generation, and transportation of air-

24 borne particulates obey the laws of physics. Building an accurate deterministic model to perfectly describe this system is, in itself, a challenging task. The natural variability could also be described by using random numbers (Webster, 2000). In this situation, our variable of observation is treated as a random variable, its value in one measurement as a probable number from a distribution. The function from a variable domain (location or time) to all possible value is called a stochastic process.

4.2 Variogram and Kriging

For a stochastic process, the value of the variable of interest, Z, can be written as:

Z(~x) = µ + (~x) (4.1)

In this equation, µ is the mean value of distribution, and  is random variable distribution of which has the same shape but different position as Z.

4.2.1 Variogram Definition Equation

The equation derivation within this section 4.2.1 is referenced and simplified from (Oliver and Webster, 2015).

Observed values at adjacent locations (in space) or consecutive points (in time) are related by the continuity of the physical processes involved. The co-variance, or variance of differences between adjacent and/or consecutive locations, can be used to describe the relationship between pairs of points. Studying how the variance changes with the changing separations in space and/or time of pairs of points allows us to characterize the spatial and temporal scales of our data. The variogram is what is typically used to calculate these spatial and temporal scales:

E[Z(~x) − Z(~x + ~h)] = 0 (4.2)

25 Variograms are an essential part of Kriging data interpolation. In equation 4.2, the vector

~h represents the distance or time interval between pairs of measurements. Starting with

equation 4.2 we can derive the variogram equation 4.3:

V ar[Z(~x) − Z(~x + ~h)] = E[(Z(~x) − Z(~x + ~h))2] = 2γ(~h) (4.3)

where, γ(~h), is the semi-variance as a function of the separation vector ~h, and is called a

variogram.

Further, as µ~x = µ~x+~h, we come to the usual variogram definition in equation 4.4:

E[((~x) − (~x + ~h))2] γ(~h) = (4.4) 2

Also, in practical use, we often need to calculate the expectation value of a data sample.

Assume we have the sample collection as Z( ~x1), Z( ~x2), Z( ~x3) ..., etc,. The variogram for each collection with a separation distance, h, can be written as equation 4.5:

i=1 ~ 1 X ~ 2 γ(h) = [z(~xi) − z(~xi + h)] (4.5) 2m(~h) m(~h) where m denotes the number of measurement used for a spacing ~h.

4.2.2 Kriging for Data Interpolation

Variograms are used extensively in Kriging, and although not the focus of our work, it is worth briefly describing the data-interpolation approach called kriging (also referred to as Gaussian Process Models (Rasmussen, 2003)). Among interpolation and regression ap- proaches kriging came to prominence because of both the good results it yields, and the fact that it provides an error estimate. Kriging is a Best Linear Unbiased Predictor (BLUP)

(Lark et al., 2006).

26 Kriging interpolates data for a stochastic process by taking into account a knowledge of the spatial/temporal variation information characterized by the variogram. The kriging estimate of the error of interpolation is shown in equation 4.6:

δ2(B) = ~bT ~λ − γˆ(B,B) (4.6)

Where B stands for all data samples in a block, the wedge on top of γ refers to a tensor summation, ~b is semi-variance between each pair of points in this block, λ is a weight vector in this linear interpolation that each interpolated point is a weighted linear summation over all sampled points. γ is the variogram (section 4.2.1).

4.2.3 Practical Use of Variograms

When coding variograms care is required as the last item in equation 4.6 requires one to calculate all the variogram pairs in a block. Instead of the above calculation, kriging often uses a spherical function fit to the variogram shape (e.g. Figure 4.1). The fitting equation can be adapted as required for various applications. The fit function can be written as:  q  2 −1 h 2h h2 c0 + {1 − cos ( ) + 1 − 2 }  π r πr r  c0 + c, for h > r (4.7)    0, for h = 0 where c0 is the nugget in figure 4.1, r is range and c is sill, h is the lag along the separation axis being considered (in this example the x-axis). The relationship between the variogram and the co-variance (Figure 4.2) is given in equation 4.8:

γ(~x1, ~x2) = sill − C(~x1, ~x2) (4.8) where γ is the variogram, and C is the co-variance.

27 Figure 4.1: A spherical variogram fit.

Multiple collections of data samples are considered when calculating variograms, each at

a different spacing, h, these are then used to plot the variogram and fit a variogram function and characterize the sill, nugget and range value, which are used for the Kriging calculation for final error evaluation (Oliver and Webster, 2015). In this dissertation, we use the sill, nugget, and range to characterize key aspects of our dataset (Figure 4.3). In Figure 4.3, the x axis depicts the observation separation, and the y axis depicts the variogram value in equation 4.4. Fitted range, sill, and nugget values are noted in figure 4.3 as well.

The range characterizes the spatial scale beyond which separation the data is no longer significantly correlated, so this is a useful way to determine the spatial and/or temporal scales of our data. The nugget (the variogram at zero separation) characterizes the experimental error in our observations.

28 Figure 4.2: Covariance function as a function of data pair separation.

4.3 Variograms of Airborne Particulates

Let us consider the temporal variograms of airborne particulates, an example is shown below

in Figure 4.4. The lower panel shows the variogram as a function of the temporal separation

(lag) between 0 and 15 minutes of PM2.5 observations. The value at which the variogram plateaus is called the range, we note that in this example there is a range of 1 minute.

This means that observations in this example time-series that were made within less than a minute of each other are partially correlated, observations separated by a time lag of greater than a minute were not significantly correlated. So, for this case, we should be recording the time-series with a resolution of at least one minute. This can be contrasted to the usual environment agency hourly reporting. The units of the variogram are the same as the units of variance, in this case, the variance of PM2.5.

29 Figure 4.3: Significance of variogram nugget and range. The range characterizes the spatial scale beyond which separation the data is no longer correlated, so this is a useful way to determine the spatial and/or temporal scales of our data. The nugget (the variogram at zero separation) characterizes the experimental error in our observations.

Let us now consider a longer time-frame. Figure 4.5 (a) (the upper panel) shows an

−3 observed PM2.5 time series in µg/cm (red line). Values are recorded every two seconds, then a one hour moving time window centered on each time point is considered. For this one hour time window we calculate the representativeness uncertainty, σrep (the average deviation of PDF of all measurements made over the one hour moving window). The green lines either side of the observed PM2.5 indicate this representativeness uncertainty.

Figure 4.5 (b) shows the representativeness uncertainty for PM2.5 over a one hour moving time window in µg/cm−3 centered on each observation. We note that the representativeness uncertainty is a significant fraction of the observed PM2.5, with representativeness uncer-

30 200 N

0 0 5 10 15 T (minutes) 0.05 Range is 0.98 minutes T) (

0 0 5 10 15 T (minutes)

Figure 4.4: The lower panel shows an example temporal variogram for observed PM2.5. The units of the variogram are the same as the units of variance, in this case of PM2.5. The upper panel shows how many observations are in each lag time bin.

tainties sometimes exceeding twice the observed PM2.5 value, i.e. there can be substantial small time-scale variability that should be routinely characterized.

The time-scale of this PM2.5 variability is what the range of the variogram seeks to characterize. Figure 4.5 (c) shows a histogram of the ranges for each variogram calculated over the entire time series. A separate variogram is calculated for all the PM2.5 observations taken over the one hour moving time window centered on each observation taken every two- second. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So

31 PM with Representativeness Uncertainty 2.5

40

30 2.5 20 PM 10

0 May 2018 Jun 2018 Jul 2018 Aug 2018 Sep 2018 Oct 2018 Nov 2018 Dec 2018 Jan 2019 Feb 2019 (a) Time

PM Representativeness Uncertainty 2.5

10

Rep Unc 5 2.5

PM 0 May 2018 Jun 2018 Jul 2018 Aug 2018 Sep 2018 Oct 2018 Nov 2018 Dec 2018 Jan 2019 Feb 2019 (b) Time

(c) (d)

−3 Figure 4.5: (a) Observed PM2.5 time series in µg/cm (shown in red). Values are recorded every two seconds, then a one-hour moving time window centered on each time point is considered. For this one-hour time window we calculate the representativeness uncertainty, σrep. The green lines either side of the observed PM2.5 indicate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representativeness uncertainty is a significant fraction of the observed PM2.5. (c) A histogram of the range of each variogram. A separate variogram is considered for all observations taken over the one-hour moving time window centered on each observation taken every two-second is calculated. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So ideally an observation should be reported every few minutes so that this dominant time-scale of temporal variations can be adequately resolved. (d) A histogram of the fractional representativeness uncertainty, σrep, for the entire time-series.

32 PM with Representativeness Uncertainty 10

100 10

PM 50

0 May 2018 Jun 2018 Jul 2018 Aug 2018 Sep 2018 Oct 2018 Nov 2018 Dec 2018 Jan 2019 Feb 2019 (a) Time

PM Representativeness Uncertainty 10

30

20 Rep Unc 10 10 PM 0 May 2018 Jun 2018 Jul 2018 Aug 2018 Sep 2018 Oct 2018 Nov 2018 Dec 2018 Jan 2019 Feb 2019 (b) Time

(c) (d)

−3 Figure 4.6: (a) Observed PM10 time series in µg/cm (shown in red). Values are recorded every two seconds, then a one hour moving time window centered on each time point is con- sidered. For this one hour time window we calculate the representativeness uncertainty, σrep. The green lines either side of the observed PM10 indicate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representativeness uncertainty is a significant fraction of the observed PM10. (c) A histogram of the range of each variogram. A separate variogram is considered for all observations taken over the one hour moving time window centered on each observation taken every two-second is calculated. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So ideally an observation should be reported every few minutes so that this dominant time-scale of temporal variations can be adequately resolved. (d) A histogram of the fractional representativeness uncertainty, σrep, for the entire time-series.

33 ideally an observation should be reported every few minutes so that this dominant time-scale of PM2.5 temporal variations can be adequately resolved.

Figure 4.5 (d) shows a histogram of the PM2.5 fractional representativeness uncertainty,

σrep, for the entire time-series. The most frequent fractional representativeness uncertainty lies between 0 and 0.1 (i.e. between 0 and 10% of the observed PM2.5 abundance). However, for a significant fraction of the time, the fractional representativeness uncertainty exceeds 1, i.e. the representativeness uncertainty is large compared to the observed PM2.5 abundance.

Figure 4.6 shows very similar results for PM10, except that the variability is a little higher than for PM2.5.

4.4 Summary

We have used variograms to characterize the observed timescales in the concentration of airborne particulates seen over many months. A spectra of timescales were observed, with a dominant timescale of around 9 minutes for both airborne PM2.5 and PM10. This means that to adequately characterize the time variation of airborne particulates measurements should be made at least half this timescale, i.e. around every 5 minutes. However, timescales of less than a minute are also routinely observed. So the typical timescale used by environment agencies to report the concentrations of airborne particulates of one hour are really too long, and measurements at this frequency are not adequately resolving the temporal variability. Based on this variogram analysis, it is recommended that airborne particulate observations are made with a frequency of at least once per minute.

34 CHAPTER 5

BUILDING EMPIRICAL PHYSICAL MODELS OF AIRBORNE

PARTICULATES USING MACHINE LEARNING

The work shown in this chapter 5 has been published as “Liu, X., Wu, D., Zewdie, G. K.,

Wijerante, L., Timms, C. I., Riley, A., ... & Lary, D. J. (2017). Using machine learning to estimate atmospheric Ambrosia pollen concentrations in Tulsa, OK. Environmental health insights, DOI: 10.1177/1178630217699399.”.

Machine learning allows us to learn by example, and to give our data a voice. It is particularly useful for those applications for which we do not have a complete theory. Ma- chine learning can be thought of as an automated implementation of the scientific method

(Domingos, 2015), following the same process of generating, testing, and discarding or re-

fining hypotheses. While a scientist or engineer may spend their entire career coming up with and testing a few hundred hypotheses, a machine-learning system can do the same in a fraction of a second. Machine learning provides an objective set of tools for automating discovery.

Machine learning provides a straightforward framework for holistically bringing together data on many aspects/parts of a system, even if we do not have a theory that completely links each of these parts. Machine learning is closely related to (and often overlaps with) computa- tional statistics and data science, which also use computers to build empirical models. In this dissertation, use is made of machine learning for multi-variate non-linear non-parametric re- gression (supervised learning) for estimating the abundance of airborne pollen. Unsupervised machine learning is also used for characterizing the different phases of the pollen-producing season.

35 Figure 5.1: An overview of some of the often used types of machine learning (Pilotte, Pilotte).

5.1 Introduction to Machine Learning

A famous definition by Arthur Samuel declares that ‘machine learning is the field of study that gives computers the ability to learn without being explicitly learned’ (Samuel, 1959).

In 1950 Samuel wrote a checkers playing program, even though he was not a very good checkers player, Samuel used a program that learned from a large number of games, and by watching what sorts of board positions tended to lead to wins and what sort of board positions tended to lose, the checkers playing program learned over time what are good board positions and what are bad board positions. In this manner, the program eventually learned to play checkers better than the Arthur Samuel himself!

In machine learning terminology this is referred to as supervised learning, i.e. learning by example. In general, machine learning problems fall into two broad types: Supervised learn- ing and unsupervised learning. Figure 5.1 outlines some of the machine learning approaches that are often used.

36 5.1.1 Supervised Learning

Mitchell provided a more recent definition of supervised machine learning, ‘a computer pro-

gram is said to learn from experience E with respect to some class of tasks T and performance

measure P, if its performance at tasks in T, as measured by P, improves with experience E.’

In supervised learning, we are given a data set and already know what our correct output

should look like, having the idea that there is a relationship between the input and the

output.

Supervised learning usually involves performing a regression or a classification. Fig-

ure 5.2 shows two schematics illustrating conceptually how the data is organized for super-

vised machine learning. The green columns represent the inputs, the yellow/orange column

represents the outputs. In a regression problem, we are typically trying to use a set of input

variables, say x1 through xn to predict a continuous real-valued output. In a classification problem, we are typically trying to use a set of input variables, say x1 through xn to predict a discrete categorical output.

5.1.2 Unsupervised Learning

The purpose of the unsupervised classification is to split the data up into a distinct set of subgroups, often called classes or clusters. Unsupervised learning finds these class labels objectively without being provided them. This can also be regarded as dimensionality reduc- tion. An advantage over approaches such as Principal Component Analysis (PCA) is that they typically perform better on non-linear data-sets (Dong and McAvoy, 1996). Figure 5.3 shows the data structure used by the clustering method.

37 MachineRegression Learning Regression

y = f (x1, x2 , x3, x4 , x5 ,…, xn )

x1 x2 x3 x4 x5 xn y Training Data Inputs Inputs Inputs Inputs Inputs Inputs Inputs Output(s)

Multivariate, nonlinear, nonparametric n can be very large

(a) Machine Learning SupervisedClassification Classification

x1 x2 x3 x4 x5 xn y Training Data Inputs Inputs Inputs Inputs Inputs Inputs Inputs Output(s)

Multivariate, non-linear, non-parametric n can be very large (b)

Figure 5.2: Schematics illustrating conceptually how the data is organized for supervised ma- chine learning. The green columns represent the inputs, the yellow/orange column represents the outputs.

38 Machine Learning ClusteringUnsu (Unsupervisedpervised Classification Classification)

x1 x2 x3 x4 x5 xn Training Data Inputs Inputs Inputs Inputs Inputs Inputs Inputs Output(s)

Figure 5.3: SchematicMultivariate, illustrating conceptually non-linear, how the non-parametric data is organized for unsupervised machine learning. The green columnsn can represent be very the inputs, large there are no outputs. The pur- pose of the unsupervised classification is to split the data up into a distinct set of subgroups.

5.1.3 Feature Engineering

The input variables for machine learning are often called features. So-called ‘feature engi- neering’ is the process of trying to determine what are the best input features to use for the best performing machine learning model.

Let us now take a look at a set of different machine learning approaches, and then use them to build empirical models of the concentration of atmospheric pollen.

5.2 LASSO

Least absolute shrinkage and selection operator (LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpret-ability of the statistical model it produces (Hans, 2009).

39 5.2.1 Algorithm

LASSO is developed from the very original regression method, which is the least squares

method. Even, it keeps the same formula. However, based on the idea that coefficient

estimates need not be unique if covariates are collinear, it develops its own restrict for esti-

mator selection, and step-wise iteration condition. LASSO is also related to ridge regression

for the subset selection, so-called soft thresholding (Tibshirani, 2011). The equation of the

LASSO method is shown below (Equation 5.1). In this equation, α and β are tunable bias parameters, xij are the input variables, and yi are the model output values.

( N ) ∧ ∧ X X 2 X (α, β) = argmin (yi − α − βjxij) satisfying |βj| 6 t (5.1) i=1 j j Lasso holds several computational advances against previous regression models, the least- squares, ridge, etc. The original lasso paper used an off-the-shelf quadratic program solver.

This does not scale well and is not transparent. The LARS algorithm((Efron, 1992)) gives an efficient way of solving the lasso and connects the lasso to forward stage-wise regression.

The same algorithm is contained in the homotopy approach of Osborne et al((Osborne et al., 2000)). Coordinate descent algorithms are extremely simple and fast, and exploit the assumed sparsity of the model to great advantage.

5.2.2 Result

Figure 5.4 shows a scatter diagram for the LASSO pollen estimate. The x-axis shows the observed pollen amount, and the y-axis shows the LASSO estimated pollen amount. The blue circles depict the training dataset, which has a correlation coefficient, RT = 0.53. The red squares depict the independent validation dataset, which has a correlation coefficient,

RV = 0.56.

40 Lasso Estimated Pollen, RT =0.53, RV =0.56

500 1:1 Training (#2071) Validation (#109) 400

300

200 Estimated Pollen 100

0 0 100 200 300 400 500 Actual Pollen Figure 5.4: Scatter diagram for the airborne pollen estimates made using the LASSO ap- proach.

5.3 Neural Networks

Neural Networks, more properly referred to as an ‘artificial’ Neural Networks (ANN), were conceived by the inventor of one of the first neurocomputers, Dr.Robert Hecht-Nielsen. He defines a Neural Network as: a system made up of many simple, highly inter- connected processing elements, which process information by their dynamic state response to external inputs. (Caudill, 1987).

5.3.1 Algorithm

Warren McCulloch and Walter Pitts (McCulloch and Pitts, 1943) created a computational model for Neural Networks based on mathematics and algorithms called threshold logic. This model paved the way for Neural Network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of Neural Networks to artificial intelligence. As can be seen in Figure 5.5, Neural Networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ which contain an ‘activation function’. Pat- terns are presented to the network via the ‘input layer’, which communicates to one or more

41 A feedforward neural network with three inputs, two hidden neurons, and one output neuron.

2 ⎛ 3 ⎞ ˆ 2 2 1 1 y = b + ∑wi σ ⎜ bi + ∑wi, j x j ⎟ i=1 ⎝ j =1 ⎠

FigureEach 5.5:arrow Schematic corresponds of ato single a real-valued hidden parameter, layer, feed-forward or a weight, Neuralof the network. Network. Each arrow correspondsThe to values a real-valued of these parameters parameter, are or tuned a weight, in the network of the network. training. The values of these parametersb are are tunedthe biases, in the w are network the weights, training. σ is bthe are activation the biases, function. w are the weights, σ is the activation function. Monday, September 22, 14

‘hidden layers’ where the actual processing is done via a system of weighted ‘connections’. The hidden layers then link to an ‘output layer’ where the answer is output. What has attracted the most interest in Neural Networks is the possibility of learning. Given a specific task to solve, and a class of functions F , learning means using a set of observations to find f ∗ ∈ F which solves the task in some optimal sense. Learning algo- rithms search through the solution space to find a function that has the smallest possible cost(measure the cost by a cost function). There are three major learning paradigms, each corresponding to a particular abstract learning task. These can be supervised learning, unsupervised learning and reinforcement learning. In this dissertation, we will first use supervised learning for regression. In supervised learning, we are given a set of example pairs (the training dataset) (x, y), x ∈ X, y ∈ Y and the aim is to find a function f : X → Y in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function

42 is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain. When one tries to minimize this cost using gradient descent for the class of Neural Networks called multilayer perceptrons (MLP), one obtains the common and well-known back-propagation algorithm for training Neural Networks.

Other than regression, there are two more applications of supervised Neural Network learning. They are pattern recognition (also known as classification) and sequential data recognition. Which can be a potential tool in my further research.

5.3.2 Result

Figure 5.6 shows the Neural Network scatter diagram. The validation correlation coefficient,

RV = 0.61, is not as good as that of the next approach we will consider, the Random Forest.

Neural Network Estimated Pollen, RT =0.91, RV =0.61

500 1:1 Training (#2071) Validation (#109) 400

300

200 Estimated Pollen 100

0 0 100 200 300 400 500 Actual Pollen

Figure 5.6: Scatter diagram for the airborne pollen estimates made using a Neural Network.

43 5.4 Ensembles of Decision Trees

A Random Forest is an ensemble statistical learning approach, consisting of an ensemble of decision trees (Ho, 1998; Breiman, 2001). It has proved to be a very useful multi-variable, non-linear, non-parametric approach for both regression and supervised classification. En- semble methods are less prone to over-learning the noise of the data and typically provide better generalization (Kursa, 2014). A random forest also provides a useful ranking of the relative importance of the predictors. Let us first look at the building block of random forests, the individual decision trees.

5.4.1 Decision Trees

A decision tree is a flowchart-like structure in which each internal node represents a partition on an attribute (Figure 5.7), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules. Indecision analysis a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

Decision trees are commonly used in operations research and operations management and for calculating conditional probabilities. Because of the following:

1. They are easy to understand and interpret.

2. They clearly describing the complex case for analysis.

3. They allow the addition of new possible scenarios.

4. They help determine the worst, best and expected values for different scenarios.

5. They can easily be combined with other decision techniques.

44 Figure 5.7: A decision tree consists of 3 types of nodes: 1. Decision nodes-commonly represented by squares; 2. Chance nodes-represented by circle

While it has two disadvantages:

• For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels. (Deng et al., 2011)

• Calculations can get very complex particularly if many values are uncertain and/or if many outcomes are linked.

5.4.2 Random Forest

Random forests were introduced by (Ho, 1998). Typically an ensemble of weak learners will perform better than a single learner on its own and will be less likely to over fit the noise in a dataset. Breiman’s work with Random Forests was influenced by the earlier work of (Amit and Geman, 1997) who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. Finally, the idea of

45 randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was introduced by (Dietterich, 2000). The introduction of Random Forests as they are used today was (Breiman, 2001). Random Forest can be briefly described as a bootstrap (Johnson, 2001) built on decision trees – a tree bagger. In an important paper on written character recognition, (Amit and Geman, 1997) defines a large number of geometric features and search over a random selection of these for the best split at each node.

Figure 5.8: A Random Forest is a classifier consisting of a collection of tree-structured classifiers {h(~x,Θk), k=1,...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input ~x

The common element in all of these procedures is that for the kth tree, a random vector

Θk is generated, independent of the past random vectors Θ1,...,Θk−1 but with the same distribution; and a tree is grown using the training set and Θk, resulting in a classifier h(~x, Θk) where ~x is an input vector. For instance, in bagging the random vector Θ is generated as the counts in N boxes resulting from N darts thrown at random at the boxes, where N is the number of examples in the training set. In random split selection Θ consists of

46 a number of independent random integers between 1 and K. The nature and dimensionality of Θ depends on its use in tree construction. After a large number of trees is generated, they vote for the most popular class. An ensemble of such trees is called a Random Forest (Figure 5.8).

5.4.3 Advantages of a Random Forest

Random Forest is less prone to overfit the training dataset (Kursa, 2014). They also provide an approach for ranking the relative importance of the input variables. “Random Forests give results competitive with boosting and adaptive bagging, yet do not progressively change the training set. Their accuracy indicates that they act to reduce bias” (Breiman, 2001). This is referred to as robustness. Similar to Artificial Neural Networks, Random Forests can be used for both supervised regression and classification. A useful feature of Random Forests is their ability to provide information on the relative importance of the input variables. Random Forests can be used to rank the importance of variables in a regression or classification problem in a natural way (Breiman, 2001). The first step in measuring the

n variable importance in a data set Dn = {(Xi,Yi)}i=1 is to fit a Random Forest to the data. During the fitting process, the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training). To measure the importance of the jth feature after training, the values of the jth feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the jth feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences. Features that produce large values for this score are ranked as more important than features which produce small values.

47 This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, Random Forests are biased in favor of those attributes with more levels.

5.4.4 Estimating Ambrosia Pollen Using Random Forests

A Random Forest can facilitate estimation of the pollen count as a multi-variate, non- parametric function of N input variables, i.e.

pollen count = fRandom F orest(x1, ..., xN ) (5.2)

where x1, . . . , xN are the N readily available environmental parameters (listed in the ap- pendix). Two enhancements were then made for a standard Random Forest implementation that allowed both improvement of the performance and provided an estimated error for each pollen count that is estimated. The enhancement was inspired by the Newton-Raphson iteration. A series of iterations were executed, for each iteration, a Random Forest was used to estimate the pollen count as indicated in equation (5.2). Then, the estimated pollen count was compared with the actual pollen count to calculate an error, that is:

error = observation − estimation (5.3)

Next, an additional Random Forest was used to learn this error. After each iteration, the Random Forest estimate of the pollen count was then corrected using the error estimated by this additional Random Forest, that is, by rearranging Equation (5.3) and replacing the observation by our Random Forest estimate of the pollen count and by replacing the error with the estimated error provided by the second Random Forest:

improved estimation = initial estimation + estimated error (5.4)

48 This was then repeated for a set of n iterations (we used n=10). After each iteration, the estimated pollen count, and estimated pollen count error were added as additional input variables for the next iteration. This considerably improved the reliability of our estimated pollen count as can be seen by comparing verification scatter diagrams in Figure 5.9.

Figures 5.9a and 5.9b are verification scatter diagrams, with the x-axis showing the observed amount of pollen and the y-axis showing the estimated amount of pollen, while the error bars show the estimated uncertainty. These estimates do not require the phenology to be specified, yet show a substantial improvement in a prior study shown in Figure 2.1.

Figure 5.9a shows the scatter diagram for the first Newton-Raphson iteration and Figure 5.9b shows the much improved scatter diagram after the tenth iteration. In each case the blue line shows a perfect 1:1 response, the green points depict the training data, the red points depict the independent validation. The random forest has performed very well.

Interestingly, when the pollen estimations were tested using a completely independent data sample not used in the training (the validation data set), the correlation coefficient is as good as that for the training dataset. These scatter diagrams show the remarkable ability of the iterative Random Forest approach to accurately estimate the airborne pollen count.

Figure 5.9c shows the correlation coefficient for the training and independent validation datasets as a function of training iteration. The error does not substantially reduce after four iterations. Figure 5.9d shows a histogram of the residuals between the observed and estimated pollen counts after the tenth iteration.

Figure 5.9e shows the machine learning error as a function of the number of decision tree estimators in the random forest ensemble. No further reduction in the error occurs after the ensemble size reaches about forty trees. In the results presented here, we used an ensemble size of fifty trees.

Figure 5.10 shows the machine learning model estimate of Ambrosia (Ragweed) pollen and the actual pollen observation for all days in all 28 seasons.

49 (a) Scatter diagram for the 1st (b) Scatter diagram for the 10th training iteration. The x-axis iteration. The blue line shows shows the actual pollen abun- a perfect 1:1 response, the green dance, the y-axis shows the es- points depict the training data. timated pollen abundance. The The red points depict the inde- blue line shows a perfect 1:1 pendent validation. response, the green points de- pict the training data. The red points depict the independent validation.

(c) The correlation coefficient for (d) A histogram showing the the scatter diagram as a func- residual error of the machine tion of training iteration. The learning estimate (i.e. pollen ob- blue line is for the training data, servation minus machine learn- the red line is for the indepen- ing estimate) for the tenth itera- dent validation. tion.

(e) Machine learning error as a function of the number of deci- sion tree estimators in the ran- dom forest ensemble.

Figure 5.9: Using Random Forest Regression to estimate the airborne abundance of Ambrosia (Ragweed) pollen.

50 2500 2500 Observed data 2000 2000 Model Estimated data 1500 1500

1000 1000

500 500 Pollen Count

0 0 0 200 400 600 800 1000 1200 Days

2500 2500 Observed data 2000 2000 Model Estimated data 1500 1500

1000 1000

500 500 Pollen Count

0 0 1000 1200 1400 1600 1800 2000 2200 Days

Figure 5.10: Plot of Model Estimation versus Actual Observation

In the previous work of (Howard and Levetin, 2014), who provided the pollen data for our research, phenology was a significant factor in predicting the geographic spread of Ragweed plants, and consequently, Ragweed pollen (Howard and Levetin, 2014). Phenology is defined as the averaged daily airborne pollen concentration of all prior years. As mentioned in §2.1, obtaining the prior years’ pollen concentration requires extensive laboratory work. Thus, our model estimates the pollen abundance solely on the current and recent environmental context, even if no prior pollen observations were available at that location. Science question: Is it possible for us to predict the pollen abundance using just characteristics of the physical environment? There has been a large amount of prior research conducted on which environmental variables affect the pollen abundance, however, none of these models faithfully describe the details of the day to day pollen variability. We will use machine learning to construct an accurate model and provide physical insights into the relative importance of the various environmental variables in estimating the pollen abundance. §5.4.4 presents our machine learning model for pollen estimation, built using a machine learning algorithm combined with a comprehensive environmental and pollen training dataset. The results related to the Ambrosia pollen estimation presented in this chapter have already been published in the peer-reviewed Journal of Environmental Health Insights (Liu et al., 2017).

51 Table 5.1: Table of correlation coefficients for the various machine learning approaches used in this study, with the best performing approach listed first. RT is the correlation coeffi- cient for the training dataset. RV is the correlation coefficient for the totally independent validation dataset.

Machine Learning Approach Correlation Coefficient Without Phenology Training, RT Validation, RV Random Forest 0.98 1 Neural Network 0.91 0.61 LASSO 0.53 0.56

Prior Multi-linear Study 0.68 With Phenology

We found that a holistic set of data can be combined with machine learning algorithms to effectively build an empirical model estimating the daily pollen concentration. Some of these machine learning models can also provide some useful insights into the relative importance of the inputs. The regression models use all meteorology and soil variables as estimators and output the estimated pollen daily concentration. For each machine learning approach we used, the performance was quantified using a scatter diagram. In the scatter diagram the actual observations were plotted against the current study machine learning estimates. A perfect scatter diagram is a straight line with a slope of one and an intercept of zero. In each case, the data were randomly split into two independent samples; one sample was used for training and the second sample for an independent validation, that is, the validation data were not used in the training stage of the algorithms. Table 5.1 shows the correlation coefficients for the various machine learning approaches used in this study. The best performing approach, namely the Random Forest, is

listed first. RT is the correlation coefficient for the training dataset and RV is the correlation coefficient for the totally independent validation dataset. By comparing the correlation coefficients between the independent validation dataset and our machine learning models we see that the Random Forest provides the best regression

52 Figure 5.11: rank of variable’s relative importance result. The Random Forest algorithm also provided the relative importance of all the input variables. Figure 5.11 shows the importance of the input variable in the descending order of importance. In Figure 5.11, abbreviations of variable names are placed next to their bar of importance. Lag30 means the variable is the value measured 30 days before the target value is measured. The full variable names, corresponding to the abbreviated variable name such as, lGRN, Z0H, DISPH, are listed in an appendix. It can be seen that the variable importance sharply decreases after the top few variables. Only the top 20 variables are shown in the figure 5.11. These top few variables contribute most of the information that is required for estimating the pollen abundance. The most important input variables are the vegetation greenness, the displacement height, the roughness length for sensible

heat, soil evaporation, and the energy stored in all reservoirs.

53 The vegetation greenness depends on several factors such as the number and type of plants, how leafy they are, and how healthy they are. These plants obviously include the Ambrosia plants (Burgan and Hartford, 1993). The high rank of the greenness is to be expected, the better the plants grow, the more pollen will be produced. The airborne pollen concentration will thus be affected indirectly. (S¨olteret al., 2012) gives a detailed description on how the temperature influences the growth of Ambrosia plants. The previous study of (Howard and Levetin, 2014) had a temperature term in their regression model shown in equation (2.1). It is noteworthy that the machine learning highlighted a lagged greenness. A healthy Ambrosia plant will need to mature before releasing the pollen, thus explaining the time delay. Soil evaporation will affect the amount of plant transpiration and will affect the plants’ health status. For example, a period of drought stress would have a negative impact on plant health. The energy stored in all reservoirs will affect the temperature and be related to the soil moisture. The displacement height is a variable describing the vertical distribution of horizontal mean wind speeds within the lowest portion of the planetary boundary layer. Combining the equation of mass continuity, it is easy to understand that the wind speed is an important variable for the dispersion of airborne particulates such as pollen. The roughness length of sensible heat describes the height below which the pressure gradient will be affected by complex boundary conditions. It is important for an equation that describes the mean specific humidity in the dynamic sub-layer (Brutsaert, 1975). Humidity will influence the growth of plants as well as directly affecting the airborne particle density (Kamens et al., 1988).

5.5 Summary of Pollen Estimation Using Machine Learning

So far we have introduced machine learning and examined three different multi-variate re- gression (supervised learning) algorithms, which are least absolute shrinkage and selection

54 operator (LASSO), neural network and random forest. We also discussed the details about how we apply these algorithms on our multi-variable regression model, and estimate the pollen concentration over time with the features discussed in Chapter 2 (also expanded from variables listed in the appendix). Once the models were built we then evaluated their performance using scatter diagrams. We found that the random forest ensemble approach performed the best for our estimation of the Ambrosia pollen abundance. Let us now employ more machine learning regression approaches to a similar problem.

Let us see if we can estimate the abundance of airborne PM2.5 using machine learning. This

PM2.5 data was collected at UT Dallas over a period of many months. We will inter-compare a set of many approaches to see which provides the best performance for our fully non-linear non-parametric regression models for estimating the concentrations of airborne particulates.

5.6 Machine Learning Inter-comparison for PM2.5

To compare the performance of different machine learning approaches, multiple approaches were applied to the same data set, and the root mean square error (RMSE) between the estimated pollen and the actual pollen abundances are calculated using cross-validation. Where cross-validation is used to guarantee the RMSE is read from validation data so that over-fitting results will be avoided. The data set used in this comparison has the same feature variables as our pollen data, but replace the label data to PM2.5. The ranking by performance of the various machine learning approaches performances, based on the RMSE value, is listed in Table 5.2.

55 Table 5.2: Table of cross-validated root mean square error (RMSE) and correlation coeffi- cients (R) values for multiple machine learning approaches. Values are listed in ascending order of RMSE.

Machine Learning Approach RMSE R-squared Optimized Ensemble Method (ensemble of trees) 0.88495 0.95 Optimization: Bagged ensembles, LSBoost 30 Iterations Exponential Gaussian Process Regression 0.94737 0.94 Rational Quadric Gaussian Process Regression 1.0145 0.94 Matern 5/2 Gaussian Process Regression 1.1617 0.92 Bagged Trees (random forest) 1.1986 0.91 Boosted Trees 1.2024 0.91 Squared Exponential Gaussian Process Regression 1.2657 0.9 Optimized Tree 1.2679 0.9 Optimization: Bayesian optimization Fine Tree (decision tree) 1.3642 0.88 Two-layer Bayesian Regularization Neural Network 1.515 0.90 Medium Tree (decision tree) 1.53 0.85 Coarse Tree (decision tree) 1.7406 0.81 Medium Gaussian SVM 1.7607 0.81 Coarse Gaussian SVM 1.7742 0.8 Fine Gaussian SVM 3.4888 0.24 Optimized SVM 4.0003 Optimization: Bayesian optimization Linear regression 217.55 Linear SVM 237.74 Robust linear regression 946.32 Quadratic SVM 8.7948 × 105 Cubic SVM 8.0001 × 109

56 (a) A time series of the the actual PM2.5 abundance (blue) and the Linear Regression estimated PM2.5 abundance (orange) as a function of time.

(b) A scatter diagram of the actual PM2.5 abundance (x-axis) against the Linear Regression estimate of PM2.5 (y-axis). Figure 5.12: Linear Regression Performance

57 (a) A time series of the the actual PM2.5 abundance (blue) and the SVM estimated PM2.5 abundance (orange) as a function of time.

(b) A scatter diagram of the actual PM2.5 abundance (x-axis) against the ma- chine learning estimate of PM2.5 (y-axis) for SVM. Figure 5.13: Support Vector Machine Performance

58 (a) A time series of the the actual PM2.5 abundance (blue) and the decision tree estimated PM2.5 abundance (orange) as a function of time.

(b) A scatter diagram of the actual PM2.5 abundance (x-axis) against the machine learning estimate of PM2.5 (y-axis) for decision tree. Figure 5.14: Decision Tree Performance

59 (a) A time series of the the actual PM2.5 abundance (blue) and the Gaussian Process Regression estimated PM2.5 abundance (orange) as a function of time.

(b) A scatter diagram of the actual PM2.5 abundance (x-axis) against the Gaussian Process Regression estimate of PM2.5 (y-axis). Figure 5.15: Gaussian Process Regression Performance

60 (a) A time series of the the actual PM2.5 abundance (blue) and the ensemble method estimated PM2.5 abundance (orange) as a function of time.

(b) A scatter diagram of the actual PM2.5 abundance (x-axis) against the machine learning estimate of PM2.5 (y-axis) for ensemble method. Figure 5.16: Ensemble Method Performance

61 It is obvious in the ranking presented in Table 5.2 that, most of the conventional linear and polynomial fitting approaches performed poorly when trying to estimate the abundance of airborne PM2.5. It is clear that a multi-variate and non-linear approach is needed.

Figures 5.12a through 5.16b each show both a time series of the actual PM2.5 abundance

(blue) and the machine learning estimated PM2.5 abundance (orange) as a function of time, along with a scatter diagram of the actual PM2.5 abundance (x-axis) against the machine learning estimate of PM2.5 (y-axis) for a set of different approaches, starting with the best machine learning approach (an ensemble of trees) and finishing with a linear regression. We can see in Figure 5.12a and figure 5.12b that the linear regression clearly does not perform well. In the scatter diagram (Figure 5.12b), the length of orange horizontal lines depict the linear regression prediction uncertainty. On average, these uncertainties are much larger than those in the following Figures 5.15b 5.16b. The performance of the linear kernel support vector machine in Figure 5.13a and Fig- ure 5.13b, and of the linear decision tree in figure 5.14a and figure 5.14b is comparable to that of the linear regression. It is clear that to adequately predict the pollen abundance a non-linear model is required. The non-linear machine learning approaches perform much better, e.g. the Gaussian Process Regression and the ensemble methods in figure 5.15a through figure 5.16b. It can be observed in these figures 5.15a 5.16a that, pollen counts above 30 are well predicted (surrounded by orange circles) in the time series. This improvement is reflected in the scatter diagrams in Figures 5.15b and 5.16b. In marked contrast to the conventional linear and polynomial fitting approaches, the machine learning approaches, such as Gaussian process regression, support vector machine (SVM) regression, decision tree, and random forest perform well. It is found for many applications that the ensemble learner approaches, such as random forests, perform well. The performance of the ensemble of trees is improved further when hyper-parameter optimization is performed.

62 CHAPTER 6

SUMMARY

6.1 Conclusion

In this dissertation research on airborne pollen and particulates was presented. A set of

empirical machine learning non-linear multi-variate non-parametric models were developed to

describe the abundance of airborne particulates. The machine learning provided some useful

physical insights into some of the key parameters that are most important in describing the

abundance of airborne Ambrosia (Ragweed) pollen, these included, vegetation greenness,

the displacement height, the roughness length for sensible heat, soil evaporation,

and the energy stored in all reservoirs. The vegetation greenness depends on several

factors such as the number and type of plants, how leafy they are, and how healthy they

are. These plants obviously include the Ambrosia plants (Burgan and Hartford, 1993). The

high rank of the greenness is to be expected, the better the plants grow, the more pollen

will be produced. The airborne pollen concentration will thus be affected indirectly. (S¨olter

et al., 2012) gives a detailed description on how the temperature influences the growth of

Ambrosia plants. The previous study of (Howard and Levetin, 2014) had a temperature

term in their regression model shown in equation (2.1). It is noteworthy that the machine

learning highlighted a lagged greenness. A healthy Ambrosia plant will need to mature before releasing the pollen, thus explaining the time delay. Soil evaporation will affect the amount of plant transpiration and will affect the plants’ health status. For example, a period of drought stress would have a negative impact on plant health. The energy stored in all reservoirs will affect the temperature and be related to the soil moisture.

The displacement height is a variable describing the vertical distribution of horizontal mean wind speeds within the lowest portion of the planetary boundary layer. Combining the equation of mass continuity, it is easy to understand that the wind speed is an important

63 variable for the dispersion of airborne particulates such as pollen. The roughness length of sensible heat describes the height below which the pressure gradient will be affected by complex boundary conditions. It is important for an equation that describes the mean specific humidity in the dynamic sub-layer (Brutsaert, 1975). Humidity will influence the growth of plants as well as directly affecting the airborne particle density (Kamens et al., 1988). When making atmospheric measurements it is important to have a spatial and temporal resolution that can adequately capture the spatial and temporal evolution of the variables of interest. In this dissertation, we used variograms to characterize the temporal scales on which atmospheric measurements of atmospheric pollen and particulates need to be made. The data used to calculate the variograms was captured every second over a period of many months. A spectra of timescales were observed, with a dominant timescale of around 9 minutes for both airborne PM2.5 and PM10. This means that to adequately characterize the time variation of airborne particulates measurements should be made at least half this timescale, i.e. around every 5 minutes. However, timescales of less than a minute are also routinely observed. So the typical timescale used by environment agencies to report the concentrations of airborne particulates of one hour is really too long, and measurements at this frequency are not adequately resolving the temporal variability. Based on this variogram analysis, it is recommended that airborne particulate observations are made with a frequency of at least once per minute. This analysis is being utilized in the deployment of over one hundred sensors occurring across the Dallas Fort Worth Metroplex over the next year.

64 6.2 Future Direction

Further work examining the mechanism for the observed time lag between the pollen abun- dance and the environmental state would be of interest. It was observed that many of the key variables for determining the pollen abundance had a lag of around either 30 days or 15 days (Figure 5.11). It would be interesting to consider a wider range of lags. The use of weather RADAR to estimate the atmospheric pollen abundance, and that of airborne particulates in general, would also be of interest.

65 APPENDIX

ENVIRONMENTAL VARIABLES USED IN POLLEN ESTIMATION

Table A.1: Variable names versus, abbreviations and units.

Variable Description Units

EFLUX latent heat flux (positive upward) W · m−2

EVAP Surface evaporation kg · m−2 · s−1

HFLUX Sensible heat flux (positive upward) W · m−2

TAUX Eastward Surface wind stress N · m−2

TAUY Northward Surface wind stress N · m−2

TAUGWX Eastward gravity wave surface stress N · m−2

TAUGWY Northward gravity wave surface stress N · m−2

PBLH Planetary boundary layer height m

DISPH Displacement height m

BSTAR Surface buoyancy scale m · s−1

USTAR Surface velocity scale m · s−1

TSTAR Surface temperature scale K

QSTAR Surface humidity scale kg

RI Surface Richardson number non dimensional

ZOH Roughness length, sensible heat m

ZOM Roughness length, momentum m

HLML Height of center of lowest model layer m

TLML Temperature of lowest model layer m

QLML Specific humidity of lowest model layer kg

66 Variable Description Units

ULML Eastward wind of lowest model layer m · s−1

VLML Northward wind of lowest model layer m · s−1

RHOA Surface air density kg · m−3

SPEED 3-dimensional wind speed for surface fluxes m · s−1

CDH Surface exchange coefficient for heat kg · m−2 · s−1

CDQ Surface exchange coefficient for moisture kg · m−2 · s−1

CDM Surface exchange coefficient for momentum kg · m−2 · s−1

CN Surface neutral drag coefficient non dimensional

TSH Effective turbulence skin temperature K

QSH Effective turbulence skin humidity kg

FRSEAICE Fraction of sea-ice Fraction

PRECANV Surface precipitation flux from anvils kg · m−2 · s−1

PRECCON Surface precipitation flux from convection kg · m−2 · s−1

PRECLSC Surface precipitation flux from large-scale kg · m−2 · s−1

PRECSNO Surface snowfall flux kg · m−2 · s−1

PRECTOT Total surface precipitation flux kg · m−2 · s−1

PGENTOT Total generation of precipitation kg · m−2 · s−1

PREVTOT Total re-evaporation of precipitation kg · m−2 · s−1

GRN Vegetation greenness fraction Fraction

LAI Leaf area index m2

GWETROOT Root zone soil wetness fraction

GWETTOP Top soil layer wetness fraction

TPSNOW Top snow layer temperature K

TUNST Surface temperature of unsaturated zone K

67 Variable Description Units

TSAT Surface temperature of saturated zone K

TWLT Surface temperature of wilted zone K

PRECSNO Surface snowfall kg · m−2 · s−1

PRECTOT Total surface precipitation kg · m−2 · s−1

SNOMAS Snow mass kg · m−2

SNODP Snow depth m

EVPSOIL Bare soil evaporation W · m−2

EVPTRNS Transpiration W · m−2

EVPINTR Interception loss W · m−2

EVPSBLN Sublimation W · m−2

RUNOFF Overland runoff kg · m−2 · s−1

BASEFLOW Baseflow kg · m−2 · s−1

SMLAND Snowmelt kg · m−2 · s−1

FRUNST Fractional unsaturated area fraction

FRSAT Fractional saturated area fraction

FRSNO Fractional snow-covered area fraction

FRWLT Fractional wilting area fraction

PARDF Surface downward PAR diffuse flux W · m−2

PARDR Surface downward PAR beam flux W · m−2

SHLAND Sensible heat flux from land W · m−2

LHLAND Latent heat flux from land W · m−2

EVLAND Evaporation from land kg · m−2 · s−1

LWLAND Net downward longwave flux over land W · m−2

SWLAND Net downward shortwave flux over land W · m−2

68 Variable Description Units

GHLAND Downward heat flux at base of top soil layer W · m−2

TWLAND Total water store in land reservoirs kg · m−2

TELAND Energy store in all land reservoirs J · m−2

WCHANGE Total land water change per unit time kg · m−2 · s−1

ECHANGE Total land energy change per unit time W · m−2

SPLAND Spurious land energy source W · m−2

SPWATR Spurious land water source kg · m−2 · s−1

SPSNOW Spurious snow source kg · m−2 · s−1

−3 PM2.5 Airborne Particulate µg · m Soil Soil type non dimensional

Lithology Lithology non dimensional

Topography Topography m

PopulationDensity Population Density

Type Surface Type non dimensional

AlbedoWSABand1 Surface reflectivity at 470 nm non dimensional

AlbedoWSABand2 Surface reflectivity at 555 nm non dimensional

AlbedoWSABand3 Surface reflectivity at 670 nm non dimensional

AlbedoWSABand4 Surface reflectivity at 858 nm non dimensional

AlbedoWSABand5 Surface reflectivity at 1240 nm non dimensional

AlbedoWSABand6 Surface reflectivity at 1640 nm nondimensional

AlbedoWSABand7 Surface reflectivity at 2130 nm non dimensional

69 REFERENCES

Amit, Y. and D. Geman (1997). Shape quantization and recognition with randomized trees. Neural computation 9 (7), 1545–1588.

ANgstr¨om,A.˚ (1962). Atmospheric turbidity, global illumination and planetary albedo of the earth. Tellus 14 (4), 435–450.

Atkinson, R. W., G. W. Fuller, H. R. Anderson, R. M. Harrison, and B. Armstrong (2010). Urban ambient particle metrics and health: a time-series analysis. Epidemiology, 501–511.

Bacsi, A., B. K. Choudhury, N. Dharajiya, S. Sur, and I. Boldogh (2006). Subpollen parti- cles: carriers of allergenic proteins and oxidases. Journal of allergy and clinical immunol- ogy 118 (4), 844–850.

Bank, W. (2016). The cost of air pollution: strengthening the economic case for action. Washington: World Bank Group.

Baskin, J. M. and C. C. Baskin (1980). Ecophysiology of secondary dormancy in seeds of ambrosia artemisiifolia. Ecology 61 (3), 475–480.

Bosilovich, M., S. Schubert, G. Kim, R. Gelaro, M. Rienecker, M. Suarez, and R. Todling (2006). Nasa’s modern era retrospective-analysis for research and applications (merra). In AGU Spring Meeting Abstracts.

Breiman, L. (2001). Random forests. Machine learning 45 (1), 5–32.

Brimblecombe, P. and C. Bowler (1992). The history of air pollution in york, . Journal of the Air & Waste Management Association 42 (12), 1562–1566.

Brutsaert, W. (1975). The roughness length for water vapor sensible heat, and other scalars. Journal of the Atmospheric Sciences 32 (10), 2028–2031.

Burgan, R. E. and R. A. Hartford (1993). Monitoring vegetation greenness with satellite data.

Caudill, M. (1987). Neural networks primer, part i. AI expert 2 (12), 46–52.

Chapman, D. S., T. Haynes, S. Beal, F. Essl, and J. M. Bullock (2014). Phenology predicts the native and invasive range limits of common ragweed. Global Change Biology 20 (1), 192–202.

Chen, J.-C. and J. Schwartz (2008). Metabolic syndrome and inflammatory responses to long-term particulate air pollutants. Environmental health perspectives 116 (5), 612–617.

70 Cheng, Y., S. Lee, Z. Gu, K. Ho, Y. Zhang, Y. Huang, J. C. Chow, J. G. Watson, J. Cao, and R. Zhang (2015). Pm2. 5 and pm10-2.5 chemical composition and source apportionment near a hong kong roadway. Particuology 18, 96–104.

Ch`ylek,P. and J. A. Coakley (1974). Aerosols and climate. Science 183 (4120), 75–77.

Deng, H., G. Runger, and E. Tuv (2011). Bias of importance measures for multi-valued attributes and solutions. In International Conference on Artificial Neural Networks, pp. 293–300. Springer.

Dietterich, T. G. (2000). An experimental comparison of three methods for constructing en- sembles of decision trees: Bagging, boosting, and randomization. Machine learning 40 (2), 139–157.

Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books.

Dong, D. and T. J. McAvoy (1996). Nonlinear principal component analysisbased on prin- cipal curves and neural networks. Computers & Chemical Engineering 20 (1), 65–78.

Ducret-Stich, Regina E, T. M.-Y. T. D. K. N. H. P. K. P. H. C. (2013, Sep). Pm10 source apportionment in a swiss alpine valley impacted by highway traffic. Environmental Science and Pollution Research 20 (9), 6496–6508.

Eaton, D. K., L. Kann, S. Kinchen, S. Shanklin, K. H. Flint, J. Hawkins, W. A. Harris, R. Lowry, T. McManus, D. Chyen, et al. (2012). Youth risk behavior surveillance-united states, 2011. Morbidity and mortality weekly report. Surveillance summaries (Washington, DC: 2002) 61 (4), 1–162.

Efron, B. (1992). Bootstrap methods: another look at the jackknife. In Breakthroughs in Statistics, pp. 569–593. Springer.

Flor´ıstica, T. Y. (2010). Anew ambrosia (asteraceae) from the baja california peninsula, mexico. Bolet´ınde la Sociedad Bot´anica de M´exico (86), 65–70.

Fumanal, B., B. Chauvel, and F. Bretagnolle (2007). Estimation of pollen and seed production of common ragweed in france. Annals of Agricultural and Environmental Medicine 14 (2).

Fumanal, B., B. Chauvel, A. Sabatier, and F. Bretagnolle (2007). Variability and cryptic heteromorphism of ambrosia artemisiifolia seeds: what consequences for its invasion in france? Annals of Botany 100 (2), 305–313.

Hall, E. S., S. M. Kaushik, R. W. Vanderpool, R. M. Duvall, M. R. Beaver, R. W. Long, and P. A. Solomon (2014). Integrating sensor monitoring technology into the current air pollution regulatory support paradigm: Practical considerations. Am. J. Environ. Eng 4 (6), 147–154.

71 Hans, C. (2009). Bayesian lasso regression. Biometrika 96 (4), 835–845.

Hansen, J., M. Sato, and R. Ruedy (1997). Radiative forcing and climate response. Journal of Geophysical Research: Atmospheres 102 (D6), 6831–6864.

Hester, R., R. Harrison, and M. Lippmann (1998). The 1997 us epa standards for particulate matter and ozone.

Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence 20 (8), 832–844.

Howard, L. E. and E. Levetin (2014). Ambrosia pollen in tulsa, oklahoma: aerobiology, trends, and forecasting model development. Annals of Allergy, Asthma & Immunol- ogy 113 (6), 641–646.

Huang, R.-J., Y. Zhang, C. Bozzetti, K.-F. Ho, J.-J. Cao, Y. Han, K. R. Daellenbach, J. G. Slowik, S. M. Platt, F. Canonaco, et al. (2014). High secondary aerosol contribution to particulate pollution during haze events in china. Nature 514 (7521), 218.

Jelks, M. (1986). Allergy plants. World-Wide Publications.

Johnson, R. W. (2001). An introduction to the bootstrap. Teaching Statistics 23 (2), 49–54.

Kamens, R. M., Z. Guo, J. N. Fulcher, and D. A. Bell (1988). Influence of humidity, sunlight, and temperature on the daytime decay of polyaromatic hydrocarbons on atmospheric soot particles. Environmental Science and Technology 22 (1), 103–108.

Karrer, G. (2016). Implications of life history for control and eradication. Julius-K¨uhn- Archiv (455), 58.

Kasprzyk, I. (2008). Non-native ambrosia pollen in the atmosphere of rzesz´ow(se poland); evaluation of the effect of weather conditions on daily concentrations and starting dates of the pollen season. International journal of biometeorology 52 (5), 341.

Kernel, H. (2015). Odroid-xu4.

Kreyling, W. G., S. Hirn, and C. Schleh (2010). Nanoparticles in the lung. Nature biotech- nology 28 (12), 1275.

Kulkarni, P., P. A. Baron, and K. Willeke (2011). Aerosol measurement: principles, tech- niques, and applications. John Wiley & Sons.

Kursa, M. B. (2014). Robustness of random forest-based gene selection methods. BMC bioinformatics 15 (1), 8.

72 Lark, R., B. Cullis, and S. Welham (2006). On spatial prediction of soil properties in the presence of a spatial trend: the empirical best linear unbiased predictor (e-blup) with reml. European Journal of Soil Science 57 (6), 787–799.

LESKOVSEK,ˇ R., A. Datta, S. Z. Knezevic, and A. SIMONCIˇ Cˇ (2012). Common ragweed (ambrosia artemisiifolia) dry matter allocation and partitioning under different nitrogen and density levels. Weed Biology and Management 12 (2), 98–108.

Liu, X., D. Wu, G. K. Zewdie, L. Wijerante, C. I. Timms, A. Riley, E. Levetin, and D. J. Lary (2017, 03). Using machine learning to estimate atmospheric pollen concentrations in tulsa, ok. Environmental Health Insights 11, 1178630217699399.

Makra, L. and I. Matyasovszky (2011). Assessment of the daily ragweed pollen concentration with previous-day meteorological variables using regression and quantile regression analysis for szeged, hungary. Aerobiologia 27 (3), 247–259.

McCulloch, W. S. and W. Pitts (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (4), 115–133.

Mohapatra, S. S., R. F. Lockey, and F. Polo (2004). Weed pollen allergens. Allergens and allergen immunotherapy 4.

Nelson, H. S. (2000). The importance of allergens in the development of asthma and the persistence of symptoms. Journal of allergy and clinical immunology 105 (6), S628–S632.

Oliver, M. A. and R. Webster (2015). Basic steps in geostatistics: the variogram and kriging, Volume 106. Springer.

Osborne, M. R., B. Presnell, and B. A. Turlach (2000). On the lasso and its dual. Journal of Computational and Graphical statistics 9 (2), 319–337.

Oswalt, M. L. and G. D. Marshall (2008). Ragweed as an example of worldwide allergen expansion. Allergy, Asthma & Clinical Immunology 4 (3), 130.

Peng, R. D., H. H. Chang, M. L. Bell, A. McDermott, S. L. Zeger, J. M. Samet, and F. Dominici (2008). Coarse particulate matter air pollution and hospital admissions for cardiovascular and respiratory diseases among medicare patients. Jama 299 (18), 2172– 2179.

Pilotte, P. Analytics-driven embedded systems, part 2 - developing analytics and prescrip- tive controls. http://www.embedded-computing.com/embedded-computing-design/ analytics-driven-embedded-systems-part-2-developing-analytics-and- . prescriptive-controls Accessed: 2016-03-14.

73 Pope, F. D., M. Gatari, D. Ng’ang’a, A. Poynter, and R. Blake (2018). Airborne particulate matter monitoring in kenya using calibrated low-cost sensors. and Physics 18 (20), 15403–15418.

Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Springer.

Rienecker, M. M., M. J. Suarez, R. Gelaro, R. Todling, J. Bacmeister, E. Liu, M. G. Bosilovich, S. D. Schubert, L. Takacs, G.-K. Kim, et al. (2011). Merra: Nasas modern-era retrospective analysis for research and applications. Journal of climate 24 (14), 3624–3648.

R¨uckerl, R., A. Schneider, S. Breitner, J. Cyrys, and A. Peters (2011). Health effects of par- ticulate air pollution: a review of epidemiological evidence. Inhalation toxicology 23 (10), 555–592.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development 3 (3), 210–229.

S¨olter,U., U. Starfinger, and A. Verschwele (2012). Halt ambrosia-complex research on the invasive alien plant ragweed (ambrosia artemisiifolia l.) in europe. Julius-K¨uhn- Archiv (434), 627.

Stark, P. C., L. M. Ryan, J. L. McDonald, and H. A. Burge (1997). Using meteorologic data to predict daily ragweed pollen levels. Aerobiologia 13 (3), 177–184.

Stier, P., J. H. Seinfeld, S. Kinne, and O. Boucher (2007). Aerosol absorption and radiative forcing. Atmospheric Chemistry and Physics 7 (19), 5237–5261.

Strother, J. L. (1753). Flora of north america. Website. http://www.efloras.org/ florataxon.aspx?flora_id=1&taxon_id=101325.

Suter, G. W. (2008). Ecological risk assessment in the united states environmental pro- tection agency: A historical overview. Integrated environmental assessment and manage- ment 4 (3), 285–289.

Taramarcaz, P., C. Lambelet, B. Clot, C. Keimer, and C. Hauser (2005). Ragweed (ambrosia) progression and its health risks: will switzerland resist this invasion? Swiss Medical Weekly 135 (37/38), 538.

Thompson, J. L. and J. E. Thompson (2003). The urban jungle and allergy. Immunology and allergy clinics of North America 23 (3), 371–387.

Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (3), 273– 282.

74 Wahn, U. (2000). What drives the allergic march? Allergy 55 (7), 591–599.

WEBER, R. W., N. Adkinson Jr, B. Bochner, A. Burks, W. Busse, S. Holgate, R. Le- manske Jr, and R. OHehir (2013). Aerobiology of outdoor allergens. Middleton’s Allergy: Principles and Practice 69, 430.

Webster, R. (2000). Is soil variation random? Geoderma 97 (3-4), 149–163.

WHO. 7 million premature deaths annually linked to air pollution. http://www.who.int/ mediacentre/news/releases/2014/air-pollution/en/. Accessed: 2016-08-29.

Wilson, R., J. D. Spengler, et al. (1996). Particles in our air: concentrations and health effects. Harvard School of Cambridge, MA.

Yin, J., A. Allen, R. Harrison, S. Jennings, E. Wright, M. Fitzpatrick, T. Healy, E. Barry, D. Ceburnis, and D. McCusker (2005). Major component composition of urban pm10 and pm2. 5 in ireland. Atmospheric Research 78 (3-4), 149–165.

Zewdie, G. K., X. Liu, D. Wu, D. J. Lary, and E. Levetin (2019). Applying machine learning to forecast daily ambrosia pollen using environmental and nexrad parameters. Environmental monitoring and assessment 191 (2), 261.

Zhang, D., J. Liu, and B. Li (2014). Tackling air pollution in chinawhat do we learn from the great smog of 1950s in london. Sustainability 6 (8), 5322–5338.

Zink, K., H. Vogel, B. Vogel, D. Magyar, and C. Kottmeier (2012). Modeling the dispersion of ambrosia artemisiifolia l. pollen with the model system cosmo-art. International journal of biometeorology 56 (4), 669–680.

Ziska, L., K. Knowlton, C. Rogers, D. Dalan, N. Tierney, M. A. Elder, W. Filley, J. Shrop- shire, L. B. Ford, C. Hedberg, et al. (2011). Recent warming by latitude associated with increased length of ragweed pollen season in central north america. Proceedings of the National Academy of Sciences 108 (10), 4248–4251.

75 BIOGRAPHICAL SKETCH

Xun Liu is currently in his 7th year of PhD study in physics at The University of Texas at Dallas. He was born in Jangsu, China. After receiving his bachelor’s degree in physics at Nanjing Univeristy in 2013, he joined The University of Texas at Dallas as a PhD student. After spending 1 year in Prof. Jason Slinker’s lab on research of bio-nano circuit chips, in June 2015, Xun joined Dr. David Lary’s research group and began his study on atmospheric particulate matter under the supervision of Dr. David Lary.

From 2015 to 2018, Xun has completed one project studying the relationships between the airborne abundance of pollen and the environmental state. Xun also used variograms to characterize the temporal scale required to accurately resolve the time evolution of atmo- spheric particulates. His research on the use of variogram fit is greatly assisted by Lakitha Wijeratne, who set up the data collection sensors with IOT technologies. Xun collaborated with Gebreab K. Zewdie and Daji Wu, and is a co-author on their publications. Xun has presented his phase I research result at the joint meeting of Texas sections of APS, AAPT, SPS in March 2016. He attended a software engineer internship at Google LLC, during the summer of 2018, where he improved his knowledge of machine learning.

76 CURRICULUM VITAE

Xun Liu October 31, 2019

Contact Information: Department of Computer Science Voice: (972) 883-4724 The University of Texas at Dallas Fax: (972) 883-2349 800 W. Campbell Rd. Email: [email protected] Richardson, TX 75080-3021, U.S.A. Educational History: B.S., Physics, Nanjing University, 2013 Ph.D., Physics, The University of Texas at Dallas, 2019 Physical Studies of Airborne Pollen and Particulates Utilizing Machine Learning Ph.D. Dissertation Physics Department, The University of Texas at Dallas Advisors: Dr.David Lary

Publication and Awards: Liu, X., Wu, D., Zewdie, G. K., Wijerante, L., Timms, C. I., Riley, A., ... & Lary, D. J. (2017). Using machine learning to estimate atmospheric Ambrosia pollen concentrations in Tulsa, OK. Environmental health insights, 11. Wu, D., Zewdie, G. K., Liu, X., Kneen, M. A., & Lary, D. J. (2017). Insights Into the Mor- phology of the East Asia PM2.5 Annual Cycle Provided by Machine Learning. Environmental health insights, 11. Zewdie, G. K., Liu, X., Wu, D. Estimating the daily pollen concentration in the atmosphere using machine learning and the NEXRAD weather radar data, Environmental monitoring and assessment, accepted.

Employment History: Software Engineer Intern, Google LLC, June 2018 – September 2018