Dissertation

PHYSICAL STUDIES OF AIRBORNE POLLEN AND PARTICULATES UTILIZING MACHINE LEARNING by Xun Liu APPROVED BY SUPERVISORY COMMITTEE: David John Lary, Chair Roderick Heelis Robert Glosser Lunjin Chen Fabiano Rodrigues Copyright c 2019 Xun Liu All rights reserved PHYSICAL STUDIES OF AIRBORNE POLLEN AND PARTICULATES UTILIZING MACHINE LEARNING by XUN LIU, BS, MS DISSERTATION Presented to the Faculty of The University of Texas at Dallas in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY IN PHYSICS THE UNIVERSITY OF TEXAS AT DALLAS December 2019 ACKNOWLEDGMENTS I would like to thank all those who supported my research and dissertation. Without their help, I would never have completed this dissertation. I must express my deepest appreciation to my advisor Dr. David Lary for his guidance in the past three years. My heartfelt gratitude also goes to his patience and tolerance for all those mistakes I had once made. His warm support and insightful instructions helped me out of many hard times during the process of this research. I would also like to thank other members of my committee: Dr. Roderick Heelis, Dr. Robert Glosser, Dr. Lunjin Chen, and Dr. Fabiano Rodrigues for serving as my committee members and generously providing me with advice and comments. I am also grateful to all teammates in Dr. Lary's group: Daji Wu, Gebreab K. Zewdie and Lakitha Wijeratne, for their collaboration on my research. October 2019 iv PHYSICAL STUDIES OF AIRBORNE POLLEN AND PARTICULATES UTILIZING MACHINE LEARNING Xun Liu, PhD The University of Texas at Dallas, 2019 Supervising Professor: David John Lary, Chair This dissertation presents an approach for estimating the abundance of airborne pollen and particulates using a comprehensive description of the physical environment coupled with machine learning. The aspects of the physical environment are characterized by eighty-five variables that quantify the physical state of the land surface and soil, and the physical state of the atmosphere. The physical environment of plants naturally affects their rate of maturing and pollen generation. Then, once the pollen is released, conditions such as wind speed will affect how the pollen is dispersed. Machine learning is helpful for studying such a complex system. Machine learning allows us to `learn by example', since at present, we do not have a complete theoretical description, from first principles, of the entire system from the plant growth and development to the plants' full interaction with its physical environment. Machine learning allows us to objectively highlight which physical parameters play a central role in determining the atmospheric abundance of the pollen, and hence, the impact on human health. Some key aspects in building a physical model of airborne particulates using machine learning that are explored in this dissertation include: 1. The collection of an appropriate and comprehensive training dataset that machine learning algorithms can use to learn from. This involves characterizing the appropriate v temporal and spatial scales involved. Variograms were used to perform this analysis. Machine learning is an automated encapsulation of the scientific method, an automated paradigm for learning by example to build descriptive models that can be tested and iteratively improved. 2. Identifying the physical parameters which are the most appropriate input variables (or features) to build an accurate machine learning model. This is a key step in machine learning called feature engineering. Feature engineering can provide useful physical insights into the key drivers of the system being studied. 3. Provide a framework for updating the machine learning model as new observational data is collected. This was done by providing a mini-batch training process that allows the machine learning model to be updated in almost real-time. vi TABLE OF CONTENTS ACKNOWLEDGMENTS . iv ABSTRACT . v LIST OF FIGURES . ix LIST OF TABLES . xii CHAPTER 1 INTRODUCTION . 1 1.1 Airborne Particulates . 1 1.1.1 Particulate Matter in Different Sizes . 1 1.1.2 Chemical Composition and Source Apportionment . 2 1.1.3 Role in Global Environmental Change . 3 1.1.4 Particulate Matter and Human Health . 5 1.2 Airborne Ambrosia Pollen . 7 1.2.1 Ragweeds - Source of Airborne Pollen in North America . 8 1.2.2 Airborne Pollen Particles . 9 1.2.3 Ambrosia Pollen and Health . 9 1.2.4 Environment's effect on airborne pollen . 12 1.3 Summary . 13 CHAPTER 2 OBSERVATIONS OF THE TEMPORAL CHANGES IN AMBROSIA (RAGWEED) POLLEN ABUNDANCE . 15 2.1 Previous Work . 15 2.2 Data . 17 2.3 Summary . 19 CHAPTER 3 OBSERVATION OF THE TEMPORAL CHANGES IN PARTICULATE MATTER ABUNDANCE . 20 3.1 Optical Particle Counters . 20 3.2 In-situ Observation System . 21 3.3 Data from MINTS . 22 CHAPTER 4 PHYSICAL INSIGHTS PROVIDED BY VARIOGRAMS . 24 4.1 Stochastic Process and Sampling . 24 vii 4.2 Variogram and Kriging . 25 4.2.1 Variogram Definition Equation . 25 4.2.2 Kriging for Data Interpolation . 26 4.2.3 Practical Use of Variograms . 27 4.3 Variograms of Airborne Particulates . 29 4.4 Summary . 34 CHAPTER 5 BUILDING EMPIRICAL PHYSICAL MODELS OF AIRBORNE PAR- TICULATES USING MACHINE LEARNING . 35 5.1 Introduction to Machine Learning . 36 5.1.1 Supervised Learning . 37 5.1.2 Unsupervised Learning . 37 5.1.3 Feature Engineering . 39 5.2 LASSO . 39 5.2.1 Algorithm . 40 5.2.2 Result . 40 5.3 Neural Networks . 41 5.3.1 Algorithm . 41 5.3.2 Result . 43 5.4 Ensembles of Decision Trees . 44 5.4.1 Decision Trees . 44 5.4.2 Random Forest . 45 5.4.3 Advantages of a Random Forest . 47 5.4.4 Estimating Ambrosia Pollen Using Random Forests . 48 5.5 Summary of Pollen Estimation Using Machine Learning . 54 5.6 Machine Learning Inter-comparison for PM2:5 . 55 CHAPTER 6 SUMMARY . 63 6.1 Conclusion . 63 6.2 Future Direction . 65 APPENDIX ENVIRONMENTAL VARIABLES USED IN POLLEN ESTIMATION 66 REFERENCES . 70 BIOGRAPHICAL SKETCH . 76 CURRICULUM VITAE viii LIST OF FIGURES 1.1 Size comparison for PM particles, GNU free documentation license from EPA public knowledge . 2 1.2 Airborne particulate size distribution chart, GNU free documentation license from Wikipedia. 4 1.3 PM2:5 source and chemical composition apportionment at multiple Chinese sites during 2013 (Huang et al., 2014). 5 1.4 Chemical composition and source apportionment comparison between PM10 and PM2:5.......................................... 6 1.5 Particulates' direct and indirect effects on the global climate system. 7 1.6 Percentages of risk factors on attributable deaths in 2013. (Bank, 2016) . 8 1.7 A schematic showing the Ambrosia life-cycle. 10 2.1 Correlation of the model-predicted pollen concentrations with observed validation data for 2013. Plotted based on equation 2.1, using data from (Howard and Levetin, 2014; Bosilovich et al., 2006) . 16 2.2 Example seasonal pollen data for 1986, 1987 and 1988. 17 2.3 Averaged 1986-2014 Pollen Data in Flowering Season . 18 3.1 Optical Particle Counters . 21 3.2 Schematic of MINTS sensors . 22 3.3 Particulate Time Series Data in Chattanooga, 08.02.2018 . 23 3.4 Particulate in size of 0.75 - 1.7 µm Time Series Data from August 10th to 12th . 23 4.1 A spherical variogram fit. 28 4.2 Covariance function as a function of data pair separation. 29 4.3 Significance of variogram nugget and range. The range characterizes the spatial scale beyond which separation the data is no longer correlated, so this is a useful way to determine the spatial and/or temporal scales of our data. The nugget (the variogram at zero separation) characterizes the experimental error in our observations. 30 4.4 The lower panel shows an example temporal variogram for observed PM2:5. The units of the variogram are the same as the units of variance, in this case of PM2:5. The upper panel shows how many observations are in each lag time bin. 31 ix −3 4.5 (a) Observed PM2:5 time series in µg/cm (shown in red). Values are recorded every two seconds, then a one-hour moving time window centered on each time point is considered. For this one-hour time window we calculate the representativeness uncertainty, σrep. The green lines either side of the observed PM2:5 indicate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representativeness uncertainty is a significant fraction of the observed PM2:5. (c) A histogram of the range of each variogram. A separate variogram is considered for all observations taken over the one-hour moving time window centered on each observation taken every two-second is calculated. The most frequent range (time-scale) for this time-series is for a lag of 9 minutes. So ideally an observation should be re- ported every few minutes so that this dominant time-scale of temporal variations can be adequately resolved. (d) A histogram of the fractional representativeness uncertainty, σrep, for the entire time-series. 32 −3 4.6 (a) Observed PM10 time series in µg/cm (shown in red). Values are recorded every two seconds, then a one hour moving time window centered on each time point is considered. For this one hour time window we calculate the representativeness uncertainty, σrep. The green lines either side of the observed PM10 indicate this representativeness uncertainty. (b) The representativeness uncertainty over a one hour moving time window in µg/cm−3. Note that the representativeness uncertainty is a significant fraction of the observed PM10. (c) A histogram of the range of each variogram. A separate variogram is considered for all observations taken over the one hour moving time window centered on each observation taken every two-second is calculated.

Load more