TOOLS FOR BIODIVERSITY ANALYSES USING NATURAL HISTORY COLLECTIONS AND REPOSITORIES: DATA MINING, MACHINE LEARNING AND PHYLODIVERSITY By CHANDRA EARL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2020 1 . © 2020 Chandra Earl 2 . ACKNOWLEDGMENTS I thank my co-chairs and members of my supervisory committee for their mentoring and generous support, my collaborators and colleagues for their input and support and my parents and siblings for their loving encouragement and interest. 3 . TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 3 LIST OF TABLES ............................................................................................................ 5 LIST OF FIGURES .......................................................................................................... 6 ABSTRACT ..................................................................................................................... 8 CHAPTER 1 INTRODUCTION ...................................................................................................... 9 2 GENEDUMPER: A TOOL TO BUILD MEGAPHYLOGENIES FROM GENBANK DATA ...................................................................................................................... 12 Materials and Methods............................................................................................ 14 Implementation ....................................................................................................... 19 Case Studies .......................................................................................................... 19 Discussion .............................................................................................................. 23 3 MACHINE LEARNING DISTINGUISHES BETWEEN SPECIES AND DISCOVERS PATTERNS OF BIODIVERSITY IN BUMBLEBEES ......................... 33 Materials and Methods............................................................................................ 35 Results .................................................................................................................... 40 Discussion .............................................................................................................. 41 4 PHYLOGENETIC ANALYSIS OF NORTH AMERICAN BUTTERFLIES................. 47 Materials and Methods............................................................................................ 50 Results .................................................................................................................... 55 Discussion .............................................................................................................. 61 5 PHYLODIVERSITY OF NORTH AMERICAN BUTTERFLIES ................................ 72 Materials and Methods............................................................................................ 76 Results .................................................................................................................... 84 Discussion .............................................................................................................. 89 6 CONCLUSIONS ................................................................................................... 107 LIST OF REFERENCES ............................................................................................. 108 BIOGRAPHICAL SKETCH .......................................................................................... 128 4 . LIST OF TABLES Table page 2-1 The overall results from each of the three GeneDumper runs. ........................... 32 3-1 The three datasets utilized showing the ML implementation, build type used, and number of images and classes. ................................................................... 46 4-1 The length of each locus used as the seed sequence for GeneDumper input. .. 69 4-2 The total number of sequences after quality filtering and duplicate removal and alignment length in total number of base pairs for each locus. .................... 70 4-3 The distribution of the number of species with sequence information and the percentage of species with distributions in Mexico, but not in the United States or Canada. ............................................................................................... 71 4-4 Ages of all families in this study compared to other studies and their 95% confidence intervals. ........................................................................................... 71 5-1 Summary of commonly used phylodiversity metrics used. ............................... 104 5-2 Summary of the top and null linear models for PD, RPD, and PE. ................... 105 5-3 Summary of the top and null binomial regression models for randomizations of PD, RPD, and PE.. ....................................................................................... 106 5 . LIST OF FIGURES Figure page 2-1 A flowchart depicting the two steps of the GeneDumper pipeline (GeneDump and GeneClean) along with their inputs and outputs. ......................................... 26 2-2 A flowchart depicting GeneDump and its two major steps: the initial BLAST and species name resolution. ............................................................................. 27 2-3 A flowchart depicting GeneClean decision making process to clean and validate sequences. ............................................................................................ 28 2-4 GeneDumper butterfly phylogeny colored by family. .......................................... 29 2-5 GeneDumper Nitrogen-Fixing clade phylogeny colored by order.. ..................... 30 2-6 GeneDumper frog and toad phylogeny colored by superfamily. ......................... 31 3-1 Examples of how bumblebee images were processed before training.. ............. 44 3-2 Architectures of the resulting multi-layer models. ............................................... 45 3-3 Confusion matrices for the two smaller neural networks. ................................... 46 4-1 A plot showing the growth of sequences across the 14 loci of interest for North American species over time.. .................................................................... 67 4-2 A plot showing the growth of species across the 14 loci of interest for species found in both Mexico and America and Canada over time. ................................ 67 4-3 A time-calibrated phylogeny of North American butterflies with bootstrap support shown for the deeper nodes of the tree. ................................................ 68 5-1 Maps depicting the observed phylogenetic diversity and relative phylogenetic diversity values for North American butterflies.................................................... 99 5-2 Observed phylogenetic endemism values for North American butterflies. ........ 100 5-3 Maps depicting significant phylogenetic diversity patterns and significant relative phylogenetic diversity patterns for butterflies. ...................................... 100 5-4 CANAPE results showing randomization of phylogenetic endemism ............... 101 5-5 Significant relative phylogenetic diversity patterns. .......................................... 101 5-6 Significant phylogenetic endemism patterns. ................................................... 102 5-7 Phylogenetic beta diversity values for North American butterflies. ................... 102 6 . 5-8 Regionalization results and comparisions for North American butterflies. ........ 103 5-9 Subregion classification of the Eastern US bioregions ..................................... 104 7 . Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TOOLS FOR BIODIVERSITY ANALYSES USING NATURAL HISTORY COLLECTIONS AND REPOSITORIES: DATA MINING, MACHINE LEARNING AND PHYLODIVERSITY By Chandra Earl August 2020 Cochair: Robert P. Guralnick Cochair: Akito Y. Kawahara Major: Genetics and Genomics Natural history collections house massive amounts of data for potential use in biodiversity studies. With such large amounts of specimen, genetic and image data available, computational tools specific to these data and their use is becoming more commonplace. This dissertation serves to explore some of the various biodiversity tools and pipelines using large amounts of natural history data. Specifically, I investigate, develop and use pipelines for building megaphylogenies, machine learning models for species classification and data mining techniques for spatial phylodiversity analysis. These pipelines provide the means to begin testing hypotheses about biodiversity and its organization at scales and extents that have not be achievable before, and in doing so showcase novel findings that cannot be achieved without such approaches. Large scale informatics tools such as these place natural history museums at the forefront of biodiversity research and the cusp of big data science. 8 . CHAPTER 1 INTRODUCTION Enormous growth in both data resources and analytical tools is facilitating a new understanding of biodiversity patterns and their drivers at multiple spatial, temporal and evolutionary scales. For hundreds of years, biologists have mapped species occurrences using inventories and specimen collecting, ultimately building the world’s natural
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages128 Page
-
File Size-