Brook Milligan NMSU.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Samples and their Sources: Case Studies and Lessons Learned Brook Milligan Conservation Genomics Laboratory Department of Biology New Mexico State University Las Cruces, New Mexico 88003 USA [email protected] Development and Scaling of Innovative Technologies for Wood Identification February 28, 2017 © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 1 / 29 The questions we face What is its taxonomic identity? 5 Sample / ? ) Where did it come from? Case studies I Taxonomic identification via direct comparison with a database I Taxonomic identification via inference I Geographic origin identification via inference Lessons learned I Direct comparison is of limited usefulness I Inference is essential for taxonomic and geographic origin identification I These lessons apply to all identification methods, not just DNA © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 2 / 29 Traditional genetics: a cottage industy Oak Heaps of Individually sample / taxon-specific / selected / lab work markers © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 3 / 29 Traditional genetics: an inefficient cottage industry Oak Heaps of Individually sample / taxon-specific / selected / lab work markers Rosewood Heaps of Individually sample / taxon-specific / selected / lab work markers Maple Heaps of Individually sample / taxon-specific / selected / lab work markers © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 4 / 29 Traditional genetics: a simplified world view Oak Heaps of Individually sample / taxon-specific / selected / lab work markers Rosewood Heaps of Individually sample / taxon-specific / selected / lab work markers Maple Heaps of Individually sample / taxon-specific / selected / lab work markers © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 4 / 29 Genomics: industrialization and economies of scale Oak sample $ Common Rosewood bioinformatics sample / / pipeline : Maple sample © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 5 / 29 Dividing a sequence into k-mers 50- ...gacaccatcgaatggcgcaaaacctttcgc... -30 30- ...ctgtggtagcttaccgcgttttggaaagcg... -50 © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 6 / 29 Dividing a sequence into k-mers 50- ...gacaccatcgaatggcgcaaaacctttcgc... -30 . ccatcgaatg catcgaatgg atcgaatggc tcgaatggcg cgaatggcgc gaatggcgca aatggcgcaa atggcgcaaa tggcgcaaaa ggcgcaaaac gcgcaaaacc cgcaaaacct . © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 7 / 29 Distribution from common to rare kmers Kmer distribution 75 50 Count (1000s) 25 0 Most common kmers © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 8 / 29 1. Identification by directly matching reference samples Unknown sample $ Reference Reference Direct Identified / / / samples database comparison match © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 9 / 29 Testing genomic identification with a diversity of plants Taxon Genome size Accessions 1 Cycas revoluta 13399 MB 1 2 Lactuca sativa 2592 MB 1 3 Magnolia yunnanensis 880-5844 MB (genus) 1 4 Oryza sativa 489 MB 3 5 Picea abies 19570 MB 2 6 Pinus taeda 21614 MB 3 7 Populus alba 509 MB 1 8 Populus balsamifera 440-528 MB (genus) 1 9 Populus tremula 440 MB 1 10 Populus trichocarpa 484 MB 18 11 Prunus armeniaca 293 MB 1 12 Prunus davidiana 303 MB 1 13 Prunus dulcis 323 MB 1 14 Prunus ferganensis 269-3570 MB (genus) 1 15 Prunus kansuensis 293 MB 1 16 Prunus mume 269-3570 MB (genus) 2 17 Prunus persica 269 MB 2 18 Prunus serotina 489 MB 1 19 Quercus mongolica 489-978 MB (genus) 1 20 Solanum lycopersicum 1002 MB 3 © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 10 / 29 Match score Populus trichocarpa: SRR1762761 25 Genus 20 Populus Salix 15 z Manihot Ricinus 10 Hevea Other 5 1 2 3 4 5 Reads (millions) © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 11 / 29 1. Identification by directly matching reference samples Highly reliable Consistently good performance across all comparisons irrespective of taxonomic distance Very discriminatory Closely related taxa can be distinguished Efficient Relatively little data is required Requirements A complete reference database for positive identification © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 14 / 29 2. Identification of taxa using formal inference Unknown sample # Reference Inference Inferred Inferred / / / samples model relationships identification © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 15 / 29 Phylogenetic relationships of Malagasy Dalbergia Hassold et al. (2016), Figure 2 © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 16 / 29 Phylogenetic relationships of Malagasy Dalbergia Sophora tetraptera Dalbergia mollis SG4 Dalbergia mollis SG4 Characters: 185,032 Dalbergia humbertii Informative: 185,018 Dalbergia urschii SG3 Parsimony score: 226,052 Dalbergia purpurascens SG3 Consistency index: 81.8% Dalbergia purpurascens SG3 Dalbergia greveana SG4 Dalbergia greveana SG4 Dalbergia bathiei Dalbergia louisii Dalbergia pseudobaronii SG4 Dalbergia pseudobaronii SG4 Dalbergia orientalis SG4 Dalbergia orientalis SG4 Dalbergia baronii SG4 Dalbergia baronii SG4 Dalbergia madagascariensis subsp. antongilensis SG4 Dalbergia chapelieri SG1 Dalbergia chapelieri SG1 Dalbergia normandii SG2 Dalbergia maritima var. pubescens SG2 Dalbergia maritima var. pubescens SG2 Dalbergia occulta SG2 © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 17 / 29 2. Identification of taxa using formal inference Phylogenetic relationships of Malagasy Dalbergia: Species-specific groups are recovered Chloroplast species groups 1 and 2 (SG1 and SG2) are identified Geographic regional groupings are identified Inconsistencies between genomic and chloroplast trees in less well-supported regions General taxonomic identification: Genomic data contain phylogenetically informative information Inference models are needed when a complete database does not exist, i.e., always © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 18 / 29 3. Identification of geographic origin using formal inference Unknown Inferred sample origin : # Inference model ; $ Reference Uncertainty samples measures © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 19 / 29 Identifying geographic origins: Pinus ponderosa Third in production volume (4.5 million m2); second in value (WWPA 2001) © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 20 / 29 Identifying geographic origins using models of differentiation: Pinus ponderosa © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 21 / 29 3. Identification of geographic origin using formal inference Very discriminatory Inferred origin can be reduced to a small region Useful Quantitative measures of uncertainty are available Efficient Relatively little data is required Flexible Multiple types of data, i.e., not just genetic, may be integrated to improve accuracy © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 22 / 29 Lessons learned: differences among identification strategies Taxonomic identification: Direct matching in a reference database Modeling evolutionary processes Geographic origin: Direct matching in a reference database Modeling spatial differentiation processes Regardless of effort and expense, reference databases will always be incomplete relative to the questions that need answering. Therefore, gaps must be filled in with inference, which also yields the benefit of learning about uncertainty. © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 23 / 29 The real world is messy! Clearly identifiable categories Continuous gradation To be useful, tools we develop must match the real world, not what we consider to be conceptually convenient. The importance of inferential analysis transcends genomics and applies to all methods of identification. © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 24 / 29 Validity and reliability must be assured President's Council of Advisors on Science and Technology (2016): There is a need to evaluate specific forensics methods to determine their validity. Of six types of identification methods examined, only one was deemed valid and applied appropriately. Quantitative, inferential analysis is an appropriate methodology. All of the identification problems under consideration for wood would fall into the “difficult to validate" cases identified by PCAST (2016). Therefore, inferential analysis must play a central role, not just for genomics analysis, but for every method. © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 25 / 29 Identification: a general strategy of integrating information Reference Sample / samples identification < : $ Unknown Inference sample model ; # " Reference Uncertainty / samples measures © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 26 / 29 Thanks to . Meaghan Parker-Forney and the World Resources Institute David Erickson, DNA4 Technologies Alex Widmer and Simon Crameri, ETH Z¨urich Valerie Hipkins, USFS National Forest Genetics Laboratory © 2017 Brook Milligan, NMSU Identifying Samples and their Sources February 28, 2017 27 / 29.