Species Classification Using DNA Barcoding and Profile Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-9-2019 Species Classification using DNA Barcoding and Profile Hidden Markov Models Sphoorti Poojary San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Other Computer Sciences Commons Recommended Citation Poojary, Sphoorti, "Species Classification using DNA Barcoding and Profile Hidden Markov Models" (2019). Master's Projects. 668. DOI: https://doi.org/10.31979/etd.cev7-sh4v https://scholarworks.sjsu.edu/etd_projects/668 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. Species Classification using DNA barcoding and Profile Hidden Markov Models A Project Report presented to The Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Science By Sphoorti Poojary May 2019 The Designated Project Committee Approves the Project Titled Species Classification using DNA Barcoding and Profile Hidden Markov Models By Sphoorti Poojary APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY May 2019 Dr. Philip Heller Department of Computer Science Dr. Katerina Potika Department of Computer Science Dr. Thomas Austin Department of Computer Science ABSTRACT Traditional classification systems for living organisms like the Linnaean taxonomy involved classification based on morphological features of species. This traditional system is being replaced by molecular approaches which involve using gene sequences. The COI gene, also known as the ”DNA barcode” since it is unique in every species, can be used to uniquely identify organisms and thus, classify them. Classifying using gene sequences has many advantages, including correct identification of cryptic species(individuals which appear similar but belong to different species) and species which are extremely small in size. In this project, I worked on classifying COI sequences of unknown species to a genus, using Profile Hidden Markov Models. (Taxonomy Ranks: Kingdom → Phylum → Class → Order →Family → Genus → Species) Index terms – COI gene, Classification, Profile Hidden Markov Models ACKNOWLEDGMENTS I want to thank my advisor Prof. Heller and committee members Prof. Potika and Prof Austin for their support and guidance. iv TABLE OF CONTENTS 1. Background .............................................................................................................................. 1 1.1. Traditional morphological based classification................................................................ 1 1.2. Drawbacks of traditional methods.................................................................................... 4 1.3. Classification using DNA barcoding................................................................................ 5 1.3.1. COI gene ................................................................................................................... 5 1.3.2. How DNA barcoding solves problems of the traditional method............................. 6 1.4. Barcode of Life Data Systems (BOLD) ........................................................................... 7 1.4.1. Introduction ............................................................................................................... 7 1.4.2. Databases .................................................................................................................. 7 1.4.3. Taxonomy Browser ................................................................................................... 8 1.5. Profile Hidden Markov models (PHMM) ........................................................................ 8 1.5.1. Overview ................................................................................................................... 8 1.5.2. Pairwise Alignment ................................................................................................. 11 1.5.3. Multiple Sequence Alignment (MSA) .................................................................... 12 1.5.4. PHMM from MSA .................................................................................................. 14 1.5.5. Scoring .................................................................................................................... 14 2. Methods ................................................................................................................................. 15 2.1. Data Gathering ............................................................................................................... 15 2.2. Data Processing .............................................................................................................. 16 2.2.1. Convert XML to FASTA ........................................................................................ 16 2.2.2. Tree Building .......................................................................................................... 16 2.2.3. Generate genus-specific fasta ................................................................................. 16 2.3. Multiple Sequence Alignment using Clustal Omega ..................................................... 17 2.4. Profile Hidden Markov Models (PHMM) ...................................................................... 17 2.4.1. Steps for training and testing PHMMs.................................................................... 17 3. Results ................................................................................................................................... 19 4. Discussion .............................................................................................................................. 31 v TABLE OF FIGURES Figure 1. Linnaean Classification System ..................................................................................... 3 Figure 2. Example of Linnaean Classification System ................................................................... 4 Figure 3. BOLD Public Data Portal ................................................................................................ 7 Figure 4. Example of search in BOLD - Taxonomy Browser ........................................................ 8 Figure 5. Multiple Sequence Alignment ......................................................................................... 9 Figure 6. The simple view of the model ....................................................................................... 10 Figure 7. Profile Hidden Markov Models ..................................................................................... 11 Figure 8. Pairwise Sequence Alignment ....................................................................................... 12 Figure 9. Multiple Sequence Alignment ....................................................................................... 13 Figure 10. Viterbi Algorithm ........................................................................................................ 15 Figure 11. Example of a record in fasta file .................................................................................. 16 Figure 12. Example of Alignment using Clustal .......................................................................... 17 Figure 13. Deletion of one species from tree ................................................................................ 18 Figure 14. Training and test samples ............................................................................................ 18 Figure 15. Tree Building result for Phylum Mollusca .................................................................. 19 Figure 16. Scores for species of genus Alburnoides ..................................................................... 20 Figure 17. Scores for species of genus Cyprinella ........................................................................ 20 Figure 18. Scores for species of genus Notropis........................................................................... 21 Figure 19. Scores for species of genus Squalius ........................................................................... 21 Figure 20. Scores for species of genus Pseudophoxinus .............................................................. 22 Figure 21. Scores for species of genus Trimma ............................................................................ 23 Figure 22. Scores for species of genus Bathygobius .................................................................... 23 Figure 23. Scores for species of genus Eviota .............................................................................. 24 Figure 24. Scores for species of genus Cryptocentrus .................................................................. 24 Figure 25. Scores for species of genus Echinolittorina................................................................. 25 Figure 26. Scores for species of genus Littoraria ......................................................................... 26 Figure 27. Scores for species of genus Littorina........................................................................... 26 Figure 28. Scores for species of genus Fluminicola ....................................................................