Nguyen Niu 0162M 12985.Pdf (4.529Mb)
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT DBHGT: THE FUNGAL HORIZONTAL GENE TRANSFER DATABASE Marcus Nguyen, M.S. Department of Computer Science Northern Illinois University, 2017 Dr. Yanbin Yin, Director Horizontal gene transfer (HGT) is a growing area of research in bioinformatics. It has been studied heavily in bacteria due to the abundance of genome data available as well as its various links to pathogenicity and antimicrobial resistance. With the growing number of fungi being sequenced, fungal research in HGT has also begun to grow. It has also been linked to pathogenicity as well. Many tools exists to predict HGT candidates; many of these are targeted for bacteria. Due to the growing research in fungi, tools that have been tested on fungi are in high demand. Having precomputed, public results for these algorithms can save both time and money in the long run; as such, building a database is of great use to many. Additionally, the creation of a large dataset can be a very useful tool to take a look at HGT from a macro perspective in fungi. NORTHERN ILLINOIS UNIVERSITY DE KALB, ILLINOIS DECEMBER 2017 DBHGT: THE FUNGAL HORIZONTAL GENE TRANSFER DATABASE BY MARCUS NGUYEN c 2017 Marcus Nguyen A THESIS SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE Thesis Director: Dr. Yanbin Yin ACKNOWLEDGEMENTS First, I thank my advisor and mentor, Dr. Yanbin Yin, for his guidance along my master's program. I am also grateful to Dr. Jie Zhou and Dr. Wesley Swingley, my committee members, for their continued support and valuable input; I would not be able to graduate without you. Also, I would like to extend my hand in appreciation to Dr. Mike Papka, Dr. Wesley Swingley (again), and the NIU Computer Science Department for the use of their servers, Hartley, Extramophile, and Gaea, respectively; the use of these machines made better use of time constraints. I wish to thank my other lab members for their continued support and input as well. DEDICATION To the ones who taught me the lessons to get through life. To the ones who couldn't have worked any harder to get me through school. To the ones who supported me every step of the way. In other words, to my parents, Minh and Muoi Nguyen. TABLE OF CONTENTS Page LIST OF TABLES . vii LIST OF FIGURES. viii Chapter 1 INTRODUCTION . 1 2 BACKGROUND .............................................. 2 2.1 What Is HGT? . 2 2.2 Why HGT? . 3 2.3 HGT in Fungi. 4 2.4 How to Predict HGT . 5 2.4.1 Composition-Based Methods . 6 2.4.1.1 Pros . 8 2.4.1.2 Cons . 8 2.4.2 Phyletic-Distribution-Based Methods . 9 2.4.2.1 Pros . 10 2.4.2.2 Cons . 11 2.4.3 Tree-based Methods . 11 2.4.3.1 Pros . 12 2.4.3.2 Cons . 13 2.5 Common Problem with HGT Prediction . 14 2.6 Goal . 15 v Chapter Page 3 HGT CANDIDATE PREDICTION TOOLS . 16 3.1 CompHGT . 16 3.1.1 How Does CompHGT Work? . 16 3.2 SimHGT . 19 3.2.1 How Does SimHGT Work? . 20 3.3 HGTFinder . 22 3.3.1 How Does HGTFinder Work? . 22 3.3.1.1 Parameters and Filters. 24 3.4 TreeHGT . 26 3.4.1 How Does TreeHGT Work? . 26 3.4.1.1 Species Tree Creation . 26 3.4.1.2 Gene Tree Creation and Reconciliation . 27 3.4.1.3 A Note on Tree Reconciliation . 31 4 DBHGT..................................................... 32 4.1 Implementation Details . 32 4.1.1 MySQL Database Implementation . 32 4.1.1.1 The Entity Relation Diagram . 33 4.1.1.2 Table Decomposition . 35 4.1.1.3 Challenges . 37 4.1.2 Website Implementation . 38 4.2 Features . 39 5 QUICK ANALYSIS OF HGT . 44 5.1 Basic Statistics Used . 44 5.1.1 Hypergeometry. 45 vi Chapter Page 5.1.2 Enrichment Analysis Using Hypergeometry. 45 5.1.3 Multiple Testing Problem and False Discovery Rate. 46 5.2 HGTFinder . 47 5.2.1 Sequence Properties . 48 5.2.2 Gene Clustering . 49 5.3 SimHGT . 50 5.3.1 General Fragment Correlations. 51 5.3.2 Functional Analysis. 55 5.3.3 Genome Taxonomic Ranks . 60 5.3.4 Genome Lifestyle, Environment, and Habitat . 61 6 CONCLUSION. 64 REFERENCES . 67 LIST OF TABLES Table Page 5.1 The length of HGTs to non-HGTs, GC content, and K ;K ; and Ka ratios.. 48 a s Ks 5.2 The differences in the total number of fragments, total length of fragments, and the average length of fragments between fungi-fungi and fungi-bacteria HGT fragments. 52 5.3 The differences in the total number of fragments, total length of fragments, and the average length of fragments between fungi-fungi and fungi-bacteria HGT fragments when fragments are split up into the gene, intergenic, and unknown subtypes. 53 5.4 The counts of GO terms that are transferred as detected by SimHGT.. 57 5.5 The counts of IPR terms that are transferred as detected by SimHGT. 57 5.6 The counts of KEGG terms that are transferred as detected by SimHGT. 58 5.7 The counts of KOG terms that are transferred as detected by SimHGT. 58 5.8 The counts of SM types that are transferred as detected by SimHGT. 59 5.9 A comparison of GO terms that were transferred from bacteria vs those transferred from fungi.. 59 LIST OF FIGURES Figure Page 2.1 The standard tree of life shows vertical gene transfer because for every node within the tree, that node contains no more than one parent. 2 2.2 When you add HGT to the tree of life, we begin to see genetic informa- tion jump horizontally across the tree (hence the name \horizontal gene transfer.") . 3 2.3 Within the past 5 years, weve seen an explosion of fungal genomes get annotated. 5 2.4 This shows an example of six sets of incongruent trees and the result that occurred between them [38]. 12 2.5 Diagram showing three sets that are nearly mutually exclusive. 14 3.1 Example of getN() function on taxonomies b and q................... 23 3.2 A graph showing the occurrence of intra-domain predicted transfers as the R-threshold varies 2 [0:1; 1:0].. 25 3.3 In this tree rooted at A, node C is the recipient while node M is the target. 29 4.1 The ER diagram for the DBHGT MySQL database.. 33 4.2 These are the tabular representation of the entity tables within the MySQL database. 36 4.3 These are the tabular representation of the relation tables within the MySQL database. 37 4.4 Screenshot of the home page for DBHGT. 40 4.5 Screenshot of the genomes menu page for DBHGT. 41 4.6 Screenshot of the genes menu page for DBHGT. 41 ix Figure Page 4.7 Screenshot of the downloads page for DBHGT. 42 4.8 Screenshot of the help page for DBHGT.. 42 4.9 Screenshot of the contact page for DBHGT. 43 4.10 Screenshot of the search page for DBHGT. 43 5.1 The X axis shows the the value of N used while the Y axis depicts the percentage of HGT that are in a cluster. ..