CONSERVED VIRUS PROTEIN FAMILIES in BACTERIOPHAGE GENOMES and in METAGENOMES of HUMANS by Xixu Cai Submitted to the Graduate De

CONSERVED VIRUS PROTEIN FAMILIES IN BACTERIOPHAGE GENOMES AND IN METAGENOMES OF HUMANS By Xixu Cai Submitted to the graduate degree program in Microbiology, Molecular Genetics and Immunology and the Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the degree of Master of Arts. ________________________________ Chairperson, Arcady Mushegian, Ph.D. ________________________________ Joe Lutkenhaus, Ph.D. ________________________________ Philip Hardwidge, Ph.D. Date Defended: July 1st, 2011 The Thesis Committee for Xixu Cai certifies that this is the approved version of the following thesis: CONSERVED VIRUS PROTEIN FAMILIES IN BACTERIOPHAGE GENOMES AND IN METAGENOMES OF HUMANS ________________________________ Chairperson, Arcady Mushegian, Ph.D. Date approved: July 22, 2011 ii Abstract Viruses are likely to be the most abundant genomes in the biosphere, displaying remarkable molecular diversity. Their fast-evolving genomes and lack of universal marker genes make phylogenetic and taxonomic studies more difficult than with other organisms. A detailed determination of gene conservation between virus genomes should facilitate the study of virus evolution and function. Here we used sequence similarity methods to build a phage orthologous groups (POGs) resource. The number of POGs has grown significantly in the past decade, while the percentage of genes in phage genomes that have orthologs in other phages has also been increasing, and the percentage of unknown "ORFans" - phage genes that are not in POGs - is decreasing. Other properties of phage genomes remain stable, in particular the high fraction of genes that are never or only rarely observed in their cellular hosts. This suggests that despite the role of phages in transferring cellular genes, a large fraction of the genes in phage genomes maintain an evolutionary trajectory that is distinct from that of host genes. Next generation sequencing technologies provide new opportunities to study viruses, their diversity and evolution, directly from environmental samples. The standards of sensitivity and specificity appropriate for analysis of these relatively short shotgun sequence reads are still evolving. In another part of our work, we used sensitive sequence similarity methods to identify more than 400 virus-related genes in 3,280 libraries derived from patients and environmental samples after low-complexity reads were removed. These identifications serve as a starting point to isolate viruses potentially associated with disease and outbreaks of unknown etiology. iii Acknowledgement I would like to thank all people that have helped me during my graduate study at Stowers Institute for Medical Research and Kansas University Medical Center. I am especially thankful to my advisor, Dr. Arcady Mushegian for his guidance, support, advice and encouragement through my whole graduate study. I learned a lot from Dr. Mushegian which will help my future career and life. I would like to thank all past and current members in the Mushegian’s lab: Dr. David Kristensen, Samuel Chapman, Dr. Hua Li, Dr. Lavanya Kannan, Olga Tsoy and Andrei Kucharavy. Particularly: Dr. David Kristensen gave me a jump start in comparative genomics and programming; Samuel Chapman helped me in many different things. I also thank Dr. Hua Li, Dr. Lavanya Kannan, Olga Tsoy and Andrei Kucharavy for extensive discussions on my projects. I am also grateful to my committee members: Drs. Joe Lutkenhaus and Philip Hardwidge for their advice and suggestions on every committee meeting. I also thank them for providing opportunities to conduct my rotations in their labs and encouraging me during the difficult times. I would like to thank all members in the Bioinformatics group at the Stowers Institute: Malcolm Cook, Madelaine Gogol, Ariel Paulson and Amy Ubben. Malcolm, Madelaine and Ariel helped me a lot on programing and gave me very good advice on my research. Amy helped me to get used to different systems in the Institute and also trips to conferences. Finally, I would like to thank my parents and my whole family for their support and love throughout my life. iv Table of Contents Page Abstract .......................................................................................................................................... iii Acknowledgement ......................................................................................................................... iv Table of Contents .............................................................................................................................v List of Figures ............................................................................................................................... vii List of Appendices ....................................................................................................................... viii Introduction ......................................................................................................................................1 Protein sequence conservation in genomes and in databases, and use of conserved families for functional and phylogenetic inference ........................................................................................ 1 Amino acid vs. nucleotide sequence analysis ............................................................................. 3 Gene Identification ...................................................................................................................... 4 Identification of Homologous Genes on the basis of Sequence Similarity ................................ 5 Orthology and paralogy: two classes of homology ..................................................................... 6 Methods for Orthology Identification ......................................................................................... 8 Phylogenetic tree-based approaches ....................................................................................... 8 Heuristic best match methods ............................................................................................... 10 Synteny ................................................................................................................................. 11 Protein sequence complexity .................................................................................................... 12 PSI-BLAST and HHsearch ....................................................................................................... 14 v Motifs and domains................................................................................................................... 15 Bioinformatics of Virus Proteins .............................................................................................. 17 Project I: Phage Orthologous Groups (POGs) ...............................................................................19 Background ............................................................................................................................... 19 Dataset: genomes and proteins.................................................................................................. 21 Construction of Orthologous Groups ........................................................................................ 23 Domains .................................................................................................................................... 25 Phage isolates ............................................................................................................................ 28 Functional annotation ................................................................................................................ 29 Paralogy .................................................................................................................................... 29 Phageness Quotient ................................................................................................................... 30 Project II: Metagenomics ...............................................................................................................34 Background ............................................................................................................................... 34 The properties of the unassembled dataset and a pilot study of sequence similarity ............... 36 Removal of low-complexity sequences at the amino acid level ............................................... 38 Assembly................................................................................................................................... 38 Translation ................................................................................................................................ 39 Virus Genes ............................................................................................................................... 39 Conclusion .....................................................................................................................................44 References ......................................................................................................................................47 vi List of Figures Page Figure 1. Two kinds of homology: orthologous and paralogous genes ………………….……… 7 Figure 2. SEG analysis of measles nucleoprotein ………………………….……………….….. 13 Figure 3. A portion of the HHsearch result for POG125 …………………………………….… 16 Figure 4. The growth of the number of completed dsDNA phage

Load more