Open Dissertation.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School College of Engineering INFERENCE OF ORTHOLOGS, WHILE CONSIDERING GENE CONVERSION, TO EVALUATE WHOLE-GENOME MULTIPLE SEQUENCE ALIGNMENTS A Dissertation in Computer Science and Engineering by Chih-Hao Hsu © 2009 Chih-Hao Hsu Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2009 The dissertation of Chih-Hao Hsu was reviewed and approved* by the following: Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Wang-Chien Lee Associate Professor of Computer Science and Engineering Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology *Signatures are on file in the Graduate School iii ABSTRACT The problem of computing a multiple-sequence alignment (MSA) is very important for the analysis of biological sequences. An equally critical problem is to evaluate the quality of an alignment. In the preliminary project described here, alignments produced by Multiz and ROAST of the human genome to other vertebrate genomes are evaluated using orthologous genes in 13 gene clusters from 6 mammalian species, which are identified using maximum-likelihood phylogenetic tree reconstruction methods. Analysis of the α- and β-globin gene clusters show that inferred ortholog relationships are accurate. The orthologous β-globin genes from over 14 species are used to evaluate the performance of four MSA programs (MLAGAN, MAVID, TBA and ROAST). The results show that the performance of ROAST is superior to the others. Furthermore, differences among gene clusters and among species are studied. This approach not only indicates the quality of a given alignment, but also helps us understand the alignment’s drawbacks and gives us some clues about how to build the next generation of multiple alignment programs. To obtain accurate orthologs, the impact of gene conversion is studied in this thesis. Gene conversion events are often overlooked in analyses of genome evolution. In such an event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound prediction of orthologs and attempts to transfer functional information between genomes. Here we propose different gene conversion detection methods for different scale of data. Detailed information about conversion events between gene pairs is determined, including their directionality. Furthermore, we analyze 1,112,202 highly conserved pairs of human genomic intervals, and iv detect a conversion event for about 13.5% of them. Properties of the putative gene conversions are analyzed, such as the distributions of the lengths of the converted regions and the spacing between source and target. Finally, we also apply our method for several well-studied gene clusters, including the globin genes. v TABLE OF CONTENTS LIST OF FIGURES.......................................................................................................................vii LIST OF TABLES ........................................................................................................................x ACKNOWLEDGEMENTS..........................................................................................................xi Chapter 1 Introduction ................................................................................................................1 1.1 Evolution of genomes...................................................................................................1 1.2 Duplication of genome .................................................................................................2 1.3 Orthologs and paralogs.................................................................................................3 1.4 Inference of orthologs and paralogs ............................................................................4 1.5 Gene conversion............................................................................................................5 Chapter 2 Evaluation of Whole-Genome Multiple Sequence Alignments..............................7 2.1 Introduction ...................................................................................................................7 2.1.1 Multiple sequence alignments ..........................................................................7 2.1.2 Methods for evaluation of multiple sequence alignments ..............................8 2.1.3 Motivation..........................................................................................................9 2.2 Methods .........................................................................................................................10 2.2.1 Gene clusters identification...............................................................................10 2.2.2 Extracting coding sequences.............................................................................11 2.2.3 Phylogenetic tree reconstruction ......................................................................12 2.2.4 Orthology identification....................................................................................18 2.2.5 Evaluation of alignments ..................................................................................22 2.3 Results ...........................................................................................................................24 2.3.1 Analysis of ortholog assignments for the α- and β-globin gene clusters.......24 2.3.2 Comparison of different alignment programs..................................................37 2.3.3 Comparison of different gene clusters .............................................................37 2.3.4 Comparison of different species .......................................................................40 2.4 Conclusion.....................................................................................................................41 Chapter 3 Gene conversion detection between a pair of genes ................................................43 3.1 Introduction ...................................................................................................................43 3.1.1 Motivation..........................................................................................................43 3.1.2 What is gene conversion ...................................................................................44 3.1.3 Impact of gene conversion to the inference of orthology ...............................45 3.1.4 Methods for gene conversion detection ...........................................................46 3.1.5 Limitations of these methods............................................................................46 3.2 Methods .........................................................................................................................46 3.2.1 Site-by-site compatibility method ....................................................................47 3.2.2 Gene conversion inference................................................................................49 3.2.3 Boundaries of gene conversion.........................................................................51 vi 3.3 Results and limitations .................................................................................................52 3.3.1 Beta and delta genes ..........................................................................................52 3.3.2 Two gamma genes.............................................................................................53 3.3.3 Limitations .........................................................................................................55 Chapter 4 Gene conversion detection for whole genome .........................................................56 4.1 Introduction ...................................................................................................................56 4.2 Methods .........................................................................................................................60 4.2.1 Highly conserved pairs of sequences ...............................................................60 4.2.2 Gene conversion detection between each pair of sequences ..........................60 4.2.3 Space-efficient modifications ...........................................................................62 4.2.4 Extension to quadruplet testing ........................................................................66 4.2.5 Multiple-comparison correction .......................................................................67 4.2.6 Directionality of gene conversion ....................................................................68 4.3 Results ...........................................................................................................................70 4.3.1 Number and distribution of gene conversion events in human ......................71 4.3.2 Correlations with the distance, length, and relative orientation of the paralogs..................................................................................................................73 4.3.3 Length of converted regions .............................................................................76 4.3.4 The effect of protein-coding DNA ...................................................................77