The Pennsylvania State University

The Graduate School

College of Engineering

INFERENCE OF ORTHOLOGS, WHILE CONSIDERING CONVERSION,

TO EVALUATE WHOLE-GENOME MULTIPLE SEQUENCE ALIGNMENTS

A Dissertation in

Computer Science and Engineering

by

Chih-Hao Hsu

© 2009 Chih-Hao Hsu

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2009

The dissertation of Chih-Hao Hsu was reviewed and approved* by the following:

Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee

Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

Wang-Chien Lee Associate Professor of Computer Science and Engineering

Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology

*Signatures are on file in the Graduate School

iii

ABSTRACT

The problem of computing a multiple-sequence alignment (MSA) is very important for the analysis of biological sequences. An equally critical problem is to evaluate the quality of an alignment. In the preliminary project described here, alignments produced by Multiz and ROAST of the human genome to other vertebrate genomes are evaluated using orthologous in 13 gene clusters from 6 mammalian species, which are identified using maximum-likelihood phylogenetic tree reconstruction methods. Analysis of the α- and β-globin gene clusters show that inferred ortholog relationships are accurate. The orthologous β-globin genes from over 14 species are used to evaluate the performance of four MSA programs (MLAGAN, MAVID, TBA and

ROAST). The results show that the performance of ROAST is superior to the others.

Furthermore, differences among gene clusters and among species are studied. This approach not only indicates the quality of a given alignment, but also helps us understand the alignment’s drawbacks and gives us some clues about how to build the next generation of multiple alignment programs.

To obtain accurate orthologs, the impact of gene conversion is studied in this thesis. Gene conversion events are often overlooked in analyses of genome evolution. In such an event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound prediction of orthologs and attempts to transfer functional information between genomes. Here we propose different gene conversion detection methods for different scale of data. Detailed information about conversion events between gene pairs is determined, including their directionality.

Furthermore, we analyze 1,112,202 highly conserved pairs of human genomic intervals, and

iv detect a conversion event for about 13.5% of them. Properties of the putative gene conversions are analyzed, such as the distributions of the lengths of the converted regions and the spacing between source and target. Finally, we also apply our method for several well-studied gene clusters, including the globin genes.

v

TABLE OF CONTENTS

LIST OF FIGURES...... vii

LIST OF TABLES ...... x

ACKNOWLEDGEMENTS...... xi

Chapter 1 Introduction ...... 1

1.1 Evolution of genomes...... 1 1.2 Duplication of genome ...... 2 1.3 Orthologs and paralogs...... 3 1.4 Inference of orthologs and paralogs ...... 4 1.5 Gene conversion...... 5

Chapter 2 Evaluation of Whole-Genome Multiple Sequence Alignments...... 7

2.1 Introduction ...... 7 2.1.1 Multiple sequence alignments ...... 7 2.1.2 Methods for evaluation of multiple sequence alignments ...... 8 2.1.3 Motivation...... 9 2.2 Methods ...... 10 2.2.1 Gene clusters identification...... 10 2.2.2 Extracting coding sequences...... 11 2.2.3 Phylogenetic tree reconstruction ...... 12 2.2.4 Orthology identification...... 18 2.2.5 Evaluation of alignments ...... 22 2.3 Results ...... 24 2.3.1 Analysis of ortholog assignments for the α- and β-globin gene clusters...... 24 2.3.2 Comparison of different alignment programs...... 37 2.3.3 Comparison of different gene clusters ...... 37 2.3.4 Comparison of different species ...... 40 2.4 Conclusion...... 41

Chapter 3 Gene conversion detection between a pair of genes ...... 43

3.1 Introduction ...... 43 3.1.1 Motivation...... 43 3.1.2 What is gene conversion ...... 44 3.1.3 Impact of gene conversion to the inference of orthology ...... 45 3.1.4 Methods for gene conversion detection ...... 46 3.1.5 Limitations of these methods...... 46 3.2 Methods ...... 46 3.2.1 Site-by-site compatibility method ...... 47 3.2.2 Gene conversion inference...... 49 3.2.3 Boundaries of gene conversion...... 51

vi

3.3 Results and limitations ...... 52 3.3.1 Beta and delta genes ...... 52 3.3.2 Two gamma genes...... 53 3.3.3 Limitations ...... 55

Chapter 4 Gene conversion detection for whole genome ...... 56

4.1 Introduction ...... 56 4.2 Methods ...... 60 4.2.1 Highly conserved pairs of sequences ...... 60 4.2.2 Gene conversion detection between each pair of sequences ...... 60 4.2.3 Space-efficient modifications ...... 62 4.2.4 Extension to quadruplet testing ...... 66 4.2.5 Multiple-comparison correction ...... 67 4.2.6 Directionality of gene conversion ...... 68 4.3 Results ...... 70 4.3.1 Number and distribution of gene conversion events in human ...... 71 4.3.2 Correlations with the distance, length, and relative orientation of the paralogs...... 73 4.3.3 Length of converted regions ...... 76 4.3.4 The effect of protein-coding DNA ...... 77 4.3.5 Correlation with GC-content ...... 79 4.4 Discussion ...... 79

Chapter 5 Applying gene conversion detection method to gene clusters...... 82

5.1 Introduction ...... 82 5.2 Results ...... 82 5.2.1 Beta-globin gene cluster (hg18.chr11: 5,180,996-5,270,995) ...... 82 5.2.2 CCL gene cluster (hg18.chr17: 31,334,806-31,886,998) ...... 86 5.2.3 IFN gene cluster (hg18.chr9: 21,048,761-21,471,698)...... 90

Chapter 6 Conclusions and future works...... 94

6.1 Conclusions ...... 94 6.2 Future Works...... 95

Bibliography ...... 96

vii

LIST OF FIGURES

Figure 1-1: Duplication caused by transposition...... 3

Figure 1-2: Normal crossing over and unequal crossing over...... 3

Figure 1-3: Effect of gene conversion...... 5

Figure 1-4: Duplication caused by transposition...... 6

Figure 2-1: Pseudoorthologs result from gene loss...... 11

Figure 2-2: Processes for extracting coding sequences for species other than human or mouse...... 12

Figure 2-3: Problem of multiple substitutions in phylogenetic tree reconstruction...... 13

Figure 2-4: Species tree and inferred trees...... 14

Figure 2-5: An example for selecting a gene as the query sequence for database search...... 16

Figure 2-6: Changing the gene tree of the TRIM gene cluster via re-rooting...... 18

Figure 2-7: Orthologous relationships within a phylogenetic tree...... 19

Figure 2-8: Some drawbacks of using a bootstrap value as the estimated reliability of orthologous genes...... 20

Figure 2-9: Evolutionary information for orthologs inference...... 22

Figure 2-10: Definitions of sensitivity and specificity...... 23

Figure 2-11: Phylogenetic trees for 25 species in β-globin gene cluster...... 29

Figure 2-12: Gene conversion and gene duplication...... 30

Figure 2-13: Phylogenetic trees for 30 species in the α-globin gene cluster...... 36

Figure 2-14: Sensitivity and specificity among different alignment programs...... 37

Figure 2-15: Sensitivity and specificity of different gene clusters...... 40

Figure 2-16: Sensitivity and specificity of different species...... 41

Figure 3-1: Phylogenetic tree of beta globin gene cluster...... 44

viii Figure 3-2: Gene conversion and gene duplication...... 45

Figure 3-3: Effect of gene conversion...... 45

Figure 3-4: Example shows the issues of Drouin’s method...... 47

Figure 3-5: An example of bottom-up phase...... 48

Figure 3-6: An example of top-down phase...... 49

Figure 3-7: An example shows how to determine gene conversion and it’s directionality...... 50

Figure 3-8: All gene conversion events between beta and delta genes...... 53

Figure 3-9: All gene conversion events between two gamma genes...... 54

Figure 4-1: Evidence of gene conversion in the human δ-globin gene...... 58

Figure 4-2: Determining the occurrence of gene conversion events in a triplet...... 62

Figure 4-3: A cubic-space algorithm for computing the probabilities xm,n.k...... 65

Figure 4-4: Comparisons between quadruplet testing and triplet testing...... 67

Figure 4-5: Algorithm for determining cutoff position of P-values...... 68

Figure 4-6: Timing of evolutionary events...... 70

Figure 4-7: Evidence that the β-globin gene (HBB) converted the δ-globin gene (HBD)...... 70

Figure 4-8: Frequencies of gene conversion events in each human chromosome...... 73

Figure 4-9: Frequency of intra-chromosomal gene conversions as a function of distance between the paralogs...... 74

Figure 4-10: Correlation with length of the paralogous human sequences...... 75

Figure 4-11: Correlation between orientation and separation distance of the human paralogs...... 76

Figure 4-12: Distribution for the length of the converted regions...... 77

Figure 4-13: Correlation between gene conversion and GC content...... 79

Figure 5-1: Gene tree and detected conversion events for the beta-globin gene cluster...... 83

Figure 5-2: Influences of gene conversion between the beta and delta genes...... 84

Figure 5-3: Inferred evolutionary histories for mammalian beta and delta genes...... 86

ix Figure 5-4: Phylogenetic trees for CCL gene cluster...... 88

Figure 5-5: Evidences of gene conversions between CCL15 and CCL23 genes...... 89

Figure 5-6: Inferred evolutionary histories for the CCL gene cluster...... 90

Figure 5-7: Gene tree and detected conversion events for the Interferon gene cluster...... 91

Figure 5-8: Influences of gene conversion to the phylogeny in the distal group...... 93

x

LIST OF TABLES

Table 2-1: Ortholog assignment of β-globin gene cluster...... 24

Table 2-2: The coordinates of all genes for 25 species of mammals in the β-globin gene cluster...... 25

Table 2-3: The predicted ortholog assignments of HBA-related genes for 30 mammalian species...... 30

Table 2-4: Coordinates of all genes for 30 mammalian α-globin gene clusters...... 31

Table 2-5: Detailed information for 13 gene clusters...... 38

Table 4-1: Information for duplicated human genomic intervals used in this study...... 71

Table 4-2: Distribution of intra- and inter-chromosomal gene conversions...... 72

Table 4-3: Distribution of gene conversion events classified by orientation...... 76

Table 4-4: Conversion frequency as a function of the presence of protein-coding sequence...... 78

Table 4-5: Number of conversion for different directionality in 1-coding category...... 78

xi

ACKNOWLEDGEMENTS

I would like to express my deepest appreciation to my advisor, Webb Miller, for giving me the opportunity to participate in this interesting field, and for his help and guidance to my graduate study at the Pennsylvania State University. I also want to thank my committee members,

Raj Acharya, Wang-Chien Lee, and Ross Hardison, for their time and effort. Furthermore, I thank all my colleagues in the Pennsylvania State University for their assistance. Finally, I would like to thank my wife, Mei-Jen Liao, for her support and encouragement, and my children, Emily and

Ethan, for their love.

Chapter 1

Introduction

In this chapter, we give an introduction about the evolution of genomes. Two main methods, which shape the genomes, are studied in this chapter and the formation and impact of duplicated regions is described in detail. Furthermore, the differences between orthologs and paralogs are explained and the inference of orthologs and paralogs are demonstrated. Finally, An important evolutionary force, gene conversion, is introduced to show how it could affect the inference of orthologs.

1.1 Evolution of genomes

Millions of organisms are living on this Earth. An organism is a living species, which could be plant, animal or virus, and may have one single cell (unicellular) or is composed of many of cells (multi-cellular). All cells contain the genetic materials called DNA, which can produce essential functions for each cell and be inherited through generations. This kind of inheritance between generations results in shared traits between organisms in the same lineage.

Nevertheless, small changes in genetic materials of the organisms are still going on from one generation to the next. This kind of change is called evolution. Even though the evolution in each generation is really small, the accumulated changes through many generations could produce new features in an organism or even form a new species.

Evolution may arise in two major methods, e.g. mutations and large-scale transfer of nucleotide sequence within or between species such as insertion, deletion, inversions and duplication. Mutations are the substitutions in the nucleotide sequence, which may be caused

2 from the error of cell division or the exposure to the radiation. While large-scale events have gotten much attention now and are believed to play a main role in evolution. Ohno (1967) even suggested that gene duplication is the most important force in the evolution since the emergence of the universal common ancestor. In fact, gene duplication is the main force for the expansion of genome and the genome sizes of different organisms range from a few thousand bases (Fiers et al.

1976), i.e. virus, to more than one hundred billion bases, i.e. marbled lungfish. It is believed to have many duplication events occurred in these genomes.

1.2 Duplication of genome

There are two main mechanisms by which duplication occur, e.g. transposition and unequal crossing over. Transportation is the movement of genetic materials from one chromosomal location to another. There are many transposable elements called transposons, which are part of DNA sequence and can move from one location to another one. As shown in

Figure 1-1, after the replication of transposon, the copy one can insert to another location and form a duplication. Another method to form duplication is unequal crossing over. During the period of meiosis, genetic materials are exchanged between chromosomes as shown in Figure 1-

2A. This process is called crossing over. However, the homologous chromosomes could be misaligned in some cases (Figure 1-2B). This situation is called unequal crossing over and duplication arises in one chromosome.

Duplication plays a very important role in the evolutionary genomics, especially does gene duplication. Gene duplication is a duplication of DNA sequence, which contains a gene.

Gene duplication is the major way to form a new gene. As usual, after gene duplication, one copy of the gene may remain the same function, while another copy of the gene could have a new function.

3

Figure 1-1: Duplication caused by transposition.

Figure 1-2: (A) Normal crossing over. (B) Duplication arises from unequal crossing over due to the misaligned of homologous chromosomes.

1.3 Orthologs and paralogs

Because of duplication, there are many similar genes or genomic intervals in the DNA

sequence. A concept called homologs, which means sharing similar characteristics inherited from

the common ancestor, is very important for evolutionary genomics and can be basically classified

4 into two different types, orthologs and paralogs. Orthologs are genes or genomic intervals that diverged via a speciation event, while paralogs diverged via a duplication event. In general, since orthologous genes are descend from a single gene in the last common ancestor of the species, their function and structure often remains the same or similar. However, since paralogous genes are created by a duplication event within a species, one copy of the duplicated genes can be recruited for a different function.

1.4 Inference of orthologs and paralogs

A typical way to infer orthologs and paralogs is to compare the phylogenetic tree with the species tree. The concept of reconciled tree was first introduced by Goodman (1979) and was shown to be an effective method for the inference of orthologs and paralogs (Yuan et al. 1998).

Reconciled tree would construct a new tree, which reconciles the phylogenetic tree with the species tree with the postulation of the existence of gene losses. A minimum reconciled tree is a tree, which minimizes the number of gene losses. It is shown to be a HP-hard problem to find the minimum reconciled tree (Goodman et al. 1979). Therefore, many heuristic algorithms for constructing a reconciled tree are proposed (Page 1994; Mirkin et al. 1995). Basically, it consists of two major steps to construct a reconciled tree. The first step is to generate a phylogenetic tree and the second step is to reconcile the phylogenetic tree with the species tree with minimum cost, which is the number of gene losses. Figure 1-3 shows an example to construct a reconciled tree.

Orthologs and paralogs could be found directly from the reconciled tree. In this case, there are three gene losses. And a1 is orthologous to b1; a2, c2 and d2 are orthologous to each other, while a1 and b1 are paralogous to a2, c2 and d2.

5

Figure 1-3: Construction of reconciled tree. (A) Phylogenetic tree. (B) Species tree. (C) Reconciled tree.

1.5 Gene conversion

Using phylogenetic tree to find orthologs and paralogs is efficient. However, the topology

of the phylogenetic tree is not always correct, therefore, the inferred orthologs and paralogs could

be inaccurate. Many factors could affect the reliable of the topology of a phylogenetic tree. One

of the most main forces is gene conversion. Gene conversion is a nonreciprocal transfer of genetic

materials. In the process of gene conversion, part of DNA sequence is transferred from one DNA

helix, A, to another DNA helix, B. In this case, A remains the same, while B is replaced by A’s

sequence. Gene conversion can result from the base mismatch repair during recombination or the

double strand repair process when the occurrence of DNA damage. Gene conversion could affect

the inference of orthologs and paralogs. For example, as shown in Figure 1-4, a gene conversion

event occurred in part of the sequences between gene D and gene E. The phylogenetic tree in the

conversion region (Figure 1-4B) could be different from the original tree shown in Figure 1-4A.

Therefore, an unreliable phylogenetic tree could be generated and incorrect relationships of

orthologs and paralogs might be inferred.

6

Figure 1-4: Effect of gene conversion. (A) Gene tree for five genes. (B) Phylogenetic tree for the conversion region.

In this thesis, we try to construct phylogenetic tree to find the orthologs and use the

inferred orthologs to evaluate the performance of multiple sequence alignments. To study the

impact of gene conversion to the inference of orthologs, three different scales, pair of genes, gene

clusters and whole genome, are analyses. The structure of this thesis is as follows. The

mechanisms for the inference of orthologs and the evaluation of multiple sequence alignments are

shown in chapter 2. Analyses of gene conversion for pair of genes, whole genome and individual

gene cluster are studies in chapter 3, chapter 4 and chapter 5 respectively. Finally, conclusions

and future works are shown in chapter 6.

Chapter 2

Evaluation of Whole-Genome Multiple Sequence Alignments

In this chapter, the method to evaluate the quality of a multiple sequence alignment is proposed. In this study, orthologous genes are identified using maximum-likelihood phylogenetic tree reconstruction methods. Alignments produced by Multiz and ROAST of the human genome to other vertebrate genomes are evaluated using orthologous genes in 13 gene clusters from 6 mammalian species. Furthermore, two gene clusters, e.g. α- and β-globin gene clusters, are analyzed. The orthologous α- and β-globin genes are used to evaluate the performance of four

MSA programs (MLAGAN, MAVID, TBA and ROAST). Finally, differences among gene clusters and among species are studied. This approach not only indicates the quality of a given alignment, but also helps us understand the alignment’s drawbacks and gives us some clues about how to build the next generation of multiple alignment programs.

2.1 Introduction

2.1.1 Multiple sequence alignments

A multiple sequence alignment (MSA), i.e., an alignment containing more than two sequences (namely genomic sequences or regions in this study), can be used to identify intervals that are conserved among those sequences. The goal might be analysis of phylogeny, or to help us to infer the functional elements and structures of a genome. One of the most widely used tools in the alignment of multiple sequences is ClustalW (Chenna et al. 2003), which uses a straightforward progressive alignment method to add sequences one by one into the multiple

8 alignment using a pre-calculated guide-tree. ClustalW also uses additional strategies, e.g. individual weights for each sequence, gap penalty adjustment, and automatic replacement of amino-acid substitution matrices, to improve the accuracy of the alignment. However, as more and more genomic sequences become available, it is imperative to have a tool that can align large-scale multiple genomic sequences rapidly. Recently, several programs, e.g. MLAGAN

(Brudno et al. 2003), MAVID (Bray et al. 2004) and TBA (Blanchette et al. 2004), have been proposed for this purpose. MLAGAN is a multiple-sequence global aligner, which uses a pair- wise aligner, LAGAN, to construct the multiple alignments progressively. Moreover, an iterative refinement is done to improve the alignment locally. MAVID automatically constructs a guide- tree, with maximum-likelihood ancestral sequences inferred progressively. In addition, more information, i.e. protein-based anchors and a homology map, is utilized. TBA proposes a new concept of “threaded blockset”, which is a local multiple alignment, and uses it to construct multiple alignments for large-scale genomic sequences.

2.1.2 Methods for evaluation of multiple sequence alignments

In addition to generating multiple-sequence alignments, it is equally important to evaluate the quality of an alignment. HOMSTRAD (Mizuguchi et al. 1998), OxBench (Raghava et al.,

2003) and PREFAB (Edgar, 2004) use 3-D protein structural superpositions to study the performance of an alignment. However, these methods can be applied only to alignments of protein sequences. In another approach, Golubchik et al. (2007) focus on gaps to determine MSA quality. Two models, overlapped gaps and non-overlapped gaps, are used in that paper to study the performance of MSA programs. However, the paper uses only artificially generated sequences to analyze these two models. In reality, few biological data match with these two models. One example studied in that paper is alternatively spliced sequences. However, there is no clear

9 evolutionary relationship between such sequences, perhaps making use of that example inappropriate for evaluating the performance of MSA programs that use evolutionary relationship to do alignments. For instance, ClustalW uses a guide tree to align multiple sequences. Since there is no tree-like relationship among alternative splice variant, the guide tree generated basing on these sequences is not well founded, which may make ClustalW perform poorly. A third evaluation strategy is employed by TBA (Blanchette et al. 2004) and Mulan (Ovcharenko et al.

2005), which use simulations to estimate the accuracy of an alignment program. A hypothetical ancestral sequence is created, and simulated neutral evolutionary processes are applied to generate multiple sequences. Furthermore, the relationships among sequences are recorded and used to score the agreement of a given alignment with the recorded (artificial) evolutionary history. However, sometimes it is very difficult to simulate evolutionary processes correctly. In particular, different genomic sequence have different evolutionary characteristic. A fourth plan is adopted by Margulies et al. (2007), who use some particular classes of sequences, e.g. annotated protein-coding sequences, ancestral repeats (AR), and Alu elements, to assess the coverage and correctness of an alignment. However, these sequences can only be used to estimate the overall quality of an alignment; they cannot determine precisely which positions are correctly aligned according to biological considerations.

2.1.3 Motivation

The concept of homologs, defined as “the same organ in different animals under every variety of form and function” by Richard Owen in 1843, is very important for evolutionary genomics and can be basically classified into two different types, orthologs and paralogs.

Orthologs are genes or genomic intervals that diverged via a speciation event, while paralogs diverged via a duplication event. In general, since orthologous genes are descend from a single

10 gene in the last common ancestor of the species, their function and structure often remains the same or similar. However, since paralogous genes are created by a duplication event within a species, one copy of the duplicated genes can be recruited for a different function. Therefore, in this thesis, ability to align orthologous genes is used as the criterion for to evaluating alignment quality. First, 13 gene clusters among 6 different species are identified, and coding sequences are extracted for each gene clusters. Then, a phylogenetic tree is generated for each cluster and orthologous genes are identified. Finally, these orthologous genes are used to study the quality of an alignment.

2.2 Methods

2.2.1 Gene clusters identification

A gene cluster is a set of genes, arranged close together on a chromosome, and having the same or at least somewhat similar function and structure. In particular, all genes in a gene cluster are homologous. In this project, to infer which genes are orthologous, a phylogenetic tree is reconstructed for each gene clusters. Particularly if there are only a few species having sequence data for a given gene cluster, we need to be aware of the possible problem of pseudoorthologs

(Koonin et al. 2005) resulting from lineage-specific gene loss. For example, if only two species are used to reconstruct a phylogenetic tree and infer orthologous genes, as shown in Figure 2-1, and if gene A1 and B2 are lost, gene A2 and gene B1 might be erroneously inferred to be orthologs. However, if too many species (e.g., with poorly resolved phylogenetic relationships) are used to reconstruct the phylogenetic tree, it is not only time-consuming but also unreliable.

Therefore, in our project, gene clusters are identified among 6 different species, i.e. human, chimp, rhesus, mouse, rat and dog. First, functionally related genes in human are identified to

11

define a gene cluster. Then, corresponding regions for the other species are obtained using a

program called “liftover” (Kent et al. 2002), which uses “chain” structures (Kent et al. 2003) to

convert genome coordinates and genome annotation across species. In total, 13 human clusters

are used in this project. More details are given next.

Figure 2-1: Pseudoorthologs result from gene loss.

2.2.2 Extracting coding sequences

In order to find orthologous genes in each gene cluster, the coding sequences of six

different species are identified for each gene cluster. The coding sequences for human and mouse

can be extracted from the UCSC Genome Browser (Kent et al. 2002). In order to extract the

coding sequences of other species, we start with genomic sequences of those species.

Subsequently, the pairwise alignment of human and each species is done by Blastz (Schwartz et

al. 2003) and possible coding regions are found using the Laj interactive alignment viewer

(Wilson et al. 2001). With these possible coding regions and the coding sequences of human, the

corresponding coding regions for other species can be found by using GeneWise2 (Birney et al.

12

2004), which is a similarity-based tool for aligning an expressed DNA sequence with a genomic

sequence. These processes are shown in Figure 2-2.

Figure 2-2: Processes for extracting coding sequences for species other than human or mouse.

2.2.3 Phylogenetic tree reconstruction

To identify orthologous genes for each cluster, we start by generating the phylogenetic

tree. Actually, there are many issues that will affect the correctness of phylogenetic tree

reconstruction. Mutational saturation is one potential problem. A gene is called mutationally

saturated when multiple substitutions have occurred at the same position so frequently that the

evolutionary distance between two genes is underestimated. In Figure 2-3, multiple substitutions

occurred at the same position in gene 3, so that gene 1 and gene 3 are erroneously considered

more related. In order to deal with this problem, a phylogenetic tree is built for each gene cluster,

which can generally be done with fair reliability, since the genes in each gene cluster are similar.

Moreover, protein sequences are used to reconstruct the phylogenetic tree, since protein

sequences are more conserved than the corresponding DNA sequences.

13

Figure 2-3: Problem of multiple substitutions in phylogenetic tree reconstruction.

Variation of substitution rate among sites is also a critical issue for phylogenetic tree

reconstruction. Some researchers (Fitch et al. 1967) have observed that the substitution rate

among site is variable in most genes or proteins. Sullivan et al. (1997) use the Jukes-Cantor

model (Jukes et al. 1969) to show that phylogenetic tree reconstruction methods, i.e. parsimony,

distance-based, and maximum-likelihood methods, can be misled if the among-site substitution

rate is incorrectly assumed to be invariable. In general, a statistical distribution could be used to

approximate the substitution rate to solve this problem. Therefore, in this project, a Hidden

Markov Model (HMM) method (Felsenstein et al. 1996) is used to infer the substitution rate at

different amino acid positions. The substitution rate is approximated using a Gamma distribution

at variable sites, while some sites are assumed to be invariable. This is called an “I + Γ” model

(Gu et al. 1995), which is the most widely used model to reconstruct phylogenetic trees. There are

two parameters, α, the shape parameter of gamma distribution, and pinv, the fraction of invariant

sites, needed in this model. To estimate these parameters, a computer program called TREE-

PUZZLE (Schmidt et al. 2003) is used. TREE-PUZZLE uses a fast tree search algorithm, called

quartet puzzling, to estimate the parameters of maximum likelihood automatically.

14

Some researchers (Van et al 1998; Foster et al. 1999) also show that compositional bias

could affect the correctness of reconstruction of phylogenetic trees. When unrelated genes have

similar nucleotide compositions, they could be considered related and grouped together in the

phylogenetic tree. Phillips (Phillips et al 2004) shows that compositional bias could mislead the

reconstruction via distance-based methods, i.e. UPGMA and Neighbor Joining, but not Maximum

Likelihood methods. Therefore, in this project, Maximum Likelihood methods are used to

reconstruct a phylogenetic tree for each gene cluster.

Figure 2-4: Species tree and inferred trees. (A) Species tree. (B) Two different inferred trees.

Long branch attraction (LBA) (Felsenstein et al. 1987) could cause inconsistency of

phylogeny reconstruction by grouping long branches together. In our project, genes for 6 different

species are extracted to reconstruct a phylogenetic tree for each gene cluster. The species tree for

these 6 species is shown in Figure 2-4A. However, since the evolutionary rates of rodents, i.e.

15 mouse and rat, are faster than others species, the inferred tree, shown in Figure 2-4B could be different from the species tree. In order to deal with this problem, when inferring the orthologs, dog and rodents are taken into consideration separately. When inferring the orthology of human and rodents, nodes containing dog genes are ignored. Similarly, when inferring the orthology of human and dog, we skip the nodes with genes of mouse and rat. Consequently, even if the phylogenetic trees are inconsistent, inferred orthologous genes are still correct.

Since the phylogenetic tree generated by the maximum likelihood method is an unrooted tree, in order to infer orthologs from a phylogenetic tree, the tree should be converted into a rooted tree. In general, an outgroup could be used to find the root of an unrooted tree. There are several issues that should be considered. When an outgroup is too distant from all genes in a gene cluster, it could result in an incorrect phylogenetic tree due to mutational saturation of sequences.

However, when an outgroup appears to be too close to some genes in a gene cluster, it might not be a real outgroup. In this project, Blastp, is used to find an appropriate outgroup. Blastp uses

BLAST (Basic Local Alignment Search Tool) to compare an amino acid sequence with a proteins database and find the most similar proteins. For a gene cluster containing many paralogous genes, e.g. there are five paralogous genes, i.e. β, δ, γ1, γ2 and ε, in β-globin gene cluster and some clusters have hundreds of paralogs, the common ancestor for all genes in the cluster could be much earlier than the common ancestor of Boreoeutheria. Therefore, genes of three different species, opossum, chicken and zebra fish, are used as outgroups in this project. In order to find genes in these three species, one gene in the gene cluster should be used to search in the protein database. In this project, this gene is determined using a heuristic method as follows:

(1) A phylogenetic tree is generated for all genes of human using Maximum Likelihood.

(2) The gene with the longest branch length is selected as a query sequences to search in

the protein database.

16

Figure 2-5 shows an example of how to select such a gene. Figure 2-5 is the phylogenetic

tree of human genes in the Chemokine Ligand (CCL1) gene cluster, generated using Maximum

Likelihood method. CCL1 is selected to search in the protein database. There are two reasons

why CCL1 is chosen. One is based on the idea of the mid-point rooting method; the root of an

unrooted tree is positioned at the mid-point of the longest span across the tree. Another reason is

to reduce the Long Branch Attraction effect. Not all of the three outgroups, i.e. opossum, chicken

and zebra fish, must be the common ancestor of all genes in a gene cluster. Some of them could

be the roots of a subtree, and grouping them with the gene with longest branch length could

reduce the LBA effect (Swofford et al. 1996).

Figure 2-5: An example for selecting a gene as the query sequence for database search.

Nevertheless, in some gene clusters, the common ancestor for all genes is earlier than the

speciation of zebra fish, and it is very difficult to find a true outgroup for these gene clusters.

Moreover, even if we can find the true outgroup for these gene cluster, we still have the problem

of mutational saturation. Therefore, when the outgroup we use to reconstruct phylogenetic tree is

not a real outgroup, a “re-root” process will be executed. The concept of “re-root” is similar to

the idea behind construction of reconciled tree (Page et al. 1998; Paola et al. 2003), which

reconciles gene trees to a species tree with the assumption of minimizing the duplication events

and gene losses. However, unlike finding a reconciled tree, which requires checking all branches

17 to find the optimal root, “re-root” only traces along a tree branch to find the real root. Figure 2-6 shows an example of how “re-root” works. Figure 2-6A is the gene tree of the Tripartite motif

(TRIM) gene cluster with the root between gene d1 and other genes. In order to find the true root, the evolutionary history should be traced first. In Figure 2-6A, branch E contains all 6 species; it is the common ancestor of all species (Boreoeutheria) so that it could be the true root for the gene tree. Even branch F doesn’t include all species, it comprises primate and dog, and it is also the common ancestor of all species as a result of gene losses in rodents. Therefore, branch C, D, E and F are ancestors of Boreoeutheria and any of them could be the true root of this gene tree. We check branch A first and find that it only contains the gene of dog. We trace along the tree and use C as the new root as shown in Figure 2-6B. We find that branch A will be combined with branch B and form a new ancestor of Boreoeutheria. Now, we check children of the root and find that both of them are ancestors of Boreoeutheria, and therefore, we stop and use branch C as the root. Consequently, even if branch C is not a real root of the gene tree of the TRIM cluster, e.g. it could be D, E or F, this does not affect the inferring of orthologous genes.

18

Figure 2-6: Changing the gene tree of the TRIM gene cluster via re-rooting. (A) Gene tree before re-rooting. (B) Gene tree after re-rooting.

2.2.4 Orthology identification

In order to infer orthologous genes for each gene cluster, the PHYLIP package

(Felsenstein et al. 2002) is used to generate the phylogenetic tree. The Maximum Likelihood

method with bootstrapping of 100 replicates is applied. Subsequently, orthologous genes are

determined from the tree. There are three types of relationships of orthologs, i.e., one-to-one, one-

to-many, and many-to-many, that could be found in the tree. Figure 2-7A shows the relationship

of one-to-one. In this situation, A and B are orthologous to each other. Figure 2-7B shows the

relationship of one-to-many. In this case, A is orthologous to B1 and B2. In third case, shown in

Figure 2-7C, A1 is orthologous to B1 and B2; similarly, A2 is orthologous to B1 and B2.

19

Figure 2-7: Orthologous relationships within a phylogenetic tree. (A) One-to-one. (B) One-to- many. (C) Many-to-many.

In general, a bootstrap value is used to evaluate the reliability of a phylogenetic tree.

However, there are some drawbacks to using bootstrap values to estimate the reliability of

putative orthologous genes. Figure 2-8A shows one of these cases. The tree in Figure 2-8A is part

of the gene tree for the human and rhesus β-globin gene clusters. If we use a bootstrap value to

infer the reliability of putative orthologous genes, the reliability of orthology of HBG1 and R3 is

98%. The reliability of the orthology of HBG2 and R3 is also 98%. However, when I checked all

of the trees, i.e. 100 trees, generated by the bootstrap method, I found that 93 trees support the

orthology of HBG2 and R3, whereas only 62 trees support the orthology of HBG1 and R3. This

result indicates the difference between bootstrap value and real reliability of orthologs.

Another case is shown in Figure 2-8B. The tree in Figure 2-8B is the gene tree for human

and mouse in β-globin gene clusters. The branches with large bootstrap value, e.g. 1000, 998 and

988, can represent the reliability of orthologs correctly. However, the branches with less bootstrap

value, e.g. 510 and 452, can't show accurate reliability of orthologs. For example, the bootstrap

value shows that the reliability of the orthology of HBG1 and Hbb-bh is 45.2%. However, a total

of 591 trees support the orthology of HBG1 and Hbb-bh, which means the reliability of the

orthology of HBG1 and Hbb-bh should be 59.1%. Thus, if we would like to take the orthologous

genes with less reliability into consideration, the bootstrap values should be used with caution.

20

Figure 2-8: Some drawbacks of using a bootstrap value as the estimated reliability of orthologous genes. (A) Parts of gene tree of β-globin gene cluster. (B) Gene tree for Human and Mouse in the β-globin gene cluster.

Therefore, in order to evaluate the reliability of orthologs accurately, in this project,

instead of using a bootstrap value to indicate the reliability of orthologs, we check all of the trees,

i.e. 100 trees, generated by bootstrapping method and calculate how many trees support the

orthology for each pair of genes. For this purpose, a program infer_orthologs is needed to infer

21 orthologs automatically for each gene tree. Before executing infer_orthologs program, some evolutionary information, as shown in Figure 2-9, should be collected first as follows:

(1) If a node contains only genes of human and chimp, it is the common ancestor of

human and chimp.

(2) If a node contains genes of rhesus and at least one gene of human or chimp (but no

genes of other species), it is the common ancestor of primates (among the six species

used here).

(3) If a node contains only genes of mouse and rat, it is the common ancestor of rodents.

(4) If one child of a node is the common ancestor of rodents, and the other child is one

of: (1) the common ancestor of primates, (2) the common ancestor of human and

chimp, (3) rhesus, (4) chimp, or (5) human, then this node is the common ancestor of

Euarchontoglires.

(5) If one child of a node contains genes of dog, and the other child of the node is one of:

(1) the common ancestor of Euarchontoglires, (2) common ancestor of rodents, (3)

common ancestor of primates, (4) common ancestor of chimp and human, (5) mouse,

(6) rat, (7) rhesus, (8) chimp, or (9) human, then this node is the common ancestor of

Boreoeutheria.

(6) If one child of a node is the common ancestor of Boreoeutheria, then this node is the

ancestor of all species (in this set) and is not useful for inference of orthologs.

22

Figure 2-9: Evolutionary information for orthologs inference.

Using the above evolutionary information, the algorithm of infer_orthologs is as follows:

(1) For each leaf N that corresponds to a human gene, H, do;

(2) Check the type of the parent of N;

(3) If the parent of N is of type 1, report that H is orthologous to all genes of chimp in the

other child of the parent of N.

(4) If the parent of N is of type 2, report that H is orthologous to all genes of rhesus in

the other child of the parent of N.

(5) If the parent of N is of type 4, report that H is orthologous to all genes of mouse and

rat in the other child of the parent of N.

(6) If the parent of N is of type 5, report that H is orthologous to all genes of dog in the

other child of the parent of N.

If N is root or the parent of N is of type 6 go to step 1, otherwise, assign the parent of N

to N and go to step 2.

2.2.5 Evaluation of alignments

To evaluate the quality of an alignment, sensitivity and specificity are calculated.

Informally, sensitivity measures the coverage of an alignment and specificity measures the

23

correctness of an alignment. The definitions of sensitivity and specificity are pictured in Figure 2-

10. Sensitivity is the ratio of correctly aligned orthologs to total true orthologs, and specificity

means the proportion of correctly aligned orthologs among all aligned sequences. In general,

sensitivity and specificity can be calculated according to following formulas.

Sensitivity = Correctly aligned orthologs (A ∩ T) / True orthologs (T) * 100 %

Specificity = Correctly aligned orthologs (A ∩ T) / Aligned sequences (A) * 100 %

Figure 2-10: Definitions of sensitivity and specificity.

However, since we obtain a subset of the true orthologs, the quality of an alignment is

evaluated using only the subset of an alignment, say, A′. Therefore, the formulas for sensitivity

and specificity could be modified as follows:

Sensitivity = (A′ ∩ T) / T * 100 %

Specificity = (A′ ∩ T) / A′ * 100 %

Moreover, a reliability value (Ri) is assigned to each ortholog (Ti) for each pair of genes.

Therefore, the formulas for sensitivity and specificity can be modified once more as follows:

Sensitivity = ((A T ) R )/ (T R ) 100% i i i i i i i

Specificity = ((A T ) R ) / (A R ) 100% $ i" # i ! i $ i"! i ! i

24

2.3 Results

2.3.1 Analysis of ortholog assignments for the α- and β-globin gene clusters

To assess the accuracy of inferred orthology relationships, we organize and oversee a set of ortholog assignments for the genes of mammalian α- and β-globin gene clusters. Some of the ortholog assignments are based on published literatures (e.g., ENCODE Project Consortium

2004; Hou 2007; Prychitko 2005; Aguileta 2004; Aguileta 2006a; Aguileta 2006b), others are determined by the analysis of protein sequences and flanking non-coding regions. The predicted ortholog assignments of HBB-related genes for 25 species in mammals are shown in Table 2-1.

The coordinates for these genes are given in Table 2-2.

Table 2-1: Ortholog assignment of β-globin gene cluster.

Human 1=HBB 2=HBD 3=HBBP1 4=HBG1 5=HBG2 6=HBE1 beta delta eta Agamma Ggamma epsilon Chimp 1 2 3 4, 5 4, 5 6 Gibbon 1 2 3 4, 5 4, 5 6 Colobus 1 2 3 4, 5 4, 5 6 Rhesus 1 2 3 4, 5 4, 5 6 Baboon 1 2 3 4, 5 4, 5 6 Green_monkey 1 2 3 4 4 5 Night_monkey 1 2 3 4 4 5 Squirrel_monkey 1 2 3 4, 5 4, 5 6 Titi 1 2 3 4, 5 4, 5 6 Marmoset 1 2 3 4, 5 4, 5 6 Galago 1 2 3 4 4 5 Mouse 1, 2 3, 4 Del 5 5 Partial6, 7 Rat 1, 2, 3, 4 5 Del 6, 7 6, 7 8, 9 Rabbit 1 2 Del 3 3 4 Guinea_pig 1 Del 2 2 3 St_squirrel 1, 2, 3 Partial4 Del 5 5 6 Lemur 1 2 Del 3 3 4 RfBat 1 2 3 Del Del 4 Dog 1 2, 3 4 5 Cat 1 2 3 Partial4 Partial4 5 Cow 4, 10 2, 6 1, 5 Armadillo 1 2 Partial3 4 4 5 Elephant 1 2 Del 3 3 4

25

Opossum 2 2 1 1 1 1

Table 2-2: The coordinates of all genes for 25 species of mammals in the β-globin gene cluster.

Gene Chromosome/ Start End Strand Accession ID Human 1 Chr11 5203404 5204827 - Human 2 Chr11 5210770 5212239 - Human 3 Chr11 5219926 5221346 - Human 4 Chr11 5226165 5227610 - Human 5 Chr11 5231083 5232534 - Human 6 Chr11 5246275 5247696 - Chimp 1 Chr11 5029505 5032159 - Chimp 2 Chr11 5039233 5040700 - Chimp 3 Chr11 5048413 5049823 - Chimp 4 Chr11 5054636 5055871 - Chimp 5 Chr11 5060341 5061790 - Chimp 6 Chr11 5075532 5076957 - Gibbon 1 NT_166576 332411 333847 - Gibbon 2 NT_166576 339814 341284 - Gibbon 3 NT_166576 349975 351354 - Gibbon 4 NT_166576 356370 357817 - Gibbon 5 NT_166576 361278 362725 - Gibbon 6 NT_166576 378650 380074 - Colobus 1 NT_165846 418227 419667 - Colobus 2 NT_165846 425558 427018 - Colobus 3 NT_165846 434743 436186 - Colobus 4 NT_165846 440978 442423 - Colobus 5 NT_165846 445868 447313 - Colobus 6 NT_165846 460244 461670 - Rhesus 1 Chr14 68130102 68131520 - Rhesus 2 Chr14 68146918 68148358 - Rhesus 3 Chr14 68151793 68153236 - Rhesus 4 Chr14 68158107 68159542 - Rhesus 5 Chr14 68167353 68168810 - Rhesus 6 Chr14 68174759 68176191 - Baboon 1 NT_086568 747537 748969 - Baboon 2 NT_086568 754887 756351 - Baboon 3 NT_086568 764098 765533 - Baboon 4 NT_086568 770262 771704 - Baboon 5 NT_086568 775139 776579 - Baboon 6 NT_086568 791112 792544 - Green_monkey 1 NT_166581 775438 776870 - Green_monkey 2 NT_166581 782818 784281 - Green_monkey 3 NT_166581 792005 793438 - Green_monkey 4 NT_166581 798192 799627 -

26

Green_monkey 5 NT_166581 816319 817738 - Night_monkey 1 NT_165722 524903 526312 - Night_monkey 2 NT_165722 531358 532816 - Night_monkey 3 NT_165722 539755 541181 - Night_monkey 4 NT_165722 545620 547096 - Night_monkey 5 NT_165722 554346 555729 - Squirrel_monkey 1 NT_166585 417410 418822 - Squirrel_monkey 2 NT_166585 423782 425234 - Squirrel_monkey 3 NT_166585 435570 436972 - Squirrel_monkey 4 NT_166585 439240 440694 - Squirrel_monkey 5 NT_166585 444442 445890 - Squirrel_monkey 6 NT_166585 453442 454964 - Titi 1 NT_165821 539760 541152 - Titi 2 NT_165821 547003 548466 - Titi 3 NT_165821 556636 558059 - Titi 4 NT_165821 562503 563957 - Titi 5 NT_165821 567426 568879 - Titi 6 NT_165821 576362 577752 - Marmoset 1 NT_091705 799695 801093 - Marmoset 2 NT_091705 805843 807281 - Marmoset 3 NT_091705 815553 816968 - Marmoset 4 NT_091705 821361 822797 - Marmoset 5 NT_091705 826241 827737 - Marmoset 6 NT_091705 835204 836588 - Galago 1 NT_086566 1125252 1126673 - Galago 2 NT_086566 1130189 1131703 - Galago 3 NT_086566 1146344 1147311 - Galago 4 NT_086566 1151156 1152668 - Galago 5 NT_086566 1158842 1160220 - Mouse 1 NT_165679 1079040 1080253 - Mouse 2 NT_165679 1093049 1094259 - Mouse 3 NT_165679 1099010 1100116 - Mouse 4 NT_165679 1105529 1106811 - Mouse 5 NT_165679 1108136 1109494 - Mouse 6 NT_165679 1116404 1116536 - Mouse 7 NT_165679 1118258 1119536 - Rat 1 Chr1 161578260 161579457 - Rat 2 Chr1 161584986 161586183 - Rat 3 Chr1 161597639 161598074 - Rat 4 Chr1 161618910 161620145 - Rat 5 Chr1 161624622 161625607 - Rat 6 Chr1 161634963 161636347 - Rat 7 Chr1 161640358 161641736 - Rat 8 Chr1 161647650 161649278 - Rat 9 Chr1 161651421 161652701 - Rabbit 1 NT_165813 622669 623811 - Rabbit 2 NT_165813 631228 632524 -

27

Rabbit 3 NT_165813 637867 639251 - Rabbit 4 NT_165813 647414 648791 - Guinea_pig 1 NT_166780 502793 503896 - Guinea_pig 2 NT_166780 514250 514685 - Guinea_pig 3 NT_166780 518244 519595 - St_squirrel 1 NT_166768 615189 616319 - St_squirrel 2 NT_166768 643976 645088 - St_squirrel 3 NT_166768 666603 667736 - St_squirrel 4 NT_166768 669893 670726 - St_squirrel 5 NT_166768 675042 676441 - St_squirrel 6 NT_166768 679828 681173 - Lemur 1 NT_166774 223937 225366 - Lemur 2 NT_166774 228333 229697 - Lemur 3 NT_166774 233486 234900 - Lemur 4 NT_166774 239318 240728 - RfBat 1 NT_113412 551055 552238 - RfBat 2 NT_113412 558587 560123 - RfBat 3 NT_113412 564514 565940 - RfBat 4 NT_113412 573968 575358 - Dog 1 Chr21 31298352 31299520 - Dog 2 Chr21 31300785 31302078 - Dog 3 Chr21 31306898 31308165 - Dog 4 Chr21 31312506 31313901 - Dog 5 Chr21 31322648 31324046 - Cat 1 NT_165815 536737 537872 - Cat 2 NT_165815 540807 542135 - Cat 3 NT_165815 544451 545875 - Cat 4 NT_165815 550400 550511 - Cat 5 NT_165815 555927 557290 - Cow 1 Chr15 31522476 31524126 + Cow 2 Chr15 31531772 31533181 + Cow 4 Chr15 31547757 31549221 + Cow 5 Chr15 31563993 31565602 + Cow 6 Chr15 31573780 31575190 + Cow 10 Chr15 31619421 31620844 + Armadillo 1 NT_112068 605550 606730 - Armadillo 2 NT_112068 609362 610625 - Armadillo 3 NT_112068 628608 628738 - Armadillo 4 NT_112068 631305 632698 - Armadillo 5 NT_112068 649394 650781 - Elephant 1 NT_165610 1149655 1150955 - Elephant 2 NT_165610 1158946 1159428 - Elephant 3 NT_165610 1215074 1215569 - Elephant 4 NT_165610 1218525 1219897 - Opossum 1 ChrUn 56617933 56619943 - Opossum 2 ChrUn 56642825 56644576 -

28

We used this manually prepared set of ortholog assignments to analyze the robustness of the automatic, tree-based method described above. In theory, the automatic method would start by constructing the phylogenetic tree for all complete genes that we predicted for the 25 species in the β-globin gene cluster, using Maximum Likelihood. However, since the tree for all genes is so huge that its reliability would be low, we divided these genes into four categories, Primates,

Rodentia, Laurasiatheria and Atlantogenata, and separately constructed a phylogenetic tree for each category. The results are shown in Figure 2-11.

We can find that in three categories, Primates, Rodentia and Laurasiatheria, almost all orthologous genes are included in the same sub-tree so that it is very reliable to use the phylogenetic tree to identify orthologs. One exception is the β and δ genes, which are mixed together. The main reason is gene conversion. As shown in Figure 2-12, gene conversion is the process that one gene is replaced by another similar gene, while gene duplication refers to duplication of a region of DNA that contains a gene. (For more details, see Chen 2007.) It is difficult to distinguished gene conversion from gene duplication. One possible solution is to reconstruct the ancestral gene cluster. When two similar genes are found in one species, but its ancestor contains only one copy, this suggests occurrence of a gene duplication. Otherwise, gene conversion is likely.

29

Figure 2-11: Phylogenetic trees for 25 species in β-globin gene cluster. (A) Primates. (B) Rodentia. (C) Laurasiatheria. (D) Atlantogenata.

In the Atlantogenata category shown in Figure 2-11D, γ and ε genes are also mixed

together. One possible reason is multiple substitutions at the same sites. Another possible reason

is “low stemmines”, which means that internal branches are much shorter than terminal branches.

30

Low stemminess decreases the probability that the correct topology will be found (Smith et al.

1994). In this project, species in the Atlantogenata category are only used as outgroups; they are

not used to do evaluation, and therefore, cannot affect the accuracy of ortholog assignments.

Figure 2-12: Gene conversion and gene duplication. (A) Gene conversion. (B) Gene duplication.

Species whose speciation events are very close in time could not be distinguished very

well by using a phylogenetic tree. For instance, the β-globin gene tree for Laurasiatheria doesn't

match very well with its species tree. Therefore, when we choose species to do the evaluation, we

would prefer to use species whose tree lacks short internal branches. In this project, 6 species,

human, chimp, rhesus, mouse, rat and dog, are chosen to evaluate the performance of MSA

programs, though extra care is needed because dogs separated from primates only shortly before

rodents did. Also, three species, opossum, chicken and zebra fish, are chosen as outgroups.

Therefore, the phylogenetic tree generated using genes of these species is generally reliable.

Ortholog assignments for the α-globin gene cluster are also studied in this project. The

predicted ortholog assignments of HBA-related genes for 30 species in mammals are shown in

Table 2-3. The coordinates for these genes are given in Table 2-4.

Table 2-3: The predicted ortholog assignments of HBA-related genes for 30 mammalian species.

Human 1=HBZ 2=HBZP 3=HBM 4=HBAP1 5=HBA2 6=HBA1 7=HBQ1 Gibbon partial1, partial1, 2 3 4 5, 6 5, 6 7 2 Colobus 1, 2, 1, 2, 4 5 6, 7 6, 7 8

31

partial3 partial3 Rhesus partial1, partial1, 3 4 5 5 6 partial2 partial2 Baboon 1, 2 1, 2 3 4 5 5 6 Owl_monke 1 1 2 3, 4 3, 4 5 y Squirrel_mo 1, 2 1, 2 3 4 4 5 nkey Vervet 1 1 2 3, 4 3, 4 5 Marmoset 1 1 2 partial3, partial3, 5 4 4 Titi 1 1 2 3, 4, 5, 6 3, 4, 5, 6 7 Lemur 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Tree_shrew 1, 2 1, 2 3 3 3 4 Galago 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Mouse 1 1 2, partial4, 2, 2, 3, 6 5 partial4, partial4, 5 5 Rat 1 1 2, 4, 6 2, 4, 6 2, 4, 6 3, 5, 7 Rabbit 1, 2, 3, 7, 1, 2, 3, 7, 4, 4, 5, 8, 13 10, 11, 10, 11, 12, 12, partial15 partial15 Guinea_pig 1 1 2 3 3 3 4 St_squirrel 1 1 2, 3, 4, 5 2, 3, 4, 5 2, 3, 4, 5 6 RfBat 1 1 2 3, 4, 5 3, 4, 5 3, 4, 5 6 Sbbat 1, 2 1, 2 3 4, 5 4, 5 4, 5 6, 7 Cat partial1, partial1, 2 3 4 4 4 2 Cow 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Horse 1 1 2, 3 2, 3 2, 3 4 Hedgehog 1, 2, 3 1, 2, 3 4 5, 6, 5, 6, 5, 6, 8 partial7 partial7 partial7 Shrew 1, 2 1, 2 3 Partial4, 5, Partial4, Partial4, 6 5, 6 5, 6 Armadillo 1 1 2 3 3 3 Tenrec 1, 3 1, 3 2, 4 2, 4 2, 4 5 Elephant 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Rock_hyrax 1, 2 1, 2 3 4 4 4 5 Opossum 1, 2 1, 2 3, 4 3, 4 3, 4

Table 2-4: Coordinates of all genes for 30 mammalian α-globin gene clusters.

Gene Chromosome/ Start End Strand Accession ID Human 1 Chr16 142909 144399 + Human 2 Chr16 153121 155155 +

32

Human 3 Chr16 155997 156707 + Human 4 Chr16 158678 159342 + Human 5 Chr16 162912 163599 + Human 6 Chr16 166716 167410 + Human 7 Chr16 170486 171107 + Gibbon 1 NT_166744 266961 267324 + Gibbon 2 NT_166744 275689 277414 + Gibbon 3 NT_166744 278252 278961 + Gibbon 4 NT_166744 280906 281549 + Gibbon 5 NT_166744 285153 285845 + Gibbon 6 NT_166744 289445 290136 + Gibbon 7 NT_166744 293543 294164 + Colobus 1 NT_166714 246123 247635 + Colobus 2 NT_166714 255994 257570 + Colobus 3 NT_166714 258391 259110 + Colobus 4 NT_166714 261055 261717 + Colobus 5 NT_166714 264064 264107 + Colobus 6 NT_166714 268410 269085 + Colobus 7 NT_166714 272393 273015 + Colobus 8 NT_166714 280848 281470 + Rhesus 1 Chr20 111747 113241 + Rhesus 2 Chr20 121757 123507 + Rhesus 3 Chr20 124346 125052 + Rhesus 4 Chr20 126991 127653 + Rhesus 5 Chr20 136577 137221 + Rhesus 6 Chr20 140609 141231 + Baboon 1 NT_086381 188123 189635 + Baboon 2 NT_086381 197994 199570 + Baboon 3 NT_086381 200391 201110 + Baboon 4 NT_086381 203055 203717 + Baboon 5 NT_086381 210410 211085 + Baboon 6 NT_086381 214393 215015 + Owl_monkey 1 NT_165721 143253 144586 + Owl_monkey 2 NT_165721 145425 146133 + Owl_monkey 3 NT_165721 150202 150867 + Owl_monkey 4 NT_165721 155114 155779 + Owl_monkey 5 NT_165721 158516 159143 + Squirrel_monkey 1 NT_166559 217372 218787 + Squirrel_monkey 2 NT_166559 228095 229600 + Squirrel_monkey 3 NT_166559 230423 231131 + Squirrel_monkey 4 NT_166559 235158 235845 + Squirrel_monkey 5 NT_166559 238563 239191 + Vervet 1 NT_166715 187008 188685 + Vervet 2 NT_166715 189525 190233 + Vervet 3 NT_166715 194050 194963 + Vervet 4 NT_166715 200330 201018 + Vervet 5 NT_166715 204137 204759 +

33

Marmoset 1 NT_108068 214911 216190 + Marmoset 2 NT_108068 217027 217728 + Marmoset 3 NT_108068 221946 222022 + Marmoset 4 NT_108068 226276 226946 + Marmoset 5 NT_108068 229644 230242 + Titi 1 NT_166770 206973 208236 + Titi 2 NT_166770 208967 209677 + Titi 3 NT_166770 220243 220906 + Titi 4 NT_166770 224616 225279 + Titi 5 NT_166770 228980 229643 + Titi 6 NT_166770 233352 234015 + Titi 7 NT_166770 236946 237556 + Lemur 1 NT_166713 34397 35797 + Lemur 2 NT_166713 40427 41786 + Lemur 3 NT_166713 42573 43285 + Lemur 4 NT_166713 44916 45610 + Lemur 5 NT_166713 47322 48014 + Lemur 6 NT_166713 51297 51907 + Tree_shrew 1 NT_166758 138777 140430 + Tree_shrew 2 NT_166758 154269 155888 + Tree_shrew 3 NT_166758 160648 161300 + Tree_shrew 4 NT_166758 164029 164608 + Galago 1 NT_086562 394915 396136 + Galago 2 NT_086562 411332 412952 + Galago 3 NT_086562 413712 414392 + Galago 4 NT_086562 415855 416573 + Galago 5 NT_086562 421719 422437 + Galago 6 NT_086562 424390 425031 + Mouse 1 NT_165676 328005 329349 + Mouse 2 NT_165676 335048 335732 + Mouse 3 NT_165676 338354 338983 + Mouse 4 NT_165676 345971 346427 + Mouse 5 NT_165676 347865 348549 + Mouse 6 NT_165676 351459 352092 + Rat 1 Chr10 15564721 15566039 - Rat 2 Chr10 15571881 15572609 - Rat 3 Chr10 15574639 15575272 - Rat 4 Chr10 15585323 15586051 - Rat 5 Chr10 15588570 15589200 - Rat 6 Chr10 15597520 15598248 - Rat 7 Chr10 15600245 15600684 - Rabbit 1 NT_165211 169451 170303 + Rabbit 2 NT_165211 174747 175599 + Rabbit 3 NT_165211 178552 179402 + Rabbit 4 NT_165211 183031 183619 + Rabbit 5 NT_165211 186069 186676 + Rabbit 6 NT_165211 191713 192163 +

34

Rabbit 7 NT_165211 194683 195531 + Rabbit 8 NT_165211 199286 199893 + Rabbit 9 NT_165211 204927 205377 + Rabbit 10 NT_165211 207901 208753 + Rabbit 11 NT_165211 212799 213651 + Rabbit 12 NT_165211 217634 218482 + Rabbit 13 NT_165211 222302 222909 + Rabbit 14 NT_165211 227983 228433 + Rabbit 15 NT_165211 230994 231575 + Guinea_pig 1 NT_166548 197709 199275 + Guinea_pig 2 NT_166548 200032 200597 + Guinea_pig 3 NT_166548 202786 203445 + Guinea_pig 4 NT_166548 205960 206592 + St_squirrel 1 NT_166245 159072 160608 + St_squirrel 2 NT_166245 162186 162807 + St_squirrel 3 NT_166245 163209 163893 + St_squirrel 4 NT_166245 166670 167349 + St_squirrel 5 NT_166245 171557 172244 + St_squirrel 6 NT_166245 175077 175710 + RfBat 1 NT_112077 202203 203489 + RfBat 2 NT_112077 204298 205007 + RfBat 3 NT_112077 206743 207559 + RfBat 4 NT_112077 211441 212263 + RfBat 5 NT_112077 216146 216969 + RfBat 6 NT_112077 218953 219609 + Sbbat 1 NT_165870 94475 95547 + Sbbat 2 NT_165870 100616 101085 + Sbbat 3 NT_165870 101861 102578 + Sbbat 4 NT_165870 104741 105532 + Sbbat 5 NT_165870 105532 110209 + Sbbat 6 NT_165870 112378 113060 + Sbbat 7 NT_165870 118507 119192 + Cat 1 NT_165807 150836 151655 + Cat 2 NT_165807 157238 158609 + Cat 3 NT_165807 159413 160096 + Cat 4 NT_165807 161814 162500 + Cow 1 Chr25 2612361 2613372 + Cow 2 Chr25 2626298 2627420 + Cow 3 Chr25 2628186 2628878 + Cow 4 Chr25 2630410 2631043 + Cow 5 Chr25 2633464 2634097 + Cow 6 Chr25 2636351 2637004 + Horse 1 NT_166686 104486 105690 + Horse 2 NT_166686 118513 119366 + Horse 3 NT_166686 122938 123791 + Horse 4 NT_166686 126040 126649 + Hedgehog 1 NT_165884 73048 74132 +

35

Hedgehog 2 NT_165884 80647 81768 + Hedgehog 3 NT_165884 87780 88949 + Hedgehog 4 NT_165884 89474 90075 + Hedgehog 5 NT_165884 91505 92276 + Hedgehog 6 NT_165884 95485 96253 + Hedgehog 7 NT_165884 99425 99918 + Hedgehog 8 NT_165884 105386 105956 + Shrew 1 NT_165845 43806 44378 + Shrew 2 NT_165845 52368 53355 + Shrew 3 NT_165845 53945 54581 + Shrew 4 NT_165845 56185 56446 + Shrew 5 NT_165845 60572 61151 + Shrew 6 NT_165845 61629 62285 + Armadillo 1 NT_113346 215339 216238 + Armadillo 2 NT_113346 224167 224860 + Armadillo 3 NT_113346 226720 227450 + Tenrec 1 NT_165962 59557 60585 + Tenrec 2 NT_165962 61855 62544 + Tenrec 3 NT_165962 74855 76050 + Tenrec 4 NT_165962 79630 80319 + Tenrec 5 NT_165962 82464 83114 + Elephant 1 NT_113506 115949 117348 + Elephant 2 NT_113506 155159 156610 + Elephant 3 NT_113506 157369 158086 + Elephant 4 NT_113506 159725 160476 + Elephant 5 NT_113506 164453 165230 + Elephant 6 NT_113506 167109 167830 + Rock_hyrax 1 NT_166712 97888 98811 + Rock_hyrax 2 NT_166712 112236 113563 + Rock_hyrax 3 NT_166712 114536 114950 + Rock_hyrax 4 NT_166712 116423 117194 + Rock_hyrax 5 NT_166712 118987 119696 + Opossum 1 Chr8 152911455 152913059 + Opossum 2 Chr8 152934930 152937304 + Opossum 3 Chr8 152943697 152944695 + Opossum 4 Chr8 152962837 152963835 + Opossum 5 Chr8 152976179 152977128 +

The genes of the α-globin clusters are divided into four categories, Primates, Rodentia,

Laurasiatheria and Atlantogenata, and phylogenetic trees for each category are constructed separately. The results are shown in Figure 2-13. We can find that some gene conversions occurred between the α and ζ genes.

36

Figure 2-13: Phylogenetic trees for 30 species in the α-globin gene cluster. (A) Primates. (B) Rodentia. (C) Laurasiatheria. (D) Atlantogenata.

37

2.3.2 Comparison of different alignment programs

In this project, the performance of different alignment programs, e.g. MLAGAN (Brudno

et al. 2003), MAVID (Bray et al. 2004), TBA (Blanchette et al. 2004) and ROAST (Hou et al.

2005) is studied. β-globin gene cluster (Hardies et al. 1984) from the ENCODE project

(ENCODE Project Consortium 2004) is used to evaluate the quality of different alignment

programs. The results are shown in Figure 2-14. Figure 2-14A shows the sensitivity of different

alignment programs. In general, the sensitivity of ROAST is better than MAVID, MLAGAN and

TBA. Figure 2-14B shows the specificity among these alignments. The performance of ROAST

is superior to the others among most species.

Figure 2-14: Sensitivity and specificity among different alignment programs. (A) Sensitivity. (B) Specificity.

2.3.3 Comparison of different gene clusters

While we limited evaluation of MLAGAN and MAVID to the two globin clusters

because we had access to those alignments only in ENCODE regions, we can compare aligners

developed in our lab on as many gene clusters as we care to analyze. In total, 13 gene clusters are

used in this part of the project to evaluate two multiple alignment programs, ROAST and Multiz

38

(Blanchette et al. 2004; Ovcharenko et al. 2005), which is the multi-aligner component of TBA

(Blanchette et al. 2004) (More detailed information for these 13 gene clusters can be found in

Table 2-5). The results are shown in Figure 2-15. Among the 13 gene clusters, sensitivities computed for three gene clusters, IFN (Interferon genes), SPRR (Small proline rich genes) and

UGT2 (UDP glycosyltransferase 2 family), are worse than the others. We find that all three of these gene clusters contain many orthologous genes with many-to-many relationships, which means that a gene in a species is orthologous to many genes in another species, while at the same time many genes in this species are orthologous to the same gene in the other species. Therefore, existing multiple-alignment programs could not align orthologous genes with many similar copies very well. Better handling of these three clusters could be a goal for the next generation of multiple-alignment programs.

Table 2-5: Detailed information for 13 gene clusters.

Gene clusters Assemblies Chromosome Start End hg18 Chr6 26473376 26618629 panTro2 Chr6 26858142 27008080 rheMac2 Chr4 26354354 26543852 BTN mm8 Chr13 23464439 23496528 rn4 Chr17 48820880 48866461 canFam2 Chr35 27266479 27309890 hg18 Chr17 29606408 29714365 panTro2 Chr17 22836198 22943546 rheMac2 Chr16 29551323 29651974 CCL mm8 Chr11 81851766 81996013 rn4 Chr10 70256257 70383350 canFam2 Chr9 42255741 42331681 hg18 Chr19 46041283 46405283 panTro2 Chr19 46413632 46769682 rheMac2 Chr19 47176054 47593535 CYP2 mm8 Chr7 25511236 26866693 rn4 Chr1 81008468 82232539 canFam2 Chr1 115644073 115900686 HB hg18 Chr11 5203271 5247949 panTro2 Chr11 5029373 5077211 rheMac2 Chr14 68129848 68176323

39

mm8 Chr7 103686346 103727225 rn4 Chr1 161618781 161652953 canFam2 Chr21 31296968 31323039 hg18 Chr6 32515624 33162954 panTro2 Chr6 33017771 33784708 rheMac2 Chr4 32062686 32802444 HLA-D mm8 Chr17 33683739 33954294 rn4 Chr20 4636293 4917100 canFam2 Chr12 5136957 5609734 hg18 Chr9 21067104 21472312 panTro2 Chr9 21565913 21997232 rheMac2 Chr15 55523382 55916613 IFN mm8 Chr4 87993248 88352020 rn4 Chr5 107837418 108453194 canFam2 Chr11 43696170 43910824 hg18 Chr11 101896449 102331672 panTro2 Chr11 101166490 101608172 rheMac2 Chr14 101110165 101567740 MMP mm8 Chr9 7272528 7699281 rn4 Chr8 4158870 4533618 canFam2 Chr5 31862392 32204859 hg18 Chr16 55156461 55275608 panTro2 Chr16 55996094 56112413 rheMac2 Chr20 54886386 55043405 MT mm8 Chr8 97026307 97076616 rn4 Chr19 11255276 11302185 canFam2 Chr2 62483105 62523478 hg18 Chr7 143263258 143647081 panTro2 Chr7 144359930 144880820 rheMac2 Chr3 181320547 181693451 OR mm8 Chr6 42670518 43215119 rn4 Chr4 70622054 71049677 canFam2 Chr16 8633593 8984667 hg18 Chr14 93819404 94183083 panTro2 Chr14 94532124 94895387 rheMac2 Chr7 157510532 157862246 SERPINA mm8 Chr12 104017723 104821011 rn4 Chr6 127887682 128473074 canFam2 Chr8 66343080 66646774 hg18 Chr1 151209751 151389231 panTro2 Chr1 132075525 132259588 rheMac2 Chr1 131475193 131656954 SPRR mm8 Chr3 92280817 92587163 rn4 Chr2 185086475 185493716 canFam2 Chr17 64847006 64918886 TRIM hg18 Chr11 5573922 5688668 panTro2 Chr11 5429080 5547846

40

rheMac2 Chr14 67623262 67763703 mm8 Chr7 104092629 104244403 rn4 Chr1 162059723 162190566 canFam2 Chr21 31792348 31892437 hg18 Chr4 69546932 70548006 panTro2 Chr4 60980943 61774067 rheMac2 Chr5 59967909 61087897 UGT2 mm8 Chr5 87965616 88561354 rn4 Chr14 22047560 22786367 canFam2 Chr13 61814461 62140623

Figure 2-15: Sensitivity and specificity of different gene clusters. (A) Sensitivity. (B) Specificity.

2.3.4 Comparison of different species

We have looked at the performance of different species in our approach to evaluating

MSA programs. In particular, the sensitivity and specificity of 5 different species, i.e. chimp,

rhesus, mouse, rat and dog are analyzed. The results are shown in Figure 2-16. The performance

of rodents, i.e. mouse and rat, is worse than other species. Although the speciation of dog

(relative to human) preceded that of rodents, the substitution rate in rodents is faster than in the

dog lineage. Therefore, the similarity of to human of genes of rodents is lower than the similarity

of genes of dog, so that multiple-alignment programs could not align the rodent sequences very

41

well. Thus, another main challenge for the next generation of multiple-alignment programs could

be to more accurately align sequences from distant species.

Figure 2-16: Sensitivity and specificity of different species. (A) Sensitivity. (B) Specificity.

2.4 Conclusion

The work described above represents the initial portion of a project that is a component

of a larger effort in the Miller lab to produce an improved suite of tools for evaluating MSAs.

This effort includes (1) building an improved sequence-evolution simulator that consolidates and

extends earlier work in the lab (Blanchette 2004; Ovcharenko 2005; Ma 2007; Zhang 2007), (2)

use and enhancement of tools and ideas developed elsewhere for evaluating whole-genome

alignments (Margulies 2007; Prakash and Tompa 2007; Wang 2007) and (3) development of new

tools to evaluate alignments in gene clusters. In each case, the aim is to produce robust tools that

can easily be run by others.

The third component of this effort, i.e., evaluation tools for gene-cluster MSAs, is our

responsibility. As described above, we have developed a method that uses phylogenetic trees to

identify orthologous genes within gene clusters. A detailed analysis in the α- and β-globin gene

clusters indicates that orthology relationships inferred this way are reasonably accurate. These

42 orthology assignments were applied to analyze the quality of alignments produced by several alignment programs, i.e., MAVID, MLAGAN, TBA and ROAST. The results show that the performance of ROAST is better than the others in terms of sensitivity and specificity. We also analyzed two aligners, ROAST and Multiz on 13 gene clusters.

Chapter 3

Gene conversion detection between a pair of genes

In this chapter, we try to detect gene conversion between a pair of genes. A site-by-site compatibility method is proposed to detect the occurrence and directionality of gene conversion between a pair of genes. This method is applied for two data sets, e.g. beta and delta genes, and two gamma genes. Detailed gene conversion information for these two data sets are shown in this chapter.

3.1 Introduction

3.1.1 Motivation

Phylogenetic trees are used to infer the orthologs between genes by comparing with species tree. However, in some cases, the phylogenetic tree is not reliable so that the inference of orthology could be incorrect. For instance, figure 3-1 shows the phylogenetic tree for the beta globin gene cluster. We can find that the beta genes and delta gene are mixed together. The phylogenetic tree cannot separate these two genes. The main reason is due to the gene conversion.

Some delta genes are converted by the beta genes so that their sequence might be closer to each other. In this chapter, we will deal with the gene conversion problem.

44

Figure 3-1: Phylogenetic tree of beta globin gene cluster.

3.1.2 What is gene conversion

Gene conversion is one gene, which is replaced by another similar gene. As shown in

figure 3-2A, originally, there are two similar genes and one gene is replaced by another gene.

Finally, there are two very similar genes. Figure 3-2B shows the process of gene duplication. The

difference of gene conversion and gene duplication is that gene conversion can only convert part

of gene. For example, part of one gene can be converted by another gene. However, part of gene

45

remains the same. It will still be a complete gene. However, for gene duplication, the whole gene

should be duplicated. Otherwise, it cannot form a complete gene.

Figure 3-2: Gene conversion and gene duplication. (A) Gene conversion. (B) Gene duplication.

3.1.3 Impact of gene conversion to the inference of orthology

Gene conversion will make the inference of phylogenetic tree more difficult. For

example, as shown in figure 3-3A, there is a gene tree for five genes. And there is a gene

conversion event in some regions between gene D and gene E. Then the gene tree for the

converted region might like the tree shown in figure 3-3B. So the relationship for these five genes

will be inconsistent in different regions. Current tree reconstruction method cannot deal with this

problem.

Figure 3-3: Effect of gene conversion. (A) Gene tree for five genes. (B) Phylogenetic tree for the

46 conversion region.

3.1.4 Methods for gene conversion detection

A lot of gene conversion detection methods are proposed. Basically, they can be

classified into four different types. Similarity method tries to find high conserved regions.

However, high conserved regions sometimes are due to the selection pressure. Phylogenetic

method tries to find if there are some regions whose phylogenetic trees are different from others.

Compatibility method will find all parsimoniously informative sites and check if they are

compatible to each other. And substitution method tries to find if there is a significant clustering

of substitution or some kind of particular pattern.

3.1.5 Limitations of these methods

However, most of current methods have these problems. They just find gene conversion

events in each gene. However, they cannot find the gene conversion events in the ancestors. Also,

some of these methods cannot determine the directionality of gene conversion. They can find a

gene conversion event between two genes. However, they cannot tell which gene is converted.

Finally, if there are multiple gene conversion events in a gene and they are overlapped to each

other. Most of these methods cannot deal with it very well.

3.2 Methods

A method called site-by-site compatibility method, which checks all parsimoniously

informative sites and finds all possible gene conversion events for all genes and their ancestors.

47

This method was first proposed by Fitch in 1990. However, they implemented this method by

hand. Drouin proposed an automatic method in 1999. However, there are still some problems for

their method. First is their hypothesis of a gene conversion event. They think there is a gene

conversion event if the paralogous genes are identical and the orthologous genes are different.

However, in some cases, it could be not correct. For example, figure 3-4 is the gene tree for four

genes. According to their method, there is gene conversion event between A1 and A2. Also, there

is gene conversion event between B1 and B2. However, it could depend on the status of their

parent. If the status of their parent is T, there is only one gene conversion event between B1 and

B2. If the status of their parent is A, there is only one gene conversion event between A1 and A2.

Another problem is the directionality of gene conversion. They didn’t propose a method to

determine which gene is converted. Finally, they also didn’t have a method to determine the

boundaries of gene conversion events.

Figure 3-4: Example shows the issues of Drouin’s method.

3.2.1 Site-by-site compatibility method

To deal with these problems, first, we will determine the status of all ancestors. We use

Fitch’s algorithm to find a parsimony tree. Fitch’s algorithm includes two phases, bottom-up

phase and top-down phase. The bottom-up phase will find possible status si of internal node i with

children j and k using following formula:

48

(3-1)

Figure 3-5 shows an example of the bottom-up phase. The possible statuses for all

internal nodes are determined.

Figure 3-5: An example of bottom-up phase.

While Top-down phase will determine the final status sj of internal node j with parent i

using following formula:

(3-2)

Figure 3-6 shows an example in which the final statuses for all internal nodes are

determined.

49

Figure 3-6: An example of top-down phase.

3.2.2 Gene conversion inference

Basing on the statuses of all internal nodes, for each parsimoniously informative site, k,

pairs of paralogous genes, sj with parent si and sj’ with parent si’, that favor the hypothesis of a

gene conversion are determined using following formula:

(3-3)

Figure 3-7 shows an example about how to detect gene conversion and the directionality.

all paralogous genes are checked. For the pair of paralogous genes, F1 and F2, they are the same

as their parents; however, they are different from each other. So, there is no gene conversion

50

between this pair of genes. For the pair of paralogous genes, E1 and E2, E1 is different from its

parent; however, it is the same as its paralogous gene, E2. So, there is a gene conversion event

between this pair of genes. And E1 is converted by its paralogous gene, E2. Similarly, for the pair

of paralogous genes, C1 and C2, there is a gene conversion event between these two genes.

However, the directionality is different. C2 is converted by C1. Finally, for the pair of paralogous

genes, B1 and B2, both of them are different from their parents; however, they are the same as

each other. There is a gene conversion event between this pair of genes. However, we cannot

determine the directionality.

Figure 3-7: An example shows how to determine gene conversion and it’s directionality.

51

3.2.3 Boundaries of gene conversion

Finally, the boundaries of gene conversion events are determined. Assume that r-site indicates a conversion event and s-site shows no evidence of gene conversion. For a sequence with s of s-sites and r of r-sites, Stephan proposed a statistic test to determine the probability that there is at least k of r-sites between a randomly chosen pair of consecutive s-sites by using following formula: r & s + r ' k ' 3# & s + r ' 2# P = $ ! /$ ! m ($ r ' k ! $ r ! j=k % " % " (3-4)

Therefore, the probability that at least one of s - 1 pair of consecutive s-sites contains at least k of r-sites is: s!1 P =1! (1! Pm ) (3-5)

However, there is a problem for this method; it only can find a region containing all of gene conversion events. However, usually, in the conversion regions, some sites might not show evidence of conversion just by chance. Therefore, we modify this formula so that we can allow some no gene conversion events in the gene conversion regions. For a sequence with s of s-sites and r of r-sites, the probability that there are at least k of r-sites between a randomly chosen pair of s-sites that contains i of s-sites is as follows:

r &k + i#& s + r ' k ' i ' 3# & s + r ' 2# Pi = $ !$ ! /$ ! m ($ i !$ r k ! $ r ! j=k % "% ' " % " (3-6)

Therefore, the probability that at least one of s – i + 1 pair of sites that contains i of s-sites has at least k of r-sites is:

s!i!1 P =1! (1! Pm ) (3-7)

52

3.3 Results and limitations

3.3.1 Beta and delta genes

I use this method to find all gene conversion events between beta and delta genes. The result is shown in figure 3-8. Empty Square represents no gene conversion event. And there are three different types of gene conversion events. Blue line indicates delta gene is converted, while red line represents beta gene is converted. And green line means that the directionality cannot be determined. We can find that there are gene conversion events in the coding regions of cat, dog and rf_bat. Also, there are some gene conversion events in the common ancestor of human, chimp, rhesus, baboon and marmoset (HCRBM); cat and dog (CaD); cat, dog and rfbat (CaDRf); and all species (HCRBMCaDRf).

53

Figure 3-8: All gene conversion events between beta and delta genes. Gene conversion within 8 species, i.e. human (H), chimp (C), rhesus (R), baboon (B), marmoset (M), cat (Ca), dog (D) and rf_bat (Rf) and their ancestors are detected.

3.3.2 Two gamma genes

54

The proposed method is also applied to detect all gene conversion events between two

gamma genes. The result is shown in figure 3-9. We can find that there are gene conversion

events in the coding regions and non-coding regions of all species. Also, there are gene

conversion events that occurred in the ancestors of these species.

Figure 3-9: All gene conversion events between two gamma genes. Gene conversion within 8 species, i.e. human (H), Gibbon (G), colobus monkey (C), rhesus (R), baboon (B), squirrel monkey (S) marmoset (M), and dusky titi (T) and their ancestors are detected.

55

3.3.3 Limitations

However, there are still some problems to use this method to find orthologous genes for different gene clusters. The most major problem is that phylogenetic information should be known in advance for this method. However, for some gene clusters, we don’t know their phylogeny. Furthermore, as shown in figure 2-7, there are three types of orthologous relationships could be determined from the gene clusters: One-to-one; one-to-many and many-to-many. Each type of orthologous relationship needs to take into consideration.

Chapter 4

Gene conversion detection for whole genome

Gene conversion events are often overlooked in analyses of genome evolution. In such an event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound prediction of orthologs and attempts to transfer functional information between genomes. Here we analyze 1,112,202 highly conserved pairs of human genomic intervals, and detect a conversion event for about 13.5% of them. Properties of the putative gene conversions are analyzed, such as the distributions of the lengths of the converted regions and the spacing between source and target. We also compare results from the whole-genome predictions with previous analyses for several well-studied gene clusters, including the globin genes.

4.1 Introduction

Several classes of evolutionary operations have sculpted the human genome. Nucleotide substitutions have been studied in great detail for years, and much attention is now focused on large-scale events such as insertions, deletions, inversions, and duplications. Frequently overlooked are gene conversion events (reviewed by Hurles 2004 and Chen et al. 2007), in which one region is copied over the location of a highly similar region; before the operation there are two genomic intervals, say A and B with 95% identity, and afterwards there are two identical copies of A, one in the position formerly occupied by B.

Conversion events need to be accounted for when attempting to understand the human genome based on identification of orthologous regions in other species. To take a hypothetical

57 example, suppose human genes A and B are related by a duplication event that pre-dated the separation of humans and Old World monkeys, so that rhesus macaques also have genes A and B.

A conversion event in a human ancestor that overwrote some of A with sequence from B could cause all or part of A’s amino-acid sequence to be more closely related to the rhesus B protein than to the rhesus A protein, even though A’s regulatory regions might remain intact. Successful design and interpretation of experiments in rhesus to understand gene A might well require knowledge of these evolutionary relationships.

Gene conversion events have been studied in a variety of species, including the following investigations. Drouin (2002) characterized conversions within 192 yeast gene families; Semple and Wolfe (1999) detected conversion events in 7,397 Caenorhabditis elegans genes; Ezawa et al. (2006) studied 2,641 gene quartets, each consisting of two pairs of orthologous genes in mouse and rat, and found that 488 (18%) appear to have undergone gene conversion; and Xu et al.

(2008) detected 377 gene conversion events within 626 multigene families in the rice genome.

However, all of these studies detect gene conversion events only between pairs of protein-coding genes, although conversion can occur between any pair of highly similar regions (Chen et al.

2007). Besides, some analyses of gene conversion are done in the human genome (Jackson et al.

2005; McGrath et al 2009; Benovoy and Drouin 2009). While none of the datasets used in these previous studies compare to the size of the one analyzed here. In this paper, we cover more than one million paralogous pairs of regions, requiring a more efficient method to deal with such a large dataset.

Evidence of conversion between human genes frequently appears in cases where the conversion involves only part of a duplicated region. For instance, consider the δ-globin and β- globin genes, which lie close together on human chromosome 11 in a gene cluster shaped by conversion events (Papadakis and Patrinos 1999). A human-human alignment reveals similarities extending beyond the genes, created by an ancient duplication event pre-dating the radiation of

58

placental mammals, roughly 100 million years ago; see Figure 4-1B. To test whether the elevated

percent identity in the protein-coding regions can be explained entirely by purifying selection on

those regions, we can compare the pattern of sequence conservation between the paralogous

human regions with that between human δ-globin and its ortholog in an appropriately diverged

species. Using dog, we see that in most of the interval around the δ-globin gene, the human

sequence is more similar to the dog δ-globin region (blue) than to the human β-globin region

(red) as expected, but this is reversed in a large interval containing exons 1 and 2, and perhaps in

part of exon 3; see Figure 4-1C. One reasonable inference from this observation is that a

conversion event overwrote that interval with the homologous sequence from the β-globin gene,

or vice versa. Indeed, the procedure described in this paper identifies a conversion event covering

an interval that starts somewhat upstream of exon 1 and extends just beyond exon 2, while the

potential conversion of part of exon 3 is not significant in this test (Figure 4-1D). On the other

hand, testing with marmoset sequence instead of dog does find significant evidence (p=0.0006) of

a conversion event involving part of exon 3 (data not shown).

Figure 4-1: Evidence of gene conversion in the human δ-globin gene. (A) Schematic view of the gene. (B) Percent identity plot of an alignment to an interval containing the human β-globin gene; each short horizontal line indicates the percent identity over a subinterval of the alignment. (C) Plot of the alignment

59 to the dog δ-globin gene (blue) compared to the paralogous human alignment (red). (D) An interval of gene conversion detected by the method described in this paper, and a smaller interval where the dog sequence does not detect statistically significant evidence of conversion. See the text for further discussion.

A number of statistical tests have been proposed to detect gene conversions. However,

most of these tests are only efficient for small data sets, e.g. individual gene clusters. Boni et al.

(2007) nicely summarize computational methods available for detecting mosaic structure in

sequences, and propose a new method that is particularly economical in terms of computer

execution time for large data sets. One drawback is that their algorithm requires large amounts of

computer memory. However, we show here that this method can be reformulated so that the

memory requirements are no longer a limiting factor, which allows us to conduct a

comprehensive scan for gene-conversion events in the human genome, starting with 1,112,202

pairs of paralogous human intervals. For each pair of paralogous intervals, say H1 and H2, we

choose a sequence from another species, say C1, that is believed to be orthologous to H1. These

triplets of sequences are examined to find cases where part of H1 is more similar to H2 than to C1,

while another part is more similar to C1. In such cases, the interval of high H1 - H2 similarity is

inferred to have resulted from a conversion event, as illustrated in Figure 4-1.

Our findings include the following observations about predicted human gene

conversions. About 71% of the detected 149,799 conversion events occurred between two

intervals on different chromosomes, but the conversion rate for intra-chromosomal paralogous

pairs is ~1.5 fold compared to that for inter-chromosomal paralogous pairs. For the intra-

chromosomal conversions, distance between the pair of sequences is a key factor affecting

conversion frequencies. Pairs of similar sequences that are too close or too distant have a lower

conversion frequency; the highest conversion frequency is between paralogs separated by 10Kb

to 100Kb. Moreover, we find that: (i) conversion frequencies are proportional to the length of the

paralogs; (ii) 57% of conversion events cover less than 100 bp; (iii) selection plays a critical role

60 for the occurrence of gene conversion; (iv) conversion frequency varies among chromosomes, and does so in a manner suggesting that the mechanisms generating conversions are not associated with the process of genome replication; (v) the relative orientation of the two sequences has little effect on conversion frequency; (vi) gene conversions have a preference for lower GC-content.

4.2 Methods

4.2.1 Highly conserved pairs of sequences

We aligned each pair of human chromosomes, including self-alignments, using BLASTZ

(Schwartz et al. 2003) with T=2 and default values for the other parameters. Chaining of the human-human alignments was performed using the method of Zhang et al. (1994). For alignments between human intervals and their putative orthologs in other species, we used the pairwise alignment “nets” (Kent et al. 2003) downloaded from the UCSC Genome Browser website (Kent et al. 2002).

4.2.2 Gene conversion detection between each pair of sequences

For each pair of similar human sequences, say H1 and H2, we found an interval, C1, from another species (perhaps chimpanzee) that appears to be orthologous to H1. Our plan was to look for cases where part of H1 is more like H2, while other parts are more like C1. Following Boni et al. (2007), we used the H1 - H2 alignment and the H1 – C1 alignment to identify “informative” positions in H1, such that either H1 and H2 have one nucleotide and C1 has another (score –1), or

H1 and C1 have one nucleotide and H2 has another (score +1). In Figure 4-2, we plot idealized examples of the cumulative sum of these scores along H1, which constitute what is called a

61 hypergeometric random walk (HGRW; Feller 1957) under the assumption that H1’s relationships to H2 and C1 are invariant across the interval. If the duplication event producing H1 and H2 preceded the speciation that divided H1 and C1, then the plotted quantity will generally increase because there will be more +1 scores (H1 like C1) than –1 scores (H1 like H2); this is illustrated in panels A and D of Figure 4-2. Panels C and F illustrate the opposite case where the duplication followed the speciation. The case of interest to us is when duplication preceded speciation, but where following duplication, a subinterval of H1 was overwritten by the corresponding subinterval of H2; in that subinterval the plot decreases, contrary to the behavior in the rest of the plot (panels B and E). Our task is to identify cases where the +1s and –1s are distributed along H1 to create an interval of maximum descent, which is the maximum decrease of scores across the interval (in one direction only) as shown in Figure 4-2, in the cumulative sum that cannot be explained by chance alone.

For a given pair H1 and H2, we needed to find sequence from a species at an appropriate evolutionary distance, i.e., that split from the human lineage somewhat after the duplication event and before the most recent conversion. Thus, we tried a gamut of available mammalian genome sequences: chimp, orangutan, rhesus, marmoset, dog and opossum. Each of these species could be used to identity gene conversion events in particular period of evolution along the lineage leading to human. Because the orthologs of H1 and H2 often differ, up to 12 triplets were used to look for gene conversion between a given paralogous pair.

More formally, we detect conversions using the test statistic xm,n,k, which is the probability of a maximum descent of k occurring by chance for a triplet with m +1s and n –1s.

Boni et al. (2007) give a dynamic programming algorithm for computing xm,n,k.

62

Figure 4-2: Determining the occurrence of gene conversion events in a triplet, where H1 and H2 are paralogous human intervals and C1 is an ortholog of H1 from another species. Panels A to C depict three scenarios relating the three sequences, while D to F are associated plots where the horizontal axis shows positions along H1 and the slope is positive if and only if H1 matches C1 more frequently than it matches H2. A region with a statically improbable decrease is predicted to identify a conversion event.

4.2.3 Space-efficient modifications

The original formulation by Boni et al. (2007) requires an amount of computer time and

memory that is proportional to B4, where B is an upper bound for m, n, and k. For a triplet with

63

400 informative sites, this approach would use 6.4 GB of computer memory, allowing the method to work only with relatively short sequences. We modified that method to need only space proportional to mn + n2 + SP, as we now describe.

In the notation of Boni et al., the test statistic xm,n,k is defined as P(md Hm,n = k) and can be calculated using the equation

k

xm,n,k = " ym,n,k, j , j= 0 (4-1) where:

! ym,n,k,j = P(md Hm,n = k ∩ min Hm,n = –j) (4-2)

The probabilities of y can be obtained by dynamic programming based on the following recursive relationships.

)" m % if j 0, +$ '[ ym(1,n,k,1 + ym(1,n,k,0] = m n +# + & +" m % " n % +$ ' ym(1,n,k, j +1 + $ ' ym,n(1,k, j(1 if k > j > 0, ym,n,k, j = *# m + n& # m + n& +" n % + $ '[ ym,n(1, j(1, j(1 + ym,n(1, j, j(1] if j = k > 0, +# m + n& + if j > k - 0. , 0 (4-3)

In order to reduce the usage of memory, we introduce the additional variable Am,n,k, defi! ned as:

" n % " n % Am,n,k = ym,n,k,k = $ '[ ym,n(1,k(1,k(1 + ym,n(1,k,k(1] = $ '[ Am,n(1,k(1 + ym,n(1,k,k(1] # m + n& # m + n&

(4-4)

! Then,

64

k k#1 xm,n,k = " ym,n,k, j = ym,n,k,0 + " ym,n,k, j + ym,n,k,k j= 0 j=1 $ m ' k#1 *$ m ' $ n ' - = & )[ ym#1,n,k,1 + ym#1,n,k,0] + ",& ) ym#1,n,k, j +1 + & ) ym,n#1,k, j#1/ % m + n( j=1+% m + n( % m + n( . $ n ' +& )[ ym,n#1,k#1,k#1 + ym,n#1,k,k#1] % m + n( $ m ' k $ n '* k#1 - = & ) " ym#1,n,k, j + & ), " ym,n#1,k, j + ym,n#1,k#1,k#1/ % m + n( j= 0 % m + n(+, j= 0 ./ $ m ' $ n '* k - = & ) xm#1,n,k + & ), " ym,n#1,k, j # ym,n#1,k,k + ym,n#1,k#1,k#1/ % m + n( % m + n(+, j= 0 ./ $ m ' $ n ' = & ) xm#1,n,k + & )[ xm,n#1,k # Am,n#1,k + Am,n#1,k#1] % m + n( % m + n( (4-5)

The key observation is that for fixed k, the only component of the equation (4-3) that

! depends on k–1 is when j = k > 0, and in that case the required value is Am,n-1,k–1. (On the other

hand, the initialization of ym,n,k,j for m = 0 does depend on k.) Consequently, provided that we

record the 3-dimensional array of values Am,n,k, we can store the values of y for a fixed k in

another 3-dimensional array that we call ym,n,j and overwrite them with the values corresponding

to k+1 as the computation proceeds. The resulting algorithm uses only two arrays of size mn2 (x

and y can be stored in the same array) as shown in Figure 4-3. It can handle triplets with 2000

informative sites on a mid-sized workstation.

MODIFIED-3SEQ(MAX_M, MAX_N) 1 for m ← 0 to MAX_M do 2 A[m, 0, 0] ← 1 3 for n ← 1 to MAX_N do 4 A[m, n, 0] ← 0 5 for k ← 1 to MAX_N do 6 for m ← 0 to MAX_M do 7 for j ← 0 to MAX_N do 8 y[m, 0, j] ← 0 9 for n ← 0 to MAX_N do 10 for j ← 0 to MAX_N do

65

11 if k = n and j = n then 12 y[0, n, j] ← 1 13 else y[0, n, j] ← 0 14 for m ← 1 to MAX_M do 15 for n ← 1 to MAX_N do 16 for j ← 0 to k - 1 do 17 if k > n or k < n – m then 18 y[m, n, j] ← 0 19 else if j > k or j > n or j < n - m then 20 y[m, n, j] ← 0 21 else if j = 0 then 22 y[m, n, j] ← m / (m + n) × (y[m - 1, n, 1] + y[m - 1, n, 0]) 23 else y[m, n, j] ← m / (m + n) × y[m - 1, n, j + 1] + n / (m + n) × y[m, n - 1, j - 1] 24 if k > n or k < n – m then 25 A[m, n, k] ← 0 26 else A[m, n, k] ← n / (m + n) × (A[m, n - 1, k - 1] + y[m, n - 1, k - 1]) 27 y[m, n, k] ← A[m, n, k] 28 for n ← 0 to MAX_N do 29 for k ← 0 to MAX_N do 30 if n = k then 31 x[0, n, k] ← 1 32 else x[0, n, k] ← 0 33 for m ← 1 to MAX_M do 34 x[m, 0, 0] ← 1 35 for k ← 1 to MAX_N do 36 x[m, 0, k] ← 0 37 for n ← 1 to MAX_N do 38 x[m, n, 0] ← 0 39 for k ← 1 to MAX_N do 40 x[m, n, k] ← m / (m + n) × x[m - 1, n, k] + n / (m + n) × (x[m, n - 1, k] + A[m, n - 1, k - 1] - A[m, n - 1, k]) 41 return x

Figure 4-3. A cubic-space algorithm for computing the probabilities xm,n.k.

Furthermore, since the value of x depends only on the values in the same loop, e.g.

2 xm,n-1,k, and in the previous loop, e.g. xm-1,n,k (when using m as the outer loop), an O(mn + n + SP) space method (where S = number of outgroup species and P = number of pairs of sequences) is possible. First, the values of m, n, and k for the triplets of all pairs of sequences are determined

66 and stored in a three-dimensional linked list, which consumes O(SP) space. Then the value of x is calculated and summed to the relevant triplets. Since only those values that are necessary for further calculation are kept, the maximum table size required for the calculation of x is O(mn + n2).

Although the space requirement is thus reduced, the time complexity is still quartic

(exponent 4). Also, the longest interval in our data is 251,067 base pairs. In order to deal with long alignments, those with length greater than 5000 are divided into several sub-alignments with

1000 sites overlapped. The p-value for each sub-alignment is then calculated, and a multiple- comparison correction method (Holm 1979) is used to determine if the set of sub-alignments supports an assertion that the whole alignment shows significant signs of a conversion.

4.2.4 Extension to quadruplet testing

It is not uncommon that we have a pair of paralogs in the other species, say C1 and C2 in chimpanzee, that are orthologs for H1 and H2 in human, respectively. In a fashion similar to the triplet testing for gene conversions, we can perform quadruplet testing (H1, H2, C1, C2) that is the summation of the hypergeometric random walks of two triplets, i.e. (H1, H2, C1) and (H1,

H2, C2), as shown in Figure 4-4. Quadruplet testing may have higher specificity and sensitivity than triplet testing for detecting conversions. For example, in Figure 4-4A, a weakly significant

(0.032) conversion event was detected between the HBD and HBBP1 paralog pair in one triplet testing, which is inconsistent with a previous study (Papadakis and Patrinos 1999) in the beta- globin gene cluster. This could be due to a faster evolutionary rate in HBBP1, which is a pseudo- gene. However, quadruplet testing did not show any evidence of conversion in this region. This suggests that the effect of one triplet can be neutralized by that of another triplet when there is no conversion between a paralog pair. On the other hand, when applying quadruplet testing for a

67

conversion region, we can get more significant result, as shown in Figure 4-4B (2.4e-18).

Therefore, whenever orthologs for both H1 and H2 are available in a particular outgroup species,

we combine the results of the two triplets to perform quadruplet testing, and use the same formula

as triplet testing, i.e. equation (4-5), to get p-values.

Figure 4-4: Comparisons between quadruplet testing and triplet testing. (A) the HBD and HBBP1 paralog pair; (B) the HBB and HBD paralog pair.

4.2.5 Multiple-comparison correction

When several statistical tests are performed simultaneously, a multiple-comparison

correction should be applied. In our study, six outgroup species are used. Here we use the

Bonferroni correction (Holm 1979); we multiply the smallest p-value for each paralogous human

pair by the number of tests (up to 6), and then compare the adjusted p-value to the p-value

threshold, α.

Multiple-comparison correction is also applied to the tests for all pairs of paralogous

sequences. For the 1,112,202 pairs that were analyzed, we used a multiple-comparison correction

method that controls the false discovery rate (FDR), proposed by Benjamini and Hochberg

(1995). The cutoff threshold for p-values can be found by the following algorithm:

68

CutOff(α, P-values) 1 sort P-values 2 for i ← 1 to number of P-values do 3 if Pi > (i / number of P-values) × α 4 return (i / number of P-values) × α

Figure 4-5: Algorithm for determining cutoff position of P-values

In our study, α is 0.05 and the cutoff threshold for p-values is 0.006818. This means that only a test whose p-value after Bonferroni correction is less than 0.006818 is considered as significant for gene conversion.

4.2.6 Directionality of gene conversion

We attempt to determine the source and target of a conversion event as follows. As shown in Figure 4-6B, let us suppose that a conversion event happened z years ago, with x > y > z, and consider a converted position. Regardless of the direction of the conversion (from H1 to

H2, or vice versa), in the converted region, H1 and H2 are separated by 2z total years. If H1 converted H2 (i.e. part of H1 overwrote part of H2), then the separation of H1 and C1 is 2y but the separation of H2 and its ortholog, C2, is 2x > 2y. This observation serves as a basis for determining the conversion direction. Figure 4-7 shows an example of determining the source and target of a conversion from HBB to HBD.

Specifically, assume (m1, n1) with maximum descent k1 in the first triplet (H1, H2, and

C1), and (m2, n2) with maximum descent k2 in the second triplet (H1, H2, and C2). Note that mi and ni here are not the m and n in equation 4-1; rather, they are the numbers of ups and downs within

69 the common maximum descent region of the two triplets (union). The probabilities of going down in the maximum descent regions of two triplets are:

p1 = n1 ÷ (m1 + n1) (4-6)

p2 = n2 ÷ (m2 + n2 ) (4-7) ! When combining these data, there are a total of (m1 + m2) ups and (n1 + n2) downs, and the possibility of going dow!n in the combined data is:

p = (n1 + n2 ) ÷ (m1 + m2 + n1 + n2 ) (4-8)

As shown in Figure 4-6B, if H1 converted H2, the separation of H1 and C1 is closer than the separation of !H 2 and C2 in the converted region. Thus, p1 should be smaller than p2. Our objective function (O) is therefore to determine how significant the difference of p1 – p2 is, based on the binomial distribution:

$ # # ' $ # # ' O = & (p1 " p2) " E(p1 " p2 )) ÷ sqrt&V (p1 " p2 )) % ( % ( (4-9)

Where:

! " " E(p1 # p2 ) = 0 (4-10)

$ ' " " 1 1 V(p1 # p2 ) = & + ) * p * (1# p) ! % m1 + n1 m2 + n2 ( (4-11)

In this paper, 6 outgroup species are used to detect gene conversions. We use the outgroup !s pecies that shows the most significant difference of p1 – p2 to determine the directionality of conversion for a given paralogous pair. However, there are several reasons why the direction of a conversion might not be clear, even when using several outgroup species, including conversions in the outgroup species and missing outgroup data. Our approach indicates a direction for 65.4% of the putative conversions.

70

Figure 4-6: Timing of evolutionary events. The assumed duplication, speciation, and conversion events occurred respectively x, y, and z years ago. See text for further explanation.

Figure 4-7: Evidence that the β-globin gene (HBB) converted the δ-globin gene (HBD). Percent identity plots for (A) HBB and (B) HBD showing alignments to the human paralog in red, and alignments to the putative marmoset ortholog in blue. In the converted region, the human-marmoset alignments have 92% identity for β-globin and 85% identity for δ-globin.

4.3 Results

We downloaded the March 2006 assembly of the human genome from the USCS

Genome Browser (Kent et al. 2002), aligned each pair of chromosomes using Blastz (Schwartz et

al. 2003), and collected alignments into longer “chains”, each of which is intended to identify the

71

results of a duplication event in which one or both copies may have been subsequently disturbed

by insertion or deletion events (Kent et al. 2003). Table 4-1 contains information about the

resulting set of duplicated genomic intervals.

Table 4-1: Information for duplicated human genomic intervals used in this study. Number of duplicated regions 1,112,202 Length of the longest duplicated region (bp) 251,067 Average length (bp) 876 Intra-chromosomal pairs 241,141 Inter-chromosomal pairs 871,061 Both regions contain coding sequences 122,207 Only one region contains annotated coding sequence 225,144 Neither region contains annotated coding sequence 764,851

4.3.1 Number and distribution of gene conversion events in human

Of the 1,112,202 analyzed pairs of human sequences, 149,799 (13.5%) indicated a gene

conversion event (Table 4-2). The occurrence of a gene conversion for 6,737 (0.6%) pairs could

not be tested due to a lack of available orthologous sequence in the other species used in this

study. These results are consistent with results of Ezawa et al. (2006), where about 13% of the

mouse sequence pairs show signs of gene conversion. Among these 149,799 putative gene

conversion events, approximately 71% (106,872) occurred between chromosomes. However, the

fraction of intra-chromosomal pairs indicating a conversion (17.8%) is significantly higher than

for inter-chromosomal conversions (12.3%). (Note that a substantial majority of the pairs are

inter-chromosomal.)

The frequencies of intra-chromosomal conversions are shown for each human

chromosome in Figure 4-8, in which they appear to vary substantially. For instance, the

conversion frequency in chromosome 5 (25.7%) is more than double that in chromosome 18

72

(11.1%). We performed a chi-square test (Abramowitz et al. 1965) to see the dependency of these

conversions in different chromosomes. The test rejects the null hypothesis that conversion

frequencies in different chromosomes are constant with χ2 = 1665.74 with 22 degree of freedom.

Following the reasoning applied by, e.g., Makova et al. (2004), the fact that the conversion

frequencies for the sex chromosomes are similar to those of the autosomes suggests that gene

conversion events are not associated with cell replication.

Table 4-2: Distribution of intra- and inter-chromosomal gene conversions.

Intra-chromosome Inter-chromosome Total

Gene conversion 42,927 (17.8%) 106,872 (12.3%) 149,799 (13.5%)

No gene conversion 195,180 (80.9%) 760,486 (87.3%) 955,666 (85.9%)

Unknowna 3,034 (1.3%) 3,703 (0.4%) 6,737 (0.6%)

Total 241,141 871,061 1,112,202 aFor some pairs of sequences, all orthologous sequences in other species cannot be found, it is impossible to determine the occurrence of gene conversion.

73

Figure 4-8: Frequencies of gene conversion events in each human chromosome.

4.3.2 Correlations with the distance, length, and relative orientation of the paralogs

To study how the physical distance between paralogous sequences affects intra-

chromosomal conversions, conversion frequencies for different ranges of distance were examined

(Figure 4-9). When the pairs of sequences are very close (< 1Kb), conversion frequencies are low

(9.35%), although they also decrease gradually when the separation exceeds 100Kb. The

distances with the highest conversion frequency (25.74%) lie between 10Kb and 100Kb. This

result differs from some previous studies (Ezawa et al. 2006; Xu et al. 2008), which suggest that

the frequency of gene conversion is inversely proportional to the physical separation of the

paralogs. However, other reports suggest an optimal separation of 850 bp for yeast (Sugawara et

al. 2000) and 3800 bp in mammals (Schildkraut et al. 2005). Schildkraut et al. offered the

74

plausible explanation that homologous recombination (HR), which repairs DNA double-strand

breaks (DSB), could occur by two mechanisms, either conservative or non-conservative. Gene

conversion is conservative and single-strand annealing (SSA) is non-conservative. When the pairs

of sequences are very close (< 10Kb), SSA becomes more competent than gene conversion with

the decrease of distance of pair of sequences since SSA takes less extensive end-processing. This

could explain why frequency of gene conversion decreases when the separation is less than 10Kb.

Figure 4-9: Frequency of intra-chromosomal gene conversions as a function of distance between the paralogs.

We are also interested in the correlation between gene conversion and the length of the

homologous pair of sequences. Our results show that the rate of gene conversion is directly

proportional to the sequence length (Figure 4-10). When the length is less than 200 bp, the

frequency of gene conversion is very low (3.95%). This is consistent with previous studies

(Liskay et al. 1987; Waldman and Liskay 1988; Reiter et al. 1998), which suggest that the so-

75

called “minimal efficient processing segment” for gene conversion in mammals exceeds 200 bp.

Naturally, the opportunity for gene conversion increases with sequence length.

Figure 4-10: Correlation with length of the paralogous human sequences.

Also, the correlation with orientation is studied. Pairs of sequences are classified into two

types of orientation, e.g. same-direction and reverse-direction, based on the physical map.

Moreover, only intra-chromosomal pairs of sequences are analyzed. For the intra-chromosomal

pairs of sequences, more gene conversion events (24,533) occur in the same-direction (Table 4-

3), although both same-directional and reverse-directional pairs of sequences have similar

conversion frequency. In order to realize the inconsistency between number and frequency of

intra-chromosomal gene conversion, the correlation between orientation and distance of pair of

sequences is analyzed (Figure 4-11). The result shows that conversion frequencies for different

ranges of distance are almost the same between two types of orientation (Figure 4-11A).

76

However, approximately 66% (73,682 / 111,689) of pairs of sequences with distance less than

1Mb have the same orientation (Figure 4-11B). This could explain why the conversion events of

same-direction paralogs are more than that of reverse-direction paralogs. Therefore, these

analyses indicate that relative orientation has little effect on conversion frequencies.

Table 4-3: Distribution of gene conversion events classified by orientation.

Same direction Reverse direction

Gene Conversion 24,533 (17.8%) 18,394 (17.8%)

No Gene Conversion 111,783 (81.0%) 83,397 (80.8%)

Unknown 1,610 (1.2%) 1,424 (1.4%)

Total 137,926 103,215

Figure 4-11: Correlation between orientation and separation distance of the human paralogs. (A) Conversion frequencies for two type of orientation in different ranges of distance. (B) Number of pairs of sequences for two type of orientation in different ranges of distance.

4.3.3 Length of converted regions

To analyze the conversion lengths, the region with maximal descent (Figure 4-2E) is

taken as the converted region. The distribution of these lengths is shown in Figure 4-12, which

indicates that the frequency of gene conversion is inversely proportional to the length of gene

77

conversion regions. Approximately 56.7% (84,983 / 149,799) of conversion events have length

less than 100 bp. By comparison, data of Xu et al (2008) indicates that 66% of conversion events

in the rice genome are less than 100 bp.

Figure 4-12: Distribution for the length of the converted regions.

4.3.4 The effect of protein-coding DNA

To investigate whether selective pressure affects the occurrence of fixation of the gene

conversion events, we classified each pair of sequences into three categories: ‘2-coding’, ‘1-

coding’ and ‘non-coding’, meaning respectively that both, only one, or neither of the two

paralogs contains coding sequence according to the UCSC KnownGenes annotations. The results

(Table 4-4) indicate that conversion frequencies are higher for 2-coding and 1-coding paralogs.

78

One possible explanation is that non-coding pairs diverge the fastest, and more quickly leave the

state where gene conversion is possible; conversion is though to require at least 92% nucleotide

identity, with over 95% identity being typical (Chen et al. 2007). Furthermore, for more

understanding of the influence of selection, the category of 1-coding is separated into three sub-

categories: coding-to-noncoding, noncoding-to-coding and unknown basing on the directionality

of conversion (Table 4-5). The result shows that the number of conversion from coding sequence

to non-coding sequence is more than that from non-coding sequence to coding sequence. This

could suggest that selection pressure would reduce the probability of occurrence and/or fixation

of gene conversion. Therefore, selection plays an important role for the occurrence of gene

conversion.

Table 4-4: Conversion frequency as a function of the presence of protein-coding sequence.

2-codinga 1-codingb Non-codingc

Gene Conversion 17,809 (14.6%) 33,133 (14.7%) 98,857 (12.9%)

No Gene Conversion 104,101 (85.2%) 190,768 (84.7%) 660,797 (86.4%)

Unknown 297 (0.2%) 1,243 (0.6%) 5,197 (0.7%)

Total 122,207 225,144 764,851 aBoth paralogs contain coding sequence. bOnly one paralog contains coding sequence. cNeither paralog contains coding sequence.

Table 4-5: Number of conversion for different directionality in 1-coding category.

coding-to-noncoding noncoding-to-coding unknowna

Gene Conversion 13,541 8,311 11,281 aThere are several reasons why the direction of a conversion might not be clear including conversions in the outgroup species and missing outgroup data.

79

4.3.5 Correlation with GC-content

Furthermore, the correlation between gene conversion and GC-content is studied (Figure

4-13). Figure 4-13A shows the number of duplication events for different ranges of GC-content.

GC-contents of more than 76% (1,004,879 / 1,308,632) of duplication events are less than 50%.

Furthermore, higher conversion frequencies occur in those duplication events with lower GC-

content (Figure 4-13B). These results indicate that both of duplication events and conversion

events have a preference for lower GC-content, which is consistent with the hypothesis that DNA

with higher GC-content is more stable than DNA with lower GC-content because of bounding by

three hydrogen bonds for the GC pair.

Figure 4-13: Correlation between gene conversion and GC content. (A) Number of pairs of sequences for different ranges of GC content. (B) Conversion frequency for different ranges of GC content.

4.4 Discussion

For much of the half-century since multigene families were discovered, it has been

known that copies of the repeated genes within a species are more similar than would be expected

from their interspecies divergence. The processes generating this sequence homogeneity in

repeated DNA are mechanisms of concerted evolution. Gene conversion is one of those

processes, and while its impact on disease genes is appreciated (Chen et al. 2007), the extent of its

80 impact on the evolution of the human genome has not been fully investigated in previous studies.

Our work documents about one hundred and fifty thousand conversions (13.5%) between duplicated DNA segments in humans. Similarly large fractions of conversion events among duplicated segments have been reported in whole-genome studies of yeast (Drouin 2002),

Drosophila melanogaster (Osada and Innan 2008) and rodents (Ezawa et al. 2006), even though the total number of observed gene conversions is much higher in our study. The genome-wide identification of DNA segments undergoing concerted evolution via gene conversions will make the application of comparative genomics to functional annotation considerably more accurate.

This resource will allow the conversion process to be factored into functional inference based on sequence similarity to other species; for example it could flag potential false positives for inferred positive or negative selection.

In this thesis, we study the gene conversion in the human genome. In order to characterize gene conversion in human, 1,112,202 highly conserved pairs of sequences in human are extracted. Moreover, to determine the occurrence of gene conversion for these plenty of pairs sequences, a memory intensive but very rapid program is used. In order to reduce the usage of memory, we propose a modified method, which can greatly reduce the memory consumption so that gene conversion events for longer pair of sequences could be determined. After the analyses of gene conversion, many interesting characteristics of gene conversion in the human genome are observed. (i) More gene conversion events occurred between chromosomes, although inter- chromosomal conversion frequencies are lower intra-chromosomal conversion frequencies; (ii)

Gene conversion frequencies in different chromosomes are not invariable and are not normal distributed; (iii) Physical distance between pair of sequences plays an important role for the frequencies of intra-chromosomal gene conversion; (iv) Longer pair of sequences has more opportunity for the occurrence of gene conversion; (v) Orientation of pair of sequences does not affect the occurrence of gene conversion significantly; (vi) The length of most gene conversion

81 events is very short; (vii) Selection pressure would reduce the chance for the occurrence of gene conversion, although it could also increase the similarity level, which would increase conversion frequencies. (viii) Gene conversions have a preference for lower GC-content. All of these results could help us have more detailed understanding about the formation and impact of gene conversion.

Chapter 5

Applying gene conversion detection method to gene clusters

5.1 Introduction

In this chapter, our method for detecting gene conversion for whole genome is applied for three gene clusters, i.e. the beta-globin, CCL, and Interferon gene cluster. For each gene cluster, gene conversion events for all species are detected. For this purpose, all other species are used as outgroup to detect gene conversion in one species. Furthermore, phylogenetic tree is constructed to demonstrate the correctness of our method.

5.2 Results

5.2.1 Beta-globin gene cluster (hg18.chr11:5,180,996-5,270,995)

The beta-globin gene cluster has been shown to have higher conversion rates that are 3 to

30 times normal (Smith et al. 1998). To study gene conversion events in this gene cluster, 13 species from the ENCODE project (ENCODE Project Consortium 2004) are used and 57 genes are extracted by using GeneWise2 (Birney et al. 2004). All conversion events are detected in this gene cluster and are shown in Figure 5-1. We can find that plenty of conversion events occur between two gamma genes and between beta and delta genes.

83

Figure 5-1: Gene tree and detected conversion events for the beta-globin gene cluster. Red arrows indicate conversions that might affect the topology of the gene tree.

In order to understand how these conversion events affect the evolution of the beta-globin

gene cluster, detailed analysis is applied for the beta and delta genes. According to the results of

gene conversion, three exons of the gene have been affected by different gene conversion events.

Thus, we divide the multiple sequence alignment of the beta and delta genes into four regions as

shown in Figure 5-2A. Furthermore, a phylogenetic tree is constructed for each region (Figure 5-

2B). It appears that the evolutionary histories of these four regions are dissimilar. In the intron 2

region, it seems not to under the influence of gene conversion since the beta and delta genes for

all species are separated very well. In the other hand, different phylogenies for other three regions

could suggest that they are under the influences of different gene conversion events.

84

Figure 5-2: Influences of gene conversion between the beta and delta genes. (A) Informative sites in the aligned sequences of the beta and delta genes are divided into four regions basing on our detected

85 conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region.

Figure 5-3 shows the inferred evolutionary histories of the mammalian beta and delta

genes basing on the phylogenetic trees of different regions and our detected gene conversion

events. The beta and delta genes are separated by an ancient duplication event pre-dating the

radiation of placental mammals. Afterward, three exons and intron 1 of delta gene are converted

by their orthologous region in beta gene after the split of Xenarthra and Boreoeutheria. Following

the separation of Laurasiatheria and Euarchontoglires, a conversion event covering an interval

that starts somewhat upstream of exon 1 and extends just beyond exon 2 occurred in both

lineages. Finally, after the speciation event of ape and New World Monkey, a region including

exon 1 and intron 1 in the delta gene is converted once more by the beta gene.

86

Figure 5-3: Inferred evolutionary histories for mammalian beta and delta genes. Three speciation events are the separations of Xenarthra and Boreoeutheria (X-B), Laurasiatheria and Euarchontoglires (L-E), ape and New World Monkey (Ape-NWM) respectively. Four different species, i.e. armadillo (A), dog (D), marmoset (M) and human (H), are shown in these trees.

5.2.2 CCL gene cluster (hg18.chr17:31,334,806-31,886,998)

This gene cluster contains several chemokine ligand genes. Some of these genes are

shown the ability to prevent HIV virus from entering the cell (Modi et al. 2006). In our study, we

find that this gene cluster expanded after the separation of humans and Old World monkeys and

87 several conversion events are detected. To study this gene cluster, 7 species from the National

Human Genome Research Institute (NHGRI) are used and 42 genes are extracted by using

GeneWise2 (Birney et al. 2004). Our results show that several conversion events occurred between CCL15 and CCL23 in different lineage. Also, some conversions are detected between

CCL18 and CCL3 and between CCL4 and CCL4L1.

To evaluate the correctness of our results, three phylogenetic trees are constructed in different regions between CCL15 and CCL23 genes (Figure 5-4). From the trees shown in Figure

5-4B, we can find that the evolutionary histories of these three regions are different. In exon 1 region, the conversion event seems to occur after the separation of Humans and New World monkeys, while a later conversion event occurred after the speciation event between Humans and

Old World monkeys in exon 2 region. At the same time, several gene conversion events occurred in the coding regions between two black lemur genes. These inferences are consistent with our results, which are shown in Figure 5-5.

88

Figure 5-4: Phylogenetic trees for CCL gene cluster. (A) Aligned sequences of CCL gene family at which at least three genes share a mutation from the consensus sequence are shown. Alignment is divided into three regions basing on our detected conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region.

89

Figure 5-5: Evidences of gene conversions between CCL15 and CCL23 genes. (A) Red line is the self alignment of gorilla and blue line shows the pairwise alignment between gorilla and dusky titi. A conversion region (yellow region) is detected around exon 1. (B) A conversion region around exon 2 is detected in gorilla (red line) using ag monkey (blue line) as outgroup species. (C) Two conversion events are found in the coding regions of black lemur.

Figure 5-6 shows the inferred evolutionary histories of the CCL gene cluster. There are

five genes in the root of Primates and it has undergone significant expansion after the separation

of Humans and Old World monkeys. Several gene conversion events occurred among different

lineages.

90

Figure 5-6: Inferred evolutionary histories for the CCL gene cluster.

5.2.3 IFN gene cluster (hg18.chr9:21,048,761-21,471,698)

The human type-I interferon gene cluster spans about 500k bases on the chromosome 9

and has been identified to be related to some important effect such as an antiproliferative effect

and the modulation of expression of cell surface modules (Diaz 1995). The mammalian type-I

Interferon gene cluster could be divided into several subfamilies, i.e. IFN-β, IFN-α, IFN-ω and

IFN-ε. Among these subfamilies, the IFN-α is the largest gene family and could be separated into

two groups, i.e. distal group and proximal group, basing on their location. The distal group that

91

originate more recent contains genes that are closer to the telomere. In the other hand, the

proximal group that is closer to the centromere expanded earlier (Diaz 1995). Many previous

studies (Woelk et al. 2007) have shown that the Interferon gene cluster is subject to the influence

of gene conversion. To study gene conversion events in this gene cluster, 7 species from the

National Human Genome Research Institute (NHGRI) are used and 76 genes are extracted by

using GeneWise2 (Birney et al. 2004). All conversion events are detected in this gene cluster and

are shown in Figure 5-7. The results show that there are a lot of conversion events in the IFN-α

gene family, especially in the distal group.

Figure 5-7: Gene tree and detected conversion events for the Interferon gene cluster.

92

Our results suggest that there are a lot of conversions in the 5′ half of gene and its 5′ flanking region. Also, there are a lot of conversions in the 3′ half of gene and its 3′ flanking region. In order to have more detailed understanding about how gene conversion affects the phylogeny of the distal group, the alignment of genes in the distal group has been divided into two regions as shown in Figure 5-8A. The phylogenetic tree for each region is shown in Figure 5-

8B. The phylogenies in these two regions are quite different, which could be the consequence of gene conversion. Our method suggests that there are some conversion events in the 5′ half of gene and its 5′ flanking region among the Old World Monkey. Besides, a lot of conversion events occurred in the 3′ half of gene and its 3′ flanking region for both human and Old World Monkey.

93

Figure 5-8: Influences of gene conversion to the phylogeny in the distal group. (A) Aligned sequences of IFNA gene family at which at least three genes share a mutation from the consensus sequence are shown. Alignment is divided into two regions basing on our detected conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region.

Chapter 6

Conclusions and future works

6.1 Conclusions

In this thesis, we have developed a method that uses phylogenetic trees to identify orthologous genes within gene clusters. A detailed analysis in the α- and β-globin gene clusters indicates that orthology relationships inferred this way are reasonably accurate. These orthology assignments were applied to analyze the quality of alignments produced by several alignment programs, i.e., MAVID, MLAGAN, TBA and ROAST. The results show that the performance of

ROAST is better than the others in terms of sensitivity and specificity. We also analyzed two aligners, ROAST and Multiz on 13 gene clusters.

Furthermore, gene conversion detection methods are applied for whole genome, individual gene clusters and a pair of genes. For pair of genes, detailed gene conversion information between beta and delta genes; and two gamma genes are studies. For whole genome,

1,112,202 highly conserved pairs of human genomic intervals are analyzed; conversion events for about 13.5% of them are detected. This method is also applied to three individual gene clusters, i.e. beta globin gene cluster, CCL gene cluster and IFN gene cluster. The results for these gene clusters are consistent with previous studies.

95

6.2 Future Works

The remaining work includes both improvements to the tree-based evaluation strategy and development of a synergistic strategy based on alignments in non-coding regions. Some of the improvements to the tree-based strategy consist of refining steps in the current pipeline. For instance, the gene-prediction tool, i.e. GeneWise2, used in this project could not always extract genes accurately in the following cases: (1) distant species; (2) genes containing different numbers of exons; (3) genes with small exons and (4) genes containing many (e.g., more than 30) exons. We will investigate ways of improving this situation; a simple idea is to combine results from several gene prediction tools, as described e.g. by Shah 2003. A related goal is to make the evaluation process more automatic; the approach used so far involves some manual guidance and checking of the orthology assignments, but for efficient evaluation of genome-wide alignments, more extensive automation will be necessary. We will also develop tools to utilize and evaluate alignments in non-coding regions. A major goal will be to combine the gene conversion information into the inference of orthologs.

Bibliography

Abramowitz, M., Stegun, I.A. 1965. Handbook of Mathematical Functions with Formulas,

Graphs, and Mathematical Tables, New York.

Aguileta, G., Bielawski, J. P. and Yang, Z. 2004. Gene conversion and functional divergence in

the beta-globin gene family, J. Mol. Evol. 59: 177-189.

Aguileta, G., Bielawski, J. P. and Yang, Z. 2006a. Evolutionary rate variation among vertebrate β

globin genes: Implications for dating gene family duplication events. Gene 1: 21-29.

Aguileta, G., Bielawski, J. P. and Yang, Z. 2006b. Proposed standard nomenclature for the alpha-

and beta-globin gene families. Genes Genet Syst. 81: 367-71

Benjamini, Y., and Hochberg, T. 1995. Controlling the False Discovery Rate: a practical and

powerful approach to multiple testing. J. Royal Stat. Soc. B 85: 289–300.

Benovoy, D. and Drouin, G. 2009 Ectopic gene conversions in the human genome. Genomics 93:

27–32.

Birney, E., Clamp, M., and Durbin, R. 2004. GeneWise and Genomewise. Genome Res. 14: 988-

995.

Blanchette, M., Kent, J.W., Riemer, C., Elnitski, L., Smit, A.F., Roskin, K.M., Baertsch, R.,

Rosenbloom, K., Clawson, H., Green, E.D., et al. 2004. Aligning multiple genomic sequences

with the threaded blockset aligner. Genome Res. 14: 708-715.

Boni, M.F., Posada, D., and Feldman, M.W. 2007. An exact nonparametric method for inferring

mosaic structure in sequence triplets. Genetics 176: 1035–1047.

Bray, N., and Pachter. L. 2004. MAVID: Constrained Ancestral Alignment of Multiple

Sequences. Genome Res. 14: 693-699.

97

Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program,

Green ED, Sidow A, and Batzoglou S. 2003. LAGAN and Multi-LAGAN: efficient tools for

large-scale multiple alignment of genomic DNA. Genome Res. 13: 721-731.

Chen, J.-M., Cooper, D. N., Chuzhanova, N., Férec, C., and Patrinos, G. P. 2007. Gene

conversion: mechanisms, evolution and human disease. Nat. Rev. 8: 762-775.

Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G. and Thompson, J.

D. 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids

Res. 31: 3497-3500.

Diaz MO. 1995. The human type I interferon gene cluster. Semin Virol, 6: 143-149.

Drouin, G., Prat, F., Ell, M., and Clarke, G. D. P. (1999) Detecting and characterizing gene

conversion between multigene families. Mol. Biol. Evol. 16: 1369-1390.

Drouin, G. 2002. Characterization of the gene conversions between the multigene family

members of the yeast genome. J. Mol. Evol. 55: 14–23.

Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high

throughput. Nucleic Acids Res. 32: 1792-1797.

ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project.

Science 306: 636-640.

Ezawa, K., Oota, S., and Saitou, N. 2006. Genome-wide search of gene conversions in duplicated

genes of mouse and rat. Mol. Biol. Evol. 23: 927–940.

Feller, W. 1957. An Introduction to Probability Theory and Its Application, Vol. I. John Wiley &

Sons, New York.

Felsenstein J. 1987. Cases in which parsimony or compatibility methods will be positively

misleading. Syst. Zool. 27: 401-410.

Felsenstein J. and G A Churchill, 1996. A hidden Markov model approach to variation among

sites in rate of evolution. Mol. Biol. and Evol. 13: 93-104.

98

Felsenstein J. 2002. Phylip Phylogeny Inference Package Version3.6.

http://evolution.gs.washington.edu/phylip.html.

W. et al. 1976. Complete nucleotide-sequence of bacteriophage MS2-RNA - primary and

secondary structure of replicase gene. Nature 260: 500-507.

Fitch, W.M. and Margoliash, E. 1967. A method for estimating the number of invariant amino

acid coding positions in a gene using cytochrome c as a model case, Biochem. Genet. 1: 65-

71.

Fitch, D. H. A., and Goodman, M. 1991. Phylogenetic scanning: a computer-assisted algorithm

for mapping gene conversions and other recombinational events. CABIOS 7: 207-215.

Foster, P.G. and Hickey, D.A. 1999. Compositional bias may affect both. DNA-based and

protein-based phylogenetic reconstructions. J Mol. Evol. 48: 284-290.

Gilbert, M. T. P., Ho, S., Qi. J., Hsu, C.-H., et al. 2007. Mammoth population genetics using

complete mitochondrial genomes. Submitted.

Golubchik, T., Wise, M. J., Easteal, S., and Jermiin, L. S. 2007. Mind the Gaps: Evidence of Bias

in Estimates of Multiple Sequence Alignments Mol. Biol. Evol. 24: 2433-2442.

Goodman, M., Czelusniak, J., Moore, G.W., Romero-Herrera, A.E. and Matsuda, G. 1979. Fitting

the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms

constructed from globin sequences. Syst. Zool. 28: 132-168.

Gu, X., Fu, Y.-X, and Li, W.-H. 1995. Maximum likelihood estimation of the heterogeneity of

substitution rate among nucleotide sites. Mol. Biol. Evol. 12: 546-557.

Hardies SC, Edgell MH, Hutchison CA 3rd. 1984. Evolution of the mammalian β-globin gene

cluster. J Biol. Chem. 259: 3748-3756.

Higgs, D.R., Wainscoat, J.S., Flint, J., Hill, A.V.S., Thein, S.L., Nicholls, R.D., Teal, H., Ayyub,

H., Peto, T.E.A. & Falusi, A.G. 1986. Analysis of the human alpha -globin gene cluster

reveals a highly informative genetic locus. Nucleic Acids Res. 14: 1903-1911.

99

Holm, S. 1979. A simple sequential rejective multiple test procedure. Scand. J. Statistics 6: 65–

70.

Hou, M. et al. 2005. Aligning Multiple Genomic Sequences That Contain Duplications.

Manuscript.

Hou, M. et al. 2007. Algorithm for Processing Genomic Sequences That Contain Duplications.

Doctoral Dissertation.

Hurles, M. 2004. Gene duplication: the genomic trade in spare parts. PLoS Biology 2: 900–904.

Jackson M.S., et al. 2005. Evidence for widespread reticulate evolution within human duplicons.

Am J Hum Genet. 77: 824–840.

Jukes, T. H., and Cantor, C. R. 1969. Evolution of protein molecules. In: Mammalian Protein

Metabolism, H. N. Munro, ed., pp. 21-132. Academic Press, New York.

Kedzierska, A., and Husmeier, D. 2006. A heuristic Bayesian method for segmenting DNA

sequence alignments and detecting evidence for recombination and gene conversion. Stat.

Appl. Genet. Mol. Biol. 5: Epub 2006 Oct. 24.

Kent, W.J., Sugnet, C. W., Furey, T. S., Roskin, K.M., Pringle, T. H., Zahler, A. M., and

Haussler, D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996-1006.

Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., Haussler, D. 2003. Evolution's cauldron:

duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad.

Sci. 100: 11484-11489

Koonin, E. V. 2005. Orthologs, paralogs, and evolutionary genomics. Annu Rev. 20. Genet 39:

309-38.

Liskay, R.M., Letsou, A., and Stachelek, J. 1987. Homology requirement for efficient gene

conversion between duplicated chromosomal sequences in mammalian cells. Genetics 115:

161–167.

100

Ma, J., Zhang, L., Suh, B.B., Raney, B.J., Burhans, R.C., Kent, W.J., Blanchette, M., Haussler,

D., and Miller, W. 2006. Reconstructing contiguous regions of an ancestral genome. Genome

Res. 16: 760-774.

Ma., J., Ratan, A., Zhang, L., Miller, W., and Haussler, D. 2007. A heuristic algorithm for

reconstructing ancestral gene orders with duplications. In Lecture Notes in Computer

Science, Vol. 4751, Springer, Berlin, pp. 122-135.

Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Miller, W., and Haussler, D. 2008. The infinite sites

model of genome evolution. Proc. Natl. Acad. Sci. 105: 14254–14261.

Mackenzie PI, Walter Bock K, Burchell B, Guillemette C, Ikushiro S, Iyanagi T, Miners JO,

Owens IS, Nebert DW. 2005. Nomenclature update for the mammalian UDP

glycosyltransferase (UGT) gene superfamily. Pharmacogenet Genomics. 15: 677-85.

Makova, K., Yang, S., and Chiaromonte, F. 2004. Insertions and deletions are male biased too: a

whole-genome analysis in rodents. Genome Research 14: 567–573.

Margulies E, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Birney E, Keefe D,

Schwartz AS, Hou M, Taylor J, Nikolaev S, Montoya-Burgos JI, Lytynoja A, Whelan S,

Pardi F, Massingham T, Brown JB, Bickel P, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B,

Schuler G, Church D, Rosenbloom KR, Kent WJ, NISC Comparative Sequencing Program,

Baylor College of Medicine Human Genome Sequencing Center, Washington University

Genome Sequencing Center, Broad Institute, UCSC Genome Browser Team, Antonarakis

SE, Batzoglou S, Goldman N, Hardison R, Haussler D, Miller W, Pachter L, Green ED,

Sidow A. 2007. Analyses of deep mammalian sequence alignments and constraint predictions

for 1% of the human genome. Genome Res. 17: 760-774.

McGrath, C. L., Casola, C., and Hahn, M. W. 2009. Minimal Effect of Ectopic Gene Conversion

Among Recent Duplicates in Four Mammalian Genomes. Genetics 182: 615–622.

101

Mirkin, B., Muchnik, I., and Smith, T. F. 1995. A biologicallyconsistent model for comparing

molecular phylogenies. J. Comput. Biol. 2: 493-507.

Mizuguchi, K., Deane, C. M., Blundell, T. L. and Overington, J. P. 1998. HOMSTRAD: a

database of protein structure alignments for homologous families. Protein Sci. 7: 2469-2471.

Modi, W.S., et al. 2006. Genetic Variation in the CCL18-CCL3-CCL4 Chemokine Gene Cluster

Influences HIV Type 1 Transmission and AIDS Disease Progression. Am. J. Hum. Genet. 79:

120–128.

Ohno S. 1967. Monographs on Endocrinology. Sex Chromosomes and Sex-linked Genes. Vol I.

Springer-Verlag, Berlin, Heidelberg, and New York.

Ovcharenko I., Loots G.G., Giardine B.M., Hou M., Ma J., Hardison R.C., Stubbs L., Miller W.

2005. Mulan: multiple-sequence local alignment and visualization for studying function and

evolution. Genome Res. 15: 184-194.

Page R.D. 1998. GeneTree: Comparing gene and species phylogenies using reconciled trees.

Bioinformatics, 14: 819-820.

Page, R.D.M. 1994 Maps between trees and cladistic analysis of historical associations among

genes, organisms, and areas. Syst. Biol. 43: 58-77.

Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi. 2003. Reconciling Gene Trees to a

Species Tree. CIAC 120-131

Phillips MJ, Delsuc F, Penny D 2004. Genome-scale phylogeny and the detection of systematic

biases. Mol Biol Evol 21: 1455-1458.

Prakash, A., and Tompa, M. 2007. Measuring the accuracy of genome-size multiple alignments.

Genome Biology 8: R124.

Prychitko, T. Johnson, R. M., Wildman, D. E., Gumucio, D., and Goodman, M. 2005. The

phylogenetic history of New World monkey β globin reveals a platyrrhine β to δ gene

conversion in the atelid ancestry. Mol. Phylo. Evol. 35: 225-234.

102

Raghava, G. P., Searle, S. M., Audley, P. C., Barber, J. D. and Barton, G. J. 2003. OXBench: a

benchmark for evaluation of protein multiple sequence alignment accuracy. BMC

Bioinformatics 4, 47.

Reiter, L.T. et al. 1998. Human meiotic recombination products revealed by sequencing a hotspot

for homologous strand exchange in multiple HNPP deletion patients. Am. J. Hum. Genet. 62:

1023–1033.

Sawyer, S. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6: 526-538.

Schildkraut, E., Miller, C.A., and Nickoloff, J.A. 2005. Gene conversion and deletion frequencies

during double-strand break repair in human cells are controlled by the distance between direct

repeats Nucleic Acids Res. 33: 1574–1580.

Schmidt, H.A., E. Petzold, M. Vingron, and A. von Haeseler 2003. Molecular Phylogenetics:

Parallelized Parameter Estimation and Quartet Puzzling. J. Parallel Distrib. Comput., 63: 719-

727.

Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and

Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13: 103-107.

Semple, C., and Wolfe, K.H. 1999. Gene duplication and gene conversion in the Caenorhabditis

elegans genome. J. Mol. Evol. 48: 555–564.

Shah, S. P., Vicker, G. P., Mackworth, A. K., Rogic, S., and Ouellete, B. F. 2003. GeneComber:

combining outputs of gene prediction programs for improved results. Bioinformatics 19:

1296-1297.

Smith A. B. 1994. Rooting molecular trees: problems and strategies. Biol. J. Linn. Soc. 51: 279-

92.

Smith, R. A., Ho, P. J., Clegg, J. B., Kidd, J. R., and Thein S. L. 1998. Recombination

Breakpoints in the Human beta -Globin Gene Cluster. Blood 92: 4415–4421.

103

Sugawara, N., Ira, G., and Haber, J.E. 2000. DNA length dependence of the single-strand

annealing pathway and the role of Saccharomyces cerevisiae RAD59 in double-strand break

repair. Mol. Cell. Biol. 20: 5300–5309.

Sullivan, J., and D. L. Swofford. 1997. Are guinea pigs rodents? The importance of adequate

models in molecular phylogenetics. J. Mamm. Evol. 4: 77–86.

Swofford, D. L., Olsen, G. P., Waddell, P. J., and Hillis, D. M. 1996. Phylogenetic inference. In:

Molecular Systematics, 2nd ed., D. M. Hillis, C. Moritz, and B. K. Mable, eds., pp. 407-514.

Van Den Bussche R A,Baker R J, Huelsenbeck J P, Hillis D M. 1998. Base compositional bias

and phylogenetic analyses: A test of the 'flying DNA' hypothesis[J]. Mol Phylogen Evol, 10:

408-416.

Waldman, A.S., and Liskay, R.M. 1988. Dependence of intrachromosomal recombination in

mammalian cells on uninterrupted homology. Mol. Cell. Biol. 8: 5350–5357.

Wang, A. X., Ruzzo, W., and Tompa, M. 2007. How accurately is ncRNA aligned within whole-

genome alignments? BMC Bioinformatics 8: 417.

Wilson,M.D., Riemer,C., Martindale,D., Schnupf,P., Boright,A., Cheung,T., Hardy,D.,

Schwartz,S., Scherer,S., Tsui,L.-C., Miller,W. and Koop,B.F. 2001. Comparative analysis of

the gene dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region

on mouse chromosome 5. Nucleic Acids Res., 29: 1352-1365.

Woelk, C.H., Frost, S.D., Richman D.D., Higley P.E. and Kosakovsky Pond S.L. 2007.

Evolution of the interferon alpha gene family in eutherian mammals. Gene. 397: 38–50.

Xu, S., Clark, T., Zheng, H., Vang, S., Li, R., Wong, G. K., Wang, J., and Zheng, X. 2008. Gene

Conversion in the rice genome. BMC Genomics 9: 93.

Yuan, Y.P., Eulenstein, O., Vingron, M. and Bork, P. 1998. Towards detection of orthologues in

sequence database. Bioinformatics 14: 285-289.

104

Zhang, Y., Song, G., Vinar, T., Green E., Siepel, A., and Miller, W. 2007. Evolutionary analysis

of human gene clusters. Submitted.

Zhang, Z., Raghavachari, B., Hardison, R.C., and Miller, W. 1994. Chaining multiple-alignment

blocks. J. Comp. Biol. 1: 217–226.

VITA

Chih-Hao Hsu

Mr. Chih-Hao Hsu received his B.S. degree in Computer Science from National Tsing-Hua

University, HsinChu, Taiwan, in 1999, and his M.S. degree in Computer Science from National

Tsing-Hua University, HsinChu, Taiwan, in 2001. In fall 2004, he joined the Ph.D. program in the Department of Computer Science and Engineering at the Pennsylvania State University. Since then he has been a research assistant working with Dr. Webb Miller at the Center for

Comparative Genomics and Bioinformatics.