Analyzing Comparative Sequence Data to Understand Genome Function and Evolution

by Arjun B. Prasad B.S., December 1999, University of California, Santa Barbara

A dissertation submitted to the Faculty of the Columbian College of Arts and Sciences of The George Washington University in partial satisfaction of the requirements for the degree of Doctor of Philosophy

January 31, 2010

Dissertation directed by Eric D. Green Section Head, Physical Mapping Section Acting Scientiﬁc Director, Division of Intramural Research Director National Human Genome Research Institute National Institutes of Health The Columbian College of Arts and Sciences of The George Washington University certiﬁes that Arjun Balmiki Prasad has passed the Final Examination for the degree of

Doctor of Philosophy as of December 14, 2009. This is the ﬁnal and approved form of the dissertation.

Analyzing Comparative Sequence Data to Understand Genome Function and Evolution Arjun B. Prasad

Dissertation Research Committee:

Eric D. Green Director, National Human Genome Research Institute National Institutes of Health Director William J. Pavan Head, Mouse Embryology Section Senior Investigator, Genetic Disease Research Branch National Human Genome Research Institute National Institutes of Health Committee Member Marc W. Allard Research Microbiologist, Ofﬁce of Regulatory Science Food and Drug Administration Committee Member

ii To Betzabe´ Rosas Gasca, the love of my life.

To my parents who, in their different ways, encouraged me to work, think, and understand.

To my sister, Dee, who has always been there for me.

iii Acknowledgements

This dissertation would not have been possible without the work, efforts, and support of many people. I would ﬁrst and foremost like to thank my wife, Betzabe,´ who has been with me every step of the way, and kept me mostly sane through all of them. Thank you.

Of course none of this would have been possible without the help and support of Eric

Green. From the time when he ﬁrst offered me a position in his lab 9 years ago, he has provided extraordinary opportunities and unrivaled support for me to pursue my scientiﬁc interests.

My doctoral advisory committee was also very helpful, and my thanks go out to each of them. Marc Allard provided an introduction to phylogenetics. Liliana Florea helped with focus and direction. Sydney Fu was very generous with his time, ﬁlling in for Lil- iana at the last minute. Jim Mullikin contributed many many good ideas. Bill Pavan also gave generously of his time and provided many good suggestions, advice, and post-it notes when I needed them. Linda Werling has been an amazing support always ready and quick to respond. She has been invaluable in helping deal with the always confusing GWU administration.

I have been privileged to work in the Green lab with so many outstanding friends, mentors, collaborators, and co-workers. Jim Thomas was the ﬁrst person I worked with in

Eric’s lab, and mentored me as well as developing the parallel mapping strategy that we used to isolate BAC clones for all of the sequencing. Valerie Braden, Jackie Idol, and Shih-

iv Queen Lee-Lin are the heart of the lab, and have done the vast majority of the mapping work for the data analyzed here. I have worked with them for nine years now, and have been very lucky to have beneﬁtted from their hard work, experience, and constant teasing.

Matt Portnoy oversaw the mapping process for a time, and provided helpful feedback on

ExactPlus while collaborating on the GDF6 work.

Many thanks also go out to Tony Antonellis who has been an inspiration to work with.

He came up with the initial idea for ExactPlus and was instrumental in developing the design and interface. As “the best molecular biologist in the lab” Tony taught me a lot about the science and practice of molecular biology, and his example bears more than a little responsibility for my decision to go to graduate school.

Of course I have to acknowledge the great work done by the other NISC Comparative

Sequencing Program members at the NIH Intramural Sequencing Center (NISC). They were responsible for generating the extremely high quality sequence used here and were led by Bob Blakesley, Gerry Bouffard, Jenny McDowell, Baishali Maskeri, Morgan Park,

Nancy Hanson, and Pam Thomas. Elliott Margulies and Steve Parker generously provided data from ENCODE Project Consortium (2007) and Margulies et al. (2007) in convenient formats for the comparisons described in Chapter 2. Julien Dutheil and Asger Hobolth also generously provided data reported in Hobolth et al. (2007) and Dutheil et al. (2009) for the comparisons reported in Chapter 4.

The members of the “EvoFuGeno” interest group have been great friends and colleagues whose productive conversations have been incredibly helpful and a lot of fun. Aida

Andres is great friend and aid in working out my thoughts, giving very generously of her time, patience, and expertise. Her understanding of population genetics and evolution and our many discussions have been an invaluable aid to my thinking. Megan Dennis is an-

v other good friend who has had many productive discussions, across the bay, at her house for home-made pizza, and over a beer. Her almost daily stories of crazy things that hap- pened to her on the way to work have never failed to brighten my day.

Tim Hefferon who joined the bay more recently has, when not working in Costa Rica, been a great source of the witty repartee and pranks that make the lab feel so much like home. An honorary bay member is Derek Gildea who has always been there, whether I needed a ride, lunch, career advice, scientiﬁc discussion, dissertation help, or beer.

These colleagues are representative of the great environment at the National Human

Genome Research Institute and in the Genome Technology Branch in particular. The environment is collegial, collaborative, supportive, and challenging all at once. It has been a great privilege to work here.

My experience as a graduate student would not have been the same without the amazing group we had in our class at GWU. Throughout the ﬁrst two years of coursework and beyond we’ve been remarkably close, studying together and having fun together. Thanks to my classmates: Thamara Abouantoun, Jamie Andrews, Monika Arora, Laura Beaver, Brad

Brooks, Jesse Damsker, Menahem Doura, Derek Gildea, Kristi Norris, Yevgeniya Nusi- novich, Erica Reeves, Philip Ryan, Courtney Smith, Alan Watson, Heather Whitehurst, and Chad Williamson.

Of course my parents had a huge role to play in developing my love of science, learning and curiosity. Thanks to my father, my Bapu, for ceaselessly embarrassing me in front of my friends by explaining the detailed life habits of the bugs, trees, and ﬂowers we would see. Thanks for all the home experiments you helped me conduct. Thanks for making me a part of your working life, letting me help agitate the trays, process ﬁlm, assist in your photo shoots and generally showing me how learning can be a constant part of life. Thank

vi you for encouraging me to explore other interests, and for telling me a PhD wasn’t worth it. You may have been right.

Thanks to my mom, Sandra, for really teaching me to love reading, research, and in- tellectual discussions. Thanks for always providing consistently challenging discussions about almost any topic, and for encouraging me to take the winding path that I’ve taken.

I’m also grateful for the frequent reminders that there are very different perspectives of how to view the world, and that I should be ready to engage and try to understand them at any time.

Thanks to my sister, Dee, who has been a great friend, support, and example to me throughout my life.

vii Abstract

Analyzing Comparative Sequence Data to Understand Genome Function and Evolution

The ever-accelerating production of genome sequence from numerous species is providing new opportunities to examine evolution and evolutionary processes. My thesis work aimed to explore applications of comparative genome sequence datasets. As a first step we developed a computational method (ExactPlus) that takes advantage of the experiments conducted by natural selection to identify conserved non-coding sequences. This method proved comparable to several other methods of identifying conserved non-coding sequences, and was successfully applied to identify candidates for functional assays of gene-regulatory potential. We next explored the utility of large comparative genome sequence datasets for inferring the phylogenetic relationships among mammals. The large amount of data allowed high-confidence inferences to be made, even for difficult to resolve taxa (such as Atlantogenata, Glires, and Theria), however for this to be successful, we had to carefully control for sources of bias, such as base composition and alignment error. We found a remarkable level of heterogeneity in tree support among regions. To better understand these patterns, we developed a sliding window-based approach (PartFinder) to identify the boundaries of congruent blocks, and validated this method by using it to examine the genetic relationships among human, chimpanzee, and gorilla. In aggregate, this body of work demonstrates the utility and promise of comparative genome sequence datasets when combined with evolutionary and genomic techniques.

viii Table of Contents

Dedication ...... iii

Acknowledgements ...... iv

Abstract ...... viii

Table of Contents ...... ix

List of Figures ...... xiii

List of Tables ...... xv

Abbreviations ...... xvi

1 Using comparative genome sequence data in evolutionary studies ..... 1 1.1 Evolutionary relationships between species ...... 2 1.1.1 Alignment ...... 4 1.1.2 Tree searching ...... 5 1.1.3 Gene trees versus species trees ...... 6 1.1.4 Convergent evolution, homoplasy, and saturation ...... 7 1.1.5 Distance based methods ...... 8 1.1.6 Maximum parsimony ...... 10 1.1.7 Maximum likelihood ...... 11 1.1.8 Nucleotide substitution models ...... 12 1.1.9 Bayesian phylogenetic analysis ...... 14 1.1.10 Conﬁdence measures ...... 16 1.1.11 Systematic biases ...... 19

ix 1.2 Conservation and function ...... 20 1.2.1 Conserved non-coding sequences ...... 21 1.3 Conclusions ...... 23

2 ExactPlus: A user-friendly method for identifying short highly conserved genomic sequences ...... 25 2.1 Introduction ...... 25 2.2 Algorithm and implementation ...... 28 2.2.1 Constraints for web interface ...... 30 2.2.2 Architecture of web-accessible version of ExactPlus ...... 31 2.2.3 Discussion of implementation ...... 34 2.3 Comparison to other methods for identifying conserved sequences . . . . 36 2.3.1 Other methods for conservation detection ...... 36 2.3.2 Methods ...... 41 2.3.3 Results ...... 42 2.4 Experimental validation ...... 47 2.4.1 GDF6 ...... 48 2.4.2 Sox10 ...... 52 2.5 Discussion ...... 55

3 Reﬁning the phylogeny of mammals with a large comparative genomic sequence dataset ...... 57 3.1 Introduction ...... 57 3.2 Methods ...... 61 3.2.1 Sequence data ...... 61 3.2.2 Alignment ...... 62 3.2.3 Annotation ...... 62 3.2.4 Tree searching ...... 65 3.3 Results and discussion ...... 67 3.3.1 Overview ...... 67 3.3.2 Non-phylogenetic signals ...... 77

x 3.3.3 Individual taxa ...... 82 3.3.4 Summary ...... 90

4 Landscapes of incongruence and the phylogeny of primates ...... 91 4.1 Introduction ...... 91 4.1.1 Why are different segments of DNA incongruent? ...... 92 4.1.2 Measures of incongruence ...... 93 4.1.3 Estimating species trees ...... 94 4.1.4 Incongruence in data from Prasad et al., 2008 ...... 95 4.1.5 Spatial arrangement of incongruent regions ...... 95 4.1.6 Incongruence in the Homo-Pan-Gorilla Group ...... 97 4.2 Materials and methods ...... 98 4.2.1 Sequences ...... 98 4.2.2 PartFinder ...... 100 4.2.3 Tree searching ...... 100 4.2.4 Simulations ...... 101 4.2.5 Distributions ...... 101 4.3 Results ...... 103 4.3.1 Homo-Pan-Gorilla Group ...... 103 4.4 Discussion ...... 113

5 Summary and future directions ...... 116 5.1 ExactPlus and model based methods ...... 117 5.2 Phylogeny of mammals and other approaches ...... 118 5.2.1 Future directions ...... 121 5.3 Landscapes of incongruence and PartFinder ...... 122 5.3.1 Rare genomic character phylogenetics and congruence ...... 123 5.3.2 Future directions ...... 124

Bibliography ...... 126

xi Appendix

Alignments for coding indels described in Chapter 3 ...... 147

Portnoy et al., 2005. Detection of potential GDF6 regulatory elements by multispecies sequence comparisons and identiﬁcation of a skeletal joint enhancer 165

Antonellis et al., 2006. Deletion of long-range sequences at Sox10 compromises developmental expression in a mouse model of WaardenburgShah (WS4) syndrome ...... 177

Prasad et al., 2008. Conﬁrming the phylogeny of mammals by use of large comparative sequence data sets ...... 191

xii List of Figures

1.1 Illustration of how lineage sorting can cause gene trees and species trees to differ...... 6 1.2 Illustration of non-parametric bootstrapping of alignments ...... 18 1.3 View of genomic sequence conservation in and outside of exons . . . . . 22

2.1 Alignment and parameter input screen for ExactPlus web interface . . . . 29 2.2 Architecture of ExactPlus web interface ...... 31 2.3 State-transition diagram for phastCons phylo-HMM ...... 39 2.4 ExactPlus overlaps with other methods ...... 43 2.5 Comparison of the sizes of individual sequences marked as conserved . . 45 2.6 Overlap with functional annotations ...... 46 2.7 WebMCS and ExactPlus for the GDF6 genomic region ...... 49 2.8 Sox10 genomic context and ExactPlus analysis ...... 53

3.1 Coding sequence maximum likelihood tree ...... 69 3.2 Bayesian bootstrap vs. maximum likelihood bootstrap ...... 70 3.3 RY-coded coding sequence maximum likelihood tree ...... 73 3.4 RY-coded maximum likelihood tree from codon partitioned coding plus conserved non-coding sequence matrix ...... 74 3.5 Informative coding indels ...... 75 3.6 ML trees for each of 10 partitions ...... 79 3.7 Neighbor joining tree for coding plus conserved non-coding matrix . . . . 84 3.8 Three roots for Placentalia ...... 88

4.1 The three possible gene trees that could explain incongruence due to lineage sorting in the Homo-Pan-Gorilla Group ...... 102

xiii 4.2 Maximum likelihood bootstrap tree ...... 104 4.3 Example of PartFinder results for Homo-Pan-Gorilla group ...... 106 4.4 PartFinder results with simulated data ...... 108 4.5 Box plots of genome characteristics in 5-kb partitions supporting alternative relationships in the Homo-Pan-Gorilla group ...... 110 4.6 Box plots of genome characteristics in 2kb partitions supporting alternative relationships in the Homo-Pan-Gorilla group ...... 111

xiv List of Tables

1.1 Number of possible phylogenetic trees increases rapidly with numbers of taxa ...... 5 1.2 Parameters used by the GTR model of sequence evolution ...... 13

2.1 ExactPlus directories and permissions ...... 33 2.2 Putatively conserved element (PCE) size distribution statistics for conservation methods ...... 43 2.3 Relationship of MCSs detected by ExactPlus and WebMCS in GDF6 region 50

3.1 Multispecies comparative sequence dataset used in Prasad et al., 2008 . . 63 3.2 Pairwise incongruence length difference tests between natural partitions . 72 3.3 Relative likelihood support for placental root across coding plus conserved non-coding sequence matrix ...... 78 3.4 ILD test results for partitions based on missing data ...... 80

4.1 Originating species and accession numbers of sequences ...... 99 4.2 Proportions of regions supporting each possible arrangement within the Homo-Pan-Gorilla group ...... 107

xv Abbreviations

5’UTR Exonic sequence that is transcribed but not translated 5’ of the coding region of a gene.

3’UTR Exonic sequence that is transcribed but not translated 3’ of the coding region of a gene.

AR Ancestral repeat. A repeat sequence inserted prior to the last common ancestor of the placental mammals. bp Base pairs.

BAC Bacterial artiﬁcal chromosome.

CoalHMM Coalescent hidden Markov model, speciﬁcally the model used by Dutheil et al. (2009).

CF Cavendar-Felsenstein two-state model of evolution (Cavender and Felsenstein, 1987).

CDS Protein coding sequence.

CGI Common gateway interface—an interface protocol for web servers to interact with other programs running on the system.

DHS DNAse I hypersensitive site.

GC content Guanine and Cytosine content.

GTR General time reversible model of base substitution, also known as REV (described in Salemi and Vandamme, 2003, Chapter 4).

HMM Hidden Markov model.

xvi ILD test Incongruence length difference test, also known as the partition homogeneity test. (see Bull et al., 1993, for details). indel Insertion and/or deletion; used to describe mutation events that cause changes in the length of DNA or Protein sequences.

FAIRE Formaldehyde-assisted isolation of regulatory elements (see Giresi et al., 2007, for details). kb Thousand (or kilo) base pairs.

Mb Million (or Mega) base pairs.

MCMC Markov chain Monte Carlo; a method that can be used estimate the posterior probability using Bayesian statistics (see Section 1.1.9 for explanation).

MCS Multispecies conserved sequence.

ML Maximum likelihood.

NISC NIH Intramural Sequencing Center.

PCE Putatively conserved element.

PJE Proximal joint element; an enhancer element required for GDF6 expression in proximal joints (see Mortlock et al., 2003, for details).

RFBR Regulatory factor bound region; region of genomic sequence found to be bound by proteins involved in the regulation of expression.

RxFrag Transcripts detected using RACE followed by hybridization to tiling arrays (see ENCODE Project Consortium, 2007, for details).

RY-coding Coding of nucleotides as either purines (R) or pyrimidines (Y).

SH-test Shimodaira-Hasegawa test of alternative topologies. Described in Shimodaira and Hasegawa (1999) and Section 1.1.10.

SNP Single Nucleotide Polymorphism.

SOX10 Human SRY-box containing protein 10.

xvii Sox10 Mouse SRY-box containing protein 10.

SPR Subtree pruning and regrafting. A heuristic method of branch swapping for trees faster and a bit less thorough than TBR (described in Salemi and Vandamme, 2003, Chapter 7).

SqSp.RFBR Sequence-speciﬁc RFBR; i.e., region bound by proteins known to bind to DNA in a sequence-speciﬁc manner.

SUID Set user ID; A UNIX file permissions flag that indicates that the program should be run as the owner of the file rather than the user running the program.

TBR Tree bisection and reconnection. A heuristic method of branch swapping for trees more thorough than SPR (see Cunningham, 1997, for more details).

TFBS Transcription factor binding sites; speciﬁc genomic locations where transcription regulatory factors bind.

TxFrag A fragment identiﬁed by microarray hybridization to be expressed as RNA.

Un.TxFrag TxFrag that is not found as part of a known coding exon.

UPGMA Unweighted Pair Group Method with Arithmetic mean; a clustering algorithm sometimes used for creating a tree from phylogenetic distance measures.

UTR Untranscribed region. wt Wild-type.

xviii Chapter 1

Using comparative genome

sequence data in evolutionary

studies

The “genome era” is characterized by the increasing availability of large genome-wide datasets and the development of techniques to analyze them (Collins et al., 2003). The generation of this sequence has created many opportunities for new types of studies and analyses. One broad set of such studies is comparative genomics—the use of genome sequence comparisons among species to better understand the function and evolution of genomes and the biology of the corresponding species (Ellegren, 2008; Gregory, 2004).

Comparative genomic approaches have been developed to discover functional elements, make inferences about the patterns and strengths of selection, and examine the patterns of evolutionary change (e.g., Antonellis et al., 2006; Ellegren, 2008; Gregory, 2004; Kosiol et al., 2006; Margulies and Birney, 2008; Portnoy et al., 2005; Rannala and Yang, 2008).

Two applications of comparative genomics are explored here: using patterns of conservation to identify functional non-coding regions and using these large-scale datasets to infer

1 the relationships between species.

1.1 Evolutionary relationships between species

Understanding the evolutionary relationships between species lies at the base of all comparative genomic techniques, indeed at the base of all techniques where one species is used as a model or example for another. The knowledge of the evolutionary relationships between species is not only interesting for its own sake, but is the background upon which comparisons of the pattern of genomic change can be examined to understand the history and mode of evolutionary change. The discovery of the historic relationships between species is made more difﬁcult and controversial because it cannot be directly observed, but must be inferred from the data available to us today.

The cladistic approach uses shared characters to infer evolutionary relationships, usually summarized in tree form. These trees are fundamental to understanding how evolution has been responsible for the characteristics of living things. Intrinsic characteristics of species such as morphological characters can be used to infer phylogenetic relationships.

The use of molecular data as phylogenetic characters have become extremely popular because of they provide abundant easily and unambiguously typed characters for phylogenetic inference.

Molecular phylogenetics in fact began even before the advent of molecular biology with studies measuring immune cross reactivity. Protein sequencing, protein electrophoresis,

DNA-DNA hybridization and morphological characteristics had all been used to determine similarities between species before the advent of modern DNA sequencing techniques. The development of high-throughput DNA sequencing techniques has created a rapid increase

2 in the use of DNA and amino acid sequence data for phylogenetic inference and spurred the development of many methods and technologies to take advantage of these resources.

DNA sequences have a number of advantages as characters for use in phylogenetic studies. First and foremost they are quickly, unambiguously, and cheaply determined. In fact most of the procedures used to identify and call DNA bases are completely automated, and thus can be performed in a relatively unbiased manner. With advent of even higher throughput DNA sequencing projects and large sequence databases it is increasingly possible to perform many experiments solely with data already in databases (e.g., Hallstrom and

Janke, 2008; Nishihara et al., 2007; Hughes and Friedman, 2007). While these characteristics are extremely well suited to some questions, morphological approaches are important because only those methods applied to fossil data can give estimates of time or consider extinct taxa. Although DNA characters have provided abundant sources of information for phylogenetic studies, independent and complementary sources of morphological data should not be ignored.

Phylogenetic methods are based on the principal that genetic changes occur more or less randomly over time. A variety of methods have been developed to use this basic idea

(also called the molecular clock) to infer the relationship between species. These methods have had a great deal of success in teasing out many of the relationships between species using the sequences of individual genes or small groups of genes, however many relationships remain unresolved or controversial. The availability of large comparative genomic sequence datasets provided by high-throughput sequencing techniques provides an unprece- dented resource to help answer questions about the relationships between species. Though qualitatively similar to other datasets, comparative genomic sequence datasets present new challenges and opportunities to understand the patterns of evolution.

3 1.1.1 Alignment

The ﬁrst step towards inferring phylogenetic relationships is the inference of homology

between individual base pairs. This inference of homology, called multiple sequence align-

ment, is done in a heuristic manner as optimal methods have not been developed. Align- ment of bases is made difﬁcult because non-protein coding sequence in particular has high rates of insertion and deletion (indel) mutations, and these can obscure the historical evolutionary relationship. Multiple sequence alignment methods require a phylogenetic tree to aid in the prioritization of gap placement, making any sequence based phylogenetic study in some ways a circular argument.

A number of methods have been developed to minimize the effect of alignment bias on the resulting tree inferences. The most primary way to reduce the chance of inaccurate homology inferences is by only considering positions where there is enough similarity between sequences and/or few enough indel mutations that we can be fairly conﬁdent of the accuracy of the homology assumptions. Many phylogeneticists manually examine the alignments to remove any areas that are ambiguous. This manual editing of alignments becomes prohibitively time consuming when very large alignments are considered. An- other method is to permute the tree used to guide the alignment process to identify any downstream effects an incorrect guide tree might have.

To resolve this circularity of phylogenetic tree inference and alignment, methods have been developed that simultaneously estimate the tree and alignment (Phillips et al., 2000;

Yue et al., 2009; Bradley et al., 2009). These methods are generally slow because of the computational requirements of joint estimation and must rely on heuristics that could result in becoming trapped in local optima. Additionally effects of model misspeciﬁcation may

4 Taxa Rooted trees Unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 6 945 105 7 10,395 945 8 135,135 10,395 9 2,027,025 135,135 10 34,459,425 2,027,025 11 654,729,075 34,459,425 12 13,749,310,575 654,729,075 13 316,234,143,225 13,749,310,575 . . . 22 2,376,613,641,928,796,906,249,519,104 13,113,070,457,687,988,603,440,625

Table 1.1: Number of possible phylogenetic trees increases rapidly with numbers of taxa. Data from Felsenstein (1978). be ampliﬁed.

1.1.2 Tree searching

Phylogenetic inference methods must also contend with the very large numbers of possible arrangements even a few taxa (Figure 1.1). Many phylogenetic methods use a scoring method to determine how well each tree conforms to their assumptions. They try to ﬁnd the tree with the best value for their scoring method. Since not all trees can be scored, any of a number of heuristic methods may be used to limit the tree search space. Many of these methods take a starting tree then break and rejoin the branches at some number of other locations. Some common examples are nearest neighbor interchange (NNI), subtree pruning-regrafting (SPR), and tree bisection-reconnection (TBR) (see Whelan, 2008, for a good summary of these methods). Genetic algorithms and Markov chain Monte Carlo

5 A B C

Figure 1.1: Illustration of how lineage sorting can cause gene trees and species trees to differ. The species tree is represented by the gray bars. The red and blue lines represent the lineages of different genes. The lineage of the red gene follows the species tree where A and B are most closely related, while in the lineage of the blue gene species B and C are most closely related.

methods can also be used to reduce the tree space to be explored.

Methods such as the neighbor joining (NJ) or Unweighted Pair Group Method with

Arithmetic mean (UPGMA) are clustering algorithms that get around considering many possible trees by using a systematic algorithm to ﬁnd a tree rather than evaluating many possible trees for an optimality criterion. These algorithmically based methods become increasingly useful when large numbers of species are considered, though they can only be used with genetic distance measures (see also Section 1.1.5).

1.1.3 Gene trees versus species trees

An additional, biological problem is the distinction between what are called gene trees and species trees. That is the evolutionary history of a species as a whole may be different than

6 the evolutionary history of a particular segment of DNA. Because segments of DNA may be duplicated, homologous segments may share different evolutionary histories. Called paralogs, considering these genes can drastically alter the inferred relationships between species even when the true histories of the individual segments of DNA are inferred. Thus care must be taken to reduce the chances of comparing paralogs (homologs resulting from duplication events) rather than orthologs (homologs resulting from speciation events).

Another possible reason for differences between species trees and gene trees can hap- pen when multiple speciation events occur in relatively short evolutionary times. This is because different alleles of the same region may be transmitted in a manner different from the order of speciation events. Figure 1.1 provides a pictorial representation of what this means. This difference between gene trees and species trees, known as lineage sorting, seems to be remarkably common even among eukaryotes (Churakov et al., 2009; Prasad et al., 2008; Hobolth et al., 2007). Horizontal gene transfer between members of different species can also cause similar gene tree vs. species tree discordance, but horizontal transfer of large segments seems quite rare in eukaryotes.

The use of comparative genomic sequence datasets should allow us to tease out the effects and history of any differences between gene trees and species trees. The development of appropriate phylogenetic methods is an active area of research (Hobolth et al., 2007; Liu and Pearl, 2007; Prasad et al., 2008).

1.1.4 Convergent evolution, homoplasy, and saturation

An additional issue with nucleotide based phylogenetic studies is that there are only four possible states at each position and as evolutionary time increases the probability that mul-

7 tiple mutations will have occurred since the last common ancestor increases. These multiple mutations obscure the phylogenetic signal and may cause species to share characters (apomorphies) because of convergent evolution or mutation back to the ancestral state

(symplesiomorphy) instead of because of shared ancestry (synapomorphy). Characters that support trees other than the true tree are called homoplasious. Because mutation rates at the nucleotide level are relatively high, and the number of possible states are low, neutrally evolving nucleotide sequences quickly reach a state of mutational saturation which not only makes alignment difﬁcult, but obscures phylogenetic signal. While this is not a problem when examining very recently diverging species, many interesting questions concern speciation events that occurred long enough ago that saturation is an issue. A number of methods are used to control for these issues, one is to use conserved or negatively selected sequences that change more slowly, though functional convergent evolution may be an issue with these sequences. Another is to use very large amounts of data to tease out small amounts of signal. Using large datasets can amplify even weak signals, though weak sources of systematic error or ‘non-phylogenetic signal’ may also be ampliﬁed by the inclusion of large amounts of data (Philippe et al., 2005).

1.1.5 Distance based methods

Previous to the development of evolutionary theory similarities were used to classify species.

The development of technologies to sequence proteins and DNA have allow very ﬁne- grained measures of the similarity between species. Distance based phylogenetic methods are based on the intuitive idea that more similar species are more closely related. This simple and powerful idea has to deal with some issues in determining distance from nucleotide based characters. Nucleotides come in only four possible states (A, C, T or U, and G). As

8 the amount of substitution between two sequences increases the changes of a mutation re-

verting to an ancestral state or to sharing a derived state with an unrelated taxon increases.

This increase in homoplasy with increasing distance (i.e., saturation) does not scale linearly

with time.

Models of sequence evolution are used to correct for saturation and different rates

of mutation between bases, but as distances get larger increasing corrections need to be

applied. A simple method of correction is the Jukes-Cantor model which assumes that

all substitutions are equally likely and occur in one step, and that the four bases have

equal frequencies. Under this model the distance between two sequences is given by

3 4 d = − 4 ln(1 − 3 p) where p is the proportion of nucleotides that are different between the two sequences. Thus as p increases from near zero the correction factor increases from near

3 zero and reaches an asymptotic maximum at 4 . This factor provides us an upper bound on the amount of divergence that can occur between sequences for them to contain any phylogenetic information. In practice our ability to accurately align sequences fails long before divergence reaches these levels.

A second issue for distance methods is that in most cases rates of evolution are not identical in all lineages, and this complicates the calculation of distances between groups of species. Clustering algorithms such as UPGMA and neighbor joining can use average distances (Sokal and Michener, 1958; Saitou and Nei, 1987). Alternatively branch lengths can be estimated using the Fitch and Margoliash (1967) method which uses a process of iteratively subtracting average branch lengths from any two child nodes to all other nodes, or maximum likelihood approaches can be used. Distance methods are generally very fast because they can rely on a deterministic clustering algorithm that doesn’t search across tree space, thus they scale very well to large numbers of species. Minimum evolution, another

9 distance based method, searches across tree space for the tree with the lowest sum of cor- rected distances (minimum evolution), though minimum evolution and neighbor joining often ﬁnd the same tree (Rzhetsky and Nei, 1992; Graur and Li, 2000; Nei and Kumar,

2000).

1.1.6 Maximum parsimony

Perhaps the simplest to describe method for the inference of phylogenetic trees is the maximum parsimony method; this method arises directly from William of Ockham’s razor which states that the hypothesis with the fewest assumptions is the best. This simple framework describes an extremely powerful method for evolutionary inference. Though it was developed originally for morphological characters it can be applied to any number of data types and has proven invaluable not only for phylogenetic inference based on nucleotide substitutions or morphological characters, but for many disparate types of data such as transposon insertions or measures of gene presence and absence.

In maximum parsimony an optimality criterion (tree length) is calculated as the minimum number of changes (e.g., substitutions) that could explain the pattern of characters that appear on the tree (Edwards and Cavalli-Sforza, 1964; Felsenstein, 1982, 2003). This approach considers only informative characters, those characters that are not ﬁxed among the taxa being considered and at least two taxa share the same state; all other characters are not considered with parsimony methods. Several different parsimony methods have been developed for treating different types of data (see Felsenstein, 1982, for a review); for nucleotide based characters maximum parsimony considers each nucleotide as an independent character that reversibly changes in one step from one base to the other (Graur and

Li, 2000). In cases that involve more than 4 taxa it is possible that more than one tree has

10 the same tree length resulting in several equally parsimonious trees. Parsimony infers that all informative sites supporting the maximum parsimony tree are synapomorphies while all other informative sites are inferred homoplastic. This method has an important advantage in that it allows a computationally efﬁcient enumeration of the character state at any of the interior nodes, though some bases may have an ambiguous state. Parsimony can be weighted, where some substitutions are weighted more heavily than others, and there are a number of reﬁnements such as irreversible Camin-Sokal parsimony, useful for characters such as retroposon insertions that are not easily removed (Camin and Sokal, 1965). Parsi- mony based approaches are valuable for analyzing large datasets because of their relative speed, and are especially useful for analyzing rare genomic characters that may be revealed by comparative genomic sequence datasets.

1.1.7 Maximum likelihood

The maximum likelihood approach is a statistical approach that maximizes a likelihood statistic that is the probability of seeing the character state matrix (e.g., alignment) given a phylogenetic tree and an evolutionary model. This could be written as L = Pr(D | H) where L is the likelihood estimate, D is the character state matrix (e.g., alignment), and

H is a tree and evolutionary model (Whelan, 2008; Felsenstein, 2003). Thus calculating the likelihood value requires three elements, a model of sequence evolution, a tree, and the sequence alignment. The likelihood value is then used as the optimality criterion in a tree search with the goal of identifying the tree (or trees) that have the highest likelihood. The tree has to specify both topology and branch lengths, and calculating the maximum likelihood estimate for branch lengths requires computation of the probabilities of the observed states at that site given all possible combinations of ancestral states (Page and Holmes,

11 1998; Felsenstein, 2003). This computational expense has been a limiting factor in the use of maximum likelihood approaches, though the development of large computer clusters, accelerated tree swapping algorithms, and computational optimization methods have increased the range of problems to which it can be applied.

The likelihood of any one given set of data, of any one individual evolutionary path given even a correct model of evolution is exceedingly small, so the likelihood value (L) has little meaning as an absolute measure except as a quantity to maximize when testing different trees or evolutionary models (H). In relative terms however, the likelihood value can be very useful, comparing one tree or hypothesis to another. Statistical tests have been developed based on these comparisons, some are described below in Section 1.1.10.

1.1.8 Nucleotide substitution models

As in calculating genetic distances, an explicit model of molecular evolution is used for the likelihood calculation. Models may include parameters such as transition/transversion ratios or complete substitution matrices indicating the probabilities of changing from any one base to others. The values for the model parameters are chosen using the maximum likelihood criterion, namely the values that maximize the likelihood given the tree and the data, again a computationally expensive exercise (Page and Holmes, 1998). The Jukes-

Cantor model described above (Section 1.1.5) is one of the simplest of these models for nucleotide based data.

The Jukes-Cantor model and several others commonly used are simpliﬁcations of the general time reversible model (GTR) that considers each nucleotide independent with its own substitution rate and unequal base frequencies. The GTR model parameters are shown

12 Base Frequencies πA πC πG πT ACGT

A. πCa πGb πT c Substitution Matrix C πAa . πGd πT e G πAb πCd . πT f T πAc πCe πG f .

Table 1.2: Parameters used by the GTR model of sequence evolution

in Table 1.2. The base frequencies (πA, πC, πG, and πT ) are allowed to be unequal, though notably likelihood approaches treat all species as having the same base frequencies, an

assumption that is often incorrect (Hasegawa and Hashimoto, 1993; Galtier and Gouy,

1998). The substitution or rate matrix (Table 1.2) describes the likelihood of substitutions

from one base to another are the base frequencies and are generally set by examining the

data as presented. a- f are parameters associated with the likelihood of substitution from

one base to another. The model is described as reversible because rate of substitution

from one base, say A→T and the reverse (T→A) are the same. This greatly simpliﬁes computation, allowing us to treat trees as unrooted.

The models discussed above assume that the probability of substitution is equal at each site. Different areas of DNA may have different probabilities of change (The International

Chimpanzee Chromosome 22 Consortium, 2004). The simple case where some sites are absolutely critical and others are free to vary is modeled by using a parameter (I) for the proportion of invariant sites. The more complex case of a distribution of rates is modeled using a Γ distribution with a shape parameter α. As α approaches ∞ all sites have the same substitution rate. Putting all of these together for the most complex of the commonly used models, the GTR + Γ + I model of sequence evolution as generally implemented has 3 parameters for unequal base frequencies, 6 for the reversible substitution matrix, and 2 for

13 between site rate variation.

When selecting the model to use for phylogenetic analysis the goal is to use a model that is as realistic as possible without overfitting. Overfitting of model parameters can occur when there is not enough information in the dataset to get good estimates of the population values for model parameters. As more parameters are added this becomes an increasing problem. Large comparative genomic datasets contain so much data that model overfitting becomes less of a problem, though the lack of model accuracy (e.g., model- misspecification) may become an increasing problem (Nishihara et al., 2007; Philippe et al.,

2005; Lartillot et al., 2007). Careful control and testing to reduce the chances of erro- neous results due to model violations are an important requirement when considering the relatively weak phylogenetic signals that large comparative genomic datasets allow us to interrogate.

The calculation of likelihood values requires a lot of computation and must be performed for each tree to be compared. Until recently this made the use of maximum likelihood approaches with large comparative genomic sequence datasets prohibitively slow.

The recent development of highly optimized and parallelized likelihood calculation and faster branch swapping heuristics have allowed the use of maximum likelihood calculations for increasingly large datasets (Stamatakis and Alexandros, 2006; Zwickl, 2006; Sta- matakis et al., 2008).

1.1.9 Bayesian phylogenetic analysis

As with likelihood based approaches the Bayesian phylogenetic approach relies on explicit statistical models of molecular evolution. In general the same models that are used with the

14 maximum likelihood approach can be used. The Bayesian approach differs from likelihood in that it provides directly the probability we are most interested in, the probability that the hypothesis is correct. Bayes theorem provides the relationship between the likelihood

Pr(D | H) and the Bayesian posterior probability

Pr(H)Pr(D | H) Pr(H | D) = ∑H Pr(H)Pr(D | H) where D is the data, H is the hypothesis (e.g., tree and model). The denominator is the sum of the numerator over all possible hypotheses (i.e., Pr(D)). Notably this equation requires two difﬁcult to calculate (or at least difﬁcult to estimate) values, the prior probability Pr(H), and the probability of the data Pr(D). Markov chain Monte Carlo (MCMC) methods allow sampling from the posterior distribution without knowing the denominator.

This approach has the extremely attractive property of directly answering the question we usually want to ask: “What is the probability that the hypothesis is correct?” Addi- tionally, the MCMC method allows the rapid simultaneous computation of the best tree and a measure of support for that tree. These properties have made Bayesian phylogenetic approaches very popular. Because the MCMC posterior sample should account for all possible values of each model parameter, model overﬁtting due to insufﬁcient data should be less of a problem. Support for individual branches is summarized using consensus methods, most frequently reporting the proportion of trees in the posterior distribution that contain a clade.

There are several controversial issues related to running Bayesian phylogenetic analysis. Perhaps the most troubling is the setting of prior probabilities. The researcher must set some prior probability of each hypothesis being correct in order to proceed with the

15 method, and there does not seem to be a way to set the prior probabilities in a way that is generally agreed to be unbiased.

Though the MCMC method should sample over the posterior distribution it must be given time to stabilize on this distribution. This requires a ‘burn-in’ time for the MCMC to stabilize. Unfortunately there is no absolute way to tell how long the burn-in period will take because it is dependent on the tree and parameter suggestion methods of the MCMC.

These new tree suggestion algorithms must be balanced between the need to quickly reach a stable distribution and the ability to sample widely enough over the possible tree and parameter space to not end up stuck in a likelihood island. Generally posterior distributions are assessed for stability by using multiple independent MCMC runs and measuring their convergence and stability relative to each other, and there has been much work done on how to assess stability (Nylander et al., 2008; Ronquist et al., 2006). Additional Bayesian techniques have been developed for the simultaneous estimation of the tree and the alignment as well and the simultaneous estimation of gene trees and species trees (Redelings and Suchard, 2005; Liu and Pearl, 2007; Liu et al., 2009). The relatively high speed of

Bayesian analyses and ability to handle many parameters has made them very popular for use with large datasets, however another issue arises when using large amounts of data: posterior probabilities may be unreasonably high, negating one of the big advantages of

Bayesian methods, that they rapidly supply both best ﬁt trees and support values (Prasad et al., 2008; Susko, 2008; Yang, 2007; Kolaczkowski et al., 2006).

1.1.10 Conﬁdence measures

All of the methods described above with the notable exception of Bayesian phylogenetics produce a point estimate of the best tree. However that best tree lacks information as to

16 how much conﬁdence we should have that it is correct and how much evidence there is in support of each of the branches. If there is homoplasy within a dataset (and there almost always is with nucleotide based characters) then different sites will support different trees.

Thus the results of a phylogenetic analysis will depend on the speciﬁc set of sites used to estimate the tree. This sampling error can be due to the choice of species to study; a choice that may be limited due to extinction and sample availability.

Non-parametric bootstrap

One way to measure the possible error in a sample is to take multiple samples and measure the distribution of the resulting conclusions. That distribution provides an estimate of how much our conclusions are dependent on the sample we have taken. Bootstrapping in phylogenetics provides a way to simulate taking multiple samples by assuming that our sample is representative of the larger population of data. Non-parametric bootstrapping creates new pseudoreplicate datasets by sampling with replacement from the alignment columns to create a new dataset of the same size as the original (Figure 1.2). By bootstrapping the dataset and reanalyzing each bootstrap replicate phylogenetically it is possible to get a distribution of results that can then provide confidence measures. The results can be shown as proportions of bootstrap replicates supporting each of the trees found. For larger numbers of taxa branch-by-branch confidence values are given by summarizing the proportion of bootstrap replicates that are bifurcated at that point. This summarizing method can obscure results, providing artificially low support for some groups if one or more taxa are difficult to place and float across the tree.

Bootstrapping can also be looked at as a method of weakening the data by re-weighting

17 ACTGTGCATG AGTCTGGATG A{AGTGACGACG

TGCATCATTC TCGATGATTG B{TGGATGACCG

Figure 1.2: Illustration of non-parametric bootstrapping of alignments. A is the original observed alignment and B, a bootstrap pseudoreplicate is created by randomly selecting columns of the original alignment such that any column may be selected zero or more times. This procedure is repeated many times to generate a distribution of results. the characters to see how often the each tree is produced from the weakened data. Sta- tistically this relies on the assumption that the sample taken by the observed data is representative of the evolutionary process leading to all data, and that all of the sites are independent and identically distributed. Independence of the sites is probably violated and the distribution of support can vary across the genome (see Section 4 for more details).

The parametric bootstrap is also used in some circumstances for hypothesis testing. Para- metric bootstraps simulate data using parameters estimated from the original alignment to simulate new datasets.

Bootstrap based hypothesis tests

There are a number of tests based on using non-parametric bootstraps to estimate sampling error. One of those tests is the Shimodaira-Hasegawa test (SH-test), an extremely conservative test that estimates the probability that a difference in log likelihoods between a number of trees or models could arise by chance given that each tree has an equal prior probability

18 of having the same likelihood distribution (Shimodaira and Hasegawa, 1999; Felsenstein,

2003). The test operates by building distributions of likelihood values for each tree for a number of bootstrap replicates. This conservative assumption means that if a number of unlikely trees are included, the threshold for exclusion increases. Because calculating likelihood values for each bootstrap replicate for each tree can become computationally intensive, the RELL method of adding up sitewise log-likelhoods is usually used (Shimodaira and Hasegawa, 2001; Felsenstein, 2003). There are a number of other tests that have been developed along the same lines using the bootstrap to estimate distributions of likelihood scores such as the RELL test, the Kishino-Hasegawa test, and the approximately unbiased test (Felsenstein, 2003). This last test, the approximately unbiased test is based on a procedure of drawing different sizes of bootstraps and appears less conservative than the SH-test.

Additional Bayesian posterior probability and parametric bootstrap based tests may be biased because the distributions they develop are based on model speciﬁcations that may not accurately reﬂect the evolutionary processes in real data (Shimodaira, 2002; Goldman et al., 2000). Non-parametric bootstrap based tests may suffer from the assumption that the evolution of each site is independent, and although some correction to this assumption for local correlations could be handled by using block bootstraps, in practice block bootstraps are rarely used (Kunsch,¨ 1989).

1.1.11 Systematic biases

Systematic biases also known as non-phylogenetic signal are artifacts of our incomplete ability to model molecular evolution and result in inaccurate phylogenetic inference. Com- monly used models of nucleotide evolution make many assumptions about the nature of molecular evolution, most of which are violated by real data. Some examples of violated

19 assumptions are that sites are evolving independently, that base composition is constant between species, that variation between rates at different sites occurs in a gamma distribution, and that substitutions are random processes. The impact of these violations varies depending on the methods and models used, and simulation studies are often carried out to investigate both their impact and mitigation strategies (Philippe et al., 2005; Rosenberg and

Kumar, 2003; Wiens, 2005; Goldman et al., 2000; also see Chapter 3 for more discussion).

1.2 Conservation and function

Knowing the relationships between species and having models of sequence evolution allows us to compare the sequences of genomes and make inferences of selective pressure.

Natural selection acts on phenotypes, but the object of selection is genetic, the genome sequence. By comparing the sequences of species we know to be related we can identify the signatures that selection has left on the genome. Selection can provide important clues to the biggest question of the “genomics era”: How does it all work? Now that we have the sequence of the genomes of many species the task of annotating and understanding the function of that DNA has begun. Many high-throughput experimental and bioinformatic techniques have been developed to identify functional sequences, however the vast majority of eukaryotic DNA can not be annotated in this manner. The function of this bulk of poorly understood DNA is difﬁcult to ascertain, and it is likely that the majority of it is non-functional. The ﬁrst step to understanding its function is to identify what portions are functional.

According to the neutral theory of molecular evolution most evolutionary change is the consequence of genetic drift—the process of mutation and random elimination of alleles

20 due to ‘genetic sampling error’ (Kimura, 1983). Only very few mutations produce beneﬁ- cial phenotypic variation. So the majority of selection seems to be negative or purifying, that is to preserve functional sequences. This means that by identifying genomic regions that have fewer genetic changes than we expect under neutrality we can identify sequences that have functional ﬁtness consequences. By developing appropriate models of neutral evolution we can determine the regions that deviate from this to identify sequence that is acted upon by selection and thus functional in some way (McLean and Bejerano, 2008;

Margulies and Birney, 2008).

This method has a few notable caveats. Because the ability of natural selection to act on functional sequences is heavily dependent on population sizes, this approach will be better able to identify functional regions in species that have high effective population sizes. In order to identify regions that depart from the neutral rate of evolution a neutral rate must be estimated. The methods for estimating a neutral rate are varied, and because we have no deﬁnitive set of neutrally evolving sequences it is difﬁcult to calibrate the models for a neutral rate. Additionally, neutral substitution rates seem to vary across genomes, so local estimates of the neutral rate must be used (Chimpanzee Sequencing and Analysis

Consortium, 2005).

1.2.1 Conserved non-coding sequences

Once the technologies of sequencing began to be developed it was realized that the vast majority of the human genome, and indeed almost any eukaryotic genome does not code for proteins. The function, if any, of all of the non-protein coding DNA is not known, however it is apparent that the activity of many genes is regulated by cis-acting non-coding sequences (Antonellis et al., 2006; McLean and Bejerano, 2008). These sequences may

21 Gene structure Conservation

Figure 1.3: Sample view of genomic sequence conservation within exons and outside of coding sequence. Black bars in the top track indicate exons and the height of peaks in blue indicate conservation. Adapted from an image of the UCSC Genome Browser (Kent et al., 2002).

provide binding sites for the binding of proteins that regulate transcription or chromatin state, or they may have structural consequences. Known functional regions such as protein coding genes are often strongly negatively selected and the high conservation around them is easily visible. Figure 1.3 shows an example of high conservation around protein coding exons as well as some additional areas. These non-protein coding areas display strong signals of sequence conservation presumably due to selection, from that we can infer that they have some important function in the organism. Several experimentally determined functions for non-protein coding DNA have been shown, including structural RNAs such as ribosomal or tRNAs, regulatory non-coding RNAs such as Xist and miRNAs, and cis- regulatory sequences such as enhancers, repressors, insulators, or promoters that may bind regulatory proteins.

The function of many conserved non-coding sites is difﬁcult to ascertain, but conservation has been an important aid in identifying non-protein coding sequences that have functions such as non-coding RNAs and regulatory regions (Ponjavic et al., 2007; Cooper and Brown, 2008; Bejerano et al., 2004; Hardison, 2000; Ahituv et al., 2005).

This idea, that sequences that show evidence of conservation must be somehow functional is an extremely powerful way to cut through the noise and isolate regions of the genome that have the biggest functional consequences and selective import. Leveraging the mutational process in the many millions of years that separate different species allows

22 the identification of functionally important sequence elements behind the basic processes of life. It is also important to note that not all functionally consequential genome sequences will have fitness consequences. Considerable genotypic and phenotypic variation may well be outside the realm of strong fitness effects and thus not be identifiable through comparative genomic approaches. However since there is clear conservation of physiology and anatomy between closely related species it is likely that much of the functional DNA sequence that provides the blueprint should be conserved. This indeed seems to be the case as we have identified broad levels of conservation between diverse species for many known functionally important sequences. The most obvious example is protein coding sequences that show broad patterns of conservation across remarkably long evolutionary time-scales.

It is important to note that not all functional sequences will be conserved, especially at the sequence level (ENCODE Project Consortium, 2007). As our models of sequence evolution improve and increasing numbers of species are available, these methods should be increasingly powerful ways of identifying functional sequences.

1.3 Conclusions

The rapid proliferation of genomic sequences from many species is increasingly allowing us to contemplate more questions concerning the evolution of species. The answers to those questions behind how species evolve can help us understand the functional consequences of

DNA and how the sequence of a genome can lead to the incredible coordination of factors that give way to life. These large comparative genomic sequence datasets provide the raw materials for the answering of new questions that could not be approached before, though approaching these questions requires new methods and new considerations of the biases

23 and methodological issues that may mislead established methods of analysis.

24 Chapter 2

ExactPlus: A user-friendly method

for identifying short highly

conserved genomic sequences

2.1 Introduction

The sequencing of the genomes of many species is taking place in part to help us understand how genomes work and what portions are important. One way to ﬁnd the important functional regions of genomes is to look for the parts that are more similar or have evolved more slowly. Evidence of a slower pace of evolution is taken as evidence of function because the most likely reason for the preservation of sequence is that selection must be acting to preserve that sequence. Because selection can only see traits that are the functional expression of genomic sequences, we can infer that sequences that are changing more slowly are functional in some way. Thus the search for understanding genome function often begins with the search for sequence conservation.

Though there are many different functions of DNA that are sequence dependent, the

25 production of protein coding RNA is one of the most sequence constrained. However, a majority of constrained sequence seems to lie outside of protein coding regions, and there are several other sequence speciﬁc functions that could explain this. This conserved DNA is sometimes transcribed into RNA even though it does not code for any protein. These

RNAs can be catalytically active and may form active molecules, such as ribosomal and transfer RNAs involved in protein translation or small nucleolar RNAs that help target RNA modiﬁcations (Storz and Wassarman, 2004). Other conserved non-coding RNAs may regulate gene expression, including microRNAs, anti-sense RNAs and other long non-coding

RNAs (Goodrich and Kugel, 2006; Feng et al., 2006; Storz and Wassarman, 2004). The different varieties of non-coding RNAs are much more abundant than previously thought, however even they can only explain a small portion of the conserved sequence between species. Most of the conserved DNA sequence is thought to have functions having to do with gene regulation, and be responsible for much of the differences between species (The

C. elegans Sequencing Consortium, 1998; Adams et al., 2000; Andolfatto, 2005; Gregory,

2004), and most does not seem to be coding for functional RNA, but is thought to have sequence speciﬁc cis-regulatory activity.

Sequence-specific DNA-binding proteins bind to many of these cis-regulatory sequences and are important for gene regulation (Zhang et al., 2007; Alberts et al., 2002). Transcrip- tion factors are sequence-specific DNA-binding proteins that directly affect transcription, either repressing or enhancing it. Transcription factors are well known to be important in signal transduction and gene regulation. They can act rapidly and efficiently and seem to be active in the regulation of almost all genes. Because they are relatively well understood and important in disease and gene regulation, many methods for identifying transcription factor binding sites (TFBSs) have been developed.

26 The speciﬁc DNA sequences bound by transcription factors may be as short as 6-14 bases in length, often interrupted by a number of non-speciﬁc bases (Kel et al., 2003;

Wang and Stormo, 2005). Because these binding motifs are so short and often contain non-specific positions, they may occur hundreds or even hundreds of thousands of times throughout a genome. Of the ∼2,000 transcription factors that are estimated to exist in most mammalian species only a small minority have known sequence binding profiles (Interna- tional Human Genome Sequencing Consortium, 2001; Waterston et al., 2002; GuhaTha- kurta, 2006). For transcription factors whose binding motifs are known, functional assays of genomic sites with matching sequences often show no function. This makes sense as many binding motifs are so abundant that function at every site is unlikely. Computational methods for finding functional TFBSs are in great need, and those relying on sequence similarity to known binding motifs alone (such as TRANSFAC entries) are prone to false- positive results (Fogel et al., 2005; GuhaThakurta, 2006; Loots et al., 2000). The position of binding motifs in relation to the start of transcription has not proven to have sufficient predictive power to be very helpful in identifying functional TFBS either, because functional TFBS can occur within the gene they regulate or as much as several megabases away.

However, recent advances in DNA sequencing technology have allowed high-throughput screening for the presence of transcription factors at genomic sites using formaldehyde crosslinking of proteins and DNA (Ren et al., 2000; Johnson et al., 2007).

Another approach that has been used quite successfully with yeast is to look for over- representation of motifs between species and among co-regulated genes, attempting to use the additional information of co-regulation to infer that similar binding sites should be seen both across species and between co-regulated genes. (Pritsker et al., 2004; Wang and

Stormo, 2005). This process can help identify binding sites, but requires large amounts of

27 genome sequence data for many species. Additionally, the much smaller genomes of yeast mean there is less background than in larger mammalian genomes.

Because they are generally small, and gene expression patterns are often conserved between species, we decided to create a simple method (ExactPlus) to quickly identify the most highly conserved short sequences that may represent TFBS. These highly conserved sequences characterize other types of functional sequences, however here we have focused our exploration of the utility of ExactPlus on identifying TFBSs. We use comparative genome sequence data to identify small (or large) highly conserved sequences. Simple minimums are deﬁned for a number of exactly matching bases between species over a minimum length of sequence. Regions matching or exceeding these minimums are marked as putatively conserved elements (PCEs) by the ExactPlus program. Because these sites often cluster together in what may be hotspots of regulatory factor binding, we also allow for the extension of ExactPlus PCEs to larger regions with a smaller number of species exactly matching.

2.2 Algorithm and implementation

ExactPlus scans along an alignment in memory, looking for matches in a minimum number of species over a minimum number of bases. It then attempts to extend each PCE to the right and left if a smaller number of species has matching bases over that region. Gap characters are considered mismatches, and PCE are terminated at gaps in the reference sequence to provide for easier coordinate accounting. The approach taken by ExactPlus loads the entire alignment into memory, leveraging the large amounts of RAM available on modern computer systems.

28 Figure 2.1: Alignment and parameter input screen for ExactPlus web interface

29 We implemented the ExactPlus method with both command-line versions and an easy to use website (http://research.nhgri.nih.gov/exactplus). Input ﬁles are: (1) an alignment in either FASTA (MFA) or MultiPipMaker ‘.acgt’ format (Schwartz et al., 2003); and (2) an optional ‘exons’ ﬁle that contains a list of positions to exclude from the results.

There are only three other parameters that are required: (1) the minimum number of species that must have identical bases to seed a PCE; (2) The minimum number of positions that must match in a row to seed a PCE; and (3) an optional smaller number of species to extend and/or join PCEs. (see Figure 2.2 for the actual interface). Two additional optional parameters are UCSC Genome Browser coordinates to allow for the automated production of a UCSC Genome Browser custom track, and an option to indicate that the coordinates in the ‘exons’ ﬁle are relative to the UCSC Genome Browser start coordinates.

2.2.1 Constraints for web interface

Our implementation of ExactPlus was constrained by the requirements of our local admin- istrators; the architecture of available systems; and considerations of security, scalability, and stability. Because of the web server conﬁguration and requirements imposed by our system administration staff, we were limited to using the CGI interface and cron jobs to initiate processing (Vixie, 2004). SUID scripts were also not permitted. This presented us with several challenges, including how to provide for secure storage of untrustworthy ﬁles to pass between programs, how to prevent race conditions between multiple calls of the

CGI script or multiple ExactPlus runs, and how to limit loads and allow parallel processing of multiple jobs.

30 Web Server Internet

CGI scripts

Filesystem

Compute server

cron job ExactPlus

ExactPlus

Figure 2.2: Architecture of ExactPlus web interface

2.2.2 Architecture of web-accessible version of ExactPlus

The framework we created relies on a number of scripts that interact to make a secure, controlled system that can be easily adapted to increase throughput through parallelization.

The web-facing portions of the system are a number of HTML pages and simple perl CGI scripts that write out uploaded ﬁles and parameters from the HTML forms and do some input data validation. A cartoon diagram of the overall architecture is shown in Figure 2.2.

Filesystem storage had to be carefully designed because the web server runs as a non- privileged user, while cron jobs are run as a normal user. Filesystem permissions (listed in

Table 2.1) are designed to allow the minimum access permissions for the components of the system. Speciﬁcally, all scripts and web pages are owned by a non-privileged users. A special ‘exactplus’ group was created to allow multiple users to access the ﬁles written by

31 the web server. The web server user (‘httpd’) and ‘cron’ users are members of the exactplus

group and are thereby given permission to read and write to the htdocs/data directory.

All data provided by the remote user is written by exp.cgi, and this is the only web-facing

script that writes to the ﬁlesystem. Two scripts are run by the cron user (dequeue exp and check exp), and must also have read-write permission to data directories.

In order to explain how the system works we will describe the steps involved in a typical run of ExactPlus. The user points her web browser at http://research.nhgri.nih.gov/

exactplus (Figure 2.2) and ﬁlls in the parameters, also selecting the ﬁles they would like

to process. Once they click ‘submit,’ the data are passed over the internet to the web server,

and a Perl CGI script (exp.cgi) strips input of all special characters and limits the size

of ﬁle that may be uploaded; it also performs basic checks on the data input by the user.

exp.cgi then creates a randomized directory name (htdocs/data/rand), writes the current

UNIX time to a flag file named uploading, copies out the alignment and (optional) exons files, creates a parameters file with all run parameters, and finally renames the flag file

(uploading) to not started before returning a waiting for processing page (Table 2.1,

Figure 2.2).

A cron job running once per minute takes over the actual processing (proc exp); it

scans each upload directory for a ﬁle named not started. If proc exp ﬁnds a not started

ﬂag ﬁle and there are less than two ExactPlus jobs running, proc exp renames not started

to working and forks a program (dequeue exp) that runs the command-line version of Ex- actPlus (exactplus) directly. If exactplus dies or returns with a non-zero exit value, the dequeue exp renames working to error, otherwise working is renamed to done and the program exits.

32 Permissions File / Directory Owner Group User Group Other exactplus user user rwx rx rx htdocs/ user user rwx rx rx htdocs/exp.cgi user user rwx rx rx htdocs/index.shtml user user rw r r htdocs/help.shtml user user rw r r htdocs/results.cgi user user rwx rx rx htdocs/e submitted.tmpl user user rw r r htdocs/r results.tmpl user user rw r r htdocs/r still working.tmpl user user rw r r htdocs/r error.tmpl user user rw r r htdocs/data/ user exactplus rwx rsx rx htdocs/data/rand/ httpd exactplus rwx rwx rx htdocs/data/rand/opts httpd exactplus rw rw r htdocs/data/rand/acgt httpd exactplus rw rw r htdocs/data/rand/exons httpd exactplus rw rw r htdocs/data/rand/align cronuser exactplus rw rw r htdocs/data/rand/custom cronuser exactplus rw rw r htdocs/data/rand/consensus cronuser exactplus rw rw r webexp/ user user rwx rx rx webexp/dequeue exp user user rwx rx rx webexp/proc exp user user rwx rx rx webexp/check exp user user rwx rx rx

Table 2.1: Directories and permissions of important ﬁles for the ExactPlus web interface. Per- missions are UNIX ﬁle permissions in standard Posix code format: r–Read, w–Write, x–Execute, s–set-uid. htdocs/data/rand is a directory created by exp.cgi with a randomized name for each run of web ExactPlus (e.g., 1090112 avbec replacing rand in the table). Both ‘httpd’ and ‘cronuser’ are members of the group ‘exactplus’ to allow them to write to the data directory. ‘user’ is a user whose is account different from and cannot be written to by ‘httpd’.

33 While the cron jobs are running, the user sees a ‘waiting for processing’ page that

automatically reloads results.cgi every 30 seconds. results.cgi checks for the ﬂag

ﬁle named error or done to tell whether the run has ﬁnished or not. An error message is

returned to the user if the flag file error is found, otherwise if a done file is found, the

location of the output ﬁles are reported to the user. ExactPlus also provides a URL pointing

to the custom track ﬁle for pasting into the UCSC Genome Browser.

Once per day, a cleanup script (check exp) runs, using the timestamp embedded in

the flag file (done) to identify old data. Directories with a flag file named not started

or working with a timestamp older than 1 day are renamed to error. Data directories named error are not deleted, so that errors can be tracked down more easily. All output of

ExactPlus processing scripts is appended to a log ﬁle in the data directory.

2.2.3 Discussion of implementation

The architecture of the web interface is relatively modular, resistant to race conditions, and easily scalable with parallel processing. Coordination between the various parts of the system is managed with a flag file whose name is changed depending on the run state of the job. When writing multi-user client-server software, such as the ExactPlus web interface, it is important to consider possible race conditions because simultaneous execution is possible. External constraints meant we were unable to implement an daemon-based system that could guarantee serial execution. Another alternative is using a database (effectively a daemon) to manage program interaction, but we deemed the additional complexity and dependancies unnecessary. The system we developed using flag files should make a CGI- instigated race condition extremely unlikely under normal conditions, though under conditions of high filesystem latencies or high load, they are possible. Because simultaneous

34 calls to exp.cgi are the most likely source of race conditions, we use a couple of methods

to mitigate that risk. exp.cgi checks for the presence of a data directory of the same name

and immediately creates it if it does not exist. Additionally, data directory names contain

a random component with >107 possibilities to minimize the chance attempts at writing to

the same directory at the same time. This has the added beneﬁt of enhancing the privacy of

ExactPlus results.

The next possible source of race conditions is the execution of the ExactPlus job via

cron calling proc exp. Several methods are used to reduce the chances of this occurrin.: proc exp itself is kept to a minimum functionally so that it runs and exits quickly; it

simply counts the number of dequeue exp processes running simultaneously, renames a

ﬂag ﬁle, and forks another process to handle running ExactPlus. dequeue exp handles

running the exactplus command line script itself and tracking the time it took, as well as

renaming the flag file once exactplus has finished. This separation reduces the probability

of two proc exp jobs running simultaneously. It is unlikely that proc exp jobs would take

longer than one minute to execute, but in case they did because of high system load or

ﬁlesystem latency, proc exp it also checks for the number of running ExactPlus processes and does not start a new process if more than a preset number (currently two) are running simultaneously.

35 2.3 Comparison to other methods for identifying con-

served sequences

2.3.1 Other methods for conservation detection

Because of the importance and difﬁculty of the problem, many computational approaches to identifying cis-regulatory sequences have been developed based on several different paradigms (GuhaThakurta, 2006; Margulies et al., 2003; Cooper et al., 2005; Siepel et al.,

2005). Using sequence conservation across multiple species has been a popular approach that has been successful in ﬁnding regulatory regions in many cases. However, it appears that many regulatory sites are not strongly conserved (ENCODE Project Consortium,

2007). Most of the approaches that have been developed thus far attempt to identify those regions that are more conserved than some measure of the background evolutionary rate

(Margulies et al., 2003; Cooper et al., 2005; Siepel et al., 2005). These approaches, including ExactPlus rely on the idea that cis-regulatory sequences are constrained by evolution and thus are less likely to change then the regions around them. While the other methods share the basic premise of ExactPlus, there is an important difference in how they work.

ExactPlus does not make any explicit attempt to model the evolutionary rate. This has two important beneﬁts: (1) The results of ExactPlus analysis are not dependent on the exact boundaries of the region selected, thus changing the boundaries of the region will not change the results of the analysis; (2) ExactPlus results do not depend on the existence or quality of annotation to determine neutrally evolving sequences. The important disadvan- tage of the ExactPlus approach is that, because it does not rely on an explicit model of evolution, it is very sensitive to the inclusion or exclusion of additional species, and there

36 is no correction for the relatedness of species. Parameters also must be ‘hand-tuned’ depending on the number of species with sequence for a region; however the simplicity and speed of ExactPlus makes the parameters easy to understand and tune.

To establish the relative utility and performance of ExactPlus, we compared its performance with three other methods for identifying conserved sequences: WebMCS, described in Margulies et al. (2003); GERP, described in Cooper et al. (2005); and PhastCons, described in Siepel et al. (2005). Margulies et al. (2007) performed a comparison of these three algorithms analyzing 1% of the human genome sequenced as part of the ENCODE pilot project; we have used the results of that study to compare ExactPlus here (ENCODE

Project Consortium, 2007). ExactPlus was created with a different aim in some regards than the other conserved sequence identiﬁcation programs. Its aim is to pick the ‘low hanging fruit’ of the most highly conserved short stretches of sequence that are good candidates for important functional regions. The other programs were designed with the goal of identifying the approximately 5% of the human genome that is thought to be under broad negative selection (Waterston et al., 2002). Partly because of this, the results of ExactPlus are in some regards not directly comparable to the results of the other programs. A brief discussion of the methods that we compared to ExactPlus follows. binCons

BinCons is an early approach to genome-scale identiﬁcation of conserved sequences from multiple sequence alignments (Margulies et al., 2003). It weights matches and mismatches from a reference sequence based on an estimate of the neutral mismatch (substitution) rate derived from nearby protein coding four-fold degenerate positions. This weighting is based on calculating the cumulative binomial probability of detecting the observed match propor-

37 tion for a ≥25-bp window, given the neutral rate estimate. Required inputs are an alignment, a phylogenetic tree of those species, and an annotation of protein coding sequence for determination of four-fold degenerate sites. As with ExactPlus, binCons only considers matches and mismatches, which discards a signiﬁcant amount of information provided in a multiple sequence alignment. Additionally, the minimum constrained site size of 25 bp is a limitation when looking to identify TFBSs which are usually much shorter. The major theoretical advantage of binCons over ExactPlus is that binCons does attempt to handle non-independence of sequences because of phylogenetic history by binomial estimation of the neutral distances, a type of phylogenetic averaging approach.

GERP

GERP (Cooper et al., 2005) identiﬁes constrained sequences by attempting to model the rate of substitutions in neutrally evolving sequence and looking for positions with signiﬁ- cantly fewer substitutions. The constrained regions are measured in numbers of ‘rejected substitutions’ compared to the expected number of observed substitutions (Cooper et al.,

2005). The ﬁrst step to use GERP involves generating a tree with branch lengths estimated using known constrained sequence. An adjustment factor is then calculated using presumed neutral sequence, and a neutral rate parameter is set to be the conserved branch length multiplied by this factor. A cutoff number of ‘rejected substitutions’ is then cali- brated based on an estimated ‘false positive rate’ derived from applying that cutoff factor to alignments with bootstrapped columns. For these studies, the cutoff that provides a 5% estimated false-positive rate is used to ﬁnd constrained sequences. This method improves upon binCons by using an explicit model of evolution that incorporates branch lengths to estimate the neutral rate of substitutions, treating gaps in the alignment as missing data, not

38 Figure 2.3: The state transition diagram for phastCons phylo-HMM. It consists of a state for conserved regions (c) and a state for non-conserved regions (n). Each state is associated with a phylogenetic model (ψc and ψn) that is identical to the other except for a scaling parameter ρ(0 ≤ ρ ≤ 1), which is applied to the branch lengths of ψc to represent the substitution rate of conserved sequence as a fraction of the substitution rate of non-conserved sequence. µ and ν (0 ≤ µ,ν ≤ 1) deﬁne the state transition probabilities. Adapted from Siepel et al. (2005).

requiring a large window size, and not requiring correctly annotated exons to determine the

neutral rate. However, GERP requires the additional information of a tree with accurate

relative branch lengths and an estimate of the neutral rate to calibrate that tree (alterna-

tively an annotation of constrained sequence can be used to estimate the neutral rate). The

requirement for annotation of conserved (and non-conserved) sequences to calibrate the

neutral rate as input to an algorithm that attempts to identify conserved sequence seems a

bit circular. Cooper et al. (2005) suggest that GERP is relatively robust to variation in the

estimation of the neutral rate because the cutoff calibration on permuted alignments helps

compensate for any error at that step of the process.

39 phastCons

PhastCons uses a ‘phylogenetic hidden Markov model’ (phylo-HMM) that attempts to

model both the process by which substitutions occur, as well as how that process could

change from one site to the next (Siepel et al., 2005). The phylo-HMM has one state

for conserved regions and another for non-conserved regions (Figure 2.3). Free parame-

ters of the model are estimated using an expectation maximization algorithm. Smoothing

parameters have to be set manually, as maximum likelihood estimates did a poor job of

estimating them on large gene-poor (e.g., mammalian) genomes. Three parameters [Lmin, the minimum expected length of a sequence of conserved sites; γ, the expected coverage

of PCEs; and ω, the expected length of a conserved PCE] need to be set by the user of

phastCons. γ and ω are related to the state-transition parameters µ and ν by the equations

γ = ν/(ν + µ) and ω = 1/µ. The complexity of the decisions on how to set these parameters make phastCons relatively difﬁcult to run for bench scientists who might be more interested in the results than the methods used. For this reason, phastCons is usually used with the pre-computed results applied to whole-genome sequences at the UCSC Genome

Browser (Kent et al., 2002), limiting its utility to people who have additional sequence data and do not wish to spend the time to fully understand the algorithm. That said, increasing numbers of whole-genome sequences will allow pre-computed phastCons tracks to be increasingly useful, and these will allow increasingly smaller PCEs to be identiﬁed by phastCons. It also has the advantage of using actual base substitution rates rather than mismatches and an explicit model of evolution (GTR/REV) that can utilize the additional information provided by knowing which base has changed. As with GERP, there is no lower bound on the size of PCEs identiﬁable with the system, though the sizes of PCEs are dependent on the setting of ω.

40 2.3.2 Methods

Alignments, bootstrapped alignments, and conserved sequence datasets used in Margulies et al. (2007) and ENCODE Project Consortium (2007) were acquired with the help of Steve

Parker and Elliott Margulies as well as downloaded using Galaxy (http://galaxy.psu. edu/). For Margulies et al. (2007) and ENCODE Project Consortium (2007), the authors tuned their methods for ﬁnding conserved sequences to get a per-target ∼5% ‘false discovery rate’—deﬁned as the amount of sequence marked as conserved when the methods were applied to a 1-Mb bootstrap alignment derived from the ENOCDE pilot project alignment for each target region. When applied with any set of reasonable parameters, ExactPlus has a 0% false discovery rate as measured by this method, so this tuning system was not used.

Instead, we tuned ExactPlus parameters to mark approximately 1% of the human sequence as conserved on a target-by-target basis.

We tuned ExactPlus to mark approximately 1% of the human sequence as conserved by running the program with a minimum PCE length of 6 bp and a series of minimum numbers of species with the same base. The extend option was omitted for simplicity. The minimum number of species to start a PCE was then selected on a target-by-target basis, and overall totals were calculated by summing the numbers for each target. ExactPlus BED file output and custom Perl scripts were used to compare the results of ExactPlus with the results from other methods. For the purposes of determining significance 1000 random datasets created with element length distributions of the same size were created for each target, and the overlap was mapped to this null distribution to determine significance of overlap.

This type of null bootstrap distribution may cause an overestimate of the signiﬁcance of overlap between sets due to the clustering of many genomic features (ENCODE Project

Consortium, 2007, Supplement S1.13). Overlaps were analyzed using custom Perl and R

41 scripts.

2.3.3 Results

We compared the results from ExactPlus to those of the three methods outlined in Margulies et al. (2007) and ENCODE Project Consortium (2007) using the data from those papers.

Our methods were as similar as possible to those used in the two studies. We did run into some difﬁculties because of the differences between ExactPlus and the other methods.

In Margulies et al. (2007) and ENCODE Project Consortium (2007) a ‘false discovery rate’ was tuned using bootstrapped datasets to generate an alignment where no sequences should be marked as conserved. Both PhastCons and GERP were tuned with parameters that marked 5% of the bootstrap dataset as conserved (Margulies et al., 2007; ENCODE

Project Consortium, 2007). We were unable to use this metric with ExactPlus because the same type of ‘false discovery rate’ for ExactPlus measured using the bootstrapped dataset is essentially zero for any reasonable set of parameters. That is not to say that the true false discovery rate for ExactPlus is zero. The low ‘false discovery rate’ on the bootstrapped dataset is because ExactPlus codes alignment gaps as mismatches. GERP and PhastCons take approaches based on phylogenetic models that treat gaps in the alignments as missing data. The additional information provided by the indels allows ExactPlus to avoid false positive results in the bootstrapped dataset.

42 al 2.2: Table signiﬁcantly bp, 8 was length 1.17% PCE covered median PCEs The ExactPlus regions. settings ENCODE these in With sequence human bp. conserved, of 6 as of target size each PCE of minimum we sequence a regions the using of small 1% conserved approximately most mark the to only ExactPlus for tuned look to designed was ExactPlus Because size bases element total conserved of Putatively percent the indicate bars the in included within category. not Numbers that are in gray. that ExactPlus PCEs light by conserved in marked marked indicated ExactPlus category. are in that sets bases in and other PCEs gray, in ExactPlus dark are total in ExactPlus of indicated to are percent unique the are indicate that bars PCEs the and B gray, within dark Numbers in gray. indicated light are (2007). and methods al. moderate, other et Loose, by Margulies (2007). marked in Consortium deﬁned Project are ENCODE sets and (2007) strict al. et Margulies in scribed 2.4: Figure h ubro ae nhmnmre yEatlsa osre htoelpohrmethods other overlap that conserved as ExactPlus by marked human in bases of number The , Number of ExactPlus Hits A 0 5000 15000 25000 binCons 88% 12% ttsisfrtePEsz itiuini aepisfrec ftecnevto methods. conservation the of each for pairs base in distribution size PCE the for Statistics xcPu Csoelpigrgosmre scnevdwt te rgasde- programs other with conserved as marked regions overlapping PCEs ExactPlus PhastCons 13% 87% EP46232 274. 1527 2184 43.8 85.3 217 32.7 55.3 3649 8.3 116.3 21 28 86.5 10.8 Max 45 3 8 1 SD 47622 Mean 25 32672 Median phastCons 6 18680 Min GERP N binCons 32440 ExactPlus Program GERP 14% 86% 99.7% loose 0.3% moderate 90% 10% strict 68% 32% 43 A B xcPu Csta vra osre PCEs conserved overlap that PCEs ExactPlus , Number of ExactPlus Bases 0 50000 150000 250000 350000 binCons 92% 8% PhastCons 11% 89% GERP 12% 88% loose 97% 3% moderate 89% 11% strict 69% 31% smaller than the median PCE length of any of the other methods (Table 2.2 and Figure 2.5).

In spite of the much smaller PCEs, the great majority of ExactPlus PCEs fall within regions marked as constrained with other methods (Figure 2.4). The distribution of ExactPlus

PCEs is very skewed to small sizes, though oftentimes the conserved PCEs cluster closely together. Notably the PCE size distribution (Figure 2.5A) for ExactPlus shows periodic peaks in the number of PCEs of a given size, with a period of 3 bp that is not seen in the

PCE size distributions for any of the other methods; this suggests a relationship to codon positions. The peaks occur at PCE sizes of 3n + 2 (e.g. 8 bp, 11 bp, 14 bp, etc.), which corresponds to PCE lengths bound on one side by a ﬁrst codon position and the other by a second codon position. This is the largest length of a string of consecutive codon positions containing the minimum number of less conserved third codon positions. Indeed, when coding positions are removed from the analysis, the periodic peaks are no longer evident.

Functional annotations

In spite of the stringent settings used for ExactPlus in this study, more than 69% of coding exons overlapped ExactPlus marked PCEs (Figure 2.6B). ExactPlus was initially designed to help in the identiﬁcation of functional TFBSs. The ENCODE Consortium used ChIP- chip experiments to attempt to identify sites bound by regulatory factors in a relatively unbiased fashion for several transcription factors and transcription-associated proteins (Zhang et al., 2007). 26% of regulatory factor bound regions (RFBRs) identiﬁed by the ENCODE

Consortium contained ExactPlus PCEs. The higher resolution of comparative genomics studies is evident from the fact that only 2% of the sequence of the annotated RFBRs is contained in the ExactPlus results, covering 26% of all annotated RFBRs (Figure 2.6;

44 A 20 18 ExactPlus 16 ExactPlus w/o CDS 14 GERP 12 binCons phastCons 10 8 Percent hits 6 4 2

0 10 20 30 40 50 60 70 80 Bases ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Bases ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

ExactPlus binCons phastCons GERP

Figure 2.5: Size of individual sequences marked as conserved. Multiple regions with no gaps between them were united to form one region. A, Density plot of the sizes of ExactPlus PCEs. Note that periodic reduction in PCE density seen with ExactPlus PCEs disappears when coding sequences (CDS) are removed. B, Box and whiskers plot of PCE sizes for each method of identifying conserved regions.

45 A Bases 40

20 cent of Annotation Overlapping

Per 10

0 * AR CDS AIRE DHSs F 5’ UTR 3’ UTR RFBRs RxFrags MCS_strict MCS_loose Un.TxFrags Pseudogenes SeqSp.RFBRs B MCS_moderate Regions 75

30 ** cent of Annotation Overlapping Per 15

0 * AR CDS AIRE DHSs F 5’ UTR 3’ UTR RFBRs RxFrags MCS_strict MCS_loose Un.TxFrags Pseudogenes SeqSp.RFBRs MCS_moderate

Figure 2.6: Amount of ENCODE annotated sequence overlapping with ExactPlus results. A, Pro- portion of ENCODE annotated bases overlapping ExactPlus results. B, Proportion of ENCODE annotated regions overlapping ExactPlus marked PCEs. All overlaps are signiﬁcantly more likely than random sets (P<0.001) except those marked * to indicate overlap signiﬁcantly less likely than random sets (P<0.001) and ** for an overlap of P=0.037 on the null distribution.

46 Zhang et al., 2007). Less than 0.1% of the ancient repeats (ARs) that were used as a proxy for neutral sequence by the ENCODE Consortium overlap ExactPlus PCEs (Figure 2.6;

ENCODE Project Consortium, 2007). All other functional classes had signiﬁcant overlap with ExactPlus PCEs, though notably the only functional category deﬁned by the EN-

CODE Consortium to have a less signiﬁcant overlap with ExactPlus PCEs is unidentiﬁed transcription fragments (Un.TxFrags; Figure 2.6).

2.4 Experimental validation

The primary aims for our development of ExactPlus were to to develop a simple method, readily usable by molecular biologists not trained in computational techniques, to identify high-conﬁdence candidate functional regions from comparative genomic data. Our comparisons to other methods demonstrate that ExactPlus produces results surprisingly similar to other methods, and can identify smaller conserved sites in some circumstances. Ad- ditionally, our relatively complete datasets mean that treating gaps as mismatches greatly enhances the power of ExactPlus to mark dramatically fewer false positives as measured by bootstrap alignments. To better assess the performance of ExactPlus in experimental systems and to demonstrate its utility, we collaborated with groups who wanted help with identifying candidate regulatory non-coding regions for experimental validation.

Two studies undertaken with our collaborators applied ExactPlus analysis to generate candidate functional non-coding PCEs that were further experimentally characterized. In the ﬁrst study described here, ExactPlus was used to identify a candidate regulatory PCE for GDF6. GDF6 is involved in axial patterning and ocular development, and was the subject of mouse BAC studies to identify regulatory regions as well as pairwise comparative

47 sequence analysis (Mortlock et al., 2003, 2004; Portnoy et al., 2005; Appendix 5.3.2). In

the second study described here, ExactPlus was used to identify a partial loss-of-function

mutation in a mouse mutant that had been mapped to Sox10, but no coding variants were

found that could explain the loss of function (Antonellis et al., 2006; Appendix 5.3.2).

2.4.1 GDF6

GDF6 is a member of the TGF-beta superfamily that has been implicated in several de-

velopmental processes, including skeletal, eye, and neural development as well as several

human diseases with varying ocular and vertebral phenotypes (Asai-Coakwell et al., 2009).

It is expressed in numerous locations throughout development, in both soft tissues and

skeletal joints, and is required for normal formation of limb, ear, and skull joints (Settle

et al., 2003; Portnoy et al., 2005). Mouse BAC-transgene studies revealed regulatory se-

quences distributed over a region of more than 100kb around the GDF6 gene and that those regulatory regions mediate GDF6 transcription in limb joints, digits, retina, genitalia, laryngeal cartilages, skull bones, and other tissues (Mortlock et al., 2003). To better understand

GDF6 regulation and to identify additional GDF6 regulatory PCEs, we used ExactPlus and binCons (WebMCS, Margulies et al. 2003) to identify conserved putative regulatory sequences. We collected available sequence from mouse, human, chicken, Fugu, and ze- braﬁsh totaling 1.5 Mb; and mapped and sequenced an additional 2.7 Mb of sequence from chimpanzee, baboon, cow, pig, cat, dog, rat, platypus, and zebraﬁsh as part of the NISC

Comparative Sequencing Program (Thomas et al., 2003; Portnoy et al., 2005; Mortlock et al., 2004).

We aligned the sequences using MultiPipMaker and analyzed them using both WebMCS

48 Figure 2.7: MCSs in the immediate region spanning the GDF6 transcription unit in a UCSC Ge- nome Browser view (mm4/NCBI build 32, October 2003 chr4:9,641,000-9,850,732) (Kent et al., 2002). WebMCS and selected ExactPlus data tracks are shown below the GDF6 gene structure. The top ten tracks (labeled “EP”) depict the positions of MCSs detected with ExactPlus labeled in the format “EP: #-#-#” indicating the initial seed length, minimum number of species whose sequence needs to match the initial seed region, and minimum number of additional species’ sequences used to extend a match. The two WebMCS tracks depict MCSs corresponding to the top 5% (WebMCS-95) and 2% (WebMCS-98) most conserved sequence. The “Transgenes” track shows the position of the 2.9-kb PJE fragment (Mortlock et al., 2003) and the smaller fragment found with this comparative analysis. Adapted from Portnoy et al. 2005, Figure 1.

49 Relationship of MCSs detected by ExactPlus and WebMCS-95 MCS detection No. MCSs Total No. MCSs MCS bases WebMCS-95 WebMCS-95 method detected MCS overlapping with overlapping with sensitivityc specificityd bases WebMCS-95a WebMCS-95b WebMCS-95 153 10,488 N/A N/A N/A N/A ExactPlus 6-6-6 1621 16,002 575 7787 0.487 0.742 ExactPlus 6-9-6 320 5089 275 4660 0.916 0.444 ExactPlus 10-9-6 136 3118 133 3080 0.988 0.294 ExactPlus 25-6-2 42 2973 42 2890 0.972 0.276 ExactPlus 25-9-6 21 901 21 901 1.000 0.086 ExactPlus 50-6-9 8 543 8 543 1.000 0.052 ExactPlus 6-10-2 114 4348 110 4140 0.952 0.395 ExactPlus 10-10-2 57 2844 57 2823 0.993 0.269 ExactPlus 25-10-2 7 637 7 637 1.000 0.061 ExactPlus 10-12-2 8 511 8 511 1.000 0.049 N/A, not applicable. a Number of ExactPlus MCSs that overlap with WebMCS-95 MCSs by at least 1 base. b Number of ExactPlus MCS bases that overlap with WebMCS-95 MCS bases. c Fraction of ExactPlus MCS bases that overlap with WebMCS-95: MCS bases overlapping with WebMCS-95 MCSs/total ExactPlus MCS bases. d Fraction of ExactPlus MCS bases also detected by WebMCS-95: MCS bases overlapping with WebMCS-95/10,488.

Table 2.3: Relationship of MCSs detected by ExactPlus and WebMCS in GDF6 region. Adapted from Portnoy et al. 2005, Table 2.

(binCons; Margulies et al., 2003) and ExactPlus. WebMCS was run with two settings to mark approximately 5% and 2% of the sequence in the region as conserved (abbreviated here as WebMCS-95 and WebMCS-98, respectively). A range of parameters was also used with ExactPlus applied to the 14-species sequence alignment. Figure 2.7 shows the resulting distribution of elements marked as conserved by ExactPlus and WebMCS across the region. WebMCS-96 and WebMCS-98 ﬁnd 153 and 58 PCEs, respectively, and the length and number of ExactPlus PCEs vary signiﬁcantly, depending on the parameters used. With the ExactPlus setting of 10 base pair (bp) seeds identical in more than 9 species and extended in species with 6 or more matches (EP: 10-9-6) 99% of bases in ExactPlus PCEs overlap with WebMCS-95 elements, though 44% of individual EP:10-9-6 elements fall outside WebMCS-95 elements (Figure 2.7, Table 2.3).

Several ExactPlus conserved PCEs were found in regions previously shown to drive

50 minigene expression in speciﬁc subsets of GDF6-expressing tissues (Mortlock et al., 2003).

Two ExactPlus PCEs occur immediately upstream of GDF6 in the genomic interval shown

to affect GDF6 expression in neural tube and brain. Three large clusters of ExactPlus PCEs

were located in the intron, a region shown to promote expression in the retina and genitalia.

An additional large region >60 kb upstream of GDF6, called the proximal joint enhancer

(PJE), was previously shown to be required for GDF6 expression in proximal limb joints

during embryonic development. Both binCons and ExactPlus detect multiple MCSs within

the PJE region, including an approximately 300-bp region. Three Pbx1/Hox-binding motifs

are found in the region, including one ﬂanked by two sites highly similar to the consensus

Lef1/TCF1-binding motif (Portnoy et al., 2005, Appendix 5.3.2; Mortlock et al., 2004).

Lef1/TCF is involved in the Wnt ligand signaling pathway, and Wnt14 has been proposed

as a key regulator of early joint development (Portnoy et al., 2005, Appendix 5.3.2). These

sites suggest that Hox and Pbx factors may interact Wnt signaling in this region to regulate

joint-speciﬁc gene expression.

To test if this region contains a core cis-acting enhancer for GDF6 expression in the

proximal limb joint, our collaborators cloned a 440-bp segment into an Hsp68 promoter-

LacZ minigene construct (Mortlock et al., 2003; Portnoy et al., 2005) and injected it into

mouse embryos. A total of 7 independently generated transgenic mouse embryos were

analyzed, and 5 of them showed strong LacZ expression in the elbow, knee, shoulder, and

hip joints in a pattern essentially identical to that of the entire 2.9-kb segment that had

previously been shown to have enhancer activity. These results show that ExactPlus can be

productively used to help identify functional cis-acting regulatory sequences. In this case,

ExactPlus was used in concert with known TFBSs and lower-resolution functional studies to identify a strong and speciﬁc regulator of GDF6 expression.

51 2.4.2 Sox10

In this study, we used ExactPlus to identify candidate regions for mutation screening of

a mouse mutant. The mutant, called Hry here (also called Sox10Tg(Igf2P3Luc)HarryWard and

Sox10Hry) was identiﬁed during a transgene experiment (Ward et al., 1997). The mutant phenotype was similar to, but milder than, known Sox10 mutants; it was chieﬂy a reduction in melanocytes, distal enteric neurons, and embryonic Sox10 expression (Antonellis et al.,

2006; Southard-Smith et al., 1998; Herbarth et al., 1998). Sox10 is a transcription factor that is mutated in the human neurocristopathy Waardenburg–Shah syndrome type 4 and involved in gene expression in and the development of neural crest cells. Though some targets of Sox10 had been reported, little was known about the transcriptional regulation of the gene itself.

Cytogenetic mapping showed the transgene insertion site on the distal portion of mouse chromosome 15, near the Sox10 gene. Microsatellite linkage mapping narrowed the region to an 8-cM region that contains Sox10. Southern blot analysis of Sox10 with 7 restriction endonucleases did not reveal any size polymorphisms in the mutant gene. Northern blots from the brains of 8 day old Hry mice showed no Sox10 expression differences. Whole- mount in situ hybridization experiments from Hry/+ intercross litters revealed reduced or eliminated Sox10 expression at E8.5-E9.5 in 25% of the embryos. These results suggest that the Hry phenotype is due to regulatory differences in the mutant mouse, rather than changes in the sequence of the Sox10 protein.

We generated genomic sequence from the region surrounding the Sox10 locus in rat, cat, pig, cow, and chicken and downloaded mouse and human sequence from the UCSC

Genome Browser (Kent et al., 2002). The sequences between the ﬂanking 5’ and 3’ genes

52 A

Figure 2.8: Comparative genomic and mutational analysis of the Sox10 region. A, The region of mouse chromosome 15 with the Sox10 locus and upstream regions as depicted by the UCSC Genome Browser (http://genome.ucsc.edu). Exactplus results are marked in red at the top for sequences of at least 6 bp identical in at least 7 species, and the clusters of highly conserved sites are marked in green. The gray box indicates the 15.9-kb deletion identiﬁed in Hry mice. B, An expanded view of the 15.9-kb deletion. DNA fragments tested for enhancer potential are indicated in black. This ﬁgure is adapted from Antonellis et al. 2006, Figure 3.

53 were aligned using MultiPipMaker (Schwartz et al., 2003) then analyzed with ExactPlus to ﬁnd regions at least 6 bp in length identical in at least 7 of the 8 species. This analysis identiﬁed 197 non-coding segments with an average of 11.7 bp length that cover 2.35% of the genomic sequence. ExactPlus PCEs to non-coding regions formed 8 clusters of highly conserved sites (Figure 2.8A). In addition, WebMCS (Margulies et al., 2003) was used to analyze the same sequences and alignment, and PipMaker (Schwartz et al., 2000) was used to identify regions >70% identical over 100 bp between mouse and human sequences.

ExactPlus PCEs covered 15% less sequence than WebMCS and 30% less than PipMaker, but ExactPlus identiﬁed a further 18% and 47% of fully conserved sequence not marked by

WebMCS or PipMaker, respectively.

To determine whether mutations in any of the clusters of highly conserved sites were responsible for the impaired Sox10 expression in Hry/Hry mice, we screened each of the wt,

Hry/+, and Hry/Hry embryos with PCR amplimers designed within each of the 8 clusters of highly conserved sites. Amplimers for 3 of those sites were found to be absent in Hry mice (Figure 2.8). Fine mapping using PCR amplimers revealed a 15.9-kb deletion located

47.3 kb upstream of Sox10 not detected in 10 additional mouse strains. Southern blots of wt and Hry/Hry mice with probes in the transgene and immediately outside the deletion detected similar sized fragments, suggesting a link between transgene insertion and the deletion event. Notably, only 1 of the 3 clusters of highly conserved sites was identiﬁed by

WebMCS or PipMaker (DCHC-1; Figure 2.8).

Because the Hry deletion seems to reduce Sox10 expression in melanocytes we made hypothesized that the deleted region contains a transcriptional enhancer active in melanocyte cell populations. To test this hypothesis, we screened the deleted region for enhancer potential in immortalized melanocytes using a PCR-based tiling path of fragments from 325

54 to 1753 bp in size. These fragments were cloned upstream of a minimal promoter and luciferase gene, and then transfected into melan-a cells (Sviderskaya et al., 2002). Five amplimers showed an increase in luciferase expression >3-fold over the empty vector

(DIH-1, DIH-8, DIH-10, DIH-15, and DIH-17; Figure 2.8). Three fragments showing high enhancer potential overlapped with the clusters of highly conserved sequences (DIH-1,

DIH-2, and DIH-15; Figure 2.8). We also tested just the three clusters of highly conserved sites with 200 bp of ﬂanking sequence (DCHCS-1, DCHCS-2, and DCHCS-3). Both the overlapping DIH-1 and DCHCS-1 showed yielded the largest bidirectional increases in luciferase expression (Antonellis et al., 2006, Figure 4). The cluster of highly conserved sites, DCHCS-1 contains a highly conserved segment of 95 bp that is 83% identical in all mammals and spans two ExactPlus matches of 51 bp and 21 bp. When we compared the enhancer activity of this core segment to the activity of the original DCHCS-1, luciferase activity increased 6-fold and 2-fold in the forward and reverse orientations, respectively

(see Antonellis et al., 2006, Figure 5). Thus, the most conserved regions marked by Ex- actPlus had the highest enhancer potential. Detailed analysis of the ExactPlus-identiﬁed sequences within this core segment using TRANSFAC (Matys et al., 2003) revealed 5 biologically relevant predicted binding sites: 2 for retinoic acid receptor and 3 for Sox family transcription factors.

2.5 Discussion

Comparisons with other methods of identifying conserved non-coding sequences indicate that ExactPlus performs remarkably well at identifying short highly conserved sequences.

While the statistical methods used by ExactPlus may not be as sophisticated as other meth-

55 ods, its behavior is easily understood and predictable. Additionally the speed and ease with which ExactPlus is used allows easy tuning to identify different classes and types of conserved segments, and the easy visualization of ExactPlus results on the UCSC Genome

Browser permits simple visual analysis and comparison to other methods of identifying conserved sequences. The results of the experimental analysis of the Sox10 and GDF6 loci demonstrate the utility of ExactPlus in identifying targets for mutation screening and

ﬁnding candidate functional regions.

56 Chapter 3

Reﬁning the phylogeny of mammals

with a large comparative genomic

sequence dataset

3.1 Introduction

As large-scale DNA sequencing becomes increasingly practical and prevalent, new opportunities are created for molecular phylogeneticists to examine ever-larger amounts of genomic sequence from increasing numbers of taxa. These data have the potential to greatly enhance our ability to answer difﬁcult phylogenetic questions; however, the size and inherent imperfections of the datasets present some unique challenges for accurate tree inference. To begin with, the large numbers of characters that serve as input demand a robust computational infrastructure. Further, the fast-evolving nature of most eukaryotic genomes has yielded large amounts of non-protein-coding sequences that are not conserved across species, making it difﬁcult to generate complete and accurate multi-sequence alignments

(Margulies et al., 2006).

57 These and other challenges in dealing with nucleotide sequence-based characters have prompted the use of genomic characters that change less frequently than individual nucleotides, such as inversions, transposon insertions, and coding insertions/deletions (indels) (Shimamura et al., 1997; Murphy et al., 2004; Bashir et al., 2005; Chaisson et al.,

2006; Kriegs et al., 2006; Murphy and O’Brien, 2007). However, these genomic characters present their own challenges. First, they are less common, so few may be found to help differentiate short branches (Nishihara et al., 2005). This leads to particular difﬁculties when assessing support using traditional methods (e.g., bootstrapping). Second, assigning the actual character state (e.g., the presence of the same insertion at a given position or a given rearrangement shared between two species) can be difﬁcult because overlapping rearrangements, changing boundaries, and/or sequence divergence can obscure the historical relationships (Murphy et al., 2004). Finally, although methods for modeling such rare genomic characters have been developed (Waddell et al., 2001; Chaisson et al., 2006), biases leading to the potential for homoplastic evolution are not well-understood (Boissinot et al., 2004; Chen et al., 2005). For example, it is likely that there is a smaller probability of indel events occurring independently at the same locus in multiple lineages than single- nucleotide changes; however, the extent of biases in indel location appears to vary among lineages and types of indel events. Thus, although rare genomic changes can be used as informative characters, there are still reasons why sequence-based characters are helpful as independent sources of phylogenetic information.

Meanwhile, traditional phylogenetic analyses based on nucleotide mutations present a different set of challenges. As the costs of procuring and operating large clusters of commodity computers have decreased, it has become increasingly practical to harness sig- niﬁcant amounts of processing power to analyze very large sequence-based datasets. This

58 provides the ability to exploit single-nucleotide substitutions more extensively, yielding more robust phylogenetic inferences. Additionally, there is extensive theory and experience relevant to both modeling the evolution of these characters and using the algorithms to infer phylogenetic trees. However, care must be taken to rule out sources of systematic

(or non-stochastic) error, such as long-branch attraction, alignment guide trees, and base- composition biases that can hinder the use of such datasets (Philippe et al., 2005; Kluge and Wolf, 1993; Hillis et al., 2003; Rokas and Carroll, 2005).

Indeed, there remain a number of uncertainties about the mammalian phylogeny that are beginning to be clariﬁed using both substitution-based and rare genomic character-based methods. For example, while recent molecular studies have broken the placental (eutherian) mammals into four groups or superorders—Afrotheria (Stanhope et al., 1998), Euarchon- toglires (also called Supraprimates by Waddell et al. 2001, Laurasiatheria (Waddell et al.,

1999b), and Xenarthra (Cope, 1889)—the relative arrangement of these groups is uncertain (Waddell et al., 2001, 1999b; Nishihara et al., 2006; Kriegs et al., 2006; Madsen et al.,

2001; Murphy et al., 2001a; Scally et al., 2001). Studies investigating the relationships among these four mammalian superorders agree that Euarchontoglires and Laurasiatheria should be grouped together to form Boreoeutheria (Murphy et al., 2001b; Waddell et al.,

2001; Springer and de Jong, 2001; Kriegs et al., 2006; Nishihara et al., 2006), although the exact root among the three remaining groups (Boreoeutheria, Afrotheria, and Xenarthra) remains unsettled. Most molecular analyses have weakly supported a basal Afrotherian root

(Murphy et al., 2001b; Waddell and Shelley, 2003; Kriegs et al., 2006; Madsen et al., 2003;

Waddell et al., 2001; Nishihara et al., 2007; Nikolaev et al., 2007; Scally et al., 2001), while a few have weakly supported the association of Afrotheria and Xenarthra to form Atlanto- genata, thereby placing the root between Atlantogenata and Boreoeutheria (Waters et al.,

59 2007; Madsen et al., 2001; Lin et al., 2002; Waddell and Shelley, 2003; Murphy et al.,

2007; Delsuc et al., 2002; Hallstrom et al., 2007; Waddell et al., 1999a). On the other hand, the traditional placement of Xenarthra at the root of the placental tree, with Boreoeuthe- ria and Afrotheria together forming Epitheria, has been supported by others (Shoshani and

Mckenna, 1998).

Several recent studies have further investigated these issues, leading to different conclusions. Kriegs et al. (2006) identiﬁed two shared transposon insertions in Afrotheria and

Boreoeutheria that could not be found in Xenarthra or in an opossum outgroup, though they did not consider these results statistically signiﬁcant. Nikolaev et al. (2007) used comparative sequence data generated for 1% of the human genome (as part of the EN-

CODE project) to examine the root of Placentalia; they reported signiﬁcant support for a root between Afrotheria and Exafroplacentalia (Boreoeutheria + Xenarthra), though they found it necessary to perform separate analyses on conserved non-coding sequences and amino-acid sequences to exclude both other possible roots. Murphy et al. (2007) searched for informative coding indels within whole-genome sequence data, ﬁnding four examples supporting Atlantogenata and none supporting the two alternative roots; they also identi-

ﬁed two retroelement insertions with well-conserved ﬂanking sequence that also support

Atlantogenata. In addition, Waters et al. (2007) analyzed a phylogeny of L1 sequences and found further support for Atlantogenata. Hallstrom et al. (2007) and Wildman et al. (2007) used coding sequence extracted from whole-genome shotgun sequencing data to ﬁnd support for an Atlantogenatan root; however, Nishihara et al. (2007) used a similar dataset, and found that the use of more complex models of evolution that partitioned individual genes suggested an Afrotherian root. Another unresolved issue in mammalian phylogenetics relates to the relationships among orders within Laurasiatheria. Though the monophyly of

60 Laurasiatheria is fairly well-established, the relative arrangements within this taxon have

been difﬁcult to establish (aside from placing Eulipotyphla at the base of Laurasiatheria)

(Waddell et al., 2001; Murphy et al., 2001b; Arnason et al., 2002; Nishihara et al., 2006;

Chaisson et al., 2006).

To investigate some of the above issues, we sought to conﬁrm and reﬁne the mammalian

phylogeny by investigating the performance of nucleotide-based phylogenetic analyses us-

ing very large genomic sequence datasets. Speciﬁcally, we mapped, sequenced, assembled,

and analyzed a large genomic region (∼1.9 Mb in humans) in 41 mammals, 1 bird, and 2

ﬁshes. Here, we report the experience and results of performing phylogenetic analyses of this large (>69 Mb) comparative sequence dataset.

3.2 Methods

3.2.1 Sequence data

The comparative sequence dataset analyzed here (available at http://www.nisc.nih.gov

/data) is an expanded version of that reported by Thomas et al. (2003), with all sequences orthologous to a 1.9-Mb region of human chromosome 7 (build hg18, chr7: 115,597,757-

117,475,182) that includes 10 known genes (e.g., CFTR, ST7, and CAV1). All species’ sequences were generated by ﬁrst isolating bacterial artiﬁcial-chromosome (BAC) clones from the orthologous genomic region using overgo-based hybridization methods (Thomas et al., 2002), and then generating high-quality sequence of each selected BAC (Blakesley et al., 2004). For each species, sets of overlapping BAC sequences were compiled into a single ordered and oriented sequence. The assembled BAC sequences are available at

61 http://arjunprasad.net/mbe08 as Supplement S1. All of the analyzed sequences were generated in this fashion as part of the NISC Comparative Sequencing Program (Thomas et al., 2003), except for the human sequence [generated by the International Human Ge- nome Sequencing Consortium, 2001] and the Fugu sequence [generated by the Fugu Ge- nome Consortium (Aparicio et al., 2002)]. Table 3.1 provides a list of the species whose sequences were analyzed.

3.2.2 Alignment

The assembled sequences were aligned using the Threaded Blockset Aligner (TBA), a local-alignment program designed to generate multi-sequence alignments of large sequence datasets (Blanchette et al., 2004b). The ﬁnal alignment size of all alignable sequence in the dataset was 44 taxa by 6,270,442 characters, including all insertions, deletions, and non- alignable sequence. The initial alignment guide tree was based on the results of Murphy et al. (2001a). Alignment guide trees were also manually created to test alternate hypotheses and to verify that ﬁnal results were not dependent on the alignment guide tree. The alignment was divided into partitions (i.e., corresponding sub-portions of the genomic region, as described below) using custom perl scripts (available on request).

3.2.3 Annotation

Coding sequences were identiﬁed based on data from the Consensus CDS Project (http:// www.ncbi.nlm.nih.gov/CCDS), as provided at the UCSC Genome Browser (hg17 build; www.genome.ucsc.edu). This approach was used for all genes except MET, which was derived from the longest GENCODE annotation (http://genome.imim.es/gencode/)

62 Conserved a b c Clade Scientific Name Common Name Total Sequence Coding Non-coding Catarrhini Homo sapiens human 1,877,426 20,647 102,884 Pan troglodytes chimpanzee 1,573,483 17,962 86,513 Gorilla gorilla gorilla lowland gorilla 1,761,981 20,489 93,962 Pongo pygmaeus abelii Sumatran orangutan 1,478,010 18,344 80,548 Hylobates gabriellae red-cheeked gibbon 2,154,624 20,122 97,708 Colobus guereza black and white colobus 2,023,939 20,575 99,065 Cercopithecus aethiops vervet vervet monkey 1,555,031 18,638 87,051 Macaca mulatta rhesus macaque 1,678,549 20,569 92,538 Papio cynocephalus anubis olive baboon 1,680,295 20,575 89,897 Platyrrhini Callithrix jacchus white-tufted-ear marmoset 1,869,361 19,783 88,306 Callicebus moloch dusky titi 1,810,674 18,263 84,974 Aotus nancymai owl monkey 2,059,585 20,581 96,483 Saimiri boliviensis boliviensis Bolivian squirrel monkey 1,695,311 16,692 67,548 Strepsirrhini Otolemur garnettii small-eared galago 1,732,353 20,373 86,512 Lemur catta ring-tailed lemur 1,399,362 20,545 84,060 Microcebus murinus grey mouse lemur 1,541,029 19,103 86,239 Rodentia Rattus norvegicus Norway rat 1,883,088 20,344 78,983 Mus musculus mouse 1,486,509 19,079 73,094 Cavia procellus guinea pig 1,815,594 20,504 85,548 Spermophilus tridecemlineatus 13-lined ground squirrel 1,757,846 20,505 89,020 Lagomorpha Oryctolagus cuniculus New Zealand white rabbit 1,889,755 20,453 81,226 Cetartiodactyla Bos taurus cow 2,022,671 20,357 85,135 Ovis aries sheep 1,816,302 20,149 76,197 Muntiacus muntjak vaginalis Indian muntjac 1,450,172 15,340 67,216 Sus scrofa domestica domestic pig 1,198,526 17,006 60,133 Perissodactyla Equus caballus horse 1,423,288 17,580 75,633 Carnivora Felis catus cat 1,737,938 20,374 81,560 Neofelis nebulosa clouded leopard 1,691,656 16,001 74,568 Canis familiaris dog 1,317,853 16,374 69,142 Mustela putorius furo domestic ferret 1,494,791 20,456 75,743 Chiroptera Carollia perspicillata Seba's short-tailed bat 1,069,438 14,424 38,369 Rhinolophus ferrumequinum greater horseshoe bat 1,684,815 20,495 85,118 Eulipotyphla Atelerix albiventris middle-African hedgehog 1,985,767 20,081 72,111 Sorex araneus European common shrew 1,734,562 18,845 63,737 Xenarthra Dasypus novemcinctus armadillo 1,454,970 16,850 59,554 Afrotheria Loxodonta africana African elephant 2,040,789 20593 87,812 Echinops telfairi lesser hedgehog tenrec 1,765,269 18,087 74,734 Marsupialia Didelphis virginiana North American opossum 1,627,985 15,114 45,484 Monodelphis domestica gray short-tailed opossum 1,174,555 12,480 33,565 Macropus eugenii tammar wallaby 1,846,640 18,545 61,489 Monotremata Ornithorhynchus anatinus duck billed platypus 1,268,713 18,543 49,457 Aves Gallus gallus chicken 744,025 19,934 32,648 Actinopterygii Tetraodon nigroviridis tetraodon 257,833 16,938 7,760 Fugu rubripes fugu 273,621 17,033 7,779 Total: 69,805,984 825,745 3,217,103 a The total amount of assembled sequence (in bases) following removal of low-quality sequence and overlaps between BAC sequences (Thomas et al. 2003). b The number of bases in the coding partition (see text for details). c The number of bases in the conserved non-coding partition (see text for details).

Table 3.1: List of species and amounts of sequence in the multispecies comparative sequence dataset analyzed here.

63 because it was not present in the Consensus CDS annotation. Coding regions within the multi-sequence alignment were manually edited, and areas of uncertain alignment were removed, with gap columns added where necessary to maintain phase using jalView 2.2.1

(Clamp et al., 2004). Codon-position partitions were generated using every third base of the coding alignment.

Evolutionarily conserved sequences were identified using annotations represented on the “17-way Most Conserved” track of the UCSC Genome Browser (hg18 build) (Siepel et al., 2005; Kent et al., 2002; Karolchik et al., 2004). These annotations reflect conserved elements that were detected using phastCons (Siepel et al., 2005), which applies a two- state conserved versus non-conserved phylogenetic hidden Markov model to a 17-species multi-sequence alignment. PhastCons also uses the five-parameter general time reversible

(GTR) model of sequence evolution with a scaling parameter for conserved sequence. A more complete description of phastCons is provided in Section 2.3.1 and in Siepel et al.

(2005). With the parameters used for generating the “17-way Most Conserved” track, 90% of the human coding bases in our analyzed genomic region reside within conserved regions.

We extracted all of these coding bases from the annotated conserved regions, leaving a conserved non-coding sequence partition of 104,918 human bases and a total alignment

(including gaps) of 132,422 bases. A character state matrix (“coding plus conserved non- coding”) was created by adding all manually edited protein-coding sequence alignments to the above conserved non-coding sequence alignment. We also generated another conserved sequence matrix for comparison purposes using Gblocks 0.91b, which uses a phylogenetically na¨ıve approach to identify sequence conservation (parameters: minimum 23 sequences for a conserved position, minimum 37 sequences for a ﬂanking position, maximum 8 contiguous non-conserved positions, minimum initial block length of 10, minimum

64 block length of 10, half allowed gap positions) (Castresana, 2000). The resulting matrix of

Gblocks conserved sequences contained 77,961 bases.

3.2.4 Tree searching

The extraction of specific bases and conversion of data files to FASTA, NEXUS, or PHYLIP formats for subsequent analyses were performed using custom perl scripts. Maximum parsimony tree searching was performed using PAUP* 4.0b10 (Swofford, 2003). All trees were rooted using the fishes (Fugu and Tetraodon) as outgroup taxa, and a constraint for the monophyly of mammals was used because the long branch leading to chicken was sometimes unstable. Maximum parsimony trees were generated using random addition replicates as well as bootstrapped with 1,000 replicates of 10 random-addition Subtree Prun- ing Regrafting (SPR) runs (Salemi and Vandamme, 2003). Neighbor joining trees were generated with 1,000 bootstrap replicates using maximum likelihood HKY85 distances.

Incongruence length difference (ILD) or partition homogeneity tests were performed with

1,000 replicates of 10 tree bisection reconnection (TBR) random-addition tree searches with PAUP* version 4.0b10 (Bull et al., 1993; Cunningham, 1997).

The monophyly of mammals was constrained for all tree searching, except when performing ILD tests, Shimodaira-Hasegawa tests (SH-tests), or otherwise noted (Shimodaira and Hasegawa, 1999). Bayesian phylogenetic analysis was performed with both partitioned and non-partitioned likelihood models using the MPI version of MrBayes v3.1.2 and GTR

+ Γ + I models, as suggested by MrModeltest or Modeltest (Altekar et al., 2004; Ronquist and Huelsenbeck, 2003; Huelsenbeck et al., 2001; Nylander, 2004; Posada and Crandall,

1998). For the MrBayes analysis, model parameters were estimated from the data, and default priors were used. Metropolis-coupled Markov chain Monte Carlo (MCMCMC) chains

65 were run for 500,000 or 1,000,000 generations, sampling every 100 generations and using

6 heated chains per each of two independent runs. Stationarity was conﬁrmed by man-

ual inspection for convergence of independent runs, as well as topological and likelihood

value stability. Majority rules consensus trees were generated from the ﬁnal two-thirds of

sampled trees using PAUP*. Bootstrapped Bayesian runs were performed using seqboot

from PHYLIP to create 100 bootstrap datasets, which were then independently analyzed

with MrBayes using the settings described above (500,000 generations, sampling every 100

generations, with the ﬁnal 6,668 sampled trees used for the consensus) (Felsenstein, 2007).

The RY-coded conserved non-coding plus coding sequence matrix was run to 5,000,000

generations, and the average likelihood values increased until around 500,000 generations,

but a single tree became completely resolved before 200,000 generations.

Maximum likelihood tree searches were performed with the GTR + Γ model using

RAxML-VI-HPC v2.1.3 (Stamatakis and Alexandros, 2006). A proportion of invariable

sites (I) was not used because that parameter is not implemented in RAxML-VI-HPC v2.1.3

(Stamatakis and Alexandros, 2006). The best maximum likelihood trees were obtained by

performing greater than 20 independent tree searches from both completely random and

random addition parsimony-based starting trees (default for RAxML) using the “-f d”

high-performance hill-climbing algorithm. Highest likelihood trees from multiple runs of

RAxML were the same as the trees obtained from multiple runs of PHYML using GTR

+ Γ + I models for several datasets tested (Guindon and Gascuel, 2003). Even though its

hill-climbing algorithm was slightly more likely to end at local maxima, we used RAxML

because a single run with the conserved sequence partition took roughly one-third the time

(∼1.5 hours) and one-tenth the RAM (∼400 Mb) compared to PHYML; this allowed for more efﬁcient use of available cluster resources. Bootstrap replicates were performed us-

66 ing the parallelized MPI-enabled version of RAxML-VI-HPC v2.1.3 with default settings.

SH-tests were performed with the best ML trees found using 15 random-addition runs of

RAxML using constraints for taxa in question. The highest likelihood trees were used for

SH-tests with PAUP* version 4.0b10 and 10,000 RELL replicates with the GTR + Γ + I model (Shimodaira and Hasegawa, 1999). All tree searching was performed using Linux clusters. Analyses with PAUP* and RAxML were scripted and split using custom perl and shell scripts. Trees were visualized with the assistance of TreeGraph (Muller and Muller,

2003).

3.3 Results and discussion

3.3.1 Overview

We generated and compiled a high-quality comparative sequence dataset consisting of genomic sequences from 44 vertebrate species (Table 3.1), all of which is orthologous to a

1.9-Mb region on human chromosome 7 (Thomas et al., 2003). This entire genomic region is syntenic in all mammals, reptiles, and fishes that we have examined to date (including some whose sequence was not analyzed in this study). Together, the consistent long-range synteny, BLAST-based sequence comparisons, and nature of the cross-species BAC isolation and mapping process (see Thomas et al., 2002) confirm the orthologous relationship of the sequences within the analyzed dataset. The amount of assembled, annotated, and quality-trimmed sequence in the dataset varies from 257 kb (Tetraodon) to 2 Mb (elephant), with this variance reflecting both intrinsic differences in the size of the genomic region among species as well as incomplete sequence coverage for some species (Table

3.1).

67 Sequences were aligned using the Threaded Blockset Aligner (TBA) (Blanchette et al.,

2004a). Because TBA produces local multi-sequence alignments, it handles small inversions or other local rearrangements well, and avoids incorporating regions where the alignment uncertainty is high (Blanchette et al., 2004a; Pollard et al., 2006). Protein-coding sequences were excised and manually edited to constrain coding indels to multiples of 3 bases, unless there was signiﬁcant evidence for other indel sizes. We also removed any portions of the alignment where gap positions could not be easily determined. From this alignment, we made a coding sequence matrix, which contained 20,647 human coding bases and

21,129 total characters; this matrix was then analyzed with maximum likelihood, Bayesian, maximum parsimony, and neighbor joining approaches. Bayesian analysis yielded a highly resolved tree with posterior probabilities of 1.0 for all nodes (Figure 3.1). This tree is fully congruent with the maximum likelihood tree, and both trees are largely in agreement with other recent phylogenetic studies (Waddell and Shelley, 2003; Springer et al., 2003; Wad- dell et al., 1999a; Madsen et al., 2003; Murphy et al., 2001a; Phillips and Penny, 2003;

Murphy et al., 2007; Springer et al., 2004; Lin et al., 2002; Nishihara et al., 2006; Delsuc et al., 2002; Murphy et al., 2001b; Kriegs et al., 2006; Waddell et al., 2001; Hallstrom et al.,

2007), though a few nodes are different than those reported in some of these studies (see below and Figure 3.3).

Bayesian bootstrap

The maximum likelihood bootstrap proportions are signiﬁcantly lower than Bayesian posterior probabilities. To investigate the relationship between the bootstrap proportions and the Bayesian posterior probabilities, we developed a “Bayesian bootstrap score” based on

68 colobus monkey vervet baboon macaque gibbon orangutan gorilla 0.1 substitutions per site 99% 1.0 chimp human dusky titi owl monkey 91% squirrel monkey 1.0 marmoset mouse lemur lemur galago rabbit ground squirrel 90% guinea pig 1.0 rat mouse hedgehog shrew short-tailed bat 90% horseshoe bat 1.0 pig 76% muntjak 54% 1.0 cow 1.0 sheep 92% horse 1.0 ferret dog cat clouded leopard * elephant tenrec armadillo 15% wallaby 1.0 North American opossum short-tailed opossum platypus chicken

Figure 3.1: Coding sequence maximum likelihood tree produced using a GTR + Γ codon partitioned model of sequence evolution. Bootstrap support and Bayesian posterior probabilities for all branches are 100% and 1.0, respectively, unless noted with an arrow and a number. Numbers ending with ‘%’ are maximum likelihood bootstrap proportions, and numbers containing a decimal (e.g., 1.0) are Bayesian posterior probabilities. The branch constraining platypus to the mammals is marked with an asterisk (*) to reﬂect the fact that it was constrained during analysis and thus has no support ﬁgures associated with it.

69 97 human 98 chimp 81 gorilla 89 orangutan gibbon colobus monkey vervet baboon macaque dusky titi 97 owl monkey 96 squirrel monkey marmoset galago mouse lemur ring-tailed lemur rabbit ground squirrel guinea pig rat mouse hedgehog shrew horseshoe bat short-tailed bat 84 pig 63 48 muntjak 39 92 sheep 95 90 cow 94 horse cat clouded leopard ferret dog armadillo elephant tenrec wallaby short-tailed opossum North American opossum platypus chicken Takifugu Tetraodon

Figure 3.2: Maximum likelihood bootstrap values (in %, above line) compared to majority rules consensus of sampled Bayesian majority rules trees (in %, below line). All nodes without numbers appear in 100% of bootstrap replicates. Maximum likelihood tree was created from bootstrapped nucleotide-coded coding sequence data and a single partitioned GTR + Γ model of sequence evolution with RAxML-HPC. Bayesian results are for bootstrapped nucleotide-coded coding sequence data analyzed with single-partitioned GTR + Γ + I models. The majority rules consensus tree for each run was taken at stationarity (after 166,700 generations). Bootstrap proportions were then taken from the proportion of those trees containing each group.

70 the majority rules consensus tree at stationarity for each bootstrapped dataset. The resulting score was only slightly higher than the maximum likelihood bootstrap proportions

(Figure 3.2); these results support the notion that Bayesian posterior probabilities may be misleadingly high in some cases, and are not necessarily comparable to traditional bootstrap measures of support (Waddell et al., 2001, 2002; Yang, 2007; Douady et al., 2003).

It is possible that more complex models and/or Bayesian techniques for combining data could lead to posterior probabilities that better reﬂect the uncertainties in the tree (Liu and

Pearl, 2007; Edwards et al., 2007). Some bootstrap replicates resulted in unlikely rearrangements of chicken and platypus (e.g., platypus outside the chicken and other mammals). We also saw this when the coding plus conserved non-coding sequence matrix was nucleotide-coded and analyzed with maximum parsimony. We believe that these ﬁndings are a consequence of the long branches leading to the monotreme and reptile species represented in this study, as well as the low taxon sampling of those groups and the distant

ﬁsh outgroup. Because these unlikely arrangements affected bootstrap support values, we constrained the monophyly of mammals unless otherwise noted.

Conserved non-coding sequence

Bayesian posterior probabilities are high for all branches except the Homo-Pan-Gorilla

Group (human, chimpanzee, and gorilla); however, maximum likelihood support is not sufﬁcient to resolve some branches. In an attempt to rectify this and to take advantage of our notably large sequence dataset, we investigated using conserved non-coding sequence for tree construction. Conserved bases were identiﬁed using phastCons (Siepel et al., 2005), which utilizes a phylogenetic hidden Markov model to distinguish conserved versus non- conserved sequence and a GTR model of substitution rates to identify conserved segments

71 Coding Coding Coding Coding Cons Partition Codinga pos1b pos2b pos3b pos1&2b non-codec Coding <0.001 Coding pos1c 0.323 <0.001 0.460 Coding pos2c 0.648 0.983 0.324 Coding pos3c 0.569 0.960 0.002 <0.001 Coding pos1&2 0.599 0.432 Cons non-codingd 0.328 0.361 0.696 0.951 0.163 a All protein-coding sequence. b Codon position 1, 2, or 3 (as indicated) within coding sequence. c Conserved non-coding sequence.

Table 3.2: Incongruence length difference tests (ILD tests) between pairs of partitions. P-values above the diagonal are for NT-coded data; P-values below the diagonal are for RY-coded data.

(more details are provided in Sections 3.2.3 and 2.3.1). With the parameters used, 90% of the known coding sequence and 5.5% of the presumed non-coding sequence within the region were identiﬁed as conserved. We generated a matrix containing 153,552 bases by combining the coding sequence alignment and the conserved non-coding sequence alignment to form a coding plus conserved non-coding sequence matrix; this matrix yielded highly resolved trees with strong branch support (Figure 3.4).

Although we found signiﬁcant differences between the trees generated with the coding versus conserved non-coding sequence partitions based on incongruence length difference

(ILD) or partition homogeneity tests (P=0.001), this was entirely due to the less-conserved third codon positions (Bull et al., 1993). When third codon positions were excluded (i.e., using partitions consisting of codon positions 1 and 2 vs. conserved non-coding sequences), there was no signiﬁcant difference between the results obtained with each partition (ILD

72 human 12% chimp 0.36 gorilla 88% orangutan 0.1 substitutions per site 1.0 gibbon

colobus monkey Catarrhini 97% vervet 1.0 baboon Primata

macaque Cercopithecidae 86% dusky titi 1.0 owl monkey squirrel monkey Platyrrhini marmoset Euarchontoglires 88% 1.0 galago mouse lemur

ring-tailed lemur Strepsirrhini rabbit ground squirrel

guinea pig Boreoeutheria Glires 83% rat 1.0 Rodentia mouse 99% 1.0 cow sheep 18% 1.0 muntjak

20% pig Cetartiodactyla 1.0 horse horseshoe bat

48% Perissodactyla 1.0 short-tailed bat Chiroptera dog Laurasiatheria ferret clouded leopard Carnivora cat shrew 75% hedgehog

1.0 81% Eulipotyphla 1.0 elephant tenrec Afrotheria * armadillo Atlantogenata

wallaby Xenarthra North American opossum

short-tailed opossum Metatheria platypus

chicken Prototheria

Figure 3.3: Maximum likelihood tree derived from the analysis of the coding sequence partition using RY-coded bases and a codon position partitioned CF + Γ model. Branch lengths indicate likelihood-inferred substitutions per site with a GTR + Γ model. Maximum likelihood bootstrap proportions are listed above Bayesian posterior probabilities for all branches at less than 100% bootstrap proportion and 1.0 Bayesian posterior probability support. Platypus was constrained to the mammals (its branch is marked with an asterisk to reﬂect this). The ﬁshes (Tetraodon and Fugu) were used to root, but their branches are not shown. Branch lengths were optimized using maximum likelihood from nucleotide-coded data with a GTR + Γ model.

73 human chimp gorilla 0.1 substitutions per site orangutan gibbon

colobus monkey Catarrhini vervet baboon

macaque Primata Cercopithecidae dusky titi 96% 1.0 owl monkey squirrel monkey Platyrrhini

marmoset Euarchontoglires galago mouse lemur ring-tailed lemur Strepsirrhini rabbit ground squirrel

guinea pig Boreoeutheria Glires rat Rodentia mouse cow sheep 44% muntjak 1.0 pig Cetartiodactyla 64% dog 1.0 ferret 98% clouded leopard Carnivora 1.0 cat horse Laurasiatheria horseshoe bat short-tailed bat hedgehog Chiroptera 96% shrew 1.0 elephant tenrec Afrotheria * armadillo Atlantogenata

wallaby Xenarthra North American opossum

short-tailed opossum Metatheria platypus

chicken Prototheria

Figure 3.4: Maximum likelihood tree derived from the analysis of coding plus conserved non- coding sequence matrix using RY-coded bases. A CF + Γ model was used, with 4 partitions: 3 for codon positions and 1 for conserved non-coding sequence. Long branches leading to the platypus and chicken were abbreviated for clarity. Maximum likelihood bootstrap proportions are listed above Bayesian posterior probabilities for all branches at less than 100% bootstrap proportion or 1.0 posterior probability. Platypus was constrained to the mammals (its branch is marked with an asterisk to reﬂect this). Branch lengths were optimized using maximum likelihood from the nucleotide-coded data with a GTR + Γ model.

74 human chimp gorilla n* orangutan gibbon colobus monkey vervet baboon n* macaque a dusky titi b owl monkey p* squirrel monkey marmoset galago d mouse lemur ring-tailed lemur rabbit l* ground squirrel guinea pig h i k p*q r w rat mouse x cow f sheep s muntjak pig horse horseshoe bat short-tailed bat dog o ferret clouded leopard cat t shrew hedgehog elephant tenrec e j armadillo p* wallaby l* North American opossum short-tailed opossum platypus chicken c g m u v Takifugu Tetraodon

Figure 3.5: Informative coding indels. Maximum likelihood tree from RY-coded protein-coding sequence partition (see Figure 3.3 for details). Informative coding indels are marked by crosshatches and labeled with letters a-x. Indels marked with an asterisk appear homoplastic on this tree, and all taxa sharing the implied derived state are marked with the same letter. Indels a&b are from ASZ1; c&d are from CAV1; e–o from CFTR; p–t from CTTNBP2; and u–w from MET. Sequence alignments for each indel are available in Appendix 5.3.2.

75 test P=0.432; see Table 3.2). RY-based coding of the data (discussed below) eliminated the significant differences between partitions, even when the third codon positions were included (ILD test P=0.328). Indeed, we only encountered differences between the trees generated with the two partitions on branches that were weakly supported by the coding- sequence dataset (Figures 3.3 and 3.4). Finally, we performed likelihood and Bayesian analysis with models partitioned by codon position (i.e., with the same model of evolution but allowing parameters of those models to vary between partitions); no significant differences in tree topology was seen, although minor differences in branch support scores were noted. We used phastCons to identify conserved sequences, and phastCons bases its identification of conserved sequences on a phylogenetic tree, so we also used Gblocks, a phylogenetically na¨ıve program for finding conserved regions. We found no differences in the maximum likelihood tree generated from conserved regions identified by phastCons and

Gblocks, with the associated bootstrap proportions similar in both cases (data not shown).

Informative coding indels

We also examined coding sequence for high-confidence indels, finding 24 phylogenetically informative indels (Figure 3.5). A large number of the coding indels were shared by closely related species, such as rat and mouse (7 indels supporting), and separated the fish as our outgroup (5 indels supporting). Notably, we found three indels that are homoplastic on all of our trees, two of which (labeled “l” and “p” in Figure 3.5) likely reflect multiple independent deletion events, as they were detected in marsupials and only 1-3 species in

Euarchontoglires. The third indel that appears homoplastic on our trees joins dusky titi to the apes (human, chimpanzee, gorilla, orangutan, and gibbon); this may be the result of lineage sorting or independent events. While multiple deletion events may be relatively rare,

76 the level of homoplasy observed here suggests that caution should be used in interpreting

support for taxa based on small numbers of such events.

3.3.2 Non-phylogenetic signals

Large sequence datasets, such as the one analyzed here, offer the potential to resolve weakly

supported branches; however, they can also be prone to detecting non-phylogenetic signals

that confound the results (Philippe et al., 2005). We examined several potential sources of

“systematic error” or “non-phylogenetic signals,” attempting to exclude them or to control

for their inﬂuence (Philippe et al., 2005). Speciﬁcally, we considered base composition

bias, incongruence across the genomic region, missing data, inﬂuence of the alignment

guide tree, and long branch attraction as possible sources of non-phylogenetic signal. We

further examined long branch attraction during the analysis of various individual taxa.

Base composition

There are signiﬁcant differences in base composition among species, ranging from 45 to

58% G+C in the coding sequence partition and 32 to 45% G+C in the conserved non-coding

sequence partition. The Chi-square test for homogeneity of base frequencies across taxa

was highly signiﬁcant (P<0.000001 for all partitions examined, except with codon second positions P=0.00937), though the validity of this test is questionable since it does not take phylogenetic structure into account. To reduce the effects of non-phylogenetic signals due to base composition differences among species, we coded the nucleotides as purines or pyrimidines (RY-coding) (Phillips et al., 2001, 2004; Philippe et al., 2005). This approach also has the beneﬁt of removing signals deriving from the more common transitions that

77 Partition 1 2 3 4 5 6 7 8 9 10 Total Combineda SH-Test P Atlantogenata 0 0 0 0 0 n/a 0 0 0 0 0 0 Best Epitheria 4.5 1.7 2.9 9.1 3.9 n/a 1.6 16.7 7.5 8.0 56.0 69.4 <0.0001 Exafroplacentalia 4.5 1.8 0.5 7.9 3.8 n/a 1.6 16.9 6.6 6.7 50.3 62.9 <0.0001

Table 3.3: Relative likelihood support for placental root across coding plus conserved non-coding sequence matrix. Partition 6 contains a region where there is incomplete armadillo sequence, so no placental root can be inferred. aCombined likelihood score for entire coding plus conserved non-coding sequence matrix with 4 partitions for model parameters (3 for codon position and 1 for conserved non-coding).

may be associated with higher rates of saturation due to reversals. Indeed, we found that

RY-coding eliminated signiﬁcant differences in trees supported by the less conserved codon third positions (Table 3.2). Because of the large dataset size, we maintained sufﬁcient signal with RY-coded data to make robust phylogenetic inferences (Figures 3.3 and 3.4).

We further found that RY-coding eliminates almost all base composition differences among species. The coefﬁcient of variation between purines and pyrimidines was 84% lower than the coefﬁcient of variation between G+C and A+T for the coding sequence partition and

87% lower for the conserved non-coding sequence partition. RY-coding also eliminated signiﬁcant differences in trees supported by codon third positions (ILD P=0.948).

Incongruence Across the Genomic Region

Combining phylogenetic data is thought to potentially be problematic (Bull et al., 1993).

For example, recent genome-wide studies in yeast and bacteria encountered problems with combining large numbers of protein-coding sequence alignments (Comas et al., 2007; Ed- wards et al., 2007). To check for heterogeneity of support across the alignment, we split the coding plus conserved non-coding sequence matrix into 10 equal-sized segments and ana-

78 human human human human human 1 chimp 2 chimp 3 chimp 4 chimp 5 gorilla gorilla gorilla gorilla gorilla chimp orangutan orangutan orangutan orangutan orangutan gibbon gibbon gibbon gibbon gibbon Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae dusky titi dusky titi dusky titi dusky titi dusky titi owl monkey owl monkey owl monkey owl monkey owl monkey marmoset marmoset marmoset marmoset marmoset squirrel monkey squirrel monkey squirrel monkey squirrel monkey squirrel monkey Strepsirrhini Strepsirrhini Strepsirrhini Strepsirrhini Strepsirrhini Glires Glires Glires Glires Glires Cetartiodactyla Cetartiodactyla Cetartiodactyla Cetartiodactyla Cetartiodactyla horse Chiroptera Carnivora horse horse Chiroptera Eulipotyphla horse Chiroptera Chiroptera Carnivora Carnivora Chiroptera Carnivora Carnivora Eulipotyphla horse Eulipotyphla Eulipotyphla Eulipotyphla Afrotheria Afrotheria Afrotheria Afrotheria Afrotheria armadillo armadillo armadillo armadillo armadillo Marsupialia Marsupialia Marsupialia Marsupialia Marsupialia platypus platypus platypus platypus platypus

human human human human human 6 chimp 7 chimp 8 chimp 9 chimp 10 gorilla gorilla gorilla gorilla gorilla gibbon orangutan orangutan orangutan orangutan Cercopithecidae gibbon gibbon gibbon gibbon dusky titi 79 Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae owl monkey dusky titi dusky titi dusky titi dusky titi marmoset owl monkey owl monkey owl monkey owl monkey squirrel monkey marmoset marmoset marmoset marmoset Strepsirrhini squirrel monkey squirrel monkey squirrel monkey squirrel monkey Glires Strepsirrhini Strepsirrhini Strepsirrhini Strepsirrhini Cetartiodactyla Glires Glires Glires Glires horse Cetartiodactyla Cetartiodactyla Cetartiodactyla Cetartiodactyla Carnivora horse horse horse Carnivora Chiroptera Carnivora Carnivora Carnivora Chiroptera Eulipotyphla Chiroptera Chiroptera Chiroptera horse Afrotheria Eulipotyphla Eulipotyphla Eulipotyphla Eulipotyphla armadillo Afrotheria Afrotheria Afrotheria Afrotheria Marsupialia Marsupialia armadillo armadillo armadillo platypus platypus Marsupialia Marsupialia Marsupialia platypus platypus platypus

Figure 3.6: Maximum likelihood trees for each of 10 sequential, equal-sized partitions from the coding plus conserved non-coding sequence matrix. Numbers (1-10) reﬂect the speciﬁc partition used. The arrangement of taxa and branches indicated in colors other than black vary among partitions. Nodes annotated with hollow circles have less than 50% bootstrap proportions; those with shaded circles have greater than 50% bootstrap proportions; and those with solid circles have 75% bootstrap proportions or greater. Branches that are the same in all trees are indicated in black, with some collapsed to higher-level taxa for simplicity. Coding Conserved non-coding Partitions ACGT RY ACGT RY ≤ 2gaps, > 2gaps 0.591 0.664 ≤ 3gaps, > 3gaps 0.273 0.698 0.050 0.627

Table 3.4: ILD test P-values comparing positions with more gaps to those with few or no gaps. All ILD tests were performed using PAUP* and 1000 replicates of 10 random-addition TBR-swapped maximum parsimony trees.

lyzed each with maximum likelihood (Figure 3.6 and Table 3.3). While we found support for the Atlantogenata hypothesis with all 10 segments, there was considerable variation in the strength of that support, with some segments providing much larger shares of the overall support. One segment (6; see Table 3.3 and Figure 3.6) did not contain any armadillo sequence (due to a gap in the BAC map), and thus could not provide support for the placental root. There was considerable heterogeneity of support among the segments for some other clades as well. Laurasiatheria was generally not well-resolved in any of our analyses, and orders within Laurasiatheria did not have a consistent relationship among the different segments. The relationship between the marsupials and platypus varied as well, perhaps because of the long branches and poor taxon sampling of only one monotreme. Addition- ally, the relationships among owl monkey, marmoset, and squirrel monkey differed among partitions, though with weak support (discussed further below). A majority of segments supported Marsupionta, though overall support was strongly in favor of Theria (Figures 3.3 and 3.4).

80 Missing Data

Missing data have been considered a source of systematic error in phylogenetic analyses

(Kearney, 2002; Huelsenbeck, 1991), although some simulation studies suggest that missing data should not be a problem for datasets with sufﬁcient characters to provide robust signal (Rosenberg and Kumar, 2003; Philippe et al., 2004; Wiens, 2005; Philippe et al.,

2005). The coding sequence alignment had 11% missing data (including indels and sequence gaps). To test if the missing data was biasing our analyses, we performed ILD tests on the RY-coded coding sequence partition, comparing alignment columns with gaps in 3 or fewer species and those with gaps in more than 3 species; no signiﬁcant differences were seen (P=0.591, Table 3.4). Similar results were obtained when the same analysis was performed with the RY-coded conserved non-coding sequence partition (P=0.627, Table 3.4).

Finally, to see if additional missing data would strongly affect the analysis, we randomly deleted 25% of non-gap bases from the coding plus conserved non-coding sequence matrix, and observed no effect on the resulting maximum likelihood tree other than slightly changing the bootstrap support for a few branches (data not shown).

Alignment Guide Tree

We also analyzed a matrix consisting of all TBA-aligned sequence (containing 1,798,347 human bases). Because of computational constraints, we analyzed this dataset only by maximum parsimony and maximum likelihood methods. The trees derived from these analyses were almost completely resolved; however, by permuting the alignment guide tree, we were able to change the relative arrangement of those branches showing 100% bootstrap support in this analysis. Using only the conserved and protein-coding portions of the alignment yielded a tree with fewer well-resolved branches; however, branches with

81 >70% bootstrap support were resistant to permutations of the alignment guide tree. No- tably, the only branches with bootstrap proportions <70% were the interordinal branches within Laurasiatheria. These results confirm that difficult-to-resolve branches are more susceptible to biases introduced by aligning less-conserved sequences, and that biases due to alignment guide trees can be largely controlled by only considering conserved sequences and strongly supported branches. To control for any possible effect of the alignment guide tree on our phylogenetic analysis, we permuted the alignment guide tree, and re-analyzed the data for all controversial branches to confirm that the alignment guide tree was not biasing the results.

3.3.3 Individual taxa

Euarchontoglires

The primate portion of the tree was strongly supported regardless of the partition or alignment guide tree used with the exception of the Homo-Pan-Gorilla Group, which had insuf-

ﬁcient informative changes in the RY-coded coding sequence partition to resolve (Figure

3.3). However, when transitions are included, there is sufﬁcient signal and 100% support for the sister relationship between chimpanzees and humans (Figure 3.1). The major primate clades (including Catarrhini, Platyrrhini, and Strepsirrhini) were all supported at

100% by maximum likelihood, Bayesian, neighbor joining, and maximum parsimony approaches. These results largely agree with other recently reported molecular phylogenies for the primates (Opazo et al., 2006; Ray et al., 2005; Barroso et al., 1997; Goodman et al.,

1998; Poux and Douzery, 2004), with the exception of the relationships within Cebidae

(marmoset, squirrel monkey, and owl monkey) (Barroso et al., 1997). Molecular system-

82 atic studies have disagreed on the arrangement of these three taxa. We ﬁnd support for the association of squirrel monkey with marmoset. Although we see consistent support for this association regardless of alignment guide tree, RY- or nucleotide-coding, or tree inference algorithm, bootstrap proportions are relatively weak and the support varies across the genomic region under study (Figure 3.6). The monophyly of both Glires and Euarchon- toglires was strongly supported by Bayesian, maximum likelihood, and maximum parsimony approaches, as in other recent studies (Douzery and Huchon, 2004; Thomas et al.,

2003); note that this is in sharp contrast to the results of Misawa and Janke (2003). No- tably, neighbor joining support for Glires and Euarchontoglires was also strong (bootstrap proportion 91%; see Figure 3.7) in contrast to the neighbor joining results of Wildman et al. (2007), who used whole-genome shotgun sequence data; this is potentially explained by the much-greater taxon sampling afforded by our dataset. The placement of guinea pig securely within Rodentia is also strongly supported by our data (SH-test P<0.0001 excluding guinea pig as an outgroup to the other rodents), in agreement with others (Sullivan and Swofford, 1997); this result holds when the alignment guide tree is permuted to place guinea pig outside the rodents.

Laurasiatheria

Within Laurasiatheria, there has been considerable disagreement about the arrangement and composition of the historical order Insectivora, although most recent molecular studies divide up this group among several orders, with tenrec falling in Afrotheria (Stanhope et al.,

1998; Lin et al., 2002; Nishihara et al., 2006) and shrew and Erinaceous hedgehogs falling in Eulipotyphla (within Laurasiatheria) (Madsen et al., 2003; Lin et al., 2002; Malia et al.,

2002; Murphy et al., 2001a; Madsen et al., 2001). Many studies of mitochondrial DNA

83 97 human chimp 82 gorilla orangutan gibbon colobus monkey 61 vervet macaque baboon dusky titi owl monkey squirrel monkey marmoset 91 galago mouse lemur ring-tailed lemur rabbit ground squirrel guinea pig 98 rat mouse horseshoe bat short-tailed bat cat clouded leopard 98 ferret dog pig muntjak 61 sheep cow horse armadillo elephant tenrec * hedgehog shrew wallaby North American opossum short-tailed opossum platypus chicken Takifugu Tetraodon

Figure 3.7: Neighbor joining majority rules bootstrap tree from coding plus conserved non-coding RY-coded sequence partition using 1000 bootstrap replicates with maximum likelihood distances from unpartitioned CF + Γ + I models. All branches have 100% bootstrap support unless otherwise noted with a number indicating bootstrap proportion (in percent). The branch constraining the platypus to the mammals is indicated with an asterisk to reﬂect the fact that this branch was constrained for this analysis. Polytomies have less than 50% bootstrap proportion support.

84 have placed the Eulipotyphlans at the root of Placentalia, probably because of the unusually high AT-content of their mitochondrial genomes; however, recent studies with greater taxon sampling and more complex models have also placed them in Laurasiatheria (Waddell et al.,

1999a; Kjer and Honeycutt, 2007; Arnason et al., 2002; Lin et al., 2002; Gibson et al.,

2005). Our neighbor joining tree has Eulipotyphla at the root of Placentalia, but this may be a consequence of the long branch lengths of the two Eulipotyphlans (hedgehog and shrew) represented in our dataset (Figure 3.4 and 3.7). Using maximum likelihood, Bayesian, and maximum parsimony approaches, our analyses consistently place Eulipotyphla at the root of Laurasiatheria, as do most other recent studies (Nishihara et al., 2006; Waddell and

Shelley, 2003; Waddell et al., 2001; Nishihara et al., 2006; Murphy et al., 2001b; Arnason et al., 2002; Nikolaev et al., 2007; Madsen et al., 2003; Scally et al., 2001).

The placement of Perissodactyla, represented here by the horse, has been another source of controversy. Usually, this group is placed either sister to Cetartiodactyla (Lin et al., 2002;

Murphy et al., 2001a) or sister to Carnivora (Murphy et al., 2001b; Arnason et al., 2002;

Madsen et al., 2003), in most cases with weak support (Waddell et al., 1999a). Schwartz et al. (2003) found a single transposon insertion supporting the Perissodactyla-Carnivora association; meanwhile, Nishihara et al. (2006) found a single transposon insertion supporting a Perissodactyla-Carnivora association and ﬁve insertions supporting a Perissodactyla-

Chiroptera-Carnivora (Pegasoferae, Nishihara et al., 2006 association (i.e., excluding the traditional Perissodactyla-Cetartiodactyla association of hoofed mammals that we see with our analyses). Of note, Nishihara et al. (2006) also found one transposon insertion that conﬂicted with the Pegasoferae hypothesis. Even with the large number of characters analyzed here, we only found weak bootstrap support for the placement of Perissodactyla

(Figure 3.4), although it tended to associate closest with the Cetartiodactylans and secon-

85 darily with the Carnivores. Across the region, support for the arrangement of orders varied signiﬁcantly, with only segments 4, 5, and 10 agreeing and none of the segments agreeing with the maximum likelihood tree for the entire matrix (Figure 3.6). Thus, the arrangement of orders within Laurasiatheria appears to be difﬁcult to resolve even with large amounts of sequence data and reasonably large numbers of species represented. We further found by permuting the alignment guide tree for all possible interordinal relationships within

Laurasiatheria that the relationships were highly dependent on the alignment guide tree, though not in a predictable manner. Using the coding plus conserved non-coding sequence matrix, we performed SH-tests with the ﬁve most-supported Laurasiatherian trees from the literature; none could be excluded with high conﬁdence, and this likely is due, in part, to the short branches separating Laurasiatherian orders. Perhaps with increased taxon sampling, this problem would be more tractable. It may be that a strong non-phylogenetic signal or incomplete lineage sorting is obscuring the inter-ordinal relationships within Laurasiathe- ria. Methods that treat gene trees and species trees simultaneously, such as that described by Liu and Pearl (2007), might be able to better resolve such regions. Alternatively, the use of high-abundance transposons or other rare genomic characters might help to resolve such recent divergences (Osterholz et al., 2009).

Theria

Although considered a mammal, the phylogenetic placement of Monotremata has long been controversial. The hierarchical placement of monotremes as an outgroup of the other mammals has been challenged by molecular and morphological studies that placed Monotremata as a sister group to the marsupials in a clade called Marsupionta (Janke et al., 1996, 1997).

Our results, however, agree with recent molecular studies that yielded signiﬁcant evidence

86 (including coding indels) in support of the monophyly of Theria (placental mammals and

marsupials), with the monotremes as the ﬁrst branch of the mammalian tree (van Rheede

et al., 2006; Phillips and Penny, 2003; Killian et al., 2001). We ﬁnd strong bootstrap sup-

port (≥94%, see Figures 3.3 and 3.4) for Theria by neighbor joining, maximum parsimony,

Bayesian, and maximum likelihood approaches, and SH-tests with nucleotide-coded data just reach signiﬁcance (P=0.0251) in excluding Marsupionta with the coding plus conserved non-coding sequence matrix. However, there is signiﬁcant heterogeneity of support for Theria across the region, and a majority of the segments (6/10) support Marsupionta.

These ﬁndings are consistent with other recent molecular and morphological analyses that supported the monophyly of Theria (Hu et al., 1997; Phillips and Penny, 2003; van Rheede et al., 2006), but illustrate the difﬁculty of determining the relationships between these clades.

Placentalia

Although the nucleotide-coded protein-coding sequence partition failed to resolve the root of Placentalia (ML bootstrap <60%, Figure 3.1), the RY-coded coding sequence partition

supports an Atlantogenatan root (Figure 3.3). Adding the conserved non-coding partition

provides high statistical support for Atlantogenata, both with nucleotide- and RY-coded

data (Figures 3.4 and 3.8). Bootstrap support of 100% is seen with all model-based ap-

proaches used (including maximum likelihood, Bayesian, neighbor joining, and minimum

evolution). SH-tests using the coding plus conserved non-coding sequence matrix exclude

Epitheria and Exafroplacentalia, with the results signiﬁcant past the limits of PAUP and

CONSEL (P<0.0001; Figure 3.8). Because of the limited number of Afrotherian and At-

lantogenatan species in this study, some caution is warranted in interpreting these results.

87 Euarchontoglires SH-test excludes Laurasiatheria P < 0.0001 A Afrotheria Epitheria Xenarthra

Euarchontoglires SH-test excludes Laurasiatheria B P < 0.0001 Xenarthra Afrotheria Exafroplacentalia

Euarchontoglires Laurasiatheria C Afrotheria Supported solution Xenarthra Atlantogenata

Figure 3.8: Three possible roots for Placentalia. SH-test results from the coding plus conserved non-coding sequence matrix for both nucleotide- and RY-coded matrices. A, Hypothesis rooting Placentalia between Xenarthra and Epitheria (Boreoeutheria + Afrotheria). B, Hypothesis rooting Placentalia between Afrotheria and Exafroplacentalia (Boreoeutheria + Afrotheria). C, Hypothesis rooting Placentalia between Boreoeutheria and Atlantogenata (Afrotheria + Xenarthra).

Maximum parsimony analysis of the nucleotide-coded coding sequence partition supports an Epitherian root (e.g., Figure 3.8A), but when codon third position sites are removed or the sequence is RY-coded, bootstrap support is reduced to <50%. Maximum parsimony analysis of the coding plus conserved non-coding sequence partition also weakly supports

Epitheria.

To exclude the inﬂuence of biases introduced by the alignment guide tree, we re-aligned the sequences using a guide tree based on the highest likelihood tree constrained to each possible root, then repeated the likelihood analysis. In all cases, the maximum likelihood tree derived from the coding plus conserved non-coding sequence matrix was rooted between Atlantogenata and Boreoeutheria with 100% bootstrap support and highly signiﬁcant

SH-test results (P<0.001). Because tenrec has a signiﬁcantly longer branch length with

88 these data than elephant or armadillo, we tried individually removing tenrec and elephant.

With either tenrec or elephant missing, we still saw ≥99% maximum-likelihood bootstrap support for Atlantogenata. We also separated the coding plus conserved non-coding sequence matrix into 10 equally sized partitions, and analyzed each separately (Figure 3.6).

Although the likelihood values for Atlantogenata varied signiﬁcantly among the partitions, all partitions supported an Atlantogenatan root (Table 3.3).

These results agree with some other recent large-scale analyses on mostly independent datasets [e.g., Murphy et al., 2007; Wildman et al., 2007; Hallstrom et al., 2007], but con-

ﬂict with the ﬁndings of Nikolaev et al. (2007), Nishihara et al. (2006), and Kriegs et al.

(2006). Kriegs et al. (2006) found two transposon insertions in Afrotheria and Boreoeuthe- ria that were not found in armadillo or sloth. These transposon sequences are quite old, and the ﬂanking sequence is not well-conserved. Thus, the transposon-associated sequences may have mutated out of recognition in the 30 million years from the placental root to the divergence of armadillo and sloth; alternatively, multiple transposon insertions may have occurred in Afrotheria and Boreoeutheria (Murphy et al., 2007; Kriegs et al., 2006;

Springer et al., 2003). Homoplasy for transposon insertions due to targeted insertions or lineage sorting on short branches, though presumably rare, has been reported (Nishihara et al., 2006; Yu and Zhang, 2005; van de Lagemaat et al., 2005; Slattery et al., 2004) and could also explain these results.

Nikolaev et al. (2007) analyzed amino-acid and conserved non-coding genomic sequences from 14 species to examine the root of Placentalia. Using maximum likelihood analyses of conserved non-coding sequence from the ENCODE pilot project regions (http:

//www.genome.gov/10005107), they exclude the Epitheria hypothesis and, separately, use amino-acid sequences derived from the same regions to exclude the Atlantogenata hy-

89 pothesis. Notably, analyses using their largest dataset (conserved non-coding sequence) failed to differentiate between rooting Placentalia with Atlantogenata or Exafroplacentalia.

Additionally, their limited taxon and outgroup sampling argues for caution in interpreting the final results (Delsuc et al., 2002); for example, when we only analyzed data from the 14 species studied by Nikolaev et al. (2007), we still found support for Atlantogenata as the root, although the bootstrap support was weak. The data used in our analysis contain significantly more taxa, both ingroup and outgroup, than other recent large-scale nucleotide- based analyses, and this may affect the results significantly.

3.3.4 Summary

In summary, we used a comparative sequence dataset that contains a remarkably large number of conserved bases to derive a phylogeny that provides additional evidence to resolve some of the controversial branches in the mammalian lineage. We find significant support for an Atlantogenatan root of Placentalia as well as additional evidence for the monophyly of Theria. Our studies highlight the difficulties in resolving some very short mammalian branches (e.g., inter-ordinal relationships within Laurasiatheria), even with large amounts of data. Our work further illustrates the value of large genomic sequence datasets for improving the resolution of phylogenetic trees, in this case, to clarify some of the remaining ambiguities within the mammalian tree. Sequences from an increasing number of mammalian taxa should help to resolve the remaining ambiguities associated with the short branches within and between the placental orders.

90 Chapter 4

Landscapes of incongruence and

the phylogeny of primates

4.1 Introduction

Sets of phylogenetic characters are incongruent if they support different phylogenetic trees.

When large phylogenetic datasets are partitioned, it is common that some of the partitions support different trees, usually at weakly supported branches. One explanation is that it is occurring by chance, but when alternative relationships are strongly supported, the question arises: Is the incongruence due to an error in the phylogenetic inference process or do the different partitions actually have different evolutionary histories?

Our interest in further investigating this subject was piqued by the studies described in

Chapter 3. We found that when our large alignment was split into 10 pieces and individual trees were inferred with each, some branches of the resulting trees disagreed, often with strong support (Section 3.3.2 and Figure 3.6). Thus, if any individual region had been investigated in isolation, the results could have differed.

91 4.1.1 Why are different segments of DNA incongruent?

Support for different trees from different DNA segments could reflect distant evolutionary paths due to duplication/paralogy, horizontal gene transfer, or incomplete lineage sorting; or such findings could also reflect artifacts of the phylogenetic tree inference process. It has long been recognized that care must be taken in phylogenetic analyses to avoid the accidental use of paralogs (homologies due to genomic duplications) from affecting the results of the analysis. In fact, a major reason multiple genes are often considered in phylogenetic analysis is to ameliorate the effects of an accidental consideration of paralogs.

Though an obvious possible cause of incongruence, duplications cannot account for all of the incongruence that we see in our large sequence datasets (Chapter 3.3.2).

Horizontal gene transfer is the process by which genes can pass from one taxon to another. Some genetic mixing between species occurs in almost all taxa, though it is rare in some (Awadalla, 2003). Organisms can take up genetic material from their environment directly, as when genes are contributed to a host by a pathogen or vise versa. Retroviruses integrate into genomes and can contribute genetic material; these are the source of retrotransposons that make up the majority of DNA in many large genomes (Gregory, 2004).

Non-retrovirus-mediated gene transfer is common in prokaryotes; however, it is less common in most multicellular organisms. Furthermore, retroviruses are limited in their DNA- carrying capacity. This has led to the hypothesis that gene transfer between animal species

(such as humans) is relatively rare and involves small amounts of DNA; thus far, this has been born out by experimental data (e.g., Kurland et al., 2003).

Incomplete lineage sorting occurs when speciation events occur rapidly in large or diverse populations. Lineage sorting occurs when multiple alleles exist in a population and are transmitted in a manner different from that of the species as a whole. In this way, spe-

92 ciation events that occur in rapid succession can preserve intraspeciﬁc variation. That is, if multiple variants are present in a population, it is possible for some variants to become

ﬁxed in a manner reﬂecting an evolutionary history different from that of the species (see

Section 1.1.3 and Figure 1.1 for a more complete description).

Phylogenetic methods can also be misled by a number of factors (also see Sections 1.1 and 3.3.2). All phylogenetic inference methods make false assumptions about sequence evolution, though the degree to which a given real dataset violates those assumptions and the effects of that violation may vary significantly. Simulations have been used to identify weaknesses in phylogenetic inference methods; however, because in most cases we cannot know for certain the evolutionary histories of extant species, it is difficult to know with certainty the amount of incongruence that can be explained by inaccuracies of the phylogenetic methods used (Kolaczkowski and Thornton, 2008; Wiuf et al., 2001). Most known model violations tend to bias phylogenetic tree inference in a specific direction, and thus are more likely to reduce incongruence rather than enhance it, with the notable exception of the “star-tree paradox” that may (or may not) plague the application of Bayesian methods with extremely large datasets (Prasad et al., 2008; Kolaczkowski et al., 2006; Susko, 2008).

It does seem that incongruence in many datasets exceeds what can be easily explained by failures of phylogenetic inference algorithms.

4.1.2 Measures of incongruence statistical measures have been developed to detect whether differences in tree topology supported by different segments are simply due to stochastic variability or are signiﬁcant.

One of the simplest is the incongruence length difference (ILD) or partition homogeneity test (Bull et al., 1993). This is a resampling-based test usually used with the maximum

93 parsimony method. If two partitions support different trees, it simply asks what proportion of the time a set of partitions randomly drawn without replacement of identical size to the partitions being tested has a lower total tree length (i.e., better score). As with other resampling-based tests it estimates the variability in the data by assuming that the observed data are representative and creates a distribution of expected tree lengths given the independence of two partitions. Although the same approach could be taken with maximum likelihood using likelihood scores rather than tree lengths, this is rarely implemented because of the very large number of tree searches it requires (one per partition per resampling replicate).

4.1.3 Estimating species trees

Rather than assess the likelihood that incongruent regions are really supporting independent trees, some recently developed methods based on Bayesian statistics try to infer a species tree given a number of gene trees directly (Ane et al., 2007; Edwards et al., 2007; Rannala and Yang, 2008). Under the assumption that incongruence is caused by incomplete lineage sorting, these methods use coalescent theory estimates of gene tree evolution to infer posterior probabilities of a species tree given inferred gene trees (Kingman, 1982). These methods are very promising, but computationally expensive. Some model simpliﬁcation and heuristics are making it easier to apply these methods to larger datasets (e.g., Hobolth et al., 2007). If incongruence is a result of variation in model parameters it may also be helpful to use partitioned models where the values of some parameters of the phylogenetic model are inferred independently for different portions of the data (Nishihara et al., 2007).

Coalescent theory-based gene tree/species tree models may also incorporate independently inferred parameters for different partitions or partition classes.

94 4.1.4 Incongruence in data from Prasad et al., 2008

During our analyses of the large dataset described in Chapter 3, we found a great deal of incongruence across the region, though primarily among closely related species groups

(Prasad et al., 2008). In that study, we examined the sequence of a 1.9-Mb region of the human genome to the orthologous sequences of 43 other species. The coding and conserved non-coding portions of the region supported similar trees (Figures 3.3 and 3.4); however, when we split the alignment into partitions based on sequence position, we found a great deal of conﬂict in the supported trees (Figure 3.6). This incongruence is dramatic within many clades, including the Homo-Pan-Gorilla Group, Platyrrhini, and Laurasiatheria— all clades that, at least initially, presented some difﬁculty for phylogenetic inference from molecular data (Hallstrom and Janke, 2008; Prasad et al., 2008; Osterholz et al., 2009;

Schneider, 2000). In order to apply the techniques such as those described in Section 4.1.3, the boundaries of congruent regions need to be deﬁned.

4.1.5 Spatial arrangement of incongruent regions

The boundaries that have been used in studies of incongruence have traditionally been genes. Phylogeneticists often worked with relatively short sequences (often gene sequences) from loci spread across the genome. When looking for incongruence between these genes or regions, the boundaries are natural; however, with genomic sequence data now becoming increasingly available, the issue of where to place boundaries between partitions has become more important because there are no longer natural boundaries. This has led to the development of approaches for identifying the boundaries of incongruent regions. While several methods have been proposed, none is completely satisfactory.

95 Sliding ILD test boundaries along an alignment is one highly computationally expensive method that has been proposed, but it presupposes that we know how many partitions there are likely to be (M.W. Allard, personal communication, 2007). Another method is based on the agreement of pairwise distances, called the phylogenetic profiles method. This method, developed in the late 1990’s to better understand horizontal gene transfer among archaea (Weiller, 1998), uses sliding windows, summing up the differences of all pairwise distances for the window upstream and downstream of a given base. The phylogenetic profile method has the attractive property of showing by dips where boundaries occur, assuming boundaries of congruence are reflected in large changes in the pairwise differences between species. For closely related species where incongruence may be due to lineage sorting, this is less likely to be helpful. Additionally, it provides no indication of what the preferred tree may be for any given region, and other properties besides incongruence can cause changes in pairwise distances.

Methods have also been developed using Bayesian hidden Markov models (HMM) and small numbers of taxa to identify boundaries of congruence (Hobolth et al., 2007;

Husmeier and Mantzaris, 2008; Webb et al., 2009). Thus far the computational burden of this approach has made it difﬁcult to apply to more than three taxa, or to large sequences.

Sliding windows combined with a measure of tree support are a relatively simple solution to this issue. Though the resolution of sliding windows may not be as high as for HMM- based Bayesian methods, the great reduction in computational load means that even very complex situations involving very large alignments can be considered. Using maximum likelihood estimates also allows consideration of statistical measures of relative likelihood values between trees for a given window (e.g., Archibald and Roger, 2002). Bayesian tree inference methods can also be applied to sliding windows; however, the signiﬁcant

96 time that Metropolis-Hastings Monte Carlo simulations require makes this a considerably slower approach (Paraskevis et al., 2005).

4.1.6 Incongruence in the Homo-Pan-Gorilla Group

In the study described in Chapter 3, we noted that the relationship between human, chimpanzee, and gorilla (the Homo-Pan-Gorilla Group) varied across the alignment (Figure

3.6). Though the sister relationship of human and chimpanzee is now well-established and extremely well-supported by our data as a whole, this relationship was controversial for a long time because different genes sometimes supported different trees (Ruvolo, 1997).

As more and more sequence data became available it has become clear that there are considerable regions where human and gorilla sequences are more similar than human and chimpanzee sequences. With the recent sequencing of the chimpanzee genome, understanding the relationship among human, chimpanzee, and gorilla at a genomic level is of considerable recent interest (Ebersberger et al., 2007; Hobolth et al., 2007; Caswell et al.,

2008; Chimpanzee Sequencing and Analysis Consortium, 2005; Patterson et al., 2006).

The very short divergence time between human, chimpanzee, and gorilla reduces the chances that multiple substitutions at the same site, major branch length differences, or base composition changes will affect tree inference algorithms. Additionally, the small divergence between these species means that multiple sequence alignment methods are able to correctly and unambiguously align the vast majority of residues. In order to investigate the patterns of incongruence evident in these and other groups within the mammals, we developed a sliding window approach for studying the spatial patterns of tree support.

97 4.2 Materials and methods

4.2.1 Sequences

We generated sequence data for the 28 mammalian species listed in Table 4.1 as part of the NISC Comparative Sequencing Program (http://www.nisc.nih.gov/) and the EN-

CODE Project (ENCODE Project Consortium, 2007) except for the sequences from human (International Human Genome Sequencing Consortium, 2001), chimpanzee (Chim- panzee Sequencing and Analysis Consortium, 2005), dog (Toh et al., 2005), macaque

(http://www.hgsc.bcm.tmc.edu/), mouse (Waterston et al., 2002), orangutan (http:// genome.wustl.edu/genome.cgi?GENOME=Pongo%20abelii), and rat (Rat Genome Se- quencing Project Consortium, 2004; Havlak et al., 2004), which were retrieved from the

UCSC Genome Browser assemblies (http://genome.ucsc.edu) using the liftOver tool.

Generated sequences were assembled and manually curated (see Blakesley et al. 2004 for details). All curated sequence assemblies were deposited in GenBank and are listed in

Table 4.1 except for the mouse sequence [which was downloaded from the UCSC genome browser using the liftOver tool from the July 2007 assembly (mm9), coordinates chr6:16,976,986-18,647,429].

Sequences were aligned using TBA (Blanchette et al., 2004a) and an alignment guide tree based on the results of Prasad et al. (2008). The parameters B=2 and C=0 were used for the blastz pairwise alignments used by TBA, and all other parameters were left at their default. Alternative alignment guide trees were generated from the maximum likelihood tree constrained to each possible alternative arrangement of taxa.

98 Common name Scientiﬁc name Accession number human Homo sapiens NT 086357.2 chimpanzee Pan troglodytes DP000016.1 gorilla Gorilla gorilla gorilla DP000025.1 orangutan Pongo abelii DP000026.2 gibbon Nomascus leucogenys leucogenys DP000194.1 baboon Papio anubis DP000233.1 macaque Macaca mulatta DP000005.1 vervet Cercopithecus aethiops DP000029.1 colobus monkey Colobus guereza DP000193.1 Ma’s night monkey Aotus nancymaae DP000197.1 squirrel monkey Saimiri boliviensis boliviensis DP000180.1 marmoset Callithrix jaccus DP000014.2 spider monkey Ateles geoffroyi DP000177.1 dusky titi Callicebus moloch DP000019.1 mouse lemur Microcebus murinus DP000022.2 black lemur Eulemur macaco macaco DP000024.1 ring-tailed lemur Lemur catta DP000004.1 galago Otolemur garnettii DP000013.2 tree shrew Tupaia belangeri NT 166790.1 guinea pig Cavia porcellus DP000184.1 mouse Mus musculus (see text) rat Rattus norvegicus DP000027.1 squirrel Spermophilus tridecemlineatus NT 166240.2 rabbit Oryctolagus cuniculus NT 166240.2 horse Equus caballus NT 166616.1 cat Felis catus DP000234.1 dog Canis familiaris DP000236.1 elephant Loxodonta africana DP000087.1

Table 4.1: Originating species and GenBank accession numbers of sequences used. See Section 4.2.1) in text for details.

99 4.2.2 PartFinder

We developed a sliding window approach that compares log-likelihoods for each window

between alternative trees or models. The software consists of two perl scripts. The ﬁrst,

pf makefiles, creates a series of command ﬁles for processing with either RAxML-HPC

(Stamatakis and Alexandros, 2006) or PAUP (Swofford, 2003). These command ﬁles are

generated by splitting an alignment into equal windows of user-deﬁnable size that slide by

a user-deﬁnable amount. These command ﬁles are generated in a format optimized for pro-

cessing in a beowulf-style cluster. RAxML-HPC is used for its rapid maximum likelihood

tree searching capabilities, while PAUP allows ﬂexibility in testing different evolution-

ary models. Once processing has completed, a second script (pf compile) processes the

output of the likelihood calculating programs and produces tables suitable for statistical

analysis and graphing with R (R Development Core Team, 2008). R is used to generate a

spatial graph of log-likelihood differences (e.g., likelihood ratios; see Figure 4.3).

4.2.3 Tree searching

Homo-Pan-Gorilla Group analyses were performed with pf makefiles options -T 5000

-i 5000 -m best. pf makefiles commands were interpreted and analyzed by RAxML

HPC versions 7.0.4 and 7.2.1 with tree search option -f d in addition to GTRGAMMA (GTR

+ Γ) model of sequence evolution (Stamatakis and Alexandros, 2006). Likelihood values for trees performed without tree searching were generated using PAUP* 4.0b10 and a GTR

+ Γ + I model (Swofford, 2003). Maximum likelihood bootstraps were performed using

RAxML’s -f d tree search and GTRCAT model (an optimized heuristic GTR + Γ model)

with at least 500 replicates.

100 4.2.4 Simulations

Simulations were generated using seq-gen using parameters inferred from the primate alignment derived above (Rambaut and Grassly, 1997). Alignments of four species’ sequences were simulated based on the alignments described above for the human, chimpanzee, gorilla, and orangutan. 50 kb partitions were simulated with each of the three rooted trees relating human, chimpanzee, and gorilla. A General time reversible with

Gamma distributed rates model (GTR + Γ) was used with nucleotide frequencies of 30.6%

A, 19.2% C, 19.2% G, and 31.0% T; this reﬂencted the ratio from the human, chimpanzee, gorilla, and orangutan sequences used above. A reversible substitution matrix with transition ratios of 0.998222 A to C, 3.711021 A to G, 0.596728 A to T, 1.157295 C to G,

3.561514 C to T, and 1 G to T was derived from the primate dataset described above. The Γ shape parameter (α) was set to 3.86 based on the maximum likelihood tree generated from the total dataset. Branch lengths were ﬁxed to those estimated from the maximum likelihood tree of the empirical data described above for the three trees (((h,c),g),o), (((h,g),c),o), and (((c,g),h),o)—where h, c, g, and o are abbreviations for human, chimpanzee, gorilla, and orangutan, respectively.

4.2.5 Distributions

Distributions of characteristics for partitions were computed using EpiGRAPH (http:// epigraph.mpi-inf.mpg.de/; Bock et al., 2006, 2007) and/or Galaxy (http://g2.bx. psu.edu; Giardine et al., 2005); graphs and statistics were calculated using R (R Develop- ment Core Team, 2008). Mann-Whitney U tests were used to determine signiﬁcant differences in distributions between partitions in Figures 4.5 and 4.6; no corrections were made

101 ((h,c),g) ((h,g),c) ((c,g),h)

Human Chimp Gorilla Human Chimp Gorilla Human Chimp Gorilla

Figure 4.1: The three possible gene trees that could occur among human, chimpanzee, and gorilla leading to incongruence. In parentheses are abbreviated versions of Newick-formatted trees, where h represents human, c represents chimpanzee, and g represents gorilla (Adapted from Hobolth et al. 2007)

for multiple hypothesis testing. GC content in Figures 4.5A and 4.6A is calculated from human sequence. All other attributes are calculated from information downloaded from the March 2006 (hg18) UCSC Genome Browser assembly (http://genome.ucsc.edu).

Figures 4.5B and 4.6B use average MultiZ alignment scores from the “multiz17way” table

of the UCSC Genome Browser. Figures 4.5C and 4.6C show comparisons with the 17-

way phastCons score from the “phastCons17way” table. Exon overlaps in Figures 4.5D

and 4.6D are with regard to all exon coordinates listed in the Known Genes track. SNPs

in Figures 4.5E, 4.5F, 4.6E, and 4.6F are from the snp126 track of the UCSC Genome

Browser.

102 4.3 Results

We mapped and sequenced a ∼1.9-Mb region in human encompassing the Cystic ﬁbro- sis transmembrane conductance regulator (CFTR) gene in most of the commonly accepted orders of primates as well as several outgroups that were chosen for relatedness and sequence coverage. This represents an extension of the dataset used in Prasad et al. (2008) and Thomas et al. (2003). We assembled and quality trimmed the sequences, and aligned them using TBA (Blanchette et al., 2004b) and a guide tree based on Prasad et al. 2008.

Conserved sequences were excised as described in Prasad et al. (2008) and analyzed with maximum likelihood. Figure 4.2 shows a maximum likelihood bootstrap majority rules consensus tree made from an alignment of the conserved sequences and rooted using elephant. Notably, the only bootstrap scores less than 100% are in Platyrrhini, with all other clades at 100% bootstrap support. To better understand the relationships between human, chimpanzee, and gorilla, we used PartFinder and RAxML to ﬁnd the maximum likelihood tree when each of the three rooted relationships between human, chimpanzee, and gorilla are constraints.

4.3.1 Homo-Pan-Gorilla Group

The relationships among human, chimpanzee, and gorilla are of great interest, both because of our membership in that group and the great deal of study that this relationship has un- dergone (Ruvolo, 1997; Salem et al., 2003; Poux and Douzery, 2004; Hobolth et al., 2007).

Incongruence in the Homo-Pan-Gorilla Group has made molecular phylogenetics of these species controversial; one recent paper even rejects the existence of this well established group entirely based on morphological characters, though its ﬁndings are controversial

103 human

chimpanzee

gorilla

0.01 Hominidae

orangutan Hominoidae

gibbon

baboon Catarrhini

macaque

vervet Haplorrhini

colobus monkey Cercopithecidae

Ma’s night monkey

49 Primata squirrel monkey Cebidae

99 marmoset

spider monkey Atelidae Platyrrhini

dusky titi Pitheciidae

mouse lemur

black lemur

ring-tailed lemur Lemuriformes

galago Lorisiformes

tree shrew

guinea pig

mouse

rat Rodentia Glires squirrel

rabbit

horse

cat

dog Laurasiatheria

elephant Afrotheria

Figure 4.2: Maximum likelihood bootstrap consensus tree of primates used in the studies of Homo- Pan-Gorilla group and Platyrrhini. Bootstrap proportions are 100% except where noted.

104 (Grehan and Schwartz, 2009).

Taking advantage of the unique resource of large contiguous blocks of high-quality genomic sequence we performed PartFinder searches with both 5-kb and 2-kb non-overlapping windows. Table 4.2 shows the number of windows supporting each of the possible relationships among human, chimpanzee, and gorilla. Because of our experience with RY-coding in Prasad et al. (2008), we also performed the analysis with nucleotides coded as purines or pyrimidines (R and Y according to IUPAC nucleotide codes).

The commonly accepted sister relationship between human and chimpanzee ((h,c),g) is supported by a majority of the windows (Table 4.2). When window sizes were reduced to 2 kb, the proportion of windows supporting ((h,c),g) was only 52%, and the proportion of windows not clearly supporting one tree was greatly increased. Thus it appears that 2 kb is approaching the lower limit for the length of an alignment to have sufﬁcient signal to accurately differentiate between the three hypotheses. 93% of the 2-kb windows supporting

((h,c),g) overlap with 5-kb windows also supporting ((h,c),g); so while there may be more noise in the 2-kb dataset areas with strong support for ((h,c),g) are well-represented in both. We also note that RY-coding of the data greatly reduces the number of windows not clearly supporting one tree. While this might appear to be a demonstration of the utility of RY-coding, the primary cause appears to be that most windows only have one or a very small number of transversions clearly lending support to one tree or another. These few differences are reﬂected in the small differences in log likelihoods in Figure 4.3. Note that difference in GC content between the hominids is very small (<0.05%)

A graphical representation of PartFinder results is shown for a 275-kb region in Figure

4.3. In this region, results for nucleotide-coded 5-kb windows (Figure 4.3A) show clear

105 A 40 Homo−Pan Homo−Gorilla 30 Pan−Gorilla 20 10 0

450 500 550 600 650 700 Difference in −ln(L) from best tree Position (kb)

B 20 10 5 0

450 500 550 600 650 700 Difference in −ln(L) from best tree Position (kb)

C 20 10 5 0

450 500 550 600 650 700 Difference in −ln(L) from best tree Position (kb)

D 15 10 5 0

450 500 550 600 650 700 Difference in −ln(L) from best tree Position (kb)

Figure 4.3: Representative PartFinder results for the Homo-Pan-Gorilla group showing the spatial variation in likelihood support over the same 275-kb region for the three possible groupings of human, chimpanzee, and gorilla. A and B are PartFinder results with 5-kb windows while C and D are results with 2-kb windows. A and C are from nucleotide-coded alignments while B and D are results from RY-coded alignments. Tree searches were performed using RAxMLs -f d option; the relationships within the non-primates and within the Homo-Pan-Gorilla group were ﬁxed.

106 Tree supported 5-kb windows 2-kb windows ACGT (%) RY (%) ACGT (%) RY (%) (h,(c,g)) 9 17 10 20 ((h,c),g) 61 63 52 51 ((h,g),c) 10 17 9 17 Uncertain 20 4 29 12

Table 4.2: Proportions of regions supporting each possible arrangement within the Homo-Pan- Gorilla group

support for ((h,c),g) over the majority of the region, with one area supporting ((h,g),c) and three small areas supporting ((c,g),h). The small size and relatively small differences in likelihood values over the regions supporting the ((c,g),h) and ((h,g),c) trees are generally characteristic of this analysis as a whole. We would expect regions supporting ((c,g),h) and

((h,g),c) to be smaller, on average, because they come from ancient haplotype blocks, and some of the blocks supporting ((h,c),g) should be more recent (and thus larger) than those remaining from the last common ancestor of human, chimpanzee, and gorilla. Block sizes of 2 kb produced results that appeared quite noisy, and many blocks do not have enough informative substitutions to discriminate among the possible trees.

Simulations

To determine whether the results of PartFinder for the Homo-Pan-Gorilla Group (e.g., Fig-

ure 4.3) were reasonable and likely to identify cases of incomplete lineage sorting, we

performed simulations based on the parameters from the alignment of the various possible

arrangements of human, chimpanzee, and gorilla. Figure 4.4 shows PartFinder results from

one of the simulations. For this simulation, an alignment made up of 150 kb of sequence

107 A

20 Homo−Pan Pan−Gorilla 15 Homo−Gorilla 10 5 0

Difference in ln(L) from best tree 0 50 100 150 Position (kb) B 12 8 6 4 2 0

Difference in ln(L) from best tree 0 50 100 150 Position (kb)

Figure 4.4: PartFinder results with simulated data from an alignment with multiple trees. A is from PartFinder analysis with 5-kb windows sliding 5 kb, while B is the same dataset, but analyzed with 2-kb windows sliding 2 kb. from four species was simulated based on a GTR + Γ model of sequence evolution with the parameter settings from the human, chimpanzee, gorilla, and orangutan data used here.

100 simulations of three 50-kb partitions with the evolutionary history of three trees—

(((h,c),g), ((h,g),c), and ((c,g),h))—were run and analyzed with PartFinder using 5-kb and

2-kb windows. Figure 4.4 is sample PartFinder graphical output from one of the simulations. We see that the results are broadly similar to those generated from real data in spite of our simulations not modeling any indel mutations. The error rate as measured by incorrect inference of the gene tree used for the simulation was 1% with 5kb windows and 5% with 2kb windows. Likelihood ratio peaks were unexpectedly smaller than the likelihood ratio peaks seen in analyses of real data (e.g., Figure 4.3). It is also notable that analysis of the simulated dataset with 2-kb windows (Figure 4.4B) showed considerably more noise than analysis with 5-kb windows. These results provide some conﬁdence that at least under

108 ideal circumstances (i.e., the alignment and evolutionary model are correct), PartFinder is able to identify partitions with different evolutionary histories when divergences are similar to those between human, chimpanzee, and gorilla.

Correlating genome annotations

One explanation for the variation in support for alternative relationships among human, chimpanzee, and gorilla is that some characteristic of the sequences or alignment is affecting the performance of the maximum likelihood phylogenetic inference method. Phylo- genetic inference methods are known to suffer from a number of possible biases because of the simplifying assumptions that they make about the evolutionary process (Felsenstein,

2003; Ezpeleta et al., 2007; Waegele and Mayer, 2007), though the very close relationship among the hominids should minimize most of these issues. To explore the possibility that intrinsic genome features could be affecting the performance of the phylogenetic likelihood calculations, we compared distributions of values of some genome annotations with partitions supporting ((h,c),g) and those supporting ((h,g),c) or ((c,g),h) (Figures 4.5 and

4.6).

GC content did not seem to differ signiﬁcantly between the ((h,c),g)-supporting partitions and those supporting others, consistent with our ﬁndings with RY-coded data (Figures

4.5A and 4.6B). Alignment error (i.e., incorrect assumptions of orthology) can also mislead phylogenetic inference.

The alignment guide tree was set to the accepted hominid tree ((h,c),g), so we might

ﬁnd that regions with alignment errors would be biased towards support for ((h,c),g). Al- though they are not independent of the alignment process, alignment scores can be used

109 A 0.55 B C 0.62 0.9 0.50 0.60 0.8 0.58 0.45 0.7 0.56 GC content 0.40 0.6 0.54 0.35 Average 17-way phastCons score Average 17-way MultiZ alignment score 0.52 0.5 0.30 0.50 HC other HC other HC other HC other HC other HC other ACGT RY ACGT RY ACGT RY D E F 600 50 50 500 40 40 400 30 30 300 dbSNP SNPs 20 Exon overlap (bp) 20 200 10 10 100 dbSNP variants of any type (including indels) 0 0 0

HC other HC other HC other HC other HC other HC other ACGT* RY* ACGT* RY ACGT* RY

Figure 4.5: Box plot of genome characteristics in 5kb partitions supporting alternative relationships in the Homo-Pan-Gorilla group. Boxes in red reﬂect partitions that support the monophyly of human + chimpanzee, while boxes in blue support the monophyly of human + gorilla and/or chimpanzee + gorilla. Asterisks (*) indicate pairs that have statistically signiﬁcant differences (Mann-Whitney U p<0.05). All data is from the Human Mar. 2006 (hg18) build of the UCSC Genome Browser (see Section 4.2.5 for details). The Y-axis scale in D is truncated to show differences between exon overlaps (77% of 5-kb regions do not overlap exons).

110 A B C 0.70 0.9 0.6 0.65 0.8 0.5 0.7 0.60 GC content 0.4 0.6 0.55 Average 17-way phastCons score Average 17-way MultiZ alignment score 0.5 0.3 0.50

HC other HC other HC other HC other HC other HC other ACGT RY ACGT RY ACGT RY* 30 D E 30 F 240 200 20 20 160 120 dbSNP SNPs Exon overlap (bp) 10 10 80 dbSNP variants of any type (including indels) 40 0 0 0

HC other HC other HC other HC other HC other HC other ACGT* RY ACGT RY ACGT RY

Figure 4.6: Box plot of genome characteristics in 2kb partitions supporting alternative relationships in the Homo-Pan-Gorilla group. Boxes in red reﬂect partitions that support the monophyly of human + chimpanzee, while boxes in blue support the monophyly of human + gorilla and/or chimpanzee + gorilla. Asterisks (*) indicate pairs that have statistically signiﬁcant differences (Mann-Whitney U p<0.05). All data is from the Human Mar. 2006 (hg18) build of the UCSC Genome Browser (see Section 4.2.5 for details). The Y-axis scale is truncated in D, E, and F for clarity, so some values occur outside the range shown.

111 as a measure of alignment uncertainty, and they did not show any relationship with the

supported tree (Figures 4.5B and 4.6B). This suggests that the quality of the alignments is

not having a strong effect on the inferences, and is to be expected because the low rates of

divergence between human, chimpanzee, gorilla, and orangutan mean alignments should

be close to true.

Negative (or purifying) selection reduces the coalescent time and reduces regional vari-

ation. We might then hypothesize that selected regions would show less signal of alternative

relationships among species. Interestingly, the reverse appears to be true for the RY-coded

data only (Figures 4.5C and 4.6C). We do not see any relationship in nucleotide-coded

data, but with RY-coded data the 5-kb and 2-kb partitions show slightly higher conserva-

tion scores for regions supporting ((h,g),c) or ((c,g),h) (5-kb p=0.036, 2-kb p=0.037). The

significance of this finding is unclear since very few changes define the trees when the

alignment is RY-coded. Replication of these results in other regions would be instructive.

We also examined overlap with exons. While the vast majority of regions do not overlap

exons (exons make up ∼2% of the sequence), both nucleotide- and RY-coded partitions supporting ((h,c),g) overlapped more with coding sequences than those supporting other trees (5-kb ACGT p=0.0092, RY p=0.049; 2-kb ACGT p=0.038, RY p=0.056; Figures 4.5D and 4.6D). The reason for this is unclear, and it will be interesting to see if we are able to replicate this result with other regions. SNPs in dbSNP are also more highly represented in

5-kb ((h,c),g) supporting regions (p=0.046), though this could very well be a consequence of ascertainment bias due to the higher exon content of ((h,c),g) supporting regions and higher coding SNP representation in dbSNP (Figure 4.5 and Figure 4.6). An unbiased source of SNP content would be preferable for such a comparison.

112 Comparison to Dutheil et al. 2009

Dutheil et al. (2009) recently published a study using a Bayesian coalescent hidden Markov model (CoalHMM) to infer evolutionary history parameters among human, chimpanzee, and gorilla using orangutan as an outgroup. They used a simpliﬁed coalescent model, three

Markov transition probabilities (for four genealogies), and a Markov chain Monte Carlo algorithm (MCMC) to infer historical population sizes, divergence times, and phylogenetic trees for each base from extant sequence. For each base, the CoalHMM provides a Bayesian posterior probability for each tree relating human, chimpanzee, and gorilla.

While their method is new and at the time of this writing lacked an easily available implementation (which precluded us from examining the behavior of the method in detail), the datasets and results reported in Dutheil et al., 2009 were provided to us by Dr. Julien

Dutheil. Sequences were derived from the NISC Comparative Sequencing Program and the resulting alignments were edited as described in Hobolth et al. (2007) to remove all indels.

Using the provided sequence alignments of the CFTR region, 83% of the bases in regions supporting ((h,c),g) by the CoalHMM were also in regions supporting ((h,c),g) determined by PartFinder with 5-kb windows. This number declined to 75% when 2-kb windows were considered, consistent with there being greater stochastic variability in support over 2-kb windows. These results show the overall similarity PartFinder and CoalHMM analyses, in spite of the lower resolution of PartFinder.

4.4 Discussion

The PartFinder sliding-window phylogenetic analysis method identiﬁes regions that support alternative topologies. It has advantages of speed, parallelization, and ﬂexibility over

113 existing methods and has identiﬁed regions of incongruence in the Homo-Pan-Gorilla

Group largely similar to those shown by other methods. This group is relatively recently diverged, and incongruence has lead to controversy. The resolution of PartFinder in the

Homo-Pan-Gorilla Group is limited by the very few differences between the great apes.

We believe that groups of species at slightly greater distances may provide higher resolution maps of incongruence, as blocks with evolutionary histories different from the majority of the genome may be smaller than 5kb.

Additionally, some interesting correlations among regions supporting the species tree and those supporting alternative trees are evident. These correlations are not strong enough to eliminate the role of chance, and we hope to examine the relationships seen in analyses of additional genomic regions. It would also be very interesting to look for any correspondence between areas of high linkage and tree support; unfortunately the data available for these studies is not large enough to consider blocks of linkage disequilibrium.

Comparisons of PartFinder results with those of Dutheil et al. (2009) suggest a broad similarity in findings. Though the CoalHMM approach has the advantage of higher resolution (i.e., no window sizes) and a basis in coalescent theory, it is significantly more complex and computationally intensive. It is also unclear how any violations of the assumptions made by the model used in CoalHMM would affect the final results. The PartFinder method is computationally efficient, easily parallelized, allows independent rates, provides comparisons of small regions, and can be applied when more than three possible trees are being considered. PartFinder also allows the the use of a full range of outgroups to aid in estimating evolutionary rates. The relatively low computational demands of PartFinder allow examination of much larger sets of uncertain taxa and larger levels of taxonomic divergence.

114 PartFinder analyses can be used both for examining phylogenetic relationships and for examining the nature of incongruence. It will be helpful to assess the reasonableness of incomplete lineage sorting as an explanation using population genetics simulations for reasonable assumptions of historical population parameters and speciation times. The

PartFinder method will also allow us to study complex relationships using genomic sequence datasets such as those available for Laurasiatheria and the New World monkeys.

115 Chapter 5

Summary and future directions

During the course of this research, we explored some of the methods that can take advantage of comparative genomic datasets and the information they can provide about genome evolution, applying an evolutionary approach to the analysis of genome sequence data from many mammals. We demonstrated the utility of genomic sequence datasets for answering questions from an evolutionary perspective.

One method of evolutionary analysis we used was to search for highly conserved sequences that should have functional signiﬁcance. ExactPlus was developed with the aim of identifying highly conserved sequences that would be candidates for functional screens.

We also explored the utility of using large comparative genomic sequence datasets to resolve phylogenetic relationships between mammals. We used this approach to help resolve several of the ongoing controversies in mammalian phylogenetics. Finding a surprising degree of incongruence in this dataset, we explored the spatial structure of tree support using a likelihood framework.

116 5.1 ExactPlus and model based methods

ExactPlus is the method we developed for identifying highly conserved regions based on searching sliding windows for identical matches. We further developed a parallel processing web-accessible implementation of ExactPlus, and compared the results of ExactPlus analysis to those of three other methods for conservation detection: BinCons (Margulies et al., 2003), phastCons (Siepel et al., 2005), and GERP (Cooper et al., 2005).

Though ExactPlus does not rely on an explicit phylogenetic model, the results of Ex- actPlus analysis are remarkably similar to the results of methods relying on complex phylogenetic models such as phastCons (Siepel et al., 2005) and GERP (Cooper et al., 2005).

This is especially true for small, highly conserved regions where ExactPlus is able to take advantage of the additional information provided by indel mutation to identify very short, highly conserved regions. We further applied ExactPlus to two projects whose goal was the identiﬁcation of conserved cis-regulatory sequences (Portnoy et al., 2005, Appendix 5.3.2;

Antonellis et al., 2006, Appendix 5.3.2). In those studies, ExactPlus was used to identify candidates for functional screening, some of which showed tissue-speciﬁc enhancer activity similar to that of the adjacent genes.

With increasingly large amounts of genomic sequence data becoming available, methods such as phastCons and GERP may be able to approach the resolution of ExactPlus for very short conserved sequences; however, for studies in taxa where sequencing of many species is not complete or where manual tuning and examination of the results is necessary, ExactPlus may have some beneﬁts. Additionally, we note that the relative success of ExactPlus when compared to the other methods demonstrates the signiﬁcance of the signal added by information from indel mutations. Indels are considered as equivalent to

117 mismatches by ExactPlus, but considered as missing data by the phylogenetically aware methods. If even this very simple treatment of indel information can be informative, then these results provide additional support to the goal of developing a method for coding indel differences that can be integrated into the phylogenetically aware algorithms. One caveat is that next-generation DNA sequencing technologies often have much higher rates of indel errors—at least compared to the more traditional Sanger sequencing methods used to generate the datasets examined here. Thus, the advantages gained by ExactPlus considering indels may be less salient when applied to data generated with the new sequencing technologies.

5.2 Phylogeny of mammals and other approaches

To explore the use of comparative genomic sequence data to infer phylogenetic relationships and help resolve ongoing controversies in the relationships among mammals, we examined the sequence of a 2-Mb region of human chromosome 7 containing the CFTR gene and orthologous sequences in 40 other mammals (Prasad et al., 2008; Appendix 5.3.2).

We found that less conserved regions were susceptible to alignment guide tree bias.

Weak signals from smaller, conserved, easier to align regions could be swamped by more difﬁcult to align non-conserved regions. Although we only saw this effect in groups with weaker phylogenetic signal, when less conserved sequence is included, support values for even those branches is extremely high (e.g., 100% bootstrap scores) and dependent on the alignment guide tree.

We also found that the differences in base composition could affect the phylogenetic inference process, even with the conserved sequences. Guanine and cytosine (GC) content

118 varied by more than 23% among species. The reasons for this variation in base composition among species is not completely clear, but these variations confound phylogenetic inference algorithms. We found that coding the bases in the alignment as purines and pyrimidines alleviated the bias introduced by the variation in base content, as well as re- ducing branch length variations that can also cause problems for phylogenetic inference. A lot of phylogenetic information is thrown away in this process, but with large amounts of data to work from, it is still possible to get strong support for most branches.

Another possible source of error in phylogenetic inference is missing data. Because of the genomic nature of the dataset, a considerable amount of missing data was included in the matrix examined in these studies, both because of gaps in sequence coverage and because of rampant indel mutation in non-coding sequence. However, we did not see a major effect of missing data on this analysis. This may be a result of having sufﬁcient data to resolve most branches in spite of some missing data.

We were able to generate an almost completely resolved tree with these data and provide strong support for several controversial clades (including Primata, Glires, Euarchon- toglires, Theria, and Atlantogenata), thus helping to resolve a number of the remaining controversies in the mammalian phylogeny. The notable exception to this strong resolution was the relationship between orders within Laurasiatheria. We found that even with the large amounts of data considered here, it was difﬁcult to resolve those relationships. Aside from strongly placing Eulipotyphla at the root of Laurasiatheria, the interordinal relationships within Laurasiatheria had surprisingly weak support. Most notably, the placement of Perissodactyla had very little support and tended to ﬂuctuate between joining with Car- nivora and Cetartiodactyla.

To explore the spatial variation in support over the region, we also split the coding

119 plus conserved non-coding alignment up into 10 equal-sized pieces and analyzed them independently. Each piece had quantities of data similar to previous large-scale combined gene datasets (e.g., Murphy et al., 2001a; Madsen et al., 2001). The surprising results of this experiment (Figure 3.6) were different trees for different segments of the alignment, with some branches very strongly supported in alternative positions depending on the partition.

Most of the root taxa remained the same, but branches closer to the edges of the tree varied signiﬁcantly. We went on to explore this phenomenon further in Chapter 4.

These studies demonstrate that phylogenetic inference with large datasets allows us to approach the more difficult phylogenetic questions with greater levels of confidence. The reduction in the effects of random noise provided by large datasets comes with an increasing need to control for possible sources of bias. This is especially true because the size of these datasets precludes careful manual inspection and editing of all alignments, and increases the chance that non-phylogenetic signals (such as alignment error, rate variation, or base composition biases) can affect the results. More complete (and complex) evolutionary models could compensate for some of the effects of these non-phylogenetic signals because large datasets also reduce possible problems of overparameterization; however, the increased computational load due to both the size of the datasets and increasingly complex evolutionary models may negate some of the benefits of increasingly complex evolutionary models.

Approaches using rare changes in genome sequence, such as indels and transpositions, may provide less ambiguous signals of phylogenetic history. In fact, the datasets we developed and analyzed here have been used by several groups to ﬁnd rare genomic characters, such as small inversions (Chaisson et al., 2006) and retrotransposons (Osterholz et al.,

2009; Bashir et al., 2005). Because the numbers of informative characters of these types

120 are generally small, and their behavior is not always well-understood caution must be taken in interpreting these studies as well.

Additionally, some uncertainty may be a result of actual genetic ambiguity, (i.e., the presence of nuclear material with different evolutionary histories). Because large datasets allow us to search for signals on very short branches, instances of incomplete lineage sorting may be increasingly evident, and confound studies of species level relationships. Ap- proaches that take into account differences in gene tree and species tree evolution can help inform us not only to species relationships, but also to ancestral population parameters

(e.g., Dutheil et al., 2009).

5.2.1 Future directions

This work along with other recent studies (e.g., Nikolaev et al., 2007; Osterholz et al.,

2009; Churakov et al., 2009) have shown both the promise and pitfalls of phylogenetic inference using large comparative genomic datasets. We have continued to sequence genome segments in additional species, and several questions about the relationships among the mammals can also be addressed using expanded versions of the dataset studied here. The placement of the tree shrew, the relationship between the orders within platyrrhini, and the interordinal relationships within Laurasiatheria all remain essentially unresolved, and can be approached using these techniques and others. We additionally would like to develop an automated method of identifying high conﬁdence indel mutations with conserved ﬂanking regions to score them as an additional set of characters with what should be lower levels of homoplasy.

121 5.3 Landscapes of incongruence and PartFinder

In order to better understand the incongruence seen among genomic regions in Chapter 3,

we developed a likelihood-based sliding window method (PartFinder) to help identify the

boundaries between regions supporting different trees. Relative to studies of the strength

and signiﬁcance of incongruence, the spatial component of incongruence has received re-

markably little attention. PartFinder uses likelihood ratio tests over sliding windows to

assess relative support for a set of possible trees over a region. Because of the computa-

tional efﬁciency, parallelizability, and ﬂexibility of PartFinder, it can be used in a number

of circumstances where other methods cannot be applied.

We used PartFinder to explore the spatial component of incongruence in the Homo-Pan-

Gorilla Group. The relationships within the hominids have been the focus of considerable study, and this allowed us to explore a group of closely related species with a thoroughly explored species phylogeny and some evidence of incomplete lineage sorting. Additionally, the boundaries of incongruence within the CFTR region have been the subject of study using a coalescent model-based Bayesian approach that can be used for comparison (Dutheil et al., 2009). Their results were broadly comparable to the results of PartFinder analysis, with a large degree of overlap. We also performed simulations using parameters inferred from real data, and found that PartFinder should be able to determine boundaries of incongruence assuming our parameters and models sufﬁciently recapitulate the evolutionary process.

Incomplete lineage sorting is not the only explanation for incongruence. Incongruence is also the possible result of post-speciation introgression or a failure of phylogenetic inference methods. We examined the relationship between the regions not supporting the

122 accepted species tree and several genome characteristics that could affect the phylogenetic inference process. We found some interesting, but fairly weakly supported, correlations between support for the accepted species tree ((human, chimpanzee), gorilla) and coding sequence content (Figure 4.5). We also found correlations with higher dbSNP content of regions supporting the species tree; however, dbSNP is more likely to contain SNPs in coding regions. A less biased sets of SNPs and replication with other independent loci would provide a better comparison.

5.3.1 Rare genomic character phylogenetics and congruence

Because of the reduced frequency of rare genomic characters, such as transposon insertions and coding indels, such differences have been used to great effect for phylogenetic inference in many groups that have been difﬁcult to resolve (e.g., Churakov et al., 2009;

Kaiser et al., 2007; Murphy et al., 2007; Kriegs et al., 2006). Rare genomic characters like these have become popular because they are not thought to suffer from many of the biases and saturation problems that can affect nucleotide- or amino acid-based phylogenetic inferences. Thanks to the development of high-throughput DNA sequencing technologies, it is now possible to identify the few characters that may be informative. If the previously seen incongruence is indicative of true evolutionary history, then it is possible that using small numbers of rare genomic characters could, by way of sampling error, lead to the inference of incorrect species trees.

Recently, Churakov et al. (2009) investigated the relationship between the three major groups of placental mammals using data from whole-genome sequencing projects. They used retroposon insertions, ﬁnding 7-9 retroposon insertions to support each of the three possible relationships. Because of the disagreement between the individual insertions they

123 concluded that the speciation events were nearly simultaneous and that the root of the pla-

cental mammals is really a polytomy. By examining the spatial variation in support among

numerous genomic sequence characters, it is possible to further investigate the likelihood

of this simultaneous speciation hypothesis. It is notable that Nishihara et al. (2007) found

support for different trees depending on how they partitioned their models across a whole-

genome dataset. They concluded that partitioned models are necessary to correctly infer

the tree. An alternative explanation for their results, if the root of the placental mammals

is a true polytomy, is that slight variations in substitution rates or base composition lead to

differences in results between partitioned and unpartitioned models.

5.3.2 Future directions

Although the 2-Mb CFTR-containing region we sequenced most deeply and thoroughly is unique, there are also a few other targets that are sequenced at least in human, chimpanzee, gorilla, and orangutan that would be suitable for replication studies. There are at least seven other targets available containing BAC sequences for chimpanzee, gorilla, and orangutan.

A few of these are pericentromeric regions, and may not be representative of the genome as a whole, but all together we could get an idea of whether the patterns we see in the CFTR

region are likely to be supported across the genome. Because coverage is not complete

and sequencing is not as deep in these targets, greater care to assembly, alignment, and

trimming of the sequences is needed, but they should provide a set of regions that permit

replication studies.

We developed PartFinder in part to explore clades with highly controversial relation-

ships. With this in mind, we are planning analyses of the clades Platyrrhini and Laurasiathe-

ria, where we see a good deal of incongruence and which have remained controversial in

124 the literature (Osterholz et al., 2009; Wildman et al., 2009; Steiper and Ruvolo, 2003; Hou et al., 2009; Nishihara et al., 2006). Using PartFinder, we hope to apply new methods of mixed estimation of gene tree and species trees to inferring the relationships between these species.

These studies demonstrate some of the promise that genomic sequencing from many species holds, both to answer targeted questions about speciﬁc sequences and larger scale questions about the evolution of species.

125 Bibliography

Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., George, R. A., Lewis, S. E., Richards, S., Ashburner, M., Henderson, S. N., Sutton, G. G., Wortman, J. R., Yandell, M. D., Zhang, Q., Chen, L. X., Brandon, R. C., Rogers, Y. H., Blazej, R. G., Champe, M., Pfeiffer, B. D., Wan, K. H., Doyle, C., Baxter, E. G., Helt, G., Nelson, C. R., Gabor, G. L., Abril, J. F., Agbayani, A., An, H. J., Andrews-Pfannkoch, C., Baldwin, D., Ballew, R. M., Basu, A., Baxendale, J., Bayraktaroglu, L., Beasley, E. M., Beeson, K. Y., Benos, P. V., Berman, B. P., Bhandari, D., Bolshakov, S., Borkova, D., Botchan, M. R., Bouck, J., Brokstein, P., Brottier, P., Burtis, K. C., Busam, D. A., Butler, H., Cadieu, E., Center, A., Chandra, I., Cherry, J. M., Cawley, S., Dahlke, C., Davenport, L. B., Davies, P., de Pablos, B., Delcher, A., Deng, Z., Mays, A. D., Dew, I., Dietz, S. M., Dodson, K., Doup, L. E., Downes, M., Dugan-Rocha, S., Dunkov, B. C., Dunn, P., Durbin, K. J., Evangelista, C. C., Ferraz, C., Ferriera, S., Fleischmann, W., Fosler, C., Gabrielian, A. E., Garg, N. S., Gelbart, W. M., Glasser, K., Glodek, A., Gong, F., Gorrell, J. H., Gu, Z., Guan, P., Harris, M., Harris, N. L., Harvey, D., Heiman, T. J., Hernandez, J. R., Houck, J., Hostin, D., Houston, K. A., Howland, T. J., Wei, M. H., Ibegwam, C., Jalali, M., Kalush, F., Karpen, G. H., Ke, Z., Kennison, J. A., Ketchum, K. A., Kimmel, B. E., Kodira, C. D., Kraft, C., Kravitz, S., Kulp, D., Lai, Z., Lasko, P., Lei, Y., Levitsky, A. A., Li, J., Li, Z., Liang, Y., Lin, X., Liu, X., Mattei, B., McIntosh, T. C., McLeod, M. P., McPherson, D., Merkulov, G., Milshina, N. V., Mobarry, C., Morris, J., Moshreﬁ, A., Mount, S. M., Moy, M., Murphy, B., Murphy, L., Muzny, D. M., Nelson, D. L., Nelson, D. R., Nelson, K. A., Nixon, K., Nusskern, D. R., Pacleb, J. M., Palazzolo, M., Pittman, G. S., Pan, S., Pollard, J., Puri, V., Reese, M. G., Reinert, K., Remington, K., Saunders, R. D., Scheeler, F., Shen, H., Shue, B. C., Siden-Kiamos,´ I., Simpson, M., Skupski, M. P., Smith, T., Spier, E., Spradling, A. C., Stapleton, M., Strong, R., Sun, E., Svirskas, R., Tector, C., Turner, R., Venter, E., Wang, A. H., Wang, X., Wang, Z. Y., Wassarman, D. A., Weinstock, G. M., Weissenbach, J., Williams, S. M., WoodageT, Worley, K. C., Wu, D., Yang, S., Yao, Q. A., Ye, J., Yeh, R. F., Zaveri, J. S., Zhan, M., Zhang, G., Zhao, Q., Zheng, L., Zheng, X. H., Zhong, F. N., Zhong, W., Zhou, X., Zhu, S., Zhu, X., Smith, H. O., Gibbs, R. A., Myers, E. W., Rubin, G. M., and Venter, J. C. (2000). The genome sequence of drosophila melanogaster. Science, 287(5461):2185–2195.

Ahituv, N., Prabhakar, S., Poulin, F., Rubin, E. M., and Couronne, O. (2005). Mapping cis- regulatory domains in the human genome using multi-species conservation of synteny. Human Molecular Genetics, 14(20):3057–3063.

126 Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., and Walter, P. (2002). Molecular Biology of the Cell, Fourth Edition. Garland.

Altekar, G., Dwarkadas, S., Huelsenbeck, J. P., and Ronquist, F. (2004). Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinforma- tics, 20(3):407–415.

Andolfatto, P. (2005). Adaptive evolution of non-coding DNA in drosophila. Nature, 437(7062):1149–1152.

Ane, C., Larget, B., Baum, D. A., Smith, S. D., and Rokas, A. (2007). Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24(2):412–426.

Antonellis, A., Bennett, W. R., Prasad, A. B., Lee-Lin, S.-Q., Green, E. D., Paisley, D., Kelsh, R. N., Pavan, W. J., and Ward, A. (2006). Deletion of long-range sequences at sox10 compromises developmental expression in a mouse model of waardenburgshah (ws4) syndrome. Human Molecular Genetics, 15(2):259–271.

Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J. M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., Gelpke, M. D., Roach, J., Oh, T., Ho, I. Y., Wong, M., Detter, C., Verhoef, F., Predki, P., Tay, A., Lucas, S., Richardson, P., Smith, S. F., Clark, M. S., Edwards, Y. J., Doggett, N., Zharkikh, A., Tavtigian, S. V., Pruss, D., Barnstead, M., Evans, C., Baden, H., Powell, J., Glusman, G., Rowen, L., Hood, L., Tan, Y. H., Elgar, G., Hawkins, T., Venkatesh, B., Rokhsar, D., and Brenner, S. (2002). Whole-genome shotgun assembly and analysis of the genome of fugu rubripes. Science, 297(5585):1301–1310.

Archibald, J. M. and Roger, A. J. (2002). Gene conversion and the evolution of eur- yarchaeal chaperonins: a maximum likelihood-based method for detecting conﬂicting phylogenetic signals. Journal of Molecular Evolution, 55(2):232–245.

Arnason, U., Adegoke, J. A., Bodin, K., Born, E. W., Esa, Y. B., Gullberg, A., Nilsson, M., Short, R. V., Xu, X., and Janke, A. (2002). Mammalian mitogenomic relationships and the root of the eutherian tree. Proceedings of the National Academy of Sciences, USA, 99(12):8151–8156.

Asai-Coakwell, M., French, C., Ye, M., Garcha, K., Bigot, K., Perera, A., Staehling- Hampton, K., Mema, S., Chanda, B., Mushegian, A., Bamforth, S., Doschak, M., Li, G., Dobbs, M., Giampietro, P., Brooks, B., Vijayalakshmi, P., Sauve, Y., Abitbol, M., Sundaresan, P., Heyningen, V., Pourquie, O., Underhill, T., Waskiewicz, A., and Lehmann, O. (2009). Incomplete penetrance and phenotypic variability characterize Gdf6-attributable oculo-skeletal phenotypes. Human Molecular Genetics, 18(6):1110– 1121.

127 Awadalla, P. (2003). The evolutionary genomics of pathogen recombination. Nature Re- views Genetics, 4(1):50–60. Barroso, C. M. L., Schneider, H., Schneider, M. P. C., Sampaio, I., Harada, M. L., Czelus- niak, J., and Goodman, M. (1997). Update on the phylogenetic systematics of New World monkeys: Further DNA evidence for placing the pygmy marmoset (Cebuella) within the genus Callithrix. International Journal of Primatology, 18(4):651–674. Bashir, A., Ye, C., Price, A. L., and Bafna, V. (2005). Orthologous repeats and mammalian phylogenetic inference. Genome Research, 15(7):998–1006. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S., and Haussler, D. (2004). Ultraconserved elements in the human genome. Science, 304(5675):1321–1325. Blakesley, R. W., Hansen, N. F., Mullikin, J. C., Thomas, P. J., Mcdowell, J. C., Maskeri, B., Young, A. C., Benjamin, B., Brooks, S. Y., Coleman, B. I., Gupta, J., Ho, S. L., Karlins, E. M., Maduro, Q. L., Stantripop, S., Tsurgeon, C., Vogt, J. L., Walker, M. A., Masiello, C. A., Guan, X., Bouffard, G. G., and Green, E. D. (2004). An intermediate grade of ﬁnished genomic sequence suitable for comparative analyses. Genome Research, 14(11):2235–2244. Blanchette, M., Green, E. D., Miller, W., and Haussler, D. (2004a). Reconstructing large regions of an ancestral mammalian genome in silico. Genome Research, 14(12):2412– 2423. Blanchette, M., Kent, J. J., Riemer, C., Elnitski, L., Smit, A. F., Roskin, K. M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., Haussler, D., and Miller, W. (2004b). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Re- search, 14(4):708–715. Bock, C., Paulsen, M., Tierling, S., Mikeska, T., Lengauer, T., and Walter, J. (2006). CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genetics, 2(3):e26+. Bock, C., Walter, J., Paulsen, M., and Lengauer, T. (2007). CpG island mapping by epigenome prediction. PLoS Computational Biology, 3(6):e110+. Boissinot, S., Entezam, A., Young, L., Munson, P. J., and Furano, A. V. (2004). The insertional history of an active family of l1 retrotransposons in humans. Genome Research, 14(7):1221–1231. Bradley, R. K., Roberts, A., Smoot, M., Juvekar, S., Do, J., Dewey, C., Holmes, I., and Pachter, L. (2009). Fast statistical alignment. PLoS Computational Biology, 5(5):e1000392+.

128 Bull, J. J., Huelsenbeck, J. P., Cunningham, C. W., Swofford, D. L., and Waddell, P. J. (1993). Partitioning and combining data in phylogenetic analysis. Systematic Biology, 42(3):384–397.

Camin, J. H. and Sokal, R. R. (1965). A method for deducing branching sequences in phylogeny. Evolution, 19(3):311–326.

Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution, 17(4):540–552.

Caswell, J. L., Mallick, S., Richter, D. J., Neubauer, J., Schirmer, C., Gnerre, S., and Reich, D. (2008). Analysis of chimpanzee history based on genome sequence alignments. PLoS Genetics, 4(4).

Cavender, J. A. and Felsenstein, J. (1987). Invariants of phylogenies in a simple case with discrete states. Journal of Classiﬁcation, 4(1):57–71.

Chaisson, M. J., Raphael, B. J., and Pevzner, P. A. (2006). Microinversions in mammalian evolution. Proceedings of the National Academy of Sciences, 103(52):19824–19829.

Chen, J. M., Stenson, P. D., Cooper, D. N., and Ferec,´ C. (2005). A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet, 117(5):411–427.

Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature, 437(7055):69–87.

Churakov, G., Jan, Baertsch, R., Zemann, A., Brosius, J., and Schmitz, J. (2009). Mosaic retroposon insertion patterns in placental mammals. Genome Research.

Clamp, M., Cuff, J., Searle, S. M., and Barton, G. J. (2004). The Jalview java alignment editor. Bioinformatics, 20(3):426–427.

Collins, F. S., Green, E. D., Guttmacher, A. E., and Guyer, M. S. (2003). A vision for the future of genomics research. Nature, 422(6934):835–847.

Comas, I., Moya, A., and Candelas, G. F. (2007). From phylogenetics to phylogenomics: the evolutionary relationships of insect endosymbiotic gamma-proteobacteria as a test case. Systematic Biology, 56(1):1–16.

Cooper, G. M. and Brown, C. D. (2008). Qualifying the relationship between sequence conservation and molecular function. Genome Research, 18(2):201–205.

Cooper, G. M., Stone, E. A., Asimenos, G., Green, E. D., Batzoglou, S., and Sidow, A. (2005). Distribution and intensity of constraint in mammalian genomic sequence. Ge- nome Research, 15(7):901–913.

129 Cope, E. D. (1889). The Edentata of North America. American Naturalist, 23:657–664.

Cunningham, C. W. (1997). Can three incongruence tests predict when data should be combined? Molecular Biology and Evolution, 14(7):733–740.

Delsuc, F., Scally, M., Madsen, O., Stanhope, M. J., de Jong, W. W., Catzeﬂis, F. M., Springer, M. S., and Douzery, E. J. (2002). Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting. Molecular Biology and Evolution, 19(10):1656–1671.

Douady, C. J., Delsuc, F., Boucher, Y., Doolittle, W. F., and Douzery, E. J. (2003). Compari- son of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Molecular Biology and Evolution, 20(2):248–254.

Douzery, E. J. and Huchon, D. (2004). Rabbits, if anything, are likely glires. Molecular Phylogenetics and Evolution, 33(3):922–935.

Dutheil, J. Y. Y., Ganapathy, G., Hobolth, A., Mailund, T., Uyenoyama, M. K., and Schierup, M. H. (2009). Ancestral population genomics: The coalescent hidden Markov model approach. Genetics.

Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., and von Haeseler, A. (2007). Mapping human genetic ancestry. Molecular Biology and Evolution, 24(10):2266–2276.

Edwards, A. W. F. and Cavalli-Sforza, L. L. (1964). Reconstruction of evolutionary trees, pages 64–76. Systematics Association Publ., No. 6, London.

Edwards, S. V., Liu, L., and Pearl, D. K. (2007). High-resolution species trees without concatenation. Proceedings of the National Academy of Sciences, 104(14):5936–5941.

Ellegren, H. A. N. S. (2008). Comparative genomics and the study of evolution by natural selection. Molecular Ecology, 17(21):4586–4596.

ENCODE Project Consortium (2007). Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799–816.

Ezpeleta, R. N., Brinkmann, H., Roure, B., Lartillot, N., Lang, B. F., and Philippe, H. (2007). Detecting and overcoming systematic errors in genome-scale phylogenies. Sys- tematic Biology, 56(3):389–399.

Felsenstein, J. (1978). The number of evolutionary trees. Systematic Zoology, 27(1):27+.

Felsenstein, J. (1982). Numerical methods for inferring evolutionary trees. The Quarterly Review of Biology, 57(4):379–404.

130 Felsenstein, J. (2003). Inferring Phylogenies. Sinauer Associates.

Felsenstein, J. (2007). PHYLIP (Phylogeny Inference Package). Distributed by the author, Department of Genome Sciences, University of Washington, Seattle.

Feng, J., Bi, C., Clark, B. S., Mady, R., Shah, P., and Kohtz, J. D. (2006). The evf-2 noncoding rna is transcribed from the dlx-5/6 ultraconserved region and functions as a dlx-2 transcriptional coactivator. Genes & development, 20(11):1470–1484.

Fitch, W. M. and Margoliash, E. (1967). Construction of phylogenetic trees: A method based on mutation distances as estimated from cytochrome c seqeunces is of general applicability. Science, 155(3760):279–284.

Fogel, G. B., Weekes, D. G., Varga, G., Dow, E. R., Craven, A. M., Harlow, H. B., Su, E. W., Onyia, J. E., and Su, C. (2005). A statistical analysis of the TRANSFAC database. Biosystems, 81(2):137–154.

Galtier, N. and Gouy, M. (1998). Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Molecular Biology and Evolution, 15(7):871–879.

Giardine, B., Riemer, C., Hardison, R. C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W. J., and Nekrutenko, A. (2005). Galaxy: a platform for interactive large-scale genome analysis. Genome Re- search, 15(10):1451–1455.

Gibson, A., Shankar, G. V., Higgs, P. G., and Rattray, M. (2005). A comprehensive analysis of mammalian mitochondrial genome base composition and improved phylogenetic methods. Molecular Biology and Evolution, 22(2):251–264.

Giresi, P. G. G., Kim, J., Mcdaniell, R. M. M., Iyer, V. R. R., and Lieb, J. D. D. (2007). FAIRE (formaldehyde-assisted isolation of regulatory elements) isolates active regulatory elements from human chromatin. Genome Research, 17:877–885.

Goldman, N., Anderson, J. P., and Rodrigo, A. G. (2000). Likelihood-based tests of topologies in phylogenetics. Systematic Biology, 49(4):652–670.

Goodman, M., Porter, C. A., Czelusniak, J., Page, S. L., Schneider, H., Shoshani, J., Gun- nell, G., and Groves, C. P. (1998). Toward a phylogenetic classiﬁcation of primates based on DNA evidence complemented by fossil evidence. Molecular Phylogenetics and Evolution, 9(3):585–598.

Goodrich, J. A. and Kugel, J. F. (2006). Non-coding-rna regulators of rna polymerase ii transcription. Nature Reviews Molecular Cell Biology, 7(8):612–616.

131 Graur, D. and Li, W.-H. (2000). Fundamentals of Molecular Evolution. Sinauer Associates.

Gregory, R. T. (2004). The Evolution of the Genome. Academic Press.

Grehan, J. R. and Schwartz, J. H. (2009). Evolution of the second orangutan: phylogeny and biogeography of hominid origins. Journal of Biogeography, 9999(9999).

GuhaThakurta, D. (2006). Computational identiﬁcation of transcriptional regulatory elements in DNA sequence. Nucleic Acids Research, 34(12):3585–3598.

Guindon, S. and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52(5):696–704.

Hallstrom, B. M. and Janke, A. (2008). Resolution among major placental mammal interordinal relationships with genome data imply that speciation inﬂuenced their earliest radiations. BMC Evolutionary Biology, 8:162+.

Hallstrom, B. M., Kullberg, M., Nilsson, M. A., and Janke, A. (2007). Phylogenomic data analyses provide evidence that xenarthra and afrotheria are sister groups. Molecular Biology and Evolution, 24(9):2059–2068.

Hardison, R. C. (2000). Conserved noncoding sequences are reliable guides to regulatory elements. Trends in Genetics, 16(9):369–372.

Hasegawa, M. and Hashimoto, T. (1993). Ribosomal RNA trees misleading? Nature, 361(6407):23.

Havlak, P., Chen, R., Durbin, K. J., Egan, A., Ren, Y., Song, X.-Z., Weinstock, G. M., and Gibbs, R. A. (2004). The atlas genome assembly system. Genome Research, 14(4):721– 732.

Herbarth, B., Pingault, V., Bondurand, N., Kuhlbrodt, K., Hermans-Borgmeyer, I., Puliti, A., Lemort, N., Goossens, M., and Wegner, M. (1998). Mutation of the Sry-related Sox10 gene in Dominant megacolon, a mouse model for human Hirschsprung disease. Proceed- ings of the National Academy of Sciences of the United States of America, 95(9):5161– 5165.

Hillis, D. M., Pollock, D. D., Mcguire, J. A., and Zwickl, D. J. (2003). Is sparse taxon sampling a problem for phylogenetic inference? Systematic Biology, 52(1):124–126.

Hobolth, A., Christensen, O. F., Mailund, T., and Schierup, M. H. (2007). Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov model. PLoS Genetics, 3(2):e7+.

132 Hou, Z.-C. C., Romero, R., and Wildman, D. E. (2009). Phylogeny of the Ferungulata (Mammalia: Laurasiatheria) as determined from phylogenomic data. Molecular phylogenetics and evolution, 52(3):660–664. Hu, Y., Wang, Y., Luo, Z., and Li, C. (1997). A new symmetrodont mammal from China and its implications for mammalian evolution. Nature, 390(6656):137–142. Huelsenbeck, J. P. (1991). When are fossils better than extant taxa in phylogenetic analysis? Systematic Zoology, 40(4):458–469. Huelsenbeck, J. P., Ronquist, F., Nielsen, R., and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294(5550):2310–2314. Hughes, A. L. and Friedman, R. (2007). The effect of branch lengths on phylogeny: An empirical study using highly conserved orthologs from mammalian genomes. Molecular Phylogenetics and Evolution, 45(1):81–88. Husmeier, D. and Mantzaris, A. V. (2008). Addressing the shortcomings of three recent Bayesian methods for detecting interspeciﬁc recombination in DNA sequence alignments. Statistical applications in genetics and molecular biology, 7(1):34. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921. Janke, A., Gemmell, N. J., Feldmaier-Fuchs, G., von Haeseler, A., and Pa¨abo,¨ S. (2/1/1996). The mitochondrial genome of a monotreme—the platypus (Ornithorhynchus anatinus). Journal of Molecular Evolution, 42(2):153–159. Janke, A., Xu, X., and Arnason, U. (1997). The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Mar- supialia, and Eutheria. Proceedings of the National Academy of Sciences, 94(4):1276– 1281. Johnson, D. S., Mortazavi, A., Myers, R. M., and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science, 316(5830):1497–1502. Kaiser, V. B., van Tuinen, M., and Ellegren, H. (2007). Insertion events of CR1 retrotrans- posable elements elucidate the phylogenetic branching order in galliform birds. Molec- ular Biology and Evolution, 24(1):338–347. Karolchik, D., Hinrichs, A. S., Furey, T. S., Roskin, K. M., Sugnet, C. W., Haussler, D., and Kent, W. J. (2004). The UCSC table browser data retrieval tool. Nucleic Acids Research, 32(Database issue). Kearney, M. (2002). Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Systematic Biology, 51(2):369–381.

133 Kel, A. E., Gossling,¨ E., Reuter, I., Cheremushkin, E., Margoulis, K. O. V., and Wingender, E. (2003). MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Research, 31(13):3576–3579.

Kent, J. W., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Research, 12(6):996–1006.

Killian, K. J., Buckley, T. R., Stewart, N., Munday, B. L., and Jirtle, R. L. (2001). Marsu- pials and eutherians reunited: genetic evidence for the theria hypothesis of mammalian evolution. Mammalian Genome, 12(7):513–517.

Kimura, M. (1983). The Neutral Theory of Molecular Evolution. Cambridge University Press.

Kingman, J. (1982). The coalescent. Stochastic Processes and their Applications, 13(3):235–248.

Kjer, K. M. and Honeycutt, R. L. (2007). Site speciﬁc rates of mitochondrial genomes and the phylogeny of Eutheria. BMC Evolutionary Biology, 7:8+.

Kluge, A. G. and Wolf, A. J. (1993). Cladistics: What’s in a word? Cladistics, 9(2):183– 199.

Kolaczkowski, Bryan, Thornton, and Joseph, W. (2006). Is there a star tree paradox? Molecular Biology and Evolution, 23(10):1819–1823.

Kolaczkowski, B. and Thornton, J. W. (2008). A mixed branch length model of heterotachy improves phylogenetic accuracy. Molecular Biology and Evolution, 25(6):1054–1066.

Kosiol, C., Bofkin, L., and Whelan, S. (2006). Phylogenetics by likelihood: Evolutionary modeling as a tool for understanding the genome. Journal of Biomedical Informatics, 39(1):51–61.

Kriegs, J. O., Churakov, G., Kiefmann, M., Jordan, U., Brosius, J., and Schmitz, J. (2006). Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biology, 4(4).

Kunsch,¨ H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3):1217–1241.

Kurland, C. G., Canback, B., and Berg, O. G. (2003). Horizontal gene transfer: A critical view. Proceedings of the National Academy of Sciences of the United States of America, 100(17):9658–9662.

134 Lartillot, N., Brinkmann, H., and Philippe, H. (2007). Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolu- tionary Biology, 7(Suppl 1).

Lin, Y. H., Mclenachan, P. A., Gore, A. R., Phillips, M. J., Ota, R., Hendy, M. D., and Penny, D. (2002). Four new mitochondrial genomes and the increased stability of evolutionary trees of mammals from improved taxon sampling. Molecular Biology and Evolution, 19(12):2060–2070.

Liu, K., Raghavan, S., Nelesen, S., Linder, R. C., and Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934):1561–1564.

Liu, L. and Pearl, D. K. (2007). Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology, 56(3):504–514.

Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E., Miller, W., Rubin, E. M., and Frazer, K. A. (2000). Identiﬁcation of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288(5463):136–140.

Madsen, A. H., Koepﬂi, K. P., Wayne, R. K., and Springer, M. S. (2003). A new phylogenetic marker, apolipoprotein B, provides compelling evidence for eutherian relationships. Molecular Phylogenetics and Evolution, 28(2):225–240.

Madsen, O., Scally, M., Douady, C. J., Kao, D. J., Debry, R. W., Adkins, R., Amrine, H. M., Stanhope, M. J., de Jong, W. W., and Springer, M. S. (2001). Parallel adaptive radiations in two major clades of placental mammals. Nature, 409(6820):610–614.

Malia, M. J., Adkins, R. M., and Allard, M. W. (2002). Molecular support for Afrotheria and the polyphyly of Lipotyphla based on analyses of the growth hormone receptor gene. Molecular Phylogenetics and Evolution, 24(1):91–101.

Margulies, E. H. and Birney, E. (2008). Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nature Reviews Genetics, 9(4):303– 313.

Margulies, E. H., Blanchette, M., Haussler, D., and Green, E. D. (2003). Identiﬁcation and characterization of multi-species conserved sequences. Genome Research, 13(12):2507– 2518.

Margulies, E. H., Chen, C. W., and Green, E. D. (2006). Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends in Genetics, 22(4):187–193.

135 Margulies, E. H., Cooper, G. M., Asimenos, G., Thomas, D. J., Dewey, C. N., Siepel, A., Birney, E., Keefe, D., Schwartz, A. S., Hou, M., Taylor, J., Nikolaev, S., Montoya- Burgos, J. I., Loytynoja, A., Whelan, S., Pardi, F., Massingham, T., Brown, J. B., Bickel, P., Holmes, I., Mullikin, J. C., Ureta-Vidal, A., Paten, B., Stone, E. A., Rosenbloom, K. R., Kent, J. W., Bouffard, G. G., Guan, X., Hansen, N. F., Idol, J. R., Maduro, V. V., Maskeri, B., Mcdowell, J. C., Park, M., Thomas, P. J., Young, A. C., Blakesley, R. W., Muzny, D. M., Sodergren, E., Wheeler, D. A., Worley, K. C., Jiang, H., Weinstock, G. M., Gibbs, R. A., Graves, T., Fulton, R., Mardis, E. R., Wilson, R. K., Clamp, M., Cuff, J., Gnerre, S., Jaffe, D. B., Chang, J. L., Lindblad-Toh, K., Lander, E. S., Hin- richs, A., Trumbower, H., Clawson, H., Zweig, A., Kuhn, R. M., Barber, G., Harte, R., Karolchik, D., Field, M. A., Moore, R. A., Matthewson, C. A., Schein, J. E., Marra, M. A., Antonarakis, S. E., Batzoglou, S., Goldman, N., Hardison, R., Haussler, D., Miller, W., Pachter, L., Green, E. D., and Sidow, A. (2007). Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Research, 17(6):760–774. Matys, V., Fricke, E., Geffers, R., Gossling,¨ E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A. E., Kel-Margoulis, O. V., Kloos, D. U., Land, S., Lewicki-Potapov, B., Michael, H., Munch,¨ R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., and Wingender, E. (2003). TRANSFAC: transcriptional regulation, from patterns to proﬁles. Nucleic Acids Research, 31(1):374–378. McLean, C. and Bejerano, G. (2008). Dispensability of mammalian DNA. Genome Re- search, 18(11):1743–1751. Misawa, K. and Janke, A. (2003). Revisiting the Glires concept—phylogenetic analysis of nuclear sequences. Molecular Phylogenetics and Evolution, 28(2):320–327. Mortlock, D. P., Guenther, C., and Kingsley, D. M. (2003). A general approach for identifying distant regulatory elements applied to the Gdf6 gene. Genome Research, 13(9):2069– 2081. Mortlock, D. P., Portnoy, M. E., Chandler, R. L., and Green, E. D. (2004). Compara- tive sequence analysis of the Gdf6 locus reveals a duplicon-mediated chromosomal rearrangement in rodents and rapidly diverging coding and regulatory sequences. Genomics, 84(5):814–823. Muller, J. and Muller, K. (2003). TREEGRAPH: generating complex postscript trees using an extensible tree description format. Program distributed by the authors, Botanical Institute, University of Bonn. Murphy, W. J., Eizirik, E., Johnson, W. E., Zhang, Y. P., Ryder, O. A., and O’Brien, S. J. (2001a). Molecular phylogenetics and the origins of placental mammals. Nature, 409(6820):614–618.

136 Murphy, W. J., Eizirik, E., O’Brien, S. J., Madsen, O., Scally, M., Douady, C. J., Teeling, E., Ryder, O. A., Stanhope, M. J., de Jong, W. W., and Springer, M. S. (2001b). Reso- lution of the early placental mammal radiation using Bayesian phylogenetics. Science, 294(5550):2348–2351.

Murphy, W. J. and O’Brien, S. J. (2007). Designing and optimizing comparative anchor primers for comparative gene mapping and phylogenetic inference. Nature Protocols, 2(11):3022–3030.

Murphy, W. J., Pevzner, P. A., and O’Brien, S. J. (2004). Mammalian phylogenomics comes of age. Trends in Genetics, 20(12):631–639.

Murphy, W. J., Pringle, T. H., Crider, T. A., Springer, M. S., and Miller, W. (2007). Using genomic data to unravel the root of the placental mammal phylogeny. Genome Research, 17(4):413–421.

Nei, M. and Kumar, S. (2000). Molecular Evolution and Phylogenetics. Oxford University Press.

Nikolaev, S., Juan, Margulies, E. H., Program, N. C., Rougemont, J., Nyffeler, B., and Antonarakis, S. E. (2007). Early history of mammals is elucidated with the ENCODE multiple species sequencing data. PLoS Genetics, 3(1):e2+.

Nishihara, H., Hasegawa, M., and Okada, N. (2006). Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proceedings of the Na- tional Academy of Sciences, 103(26):9929–9934.

Nishihara, H., Okada, N., and Hasegawa, M. (2007). Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biology, 8:R199+.

Nishihara, H., Satta, Y., Nikaido, M., Thewissen, J. G., Stanhope, M. J., and Okada, N. (2005). A retroposon analysis of afrotherian phylogeny. Molecular Biology and Evolu- tion, 22(9):1823–1833.

Nylander, J. A., Wilgenbusch, J. C., Warren, D. L., and Swofford, D. L. (2008). AWTY (are we there yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics. Bioinformatics, 24(4):581–583.

Nylander, J. A. A. (2004). MrModeltest. Program distributed by the author, Evolutionary Biology Center, Uppsala University.

Opazo, J. C., Wildman, D. E., Prychitko, T., Johnson, R. M., and Goodman, M. (2006). Phylogenetic relationships and divergence times among New World monkeys (Platyrrhini, primates). Molecular Phylogenetics and Evolution, 40(1):274–280.

137 Osterholz, M., Walter, L., and Roos, C. (2009). Retropositional events consolidate the branching order among New World monkey genera. Molecular Phylogenetics and Evo- lution, 50(3):507–513. Page, R. D. M. and Holmes, E. C. (1998). Molecular Evolution: A Phylogenetic Approach. Blackwell Publishers. Paraskevis, D., Deforche, K., Lemey, P., Magiorkinis, G., Hatzakis, A., and Vandamme, A. M. (2005). SlidingBayes: exploring recombination using a sliding window approach based on Bayesian phylogenetic inference. Bioinformatics, 21(7):1274–1275. Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S., and Reich, D. (2006). Genetic evidence for complex speciation of humans and chimpanzees. Nature, 441(7097):1103– 1108. Philippe, H., Delsuc, F., Brinkmann, H., and Lartillot, N. (2005). Phylogenomics. Annual Review of Ecology, Evolution, and Systematics, 36(1):541–562. Philippe, H., Snell, E. A., Bapteste, E., Lopez, P., Holland, P. W., and Casane, D. (2004). Phylogenomics of eukaryotes: impact of missing data on large alignments. Molecular Biology and Evolution, 21(9):1740–1752. Phillips, A., Janies, D., and Wheeler, W. (2000). Multiple sequence alignment in phylogenetic analysis. Molecular Phylogenetics and Evolution, 16(3):317–330. Phillips, M. J., Delsuc, F., and Penny, D. (2004). Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution, 21(7):1455–1458. Phillips, M. J., Lin, Y. H., Harrison, G. L., and Penny, D. (2001). Mitochondrial genomes of a bandicoot and a brushtail possum conﬁrm the monophyly of australidelphian marsupials. Proceedings of the Royal Society. B, Biological sciences, 268(1475):1533–1538. Phillips, M. J. and Penny, D. (2003). The root of the mammalian tree inferred from whole mitochondrial genomes. Molecular Phylogenetics and Evolution, 28(2):171–185. Pollard, D. A., Moses, A. M., Iyer, V. N., and Eisen, M. B. (2006). Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments. BMC Bioinformatics, 7:376. Ponjavic, J., Ponting, C. P., and Lunter, G. (2007). Functionality or transcriptional noise? evidence for selection within long noncoding RNAs. Genome Research, 17(5):556–565. Portnoy, M. E., Mcdermott, K. J., Antonellis, A., Margulies, E. H., Prasad, A. B., Kingsley, D. M., Green, E. D., and Mortlock, D. P. (2005). Detection of potential GDF6 regulatory elements by multispecies sequence comparisons and identiﬁcation of a skeletal joint enhancer. Genomics, 86(3):295–305.

138 Posada, D. and Crandall, K. (1998). MODELTEST: testing the model of DNA substitution. Bioinformatics, 14(9):817–818.

Poux, C. and Douzery, E. J. (2004). Primate phylogeny, evolutionary rate variations, and divergence times: a contribution from the nuclear gene IRBP. American Journal of Physical Anthropology, 124(1):1–16.

Prasad, A. B., Allard, M. W., The NISC Comparative Sequencing Program, and Green, E. D. (2008). Conﬁrming the phylogeny of mammals by use of large comparative sequence datasets. Molecular Biology and Evolution, 25(9):1795–1808.

Pritsker, M., Liu, Y.-C., Beer, M. A., and Tavazoie, S. (2004). Whole-genome discovery of transcription factor binding sites by network-level conservation. Genome Research, 14(1):99–108.

R Development Core Team (2008). R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Rambaut, A. and Grassly, N. C. (1997). Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computer applications in the biosciences : CABIOS, 13(3):235–238.

Rannala, B. and Yang, Z. (2008). Phylogenetic inference using whole genomes. Annual Review of Genomics and Human Genetics, 9(1):217–231.

Rat Genome Sequencing Project Consortium (2004). Genome sequence of the brown Nor- way rat yields insights into mammalian evolution. Nature, 428(6982):493–521.

Ray, D. A., Xing, J., Hedges, D. J., Hall, M. A., Laborde, M. E., Anders, B. A., White, B. R., Stoilova, N., Fowlkes, J. D., Landry, K. E., Chemnick, L. G., Ryder, O. A., and Batzer, M. A. (2005). Alu insertion loci and platyrrhine primate phylogeny. Molecular Phylogenetics and Evolution, 35(1):117–126.

Redelings, B. and Suchard, M. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3):401–418.

Ren, B., Robert, F., Wyrick, J. J., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T. L., Wilson, C. J., Bell, S. P., and Young, R. A. (2000). Genome-wide location and function of DNA binding proteins. Science, 290(5500):2306–2309.

Rokas, A. and Carroll, S. B. (2005). More genes or more taxa? the relative contribution of gene number and taxon number to phylogenetic accuracy. Molecular Biology and Evolution, 22(5):1337–1344.

139 Ronquist, F. and Huelsenbeck, J. P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12):1572–1574.

Ronquist, F., Larget, B., Huelsenbeck, J. P., Kadane, J. B., Simon, D., and van der Mark, P. (2006). Comment on “Phylogenetic MCMC algorithms are misleading on mixtures of trees”. Science, 312(5772):367a–.

Rosenberg, M. S. and Kumar, S. (2003). Taxon sampling, bioinformatics, and phylogenomics. Systematic Biology, 52(1):119–124.

Ruvolo, M. (1997). Molecular phylogeny of the hominoids: Inferences from multiple independent DNA sequence data sets. Molecular Biology and Evolution, 14(3):248– 265.

Rzhetsky, A. and Nei, M. (1992). A simple method for estimating and testing minimum- evolution trees. Molecular Biology and Evolution, 9(5):945–967.

Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425.

Salem, A. H., Ray, D. A., Xing, J., Callinan, P. A., Myers, J. S., Hedges, D. J., Garber, R. K., Witherspoon, D. J., Jorde, L. B., and Batzer, M. A. (2003). Alu elements and hominid phylogenetics. Proceedings of the National Academy of Sciences, USA, 100(22):12787– 12791.

Salemi, M. and Vandamme, A.-M., editors (2003). The Phylogenetic Handbook: A Practi- cal Approach to DNA and Protein Phylogeny. Cambridge University Press.

Scally, M., Madsen, O., Douady, C. J., de Jong, W. W., Stanhope, M. J., and Springer, M. S. (2001). Molecular evidence for the major clades of placental mammals. Journal of Mammalian Evolution, 8(4):239–277.

Schneider, H. (2000). The current status of the New World monkey phylogeny. Anais Da Academia Brasileira De Ciencias, 72(2):165–172.

Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Program, N. C., Green, E. D., Hardison, R. C., and Miller, W. (2003). MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Research, 31(13):3518–3524.

Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. (2000). PipMaker—a web server for aligning two genomic DNA sequences. Genome Research, 10(4):577–586.

140 Settle, S. H., Rountree, R. B., Sinha, A., Thacker, A., Higgins, K., and Kingsley, D. M. (2003). Multiple joint and skeletal patterning defects caused by single and double mutations in the mouse Gdf6 and Gdf5 genes. Developmental Biology, 254(1):116–130.

Shimamura, M., Yasue, H., Ohshima, K., Abe, H., Kato, H., Kishiro, T., Goto, M., Munechika, I., and Okada, N. (1997). Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature, 388(6643):666–670.

Shimodaira, H. (2002). An approximately unbiased test of phylogenetic tree selection. Systematic Biology, 51(3):492–508.

Shimodaira, H. and Hasegawa, M. (1999). Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molecular Biology and Evolution, 16(8):1114– 1116.

Shimodaira, H. and Hasegawa, M. (2001). CONSEL: for assessing the conﬁdence of phylogenetic tree selection. Bioinformatics, 17(12):1246–1247.

Shoshani, J. and Mckenna, M. C. (1998). Higher taxonomic relationships among extant mammals based on morphology, with selected comparisons of results from molecular data. Molecular Phylogenetics and Evolution, 9(3):572–584.

Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., Claw- son, H., Spieth, J., Hillier, L. W., Richards, S., Weinstock, G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J., Miller, W., and Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15(8):1034– 1050.

Slattery, P. J., Wilkerson, P. A. J., Murphy, W. J., and O’Brien, S. J. (2004). Phylogenetic assessment of introns and SINEs within the Y chromosome using the cat family Felidae as a species tree. Molecular Biology and Evolution, 21(12):2299–2309.

Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientiﬁc Bulletin, 28:1409–1438.

Southard-Smith, E. M., Kos, L., and Pavan, W. J. (1998). Sox10 mutation disrupts neural crest development in Dom Hirschsprung mouse model. Nature Genetics, 18(1):60–64.

Springer, M. S. and de Jong, W. W. (2001). Phylogenetics: Which mammalian supertree to bark up? Science, 291(5509):1709–1711.

Springer, M. S., Murphy, W. J., Eizirik, E., and O’Brien, S. J. (2003). Placental mammal di- versiﬁcation and the cretaceous-tertiary boundary. Proceedings of the National Academy of Sciences, USA, 100(3):1056–1061.

141 Springer, M. S., Stanhope, M. J., Madsen, O., and de Jong, W. W. (2004). Molecules consolidate the placental mammal tree. Trends in Ecology and Evolution, 19(8):430– 438.

Stamatakis and Alexandros (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21):2688–2690.

Stamatakis, Alexandros, Hoover, Paul, Rougemont, and Jacques (2008). A rapid bootstrap algorithm for the RAxML web servers. Systematic Biology, 57(5):758–771.

Stanhope, M. J., Waddell, V. G., Madsen, O., de Jong, W., Hedges, B. S., Cleven, G. C., Kao, D., and Springer, M. S. (1998). Molecular evidence for multiple origins of Insec- tivora and for a new order of endemic African insectivore mammals. Proceedings of the National Academy of Sciences, 95(17):9967–9972.

Steiper, M. E. and Ruvolo, M. (2003). New World monkey phylogeny based on X-linked G6PD DNA sequences. Molecular Phylogenetics and Evolution, 27(1):121–130.

Storz and Wassarman, K. M. M. (2004). An abundance of RNA regulators. Annual Review of Biochemistry.

Sullivan, J. and Swofford, D. L. (1997). Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. Journal of Mammalian Evolution, 4(2):77– 86.

Susko, E. (2008). On the distributions of bootstrap support and posterior distributions for a star tree. Systematic Biology, 57(4):602–612.

Sviderskaya, E. V., Hill, S. P., Evans-Whipp, T. J., Chin, L., Orlow, S. J., Easty, D. J., Cheong, S. C., Beach, D., DePinho, R. A., and Bennett, D. C. (2002). p16(Ink4a) in melanocyte senescence and differentiation. Journal of the National Cancer Institute, 94(6):446–454.

Swofford, D. L. (2003). Phylogenetic Analysis Using Parsimony (*and other methods). Sinauer Associates, Sunderland, Massachusetts.

The International Chimpanzee Chromosome 22 Consortium (2004). DNA sequence and comparative analysis of chimpanzee chromosome 22. Nature, 429(6990):382–388.

The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: A platform for investigating biology. Science, 282(5396):2012–2018.

142 Thomas, J. W., Prasad, A. B., Summers, T. J., Lin, L. S. Q., Maduro, V. V., Idol, J. R., Ryan, J. F., Thomas, P. J., Mcdowell, J. C., and Green, E. D. (2002). Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Research, 12(8):1277–1285.

Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. G., Sternberg, B. S. M., Margulies, E. H., Blanchette, M., Siepel, A. C., Thomas, P. J., Mcdowell, J. C., Maskeri, B., Hansen, N. F., Schwartz, M. S., Weber, R. J., Kent, W. J., Karolchik, D., Bruen, T. C., Bevan, R., Cutler, D. J., Schwartz, S., Elnitski, L., Idol, J. R., Prasad, A. B., Lin, L. S. Q., Maduro, V. V., Summers, T. J., Portnoy, M. E., Dietrich, N. L., Akhter, N., Ayele, K., Benjamin, B., Cariaga, K., Brinkley, C. P., Brooks, S. Y., Granite, S., Guan, X., Gupta, J., Haghighi, P., Ho, S. L., Huang, M. C., Karlins, E., Laric, P. L., Legaspi, R., Lim, M. J., Maduro, Q. L., Masiello, C. A., Mastrian, S. D., Mccloskey, J. C., Pearson, R., Stantripop, S., Tiongson, E. E., Tran, J. T., Tsurgeon, C., Vogt, J. L., Walker, M. A., Wetherby, K. D., Wiggins, L. S., Young, A. C., Zhang, L. H., Osoegawa, K., Zhu, B., Zhao, B., Shu, C. L., De Jong, P. J., Lawrence, C. E., Smit, A. F., Chakravarti, A., Haussler, D., Green, P., Miller, W., and Green, E. D. (2003). Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424(6950):788–793.

Toh, K. L., Wade, C. M., Mikkelsen, T. S., Karlsson, E. K., Jaffe, D. B., Kamal, M., Clamp, M., Chang, J. L., Kulbokas, E. J., Zody, M. C., Mauceli, E., Xie, X., Breen, M., Wayne, R. K., Ostrander, E. A., Ponting, C. P., Galibert, F., Smith, D. R., Dejong, P. J., Kirkness, E., Alvarez, P., Biagi, T., Brockman, W., Butler, J., Chin, C. W., Cook, A., Cuff, J., Daly, M. J., Decaprio, D., Gnerre, S., Grabherr, M., Kellis, M., Kleber, M., Bardeleben, C., Goodstadt, L., Heger, A., Hitte, C., Kim, L., Koepﬂi, K. P., Parker, H. G., Pollinger, J. P., Searle, S. M. J., Sutter, N. B., Thomas, R., Webber, C., and Lander, E. S. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069):803–819.

van de Lagemaat, L. N., Gagnier, L., Medstrand, P., and Mager, D. L. (2005). Genomic deletions and precise removal of transposable elements mediated by short identical DNA segments in primates. Genome Research, 15(9):1243–1249.

van Rheede, T., Bastiaans, T., Boone, D. N., Hedges, B. S., de Jong, W. W., and Madsen, O. (2006). The platypus is in its place: Nuclear genes and indels conﬁrm the sister group relation of monotremes and therians. Molecular Biology and Evolution, 23(3):587–597.

Vixie, P. (2004). ISC Cron v.4.1. Internet Systems Consortium, Redwood City, California.

Waddell, P. J., Cao, Y., Hauf, J., and Hasegawa, M. (1999a). Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conﬂicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. Systematic Biology, 48(1):31–53.

143 Waddell, P. J., Kishino, H., and Ota, R. (2001). A phylogenetic foundation for comparative mammalian genomics. Genome Informatics, 12:141–154.

Waddell, P. J., Kishino, H., and Ota, R. (2002). Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from sequence data. Genome Informatics, 13:82–92.

Waddell, P. J., Okada, N., and Hasegawa, M. (1999b). Towards resolving the interordinal relationships of placental mammals. Systematic Biology, 48(1):1–5.

Waddell, P. J. and Shelley, S. (2003). Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-ﬁbrinogen, ND6, and mt-tRNA, plus MCMC- driven nucleotide, amino acid, and codon models. Molecular Phylogenetics and Evolu- tion, 28(2):197–224.

Waegele, J. W. and Mayer, C. (2007). Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology, 7:147+.

Wang, T. and Stormo, G. D. (2005). Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. Proceedings of the National Academy of Sciences, 102(48):17400–17405.

Ward, A., Fisher, R., Richardson, L., Pooler, J. A., Squire, S., Bates, P., Shaposhnikov, R., Hayward, N., Thurston, M., and Graham, C. F. (1997). Genomic regions regulating imprinting and insulin-like growth factor-II promoter 3 activity in transgenics: novel enhancer and silencer elements. Genes and Function, 1(1):25–36.

Waters, P. D., Dobigny, G., Waddell, P. J., and Robinson, T. J. (2007). Evolutionary history of LINE-1 in the major clades of placental mammals. PLoS ONE, 2.

Waterston, R. H., Toh, L. K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., Antonarakis, S. E., Attwood, J., Baertsch, R., Bailey, J., Barlow, K., Beck, S., Berry, E., Birren, B., Bloom, T., Bork, P., Botcherby, M., Bray, N., Brent, M. R., Brown, D. G., Brown, S. D., Bult, C., Burton, J., Butler, J., Campbell, R. D., Carninci, P., Cawley, S., Chiaromonte, F., Chinwalla, A. T., Church, D. M., Clamp, M., Clee, C., Collins, F. S., Cook, L. L., Copley, R. R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts, T., Daly, M., David, R., Davies, J., Delehaunty, K. D., Deri, J., Dermitzakis, E. T., Dewey, C., Dickens, N. J., Diekhans, M., Dodge, S., Dubchak, I., Dunn, D. M., Eddy, S. R., Elnitski, L., Emes, R. D., Eswara, P., Eyras, E., Felsenfeld, A., Fewell, G. A., Flicek, P., Foley, K., Frankel, W. N., Fulton, L. A., Fulton, R. S., Furey, T. S., Gage, D., Gibbs, R. A., Glusman, G., Gnerre, S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T. A., Green, E. D., Gregory, S., Guigo,´ R., Guyer, M., Hardison, R. C., Haussler, D., Hayashizaki, Y., Hillier, L. W., Hinrichs, A.,

144 Hlavina, W., Holzer, T., Hsu, F., Hua, A., Hubbard, T., Hunt, A., Jackson, I., Jaffe, D. B., Johnson, L. S., Jones, M., Jones, T. A., Joy, A., Kamal, M., Karlsson, E. K., Karolchik, D., Kasprzyk, A., Kawai, J., Keibler, E., Kells, C., Kent, W. J., Kirby, A., Kolbe, D. L., Korf, I., Kucherlapati, R. S., Kulbokas, E. J., Kulp, D., Landers, T., Leger, J. P., Leonard, S., Letunic, I., Levine, R., Li, J., Li, M., Lloyd, C., Lucas, S., Ma, B., Maglott, D. R., Mardis, E. R., Matthews, L., Mauceli, E., Mayer, J. H., Mccarthy, M., Mccombie, W. R., Mclaren, S., Mclay, K., Mcpherson, J. D., Meldrim, J., Meredith, B., Mesirov, J. P., Miller, W., Miner, T. L., Mongin, E., Montgomery, K. T., Morgan, M., Mott, R., Mullikin, J. C., Muzny, D. M., Nash, W. E., Nelson, J. O., Nhan, M. N., Nicol, R., Ning, Z., Nusbaum, C., O’Connor, M. J., Okazaki, Y., Oliver, K., Larty, O. E., Pachter, L., Parra, G., Pepin, K. H., Peterson, J., Pevzner, P., Plumb, R., Pohl, C. S., Poliakov, A., Ponce, T. C., Ponting, C. P., Potter, S., Quail, M., Reymond, A., Roe, B. A., Roskin, K. M., Rubin, E. M., Rust, A. G., Santos, R., Sapojnikov, V., Schultz, B., Schultz, J., Schwartz, M. S., Schwartz, S., Scott, C., Seaman, S., Searle, S., Sharpe, T., Sheridan, A., Shownkeen, R., Sims, S., Singer, J. B., Slater, G., Smit, A., Smith, D. R., Spencer, B., Stabenau, A., Thomann, S. N., Sugnet, C., Suyama, M., Tesler, G., Thompson, J., Torrents, D., Trevaskis, E., Tromp, J., Ucla, C., Vidal, U. A., Vinson, J. P., Von Niederhausern, A. C., Wade, C. M., Wall, M., Weber, R. J., Weiss, R. B., Wendl, M. C., West, A. P., Wetterstrand, K., Wheeler, R., Whelan, S., Wierzbowski, J., Willey, D., Williams, S., Wilson, R. K., Winter, E., Worley, K. C., Wyman, D., Yang, S., Yang, S. P., Zdobnov, E. M., Zody, M. C., and Lander, E. S. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–562. Webb, A., Hancock, J. M., and Holmes, C. C. (2009). Phylogenetic inference under recombination using Bayesian stochastic topology selection. Bioinformatics, 25(2):197–203. Weiller, G. F. (1998). Phylogenetic proﬁles: a graphical method for detecting genetic recombinations in homologous sequences. Molecular Biology and Evolution, 15(3):326– 335. Whelan, S. (2008). Inferring trees. Methods in Molecular Biology (Clifton, N.J.), 452:287– 309. Wiens, J. (2005). Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology, 54(5):731–742. Wildman, D. E., Jameson, N. M., Opazo, J. C., and Yi, S. V. (2009). A fully resolved genus level phylogeny of neotropical primates (platyrrhini). Molecular Phylogenetics and Evolution. Wildman, D. E., Uddin, M., Opazo, J. C., Liu, G., Lefort, V., Guindon, S., Gascuel, O., Grossman, L. I., Romero, R., and Goodman, M. (2007). Genomics, biogeography, and the diversiﬁcation of placental mammals. Proceedings of the National Academy of Sci- ences, 104(36):14395–14400.

145 Wiuf, C., Christensen, T., and Hein, J. (2001). A simulation study of the reliability of recombination detection methods. Molecular Biology and Evolution, 18(10):1929–1939.

Yang, Z. (2007). Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics. Molecular Biology and Evolution, 24(8):1639–1655.

Yu, L. and Zhang, Y. P. (2005). Evolutionary implications of multiple SINE insertions in an intronic region from diverse mammals. Mammalian Genome, 16(9):651–660.

Yue, F., Shi, J., and Tang, J. (2009). Simultaneous phylogeny reconstruction and multiple sequence alignment. BMC Bioinformatics, 10(Suppl 1):S11+.

Zhang, Z. D., Paccanaro, A., Fu, Y., Weissman, S., Weng, Z., Chang, J., Snyder, M., and Gerstein, M. B. (2007). Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Research, 17(6):787–797.

Zwickl, D. J. (2006). Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis, The University of Texas at Austin.

146 Alignments for coding indels

described in Chapter 3

Following are alignments for coding indels found in the dataset described in Chapter 3 and Prasad et al., 2008.

a - ASZ1 ------human AATTGAAAAGGC---TATTACCCATTGAA chimp AATTGAAAGGGC---TGTTACCCATTGAA gorilla AGTTGAAAGGGC---TGTTACCCATTGAA orangutan AATTGAAAGGGC---TGTTACCCATTGAA gibbon AATTGAAAGGGC---TGTTACCCATTGAA colobus_monkey AATTGAAAGGGC---TGTTACCCATTGAA vervet AATTGAAAGGGC---TGTTACCCATTGAA baboon AATTGAAAGGGC---TGTTACCCATTGAG macaque AATTGAAAGGGC---TGTTACCCATTGAG dusky_titi AATTTAAAGGGCAAATGTTACCCATTGAA owl_monkey AATTGAAAGGGCAAACGTTACCCATTGAA squirrel_monkey AATTGAAAGGGCAAATGTTACCCATTGAA marmoset AATTGAAAGGGCAAATGTTACCCATTGAA galago AATTAAAAGGTC---CGTTACCTATTGAA mouse_lemur AATTAAAAGTTC---CATTACCTGTTGAA lemur AATTAAAAGGGC---CATTACCTGGTGAA rabbit AATTAAAAGGAC---CATTACCCATAGAA st_squirrel AATTAAAAAGGC---CGTTGCCTATAGAA guinea_pig AATTAAAAAGGC---CATTACCCTCTGAA rat AATTAAAAAGGT---CTTTACCCGTTGAA mouse AATTAAAAAGGT---CTTTACCCGTTGAA hedgehog AATTGAGAGGTC---CCTTACCCACTGAA shrew AATTAAAAGGCC---CATTACCCACCGAA rfbat AATTGAAAGGAC---CGTTACCCACTGAG cat AATTAAAAGGGC---CATTACTCCCTGAG clouded_leopard AATTAAAAGGGC---CATTACTCCCTGAG ferret AATTAAAAGGGC---CATTGCTCCCTGAA muntjak_indian AATTAAAAGGAC---CGTTACCTGTTGAA sheep AATTAAAAGGAC---CATTACCCGTTGAA cow AATTAAAAGGAC---CGTTACCAGTTGAA horse AACTGAAAGAGC---CGTTGCCTATTGAA armadillo AATTAACAGGGC---TGTTACCCACTGAA elephant AATTAACGAGTC---CGTTACCCACTGAA tenrec AATTTACAGCGC---CGTTACCTGCTGAA wallaby CAACGAAAGGGT---CCTTAGCGACTGAA opossum CAATGAAAGGGT---CCTCAGTGGCTGAA

147 platypus AGTTGAAAGACT---CATTACTGAGCGAA chicken AGCTGGATCAGG---TGTCGCCGGC---A b - ASZ1 ------human GCATTA---GTGTAGATTCCAACTTTC chimp GCATTA---GTGTAGATTCCACCTTTC gorilla GCATTA---GTGTAGATTCCACCTTTC orangutan GCATTA---GTGTAGATTCCACCTTTC gibbon GCATTA---GTGTAGATTCCACCTTTC colobus_monkey GCATTA---GTGTAGATTCCACCTTTC vervet GCATTA---GTGTAGATTCCACCTTTC baboon GCATTA---GTGTAGATTCCACCTTTC macaque GCATTA---GTGTAGATTCCACCTTTC dusky_titi GCATTA---GTGTAGACGCTACCTTTC owl_monkey GCATTATTAGTGTAGACGATACCTTTC squirrel_monkey GCATTATTAGTGTAGACGCTACCTTTC marmoset GCATTATTAGTGTAGACGCTACCTTTC galago GCATCA---GTGTCGACTCCAGTTTTC mouse_lemur GAATTA---GTGTAGATTCCAGCTTTC lemur GAATTA---GTGTAGATTCCAGCTTTC rabbit GTATTA---GTGTGGATTCCAGTTTTC st_squirrel GCATTA---GTGTAGATTCCAGCTTTC guinea_pig GCATTA---ACATAGAGTCTAGCTTTC rat GCATTA---ATGTAGATTCCAGCTTCC mouse GCATTA---ATGTGGACTCCAGCTTCC hedgehog GCATTA---GTGTGGATACCAGCTTTC shrew GCATAA---GTGTAGAGTCAAGCTTTC rfbat GCATTA---GTGTAGACTCCAGCTTTC cat GCCTTA---GCGTAGATTCCAGCTTTC clouded_leopard GCCTTA---GCGTAGATTCCAGCTTTC ferret GAATTA---GTGTAGAGTCCAGCTTTC muntjak_indian GCATCA---GTGTCAACACCAGCTTTC sheep GCATCA---GTGTAGATACCAGCTTTC cow GCATCA---GTGTAGATACCAGCTTTC horse GCATTA---GTGTAGATTCCAGCTTTC armadillo GCATTA---GTATAGATTCCAGTTTTC elephant GCATCA---GTGTAGAGTCCAGCTTTC tenrec GCATCA---GTGTGGACTCCAGCTTTC wallaby GAATGA---GTGTGGAATCCAGTTTTC opossum GAATGA---ATGTGGAATCCAACTTTC platypus GCATGC---AAGTCGACTCAAGTTTTC chicken GTATAG---ACATAGAAGCTAACATTC fugu GTTTGG---ATGTGGAGACGAGGCTGA tetra GTTTGG---ATGTGGAGATGAGGCTGG c - CAPZA2 ------human TGGCAGACGAGC---TGAGCGAGA chimp TGGCAGACGAGC---TGAGCGAGA gorilla TGGCAGACGAGC---TGAGCGAGA orangutan TGGCAGAGGAGC---TGAGCGAGA gibbon TGGCAGACGAGC---TGAGCGAGA colobus_monkey TGGCAGACGAGC---TGAGCGAGA vervet TGGCAGACGAGC---TGAGCGAGA baboon TGGCAGACGAGC---TGAGCGAGA macaque TGGCAGACGAGC---TGAGCGAGA dusky_titi TGGCAGACGAGC---TGAGCGAGA owl_monkey TGGCAGACGAGC---TGAGCGAGA squirrel_monkey TGGCAGACGAGC---TGAGCGAGA marmoset TGGCAGACGAAC---TGAGTGAGA galago TGGCAGAGGAGG---TGAACGAGA

148 mouse_lemur TGGCAGACGAGG---TGAGCGAGA lemur TGGCAGACGAGG---TGAGCGAGA rabbit TGGCCGACGAGG---TGAACGAGA st_squirrel TGGCAGACGAGA---TGAACGAGA guinea_pig TGGCAGACGAAG---TGAGCGAGA rat TGGCAGACGAGG---TGAATGAGA mouse TGGCAGACGAGG---TGACTGAGA hedgehog TGGCAGAGGAGA---TGAACGAAA shrew TGGCAGACGAGA---TGAACGAGA rfbat TGGCAGACGAAA---TGAACGAAA cat TGGCAGAAGAGA---TTAACGAGA clouded_leopard TGGCAGAAGAGA---TTAACGAGA ferret TGGCAGAAGAGA---TAAACGAGA dog TGGCGGAGGAGA---TGAGCGAGA pig TGGCAGAGGAAA---TGAACGAGA sheep TGGCAGAGGAGA---TGAACGAGA cow TGGCAGAGGAAA---TGAACGAGA horse TGGCAGACGAGA---TGAACGAGA armadillo TGGCAGAGGAGA---TGAATGAAA elephant TGGCAGAGGAGA---TGCACGAGA tenrec TGGCCGAAGACA---TGAACGAGA wallaby TGACCGACGAAA---TGAACGAGA opossum TGGCCGACGAAA---TGAACGAGA platypus TGGCAGACGACC---TGCCCGAGA chicken TGGCAGACGAGC---TGAGCGAGA fugu TGGACAACGACAGCCTGAACGAGA tetra TGGACAACGACAGCCTGAACGAGA d - CAPZA2 ------human TGAGCGAGA---AGCAAGTGTACG chimp TGAGCGAGA---AGCAAGTGTACG gorilla TGAGCGAGA---AGCAAGTGTACG orangutan TGAGCGAGA---AGCAAGTGTACG gibbon TGAGCGAGA---AGCAAGTGTACG colobus_monkey TGAGCGAGA---AGCAAGTGTACG vervet TGAGCGAGA---AGCAAGTGTACG baboon TGAGCGAGA---AGCAAGTGTACG macaque TGAGCGAGA---AGCAAGTGTACG dusky_titi TGAGCGAGA---AGCAGGTTTACG owl_monkey TGAGCGAGA---AGCAAGTTTACG squirrel_monkey TGAGCGAGA---AGCAAGTGTACG marmoset TGAGTGAGA---AGCAAGTTTACG galago TGAACGAGA---AGCAAGTCTACG mouse_lemur TGAGCGAGAAGCAGCAAGTCTATG lemur TGAGCGAGAAGCAGCAAGTCTACG rabbit TGAACGAGA---AGCAAGTGTACG st_squirrel TGAACGAGA---AGCAAGTGTACG guinea_pig TGAGCGAGA---AACAAGTGTACG rat TGAATGAGA---AGCAAGTGTACG mouse TGACTGAGA---AGCAAGTGTATG hedgehog TGAACGAAA---AACAAATGTACG shrew TGAACGAGA---AGCAAGTGTACG rfbat TGAACGAAA---AGCAAGTGTACG cat TTAACGAGA---AGCAAGTGTACG clouded_leopard TTAACGAGA---AGCAAGTGTACG ferret TAAACGAGA---AGCAAGTGTACG dog TGAGCGAGA---AGCAGGTGTACG pig TGAACGAGA---AGCAAGTGTATG sheep TGAACGAGA---AGCAAGTGTACG cow TGAACGAGA---AGCAAGTGTACG horse TGAACGAGA---AGCAAGTGTACG

149 armadillo TGAATGAAA---AGCAAGTGTACG elephant TGCACGAGA---AGCAAGTGTACG tenrec TGAACGAGA---AGCAAGTGTACG wallaby TGAACGAGA---AGCAAATGTACG opossum TGAACGAGA---AGCAAATGTACG platypus TGCCCGAGA---AGCAGCTGTATG chicken TGAGCGAGA---AGGCGGTGCACG fugu TGAACGAGA---AGTCGATGGAGG tetra TGAACGAGA---AGTCGCTGGAGG e - CFTR ------human GATAGAGAGCTGG---CTTCAAAGAAAAATCCT chimp GATAGAGAGCTGG---CTTCAAAGAAAAATCCT gorilla GATAGAGAGCTGG---CTTCAAAGAAAAATCCT orangutan GATAGAGAGCTGG---CTTCAAAGAAAAATCCC gibbon GATAGAGAGCTGG---CTTCAAAGAAAAATCCC colobus_monkey GATAGAGAGCTGG---CTTCAAAGAAAAATCCC vervet GATAGAGAGCTGG---CTTCAAAGAAAAATCCC baboon GATAGAGAGCTGG---CTTCAAAGAAAAATCCC macaque GATAGAGAGCTGG---CTTCAAAGAAAAATCCC dusky_titi GATAGAGAGCTGG---CTTCAAAGAAAAATCCC owl_monkey GATAGAGAGCTGG---CTTCAAAGAAAAATCCC squirrel_monkey GATAGAGAGCTGG---CTTCAAAGAAAAATCCC marmoset GATAGAGAGCTGG---CTTCAAAGAAAAACCCC galago GATAGAGAGCTGG---CTTCAAAGAAAAATCCC mouse_lemur GATAGAGAGCTGG---CTTCAAAGAAAAATCCC lemur GATAGAGAACTGG---CTTCAAAGAAAAATCCC rabbit GATAGAGAGCTGG---CTTCAAAGAAAAAACCC st_squirrel GATAGAGAACTGG---CTTCAAAGAAAAATCCC guinea_pig GATAGAGAGCTGG---CTTCAAAGAAAAATCCA rat GACAGGGAACAAG---CTTCCAAAAAGAAGCCC mouse GACAGAGAACAAG---CTTCAAAAAAGAATCCC hedgehog GACAGAGAGCTGG---CTTCAAAGAAAAACCCC shrew GACAGAGAGCTGG---CTTCCAAAAAAAATCCC rfbat GACAGAGAGCTGG---CATCCAAGAAAAATCCG cpbat GACAGAGAACTGG---CATCAAAGAAAAATCCC cat GACAGAGAACTGG---CATCAAAGAAAAACCCC clouded_leopard GACAAAAAACTGG---CATCAAAAAAAAACCCC ferret GACCGAGAACTGG---CATCAAAGAAAAACCCC dog GACAGAGAGCTGG---CATCAAAGAAGAACCCC pig GACAGAGAACTGG---CTTCAAAGAAGAATCCC muntjak_indian GACAGAGAACTGG---CTTCAAAGAAAAACCCC sheep GACAGAGAACTGG---CTTCAAAGAAAAATCCC cow GACAGAGAACTGG---CTTCAAAGAAAAATCCC horse GACAGAGAGCTGG---TGTCAAAGAAAAATCCC elephant GACAGAGAGCTGG---CTTCCAAGAAAAATCCC tenrec GACAGAGAGCTGG---CTTCAAAGAAAAATCCC wallaby GACAGAGAGCTGG---CATCAAAGAAAAAACCC monodelphis GACAGAGAGCTGG---CATCAAAGAAAAAACCC opossum GACAGAGAGCTGG---CATCAAAGAAAAAACCC platypus GACAGAGAGCTGG---CTTCCAAGAAGAACCCC chicken GACAGAGAGCTGGCAACTTCAAAGAAAAAACCC fugu GACAGAGAGATCGTCTCAGCCAAGAAGCGACCC tetra GACAGGGAGATCGTCTCAGCCAAGAAGCGTCCC f - CFTR ------human CAGTAATTTCTCACTTCTTG chimp CAGTAATTTCTCACTTCTTG gorilla CAGTAATTTCTCACTTCTTG orangutan CAGTAATTTCTCACTTCTTG

150 gibbon CAGTAATTTCTCACTTCTTG colobus_monkey CAGTAATTTCTCACTTCTTG vervet CAGTAATTTCTCACTTCTTG baboon CAGTAATTTCTCACTTCTTG macaque CAGTAATTTCTCACTTCTTG dusky_titi CAGTAATTTCTCACTTCTTG owl_monkey CAGTAATTTCTCACTTCTTG squirrel_monkey CAGTAATTTCTCACTTCTTG marmoset CAGTAATTTCTCACTTCTTG galago CAGTAATTTAGCACTTCTTG mouse_lemur CAGTAATTTAGCACTTCTTG lemur CAGTAATTTAGCACTTCTTG rabbit CAGTAATTTCTCACTTCTTG st_squirrel CAGTAACTTCTCCCTCCTTG guinea_pig CAGTAACTTCTCCCTTCTTG rat TAGTCATCTCTGTCTAGTGG mouse CAGTCATCTCTGCCTTGTGG hedgehog CAGCAACTTTTCACTGCTTG shrew CAGCAACTTTTCACTTCTTG rfbat CAGTAATTTCGCAATTCTTA cpbat CAGTAACTTCTCGCTTCTTG cat CAGCAACTTCTCACTTCTTG ferret CAGCAATTTCTCACTTCTTG dog CAGCAACTTCTCACTTCTTG pig CAGTAACTTTTCACTTCTCG muntjak_indian TAGTAACTT---ACTTCTCG sheep CAGTAACTT---ACTTCTTG cow CAGTAACTT---ACTTCTTG horse CAGTAACTTCTCACTTCTTG armadillo CAGTAACTTCTCACTTCTTG elephant CAGTAACTTCTCACTTCTTG tenrec CAGTAACTTCGCACTACTCG wallaby TAGCAATTTCCAACTTCTTG monodelphis TAGCAACTTCTCACTTCTTG opossum TAGCAATTTCTCACTTCTTG platypus TAGTAACATCTCGCTCCTCG chicken CAGCAATTTTCCCCTTCATG fugu CACCAACTTGTA------TG tetra CACCAACCTGTA------CG g - CFTR ------human GTAATTTCTCACTTCTTGGTACTCCTGTC chimp GTAATTTCTCACTTCTTGGTACTCCTGTC gorilla GTAATTTCTCACTTCTTGGTACTCCTGTC orangutan GTAATTTCTCACTTCTTGGTACTCCTGTC gibbon GTAATTTCTCACTTCTTGGTACTCCTGTC colobus_monkey GTAATTTCTCACTTCTTGGTACTCCTGTC vervet GTAATTTCTCACTTCTTGGTACTCCTGTC baboon GTAATTTCTCACTTCTTGGTACTCCTGTC macaque GTAATTTCTCACTTCTTGGTACTCCTGTC dusky_titi GTAATTTCTCACTTCTTGGTACTCCTGTC owl_monkey GTAATTTCTCACTTCTTGGTACTCCTGTC squirrel_monkey GTAATTTCTCACTTCTTGGTACTCCTGTC marmoset GTAATTTCTCACTTCTTGGTACTCCTGTC galago GTAATTTAGCACTTCTTGGTGCTCCTGTC mouse_lemur GTAATTTAGCACTTCTTGGTACTCCTGTC lemur GTAATTTAGCACTTCTTGGTACTCCTGTC rabbit GTAATTTCTCACTTCTTGGTGCTCCTGTC st_squirrel GTAACTTCTCCCTCCTTGGTGCCCCTGTA guinea_pig GTAACTTCTCCCTTCTTGGTAGCCCTGTC rat GTCATCTCTGTCTAGTGGGAAATCCTGTG

151 mouse GTCATCTCTGCCTTGTGGGAAATCCTGTG hedgehog GCAACTTTTCACTGCTTGGTACTCCTGTC shrew GCAACTTTTCACTTCTTGGTACTCCTGTC rfbat GTAATTTCGCAATTCTTAGTACTCCTGTG cpbat GTAACTTCTCGCTTCTTGGTACTCCTGTC cat GCAACTTCTCACTTCTTGGTGCTCCTGTC ferret GCAATTTCTCACTTCTTGGTGCTCCTGTC dog GCAACTTCTCACTTCTTGGTACTCCTGTC pig GTAACTTTTCACTTCTCGGTACTCCTGTC muntjak_indian GTAACTT---ACTTCTCGGTGCTCCTGTC sheep GTAACTT---ACTTCTTGGTACTCCTGTC cow GTAACTT---ACTTCTTGGTACTCCTGTC horse GTAACTTCTCACTTCTTGGTACTCCGGTC armadillo GTAACTTCTCACTTCTTGGCACTCCTGTC elephant GTAACTTCTCACTTCTTGGTGCTCCTGTC tenrec GTAACTTCGCACTACTCGGCACTCCTGTG wallaby GCAATTTCCAACTTCTTGGTACTCCTGTT monodelphis GCAACTTCTCACTTCTTGGTACTCCTGTC opossum GCAATTTCTCACTTCTTGGTACTCCTGTC platypus GTAACATCTCGCTCCTCGGTCCCCCTGTC chicken GCAATTTTCCCCTTCATGCTTCACCTGTC fugu CCAACTTGTA------TGTCACACCAGTT tetra CCAACCTGTA------CGTCACACCGGTC h - CFTR ------human GTTTCTCATTAGAAGGAGATGCTCCTG chimp GTTTCTCATTAGAAGGAGATGCTCCTG gorilla GTTTCTCATTAGAAGGAGATGCTCCTG orangutan GTTTCTCATTAGAAGGAGATGCTCCTG gibbon GTTTCTCATTAGAAGGAGATGCTCCTG colobus_monkey GTTTCTCATTAGAAGGAGATGCTCCTA vervet GTTTCTCATTAGAAGGAGATGCTCCTG baboon GTTTCTCATTAGAAGGAGATGCTCCTG macaque GTTTCTCATTAGAAGGAGATGCTCCTG dusky_titi GTTTCTCATTAGAAGGAGATGCTCCTG owl_monkey GTTTCTCATTGGAAGGAGATGCTCCTG squirrel_monkey GTTTCTCATTAGAAGGAGATGCTCCTG marmoset GTTTCTCTTTAGAAGGAGATGCTCCTG galago GTTTCTCATTAGAAGGAGACGCCGCTG mouse_lemur GTTTCTCATTGGAAGGAGATGCTGCTG lemur GTTTCTCATTAGAAGGAGATGCTGCTG rabbit GTTTCTCATTAGAAGGAGATGCTAGCA st_squirrel GTTTCTCATTAGAAGGAGATGCTTCTG guinea_pig GTTTCTCCTTAGAAGGAGATCCCTCTG rat GGTTCTCAGTGGA---CGATGCCTCTA mouse GGTTCTCAGTAGA---CGATTCCTCTG hedgehog GTTTCTCATTAGAAGGAGATGGTGCTG shrew GTTTCTCACTGGAAGGGGATGCTGCCG rfbat GTTTCTCATTTGAAGGAGATGCTGCTG cpbat GTTTCTCATTAGAAGGAGATGCTGCCA cat GTTTCTCATTGGAAGGAGATGCGGCCG ferret GTTTTTCATTAGAAGGAGATGCGCCTG dog GTTTCTCCTTAGAAGGAGATGCGGCCG pig GTTTCTCATTAGAAGGAGATGCCTCTG muntjak_indian GTTTCTCATTAGAAGGAGATACTTCTG sheep GTTTCTCATTAGAAGGAGATACTTCTG cow GTTTCTCATTAGAAGGAGATACTTCTG horse GTTTCTCATTAGAAGGAGATGCTACTG armadillo GTTTTTCATTAGAAGGAGATGCTGCTC elephant GTTTCTCATTAGAAGGAGATAC---TG wallaby GTTTCTCAGTAGAAGGTGATACTGCCG

152 monodelphis GTTTCTCAATAGAAGGTGATACTGCTG opossum GTTTCTCAATAGAAGGTGATACTGCCG platypus GTTTCTCCCTGGACGGCGATGCCGCCG chicken GATTTTCCTTTGAAGGAGAAAGCATGG fugu GGGTCTCCGTCGATGAAACTGCAGGTT tetra GGGTCTCCGTTGATGAAACCGCCAGTT i - CFTR ------human CAGAAACAAAAAAACAATCTTTTAA chimp CAGAAACAAAAAAACAATCTTTTAA gorilla CAGAAACAAAAAAACAATCTTTTAA orangutan CAGAAACAAAAAAACAACCTTTTAA gibbon CAGAAACAAAAAAACAATCTTTTAA colobus_monkey CCGAAACAAAAAAACAATCTTTTAA vervet CCGAAACAAAAAAACAATCTTTTAA baboon CCGAAACAAAAAAACAATCTTTTAA macaque CCGAAACAAAAAAACAATCTTTTAA dusky_titi CGGAAACAAAAAAACAATCTTTTAA owl_monkey CGGAAACAAAAAAACAATCTTTTAA squirrel_monkey CGGAAACAAAAAAACAATCTTTTAA marmoset CAGAAACAAAAAAACAATCTTTTAA galago ATGAGACGAAAAAACAATCTTTTAA mouse_lemur ATGAAACAAAAAAACAATCCTTTAA lemur ATGAACCAAAAAAACAATCCTTTAA rabbit ATGATACCAGAAAACAATCTTTTAA st_squirrel ATGAAACAAAAAAACAATCTTTTAA guinea_pig ATGAAACAAAAAAACAATCTTTTAA rat ACAAAGCCAAA---CAGTCATTTAG mouse GCAAACCCAAA---CAGTCGTTTAG hedgehog ATGAAACAAGAAAGCAGTCTTTTAA shrew ATGAAGCCAGAAAGCAATCTTTTAA rfbat ATGAACCGAAAAAACAGTCTTTTAA cpbat ATGAAACAAAAAAACAGTCTTTTAA cat ATGAAACAAAAAAGCAGTCTTTTAA ferret ATGAAACAAAAAAGCAATCTTTTAA dog ATGAAACAAAAAAGCAATCTTTTAA pig ACGAAACAAAAAAACAATCCTTTAA muntjak_indian ATGAAACGAAAAAGCCTTCTTTTAA sheep ATGAAACAAAAAAGCCTTCTTTTAA cow ATGAAACGAAAAAGCCTTCTTTTAA horse ACGAAACAAAAAAACAATCTTTTAA armadillo ATGAAACAAAAAAACAATCTTTTAA elephant ATGAAACAAAAAAGCAGTCTTTTAA wallaby ATGAAGGAAAAAAACAATCTTTCAA monodelphis ATGAAGGAAAAAAGCAATCTTTTAA opossum ATGAAGGAAAAAAGCAATCTTTTAA platypus ATGAAACTAAGAAGCAGTCGTTCAA chicken ATGAAATGAAGAAGCAGTCTTTTAA fugu CTGAGTCCATTCGCCAGTCATTCCG tetra CTGAGCCCATTCGCCAGTCATTCCG j - CFTR ------human AAGGAAGAATTCT---ATTCTCAATCCA chimp AAGGAAGAATTCT---ATTCTCAATCCA gorilla AAGGAAGAATTCT---ATTCTCAATCCA orangutan AAGGAAGAATTCT---ATTCTCAATCCA gibbon AAGGAAGAATTCT---ATTCTCAATCCA colobus_monkey AAGGAAGAATTCT---ATTCTCAATCCA vervet AAGGAAGAATTCT---ATTCTCAATCCC baboon AAGGAAGAATTCT---ATTCTCAATCCA

153 macaque AAGGAAGAATTCT---ATTCTCAATCCA dusky_titi AAGAAAGAATTCT---ATTCTCAATTCC owl_monkey AAGAAAGAATTCT---ATTCTCAATTCC squirrel_monkey AAGAAAGAATTCT---ATTCTCAATTCC marmoset AAGAAAGAATTCT---ATTCTCAATTCC galago AAGAAAGAATTCT---ATTCTCAATCCA mouse_lemur AAGGAAAAATTCT---ATTCTCAATTCA lemur AAGGAAAAATTCT---ATTCTCAATTCA rabbit ACGGAAGAACTCT---ATTCTCAATCCA st_squirrel AAGGAAGAACTCT---ATTCTCAATCCA guinea_pig AAGGAAAAACTCA---ATTCTCAATTCA rat AAGGAAGAACTCT---ATTCTAAGTTCA mouse AAGGAAGAACTCT---ATTCTAAATTCA hedgehog AAGGAAGAACTCT---ATTCTCAATTCA shrew AAGGAAGAACTCT---GTTCTCAATCCA rfbat AAGGAAGAGCTCT---GTTCTCAATCCA cpbat GAGGAAGAACTCT---ATTCTCAATCCA cat AAGGAAGAACTCT---ATTCTTAATCCA ferret AAGGAAGAACTCT---ATTCTTAATCCA dog AAGGAAGAACTCT---ATTCTTAATCCA pig AAGGAAGAATTCC---ATTCTCAATTCA muntjak_indian AAGGAAGAACTCC---ATTCTCAATTCA sheep AAGGAAGAACTCC---ATTCTCAATTCT cow AAGGAAGAACTCC---ATTCTCAGTTCA horse AAGGAAGAACTCT---ATTCTCAATCCA armadillo AAGAAAGAACTCT---ATTCTCAATCCA elephant AAGGAAGAACTCT---ATTCTCAATTCA wallaby AAGAAAGAATTCT---ATTCTCAACCCA monodelphis AAGAAAGAATTCT---ATTCTTAGTTCA opossum AAGAAAGAATTCT---ATTCTCAATCCA platypus AAGAAAGAACTCC---ATCCTCAACCCG chicken AAGGAAGAACTCCATAATTATTAACCCA fugu ACGCAAGCAGTCCCTCATCCTCAGTCCC tetra ACGCAAACAGTCCCTCATCCTCAGTCCC k - CFTR ------human CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG chimp CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG gorilla CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG orangutan CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG gibbon CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG colobus_monkey CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG vervet CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG baboon CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG macaque CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG dusky_titi CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG owl_monkey CTCCCTTGCAAATGAATGGCAT---TGAAGAGGATTCTGATG squirrel_monkey CTCCCTTACAAATGAATGGTAT---CGAAGAGGATTCTGATG marmoset CTCCCTTACAAATGAATGGCAT---CGAAGAGGATTCTGATG galago CTCCACTCCCAATGAATGGCATTGATGAGGAGGACTCGGAGG mouse_lemur CTCCATTACAAATGAGTGGTAT---TGAGGAGGATCCAGACG lemur CTCCATTACAAATGAGTGGCAT---TGAAGAGGATCCTGATG rabbit CTCCTTTACAAATGAATGGCAT---TGAAGAAGATTCGGACG st_squirrel CCCCATTACAAGGAAATGGCAT---GGAAGATGAATCTGATG guinea_pig CTCCTTTGCAAATAAGTGGCAT---TGAAGAGGATTCTGATG rat CTCCATTGTCCAT------AGAAGGAGAATCTGATG mouse CTCCATTATGTAT------CGATGGAGAGTCTGATG hedgehog CTACATTGCAAATGAATGGCAT---TGAAGCAAATTCTGAAG shrew CGCCGTTGCAAATGAACGGTAT---TGAAGAAGACTCGGAGG rfbat CTCCATTACAAATGAATGGAAT---TGAAGAAGACTCTGAAG cpbat CTCCATTACAAATGAATGGCAT---CGAAGAAGATTCAGAAG

154 cat CTCCGTTACAAATGAATGGCAT---CGAAGGAGATCCTGATG ferret CTCCATTACAAATGAATGGCAT---TGAAGAAGATTCCGATG dog CCCCGTTACCCATGAATGGCAT---CGAAGGAGAGTCTGAGG pig CTCCCTTACAAATGAATGGCTT---TGAAGAAGATTCTGGCG muntjak_indian CTTCATTACAAATGAATGGAAT---TGAAGAAACTTCTGATG sheep CTTCATTACAAATGAATGGTAT---CGATGGAGCTTCTGATG cow CTTCATTACAAATGAATGGTAT---CGAAGGAGCTGCTGATG horse CTCCGTTACAAATGAATGGCAT---TGAAGAAGATTCTGATG armadillo CTCCATTGCCAATGAATGGCAT---TGATGAGGATTCTGGTG elephant CTCCACTACAAATGAATGGTAT---TGATGAAGATTCTGACG wallaby GTCAACAGCAGATGAATGGCAT---TGAAGAAGATGATGATG monodelphis GTCAACCGCAAATGAATGGCAT---TGAGGAAAATGATGATG opossum GTCAACCACAAATGAATGGCAT---TGAGGAAAATGATGATG platypus CTCCGCTCCAGATGAACTGCAT---CGACGAGACCCTGGACG chicken ATGGGACGCAGGTCAATGGACT---GGAAGATGGCCACATTG fugu ATGCCGCCCAGGCCACGGCGAC---AGAGGATGGGGTCCACG tetra ACGCCACCCAGTCCATGGCACC---AGAGGAGGGGGTTCATG l - CFTR ------human CCGAAAGACAACAGCATCCACACGAAAAG chimp CCGAAAGACAACAGCATCCACACGAAAAG gorilla CCGAAAGACAACAGCATCCACACGAAAAG orangutan CCGAAAGACAACAGCATCCACACGAAAAG gibbon CCGAAAGACAACAGCATCCACACGAAAAG colobus_monkey CAGAAAGACAGCAGCATCCACACGAAAAG vervet CAGAAAGACAGCAGCATCCACACGAAAAG baboon CAGAAAGACAGCAGCATCCACACGAAAAG macaque CAGAAAGACAGCAGCATCCACACGAAAAG dusky_titi CCGGAAGACAACAGCATCCACACGAAAAG owl_monkey CCGAAAGACAACAGCATCCACACGAAAAG squirrel_monkey CCGAAAGACAACAGCATCCACACGAAAAA marmoset CCGAAAGACAACAGCATCCACACGAAAAG galago CCGAAAGACTGCAGCATCTACACGTAAGA mouse_lemur CCGAATAAACGCAGCATCCACACGAAAAA lemur CCGAATAACCACAGCAGCCACACGAAAAA rabbit CAGAAGGACAACCACATCCGCACGAAAAA st_squirrel CCGGAGGACAACAA---CTACACGAAAAA guinea_pig CCGGACGACAACAGCACCCTCAAGAAAAA rat CCAGAGGACCAGAGCTTCTATTCGAAAAA mouse TCAGAGGACCAGAACTTCTATTCGAAAAA hedgehog CCGAAAGGCAGCAGCATCCACACGGAAAA shrew CCGAAGGGCCACAGCAGCTGCTCGAAAGA rfbat CAGAAGCACAGCAGCATCCACAAGAAAAA cpbat CCAAAAGACAGCACCATCTACTAGGAAAA cat CCGAAGGACAACAGCATCCACACGAAAAA ferret TGGAAGGACAGCCTCATCCACACGTAAAA dog CCGAAGGACAACAGCATCCACGCGAAAAA pig CCGAAAGACAGCGACGTCCACACGAAAAA muntjak_indian TCGAAAGACAGCGACATCCACACGAAAAA sheep TCGAAAGACAGCGACATCCACACGAAAAA cow TCGAAAGACAGCGACATCCACACGAAAAA horse CCGAAAGACAGCGACATCCACACGAAAAA armadillo CAGGAGGACAACAGCATTCACACGAAAAA elephant CCGAAGGACAGCAGCATCCGCACGAAAAA wallaby CAGAACTGGAAGTGCTGCTGTCAGGAAGA monodelphis TAAAACGGGGAATG---CTGCCAGGAAGA opossum TAAAACGGGGAATG---CTGCCAGGAAGA platypus CAAGAAGGGCAGCGCCTCGGCCCGCAAGA chicken CAAAAAGGGAAGTACATCATTTAGGAAGA fugu AGCGCAGATACAGTCCTCTTTCAGAAAAA tetra AGAACAGATACAGTCATCTTTCAGAAAAA

155 m - CFTR ------human CTTCATTTCCATTTTAACAACAG chimp CTTCATTTCCATTTTAACAACAG gorilla CTTCATTTCCATTTTAACAACAG orangutan CTTCATTTCCATTTTAACAACAG gibbon CTTCATTTCCATTTTAACAACAG colobus_monkey CTTCATTTCCATTTTAACAACAG vervet CTTCATTTCCATTTTAACAACAG baboon CTTCATTTCCATTTTAACAACAG macaque CTTCATTTCCATTTTAACAACAG dusky_titi CTTCATTTCCATTTTAACAACAG owl_monkey CTTCATTTCCATTTTAACAACAG squirrel_monkey CTTCATTTCCATTTTAACAACAG marmoset CTTCATTTCCATTTTAACTACAG galago CTTCATTTCCATTTTAACAACAG mouse_lemur CTTCATTTCCATTTTAACAACAG lemur CTTCATTTCCATTTTAACAACAG rabbit CTTCATTTCCATTTTAACAACAG st_squirrel CTTCATTTCCATTTTAACAACAG guinea_pig CTTCATTTCCATTTTAACAACAG rat CTTCATCTCCATTTTAACAACAG mouse CTTCATCTCCATTTTAACAACAG hedgehog CTTCATTTCCATTTTAACAACAG shrew CTTCATTTCCATTTTAACAACAG rfbat CTTCATTTCTATTTTAACAACAG cpbat CTTCATTTCCATTTTAACAACAG cat CTTCATTTCCATTTTAACAACAG ferret CTTCATTTCCATTTTAACAACAG dog CTTCATCTCCATTTTAACAACAG pig CTTCATTTCCATTTTAACAACAG sheep CTTCATTTCCATTTTAACAACAG cow CTTCATTTCCATTTTAACAACAG horse CTTTATTTCCATCTTAACAACAG armadillo CTTCATTTCCATTTTAACAACAG elephant CTTCGTTTCCATTTTAACAACAG wallaby CTTCATCTCCATTTTAACTACAG monodelphis CTTCATCTCCATTTTAACTACAG opossum GTTCATCTCCATTTTAACTACAG platypus CTTCATCTCCATTTTAACCACAG chicken ATTCATCTCTATCATAACCACAG fugu CTTCATCGCCG---TAGGAACCA tetra CTTCATCGCTG---TAGGAACCA n - CFTR ------human TTGACATGCCAAC---AGAAGGTAA chimp TTGACATGCCAAC---AGAAGGTAA gorilla TTGACATGCCAAC---AGAAGATAA orangutan TTGACATGCCAAC---AGAAGGTAA gibbon TTGACATGCCAAC---AGAAGGTAA colobus_monkey TTGACATGCCAACAGAAGAAGGTAA vervet TTGACATGCCAACAGAAGAAGGTAA baboon TTGACATGCCAACAGAAGAAGGTAA macaque TTGACATGCCAACAGAAGAAGGTAA dusky_titi TTGACATGCCAAC---AGAAGGTAA owl_monkey TTGACATGCCAACAGAAGAAGGTAA squirrel_monkey TTGACATGCCAACAGAAGAAGGTAA marmoset TTGACATGCCAACAGAAGAAGGTAA galago TTGATATGCCGACAGAAGAAGGTAG mouse_lemur TCGACATGCCAACAGAAGAAGGTAA

156 lemur TTGACATGCCAACAGAAGAAGGTAA rabbit TTGACATGCCAACAGAAGAAACTAA st_squirrel TTGACATGCCAACAGAAGACGGTAA guinea_pig TTGACATGC---CAGAAGAAGGTGC rat TTGATATACAAACAGAAGAAAGTAT mouse TTGATATACAAACAGAAGAAAGTAT hedgehog TTGATATGCCAACAGAAGAAATTAA shrew TTGATATGCCAACAGAAGAAAGTAA rfbat TTGACATGCCAGCTGAAGAAAGTAA cpbat TAGACATGCCAACAGAAGAAAGTAA cat TTGACATGCCTACAGAAGAAAGTAA ferret TTGACATGCCAGCAGAAGAAAGTAA dog TTGACATGCCTACAGAAGAAAGTAA pig TTGATATGCCGGCAGAAGGAGATCA sheep TTGATATGCCAACAGAAGATGGTAA cow TTGATATGCCAACAGAAGATGGTAA horse TTGACATGCCAACAGAAGACAGTAA armadillo TTGACATACCAACAGAAGAAAATAA elephant TCGACATGCCAACAGAAGAAAGTAA wallaby TTGATATGCCAAGTGAAGAACCCCT opossum TTGATATGCCCACTGAAGAACCTCT platypus TCGACATGCCAACCGAGGAGCACAG chicken TTGACATGCCAACAGAAGAGATGAA fugu TCGACCTCCCGTCGGAGGAGCTGCT tetra TCGACCTCCCGTCGGAGGAGCTGCT o - CFTR ------human AGAAGGTAA---ACCTACCAAGTCAAC chimp AGAAGGTAA---ACCTACCAAGTCAAC gorilla AGAAGATAA---ACCTACCAAGTCAAC orangutan AGAAGGTAA---ACCTACCAAGTCAAC gibbon AGAAGGTAA---ACCTACCAAGTCAAC colobus_monkey AGAAGGTAA---ACCTACCAGGTCAAC vervet AGAAGGTAA---ACCTACCAGGTCAAC baboon AGAAGGTAA---ACCTACCAGGTCAAC macaque AGAAGGTAA---ACCTACCAGGTCAAC dusky_titi AGAAGGTAA---ACCTACCAAGTCAAC owl_monkey AGAAGGTAA---ACCTACCAAGTCAAC squirrel_monkey AGAAGGTAA---ACCTACCAAGTCAAC marmoset AGAAGGTAA---ACCTACCAAGTCAAC galago AGAAGGTAG---ATCTACCAAGTCCAT mouse_lemur AGAAGGTAA---ATCTACCAGGTCAAT lemur AGAAGGTAA---ATCTACCAAGTCAAT rabbit AGAAACTAA---GTCTACCAAATCCAT st_squirrel AGACGGTAA---ATCCTCGAAGTCCAT guinea_pig AGAAGGTGC---ACCTGTCAAGTCTAT rat AGAAAGTAT---ATGTACCAAGATAAT mouse AGAAAGTAT---GTACACACAGATAAT hedgehog AGAAATTAA---ACCTAGCAAGCCAGT shrew AGAAAGTAA---ACCTACCAAGGCAGT rfbat AGAAAGTAA---ATATACCAAGTCAGT cpbat AGAAAGTAA---ATCTACCAAGTTGAT cat AGAAAGTAAACCACCTAGCAAGCCATT ferret AGAAAGTAAACCACCCACCAAGTCATT dog AGAAAGTAAACCACCTACCAAGTCATT pig AGGAGATCA---ACCTAACAGGTCGTT sheep AGATGGTAA---ACCTAACAATTCATT cow AGATGGTAA---ACCTAACAATTCATT horse AGACAGTAA---ACCTACCAAGTCAGT armadillo AGAAAATAA---GCCTACCAAATCAAT elephant AGAAAGTAA---ACCTGCCAAATCTGT

157 wallaby AGAACCCCT---TCCTACCAAACCAGC opossum AGAACCTCT---ACCTGCCAAACCCAC platypus GGAGCACAG------GCCCAC chicken AGAGATGAA------AACCAT fugu GGAGCTGCT------GCCGGG tetra GGAGCTGCT------GTCTGG q - CTTNBP2 ------human TTCACCATGTGCCAACACCTCTTTGCATCCAGGTCTAAACC gorilla TTCACCATGTGCCAACACCTCTTTGCATCCAGGTCTAAACC gibbon TTCACCATGTGCCAATGCCTCTTTGCATCCAGGTCTAAACC colobus_monkey TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC vervet TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC baboon TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC macaque TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC dusky_titi TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC owl_monkey TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC squirrel_monkey TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC marmoset TTCACCATGTGCCAACGCCTCTTTGCATCCAGGTCTAAACC galago CTCCCCGTGTGCCAGCGCGCCCCCGCACCCGGGCCTGAACC mouse_lemur GTCGCCGTGCACACCCGCGCCCGCGCAGCCCGGCCTCAACC lemur GTCGCCCTGCGCAGGGGCGCCCCCGCACCCCGGCCTGAATC rabbit CTCACCGTGTGCCACCACGGCTTTGCATCCAGGCCTCAACC st_squirrel TTCACCATGTGCCAATGCATCTTTGCATCCAGGTCTCAACC guinea_pig TTCACCATGTGCCAATGCATCTTTGCATCCAGGTCTAAACC rat TTCACCGTGTGCCAACAC------TCATCCAGGTCTCAACC mouse TTCACCATGTGCCAACAC------TCATCCAGGCCTCAACC hedgehog TTCACCTTGTGCCAATGTGTCTTTGCATCCAGGTTTAAACC shrew TTCACCGTGTGCCAATGCGTCTCTGCACCCGGGGCTGAACC rfbat TTCACCGAGTGCCAATGCGTCTCTGCATCCAGGTTTAAACC cpbat TTCACCGTGTGCCAACGCATCTCTGCATTCAGGTCTAAACC cat TTCACCGTGTGCCAATGCGTCTTTGCATCCAGGTCTCAATC clouded_leopard TTCACCGTGTGCCAATGCGTCTTTGCATCCAGGTCTCAATC ferret CTCACCGTGTGCCAATGCAGCTTTGCATCCAGGTCTAAACC dog cccgccGGGCACCAGCGCCGCCTTCCCCCCGGCGCCCAACC pig TTCCCCGTGTGCCGGTGCGCCTCTGCATCCAGGTCTGAACC muntjak_indian TTCACCGTGTGCCAACGCGCCTCTGCACCCGGGTCTGAACC sheep TTCACCGTGTGCCAACGCGCCTCTGCATCCGGGTCTGAACC cow TTCACCGAGTGCCAGCGCGCCTCTGCATCCGGGTCTGAACC horse TTCACCATGTGCCAATGCGTCTTTGCATCCAGGTCTAAACC armadillo TTCACCATGTGCCAATGCATCTTTGCATCCAGGTCCAAACC elephant TTCACCATGTGCCAACGCGTCTTTGCATCCAGGTGTAAACC tenrec TTCACCATGTGCCAATGCGGCTTTGCATCCGGCTCTAACCC wallaby TTCACCATGTGCCAATGCATCTTTGCATCCAGGTCCAAACC monodelphis TTCACCATGTGCCAATGCATCTTTGCATCCAGGTCCAAACC opossum TTCACCATGTGCCAATGCCTCTTTGCATCCAGGTCCAAACC platypus TTCACCGTGTGCCAACGCGGCTTCGTACCCCGCGCTCAATC chicken TTCCCCATCTGCCAGCACTACTGGCCAGCCAGGCCTAAATC fugu ------AGGGTCAAGTGCCTCATGGCTTACCGAGCTCAGCC tetra ------AGGGGCAAGCGCCCCACGGCCTACCGAGCTCAGCC r - CTTNBP2 ------human TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCGAACACAGGGGCCAAA gorilla TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCGAACACAGGGGCCAAA gibbon TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCAAACACAGCGGCCAAA colobus_monkey TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCAAACACAGGAGCCAAA vervet TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCAAACACAGGGGCCAAA baboon TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCGAACACAGGGGCCAAA macaque TCCACCACACCCCCAACTCAAGGTTATTATAGACAGCAGCAGGGCCTCGAACACAGGGGCCAAA dusky_titi TCCACCACACCCCCAACTCAAGGTTATTATCGATAGCAGCAGGGCCTCGAACACAGGGGCCAAA

158 owl_monkey TCCACCACACCCCCAACTCAAGGTTATTATAGATAGCAGCAGGGCCTCGAACACAGGGGCCAAA squirrel_monkey TCCACCACACCCCCAACTCAAGGTTATTATAGATAGCAGCAGAGCCTCGAACACAGGGGCCAAA marmoset TCCACCACACCCCCAACTCAAGGTTATTATAGATAGCAGCAGGGCCTCGAACACAGGGGCCAAA galago TCCACCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAATGCAGGGGCCAAA mouse_lemur TCCGCCACACCCCCAGCTCAAGGTTATTATGGATAGCAGCAGGGCCTCGAACGCAGGGGCCAAA lemur TCCACCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAACGCAGGGGCCAAA rabbit TCCACCCCACCCCCAACTCAAGATTCCGGTGGACAGCAGCAGGGCCTCCAGTGCGGGGGCCAAA st_squirrel TCCGCCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCTTCAAATGCAGGAGCCAAA guinea_pig TCCACCTCACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCAAATGCAGGGGCCAAA rat TCCACCACACCCTCAACTGA------GGGCCTCCAATGCAGGGGCCAAA mouse TCCGCCACACCCCCAACTGA------GGGCCTCCAATGCAGGGGCCAAA hedgehog TCCCCCACACCCACAACTCAAGGTTATCATGGATAGCAGCAGGGCCGCCAGTGCAGGGGTCAAA shrew TCCCCCACACCCCCAACTCAAGGCTATCATGGAT---AGCAGGGCCTCCAGTGCGGGGGCCAAA rfbat TCCACCCCACCCCCAACTCAAGGTCGTTATGGATAGCAGCAGGGCCCCCAGTGCAGGGGCCAAA cpbat TCCACCCCACCCCCAATTCAAAGTTATTATGGATAGCGGCAGGGCCTCCAGTGCTGGGGCCAAA cat TCCACCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAATGCAGGGGCCAAA clouded_leopard TCCACCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAATGCAGGGGCCAAA ferret TCCACCACACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAGTGCGGGGGCCAAA dog TCCGCCACACCCCCAGCTCAAGGTCCTCCTGGATAGCAG---GGCCCCCAGCGCAGGGGCCAAA pig TCCACCACACCCCCAGCTCAAGGTTATCATGGATAGCAGCAGGGCCCCCAGTGCAGGGGCCAAG muntjak_indian TCCACCGCACCCCCAACTCAAGGTTATCATGGATAGCAGCAGGGCCTCCAGCACAGGGATCAAA sheep TCCACCGCACCCCCAACTCAAGGTTATCATGGATAGCAGCAGGGCCTCCAGCACAGGGATCAAA cow TCCACCGCACCCCCAACTCAAGGTTATCATGGATAGCAGCAGGGCCTCCAGCACAGGGATCAAA horse TCCACCACACCCCCAGCTCAAGGTTATTATGGATAGCAGCAGGGCCTCCAACGCAGGGGCCAAA armadillo TCCACCACACCCCCAGCTCAAGGCTATTATGGATAGCAGCAGGGCCTCCAATGCAGGGGCCAAA elephant TCCACCGCACCCCCAACTCAAGGTTATTATGGATAGCAGCAGGGCCTCAAATGCAGGGGCCAAA tenrec TCCACCACACCCCCAACTCAAAGTTCTCCTGGATAGCAGCCGGGCCTCGAACGCAGGGCCCAAA wallaby ACCACCACACCCTCAAATCAAAGTTATTATGGATAGCAGCAGATCCTCTAATTCAGGGGCCAAA monodelphis ACCACCACACCCTCAAATCAAAGTTATTATGGATAGCAGCAGATCCTCTAATGCAGGGGCCAAA opossum ACCACCACACCCTCAAATCAAAGTTATTATGGATAGCAGCAGATCCTCTAATTCAGGGGCCAAA platypus GCCTCCCCACCCCCAGCTCAAGGTCCTCAAGGACAGCGGCCGCCCAGCCAACGCGGGGGCCAAA chicken TCCACCTCACCCACAGCTAAAGGTTTTAATGGATAGTAGTCGATCCCCAAATACAAGTGCAAAA fugu TCCTCCTCACCCGCCCATTAAGGTGGTGGGGGACAGCAGTCGCACCCCTGCAGCTGGG------tetra CCCTCCTCACCCACCCATTAAGGTGGTGGGGGACAGCAGTCGCACCCCTGCAGCCGGG------s - CTTNBP2 ------human GCAGACAACTGCTAAAAGACAC chimp GCAGACAACTGCTAAAAGACAC gorilla GCAGACAACTGCTAAAAGACAC orangutan GCAGACAACTGCTAAAAGACAC gibbon GCAGACAACTGCTAAAAGACAC colobus_monkey ACAGACAACTGCTAAAAGACAC vervet ACAGACAACTGCTAAAAGACAC baboon ACAGACAACTGCTAAAAGACAC macaque ACAGACAACTGCTAAAAGACAC dusky_titi GCAGACAACTGCCAAAAGACAC owl_monkey GCAGACAACTGCTAAAAGACAC squirrel_monkey GCAGACAACTGCTAAAAGACAC marmoset GCAGACAATTGCTAAAAGACAC galago GCAGACAACTGCTAAGAAACAC lemur GCAGACAACCACTAAAAAACAC rabbit GCAGACGGCGGCTAAGAAACAC st_squirrel GCAGACTGCTGCTAAAAAGCAC guinea_pig GCAGACAGCTGCTAAGAAGCAC rat GCAGACAGCTTCGAAGAAGCAC mouse GCAGACAGCTTCTAAGAAGTAC hedgehog GCAGACAACAGCTAAAAAACAC shrew GCAAATAGCTGCCAAAAAACAC rfbat ACAAACAACTGCGAAGAAAACC cpbat GCAAACAACTGCTAAAAAGCAC cat GCTGGCAACTGCTAAAAAACAC

159 clouded_leopard GCTGGCAACTGCTAAAAAACAC ferret GCTGACAGCGGCCAGAAGCCGC dog GGTGACAGCTGCTAAGAGCCAC pig GCAAACAACT------AGAAAC sheep GCAAACAACT------AAAAAC cow GCAAACAACT------AAAAAC horse GCAAACAATTGCAAAAAAACAC armadillo ACAAACAACTGCTAAAAAACCC elephant ACAAACAAATGCTAAAAAACAC tenrec ACAAACAGCTGCTAAGAAACAC wallaby ACAGACACCTGCTCGAAAGCAT monodelphis ACAGACAACTGCTAGGAAGCAT opossum ACAGACAACTGCTAGGAAGCAT platypus -----CAGCAGTCAGACCGAGT chicken GCAAACAATAGCCAAAAAAAAT fugu ACGACCGTCTCTTAGCAACTGT tetra GCGACCGTCTCTTAGCAACTGC t - CTTNBP2 ------human CTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG chimp CTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG gorilla CTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG orangutan CTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG gibbon CTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG colobus_monkey CTGTCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG vervet CTGTCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG baboon CTGTCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG macaque CTGTCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG dusky_titi GTATCACTGACATTGAATTTGGATCAGAGACTCTCTCTG owl_monkey GTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG squirrel_monkey GTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG marmoset GTATCACTGACGTTGAATTTGGATCAGAGACTCTCTCTG galago CTGTCCCTGATGTTGAATTTGGATCAGAGACTCTCTCTG mouse_lemur TTATCATTGATGTTGAATTTGGATCCGAGACTCTCTCTG lemur CTCTCATTGATGCTGAATTTGGATCAGAGCCTCTCTCTG rabbit CTGTCGTTGACGTTGAATTTGGATCAGAGGTTCTCTCTG st_squirrel CTGTCATTAACCTTGACTTTGGATCAGAGACTCTCTCTG guinea_pig CTCTCATTGACTTTGACTTTGGATCCAAGACTTTCTCTG rat CTATCCATGACATTGACTTTGGACCACAGACTCTCTTTG mouse CTATCTGTGACATTGACTTTGGACCACAGACTCTCTTTG hedgehog CTGTCATTGACATTG------GACCAAAGACTCTCTCTG shrew TTATCATTGACATTG------GATCAAAGACTCTCTCTG rfbat CTATCATCGACATTGACCTTGGATCAAAAACTCTATCTG cpbat CTGTCCTTGACATTGAACTTGGATCAAAGACTCTATCTG cat CTATCATTGACATTGAATTTGGATCAGAGACTCTCTCTG clouded_leopard CTATCATTGACATTGAATTTGGATCAGAGACTCTCTCTG ferret CTATCATTGACATTGAATTTGGAACAGAGACTCTCTCTG dog CTGTCATTGACCTTGAATTTGGAACAAAGGCTCTCTCTG pig CTATCATTGACATTGAATTTGGAACAAAGACTCTCTCTG sheep CTGTCATTGACATTGAATCTGGATCAAAGACTGTCTCTG cow CTGTCATTGACCTTGAATCTGGATCAAAGACTCTCTCTG horse CTGTCATTGACGTTGAGTTTGGATCAGAGGTTCTCTCTG armadillo CCATCATCGATGTTGAATCTGGACCAAAGACTCTCTCTA elephant CTGTCGTTGACGTTGAATCTGGATCAAAGACTCTCTCTG tenrec CTGTCGTTGGCGTTGAATCTGGACCAAAGGCTCTCTCTG wallaby CAACCACAAGGATTGAATCTGGATCAGCGACTGTCTCTT monodelphis CAACCACAAGGATTGAATCTGGAGCAAAGATTGTCTCTC opossum CAACCCCAAGGATTAAATCTGGAGCAAAGATTGTCTCTT platypus CATCCACAAGTGTTACATCTAGATCAAAGGCTGTCTCTA chicken CAATTGAGAACATTGAATCTAGAGCAAAGGCTGTCCCTT fugu TTGTAATTGTTT------CTGTCTGTG

160 tetra CTAAATTTGCTT------TTATCTGTG p - CTTNBP2 ------human GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA gorilla GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA gibbon GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA colobus_monkey GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA vervet GCTGGTGACAAAGAGAAGAAGCCAGTGTGTACCA baboon GCTGGTGACAAAGAGAAGAAGCCAGTGTGTACCA macaque GCTGGTGACAAAGAGAAGAAGCCAGTGTGTACCA dusky_titi GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA owl_monkey GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA squirrel_monkey GCTGGTGACAAAGAGAAGAAGCCAGTTTGTACCA marmoset GCTGGTGACAAAGA---GAAGCCAGTTTGTACCA galago GCTGGGGACAAAGAGAAGAAGCCGGTTTGTACCA mouse_lemur GCCGGTGACAAAGAGAAGAAGCCAGTTTGTACCA lemur GCGGGTGACAAAGAGAAGAAGCCAGTTTGTACCA rabbit GCCAGTGACAAAGAGAAGAAGCCTGTTTGCACCA st_squirrel GCTGGTGACAAAGAGAAGAAACCAGTTTGTACCA guinea_pig GCTGGTGACAAAGAAAAGAAGCCCGTTTGTACCA rat GCTGGTGACAAAGA---GAAGCCAGTCTGCACCA mouse CCTGGTGACAAAGA---GAAGCCAGTCTGCACCA hedgehog GCTGGTGACAAAGAGAAGAAGCCAGTTTGCACCA shrew GCCAGTGACAAAGAGAAGAAGCCCGTTTGCACCA rfbat GCCGGTGATAAAGAGAAGAAGCCAGTTTGCGCCA cpbat GCTGGCGATCAAGACAAGAAGCCAGTTTGTACCA cat GCTGGCGATAAAGAGAAGAAGCCAGTTTGCACCA clouded_leopard GCTGGCGATAAAGAGAAGAAGCCAGTTTGCACCA ferret GCCGGGGAGAAAGAGAAGAAGCCAGTGTGTACCA muntjak_indian GCCAGCGATAAAGAGAAGAAGCCAGTTTGTACCA sheep GCCAGCGATAAAGAGAAGAAGCCAGTTTGTACCA cow GCCAGCGATAAAGAGAAGAAGCCAGTTTGTACCA horse GCCGGCGATAAAGAGAAGAAGCCAGTTTGTACCA armadillo GCTGGTGAGAAAGAGAAGAAGCCAGTTTGTACCA elephant GCCAGGGAGAAAGAGAAAAAGCCGGTTTGTACCA tenrec GCTGGGGATAAAGAGAAGAAGCCAGTTTGTACCA wallaby GCTGGTGATGCAGA---GAAGCCAGTTTGTACCA monodelphis GCTGGTGATGCAGA---GAAGCCAGTTTGTACCA opossum GCTGGTGATGCAGA---GAAGCCAGTTTGTACCA platypus GCTGGTGAAGCAGAGAAGAAGCCCGTCTGTACCA chicken GCTGGTGAGAAAGAAAGGAAGCCAATCTGCATGA fugu GGAGGCGACAAGGAGCGGCGGGCCATGGGTTCCA tetra GGAGGAGACAAGGAGCGCCGGGCCCTGGGTTCCA u - MET ------human AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATACT chimp AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT gorilla AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT orangutan AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT gibbon AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT colobus_monkey AAAAGAGATCCACAAAGAA---GGAAGTGTTTAACATCCT vervet AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT baboon AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT macaque AAAAGAGATCCACAAAGAA---GGAAGTGTTTAATATCCT dusky_titi AAAAGAGATCCACAAAAAA---GGAAGTGTTTAATATCCT owl_monkey AAAAGAGATCCACGAAAAA---GGAAGTGTTTAATATCCT squirrel_monkey AAAAGAGATCCACAAAAAA---GGAAGTGTTTAATATCCT marmoset AAAAGAGATCCACAAAAAA---GGAAGTGTTTAATATCCT galago GAAAGAGATCCACAAGGAA---GGAAGTATTCAACATCCT mouse_lemur GAAAGAGATCCACAAGAGA---GGAAGTGTTTAATATCCT

161 lemur GAAAGAGATCCACAAGGGA---GGAAGTGTTTAATATCCT rabbit GGAAGAGAGCCACCAGGGA---GGAAGTGTTTAATATCCT st_squirrel GGAAGAGATCCACAAGGGA---AGAGGTATTTAATATCCT guinea_pig GAAAAAGATCCACAGGGCA---GGAAGTGTTTAACATCCT rat GAAAGAGATCCACAAGGGA---AGAAGTGTTTAATATCCT mouse GGAAGAGATCCACAAGGGA---AGAAGTGTTTAATATCCT hedgehog GGAAGAGAGCTGCAAGGGA---AGAAGTATTTAATATACT rfbat GGAGGAGATCCGCAAGGTC---GGAAGTGTTCAATATCCT cat gaaagagatccacaaggga---ggaagtgtttaatattct clouded_leopard gaaagagatccacaaggga---ggaagtgtttaatattct ferret GAAAGAGATCCACAAGGGA---GGAAGTCTTTAATATTCT dog gaaagagatccacaaggga---ggaagtgtttaatattct pig gaaagagatccacaaggga---ggaagtgttcaatattct muntjak_indian gaaagagatccacaaagca---ggaagtgtttaatattct sheep gaaagagatccacaaagca---ggaagtgtttaatattct cow gaaagagatccacaaagca---ggaagtgtttaatattct horse GAAAGAGATCCACAAGTGA---GGAAGTGTTTAATATCCT armadillo GAAAGAGATCCACAAGACA---GGAGGTGTTTAATATTCT elephant GGAAGAGGTCAGCCAGGGA---GGAGGTGTTTAATATCCT tenrec GCAAGCGGGCCCTGCGCAG---CGAGGTGTTCAACGTGCT wallaby GAAAGAGATCAACAAGGGA---AGAAGTATTTAATATTCT platypus GGAGGAGGTCCACCAACAA---GGAAGTGTTCAATGTCCT chicken GGAAGAGGTCCATTAGAAA---GGAGGTATTCAACATTTT fugu GGCGGCGCTCCATGGAAGACGTGAAGGTCTTCAACATCCT tetra GGCGGCGCTCCACGGAAGACGTGAAGGTCTTCAACATCCT v - MET ------human GGCTGCGTATGTCAGCAAGCCTGG------GGCCCAGCTTGCTAGACAAAT chimp GGCTGCGTATGTCAGCAAGCCTGG------GGCCCAGCTTGCTAGACAAAT gorilla GGCTGCGTATGTCAGCAAGCCTGG------GGCCCAGCTTGCTAGACAAAT orangutan GGCTGCGTATGTCAGCAAGCCTGG------GGCCCAACTTGCTAGACAAAT gibbon GGCTGCATATGTCAGCAAGCCTGG------GGCCCAACTTGCTAGACAAAT colobus_monkey GGCTGCATATGTCAGCAAGCCTGG------GGCCCAACTTGCTAGACAAAT vervet GGCTGCATATGTCAGCAAGCCCGG------GGCCCAGCTTGCTAGACAAAT baboon GGCTGCATATGTCAGCAAGCCCGG------GGCCCAGCTTGCTAGACAAAT macaque GGCTGCATATGTCAGCAAGCCCGG------GGCCCAGCTTGCTAGACAAAT dusky_titi GGCTGCGTATGTCAGCAAGCCTGG------GGCCCAGCTTGCGAGGCAAAT owl_monkey GGCTGCATATGTCAGCAAGCCTGG------GGCCCAACTTGCGAGGCAAAT squirrel_monkey GGCTGCATATGTCAGCAAGCCTGG------GGCCCAACTTGCGAGGCAAAT marmoset GGCTGCATATGTCAGCAAGCCTGG------GGCCCAACTTGCGAGGCAAAT galago AGCTGCATACGTCAGTAAGCCTGG------GGCCCACCTTGCCAAGCAGAT mouse_lemur GGCTGCATATGTCAGTAAGCCTGG------GGCCCAGCTTGCTAAGCAAAT lemur AGCTGCATATGTCAGTAAGCCTGG------GGCCCAACTTGCTAAGCAAAT rabbit GGCTGCATATGTCAGTAAGCCTGG------GGCCCATCTTGCTAGGCAGAT st_squirrel AGCTGCATATGTCAGTAAGCCTGG------GGCCCATCTTGCCAAGCAAAT guinea_pig GGCTGCATATGTTGGTACAGCCGG------GGCCCATCTCGCTAAGCAAAT rat AGCCGCATATGTCAGTAAACCGGG------GGCCAATCTTGCTAAGCAAAT mouse AGCCGCGTATGTCAGTAAACCAGG------GGCCAATCTTGCTAAGCAAAT hedgehog AGCTGCATATGTCAGTAAGCCTGG------GGCCCATCTCGCTAGGCAGAT rfbat AGCTGCCTATGTCGGCAAGCCTGG------GACCCATCTCGCCAACCAAAT cat agctgcatatgtcagtaagCCTGG------GGCCCACCTTGCTAAGCAAAT clouded_leopard agctgcatatgtcagtaagCCTGG------GGCCCACCTTGCTAAGCAAAT ferret AGCTGCATATGTTGGTAAGCCTGG------GGCCCACCTCGCTAAGCAAAT dog agctgcatatgtcagtaagCCTGG------GGCCCATCTCGCTAAACAAAT pig agctgcatatgtcagtaagcctgg------gactcagCTCGCTAAGCAAAT muntjak_indian agctgcatatgtcagtaagcctgg------ggctcagCTCGCTAGGCAAAT sheep agctgcatatgtcagtaagcctgg------ggctcagCTTGCTAGGCAAAT cow agctgcatatgtcagtaagcctgg------ggctcagCTCGCTAGGCAAAT horse AGCTGCGTATGTCAGTAAGCCTGG------GGCTCATCTTGCTAAGCAAAT armadillo GGCTGCATATGTCAGTAAGCCTGG------GGCCCATCTCGCTAAGCAAAT elephant GGCTGCCTACGTCAGTAAGCCTGG------GGCCTATCTCGCTAAGCAAAT

162 tenrec GGCCGCGTACGTGGGCAAGCCCGG------GGCGCAACTGGCCAAGCAGAT wallaby AGCCGCATATGTGAGTAAGCCTGG------TGCACAACTAGCTAGGCAAAT platypus GGCTGCCTATGTCAGTAAGCCAGG------GGCCCATCTGGCTCGGCAAAT chicken AGCTGCATATGTAAGCAAACCCGG------TGCAGCCCTAGCCCATGAAAT fugu AGCGGCCACCGTCACCAAGGTGGGCAGCGACGTGGAGCTGCAGAGGCAGCT tetra GGCGGCCAACGTCACCAAGGTGGGCAGTGACGTGGAGCTGCAGAGGCAGCT w - MET ------human GGCTGAAAAAGAGAAAGCAAATTAAAG chimp GGCTGAAAAAGAGAAAGCAAATTAAAG gorilla GGCTGAAAAAGAGAAAGCAAATTAAAG orangutan GGCTGAAAAAGAGAAAGCAAATTAAAG gibbon GGCTGAAAAAGAGGAAACAAATTAAAG colobus_monkey GGCTGAAAAAGAGAAAGCAAATTAAAG vervet GGCTAAAAAAGAGAAAGCAAATTAAAG baboon GGCTGAAAAAGAGAAAGCAAATTAAAG macaque GGCTGAAAAAGAGAAAGCAAATTAAAG owl_monkey GGCTGAAAAAGAGAAAGCAAATTAAAG squirrel_monkey GGCTGAAAAAAAGAAAGCAAATTAAAG marmoset GGCTGAAAAAGAAAAAGCAAATTAAAG galago GGCTGAAAAAAAGAAAGCAAAACAAAG mouse_lemur GGATGAAAAAGAGAAAGCAAATTAAAG lemur GGATGAAAAAGAGAAAGCAAATTAAAG rabbit GGCTAAAAAAGAAAAAGCGCATTAAAG st_squirrel GGCTGAAAAAGAGAAAGAAAAAAAAAG guinea_pig GGCTGAAAAAGAGAAAGCAAATTAAAG rat GGCTGAGAAAGAGAAAGCA---TAAAG mouse GGATGAGAAAGAGAAAGCA---TAAAG hedgehog GGTGGAAGAAGAAAAAGCAAATTAAAG shrew GGTGGAAAAAGAGAAAGCAAATTAAAG cpbat GGCTGAAAAGGAGAAAGCAACTTAAAG rfbat GGCTGAAAAGGAGAAAGCAAATTAAAG cat GGCTGAAAAAGAGAAAGCAAATTAAAG clouded_leopard GGCTGAAAAAGAGAAAGCAAATTAAAG ferret GGCTGAAAAGGAGAAAGCAAATTAAAG dog GGCTGAAAAGGAAAAAGCAAATTAAAG pig GGCTGAAAAAGAGAAAGCAAATTAAAG muntjak_indian GGCTGAAAAAAAGGAAGCAAATTAAAG sheep GGCTGAAAAAAAGGAAGCAAATTAAAG cow GGCTGAAAAAAAGGAAGCAAATTAAAG horse GGCTGAAAAGGAGAAAGCAAATTAAAG armadillo GGCTAAAAAAGAAAAAGCAAATTAAAG elephant GGCTGAAAAAGAGAAAGCAAATTAAAG tenrec GGCTGAAAAAGAGAAAGCAAATTAAAG wallaby GGCTGAAAAAGAGGAAGCAGATTAAAG platypus GGATGAAAAAGAAGAAACAAATTAAAG chicken GGAGAAGGAAGAAGAAGCAGATTAAAG fugu TGTGGAAGAAGAACAAGCACATCGATG tetra TGTGGAGGAGAAACAAGAACATCGACG x - MET ------human AAGATAACGCTGATGATGAG------GTGGACACACGA chimp AAGATAACGCTGATGATGAG------GTGGACACACGA gorilla AAGATAACGCTGATGATGAG------GTGGACACACGA orangutan AAGATAACGCTGATGATGAG------GTGGACACACGA gibbon AAGATAACGCTGATAATGAG------GTGGACACACGA colobus_monkey AAGATAACGCTGATGACGAG------GTGGACACATGA vervet AAGATAACGCTGATGATGAG------GTGGACACATGA baboon AAGATAACGCTGATGACGAG------GTGGACACATGA macaque AAGATAACGCTGATGACGAG------GTGGACACATGA

163 dusky_titi AAGATAACGCTGATGGCGAG------GTGGACACATGA owl_monkey AAGATAACCCTGATGGCGAG------GTGGACACATGA squirrel_monkey AAGATAACGCTGACGGCGAG------CTGGACACCTGA marmoset AAGATAACACTGATGGCGAG------GTGGACACATGA galago AAGATAACATCGATGGTGAG------GTGGACACATGA mouse_lemur AAGATAACATCAATGGCGAG------GTGGACACATGA lemur AAGATAACATCAATGGCGAG------GGGGACACATGA rabbit AGGATAACGTGGATGGCACA------GTGGACACGTGA st_squirrel AAGATAACGCTGATGGTGTG------GTGGACACGTGA guinea_pig AAGAAAACATCGATGGTGAC------ATGGACACGTGA rat AAGACAACATTGACGGCGAA------GCGAACACATGA mouse AAGACAACATTGATGGCGAG------GGGAACACATGA hedgehog AAGATAACTTTGATAGTGAG------GGCAACACATGA rfbat AAGATCACGTTGATGGTGAG------GGAGACACATGA cat AAGATAACATTGATGGCGAG------GGGGACACATGA clouded_leopard AAGATAACATTGATGGCGAG------GGGGACACATGA ferret AAGATATCATTGATGGCGAG------GGGGACACATGA dog AAGATAACATTGATGGCGAG------GGGGACACATGA pig AAGATAATGTTGATGGCGAG------GGGGACACATGA muntjak_indian AAGATAATGTCAATGGCGAG------GGGGACACATGA sheep AAGAGAATGTCAGTGGCGAGGATGATGATGACACATGA cow AAGATAATGTCAGTGGCGAGGATGATGATGACACATGA horse AAGATAACGTTGATGGCGAG------GTAGATACATGA elephant AAGATAGCGTGGATGATGAG------GTGGACACATGA tenrec ACGACACGGTGGATGGCGAG------GTGGACACATGA wallaby AAGACAATGTTGACAGGGAG------GTGGACACCTGA platypus AGGACGCCCCAGACAGGGTT------GTGGACACATGA chicken AAGACAATACAGACATGGAT------GTAGACACATGA

164 Portnoy et al., 2005. Detection of potential GDF6 regulatory elements

by multispecies sequence comparisons and identiﬁcation of a

skeletal joint enhancer

165 Genomics 86 (2005) 295 – 305 www.elsevier.com/locate/ygeno

Detection of potential GDF6 regulatory elements by multispecies sequence comparisons and identification of a skeletal joint enhancer

Matthew E. Portnoya, Kelly J. McDermottb, Anthony Antonellisa, Elliott H. Marguliesa, Arjun B. Prasada, NISC Comparative Sequencing Programa,c, David M. Kingsleyd, Eric D. Greena,c, Douglas P. Mortlockb,*

aGenome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA bDepartment of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA cNIH Intramural Sequencing Center (NISC), National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA dDepartment of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305, USA

Received 21 March 2005; accepted 24 May 2005 Available online 24 June 2005

Abstract

The identification of noncoding functional elements within vertebrate genomes, such as those that regulate gene expression, is a major challenge. Comparisons of orthologous sequences from multiple species are effective at detecting highly conserved regions and can reveal potential regulatory sequences. The GDF6 gene controls developmental patterning of skeletal joints and is associated with numerous, distant cis-acting regulatory elements. Using sequence data from 14 vertebrate species, we performed novel multispecies comparative analyses to detect highly conserved sequences flanking GDF6. The complementary tools WebMCS and ExactPlus identified a series of multispecies conserved sequences (MCSs). Of particular interest are MCSs within noncoding regions previously shown to contain GDF6 regulatory elements. A previously reported conserved sequence at À64 kb was also detected by both WebMCS and ExactPlus. Analysis of LacZ- reporter transgenic mice revealed that a 440-bp segment from this region contains an enhancer for Gdf6 expression in developing proximal limb joints. Several other MCSs represent candidate GDF6 regulatory elements; many of these are not conserved in fish or frog, but are strongly conserved in mammals. D 2005 Elsevier Inc. All rights reserved.

Keywords: Gdf6; Enhancer elements; Bacterial artificial chromosome; Transgenes; Sequence analysis

The use of comparative sequence analysis for identifying limited. There is often too much aligning sequence between the functional portions of complex genomes has great pairs of closely related species, forcing the use of somewhat potential for facilitating the detection of noncoding functional arbitrary thresholds (e.g., >70% identity across 100 bp). elements. Pair-wise sequence alignments and comparisons Meanwhile, there is often too little aligning sequence can be used for this purpose, for example, by simply assessing between pairs of more distantly related species, especially the percentage sequence identity across windows of a defined within noncoding regions. To overcome the limitations of size [1]. Such approaches have proven effective at identifying simple pair-wise analyses, approaches for performing multi- cis-acting regulatory sequences, including those associated species sequence comparisons have been developed [5–7]; with developmentally regulated vertebrate genes, which tend these provide a more powerful means of identifying the most to have multiple, distantly located regulatory elements [2–4]. highly conserved genomic regions. It is thus of great interest However, the use of pair-wise sequence comparisons is to apply these new approaches for studying genomic regions thought to contain complex, cis-acting regulatory elements. * Corresponding author. Fax: +1 615 343 8619. GDF6 is a member of the BMP (bone morphogenetic E-mail address: [email protected] (D.P. Mortlock). protein) gene family, a group of genes that encode secreted

166 296 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 signaling factors. Like other BMP genes, GDF6 is Using the previously established sequence of this BAC as a expressed in many anatomical locations during embryonic reference, we have generated (¨2.7 Mb) or obtained (¨1.5 development, including numerous skeletal joints [8,9], and Mb) sequences of the orthologous genomic regions from 13 it is required for normal formation of limb, ear, and skull additional vertebrates (Fig. 1a). A previous PipMaker joints [9]. These findings are consistent with the known analysis of the mammalian GDF6 sequences has been roles of BMP members in mediating many developmental described [16]. To examine the degree of noncoding processes, such as the regional control of bone growth and conservation in other vertebrates, we obtained sequences shape [8,10,11] and soft tissue development [12,13].As from additional species and performed MultiPipMaker with other developmentally important genes, the localized analysis on the entire data set. For nine species (chimpan- effects of BMPs seem to be largely controlled by modular zee, baboon, cow, pig, cat, dog, rat, platypus, and ze- arrangements of cis-acting regulatory sequences that control brafish), appropriate BACs were isolated and sequenced the expression of BMP genes in specific anatomical as previously described [16,17]. For four species (human, locations. These regulatory sequences can reside far away chicken, Fugu, and Xenopus), the orthologous sequence from the gene [14,15]; for example, mouse bacterial was obtained from the whole-genome sequence assembles artificial chromosome (BAC)-transgene studies revealed available at the UCSC Genome Browser (http://genome. that regulatory sequences are distributed across a region of ucsc.edu/) [18]. more than 100 kb encompassing the GDF6 gene and that Pair-wise alignments were generated between the mouse these elements mediate GDF6 transcription in limb joints, reference sequence and each of the other species’ sequences digits, retina, genitalia, laryngeal cartilages, skull bones, and using BLASTZ [19], with the results visualized using other tissues [4]. MultiPipMaker [20]. Sequence coverage of the region To localize and identify further individual GDF6 cis- immediately encompassing Gdf6 was nearly complete for acting regulatory elements, we performed extensive multi- most species, with minor exceptions (Fig. 1b). The two species sequence comparisons. Previously, we performed GDF6 exons are highly conserved across vertebrates, as are pair-wise comparisons using PipMaker to indicate con- several intronic noncoding regions (Fig. 1b). As reported served GDF6 sequences [16]. Here, we describe the use of previously [16] much of the noncoding DNA flanking the two multispecies comparative approaches for detecting GDF6 gene is conserved among mammals. However, a highly conserved regions in and around GDF6. These more limited set of flanking regions is also conserved in methods detect a developmentally regulated GDF6 chicken and Xenopus. The mouse–zebrafish and mouse– enhancer that can direct gene expression in proximal limb Fugu alignments in regions flanking GDF6 appear to be joints in vivo. Our findings suggest that multispecies spurious (i.e., related to simple repeat-like sequences; data conserved sequence (MCS) analysis may be a sensitive not shown); however, true mouse–fish conserved ortholo- approach for detecting other GDF6 regulatory elements. gous sequences were found within the GDF6 exons and intron.

Results WebMCS and ExactPlus analyses

Multispecies sequences and alignments Since MultiPipMaker essentially displays separate pairwise comparisons, we then performed two multispecies The 209-kb mouse BAC clone RPCI23-117O7 contains comparative analyses (WebMCS and ExactPlus) to priori- the entire Gdf6 gene and extensive flanking regions [4]. tize conserved regions by assessing multiple alignments

Fig. 1. Multispecies comparative sequence analysis of the genomic region encompassing GDF6. (a) Venn diagram showing the major cladistic relationships of the vertebrate species whose sequences were analyzed. (b) A low-resolution overview of MultiPipMaker analysis of the GDF6-containing sequences from 14 species, with the mouse sequence used as the reference. The horizontal colored bars at the top of the overview plot (red, yellow, green, blue, and magenta) indicate positions of five contiguous genomic regions in mouse that were previously identified by BAC-transgene analysis to contain different subsets of tissue- specific Gdf6 regulatory enhancers [4]. The relative positions and orientations of mouse Gdf6 and two pseudogenes (Uqcrb and Gapdh) are indicated, as is the 2.9-kb interval previously shown to contain a proximal joint enhancer (PJE) [4]. For each species, portions of the mouse reference sequence that align to that species’ sequence are indicated by green and red bars (reflecting regions with >50 or >75% identity with the mouse sequence, respectively). Gray bars indicate known gaps in the sequence data that are greater than 1 kb. (c) UCSC Genome Browser-based view depicting the positions of MCSs around the mouse Gdf6 gene (mm4/NCBI build 32, October 2003; shown is the interval chr4:9,641,000–9,850,732, which corresponds to bases 50,001–209,733 of mouse BAC RP23-117O7; GenBank No. AC058786). The top 10 tracks (labeled ‘‘EP’’) depict the positions of MCSs detected with ExactPlus; these are labeled using the format ‘‘EP: #-#-#,’’ in which the three numbers indicate the initial seed length (in bases), minimum number of species whose sequence needs to match the initial seed region, and minimum number of additional species’ sequences used to extend the initial match in either direction (see Results for details). The two WebMCS tracks depict MCSs corresponding to the top 5% (WebMCS-95) and 2% (WebMCS-98) most conserved sequence. The ‘‘Transgenes’’ track shows the positions of the previously tested 2.9-kb PJE fragment [4] and the 440-bp fragment tested in this study (see Figs. 3 and 4). The ‘‘Known Genes’’ track (top) shows the positions of Gdf6, a Riken clone transcript that probably originates from the Uqcrb pseudogene [4], and a LINE-containing transcript (GenBank No. U156547). The ‘‘Spliced ESTs’’ and ‘‘Non-Mouse mRNAs’’ tracks show data from the UCSC Genome Browser for positions of spliced mouse ESTs and regions having protein-coding homology to nonmouse mRNAs, respectively. The ‘‘RepeatMasker’’ track displays the locations of repetitive elements identified by the RepeatMasker program.

167 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 297 simultaneously. Both programs are designed to detect WebMCS uses a previously described binomial-based MCSs, although the underlying algorithms used by each approach [6] to derive a ‘‘conservation score’’ for each base are quite distinct. Note that we use the term MCS to refer to in a reference sequence by analyzing windows across a a conserved region detected by multispecies sequence multispecies sequence alignment. WebMCS can be imple- comparisons, regardless of the specific method used to mented to detect different amounts of conserved sequence; perform those comparisons. Both WebMCS and ExactPlus for example, WebMCS-95 and WebMCS-98 identify the top use the same MultiPipMaker-generated multisequence 5% and 2% mostly highly conserved bases in the reference alignment as input and generate similar output files that sequence, respectively. ExactPlus finds small blocks of can be uploaded to the UCSC Genome Browser for bases (or ‘‘seeds’’) of a designated size such that each base visualization (Fig. 1c). in a block is identical across a defined minimum number of

168 298 M.E. Portnoy et al. / Genomics 86 (2005) 295–305

Table 1 MCS detection with WebMCS MCS detection No. Total Avg. MCS No. of No. of Coding bases Coding Noncoding Sensitivity Specificity method MCSs MCS length coding noncoding overlapping bases MCS basese of detecting of detecting detected bases (bases) MCSsa MCSsb MCSsc missedd coding basesf coding basesg WebMCS-95 153 10,488 69 8 145 1111 251 9377 0.816 0.106 WebMCS-98 58 4201 72 9 49 829 533 3372 0.609 0.197 a Number of MCSs that overlap protein-coding sequence by at least 1 base. Note that the only protein-coding sequence in the region resides in GDF6 exons 1 and 2. b Number of MCSs that do not overlap protein-coding sequence by at least 1 base. c MCS bases that overlap the 1362 bases of GDF6 coding bases. d GDF6 coding bases not overlapping MCSs. e MCS bases not overlapping GDF6 coding sequence (total MCS bases – coding bases). f Coding bases overlapping MCSs/(coding bases overlapping MCSs + coding bases missed). g Coding bases overlapping MCSs/total MCS bases. species (Antonellis et al., manuscript in preparation). The To assess qualitatively how WebMCS and ExactPlus seeds can then be extended in either direction based on perform on the GDF6 data set, we ran each with different identity across a separately defined minimum number of input parameters and quantified the number of MCSs and species (see Materials and methods for details). The amount of conserved sequence detected in each case. extension step was designed in an attempt to detect ancient, WebMCS-95 and WebMCS-98 detect 153 (average size of strongly conserved sequences that could represent the core 69 bases) and 58 (average size of 72 bases) MCSs, of a larger functional element. For example, in regulatory respectively, within the roughly 209,000-base multise- elements such as enhancers, core transcription factor-bind- quence alignment (Table 1). Of these, 8 and 9 MCSs, ing sites may be highly conserved while flanking sequences respectively, overlap the two GDF6 protein-coding exons may have evolved considerably. (Fig. 2). Not surprisingly, ExactPlus detects different

Fig. 2. MCSs in the immediate region spanning the GDF6 transcription unit. An expanded 21-kb region from Fig. 1c highlights the interval spanning from 2 kb upstream to 1 kb downstream of the GDF6 gene in the UCSC Genome Browser-based view. WebMCS-98 and selected ExactPlus data tracks are shown below the GDF6 gene structure (see Fig. 1 for details). Arrows indicate MCSs located outside the GDF6 exons (see text for details). Individual MCSs in certain tracks have been labeled with ‘‘L_#’’ to indicate the length of the MCS in bases. In the WebMCS-98 track, the light and dark shading of MCSs is used to represent MCSs that do and do not overlap, respectively, a GDF6 exon by at least 1 base.

169 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 299 numbers of MCSs depending upon the parameters used for in the GDF6 intron (Figs. 1c and 2, Table 2), indicating initial seed size and for the minimum number of species ancient conserved sequences. This is consistent with the required for seeds and extended bases. For example, using recent findings that noncoding sequences that are highly low-stringency parameters (6-base seeds initiated with 6 conserved between mammals and fish often reside near species and extended in 6 species, designated as 6-6-6), developmentally regulated genes [21,22]. ExactPlus detects 1621 MCSs (Fig. 1c and Table 2). If the seed length is increased to 10 bases and the number of initial Conserved sequences in the GDF6 promoter, intron, and 3V species is increased to 9 (10-9-6), only 136 MCSs are UTR detected (Fig. 1c and Table 2). Interestingly, using the latter parameters, nearly all (98.8%) of the MCS bases detected by The above results indicate the presence of several highly ExactPlus overlap those detected by WebMCS-95 (Table 2). conserved sequences close to the GDF6 gene that may be As expected, increasing the ExactPlus initial seed length candidates for ancient cis-acting regulatory elements. to 25 or 50 bases greatly reduces the number of identified Indeed, in zebrafish and Xenopus, Gdf6 is transcribed in MCSs; however, greater than 97% of the resulting MCS the dorsal retina and dorsal neural tube [23,24], and similar bases overlap with those detected by WebMCS-95 (Fig. 1c patterns of mouse Gdf6 expression were implied by BAC- and Table 2). Several distal flanking regions are still transgene experiments [4]. Therefore we scrutinized the area detected by ExactPlus using a 50-base seed, including: (1) proximal to GDF6 in more detail. Fig. 2 depicts the a region at À63.2 kb within the previously mapped limb positions of MCSs detected by WebMCS and ExactPlus joint regulatory segment [4], (2) a region at À1.0 kb, (3) two within the region immediately encompassing the GDF6 regions in the GDF6 intron, (4) portions of the GDF6 exon transcription unit. At À1.0kbrelativetotheGDF6 2 coding region and the 3V UTR, and (5) two regions at translation start site an MCS was detected by WebMCS- +53.7 and +57.7 kb (Figs. 1c, 2, and 3; note that all 98 and ExactPlus (using two sets of ExactPlus parameters, coordinates are given with respect to the GDF6 5V end). 10-9-6 and 50-6-9; this MCS is labeled ‘‘L_69’’ in the EP: With sequences from 9 eutherian mammals included in our 50-6-9 parameter track within Fig. 2). WebMCS-98 and data set, the initial requirement of 10 species’ sequences to ExactPlus (with 25-6-2 parameters) also detect an MCS just match the seed might discriminate between eutherian- upstream of the mRNA 5V end (the leftmost L_38 in specific conserved sequences and more ancient conserved WebMCS-98 and L_74, respectively), suggesting conserva- sequences. For example, a conserved region (approximately tion within the GDF6 promoter; this region was not detected 75 bases) roughly 1 kb upstream of the first GDF6 exon is by ExactPlus using a shorter initial seed length and a larger nearly identical in all eutherian mammals but not in any number of initial and extension species (EP: 10-9-6). Both noneutherian species (Figs. 1b, 1c, and 2); this may reflect a of these regions are within a genomic interval shown to have eutherian-specific regulatory sequence(s). neural tube and brain enhancer activity affecting GDF6 Increasing the ExactPlus minimum-species number at expression [4]. initial seed to 12 results in the identification of regions that Three notably large ExactPlus-detected MCSs reside are conserved in all mammals and at least one of Xenopus, within the GDF6 intron (labeled A, B, and C in Fig. 2). Fugu, or zebrafish. Interestingly, this identifies two MCSs These may function to regulate GDF6 expression in dorsal

Table 2 Relationship of MCSs detected by ExactPlus and WebMCS-95 MCS detection No. MCSs Total No. MCSs MCS bases WebMCS-95 WebMCS-95 method detected MCS overlapping with overlapping with sensitivityc specificityd bases WebMCS-95a WebMCS-95b WebMCS-95 153 10,488 N/A N/A N/A N/A ExactPlus 6-6-6 1621 16,002 575 7787 0.487 0.742 ExactPlus 6-9-6 320 5089 275 4660 0.916 0.444 ExactPlus 10-9-6 136 3118 133 3080 0.988 0.294 ExactPlus 25-6-2 42 2973 42 2890 0.972 0.276 ExactPlus 25-9-6 21 901 21 901 1.000 0.086 ExactPlus 50-6-9 8 543 8 543 1.000 0.052 ExactPlus 6-10-2 114 4348 110 4140 0.952 0.395 ExactPlus 10-10-2 57 2844 57 2823 0.993 0.269 ExactPlus 25-10-2 7 637 7 637 1.000 0.061 ExactPlus 10-12-2 8 511 8 511 1.000 0.049 N/A, not applicable. a Number of ExactPlus MCSs that overlap with WebMCS-95 MCSs by at least 1 base. b Number of ExactPlus MCS bases that overlap with WebMCS-95 MCS bases. c Fraction of ExactPlus MCS bases that overlap with WebMCS-95: MCS bases overlapping with WebMCS-95 MCSs/total ExactPlus MCS bases. d Fraction of ExactPlus MCS bases also detected by WebMCS-95: MCS bases overlapping with WebMCS-95/10,488.

170 300 M.E. Portnoy et al. / Genomics 86 (2005) 295–305

Fig. 3. Highly conserved regions and potential transcription factor binding sites in the PJE region. (a) UCSC Genome Browser-based view highlighting the positions of MCSs in a 1.2-kb interval within the PJE region (see Figs. 1b and 1c for orientation). Individual MCSs in certain tracks have been labeled with ‘‘ L _#’’ to indicate the length of the MCS in bases. ‘‘2.9Kb_tg’’ and ‘‘440bp_tg’’ represent the locations of the 2.9-kb transgene [4] and 440-bp transgene (PJE- 440; see Results), respectively. Arrows indicate locations of three potential Pbx1/Hox binding motifs (see Results). (b and c) Putative transcription factor binding motifs within specific MCSs (indicated in (a) by asterisks). Coverage across the motif regions is shown for selected MCSs that were detected using different parameters, as indicated. retina or genitalia [4]. In addition to the MCSs overlapping minigene cassette in elbow, knee, shoulder, and hip joints the coding portion of exon 2, an MCS was also found in the [4]. The presumed enhancer in this fragment was called 3V UTR just upstream of the noncanonical ATTAAA the PJE (proximal joint element). Two potential hetero- polyadenylation signal. This conserved sequence may play dimeric binding sites for Pbx1/Hox transcription factors a role in the posttranscriptional regulation of GDF6 mRNA. [26] have been detected in this region, consistent with a We were unable to detect evidence for an RNA secondary role for this sequence in proximal limb patterning structure within this MCS (e.g., long hairpins or stem– [16,27]. loops). Interestingly, a highly conserved sequence has been We therefore examined WebMCS and ExactPlus results found in the 3VUTR of another BMP family gene, BMP2, for this region. Both methods detect multiple MCSs in the where it may regulate mRNA stability [25]. 2.9-kb region (Fig. 3). Both programs also identify a highly conserved region of approximately 300 bp within the 2.9-kb Conserved sequences regulate GDF6 expression region found to contain the PJE (Fig. 3a). This conserved region is present in all mammalian sequences (except pig, In the embryonic limb bud, GDF6 is transcribed in a for which sequence coverage is lacking in this region), but stripe-like pattern that marks the locations of interzones, not Fugu, zebrafish, or Xenopus (Fig. 1b). Thus, this likely histologically distinct regions that give rise to skeletal reflects a mammal-specific conserved element. There is also joints [8,9]. Previous transgenic studies indicated that a a less conserved region several hundred bases downstream region far upstream (>60 kb) of GDF6 is required for its that is detected by WebMCS-95 and ExactPlus with low- expression in proximal limb joints during embryonic stringency parameters (6-6-6); however, this region is not development [4]. Furthermore, a 2.9-kb fragment from detected by WebMCS-98 or ExactPlus with higher strin- this region can drive expression of a LacZ-containing gency parameters (Fig. 3a).

171 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 301

Within the highly conserved approximately 300-bp the Lef1/TCF factors function in the Wnt ligand signaling region reside several stretches of high conservation, as pathway [30,31] and that Wnt14 has been proposed to be a detected by ExactPlus. One block of 56 bp was detected key regulator of early joint development [32,33]. These data using either 25:9:6 or 50:6:9 parameters; this may reflect a suggest that Hox and Pbx factors may interact with the Wnt core enhancer element (Fig. 3b). Closer inspection of this signaling pathway to regulate joint-specific gene expression. block revealed that it overlaps with the second of the two To test if the highly conserved L_56 region indicates the previously reported Pbx1/Hox binding motifs [16]. A third core of a cis-acting sequence that enhances GDF6 expression Pbx1/Hox binding motif was also found. Interestingly, this in the proximal limb joint, a 440-bp segment encompassing Pbx1/Hox motif is flanked by two CTTT(T/A)A(T/A) L_56 was subcloned into an Hsp68 promoter–LacZ motifs, which are similar to the consensus Lef1/TCF1 minigene construct [4,34]. Transgenic mouse embryos binding site [28,29]. WebMCS-98 detects one long (L_221) containing this construct (PJE-440) were analyzed midg- and one short (L_35) MCS in this region. While the former estation for LacZ expression, with the results summarized in contains the three Pbx1/Hox motifs, the latter contains three Fig. 4 and Table 3. Of seven independently generated imperfect matches to the Lef1/TCF consensus sequence transgenic embryos, several had staining in the eye, which (Fig. 3c). These results are particularly striking given that we have found to be a common expression artifact of the

Fig. 4. A conserved sequence located 64 kb upstream of GDF6 functions as a transcriptional enhancer in proximal limb joints. The PJE-440 construct (containing a highly conserved 440-bp portion of the PJE region cloned into a heat-shock promoter/LacZ reporter vector) was injected into one-cell mouse embryos. Transgenic embryos were harvested at 14.5 dpc, cut in half down the midline, stained with X-gal, and cleared with glycerol for imaging. (a) A representative embryo is shown, with white arrowheads indicating LacZ expression in shoulder, elbow, and knee joints. The hip joint was also stained but not retained in the photographed half of the embryo after dissection (data not shown). LacZ expression in the eye is a cryptic effect of the transgene vector and is likely not associated with the 440-bp sequence (see Results). The faint staining in the brain is due to background (i.e., nontransgenic) h-galactosidase activity. (b) Close-up view of the upper arm, showing shoulder and elbow joint staining at the ends of the humerus. (c) Section through the elbow of a transgenic embryo reveals that staining is specific to cells in the humeroradial and humeroulnar joints. (d) Close-up view of the knee joint showing two separate major domains of staining, plus a faint area of more proximal staining near the future patella. (e) Schematic drawing of the two major staining domains relative to the femur and tibia. (f –i) Near-adjacent serial sections through the knee of a transgenic embryo. hu, humerus; ra, radius; ul, ulna; fe, femur; ti, tibia.

172 302 M.E. Portnoy et al. / Genomics 86 (2005) 295–305

Table 3 encompassing GDF6 in multiple additional vertebrates. PJE-440 transgene expression in mouse embryos Two analytical methods, WebMCS and ExactPlus, were Embryo Proximal Ectopic expression used to analyze these multispecies sequences, allowing the limb joints identification of a large set of MCSs. The majority of these 1 +++ Some phalangeal joints, limb reside within noncoding regions. More detailed functional tendons, brain, neural tube, analysis of one MCS region identified an enhancer (the PJE) stomach 37 +++ Retina that mediates GDF6 expression in limb joints. This has 38 – Widespread ectopic refined the PJE location from 2.9 kb to 440 bp (Figs. 3 and staining 4), although both WebMCS and ExactPlus suggest the 65 + critically conserved core of the PJE is closer to 300 bp (Fig. 69 – Brain (weak) 3; see below). 72 +++ Retina 82 ++ Retina, forebrain, neural tube Our findings indicate that ExactPlus and WebMCS 88 +a None represent complementary approaches for identifying con- a Weak staining in elbow joint only. served sequences, with degree of overlap depending on input parameters. Though virtually all MCSs detected by highly stringent implementation of ExactPlus tend to over- Hsp68 –LacZ vector backbone (D.P.M., D.M.K, unpub- lap with WebMCS elements, when smaller seeds and lished observations; Catherine Guenther, personal commu- matching-species parameters are used some important nication) (Fig. 4a and Table 3). However, five of the differences are notable. For example, ExactPlus is more embryos showed strong LacZ expression in the nascent likely to detect small sequence blocks (e.g., on the scale of elbow, knee, shoulder, and hip joints (Figs. 4a, 4b, and 4d), transcription factor-binding sites) than WebMCS, which in a pattern essentially identical to that previously observed analyzes sequences in 25-base windows [6].Wealso with the 2.9-kb segment [4]. This indicates that the 440-bp suggest that ExactPlus may be a useful alternative to fragment can function as a modular regulatory element methods that search for consensus motifs for transcription capable of activating a heterologous promoter specifically in factor binding sites. For example, the potential nonconsen- proximal limb joints. sus TCF/Lef binding sites depicted in Fig. 3 may fall into Histological sections of PJE-440-containing embryos this category. At the same time, regulatory elements often confirmed that the LacZ expression is restricted to the contain multiple transcription factor-binding sites that joints (Figs. 4c and 4f–4i), generally within the articular function redundantly, with the amount of sequence con- cartilage and also in intervening cells between the cartilage servation at any one site being variable among species. In elements and at a developmental stage prior to cavitation of these situations, WebMCS should be particularly effective the joint cavity. Interestingly, in the knee joint, two arc- since it assesses conservation with sliding windows and shaped domains of LacZ expression are apparent. Section- does not require 100% sequence identity. In other words, ing revealed that these domains curve around the lateral and WebMCS probably tolerates plasticity better than ExactPlus. medial condyles of the distal femur where it articulates with The amount and distribution of conserved sequences the tibia (Figs. 4d–4i). We also tested a larger construct that, across the GDF6 genomic region are broadly similar among in addition to the 440-bp segment, contains the less well the nonrodent eutherian mammals (see Fig. 1b). A very conserved region about 400 bp downstream (see EP: 6-6-6 different pattern is seen for platypus (a noneutherian and WebMCS-95 blocks, right side of Fig. 3a). This larger mammal), with conservation largely confined to the most construct confers LacZ expression in proximal limb joints in stringently defined MCSs. Comparisons with the ortholo- a pattern indistinguishable from that of the PJE-440 gous chicken sequence reveal similar findings, though fewer construct (data not shown). While the functional relevance regions of conservation are noted. Conservation between the (if any) of this less well conserved region is unclear, the mouse and the nonamniote species (Xenopus tropicalis, 440-bp region seems to contain the GDF6 PJE. Fugu, and zebrafish) is mostly confined to GDF6 coding sequences and a minimal number of noncoding regions near GDF6 and was well detected by both WebMCS and Discussion ExactPlus. Across the GDF6 region, the platypus sequence appears to be the most effective for identifying noncoding Previous human and mouse sequence comparisons MCSs, particularly for segments flanking the gene. These suggested the presence of numerous conserved noncoding results are consistent with a recent larger study investigating regions within and flanking the GDF6 gene [4,16]. These the utility of noneutherian mammal (marsupial and monot- regions represent tantalizing candidates for serving as cis- reme) sequences for identifying conserved genomic regions acting regulatory elements that mediate the complex [35]. expression of GDF6 [8,9]. Here, we report an extension MCS analysis indicates the PJE is well conserved among of those studies that has involved the generation and amniote species. The discovery of the PJE also provides comparison of the sequence of the genomic region new insights into potential roles for GDF6 in the patterning

173 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 303 of embryonic limb joints. GDF6 transcription in the (e.g., limb joint adaptations compatible with terrestrial embryonic elbow joint has been previously documented mobility). Further transgenic experiments to compare the [4,9]. However, because the GDF6 mRNA is transcribed at regulatory functions of candidate GDF6 enhancers in low levels we have found it difficult to characterize its mammals, frog, and fish will be useful for investigating expression in the other proximal limb joints. In contrast, the this possibility. more sensitive transgenic LacZ assay has proven valuable for characterizing the function of the GDF6 PJE. Taken together with our comparative sequence analyses, the LacZ Materials and methods expression data strongly suggest a conserved role for GDF6 in proximal joint patterning. Interestingly, in the knee joint BAC clones and sequences PJE-regulated transgene expression was restricted to sub- regions within the joint cavity. Given the ability of GDF The Gdf6-containing mouse BAC RPCI23-117O7 proteins to stimulate chondrogenesis [36–38], the trans- sequence (GenBank No. AC058786) served as the reference genic expression pattern suggests that GDF6 regulates for comparative analyses and corresponds to UCSC growth of the adjacent femoral condyles. Genome Browser coordinates chr4:9,641,000–9,850,732 MCS analysis was useful for stimulating hypotheses for (mm4/NCBI build 32, October 2003). Orthologous human candidate PJE-binding factors. The PJE contains putative and rat genomic sequences were retrieved from the binding sites for Pbx1/Hox heterodimers, factors that both respective genome-wide data sets [16,42,43]. Minimally pattern proximal limb tissue [27]. Studies in Drosophila overlapping GDF6-containing BACs from rat, chimpanzee, suggest that Hox-binding regulatory modules typically baboon, cow, pig, cat, dog, platypus, and zebrafish were require additional inputs from interacting signaling path- identified [17] and sequenced as part of the NISC ways [39]. We hypothesize that the PJE integrates positional Comparative Sequencing Program [5] (www.nisc.nih.gov). information provided by the Pbx/Hox factors with other Accession numbers for all of the above BAC sequences signaling pathways that specify skeletal joints. Lef1/TCF were previously reported [16] except for zebrafish BAC factors function in the Wnt signaling pathway, raising the CH211-216G21 (GenBank No. AC139623). Sequences intriguing possibility that Hox and Pbx patterning factors were finished to a grade that is of intermediate quality interact with Wnt proteins to specify expression in devel- between phase I (full shotgun) and phase III (contiguous, oping joints. Indeed, Wnt14 is thought to direct early near perfect); this is called ‘‘comparative-grade finished synovial joint development [32]. Further testing should sequence’’ and is an enhanced version of phase II finished determine if Pbx, Hox, and/or TCF/Lef proteins bind the sequence [44]. Additionally, sequence from an orthologous PJE in vitro or in vivo. We also note that the long tracts of Gdf6-containing X. tropicalis BAC (CH216-129O13, Gen- sequence conservation in the PJE are not easily explained by Bank No. AC147884) was identified and downloaded from the detected Pbx/Hox and Lef1/TCF binding sites, so the the DOE/JGI Web site (http://genome.jgi-psf.org/xenopus). binding of additional transcription factors is probably Additional orthologous sequences were obtained from the important for PJE function. Further analysis of ExactPlus UCSC Genome Browser as follows: (1) chimpanzee, used data may help characterize binding sites for such factors. to fill in gaps in the chimpanzee BAC sequence (NCBI In addition to skeletal joints, GDF6 is transcribed in the chimpanzee draft genome sequence build 1, November developing skull, larynx, and digits, and previous BAC- 2003; chr7:99,408,595–99,511,759); (2) chicken (build transgenic data suggest that GDF6 is expressed in neural galGal2, Chicken Genome Sequencing Consortium, Febru- tube, retina, teeth, and other tissues [4]. Interestingly, these ary 2004; chr2:124,800,000–124,950,000; also see structures show varying degrees of morphological diversity www.genome.wustl.edu/projects/chicken); (3) Fugu rubri- across vertebrates. Thus, GDF6 enhancers might be pes (Fugu build fr1, August 2002, chrUn:140,629,874– expected to vary considerably with respect to sequence 140,862,748; International Fugu Genome Consortium; also conservation. Studies of other developmentally regulated see www.fugu-sg.org). genes have shown that the more conserved the sequence between highly divergent species (for example, mammals Comparative analysis and MCS tools and fish), the more likely it is to be functionally important [21,22]. However, other studies indicate that many sequen- Multispecies sequence comparisons were performed using ces that serve to regulate gene expression are not conserved MultiPipMaker [20], as described [16]. MCSs were detected between mammals and fish [40,41]. At least three GDF6 using ExactPlus (Antonellis et al., manuscript in prepara- noncoding MCSs are conserved between mouse and frog tion; http://research.nhgri.nih.gov/projects/exactplus)or and/or fish, making them candidates for being ancient WebMCS [6] (http://research.nhgri.nih.gov/MCS). Briefly, GDF6 enhancers. Other GDF6-associated MCSs, such as ExactPlus finds small blocks of bases (or ‘‘seeds’’) of a the PJE, are not found in fish or frog. We speculate that this designated size such that each base in a block is identical reflects the evolution of new GDF6 regulatory capabilities across a defined minimum number of species in the Multi- correlating with morphological adaptations in amniotes PipMaker alignment. The seeds can then be extended in either

174 304 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 direction in a base-by-base fashion, on the condition that each with isopropanol/paraffin, and embedded in paraffin over- extended base must be identical in a defined minimum night. Sections (10–12 Am) were cut using a microtome, number of species (note that the minimum number of species transferred to glass slides, dewaxed briefly with xylene, can be different in the initial seed and subsequent extension and counterstained with eosin prior to microscopic steps). The extension steps attempt to detect the presence of imaging. ancient, strongly conserved sequences that represent the core of a larger functional element. For example, in regulatory elements such as enhancers, the core transcription factor- Acknowledgments binding sites can be highly conserved, while the immediate flanking sequences may have evolved considerably. For all We thank numerous people associated with the NISC analyses, the presumptive mouse Gdf6 transcription start site Comparative Sequencing Program, in particular Robert was assumed to be the 5Vend of a Gdf6 mRNA sequence Blakesley, Gerry Bouffard, Jennifer McDowell, Baishali (GenBank No. AJ537425). Maskeri, Nancy Hanson, the many dedicated mapping and sequencing technicians, and other staff. We also thank Laura Transcription factor site analysis Selenke and Lissett Ramirez for expert technical assistance and Karen Deal, Maureen Gannon, Anna Means, and Laura MacVector software was used to scan the mouse PJE Wilding for generously sharing equipment and advice. We sequence for previously reported consensus binding sites for also thank Ronald Chandler for helpful discussions. Kelly J. Pbx1/Hox heterodimers [26] and Lef/TCF. The Pbx1/Hox McDermott was supported by NIH Genetics Training Grant binding sequence ATGATTTA(C/T)GAC was previously 1T32GM62758-03. David M. Kingsley was supported by determined using heterodimers of Pbx1 protein with Hox NIH Grant 5R37AR042236-12. Douglas P. Mortlock was proteins of Hox groups 9 and 10 [26]. The Lef/TCF binding supported by NIH Grant 1R01HD47880-01. Transgenic consensus used was CTTTG(A/T)(A/T), to reflect both the mice were generated by the Vanderbilt University Trans- reported TCF1 binding site CTTTGTT [29] and the reported genic and ES Cell Shared Resource, which is supported by Lef1 binding site CCTTTG(A/T)(A/T) [28]. Consensus the Vanderbilt Cancer, Diabetes, Kennedy, and Vision binding sites were scanned using the MacVector Nucleic Centers. Acid Subsequence analysis tool set to permit mismatches anywhere within the consensus sequence. Up to three mismatches were permitted for the Hox/Pbx1 consensus References and up to one mismatch was allowed for the Lef/TCF consensus. [1] G.G. Loots, et al., Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons, Generation of the PJE-440 transgene construct Science 288 (2000) 136–140. [2] L.A. Lettice, et al., A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial A 440-bp segment was amplified by PCR from the polydactyly, Hum. Mol. Genet. 12 (2003) 1725–1735. Gdf6-containing mouse BAC C-bGeo [4] using primers [3] L.A. Pennacchio, Insights from human/mouse genome comparisons, 5V-GTGAGGCCAAACAGGCCAATCCCTGTATTACAA- Mamm. Genome 14 (2003) 429–436. GGACTCAAATTCT-3Vand 5V-GTGAGGCCTGTTTGGC- [4] D.P. Mortlock, C. Guenther, D.M. Kingsley, A general approach for V identifying distant regulatory elements applied to the Gdf6 gene, CTATACCAACACCTATATAATCAAAATTTAAACA-3 . Genome Res. 13 (2003) 2069–2081. The resulting product was cloned into the SfiI site of pSfi- [5] J.W. Thomas, et al., Comparative analyses of multi-species sequences HspLacZ [4], a derivative of pHsp68-LacZ [15], linearized from targeted genomic regions, Nature 424 (2003) 788–793. with SalI, and purified prior to pronuclear injection into [6] E.H. Margulies, M. Blanchette, N.C.S. Program, D. Haussler, E.D. mouse embryos [15]. Green, Identification and characterization of multi-species conserved sequences, Genome Res. 13 (2003) 2507–2518. [7] A. Siepel, D. Haussler, Combining phylogenetic and hidden Markov Generation and analysis of transgenic mouse embryos models in biosequence analysis, J. Comput. Biol. 11 (2004) 413–428. [8] E.E. Storm, et al., Limb alterations in brachypodism mice due to Transgene constructs were microinjected into FVB or mutations in a new member of the TGF beta-superfamily, Nature 368 C57BLB6/D2 F1 hybrid mouse embryos using standard (1994) 639–643. [9] S.H. Settle Jr., et al., Multiple joint and skeletal patterning defects pronuclear injection methods by the Vanderbilt University caused by single and double mutations in the mouse Gdf6 and Gdf5 Transgenic Mouse/ES Cell Shared Resource, in accordance genes, Dev. Biol. 254 (2003) 116–130. with protocols approved by the Vanderbilt University [10] J.A. King, E.E. Storm, P.C. Marker, R.J. Dileone, D.M. Kingsley, The Institutional Animal Care and Use Committee. Transgenic role of BMPs and GDFs in development of region-specific skeletal embryos were verified by PCR from yolk sac or tail DNA structures, Ann. N.Y. Acad. Sci. 785 (1996) 70–79. [11] D.M. Kingsley, et al., The mouse short ear skeletal morphogenesis samples. Whole-mount X-gal staining was performed as locus is associated with defects in a bone morphogenetic member of described [4]. For histological analysis, X-gal-stained the TGF beta superfamily, Cell 71 (1992) 399–410. embryos were dehydrated in ethanol/1Â PBS, equilibrated [12] N. Jena, C. Martin-Seisdedos, P. McCue, C.M. Croce, BMP7 null

175 M.E. Portnoy et al. / Genomics 86 (2005) 295–305 305

mutation in mice: developmental defects in skeleton, kidney, and eye, programming chondrocyte proliferation and differentiation, Develop- Exp. Cell Res. 230 (1997) 28–37. ment 128 (2001) 3543–3557. [13] S. Settle, et al., The BMP family member Gdf7 is required for seminal [28] K. Giese, A. Amsterdam, R. Grosschedl, DNA-binding properties of vesicle growth, branching morphogenesis, and cytodifferentiation, the HMG domain of the lymphoid-specific transcriptional regulator Dev. Biol. 234 (2001) 138–150. LEF-1, Genes Dev. 5 (1991) 2567–2578. [14] R.J. DiLeone, L.B. Russell, D.M. Kingsley, An extensive 3Vregulatory [29] M. van de Wetering, M. Oosterwegel, D. Dooijes, H. Clevers, region controls expression of Bmp5 in specific anatomical structures Identification and cloning of TCF-1, a T lymphocyte-specific tran- of the mouse embryo, Genetics 148 (1998) 401–408. scription factor containing a sequence-specific HMG box, EMBO J. [15] R.J. DiLeone, G.A. Marcus, M.D. Johnson, D.M. Kingsley, Efficient 10 (1991) 123–132. studies of long-distance Bmp5 gene regulation using bacterial [30] X. He, A Wnt–Wnt situation, Dev. Cell 4 (2003) 791–797. artificial chromosomes, Proc. Natl. Acad. Sci. USA 97 (2000) [31] R. DasGupta, E. Fuchs, Multiple roles for activated LEF/TCF 1612–1617. transcription complexes during hair follicle development and differ- [16] D.P. Mortlock, M.E. Portnoy, R.L. Chandler, N.C.S. Program, E.D. entiation, Development 126 (1999) 4557–4568. Green, Comparative sequence analysis of the Gdf6 locus reveals a [32] C. Hartmann, C.J. Tabin, Wnt-14 plays a pivotal role in inducing duplicon-mediated chromosomal rearrangement in rodents and rapidly synovial joint formation in the developing appendicular skeleton, Cell diverging coding and regulatory sequences, Genomics 84 (2004) 104 (2001) 341–351. 814–823. [33] X. Guo, et al., Wnt/{beta}-catenin signaling is sufficient and necessary [17] J.W. Thomas, et al., Parallel construction of orthologous sequence- for synovial joint formation, Genes Dev. 18 (2004) 2004–2417. ready clone contig maps in multiple species, Genome Res. 12 (2002) [34] R. Kothary, et al., Inducible expression of an hsp68–lacZ hybrid gene 1277–1285. in transgenic mice, Development 105 (1989) 707–714. [18] W.J. Kent, et al., The human genome browser at UCSC, Genome Res. [35] E.H. Margulies, et al., Comparative sequencing provides insights 12 (2002) 996–1006. about the structure and conservation of marsupial and monotreme [19] S. Schwartz, et al., PipMaker—A Web server for aligning two genomes, Proc. Natl. Acad. Sci. USA 102 (2005) 3354–3359. genomic DNA sequences, Genome Res. 10 (2000) 577–586. [36] P.H. Francis-West, et al., Mechanisms of GDF-5 action during skeletal [20] S. Schwartz, et al., MultiPipMaker and supporting tools: alignments development, Development 126 (Suppl.) (1999) 1305–1315. and analysis of multiple genomic DNA sequences, Nucleic Acids Res. [37] R. Merino, et al., Expression and function of Gdf-5 during digit 31 (2003) 3518–3524. skeletogenesis in the embryonic chick leg bud, Dev. Biol. 206 (1999) [21] M.A. Nobrega, I. Ovcharenko, V. Afzal, E.M. Rubin, Scanning human 33–45. gene deserts for long-range enhancers, Science 302 (2003) 413. [38] E.E. Storm, D.M. Kingsley, GDF5 coordinates bone and joint [22] A. Woolfe, et al., Highly conserved non-coding sequences are formation during digit development, Dev. Biol. 209 (1999) 11–27. associated with vertebrate development, PLoS Biol. 3 (2004) e7. [39] K.A. Guss, C.E. Nelson, A. Hudson, M.E. Kraus, S.B. Carroll, [23] M. Rissi, J. Wittbrodt, E. Delot, M. Naegeli, F.M. Rosa, Zebrafish Control of a genetic regulatory network by a selector gene, Science radar: a new member of the TGF-beta superfamily defines dorsal 292 (2001) 1164–1167. regions of the neural plate and the embryonic retina, Mech. Dev. 49 [40] B. Gottgens, et al., Transcriptional regulation of the stem cell leukemia (1995) 223–234. gene (SCL)—Comparative analysis of five vertebrate SCL loci, [24] C. Chang, A. Hemmati-Brivanlou, Xenopus GDF6, a new antagonist Genome Res. 12 (2002) 749–759. of noggin and a partner of BMPs, Development 126 (Suppl.) (1999) [41] B. Gottgens, et al., Analysis of vertebrate SCL loci identifies 3347–3357. conserved enhancers, Nat. Biotechnol. 18 (2000) 181–186 (Erratum [25] K.L. Abrams, J. Xu, C. Nativelle-Serpentini, S. Dabirshahsahebi, appears in Nat. Biotechnol. 18 (2000) 1021). M.B. Rogers, An evolutionary and molecular analysis of Bmp2 [42] R.A. Gibbs, et al., Genome sequence of the Brown Norway rat yields expression, J. Biol. Chem. 279 (2004) 15916–15928. insights into mammalian evolution, Nature 428 (2004) 493–521. [26] W.F. Shen, S. Rozenfeld, H.J. Lawrence, C. Largman, The Abd-B-like [43] E.S. Lander, et al., Initial sequencing and analysis of the human Hox homeodomain proteins can be subdivided by the ability to form genome, Nature 409 (2001) 860–921. complexes with Pbx1a on a novel DNA target, J. Biol. Chem. 272 [44] R.W. Blakesley, et al., An intermediate grade of finished genomic (1997) 8198–8206. sequence suitable for comparative analyses, Genome Res. 14 (2004) [27] L. Selleri, et al., Requirement for Pbx1 in skeletal patterning and 2235–2244.

176 Antonellis et al., 2006. Deletion of

long-range sequences at Sox10

compromises developmental

expression in a mouse model of

WaardenburgShah (WS4) syndrome

177 Human Molecular Genetics, 2006, Vol. 15, No. 2 259–271 doi:10.1093/hmg/ddi442 Advance Access published on December 5, 2005 Deletion of long-range sequences at Sox10 compromises developmental expression in a mouse model of Waardenburg–Shah (WS4) syndrome

Anthony Antonellis1,{, William R. Bennett4,{, Trevelyan R. Menheniott4, Arjun B. Prasad1,5, Shih-Queen Lee-Lin1, NISC Comparative Sequencing Program2, Eric D. Green1,2, Derek Paisley4, Robert N. Kelsh4, William J. Pavan3,* and Andrew Ward4

1Genome Technology Branch, 2NIH Intramural Sequencing Center (NISC) and 3Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA, 4Developmental Biology Programme and Centre for Regenerative Medicine, Department of Biology and Biochemistry, University of Bath, Bath, UK and 5Graduate Genetics Program, The George Washington University, Washington, DC 20052, USA

Received October 12, 2005; Revised and Accepted November 25, 2005

The transcription factor SOX10 is mutated in the human neurocristopathy Waardenburg–Shah syndrome (WS4), which is characterized by enteric aganglionosis and pigmentation defects. SOX10 directly regulates genes expressed in neural crest lineages, including the enteric ganglia and melanocytes. Although some SOX10 target genes have been reported, the mechanisms by which SOX10 expression is regulated remain elusive. Here, we describe a transgene-insertion mutant mouse line (Hry) that displays partial enteric aganglionosis, a loss of melanocytes, and decreased Sox10 expression in homozygous embryos. Mutation analysis of Sox10 coding sequences was negative, suggesting that non-coding regulatory sequences are disrupted. To isolate the Hry molecular defect, Sox10 genomic sequences were collected from multiple species, comparative sequence analysis was performed and software was designed (ExactPlus) to identify identical sequences shared among species. Mutation analysis of conserved sequences revealed a 15.9 kb deletion located 47.3 kb upstream of Sox10 in Hry mice. ExactPlus revealed three clusters of highly conserved sequences within the deletion, one of which shows strong enhancer potential in cultured melanocytes. These studies: (i) present a novel hypomorphic Sox10 mutation that results in a WS4-like phenotype in mice; (ii) demonstrate that a 15.9 kb deletion underlies the observed phenotype and likely removes sequences essential for Sox10 expression; (iii) combine a novel in silico method for comparative sequence analysis with in vitro functional assays to identify candidate regulatory sequences deleted in this strain. These studies will direct further analyses of Sox10 regulation and provide candidate sequences for mutation detection in WS4 patients lacking a SOX10-coding mutation.

INTRODUCTION neurons and Schwann cells of the peripheral nervous system and melanocytes (1). Delineating the molecular path- The neural crest (NC) is a multi-potent, migratory population ways and transcriptional hierarchies that mediate NC cell of cells that arise from the neural tube boundary during development is key toward understanding NC-related embryonic development. The NC gives rise to a wide human diseases (neurocristopathies) as well as normal variety of structures, including the craniofacial skeleton, human development.

*To whom correspondence should be addressed at: Genetic Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 49 Convent Dr., Building 49, Room 4A82, Bethesda, MD 20892, USA. Tel: 1 3014967584; Fax: 1 3014022170; Email: [email protected] þ þ {The authors wish it to be known that, in their opinion, the ﬁrst two authors should be regarded as joint First Authors.

Published by Oxford University Press 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact: [email protected]

178 260 Human Molecular Genetics, 2006, Vol. 15, No. 2

The SRY-box containing 10 [SOX10; Online Mendelian RESULTS Inheritance in Man (OMIM) 602229] transcription factor has a clear role in NC cell development. This is supported by Identification of the Hry mouse data showing that SOX10 is expressed in multiple NC deriva- Characterization of 74 previously generated transgenic mouse tives and that SOX10 mutations cause neurocristopathies in lines (16) revealed five lines, which when bred to homozygo- man and mouse. In situ hybridization analyses of mouse sity for the transgene, displayed overt mutant phenotypes. embryos revealed that Sox10 expression is first detectable in Affected mice in one of these lines, designated Tg(Igf2P3Luc) regions of the dorsal neural tube at embryonic day 8.5 HarryWard (or Hry), were characterized by the near complete (E8.5). Thereafter, Sox10 expression is apparent in cranial absence of skin pigment, with postnatal growth retardation, (E9.5), dorsal root (E9.5), sympathetic (E10.5) and enteric distended abdomen and failure to thrive (Fig. 1A). These ganglia (E12.5), as well as presumptive melanoblasts mutant features were linked with transgene homozygosity (E12.5) (2–4). Furthermore, Sox10 expression has been (discussed subsequently). None of the other lines, including detected in embryonic and adult mouse Schwann cells (3–5). 11 made with the same transgene, demonstrated a similar phe- Similar SOX10 expression patterns have been described in notype, indicating that it is due to the site of insertion within humans (6,7). the genome. Hry acts in an incompletely penetrant semidomi- Two human diseases have been associated with SOX10 nant fashion with variable expressivity, as transgene hemizy- mutations: Waardenburg–Shah syndrome type 4 (WS4; gotes are essentially normal except for the presence, in some OMIM 277580) and a multi-syndrome disorder (PCWH; animals, of a non-pigmented belly spot of variable size OMIM 609136) exhibiting peripheral demyelinating neuro- (Fig. 1B). In Hry/Hry homozygotes, abdominal swelling is pathy, central dysmyelinating leukodystrophy, Waardenburg correlated with megacolon, in which passage of feces along syndrome and Hirschsprung disease (PCWH). WS4 is charac- the large intestine is impeded (Fig. 1C), which could terized by cochlear deafness, enteric aganglionosis and account for the observed growth retardation and failure to pigmentary defects and is caused by heterozygous SOX10 thrive. Mutant Hry/Hry animals were routinely culled within mutations leading to a loss-of-function (8–10). PCWH is the first two weeks after birth to prevent unnecessary suffering. characterized by leukodystrophy consistent with Pelizaeus– Intestines from 30 Hry/Hry animals were dissected for Merzbacher disease, peripheral neuropathy consistent with examination. A constriction point, of variable position, was Charcot–Marie–Tooth disease type I and the aforementioned observed within the colon (indicated by an arrow in Fig. 1C) WS4 phenotype. PCWH is also caused by heterozygous in all but two of these cases; in the latter, fecal passage was SOX10 mutations; however, the molecular pathology is pre- arrested above the ileo-caecal junction. dicted to be due to a dominant-negative effect (8,11). Consist- As megacolon can result from a depletion of enteric ent with the human pathology, mice heterozygous for neurons, we performed a histological examination, comparing spontaneous (Sox10Dom) and targeted (Sox10tm1Weg) Sox10 samples from the distal colon of Hry/Hry and wild-type litter- mutations display a specific loss of enteric ganglia and coat mates. Conventional staining with hematoxylin and eosin color spotting (4,12). Homozygosity for Sox10 mutations are revealed a dramatic reduction in the thickness of the muscle embryonic lethal in mouse and zebrafish (13,14), underscoring layers and severe disorganization of the mucosa in Hry/Hry the critical role of this transcription factor in vertebrate mice (Fig. 1D and E); however, the role of physical damage development. in this and other sections cannot be fully assessed. Myenteric Consistent with the mutant phenotypes described in human ganglia were seen in wild-type colon sections as numerous and mouse, SOX10 has been shown to directly regulate the clustered cells with large oval nuclei, located between the transcription of genes expressed in and essential for the devel- muscle layers (Fig. 1D). In contrast, these were almost entirely opment of affected cell populations. These include RET in absent in sections from Hry/Hry mice (Fig. 1E). The depletion developing enteric ganglia, P0 and CX32 in Schwann cells of enteric neurons in Hry/Hry mice was confirmed by staining and MITF and DCT in melanocytes (reviewed in 15). sections with an antibody specific to the neuronal marker However, although some targets of SOX10 have been reported neurofilament 200. The plexus of Auerbach was visible as in recent years, the transcriptional regulation of SOX10 itself an essentially continuous network of brightly immunofluores- remains poorly defined; indeed, the SOX10 promoter and cent cells in wild-type samples (Fig. 1F) and reduced to cis-acting regulatory elements have yet to be identified. isolated patches of fluorescence in Hry/Hry samples Here, we describe a novel mouse strain (designated (Fig. 1G). The latter was only informative for neurons within Sox10Tg(Igf2P3Luc)HarryWard or Sox10Hry) characterized by Auerbach’s plexus, as cells of the inner plexus of Meissner decreased embryonic Sox10 expression and an associated are not as numerous, and this plexus does not contain as reduction of melanocytes and distal enteric neurons. Further, many neurofilamentous fibers (17). Furthermore, non-specific we identify the molecular defect in these mice, which lies fluorescent signal in the inner submucosal layers (seen in over 40 kb upstream of the transcription start site and corre- wild-type sections stained in the absence of the primary anti- sponds to a 15.9 kb deletion of non-coding sequence contain- neurofilament antibody; Fig. 1H) made it impossible to identify ing what appears to be a highly conserved enhancer(s). Our submucosal ganglia unambiguously. results show that long-range non-coding mutations of Sox10 The choroidal melanocytes of the eye represent another can give rise to a neurocristopathy in mice and demonstrate derivative of the NC. In histological sections, the choroidal the utility of identifying highly conserved, non-coding layer is immediately apposed to the neuroectodermally derived sequences in order to better define the functional landscape retinal-pigmented epithelium (Fig. 1I). Melanized choroidal of the mammalian genome. cells were absent in sections from Hry/Hry mice, whereas the

179 Human Molecular Genetics, 2006, Vol. 15, No. 2 261

Figure 1. Phenotypic characterization of Hry/Hry mice. (A) Hry/Hry (left) and wild-type littermate (right) at postnatal day 10. (B) Four Hry/ transgenic littermates show the variable penetrance of belly spots. (C) Intestinal tracts from wild-type (left) and Hry/Hry littermates at postnatal day 10 (constrictionþ point indicated by arrow). (D and E) Longitudinal sections through the wall of the colon stained with Masson’s trichrome from wild-type (D) and Hry/Hry (E) animals [mucosae (mu); sub-mucosae (sm); circular smooth muscle (cm); myenteric ganglia (mg); longitudinal smooth muscle (lm)]. (F–H) Longitudinal sections through the colon wall of wild-type (F) and Hry/Hry mutant (G and H) stained with an antibody specific to neurofilament 200 (red) and counterstained with DAPI (cyan). Lumenal aspect is at the top of each picture with the location of myenteric plexus indicated by asterisks. Isolated patches of cells in Hry/Hry mice are indicated by an arrow. Control studies in which the primary antibody (specific to neurofilament) was omitted demonstrated that staining at the lumenal border and within the sub-mucosa was non-specific (H). (I and J) Hematoxylin and eosin-stained sections through retinas from wild-type (I) and Hry/Hry (J) mice showing absence of choroidal melanocytes (cm) in mutant samples [retina (ret); retinal pigmented epithelium (rpe); choroid (cho)].

retinal pigmented epithelium and all other cell layers appeared the more severe phenotype segregates as a recessive trait normal in these animals (Fig. 1J). ( 25%). These data reject the hypothesis that the transgene segregates independently of the mutant phenotype (P , 0.0001 using a x2 test). Next, transgene homozygosity was conﬁrmed Linkage of the Hry transgene insertion to the Sox10 locus in mutant animals by performing ﬂuorescent in situ hybridiz- To genetically characterize the Hry/Hry mouse, we established ation of metaphase chromosome spreads, using the transgene linkage between the transgene and the mutant phenotype. Trans- as a probe. Three transgenic animals were analyzed that had gene hemizygotes were intercrossed to produce 223 offspring, of the megacolon and extensive white spotting, and each had two which 52 displayed megacolon and predominant hypopigmenta- chromosomes with hybridization signals, indicating that they tion. Thus, 23.3% of the offspring were mutant, as expected, if were Hry/Hry homozygotes, whereas a single hybridization

180 262 Human Molecular Genetics, 2006, Vol. 15, No. 2 signal was seen in four transgenic animals with a wild-type appearance (Supplementary Material, Fig. S1), consistent with Hry/ hemizygosity. Theþse studies revealed that the Hry transgene was integrated within the distal portion of chromosome 15 (Supplementary Material, Fig. S1). This immediately suggested Sox10 as a candidate gene for disruption in Hry mutant mice, as Sox10 is located on chromosome 15 at 79.5 Mb (www.ensembl.org/ Mus_musculus/) and characterized Sox10 mouse mutants display deficiencies in skin pigmentation and enteric neurons (2,4). Consequently, microsatellite mapping was performed to test for linkage between Sox10 and the Igf2P3Luc transgene insertion. The Hry transgenic founder animal was derived by pronuclear injection of an F1 (C57BL/6 CBA) zygote and the resulting line maintained by mating HrÂ y/ animals with F1 (C57BL/6 CBA) animals. Thus, the transþgene integrated into a chromosoÂ me from one of the parental strains, and this allele should co-segregate with the transgene. We identified three microsatellite markers closely linked to Sox10 and polymorphic between the CBA and C57BL/6 strains (D15MIT1, D15MIT2 and D15MIT71). Transgene hemizygotes were intercrossed, with both parents and offspring genotyped for all three markers. A total of 43 informative meioses were found to show no recombination between the transgene and each marker, with the transgene associated with the C57BL/6 allele in all cases (data not shown). Using GENE-LINK software (18), we determined that the transgene was located within 8 cM of the markers, and therefore Sox10, with a 95% confidence limit.

Analysis of Sox10-coding sequences in Hry mice To determine whether the transgene insertion disrupted the Sox10-coding region, we performed Southern blot analysis of wild-type, Hry/ and Hry/Hry mice, using probes derived þ from the 50- and 30-regions of the Sox10 cDNA sequence. We did not detect any size polymorphism with seven different restriction endonucleases, other than those occurring naturally between the CBA and C57BL/6 strains (data not shown). Given the location of the probes and restriction sites relative to Sox10, we concluded that the transgene did not reside in the genomic interval 3 kb upstream of the 50-most Sox10 exon and at least 1.5 kb downstream of the 30-most Sox10 exon. To examine whether Sox10 transcription is disrupted by the transgene, Hry/ hemizygotes were intercrossed and Sox10 expression analyzedþ in the offspring. RNA was purified from whole brains taken from pups at 8 days postpartum, when wild-type and mutant animals could be distinguished by phe- Figure 2. Sox10 expression in wild-type and Hry/Hry mice. (A) Northern blot notype. In RT–PCR (data not shown) and northern blot analysis of RNA from brains of eight littermates at postnatal day 10 using (Fig. 2A) experiments, bands corresponding to Sox10 mRNA probes specific to Sox10 (upper) and Gapdh (lower). (B–E). Whole-mount were seen in all samples. With the northern blots, Sox10 in situ hybridization of embryos at E8.5 (B and C) and E9.5 (D and E) was detected as a single band of 3 kb, consistent with that using a Sox10 probe. Note vastly reduced expression in presumed Hry/Hry mice (C and E) versus littermates (B and D). reported previously (2–4). No significant differences in Sox10 expression were observed between wild-type, Hry/ and Hry/Hry samples (when compared with the Gapdhþ loading control), indicating that Sox10 expression is not dis- there was no ability to distinguish between wild-type, rupted in the postnatal brain of Hry/ or Hry/Hry animals. Hry/Hry and Hry/ embryos. However, 75% of embryos Sox10 expression was also examþined by whole-mount (19 out of 24) exhiþbited normal Sox10 expression (3,4), and in situ hybridization of embryos at E8.5 (Figs 2B and 3C) these embryos were presumed to represent wild-type and and E9.5 (Figs 2D and 3E). Embryos were recovered from Hry/ offspring. At E8.5, there was strong Sox10 expression entire Hry/ intercross litters and processed together. Thus, in theþ otic vesicle and anterior dorsal neural tube, which is þ

181 Human Molecular Genetics, 2006, Vol. 15, No. 2 263 the site of NC origin (Fig. 2B). At E9.5, Sox10 expression was To assess ExactPlus, we compared the output with detected in the dorsal neural tube and migrating NC through- WebMCS (20), using the same acgt file and to PipMaker out most of the trunk (Fig. 2D). These patterns of Sox10 (21) to identify regions at least 70% identical over 100 bp expression were not seen in the remaining 25% of embryos between human and mouse. By identifying regions identical (five out of 24) and these were presumed to be Hry/Hry between multiple mammalian species, ExactPlus reduced mutants. Closer inspection revealed that these embryos the data set by 15% when compared with WebMCS and by display an essentially normal pattern of Sox10 expression at 30% when compared with PipMaker. Furthermore, Exact- E8.5, however, at severely reduced levels (Fig. 2C), but that Plus provided an additional 18 and 47% of fully conserved Sox10 expression was not detectable at E9.5 (Fig. 2E). We sequence not provided by WebMCS or PipMaker output, thus renamed this mouse line Sox10Tg(Igf2P3Luc)HarryWard (or respectively (data not shown). Finally, WebMCS and Pip- Sox10Hry) because of the clear defect in Sox10 expression Maker output overlapped with six of the eight aforementioned and the insertion of the transgene near Sox10. CHCS fragments, further suggesting the importance of these sequences. Identification of conserved non-coding sequences near Sox10 Sox10Hry mice harbor a 15.9 kb deletion upstream of Sox10 To explore whether the Sox10Hry/Hry phenotype arises because To determine whether mutations in conserved non-coding of a mutation(s) in non-coding sequences required for Sox10 sequences account for the impaired Sox10 expression in expression, we performed comparative sequence analyses to Sox10Hry, we performed PCR analysis of wild-type, Hry/ Hry/Hry identify conserved non-coding sequences near Sox10. Such Sox10 þ and Sox10 embryos using amplimers sequences are candidates for regulating Sox10 expression designed within each of the eight CHCSs (Fig. 3A, green and thus could be screened for mutations in Sox10Hry. We bars). The region corresponding to three amplimers generated genomic sequence of the region encompassing the (Fig. 3A, shaded region, and B, green bars) were found to Sox10 loci in rat, cat, dog, pig, cow and chicken. The ortholo- be absent in Sox10Hry mice (Fig. 3C). Fine mapping of the del- gous sequences from mouse and human were obtained from etion using additional PCR amplimers designed across the the University of California at Santa Cruz (UCSC) Genome interval revealed a 15.9 kb deletion located 47.3 kb upstream Browser (genome.ucsc.edu). These efforts provided genomic of Sox10 (Fig. 3A, shaded region, B), which was not detected sequences from eight species spanning Sox10 and including in 10 additional mouse strains. Southern blot analysis of DNA the adjacent centromeric (Pol2rf ) and telomeric (Prkcabp) isolated from wild-type and Sox10Hry/Hry mice revealed that a genes. These flanking genes defined the region for analysis, probe within the transgene and one just outside the deletion which encompassed 65 and 3 kb upstream and downstream hybridized to fragments of similar size when digested with of Sox10, respectively. two restriction enzymes (data not shown). These data To obtain a highly confident data set of conserved non- suggest a direct link between the transgene insertion and the coding sequences within and flanking Sox10, we developed genomic deletion events. The deleted region contains three software (ExactPlus) that finds identical sequences in a CHCSs, hereafter referred to as deleted clusters of highly con- multiple species sequence alignment. ExactPlus input served sequences (DCHCSs; Fig. 3B). DCHCS-1 contains 10 includes: (i) a MultiPipMaker alignment (acgt) file; (ii) the conserved segments that cover 139 bp across a 237 bp interval length of identical matches to report; (iii) the number of (58.6%); DCHCS-2 contains four conserved segments that species that defines a match; (iv) the number of species to cover 39 bp across a 175 bp interval (22.3%); and DCHCS-3 include in extending a match (optional); and (v) a file indicat- contains five conserved segments that cover 35 bp across a ing coding sequences to be removed from the analysis 155 bp interval (22.6%). Importantly, the only region of the (optional). After processing the acgt file with the supplied deletion detected by WebMCS and PipMaker was DCHCS-1; parameters, ExactPlus reports three outputs: (i) a local no other sequences in the 15.9 kb deletion were detected by alignment at each match; (ii) a strict, IUPAC-coded consensus these two methods. sequence for each match; and (iii) a UCSC Genome Browser custom track for positioning results. Sox10Hry Genomic sequences were analyzed by MultiPipMaker (19) Enhancer activities of sequences within the to provide a line-by-line alignment (acgt) file, which was deletion then submitted to ExactPlus to identify sequences of six The Sox10Hry deletion appears to have an effect on Sox10 bases more that are identical in seven of the eight species. expression in melanocytes. Therefore, it is reasonable to In effect, this revealed identical sequences in seven mamma- hypothesize that the deletion harbors a transcriptional enhan- lian species and did not require chicken. This analysis ident- cer that is active in this cell population. To address this, we ified 197 non-coding segments, ranging from six to 104 assessed the enhancer potential of the deleted region in bases (with an average length of 11.7 bases), which together immortalized melanocytes (melan-a cells) (22). As the cover 2.35% of the sequence analyzed. Importantly, these Sox10Hry deletion may contain sequences necessary for sequences form eight clusters of highly conserved sequences Sox10 expression other than those identified by ExactPlus, (CHCSs; Fig. 3A, green bars) in the 64.5 kb between the we tested the enhancer potential of the entire deleted region. Sox10 transcription start site and the known gene (Prkcabp), Briefly, we generated a PCR-based tiling path of overlapping with each cluster containing at least four ExactPlus- fragments ranging from 325 to 1753 bp (Fig. 3B, black bars) identified segments in a 200 base window. across the Hry deletion. Each Deleted In Hry (DIH) fragment

182 264 Human Molecular Genetics, 2006, Vol. 15, No. 2

Figure 3. Comparative sequence and mutation analysis of Sox10. (A) The interval of mouse chromosome 15 harboring the Sox10 locus and upstream sequences is depicted by the UCSC Genome Browser (genome.ucsc.edu). ExactPlus-detected sequences (red bars) and CHCS (green bars) are indicated along the top. The 15.9 kb deletion identiﬁed in Sox10Hry mice is indicated by the gray box. (B) An expanded view of the 15.9 kb Sox10Hry deletion is shown. DNA fragments tested for enhancer activity are indicated in black. The entire region was examined by developing a PCR-based tiling path of segments DIH, although speciﬁc Hry/ Hry/Hry DCHCS fragments were tested independently. (C) PCR analysis of DNA isolated from wild-type, Sox10 þ and Sox10 embryos revealed that DCHCS-1 (C), DCHCS-2 (data not shown) and DCHCS-3 (data not shown) are all deleted in Sox10Hry.

183 Human Molecular Genetics, 2006, Vol. 15, No. 2 265 was PCR amplified and cloned upstream of the minimal pE1b details). This analysis revealed five biologically relevant promoter driving luciferase expression. Subsequently, each predicted binding sites: two retinoic acid receptor binding resulting expression construct was transfected into melan-a sites and three Sox family binding sites, two of which are cells and analyzed for enhancer activity via a luciferase repor- oriented in a head-to-head fashion and are separated by 5 bp ter assay. To prioritize the results, we deemed a fragment to (Fig. 5C). It is worth noting that the majority of the sequence show enhancer activity in vitro when it increased luciferase in these predicted binding sites are conserved in chicken expression at least 3-fold above the empty vector in both (86.4%; Fig. 5C). orientations (Fig. 4A, green bar). DIH-1, DIH-8, DIH-10, DIH-15 and DIH-17 displayed enhancer activity in vitro (Fig. 4A, green boxes). In the DISCUSSION context of our ExactPlus analysis, two observations make The transcription factor SOX10 plays an essential role in NC these results particularly interesting. First, three of these stem cells and in the development of a subset of NC derivatives fragments overlap DCHCSs (DCHCS-1 and DIH-1, (4,24). Although specific genes essential for the development DCHCS-2 and DIH-10 and DCHCS-3 and DIH-15). Secondly, of these cells are known to be SOX10 targets, little is known the fragment yielding the largest increase in luciferase about the transcriptional regulation of SOX10 itself. Here, we expression in both directions was DIH-1, which contains describe a novel mouse strain (Sox10Hry) characterized by DCHCS-1. DIH-9 displayed the largest increase in luciferase reduced embryonic Sox10 expression in the NC. This serendi- expression (.50-fold) in the forward orientation; however, pitous mouse mutant has proven to be instrumental in identify- the reverse orientation showed no activity. Closer examination ing sequences required for Sox10 transcription. Specifically, we revealed that the sequence unique to DIH-9 is a simple repeat used comparative genomic sequence analysis to reveal eight that is expanded only in mouse (data not shown). Thus, DIH-9 distinct CHCSs upstream of Sox10. Using these as a basis for appears to be an orientation-dependent, mouse-specific enhan- mutation screening, we identified a 15.9 kb deletion upstream cer, or the DIH-9 data reflect a false-positive or false-negative of Sox10 that spans three of the clusters. One of these, result in the forward or reverse orientation, respectively. DCHCS-1, displays particularly strong enhancer activity in To assess the enhancer activity of the highly conserved Sox10-expressing cultured melanocytes, resides within the sequences deleted in Sox10Hry mice, we independently most highly conserved portion of the deletion and contains measured the ability of each DCHCS, along with 200 highly conserved, biologically relevant transcription factor bases of flanking sequence, to drive luciferase expression in binding sites. Thus, DCHCS-1 should be considered a strong melan-a cells, as described earlier. This analysis revealed candidate for being the first described Sox10 enhancer. that each DCHCS significantly enhanced luciferase expression The description of a novel, non-coding mutation that yields a in both forward and reverse orientations (Fig. 4B). However, phenotype similar to WS4 adds to the spectrum of SOX10 consistent with our tiling-path analysis, DCHCS-1 yielded mutations that give rise to NC defects. To date, coding the largest increase in luciferase expression ( 12-fold mutations have accounted for all human and mouse phenotypes forward and 91-fold reverse), whereas DCHCS-2 only attributed to SOX10 dysfunction (4,9,11). Importantly, the increased expre ssion 3-fold in each orientation and Sox10Hry allele is not as severe as the previously reported DCHCS-3 increased expression 20-fold in the forward mouse mutations Sox10Dom and Sox10LacZ (4,12) because orientation and 2-fold in the reverse orientation. mice homozygous for the deletion can survive to weaning and hemizygous mice are phenotypically normal (with the exception of variable belly spots). This is likely due to the DCHCS-1 core sequences show strong enhancer potential fact that the loss of Sox10 expression is limited to a subset of DIH-1 and DCHCS-1 have the greatest enhancer activity when Sox10-expression domains. Our data also suggest that transfected into cultured melanocytes. As these fragments mutations in non-coding sequences near SOX10 may contribute overlap and contain the same set of ExactPlus-detected to WS4, especially in patients with no known coding mutations. sequences (Fig. 3B), it is possible that ExactPlus identified This notion is supported by the identification of mutations in the specific conserved sequences that enhance Sox10 both coding regions and distal regulatory elements of other expression in melanocytes. DCHCS-1 contains a core loci implicated in human disease; for example, at the SOX9 segment of 95 bp that is 83% identical in all mammalian locus in patients with campomelic dysplasia (OMIM 114290) species studied and spans two ExactPlus matches of (25,26) and at the RET locus in patients with Hirschsprung 51 bp and 21 bp (Fig. 5A). We compared the enhancer disease (27,28). It would thus seem reasonable to screen activity of this 95 bp segment with the original DCHCS-1. relevant patient populations for mutations in the human This revealed that the 95 bp core segment displayed greater sequences identified by ExactPlus. In a similar fashion, enhancer activity than DCHCS-1 in both orientations these sequences should be considered candidate regions for (Fig. 5B). causative mutations or modifiers of phenotypes affecting The high conservation and in vitro enhancer potential tissues in which SOX10 is expressed. Relevant diseases could associated with the 95 bp core segment suggest that transcrip- include, but are not limited to, Hirschsprung disease, tion factor binding sites reside therein. To determine the demyelinating Charcot–Marie–Tooth disease, non-syndromic transcription factor binding site profile of this region, we deafness and melanoma. submitted the three ExactPlus-identified sequences within Cataloging cis-acting transcriptional regulatory elements in the 95 bp core segment (Fig. 5A) to TRANSFAC (23) using mammalian genomes is of widespread importance to the study MATCH and PATCH (see Materials and Methods for of human development and disease. Although comparative

184 266 Human Molecular Genetics, 2006, Vol. 15, No. 2

Figure 4. Enhancer activity of sequences deleted in Sox10Hry. (A) Enhancer activity of the entire Sox10Hry deletion. Each DIH segment indicated in Figure 3B was cloned upstream of a minimal promoter driving luciferase expression. Melan-a cells were transfected with each construct along with an internal-control vector expressing renilla and incubated for 48 h. The ability of each DIH to enhance luciferase expression was determined by calculating the ratio of luciferase expression to renilla expression and then determining the fold increase (y-axis) over the ratio calculated for pE1b with no insert. The green area along the x-axis depicts no increase. DIH segments with enhancer activity in both orientations are highlighted in green. (B) Similar experiment as in (A) for DCHCS-1, DCHCS-2 and DCHCS-3. For both experiments, the results with the forward and reverse orientation of each segment (with respect to the direction of Sox10 transcription) are indicated in blue and red, respectively, and error bars indicate the standard deviation. sequence analyses have proven to be an invaluable tool for sequences and use relatively low thresholds; for example, this, such approaches are often complicated by decisions analyzing human and rodent sequences and requiring at least regarding which species to include, which software tools to 70% identity over 100 bp. Although cis-acting regulatory employ and what conservation thresholds to use (29,30). At elements have been identiﬁed by such an approach (31), one extreme, investigators compare closely related genomic these analyses are prone to numerous false positives (20).

185 Human Molecular Genetics, 2006, Vol. 15, No. 2 267

Figure 5. Analysis of a 95 bp core segment within DCHCS-1. (A) ExactPlus-identiﬁed sequences within DCHCS-1 are indicated. (B) Comparison of the enhancer activities of DCHCS-1 and the 95 bp core segment in cultured melanocytes. Fragment orientations are noted as in Figure 4, and error bars indicate the standard deviation. (C) ExactPlus-identiﬁed sequences within the 95 bp core segment were submitted to TRANSFAC using two interfaces (see Materials and Methods). Retinoic acid receptor (red) and Tcf/Lef/Sox family/Sry (green) consensus sequences are indicated in boxes. The ExactPlus-built consensus sequence (Cons) is displayed at the bottom of each alignment. Note that two of the Sox-family sites are oriented in a head-to-head fashion and separated by 5 bp and that sequence identity in chicken allowed 30 extension of the two hits by two and one bases, respectively.

Furthermore, such methods do not detect short conserved related species; for example, mammals and chicken, zebrafish sequences, which may represent protein binding sites. Such and/or Fugu sequences (33). Although the latter can yield assessment is important considering that transcription factor highly conserved segments, these mostly reflect protein- binding sites are more likely to occur in highly conserved coding regions (20,34) and may not be relevant for the identi- sequences (32). At the other extreme, investigators require fication of functional non-coding sequences specific to mam- conservation in non-coding sequences across more distantly malian lineages.

186 268 Human Molecular Genetics, 2006, Vol. 15, No. 2

ExactPlus identified identical non-coding sequences is unlikely because the phenotype is characterized by impaired near Sox10 in multiple mammalian species. Importantly, Sox10 expression and is strikingly similar to phenotypes these studies reduced the amount of sequence for screening associated with loss-of-function Sox10 mutations. The devel- by 98% and provided a more refined set of conserved opment of knockout and transgenic mouse lines carrying del- sequences when compared with two commonly used etions of this 15.9 kb segment, either whole or in part, should methods (20,21). A comparison of our computational and address this issue. Also supporting a loss-of-function mechan- functional analyses reveals that ExactPlus identified three ism is the finding that deletion of highly conserved, long-range of the five fragments in the Sox10Hry deletion, which display regulatory elements disrupts the transcription of another Sox- in vitro enhancer activity. Although DCHCS-1 remains the family transcription factor involved in NC development, Sox9 most promising candidate for being a Sox10 enhancer, the (35). Finally, it is important to note that the 15.9 kb deleted enhancer activity observed with DCHCS-2, DCHCS-3, DIH-8 region does not contain all sequences needed for Sox10 and DIH-17 suggests that sequences in these regions may expression because central nervous system expression appears also be important for Sox10 expression, either in melanocytes unaltered. Indeed, a brain-specific enhancer(s) likely resides or in other cells. Interestingly, although DIH-8 and DIH-17 outside of this region, and the corresponding CHCSs should showed in vitro enhancer activity, our comparative sequence be tested for enhancer activity in relevant assays. analyses did not detect any conserved elements in these Although cis-acting regulatory sequences likely exist in the regions. Two possibilities might explain this discrepancy: Sox10Hry deletion, and the 95 bp core segment within (i) the in vitro results represent false positives; or (ii) these DCHCS-1 is a candidate enhancer, the issue remains as to are functional sequences that are not highly conserved. This what transcription factors regulate Sox10 expression. An last point underscores a major limitation of our approach; advantage of ExactPlus is that it provides highly delimited that is, functional elements will not be identified in the regions in which to search for putative transcription factor absence of strong evolutionary selection among the species binding sites. One common concern with such searches is examined. Thus, ExactPlus is unlikely to detect species- that they return multiple binding sites per basepair of input, or even lineage-specific functional elements or those not making it difficult to assess the significance of any one site. highly conserved. Our analysis begs the question of which This problem is exacerbated with very long query sequences ExactPlus settings will provide similar, highly confident (e.g. multiple kilobases). Thus, by only searching highly con- data sets at other loci, or across entire genomes. Inspection of served sequences, a more confident data set is provided for each CHCS near Sox10 reveals that all eight contain at least functional analyses. Indeed, the identification of Tcf/Lef and one 10-mer that is identical in all seven mammalian species Sox-family binding site consensus sequences was particularly studied, and no identical 10-mer was found outside of these encouraging because factors in these families have known or clusters. Thus, one possible use of ExactPlus could suspected roles in NC development (reviewed in 36). involve identifying sequences 10 bases or more and identical Several points make Sox9 a particularly attractive candidate: in all (or some large subset of) available mammalian species. (i) Sox9 can bind to DNA as a dimer when two Sox sites Comparison of the luciferase assay results obtained with are positioned in a head-to-head fashion, as in Figure 5C DIH-1 (Fig. 4A), DCHCS-1 (Fig. 4B) and the 95 bp (37); (ii) Sox9 induction has been shown to increase Sox10 DCHCS-1 core segment (Fig. 5B) reveals marked differences expression in chick embryos (38) and (iii) the head-to-head among the three overlapping sequences. Specifically, the Sox sites are conserved in chicken (Fig. 5C), thus under- 95 bp core segment showed more enhancer activity than scoring their likely importance. These observations should DCHCS-1, which in turn showed more activity than DIH-1. aid future studies on the role of Sox9 in Sox10 expression. Furthermore, the 95 bp core segment is the shortest of the The results presented herein indicate that: (i) a non-coding three and acted most like a classical enhancer (Fig. 5B). If Sox10 mutation causes a WS4-like phenotype in mice; (ii) a enhancers involved in Sox10 expression exist in these overlap- 15.9 kb deletion underlies the phenotype and likely removes ping sequences, it is unclear why they show different in vitro sequences essential for Sox10 expression and (iii) our novel enhancer activities. One possibility is that repressor sequences computational approach is effective for identifying candidate are absent in the shorter segments and another is that additional regulatory sequences. Thus, these studies have implications flanking sequences decrease enhancer activity in vitro. Indeed, for the relationship of SOX10 and human disease, as well as the segments decrease in size in the same order that they the use of comparative genomics for identifying non-coding increase in enhancer activity (Figs 3B and 5A). Therefore, if functional elements. the 95 bp core segment represents a true Sox10 enhancer in vivo, this would suggest that stringent comparative analysis of mammalian genomic sequences is a powerful approach for MATERIALS AND METHODS identifying non-coding functional elements by directly indicating the sequences most likely to be relevant for functional Mice studies. Transgenic mice were generated by micro-injection of linear Our combined analyses revealed that a 15.9 kb upstream construct DNA (16) into one pronucleus of F1 zygotes result- segment is required for Sox10 expression in multiple NC ing from crosses between CBACa females and C57BL/6J derivatives within developing mouse embryos. One important males using standard techniques (39). A total of 74 indepen- consideration is that the Hry phenotype may be caused by the dently derived transgenic lines were made, originally for a transgene insertion rather than by the associated deletion. study of gene regulation in which putative regulatory elements Although this possibility cannot be completely ruled out, it from the Igf2/H19 locus were used to drive expression of a

187 Human Molecular Genetics, 2006, Vol. 15, No. 2 269 luciferase reporter gene (16). Transgenic mice were main- Embryos were destained in PBT for 24 h, post-ﬁxed in 4% tained on a mixed C57BL/6J:CBACa genetic background. paraformaldehyde in PBS, washed in PBT, taken through The presence of the transgene was reliably detected as an ascending glycerol series (50, 80 and 100%) and described (16). photographed.

Histology Multi-species sequence generation and analysis Paraffin-embedded tissue sections were stained with hematox- Bacterial artificial chromosomes derived from the indicated ylin and eosin or Masson’s trichrome solutions using standard species and spanning the SOX10 locus were isolated and protocols (40). For immunofluorescence, 7 mm sections were sequenced as described (42,43): rat (GenBank accession no. blocked by incubation in SSCTM [4 SSC, 0.1% Tween- AC137528), cat (GenBank accession no. AC137542), 20, 5% (w/v) dried non-fat milk] for 30Âmin at 378C. Follow- dog (GenBank accession no. AC137537), pig (GenBank ing three 2 min washes in SSCT (4 SSC, 0.1% Tween-20), accession no. AC137657) and cow (GenBank accession no. the slides were incubated with eitherÂ SSCTM (control) or AC137534). Sequences encompassing SOX10 in human 50 ml of rabbit anti-neurofilament (Sigma) diluted 1:100 in (chr22:36606845–36725000, July 2003 assembly) and SSCTM overnight at 48C. Slides were washed and incubated mouse (chr15:79203055–79300747, March 2005 assembly) with biotinylated anti-rabbit antibody diluted 1:200 in were obtained from the University of California at Santa SSCTM for 30 min at 378C, washed again and incubated Cruz (UCSC) Genome Browser (genome.ucsc.edu). All with streptavidin–Texas red (Vector) diluted 1:200 in sequences were analyzed by MultiPipMaker (19) to obtain a SSCTM for 30 min at 378C. Slides were then equilibrated in line-by-line alignment (acgt) file (pipmaker.bx.psu.edu/ PBS, counterstained with DAPI, mounted in Vectashield and cgi-bin/multipipmaker). photographed using a Leica DMRB or Nikon Eclipse E800 microscope. ExactPlus software development ExactPlus is a stand-alone perl script. The alignment (acgt) Hry Sox10 Linkage of the transgene insertion to the locus file from MultiPipMaker is read into memory and accessed as Microsatellite mapping was performed to test for linkage a two-dimensional array, where each column is a base position between the Sox10 and the Igf2P3Luc transgene insertion. and each row is a different species. The program searches the The Hry transgenic founder animal was derived by pronuclear array for regions where a minimum number of species is iden- injection of an F1 (C57BL/6 CBA) zygote and the line main- tical for a given number of base positions (e.g. all seven tained by mating Hry/ animÂ als with F1 (C57BL/6 CBA) species in an alignment identical for at least six consecutive animals. Thus, the transgeneþ integrated into a chromosÂ ome bases). Optionally, ExactPlus can extend any match in from one of the parental strains and this allele would either direction by requiring a smaller number of species to co-segregate with the transgene. We identified three microsa- be identical at each base once the original criteria are met. tellite markers closely associated with Sox10 and polymorphic Assessment of ExactPlus output was performed by: (i) ana- between the CBA and C57BL/6 strains (D15MIT1, D15MIT2 lyzing the same acgt file using WebMCS (20) with standard and D15MIT71). Transgene heterozygotes were intercrossed, options; and (ii) using PipMaker (21) to identify regions at with both parents and offspring genotyped for all three least 70% identical over 100 bp between human and mouse. markers. A total of 43 informative meioses were found to ExactPlus is available at research.nhgri.nih.gov/exactplus. show no recombination between the transgene and each marker, with the transgene associated with the C57BL/6 allele in all cases (data not shown). Using GENE-LINK soft- Luciferase reporter constructs ware (18), we determined that the transgene was located Luciferase gene-containing vectors were generated using the within 8 cM of the markers, and therefore Sox10, with a pLGF-E1b destination vector (kindly provided by Glenn 95% confidence limit. Maston, University of Massachusetts). The pLGF-E1b vector is pGL3-Basic (Promega, Madison, WI, USA) with two modi- fications. First, it contains the Gateway cloning and selection Sox10 -expression studies cassette for destination vectors (Invitrogen, Carlsbad, CA, Northern blot analysis was performed using standard protocols USA) cloned into the SmaI site. Secondly, it contains a and a 606 bp EcoRI–KpnI fragment from the 50 UTR of the minimal adenovirus E1b promoter directionally cloned into Sox10 cDNA (4) as the probe. A Gapdh probe was used for the BglII and HindIII sites, downstream of the Gateway control experiments. For in situ analysis, mouse embryos cassette. were taken from pregnant females with gestational age esti- Luciferase expression constructs were generated using mated by the appearance of a copulation plug designating Gateway technology (Invitrogen). Briefly, PCR primers con- embryonic day E0.5. Embryos were fixed in 4% paraformalde- taining flanking attB sites were designed to amplify each hyde in PBS overnight at 48C, washed in PBT [PBS and 0.1% region of interest. PCR reactions were performed under stan- Tween-20 (Sigma)] and dehydrated through a methanol series dard conditions. The resulting products were separated on (10 min each in 25, 50, 75% and absolute methanol in PBT). 1% low-melt agarose gels, purified using the QIAquick Gel Whole-mount in situ hybridization was performed using a Extraction Kit (Qiagen, Valencia, CA, USA) and recombined DIG-labeled RNA probe derived from a Sox10 cDNA (41). into the pDONR 221 vector according to the manufacturer’s

188 270 Human Molecular Genetics, 2006, Vol. 15, No. 2 specifications (Invitrogen). Each recombination reaction was factors; (ii) 6 bp or more; and (iii) with the maximum transformed into Escherichia coli, and colonies selected for number of mismatches set at zero. MATCH parameters were resistance to 30 mg/ml kanamycin. Each insert was sequenced set to identify TRANSFAC entries using the ‘minimize false to ensure the absence of PCR-induced errors. Subsequently, negatives’ setting. plasmid DNA isolated from each entry clone was recombined with the pLGF-E1b destination vector according to the manufacturer’s specifications (Invitrogen). Each reaction was SUPPLEMENTARY MATERIAL transformed into E. coli and colonies selected for resistance Supplementary Material is available at HMG Online. to 100 mg/ml ampicillin. Each construct underwent restriction enzyme digest with BsrGI to confirm the presence of the appropriately sized insert. A negative control vector bearing ACKNOWLEDGEMENTS only the Gateway sequences upstream of the E1b promoter We thank Elliott Margulies and Matt Portnoy for discussions was generated in a similar manner. about ExactPlus, Ping Hu and Valerie Maduro for expert technical assistance, Glenn Maston for pGLF-E1b destination Cell culture, transfections and luciferase assays vector, Ruth Arkell for Sox10 in situ probe, Dot Bennett for melan-a cells and Laura Baxter and Debra Silver for critical Melan-a cells (22) were cultured under standard conditions in review of the manuscript. A.A. is supported by a fellowship RPMI medium 1640 (Invitrogen) containing 20% fetal bovine grant from the Charcot–Marie–Tooth Association. This serum, 2 mM L-glutamine and 200 nM tumor promoting agent research was supported, in part, by the BBSRC and Medical (Sigma, St Louis, MO, USA). A total of 5 104 cells were Research Council (UK) and the Intramural Research placed into each well of a 96-well culture plateÂand transfected Program of the National Human Genome Research Institute, with luciferase reporter vectors (mentioned earlier) using National Institutes of Health (USA). lipofectamine 2000 reagent (Invitrogen) according to the manufacturer’s instructions. For each reaction, 2.5 ml of lipofecta- Conflict of Interest statement. The authors declare that there is mine 2000 and 25 ml of OptiMEM I minimal growth medium no conflict of interest in presenting this manuscript. (Invitrogen) were combined and incubated at room temperature for 10 min. Purified luciferase reporter vector (200 ng) or the equivalent volume of water (in the case of DNA- REFERENCES negative controls) and 2 ng of the internal control pCMV-RL renilla expression vector (Promega) were diluted in 25 ml of 1. Kalcheim, C. and Le Douarin, N.M. (1986) Requirement of a neural tube signal for the differentiation of neural crest cells into dorsal root ganglia. OptiMEM I and combined with the lipofectamine–OptiMEM Dev. Biol., 116, 451–466. I mixture. The 50 ml reactions were incubated at room 2. Herbarth, B., Pingault, V., Bondurand, N., Kuhlbrodt, K., Hermans- temperature for 20min and then added to a single well of the Borgmeyer, I., Puliti, A., Lemort, N., Goossens, M. and Wegner, M. 96-well culture plate containing melan-a cells. After a 4 h incu- (1998) Mutation of the Sry-related Sox10 gene in Dominant megacolon, a bation at 37 C, the medium was aspirated, the slides were mouse model for human Hirschsprung disease. Proc. Natl Acad. Sci. USA, 8 95, 5161–5165. washed with 1 PBS and normal growth medium was added. Â 3. Kuhlbrodt, K., Herbarth, B., Sock, E., Hermans-Borgmeyer, I. and After a 48 h incubation at 378C, cells were washed with 1 Wegner, M. (1998) Sox10, a novel transcriptional modulator in glial cells. PBS and lysed at room temperature using 1 Passive LysisÂ J. Neurosci., 18, 237–250. Buffer (Promega). A total of 4 ul of the resultingÂ cell lysate 4. Southard-Smith, E.M., Kos, L. and Pavan, W.J. (1998) Sox10 mutation disrupts neural crest development in Dom Hirschsprung mouse model. were transferred to a white polystyrene 96-well assay plate Nat. Genet., 18, 60–64. (Corning Inc., Corning, NY, USA). Luciferase and renilla 5. Stolt, C.C., Rehberg, S., Ader, M., Lommes, P., Riethmacher, D., activities were determined using the Dual Luciferase Reporter Schachner, M., Bartsch, U. and Wegner, M. (2002) Terminal 1000 Assay System (Promega) and a model Centro LB 960 differentiation of myelin-forming oligodendrocytes depends on the luminometer (Berthold Technologies, Bad Wildbad, transcription factor Sox10. Genes Dev., 16, 165–170. 6. Bondurand, N., Kobetz, A., Pingault, V., Lemort, N., Encha-Razavi, F., Germany). Each experiment was performed 12 times, and Couly, G., Goerich, D.E., Wegner, M., Abitbol, M. and Goossens, M. the ratio of luciferase to renilla activity and the fold increase (1998) Expression of the SOX10 gene during human development. FEBS in this ratio over that observed for pLGF-E1b with no insert Lett., 432, 168–172. were calculated. The mean (bar height in figures) and standard 7. Pusch, C., Hustert, E., Pfeifer, D., Sudbeck, P., Kist, R., Roe, B., Wang, Z., Balling, R., Blin, N. and Scherer, G. (1998) The SOX10/Sox10 gene from deviation (error bars in figures) were determined using human and mouse: sequence, expression, and transactivation by the standard calculations. encoded HMG domain transcription factor. Hum. Genet., 103, 115–123. 8. Inoue, K., Khajavi, M., Ohyama, T., Hirabayashi, S., Wilson, J., Reggin, J.D., Mancias, P., Butler, I.J., Wilkinson, M.F., Wegner, M. et al. (2004) Transcription factor binding site prediction Molecular mechanism for distinct neurological phenotypes conveyed by allelic truncating mutations. Nat. Genet., 36, 361–369. FastA formatted ExactPlus output files containing identical 9. Pingault, V., Bondurand, N., Kuhlbrodt, K., Goerich, D.E., Prehu, M.O., sequences among multiple species were submitted to TRANS- Puliti, A., Herbarth, B., Hermans-Borgmeyer, I., Legius, E., Matthijs, G. FAC (23) version 8.1 using the Matrix Search for Transcrip- et al. (1998) SOX10 mutations in patients with Waardenburg– tion Factor Binding Sites (MATCH) and Pattern Search for Hirschsprung disease. Nat. Genet., 18, 171–173. 10. Shah, K.N., Dalal, S.J., Desai, M.P., Sheth, P.N., Joshi, N.C. and Transcription Factor Binding Sites (PATCH) interfaces. Ambani, L.M. (1981) White forelock, pigmentary disorder of irides, PATCH parameters were set to identify TRANSFAC entries: and long segment Hirschsprung disease: possible variant of Waardenburg (i) in vertebrate genes and for vertebrate transcription syndrome. J. Pediatr., 99, 432–435.

189 Human Molecular Genetics, 2006, Vol. 15, No. 2 271

11. Inoue, K., Tanabe, Y. and Lupski, J.R. (1999) Myelin deficiencies in both 27. Emison, E.S., McCallion, A.S., Kashuk, C.S., Bush, R.T., Grice, E., the central and the peripheral nervous systems associated with a SOX10 Lin, S., Portnoy, M.E., Cutler, D.J., Green, E.D. and Chakravarti, mutation. Ann. Neurol., 46, 313–318. A. (2005) A common sex-dependent mutation in a RET enhancer 12. Britsch, S., Goerich, D.E., Riethmacher, D., Peirano, R.I., Rossner, M., underlies Hirschsprung disease risk. Nature, 434, 857–863. Nave, K.A., Birchmeier, C. and Wegner, M. (2001) The transcription 28. Romeo, G., Ronchetto, P., Luo, Y., Barone, V., Seri, M., Ceccherini, I., factor Sox10 is a key regulator of peripheral glial development. Genes Pasini, B., Bocciardi, R., Lerone, M., Kaariainen, H. et al. (1994) Point Dev., 15, 66–78. mutations affecting the tyrosine kinase domain of the RET proto- 13. Dutton, K.A., Pauliny, A., Lopes, S.S., Elworthy, S., Carney, T.J., Rauch, J., oncogene in Hirschsprung’s disease. Nature, 367, 377–378. Geisler, R., Haffter, P. and Kelsh, R.N. (2001) Zebrafish colourless encodes 29. Frazer, K.A., Elnitski, L., Church, D.M., Dubchak, I. and Hardison, R.C. sox10 and specifies non-ectomesenchymal neural crest fates. Development, (2003) Cross-species sequence comparisons: a review of methods and 128, 4113–4125. available resources. Genome Res., 13, 1–12. 14. Kelsh, R.N. and Eisen, J.S. (2000) The zebrafish colourless gene regulates 30. Nardone, J., Lee, D.U., Ansel, K.M. and Rao, A. (2004) Bioinformatics development of non-ectomesenchymal neural crest derivatives. for the ‘bench biologist’: how to find regulatory regions in genomic DNA. Development, 127, 515–525. Nat. Immunol., 5, 768–774. 15. Mollaaghababa, R. and Pavan, W.J. (2003) The importance of having your 31. Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., SOX on: role of SOX10 in the development of neural crest-derived Rubin, E.M. and Frazer, K.A. (2000) Identification of a coordinate melanocytes and glia. Oncogene, 22, 3024–3034. regulator of interleukins 4, 13, and 5 by cross-species sequence 16. Ward, A., Fisher, R., Richardson, L., Pooler, J.A., Squire, S., Bates, P., comparisons. Science, 288, 136–140. Shaposhnikov, R., Hayward, N., Thurston, M. and Graham, C.F. (1997) 32. Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W. and Genomic regions regulating imprinting and insulin-like growth factor-II Lawrence, C.E. (2000) Human–mouse genome comparisons to locate promoter 3 activity in transgenics: novel enhancer and silencer elements. regulatory sites. Nat. Genet., 26, 225–228. Genes Funct., 1, 25–36. 33. Bagheri-Fam, S., Ferraz, C., Demaille, J., Scherer, G. and Pfeifer, D. 17. Burns, A.J. and Douarin, N.M. (1998) The sacral neural crest contributes (2001) Comparative genomics of the SOX9 region in human and Fugu neurons and glia to the post-umbilical gut: spatiotemporal analysis of the rubripes: conservation of short regulatory sequence elements within large development of the enteric nervous system. Development, 125, intergenic regions. Genomics, 78, 73–82. 4335–4347. 34. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., 18. Montagutelli, X. (1990) GENE-LINK: a program in PASCAL for Christoffels, A., Rash, S., Hoon, S., Smit, A. et al. (2002) Whole-genome backcross genetic analysis. J. Hered., 81, 490–491. shotgun assembly and analysis of the genome of Fugu rubripes. Science, 19. Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., 297, 1301–1310. Green, E.D., Hardison, R.C. and Miller, W. (2003) MultiPipMaker and 35. Wunderle, V.M., Critcher, R., Hastie, N., Goodfellow, P.N. and supporting tools: alignments and analysis of multiple genomic DNA Schedl, A. (1998) Deletion of long-range regulatory elements upstream sequences. Nucleic Acids Res., 31, 3518–3524. of SOX9 causes campomelic dysplasia. Proc. Natl Acad. Sci. USA, 20. Margulies, E.H., Blanchette, M., Haussler, D. and Green, E.D. (2003) 95, 10649–10654. Identification and characterization of multi-species conserved sequences. Genome Res., 13, 2507–2518. 36. Meulemans, D. and Bronner-Fraser, M. (2004) Gene-regulatory 21. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., interactions in neural crest evolution and development. Dev. Cell, 7, Gibbs, R., Hardison, R. and Miller, W. (2000) PipMaker—a web server 291–299. for aligning two genomic DNA sequences. Genome Res., 10, 577–586. 37. Peirano, R.I. and Wegner, M. (2000) The glial transcription factor Sox10 22. Sviderskaya, E.V., Hill, S.P., Evans-Whipp, T.J., Chin, L., Orlow, S.J., binds to DNA both as monomer and dimer with different functional Easty, D.J., Cheong, S.C., Beach, D., DePinho, R.A. and Bennett, D.C. consequences. Nucleic Acids Res., 28, 3047–3055. (2002) p16(Ink4a) in melanocyte senescence and differentiation. J. Natl 38. Cheung, M. and Briscoe, J. (2003) Neural crest development is regulated Cancer Inst., 94, 446–454. by the transcription factor Sox9. Development, 130, 5681–5693. 23. Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., 39. Hogan, B., Constantini, F. and Lacy, E. (1986) Manipulating the Mouse Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V. et al. (2003) Embryo: A Laboratory Manual. Cold Spring Harbor Press, Cold Spring TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Harbor. Acids Res., 31, 374–378. 40. Bancroft, J.D. and Stevens, A. (1977) Theory and Practice of Histological 24. Kim, J., Lo, L., Dormand, E. and Anderson, D.J. (2003) SOX10 maintains Techniques. Churchill Livingstone, Edinburgh. multipotency and inhibits neuronal differentiation of neural crest stem 41. Elms, P., Siggers, P., Napper, D., Greenfield, A. and Arkell, R. (2003) cells. Neuron, 38, 17–31. Zic2 is required for neural crest formation and hindbrain patterning during 25. Foster, J.W., Dominguez-Steglich, M.A., Guioli, S., Kowk, G., Weller, mouse development. Dev. Biol., 264, 391–406. P.A., Stevanovic, M., Weissenbach, J., Mansour, S., Young, I.D., 42. Thomas, J.W., Prasad, A.B., Summers, T.J., Lee-Lin, S.Q., Maduro, V.V., Goodfellow, P.N. et al. (1994) Campomelic dysplasia and autosomal sex Idol, J.R., Ryan, J.F., Thomas, P.J., McDowell, J.C. and Green, E.D. reversal caused by mutations in an SRY-related gene. Nature, 372, (2002) Parallel construction of orthologous sequence-ready clone contig 525–530. maps in multiple species. Genome Res., 12, 1277–1285. 26. Wagner, T., Wirth, J., Meyer, J., Zabel, B., Held, M., Zimmer, J., 43. Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Pasantes, J., Bricarelli, F.D., Keutel, J., Hustert, E. et al. (1994) Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, Autosomal sex reversal and campomelic dysplasia are caused by A.C., Thomas, P.J., McDowell, J.C. et al. (2003) Comparative analyses of mutations in and around the SRY-related gene SOX9. Cell, 79, multi-species sequences from targeted genomic regions. Nature, 424, 1111–1120. 788–793.

190 Prasad et al., 2008. Conﬁrming the

phylogeny of mammals by use of

large comparative sequence data

sets

191 RESEARCH ARTICLES Conﬁrming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets

Arjun B. Prasad,*à Marc W. Allard,§ NISC Comparative Sequencing Program* and Eric D. Green* *Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD; NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD; àIntegrated Biosciences Program, George Washington University; and §Department of Biological Sciences, George Washington University

The ongoing generation of prodigious amounts of genomic sequence data from myriad vertebrates is providing unparalleled opportunities for establishing definitive phylogenetic relationships among species. The size and complexities of such comparative sequence data sets not only allow smaller and more difficult branches to be resolved but also present unique challenges, including large computational requirements and the negative consequences of systematic biases. To explore these issues and to clarify the phylogenetic relationships among mammals, we have analyzed a large data set of over 60 megabase pairs (Mb) of high-quality genomic sequence, which we generated from 41 mammals and 3 other vertebrates. All sequences are orthologous to a 1.9-Mb region of the human genome that encompasses the cystic fibrosis transmembrane conductance regulator gene (CFTR). To understand the characteristics and challenges associated with phylogenetic analyses of such a large data set, we partitioned the sequence data in several ways and utilized maximum likelihood, maximum parsimony, and Neighbor-Joining algorithms, implemented in parallel on Linux clusters. These studies yielded well- supported phylogenetic trees, largely confirming other recent molecular phylogenetic analyses. Our results provide support for rooting the placental mammal tree between Atlantogenata (Xenarthra and Afrotheria) and Boreoeutheria (Euarchontoglires and Laurasiatheria), illustrate the difficulty in resolving some branches even with large amounts of data (e.g., in the case of Laurasiatheria), and demonstrate the valuable role that very large comparative sequence data sets can play in refining our understanding of the evolutionary relationships of vertebrates.

Introduction acter state (e.g., the presence of the same insertion at a given position or a given rearrangement shared between 2 Advances in large-scale DNA sequencing are creating species) can be difficult because overlapping rearrange- new opportunities for molecular phylogeneticists to exam- ments, changing boundaries, and/or sequence divergence ine ever-larger amounts of genomic sequence from increas- can obscure the historical relationships (Murphy et al. ing numbers of taxa. These data have the potential to greatly 2004). Finally, although methods for modeling such rare enhance our ability to answer difficult phylogenetic ques- genomic characters have been developed (Waddell et al. tions; however, the size and inherent imperfections of such data sets present some unique challenges for accurate 2001; Chaisson et al. 2006), biases leading to the potential tree inference. To begin with, the large numbers of charac- for homoplastic evolution are not well understood (Boissinot ters that serve as input demand a robust computational in- et al. 2004; Chen et al. 2005). For example, it is probable that frastructure. Further, the fast-evolving nature of most indel events are less likely to occur independently in multiple eukaryotic genomes has yielded large amounts of nonpro- lineages than single nucleotide changes; however, the extent tein-coding sequences that are not conserved across species, ofbiasesinindellocation appears tovary among lineages and making it difficult to generate complete and accurate multi- types of indel events. Thus, although rare genomic changes sequence alignments (Margulies et al. 2006). can be used as informative phylogenetic characters, there are These and other challenges in dealing with nucleotide still reasons why sequence-based characters are helpful as in- sequence–based characters have prompted some to make dependent sources of phylogenetic information. phylogenetic inferences based on genomic characters that Meanwhile, traditional phylogenetic analyses based change less frequently than individual nucleotides, such on nucleotide mutations present a different set of chal- as inversions, transposon insertions, and coding inser- lenges. As the costs of procuring and operating large clus- tions/deletions (indels) (Shimamura et al. 1997; Murphy ters of commodity computers have decreased, it has become et al. 2004, 2007; Bashir et al. 2005; Chaisson et al. increasingly practical to harness significant amounts of pro- 2006; Kriegs et al. 2006). However, these genomic charac- cessing power to analyze very large sequence-based data ters present their own challenges. First, they are less com- sets. This provides the ability to exploit single nucleotide mon, so few may be found to help differentiate short mutations more extensively, yielding more robust phyloge- branches (Nishihara et al. 2005). This leads to particular netic inferences. Additionally, there is extensive theory and difficulties when assessing support using traditional meth- experience relevant to both modeling the evolution of these ods (e.g., bootstrapping). Second, assigning the actual char- characters and using the algorithms to infer phylogenetic trees. However, care must be taken to rule out sources of Key words: Placentalia, Eutheria, Mammalia, mammalian phylog- systematic (or nonstochastic) error, such as long-branch at- eny, phylogenomics, Atlantogenata, molecular systematics. traction, alignment guide trees, and base-composition E-mail: [email protected]. biases that can hinder the use of such data sets (Kluge Mol. Biol. Evol. 25(9):1795–1808. 2008 doi:10.1093/molbev/msn104 and Wolf 1993; Hillis et al. 2003; Philippe et al. 2005; Advance Access publication May 2, 2008 Rokas and Carroll 2005).

Published by Oxford University Press 2008. This is an Open Access article distributed under the terms of the Creative Commons192 Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/ uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 1796 Prasad et al.

Indeed, there remain a number of uncertainties about evolution that partitioned individual genes suggested an the mammalian phylogeny that are beginning to be clarified Afrotherian root. Another unresolved issue in mammalian using both substitution-based and rare genomic character– phylogenetics relates to the relationships among orders based methods. For example, although recent molecular within Laurasiatheria. Though the monophyly of Laurasia- studies have broken the placental (eutherian) mammals into theria is fairly well established, the relative arrangements 4 groups or superorders—Afrotheria (Stanhope et al. 1998), within this taxon have been difficult to establish (aside Euarchontoglires (also called Supraprimates by Waddell from placing Eulipotyphla at the base of Laurasiatheria) et al. 2001), Laurasiatheria (Waddell, Okada, and (Murphy, Eizirik, O’Brien, et al. 2001; Waddell et al. Hasegawa 1999), and Xenarthra (Cope 1889)—the relative 2001; Arnason and Janke 2002; Chaisson et al. 2006; arrangement of these groups is uncertain (Waddell, Okada, Nishihara et al. 2006). and Hasegawa 1999; Madsen et al. 2001; Murphy, Eizirik, To investigate some of the above issues, we sought to Johnson, et al. 2001; Waddell et al. 2001; Scally et al. 2002; confirm and refine the mammalian phylogeny by examining Kriegs et al. 2006; Nishihara et al. 2006). Studies investi- the performance of nucleotide-based phylogenetic analyses gating the relationships among these 4 mammalian super- using very large genomic sequence data sets. Specifically, orders agree that Euarchontoglires and Laurasiatheria we mapped, sequenced, assembled, and analyzed a large should be grouped together to form Boreoeutheria genomic region (;1.9 Mb in humans) in 41 mammals, (Murphy, Eizirik, O’Brien, et al. 2001; Springer and 1 bird, and 2 fishes. Here, we report the experience and de Jong 2001; Waddell et al. 2001; Kriegs et al. 2006; results of performing phylogenetic analyses of this large Nishihara et al. 2006), although the exact root among (.69 Mb) comparative sequence data set. the 3 remaining groups (Boreoeutheria, Afrotheria, and Xenarthra) remains unsettled. Most molecular analyses Methods have weakly supported a basal Afrotherian root (Murphy, Eizirik, O’Brien, et al. 2001; Waddell et al. 2001; Scally et al. The comparative sequence data set analyzed here 2002; Amrine-Madsen et al. 2003; Waddell and Shelley (available at http://www.nisc.nih.gov/data) is an expanded 2003; Kriegs et al. 2006; Nikolaev et al. 2007; Nishihara version of that reported by Thomas et al. (2003), with all et al. 2007), whereas a few have weakly supported the asso- sequences orthologous to a 1.9-Mb region of human chro- ciation of Afrotheria and Xenarthra to form Atlantogenata, mosome 7 (build hg18, chr7:115,597,757–117,475,182) thereby placing the root between Atlantogenata and that includes 10 known genes (e.g., CFTR, ST7, and Boreoeutheria (Waddell, Cao, et al. 1999; Madsen et al. CAV1). All species’ sequences were generated by first iso- 2001; Delsuc et al. 2002; Lin et al. 2002; Waddell and lating bacterial artificial chromosome (BAC) clones from Shelley 2003; Hallstrom et al. 2007; Murphy et al. 2007; the orthologous genomic region using overgo-based hy- Waters et al. 2007). On the other hand, the traditional place- bridization methods (Thomas et al. 2002) and then gener- ment of Xenarthra at the root of the placental tree, with Bor- ating high-quality sequence of each selected BAC eoeutheria and Afrotheria together forming Epitheria, has (Blakesley et al. 2004). For each species, sets of overlap- been supported by others (Shoshani and McKenna 1998). ping BAC sequences were compiled into a single ordered Several recent studies have further investigated these and oriented sequence. The assembled BAC sequences are issues, leading to different conclusions. Kriegs et al. (2006) provided as supplementary data online (available at http:// identified 2 shared transposon insertions in Afrotheria and www.nisc.nih.gov/data). All the analyzed sequences were Boreoeutheria that could not be found in Xenarthra or in an generated in this fashion as part of the NIH Intramural Se- opossum outgroup, although they did not consider these re- quencing Center Comparative Sequencing Program sults statistically significant. Nikolaev et al. (2007) used (Thomas et al. 2003), except for the human sequence (gen- comparative sequence data generated for 1% of the human erated by the International Human Genome Sequencing genome (as part of the ENCODE project) to examine the Consortium [2001]) and the Fugu sequence (generated root of Placentalia; they reported significant support for by the Fugu Genome Consortium [Aparicio et al. a root between Afrotheria and Exafroplacentalia (Boreoeu- 2002]). Table 1 provides a list of the species whose se- theria þ Xenarthra), though they found it necessary to per- quences were analyzed. form separate analyses on conserved noncoding sequences The assembled sequences were aligned using the and amino-acid sequences to exclude both other possible threaded blockset aligner (TBA), a local alignment program roots. Murphy et al. (2007) searched for informative coding designed to generate multisequence alignments of large indels within whole-genome sequence data, finding 4 data sets (Blanchette et al. 2004). The final alignment size examples supporting Atlantogenata as the root and none of all alignable sequence in the data set was 44 taxa by supporting the 2 alternative roots; they also identified 2 ret- 6,270,442 characters. The initial alignment guide tree roelement insertions with well-conserved flanking se- was based on the results of Murphy, Eizirik, Johnson, quence that also support Atlantogenata as the root. In et al. (2001); this was then modified to test alternate hypoth- addition, Waters et al. (2007) analyzed a phylogeny of eses and to verify that the results were not dependent on the L1 sequences and found further support for Atlantogenata alignment guide tree (see Results and Discussion). The as the root. Hallstrom et al. (2007) and Wildman et al. alignment was divided into partitions (i.e., corresponding (2007) used coding sequence extracted from whole-genome subportions of the genomic region, as described below) us- shotgun sequencing data to find support for an Atlantoge- ing custom perl scripts (available on request). natan root; however, Nishihara et al. (2007) used a similar Coding sequences were identified based on data from data set and found that the use of more complex models of the Consensus CDS Project (http://www.ncbi.nlm.nih.gov/ 193 Phylogeny of Mammals 1797

Table 1 Multispecies Comparative Sequence Data Set Clade Scientiﬁc Name Common Name Total Sequencea Codingb Conserved Noncodingc Homo sapiens Human 1,877,426 20,647 102,884 Pan troglodytes Chimpanzee 1,573,483 17,962 86,513 Gorilla gorilla gorilla Lowland gorilla 1,761,981 20,489 93,962 Pongo pygmaeus abelii Sumatran orangutan 1,478,010 18,344 80,548 Hylobates gabriellae Red-cheeked gibbon 2,154,624 20,122 97,708 Colobus guereza Black and white colobus 2,023,939 20,575 99,065 Cercopithecus aethiops vervet Vervet monkey 1,555,031 18,638 87,051 Macaca mulatta Rhesus macaque 1,678,549 20,569 92,538 Catarrhini Papio cynocephalus anubis Olive baboon 1,680,295 20,575 89,897 Callithrix jacchus White-tufted-ear marmoset 1,869,361 19,783 88,306 Callicebus moloch Dusky titi 1,810,674 18,263 84,974 Aotus nancymai Owl monkey 2,059,585 20,581 96,483 Platyrrhini Saimiri boliviensis boliviensis Bolivian squirrel monkey 1,695,311 16,692 67,548 Otolemur garnettii Small-eared galago 1,732,353 20,373 86,512 Lemur catta Ring-tailed lemur 1,399,362 20,545 84,060 Strepsirrhini Microcebus murinus Gray mouse lemur 1,541,029 19,103 86,239 Rattus norvegicus Norway rat 1,883,088 20,344 78,983 Mus musculus Mouse 1,486,509 19,079 73,094 Cavia porcellus Guinea pig 1,815,594 20,504 85,548 Rodentia Spermophilus tridecemlineatus 13-lined ground squirrel 1,757,846 20,505 89,020 Lagomorpha Oryctolagus cuniculus New Zealand white rabbit 1,889,755 20,453 81,226 Bos taurus Cow 2,022,671 20,357 85,135 Ovis aries Sheep 1,816,302 20,149 76,197 Muntiacus muntjak vaginalis Indian muntjac 1,450,172 15,340 67,216 Cetartiodactyla Sus scrofa domestica Domestic pig 1,198,526 17,006 60,133 Perissodactyla Equus caballus Horse 1,423,288 17,580 75,633 Felis catus Cat 1,737,938 20,374 81,560 Neofelis nebulosa Clouded leopard 1,691,656 16,001 74,568 Canis familiaris Dog 1,317,853 16,374 69,142 Carnivora Mustela putorius furo Domestic ferret 1,494,791 20,456 75,743 Carollia perspicillata Seba’s short-tailed bat 1,069,438 14,424 38,369 Chiroptera Rhinolophus ferrumequinum Greater horseshoe bat 1,684,815 20,495 85,118 Atelerix albiventris Middle-African hedgehog 1,985,767 20,081 72,111 Eulipotyphla Sorex araneus European common shrew 1,734,562 18,845 63,737 Xenarthra Dasypus novemcinctus Armadillo 1,454,970 16,850 59,554 Loxodonta africana African elephant 2,040,789 20593 87,812 Afrotheria Echinops telfairi Lesser hedgehog tenrec 1,765,269 18,087 74,734 Didelphis virginiana North American opossum 1,627,985 15,114 45,484 Monodelphis domestica Gray short-tailed opossum 1,174,555 12,480 33,565 Marsupialia Macropus eugenii Tammar wallaby 1,846,640 18,545 61,489 Monotremata Ornithorhynchus anatinus Duck-billed platypus 1,268,713 18,543 49,457 Aves Gallus gallus Chicken 744,025 19,934 32,648 Tetraodon nigroviridis Tetraodon 257,833 16,938 7,760 Actinopterygii Fugu rubripes Fugu 273,621 17,033 7,779 Total 69,805,984 825,745 3,217,103

a The total amount of assembled sequence (in bases) following removal of low-quality sequence and overlaps between BAC sequences (Thomas et al. 2003). b The number of bases in the coding partition (see text for details). c The number of bases in the conserved noncoding partition (see text for details).

CCDS), as provided on the UCSC Genome Browser (hg17 These annotations reflect conserved elements that were build; http://www.genome.ucsc.edu). This approach was used detected using phastCons (Siepel et al. 2005), which applies for all genes except MET, which was derived from the longest a 2-state conserved versus nonconserved phylogenetic hid- GENCODE annotation (http://genome.imim.es/gencode/) den Markov model to a 17-species multisequence align- because it was not present in the Consensus CDS annota- ment. PhastCons also uses the 5-parameter general time tion. Coding regions within the multisequence alignment reversible (GTR) model of sequence evolution with a scal- were manually edited, and areas of uncertain alignment ing parameter for conserved sequence. With the parameters were removed, with gap columns added where necessary used for generating the 17-way Most Conserved track, 90% to maintain phase using jalView 2.2.1 (Clamp et al. of the human coding bases in our analyzed genomic region 2004). Codon position partitions were generated using reside within conserved regions. For the studies described every third base of the alignment. here, we extracted all coding bases from the annotated con- Evolutionarily conserved sequences were identified served regions, leaving a conserved noncoding sequence using annotations represented on the ‘‘17-way Most Con- partition of 104,918 human bases and a total alignment (in- served’’ track of the UCSC Genome Browser (hg18 build) cluding gaps) of 132,422 bases. A character state matrix (Kent et al. 2002; Karolchik et al. 2004; Siepel et al. 2005). (coding plus conserved noncoding) was created by adding 194 1798 Prasad et al. all manually edited protein-coding sequence alignments to (Stamatakis 2006). The best ML trees were obtained by per- the above conserved noncoding sequence alignment. forming greater than 20 independent tree searches from We also generated another conserved sequence matrix both completely random and random addition parsi- for comparison purposes using Gblocks 0.91b, which uses mony-based starting trees (default for RAxML) using the a phylogenetically naive approach to identify sequence con- ‘‘f d’’ high-performance hill-climbing algorithm. Highest servation (parameters: minimum 23 sequences for a con- likelihood trees from multiple runs of RAxML were the served position, minimum 37 sequences for a flanking same as the trees obtained from multiple runs of PHYML position, maximum 8 contiguous nonconserved positions, using GTR þ C þ I models for several data sets tested minimum initial block length of 10, minimum block length (Guindon and Gascuel 2003). Even though its hill-climbing of 10, and half allowed gap positions) (Castresana 2000). algorithm was slightly more likely to end at local maxima, The resulting matrix of conserved sequences contained we used RAxML because a single run with the conserved 77,961 bases. sequence partition took roughly one-third the time (;1.5 h) The extraction of specific bases and conversion of data and one-tenth the RAM (;400 Mb) compared with files to FASTA, NEXUS, or PHYLIP formats for subse- PHYML; this allowed for more efficient use of available quent analyses were performed using custom perl scripts cluster resources. Bootstrap replicates were performed (available on request). Maximum parsimony tree searching using the parallelized MPI-enabled version of RAxML- was performed using PAUP* 4.0b10 (Swofford 2003). VI-HPC v2.1.3 with default settings. SH tests were per- All trees were rooted using the fishes (Fugu and Tetraodon) formed with the best ML trees found using 15 random as outgroup taxa, and a constraint for the monophyly of addition runs of RAxML using constraints for taxa in ques- mammals was used. Maximum parsimony trees were gen- tion. The best trees were used for SH tests with PAUP* and erated using random addition replicates as well as boot- 10,000 RELL replicates with the GTR þ C þ I model strapped with 1,000 replicates of 10 random addition (Shimodaira and Hasegawa 1999). All tree searching subtree pruning regrafting runs. Neighbor-Joining trees was performed using Linux clusters. Analyses with PAUP were generated with 1,000 bootstrap replicates using max- and RAxML were scripted and split using custom perl and imum likelihood (ML) HKY85 distances. Incongruence shell scripts. Trees were visualized with the assistance of length difference (ILD) or partition homogeneity tests were TreeGraph (Muller and Muller 2004). performed with 1,000 replicates of 10 Tree Bisection- Reconnection random addition tree searches with PAUP* Results and Discussion (Bull et al. 1993; Cunningham 1997). Overview The monophyly of mammals was constrained for all tree searching, except when performing ILD tests, We generated and compiled a high-quality compara- Shimodaira–Hasegawa tests (SH tests), or otherwise noted tive sequence data set consisting of sequences from 44 ver- (Shimodaira and Hasegawa 1999). Bayesian phylogenetic tebrate species (table 1), all of which are orthologous to a analysis was performed with both partitioned likelihood 1.9-Mb region on human chromosome 7 (Thomas et al. models and single partition models using the MPI version 2003). This entire genomic region is syntenic in all mam- of MrBayes v3.1.2 and GTR þ I þ C models, as suggested mals, reptiles, and fishes that we have examined to date by MrModeltest or Modeltest (Posada and Crandall 1998; (including some whose sequence was not analyzed in this Nylander 2004). For the MrBayes analysis, model param- study). Together, the consistent long-range synteny, Blast- eters were estimated from the data, and default priors were based sequence comparisons, and nature of the cross-species used. Metropolis-coupled Markov chain Monte Carlo BAC isolation and mapping process (see Thomas et al. chains were run for 500,000 or 1,000,000 generations, sam- [2002]) confirm the orthologous relationship of the sequen- pling every 100 generations and using 6 heated chains per ces within the analyzed data set. The amount of assembled, each of 2 independent runs. Stationarity was confirmed by annotated, and quality-trimmed sequence in the data set manual inspection for convergence of independent runs varies from 257 kb from Fugu to 2 Mb from elephant, with as well as topological and likelihood value stability. Majority this variance reflecting both intrinsic differences in the size rules consensus trees were generated from the final of the genomic region among species as well as incomplete two-thirds of sampled trees using PAUP. Bootstrapped sequence coverage for some species (see table 1). Bayesian runs were performed using seqboot from PHYLIP Sequences were aligned using TBA (Blanchette et al. to create 100 bootstrap data sets, which were then indepen- 2004). Because TBA produces local multisequence align- dently analyzed with MrBayes using the settings described ments, it handles small inversions or other local rearrange- above (500,000 generations, sampling every 100 generations, ments well and avoids incorporating regions where the with the final 6,668 sampled trees used for the consensus) alignment uncertainty is high (Blanchette et al. 2004; (Felsenstein 2007). The RY-coded coding plus conserved Pollard et al. 2006). Protein-coding sequences were excised non-coding sequence matrix was run to 5,000,000 genera- and manually edited to constrain coding indels to multiples tions, the average likelihood values increased until around of 3 bases, unless there was significant evidence for other 500,000 generations, but a single tree became completely re- indel sizes. We also removed any portions of the alignment solved before 200,000 generations (data not shown). where gap positions could not be easily determined. From ML tree searches were performed with the GTR þ C this alignment, we made a coding sequence matrix, which model using RAxML-VI-HPC v2.1.3 (Stamatakis 2006). A contained 20,647 human-coding bases and 21,129 total proportion of invariable sites was not used because that pa- characters; this matrix was then analyzed with ML, Bayes- rameter (I) is not implemented in RAxML-VI-HPC v2.1.3 ian, maximum parsimony, and Neighbor-Joining 195 Phylogeny of Mammals 1799

FIG. 1.—ML tree derived from the analysis of the coding sequence partition using RY-coded bases and a codon position partitioned CF þ C model. Branch lengths indicate likelihood-inferred substitutions per site with a GTR þ C model. ML bootstrap proportions are listed above Bayesian posterior probabilities for all branches at less than 100% bootstrap proportion and 1.0 Bayesian posterior probability support. Platypus was constrained to the mammals (its branch is marked with an asterisk to reflect this). The fishes (Tetraodon and Fugu) were used to root, but their branches are not shown. Branch lengths were optimized using ML from nucleotide-coded data with a GTR þ C model. approaches. Bayesian analysis yielded a highly resolved ian posterior probabilities, we developed a ‘‘Bayesian tree with posterior probabilities of 1.0 for all nodes (supple- bootstrap score’’ based on the majority rules consensus tree mentary fig. 1, Supplementary Material online). This tree is at stationarity for each bootstrapped data set. The resulting fully congruent with the ML tree, and both trees are largely score was only slightly higher than the ML bootstrap pro- in agreement with other recent phylogenetic studies (Wad- portions (see supplementary fig. 2, Supplementary Material dell, Cao, et al. 1999; Murphy, Eizirik, Johnson, et al. 2001; online); these results support the notion that Bayesian pos- Murphy, Eizirik, O’Brien, et al. 2001; Waddell et al. 2001; terior probabilities may be misleadingly high in some cases Delsuc et al. 2002; Lin et al. 2002; Amrine-Madsen et al. and are not necessarily comparable with traditional boot- 2003; Phillips and Penny 2003; Springer et al. 2003, 2004; strap measures of support (Waddell et al. 2001, 2002; Waddell and Shelley 2003; Kriegs et al. 2006; Nishihara et Douady et al. 2003; Yang 2007). It is possible that more al. 2006; Hallstrom et al. 2007; Murphy et al. 2007), though complex models and/or Bayesian techniques for combining a few nodes are different than those reported in some of data could lead to posterior probabilities that better reflect these studies (see below and fig. 1). the uncertainties in the tree (Edwards et al. 2007; Liu and The ML bootstrap proportions are significantly lower Pearl 2007). Some bootstrap replicates resulted in unlikely than Bayesian posterior probabilities. To investigate the re- rearrangements of chicken and platypus (e.g., platypus out- lationship between the bootstrap proportions and the Bayes- side the chicken and other mammals). We also saw this 196 1800 Prasad et al. when the coding plus conserved noncoding sequence ma- separated the fish as our outgroup (5 indels supporting). trix was nucleotide coded and analyzed with maximum par- Notably, we found 3 indels that are homoplastic on any simony. We believe that these findings are a consequence of of our trees, 2 of which (labeled ‘‘l’’ and ‘‘p’’ in supplemen- the long branches leading to the monotreme and reptile spe- tary fig. 6, Supplementary Material online) likely reflect cies represented in this study, as well as the low taxon sam- multiple independent deletion events as they were detected pling of those groups and the distant fish outgroup. Because in marsupials and only 1–3 species in Euarchontoglires. these unlikely arrangements affected bootstrap support val- The third indel that appears homoplastic on our trees joins ues, we constrained the monophyly of mammals unless oth- dusky titi to the apes (human, chimpanzee, gorilla, orang- erwise noted. utan, and gibbon); this may be the result of lineage sorting Although Bayesian posterior probabilities are high for or independent events. Although multiple deletion events all branches except the Homo–Pan–Gorilla group (human, may be relatively rare, the observed homoplasy suggests chimpanzee, and gorilla), ML support is not sufficient to that caution should be used in interpreting support for taxa resolve some branches. In an attempt to rectify this and based on small numbers of such events. to take advantage of our notably large sequence data set, we investigated using conserved noncoding sequence for tree construction. Conserved bases were identified using Nonphylogenetic Signals phastCons (Siepel et al. 2005), which utilizes a phylogenetic Large sequence data sets, such as the one analyzed hidden Markov model to distinguish conserved versus non- here, offer the potential to resolve weakly supported conserved sequence and a GTR model of substitution rates branches; however, they can also be prone to detecting non- to identify conserved segments (other details are provided phylogenetic signals that confound the results (Philippe in Methods). With the parameters used, 90% of the known et al. 2005). We examined several potential sources of ‘‘sys- coding sequence and 5.5% of the presumed noncoding se- tematic error’’ or ‘‘nonphylogenetic signals,’’ attempting to quence within the region were identified as conserved. We exclude them or to control for their influence (Philippe et al. generated a matrix containing 153,552 bases by combining 2005). Specifically, we considered base-composition bias, the coding sequence alignment and the conserved noncod- incongruence across the genomic region, missing data, in- ing sequence alignment to form a coding plus conserved fluence of the alignment guide tree, and long-branch attrac- noncoding sequence matrix; this matrix yielded highly tion as possible sources of nonphylogenetic signal. We resolved trees with strong branch support (fig. 2). further examined long-branch attraction during the analysis Although we found significant differences between the of various individual taxa. trees generated with the coding versus conserved noncoding sequence partitions based on ILD tests (P 5 0.001), Base Composition this was entirely due to the less-conserved third codon positions (Bull et al. 1993). When third codon positions were There are significant differences in base composition excluded (i.e., using partitions consisting of codon posi- among species, ranging from 45% to 58% G þ C in the tions 1 and 2 vs. conserved noncoding sequences), there coding sequence partition and 32% to 45% G þ C in the was no significant difference between the results obtained conserved noncoding sequence partition. The chi-square with each partition (ILD test P 5 0.432; see table 2). RY- test for homogeneity of base frequencies across taxa was based coding of the data (discussed below) eliminated the highlysignificant (P, 0.000001 for all partitionsexamined, significant differences between partitions, even when the except with codon second positions P 5 0.00926835), third codon positions were included (ILD test P 5 0.328; though the validity of this test is questionable because it see table 2). Indeed, we only encountered differences be- does not take phylogenetic structure into account. To re- tween the trees generated with the 2 partitions on branches duce the effects of nonphylogenetic signals due to base- that were weakly supported by the coding sequence data composition differences among species, we coded the set (figs. 1 and 2). Finally, we performed likelihood and nucleotides as purines or pyrimidines (RY coding) (Phillips Bayesian analysis with models partitioned by codon posi- et al. 2001, 2004; Philippe et al. 2005). This approach also tion (i.e., with the same model of evolution but allowing has the benefit of removing signals deriving from the more parameters of those models to vary between partitions); common transitions that may be associated with higher no significant differences in tree topology were seen, al- rates of saturation due to reversals. Indeed, we found that though minor differences in branch support scores were RY-coding eliminated significant differences in trees sup- noted. Because phastCons bases its identification of con- ported by the less-conserved codon third positions (table 2). served sequences on a phylogenetic tree, we also used Because of the large data set size, we maintained sufficient Gblocks, a phylogenetically naive program for finding con- signal with RY-coded data to make robust phylogenetic in- served regions. We found no differences in the ML tree gen- ferences (figs. 1 and 2). We further found that RY-coding erated from conserved regions identified by phastCons and eliminates almost all base-composition differences among Gblocks, with the associated bootstrap proportions similar species. The coefficient of variation between purines and in both cases (data not shown). pyrimidines was 84% lower than the coefficient of varia- We also examined coding sequence for high-confidence tion between G þ CandAþ T for the coding sequence indels, finding 24 phylogenetically informative indels (sup- partition and 87% lower for the conserved noncoding se- plementary fig. 6, Supplementary Material online). A large quence partition. RY-coding also eliminated significant dif- number of the coding indels were shared by closely related ferences in trees supported by codon third positions (ILD species, such as rat and mouse (7 indels supporting), and P 5 0.948). 197 Phylogeny of Mammals 1801

human chimp gorilla 0.1 substitutions per site orangutan gibbon

colobus monkey Catarrhini vervet baboon macaque Primata dusky titi Cercopithecidae 96% 1.0 owl monkey squirrel monkey

marmoset Platyrrhini Euarchontoglires galago mouse lemur ring-tailed lemur rabbit Strepsirrhini ground squirrel guinea pig rat Glires Rodentia

mouse Boreoeutheria cow sheep 44% muntjak 1.0 pig Cetartiodactyla 64% dog 1.0 ferret 98% clouded leopard 1.0 cat Carnivora horse horseshoe bat

short-tailed bat Laurasiatheria

hedgehog Chiroptera 96% shrew 1.0 elephant tenrec * armadillo Afrotheria

wallaby Atlantogenata Xenarthra North American opossum short-tailed opossum Metatheria platypus chicken Prototheria

FIG. 2.—ML tree derived from the analysis of coding plus conserved noncoding sequence matrix using RY-coded bases. A CF þ C model was used, with 4 partitions: 3 for codon positions and 1 for conserved noncoding sequence. Long branches leading to platypus and chicken were abbreviated for clarity. Other features are the same as indicated in ﬁgure 1.

Incongruence Across the Genomic Region lems with combining large numbers of protein-coding sequence alignments (Comas et al. 2007; Edwards et al. Combining phylogenetic data is thought to potentially 2007). To check for heterogeneity of support across the be problematic (Bull et al. 1993). For example, recent ge- alignment, we split the coding plus conserved noncoding nome-wide studies in yeast and bacteria encountered prob- sequence matrix into 10 equal-sized segments and analyzed

Table 2 Pairwise ILD P Valuesa Coding Coding Coding Coding Conserved Partition Codingb Position 1c Position 2c Position 3c Positions 1 and 2c Noncoded Coding ,0.001 Coding pos1c 0.323 ,0.001 0.460 Coding pos2c 0.648 0.983 0.324 Coding pos3c 0.569 0.960 0.002 ,0.001 Coding pos1 and pos2 0.599 0.432 Conserved noncodingd 0.328 0.361 0.696 0.951 0.163

a Values above diagonal are for NT-coded data, below are for RY-coded data. b All protein-coding sequences. c Codon position 1, 2, or 3 (as indicated) within coding sequence. d Conserved noncoding sequence. 198 1802 Prasad et al.

FIG. 3.—ML trees for each of 10 sequential, equal-sized partitions from the coding plus conserved noncoding sequence matrix. Numbers (1–10) reflect the specific partition used. The arrangement of taxa and branches indicated in colors other than black vary among partitions. Nodes annotated with hollow circles have less than 50% bootstrap proportions, those with shaded circles have 50% to 75% bootstrap proportions, and those with solid circles have 75% bootstrap proportions or greater. Branches that are the same in all trees are indicated in black, with some collapsed to higher level taxa for simplicity. each with ML (fig. 3 and table 3). Although we found sup- well, perhaps because of the long branches and poor taxon port for the Atlantogenata hypothesis with all 10 segments, sampling of only one monotreme. Additionally, the rela- there was considerable variation in the strength of that sup- tionships among owl monkey, marmoset, and squirrel mon- port, with some segments providing much larger shares of key differed among partitions, though with weak support the overall support. One segment (6; see table 3 and fig. 3) (discussed further below). A majority of segments sup- did not contain any armadillo sequence (due to a gap in the ported Marsupionta, though overall support was strongly BAC map) and thus could not provide support for the pla- in favor of Theria (figs. 2 and 3). cental root. There was considerable heterogeneity of support among the segments for some other clades as well. Missing Data Laurasiatheria was generally not well resolved in any of our analyses, and orders within Laurasiatheria did not have Missing data have been considered a source of system- a consistent relationship among the different segments. The atic error in phylogenetic analyses (Huelsenbeck 1991; relationship between the marsupials and platypus varied as Kearney 2002), although some simulation studies suggest

Table 3 Relative Likelihood Support for Placental Root across Coding Plus Conserved Noncoding Sequence Matrix Partition 123456a 78910Total Combinedb SH Test P Atlantogenata 0 0 0 0 0 n/a 0 0 0 0 0 0 Best Epitheria 4.5 1.7 2.9 9.1 3.9 n/a 1.6 16.7 7.5 8.0 56.0 69.4 ,0.0001 Exafroplacentaila 4.5 1.8 0.5 7.9 3.8 n/a 1.6 16.9 6.6 6.7 50.3 62.9 ,0.0001

NOTE.—n/a, not applicable. a Partition 6 contains a region where there is incomplete armadillo sequence, so no placental root can be inferred. b Likelihood score for entire coding plus conserved noncoding sequence matrix with 4 partitions for model parameters (3 for codon position and 1 for conserved noncoding). 199 Phylogeny of Mammals 1803 that missing data should not be a problem for data sets with ML, Bayesian, Neighbor-Joining, and maximum parsi- sufficient characters to provide robust signal (Rosenberg mony approaches. These results largely agree with other and Kumar 2003; Philippe et al. 2004, 2005; Wiens recently reported molecular phylogenies for the primates 2005). The coding sequence alignment had 11% missing (Barroso et al. 1997; Goodman et al. 1998; Poux and data (including indels and sequencing gaps). To test if Douzery 2004; Ray et al. 2005; Opazo et al. 2006), with the missing data were biasing our analyses, we performed the exception of the relationships within Cebidae (marmo- ILD tests on the RY-coded coding sequence partition, com- set, squirrel monkey, and owl monkey) (Barroso et al. paring alignment columns with gaps in 3 or fewer species 1997). Molecular systematic studies have disagreed on and those with gaps in more than 3 species; no significant the arrangement of these 3 taxa. We find support for the differences were seen (P 5 0.591, supplementary table 1, association of squirrel monkey with marmoset. Although Supplementary Material online). Similar results were ob- we see consistent support for this association regardless tained when the same analysis was performed with the of alignment guide tree, RY- or nucleotide-coding, or tree RY-coded conserved noncoding sequence partition (P 5 inference algorithm, bootstrap proportions are relatively 0.627). Finally, to see if additional missing data would weak, and the support varies across the genomic region un- strongly affect the analysis, we randomly deleted 25% of der study (fig. 3). The monophyly of both Glires and Eu- nongap bases from the coding plus conserved noncoding archontoglires was strongly supported by Bayesian, ML, sequence matrix and observed no effect on the resulting and maximum parsimony approaches, as in other recent ML tree, other than slightly changing the bootstrap support studies (Thomas et al. 2003; Douzery and Huchon for a few branches (data not shown). 2004); note that this is in sharp contrast to the results of Misawa and Janke (2003). Notably, Neighbor-Joining support for Glires and Euarchontoglires was also strong (boot- Alignment Guide Tree strap proportion 91%, supplementary fig. 5, Supplementary We also analyzed a matrix consisting of all TBA- Material online) in contrast to the Neighbor-Joining results aligned sequence (containing 1,798,347 human bases). Be- of Wildman et al. (2007), who used whole-genome shotgun cause of computational constraints, we analyzed this data sequence data; this is potentially explained by the much set only by maximum parsimony and ML methods. The greater taxon sampling afforded by our data set. The place- trees derived from these analyses were almost completely ment of guinea pig securely within Rodentia is also strongly resolved; however, by permuting the alignment guide tree, supported by our data (SH test P , 0.0001 excluding we were able to change the relative arrangement of those guinea pig as an outgroup to the other rodents), in agree- branches showing 100% bootstrap support in this analysis. ment with others (Sullivan and Swofford 2004); this result Using only the conserved and protein-coding portions of holds when the alignment guide tree is permuted to place the alignment yielded a tree with fewer well-resolved guinea pig outside the rodents. branches; however, branches with .70% bootstrap support were resistant to permutations of the alignment guide tree. Laurasiatheria Notably, the only branches with bootstrap proportions ,70% were the interordinal branches within Laurasiathe- Within Laurasiatheria, there has been considerable ria. These results confirm that difficult-to-resolve branches disagreement about the arrangement and composition of are more susceptible to biases introduced by aligning the historical order Insectivora, although most recent mo- less-conserved sequences and that biases due to alignment lecular studies divide up this group among several orders, guide trees can be largely controlled by only considering with tenrec falling in Afrotheria (Stanhope et al. 1998; Lin conserved sequences and strongly supported branches. et al. 2002; Nishihara et al. 2006) and shrew and Erinaceous To control for any possible effect of the alignment guide hedgehogs falling in Eulipotyphla (within Laurasiatheria) tree on our phylogenetic analysis, we permuted the align- (Madsen et al. 2001; Murphy, Eizirik, Johnson, et al. ment guide tree and reanalyzed the data for all controversial 2001; Lin et al. 2002; Malia et al. 2002; Amrine-Madsen branches to confirm that the alignment guide tree was not et al. 2003). Many studies of mitochondrial DNA have biasing the results. placed the Eulipotyphlans at the root of Placentalia, probably because of the unusually high AT content of their mitochondrial genomes; however, recent studies with greater Individual Taxa taxon sampling and more complex models have also placed them in Laurasiatheria (Waddell, Cao, et al. 1999; Arnason Euarchontoglires et al. 2002; Lin et al. 2002; Gibson et al. 2005; Kjer and The primate portion of the tree was strongly supported Honeycutt 2007). Our Neighbor-Joining tree has Eulipoty- regardless of the partition or alignment guide tree used with phla at the root of Placentalia, but this may be a consequence the exception of the Homo–Pan–Gorilla group, which had of the long-branch lengths of the 2 Eulipotyphlans (hedge- insufficient informative changes in the RY-coded coding hog and shrew) represented in our data set (fig. 2 and sup- sequence partition to resolve (fig 1). However, when tran- plementary fig. 5, Supplementary Material online). Using sitions are included, there is sufficient signal and 100% sup- ML, Bayesian, and maximum parsimony approaches, our port for the sister relationship between chimpanzees and analyses consistently place Eulipotyphla at the root of Laur- humans (supplementary fig. 1, Supplementary Material on- asiatheria, as do most other recent studies (fig. 2 and sup- line). The major primate clades (including Catarrhini, Pla- plementary figs. 3 and 5; Supplementary Material online) tyrrhini, and Strepsirrhini) were all supported at 100% by (Murphy, Eizirik, O’Brien, et al. 2001; Waddell et al. 200 1804 Prasad et al.

2001; Arnason et al. 2002; Scally et al. 2002; Amrine- A Euarchontoglires Madsen et al. 2003; Waddell and Shelley 2003; Nishihara Laurasiatheria et al. 2006; Nikolaev et al. 2007). The placement of Perissodactyla, represented here by Afrotheria Epitheria the horse, has been another source of controversy. Usually, Xenarthra this group is placed either sister to Cetartiodactyla (Murphy, SH-test excludes Eizirik, Johnson, et al. 2001; Lin et al. 2002) or sister to P< 0.0001 Carnivora (Murphy, Eizirik, O’Brien, et al. 2001; Arnason B and Janke 2002; Amrine-Madsen et al. 2003), in most cases Euarchontoglires with weak support (Waddell, Cao, et al. 1999). Schwartz et al. (2003) found a single transposon insertion supporting Laurasiatheria the Perissodactyla–Carnivora association; meanwhile, Xenarthra

Nishihara et al. (2006) found a single transposon insertion Afrotheria Exafroplacentalia supporting a Perissodactyla–Carnivora association and 5 in- SH-test excludes sertions supporting a Perissodactyla–Chiroptera–Carnivora P < 0.0001 (Pegasoferae, Nishihara et al. 2006) association (i.e., ex- C Euarchontoglires cluding the traditional Perissodactyla–Cetartiodactyla asso- Laurasiatheria ciation of hoofed mammals that we see with this analyses).

Afrotheria enata

Of note, Nishihara et al. (2006) did also find one transposon g insertion that conflicted with the Pegasoferae hypothesis. Xenarthra Supported solution Even with the large number of characters analyzed here, Atlanto we only found weak bootstrap support for the placement Fig. 4.—Three possible roots for Placentalia. SH test results from the of Perissodactyla (fig. 2), although it tended to associate coding plus conserved noncoding sequence matrix for both nucleotide- closest with the Cetartiodactylans and secondarily with and RY-coded matrices. (A) Hypothesis rooting Placentalia between Xenarthra and Epitheria (Boreoeutheria þ Afrotheria). (B) Hypothesis the Carnivores. Across the region, support for the arrange- rooting Placentalia between Afrotheria and Exafroplacentalia (Boreoeu- ment of orders varied significantly, with only segments 4, 5, theria þ Afrotheria). (C) Hypothesis rooting Placentalia between and 10 agreeing and none of the segments agreeing with the Boreoeutheria and Atlantogenata (Afrotheria þ Xenarthra). ML tree for the entire matrix (fig. 3). Thus, the arrangement of orders within Laurasiatheria appears to be difficult to resolve even with large amounts of sequence data and reason- et al. 2006). We find strong bootstrap support (94%, see ably large numbers of species represented. We further figs. 1 and 2) for Theria by Neighbor-Joining, maximum found that the relative arrangement of Laurasiatherian or- parsimony, Bayesian, and ML approaches, and SH tests ders was highly sensitive to alignment guide tree artifacts, with nucleotide-coded data just reach significance (P 5 though not in a predictable way. Using the coding plus con- 0.0251) in excluding Marsupionta with the coding plus conserved noncoding sequence matrix, we performed SH tests served noncoding sequence matrix. However, there is sig- with the 5 most supported Laurasiatherian trees from the nificant heterogeneity of support for Theria across the literature; none could be excluded with high confidence, region, and a majority of the segments (6/10) support Mar- and this likely is due, in part, to the short branches separat- supionta (fig. 3). These findings are consistent with other ing Laurasiatherian orders. Perhaps with increased taxon recent molecular and morphological analyses that supported sampling, this problem will be more tractable. It may be the monophyly of Theria (Hu et al. 1997; Phillips and Penny that a strong nonphylogenetic signal or incomplete lineage 2003; van Rheede et al. 2006) but illustrate the difficulty of sorting is obscuring the interordinal relationships within determining the relationships between these clades. Laurasiatheria. Methods that treat gene trees and species trees simultaneously, such as that described by Liu and Placentalia Pearl (2007), might also be able to better resolve such regions. Although the nucleotide-coded protein-coding sequence partition failed to resolve the root of Placentalia (ML bootstrap ,60%, supplementary fig. 1, Supplemen- Theria tary Material online), the RY-coded coding sequence Although considered a mammal, the phylogenetic partition supports an Atlantogenatan root (fig. 1). Adding placement of monotremes has long been controversial. the conserved noncoding partition provides high statistical The hierarchical placement of monotremes as an outgroup support for Atlantogenata, both with nucleotide- and RY- of the other mammals has been challenged by molecular coded data (figs. 2 and 4). Bootstrap support of 100% is and morphological studies that placed Monotremata as a sis- seen with all model-based approaches used (including ter group to the marsupials in a clade called Marsupionta ML, Bayesian, Neighbor-Joining, and minimum evolution (Janke et al. 1996, 1997). Our results, however, agree with supplementary figs. 4 and 5). SH tests using the coding plus recent molecular studies that yielded significant evidence conserved noncoding sequence matrix exclude Epitheria (including coding indels) in support of the monophyly and Exafroplacentalia, with the results significant past of Theria (placental mammals and marsupials), with the the limits of PAUP and CONSEL (P , 0.0001) (fig. 4). monotremes as the first branch of the mammalian tree Because of the limited number of Afrotherian and Atlanto- (Killian et al. 2001; Phillips and Penny 2003; van Rheede genatan species in this study, some caution is warranted in 201 Phylogeny of Mammals 1805 interpreting these results. Maximum parsimony analysis of The data used in our analysis contain significantly more the nucleotide-coded coding sequence partition supports an taxa, both ingroup and outgroup, than other recent large- Epitherian root (e.g., fig. 4A), but when codon third position scale nucleotide-based analyses, and this may affect the re- sites are removed or the sequence is RY-coded, bootstrap sults significantly. support is reduced to ,50%. Maximum parsimony analysis of the coding plus conserved noncoding sequence partition also weakly supports Epitheria (supplementary fig. 4). Summary To exclude the influence of biases introduced by the alignment guide tree, we realigned the sequences using In summary, we used a comparative sequence data set a guide tree based on the highest likelihood tree constrained that contains a remarkably large number of conserved bases to each possible root, then repeated the likelihood analysis. to derive a phylogeny that provides additional evidence to In all cases, the ML tree derived from the coding plus con- resolve some of the controversial branches in the mamma- served noncoding sequence matrix was rooted between lian lineage. We find significant support for an Atlantoge- Atlantogenata and Boreoeutheria with 100% bootstrap sup- natan root of Placentalia, as well as additional evidence for the monophyly of Theria. Our studies highlight the port and highly significant SH test results (P , 0.001). Because tenrec has a significantly longer branch length with difficulties in resolving some very short mammalian these data than elephant or armadillo, we tried individually branches (e.g., interordinal relationships within Laurasia- removing tenrec and elephant. With either tenrec or ele- theria), even with large amounts of data. Our work further illustrates the value of large genomic sequence data sets phant missing, we still saw 99% ML bootstrap support for Atlantogenata. We also separated the coding plus con- for improving the resolution of phylogenetic trees, in this served noncoding sequence matrix into 10 equally sized case, to clarify some of the remaining ambiguities within partitions and analyzed each separately (fig. 3). Although the mammalian tree. Sequences from an increasing number the likelihood values for Atlantogenata varied significantly of mammalian taxa should help to resolve the remaining among the partitions, all partitions supported an Atlantoge- ambiguities associated with the short branches within natan root (table 3). and between the placental orders. These results agree with some other recent large-scale analyses on mostly independent data sets (e.g., Hallstrom et al. [2007]; Murphy et al. [2007]; and Wildman et al. Supplementary Material [2007]) but conflict with the findings of Nikolaev et al. Supplementary figures 1–6 and table 1 are available at (2007), Nishihara et al. (2007), and Kriegs et al. (2006). Molecular Biology and Evolution online (http://www. Kriegs et al. (2006) found 2 transposon insertions in Afro- mbe.oxfordjournals.org/). theria and Boreoeutheria that were not found in armadillo or sloth. These transposon sequences are quite old, and the flanking sequence is not well conserved. Thus, the transposon-associated sequences may have mutated out of recog- Acknowledgments nition in the 30 Myr from the placental root to the We thank Adam Siepel and Webb Miller for helpful divergence of armadillo and sloth; alternatively, multiple comments about this manuscript. We also thank members transposon insertions may have occurred in Afrotheria of the NISC Comparative Sequencing Program (particu- and Boreoeutheria (Springer et al. 2003; Kriegs et al. larly B. Blakesley, G. Bouffard, J. McDowell, B. Maskeri, 2006; Murphy et al. 2007). Homoplasy for transposon in- M. Park, N. Hanson, and P. Thomas) for providing leader- sertions due to targeted insertions or lineage sorting on ship in the generation of the comparative sequence data an- short branches, though presumably rare, has been reported alyzed here. We also thank Associate Editor Scott Edwards (Pecon-Slattery et al. 2004; van de Lagemaat et al. 2005; Yu and 2 anonymous reviewers for unusually thorough and and Zhang 2005; Nishihara et al. 2006) and could also helpful comments that ultimately resulted in a significantly explain these results. improved manuscript. This research was supported in part Nikolaev et al. (2007) analyzed amino acid and con- by the Intramural Research Program of the National Human served noncoding genomic sequences from 14 species to Genome Research Institute, National Institutes of Health. examine the root of Placentalia. Using ML analyses of conserved noncoding sequence from the ENCODE pilot project regions (http://www.genome.gov/10005107), they Literature Cited exclude the Epitheria hypothesis and, separately, use amino acid sequences derived from the same regions to exclude Amrine-Madsen H, Koepfli KP, Wayne RK, Springer MS. 2003. the Atlantogenata hypothesis. Notably, analyses using their A new phylogenetic marker, apolipoprotein B, provides largest data set (conserved noncoding sequence) failed to compelling evidence for eutherian relationships. Mol Phylo- differentiate between rooting Placentalia at Atlantogenata genet Evol. 28:225–240. or Exafroplacentalia. Additionally, their limited taxon Aparicio S, Chapman J, Stupka E, et al. (41 co-authors). 2002. Whole-genome shotgun assembly and analysis of the genome and outgroup sampling argues for caution in interpreting of Fugu rubripes. Science. 297:1301–1310. the final results (Delsuc et al. 2002); for example, when Arnason U, Adegoke JA, Bodin K, Born EW, Esa YB, we only analyzed data from the 14 species studied by Gullberg A, Nilsson M, Short RV, Xu X, Janke A. 2002. Nikolaev et al. (2007), we still found support for Atlanto- Mammalian mitogenomic relationships and the root of the genata as the root, although the bootstrap support was weak. eutherian tree. Proc Natl Acad Sci USA. 99:8151–8156. 202 1806 Prasad et al.

Arnason U, Janke A. 2002. Mitogenomic analyses of eutherian complemented by fossil evidence. Mol Phylogenet Evol. relationships. Cytogenet Genome Res. 96:20–32. 9:585–598. Barroso CML, Schneider H, Schneider MPC, Sampaio I, Guindon S, Gascuel O. 2003. A simple, fast, and accurate Harada ML, Czelusniak J, Goodman M. 1997. Update on algorithm to estimate large phylogenies by maximum likeli- the phylogenetic systematics of new world monkeys: further hood. Syst Biol. 52:696–704. DNA evidence for placing the pygmy marmoset (Cebuella) Hallstrom BM, Kullberg M, Nilsson MA, Janke A. 2007. within the genus Callithrix. Int J Primatol. 18:651–674. Phylogenomic data analyses provide evidence that Xenarthra Bashir A, Ye C, Price AL, Bafna V. 2005. Orthologous repeats and Afrotheria are sister groups. Mol Biol Evol. 24: and mammalian phylogenetic inference. Genome Res. 2059–2068. 15:998–1006. Hillis DM, Pollock DD, McGuire JA, Zwickl DJ. 2003. Is sparse Blakesley RW, Hansen NF, Mullikin JC, et al. (22 co-authors). taxon sampling a problem for phylogenetic inference? Syst 2004. An intermediate grade of finished genomic sequence Biol. 52:124–126. suitable for comparative analyses. Genome Res. Hu Y, Wang Y, Luo Z, Li C. 1997. A new symmetrodont 14:2235–2244. mammal from China and its implications for mammalian Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). 2004. evolution. Nature. 390:137–142. Aligning multiple genomic sequences with the threaded Huelsenbeck JP. 1991. When are fossils better than extant taxa in blockset aligner. Genome Res. 14:708–715. phylogenetic analysis? Syst Zool. 40:458–469. Boissinot S, Entezam A, Young L, Munson PJ, Furano AV. International Human Genome Sequencing Consortium. 2001. 2004. The insertional history of an active family of L1 Initial sequencing and analysis of the human genome. Nature. retrotransposons in humans. Genome Res. 14:1221–1231. 409:860–921. Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Janke A, Gemmell NJ, Feldmaier-Fuchs G, von Haeseler A, Waddell PJ. 1993. Partitioning and combining data in Paabo S. 1996. The mitochondrial genome of a monotreme– phylogenetic analysis. Syst Biol. 42:384–397. the platypus (Ornithorhynchus anatinus). J Mol Evol. 42: Castresana J. 2000. Selection of conserved blocks from multiple 153–159. alignments for their use in phylogenetic analysis. Mol Biol Janke A, Xu X, Arnason U. 1997. The complete mitochondrial Evol. 17:540–552. genome of the wallaroo (Macropus robustus) and the Chaisson MJ, Raphael BJ, Pevzner PA. 2006. Microinversions in phylogenetic relationship among Monotremata, Marsupialia mammalian evolution. Proc Natl Acad Sci USA. 103: and Eutheria. Proc Natl Acad Sci USA. 94:1276–1281. 19824–19829. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Chen JM, Stenson PD, Cooper DN, Ferec C. 2005. A systematic Haussler D, Kent WJ. 2004. The UCSC table browser data analysis of LINE-1 endonuclease-dependent retrotransposi- retrieval tool. Nucleic Acids Res. 32:D493–D496. tional events causing human genetic disease. Hum Genet. Kearney M. 2002. Fragmentary taxa, missing data, and 117:411–427. ambiguity: mistaken assumptions and conclusions. Syst Biol. Clamp M, Cuff J, Searle SM, Barton GJ. 2004. The Jalview Java 51:369–381. alignment editor. Bioinformatics. 20:426–427. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Comas I, Moya A, Gonzalez-Candelas F. 2007. From phyloge- Zahler AM, Haussler D. 2002. The human genome browser at netics to phylogenomics: the evolutionary relationships of UCSC. Genome Res. 12:996–1006. insect endosymbiotic gamma-proteobacteria as a test case. Killian JK, Buckley TR, Stewart N, Munday BL, Jirtle RL. 2001. Syst Biol. 56:1–16. Marsupials and Eutherians reunited: genetic evidence for the Cope ED. 1889. The Edentata of North America. Am Nat. Theria hypothesis of mammalian evolution. Mamm Genome. 23:657–664. 12:513–517. Cunningham CW. 1997. Can three incongruence tests predict Kjer KM, Honeycutt RL. 2007. Site specific rates of mitochon- when data should be combined? Mol Biol Evol. 14:733–740. drial genomes and the phylogeny of eutheria. BMC Evol Biol. Delsuc F, Scally M, Madsen O, Stanhope MJ, de Jong WW, 7:8. Catzeflis FM, Springer MS, Douzery EJ. 2002. Molecular Kluge AG, Wolf AJ. 1993. Cladistics: what’s in a Name. phylogeny of living xenarthrans and the impact of character Cladistics. 9:183–199. and taxon sampling on the placental tree rooting. Mol Biol Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Evol. 19:1656–1671. Schmitz J. 2006. Retroposed elements as archives for Douady CJ, Delsuc F, Boucher Y, Doolittle WF, Douzery EJ. the evolutionary history of placental mammals. PLoS Biol. 2003. Comparison of Bayesian and maximum likelihood 4:e91. bootstrap measures of phylogenetic reliability. Mol Biol Evol. Lin YH, McLenachan PA, Gore AR, Phillips MJ, Ota R, 20:248–254. Hendy MD, Penny D. 2002. Four new mitochondrial Douzery EJ, Huchon D. 2004. Rabbits, if anything, are likely genomes and the increased stability of evolutionary trees of Glires. Mol Phylogenet Evol. 33:922–935. mammals from improved taxon sampling. Mol Biol Evol. Edwards SV, Liu L, Pearl DK. 2007. High-resolution species 19:2060–2070. trees without concatenation. Proc Natl Acad Sci USA. Liu L, Pearl DK. 2007. Species trees from gene trees: 104:5936–5941. reconstructing Bayesian posterior distributions of a species Felsenstein J. 2007. PHYLIP (phylogeny inference package). phylogeny using estimated gene tree distributions. Syst Biol. Distributed by the author. Seattle (WA): Department of Genome 56:504–514. Sciences, University of Washington. Madsen O, Scally M, Douady CJ, Kao DJ, DeBry RW, Gibson A, Gowri-Shankar V, Higgs PG, Rattray M. 2005. A Adkins R, Amrine HM, Stanhope MJ, de Jong WW, comprehensive analysis of mammalian mitochondrial genome Springer MS. 2001. Parallel adaptive radiations in two major base composition and improved phylogenetic methods. Mol clades of placental mammals. Nature. 409:610–614. Biol Evol. 22:251–264. Malia MJ Jr, Adkins RM, Allard MW. 2002. Molecular support Goodman M, Porter CA, Czelusniak J, Page SL, Schneider H, for Afrotheria and the polyphyly of Lipotyphla based on Shoshani J, Gunnell G, Groves CP. 1998. Toward a phyloge- analyses of the growth hormone receptor gene. Mol netic classification of primates based on DNA evidence Phylogenet Evol. 24:91–101. 203 Phylogeny of Mammals 1807

Margulies EH, Chen CW, Green ED. 2006. Differences between Poux C, Douzery EJ. 2004. Primate phylogeny, evolutionary rate pair-wise and multi-sequence alignment methods affect variations, and divergence times: a contribution from the vertebrate genome comparisons. Trends Genet. 22:187–193. nuclear gene IRBP. Am J Phys Anthropol. 124:1–16. Misawa K, Janke A. 2003. Revisiting the Glires concept—phy- Ray DA, Xing J, Hedges DJ, et al. (13 co-authors). 2005. Alu logenetic analysis of nuclear sequences. Mol Phylogenet insertion loci and platyrrhine primate phylogeny. Mol Evol. 28:320–327. Phylogenet Evol. 35:117–126. Muller J, Muller K. 2004. TREEGRAPH: automated drawing of Rokas A, Carroll SB. 2005. More genes or more taxa? The complex tree figures using an extensible tree description relative contribution of gene number and taxon number to format. Molecular Ecology Notes. 4:786–788. phylogenetic accuracy. Mol Biol Evol. 22:1337–1344. Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, Rosenberg MS, Kumar S. 2003. Taxon sampling, bioinformatics, O’Brien SJ. 2001. Molecular phylogenetics and the origins of and phylogenomics. Syst Biol. 52:119–124. placental mammals. Nature. 409:614–618. Scally M, Madsen O, Douady CJ, de Jong WW, Stanhope MJ, Murphy WJ, Eizirik E, O’Brien SJ, et al. (11 co-authors). 2001. Springer MS. 2002. Molecular evidence for the major clades Resolution of the early placental mammal radiation using of placental mammals. J Mammal Evol. 8:239–277. Bayesian phylogenetics. Science. 294:2348–2351. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Murphy WJ, Pevzner PA, O’Brien SJ. 2004. Mammalian Green ED, Hardison RC, Miller W. 2003. MultiPipMaker and phylogenomics comes of age. Trends Genet. 20:631–639. supporting tools: alignments and analysis of multiple genomic Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W. DNA sequences. Nucleic Acids Res. 31:3518–3524. 2007. Using genomic data to unravel the root of the placental Shimamura M, Yasue H, Ohshima K, Abe H, Kato H, Kishiro T, mammal phylogeny. Genome Res. 17:413–421. Goto M, Munechika I, Okada N. 1997. Molecular evidence Nikolaev S, Montoya-Burgos JI, Margulies EH, Rougemont J, from retroposons that whales form a clade within even-toed Nyffeler B, Antonarakis SE. 2007. Early history of mammals ungulates. Nature. 388:666–670. is elucidated with the ENCODE multiple species sequencing Shimodaira H, Hasegawa M. 1999. Multiple comparisons of log- data. PLoS Genet. 3:e2. likelihoods with applications to phylogenetic inference. Mol Nishihara H, Hasegawa M, Okada N. 2006. Pegasoferae, an Biol Evol. 16:1114–1116. unexpected mammalian clade revealed by tracking ancient Shoshani J, McKenna MC. 1998. Higher taxonomic relationships retroposon insertions. Proc Natl Acad Sci USA. 103: among extant mammals based on morphology, with selected 9929–9934. comparisons of results from molecular data. Mol Phylogenet Nishihara H, Okada N, Hasegawa M. 2007. Rooting the Evol. 9:572–584. eutherian tree: the power and pitfalls of phylogenomics. Siepel A, Bejerano G, Pedersen JS, et al. (16 co-authors). 2005. Genome Biol. 8:R199. Evolutionarily conserved elements in vertebrate, insect, Nishihara H, Satta Y, Nikaido M, Thewissen JG, Stanhope MJ, worm, and yeast genomes. Genome Res. 15:1034–1050. Okada N. 2005. A retroposon analysis of Afrotherian Springer MS, de Jong WW. 2001. Phylogenetics. Which phylogeny. Mol Biol Evol. 22:1823–1833. mammalian supertree to bark up? Science. 291:1709–1711. Nylander JAA. 2004. MrModeltest. Program distributed by the Springer MS, Murphy WJ, Eizirik E, O’Brien SJ. 2003. Placental author. Uppsala (Sweden): Evolutionary Biology Centre, mammal diversification and the cretaceous-tertiary boundary. Uppsala University. Proc Natl Acad Sci USA. 100:1056–1061. Opazo JC, Wildman DE, Prychitko T, Johnson RM, Springer MS, Stanhope MJ, Madsen O, de Jong WW. 2004. Goodman M. 2006. Phylogenetic relationships and diver- Molecules consolidate the placental mammal tree. Trends gence times among New World monkeys (Platyrrhini, Ecol Evol. 19:430–438. Primates). Mol Phylogenet Evol. 40:274–280. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood- Pecon-Slattery J, Pearks Wilkerson AJ, Murphy WJ, O’Brien SJ. based phylogenetic analyses with thousands of taxa and 2004. Phylogenetic assessment of introns and SINEs within mixed models. Bioinformatics. 22:2688–2690. the Y chromosome using the cat family felidae as a species Stanhope MJ, Waddell VG, Madsen O, de Jong W, Hedges SB, tree. Mol Biol Evol. 21:2299–2309. Cleven GC, Kao D, Springer MS. 1998. Molecular evidence Philippe H, Delsuc F, Brinkmann H, Lartillot N. 2005. for multiple origins of Insectivora and for a new order of Phylogenomics. Annu Rev Ecol Evol. 36:541–562. endemic African insectivore mammals. Proc Natl Acad Sci Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, USA. 95:9967–9972. Casane D. 2004. Phylogenomics of eukaryotes: impact of Sullivan J, Swofford DL. 2004. Are guinea pigs rodents? The missing data on large alignments. Mol Biol Evol. 21: importance of adequate models in molecular phylogenetics. J 1740–1752. Mammal Evol. 4:77–86. Phillips MJ, Delsuc F, Penny D. 2004. Genome-scale phylogeny Swofford DL. 2003. PAUP*: Phylogenetic analyisis using and the detection of systematic biases. Mol Biol Evol. parsimony (*and other methods). Sunderland (MA): Sinauer 21:1455–1458. Associates. Phillips MJ, Lin YH, Harrison GL, Penny D. 2001. Mitochon- Thomas JW, Prasad AB, Summers TJ, Lee-Lin SQ, Maduro VV, drial genomes of a bandicoot and a brushtail possum confirm Idol JR, Ryan JF, Thomas PJ, McDowell JC, Green ED. 2002. the monophyly of australidelphian marsupials. Proc Biol Sci. Parallel construction of orthologous sequence-ready 268:1533–1538. clone contig maps in multiple species. Genome Res. 12: Phillips MJ, Penny D. 2003. The root of the mammalian tree 1277–1285. inferred from whole mitochondrial genomes. Mol Phylogenet Thomas JW, Touchman JW, Blakesley RW, et al. (71 co- Evol. 28:171–185. authors). 2003. Comparative analyses of multi-species Pollard DA, Moses AM, Iyer VN, Eisen MB. 2006. Detecting the sequences from targeted genomic regions. Nature. 424: limits of regulatory element conservation and divergence 788–793. estimation using pairwise and multiple alignments. BMC van de Lagemaat LN, Gagnier L, Medstrand P, Mager DL. 2005. Bioinformatics. 7:376. Genomic deletions and precise removal of transposable Posada D, Crandall KA. 1998. MODELTEST: testing the model elements mediated by short identical DNA segments in of DNA substitution. Bioinformatics. 14:817–818. primates. Genome Res. 15:1243–1249. 204 1808 Prasad et al. van Rheede T, Bastiaans T, Boone DN, Hedges SB, de tide, amino acid, and codon models. Mol Phylogenet Evol. Jong WW, Madsen O. 2006. The platypus is in its place: 28:197–224. nuclear genes and indels confirm the sister group relation of Waters PD, Dobigny G, Waddell PJ, Robinson TJ. 2007. monotremes and Therians. Mol Biol Evol. 23:587–597. Evolutionary history of LINE-1 in the major clades of Waddell PJ, Cao Y, Hauf J, Hasegawa M. 1999. Using novel placental mammals. PLoS ONE. 2:e158. phylogenetic methods to evaluate mammalian mtDNA, Wiens J. 2005. Can incomplete taxa rescue phylogenetic analyses including amino acid-invariant sites-LogDet plus site strip- from long-branch attraction? Syst Biol. 54:731–742. ping, to detect internal conflicts in the data, with special Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, reference to the positions of hedgehog, armadillo, and Gascuel O, Grossman LI, Romero R, Goodman M. 2007. elephant. Syst Biol. 48:31–53. Genomics, biogeography, and the diversification of Waddell PJ, Kishino H, Ota R. 2001. A phylogenetic foundation for placental mammals. Proc Natl Acad Sci USA. 104: comparative mammalian genomics. Genome Inform. 12:141–154. 14395–14400. Waddell PJ, Kishino H, Ota R. 2002. Very fast algorithms for Yang Z. 2007. Fair-balance paradox, star-tree paradox, and evaluating the stability of ML and Bayesian phylogenetic Bayesian phylogenetics. Mol Biol Evol. 24:1639–1655. trees from sequence data. Genome Inform. 13:82–92. Yu L, Zhang YP. 2005. Evolutionary implications of multiple Waddell PJ, Okada N, Hasegawa M. 1999. Towards resolving SINE insertions in an intronic region from diverse mammals. the interordinal relationships of placental mammals. Syst Biol. Mamm Genome. 16:651–660. 48:1–5. Waddell PJ, Shelley S. 2003. Evaluating placental inter-ordinal Scott Edwards, Associate Editor phylogenies with novel sequences including RAG1, gamma- fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleo- Accepted April 7, 2008

205