<<

PRIMATE AND DRIVEN BY SEGMENTAL

DUPLICATION ON 16.

by

MATTHEW ERIC JOHNSON

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Evan E. Eichler

Department of

CASE WESTERN RESERVE UNIVERSITY

January, 2008

1

Copyright © 2007 by Matthew Eric Johnson

All rights reserved

2 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

_Matthew Eric Johnson______candidate for the __Ph.D.______degree *.

(signed)__Peter Harte______(chair of the committee)

__Anne Mathews______

__Helen Salz ______

______Mitch Drumm

__Evan Eichler ______

______

(date) __8/10/07______

*We also certify that written approval has been obtained for any proprietary material contained therein. To David and Audrey Johnson without whose love and support none of my achievements would be possible. Also to Paul and Mark Johnson, without their love and support for me throughout this process I would have lost my sanity.

4 TABLE OF CONTENTS

TABLE OF CONTENTS………………………………………………………….....v

LIST OF TABLES…………………………………………………………...... ix

LIST OF FIGURES…………………………………………………………………..x

ACKNOWLEDGEMENTS……………………………………………...... xii

Abstract……………………………………………………………………………..xiv

Chapter 1: Introduction and Objectives...... 1

THE REPEAT ARCHITECTURE OF THE ………………….3

Common Repeats: SINEs, LINEs, and ERVs…………………………………….3

SEGMENTAL DUPLICATIONS…………………………………………………....7

What Distinguishes a Segmental Duplication?...... 7

Deciphering Segmental Duplication in the Genome…………………………….10

Low Copy Repeats on Chromosome 16 (LCR16)………………………………12

DUPLICATION AND EVOLUTION OF …………………………….12

Whole Genome Duplication Influencing Vertebrate Genomes…………………13

Tandem Duplication: Genome and Gene Evolution…………………………….16

SEGMENTAL DUPLICATION: GENOME AND GENE EVOLUTION………...17

Segmental Duplication and Human Disease…………………………………….18

Segmental Duplication and Gene Evolution…………………………………….20

NATURAL SELECTION OPERATING UPON DUPLICATED ………..23

Methodology of Testing for Neutral, Negative, and Positive Selection………...24

Population Genetic Models Acting upon Duplicated Genes……………………25

PRIMATE ………………………………………………27

RESEARCH OBJECTIVES………………………………………………………..30

v 5 Chapter 2: Positive Selection of a During the Emergence of Humans and

African Apes……………………...... 31

ABSTRACT………………………………………………………………………...32

INTRODUCTION, RESULTS, AND DISSCUSSION…………………………….33

Gene Analysis and Genomic Organization of LCR16a…………………………33

Comparative Analysis…………………………………………………………...38

Duplication Timing……………………………………………………………...38

Positive Selection………………………………………………………………..41

Gene Function…………………………………………………………………...48

Conclusions……………………………………………………………………...49

METHODS………………………………………………………………………….50

Human Genome Analysis………………………………………………………..50

Fluorescence in situ Hybridization……………………………………………....51

Library Hybridization and Sequencing………………………………………….51

Sequence Analysis……………………………………………………………….52

ACKNOWLEDGEMENTS…………………………………………………………54

APPENDIX…………………………………………………………………………55

Supplementary Figures…………………………………………………………..55

Supplementary Tables…………………………………………………………...57

Chapter 3: Recurrent duplication-driven transposition of DNA during hominoid evolution……………………………………………………………………………...59

ABSTRACT………………………………………………………………………...60

INTRODUCTION…………………………………………………………………..61

RESULTS…………………………………………………………………………...62

Human LCR16 Genome Organization…………………………………………..62

vi 6 Single Copy Architecture of Old World Monkey Loci………………………….65

Duplication of Great Ape LCR16 Blocks into New Locations………………….67

Sequence Structure of New Insertions and Lineage-Specific Duplications……..73

Recurrent and Independent Duplications of LCR16a…………………………....81

Junction Analysis………………………………………………………………...83

DISCUSSION……………………………………………………………………….86

Core Duplicon-Flanking Transposition Model…………………………………..88

METHODS………………………………………………………………………….90

Genomic library hybridization and BAC end-sequencing……………………….90

BAC Sequencing………………………………………………………………....90

Sequence Annotation…………………………………………………………….90

PCR Breakpoint Analysis………………………………………………………..91

Phylogenetic Analysis…………………………………………………………...91

ACKNOWLEDGEMENTS…………………………………………………………92

Chapter 4: Dynamic changes in the structure, expression and selection of the morpheus gene family during primate evolution…………………………………..93

ABSTRACT………………………………………………………………………...94

INTRODUCTION…………………………………………………………………..95

RESULTS…………………………………………………………………………...96

Gene Models and Positive Selection…………………………………………….96

Expression and Gene Structure Analysis……………………………………….102

DISCUSSION……………………………………………………………………...110

METHODS………………………………………………………………………...114

Phylogenetic and Evolutionary Genetics analysis……………………………...114

Tissue Samples…………………………………………………………………115

vii 7 Expression Analysis……………………………………………………………116

Genomic Clone and Sequence Analysis……………………………………….116

ACKNOWLEDGEMENTS……………………………………………………….116

Chapter 5: Discussion and Future Directions……………………………………118

SUMMARY AND DISCUSSION………………………………………………...119

The Recurrent Duplicative Nature of LCR16a………………………………...119

Stepwise Accumulation of LCR16 Duplications around LCR16a Duplication..120

Coordinated Deletion and Duplication…………………………………………123

A Model of LCR16a-Mediated Duplicative Transposition…………………….124

BIRTH OF A GENE FAMILY (MORPHEUS) IN ………………….126

Darwinian Selection Acting upon Morpheus Exons…………………………...126

Timing of the Positive Selection Event in Relation to the Morpheus Gene

Family………………………………………………………………………………..128

Comparative Expression of the Morpheus Gene Family in Primates…………..129

FUTURE DIRECTIONS…………………………………………………………..132

Ascertaining Function and Phenotype for the Morpheus Gene Family………..133

Human Copy-Number and of LCR16a…………………..134

LCR16a as a Model for the expansion of other Human Segmental

Duplications?…...... 136

Coordinated Deletion of Pre-integration Sites: A common feature of primate

SDs?...... 137

CONCLUSION…………………………………………………………………....138

BIBLIOGRAPHY………………………………………………………………….139

viii 8 LIST OF TABLES

Table 2-1: Average pairwise distance (K) of intron sequence……………………….39

Table 2-2: Positive selection of Exon 2……………………………………………....44

Table 2-3: Positive selection of Exon 4………………………………………………47

Supplemental Table 2-1: Genetic distance (K =Jukes-Cantor) and size of LCR16a genomic duplications………………………………………………………………….57

Supplemental Table 2-2: Duplication structure of Chromosome 16 Loci…………..57

Supplemental Table 2-3: Clones and Deposited Accessions………………………..58

Table 3-1: LCR16 Segmental Duplications in Human Genome……………………..63

Table 3-2: Sequencing Table of Primate LCR16 Loci……………………………….64

Table 3-3: LCR16 Copy Number among Primates…………………………………..65

Table 3-4: Co-occurrence of Human LCR16 Probes………………………………...66

Table 3-5: Putative Ancestral LCR16 Loci…………………………………………..67

Table 3-6: Primate Segmental Duplication BAC Sequencing Summary…………….68

Table 3-7: Novel Segmental Duplications found in Association with LCR16………76

Table 3-8: Co-occurrence of Orangutan LCR16 Probes……………………………..79

Table 3-9: Repeat Content at Novel Segmental Duplication Junctions……………...85

Table 3-10: Composition of Preintegration Sites based on Human Reference Genome

(hg16)………………………………………………………………………………....86

Table 4-1: Positive selection of Exon 1- 7………………………………………….100

Table4-2: Complete mRNA Exons 1-7 Testing Positive Selection on orthologs…..102

ix 9 LIST OF FIGURES

Figure 1-1: Models of Alu-mediated Duplication Events…………………………….5

Figure 1-2: Model of a Segmental Duplication in Humans…………………………...9

Figure 1-3: Model of Whole-Genome Duplication…………………………………..14

Figure 1-4: The Mechanism behind Non-allelic ……...20

Figure 1-5: Darwinian Selection Acting upon a Duplicated Gene…………………..25

Figure 1-6: Primate Phylogenetic ………………………………………………28

Figure 2-1: Sequence Properties of the LCR16a Duplication………………….….....34

Fig. 2-2: Genomic Organization of LCR16a Loci…………………………………....35

Fig. 2-3: Alignment…………………………………………………………..37

Fig. 2-4: Comparative FISH analysis among Primates……………………………….40

Fig. 2-5: Phylogeny of coding and non-coding portions of the LCR16a duplication..42

Fig. 2-6: Neighbour-Joining Evolutionary Tree for 146 bp of coding DNA from exon

4……………………………………………………………………………………….46

Fig. 2-7: RT-PCR Analysis of LCR16a Gene Family………………………………..49

Supplement Fig. 2-1: In situ Localization…………………………………………..55

Supplement Fig. 2-2: Multiple Sequence Alignment………………………………..56

Fig. 3-1: LCR16 Organization in Human and Baboon……………………………....63

Fig. 3-2: Map locations of primate LCR16a………………………………………….69

Fig. 3-3: Sequence alignment between human and non-human primate LCR16 Loci..74

Fig. 3-4: The duplication architecture of primate loci………………………………...77

Fig. 3-5a, b: Sequence alignment of LCR16a loci……………………………………78

Fig. 3-6: Copy-number and sequence divergence flanking LCR16a in a) orangutan and b) human………………………………………………………………………………80

Fig. 3-7: More diverged LCR16a loci on 11, 10 & X…………………82

x 10 Fig. 3-8: Breakpoint resolution of a novel chimpanzee insertion…………………….84

Fig. 4-1: Morpheus Gene Structure…………………………………………………..97

Fig. 4-2: Positive Selection………………………………………………………….100

Fig. 4-3: RT-PCR Analysis of LCR16a Gene Family in Primate Tissues…………..103

Fig. 4-4: Cross- Morpheus Gene Structures………………………………....107

Fig. 4-5: Human and Old World monkey ORF and amino-acid alignment…………………………………………………………………………….109

Fig. 4-6: Neighbour-Joining Evolutionary Tree for 99 bp of coding DNA from exon

3……………………………………………………………………………………...112

Fig. 4-7: Model of Morpheus Evolution…………………………………………….114

Fig. 5-1: Step-wise model of duplication accretion around LCR16a……………….122

xi11 ACKNOWLEDGEMENTS

There are many people that need to be acknowledged who helped shape my path toward attaining a doctorate degree in genetics. In many ways, it would not have been possible for me to attain this degree without their support and nurturing of my scientific exploration and with aiding in keeping me sane through all the highs and lows that I have experienced throughout my graduate work. First and foremost, I must thank my parents,

David and Audrey Johnson, for encouraging my curiosity for the world around me at a very early age and for helping to foster a true love of scientific discovery. They have also been my rock through the highs and the lows of graduate studies.

Without the mentor, wise man, friend, professor, and Ph.D. adviser that Evan

Eichler has encompassed throughout my tenure as a graduate student in his lab, it would have been virtually impossible to think of me being able to get to this point. It has truly been a pleasure and joy to work with such a brilliantly driven scientist. Evan has many good traits that make him a great mentor, but the one that really stands out is his love and enthusiasm for science and his ability to impart that to his students in the lab.

I would like to thank all of the many committee members that have served on my committee over the years: Dr. Mitch Drumm, Dr. Evan Eichler, Dr. Peter Harte, Dr. Terry

Hasshold, Dr. Bruce Latimer, Dr. Anne Matthews, and Dr. Stuart Schwartz. In their own way, all these great minds of science have helped to mould a young man with a love of science into a critical thinker and problem-solver who has never lost his child-like love and zeal for science, and for that I thank you from the bottom of my heart.

The other students of Case Western Reserve Department of Geneitcs have kept me going through the years by just being there to listen to my rants, no matter what they were about. The Eichler lab has been the glue and nails that has held this ship together over the vast ocean on which I have had to travel, particularly Jeff, Amy, Julie, Devin, James,

xii 12 Andy, Anne, Heather, and Zhaoshi for providing research assistance and mental health assistance through the long voyage of writing this dissertation. I would like to give special thanks to Rocky, because he was always there to talk to about anything and was a true friend.

I would like to thank my two brothers, Mark and Paul, for all their love and support throughout my graduate studies. I also want to thank Dr. Dan Krane for all his support and guidance during and after my undergraduate work in his lab, which first sparked my love of bench work. Finally, I would like to take a moment to thank all of my science teachers through my years of schooling for shaping an individual with the skills and passion for science necessary to obtain this doctorate.

xiii 13 Primate Gene and Genome Evolution Driven by Segmental

Duplication on Chromosome 16.

Abstract

by

MATTHEW ERIC JOHNSON

A low copy repeat on chromosome 16a (LCR16a) was chosen to study the

evolutionary roles of segmental duplications because the duplications were recent in

human evolution and contained a full-length gene. To more fully understand the

evolutionary significance of these events and gain insight into the mechanism underlying

these changes, I have performed a detailed comparative sequencing project targeting

LCR16a copies within large-insert BACs from six species: chimpanzee, gorilla,

orangutan, gibbon, macaque, and baboon. Seventy-one LCR16a copies from primates

and twenty-four LCR16a copies from humans (~20Mb) were analyzed for genomic and

gene evolution, revealing four important properties. First, a novel hominoid gene family

(morpheus) contained within the LCR16a duplication has shown an extraordinary degree

of amino acid divergence between humans and great apes (20%) when compared to the

background rate of nucleotide substitution rate of 2%–3%. In addition, five out of seven

exons have shown evidence of rapid positive selection, as determined by Ka/Ks ratios.

Second, LCR16a is not a lone duplication block: many of the LCR16a duplicons are associated with other LCR16 duplications as part of larger and more complex duplications, which are variable in duplicon type and number, in many different locations in primates. Phylogenetic analysis of LCR16a demonstrates a stepwise patterning of duplication centered on the LCR16a duplicon. Third, large-scale sequencing of great ape

(chimpanzee, gorilla, and orangutan) BACs have revealed new (species-specific)

xiv 14 insertions of LCR16a that do not correspond to LCR16a sites in the human genome assembly. The analysis of the sequence from these new insertion events has suggested a model where flanking repeats (SINEs) act as donors that are mediating the transposition event of LCR16a along with the coordinate deletion of large (average 19.5 kb) tracks of

repeat-rich sequences at the acceptor sites. Fourth, RT-PCR expression analysis of human

and primate tissues has shown a pattern of suggesting a model where a

single copy of morpheus is expressed tissue-specifically in the testis of baboons while

being ubiquitously expressed in all great apes. These data provide a model for the birth of

a novel gene family during human evolution.

xv

15

Chapter 1

Introduction and Objectives

1 From Darwin’s observations of to Mendel’s discovery of the laws of hereditable traits, these observations have provided the fundamental drive in biology to

understand how natural selection functions not only on the gross phenotypic level but on

the level of DNA and, more broadly, an ’s genome. This understanding of

genetics has developed over the last century of scientific research in stages, from

understanding chromosomes as the cellular mechanism for heredity to the molecular level

of heredity in the form of the double-helix DNA molecule. This ultimately led to the

understanding of how genetic information was encoded and translated by the into

. These fundamental findings have driven researchers ultimately to try to

decipher how entire genomes function, leading to the Human Genome Sequencing Project

(IHGSC 2001). The completion of the in 2001 and the ongoing

comparative genome sequencing of other primates has led to the generation of the

sequence data necessary to test fundamental questions about genome function and

evolution as well as gene evolution at the sequence level, and to tease out the fundamental

properties that govern the evolutionary instructions that build a human/primate, in ways

that were simply not possible in the past.

This dissertation encompasses the search for an understanding of how primate

genomes function and evolve over time by using a model segmental duplication

(LCR16a) to investigate the mechanism of genomic duplication, understand the

evolutionary impact (new insertion, deletions, and inversions) on chromosome 16 of this

highly duplicated chromosome during primate evolution, and the impact on primate gene

evolution over the last ~35 million years (my). The main goals of this dissertation are to

study the mechanisms by which segmental duplication occurs in primate genomes and the

effects of these events on genome and gene evolution.

2 THE REPEAT ARCHITECTURE OF THE HUMAN GENOME

Common Repeats: SINEs, LINEs, and ERVs

Repetitive DNA was first identified by Waring and Britten (1966) and was later verified by the results of a series of rate renaturation experiments, which indicated that a repeat content of an organism’s genome is roughly equivalent to its genetic complexity

(Britten and Kohne 1968; Rowold and Herrera 2000). These common repeats were known as junk DNA because most of them had no known beneficial function in the human genome. This junk DNA makes up ~45% of the human genome sequence, whereas coding regions only make up a much smaller fraction of the human genome sequence: ~3% (Ohno 1972; IHGSC 2001). Common repeats can be broken down into three major groups—short interspersed elements (SINEs), long interspersed elements

(LINEs), and endogenous (ERVs)—with DNA elements and unclassified repeats rounding out the total makeup of common repeats in the human genome. Out of the ~45% of the human genome consisting of common repeats, the three repeats, SINEs

(~13%), LINEs (~20%), and ERVs (~8%), together contribute ~41% of the common repeat sequence seen in the human genome, or a remarkable ~1,144 Mb of genomic sequence (IHGSC 2001; Batzer and Deininger 2002; Mayer and Meese 2005).

Calling common repeats junk DNA is a very gene-centric view, where ~45% of the genome is discarded for lack of understanding about its functional role in the genome.

In more recent times, common repeats have been shown to have been involved in very diverse roles in gene and genome evolution. Alus are a perfect example of the functional impact of common repeats. Alus, which are members of the SINE family of repeats, are dimeric sequences approximately 300bp in size and are GC rich (Houck, Rinehart et al.

1979; Rowold and Herrera 2000). With 1.558 x 106 Alu elements interspersed at a

3 frequency of one every 5 kb throughout the human genome, they affect the composition, organization, expression, gene function, GC content, and plasticity (duplications and deletions changing the sequence content) of the rest of the genome (Rinehart, Ritch et al.

1981; Batzer, Kilroy et al. 1990; Deininger and Batzer 1999; Rowold and Herrera 2000;

Bailey, Yavor et al. 2001; IHGSC 2001). Because Alu repeats are the largest repeat family in the human genome and younger Alu repeats have very low pairwise divergence

(<1%), this creates many sites of near-perfect identity throughout the human genome for non-allelic homologous recombination to occur at these sites (Batzer and Deininger 2002;

Bailey and Eichler 2003). This observation has led to the theory that the expansion of primate Alu repeats (~35–40 mya) sensitized the ancestral human/primate genomes to

Alu-mediated non-homologous recombination, creating the ideal environment to initiate the propagation of gene-rich segmental duplications in primate genomes; after the first segmental duplication event, this would predispose the sequence to further duplication events (Bailey, Giu et al. 2003) (Fig 1-1).

LINEs make up ~20% of the genome and are categorized into major groups of

LINE1, LINE2, and LINE3, with LINE1 (L1) composing the majority of the repeats

(~17%) in the human genome (IHGSC 2001; Ostertag and Kazazian Jr 2001). Although most of the L1s in the human genome are inactive, some of these elements remain capable of retrotransposition, therefore allowing these elements to still be active in genome

evolution (Ostertag and Kazazian Jr 2001; Khan, Smit et al. 2006).

4 Alu transposition expansion

A B C DE F G

Non-homologous recombination

DNA fragment E F A B C D E F G

A D E F G A B C D E F G BC Misalignment

A B C D E C D E FG A B EF C D E F G

Tandem duplication Interspersed duplication A B F G Deletion

Figure 1-1: Models of Alu-mediated Duplication Events. The initial seeding event was the Alu transposition expansion in primates. These new Alu repeats in the primate genome generated many sites of identical sequence, causing many targets of non-allelic homologous recombination. This can yield duplication events for tandem duplication by misalignment of chromosomes or interspersed duplication through DNA fragments being inserted at Alu repeats. The black dots and ovals represent chromosomes, and red bars are Alu repeats. Letters represent unique sequence and genes. The colored black and purple letters are to distinguish sequence blocks of different pieces or chromosome. The black crosses represent sites of homologous recombination (Bailey and Eichler 2003).

The L1 structure of intact full-length elements is 6 kb composed of two open reading frames (ORFs) that encode a protein with an endonuclease and reverse transcriptase

activity (Khan, Smit et al. 2006). L1s have helped shape primate as well as mammalian

genomes through many diverse mechanisms. Principally, L1s shape the genome by

greatly expanding the genome size (558.8 Mb) of new sequence through their own

retrotransposition and the unintended consequence of aiding Alu expansion. The Alu

usurpation of L1s endonuclease and reverse transcriptase provides the enzymatic

machinery for Alu poliferation in primates (IHGSC 2001; Dewannieux and Heidmann

2005). L1s also have moved flanking sequences around the genome by a process termed

5 transduction. They are also thought to have influenced gene expression by inactivating genes, such as chromosome X inactivation, and by inserting into genes, causing expression changes and coding changes (Bailey, Carrel et al. 2000; Ostertag and Kazazian

Jr 2001; Khan, Smit et al. 2006).

ERVs were first discovered in the late 1960s and early 1970s, and three types of repeat elements were ascertained based on similarity to known : avian leukosis , murine leukosis virus, and murine mammary tumor virus (Weiss 2006). It is startling to think, from their initial discovery, that ERVs would contribute 227 Mb, or

~8%, of the human genome (de Parseval and Heidmann 2005; Mayer and Meese 2005;

Yohn, Jiang et al. 2005). In addition, not all primates share all types of ERVs, but different sub-classes of ERVs have been shown to have fixed within some primate genomes but not other primate genomes, which suggests both a burst of infection and fixation within the primate genomes at different time points. Two examples of these primate ERVs are human endogenous retroviruses (HERVs) and Pan troglodytes endogenous 1 (PTERV1) (IHGSC 2001; de Parseval and Heidmann 2005;

Yohn, Jiang et al. 2005).

There are five major domains that constitute a full-length autonomous ERV: gag

(encoding the viral core protein gene), pol (encoding the reverse transcriptase, RNAseH, integrase), env (encoding the virus’s membrane protein gene), and two flanking long terminal repeats (LTR) with the length of the repeat between 6 and 11 kb (IHGSC 2001; de Parseval and Heidmann 2005). A non-autonomous ERV contains just the gag region and the two LTR elements, with the length of the repeat between 1 and 3 kb (IHGSC

2001; de Parseval and Heidmann 2005). HERVs and PTERV1 have been involved in genome and gene evolution through four main mechanisms. The first is by serving as a substrate for new gene function. Previous viral genes have been co-opted and acquired

6 new functions in host primate genomes. Examples of this co-option of viral genes to functional genes in the human and primate genomes are the four env genes (X, Y, Z, A), the HERV-W Env protein which is involed in intercellular fusion and Fv-4 MLV evn protein involed in infection prevention by receptor interference (Rote, Chakrabarti et al. 2004; de Parseval and Heidmann 2005). The second mechanism is the accumulation of highly identical (~97–99.9%) LTR elements interspersed throughout the human and primate genome, which can be substrates for genomic rearrangements, the deletion of sequences located between two homologous ERVs on the same chromosome, and causing tandem duplication (de Parseval and Heidmann 2005; Yohn, Jiang et al. 2005). The third way that ERVs cause genome evolution is through retrotransposition and the new insertion of the into a new genomic context, which can affect adjacent genes by disrupting gene function or by altering the regulation of nearby genes (de Parseval and

Heidmann 2005; Mayer and Meese 2005; Yohn, Jiang et al. 2005). If Charles Darwin were to visit the present day, he would be shocked and fascinated to see how much of the evolution of primate and mammalian genomes have been caused by common repeats and especially by the horizontal transfer of sequences being fixed into the genome.

SEGMENTAL DUPLICATIONS

What Distinguishes a Segmental Duplication?

Segmental duplications, also referred to as (LCR), are indeed inherently repetitive in their nature, as multiple copies are found in a haploid genome, but are not considered common repeats because of the standard sequence properties of these

LCRs. The sheer size of segmental duplications set them apart from common repeats,

7 with segmental duplications ranging in size from 1 to >200 kb. Unlike common repeats, segmental duplications are derived from unique sequences (Bailey, Yavor et al. 2001;

Bailey, Gu et al. 2002). In terms of sequence content, segmental duplications can contain a variety of common repeats and normal constituents of genes: exons, introns, promoters and enhancers. Unlike common repeats, segmental duplications share no defining characteristics (Stallings, Doggett et al. 1992; Eichler, Lu et al. 1996; Lupski 1998;

Loftus, Kim et al. 1999; Horvath, Schwartz et al. 2000; Ji, Eichler et al. 2000; Shaikh,

Kurahashi et al. 2001; Samonte and Eichler 2002; Scherer, Cheung et al. 2003)(Fig 1-2).

LCRs can be organized in tandem or in an interspersed configuration and be distributed

either between (interchromosomal) or within a chromosome (intrachromosomal). In

humans most intrachromosomal duplications are interspersed, while interchromosomal

duplications are segmental duplications spread across multiple non-homologous

chromosomes in primate genomes (Horvath, Schwartz et al. 2000; Samonte and Eichler

2002). Common repeats range from thousands to millions of copies in the human

genome, such as Alus, with ~1,558,000 copies in the human genome, whereas segmental

duplications rarely exceed ~50 copies (PIR4 and chAB4 being some of the most abundant

segmental duplications in the human genome (Wohr, Fink et al. 1996; IHGSC 2001;

Horvath, Gulden et al. 2003)).

8

Figure 1-2: Model of a Segmental Duplication in Humans. The expanded red block of sequences represents a hypothetical euchromatic piece of chromosome 16. The blue blocks represent exons of a five-exon gene. The arrow represents the transcriptional start site of the gene, and the cross represents the stop codon position of the gene. The green block represents a L1 element, and the purple blocks represent Alu repeats. The brown block represents a HERV element. The euchromatic sequence block is represented by the red blocks, which demonstrates the two types of duplication propagation: either intrachromosomal (yellow human chromosome 16) or interchromosomal (light-green human chromosome 18).

9 Deciphering Segmental Duplication in the Genome

One of the biggest obstacles to sequencing and assembling any genome, and in particular the human genome, is the presence of segmental duplications (Eichler 1998;

Bailey, Yavor et al. 2001; Bailey, Yavor et al. 2002). Segmental duplications of >1 kb compose ~5% of the human genome, whereas for other model (fly and worm), their genomes consist of fewer (~1% and ~4%) segmental duplications of >1 kb. The most significant difference regarding human segmental duplication is that large ones (>10 kb) make up (~2.5%) of our genome when compared to fly (~0.1%), and worm (~0.7%) genomes (IHGSC 2001). These large duplications of high sequence identity have presented some very unique challenges to sequence and assembly of the human genome that were not encountered in other organisms (Eichler 1998; Bailey, Yavor et al. 2001).

There are two divergent techniques that can be used to analyze the human genome or any organism’s genome for segmental duplications: one that computationally analyses of a genome sequence assembly and another that uses an experimental approach to analyze duplications on a genome-wide scale.

To mine the vast amount of human sequence data generated by the human genome sequencing project, specialized computational programs and algorithms had to be designed to search for regions of high percent identity (95%–99%) to other loci in the genome (Eichler 1998). To analyze segmental duplications at the global/genome level, methodologies had to be designed to overcome the noise of common high copy repeats that impede the characterization of segmental duplications, by simplifying the genome through removing all the common repeats from the human genome (whole-genome assembly comparison or WGAC) sequence, creating a compact genome (repeat-free).

Subsequently, high sequence-identity pair-wise alignments from the repeat-free genome were used to identify that ~5% of the human genome is comprised of >1 kb segmental

10 duplications that have 90–100% sequence identity (Bailey, Yavor et al. 2001). This initial approach assumed that the genome and the assembly are without error. Other computational approaches were later developed to detect duplications in incomplete genome assemblies by measuring the depth of coverage of whole genome shotgun sequence data aligned to the reference sequence. This led to the development of the first accurate duplication maps of the genome (Bailey, Gu et al. 2002). However, the duplicated nature of these locations of collapsed duplications required further experimental analyses.

To assess and map segmental duplication content in regions that were incomplete, an experimental sequence-tag approach was developed (Horvath, Schwartz et al. 2000;

Horvath, Viggiano et al. 2000). Here, a reference sequence (such as a fully sequenced

BAC clone containing duplications) was selected and a sequence tag site (STS) was developed to cross-amplify the multiple copies. The PCR assay was initially used to screen a mono-chromosomal hybrid DNA panel and PCR products were sequenced from each chromosome and the sequences compared to generate a catalog of STSs to be used to distinguish individual copies and assign specific duplications to specific chromosomes

(Horvath, Schwartz et al. 2000; Horvath, Viggiano et al. 2000). These same STSs could be used as probes to hybridize against BAC libraries to identify those genomic clones mapping to specific regions. BAC-end sequence data from these large inserts could then be generated and if the end-sequences spanned beyond the duplication, the clone could be sequenced and then integrated into the assembly (Horvath, Schwartz et al. 2000; Horvath,

Viggiano et al. 2000).

11 Low Copy Repeats on Chromosome 16 (LCR16)

During the physical mapping of chromosome 16, a series of LCR16s were identified by the clustering of artificial chromosomes (YACs), , and in large bins of overlapping YACs, cosmids, and plasmids along the chromosome (Stallings, Doggett et al. 1992). A striking feature of chromosome 16 is the abundance of intrachromosomal duplication greater than 20 kb interspersed along the short arm of chromosome 16 (Loftus, Kim et al. 1999). Nineteen different LCR16s were identified and labeled LCR16a-s, which contain partially or entirely duplicated genic sequence on chromosome 16 (Loftus, Kim et al. 1999). Partial sequence from human

BACs suggested an extraordinary degree of sequence identity (>95%). The abundance of large, highly identical interspersed intrachromosomal duplications in euchromatic regions containing genic sequences represented an ideal setting to investigate the genic and genomic evolutionary impact of intrachromosomal segmental duplications.

DUPLICATION AND EVOLUTION OF GENOMES

Charles Darwin eloquently described duplication and evolution in species genomes without ever understanding the nature of DNA. Based on his careful observations of species’ evolution through subtle trait changes, his words are still applicable for duplication evolution:

“We have formerly seen that parts many times repeated are eminently liable to vary in number and structure; consequently it is quite probable that natural selection, during the long-continued course of modification, should have seized on a certain number of the primordially similar elements, many times repeated, and have adapted them to the most diverse purposes.” (Darwin 1859)

12 Whole Genome Duplication Influencing Vertebrate Genomes

Susumu Ohno published several manuscripts around 35 years ago that were seminal to our current understanding of the role of duplications and species evolution. In his model, he argued that polyploidization (whole-genome duplication) was the driving force in genome and gene evolution and that several rounds of whole-genome duplication

occurred during the evolution of vertebrates (Ohno 1970; Ohno 1973; Skrabanek and

Wolfe 1998). Polyploidization has occurred in fish and amphibians, but has not occurred

in mammals, birds, and reptiles since the divergence of these classes of vertebrates

because of the well-established chromosome sex-determining mechanism that emerged

(Ohno, Wolf et al. 1968; Ohno 1970). The evolution of the sex chromosomes (X and Y)

in mammals would restrict the last whole-genome duplication events to 450–550 mya ago

(Ohno 1970; Eichler 2001). Whole-genome duplication generates a full copy of the entire

genome, allowing all genes, factors, promoters, and the like to be under a relaxed selective constraint in their ancestral function, leading to the emergence of new gene families and gene expression patterns in vertebrates (Ohno 1970; Ohno 1973;

Skrabanek and Wolfe 1998; Eichler 2001). But this tetraploid state is inherently unstable as an evolutionary intermediate in vertebrates, which causes the shuffling of the genome during the re-establishment of the diploid state, in turn creating large blocks of duplication in the organism’s genome (Ohno 1970; Ohno 1973; Skrabanek and Wolfe 1998; Eichler

2001) (Fig 1-3).

13 2n

4n

Rearrangement

New 2n

Figure 1-3: Model of Whole-Genome Duplication. Red and black dumbbells represent chromosomes. An organism’s genome at a diploid (2N) state undergoes duplication of its entire genome, creating a tetraploid (4N) genomic state that is inherently unstable. Thus, the tetraploid genome tries to re-establish the diploid state, but in doing so causes rearrangements between chromosomes, which in turn cause the duplication of large portions of chromosomal regions, therefore deriving the new diploid (New 2n) chromosomal state that retains these duplications.

14 The regions are the gene family most cited as evidence for two whole- genome duplications acting as an evolutionary force in the emergence of vertebrate lineages, because mammalian genomes were found to contain four distinct Hox gene clusters whereas invertebrate genomes were found to contain a single Hox in most cases (Skrabanek and Wolfe 1998; Bayarsaihan, Enkhmandakh et al. 2003). There are three strong lines of evidence pointing to whole-genome duplication occurring in verbrate lineage: 1) A line of evidence often used in this debate is the existence of the one-to-four rule which refers to a situation in which four members of a gene family in vertebrates are homologous to a single copy of the gene in invertebrates more often than 3 or 5. The Hox gene clusters are a classic example of 1 D 2 D 4 rule because the comparisons of nucleodite sequences of the Hox clusters suggest they arose from a single common ancestral complex and subsequently underwent two rounds of duplication during vertebrate evolution (Holland 1997; Skrabanek and Wolfe 1998; Bayarsaihan,

Enkhmandakh et al. 2003). 2) Ancient and large paralogous regions in vertebrate genomes have been considered a strong line of evidence in favor of whole-genome duplication. It has been demonstrated that many of the gene families mapping close to the

Hox gene clusters lie in paralogous regions on human chromosomes 2, 7, 12 and 17 are also in the vicinity of the Hox clusters of Drosophila and C. elegans lending support to the theory that two rounds of whole-genome duplication occurred early in vertebrate evolution (Holland 1997; Skrabanek and Wolfe 1998). 3) Differential gene loss in paralogous regions has been cited as evidence in favor of a WGD event. Whole-genome duplication initially creates a two-fold increase in the total number of genes. After the duplication event has occurred it is expected that duplicated genes will be lost at a fairly high rate due to the relaxation of constraint on every gene (Holland 1997). Thus each round of tetraploidization is not expect to double the gene number over the long run but

15 instead only fix a fraction of the duplicated genes from tetraploidization event. This gene number loss can be demonstrated by comparing the four Hox gene clusters between different vertebrates (Fugu and mouse) (Holland 1997). They both have four Hox gene clusters as would be expected by two rounds of whole-genome duplication but each cluster contains different numbers and types of Hox genes in each of the clusters when compared to each other (Holland 1997). This would be expected to occur if there were subsequently differential loss after whole-genome duplication of the entire region. Hox gene duplicates are particularly interesting in that the ancestral genes still encode the old body-plan while the duplicates have evolved unencumbered by selection, acquiring random in the Hox gene duplicates, which could give rise to new body structures.

Tandem Duplication: Genome and Gene Evolution

Tandem duplication of the Bar locus on the X chromosome in Drosophila melanogaster contained a gene that when duplicated reduced eye size and was the first phenotypic association of a segmental duplication (Bridges, Skoog et al. 1936; Muller

1936). Although the nature of DNA was not understood, salivary chromosome analysis revaled that Bar locus occurred when local chromosomal regions tandemly duplicated through a process , later understood, as (Ohno 1970; Eichler 2001).

Bridges and Muller hypothesized that the Bar gene region when tandemly repeated gave rise to the two different small-eye phenotype observed in the Drosophila melanogaster mutants. They further hypothesized that the Bar reversion is due to the total loss of the

Bar duplicates and that all three phenotypes could be explained by unequal crossing over.

Finally, it was hypothesized that repeat sequences may be mediating the unequal crossing over events that were generating the Bar phenotypes (Bridges, Skoog et al. 1936; Muller

16 1936). Bridges and Muller research efforts gave some of the first insights into the structural dynamics of duplications and how such processes could quickly lead to new phenotypes which could frequently revert (Bridges, Skoog et al. 1936; Muller 1936;

Eichler 2001). Tandem duplication events have been an integral part of genome and gene evolution not only in ancient events (immunoglobulin gene family) but also in terms of more recent events (human hemoglobin β-chain and δ-chain) in the Hominoidea superfamily (25 mya) (Ohno 1970; Eichler 2001).

While tandem duplications have had fundamental effects in genome evolution, their regional nature limits their impact (Ohno 1970; Skrabanek and Wolfe 1998). 1) A property first recognized by the Bar locus is that tandem duplication expansion of sequence can be quickly reverted back to the ancestral state by unequal crossing over

(Bridges, Skoog et al. 1936; Muller 1936). This flux in the expansion and contraction of the tandem duplicated sequence limits the fixation of specific variants. 2) The regional expansion and contraction of tandem genes limits the chance of duplicated genes being placed near new promoters or enhancers or escaping to new chromatin environments.

Such genes have a more limited potential to gain new expression patterns and consequently new functions in different tissues (Ohno 1970).

SEGMENTAL DUPLICATION: GENOME AND GENE EVOLUTION

With the completion of the sequencing and assembly of the human genome, it has become quite evident that segmental duplications constitute at least 5% of the human genome sequence. However, long before this revelation, the importance of segmental duplications in causing human disease was already known (Maniatis, Fritsch et al. 1980;

Collins and Weissman 1984; Orkin and Kazazian 1984; Antonarakis, Kazazian et al.

1985; Higgs, Vickers et al. 1989; Bondeson, Dahl et al. 1995; Mazzarella and

17 Schlessinger 1998; Ji, Walkowicz et al. 1999; Ji, Eichler et al. 2000; Eichler 2001; IHGSC

2001; Bailey and Eichler 2003; Babushok, Ostertag et al. 2007). The frequency at which new segmental duplications arise in (0.01 per gene per million years) can vary a hundred-fold between species, which would generate many new gene copies over millions of years for selection to act upon and create new gene families (Lynch, O'Hely et al. 2001; Babushok, Ostertag et al. 2007). LCRs can range in size from a few kb to greater than 200 kb and are dispersed throughout the genome, not only in pericentric regions but also in gene-rich euchromatic regions of chromosomes. The high sequence identity (97%–99%) in great ape and human genomes creates a dynamic interplay of -based mutational events, thus leading to gene and genome evolution in primates (Eichler, Hoffman et al. 1998; Bailey, Yavor et al. 2001; Bailey, Gu et al. 2002).

Segmental Duplication and Human Disease

Ever since the field was created to understand on this planet, biologists have been using changes in phenotypes, i.e. disease states of organisms, to unravel biological functions. One of the byproducts of evolution is the generation of deleterious phenotypes that are rapidly eliminated from the population. These deleterious phenotypes are often categorized and manifest as disease states in humans. The impact of LCRs is no exception: they can be drivers of new gene evolution by duplicating whole genes, and they can be deleterious by causing disease phenotypes.

Although the primary cause of many human genetic diseases is single gene defects, i.e. base pair , a growing number of such diseases have been classified as genomic disorders, which are caused by structural rearrangements of segments of chromosomes (Lupski 1998; Ji, Eichler et al. 2000). In human genomic disorders, segmental duplications have been shown to be directly involved in a growing number of

18 recurrent chromosomal rearrangements that result in disease phenotypes (Emanuel and

Shaikh 2001; Samonte and Eichler 2002). In the 1980s, certain chromosomal regions were first associated with several human genetic disorders such as DiGeorge syndrome, located at 22q11, and Prader–Willi syndrome, located at 15q11-q13. These were later associated with LCRs flanking these regions, which have been implicated in causing chromosomal rearrangements (deletions, insertions, and inversions) mediated by non- allelic homologous recombination between highly similar duplications (97–99%) causing genomic disease (de la Chapelle, Herva et al. 1981; Ledbetter, Riccardi et al. 1981;

Eichler 1998) (Fig 1-4). The majority of genomic diseases caused by unequal homologous recombination between duplicons create multi- or deletion of intervening sequencing, generating an imbalance of dosage-sensitive genes. This imbalance of gene dosage in humans is manifested as disease phenotypes, from red–green color blindness and α- and β-thalassemia to Neurofibromatosis type 1 and Williams syndrome. Recurrent rearrangements range in size from a few hundred kb to mega-bases of sequence (Nickerson, Greenberg et al. 1995; Dorschner, Sybert et al. 2000; Ji, Eichler et al. 2000). The short-term effect of unequal homologous recombination between duplicons is genetic disease, but over the longer term (evolutionary time), it is essential that there is the possibility of the emergence of new genes and sequence diversity in primates. This thesis aims to evaluate how duplications are evolved in generating new gene families.

19 Figure 1-4: The Mechanism behind Non-allelic Homologous Recombination. Chromosome Z contains three genes (red, green, and blue bars) flanked by two segmental duplications (purple, violet), which are 99% identical. Arrows above the genes represent the direction of transcription. During meiosis, unequal homologous recombination occurs between the segmental duplications, creating the gametes, with one gamete having a chromosome with a duplication of the three genes, and the other gamete having a deletion of the genes.

Segmental Duplication and Gene Evolution

The main effect of segmental duplication evolution is structural in nature. The process moves blocks of the genome into new genomic contexts where selection now operates in a new enviroment. Additionally, the first event of duplication predisposes the duplication to secondary mutational events (duplications, deletions, and rearrangements) mediated by the homology. As additional rearrangements occur, this increases the probability of still other events occurring, thus, creating a dynamic process of genomic evolution (Eichler 2001). This dynamic process of genomic evolution rearranges and scrambles large regions, sometimes in excess of 200 kb, that contain genic sequences and other functional elements (enhancers, promoter, repressors, etc.). I hypothesize that

20 natural selection can act upon these dynamic regions to generate new genes and gene families (Eichler 2001; Khaitovich, Enard et al. 2006; Babushok, Ostertag et al. 2007).

There are three main roles in gene evolution (exon shuffling, alterative splicing, and domain accretion) that segmental duplications have been implicated in during primate evolution.

Gilbert was the first to explain exon shuffling effects on gene evolution and the

advantage they would give to creating gene diversity in the genome evolution of organisms (Gilbert 1978; Liu and Grigoriev 2004). Exon shuffling has been estimated to have been involved in up to 20% of eukaryotic exons (Li, Gu et al. 2001; Babushok,

Ostertag et al. 2007). Exon shuffling theory is based on the theory that exons represent the functional modules in proteins, and that driving forces of exon shuffling are the duplication and the rearrangement of exons in newly formed genes, which results in novel genes containing new functional modules (Liu and Grigoriev 2004). There are hundred of genes that have been thought to have arisen by exon shuffling, from blood coagulation genes to extracellular matrix genes (Babushok, Ostertag et al. 2007). The main evolutionary advantage of exon shuffling is that exons with different functional properties can be duplicated and new combinations can be created, thus creating new gene functions by reassembling different functional elements.

Alternative splicing of genes in the human genome is prevalent, with 47%–74% of genes having at least two splice variants (Modrek and Lee 2002; Johnson, Castle et al.

2003; Modrek and Lee 2003; Babushok, Ostertag et al. 2007). The segmental duplication of genic segments into and beside other genic regions ~8.5–50 kb can precipitate the creation of naturally-occuring transcription-induced chimeras (TICs) (Akiva, Toporik et al. 2006; Parra, Reymond et al. 2006; Babushok, Ostertag et al. 2007). TICs are intergenic splicing events that frequently occur between different genes located in tandem

21 (with in ~8.5 kb) resulting in a read-through chimera transcript of two different genes

(Akiva, Toporik et al. 2006; Parra, Reymond et al. 2006; Babushok, Ostertag et al. 2007).

Intergenic splicing generates newly spliced genes in several ways, two of which are by using one gene promoter and the first few exons to change the transcript of the remaining genes, and by allowing exon inclusion or exclusion from both genes to generate functionally new transcripts (Babushok, Ostertag et al. 2007). With segmental duplications in the human genome constituting ~5% of the genomic sequence and their proclivity to move genic sequencing around the genome of primates (Bailey, Yavor et al.

2001; Bailey, Yavor et al. 2002; She, Horvath et al. 2004), together with intergenic splicing estimated to affect at least ~47-77% of the genes located tandemly, this makes segmental duplications the ideal substrate for increasing transcriptional diversity in a genome (Akiva, Toporik et al. 2006; Parra, Reymond et al. 2006; Babushok, Ostertag et al. 2007). In principle, these TICs are evolutionarily advantageous because they are frequent, ubiquitous, and create novel gene products at low cost to the host (Babushok,

Ostertag et al. 2007). Ocassionaly, such transcriptional novelties may serve as targets of natural selection leading to the “birth” of a new gene.

Domain accretion is marked by the addition of exon modules that demarcate a functional domain to an ancestral protein structure allowing for the emergence of new proteins with more diverse function. Segmental duplication can play an active role in this process creating a new fusion with new domains supplied by both genes that were fused together to generate the new transcript (Eichler 2001; Paulding, Ruvolo et al. 2003). An example of a hominoid-specific gene that arose by domain accretion is

Tre2, in which domains of TBC1D3 and USP32 became duplicated and then duplicates subsequently became fused, generating the Tre2 gene (Paulding, Ruvolo et al. 2003).

With the completion of the human genome sequencing and assembly, it has been

22 recognized that there is a twofold increase in domain accretion in humans when compared with the proteomes of invertebrates, and a similar increase (tenfold) of segmentally duplicated DNA in humans when compared with invertebrates (IHGSC 2001), suggesting that segmental duplication is critical to the formation of new genes through domain accretion (Eichler 2001). New genes created by domain accretion are evolutionarily advantageous because they are frequent and create novel gene products at low cost to the host by not destroying the original genes’ functions in the organism (Eichler 2001;

Babushok, Ostertag et al. 2007).

NATURAL SELECTION OPERATING UPON DUPLICATED GENES

Gene duplication is a major source of evolutionary novelty through its generation of new material for selection to operate upon. However, to demonstrate genetic adaptation in any organism, evidence is required that natural selection was the underlying factor in the evolution of a particular trait (Harrison 1988; Rastogi and Liberles 2005; Harris and

Meyer 2006). For my thesis, I will be focusing on the effects of evolutionary change at

the DNA level. To demonstrate genetic adaptation at the DNA level is to test

experimentally whether a gene is under neutral, negative, or positive selection. Neutral

selection refers to the event where a novel DNA base-pair change has neither an

advantage nor a disadvantage over other background mutation, and consequently the base-

pair change does not rise or lower in frequency due to evolutionary pressure acting on the

base-pair change (Nielsen 2005; Harris and Meyer 2006). Negative selection refers to an

event where a novel DNA base-pair change has a disadvantage over other base-pair

changes, and consequently the base-pair change lowers in frequency due to evolutionary

pressure acting to eliminate that mutation (Nielsen 2005; Harris and Meyer 2006).

Finally, positive selection refers to the event where a novel DNA base-pair change has an

23 advantage over other base-pair changes, and consequently non-synonymous base-pair changes increase in frequency in accordance with evolutionary pressure (Nielsen 2005;

Harris and Meyer 2006). Fitness plays a role in negative and positive selection in that these genotype changes confer an advantage or disadvantage to an organism in competition with other organisms harboring an alternative allele. Neutral, negative, and positive selection leave distinct signatures that are detectable at the sequence level through statistical methods.

Methodology of Testing for Neutral, Negative, and Positive Selection

The methodology that I will use in this dissertation will be comparative .

I will generate and utilize data from multiple species or duplicated genes within species for inferring past selection events. The major computational analysis used to determine selection on comparative data is the comparison of the ratio of non-synonymous mutations per non-synonymous sites (Ka or dN) against the ratio of synonymous mutations

per synonymous sites (Ks or dS) (Bamshad and Wooding 2003; Nielsen 2005). Non- synonymous sites are defined as any base-pair in the three base-pairs that constitute a codon that, when changed, alter the amino acid that is encoded by that codon.

Synonymous sites are defined as any base-pair in the three base-pairs that make up a codon that, when changed, does not change the amino acid that is encoded by that codon

(Bamshad and Wooding 2003)(Fig 1-5). The Ka or dN ratio is divided by Ks or dS ratio to get the average (Ka/Ks of dN/dS) non-synonymous substitutions per synonymous

substitutions for a gene or exon being tested (Nielsen 2005). If there is no selection on the

gene, because it is under neutral selection, then the non-synonymous substitutions and

synonymous substitutions are occurring at the same rate, and we would expect Ka/Ks = 1

(Nielsen 2005). If there is positive selection acting upon the gene, we would observe that

24 non-synonymous substitutions are fixed at a higher rate than synonymous substitutions

(Ka/Ks > 1) in the coding sequence of the gene. Finally, if negative selection was acting

upon the gene, we would observe that synonymous substitutions are fixed at a higher rate

than non-synonymous substitutions (Ka/Ks < 1) in the coding sequence of the gene (Fig 1-

5).

Figure 1-5: Darwinian Selection Acting upon a Duplicated Gene. Synonymous (Ks) base-pair replacement is any base-pair change that does not change the amino acid, while non-synonymous (Ka) base-pair replacement is any base-pair change that changes the amino acid. Each green letter represents a synonymous base-pair replacement, while each red letter represents a non-synonymous base-pair replacement. The schematic represents the three (neutral, negative, and positive) Darwinian selection outcomes on a new duplicated gene and the possible fixation of base-pair changes seen if the gene was under neutral, negative, or positive selection.

Population Genetic Models Acting upon Duplicated Genes

Three models (neofuctionalization, subfunctionalization, and non-

functionalization) have been proposed to explain the retention of duplicated genes in the

genomes of organisms. Ohno (1970) first proposed the model of neofuctionalization,

25 which states that duplication creates new gene loci that are free to accumulate random mutations in the coding sequence and can lead to the formation of new genes and gene families. There are two caveats to this model: the gene and its duplicate cannot create a lethal dosage effect, and one of the two genes must keep the original function of the gene in that organism (Ohno 1970; Walsh 2003; Roth, Rastogi et al. 2007). Lynch and Force proposed the model of subfunctionalization, which states that gene duplication is followed by degenerative mutation in both copies, causing complementary loss of subfunctions in the two gene members, with each gene performing one of the ancestral gene functions

(Lynch and Force 2000; Lynch, O'Hely et al. 2001; Walsh 2003; Roth, Rastogi et al.

2007). This model takes into account the effects of gene dosage, since both copies change rapidly, creating two genes that have the same function as the ancestral gene, although each of the sister genes are now expected to be expressed in a more restricted set of tissues than the single ancestral gene (Lynch and Conery 2000). Lynch (2001) proposed a variant of this model, duplication-degradation-complementation (DDC), to explain the restriction of expression of the sister genes in the subfunctionalization model (Lynch,

O'Hely et al. 2001). DDC states that degenerative mutations in regulatory elements of the duplicated genes can increase the probability of preservation of the genes through the partitioning of ancestral gene expression (Lynch, O'Hely et al. 2001). The non- functionalization model is the byproduct of the other two models for the fixation of non- functional pseudo-genes. The model states that the relaxed selection on the gene duplicates allows the accumulation of base-pair changes that lead to the formation of stop codons in the gene transcript, which is the fate of > 95% of the duplicated genes, as seen in the olfactory gene duplicates, the guanosine diphosphate (GDP) duplicates, and the major histocompatibility complex (MHC) gene duplicates (Walsh 2003; He and Zhang

2005; Roth, Rastogi et al. 2007).

26 Examples of classes of genes that have undergone fixation by positive selection have evolved with immune response to disease, sexual selection (sperm–egg competition), and predator-versus-prey competition ( from predatory organisms), and which are genes that have evolved to help an organism to adapt to ecological niches (Swanson and

Vacquier 1995; Duda and Palumbi 1999; Swanson, Clark et al. 2001; Swanson, Yang et al. 2001; Zhu, Bosmans et al. 2004; Lynch 2007). In studies of these classes of protein, there appear to be specific positions in these protein-encoding genes, rather than the whole genes that are under positive selection (Roth, Rastogi et al. 2007). It has been proposed that subfunctionalization results in the preservation of duplicated genes in the genome of an organism ultimately through selection, leading to the neofunctionalization of duplicated gene copies. This generates evolutionary novelty through positive selection, leading to a function that was not present in the pre-duplicated gene (Rastogi and Liberles

2005; Shakhnovich and Koonin 2006).

PRIMATE GENOME EVOLUTION

Before discussing primate chromosome evolution it is important to understand the phylogeny of primate species. With respect to human, there are five generally-accepted divergence times among the anthropoid primates: New World monkeys (marmoset, wool monkey, squirrel monkey, etc.), diverged~35 mya (million years ago), Old World monkeys (macaque and baboon), separated ~25 mya, and the great apes (orangutan, gorilla, and chimpanzee) diverged ~12 mya, ~8 mya, and ~6 mya respectively (Goodman

1999) (Fig 1-6 ).

27 Figure 1-6: Primate Phylogenetic Tree. This tree represents the generally accepted timing of primate evolution. The years that each species diverged off from their common ancestors are shown at the top of the tree in millions of years (Mya), with representative primate species located at the bottom of the tree.

Early comparative analysis of chromosomes from the orangutan, gorilla, chimpanzee, and human suggest that eighteen out of twenty-three chromosomes are virtually identical in the great apes (Yunis and Prakash 1982). Yunis and Prakash demonstrated that human chromosome 2 was a telomeric fusion of two chromosomes (2q and 2p) in the common ancestor of humans, which is not observed in the chimpanzee, gorilla, or orangutan (Yunis and Prakash 1982). The largest differences in chromosomes in the great apes involve inversions and consist mainly of pericentric inversion, as observed on chromosomes 4, 5, 9, 12, 15, and 16 between human and chimpanzee.

However, chromosome 7 contains a paracentric inversion that differentiates the chimpanzee chromosome 7 from the gorilla chromosome 7 (Yunis and Prakash 1982).

The primate chromosome naming system is based on the convention of Ed McConkey for naming the chromosomes based on human designations.

From these observations, others would go on to study the evolution of chromosomes in primates by comparing data obtained by chromosome banding, chromosome painting, and gene mapping. Human chromosome 7 homologous sequences

28 are present in at least two chromosomes in New World monkeys, but in the Old World monkeys and great apes, chromosome 7 sequence maps to a single chromosome (Richard,

Lombard et al. 2000). In great-ape lineage, chromosome 7 went through further rearrangements, and chromosome 7 underwent a pericentric inversion after the orangutans diverged from the great apes (Richard, Lombard et al. 2000). After the divergence of gorillas from the rest of the great apes, chromosome 7 underwent a second pericentric inversion, and a third pericentric inversion separates the bonobo from the common chimpanzee and human chromosome 7 (Yunis and Prakash 1982; Richard, Lombard et al.

2000).

More recent comparative analysis of primate chromosomes have shown striking differences of between humans and macaques. One aspect is the emergence of evolutionary new centromeres (ENC) forming in different regions of macaque chromosomes compared to humans (Ventura, Antonacci et al. 2007). ENC can appear in novel chromosomal regions in an organism with the concordant inactivation of the ancestral (Ventura, Antonacci et al. 2007). After the fixation of the new centromere in primates, it starts to acquire the typical complexity of ancestral centromeric regions gaining a core of DNA flanked by inter- and intrachromosomal pericentromeric duplications (She, Horvath et al. 2004; Ventura, Antonacci et al. 2007).

ENC have dramatically changed the structures of macaque and human chromosomes with a total of 14 ENCs, 9 occurring specifically in the macaque lineage (Ventura, Antonacci et al. 2007). These chromosomal evolutionary changes of ENCs had the potential to lead to events by changing the sequence content around the ENC by the build up of pericentormic duplications around the new centromeres. Also the deactivation of the old centromere would allow the expression of new genes that were created by pericentromeric

29 duplication. ENCs have shown that chromosome evolution is a very dynamic process in primate evolution.

RESEARCH OBJECTIVES

This dissertation will focus on a specific set of recent segmental duplications

(LCR16a) on chromosome 16 in order to gain insight into the mechanism of duplication in the genome and to gain an understanding of how new genes evolve. The first goal will be to investigate the organization and evolutionary history of LCR16a in primates to determine the timing of the expansion of the segmental duplication (Chapter 2).

Evolutionary reconstruction allowed me to determine whether new species-specific insertions occurred in primates other than humans (Chapter 3). The second goal will be to use the signatures of the new insertion sites and the inserted segmental duplication sequences at the flanks in great-ape genomes to provide insight into the mechanism of segmental duplicative transposition. Recent duplications such as LCR16a are particularly advantageous since there has not been enough evolutionary time for sequence signatures at the new insertion sites to decay (Chapter 3). The third goal of this dissertation will be to investigate whether the flanking duplications surrounding LCR16a duplicated in an independent fashion from LCR16a or whether LCR16a was a master element that duplicated into a new genomic sequence and picked up the new flanking sequence to further duplicate the flanking sequence in primates genomes, thus creating species- specific flanking LCR16a duplications (Chapter 3). The final aim will be to examine the gene that is contained within LCR16a to determine what type of Darwinian selection has operated on the associated gene family, morpheus (Chapters 2 and 4). The combined data serve as a model to gain an understanding of genomic rearrangement and gene innovation in the primate genome via segmental duplication (Chapter 4).

30

Chapter 2

Positive Selection of a Gene Family During the

Emergence of Humans and African Apes

Matthew E. Johnson1, Luigi Viggiano², Jeffrey A. Bailey1,Munah Abdul-Rauf³, Graham Goodwin³, Mariano Rocchi² & Evan E. Eichler1

1Department of Genetics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, Ohio 44106, USA

²DAPEG, Sezione di Genetica, Via Amendola 165/A, 70126 Bari, Italy

³ Section of Molecular , Institute of Research, Haddow Laboratories, 15 Cotswold Road, Sutton, Surrey MS2 5NG, UK

Note: This manuscript has been published: Johnson ME, Viggiano L, Bailey JA, Abdul- Rauf M, Goodwin G, Rocchi M, Eichler EE (2001) Nature 413:514–519. L.V. and M.R. provided primate comparative FISH analysis. J.B. provided computational analysis and data. M.R. and G.G. provided immunolocalization studies data.

31 ABSTRACT

Gene duplication followed by adaptive evolution is one of the primary forces for the emergence of new gene function (Ohno 1970). We describe the recent proliferation, transposition and selection of a 20 kb duplicated segment throughout 15 Mb of the short- arm of human chromosome 16. The dispersal of this segment was accompanied by considerable variation in chromosomal map location and copy number among hominoid species. In humans, we identify a novel gene family (termed morpheus) within the duplicated segment. Comparison of putative protein-encoding exons reveals the most extreme case of positive selection among hominoids reported to date. The major episode of enhanced amino-acid replacement occurred after the separation of human and great ape lineages from the orangutan. Positive selection continued to alter amino-acid composition after the divergence of human and chimpanzee lineages. The rapidity and bias for amino- acid altering nucleotide changes suggests adaptive evolution of the morpheus gene family during the emergence of humans and the African apes. Moreover, the analysis indicates that some genes emerge and evolve very rapidly, generating copies that bear little similarity to their ancestral precursors. Consequently, a small fraction of human genes may not possess discernible orthologues within the genomes of model organisms.

32 INTRODUCTION, RESULTS, AND DISSCUSSION

Gene Analysis and Genomic Organization of LCR16a

During physical mapping and sequencing of the human genome (Stallings,

Whitmore et al. 1993; Loftus, Kim et al. 1999; Consortium 2001), a complex series of duplicated genomic segments were identified that mapped to multiple cytogenetic band positions on chromosome 16 (Fig. 2-1a). We reassessed the genomic distribution and the extent of duplication of one of these segmental duplications termed LCR16a (low-copy

repeat sequence a from chromosome 16) (Loftus, Kim et al. 1999). Fifteen distinct copies of the duplicated segment were characterized (see Methods and Fig. 2-2). These genomic repeats were specific to human chromosome 16, were ~ 20 kb in length and shared a remarkably high degree of sequence identity (Fig. 2-1b). Sequence similarity searches of the expressed sequence divisions of GenBank revealed a previously uncharacterized family of genes within the LCR16a segment (Fig. 2-1c and Supplemental Data).

33

Figure 2-1: Sequence Properties of the LCR16a Duplication. a) The schematic displays the distribution pattern (red bars) of LCR16a duplications relative to a human chromosome 16 ideogram. The analysis is based on the published human genome project assembly (Bailey, Yavor et al. 2001; Consortium 2001) and shows the clustering of duplications on the short arm of chromosome 16. b) The histogram plots the number of substitutions per site (K) among LCR16 duplications as a function of the number of aligned basepairs. Optimal global alignments for all possible pairwise combinations for each duplication were made and the degree of sequence identity for each alignment was computed (n=183 pairwise alignments). LCR16a duplications (black) are compared to all other characterized LCR16 duplications (white) (see Fig. 2-2b for a detailed description of other duplications). c) The gene structure of one member of the gene family (AF132984) is shown (green bars) compared to the 20 kb LCR16a segment from its corresponding genomic locus (AC002045). The analysis indicates 8 exons, two strong polyadenylation signals within the 3’ UTR and a putative promoter region overlapping the first exon. PCR products used as probes in this study and their relative location are indicated above this exemplar gene structure.

34

Fig. 2-2: Genomic Organization of LCR16a Loci. a) A graphical view (PARASIGHT; Ref. 5) showing the extent of shared sequence (red bars) among LCR16a loci—discontinuities demarcate the location of common repeat sequences (Alu’s, LINEs, etc). A reference finished genomic sequence (AF001549/AC003007) has been used for all comparisons. A total of 41 human GenBank accessions were identified that contained the duplicated segment and all possible genomic pairwise comparisons were analyzed. Examination of unique flanking sequence and comparison of sequence signatures derived from monochromosomal 16 source material, allowed us to unambiguously identify 15 non-allelic LCR16a copies. In some cases, larger shared segments were observed among paralogues indicating that the boundaries of the duplication were not always identical. Evidence of tandem duplication was found for only 4 of the 15 copies. The

35 remaining copies were distributed across an estimated 15 Mb of the short arm of 16p (16p11.2, 16p12.1, 16p13.1) with a single copy mapping to 16q22 (AC009153). Note the minimal shared segment length of ~20kb (LCR16a = black bar) among the sequences. b) The organization of various duplications (LCR16a-e (Ref. 3) and LCR16o-s (this study) for six GenBank accessions is shown to scale. The nine additional low-copy repeat sequences ranged in size from 3-50 kb and were identified in the vicinity of the LCR16a duplication (Fig. 2-1b, Supplemental Table 2-2). Mosaic patterns of duplication were observed for most of the loci--only two of the LCR16a- containing loci were not associated with other duplications. The patterns of duplication are based on a comparison against all human genomic sequence (see Supplement Table 2-2 for details regarding each duplication). Arrows indicate expressed mRNA or EST sequences with intron/exon structure that correspond to different LCR16a copies (1= AF132984, 2=D86974, 3=AW47462 and AI073447, 4=AI6745414=AA309541). c) A sequence alignment of the two corresponding genomic segments (AF001549 and AC002045). Per cent identity was based on the number of substitutions observed within a 1500 bp window, sliding every 50 bp. The intron-exon structure of AF132984 is superimposed to show the correspondence between exon positions and decrease in % nucleotide identity.

Transcripts could be identified for 6 of the 15 genomic copies. We found no significant sequence similarity to this gene family in other organisms either at the nucleotide or protein level (E<1e-30), indicating a highly diverged family of human transcripts.

Sequence comparison of putative proteins from two full-length human transcripts showed

81% amino-acid sequence identity (Fig. 2-3). In sharp contrast, the corresponding non- coding portions of genomic DNA were 98.1% identical. These data suggested either that the exonic regions were hypermutable or that amino-acid changes had been selected during the evolution of this gene family.

36

Fig. 2-3: Protein Alignment. An alignment of ~280 amino acids from two full length ORF- containing cDNA transcripts (a= AF229069, b=AF132984). The first exon is not shown due to in-frame alternative splicing. Only part of the final exon is shown due to a 19 amino-acid repetitive motif. Initially, a full length cDNA (AF132984) was identified, corresponding to the LCR16a segment within accession AC002045. The cDNA had been independently isolated by a two-hybrid assay in an effort to identify components that interact with the human SWI/SNF transcriptional activator complex. A 347 amino acid open reading frame was predicted, encoding a putative protein of 37 kDa. Although no functional motifs could be identified, protein structure prediction analysis (Methods) suggested a simple membrane-bound protein model. Analysis of AF132984 cDNA and its corresponding genomic sequence indicated 8 exons, two strong polyadenylation signals within the 3’ UTR and a putative promoter region overlapping the first exon (Fig. 2-1c). BLAST analysis of the human-EST division of GenBank, identified a total of 582 ESTs that showed significant sequence identity to AF132984 (90-99% sequence identity). 284 of these also possessed intron/exon splicing consistent with the AF132984 gene model. Comparison of these ESTs to the sequenced genomic copies indicated that at least six of the 15 duplicated loci were transcriptionally active. The transcriptional status of the remaining 9 copies is not known, although all copies contain a complete complement of exons based on the AF132984 model. No significant (E<1e-30) BLAST sequence similarity to this gene family in other organisms could be found either at the nucleotide or protein level, indicating a highly diverged family of human transcripts. Sequence comparison of putative proteins from two experimentally determined full-length cDNA (AF132984 and AF229069) showed 81% amino- acid sequence identity.

37 Comparative Analysis

The high degree of genomic sequence similarity among the various human copies

(~98%) indicated a recent evolutionary divergence. Analysis of primate metaphase chromosomes and interphase nuclei confirmed striking variation in signal intensity, copy number and map location (Fig. 2-4). Among all Old World Monkeys, a single metaphase signal corresponding to 1 or 2 copies (by interphase nuclei) was identified distally on

chromosome 16. Sequence analysis of a genomic subclone from baboon (data not shown)

confirmed an orthologous map position to human sequence 16p13.1- the likely ancestral

segment from which all other copies originated. In contrast to the Old World Monkeys,

the genome of the Great Apes showed a major proliferation of the LCR16a duplicon.

This is particularly evident within the short arm, which is almost completely “painted” by

the LCR16a probe. This effect is most striking in human and the African Great Ape

lineages. Using a combination of approaches (interphase nuclei, library hybridization and

sequence analysis of genomic clones), we estimated the copy number of the duplication in orangutan, gorilla, human and chimpanzee as 9, 17, 15 and 25-30 copies respectively.

Interestingly, in both orangutan and chimpanzee, copies have been transposed to chromosomes other than 16 (Fig. 2-4) clearly indicating lineage-specific duplication events.

Duplication Timing

In order to more precisely estimate evolutionary timing of the duplications, we resequenced 1421 bp of non-coding intronic sequence (Fig. 2-1c) from various human, chimpanzee, gibbon, and baboon genomic subclones and compared the number of nucleotide substitution events both within and between species (Table 2-1).

38

Table 2-1: Average pairwise distance (K) of intron sequence Species HSA PTR HKL PHA HSA --- 0.021 0.034 0.074 PTR 0.002 --- 0.038 0.080 HKL 0.004 0.004 --- 0.070 PHA 0.008 0.008 0.007 --- HSA=Homo sapiens (n=14 sequences), PTR=Pan troglodytes (n=17), HKL=Hylobates (n=4), and PHA=Papio hamadryas (n=1) K= Average genetic distance (Kimura 2-parameter model) between groups (above the diagonal). SE=standard error (below the diagonal). n= the number of paralogues analyzed within each species

First, we performed a Tajima’s relative rate test using baboon sequence as an outgroup

and orthologous pairs of sequence from chimpanzee, human, and gibbon. True

orthologues were determined by end-sequence analysis of the genomic subclones

(Methods). This allowed duplicated copies that were orthologous by position to be

identified within their respective genomes. Based on the analysis of four tests (p=0.109,

0.239, 0.242, 0.093) which indicated that the intronic sequence was evolving neutrally, we accepted the molecular clock hypothesis.

39

Fig. 2-4: Comparative FISH analysis among Primates. Metaphase (right) and interphase nuclei (left) from a representative panel of Old World (OWM), New World monkeys (NWM) (MFU=Macaca fascicularis, PCR=Presbytis cristata, PAN=Papio anubis and CMO=Callicebus mollochus) and hominoid species (HSA=Homo sapiens, PTR=Pan troglodytes, PPA=Pan paniscus, GGO=Gorilla gorilla, PPY=Pongo pygmaeus) are shown that have been hybridized with probes (16.1/9 and 16.8/12). The results are depicted in the context of a generally accepted phylogeny of the species (Ref. 6). Roman numerals above metaphase chromosomes are according to standard cytogenetic nomenclature. Note the multiple copies of the repeat located on XVI among the hominoids which effectively appear to paint the short arm of the chromosome. Reciprocal experiments using probes derived from other primate species were used to eliminate the possibility of false negative signal (Methods). The orangutan interphase also shows hybridization of a human chromosome XVI paint (green fluorescence). In this species, copies of the LCR16a duplicon have spread to the pericentromeric region of chromosome XIII.

40

Using 25 million years (my) as an estimate of the timing of separation between human and the Old World monkeys (Goodman 1999), we calculated the rate of nucleotide substitution for intronic sequence as 1.5 X 10-9 (S.E. +/- 0.14) substitutions/per site/year.

This estimate is remarkably consistent with previously published neutral rates between

human and Old World monkeys (1.2-1.8 x 10-9) (Li 1997). Based on this rate of nucleotide substitution, we predict that the duplication events identified within the gibbon and orangutan occurred independently from those of human and Great Apes. In the case of humans and chimpanzee, our analysis indicates that duplications occurred both before and after the separation of these two lineages (Fig. 2-5a). The observed quantitative and qualitative differences in the localization of some of the copies (Fig. 2-4) support this conclusion.

Positive Selection

Alignments of the human paralogous segments revealed that regions corresponding to coding exons were conspicuously hypervariable (10% nucleotide divergence when compared to intronic sequences that exhibited ~2% divergence). The increased frequency of substitution suggested rapid genic evolution had occurred along with the genomic dispersal of the LCR16a duplication. Increased substitutions among exons are a hallmark feature of genes undergoing adaptive evolution (Vacquier, Swanson et al. 1997; Nurminsky, Nurminskaya et al. 1998; Duda and Palumbi 1999; Wyckoff,

Wang et al. 2000). A common test of positive selection is to compare the number of non- synonymous substitutions per site (Ka) to the number of synonymous substitutions per site

(Ks) (Li 1997). Ka/Ks ratios significantly greater than 1.0 are taken as evidence for

positive selection. We assessed the Ka/Ks ratio for two of the most rapidly diverging exons (exons 2 and 4; Fig. 2-5b and Fig. 2-6).

41

42

Fig. 2-5: Phylogeny of coding and non-coding portions of the LCR16a duplication. Neighbour-joining phylogenetic for a) 1421 bp of intronic sequence (introns 2, 3, 4, Fig. 2- 1c) and b) 186 bp of exon 2 are compared. Extreme positive selection for exon 2 is indicated on the branch separating humans and African apes from the orangutan lineage (a 35-fold excess of amino-acid changes when compared to the neutral model). Note the significantly shorter branch lengths for flanking non-coding intronic sequences which are consistent with nucleotide sites evolving at a neutral rate. More than 95% of the informative sites for the phylogenetic tree of exon 2 are the result of amino-acid altering nucleotide changes. Branches showing significant positive selection are indicated by arrows with accompanying ba/bs ratios (estimated amino-acid replacement and synonymous changes per branch per site, (Zhang 1998)). Significance was calculated based on the difference (*p<0.05, **p<0.01). Sequence for various duplicate copies are identified by species acronym (PHA=Papio hamadryas, HKL= Hylobates klossi and see Fig. 2-4) and a number corresponding to clone and/or accession within GenBank (Supplement Table 2- 3). Scale bar, Jukes-Cantor corrected distance. The midpoint of all trees was set to 1/2 the distance between gibbon and baboon sequence taxa. Only bootstrap values >50% are shown (n=1000 replicates). A similar topology showing positive selection for exon 4 sequence was obtained (Fig. 2-6).

43 Genomic subclones were obtained for the various copies of LCR16a from chimpanzee, gorilla, orangutan, gibbon, and several Old World Monkeys and the exonic regions were comparatively sequenced (Methods). The average number of nonsynonymous and synonymous substitutions per site for all between-group and within-group comparisons was calculated independently for each exon (MEGA2, Modified Nei-Gojobri method;

Table 2-2 and Table 2-3). A statistical test of the difference of average Ka and Ks values both between species and for multiple copies within species was used as a measure of significance.

Table 2-2: Positive selection of Exon 2

Exon 2 Ka(SE) Ks(SE) Ka/Ks Ka-Ks (SE) Z value p HSA-PTR 0.189(0.035) 0.042(0.021) 4.5 0.147(0.043) 3.42 <0.0001 HSA-GGO 0.176(0.031) 0.066(0.026) 2.67 0.110(0.036) 3.06 <0.01 HSA-PPY 0.350(0.065) 0.097(0.048) 3.61 0.254(0.080) 3.18 <0.0005 HSA-HKL 0.345(0.062) 0.098(0.045) 3.52 0.247(0.077) 3.21 <0.0005 HSA-OW 0.429(0.080) 0.033(0.016) 13.00 0.396(0.081) 4.89 <0.00001 PTR-GGO 0.182(0.035) 0.069(0.028) 2.64 0.114(0.041) 2.78 <0.01 PTR-PPY 0.341(0.064) 0.101(0.048) 3.38 0.240(0.077) 3.11 <0.0005 PTR-HKL 0.334(0.062) 0.102(0.046) 3.27 0.232(0.075) 3.09 <0.01 PTR-OW 0.423(0.078) 0.036(0.016) 11.75 0.386(0.078) 4.95 <0.00001 GGO-PPY 0.361(0.067) 0.112(0.048) 3.22 0.248(0.083) 2.99 <0.01 GGO-HKL 0.350(0.066) 0.113(0.047) 3.10 0.244(0.081) 2.77 <0.01 GGO-OW 0.420(0.078) 0.054(0.024) 7.78 0.366(0.077) 4.75 <0.0001 PPY-HKL 0.025(0.011) 0.038(0.024) 0.66 -0.012(0.020) -0.55 NS PPY-OW 0.113(0.030) 0.071(0.046) 1.59 0.042(0.050) 0.84 NS HKL-OW 0.105(0.029) 0.072(0.044) 1.46 0.033(0.051) 0.65 NS HSA-HSA 0.190(0.032) 0.046(0.030) 4.75 0.150(0.041) 3.66 <0.001 PTR-PTR 0.181(0.039) 0.046(0.022) 3.93 0.135(0.046) 2.93 <0.01 GGO-GGO 0.152(0.031) 0.067(0.030) 2.27 0.085(0.037) 2.30 <0.01 PPY-PPY 0.033(0.013) 0.037(0.027) 0.89 -0.004(0.023) -0.17 NS HKL-HKL 0.021(0.012) 0.044(0.028) 0.48 -0.026(0.024) -1.08 0.95 OW-OW 0.022(0.012) 0(0) NA 0.022(0.011) `2.00 <0.05 HSA=Homo sapiens (n=15 paralogous sequences), PTR=Pan troglodytes(n=21), GGO=Gorilla gorilla (n=21), HKL=Hylobates klossi (n=8), and PPY=Pongo pygmaeus (n=9) OW=Cercopithecus aethiops, Papio anubis, Papio hamadryas, and Macaca fascicularis Ka=Average number of amino-acid substitutions per site between and within group comparisons. Ks=Average number of synonymous substitutions per site between and within group comparisons. p=probability that observed difference was due to chance (one-tailed Z test). Ka-Ks=Difference between Ka and Ks NS=No significant positive selection detected n= the number of paralogous sequences considered within each group comparison

In the case of exon 2, highly significant Ka/Ks ratios (p<0.0005) were observed among all comparisons involving either human or chimpanzee. The most extreme positive

- selection was observed between human and the Old World Monkeys (Ka/Ks =13.0, p<10

44 5 -5 ) and the chimpanzee and the Old World Monkeys (Ka/Ks =11.8, p<10 ). This level of amino acid replacement translates into ~43% amino acid divergence between these species and a rate of amino acid replacement of ~ 1.0 x 10-8 changes/per/site per year for this exon. This is far in excess (20-fold) of most typical estimates (Li 1997) of protein divergence between Old World and Great Ape species. Highly significant differences were also observed when comparing paralogues between chimpanzee and human sequences (Ka/Ks =5.0, p<0.0001) with an average amino acid divergence of 23% among the paralogous exons. To identify more precisely when the major episode of positive selection occurred, we estimated the number of synonymous and nonsynonymous nucleotide substitutions per site for each branch of a phylogenetic tree using the method first proposed by Zhang (Zhang, Rosenberg et al. 1998). A major burst of positive selection appears to have occurred after the separation of the human and chimpanzee lineages from the orangutan (<12 mya, Ka/Ks =35.0), with subsequent protein diversification events occurring during the emergence of chimpanzee and human species

(compare Fig. 2-5). A comparison with gorilla (Table 2-2) confirms that the major effect occurred in a common ancestor to humans and the African apes. In stark contrast, the paralogues within orangutan and gibbon species have not experienced bursts of rapid positive selection (Table 2-2).

Similar to exon 2, analysis of exon 4 sequences showed a significant episode of positive selection after the separation of the chimpanzee/human and orangutan

(Ka/Ks=4.67, p<0.05; Fig. 2-6).

45

Fig. 2-6: Neighbour-Joining Evolutionary Tree for 146 bp of coding DNA from exon 4. The overall topology of the tree is similar to exon 2, indicating extensive positive selection occurring after the separation of orangutan from the African Ape lineage. Scale bar, Jukes-Cantor corrected distance. The midpoint of all trees was set to 1/2 the distance between gibbon and baboon sequence taxa. Only bootstrap values >50% are shown (n=1000 replicates). Branches showing significant positive selection are indicated by arrows with accompanying ba/bs ratios (estimated amino-acid replacement and synonymous changes per branch per site, (Zhang, Rosenberg et al. 1998)). Significance was calculated based on the difference (*p<0.05, **p<0.01).

Although there is a dramatic increase in the number of putative amino acid replacements

(~30% between chimp/human and Old World monkey species), much more modest Ka/Ks ratios (1.81-2.45, Table 2-3) are observed between and within species comparisons. This reduction is primarily due to the accelerated number of synonymous events that have occurred concurrently with non-synonymous changes. These events have occurred

46 precisely within the putative coding regions of the exons and do not extend into flanking intronic sequence. Trivial explanations for the enhanced rates of non-synonymous and synonymous changes were examined including unusual CpG content, codon bias and the presence of hypermutable repeat sequences within the exonic regions. No evidence in support of these alternatives could be found. It should be noted, however, that alternative splicing has been observed for exon 4 among human cDNAs (eg. AF229069, D86974 and

Fig. 2-3). It is possible that some paralogues in different species have also experienced

alternative splicing resulting in new protein products whose open reading frames no

longer conform to that predicted by our human reference cDNA (AF132984). The result

of such an apparent frameshift would be to increase both synonymous and non-

synonymous rates if both splice variants were represented in each species.

Table 2-3: Positive selection of Exon 4

HSA=Homo sapiens (n=15 paralogous sequences), PTR=Pan troglodytes (n=19), HKL=Hylobates klossi (n=7), and PPY=Pongo pygmaeus (n=14) OW=multiple Old World monkey species were grouped (each species was represented by a single copy sequence). OW=Cercopithecus aethiops, Papio anubis, Papio hamadryas, Presbytis cristata and Allenopithecus Ka=Average number of amino-acid substitutions per site between and within group comparisons. Ks=Average number of synonymous substitutions per site between and within group comparisons. Ka-Ks=Difference between Ka and Ks. p=probability that observed difference was due to chance (one-tailed Z test). NS=No siginficant positive selection detected

Furthermore, it should be noted that no distinction in this study has been made between

functional and non-functional paralogues since this would require detailed expression and

protein analyses. We felt that this treatment was conservative since

comparisons and alternative splicing would tend to neutralize both adaptive as well as

47 purifying selection constraints. Consequently, such events would cause Ka/Ks ratios to approximate unity and reduce our power to detect positive selection.

Gene Function

Although the precise function(s) of this gene family remains to be elucidated, it is noteworthy that previous examples of positive selection have included either genes involved in xenobiotic recognition of macromolecules (immunoglobulin genes, venom toxins, lysozymes) (Hughes and Nei 1988; Messier and Stewart 1997; Zhang, Rosenberg et al. 1998; Duda and Palumbi 1999) or genes associated with male reproduction

(Vacquier, Swanson et al. 1997; Nurminsky, Nurminskaya et al. 1998; Ting, Tsaur et al.

1998; Wyckoff, Wang et al. 2000). In many of these cases, positive selection has occurred in concert with duplication events. Delineation of the function of this gene family will require detailed experimental analysis. In humans, multiple transcripts

(n=284) with open-reading frames have been recovered demonstrating clear transcriptional and splicing potency. RT-PCR analysis confirms a broad distribution of this gene family in most human tissues (Fig. 2-7).

48

Fig. 2-7: RT-PCR Analysis of LCR16a Gene Family. cDNA was prepared from a panel of human tissue mRNAs. Oligonucleotides (16.16/17) were designed within exons 2 and 4 (AF132984 model) to amplify putative transcripts from 8 of the 15 LCR16a segments. Three families of transcripts can be distinguished by length: C (311 bp) = represented by AF132984 (HSA9) gene model, B (363 bp)=D86974 (HSA6, alternative splicing of 48 bp of exon 4) and A (417 bp)=represented by BF109282, (HSA11, 54 bp insertion in exon 2). Further subcloning and sequencing is required to resolve potential sequence heterogeneity of each set of bands.

Finally, immunolocalization studies performed with GFP-fusion constructs reveal a clear

nuclear membrane localization for at least one translated member of this gene family.

Colocalization of these products with antibodies raised against membrane-bound

nucleoporin (p62) (Davis and Blobel 1986), further indicates that this particular human copy may associate with the nuclear pore complex (Supplement Fig. 2-1).

Conclusions

Our analysis has revealed an extraordinary degree of evolutionary plasticity—at level of both the genome and the gene. We provide evidence for the evolution of a “new” hominoid gene family by recent duplication and positive selection. Can additional examples within the human proteome be expected? Preliminary analysis of the human genome suggests that as much as 5-7% of all human sequence may have been duplicated within the last 30 million years of evolution. The abundance of segmental duplications

49 may be an important reservoir for the emergence of other new hominoid genes that do not possess definitive orthologues in the genomes of model organisms.

METHODS

Human Genome Analysis

Sequence similarity searches of GenBank (release 121.0) identified a total of 41 human accessions that contained a complete copy of the LCR16a repeat. Since the degree of sequence similarity among these copies approached levels of allelic variation (98-99%), comparison of unique sequences, flanking the duplications, and partial sequencing of chromosome 16 cosmids (LA16NC02) were used to distinguish various paralogues from allelic overlap. The human library is derived from a single chromosome 16 haplotype, thereby allowing sequence variants to effectively classify duplicated copies

(Horvath, Schwartz et al. 2000). A suite of genomic software tools were used to analyze and characterize the duplications, including: PARASIGHT (Consortium 2001) to delineate the junction sequences and the extent of overlap for each duplicated segment,

ALIGN to perform optimal global pairwise alignments between copies and sim4 to optimally compare cDNA vs genomic DNA (Horvath, Schwartz et al. 2000). Only pairwise sequence alignments greater than 1 kb, with a minimum of 90% identity were considered in this analysis (supplemental data). Unique sequence differences within the predicted exons from genomic sequence compared to EST sequences were used to identify transcriptionally competent loci. Protein structure analysis software

(http://bmerc-www.bu.edu/psa and http://maple.bioc.columbia.edu /predictprotein ) predicted the presence of single transmembrane domain flanked by alpha-helical secondary structure for AF132984 ORF (347 aa).

50 Fluorescence in situ Hybridization

Chromosome metaphase and interphase nuclei were prepared from lymphoblastoid cell

lines representative of five hominoid species (H. sapiens, P. troglodytes, P. paniscus, G.

gorilla, P. pygmaeus), three Old World monkey (Pan anubis, Presbytis cristata,

Cercopithecus aethiops) and one New World monkey (Callicebus mollochus). In situ

hybridizations were performed under standard conditions (Lichter, Bray et al. 1992) with

two genomic probes (16.1/9 and 16.8/12; Fig 1c) subcloned from one of the paralogous

copies (AC002039). To eliminate the effect of crosshybridization of common repeat

sequences, probes were blocked by Cot DNA prior to hybridization. At least 20

independent metaphase and interphase nuclei were examined in the determination of copy

number and chromosomal band location. The combined probes spanned 11.5 kb of

genomic sequence and included 7/8 exons of the AF132984 cDNA. Reciprocal

experiments using probes derived from baboon and gibbon were used to confirm the

specificity of hybridization. When necessary, hybridizations were performed in

conjunction with human whole-chromosome painting probes to confirm chromosomal

assignment (orangutan, Fig. 2).

Library Hybridization and Sequencing

Large insert genomic libraries from human (LA16NC02), chimpanzee (RPCI-43:Pan troglodtyes), gibbon (DKZ-140:Hylobates klossi), and the olive baboon (RPCI-41: Papio hamadryas) were hybridized with PCR-amplified products (16.1/9 and 16.8/12; for PCR oligonucleotide sequence and PCR conditions). All hybridizations were performed as previously described (Horvath 2000). A total of 156 genomic clones (70 human, 75 chimp, 10 gibbon and 1 baboon) were comparatively sequenced. (1753 bp were examined, partitioned into 1421 bp of intronic sequence and 332 bp of sequence from

51 exons 2 and 4). All PCR products (forward and reverse reactions) were directly sequenced using a modified dye-terminator sequencing protocol (Horvath, Schwartz et al. 2000).

Non-human sequences were deemed to be paralogous if more than 2 sequence differences

were observed within 150 bp of coding sequence. All paralogues were encoded by species

name and numbered according to clone and/or accession identifier (Supplement Table 2-

3). End-sequences generated from the cloning site (T7 and T3 or T3 and SP6) were used

to further position specific paralogous copies with respect to the human genome

reference. In all cases, the duplicated sequence is flanked either directly or within ~70 kb

by non-duplicated unique sequence. As a result, end-sequence analysis allows a subset of

BAC clones from different species to be unambiguously placed based on alignment to

unique sequence on either side spanning the duplication. For exons 2 and 4, additional sequence was generated by TA-subcloning of PCR –amplified product from orangutan

(Pongo pygmaeus) and direct PCR sequencing of products from various Old World

monkeys (Macaca fascicularis, Presbytis cristata and Cercopithecus aethiops).

Sequence Analysis

Estimates of genetic distance (pairwise deletion) were calculated using the Jukes and

Cantor one parameter model (when transition/transversion ratios, i/v ~ 1.0) or Kimura’s

two parameter model (when i/v~2.0) (Jukes 1969). A Tajima’s relative rate test was

performed (Tajima 1993) using orthologous sequence pairs (HSA13 vs. PTR3, PTR8 vs.

HKL1, HKL1 vs. HSA13 and HSA3 vs. PTR17; see Supplemental Table 2-3) from

human, chimpanzee and gibbon intronic sequence and specifying baboon sequence (PHA)

as the outgroup. Four such tests were used to accept the molecular clock hypothesis for

the non-coding sequences under study in this analysis. Estimates of duplication timing

were based on 1421 bp of noncoding sequence and were calculated using the formula

52 (r=K/2T), the baboon sequence as a reference orthologue and an estimated time of separation from the hominoid lineage of 25 mya (Goodman 1999). For exonic sequence, the average number of synonymous (Ks) and nonsynonymous (Ka) substitutions per site

were estimated using the modified Nei-Gojobori Method (Nei and Gojobori 1986; Zhang,

Rosenberg et al. 1998). To test for positive Darwinian selection, we calculated the

difference (D= Ka-Ks) both within and between groups for all pairwise comparisons of

paralogues. Groups, here, are defined as species. Due to the large number of pairwise

analyses performed, ideally significance levels should be corrected for multiple

comparisons. Since the comparisons are not independent of one another, the usual

Bonferroni method cannot be used. Instead, the average difference for all comparisons

and its associated standard error were computed. Initially, all possible comparisons were

made among the sequenced exons and the difference between amino-acid and

synonymous substitutions were calculated for each pairwise. The differences were

averaged between and within groups (within groups included multiple duplicate copies

within each species). The variance for the average difference was estimated using the

bootstrap method (n=1000 replicates) and a one-tailed Z test (Z=D/sD) to determine the

level of significance (Nei and Kumar 2000). Positive selection was defined as a

significant positive difference. Evolutionary trees of multiple aligned sequences

(ClustalW) were generated using Neighbor-joining distance estimates (MEGA2). Only

bootstrap values >50% are indicated in the tree topology. Internal branch estimates of the

number of synonymous (bs) and nonsynonymous (ba) substitutions per site were

determined using the method of Zhang et al. (Zhang, Rosenberg et al. 1998).

53 ACKNOWLEDGMENTS

We thank W. E. Kutz and D. Zivkovic for technical assistance and sequencing analyses.

This work was supported by grants from the National Institutes of Health and the US

Department of Energy to E.E.E., and grants from Progretti di Interesse Nationale (PRIN),

Centro Eccelenza (CE), Ministero per la Ricerca Scienti®ca e Tecnologica (MURST) and

Telethon to M.R. We are grateful to C. I. Wu, A. Chakravarti, D. Cutler, D. Locke, G.

Matera and H. Willard for comments on this manuscript.

54 APPENDIX:

Supplementary Figures:

Supplement Fig. 2-1: In situ Localization of AF132984 (corresponding to genomic accession AC002045, HSA9) to the nuclear envelope. COS-7 cells were grown on coverslips and transfected with the GFP-AF132984 construct. The cells were fixed and probed with anti-p62 antibodies (a

55 component of the nuclear pore complex) and rhodamine conjugated goat anti-mouse secondary antibody. Panel A is the fluorescent image of GFP-AF132984 fusion protein showing the punctate nuclear rim staining, panel B shows the antibody staining of anti-p62 and panel C is the superimposed image (the bar represents 10 (microns). A higher magnification view of the punctate nuclear rim co-localisation staining is shown in panel D (the bar represents 5 (microns).

Supplement Fig. 2-2: Multiple Sequence Alignment of putative proteins for exons 2 and 4. Only differences among 14 copies from various species are shown. For human, only duplicates that have corresponding ESTs are translated. Conserved residues are shaded gray.

56 Supplementary Tables:

Supplemental Table 2-1: Genetic distance (K =Jukes-Cantor) and size of LCR16a genomic duplications K (Jukes-Cantor) and S.E.

K values for human copies (lower left) and S.E. (upper right). Copy size base on pairwise alignments * = GenBank accessions corresponding to full-length cDNA's

Supplemental Table 2-2: Duplication structure of Chromosome 16 Loci

Duplications a-e as defined in Ref #3, Duplications o-s defined (this study) according to last reported chromosome 16 low-copy repeat Genetic distance (K=Kimura 2-parameter) and %identity calculated as the average of all pairwise distances among duplicated segments Chromosomal location based on assignments of sequences within www.ucsc.genome assembly (Jan.1,2001)

57 Supplemental Table 2-3: Clones and Deposited Accessions

* = GenBank accessions corresponding to full-length cDNA's (AF132984 and D86974, respectively)

58

Chapter 3

Recurrent duplication-driven transposition of DNA

during hominoid evolution.

Matthew E. Johnson1,2, NISC Comparative Sequencing Program3, Ze Cheng1, V. Anne Morrison6, Steven Scherer4, Mario Ventura5, Richard A. Gibbs4, Eric D Green3, Evan E. Eichler1,6.

1 Department of Genome Sciences and the Howard Hughes Medical Institute6, University of Washington, Seattle, Washington, 98195, USA

2Department of Genetics and Center for Human Genetics, Case Western Reserve School of Medicine and University Hospitals of Cleveland, Cleveland, OH, 44106, USA

3 Genome Technology Branch and NIH Intramural Sequencing Center (NISC), National Human Genome Research Institute, Bethesda, Maryland 20892, USA

4Baylor College of Medicine HGSC, One Baylor Plaza, Houston, TX, 77030, USA

5Sezione di Genetica, DAPEG, University of Bari, 70126 Bari, Italy

Note: This manuscript has been published: Matthew E. Johnson, NISC Comparative Sequencing Program, Ze Cheng, V. Anne Morrison, Steven Scherer, Mario Ventura, Richard A. Gibbs, Eric D. Green, and Evan E. Eichler Eukaryotic Transposable Elements and Genome Evolution Special Feature: Recurrent duplication-driven transposition of DNA during hominoid evolution PNAS 103: 17626-17631. M.V. provided primate comparative FISH analysis. NISC, S. S., and E. D. G. provided primate BAC Sequencing. Z. C. provided primate BAC analysis assistance. V. A. M. provided Orangutan Genomic library hybridization and BAC end-sequencing assistance.

59 ABSTRACT

The underlying mechanism by which the interspersed pattern of human segmental duplications has evolved is unknown. Based on a comparative analysis of primate genomes, we show that a particular segmental duplication (LCR16a) has been the source locus for the formation of the majority of intrachromosomal duplications blocks on human chromosome 16. We provide evidence that this particular segment has been active independently in each great ape and human lineage at different points during evolution.

Euchromatic sequence flanking sites of LCR16a integration are frequently lineage- specific duplications. This process has mobilized duplication blocks (15-200 kb in size) to new genomic locations in each species. Breakpoint analysis of lineage-specific insertions suggests coordinated deletion of repeat-rich DNA at the target site, in some cases deleting genes in that species. Our data support a new model of duplication where the probability that a segment of DNA becomes duplicated is determined by its proximity to core duplicons, such as LCR16a.

60 INTRODUCTION

Based on the current sequenced genomes, human genomic architecture is unique in the abundance of large segmental duplications that are interspersed at discrete locations in the genome (Bailey, Yavor et al. 2001; Bailey, Gu et al. 2002; Cheung,

Estivill et al. 2003; She, Horvath et al. 2004; Zhang, Lu et al. 2005). While recent duplications are common among other animal genomes, they are typically organized as clusters of tandemly arrayed segments (She, Liu et al. 2006). In humans and other great ape genomes, ~450 duplication hubs have been identified that have been the target of duplications from many of different ancestral loci. This has created regions of the genome that are complex mosaics of different genomic segments (She, Liu et al. 2006) where novel genes, fusion genes and gene families have emerged (Courseaux and Nahon

2001; Johnson, Viggiano et al. 2001; Bailey, Gu et al. 2002; Paulding, Ruvolo et al. 2003;

Ciccarelli, von Mering et al. 2005; Vandepoele, Van Roy et al. 2005). Detailed studies of a few of the underlying regions (Courseaux and Nahon 2001; Eichler 2001; Stankiewicz,

Shaw et al. 2004) suggest that duplications have occurred in a stepwise fashion involving subsequent larger segments of duplication as secondary events. The mechanism by which

100’s of kb of genomic sequence becomes duplicatively transposed to a new location on a chromosome is unknown.

Human chromosome 16 represents one of the most extreme examples of such recent segmental duplication activity (Stallings, Doggett et al. 1992). More than 10% of the euchromatic portion of human chromosome 16p consists of segmental duplications known as LCR16 (low-copy repeat sequences on chromosome 16) (Stallings, Whitmore et al. 1993; Loftus, Kim et al. 1999). During the initial sequence analysis of this chromosome, Loftus and colleagues identified at least 20 distinct gene-rich LCR16 elements ranging in size from a few kb to >50 kb in length, termed LCR16a-t. The

61 majority of these were duplicated in an interspersed configuration throughout the chromosome. We subsequently identified a gene family, morpheus, within LCR16a which showed significant signatures of positive selection (Ka/Ks ratios up to 13.0 between humans and Old World Monkey species). Finished chromosome 16 sequence

(Martin, Han et al. 2004) provided the basis for a detailed analysis of these regions. We interrogated the detailed organization of these regions among non-human primate species by sequencing large-insert clones from a diversity panel of primates, to address questions regarding the mechanism of origin, the extent of structural variation among primates and the relationship of these complex structures to the rapidly evolving LCR16a segment.

RESULTS

Human LCR16 Genome Organization

In humans, there are 17 complex blocks of LCR16 duplication (4.2 Mb of sequence) that contain 25 distinct copies of LCR16a with fewer copies of other flanking

LCR16 segmental duplications (Table 3-1, Fig. 3-1 on human chromosome 16). Three blocks map to 16q22, while the remainder are distributed along the short arm of chromosome 16 where they occupy an estimated 11% of the euchromatin. The duplication blocks range in size from 604,376 bp (16p12.1/11.2) to solo copies of the LCR16a element (~19,794 bp in length) (Table 3-2, Figure 3-1).

62 Table 3-1: LCR16 Segmental Duplications in Human Genome

LCR16 classification based on Loftus et al., All calculations based on whole-genome assembly comparisons for segmental duplications build34 (hg16). Gene content refers to homologous Refgene gene structure embedded within LCR16 as determined by sim4--many correspond to partial duplicates of ancestral gene.

Fig. 3-1: LCR16 Organization in Human and Baboon. The location, copy number and structure of LCR16 duplications are depicted within the context of an ideogram for human (Left) and Papio hamadryas (PHA) (Right) based on the human genome reference sequence (hg16), BAC-end sequencing and complete clone insert sequence of baboon clones. With the exception of the ancestral loci, duplication blocks are enumerated based on their position (p-q) on human chromosome 16 (Table 3-2).

63 Table 3-2: Sequencing Table of Primate LCR16 Loci

Non-human primate LCR16 clones were sequenced (Genbank) and assigned to their best location in the human genome (cytogenetic and hg16 coordinates). Locations were determined based on alignment of unique flanking anchor sequences or the longest alignment (when no unique anchor could be identified). Each human locus was assigned a numerical identifier and subsequent orthologous primate BACs were identified with the same numeric identifier (eg. PTR:25 and HSA:25 are orthologous by placement of unique flanking anchors). If the map location of the LCR16 duplications did not contain duplications (eg. new insertions) primate loci were identified with a letter designation. Each non-human primate LCR16

64 accession was analyzed for duplication content using by cross-matching a repeatlibrary database of each human duplicon consensus. In some cases, only a partial sequence of the duplicon was discovered and is indicated by brackets (eg. (e4kb) = only a 4 kb segment of the ancestral LCR16e duplicon was found). The order of the duplicons is shown as found in the aligned sequence.

Of the eleven other LCR16 elements considered in this analysis (Table 3-1), all map within 109 kb of an LCR16a duplication. After excluding ancestral segments, we find only one exception where a block exists (LCR16uw, Fig. 3-1, Table 3-4) without a full- length (20kb) LCR16a element. In contrast, two distinct “solo” LCR16a elements have been identified that are not associated with other duplicated segments (Table 3-2). This includes a single rogue segment that has been mapped outside of chromosome 16 to human 18p11.

Table 3-3: LCR16 Copy Number among Primates

Copy number was based on experimental hybridization and BAC-end sequencing results (Methods). *Indicate that copy number estimate was based on the hg16 or rheMac2 sequence assembly.

Single Copy Architecture of Old World Monkey Loci

To investigate the evolutionary history of these complex genomic regions, we systematically recovered large-insert genomic clones corresponding to each human

LCR16 segment from five non-human primate species including chimpanzee, gorilla, orangutan, macaque and baboon. We designed a total of 12 probes, one corresponding to

65 each of the LCR16 duplications and hybridized each independently to available genomic

BAC libraries (Methods). We identified 782 clones and estimated the copy number and co-occurrence of various LCR16 segments in these different species (Table 3-3 and Table

3-4). The BAC hybridization results revealed that the majority (11/12) of the LCR16 elements are single copy in Old World monkey (OWM) outgroup species (macaque and baboon) (Table 3-5) and copy number increases have occurred in a step-wise fashion based on the inferred phylogenetic relationship of these species (Table 3-3).

Table 3-4: Co-occurrence of Human LCR16 Probes

Since many of the duplicons are duplicated in concert, we assessed the number of clones (brackets) that scored positive by both probes and the estimated copy number. *Based on hg16 and rheMac2 sequence assembly.

We observed a positional bias in the evolutionary order of these events. LCR16

segmental duplications located more distally from LCR16a, in general, are predicted to be

more recent than those that map in closer proximity to human LCR16. Finally, we note

that certain pairs of LCR elements (e.g. LCR16u and w, as well as LCR16i and c)

66 consistently co-hybridize to the same BACs including the single copy locus within OWM species suggesting that these different duplicons originated from the same ancestral locus.

Table 3-5: Putative Ancestral LCR16 Loci

The most likely position of the ancestral locus (human hg16 coordinates) was determined by mapping the location of the single copy locus in two Old World monkey outgroup species; Macaca mulatta (MMU) & Papio hamadryas (PHA). Multiple clones from each species were end- sequenced and the end-sequences were aligned against the human genome to identify the likely ancestral position. In six cases, the corresponding baboon or macaque locus was sequenced (indicated by an asterisk). Phylogenetic analysis confirmed that in each case duplications occurred after the separation of Old World monkeys from the human lineage.

Duplication of Great Ape LCR16 Blocks into New Locations

Mapping and sequencing of LCR16 segmental duplications within primate

genomes has been problematic because the duplications are typically embedded in large

duplication blocks that may exceed >100 kb in size. For example in the chimpanzee

genome, these regions are misassembled, highly fragmented or correspond to gaps (CSAC

2005). Large insert genomic clones, such as BACs, can help circumvent this problem

since BAC-end sequence (BES) may extend beyond the duplication blocks to anchor in

unique sequence (Yohn, Jiang et al. 2005). Such sequence anchors provide information

regarding the corresponding map position. We therefore selected 782 BACs for insert

end-sequencing, generating 526 pairs of end-sequence that were informative for mapping

purposes. Based on comparative mapping of macaque and baboon for each single copy

locus of LCR16, we unambiguously determined the most likely ancestral location of each

segmental duplication. They mapped to nine distinct locations that were consistent

between both outgroup species (Fig. 3-1, Fig. 3-2). With the exception of LCR16t,

67 LCR16a is not associated with any of these regions in Old World monkey species (Fig. 3-

1).

Using a similar strategy, we attempted to assign locations for corresponding loci within ape genomes using BES data. In contrast to Old World monkeys, we identified multiple loci for each probe—the vast majority of which associated with LCR16a based on the hybridization results. We categorized ape loci as mapping to i) an orthologous locus (based on the identification of LCR16 duplications at that position in human), ii) an

ancestral position (based on map positions of single copy loci in baboon and macaque) or iii) a novel location (based on the absence of a corresponding duplication at that position in human) (Table 3-6, Fig. 3-2, Table 3-2).

Table 3-6: Primate Segmental Duplication BAC Sequencing Summary

62 LCR16 BAC clones from five non-human primate species were sequenced and aligned to the human genome. Loci were classified as new insertion or orthologous based on the presence of unique anchors between human and non-human primate genomes. Ancestral loci correspond to the putative ancestral locus based on map position of single loci in baboon and macaque. For 27 clones the precise map location could not be assigned because the entire insert consisted of segmental duplications. *In the case of orangutan, most mapped to chromosome 13 and therefore were "novel" insertions with respect to human and other apes but the map location could not be further refined by orthologous anchors.

We could assign 35 loci to one of these categories, while approximately 27 were

ambiguous (end-sequences placed in duplicated sequences in humans or other primates

preventing accurate assignment; see Methods). We observed a spatial clustering of new

insertions. Both sequence and BES data, for example, indicate that the distal portion of

chromosome 16 has been the target of novel LCR16 duplications particularly within the

chimpanzee lineage. Similarly, many of the novel orangutan insertions mapped to a 5 Mb

region on human 13q12.1-13q12.3 (Fig. 3-2).

68

69

70

71

72

Fig. 3-2: Map locations of primate LCR16a. Map location of LCR16a based on BAC- end sequence analysis. Map locations are shown against a human chromosomal ideogram. Arrows show the direction of transcription with respect to the morpheus gene model for the LCR16a duplication; crosses denote ancestral locus positions, and asterisks identify lineage-specific duplications.

Sequence Structure of New Insertions and Lineage-Specific Duplications

We sequenced 62 non-human primate LCR16-positive BAC clones generating

12.2 Mb of genomic sequence from five non-human primate species (Table 3-6). For each sequence, we identified the best location in the genome based on alignment of unique flanking sequence (Methods). Sequence data generated from both the macaque and baboon unambiguously confirmed synteny and structure of the most likely ancestral position (Fig. 3-1a and b) including previously recognized distinct duplicons mapping to

73 the same location (i.e. LCR16at and LCR16uw). We identified a minimum of 6 duplication blocks that were present at locations in ape genomes where there was no evidence of corresponding duplications in humans (Fig. 3-3; Table 3-2).

74

Fig. 3-3: Sequence alignment between human and non-human primate LCR16 Loci. a) Chimpanzee-specific insertion (AC097264) of 81,799 bp between genes, DRF1 and FLJ31795 on chromosome 17q21.31. The new insertion consists of three LCR16 duplicons that are shared between humans and chimpanzee (LCR16e, w & a) in addition to a flanking 16,820 bp lineage- specific duplication (LCR16a’, Table 3-7). 5,962 bp of human sequence corresponding to the preintegration site is deleted in chimpanzee (Table 3-10). The extent of duplication of the underlying sequence based on WSSD analysis is shown for human (light blue), chimpanzee (pink) and orangutan (dark blue) for this and all subsequent images. b) Chimpanzee-specific insertion (AC149436) of a segmental duplication mapping to chromosome 16p13.3. Insertion sequence (33,405 bp) consists of LCR16a and a chimpanzee- specific duplication (termed LCR16b’) of 7,403 bp which is single copy in human. A corresponding deletion of the integration site (16,100 bp) deletes the serine protease EOS gene in chimpanzee. c) A 230 kb sequence in orangutan that is completely duplicated (dark blue bar). Two different segments flank the LCR16aw segmental duplication, including a 109 kb segment corresponding to human chromosome 13q34 (chr13:112701583-112831134) and a 99 kb segment from chromosome 16 (chr16:11526252-11625727). Both segments are unique in chimpanzee and human. d) Orangutan genomic sequence (AC144879) shows the presence of an inserted duplication complex corresponding to human 13q12.11 (chr13:19666603-19839556). A 38,344 bp segment has been deleted corresponding to the site of insertion in human. Several orangutan-specific duplications are noted including a 21 kb flanking duplication that maps to the corresponding region in human. This shows that LCR16a and the ancestral locus (LCR16n’) were associated.

These new insertions which ranged in size from 33.4 kb to an estimated >200 kb were always accompanied by a copy of LCR16a. Moreover, PCR breakpoint analyses (see below) and FISH analyses (data not shown) confirmed that these events occurred specifically within each lineage. We note that the sequenced new insertions consist of both LCR16a as well as LCR16 elements flanking these regions suggesting duplicative transposition, in some cases, of large (>100 kb), complex sequences into new locations.

75 Table 3-7: Novel Segmental Duplications found in Association with LCR16

28 novel duplications (denoted ') were found in association with LCR16 in other primate species but not humans. The approximate size, gene content and map location within the human genome are shown. Duplication status of each was determined by whole-genome shotgun sequence detection strategy for human (HSA), chimpanzee (PTR) and orangutan (PPY). In most cases, the segmental duplications are specific to each lineage. In some cases, the sequences are duplicated but are not associated with LCR16 in humans. * May represent a human-specific deletion of the segment.

During our analysis of these new insertions, we observed segments that were duplicated specifically within each species (grey and black bars in Figs.3-3, 3-4). These lineage-specific duplications ranged in size from a few kb to >80 kb in length and were frequently associated with genic regions (Table 3-7). There are two important structural properties regarding these lineage-specific duplications associated with novel insertions.

First, these lineage-specific duplications most frequently map at the periphery of duplicated segments that are shared between great apes (Figs. 3-3 & 3-4).

76

Fig. 3-4: The duplication architecture of primate loci is shown in the context of a NJ phylogenetic tree for LCR16a (2 kb non-coding sequence). Loci encoded by species (HSA (Homo sapiens), PTR (Pan troglodytes), GGO (Gorilla gorilla), PPY (Pongo pygmaeus) and PHA (Papio hamadryas)) and relative to human orthologous loci (Fig. 3-1). New insertions or ambiguous loci were given a letter designation.

These findings are consistent with hybridization (Table 3-3) and phylogenetic results

(below) which show evolutionarily younger duplicons accreting at the edges. Second, most of these peripheral lineage-specific segmental duplications originate from chromosomal regions where LCR16a activity can be documented as having recently

77 occurred (Fig. 3-3d and Fig. 3-5a). These associations with ancestral loci suggest that lineage-specific LCR16 segments originate in regions of prior LCR16a integration.

a) AC097333 Chimpanzee: build34 chromosome coordinates (eg. hg16 chr16:11909696-12073084)

b) AC148538 Chimpanzee: build34 chromosome coordinates (eg. hg16 chr16:69936319-70089422)

Fig. 3-5a, b: Sequence alignment of LCR16a loci. Non-human primate genomic sequence and corresponding segment in the human genome are compared (see Fig. 3-3 for more detailed description). In addition to the WSSD tracks, the estimated copy number of each region is indicated based on the corresponding number of reads per 5 kb.

As a more direct test of association with LCR16a, we performed a series of independent hybridization experiments with each of the eight lineage-specific

78 duplications in orangutan but which were not identified as duplicated in chimpanzee or human. We estimated copy-number of each duplication and then cross-referenced positive clones by PCR to determine whether they were associated with LCR16a in the orangutan (Table 3-8). 72% (100/139) of clones detected using a lineage-specific probe were also positive for LCR16a. When BES data were used to eliminate the ancestral locus, we found that 94% (17/18) of the duplicated loci were in association with LCR16a.

We found only one exception where an orangutan-specific duplication had occurred without LCR16a. These data indicate that different intrachromosomal euchromatic duplications have emerged at different locations in a different lineage but focused, once again, around the LCR16a core.

Table 3-8: Co-occurrence of Orangutan LCR16 Probes

Copy number was based on experimental hybridization and BAC-end sequencing results (Methods). Numbers in the brackets are number of library positives and numbers outside brackets are estimation of copy numbers of duplicons.

Interestingly, both copy number and sequence divergence decrease in a gradient-like fashion as distance from LCR16a increases (Fig. 3-6). Thus, even though the chromosome, the location and the content of the segmental duplication differ, we observe

79 a virtually identical complex mosaic pattern of segmental duplications and polarity vis-à- vis LCR16a in different primate species.

Fig. 3-6: Copy-number and sequence divergence flanking LCR16a in a) orangutan and b) human. Orangutan genomic sequence AC145295 was analyzed by WSSD analysis against orangutan WGS and shown to be completely duplicated (blue bar >94%). Approximate copy number (orange) and average degree of sequence identity (above the line) were estimated. Both copy number and divergence decrease in a gradient-like fashion from LCR16a which is the only corresponding segment duplicated in human (pink) and chimpanzee (light blue). A virtually identical analysis using an ~185 kb segment from human chromosome 16 against human WGS shows the segment to be duplicated in chimpanzee and human in a gradient-like fashion. The only duplicated segment in orangutan (blue bar) corresponds to the LCR16a segment.

80 Recurrent and Independent Duplications of LCR16a

Sequencing of the baboon and macaque genomes confirmed the ancestral location of each LCR16 segment (Fig. 3-1). Using non-coding primate genome sequence, we constructed a neighbour-joining phylogenetic tree for each of the fourteen human LCR16 duplicons. The tree topology and corresponding branch lengths were remarkably consistent with the evolutionary order of events predicted from the initial hybridization results. The LCR16a phylogenetic analysis reveals two distinct clades-- one monophyletic origin with respect to human/African ape sequences and a second monophyletic origin for the orangutan loci (Fig. 3-4). This is consistent with molecular clock data, which indicate that LCR16a expansions have occurred independently in each of the two lineages. It is interesting that when the duplication architecture is superimposed over the LCR16a phylogeny, that similar block architectures cluster. For example, in the case of human, three distinct groups can be recognized based largely on the presence of flanking LCR16 duplicons (LCR16b, d or k/l). These associations supersede relationships predicted based on orthology suggesting large-scale genetic exchanges since speciation of humans and great apes (Jackson, Oliver et al. 2005).

The finding of so many independent, recurrent duplications of the LCR16a segment prompted us to investigate whether there might be evidence for additional, more ancient copies of LCR16a that were not originally identified as a result of our threshold for detection (i.e. >90% sequence identity). Five additional loci were discovered, including three nearly full-length copies on chromosome 10q22.3 as well as 2 partial copies on chromosomes Xp11.22 and 11p15.4 (Fig 3-7). Three of these five homologous

LCR16a structures were embedded within complex duplication blocks flanked by chromosome-specific segmental duplications. The extensive substitution (~0.2-0.3 substitutions/per site) suggest that these duplications of LCR16a occurred much earlier

81 during primate evolution (>40 million years) (Liu, Zhao et al. 2003). Analysis of the recent rhesus macaque genome assembly confirmed the presence of Xp11.22, 11p15.4 and one of the 10q22.3 loci at syntenic positions to these human copies confirming duplication of these prior to the divergence of the macaque/human lineages.

82

Fig. 3-7: More diverged LCR16a loci on chromosomes 11, 10 & X. a) NJ phylogeny with respect to human and baboon copies b) The extent of genomic sequence overlap with LCR16a consensus based on BLAST sequence similarity. NPIP=Nuclear pore interacting protein human mRNA. DogChr.6 represents the extent of blast (>70%) from dog chromosome 6 (July 2004) to this locus. This particular region of dog chromosome 6 is syntenic to human chromosome 16.

Junction Analysis

Two types of novel junctions could be identified based on our comparison of non- human primate and human sequences: (i) those that traversed lineage-specific duplications that had not been observed in humans (termed accretion boundaries) and (ii) those corresponding to the sites of new insertions (i.e. unique-duplication transitions where the

LCR16 duplications were not present at that locus in human). The latter, termed insertion boundaries provided the opportunity to study the architecture of the integration sites prior to duplicative transposition.

We generated precise sequence alignments and examined the repeat content for a total of 12 insertion and 23 accretion boundaries. As a control for the quality of sequence and assembly, subsets of these were tested and validated by junction-PCR amplification and sequencing of the PCR product (Fig. 3-8).

83

Fig. 3-8: Breakpoint resolution of a novel chimpanzee insertion. The schematic depicts a novel segmental duplication insertion of 82 kb and the corresponding deletion of 6.0 kb at the preintegration site with respect to the human reference sequence. PCR breakpoint analysis shows that repeat sequences were present in common ape ancestors, but that insertion was specific to chimpanzee and bonobo. Variability in PCR products is due to insertion and deletion of Alu repeats which are common in repeat rich regions (Liu, Zhao et al. 2003; Sen, Han et al. 2006). The preintegration locus consists of 92.7% common repeats.

Overall, ~55% (19/35) of the junctions showed the presence of an Alu repeat mapping precisely at the accretion or insertion boundary. Of these, ~95% (18/19) corresponded to younger subfamilies (AluS and AluY) (Table 3-9, Fig. 3-8). This three-fold enrichment confirms previous findings that younger Alu repeat elements are significantly enriched at the breakpoints of segmental duplication (Bailey, Giu et al. 2003; Jurka, Kohany et al.

2004). Because of the lineage-specific nature of the duplications, donor and acceptor relationships could in most cases be readily defined. We noted eleven examples where the transition between donor and acceptor sequences occurred within homologous

(although not identical) repeat elements.

84 Table 3-9: Repeat Content at Novel Segmental Duplication Junctions

Non-human primate sequences were aligned to the human reference genome and novel junctions were identified that did not exist in human. *Junctions were categorized into two categories: Those that corresponded to a new insertion (NI) as opposed to those which juxtaposed different segmental duplications (termed accretion boundaries or AB). Sequence mapping 50 bp proximally and distally to the two breakpoints (J1P, J1D, J2P, J2D) were analyzed for repeat content. PCR amplification and sequencing was performed from genomic DNA. The sequence complexity of the junctions frequently precluded validation by PCR.

Interestingly, for six examples where a new segmental duplication was clearly

documented at a new location in a non-human primate species, we observed a

corresponding genomic deletion of the preintegration site (Fig. 3-3, 3-8). These deletions ranged in size from 3.4 to 80.1 kb in length (median length =5.9 kb) and were remarkably repeat-rich (77.3%) (Table 3-10). The evolutionary age of the corresponding repeat subfamilies and junction PCR indicate that these complex repeat structures represent the ancestral state. In one case, the corresponding segmental duplication was associated with the deletion of an entire serine protease gene in chimpanzee (Fig. 3-8, 3-

5b). This gene deletion was previously shown to be specific to the chimpanzee lineage

(Puente, Gutierrez-Fernandez et al. 2005) and our results clearly indicate a novel mechanism underlying its excision. Although the number of sites are still limited, these data suggest coordinated deletion of repeat-rich DNA is a hallmark feature of de novo segmental duplication.

85

Table 3-10: Composition of Preintegration Sites based on Human Reference Genome (hg16)

*Sequence content of deleted regions in human genome that contain a novel, lineage-specific segmental duplication within a non-human primate species (PTR=Pan troglodytes, GGO=Gorilla gorilla and PPY=Pongo pygmaeus).

DISCUSSION

Our detailed sequence and evolutionary analysis of a subset of primate segmental duplications reveals unexpected properties regarding their origin and expansion. We summarize these properties, the supporting data and put forward a model for LCR16 segmental duplication and associated structural variation of primate genomes.

• Recurrent Duplications: We show that LCR16a has duplicated independently in each

of the great ape lineages to new euchromatic locations (Fig 3-3). Most of the complex

duplication blocks on human chromosome 16 are or have been associated with a full-

length copy of LCR16a. Human and orangutan LCR16a map to different locations in

the two genomes (Fig. 3-2). More ancient, full-length copies of the LCR16a element

have been identified on different chromosomes once again associated with complex

regions of duplication. These data indicate that LCR16a duplications have occurred

independently multiple times and this 20 kb sequence has an inherent proclivity to

duplicate to new locations.

• Duplication Polarity: Other LCR16 elements have accumulated in a stepwise fashion

focused around LCR16a to form complex duplication blocks (Fig. 3-4). Unlike

LCR16a, solitary duplications (i.e. not associated with another LCR) are rarely

86 identified for these—in the one clear case in human, analysis of the structure showed

it to be a deletion of LCR16a (Fig. 3-5b). Based on outgroup sequence data (macaque

and baboon), most of these LCR16 elements originate from ancestral single copy

sequences (Fig. 3-1). We show that younger and less abundant duplications

accumulate at the periphery of LCR16a (Fig. 3-6). In the case of orangutan, a

completely analogous structure of flanking duplications (independent in origin) has

emerged flanking LCR16a (Fig 3-3 & Fig. 3-4). These data suggest polarity of

duplication around LCR16a.

• Ancestral Associations: Our hybridization and sequencing (Table 3-3, 3-6) data

indicate that several of the ancestral loci of intrachromosomal segmental duplication

on chromosomes 13 and 16 have been associated with LCR16a. In gorilla, for

example, we find LCR16a in close proximity to LCRl (although at least in humans,

such an association no longer exists). Two other examples (Fig. 3-3d & 3-5a) indicate

that ancestral positions of LCR16 in chimpanzee and orangutan map in close

proximity with LCR16a and are associated with lineage-specific duplications in these

species. We propose these associations with LCR16a have served to prime lineage-

specific duplications from these regions.

• Coordinated Deletion: Our detailed analysis of 6 new insertions have shown that in all

6 cases the newer insertions involved the coordinated deletion of sequences. The

preintegration sequences are highly enriched for common repeat sequences and may

be prone to double strand breakage events. The coordinated deletion of target site

nucleotides has been observed for several atypical L1 integration events (Gilbert,

Lutz-Prigge et al. 2002; Gilbert, Lutz et al. 2005) and may implicate single-strand

87 annealing (SSA) and/or synthesis-dependent annealing (SDSA) (Sugawara and Haber

1992; Nassif, Penney et al. 1994) as part of the pathway of segmental duplication.

Core Duplicon-Flanking Transposition Model:

We have shown that LCR16 segmental duplications change in copy number, composition and more remarkably in location among humans and great apes. These regions of the genome may be loosely classified as a form of mobile DNA. Unlike typical common repeats (Moran, DeBerardinis et al. 1999), however, this process has moved and juxtaposed large gene structures frequently in a lineage-specific manner into new genomic contexts. The complex set of data, presented here, argue that LCR16a has played an active role in creating the duplication architecture on human chromosome 16 and orangutan chromosome 13. We propose that other LCR16 duplications have been duplicated passively, essentially as genetic hitchhikers as part of this process. The association of LCR16a elements with ancestral loci, especially younger duplication events, suggests that a property of the sequence itself has the potential to duplicate sequences to new locations. This has occurred independently and at different times during human-great ape evolution. These events are associated with both deletions and other rearrangement events that have subtly restructured human and great ape chromosomes during evolution. Core duplicons, similar to LCR16a, have recently been identified for other chromosomes with an overabundance of intrachromosomal duplications (Zody,

Garber et al. 2006; Zody, Garber et al. 2006). It is possible that this may represent a general property of the human/ great ape genome.

There are at least two possible explanations for our observations. First, the

LCR16a sequence may have evolved mechanistically as a preferred template for events to new locations in the human genome. In this model, LCR16a would

88 serve as a source for the directional repair of a double-strand breaks in the genome similar to yeast mating-type switching (Haber 1998). The Alu-repeat richness of LCR16a cassette would provide the homology to promote single-strand annealing and/or SDSA.

These findings might explain the coordinated deletion of preintegration sites, the enrichment of Alu repeats at breakpoints and the finding that sequences flanking LCR16a become duplicated. If the LCR16a sequence carries an inherent enhancer of gene conversion, it is unclear how the process could be so processive (100’s of kb) or why it, as opposed to other Alu-rich repeat regions of the genome, is the preferred source.

An alternative explanation for the apparent strong association of new duplications with LCR16a may be as an indirect consequence of intense selection. We have shown previously that the gene family encoded by LCR16a shows amongst the strongest signatures of positive selection among humans and African ape genes (Johnson, Viggiano et al. 2001). It is possible that the complex pattern of duplication is simply a consequence of the pressure to produce more divergent copies of LCR16a at distinct locations. We do not favor this model completely because positive selection of the morpheus gene family occurred only among humans and African apes (Ka/Ks =10-13 for exon 2 when compared to the OWM outgroup). Our data indicate that complex duplicated blocks have emerged completely independently in the orangutan lineage where there is no strong evidence of positive selection (Ka/Ks=~1.0). Moreover we have identified more ancient copies of

LCR16a on chromosomes 10, 11 and X chromosome. This suggests that this piece of genetic material was inherently unstable and duplicating prior to positive selection. We, therefore, favor a duplication-driven model of DNA transposition. This new dynamic model for genomic duplication helps to explain the non-random spatial-temporal distribution of segmental duplications in human and great apes.

89 METHODS

Genomic library hybridization and BAC end-sequencing

Large insert genomic BAC libraries (6-fold coverage) from chimpanzee (RPCI-43), gorilla (CHORI-255), the orangutan (CHORI-253), the olive baboon (RPCI-41) and the rhesus macaque (CHORI-250) were probed by hybridization for each individual LCR16 duplication (Table 3-6), as described (Yohn, Jiang et al. 2005). A total of 782 LCR16- positive BACs were selected and the inserts end-sequenced. Repeat-masked BES was rescored for quality and mapped against the human genome (megablast 12PATCh –d BES

–D 3 –p 93 –F m –UT –s 150 –R T).

BAC Sequencing

BACs were subjected to shotgun sequencing at the NIH Intramural Sequencing

Center (Thomas, Touchman et al. 2003) and the Baylor College of Medicine Human

Genome Sequencing Center (Scherer, Muzny et al. 2006). HTGS phase 1 sequence quality (at least 6-fold sequence redundancy). A subset of clones (n=25), corresponding to potential new insertions, were selected for ordered and oriented sequence assembly.

Sequence Annotation

Non-human primate BAC sequence was compared to human genome sequence using Miropeats (Parsons 1995), two_way_mirror (Bailey, unpublished) and ALIGN

(Myers and Miller 1988), using parameters optimized for global alignment of primate sequences (Liu, Zhao et al. 2003). The best map location was defined as one where human and non-human primate sequences align within non-duplicated flanking unique sequence.

If the entire BAC was duplicated, the most significant correspondence by BLAST

90 sequence homology was used and the location was classified as “ambiguous”. We examined the extent of recent duplication (>94%) for each clone using the whole genome shotgun sequence detection strategy for human (Bailey, Yavor et al. 2002), chimpanzee

(Cheng, Ventura et al. 2005), and orangutan (EEE, unpublished). FISH hybridization was used to assess duplication/unique status in Gorilla (Ventura, data not shown). For simplicity, human chromosome designations are used for non-human map descriptions

(McConkey 2004).

PCR Breakpoint Analysis

A subset of breakpoints associated with lineage specific insertions were validated by designing PCR assays across the breakpoint junctions (Table 3-9) and amplification of genomic DNA from a panel of primate lymphoblast-derived . The dense repeat content of many of the breakpoints precluded design of assays across all insertion breakpoints.

Phylogenetic Analysis

We extracted overlapping sequence corresponding to each of the human segmental duplications from non-human primate sequence and generated multiple sequence alignment using ClustalW (Higgins, Thompson et al. 1996) and corresponding neighbour joining phylogenetic trees (MEGA). We only considered non-coding sequences by processing the multiple sequence alignments for corresponding cDNA using MAM software. We used Kimura’s two-parameter method (Kumar and Hedges 1998) for all estimates of genetic distance.

91 ACKNOWLEDGEMENTS

We are grateful to Baishali Maskeri, Robert Blakesley, Andy Sharp and Eray

Tuzun for technical assistance. We thank the large-scale sequencing centers (Baylor

College of Medicine and the Washington University Genome Sequencing Center) for access to genome assembly data (macaque) and trace sequence data from the chimpanzee and orangutan prior to publication. This work was supported, in part, by an NIH grant

GM58815 to EEE and by the Intramural Research Program of the National Human

Genome Research Institute, National Institutes of Health. EEE is an investigator of the

Howard Hughes Medical Institute.

92

Chapter 4

Dynamic changes in the structure, expression and

selection of the morpheus gene family during primate

evolution.

Matthew E. Johnson1,2, Sean D. McGrath1, Geoffrey Findlay1, NISC Comparative Sequencing Program3, Zhaoshi Jiang1, Jeff Rogers4, Mario Ventura5, Eric D Green3, Evan E. Eichler1,6.

1 Department of Genome Sciences and the Howard Hughes Medical Institute6, U niversity of Washington, Seattle, Washington, 98195, USA 2Department of Genetics and Center for Human Genetics, Case Western Reserve School of Medicine and University Hospitals of Cleveland, Cleveland, OH, 44106, USA

3Genome Technology Branch and NIH Intramural Sequencing Center (NISC), National Human Genome Research Institute, Bethesda, Maryland 20892, USA

4 Department of Genetics, Southwest Foundation for Biomedical Research, P.O. Box 760549, San Antonio, TX 78227, USA

5Sezione di Genetica, DAPEG, University of Bari, 70126 Bari, Italy

Note: This work is being prepared for publication

93 ABSTRACT

Gene duplication followed by adaptive selection has been one of the forces driving the emergence of new gene function. Here, we recapitulate a detailed evolutionary history of the Morpheus gene family encoded in a 20-kilobase (kb) segmental duplication which shows a dynamic pattern of change (proliferation, expression, and selection) during genome primate evolution. Expression analysis shows a predominant testis expression in

Old World and New World Monkeys with a ubiquitous expression profile emerging in all great ape species. The timing of duplication of the gene family is concurrently associated

with ubiquitous expression of the gene family in all great ape lineages. Comparative

primate cDNA sequencing shows that the exon-intron structure has changed dramatically

before and after ape evolution with 3 exons being lost and 1 exon being added prior to the

emergence of the apes. Interestingly, phylogenetic analyses suggests that the major

episode of positive selection occurred similarly immediately before and after the

emergence of the African great ape lineages. We propose a model whereby recurrent

segmental duplications promoted alterations of the expression profile and gene structure

early during primate evolution; adaptive changes occurred subsequently to lead to the

birth of new gene family in the human and ape species.

94 INTRODUCTION

One of the most unlikely, yet, most important outcomes of genomic duplication is the evolution of new function, termed neofunctionalization (Lynch, O'Hely et al. 2001;

Lynch and Katju 2004). Gene duplication followed by diversifying selection has been postulated to be one of the primary forces responsible for achieving the proteome diversity and morphological complexity of vertebrates (Muller 1936; Ohno, Wolf et al.

1968; Ohno 1970; Hughes 1994; Lynch and Conery 2000; Eichler 2001; Lynch and Katju

2004; Taylor and Raes 2004). Although there are several different models for the birth- death process of gene duplicates (Hughes 1994; Lynch and Conery 2000; Lynch and

Force 2000), it has generally been argued that duplicating a copy of a gene or genomic

segment, effectively allows that copy of the gene to evolve unencumbered by the original

selective constraint of its progenitor. Under such circumstances, a gene may emerge with

a slightly new/improved function from its precursor. Such duplicated genes are often not

essential for viability but represent ecological adaptations to specific environmental

niches or be important as “speciation genes”.

Among human segmental duplications, a growing list of dozens of primate-

specific genes and gene families have been identified (Johnson, Viggiano et al. 2001;

Paulding, Ruvolo et al. 2003; Birtle, Goodstadt et al. 2005; Ciccarelli, von Mering et al.

2005; Vandepoele, Van Roy et al. 2005). The functions of these duplicated genes, especially those in interspersed SDs, are largely unknown. Many genes that lie in interspersed SDs show limited or no evidence of homology to genes in other mammals and model organisms (Johnson, Viggiano et al. 2001; Paulding, Ruvolo et al. 2003;

Ciccarelli, von Mering et al. 2005; Vandepoele, Van Roy et al. 2005). In humans, many of these rapidly evolving genes frequently map to the most duplicated segment of the interspersed duplication blocks (Bailey and Eichler 2006). Not surprisingly, gene

95 annotation in these highly duplicated regions is particularly challenging due to the absence of clear orthologues in out-group species and the difficulties in sequence and assembly of the underlying genomic regions (Eichler, Clark et al. 2004).

We have focused on the detailed characterization one particular subset of segmental duplications, termed LCR16 (low-copy repeats on chromosome 16) (Loftus, Kim et al.

1999). One of these segmental duplications, LCR16a, was remarkable as it contains one of the most rapidly evolving human gene families in the human genome Analysis of two exons from the eight-exon corresponding gene family (morpheus) demonstrates rates of amino-acid replacement 20-50 fold faster than typical genes under purifying selection ( eg. Ka/Ks=~10 for exon 2 when compared to Old World monkey species). Positive selection emerged specifically within the human/African great ape lineage (Johnson, Viggiano et al. 2001). The function of the gene family (morpheus) is still not known although in situ studies and two- hybrid analyses suggest that some members interact with the nuclear pore complex and may be involved in chromatin remodeling (Johnson, Viggiano et al. 2001). In this study, we further explore the genomic and evolutionary dynamics of this subset of recently evolved segmental duplications and investigate changes in selection and expression patterns of this gene family during the course of primate evolution.

RESULTS

Gene Models and Positive Selection

The aim of this study was to investigate the selection and expression of the morpheus gene family in conjunction with its dynamic evolutionary history. The highly variable and complex duplication architecture of the LCR16a loci has hindered sequence assembly by WGSA, limited gene annotation for specific copies and complicated the establishment of orthologous relationships between human/great-ape loci. To help

96 resolve these problems, we generated contiguous sequence from large-insert clones in order map orthologous positions based on the placement of unique sequence anchors between species. To date, we have sequenced 67 different loci (5.0 Mbp) in apes and

Old-World monkey species (Johnson, Cheng et al. 2006), including, more recently, 5 additional loci from a representative gibbon genome (Hylobates klossi). Using the eight- exon human NPIP (nuclear pore interacting protein) cDNA as a model (Fig. 4-1a), we defined the putative exon-intron structure of the morpheus gene family for each of the 67 loci. We constructed a neigbour-joining phylogenetic tree from intronic sequence and found that intronic sequences evolved neutrally (multiple Tajima’s relative rate tests, data not shown) (Fig. 4-1b). We note that in both Old-World monkey species a single

orthologous locus was identified, while two largely independent expansions have

occurred in the Asian and African ape/human lineages.

97

Fig. 4-1: a) Morpheus Gene Structure. Intron-exon structure of one member of the morpheus gene family (NPIP, Genbank) encoded by LCR16a (black arrow= methionine start and cross =stop codon). Dashed black lines approximate the placement of _RT-PCR oligonucleotides used in the expression analysis. b) Genomic structure of LCR16a in non-human primates. BACs

98 containing LCR16a were sequenced and assembled for 64 loci. The duplication architecture (colored and grey blocks) for loci is shown in the context of a neighbor-joining (NJ) phylogenetic tree for LCR16a (2-kb noncoding sequence). Loci encoded by species [HSA (Homo sapiens), PTR (Pan troglodytes), GGO (Gorilla gorilla), PPY (Pongo pygmaeus), and PHA (Papio hamadryas)] and relative to human orthologous loci (Fig. 3-4 and table 3-2 (Johnson, Cheng et al. 2006). New insertions or loci that could not be unambiguously mapped with respect to the human locus were given a letter designation. Red asterisks represent loci with 99-100% match to cDNA or ESTs. These represent copies for which there is evidence of transcription. The ancestral locus of LCR16a in the tree is represented by the species code being colored purple.

We initially provided evidence of positive selection for two of the seven coding exons of the morpheus gene family for chimpanzee and human (Johnson, Viggiano et al.

2001) by comparing the rates of synonymous and amino-acid replacement changes. In this study, we examine positive selection for each of the individual exons (significant departures from neutrality based on the average Ka/Ks comparison within- and between- species). We excluded exon 8 from analysis due the presence of a highly variable repeat sequence which complicated optimal global alignments. At the exonic level, five out of seven exons show signatures of positive selection among great-ape species (Fig. 4-2,

Table 4-1). Although exon 2 shows the most significant positive selection among all

African great apes tested, we note species-specific differences. Exon 3 only shows positive selection specifically within the human and gibbon lineages (Table 4-1). In contrast, exons 4 and exon 7 depart significantly from neutrality when ape and Old World monkey species are compared. Internal branch estimates suggest that exon 7 may have been selection earliest during evolution (after Old world monkey and great ape split ~25 million years ago). Exon 5 is the most conserved with evidence of limited amino-acid replacement, although statistically this is indistinguishable from neutrality.

99

Fig. 4-2: Positive Selection. For each exon (exons 1-7 based on the human NPIP model), we compared the number of non-synonymous substitutions per site (Ka) to the number of synonymous substitutions per site (Ks). Within and between species comparisons were made (HSA=human, PTR=chimp, GGO=gorilla, PPY=orangutan, OW=Old World monkey baboon and macaque). Among OW species a single copy of morpheus exists and thus all comparisons are orthologous. Ka/Ks ratios significantly greater than 1.0 are taken as evidence of positive selection (Z test transformation). Positive selection is largely restricted to humans and African great ape copies of morpheus.

Table 4-1: Positive selection of Exon 1- 7

HSA=Homo sapiens (n=23 paralogous sequences), PTR=Pan troglodytes (n=22), GGO=Gorilla gorilla (n=19), HKL=Hylobates klossi (n=8), and PPY=Pongo pygmaeus (n=15) OW= Papio anubis and Macaca fascicularis p=probability that observed difference was due to chance (one-tailed Z test). Ka-Ks=Difference between Ka and Ks S= significant positive selection detected n= the number of paralogous sequences considered within each group comparison

100

We tested for positive selection at the level of the gene by comparing orthologous gene models between human and African great-ape copies that has been mapped to orthologous positions based on unique sequence anchors. We compared 6 human/ chimpanzee and 4 human/gorilla gene models (Table 4-2). The most significant positive selection is observed between human and chimpanzee comparisons (4/6 comparisons, range Ka/Ks=2.7-4.4 p<0.01), whereas only one of the gorilla-human comparisons departed significantly from neutrality (Ka/Ks=2.72 <0.05). In contrast, corresponding genes from orangutan, gibbon and Old World monkey showed very little evidence of selection. We note that no distinction in this study has been made between functional and non-functional duplicate copies. We felt that this treatment was conservative as pseudogene comparisons and alternative splicing would tend to neutralize both adaptive and purifying selection constraints. Consequently, such events would cause Ka/Ks ratios to approximate one and should reduce our power to detect true positive selection.

101 Table4-2: Complete mRNA Exons 1-7 Testing Positive Selection on orthologs

The accession numbers represent humans = HSA, chimpanzee = PTR, gorilla =GGO, or gibbon = NLE orthologous BACs, Ka=Average number of amino-acid substitutions per sight between and within group comparisons Ks=Average number of synonymous substitutions per sight between and within group comparisons p=probability that observed difference was due to chance (one-tailed Z test). Ka-Ks=Difference between Ka and Ks NS=No significant positive selection detected Humans (HSA), chimpanzee (PTR), and gorilla (GGO) are test with in the groups orthologous copies

Expression and Gene Structure Analysis

We next assessed the pattern of expression for various members of the morpheus gene family by RT-PCR and sequence analyses of corresponding cDNA. Analysis of human EST data suggests that eight of the 24 copies are transcribed (Fig. 4-1b) in a variety of different tissues. We designed both degenerate and specific RT-PCR assays to distinguish three of the eight expressed copies in human (Fig. 4-3). Each human copy showed a ubiquitous pattern of expression in all tissues tested (n=12) (Fig. 4-3a).

Screening of a more diverse panel of human tissues confirmed a broad pattern of expression across 30 different tissues (primgen II) data not shown.

102

103

104 e)

f)

Fig. 4-3: RT-PCR Analysis of LCR16a Gene Family in Primate Tissues. cDNA was prepared from a panel of human tissue mRNAs (Clonetech). Oligonucleotides (16.16/17) were designed within exons 2 and 4 (AF132984 model) to amplify putative transcripts from 8 of the 15 LCR16a segments. Three families of transcripts can be distinguished by length: 1. (311 bp) = represented by AF132984 (HSA9) gene model, 2. (363 bp) =D86974 (HSA6, alternative splicing of 48 bp of exon 4) and 3. (417 bp) =represented by BF109282, (HSA11, 54 bp insertion in exon 2). b) cDNA was prepared from a panel of orangutan tissue mRNAs. Oligonucleotides (PTR1/2) were designed within exons 2 and 4 to amplify putative transcripts

105 from 12 of the 21 LCR16a segments. Two of the family members of transcripts can be distinguished by length: 1. (311 bp) =represented by AF132984 (HSA9) gene model, and 2. (363 bp) =D86974 (HSA6, alternative splicing of 48 bp of exon 4). c) cDNA was prepared from a panel of orangutan tissue mRNAs. Oligonucleotides (PPY1/2) were designed within exons 2 and 4 to amplify putative transcripts from 12 of the 21 LCR16a segments. One family of transcripts can be distinguished by length: 2 (363 bp) =D86974 (HSA6, alternative splicing of 48 bp of exon 4). d) cDNA was prepared from a panel of macaque tissue mRNAs. Oligonucleotides (OWM1 and N8) were designed within exons 2 and 8 to amplify putative transcripts from 2 of the 2 LCR16a segments. e) cDNA was prepared from a panel of baboon tissue mRNAs. Oligonucleotides (OWM1 and N8) were designed within exons 2 and 8 to amplify putative transcripts from 1 of the 8 LCR16a segments. f) cDNA was prepared from a panel of marmoset tissue mRNAs. Oligonucleotides (marmoset16.1 and 16.2) were designed within marmoset exons. For all of the RT-PCRs control primers designed to UBE1 that is ubiquitously in all tissues and used to test the quality of the cDNA (methods).

Based on great-ape genomic sequence (see above), we similarly designed RT-PCR

assays to assess the expression profile of the gene family across tissues obtained from

necropsied ape specimens (methods). A pattern of ubiquitous expression was noted for chimpanzee (4-3b), orangutan copies (4-3c) and gorilla (data not shown)—although the number of tissues analyzed was limited in the latter two cases. In stark contrast, expression analysis of the single copies in both Old World monkey representatives

(macaque and baboon) showed predominant expression in the testis (Fig. 4-3d,e). Weaker bands (Fig. 4-3d) were reproducibly noted in both macaque kidney and liver (Fig. 4-3d).

Overall the data suggest that duplication and ubiquitous patterns of expression are correlated.

During our sequence analyses of over 30 different cDNA products, we observed considerable alternative splicing (Table 4-1; Fig. 4-4). Sequencing of cDNA from both baboon and macaque revealed two significant differences with respect to the canonical human gene model: the complete absence of exon 5 and the presence of three additional exons (labeled 7.1, 7..2 and 7.3 in Fig. 4-4). These alternative splice variants were never observed once in human cDNA or EST. Among Old World monkey (OWM) cDNA, splicing of these three exons occurs in frame with the human NPIP gene model and adds

106 an additional 76 codons to the carboxyl portion of the protein. RT-PCR with marmoset testis tissue also shows the presence of exons (7.1, 7.2 and 7.3).

Fig. 4-4: Cross-species Morpheus Gene Structures. Comparisons of gene structures between human, baboon and dog. Note the low degree of amino-acid similarity between human and baboon despite intronic sequence identity of 94%. 98% of the protein divergence is due to positive selection of the exons. Baboon and dog copies show a Ka/Ks=~1.0 not significantly different from a neutral model. Arrow symbols indicate the position of initiation codon.

As predicted, alignment of the nearly full-length human and Old World monkey

ORF showed extensive amino-acid replacement (Fig. 4-5a and b). In the case of baboon,

multiple alternative splice products were sub-cloned and sequenced. Only the

predominant band (Fig 4-3e) maintained a long ORF (n=500 bp) consistent with the

human NPIP model. Interestingly, sequence similarity searches with the OWM gene

model identified a homologous predicted gene in the dog genome (XM_846143,

LOC608970) with EST support (mixed cDNA source library). The dog transcript

includes the three OWM exons that are not found in human versions of the morpheus

gene family. The dog transcript maps to dog chromosome 6 (Dog 2.0 assembly,

27829166-27818670)—a region syntenic to human chromosome 16 and consistent with

107 the single-copy map locations observed in baboon and macaque. These data suggest that the corresponding genomic segment represents the likely ancestral locus of the morpheus—although the gene structure and amino-acid composition have changed

radically during human/great ape evolution.

108

109

Fig. 4-5: Human and Old World monkey ORF base pair and amino-acid alignment. a) Human and baboon DNA multiple aligned (ClustalW) were generated from ancestral full-length morpheus genic sequence. The asterisks represent base pairs in the alignment that are identical and aligned base pairs without an asterisk represent different base pairs aligned in the multiple alignment. Dashes in the multiple alignment represent gaps in the sequence alignment. b) Human and baboon amino-acid multiple aligned (ClustalW) were generated from ancestral full- length morpheus genic sequence. The asterisks represent amino-acids in the alignment that are identical and aligned amino-acids without an asterisk represent different base pairs aligned in the multiple alignment. Dots represent similarly charge amino-acids.

DISCUSSION

Our comparative sequence and expression analysis reveals several important properties with respect to the evolution of morpheus gene family. First, comparative sequencing of cDNA suggest radical changes in the gene structure over short periods of evolutionary time. In the human/great ape lineage, there has been extensive exon loss

(exon 7.1, 7.2 and 7.3) and exaptation (exon 5) when compared to the most likely ancestral gene model (dog and baboon). Many evolutionary studies in model organisms indicate that radical changes in gene structure and gene expression are hallmark features of evolutionary young duplicates undergoing intense selection (Ting, Tsaur et al. 1998;

Katju and Lynch 2003; Long, Betran et al. 2003; Zhang, Dean et al. 2004; Katju and Lynch 2006).

110 It is interesting that despite these dynamic changes that the apparent ORF has not changed. Second, our analysis predicts that the ancestral locus originated from a region syntenic to human chromosome 16. A single copy is identified at this position in both baboon and macaque and flanking unique markers around this locus correspond to the single copy detected in the dog genome. OW monkey species as well as marmosets (data not shown) show a striking testis-predominant pattern of expression, while all great-apes reveal a ubiquitous pattern of expression. These data argue that the ancestral expression profile was largely restricted to testis and that a broader expression profile may have occurred in concert with the increase in copy-number of LCR16a among the great-apes.

Third, most of the positive selection (exons 2, 3 and 4) occurred during the emergence and divergence of the human and African great-ape lineages (Fig. 4-6, (Johnson, Viggiano et al. 2001)). Despite the presence of multiple copies and a broad expression profile in orangutan, limited positive selection is observed over these exons among the Asian apes.

This argues that the most significant burst of positive selection occurred after duplication and a broader expression profile evolved.

Two corollaries of this work are instructive for gene discovery and annotation

particularly of rapidly evolving genes. Simply overlaying gene structures based on the

human gene model is not sufficient from primate gene annotation—especially for genes

mapping to segmental duplications. Sequencing characterization of the OWM cDNA

provided us the necessary handle to identify orthologous genes in more distantly related

mammalian species—i.e. searches with the positively selected genes in humans had failed

to recover homologues outside of primates.

111

Fig. 4-6: Neighbour-Joining Evolutionary Tree for 99 bp of coding DNA from exon 3. The overall topology of the tree is similar to exon 2, indicating extensive positive selection occurring after the separation of orangutan from the African Ape lineage. Scale bar, Jukes-Cantor corrected distance. The midpoint of all trees was set to 1/2 the distance between gibbon and baboon sequence taxa. Only bootstrap values >50% are shown (n=1000 replicates). Branches showing significant positive selection are indicated by arrows with accompanying ba/bs ratios (estimated amino-acid replacement and synonymous changes per branch per site, (Zhang 1998)). Significance was calculated based on the difference (*p<0.05, **p<0.01).

112

These observations advocate for the need for primate cDNA re-sequencing. It is also noteworthy that the duplicated gene showing positive selection demonstrated a relatively radical change in expression profiles. Such changes for gene duplicates are not unexpected but typically duplicated genes show a more restricted pattern of expression due to subfunctionalization or presumed lack of selection. For example, several genes as well as fusion transcripts show patterns of expression largely restricted to testis (eg.

TRE2, TRIM48, TIPIP (Mudge and Jackson 2005; She, Liu et al. 2006)). Morpheus may be among one of the few examples where this usual pattern has reversed—ie. a broader profile of expression than the ancestral pattern. We suggest that such radical changes of expression may be a hallmark feature of genes undergoing strong adaptive evolution.

In summary, the data suggest a complex interplay between mechanism of segmental duplication, expression alterations and selection (Fig. 4-7). We propose that the gene family emerged, in part, from a genetically unstable element (Johnson, Cheng et al.

2006) that contained a neutrally evolving whose expression was restricted largely or exclusively in the testis. Through the process of duplication, shuffling, and rearrangement of other segmental duplications, paralogs acquired the regulatory machinery to be more broadly expressed. This broader pattern of expression was accompanied by exon exaptation, fusion and loss. Fortuitous mutations occurred within the human/great-ape lineage that led to adaptive evolution and further promulgation of this gene family.

Consequently, a new gene and gene family emerged seemingly from non-functional segments of the human genome.

113

Fig. 4-7: Model of Morpheus Evolution. Changes in copy number, expression and selection of the morpheus gene family based on comparative analyses of primates. *Ka/Ks based on analyses of the entire gene model.

METHODS

Phylogenetic and Evolutionary Genetics analysis

The exonic sequence data was analyzed by the average number of synonymous

(Ks) and non-synonymous (Ka) substitutions per site were estimated using the modified

Nei-Gojobori Method (Nei and Gojobori 1986; Zhang, Rosenberg et al. 1998). To test for positive Darwinian selection, we calculated the difference (D= Ka-Ks) for orthologous pairings of genes between full length human genes too full length chimpanzee and gorilla genes pairwise comparisons of orthologs. Also the difference was calculated for both within and between groups for all pairwise comparisons of paralogues. Groups are identified as different primate species. The large amount of pairwise analysis performed on the exon sequences significance levels should be corrected for multiple comparisons.

The usual Bonferroni method cannot be used, since the comparisons of gene sequences are not independent of one another. Instead, the average differences were calculated for all comparisons and standard errors. All possible comparisons were made among the exons and the difference between amino-acid sequence and synonymous substitutions was

114 calculated for each pairwise comparison. The differences were averaged between and within groups. A one-tailed Z test (Z=D/sD) was used to determine the level of significance over the average difference variance which was estimated using the bootstrap method (n=1000 replicates) and significant positive difference was used to define positive selection (Johnson, Viggiano et al. 2001). Phylogenic trees of multiple aligned gene sequences (ClustalW) were generated using Neighbor-joining distance estimates

(MEGA2). Bootstrap values >50% are indicated in the tree topology. The method designed by Zhang et al was used to determine internal branch length estimates of the number of synonymous (bs) and non-synonymous (ba) substitutions per site (Zhang,

Rosenberg et al. 1998).

Tissue Samples

Chimpanzee (Pan troglodytes, SFBR-4X0130/ Claire sample ID) tissue material (

pancreas, brain stem, cerebellum, medulla oblongata, thalamus, spleen, heart, small and

large intestine) was obtained <8 hours post-mortem from a female specimen from

Southwest Foundation for Biomedical Research courtesy of Jerilyn Pecotte). Chimpanzee

testis material was obtained from a necropsied male specimen (SFBR-4X0507).

Orangutan (Pongo pygmaeus) tissue samples were obtained post-mortem (Dan Anderson,

Yerkes Primate Center) from two different male orangutan specimens (YN98-329/Gelar

for spleen, liver, brain; and YN98-389 (Ayer) for liver and heart). Both macaque

(Macaca fascicularis) and baboon (Papio hamadryas SFBR-13056) tissues were obtained

from euthanized male specimens obtained from SFBR breeding colonies.

115 Expression Analysis

Human RT-PCR assays were performed using polyA mRNA (Clonetech) or prepared cDNA (PrimeExpressTM II Human Normal Tissue cDNA Panel, Primgen). Total non-human primate RNA was extracted from tissue panels of the primates and mice by

using the RNeasy® Mini or Midi Kit from Qiagen and protocol. c-DNA was isolated

using Powersript™ Reverse Transcriptase from Clontech/Takara Bio kit and protocol.

The rt-PCR was generated using standard protocol with primers found in Sup. table 4-1

(not shown) (Johnson, Viggiano et al. 2001). The PCR products were run on a 1%

agarose gel with 0.5 ug of 100bp ladder.

Genomic Clone and Sequence Analysis

Large insert libraries (6-fold coverage) from gibbon (HKL) were

hybridized with PCR-amplified products (16.1/9 and 16.8/12; for PCR oligonucleotide

sequence and PCR conditions). Inserts were end-sequenced, repeat-masked and mapped

against the human genome (MEGABLAST 12PATCh –d BES –D 3 –p 93 –F m –U T –s

150 –R T). Clones were subjected to shotgun sequencing at the National Institutes of

Health Intramural Sequencing Center (Thomas, Touchman et al. 2003) at least 6-fold

sequence redundancy. A subset of clones (n =5) corresponding to potential new insertions were selected for ordered and oriented sequence assembly.

ACKNOWLEDGEMENTS

We are grateful to Jeff Rogers, Jerilyn Pecotte and Dan Anderson for providing access to scarce great-ape tissues. We thank the Washington University Genome

Sequencing Center for access to marmoset genome trace sequence data corresponding to the LCR16a (20 kb) prior to publication. This work was supported in part by National

116 Institutes of Health (NIH) Grant GM58815 (to E.E.E.) and by the Intramural Research

Program of the National Human Genome Research Institute. NIH. E.E.E. is an

Investigator of the Howard Hughes Medical Institute.

117

Chapter 5

Discussion and Future Directions

118 SUMMARY AND DISCUSSION

This thesis has provided a very detailed study of a set of euchromatic segmental duplications on chromosome 16 (LCR16). My study has provided further insight into the impact of segmental duplications on primate genome evolution, particularly with respect to large-scale changes in chromosome structure. Second using the morpheus gene family

contained within the LC16a duplication as a model, I have gained insight into the

plasticity of the process of new gene formation through duplication. The following

section will concisely summarize the major findings reported in this dissertation.

The Recurrent Duplicative Nature of LCR16a

The first goal of this project was to determine the nature of the duplicative

transposition of LCR16a and whether its distribution in different primate lineages

suggested a recurrent process. In Chapter 2, I provided the first evidence that LCR16a

has duplicated recurrently by utilizing comparative FISH analysis on primate metaphase

and interphase nuclei (Fig. 2-4). The metaphase analysis suggested that LCR16a in

chimpanzees had duplicated in a lineage-specific manner to chromosomes 7 and 17, and

in orangutan to chromosome 13 (Fig 2-4). From these FISH results, it was also

determined that LCR16a underwent a rapid duplication expansion in primates after the

split of Old World monkeys from great apes (i.e. a single copy in baboons to ~30 copies

in chimpanzees). Additionally, there were different copy numbers of the duplication for

each great-ape species, suggesting lineage-specific duplication of LCR16a in humans,

chimpanzees, gorillas, and orangutans. These data supported the theory that LCR16a

duplication is recurrent and recent. After determining the likely ancestral location of

LCR16a in baboons, I tested this hypothesis in another Old World monkey (macaque) and

in a New World monkey (marmoset).

119 Additional data (library hybridization, end-sequence mapping and sequencing of large-insert clones) (see “Methods”, Chapter 3) confirmed that LCR16a had independently duplicated in each of the great-ape lineages (Fig 3-1). This analysis provided a locus-by-locus comparison to identify new insertions in each primate lineage.

New euchromatic regions were shown to have duplicated to different chromosomes in chimpanzee and orangutan (Fig 3-2c and e) as well as new locations within the same homologous chromosome in the three species (Chapter 3). In contrast to the single copy of LCR16a on chromosome 16 in baboon and other Old World monkey species, I discovered lineage-specific duplications of LCR16a in the macaque (2 copies) and the marmoset (~15 copies). A search for more ancient copies of LCR16a uncovered either full-length or partial copies to other chromosomes (10, 11, and X) suggesting more

ancient duplicative transpositions had occurred earlier in primate evolution (~33 mya to

~80 mya) (Chapter 3, 4). The chromosomal regions of both ancient LCR16a duplication

insertion and lineage-specific insertions were found to be associated with complex regions

of duplication (Fig 3-7). Based on these analyses, I concluded that LCR16a represents a

20 kb segment of DNA that has been highly unstable and prone to independent

duplication to new euchromatic genomic locations at various points during primate

evolution.

Stepwise Accumulation of LCR16 Duplications around LCR16a Duplication

One of the complex duplications located on 16p12 contained four distinct LCR16

duplications (LCR16v, u, w, and t) flanking the LCR16a duplication. The original theory

of LCR16 duplication with which we were working was that each of the duplications

flanking LCR16a duplicated independently during primate evolution into a common

acceptor region prone to accumulate duplications. These formed large blocks of

120 duplication containing different blocks of LCR16 duplication, as seen in the pericentromeric regions of chromosome 2 (Horvath, Schwartz et al. 2000; Horvath,

Viggiano et al. 2000), and then these larger blocks of LCR16, duplication containing

LCR16a, were further duplicated to other regions on chromosome 16 (Table 3-3).

As we investigated the duplication patterns of the LCR16 in other primates (see

Chapter 3 for methods), it became apparent that these duplications were not following the

“simple” two-step model of segmental duplication propagation proposed for pericentromeric regions (Jackson, Rocchi et al. 1999; Horvath, Bailey et al. 2001).

Evolutionary reconstruction and phylogenetic analyses indicated that LCR16a was associated with all other flanking duplicons in humans and other primates—i.e. none of the other duplicons were common to the all duplication blocks. In addition, LCR16a was found to be the only LCR16 duplication seen duplicated as a single duplicon in great-ape genomes; the remaining eleven human duplications were always duplicated at the flanks of the LCR16a duplication (Table 3-3). Based on sequence data of the ancestral location of the LCR16 duplication in outgroup species (macaque and baboon), I determined that all of the LCR16 elements originated from a single-copy ancestral locus (Fig. 3-2). From the comparative FISH and phylogenetic analyses of LCR16 duplications from Old World monkeys through to the great apes, I showed that younger LCR16 duplications accumulate at the periphery of large blocks of LCR16 duplications with evolutionarily more ancient events occurring within the interior, mapping closer to the LCR16a element

(Fig. 5-1).

121

Fig. 5-1: Step-wise model of duplication accretion around LCR16a. The stripe colored boxes (red LCR16a, blue LCR16w, green LCR16u, and violet LCR16v) represent ancestral blocks of LCR16 prior to duplication. The black striped boxes represent unique sequence flanking the segmental duplications. Solid colored boxes (red LCR16a, blue LCR16w, green LCR16u, and purple LCR16v) represent LCR16 duplications that have been duplicated in association with LCR16a. This schematic shows LCR16a duplicating to new chromosomal regions, adjacent to unique sequence which subsequently becomes duplicated upon successive rounds of LCR16a duplication.

From all this data collected on LCR16 duplication, I produced a stepwise model of

LCR16a duplication, which states that LCR16a duplicatively transposes to new

chromosomes environments and catalyses the duplication of flanking sequence. With each

progressive duplication into a new genomic sequence, other segments become duplicated

with LCR16a, leading to the creation of many other duplicated pieces of DNA. This

would create a “sandwich” effect of older duplications near the core and younger

duplications at the periphery for each new insertion (Fig 5-1).

122 The Orangutan genome proved to be the perfect test case for this model because most of the LCR16a duplications occurred lineage-specifically on chromosome 13. Based on our model, we would expect to see LCR16a duplication picking up and duplicating pieces of chromosome 13 sequence as new segmental duplication blocks. This is exactly what I found from my analysis of LCR16a duplications on orangutan chromosome 13: the creation of seven new orangutan-specific chromosome 13 segmental duplications associated with LCR16a that are not seen duplicated in humans or any other primate genome (Table 3-8). This model of the stepwise accumulation of duplications around a single segmental duplication, driving the duplication of the other segmental duplications flanking it, is an under-appreciated model for segmental duplication formation in the primate genome. In addition, this model of a single duplication passively picking up large flanking pieces of DNA has the potential to create new fusion genes or to allow relaxed selection to operate on the gene copies to create new genes through neofuctionalization.

Therefore, this step-wise accumulation of DNA sequences could in principle lead to the rapid formation of many new primate gene families.

Coordinated Deletion and Duplication

Lineage-specific new insertions of LCR16a in chimpanzee, gorilla, and orangutan were invaluable in gaining insight into the mechanism of duplication, because the sequence footprint of the duplication event had not been significantly eroded by passage of large amounts of evolutionary time. There were six new insertions of LCR16a in the aforementioned great apes for which I performed detailed breakpoint analysis. Since I investigated events that occurred in each of the great ape lineages after separation from the human lineage, I refer to the human organization at these sites as “pre-integration” sites and the structure after duplicative transposition as post-integration (Tables 3-9 and

123 3-10). The analysis of the breakpoint regions were scrutinized at the basepair level in the six new insertion sites and twelve accretion boundaries (LCR16a juxtaposed by other segmental duplications, see Chapter 3). The analysis revealed that, out of sixty-four junctions, thirty (47%) had Alu repeats associated with the breakpoints, which is ~3.4 times higher than the genome average of 14% Alu repeats (Table 3-9). The pre- integration sequence of the six new insertion sites was highly enriched for common repeats (77.3% average). In particular, Alu repeats (37% average) were enriched 2.6 times enriched when compared to the genome average (Table 3-10).

Analysis of the post-integration sequence (based on the comparison of human and non-human large insert sequence) yielded a very unexpected observation. New LCR16a segmental duplications do not insert simply into the genome (i.e by simply splitting the pre-integration site apart according to the size of the inserted material). Rather, I observed a coordinated deletion of the pre-integration site, averaging 19.5 kb length (Fig

3-3 and 3-8). For each of the six new segmental duplication insertions, a corresponding deletion of the ancestral region was demonstrated.

A Model of LCR16a-Mediated Duplicative Transposition

My detailed comparative sequence analysis of LCR16a segmental duplication and

flanking duplicons showed remarkable variability in changes in the copy number, in the

composition of duplication blocks, and in the lineage-specific locations among humans

and great apes. In addition, I showed that LCR16a is enriched for Alu repeat sequences and the pre-integration sites of LCR16a duplication are enriched for Alu repeat sequences.

From these observations, we propose a duplication-driven model of DNA transposition of

LCR16a. This transposition is mediated by non-allelic homologous or microhomology recombination between highly identical Alu repeats contained within LCR16a and the pre-

124 integration sites, followed by the deletion of sequences constituting the repeat-enriched pre-integration site. The deletion of the pre-integration site is probably a consequence of

the segmental duplication process.

There are least two possible explanations for our observations of LCR16a

duplication in primates. 1) LCR16a acts as a preferred template for the directional repair

of double stranded breaks in the genome, perhaps in a fashion similar to yeast-mating type

switching (Haber 1998; Johnson, Cheng et al. 2006) or breakpoint-induced repair. The

enrichment of Alus at the periphery of LCR16a and in the pre-integration site would serve

as a source of sequence homology to promote the repair of double strand breaks through

single-strand annealing. This model might explain the coordinate deletion of the pre-

integration site, enrichment of Alu sequences at the breakpoints of both the duplications

and pre-integration site, and the complex architecture of flanking segmental duplication

around LCR16a. However this model does not explain why Alu-rich LCR16a is the preferred template utilized for directional repair of double strand breaks over other Alu repeat rich regions in the genome. Under this model, it is also unclear why the process would continue to propagate ever larger blocks of LCR16 sequence (> 200kb) encompassing flanking sequence around LCR16a throughout the genome.

2) An alternative model to explain the affinity of flanking sequence to duplicate along with LCR16a is that the duplication of the flanking sequence is an indirect consequence of intense selection acting upon the LCR16a duplications. I have shown previously (Chapter 2) that the gene family encoded in LCR16a has experienced some of

the strongest signatures of positive selection reported for humans and great ape gene

families. This strong pressure of positive selection on LCR16a gene family could

generate the pattern of complex duplications associated with LCR16a as a consequence of

the selective pressure to produce more and more divergent copies of LCR16a at new

125 locations. I do not favor this model because my data show independent and recurrent fixation of novel LCR16a duplications in the absence of positive seleciton. In the orangutan genome, for example, LCR16a has duplicated independently and formed complex duplication blocks in the absence of strong positive selection (Ka/Ks = ~1.0).

My preliminary data suggest a similar set of independent events within the marmoset genome. Finally, I have found evidence of more ancient LCR16a duplications on human chromosomes X, 10, and 11 which suggest that this piece of DNA is inherently unstable and was duplicating well before strong signals of positive selection emerged within human/great-ape lineages.

BIRTH OF A GENE FAMILY (MORPHEUS) IN PRIMATES

Darwinian Selection Acting upon Morpheus Exons

The investigation into the type of selection acting upon a gene contained within

LCR16a segmental duplication arose because of an initial observation made while performing a pair-wise alignment analysis of two human copies of LCR16a. This analysis revealed deep troughs of sequence divergence corresponding perfectly to the sequence

position in the duplication of the NPIP exons, which demonstrated that the exons had

many more base-pair changes between the two human LCR16a sequences than did the

intronic sequence (Fig 2-2). This decrease of sequence identity suggested that this gene

family had been undergoing positive selection.

To test the model of selection acting upon the morpheus gene family, I amplified

by PCR, sub-cloned, and sequenced exons 2 and 4 from multiple copies of the gene in

eleven primate species (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo pygmaeus,

Hylobates klossi, Cercopithecus aethiops, Papio anubis, Papio hamadryas, Presbytis crisata, and Allenopithecus nigroviridis). I calculated the average Ka (non-synonymous

126 base-pair changes) and Ks (synonymous base pair changes) for these exons within each primate lineage as well as between the primate species. Exon 2 Ka/Ks quotients were highly significant (P > 0.005) in all human and chimpanzee sequence comparisons. When

Old World monkeys and humans were compared I observed some of the highest positive selection ever reported (Ka/Ks = 13.0, P > 0.00001) (Table 2-2). This equates to a level of

amino-acid replacement that translates to ~43% amino-acid divergence in the exon 2

coding region between these primates (Table 2-2). Exon 4 also showed a significant

signature of positive selection between human and chimpanzee paralogues copies (Ka/Ks

= 1.77, P < 0.05), but when Old World monkeys comparisons were made, no significant positive selection found (Table 2-3). The observation of lower positive selection over exon 4 is due, not to decreased non-synonymous base-pair changes, but to a greater number of synonymous base-pair changes occurring with the non-synonymous base-pair changes.

From here, I systematically tested positive selection on the remaining exons that comprise the human morpheus gene family. Five out of the seven exons showed significant positive selection, either within human, chimpanzee, gorilla, and orangutan or between Old World monkeys (Fig. 4-2 and Table 4-1). Exon 7 showed the highest positive selection between human and Old World monkey (Ka/Ks = 27.0, P < 0.05) (Fig.

4-2 and Table 4-1). This is due to its small size (36 bp), and the paucity of synonymous

base-pair changes. Exons 5 and 6 did not show significance for positive, negative, or

neutral selection, but most comparisons showed an overall Ka/Ks < 1, demonstrating that

these exons show a trend toward being under negative selection (Fig. 4-2 and Table 4-1).

This result may suggest that exons 5 and 6 have a fixed function that can no longer

tolerate amino-acid changes. I should caution, however, that the values for exon 5 and 6

are not significantly different from a neutral expectation. This may reflect insufficient

127 evolutionary time for signals of selection to be detected. The positive Darwinian selection acting upon exons 1, 2, 3, 4, and 7 of the morpheus gene family showed two unexpected patterns. First, exons 1–4 most recently experienced positive selection (i.e since the divergence of the human and chimpanzee lineage) (Table 4-1). Second, most of the positively selected events (as reflected by an excess of amino-acid replacement changes) on exons 1–4 occurred after the orangutan split from the other great apes (Fig. 2-5, 2-6, and 4-6).

As a final test of positive selection acting on the morpheus gene family, I

compared full-length orthologous copies of chimpanzee and gorilla genes against human

gene copies to determine whether there were signatures of positive selection across the

entire open reading frame. The orthologous copies of chimpanzee (average Ka/Ks = 3.5, P

> 0.01) and gorilla (Ka/Ks = 2.7, P > 0.01) showed significant positive selection (Table 4-

2). This data shows that the morpheus gene underwent such extreme positive selection

over exons 1, 2, 3, 4, and 7 during the evolution of great apes/humans after the split from

orangutan that the signatures of positive selection can still be seen over the entire gene

model.

Timing of the Positive Selection Event in Relation to the Morpheus Gene Family

An important consideration of Darwinian selection is the timing of when the

selection starts acting upon the new gene or gene family. An understanding of the timing

of selection can give insight into how the gene family has arisen in conjunction with other

events including the evolution of the species. To gain insight into when the morpheus

gene family underwent its burst of positive selection, a phylogenetic approach was used to

investigate the timing of selection on each branch of primate evolution. Phylogenetic

analysis of the seven exons showed significance for four of the exons on the tree’s

128 branches, thus revealing the timing of the positive selection events. An algorithm designed by Zhang and colleagues was used to estimate selection on internal branches

(Ba/Bs) of the gene family’s phylogenetic tree. Exons 2 (Ba/Bs = 35, P > 0.01), 3 (Ba/Bs =

8.3, P > 0.01), and 4 (Ba/Bs = 4.7, P > 0.05) were determined to have undergone a burst of

positive selection after the split of orangutan from the rest of the great apes (~ <12 mya).

Exon 7 (Ba/Bs = 101.6, P > 0.05) was determined to have undergone positive selection after the split of Old World monkeys from the great apes (~25 mya)(Zhang, Rosenberg et al. 1998) (Fig. 2-5, 2-6, and 4-6). These data show the timing of events for selection on the morpheus gene family. The first event involved positive selection acting on exon 7 after the split of Old World monkeys from the great apes, and coincides with the expansion of the duplicated copies in great apes (Chapter 3). I hypothesize that this is when the morpheus gene family gained function outside the testis. The second event was positive selection occurring on exons 2, 3, and 4 after the split of orangutan from the other great apes (<12 mya). I suggest that this was the point in time when the morpheus gene family further diversified its gene-copy function in the common ancestor of gorillas, chimpanzees, and humans, perhaps in response to a changing environment. The last event was a strong signature of positive selection for exon 2 only in human and chimpanzee branches copies, suggesting this exon has experienced positive selection recently during human evolution.

Comparative Expression of the Morpheus Gene Family in Primates

Another key evolutionary feature in the birth of a new gene family (Taylor and

Raes 2004) is changes in expression profile during evolution as inferred by comparisons between closely related species. I investigated the expression profile of morpheus in seven

primate species including representatives of New World monkeys (marmoset), Old World

monkeys (macaque and baboon), and great apes (human, chimpanzee, and orangutan).

129 Based on Lynch and Connery’s model of subfunctionalization and previous experimental analyses on numerous duplicated genes, I expected to observe a more restricted pattern of expression for certain duplicated copies when compared to species with a single copy

(Lynch and Conery 2000; Mudge and Jackson 2005). In many published reports, duplicate genes are frequently expressed only in the (especially testis) tissue, perhaps as a consequence of chromatin deregulation (She, Horvath et al. 2004).

Unexpectedly, my RT-PCR results on mRNA derived from various non-human primate tissue panels showed the complete opposite trend. Species with a single copy of the morpheus gene (e.g. baboon) expressed the gene transcript predominantly in the testis

(Fig 4-3e). Each great ape species showed a ubiquitous expression of the morpheus gene

in all tissues for each specific copy of morpheus that was examined (Fig 4-3a, b, and c).

An additional Old World monkey (macaque) and New World monkey (marmoset) were

examined for expression of morpheus. These two selected primate species are

problematic because they have duplicated copies of the morpheus gene, which is not the

case in the baboon. Nevertheless, they confirmed the results of the expression study of

the baboon tissue showing high levels of expression in the testis, but also weaker patterns

of expression in the kidney and liver. The weak morpheus expression seen outside the

testis in the macaque and marmoset may be due to the placement of duplications near new

transcription factor binding sites or enhancers allowing expression of specific copies

outside the testis in these species. I conclude, however, based on these analyses of

outgroup species that the likely ancestral pattern of expression was the testis and that the

ubiquitous pattern of expression evolved specifically in the human/great-ape lineages.

Based on all the experimental data present on the morpheus gene family and

LCR16a role in genomic rearrangements in this dissertation, I propose a new model of

genome and gene family evolution in primates. This model is based on two properties of

130 LCR16a sequence. First, LCR16a has been a catalyst for duplications in primate genomes generating submicroscopic rearrangement events (insertions, deletions, and duplications) and complex duplication blocks in the wake of its duplicative transpositions. Second, the

LCR16a segment encodes a multi-exon transcript. I theorize that early during primate evolution (~35 mya), the ancestral single-copy (LCR16a) encoded a potential ORF with transcription limited to the testis. This newly formed transcript originally had no function in the testis, so the transcript sequence was not under selection (negative or positive). The transcript, at this point, evolved neutrally, continuing to accrue random mutations. After the split of Old World monkeys (<25 mya), LCR16a underwent a period of rapid expansion (1 to ~5 copies in the common ancestor humans and the apes). This occurred concurrently with the emergence of ubiquitous expression of the transcript in orangutans

(~12 mya). I propose that these duplications into new genomic contexts (eg. different segmental duplications at the periphery) provided LCR16a new regulatory sequences including enhancers that changed the expression pattern from testis-specific to being expressed in all tissues. This exposure to a wider range of transcriptional niches created

more opportunities for selection to act upon the sequence encoding these genes. After the

split of orangutans (< 12 mya), some of the gene transcripts encoded in LCR16a sequence

accrued fortuitous mutations which resulted in the ORFs now becoming selectively

advantageous in the great ape lineages and experiencing strong positive selection.

131 FUTURE DIRECTIONS

This dissertation provides the starting point to many future avenues of analysis involving segmental duplications, primate-specific genes, and other facets of LCR16a evolution and function in primates. The most obvious future direction is to determine whether the morpheus gene family expresses a protein and to determine its level of expression in various tissues and cells. The major requirement for this future research is the design of a specific antibody that recognizes different members of the morpheus gene family which could be used for western blot analysis, immuno-precipitation along with mass spectrometry-based sequencing of the precipitated proteins and antibody staining in cells and tissues. Real time RT-PCR of specific copies could be used in conjunction with this work to determine if changes in protein expression level correspond to changes of morpheus gene transcription. Although such work is inherently descriptive in nature, validation of the protein and its pattern and level of expression may shed insight into its potential function (i.e. if were highly expressed in immune-response cells or the cerebellum as some early transcription data suggest). In addition to determining whether the morpheus gene family expresses a protein, there are four future directions that these initial analyses of LCR16a could take: (1) population studies of LCR16a ; (2) determining the function and phenotype associated with the morpheus gene family; (3) finding other primate-specific segmental duplications with similar characteristics to LCR16a; and (4) determining whether deletion of the pre-integration site is a common feature of interspersed segmental duplication. I will discuss each accordingly.

132 Ascertaining Function and Phenotype for the Morpheus Gene Family

Assuming that the morpheus gene family has already been proven to express a protein, there are several ways to test the functional properties of this protein family. I will focus on two types of possible experiments: antibody assays and RNA interference by siRNA. Immuno-staining of human tissues with morpheus antibody would provide

fundamental information on the location of this gene family in different tissues and

potentially subcellular localization. The multiple copies and the high sequence identity of

various copies of the gene family, however, limits the number of unique epitopes, and

thus information on specific copies could not be reliably obtained in this way (Ramos-

Vara 2005). To better understand the dynamic localization of the protein family at the

level of single proteins, I would design mouse models of each human expressed gene

using BAC transgenics. In effect this would generate a model organism that produces a

single copy of the gene where cross-reactivity of the anti-body is no longer an issue

because only one copy would be expressed in each transgenic mouse engineered with a

single copy morpheus gene. These transgenic mice expressing individual copies of

human morpheus genes would be used to determine whether different copies localized to

different areas of cells and tissues or whether they show the identical pattern of

localization (Heintz 2001; Ramos-Vara 2005). Subsequently, these antibodies could be

used in co-immuno-precipitation analyses to identify other proteins that interact with the

morpheus genes—there is good reason to suspect that the strong signature of positive

selection of the morpheus gene family means that is evolving in response to another

agent, perhaps another protein (Piccini, Ballarati et al. 2001). Immunoprecipitated

proteins would be further analyzed by mass spectrometry sequencing methods to identify

the proteins associating with morpheus proteins. Knowing the function of proteins

interacting with the morpheus protein family would help to understand what type of

133 protein it is and whether it functions in a wider protein network (Piccini, Ballarati et al.

2001).

RNA interference (RNAi) could be used in human cell lines to silence morpheus protein expression (Davidson and Boudreau 2007). RNAi functions according to the mechanism of gene silencing through small noncoding RNAs that are complementary to the target mRNA, creating duplex RNAs and leading to the RNA silencing pathway

(Davidson and Boudreau 2007). The most complex part of this analysis will be designing a siRNA that will down-regulate most if not all of the morpheus copies in human cell

lines. Once the siRNAs are designed, they can be used in human cell lines to perturb the

protein expression in order to determine any phenotypic changes in the cells. Phenotypic

changes in cells that can be measured from the siRNA experiments are changes in cell

, rates of apoptosis, cellular structure, how the cells metabolize different molecules,

cell to cell interaction, growth rates, etc (Davidson and Boudreau 2007). This might yield insight into the cellular function of the protein.

Human Copy-Number and Structural Variation of LCR16a

My research has clearly shown that LCR16a has been duplicated and deleted in a lineage-specific manner in chimpanzee, gorilla, and orangutan. What is not understood is whether LCR16a is still undergoing these duplication and deletion events in the human population. To address this question in the human population, a two-pronged approach will be necessary. The first involves using a sensitive assay to detect copy-number variation, such as real-time quantitative PCR (eg. Taqman) to rapidly assess the number of LCR16a duplications within a diverse and large number of individual human DNA samples (Arya, Shergill et al. 2005). There are two caveats that could decrease the power of this analysis. The most critical is that our primers are designed in a conserved

134 sequence to assay all twenty-four copies of LCR16a duplication (Arya, Shergill et al.

2005). This is where the power of real-time PCR lies: in discerning a gain or loss of one

LCR16a copy without being overwhelmed by the signal of the other twenty-four copies of

LCR16a already in the genome (Arya, Shergill et al. 2005). Even so, if we cannot discern

single changes, we would still be able to detect gains and losses of multiple insertion and

deletion events, and this would provide a preliminary survey of the degree of variability

occurring in human populations. The second caveat is that we would not be able to detect

a partial duplication of LCR16a that does not contain the targeted PCR assay or doesn’t

amplify because of unknown sequence differences. For example, there may be older

copies of LCR16a with paralogous sequence differences that are not represented within

the human genome assembly. To overcome this problem, I would design multiple primer

pairs spaced a few kb apart to avoid missing a partial duplication or deletion.

Another potentially complementary approach would be to design a high density

oligonucleotide microarray specifically for the known LCR16a copies to assess variation

among individuals based on a single reference individual. Preliminary data from the

Eichler laboratory shows that this approach has promise for detecting variation with

segmental duplications including LCR16a (Sharp, Locke et al. 2005; Locke, Sharp et al.

2006; Sharp, Hansen et al. 2006; Sharp, Selzer et al. 2007). In principle, further definition

of the duplication or deletion event by a targeted microarray of oligonucleotides designed to specific regions on chromosome 16 could be designed, to more exactly define the breakpoints of the event. This would have two benefits: first, validating the real-time

PCR results, and second, determining what flanking sequence was duplicated or deleted along with the LCR16a duplications (Sharp, Locke et al. 2005; Locke, Sharp et al. 2006;

Sharp, Hansen et al. 2006; Sharp, Selzer et al. 2007). If successful, one could conceivably use these techniques to scan human disease populations that have no known genetic cause

135 to see whether LCR16a containing the morpheus gene has undergone duplication or deletion in these genomes and, thus, possibly gain insight into particular phenotypes associated with this primate-specific gene family (Locke, Sharp et al. 2006; Sharp,

Hansen et al. 2006; Sharp, Selzer et al. 2007; Wong, deLeeuw et al. 2007).

LCR16a as a Model for the expansion of other Human Segmental Duplications?

From the insight gained by studying the LCR16a segmental duplication, we could hypothesize that this is not an isolated form of genomic evolution in primates but perhaps a newly evolved process in organisms with increased gestation times to adapt quickly to an ever-changing environment. Using analytical techniques developed in this dissertation along with bioinformatic techniques developed by Zhaoshi Jiang published in this lab; core segmental duplications acting similar to LCR16a can be teased out of the human and great-ape genomes (Bailey, Yavor et al. 2001; Bailey, Gu et al. 2002; Bailey, Yavor et al.

2002; Bailey and Eichler 2006). First, we would need to define all the duplications in the human genome by using Old World monkeys as the out-group for ancestral loci for the segmental duplications in order to develop a non-redundant sequence library of duplications to identify duplicons with attributes similar to LCR16a, defined here as cores

(Zhaoshi Jiang submitted). Cores would appear in the human genome as segmental duplications that are duplicated and most frequently associated with flanking duplications.

These cores would also exist as solitary elements, without any flanking sequences (Jiang, submitted). Then, we would use a BAC library hybridization approach (Chapter 3) of different primate genomes to determine if those cores on other chromosomes behave similarly to LCR16a: i.e. whether they have duplicated independently, and whether they are associated with the formation of complicated intrachromosomal duplication blocks where lineage-specific duplications map at the periphery. Next, we would analyze the

136 core duplications for genes and, if they contain genes as LCR16a did, analyze those genes for evidence of positive selection. If positive selection were found, then I would examine the expression properties, as described in this dissertation. Preliminary data from our laboratory and others suggest that this may be a general property of intrachromosomal duplications in our genome (Bailey, Church et al. 2004; Zody, Garber et al.

2006)(Zhaoshi Jiang in press).

Coordinated Deletion of Pre-integration Sites: A common feature of primate SDs?

A common feature of primate segmental duplications is that they are distributed in

an interspersed fashion either interchromosomally or intrachromosomally, unlike those in

other organisms, where duplications are most frequently organized in tandem. The

mechanism behind the interspersion of duplications in primate genomes is still not

understood. In this thesis, we have described an interesting property that lineage-specific

duplications show evidence coordinated deletion of repeat-rich pre-integration sites. To

prove that this is a common mechanism of the interspersed nature of segmental

duplications in primates, further analysis of new insertions by other segmental

duplications, perhaps core elements, will have to be undertaken systematically to

determine a) if lineage-specific insertions occur in regions enriched for common repeats

and b) if these pre-integration sites are deleted during the new insertion of the core

duplications.

137 CONCLUSION

The work included in this study represents a very detailed analysis of a set of human segmental duplications containing a primate-specific gene family and its impact on genomic architecture throughout primate evolution. It has provided new insights into how a single segmental duplication can act as a core duplication, perhaps driving the duplication of flanking genomic sequences. This has created new segmental duplications focused around a core duplication, producing lineage-specific duplication events that continue to alter primate genomes in substantive ways. The new insertion events are also associated with the deletion of pre-integration sites and genes contained within them, leading to a very dynamic process in the evolution of primate genomes. More importantly, this analysis has also given insight into the evolution of a novel gene and gene family during primate evolution. Its path has been complicated, involving changes in gene structure, changes of expression patterns, and dramatic signatures of positive selection. I was very lucky and blessed to have been present at a time in genome research when I had the tools, reagents and sequence data at my disposal to study this duplication and gene family at such a detailed level. The study of this single duplication, LCR16a has yielded more than enough data and avenues of research for several doctoral projects. This is only one step toward understanding the complex and dynamic nature of human and primate genome evolution, but I believe that the LCR16a model serves as a blueprint for future analyses that will be key to unravelling how duplications have shaped the evolution of our species.

138

BIBLIOGRAPHY

Akiva, P., A. Toporik, et al. (2006). "Transcription-mediated gene fusion in the human genome." Genome Res 16(1): 30-6. Antonarakis, S. E., H. H. Kazazian, Jr., et al. (1985). "DNA polymorphism and molecular pathology of the human globin gene clusters." Hum Genet 69(1): 1-14. Arya, M., I. S. Shergill, et al. (2005). "Basic principles of real-time quantitative PCR." Expert Rev Mol Diagn 5(2): 209-19. Babushok, D. V., E. M. Ostertag, et al. (2007). "Current topics in genome evolution: molecular mechanisms of new gene formation." Cell Mol Life Sci 64(5): 542-54. Bailey, J. A., L. Carrel, et al. (2000). "Molecular evidence for a relationship between LINE-1 elements and X chromosome inactivation: the Lyon repeat hypothesis." Proc Natl Acad Sci U S A 97(12): 6634-9. Bailey, J. A., D. M. Church, et al. (2004). "Analysis of segmental duplications and genome assembly in the mouse." Genome Res 14(5): 789-801. Bailey, J. A. and E. E. Eichler (2003). Genome-wide detection of segmental duplication within mammalian organisms. Proceedings of the 68th Cold Spring Harbor Symposium: Genome of Homo sapiens. J. Ebert. New York, Cold Spring Harbor Press. Bailey, J. A. and E. E. Eichler (2006). "Primate segmental duplications: crucibles of evolution, diversity and disease." Nat Rev Genet 7(7): 552-64. Bailey, J. A., L. Giu, et al. (2003). "An Alu transposition model for the origin and expansion of human segmental duplications." Am J Hum Genet 73: 823-34. Bailey, J. A., Z. Gu, et al. (2002). "Recent segmental duplications in the human genome." Science 297(5583): 1003-7. Bailey, J. A., A. M. Yavor, et al. (2001). "Segmental duplications: organization and impact within the current human genome project assembly." Genome Res 11(6): 1005-17. Bailey, J. A., A. M. Yavor, et al. (2002). "Human-specific duplication and mosaic transcripts: the recent paralogous structure of ." Am J Hum Genet 70(1): 83-100. Bamshad, M. and S. P. Wooding (2003). "Signatures of natural selection in the human genome." Nat Rev Genet 4(2): 99-111. Batzer, M. A. and P. L. Deininger (2002). "Alu repeats and human genomic diversity." Nat Rev Genet 3(5): 370-9. Batzer, M. A., G. E. Kilroy, et al. (1990). "Structure and variability of recently inserted Alu family members." Nucleic Acids Res 18(23): 6793-8. Bayarsaihan, D., B. Enkhmandakh, et al. (2003). "Homez, a homeobox leucine zipper gene specific to the vertebrate lineage." Proc Natl Acad Sci U S A 100(18): 10358-63. Birtle, Z., L. Goodstadt, et al. (2005). "Duplication and positive selection among hominin- specific PRAME genes." BMC Genomics 6: 120. Bondeson, M. L., N. Dahl, et al. (1995). "Inversion of the IDS gene resulting from recombination with IDS-related sequences is a common cause of the Hunter syndrome." Hum Mol Genet 4(4): 615-21.

139 Bridges, C. B., E. N. Skoog, et al. (1936). "Genetical and Cytological Studies of a Deficiency (Notopleural) in the Second Chromosome of Drosophila Melanogaster." Genetics 21(6): 788-95. Britten, R. J. and D. E. Kohne (1968). "Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms." Science 161(841): 529-40. Cheng, Z., M. Ventura, et al. (2005). "A genome-wide comparison of recent chimpanzee and human segmental duplications." Nature 437(7055): 88-93. Cheung, J., X. Estivill, et al. (2003). "Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence." Genome Biol 4(4): R25. Ciccarelli, F. D., C. von Mering, et al. (2005). "Complex genomic rearrangements lead to novel primate gene function." Genome Res 15(3): 343-51. Collins, F. S. and S. M. Weissman (1984). "The of human hemoglobin." Prog Res Mol Biol 31: 315-462. Consortium, I. S. (2001). "Initial sequencing and analysis of the human genome." Nature 409: 860-920. Courseaux, A. and J. L. Nahon (2001). "Birth of two chimeric genes in the Hominidae lineage." Science 291(5507): 1293-7. CSAC (2005). "Initial sequence of the chimpanzee genome and comparison with the human genome." Nature 437: 69-87. Darwin, C. (1859). On the origin of species by means of natural selection : or, The preservation of favoured races in the struggle for life. London, John Murray. Davidson, B. L. and R. L. Boudreau (2007). "RNA interference: a tool for querying nervous system function and an emerging therapy." Neuron 53(6): 781-8. Davis, L. I. and G. Blobel (1986). "Identification and characterization of a nuclear pore complex protein." Cell 45(5): 699-709. de la Chapelle, A., R. Herva, et al. (1981). "A deletion in chromosome 22 can cause DiGeorge syndrome." Hum Genet 57(3): 253-6. de Parseval, N. and T. Heidmann (2005). "Human endogenous retroviruses: from infectious elements to human genes." Cytogenet Genome Res 110(1-4): 318-32. Deininger, P. L. and M. A. Batzer (1999). "Alu repeats and human disease." Mol Genet Metab 67(3): 183-93. Dewannieux, M. and T. Heidmann (2005). "LINEs, SINEs and processed : parasitic strategies for genome modeling." Cytogenet Genome Res 110(1-4): 35- 48. Dorschner, M. O., V. P. Sybert, et al. (2000). "NF1 microdeletion breakpoints are clustered at flanking repetitive sequences." Hum Mol Genet 9(1): 35-46. Duda, T. F. and S. R. Palumbi (1999). "Molecular genetics of ecological diversification: duplication and rapid evolution of toxin genes of the venomous gastropod Conus." Proc Natl Acad Sci U S A 96(12): 6820-3. Eichler, E. E. (1998). "Masquerading repeats: Paralogous pitfalls of the Human Genome." Genome Res. 8: 758-762. Eichler, E. E. (2001). "Recent duplication, domain accretion and the dynamic mutation of the human genome." Trends Genet 17(11): 661-9. Eichler, E. E., R. A. Clark, et al. (2004). "An assessment of the sequence gaps: unfinished business in a finished human genome." Nat Rev Genet 5(5): 345-54. Eichler, E. E., S. M. Hoffman, et al. (1998). "Complex beta-satellite repeat structures and the expansion of the zinc finger gene cluster in 19p12." Genome Res. 8: 791-808.

140 Eichler, E. E., F. Lu, et al. (1996). "Duplication of a gene-rich cluster between 16p11.1 and Xq28: a novel pericentromeric-directed mechanism for paralogous genome evolution." Hum Molec Genet 5: 899-912. Emanuel, B. S. and T. H. Shaikh (2001). "Segmental duplications: an 'expanding' role in genomic instability and disease." Nat Rev Genet 2(10): 791-800. Gilbert, N., S. Lutz, et al. (2005). "Multiple fates of L1 retrotransposition intermediates in cultured human cells." Mol Cell Biol 25(17): 7780-95. Gilbert, N., S. Lutz-Prigge, et al. (2002). "Genomic deletions created upon LINE-1 retrotransposition." Cell 110(3): 315-25. Gilbert, W. (1978). "Why genes in pieces?" Nature 271(5645): 501. Goodman, M. (1999). "The genomic record of Humankind's evolutionary roots." Am J Hum Genet 64(1): 31-9. Haber, J. E. (1998). "Mating-type gene switching in Saccharomyces cerevisiae." Annu Rev Genet 32: 561-99. Harris, E. E. and D. Meyer (2006). "The molecular signature of selection underlying human adaptations." Am J Phys Anthropol Suppl 43: 89-130. Harrison, G. A. (1988). Human biology : an introduction to human evolution, variation, growth, and adaptability. Oxford ; New York, Oxford University Press. He, X. and J. Zhang (2005). "Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution." Genetics 169(2): 1157-64. Heintz, N. (2001). "BAC to the future: the use of bac transgenic mice for neuroscience research." Nat Rev Neurosci 2(12): 861-70. Higgins, D. G., J. D. Thompson, et al. (1996). "Using CLUSTAL for multiple sequence alignments." Methods Enzymol 266: 383-402. Higgs, D. R., M. A. Vickers, et al. (1989). "A review of the molecular genetics of the human alpha-globin gene cluster." Blood 73(5): 1081-104. Holland, P. W. (1997). "Vertebrate evolution: something fishy about Hox genes." Curr Biol 7(9): R570-2. Horvath, J., S. Schwartz, et al. (2000). "The mosaic structure of a 2p11 pericentromeric segment: A strategy for characterizing complex regions of the human genome." Genome Res 10: 839-52. Horvath, J., L. Viggiano, et al. (2000). "Molecular structure and evolution of an alpha/non-alpha satellite junction at 16p11." Hum Molec Genet 9: 113-123. Horvath, J. E., J. A. Bailey, et al. (2001). "Lessons from the human genome: transitions between euchromatin and heterochromatin." Hum Mol Genet 10(20): 2215-23. Horvath, J. E., C. L. Gulden, et al. (2003). "Using a Pericentromeric to Recapitulate the Phylogeny and Expansion of Human Centromeric Segmental Duplications." Mol Biol Evol. Houck, C. M., F. P. Rinehart, et al. (1979). "A ubiquitous family of repeated DNA sequences in the human genome." J Mol Biol 132(3): 289-306. Hughes, A. L. (1994). "The evolution of functionally novel proteins after gene duplication." Proc Biol Sci 256(1346): 119-24. Hughes, A. L. and M. Nei (1988). "Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection." Nature 335(6186): 167-70. IHGSC (2001). "Initial sequencing and analysis of the human genome." Nature 409: 860- 921. Jackson, M., M. Rocchi, et al. (1999). "Characterisation of the heterochromatin/euchromatin boundary at 10q11 and identification of novel

141 transcripts by repeat induced instability." Am J Hum Genet 65 (Supplement): A56. Jackson, M. S., K. Oliver, et al. (2005). "Evidence for widespread reticulate evolution within human duplicons." Am J Hum Genet 77(5): 824-40. Ji, Y., E. E. Eichler, et al. (2000). "Structure of chromosomal duplicons and their role in mediating human genomic disorders." Genome Res 10(5): 597-610. Ji, Y., M. J. Walkowicz, et al. (1999). "The ancestral gene for transcribed, low-copy repeats in the Prader- Willi/Angelman region encodes a large protein implicated in protein trafficking, which is deficient in mice with neuromuscular and spermiogenic abnormalities." Hum Mol Genet 8(3): 533-542. Johnson, J. M., J. Castle, et al. (2003). "Genome-wide survey of human alternative pre- mRNA splicing with exon junction microarrays." Science 302(5653): 2141-4. Johnson, M. E., Z. Cheng, et al. (2006). "Recurrent duplication-driven transposition of DNA during hominoid evolution." Proc Natl Acad Sci U S A 103(47): 17626-31. Johnson, M. E., L. Viggiano, et al. (2001). "Positive selection of a gene family during the emergence of humans and African apes." Nature 413(6855): 514-9. Jurka, J., O. Kohany, et al. (2004). "Duplication, coclustering, and selection of human Alu ." Proc Natl Acad Sci U S A 101(5): 1268-72. Katju, V. and M. Lynch (2003). "The structure and early evolution of recently arisen gene duplicates in the genome." Genetics 165(4): 1793-803. Katju, V. and M. Lynch (2006). "On the formation of novel genes by duplication in the Caenorhabditis elegans genome." Mol Biol Evol 23(5): 1056-67. Khaitovich, P., W. Enard, et al. (2006). "Evolution of primate gene expression." Nat Rev Genet 7(9): 693-702. Khan, H., A. Smit, et al. (2006). " and tempo of amplification of human LINE-1 retrotransposons since the origin of primates." Genome Res 16(1): 78-87. Kumar, S. and S. B. Hedges (1998). "A molecular timescale for vertebrate evolution." Nature 392(6679): 917-20. Ledbetter, D. H., V. M. Riccardi, et al. (1981). "Deletions of chromosome 15 as a cause of the Prader-Willi syndrome." N Engl J Med 304(6): 325-9. Li, W. (1997). Molecular Evolution. Sunderland, MA, Sinauer Associates. Li, W. H., Z. Gu, et al. (2001). "Evolutionary analyses of the human genome." Nature 409(6822): 847-9. Lichter, P., P. Bray, et al. (1992). "Clustering of C2-H2 zinc finger motif sequences within telomeric and fragile sites of human chromosomes." Genomics 13: 999- 1007. Liu, G., S. Zhao, et al. (2003). "Analysis of primate genomic variation reveals a repeat- driven expansion of the human genome." Genome Res 13(3): 358-68. Liu, M. and A. Grigoriev (2004). "Protein domains correlate strongly with exons in multiple eukaryotic genomes--evidence of exon shuffling?" Trends Genet 20(9): 399-403. Locke, D. P., A. J. Sharp, et al. (2006). "Linkage disequilibrium and heritability of copy- number polymorphisms within duplicated regions of the human genome." Am J Hum Genet 79(2): 275-90. Loftus, B., U. Kim, et al. (1999). "Genome duplications and other features in 12 Mbp of DNA sequence from human chromosome 16p and 16q." Genomics 60: 295-308. Long, M., E. Betran, et al. (2003). "The origin of new genes: glimpses from the young and old." Nat Rev Genet 4(11): 865-75.

142 Lupski, J. R. (1998). "Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits." Trends Genet 14(10): 417-22. Lynch, M. and J. S. Conery (2000). "The evolutionary fate and consequences of duplicate genes." Science 290(5494): 1151-5. Lynch, M. and A. Force (2000). "The probability of duplicate gene preservation by subfunctionalization." Genetics 154(1): 459-73. Lynch, M. and V. Katju (2004). "The altered evolutionary trajectories of gene duplicates." Trends Genet 20(11): 544-9. Lynch, M., M. O'Hely, et al. (2001). "The probability of preservation of a newly arisen gene duplicate." Genetics 159(4): 1789-804. Lynch, V. J. (2007). "Inventing an arsenal: adaptive evolution and neofunctionalization of phospholipase A2 genes." BMC Evol Biol 7: 2. Maniatis, T., E. F. Fritsch, et al. (1980). "The molecular genetics of human hemoglobins." Annu Rev Genet 14: 145-78. Martin, J., C. Han, et al. (2004). "The sequence and analysis of duplication-rich human chromosome 16." Nature 432(7020): 988-94. Mayer, J. and E. Meese (2005). "Human endogenous retroviruses in the primate lineage and their influence on host genomes." Cytogenet Genome Res 110(1-4): 448-56. Mazzarella, R. and D. Schlessinger (1998). "Pathological consequences of sequence duplications in the human genome." Genome Res 8(10): 1007-21. McConkey, E. H. (2004). "Orthologous numbering of great ape and human chromosomes is essential for ." Cytogenet Genome Res 105(1): 157-8. Messier, W. and C. B. Stewart (1997). "Episodic adaptive evolution of primate lysozymes." Nature 385(6612): 151-4. Modrek, B. and C. Lee (2002). "A genomic view of alternative splicing." Nat Genet 30(1): 13-9. Modrek, B. and C. J. Lee (2003). "Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss." Nat Genet 34(2): 177-80. Moran, J. V., R. J. DeBerardinis, et al. (1999). "Exon shuffling by L1 retrotransposition." Science 283(5407): 1530-4. Mudge, J. M. and M. S. Jackson (2005). "Evolutionary implications of pericentromeric gene expression in humans." Cytogenet Genome Res 108(1-3): 47-57. Muller, H. J. (1936). "Bar duplication." Science 83: 528-530. Myers, E. W. and W. Miller (1988). "Optimal alignments in linear space." Comput Appl Biosci 4(1): 11-7. Nassif, N., J. Penney, et al. (1994). "Efficient copying of nonhomologous sequences from ectopic sites via P-element-induced gap repair." Mol Cell Biol 14(3): 1613-25. Nei, M. and T. Gojobori (1986). "Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions." Mol Biol Evol 3(5): 418-26. Nei, M. and S. Kumar (2000). Molecular Evolution and Phylogenetics. New York, Oxford University Press. Nickerson, E., F. Greenberg, et al. (1995). "Deletions of the elastin gene at 7q11.23 occur in approximately 90% of patients with Williams syndrome." Am J Hum Genet 56(5): 1156-61. Nielsen, R. (2005). "Molecular signatures of natural selection." Annu Rev Genet 39: 197- 218. Nurminsky, D. I., M. V. Nurminskaya, et al. (1998). " of a newly evolved sperm-specific gene in Drosophila." Nature 396(6711): 572-5.

143 Ohno, S. (1970). Evolution by Gene Duplication. Berlin/Heidelberg/New York, Springer Verlag. Ohno, S. (1972). "An argument for the genetic simplicity of man and other mammals." J. Hum. Evol. 1: 651-662. Ohno, S. (1973). "Ancient linkage groups and frozen accidents." Nature 244(5414): 259- 62. Ohno, S., U. Wolf, et al. (1968). "Evolution from fish to mammals by gene duplication." Hereditas 59: 169-187. Orkin, S. and H. Kazazian (1984). "The mutation and polymorphism of the human beta- globin gene and its surrounding DNA." Ann. Rev. Genet. 18: 131-171. Ostertag, E. M. and H. H. Kazazian Jr (2001). "BIOLOGY OF MAMMALIAN L1 RETROTRANSPOSONS doi:10.1146/annurev.genet.35.102401.091032." Annual Review of Genetics 35(1): 501- 538. Parra, G., A. Reymond, et al. (2006). "Tandem chimerism as a means to increase protein complexity in the human genome." Genome Res 16(1): 37-44. Parsons, J. (1995). "Miropeats: graphical DNA sequence comparisons." Comput Appl Biosci 11: 615-619. Paulding, C. A., M. Ruvolo, et al. (2003). "The Tre2 (USP6) is a hominoid- specific gene." Proc Natl Acad Sci U S A 100(5): 2507-11. Piccini, I., L. Ballarati, et al. (2001). "The structure of duplications on human acrocentric chromosome short arms derived by the analysis of 15p." Hum Genet 108(6): 467- 77. Puente, X. S., A. Gutierrez-Fernandez, et al. (2005). "Comparative genomic analysis of human and chimpanzee proteases." Genomics 86(6): 638-47. Ramos-Vara, J. A. (2005). "Technical aspects of immunohistochemistry." Vet Pathol 42(4): 405-26. Rastogi, S. and D. A. Liberles (2005). "Subfunctionalization of duplicated genes as a transition state to neofunctionalization." BMC Evol Biol 5(1): 28. Richard, F., M. Lombard, et al. (2000). "Phylogenetic origin of human chromosomes 7, 16, and 19 and their homologs in placental mammals." Genome Res 10(5): 644- 51. Rinehart, F. P., T. G. Ritch, et al. (1981). "Renaturation rate studies of a single family of interspersed repeated sequences in human deoxyribonucleic acid." Biochemistry 20(11): 3003-10. Rote, N. S., S. Chakrabarti, et al. (2004). "The role of human endogenous retroviruses in trophoblast differentiation and placental development." Placenta 25(8-9): 673-83. Roth, C., S. Rastogi, et al. (2007). "Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms." J Exp Zoolog B Mol Dev Evol 308(1): 58-73. Rowold, D. J. and R. J. Herrera (2000). "Alu elements and the human genome." Genetica 108(1): 57-72. Samonte, R. V. and E. E. Eichler (2002). "Segmental duplications and the evolution of the primate genome." Nat Rev Genet 3(1): 65-72. Scherer, S. E., D. M. Muzny, et al. (2006). "The finished DNA sequence of human chromosome 12." Nature 440(7082): 346-51. Scherer, S. W., J. Cheung, et al. (2003). "Human chromosome 7: DNA sequence and biology." Science 300(5620): 767-72. Sen, S. K., K. Han, et al. (2006). "Human genomic deletions mediated by recombination between Alu elements." Am J Hum Genet 79(1): 41-53.

144 Shaikh, T. H., H. Kurahashi, et al. (2001). "Evolutionarily conserved low copy repeats (LCRs) in 22q11 mediate deletions, duplications, translocations, and genomic instability: an update and literature review." Genet Med 3(1): 6-13. Shakhnovich, B. E. and E. V. Koonin (2006). "Origins and impact of constraints in evolution of gene families." Genome Res 16(12): 1529-36. Sharp, A. J., S. Hansen, et al. (2006). "Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome." Nat Genet 38(9): 1038-42. Sharp, A. J., D. P. Locke, et al. (2005). "Segmental duplications and copy-number variation in the human genome." Am J Hum Genet 77(1): 78-88. Sharp, A. J., R. R. Selzer, et al. (2007). "Characterization of a recurrent 15q24 ." Hum Mol Genet 16(5): 567-72. She, X., J. E. Horvath, et al. (2004). "The structure and evolution of centromeric transition regions within the human genome." Nature 430(7002): 857-64. She, X., G. Liu, et al. (2006). "A preliminary comparative analysis of primate segmental duplications shows elevated substitution rates and a great-ape expansion of intrachromosomal duplications." Genome Res 16(5): 576-83. Skrabanek, L. and K. H. Wolfe (1998). " genome duplication - where's the evidence?" Curr Opin Genet Dev 8(6): 694-700. Stallings, R., N. Doggett, et al. (1992). "Evaluation of a cosmid contig physical map of human chromosome 16." Genomics 13: 1031-1039. Stallings, R., N. Doggett, et al. (1992). "Chromosome 16-specific repetitive DNA sequences that map to chromosomal regions known to undergo breakage/rearrangement in leukemia cells." Genomics 7: 332-338. Stallings, R., S. Whitmore, et al. (1993). "Refined physical mapping of chromosome 16- specific low-abundance repetitive DNA sequences." Cytogenet. Cell Genet. 63: 97-101. Stankiewicz, P., C. J. Shaw, et al. (2004). "Serial segmental duplications during primate evolution result in complex human genome architecture." Genome Res 14(11): 2209-20. Sugawara, N. and J. E. Haber (1992). "Characterization of double-strand break-induced recombination: homology requirements and single-stranded DNA formation." Mol Cell Biol 12(2): 563-75. Swanson, W. J., A. G. Clark, et al. (2001). "Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in Drosophila." Proc Natl Acad Sci U S A 98(13): 7375-9. Swanson, W. J. and V. D. Vacquier (1995). "Extraordinary divergence and positive Darwinian selection in a fusagenic protein coating the acrosomal process of abalone spermatozoa." Proc Natl Acad Sci U S A 92(11): 4957-61. Swanson, W. J., Z. Yang, et al. (2001). "Positive Darwinian selection drives the evolution of several female reproductive proteins in mammals." Proc Natl Acad Sci U S A 98(5): 2509-14. Tajima, F. (1993). "Simple methods for testing the molecular evolutionary clock hypothesis." Genetics 135(2): 599-607. Taylor, J. S. and J. Raes (2004). "Duplication and divergence: the evolution of new genes and old ideas." Annu Rev Genet 38: 615-43. Thomas, J. W., J. W. Touchman, et al. (2003). "Comparative analyses of multi-species sequences from targeted genomic regions." Nature 424(6950): 788-93. Ting, C. T., S. C. Tsaur, et al. (1998). "A rapidly evolving homeobox at the site of a hybrid sterility gene." Science 282(5393): 1501-4.

145 Vacquier, V. D., W. J. Swanson, et al. (1997). "Positive Darwinian selection on two homologous fertilization proteins: what is the selective pressure driving their divergence?" J Mol Evol 44(Suppl 1): S15-22. Vandepoele, K., N. Van Roy, et al. (2005). "A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution." Mol Biol Evol 22(11): 2265-74. Ventura, M., F. Antonacci, et al. (2007). "Evolutionary formation of new centromeres in macaque." Science 316(5822): 243-6. Walsh, B. (2003). "Population-genetic models of the fates of duplicate genes." Genetica 118(2-3): 279-94. Weiss, R. A. (2006). "The discovery of endogenous retroviruses." Retrovirology 3: 67. Wohr, G., T. Fink, et al. (1996). "A palindromic structure in the pericentromeric region of various human chromosomes." Genome Res 6: 267-279. Wong, K. K., R. J. deLeeuw, et al. (2007). "A comprehensive analysis of common copy- number variations in the human genome." Am J Hum Genet 80(1): 91-104. Wyckoff, G. J., W. Wang, et al. (2000). "Rapid evolution of male reproductive genes in the descent of man." Nature 403(6767): 304-9. Yohn, C. T., Z. Jiang, et al. (2005). "Lineage-Specific Expansions of Retroviral Insertions within the Genomes of African Great Apes but Not Humans and Orangutans." PLoS Biol 3(4): 1-11. Yunis, J. J. and O. Prakash (1982). "The origin of man: a chromosomal pictorial legacy." Science 215(4539): 1525-30. Zhang, J., A. M. Dean, et al. (2004). "Evolving protein functional diversity in new genes of Drosophila." Proc Natl Acad Sci U S A 101(46): 16246-50. Zhang, J., H. F. Rosenberg, et al. (1998). "Positive Darwinian selection after gene duplication in primate ribonuclease genes." Proc Natl Acad Sci U S A 95(7): 3708-13. Zhang, L., H. H. Lu, et al. (2005). "Patterns of segmental duplication in the human genome." Mol Biol Evol 22(1): 135-41. Zhu, S., F. Bosmans, et al. (2004). "Adaptive evolution of scorpion sodium channel toxins." J Mol Evol 58(2): 145-53. Zody, M. C., M. Garber, et al. (2006). "DNA sequence of human and analysis of rearrangement in the human lineage." Nature 440(7087): 1045-9. Zody, M. C., M. Garber, et al. (2006). "Analysis of the DNA sequence and duplication history of human chromosome 15." Nature 440(7084): 671-5.

146