<<

SEGMENTAL DUPLICATIONS PROMOTE GENOMIC INSTABILITY IN

HUMAN 15q11-q13

by

DEVIN PAUL LOCKE

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Evan E. Eichler

Department of Genetics

CASE WESTERN RESERVE UNIVERSITY

August, 2004

Copyright © 2004 by Devin Paul Locke

All rights reserved

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

______

candidate for the Ph.D. degree *.

(signed)______(chair of the committee)

______

______

______

______

______

(date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein.

To Patricia Anne Clark and Paul Arthur Locke, otherwise known as Mom and Dad, for their unwavering support.

TABLE OF CONTENTS

List of Figures...... ix

List of Tables ...... xi

Abstract...... xii

Acknowledgements ...... xiv

Chapter 1: Introduction and Objectives...... 1

The Landscape of the ...... 2 Introduction...... 2 Repetitive Sequences of the Mammalian Genome ...... 3 Structural DNA ...... 3 Functionless Junk: LINEs and SINEs and HERVs...... 4

Segmental Duplications ...... 5 Defining Segmental Duplication...... 5 The Initial Description of a Segmental Duplication ...... 8 Identifying Segmental Duplications ...... 9 How Segmental Duplications Arise...... 12

Pericentromeric Regions...... 12 Structural Variation Among Pericentromeric Regions ...... 12 The Two-Step Model of Pericentromeric Duplication ...... 15 Paralogous Sequence Variants...... 17

Subtelomeric Regions ...... 20 Evolution by Duplication...... 22 Local Duplication ...... 22 Whole-Genome Duplication ...... 23 Tandem Duplications...... 25 Chromosome Evolution ...... 26

Primate Genome Variation ...... 28 Karyotype Variation...... 28 Nucleotide Variation...... 29

Array Comparative Genomic Hybridization...... 31 Chromosome 15q11-q13 Exemplifies Genomic Plasticity...... 37 Genomic Disorders Within 15q11-q13: Prader-Willi and Angelman Syndromes ... 38 Imprinting Implicated in PWS/AS...... 39 Duplications and 15q11-q13 Genomic Rearrangements ...... 40

v

Timing of 15q11-q13 Structural Evolution...... 42 The Diversity of Additional 15q11-q13 Rearrangements...... 43 Phenotypically Silent 15q11 Variation ...... 44

Research Objectives...... 45 Chapter 2: Large-scale Variation Among Human and Great Ape Genomes Determined by Array Comparative Genomic Hybridization ...... 47

Abstract...... 48 Introduction...... 49 Results...... 51 Discussion...... 65 Materials and Methods...... 68 Human BAC Arrays...... 68 Primate DNA Samples...... 69 Array Comparative Genomic Hybridization...... 70 Fluorescence In Situ Hybridization...... 71 BAC Analysis ...... 72

Acknowledgements...... 73 Chapter 3: Refinement of a Chimpanzee Pericentric Inversion Breakpoint to a Segmental Duplication Cluster ...... 74

Abstract...... 75 Introduction...... 76 Results...... 77 Discussion...... 87 Materials and Methods...... 93 FISH Probe Selection...... 93 Comparative FISH ...... 94 Duplication Analysis...... 94 Southern Analysis ...... 95

Acknowledgements...... 96 Chapter 4: of the Human Pericentromeric Region...... 98

Abstract...... 99 Introduction...... 100 Results and Discussion ...... 102 In Silico Analysis of the 15q11 Pericentromeric Region...... 102

vi

FISH Analysis of Human 15q11...... 109 Comparative FISH Analysis of Non-human Primate ...... 112 Phylogenetic Analysis...... 115 Validation of the 15q11 Assembly ...... 124 STS Analysis...... 125 BAC End Sequence Analysis...... 125 Duplicon Delineation...... 126 Phylogenetic Analysis...... 127 Comparative FISH ...... 128

Chapter 5: BAC Microarray Analysis of 15q11-q13 Rearrangements and the Impact of Segmental Duplications...... 130

Abstract...... 131 Introduction...... 132 Results...... 134 Clone Selection ...... 134 Hybridization Profiles of Normal Individuals ...... 139 Hybridization Profiles of 15q11-q13 Sequence Losses ...... 140 Hybridization Profiles of 15q11-q13 Sequence Gains...... 144 Correlation of Dosage Imbalance and Log2 Ratio ...... 146 Duplication Sensitivity of Arrayed Clones ...... 147 Discussion...... 150 Materials and Methods...... 153 Clinical Samples ...... 153 Clone Characterization...... 154 Array CGH...... 154

Acknowledgements...... 155 Chapter 6: Extensive Homogenization, Interchromosomal Duplication and Lineage- specific Evolution of the 15q11-q13 Prader-Willi/Angelman Syndrome Breakpoint Regions...... 157

Abstract...... 158 Introduction...... 159 Results...... 164 15q11-q13 Sequence Assemblies ...... 164 Scaffold Comparison to the Genome Assembly...... 171 Duplication Content of the PWS/AS Breakpoints ...... 173 Relatedness and Orientation of the PWS/AS Breakpoints ...... 181 Sequence Similarity of the Human HERC2 Duplicons ...... 186 Structure of the HERC2 Duplicon ...... 189 Copy Number Estimates of the HERC2 Locus in Humans and Primates...... 192

vii

Comparative Southern Blot Analysis...... 192 Comparative FISH Analysis ...... 195 Primate Comparative Sequence Analysis ...... 198 Structural Divergence in Primate HERC2 duplicons...... 202 Phylogenetic Analysis...... 205 Sliding Window Analysis of HERC2 “Core” Duplicon Divergence Patterns...... 210 Baboon BAC End Sequence Analysis ...... 213

Discussion...... 217 Materials and Methods...... 224 Defining the HERC2 Locus ...... 224 Scaffold Assembly ...... 225 STS Design and Hybridization ...... 226 Paralogous Sequence Variant Analysis ...... 227 BAC End Sequencing ...... 227 Comparative FISH Analysis ...... 228 Lemur BAC Subclone Sequencing ...... 228 Phylogenetic Analysis...... 229 Comparative Sequence Analysis...... 229 Fosmid End Placement Validation...... 230

Chapter 7: Discussion and Future Directions ...... 234

Summary and Discussion...... 235 Whole-Genome Array Comparative Genomic Hybridization (CGH) ...... 236 Validating Sites of Inter-Species Genome Variation...... 239 Patterns of Primate Genome Variation ...... 242 Pericentric Inversion of 15q11-q13 in Chimpanzee ...... 243 Evolution of the 15q Pericentromeric Region...... 246 Determining Duplicon Ancestry By Mouse-Human Comparison...... 249 Chromosome 15q11-q13 Rearrangements and Array CGH ...... 251 Organization and Evolution of the 15q11-q13 PWS/AS Breakpoints...... 254 Tracking the HERC2 Duplicon...... 256 Timing the Expansion of the HERC2 Duplicon...... 257 The Disparity Between HERC2 Homology and Initial Duplication Timing...... 258

Future Directions ...... 260 Implications for Gap Closure in the PWS/AS Breakpoints ...... 260 Comparative Sequencing ...... 262 Comparative Array CGH Analysis ...... 264 Diagnosis of Genomic Disease and Disease Discovery ...... 264

Conclusion ...... 267 Bibliography ...... 270

viii

LIST OF FIGURES

Figure 1-1. The structure of a hypothetical segmental duplication...... 7 Figure 1-2. Non-allelic homologous recombination (NAHR)...... 11 Figure 1-3. The two-step model of pericentromeric duplication ...... 16 Figure 1-4. Paralogous sequence variants...... 19 Figure 1-5. Array comparative genomic hybridization (CGH)...... 35 Figure 1-6. Map of human 15q11-q13 and mouse 7C ...... 43

Figure 2-1. Example of array CGH data...... 53 Figure 2-2. Sites of great ape/human variation detected by array CGH...... 56 Figure 2-3. Comparative FISH validation of array CGH-detected duplications ...... 60 Figure 2-4. Validation of array CGH-detected deletions...... 64

Figure 3-1. Map of human 15q11-q13...... 80 Figure 3-2. Two-color FISH analysis of the pericentric inversion of human 15q11-q13 in chimpanzee ...... 81 Figure 3-3. FISH analysis of breakpoint spanning clones ...... 83 Figure 3-4. Segmental duplications in the inversion breakpoint interval ...... 85 Figure 3-5. Southern analysis of the CHRNA7 duplication ...... 87

Figure 4-1. Organization of human 15q11...... 105 Figure 4-2. Sequence similarity search results for the ~1.2 Mb pericentromeric contig ...... 108 Figure 4-3. Human and non-human primate comparative FISH ...... 115 Figure 4-4. Duplicon delineation ...... 118 Figure 4-5. Phylogenetic analyses ...... 124

Figure 5-1. Map of 15q11-q13 array clones and the extent of common 15q11-q13 rearrangements...... 136 Figure 5-2. RP11-219B16 FISH results contrast with AC068962...... 137 Figure 5-3. Array CGH profiles of normal DNA samples...... 140 Figure 5-4. Profile of 15q11-q13 losses...... 143

ix

Figure 5-5. Profile of 15q11-q13 gains...... 145 Figure 5-6. Correlation of copy number and fluorescence intensity ratios...... 146 Figure 5-7. Duplication sensitivity of arrayed clones...... 148 Figure 5-8. Sequence similarity, duplication content and duplication sensitivity ...... 150

Figure 6-1. Schematic of the 15q11-q13 region and PWS/AS breakpoints...... 161 Figure 6-2. Interchromosomal duplication within BP1, BP2 and BP3...... 175 Figure 6-3. Duplication content of the 15q11-q13 scaffolds and unanchored HERC2- related sequences ...... 180 Figure 6-4. The palindromic structure of BP3 ...... 184 Figure 6-5. Homology between PWS/AS breakpoints BP1, BP2 and BP3...... 186 Figure 6-6. Sequence similarity among HERC2 duplicons ...... 188 Figure 6-7. Histogram of HERC2 duplicon length ...... 188 Figure 6-8. Structure of the HERC2 duplicon...... 191 Figure 6-9. Primate Southern blot analysis...... 194 Figure 6-10. Comparative FISH analysis of the HERC2 duplicon...... 197 Figure 6-11. Analysis of primate BAC sequences containing the HERC2 duplicon.... 200 Figure 6-12. Divergence between the human, chimpanzee and baboon HERC2 duplicons ...... 203 Figure 6-13. Phylogenetic analysis of human and non-human primate HERC2 duplicons ...... 208 Figure 6-14. The pattern of sequence divergence between human HERC2 duplicons. 212 Figure 6-15. Baboon BAC end placements against the human genome...... 215 Figure 6-16. Model of HERC2 evolution from lemur to baboon...... 223 Figure 6-17. Fosmid validation of 15q11-q13 scaffolds...... 233

Figure 7-1. Model of PWS/AS breakpoint evolution and HERC2 homogenization .... 259

x

LIST OF TABLES

Table 2-1. Summary of variant sites detected by array CGH ...... 54 Table 2-2. Summary of experimentally verified sites of genomic rearrangement...... 58

Table 4-1. FISH localization of human chromosome 15 pericentromeric BAC clones...... 111 Table 4-2. Estimated divergence time of 15q11 pericentromeric duplications...... 120

Table 5-1. Properties of the arrayed BAC clones...... 138 Table 5-2. Genomic DNA samples assayed by array CGH ...... 142

Table 6-1. BAC library hybridization STS probes...... 165 Table 6-2. HERC2 copy number estimate by BAC library hybridization...... 167 Table 6-3. BP1, BP2 and BP3 WGAC alignment statistics...... 175 Table 6-4. Distribution of interchromosomal duplications ...... 176

Segmental Duplications Promote Genomic Plasticity in Human Chromosome 15q11-q13

Abstract

by

Devin Paul Locke

The human genome is comprised of a wide spectrum of repetitive sequences.

Segmental duplications are a class of repetitive sequence that has been independently associated with human genomic disease and evolutionary rearrangements. I have undertaken a study to test the hypothesis that evolutionary rearrangements and rearrangements within the human genome are linked by genomic instability at sites of segmental duplication. This study has focused on the 15q11-q13 region of the human genome due to the presence of several large clusters of segmental duplications that have been associated with the common breakpoints of Prader-Willi and Angelman syndromes. Using both computational and experimental techniques I have constructed sequence assemblies within these complex regions that provide a substrate for comparative primate studies. Extensive evolutionary variation at sites of segmental duplication, in both the pericentromeric region of 15q11 and within the 15q11-q13 common deletion breakpoints was observed. The scope of variation detected included both large-scale chromosome restructuring events and local re-patterning within clusters of segmental duplications. Sequence analysis of the 15q11-q13 common deletion breakpoint clusters revealed a complex evolutionary history associated with extensive segmental duplication. In addition, I describe a mechanism in which recurrent homogenization of a particular component of the 15q11-q13 breakpoints, the HERC2

xii

duplicon, suggests dynamic restructuring of these regions occurred recently in multiple independent primate lineages. The abundant plasticity observed in the 15q11-q13 region in primates indicates genomic instability is a general property of segmental duplications that is not bound by the barrier of speciation. Also, through the use of the array comparative genomic hybridization technique I have demonstrated the efficacy of the method for detecting genomic imbalance across a wide range of 15q11-q13 genomic rearrangements. Lastly, the application of the array method to primate comparative genomics provided an unprecedented level of resolution for detecting dosage differences among great ape species. In agreement with the other analyses presented here, segmental duplications were found highly enriched at sites of dosage imbalance, demonstrating the genomic plasticity observed within 15q11-q13 is a genome-wide phenomenon. Together, these studies provide a comprehensive and detailed analysis of the role of segmental duplications in genomic rearrangements both within the human population and between humans and our closest living relatives.

xiii

ACKNOWLEDGEMENTS

It may take a village to raise a child, according to Hillary Clinton, but it has taken a legion of smart, and in some cases crazy, people to have made this dissertation possible. This dissertation really began with my first science mentor, Dr. Bruce Seal, who I worked with while attending the University of Georgia, prior to arriving at Case Western Reserve University. Dr. Bruce set such a positive example for what the science life could be like, I couldn’t help but continue on in my studies, and I thank him for that, as well as for his continued friendship. I would also like to thank Dr. Robert Nicholls, with whom, as my first mentor at Case, I learned a lot both professionally and personally. My time in the Nicholls lab was tumultuous, but in hindsight the experience was positive, which I’m very glad to be able to say. Additionally, I would like to thank Dr. Evan Eichler for first accepting me into his lab, and then exposing me the execution of science with uncompromising and exacting standards. I wholeheartedly believe the rigorous training given in Evan’s lab would allow one to compete in a scientific environment anywhere in the world. It’s also been a great pleasure to see science performed to such standards, infused with Evan’s bountiful creativity. Not to mention, he shown a capacity to put up with me, which is a feat not to be underestimated. I am also indebted to those who I have been fortunate enough to work with in the many collaborations that have made this work possible. In terms of emotional support, it would’ve been impossible to reach this point without the friendship of my fellow lab members, such as Jeff Bailey, Julie Horvath, Matt Johnson, Sean McGrath, Karen Hayden, Ge Lui, Cassandra Gulden, Lisa Pertz and Marla Eichler. A special thanks goes to Lisa Pertz who was not just a friend in the lab, but who also worked with me closely on mapping and assembling the 15q11-q13 region and grew to share a quirky passion for seeing the region completed. Lastly, but certainly not least, I would like to thank my family, Mom and Dad, for only wanting to see me happy in whatever manner I saw fit. The freedom of unconditional support is quite a luxury.

xiv

Chapter 1

Introduction and Objectives

1 2

THE LANDSCAPE OF THE HUMAN GENOME

Introduction

With the sequencing of the human genome, we now have the ability to assess global properties of the DNA sequence that comprises the instructions for building a human being in a quantitative way (IHGSC 2001; Venter et al. 2001). The categorization of sequence within the genome can be viewed from multiple perspectives at varying levels of depth. From a general perspective, the human genome sequence can be divided into two basic categories: unique and repetitive sequences. This over-simplification illustrates an important point, nearly half the entire human genome is comprised of repetitive sequences (IHGSC 2001). Given the abundance of repetitive sequences, it is self-evident that to understand the evolution of the human genome, as well as work toward understanding the molecular basis of human genetic disease, analysis of the repetitive fraction of the genome is required.

Early studies of DNA re-association kinetics first shed light on the repetitive nature of the genome of “higher organisms” (Britten and Kohne 1968; Schmid and

Deininger 1975). It is a testament to the accuracy of these experimentally determined estimates that the analysis of the human genome sequence with repeat-identifying algorithms such as REPEATMASKER generally agree with these early studies a quarter century later (Smit and Green 1999; IHGSC 2001). From an evolutionary perspective, it was demonstrated that the presence of the repetitive fraction was a general property of the mammalian genome, and not a human-specific phenomenon (Jelinek et al. 1980). In addition, early comparative analyses tested the re-association kinetics of chimpanzee-

3 human heteroduplex genomic DNA molecules, and demonstrated a remarkable similarity between these species (Deininger and Schmid 1976).

REPETITIVE SEQUENCES OF THE MAMMALIAN GENOME

Structural DNA

The detailed characterization of the human genome at the sequence level provided by the Human Genome Project, and private sequencing efforts, allows for a much more detailed dissection of the sequence classes that comprise the human genome (IHGSC

2001; Venter et al. 2001). Aside from the over-simplified repetitive and unique DNA categories, there are other sequence classes such as structural DNA. Structural DNA does not code for , or contain enhancers or promoters, but provides a scaffold for physical manipulation of chromosomes including mitotic and meiotic spindle attachment at the centromeres, and maintaining chromosome length at the telomeres. These structural DNA sequences are indeed repetitive, as tandem iterations of a 171 sequence, called an alpha satellite monomer, are concatenated to form higher-order repeats that are themselves repeated in tandem to form chromosome-specific arrays in the centromeres of human chromosomes (Willard 1998). Human telomeres, in contrast, contain tandem repeats of a short 6 base pair motif (TTAGGG) (Morin 1989). These sequences, although essential for the viability of numerous eukaryotic organisms, do not comprise the majority of the repetitive fraction of the human genome.

4

Functionless Junk: LINEs and SINEs and HERVs

The vast majority of the repetitive fraction of the human genome is made up of

repeats of undefined function, or “junk” DNA (Ohno 1972). This genomic junk can be

divided into sub-categories primarily by the mechanism of replication. Retroelements

populate the DNA genome through an RNA intermediate, and are found in two major

classes, the long interspersed nuclear elements (LINEs) and short interspersed nuclear

elements (SINEs). These two element families alone constitute approximately 34% of

human genome sequence, 21% in LINEs and 13% in SINEs (IHGSC 2001). As their

names imply, these elements are not constricted to specific functional regions of the genome, as are the structural repeats, but are distributed throughout the genome. The distribution of SINEs and LINEs, however, demonstrates a slight bias for integration sites, as LINEs are more commonly found in AT-rich regions, and SINEs in GC-rich regions (IHGSC 2001). The relationship between LINEs and SINEs is interesting in itself, as the LINE element has been shown to provide the retrotransposition apparatus not only for subsequent LINE retrotransposition, but also for SINE dispersal, in an ironic parasitism of a genomic parasite (Smit et al. 1995; Kazazian 2000; Dewannieux et al.

2003).

The process of LINE and SINE retroposition is an active one, in terms of human genetic disease and primate genome evolution. It has been estimated that new SINE events, specifically Alu element retropositions, occur in 1:200 live births

(Deininger and Batzer 1999). Additionally, Alu insertion events have also been correlated with in human genetic disease such as Neurofibromatosis-1,

Huntington’s disease, and breast cancer, among others (Wallace et al. 1991; Goldberg et

5 al. 1993; Miki et al. 1996). Recent comparative analysis of sequence from the genomes of the New World monkey and Prosimian lineages along with the human genome sequence has indicated retrotransposition is a major force in increasing genome size, and therefore complexity, during primate evolution (Liu et al. 2003). Thus, these retroelements are actively shaping the primate genomic landscape.

Human endogenous retroviruses are another class of retroelement that populate the human genome to a significant degree, comprising approximately 8% of the human genome sequence (IHGSC 2001). As the name implies, these retroviral sequences are propagated via an RNA intermediate, and their replication can be autonomous or non- autonomous, similar to LINE elements. In contrast to LINEs and SINEs, however, endogenous retroviruses leave their mark in the form of long terminal repeats, or LTRs, which can act as strong promoter elements and can alter the expression of adjacent to an endogenous retrovirus integration site (Sverdlov 2000). In this manner, a wave of new endogenous retrovirus integrations could alter the expression of a substantial number of genes, potentially influencing speciation (Sverdlov 2000).

SEGMENTAL DUPLICATIONS

Defining Segmental Duplication

Although segmental duplications are indeed repetitive sequences, in that they are multi-copy sequences within a haploid genome, they must be considered distinct from other types of repeats. This is due to the fact that segmental duplications are derived from “normal” unique genomic DNA which may contain a complex landscape of high- copy repeats (Figure 1-1). In terms of size, segmental duplications have no known

6 biologically-defined lower bound, however bioinformatic analyses generally use a lower threshold of 1 kb in length (Bailey et al. 2001; Bailey et al. 2002b). The upper size limit is also not subject to currently understood biological properties, as the exact mechanism of segmental transposition is not well understood; however, an upper limit of 400 kb is likely to include the vast majority of duplications. Segmental duplications may appear in a tandem configuration, or interspersed, and they may be contained within a single chromosome (i.e. intrachromosomal) or spread to non-homologous chromosomes (i.e. interchromosomal). Also, compared to a typical retroelement, segmental duplications are found in significantly fewer copies. For example, over 800,000 distinct partial and complete LINE sequences were found during the initial analysis of the human genome, whereas prevalent segmental duplications, such as the PIR4 pericentromeric interspersed repeat, or the chAB4 duplicon are present in approximately 50 copies per haploid genome

(Wohr et al. 1996; IHGSC 2001; Horvath et al. 2003).

7

Figure 1-1. The structure of a hypothetical segmental duplication. The Your Favorite Gene (YFG) locus contains genic sequences such as introns and exons, in addition to high-copy repeats. Exons are depicted as numbered red boxes, SINEs as blue boxes and LINEs as green boxes. A segmental duplication derived from the YFG locus, called YFG’1, contains a contiguous stretch of sequence, in this case, comprising exons 2-5 of YFG. The YFG’1 segment can be transposed within a chromosome (intrachromosomal) or to a non-homologous chromosome (interchromosomal). Note the orientation of the YFG’1 duplication is independent of the donor locus.

8

The Initial Description of a Segmental Duplication

From the historical perspective, the first segmental duplication described was

detected as a minisatellite variant (MS29) that detected two loci, one present on chromosome 6, called DNF21S1, and a related site in chromosome 16, called DNF21S2

(Wong et al. 1990). The DNF21S2 locus was not detected in non-human primates, and

was shown to be polymorphic in the human population (Wong et al. 1990). Additionally,

the findings of this study established a DNA-based mechanism of sequence replication.

Soon after, a duplication derived from the von Willebrand factor gene was described, and

called a “partial unprocessed pseudogene” (Mancuso et al. 1991). Since these initial reports, several duplicated genic segments were mapped throughout the human genome

(Buiting et al. 1992; Kelley et al. 1992; Tomlinson et al. 1994; Eichler et al. 1996; Eichler

et al. 1997; Regnier et al. 1997; Zimonjic et al. 1997; Monfouilloux et al. 1998; Trask et

al. 1998a; Trask et al. 1998b; Amos-Landgraf et al. 1999). Efforts were also made to

study segmental duplication from the regional perspective, as the sequencing of the

human genome permitted the assembly of large stretches of contiguous sequence (Loftus

et al. 1999). One particularly interesting product of this effort has been the

characterization of an individual duplicon, known as Morpheus, from chromosome 16

that contains an uncommon complete genic segment (Johnson et al. 2001). Evolutionary

analysis of this duplicon at the cytogenetic level demonstrated surprisingly extensive

lineage-specific variation among primates; for example, a two-fold increase in copy

number in chimpanzee (~30 copies) compared to human (~15 copies) (Johnson et al.

2001).

9

Identifying Segmental Duplications

The Human Genome Project presented significant challenges to those who wished

to mine the ever-expanding pool of sequence data to uncover genome-wide properties.

Segmental duplications introduced un-anticipated difficulties in genome assembly, due to

the use of homology-based algorithms (Collins et al. 1998; Eichler 1998). Therefore, in

order to analyze the presence of segmental duplications on a genome-wide scale,

specialized computational methods and algorithms had to be developed (Bailey et al.

2001; Bailey et al. 2002a). Using a methodology of simplifying the genome by

eliminating high-copy repeats such as the retroelements discussed above, and

subsequently generating pairwise alignments from the remaining sequence (called the

whole-genome assembly comparison or WGAC), approximately 4% of the entire genome

was found to be involved in segmental duplications with 90-98% sequence similarity,

equal to or greater than 1 kb in length (Bailey et al. 2001). Additional analyses using a

more complete assembly revealed 5.2% segmental duplication content within the human

genome (Bailey et al. 2002a). In a 3 Gigabase genome, this represents 150 Megabases of

duplicated material, which is equivalent to a moderate sized human chromosome, such as

chromosome 7 (158 Megabases, including gaps, build34). These genome-wide

computational analyses firmly establish segmental duplications as a highly prevalent

sequence property of the human genome.

One important aspect of the genome-wide analyses has been the observation that

pericentromeric regions and subtelomeric regions are highly enriched (10-fold) for segmental duplications (Bailey et al. 2001). Segmental duplications are not limited to these regions, however, as several large clusters of segmental duplications have been

10 found at the breakpoints of multiple deletion syndromes, or genomic disorders, on multiple chromosomes such as Prader-Willi and Angelman syndromes in 15q11-q13,

Smith-Magenis syndrome in 17p11, Cat-Eye syndrome in 22q11, Velocardiofacial-

DiGeorge syndrome in 22q11 and others (Mazzarella and Schlessinger 1997; Ji et al.

2000a; Stankiewicz and Lupski 2002). Segmental duplications are often referred to as low copy repeats or LCRs, when implicated in intrachromosomal events. The presence of segmental duplications at the proximal and distal breakpoints of multiple deletion syndromes has lead to the model by which non-allelic homologous recombination

(NAHR) between paralogous sequences mediate the deletion or duplication of large stretches of genomic material, leading ultimately to developmental disorders (Figure 1-2)

(Stankiewicz and Lupski 2002).

11

Figure 1-2. Non-allelic homologous recombination (NAHR). To illustrate NAHR consider the YFG locus of chromosome 1 flanked by two copies of a duplication, DUP’1 and DUP’2. The YFG locus is depicted by the green box, and the DUP duplicons by the blue boxes. During normal recombination the homologous DUP’1, YFG and DUP’2 sequences align between Homolog A and Homolog B of chromosome 1. Aberrant recombination between DUP’1 and DUP’2, will lead to a rearrangement that can produce either a deletion of the YFG locus (above) or a duplication of YFG (below). Dosage imbalance of the YFG region may have pathological consequences.

12

How Segmental Duplications Arise

The mechanism which gives rise to segmental duplications is generally unknown, although a recent study has demonstrated the presence of SINE sequences, specifically

Alu elements, at the boundaries of 27% of all segmental duplications (Bailey et al. 2003).

This finding implies an Alu-Alu based recombination mechanism may mediate the initial transposition of duplicated sequence, and implies that the mechanism of segmental duplication began accelerating in the primate lineage upon expansion of the AluS sub- family of Alu repeats, which occurred approximately 35 million years ago (Bailey et al.

2003). This is consistent with the analysis of pericentromeric segmental duplications discussed in further detail below, which suggest a primate-specific burst of segmental duplication. Additionally, segmental duplications show a propensity to contain genic segments, which allows the intriguing possibility of evolving new genes from the novel juxtaposition of duplicated genic segments (Bailey et al. 2002b; Hillier et al. 2003). In fact, one such hominoid-specific chimeric gene, TRE2, has been described (Paulding et al. 2003). The association of Alu elements, which show a bias for GC-rich and gene-rich regions of the genome, and segmental duplications, which also show an enrichment for genic content, adds promise to the Alu-Alu mechanism as one mode of segmental duplication propagation.

PERICENTROMERIC REGIONS

Structural Variation Among Pericentromeric Regions

The pericentromeric region of human chromosomes, for the purposes of this work, is defined as the area between higher-order alpha-satellite repeats, which

13

comprise the functional centromere, and the unique genic regions which comprise the

majority of the chromosomal arms. This definition does not imply, however, that all

pericentromeric regions are identical in structure, however, as analysis of multiple

pericentromeric regions has revealed extensive structural variation between them. For example, the pericentromeric region of chromosome 2p11 has been characterized as a mosaic of predominantly interchromosomal duplicated segments, a subset of which are shared among the pericentromeric regions of multiple chromosomes (Horvath et al.

2000a). In contrast, characterization of the 19p12 pericentromeric region revealed a divergent structure in which intrachromosomal duplications of Krüppel-associated box zinc finger genes were found interspersed with clusters of -specific beta- satellite sequence (Eichler et al. 1998). Furthermore, analysis of the X chromosome

pericentromeric regions demonstrated an abundance of degenerate monomeric alpha-

satellite sequence without the presence of substantial interchromosomal duplication

similar to 2p11, or intrachromosomal duplication similar to 19p12 (Schueler et al. 2001).

Therefore, there exists substantial variation in the structure of the pericentromeric regions

of human chromosomes.

In addition to substantial variation in structure, an enrichment of segmental

duplications in the pericentromeric region of human chromosomes has been noted from

global analyses of segmental duplications from the human genome sequence (Bailey et

al. 2001; Bailey et al. 2002a). In fact, the pericentromeric region of more than half of all

human chromosomes is comprised of blocks of segmental duplications (Bailey et al.

2001; Horvath et al. 2001). Complete ascertainment of the sequence of the pericentromeric regions from all human chromosomes has not yet been achieved,

14 however, as these regions have been considered one of the “serious and unanticipated challenges” of the Human Genome Project (Collins et al. 1998). As a reflection of this fact, only a subset of pericentromeric regions have been characterized in extensive detail at the sequence level and from the evolutionary perspective (Dunham et al. 1999; Jackson et al. 1999; Ruault et al. 1999; Guy et al. 2000; Horvath et al. 2000a; Horvath et al.

2000b; Footz et al. 2001; Bailey et al. 2002b; Guy et al. 2003; Horvath et al. 2003).

Insight into the evolutionary origin of the pericentromeric regions which consist of interchromosomal duplication mosaics has been gained from the elucidation of the duplication status and evolutionary history of individual duplicons (Buiting et al. 1992;

Kelley et al. 1992; Tomlinson et al. 1994; Eichler et al. 1996; Eichler et al. 1997; Regnier et al. 1997; Zimonjic et al. 1997).

Establishment of the pericentromeric region has an acceptor of duplicated segments occurred with the localization of several duplicated immunoglobulin light chain

VK segments, or orphons, to the pericentromeric region of several human chromosomes

(Borden et al. 1990). Further characterization of additional duplicons, such as the immunoglobulin heavy chain Vh and D segments, and the evolutionarily unstable chAB4 segment, supported the growing trend that pericentromeric regions have readily accepted duplicated material in the past (Tomlinson et al. 1994; Wohr et al. 1996). In addition, the initial detection of the pericentromeric duplication of the creatine transporter gene and a partial B-cell receptor-associated protein 31 gene (known as CDM at the time of initial publication), which duplicated from Xq28 to 16p11.1, revealed an expansion within the

African great ape lineages occurring 7-10 million years ago (Mya), after the divergence of the Asian and African great apes (Eichler et al. 1996). Taken together with the chAB4

15 analysis, which indicated there was human lineage-specific expansion of that segment, the evolutionary plasticity of the pericentromeric region was becoming apparent.

The Two-Step Model of Pericentromeric Duplication

Localization of a recent (5-10 Mya) duplication of the adrenoleukodytrophy locus to chromosomes 2p11, 10p11, 16p11 and 22q11 not only catalogued additional segmental duplications present in the pericentromeric region of human chromosomes, but also highlighted the sharing of duplications between multiple pericentromeric regions (Eichler et al. 1997). This lead to the two-step model of pericentromeric duplication in which an initial duplication event to a pericentromeric region (a “seeding” event) is followed by non-homologous interchromosomal movement (“swapping”) of that duplicated segment to the pericentromeric region of other chromosomes (Figure 1-3) (Horvath et al. 2000a).

This model is supported by the evolutionary history of elements found within the pericentromeric region called pericentromeric interspersed repeats, or PIRs (Horvath et al. 2003).

16

Figure 1-3. The two-step model of pericentromeric duplication. First, a segmental duplication is transposed to a pericentromeric region. In this case, the YFG locus (depicted as in Figure 1-1) serves as the duplication donor, producing YFG’, which is initially duplicated to the pericentromeric region of chromosome 2. Subsequently, the YFG’ duplication and a neighboring duplication, called DUP1 (blue box), are “swapped” among the pericentromeric regions of chromosomes 3 and 4. The chromosome 3 event demonstrates the duplication cassette of YFG’ and DUP1 may be juxtaposed adjacent to additional duplications to form more complex composites (yellow box). The exchange of the YFG’/DUP1 cassette to the pericentromeric region of chromosome 4 may be followed by additional swapping events, i.e. the cassette can be spread through subsequent events and is not repetitively derived from chromosome 2.

17

The detailed analysis of one of these elements, PIR4, which is extensively duplicated

(~40 copies) among multiple human chromosomes, indicates a timeline in which initial

duplication of segments to the pericentromeric region occurred prior to the divergence of the great ape species, with a subsequent burst of duplication activity and pericentromeric swapping after the divergence of the Asian and African great apes (Horvath et al. 2003).

Although we still await full sequencing of the pericentromeric regions of all human chromosomes, the data gathered to date indicate the pericentromeric region is an evolutionary dynamic place. Comparisons of the human and great ape genomes in these regions points to a level of divergence which is in sharp contrast to the level of

divergence noted between orthologous unique sequences. In other words, a model of

constant nucleotide evolution does not apply to regions such as the pericentromeric

region which have shown the proclivity to undergo substantial rearrangement in short

spans of evolutionary time. Upon further sequencing and analysis of the pericentromeric regions of the human genome, in addition to the comparative sequencing of primates, it will be fascinating to explore in further detail the precise order of molecular events which gave rise to the extremely complex structure we observe in the human pericentromeric region.

Paralogous Sequence Variants

Due to the highly duplicated nature of pericentromeric regions, and the high sequence similarity of pericentromeric duplications, specialized strategies have been developed to assemble sequence contigs, and assign individual duplicons to specific chromosomes (Horvath et al. 2000a). These strategies take advantage of technologies

18 and resources such as genomic BAC libraries, monochromosomal-source genomic libraries (typically cosmid libraries), monochromosomal hybrid panel , and high- throughput sequencing. By using a seed sequence that is duplicated to the pericentromeric region of a human chromosome. For example, for the adrenoleukodystrophy gene mentioned above, or the zinc finger gene duplications within

19p12, it is possible to obtain a set of genomic clones related to the seed sequence by

BAC library hybridization. This initial population of clones represents sequences from all sites related to the seed sequence within the genome of interest, assuming there is no cloning bias against any particular copy of the duplication. The next step is to utilize paralogous sequence variants (PSVs) to tag each duplicated copy at the sequence level

(Figure 1-4). The paralogous sequence variant is similar to a single nucleotide polymorphism (SNP), except PSVs distinguish individual copies of a duplicon within a single haplotype. In this manner, PSVs are specific base pair changes that can tag individual copies of a duplicated sequence, and those base pair changes, once identified, are diagnostic and can be used as anchors for building large contigs, and assigning duplicons to chromosomes. Once a PSV has been identified, the subsequent PCR amplification and sequencing of the PSV STS, or pSTS, from a monochromosomal hybrid panel will allow for the assignment of specific PSV signatures to individual chromosomes. Similarly, hybridization of an STS to cosmid libraries, which were derived from monochromosomal sources, and subsequent pSTS characterization of the positive cosmids, further anchors specific copies of a duplicon to a chromosome. Given the cosmid libraries and monochromosomal hybrid panel sources are haploid, the PSVs that are detected and verified are by definition non-allelic. In other words, allelic

19 variation in paralogous sequences is avoided through the application of monochromosomal source material for PSV validation. It should also be noted that when analyzing a genome such as the human genome, for which there is extensive sequence available, multi-sequence alignments can be used to identify diagnostic nucleotide variants that tag a particular copy of a duplicon, and pSTSs can be then designed to verify the computational results.

Figure 1-4. Paralogous sequence variants. In the top panel, the sequence from the YFG locus is aligned to three potential interchromosomal duplications of YFG, called YFG’1, YFG’2 and YFG’3. Within this sequence, there are diagnostic base pair changes that distinguish each copy of the duplication (in bold blue). Sequencing this region of the YFG duplication from monochromosomal material can correlate the PSV signature with a chromosome (bottom panel). A SNP in YFG’2 is indicated by bold red and represents a variant sequence within the human population, and not a PSV.

20

SUBTELOMERIC REGIONS

The ends of eukaryotic chromosomes are capped with a repetitive sequence motif

known as the telomere, which prevents degradation of the chromosome arms at the

termini. In humans, the ribonucleoprotein telomerase, which is a primary regulator of

telomere length, acts to maintain tandem repeats of a (TTAGGG)n motif (Morin 1989).

In the human genome, the regions adjacent to the telomeric repeats are known as subtelomeric regions, and have been shown to contain 10-300 kb of interchromosomal segmental duplications that are shared among the subtelomeric regions of several chromosomes (Riethman et al. 2001; Mefford and Trask 2002). Another similar feature of pericentromeric regions and subtelomeric regions is that sequence from the adjacent structural DNA is often found interspersed within the mosaic of segmental duplications.

Specifically, satellite sequences are found enriched in regions of pericentromeric segmental duplication and telomeric repeats are found enriched in regions of subtelomeric segmental duplication (Bailey et al. 2001; Bailey et al. 2003). There are two possible explanations for this, as the interchromosomal exchange of material may lead to the incorporation of adjacent structural DNA from other chromosomes, or intrachromosomal events such as small inversions may interdigitate satellite sequence between the blocks of segmental duplications. Evidence supporting the interchromosomal exchange model is provided by the extensive sharing of duplicated segments among multiple subtelomeric regions (Monfouilloux et al. 1998; Trask et al.

1998a; Trask et al. 1998b; van Geel et al. 2002). Additionally, it has been shown that these subtelomeric regions vary widely in their duplication content between species, and within the human population; thus the process which duplicates and distributes sequences

21

between the subtelomeric regions of human chromosomes is active and ongoing

(Hoglund et al. 1995; Monfouilloux et al. 1998; Mefford and Trask 2002).

As with pericentromeric regions, sequencing these regions also takes specialized

techniques, as the lack of distal restriction sites for BAC cloning results in decreased

coverage of telomeric regions in BAC libraries, which was reflected in early version of

the human genome assembly (Bailey et al. 2001). The technical challenge of cloning and

sequencing the telomeric and subtelomeric regions has been met through the use of

“Half-YACs” which are stable in yeast, yet maintain an intact human telomere (Riethman

et al. 1989; Riethman et al. 2001). This advance, along with further studies into the rapid

evolution of these sequences through comparative sequencing in primates and targeted

sequencing of additional human subtelomeric region will allow for a detailed

investigation of the mechanisms which drive the diversity of segmental duplications in

subtelomeric regions.

From an evolutionary perspective, the study of subtelomeric regions has revealed

two important trends. First, as with pericentromeric duplications, expressed genic

segments are frequently found in subtelomeric regions, creating the possibility of novel

gene formation by the juxtaposition of genic segments in advantageous combinations

(Brand-Arpon et al. 1999; Wong et al. 1999; van Geel et al. 2000; Riethman et al. 2001).

Also, the analysis of interchromosomal duplications, specifically the extensive distribution of olfactory receptors, in subtelomeric regions has lead to the discovery of gene conversion as a potential contributing factor to the interchromosomal similarity of these regions (Rouquier et al. 1998; Trask et al. 1998a; Trask et al. 1998b; Mondello et

al. 2000; Glusman et al. 2001; Mefford et al. 2001; Newman and Trask 2003). Together,

22

these results demonstrate subtelomeric regions are extremely dynamic, both from the

evolutionary perspective and from the perspective of population genetics, due primarily

to the extensive variation in the segmental duplications that comprise the subtelomeric

sequence.

EVOLUTION BY DUPLICATION

Local Gene Duplication

Duplication has long been recognized as a mechanism by which genes with new

functions may be created (Haldane 1932; Muller 1936). The duplication of a gene is

thought to permit relaxed selection upon a copy of that gene such that new mutations are

allowed to accumulate, potentially leading to the development of a new function of that

gene, and an overall increase in gene diversity (Sidow 1996). The development of new

genes by duplication and subsequent is thought to be rare, due to the greater

probability of a gene acquiring inactivating mutations (Nei 1969; Nei and Roychoudhury

1973). Thus, duplicate genes are theoretically more likely to form pseudogenes, and the

stochastic generation of beneficial gene functions would remain rare; however, the

analysis of gene families in mouse suggests otherwise, that gene duplicates stand nearly

an even chance of being retained (Walsh 1995; Nadeau and Sankoff 1997). In an effort

to explain this quandary, an alternative path has been proposed for a newly duplicated

gene in which slightly deleterious mutations are maintained, but the mutations affect

different parts of the duplicate copies, such that the original function of the gene can still be effectively performed, but now by both copies work in concert (Force et al. 1999).

This model has helped to explain how such a large number of duplicate genes have been

23

retained in the face of . The fate of a duplicate gene is not assured,

however, as a statistical analysis of multiple duplicated genes from a variety of organisms

demonstrates that although duplicate genes arise at a substantial rate, and do experience a period of relaxed selection, the vast majority of duplicate genes are inactivated within a few million years (Lynch and Conery 2000).

Whole-Genome Duplication

From the level of the gene, to the level of the genome, whole genome duplication has also been proposed to be a major mechanism of organismal evolution (Ohno et al.

1968; Ohno 1970). In the tetraploidization theory, two rounds of genome duplication have taken place in the distant evolutionary past; one prior to the divergence of the jawless fishes, and a second event prior to the divergence of the vertebrate tetrapod

(Ohno 1970). Polyploidization and the subsequent return to the diploid state allow for rapid change in gene complement over a short span of evolutionary time and these bursts of activity are thought to coincide with increases in organismal complexity over time. In theory, the majority of the duplicated material is lost in the return to the diploid state, yet evidence may remain in the genome of modern organisms in the form of gene families.

This does not necessitate that all paralogy be the result of genome duplication, however, as tandem events are also plausible sources for duplicate material (Ohno 1970). The evidence for tetraploidization comes from analysis of unlinked duplicated loci. To explain the presence of related gene loci that share significant homology, yet are unlinked, even located on separate chromosomes, the tetraploidization model allows for a period of extensive chromosomal restructuring as the tetraploid genome resolves into the

24

diploid state. As a caveat of this model, the tetraploidization events would have necessarily occurred prior to the development of the sex chromosomes, which hamper the ability if a tetraploid genome resolution into the diploid state (Ohno 1970).

Early studies of tetraploidization focused on fish species, which anecdotally have also shown evidence of a recent tetraploidization event. For example, analysis of the lactate dehydrogenase gene, two copies of which were identified in smelt (diploid) and four copies in rainbow trout (tetraploid)(Ohno 1970). Similar analyses were performed with the Ish-1 homeobox gene family, which encompasses 3 related genes Ish-1a, Ish-1b and Ish-1c. A single representative of both Ish-1a and Ish-1b are present in zebrafish

(diploid), where salmon (tetraploid) have two copies of Ish-1a and Ish-1b, and the once- present duplicate of Ish-1c in salmon has been theoretically deleted (Gong et al. 1995).

These results also demonstrate that it appears in certain cases gene retention post-genome duplication has been relatively high, and early studies of the salmonids and catostomids

(salmon and sucker fish, tetraploid and diploid respectively) show 50% gene retention after 50 million years of subsequent evolution (Bailey et al. 1978).

Evidence for early tetraploidization from the human genome comes from studies of the aldolase genes, as there are 4 copies present on 4 independent chromosomes: 9, 10,

16 and 17, one of which (the chromosome 10 copy) has degraded into a pseudogene

(Tolan et al. 1987). Similarly, the expansion of the Hox gene clusters throughout vertebrate evolution has lent credence to the tetraploidization theory. The cephalocordate amphioxus has a single Hox cluster, while lamprey have three, and mice have four clusters of Hox genes (Gaunt 1991; Pendleton et al. 1993). More recent analyses of the

Hox genes in humans, and multi-gene families in humans, Drosophila and C. elegans,

25 however, have indicated that the landscape of genomic variation present between the family members is not consistent with an ancient duplication model (Friedman and

Hughes 2001; Hughes et al. 2001). So while the evidence is strong for a tetraploidization event in the recent evolutionary history of the salmonid fishes, there is still strong debate over the distant evolutionary history of the vertebrate lineage. In fact, recent analyses of paralogous gene families has implicated both potential tetraploidization events in early vertebrate evolution, with subsequent bursts in large-scale segmental duplication; therefore, the theory of early vertebrate evolution is in fact still evolving rapidly itself

(Gu et al. 2002). The exploration of the theory put forth by Susumo Ohno, whether accepted or not, has substantially raised the consciousness of the scientific community concerning the importance of duplications.

Tandem Duplications

Turning to more recent evolutionary events, the mammalian genome has been shown to duplicate sequence by multiple pathways, one of which being tandem duplication. Analysis of tandem duplication of zinc finger genes in human 19p12, which was mentioned earlier with respect to pericentromeric regions, demonstrates how gene diversity can be increased through local duplication (Bellefroid et al. 1995). Specifically, the head-to-tail orientation of zinc finger genes likely evolved from repeated unequal crossing over of 19p12 sequences, which were found to be locally clustered from great apes to New World monkeys, yet not in prosimians and rodents (Bellefroid et al. 1995).

There is not a primate bias in the tendency to duplicate via a tandem mechanism, however, as a comparison of mouse and human olfactory receptor genes revealed the

26 rodent genome was more prone to local tandem duplication of olfactory receptors than the human genome, which demonstrated a more wide interchromosomal distribution

(Young et al. 2002). In support of this, recent analysis of duplications in the Norway rat genome sequence also showed a prevalence for tandem duplication within the rodent genome (Tuzun et al. 2004). With regard to the differences between human and rodent proclivity to tandem duplication, a global investigation of gene families in the human and mouse genomes indicated intrachromosomal relationships between gene family members were more common in humans, and that the mouse genome contains more gene families with members distributed interchromosomally (Friedman and Hughes 2004). The authors assertion from these results, that segmental duplication occurs post-tandem duplication to separate duplicate copies, and that segmental rearrangement occurs more frequently in the mouse lineage is in direct contrast to previous results of global segmental duplication analysis of the human, mouse and rat genomes, which indicate a substantially lower prevalence of segmental duplications in mouse compared to human, with rat falling in between (Bailey et al. 2001; Bailey et al. 2002a; Cheung et al. 2003;

Bailey et al. 2004; Tuzun et al. 2004). Thus, the tandem duplication of material is a common mechanism to generate diversity in mammalian species, commonly in a lineage specific manner. However, these lineage specific tandem expansions complicate delineating syntenic relationships in chromosomal evolution (Dehal et al. 2001).

Chromosome Evolution

Chromosomal evolution is another aspect of mammalian evolution that has been impacted by duplication. Given the arguably embattled assertion above, that multiple

27

rounds of tetraploidization have given rise to the interspersed duplicate gene families

observed in the modern mammalian genome, one might expect the genomes of diverse

mammals to be quite similar. From the perspective of genome size, this appears to be the

case, as several mammalian genomes cluster in size around 3 Gigbases in length (Graur

and Li 2000). In this case, size doesn’t matter, however, because linkage analysis in

mouse and man has shown that there has been a substantial degree of shuffling of large

segments of DNA, or syntenic blocks. Early estimates place the number of syntenic

blocks between mouse and man at nearly 180, each extending 8 centimorgans on average

(Nadeau and Taylor 1984). An important aspect of this model of chromosome

reshuffling is that the breakpoints of these rearrangements were proposed to be randomly

distributed throughout the genome (Nadeau and Taylor 1984). Nearly two decades later,

with the sequence of the human and mouse genomes in hand, in silico comparisons

revealed over 280 syntenic blocks (greater than 1 Megabase in length) between man and

mouse (Waterston et al. 2002; Pevzner and Tesler 2003a). An important aspect of this

work being the consideration of breakpoints as reusable, and not the result of random

breakage (Pevzner and Tesler 2003b). Additionally, a massive increase in the number of

syntenic blocks occurs if one includes micro-rearrangements, such as small deletions, or

inversions, that were not previously detectable using linkage analysis, although false positives may occur due to errors in genome assembly (Mural et al. 2002; Pevzner and

Tesler 2003a). The relationship between syntenic blocks and duplications becomes evident from recent findings that segmental duplications are found to be enriched (25-

50%) at the sites of syntenic breakpoints (Armengol et al. 2003; Bailey et al. 2004). The study of syntenic breaks and potential hotspots of evolutionary change will be aided by

28

future comparisons with the recently published rat genome sequence, which will allow

for the distinction of shared or derived syntenic breaks in pairwise genome comparisons

(Bailey et al. 2004; Gibbs et al. 2004).

PRIMATE GENOME VARIATION

Karyotype Variation

Some of the most interesting data is yet to come, however, as the sequencing of

the chimpanzee genome is underway. It has long been recognized that the genomes of

the great apes are remarkably similar at the karyotypic level (Yunis et al. 1980; Yunis and

Prakash 1982). Aside from the telomeric fusion which gave rise to human chromosome

2, the majority of great ape chromosomes are differentiated by a multitude of pericentric inversions, one translocation, and telomeric heterochromatic expansions (Yunis et al.

1980; Yunis and Prakash 1982; Toder et al. 1998). If one looks back further into primate evolution, the repositioning of the centromere also becomes another source of primate chromosomal variation (Ventura et al. 2001). To date, only a few of these events have been investigated at the molecular level (Nickerson and Nelson 1998; Stankiewicz et al.

2001; Kehrer-Sawatzki et al. 2002). One important finding from these studies was the presence of an evolutionary breakpoint at a site of segmental duplication (Stankiewicz et al. 2001). Combined with the mouse synteny data discussed above, these results indicate sites of segmental duplication have also been prone to chromosomal rearrangement in multiple mammalian lineages, although the consequences of such rearrangements have not been fully explored.

29

Nucleotide Variation

In contrast to the chromosomal rearrangements between primates, explored primarily through cytogenetic techniques, nucleotide variation studies have focused on the extreme similarity between humans and our closest living relatives. Comparisons by

DNA hybridization (heteroduplex melting), in addition to immunological and biochemical methods (gel electrophoresis) lead to the conclusion that chimpanzees and humans are roughly 98.5% identical at the nucleotide level, which was surprising considering the overt morphological differences between the species (King and Wilson

1975; Sibley and Ahlquist 1984; Goodman et al. 1990; Bailey et al. 1991). Several recent comparative sequencing studies have also investigated chimpanzee-human nucleotide divergence, analyzing large samples of DNA sequence. Alignment of 2.3 Megabases of unique gene-containing sequence, including high-copy repeats but excluding coding regions, from 53 independent regions of the genome demonstrated an average of 98.78% sequence identity (Chen and Li 2001). An analysis of 1.9 Megabases of chimpanzee- human alignment, distributed among greater than 8,000 independent sites, revealed the same level of average divergence between human and chimpanzee (Ebersberger et al.

2002). The pattern of nucleotide divergence in this analysis was consistent with regional variation in nucleotide divergence, a phenomenon also observed in human-mouse comparisons, suggesting regional variation in nucleotide divergence was a conserved property (Ebersberger et al. 2002; Lercher et al. 2002). An interesting anecdote from the analysis of over 8,000 independent sites is that 7% of the chimpanzee sequences generated could not be aligned to the human genome, which will be an issue facing the current chimpanzee genome sequencing project (Ebersberger et al. 2002). A recent

30

comparison of 10.6 Megabase of DNA sequence from human, chimpanzee, baboon and

lemur revealed an average of 98.86% identity between human and chimpanzee, 94.19%

between human and baboon, and 77.00% between human and lemur (Liu et al. 2003).

More importantly, concerning global evolutionary patterns of variation, was the increase

in genome size due to an increase in Alu retroposition activity in a lineage-specific

manner (Liu et al. 2003). This expansion of the human genome, estimated to be 15-20%

over 50 million years, could affect the expression of numerous genes due to retroelement

insertion, as described above (Liu et al. 2003).

Nucleotide divergence is a method of estimating the evolutionary distance

between two sequences, which is extrapolated to the divergence time since a common

ancestor between the species being compared. This is expressed using the formula: rate

of mutation = (genetic distance/2 * divergence time). Alternatively, the divergence time can be estimated from fossil evidence or extensive multiple-gene comparisons (Kumar and Hedges 1998; Goodman 1999). In this scenario the nucleotide divergence is used to estimate a , based upon a neutral model of nucleotide evolution (Kimura

1968). Algorithms for calculating the divergence, or K value, have evolved to account for variation in the transition-transversion mutation rates (Kimura 1980). For duplicated loci, nucleotide divergence has been used to estimate the timing of the duplication event

(Eichler et al. 1997; Regnier et al. 1997; Zimonjic et al. 1997; Jackson et al. 1999;

Horvath et al. 2000b; Crosier et al. 2002; Guy et al. 2003; Horvath et al. 2003). One complication to estimating divergence time, or duplication time, has been the issue of the molecular clock (or the estimated constant rate of mutation), which appears to show variation across mammalian species. The mutation rate for rodents, for example, was

31 shown to be accelerated 4-10 times compared to the mutation rate in primates (Li and

Tanimura 1987). In addition, the recent sequencing of the mouse genome allowed a revisitation of this question using a larger sequence sample that resulted in an estimate of a 3-fold acceleration in mutation rate in rodents compared to primates (Waterston et al.

2002). Even among primates, the molecular clock was found to vary significantly (Yi et al. 2002). In fact, given the results of the analysis of over 8,000 independent sites cited above, in which nucleotide divergence was found to vary according to genomic position, and the findings of lineage-specific molecular clocks, it is likely that patterns of nucleotide variation will remain an important aspect for future study.

ARRAY COMPARATIVE GENOMIC HYBRIDIZATION

The rate of nucleotide variation between species is a question that has been studied extensively over the previous few decades. The ability to evaluate large-scale variation between genomes has only recently become feasible due to advances in experimental methods. For example, a new technique that has been applied to the study of comparative genomics has been the microarray hybridization. In particular, the use of a genomic microarray representing 27 Megabases of chromosome 21 has shown extensive small (0.2-8 kb) insertion deletion events in genic regions, which is a level of resolution not achievable outside of comparative sequencing (Frazer et al. 2003). The use of expression microarrays has also been applied to primate-human comparisons, which has added a new dimension in comparative studies (Enard et al. 2002; Karaman et al. 2003). Although these studies are not conclusive in themselves, combined with

32 structural analyses of the primate genomes, correlations between expression differences and genomic rearrangements can be tested.

One significant benefit of the Human Genome Project, with implications for the future of human medicine as well as basic research, has been the mapping of the thousands of bacterial artificial chromosome (BAC) clones which comprise the backbone of the human genome assembly. The intersection of the development of extensive human

BAC maps with recent advances in cytogenetic techniques has allowed for a powerful synergism, resulting in high-throughput methods of assessing dosage gains and losses in genomic DNA samples. Specifically, BAC array comparative genomic hybridization allows for the determination of genomic copy number, at typically hundreds or thousands of BAC loci in a single experiment.

Array comparative genomic hybridization technology can be traced to the advent of fluorescence in situ hybridization or FISH (Pinkel et al. 1986). Originally this technique was used to identify human chromosomes in human-hamster hybrid cell lines by biotin-labeling total human genomic DNA, hybridizing the labeled DNA to hybrid metaphase chromosomes, and subsequent visualization by application of fluroescein- labeled avidin and biotinylated anti-avidin antibody (Pinkel et al. 1986). The development of human chromosome paints allowed for the identification for chromosome dosage imbalance in human clinical samples (Pinkel et al. 1988). Simultaneously, the resolution for detection of fluorescently labeled DNA was reduced from an entire chromosome to the level of a Megabase by using small repetitive chromosome-specific probes (Trask et al. 1989).

33

The advent of higher-resolution mapping of DNA probes ushered in a revolution

in cytogenetics in which the structural aspects of chromosomes, as well as specific gene loci, could be examined with relative ease (Trask 1991). In combination with small- insert genomic clones, FISH mapping proved to be a robust method for ordering loci in the genome of a single individual, which complemented other mapping techniques such as genetic maps and pulse-field gel electrophoresis mapping (Trask 1991). It is of note that over a decade later, the FISH mapping technique still provides important data for identifying the extent of disease-causing rearrangements, diagnosing the presence of genomic disorders, and remains an essential tool to comparative genomics.

Concurrent with the development of FISH mapping specific loci at high resolution, including interphase FISH mapping at the resolution of 100 kb, techniques aimed at detecting dosage imbalances between two genomic DNA samples were being developed, namely comparative genomic hybridization or CGH (Kallioniemi et al. 1992; van den Engh et al. 1992). CGH involves the direct labeling of a normal reference genomic DNA sample with a fluorophore, and labeling an experimental sample with a different, yet distinguishable, fluorophore. Subsequently both labeled genomic DNA probes are hybridized to a normal metaphase chromosome spread. The fluorescence intensity ratio of any chromosomal region of the hybridized metaphase spread is a function of the dosage of that region in the reference and test genomic DNA samples. In this manner, large-scale rearrangements involving entire chromosome arms, or sub- regions of chromosomes are easily visualized as distortions of the normal 1:1 fluorescence intensity ratio. Although powerful in the ability to analyze the entire human genome in a single hybridization experiment, the resolution of CGH is limited to the

34 resolution of the metaphase chromosome preparation, and does not equate to the high resolution achieved by FISH mapping of specific loci using cloned genomic DNA probes.

Array-based CGH using BAC clones is a recently developed technique that involves aspects of both the CGH approach and the high resolution FISH approach

(Pinkel et al. 1998). Briefly, BAC DNA is spotted on a glass slide, the total number of

BACs that can be spotted is dependent upon the density of the array-producing equipment, and not of the overall technique. Theoretically, given sufficient area, many thousands of loci can be spotted on glass for subsequent hybridization. Although future developments may involve smaller genomic clones, BACs have proven to produce the most consistently robust signals, and are the current genomic clone of choice for array experiments. Hybridized to the array are probes derived from two genomic DNA samples. Similar to the CGH protocol described above, a normal reference genomic

DNA sample is labeled with one fluorphore, and an experimental genomic DNA sample is labeled with an alternate distinct fluorphore. Both probe mixtures are then hybridized to the array, and after washing, the fluorescence intensity at each spot on the array is quantified by laser scanning, or computer analysis of CCD captured array images (Figure

1-5).

35

Figure 1-5. Array comparative genomic hybridization (CGH). A normal reference genomic DNA sample is labeled with one fluorophore (depicted in green), while an experimental genomic DNA sample is labeled with a distinct fluorophore (depicted in red). Both probe mixtures are hybridized to an array of BAC clones fixed on a glass substrate. After washing, the slide is imaged and the fluorescence intensity level obtained for each spot in both the normal and experimental channels. The ratio of the experimental and normal fluorescence intensity is indicative of copy number.

36

The fluorescence intensity values for each spot are compared, producing a ratio of

the reference normal sample intensity and test sample intensity, which is a direct measure

of the copy number relationship between the two samples. If the ratio of normal to test

intensity favors the normal sample, that indicates a deletion is present at that BAC locus in the test sample. Similarly, if the ratio of normal to test intensity favors the test sample, that indicates the presence of a copy number gain in the test sample, relative to the normal. The ratio of fluorescence intensity between normal and test samples has been shown to respond in a linear fashion to substantial increases in copy number (Pinkel et al.

1998). Thus, the array CGH technique allows for the detection of dosage imbalances across thousands of BAC loci simultaneously. In a sense, the array CGH technique is analogous to performing a large number of high resolution FISH experiments without the need for qualitative interpretation of FISH signals on metaphase chromosomes. Also, the parallel nature of array CGH represents a high level of throughput unachievable using conventional FISH techniques.

Although the technique does allow for the rapid detection of dosage imbalance at numerous loci simultaneously, there are some limitations to the method. For example, genomic rearrangements that do not affect dosage, such as inversions, are not detectable by this technique. Additionally, the array CGH method does not directly identify the breakpoint of a genomic rearrangement, instead an interval is indentified in which the breakpoint lies. The size of the interval is totally dependent upon the density of the genomic array. Once narrowed to an interval, further experiments are then required to determine the precise breakpoint location. In addition, interpretation of array CGH data is dependent upon accurate mapping information concerning the arrayed BAC clones.

37

Since the establishment of the array CGH method, numerous studies have demonstrated the use of the technique. In the field of cancer genetics, profiling tumors with array CGH has facilitated the mapping of many oncogenes from a wide variety of tissue types (Albertson et al. 2000; Kashiwagi and Uchida 2000; Collins et al. 2001;

Daigo et al. 2001; Hodgson et al. 2001). In addition, profiling of human clinical samples for genomic deletions has been accomplished (Bruder et al. 2001; Rauen et al. 2002; Yu et al. 2003). Interestingly, cDNA arrays have also been adapted to the CGH protocol, and have been used to effectively assess copy number variation in genomic DNA samples

(Pollack et al. 1999; Lin et al. 2002; Pollack et al. 2002). These early studies with array

CGH have lead to the adaptation of the technique for tumor profiling, mapping oncogenes, and other tasks such as somatic cell hybrid mapping, and mapping unbalanced translocations (Gunn et al. 2003). In a recent development, as the resolution of array CGH is dependent only on the number of features spotted on a glass slide, it is now possible to produce a set of arrays with coverage of the entire genome (Ishkanian et al. 2004). Thus, the array CGH technique has wide range of applications in a variety of settings, potentially including regions of the genome which show a propensity for rearrangement due to the presence of segmental duplication clusters.

CHROMOSOME 15q11-q13 EXEMPLIFIES GENOMIC PLASTICITY

Although many regions of the human genome have demonstrated a capacity for rearrangement, the mapping of multiple segmental duplication clusters flanking the

Prader-Willi-Labhart syndrome (commonly shortened to Prader-Willi syndrome or PWS) and Angelman syndrome (AS) critical region in 15q11-q13 established the involvement

38 of such sequences in the underlying these genomic disorders. In this manner, the elucidation of the mechanism leading to PWS/AS rearrangements is intimately associated with the characterization of the breakpoint duplication clusters themselves.

Genomic Disorders Within 15q11-q13: Prader-Willi and Angelman Syndromes

Chromosome 15q11-q13 has a long-standing clinical history, beginning with the characterization of PWS, which was established as a clinical entity in 1956, although the initial description of what would be later known as PWS may have actually occurred in the 19th century by Down (Down 1887; Prader et al. 1956). The clinical features of PWS include hypotonia, hypogonadism, hypopigmentation in certain cases, mild to moderate mental retardation, and a failure-to-thrive phenotype at birth with a poor feeding response that, if left unmanaged, later in life reverts into a compulsive eating disorder and morbid obesity (Cassidy et al. 2000; Nicholls and Knepper 2001). The association of PWS and

15q was made through the analysis of PWS unbalanced translocations (Hawkey and

Smithies 1976; Emberger et al. 1977; Fraccaro et al. 1977).

Not long after the establishment of PWS, the term “happy puppet” syndrome (also known as “marionette joyeuse”) was used to characterize a syndrome presenting severe ataxia, epilepsy, hypotonia, severe mental retardation, an absence of speech, and paroxysms of laughter (Angelman 1965; Bower and Jeavons 1967; Halal and Chagnon

1976). A bout of rational sensitivity by Dr. Harry Angelman altered the name of the disorder to AS (Angelman 1965). The initial connection between 15q11-q13 and AS was made through the description of two independent cases of 15q proximal deletion

39

(Magenis et al. 1987). However, at this time the reciprocal relationship between PWS and AS was not well understood.

Imprinting Implicated in PWS/AS

In the late 1980s it was shown that similar deletions of 15q11-q13 were involved in both PWS and AS, through the use of cytogenetics and cloned RFLP probes (Donlon

1988; Knoll et al. 1989). By this time the concept of uniparental disomy (UPD) had been put forth as a potential contributing factor to human genetic disease (Engel 1980). In a landmark study, analysis of the subset of PWS patients that do not present a deletion of

15q11-q13 demonstrated maternal UPD of chromosome 15 could lead to PWS; therefore, genomic imprinting of the 15q11-q13 region was essential for human development

(Nicholls et al. 1989). Subsequently, chromosome 15 UPD was also found in AS patients

(Malcolm et al. 1991). The mapping of microdeletions in non-UPD, non-deletion AS cases lead to the isolation of mutations in the maternally expressed E6-AP-ubiquitin protein ligase (UBE3A) gene (Kishino et al. 1997; Matsuura et al. 1997). Thus, by this time several mechanisms had been established to give rise to AS including deletion, translocation, UPD and gene-inactivating mutations.

Early studies of the imprinting process identified methylation as a DNA marker that could distinguish maternal and paternal inherited alleles (Driscoll et al. 1992;

Clayton-Smith et al. 1993). With this technique it was determined that the small nuclear ribonucleoprotein N gene (SNRPN) in 15q12 was absent from PWS material, but expressed in AS and normal samples, focusing investigation on this gene as a candidate for the PWS gene (Glenn et al. 1993). Inactivating mutations within the SNRPN coding

40 sequence were never isolated from PWS patients; however, leading to the general conclusion that PWS was a multigenic syndrome caused by the loss of paternally expressed genes in 15q11-q13. In support of this, a translocation patient presenting with

PWS and an intact paternal SNRPN allele was described (Schulze et al. 1996). Analysis of mutations with errors in the imprinting process identified an imprinting center, or IC, that affords the ability for the 15q11-q13 region to switch the parental imprint during the transition through the germ line (Nicholls 1994; Nicholls and Knepper 2001).

Several genes adjacent to SNRPN were subsequently shown to be paternally expressed, such as makorin 3 (MKRN3), necdin (NDN), magel 2 (MGL2) and imprinted in Prader-Willi (IPW) (Glenn et al. 1993; Wevrick et al. 1994; MacDonald and Wevrick

1997; Boccaccio et al. 1999; Chai et al. 2001). This strengthened the model that the imprinting process has a regional scope, and genes such as NDN and MKRN3, which retroposed into the PWS/AS region during mammalian evolution acquired their imprinted status due to the site of integration (Chai et al. 2001).

Duplications and 15q11-q13 Genomic Rearrangements

Concerning the mechanism of de novo 15q11-q13 deletion, early evidence for the role of duplications came from the identification of a multi-copy probe, obtained via microdissection, with homology to 15q11-q13 and 16p11 (Buiting et al. 1992). This probe, called MN7, identified several YAC clones from two distant sites on 15q, and early estimates suggested 4-5 copies of this probe were in the human genome (Buiting et al. 1992). Simultaneously, YAC mapping efforts were being undertaken to develop a contiguous set of YAC clones across the entire PWS/AS critical region, which were used

41

to characterize patient samples (Kuwano et al. 1992; Mutirangura et al. 1993). At this time, it was noticed that several patients presented with breakpoints in a single YAC, which implied a site of preferential breakage may be involved in the de novo deletion process (Kuwano et al. 1992). It is interesting to note at this time other sequences were being mapped to 15q11 and 16p11 including immunoglobulin Vh and D segment duplications, highlighting the recent exchange of material between these chromosomes

(Tomlinson et al. 1994). Mapping of neurofibromatosis-1 (NF1) duplications to 15q11.2 further demonstrated the proclivity of proximal 15q to accept duplicated segments

(Kehrer-Sawatzki et al. 1997).

Microsatellite markers were used to further refine the genomic map of 15q11-q13 and identified two proximal breakpoints of PWS/AS deletions (Christian et al. 1995).

Having generally defined the three major breakpoint intervals of PWS/AS interstitial deletions (BP1, BP2 and BP3), further efforts were aimed at characterizing the duplicated sequences present within the breakpoints themselves (Amos-Landgraf et al. 1999;

Christian et al. 1999). Specifically, the source of the MN7 probe sequences identified previously had been mapped to the hect-domain and rld2 (HERC2) gene in 15q13

(Buiting et al. 1998; Ji et al. 2000b). Multiple segmental duplications have since been associated with the breakpoints of 15q11-q13 rearrangements, including several duplicons derived from the Golgin-Linked-to-PML (GLP) duplicon, also known as

LCR15, in addition to other less prevalent segments such as a duplication of the chromosome 16 PARN locus and the 15q13 amyloid precursor protein-binding protein

(APBA2) locus (Buiting et al. 1992; Amos-Landgraf et al. 1999; Buiting et al. 1999;

42

Christian et al. 1999; Gilles et al. 2000; Ji et al. 2000b; Pujana et al. 2001; Pujana et al.

2002; Sutcliffe et al. 2003).

Timing of 15q11-q13 Structural Evolution

Although few studies have traced the course of 15q11-q13 structural evolution,

some observations can be made from comparisons of the mouse and human genomes.

First, the HERC2 locus is not duplicated in mouse and there are no clusters of

duplications within the region syntenic to 15q11-q13 (Gabriel et al. 1999; Ji et al. 1999).

Thus, the origin of the HERC2 duplications and the breakpoint clusters of segmental

duplications are likely to be primate-specific. Also, it has been shown that the genes

which reside within the BP1-BP2 region of the human genome in 15q11.2, including

non-imprinted in Prader-Willi/Angelman syndrome 1 (NIPA1), non-imprinted in Prader-

Willi/Angelman syndrome 2 (NIPA2), cytoplasmic FMR1 interacting protein 1 (CYFIP1)

and gamma-tubulin complex component 5 (GCP5), are adjacent to the HERC2 locus

(15q13) in the mouse genome (Figure 1-6)(Chai et al. 2003). Thus a model has

developed in which duplication of the HERC2 locus in a primate ancestor preceded the transposition of the four-gene cassette to the BP1-BP2 region (Chai et al. 2003). The type of genomic event which brought about this rearrangement is unknown. An inversion would be plausible, however a subsequent inversion of the BP1-BP2 region would be required to return the BP1-BP2 genic cassette into phase with the orientation of the mouse genome.

43

Figure 1-6. Map of human 15q11-q13 and mouse 7C. The human 15q11-q13 region contains the three common deletion breakpoints of PWS/AS, termed BP1, BP2 and BP3 which contain duplications of the HERC2 locus in 15q13. Genes are indicated as the circles and ovals, color coded with respect to their imprint status. Black represents non- imprinted genes, white indicates paternally expressed genes and grey represents maternally expressed genes. A cassette of four non-imprinted genes, NIPA1, NIPA2, CYFIP1 and GCP5, is contiguous with the mouse HERC2 locus, also non-imprinted, yet this cassette is located between the BP1 and BP2 duplication clusters in the human genome.

The Diversity of Additional 15q11-q13 Rearrangements

Several duplication and deletion syndromes are characterized by the presence of large blocks of segmental duplications, which provide the substrates for NAHR, discussed above, to produce genomic rearrangements (Ji et al. 2000a; Stankiewicz et al.

2001). In contrast to the loss of 15q11-q13 in PWS/AS, pseudo-dicentric (15) syndrome

(formerly known as inv dup (15) syndrome) and interstitial duplication of 15q11-q13 are disorders that involve the gain of material in this region (Barber et al. 1998). In fact, further evidence for the role of the complex 15q11-q13 genomic structure in facilitating

44

proximal 15q rearrangements came from the study of pseudo-dicentric (15) marker

chromosome breakpoints, a substantial portion of which appear to correlate with the

PWS/AS common deletion breakpoints, although a subset of inv dup(15) rearrangements

involve a more distal breakpoint (Huang et al. 1997; Wandstrat et al. 1998; Wandstrat

and Schwartz 2000). Recently, partial hexasomy of 15q11-q13 has been reported in one

patient with multiple proximal 15-derived marker chromosomes and multiple clinical

features (Nietzel et al. 2003). Generally, phenotypic effects due to dosage imbalance of

15q11-q13 are predicated upon the involvement of the imprinted domain in 15q12,

termed the PWS/AS critical region, thus the parental origin of the rearranged

chromosome has a significant impact on phenotype (Browne et al. 1997). A notable

exception includes expansion of the 15q pericentromeric region, proximal of the PWS/AS

critical region, which has been associated with a phenotypic effect, primarily growth

retardation, indicating that not all phenotypically-relevant genomic rearrangements of

15q11-q13 involve the PWS/AS critical region (Mignon et al. 1997). It is the assertion of

the author, however, that few regions of the human genome have demonstrated as wide

an spectrum of potential rearrangements, as 15q11-q13 has been implicated in deletions,

interstitial duplications, interstitial triplications, translocations, and numerous size classes

of supernumerary marker chromosomes.

Phenotypically Silent 15q11 Variation

In contrast to the syndromes described above, not all cytologically detectable

variations of 15q11-q13 appear to have clinical significance. The inheritance of multiple copies of the PWS/AS critical region from the paternal lineage, for example, has been

45

found in several phenotypically normal individuals (Browne et al. 1997). In addition, the

pericentromeric region of 15q has also been shown through cytological as well as

molecular analyses to vary in size due to the presence of several segmental duplications

including NF1, immunoglobulin variable heavy chain D segment and gamma- aminobutyric acid receptor 5 (GABRA5) duplications (Tomlinson et al. 1994; Kehrer-

Sawatzki et al. 1997; Regnier et al. 1997; Barber et al. 1998; Ritchie et al. 1998; Fantes et al. 2002). In one study, variation in the copy number of the NF1 duplication was associated with growth retardation and a suggestive mild PWS-like phenotype in a single patient; however the majority of cases examined revealed no phenotypic effect of euchromatic expansion of the 15q11 pericentromeric region (Browne et al. 1997; Barber et al. 1998).

RESEARCH OBJECTIVES

This dissertation aims to explore the hypothesis that segmental duplications are commonly present at sites of genomic rearrangement, both between species and within a species. Toward that goal, this work involved the application of a high-throughput technology for genomic profiling. Additionally, this dissertation aimed to combine bioinformatic approaches with experimental methods to achieve a parsimonious conclusion concerning the role of segmental duplications in genomic rearrangements.

The first goal of this work was to establish the high-throughput array CGH technique in comparative studies of primate evolution (Chapter 2). This work also sought to assess the association of evolutionary rearrangements detected via array CGH with segmental duplications, and analyze patterns of evolutionary rearrangement with respect to

46

chromosomal position. The second goal of this work was to map the breakpoint of a

chromosome restructuring event in the chimpanzee lineage, specifically a pericentric

inversion of proximal chromosome XV (Chapter 3). In this study I pursued the

identification of the sequence at the breakpoint of the rearrangement and to characterize

the region in which the rearrangement occurred. The third goal was to evaluate the

pattern and prevalence of interchromosomal segmental duplications within the

pericentromeric region of human chromosome 15q11 in primates (Chapter 4). Focusing further on chromosome 15, the fourth aspect of this work aimed to apply the array CGH methodology to the 15q11-q13 region, which is interrupted by several clusters of segmental duplications associated with the breakpoints of human genomic disorders

(Chapter 5). In this analysis I attempted to address the influence of segmental duplications within the array CGH format, and develop guidelines for future experiments with the technology. Additionally, I sought to demonstrate the technique is sensitive to a wide variety of dosage imbalances, and could delineate the breakpoint of dosage gains and losses with high resolution. Lastly, I pursued the development of large sequence contigs within the 15q11-q13 PWS/AS breakpoint segmental duplication clusters, which had proven refractory to sequence assembly by the Human Genome Project (Chapter 6).

Additionally, through comparative analyses I investigated the evolution of a major component of the PWS/AS breakpoint clusters, the HERC2 duplicon. The elucidation of the evolutionary history of this important sequence family will shed light on the mechanism which gave rise to the complex 15q11-q13 structure and maintain that structure during human evolution.

Chapter 2

Large-scale variation among human and great ape

genomes determined by array comparative genomic

hybridization

Devin P. Locke1, Rick Segraves2, Lucia Carbone3, Nicoletta Archidiacono3, Donna G. Albertson2, Daniel Pinkel2 and Evan E. Eichler1

1Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, OH, 44106

2Comprehensive Cancer Center, UCSF, San Francisco, CA 94143

3Dipartimento di Anatomia Patologica e di Genetica, Sezione di Genetica, University of Bari, Bari 70126, Italy

Note: This manuscript has been published: Locke, D.P., Segraves, R., Carbone, L., Archidiacono, N., Albertson, D.G., Pinkel, D., and E.E. Eichler. 2003. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res 13:347-357. R.S. and D.G.A. and D.P. provided array comparative genomic hybridization data. L.C. and N.A. provided comparative fluorescence in situ hybridization data.

47 48

ABSTRACT

Large-scale genomic rearrangements are a major force of evolutionary change and the

ascertainment of such events between the human and great ape genomes is fundamental

to a complete understanding of the genetic history and evolution of our species (Nadeau and Sankoff 1997; O'Brien and Stanyon 1999). Here, we present the results of an evolutionary analysis utilizing array comparative genomic hybridization (array CGH),

measuring copy number gains and losses among these species. Using an array of 2460

human BACs (12% of the genome), we identified a total of 63 sites of putative DNA copy number variation between humans and the great apes (chimpanzee, bonobo, gorilla and orangutan). Detailed molecular characterization of a subset of these sites confirmed rearrangements ranging from 40 kb to at least 175 kb in size. Surprisingly, the majority of variant sites differentiating great ape and human genomes were found within interstitial euchromatin. These data suggest that such large-scale events are not restricted solely to subtelomeric or pericentromeric regions but also occur within genic regions. In addition, 5/9 of the verified variant sites localized to areas of intrachromosomal segmental duplication within the human genome. Based on the frequency of duplication in humans, this represents a 14-fold positional bias. In contrast to previous cytogenetic and comparative mapping studies, these results indicate extensive local re-patterning of hominoid chromosomes in euchromatic regions through a duplication-driven mechanism of genome evolution (Yunis et al. 1980; Yunis and Prakash 1982; Stanyon et al. 1995;

Muller et al. 1999; Muller and Wienberg 2001).

49

INTRODUCTION

The evolution of human and non-human primate genomes has been studied at two levels: karyotypic differences and single-base-pair nucleotide differences (Yunis et al.

1980; Yunis and Prakash 1982; Kaessmann et al. 2001). To date, no ascertainment has been made of the frequency, extent or distribution of large-scale sequence gain or loss events, defined here as copy number-altering events involving greater than 10 kb of sequence, yet events that are undetectable at the cytogenetic level. Events of this size have the potential to significantly impact the gene complement and genome structure of closely related species; however, genome-wide experimental comparative analyses at this scale have been impossible due to the lack of efficient methods of comparing genomes with accuracy and a high level of resolution. With the advent of the human genome reference sequence; however, new approaches have emerged that facilitate the simultaneous detection and high-resolution mapping of large-scale DNA variation across the entire genome (Snijders et al. 2001). One such method, array comparative genomic hybridization (array CGH), has demonstrated the ability to reliably detect DNA copy number changes between genomic DNA samples with the resolution of a single BAC clone (Pinkel et al. 1998; Albertson et al. 2000; Snijders et al. 2001).

Our array CGH procedure involves the hybridization of differentially labeled specimen and reference genomic DNA to an array of large-insert genomic clones. The hybridization intensity ratio at each array locus is proportional to the copy number ratio between genomic DNA samples, which is used as a measure of putative regional gains and losses. The microarray used in this study consisted of 2460 large-insert human BAC clones that had been previously mapped by STS content and validated by FISH (Snijders

50 et al. 2001). The entire clone set encompassed approximately 12% (370 Mb) of the entire human genome, providing a resolution of 1 BAC every 1.4 Mb of DNA. To date, the array CGH technique has been used primarily to assess within-species DNA copy number variation associated with tumor progression or recurrent structural rearrangements of the human genome (Pinkel et al. 1998; Albertson et al. 2000; Bruder et al. 2001; Hodgson et al. 2001). In this study, we applied array CGH to detect fixed inter-specific copy-number differences between the genomes of humans and great apes.

Large-scale sequence duplication and deletion events have the potential to significantly impact the structure and evolution of the primate genome. At the simplest level, sequence gain and loss events can alter the gene complement of an organism, which may result in phenotypic variation susceptible to selection pressures. Gene loss has been proposed as an important force in driving the evolution of eukaryotic genomes in response to selective pressures such as altered growth conditions in yeast, or infectious disease in the human population (Olson 1999). This “less is more” hypothesis can only be addressed with respect to primate genomes through the application of high throughput sequence variation detection that allow for direct and efficient correlation with genome sequence information (or cDNA expression analysis). In addition, segmental duplications characterized from analysis of the human genome have proven to be considerably variable during the course of primate evolution, with alterations seen in copy number and location in the genomes of the great apes particularly within the pericentromeric regions of the genome (Guy et al. 2000; Horvath et al. 2000b). Very little is known in regard to the role of segmental duplications in restructuring euchromatin, although they have been hypothesized to provide the molecular substrate

51

for chromosomal rearrangement among humans and great ape chromosomes

(Stankiewicz et al. 2001; Samonte and Eichler 2002).

Using normal human genomic DNA as the reference, we compared relative

hybridization intensities among four great ape species: chimpanzee (Pan troglodytes), bonobo (Pan paniscus), gorilla (Gorilla gorilla) and orangutan (Pongo pygmaeus) in

pairwise comparisons with human. Multiple individuals of each species were examined

in independent experiments, and array loci were scored as potential variants only if a

consistent increase or decrease in hybridization intensity ratio was observed across all

trials and all individuals of each species (See Methods). This approach ensured the

detection of fixed differences between the species, but eliminated potential large-scale

polymorphisms within species for further analysis. The application of such conservative

criteria, however, was essential to identify sites of inter-specific copy number differences

as opposed to spurious artifacts. We examined the distribution and the sequence context

of these sites to provide some insight into the significance of such variation as a potential

force underlying chromosomal change among humans and non-human primates. This

study therefore presents the first genome-wide comparison of great ape species, with the

level of resolution afforded by array CGH.

RESULTS

Genomic DNA from multiple unrelated individuals of each primate species were

analysed using array CGH, and only sites displaying copy number variation in all

individuals were scored as positive (Figure 2-1).

52

53

Figure 2-1. Example of array CGH data. Graphs depicting data for 2460 array loci are presented for a single individual of each species (PPA = Pan paniscus, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus). Putative variant sites are circled in red. Vertical lines separate the loci from each chromosome, with the p-arm telomere oriented toward the left of each interval and the q-arm telomere toward the right. The Y+ interval contains loci from the Y chromosome in addition to potentially duplicated clones, determined by FISH characterization, and variants in this interval were disregarded for this analysis. Not all variant sites detected in a particular species are represented by the individuals shown. Hybridization of human male reference DNA with female primate DNA or vice versa will yield a constitutive gain or loss for the X chromosome, as seen for the PTR, PPA and GGO hybridizations. The GGO hybridization depicted here presents data from version 2.0 of the human BAC array, whereas the PTR, PPA and PPY hybridizations depicted are from version 1.14 of the human BAC array.

54

Average Log2 Ratio Chrom Clone ID Event PPA PTR GGO PPY 1 RP11-188A4 duplication 0.53 2 CTB-172I13 deletion -1.15 2 RP11-77g15 deletion -0.61 3 CTB-228K22 deletion -0.57 -0.68 3 RP11-198g24 deletion -0.62 4 RP11-53F02 deletion -0.50 4 RP11-266E24 deletion -0.57 4 RP11-176I04 deletion -0.55 4 RP11-85B10 deletion -0.56 4 RP11-42a17 deletion -0.57 -0.73 -0.57 4 RP11-47p10 deletion -0.58 4 RP11-82l08 deletion -0.58 4 RP11-218i24 deletion -0.61 4 CTD-2224I19 duplication 0.78 4 RP11-101N17 duplication 0.64 5 RP11-88L18 duplication -1.13 0.65 6 RP11-52c20 deletion -0.53 6 RP11-43b19 deletion -1.32 7 RP1-164D18 duplication 1.20 1.66 7 RP11-64I2 duplication 0.71 7 GS1-216H24 duplication 0.56 7 GS1-35M23 deletion -0.91 7 GS1-183H7 deletion -0.54 7 GS1-74E4 deletion -0.84 7 RP11-188A12 deletion -0.56 7 RP11-217O10 duplication 0.55 8 RP11-82K08 deletion -0.69 8 RP11-246G24 deletion -0.51 8 RP11-121F07 deletion -0.68 8 RP11-113B07 deletion -0.68 -0.82 -0.76 -1.04 8 RP11-140K14 deletion -0.93 8 RP11-122N11 deletion -0.64 8 RP11-236O01 deletion -0.65 8 RP11-218N24 duplication 0.67 0.51 9 RP11-65B23 duplication 0.88 9 RP11-32d04 deletion -0.51 -0.52 10 RP11-71N21 duplication 0.54 11 RP11-209M09 duplication 0.81 11 RP11-230G20 deletion -0.64 11 RP11-150D18 deletion -0.53 11 RP11-56J22 duplication 0.69 11 RP11-171I08 deletion -0.96 11 RP11-233M01 deletion -1.07 11 RP11-75H24 deletion -0.61 -0.64 13 CTD-2202J2 duplication 0.66 14 RP11-152G22 deletion -0.50 14 RP11-84d12 deletion -0.56 -0.50 15 RP11-194H07 duplication 1.23 16 RP11-49M06 duplication 1.20 16 GS-127C11 duplication 1.95 0.77 16 RP11-109D4 deletion -0.64 16 RP11-283M20 deletion -0.66 16 RP11-204E18 deletion -0.66 16 PAC 191P24 duplication 1.38 0.50 17 RPC-34H11 duplication 0.63 17 RP5-1137e3 deletion -1.11 17 RP11-50F16 duplication 1.01 17 CTB-262I19 duplication 1.20 18 RP11-81J13 duplication 0.75 22 RP1-15J16 deletion -0.54 22 CTD-2093I24 duplication 0.52 X RP11-839D20 duplication 0.74 X RP11-180F16 duplication 1.99

Table 2-1. Summary of variant sites detected by array CGH. Variant sites were selected according to the average Log2 ratio of all reporting individuals. Species average Log2 ratios < -0.5 were considered putative deletions and Log2 ratios > 0.5 were considered putative duplications. PPA = Pan paniscus, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus. A total of 63 sites of dosage variation were identified. The experimentally verified sites are indicated by bold type.

55

Four great ape species were examined in this study: common chimpanzee (n=2), pygmy chimpanzee (n=4), gorilla (n=5) and orangutan (n=4). In addition, several hybridizations with individual primate genomic DNA samples were repeated and the results compared to previous hybridizations to test the consistency of the hybridization results. Based on these criteria, 63 sites of variant intensity ratios (38 reductions and 25 increases) were consistently identified in primate-human comparisons and mapped to the human reference genome sequence (Figure 2-1, Table 2-1). Both lineage-specific and shared ratio differences were observed among the great apes. Lineage-specific differences, however, predominated (Figure 2-2b). As expected, the quantity of variant sites detected in each great ape species was in proportion to the estimated divergence times of each species (Goodman 1999), as orangutan showed the greatest number of ratio differences and the chimpanzee species demonstrated the fewest ratio differences.

56

Figure 2-2. Sites of great ape/human variation detected by array CGH. (A) A histogram shows the chromosomal distribution of variant sites detected by array CGH. PPA = Pan paniscus, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus. The number of variant sites per 100 Mb is shown for each chromosome. Approximately 12% of the human genome (2460 BAC and PAC clones) was screened by this assay, identifying a total of 63 loci for all species comparisons. (B) A Venn diagram demonstrates that the majority of sites detected by array CGH are specific to each great ape lineage. The number of variant sites per species, PTR and PPA are grouped as they share a majority of variant sites, is shown above the species abbreviation and shared sites are indicated in the respective overlapping regions. The Venn diagram describes 64 rearrangements, yet 63 sites of variation were detected in our analysis. This is due to the fact that one rearrangement site (RP11-88L18) demonstrated a duplication in orangutan and a deletion in gorilla, which are considered separate events. A complete description of each site including Log2 ratio intensity statistics is available (Table 2-1).

57

We chose a subset of the 63 putative variant sites for detailed experimental

validation and verification of the array CGH approach. Seven sites with increased

primate signal intensity ratios, therefore potential duplications with respect to the human

genome, were assessed by interphase and metaphase FISH. In all cases, a genomic

duplication was detected among the great apes when probing with the human single-copy

arrayed BAC. Among these variants we observed both intra- and inter-chromosomal

duplications, duplications within homologous chromosomes and between non-

homologous chromosomes respectively, which were readily resolved at the metaphase

and interphase levels (Figure 2-3). Similarly, seven sites with reduced primate signal

intensity ratios, potential inter-specific deletions, were examined by FISH. We

encountered one instance where there was a complete absence of signal, suggesting that a deletion of at least the size of the arrayed BAC (175 kb by PFGE, data not shown) had occurred (Table 2-2 and Figure 2-4a). The extent of this deletion was subsequently confirmed by STS-content mapping and Southern analysis (data not shown). Reduced

FISH signal intensity was observed for several of the remaining six sites tested, yet

deletions could not be convincingly demonstrated by comparative FISH. Considering the

Log2 ratios for these un-verified sites were decreased to a lesser extent than the complete deletion demonstrated in orangutan by RP11-171I8, we believe these sites were partial deletions of the arrayed BAC sequence. Thus, we sought an alternative approach to validate these potential partial deletions.

58 , nd e 3 O .49 .09 .1 .06 .11 G 0 0 0 0 0.71 0.08 0.00 0.13 - - -1 - - G tion a R .15 uplica . The species 0.06 0.10 0.64 0.14 0.29 0.36 0.51 0.10 0 PT - - - - l d a ly A b .01 .25 .61 .09 P .15 0 0 0 0 m 0.47 0.67 0.81 0.04 0 P - - - - e s d for loc is indicated with the typ as ze i ent t 4 Pongo pygmaeus 0855 ter 2001 A

A t 061 T s A t* I A u n K g G e u t 1 A 1, n charac -

Experimentally verified gains and , tion con ne ne ne ne LOC51 M1 o r P , LA3, , PPY = FT e R N C s No No No No D R RF e N n illa , F ow e 6, NO , S 7 G SB G e Br I 12A CT P C L S ed. Duplica enom G C Gorilla gor amin t* 99% 98% n UCS @ @ @ 95% @ 99% @ 95%

nte lved. b b b b b o o 0k 0k 0k None None None None 3 5 5 . C , GGO = pe. up ing to the a, ~ a, ~ a, ~ a, ~40 k a, ~ 50k d D r o Intr Intr Intr Intr Intr c tions inv bold ty

n 1 S ac b8 G lica n for all species ex ted i S 11994 24145 33366 11261 14668 a 636 - - - - T - ed ST I C C C C- C S e dup G G G G W Pan troglodytes 9694 FM210y indic erified sites of genomic rearrangement. t to the human genome, were mapped and link H HG H H H 2 A e AFM105X S S S S S r G a ally v ty of th H t hold i S s the BAC- , PTR = I f e 8p22 7p14 5p15 F 15q14 11p15 16p13.2 11p15.1 11q12.3 11p11.12 al o t imilar 0.5 thr dis om or 6 1 5 1 1 1 +/- 8 7 5 1 1 1 1 1 1 ecies, with respec hr e al cent s p C r im e Pan paniscus ox pr

ion ion ion ion ion ion ion t b t t t t t t t eeding th n a a a a a a a tion tion c e e e l l v ex E 100 k De De ratios for each variant site are show Duplic Duplic Duplic Duplic Duplic Duplic Duplic tios size and p 2 g great ape s a thin r wi D 22 I e 75H24 218N24 56J 171I08 194H07 49M06 88L18 64I02 209M09 ------ned as on i age Log2 l f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C er v average Log Table 2-2. Summary of experimen losses amon genic content. PPA = approximate RP RP RP RP RP RP RP RP RP * De A

59

60

Figure 2-3. Comparative FISH validation of array CGH-detected duplications. Interphase and metaphase images are presented for three array CGH duplications verified by FISH; all signals are shown in red. HSA = Homo sapiens, PPA = Pan pansicus, GGO = Gorilla gorilla, PPY = Pongo pygmaeus. Array CGH results for an individual of the respective primate species are shown below each FISH comparison, with the variant BAC used as a probe in each respective FISH experiment circled in red. The x-axis represents the map position in Megabases along the chromosome and the y-axis represents the Log2 ratio. The p-telomere is oriented to the left in each plot, and the q- telomere to the right. The centromere is indicated as a vertical line. An intrachromosomal duplication of RP11-218N24 is confirmed in bonobo when compared with human (A). Similarly, a gorilla (intrachromosomal duplication is illustrated by interphase FISH using probe RP11-64I2 in contrast to the single signal on human 7p14 (B). Extensive interchromosomal duplication of BAC RP11-88L18 is shown for orangutan acrocentric chromosomes when compared with the unique 5p15 human locus (C). See Table 2-2 for more details.

61

We developed a BAC end sequence-based strategy to verify the potential partial deletions that were not confirmed by comparative FISH. This strategy compared the insert size of primate BACs linked to a variant site with the estimated BAC insert size according to the human genome reference sequence (Methods; Figure 2-4b). A disparity between the experimentally determined primate BAC insert size, and the estimated insert size based upon sequence similarity searches against the human genome sequence, was indicative a partial deletion had occurred. For a putative deletion in chimpanzee 11q12, this approach identified a ~38 kb disparity between chimpanzee BAC insert size and the human equivalent insert size (Figure 2-4b). Examination of the underlying human sequence revealed the presence of a ~40 kb tandem intra-chromosomal duplication event that contained a partial duplication of the BC005998 gene. Subsequent Southern analysis confirmed the deletion of the proximal BC005998 duplicon in the chimpanzee genome

(Figure 2-4b). Thus we had successfully identified a deletion in the chimpanzee genome using array CGH that was below the level of cytogenetic detection and involved the partial deletion of a BAC clone on the array. We believe the tandem duplication was present prior to the divergence of the great ape lineages since the gorilla and orangutan did not show reduced intensity ratios at this locus. Two other predicted sites were investigated for possible deletions by this method; however no size discrepancies were detected by this BAC-end mapping approach.

62

63

64

Figure 2-4. Validation of array CGH-detected deletions. (A) Validation of a deletion by comparative FISH. FISH images are shown for orangutan variant array BAC RP11- 171I8. HSA = Homo sapiens, PPY = Pongo pygmaeus. To control for variability in FISH conditions, human and orangutan cells were mixed on a slide prior to fixation. Hybridization with BAC RP11-171I8 (red) and an unrelated control probe, RP11-233C13 (green) is shown for human (HSA) and orangutan (PPY). The complete absence of signal was observed for RP11-171I8 against the orangutan metaphase chromosomes, but not RP11-233C13. A genomic deletion at least the size of the BAC (175 kb by PFGE; see text) was confirmed by STS content mapping and Southern Analysis (data not shown). (B) Deletion validation by primate BAC analysis and Southern blot. Human BAC clone RP11-75H24 showed a reduced relative fluorescent signal intensity ratio of chimpanzee compared with human (Table 2-2). HSA = Homo sapiens, PTR = Pan troglodytes. Four chimpanzee BAC clones from libraries RP-43 and CHORI-251 orthologous to this site were obtained, end-sequenced, and restriction mapped. A ~38-kb discrepancy in size was determined by comparing BAC PFGE size and BAC end sequence placement of the chimpanzee loci against both public and private assembly versions of the human genome. Analysis of the human sequence identified ~40 kb-tandem duplications (black boxes labeled Proximal and Distal), both containing exons 1-3 of gene BC005998. Using a probe spanning exon 1 of BC005998, Southern analysis of HaeII and SpeI-digested human (RP11-75H24) and chimpanzee (RP43-35C3) BAC DNA show a loss of a duplication within the chimpanzee genome (HaeII and SpeI restriction sites are indicated by the H and S, respectively).

Combined, the experimental validation results suggest that array CGH offers excellent sensitivity in detecting genomic duplications, as 100% of array CGH detected variant sites were experimentally verified by comparative FISH; however, deletions were not detected with similar efficiency as only two deletions have been experimentally verified.

Unlike intensity ratio increases, the intensity ratio decreases reported by array CGH may be due to a number of other factors, such as extensive sequence divergence or dramatic restructuring of loci as a result repeat-content variation. Therefore, sites of reduced intensity ratios, without significant size differentials between species by our BAC end sequence-based approach, remain important targets for further comparative sequence analysis. As large-insert genomic libraries become available from a number of the great ape species (Eichler and DeJong 2002), high-quality BAC sequencing of such regions

65 will provide the most efficient method for assessing the molecular nature of these differences.

DISCUSSION

Previous studies have suggested that the majority of fixed large-scale variation between the genomes of humans and great apes localize to gene-poor heterchromatic regions (Verma and Luke 1991; Archidiacono et al. 1995; Trask et al. 1998b). In light of these predictions, we examined the genomic landscape around the nine experimentally verified duplications (n=7) and deletions (n=2). Gene content within the vicinity of a bona fide genomic rearrangement site was assessed by considering the proximal and distal 100 kb flanking the array BAC-linked STS, or by placing the BAC end sequences for the variant clone against the human genome assembly by sequence similarity searches

(Methods). Overall, we found that 6 of 9 experimentally verified sites of rearrangement lie within or in close proximity to actively transcribed euchromatic regions (Table 2-2).

These results suggest that gene-rich regions are susceptible to copy number changes between humans and great apes, and that array CGH can effectively detect such rearrangements. Such regional differences in copy-number could, in theory, underlie gene expression differences that have been predicted and observed between humans and great apes (King and Wilson 1975; Enard et al. 2002).

The distribution of sites along the interstitial euchromatic regions of human chromosomes, determined by viewing the BAC-linked STS on the human assembly, shows that these events show no bias toward the subtelomeric and pericentromeric regions of human chromosomes. A total of 9 of the 63 variant sites occurred within 2 Mb

66 of a subtelomeric region or pericentromeric region, and the remaining 86% of sites

(54/63) mapped to euchromatic regions of the human genome. This result demonstrates that the large genomic insertion and deletion events occurred in euchromatic, and potentially gene-rich regions, and were not limited to the heterochromatic expansions observed in previous karyotypic analyses (Yunis et al. 1980; Yunis and Prakash 1982).

During our analysis of potential large-scale variation, we noticed that certain chromosomes showed a disproportionately large number of variant sites (Figure 3-1a).

Interestingly, many of the same chromosomes enriched for rearrangements

(chromosomes 4, 7, 8, 16, 17 and 22) have also been shown to be enriched for segmental duplications (Bailey et al. 2002a). Indeed, a positive correlation (r2=0.50) is observed when segmental duplication content and the number of variant sites per chromosome, adjusted for array clone coverage per chromosome, are compared. Several studies have shown that highly homologous sequences within genomes (also known as segmental duplications) may predispose to homologous unequal recombination, leading to large- scale deletions and duplications (Lupski 1998; Mazzarella and Schlessinger 1998; Ji et al.

2000a; Emanuel and Shaikh 2001). More recently non-allelic homologous recombination between such sequences has been postulated to underlie evolutionary chromosome rearrangements (Tunnacliffe et al. 1993; Nickerson et al. 1999; Valero et al. 2000; Dehal et al. 2001; Stankiewicz et al. 2001).

To more specifically test this association, we examined the sequence context for the nine verified structural rearrangements detected in this study. Segmental duplications comprise an estimated 5% of the total human genome sequence, yet we found 5/9 (56%) of our validated rearrangements were within close proximity to segmental duplications, a

67

11-fold bias (Bailey et al. 2002a). Considering the majority of duplications associated

with the variant sites were intrachromosomal duplications (2.8% of total genome

sequence), this bias increases to 14-fold (Bailey et al. 2002a). The results indicate a

highly significant (G=19.63, p<0.0001) non-random association of large-scale structural

variation and segmental duplication. Although the sample size of verified duplications

and deletions is small, we believe this trend suggests the importance segmental

duplications may play in evolutionary rearrangements. Furthermore, the sites chosen for

the array used in our great ape comparisons were selected due to their seemingly single-

copy nature as determined by extensive testing on human material. This inherent bias against sites of duplication may thus provide a fairly conservative estimate of the role segmental duplications play in interstitial euchromatic evolutionary rearrangements. A microarray targeted to regions of recent segmental duplication should provide a more accurate view of the dynamics of these regions and their relative importance in contributing to great ape and human evolution.

In summary, our results show that array CGH technology is a powerful approach for interrogating large-scale differences among the genomes of closely-related species.

It can successfully identify large-scale deletions and duplications too small to be detected by standard karyotype analysis, yet too large to be readily resolved by whole genome shotgun sequencing approaches. Our analysis of the orangutan genome using human

BAC microarrays suggests that 3% sequence divergence is sufficient for cross-species comparisons by this method. Array CGH is therefore a valuable first step to target regions for specialized study—and is therefore amenable to many cross-species comparisons where entire genomes are unlikely to be sequenced. Second, our analysis of

68

primate genomes indicates that large-scale events are not uncommon within genic

regions. While large-scale differences among great apes have been documented within

heterochromatic regions such as pericentromeric DNA, this is among the first

demonstrations of large-scale differences within euchromatic DNA. We have

characterized deletions and duplications ranging in size from 40-175 kb and have

provided a first approximation of the frequency of such events at a genome-wide level.

Rate estimation of such events will require more uniform representation of the genome as

well as resolution of regions enriched for segmental duplications. Perhaps the most

striking finding is that the events are non-randomly distributed. Regions within or in the vicinity of segmental duplications show a proclivity to delete and duplicate. While such genomic variation has commonly been reported in association with de novo rearrangements associated with disease, these data implicate a homology-driven mechanism for the generation of fixed differences that distinguish closely related species.

The biological significance of these events and their relationship to gene expression variation among primates await further characterization.

MATERIALS AND METHODS

Human BAC Arrays

Arrays were prepared as described by Snijders et al. (2001). Briefly, ligation-

mediated PCR was employed to prepare DNA representations of ~2460 human BAC and

PAC clones. The DNA was suspended in 20% DMSO and spotted in triplicate onto

chromium coated microscope slides using a custom built printer employing capillary

printing pins. The entire array of ~7500 spots filled a 12 mm x 12 mm square. Each

69 clone on the array contained at least one STS mapped to the human reference assembly sequence allowing the underlying genomic sequence to be assessed. All clones on the array were cytogenetically mapped by FISH, confirming 93.4% as single copy. In addition, extensive analyses of cell lines containing known single copy aberrations were performed to allow recognition of clones that contained significant amounts of sequence that mapped to multiple sites in the human genome. Data from such clones were excluded from the analysis presented here. On average, the array provides a resolution of

1 genomic clone for every 1.4 Mb of human genomic sequence. It should be noted that by performing this analysis with a human BAC array, deletions that have occurred specifically within the human lineage with respect to other primate genomes can not be readily detected. Deletions in the human lineage with respect to the chimpanzee may become apparent as the chimpanzee genome project progresses and sites detected via this method will provide landmarks for where the chimpanzee and human sequence maps may vary significantly. For the other primate species for which no dedicated genome project is anticipated, it may be necessary to develop reciprocal primate BAC arrays to detect large-scale deletions that have occurred within the human genome.

Primate DNA Samples

Four great ape species were examined in this study: common chimpanzee (n=2), pygmy chimpanzee (n=4), gorilla (n=5) and orangutan (n=4). For each species, all individuals were unrelated with the exception of two gorilla samples which were related as half-sisters. Chimpanzee samples (PTR BC449, BC450 and PPA OR833, KB8763,

BB501, BB502) were derived from lymphoblastoid and primary fibroblast cell lines. Of

70

the four orangutan DNAs (O100, O101, GM04272, Segundo), three were isolated from

EBV-transformed lymphoblastoid cell lines and were compared to results from an

unrelated blood sample. The gorilla cohort (Kwan, KB6278, 9247, 324, 465) consisted

of 4 DNA samples extracted from peripheral blood lymphocytes and one primary

fibroblast cell line. The use of DNA isolated from both transformed cell line material and peripheral blood served as a valuable control to assess the quality of the cell line material.

For microarray analysis, only sites where consistent increases and decreases in Log2

ratios were observed in all individuals were considered. While this requirement

eliminated potential false positive signals due to small-scale rearrangements within cell

lines, it also removed potential structural polymorphic variation from further analysis.

Array Comparative Genomic Hybridization

Genomic DNA samples were prepared from blood using the PureGene Genomic

DNA Isolation Kit (Gentra Systems). Great ape and human DNAs were directly labelled

with Cy3 and Cy5 fluorochromes, respectively by random primer labelling. Arrays were

simultaneously hybridized with a primate and a human genomic DNA probe for at least

48 hours, using unlabeled human Cot-1 DNA to block repetitive sequences. After post-

hybridization washing, the arrays were imaged with a custom-built CCD camera system,

and quantitative measurements of the fluorescence intensity ratios were obtained using

the software package UCSF SPOT(Pinkel et al. 1998; Jain et al. 2002). Ratios for the

triplicate spots were averaged. For each hybridization, the primate-to-human

fluorescence intensity ratio (Log2 ratio) at every array locus was assessed for variation.

Arrays fabricated with two different sets of print stocks and printed in several different

71 batches were used to eliminate false positives due to inconsistencies in array production.

If a particular site appeared to be variant in a primate species the Log2 ratios for all hybridizations of DNAs from individuals of that species were averaged. Sites with average Log2 ratios > 0.5 and < -0.5 were selected as putative variants.

The ratio variation among clones at constant copy number (most of the clones on the array) was significantly higher for great ape/human comparisons than for human/human comparisons. This ratio variation complicated recognition of inter-species copy number differences that might be affecting only a portion of a BAC, and would therefore show a ratio change of reduced magnitude compared to the expected value for single or multi-copy changes. Fixed copy number differences between species should result in 0 copies in the genome for a deletion event, or 4 copies for a duplication event.

Thus if the change affected the entire BAC one should see Log2 ratios of minus infinity or 1 for deletions and duplications respectively. However, several factors modify this expectation including: 1) changes that affect only part of the BAC, which are predicted to produce less dramatic ratio differences; 2) whether or not the arrayed BAC contains duplicated sequences; and 3) incomplete suppression of the repetitive sequences in the great ape genome by human Cot-1 DNA.

Fluorescence In Situ Hybridization

Lymphoblastoid cell lines derived from humans (Homo sapiens) and 4 great ape species (bonobo: Pan paniscus, chimpanzee: Pan troglodytes, gorilla: Gorilla gorilla and orangutan: Pongo pygmaeus) were used to prepare metaphase and interphase nuclei. In situ hybridizations with BAC probes corresponding to the arrayed clones, and control probes where appropriate, were conducted using standard techniques (Lichter et al.

72

1990). To prevent cross hybridization due to the presence of repetitive sequence within

BAC probes, Cot-1 DNA was used to block potential hybridization of high-copy repeat sequences. A minimum of 20 interphase and metaphase nuclei were examined in each hybridization experiment for the assessment of genomic duplications and deletions. For confirmation of the RP11-171I8 deletion in orangutan, human and orangutan cells were mixed on a slide prior to fixation to ensure hybridization conditions were equivalent for cells of both species.

BAC Analysis

Large-insert genomic chimpanzee BAC clones corresponding to sites of putative deletion were isolated from the RPCI-43 and CHORI-251 BAC libraries. Amplicons generated by PCR from end-sequences of the variant human BAC were used as hybridization probes against the chimpanzee large-insert clone libraries as previously described (Horvath et al. 2000a). The resulting chimpanzee BAC clones were end- sequenced (T7 and SP6) and the end sequences were aligned to the human genome assembly using sequence similarity searches (BLAST). The human “equivalent” insert size of the chimpanzee BACs was then calculated. The insert size of chimpanzee BAC clones was determined by pulsed-field gel electrophoresis. A disparity between the insert size of the chimpanzee BAC clone and the human “equivalent” insert size indicated a deletion in the chimpanzee genome occurred. Southern blot analysis was used to confirm the presence of the deletion in the chimpanzee BACs with respect to the human array

BAC.

73

ACKNOWLEDGEMENTS

We would like to thank Oliver Ryder, Lisa Faust and Erin Adams for providing primate material for this study. This work was supported, in part, by NIH grants

GM58815 and HD043569 and U.S. Department of Energy grant ER62862 to EEE, NCI to DP, and the financial support of Telethon, CEGBA (Centro di Eccellenza Geni in campo Biosanitario e Agroalimentare, and MIUR (Ministero Italiano della Istruzione e della Ricerca) to NA. The financial support of the W. M. Keck Foundation, Vysis Inc., and a grant from the Charles B. Wang Foundation to the Center for Computational

Genomics, Case Western Reserve University are also gratefully acknowledged.

Chapter 3

Refinement of a chimpanzee pericentric inversion

breakpoint to a segmental duplication cluster

Devin P. Locke1, Nicoletta Archidiacono2, Doriana Misceo2, Maria F. Cardone2, Stephanie Dechamps3, Bruce Roe3, Mariano Rocchi2 and Evan E. Eichler1

1Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, 10900 Euclid Avenue, Cleveland, OH 44106

2Dipartimento di Anatomia Patologica e di Genetica, Sezione di Genetica, University of Bari, Bari 70126, Italy

3Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019

Note: This manuscript has been published: Locke, D.P., Archidiacono, N., Misceo, D., Cardone, M.F., Dechamps, S., Roe, B., Rocchi, M. and E.E. Eichler. 2003. Refinement of a chimpanzee pericentric inversion breakpoint to a segmental duplication cluster. Genome Biol. 4:R50. N.A., D.M., M.F.C. and M.R. provided comparative fluorescence in situ hybridization data. S.D. and B.R. provided comparative sequencing data.

74 75

ABSTRACT

Pericentric inversions are the most common euchromatic chromosomal differences

among humans and the great apes. The human and chimpanzee karyotype differs by nine

such events, in addition to several constitutive heterochromatic increases and one

chromosomal fusion event. Reproductive isolation and subsequent speciation are thought

to be the potential result of pericentric inversions, as reproductive boundaries form due to

hybrid sterility. Here we employed a comparative FISH approach, using probes selected

from a combination of physical mapping, genomic sequence, and segmental duplication

analyses to narrow the breakpoint interval of a pericentric inversion in chimpanzee

involving the orthologous human 15q11-q13 region. We have refined the inversion

breakpoint of this chimpanzee-specific rearrangement to a 600 kb interval of the human

genome consisting of entirely duplicated material. Detailed analysis of the underlying

sequence indicated this region is comprised of multiple segmental duplications, including

a previously characterized duplication of the alpha 7 neuronal nicotinic acetylcholine receptor subunit gene (CHRNA7) in 15q13.3 and several Golgin-Linked-to-PML or

LCR15 duplications. We conclude, based upon experimental data excluding the

CHRNA7 duplicon as the site of inversion, and sequence analysis of regional

duplications, the most likely rearrangement site is within a GLP/LCR15 duplicon. This

study further exemplifies the genomic plasticity due to the presence of segmental

duplications and highlights their importance for a complete understanding of genome

evolution.

76

INTRODUCTION

The karyotype of humans, African and Asian great apes are remarkably well

conserved, with relatively few large-scale chromosomal changes among these species

despite the considerable phenotypic and biological differences between hominoids (King

and Wilson 1975; Yunis et al. 1980; Yunis and Prakash 1982). This conservation is

acutely relevant in comparisons between the human (Homo sapiens, abbreviated as HSA)

and common chimpanzee (Pan troglodytes, abbreviated as PTR) genomes, for in order to

achieve a complete ascertainment of the evolution of our own lineage it is necessary to understand what differentiates us at the genomic level from that of our closest relatives.

Furthermore, insight into the mechanism(s) underlying primate chromosomal evolution can be obtained from the molecular characterization of species-specific rearrangement breakpoints. Recent analyses of synteny disruptions between man and mouse although valuable, do not provide the level of detail afforded by the comparison of closely related species such as chimpanzee and human (Dehal et al. 2001; Waterston et al. 2002).

Several mechanisms of genetic change are thought to lead to speciation, including gene evolution via coding sequence mutation, gene expression variation by regulatory element mutation and chromosomal rearrangement - which has the potential to create reproductive barriers and induce genetic isolation within an existing population (White

1968). Although a thorough determination of all genetic differences between humans and our closest non-human primate relatives will not be possible until the genomic sequences of great ape species have been determined, many of the chromosomal rearrangements between great apes have been previously characterized at the cytogenetic level.

77

Primarily examined through the use of G-banding cytogenetic techniques, the human and common chimpanzee karyotypes differ by only 10 euchromatic rearrangements: a telomere fusion between PTR chromosomes 12 and 13, resulting in

HSA chromosome 2, and 9 pericentric inversions (HSA 1, 4, 5, 9, 12, 15, 16, 17, 18)

(Yunis et al. 1980; Yunis and Prakash 1982). The predominance of pericentric inversions between chimps and humans highlights their potential importance in the divergence of human from non-human primate species, and provides an opportunity to investigate the mechanism facilitating these rearrangements. Recent studies have characterized several evolutionary breakpoints in common chimpanzee and other great ape species including pericentric inversions and a chromosome translocation (Nickerson and Nelson 1998;

Stankiewicz et al. 2001; Kehrer-Sawatzki et al. 2002). We present here a refinement of the breakpoint associated with a previously identified pericentric inversion of the human

15q11-q13 orthologous region in common chimpanzee (XVp) to a region containing multiple segmental duplications (Yunis et al. 1980; Yunis and Prakash 1982; Luke and

Verma 1995).

RESULTS

A panel of BAC and PAC clones spanning ~10 Mb of human genome sequence was initially used to characterize the pericentric inversion of human 15q11-q13 in chimpanzee. Probes were selected that flank known sites of genomic rearrangement based upon previous YAC and BAC/PAC mapping studies, in concert with the human sequence map available from the UCSC Genome Browser (http://www.genome.ucsc.edu)

(Amos-Landgraf et al. 1999; Christian et al. 1999; Ji et al. 2000b; Kent et al. 2002).

78

Probes flanking the Prader-Willi/Angelman syndrome breakpoints, in addition to a characterized inv dup(15) rearrangement were selected to test the possibility that the pericentric inversion evolutionary breakpoint corresponded to a site of known human genomic instability (Figure 3-1). Initially, the cytogenetic map position of a total of 18

BAC/PAC probes were comparatively mapped using FISH (data not shown).

79 L TE 88O16 AC018870 ID1 758N13 AC074201 456J20 AC026951 15q13.3 RNA7 30N16 H AC021413 C 348B17 AC009562 16E12 AC012236 126J9 AC009873 11J16 AC021316 15q13.2 736I24 AC091057 382B18 AC019322 n 40J8 centric AC010799 eakpoint 422F16 CHRNA7’ Inversio AC110601 Br 605N15 PTR Peri AC026150 932O9 AC120045 360J18 15q13.1 AC069382 37J13 AC061965 686I6 AC024474 P3 B AC004583 pDJ778A2 322N14 AC017046 15q12 ISH Signals F 131I21 AC009696 XVq ISH Signal ISH Signal & F F BP2 15q11.2 289D12 AC016446 PTR XVp PTR XVp PTR XVq BP1

CEN

80

Figure 3-1. Map of human 15q11-q13. Probes used to localize the chimpanzee (PTR = Pan troglodytes) pericentric inversion are shown color-coded as to whether they hybridized to PTR XVp (yellow) or PTR XVq (blue). The breakpoint spanning clones are indicated in green. The three major PWS/AS common deletion breakpoints are indicated in red as BP1, BP2 and BP3. The previously characterized inv dup(15) breakpoint is indicated in orange as ID1. The duplicate CHRNA7 sequence is designated CHRNA7’. Accession numbers for all clones are indicated, all clones were obtained from the RPCI-11 BAC library. The map is not drawn to scale.

During this analysis, chimpanzee, pygmy chimpanzee, gorilla and orangutan

chromosomes were assayed for the species-specificity of the pericentric inversion (Figure

3-2). Previous cytogenetic analyses of human, chimpanzee, gorilla and orangutan G-

banded metaphase chromosomes demonstrated that the pericentric inversion was a

chimpanzee-specific event, with humans maintaining the ancestral state of this region

when compared to gorilla and orangutan (Yunis et al. 1980; Yunis and Prakash 1982).

Two-color FISH experiments utilizing distal probe RP11-88O16, and the PWS/AS common deletion breakpoint-marking probe pDJ-778A2 clearly demonstrate the chimpanzee lineage is the sole great ape species to harbor this rearrangement (Figure 3-

2).

81

Figure 3-2. Two-color FISH analysis of the pericentric inversion of human 15q11- q13 in chimpanzee. Utilizing probe RP11-88O16 (red), located in distal human 15q13.3, and probe pDJ-778A2 (green), which hybridizes to the HERC2-related PWS/AS breakpoint clusters in 15q11.2 and 15q13.1 (See Figure 3-1). HSA = Homo sapiens, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus. Only the chimpanzee metaphase clearly demonstrates the inversion of probe pDJ-778A2 to XVp. All other hominoids show overlapping signals to XVq. Centromeres are indicated by the arrows.

Based on the appearance of dual FISH signals on PTR XVp and XVq, the general location of the pericentric inversion breakpoint was identified with probe RP11-40J8

(Figures 3-1 & 3-3). Using the November 2002 human genome assembly, probes adjacent to and overlapping RP11-40J8 were used in further FISH experiments to characterize and more precisely define the breakpoint interval (Figure 3-3). Clones

82

selected for the array were represented as either completely finished or working draft

sequence within the human genome assembly. These clones were chosen due to their

presence in at least working draft status in the human genome assembly (RP11-932O9,

RP11-605N15, RP11-422F16, RP11-382B18 and RP11-736I24). Overall, near-complete

sequence coverage in the form of working draft and finished BAC clones is available for

this region of 15q13, although there are several dis-contiguous placements of working

draft sequences, termed “warping”, within the assembly around RP11-40J8 and the overlapping clone RP11-382B18 (e.g. Figure 3-4, RP11-422F16 and RP11-605N15). We believe the warping, or fragmented arrangement of working draft sequence contigs that span a distance exceeding the total sequence length within an accession, observed at this site is likely due to the presence of segmental duplications, discussed below. In this second round of FISH experiments all but one of these clones, RP11-736I24, displayed the dual signal on PTR XVp and XVq (Figure 3-3).

83

Figure 3-3. FISH analysis of breakpoint spanning clones. Shown are FISH hybridization results for probes spanning the chimpanzee inversion breakpoint interval. HSA = Homo sapiens, PTR = Pan troglodytess. The DAPI image for each chromosome is shown above the hybridization image. Probe RP11-360J18 and RP11-736I24 flank the breakpoint interval on the PTR XVp and XVq side, respectively, and map uniquely. All clones tested which map between RP11-360J18 and RP11-736I24 demonstrated dual FISH signals on PTR XVp and XVq. The clones are ordered from left to right according to their starting position within the November 2002 human genome assembly (See Figure 3-4).

We then defined the inversion breakpoint interval as the region between the most distal probe on HSA 15q to display a PTR XVp signal, RP11-360J18 and the most proximal probe on HSA 15q to display a PTR XVq signal, RP11-736I24 (Figure 3-3).

This interval encompasses approximately 1 Mb of genomic sequence, including all clones which produced dual FISH signals in PTR hybridizations. Interestingly, all of the clones which produced the dual FISH signals in PTR were confined to a 600 kb segment of this interval. These results imply that either the pericentric inversion involved the duplication of a large chromosomal segment, or that the inversion involved a sequence common to all clones in the 600 kb interval.

84

We extracted the 1 Mb interval from the November 2002 human genome

assembly and performed a sequence similarity search against the entire human genome

(build 31 (Figure 3-4) and the NT/HTGS databases (data not shown)) using

MEGABLAST. The output was parsed and displayed in the graphical program

PARASIGHT (Jeffrey Bailey, in preparation), which allows for the visual identification

of duplicated sequences (Figure 3-4). Interestingly, the segmental duplications identified

in this analysis were confined to the 600 kb covered by the clones which produced the

dual FISH signals in PTR. We identified two known duplications from this analysis,

which is in agreement with previous segmental duplication analyses available from the

UCSC Genome Browser (Bailey et al. 2001; Kent et al. 2002). The first is derived from

the a duplication of the CHRNA7 gene, characterized in detail by Riley et al. 2002 (Riley

et al. 2002). This sequence was duplicated from the intact CHRNA7 locus in 15q13.3 and

is highly homologous (>99% similar) to the ancestral locus. The CHRNA7 duplication

involves at least 125 kb of material (indicated in Figure 3-4), however due to the fragmented nature of the assembly in this region at this time, that number could increase with future versions of the human genome assembly. The second duplicon recognized within the 600 kb sub-interval was originally described as the Golgin-Linked-to-PML

(Promyelocytic leukemia) duplicon, or GLP duplicon, by Gilles et al. 2000 and subsequently described as LCR15 by Pujana et al. 2001, Gratacos et al. 2001 and Pujana et al. 2002 (Gilles et al. 2000; Gratacos et al. 2001; Pujana et al. 2001; Pujana et al.

2002). The ~30 kb GLP/LCR15 duplicon is present in approximately 8 copies throughout the 600 kb duplication-rich sub-interval (Figure 3-4).

85

Figure 3-4. Segmental duplications in the inversion breakpoint interval. The human genomic sequence encompassing the most proximal XVp probe (RP11-360J18) and most distal PTR XVq probe (RP11-736I24) was extracted from the genome assembly (Nov. 2002) and sequence similarity searches performed against the entire human genome (see Methods). The output was displayed using the program PARASIGHT (Jeffrey Bailey, unpublished), which shows pair-wise alignments as colored horizontal boxes below the black line which represents the 1 Mb interval. The color coding of each horizontal box is a reflection of the sequence similarity of the alignment: 100-99% is indicated in red, 99- 97% in orange, 97-95% in yellow, 95-93% in green, 93-91% in blue and 91-89% in purple. * = additional putative duplicons, however these may be artifacts of assembly and require further verification.

86

Although previously it had been postulated the CHRNA7 duplication was human-

specific based on sequence identity, the location of this duplicon within the 600 kb

interval associated with the pericentric inversion in chimpanzee necessitated determining

the copy number of CHRNA7 within the chimpanzee genome (Riley et al. 2002). First,

chimpanzee BAC library hybridizations (8.7x coverage from RPCI-43 and CHORI-251

Segment 1 BAC library filters) were performed with a probe derived from CHRNA7-

related sequence, yielding a total of 4 positives, which was consistent with a single copy

locus (data not shown). Secondly, comparative FISH analysis of RP11-30N16, a clone

which contains a substantial portion of the ancestral 15q13.3 CHRNA7 locus, produced a single FISH signal on PTR XVq only (Summarized in Figure 3-1). Finally, Southern blot analysis was performed using a probe designed to a PstI restriction fragment length polymorphism which distinguished the donor and duplicate copies of human CHRNA7.

Upon hybridization of this probe to a panel of human, pygmy chimpanzee, common

chimpanzee and gorilla genomic DNAs, two bands were observed in the human sample

and a single band was noted in all great ape species, providing further evidence that the

CHRNA7 duplication was specific to humans and was not present at the time of the pericentric inversion event in the chimpanzee genome.

87

Figure 3-5. Southern analysis of the CHRNA7 duplication. Probe CHRNA7-PstI was used to demonstrate a 104 bp RFLP distinguishing the intact (548 bp band) and duplicated (444 bp band) CHRNA7 loci in the human genome (Lane 1). HSA = Homo sapiens, PTR = Pan troglodytes, PPA = Pan paniscus, GGO = Gorilla gorilla. A single band – corresponding to the intact CHRNA7 locus – is seen in PTR, PPA and GGO (Lanes 2 and 3, 4 and 5, 6 and 7, respectively), indicating the duplication of CHRNA7 was a recent human-specific event and therefore single copy in the chimpanzee genome.

DISCUSSION

Using a panel of FISH probes spanning several known sites of rearrangement

within 15q11-q13 we were able to narrow the region containing the PTR pericentric inversion breakpoint to a complex cluster of segmental duplications approximately 600 kb in length, according to the most recent assembly of the human genome. Comparative analysis using material from other great apes including gorilla and orangutan indicated the inversion event was restricted to the chimpanzee lineage. It should be noted that the

88 inversion, although primarily characterized in the common chimpanzee (Pan troglodytes) was also observed in the pygmy chimpanzee (Pan paniscus) thus dating the inversion event more precisely to an interval of 2-5 Mya (data not shown). Although human chromosome 15q11-q13 is noted for its exceptional instability associated with human disease and chromosomal rearrangement, it is noteworthy that the orthologous site of rearrangement in the human genome did not correspond to one of the three major previously described common disease rearrangement breakpoints (Christian et al. 1999;

Ji et al. 2000a; Ji et al. 2000b; Pujana et al. 2002). Similarly, probe RP11-88O16, distal of the previously characterized inv dup(15) breakpoint, and probes RP11-456J20 and

RP11-758N13, which flank an additional inv dup(15) rearrangement breakpoint (S.

Schwartz unpublished data) also appeared to play no role in this evolutionary breakpoint between man and chimpanzee (Wandstrat et al. 1998; Wandstrat and Schwartz 2000).

Our FISH results using probe RP11-40J8, which demonstrated dual signals on

PTR XVp and XVq, indicated we had reached the site of the pericentric inversion, or alternatively a sequence within RP11-40J8 had been duplicated to both arms of PTR XV.

Consequently, we performed comparative FISH in human and chimpanzee using multiple

BAC clones that flanked RP11-40J8 in the November 2002 build of the human genome

(Figure 3-3 & 3-4). Clone RP11-360J18 was the most distal clone, based upon its position in the human assembly, to show a FISH signal on a single PTR chromosome arm

(XVp). Conversely, probe RP11-736I24 hybridized to HSA and PTR XVq, indicating the pericentric inversion breakpoint likely lies proximal of this position in the human assembly. All probes tested between RP11-360J18 and RP11-736I24 produced signals on PTR XVp and PTR XVq near the centromere. In addition, several probes which

89

yielded dual signals flanking the PTR centromere also produced a signal in the region

orthologous to HSA 15q22-24 (Figure 3-3).

We next investigated the nature of the sequence underlying the pericentric

inversion breakpoint by extracting the 1 Mb interval encompassing the sequence between

the most distal PTR XVp probe, RP11-360J18, and the most proximal PTR XVq probe,

RP11-736I24, from the November 2002 human genome assembly. This interval includes

~600 kb of sequence spanning FISH probes RP11-932O9, RP11-605N15, RP11-422F16,

RP11-40J8 and RP11-382B18, all of which produced dual FISH signals flanking the

centromere in PTR XVq and XVp (Figure 3-3 & 3-4). Sequence similarity searches of

the 1 Mb interval against both the November 2002 assembly and the NT and HTGS

databases were in agreement with previous global analyses of the human genome, which

indicated multiple segmental duplications were present in the interval (Figure 3-4)

(Bailey et al. 2001; Bailey et al. 2002a).

We were able to discern two major duplicons from this analysis: one segment

derived from the CHRNA7 locus in 15q13.3 previously described by Riley et al. 2002

(Riley et al. 2002), and several copies of a duplicon originally described as Golgin- linked-to-PML (GLP) (Gilles et al. 2000) and also described as LCR15 (Gratacos et al.

2001; Pujana et al. 2001; Pujana et al. 2002). A third, previously undetected, potential segmental duplication was identified in the proximal third of the 600 kb region (Figure 3-

4, indicated by the asterisk), however considering the assembly of the region, which involves significant warping or artifactual fragmentation of accessions AC110601 (RP11-

422F16) and AC026150 (RP11-605N156), and the high sequence identity of the duplications, further validation is warranted.

90

In Riley et al., 2002, the authors theorize that the duplication of the CHRNA7 locus is a human-specific event due to the high of the duplicated segment to the donor segment in 15q13.3 and the fact the CHRNA7 duplication appeared polymorphic in the human population (Riley et al. 2002). This hypothesis deserved further consideration in this study, as the site of duplication of the CHRNA7 locus in the human genome may correspond to the pericentric inversion breakpoint, and the high sequence homology noted between the donor and duplicated CHRNA7 loci does not establish the human-specificity of the CHRNA7 duplication, as gene conversion may also maintain high sequence homology of duplicated segments (Eichler et al. 2001;

Stankiewicz and Lupski 2002). Thus, we tested the chimpanzee genome by Southern genomic analysis for the presence of the CHRNA7 duplication which revealed this duplication is most likely human-specific and therefore did not contribute to the pericentric inversion in chimpanzee. Our BAC library hybridization results also support this assertion. In addition, FISH results with probe RP11-30N16, which contains the intact donor CHRNA7 locus, did not produce dual FISH signals on PTR XVp and XVq, further demonstrating this sequence is not involved in the evolutionary rearrangement.

Finally, the association of the dual FISH signal with clones that did not contain the

CHRNA7 duplication (RP11-422F16) indicates the split signal is caused by a duplicated sequence within the 600 kb interval other than the CHRNA7-related sequence. Barring other duplication events that may have been lost through subsequent deletion, we estimate that the breakpoint region in the ancestral genome was significantly smaller

(~400 kb or potentially smaller).

91

Interestingly, the extent of the segmental duplications in the 1 Mb region examined was restricted to the clones which showed the dual FISH signal in PTR XVp and XVq and each clone which produced dual FISH signals contained a GLP/LCR15 duplicon by sequence analysis (Figure 3-4). With our results indicating the CHRNA7 sequence is single-copy in the chimpanzee genome and thus unlikely to be involved in the inversion, the GLP/LCR15 duplicon is therefore the best candidate sequence for being present at the site of the pericentric inversion. Unfortunately, the highly duplicated nature of the region prevents precise narrowing of the breakpoint to a specific sequence, however we believe large-scale sequence analysis of this region between human and chimpanzee will be able to reveal the precise nature of the breakpoint sequence.

Pericentric inversions are the most common rearrangement differentiating humans and the great ape species at the karyotypic level (Yunis et al. 1980; Yunis and Prakash 1982).

In addition, it has been proposed that multiple inversions within a single chromosome could potentially lead to speciation (King 1993). The mechanism underlying their origin, however, remains poorly understood. It has been shown in multiple studies that regions of segmental duplication can be associated with evolutionary rearrangements (Nickerson et al. 1999; Stankiewicz et al. 2001). The pericentric inversion of chimpanzee XVp, having been localized to a site of multiple segmental duplications in the human genome, indicates that the evolutionary rearrangement likely involves such sequences. Based upon the findings of Pujana et al. (2002), the GLP/LCR15 duplicon is present at multiple regions of human genomic instability (Pujana et al. 2002). In addition, the GLP/LCR15 duplicon was shown by Gilles et al. (2000) to be polymorphic in the human population – showing these sequences have considerable intra-specific plasticity (Gilles et al. 2000).

92

The GLP/LCR15 duplication family has expanded over the last 20 My of primate evolution, prior to the divergence of the great ape lineages, resulting in a multitude of copies (estimated at 27) present at multiple sites along chromosome 15, providing the conditions required for non-allelic homologous recombination (NAHR) driven rearrangements (Gilles et al. 2000; Pujana et al. 2001; Pujana et al. 2002; Stankiewicz and Lupski 2002). We have begun an investigation of the organization of this region within the chimpanzee lineage. Both genomic library characterization and partial sequence characterization (AC119799) indicate the presence of both LCR15 and HERC2 duplicons in this region. No evidence of CHRNA7 was found either by sequence similarity searches or through STS-content hybridization mapping (data not shown).

Large-scale sequence analysis of this region between human and chimpanzee will be necessary to refine the exact position of the pericentric inversion. For NAHR to have played a role, however, one must presume the presence of a GLP/LCR15-related sequence on the ancestral 15p. Alternatively, this site, which contains a recent human- specific segmental duplication of the CHRNA7 locus may be a preferential site of breakage, which allowed the pericentric inversion to occur. The association of two independent chromosomal rearrangement events at one site: a pericentric inversion in the chimpanzee genome, and the insertion of a novel segmental duplication within the human lineage, may be a consequence of the overall genomic instability of this region.

93

MATERIALS AND METHODS

FISH Probe Selection

We selected probes to refine the chimpanzee XVq11-q13 pericentric inversion breakpoint based upon YAC maps of human 15q11-q13, in addition to the genome assemblies produced by the Human Genome Project (Christian et al. 1998; IHGSC 2001).

We chose BACs targeted for complete sequencing (Figure 3-1) so that the underlying sequence of each probe would be available for further analysis pending comparative fluorescence in situ hybridization (FISH) results. Particular attention was paid to the position of each probe with respect to the three large low-copy repeat clusters found within 15q11-q13; as these clusters contain both inter- and intrachromosomal segmental duplications, and correspond to the common deletion breakpoints of Prader-Willi and

Angelman syndrome deletions (Amos-Landgraf et al. 1999; Ji et al. 2000b; Pujana et al.

2002). The duplication track of the UCSC Genome Browser, August 2001 assembly, displays information obtained from a global analysis of segmental duplications in the human genome, and allows for the selection of informative unique clones (Bailey et al.

2001; Kent et al. 2002). In this manner, we selected probes flanking the human 15q11- q13 rearrangement breakpoints associated with common PWS/AS deletions. To mark the regions of low-copy repeats in human 15q11-q13, we utilized a probe derived from the donor locus of a major component of the PWS/AS breakpoint duplication clusters, PAC clone pDJ-778A2 which contains a significant portion of the HERC2 gene in 15q13 (Ji et al. 1999). In addition, we selected probes corresponding to sites distal of the PWS/AS domain flanking a large inverted-duplicated 15 (inv dup(15)) supernumerary marker chromosome (Wandstrat et al. 1998). Specifically, probe RPCI-11-88O16 (abbreviated

94

RP11-88O16) was selected for FISH analysis due to the fact it lies distal to the inversion- duplication event breakpoint and is linked to STS marker D15S1010 (Wandstrat et al.

1998; Wandstrat and Schwartz 2000). Upon reaching the PTR inversion breakpoint interval with probe RP11-40J8, flanking BACs were chosen from the November 2002 human genome assembly to establish the extent of the interval and define the boundaries of the interval. These additional hybridizations were performed on human and chimpanzee material only.

Comparative FISH

Previous reports indicated the inversion of human 15q11-q13 in the chimpanzee genome occurred solely in the chimpanzee lineage, while humans share the ancestral state with the gorilla (Gorilla gorilla, abbreviated as GGO) and orangutan (Pongo pygmaeus, abbreviated as PPY). To confirm these observations, metaphase chromosome preparations were generated from lymphoblast-derived cell lines for human, common chimpanzee, pygmy chimpanzee, gorilla and orangutan. BAC and PAC DNAs, isolated from bacterial cultures (Nucleobond, Clontech), were labeled by nick-translation with biotin and digoxygenin. FISH experiments were performed under standard conditions

(Lichter et al. 1990). For each FISH experiment at least 20 independent metaphase chromosome preparations were examined.

Duplication Analysis

The ~1 Mb (1,008,705 bp) interval encompassing the entire pericentric inversion breakpoint interval and the adjacent unique (present on only one chromosome arm in

95

PTR) FISH probes was extracted from the November 2002 human genome assembly and compared to the entire genome using MEGABLAST. Prior to sequence similarity searches, however, common repeats were identified and filtered as lower-case letters using REPEATMASKER (Smit and Green 1999). For efficient MEGABLAST analysis, the human genome sequence is fragmented into 400 kb pieces, each chromosome is fractioned separately and each fragment labeled consecutively from p-telomere to q- telomere. Thus, the most distal 400 kb segment of human chromosome 15 is labeled chr15_001, etc. MEGABLAST parameters were set to take advantage of the lower-case masking which allows for alignment extension (but not alignment seeding) through lower-case masked sequences. The MEGABLAST output was parsed into tab-delimited text and displayed using the graphical alignment display program PARASIGHT (Jeffrey

Bailey, in preparation). The results were then compared to previous duplication analyses available through the UCSC Genome Browser (http://www.genome.ucsc.edu) (Bailey et al. 2001; Kent et al. 2002).

Southern Analysis

To determine the evolutionary age of the CHRNA7 duplication with respect to the human/chimpanzee divergence, we performed a genomic Southern blot using DNA extracted with the PureGene DNA isolation kit (Gentra Systems). DNA was obtained from cell lines and/or blood from a human (Sample ID 02-0056), two unrelated common chimpanzees (Sample IDs: NA03450B, NA03448A), two unrelated pygmy chimpanzees

(Pan paniscus, PPA; Sample IDs: BB501, LB502), and two unrelated gorillas (Sample

IDs: 9521, 9247). Probe CHRNA7-PstI was designed by identifying a ~100 bp restriction

96

fragment length polymorphism (RFLP) between the CHRNA7-related sequence within

accessions AC010799 (RP11-40J8) and AC111169 (RP11-20D7), using the web-based

restriction mapping program WEBCUTTER 2.0

(http://www.firstmarket.com/cutter/cut2.html, authored by Maxwell Heiman). Once a

polymorphism distinguishing the two copies of CHRNA7 in Homo sapiens had been

identified, probe CHRNA7-PstI was PCR amplified from BAC RP11-40J8 DNA using

primers CHRNA7.probe2.1 (Forward, 5’-TGAAACCTTGGGTGAGTTGG-3’) and

CHRNA7.probe2.2 (Reverse, 5’-CAGATGAGACTGGGAAAGGC-3’) at an annealing temperature of 55 degrees Celsius for 35 cycles. For Southern analysis, all DNAs were digested with PstI, blotted, and the resulting membranes were hybridized and washed according to standard protocols.

Filter hybridizations of the RPCI-43 and CHORI-251 Segment 1

(www.chori.org/bacpac) libraries were performed according to previously established methods (BACPAC_CHORI_link; Horvath et al. 2000a). The PCR-amplified probe

40J8.Site2, amplified using primers 40J8.Site2.1 (Forward, 5’-

GTACTTTACAAGCAGGCGGC-3’) and 40J8.Site2.2 (Reverse 5’-

CAGATGAGACTGGGAAAGGC-3’) at an annealing temperature of 55 degrees Celsius for 35 cycles, was designed to the CHRNA7-related portion of BAC RP11-40J8

(accession AC010799).

ACKNOWLEDGEMENTS

We thank Juliann E. Horvath for critical reading of the manuscript and Sean D.

McGrath for technical assistance. This work was supported by grants NIH GM58815,

97

NIH HG002385 and DOE ER62862 to EEE, the financial support of Telethon, CEGBA

(Centro di Eccellenza Geni in campo Biosanitario e Agroalimentare, and MIUR

(Ministero Italiano della Istruzione e della Ricerca) to NA.

Chapter 4

Molecular evolution of the human chromosome 15

pericentromeric region

Devin P. Locke1, Zhaoshi Jiang1, Lisa M. Pertz1, Doriana Misceo2, Nicoletta Archidiacono2 and Evan E. Eichler1

1Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, 10900 Euclid Avenue, Cleveland, OH 44106

2Dipartimento di Anatomia Patologica e di Genetica, Sezione di Genetica, University of Bari, Bari 70126, Italy

Note: This manuscript is in press for publication by Cytogenetics and Genome Research: Locke, D.P., Jiang, Z., Pertz, L.M., Misceo, D., Archidiacono, N. and E.E. Eichler. Molecular evolution of the human chromosome 15 pericentromeric region. Cytogenet Genome Res. Z.J. provided computational programming and phylogenetic analysis assistance. L.M.P. provided assistance in contig construction. D.M. and N.A. provided comparative fluorescence in situ hybridization data

98 99

ABSTRACT

We present a detailed molecular evolutionary analysis of 1.2 Mb from the

pericentromeric region of human 15q11. Sequence analysis indicates the region has been

subject to extensive interchromosomal and intrachromosomal duplications during primate

evolution. Comparative FISH analyses among non-human primates show remarkable

quantitative and qualitative differences in the organization and duplication history of this region—including lineage-specific deletions and duplication expansions. Phylogenetic and comparative analysis reveal that the region is composed of at least 24 distinct segmental duplications or duplicons that have populated the pericentromeric regions of the human genome over the last 40 million years of human evolution. The value of combining both cytogenetic and experimental data in understanding the complex forces which have shaped these regions is discussed.

100

INTRODUCTION

Segmental duplications are duplicated blocks of genomic DNA, often containing high copy repetitive elements as well as intron-exon structure (IHGSC 2001). Recent studies into the extent of segmental duplication in the human genome estimate approximately 5% of the entire genome consists of duplications (Bailey et al. 2001;

Bailey et al. 2002a). Pericentromeric regions, in particular, have been shown to be enriched in such sequences (Bailey et al. 2001). In fact, the pericentromeric region of more than half of all human chromosomes is comprised of blocks of segmental duplications extending between the centromeric satellite sequence and unique sequence

(Bailey et al. 2001; IHGSC 2001).

A two-step model has been proposed for the evolution of the complex structure of duplications within these regions, involving an initial seeding of material into a pericentromeric region, and subsequent swapping of that duplicated material, or larger composite blocks of duplications, between chromosomes (Horvath et al. 2000b).

However, only five such pericentromeric regions have been thoroughly characterized at both the structural level within the human genome, and from the evolutionary perspective

(Guy et al. 2000; Horvath et al. 2000a; Horvath et al. 2000b; Footz et al. 2001; Brun et al.

2003). Discerning the progression and pattern of duplication events that have generated the mosaic of segmental duplications in the pericentromeric region of many human chromosomes involves a comparison to the genomes of closely related primate species.

Several studies have investigated the evolutionary history of individual duplicated segments (Arnold et al. 1995; Eichler et al. 1996; Eichler et al. 1997; Regnier et al. 1997;

Zimonjic et al. 1997; Orti et al. 1998; Horvath et al. 2000b; Golfier et al. 2003; Horvath

101 et al. 2003). Few studies to date, however, have investigated the evolutionary history of a large pericentromeric region encompassing several duplicated segments.

Several segmental duplications have been mapped to the pericentromeric region of chromosome 15, including immunoglobulin heavy chain V and D segment, gamma- aminobutyric acid receptor subunit alpha 5 (GABRA5), neurofibromatosis 1 (NF1), B-cell

CLL/lymphoma 8 (BCL8) and KIAA0187 derived duplications (Tomlinson et al. 1994;

Kehrer-Sawatzki et al. 1997; Regnier et al. 1997; Barber et al. 1998; Ritchie et al. 1998;

Crosier et al. 2002; Dyomin et al. 2002; Fantes et al. 2002). Several of these duplications have been characterized as a polymorphic “cassette” of approximately 1 Mb in size which varies in copy number in the human population (Barber et al. 1998; Ritchie et al.

1998; Fantes et al. 2002). The underlying nature of this polymorphic region is not well understood, however, due to the fact that a complete sequence map of 15q11 has not been resolved to date by the Human Genome Project. The assembly and analysis of highly duplicated pericentromeric regions has required the development of specialized strategies that use stringent standards of nucleotide identity to determine paralogous versus homologous overlap (Horvath et al. 2000a). In addition, the development of bioinformatic tools and methods has been essential to resolving overlaps.

In this study we have applied both a primate cytogenetic and phylogenetic approach, to explore the evolutionary history of the pericentromeric region of human chromosome 15. We first sought to construct a contig of BAC sequences from the pericentromeric region of 15q11 using a strict threshold of sequence identity as evidence of allelic overlap. Subsequent sequence analysis of the contig identified a tiling path of clones for comparative FISH in human, common chimpanzee, gorilla, orangutan,

102 macaque and baboon. Comparison of the FISH signals obtained in human hybridizations with sequence analysis of the human genome assembly has facilitated the identification of regions potentially absent from the human genome sequence. In addition, we were able to identify a multitude of individual duplicons within the contig and perform phylogenetic comparisons to all paralogous sequences within human genome assembly.

The cytogenetic and phylogenetic evidence suggest that a substantial portion of the mosaic structure observed within 15q11 emerged from a burst of primate segmental duplication which occurred shortly after the divergence of the African and Asian great ape lineages.

RESULTS AND DISCUSSION

In Silico Analysis of the 15q11 Pericentromeric Region

The presence of highly duplicated sequences has been problematic for the sequencing and assembly of the human genome due to either an under-representation of paralogous sequences within genome databases or due to mis-assembly of duplicated sequence (Bailey et al. 2001; Bailey et al. 2002a). Consequently, such regions require additional scrutiny. To avoid such potential pitfalls, we independently assembled the pericentromeric region of 15q11 (Figure 4-1).

103 A)

104

) B

105

Figure 4-1. Organization of human 15q11. In panel A, the tiling path of BAC clones is shown, terminating proximally in monomeric alpha satellite sequence and extending distally from left to right. Tick marks along the black line are placed at 50 kb intervals. Clones shown in red have been completely sequenced or are in working draft status. The single clone shown in grey (RP11-1115P6) is depicted according to STS hybridization and BAC end sequence placement. Above the black line the complex mosaic of duplicons is depicted with color coding according to the chromosome of the ancestral segment prior to duplication. The identity of pseudogenes associated with these duplicons, if known, is indicated. In panel B, The interchromosomal duplication pattern within 15q11 is depicted. The diagram shows the complexity of interchromosomal duplications (chromosome connecting lines) which represent alignments >20 kb in length and >90% identity (based on analysis of build34). Centromeres are indicated as purple boxes. Most of the duplications occur between pericentromeric regions or ancestral pericentromeric regions within the human genome.

The pericentromeric sequence assembly in 15q11 is composed of two sub-contigs:

the most proximal contig consists of eight overlapping BACs and spans 865 kb, while the

distal sequence contig consist of two BACs which span 265 kb (Figure 4-1). Our

analysis of 15q11 generally conforms to the finished human genome sequence assembly

(build34) with one important exception. STS hybridization experiments were performed

to identify a clonal link, RP11-1115P6 which bridges the existing sequence gap between

these two contigs (see Methods). Given the average insert size of the BAC library as 196

kb (RP11-Segment 5; www.chori.org/bacpac), and the extent of the BAC-end sequence

overlap of the traversing clone, RP11-1115P6, we estimate the sequence gap to be

approximately 60 kb. Combined, the 15q11 pericentromeric region is approximately 1.2

Megabases in length, representing one of the longest contiguous clone assemblies within

human pericentromeric DNA.

Based on this finished high-quality sequence, we analyzed the global segmental

duplication content of 15q11 using previously described methods and graphical viewing software (see Methods). A total of 228 pairwise alignments were detected with

106 significant sequence similarity (>1 kb >90%) to this portion of 15q11. The average similarity of these alignments was 94.7% with an average length of 7.5 kb and a range of

1.0 kb to 77.3 kb. Figure 4-1b simplistically depicts the interchromosomal duplication content (>90% sequence identity; >20 kb) for this 1.2 Mb portion of chromosome 15q11.

A more detailed analysis of the underlying alignments is also presented (Figure 4-2).

Similar to previous analyses of human pericentromeric regions (Jackson et al. 1999; Guy et al. 2000; Horvath et al. 2000a; Horvath et al. 2000b; Footz et al. 2001), in silico analysis confirms that the pericentromeric region of 15q11 is composed of a complex mosaic of small and large duplications. Most regions share homology to 3 or more distinct regions of the genome. Also, most interchromosomal sites with similarity to the

15q11 pericentromeric region occur within other pericentromeric regions of the human genome. Notable exceptions to this trend include sequence homology to the subtelomeric region of chromosome 14, a large block of homology to the Prader-Willi/Angelman breakpoint associated HERC2 and LCR15 duplications and the euchromatic site within chromosome 2q21 which corresponds to an ancestral pericentromeric region.

Specifically, the alignments generated by comparison of the 15q11 contig with the ancestral pericentromeric region of chromosome 2q21 resulted in 10 alignments of 8.8 kb average length and 91.9% average percent identity, with alignments raging from 2.4 kb to

20.5 kb.

107

108

Figure 4-2. Sequence similarity search results for the ~1.2 Mb pericentromeric contig. The 60 kb gap estimated between clones RP11-336L20 and RP11-32B5 is included for illustrative purposes. Above the black line, which represents the contig sequence, the output from REPEATMASKER has been depicted using color coded boxes, however the output has been filtered for LTR, Simple Repeats/Low-Complexity Repeats, LINEs, SINEs and both centromeric and telomeric Satellite sequences only. All other elements detected by REPEATMASKER have been removed for illustrative purposes. Note the block of centromeric satellite sequence at the proximal end (left) of the contig. The duplicons identified in Figure 4-1 have been added for reference purposes. Below the black line, the alignments produced by a sequence similarity search (using MEGABLAST) against the human genome assembly (build34) are shown color coded according to the percent identity of the alignment. Each horizontal box therefore depicts an alignment. For global analysis of segmental duplications, the genome is fractioned into 400 kb pieces, and the identity of each 400 kb piece is indicated to the left hand side of the alignments produced from that segment of the assembly, alignments to the respective chromosomal random bins are indicated by the “r”. Note the overall complexity of the paralogous relationships depicted by this analysis and variation in the number of paralogous sites, which is indicative of wide variation in copy number of these duplicated segments. For simplicity, only alignments >1.5 kb are shown.

We also identified a block of 28 kb of alpha satellite within the most proximal portion of this sequence contig (RP11-79C23). The alpha-satellite sequence shows no evidence of higher-order repeats (unpublished DOTTER analysis) nor significant sequence similarity (>90%) to previously characterized higher order-repeat sequences for chromosome 15. Based on previous published alpha satellite/non-alpha satellite transition regions (Horvath et al. 2000a; Horvath et al. 2000b; Schueler et al. 2001), it is likely that this monomeric block of alpha satellite demarcates the boundary between higher-order and monomeric alpha satellite repeat sequences typical for such alpha satellite/non-alpha satellite transitions.

Genic content within duplicons was identified through comparison with the

UCSC Genome Browser (www.genome.ucsc.edu) resulting in the following partial gene structures: immunoglobulin heavy chain gamma 3 (IGHG3), immunoglobulin lambda variable region (IgV Lambda), KIAA0187, rhophilin like protein (RHPN2), checkpoint

109

homolog 2 (CHK2), hect domain and rcc1 domain protein 2 (HERC2), golgin-like protein

(GLP), FLJ35866, neurobeachin (NBEA) (also known as B-cell lymphoma

CLL/lymphoma 8 (BCL8)), p21 activated kinase 2 (PAK2), myotubularin related protein

3 (MTMR3), gamma-aminobutyric acid receptor subunit alpha 5 (GABRA5) and neurofibromatosis 1 (NF1) genic segments. Not all duplications analyzed in this study, however, contained genic content, including the pericentromeric interspersed repeat,

PIR4.

FISH Analysis of Human 15q11

FISH has served as powerful tool to assess the quality of sequence assembly within human pericentromeric regions (Cheung et al. 2001; Horvath et al. 2001; Bailey et al. 2002b). Since most duplicated segments are large (>20 kb) and share considerable sequence identity (>95%), the consistent presence or absence of multi-site FISH signals has been used to suggest potential errors or gaps within the genome assembly. We, therefore, selected a tiling path of 11 clones, confirmed their identity by BAC end sequence analysis and assessed their multi-chromosomal distribution by FISH on human metaphase chromosomes (Table 4-1). From the cytogenetic perspective, these 11 clones produced a total 71 distinct metaphase signals for an average of ~6 chromosomal signals per probe. Interestingly, 12 of these signals appeared to have no underlying support from the most recent human genome assembly, for a discordance rate of 16.9% (Table 4-1).

Although this rate appears relatively high, it should be pointed out that four of the discordant signals map to chromosome 1 and correspond to 4 contiguous BAC clones spanning ~400 kb of sequence. Chromosome 1 has the highest density of sequence gaps

110 with slightly less than a quarter of all remaining gaps mapping to this chromosome. This suggests that these four clones likely identify a single sequence gap (~400 kb in size) within the pericentromeric region of this chromosome. Similarly, the absence of in silico signals for 14q11 (RP11-32B5 and RP11-257E15) likely correspond to pericentromeric gap on this chromosome. If we consider these as single large gaps within the finished human genome, the discordance rate drops to 8.5%. In other words, 8.5% of human pericentromeric regions with sequence similarity to 15q11 are still not faithfully represented within the “finished” human genome and therefore likely represent gaps that remain to be sequenced and assembled.

111 I V U V I M , X X 3 3 , X M , V IV X V V X X X X X V X X NS NS XX XV I X X III, V I , X I 4 V X 3, s X , V V, X I I me I PPY I X X X so , X . II, Y I h I, mo , X t II, I X X , X o II g X r II V q, h I , X X X XX len , I, X , X ll C V V V X , X in p, I p

I II X I X X A X X XV V, XV II 0 kb V II >1 I, X X y t V i s X a X r I, , d V V a X , X m similar

V V, 4 I , I V nce o ha GO I, X 2, 3 I I e I, X X I pi 3 G I , u X I X V V X Pa X I, X

, , X I I seq X V, V V, I I, X I % XV X II, I , , X , VI V V V I, X II, I I , PHA = >90 I XV , , X , X , X X a IIq V, V X on V , tt III III V I III p, p, p, a I I X I l X X X XV X I X I I I XV up u I I X m sed a c I, X a c X e ba a X M

s ar III, tromeric BAC clones. l V a U = n II X X M II, sig

X V , , M d X . I I ericen I, X I us n icte X I o X V i I, t ae X I ed I X a V R , m I, X X z X i , X g V pr X , d PT y i III ), X X r V I, X ta. V b I, X 34 I o p V, y I V X , X da h ild V, X ng I I, X V s u , X I, X s V I b I, X nal I X I Po V ( , X o ,

II o

V r i X t , X V V III ly = I , c V b , X Y c X X V, X , X , X I uta ri V X X V II, I, I X I I , V, , mp , PP I me I ssem X X X o V, X IX I, X

X I a I , ro c V

I ,

illa r X d X IIq, IIq, X 3 , X o ent n me V III, a p, q, p, p, V, p, p, IX g

ND II II II X II II NS II X XV e c no c v ti illa i r ge o 22 ne ns man chromosome 15 p e , G 8 h te . t = 21 oge

n t , 1 o ex ,

. o y t i 8 1 d d t y 17 e e a e c , 1 , 2 , rit z 1 i GGO uc 8 th , d

i , 2 17 18 16 r s , , , , 1 mila b e i 6 7 5 , 22 22 y 20 een assign , s prod 22 2 22 , A 16

, 1 , 1 , 1 6 h dyt w , , ly 19 , y S ce 5 1 1 2 e , lo 6 , 2 18 , 1 n v H 2 6 i , 2 , 22 22 22 , 15 , 16 , 14 , 22 , 22 e bet , 1 , , 2 , 2 9C23 og 16 , 1 5 enc , , u e 4 1 8 8 , 1 14 nit , i 16 , 2 6 6 15 13 18 c -7 tr 14 21 , f 5 15 , 1 ng , , , , n , 1 , , 2 , 1 , , 1 15 , ri 5 16 , 1 , 1 5 4 an a t seq 4 , 1 13 , de , 1

13 13 18 15 18 15 10 10 P t 1

, rd s

5 , , , 1 , , , , o 15 15 , 1 o , 10 10 3 , w RP11 9, 9, 9, 9, 5 4 5 4 4 = 1 6

n of c , 9 , , 4 , , o 1 t n in s l io 10 , 1

, 1 , 1 , 1 , 1 , 6 9 10 9 5 1 8 7, 7, 7, 7,

o , , , , 9 , , e 16 at

c bu 5 , 1 9 11 a di , 2, 2, 13 13 13 5, 2, 2, 13 13

9,

in e t

, PTR t , 2 , 2 , 2 , 2 5, 5 5 s en n s 1 1 1 2, 1 2, 1, 2, 2, 2, 1, 1, 1, 2, 15 1 2, 2, 1 2, 2, 1 m n te a du qu a ese e d, c pie 1 ex t t t t t t t t t t t di un e e s al or or or or or or or or or or or t 22 9 sa by 7 5 6 i n 0 ro in l o l d 6 3 3 1 4 1 als pr M P gn g pp pp pp pp pp pp pp pp pp pp pp i e 2 5 M1 0 A D A E D 5 L2 k Do te u u u u u u u u u u u

gn c ics t a S i a 74 36 09 92 9C 82 75 2B 73 11 36 S S S S S S S S S S S o Hom s Clone

rmin r b a s -6 -1 -5 -4 -7 -3 -2 -3 -1 -1 -3 34 34 34 34 34 34 34 34 34 34 34 e e t h gh ph 11 11 11 = N 11 11 11 11 11 11 11 11 ld ital e t ild ild ild ild ild i ild ild ild ild ild ild P P P o S P P P P P P P P u u u u u u u u u u u O H Al Table 4-1. FISH localization of hu D R B R B R B HSA = B B B ND = No N R B B R B R B R B B R R R R 1 2 3 4

112

From the in silico perspective of the genome assembly, the correlation is also good, but not absolute. In total, 80.3% of the sites considered as potentially duplicated by computational analyses were confirmed by FISH signals on human metaphase chromosomes (Table 4-1). Interestingly, a consistent lack of signal for chromosome 10 duplications was observed for 6 of the clones used in this study, with two potential

explanations. First, the sequence analysis may be detecting sequences that are duplicated

below the threshold of FISH to produce clearly visible signals. Focusing on clone RP11-

173D3, the sequence relationship between this clone and chromosome 10 extends

approximately 39 kb with an average identity of 96.4%, which is typically sufficient to

produce a FISH signal. Alternatively, the chromosome 10 region is highly polymorphic

in the human population and the material used to assess the distribution of duplicated sequences by FISH is lacking this sequence (i.e. a homozygous deletion). The presence of chromosome 10 signals (RP11-674M19, RP11-492D6 and RP11-509A17 - shown in

Figure 4-3, discussed below) within the chimpanzee genome indicates that this is likely a limitation of the analysis and that the prediction of a chromosome 10 sequence relationship by in silico analysis is indeed correct. These results highlight some of the general limitations to investigating highly duplicated regions of the genome using probes consisting of potentially highly duplicated pericentromeric sequences.

Comparative FISH Analysis of Non-human Primate Chromosomes

In order to provide some insight into the evolutionary dynamics of this region during human evolution, we compared the distribution of metaphase FISH signals

113 between human and non-human primates (chimpanzee, gorilla, orangutan, macaque and baboon) for the same underlying human BAC clones (see above). A wide variety of multi-site patterns was observed among non-human primate chromosomes, demonstrating the extremely complex evolutionary history of the human 15q11 pericentromeric region. In general, both qualitative and quantitative differences in the distribution of FISH signals were noted. Also, as more distantly related primate species were examined, a general reduction in the number of signals was observed. While some of these differences may represent loss of signal due to sequence divergence, this observation is consistent with the phylogenetic analyses (see below) which clearly indicate an expansion during great ape evolution. Overall, only 5 of the 11 BAC clones used in this study yielded results for all primate species examined (Table 4-1). Generally, two sets of signals can be distinguished based on a compilation of the comparative FISH results: sites shared between closely related species, and putative lineage specific duplication events. It should be noted that the presence of a FISH signal in one species, and not its closest relatives in the panel could also be the result of lineage specific loss in the related species as opposed to a lineage specific duplication. For example, hybridization with BAC RP11-509A17 produced a FISH signal in the orthologous human chromosome 15 region in all species examined (Figure 4-3). Hybridization of this clone to PHA, MMU and PPY produced a single signal. In contrast, multiple signals were observed among all African ape lineages examined as GGO exhibited signals on chromosomes XIII, XV and XVI, while in PTR signals were observed on chromosomes

IIp, X, XIV, XV, XVI, XXII. Thus, the lack of a chromosome XIII signal in PTR is indicative of a lineage specific duplication in GGO, or alternatively the sequence was lost

114 after the divergence of the gorilla and chimpanzee lineages. The hybridization of RP11-

509A17 to human metaphase chromosomes, however, produced signals on chromosomes

1, 2, 9, 15, 16 and 22. We believe these results suggest extensive variability in the pericentromeric regions of primate chromosomes.

115

Figure 4-3. Human and non-human primate comparative FISH. An example of the comparative FISH results is shown for 15q11 pericentromeric human BAC clone RP11- 509A17. HSA = Homo sapiens, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus, MMU = Macaca mulatta, PHA = Papio hamadras. Note the progressive expansion of the number of interchromosomal FISH signals in GGO, PTR and HSA. Interchromosomal sites of hybridization exclusive to one species, such as the chromosome XIII signal observed in GGO, are indicative of lineage specific rearrangements. Signals labeled with an ‘m’ indicate non-chromosome 15 marker probes used for hybridization controls.

Phylogenetic Analysis

Large-scale genomic sequence analyses of primate DNA have shown that human and lemur non-coding DNA diverge ~15% (Liu et al. 2003; Thomas et al. 2003).

Assuming that the majority of the DNA within duplicated regions evolves in a neutral fashion, it is likely that the duplications we analyzed emerged specifically within the primate lineage. Experimental determination of the ancestral origin of the duplicated

116 segments is a tedious process requiring comparative and phylogenetic analysis of each underlying duplicon (Crosier et al. 2002; Guy et al. 2003; Horvath et al. 2003). As a first step in understanding the evolutionary history, however, it is important to delineate the origin of the initial duplication events. Our analysis of such regions over the last seven years has revealed some important trends (Eichler et al. 1996; Eichler et al. 1997;

Horvath et al. 2000a; Horvath et al. 2000b; Bailey et al. 2002b; Horvath et al. 2003).

First, most of the ancestral loci originate from euchromatic regions and are not associated with pericentromeric DNA. Second, while the ancestral duplications may be contained within larger duplicated pairwise alignments (termed blocks), ancestral loci are usually more divergent and demarcate a smaller segment termed a minimal evolutionary shared segment (Bailey et al. 2002b).

We then applied a mouse-human sequence alignment methodology (see Methods) to predict the ancestral donor locus for 24 duplicons >5 kb within the contig. We were able to identify 19/24 putative donor loci employing this approach, which was consistent with a pilot study of 12 experimentally determined ancestral regions within 2p11 which showed generally good correspondence - 9/12 regions were correctly identified. An example of this analysis is shown for BAC clone RP11-509A17 (Figure 4-4), for which the comparative FISH analyses are also shown (Figure 4-3). The top panel of Figure 4-4 illustrates the result of the ancestral locus prediction using the mouse-human alignment strategy. The prediction appears to be quite effective for low copy duplications, such as the chromosomes 2 and 17 duplicons, which by analysis of the human genome assembly, shown in the bottom panel of Figure 4-4, appear to be duplicated only once. Even for intrachromosomal duplications such as the chromosome 15 duplicon derived from the

117

HERC2 locus in 15q13, the mouse-human alignment data correctly pinpoint the ancestral segment. Determining the ancestry of moderately duplicated segments, such as the chromosome 22 duplication noted in Figure 4-5 are also approachable with this technique.

118

Figure 4-4. Duplicon delineation. The diagram depicts the identification of ancestral duplicated segment (duplicon) using two different methods for 15q11 accession AC026495 (BAC RP11-509A17). The results of the Mouse Net determination of the ancestral segment are depicted above the line. Putative ancestral loci are color coded according to the optimal placement within the mouse-human synteny map (Methods). Below the black line, the alignments produced by a sequence similarity search against the most recent human genome assembly (build34) are shown. Each alignment is represented by a horizontal box, corresponding to the coordinates of the alignment within the assembly. Global segmental duplication analyses are conducted by fractioning the genome into 400 kb segments, labeled consecutively from p-telomere to q-telomere. Thus the alignments are labeled on the left according to the 400 kb segment of the genome, the alignments to the respective chromosomal random bins are labeled with “random”. For the purposes of illustration, only alignments >4 kb are depicted. The alignments with chromosomes other than those determined to be ancestral to a segment within AC026495 are shown in grey. Dotted lines connect the intervals in which ancestry was determined by Mouse Net analysis with the alignments produced by global sequence similarity searches. An asterisk indicates the alignments which agree with the ancestry prediction.

119

Using a molecular clock calibrated from human-chimp sequence alignments (see

Methods), we then estimated the timing of the initial duplication event. Overall, the majority of duplication events appear older than would be suggested simply by the comparative FISH hybridization data. We attribute this to the limited effectiveness of

FISH to detect substantially divergent paralogs among species as well as the complex organization of the underlying duplicons contained within each BAC probe. The most proximal half of the 15q11 region consists of relatively younger duplication events which emerged, with one exception, during the separation of the great ape species, 8-15 million years ago (up to and including the 2p24.3 duplicon; Table 4-2). In contrast, most of the duplications located distally appear to be significantly older, with all but two occurring

>15 million years ago. These data generally support previous suggestions of a gradient model with respect to the centromere (Guy et al. 2000; Horvath et al. 2001) – younger evolutionary duplications occur near the centromere while more ancient ones accrue distally. It should be emphasized that these data, however, do not necessarily indicate precisely when the duplicons emerged on chromosome 15 but rather when the initial duplication occurred from an ancestral euchromatic to a pericentromeric region. Several studies have shown that pericentromeric duplications occur in a step-wise fashion, with subsequent duplications of larger blocks spreading duplications among pericentromeric regions of the primate genome.

120 . Err.] .1 .3 .1 .1 .2 .7 .8 .3 .3 .1 .3 .3 .5 .2 .6 .5 .9 .1 .3 .1 .4 3 1 1 1 1 1 0 3 1 1 1 2 2 2 1.4 2 0 1.3 1 2 2 1 3 1 [Std T 7 ) y 3 3 1 9 5 4 1 7 7 6 5 1 4 7 5 2 9 8 7 8 . 8 8 6 M 17 9. 8. 8. 3. 8 6. is 10. 38. 17. 27. 26. 12. 22. 37. 21. 13. 32. 16. 36. 20. r T ( o r r. on s ti d er rro a . si ) e o ar ] 4 . , w d r 3 p r e r d a l 5 1 i d 0 1 u stand 048 041 145 043 041 045 026 048 051 114 079 049 087 094 055 085 022 059 079 049 118 052 n ang

td. E h d 0.0 0.0 cestral r (b 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 c n [S se k A ed sta w tide . o r ly leo b B sociat e e associate s m nuc h 6 T loci. . 026 k

0393 0364 1469 0314 0340 0308 0145 0307 0657 1071 1014 0477 0847 1445 0653 0833 0230 0521 0332 0616 1395 1244 0795 . ) he a . al te of 0 e assem S 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. a th t i

C Geno ES ods) w M the r enom d

ancestr , . ts ( R n 5 an g . e and ig (Meth the UCS ods) t esente 2T r / 95 50 73 80 573 895 927 564 531 370 927 665 447 783 077 201 102 516 813 326 019 193 239 837 in 15 ngth hum l 7 9 8 9 1 1 1 1 1 2 1 2 1 1 1 2 2 4 1 1 1 1 1 2 e a segm v Le R=k (Meth r 11 con the y ed e a som n ul ar of o ) int h l 15q uences, p m 4 m nte r a s y o rt o y s 003 r the a 2 l seq n y an a i l r s t u uppo m he chr s J t ancestr ution ( s s s s s s s s s s s s s s s s s s s S f e o o o o o ing the f hu t e e e e e e e e e e e e e e e e e e e c e o N N N N N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y s the plicon i evol d34 N

s use- e d an ly o s n of a o du u buil an m naly o ulated us 5 n atio tw nim i lc om M 1 r in a m e ise a m o c f r m s sed o is, f f o a s a 3 11 pericentromeric duplications. e s t c ) exam o on o w r 8 pairw aly e en o m t ed b i 3 by 2 o 66 uen m CL

n 2 bda bda inati e, 187 tif C 1 P r B d34). 0 EK2 e ti MR one one one one one one one one ted chr BRA5 of 15q etic an NF GL a atu N N N N None N N N N A PAK AA buil r iden IgHG3 IgHG3 ( CH RHPN2 HER MT enc ee seq

t of G FLJ358 KI Ig Lam Ig Lam alu g gen d exam v nz lite o en NBEA ( l cus ser Genic Co y ver o m an i w pa o the " l D l r im n a . h i in ph B ) ods) align c s y e t 1 1 1 3 3 1 1 2 1 2 3 2 2 2 3 2 M 2 m eth d o ( por the and M s e 11.2 22.1 13.1 12.1 24.3 11.2 32.3 11.1 11.1 12.3 24.3 32.3 ancestr q11. q13. q12. q11. q22. q13. q12. q11. q12. q25. ( 5q1 of 3q29 ar y 2 5 2 2 5 3 2 7 2 7 1 an Ban intronic sites e 9q 5q 2p 2q 8p 3p 2q e 2 1 2 2 1 1 2 1 2 1 Gen 10q 19q 10q 14q 14q m y of C locus used ious r ous " r nten . S l e of n hu a ev ges/bp C b igu U b m pr t of lum an sy n m o 1 4 5 7 1 4 0 7 0 1 1 3 272 8 5 407 49 08 cestr illions e by chan n c g to e nu m p 1 hum in nm e ied - t d 9335 9779 7494 7497 0867 9779 7778 0996 7932 1979 5878 3035 3762 4362 8885 8337 8522 7180 4702 0473 8817 0408 6644 4264 in i e r t - una le b e the a s o

S as th 10 10 or u alig l entif d 425 153 381 260 274 153 602 596 334 286 246 297 370 297 156 783 354 166 197 495 318 775 o a ------9- - - 0- - - 8 6 1 9 4 7 enc r om adjacent sing acc r t id g f upp m the ine r ined o 455 760 059 457 306 760 902 54- 174 704 812 735 374 069 232 05- 246 06- 019 329 141 76- 59- 45- e est r t S the by by c e tes f e nten d d n sition b a o div N o A is def d in 2591 5399 8173 6073 7410 5399 0276 9624 3482 7036 0436 8618 4657 9732 9790 5638 8386 5490 0470 6048 7883 9491 1879 7571 ine ine of din c i e num

or t Determ nd p ngth m ouse a i term term = e e Le k Co B Gen M T No ndicate Table 4-2. Estimated divergence time d i chr10:3 chr14:1 d chr10:4 chr22:2 chr22:1 chr5:61 chr19:3 chr17:7 chr2:13 chr15:2 chr22:2 chr22:1 chr15:6 chr9:85 chr14:1 chr13:3 chr2:16 chr3:19 chr22:2 chr2:94 chr8:43 chr3:77 chr15:2 chr17:2 1 2 3 4 5 6 7 8

121

The molecular evolutionary history of the underlying duplicons – especially the sequence relationships between the ancestral duplicon donor sequence and the chromosome 15 duplicon – becomes more apparent when phylogenetic trees are independently considered for each duplicon (Figure 4-5). As depicted for the chromosome 10p11.2 duplicon in Figure 4-5a the topology of the phylogram indicates there has been a dispersal of this segment in recent evolutionary time to multiple human chromosomes. The data indicate that either 15q11 or the ancestral pericentromeric region of 2q21.2 were the targets of duplicative transposition of the segment from 10p11.2 approximately 20 million years ago. Subsequent duplications of a larger segment (Figure

4-2) were responsible for the distribution of this segment within 21q11, 18p11 and 9q13.

Similarly, the topology of the 8q11.2 duplicon suggests an equally complex model in which there were more ancient duplications (~30 million years ago) of the chromosome 8 segment to chromosomes 2 and 22, and a subsequent swapping of this segment to chromosome 15, 18 and 21 (Figure 4-5b). Interestingly, the two segments depicted in

Figure 4-5, which are approximately a Megabase apart within the human 15q11 contig, show a similar interchromosomal distribution to chromosomes 2, 18 and 21. In addition to gaining insight into the individual history of each duplicon, the results of the phylogenetic analysis provide further support for the identification of ancestral loci.

Until large-scale comparative sequence for these regions is obtained from non-human primate species, such complex movements will remain untested hypotheses. Definitive phylogenies which can parsimoniously track the evolutionary history of the basic elements of the duplication mosaic will require directed comparative studies within these

122 regions. Our analyses, however, clearly indicate that such regions represent a rich resource for understanding the natural pattern of primate genetic variation.

123

124

Figure 4-5. Phylogenetic analyses. Two examples of neighbor-joining phylogenetic trees of duplicated sequences within the 15q11 pericentromeric contig are presented. The band position of each duplicated segment is indicated at the branch termini. Bootstrap values are placed as near as possible to branch points. a) Analysis of the chromosome 10 duplicon adjacent to the monomeric alpha satellite sequence of the 15q11 pericentromeric contig indicates a progressive swapping of this segment to additional interchromosomal sites. The approximate time of the duplication to chromosome 15 is estimated at 21.7 Mya (Table 4-2) b) The phylogenetic analysis of the single chromosome 8 duplicon from the 15q11 contig indicates a longer evolutionary history of this segment swapping among multiple sites. The approximate timing of the duplication to chromosome 15 is estimated to be 32.5 Mya. For both phylograms, the putative ancestral locus is indicated in bold, and the sequence derived from the chromosome 15 contig is indicated in bold italics. Bootstrap values >90 are indicated.

MATERIAL AND METHODS

Validation of the 15q11 Assembly

Due to the duplicated nature and overall complexity of the region, we utilized three independent methods of verifying the assembly of the 15q11 pericentromeric region. First, we utilized seed sequences which had been mapped to chromosome 15 by sequence comparison with monochromosomal hybrid sources as described previously

(Horvath et al. 2000a; Horvath et al. 2000b). Clones RP11-1360M22 (Horvath et al.

2003) and RP11-509A17 (unpublished data) were validated as chromosome 15 clones in this manner. Second, we reassembled the region using stringent standards for considering allelic sequence overlap as opposed to paralogous overlap. Reiterative sequence similarity searches against the non-redundant nucleotide (NT) and high throughput genome sequence (HTGS) databases were performed. Overlaps required a dove-tail configuration with a minimum of >99.9% sequence identity of alignment and at least 10 kb in length. During this process, seed sequences were filtered for high copy repeats using the lower-case REPEATMASKER option (Smit and Green 1999) which allowed for extension through these regions using MEGABLAST (Zhang 2000). Third,

125

we selected a tiling path of BAC clones for FISH analysis (described below) to validate

the mapping of these clones to chromosome 15.

STS Analysis

At the distal end of the 860 kb contig an STS was developed and PCR amplified

using a pair of oligonucleotides designed to GenBank accession AC023310: namely;

AC023310.3 (5’- GAAATTTATGGTCAATCTCCCC -3’) and AC023310.4 (5’-

TATTGCCCAATAGGATGTCG -3’). The PCR product was subsequently radiolabeled

and hybridized to RP11 BAC library filters as described (Eichler et al. 1997). The inserts

of the resulting positive BACs were end sequenced as described below and the end

sequences used as queries in similarity searches against the 860 kb contig, as well as the

NT and HTGS sequence databases. From this analysis, a single clone, RP11-1115P6,

was identified which linked the 860 kb proximal contig to a 265 kb contig of clones

RP11-32B5 and RP11-275E15.

BAC End Sequence Analysis

To verify the identity of all clones used in the FISH analyses, all clones were

subjected to end sequencing analysis. BAC DNA was extracted from 250 mL

LB/chloramphenicol bacterial cultures grown O/N by column purification (Nucleobond,

Clontech) and re-suspended in 100 uL H2O. After determining the DNA concentration by spectrophotometry, 2 ug of BAC DNA was subjected to automated dideoxy- terminator cycle sequencing using the ABI Big Dye terminator sequencing chemistry

(Applied Biosystems) using vector primers T7 and SP6. Sequencing reaction products

126 were purified using G-50 Sephadex purification columns and analyzed on an ABI 3100

DNA Analyzer (Applied Biosystems). BLAST sequence similarity searches against

GenBank confirmed the identity of the clones The clones which comprise the verified contig include RP11-79C23 (AC138701), RP11-1360M22 (AC127381), RP11-173D3

(AC087386), RP11-674M19 (AC142539), RP11-492D6 (AC126603), RP11-509A17

(AC026495), RP11-382A4 (AC138748), RP11-336L20 (AC023310), RP11-1115P6

(submitted for sequencing based on this analysis in collaboration with the Whitehead

Institute Center for Genomic Research), RP11-32B5 (AC068446), RP11-275E15

(AC060814).

Duplicon Delineation

Underlying duplicons (ancestral segmental duplications) were delineated using two independent methods. The first strategy determined the minimal evolutionary shared segment (MESS) from a global analysis of all pairwise alignment as described previously

(Bailey et al. 2002b). Briefly, sequence similarity searches using 15q11 sequence as a query are made against both the human genome assembly (build34) and the NT and

HTGS databases. All pairwise alignments are evaluated for percent identity, length, and chromosomal location. The alignments are manually curated using the graphical display program PARASIGHT (Jeffrey Bailey, unpublished) which facilitates the determination of duplicon boundaries and distribution throughout the genome. Ancestral duplication fragments correspond to those which are either the shortest, most divergent pair and/or an ancestral gene structures can be identified where a complete intron-exon complement can be deduced. Not all duplicons may be delineated using this method. The second method

127 entails the comparison of the human and mouse sequence to identify the “best match” within the mouse genome sequence. The mouse-human alignment data were produced by BLASTZ comparison of the mouse and human genomes (Schwartz et al. 2003), and the linkage of BLASTZ mouse-human alignments into what are termed chains and nets

(Kent et al. 2003). As the duplications found in 15q11 represent primate-specific events, the comparison of human duplicated segments with the mouse genome greatly simplifies the analysis. One caveat, however, is that regions duplicated in the mouse genome are uninformative in such a comparison, although the level of segmental duplication in the mouse genome is lower than that of the human genome (Cheung et al. 2003; Bailey et al.

2004). Once the loci syntenic in mouse to the 15q11 pericentromeric sequence are identified, the mouse sequences were searched against the genome assembly to identify putative ancestral human loci. Thus, we utilized two methods of defining the position and length of duplicons that originated from multiple interchromosomal locations within the contiguous set of 15q11 pericentromeric clones.

Phylogenetic Analysis

Non-coding sequences were extracted from each of the 24 duplicons and subjected to phylogenetic analysis (Table 4-2). A threshold of >5 kb was chosen as the likelihood of detecting smaller duplications by comparative FISH was unlikely. Multi- sequence alignments, generally >1 kb if possible, were generated using CLUSTALW version 1.82 (Thompson et al. 1994) and pairwise distance calculations and phylogram construction was performed using the MEGA software package version 2.1 (Kumar et al.

1993). The neighbor-joining method was used to generate phylograms and pairwise

128

distance calculations were corrected for multiple substitutions by using the Kimura 2-

parameter model of nucleotide substitution (Kimura 1980).

To estimate the divergence time of the chromosome 15 duplicons and their

associated ancestral loci , the formula r = k/2T was employed, where r is the rate of

nucleotide changes per bp per year, k is the distance calculated between the ancestral and chromosome 15 sequence, and T is the time of divergence of the molecules. The rate of nucleotide change in the 15q11 pericentromeric contig was determined by alignment of two duplicons for which corresponding chimpanzee sequence was available.

Specifically, alignments of RP11-79C23 with AC122174 (15.1 kb), and RP11-509A17 with AC124220 (8.9 kb) were used to calculate independent rates based on a divergence time of 6 million years between the chimpanzee and human lineages (Goodman 1999).

The independent rates were averaged, resulting in a rate of 1.9x10-9 nucleotide changes

per bp per year. This estimate is in close agreement to previously published mutation

rates for duplicated sequences (Horvath et al. 2003; Liu et al. 2003).

Comparative FISH

Metaphase chromosome preparations were prepared from lymphoblastoid cell

lines derived from humans (Homo sapiens (HSA)) and 5 non-human primate species

including three great ape species (Pan troglodytes (PTR), Gorilla gorilla (GGO), Pongo pygmaeus (PPY)), and two old-world monkey species (Macaca mulatta (MMU) and

Papio hamadras (PHA)). Hybridizations were performed using standard conditions with

BAC DNA probes labeled in either biotin-16-dUTP or digoxigenin-11-dUTP as previously described (Horvath et al. 2000b). At least 20 metaphases were examined for

129

each hybridization. In situ hybridization experiments were repeated and only consistent signals were recorded in order to minimize potential extraneous signals from these multi- site clones. Chromosome identity was determined by DAPI staining and reported according to the guidelines of the International Standard for Cytogenetic Nomenclature

(ISCN 1985).

Chapter 5

BAC microarray analysis of 15q11-q13 rearrangements

and the impact of segmental duplications

Devin P. Locke1, Rick Segraves2, Robert D. Nicholls3, Stuart Schwartz1, Daniel Pinkel2, Donna G. Albertson2 and Evan E. Eichler1

1Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, Ohio 44106

2Comprehensive Cancer Center, University of California San Francisco, San Francisco, California 94143

3Center for Neurobiology and Behavior, Department of Psychiatry and Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania 19104

Note: This manuscript has published: Locke, D.P., Segraves, R., Nicholls, R.D., Schwartz, S., Pinkel, D., Albertson, D.G. and E.E. Eichler. 2003. BAC microarray analysis of 15q11-q13 rearrangements and the impact of segmental duplications. J Med Genet. 41:175-182. R.S., D.G.A and D.P. provided array comparative genomic hybridization data. R.D.N. provided array design assistance. S.S. provided patient samples.

130 131

ABSTRACT

Chromosome 15q11-q13 is one of the most variable regions of the human genome.

Numerous clinical rearrangements have been documented involving a dosage imbalance of the region. Multiple clusters of segmental duplications are found in the pericentromeric region of 15q at more importantly at the breakpoints of proximal 15q rearrangements. Using sequence maps and previous global analyses of segmental duplications in the human genome, we developed a targeted microarray to detect a wide range of dosage imbalances in clinical samples. Clones were also chosen to assess the effect of paralogous sequences in the array format. In 19 patients analyzed, the array data correlated with microsatellite and FISH characterization. The data showed a linear response with respect to dosage ranging from 1 to 6 copies of the region. Paralogous sequences in arrayed clones appear to respond to the total genomic copy number, and results with such clones may appear aberrant unless the sequence context of the arrayed sequence is well understood. The array CGH method offers exquisite resolution and sensitivity for detecting large-scale dosage imbalances. Our results indicate that the duplication composition of BAC substrates may significantly impact the sensitivity for detection of dosage variation. These results have important implications in effective microarray design, as well as for the detection of segmental aneusomy within the human population.

132

INTRODUCTION

Human chromosome 15q11-q13 is one of the most unstable regions of the human

genome. This assertion is supported by the wide spectrum of clinically recognized

rearrangements that involve proximal 15q, including deletions, duplications, triplications, inversions and translocations. In addition, multiple types of supernumerary marker

chromosome, both dicentric and monocentric, are derived from the region and are found

in multiple size classes. Genotype-phenotype correlations of clinically recognized

15q11-q13 rearrangements demonstrate that both the gain and loss of material from the

region frequently result in disease. Prader-Willi and Angelman syndromes (PWS/AS) are

classic examples of the phenotypic effect of regional loss, while pseudo-dicentric(15)

syndrome (formerly inv dup(15) syndrome) and interstitial duplication of 15q11-q13 are

clinically recognized disorders due to the gain of material (Barber et al. 1998; Cassidy et

al. 2000; Nicholls and Knepper 2001). Partial hexasomy of 15q11-q13 has also been

reported in one patient with multiple proximal chromosome 15 derived supernumerary

marker chromosomes, establishing a wide range of potential dosage imbalances for the

15q11-q13 region (Nietzel et al. 2003).

Dosage imbalance of the region has been linked to the presence of large blocks of

segmental duplications, which comprise approximately 5% of the total human genome

sequence (Amos-Landgraf et al. 1999; Bailey et al. 2001; Bailey et al. 2002b). The

presence of segmental duplications in regions prone to dosage imbalance is common (Ji

et al. 2000a; Stankiewicz and Lupski 2002), and an important consideration for

employing genome wide methods of detecting dosage imbalance such as array CGH,

which has shown the capability to assess dosage at thousands of genomic loci

133 simultaneously (Snijders et al. 2001), is how best to approach variable regions with high duplication content.

Informative microarray design is dependent upon a variety of bioinformatic resources made available through the Human Genome Project. In particular, the detection and analysis of segmental duplications, especially in regions such as 15q11-q13 which are known to contain large blocks of such sequences, is critical in designing the most informative microarray (Bailey et al. 2001; Bailey et al. 2002a). Furthermore, the sensitivity of highly duplicated clonal substrates within a microarray experiment has not been systematically explored. Such considerations are important for two reasons. First, genomic areas flanked by duplications show a greater proclivity to rearrange through non-allelic homologous recombination. Therefore, these regions represent ideal targets for the discovery of new segmental aneusomy syndromes. Second, as genomic microarrays move toward complete tiling of the human genome sequence (~30,000

BACs) one will have to understand the sequence content of each BAC in order to properly interpret the array result. As segmental duplications vary in size and degree of sequence identity, we sought to explore the effects of segmental duplications within the array format and the utility of array comparative genomic hybridization (array CGH) technology in the detection of large-scale structural rearrangements. Specifically, we utilized well-characterized BACs of varying duplication content in addition to unique

BACs as array elements in development of a specialized 15q11-q13 microarray. In this study, we assessed the effectiveness of our approach to microarray design by focusing on a small unstable region of the genome. By using breakpoint-flanking BAC clones as well as testing clones with varying degrees of segmental duplication, we demonstrate the

134

utility and limitation of this approach to detect a wide range of genomic dosage

imbalances among normal individuals and clinically characterized patient material.

RESULTS

Clone Selection

Clones were manually selected from the human genome assembly (August 2001

assembly) and in accordance with previous BAC/YAC mapping efforts (Figure 5-1)

(Christian et al. 1999; IHGSC 2001). The 18 clones used in this study span approximately 10 Mb with a resolution of 1 BAC per 550 kb. The vast majority of clones

(17/18) were in the sequencing queue at the time of selection. One exception, RP11-

219B16 was represented as low-pass sequence (phase 0) in the HTGS database, however subsequent FISH and sequence analysis (Figure 5-2) indicates the sequence in accession

AC068962 is not representative of the single-colony isolate of RP11-219B16 used in this study. The clones can be divided into groups based on the position of the known common rearrangement breakpoints (Figure 5-1; Table 5-1) (Robinson et al. 1998;

Amos-Landgraf et al. 1999; Christian et al. 1999). In terms of duplication content, the clones vary in the amount of duplicated material as a percent of the total sequence of each accession, the sequence similarity of the duplications, and the interchromosomal or intrachromosomal nature of the duplicated material (Table 5-1). The correlation of array

CGH fluorescence intensity ratios with microsatellite genotyping and cytogenetic results was performed during the analysis phase of this study, and are discussed below according to genotype.

135

136

Figure 5-1. Map of 15q11-q13 array clones and the extent of common 15q11-q13 rearrangements. Array clones are indicated by horizontal colored boxes below the line, labeled with the RPCI-11 clone ID and accession number. Blue boxes indicate interchromosomal duplication, red boxes intrachromosomal duplication and green boxes unique sequence. The common PWS/AS breakpoints, which contain intrachromosomal HERC2 and GLP/LCR15-related sequences, are indicated as the large red boxes labeled BP1, BP2 and BP3. The pseudo-dicentric(15) breakpoint, BP4, is indicated in purple. The map is not drawn to scale. Depicted above the BAC map are the intervals gained and lost in the most common 15q11-q13 rearrangements.

137

Figure 5-2. RP11-219B16 FISH results contrast with AC068962. By similarity search against build34 the sequence within AC068962 aligns most parsimoniously to chromosome 5 (data not shown). The BAC isolated from the RP11 library, however, when labeled as a probe in FISH experiments hybridizes to chromosomes 15q11, 2p12 and 2p23. Thus, there is a disparity between the clone used in this analysis and the accession sequence and the accession sequence was ignored.

138 ) s pie 2 o c hr 21) (C 5 (3

HERC a , ) R1 d 2

b LC

5 2, Lam (Chr

) V 2 3 R1 RC g 5 K 1 I r , LC HE

, CH ent , on 2 ) (Ch nt c ) C

i 1 o on, 1 R pl r hr C u ene 2 n HE (C

d (Ch , R ------3 0 N dog 6 1 3) plico B1 HERC eu 32 hr u hr r 17 duplic C D BC (C h ps PC

5, A C

P BP R1 6I .), 1 Genic/ FA - CD LC , E ARB4, PD d. B4 NN te

(Chr 2 , o a ) AR d n b is 16 NN m , n hr a 2 io t L (C a

V RC ic Ig RN HE upl d PA he t 4 of in % % % % % rity ig 8 9 9 8 9 % % r R ------5 3 N l o 9 9 0-9 2-9 5-9 3-9 7-9 mila a 9 9 9 9 9 i n) m S io o s ss o e 3 c m ) c o r (2) (3) (3) (1) o. a

h s al. tra tra tra tra t . e c a (N thi s r h i In In In In t R o

s - tra - - ter - - - - - ter - - dis , f Intr N ly o

) & ) & ) & ) & In In In ta s ed i a (2 al t v ana r (1 r r (3 r (2 tif

d l im te te te is en ter n. n n nte n I I I I ox id nta io In

r r th s e s o s a f , p

e l rim c w a t c . 15 pe n m up. e a x e o t D h m R n c ------o 3 4 5 5 1 1 1 os d e o N s of m o . c ea o m in nic is an N hro o s r s c e h on. ge s i tra aly o s 2 g c t s in n ba an n e

if

c e tly , ed e e e e e e e e e e c % % % % d alon % nte u u u u u u u u u u ac % at an

e R 6 q q q q q q q q q q 0 0% t i i i i i i i i i i 23 70 14 19 h on in 0 N i a uen n n n n n n n n n n c t ed BAC clones Co plic i c 1 10 3.6 i U U U U U U U U U U 35. 97. 24. 12. s y om ea du up. f ind ed po n seq D r p t o is a p ee d en 1 ns 1 L m tw c f P P io P2 P3 P4 P3 P3 P2 P4 P3 P3 P4 P3 P4 P3 E licons in r t ere 2 3 B B p o a be - - r -T -B -B -B -B -B -B -B -B -B -B -B -B -B P P id u e pe e 4 1 2 3 2 1 2 3 2 2 3 2 3 2 N N s B B c e d P l d P P P P P P P P P P P P P E E plic nterval h I B B B B B B B B B B B B B B C C t al f dan s du r e con in or e o r o d h d a c s ty a 67 82 e 95 79 33 82 39 64 81 24 96 62 94 04 54 65 62 46 t s e n ion ri s t h o t a er n i l di i c ss b on e 117 909 219 264 160 214 907 693 689 874 096 112 007 619 190 913 095 170 li c w m 0 0 0 0 0 0 0 0 1 0 m i ce the d es erties of the arra s C0 C0 C0 C C C0 C C C0 C C C C C0 C0 C C0 C r pli c

e p A A A A A A A A A A A A A A A A A A p to e A

l dup

at du i c e nu a e r c t is o en m a 15 4 3 7 s 1 o s 3 12 1 17 15 16 2 1 en (du R 21 11 s qu 20 t e 1 B6 I I J18 E B N C e n as O24 1 2 0 ed 9A 4P 9B 3 8 t 9D 2 o 7J on t 0K 40 85H 3 8O16 3 6 L l om 26F2 n r 30G8 r c e s -3 C ID 21 48 50 e o 28 -5 -1

-8 h -36 -48 -34 - - - - -32 n c 1 1 1 1-3 1-1 1 A 1-1 1-2 C nt 11- 1 1 1 1 ag 11- and rc 1 1 1 io 1 1

r 11 o B 1 1 1 11 11 t 1 11 A P P P e P P Rep P P P1 c P1 P1 2 a P P P P B

R v R t

R R R R R R R R ic RP RP RP R R R RP R y o l a RC nic a p = Inte e r N e

u r h 1 . A D G HE = ld

T 4 5 6 7 8 9 1 2 3 15 16 17 18 14 10 11 12 13 R = Table 5-1. Pro = = = = No N Bo 1 2 3 4 5

139

Hybridization Profiles of Normal Individuals

We assessed potential structural variation among a set of six normal individuals

using the specialized 15q11-q13 array (Table 5-2; N1-N6). For the majority of arrayed

BAC clones limited variation in fluorescence intensity ratios was observed, however,

there were two notable exceptions. The two most proximal BAC clones, RP11-219B16

and RP11-509A17, demonstrated log2 ratios inconsistent with theoretical thresholds expected for either equimolar representation or complete gain or loss between the reference and test genomic DNA samples. For example, for RP11-219B16, normal DNA sample N1 yielded a log2 ratio of 0.44, which is short of the theoretical level for the

haploid duplication of a BAC (0.58) (Methods); however, the increased ratio is indicative

of either a partial duplication of material in this BAC, or an increase in material with

substantial sequence similarity (Figure 5-3). Similarly, sample N2 demonstrated log2

ratios of -0.55 for RP11-219B16 and -0.59 for RP11-509A17, which is less negative than

the ratio expected for the haploid deletion of an entire BAC (1.0) (Figure 5-3).

140

Figure 5-3. Array CGH profiles of normal DNA samples. The log2 relative fluorescence intensity ratios for all normal individuals hybridized to the 15q11-q13 array have been plotted linearly with respect to the proximal-distal position of the 18 BAC clones along chromosome 15 (Table 5-1). BAC clones RP11-219B16 (array element 1) and RP11-509A17 (array element 2), which are located proximally of BP1, vary between normal individuals, and appear to contain interchromosomal duplications by sequence analysis, FISH and paralogous STS characterization (Table 5-1; Figure 5-2). The source of this variation may be altered copy number of the related interchromosomal sites. The remaining 16 clones, in contrast, clustered around a log2 ratio of zero, indicating equal fluorescence intensity resulted from hybridization of the normal genomic DNA and the reference DNA.

Hybridization Profiles of 15q11-q13 Sequence Losses

In our analysis of deletion rearrangements we utilized a spectrum of patient samples, including 7 PWS and AS Class I deletion patients, 3 PWS and AS Class II deletion patients and 1 PWS unbalanced translocation patient (Table 5-2; Figure 5-4). In each case, the extent of a haploinsufficiency as measured by FISH and microsatellite genotyping was consistent with the extent of the deletion determined by array CGH.

Patient samples P1-P7 demonstrated a log2 ratio decrease for all clones mapping between

141

BP1 and BP3 (Figure 5-4a), and similarly samples P9 and P10 presented reduced ratios

for clones between BP2 and BP3 (Figure 5-4b), consistent with Class I and Class II

rearrangements respectively. In particular, clones RP11-26F2 and RP11-289D12, located

in the D15S542 region between PWS/AS common deletion breakpoints BP1 and BP2,

were useful for distinguishing Class I from Class II PWS/AS deletions. In general, BACs

which map between BP2 and BP3 (clones 5-13 in Figure 5-4) were deleted with an

average log2 ratio of -0.76 (STDDEV=0.16, n=10 hybridizations; 89/90 BACs reporting)

in all PWS/AS deletions examined, regardless of class. In comparison, the average log2 ratio for the normal genomic DNA samples in the identical interval averaged -0.06

(STDDEV=0.09, n=6 hybridizations; 54/54 BACs reporting). The non-overlapping intervals of the respective standard deviations indicate a statistically significant difference in the log2 ratios for probes in the BP2-BP3 region.

142

Table 5-2. Genomic DNA samples assayed by array CGH

Dosage per 15q11-q13 Interval1 ID Classification CEN-BP1 BP1-BP2 BP2-BP3 BP3-BP4 BP4-TEL N1 Normal 22222 N2 Normal 22222 N3 Normal 22222 N4 Normal 22222 N5 Normal 22222 N6 Normal 22222 P1 PWS Class I 21122 P2 PWS Class I 21122 P3 AS Class I 21122 P4 PWS Class I 21122 P5 AS Class I 21122 P6 AS Class I 1* 1 1 2 2 P7 PWS Class I 21122 P8 AS Class II 2 2 1 2 2 P9 PWS Class II 2 2 1 2 2 P10 AS Class II 22122 P11 PWS Unbalanced Translocation 1 1 1 2 2 P12 Supernumerary Del(15) 3 3 3 2 2 P13 Interstitial 15q11-q13 Triplication 2 6 6 2 2 P14 Pseudo-Dicentric(15) Small 4 4 2 2 2 P15 Pseudo-Dicentric(15) Medium 4 4 4 2 2 P16 Pseudo-Dicentric(15) Medium 4 4 4 2 2 P17 Pseudo-Dicentric(15) Large 4 4 4 4 2 P18 Pseudo-Tricentric(15) 6 6 6 2 2 P19 Pseudo-Tricentric(15) 6 6 6 2 2 1 = Intervals are demarcated by the common PWS/AS breakpoints, see Figure 1. * = Profile suggests a deletion, however microsatellite analysis did not include markers in this region.

The dosage (number of copies) of the indicated interval is shown for the 25 individuals studied based on cytogenetic and microsatellite characterization.

143

Figure 5-4. Profile of 15q11-q13 losses. Samples demonstrating a loss of material in 15q11-q13 have been divided into sub-classes according to genotype. Panel A depicts the profile of seven PWS Class I patients, which demonstrate a loss of material between BP1 and BP3. Note clone RP11-37J13 (array element 14) for patient sample P1 reported incomplete data and was excluded. Panel B depicts two PWS Class II patients, which show loss of material between BP2 and BP3. Panel C depicts a PWS patient with an unbalanced translocation breakpoint between array clones 9 and 10. Panel D depicts an uncommon PWS deletion in which the BP1-BP2 interval is intact, similar to a Class II deletion patient, yet the deletion extends distally to a breakpoint between BP3 and BP4.

For the 15q11.2 translocation patient sample (P11) a reduction in log2 ratios was

observed for array elements in the proximal BP2-BP3 region (average log2 ratio= -0.94,

STDDEV=-0.12, n=4/4 BACs reporting). A sharp transition was observed between the four proximal array clones compared to the three distal clones in the BP2-BP3 interval

(average log2 ratio=-0.04, STDDEV=0.03, 3/3 BACs reporting) (Figure 5-4c). Thus, the breakpoint of the unbalanced translocation occurred between RP11-131I21 and RP11-

144

10K20, consistent with previous cytogenetic and molecular characterization of this patient sample (Conroy et al. 1997).

Hybridization Profiles of 15q11-q13 Sequence Gains

We analyzed 8 patient samples containing gains of 15q11-q13 including 4 pseudo-dicentric(15) supernumerary marker chromosome patients (one small, two medium and one large), 1 interstitial triplication patient, 2 pseudo-tricentric(15) supernumerary marker chromosome patients and one monocentric supernumerary del(15) marker chromosome patient (Table 5-2). Dosage of the 15q11-q13 region in these samples ranged from 3 to 6 copies in select intervals. The array profiles of these samples consistently demonstrated increases in fluorescence intensity ratios for intervals of varying length that correlated with the length of the rearrangement determined by microsatellite analysis and FISH (Table 5-2; Figure 5-5). The analysis of the supernumerary del(15) marker chromosome patient sample revealed somewhat higher background signal, which may have been due to the quality of the input patient DNA

(Figure 5-5d).

145

Figure 5-5. Profile of 15q11-q13 gains. Samples demonstrating a gain of material in 15q11-q13 have been divided into sub-classes according to genotype. Panel A depicts the dosage profile of a genomic sample containing a small pseudo-dicentric(15) marker chromosome with a distal breakpoint at BP2. The apparent lack of differential fluorescent intensity for BACs, RP11-219B16 (array element 1) and RP11-509A17 (array element 2) in this and other pseudo-dicentric(15) samples may be due to the presence of additional paralogous sequences in the genome, limiting the response of these clones. Panel B depicts the profile of two medium pseudo-dicentric(15) chromosomes, with a distal break at BP3. Panel C depicts the profile of a patient sample with a large pseudo- dicentric(15) marker chromosome with a distal breakpoint at BP4. Panel D shows the profile of a supernumerary del(15) marker chromosome sample, which includes increased dosage of the PWS/AS critical region. Panel E depicts three rare triplication profiles including one interstitial triplication and two pseudo-tricentric(15) marker chromosome samples. All three triplication patients appear to have a distal breakpoint at BP3. Note clone RP11-37J13 (array element 14) in sample P18 was excluded due to incomplete data from this hybridization.

146

Correlation of Dosage Imbalance and Log2 Ratio

As shown previously, the relationship between fluorescence intensity ratio and

copy number is linear (Pinkel et al. 1998). We performed a similar regression analysis to

assess the behavior of log2 ratios in clinical samples with 15q11-q13 rearrangements in

relation to previous studies. Using samples with 1, 2, 3, 4 and 6 copies of the BP2-BP3

interval (samples P10, N1, P12, P16 and P17), the raw fluorescence intensity ratios were

averaged across the interval and plotted (Figure 5-6). The correlation coefficient (R2) of

0.995 demonstrates an excellent fit to the linear model.

Figure 5-6. Correlation of copy number and fluorescence intensity ratios. The raw fluoresence ratios across the BP2-BP3 interval were averaged for samples known to contain 1, 2, 3, 4 and 6 copies of the region (samples P10, N2, P12, P16, P17). The data fit a linear regression model with a correlation coefficient of 0.995, confirming a linear relationship between copy number and fluoresence intensity, in addition to providing a measure of array performance across independent experiments.

147

Duplication Sensitivity of Arrayed Clones

It was noted during qualitative analysis of the array CGH profiles that certain

BAC clones consistently showed unexpected deviations in fluorescence intensity ratio.

These deviations were typically inconsistent with neighboring clones and not contiguous

with the genomic rearrangement in the patient sample. In addition, aside from the

variation noted above for the most proximal BAC clones in normal individuals,

deviations were only observed in patient samples with dosage imbalances. For example,

the profile of the small pseudo-dicentric(15) sample (P14) demonstrated a increased

dosage of material in the BP1-BP2 interval, yet RP11-483E23, and to a lesser extent

RP11-540B6, which map distal to the BP1-BP2 region, showed an increase in

fluorescence intensity ratio (Figure 5-7). This effect was also observed for the PWS

unbalanced translocation (P11) profile, in which BACs RP11-483E23 and RP11-540B6

demonstrated marked decreases in fluorescence intensity ratio despite their position distal

of the deleted interval. Both of these clones harbor segmental duplications (HERC2)

(Figure 5-8; Table 5-2) that are also present in more proximal sequences such as RP11-

13O24.

148

Figure 5-7. Duplication sensitivity of arrayed clones. RP11-483E23 (array element 13), and to a lesser extent RP11-540B6 (array element 16), appear to be influenced by remote genomic rearrangements, demonstrated by the deviations in log2 ratio observed for these distal clones in samples with proximal 15q11-q13 rearrangements. The response to the gain or loss of non-contiguous material, likely due to the presence of segmental duplications in the distal array BAC clones, we term duplication sensitivity.

149

150

Figure 5-8. Sequence similarity, duplication content and duplication sensitivity. The program MIROPEATS was used to illustrate the relationship between clones RP11- 26F2, RP11-13O24, RP11-483E23 and RP11-540B6. All four of these clones contain HERC2 and/or GLP/LCR15-related sequences; however the total fraction of duplicated sequence in each clone, and the degree of similarity of that duplicated sequence to other sites in the genome, varies among the clones. For example, RP11-13O24 and RP11-26F2 (Panel A) share 12 kb of sequence (92.5%), RP11-13O24 and RP11-540B6 (Panel B) share 27 kb of sequence (92.4%), and RP11-13O24 and 483E23 (Panel C) share 139 kb of sequence (98.4%). Therefore the duplication sensitivity seen with RP11-483E23, and to a lesser extent RP11-540B6, is likely a result of the segmental duplications present in these clones.

DISCUSSION

Through the use of a BAC microarray designed specifically for the highly variable 15q11-q13 region, we have shown that array CGH is effective at discerning the extent of dosage imbalance in a wide spectrum of clinical samples. As part of analysis, we selected BACs that contained segmental duplications. Since these duplicated BACs varied in content, degree of sequence identity and distribution of segmental duplications, we were able to assess the effect of paralogous sequences on microarray detection sensitivity for the first time. Two effects were noted. First, it appears that duplicated templates may mimic effects (i.e. exaggerate fluorescence intensity ratios) consistent with partial gains or losses but that involve rearrangement events that have occurred elsewhere in the genome. These secondary regions contain duplicated sequence which when deleted or duplicated concomitantly alter intensity levels at all duplicated loci. This was particularly evident for large blocks of segmental duplication with the highest degree of sequence identity (>98%) such as the HERC2 duplication. Second, fluorescence intensity levels for such sites were generally suppressed when compared to theoretical expectations for a discrete gain or loss of a copy. The most notable example of this was observed among the highly-duplicated 15q11 pericentromeric clones whose fluorescence

151

intensity ratios were inconsistent with pseudo-dicentric(15) rearrangement as predicted

by FISH. Such effects were not observed among unique clones that did not contain

segmental duplications. We conclude that the duplication content of BAC templates is an

important consideration in the construction of BAC microarrays. Unlike common repeats

such as Alus and LINES, duplicated regions can not be effectively blocked by Cot-1

DNA. Consequently significant departures from the expected 1:2 and 3:2 ratios for haploid deletions and duplications may occur. Data interpretation may be particularly compromised when the duplications are large (~50-100% of the BAC) and highly identical (>98%).

Our study also revealed some interesting aspects of 15q11-q13 genomic instability. Among normal individuals, variation in relative fluorescence signal intensity was noted near the pericentromeric region consistent with previously reported large-scale structural polymorphism. Such variation among normal individuals should be taken as a cautionary note in a clinical setting. It emphasizes the need to consider multiple BACs over the critical region before a final “karyotype” diagnosis is reached. Among patient material, most rearrangements occur, as expected, at classically defined PWS/AS breakpoints, BP1, BP2 and BP3. In this study, distal breakpoint BP3 was the most common breakpoint terminus of all the 15q11 dosage imbalances (15/19). Two samples, however, involved rearrangements that extended distally to BP3. One patient with a large pseudo-dicentric(15) chromosome showed a breakpoint localization at BP4. This is in agreement with previous reports of larger 15q11-q13 supernumerary marker chromosomes (Robinson et al. 1998; Christian et al. 1999). Surprisingly, analysis of one patient sample, a Class II AS deletion patient, predicted an atypical breakpoint between

152

BP3 and BP4. This event is likely a rare occurrence, due to the fact that no deletions beyond BP3 have been previously documented. Interestingly, the distal breakpoint in this sample corresponds to a region recently characterized as the pericentric inversion breakpoint of chromosome 15 in chimpanzee (Locke et al. 2003a). This site was recently shown to harbor extensive segmental duplication, including copies of the LCR15 duplicon, which have been associated with other PWS/AS breakpoints (Pujana et al.

2002; Locke et al. 2003a). Rearrangements involving this region should be considered in further testing of PWS/AS patients.

Currently, array CGH is one of several techniques being developed to assess genomic dosage imbalance. Other competing technologies such as, multiplex amplifiable probe hybridization (MAPH) and multiplex ligation-dependent probe amplification

(MLPA) (Armour et al. 2000; Schouten et al. 2002), involve the design of specific DNA probes ranging from 80-600 nucleotides in length. Compared to BAC array CGH these methods utilize much smaller target sequences which could, in theory, significantly increase the precision in targeting unique regions of the genome where recurrent rearrangements are likely. A set of probes, for example, has already been developed to detect rearrangements near human subtelomeric regions (Hollox et al. 2002). Complete genome coverage at the level afforded by array CGH may prove difficult to achieve. The expense and the number of required probes are currently rate-limiting. In addition, the discrimination of segmental duplications which may be numerous (~40 copies) and highly identical (99.9%) will require methodological advances irrespective of the technology.

153

For the diagnosis of clinical 15q11-q13 rearrangements we have shown that a single assay measuring dosage across a complex region of the genome may be performed accurately and robustly using array CGH technology. Correct interpretation, however, requires sufficient knowledge of the underlying sequence including the behavior of duplicated sequence in the array format. Since duplicated regions show a greater proclivity to rearrange, a consideration of this fact will facilitate the design of future arrays as well as the interpretation of array data. For future experiments, one may wish to avoid such regions, or evaluate array data while employing global segmental duplication analyses. This combined approach will provide clinically relevant information for the majority of the human genome with minimal error.

MATERIALS AND METHODS

Clinical Samples

All array CGH hybridizations were conducted blind to the genotype of the sample. The DNA samples utilized in this study were obtained from patient-derived cell lines and characterized using a combination of cytogenetic methods and microsatellite analysis. Reference DNA for array hybridizations was obtained from a healthy anonymous blood donor. Normal samples used in these studies were obtained from unaffected individuals from University Hospitals of Cleveland, Center of Human

Genetics under appropriate informed consent protocols. For the PWS/AS samples, FISH analysis using BAC RP11-289D12 was used to differentiate Class I and Class II deletions. The panel of microsatellite markers used to characterize the patient samples include STS D15S541, D15S542, D15S1035, D15S543, D15S1002, D15S1048,

154

D15S1019 and D15S165. Note not all samples were analyzed cytogenetically and characterized with all STS markers, typically a combination of the techniques were performed. The genomic DNA samples utilized in this study were obtained from the resources of the CWRU Center for Human Genetics.

Clone Characterization

DNA for all arrayed clones was isolated (Nucleobond, Clontech) and subjected to

BAC end sequencing using standard protocols with vector primers. The end sequences were then used in sequence similarity searches against the set of accessions chosen for the array in order to verify identity. To experimentally confirm the localization of each

BAC selected for the array, DNA from each clone was labeled fluorescently and used as a probe in FISH assays on human metaphase chromosomes in accordance with standard protocols (Horvath et al. 2000b). Twenty metaphase preparations were examined for each hybridization experiment. The duplication content of the arrayed BAC clones was determined by sequence similarity searches against the NT/HTGS Genbank nucleotide databases, published reports of known duplicons, and the segmental duplication database

(SDD; http://humanparalogy.gene.cwru.edu) (Bailey et al. 2001). To graphically illustrate the relationship between clones RP11-483E23, RP11-13O24, RP11-540B6 and

RP11-26F2, the program MIROPEATS was used (Parsons 1995).

Array CGH

Microarrays were prepared as previously described (Snijders et al. 2001). Briefly, ligation-mediated PCR of MseI digested BAC DNA is used to create a DNA

155 representation of each BAC clone. These DNA solutions are spotted in triplicate on chromium-coated slides. For normalization purposes, 200 unique BAC clones mapping to other regions of the genome were included. Patient genomic DNA extraction and array hybridization were performed as previously described (Snijders et al. 2001; Locke et al. 2003b). Arrays were imaged using a custom CCD camera system and analyzed using the UCSF SPOT software package (Pinkel et al. 1998; Jain et al. 2002).

Relative DNA dosage was determined by calculating the fluorescence intensity ratio produced by hybridization of differentially labeled experimental and control genomic DNAs to the array. The ratio of experimental to control fluorescence is linearly proportional to the relative dosage between the two samples over a wide dynamic range

(Pinkel et al. 1998). Ratio data are normalized so that the median ratio is set to 1.0. Thus the ratio response is linear, but the slope of the curve is less than ideal. We also report data as the logarithm to the base 2 of the ratio, which conveniently allows display of data over a wide dynamic range and facilitates calculations. On this scale, the normalization results in the median log2 ratio being 0. A sequence that is present at half the dosage in the sample relative to the reference sample would ideally have a log2 ratio of -1. A sequence that has a factor of 1.5 increase in dosage (a haploid duplication, for example) would ideally have a log2 ratio of 0.58. The linear and logarithmic data formats are completely equivalent.

ACKNOWLEDGEMENTS

The authors would like to thank Julie E. Horvath for helpful comments on the manuscript. This work was supported in part by NIH grants HD043569 to E.E.E. and

156

ES10631 to R.D.N. and E.E.E., in addition to NCI grants CA83040 to D.P. and CA84118 to D.A.

Chapter 6

Extensive homogenization, interchromosomal duplication and lineage-specific evolution of the 15q11- q13 Prader-Willi/Angelman syndrome breakpoint regions

Devin P. Locke1, Lisa M. Pertz1, Amy M. Yavor1, Doriana Misceo2, Jessica Lehoczky3, Jean L. Chang3, Robert D. Nicholls4, Stephanie Dechamps5, Bruce Roe5, Ken Dewar3, Mariano Rocchi2, Shaying Zhao6, Chad Nusbaum3 and Evan E. Eichler1

1Department of Genetics, Center for Computational Genomics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland, Cleveland, OH, 44106

2Dipartimento di Anatomia Patologica e di Genetica, Sezione di Genetica, University of Bari, Bari 70126, Italy

3Whitehead Institute Center for Genome Research, Cambridge, MA 02141

4Center for Neurobiology and Behavior, Department of Psychiatry and Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania 19104

5Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019

6The Institute for Genomic Research, Rockville, MD 20850

Note: This manuscript is being prepared for submission: Locke, D.P., Pertz, L.M., Yavor, A.M., McGrath, S.D., Misceo, D., Lehoczky, J., Chang, J.L., Nicholls, R.D., Dewar, K., Rocchi, M., Zhao, S., Nusbaum, C. and E.E. Eichler. Extensive homogenization, interchromosomal duplication and lineage-specific evolution of the 15q11-q13 Prader-Willi/Angelman syndrome breakpoint regions. L.M.P and A.M.Y. provided assistance with contig construction and phylogenetic analysis. D.M. and M.R. provided comparative fluoresence in situ hybridization data. R.D.N. provided contig development assistance. J.L., J.L.C., K.D. and C.N. provided human and primate comparative BAC sequencing data. S.Z. provided BAC end sequencing data.

157 158

ABSTRACT

Clusters of segmental duplications are commonly found at the breakpoints of genomic

disorders such as Prader-Willi and Angelman syndromes (PWS/AS) in 15q11-q13. We

have applied specialized techniques for dissecting regions with extensive paralogy and

developed sequence contigs within the three major 15q11-q13 breakpoints, which

coincide with gaps in the genome assembly. We present here an analysis of the structure,

organization and evolutionary history of these regions, termed BP1, BP2 and BP3.

Similarity searches of the human genome sequence revealed a surprising level of

interchromosomal duplication within all three breakpoints, and extensive sharing of both

interchromosomal and intrachromosomal duplications, including LCR15 and HERC2 duplicons. The mosaic structure of the breakpoint regions appears to be a trait established within the last 30 million years of primate evolution, as evolutionary analysis of the HERC2 duplicon by fluorescence in situ hybridization and comparative sequencing demonstrate a pattern of intrachromosomal duplication in the baboon genome consistent with lineage-specific homogenization of distant duplications in multiple primate lineages.

Additionally, genomic and phylogenetic analysis suggest a recent homogenization of

HERC2 sequences has occurred in the human lineage. Furthermore, comparative sequencing in baboon and chimpanzee revealed non-orthologous relationships to the human genome at sites of HERC2 duplication, demonstrating substantial plasticity at sites of segmental duplication. Overall, we found that the pattern of duplication observed in humans and primates is indicative of a complex evolutionary history in which large segments of DNA have been homogenized recently in multiple independent lineages.

159

INTRODUCTION

The highly duplicated regions of 15q11-q13 have proven difficult to sequence and assemble, as demonstrated by the presence of seven clone or contig gaps in the most proximal ten Megabases of the ‘finished’ human genome assembly of chromosome 15

(build34). Regions of prevalent segmental duplication have long been recognized as problematic for sequence homology-based assembly, despite their biological importance

(Collins et al. 1998; Eichler 1998). In the case of 15q11-q13, this is evident by the wide variety of clinically recognized rearrangements which share common rearrangement breakpoints that coincide with clusters of segmental duplications (Chapter 5)(Christian et al. 1995; Robinson et al. 1998; Amos-Landgraf et al. 1999; Christian et al. 1999; Pujana et al. 2002).

Several human genomic disorders have been characterized by the presence of large clusters of segmental duplications at common rearrangement breakpoints, including

Prader-Willi and Angelman syndrome in 15q11-q13 (Ji et al. 2000a; Stankiewicz and

Lupski 2002). The 15q11-q13 region contains three major clusters of segmental duplications, called BP1, BP2 and BP3 (Figure 6-1a). BP1 and BP2 are the proximal common rearrangement breakpoints in the majority of PWS/AS deletions. Deletions of

15q11-q13 are found in two size classes depending on the proximal breakpoint. Class I deletions, which occur in ~40% of PWS/AS patients, involve BP1 and Class II deletions, which occur in ~60% of PWS/AS patients, involve BP2 (Amos-Landgraf et al. 1999).

160

161

Figure 6-1. Schematic of the 15q11-q13 region and PWS/AS breakpoints. A) Gene loci are depicted as circles and ovals color coded according to imprint status. Black indicates a non-imprinted gene, white indicates paternal expression and grey indicates maternal expression. The centromere is indicated by the box labeled CEN and the pericentromeric region is marked by the box labeled PERI. The PWS/AS breakpoint clusters are indicated by the boxes labeled BP1, BP2 and BP3. PWS/AS deletions occur in two size classes depending upon the proximal breakpoint involved. The location of the sequence scaffolds is indicated underneath the gene map. The map is not drawn to scale. B) The HERC2 locus, depicted as a black line with tick marks every 10 kb, contains 93 exons, indicated as the black boxes above the line, and is transcribed in the telomere- centromere orientation. Note, the HERC2 locus abuts BP3. Three STS probes used for BAC library hybridization are indicated. C) Each scaffold is depicted as a horizontal line, the BACs which comprise each scaffold are indicated underneath the major black line and are individually labeled with accession numbers. Gaps are indicated by the tall vertical grey boxes. STS probes are shown as short vertical grey boxes, labeled with arrows.

162

In addition to PWS/AS, the 15q11-q13 region has also been shown to undergo

pericentromeric expansions, interstitial duplications, interstitial triplications (Browne et

al. 1997; Ritchie et al. 1998; Amos-Landgraf et al. 1999; Ungaro et al. 2001; Fantes et al.

2002). It has been estimated half of all supernumerary marker chromosomes are derived

from proximal 15q, and these marker chromosomes occur in varying lengths that

correlate with breaks at segmental duplication clusters (Chapter 5)(Blennow et al. 1995).

It has been shown that non-allelic homologous recombination (NAHR) between paralogous sequences, also known as low-copy repeats or LCRs, can lead to the deletion or duplication of the intervening material between duplications (Stankiewicz and Lupski

2002). Additionally, an inversion of the PWS/AS region between the breakpoint clusters of segmental duplications has been shown to promote subsequent deletions of the region in the gametes (Gimelli et al. 2003). Elucidating the sequence structure of the 15q11-q13 breakpoint regions, therefore, is essential to determining the mechanism of PWS/AS rearrangement, as well as the reciprocal 15q11-q13 duplication events and supernumerary marker chromosome formations.

Segmental duplications have also been implicated in several types of evolutionary rearrangement, such as large-scale dosage gains and losses, translocations, pericentric inversions in primates and syntenic breaks in rodent-human genome comparisons

(Chapters 2 and 3) (Stankiewicz et al. 2001; Armengol et al. 2003; Bailey et al. 2004).

The analysis of several individual duplicons as well as regions biased for the accumulation of segmental duplications such as pericentromeric and subtelomeric regions have also demonstrated considerable variation between species (Trask et al. 1998a;

Horvath et al. 2000b; Young et al. 2002; Horvath et al. 2003). Fluorescence in situ

163

hybridization analysis of material from the 15q11-q13 breakpoints has indicated a long

evolutionary history of duplicated structure, as the initial duplication of material in

15q11-q13 predates the divergence of the Old World monkey species from a primate

common ancestor (Christian et al. 1999). It has been impossible to perform more detailed

evolutionary analyses of the PWS/AS breakpoint segmental duplication clusters however, due to the lack of extended sequence assemblies in to the breakpoint regions.

In this study we present the sequence structure of the highly duplicated PWS/AS breakpoints using techniques originally developed to investigate highly paralogous pericentromeric regions (Horvath et al. 2000a). We have analyzed these sequence contigs for segmental duplication content and note a surprising amount of interchromosomal duplication in addition to previously identified breakpoint components such as LCR15 and HERC2 duplicons (Buiting et al. 1992; Buiting et al. 1998; Amos-

Landgraf et al. 1999; Christian et al. 1999; Pujana et al. 2001; Pujana et al. 2002).

Comparative analysis at the sequence and cytogenetic level indicated the initial expansion of the HERC2 locus occurred after the divergence of prosimians, but prior to the divergence of the Old World monkeys, and potentially prior to the divergence of the marmoset lineage ~45 Million years ago (Mya). Analysis of multiple HERC2 loci from comparative BAC sequencing in lemur, baboon and chimpanzee demonstrates lineage- specific evolution of the HERC2 duplicon and suggests homogenization of HERC2 sequences has occurred in multiple independent primate lineages, including recently in the human lineage. The presence of HERC2 duplicons in the chimpanzee and baboon genomes in non-orthologous relationships compared to the human genome indicates

164 frequent repositioning of unique genic material in 15q11-q13 in direct relation to segmental duplications.

RESULTS

15q11-q13 Sequence Assemblies

Efforts to build sequence contigs within the PWS/AS breakpoint clusters began with the selection of the HERC2 duplicon as the seed sequence for contig construction

(See Methods). Previous studies with probes derived from the HERC2 locus, including the MN7 probe, had mapped the HERC2 duplicon to all three major PWS/AS breakpoints, and was therefore an excellent candidate sequence for the initiation of contig development. Multi-sequence alignments produced by similarity searches with the

HERC2 genomic sequence (AC004583) against all available BAC sequence databases

(NT and HTGS) allowed for the identification of paralogous sequence variants, or base pair changes specific to one copy of a duplicated sequence. Primers were then designed to flank these diagnostic base pair changes, resulting in paralogous STS H2 which is located between exons 1 and 2 of HERC2 (Figure 6-1B; Table 6-1). For comparative mapping of HERC2-related sequences in primates STS H15 (targeted to the approximate mid-point of the genomic locus) and H16 (targeted to the 3’ UTR of HERC2) were designed in unique regions of the HERC2 locus as determined by sequence similarity searches.

165

2 y y y y y y y atus p p p t p p p p py opy o o opy o o o c c c co c c c - - - - -co - -co i i i i e-c e e e l l l t t t t l l l l g g gl g u u u u No. S n n n n ingle- i i i i M M M M Multi- S S S S S Copy C C A A C A C C 1 G C A C C ) C C G G T G G T TG 3’ A T A AA TG T > TG T C T G AG A TTTT A A G ’ - T C AG TC T TC 5 C C G C T T T G G T A T TG C T AAC G TTG T A AAGT T A TG G G G A T imer ( A C A C T A G T A AT T A G TC T T T A CCTGCT A GCC C A G TC rse Pr GGA AAG T e G GA G AG C A CCACAA TTC CT GCT T TTA T C G 55°C. G G A Rev C A A AG T A CA A of TG C C G A G A T GGT A e G A C C C T C G T AA C C A G T GT A atur T C C G C T G C G CT per G T G G A A C 1 g tem C G G C C CA G G . C GA 3’) e TG CA > TG AAG TTA AAAT G ’ - AAAC GC 5 A C C G TTTTG GGGA A G C AAG T GGGA TG er ( G n genom C A G TTC A C CACC TTTTTC m a A A A GG C G th an annealin C G i A T AA A T G GT T T AT G C C G A C d w TTA T rd Pri e T CA C A C a G A m GA AAG G TTC T AGGGGA C T TTTC or G G f C in the hum A G r T C Forw e G T TTTG C T er A A C b CAC G A p T G e A T C A GTGGT G T A r T G G G T A e GGGAG T G G TG A GA C GCCTGGGT G G num w py o tions 2 8 2 . . . c e c / / / a 1 7 1 . . . e 4 2 4 2 2 2 S ID . . . 1 8 5 to th / / / s 8 6 2 1 1 1 r 2 . . ST . 0 8 1 . e / 6 3 4 D P D 1 011767.7/.8 0 1 1 1 2 2 6.1/.2 . 1 2 C C P P P C All PCR r Ref H A AC A A B B B H H15.1/.2 Table 6-1. BAC library hybridization STS probes. 1 2

166

Hybridization of the H2 paralogous STS against the RP11 human BAC library

(18.7-fold coverage) identified 145 H2 positive BAC clones, representing approximately

8 copies of the H2 sequence in the human genome based on library coverage (Table 6-2).

167 ) a y M c y 5 e a ul ce ( -65 -40 -50 J M - 3-2 - N/A en 4-D 8 55 35 45 2 6 1 rg e , Div s ra er b m Papio hamad 2 0 1 1 3 6 8 2* 1 1 1 Nu y ed. p o

us , PHA = C ies ar gmaeus ll libr ) x r a ( o e f rag ent Pongo py e v 7 4 1 2 . 2 = /A . . 6 7 8 egm 6. 6. 5 5 N 1 s e Co y , PPY m ar o tta. a rilla en libr G r c u lla go m i r e o age per L G s ver e A = o v i O = e c , LC sit 5 6 7 1 2 7 7 us , GG 7 1 1 3 6 58 14 h Po s c enom of c te g a j o. ix N 6.0x hr oglody t i l l a e of C ag Pan tr er A = J ies R = O av A A R c copy number estimate by BAC library hybridization. SA HA G CJ LC PPY PT H P G MMU , PT tta, C Spe s a l u upon an m apien a ed HERC2 c a bas mo s e o t 259 255 251 253 250 1 1 H a I- I- I- I- I-

2 y 1 - r -4 R R R R R I L a tim O O O O O CI- C br P H H H H H SA = R C C C RP C C LBN H MMU = Mac Li * Es Table 6-2.

168

Additionally, sequencing of H2 PCR products from the H2-positive BAC clones was

performed and the sequences were analyzed with the program CONSED (Gordon et al.

1998). Using CONSED, 8 major sequence signatures were identified, with 3 subtypes

differentiated from the major 8 types by a single base pair difference (data not shown).

Thus, we detected an upper bound of 11 copies of the H2 sequence in the genome,

although this figure does not account for potential allelic variation within the H2

sequence as opposed to paralogous variation. Hybridization of the H2 STS to the LLNL-

15 chromosome 15 cosmid library, which was derived from a monochromosomal source,

and subsequent sequencing of H2 STS PCR products resulted in 6 distinct sequence

signatures, with an additional 2 single nucleotide variants, for a theoretical upper bound

of 8 copies of the H2 sequence within the cosmid library. Given the previous reports of

HERC2-related sequence on chromosome 16, which would be present within the RP11

pool of H2-positive BAC clones, but not the pool of cosmid clones, these data suggest

there are approximately 8-10 copies of H2 related sequence within the human genome, although it should be noted that the results for the H2 STS do not necessarily ascertain all copies of the HERC2 duplicon (see below).

The HERC2-related BAC clones that were identified by similarity searches against the NT and HTGS databases were subsequently used as queries in additional searches to develop extended sets of contiguous BAC clones (see Methods)(Figure 6-1C).

In anticipation of extensive paralogy within and adjacent to the HERC2 duplicons, the thresholds for BAC overlap were stringent. First, only RP11 BAC clones were evaluated where possible, as we sought to limit the effect of heteromorphism within diploid BAC libraries, yet maximize the potential for finding overlaps. Second, the sequence

169

similarity threshold for overlap was considered 99.9% identity or greater. Third, overlaps

were to extend greater than 10 kilobases (kb) in length and join end to end. Using these

strict criteria, three major sequence scaffolds, or groups of linked contigs of BAC

sequences, containing HERC2-related sequence were constructed. The scaffolds mapped to the pericentromeric region of 15q11 (described in detail in Chapter 4), the region spanning distal BP1 and across BP2, and lastly extending from the HERC2 locus in

15q13 proximally into BP3 (Figure 6-1A).

Several additional paralogous STS were designed at the termini of the 15q contigs for subsequent BAC library hybridization (Figure 6-1C; Table 6-1). The resulting BAC clones were end sequenced, and those BAC end sequences searched against the NT and

HTGS databases to identify clones that extended into the adjacent sequence gaps.

Additionally, the paralogous STS was PCR amplified from the BACs and sequenced, and that sequence matched to the contig as additional validation of the overlap. In this manner, several BACs have been identified that link together contigs built using the reiterative sequence similarity search method (Figure 6-1C).

To validate the sequence scaffolds, a fosmid end sequence placement strategy was employed (Figure 6-17) (Bailey and Tuzun, unpublished). Briefly, the three scaffold sequences were used as queries against a database of 2.3 million human fosmid end sequences generated by the Broad Institute, formerly the Whitehead Institute Center for

Genome Research, under the auspices of the Human Genome Project. Due to the limited size variation of fosmid clones, which are typically 38-42 kb in length, discordances within the scaffolds are identified by fosmid paired ends that exceed the physical properties of a fosmid clone. Since the majority of the scaffolds are comprised of

170 finished BAC sequences, the fosmid paired-end method was in agreement with our BAC assembly. The regions which showed disparity were typically either gaps in the scaffolds or within regions in which working draft status BAC sequences were used in the scaffolds (Figure 6-17).

The BP1_Prox scaffold is anchored in monomeric alpha satellite sequence and extends approximately 1.4 Mb in length, with two sequence gaps closed by clones RP11-

1115P6 and RP11-992L12 (Figure 6-1C)(Chapter 4). The BP1_to_BP2 scaffold is approximately 1.3 Mb in length, with two clone gaps closed by clones RP11-1276B22 and RP11-899L8 (Figure 6-1C). Note, this scaffold likely terminates within BP1 and potentially does not encompass the entire BP1 region. The BP3_Prox scaffold is 900 kb in length, extending distally from the 15q13.1 HERC2 locus into the BP3 duplication cluster, with one sequence gap spanned by clone RP11-1174I20 (Figure 6-1C).

Three exceptions to the stringent criteria above should be noted for the construction of these scaffolds. First, the most proximal clone in the BP1_to_BP2 scaffold, AC135058, is from the CalTech D library, however this clone was included due to extensive homology with AC100757 (>50 kb, >99.9% identity). Second, in the

BP3_Prox scaffold the homology between AC135348 and AC055876 extends only ~5 kb at 100% identity. Third, AC135348 is from the RP13 human female BAC library. We suggest these last two exceptions are negligible due to the identification of an overlapping RP11 BAC clone (RP11-1174I20).

171

Scaffold Comparison to the Genome Assembly

The 15q11-q13 sequence scaffolds contain of a total of 21 BAC overlaps that were compared to build34 of the genome assembly to assess the agreement between assemblies. Overall, our assembly agreed well with that of build34, as 19/21 overlaps between the two assemblies were concordant; however, there were notable exceptions.

First, within the BP1_Prox contig, the overlap of AC138748 with AC023310 is discordant with build34. Within build34, AC138748 is placed overlapping AC131280.

By our stringent criteria AC131280 contains HERC2-related material and overlaps no

BAC sequences within the NT and HTGS databases. AC138748 does, however, overlap

AC023310 within the guidelines of our criteria, however this clone has been expunged from build34. AC131280 was included in duplication analysis described below, due to the presence of HERC2-related sequence. It should be noted that although the overlaps between AC060814/AC068446 and AC135068/AC134781 were in agreement with build34, the genome assembly also indicates there are several other overlapping BAC clones at these sites. Examples such as AC136532, AC012414, AC136774, AC020679,

AC037471, AC134980 among others, do not meet the stringent criteria for overlap with the clones in the BP1_Prox assembly. We suggest this is a result of the compression of duplicated regions that occurs during genome assembly due to the high degree of sequence similarity between large blocks of paralogous sequence. The BP1_to_BP2 scaffold clone overlaps differs from the genome assembly in two areas. First, the most proximal clone, AC135058, is placed more proximally within 15q11 in the genome assembly and is dis-contiguous with AC100757 despite the extensive homology detected in this analysis. Second, clone AC016033, which we suggest is located in the interior of

172

the BP2 region and overlaps AC100755, has been expunged from build34. Lastly, the

BP3_Prox scaffold clone overlaps do not differ from the genome assembly, however we

suggest finishing of clone AC139682 will reveal an extension of this clone distal of

AC091304, which is likely hidden by the extensive paralogy within the BP3 region. This

was evident from analysis of the AC139683.1.2 paralogous STS, which identified two

similar yet distinct sequence signatures within BAC AC139682 (data not shown). One of

these sequence signatures was identical to that of AC091304, confirming the overlap

between these two clones. The other sequence signature from AC139682 therefore

identifies a specific copy of the duplicon also present in AC091304 and indicates

AC139682 extends distally.

Although the overlaps between accessions used within the 15q11-q13 scaffolds

are in general agreement within the assembly, nearly every sequence gap region within

the scaffolds is discordant with respect to build34. In other words, although the overlaps within the contigs agree, the order and orientation of the contigs within the scaffolds does not. For example, the central region of the BP1_to_BP2 contig comprised of AC011767,

AC090764, AC135069, AC141254, which contains the entire unique region between

BP1 and BP2, is in an inverted orientation with respect to build34. This is due the BAC end sequence placement results of BAC RP11-899L8, and analysis of the AC141254.1/.2 paralogous STS (Figure 6-1C). Future sequencing of the clones which we have identified through paralogous STS analysis and BAC end sequence placement will resolve this issue.

173

Duplication Content of the PWS/AS Breakpoints

The BP1, BP2 and BP3 regions encompassed by the BP1_to_BP2 and BP3_Prox scaffolds were subjected to whole-genome alignment comparison (WGAC) analysis, as previously described (Bailey et al. 2001). Briefly, high copy repeats are removed from a query sequence, in a process called fuguization, which reduces the complexity of the sequence for similarity searches. The fuguized sequences are then searched against the genome assembly by BLAST, the repeats are re-inserted post-BLAST and the ends of the alignments are trimmed with a heuristic algorithm. Performing this procedure on the

BP1, BP2 and BP3 sequences allowed for a quantitative evaluation of the intrachromosomal and interchromosomal duplication within in the PWS/AS breakpoint regions. Note only alignments >1 kb in length were considered for both interchromosomal and intrachromosomal analyses.

The WGAC analysis of BP1, BP2 and BP3 produced alignments to 14 chromosomes other than chromosome 15, including chromosomes 1, 2, 3, 5, 7, 13, 14,

16, 17, 19, 20, 22, X and Y (Figure 6-2).

174

175

Figure 6-2. Interchromosomal duplication within BP1, BP2 and BP3. Whole- genome alignment comparison (WGAC) analysis of the PWS/AS breakpoint sequences demonstrates extensive interchromosomal duplication. Chromosomes are depicted as horizontal lines, the centromere are represented as the purple boxes. Pairwise alignment between the breakpoint sequences and interchromosomal sites are shown as colored lines, coded according to the breakpoint involved. BP1 alignments are shown in blue, BP2 alignments in green, and BP3 alignments are shown in red. Note the scale of the outer chromosomes is 1,000-fold greater than that of the inner breakpoint sequences.

Interchromosomal alignments to the breakpoint regions totaled 1.86 Mb, in 441 alignments, with an average alignment length of 4.2 kb, a range of 1 kb to 26 kb, and an average 94.0% identity (Table 6-3).

Table 6-3. BP1, BP2 and BP3 WGAC alignment statistics.

Interchromosomal Alignments Breakpoint No. of Align. Avg. Identity Aligned bp Avg. Length (bp) BP1 78 94.16% 373234 4785 BP2 153 93.92% 597110 3903 BP3 210 94.11% 894876 4261 Total 441 94.05% 1865220 4230

Intrachromosomal Alignments Breakpoint No. of Align. Avg. Identity Aligned bp Avg. Length (bp) BP1 100 95.03% 1033710 10337 BP2 252 95.82% 1857917 7373 BP3 313 95.64% 2323415 7423 Total 665 95.62% 5215042 7842

As a reflection of the sequence shared among all three PWS/AS breakpoints, BP1, BP2 and BP3 all produced alignments to chromosomes 2, 3, 5, 7, 13, 14, 16, 20 and X (Table

6-4).

176

Table 6-4. Distribution of interchromosomal duplications.

Chromosome BP1 BP2 BP3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

BP2 and BP3 also shared alignments to chromosomes 1, 17, 19 and 22 and BP3 alone had alignment to the Y chromosome (Table 6-3). Note the BP1 sequence included in

WGAC analysis was likely not the entire BP1 region, thus continued efforts to map proximally into BP1 may yield regions of interchromosomal homology similar to those found in BP2 and BP3. In contrast to the extreme sharing of interchromosomal duplications, a subset of duplications appeared to be breakpoint specific. For example, a

26 kb duplication from proximal chromosome 13 aligns only to BP1, while the duplication of a more distal site on chromosome 13 is shared among BP1, BP2 and BP3

177

(Figure 6-2). With respect to intrachromosomal alignments, we excluded alignments

above 99.9% sequence identity to avoid biasing the results due to the alignment of the

scaffold sequence with the identical sequence within build34. Using this criterion, a total

of 665 intrachromosomal alignments were detected (Table 6-3). The sum of alignment

lengths was 5.2 Mb, with an average alignment length of 7.4 kb and average 95.6%

identity.

Examination of the alignments using the graphical program PARASIGHT (J.

Bailey unpublished) allowed for determination of ancestry based upon the minimal

evolutionary shared segment or MESS (Bailey et al. 2002b). In practice, the MESS was

qualitatively determined by inspection of sequence similarity searches of the scaffolds and HERC2 duplicon containing BAC sequences against the NT and HTGS databases, as well as build34 (Figure 6-3).

178

A)

179

B)

180

Figure 6-3. Duplication content of the 15q11-q13 scaffolds and unanchored HERC2-related sequences. A) The scaffold sequences are indicated as the horizontal lines, with tick marks every 100 kb. The tiling path of BACs which comprise the scaffold is indicates by grey boxes underneath each scaffold. Duplicons determined by minimal evolutionary shared segment (MESS) analysis are depicted above the black line and are color coded. The horizontal boundaries of each duplicon block are directly related to the coordinates of pairwise alignment. Breakpoint regions are indicated above each block of duplications. The major intrachromosomal components of the PWS/AS breakpoint clusters, HERC2 and LCR15 duplicons, are indicated. B) Duplicon analysis of orphaned HERC2 sequences. One unanchored contig, AC025138_AC136687, and four unanchored clones containing HERC2 duplicons, are indicated as the horizontal black lines, with tick marks every 20 kb. Duplicons are indicated above the black line and color coded as in panel A. The HERC2 and LCR15 duplicons are indicated.

The ancestral position is typically the most divergent uninterrupted alignment, although local fragmentation of the most divergent alignment was allowed in order to establish the largest possible region of ancestry. Using this methodology, the intimate association of

HERC2-related and LCR15-related sequences within the PWS/AS breakpoint clusters became evident (Figure 6-3). As a reflection of this, the large (>20 kb) HERC2 duplicons found within BP1, BP2 and BP3 are flanked on both sides by LCR15 duplicons (Figure

6-3A). Additionally, there appear to be smaller degenerate adjacent LCR15 and HERC2 duplicons at the distal edge of both BP1 and BP2 (Figure 6-3A). There is also a pair of smaller HERC2/LCR15 duplicons at the distal terminus of the BP3_Prox contig, however since this contig does not include distal flanking unique sequence, it is impossible to determine at this time if this represents the distal edge of BP3 (Figure 6-3A). MESS analysis of an unanchored contig (AC025138-AC136687) and 4 unanchored clones identified through similarity searches with the HERC2 locus seed sequence also indicated a close relationship between HERC2 and LCR15 duplicons (Figure 6-3B).

181

Several segments were shared among the scaffolds and unanchored clones, such

as the 13q31.3 duplication, which appeared to be associated with the ABCB10 duplicon in

multiple locations (BP3_Prox and AC136693). Also, a segment from 20q13.12 which

contained a DEXI pseudogene appeared in clones AC138649 and AC136776. The

segment from 3p22.1 was shared among the AC025138 contig and multiple unanchored

clones (AC131280 and AC136776). In contrast, duplications such as the PARN duplication from 16p13 to the distal edge of BP2 were breakpoint specific and not shared among any of the other breakpoints or unanchored clones (Figure 6-3A). This particular duplicated segment appears solely at this location and has not been swapped to other

HERC2/LCR15 duplication clusters. The ancestral human duplication donor site was not evident for all duplicated segments, however, as multiple segments appeared to be duplicated within chromosome 15, yet no low-homology alignments indicated an origin for the duplication (Figure 6-3A, dark grey boxes). These sequences are present only within the segmental duplication clusters of 15q11-q13.

Relatedness and Orientation of the PWS/AS Breakpoints

MIROPEATS was utilized to investigate the relatedness of the BP1, BP2 and BP3 sequences, and also examine the sequence relationships within BP3, which showed evidence of an inverted structure in a previous study (Christian et al. 1999). The

BP3_Prox analysis was performed using sequence masked for high copy repeats by

REPEATMASKER and a MIROPEATS threshold of 4000 (Parsons 1995; Smit and

Green 1999). The inverted nature of the BP3 region within the BP3_Prox scaffold was

182 confirmed by MIROPEATS, as displayed using PARASIGHT (Figure 6-4)(Jeffrey

Bailey, unpublished).

183

184

Figure 6-4. The palindromic structure of BP3. A 360 kb interval of the BP3 scaffold is depicted as a horizontal black line, with tick marks every 20 kb. Duplicons are indicated as the color coded boxes beneath the horizontal line. Sequence similarity relationships detected by MIROPEATS (Parsons 1995)are indicated by the blue blocks above the horizontal line connected by the blue lines. A threshold of 5000 was used for MIROPEATS analysis. Note the underlying region around 560 kb is fragmented due to the fact the underlying sequence (AC139682) is working draft, and not ordered and oriented. The palindromic repeats are indicated by the gray arrows.

The axis of symmetry of the inverted structure falls within a 13q31 duplication, which is present in association with LCR15 and HERC2 duplicons in multiple locations (Figure 6-

4). By overlaying the MIROPEATS analysis with the duplicons determined by the

MESS methodology, it appears a 100 kb block of duplications including HERC2,

FLJ36131, FLJ00287, LCR15, ABCB10 and 13q31 duplicons has been involved in the inverted duplication (Figure 6-4). Note the fragmentation of the LCR15, ABCB10 and

13q31 duplicons at the distal edge of the central 13q31 duplicon is likely due to the underlying working draft sequence (AC139682) and is therefore an artifact.

MIROPEATS comparisons of the BP1_to_BP2 and BP3_Prox scaffolds revealed extensive homology between the PWS/AS breakpoint regions (Figure 6-5).

185

186

Figure 6-5. Homology between PWS/AS breakpoints BP1, BP2 and BP3. The horizontal line represents a fusion of the BP1_to_BP2 and BP3_Prox scaffolds, with the fusion point indicated by the vertical arrow. In other words, the entire BP2-BP3 unique region has been removed. The tick marks are placed every 200 kb. Sequence similarity relationships surpassing a MIROPEATS (Parsons 1995) threshold of 4000 are indicated by the blue boxes and connecting blue lines. The orientation of the breakpoints, relative to each other, is indicated by the horizontal arrow underneath each breakpoint.

The BP1 and BP2 duplication clusters appeared to be in a direct orientation with respect to each other. The relationship between BP1 and BP2 with the BP3 breakpoint is complicated by the internal inverted repeat structure discussed above. The intact HERC2 locus, which is directly adjacent to BP3, is oriented distal to proximal in 15q13, therefore the inverted duplication is oriented proximal to distal. The BP1 and BP2 regions appear to be in a direct orientation with respect to this distal copy of the BP3 inverted repeat.

Additionally, there appeared to be inverted repeat structures within the BP2 region, however not to the extend found within BP3.

Sequence Similarity of the Human HERC2 Duplicons

The distribution of HERC2-related loci was determined by sequence similarity searches of the scaffolds and clones with the HERC2 locus sequence. A total of 15

HERC2 duplicons were identified, designated H’1-H’15, that totaled 579 kb of pairwise alignment of an average 96.4% identity to the HERC2 locus (Figure 6-6).

187

188

Figure 6-6. Sequence similarity among HERC2 duplicons. The 15q11-q13 scaffolds, the unanchored contig and unanchored clones are indicated by horizontal lines. The HERC2 duplicons are depicted as blue boxes, designated by H’ followed by a number. The 15q13 HERC2 locus is shown in the center, and pairwise alignments with locus are indicated by the colored lines radiating outward. The length and average percent identity of each HERC2 duplicon are indicated at each site. The scaffolds are shown in the proximal to distal orientation. Note the pattern of small HERC2 duplicons flanking a larger duplicon, seen in BP1, BP2 and BP3.

The average length of a HERC2 duplicon was 36.3 kb, however, this is number is misleading considering the distribution of HERC2 duplicon lengths (Figure 6-7).

Figure 6-7. Histogram of HERC2 duplicon length. The length of 12 HERC2 duplicons, not interrupted by the termination of a BAC, contig or scaffold sequence is plotted. The vertical axis represents the length in kb of each duplicon, the horizontal axis includes the full length duplicons arranged from smallest to largest. The duplicons in general fall into three size categories (<10 kb, ~25 kb and >80 kb), however the small and large duplicons are predominant

Note, of the total 15 HERC2 duplicons, 3 were interrupted by the termination of a scaffold or clone, and were therefore excluded from the average duplicon length

189

calculation and graph. The size classes of HERC2 duplicons can be separated into three

distinct categories, those 10 kb and under (H’3, H’4, H’6, H’7, H’9, H’13, H’15), those

between 10 kb and 80 kb (H’5), and duplicons >80 kb (H’1, H’2, H’8, H’10). The

pericentromeric contig, BP1 and BP3 all appear to contain a large (>80 kb) duplicon, and

in fact, the BP1 and BP3 duplicons are 99 kb in length and are 99.6% and 99.7%

identical, respectively, to the HERC2 locus. Furthermore, a pattern emerges when

examining the distribution of HERC2-related sequences such that within the breakpoint

clusters each large HERC2 duplication is flanked by smaller duplicons (Figure 6-6). This

proves to be the case for the distal region of BP1, both the proximal and distal flanks of

BP2 and in BP3, where H’8 is adjacent to two HERC2 duplicons under 5 kb. An

exception to this trend is the pericentromeric copy of HERC2, H’1, which is 84 kb in

length and 96.6% similar to the HERC2 locus, yet exhibits no flanking HERC2-related

loci.

Structure of the HERC2 Duplicon

Sequence analysis of the HERC2 duplicon revealed that the duplicon is not derived from one contiguous sequence, but rather a patchwork of various segments from the HERC2 locus (Figure 6-8).

190

191

Figure 6-8. Structure of the HERC2 duplicon. The HERC2 locus is shown as the horizontal black line. Exons are indicated as the black boxes above the line. Note the orientation has been flipped with respect to the chromosomal orientation. In other words, the direction of transcription is shown as 5’ to 3’, left to right. Tick marks are positioned every 20 kb. The pairwise alignments between the HERC2 locus and the HERC2 duplicons are shown below the line, and each duplicon is represented by a row of colored boxes. The blue boxes represent pairwise alignments between the duplicon and the locus sequence within chromosome 15. Pairwise alignments to chromosome 16, obtained from sequence similarity searches of build34, are shown in gray. The white box represents the coverage of PAC RP5-778A2 (AC004583) which was used in comparative FISH analysis (Figure 6-10).

There was a bias, however, toward the 5’ half of the HERC2 locus in duplicon content.

The majority (14/16) of HERC2 duplicons are comprised of sequence from the 5’ end of the locus sequence (Figure 6-8). One exception being the H’5 duplicon from the BP2 region, and the H’16 duplicon which is a remnant of an older duplication event (94% sequence identity, 28 kb in length) distal of BP3. The large duplicons, H’1, H’2, H’8,

H’10 and H’14 (all >80 kb), share a strikingly similar structure, which suggests a common origin for all these large duplicons and not independent derivations.

Additionally, a trend is noted in which the smaller duplicons are derived from the 5’ half of the HERC2 locus. Phylogenetic analysis suports this relationship (see below). Two explanations are possible, one being that the smaller duplicons are remnants of larger duplicons that have undergone deletion and sequence divergence such that the only detectable remaining homology is to the 5’ sequence. This implies that the sequence similarity across the HERC2 duplicon is not uniform (discussed below). Alternatively, the 5’ half of the HERC2 locus is more prone to spawning segmental duplications via an unknown mechanism.

192

Copy Number Estimates of the HERC2 Locus in Humans and Primates

A method to estimate the copy number of the HERC2 duplicon is through BAC

library hybridization (see Methods). We hybridized the H2 STS to four segments of the

RP11 BAC library and identified 145 clones, over 18.7X genome coverage, yielding an

estimate of approximately 8 copies (Table 6-2). This was in agreement with previous

estimates of total HERC2 duplication in the human genome, but obviously short of the 15

duplicons detected by sequence based analysis. This is due to the fact the H2 STS

sequence is not present in every single copy of the HERC2 duplicon. Additional

hybridizations against primate BAC libraries from chimpanzee, gorilla, orangutan,

baboon, macaque and lemur revealed a pattern of duplication consistent with the initial

duplication of the HERC2 locus having occurred prior to the divergence of the Old World monkey species from a primate common ancestor. Additionally, the level of HERC2 copy number has remained relatively constant through the great apes, although a decreased copy number in chimpanzee was observed (6 copies).

Comparative Southern Blot Analysis

A panel of primate genomic DNAs, consisting of chimpanzee, gorilla, orangutan, baboon, macaque, marmoset and lemur material, was digested with PstI and subsequently blotted and probed with a radiolabeled H2 STS probe (Figure 6-9).

193

194

Figure 6-9. Primate Southern blot analysis. A panel of human and non-human primate genomic DNAs were digested with PstI and hybridized with probe H2. Human = Homo sapiens, Chimpanzee = Pan troglodytes, Gorilla = Gorilla gorilla, Orangutan = Pongo pygmaeus, Baboon = Papio hamadras, Macaque = Macaca mulatta, Marmoset = Callithrix jacchus, Lemur = Lemur catta. The top panel shows a short exposure of the hybridized membrane, the lower panel a longer exposure to visualize the band in the lemur lane. A molecular weight ladder is indicated to the right hand side of each blot. The band corresponding to the human HERC2 locus was predicted to migrate at approximately 2.5 kb, indicated by the arrow.

Two exposures of the identical blot are shown to enhance the signal from lemur, which appears to have a single copy of HERC2, likely corresponding to the HERC2 gene itself.

From this analysis the HERC2 duplicon appears to have duplicated since the divergence of the marmoset and the Old World monkeys (macaque and baboon) from a primate common ancestor. This time frame, approximately 45-35 million years ago is older than previous estimates (Christian et al. 1999). Focusing on the great apes, several bands are shared between humans, chimpanzees and gorillas, although a subset of bands appear to be lineage specific. This could imply lineage-specific duplication of additional HERC2 duplicons in great apes or rearrangements involving ancestral HERC2 duplicons that cause a band shift or sequence variation at a restriction site.

From BAC library hybridization data, the orangutan genome was estimated to contain 12 copies of the H2 STS sequence. In contrast, the Southern blot indicates 6 bands of widely varying intensity. Thus, it appears the orangutan genome contains several copies of the HERC2 duplicon in 6 distinct size classes, which differs from other great apes. The 4.0 kb band, for example, is distinctly more intense than the others, perhaps harboring several of the 12 estimated copies. Similarly, the bands vary in

intensity within the human, chimpanzee and gorilla lanes, suggesting multiple copies of

HERC2 may be co-migrating. The baboon lane shows a clustering of bands at

195 approximately 3.0 kb, which suggests the presence of several highly similar copies, but there is also additional variation in band size within the baboon lane. Overall, the data clearly show an expansion of the HERC2 duplicon occurred 45-35 Million years ago, and has maintained a high copy number, but provides evidence for rearrangements in multiple independent primate lineages.

Comparative FISH Analysis

To explore the expansion of the HERC2 duplicon through primate evolution we have utilized comparative FISH. Probe PAC RP5-778A2 (AC004583), which contains the 5’ half of the HERC2 genomic locus (Ji et al. 2000b), was used as a probe in hybridizations on metaphase chromosome preparations from human (Homo sapiens, abbreviated HSA), chimpanzee (Pan troglodytes, abbreviated PTR), gorilla (Gorilla gorilla, abbreviated GGO), orangutan (Pongo pygmaeus, abbreviated PPY), baboon

(Papio hamadras, abbreviated PHA), African green monkey (Cercopithecus aethiops, abbreviated CAE) and marmoset cell lines (Callithrix jacchus, abbreviated CJA) (Figure

6-10A).

196

197

Figure 6-10. Comparative FISH analysis of the HERC2 duplicon. A) Probe RP5- 778A2 (AC004583; shown in red) was hybridized to a panel human and non-human primate of metaphase and interphase preparations. HSA = Homo sapiens, PTR = Pan troglodytes, GGO = Gorilla gorilla, PPY = Pongo pygmaeus, CAE = Cercopithecus aethiops, PHA = Papio hamadras, CJA = Callithrix jacchus, EMA = Eulemur macaco, LCA = Lemur catta. Isolated chromosomes from each species have been extracted for comparative purposes. The signals in the baboon interphase hybridization have not been pseudo-colored for clarity. B) Lemur BAC clone LBNL2-68B10 was identified as HERC2 positive by BAC library hybridization and subsequently used as a FISH probe (shown in red) on a preparation of lemur metaphase chromosomes. Both the ring-tailed lemur (LCA) and black lemur (EMA) were examined. HERC2 appears single copy in these species by FISH hybridization. C) Baboon BAC clone RP41-197E14 (AC116043), was used as a probe (shown in red) in metaphase and interphase hybridizations of baboon material. This clone, which contains LCR15 and HERC2 duplicons (see Figure 6-11) by sequence analysis demonstrates the clustering of multiple intrachromosomal duplications in the baboon lineage, similar to the human organization.

Extensive sequence divergence between human and lemur prevented robust FISH signals using human BAC probes; therefore, we performed the FISH hybridization using BAC

LBNL-2 68B10, which was positive by Southern hybridization with the H2 STS (Figure

6-10B). Also, to demonstrate the clustered multi-copy nature of the HERC2 and LCR15 sequences in PHA, FISH hybridizations were performed using PHA BAC RP41-197E14

(AC116043)(Figure 6-10C). Interphase images for hybridization of both RP5-778A2 and

RP41-197E14 were also obtained (Figure 6-10A, 6-10C).

The comparative FISH analyses indicate the HERC2 locus was single copy in the genome of lemur and potentially marmoset, which diverged from other primate lineages approximately 45-50 million years ago (Goodman 1999). The HERC2 locus also appeared single copy in CJA, which diverged approximately 40 Mya. Multiple FISH signals were observed in both Old World monkey species tested (PHA and CAE) with an estimated divergence time of 23-25 Mya, which was consistent with previous studies

(Christian et al. 1999). The Asian and African great ape species tested also showed

198

multiple FISH signals, however the HERC2 signal on HSA chromosome 16 was only

observed in HSA, PTR and GGO, and not PPY. This suggests the initial duplication to

chromosome 16 occurred after the Asian and African great apes diverged.

Primate Comparative Sequence Analysis

We selected four baboon BAC clones for sequencing as representative of HERC2-

sequence structures. The clones were identified through BAC library hybridizations with the H2 and H15 STSs, were sequenced to working draft status (RP41-70C10: AC116041,

RP41-127C1: AC116042, RP41-197E14: AC116042, RP41-199N8: AC119423). The

BAC sequences were then analyzed for duplication content. Sequence similarity searches with accession AC119423 indicated this clone corresponded to the 3’ half of the baboon

HERC2 locus, which is not duplicated extensively in the human genome and was excluded from further duplication analysis. The remaining three baboon accessions, however, contained HERC2 duplications (Figure 6-11).

199

200

Figure 6-11. Analysis of primate BAC sequences containing the HERC2 duplicon. BAC clones RP41-70C10 (AC116041), RP41-127C1 (AC116042), RP41-197E14 (AC116043) and CHORI250-9K16 (AC119799) were identified by primate BAC library hybridization with the H2 STS and subsequently sequenced. A horizontal line represents the sequence of each accession and tick marks are placed every 20 kb. The duplicons are indicated as boxes above the line, color coded as in Figure 6-3. A) The presence of the CYFIP1, NIPA2 and NIPA1 genes in AC116041 indicate this sequence is orthologous by position to H’3 in distal BP1. B) The single HERC2 duplicon within AC116042 is dis- contiguous due to the working draft status of AC116042. Duplication analysis indicates, however, the relationship between LCR15 and HERC2 duplications, which is observed in humans, developed prior to the divergence of the Old World monkeys from a common ancestor of great apes. C) Two large (>40 kb) HERC2 duplicons and two LCR15 duplicons are found in tandem in AC116043, a configuration not observed in the human genome. D) Chimpanzee accession AC119799 also demonstrates an association of LCR15 and HERC2 duplicons. The juxtaposition of 15q13.2 sequence adjacent to a LCR15 duplicon and a large (>50 kb) HERC2 duplicon is not orthologous to the organization of the human genome.

In addition, chimpanzee BAC CHORI250-9K16 (AC119799), known to contain HERC2- related material through BAC end sequence analysis (data not shown), was sequenced and analyzed for duplications.

Baboon accession AC116041 mapped to the distal BP1 region, due to the presence of CYFIP1, NIPA2 and NIPA1 genic sequences (Figure 6-11A). The organization of these genes within this accession was orthologous to the human genome.

Additionally, the HERC2 duplicon within AC116041 is orthologous by position to human duplicon H’3 (Figure 6-6). Accession AC116042 contained both HERC2 and LCR15 duplicons, demonstrating the relationship between these duplicons originated prior to the divergence of the Old World monkeys from a primate common ancestor (Figure 6-11B).

Interestingly, AC116042 also contained sequence homologous to a region of a recent human-specific duplication of the CHRNA7 locus (described in Chapter 3). The human genome, however, does not contain a HERC2 duplication at this site; therefore, this sequence represents a non-orthologous transition from duplication to putatively unique

201

sequence within the baboon genome. Accession AC116043 contained a tandem

duplication of an LCR15/HERC2 duplicon cassette (Figure 6-11C). Although the

juxtaposition of LCR15 and HERC2 duplicons is frequently observed in the human

genome (Figures 6-3 and 6-4), the tandem arrangement of multiple large HERC2 duplicons in baboon contrasted with the pattern of HERC2 duplication observed in the human genome. Since all the 15q11-q13 breakpoint regions have not been completely sequenced, the possibility can not be excluded that such a structure exists in the human genome, but no human BAC sequence within the NT and HTGS databases was identified with a similar structure to that of AC116043. As seen with AC116042, the

LCR15/HERC2 duplicon relationship is present at multiple sites within the baboon genome, much like that of the human genome. Additionally, the presence of an HPRT duplicon in AC110643 indicates this region is amenable to serving as duplicative transposition acceptor, and that the interchromosomal duplication process which characterizes the human HERC2 segmental duplication clusters may be a long-standing property of these regions.

Chimpanzee accession AC119799 also contains a non-orthologous juxtaposition of HERC2/LCR15 sequence with respect to 15q13.2 (Figure 6-11D). Interestingly, this region of 15q13.2, in the human genome, is approximately 250 kb from the human- specific duplication associated with CHRNA7. Thus, both AC116042 and AC119799 contain non-orthologous juxtapositions of HERC2 duplicons in comparison to the human genome. Additionally, both events involve sequence distal of the PWS/AS breakpoint clusters, and both events are within ~500 kb of a site which has undergone human- specific duplication of the CHRNA7 locus. The coincidence of three lineage-specific

202

events in one region: the CHRNA7 human-specific duplication, the juxtaposition of

HERC2/LCR15 and 15q13.2 in baboon accession AC116042, and the juxtaposition of

HERC2/LCR15 sequence with a distinctly different 15q13.2 sequence in chimpanzee accession AC119799, indicate this region has undergone substantial restructuring in multiple primate lineages.

Structural Divergence in Primate HERC2 duplicons

The sequencing of several primate BAC clones with HERC2-related sequence allows for the direct comparison of HERC2 duplicon from human, baboon and chimpanzee (Figure 6-12).

203

Figure 6-12. Divergence between the human, chimpanzee and baboon HERC2 duplicons. Shown are the results of three independent comparisons of a HERC2 duplicon from human, chimpanzee and baboon with the human HERC2 locus. MIROPEATS (Parsons 1995) was used for the pairwise comparisons to the HERC2 locus, using a threshold of 50 for illustrative purposes, and the results were displayed using PARASIGHT (Jeffrey Bailey, unpublished). The central horizontal line represents HERC2 locus, oriented 5’ to 3’, left to right, as in Figure 6-8. The exons of the HERC2 cDNA contained within each segment of sequence similarity are indicated as the white boxes above the human pairwise comparison, and below the primate pairwise comparisons, respectively. Note the chimpanzee sequence is interrupted at the 5’ end by the terminus of the BAC sequence, and thus this represents only a partial chimpanzee HERC2 duplicon.

204

MIROPEATS was used to compare the human H’2 duplicon, a full-length baboon

HERC2 duplicon from AC116043 and a partial chimpanzee HERC2 duplicon from

AC119799, using a MIROPEATS threshold of 50. Although the baboon accession is working draft, the HERC2 duplicon (AC116043a) lies within a single contig of

AC116043. The chimpanzee HERC2 duplicon is likely truncated at the 5’ end by the end of the BAC clone, however enough material was present to allow a comparison.

As described above, the human HERC2 duplicon is a composite patchwork of segments from the HERC2 locus, and not a single contiguous stretch of sequence. This is illustrated by tracking the exons contained within the H’2 duplicon (Figure 6-12). The 5’ most portion of the duplicon contains exons 17-20, but in the inverted orientation with respect to the adjacent segment which contains exons 1-5. This segment is followed by solitary exon 10, then a block spanning exons 24-52. Proximal to this, exons 65-63 are present, but in the inverted orientation with respect to the rest of the exons, save for 20-

17. The numerous deletions and inversions required to produce such a structure, at a minimum 5 deletions, 2 inversions and a sequence transposition event (moving the exon

17-20 segment adjacent to the exon 1-5 segment), indicate that the HERC2 duplicon has undergone rearrangement after the initial duplication event.

More striking, however, was that the baboon and chimpanzee duplicons were derived from different regions of the HERC2 locus (Figure 6-12). For example, the baboon duplicon, aside from exon 3, was comprised of a contiguous block encompassing exons 2-20. As confirmation, sequence similarity searches of exons 13, 14 and 15 of the

HERC2 cDNA sequence (NM_004667) against build34 revealed no evidence for duplication. Although not as drastic a contradiction as the baboon, the partial

205

chimpanzee duplicon also showed some variation with respect to the human duplicons.

Specifically, exons 6-9 were found in the chimpanzee duplicon, and yet these exons are

only present in the smaller H’5 and H’12 duplicons, and in the chromosome 16

duplicons. The large duplicons, H’1, H’2, H’8, H’10 and H’14, do not contain exons 6-9.

Phylogenetic Analysis

To investigate the relationship between HERC2 paralogs in humans and primates, phylogenetic analysis was performed. For parity with the library hybridization data, sequences corresponding to the H2 paralogous STS (approximately 560 bp) were obtained for all HERC2 duplicons possible, in addition to the H2 positive baboon BAC clones which had been sequenced (AC116041, AC116042 and AC116043). In addition, a transposon mediated sequencing protocol was employed to sequence an H2 positive subclone of lemur BAC LBNL2 68B10, which had been identified as H2 positive by library hybridization (see Methods). Since HERC2 appears to be single copy in lemur by

FISH, Southern blot and BAC library hybridization (Table 6-2, Figure 6-9 and Figure 6-

10), this sequence serves as an outgroup for phylogenetic comparisons. In total, 21 H2 signatures were obtained from the human 15q11-q13 scaffolds, the baboon BAC clones, and the lemur BAC subclone.

The phylogram, generated using the Neighbor-Joining method and tested with

1000 bootstrap replicates, was rooted to the lemur H2 sequence (Figure 6-13).

206

207

208

Figure 6-13. Phylogenetic analysis of human and non-human primate HERC2 duplicons. A) The radial phylogram, constructed using the Neighbor-Joining method and tested by 1,000 bootstrap replicates, consists of 21 taxa derived from lemur, baboon and human H2 paralogous STS sequence signatures. The orthologous HERC2 loci (i.e. the intact duplicon donor gene) are indicated in red. The tree was rooted to the single- copy lemur H2 sequence. Orthologous HERC2 duplicon signatures are indicated in green. The asterisk indicates the presence of an Alu insertion in the H2 paralogous STS sequence from that site. Bootstrap values >95 are indicated at the branch points. Phylogenetic groups are indicated by the labeled arcs. B) A schematic of the position of the HERC2 duplicons in which the “core” duplicons are indicated in red, and the “edge” duplicons are indicated in blue. Duplicons absent from the phylogram are shown in gray.

Several observations were made with respect to the distribution of H2 sequences in the phylogram. Generally, the phylogram has five main groups, which not only separate by species, but also by the type of duplicon contained within them. The H2 sequence was obtained from both the large (> 80 kb) HERC2 duplicons, which appear in the central regions, or “cores”, of the PWS/AS breakpoint clusters, as well as the smaller (<10 kb) degenerate HERC2 duplicons which appear at the “edges” of the breakpoints. From the phylogenetic comparison, the “core” and “edge” duplicons separated into two distinct clades, and the HERC2 locus sequence (from the intact gene in 15q13) groups with the large “core” duplicons. The chromosome 16 copies of the HERC2 duplicon appear to be more closely related to the “edge” sequences. This relationship is supported by the observation of an Alu insertion in the H2 STS sequence from a subset of H2 signatures

(indicated by the * in Figure 6-13). Since an Alu insertion is typically considered an irreversible character state, this indicated that the chromosome 16 copy of the HERC2 duplicon likely arose from the progenitor sequence of the chromosome 15 “edge” duplicons. The Alu insertion was also observed in the AC116041 H2 STS signature, which is orthologous to H’3. This indicates that the duplication of HERC2 which generated the “edge” signature in H’3 occurred prior to the divergence of the baboon and

209 human lineages and is older than the “core” copies of the duplicon. Note, this Alu insertion was not observed in either the human or baboon “core” clades. Thus, these duplicons likely represent multiple waves of HERC2 duplication in these lineages (see

Discussion).

To estimate the timing of the multiple duplication events in human and baboon lineages a mutation rate was estimated from the lemur-human H2 pairwise distance, resulting in a rate of 1.96E-9. Using this rate, the large “core” human duplicons, with an average k of 0.0094 +/- 0.0027, appear to have diverged 2.4 Mya, within an interval of

+/- 688 thousand years. This estimate is more recent than the human chimpanzee divergence and suggests a human-specific homogenization event has occurred. Similarly, the baboon “core” clade within group mean average k of 0.0339 +/- 0.0063 suggests a divergence time of 8.6 Mya, within an interval +/- 1.6 My, after the divergence of the baboon and great ape lineages. The orthologous baboon and human “edge” sequences

(AC116041 and H’3), with a k of 0.0789 +/- 0.0127, suggest a divergence time of 20.1

Mya, within an interval +/- 3.2 My, which is consistent with a pattern of initial HERC2 duplication prior to the divergence of the baboon and great ape lineages. The chromosome 16 copies of the HERC2 duplicon are estimated to have arisen 15.5 Mya, within an interval +/- 2.9 my, which agrees with the comparative FISH analysis that demonstrates the duplication to chromosome 16 occurred after the divergence of the

Asian and African great apes.

210

Sliding Window Analysis of HERC2 “Core” Duplicon Divergence Patterns

The large (>80 kb) “core” human HERC2 duplicons demonstrated an extremely

close phylogenetic relationship (Figure 6-13). The pattern of nucleotide variation within the pairwise alignments of the individual duplicons with the donor HERC2 locus may be indicative of the mechanism which gave rise to the human “core” clade. To investigate the landscape of nucleotide variation, pairwise global alignments were performed between several members of the large “core” clade (H’1, H’2, H’8, H’10 and H’14) and the HERC2 locus. The alignments were constructed within a 41.5 kb region common to several of the large duplicons, spanning the region from 3.4 kb to 44.9 kb of the HERC2 locus sequence, which contains exons 1-4 (see Figure 6-8). This region corresponds to the 5’ most portion of the duplicon, which is also present in the smaller “edge” duplicons.

The global alignments generated by the program Align (Myers and Miller 1988) were subjected to a sliding window analysis using the program Align_Slider (Jeffrey

Bailey, unpublished). The program calculated the percent identity and percent GC- content within a 2 kb window, and the window slid 5 bp per iteration (Figure 6-14).

211

212

Figure 6-14. The pattern of sequence divergence between human HERC2 duplicons. Pairwise alignments between HERC2 duplicons and the HERC2 locus were performed and the alignments evaluated in a sliding-window analysis that measured the percent identity and percent GC-content in a 2 kb window, sliding 5 bp in each iteration. The schematic across the top indicates the 41 kb alignment is derived from the 5’ half of the HERC2 locus, and contains exons 1-4 of the HERC2 cDNA. The alignments, generated using ALIGN (Myers and Miller 1998), were analyzed using Align_Slider (Jeffrey Bailey, unpublished). The upper panel presents the percent identity profiles of 5 independent alignments, note the extremely similar pattern of sequence divergence between the majority of the duplicons. Similarly, the percent GC-content plot shown below demonstrates little variation in the pattern of GC-content variation.

The pattern of variation seen between independent pairwise alignments of HERC2 duplicons with the HERC2 locus was strikingly similar between duplicons H’1, H’2, H’8 and H’10, despite the fact some of these duplicons are separated by Megabases of intervening sequence. Comparison of the REPEATMASKER profiles of the duplicon and locus sequences indicates the dip in sequence identity in the region approximately 11 kb into the alignment is due to variation in the length of a CT-repeat (data not shown)

(Smit and Green 1999). Interestingly, the H’14 duplicon, which is currently not anchored within any of the 15q11-q13 scaffolds, presents a varied profile for the initial 12 kb of the global alignment, then fell in phase with the remainder of the HERC2 duplicons distal of this region. The percent GC-content plots were extremely similar across the entire alignment, with the H’2 duplicon appearing slightly out of phase with the other duplicons, similar to the percent identity plot, due to slight variation in the global alignment due to small (<50 bp) insertion/deletion events (data not shown).

213

Baboon BAC End Sequence Analysis

To investigate the organization of the baboon HERC2/LCR15 duplication clusters we designed STSs within unique sequences flanking the distal edge of BP1 (BP1D) as well as flanking both proximal and distal edges of BP2 (BP2P, BP2D) (Figure 6-15).

214

215

Figure 6-15. Baboon BAC end placements against the human genome. The BP1- BP2 region of the human genome is depicted as the horizontal line, with BP1 and BP2 represented as the labeled boxes. Gene loci are indicated by the black circles. The positions of STS probes BP1D, BP2P and BP2D are indicated by the arrows. The paired- end sequence placements from the STS-positive BAC clones are indicated by the inter- connected squares beneath the line. Black squares represent a sequence match (>250 bp, >90% sequence identity) to unique human sequence. Open squares represent BAC end sequences >150 bp (filtered for high-copy repeats by REPEATMASKER (Arian Smit)) that had no match in the NT and HTGS databases. Red squares indicated a sequence match to the HERC2 locus. The gray square denotes a sequence match to duplicated sequence within the BP1 region with unknown ancestry (see Figure 6-3). The green square represents a human chromosome 9p13.2 sequence match that is a putative baboon- specific duplication. The paired end sequence matches are presented as a schematic, as the size of the BAC inserts were not experimentally determined.

The STSs were used as probes in hybridizations of the baboon BAC library and the subsequent positive clones were end sequenced. In addition, all H2 STS positive baboon

BAC clones were end sequenced. All baboon BAC end sequences were then placed against the 15q11-q13 scaffolds and also analyzed by sequence similarity searches against the NT and HTGS databases. Only BAC end sequence pairs with informative sequence matches at both ends were reported.

Baboon BAC end sequence placements support the presence of a duplication-rich region adjacent to the group of unique genes found in the orthologous human BP1-BP2 interval (Figure 6-15). Several paired baboon BAC ends were anchored in the distal BP1 region, and within baboon accession AC116041 which is orthologous to this region. The unanchored end sequence of these BAC end pairs matched a variety of sites, including

HERC2 sequences that were not present in AC116041. There was also a match to a region of human chromosome 9p13.2 that is not duplicated within the human genome.

Thus, the 9p13.2 sequence match represents a putative baboon-specific duplication, or alternatively, an interchromosomal rearrangement breakpoint. The 9p13.2 sequence was

216 found in several H2 STS positive baboon BAC ends; however, which indicated the former possibility is more likely (Table 6-5).

Hybridizations of STSs flanking BP2 resulted in three positive baboon BACs with end sequence pairs that were successfully placed against the human genome (Figure 6-

15). Interestingly, the three BACs which anchored in the unique sequence flanking BP2 had complementary BACs ends that had no sequence match within the human genome.

To qualify as a sequence with no match to the human genome, the BAC end sequence had to retain >150 bp of ‘unique’ sequence as defined by REPEATMASKER (Smit and

Green 1999). In other words, the BAC end sequence must be free of high-copy repeats for a contiguous stretch of sequence >150 bp, and have no match within the NT and

HTGS sequence databases. To that end, BACs RP41-470P11 and RP41-243O16 both had unanchored end sequences with >350 bp of high-copy repeat-free contiguous sequence with no match to the human genome. The lack of any sequence hit, much less a sequence match to a duplicon associated with BP2 in the human genome, indicates the

BP2 duplication cluster is not present in the baboon genome.

The characterization of baboon BAC end sequences from BACs positive for the

H2 STS resulted in several categories of sequence physically linked to the baboon

HERC2 duplicon (Table 6-5). A substantial proportion of H2 positive BACs (10/36) had both BAC ends place within duplications associated with the human PWS/AS breakpoints, such as HERC2, LCR15, FLJ00287, HPRT and APBA2 sequences. This suggests the association of these duplicated segments into clusters occurred prior to the

Old World monkey divergence from a primate common ancestor. Also, the preponderance of this class of baboon BAC indicates there are putative large tandem

217

clusters of these duplicons in the baboon genome. A large number of baboon BACs

(13/36) linked the human PWS/AS breakpoint associated duplicons, such as those listed above, with the CHRNA7-associated sequence. This is in agreement with the sequence analysis of accession AC116042. Several other classes of paired-end sequence placement were observed, several of which involve the putative baboon-specific duplication of the

human 9p13.2 segment (Table 6-5). Interestingly, one baboon BAC end sequence pair

linked the 3’ half of the baboon HERC2 locus (AC119423) to an LCR15 duplicon, which

is orthologous to the human BP3 organization. Thus, there is evidence from baboon

BAC end sequencing to support the existence of multiple clusters of duplicons that

roughly correspond to BP1 and BP3.

DISCUSSION

We have utilized a combination of strategies to build sequence maps within the

highly duplicated, and difficult to assemble, PWS/AS breakpoint clusters. One aspect of

this strategy relies on stringent thresholds for considering BAC sequence overlap from

existing database BAC sequences. The other aspect relies on the identification of

paralogous sequence variants within highly duplicated regions, and the use of those

variants as a signature that distinguishes one copy of a duplication from all the others

present within the genome. Using these strategies in concert allows one to traverse

extended regions of duplicated material and work towards complete sequence assemblies

of biologically important regions such as the PWS/AS common deletion breakpoints.

One caveat of this high stringency approach to BAC contig assembly is that by

using a standard of 99.9% or greater sequence identity over 10 kb, one may in fact be

218

assembling a single haplotype, and the line separating allelic variation and paralogous

variation is obscured. For example, when assembling multiple independent contigs

within a duplicated region, achieving a fusion of the subcontigs into a larger scaffold may

become problematic when the contigs originated from different alleles. Allelic variation

in duplication content, structure or organization may induce an inconsistency in the

assembly.

Our analysis of the 15q11-q13 sequence scaffolds showed extensive segmental

duplication within the PWS/AS breakpoint clusters. This in itself is not surprising;

however, the level of interchromosomal duplication detected in this analysis was

unprecendented in relation to previous investigations of these regions. The

interchromosomal duplications could also be divided into two classes: those that were

found among multiple PWS/AS breakpoints, and those specific to a single breakpoint.

This is indicative of two properties of the PWS/AS breakpoints. First, the duplicated

breakpoint clusters are able to accept interchromosomal duplications in a manner

reminiscent of pericentromeric regions. Second, a subset of interchromosomal

duplications are likely shared among all three breakpoint clusters due to a post-

duplication homogenization mechanism.

From the perspective of inter-breakpoint sequence relationships, the PWS/AS

breakpoints appear to be organized in large direct repeats. The structure of BP3,

however, consists of an inverted repeat of the HERC2 locus with respect to the adjacent

distal large HERC2 duplicon (H’8). It is one of the unique aspects of the structure of

15q11-q13 that the BP3 sequence is directly adjacent to the donor sequence to a major component of the PWS/AS breakpoints. The inverted structure of BP3, and the direct

219 repeat of BP1 and BP2 with respect to the distal region of BP3, may allow for homogenization of the duplication clusters, without causing the deletion of the intact

HERC2 donor locus. Evidence for this theory, however, will only be found by capturing such a homogenization event, perhaps through monitoring meiotic products. To extend this structural observation to the HERC2 duplicon in the pericentromeric region (H’1), the orientation of this duplicon is identical to the HERC2 locus. In other words, the H’1 duplicon is in an inverted orientation with respect to BP1, BP2 and distal BP3. The lack of any clinically recognized rearrangements at the pericentromeric site of HERC2 duplication may be a reflection of this. Perhaps for a deletion to take place, the HERC2 and surrounding duplicons must be oriented as direct repeats.

We propose a model for the dynamic restructuring of 15q11-q13 segmental duplication clusters that results in “core” and “edge” duplicons. The “core” duplicons, which are larger and have a higher sequence identity with respect to the donor HERC2 locus than “edge” copies, result from lineage-specific homogenization events. The

“edge” duplicons represent older copies of the duplicon, in a sense previous versions, that are outside of the homogenization events and therefore present a genetic distance between orthologs that is consistent with neutral mutation. In support of this, the majority of the “edge” duplicons are derived from the 5’ end of the duplicon, which is the region which shows the highest percent similarity by sliding window analysis. For example, multiple 2 kb windows within the 5’ end of the H’8 and H’14 duplicons aligned with the HERC2 locus at 100% identity. Thus, the “edge” duplicons may be the remnants of older HERC2 duplicons that have been subsequently deleted and diverged such that the only remaining detectable alignment is to the 5’ end of the duplicon, which

220

shows such extraordinary sequence identity. The shape of the percent identity sliding

window plot may also indicate that there is a hot spot or propensity for homogenization

to occur at the 5’ end of the duplicon.

The argument for a homogenization-based mechanism for the generation of the

highly-similar HERC2 duplicons in multiple locations throughout 15q11-q13 is supported by the structural similarity of human HERC2 duplicons. In addition, the structural differences between human and non-human primate HERC2 duplicons suggests a homogenization event has occurred recently in the human lineage. The duplication of

HERC2 to chromosome 16 serves as an excellent source for insight into this question. If a mechanism homogenizes HERC2 duplicons within 15q11-q13, then a duplication of the

HERC2 duplicon to chromosome 16 is analogous to taking a snapshot of what the ancestral African great ape HERC2 duplicon was structured like, assuming relatively few subsequent rearrangements have taken place on chromosome 16. Our analysis has definitively shown the chromosome 16 HERC2 duplicons are structured differently than the chromosome 15 loci; however, the comparison of the exon content of the chimpanzee duplicon to the human duplicon is most informative.

The chimpanzee HERC2 duplicon found in AC119799 contained exons 6-10 of the HERC2 cDNA. The large human HERC2 “core” duplicons all contained exon 10, but not exons 6-9. If the ancestral African great ape HERC2 duplicon contained exons 6-10, as indicated by the chimpanzee structure, five independent deletions would have to have occurred in the human lineage to eliminate exons 6-9 from the five large “core” HERC2 duplicons described in this analysis. A more parsimonious model is that a progenitor

HERC2 duplicon in the human lineage lost exons 6-9 through a small deletion event, and

221 that progenitor was then spread to among the 15q11-q13 breakpoint clusters through a homogenization event. In support of this, the chromosome 16 HERC2 duplicons contain exons 6-10, which is consistent with exons 6-10 being present in the ancestral African great ape HERC2 “core” duplicon.

The potential for rapid homogenization of HERC2 “core” duplicons also obscures the precise site from which the chromosome 16 duplicons arose. In fact, no currently described chromosome 15 duplicons match the structure of the chromosome 16 sequence.

In terms of HERC2 locus content, the H’5 duplicon from BP2 and H’12 in the unanchored clone AC136693 match the chromosome 16 duplicon the best. In contrast, the phylogenetic analysis of HERC2 duplicons suggested the chromosome 16 copies of

HERC2 are most closely related to the “edge” sequences found in distal BP1 (H’3) and within BP3 (H’9). The Alu insertion common to chromosome 16 duplicons in addition to

H’3 and H’9 also support this assertion, as the H’5 duplicon does not contain the Alu insertion in the H2 STS sequence. Thus, the chromosome 16 sequence may act as a copy of the African great ape ancestral HERC2 sequence frozen in a time capsule. The 16 sequence is most closely related to the “edge” HERC2 duplicons, yet these duplicons have been reduced to <10 kb in size. This may be an after-effect of a previous generation

“core” duplicon that has been marginalized and degraded as a result of multiple rounds of subsequent homogenization events.

The phylogenetic groupings also indicate there are separate classes of HERC2 duplicon in the baboon genome. For example, the Alu insertion in H’3 is also seen in the orthologous sequence of AC116041. This demonstrates the H’3 copy of the HERC2 duplicon pre-dates the divergence of Old World monkey species, yet the Alu insertion is

222 not observed in the baboon “core” clade of duplicons. Thus, the duplicons which comprise the baboon core clade arose separately from the H’3 ortholog, identical to the state observed in the human genome. Several similarities between the HERC2 and

LCR15 duplicons between humans and baboons indicate the relationship between these sequences has existed >25 million years. Undoubtedly these sequences are found in clusters in both species, as demonstrated by both comparative sequencing and baboon

BAC end sequence analysis. The comparative FISH data also suggested two major clusters of HERC2- and LCR15-related sequences in the baboon genome. This assertion is also supported by the absence of any duplications in the region orthologous to BP2 in baboon.

Through this analysis we have come to a model of HERC2 duplication that begins farther back in evolutionary time than previously anticipated (Figure 6-16).

223

Figure 6-16. Model of HERC2 evolution from lemur to baboon. The horizontal line represents the orthologous human 15q11-q13 sequence in the ancestral mammalian genome. Genic loci are depicted as the open circles, the HERC2 locus is represented by the red square. Duplications of additional loci are color coded according to the key provided. In the bottom panel, accessions AC116041, AC116042, AC116043 and AC119423 are indicated below the line in accordance with their sequence content (Figure 6-11).

Our data suggest the HERC2 gene is single copy in the prosimians. By 45-50 Mya, two copies of the HERC2 locus are present in the ancestral primate genome. The two copies are organized such that they are within 1 Mb of each other, and thus produce a single signal by comparative FISH, yet distant enough to appear as two independent loci by

224

BAC library hybridization (See CJA, Table 6-2). By the time of divergence of the Old

World monkey species, the copy number of the HERC2 duplicon has increased, and the

establishment of “core” and “edge” duplicons has occurred. Additionally, by this time

the clusters of HERC2 and LCR15 duplicons have acquired the ability to serve as

integration sites for interchromosomal segmental duplications. After the divergence of

the African and Asian great ape species 8-10 Mya, a progenitor HERC2 duplicon is duplicated to chromosome 16. Subsequent homogenization events which occurred recently (~2.5 Mya) in the human lineage have erased the evidence of this structure on chromosome 15. Also, lineage-specific rearrangements, such as the region associated with the CHRNA7 duplication in the human genome, have re-patterned the 15q11-q14

region in multiple primate lineages. The level of genomic complexity in the human and

orthologous primate 15q11-q13 regions suggests an extraordinarily dynamic evolutionary

history at these sites of clustered segmental duplications.

MATERIALS AND METHODS

Defining the HERC2 Locus

The HERC2 locus was defined as an interval containing all 93 exons of the

HERC2 cDNA, in addition to 9.5 kb 5’ of exon 1 and 10 kb 3’ of exon 93. The 9.5 kb

upstream interval was used, as opposed to a 10 kb interval, due to the discovery of a

duplication of the HERC2 locus containing exons 17-20 immediately 10 kb 5’ of exon 1

of HERC2 (see Figure 6-12). This unusual arrangement required trimming of the

reference locus sequence by 500 bp to compensate for the nearby duplication. Although

this duplication of HERC2 5’ of exon 1 fits the definition of a duplicon by itself,

225

comparison with other HERC2 duplicons in the human genome (see Figure 6-8) indicated

this small 5’ segment had been duplicated concomitantly with the HERC2 locus itself; therefore, for simplification this 5’ duplication of exons 17-20 was considered part of the entire HERC2 duplicon, yet excluded from the HERC2 locus sequence to avoid multiple alignments within the reference HERC2 sequence. Using these criteria, the entire

HERC2 locus reference sequence, including flanking regions, spanned 230,564 bp with an overall GC content of 44.37%.

Scaffold Assembly

Using the HERC2 locus as a seed sequence in sequence similarity searches against the non-redundant nucleotide (NT) and high-throughput genome sequence

(HTGS) databases, BACs that contained HERC2 related sequences were identified.

These related clones were then searched independently against the available databases

(NT/HTGS) in order to identify overlapping clones. Due to the extremely high sequence similarity between HERC2 duplications, it was necessary to use stringent standards for considering the overlap between BAC sequences. Three major factors contribute to the determination of paralogous overlap or authentic clone overlap: sequence identity, length of overlap, and overlap configuration. To generate the alignments for scaffold building we utilized MEGABLAST, set to extend alignments through lower-case masked sequence, produced by REPEATMASKER (Smit and Green 1999; Zhang 2000). In general, an overlap was considered legitimate if the degree of sequence identity exceeded

99.9%, the overlap extended greater than 10 kb, and the overlap occurred between the ends of the respective clones. One exception to this rule was the use of accession

226

AC135348 in the BP3_Prox contig, which contains approximately 5 kb overlap with

AC055876. This sequence, however, was spanned by a clone identified by STS

hybridization and subsequent paralogous sequence variant analysis. Thus, we are

confident the clone is properly placed and use the sequence within AC135348 to

characterize the duplications in this region. In general, only clones from the RP11 library

were considered for this analysis, however two exceptions are noted. One, AC135058

(CTD-2298I13), represents the most proximal extension into BP1 of any BAC linked to

unique sequence, and due to the extensive overlap at high sequence similarity with

AC100757 (58874 bp, 99.9%). The other exception, AC135348 (RP13-822L18), was the

only clone with >99.9% overlap to the proximal edge of AC055876. We propose this

clone represents a glimpse of the interior of BP3 which will be replaced by sequencing of

RP11-1174I20 (discussed below).

STS Design and Hybridization

STSs were designed at the edges of several contigs for hybridization of the RP11

BAC library in order to identify potential gap spanning clones, which are depicted in

Figure 6-1. In the BP1_Prox scaffold, RP11-1115P6 was identified by previous study

(Chapter 4). RP11-992L12 was identified by hybridization with PCR-amplified STS

AC068446.3.4 (Table 6-1). In the BP1_to_BP2 scaffold, RP11-1276B22 was identified by hybridization with probe AC011767.7.8 and RP11-899L8 identified by hybridization with probe AC141254.1/.2 (Table 6-1). In the BP3_Prox scaffold, RP11-1174I20 was identified by library hybridization with STS AC139682.1/.2 (Table 6-1). All STSs were

227

PCR amplified using standard conditions from human genomic DNA and hybridizations were performed as previously described (Eichler et al. 1997).

Paralogous Sequence Variant Analysis

To identify overlapping clones in highly duplicated regions, STSs are designed such that the primers lie in regions of great depth of identical sequence, yet flank sites containing paralogous sequence variants (PSVs), or base-pair changes specific to the copy of the duplication that is being used to extend the contig. In this manner, BAC library hybridization will obtain clones with STS related sequence, and the diagnostic nucleotide changes encompassed by the STS, or paralogous sequence variants, will identify the specific clones which contain the exact copy of that duplication.

All the positive clones from a BAC library hybridization are subject to PCR amplification with the respective STS primers and subsequent sequencing of the PCR product according to standard protocols. The PCR sequences are then grouped using

CONSED, and multi-sequence alignments with the reference STS sequence performed using CLUSTALW and BLAST (Altschul et al. 1990; Thompson et al. 1994).

BAC End Sequencing

Clones identified by BAC library hybridization are subjected to BAC end sequencing as described (Chapter 4). The BAC end sequences are then searched against the contigs such that the clones with the theoretical maximum extension into unmapped regions are selected. Alternatively, the BAC end sequences may link two contigs, or a contig with a sequence clone by sequence similarity searches of the BAC end sequences

228 against the NT and HTGS databases. Thus, the selection of gap spanning clones relies upon STS hybridization, STS sequencing of the resulting positive clones, multi-sequence alignment of the STS sequences, BAC end sequencing of the positive clones and placement of the BAC end sequences.

Comparative FISH Analysis

Metaphase chromosome preparations were made from lyphoblastoid cell lines derived from the following species: Homo sapiens (HSA), Pan troglodytes (PTR),

Gorilla gorilla (GGO), Pongo pygmaeus (PPY), Papio hamadras (PHA), Cercopithecus aethiops (CEA), Callithrix jacchus (CJA), Lemur catta (LCA) and Eulemur macaco

(EMA). Hybridizations were performed using standard protocols with BAC DNA probes labeled with either biotin-16-dUTP or digoxigenin-11-dUTP as previously described

(Horvath et al. 2000b). A minimum of 20 metaphases were examined for each experiment. The chromosome identity was determined by DAPI staining and reported according to the ISCN guild lines for nomenclature (ISCN 1985).

Lemur BAC Subclone Sequencing

To obtain the H2 STS sequence from lemur, the H2 STS was first hybridized to the LBNL2 lemur BAC library. A subclone library from H2-positive lemur BAC

LBNL2-68B10 was then generated by EcoRI digestion and ligation into the plasmid pGEM-3Zf(+) (Promega). A set of 96 subclones were arrayed on an agar plate, a colony lift was performed, and the resulting filter was hybridized with a radiolabeled H2 STS probe to identify the appropriate subclones. Once identified, H2-positive LBNL2-68B10

229

subclone 2B10 was subjected to transposon-mediated sequencing using the GPS-1 kit

according to the manufacturers protocol (NEB). Sequencing 96 transposon insertions in

both directions resulted in a 14.9 kb CONSED contig that contained both vector and

insert sequences (Gordon et al. 1998). The insert sequence was 11.6 kb in length and was

used in subsequent sequence alignments.

Phylogenetic Analysis

HERC2-related sequences were identified within the scaffolds, human clones and

primate clones by sequence similarity searches using the HERC2 locus seed sequence as the query. The related sequences were extracted and multiple sequence alignments performed using CLUSTALW (Thompson et al. 1994). All subsequent distance calculations and phylogram constructs were generated using the MEGA 2.1 software package (Kumar et al. 1993). Phylograms were constructed using the neighbor-joining method and the Kimura 2-parameter model of nucleotide substitution was used to correct for multiple substitutions (Kimura 1980). The standard error calculated based upon 1000 replicates of the bootstrap method.

Comparative Sequence Analysis

In order to obtain HERC2 related sequence from Old World monkeys for phylogenetic and genomic comparisons, the H2 STS was hybridized against the RPCI-41

PHA BAC library. The resulting positive clones were subjected to BAC end sequencing

(described above) as well as PCR using the H2 STS primers and the PCR products were then sequenced. Hybridizations were also performed to other primate BAC libraries,

230 including CHORI-250 (MMU), CHORI-251 (PTR), CHORI-255 (GGO), CHORI-253

(PPY) and the LBNL-2 LCA library (available from www.chori.org/bacpac). Initial hybridization of the LBNL-2 BAC library was performed with three STS (H2, H15 and

H16; see Figure 6-1) that encompassed the entire HERC2 locus. Two of the resulting positive clones were sequenced to working draft depth (AC131599, AC126425).

Fosmid End Placement Validation

To assess the BAC overlaps within the scaffolds, the scaffold sequences were subjected to similarity searches against 2.3 million human fosmid end sequences, generated at WIGR as part of the Human Genome Project. The average insert size of a fosmid is 40.2 kb +/- 1.5 kb, yielding a range of roughly 38-42 kb, as physical limitations in the cloning process substantially decrease the efficiency of cloning significantly smaller or larger inserts. Additionally, the fosmids were sequenced at both ends, and the mate-pair information is known. Thus, by placing the fosmid end sequences against the scaffolds, and assessing the span of paired ends, one may estimate the fit of the assembly.

The orientation of fosmid end pairs is also considered. Significant deviations from the

38-42 kb span of a fosmid, or physically impossible orientations of fosmid end pairs, identify potential regions of discordance in the assembly. Alternatively, due to the fact the RP11 BAC library and fosmid library were derived from distinct individuals, aberrant results may indicate sites of potential polymorphic variation in the population.

To increase the fidelity of fosmid end placement within duplicated regions, two rounds of alignment are performed. An initial alignment by BLAST is used to find regions of potential placement, and then a global alignment is performed. This alignment

231 process also takes into account the sequence quality information provided by Phred scoring of the fosmid end sequence reads (Ewing and Green 1998; Ewing et al. 1998).

Thus, the “best placement” of fosmid end sequences is determined. The three 15q11-q13 scaffolds were searched against the 2.3 million fosmid end sequence database and the results plotted using the graphics program PARASIGHT (Jeffrey Bailey, unpublished)

(Figure 6-17).

232

233

Figure 6-17. Fosmid validation of 15q11-q13 scaffolds. The 15q11-q13 scaffolds are indicated as horizontal black lines with tick marks descending from them, marking 100 kb intervals. The tiling path of clones which comprise the scaffold is indicated beneath the line. Working draft status clones are shown in black. Gaps within the scaffolds are indicated as the purple boxes. Regions of fosmid paired-end agreement are shown by the green line underneath the black line. There are instances in which the fosmid end placement algorithm can not discern the “best placement” of an end sequence. This occurs in highly duplicated areas, and so concordant fosmid paired-end placements in which one end is a “best placement” and the other end is potentially duplicated are called “ties”. These “ties” are represented by the blue line. Discordant fosmid paired-ends are indicated above the black lines. Fosmid pairs which are in the proper orientation, yet a discordant size are shown in red. Fosmid pairs which are in the improper orientation and have a discordant size are shown in orange. Note in the BP1_to_BP2 and BP3_Prox scaffold, the regions of discordance generally match the gaps and working draft sequences, which was expected.

The majority of all three scaffolds were spanned by fosmid paired ends that fit the 38-42 kb criteria at the level of “best placement.” In general, the areas in which the fosmid validation failed are coincident with the working draft accessions within the contigs.

This was not unexpected, as the working draft sequence, if not ordered and oriented, will foil the fosmid paired end method. Additionally, the sequence gaps which have been closed by library hybridization, BAC end sequencing and PSV analysis described above, will also create false negatives in the fosmid paired end method. Subsequent sequencing of the gap-spanning clones and reintegration of the finished sequence into the scaffolds is likely to reduce the fosmid paired end disparity significantly.

Chapter 7

Discussion and Future Directions

234 235

SUMMARY AND DISCUSSION

The major findings of this diverse set of studies centralize on the hypothesis that

segmental duplications promote genomic instability both between closely-related species

and within a species. This includes rearrangement events involving the same duplicated

segment in multiple independent species. A primary strength of this work has been the

myriad of methods which were used to investigate the phenomenon of duplication-

mediated genomic plasticity, and that all these diverse techniques, both computation and

experimental, distinguish segmental duplications as sites of prolific rearrangement.

Additionally, these methods have allowed a detailed examination of complex sequence structures in the human and primate genomes, as well as assess genome-wide variation between our genome, and that of the great apes.

The most interesting aspect of the genome-wide detection of dosage variation between humans and non-human primates was the observed 14-fold enrichment for intrachromosomal segmental duplications at sites of dosage imbalance (Chapter 2). In addition to localized dosage variation between primates, a cluster of intrachromosomal segmental duplications were found at the site of the pericentric inversion of 15q11-q13 in chimpanzee (Chapter 3). This finding demonstrates such sequences may mediate large chromosome restructuring events in addition to small-scale deletions and duplications.

The pericentromeric region of 15q11 was also shown to be evolutionarily dynamic, with substantial re-patterning in closely-related primates (Chapter 4). Additionally, an accumulation of segmental duplications after the divergence of the African and Asian great apes was noted. Through the application of array comparative genomic hybridization in a targeted study, dosage variation was detected in a wide range of

236

clinical samples with rearrangements of the 15q11-q13 region (Chapter 5). Samples

containing multiple size classes of deletion and duplication, both interstitial and pseudo-

dicentric, commonly shared breakpoint intervals coincident with segmental duplication

clusters in 15q11-q13. Lastly, a detailed dissection of the common rearrangement

breakpoint segmental duplication clusters, and specific analysis of the HERC2 duplicon, described a complex structure in the human genome with a dynamic evolutionary origin

(Chapter 6). This complex structure, through phylogenetic and comparative sequence analysis, potentially arose through multiple rounds of homogenization of the HERC2 duplicon between the PWS/AS common deletion breakpoint regions. Also, lineage specific rearrangements were detected in both the chimpanzee and baboon genomes at sites of HERC2 duplication, demonstrating substantial instability involving the HERC2 duplicon humans and non-human primates. Overall, these complimentary studies illustrate the primate genome is subject to genomic instability at sites of segmental duplication. These sequences promote rearrangements within the humans genome, resulting in genomic disorders, and are also involved in rearrangements that differentiate the genomes of human and non-human primates.

Whole-Genome Array Comparative Genomic Hybridization (CGH)

Aside from previous work in DNA re-association kinetics, few techniques have proven amenable to comparative whole-genome analysis. For example, molecular genetic techniques such as FISH and Southern hybridization have been extensively employed to track the evolutionary history of specific loci. On a cytogenetic level, whole-chromosome FISH paints added an additional dimension to comparative studies.

237

BAC array technology, however, allows one to investigate dosage variation between species at thousands of loci simultaneously. The initial application of the highly parallel array CGH technology to evolutionary analysis, described in Chapter 2 of this work, demonstrated the utility of this method and uncovered an interesting association of inter- species variation and segmental duplication.

Insight into the role of segmental duplications in evolutionary rearrangements came from a “reverse” approach in which a human genome BAC array, with a density of approximately 1 BAC clone per 1.4 Mb of genomic DNA, was hybridized with genomic

DNA from a panel of African and Asian great apes (Chapter 2). The term “reverse” implies that the clones chosen for this array were not selected with regard to the presence of segmental duplications in any way, as the initial purpose for the array was to profile cancerous tumor samples for genomic variations (Pinkel et al. 1998). The array consisted of ~2400 BACs and represented approximately 12% of the entire human genome. Upon hybridization of the array with a panel of primate DNA samples, including pygmy chimpanzee (or bonobo), common chimpanzee, gorilla (the three species of African great ape) and orangutan (an Asian great ape), a total of 63 sites of potential duplication or deletion with respect to the human genome were detected between one or more species

(Table 2-1). Given this study only sampled 12% of the human genome, the extrapolated total number of detectable rearrangement events in a great ape array CGH comparison would result in 523 sites of rearrangement 40+ kb in length, the distribution of which would favor the more divergent species.

In addition, as this genome-wide analysis was the first study of its kind using array CGH in primate comparative genomics, particular attention was paid to the

238

resolution of detectable events. It was expected a priori that dosage differences between

species encompassing an entire BAC probe on the array would be easily observed. This

was due to the nature of the technique, which assesses the fluorescence intensity

produced at each arrayed spot in both the reference (human) and test (primate) channels.

The theoretical fluorescence intensity ratios for haploid duplications and deletions are

therefore calculable. A haploid duplication, for example, would be represented as the

log2 of 3/2 or 0.58, and a haploid deletion would be represented as the log2 of 1/2 or -1

(Figure 2-1). Although these ratios were good general guidelines for selecting clones for further validation as true variant sites, the theoretical thresholds were not strictly adhered to. The reasoning behind this was two-fold. First, events which involved the partial deletion or duplication of the material within a BAC clone would only partially influence the fluorescence intensity ratio between human and primate genomic DNA samples.

Secondly, the theoretical thresholds for haploid duplication or deletion of a BAC are calculated without consideration of the efficiency of the experiment as a whole, or the signal-to-noise ratio. The signal-to-noise ratio is likely to be affected by the comparison of the hybridization of human DNA to a human BAC spot with the hybridization of a great ape species (i.e. divergent) DNA to a human BAC spot. Thus, I believe the experimental noise generated during the actual hybridizations served to attenuate the fluorescence intensity ratios, and therefore I used less-stringent criteria for selecting sites for future validation. For example, the 63 sites reported in Chapter 2 fit the criteria of having a log2 ratio >0.5 for a duplication or a log2 ratio <-0.5 for a deletion.

239

Validating Sites of Inter-Species Genome Variation

When performing whole-genome comparative analyses, the methods used to

validate variant sites became critical, as the number of variants may prove impractical to

assess with efficiency. Assessing putative dosage gains detected by array CGH proved to

be effectively addressed by comparative FISH hybridizations, using human material as

the single-copy control, and primate material of the respective variant species to localize the putative duplication (Figure 2-3). An important caveat to this being tandem

duplications are not likely to be resolved in this manner, as the resolution of FISH

hybridizations on metaphase chromosomes will not be sufficient to distinguish adjacent

duplicated sites. As noted in Chapter 2, however, all of the putative duplications tested by comparative FISH in this analysis proved to be valid.

Assessing putative deletions, however, proved to be more of a challenge.

Initially, comparative FISH was attempted to verify a subset of deletions, however due

the suspicion that a partial deletion could be detected by array CGH, yet also produce a

FISH signal in comparative FISH experiments (data not shown), this technique was

abandoned for all putative partial BAC deletions. For BACs that appeared to be

completely deleted from the respective primate genome due to a log2 ratio <-1.0,

comparative FISH successfully validated the rearrangement (Figure 2-4). For partial

deletions, a strategy was developed based upon BAC end sequence placement of primate

BACs on the human genome assembly. Briefly, with the region of the human genome

spanned by the arrayed BAC clone, a sequence-tagged-site (STS) probe was designed for

hybridization to the appropriate primate BAC genomic library (Figure 2-4). The

resulting positive BAC clones were end sequenced, and sized by pulsed-field gel

240 electrophoresis (PFGE). Then, by comparing the size of the BAC clone determined by

PFGE, with the size of the BAC clone determined by BAC end sequence placement on the human assembly, the size of the disparity approximates the size of the deletion. For example, if a deletion of 50 kb has occurred in the chimpanzee genome with respect to the human genome, BACs isolated from the chimpanzee genome will place 50 kb further apart than the size of the chimpanzee BAC as assessed by PFGE. In other words, the human equivalent sequence of the chimpanzee BAC will exceed the physical size of the chimpanzee BAC by the size of the deletion event.

As discussed in Chapter 2, this method is not a high-throughput process, as several BAC clones are identified in primate BAC library hybridizations, and given the uneven overlap of BAC clones in a library derived from partial restriction enzyme digestion, multiple STS probes may be required to span the deletion. In this sense, the technique is analogous to small-scale physical mapping of each deletion. One side benefit, however, being that after the procedure has verified a deletion, primate BAC clones are immediately available for comparative sequencing to examine, in fine detail, the exact nature of the deletion event that was detected by array CGH. Using this BAC approach, a deletion as small as 40 kb was validated (Figure 2-4), indicating that events between 40 kb and the size of an entire BAC clone (~150-200 kb) can be readily detected and validated.

In performing this study, the issue of sequence divergence and array CGH sensitivity was also addressed. Hybridization of DNA samples from closely related species, such as humans and chimpanzees produced consistent results, however the signal-to-noise ratio appeared to increase with sequence divergence, such that human-

241

orangutan hybridizations represented the upper limit to successful array CGH

comparisons (data not shown). Attempts at hybridization with DNA from more divergent

primates, such as baboon (Papio hamadras) and macaque (Macaca mulatta), resulted in too much noise to identify variant sites with any confidence (data not shown). This indicates that evolutionary comparative genomic hybridizations are valid within approximately 3.5% average genomic sequence divergence, as the average sequence similarity between the orangutan and human genomes is approximately 0.965 (Yi et al.

2004).

Using the number of putative and validated sites of genomic variation among the human and great apes genomes (63) it is possible to estimate the rate of new duplication/deletion events in primates. First, lineage-specific events must be separated from ancestral events (i.e. those present in multiple species), taking into account the divergence time of the species with a common ancestor with humans. For example, a total of 13 variant sites were noted in the bonobo genomic profile, and 12 variant sites were observed in chimpanzee; however, among those variant sites, 8 were shared between both bonobo and chimpanzee genomes. Given the relatively recent divergence

(2 My) of the bonobo and chimpanzee lineages, this indicates that 5 bonobo variant sites and 4 chimpanzee variant sites have been produced in the post-divergence period.

Simply averaging the rates for the independent sites in these two lineages, and adjusting for sampling 12% of the human genome, results in an overall rate of 17-21 deletion/duplication events >40 kb in length per million years of primate evolution. The rates for independent events within the gorilla and orangutan lineages (17 events/My, 20 events/My, respectively) are also in agreement with this estimate. Unfortunately, without

242 a primate array to test against the human genome, the relative rate of events in one lineage compared to another can not be accurately compared.

Given the lower bound of 40 kb for detectable events, and an artificial upper bound of an average BAC insert size (~175 kb), the rate of duplication and deletion events implies between 800 kb and 3.5 Mb of material are potentially turned-over per million years of primate evolution. This turnover has the potential to substantially alter the genomic landscape and effect the expression of neighboring genes, or perhaps alter the gene complement of a species.

Patterns of Primate Genome Variation

Two important questions remained about these events, however, including whether there is a chromosomal position bias for these events (i.e. pericentromeric regions or subtelomeric regions) and if there are any sequence properties that are associated with sites of dosage variation between primates. To address the first question, one must utilize the STS information used to map each arrayed BAC clone. By searching for the position of each STS in the human genome assembly, it is possible to place each clone with respect to the pericentromeric and subtelomeric regions. Interestingly, the vast majority of sites of primate genomic variation, 54 of the 63 total or 86%, mapped outside the pericentromeric and subtelomeric regions. Thus, the events detected by array

CGH occurred in potentially gene rich regions and were not sequestered to regions of the genome which may be subject to reduced selective pressures. As for sequence properties which may correlate with the variant sites themselves, two trends were noted. First, chromosomes which harbored a large number of variant sites were correlated (r2=0.50)

243

with chromosomes that were enriched in segmental duplication content, determined by

whole-genome analyses (Bailey et al. 2002a). Secondly, and perhaps most important to

the overall theme of the studies contained within this thesis, is that 5 of the 9

experimentally validated sites of dosage imbalance in primate-human genome

comparisons were associated with regions of segmental duplication (Table 2-2).

Considering recent whole-genome analyses estimate the total segmental duplication

content of the entire human genome to be approximately 5%, the occurrence of segmental

duplications with 55% of validated dosage variant sites is an 11-fold enrichment (Bailey

et al. 2002a). When looking specifically at intrachromosomal duplications, which are

estimated to comprise 2.8% of the human genome and are the predominant type of

segmental duplication found at sites of primate variation, this enrichment increases to 14- fold. Thus, the sites of dosage variation between primate genomes, on the order of 40 kb to ~175 kb+, are frequently associated with intrachromosomal duplications, hinting that these sequences permit rapid change over short spans of evolutionary time.

Pericentric Inversion of 15q11-q13 in Chimpanzee

Duplications and deletions, which result in a dosage change, are not the only type of evolutionary event, however. Translocations, inversions, and chromosome fusions are all types of evolutionary events which may greatly alter chromosome structure, yet not significantly affect genome content. The events are important to explore, however, due to the potential role of chromosomal evolution in driving speciation, and for the possibility that gene expression patterns may be altered at the breakpoints of such rearrangements. Early work in comparative genome analysis was performed at the

244

cytogenetic level using G-banded chromosomes, and identified numerous rearrangements

between humans and our great ape cousins (Yunis et al. 1980; Yunis and Prakash 1982).

The most common of these rearrangements were pericentric inversions, 9 of which separate the karyotype of humans and chimpanzees. Among these 9 inversion events, one occurred in the chimpanzee genome orthologous to the human 15q11-q13 region, the subject of multiple analyses within this work (Chapters 3, 4 , 5 and 6). A systematic approach was thus employed to narrow the breakpoint interval of the pericentric inversion identified over 20 years ago.

By using previously constructed YAC/BAC maps of the 15q11-q13 region, in addition to the early assemblies of the human genome, I selected a panel of BAC clones for comparative FISH analysis in the African and Asian great apes (Figure 3-1). Probes were selected in the unique regions between the three major PWS/AS breakpoints BP1,

BP2 and BP3, in addition to sites further distal of BP3 in 15q13 that encompassed the

BP4 region. Comparative FISH hybridizations revealed that the entire PWS/AS region is present on XVp in chimpanzee, with unknown ramifications concerning the imprinting of

15q11-q13 loci (Figure 3-2).

The breakpoint interval was narrowed to a 600 kb region in which all BAC clones tested produced dual FISH signals on both sides of the chimpanzee XV centromere

(Figure 3-3). The region encompassed by these BAC clones was part of a region of the human genome that was misassembled, due to dis-contiguous placement of BAC sequence in order to produce a consensus across the region, termed “warping”. The mis- assembly within the human genome was likely caused by the presence of several segmental duplication, in fact, the entire 600 kb region was comprised of paralogous

245

sequence (Figure 3-4). Out of the 600 kb region, two segmental duplications were

potentially involved with the inversion. This assertion is based upon the combined FISH

and sequence analysis results obtained with clone RP11-40J8, which contained two

duplications and produced dual signals flanking the chimpanzee XV centromere. BAC

RP11-40J8 contained a copy of the LCR15 duplicon, previously mapped to the 15q11-

q13 PWS/AS breakpoints as well as a large deletion associated with phobic disorders

(Gratacos et al. 2001; Pujana et al. 2001 ; Pujana et al. 2002). RP11-40J8 also contained

a duplication of the CHRNA7 locus which had been previously hypothesized to be a

human-specific duplication (Riley et al. 2002). The LCR15 and CHRNA7 duplications

comprised the entire sequence within RP11-40J8, thus, one of these sequences was likely

involved with the inversion rearrangement. By eliminating the CHRNA7 sequence as a candidate by Southern hybridization, confirming the suspicion of previous analyses, the

LCR15 sequence is the only remaining candidate (Figure 3-5).

Two models potentially explain the mechanism of pericentric inversion of 15q11- q13. First, the theoretical presence of an LCR15-related sequence on chimpanzee XVp may have facilitated the inversion via the interaction of paralogous sequences. This would imply, however, that a duplication of LCR15 was present in chimpanzee XVp prior to the inversion event. None of the BAC clones that contained LCR15 produced a

FISH signal on human 15p, which does not eliminate the paralogy-mediated event model, but requires the additional deletion of an LCR15 duplication from human 15p to be plausible. Alternatively, the pericentric inversion in chimpanzee occurred at a site susceptible to breakage. Two observations support this model. First, the presence of

LCR15 material in chimpanzee XVp is not required. Also, this model is in agreement

246 with the fact that in two independent lineages, two diverse rearrangement events have taken place at the same location in recent evolutionary time. In the human lineage, the recent duplication of the CHRNA7 locus occurred, and in the orthologous site in the chimpanzee genome, there was the pericentric inversion. As the chimpanzee genome represents a derived state, it will be important to see what changes have occurred at the sequence level, however, the sequence of the ancestral state should be pursued by sequencing the locus in additional primates.

As with the array CGH study, this work began with an unbiased perspective with respect to segmental duplications. There was no initial assumption segmental duplications played a role in the pericentric inversion. In fact, clones devoid of segmental duplications were chosen initially in order to produce the most precise signals for evaluating the XVp or XVq location of a BAC probe. The association of a cluster of segmental duplications with a site of evolutionary rearrangement was only made once the region had been narrowed to the site of genome mis-assembly. This unbiased approach strengthens the association of segmental duplications with genome variation, and the findings of this study emphasize the malleability afforded by segmental duplications distinct from rearrangements within the pericentromeric and subtelomeric regions.

Evolution of the 15q Pericentromeric Region

As discussed in Chapter 1, several studies have demonstrated variation in the pericentromeric regions of primate chromosomes due to the propagation of segmental duplication in the great apes lineages. Unfortunately, few pericentromeric regions have contiguous sequence assemblies anchored in alpha satellite DNA. This prevents the

247 complete characterization of the evolution of human pericentromeric regions, which have an extremely complex history. Thus, some of the most rapidly evolving regions of the human and non-human primate genomes are overlooked due to the complications of sequence assembly in highly duplicated regions. The pericentromeric region of chromosome 15 provided such an example, as numerous segmental duplications had been mapped to the region, yet no studies of the evolutionary history of the region as a whole had been attempted. From the perspective of segmental duplications, the dissection of this region was critical to the development of methods used for further in depth analysis of 15q11-q13 segmental duplication clusters (Chapter 6). Also, the development of a large BAC contig anchored in monomeric alpha satellite DNA allowed for an assessment of the completeness of the human genome through human metaphase FISH analysis.

Using stringent standards for BAC sequence overlap, and seed sequence related to the HERC2 duplicon analyzed in detail in Chapter 6, one of the largest pericentromeric

BAC assemblies was developed (Figure 4-1A). The proximal terminus of the contig lay in monomeric chromosome 15 alpha satellite sequence, indicating that the site of a putative alpha satellite/non-alpha satellite junction had been reached. As seen with the pericentromeric of human chromosomes 2, 10, 21 and 22 the pericentromeric region of chromosome 15q appeared to be a mosaic of interchromosomal and intrachromosomal segmental duplications (Jackson et al. 1999; Guy et al. 2000; Horvath et al. 2000a;

Horvath et al. 2000b; Footz et al. 2001; Brun et al. 2003; Guy et al. 2003)(Figure 4-1B).

In agreement with the two-step model of pericentromeric region evolution, the majority of segmental duplications identified within the 15q pericentromeric region shared homology with other pericentromeric regions, a reflection of the subsequent swapping

248

process which occurs after an initial seeding event to a pericentromeric region (Horvath

et al. 2000a; Horvath et al. 2000b; Horvath et al. 2001).

The development of a sequence contig within the pericentromeric region of

human chromosome 15q allowed for two subsequent analyses. First, an assessment of

the completeness of the human genome sequence in pericentromeric regions was possible

though FISH mapping of the chromosome 15 pericentromeric clones on human

metaphase chromosomes (Table 4-1). The 11 clones, which spanned approximately 1.2

Mb of pericentromeric sequence, produced a total of 71 FISH signals, 12 of which had no

underlying support from the human genome sequence assembly, resulting in a

discordance rate of 16.9%. Although this number is high, it should be noted that several

contiguous 15q clones reported discordant signals on chromosome 1, which has the

highest density of sequence gaps remaining in the genome. A similar phenomena was

noted for a chromosome 14q11 pericentromeric signal involving two contiguous BAC

clones. Thus, using an independent BAC assembly of a pericentromeric region, it

appears possible to identify genome sequence gaps within the pericentromeric regions of

other chromosomes, which could be an effective tool for evaluating the sequence

assembly quality in such regions.

The reverse analysis, correlating in silico predicted FISH signals with actual experimental results indicated that over 80% of the sites predicted by segmental duplication analyses were confirmed by FISH. To account for the lack of complete concordance, sequence similarity relationships are detectable below the threshold of efficient FISH hybridization, thus sequence comparisons are more effective at identifying small divergent sequences. Also, polymorphism of segmental duplication content within

249 human pericentromeric regions could account for the presence of high-similarity sequence alignments to sites that yield no corresponding FISH signal.

Secondly, a comparative analysis using the set of 11 pericentromeric BAC clones as FISH probes on great ape metaphase chromosomes revealed extensive variation in the content and pattern of hybridization signals between species (Table 4-1). A trend emerged in which species more distantly related to humans produced fewer FISH signals, indicating that the accumulation of pericentromeric interchromosomal duplications is a relatively recent phenomena in primate evolution. The process appears to be ongoing, as well, as lineage specific duplication and/or deletion events were observed among great ape species. Thus, the restructuring of the pericentromeric regions of primate chromosomes is likely an active and potentially rapid process.

Determining Duplicon Ancestry By Mouse-Human Comparison

The availability of the mouse genome sequence has provided a new method for determining the ancestral position of a duplicated sequence within the human genome.

Briefly, due the lower level of segmental duplications observed in mouse, discussed in

Chapter 1, sequence searches with a copy of a human-specific duplication may theoretically detect homology to the region of the mouse genome syntenic to the ancestral site of the human duplication. The 15q11 pericentromeric BAC contig was used as pilot sequence for determining if the method of human-mouse-human sequence comparison, which is performed through automated database searches, is in agreement with the method of determining the minimal evolutionary shared segment, or MESS, which is more subjective and not automated (Zhaoshi Jiang, unpublished)(Figure 4-4).

250

Additionally, the determination of the ancestral segment by computational methods aided phylogenetic analysis, as agreement between the two methods greatly supported the initial result.

The mouse-human comparison data used for ancestry determination was derived from BLASTZ alignment of mouse and human genome sequence (Schwartz et al. 2003).

The BLASTZ algorithm was optimized for creating long contiguous alignments between divergent sequences that would not have been possible using the BLAST algorithm alone

(Altschul et al. 1990; Schwartz et al. 2003). The BLASTZ mouse-human alignments, once constructed and annotated in a database, are a fixed resource. This enhances the efficiency of the analysis as only the alignment of the human duplicon with the mouse genome is required to determine ancestry, as the subsequent mouse-human comparison was completed previously. It should be noted that although the level of segmental duplication is lower in the mouse genome than the human genome, the coincidence of duplicated loci in both species will foil attempts at determining ancestry by this method.

However, phylogenetic analysis and MESS determination are alternative compensatory methods. In this study, the application of the human-mouse comparative approach to the

15q11 pericentromeric contig aided the delineation of duplicons within the extensive mosaic of adjacent segmental duplications, and accelerated phylogenetic comparisons.

The phylogenetic analyses permitted the approximation of the timing of the initial duplication event for several of the 15q11 duplicons (Table 4-2). Using a mutation rate estimate based upon chimpanzee-human sequence comparisons, which was in agreement with previous analyses of the 2p11 pericentromeric region, the majority of the 15q11 duplicons analyzed appeared to be older than the FISH data suggested (Horvath et al.

251

2003). This disparity was attributed to the limited effectiveness of FISH to detect substantially divergent paralogs. An interesting trend was observed, however, in which the proximity of a duplicon to the monomeric alpha satellite appeared to correlate with the evolutionary age of the duplicon. In other words, the younger duplications appeared to be closer to the centromere than the older duplications, which is in general agreement with a gradient model of duplication age with respect to the centromere (Horvath et al.

2001; Bailey et al. 2002b; Guy et al. 2003).

Overall, the comparative analysis of pericentromeric segmental duplications demonstrates the extreme fluidity of such regions. By combining computational, phylogenetic and experimental approaches, the dynamic nature of the 15q11 pericentromeric region becomes evident. In fact, the relatively young age (8-10 million years) of several duplicons within the 15q pericentromeric region indicates that the duplication process is active and may be rapidly re-patterning great ape chromosomes.

Chromosome 15q11-q13 Rearrangements and Array CGH

Thus far, this work has primarily investigated evolutionary rearrangements, which are fixed differences between species. Segmental duplications have also been shown to play a role in intra-species variation. This variation can arise in two forms, phenotypically silent genomic polymorphism, and clinically-recognized genomic disorders, such as Prader-Willi and Angelman syndromes (PWS/AS) in 15q11-q13.

Thus, segmental duplications within the 15q11-q13 region are interesting from both the evolutionary and clinical perspective.

252

Three major clusters of segmental duplications have been associated with

PWS/AS common deletions, termed BP1, BP2 and BP3 (Figure 5-1). Non-allelic homologous recombination (NAHR) is the proposed mechanism by which rearrangements between these large clusters induce deletions and duplications of large expanses of material in 15q11-q13 (Stankiewicz and Lupski 2002). Mapping studies, and efforts to close gaps within the 15q11-q13 genome sequence presented here (Chapter 6), permitted the development of a specialized BAC microarray with the ability to assess dosage variation at sites flanking these clusters of segmental duplications. Additionally, I was able to address the influence of segmental duplications in the array format by selecting sequenced BAC clones for the 15q11-q13 microarray that contained known duplicons. This study reflects the synergism between the array CGH methodology, segmental duplication analysis, and the mapping of highly duplicated regions of clinical importance.

To initiate this study, I selected a small set (20) of BAC clones spanning the

15q11-q13 region that were arrayed and hybridized with genomic DNA from patients with a wide variety of 15q11-q13 rearrangements (Figure 5-1). The BACs were targeted to sites flanking regions of segmental duplications, such as BP1, BP2 and BP3, but also included distal sites. Dosage variation within the patient sample population included 1, 2,

3, 4 and 6 copies of the BP2-to-BP3 region of 15q11-q13, as well as samples with extensive variation in the length of material involved in the rearrangement (Table 5-2).

The patient samples had been extensively characterized by FISH and microsatellite analysis in the lab of Dr. Stuart Schwartz, thus these samples represented a known

253

quantity with which to test the efficacy of array CGH, and the effect of segmental

duplications on BAC clones in the array format.

The primary conclusion of this study was that array CGH was an effective

approach for the detection of dosage imbalances across a widely varied patient

population. The relationship between fluorescence intensity ratio and dosage was linear,

in line with previous array CGH studies, suggesting the technique has robustness and

reproducibility (Figure 5-6). Analysis of the performance of clones containing segmental

duplications lead to two major conclusions. Segmental duplications in the array format

appear to be sensitive to dosage changes in related sequences. This “duplication

sensitivity” applied regardless of the type of dosage imbalance, as both gains and losses

at dis-contiguous sites resulted in aberrant fluorescence intensity ratios (Figure 5-7). The second major conclusion concerning segmental duplications was that clones containing segmental duplications may not respond with full sensitivity to dosage imbalances. As noted with the BACs derived from the 15q11 pericentromeric region, rearrangements that putatively involve these sequences did not demonstrate a significant dosage imbalance by array CGH. Thus, it appeared the fluorescence intensity ratio had been attenuated by the presence of multiple genomic copies of sequences related to the arrayed BACs.

The wide variety of clinical samples analyzed in this study demonstrated some interesting properties of 15q11-q13 rearrangements in general. The majority of rearrangements analyzed had a distal breakpoint at BP3, which is consistent with previous analyses of PWS/AS deletions (Christian et al. 1995; Amos-Landgraf et al.

1999; Christian et al. 1999). There were two samples with more distal rearrangements, however, including a large supernumerary marker chromosome with a breakpoint in the

254

more distal BP4 region, and an AS deletion with a breakpoint between BP3 and BP4

(Figure 5-4, Figure 5-5). The interesting aspect of this particular sample is that the distal breakpoint occurred in the region associated with the pericentric inversion of PTR XV described in Chapter 3. Perhaps this cluster of segmental duplications, which is the site of a recent duplication event in the human lineage and a large chromosome restructuring event in the chimpanzee lineage, should be considered BP3.5, or perhaps rename BP4 to

BP5 and label this site as BP4. Regardless of the nomenclature used, however, this site represents a potential distal PWS/AS breakpoint, which was easily identified through the application of the array CGH technique. Further characterization of PWS/AS samples should consider a distal break in this region, and subsequent genotype-phenotype correlations will be enhanced by the use of targeted microarrays, such as I have employed to analyze 15q11-q13 rearrangements.

Organization and Evolution of the 15q11-q13 PWS/AS Breakpoints

Although the Human Genome Project has been deemed complete, there remain gaps in regions of significant biological importance. The segmental duplication clusters that comprise PWS/AS breakpoints BP1, BP2 and BP3 are prime examples of how difficult it is to assemble such regions, as numerous gaps still exist at these locations in

the “finished” genome assembly. Using the methodology for building contigs in highly

duplicated regions which I employed in the study of the 15q11 pericentromeric region, I

have also constructed contigs of BAC sequences in the PWS/AS breakpoints (Figure 6-

1). Since the initiation of my work, it has been an over-arching goal to close gaps in the

human genome, and in collaboration with the Whitehead Institute for Genome Research,

255 this effort has resulted in the sequencing of numerous BAC clones for the purpose of gap closure (Chapter 6). In short, the method of assembly involves two major aspects, extension of contigs into duplicated areas from unique anchor regions, and separating paralogous loci by the use of paralogous sequence variants (PSVs). These two approaches combined permit traversing of highly duplicated regions and achieving long sequence assemblies in these problematic areas.

The BAC sequence assemblies that correspond to BP1, BP2 and BP3 were subjected to segmental duplication detection methods (whole-genome alignment comparison or WGAC) identical to previous to whole-genome segmental duplication analyses (Bailey et al. 2001; Bailey et al. 2002a)(Figure 6-2). Multiple segmental duplications, such as the HERC2 and LCR15 duplicons have previously been mapped to the PWS/AS breakpoint clusters, and these intrachromosomal duplications were detected by the WGAC and MESS methodologies (Figure 6-3). The level of interchromosomal duplication, which comprised 39.9% of the total pairwise alignments >1 kb produced by the WGAC procedure, was surprising for multiple reasons. First, the extensive interchromosomal duplication within the PWS/AS breakpoints is indicative of these sequences behaving as duplicative transposition acceptor sites. In a sense, these sequences are behaving as pericentromeric regions in their ability to accept segmental duplications from interchromosomal sites. Second, in certain cases, the PWS/AS breakpoint locus was the only site of duplication, for example, the PARN duplicon at distal BP2 appears to be the result of a single duplication event to the BP2 region. The implication being that the PWS/AS breakpoint clusters are not duplicates of each other, but are instead distinct mosaics of duplications, a subset of which are common to all three

256

breakpoints (Figure 6-2). This may indicate that the mechanism which has maintained

the BP1, BP2 and BP3 structures acts in the central portion of the breakpoints, but not

necessarily the edge sequences, as breakpoint-specific duplications such as the PARN

duplication are located at the periphery of the breakpoint clusters.

Tracking the HERC2 Duplicon

The sequence similarity between BP1, BP2 and BP3 is extensive, and the most

frequent constituent of the breakpoint clusters are the LCR15 and HERC2 duplicons

(Figure 6-5). For example, 15 loci related to the HERC2 gene were identified by sequence similarity searches of the 15q11-q13 scaffold sequences (Figure 6-6). In addition, it was the HERC2 locus sequence that served as the initial query sequence for building the 15q11-q13 scaffolds themselves. What was unexpected at the initiation of this study, however, was that the HERC2 duplicon was present in such a wide variation in size. For example, the duplicons which inhabit the central regions of the breakpoints are

>80 kb in size (Figure 6-7). These large duplicons are frequently flanked by smaller more degenerate HERC2 duplicons >5 kb in length. From a strictly qualitative perspective, it appears as if the central, or core, duplicons have been maintained at a full length, and an unknown mechanism has permitted the HERC2 duplicons at the edge of the breakpoints to acquire mutations neutrally, leaving only detectable homology to the

5’ region of the locus. The phenomenon is not HERC2-specific, as LCR15 duplicons, which were found adjacent to HERC2 duplicons in both humans and baboons, are also found adjacent to the degenerate peripheral HERC2 duplicons, yet they are also smaller

257

in size and more divergent than the LCR15 duplicons in the core regions of the

breakpoint clusters.

Timing the Expansion of the HERC2 Duplicon

BAC library hybridization experiments along with Southern hybridizations of a

panel of primate genomic DNAs suggest that the initial duplication of the HERC2 locus occurred after the divergence of the prosimians from a common ancestor of the primate lineages (Table 6-2, Figure 6-9). This observation was supported by comparative FISH analysis of an extensive variety of primates (Figure 6-10). Multiple metaphase and interphase FISH signals were detected in the Old World monkey species, suggesting the initial duplication of HERC2 occurred prior to the divergence of these species from a primate common ancestor. It is interesting to note that the prevalence of the HERC2 duplicon does not appear to have changed substantially since the initial expansion prior to the divergence of the Old World monkeys. Certain duplications, such as the morpheus duplicon on chromosome 16, have shown a wide variation in copy number over short spans of evolutionary time (Johnson et al. 2001). As discussed extensively in Chapter 1, several duplicated segments have also been shown to have duplicated and expanded rapidly within the pericentromeric region of primate chromosomes. The vast majority of

HERC2 duplicons (14/15), however, are not pericentromeric. In this sense, it appears that the overall number of duplicons has stayed relatively constant over 25 million years of primate evolution.

Specifically concerning the baboon lineage, it is noteworthy that the HERC2 duplications appeared by interphase FISH experiments to be in clusters, suggesting that

258

there may be a somewhat similar organization to HERC2 duplications in humans and baboons. Further support for this assertion came from the sequencing of several baboon

BAC clones which were identified by hybridization with STS probes derived from the human HERC2 gene (Figure 6-11). The analysis of these baboon clones, in addition to a chimpanzee clone sequenced at Oklahoma University by Dr. Bruce Roe, resulted in several striking observations. First, HERC2 duplications were present in the baboon and chimpanzee genomes at positions non-orthologous to the location of HERC2 duplicons in the human genome. The implication of this finding is that there has been significant re- patterning of the baboon and chimpanzee genomes specifically in close proximity to the

HERC2 duplicon, further supporting the association of segmental duplications with sites of evolutionary genomic rearrangement. Secondly, from the comparative sequencing of baboon clones it was observed that the LCR15 and HERC2 duplicons were found in a tandem orientation. This suggests that the initial association of HERC2 and LCR15 sequences occurred prior to the divergence of the great apes, and that the initial burst of segmental duplication leading to the structure observed in the human genome predates the divergence of Old World monkey species from a primate common ancestor.

The Disparity Between HERC2 Homology and Initial Duplication Timing

As demonstrated by the phylogenetic analysis, there is a disparity between the genetic distance of the large 15q11-q13 HERC2 duplicons and the age of the initial duplication of the duplicon (Figure 6-13). This disparity is explained by the periodic homogenization of HERC2 sequences over evolutionary time. In this manner, the number of large duplicons remains relatively constant, while the sequence similarity

259

between duplicons is maintained. Incidental mis-alignment of sequences during a

homogenization event could produce multiple copies of the HERC2 duplicon, only a subset of which would be maintained in subsequent homogenizations. Thus, the new copy of the duplicon would be subject to mutation, including deletions and point mutations, which could in effect erode the duplicon to the small edge class of HERC2 duplicon observed at the periphery of the PWS/AS breakpoint clusters (Figure 7-1)

Figure 7-1. Model of PWS/AS breakpoint evolution and HERC2 homogenization. Initial duplication of the HERC2 locus results in a rearranged progenitor duplicon. The progenitor duplicon is then duplicated among the PWS/AS breakpoint clusters. The unequal transposition of material to the core of the breakpoints pushes the older copies to the edge. After repeated cycles, core and edge HERC2 duplicons diverge, producing the structure observed in the human genome.

260

Although there is no direct evidence for homogenization of dis-contiguous HERC2 loci which facilitates the degradation of peripheral copies, the sequence similarity between dis-contiguous HERC2 duplicons does point to a process of homogenization. The alternative explanation, that the HERC2 duplicon has duplicated extremely recently in the human lineage to the similar copy number seen within the Old World monkey lineage with a similar association of flanking duplicons, appears unlikely.

FUTURE DIRECTIONS

The extremely diverse experimental approaches taken within this work lead to a wide variety of potential directions for future research. The most immediate of these possibilities is the continued refinement of the human genome assembly within proximal

15q. The trials and tribulations I have encountered in attempting to close gaps within the

15q11-q13 region testify to the difficulty of this task, and the requirement for handling these regions separately from the rest of the genome. Simply put, closing gaps within the highly duplicated regions of the genome must be a priority, and not a secondary goal, of a genome project for the assembly of duplicated regions to reveal patterns of genome evolution in these dynamic regions.

Implications for Gap Closure in the PWS/AS Breakpoints

The continued sequencing and assembly of duplicated regions, such as the

PWS/AS breakpoints will have a significant benefit. Aside from the general desire to have a “complete” genome sequence, the sequence of the PWS/AS breakpoints must be obtained in order to investigate the precise genetic events that lead to PWS/AS

261 rearrangements. In other words, in order to clone the breakpoints from 15q11-q13 rearrangement patient samples, a complete reference sequence must be in place.

Additionally, to investigate further the mechanism of HERC2 homogenization from the comparative perspective, a complete human sequence is required. In fact, the complete reference sequence of the 15q11-q13 breakpoint segmental duplication clusters may impact a variety of studies including evolutionary comparisons and variation within the human population.

From the perspective of intra-specific genome variation, studies of variation in duplication content in the highly-variable PWS/AS breakpoint regions – which may point to haplotypes more prone to producing PWS/AS deletions and 15q11-q13 rearrangements in general – will remain impossible without a full sequence map. Interestingly, some of these questions can be directly assessed by sequencing BACs within the RP-11 BAC library itself. Since the RP-11 library was derived from a diploid source, two alleles of chromosome 15 are contained within the library, and thus, two copies of BP1, BP2 and

BP3. One observation made during the course of contig construction in 15q11-q13 was that by using the 99.9% similarity threshold, there was the potential for building allele- specific contigs. This was evident from the alignment of finished BAC sequences which was inconsistent with the standard “dovetail” configuration of overlapping BAC clones.

If two finished BAC sequences match at 99.9% similarity or greater, yet do not fit a dovetail configuration, it is suggestive of either recent (likely human-specific) duplication or a heteromorphic site with variation between the two alleles contained within the RP-11

BAC library. Thus, by using such a high threshold for assembly, one may assemble both haplotypes within the diploid BAC library. Therefore, if one were to “over-sequence”

262 the highly duplicated regions of 15q11-q13 one could build allele-sequences and isolate heteromorphisms down to the base pair level. Whether these putative heteromorphisms represented true variants or artifacts of propagating human genomic DNA in a cloned vector could be resolved by testing variants by PCR assays in a small sample of diploid individuals.

Comparative Sequencing

The future of comparative sequencing is changing as this dissertation is being written. The sequencing of the chimpanzee genome is underway, and studies presented here, such as the analysis of the chimpanzee pericentric inversion presented in Chapter 3, will proceed from the opposite direction: bioinformatic analyses will identify regions to explore experimentally. This does not mean that the “wet bench” scientist will be obsolete, however. What this alteration in approach will mean is that the experimental science will be streamlined and extremely efficient, as base pair-level assays will be easily designed. The improvement in efficiency, however, will be dependent on the depth of sequence coverage in non-human primates.

Large-scale comparative sequencing of evolutionarily dynamic regions of the genome is another facet of the implications of this thesis work, and a potential area for future exploration. Specifically, tracking the expansion of a duplication such as the

HERC2 duplicon, which likely plays a role in human genomic disorders, through primate evolution would entail the sequencing of overlapping BAC clones from a multitude of primate BAC libraries. Although the overall sequence for a HERC2-primate sequencing project would result in relatively few Megabases of finished sequence, the result from

263 such a study could be quite interesting. First, the analysis of the first HERC2 duplicon could give important clues as the mechanism of formation. Secondly, despite the in- depth analysis presented in Chapter 6, the question still remains as to why this highly malleable structure, as we see in the human genome and potentially other non-human primate genomes, is maintained in the population? What selective advantage is there to the species such that the penalty for rearrangement is out-weighed? To address this question, comparative sequencing in a variety of species from prosimians to great apes would be required.

A caveat to the new age of primate comparative genomics, ushered in by the sequencing of the chimpanzee genome, is that the chimpanzee genome, and potentially additional primate genomes, will not be sequenced to the same depth and quality as the human genome. Additionally, as the predominant method used to sequence the chimpanzee genome is the whole-genome shotgun (WGS) method, there will be numerous gaps within the chimpanzee genome sequence, and the assembly of the highly duplicated areas is likely to be suspect. In addition, attempts to assemble chimpanzee sequence on a human framework may bias the resulting chimpanzee genome assembly.

In regions of questionable human genome assembly, such as the BP1, BP2 and BP3 regions of 15q11-q13, the errors will compound upon each other, and obtaining an accurate sequence of the orthologous chimpanzee regions will be unlikely at best. Thus, the tried-and-true method of BAC sequencing to finished quality remains the best alternative for comparing regions of high paralogy in orthologous relationships.

264

Comparative Array CGH Analysis

It should also be noted that although the sequencing of non-human primate genomes to significant depth will undoubtedly aid studies of genome evolution, the application of array CGH technology to comparative genomics will have continued uses.

The ability to assess dosage variation at a multitude of sites in numerous individuals will allow for large-scale statistical analyses to address questions of population variation in natural populations, and to compare the level of variation within our species with that of other species. Although the scope of array CGH experiments currently remains a subset of the entire human genome, future developments in the density of genomic arrays will increase the amount of assayed material per experiment, increasing the power of such analyses. In addition, the array CGH approach allows for investigation of primate species that have not been targeted for genome sequencing. For uncovering the genomic diversity between closely related sub-species of orangutan (i.e. the Sumatran and

Bornean orangutans), a comparison of array profiles may identify regions that differentiate these species at an extremely low cost in comparison to genome sequencing projects.

Diagnosis of Genomic Disease and Disease Discovery

By my estimation, the most important future direction of this thesis work will be in the area of clinical genomics, or the ascertainment of dosage imbalances in clinical samples on a genome-wide scale. The initial pilot study presented here, for the 15q11- q13 region, took advantage of a well-characterized set of patient samples, in order to test the efficacy of the array CGH technique to detect dosage imbalance in clinical samples.

265

From these initial results, it appears that the technique is ready for expansion to encompass the entire genome.

The coming-of-age of array CGH technology has wide-reaching implications, some of which are potentially quite daunting. From the purely scientific perspective, whole-genome profiling for dosage gains and losses will allow for the investigation of variation within the human population. This is an interesting subject in itself, for comparative genomics among humans on such a scale has never before been possible.

Also, by targeting these regions for array CGH profiling, based upon genome- wide duplication analyses such as those pioneered by Dr. Jeffrey Bailey, one has theoretically maximized the potential for identifying variant sites. Colloquially put, if segmental duplications are where the action is, array profiling such sites is the surest way to identify frequent polymorphisms in the population that result in dosage imbalances.

On an evolutionary level, array profiling of non-human primates in regions of potential variation could also shed additional light on the role of segmental duplications in generating genomic diversity between species. Additionally, new sites for targeted comparative sequencing would be easily identified by such a method. As a proof of principle, the primate array CGH profiling study presented here represents a source for numerous potential sites of primate variation for investigation by comparative sequencing. The results of array profiling primates with a genome-wide array focused on regions of segmental duplication, however, could potentially enrich the results significantly due to the synergy between segmental duplications and genomic rearrangements established by this work.

266

Lastly, the most important implication of this work is the potential of profiling humans with suspected clinical disease as a diagnostic tool. Based on the study presented here (Chapter 5), profiling of PWS/AS and pseudo-dicentric 15 patients is possible using the current technology. Expansion of genomic array coverage to include many more sites of potential rearrangement, as well as known deletion/duplication syndrome regions, would provide a tool to the clinician for identifying diseases. Additionally, array profiling could ascribe genomic dosage gains and losses to patients with undetermined genomic abnormalities. Idiopathic mental retardation and autism patients, for example, are two patient populations for which exact genetic causes are commonly unknown or undetermined. By investigating the potential for genomic imbalances in such patient populations one can address multiple questions. First, are there regions of genomic dosage imbalance that appear in multiple individuals with a particular phenotype, and if so, are those rearrangements potentially causative? Secondly, the identification of specific regions that associate with certain disease states provides information for further gene mapping studies to identify candidate genes. In this sense, genomic profiling may point the way to specific genes involved in idiopathic mental retardation and autism.

Also, although any detected dosage imbalances may in fact be causative of the phenotype observed, mutations in those genes identified by genomic profiling may help explain the phenotype observed in patients with intact dosage for those particular genes.

There is another aspect to the genomic profiling equation, however, with an ethical component. Genomic profiling is one step on the way towards personalized medicine, whereby a vast amount of biological information may stem from the ascertainment of one’s genomic make-up. This information is likely to result in a series

267

of probabilities. In other words, profiling Person X may result in a landscape consisting of genes AA, Bb and cc. Theoretically, this combination of AA, Bb and cc could be associated with a increased risk of lung cancer should Person X smoke, or perhaps a decreased risk of type II diabetes given a sedentary lifestyle. The security of that information in this age remains an unaddressed question with extremely important implications. Although such a question is beyond the scope of a single empirical scientist, the progress of genomic technology and the application of that technology to human medicine has come to a point where addressing this issue will become important for everyone in the near future.

CONCLUSION

The studies presented in this work encompass a multitude of approaches to investigate the role of segmental duplications in genomic rearrangements. From the evolutionary perspective, the introduction of the array CGH method to comparative studies opens the field to novel approaches to comparing species, as well as assessing polymorphism within a species. In addition, the comparison of the human genome to that of non-human primates emphasized the association of segmental duplications with rearrangements that result in dosage gains and losses. The potential to affect the gene complement of a species, or the expression patterns of genes, by such rearrangements, warrants further study in this area. In addition, this work characterized the association of segmental duplications with respect to a chromosomal restructuring event that is visible at the cytogenetic level. The presence of two lineage-specific rearrangements in humans and chimpanzees at this site, however, suggests a model by which the region is

268 susceptible to breakage and subsequent rearrangement. Analysis of the mosaic of pericentromeric segmental duplications in 15q11 provided further evidence for a re- patterning of the pericentromeric regions of primate chromosomes in the great apes, and also served as a pilot study for the application of mouse-human genome comparisons to determine ancestry of a duplicated segment. This work also allowed for the assessment of genome c.overage in the pericentromeric regions of the human genome through a computational method-to-experimental method direct comparison. Through the continued analysis of the 15q11-q13 region, the application of the array CGH methodology to assessing dosage imbalance in patient samples harboring 15q11-q13 rearrangements was explored. Again, the mapping of clusters of segmental duplications present at the breakpoints of 15q11-q13 rearrangements was critical to developing an effective BAC microarray. Additionally, this work investigated the effect of utilizing clones with segmental duplications within them in the array format. As approximately

5% of the human genome is comprised of segmental duplications, the implication for whole-genome arrays with 30,000+ BAC clones becomes evident, as approximately

1,500 BAC clones on a whole-genome array could potentially produce inaccurate results due to duplication sensitivity. Lastly, this work furthered the sequence assembly of the

15q11-q13 region, specifically in areas of high concentrations of segmental duplications such as the PWS/AS common deletion breakpoints. Surprisingly, these regions are a complex mosaic of interchromosomal and intrachromosomal segmental duplications.

This type of structure is strikingly similar to the that of the pericentromeric regions of a subset of human chromosomes. In addition, it appears through evolutionary analysis that the PWS/AS breakpoint sequences have undergone homogenization, perhaps repeatedly,

269 in multiple primate lineages. The mechanism of homogenization and the mechanism which gives rise to PWS/AS rearrangements are likely to be intimately related. Thus, these studies provide a strong framework for subsequent analyses of the extremely variable 15q11-13 region. Additionally, this study provides the framework for the experimental use of array CGH profiling of primates for comparative studies in addition the application of the array CGH technology to profiling human genomic DNA samples for disease ascertainment. Likely the most far-reaching aspect of this work, it will be interesting to see future development and application of this technology for basic science as well as applied science in a surge towards personalized medicine.

270

BIBLIOGRAPHY

Albertson, D.G., B. Ylstra, R. Segraves, C. Collins, S.H. Dairkee, D. Kowbel, W.L. Kuo, J.W. Gray, and D. Pinkel. 2000. Quantitative mapping of amplicon structure by array CGH identifies CYP24 as a candidate oncogene. Nat Genet 25: 144-146. Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Molec. Biol. 215: 403-410. Amos-Landgraf, J.M., Y. Ji, W. Gottlieb, T. Depinet, A.E. Wandstrat, S.B. Cassidy, D.J. Driscoll, P.K. Rogan, S. Schwartz, and R.D. Nicholls. 1999. Chromosome breakage in the Prader-Willi and Angelman syndromes involves recombination between large, transcribed repeats at proximal and distal breakpoints. Am J Hum Genet 65: 370-386. Angelman, H. 1965. 'Puppet children': a report of three cases. Dev. Med. Child Neurol. 7: 681-688. Archidiacono, N., R. Antonacci, R. Marzella, P. Finelli, A. Lonoce, and M. Rocchi. 1995. Comparative mapping of human alphoid sequences in great apes using fluorescence in situ hybridization. Genomics 25: 477-484. Armengol, L., M.A. Pujana, J. Cheung, S.W. Scherer, and X. Estivill. 2003. Enrichment of segmental duplications in regions of breaks of synteny between the human and mouse genomes suggest their involvement in evolutionary rearrangements. Hum Mol Genet 12: 2201-2208. Armour, J.A., C. Sismani, P.C. Patsalis, and G. Cross. 2000. Measurement of locus copy number by hybridisation with amplifiable probes. Nucleic Acids Res 28: 605-609. Arnold, N., J. Wienberg, K. Emert, and H. Zachau. 1995. Comparative mapping of DNA probes derived from the Vk immunoglobulin gene regions on human and great ape chromosomes by fluorescence in situ hybridization. Genomics 26: 147-156. BACPAC_CHORI_link. Bailey, G., R. Poulter, and P. Stockwell. 1978. Gene duplication in tetraploid fish: Model for gene silencing at unlinked duplicated loci. Proc. Natl. Acad. Sci. USA 75: 5575-5579. Bailey, J.A., R. Baertsch, W.J. Kent, D. Haussler, and E.E. Eichler. 2004. Hotspots of mammalian chromosomal evolution. Genome Biol 5: R23. Bailey, J.A., Z. Gu, R.A. Clark, K. Reinert, R.V. Samonte, S. Schwartz, M.D. Adams, E.W. Myers, P.W. Li, and E.E. Eichler. 2002a. Recent segmental duplications in the human genome. Science 297: 1003-1007. Bailey, J.A., G. Liu, and E.E. Eichler. 2003. An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet 73: 823-834. Bailey, J.A., A.M. Yavor, H.F. Massa, B.J. Trask, and E.E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11: 1005-1017. Bailey, J.A., A.M. Yavor, L. Viggiano, D. Misceo, J.E. Horvath, N. Archidiacono, S. Schwartz, M. Rocchi, and E.E. Eichler. 2002b. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am J Hum Genet 70: 83-100.

271

Bailey, W.J., D.H. Fitch, D.A. Tagle, J. Czelusniak, J.L. Slightom, and M. Goodman. 1991. Molecular evolution of the psi eta-globin gene locus: gibbon phylogeny and the hominoid slowdown. Mol Biol Evol 8: 155-184. Barber, J.C., I.E. Cross, F. Douglas, J.C. Nicholson, K.J. Moore, and C.E. Browne. 1998. Neurofibromatosis pseudogene amplification underlies euchromatic cytogenetic duplications and triplications of proximal 15q. Hum Genet 103: 600-607. Bellefroid, E., J. Marine, A. Matera, C. Bourguignon, T. Desai, K. Healy, P. Bray-Ward, J. Martial, J. Ihle, and D. Ward. 1995. Emergence of the ZNF91 Kruppel- associated box-containing zinc finger gene family in the last common ancestor of anthropoidea. Proc. Natl. Acad. Sci. USA 92: 10757-10761. Blennow, E., K.B. Nielsen, H. Telenius, N.P. Carter, U. Kristoffersson, E. Holmberg, C. Gillberg, and M. Nordenskjold. 1995. Fifty probands with extra structurally abnormal chromosomes characterized by fluorescence in situ hybridization. Am J Med Genet 55: 85-94. Boccaccio, I., H. Glatt-Deeley, F. Watrin, N. Roeckel, M. Lalande, and F. Muscatelli. 1999. The human MAGEL2 gene and its mouse homologue are paternally expressed and mapped to the Prader-Willi region. Hum Mol Genet 8: 2497-2505. Borden, P., R. Jaenichen, and H. Zachau. 1990. Structural features of transposed human Vk genes and implications for the mechanism of their transpositions. Nucleic Acids Res. 18: 2101-2107. Bower, B.D. and P.M. Jeavons. 1967. The "happy puppet" syndrome. Arch Dis Child 42: 298-302. Brand-Arpon, V., S. Rouquier, H. Massa, P.J. de Jong, C. Ferraz, P.A. Ioannou, J.G. Demaille, B.J. Trask, and D. Giorgi. 1999. A genomic region encompassing a cluster of olfactory receptor genes and a myosin light chain kinase (MYLK) gene is duplicated on human chromosome regions 3q13-q21 and 3p13. Genomics 56: 98-110. Britten, R.J. and D.E. Kohne. 1968. Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science 161: 529-540. Browne, C., N. Dennis, E. Maher, F. Long, J. Nicholson, J. Sillibourne, and C. Barber. 1997. Inherited interstitial duplication of proximal 15q: genotype-phenotype correlations. Am. J. Hum. Genet. 61: 1342-1352. Bruder, C.E., C. Hirvela, I. Tapia-Paez, I. Fransson, R. Segraves, G. Hamilton, X.X. Zhang, D.G. Evans, A.J. Wallace, M.E. Baser et al. 2001. High resolution deletion analysis of constitutional DNA from neurofibromatosis type 2 (NF2) patients using microarray-CGH. Hum Mol Genet 10: 271-282. Brun, M.E., M. Ruault, M. Ventura, G. Roizes, and A. De Sario. 2003. Juxtacentromeric region of human chromosome 21: a boundary between centromeric heterochromatin and euchromatic chromosome arms. Gene 312: 41-50. Buiting, K., V. Greger, B.H. Brownstein, R.M. Mohr, I. Voiculescu, A. Winterpacht, B. Zabel, and B. Horsthemke. 1992. A putative gene family in 15q11-13 and 16p11.2: possible implications for Prader-Willi and Angelman Syndromes. Proc Natl Acad Sci USA 89: 5457-5461. Buiting, K., S. Gross, Y. Ji, G. Senger, R. Nicholls, and B. Horsthemke. 1998. Expressed copies of the MN7(D15F37) gene family map close to the common deletion

272

breakpoints in the Prader-Willi/Angelman syndromes. Cytogenet Cell Genet 81: 247-253. Buiting, K., C. Korner, B. Ulrich, E. Wahle, and B. Horsthemke. 1999. The human gene for the poly(A)-specific ribonuclease (PARN) maps to 16p13 and has a truncated copy in the Prader-Willi/Angelman syndrome region on 15q11-->q13. Cytogenet Cell Genet 87: 125-131. Cassidy, S.B., E. Dykens, and C.A. Williams. 2000. Prader-Willi and Angelman syndromes: sister imprinted disorders. Am J Med Genet 97: 136-146. Chai, J.H., D.P. Locke, J.M. Greally, J.H. Knoll, T. Ohta, J. Dunai, A. Yavor, E.E. Eichler, and R.D. Nicholls. 2003. Identification of four highly conserved genes between breakpoint hotspots BP1 and BP2 of the Prader-Willi/Angelman syndromes deletion region that have undergone evolutionary transposition mediated by flanking duplicons. Am J Hum Genet 73: 898-925. Chai, J.H., D.P. Locke, T. Ohta, J.M. Greally, and R.D. Nicholls. 2001. Retrotransposed genes such as Frat3 in the mouse Chromosome 7C Prader-Willi syndrome region acquire the imprinted status of their insertion site. Mamm Genome 12: 813-821. Chen, F.C. and W.H. Li. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet 68: 444-456. Cheung, J., M.D. Wilson, J. Zhang, R. Khaja, J.R. MacDonald, H.H. Heng, B.F. Koop, and S.W. Scherer. 2003. Recent segmental and gene duplications in the mouse genome. Genome Biol 4: R47. Cheung, V.G., N. Nowak, W. Jang, I.R. Kirsch, S. Zhao, X.N. Chen, T.S. Furey, U.J. Kim, W.L. Kuo, M. Olivier et al. 2001. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409: 953-958. Christian, S.L., N.K. Bhatt, S.A. Martin, J.S. Sutcliffe, T. Kubota, B. Huang, A. Mutirangura, A.C. Chinault, A.L. Beaudet, and D.H. Ledbetter. 1998. Integrated YAC contig map of the Prader-Willi/Angelman region on chromosome 15q11- q13 with average STS spacing of 35 kb. Genome Res 8: 146-157. Christian, S.L., J.A. Fantes, S.K. Mewborn, B. Huang, and D.H. Ledbetter. 1999. Large genomic duplicons map to sites of instability in the Prader- Willi/Angelman syndrome chromosome region (15q11-q13). Hum Mol Genet 8: 1025-1037. Christian, S.L., W.P. Robinson, B. Huang, A. Mutirangura, M.R. Line, M. Nakao, U. Surti, A. Chakravarti, and D.H. Ledbetter. 1995. Molecular characterization of two proximal deletion breakpoint regions in both Prader-Willi and Angelman syndrome patients. Am J Hum Genet 57: 40-48. Clayton-Smith, J., D.J. Driscoll, M.F. Waters, T. Webb, T. Andrews, S. Malcolm, M.E. Pembrey, and R.D. Nicholls. 1993. Difference in methylation patterns within the D15S9 region of chromosome 15q11-13 in first cousins with Angelman syndrome and Prader-Willi syndrome. Am J Med Genet 47: 683-686. Collins, C., S. Volik, D. Kowbel, D. Ginzinger, B. Ylstra, T. Cloutier, T. Hawkins, P. Predki, C. Martin, M. Wernick et al. 2001. Comprehensive genome sequence analysis of a breast cancer amplicon. Genome Res 11: 1034-1042. Collins, F.S., A. Patrinos, E. Jordan, A. Chakravarti, R. Gesteland, and L. Walters. 1998. New goals for the U.S. Human genome project: 1998-2003. Science 282: 682- 689.

273

Conroy, J.M., T.A. Grebe, L.A. Becker, K. Tsuchiya, R.D. Nicholls, K. Buiting, B. Horsthemke, S.B. Cassidy, and S. Schwartz. 1997. Balanced translocation 46,XY,t(2;15)(q37.2;q11.2) associated with atypical Prader-Willi syndrome. Am J Hum Genet 61: 388-394. Crosier, M., L. Viggiano, J. Guy, D. Misceo, R. Stones, W. Wei, T. Hearn, M. Ventura, N. Archidiacono, M. Rocchi et al. 2002. Human paralogs of KIAA0187 were created through independent pericentromeric-directed and chromosome-specific duplication mechanisms. Genome Res 12: 67-80. Daigo, Y., S.F. Chin, K.L. Gorringe, L.G. Bobrow, B.A. Ponder, P.D. Pharoah, and C. Caldas. 2001. Degenerate oligonucleotide primed-polymerase chain reaction- based array comparative genomic hybridization for extensive amplicon profiling of breast cancers : a new approach for the molecular analysis of paraffin- embedded cancer tissue. Am J Pathol 158: 1623-1631. Dehal, P., P. Predki, A.S. Olsen, A. Kobayashi, P. Folta, S. Lucas, M. Land, A. Terry, C.L. Ecale Zhou, S. Rash et al. 2001. Human chromosome 19 and related regions in mouse: conservative and lineage-specific evolution. Science 293: 104-111. Deininger, P.L. and M.A. Batzer. 1999. Alu repeats and human disease. Mol Genet Metab 67: 183-193. Deininger, P.L. and C.W. Schmid. 1976. Thermal stability of human DNA and chimpanzee DNA heteroduplexes. Science 194: 846-848. Dewannieux, M., C. Esnault, and T. Heidmann. 2003. LINE-mediated retrotransposition of marked Alu sequences. Nat Genet 35: 41-48. Donlon, T.A. 1988. Similar molecular deletions on chromosome 15q11.2 are encountered in both the Prader-Willi and Angelman syndromes. Hum Genet 80: 322-328. Down, J.L. 1887. Mental Affections of Childhood and Youth. Churchill, London. Driscoll, D.J., M.F. Waters, C.A. Williams, R.T. Zori, C.C. Glenn, K.M. Avidano, and R.D. Nicholls. 1992. A DNA methylation imprint, determined by the sex of the parent, distinguishes the Angelman and Prader-Willi syndromes. Genomics 13: 917-924. Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, A.R. Hunt, J.E. Collins, R. Bruskiewich, D.M. Beare, M. Clamp, L.J. Smink et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489-495. Dyomin, V.G., S.R. Chaganti, K. Dyomina, N. Palanisamy, V.V. Murty, R. Dalla-Favera, and R.S. Chaganti. 2002. BCL8 is a novel, evolutionarily conserved human gene family encoding with presumptive protein kinase A anchoring function. Genomics 80: 158-165. Ebersberger, I., D. Metzler, C. Schwarz, and S. Paabo. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet 70: 1490- 1497. Eichler, E.E. 1998. Masquerading repeats: Paralogous pitfalls of the Human Genome. Genome Res. 8: 758-762. Eichler, E.E., M.L. Budarf, M. Rocchi, L.L. Deaven, N.A. Doggett, A. Baldini, D.L. Nelson, and H.W. Mohrenweiser. 1997. Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Hum Molec Genet 6: 991-1002.

274

Eichler, E.E. and P.J. DeJong. 2002. Biomedical applications and studies of molecular evolution: a proposal for a primate genomic library resource. Genome Res 12: 673-678. Eichler, E.E., S.M. Hoffman, A.A. Adamson, L.A. Gordon, P. McCready, J.E. Lamerdin, and H.W. Mohrenweiser. 1998. Complex beta-satellite repeat structures and the expansion of the zinc finger gene cluster in 19p12. Genome Res. 8: 791-808. Eichler, E.E., M.E. Johnson, C. Alkan, E. Tuzun, C. Sahinalp, D. Misceo, N. Archidiacono, and M. Rocchi. 2001. Divergent origins and concerted expansion of two segmental duplications on chromosome 16. J Hered 92: 462-468. Eichler, E.E., F. Lu, Y. Shen, R. Antonacci, V. Jurecic, N.A. Doggett, R.K. Moyzis, A. Baldini, R.A. Gibbs, and D.L. Nelson. 1996. Duplication of a gene-rich cluster between 16p11.1 and Xq28: a novel pericentromeric-directed mechanism for paralogous genome evolution. Hum Molec Genet 5: 899-912. Emanuel, B.S. and T.H. Shaikh. 2001. Segmental Duplications: An 'Expanding' Role in Genomic Instability and Disease. Nat Rev Genet 2: 791-800. Emberger, J.M., M. Rodiere, J. Astruc, and D. Brunel. 1977. [The Prader-Willi syndrome and 15-15 translocation]. Ann Genet 20: 297-300. Enard, W., P. Khaitovich, J. Klose, S. Zollner, F. Heissig, P. Giavalisco, K. Nieselt- Struwe, E. Muchmore, A. Varki, R. Ravid et al. 2002. Intra- and interspecific variation in primate gene expression patterns. Science 296: 340-343. Engel, E. 1980. A new genetic concept: uniparental disomy and its potential effect, isodisomy. Am J Med Genet 6: 137-143. Ewing, B. and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8: 186-194. Ewing, B., L. Hillier, M.C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8: 175-185. Fantes, J.A., S.K. Mewborn, C.M. Lese, J. Hedrick, R.L. Brown, V. Dyomin, R.S. Chaganti, S.L. Christian, and D.H. Ledbetter. 2002. Organisation of the pericentromeric region of chromosome 15: at least four partial gene copies are amplified in patients with a proximal duplication of 15q. J Med Genet 39: 170- 177. Footz, T.K., P. Brinkman-Mills, G.S. Banting, S.A. Maier, M.A. Riazi, L. Bridgland, S. Hu, B. Birren, S. Minoshima, N. Shimizu et al. 2001. Analysis of the Cat Eye Syndrome Critical Region in Humans and the Region of Conserved Synteny in Mice: A Search for Candidate Genes at or near the Human Chromosome 22 Pericentromere. Genome Res 11: 1053-1070. Force, A., M. Lynch, F.B. Pickett, A. Amores, Y.L. Yan, and J. Postlethwait. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531-1545. Fraccaro, M., O. Zuffardi, E.M. Buhler, and L.P. Jurik. 1977. 15/15 translocation in Prader-Willi syndrome. J Med Genet 14: 275-276. Frazer, K.A., X. Chen, D.A. Hinds, P.V. Pant, N. Patil, and D.R. Cox. 2003. Genomic DNA insertions and deletions occur frequently between humans and nonhuman primates. Genome Res 13: 341-346. Friedman, R. and A.L. Hughes. 2001. Pattern and timing of gene duplication in animal genomes. Genome Res 11: 1842-1847.

275

Friedman, R. and A.L. Hughes. 2004. Two Patterns of Genome Organization in Mammals: The Chromosomal Distribution of Duplicate Genes in Human and Mouse. Mol Biol Evol. Gabriel, J.M., M. Merchant, T. Ohta, Y. Ji, R.G. Caldwell, M.J. Ramsey, J.D. Tucker, R. Longnecker, and R.D. Nicholls. 1999. A transgene insertion creating a heritable chromosome deletion mouse model of Prader-Willi and angelman syndromes. Proc Natl Acad Sci U S A 96: 9258-9263. Gaunt, S.J. 1991. Expression patterns of mouse Hox genes: clues to an understanding of developmental and evolutionary strategies. Bioessays 13: 505-513. Gibbs, R.A. G.M. Weinstock M.L. Metzker D.M. Muzny E.J. Sodergren S. Scherer G. Scott D. Steffen K.C. Worley P.E. Burch et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493- 521. Gilles, F., A. Goy, Y. Remache, K. Manova, and A.D. Zelenetz. 2000. Cloning and characterization of a Golgin-related gene from the large-scale polymorphism linked to the PML gene. Genomics 70: 364-374. Gimelli, G., M.A. Pujana, M.G. Patricelli, S. Russo, D. Giardino, L. Larizza, J. Cheung, L. Armengol, A. Schinzel, X. Estivill et al. 2003. Genomic inversions of human chromosome 15q11-q13 in mothers of Angelman syndrome patients with class II (BP2/3) deletions. Hum Mol Genet 12: 849-858. Glenn, C.C., R.D. Nicholls, W.P. Robinson, S. Saitoh, N. Niikawa, A. Schinzel, B. Horsthemke, and D.J. Driscoll. 1993. Modification of 15q11-q13 DNA methylation imprints in unique Angelman and Prader-Willi patients. Hum Mol Genet 2: 1377-1382. Glusman, G., I. Yanai, I. Rubin, and D. Lancet. 2001. The complete human olfactory subgenome. Genome Res 11: 685-702. Goldberg, Y.P., J.M. Rommens, S.E. Andrew, G.B. Hutchinson, B. Lin, J. Theilmann, R. Graham, M.L. Glaves, E. Starr, H. McDonald et al. 1993. Identification of an Alu retrotransposition event in close proximity to a strong candidate gene for Huntington's disease. Nature 362: 370-373. Golfier, G., F. Chibon, A. Aurias, X.N. Chen, J. Korenberg, J. Rossier, and M.C. Potier. 2003. The 200-kb segmental duplication on human chromosome 21 originates from a pericentromeric dissemination involving human chromosomes 2, 18 and 13. Gene 312: 51-59. Gong, Z., C.C. Hui, and C.L. Hew. 1995. Presence of isl-1-related LIM domain homeobox genes in teleost and their similar patterns of expression in brain and spinal cord. J Biol Chem 270: 3335-3345. Goodman, M. 1999. The genomic record of Humankind's evolutionary roots. Am J Hum Genet 64: 31-39. Goodman, M., D.A. Tagle, D.H. Fitch, W. Bailey, J. Czelusniak, B.F. Koop, P. Benson, and J.L. Slightom. 1990. Primate evolution at the DNA level and a classification of hominoids. J Mol Evol 30: 260-266. Gordon, D., C. Abajian, and P. Green. 1998. Consed: a graphical tool for sequence finishing. Genome Res 8: 195-202. Gratacos, M., M. Nadal, R. Martin-Santos, M.A. Pujana, J. Gago, B. Peral, L. Armengol, I. Ponsa, R. Miro, A. Bulbena et al. 2001. A polymorphic genomic duplication on

276

human chromosome 15 is a susceptibility factor for panic and phobic disorders. Cell 106: 367-379. Graur, D. and W. Li. 2000. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA. Gu, X., Y. Wang, and J. Gu. 2002. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat Genet 31: 205-209. Gunn, S.R., M. Mohammed, X.T. Reveles, D.H. Viskochil, J.C. Palumbos, T.L. Johnson- Pais, D.E. Hale, J.L. Lancaster, L.J. Hardies, O. Boespflug-Tanguy et al. 2003. Molecular characterization of a patient with central nervous system dysmyelination and cryptic unbalanced translocation between chromosomes 4q and 18q. Am J Med Genet 120A: 127-135. Guy, J., T. Hearn, M. Crosier, J. Mudge, L. Viggiano, D. Koczan, H.J. Thiesen, J.A. Bailey, J.E. Horvath, E.E. Eichler et al. 2003. Genomic sequence and transcriptional profile of the boundary between pericentromeric satellites and genes on human chromosome arm 10p. Genome Res 13: 159-172. Guy, J., C. Spalluto, A. McMurray, T. Hearn, M. Crosier, L. Viggiano, V. Miolla, N. Archidiacono, M. Rocchi, C. Scott et al. 2000. Genomic sequence and transcriptional profile of the boundary between pericentromeric satellites and genes on human chromosome arm 10q. Hum Mol Genet 9: 2029-2042. Halal, F. and J. Chagnon. 1976. ["Happy puppet" syndrome]. Union Med Can 105: 1077- 1083. Haldane, J.B.S. 1932. The Causes of Evolution. Longmans, Green and Co. Limited, New York. Hawkey, C.J. and A. Smithies. 1976. The Prader-Willi syndrome with a 15/15 translocation. Case report and review of the literature. J Med Genet 13: 152-157. Hillier, L.W. R.S. Fulton L.A. Fulton T.A. Graves K.H. Pepin C. Wagner-McPherson D. Layman J. Maas S. Jaeger R. Walker et al. 2003. The DNA sequence of human chromosome 7. Nature 424: 157-164. Hodgson, G., J.H. Hager, S. Volik, S. Hariono, M. Wernick, D. Moore, D.G. Albertson, D. Pinkel, C. Collins, D. Hanahan et al. 2001. Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nat Genet 29: 459-464. Hoglund, M., F. Mitelman, and N. Mandahl. 1995. A human 12p-derived cosmid hybridizing to subsets of human and chimpanzee telomeres. Cytogenet Cell Genet 70: 88-91. Hollox, E.J., T. Atia, G. Cross, T. Parkin, and J.A. Armour. 2002. High throughput screening of human subtelomeric DNA for copy number changes using multiplex amplifiable probe hybridisation (MAPH). J Med Genet 39: 790-795. Horvath, J., S. Schwartz, and E. Eichler. 2000a. The mosaic structure of a 2p11 pericentromeric segment: A strategy for characterizing complex regions of the human genome. Genome Res 10: 839-852. Horvath, J., L. Viggiano, B. Loftus, M. Adams, M. Rocchi, and E. Eichler. 2000b. Molecular structure and evolution of an alpha/non-alpha satellite junction at 16p11. Hum Molec Genet 9: 113-123.

277

Horvath, J.E., J.A. Bailey, D.P. Locke, and E.E. Eichler. 2001. Lessons from the human genome: transitions between euchromatin and heterochromatin. Hum Mol Genet 10: 2215-2223. Horvath, J.E., C.L. Gulden, J.A. Bailey, C. Yohn, J.D. McPherson, A. Prescott, B.A. Roe, P.J. De Jong, M. Ventura, D. Misceo et al. 2003. Using a pericentromeric interspersed repeat to recapitulate the phylogeny and expansion of human centromeric segmental duplications. Mol Biol Evol 20: 1463-1479. Huang, B., J. Crolla, S. Christian, M. Wolf-Ledbetter, M. Macha, P. Papenhausen, and D. Ledbetter. 1997. Refined molecular characterization of the breakpoints in small inv dup (15) chromosomes. Hum. Genet. 99: 11-17. Hughes, A.L., J. da Silva, and R. Friedman. 2001. Ancient genome duplications did not structure the human hox-bearing chromosomes. Genome Res 11: 771-780. IHGSC. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860- 921. ISCN. 1985. Report of the standing committee on human cytogenetic nomenclature. Birth Defects 21: 1-117. Ishkanian, A.S., C.A. Malloff, S.K. Watson, R.J. DeLeeuw, B. Chi, B.P. Coe, A. Snijders, D.G. Albertson, D. Pinkel, M.A. Marra et al. 2004. A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 36: 299-303. Jackson, M.S., M. Rocchi, G. Thompson, T. Hearn, M. Crosier, J. Guy, D. Kirk, L. Mulligan, A. Ricco, S. Piccininni et al. 1999. Sequences flanking the centromere of human chromosome 10 are a complex patchwork of arm-specific sequences, stable duplications, and unstable sequences with homologies to telomeric and other centromeric locations. Hum Mol Genet 8: 205-215. Jain, A.N., T.A. Tokuyasu, A.M. Snijders, R. Segraves, D.G. Albertson, and D. Pinkel. 2002. Fully automatic quantification of microarray image data. Genome Res 12: 325-332. Jelinek, W.R., T.P. Toomey, L. Leinwand, C.H. Duncan, P.A. Biro, P.V. Choudary, S.M. Weissman, C.M. Rubin, C.M. Houck, P.L. Deininger et al. 1980. Ubiquitous, interspersed repeated sequences in mammalian genomes. Proc Natl Acad Sci U S A 77: 1398-1402. Ji, Y., E.E. Eichler, S. Schwartz, and R.D. Nicholls. 2000a. Structure of chromosomal duplicons and their role in mediating human genomic disorders. Genome Res 10: 597-610. Ji, Y., N.A. Rebert, J.M. Joslin, M.J. Higgins, R.A. Schultz, and R.D. Nicholls. 2000b. Structure of the highly conserved HERC2 gene and of multiple partially duplicated paralogs in human. Genome Res 10: 319-329. Ji, Y., M.J. Walkowicz, K. Buiting, D.K. Johnson, R.E. Tarvin, E.M. Rinchik, B. Horsthemke, L. Stubbs, and R.D. Nicholls. 1999. The ancestral gene for transcribed, low-copy repeats in the Prader- Willi/Angelman region encodes a large protein implicated in protein trafficking, which is deficient in mice with neuromuscular and spermiogenic abnormalities. Hum Mol Genet 8: 533-542. Johnson, M.E., L. Viggiano, J.A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Rocchi, and E.E. Eichler. 2001. Positive selection of a gene family during the emergence of humans and African apes. Nature 413: 514-519.

278

Kaessmann, H., V. Wiebe, G. Weiss, and S. Paabo. 2001. Great ape DNA sequences reveal a reduced diversity and an expansion in humans. Nat Genet 27: 155-156. Kallioniemi, A., O.P. Kallioniemi, D. Sudar, D. Rutovitz, J.W. Gray, F. Waldman, and D. Pinkel. 1992. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258: 818-821. Karaman, M.W., M.L. Houck, L.G. Chemnick, S. Nagpal, D. Chawannakul, D. Sudano, B.L. Pike, V.V. Ho, O.A. Ryder, and J.G. Hacia. 2003. Comparative analysis of gene-expression patterns in human and African great ape cultured fibroblasts. Genome Res 13: 1619-1630. Kashiwagi, H. and K. Uchida. 2000. Genome-wide profiling of gene amplification and deletion in cancer. Hum Cell 13: 135-141. Kazazian, H.H., Jr. 2000. Genetics. L1 retrotransposons shape the mammalian genome. Science 289: 1152-1153. Kehrer-Sawatzki, H., B. Schreiner, S. Tanzer, M. Platzer, S. Muller, and H. Hameister. 2002. Molecular characterization of the pericentric inversion that causes differences between chimpanzee chromosome 19 and human chromosome 17. Am J Hum Genet 71: 375-388. Kehrer-Sawatzki, H., T. Schwickardt, G. Assum, G. Rocchi, and W. Krone. 1997. A third neurofibromatosis type 1 (NF1) pseudogene at chromosome 15q11.2. Hum Genet 100: 595-600. Kelley, M.J., M. Pech, H.N. Seuanez, J.S. Rubin, S.J. O'Brien, and S.A. Aaronson. 1992. Emergence of the keratinocyte growth factor multigene family during the great ape radiation. Proc Natl Acad Sci U S A 89: 9287-9291. Kent, W.J., R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. 2003. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100: 11484-11489. Kent, W.J., C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler, and D. Haussler. 2002. The human genome browser at UCSC. Genome Res 12: 996- 1006. Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217: 624-626. Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111-120. King, M. 1993. Species Evolution: the Role of Chromosome Change. Caimbridge University Press, New York. King, M.C. and A.C. Wilson. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116. Kishino, T., M. Lalande, and J. Wagstaff. 1997. UBE3A/E6-AP mutations cause Angelman syndrome. Nat Genet 15: 70-73. Knoll, J., R. Nicholls, R. Magenis, J. Grham, M. Lalande, and S. Latt. 1989. Angelman and Prader-Willi syndromes share a common chromosome 15 deletion but differ in parental origin of the deletion. Am. J. Med. Genet. 32: 285-290. Kumar, S. and S.B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392: 917-920. Kumar, S., K. Tamura, and M. Nei. 1993. MEGA: Molecular Evolutionary Genetic Analysis, version 1.0. Pennsylvania State University, University Park.

279

Kuwano, A., A. Mutirangura, B. Dittrich, K. Buiting, B. Horsthemke, S. Saitoh, N. Niikawa, S.A. Ledbetter, F. Greenberg, A.C. Chinault et al. 1992. Molecular dissection of the Prader-Willi/Angelman syndrome region (15q11-13) by YAC cloning and FISH analysis [published erratum appears in Hum Mol Genet 1992 Dec;1(9):784]. Hum Mol Genet 1: 417-425. Lercher, M.J., N.G. Smith, A. Eyre-Walker, and L.D. Hurst. 2002. The evolution of isochores: evidence from SNP frequency distributions. Genetics 162: 1805-1810. Li, W.H. and M. Tanimura. 1987. The molecular clock runs more slowly in man than in apes and monkeys. Nature 326: 93-96. Lichter, P., C.J. Tang, K. Call, G. Hermanson, G.A. Evans, D. Housman, and D.C. Ward. 1990. High-resolution mapping of human chromosome 11 by in situ hybridization with cosmid clones. Science 247: 64-69. Lin, J.Y., J.R. Pollack, F.L. Chou, C.A. Rees, A.T. Christian, J.S. Bedford, P.O. Brown, and M.H. Ginsberg. 2002. Physical mapping of genes in somatic cell radiation hybrids by comparative genomic hybridization to cDNA microarrays. Genome Biol 3: RESEARCH0026. Liu, G., S. Zhao, J.A. Bailey, S.C. Sahinalp, C. Alkan, E. Tuzun, E.D. Green, and E.E. Eichler. 2003. Analysis of primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res 13: 358-368. Locke, D.P., N. Archidiacono, D. Misceo, M.F. Cardone, S. Deschamps, B. Roe, M. Rocchi, and E.E. Eichler. 2003a. Refinement of a chimpanzee pericentric inversion breakpoint to a segmental duplication cluster. Genome Biol 4: R50. Locke, D.P., R. Segraves, L. Carbone, N. Archidiacono, D.G. Albertson, D. Pinkel, and E.E. Eichler. 2003b. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res 13: 347- 357. Loftus, B., U. Kim, V. Sneddon, F. Kalush, R. Brandon, J. Fuhrmann, T. Mason, M. Crosby, M. Barnstead, L. Cronin et al. 1999. Genome duplications and other features in 12 Mbp of DNA sequence from human chromosome 16p and 16q. Genomics 60: 295-308. Luke, S. and R.S. Verma. 1995. The genomic sequence for Prader-Willi/Angelman syndromes' loci of human is apparently conserved in the great apes. J Mol Evol 41: 250-252. Lupski, J.R. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet 14: 417-422. Lynch, M. and J.S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151-1155. MacDonald, H.R. and R. Wevrick. 1997. The necdin gene is deleted in Prader-Willi syndrome and is imprinted in human and mouse. Hum Mol Genet 6: 1873-1878. Magenis, R.E., M.G. Brown, D.A. Lacy, S. Budden, and S. LaFranchi. 1987. Is Angelman syndrome an alternate result of del(15)(q11q13)? Am J Med Genet 28: 829-838. Malcolm, S., J. Clayton-Smith, M. Nichols, S. Robb, T. Webb, J.A. Armour, A.J. Jeffreys, and M.E. Pembrey. 1991. Uniparental paternal disomy in Angelman's syndrome. Lancet 337: 694-697.

280

Mancuso, D.J., E.A. Tuley, L.A. Westfield, T.L. Lester-Mancuso, M.M. Le Beau, J.M. Sorace, and J.E. Sadler. 1991. Human von Willebrand factor gene and pseudogene: structural analysis and differentiation by polymerase chain reaction. Biochemistry 30: 253-269. Matsuura, T., J.S. Sutcliffe, P. Fang, R.J. Galjaard, Y.H. Jiang, C.S. Benton, J.M. Rommens, and A.L. Beaudet. 1997. De novo truncating mutations in E6-AP ubiquitin-protein ligase gene (UBE3A) in Angelman syndrome. Nat Genet 15: 74-77. Mazzarella, R. and D. Schlessinger. 1997. Duplication and distribution of repetitive elements and non-unique regions in the human genome. Gene 205: 29-38. Mazzarella, R. and D. Schlessinger. 1998. Pathological consequences of sequence duplications in the human genome. Genome Res 8: 1007-1021. Mefford, H.C., E. Linardopoulou, D. Coil, G. van den Engh, and B.J. Trask. 2001. Comparative sequencing of a multicopy subtelomeric region containing olfactory receptor genes reveals multiple interactions between non-homologous chromosomes. Hum Mol Genet 10: 2363-2372. Mefford, H.C. and B.J. Trask. 2002. The complex structure and dynamic evolution of human subtelomeres. Nat Rev Genet 3: 91-102. Mignon, C., F. Parente, C. Stavropoulou, P. Collignon, A. Moncla, C. Turc-Carel, and M.G. Mattei. 1997. Inherited DNA amplification of the proximal 15q region: cytogenetic and molecular studies. J Med Genet 34: 217-222. Miki, Y., T. Katagiri, F. Kasumi, T. Yoshimoto, and Y. Nakamura. 1996. Mutation analysis in the BRCA2 gene in primary breast cancers. Nat Genet 13: 245-247. Mondello, C., L. Pirzio, C.M. Azzalin, and E. Giulotto. 2000. Instability of interstitial telomeric sequences in the human genome. Genomics 68: 111-117. Monfouilloux, S., H. Avet-Loiseau, V. Amarger, I. Balazs, C. Pourcel, and G. Vergnaud. 1998. Recent human-specific spreading of a subtelomeric domain [In Process Citation]. Genomics 51: 165-176. Morin, G.B. 1989. The human telomere terminal transferase enzyme is a ribonucleoprotein that synthesizes TTAGGG repeats. Cell 59: 521-529. Muller, H.J. 1936. Bar duplication. Science 83: 528-530. Muller, S., R. Stanyon, P.C. O'Brien, M.A. Ferguson-Smith, R. Plesker, and J. Wienberg. 1999. Defining the ancestral karyotype of all primates by multidirectional chromosome painting between tree shrews, lemurs and humans. Chromosoma 108: 393-400. Muller, S. and J. Wienberg. 2001. "Bar-coding" primate chromosomes: molecular cytogenetic screening for the ancestral hominoid karyotype. Hum Genet 109: 85- 94. Mural, R.J. M.D. Adams E.W. Myers H.O. Smith G.L. Miklos R. Wides A. Halpern P.W. Li G.G. Sutton J. Nadeau et al. 2002. A comparison of whole-genome shotgun- derived mouse chromosome 16 and the human genome. Science 296: 1661-1671. Mutirangura, A., A. Jayakumar, J. Sutcliffe, M. Nakao, M. McKinney, K. Buiting, B. Horsthemke, A. Beaudet, A. Chinault, and D. Ledbetter. 1993. A complete YAC contig of the Prader-Willi/Angelman chromsosome region (15q11-q13) and refined localization of the SNRPN gene. Genomics 18: 546-552.

281

Myers, E.W. and W. Miller. 1988. Optimal alignments in linear space. Comput Appl Biosci 4: 11-17. Nadeau, J. and D. Sankoff. 1997. Landmarks in the Rosetta Stone of mammalian comparative maps. Nature Genet. 15: 6-7. Nadeau, J.H. and B.A. Taylor. 1984. Lengths of chromosomal segments conserved since divergence of man and mouse. Proc Natl Acad Sci U S A 81: 814-818. Nei, M. 1969. Gene duplication and nucleotide substitution in evolution. Nature 221: 40- 42. Nei, M. and A.K. Roychoudhury. 1973. Probability of fixation and mean fixation time of an overdominant mutation. Genetics 74: 371-380. Newman, T. and B.J. Trask. 2003. Complex evolution of 7E olfactory receptor genes in segmental duplications. Genome Res 13: 781-793. Nicholls, R.D. 1994. New insights reveal complex mechanisms involved in genomic imprinting. Am J Hum Genet 54: 733-740. Nicholls, R.D. and J.L. Knepper. 2001. Genome organization, function, and imprinting in Prader-Willi and Angelman syndromes. Annu Rev Genomics Hum Genet 2: 153- 175. Nicholls, R.D., J.H. Knoll, M.G. Butler, S. Karam, and M. Lalande. 1989. Genetic imprinting suggested by maternal heterodisomy in nondeletion Prader-Willi syndrome. Nature 342: 281-285. Nickerson, E., R.A. Gibbs, and D.L. Nelson. 1999. Sequence analysis of the breakpoints of a pericentric inversion distinguishing the human and chimpanzee chromosomes 12. Am J Hum Genet 65 (Supplement): A56. Nickerson, E. and D.L. Nelson. 1998. Molecular definition of pericentric inversion breakpoints occurring during the evolution of humans and chimpanzees. Genomics 50: 368-372. Nietzel, A., B. Albrecht, H. Starke, A. Heller, G. Gillessen-Kaesbach, U. Claussen, and T. Liehr. 2003. Partial hexasomy 15pter-->15q13 including SNRPN and D15S10: first molecular cytogenetically proven case report. J Med Genet 40: e28. O'Brien, S.J. and R. Stanyon. 1999. Phylogenomics. Ancestral primate viewed. Nature 402: 365-366. Ohno, S. 1970. Evolution by Gene Duplication. Springer Verlag, Berlin/Heidelberg/New York. Ohno, S. 1972. An argument for the genetic simplicity of man and other mammals. J. Hum. Evol. 1: 651-662. Ohno, S., U. Wolf, and N. Atkin. 1968. Evolution from fish to mammals by gene duplication. Hereditas 59: 169-187. Olson, M.V. 1999. When less is more: gene loss as an engine of evolutionary change. Am J Hum Genet 64: 18-23. Orti, R., M.C. Potier, C. Maunoury, M. Prieur, N. Creau, and J.M. Delabar. 1998. Conservation of pericentromeric duplications of a 200-kb part of the human 21q22.1 region in primates. Cytogenet Cell Genet 83: 262-265. Parsons, J. 1995. Miropeats: graphical DNA sequence comparisons. Comput Appl Biosci 11: 615-619. Paulding, C.A., M. Ruvolo, and D.A. Haber. 2003. The Tre2 (USP6) oncogene is a hominoid-specific gene. Proc Natl Acad Sci U S A 100: 2507-2511.

282

Pendleton, J.W., B.K. Nagai, M.T. Murtha, and F.H. Ruddle. 1993. Expansion of the Hox gene family and the evolution of chordates. Proc Natl Acad Sci U S A 90: 6300- 6304. Pevzner, P. and G. Tesler. 2003a. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res 13: 37-45. Pevzner, P. and G. Tesler. 2003b. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci U S A 100: 7672- 7677. Pinkel, D., J.W. Gray, B. Trask, G. van den Engh, J. Fuscoe, and H. van Dekken. 1986. Cytogenetic analysis by in situ hybridization with fluorescently labeled nucleic acid probes. Cold Spring Harb Symp Quant Biol 51 Pt 1: 151-157. Pinkel, D., J. Landegent, C. Collins, J. Fuscoe, R. Segraves, J. Lucas, and J. Gray. 1988. Fluorescence in situ hybridization with human chromosome-specific libraries: detection of trisomy 21 and translocations of chromosome 4. Proc Natl Acad Sci U S A 85: 9138-9142. Pinkel, D., R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W.L. Kuo, C. Chen, Y. Zhai et al. 1998. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20: 207-211. Pollack, J.R., C.M. Perou, A.A. Alizadeh, M.B. Eisen, A. Pergamenschikov, C.F. Williams, S.S. Jeffrey, D. Botstein, and P.O. Brown. 1999. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23: 41-46. Pollack, J.R., T. Sorlie, C.M. Perou, C.A. Rees, S.S. Jeffrey, P.E. Lonning, R. Tibshirani, D. Botstein, A.L. Borresen-Dale, and P.O. Brown. 2002. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 99: 12963-12968. Prader, A., A. Labhart, and H. Willi. 1956. Ein Syndrom von Adipositas, Kleinwuchs, Kryptorchismus und Oligophrenie nach Myatonieartigem Zustand im Neugeborenenalter. Schveiz. Med. Wschr. 86: 1260-1261. Pujana, M.A., M. Nadal, M. Gratacos, B. Peral, K. Csiszar, R. Gonzalez-Sarmiento, L. Sumoy, and X. Estivill. 2001. Additional complexity on human chromosome 15q: identification of a set of newly recognized duplicons (LCR15) on 15q11-q13, 15q24, and 15q26. Genome Res 11: 98-111. Pujana, M.A., M. Nadal, M. Guitart, L. Armengol, M. Gratacos, and X. Estivill. 2002. Human chromosome 15q11-q14 regions of rearrangements contain clusters of LCR15 duplicons. Eur J Hum Genet 10: 26-35. Rauen, K.A., D.G. Albertson, D. Pinkel, and P.D. Cotter. 2002. Additional patient with del(12)(q21.2q22): further evidence for a candidate region for cardio-facio- cutaneous syndrome? Am J Med Genet 110: 51-56. Regnier, V., M. Meddeb, G. Lecointre, F. Richard, A. Duverger, V.C. Nguyen, B. Dutrillaux, A. Bernheim, and G. Danglot. 1997. Emergence and scattering of multiple neurofibromatosis (NF1)-related sequences during hominoid evolution suggest a process of pericentromeric interchromosomal transposition. Hum Molec Genet 6: 9-16.

283

Riethman, H.C., R.K. Moyzis, J. Meyne, D.T. Burke, and M.V. Olson. 1989. Cloning human telomeric DNA fragments into Saccharomyces cerevisiae using a yeast- artificial-chromosome vector. Proc Natl Acad Sci U S A 86: 6240-6244. Riethman, H.C., Z. Xiang, S. Paul, E. Morse, X.L. Hu, J. Flint, H.C. Chi, D.L. Grady, and R.K. Moyzis. 2001. Integration of telomere sequences with the draft human genome sequence. Nature 409: 948-951. Riley, B., M. Williamson, D. Collier, H. Wilkie, and A. Makoff. 2002. A 3-Mb map of a large Segmental duplication overlapping the alpha7-nicotinic acetylcholine receptor gene (CHRNA7) at human 15q13-q14. Genomics 79: 197-209. Ritchie, R.J., M.G. Mattei, and M. Lalande. 1998. A large polymorphic repeat in the pericentromeric region of human chromosome 15q contains three partial gene duplications. Hum Mol Genet 7: 1253-1260. Robinson, W.P., F. Dutly, R.D. Nicholls, F. Bernasconi, M. Penaherrera, R.C. Michaelis, D. Abeliovich, and A.A. Schinzel. 1998. The mechanisms involved in formation of deletions and duplications of 15q11-q13. J Med Genet 35: 130-136. Rouquier, S., S. Taviaux, B.J. Trask, V. Brand-Arpon, G. van den Engh, J. Demaille, and D. Giorgi. 1998. Distribution of olfactory receptor genes in the human genome. Nat Genet 18: 243-250. Ruault, M., V. Trichet, S. Gimenez, S. Boyle, K. Gardiner, M. Rolland, G. Roizes, and A. De Sario. 1999. Juxta-centromeric region of human chromosome 21 is enriched for pseudogenes and gene fragments. Gene 239: 55-64. Samonte, R.V. and E.E. Eichler. 2002. Segmental duplications and the evolution of the primate genome. Nat Rev Genet 3: 65-72. Schmid, C.W. and P.L. Deininger. 1975. Sequence organization of the human genome. Cell 6: 345-358. Schouten, J.P., C.J. McElgunn, R. Waaijer, D. Zwijnenburg, F. Diepvens, and G. Pals. 2002. Relative quantification of 40 nucleic acid sequences by multiplex ligation- dependent probe amplification. Nucleic Acids Res 30: e57. Schueler, M.G., A.W. Higgins, M.K. Rudd, K. Gustashaw, and H.F. Willard. 2001. Genomic and genetic definition of a functional human centromere. Science 294: 109-115. Schulze, A., C. Hansen, N.E. Skakkebaek, K. Brondum-Nielsen, D.H. Ledbeter, and N. Tommerup. 1996. Exclusion of SNRPN as a major determinant of Prader-Willi syndrome by a translocation breakpoint. Nat Genet 12: 452-454. Schwartz, S., W.J. Kent, A. Smit, Z. Zhang, R. Baertsch, R.C. Hardison, D. Haussler, and W. Miller. 2003. Human-mouse alignments with BLASTZ. Genome Res 13: 103- 107. Sibley, C.G. and J.E. Ahlquist. 1984. The phylogeny of the hominoid primates, as indicated by DNA-DNA hybridization. J Mol Evol 20: 2-15. Sidow, A. 1996. Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev 6: 715-722. Smit, A. and P. Green. 1999. RepeatMasker. Smit, A., G. Toth, A. Riggs, and J. Jurka. 1995. Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences. J. Mol. Biol. 246: 401-417.

284

Snijders, A.M., N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, G. Hamilton, A.K. Hindle, B. Huey, K. Kimura et al. 2001. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet 29: 263-264. Stankiewicz, P. and J.R. Lupski. 2002. Molecular-evolutionary mechanisms for genomic disorders. Curr Opin Genet Dev 12: 312-319. Stankiewicz, P., S.S. Park, K. Inoue, and J.R. Lupski. 2001. The evolutionary chromosome translocation 4;19 in Gorilla gorilla is associated with microduplication of the chromosome fragment syntenic to sequences surrounding the human proximal CMT1A-REP. Genome Res 11: 1205-1210. Stanyon, R., N. Arnold, U. Koehler, F. Bigoni, and J. Wienberg. 1995. Chromosomal painting shows that "marked chromosomes" in lesser apes and Old World monkeys are not homologous and evolved by convergence. Cytogenet Cell Genet 68: 74-78. Sutcliffe, J.S., M.K. Han, T. Amin, R.A. Kesterson, and E.L. Nurmi. 2003. Partial duplication of the APBA2 gene in chromosome 15q13 corresponds to duplicon structures. BMC Genomics 4: 15. Sverdlov, E.D. 2000. Retroviruses and primate evolution. Bioessays 22: 161-171. Thomas, J.W., J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beckstrom- Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793. Thompson, J.D., D.G. Higgins, and T.J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. Toder, R., Y. Xia, and E. Bausch. 1998. Interspecies comparative genome hybridization and interspecies representational difference analysis reveal gross DNA differences between humans and great apes. Chromosome Res 6: 487-494. Tolan, D.R., J. Niclas, B.D. Bruce, and R.V. Lebo. 1987. Evolutionary implications of the human aldolase-A, -B, -C, and -pseudogene chromosome locations. Am J Hum Genet 41: 907-924. Tomlinson, I.M., G.P. Cook, N.P. Carter, R. Elaswarapu, S. Smith, G. Walter, L. Buluwela, T.H. Rabbitts, and G. Winter. 1994. Human immunglobulin VH and D segments on chromosomes 15q11.2 and 16p11.2. Hum Molec Genet 3: 853-860. Trask, B., C. Friedman, A. Martin-Gallardo, L. Rowen, C. Akinbami, J. Blankenship, C. Collins, D. Giorgi, S. Iadonato, F. Johnson et al. 1998a. Members of the olfactory receptor gene family are contained in large blocks of DNA duplicated polymorphically near the ends of human chromosomes. Hum. Molec. Genet. 7: 13-26. Trask, B., D. Pinkel, and G. van den Engh. 1989. The proximity of DNA sequences in interphase cell nuclei is correlated to genomic distance and permits ordering of cosmids spanning 250 kilobase pairs. Genomics 5: 710-717. Trask, B.J. 1991. Fluorescence in situ hybridization: applications in cytogenetics and gene mapping. Trends Genet 7: 149-154. Trask, B.J., H. Massa, V. Brand-Arpon, K. Chan, C. Friedman, O.T. Nguyen, E.E. Eichler, G. van den Engh, S. Rouquier, H. Shizuya et al. 1998b. Large multi-

285

chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum Molec Genet 7: 2007-2020. Tunnacliffe, A., L. Liu, J.K. Moore, M.A. Leversha, M.S. Jackson, L. Papi, M.A. Ferguson-Smith, H.J. Thiesen, and B.A.J. Ponder. 1993. Duplicated KOX zinc finger gene clusters flank the centromere of human chromosome 10: evidence for a pericentric inversion during primate evolution. Nucleic Acids Res. 21: 1409- 1417. Tuzun, E., J.A. Bailey, and E.E. Eichler. 2004. Recent segmental duplications in the working draft assembly of the brown norway rat. Genome Res 14: 493-506. Ungaro, P., S.L. Christian, J.A. Fantes, A. Mutirangura, S. Black, J. Reynolds, S. Malcolm, W.B. Dobyns, and D.H. Ledbetter. 2001. Molecular characterisation of four cases of intrachromosomal triplication of chromosome 15q11-q14. J Med Genet 38: 26-34. Valero, M.C., O. de Luis, J. Cruces, and L.A. Perez Jurado. 2000. Fine-scale comparative mapping of the human 7q11.23 region and the orthologous region on mouse chromosome 5G: the low-copy repeats that flank the Williams-Beuren syndrome deletion arose at breakpoint sites of an evolutionary inversion(s). Genomics 69: 1- 13. van den Engh, G., R. Sachs, and B.J. Trask. 1992. Estimating genomic distance from DNA sequence location in cell nuclei by a random walk model. Science 257: 1410-1412. van Geel, M., E.E. Eichler, A.F. Beck, Z. Shan, T. Haaf, S.M. van Der Maarel, R.R. Frants, and P.J. de Jong. 2002. A Cascade of Complex Subtelomeric Duplications during the Evolution of the Hominoid and Old World Monkey Genomes. Am J Hum Genet 70: 269-278. van Geel, M., J.C. van Deutekom, A. van Staalduinen, R.J. Lemmers, M.C. Dickson, M.H. Hofker, G.W. Padberg, J.E. Hewitt, P.J. de Jong, and R.R. Frants. 2000. Identification of a novel beta-tubulin subfamily with one member (TUBB4Q) located near the telomere of chromosome region 4q35. Cytogenet Cell Genet 88: 316-321. Venter, J.C. M.D. Adams E.W. Myers P.W. Li R.J. Mural G.G. Sutton H.O. Smith M. Yandell C.A. Evans R.A. Holt et al. 2001. The sequence of the human genome. Science 291: 1304-1351. Ventura, M., N. Archidiacono, and M. Rocchi. 2001. Centromere emergence in evolution. Genome Res 11: 595-599. Verma, R.S. and S. Luke. 1991. Heteromorphisms of pericentromeric heterochromatin of chromosome 19. Genet Anal Tech Appl 8: 179-180. Wallace, M.R., L.B. Andersen, A.M. Saulino, P.E. Gregory, T.W. Glover, and F.S. Collins. 1991. A de novo Alu insertion results in neurofibromatosis type 1. Nature 353: 864-866. Walsh, J.B. 1995. How often do duplicated genes evolve new functions? Genetics 139: 421-428. Wandstrat, A., C. Lena, J, L. Jenkins, and S. Schwartz. 1998. Molecular cytogenetic evidence for a common breakpoint in large (Class III) inverted duplications of chromosome 15. Am. J. Hum. Genet. 62: 925-936.

286

Wandstrat, A.E. and S. Schwartz. 2000. Isolation and molecular analysis of inv dup(15) and construction of a physical map of a common breakpoint in order to elucidate their mechanism of formation. Chromosoma 109: 498-505. Waterston, R.H. K. Lindblad-Toh E. Birney J. Rogers J.F. Abril P. Agarwal R. Agarwala R. Ainscough M. Alexandersson P. An et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562. Wevrick, R., J.A. Kerns, and U. Francke. 1994. Identification of a novel paternally expressed gene in the Prader-Willi syndrome region. Hum Mol Genet 3: 1877- 1882. White, M.J. 1968. Models of speciation. New concepts suggest that the classical sympatric and allopatric models are not the only alternatives. Science 159: 1065- 1070. Willard, H.F. 1998. Centromeres: the missing link in the development of human artificial chromosomes. Curr Opin Genet Dev 8: 219-225. Wohr, G., T. Fink, and G. Assum. 1996. A palindromic structure in the pericentromeric region of various human chromosomes. Genome Res 6: 267-279. Wong, A.C., D. Shkolny, A. Dorman, D. Willingham, B.A. Roe, and H.E. McDermid. 1999. Two novel human RAB genes with near identical sequence each map to a telomere-associated region: the subtelomeric region of 22q13.3 and the ancestral telomere band 2q13. Genomics 59: 326-334. Wong, Z., N. Royle, and A. Jeffreys. 1990. A novel human DNA polymorphism resulting from transfer of DNA from chromosome 6 to chromosome 16. Genomics 7: 222- 234. Yi, S., D.L. Ellsworth, and W.H. Li. 2002. Slow molecular clocks in Old World monkeys, apes, and humans. Mol Biol Evol 19: 2191-2198. Yi, S., T.J. Summers, N.M. Pearson, and W.H. Li. 2004. Recombination has little effect on the rate of sequence divergence in pseudoautosomal boundary 1 among humans and great apes. Genome Res 14: 37-43. Young, J.M., C. Friedman, E.M. Williams, J.A. Ross, L. Tonnes-Priddy, and B.J. Trask. 2002. Different evolutionary processes shaped the mouse and human olfactory receptor gene families. Hum Mol Genet 11: 535-546. Yu, W., B.C. Ballif, C.D. Kashork, H.A. Heilstedt, L.A. Howard, W.W. Cai, L.D. White, W. Liu, A.L. Beaudet, B.A. Bejjani et al. 2003. Development of a comparative genomic hybridization microarray and demonstration of its utility with 25 well- characterized 1p36 deletions. Hum Mol Genet 12: 2145-2152. Yunis, J.J. and O. Prakash. 1982. The origin of man: a chromosomal pictorial legacy. Science 215: 1525-1530. Yunis, J.J., J.R. Sawyer, and K. Dunham. 1980. The striking resemblance of high- resolution G-banded chromosomes of man and chimpanzee. Science 208: 1145- 1148. Zhang, Z., Schwartz, S., Wagner, L., Miller, W. 2000. A greedy algorithm for aligning DNA sequences. Journal of Computational Biology 7: 203-214. Zimonjic, D., M. Kelley, J. Rubin, S. Aaronson, and N. Popescu. 1997. Fluorescence in situ hybridization analysis of keratinocyte growth factor gene amplification and dispersion in evolution of great apes and humans. Proc Natl Acad Sci USA 94: 11461-11465.