<<

Genomic DNA Copy Number Variations and : Studies of Li-Fraumeni Syndrome and its Variants

by

Adam Shlien

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Medical Biophysics University of Toronto

© Copyright by Adam Shlien 2010 ii

Genomic DNA Copy Number Variations and Cancer: Studies of Li-Fraumeni Syndrome and its Variants

Adam Shlien

Doctor of Philosophy

Department of Medical Biophysics University of Toronto

2010 Abstract

Copy number variations (CNVs) are a major source of inter-individual genetic difference, accounting for a greater proportion of the than other forms of variation. Recently, the identification of benign and pathogenic CNVs has improved due to arrays with increased coverage. Nevertheless, most CNVs have not been studied with great precision and questions persist regarding their exact breakpoint, content, frequency and functional impact. This is especially true in cancer, in which a role for CNVs as risk factors is under-explored.

Li-Fraumeni syndrome (LFS) is a dominantly inherited disorder with an increased risk of early-onset breast cancer, sarcomas, brain tumors and other neoplasms in individuals harboring germline TP53 mutations. Known genetic determinants of LFS do not fully explain its clinical . In this thesis, we describe the association between CNVs and LFS. First, by examining DNA from a healthy population and an LFS cohort using oligonucleotide arrays, we show that the number of CNVs per genome is well conserved in the healthy population, but remarkably enriched in these cancer-prone individuals. We found a significant increase in CNVs among carriers of germline TP53 mutations with a familial cancer history. Second, we find that iii specific CNVs at 17p13.1 are associated with LFS or developmental delay, depending on the exact breakpoint with respect to TP53. Using a purpose built array with 93.75% accuracy, we fine-mapped these microdeletions and find that they arise by Alu-mediated non-allelic , and contain common , whose under-expression distinguishes the two . Third, we explore somatic CNVs in choroid plexus carcinoma tumor genomes. We show that this tumor is over-represented in LFS, and the number of somatic CNVs is associated with TP53 mutations and disease progression. These studies represent the first genomic analyses of LFS, and suggest a more generalized association between CNVs and cancer.

iv

List of Publications

The following publications were produced during the term of my PhD studies:

• Adam Shlien, Berivan Baskin, Maria Isabel W. Achatz, Dimitrios J. Stavropoulos, Kim E. Nichols, Louanne Hudgins, Chantal F. Morel, Margaret P. Adam, Nataliya Zhukova, Lianne Rotin, Ana Novokmet, Harriet Druker, Mary Shago, Peter N. Ray, Pierre Hainaut, David Malkin. A common molecular mechanism underlies two phenotypically distinct 17p13.1 microdeletion syndromes. American Journal of Human . [Under Review]

• Tabori U*, Shlien A*, Baskin B, Levitt S, Ray P, Alon N, Hawkins C, Bouffet E, Pienkowska M, Lafay-Cousin L, Gozali A, Zhukova N, Shane L, Gonzalez I, Finlay J, Malkin D. TP53 alterations determine clinical subgroups and survival of patients with choroid plexus tumors. J Clin Oncol. 2010 Apr 20;28(12):1995-2001. Epub 2010 Mar 22.

*co-first authors

• Shlien A, Malkin D. Copy number variations and cancer susceptibility. Curr Opin Oncol. 2010 Jan;22(1):55-63.

• Pasic I, Shlien A, Durbin AD, Stavropoulos DJ, Baskin B, Ray PN, Novokmet A, Malkin D. Appendix 1 - Recurrent Focal Copy-Number Changes and Loss of Heterozygosity Implicate Two Noncoding RNAs and One at 3q13.31 in Osteosarcoma. Cancer Res. 2010 Jan 1;70(1):160-71.

• Jinchuan Xing, W. Scott Watkins, Adam Shlien*, Erin Walker*, Chad D. Huff, David J. Witherspoon, Yuhua Zhang, Tatum S. Simonson, Robert B. Weiss, Joshua D. Schiffman, David Malkin, Scott R. Woodward, and Lynn B. Jorde. Toward a more Uniform Sampling of Human Genetic Diversity: A Survey of Worldwide Populations by High-density Genotyping. Genomics [In Press]

* Equal contributions

v

• Shlien A, Tabori U, Marshall CR, Pienkowska M, Feuk L, Novokmet A, Nanda S, Druker H, Scherer SW, Malkin D. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proc Natl Acad Sci U S A. 2008 Aug 12;105(32):11264-9. Epub 2008 Aug 6.

• Shlien A, Malkin D. Copy number variations and cancer. Genome Med. 2009 Jun 16;1(6):62.

• Robert K. Nam, William Zhang, Katherine Siminovitch, Adam Shlien, Michael W. Kattan, Arun Seth, Laurence H. Klotz, John Trachtenberg, Yan Lu, Jinyi Zhang, Changhong Yu, Ants Toi, D. Andrew Loblaw, Vasundara Venkateswaran, Aleksandra Staminirovic, Linda Sugar, David Malkin, and Steven A. Narod. New Variants at 10q26 and 15q21 Are Associated With Aggressive Prostate Cancer in a Genome-Wide Association Study from a Prostate Biopsy Screening Cohort. [Submitted]

• Villani A, Tabori U, Schiffman J, Shlien A, Druker H, Novokmet A, Finlay J, Malkin D. Impact of biochemical and imaging surveillance on cancer detection and mortality among germline TP53 mutation carriers with Li-Fraumeni syndrome. [Under Review]

vi

Acknowledgments

It is with genuine pleasure that I offer my thanks to the numerous individuals who have made this thesis possible.

I owe my gratitude to my supervisor, Dr. David Malkin whose kind and thoughtful support, mentorship, and guidance have directly contributed to my success as a graduate student. His support provided me with comprehensive PhD training, and enabled me to learn much more than what is contained within these pages. Furthermore, I would like to thank the members of my PhD committee, Dr. Stephen Meyn, Dr. David Hogg and Dr. Jeremy Squire, for their assistance and insight. Thank you for always asking the tough questions and for encouraging my scientific curiosity. I am also grateful to Dr. Uri Tabori for his valuable advice and guidance. Our discussions regarding this project have of been of great value to me. Thank you to Dr. Berivan Baskin and to all the members of the Malkin lab for your support and friendship.

I owe my loving thanks to my wife, Kerri, who edited all of my manuscripts, and provided abundant love and encouragement throughout this adventure. My warm thanks to Ivan and Ruth for their ceaseless support and for nurturing a caring environment for Kerri and I. I thank my father, Eddy, my mother, Penny, and my sister, Andrea, for their unwavering belief in me, and for encouraging my continuous education, without which this thesis would not have been possible.

vii

Table of Contents

List of Publications ...... iv

Acknowledgments...... vi

Table of Contents...... vii

List of Tables ...... xiii

List of Figures...... xiv

List of Appendices ...... xvi

Chapter 1...... 1

1 Introduction...... 1

1.1 TP53...... 1

1.1.1 Transcriptional Control of TP53...... 2

1.1.2 Post Translation Activation of TP53...... 3

1.1.3 MDM2 and ARF...... 3

1.1.4 TP53-activating Signals...... 5

1.1.5 Genotoxic Stress and TP53 Activation...... 5

1.1.6 Ionizing Radiation and ATM...... 6

1.1.7 Ultraviolet Radiation and ATR...... 6

1.1.8 Non-genotoxic Stress and TP53 activation...... 8

1.1.9 TP53-mediated Response to Cellular Stress...... 8

1.1.10 TP53: ...... 10

1.1.11 TP53-mediated Cell Cycle Arrest...... 12

1.1.12 TP53-mediated Apoptosis...... 13

1.2 Li-Fraumeni Syndrome...... 15

1.2.1 Genetic Etiology of Li-Fraumeni Syndrome ...... 16

1.2.2 TP53 Mutation Spectrum, Type and Frequency...... 16 viii

1.2.3 Genetic Modifiers ...... 22

1.2.4 Cancer Penetrance...... 24

1.2.5 Genetic ...... 25

1.2.6 Germline TP53 Mutations in Pediatric Adrenocortical Carcinomas of Brazil...... 27

1.2.7 LFS-associated Tumorigenesis ...... 28

1.2.8 Choroid Plexus Carcinoma: an LFS Component Tumor...... 32

1.3 Copy Number Variations: Dynamic Genomes ...... 35

1.3.1 CNVs and disease: Mutable Genomes...... 38

1.3.2 Pathogenic CNVs Often Contain Multiple Genes ...... 38

1.3.3 The Effect of a Pathogenic CNV is Not Limited to the Gene(s) it Contains...... 39

1.3.4 Pathogenic CNVs Can Have Reciprocal Deletions/Duplications...... 39

1.3.5 CNVs and Cancer Predisposition: First Hits to the Tumor Genome...... 40

1.3.6 Common Cancer CNVs ...... 40

1.3.7 Rare Cancer CNVs...... 47

1.3.8 Mutational Mechanisms Leading to CNVs...... 51

1.3.9 CNVs and Tumor Genomes...... 54

1.3.10 Genome-Scale Analyses Have Found Many Formerly Invisible Copy Number Alterations (CNA)...... 55

1.3.11 CNAs can be Integrated With Other Global Analyses to Define the Key Pathways of a Tumor ...... 55

1.4 Rationale ...... 56

Chapter 2...... 57

2 Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome...... 57

2.1 Abstract...... 58

2.2 Introduction...... 58

2.3 Results...... 60 ix

2.4 Discussion...... 75

2.5 Materials and Methods...... 80

2.5.1 Subject recruitment ...... 80

2.5.2 DNA microarray analysis ...... 80

2.5.3 Quantitative PCR validation ...... 81

2.5.4 Statistical analyses ...... 81

2.5.5 Computational assessment of cancer-related genes...... 81

2.5.6 TP53 mutation screening ...... 82

2.6 Supplementary Discussion...... 82

2.6.1 Recurrent copy number variations at cancer associated genes ...... 82

2.7 Supplementary Figures and Tables...... 85

2.8 Supplementary Methods ...... 90

2.8.1 Characterization of Copy Number Variation...... 90

Chapter 3...... 93

3 A common molecular mechanism underlies two phenotypically distinct 17p13.1 microdeletion syndromes ...... 93

3.1 Abstract...... 94

3.2 Introduction...... 94

3.3 Material and Methods ...... 95

3.3.1 Sample Recruitment...... 95

3.3.2 CGH microarray design and hybridization...... 96

3.3.3 Gene expression arrays and analysis...... 97

3.3.4 Breakpoint simulation...... 97

3.3.5 Quantitative PCR...... 97

3.3.6 Fluorescence in situ hybridization...... 98

3.3.7 Parent-of-origin analysis...... 98 x

3.3.8 Breakpoint mapping...... 98

3.4 Results...... 98

3.4.1 Rare CNVs at TP53 are associated with cancer predisposition or developmental delay ...... 98

3.4.2 Different 17p13.1 breakpoints are related to two distinct phenotypes ...... 99

3.4.3 17p13.1 genomic deletions can be inherited or arise de novo ...... 107

3.4.4 Design of a custom ultra high-resolution tiling array ...... 107

3.4.5 Alu short interspersed nuclear repeats are associated with breakpoints...... 108

3.4.6 Most 17p13.1 CNVs arise by Alu-mediated non-allelic homologous recombination ...... 109

3.4.7 A common region implicates new genes in developmental delay...... 114

3.5 Discussion...... 122

3.6 Supplementary tables and figures ...... 125

Chapter 4...... 134

4 TP53 alterations determine clinical subgroups and survival of patients with choroid plexus tumors...... 134

4.1 Abstract:...... 135

4.2 Introduction...... 136

4.3 Patients and Methods ...... 137

4.3.1 Samples and Clinical Data...... 137

4.3.2 Sequencing of Genomic Tumor and Constitutional DNA...... 138

4.3.3 DNA microarray analysis ...... 138

4.3.4 Immunohistochemistry ...... 139

4.3.5 Statistical Analysis...... 139

4.4 Results...... 140

4.4.1 TP53 Mutations in Choroid Plexus Tumors ...... 140

4.4.2 Germline TP53 Status Correlates with LFS Criteria ...... 142 xi

4.4.3 Specific Genotypes Correlate with CPC Subtypes ...... 142

4.4.4 Somatic Total Structural Variation Differentiates Tumor Subtypes...... 143

4.4.5 TP53 Dysfunction Predicts Outcome in CPC...... 150

4.5 Discussion...... 154

4.6 Supplementary tables and figures ...... 159

Chapter 5...... 164

5 Summary and future directions ...... 164

5.1 Summary...... 164

5.2 Future directions ...... 167

5.2.1 CNVs and LFS...... 167

5.2.2 Choroid plexus carcinoma ...... 169

Chapter 6...... 171

6 Appendix 1 - Recurrent Focal Copy-Number Changes and Loss of Heterozygosity Implicate Two Noncoding RNAs and One Tumor Suppressor Gene at Chromosome 3q13.31 in Osteosarcoma...... 171

6.1 Abstract...... 172

6.2 Introduction...... 172

6.3 Materials and methods ...... 174

6.4 Results...... 176

6.5 Discussion...... 184

Chapter 7...... 197

7 Appendix 2 - Toward a more Uniform Sampling of Human Genetic Diversity: A Survey of Worldwide Populations by High-density Genotyping...... 197

7.1 Abstract...... 198

7.2 Introduction...... 198

7.3 Materials and Methods...... 200

7.4 Results...... 205 xii

7.5 Discussion...... 212

7.6 Conclusion ...... 217

Chapter 8...... 227

8 Appendix 3 - New Variants at 10q26 and 15q21 Are Associated With Aggressive Prostate Cancer in a Genome-Wide Association Study from a Prostate Biopsy Screening Cohort.....227

8.1 Abstract...... 228

8.2 Introduction...... 228

8.3 Results...... 230

8.4 Discussion...... 234

8.5 Methods...... 236

References...... 257

Copyright Acknowledgements...... 282

xiii

List of Tables

Chapter 1 Page

Table 1 - Rare cancer CNVs at known cancer-predisposing genes 49

Chapter 2

Supplementary Table 1A - LFS Families 89

Supplementary Table 1B - Unrelated TP53 mutation carriers 89

Chapter 3

Table 1 - Phenotypic features of four patients with 17p13.1 CNVs 103 and developmental delay

Table 2 - Deleted and disrupted genes in 17p13.1 deletion patients 115

Chapter 4

Table 1 - Somatic and germline TP53 mutation frequencies in study 141 population.

Table 2 - Frequency of germline TP53 mutations in choroid plexus 142 tumor patients.

Supplementary Table 1 - TP53 germline mutations in choroid plexus 159 tumor patients

Supplementary Table 2 - Genotype-phenotype correlation of TP53 160 codon 72 and MDM2 SNP309 polymorphisms.

Supplementary Table 3 - Correlation between TP53 immunostain and 161 TP53 mutational status.

xiv

List of Figures

Chapter 1 Page

Figure 1 - TP53 is an integrator of multiple genotoxic and non- 9 genotoxic stressors that elicit a cellular response by either inducing apoptosis or cell cycle arrest

Figure 2 - Sequence logo showing two TP53 binding sites 11

Figure 3 - Codon distribution of TP53 somatic mutations. Highlighted 19 are the six most frequent sites of TP53 mutation

Figure 4 - Li-Fraumeni syndrome component tumors 31

Figure 5 - Distribution of common cancer CNVs in the human genome 42

Figure 6 - Cancer CNV breakpoint mapping 44

Chapter 2

Figure 1 - Increased CNV frequency in Li-Fraumeni syndrome 63-66

Figure 2 - Inherited deletions and duplications in 4 LFS families. 69

Figure 3 - Progression of germline chromosomal alterations in paired 72-74 tumor DNA

Figure 4 - Proposed model for the progression of copy number 78-79 variable DNA regions in the Li-Fraumeni cancer predisposition syndrome

Supplementary Figure 1 - CNV frequency and total structural variation 85 are conserved in different ethnic groups

Supplementary Figure 2 - Chromosomal positions and SNP coverage 86 of 5 cancer-related genes overlapping CNVs

Supplementary Figure 3 - Validation of a 6.1 Mb deletion in a Li- 87 Fraumeni syndrome family

Supplementary Figure 4 - DNA Copy number variations in 893 88 individuals

Chapter 3

Figure 1 - Discovery of a 17p13.1 CNV leading to two distinct 101-102 phenotypes xv

Figure 2 - Breakpoint maps, sequence resolution, and inferred 110-113 mechanism of 17p13.1 CNVs

Figure 3 - Gene expression differences distinguish cancer-affected 121 from developmental delay patients with 17p13.1 deletions

Supplementary Figure 1 - Copy number of ATP1B2 and WRAP53, 125-126 TP53’s neighboring genes

Supplementary Figure 2 - Pedigrees 127-130

Supplementary Figure 3 - Design of ultra high-resolution array 131-132

Supplementary Figure 4 - A complex event near 17p13.1 deletion 133 breakpoint

Chapter 4

Figure 1 - Total somatic variation changes in choroid plexus tumors 145-148

Figure 2 - Overall survival for choroid plexus carcinoma 152-153

Figure 3 - Management of patients with newly diagnosed choroid 158 plexus carcinoma

Supplementary Figure 1 - Study population overview 161

Supplemental Figure 2 - Structural instability in CPCs 162

Supplemental Figure 3 – Numerical instability in TP53 mutated CPCs 163

xvi

List of Appendices

Page

Appendix 1 - Recurrent Focal Copy-Number Changes and Loss of 171-196 Heterozygosity Implicate Two Noncoding RNAs and One Tumor Suppressor Gene at Chromosome 3q13.31 in Osteosarcoma

Appendix 2 - Toward a more Uniform Sampling of Human Genetic 197-226 Diversity: A Survey of Worldwide Populations by High-density Genotyping

Appendix 3 - New Variants at 10q26 and 15q21 Are Associated With 227-256 Aggressive Prostate Cancer in a Genome-Wide Association Study from a Prostate Biopsy Screening Cohort

1

Chapter 1 1 Introduction 1.1 TP53

Defects of TP53, the most common genetic alteration in Li-Fraumeni syndrome, are also the most commonly acquired genetic alteration in sporadic human cancer1. The TP53 was first discovered because of its association with the SV40 large T antigen. The precipitated 53-54 kilodalton protein was considered to be an oncogene since it could transform recipient cells2.

However this was subsequently found to be an initial misinterpretation as injection of wild-type

TP53 could, in fact, suppress cellular growth. Only the mutated form of TP53 possesses the growth promoting properties.

The human TP53 protein is 393 amino acids long and has four important functional domains1: the transcriptional activation domain (amino acids 1-42) where it interacts with the cell’s transcriptional machinery and its own negative regulators (including E1B-55Kd and

MDM2); the DNA binding domain (amino acids 102-292) where it binds to consensus DNA sequences in the phosphate backbone in the major groove and the minor groove of the DNA helix; the tetramerization domain which is required for the protein’s oligomerization; and the C- terminal domain (last 26 residues) which regulates TP53’s ability to bind to specific DNA sequences at the core domain.

TP53 is subject to multiple levels of regulation. Chief among these are methods for posttranslational modification of its protein product. 2

1.1.1 Transcriptional Control of TP53

TP53 is transcribed off of the negative strand of chromosome 17p13.1. Its first promoter,

P53P1, is located 250 bp upstream of the first exon (which is non coding) and its second promoter, P53P2, is within intron 13. P53P3, the gene’s third promoter, is located within intron 4.

TP53’s transcriptional control has not been extensively studied, however certain aspects of this level of control have been long known. For example, TP53 mRNA levels rise following serum starvation of cells in culture. As initially shown by Reich et al.4, 3T3 cells deprived of and then stimulated with serum demonstrated increased TP53 protein levels, as measured by western blot. Increased protein stability was ruled out as a possible explanation as pulse chase experiments showed no increase in TP53’s half life (~20 min) while TP53 transcription mRNA levels increased 6-7 h after serum stimulation. The increase in TP53 prior to DNA synthesis is thought to be a form of “watchful waiting”, ready to halt the cell cycle should any DNA damage be detected. In 1992, Rotter and colleagues demonstrated that TP53’s promoter contained a bHLH recognition sequence (CACGTG) between +70 and +75 of the transcription start site, a known recognition sequence of c-MYC/MAX. This work established that c-MYC transactivates

TP53’s promoter in several cell lines. Notably, c-MYC’s DNA binding requires heterodimerization with Max, without which regulation of TP53 does not occur.

The regulation of TP53 by its genomic neighbor, WRAP53 (also known as WDR79), is an intriguing recent finding5. WRAP53 and TP53 are bidirectionally transcribed genes; that is,

WRAP53 is located immediately upstream of TP53 but on the opposite strand. WRAP53 has numerous alternative transcripts that use one of three possible first exons. One of these first exons (1α) overlaps with up to 227 basepairs of TP53’s first exon depending on which TP53 transcription start site is used. siRNA knockdown of WRAP53 led to reduced levels of TP53 3 protein and RNA in U2OS, HCT116 and in non-tumor cell lines, as well as suppression of TP53 following DNA damage. This regulation is non-reciprocal and is accomplished by WRAP53’s mRNA, not protein.

1.1.2 Post Translation Activation of TP53

TP53 lies at the heart of a complex system that monitors the health of the genome and the presence of aberrant growth signals. Disturbances of the system lead to increased levels of TP53 protein, as will be discussed below.

1.1.3 MDM2 and ARF

The MDM2 gene is the primary arbiter of TP53’s protein levels. The most compelling evidence for the interdependence of these two came from mouse studies showing that while deficiency of MDM2 alone is embryonic lethal, co-deficiency of both MDM2 and TP53 is not6,7. While these mice are viable they do develop a spectrum of similar to those of

TP53-null mice.

Initially discovered in mouse double minute , MDM2 controls TP53 levels in a number of important ways. MDM2 is a RING E3 ubiquitin ligase, which is capable of catalyzing both monoubiquitination and polyubiquitination of TP538. MDM2 itself has a short half-life largely due to self ubiquitination. At low levels, MDM2 catalyzes the monoubiquitination of TP53, which is then shuttled out of the nucleus into the cytoplasm. At high levels, MDM2 catalyzes the polyubiquitination of TP53 where it is degraded by nuclear proteasomes. TP53 can also be ubiquitinated in an MDM2-independendent manner by either

Pirh2 or COP1. 4

The C terminal of TP53 contains its nuclear localization signals9 and determines its susceptibility to MDM2-mediated degradation. Deletion of TP53’s terminal 30 amino acids ablates this degradation10. Of these, six amino acids are lysine residues. The introduction of single or multiple K-to-R mutations in this region leads to higher TP53 transcriptional output and changing all six lysines inhibits MDM2-mediated ubiquitination and degradation11.

12 The crystal structure of MDM2 bound to TP53 has been resolved . It’s NH2 domain binds to TP53’s NH2 transactivation domain. The interface between the two molecules involves multiple van der Waals contacts and a deep pocket, into which three of TP53’s amino acids,

Phe19, Trp23 and Leu26, insert. The importance of these residues has also been shown by site directed mutagenesis13. Once bound through their N-terminal domains, the MDM2 C-terminal

RING domain recruits the E2 ubiquitin conjugating enzyme, which transfers ubiquitins to

TP53’s lysine residues.

MDM2 must be blocked to activate TP53. This is accomplished by the ARF protein

(p14ARF), whose connection with TP53 and MDM2 was hypothesized and then definitively established through a number of key observations. First, melanomas from INK4A/ARF+/- transgenic mice do not have TP53 mutations, nor do many human melanomas, suggesting that mutations of ARF obviate the need for TP53 dysfunction14. Second, ARF requires TP53 to inhibit cellular transformation15. Third, a series of coimmunoprecipitation experiments were run to determine the potential interaction between TP53, MDM2 and ARF. 293T cells co-transfected with all three genes, then immunoprecipitated with an anti-TP53 antibody which also pulled down ARF, suggestive of an interaction between the three15. However, without the inclusion of

MDM2, the TP53-ARF interaction was lost. Finally, it was shown that ARF physically interacts with MDM2 and blocks its degradation of TP53. By deletion analysis it was found that removal 5 of MDM2 residues 222-437 abolished it’s binding to ARF and that, in fact most of ARF was bound to amino acids 210-244 (C terminal). Therefore, TP53 and ARF bind to different regions of MDM216, while not necessarily binding to each other. This binding inhibits MDM2’s nuclear export ability and MDM2’s ubiquitination of TP53. ARF accomplishes this inhibition by sequestering MDM2 to the nucleolus, a region of the nucleus usually reserved for ribosomal subunit production, preventing MDM2 from ubiquitylating TP53 and leading to a quick increase in TP53’s levels and activity.

1.1.4 TP53-activating Signals

As described, TP53 normally exists in a latent form, and is activated by removal of inhibition rather than by up-regulating transcription. There are a number of cellular stressors that quickly lead to TP53 activation. This is largely accomplished by post-translational modification at select residues of TP53. In fact, all serines and threonines in the first 89 residues of TP53 can be phosphorylated or dephosphorylated following stress17. The enormous number of TP53 post- translational modifications are beyond the scope of this discussion. However, it should be noted that there exists some controversy as to the degree of their importance. It has been shown, for example, that TP53 proteins with mutations at all known N-terminal and C-terminal phosphorylation sites (some of which will be described below) still retain transcriptional activation activity18. These stressors can be broadly classified as genotoxic and non-genotoxic stressors.

1.1.5 Genotoxic Stress and TP53 Activation

Cells contain multiple processes for the detection and repair of DNA damage, all of which impinge upon the TP53 pathway19, as will be described below (Figure 1). 6

1.1.6 Ionizing Radiation and ATM

Ionizing radiation (IR) is an important source of environmental damage, arising from cosmic radiation, naturally occurring sources (e.g. radon) and intensive cancer therapy. IR generates hydroxyl radicals which cause DNA double-strand breaks and induce a TP53 response.

The ATM (ataxia telangiectasia-mutated) gene is key to this response.

IR induces G1 cell cycle arrest in order to prevent replication of damaged DNA and it has been long known that this arrest was related to TP53 levels. In 1992, Kastan et al. showed that cells defective for ATM lacked IR-induced arrest following 2 Gy of IR20. These cells displayed less TP53 accumulation than normal cells. This suggested a shared pathway between ATM and

TP53. It was subsequently found that IR induces phosphorylation of TP53 Ser15 and this impairs

MDM2’s ability to block TP53 transactivation21. A recombinant ATM protein containing three domain mutations was compared to wild type ATM for phosphorylation ability22. Wild type ATM was found capable of phosphorylating TP53 at Ser15 (but not Ser6 or Ser9) whereas recombinant ATM was not. ATM also activates Chk2 which phosphorylates TP53 S2023, a process which contributes to TP53 stability by compromising it’s MDM2 interation24,25. Among many other sites, ATM also phosphorylates MDM2 Ser395, which presumably reduces its ability to drive TP53 degradation26.

1.1.7 Ultraviolet Radiation and ATR

Ultraviolet radiation (UV) is another important source of genotoxic damage that induces

TP53. At 100-295 nm, UV-C is the shortest wavelength of UV radiation and is the most studied as it most directly damages DNA. The most frequent lesions induced by UV-B or UV-C radiation are cis-syn cyclobutane pyrimidine dimers and the pyrimidine (6-4) pyrimidone photoproducts 27. The mechanisms of TP53 activation and accumulation following UV exposure 7 are similar to its activation following IR exposure, but there are a number of important differences. These can be partially explained by overlapping yet distinct functions of ATM and

ATR, two master controllers of cellular DNA damage response28.

The ATM-Rad3-related (ATR) gene is structurally and functionally related to ATM. Both are protein from the PI3K family and both exert control over signaling pathways through phosphorylation. Both kinases are involved in the G1/S and G2/M cell cycle checkpoints (that is, both proteins can arrest cells at either checkpoint).

TP53 is activated, accumulated, and phosphorylated at Ser15 following UV-induced DNA damage. Surprisingly, this was shown to still be true in ataxia telangiactasia cells lacking ATM.

Another finding, which fueled suspicions that a second Ser15-inducing protein existed, was that

IR induces a similar response in ATM-deficient cells, even if this is delayed compared to normal cells. Tibbetts et al. hypothesized that ATR played this role29. They showed that overexpression of a catalytically inactive ATR (ATRKI) protein led to decreased TP53 Ser15 levels two to four hours following γ-radiation (20 Gy), but not one hour after γ-radiation29. In comparison, following UV exposure (310 nm) the overexpression of ATRKI led to time-independent TP53

Ser15 activation. From this the authors proposed a model of sequential activation of ATM and

ATR after IR induction. ATM is set in motion first, within minutes of a DNA double strand break, while ATR is activated later and, in so doing, prolongs the activation of TP53.

While both ATM and ATR are involved in TP53’s IR response (albeit at different time points), only ATR is involved in the cellular response to UV22. Consistent with this, cells from ataxia telangiactasia patients are not hypersensitive to UV irradiation, and therefore both activation of TP53 and phosphorylation at Ser15 occur normally. 8

It has also been noted that IR and UV-induced TP53 responses have different kinetics. Lu and Lane first noted that UV-induced cells took approximately two hours to mount a TP53 response, while 2.5 Gy or 5 Gy of X-ray engendered a TP53 response within one hour30.

However, while IR-induced TP53 levels dropped off after three hours, UV-induced levels continued to rise. Thus, the induction of TP53 by IR is quicker and peaks earlier compared to

UV stimulation.

1.1.8 Non-genotoxic Stress and TP53 Activation

Additionally, several non-genotoxic stressors elicit a TP53 response. These include hypoxia, glucose starvation, ribonucleotide trisphosphate reduction and oncogene activation, among others. For example, in 1993 Lowe found that over-expression of the adenovirus 5 E1A protein induces TP53 and promotes apoptosis31. This is accomplished by the inactivation of the retinoblastoma protein32.

1.1.9 TP53-mediated Response to Cellular Stress

The main role of TP53 is to act as a transcription factor for a host of downstream targets.

When activated, TP53 carries out this role as described above. The various aspects of TP53- mediated response will be elaborated upon below.

9

Figure 1

Figure 1 | TP53 is an integrator of multiple genotoxic and non-genotoxic stressors that elicit a cellular response by either inducing apoptosis or cell cycle arrest (image from 33).

10

1.1.10 TP53: Transcription Factor

To explore TP53’s transcription activating properties, El-Deiry used an unbiased approach that identified DNA bound to TP53 and led to the definition of a consensus binding sequence34.The strategy involved shearing genomic DNA, ligating it to linkers, and then incubating it with wild type TP53 protein. The bound DNA was precipitated, and amplified by

PCR, and then the entire process was repeated. The selected DNA was cloned and tested for its ability to bind TP53. In this way, the authors found 18 independent genomic DNA fragments bound to TP53. Strikingly, each of the clones contained two copies of the 10 bp motif 5’-Pu-Pu-

Pu-C-A/t-T/a-G-Py-Py-Py-3’ (Figure 2) separated by 0-13 bp34 (Pu = purine residues and Py = pyrimidines). Synthetic oligonucleotides composed of a single monomer did not bind, nor did mutant TP53 proteins (with mutations at codons 143, 175, 248 or 273).

11

Figure 2

Figure 2 | Sequence logo showing two TP53 binding sites (modified from 35).

12

After the sequencing of the human genome, a new technique was developed combining chromatin immunoprecipitation and high-density oligonucleotide array design called ChIP on chip. This technique was used to identify all transcription factor binding sites for three transcription factors: cMyc, Sp1 and TP53 on human chromosomes 21 and 2236. In the HCT1116 cell line, 48 TP53 transcription factor binding sites were found. The authors extrapolated from these two chromosomes to estimate that TP53 has 1600 transcription factor binding site regions across the human genome. In another post-genomic evaluation of TP53 transcription factor binding sites, Wei et al. used a variant technique in which the precipitated DNA fragments were cloned and sequenced directly35. A complete genome scan was conducted to find all of TP53’s direct targets. A certain amount of discrepancy exists on chromosomes 21 and 22 between the two studies, as Wei reported only 542 high quality loci of TP53 binding. Having identified all of

TP53’s transcription factor binding sites, the authors also provided a list of TP53’s target genes.

The spectrum and function of these TP53 target genes provides insight into TP53’s normal role when activated.

1.1.11 TP53-mediated Cell Cycle Arrest

While normal cells arrest at G1 or G2, those lacking TP53 do not arrest at G1. GADD45 is one of the downstream targets of activated TP53. It was first isolated as a gene whose expression was induced following DNA damage and growth arrest (so named because it is growth arrest and DNA damage-inducible) and it contains a TP53-binding site in its promoter20.

Notably, it is not induced in TP53-mutated cells or in cells from ataxia telangiectasia patients.

Another highly important gene whose expression is tightly controlled by TP53 is P21

(also called WAF1 or CDKN1A). When cells suffer DNA damage during G1, TP53 induces the transcription of P21, which is transported to the nucleus and then blocks the cell cycle37. P21 is 13 an inhibitor of two cyclin-dependent kinases (CDKs): CDK2 and CDC2. P21 will block the activity of any cyclin-CDK complexes and, once the DNA damage is repaired, TP53 levels drop and P21’s cell cycle block is removed. The induction of P21 is highly correlated with TP53’s mutational status as it has been shown that a panel of TP53 wild type cell lines stimulated by various DNA-damaging agents all expressed P21, whereas TP53 mutated cell lines had low or undetectable P2137. In this way, TP53 prevents cells with DNA damage from replicating their

DNA in the S phase. If a cell’s DNA has been damaged while already in S phase, TP53-induced

P21 will associate with the proliferating nuclear antigen (PCNA), a subunit of the DNA polymerase δ, and halt the advance of a replication fork.

1.1.12 TP53-mediated Apoptosis

Programmed cell death, or apoptosis, is vital to the proper development, homeostasis, and immune system maintenance of multi-cellular organisms. This suicide program is therefore heavily regulated in all stages of life. An excess of apoptosis in the developing vertebrate nervous system (for example in mice lacking the gene Bcl-2 [discussed below]38) leads to embryonic death. Too much cell growth and a decrease in apoptosis can lead to cancer. In fact the ability to evade apoptosis is a major hallmark of cancer39. As such, expression of anti- apoptotic proteins act as oncogenes and downregulation or mutations of proaptotic proteins act as tumor suppressors. TP53 is one such tumor suppressor and its role in apoptosis will be reviewed below.

Apoptosis is an ordered display of several distinct morphologies. First, the plasma membrane begins to form protrusions known as blebs. The nuclear structure then collapses and chromatin condense (pyknosis). Finally, the nuclei fragment and the chromosomal DNA is 14 broken down, and within about an hour the fragmented cells are engulfed by other cells through phagocytosis.

The drivers of apoptosis are the caspases, cysteine proteases that cut after an Asp residue40. Caspases exist as inactive zymogens and must themselves be cleaved through separation of the large and small subunits to become active. There are multiple members of the caspase family that participate in apoptosis, and the activation of an initiator caspase leads to a cascade of activated executioner caspases, which cleave a broad array of targets.

One method by which tumors evade apoptosis is through disruption of the TP53 pathway.

The pathway can be altered by TP53 point mutation (usually in the DNA binding domain), through deletions or methylation of ARF, by amplifying MDM2 or by mislocalizing TP53 to the cytoplasm. In non-cancer cells, TP53 induces apoptosis by activating genes in the extrinsic or intrinsic apoptotic pathways.

Activated TP53 induces the expression of the Fas receptor41, which is a component of the extrinsic, or death receptor, apoptotic pathway. Fas is a death receptor that, once expressed at the cell surface, sensitizes cells to the effect of Fas ligand (FasL), a death ligand. After being bound by their ligand, the death receptors bind the fas-associated death domain protein (FADD) in the cytoplasm. Acting as a “death complex”, FADD recruits two initiator caspases: caspase-8 and caspase-10. At this point the initiator complexes activate the executioner caspases (3, 6 and 7), which are also common to the second apoptotic pathway, termed the intrinsic or mitochondrial pathway.

As the name would imply, mitochondria play a vital role in mitochondria-mediated apoptosis. Specifically cytochrome c, which resides between the mitochondria’s inner and outer 15 membranes, is released into the cytosol when apoptosis is initiated. Cytochrome c then binds

APAF1 to form a seven-spoked molecule called the apoptosome, which activates the zymogen procaspase 9 and converts it into caspase 9. Caspase 9, in turn, activates the same executioner caspases as described above. The cleavage activities of the executioner caspases are responsible for the characteristic morphological changes of apoptosis. The release of cytochrome c by the mitochondria is controlled by a variety of proteins belonging to the Bcl-2 family42. All members of the Bcl-2 family have in common at least one of four distinct protein domains and can associate with organelles, including the mitochondria. The Bcl-2 gene itself and four related genes (Bcl-XL, A1, Bcl-w and Mcl-1) are anti-apoptotic while the Bax family (Bax, Bak and Bok) and the BH3-only family (Bim, Bik, Bad, Puma, Bid, Noxa, Hrk and Bmf) are pro-apoptotic. The quantity of pro- versus anti-apoptotic Bcl-2 determines whether the mitochondria depolarize and releases their cytochrome c or retains it. As mentioned, TP53 plays an important role in mitochondria-mediated apoptosis. Once stabilized, TP53 is a transcriptional activator of Bax43,

NOXA44 and PUMA45, all pro-apoptotic proteins that drive the release of cytochrome c from the mitochondria.

1.2 Li-Fraumeni Syndrome

Alterations of TP53 are the most frequently observed change in human cancer. Germline

TP53 mutations have also been observed in Li-Fraumeni syndrome, a unique cancer predisposition syndrome, which will be described. Further insight into the role of TP53 has come from study of its mutant form in human cancer.

In 1969, Li and Fraumeni described a constellation of tumors suggestive of a previously unrecognized autosomal dominant familial syndrome46. After having conducted an epidemiological survey of 650 children with rhabdomyosarcoma, four families with the 16 presumptive syndrome were identified. Other tumors, including early onset soft tissue sarcomas and osteosarcomas, breast cancer, brain tumors and leukemia, were also identified in first-degree family members47. A follow-up study found that over a 12-year period, 10 of 31 surviving members developed 16 additional cancers, a far greater number than expected by chance. In

1988, Li and Fraumeni reported 24 affected families who met the following criteria: 3 close relatives with a documented cancer, including one individual, designated the proband, with a sarcoma before 45 years of age; a first-degree relative with cancer in this age interval; and a close relative (first or second degree in the same lineage) with cancer at this age interval or a sarcoma at any age48. This has remained the clinical definition of classic Li-Fraumeni syndrome

(LFS).

1.2.1 Genetic Etiology of Li-Fraumeni Syndrome

Although a common etiological agent in these families was thought to exist, several factors prevented the LFS gene from being identified. Because of the syndrome’s rarity and high mortality, available tissue samples were limited. Furthermore, classical genetic linkage analysis was difficult and no constitutional karyotypic abnormalities were found. In 1990, Malkin et al. used a candidate gene approach to identify constitutional TP53 mutations in LFS families49. All five of the families studied in this initial report had constitutional mutations in the TP53 gene.

Confirming these initial findings, there are now numerous independent reports of patients harboring germline TP53 mutations (UMD TP53 database)1, 3-39,50-69,70-107.

1.2.2 TP53 Mutation Spectrum, Type and Frequency

TP53 was the most sequenced cancer-related gene in the genome before high throughput sequencing became routine108. The frequency of TP53 mutations in specific tumor types, as well as the spectrum of reported mutations, have provided insights into the carcinogenic processes, 17 the specific tumor susceptibilities of those who harbor germline TP53 mutations, and the functional characteristics of the TP53 protein.

The majority of germline TP53 mutations are located within TP53’s DNA-binding domain (Figure 3). This domain has been located to the central portion of the gene between residues 102-292, and is highly resistant to proteolytic digestion, suggesting that it is an independently-folded unit109. Crystals of this unit bound to a 21 bp DNA duplex were grown and resolved by crystallography, and shown to consist of two antiparallel β sheets, an α helix that bonds with the major groove of the DNA, and other structural units110. Of note is that the specific residues within the DNA binding domain that are most frequently mutated (“hotspots”) are those residues located at or near the TP53-DNA interface. These hotspot mutations are sometimes categorized as “contact” (codons 248 and 273) or “structural” (codons 175, 245, 249 and 282) mutants, depending on whether they normally directly bind DNA or contribute to the proper folding of the DNA binding domain.

These six hotspot mutations - at codons 175, 245, 248, 249, 273 and 282 - are not only the most common site for somatic TP53 mutations, but are also common sites of germline alteration (the germline R337H mutation is an exception and will be discussed below). Thus, the dysfunctions they confer result not only in rare LFS-associated tumors (also discussed below) but also frequently drive sporadic cancers. Cancers with high frequencies of somatic TP53 mutations include ovarian, esophageal, colorectal, head and neck, laryngeal, and lung cancers, whereas leukemia, sarcoma, testicular cancer, malignant melanoma, and cervical cancer infrequently acquire TP53 mutations108. Even within well-defined histopathological tumor types, specific patterns of TP53 mutations are usually not found (e.g. codons mutated only in that tumor type). However some tumor types do have distinct methods for inactivating TP53: sarcomas are 18 more likely to have amplifications of the MDM2 oncogene, and aflotoxin-induced hepotecellular carcinoma is associated with TP53 codon 249 mutations111.

19

Figure 3

Figure 3 | Codon distribution of TP53 somatic mutations. Highlighted are the six most frequent sites of TP53 mutation (image from the IARC database version R14112).

20

The most common type of TP53 mutation, whether somatic or germline, is missense

(>70%), which involve the expression of a full length but dysfunctional protein that accumulates in the nucleus. The predominance of missense mutations has been considered a unique feature of this gene when compared to other tumor suppressors or known cancer predisposition genes. For example, Hussain and Harris compared TP53’s dominant type of mutation to that of RB1, APC,

ATM, WT1, BRCA1, BRCA2, NF1, NF2, p16 and VHL, ten well-studied tumor suppressors. Most of these are inactivated by nonsense mutations, deletions or insertions, leading to an absent or truncated protein113. However, recent high-throughput cancer genome studies have shown a high proportion of somatic missense mutations in the cancer genome. A sequencing study of 518 protein kinase genes in 210 diverse human cancers found that the majority of changes were missense114. A similar result was found in colorectal and breast cancers115. These recent studies do not distinguish tumor suppressors from oncogenes and this may explain some of the differences with previous work.

In any case, the issue of whether or not missense mutations are common to some or all cancer genes does not take away from the fact that they are widespread in TP53. Two hypotheses explain why it is more advantageous for tumors to retain a defective TP53 protein, rather than remove it entirely: mutant TP53 has a wild type TP53-independent gain of function, or mutant

TP53 interferes with wild type TP53 in a dominant negative manner. These concepts are not necessarily mutually exclusive.

The gain of function concept contends that, upon acquiring a mutation, TP53 develops novel functionality that is distinct from the wild type protein and that these functions actively contribute to neoplastic growth116. An example of the oncogenic function of mutant TP53 is its ability to inactivate p63 and p73, two TP53 family members, and inhibit their ability to induce 21 cell cycle arrest leading to an increased number of foci in a focus-forming assay117. The overexpression of a R175H mutant in TP53-deficient cells results in tumor formation in mice, whereas cells without the over-expressed mutant do not form tumors in nude mice118.

Many TP53 mutant proteins exert a dominant negative activity over the wild type .

This is due to their ability to oligomerize and it has been shown that wild type and mutant TP53 can be coimmunoprecipitated119. The oligomerized proteins form part of a TP53 tetramer which is unable to bind and induce downstream targets. The dominant negative effects of two hotspot mutations was tested in mouse embryonic stem (ES) cells120. A variety of ES cells were exposed to γ irradiation and, while TP53-/- cells did not induce bax, cyclinG or MDM2, cells containing a single point mutant showed reduced expression of these targets, even when compared to TP53+/- cells.

TP53 mutational status can provide insight into specific carcinogen exposures121.

Ultraviolet light is related to transitions at dypirimidines sites (CCTT) in squamous cell carcinoma. Whereas aflotoxin exposure leads to G:C to T:A transversions at serine 249 in hepatocellular carcinoma122. Similarly, tobacco smoke carcinogens lead to G:C to T:A transversions in lung cancer123.

In work that dramatically boosted our understanding of TP53 missense mutations, Kato et al. used high-resolution missense mutation analysis to evaluate the effect of all possible amino acid changes in TP53 on transactivation124. Substitution of every codon of TP53 except the first

(2 to 393) was performed by site directed mutagenesis. Nonsense and silent mutations were excluded and this resulted in 2,314 TP53 mutants, which were then converted to cDNA and used in a yeast-based functional assay to evaluate transactivation of p21, WAF1, MDM2, BAX, 14-3-

3α, AIP1, GADD45, Noxa and P53R2. Remarkably, approximately one third of missense 22 mutations were inactive and the vast majority of these were in the DNA binding or tetramerization domains, while missense mutations in other areas were as active as wild type

TP53 or had some enduring activity. It was further noted that 27.5% of the mutants remained active for some but not all promoters (fairly similar results were found in human Saos-2 cells), which might provide clues as to the transactivational preferences of specific TP53 mutations.

1.2.3 Genetic Modifiers

Genetic modifiers are secondary changes at loci that are distinct from the primary disease gene but alter its phenotype. They are found in a number of genetic conditions. In inherited cancer susceptibility, genetic modifiers can explain differences in clinical phenotype between people harboring the same germline mutation. This can include differences in the age of onset, site of the primary tumor, or in the severity of the disease. In neurofibromatosis type 1 it has been shown that the phenotypic correlations between monozygotic twins was higher than between distant relatives with the same NF1 mutations125. A genetic modifier of RAD51, in the form of a G to C single polymorphism, has been found in carriers of BRCA2 mutations, which raises the cancer risk at younger ages126-128.

The clinical presentation of LFS is highly heterogeneous, even within the same family, and it has been suggested that genetic modifiers account for some of these differences. One of the first variants shown to modify TP53 germline mutations was a secondary change in TP53’s fourth exon that leads to either a proline or arginine at codon 72 (P53Codon72). While extremely common in the human population, this single nucleotide polymorphism does have functional importance. Pro72 was the first of the variants to be cloned. In 1986, it was observed that arginine migrates faster on SDS-polyacrylamide gels129. Subsequently it was found that Arg72 induces apoptosis better than Pro72 in stably transfected human Saos2 cells130. While the two 23 variants could both transactivate similar genes, the difference in apoptosis was explained by greater mitochondrial localization.

In 2004, Bond et al. identified a naturally occurring common variant in the promoter of

MDM2, TP53’s key regulator (MDM2SNP309)131. The single nucleotide polymorphism is located at the 309th basepair in intron 1 and involves a T to G transition which extends an Sp1 transcription factor binding site. This leads to higher affinity for Sp1 and consequent over- expression of MDM2 in tumor cell lines homozygous for MDM2SNP309 (G/G), as seen by real time quantitative PCR and Western blot analysis. These cells showed reduced death rates following treatment with a DNA damaging agent. The authors went on to show that LFS fibroblasts homozygous for MDM2SNP309 had increased basal levels of MDM2 and an altered

DNA damage response, as measured by G2 arrest following etoposide treatment. The phenotypic consequences of this genetic modifier in LFS is a reduced age of cancer onset, and the occurrence of multiple primary tumors.

Knowing that P53Codon72 had functional significance (as discussed above) and that it modified the risk of sporadic cancer (albeit in a non straightforward manner), Bougeard et al. set out to see what effect this variant had in LFS families132. Additionally, the authors reasoned that the effects of MDM2SNP309 and P53Codon72 might be additive, as MDM2 was known to have higher affinity for Arg72 than Pro72. Sixty one TP53 mutations carriers from 41 LFS families were genotyped for codon 72 and MDM2SNP309. Variants were phased relative to the TP53 mutation if heterozygous (Arg/Pro). Intriguingly, it was found that carriers of Arg72 were affected with cancer significantly earlier than Pro72 carriers (21.8 vs. 34.4 years, respectively), but the variant’s phase had no effect. Notably, those TP53 mutation carriers harboring both 24 genetic modifiers had the lowest age of onset (16.9 years), while the ages of onset were highest or at intermediate levels in individuals with no variants or only one variant, respectively.

A 16 basepair duplication in TP53 intron 3 (PIN3) was recently identified as another potential genetic modifier of LFS133. DNA variants at PIN3, MDM2SNP309, P53Codon72 and

PIN2 (single change in intron 2) were genotyped in a series of 32 Brazilian TP53 mutation carriers. PIN2 was found to be in perfect linkage disequilibrium with P53Codon72 in a large set of controls. In this cohort of TP53 mutation carriers, the non-duplicated PIN3 was associated with a 19-year reduction of age of cancer onset, which was suggested to be non- cumulative with MDM2SNP309.

1.2.4 Cancer Penetrance

In a hospital-based analysis, the lifetime cancer risk of TP53 germline mutation carriers was estimated to be 73% in males and nearly 100% in females, with the high risk of breast cancer accounting for the difference. The specific risk for males is 19%, 27% and 54% before the age of 15, 16-45 years and >45years, respectively. The risk for females is 12%, 82% and

100% % before the age of 15, 16-45 years and >45years, respectively76.

A study published by Hwang et al94 described the cancer risk in kindreds ascertained on the basis of childhood soft tissue sarcomas. Cancer risk was determined for TP53 mutation carriers and non-carriers who had been followed for greater than 20 years. 12%, 35%, 52% and

80% of the carriers developed cancer by ages 20, 30, 40 and 50 years, respectively. The most common cancers were breast cancer and soft tissue sarcomas (both known LFS-component tumors). The 3,201 non-carriers had a cumulative risk of 0.7%, 1.0%, 2.2% and 5.1% for the same ages, which is almost identical to that of the general population. While the number of carriers was similar in males and females, the cancer risks were not. The observed cancer risk 25 was significantly higher in female carriers than males and, in contrast to the previous study presented, was not due to the incidence of breast cancer. At every age analyzed, females had a significantly higher incidence of cancer (p<0.001). The specific cumulative risks for female carriers was found to be 18%, 49%, 77% and 93% by ages 20, 30, 40 and 50 years, compared with cumulative risks of 10%, 21%, 33% and 68% in males carriers at the same ages. Even after excluding sex-specific cancers (breast, ovarian and prostate cancer) a higher female cancer-risk was observed, including a higher risk for brain and lung cancer.

1.2.5 Genetic Anticipation

The notion of anticipation has its roots in Morel’s theory of degeneration (Traité des

Dégénérences, 1857)134. The concept was later renamed “anticipation” and has come to mean a progressive deterioration of illness from parent to child, involving either increased incidence, earlier age of onset, or an increased severity of the disease in successive generations. Studies involving anticipation appeared in the beginning of the 20th century, however the idea’s popularity diminished considerably when critiqued by Penrose in 1948 as a statistical artifact, attributable only to one or more forms of biases of ascertainment. A resurgence of the study of anticipation emerged when, even after systematic ascertainment, a prospective study showed that anticipation was inherent in the transmission of myotonic dystrophy135. A molecular basis for this observation followed soon thereafter: the discovery of an association between expanding trinucleotide repeats with more severe disease progression136. The molecular mechanism for disease anticipation in the autosomal dominant (AD-DC) bone marrow failure syndrome has been discovered to be progressive shortening47. In AD-DC families the age-adjusted telomere length of parents and children with TERC mutations was found to be significantly lower than that of healthy individuals. The difference in telomere length 26 was also considerably shorter in the second generation of affected families compared with normal families.

Genetic anticipation has been long suspected in LFS. On the basis of primary observations of LFS families, it has been hypothesized that the inheritance of a germline TP53 mutation is associated with earlier onset tumors in succeeding generations. Data supporting this concept has come from two independent statistical analyses, each designed to eliminate possible biases. In the first work, the mean age of onset was studied in 162 families or isolated patients with germline TP53 mutations as reported in a database137. The mean age of onset was studied in one-to five-generation families and found to always be lower in the generation of offspring than in the generation of the parents for any two successive generations and the differences in age of onset between the generations was always highly statistically significant. To compare intergenerational differences in age-of-onset, four different sampling schemes were used that diminished any potential sampling biases. For each sampling scheme, the fraction of intergenerational pairs showing anticipation was greater than 50%. In the second study, time-to- event methods were used to show that ten extended LFS families exhibited a generational effect, in which cancer incidence increased in successive generations only amongst carriers of TP53 germline mutations138.

Similar to AD-DC, the molecular mechanism for anticipation in LFS may also be telomere erosion. In both children and adults telomere length is shorter in TP53 germline mutation carriers affected with cancer than in non-affected carriers and controls, as measured by the terminal restriction fragment length method. Within families, affected children have shorter telomere length than their unaffected siblings and wild type parent139. Compared with healthy 27 individuals of the same age, germline TP53 carriers have shorter telomere length, as compared by quantitative PCR. Further, the difference is greater in children than in adults140.

1.2.6 Germline TP53 Mutations in Pediatric Adrenocortical Carcinomas of Brazil

Although an infrequent tumor, pediatric adrenocortical carcinomas of Brazil consistently associate with inherited mutations of TP53 at codon 337. These tumors are therefore some of the rare examples of a particular TP53 mutation linked to a specific tumor-type. Ribeiro and coworkers reported that a striking 35 of 36 patients with this tumor harbored the identical TP53 germline point mutation in exon 10 at codon 337 (CGC->CAC), encoding an arginine to histidine substitution (R337H)88. To show that this was not simply due to the patients having a common ancestor, four intergenic and intragenic TP53 markers were genotyped. However, a later study showed strong co-segregation between two highly informative markers and the

R337H mutation, suggesting that it does indeed came from a common ancestor141. In adrenocortical tumors, loss of heterozygosity was observed in 5 of 6 tumor samples examined88.

In a follow-up study of 30 kindreds from southern Brazil with at least one case of adrenocortical tumors, 695 individuals in the carrier parental line and 232 individuals in the other parental line were tested for this mutation using a PCR-based assay142. The R337H mutation was found in

34.5 % (240/695) of the carrier line and not found in members of the other parental line. Thirty- one cancers presented in the carrier parental line and seven kindreds met the definition of Li-

Fraumeni like syndrome (LFL)143, while none met the classical definition of LFS48.The estimated penetrance of adrenocortical carcinomas among carriers in this cohort was 8.5% (95% CI, 7.3% to 9.6%) in kindreds with one proband and 12.5% (95% CI, 10.2% to 14.6%) amongst those with multiple probands. The authors noted that determining if extra-adrenal tumors are more frequent 28 in carriers (as compared to the Brazilian population frequency) should be the subject of future research.

Achatz et al employed a different design while studying this mutation in southern

Brazil144. Index cases were selected on the basis of family histories corresponding to the definitions of LFS or LFL instead of on the basis of an adrenocortical tumor. Thirteen of 45 index cases were found to harbor a germline mutation of TP53 and, strikingly, 6 of these were found to be R337H. The frequency and spectrum of tumors in these families were comparable to those reported elsewhere in LFS families, although adrenocortical tumors were twice as frequent in this population.

1.2.7 LFS-associated Tumorigenesis

Most dominant cancer syndromes predispose individuals to a narrow range of cancer types. Examples include autosomal dominant susceptibilities to breast, ovarian, colorectal and melanoma. LFS, in contrast, leads to a broader range of cancer types (Figure 4). However, even if the LFS clinical presentation is heterogeneous, five organs have been shown to be most prone to germline TP53-induced tumorigenesis: breast cancer, soft tissue sarcoma, adrenal carcinoma, brain tumors and bone sarcomas. These 5 cancer types represent 30.6% (breast), 17.8% (soft tissue), 14% (brain), 13.4% (bone) and 6.5% (adrenal gland) of tumors in carriers or obligate carriers of TP53 mutations145. In the first decade of life adrenal gland, soft tissue and brain tumors predominate, while bone sarcomas occur during the second decade, and breast cancers predominate after 21 years of age.

It should be noted that these statistics do not paint a complete picture of LFS-associated cancers; There are specific tumor types that are rarely found, but if diagnosed are frequently associated with the syndrome. Adrenal cortical carcinoma is an example; while it represents only 29

6.5% of all LFS tumors, between 50%146 to 82%74 of all adrenal gland carcinoma patients are carriers of a germline TP53 mutation, even in the absence of a family history of cancer.

Therefore, certain rare tumors on their own can be extraordinarily strong indications of LFS.

Knowing that TP53 is central to sporadic cancer, one must ask why these five cancer types are most common in LFS above all others? The question of the syndrome’s tissue specificity is unanswered. Actually, this is an open question in nearly every dominant cancer syndrome, even those more common and with greater tissue specificity than LFS. For example the penetrance of BRCA1 mutations is 85% for breast cancer and 65% for ovarian cancer147. The reason for this incredible tissue specificity is not known although hypotheses postulate that this may be due to redundant DNA repair systems in other tissues, the high proliferative rate of breast and ovarian cells, a unique signaling environment involving the estrogen receptor, or perhaps a uniquely protective environment that enables proliferation of BRCA1-deficient cells148.

The Brazilian R337H TP53 mutation is a unique example of specific mutation that leads to a specific cancer within LFS. Though this mutation is now thought to also predispose to other cancers, the R337H-adrenocortical-carcinoma connection is a rare example wherein the tissue- specificity of LFS has been explained. While most TP53 mutations are located within the DNA binding domain, the R337H mutation is within the tetramerization domain (residues 310 to 360).

This domain is a dimer of dimers with a four-helix bundle flanked by antiparallel ß-strands. The amino acid Arg 337 forms a salt bridge with Asp 352 within the dimer subunits149. Several methodologies have been used to compare the structure of the wild type tetramerization domain with the R337H domain150. Both forms of the domain elute as tetramers in gel filtration columns and have similar secondary structures. Although structurally very similar, the TP53 R337H domain unfolds at lower temperatures and is more dependent on pH for stabilization. The authors 30 suggest that the tissue specific effects of this mutation may be associated with an increase in the pH of adrenal cells, whereas in cells of other tissues, with a pH of 7.0, it may function properly.

What is clear is that LFS-associated tumor types are not necessarily those with a propensity to develop somatic TP53 mutations. There does not seem to be a relationship between the most common LFS tumors and sporadic cancers that frequently acquire TP53 mutations. For example, 43.2% of colorectal cancers have somatic TP53 mutations108, but only 2.8% of TP53 germline mutation carriers developed early onset colorectal cancer151.

LFS cell lines display characteristics of transformed cells, an observation that was hypothesized to explain the intense cancer predisposition amongst these patients. In 1990,

Bischoff et al. remarked that normal fibroblasts from LFS patients continued to grow after numerous passages, even if control cell lines ceased growing152. Morphologically different cells spontaneously developed and became a dominant feature of the dish, within 8 to 10 population doublings. Karyotyping of these same cells showed that their genomes were often hypodiploid or contained cells with structural abnormalities (e.g., double minutes). In addition to spontaneous immortalization and genomic instability, LFS fibroblasts also displayed anchorage independent growth. Spontaneous immortalization has also been observed in breast epithelial cells from LFS patients153.

Knudson’s two hit hypothesis would predict that both copies of TP53 need to be inactivated for an LFS-associated tumor to develop154. However biallelic loss, or loss of heterozygosity, is only seen in 44% of human LFS tumors54 and in approximately 50% of

Trp53+/- mouse tumors155. This shows that merely losing or inactivating one TP53 allele can be sufficient, especially if the singular mutation is dominant negative or has oncogenic features (as described above). 31

Figure 4

Figure 4 | Li-Fraumeni syndrome component tumors.

32

1.2.8 Choroid Plexus Carcinoma: an LFS Component Tumor

Choroid plexus (CP) neoplasms are an important LFS component tumor. These are rare tumors but it has been suggested that they are highly indicative of a TP53 germline mutation and in this respect are similar to adrenocortical carcinomas, as discussed above. The occurrence of

CP tumors has been documented in case reports of LFS families and in individuals without a family history of cancer65,156-159. The original description of LFS reported the cancer spectrum and ages of 24 kindreds48. After identifying a proband in an LFS family affected with a CP tumor, Garber et al. re-examined the tumor incidence in these 24 LFS kindreds158. Of the 13 brain tumors of known histology, two CP tumors were found. More recently, after TP53 was identified as the cause of LFS, Krutilkova et al. described five families in which a CP tumor was found159. These five individuals were ascertained in different ways and harbored germline TP53 mutations in exons 5, 7 and 8. Of these, only one family met the clinical definition of LFS as outlined by Drs. Li and Fraumeni. Gonzalez et al. reported another eight CP tumors with TP53 mutations160, and Ruijs et al. another reported another patient161. Based on these and other reports, the Chompret criteria for TP53 mutation screening, described by the French LFS group, has been updated to also include a patient with a CP tumor, irrespective of age or family history162.

The choroid plexus is a specialized epithelial layer in the ventricles of the brain. It is a villous structure and protrudes into the cerebrospinal fluid of the lateral, third and fourth ventricles. While there are macroscopic differences between the four parts of the choroid plexus, they are microscopically very similar163. Choroid plexuses are comprised of villi of cuboidal or columnar epithelial cells, and a vascularized stromal core. The choroid plexus projects into the 33 four ventricles but is continuous with ependymal cells that line each ventricle. The stromal core contains blood vessels, collagen fibers, elongated fibroblasts as well as globular macrophages and dendritic cells. The primary function of the choroid plexus is the production and secretion of cerebrospinal fluid (CSF). CSF has a number of important functions in the central nervous system, including providing mechanical support to the brain, amongst others164. The concentration and components of CSF are tightly regulated by the choroid plexus, which forms a barrier between the blood (contained within its stroma) and the CSF. The blood-CSF barrier is accomplished by tight junctions which restrict the flow of molecules and ions into the CSF.

Through the expression of various ion transport proteins, the choroid plexus epithelium maintains a polarity that helps drive the movement of water due to an osmotic gradient.

Tumors arising from the choroid plexus epithelium are classified as choroid plexus papilloma (CPP, World Health Organization [WHO] grade 1), atypical choroid plexus papilloma

(aCPP, WHO grade 2) and choroid plexus carcinoma (CPC, WHO grade 3). CPP are benign tumors that can be treated through surgery alone. They resemble the choroid plexus epithelium from which they arose except with higher cellular density and cellular atypia. Like normal choroid plexus, CPPs can secrete CSF. On the other hand, CPC is a more aggressive tumor, which has lost the differentiated papillary structure, and has very high cellular density and necrosis.

The aCPPs are an intermediate entity and a novel diagnostic subtype created due to the realization that choroid plexus tumors could not always be easily classified as CPP or CPC. The aCPP is distinguished from the CPP by increased mitotic activity165. In recognition of this,

Jeibmann et al. examined 164 choroid tumors for atypia and measured the effect of various clinical and histological features on the tumor’s recurrence166. 37% of the CPPs displayed at least 34 one atypical histological feature. Mitotic activity, above other features, was most associated with tumor recurrence. The frequency of aCPPs that stain positively for TP53 was also intermediate to that of CPPs and CPCs (2/21, 0/31 and 8/17, respectively)167. The proportion of aCPPs that recur is higher than that of CPPs (29% vs 6%, respectively), and this recurrence tended to be in a shorter time period168.

Choroid plexus tumors are rare, with an incidence of 0.3 cases per million. There are nearly four times as many CPPs diagnosed as CPCs. However, they are more frequent in the pediatric population, comprising between 1% and 4% of brain tumors in children and 13% of brain tumors occurring during the first year of life. In fact, 70% of choroid plexus tumors are diagnosed within the first two years of life. The most common location for CPP, aCPP and CPC is the lateral ventricle (71%, 83% and 88%, respectively)167.

Much of the biological basis of choroid plexus tumor initiation, recurrence and metastasis is unknown. Research efforts have been hampered by the lack of high quality frozen tumor specimens and adequate model systems (e.g. tumor-derived cell lines or mouse models).

However, two important aspects of choroid plexus biology are clear, as these have been reported by multiple groups: TP53 is involved (as evidenced by the excess of CPC in LFS families) and these tumors reliably express specific diagnostic markers (as has been shown by immunohistochemical studies of banked slides). Other work on the biological basis of choroid plexus tumorigenesis will be summarized.

Hasselblatt et al. used a microarray approach to look for novel diagnostic markers of choroid plexus tumors169. Eight postmorten normal choroid plexus samples were microdissected to separate choroid plexus and ependymal cells RNA was then hybridized to Affymetrix

GeneChip arrays. One CPP sample was run on a different array manufactured by the same 35 company. 46 genes were found to be overexpressed in normal choroid plexus epithelial cells and present in the CPP and 11 genes were subsequently validated by immunohistochemistry to be expressed in normal human choroid plexus epithelium but not in the ependyma. This gene set was then evaluated on a tissue array consisting of 35 normal choroid plexus, 18 CPPs and 100 other primary or metastatic brain tumors. Notably, Kir7.1 and stanniocalcin-1 were positive in nearly all normal samples, the majority of CPCs and almost none of the other brain tumors.

Transthyretin, a common choroid plexus marker, did stain most choroid plexus samples, but was also found among other epithelial or glial tumors. Doring et al. have also previously reported that

Kir7.1 was selectively expressed in the choroid plexus and suggested that this channel may contribute to the transepithelial transport of potassium and its clearance from the CSF170. In a subsequent analysis of 7 CPPs, TWIST-1 was shown to also be over-expressed and to promote proliferation and invasion171.

Chromosomal analyses have been rarely performed in choroid plexus tumors172. In one experiment, the chromosomal imbalances of 49 formalin-fixed and paraffin embedded choroid plexus tumors (34 CPP, 15 CPC) were evaluated by array comparative genomic hybridization173.

Every chromosome was found altered and nearly every sample had some form of imbalance

(32/24 CPP and 15/15 CPC). Non-random chromosomal changes in CPC include gains of 5q, 5p,

9q, and 8q, and losses of 10q and 10p in CPPs, and gains of 1, 4q, 4p, 8q, 14q, and 21q, and losses of 5q, 5p, and 18q in CPCs.

1.3 Copy Number Variations: Dynamic Genomes

Portions of this section are published and are reproduced with permission from Genome Medicine and Current Opinion in Oncology. 36

Shlien A, Malkin D. Copy number variations and cancer. Genome Med. 2009 Jun 16;1(6):62.

Shlien A, Malkin D. Copy number variations and cancer susceptibility. Curr Opin Oncol. 2010 Jan;22(1):55-63.

The human genome contains scores of DNA changes that form the basis of inter-individual differences and inherited traits. As described above, the genomes of LFS patients have been assessed for a select number of single nucleotide polymorphisms (SNPs), in addition to rare

TP53 mutations. SNPs are but one form of common genomic variation. Other types of variation include microsatellites, minisatellites and copy number variations (CNVs). While CNVs are more recently described (as will be discussed below), other forms of genetic variation have been extensively studied, and have provided insights into population structure and revealed new common disease variants.

Our genomes are not the stable structures we once thought they were. Recent genome-wide studies have shed light on copy number variations (CNVs), an unexpectedly frequent, dynamic and complex form of genetic diversity, and have quickly upended the idea of a single diploid

“reference genome”. While the characterization of the extent and location of these regions in genomes from health individuals is far from complete, many groups, including ours, are actively trying to determine the clinical relevance and impact of CNVs in patient populations.

CNVs are structurally variant regions, in which copy number differences have been observed between two or more genomes 174. Defined as being larger than one kilobase (kb) in size, CNVs can involve gains or losses of genomic DNA that are either microscopic or submicroscopic. Until recently, only a few copy number variable loci had been identified, such as duplications at the 37 alpha 7-nicotinic receptor gene (Chrna7) at 15q13-15 175 and variation at the major histocompatability complex 176. In 2004, significant advances in DNA array technology enabled the discovery of many CNVs, revealing a novel and pervasive form of inter-individual genomic variation 177,178. These pioneering genome-scale efforts used different platforms to find

76 CNVs in 20 individuals 178 and 255 CNVs in 55 individuals 177, a number of which were common to both studies, suggesting possible hotspot regions of CNVs in the human genome.

Even this was soon found to be an under-representation of the number of CNVs; follow-up studies have since ascertained many thousands of CNV regions in hundreds of healthy individuals.

The recent increase in scientific interest in CNVs, combined with improvements in microarray construction (higher density at lower cost) and the development of new informatics techniques, have led to the ascertainment of approximately 21,000 CNVs, or around 6,500 unique CNV loci, in the 5 short years since this form of genetic variation was first revealed

(March 2009 update of the Database of Genomic Variants177). Furthermore, next-generation sequencing technologies will soon be used to sequence thousands of genomes along with their

CNVs.

In 2010, Conrad and Pinto et al. published an updated CNV map of the human genome179.

CNV discovery was performed using a set of 20 NimbleGen arrays, on which 42 million probes were tiled (~2.1 million per array), recapitulating the genome at a resolution of one probe every

50 bp. 40 female samples were used to find CNVs with a minor allele frequency of 5% in

Northern and Western European individuals and the Yoruban from Ibadan, Nigeria (95% power).

11,700 CNVs greater than 443 bp were identified, with an average of 1,117 and 1,488 CNVs found per person in the European and African samples, respectively. Remarkably, 49% were 38 called in a single individual. Using this discovery set, the group created a CNV genotyping array comprised of 105,000 oligonucleotide probes, which was also used by the Wellcome Trust Case

Control Consortium (described below). 450 samples were assessed using this array and 5,238

CNVs were genotyped (77% deletions, 16% duplications and 7% multi-allelic).

1.3.1 CNVs and disease: Mutable Genomes

The copy number variation map for the human genome is being continuously refined and has already pinpointed the location, copy number, gene-content, frequency and approximate breakpoints of numerous CNVs in the healthy population. These structural variants can alter transcription of genes by altering dosage or by disrupting proximal or distant regulatory regions, as has been shown globally in the healthy human 180, mouse 181 and rat genomes 182. It is, however, the specific disease-associated CNV loci that have been particularly scrutinized, and therefore provide the most detailed examples of how CNVs can alter cellular function.

1.3.2 Pathogenic CNVs Often Contain Multiple Genes

Genomic rearrangements give rise to a variety of diseases classified as “genomic disorders” 183. Because they involve large CNVs, it is common for genomic disorders to include many deleted or duplicated genes, unlike traditional mutations that affect a single coding region change of one gene. These genes can be either fully encompassed or partially overlapped by the pathogenic CNV. Deletions of 22q11.2 are associated with DiGeorge/velocardiofacial syndrome and include the catechol-O-methyltransferase and T box transcription factor 1 genes. Similarly, the autosomal dominant Prader-Willi syndrome (15q11-q13 deletion) involves many genes and the Williams-Beuren syndrome (WBS; 7q11.23 deletion) involves 28 genes. As microarray resolution increases, genomic disorders will certainly be found that are caused by small CNVs involving only a single gene, or even a portion of one gene. 39

1.3.3 The Effect of a Pathogenic CNV is Not Limited to the Gene(s) it Contains

Usually, the genes contained in the pathogenic CNV are candidates for association with the clinical phenotype under study. However, research on genomic disorders has shown that some genes within a CNV may not be necessary, or may not be sufficient, to cause the observed disease. For example, a recurrent 3.7 Mb microdeletion at chromosomal locus 17p11.2 is responsible for 70% of cases of Smith-Magenis syndrome (SMS), a neurobehavioral disorder involving sleep disturbance, craniofacial and skeletal anomalies, and distinctive behavioral traits. While varying-sized deletions are observed, the identification of a common ‘critical region’ (1.5 Mb) in SMS patients led to the conclusion that retinoic acid induced 1 (RAI1) alone is responsible for most SMS features. Indeed, RAI1 point mutations have been seen in patients without deletions, thus confirming that this gene (of the 13 in the critical region) is necessary to cause SMS. Patients with additional genes deleted have a variable and more severe phenotype. On the other hand, in WBS, not only the deleted genes but also genes far outside the deleted region, have reduced expression and are thought to contribute to the phenotype 184. Such long-range influence of CNVs on distant gene expression is proposed to be caused by positional effects 185.

1.3.4 Pathogenic CNVs Can Have Reciprocal Deletions/Duplications

Recombination between highly homologous sequences (non-allelic homologous recombination [NAHR]) can generate deletions, duplications, inversions and translocations. The sequence architecture that enables one copy number change can also allow for its reciprocal at the same locus, although usually causing varying phenotypes and occurring at different frequencies in the population and rates during 186. The molecular mechanisms that lead to CNVs, such as NAHR, are described in greater detail below. 40

1.3.5 CNVs and Cancer Predisposition: First Hits to the Tumor Genome

The goal of cancer genetics is to discover all variants that predispose to neoplasms. To this end, single nucleotide polymorphisms (SNPs) have been the most widely studied form of genetic variation and, by using massive whole-genome studies (genome-wide association [GWA]), many common SNPs have been associated to cancer and other complex traits. However, the results of these efforts have not explained much of the heritability of disease

187. This is perhaps because GWA studies have largely ignored the inter-individual genetic variation provided by CNVs, which affect more than 10% of the human genome. CNVs, especially smaller variants, have been essentially hidden from view until recently. Thus only a handful of studies associate CNVs with cancer. Once disclosed, one can only assume that CNVs will explain a larger portion of the genetic basis of cancer.

1.3.6 Common Cancer CNVs

As with SNPs, CNVs that are found frequently in the healthy population (common

CNVs) surely play a role in cancer etiology. In the only study published to date that begins to test the hypothesis that common CNVs are associated with malignancy, we created a map of

CNVs whose locus coincides with that of bona fide cancer-related genes (cancer CNVs) 188. In an initial analysis189, we examined 770 healthy genomes using the Affymetrix 500k array set, which has an average inter-probe distance of 5.8 kb. As CNVs are generally thought to be depleted in gene regions 190, it was surprising to find 49 cancer genes that were directly encompassed or overlapped by a CNV in more than one person in a large reference population

(Figure 5). Among the top 10 genes, cancer CNVs could be found in four or more persons. In this analysis only CNVs directly overlapping a cancer gene were selected (i.e. either both breakpoints were inside the genomic interval containing the gene, both were outside the interval or one breakpoint was inside while the other was outside). However, this is likely an under 41 appreciation of the actual number of common cancer CNVs. First, many smaller variants are missed at the resolution of this array: the mean size of CNVs found using the Affymetrix 500K array is 206 kb 190 while the size of CNVs found using the newer Affymetrix 6.0 platform with a median inter-marker distance of less than 700 bases are 5–15 times smaller 191. Second, as discussed above, there are unquestionably additional, more distal, CNVs that have a long-range effect on cancer gene transcriptional levels. Validating the initial observation, many of these genes are also found in the Database of Genomic Variants (DGV), a curated list of CNVs compiled from numerous publications 177. When one analyzes the DGV, we can note that nearly

40% of cancer-related genes 192 are interrupted by a CNV. This trend continues: even amongst the 10 most current CNV publications in the DGV (those published after February 2008), one notes many important tumor suppressor genes and oncogenes with diverse functions, including apoptosis, control of cell cycle checkpoints, DNA repair and numerous translocation and fusion gene partners. An example of this includes Rad51L1, a gene that is a member of the RAD51 family, this is essential for DNA repair by homologous recombination and has been shown by

GWA study to contain a SNP that is strongly associated with breast cancer 193.

42

Figure 5

Figure 5 | Distribution of common cancer CNVs in the human genome. The chromosomes containing common cancer CNVs in the human genome are shown with centromeric regions in red and Giemsa banding patterns in white, grey or black. Loci are highlighted in green if they were found to contain a cancer-related gene that is overlapped or encompassed by a CNV.

43

The challenge will be to determine which of these genes are dosage-sensitive and which tissues containing these common cancer CNVs will be susceptible to malignant transformation and growth. One approach is to characterize specific cancer CNVs in great detail, both in terms of population frequency and breakpoint sequence 194. For example, in a pilot candidate-gene association study, we found a cancer CNV at the gene MLLT4 (a Ras target that regulates cell- cell adhesion) that appears to be associated with the Li-Fraumeni cancer predisposition disorder

(LFS) in whom affected individuals harbor a germline heterozygous mutation of the TP53 tumor suppressor gene 189. The frequency of this CNV is significantly increased in LFS (p=0.006,

Fisher’s exact test): three of the 19 LFS probands (15.8%; Observed/Expected: 3/0.4=7.5) harbored the CNV duplication, while only 12 of 710 healthy individuals from the reference population (1.69%; observed/expected: 12/14.6 = 0.82) harbored the CNV. A nice illustration of a focal CNV with phenotypic effect can be found in the mitochondrial tumor suppressor gene

(Mtus1); Frank et al. found that a small deletion in Mtus1 is associated with a decreased risk for familial and high-risk breast cancer 195. Using long-range PCR, we independently fine-mapped this common cancer CNV and genotyped it in a panel of healthy controls. While only 1.1 kb in size, the deletion removes an entire exon in Mtus1. Direct sequencing reveals a 41 bp

(aaataagaaccaagtccaaatacatctttggaatgaaagag) stretch of homology flanking the exon, which suggests the formation of this deletion by NAHR (Figure 6). These examples demonstrate hypothesis-driven approaches and are restricted to genes for which there is an a priori association with cancer. Ultimately, it will be important to be able to discover and test every

CNV in a genome for cancer susceptibility, however while this hypothesis-free approach is becoming technically tractable and more economical, such studies do have unique analytical challenges 196,197.

44

Figure 6

Scale 1 kb chr8: 17,624,500 17,625,000 17,625,500 17,626,000

MTUS1

Figure 6 | Cancer CNV breakpoint mapping. A 1.1 kb deletion in the mitochondrial tumor suppressor gene, MTUS1, was mapped to basepair (bp) resolution. The affected portion of the gene is shown, including an exon (blue square) that is deleted in the presence of the CNV. Two 41 bp repeats were found at the breakpoints (red squares).

45

Using such an approach, the Wellcome Trust Case Control Consortium recently performed a genome-wide association study (GWAS) of CNVs in 16,000 cases and 3,000 controls in eight diseases, including bipolar disorder, breast cancer, coronary artery disease,

Crohn’s disease, hypertension, rheumatoid arthritis, type 1 and type 2 diabetes198. Many of these diseases had been studied by a SNP-based GWAS by the same group199 and, in fact, 80% of the

DNA samples overlapped between the two studies. Therefore the relative contribution of these two forms of genetic variation to eight diseases could be compared. For this study, which is the largest CNV GWAS published to date, the authors designed a CNV genotyping array. The

Agilent CGH platform was selected for this purpose and probes were designed to cover common

CNVs. The vast majority of the CNVs (83.62%) were initially ascertained by the Genome

Structural Variation Consortium (as described above). While some candidate genes were also targeted (for discovery of rare CNVs), the study’s array platform, analysis and results overwhelming describe the potential association of common CNVs to these diseases. 16 pre- processing pipelines were carried out and two algorithms were used to determine the number of diploid copy number classes at each CNV and to assign individuals into these classes. Of the

10,894 common CNVs typed by the array’s ~100,000 probes, 60% were excluded from the analysis as they could not be assigned to different copy number classes, half of which are likely not polymorphic. Finally, only 3,432 CNVs were tested for association using a previously published method200, representing 42-50% of common (>5% minor allele frequency) CNVs greater than 0.5 kb in the European population. At least two types of biological artifacts were found that lead to spurious signals: dispersed duplications and differences in DNA source (saliva vs. cell-line). Three CNV loci were found: HLA for Crohn’s disease, rheumatoid arthritis and type 1 diabetes; IRGM for Crohn’s disease (validating a known association201); TSPAN8 for 46 type 2 diabetes; and a CNV near an inversion on . All of these passed replication while 13 other loci did not. Notably, no CNVs were found to be associated with breast cancer.

One important study demonstrated the importance of identification of CNVs using large cancer cohorts. In work that will prelude future CNV GWA studies, Diskin and colleagues describe the association of a common CNV and neuroblastoma, a pediatric cancer of the sympathetic nervous system 202. The CNV, a hemizygous deletion at chromosome 1q21.1, was found significant beyond the stringent genome-wide threshold level (P =1.0x10-7) and validated in two replication sets (363 Caucasian cases vs. 1,139 controls and 232 Caucasian cases vs.

2,218 controls). The neuroblastoma-related CNV was first examined using the Illumina

HumanHap550 BeadChip, a 550,000 SNP platform, and was found to span a 1.6 Mb region containing a number of neuroblastoma breakpoint family (NBPF) genes 203. The NBPF family is so named because one of its members, NBF1, is disrupted by a constitutional translocation in a neuroblastoma patient (t(1;17)(p36;q12-21)) 204. When a higher resolution platform, the Illumina

HumanHap610, was applied to examine the same area on chromosome 1, the minimally deleted region was in fact reduced to a smaller 121 kb segment that did not appear to contain any genes.

This is consistent with reports showing that earlier platforms can overestimate the size of CNVs

191,194. Interestingly, when this smaller region was further scrutinized it was found to harbor a spliced EST from a melanoma library that is highly homologous to at least three known NBPF genes, and thus constitutes a novel member of this gene family.

The NBPF gene family is interesting in that its members are recently evolved and show both intragenic and intergenic homology. The region is rife with low-copy repeats and segmental duplications and it is therefore of no surprise that 1q21.1 includes many other common CNVs, as does the locus containing NBF1 (1p36). Interestingly, both common and rare CNVs at 1q21.1 47 can predispose to numerous diseases, including schizophrenia 205,206, a range of pediatric impairments, 207 and even Tetralogy of Fallot 208. It would seem that the architectural complexity of this region has rendered it fragile, prone to illegitimate recombination and therefore the linchpin in a number of disease processes. As research progresses, it will be intriguing to see if these phenotypes are mutually exclusive to one another (due to, for example, strictly non-overlapping breakpoints), or if a patient’s seemingly unrelated diseases (for example, neuroblastoma at childhood and schizophrenia later in life) can be attributed to a single large

CNV.

1.3.7 Rare Cancer CNVs

Common cancer SNPs - and by analogy common cancer CNVs – each confer only a minor increase in disease risk, while collectively they may cause a substantially elevated risk. In contrast, the mutations associated with hereditary cancer syndromes are frequently highly penetrant on their own and are usually inherited in an autosomal dominant fashion. Unlike low- penetrant alleles, rare high-penetrant mutations will almost always co-segregate with the disease in families.

There are over 200 cancer syndromes and, although most arise infrequently, they account for at least 5–10% of all cancer cases 209. These are caused by base-pair-sized germline mutations in many central tumor suppressor genes and (fewer) oncogenes including TP53, APC, BRCA1,

BRCA2, PTEN, RB1 and HRAS.

The role of large structural mutations in cancer syndromes has been less appreciated, probably because genomic deletion or duplications are not readily detected by PCR-based sequencing. New multiplexing methods, especially multiplex ligation-dependent probe amplification (MLPA 210), allow for targeted copy number assessment of single gene/exon 48 changes. This has led to a recent upsurge in discoveries of patients and families with rare pathogenic CNVs that strongly predispose to cancer. Of the 70 germline cancer genes in the

Cancer Genes Census 211, 28 have been reported to be mutated by genomic deletion or duplication (the genes and citations are shown in Table 1). We hypothesize that many of the remaining gene mutations will be found to have a genomic equivalent and, perhaps more importantly, that predisposing CNVs will be found in other regions not usually associated with hereditary cancer. A recent report by Jackson et al. describing five patients with rhabdoid predisposition syndrome and deletions at SMARCB1 (22q11.2) highlights the benefits of a global approach to CNV detection: using SNP arrays to gain a broad perspective on the SMARCB1 deletion and surrounding chromosomal landscape, it was found that the extent of two patients’ deletions, in fact, extended past SMARCB1, impinging on neighboring genes, and explaining their clinical phenotype 212.

The presence of rare cancer CNVs leads to many questions: Do they differ from basepair changes at the same locus? What is their penetrance? What are the mutational processes that give rise to them? Do they have reciprocal deletions/duplications? Do they have long-range effects on gene expression?

49

Table 1 - Rare cancer CNVs at known cancer-predisposing genes

References on genomic Gene Cancer syndrome deletions or duplications

Hodgson et al (1993)213, Su et al (2000)214, Aretz et al Adenomatous polyposis coli; (2005)215 and Charames et al APC Turcot syndrome (2008)216

BMPR1A Juvenile polyposis Delnatte et al (2006)217

Petrij-Bosch et al (1997)218 BRCA1 Hereditary breast/ovarian cancer and Montagna et al (2003)219

BRCA2 Hereditary breast/ovarian cancer Casilli et al (2006)220

CDKN2A- p14ARF Familial malignant melanoma Lesueur et al (2008)221

CDKN2A - Lesueur et al (2008)221 p16(INK4a) Familial malignant melanoma

Cybulski et al (2006 and CHEK2 familial breast cancer 2007)222,223

FANCA Fanconi anaemia A Levran et al (2005)224

MADH4 Juvenile polyposis Van Hattem et al (2008)225

Multiple Endocrine Neoplasia Kishi et al (1998)226 MEN1 Type 1

Hereditary non-polyposis Nystrom-Lahti et al (1995)227 colorectal cancer, Turcot Chan et al (2001)228 MLH1 syndrome

Hereditary non-polyposis Stella et al (2007)229 MSH2 colorectal cancer

Hereditary non-polyposis Plaschke et al (2003)230 MSH6 colorectal cancer

Riva et al (2000)231, Bausch NF1 Neurofibromatosis type 1 et al (2006)232

NF2 Neurofibromatosis type 2 Tsilchorozidou et al (2004)233

PRKAR1A Horvath et al (2008)234 50

Nevoid Basal Cell Carcinoma Shimkets et al (1996)235 PTCH Syndrome

RB1 Familial retinoblastoma Bremner et al (1997)236

SDHB Familial paraganglioma Cascon et al (2006)237

SDHC Familial paraganglioma Baysal et al (2004)238

SDHD Familial paraganglioma McWhinney et al (2004)239

Rhabdoid predisposition Swensen et al (2009)240 SMARCB1 syndrome

STK11 Peutz-Jeghers syndrome Le Meur et al (2004)241

Bougeard et al (2003 and TP53 Li-Fraumeni syndrome 2008)242,243

TSC1 1 Kozlowski et al (2007)244

TSC2 Tuberous sclerosis 2 Kozlowski et al (2007)244

VHL von Hippel-Lindau syndrome Richards et al (1993)245

Denys-Drash syndrome, Frasier Huff et al (1991)246 syndrome, Familial Wilms WT1 tumor

51

1.3.8 Mutational Mechanisms Leading to CNVs

The generation of structural chromosomal changes, such as CNVs, is intimately tied to the processes of DNA repair, whether initiated by timed or untimed DNA breaks. Specifically, in the attempted repair of DNA breaks, separate chromosomal segments are fused by rearrangements involving deletions, duplications or inversions247. The processes of repair leading to changes in copy number can be divided into those that are homology-mediated and those that are not homology-mediated.

Homologous recombination (HR) is the process of genetic exchange of material between homologous chromosomes that is used to repair DNA double strand breaks. The breaks can be induced by exogenous factors or be self-inflicted. During meiosis, the process by which sexual organisms maintain genome size, the Spo11 protein initiates HR by creating an intentional DNA double strand break248. Meiotic HR generates diversity in the species by enabling the exchange of DNA between maternal and paternal chromosomes. Mitotic HR is functionally similar, but is initiated by the DNA double strand breaks caused during normal metabolism or by external cellular insult. The 3’ ends at the breaks are processed by endonucleases which expose a 3’ single stranded overhang (tail). A suitable homologous donor sequence is then located. Meiotic cells have a preference for interhomologue recombination rather than inter-sister recombination, while in mitotic cells the opposite is true. Notwithstanding where the homologous donor DNA came from, the next step of HR involves strand exchange, in which single stranded DNA invades homologous duplex DNA thus displacing a strand and creating a loop (D-loop). Strand exchange is catalyzed by RAD51, which is used in both meiotic and mitotic HR, or DMC1, which is a meiosis-specific DNA recombinase. DNA synthesis then begins from the invading strand using the donor DNA as a template and the second end of the double strand break is “captured”, resulting in a pair of double Holliday junctions. The Holliday junctions, which are intermediate 52 crossover points, can be resolved such that the two DNA molecules have exchanged material

(crossover) or do not (non-crossover).

HR is commonly a very robust mechanism and will generally repair breaks accurately.

However, repetitive elements present in the genome can lead to ectopic HR in which recombination occurs between homologous segments that are at different chromosomal locations. Also known as non-allelic homologous recombination (NAHR), in genomic disorders this type of “slip-up” is usually associated with low-copy repeats (or segmental duplications).

Segmental duplications are highly identical DNA segments greater than 1 Kb in size that play an important role in generating evolutionarily novel genes and gene families. The majority of human segmental duplications map to ~400 regions, in which there are often numerous duplications juxtaposed249. Segmental duplication-mediated NAHR has led to a number of genomic disorders, including the 24 Kb duplications at 17p12 which leads to Charcot-Marie-

Tooth disease and the LCR22 duplications near the DiGeorge syndrome region. In an elegant experimental approach, Turner et al. used a novel qPCR assay to directly measure the rates of deletions and duplications in the sperm and blood genomes of nine men across four NAHR hotspots 186. The segmental duplications mediating these variously sized deletions were highly homologous (94-98.7%) and in all cases the rates of deletion were higher than that of the reciprocal duplication (>2 fold). Notably, the frequency of de novo CNV generation, as calculated by this and other analyses is several orders of magnitude higher than that of point mutations247.

Other homology-mediated DNA repair processes include synthesis-dependent strand annealing and break-induced replication. Synthesis-dependent strand annealing is similar to the double-strand break model of HR discussed above except, instead of the 2nd DNA end being 53

“captured” to form a double Holliday junction, the synthesized strand of DNA is displaced from the D loop and fuses with the other end of the double strand break. The break-induced replication repair mechanism operates on one-ended double-strand breaks which occur during the collapse of DNA replication forks250. When the replicative helicase arrives at a nicked template strand, one arm of the fork breaks off, is resected to reveal a one-sided 3’ overhang, invades a homologous strand to form a D loop, after which a new replication fork is formed and both the leading and lagging strands are synthesized. The sister chromatids are separated and the cycle of invasion, extension and separation is repeated until the replication fork is completely resolved.

Like HR, break-induced replication can lead to CNV formation if the broken end uses a non- allelic region to restart the replication fork247.

Non-homology mediated pathways also repair DNA double strand breaks and can lead to

CNVs and other structural alterations. Most prominent of these are non-homologous end joining

(NHEJ) and various replication-based processes. NHEJ is used more frequently than HR in situations when a homolog is not present to act as a donor (e.g. cells not in S or G2)251. NHEJ is also thought to be less active during meiosis. NHEJ acts to convert a diverse array of DNA double strand breaks to a joined product. The process involves the action of a nuclease, DNA polymerase and ligase. Briefly, in vertebrates the Ku protein binds with DNA and forms a complex to which the nuclease, polymerase and ligase can dock. This docking happens with a fair amount of flexibility and can lead to a number of different junction products, with each side of the junction having possibly different resected or added. When a small amount of homology at the DNA ends is present, the NHEJ process can use it to join the ends however this is not a requirement. This alternative NHEJ pathway is also called microhomology-mediated end joining and may depend on a different set of proteins. V(D)J recombination and class switch recombination are two processes in which somatic cells create intentional double strand breaks 54 that are subsequently repaired by NHEJ. NHEJ has been shown to be involved in the genesis of a number of genomic disorders. For example, in the breakpoint characterization of 39 deletions at the gene, Toffolatti et al. found three cases in which a few nucleotides were inserted, five cases which had a short homology and in three cases short duplications252. These breakpoint features are consistent with error-prone NHEJ.

The occurrence of short stretches of microhomology at CNV junctions has also been suggested to occur by DNA replication-based mechanisms. Support for a replicative mechanism, that is independent of RecA (Rad51 in eukaryotes), originally came from studies in E. coli253.

The replication “slippage” or template switching is one such model. During replication slippage, the presence of two repeat units can cause the lagging strand to move to another location within the Okazaki fragment (i.e. within the same replication fork). This misspriming can lead to a deletion or duplication of a short length of DNA. Another mechanism, termed FoSTeS (fork stalling and template switching), was proposed by Lee et al. to explain the genomic changes associated with the desmyelinating disorder Pelizaeus-Merzbacher disease254. CNVs generated by FoSTeS are more likely to be complex, have microhomologies at their junctions, and be in close proximity to LCRs. Under the proposed model, when a replication fork stalls the 3’end of the lagging DNA strand can switch to a nearby active replication fork, to which it shares some microhomology, and continue copying DNA. The strand could disengage and switch multiple times before returning to the original template.

1.3.9 CNVs and Tumor Genomes

This discussion has thus far focused on CNVs and cancer predisposition, but similar high-resolution approaches have also propelled recent studies on acquired (somatic) copy number alterations (CNAs) in tumor DNA. 55

1.3.10 Genome-Scale Analyses Have Found Many Formerly Invisible Copy Number Alterations (CNA)

In an analysis of 371 lung adenocarcinoma samples using a 250,000 probe array, Weir et al. 255 identified seven recurrent homozygous deletions and 24 recurrent amplifications. Of note, the most significant amplification, at chromosome region 14q13.3 and containing the novel oncogene NKX2-1, had not been found in previous studies yet, because of insufficient resolution and sample size the target gene it contained had not been identified. Using a yet denser array,

Mullighan and colleagues 256 profiled the DNA copy number changes of 242 pediatric acute lymphoblastic leukemias (ALL), including 192 B-progenitor leukemias (B-ALL) and 50 T- lineage leukemias (T-ALL). Global differences between the subtypes’ genomes and recurrent abnormalities at specific loci were identified. An average of six CNAs were found per leukemia genome, but significant differences in the number of CNAs were found within the B-ALL group and between the B-ALL and T-ALL subtypes. Intriguingly, in 30% of B-ALL patients, the authors detected deletions of PAX5, a transcription factor that is expressed during early stages of

B-cell development. Using CNA analysis to pinpoint critical genes can also help to plan subsequent sequencing efforts. For example, having identified deletions at PAX5, an additional

14 patients were found to have point mutations in the same gene.

1.3.11 CNAs can be Integrated With Other Global Analyses to Define the Key Pathways of a Tumor

In glioblastoma, CNA information, mRNA expression levels, methylation changes and nucleotide mutational analysis have been generated 257. Integrative analysis has shown that >70% of tumors carry alterations in the RB, P53 and RTK pathways. While cancer is driven primarily by alterations of the genome, this study and others have shown that CNA profiles can be 56 combined with other high-throughput data to create insights that are ‘greater than the sum of their parts’.

1.4 Rationale

Li-Fraumeni syndrome is a devastating autosomal dominant, inherited, cancer predisposition syndrome associated with germline TP53 mutations. Genetic counseling of LFS families is complicated by the vast clinical and genetic heterogeneity of the syndrome. The genetic underpinnings for this heterogeneity are nearly unknown. Moreover, the molecular changes that give rise to specific LFS-associated tumors, such as choroid plexus carcinoma, are unidentified.

DNA copy number variations are an important component of genetic variation, affecting a greater fraction of the genome than single nucleotide polymorphisms. The advent of high- resolution microarrays has made it possible to identify CNVs with greater precision. The role of

CNVs as risk factors for cancer was under-appreciated. We proposed that this form of genetic variation is particularly intriguing to study in cancer, a disease that often features genomic instability and structural dynamism. Mutant TP53 is often the harbinger of these structural changes to the tumor genome.

As described in the above introduction, there exists a rich literature on TP53, detailed studies of Li-Fraumeni syndrome families, and recent ascertainment of CNVs. Given this foundational work, we proposed to use detailed genomic analyses to study the role of CNVs in

Li-Fraumeni syndrome as a model for a more generalized association between CNVs and cancer susceptibility. 57

Chapter 2 2 Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome

This chapter has been published and is reproduced with permission from the Proceeding of the National Academy of Sciences of the United States of America.

Shlien A, Tabori U, Marshall CR, Pienkowska M, Feuk L, Novokmet A, Nanda S, Druker H, Scherer SW, Malkin D. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proc Natl Acad Sci U S A. 2008 Aug 12;105(32):11264-9. Epub 2008 Aug 6.

58

2.1 Abstract

DNA copy number variations (CNVs) are a significant and ubiquitous source of inherited human genetic variation. However, the importance of CNVs to cancer susceptibility and tumor progression has not yet been explored. Li-Fraumeni syndrome (LFS) is an autosomal dominantly inherited disorder characterized by a strikingly increased risk of early-onset breast cancer, sarcomas, brain tumors and other neoplasms in individuals harboring germline TP53 mutations.

Known genetic determinants of LFS do not fully explain the variable clinical phenotype in affected family members. As part of a wider study of CNVs and cancer, we conducted a genome- wide profile of germline CNVs in LFS families. Here, by examining DNA from a large healthy population and an LFS cohort using high-density oligonucleotide arrays, we show that the number of CNVs per genome is well-conserved in the healthy population, but remarkably enriched in these cancer-prone individuals. We found a highly significant increase in CNVs among carriers of germline TP53 mutations with a familial cancer history. Further, we identified a remarkable number of genomic regions in which known cancer-related genes coincide with

CNVs, in both LFS families and healthy individuals. Germline CNVs may provide a foundation which enables the more dramatic chromosomal changes characteristic of TP53-related tumors to be established. Our results suggest that screening families predisposed to cancer for CNVs may identify individuals with an abnormally high number of these events.

2.2 Introduction

LFS is a clinically and genetically heterogeneous familial cancer syndrome associated with a diverse spectrum of germline TP53 mutations49,258. In contrast to other familial cancer syndromes, LFS-affected families display a wide array of tumors, including sarcomas of the bone and soft tissue, carcinomas of the breast and adrenal cortex, brain tumors and acute 59 leukemias, among others. The spectrum of reported germline TP53 mutations is equally diverse and this has complicated efforts to derive a clear genotype-phenotype model for the syndrome.

Indeed, even in the same LFS family, affected individuals sharing an identical germline TP53 mutation develop tumors of varying severity, at different anatomical sites and at different ages258. This heterogeneity is thought to be due in part to additional germline genetic variations present within and among LFS families. With this in mind, we undertook a genome-wide characterization study of the constitutional genetic variation of LFS family members.

A CNV is a segment of DNA 1 kb or larger that is present in variable copy number in the genomes of humans, primates and potentially many other species190,259. Despite efficient repair machinery, CNVs still occur 100 to 10,000 times more frequently than point mutations in the human genome260. While the precise mechanism that give rise to most human CNVs are not known, nonallelic homologous recombination (NAHR) and nonhomologous end joining (NHEJ) are thought to be involved183. A first-generation map of CNVs in the human genome was recently completed, revealing 1,447 variable regions in 270 individuals from the HapMap collection190. Knowledge of the frequency of CNVs per population is necessary for the characterization of rare disease-associated regions, while knowledge of the baseline number of

CNVs per person will aid in identifying individuals with particularly unstable genomes.

The importance of acquired chromosomal changes in tumorigenesis has been established, for example amplification of the MYCN oncogene and deletions of chromosome 1p are major prognostic indicators in neuroblastoma261. Higher resolution analyses have recently provided clues into the etiology of lung adenocarcinoma and acute lymphoblastic leukaemia255,256 and exciting new data from genome-wide association studies have implicated specific single nucleotide polymorphisms to susceptibility of many diseases, including prostate and breast 60 cancer262-266. However, the role of constitutional CNVs in cancer predisposition has not yet been explored. We set out to study the frequency of CNVs per person in apparently healthy individuals and in the LFS cancer-prone population. To our knowledge, this is the first reported genome-wide study of CNVs and genetic susceptibility to cancer.

2.3 Results

A cohort of individuals including 500 of European descent and the multiethnic 270 person HapMap collection has been previously assembled and used for studies of copy number variation190,267. We used this cohort to establish whether a baseline CNV frequency existed in a healthy population. In our independent analysis, we identified 3,884 CNVs in genomic DNA from these 770 reportedly healthy individuals using Affymetrix GeneChip 250K Nsp microarrays. The European cohort was analyzed on blood-derived DNA and the HapMap cohort on lymphoblastoid cell line derived DNA. Samples were grouped by microarray facility and normalized against members of their group to reduce batch effects. CNVs were then determined using dChip268. To minimize false positives, we only counted CNVs on autosomal chromosomes comprised of 2 or more underlying single nucleotide polymorphism (SNP) probes.

Many CNVs were found in single individuals while others, such as the CNV at chromosome 10q11.22 identified in 63 persons, were found in numerous individuals, demonstrating the variability of the CNV population frequency. In contrast, the frequency of

CNVs per genome appears to be highly conserved: the median number of CNVs detected per person was 3, with 75% of the population having 4 or fewer CNVs (Fig. 1A). Moreover, CNV frequency appeared to be independent of ethnicity, as a separate analysis of the Yorubans,

Chinese, Japanese and individuals of European descent revealed a similar result (Supplementary

Fig. 1). Despite conserved CNV frequencies, the varying size of these deletions and duplications 61 could still result in individuals with different amounts of copy number-variable DNA. To investigate this real possibility we created a simple metric, termed total structural variation, defined as the CNV frequency multiplied by the individual’s average CNV size (in bp). The median total structural variation showed a similar degree of conservation and was calculated to be 395 kb, with 75% of the healthy population having 1.1 Mb or less copy variable DNA

(Supplementary Fig. 1). Therefore, this is the first analysis establishing a baseline CNV frequency in the general population.

Having established the distribution and frequency of CNVs in a large reference population, we studied deviations from the global norm in 11 well-characterized cancer predisposed LFS families. Inherited TP53 mutations were observed in 9 families and de novo

TP53 mutations in the other two (Supplementary Table 1A). Forty-five family members were evaluated. Eight additional unrelated TP53 mutation carriers were included for whom DNA samples were unavailable from other family members (Supplementary Table 1B). Of these 53 individuals, 33 were TP53 mutation carriers and 20 harbored wild type TP53. In addition, 70 unrelated healthy controls were evaluated for CNVs. Both Affymetrix GeneChip 250K Nsp and

Sty microarrays were utilized for all analyses, and validation was performed using two additional

CNV detecting algorithms and quantitative PCR (qPCR)(Supplementary Methods).

Similar to the large reference population, our controls displayed a median of 2 CNVs per genome, with 75% of the population having 4 or fewer CNVs (mean = 2.93). Additionally, we saw no significant difference in CNV frequency between controls and the TP53 wild type group

(median = 2, 75th percentile = 3, mean = 3.4). In contrast, the TP53 mutation carriers displayed a significant increase in CNVs (p=0.01): this cancer-prone group displayed a mean of 12.19 CNVs per genome with 75 percent having 10 or fewer CNVs (median = 3, Fig. 1B). Of the 33 carriers, 62

17 exhibited more alterations than the baseline (>3). Remarkably, every LFS family with an inherited TP53 mutation, except one, contained individuals with CNV counts above the global norm of 3. The majority of specific CNVs in LFS family trios were acquired and not found in either parent (on average twice as common than inherited CNVs) and, among families with a history of cancer, offspring were significantly more likely to have an increase in CNVs when compared to their mutation carrier parent (p=0.015 by Fisher’s exact test, Observed/Expected ratios: 2.0 for carriers and 0.0 for their wild type siblings).

Eight of the 11 families studied had histories of cancer. The only families that did not have high CNV frequencies were those that did not have a family history of cancer (3/11 families). Of these, two had a single affected proband with a de novo TP53 mutation (Tyr163Cys and His193Pro). The other family had a single affected child who harbored an extremely rare paternally inherited TP53 mutation (Phe134Tyr).

63

Figure 1A

64

Figure 1B

65

Figure 1C

66

Figure 1 | Increased CNV frequency in Li-Fraumeni syndrome. a, To establish a baseline CNV frequency (CNVs per genome), genomic DNA from a large healthy population (n=770) was assessed for CNVs using the Affymetrix Nsp SNP microarray. The distribution of CNV frequencies in the normal population is shown. Most individuals have few CNVs (median = 3). 75% of the healthy population have 4 or fewer CNVs. b, A significant increase in CNVs was observed in TP53 mutation carriers as compared to controls (p = 0.01). The TP53 wild type group displayed no significant increase in CNV frequency (p = 0.994). As shown, the mean CNV frequencies are 2.93, 3.40 and 12.19 CNVs per genome in the controls (n=70), TP53 wild type (n = 20) and TP53 mutation carriers (n = 31) groups, respectively. Error bars represent S.E.M. c, Bargraph of CNV frequency in controls (n=70), TP53 wild type individuals (n=20), TP53 mutation carriers unaffected by cancer (n=8) and TP53 mutation carriers affected by cancer (n=23) is shown. Both the unaffected and affected groups had significantly increased CNV frequencies as compared to controls (p= 0.009 and p=0.046, respectively). There is also an increase in CNVs in the affected group as compared to the unaffected TP53 mutation carriers, although not meeting statistical significance because of the loss of power caused by subdividing the group into small cohorts. Error bars represent S.E.M.

67

Many of the TP53 mutation carriers also had higher total structural variation scores than

TP53 wild-type individuals, which is as one would expect given their numerous CNVs. Less anticipated were individuals found to have few CNVs but high total structural variation scores, as a consequence of exceptionally large deletions or duplications. The most dramatic example found was a paternally inherited 6.1 Mb deletion encompassing 13% of chromosome 21

(21q21.1-q21.2) in an LFS family (shown below the pedigree of family 1 in Fig. 2 as a contiguous faintly-colored vertical bar). The deletion was confirmed by qPCR of DNA derived from blood or normal paraffin-embedded tissue in the absence of available blood (p<0.01 in all cases; Supplementary Fig. 3). Further, we examined the SNP genotypes in the same region and identified a 6 Mb stretch of homozygosity, which is as expected since the individual has only one allele at this locus. Despite the presence of a germline TP53 mutation (codon R273S), the hallmarks of the syndrome (strong family history, multiple early onset tumors) are conspicuously absent in the first generation. However, the second generation of the family prominently displays these hallmarks. The full presentation of the syndrome is therefore associated with an increase in copy number variable DNA, although in this instance from an apparently healthy individual (the father I-1). While it is possible that the accelerated clinical phenotype in the affected mutant

TP53 carrier children (II-1 and II-2) and the presence of the 6.1 Mb deletion inherited from the

TP53 wild type father may be coincidental, an alternative explanation is that the phenotype may have resulted from the effect of an additional genetic modifier effect conferred by the presence of this exceptionally large deletion. The confluence of these two genetic events, high total structural variation and a germline TP53 mutation, thus correlates with the increase in cancer incidence observed in the family.

Increased CNV frequency was found by comparing individuals at elevated risk for cancer to those at normal risk (TP53 mutation carriers versus TP53 wild type). We found no increase in 68 cancer in those individuals that are TP53 wild type. Although nearly all mutant TP53 carriers will develop cancer in their lifetime269, we sought to determine whether CNV frequency may also explain the clinical variability within the TP53 mutant (at-risk) group. We examined the

CNV frequency of TP53 mutation carriers affected by cancer separately from the unaffected carriers. The unaffected and affected groups each had significantly increased CNV frequencies as compared to controls (p = 0.009 and p = 0.046, respectively). Of particular interest is the presence of an even greater number of CNVs present in those affected by cancer, when compared to those who have not as yet developed cancer. Although not meeting the threshold for statistical significance because of the loss of power caused by splitting this group into small cohorts, this trend suggests a dose-response relationship between CNV frequency and severity of the LFS phenotype (Fig. 1C). Whether exposure to chemotherapy influences accumulation of germline structural alterations is not known. However, the fact that blood was drawn prior to starting therapy in almost all of the patients in this study, and the observation of increased germline CNVs even in those mutant TP53 carriers who are not yet affected with cancer, suggest that therapy does not contribute to accumulation of germline DNA structural variations (Fig 1C).

69

Figure 2 | Inherited deletions and duplications in 4 LFS families. Three examples of CNVs found in LFS families are shown. The upper portion of figure 2 shows pedigrees for four LFS families and the lower portion shows the chromosomal size and relative microarray hybridization intensity of each CNV and the family member in whom that CNV was identified. It was not possible to evaluate all members in every pedigree. However, in each of the 4 families, an affected member, usually designated as the proband (arrow), harbored both the displayed CNV and a TP53 mutation. In these pedigrees open circles and squares indicate healthy females and males. Black circles and squares: females or males affected with cancer, respectively. Dotted circles or squares, TP53 mutation carriers who have not yet developed cancer. Oblique lines indicate the person is deceased. Arrows point to the proband in each family. The lower portion of the figure highlights chromosomal regions of interest undergoing copy number alteration in these four families. In copy number analysis, faint colouring indicates deletions while duplications are coloured more intensely. Vertical columns represent a single individuals’ copy number for the region. Individuals of interest from the pedigree are numbered. a, Paternally inherited 6.1 Mb deletion on chromosome 21q21.1-q21.2 (13% of the chromosome), the largest deletion seen in 893 genomes assessed in this 70 study. Of the three children, two inherited both the deletion and TP53 mutation (II-1 and II-2). Each developed two neoplasms, first diagnosed at ages 6 and 7. The remaining child (II-3) harbored neither the deletion nor the mutation and is unaffected. The mother (I-2), carrying only the TP53 mutation, developed a single tumor (fibrous histiocytoma) at age 27. This exceptional deletion is inherited from an as of yet unaffected individual and is associated with a worsening of the clinical phenotype between generation I and II: The second generation displays the hallmarks of the syndrome (multiple cancers in multiple offspring), while the 1st generation does not. From left to right: the copy number deletion (red); the SNP genotype calls (red, blue and yellow squares) and loss of heterozygosity (LOH) analysis (blue and yellow) in the same region. There is a concomitant region of homozygous genotype calls for individuals I-1 and II-1. For the same individuals, SNP genotypes are coloured red or blue if homozygous, yellow if heterozygous or white if not called. An extended region of LOH is shown in blue at the right. b, An inherited 240 Kb duplication at 6q27, overlapping the leukemia gene MLLT4, in four individuals from two LFS families: in b-i the proband transmitted the duplication of MLLT4 to her son (V-1, as-of-yet unaffected carrier) and in b-ii the proband inherited the same duplication of MLLT4 from his mother (II-2, affected and carrier), although this inheritance is presumptive since it could not be ascertained directly in the mother but was found in her sister (II-4, unaffected and non-carrier). The frequency of this CNV is significantly enriched in LFS probands (p=0.006; Fisher’s exact test; Supplementary Discussion). c, A 574 Kb duplication, overlapping the cancer-related gene ADAM12, inherited through 3 generations of an LFS family (Supplementary Discussion). The proband’s brother (II-2, affected) harbored the paternally inherited duplication at ADAM12 (I-1, as-of-yet unaffected carrier) which he transmitted to his daughter (III-1, unaffected and non-carrier).

71

We next examined the effect of germline CNVs on the development of somatic chromosomal alterations in paired tumor tissue (Fig. 3A). In a separate analysis, DNA was extracted from four frozen tumor samples, taken from individuals whose constitutional CNVs were known, and hybridized on the same platform. Choroid plexus tumors were selected since they frequently occur in the context of LFS. As expected, the tumor DNA contained many structural changes, however we focused only on those change which were previously found in normal tissue from the same person. Three of 4 tumors had loci where germline hemizygous deletions progressed into homozygous deletions in the tumor or where germline duplications became larger in the tumor. Fifteen of 21 loci overlapping germline CNVs became substantially larger ( >50%) in paired tumors and in all cases the new somatic alteration was of the same orientation as the germline CNV (i.e. a deletion became a larger deletion in the tumor and an amplification, a yet larger amplification). Because the presence of gross tumor chromosome changes could artificially inflate the observed number of such events, we only selected regions undergoing discrete changes localized to the underlying CNV. This phenomenon was also validated by comparing SNP genotype homozygosity between blood and tumor at these loci

(Supplementary Methods). One such CNV, a loss at 22q11.23, underwent an additional somatic deletion while the rest of the chromosome maintained disomic (Fig. 3B). Paired blood-tumor analysis also revealed a deletion in the tumor sample, indicating that the deletion is located at the same locus and is deleted beyond that observed in the patient’s blood. QPCR confirmed a one copy loss in the germline as compared to a diploid reference, and at the same locus, a one copy loss in tumor DNA as compared to the germline (Fig. 3C). It therefore appears that germline

CNVs can act as a basis for more dramatic tumor-specific changes.

72

Figure 3A and 3B

73

Figure 3C

Figure 3 | Progression of germline chromosomal alterations in paired tumor DNA.

Copy number alterations in blood DNA and in paired tumor DNA for two individuals (A and B) are shown. Germline CNVs are displayed plotted across all autosomal chromosomes. Immediately below is the copy number of the tumor, which was biopsied or resected from the same person. a, A CNV region (deletion) at 6q16.1 is highlighted on chromosome 6 and enlarged at the right. An arrow points to the patient of interest and their deletion is indicated by a fainter colour. The neighboring column represents the patient’s sister who also has this CNV. Although the patient’s tumor genome displays a high level of instability, a deletion identical to the germline CNV was found in both the tumor biopsy and resection as shown below. b, In this person, a 70 Kb hemizygous deletion in blood DNA at 22q11.23 is highlighted and displayed enlarged at 74 the right. The same CNV was found to be further deleted in the patient’s paired tumor.

That is, the remaining allele was lost, as indicated by the yet fainter colour, which represents a reduction in the array’s signal intensity in the same region. c, The copy number of genomic DNA at 22q11.23 was confirmed by qPCR to be both specific to the underlying CNV and complete. From left: a diploid control; patient blood DNA with a hemizygous deletion; tumor DNA from the same patient showing a further deletion. The difference in mean copy number between reference, blood and tumor DNA are highly significant (p<0.01). Error bars represent +/- 2 S.E.M. QPCR shows that the relative copy number in tumor DNA is 0.1, meaning that greater than 80% of the tumor specimen is homozygous for the deletion (zero copies). We can approximate that only

20% of remaining cells have the germline hemizygous deletion (one copy).

75

DNA rearrangements, such as CNVs, can predispose to or cause disease when they encompass, overlap or disrupt dosage-sensitive genes183,174. CNVs can also unmask recessive mutations at dosage-insensitive loci174. We sought to determine which cancer-related genes fall within copy number variable regions. In both the large reference population as well as the LFS cohort, we observed copy number variability in cancer-related genes. We observed inherited duplications at MLLT4 and ADAM12 in LFS families (Figure 2B-C and Supplementary discussion). MLLT4 is a target of Ras and is fused with MLL in the common leukemia translocation t(6;11)(q27;q23) and the frequency of this CNV is significantly enriched in LFS probands (p=0.006; Fisher’s exact test; Supplementary Discussion). ADAM12 is disintegrin- metalloproteinase, whose dysregulation has been reported in brain, breast, liver, stomach and colon cancer. The contribution of these CNVs to tumor predisposition, initiation or progression will require further investigation (Supplementary Discussion).

2.4 Discussion

LFS is an ideal model for the discovery of genetic modifiers of cancer and research on these rare families has had a disproportionately large impact on our understanding of cancer biology in general131,139,258,270. The primary reason for this is that defects of TP53, the most frequent genetic alteration in LFS, are the most commonly acquired genetic alteration in sporadic human cancer. Non-transformed fibroblasts and lymphocytes from TP53 mutation carriers display aberrant growth characteristics when passaged in culture, spontaneously acquire properties of a transformed and ultimately immortalized ‘tumor cell’, and ultimately display mass chromosomal aneuploidy152. While defective TP53 function is known to cause increased copy number variation and instability in tumors271-273, our observations in primary non-cultured lymphocytes of TP53 mutation carriers suggest a new model of carcinogenesis wherein the existence of excessive submicroscopic copy number alterations represent early germline events 76 that may inform the progressive changes required for neoplastic transformation (Figure 4). These subtle changes are likely the earliest manifestations of instability conferred by the constitutional

TP53 mutation, which then progress in complexity into events that can be seen by conventional cytogenetic techniques. TP53, as guardian of the genome, actively suppresses cell cycle advance and DNA replication following dsDNA damage, and is involved in the very processes known to give rise to CNVs, including suppression of homologous recombination274. While the ubiquity and non-random distribution of CNVs in humans highlights genomic regions that are intrinsically unstable, their increased abundance in LFS TP53 mutation carriers can be explained by germline TP53 haploinsufficiency. Genomic instability is a feature of all cancers39 but it may be preferentially directed towards CNV regions that are hotspots for recombination. As we have observed, this can be accomplished in the tumor genome by expansion of the linear extent of the same allele or by loss of the opposite allele (Fig 3B and 3C). Our observation that CNVs can act as the genetic foundation on which larger somatic chromosomal deletions and duplications develop in tumors, suggest that CNVs are fertile ground for subsequently acquired changes in cancer cells. Thus, the sequence architecture of genomic regions which give rise to CNVs may facilitate both acquired constitutional as well as tumor-specific genetic changes.

In this study, mutation carriers found to not have high CNV frequencies were also those who did not have family histories of cancer, either because their cancers arose as a consequence of de novo TP53 mutations or a low-penetrant mutation. It therefore appears that CNV frequency, or another high-resolution measure of instability, may help to define the nature and severity of the germline TP53 mutations found in LFS families. It will be important in the future to determine the exact patterns of Mendelian inheritance of CNVs in a larger cohort of complete

LFS families. It appears that the reason that LFS offspring have a greater CNV frequency is because not only do they inherit CNVs from their parents, but they also acquire de novo CNVs. 77

Therefore, the total number of CNVs can be greater than in either parent. Because mutation carriers affected with cancer had more CNVs than unaffected carriers (while both groups harbored more CNVs than individuals carrying wild type TP53), it is tempting to speculate that

CNV frequency might also help to categorize TP53 mutation carriers into “risk groups” and provide a more rational basis for screening and genetic counselling.

With respect to the role of CNVs and sporadic cancer, while similar genome-wide analyses are now routinely performed on tumor samples255,256, the frequent lack of matched constitutional DNA means that the germline contribution to the detected somatic alteration cannot be known. Our observation of a surprising number of genomic regions where cancer- related genes coincide with CNVs suggesting that germline CNVs can provide the foundation for somatic chromosomal changes, in both LFS families and healthy individuals, highlights the need for matched analyses in cancer studies and for the establishment of a baseline for structural variation in healthy human genomes.

Our data demonstrate that the CNV frequency is remarkably similar among healthy individuals, but significantly increased in individuals with germline TP53 mutations. In addition,

LFS family members can contain exceptionally large deletions or duplications, as identified by their total structural variation scores. This constitutional structural dynamism may act as the genetic foundation on which larger somatic chromosomal deletions and duplications build, leading to the development of cancer. These findings also establish a novel method for identifying individuals with constitutional chromosomal instability and inherent susceptibility to cancer.

78

Figure 4

Figure 4 | Proposed model for the progression of copy number variable DNA regions in the Li-Fraumeni cancer predisposition syndrome. A model of copy number variable DNA regions in patients with sporadic (top row) or inherited cancer (bottom row) is shown. a, The total number of CNVs in the genomes of healthy individuals is similar. Non-cancer predisposed individuals have intact DNA repair mechanisms that maintain the number of CNVs close to this baseline (Fig. 1A). Despite efficient repair machinery, CNVs still occur 100 to 10,000 times more frequently than point mutations in the human genome260. This is largely facilitated by the genomic sequence architecture. The precise mechanisms that give rise to most human CNVs are not known, however nonallelic homologous recombination (NAHR) and nonhomologous end joining (NHEJ) are thought to be involved183. Both NHEJ and NAHR are processes by which double strand (ds) DNA breaks are repaired. The ubiquity and non-random distribution of CNVs in humans highlights genomic regions that are intrinsically unstable. b, CNVs are more abundant in Li-Fraumeni cancer predisposed TP53 mutation carriers because of germline TP53 haploinsufficiency. TP53, as the “guardian of the genome”274, suppresses cell cycle advance and DNA replication following dsDNA 79 damage. Further, TP53 is involved in the very processes known to give rise to CNVs, including suppressing the level of homologous recombination. While defective TP53 is known to cause increased copy number variation and instability in tumors271-273, our data suggests a new model wherein these alterations arise much earlier in cancer- prone individuals. We have observed this increase of CNVs in primary LFS lymphocyte DNA but this effect may be more dramatic in other cells undergoing rapid remodelling, replicative stress or in the normal tissue of patients with other cancer predisposition disorders. c, CNVs become fertile ground for changes in cancer. Genomic instability may be preferentially directed toward CNV regions that are hotspots for recombination as suggested by our observation (Fig. 3) that CNVs can act as the genetic foundation on which larger somatic chromosomal deletions and duplications develop in tumors (shown here as arrows from CNVs in blood to those in tumor DNA). Tumor changes may be secondary to an underlying non-tumor CNV or arise de novo at the same locus. It is likely that the sequence architecture of genomic regions which gives rise to CNVs also facilitates large somatic alterations. In this model CNVs are seen as crucial regions in both sporadic and inherited tumors. Further, the early age of onset of inherited tumors might be explained by the patient’s increased CNV frequency. CNVs should therefore be viewed as important contributors to the inborn and acquired genetic changes that give rise to cancer. CNVs are shown as (one copy loss),

(two copy loss) or (one copy gain). Inherited CNVs are represented in black, acquired CNVs in red and tumor-specific CNVs in blue.

80

2.5 Materials and Methods

2.5.1 Subject recruitment

After obtaining written informed consent, DNA was extracted from peripheral blood leukocytes of 53 individuals from families with a germline TP53 mutation and from 70 unrelated controls.

These included 20 TP53 wild type and 33 TP53 mutation carriers. Of these, one individual had been diagnosed as a TP53 mosaic and was grouped with the TP53 mutation carriers in the CNV analysis. In addition, genomic DNA from 5 frozen choroid plexus tumors was extracted. DNA was quantified using a NanoDrop Spectrophotometer (NanoDrop, Wilmington, DE) and quality assessed by agarose gel electrophoresis. This study was approved by the Research Ethics Board at the Hospital for Sick Children in Toronto. Subject recruitment for the 500 individuals of

European descent and the 270 individuals from the HapMap collection are described elsewhere275,276.

2.5.2 DNA microarray analysis

Genomic DNA was genotyped with Affymetrix GeneChip Human Mapping 250K arrays

(Affymetrix, Santa Clara, CA; Supplementary Fig. 4)277. Samples were restriction enzyme digested, amplified, purified, labeled, fragmented and hybridized as per the manufacturer’s protocol. For the reference samples (n=770), DNA copy number analysis was performed with dChip268 using Affymetrix Nsp CEL files. The LFS case-control cohort (n=123) was assessed with dChip, CNAG278 and GEMCA279 using Affymetrix Nsp and Sty CEL files. The average call rate in the LFS hybridizations, which is an indicator of the overall performance of the assay, was

97.5% (SD=1.6). Two blood DNA samples with more than 150 CNVs were excluded from the

TP53 mutation carrier group to avoid calling a high number of false positives. The corresponding paired tumor for one these samples was therefore also excluded. The characterization of copy number variation is described in more detail in the Supplementary Methods. 81

2.5.3 Quantitative PCR validation

Quantitative PCR of genomic DNA copy number was performed by relative quantification on a

Roche LightCycler 480 (Roche Applied Science, Indianapolis, IN) instrument using the Roche

SYBR green kit. Primers were designed using Primer3 and the human genome reference assembly (UCSC version hg17, based on NCBI build 35). All samples were run in triplicate.

Copy number alterations were assessed by relative quantification methods which compensate for differences in target and reference amplification efficiencies.

2.5.4 Statistical analyses

Data was analyzed using SPSS versions 14.0 and 15.0 (SPSS Inc, Chicago, IL). CNV frequencies were natural logarithm transformed and compared by two-tailed independent- samples t-tests after assessing for normality using stem and leaf plots and histograms. A p-value of <0.05 was considered significant. Levene’s test for equality of variances was used to determine when to assume equal variances. To compare the frequency of the cancer-related CNV overlapping MLLT4 (Supplementary Discussion), the Fisher’s exact test was used. Unrelated probands in the LFS cohort (n=19) were evaluated for the CNV and contrasted to unrelated individuals in the reference population (n=710, all children from the CEPH and Yoruban trios were excluded to ensure independent observations).

2.5.5 Computational assessment of cancer-related genes

Cancer-related genes were selected from the CancerGenes database188. Genes with zero sources were excluded, yielding a final list of ~400 known cancer-related genes. Genomic coordinates of

CNVs and genes were based on the NCBI build 35 reference human genome sequence

(Supplementary Fig. 2). Custom software (available upon request) was used to determine CNVs encompassing or overlapping genes in more than one individual. 82

2.5.6 TP53 mutation screening

TP53 mutations were detected by direct sequencing of exons 2 to 11 and intron-exon boundaries of PCR products from blood-derived DNA using an ABI automated sequencer. Primer sequences have been published elsewhere139.

2.6 Supplementary Discussion

2.6.1 Recurrent copy number variations at cancer associated genes

Cancer is an incremental process involving alterations of multiple tumor suppressor and oncogenes. Common genetic variants, such as single nucleotide polymorphisms (SNPs), which modify or accelerate this process can contribute to early-onset tumors or familial aggregations of cancer. Acquired chromosomal changes are frequently found in tumor genomes, causing gene deletions, amplifications or balanced cytogenetic abnormalities and their importance in somatic tumorigenesis is well established. As with SNPs, constitutional deletions and duplications, such as CNVs, are recognized as important components of genetic variation. However, the potential role of CNVs as genetic risk factors to cancer predisposition has not yet been explored.

In a reference population, which included 500 persons of European descent and the multiethnic 270 person HapMap collection, we identified cancer-related genes encompassed or directly overlapped by a CNV (49 genes found in more than 1 individual and 98 genes in singular individuals). The current catalogue of genes implicated in cancer was obtained from the

CancerGenes database and the CNV regions were determined from the oligonucleotide SNP array hybridizations 188. The most frequent copy number variable cancer genes observed are:

MLLT4 (Myeloid/lymphoid or mixed-lineage leukemia [trithorax homolog, Drosophila] translocated to, 4); FHIT (Fragile histidine triad gene); TFG (TRK-fused gene); FANCF

(Fanconi anemia, complementation group F) and MSH6 (mutS homolog 6 [E. coli]). These copy 83 number variable genes have been implicated in acute and chronic leukaemias, lymphomas and numerous solid tumors of mesenchymal or epithelial tissue.

We noted the presence of apparently healthy individuals with CNVs at MSH6. Germline point mutations and gross genomic rearrangements at MSH6, MSH2, MLH1 and PMS2 are associated with Lynch Syndrome (or HNPCC), the most common form of inherited colorectal cancer 280,281. The FHIT gene was also determined to be the site of CNVs in this analysis. FHIT spans 1.5 Mb of DNA, encompasses the FRA3B fragile site and its protein is partially or entirely lost in most human cancers 282. The contribution of these CNVs to tumor predisposition, initiation or progression will require further investigation.

The LFS cohort also showed copy number variability in cancer-related genes. Of the nine families with inherited TP53 mutations assessed for CNVs, 2 families had near identical duplications on chromosome 6 (locus 6q27), overlapping the MLLT4 gene (also named AF6,

Fig. 2B). MLLT4 is a target of Ras and is fused with MLL in the common leukemia translocation t(6;11)(q27;q23) 283. The MLLT4 duplication was validated by qPCR in all individuals and in DNA from independent blood-redraws when available. The duplication was structurally similar to the CNV in the healthy reference population (n=770). The average size of the CNV is 260 kb (range: 220 kb to 350 kb) in LFS and 250 kb (range: 240 kb to 372 kb) in the reference population. However, the frequency of the CNV is significantly increased in LFS

(p=0.006, Fisher’s exact test): Three of the 19 LFS probands (15.8%; Observed/Expected:

3/0.4=7.5) harbored the duplication, while only 12 of 710 healthy individuals from the reference population (1.69%; observed/expected: 12/14.6 = 0.82) harbored the CNV.

Another LFS family displayed two separate duplications on chromosome 10, which were inherited through three generations of family members (Fig. 2C). One of these duplications, at 84 locus 10q26.2, intersects with the disintegrin-metalloproteinase ADAM12. While it is not included in the CancerGenes database, the dysregulation of ADAM12 has been reported in brain, breast, liver, stomach and colon cancer 284.

While genome-wide copy number analyses are routinely performed on tumor samples, the frequent lack of matched constitutional DNA means that the underlying germline contribution to the detected somatic alteration cannot be known. Knowledge of ‘underlying’

CNVs is important if these disrupt critical genes (as noted above) or if these occur in loci where they may contribute to the chromosomal alterations of the resultant tumor, especially as we have shown that germline CNVs can provide the foundation for larger somatic alterations. The germline deletion in chromosome 22q11.23 discussed in the main text provides an interesting example in this regard. The region, which is hemizygously deleted in the blood and becomes homozygously deleted in the tumor, encompasses at least two known genes. Just as importantly, it has been previously reported that loss of material from 22q is the most common change in choroid plexus carcinoma, which is the tumor that this patient ultimately developed (Rickert CH et al. 2002).

Our findings that germline CNVs can provide the foundation for somatic chromosomal changes and overlap certain cancer-related genes, highlights the need for matched analyses in cancer studies. The establishment of a baseline for structural variation in healthy human genomes will allow for future association studies in cancer. 85

2.7 Supplementary Figures and Tables

Supplementary Figure 1 | CNV frequency and total structural variation are conserved in different ethnic groups. Boxplots of CNV frequency (blue) and total structural variation (red) for 4 ethnic groups are shown. A dashed line indicates the baseline value for each metric. The central boxes span the quartiles and the outer lines (whiskers) are the largest/smallest values that are not outliers. Outliers are indicated by open circles, while stars represent extreme values.

86

Supplementary Figure 2 | Chromosomal positions and SNP coverage of 5 cancer-related genes overlapping CNVs. Shown is the SNP coverage of Affymetrix GeneChip arrays for the 5 most frequent copy number variable cancer-related genes in the apparently healthy population (n=770). These are: MLLT4, FHIT, TFG, FANCF and MSH6. The genes are plotted using the 87

University of California Santa Cruz genome browser (www.genome.ucsc.edu). Each gene’s position is indicated on the chromosome idiogram in red. Shown below each idiogram, in an enlarged view, are chromosomal coordinates of the gene, its intron-exon structure and the oligonucleotide array SNP probe coverage for the region (black vertical lines).

Supplementary Figure 3 | Validation of a 6.1 Mb deletion in a Li-Fraumeni syndrome family. Shown is validation of the copy number of genomic DNA on chromosome 21q21.1- q21.2, the site of a 6.1 Mb germline heterozygous deletion in an LFS family. Validation was performed by qPCR. From left: a diploid control; blood DNA with a heterozygous deletion from proband’s father (I-1, TP53 wild type, as of yet unaffected); the proband’s blood DNA showing the same deletion (II-1, TP53 mutation carrier, affected with two primary tumors) and proband’s sister with the same deletion (II-2, TP53 mutation carrier, affected with two primary tumors). The difference in mean copy number between control DNA and the proband’s father (I-1), proband (II-1) and proband’s sister’s (II-2) DNA are all highly significant (p<0.01). Error bars represent +/- 2 S.E.M. 88

Supplementary Figure 4 | DNA Copy number variations in 893 individuals.

Shown are genome-wide plots of copy number variations of all individuals assessed, including the reference population (n=770), LFS families (n=53) and unrelated controls (n=70). The chromosome number is on the y-axis and the DNA samples assessed are on the x-axis. Chromosomal regions are contiguously colored on the basis of their copy number: regions of faint coloring indicate deletions while duplications are colored more intensely. A region of chromosome 21 is magnified (21q21.1-q21.2), showing a 6.1 Mb germline deletion present in an LFS family described in the text and shown in more detail in Figure 2A. This structural alteration, which is paternally inherited, was the largest observed in any individual assessed.

89

Supplementary Table 1A | LFS Families

TP53 mutation n WT Mutation carriers Family

1 Arg175His 3 1 2

2 Arg273Ser 4 2 2

3 12138 insC; pro72fs 3 1 2

4 Pro152Leu 3 1 2

5 Arg175His 5 3 2

6 Arg158His 4 1 3

7 IVS03-11 C>G 6 2 4

8 His193Pro 4 3 1

9 Phe134Tyr 3 1 2

10 Arg248Gln 6 3 3

11 Tyr163Cys 4 3 1

Supplementary Table 1B | Unrelated TP53 mutation carriers

Unrelated TP53 mutation carriers

1 Arg248Gln

2 IVS05-1 G>C

3 c.652insG;Glu221Stop

4 Arg273His

5 Arg175His

6 14494-1450 del8/ins AGGTG; Cys275Stop

7 Arg273Cys

8 Arg273His 90

2.8 Supplementary Methods

2.8.1 Characterization of Copy Number Variation

NspI and StyI microarray scans were analysed for copy number variation (CNV) using the following software: DNA Chip Analyzer (dChip)(Li C et al. 2001; Lin M. et al. 2004), Copy

Number Analysis for GeneChip (CNAG)(Nannya Y. et al. 2005) and Genotyping Microarray based CNV Analysis (GEMCA)(Komura D. et al. 2006).

Analysis with dChip was performed as previously described (Zhao et al. 2005) in batches of ~100 probands. Briefly, array scans were normalized at the probe intensity level with an invariant set normalization method. After normalization, a signal value was calculated for each

SNP using a model-based (PM/MM) method. In this approach, image artefacts are identified and eliminated by an outlier detection algorithm. For both sets of arrays, the resulting signal values were averaged across all samples for each SNP to obtain the mean signal of a diploid genome.

From the raw copy numbers, the inferred copy number at each SNP was estimated using a

Hidden Markov Model (HMM).

To validate the CNV frequency determined by dChip we used CNAG and GEMCA, two independent CNV detecting algorithms. For analyses with CNAG version 2.0, we set the reference pool to include all samples and performed an automatic batch pairwise analysis using sex-matched controls. Test samples were compared to all samples within the reference pool and matched based on signal intensity standard deviations. The scan intensities for each ‘test’ sample were compared to the average intensities of the reference samples (typically the average of 5-12 samples) and used to calculate raw copy number changes. Underlying copy number changes were then inferred using a HMM built into CNAG. GEMCA analysis was performed essentially 91 as described, except we used two designated DNA samples (NA10851 and NA15510) as references for pairwise comparison to all proband experiments. We further filtered these results by only including those CNVs that were common to both pairwise experiments.

As described in the main text, CNAG and GEMCA also revealed an increase in CNV frequency in TP53 mutation carriers as compared to TP53 wild types. Further, the relative ratio of CNVs between these two groups was similar in the three independent analyses: 1.5, 1.33 and

1.6 (using dChip, GEMCA and CNAG, respectively). We designed an additional more stringent analysis which only counted CNVs found by two or more algorithms and, similarly, found an increase in CNV frequency in TP53 mutated as compared TP53 wild type individuals (relative ratio: 1.5).

In addition to the quantitative PCR (qPCR) validation of CNVs described in the main text, 10 other CNVs in the Li-Fraumeni Syndrome cohort were qPCR validated. These CNVs were chosen at random using a random number generator (Haahr M, random.org site: http://www.random.org, 1998-2007) and their primer sequences are provided.

The further copy number reduction or expansion of germline CNVs in paired tumor DNA was validated either by qPCR or SNP genotype analysis. In one instance, the 22q11.23 germline deletion which was found to be homozygously deleted in the patient’s brain tumor DNA (Fig.

3B), two qPCR probes were used. Both probes showed a copy number reduction in blood DNA

(copy=1) and further copy reduction in tumor DNA (copy=0). This is consistent with a somatic loss of the opposite allele at this locus. In another instance, a 221 kb CNV deletion at chromosome 11p11.2, enlarged to 662 kb in the tumor, was validated using the SNP genotype information. In blood and tumor DNA, at the site of the 221 kb germline deletion all SNPs

(22/22) were homozygous and of identical genotype. This is consistent with a constitutional 92 deletion that persists in the tumor on the same allele. However, whereas the site of the

‘expansion’ contained both heterozygous and homozygous genotypes in blood, the tumor DNA was completely reduced to homozygosity. This is consistent with a somatic deletion adjacent to, and on the same allele as, the germline CNV. 93

Chapter 3 3 A common molecular mechanism underlies two phenotypically distinct 17p13.1 microdeletion syndromes

This chapter has been submitted for publication to the American Journal of Human Genetics.

Adam Shlien, Berivan Baskin, Maria Isabel W. Achatz, Dimitrios J. Stavropoulos, Kim E. Nichols, Louanne Hudgins, Chantal F. Morel, Margaret P. Adam, Nataliya Zhukova, Lianne Rotin, Ana Novokmet, Harriet Druker, Mary Shago, Peter N. Ray, Pierre Hainaut, David Malkin. A common molecular mechanism underlies two phenotypically distinct 17p13.1 microdeletion syndromes.

94

3.1 Abstract

DNA copy number variations (CNVs) underlie many neuropsychiatric conditions, but have been less studied in cancer. We report the association of a 17p13.1 CNV, childhood-onset developmental delay (DD) and cancer. Through a screen of over 4,000 patients with diverse diagnoses, we identified eight probands harboring microdeletions at TP53 (17p13.1). We used a purpose built high-resolution array with 93.75% breakpoint accuracy to fine-map these microdeletions. Four patients were found to have a common phenotype, including developmental delay, and hand/foot abnormalities, constituting a novel syndrome. Notably, these patients were not affected with cancer. Moreover, none of the TP53 deletion patients affected with cancer (n=4) had neurocognitive impairments. DD patients have larger deletions, which encompass but do not disrupt TP53, whereas cancer-affected patients harbor CNVs with at least one breakpoint within TP53. Most 17p13.1 deletions arise by Alu-mediated non-allelic homologous recombination. Furthermore, we identify a critical genomic region associated with

DD containing six under-expressed genes. We conclude that, while they overlap, 17p13.1 CNVs are associated with distinct phenotypes depending on the position of the breakpoint with respect to TP53. Further, detailed characterization of breakpoints revealed a common formation signature. Future studies should consider whether other loci in the genome also give rise to phenotypically distinct disorders by means of a common mechanism.

3.2 Introduction

As the range of diseases associated with copy number variations (CNV) expands, it has become apparent that specific CNV loci can be associated with a spectrum of unrelated conditions. For example, CNVs at 1q21.1 predispose to schizophrenia205,206, Tetralogy of

Fallot208, cancer (including neuroblastoma202), and a range of pediatric conditions207. It is unclear 95 whether individuals harboring structural changes at these and other hotspots are predisposed to one, or many diseases. High-resolution copy number platforms, such as tiling oligonucleotide arrays, offer increased accuracy285 and can therefore be applied to disentangle overlapping CNV- based diseases. Further, characterization of CNVs at the single basepair level can unearth common sequence elements, which represent signatures of the various DNA repair processes that led to their formation.

Most pediatric cancers arise sporadically, however at least 5-10% harbor an underlying germline defect209, with an emerging link between CNVs and cancer susceptibility189,202,286. We have previously shown that germline TP53 missense mutations predispose to the autosomal dominant cancer susceptibility condition known as Li-Fraumeni syndrome (LFS)49, in which an excess of CNVs across the genome are observed189. Here we investigate whether 17p13.1 CNVs, which include TP53, are sufficient to cause LFS.

To improve our understanding of 17p13.1 CNVs, we constructed an oligonucleotide comparative genomic hybridization (CGH) array to interrogate this genomic region at ultra-high resolution; overlapping probes covering all exons of every gene in the region were designed to achieve 93.75% breakpoint accuracy. Using this platform, we set out to determine whether patients with 17p13.1 CNVs contain shared breakpoint sequences, critically deleted genes, or common clinical features.

3.3 Material and Methods

3.3.1 Sample Recruitment.

Samples were collected from The Hospital for Sick Children, Stanford University Hospital, The

Children’s Hospital of Philadelphia, The University Health Network, Toronto, Hospital do 96

Câncer A.C. Camargo and from Emory University. Research was approved by each centre’s institutional review board.

3.3.2 CGH microarray design and hybridization.

Array comparative genomic hybridization was performed using a customized 4x44K microarray platform (Agilent Technologies, Santa Clara, CA), with genomic DNA extracted from peripheral blood using standard methods. 40,577 oligonucleotide probes were placed on the short arm of chromosome 17, in which an 8 Mb target region around TP53 was covered in ultra high- resolution. Of the 38,061 in the target region, 15,762 (41%) were designed on exons. Exonic probes were overlapping and tiled across all exons, of all alternative splice variants, for every gene. A minimum of one probe per 350 bp was placed in intronic and non-genic regions. An additional set of probes was designed to also extend our coverage to the telomere and centromere of chromosome 17p but at reduced density (Supplemental Figure 3). With this array design we hoped to capture all copy number changes anywhere within 8 Mb of TP53, from small single exon-sized alterations (45-350 bp) to large macroscopic events, and to quickly obtain breakpoint information of copy number changes, especially those within protein-coding regions. The lab performing array experiments was blinded to all previous results (sequencing, MLPA, qPCR etc). Patient and male reference DNA were labeled with Cy3-dCTP and Cy5-dCTP

(PerkinElmer, Waltham, MA) respectively using the BioPrime genomic labeling module

(Invitrogen, Carlsbad, CA), and hybridized to the array platform, as recommended by the manufacturer’s protocol (Agilent Technologies). The arrays were washed and scanned using the

Agilent G2505B microarray scanner. Data analysis was performed using DNA Analytics version 4.0 (Agilent Technologies). 97

3.3.3 Gene expression arrays and analysis.

RNA was extracted from blood using standard methods, assessed by Bioanalyzer (Agilent

Technologies, Santa Clara, CA) and hybridized to Affymetrix Exon 1.0 microarrays (Affymetrix,

Santa Clara, CA). High quality RNA was available for one individual with a large 17p13.1 deletion (encompassing TP53), two individuals with small 17p13.1 deletions (disrupting TP53) and two individuals harboring germline TP53 missense mutations. RNA from three non-carrier siblings were used as controls. Gene expression analysis was performed using Partek Genomics

Suite (Partek, St. Louis, MI).

3.3.4 Breakpoint simulation.

Custom software was developed to simulate CNV deletions across all autosomal chromosomes in the human genome. For each size range (from 10Kb to 2Mb) 10,000 simulated CNVs were assessed for intersection with Alu elements at both breakpoints.

3.3.5 Quantitative PCR.

To obtain better size information on these deletions, we developed a high-throughput quantitative assay using an automated liquid handling system in a 384-well plate format with 176 qPCR probes to target and detect the copy number of a large region of chromosome 17 (1.3 Mb). qPCR assays were performed on a Roche LightCycler by relative quantification. qPCR plates were setup using a custom script and an automated liquid handling system. Primers were designed using Primer3 and the human genome reference assembly (UCSC genome browser, version hg18). Deletion sizes were found to be larger than that reported by array CGH or MLPA and, on average, were improved by 31% using this assay. The DD-associated deletions were nearly 100 times larger than those involved in early-onset cancer. 98

3.3.6 Fluorescence in situ hybridization.

Fluorescence in situ hybridization (FISH) was performed using standard protocols.

3.3.7 Parent-of-origin analysis.

SNP genotyping was performed using Affymetrix GeneChip 250 Sty arrays or by direct sequencing. Microsatellite marker genotyping was performed by The Center for Applied

Genomics, Sick Kids Hospital.

3.3.8 Breakpoint mapping.

Following custom array processing and analysis, breakpoints were mapped by, first, designing primers flanking the predicted breakpoints and then, by amplifying junction-specific fragments using long range PCR (Roche Expand long template PCR system). Junction fragments were subjected to sequencing. Putative breakpoints were analyzed by BLAST, BLAT and by manual inspection.

3.4 Results

3.4.1 Rare CNVs at TP53 are associated with cancer predisposition or developmental delay

Our six diagnostic labs screened 4,524 patients with diverse clinical phenotypes for DNA dosage changes using array CGH or multiplex ligation-dependent probe amplification (MLPA;

Supplemental Table 1). Eight probands were identified with a microdeletion at TP53 (17p13.1), a tumor suppressor gene that predisposes to early-onset cancer when mutated in LFS49. We performed interphase and metaphase fluorescence in situ hybridization (FISH) using TP53 and

17ptel probes (Figure 1A). The interstitial deletion could be seen in all cells and was therefore not due to mosaicism. 99

Individuals with microdeletions at TP53 had cancer (n=4) or a non-cancer phenotype

(n=4) comprising a spectrum of congenital anomalies (Table 1 and Supplemental Figure 2) that included pervasive developmental delay/mental retardation, speech difficulties, hypotonia, hand/foot abnormalities and facial dysmorphisms.

3.4.2 Different 17p13.1 breakpoints are related to two distinct phenotypes

Several congenital syndromes are known to also occur in association with cancer predisposition. Such dual phenotypes are frequently caused by gene dosage mutations, either through numerical chromosomal abnormalities or specific structural changes (e.g. trisomy 21287).

In contrast, LFS patients do not show increased rates of neurocognitive disability or any phenotype besides cancer. Consistent with this, none of the TP53 deletion patients affected with cancer had DD or congenital anomalies. Similarly, none of the patients with DD exhibited any neoplastic growth that might suggest an underlying susceptibility nor did they have family histories of cancer consistent with LFS. Therefore, while they share genomic alterations at TP53, their distinct clinical presentations suggest that these patients fall into two non-overlapping groups.

We next determined the genetic basis of this dichotomy. Our initial patient discovery was performed on two platforms, with complementary depth and breadth. Array CGH is low- resolution at TP53 but provides more information on the extent of 17p13.1 CNVs beyond TP53, while MLPA provides high-resolution across TP53’s 11 exons but provides little information for the surrounding regions. To determine whether CNVs defined by MLPA extend beyond TP53, we used quantitative PCR (qPCR) to determine the copy number of the genes immediately flanking TP53 (Supplementary Figure 1). Both ATP1B2 (telomeric) and WRAP53 (centromeric) were diploid in all cancer patients (mean copy number = 2.03 and 2.11, respectively). However, 100 all patients with DD were hemizygously deleted for both flanking genes (mean copy number =

0.87 [ATP1B2] and 1.13 [WRAP53]), a significant reduction as compared to the cancer patients

(p=2.90 x 10-4 [ATP1B2] and 2.42 x 10-8 [WRAP53]). We also carried out MLPA experiments on all array CGH-ascertained samples, and found that in every DD case all 11 exons of TP53 were contiguously deleted. In contrast, no cancer case harbored a CNV that included all 11 exons. These results demonstrate that our cohort of DD and cancer patients have overlapping but genotypically distinct CNVs at TP53; while DD-associated CNVs include all exons of TP53 as well as flanking genes, cancer-associated CNVs are within TP53, causing a change in copy number to some - but not all - of its exons (Figure 1B).

101

Figure 1A

102

Figure 1B

Figure 1 Discovery of a 17p13.1 CNV leading to two distinct phenotypes. (a) FISH experiments using TP53 (red) and 17ptel (green) probes. The fluorescent signals in this representative family trio confirm a de novo hemizygous TP53 deletion in the child’s metaphase and interphase nuclei. Two hundred nuclei were scored and no evidence of mosaicism for the CNV was observed. TP53 microdeletions were not observed by conventional Giemsa banded karyotyping. (b) Results of MLPA, qPCR, and clinical array revealed two isoforms of the 17p13.1 CNV: Amongst developmental delay patients the CNV includes and extends past TP53 in the telomeric and centromeric directions (n=4; top); Amongst the cancer-affected patients, the 17p13.1 CNV deletes some – but not all – of TP53’s exons.

103

Table 1 Phenotypic features of four patients with 17p13.1 CNVs and developmental delay

Patient ID 3026 2723 3148 3354

Sex F F F M

Inheritance -- de novo de novo de novo

Parental origin Paternal Paternal Maternal --

Cognitive GDD GDD GDD GDD

Non verbal Speech apraxia Limited speech development

Severe MR

Growth Height (<3rd) Weight Height (25th) Height (10-25th) Height (50-75th) (percentile) (<3rd) Weight (70th) Weight (75-90th) Weight (50-75th) HC (50th) HC (25th) HC (97th) HC (10-25th)

Facial features Prominent nasal Upturned nose Wide nasal bridge Broad upturned nose bridge with small nares

High arched palate High arched High arched palate palate

Thin lips Thin puckered lips

High forehead Prominent forehead Low set ears bilateral Upswept ear lobules ear lobe pits with earlobe pits

Downslanting palpebral fissures Downslanting palpebral Short neck with fissures webbing Short neck no webbing

Arched eyebrows that extend laterally Unusual arched eyebrows

Epicanthal folds

Broad flat Epicanthal folds Small recessed chin

Small recessed chin Downturned corners of mouth Downturned corners of 104

mouth Other: Low hairline, Left-sided mild , Malar hypoplasia, Other: Mild upslanting Bitemporal narrowing Other: Telecanthus, palpebral fissures, Short Other: Depressed nasal tip, Bifid columella with prominent Brachycephaly uvula, Posterior hair whorl ala nasi

MSK features Ligamentous laxity Ligamentous laxity

Bilateral elbow Contractures elbow and Contractures knees and Dimpling at ankles, elbows and knees

Sacral crease

Extra flexion Asymmetric crease creases (calves, with deep sacral arms) dimple

Other: Mild pectus Other: Bilat vagus, Deformity Other: 13 pairs of ribs, deformity of ankles mild spine curvature, partial sacralization of lower lumbar spine

Cardiovascular VSD PDA (self- Normal No echocardiogram. resolved)

Ocular Strabismus Strabismus Bilateral alternating Lateral vision difficulty exotropia, myopia and difficulty tracking (11 months of age)

Other: Legally blind, right eye hamartoma (CHRPE-like lesion), iris hypoplasia, astigmatism, decreased lacrimation

Bone marrow Hemolytic anemia of N/A N/A N/A infancy (self- resolved), pure red 105

cell aplasia (onset age 15 years)

Neurological Hypotonia Hypotonia Hypotonia Hypotonia

Brisk DTRs Brisk DTRs

Ankle clonus Ankle clonus

Hydrocephalus External hydrocephalus

Other: Other: broad based Other: brain MRI: normal Other: DTRs difficult to gait. Brain MRI: Choreoathetoid elicit arrested movements. Brain MRI: thinned CC, Delayed myelination, tethered cord. Upgoing plantar responses,

Audiology Normal Normal Decreased hearing Normal

Psychiatric PDD PDD

Bipolar disorder N/A N/A

Behavioral Self-injurious, Intermittent hand- N/A N/A aggressive wringing, hand clapping

Hands Thumbs proximally Left thumb Normal placed proximally placed Short hands, deep palmar creases bilat, Short hands (3- first finger 25th percentile), clinodactyly bilat, 5th broad thumbs finger IP joint contracture bilat, Left transverse palmar crease, right "hockey stick" shaped crease

Feet Short feet Small feet (<5th Normal Shortened feet with PC) broad first toes

Big toe large and Big toe abnormally broad bilaterally, long and narrow pollicization of bilaterally big toes 106

Other: Deep plantar creases bilat, flat feet, 4th toe shortened bilat

GU Normal Neurogenic Normal Shawl scrotum bladder, ovarian cysts, resolved renal cysts

Skin Normal dermoid cyst Normal sacral mongolian spot above left eye, compound melanocytic nevus, epitheliod cell type of scalp

Nipples Normal Bilateral inverted NA Bilateral inverted supernumerary nipples nipple

Other Failure to thrive with Feeding Feeding difficulties NA feeding difficulties difficulties

Sleep disturbances Sleep disturbances

GERD

Other: chronic GERD Other: constipation, Hypothyroidism and Hypogammaglobin Other: Benign paroxysmal iron overload emia torticollis secondary to blood transfusions q3-4 weeks

Bolded text indicate features shared by more than one patient HC: head circumference, CC: corpus callosum, DTR: deep tendon reflexes, GDD: global developmental delay, GERD: gastro-esophageal reflux disease, IP: interphalangeal, MR: mental retardation, PDD: pervasive developmental disorder

107

3.4.3 17p13.1 genomic deletions can be inherited or arise de novo

A review of the eight probands’ pedigrees showed that families of DD patients did not have neurocognitive impairment and that the pedigrees of the four cancer patients were consistent with LFS (Supplementary Figure 2). All DD patients with available parental samples

(3) had a de novo deletion, as shown by CGH, MLPA and FISH analysis (200 nuclei tested, with no evidence of low level mosaicism in parents). Among the cancer patients with deletions, familial samples were available in two cases. Of these, one family’s samples were sufficiently informative to establish inheritance of the deletion. No apparent parent-of-origin bias was observed for deletions in either group of our cohort.

3.4.4 Design of a custom ultra high-resolution tiling array

Obtaining sequence-level resolution is the most definitive method of validating rearrangements288, as it leads to precise definitions of the CNVs’ breakpoints and gene content, provides clues as to the mechanism underlying their formation289,290 and reveals their potential architectural complexity194.

We designed an ultra high-resolution array covering 8 Mb of chromosome 17 to get close to sequence-level resolution in these eight cases, and to be used as a novel clinical diagnostic platform to identify all possible rearrangements in future patients. The array is comprised of

~45,000 oligonucleotide probes spanning 4 Mb upstream and 4 Mb downstream of the TP53 locus (7,512,444 to 7,531,588). All exons within this region are tiled, representing the entire coding sequence of 182 genes and all possible alternative transcripts (2,130 exons;

Supplementary Figures 3A and 3B). The precise array design and probe placement is described in the methods and Supplementary Figure 3. 108

We tested our novel array on patients whose breakpoints we had already successfully sequenced. These experiments yielded highly precise size and breakpoint information. For example, a patient with DD was found, by our chromosome 17p13.1 array, to have a contiguous genomic deletion of 923,492 bp, a difference in size of only 2,183 bp (0.2%) from that established by sequencing. The 5’ and 3’ breakpoints of the deletion were 2,341 bp and 153 bp away from the true breakpoints, respectively.

3.4.5 Alu short interspersed nuclear repeats are associated with breakpoints

Using this array, we determined the size and breakpoints of the remaining samples. Using long-range PCR, we amplified junction fragments spanning putative breakpoints. Then, in cases where high-molecular weight DNA was available, we sequenced junction fragments and determined the breakpoint and size of the deletions. The average difference between actual CNV sizes and the arrays’ predicted sizes was 6.25% (i.e. 93.75% accuracy).

By array, one patient was revealed to harbor another deletion in chromosome region

17p13.1. The secondary deletion, which is also heterozygous, is 24 Kb in length and is located downstream of the primary deletion’s distal breakpoint (Supplementary Figure 4). In another instance, an identical deletion was found in the proband and their sibling, indicating the inheritance of the same pathogenic CNV (Figure 2, patient 3332). While asymptomatic, the sibling is now undergoing routine biochemical and radiographic surveillance for cancer.

We looked for repeat elements coinciding with CNV breakpoints. Of the 12 sequenced breakpoints, ten directly intersect with an Alu short interspersed nuclear repeat element (1 from the oldest AluJ family, 7 from the intermediate AluS family and 2 from the young AluY family). 109

3.4.6 Most 17p13.1 CNVs arise by Alu-mediated non-allelic homologous recombination

Analysis of sequenced CNVs revealed the mechanisms by which they arose. Four of six deletions involved non-allelic homologous recombination (NAHR) between Alu elements present at both the proximal and distal ends (Figure 2, patients 3026, 3148, 3354 and 3332). The

Alu elements flanking these deletions were in the same orientation and shared a moderate degree of homology (81-84% similarity by BLAST). The remaining two patients did not exhibit extensive homologies spanning their breakpoints. Of these, one breakpoint showed a 4 bp microinsertion (4 bp: CAAG), an ‘information scar’ that is a hallmark of nonhomologous DNA end joining291 (NHEJ; Figure 2, patient 2723).

To evaluate the significance of the observed number of Alus at 17p13.1 CNV breakpoints, we performed 10,000 permutation experiments using randomly distributed CNVs of different sizes (10 Kb to 2 Mb). In these simulations, fewer than 1% of breakpoints coincided with an Alu pair in the same orientation. In contrast, the majority of 17p13.1 CNVs coincide with directly oriented Alus (67%; Figure 2B). 110

Figure 2A

111

112

113

Figure 2 Breakpoint maps, sequence resolution, and inferred mechanism of 17p13.1 CNVs. (a) We developed an ultra high resolution CGH array (see Supplementary Figure 3) to obtain breakpoint-level information on 17p13.1 CNVs.

Shown are the array results for all four DD and two cancer-affected patients. Log2 ratios from the array, with each dot representing one probe and deletions indicated in green. The proximal and distal breakpoints were determined for all samples, revealing that all DD patients shared a critical region including TP53 and 23 other genes (red). The precise breakpoint positions, sizes, and the nucleotide sequence of the disrupted regions are shown. The presence of two Alu elements (orange arrows and orange- colored nucleotides) at the junctions is consistent with the formation of the CNV by Alu- Alu mediated NAHR (patients 3026, 3148, 3354 and 3332). The percentage of homology between directly oriented Alu’s is indicated for NAHR CNVs. In one instance an NHEJ signature could be seen at the at the breakpoint sequence: Four additional basepairs incorporated at the junction (patient 2723). Amongst the cancer patients, the proximal and distal breakpoints were always either intronic in TP53 or intragenic, never disrupting other genes besides TP53 or leading to gene fusions. Using high-quality DNA from one patient’s frozen tumor, a second deletion was observed on the opposite allele, conforming to the classical two-hit hypothesis of tumorigenesis292 (patient 2760). This custom array was used to test for the presence of the CNV in two asymptomatic siblings of an index case affected with cancer (patient 3332). One sibling (shown) was found to harbor the identical deletion. (b) Shown are the proportion of CNVs whose breakpoints overlap with an Alu retrotransposon. We performed 10,000 simulation experiments in which randomly distributed CNVs were assessed for Alu overlap. Experiments were done on simulated CNVs sized 10 Kb, 100 Kb, 1 Mb and 2 Mb. Across all size ranges, few simulated CNVs were found to have directly oriented Alus at both breakpoints (~1%), or an Alu at only one breakpoint (<10%). In contrast, all 17p13.1 CNV breakpoints intersected with at least one Alu element: Most 17p13.1 CNV breakpoints were found to have directly oriented Alus at both breakpoints (67%), and the remainder had an intersecting Alu at only one breakpoint (33%).

114

3.4.7 A common region implicates new genes in developmental delay

In our study cohort all CNVs associated with occurrence of childhood cancer were limited to the TP53 locus, deleting between one and ten, of 11 exons. Such deletions are predicted to cause protein truncation, thus interfering with the gene’s tumor suppressive activity.

Indeed, in a paired tumor specimen we observed an additional copy number alteration of the same size, thus inactivating the wild type allele (Figure 2, patient 2760).

In contrast, by fine-mapping we found a common deleted region in the four patients with

DD (Table 2) that includes 24 genes (critical region shown in Figure 2). There are a number of candidate genes for the observed phenotypes. The four patients with DD harbored between 27 and 86 fully deleted genes. Additionally, fine-mapping revealed that two DD patients carried partial deletions of genes, disrupting some but not all of their exons (Table 2).

115

Table 2 Deleted and disrupted genes in 17p13.1 deletion patients. The four patients with DD harbored between 24 and 86 fully deleted genes and partial deletion of two genes. The minimally deleted region includes 24 genes.

Developmental delay

Disrupted Patient Chr Start End Deleted genes genes

1 17 7,300,398 8,273,016 55 CHRNB1

2 17 7,140,464 8,061,771 58 --

3 17 7,429,371 7,972,019 28 MPDU1

4 17 5,500,027 7,937,620 86 --

Critical region 17 7,429,371 7,937,620 24 MPDU1 (patients 1-4)

Cancer

1 17 7,511,866 7,516,100 1 TP53

2 17 7,505,270 7,525,566 1 TP53

3 17 7,512,445 7,519,262 1 TP53

4 17 7,520,037 7,520,315 1 TP53

116

We evaluated the effect of this 17p13.1 CNV on mRNA levels using expression arrays

(methods). Amongst the genes in the minimally deleted region, the expression of TP53 was significantly under-expressed in the patient affected with cancer but not the patient with DD (p

=6.82 x 10-3; fold change = -1.85797; Figures 3A and 3B).

The expression of six other genes (of the 24 candidates in the region) were significantly changed in DD but not in the cancer-affected patient (Figure 3B; p<0.01; fold change <-1.5 or

>1.5). Of these DD-specific genes, which are both hemizygously deleted and under-expressed in all patients, the trafficking protein particle complex 1 gene (TRAPPC1) is particularly intriguing.

TRAPPC1, involved in vesicular transport from the endoplasmic reticulum to Golgi apparatus as part of the TRAPP complex, is the most significantly changed at 17p13.1 and, of note, it is also the most significantly changed gene genome-wide (Figure 3C; p=2.90x10-5; fold change=-

2.34146).

Three additional DD-specific genes are noteworthy: MPDU1, mutations of which result in congenital disorder of glycosylation type If involving severe mental and psychomotor retardation293; FXR2, a homologue of the fragile X mental retardation gene, FMRP, which itself may play a role in that disease294; and EFNB3, known to be important in the development of normal locomotor behavior295. To determine if the deletion unmasks a recessive mutation, we sequenced these four genes but did not find additional mutations.

Having found reduced TP53 expression in the cancer-affected individual, we examined whether other TP53 signaling pathway members were altered. We first measured the mRNA levels of a proband from an LFS family carrying an established deleterious basepair mutation

(Arg273Cys) and conducted pathway analysis. Using the well-annotated Ingenuity Pathway

Analysis Core Pathways, we noted a subtle but significant difference of genes in the TP53 117 pathway in individuals with either an established mutation or an internal deletion of TP53, but not those with complete deletions and DD (p=4.74x 10-2 and 3.01x10-2, respectively). This shows that TP53 is aberrantly expressed in individuals affected with cancer but not those affected with DD. Further, we find that disregulated genes common to the missense and internally-deleted mutation carriers are associated with known molecular mechanisms of cancer

(p=2.59x10-3). Together these data highlight gene expression differences between individuals having large or small CNVs at 17p13.1.

118

Figure 3A

119

Figure 3B

120

Figure 3C

121

Figure 3. Gene expression differences distinguish cancer-affected from developmental delay patients with 17p13.1 deletions. We used Affymetrix Exon arrays to look for gene expression differences in available blood-derived RNA. We first evaluated which of the 24 genes in our critical region (commonly deleted in patients with DD) is significantly under or over expressed. (a) Gene expression values (x axis), expressed as a fold change relative to controls, are shown for all 24 genes (circles) in two individuals harboring a small 17p13.1 CNV. The p value of each genes’ expression change is indicated (y axis, in reverse order). Grey gridlines delineate regions of the plot containing significantly under expressed genes (top left; p<0.01 and <-1.5 fold change) or significantly over expressed genes (top right; p <0.01 and >1.5 fold change). As shown in red, amongst patients with small 17p13.1 CNV only TP53’s expression is significantly changed (p =6.82 x 10-3; fold change = -1.85797). (b) Notably, a similar analysis of RNA from a DD patient did not show TP53 under expression, despite the gene being fully deleted in a large 17p13.1 CNV. There are however six significantly changed genes (all under expressed). As shown in red these are: TRAPPC1 (p=2.90x10-5, fold change = -2.34146), FXR2 (p=3.47x10-3, fold change=-1.62447), LSMD1 (p=4.88x10-3, fold change=-2.22006), KDM6B (p=6.98x10-3, fold change = - 6.3709), CYB5D1 (p =8.80x10-3, fold change = -1.60553) and MPDU1 (p=9.78x10-3, fold change =-1.76825). (c) A similar analysis and plot are shown for a patient with a large 17p13.1 CNV and DD, but the expression and significance of all genes are shown. TRAPPC1 (red) was found to be the most significantly under expressed gene in the transcriptome.

122

3.5 Discussion

LFS is a highly penetrant susceptibility to cancer that disproportionately affects the young. Children with germline TP53 mutations are at a 20% risk of developing cancer by 15 years and, over a lifetime, have a 73% to 100% risk76. However the four DD patients in this report, ranging in age from 3.17 to 33.42 years, are not affected with cancer despite harboring complete deletions of TP53. Other case reports highlight an additional 6 patients with 17p13.1 deletions296-299, of whom none are affected with cancer. While these reports support our contention that DD-associated deletions involve reduced cancer risk, it is premature to discount the possibility that these patients may have a high risk of developing cancers due to somatic

TP53 mutation, which may become manifest only at later ages.

The molecular basis for this apparent absence or reduction of cancer risk remains to be elucidated. Studies in mouse models of LFS as well as the somatic mutation spectra of TP53 in human cancers provide evidence that tumorigenesis is accelerated when TP53 is altered by point mutations or short insertion/deletions, rather than completely lost. In contrast, a number of nonsense mutations that predict total absence of TP53 protein expression are strongly associated to cancer. It should be noted that in the cancer-prone patients described here the deletions do not include exon 1 and the long intron 1. It is possible that sequences in the latter region may contribute to regulate TP53 suppressor function. In particular, the proximal region of intron 1 contains sequences encoding a natural antisense transcript of TP53, WRAP53, that regulates endogenous TP53 mRNA by targeting the 5' untranslated region of TP53 mRNA5. The exact role of this sequence in predisposition to cancer deserves further studies. Notwithstanding this caveat, we show here that mRNA expression levels of TP53 and TP53-dependent genes are altered in patients with partial, but not complete, deletions – consistent with mutant TP53-initiated tumorigenesis in the former group but not the latter. In contrast, the neurocognitive delay 123 phenotype is characterized by the dysregulation of a different set of genes at 17p13.1, including

MPDU1, FXR2, EFNB3 and, in particular, TRAPPC1, which is also the most significantly under-expressed gene in the transcriptome.

We designed a tiling array to determine accurate breakpoints of CNVs at 17p13.1; the locus which we also show to be responsible for a novel congenital syndrome. By achieving basepair resolution we gained insight into the genomic basis of this dysmorphology syndrome, including the precise determination of deletion length and gene content, the definition of a critical region, and the recognition of a shared mechanism of CNV formation in multiple probands. Alu retrotransposons are nearly ubiquitous at 17p13.1 breakpoints, which is highly suggestive of Alu-mediated non-allelic homologous recombination300. Alus are the largest type of mobile elements in the human genome and have been implicated in a number of diseases such as neurofibromatosis and breast cancer301,302. Somatically acquired rearrangements are common in cancer, and it has been shown that regions with high levels of Alus are more susceptible to recombination in tumors300,303. Disruptions of TP53, by somatic mutation or loss of heterozygosity, are a virtual prerequisite for transformation of incipient cancer cells. While the breakpoint resolution achieved in our study has yet to be examined in many cancer samples, at least one report has demonstrated that Alus can indeed mediate somatic rearrangements at

TP53304. Future studies in cancer115 will determine if Alu-mediated recombination at 17p13.1 are as widespread in tumors as we show them to be in the germline.

This report adds 17p13.1 deletions – which results in two seemingly distinct phenotypes

– to the list of disease loci associated with Alus. As more CNV-associated disorders are discovered, it will be intriguing to consider whether other loci in the genome also give rise to phenotypically distinct disorders by means of a common mechanism. 124

Our work re-confirms that TP53 mutations alone lead to LFS. Supporting this notion, all

TP53 deletions reported to date are, to our knowledge, small (<50 Kb) 242,243,305 and except for one particularly complex Alu-mediated 45 Kb rearrangement242, involve only partial deletion of the gene. While the cancer-specific susceptibility of LFS is well recognized, we show that

17p13.1 deletions are associated with a novel contiguous deletion syndrome involving a recognizable phenotype with developmental delay, hypotonia and hand/foot abnormalities.

Furthermore, we demonstrate that a high-resolution array platform improves detection of previously unrecognized microdeletions, suggesting it could provide a valuable tool in the molecular diagnosis of TP53 wild type LFS and patients with cognitive delay phenotypes.

125

3.6 Supplementary tables and figures

Supplementary Table1 Initial sample ascertainments

Patients with Patients Reason for Hospital 17p13.1 Method screened screen microdeletion

The Hospital Suspected LFS, 1 for Sick 230 2 TP53 sequencing MLPA Children wild type

Stanford Array University 2 400 1 Many CGH (44K School of array) Medicine

Emory Array University CGH 3 3,374 1 Many School of (EmArray Medicine Cyto6000)

Children’s Neurological or Illumina 4 Hospital of 487 1 developmental Hap550 Philadelphia issues BeadChip

Hospital Suspected LFS, A.C. 5 13 2 TP53 sequencing MLPA Camargo wild type

Presence of multiple issues, University including Array Health 6 20 1 dysmorphologies, CGH Network, congenital delay (Gene DX) Toronto or learning disabilities

Total: 4,524 8

Supplementary Figure 1 Copy number of ATP1B2 and WRAP53, TP53’s neighboring genes. To determine the extent of 17p13.1 CNVs, probes were designed in TP53’s neighboring genes, ATP1B2 and WRAP53 and assessed by qPCR. Probes were located in the closest exon to TP53 (red and green asterisks). While all patients’ harbored deletions at TP53, only the 126 developmental delay patients’ deletions included ATP1B2 (telomeric) and WRAP53 (telomeric). Both ATP1B2 and WRAP53 were diploid in all cancer patients (mean copy number = 2.03 and 2.11, respectively). However, all patients with DD were hemizygously deleted for both flanking genes (mean copy number = 0.87 [ATP1B2] and 1.13 [WRAP53]), a significant reduction as compared to the cancer patients (p=2.90 x 10-4 [ATP1B2] and 2.42 x 10-8 [WRAP53]).

127

Supplementary Figure 2 - Pedigrees

Patient 2723 - DD

Patient 3026 - DD

128

Patient 3148 - DD

Patient 3354 – DD

129

Patient 3332 - Cancer

Patient Y47- Cancer

130

Patient Y20 - Cancer

Patient 2760 - Cancer

131

Supplementary Figure 3 Design of ultra high-resolution array. (a) 40,577 oligonucleotide probes were placed on the short arm of chromosome 17, in which an 8 Mb target region around TP53 (red) was covered at ultra high-resolution. Of the 38,061 in the target region, 15,762 (41%) were designed on exons. Exonic probes were overlapping and tiled across all exons, of all alternative splice variants, for every gene (183) in the target region. The remaining probes in the target region (22,389) were placed in intronic or non-genic regions at a resolution of one probe per 350 nucleotides. An additional 2,426 probes were placed in the border regions (blue) from the target region to the telomere and centromere. The borders regions were covered at low resolution with 1 probe per 1 Kb, in 50 Kb regions. The following 183 genes are covered at tiling resolution by this novel platform: Telomere-…-TAX1BP3-TMEM93-P2RX5-ITGAE-GSG2-C17orf85- CAMKK1-P2RX1-ATP2A3-ZZEF1-CYB5D2-ANKFY1-UBE2G1-SPNS3-SPNS2-MYBBP1A-GGT6-SMTNL2- ALOX15-PELP1-ARRB2-MED11-CXCL16-ZMYND15-TM4SF5-VMO1-GLTPD2-PSMB6-PLD2-MINK1- CHRNE-LOC100130311-GP1BA-SLC25A11-RNF167-PFN1-ENO3-SPAG7-CAMTA2-INCA1-KIF1C- GPR172B-ZFP3-ZNF232-USP6-ZNF594-C17orf87-RABEP1-NUP88-RPAIN-C1QBP-DHX33-DERL2-MIS12- NLRP1-WSCD1-AIPL1-FAM64A-PITPNM3-KIAA0753-TXNDC17-MED31-C17orf100-SLC13A5-XAF1- FBXO39-TEKT1-ALOX12P2-ALOX12-RNASEK-C17orf49-BCL6B-SLC16A13-SLC16A11-CLEC10A-ASGR2- ASGR1-DLG4-ACADVL-DVL2-PHF23-GABARAP-DULLARD-C17orf81-CLDN7-SLC2A4-YBX2-EIF5A- GPS2-NEURL4-ACAP1-KCTD11-TMEM95-TNK1-PLSCR3-C17orf61-NLGN2-SPEM1-C17orf74-TMEM102- FGF11-CHRNB1-ZBTB4-AMAC1L3-POLR2A-TNFSF12-TNFSF12/TNFSF13-TNFSF13-SENP3-EIF4A1- SNORA48-SNORD10-SNORA67-CD68-MPDU1-SOX15-FXR2-SHBG-SAT2-SHBG-ATP1B2-TP53-WRAP53- EFNB3-DNAH2-RPL29P2-KDM6B-TMEM88-LSMD1-CYB5D1-CHD3-SCARNA21-LOC284023-KCNAB3- TRAPPC1-CNTROB-GUCY2D-ALOX15B-ALOX12B-ALOXE3-HES7-PER1-VAMP2-TMEM107-C17orf59- AURKB-C17orf44-C17orf68-PFAS-SLC25A35-RANGRF-ARHGEF15-ODF4-LOC100128288-KRBA2-RPL26- RNF222-NDEL1-MYH10-CCDC42-SPDYE4-MFSD6L-PIK3R6-PIK3R5-NTN1-STX8-WDR16-USP43- DHRS7C-GLP2R-RCVRN-GAS7-MYH13-MYH8-MYH4-MYH1-MYH2-MYH3-SCO1-C17orf48-TMEM220- PIRT-FLJ45455-DNAH9-…-Centromere. (b) Shown are the positions of probes (black squares and rectangles) across the region of 17p13.1 containing TP53. All exons (solid blue boxes), introns (dashed) and alternative transcripts of TP53 are covered. Our arrays’ coverage is contrasted to that of the Affymetrix 6.0 and Illumina 1M Duo microarrays. All genes within the target region have identical coverage as TP53, which is here shown to demonstrate the resolution of the platform in genic regions.

132

Supplementary Figure 3A

Supplementary Figure 3B

133

Supplementary Figure 4 A complex event near 17p13.1 deletion breakpoint

By ultra high-resolution array, an additional deletion was observed in one patient. This secondary deletion, which is proximal to the primary deletion and is also hemizygous, is not a polymorphic CNV as it was not seen in other hybridizations using this platform, and is absent from the Database of Genomic Variants177 and from the ultra high-resolution data released from the Genome Structural Variation Consortium285. The secondary deletions contains two genes: EFNB3 and DNAH2.

134

Chapter 4 4 TP53 alterations determine clinical subgroups and survival of patients with choroid plexus tumors

This chapter has been published and is reproduced with permission from the Journal of Clinical Oncology

Tabori U*, Shlien A*, Baskin B, Levitt S, Ray P, Alon N, Hawkins C, Bouffet E, Pienkowska M, Lafay-Cousin L, Gozali A, Zhukova N, Shane L, Gonzalez I, Finlay J, Malkin D. TP53 alterations determine clinical subgroups and survival of patients with choroid plexus tumors. J Clin Oncol. 2010 Apr 20;28(12):1995-2001. Epub 2010 Mar 22.

* co-first authors

135

4.1 Abstract:

Background: Choroid plexus carcinomas are pediatric tumors with poor survival and a strong, but poorly understood, association with Li-Fraumeni syndrome (LFS). Currently, with lack of biological predictors, most children are treated with aggressive chemo-radiation protocols.

Methods: We established a multi-institutional tissue and clinical database, enabling the analysis of specific alterations of the TP53 tumor suppressor and its modifiers in choroid plexus tumors

(CPT). We conducted high-resolution copy-number analysis to correlate these genetic parameters with family history and outcome.

Results: We studied 64 CPT patients. All individuals with germline TP53 mutations fulfilled

LFS criteria, while all patients not meeting these criteria harbored wild-type TP53 (p<0.0001).

TP53 mutations were found in 50% of CPC. Additionally, two sequence variants known to confer TP53 dysfunction, TP53 codon72 and MDM2 SNP309, co-existed in the majority of TP53 wild-type CPCs (92%) and not in TP53 mutated CPC (p=0.04), suggesting a complementary mechanism of TP53 dysfunction in the absence of a TP53 mutation. High-resolution SNP array analysis revealed extremely high total structural variation (TSV) in TP53 mutated CPC tumor genomes compared to TP53 wild-type tumors and choroid plexus papillomas (p=0.006 and

0.004, respectively). Moreover, high TSV was associated with significant risk of progression

(p=0.0005). Five-year survival for TP53 immuno-positive and -negative CPC were 0% and 82+/-

9% respectively (p=0.0006). Furthermore, 14/16 patients with TP53 WT CPC are alive without having received radiation therapy.

Conclusions: CPC patients with low tumor TSV and absence of TP53 dysfunction have a favorable prognosis and can be successfully treated without radiation therapy. 136

4.2 Introduction

Li-Fraumeni syndrome (LFS) is the prototype cancer predisposition syndrome306. LFS individuals harbor germline mutations in the TP53 tumor suppressor gene resulting in a remarkably heterogeneous phenotype of early onset cancers49. While the lifetime cancer risk for

TP53 mutation carriers approaches 75% in males and 93% in females, almost 40% of individuals will develop cancer before 21 years of age. These childhood cancers include bone and soft-tissue sarcomas, adrenocortical carcinomas and brain tumors145. Although much is known about the role of TP53 in cancer and the overall risk of LFS patients to develop specific tumor types, information is lacking regarding the significance of germline and somatic TP53 mutations in the risk stratification and management of patients with LFS-related tumors. Choroid plexus tumors

(CPT) are particularly relevant to understanding the biological importance of mutant TP53. CPT are intraventricular neoplasms of epithelial origin affecting primarily young children307. CPTs are subclassified as choroid plexus carcinoma (CPC, WHO grade III), choroid plexus papilloma

(CPP, WHO grade I) and recently described atypical CPP (WHO grade II)165. CPT subtypes show a wide spectrum of clinical outcomes. While long term survival for CPP is extremely high after surgical resection alone, CPC exhibit an unpredictable course with less than 50% survival in most reports308. Analysis of data submitted to a recently established international registry confirmed the important roles for resection309, postoperative chemotherapy310 and radiation therapy311 in improving survival for these patients. These studies, however, raise new dilemmas since most of these children are younger than 3 years of age and the long term detrimental effects of irradiation upon growth and the developing brain cannot be overemphasized. This highlights the need for better biological risk stratification for these young children. Because CPT is relatively uncommon (1-4% of pediatric brain tumors308), existing biological knowledge relies on a few case reports and small series. In order to discover pathways which control tumor 137 progression and initiation, high quality tissue for molecular analysis is required. Since CPC is commonly found in LFS families and mutations in TP53 are associated with resistance to chemotherapy and irradiation 312,313, we set out to interrogate TP53 in CPC. The findings we report here, generated from a large multi-institutional cohort of CPTs, provide a framework to determine both the risk of developing CPC as well as to identify novel prognostic markers for this disease. Furthermore, we demonstrate how somatic TP53 status can be used to inform management decisions and how germline TP53 status can provide genetic risk information for other family members. Our approach may also be of value in determining biological and clinical relevance of TP53 for other rare neoplasms in which small numbers preclude the use of traditional randomized prospective trials.

4.3 Patients and Methods

4.3.1 Samples and Clinical Data

After institutional research ethics board approval was obtained, clinical data, slides, blood and tumor samples were collected for 64 patients from three sources, including two large pediatric neuro-oncology centers - The Hospital for Sick Children (SickKids) in Toronto and the

Children’s Hospital of Los Angeles (CHLA) - and the Collaborative Human Tissue Network

(CHTN) in Columbus, OH (Table 1). Of these 64 patients, eight subjects only had blood samples available and for two samples only slides for immunostaining were available (Supplementary

Figure 1). Fifty four tumor tissue samples (36 CPC and 18 CPP) were collected. Blood was available from 18 CPC and 6 CPP patients. Fresh, frozen tissue was available from 25/36 CPC and 18/18 CPP. Where frozen tissue was not available, we extracted DNA from paraffin embedded formalin fixed (FFPE) tumors. As a tumor-derived DNA control, used for single nucleotide polymorphism (SNP) frequency analysis, we collected 50 TP53 wild-type (WT) medulloblastoma samples from SickKids, since this early-onset brain tumor type is known to not 138 be TP53 driven. DNA extracted from peripheral blood lymphocytes of cases and TP53 WT healthy controls (n=50) were collected from SickKids and CHLA.

For outcome analysis, we collected all relevant demographic, clinical and treatment data from patients and families managed at SickKids and CHLA. Clinical data were not available from CHTN. Pathological specimens were reviewed independently at both neuro-oncology centres.

4.3.2 Sequencing of Genomic Tumor and Constitutional DNA

Exons 2-11, as well as up to 50 bases into spanning introns of the TP53 gene was sequenced from DNA extracted from blood and frozen tissues as previously described189. Sequencing of

TP53 from FFPE tissue was performed using a modified set of primers314. For analysis of the

TP53 codon 72 Arg/Pro and the MDM2 309G>A SNPs, direct sequencing was performed and confirmed by restriction fragment length polymorphism (RFLP) analysis using BstUI and

139 MspA1I restriction enzymes, respectively .

4.3.3 DNA microarray analysis

Genomic DNA samples derived from fresh-frozen tumor samples (n=36) and blood (n=19) when available, were genotyped on Affymetrix Genome-Wide Human SNP 6.0 arrays. DNA was quantified using a NanoDrop spectrophotometer (NanoDrop, DE) and quality assessed by agarose gel electrophoresis. High molecular weight DNA was digested using Sty and Nsp enzymes, ligated, PCR amplified, product purified, quantitated, fragmented, labeled and hybridized as per the protocol315 .

DNA copy number analysis was performed on normalized probe intensities. Regions of deletion or duplication were determined using a genomic segmentation algorithm (Partek, MO) such that 139 neighboring regions are significantly different from each other and contain 10 or more probes.

Copy number alterations (CNAs) are structurally variable regions in genomic DNA, defined as being larger than 1 kilobase in size, in which copy number differences exist between two or more tumor genomes. We used total structural variation (TSV), defined as the product of the number of CNAs and their average size, as a measure of the extent of genomic instability in each tumor sample.

4.3.4 Immunohistochemistry FFPE sections were subjected to immunohistochemical analysis and the results were scored independently by pathologists (CH, IG) in the two institutions who were blinded to the sequencing results. Slides were stained for BAF47 (INI-1) to exclude atypical teratoid/rhabdoid tumor (ATRT). Immunohistochemical staining for TP53 (nuclear) was reviewed and graded for both strength (0-none, 1-weak, 2-strong) and distribution (<25%, 25–50%, >50% of tumor cells), as described316. The reviewers were blinded to clinical data at the time of grading. Only the strongly staining cells (score 2) with a distribution of >50% were considered positive.

4.3.5 Statistical Analysis

Overall and progression-free survival were estimated using the Kaplan-Meier method and significance testing (α = 0.05) performed on the basis of the log-rank test. Correlation between parameters was assessed using the Pearson χ2 and Fisher exact test when applicable. Data were analyzed using SPSS 15.0 (SPSS, Chicago). TSV frequencies were compared by two-tailed, independent-samples t tests. Levene’s test for equality of variances was used to determine equal variances. 140

4.4 Results

Overall, we studied 64 CPT patients including 22 CPPs and 42 CPCs (Supplementary Figure 1).

Complete clinical, treatment and outcome data were available for 26 patients.

4.4.1 TP53 Mutations in Choroid Plexus Tumors

Somatic TP53 mutations were detected in 18/36 (50%) CPCs; with identical frequencies found in tumors from all three institutions (Table 1). All mutations were in exons within the

DNA binding domain110 (residues 102-292). Fifteen of eighteen (83%) mutations were homozygous and all displayed homozygous genotypes at a nearby SNP (codon 72, rs1042522), suggesting loss of heterozygosity. In contrast, only one of 18 (5%) CPPs harbored a somatic

TP53 mutation (p=0.001); this was a heterozygous mutation. All patients with somatic heterozygous TP53 mutations are alive.

141

Table 1: Somatic and germline TP53 mutation frequencies in study population:

SickKids CHLA CHTN Total

CPC tumor 18 10 8 36

WT 9 5 4 18

Mutated 9 5 4 18

CPC germline 15 3 -- 18

WT 9 1 -- 10

Mutated 6 2 -- 8

CPP tumor 2 10 6 18

WT 2 9 6 17

Mutated 0 1 0 1

CPP germline 6 -- -- 6

WT 6 0 -- 6

Mutated 0 0 -- 0

Germline controls WT 50 50

Medulloblastoma 50 50 controls WT

SickKids-The Hospital for Sick Children, Toronto; CHLA- Children’s Hospital of Los- Angeles; CHTN- Collaborative Human Tissue Network, Columbus, OH; WT- TP53 wild- type: CPC – choroid plexus carcinoma: CPP – choroid plexus papilloma.

142

4.4.2 Germline TP53 Status Correlates with LFS Criteria

Germline TP53 analysis was performed on 24 patients (18 CPC and 6 CPP). TP53 mutations were found in 8/18 CPC patients (Table 2 and Supplementary Table 1). All patients with germline TP53 mutations fulfilled the criteria for LFS or Li-Fraumeni-like syndrome (LFS-

L), either by family cancer history or the presence of multiple LFS tumors in the affected individual, while all CPC patients not meeting either criteria harbored a WT TP53 genotype

(p<0.0001). None of the CPP patients harbored a germline TP53 mutation. Since having a germline TP53 mutation is highly unlikely in the setting of a somatic WT TP53 genotype, combining tumor and germline analysis revealed that 0 of 21 of CPP patients harbored a germline TP53 mutation.

Table 2: Frequency of germline TP53 mutations in choroid plexus tumor patients.

Germline TP53 Germline WT TP53 P-value mutation

LFS associated CPC 8 0 ---

Sporadic CPC 0 10 P<0.0001*

CPP 0 6 P=0.0003**

* Sporadic CPC as compared to LFS associated CPC. **CPP as compared to LFS associated CPC.

4.4.3 Specific Genotypes Correlate with CPC Subtypes

In order to delineate the role of genetic modifiers of TP53 in CPT tumor initiation, especially in tumors not harboring TP53 mutations, we analyzed the frequency of the TP53 codon 72 variant, which encodes either proline (TP53-P72) or arginine (TP53-R72), and the

MDM2 SNP309 polymorphism in these tumors (Supplementary Table 2). The combination of

TP53-R72 and MDM2 SNP309 is associated with reduced TP53 activity and therefore may allow 143 for malignant transformation317. The TP53-R72 allele was significantly more frequent in TP53

WT tumors than in the healthy population (p=0.001, Supplementary Table 2). Furthermore, only

1/18 (6%) TP53 WT CPC was homozygous for the TP53-P72 allele while 8/18 (45%) TP53 mutated CPC harbored this allele (p=0.01). We compared the frequency of the combined TP53-

R72 and MDM2 SNP309 genotype in the CPC cohort with two control groups: 50 healthy individuals and 50 TP53 WT brain tumors (medulloblastoma318,319). The combination of TP53-

R72 and MDM2 SNP309 were observed in the majority (92%) of TP53 WT CPCs but not in healthy controls, in the medulloblastoma patients, or in CPCs harboring a TP53 mutation

(p=0.04, Supplementary Table 2) highlighting the role of quantitative TP53 dysfunction in tumor initiation of CPC lacking TP53 mutations.

Somatic MDM2 amplifications, a common event in cancer320, were observed in 75% of

TP53 mutated CPCs but only in 25% of TP53 WT CPCs (p=0.04). Furthermore, focal MDM2 amplifications were observed only in TP53 mutated CPCs (3/12) but in none of the TP53 WT

CPCs (0/8) or CPPs (0/16).

4.4.4 Somatic Total Structural Variation Differentiates Tumor Subtypes

Germline DNA TSV is an important contributor to genetic diversity and we have previously reported an excess of TSVs in carriers of TP53 germline mutations189. Here, we tested whether somatic TSV scores could predict tumor type. CPCs harbored significantly more somatic TSV than CPPs (p=0.004, Figure 1A). Furthermore, TP53 mutant tumors harbored significantly higher TSVs than TP53 WT CPCs (p=0.006. Figure 1B), indicating higher genomic instability in these tumors. Further detailed analysis of alterations such as deletions and amplifications and global vs. focal aneuploidy revealed higher focal alterations in CPC compared to CPP. TP53 mutated CPC exhibited higher global aneuploidy accompanied by significantly 144 more deletions and amplifications than was observed in TP53 WT CPC (Supplementary Figure

3).

Strikingly, individual analysis of TSV in tumors where clinical data were available revealed that TSV in tissue from primary CPCs which eventually recurred was 3-4 times higher than in CPCs which did not recur (p=0.0005, Figure 1C and 1D).

145

Figure 1A

1,000,000,000 P=0.004

800,000,000

600,000,000

400,000,000 Mean total structural variation

200,000,000

0 CPP CPC

Error bars: +/- 1 SE

146

Figure 1B

1,200,000,000 P=0.006

1,000,000,000

800,000,000

600,000,000

400,000,000 Mean total structural variation

200,000,000

0 P53 WT CPC P53 mutated CPC

Error bars: +/- 1 SE

147

Figure 1C

1,200,000,000

1,000,000,000 P=0.0005

800,000,000

600,000,000

400,000,000 Mean total structural variation

200,000,000

0 Non-recurrent Recurrent

Error bars: +/- 1 SE

148

Figure 1D

Figure 1: Total somatic variation changes in choroid plexus tumors.

A. TSV stratified by tumor type revealed higher TSV in CPC vs. CPP (p=0.004). B. CPC stratification by TP53 mutation revealed higher TSV in TP53 mutated tumors (p=0.006). C. Stratification of CPC by outcome reveals significantly higher TSV in tumors of patients which had tumor recurrence (p=0.0005). D. Genome-wide copy 149 number levels differ substantially between non-recurrent CPCs (top) and recurrent CPCs (bottom). Copy number heat maps are colored from dark blue (zero copies) to dark red (four copies). A representative genome-wide copy number plot (zero to five copies) is also shown below the heatmap.

150

4.4.5 TP53 Dysfunction Predicts Outcome in CPC

In order to analyze the role of TP53 in survival of CPC patients, we collected clinical and biological data at SickKids and CHLA. Initial pathology review included immunostaining for

TP53 and INI-1. INI-1 staining was performed to exclude ATRT which are occasionally misdiagnosed as CPC. Of 28 tumors, two had negative INI-1 staining and upon histopathologic review their diagnoses were changed to ATRT. Of the 26 CPC, TP53 staining was positive in 10

(38%). There was a high correlation between TP53 immunopositivity and presence of somatic

TP53 mutations (p=0.0005, Supplementary Table 3).

Five-year overall survival was 82+/-9% and 0% for TP53 immuno-negative and immuno- positive tumors, respectively (p=0.0006, Figure 2A). These findings were similar in each centre when examined independently. TP53 mutation analysis was performed on 20 tumors. Five year overall-survival was 100% and 22+/-17% for TP53 WT and mutant tumors, respectively

(p=0.0008, Figure 2B).

Detailed analysis of presentation and treatment protocols revealed that metastatic status was not different between TP53 immuno-positive and negative tumors (3/10 and 4/16 tumors respectively). In our cohort, gross total resection was achieved in 13 patients and did not correlate with TP53 status or survival. Five patients received radiation therapy (2 at initial treatment and 3 more at recurrence). All these were TP53 immuno-positive and succumbed to their disease. All 16 TP53 immuno-negative tumors were treated with regimens that excluded radiation therapy. Of this group, 14/16 (87%) are still alive with a mean follow up time of 10.2 years (2.4-20yrs). Moreover, the two non-survivors died at or shortly after surgery, respectively and did not receive additional therapy. Of the 14 survivors, four had metastatic disease at 151 diagnosis and four developed tumor recurrence and received further surgery and chemotherapy still without irradiation.

152

Figure 2A

153

Figure 2B

Figure 2: Overall survival for choroid plexus carcinoma. A. Tumors stratified by TP53 immunostain. B. Tumors stratified by TP53 mutation. WT-wild-type; Pos-positive; Neg-negative.

154

4.5 Discussion

The comprehensive study of rare cancer susceptibility syndromes has led to discovery of scores of genes associated with predisposition. In turn, all of these genes have been shown to play important biological roles in somatic malignant cellular transformation. Exploration of genotype:phenotype associations have led to genetic screening programs to identify at risk individuals321-324 but the specific role of germline alterations in cancer susceptibility genes in determining clinical outcome and response to treatment are not well understood. This is in part due to the lack of sufficient numbers of patients to answer these complex but important questions.

Here, using comprehensive clinical databases and tissue repositories together with high- resolution whole-genome array platforms, we shed new light on the role of the TP53 tumor suppressor. Specifically, we show that TP53 status determines the phenotype of a common pediatric epithelial LFS component tumor, providing a powerful indicator of clinical outcome and response to therapy.

First, our data suggest that although CPC is a frequent tumor in LFS, most children with sporadic CPC do not harbor germline TP53 mutations. In the absence of LFS or LFS-L clinical criteria such as family history or multiple tumors in the index case, the likelihood of harboring a germline TP53 mutation is very low (Table 2). This is in contrast to another pediatric tumor, adrenocortical carcinoma, in which more than 85% of children affected by this disease will have germline TP53 mutation regardless of other LFS criteria142,146,325. Since for TP53 is not available in all centres, this finding may have important implications in the management of these patients and family members. Accurate family history and genetic counseling should be considered for all children with CPC. 155

Second, we show that TP53 alterations play a significant role in choroid plexus tumorigenesis. Somatic TP53 mutations were observed in 50% of CPC and in only 5% of CPP.

Moreover, 94% of the TP53 WT CPC carried the Arg/Arg TP53 codon 72 SNP genotype. This finding highlights the role of modifiers in the TP53-associated phenotype as this genotype is associated with deficient cell cycle arrest 326. Taken together, our observations imply that either qualitative (TP53 mutations) or quantitative (specific SNPs resulting in reduced TP53 levels and

MDM2 amplification) alterations of the TP53 pathway are required for CPC formation.

We have previously shown that germline TP53 mutations are associated with higher genomic instability as manifested by excess germline DNA TSVs in LFS TP53 carriers189. Here we extend these findings to patients with a specific tumor type - CPC. TP53 mutated CPC exhibited higher TSV than TP53 WT tumors or CPP (Figure 1). Furthermore, CPCs, and in particular CPCs with TP53 mutations, harbored an excess of focal amplifications and deletions as well as chromosome gains and losses (Supplementary Figure 2 and 3). The striking correlation between higher TSV in tumors and their ability to recur (Figure 1C and 1D) may have broad implications on our understanding of the role of genome integrity and response to cancer therapy. Moreover, this observation can be used to develop a prognostic tool to tailor therapy for these young children. We have reported here the aberrations in global genomic TSV and are separately pursuing studies on the correlation of specific loci with choroid plexus carcinogenesis.

As with other forms of genomic instability, this karyotypic aneuploidy is most pronounced in tissues carrying TP53 mutations. These findings suggest that genomic instability, common in TP53 dysfunctional tumors, can be traced to the germline of these patients. Since

TP53 is altered in most cancers, it is tempting to speculate that genomic instability, as measured 156 by aberrant DNA copy number, can be used as a screening tool for cancer risk in the general population.

The third key finding of the study is the role of TP53 dysfunction in determining response to therapy and outcome of these patients. TP53 immunopositivity, a measure of TP53 dysfunction, is an excellent screening tool for the presence of TP53 mutations (Supplementary

Table 3) and predicts survival in these patients. Furthermore, all patients whose tumors were immuno-negative and who survived the postoperative period are now long-term survivors without the need for radiation therapy. This finding was confirmed by TP53 mutation analysis of available tumors which revealed 100% survival for patients with TP53 WT tumors. This has immediate clinical implications since radiation therapy is widely recommended for all CPC patients regardless of age. Avoiding radiation therapy in patients with favorable biology CPC will have a profoundly positive impact on the developing brain, endocrine function, growth potential and the risk for secondary malignancy.

Based on the findings presented here, we suggest a new approach to the management of children and family members with CPC (Figure 3). At surgery, comprehensive family and clinical histories should be taken. If this information suggests the possibility of LFS, blood should be sent for TP53 sequencing and MLPA210 analysis. If the tumor is TP53 immuno- positive, mutation analysis of the tumor is recommended. Negative staining and/or WT TP53 in the tumor would indicate that the patient should be offered an irradiation-sparing protocol. If the tumor is TP53 mutated, blood for germline TP53 sequencing should be sent. Since survival is poor for patients with TP53 dysfunctional tumors, novel therapies should be explored for these patients. 157

In summary, this collaborative effort using precision genetic and genomic analyses uncovered important clinical and biological findings that influence the management of a rare tumor and a striking cancer predisposition syndrome. This approach, specifically the predictive role of germline and somatic genomic instability in tumor subgroups and response to treatment, can be expanded to other common cancers.

158

Figure 3

Figure 3: Management of patients with newly diagnosed choroid plexus carcinoma. Bold-lined boxes indicate decision-critical data that should be gathered. In case of a family history that is compatible with multiple tumors or tumor positive for TP53 mutation Li-Fraumeni syndrome should be suspected. Germline analysis of TP53 is warranted through genetic counseling. If positive, other family members should undergo genetic counseling and be screened321. The patient should be screened for other LFS tumors routinely.

159

4.6 Supplementary tables and figures

Supplementary Table 1: TP53 germline mutations in choroid plexus tumor patients

Sample ID Pathology TP53 Exon Other primary Family mutation tumor history of cancer

1 CP2B1 CPC R158H 5 X (c.473G>A)

2 CP5B2 CPC H193P 6 X (c.578A>C)

3 CP6B1 CPC R175H 5 X (c.524G>A)

4 CP10B1 CPC R248Q 7 X (c.743G>A)

5 CP37B1 CPC R213Q 6 X X (c.638G>A)

6 CP40B1 CPC R248Q 7 X (c.743G>A)

7 CP41B1 CPC S241Y 7 X (c.722C>A)

8 CP46B1 CPC S241Y 7 X (c.722C>A)

160

Supplementary Table 2: Genotype-phenotype correlation of TP53 codon 72 and MDM2 SNP309 polymorphisms.

Codon72 allele p-value Combined p-value normal/Total Group combination* Arg Pro

Healthy Control 42 48 --- 19/50 ---

CPC Mutated 17 19 1 6/13 1

CPC WT 28 8 0.001 1/12 0.04

CPP WT 20 10 0.08 3/5 1

Medulloblastoma WT 58 42 0.2 12/24 1

* Normal MDM2 SNP309 allele with at least one Proline codon 72 allele.

161

Supplementary Table 3: Correlation between TP53 immunostain and TP53 mutational status.

TP53 mutation Negative TP53 stain Positive TP53 stain

CPC (n=24) WT 13 2

Mutated 1 8

Other brain tumors (n=62) WT 39 8

Mutated 1* 14

* One of 2 TP53 mutated tumors which stained negatively was found to be a splice site mutation resulting in a truncated protein327

Supplementary Figure 1: Study population overview. Each subgroup is shown along with the number of patients. The bottom row indicates the numbers of tumor and blood samples available for analysis. CPP - choroid plexus papilloma; CPC – choroid plexus carcinoma. 162

Supplemental Figure 2 – Structural instability in CPCs. A significant difference in the total amount of focal copy number altered regions (TSV) (p= 0.03) was observed as compared to CPP.

163

Supplemental Figure 3 –Numerical instability in TP53 mutated CPCs. A significant difference in the total number of aneuploid autosomal chromosomes in TP53 mutated CPCs as compared to TP53 wild-type CPCs (p=0.039).

164

Chapter 5 5 Summary and future directions 5.1 Summary

This thesis describes detailed genomic analyses of Li-Fraumeni syndrome (LFS), and suggests a more generalized association between copy number variations (CNV) and cancer susceptibility.

In Chapter 2, we show that the frequency of germline CNVs is largely invariable among healthy individuals in the human population. Using this baseline dataset to compare against, we discover for the first time a disproportionately high-level of CNVs in the genomes of LFS families. In particular, there is a significant increase in constitutional CNVs in TP53 mutation carriers, who are known to have an increased risk of multiple early-onset neoplasms, but not individuals harboring wild type TP53 (p=0.01 and p=0.994, respectively). The excess of submicroscopic deletions and duplications reported is likely the earliest manifestation of instability conferred by the constitutional TP53 mutation in LFS. Therefore, this was the first report of an over-abundance of germline structural alterations in LFS and, to our knowledge, the first genome-wide study of CNVs and genetic susceptibility to cancer.

Our observations are described in the setting of germline TP53 mutations, however two findings presented here also suggest the importance of CNVs in cancer susceptibility to be more generalized. First, by studying paired tumor DNA from these patients, we demonstrate that

CNVs can act as the genetic foundation on which larger somatic chromosomal deletions and duplications develop. For example, we show a germline deletion that undergoes a further somatic deletion, while the rest of the chromosome maintains diploidy. In our reading of the literature, 165 this is the first description of the impact of an inherited and genetically variable structural element on a somatic chromosomal alteration, suggesting that CNVs are fertile ground for subsequently acquired changes in cancer cells. Second, as detailed in the supplementary data, a large number of CNVs coincide with known cancer-related genes. This is consistent with CNVs as a major source of human genetic variation, similar to SNPs, and also highlights a rich source of new variants to be included in future association studies. To this end, we show that the frequency of one specific cancer-related CNV is significantly increased in affected probands as compared to the general population (p=0.006).

In chapter 3 we describe a novel contiguous deletion syndrome with characteristic physical and neurocognitive features, shared amongst four index cases. The syndrome is due to variously sized deletions within chromosome 17p13.1 that, surprisingly, include the tumor suppressor gene TP53. This was unexpected as TP53 is associated with LFS, not developmental delay. To explore a potential developmental delay-LFS connection, we screened cancer cases for copy number changes at 17p13.1 and identified four additional patients.

We used an ultra high-resolution tiling microarray of our own design to fine map rearrangements in these two unique patient populations. This had led to a number of intriguing findings: (1) while they overlap, the size and exact breakpoints of 17p13.1 deletions determine which syndrome is developed; (2) with our heightened resolution we observed a common formation signature for deletions at this region (Alu-mediated non allelic homologous recombination); and (3), we show that a set of genes located within a critical stretch of 17p13.1 are under-expressed in the developmentally delayed patients but not in cancer-affected probands, who instead possess expression profiles consistent with TP53 signaling dysfunction and cancer predisposition. 166

The application of array CGH to ever larger and broader patient populations has led to the discovery of new genomic disorders underlying similar phenotypes. However, it is being increasingly recognized that the plasticity of the human genome can also give rise to hotspots, containing chromosomal rearrangements that instead lead to numerous different diseases. We show that 17p13.1 is such a region, but that the disease states involved can be disentangled by high-resolution analysis.

In chapter 4 we describe a multi-institutional study that explored the role of TP53 mutations in the initiation, progression and therapeutic response in patients with choroid plexus tumors (CPT). Using the largest known collection of high-quality CPTs , which combines both pathological subtypes (CPC and CPP), we show that all individuals with germline TP53 mutations fulfilled the criteria for LFS, while all patients not meeting these criteria harbored wild-type TP53 (p<0.0001). This observation suggests that parents of children with CPC should receive genetic counseling, and that all children with this diagnosis should be considered for

TP53 gene testing. Moreover, 50% of CPCs harbor somatic TP53 mutations and two sequence variants known to confer TP53 dysfunction, TP53 codon72 (Pro>Arg) and MDM2 SNP309

(T>G), co-exist in the majority (92%) of TP53 wild-type CPCs and not in TP53 mutated CPC

(p=0.04).

High-resolution copy number analysis using the Affymetrix 6.0 platform revealed extremely high total structural variation (TSV) in TP53 mutated CPC tumor genomes compared to TP53 wild-type tumors and CPPs (p=0.006 and 0.004, respectively). High TSV was associated with significant risk of progression (p=0.0005). Five-year survival for TP53 immuno-positive and -negative CPC were 0% and 82+/-9% respectively (p=0.0006). Furthermore, 14/16 patients with TP53 WT CPC are alive without having received radiation therapy. 167

These findings strongly suggest that young children with CPC whose tumors have intact

TP53 function have a very favorable outcome and, most importantly, can be successfully treated without the use of radiation therapy.

5.2 Future directions

5.2.1 CNVs and LFS

Having shown the overall importance of CNVs in LFS, a more precise characterization and validation of these structural alterations should be carried out. This includes a deeper analysis of those CNVs found in families initially studied and a broadening of the cohort to also include families with a less pronounced LFS phenotype.

A detailed CNV analysis of LFS families, preferably including trios, should be performed with Affymetrix 6.0 arrays using methods described in this thesis. CNVs should be classified as gains or losses, and the parent of origin of each event (as well as its inheritance) should be determined. Next, a custom CGH tiling array should be developed to determine approximate breakpoints of each CNV. PCR junction fragments would then be sequenced to determine the exact nucleotide position of all CNVs in select LFS families. Sequence analysis should be carried out to determine whether specific molecular mechanisms are associated with CNV formation in LFS. These data can be contrasted to ongoing work and data by the structural variation consortium in ethnically-matched controls.

To our knowledge, the work described in this thesis was the first to (1) establish a baseline CNV frequency in healthy individuals and (2) to show that individuals with a known predisposition to cancer (an inherited TP53 mutation) deviate dramatically from this baseline.

Measuring CNV frequency is a novel method for identifying individuals with germline chromosomal instability. Therefore, to explore the association between germline instability (in 168 the form of excessive CNVs) and early onset cancer in a more general way, the total structural variation (or some such measure of CNV burden) should be assessed in other cancer-prone populations. These could include individuals suspected of LFS or families with dominantly inherited mutations, besides TP53.

We have shown that 17p13.1 microdeletions cause a novel contiguous deletion syndrome with a recognizable phenotype. Collaborations should be established with other groups to find new probands with this novel syndrome. The identification of new patients can be performed by detailed patient characterization (i.e. by looking for common features) but should also include a detailed genotypic evaluation using our custom tiling array CGH, which has been shown to be a highly accurate platform for breakpoint characterization. It will be intriguing to see whether in a new cohort all 17p13.1 deletions involve Alu-Alu mediated recombination. Further, fibroblast cell lines should be established from these patients to measure the function of TP53 in patients with large 17p13.1 deletion. For example, TP53’s transcriptional output could be assessed using quantitative PCR to calculate the mRNA levels of p21, MDM2, GADD45, Cyclin G1, Bax and

IGF-BP3. mRNA levels could be assessed before and after cells are irradiated. Differences between small and large TP53 deletions could be evaluated in this way and by measuring TP53 protein accumulation by western blot following exposure to IR.

The study of cancer and CNVs is in its infancy but is maturing quickly. In considering the effect of this form of genetic variation on cancer predisposition, cancer gene expression and tumor genome profiling, there is much to learn from past studies on genomic disorders. Denser microarrays, next-generation sequencing and integrative informatics analyses are around the corner and promise to uncover new CNVs and somatic copy number changes. 169

There are therefore many exciting questions to be addressed: what role do CNVs play in cancer predisposition and how can we use this newly discovered form of genetic variation to identify those most at risk? Which cancer-related genes are affected by CNVs and, of these changes, which are both necessary and sufficient to cause neoplastic growth? Can incipient cancer cells use these constitutional deletions and duplications to induce or accelerate tumorigenesis and tumor proliferation? As these questions are resolved, the potential value of cancer CNVs as novel biomarkers of cancer susceptibility and initiation, and cancer progression and metastases will become apparent.

5.2.2 Choroid plexus carcinoma

Having shown that TP53 mutations are associated with higher rates of somatic structural variation in choroid plexus tumor genomes, especially in malignant choroid plexus carcininoma

(CPC), future work should determine exactly which genes are impacted. This work is urgently needed as patients with this tumor maintain very low survival. Yet the molecular basis of CPC initiation, progression and metastasis are practically unknown. As we have now collected the largest collection of high-quality CPC specimens, the next steps are to create a genome-wide integrative profile of human CPC. First, using the existing Affymetrix 6.0 data presented in this thesis, a map of frequently-amplified or deleted chromosomal regions in CPC should be created.

This should be contrasted for all choroid plexus pathological subtypes, including the benign choroid plexus papilloma and the intermediary entity known as atypical choroid plexus papilloma. Next, a transcriptomic analysis will define which copy number alterations drive expression changes of specific genes. Pathway analyses should be carried out using Ingenuity

Pathway Assist to determine whether a conserved pathway set defines this tumor type.

Subsequently these genomic and transcriptomic data should be integrated with detailed clinical variables to see which events are associated with risk of relapse and overall survival. 170

Studies of Li-Fraumeni syndrome, a rare cancer predisposition condition, have yielded a disproportionately large number of insights into the genetics and biology of cancer. This is primarily because TP53, the only known LFS gene, is also the most frequently mutated gene in human cancer. Based on the studies presented herein, it is tempting to speculate that measures of overall CNV frequency, precise characterization of specific CNVs and associated breakpoints, and detailed genomic assessment of LFS-associated tumors may help to categorize TP53 mutation carriers into risk groups and provide a more rational basis for cancer screening, genetic counseling and therapy.

171

Chapter 6 6 Appendix 1 - Recurrent Focal Copy-Number Changes and Loss of Heterozygosity Implicate Two Noncoding RNAs and One Tumor Suppressor Gene at Chromosome 3q13.31 in Osteosarcoma

This chapter has been published and is reproduced with permission from Cancer Research

Pasic I, Shlien A, Durbin AD, Stavropoulos DJ, Baskin B, Ray PN, Novokmet A, Malkin D. Appendix 1 - Recurrent Focal Copy-Number Changes and Loss of Heterozygosity Implicate Two Noncoding RNAs and One Tumor Suppressor Gene at Chromosome 3q13.31 in Osteosarcoma. Cancer Res. 2010 Jan 1;70(1):160-71.

172

6.1 Abstract

Osteosarcomas are malignant bone tumors that are rich in DNA copy-number alterations

(CNAs). Using microarrays, fluorescence in-situ hybridization and quantitative polymerase chain reaction we characterized a focal region of chr3q13.31 (osteo3q13.31) harboring CNAs in 80% of osteosarcomas. As such, osteo3q13.31 is the most altered region in osteosarcoma and contests the view that CNAs in osteosarcoma are non-recurrent. Most osteo3q13.31 CNAs are deletions

(67%), with 75% of these mono-allelic and frequently accompanied by loss-of-heterozygosity

(LOH) in flanking DNA. Notably, these CNAs often involve the non-coding RNAs (ncRNAs)

LOC285194 and BC040587, and in some cases, a tumor suppressor gene that encodes the limbic system-associated membrane protein (LSAMP). Ubiquitous changes occur in these genes in osteosarcoma, usually involving a loss of expression. Underscoring their functional significance, expression of these genes are correlated with the presence of osteo3q13.31 CNAs. Focal osteo3q13.31 CNAs and LOH are also common in cell lines from other cancers, identifying osteo3q13.31 as a generalized candidate region for tumor suppressor genes. The osteo3q13.31 genes may function as a unit, given significant correlation in their expression despite the great genetic distances between them. In support of this notion, depleting either LSAMP or

LOC285194 promoted proliferation of normal osteoblasts by regulation of apoptotic and cell cycle transcripts and also upregulated VEGF receptor-1. Moreover, genetic deletions of

LOC285194 or BC040587 were also associated with poor survival of osteosarcoma patients. Our findings identify osteo3q13.31 as a novel region of cooperatively acting tumor suppressor genes.

6.2 Introduction

Osteosarcoma is the most common bone malignancy in children and adolescents 328, with a 65% five-year survival 329. While germline mutations in p53, Rb1, and RecQL4 have been 173 implicated in genetic susceptibility to osteosarcoma 49,330-336, less is known about the etiology of sporadic osteosarcoma. Cytogenetically, osteosarcomas are highly complex 337-345. Gains and losses affect all chromosomes in osteosarcoma and are largely non-recurrent, although some examples of recurrent CNAs exist, including gains at 6p12, 8q24, 17p12-p11, and losses at

13q14 and 17p13.1 337-341,343-345. Chromosome 6p12 contains the RUNX2 gene, which promotes terminal osteoblast differentiation 346. Elevated RUNX2 levels have been reported in osteosarcoma cells 347, providing an example of how integrative genome and functional analyses may help identify chromosomal regions important in tumorigenesis. Identification of these regions has been facilitated by the advent of high-resolution microarray technologies. This is highlighted in the recent identification of ALK as a familial neuroblastoma predisposition gene

348 and implication of CDKN2A in childhood acute lymphoblastic leukemia 349.

We characterize a region of chr3q13.31, designated osteo3q13.31, which harbors frequent focal CNAs and LOH in primary osteosarcoma samples and cell lines from other malignancies.

Most osteo3q13.31 CNAs involve the ncRNAs LOC285194 and BC040587, and, sometimes, the

LSAMP TS. A change in expression of osteo3q13.31 genes is ubiquitous in osteosarcoma, and, in the case of LOC285194 and BC040587, reflects the presence of osteo3q13.31 CNAs. Distantly spaced osteo3q13.31 genes may function as a unit, as they display significant correlation in expression. In support of this notion, depletion of either LSAMP or LOC285194 promotes proliferation of normal osteoblasts through regulation of apoptotic, cell cycle, and tumor- promoting transcripts. Furthermore, the presence of LOC285194 or BC040587 deletions in tumor

DNA is associated with poor patient survival. Recently, two studies 350,351 have implicated

LSAMP as an osteosarcoma TS, although neither addressed the functional role of LSAMP in repressing tumorigenesis, or the potential TS role of the osteo3q13.31 ncRNAs. Our findings 174 implicate osteo3q13.31 as a novel universal TS region containing several cooperatively acting genes.

6.3 Materials and methods

Subject recruitment: Tumor biopsy tissue on 20 children and adolescents with osteosarcoma, with paired blood on 19 patients, was obtained at the Hospital for Sick Children, Toronto. Tumor and paired blood DNA was obtained on 28 adolescents and adults under 31 with osteosarcoma from the Interdisciplinary Health Research Team in Musculoskeletal Neoplasia Tumor Bank,

Mount Sinai Hospital. Tumor biopsy tissue was obtained on one adolescent with osteosarcoma from the Cooperative Human Tissue Network. Samples were labeled OS01 through OS49. This study was approved by the Research Ethics Board at the Hospital for Sick Children.

Cells and reagents: U2OS, SAOS-2, HOS, MNNG and KHOS cells were purchased from the

American Type Culture Collection, osteoblasts from PromoCell, C-terminal myc/DDK-tagged

LSAMP expression clone from OriGene Technologies, and small interfering RNAs (siRNAs) from Ambion. Cell lines tested mycoplasma negative. siRNA oligonucleotide sequences are in

Table S1. Cell proliferation was measured by a TACS MTT Cell Proliferation and Viability

Assay (R&D Systems). MTT assays, colony-forming assays and westerns were performed as in

352. Flow cytometry was performed according to standard protocols.

Sample processing: Samples were processed as in 189. RNA was extracted using TRIzol

(Invitrogen). Contaminating DNA was removed with GeneHunter MessageClean kit. RNA was cleaned using QIAGEN RNeasy Mini Kit.

DNA microarray analysis: CN analysis was performed on OS03-OS05, OS07-OS16, OS18, and OS20-OS32 blood and tumor DNA, hybridized to Affymetrix genome-wide 6.0 microarrays, 175 by Partek Genomic Suite. Baseline was generated on paired blood DNA or 270 International

HapMap project individuals 276. Amplifications and deletions were detected by genomic segmentation using a minimum of ten consecutive probes with CN≥2.5 or CN≤1.5, respectively.

LOH was detected using Hidden Markov Model 353.

Real-time quantitative PCR (qPCR): qPCR of OS01-OS17, OS20-OS32, and HOS DNA was performed on Roche LightCycler® by relative quantification 189. Pooled blood DNA from 80 controls (Roche) was used as a reference. Primers were designed using Primer3 and qRTDesigner 1.2 and the human reference assembly (UCSC version hg18, based on National

Center for Biotechnology Information build 36). Primer sequences are in Table S2.

Reverse-transcriptase qPCR (RT-qPCR): RT-qPCR was performed on OS01-OS08, OS10-

OS17, OS20-OS28, OS30-31, OS33-OS48, HOS, K-HOS, MNNG, SAOS-2, U2OS, and osteoblast RNA. Expression of LSAMP, LOC285194, and BC040587 was measured relative to that of TATA-binding protein, by the comparative C(T) method 354. A 1.5-fold expression change cut-off was used. Primer sequences are in Table S3.

Fluorescence in-situ hybridization (FISH): FISH was performed using standard protocols. A commercially available RP11-956M14 bacterial artificial chromosome (BAC) clone, covering the region of between 117,799,751-117,974,624bp, tagged with a green fluorescein isothiocyanate tag, was used as a 3q13.31 probe. A chromosome 3 α-satellite (D3Z1) probe (MP Biomedicals), tagged with a red cyanine fluorescent tag, was used as a chromosome 3 centromeric probe. FISH probes were synthesized at The Centre for Applied Genomics (TCAG). 176

Statistical analyses: Analyses were performed using PASW Statistics 17.0. Kaplan-Meier survival analysis was performed using Partek. Error bars represent standard deviations unless noted otherwise.

Antibodies: Anti-myc and anti-proliferating cell nuclear antigen (PCNA) antibodies were from

Santa Cruz Biotechnology. Anti-vinculin antibody was from Millipore. Cleaved PARP-1

(cPARP-1) antibody was from Cell Signaling Technology.

LSAMP mutation screening: Primer sequences are in Table S4.

6.4 Results

Osteo3q13.31: site of highly common focal CNAs and LOH in osteosarcoma: To identify novel CNAs in osteosarcoma, we analyzed DNA from primary tumor biopsies of 27 osteosarcoma patients by genome-wide oligonucleotide 6.0 (GW6.0) microarrays. In 26 of these, we used qPCR to validate CNA calls, in addition to DNA from primary tumor biopsies of four osteosarcoma patients, which were analyzed by qPCR only. CNAs identified by microarrays are shown in Figure S1. The most common deletion was within chr3q13.31 (Figure 1A, arrowhead). CNAs were observed in 70.4% (19/27) of samples analyzed by microarrays: 51.9%

(14/27) exhibited CN loss at chr3q13.31 (Figure 1B, samples with blue tracks only), 3.7% (1/27)

CN gain (Figure 1B, samples with red tracks only), while 14.8% (4/27) displayed complex changes involving regions of CN loss and gain (Figure 1B, samples with blue and red tracks).

Validating our approach, we detected previously reported amplifications at chr8q24.21 in 70.4%

(19/27) and chr6p21.2-6p12.3 in 66.7% (18/27) of samples (Figure 1A, arrowheads).

We next determined if CNAs at chr3q13.31 represented de novo events in tumor DNA or

CN changes of germline CNVs by comparing CN of chr3q13.31 on microarrays between blood- 177 and tumor-derived DNA of each patient. We identified a 0.7Mb region (Figure 1C) with significantly (χ2, p<0.01) lower CN in tumor (median CN=1.5) than blood DNA (median

CN=2.1). In fact, while CN of this locus in blood DNA clustered tightly around 2, CN of the same locus in tumor DNA varied significantly. Therefore, chr3q13.31 CNAs in osteosarcoma represent de novo events in tumor DNA.

To determine the minimal region of overlap (MRO) of chr3q13.31 CNAs, we used

Significance testing for aberrant CN (STAC) 355, a multiple testing-corrected permutation approach for identifying regions harboring CNAs more often than by chance. We identified a

0.5Mb region between 117.8Mb and 118.3Mb of chr3q13.31 (Figure 1D) that showed higher frequency of CNAs than the remainder of chr3q (p<0.05) in tumor DNA. Importantly, this region corresponded closely to the 0.7Mb stretch of chr3q13.31 that had significantly lower CN in tumor than in blood DNA of patients (Figure 1C). A 10kb MRO between 118,035,000bp and

118,045,000bp (Figure 1B and Figure 1D) was seen in 63% (17/27) of patients. Therefore,

STAC confirms the focal and non-random nature of chr3q13.31 events.

To validate chr3q13.31 CNAs in osteosarcoma we performed qPCR. Five pairs of probes were used, corresponding to the 0.7Mb stretch of chr3q13.31 affected by CNA, with one pair centering over the MRO (Figure 1B, asterisks). As a positive control, we used HOS cells with a chr3q13.31 deletion 340. HOS cells showed bi-allelic loss at all locations tested (Table 1). CNAs were observed within chr3q13.31 in 80% (24/30) of osteosarcoma samples: 53.3% (16/30) exhibited chr3q13.31 CN loss (Table 1, samples with blue tracks only), 13.3% (4/30) CN gain

(Table 1, samples with red tracks only), while 13.3% (4/30) displayed complex changes including regions of CN loss and gain (Table 1, samples with blue and red tracks). qPCR confirmed our microarray findings in the majority of cases, with few exceptions (deletion in 178

OS26, amplifications in OS08, OS11, and OS23; Figure 1B and Table 1), possibly reflecting use of a different CN baseline in qPCR (blood DNA from 80 controls) and DNA microarray

(patients’ own blood DNA). We named the region of chr3q13.31 affected by CNAs in osteosarcoma osteo3q13.31 to emphasize their focal nature, which contrasts the majority of osteosarcoma CNAs previously identified.

We next evaluated whether deletions at osteo3q.13.31 were mono- or bi-allelic. Of 20 osteosarcoma patients with osteo3q13.31 deletions (Table 1), 75.0% (15/20) were mono-allelic

(Table 1 and Figure 2A), 15.0% (3/20) bi-allelic (Table 1 and Figure 2B), and 10.0% (2/20) contained regions of mono- and bi-allelic CN loss (Table 1 and Figure 2C). DNA from OS14 contained a region of CN gain neighboring a region of bi-allelic deletion (Table 1 and Figure

2D). Allele-specific CN analysis revealed (Figure S2) that the observed CN gain was caused by unequal amplification of one allele, a situation functionally resembling loss of one allele through

LOH. Therefore, mono- and bi-allelic deletions at osteo3q13.31 are both common and could, along with unequal amplifications, lead to loss of genetic information.

CNAs arise through mechanisms that can lead to LOH in flanking DNA 356-358. We therefore examined if osteo3q13.31 was affected by LOH. Of 27 osteosarcoma samples analyzed by microarrays, 33.3% (9/27) displayed LOH at osteo3q13.31 (Figure S3). LOH was common in tumors with bi-allelic osteo3q13.31 deletions where the deletion was flanked by LOH extending several megabases into chr3q (Figure 2B-D). LOH at osteo3q13.31 was also confirmed by DNA sequencing of an osteo3q13.31 gene, LSAMP, in 25% (3/12) of informative samples (Table S5).

Therefore, in addition to CN loss, LOH also leads to loss of genetic information at osteo3q13.31 in osteosarcoma. 179

To confirm microarray and qPCR findings, we performed FISH using the BAC probe

RP11-956M14, covering the region near MRO at osteo3q13.31. As a control, we used early passage osteoblasts. These were compared to three tumor biopsies: one without an osteo3q13.31 deletion (OS02), one with a mono-allelic deletion (OS15), and one with a bi-allelic deletion

(OS16). Metaphase and interphase FISH analyses confirmed osteoblasts were diploid at osteo3q13.31 (Figure 3A and Table S6). Similarly, the majority of OS02 cells (66.7%) showed two osteo3q13.31 signals by interphase FISH (Figure 3B and Table S6). Interphase FISH of

OS15 tumor revealed approximately equal proportion of cells with two osteo3q13.31 signals and cells with one signal (Figure 3C and Table S6). This agrees with the estimated CN of 1.4 for osteo3q13.31 in this sample by qPCR (Table S6) and reflects the presence of normal cells in the specimen. Interphase FISH showed that the majority of cells (74.3%) in OS16 tumor lacked both osteo3q13.31 signals (Figure 3D and Table S6). Therefore, FISH results are consistent with the focal nature of deletions at osteo3q13.31 in osteosarcoma.

The osteo3q13.31 genes LOC285194, BC040587 and LSAMP commonly show loss of expression in primary osteosarcoma samples and cell lines. Having identified osteo3q13.31 as the site of common, non-random, and focal CNAs and LOH in osteosarcoma, we examined if loss of osteo3q13.31 genetic information had functional significance. CNAs and LOH at osteo3q13.31 overlap or map near one TS gene, LSAMP, and the ncRNAs LOC285194 and

BC040587. We therefore examined the effect of osteo3q13.31 CNAs on LSAMP, LOC285194, and BC040587 expression by RT-qPCR. Expression was analyzed in 43 primary osteosarcoma tumor biopsies and five osteosarcoma cell lines, normalized to primary osteoblasts. All three genes are expressed in osteoblasts (Figure 4A), suggesting a possible role in bone biology. Of 48 samples, 64.6% (31/48) showed decreased LSAMP expression (calculated as mean expression of exons 1-2 and 3-4) compared to osteoblasts (Figure 4A, top, bold samples). In 18.8% (9/48), 180

LSAMP expression was increased (Figure 4A, top, italicized samples). In 16.7% (8/48), there was no difference in LSAMP expression compared to osteoblasts (Figure 4A, top). LOC285194 and BC040587 revealed more striking differences. In 81.2% (39/48) of samples, LOC285194 expression was decreased compared to osteoblasts (Figure 4A, middle, bold samples). The majority of these (62.5%, 30/48) had a large decrease, with residual LOC285194 RNA levels less than 10% of that in osteoblasts. Like LSAMP, LOC285194 expression was increased in some samples: 8.3% (4/48) showed higher expression than osteoblasts (Figure 4A, middle, italicized samples). In 10.4% (5/48), there was no difference in LOC285194 expression compared to osteoblasts (Figure 4A, middle). BC040587 expression was decreased in 77.1% (37/48) of samples (Figure 4A, bottom, bold samples), increased in 10.4% (5/48) (Figure 4A, bottom, italicized samples), and unchanged in 12.5% (6/48) (Figure 4A, bottom). In summary, LSAMP expression was affected in 83.3% (40/48), LOC285194 expression in 89.6% (43/48), and

BC040587 expression in 87.5% (42/48) of samples. Intriguingly, a change in expression of at least one osteo3q13.31 gene was evident in 100% (48/48) of samples, suggesting dysregulation of osteo3q13.31 genes is ubiquitous in osteosarcoma.

To examine if changes in LSAMP and LOC285194 expression were caused by alternative splicing, we compared expression of LSAMP messenger RNA (mRNA) at the junction of exons

3-4 to that at the junction of exons 1-2, and expression of LOC285194 mRNA at exon 1 to that at exon 4. Linear regression analysis revealed (Spearman’s ρ=0.965, p=3.7x10-85) that expression of LSAMP exons 3-4 correlated well with expression of exons 1-2 (Figure S4A). Similarly, expression of LOC285194 exon 4 correlated well with exon 1 (Spearman’s ρ=0.886, p=8.2x10-

42, Figure S4B), indicating that changes in LSAMP and LOC285194 expression in osteosarcoma were not caused by alternative splicing. Instead, changes in expression of LOC285194,

BC040587, and, less often LSAMP could be explained by CNAs at osteo3q13.31. For example, 181

OS22 DNA contains a deletion eliminating LOC285194, but not LSAMP or BC040587 (Figure

1B), leading to complete loss of LOC285194, but not LSAMP or BC040587 expression (Figure

4A). OS14 DNA contains a region of CN gain involving LSAMP, plus 100kb upstream of

LSAMP, followed by a region of bi-allelic CN loss eliminating both LOC285194 and BC040587

(Figures 1B and 2D). This leads to a nearly 3-fold increase in LSAMP expression and loss of

LOC285194 and BC040587 expression (Figure 4A). OS16 DNA contains a large bi-allelic deletion eliminating all three osteo3q13.31 genes (Figure 1B) and leads to loss of expression of all three (Figure 4A). However, in some tumors we detect changes in LSAMP, LOC285194, and

BC040587 expression in the absence of osteo3q13.31 CNAs. For example, osteo3q13.31 CN- neutral samples OS03 and OS28 (Figure 1B and Table 1) show a large reduction in LSAMP,

LOC285194, and BC040587 expression (Figure 4A). In contrast, OS15, which contains a mono- allelic deletion at osteo3q13.31 (Table S6 and Figure 3C), shows a large increase in expression of all three osteo3q13.31 genes (Figure 4A). We considered the possibility that in tumors without osteo3q13.31 CNAs the apparent change in expression levels of osteo3q13.31 genes resulted from point mutations or small deletions. To test this possibility, we sequenced all 7 exons of LSAMP in blood and tumor DNA of 29 patients. We did not detect LSAMP mutations in any samples (Table S5). To assess the extent to which expression of osteo3q13.31 genes could be explained by CNAs, we calculated the correlation between expression of LSAMP,

LOC285194, and BC040587 and CN of osteo3q13.31 at five different positions measured by qPCR (Figure 1B). The expression of LSAMP did not show significant correlation with CN at any osteo3q13.31 site (not shown). In contrast, LOC285194 expression correlated with osteo3q13.31 CN at positions 117,927kb (Spearman’s ρ=0.493, p=0.009) and 118,040kb

(Spearman’s ρ=0.578, p=0.002, Figure 4B, top). Likewise, BC040587 expression correlated with osteo3q13.31 CN at 117,927kb (Spearman’s ρ=0.536, p=0.004) and 118,040kb 182

(Spearman’s ρ=0.592, p=0.001, Figure 4B, bottom). Therefore, osteo3q13.3 CNAs account for changes in LOC285194 and BC040587 expression, and, less often, LSAMP.

LSAMP, LOC285194, and BC040587 map far apart at osteo3q13.31 (Figure 1B), spanning 468kb. As all three are under-expressed in osteosarcoma, we investigated if expression patterns of LSAMP, LOC285194, and BC040587 were correlated. Intriguingly, we found evidence for correlation between expression of all genes: LSAMP and LOC285194 (Spearman’s

ρ=0.431, p=5.01x10-8, Figure 4C, top), LSAMP and BC040587 (Spearman’s ρ=0.460, p=8.16x10-9, Figure 4C, middle), and LOC285194 and BC040587 (Spearman’s ρ=0.653, p=2.25x10-4, Figure 4C, bottom). Taken together, this data suggests that expression of all osteo3q13.31 genes follows a similar pattern in osteosarcoma and, in the case of LOC285194 and

BC040587, this relates to the presence of CNAs at osteo3q13.31.

Focal osteo3q13.31 deletions and LOH are common in cell lines from various cancers. As loss of expression of osteo3q13.31 genes is common in osteosarcoma, we hypothesized that they might be implicated in tumorigenesis in other cell types. We therefore searched public data on cell lines from the Catalogue of Somatic Mutations in Cancer (COSMIC) panel for focal osteo3q13.31 CN loss and LOH 192. Interestingly, focal osteo3q13.31 bi-allelic deletions (Figure

S5A) and LOH (Figure S5B) are common in cell lines from various cancers, including those of lung, autonomic and central nervous system, blood, endometrium, soft tissue, skin, gastrointestinal tract, urinary tract, and breast (Table S7). Osteo3q13.31 may thus be important in tumorigenesis in various malignancies, including sarcomas and carcinomas.

Depletion of LSAMP or LOC285194 Promotes Osteoblast Proliferation Through Regulation of Apoptotic and Cell Cycle Transcripts, and VEGF/VEGFR1. As osteo3q13.31 CNAs sometimes involve LSAMP, which belongs to the family of genes implicated as TS in clear cell 183 renal cell carcinoma (CCRCC), glioma, and ovarian carcinoma 359-361, we examined if LSAMP also acted as a TS in osteosarcoma. We did not detect any effect of LSAMP over-expression on proliferation and survival (Figure S6A and S6B), cell-cycle progression (Figure S6C and S6D), or levels of endogenous apoptosis (Figure S6D) of HOS and U2OS osteosarcoma cell lines. We next investigated the effect of depletion of LSAMP, LOC285194 and BC040587 on normal osteoblasts. Depletion of LSAMP and LOC285194 by siRNA successfully reduced mRNA levels by 90% and 50%, respectively (Figure 5A). In contrast, siRNA-mediated silencing of BC040587 was not effective (not shown), consistent with reports of reduced susceptibility of ncRNAs to siRNA 362.

To examine the effect of LSAMP and LOC285194 depletion, we conducted MTT proliferation assays, modified Boyden chamber migration assays and cell cycle analysis.

Depletion of LSAMP and LOC285194 mRNA had no effect on osteoblast migration (not shown).

However, depletion of LSAMP (p=0.007) and LOC285194 (p=4.8 x 10-4) both promoted cell growth in MTT assays (Figure 5B). Cell cycle analysis demonstrated a mild increase in the G1 population of LOC285194-depleted (Figure S7A), and an increase in the S-phase population of

LSAMP-depleted osteoblasts (Figure S7B). This implicates these two genes as functional growth suppressors in osteoblasts, in vitro.

Since both LSAMP and LOC285194 depletion promoted proliferation (Figure 5B), with differing effects on the cell cycle (Figure S7), we examined changes induced by depletion of these genes on transcription of a panel of proliferation-associated genes. Cyclin D1, VEGF, and

VEGFR1 were all upregulated by LOC285194 depletion (Figure 5C). Consistent with the results of cell cycle analysis (Figure S7), depletion of LSAMP and LOC285194 had opposing effects on cyclin A2 and cyclin B1 expression, with these transcripts being induced by LSAMP siRNA, and 184 suppressed by LOC285194 siRNA, indicating that these genes may regulate proliferation by differing mechanisms. Finally, LOC285194 depletion suppressed expression of pro-apoptotic

BCL2 and BimEL (Figure 5C). Taken together, these data implicate LSAMP and LOC285194 as growth suppressors through regulation of apoptotic and cell cycle transcripts, and

VEGF/VEGFR1.

Presence of LOC285194 or BC040587 deletions in tumor DNA is associated with poor survival of osteosarcoma patients. Chromosomal aberrations involving LSAMP are associated with poor outcome in osteosarcoma 350,351. To examine if deletions involving the ncRNAs

LOC285194 or BC040587 also affect patient survival, we used Kaplan-Meier survival analysis to compare survival of patients with LOC285194 or BC040587 deletions in tumor DNA (based on microarray data) to those without deletions. Intriguingly, the presence of either LOC285194 deletions (log-rank p=0.008, Figure 5D, left) or BC040587 deletions (log-rank p=0.01, Figure

5D, right) in tumor DNA was associated with a dramatic decrease in survival. Therefore,

LOC285194 and BC040587 ncRNAs, in addition to LSAMP, function as TS genes at osteo3q13.31.

6.5 Discussion

Osteosarcomas are cytogenetically complex malignancies 337-345. While the prevailing view was that most osteosarcoma CNAs were non-recurrent, we used microarrays, qPCR, and

FISH to characterize highly recurrent CNAs within a region of chr3q13.31 that we term osteo3q13.31. While osteosarcomas are characterized by marked aneuploidy, osteo3q13.31

CNAs are focal and include events clustered in a region less than 0.7Mb in size. CNAs and LOH at chr3 in osteosarcoma had been reported previously 338,340,344,345,363, however these studies have not provided insight into the roles of the involved genes in osteosarcomagenesis. 185

The observed CNAs represent de novo events in tumor DNA, rather than CN changes of germline CNVs. The presence of focal osteo3q13.31 CNAs in as many as 80% of osteosarcoma samples strongly suggests a functional role for this region.

In samples with osteo3q13.31 deletions, both mono- and bi-allelic deletions are common and involve the genomic sequence of LOC285194 and BC040587 ncRNAs, and, sometimes, the

LSAMP TS. Using multiple probes, we show by RT-qPCR that expression of one or more osteo3q13.31 genes is nearly ubiquitously dysregulated in 48 primary osteosarcoma biopsies and cell lines. However, while changes in expression of LOC285194 and BC040587 are frequently accounted for by osteo3q13.31 CNAs, this is less often true for LSAMP. Changes in LSAMP expression are not caused by point mutations either, as sequencing of LSAMP in patient blood and tumor DNA only revealed a common silent polymorphism in exon 4. In addition to mutations in the coding sequence, changes involving distant elements can affect gene expression.

In the case of osteo3q13.31 genes, changes at one such element, through CNAs, mutations, or epigenetic mechanisms, could lead to a change in expression levels. In support of this, we found a tight correlation between expression of LSAMP, LOC285194 and BC040587, which span

468kb of DNA. One such element may reside near the osteo3q13.31 MRO, as presence of genomic deletions at this site correlates well with loss of LOC285194 and BC040587 expression.

LSAMP, by virtue of its more distant location, displays an expression pattern showing little correlation with the CN of the MRO. However, as in the case of LOC285194 and BC040587, its expression may be affected by events involving cis-regulatory elements outside its coding sequence such as promoter methylation 350,359. Therefore, multiple mechanisms must exist to explain the behavior of osteo3q13.31 genes in osteosarcoma, CNAs being one that led us to study this region. 186

As loss of expression of osteo3q13.31 genes is frequent in osteosarcoma, we hypothesized they may be functionally important in tumorigenesis. In support of this idea, we found that cell lines from other malignancies contain regions of focal CN loss involving osteo3q13.31 genes. While some of these CN changes may represent germline CNVs, most do not (Table S6) as they are far more common than osteo3q13.31 CNVs, estimated to be present in less than 0.1% of healthy controls 364. LSAMP and a related gene have been implicated as TS in

CCRCC 359, ovarian carcinoma 360, and glioma 361. During the preparation of this manuscript, two groups 350,351 reported deletions at osteo3q13.31 in osteosarcoma, implicating LSAMP as an osteosarcoma TS. However these studies did not provide functional evidence or recognize the role of ncRNAs at osteo3q13.31. Supporting the notion that osteo3q13.31 genes act together, we report that depletion of either LSAMP or LOC285194 leads to increased proliferation of normal osteoblasts. Interestingly, while LSAMP and LOC285194 both promote growth and proliferation, they may act through different mechanisms, since LSAMP depletion induces S-phase cyclin A2, and LOC285194 depletion suppresses the pro-apoptotic genes BCL2 and BimEL, and induces the

VEGF/VEGFR1 axis, previously implicated in osteosarcomagenesis 365. Furthermore, while depletion of LSAMP led to increased proliferation of normal osteoblasts, LSAMP over-expression had no effect on HOS or U2OS cells. These contrasting effects may indicate that loss of LSAMP and other osteo3q13.31 genes are early events in osteosarcoma, and restoration of LSAMP expression alone may not be sufficient to suppress proliferation of fully transformed cell lines carrying a significant burden of genomic instability.

While chromosomal aberrations involving LSAMP have previously been associated with poor prognosis 350,351, our work demonstrates that presence of LOC285194 or BC040587 deletions in tumor DNA is also associated with dramatically reduced survival of osteosarcoma patients. In fact, additional osteo3q13.31 TS properties may lie in interactions between 187 osteo3q13.31 genes which, though far apart, show remarkable correlation in their expression.

Dysregulation of LSAMP, LOC285194, and BC040587 expression may be a step-wise process in osteosarcoma, representing a novel example of “multiple hits” in tumorigenesis where expression of neighboring genes is altered sequentially. Mono- or bi-allelic loss, or up- regulation of a particular allele of one or a combination of osteo3q13.31 genes may have a profound effect on osteosarcoma tumor biology. A challenge remains to determine the functional role for BC040587 ncRNAs (individually or in a network with other osteo3q13.31 genes), as well as the potential value for osteo3q13.31 in diagnosis, prognosis and evaluation of response to anti-cancer treatment. Given the high frequency, focal nature, non-random distribution and universal presence of CNAs overlapping three osteo3q13.31 genes, of which at least two now appear to suppress cellular proliferation and survival, these data support a role for the osteo3q13.31 TS unit as a driver in osteosarcomagenesis, perhaps with a similar role in other cancers.

Figure 1. De novo osteo3q13.31 CNAs involving regions of mono- and bi-allelic CN loss and gain are common in osteosarcoma. (A) The three most common CNAs in osteosarcoma

(arrowheads): chr8q24.21 and chr6p21.2-6p12.3 amplifications, and chr3q13.31 deletions.

Amplifications: red; deletions: blue. Y-axes: number of patients. (B) osteo3q13.31 deletions

(blue) and amplifications (red). Top: chr3 and chr3q13.31 (red box). Middle: qPCR probes

(asterisks). Black rectangle: MRO. Bottom: UCSC genes. (C) Box-plot diagram of osteo3q13.31

CN in blood (blue box) and tumor (red box) DNA of osteosarcoma patients. Y-axis: Mean CN between positions 117,652,189bp and 118,369,561bp of chr3q13.31 in each patient’s blood (blue circles) and tumor (red circles) DNA. (D) Grey bars: 0.5Mb region of chr3q13.31 (osteo3q13.31) where CNAs occur more often than by chance; bars: 1-p value of STAC for 10kb genomic segments. Blue line: osteo3q13.31 CNA MRO. 188

Figure 2. osteo3q13.31 CNAs in osteosarcoma tumor DNA involve regions of mono- and bi- allelic CN loss, high-level CN gain, and LOH. Microarray analysis of tumor DNA from a patient with a mono-allelic deletion without LOH (A), bi-allelic deletion with LOH (B), mono- and bi- allelic deletions with LOH (C), and complex changes involving regions of high-level CN gain and bi-allelic CN loss (D). Each panel: CN of patient’s blood (blue dots) and tumor (red dots)

DNA. Dots: smoothed CN for 15 (A) or 10 (B-D) markers. Gray tracks: LOH. Boxes: LSAMP,

LOC285194, and BC040587 exons; connecting lines: introns.

Figure 3. FISH confirms focal mono- and bi-allelic osteo3q13.31 deletions in osteosarcoma. (A)

Metaphase (center) and interphase (right) FISH of primary osteoblasts using chr3 centromere

(red) and osteo3q13.31 (green) probes. Interphase FISH is also shown of cells from the osteo3q13.31-diploid tumor of OS02 (B), osteo3q13.31-haploid tumor of OS15 (C), and the

OS16, which has a bi-allelic deletion of osteo3q13.31 (D).

Figure 4. Expression of osteo3q13.31genes is reduced in primary osteosarcoma tumor samples and cell lines and often reflects the presence of osteo3q13.31 CNAs. Expression levels for 43 osteosarcoma tumor samples and five osteosarcoma cell lines were normalized to primary osteoblasts. Positive control: testis cDNA. Font: bold (decreased expression); italics (increased expression); regular (expression levels similar to osteoblasts). Error bars: standard errors of the mean. (A) RT-qPCR analysis of LSAMP expression (top) at exon 3-4 junction (open bars) and exon 1-2 junction (grey bars); LOC285194 expression (middle) at exon 1 (open bars) and exon 4

(grey bars); and BC040587 exon 2 expression (bottom). (B) Linear regression analysis for correlation between LOC285194 (top) or BC040587 (bottom) expression and CN of osteo3q13.31 CNA MRO at 118,040kb. (C) Linear regression analysis for correlation between 189 expression of LOC285194 and LSAMP (top), BC040587 and LSAMP (middle), and LOC285194 and BC040587 (bottom).

Figure 5. In vitro and clinical evidence implicates LSAMP, LOC285194, and BC040587 as osteo3q13.31 TS genes. (A) Levels of LSAMP and LOC285194 mRNA were assessed in osteoblasts transfected with control, LSAMP or LOC285194 siRNA. (B) Proliferation of osteoblasts in response to treatment with control, LSAMP, or LOC285194 siRNA was measured by MTT assays. Bars: averages of three independent experiments. (C) Effect of LSAMP or

LOC285194 siRNA on transcription of a panel of proliferation-associated genes in osteoblasts was measured by RT-qPCR. Bars: averages of three independent experiments. Error bars: standard errors of the mean. (D) Survival of patients with LOC285194 deletions (left) or

BC040587 deletions (right) in tumor DNA was compared to patients without LOC285194 or

BC040587 deletions.

190

191

192 193

194

195

Table 1. qPCR analysis of osteo3q13.31 CNAs in OS tumor DNA at five positions. Light blue: mono-allelic deletions; dark blue: bi-allelic deletions; red: CN gain. HOS cells are included as a control. CN values are rounded to the nearest integer.

Sample Position Position Position Position Position 117.668Mb 117.737Mb 117.927Mb 118.040Mb 118.355Mb

Control 2 2 2 2 2

HOS 0 0 0 0 0

OS01 3 3 3 2 3

OS02 2 2 2 2 2

OS03 2 2 2 2 2

OS04 2 2 1 1 1

OS05 2 2 0 0 3

OS06 2 1 0 0 0

OS07 2 2 2 1 2

OS08 3 3 1 1 1

OS09 2 1 1 1 1

OS10 2 1 1 1 2

OS11 3 3 2 2 3

OS12 1 1 1 1 2

OS13 2 2 1 1 2

OS14 6 6 0 0 0

OS15 2 2 1 1 2

OS16 0 0 0 0 1

OS17 2 3 1 1 3

OS20 2 2 2 2 1

OS21 2 2 2 2 2

OS22 2 2 1 1 2 196

Sample Position Position Position Position Position 117.668Mb 117.737Mb 117.927Mb 118.040Mb 118.355Mb

OS23 3 3 2 2 3

OS24 2 2 2 2 2

OS25 2 2 2 1 2

OS26 2 2 2 2 2

OS27 2 2 1 1 2

OS28 2 2 2 2 2

OS29 1 1 1 1 1

OS30 1 1 1 1 1

OS31 2 2 0 0 2

OS32 6 6 8 2 2

197

Chapter 7 7 Appendix 2 - Toward a more Uniform Sampling of Human Genetic Diversity: A Survey of Worldwide Populations by High-density Genotyping

This chapter has been accepted for publication and is reproduced with permission from Genomics

Contribution: CNV discovery and analysis.

* * Jinchuan Xing, W. Scott Watkins, Adam Shlien , Erin Walker , Chad D. Huff, David J.

Witherspoon, Yuhua Zhang, Tatum S. Simonson, Robert B. Weiss, Joshua D. Schiffman, David Malkin, Scott R. Woodward, and Lynn B. Jorde. Toward a more Uniform Sampling of Human Genetic Diversity: A Survey of Worldwide Populations by High-density Genotyping. Genomics (accepted).

* Equal contributions

198

7.1 Abstract

High-throughput genotyping data are useful for making inferences about human evolutionary history. However, the populations sampled to date are unevenly distributed, and some areas (e.g., South and Central Asia) have rarely been sampled in large-scale studies. To assess human genetic variation more evenly, we sampled 296 individuals from 13 worldwide populations that are not covered by previous studies. By combining these samples with a data set from our laboratory and the HapMap II samples, we assembled a final dataset of ~250,000 SNPs in 850 individuals from 40 populations. With more uniform sampling, the estimate of global genetic differentiation (FST) substantially decreases from ~16% with the HapMap II samples to

~11%. A panel of copy number variations typed in the same populations shows patterns of diversity similar to the SNP data, with highest diversity in African populations. This unique sample collection also permits new inferences about human evolutionary history. The comparison of haplotype variation among populations supports a single out-of-Africa migration event and suggests that the founding population of Eurasia may have been relatively large but isolated from Africans for a period of time. We also found a substantial affinity between populations from central Asia (Kyrgyzstani and Mongolian Buryat) and America, suggesting a central Asian contribution to New World founder populations.

7.2 Introduction

Every major demographic event in a population’s history (e.g., population bottlenecks, expansions, and migrations) leaves an imprint on the population’s collective assemblage of DNA sequences. Consequently, studies of DNA variation have illuminated many aspects of human population history. Because the genetic variation responsible for disease is a subset of genetic variation in general, these studies are also providing a foundation for important biomedical 199 studies 366,367. Large-scale genotyping efforts using high-density SNP microarrays have generated an unprecedented amount of human population genetic data. In addition to their application in whole-genome association studies, these data have been used to address issues such as the evolutionary history of human populations 368-375, estimation of individual ancestry

368,376-380, and patterns of natural selection in populations 381-385.

In contrast to the rapid pace of technological development, progress in collecting human

DNA samples has been slow and uneven. All existing human genetic diversity datasets, including the HapMap collection, the Coriell collection, and the Human Genome Diversity

Project (HGDP-CEPH), are only partial representations of worldwide human diversity. For example, the HGDP database, one of the most widely used resources, lacks coverage in India.

Other major regions, such as Eastern Europe and central/north Asia, are also under-represented in databases of human genetic variation.

To help achieve a more uniform sampling of world-wide human genetic diversity, we genotyped a sample of 296 individuals from 13 populations using Affymetrix 6.0 microarrays (~900,000

SNPs and 946,000 copy number variation (CNV) probes). We included populations from West

Africa (Dogon and Bambaran), Central Europe (Slovenian), West Asia (Iraqi), Central Asia

(Kyrgyzstani and Buryat), South/Southeast Asia (Pakistani, Napalese, and Thai), Polynesia

(Tongan and Samoan), and America (Bolivian and Totonac). By adding these populations from previously under-represented regions to existing datasets, we sought to achieve two goals: first, a more comprehensive understanding of the distribution of human genetic diversity; second, a more detailed inference of human demographic history, such as the mode and tempo of the out- of-Africa diaspora, the peopling of South Asia, and the peopling of America. 200

7.3 Materials and Methods

DNA samples

DNA samples from 13 worldwide populations were collected by the Sorenson Molecular

Genealogy Foundation (SMGF) and genotyped (Figure 1, Table 1). Informed consent was obtained from all study subjects at the sampling location, and the Western Institutional Review

Board approved all procedures. The sampling locations of these populations are: Bambaran: southwest Mali; Dogon: central Mali in the state of Mopti; Slovenian: several locations in

Slovenia; Iraqi Kurds, born in Akra, northern Iraq and (collected in Baghdad); Pakistani: Arain agriculturalists from the Punjab region; Nepalese: collected from Kathmandu, Nepal (samples consist of 16 Brahman, 2 Magar, 2 Chhetri, 2 Newar,1 Madhesi, and 2 Nepalese with unknown ethnicity); Kyrgyzstani: collected from Bishkek, the capital of Kyrgyzstan, having origins in several states in northeast Kyrgyzstan; Thai: 19 samples from the Moken ethnic group, and ten from Phuket, Thailand; Buryat: Buryat ethnic group from northeastern Mongolia; Samoan: ethnic Samoans sampled in Samoa: Tongan: ethnic Tongans sampled in Tonga; Totonac: agriculturalists living near Vera Cruz, Mexico; Bolivian: high-altitude Native American Aymara speakers living near La Paz. Most of these DNA samples were collected from saliva, with the exception of 22 Tongans and Samoans from whom blood samples were obtained.

SNP Genotyping, Genotype calling and quality control

High-throughput microarray genotyping of approximately 906,000 SNPs was performed using the Affymetrix Genome-Wide Human SNP Array 6.0 (Affymetrix, Santa Clara, CA,

USA). Previous comparisons using this array indicate that DNA derived from saliva samples yield SNP genotypes of quality comparable to DNA derived from blood samples 386. The recommended protocol described by Affymetrix was followed to construct DNA libraries. 201

Samples were then injected into microarray cartridges and hybridized in a GeneChip®

Hybridization Oven 640, followed by washing and staining in a GeneChip® Fluidics Station

450. Mapping array images were obtained using the GeneChip® Scanner 3000 7G (Affymetrix).

Genotypes of 302 microarrays that passed the initial QC were called with the Birdseed algorithm (version 2) in the Affymetrix Power Tools package

(http://www.affymetrix.com/support/developer/powertools/index.affx) with default parameters.

Because our samples contain no females, CEL files from 15 unrelated CEU female samples were included in the calling process following the manufacturer’s recommendation. After genotype calling, we calculated pairwise allele-sharing genetic distances between each pair of individuals.

Five comparisons showed unusually small genetic distances, indicating close relatedness between these pairs of individuals. Therefore, one individual was excluded from each pair in order to retain a set of unrelated individuals. One additional individual was removed because of ambiguous population information. The remaining 296 samples from 13 populations compose our dataset for analyses.

SNP selection

Several criteria were applied to select SNPs for the analyses. First, we excluded all SNPs on the X, Y, and mitochondrial chromosomes, as well as SNPs whose chromosomal locations were unknown (38,456 SNPs). Then, SNPs with more than 10% missing data were removed

(5,742 SNPs). We next divided all individuals into four major groups (Africa, Asia, Europe, and

India) and tested each SNP for deviations from Hardy-Weinberg Equilibrium (HWE) for populations within each group using the hweStrata algorithm 387. The continent-level HWE p- values were combined using Stouffer’s z-average method 388, and 213 SNPs that deviated from

HWE at p < 5.5x10-8 (Bonferroni correction: 0.05/900,000) were excluded from subsequent 202 analyses. To combine our dataset with HapMap II samples, Affymetrix SNP Array 6.0 genotypes of the 210 unrelated HapMap samples were obtained from the HapMap project website

(http://hapmap.ncbi.nlm.nih.gov), and the same SNP selection criteria were applied to HapMap samples. The filtered HapMap dataset was combined with the dataset generated in this study and a dataset from an earlier study 372 using the Affymetrix NspI 250K microarrays, resulting in a final dataset containing 246,554 autosomal loci genotyped in 850 individuals from 40 populations.

CNV genotype calling

The microarray data for the 296 DNA samples were analyzed for CNVs using two complementary algorithms: a genomic segmentation algorithm (Partek, MO) and Birdsuite 389.

The use of two complementary CNV detection algorithms increases the robustness of CNV detection 267. To minimize batch variability, an internal baseline was generated from all 296 samples and used in the segmentation CNV detection. A minimum of ten consecutive probes was required to detect a copy number change. CNVs were removed if the probe density was < 1 probe/ 5,000 bp, in order to remove potentially spurious CNV calls that cover centromeric regions. The Canary and Birdseye algorithms were used in Birdsuite version 1.5.3. We restricted our analysis to autosomal CNVs calls that had a LOD score greater than or equal to 10, and that were greater than 1 kb in length. To obtain a conservative set of CNV regions, we removed any

CNVs not found by both algorithms, leaving a stringent set of copy number regions for each individual. Genotypes of all samples in the final dataset (including both SNP and CNV genotypes) are available as a supplemental file on our website (http://jorde- lab.genetics.utah.edu/) under Published Data. The pre-filtering raw dataset is available upon request. 203

Data analysis

Haplotype Diversity

To standardize the population sample sizes, we combined several closely related populations into population groups and excluded remaining populations that had fewer than 20 individuals (see Table 1 for details). The combined population groups are: Nilotic (Alur and

Hema), Bantu (Nguni, Pedi, and Sotho/Tswana), Daghestani (Stalskoe and Urkarah),

Mala/Madiga (AP Madiga and AP Mala), and Tongan/Samoan (Tongan and Samoan). Then, we randomly chose 20 individuals from each population group to equalize the sample sizes. The genome was divided into consecutive 100 kb windows, and the number of SNP loci in the dataset was determined for each window. Windows with fewer than 10 loci in the final dataset were excluded. For windows containing more than ten SNPs, we calculated the haplotype heterozygosity 390 in each population using the MATLAB Population Genetics & Evolution

Toolbox 391.

Population tree

Distances between populations were calculated from allele frequency data as Nei’s genetic distance implemented in the PHYLIP software package 392. The dataset contains 232,114

SNPs with known ancestral state for 40 world populations. Dendrograms were constructed using the neighbor-joining method. All ancestral allele states were obtained from the orthologous base in chimpanzee, or orangutan plus macaque if chimpanzee was unknown, as obtained from the

UCSC database (hg19, snp130). Each dendrogram was rooted by this chimpanzee-orangutan- macaque outgroup. One thousand bootstrap runs were performed for each dataset to generate the consensus tree and obtain the confidence value for each branch. 204

FST estimates and Principal Component Analysis (PCA)

FST estimates between populations were calculated by the method described by Weir and

393 Cockerham . To obtain the confidence interval of FST values in each continental group, 60, 60, and 90 individuals were randomly sampled 1000 times (with replacement) from Africa, Europe, and Asia (to match the sample sizes of the HapMap II populations), respectively. Pairwise allele- sharing genetic distance calculation and PCA were performed using MATLAB (ver. r2008a).

ADMIXTURE analysis

A model-based algorithm implemented in ADMIXTURE 394 was used to determine the genetic ancestries of each individual in a given number of populations without using information about population designation. To eliminate the effect of SNPs that are in LD, we first filtered out

SNPs that had r2 > 0.2 within 100 Kb using PLINK 395, as recommended by the authors of

ADMIXTURE. The pruned data set contains 86,273 SNPs.

CNV analysis

CNV data were analyzed using internally developed software (available upon request) and SPSS 15.0 (SPSS, IL). We required a minimum of 75% reciprocal overlap between pairs of

CNVs to consider that two individuals shared the same CNV region. A pairwise comparison of shared CNV regions allowed us to identify those CNVs that were private to individuals, private to specific populations, and those CNVs that were shared across multiple populations. To adjust for outlier effects, individuals above the 95th percentile for CNV number were removed from the analysis. A principal components analysis performed on all individuals indicated that the DNA samples from different DNA sources (i.e., blood versus saliva) may have different CNV calling 205 results (Supp. Figure S1). Because DNA only from the 22 Tongan and Samoan samples was derived from blood, we excluded these subjects from the CNV analysis.

7.4 Results

Population samples

We sampled 296 individuals from 13 world-wide populations, including populations from West Africa, Central Europe, West Asia, Central Asia, South Asia, Southeast Asia,

Polynesia, and America (Figure 1, Table 1 populations in bold). All samples were genotyped using the Affymetrix 6.0 array and we will refer to this individual set as the “Affy6.0” set in the following analysis. We then combined these samples with 344 individuals from 23 populations in our previous study 372 (Figure 1, populations in black), in which the Affymetrix 250K NspI array was used (“Affy250K” set), and 210 individuals from four HapMap populations (YRI,

CEU, CHB and JPT, “HapMap” set). The final dataset contains 246,554 autosomal loci genotyped in 850 individuals from 40 populations (See methods for details of SNP selection and merging criteria). To determine the effect of using only this subset of the SNPs, we compared pairwise FST between each pair of populations in the HapMap and the Affy6.0 sample set using the 246,554 SNP set and the whole SNP set (~866,000 SNPs). The FST values between all population pairs are virtually identical for the two SNP sets (overall correlation coefficient r =

0.99998, p << 10-50), suggesting that the 250K SNP set is sufficient for examining inter- population relationships.

Decrease in population differentiation with more uniform sampling 206

To assess the effect of more even sampling on the degree of population differentiation, we compared the FST values between three major continental groups (Africa, Europe and Asia) from three individual sets: HapMap, HapMap+Affy6.0, and HapMap+Affy6.0+Affy250K. To match the sample sizes of the HapMap set, we randomly sampled 60, 60, and 90 individuals from Africa, Europe and Asia, respectively, in each individual set. Our results show that the overall FST value decreases substantially with the inclusion of geographically intermediate populations, dropping from 15.9% for HapMap, to 11.2% for HapMap+Affy6.0+Affy250K with non-overlapping confidence intervals (Table 2). Adding the American and Polynesian individuals into the HapMap+Affy6.0+Affy250K set increased FST slightly (to 11.3%) because of substantial founder effects and genetic drift in these populations. Nevertheless, the FST value in all individuals is still significantly lower than the FST value of the HapMap individual set.

These statistically significant FST differences illustrate the important effects of population sampling. A decrease in population differentiation with more even sampling is also demonstrated by an increase in the proportion of SNPs whose minor alleles are shared in all three continental groups. This value increases from 74.9% for HapMap to 88.2% for HapMap+Affy6.0+Affy250K

(Table 2).

For individuals that were genotyped for more than 866,000 autosomal SNPs using the

Affymetrix 6.0 array (HapMap and Affy6.0), we also determined the FST values and proportion of polymorphic SNPs using all genotyped autosomal SNPs. In both individual sets the FST values using all SNPs are comparable to FST values using the ~250,000 SNPs (Table 2). The difference between the two individual sets remains significant. The percentage of shared polymorphic SNPs decreases slightly in both datasets (Table 2), reflecting a relatively higher proportion of low- frequency SNPs in the Affymetrix 6.0 array. 207

Haplotype diversity

To compare haplotype diversity across populations, we normalized the sample size across population groups by randomly choosing 20 individuals and excluding populations with fewer than 20 samples (see methods for details). The average haplotype heterozygosity is significantly higher in African populations than non-African populations (Table 1, Wilcoxon rank test p =

1.2x10-4), and haplotype diversity decreases as geographic distance to east Africa increases

(Figure 2A, r = -0.76, p = 4.3x10-6). Despite the overall significant correlation, there appears to be little correlation within Africa between haplotype diversity and distance to east Africa (r = -

0.13, p = 0.78). Indeed, when African populations were excluded from the analysis, a stronger correlation is obtained (r = -0.94, p = 4.3x10-10, Figure 2A upper panel).

We also compared the SNP and haplotype heterozygosity values in each population

(Figure 2B). These two quantities are generally highly correlated, although there are several exceptions: First, SNP heterozygosity is higher than haplotype heterozygosity in European and

Central Asian populations. This may reflect a SNP ascertainment bias, since many of these polymorphisms were historically selected to maximize heterozygosity in European populations.

Second, the Pygmy sample shows a low SNP heterozygosity despite relatively high haplotype heterozygosity. This unusual pattern could be caused by stronger effects of SNP ascertainment bias in this population than in others. Indeed, a recent study of Khoisan individuals (another hunter-gatherer group from Africa) showed a similar pattern: despite high SNP heterozygosity

(~60%) in whole-genome sequence data, a Khoisan individual showed low heterozygosity on the

SNP microarray genotypes (~22%) 396. Alternatively, this difference could also reflect unique attributes of population history.

Genetic structure among populations 208

To examine inter-population relationships, we first constructed a neighbor-joining tree based on genetic distances (Figure 3A). Populations from major geographic regions are clustered, and most branches have very high (>99%) bootstrap support (Supp. Figure S2). New

World populations (Totonac and Bolivian) are placed between Nepalese and Kyrgyzstanis, indicating higher affinity of these American samples to central Asians than to eastern Asians. A second neighbor-joining tree was constructed by adding 40 HGDP populations (46,260 common

SNPs), producing similar patterns of population clustering (Supp. Figure S3).

We then performed a Principal Component Analysis (PCA) based on the pairwise allele- sharing distances among all pairs of individuals (Figure 3B). The majority of the genetic variation is found between African and non-African populations, as the first principal component

(PC1) accounts for 78.7% of total variance. PC2 reflects genetic variation in Eurasia, and populations from Central and West Asia occupy the space between East Asia and Europe to form a relatively continuous distribution. The two Polynesian populations (Tongan and Samoan) show a close relationship to Southeast Asian populations (Figure 3B). PC3 distinguishes New World populations (Bolivian and Totonac) from other populations (Supp. Figure S4A).

At the sub-continental level, we focus first on Eurasia, where most of our samples have been selected (Figure 4A). Overall, PC1 and PC2 mainly reflect the geographic distribution of the populations, with the majority of genetic variation accounted for by their locations. PC1

(accounting for 62.7% of the variance) reflects an east-west gradient, while PC2 (3.3% of the variance) reflects a north-south gradient. Slovenians and Iraqi Kurds show close relationships to

European populations. A closer examination (Supp. Figure S4B) shows that Kurds and eastern

European Daghestani populations (Urkarah and Stalskoe) are clearly separated from western 209

European populations. On the other hand, Slovenians show very little differentiation from western European populations (Supp. Figure S4B).

Some of our populations form less defined clusters than do the HapMap populations. The

Nepalese samples, in particular, are highly diverse, with some individuals showing a closer relationship to East Asian populations, while others are closer to South Asian populations. An examination of the ethnicity of the Nepalese individuals reveals that individuals from the ethnic groups derived from the caste system, including Madhesi, Brahman, and Chhetri, show a closer relationship to South Asian populations (especially Indian Brahmins). Individuals from the two indigenous Nepal ethnic groups (Newar and Magar) are closer to Central/East Asian populations

(Figure 4B). Kyrgyzstanis were also widely dispersed along the first PC, although to a lesser extent than the Nepalese samples. This dispersion is expected because Kyrgyzstan is on the trade route between Europe and Asia, where there has long been a high level of migration.

Distinctive patterns can also be observed at the sub-continental level in non-Eurasian populations. Within Africa, the first two PCs separate Mbuti Pygmy and !Kung from other

African populations (Supp. Figure S4C). The remaining African populations appear to follow a north-south gradient, and the Dogon and Bambara from Mali show high similarity to the

HapMap YRI from Nigeria (Supp. Figure S4C). Within America, the two populations showed contrasting patterns: Totonacs from Mexico form a tight cluster, while about half of the Bolivian samples are separated from the Bolivian cluster, which appears to reflect European admixture

(Supp. Figure S4D).

Individual group membership 210

We used the program ADMIXTURE 394 to assess the ancestry of each individual from 3-

12 inferred populations (K) (Supplemental Table S1). The results from K=4 and K=12 are illustrated in Figure 3C. When K=4, four groups corresponding to Africa, America, Europe, and

Asia are identified. Unlike individuals from Africa and America, who form two relatively distinct groups, individuals from Eurasia show a mixture of Asian and European ancestry components.

When K=12, a number of sub-continental patterns appear. In Africa, Mbuti Pygmy,

!Kung, and Dogon are separated into distinct groups. Despite being sampled from neighboring regions in Mali, Bambaran and Dogon individuals show quite different ancestry. Most Dogon individuals appear to be composed of a single western African component, while Bambaran individuals contain more than 30% of a component prevalent in eastern Africa. Polynesian and

American populations were separated into two distinct components. In agreement with the PCA result, some Bolivian individuals contain more than 20% European ancestry, suggesting admixture in these samples.

Within Eurasia, the patterns are more complex. To examine the relationships among

Eurasian populations in detail, we performed ADMIXTURE analysis on the Eurasian individuals only and calculated the average ancestry components in each population. Major regional groups and geographic clines are best visualized with seven ancestral components (K=7, Figure 5). In

Europe, a northern/western European component is predominant in HapMap CEU, the Utah

Northern European, and the Slovenian samples. One Caucasus/Middle East component is predominant in Daghestani and Iraqi samples and appears to decrease clinally to the east through

Pakistan and Nepal and to the west through southern and northern Europe. In southern India, this component is a major genetic signal in two independently sampled Brahmin groups (>20%) but 211 is nearly absent in lower castes and Irula (a tribal group, < 1.5%). Notably, the central Asian populations of Nepal and Kyrgyzstan have the most genetic admixture. This result is consistent with our PCA results showing a high level of genetic variation within these two populations.

Another interesting observation is that Buryats and Kyrgyzstanis share about 5% ancestry with native American populations (averages of 4.4% in Kyrgyzstanis and 5.8% in Buryats), while

East Asian individuals have very little of the Native American ancestry component (average

<1%).

Copy Number Variation (CNV) profile

As a complement to our SNP analysis, we also used the same array platform to determine each individual’s CNV profile. To investigate the overall inter-population differences due to

CNVs, we determined the number of CNVs per person and the average CNV frequencies in each population (Figure 6A). The African populations (Bambaran and Dogon) have the highest number of CNVs among all populations (median of 44 and 42 CNVs per genome, respectively).

Outside of Africa, median number of CNVs varies between 38 in Kyrgyzstani to 30 in Totonac

(Figure 6A). These data are comparable with previous work, which found a higher number of

CNVs in African populations 191,397-399, suggesting a loss of low-frequency CNV alleles due to population bottlenecks during the out-of-Africa migration and the peopling of the Americas.

Next, we identified CNVs that are specific to each population and then counted the number of individuals within each population sharing the same population-specific CNV (Figure

6B). Within the Dogon, Bambaran, Pakistani, and Totonac populations, we found a high proportion of population-specific CNVs that were common to multiple members, with more than

12% observed in two or more individuals. The remaining populations had few population- specific CNVs in common among their members. More than 90% of detected population-specific 212

CNVs in these populations are only present in one individual. Because most population-specific

CNVs are relatively rare within each population, and there are only a small number of total CNV loci, samples from different populations do not form distinct clusters in a PCA (Supp. Figure

S5).

We also investigated CNVs that are common between pairs of populations (Figure 6C). A comparison of the African populations (Dogon and Bambaran) revealed that 23% of CNVs were present in both populations, while both groups had little in common with any other population.

There is also a relatively high proportion of CNVs in common between the Slovenian and Iraqi populations. Likewise, the Pakistanis, Kyrgyzstanis, Nepalese and Buryats all have a high percentage of CNVs in common (14-19%, Figure 6C). This pattern is consistent with the population affinities shown by PCA and ADMIXTURE analysis of the SNP data (Figure 4A and

5). Finally, the Totonac and Bolivian populations have the highest proportion of CNVs in common, with 27.1% of CNVs identified in both populations. This high proportion of CNV sharing and the relatively low number of CNVs identified in these populations may be due to the low genetic diversity in their common founding population.

7.5 Discussion

Patterns of human genetic variation are influenced by mating patterns, and the latter are in turn influenced by geographic and cultural factors (e.g., mountain ranges, language, religious practices). Consequently, it is not surprising that human genetic variation, while correlated with geographic location, is not perfectly clinal 400-402. However, between-population differences can be seriously exaggerated if human populations are sparsely sampled.

Consistent with previous studies 400,402,403, our analyses demonstrate that differentiation among human populations decreases substantially and genetic diversity is distributed in a more 213 clinal pattern when more geographically intermediate populations are sampled. The reduction of

FST values with further geographic sampling illustrates the limitations of a global FST estimate to capture the pattern of human genetic diversity. With a more comprehensive population samples, our data have also led to several new observations about human demographic history and genetic relationships among human populations.

The out of Africa (OoA) bottleneck and the peopling of Eurasia

As observed in previous studies 369,370,372, we find that SNP and haplotype variation is highest in African populations, and that heterozygosity in non-African populations declines with geographic distance from Africa. This decline in heterozygosity has been interpreted as evidence for a worldwide serial founder effect originating in East Africa 369,404. While serial founder effects may explain much of the pattern of worldwide variation, we note two interesting deviations from the prediction of a linear decline in heterozygosity. First, as demonstrated in

Figure 2A, there appears to be little relationship between heterozygosity within Africa and distance from the hypothesized point of East African origin (r = -0.13, p = 0.78). Second, there is a drastic decrease in diversity for all Eurasian populations immediately outside of Africa. These observations are best explained by a single bottleneck out of Africa rather than by a series of founding emigrations from Africa (Figure 2A).

The OoA hypothesis, proposing a single OoA bottleneck followed by an expansion into

Eurasia approximately 50,000 years ago, has gained extensive support from the archaeological record 405,406 and genetic studies 369,370,372. Nevertheless, many of the historical details of this diaspora remain unclear. A common interpretation is that the OoA bottleneck was the result of a migration of a small founding population into Eurasia. Given the difference in haplotype heterozygosity between African and non-African populations and the relationship between 214 heterozygosity and effective population size, we can estimate the effective population size of such a founding population 407. Within Africa, the average 100-kb haplotype heterozygosity in our data is 0.91. Immediately outside of Africa in Europe, the Middle East, and Central Asia, the average haplotype heterozygosity is 0.82 (Figure 2). A reduction of heterozygosity from 0.91 to

0.82 in a one-generation bottleneck would require an effective population size of only 5.5 individuals. While a one-generation bottleneck is an oversimplification, these estimates indicate that an OoA bottleneck resulting from the migration of a small founding population would require an extremely small population size. However, given that the archaeological record indicates a rapid expansion of modern humans into Europe and Asia in just a few thousand years

405,406, it seems unlikely that Eurasia could be populated so quickly by a such a small founding population.

A more likely explanation for the OoA bottleneck is that Eurasia was populated by a larger population that had been relatively isolated from other modern human populations for tens of thousands of years prior to the expansion. The first fossil evidence for modern humans outside of Africa is in the Middle East at Skhul and Qafzeh between 80,000-100,000 years ago, which is at least 20,000 years prior to the Eurasian diaspora 408. If a population of modern humans remained in the Middle East until the expansion into Eurasia, there would have been sufficient time for genetic drift to reduce heterozygosity dramatically before the Eurasia expansion. This

“Middle East isolation” hypothesis provides a robust explanation for the relative homogeneity of

European and Asian populations relative to African populations (see Figures 3A-B) and is supported by a recent maximum likelihood estimate of 140,000 years ago for the time of

Eurasian-West African population separation 409. Interestingly, a recent study of the Neandertal genome suggests that the non-African individuals, but not the Africans, contain similar amount of admixture (1-4%) with the Neandertals 410. The authors suggest that the admixture must have 215 happened between the Neandertals with an ancestral non-African population before the Eurasian expansion. Given the fossil, archaeological, and genetic evidence, the Middle East isolation hypothesis warrants rigorous evaluation as whole-genome sequence data become available.

Dispersion of a Caucasus/Middle East genetic component

In the ADMIXTURE analysis of Eurasia, we observed a clinal distribution of a

Caucasus/Middle East genetic component (red component, Figure 5) in several South Asian populations. Evidence from mitochondrial DNA, Y-chromosome, and autosomal loci suggests that the genetic composition of India has been influenced by west Eurasians 372,373,411,412. We find that this ancestry component is most prevalent in West Asians (Iraqi Kurd) and Caucasus populations (Daghestani). The component extends eastward into Central Asia (Pakistan, Nepal, and Kyrgyzstan) and into South India, where it is more prevalent in higher castes than in lower castes. This ancestry component also extends into Europe and is more prevalent in southern

Europeans than in northern Europeans. Our results suggest that the northern Indian genetic component proposed by Reich et al 373 could represent the dispersion of a genetic ancestry component originating near the Caucasus/Middle East region.

Nepalese diversity

Containing more than 100 ethnic groups, Nepal is a geographically small but diverse country 413. Earlier genetic studies of Nepalese populations have suggested a northern Asian origin with subsequent gene flow from South Asia (e.g., Hindu caste-derived groups) 414-416. Our results are in general agreement with this view and suggest that the most prevalent ancestry component in the Nepalese is the primary ancestry component found in Indians and Pakistanis.

The Nepalese, however, are highly heterogeneous and also have substantial ancestry components 216 from Central Asia, East Asia, and Southeast Asia. Moreover, individual Nepalese from different ethnic groups have substantially different genetic composition. Hindu upper-caste Nepalese

Brahman and Chhetri individuals cluster in PCA and show affinity to Indian Brahmin samples

(Figure 4B). In contrast, samples from the linguistically distinct Magar and Newar groups show affinity to populations from Central and East Asia. These results suggest that substantial population structure may exist between the major population groups of Nepal. Although our limited sample size prevents a detailed analysis of the genetic diversity among Nepalese ethnic groups, our observations suggest high levels of genetic diversity in South and Central Asian populations and underscore the need for additional genetic studies of this region.

Native American founding populations

The Americas, first peopled during the late Pleistocene, were the last continents to be colonized by modern humans. Despite general agreement that modern humans crossed a land bridge in the current Bering Strait region to populate the Americas (reviewed in 417-419), the exact timing, routes of colonization, and origin of the ancestral population(s) remain unclear 420-424.

Earlier studies suggest that an ancestral American population may have lived in western

Siberia, rather than eastern Siberia/Northern Asia 425,426. Congruent with this view, the two

Native American populations (Totonac and Bolivian) in our samples show closer relationships to

Central Asian populations (Kyrgyzstanis and Buryats from Mongolia) than East Asian populations (e.g., Chinese and Japanese). This result is most apparent in the ADMIXTURE plot

(Figure 4B; k=12), where Kyrgyzstani and Buryat individuals share about 5% of the American ancestry component. In contrast, East Asian individuals share very little (< 1%) genetic ancestry with the American populations. 217

CNV population profiles

In previous studies, we have shown highly consistent patterns of population genetic structure when using different types of polymorphisms, such as restriction site polymorphisms, short tandem repeat polymorphisms, and Alu and L1 insertion polymorphisms 400,427-429.

Similarly, despite a very different mutational mechanism, CNVs also reveal overall patterns of genetic structure that are highly similar to those of other types of polymorphisms: First, we find that populations from Africa harbor the greatest number of CNVs, and that the average number of CNVs decreases with increasing distance from Africa. Second, we find that the degree of

CNV sharing between groups reflects their population relationships. Notably, the Totonac and

Bolivian populations share a high number of CNVs. The Pakistani, Kyrgyzstani, Nepalese, and

Buryat populations also exhibit a high number of shared CNVs. Previous studies have also shown general agreement in genetic structure patterns revealed by SNP and CNV data 370.

7.6 Conclusion

In this study, by sampling populations from previously under-sampled regions, we sought to assess the effect of more even sampling on human genetic diversity and to investigate the evolutionary history of these populations. We found support for a relationship between the initial founding populations of America and Central/North Asian populations. We demonstrated high genetic diversity in Central Asian and South Asian populations, especially in Nepal. We also found that Iraqi Kurds have a closer relationship to European populations than Asian populations. These results increase our understanding of human population relationships and evolutionary history. In addition, our data provide a resource for understanding patterns of linkage disequilibrium, natural selection and the differential distributions of SNP and CNV 218 alleles among populations, all of which have important implications in genome-wide association studies and the identification of loci with functional, biomedical significance.

Figure Legends

Figure 1. Population samples analyzed in this study. The number of individuals sampled in each population is shown at the bottom of the figure. Populations genotyped in this study are colored in red and populations obtained from the HapMap project and Xing et al. 372 are shown in black. Populations are labeled with filled (Affymetrix 6.0 array) or empty circles (Affymetrix

NspI 250K array) on the map based on the genotyping platforms.

Figure 2. SNP diversity. A) SNP haplotype diversity as a function of geographic distance from East Africa. The correlation improved substantially when African populations were excluded (r = -0.96, upper right panel). B) SNP haplotype diversity versus SNP heterozygosity.

Figure 3. Population relationships between the 40 populations. A) Neighbor-joining tree.

Populations are color-coded based on their continental origins. The hypothetical ancestral population is shown. Bootstrap support values for most branches are larger than 95% (the bootstrap consensus tree is shown in Supp. Figure S1). B) Principal components analysis. First two principal components (PCs) are shown. Each individual is represented by one dot and the color label corresponding to their regional origin. The percentage of variance explained by each

PC is shown on the axis. C) Individual grouping inferred by ADMIXTURE. Results from K =

4 and K = 12 are shown. Each individual’s genome is represented by a vertical bar composed of colored sections, where each section represents the proportion of an individual’s ancestry derived 219 from one of the K ancestral populations. Individuals are arrayed horizontally and grouped by population as indicated.

Figure 4. Principal components analysis of population structure. First two PCs are shown.

The percentage of variance explained by each PC is shown on the axis. A) Eurasia. Each individual is represented by one dot and the color label corresponding to their population. B)

Nepalese and surrounding Asian populations. Nepalese individuals are represented by squares and the color label corresponding to their ethnic groups. Two Nepalese who have no ethnic group information were excluded from this plot. Other Asian individuals are represented by dots and color labels corresponding to their regional origins to improve the resolution. The regional groups include: India Brahmin (A.P. Brahmin and T.N. Brahmin); India Lower Caste

(A.P.Madiga, A.P. Mala, T.N. Dalit); East Asia (CHB, Chinese, Japanese, JPT); Southeast Asia

(Cambodian, Iban, Thai, Vietnamese).

Figure 5. ADMIXTURE analysis of Eurasian individuals with K = 7. Each individual’s genome is represented by a vertical bar composed of colored sections (bottom of the figure), where each section represents the proportion of an individual’s ancestry derived from one of the seven ancestral populations. Individuals are arrayed horizontally and grouped by population as indicated. In the map, the average ancestral components of each population are illustrated as pie charts.

Figure 6. CNV profile among populations. A) The median number of CNVs in each population. Red bars represent the median number of CNVs. The central boxes span the quartiles and the whiskers extend to the most extreme data points not considered outliers. Outliers are indicated by red dots. B) Population-specific CNVs found in multiple individuals. The percentage of population-specific CNVs present in multiple individuals within each population is 220 shown. The number of CNVs present in 2 (blue), 3 (red), 4 (green), or ≥5 (purple) individuals is represented as a percentage of the total number of CNVs within that population. C) CNV sharing between population pairs. A heatmap showing the CNV overlap between each population pair is shown. The number of CNVs present in both populations was calculated as a percentage of the total number of CNVs in each individual population. The scale below the figure represents the range of percentage values, ranging from 0% (light blue) to 27% (bright red).

221

Figure 1

222

Figure 2

223

Figure 3

224

Figure 4

225

Figure 5

226

Figure 6

227

Chapter 8 8 Appendix 3 - New Variants at 10q26 and 15q21 Are Associated With Aggressive Prostate Cancer in a Genome-Wide Association Study from a Prostate Biopsy Screening Cohort

This chapter has been submitted for publication.

Contribution: CNV discovery.

Robert K. Nam, William Zhang, Katherine Siminovitch, Adam Shlien, Michael W. Kattan, Arun Seth, Laurence H. Klotz, John Trachtenberg, Yan Lu, Jinyi Zhang, Changhong Yu, Ants Toi, D. Andrew Loblaw, Vasundara Venkateswaran, Aleksandra Staminirovic, Linda Sugar, David Malkin, and Steven A. Narod

228

8.1 Abstract

We conducted a genome-wide association study among patients with aggressive forms of prostate cancer and biopsy-proven normal controls ascertained from a prostate cancer screening program. We found significant associations between aggressive prostate cancer and five single nucleotide polymorphisms (SNPs) in the 10q26 (rs10788165, rs10749408, and rs10788165, p- values for association 1.3x10-10-3.2x10-11) and 15q21 (rs4775302 and rs1994198, p-values for association 3.1x10-8-8.2x10-9) regions. Results of a replication study done in 3439 patients undergoing a prostate biopsy, revealed certain combinations of these SNPs to be significantly associated not only with prostate cancer but with aggressive forms of prostate cancer using an established classification criterion for prostate cancer progression (odds ratios for intermediate to high-risk disease 1.8 to 3.0, p-value 0.003 to 0.001). These SNP combinations were also important clinical predictors for prostate cancer detection based on nomogram analysis that assesses prostate cancer risk.

8.2 Introduction

Several genome-wide association studies (GWAS) have identified a number of genomic variants which are associated with an increased risk of prostate cancer, particularly from the

8q24 region [Eeles, 1998 #712; Gudmundsson, 2007 #1087; Yeager, 2007 #1088]. Although these associations are statistically significant, it remains unclear to what extent these high-risk genotypes are associated with aggressive forms of prostate cancer [Penney, 2009 #1189;

Wiklund, 2009 #1196].

A primary goal of evaluating biomarkers for the early detection of prostate cancer is to distinguish patients who will eventually develop metastases from those with more indolent forms of cancer. Recently, a large, multi-centred study, including subjects from the Physicians’ Health 229

Study, failed to show any associations with aggressive and lethal forms of prostate cancer with these and other SNPs found from past GWAS [Penney, 2009 #1189]. Also, a recent study examining the most significant SNPs found by past GWAS study found no associations to prostate cancer outcomes including measures of aggressiveness and cancer-specific mortality

[Wiklund, 2009 #1196]. Thus, although new SNP associations have been numerous, none have been clinical useful since these SNPs cannot identify patients with aggressive forms of prostate cancer.

In a typical GWAS, the cases and controls are not derived from the same patient sample; however, this is not the case when screened subjects are studied. In a screening study, controls may be selected from men who screen negative for the cancer. Recent data from the Prostate

Cancer Prevention Trial [Thompson, 2003 #887] has established that men who are judged to be at low risk for prostate cancer (i.e. normal patients), but who undergo a prostate biopsy, have a prevalence rate of 25% for prostate cancer. One-quarter of these are high-grade, aggressive cancers [Thompson, 2004 #906]. Thus, the misclassification of cases as controls may diminish the potential of discovering SNPs that help to identify men with aggressive, high grade prostate cancer. By using men with negative biopsies as controls, the potential for misclassification is minimised.

To identify new SNP variants for aggressive prostate cancer, we conducted a GWAS among men who had a prostate biopsy, using a two-stage approach. In the first stage, 316 cases and 229 controls were genotyped using the Affymetrix 500K SNP array (443,816 SNPs). Cases were patients with aggressive forms of prostate cancer using the established D’Amico classification criteria [D'Amico, 1998 #1182], and controls were biopsy proven normal patients.

In the second stage, we genotyped positive SNPs found from stage 1 among 3439 patients who underwent prostate biopsy for prostate cancer screening. We investigated their clinical 230 significance by examining their association with D’Amico criteria outcomes and by nomogram analysis in predicting prostate risk.

8.3 Results

After adjusting for population stratification, 20 SNPs were selected from the first stage for further study (Table 1). These 20 SNPs were selected based on: 1) a Bonferroni corrected p- value of less than 0.01, and the SNP was from a region which has previously been shown to harbor a locus associated with prostate cancer (n = 5); 2) two or more SNPs were in linkage disequilibrium and the Bonferroni corrected p-value for each was less than 0.01 (n = 9); or 3) the Bonferroni corrected p-value was 10-5 or less (n = 6). The 20 SNPs were tested in an independent sample of 3439 men of various ethnicities. All 3439 men underwent a prostate biopsy; 1663 (48.4%) were diagnosed with prostate cancer, and 1776 (51.6%) did not have any evidence of cancer from biopsy (Table 2). Among the 1663 men with cancer, 413 (24.8%) had a low risk cancer (based on the D’Amico classification of prostate cancer aggressiveness), 944

(56.8%) had an intermediate risk cancer, and 306 (18.4%) had a high risk prostate cancer.

In the replication set, nine of the 20 SNPs were significantly associated with prostate cancer. The strongest associations were found from five SNPs – three SNPs at region 10q26 (p- value 6x10-7 to 3x10-10) and two SNPs at 15q21 (p-value 7x10-6 to 1x10-7) (Table 2). SNPs from other regions, at 10q23 (rs7089868, rs11596082 and rs2351337) and at 8q13 (rs2053140) were also significantly associated with prostate cancer, but the p-values ranged from 0.05 to

0.02.

The associations between the five SNPs and prostate cancer were only present among white subjects. No associations were found for men of Asian or African ancestries, although the sample sizes for these ethnic groups were smaller (Table 3). No other significant association was 231 found with the other 15 SNPs. Of the three SNPs from region 10q26, two SNPs (rs11199874 and rs10788165) were in strong linkage disequilibrium (r2=0.97 in controls). The third SNP, rs10749408, was not in LD with either of the other two. The two SNPs from region 15q21, rs4775302 and rs1994198, were in strong linkage disequilibrim (r2=0.99 in controls).

We asked whether these SNPs were associated with prostate cancer, after adjusting for established risk factors and other variables (age, family history of prostate cancer, ethnicity, urinary symptoms, prostate specific antigen (PSA) level and digital rectal examination). The adjusted odds ratios, based on the risk genotype classes for the five SNPs, ranged from 1.26 to

1.42 (Table 4). To determine whether these variants were also associated with aggressive prostate cancer, we used the D’Amico criteria to divide the cases into low, intermediate and high-risk categories for prostate cancer progression. This criteria combines Gleason Score grade, clinical stage and PSA level at diagnosis into risk group categories and is a well established method of predicting prostate cancer mortality [D'Amico, 1998 #1182].

We chose one SNP to represent each of the two haplotype blocks (the one with the highest odds ratio for prostate cancer) resulting in three SNPs (rs11199874, rs10749408, and rs4775302). We combined the risk alleles from the three SNPs and compared the frequencies of patients with low, intermediate, high risk prostate cancer and with no cancer, by the number of risk alleles (Table 5). Within each risk category, the odds ratios for cancer increased by the number of variant alleles (Table 6). In particular, for patients who were diagnosed with high risk prostate cancer, those with three variant alleles had a 3-fold increase in risk (odds ratio being 3.0,

95% CI:1.5-5.8) for having high risk prostate cancer, compared to patients with no variant allele.

The odds ratios for high risk cancer for patients with two variant alleles was 1.8 (95% CI:0.9-

3.4) and for one variant allele was 1.2 (95% CI:0.6-2.4) with a significant increase in trend by 232 the number of variant alleles (p<0.0001) (Table 6). There was a significant positive trend in the odds ratios for having low, intermediate and high risk prostate cancer for patients with the three variant risk alleles (p=0.0002). No significant trends were observed for patients with 1 or 2 variant alleles across disease risk categories. Overall, there was a global positive trend association by risk group and number of variant risk alleles (p=0.048).

To determine whether or not these SNPs could be used clinically in diagnosing prostate cancer, we examined the effect of SNPs within a multivariate logistic regression model in predicting prostate cancer using nomograms, and also assessed the clinical validity of these SNPs using sensitivity, specificity, positive predictive value (NPV) and negative predictive (NPV) analysis, as proposed by Kraft et al [Kraft, 2009 #1190]. To construct the nomogram, all twenty

SNPs were tested within the multivariate model. Based on the concordance indices of bias- corrected probabilities for any and aggressive prostate cancer, for each of these SNPs, the performance of each of the twenty SNPs and the three variant risk alleles were similar. Among white subjects, the nomogram with the three risk alleles was a significant predictor (p=0.0002) of any cancer and high grade (Gleason Score 7 or more) cancer (Figure 1). When examining the importance of each predictor from AUC analysis, the three SNP model was the fourth (out of six) most important predictor for cancer (incremental drop of AUC being 0.0038, Figure 1). It was a more important predictor than family history of prostate cancer and urinary symptoms.

When considering the clinical performance of the three SNP model alone, the sensitivity was

94.4% for any cancer for patients with one or more variant alleles and the specificity was 78.2% for any cancer for patients with three variant alleles (Table 7). The positive and negative predictive values improved with the number of variant alleles, but the absolute levels were low, compared to the baseline prevalence. Also, the ROC curve for the 3 SNP model was better at diagnosing aggressive cancer than for any cancer (Figure 2). 233

Common copy number variations (CNVs) are in LD with nearby SNPs, and are therefore captured by our SNP study, whereas rare CNVs are less likely to be tagged285. Having established the association of common variants at 10q26 and 15q21, we sought to identify rare

CNVs at these regions. We used the same array since it contains both SNP and CNV probes

(500,568 polymorphic and 420,000 non-polymorphic probes). Using established methods267,389, we ascertained all CNVs and then identified those that directly overlapped or encompassed the associated regions (chr10: 122,957,516 to 123,034,204 and chr15: 44,427,100 to 44,440,459).

While no CNVs are present at 15q21, we identified a heterozygous deletion on chromosome 10 in the same genomic interval as the associated SNPs (Figure 3). The deletion removes 28,360 nucleotides, which impinges upon the haplotype block containing rs10749408, but not rs111199874 or rs10788165. We used quantitative PCR (qPCR) assays to validate the CNV.

The deletion was found in one cancer case, but is absent from all controls, from the Database of

Genomic Variants177 and from the ultra high-resolution data released from the Genome

Structural Variation Consortium285. Combined with our SNP GWAS data, these results suggest that rare 10q26 CNVs may represent an additional mechanism for prostate cancer susceptibility. 234

8.4 Discussion

It is well established that men diagnosed with intermediate or high-risk prostate cancer based on the D’Amico criteria have a high chance for progression to metastasis. It is of primary importance to identify these men in a screening program. Our study is the first to demonstrate association of 10q26 and 15q21 region SNPs with prostate cancer, and to show that a combination of these risk alleles (three SNP model) is associated with aggressive prostate cancer.

From nomogram analysis, the clinical performance of our SNP variants was not superior to PSA or age, but was important as other established risk factors, such as family history of prostate cancer.

The stage 1 analysis of this study did not reveal any associations between prostate cancer and previously-identified risk loci on 8q24 and 17q. This discrepancy likely reflects our specific analysis of patients with aggressive prostate cancer, as our prior analysis of the patient cohort included in stage 2, did reveal associations between 8q24 and 17q SNPs and prostate cancer, but not aggressive cancer [Nam, 2009 #1175]. Similarly, results of two recent large, multi-centred studies revealed no association of prostate cancer aggressiveness and mortality with these or other putative prostate cancer SNPs found by past large GWAS [Penney, 2009 #1189; Kader,

2009 #1191].

The odds ratios for the five SNPs identified by our Stage 1 analysis were larger than those identified in Stage 2. This result could reflect a “Winner’s Curse” phenomenon [Kraft,

2008 #1192; Lohmueller, 2003 #1193], wherein the estimated effect of a marker allele from the initial GWAS may be exaggerated related to the estimated effect in a confirmatory study. Such a possibility could relate to our selection of “hypernormal” controls in Stage 1 – men who had normal PSA levels or had multiple biopsies with no evidence of cancer. 235

No previous studies have reported prostate cancer association with the chromosomal

10q26 and 15q21 regions or with genes in these regions. The three disease-associated SNPs on chromosome 10q26 map within 70 kb from one another and the two SNPs in LD (rs11199874 and rs10788165) are separated by 12 kb. These three SNPs span a 590 kb region encompassing two genes, WDR11 and FGFR2, that have been linked to glioblastomas [Chernova, 2001 #1183] and breast cancer [Liang, 2008 #1184]. The two SNPs on 15q21 are 13 kb apart and map 880 kb from the closest gene, GATM, which encodes a mitochondrial enzyme.

In addition to common variants at these two novel loci, we also found a rare CNV that removes nearly 30,000 nucleotides at 10q26. We note with interest that this CNV directly coincides with one of the associated SNPs (rs10749408), is absent from all healthy controls, and may therefore represent an alternative method for prostate cancer predisposition.

Thus, from a GWAS based on cases with aggressive prostate cancer and controls with no evidence of cancer from biopsy, we have found new associations of SNPs at 10q26 and 15q21.

Certain combinations of these SNPs are associated with aggressive forms of prostate cancer and have similar clinical importance compared to other risk factors for prostate cancer. Further elucidation of additional SNPs or genes in these new regions would be important to examine before these and potentially other SNPs from these regions can be used clinically to predict prostate cancer. 236

8.5 Methods

Study Subjects

Patients were drawn from a sample of 4573 men who underwent a prostate biopsy from a prostate cancer screening program between 1999 and 2008 within an urban, North American- based population. Patients were included in the study if they had an abnormal PSA value (>4.0 ng/mL) or DRE. Patients were also included if they had a normal PSA level (<4.0 ng/mL) or

DRE, but were willing to undergo a prostate biopsy for the purpose of prostate cancer screening

(n=408). Patients eligible for this study were unselected and were accrued consecutively. No patient had a past history of prostate cancer. All patients underwent one or more transrectal ultrasonography (TRUS)-guided needle core biopsies. Patients were excluded if they were not capable of giving consent to participate in a research study (n = 46); or if they could not provide sufficient baseline information (n = 53). Blood samples were obtained prior to prostate biopsy.

Six to 15 ultrasound-guided needle core biopsies were performed (median = 8), using an 18- gauge spring loaded biopsy device. Samples were obtained using a systematic pattern and additional targeted samples were obtained from suspicious areas. The primary endpoint was the histologic presence of adenocarcinoma of the prostate in the biopsy specimen. All grading was based on the Gleason scoring system [Gleason, 1974 #259]. A urological voiding history (AUA

Symptom Score [Barry, 1992 #607]), DRE results, serum PSA level, family history of prostate cancer information, and ethnic background were obtained by research personnel through questionnaire administration and medical record review. All data were stored within a centralized database. This study was approved by the research ethics board (Sunnybrook Health

Sciences Centre). 237

For stage 1, subjects were derived from the first 1000 subjects who underwent a prostate biopsy. We identified 545 cases and controls from the 1000 eligible patients for SNP array analysis. Cases were patients with a Gleason Score of 7 and a PSA >10 ng/mL or a Gleason

Score 8 to 10 cancer. Controls were patients with no cancer found at biopsy with a PSA <10 ng/mL. For stage 2, the remaining 3573 patients were included in the replication study.

White patients with screen-detected prostate cancer and who had a high risk of progression based on the D’Amico criteria which is and established method of risk stratification for prostate cancer progression; i.e., either a Gleason Score of 7 and a PSA level of >10 ng/mL or a Gleason Score of 8 to 10. Controls were white patients who had no evidence of cancer, based on one or more systematic multi-core (median number = 8 cores per biopsy session) prostate biopsies – 105 controls had a normal PSA level and 124 controls had a PSA level between 4.0 ng/mL and 10.0 ng/mL.

SNP Array Analysis and Genotyping

In stage 1, using the 500K Affymetrix SNP array, we utilized only samples which called on at least 97% of SNPs at a confidence score of ≥0.25. We selected SNPs that met Hardy-

Weinberg equilibrium (HWE), had a minor allele frequency of >0.05, and a call rate of >95%.

To deal with population stratification, we used principal component analysis to adjust for variances within ethnic subgroups with EIGENSTRAT (Golden HelixTree, V6.4.1) [Price, 2006

#1185].

Genotyping was conducted using mass-spectrometry-based genotyping analysis and matrix-assisted laser desorption ionization – time of flight (MassArray System, Sequenom Inc.,

San Diego, California, USA) following the manufacturer’s instructions. A standard protocol for multiplex homogeneous mass extend assay developed by Sequenom Inc. was utilized and 238 modified according to designed primers. For quality control, we assigned negative controls for each test plate (Microseal TM 384 V2.0).

Data Analysis for Replication Study

For stage 2, we excluded 40 samples that failed on two or more of the assays used. The call rates were at least 95% for each SNP. Genotype distributions for all tested SNPs were in

HWE. Cases were defined as patients with prostate cancer and controls were men with no evidence of cancer. Allele frequencies for each SNP were calculated for cases and controls and the distributions were compared. Allelic odds ratios were calculated based on a multiplicative model. For genotypes, a series of tests assuming an additive, dominant or recessive genetic model were performed for each of the SNPs with unconditional logistic regression. The model with the highest likelihood was considered to be the best fitting model for each SNP [Zheng,

2008 #1131]. We tested the cumulative effects of selected SNPs for each model by counting the number of genotypes associated with prostate cancer; the odds ratios for prostate cancer for patients with one or more variant genotypes were estimated. In the multivariate analysis, we adjusted for age, ethnic group, family history of prostate cancer, the presence of lower urinary tract voiding symptoms, the total PSA level, and the digital rectal examination. Unconditional logistic regression analysis was used to examine how each of these factors, alone and in combination, would predict the presence of prostate cancer and aggressive forms of prostate cancer, defined by the D’Amico classification [D'Amico, 1998 #1182].

Nomogram Construction

To develop a clinical instrument that incorporates the SNP findings, we considered all

SNPs in a nomogram multivariate model, restricted to white subjects only. This nomogram was designed to predict both prostate cancer and aggressive cancer defined as having intermediate to 239 high-grade prostate cancer (Gleason Score 7 or more). Ordinal logistic regression was used to model the probability of having low or high grade cancer. Three outcome levels were defined:

1) no cancer; 2) low-grade cancer (Gleason Score 6 or less; and 3) intermediate to high-grade cancer (Gleason Score 7 or more). Continuous variables were modeled with restricted cubic splines to avoid linearity assumptions. The logistic regression model was the basis for constructing a nomogram. All patients were used to develop and validate the nomogram.

Bootstrapping was used to correct for optimism in the evaluation of discrimination and calibration. All analyses were performed using S-Plus 2000 Professional software (Statistical

Sciences, Seattle, WA) with the Design and Hmisc libraries added [Harrell, 2001 #1020].

CNV discovery and qPCR validation

Affymetrix 5.0 CEL files were analyzed for CNVs using two complementary algorithms: a genomic segmentation algorithm (Partek, MO) and Birdsuite version 1.5.32. Only CNVs found in both analyses were considered. In the segmentation analysis, a minimum of 10 consecutive probes was required to detect a copy number change. Birdsuite calls were restricted to those with

LOD score greater than 10 and a size larger than 1kb. qPCR validation of the 10q26 CNV deletion was performed on a Roche LightCycler by relative quantification. Primers (available upon request) were designed using Primer3 and the human genome reference assembly (UCSC genome browser, version hg18).

240

Table 1: List of 20 SNPs found by GWAS using the 500K Affymetrix SNP chip to be associated with aggressive prostate cancer.

SNP P-value/Bonferroni- Chromosomal Minor Allele Odds Ratio Associated Gene Corrected P-value Region Frequency (if applicable) (95% C.I.)

1st Criteria – Bonf. p-value <0.01 with region close to known prostate cancer gene (n=5)

rs12699509 1.4x10-8/0.006 7p21.2 0.29 2.7 ETV1

(1.9. – 3.8) rs3114316 9.2x10-9/0.004 7q11.23 0.12 3.4 KIAA1505

(2.2 – 5.2) rs3852402 2.9x10-8/0.01 9q23.32 0.38 2.2 FANCC

(1.7 – 2.9) rs2226016 3.0x10-8/0.01 11p14.1 0.34 2.3 FSHB

(1.7 – 3.2) rs4281668 9.9x10-9/0.004 15q25.3 0.42 2.9 AKAP13

(2.0 – 4.1)

2nd Criteria – 2 or more SNPs in LD and Bonf. p<0.01 (n=9)

rs2053140 5.7x10-9/0.002 8q13.2 0.47 2.6 DEPDC2 rs4131931 2.8x10-8/0.01 (1.8 – 3.7) (8 kb apart)

rs7089868 1.8x10-8/0.007 10q23.1 0.47 2.8 KIAA1128 rs11596082 1.7x10-8/0.007 (2.0 – 4.1) (7 kb apart) rs2351337 1.8x10-8/0.007 (40 kb apart)

rs11199874 7.2x10-11/3.0x10-5 10q26.12 0.29 2.9 No description rs10788165 1.3x10-10/5.5x10-5 (2.1 – 4.1) (12 kb apart) 241

rs1994198 8.2x10-9/0.003 15q21.1 0.43 2.6 No description rs4775302 3.1x10-8/0.01 (1.8 – 3.7) (13 kb apart)

3rd Criteria – Bonf. corrected p-value <10-5 (n=6)

rs17018760 4.0x10-12/1.7x10-6 2p12 0.11 4.4 CTNNA2

(2.8 – 6.9) rs410259 2.2x10-11/1.0x10-5 2p22.1 0.12 4.0 SLC8A

(2.6 – 6.2) rs10749408 3.2x10-11/1.0x10-5 10q26.12 0.34 2.5 No description

(1.9 – 3.3) rs2429763 2.0x10-11/1.0x10-5 11p14.3 0.18 2.9 No description

(2.1 – 4.0) rs595018 2.0x10-11/1.0x10-5 11q12.2 0.19 2.9 CCDC86

(2.1 – 4.0) rs2347306 3.0x10-11/1.0x10-5 12q24.32 0.05 5.5 No description

(3.2 – 9.7)

242

Table 2: Comparison between cases and controls of risk factors, tumour markers for prostate cancer and the genotype frequencies of selected SNPs.

Factor Cancer No Cancer P-value*

Total n = 3439 N = 1663 (48.4%) N = 1776 (51.6%)

Age Group (years)

≤ 50 38 (2.3%) 96 (5.4%) 1.3 x 10-21

51 – 60 397 (23.9%) 554 (31.2%)

61 – 70 718 (43.2%) 776 (43.7%)

> 70 510 (30.7%) 350 (19.7%)

Family History of PC

Absent 1376 (82.7%) 1545 (87.0%) 0.0005

Present 287 (17.3%) 231 (13.0%)

Ethnicity

Asian 55 (3.3%) 134 (7.6%) 7.4 x 10-17

Caucasian 1382 (83.1%) 1428 (80.4%)

Black 184 (11.1%) 127 (7.1%)

Other 42 (2.5%) 87 (4.9%)

LUTS

≤7 926 (55.7%) 864 (48.7%) 5.5 x 10-6

>7 737 (44.3%) 912 (51.3%)

243

DRE

No Nodule 1206 (72.5%) 1500 (84.5%) 7.3 x 10-18

Nodule 457 (27.5%) 276 (15.5%)

PSA (ng/mL)

≤ 4.0 135 (8.1%) 353 (19.9%) 2.1 x 10-34

4.1 – 10.0 980 (58.9%) 1011 (56.9%)

10.1 – 20.0 387 (23.3%) 345 (19.4%)

> 20.0 161 (9.7%) 67 (3.8%)

SNP Genotyping Frequency rs11199874, 10q26

GG 962 (57.9%) 869 (48.9%) 2.6x10-10

AG 567 (35.9%) 740 (41.7%)

AA 104 (6.3%) 167 (9.4%) rs10749408, 10q26

TT 813 (48.9%) 775 (43.6%) 6.9x10-6

CT 697 (41.9%) 803 (45.2%)

CC 153 (9.2%) 198 (11.2%) rs10788165, 10q26

TT 740 (44.5%) 673 (47.6%) 1.2x10-7

GT 734 (44.1%) 842 (47.4%)

GG 189 (11.4%) 261 (14.7%) rs4775302, 15q21

AA 496 (29.8%) 467 (26.3%) 4.1x10-8

AG 828 (49.8%) 833 (46.9%)

GG 339 (20.4%) 476 (26.8%) 244

rs1994198, 15q21

TT 492 (29.6%) 498 (28.0%) 5.8x10-7

CT 802 (48.2%) 783 (44.1%)

CC 369 (22.2%) 495 (27.9%)

• P-value calculation based on Fisher’s Exact Test

245

Table 3: Association of the 5 SNP variants to prostate cancer by ethnic group.

Caucasian Black Asian Other

SNP Variant No Cance p- No Cance p- No Canc p- No Cance p- Cance r (%) value Cance r (%) valu er valu r (%) valu r (%) r (%) e Cance (%) e Cance e r (%) r (%) rs1119987 4, 10q26 675 766 2.3x1 86 130 0.17 62 30 0.52 46 36 0.00 GG (46.8) (53.2) 0-8 (39.8) (60.2) (67.4) (32.6 (56.1) (43.9) 4 ) AG 607 521 41 50 57 35 5 (53.8) (46.2) (45.1) (54.9) (73.1) 21 (87.5) (12.5) AA (26.9 146 95 0 (0) 4 15 ) 6 1 (60.6) (39.4) (100) (11.2) (85.7) (14.3) 4 (21.1 ) rs1074940 8, 10q26 599 653 1.5x1 75 120 0.45 66 23 0.57 36 17 0.26 TT (47.8) (52.2) 0-5 (38.1) (61.9) (74.2) (25.8 (67.9) (32.1) ) CT 663 599 47 56 50 43 17 (52.5) (47.5) (45.6) (54.4) (66.7) 25 (71.7) (28.3) CC (33.3 166 130 6 8 18 ) 8 8 (56.1) (43.9) (4.7) (57.1) (72.0) (50.0) (50.0) 7 (28.0 ) rs1078816 5, 10q26 540 608 2.6x1 42 79 0.13 51 24 0.35 40 29 0.02 TT (47.0) (53.0) 0-7 (34.7) (65.3) (68.0) (32.0 (58.0) (42.0) ) GT 666 620 69 79 68 39 13 (51.8) (48.2) (46.6) (53.4) (75.6) 22 (75.0) (25.0) GG (24.4 222 154 16 26 15 ) 8 0 (0) (59.0) (41.0) (38.1) (61.9) (62.5) (100) 9 (37.5 ) 246

rs4775302, 15q21 413 450 1.8x1 25 36 0.98 8 5 0.59 21 5 0.05 AA (47.9) (52.1) 0-6 (41.0) (59.0) (61.5) (38.5 (80.8) (19.2) ) AG 702 701 56 83 46 29 23 (50.0) (50.0) (40.3) (59.7) (68.7) 21 (55.8) (44.2) GG (31.3 313 231 46 65 80 ) 37 14 (57.5) (42.5) (41.4) (58.6) (73.4) (72.6) (27.4) 29 (26.6 )

rs1994198, 15q21 449 458 1.1x1 11 4 0.31 19 25 0.58 19 5 0.08 TT (49.5) (50.5) 0-5 (73.3) (26.7) (43.2) (56.8 (79.2) (20.8) ) CT 663 682 45 25 41 34 25 (49.3) (50.7) (64.3) (45.5) (36.9) 70 (57.6) (42.4) CC (63.1 316 242 78 26 67 ) 34 12 (56.6) (43.4) (75.0) (25.0) (43.0) (73.9) (26.1) 89 (57.0 )

247

Table 4: Crude and adjusted odds ratio based on variant genotypes. Analysis for 10q26 was based on a recessive model and for 15q21 on a dominant model. Multivariate model variables included age at biopsy, ethnicity, family history of prostate cancer, the presence of urinary voiding symptoms, PSA level and DRE status.

SNP Alternat Asso Frequency Genotype Odds P- Adjusted P- Information ive c. Ratio value Odds value Alleles Allel Ratio e (95% CI) (95% CI)

Case Contr Referen Associat s ols ce ed rs11199874 A/G G 0.75 0.70 A/A, G/G 1.42 3.1x10 1.27 0.001 A/G -8 10q26 (1.2 – (1.1 – 1.6) 1.5)

rs10749408 C/T T 0.70 0.66 C/C, T/T 1.26 6.0x10 1.18 0.02 C/T -5 10q26 (1.1 – (1.0 – 1.4) 1.4)

rs10788165 G/T T 0.67 0.62 G/G, T/T 1.34 2.6x10 1.29 0.000 G/T -6 5 10q26 (1.2 – (1.1 – 1.5) 1.5)

rs4775302 A/G A 0.59 0.50 G/G A/G, 1.41 2.4x10 1.31 0.001 A/A -6 15q21 (1.2 – (1.1 – 1.6) 1.6)

rs1994198 C/T T 0.55 0.52 C/C C/T, T/T 1.34 3.4x10 1.33 0.000 -5 9 15q21 (1.1 – (1.1 – 1.6) 1.6)

248

Table 5: Distribution of patients by aggressive of prostate cancer using low, intermediate and high risk categories for prostate cancer (D’Amico Classification). Variant alleles based on a combination of a 3 SNP model (rs11199874, rs10749408, and rs4775302).

Number of Patients with CANCER* (n=1663)

LOW RISK** INTERMEDIATE HIGH RISK** RISK** Number of Variant Number of patient (n=413) (n=306) Alleles with NO CANCER (n=944)

(Frequency (n=1776) Distribution) Gleason Score 6, Gleason Score 8-10, Gleason Score 7, PSA <10, AND PSA >20, OR PSA 10-20, OR Stage T1c Stage T2c Stage T2a/b

0 154 25 55 18

(7.3%) (8.7%) (6.1%) (5.8%) (5.9%)

1 669 138 293 78

(34.3%) (37.7%) (33.4%) (31.0%) (25.5%)

2 584 148 330 100

(33.8%) (32.9%) (36.1%) (34.9%) (32.7%)

3 369 101 267 110

(24.6%) (20.8%) (24.5%) (28.3%) (36.0%)

249

* Comparison between patients with and without cancer by the number of variant alleles (2x4 table): chi-square=44.6, p<0.0001.

** Comparison between patients without cancer and cancer risk groups by the number of variant alleles (4x4 table): chi-square=58.1, p<0.0001. Comparison between risk groups (cases only, 3x4 table): chi-square=12.6, p=0.05.

250

Table 6: Odds ratio for having low, intermediate or high risk prostate cancer compared to no cancer for patients with 1, 2 or 3 variant risk alleles (restricted to white subjects only).

Odds Ratio for Cancer by Risk Groups

Number of Low Risk Intermediate Risk High Risk Disease* Test for trend of Odds Ratios Variant Alleles Disease* Disease* Across Disease Risk Categories (95% CI) (95% CI) (95% CI) (p-value) (p-value) (p-value)

0 1.0 1.0 1.0

1 1.3 1.2 1.2 No trend

(0.8-2.2) (0.8-1.7) (0.6-2.4)

(0.35) (0.46) (0.58)

2 1.5 1.4 1.8 No trend

(0.9-2.5) (0.9-2.0) (0.9-3.4)

(0.13) (0.10) (0.10)

3 1.7 1.8 3.0 Estimated slope of OR=1.24

(1.0-2.9) (1.2-2.6) (1.5-5.8) (95% CI:1.2-1.3, p=0.0002)

(0.05) (0.003) (0.001)

* Test for trend based on using number of variant risk alleles as a categorical variable within the logistic model for patients in Low Risk Disease and no cancer group, p=0.02; for patients in Intermediate Risk Disease and no cancer group, p<0.0001; for patients in High Risk Disease and no cancer group, p<0.0001. Global test for trend based on odds ratios, p=0.048. 251

Table 7: Comparisons of sensitivity, specificity, positive predictive value, and negative predictive value based on the combination of the risk alleles from the three SNP model among white subjects only.

Predictor Sensitivity (%) Specificity (%) Positive Predictive Negative Predictive Variable Value (%) Value (%)

An GS GS 8- Any GS GS Any GS 7 GS Any GS GS 8- y 7 10 Cancer 7 8-10 Cancer 8-10 Cancer 7 10 Can cer (26. 5%) (5.6 (73. (94.4 %) 5% %) )

1 Variant 94. 94. 95.6 7.9 7.2 6.9 49.8 26.9 5.8 59.5 78. 96.3 Allele 4 5 4 Cut-off

2 Variant 62. 62. 70.4 46.6 43. 43.1 52.9 28.5 6.9 55.9 76. 96.1 Alleles 0 0 9 2 Cut-off

3 Variant 29. 30. 37.1 78.2 76. 75.1 56.8 31.9 8.2 53.4 75. 95.2 Alleles 5 7 3 3 Cut-off

252 253 254

255

256

Figure 3. Rare CNV microdeletion. 10q26, which is associated with aggressive prostate cancer by SNP GWAS, is also the site of a rare CNV deletion. The copy number of the entire region, as measured by the Affymetrix 5.0 array is shown (top). The highlighted CNV (blue) spans 28 Kb and deletes a portion of the haplotype block containing rs10749408, but not the haplotype block with rs111199874 and rs10788165 (bottom).

257

References

1. Levine, A.J. p53, the cellular gatekeeper for growth and division. Cell 88, 323-31 (1997). 2. Lane, D.P. & Crawford, L.V. T antigen is bound to a host protein in SV40-transformed cells. Nature 278, 261-3 (1979). 3. Reisman, D., Greenberg, M. & Rotter, V. Human p53 oncogene contains one promoter upstream of exon 1 and a second, stronger promoter within intron 1. Proc Natl Acad Sci U S A 85, 5146-50 (1988). 4. Reich, N.C. & Levine, A.J. Growth regulation of a cellular tumour antigen, p53, in nontransformed cells. Nature 308, 199-201 (1984). 5. Mahmoudi, S. et al. Wrap53, a natural p53 antisense transcript required for p53 induction upon DNA damage. Mol Cell 33, 462-71 (2009). 6. Montes de Oca Luna, R., Wagner, D.S. & Lozano, G. Rescue of early embryonic lethality in mdm2-deficient mice by deletion of p53. Nature 378, 203-6 (1995). 7. Jones, S.N., Roe, A.E., Donehower, L.A. & Bradley, A. Rescue of embryonic lethality in Mdm2-deficient mice by absence of p53. Nature 378, 206-8 (1995). 8. Brooks, C.L. & Gu, W. Dynamics in the p53-Mdm2 ubiquitination pathway. Cell Cycle 3, 895-9 (2004). 9. Shaulsky, G., Goldfinger, N., Ben-Ze'ev, A. & Rotter, V. Nuclear accumulation of p53 protein is mediated by several nuclear localization signals and plays a role in tumorigenesis. Mol Cell Biol 10, 6565-77 (1990). 10. Kubbutat, M.H., Ludwig, R.L., Ashcroft, M. & Vousden, K.H. Regulation of Mdm2- directed degradation by the C terminus of p53. Mol Cell Biol 18, 5690-8 (1998). 11. Rodriguez, M.S., Desterro, J.M., Lain, S., Lane, D.P. & Hay, R.T. Multiple C-terminal lysine residues target p53 for ubiquitin-proteasome-mediated degradation. Mol Cell Biol 20, 8458-67 (2000). 12. Kussie, P.H. et al. Structure of the MDM2 oncoprotein bound to the p53 tumor suppressor transactivation domain. Science 274, 948-53 (1996). 13. Chen, J., Marechal, V. & Levine, A.J. Mapping of the p53 and mdm-2 interaction domains. Mol Cell Biol 13, 4107-14 (1993). 14. Chin, L. et al. Cooperative effects of INK4a and ras in melanoma susceptibility in vivo. Genes Dev 11, 2822-34 (1997). 15. Pomerantz, J. et al. The Ink4a tumor suppressor gene product, p19Arf, interacts with MDM2 and neutralizes MDM2's inhibition of p53. Cell 92, 713-23 (1998). 16. Zhang, Y. & Xiong, Y. Control of p53 ubiquitination and nuclear export by MDM2 and ARF. Cell Growth Differ 12, 175-86 (2001). 17. Appella, E. & Anderson, C.W. Post-translational modifications and activation of p53 by genotoxic stresses. Eur J Biochem 268, 2764-72 (2001). 258

18. Ashcroft, M., Kubbutat, M.H. & Vousden, K.H. Regulation of p53 function and stability by phosphorylation. Mol Cell Biol 19, 1751-8 (1999). 19. Levine, A.J., Hu, W. & Feng, Z. The P53 pathway: what questions remain to be explored? Cell Death Differ 13, 1027-36 (2006). 20. Kastan, M.B. et al. A mammalian cell cycle checkpoint pathway utilizing p53 and GADD45 is defective in ataxia-telangiectasia. Cell 71, 587-97 (1992). 21. Shieh, S.Y., Ikeda, M., Taya, Y. & Prives, C. DNA damage-induced phosphorylation of p53 alleviates inhibition by MDM2. Cell 91, 325-34 (1997). 22. Canman, C.E. et al. Activation of the ATM kinase by ionizing radiation and phosphorylation of p53. Science 281, 1677-9 (1998). 23. Shieh, S.Y., Ahn, J., Tamai, K., Taya, Y. & Prives, C. The human homologs of checkpoint kinases Chk1 and Cds1 (Chk2) phosphorylate p53 at multiple DNA damage- inducible sites. Genes Dev 14, 289-300 (2000). 24. Chehab, N.H., Malikzay, A., Stavridi, E.S. & Halazonetis, T.D. Phosphorylation of Ser- 20 mediates stabilization of human p53 in response to DNA damage. Proc Natl Acad Sci U S A 96, 13777-82 (1999). 25. Unger, T. et al. Critical role for Ser20 of human p53 in the negative regulation of p53 by Mdm2. EMBO J 18, 1805-14 (1999). 26. Maya, R. et al. ATM-dependent phosphorylation of Mdm2 on serine 395: role in p53 activation by DNA damage. Genes Dev 15, 1067-77 (2001). 27. Pfeifer, G.P., You, Y.H. & Besaratinia, A. Mutations induced by ultraviolet light. Mutat Res 571, 19-31 (2005). 28. Shiloh, Y. ATM and ATR: networking cellular responses to DNA damage. Curr Opin Genet Dev 11, 71-7 (2001). 29. Tibbetts, R.S. et al. A role for ATR in the DNA damage-induced phosphorylation of p53. Genes Dev 13, 152-7 (1999). 30. Lu, X. & Lane, D.P. Differential induction of transcriptionally active p53 following UV or ionizing radiation: defects in chromosome instability syndromes? Cell 75, 765-78 (1993). 31. Lowe, S.W. & Ruley, H.E. Stabilization of the p53 tumor suppressor is induced by adenovirus 5 E1A and accompanies apoptosis. Genes Dev 7, 535-45 (1993). 32. Lowe, S.W. Activation of p53 by oncogenes. Endocr Relat Cancer 6, 45-8 (1999). 33. Pluquet, O. & Hainaut, P. Genotoxic and non-genotoxic pathways of p53 induction. Cancer Lett 174, 1-15 (2001). 34. el-Deiry, W.S., Kern, S.E., Pietenpol, J.A., Kinzler, K.W. & Vogelstein, B. Definition of a consensus binding site for p53. Nat Genet 1, 45-9 (1992). 35. Wei, C.L. et al. A global map of p53 transcription-factor binding sites in the human genome. Cell 124, 207-19 (2006). 259

36. Cawley, S. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116, 499-509 (2004). 37. el-Deiry, W.S. et al. WAF1/CIP1 is induced in p53-mediated G1 arrest and apoptosis. Cancer Res 54, 1169-74 (1994). 38. Middleton, G., Cox, S.W., Korsmeyer, S. & Davies, A.M. Differences in bcl-2- and bax- independent function in regulating apoptosis in sensory neuron populations. Eur J Neurosci 12, 819-27 (2000). 39. Hanahan, D. & Weinberg, R.A. The hallmarks of cancer. Cell 100, 57-70 (2000). 40. Shi, Y. Mechanisms of caspase activation and inhibition during apoptosis. Mol Cell 9, 459-70 (2002). 41. Bennett, M. et al. Cell surface trafficking of Fas: a rapid mechanism of p53-mediated apoptosis. Science 282, 290-3 (1998). 42. Adams, J.M. Ways of dying: multiple pathways to apoptosis. Genes Dev 17, 2481-95 (2003). 43. Miyashita, T. & Reed, J.C. Tumor suppressor p53 is a direct transcriptional activator of the human bax gene. Cell 80, 293-9 (1995). 44. Oda, E. et al. Noxa, a BH3-only member of the Bcl-2 family and candidate mediator of p53-induced apoptosis. Science 288, 1053-8 (2000). 45. Nakano, K. & Vousden, K.H. PUMA, a novel proapoptotic gene, is induced by p53. Mol Cell 7, 683-94 (2001). 46. Li, F.P. & Fraumeni, J.F., Jr. Soft-tissue sarcomas, breast cancer, and other neoplasms. A familial syndrome? Ann Intern Med 71, 747-52 (1969). 47. Vulliamy, T. et al. Disease anticipation is associated with progressive telomere shortening in families with dyskeratosis congenita due to mutations in TERC. Nat Genet 36, 447-9 (2004). 48. Li, F.P. et al. A cancer family syndrome in twenty-four kindreds. Cancer Res 48, 5358-62 (1988). 49. Malkin, D. et al. Germ line p53 mutations in a familial syndrome of breast cancer, sarcomas, and other neoplasms. Science 250, 1233-8 (1990). 50. Lomax, M.E. et al. Two functional assays employed to detect an unusual mutation in the oligomerisation domain of p53 in a Li-Fraumeni like family. Oncogene 14, 1869-74 (1997). 51. Ponten, F. et al. Molecular pathology in basal cell cancer with p53 as a genetic marker. Oncogene 15, 1059-67 (1997). 52. Saeki, Y. et al. Germline p53 mutation at codon 133 in a cancer-prone family. J Mol Med 75, 50-6 (1997). 53. Varley, J.M. et al. Germ-line mutations of TP53 in Li-Fraumeni families: an extended study of 39 families. Cancer Res 57, 3245-52 (1997). 260

54. Varley, J.M. et al. A detailed study of loss of heterozygosity on chromosome 17 in tumours from Li-Fraumeni patients carrying a mutation to the TP53 gene. Oncogene 14, 865-71 (1997). 55. Williams, K.J., Boyle, J.M., Birch, J.M., Norton, J.D. & Scott, D. Cell cycle arrest defect in Li-Fraumeni Syndrome: a mechanism of cancer predisposition? Oncogene 14, 277-82 (1997). 56. Ayan, I. et al. Germline mutations of the p53 gene in children with malignant solid tumors. J Exp Clin Cancer Res 17, 497-502 (1998). 57. Bot, F.J., Sleddens, H.F. & Dinjens, W.N. Molecular assessment of clonality leads to the identification of a new germ line TP53 mutation associated with malignant cystosarcoma phyllodes and soft tissue sarcoma. Diagn Mol Pathol 7, 295-301 (1998). 58. Luca, J.W., Strong, L.C. & Hansen, M.F. A germline missense mutation R337C in exon 10 of the human p53 gene. Hum Mutat Suppl 1, S58-61 (1998). 59. Murakawa, Y. et al. Astrocytoma and B-cell lymphoma development in a man with a p53 germline mutation. Jpn J Clin Oncol 28, 631-7 (1998). 60. Orellana, C. et al. A novel TP53 germ-line mutation identified in a girl with a primitive neuroectodermal tumor and her father. Cancer Genet Cytogenet 105, 103-8 (1998). 61. Pivnick, E.K. et al. Simultaneous adrenocortical carcinoma and ganglioneuroblastoma in a child with Turner syndrome and germline p53 mutation. J Med Genet 35, 328-32 (1998). 62. Reifenberger, J. et al. Primitive neuroectodermal tumors of the cerebral hemispheres in two siblings with TP53 germline mutation. J Neuropathol Exp Neurol 57, 179-87 (1998). 63. Rines, R.D. et al. Comprehensive mutational scanning of the p53 coding region by two- dimensional gene scanning. Carcinogenesis 19, 979-84 (1998). 64. Sedlacek, Z. et al. Two Li-Fraumeni syndrome families with novel germline p53 mutations: loss of the wild-type p53 allele in only 50% of tumours. Br J Cancer 77, 1034-9 (1998). 65. Vital, A. et al. Astrocytomas and choroid plexus tumors in two families with identical p53 germline mutations. J Neuropathol Exp Neurol 57, 1061-9 (1998). 66. Auer, H. et al. Variations of p53 in cultured fibroblasts of patients with lung cancer who have a presumed genetic predisposition. Am J Clin Oncol 22, 278-82 (1999). 67. Gallo, O. et al. Multiple primary tumors of the upper aerodigestive tract: is there a role for constitutional mutations in the p53 gene? Int J Cancer 82, 180-6 (1999). 68. Guran, S., Tunca, Y. & Imirzalioglu, N. Hereditary TP53 codon 292 and somatic P16INK4A codon 94 mutations in a Li-Fraumeni syndrome family. Cancer Genet Cytogenet 113, 145-51 (1999). 69. Hung, J. et al. TP53 mutation and haplotype analysis of two large African American families. Hum Mutat 14, 216-21 (1999). 70. Huusko, P. et al. Germ-line TP53 mutations in Finnish cancer families exhibiting features of the Li-Fraumeni syndrome and negative for BRCA1 and BRCA2. Cancer Genet Cytogenet 112, 9-14 (1999). 261

71. Ishii, N. et al. Cells with TP53 mutations in low grade astrocytic tumors evolve clonally to malignancy and are an unfavorable prognostic factor. Oncogene 18, 5870-8 (1999). 72. Quesnel, S. et al. p53 compound heterozygosity in a severely affected child with Li- Fraumeni syndrome. Oncogene 18, 3970-8 (1999). 73. Sugano, K. et al. Germline p53 mutation in a case of Li-Fraumeni syndrome presenting gastric cancer. Jpn J Clin Oncol 29, 513-6 (1999). 74. Varley, J.M. et al. Are there low-penetrance TP53 Alleles? evidence from childhood adrenocortical tumors. Am J Hum Genet 65, 995-1006 (1999). 75. Zhou, X.P. et al. Germline mutations of p53 but not p16/CDKN2 or PTEN/MMAC1 tumor suppressor genes predispose to gliomas. The ANOCEF Group. Association des NeuroOncologues d'Expression Francaise. Ann Neurol 46, 913-6 (1999). 76. Chompret, A. et al. P53 germline mutations in childhood cancers and cancer risk for carrier individuals. Br J Cancer 82, 1932-7 (2000). 77. Grzybowska, E. et al. High frequency of recurrent mutations in BRCA1 and BRCA2 genes in Polish families with breast and ovarian cancer. Hum Mutat 16, 482-90 (2000). 78. Nutting, C. et al. A patient with 17 primary tumours and a germ line mutation in TP53: tumour induction by adjuvant therapy? Clin Oncol (R Coll Radiol) 12, 300-4 (2000). 79. Tachibana, I. et al. Investigation of germline PTEN, p53, p16(INK4A)/p14(ARF), and CDK4 alterations in familial glioma. Am J Med Genet 92, 136-41 (2000). 80. Zajac, V. et al. A double germline mutations in the APC and p53 genes. Neoplasma 47, 335-41 (2000). 81. Bougeard, G. et al. Detection of 11 germline inactivating TP53 mutations and absence of TP63 and HCHK2 mutations in 17 French families with Li-Fraumeni or Li-Fraumeni-like syndrome. J Med Genet 38, 253-7 (2001). 82. Joachim, T. et al. Comparative analysis of the NF2, TP53, PTEN, KRAS, NRAS and HRAS genes in sporadic and radiation-induced human meningiomas. Int J Cancer 94, 218-21 (2001). 83. Kimura, K. et al. Germline p53 mutation in a patient with multiple primary cancers. Jpn J Clin Oncol 31, 349-51 (2001). 84. Latronico, A.C. et al. An inherited mutation outside the highly conserved DNA-binding domain of the p53 tumor suppressor protein in children and adults with sporadic adrenocortical tumors. J Clin Endocrinol Metab 86, 4970-3 (2001). 85. Limacher, J.M., Frebourg, T., Natarajan-Ame, S. & Bergerat, J.P. Two metachronous tumors in the radiotherapy fields of a patient with Li-Fraumeni syndrome. Int J Cancer 96, 238-42 (2001). 86. Malkin, D. et al. Tissue-specific expression of SV40 in tumors associated with the Li- Fraumeni syndrome. Oncogene 20, 4441-9 (2001). 87. Rapakko, K. et al. Germline TP53 alterations in Finnish breast cancer families are rare and occur at conserved mutation-prone sites. Br J Cancer 84, 116-9 (2001). 262

88. Ribeiro, R.C. et al. An inherited p53 mutation that contributes in a tissue-specific manner to pediatric adrenal cortical carcinoma. Proc Natl Acad Sci U S A 98, 9330-5 (2001). 89. Vahteristo, P. et al. p53, CHK2, and CHK1 genes in Finnish families with Li-Fraumeni syndrome: further evidence of CHK2 in inherited cancer predisposition. Cancer Res 61, 5718-22 (2001). 90. Patrikidou, A., Bennett, J., Abou-Sleiman, P., Delhanty, J.D. & Harris, M. A novel, de novo germline TP53 mutation in a rare presentation of the Li-Fraumeni syndrome in the maxilla. Oral Oncol 38, 383-90 (2002). 91. Potzsch, C., Voigtlander, T. & Lubbert, M. p53 Germline mutation in a patient with Li- Fraumeni Syndrome and three metachronous malignancies. J Cancer Res Clin Oncol 128, 456-60 (2002). 92. Rutherford, J. et al. Investigations on a clinically and functionally unusual and novel germline p53 mutation. Br J Cancer 86, 1592-6 (2002). 93. Schaefer, K.L. et al. Analysis of TP53 germline mutations in pediatric tumor patients using DNA microarray-based sequencing technology. Med Pediatr Oncol 38, 247-53 (2002). 94. Hwang, S.J., Lozano, G., Amos, C.I. & Strong, L.C. Germline p53 mutations in a cohort with childhood sarcoma: sex differences in cancer risk. Am J Hum Genet 72, 975-83 (2003). 95. Lynch, H.T. et al. Familial sarcoma: challenging pedigrees. Cancer 98, 1947-57 (2003). 96. Martin, A.M. et al. Germline TP53 mutations in breast cancer families with multiple primary cancers: is TP53 a modifier of BRCA1? J Med Genet 40, e34 (2003). 97. Miyaki, M. et al. A novel case with germline p53 gene mutation having concurrent multiple primary colon tumours. Gut 52, 304-6 (2003). 98. Pepper, C. et al. Leukemic and non-leukemic lymphocytes from patients with Li Fraumeni syndrome demonstrate loss of p53 function, Bcl-2 family dysregulation and intrinsic resistance to conventional chemotherapeutic drugs but not flavopiridol. Cell Cycle 2, 53-8 (2003). 99. Trkova, M., Foretova, L., Kodet, R., Hedvicakova, P. & Sedlacek, Z. A Li-Fraumeni syndrome family with retained heterozygosity for a germline TP53 mutation in two tumors. Cancer Genet Cytogenet 145, 60-4 (2003). 100. Avigad, S. et al. Prenatal diagnosis in Li-Fraumeni syndrome. J Pediatr Hematol Oncol 26, 541-5 (2004). 101. Bendig, I., Mohr, N., Kramer, F. & Weber, B.H. Identification of novel TP53 mutations in familial and sporadic cancer cases of German and Swiss origin. Cancer Genet Cytogenet 154, 22-6 (2004). 102. Khayat, C.M. & Johnston, D.L. Rhabdomyosarcoma, osteosarcoma, and adrenocortical carcinoma in a child with a germline p53 mutation. Pediatr Blood Cancer 43, 683-6 (2004). 263

103. Kim, I.J. et al. A TP53-truncating germline mutation (E287X) in a family with characteristics of both hereditary diffuse gastric cancer and Li-Fraumeni syndrome. J Hum Genet 49, 591-5 (2004). 104. Nogales, F.F. et al. Multifocal intrafollicular granulosa cell tumor of the ovary associated with an unusual germline p53 mutation. Mod Pathol 17, 868-73 (2004). 105. Oliveira, C. et al. E-Cadherin (CDH1) and p53 rather than SMAD4 and Caspase-10 germline mutations contribute to genetic predisposition in Portuguese gastric cancer patients. Eur J Cancer 40, 1897-903 (2004). 106. Dickens, D.S., Dothage, J.A., Heideman, R.L., Ballard, E.T. & Jubinsky, P.T. Successful treatment of an unresectable choroid plexus carcinoma in a patient with Li-Fraumeni syndrome. J Pediatr Hematol Oncol 27, 46-9 (2005). 107. Rieske, P. et al. Atypical molecular background of glioblastoma and meningioma developed in a patient with Li-Fraumeni syndrome. J Neurooncol 71, 27-30 (2005). 108. Olivier, M., Hollstein, M. & Hainaut, P. TP53 mutations in human cancers: origins, consequences, and clinical use. Cold Spring Harb Perspect Biol 2, a001008. 109. Pavletich, N.P., Chambers, K.A. & Pabo, C.O. The DNA-binding domain of p53 contains the four conserved regions and the major mutation hot spots. Genes Dev 7, 2556-64 (1993). 110. Cho, Y., Gorina, S., Jeffrey, P.D. & Pavletich, N.P. Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. Science 265, 346-55 (1994). 111. Montesano, R., Hainaut, P. & Wild, C.P. Hepatocellular carcinoma: from gene to public health. J Natl Cancer Inst 89, 1844-51 (1997). 112. Petitjean, A. et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat 28, 622-9 (2007). 113. Hussain, S.P. & Harris, C.C. Molecular epidemiology of human cancer: contribution of mutation spectra studies of tumor suppressor genes. Cancer Res 58, 4023-37 (1998). 114. Greenman, C. et al. Patterns of somatic mutation in human cancer genomes. Nature 446, 153-8 (2007). 115. Sjoblom, T. et al. The consensus coding sequences of human breast and colorectal cancers. Science 314, 268-74 (2006). 116. Oren, M. & Rotter, V. Mutant p53 gain-of-function in cancer. Cold Spring Harb Perspect Biol 2, a001107. 117. Lang, G.A. et al. Gain of function of a p53 hot spot mutation in a mouse model of Li- Fraumeni syndrome. Cell 119, 861-72 (2004). 118. Dittmer, D. et al. Gain of function mutations in p53. Nat Genet 4, 42-6 (1993). 119. Milner, J., Medcalf, E.A. & Cook, A.C. Tumor suppressor p53: analysis of wild-type and mutant p53 complexes. Mol Cell Biol 11, 12-9 (1991). 264

120. de Vries, A. et al. Targeted point mutations of p53 lead to dominant-negative inhibition of wild-type p53 function. Proc Natl Acad Sci U S A 99, 2948-53 (2002). 121. Harris, C.C. p53 tumor suppressor gene: at the crossroads of molecular carcinogenesis, molecular epidemiology, and cancer risk assessment. Environ Health Perspect 104 Suppl 3, 435-9 (1996). 122. Bressac, B., Kew, M., Wands, J. & Ozturk, M. Selective G to T mutations of p53 gene in hepatocellular carcinoma from southern Africa. Nature 350, 429-31 (1991). 123. Takeshima, Y. et al. p53 mutations in lung cancers from non-smoking atomic-bomb survivors. Lancet 342, 1520-1 (1993). 124. Kato, S. et al. Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis. Proc Natl Acad Sci U S A 100, 8424-9 (2003). 125. Easton, D.F., Ponder, M.A., Huson, S.M. & Ponder, B.A. An analysis of variation in expression of neurofibromatosis (NF) type 1 (NF1): evidence for modifying genes. Am J Hum Genet 53, 305-13 (1993). 126. Levy-Lahad, E. et al. A single nucleotide polymorphism in the RAD51 gene modifies cancer risk in BRCA2 but not BRCA1 carriers. Proc Natl Acad Sci U S A 98, 3232-6 (2001). 127. Kadouri, L. et al. A single-nucleotide polymorphism in the RAD51 gene modifies breast cancer risk in BRCA2 carriers, but not in BRCA1 carriers or noncarriers. Br J Cancer 90, 2002-5 (2004). 128. Wang, W.W. et al. A single nucleotide polymorphism in the 5' untranslated region of RAD51 and risk of cancer among BRCA1/2 mutation carriers. Cancer Epidemiol Biomarkers Prev 10, 955-60 (2001). 129. Harris, N. et al. Molecular basis for heterogeneity of the human p53 protein. Mol Cell Biol 6, 4650-6 (1986). 130. Dumont, P., Leu, J.I., Della Pietra, A.C., 3rd, George, D.L. & Murphy, M. The codon 72 polymorphic variants of p53 have markedly different apoptotic potential. Nat Genet 33, 357-65 (2003). 131. Bond, G.L. et al. A single nucleotide polymorphism in the MDM2 promoter attenuates the p53 tumor suppressor pathway and accelerates tumor formation in humans. Cell 119, 591-602 (2004). 132. Bougeard, G. et al. Impact of the MDM2 SNP309 and p53 Arg72Pro polymorphism on age of tumour onset in Li-Fraumeni syndrome. J Med Genet 43, 531-3 (2006). 133. Marcel, V. et al. TP53 PIN3 and MDM2 SNP309 polymorphisms as genetic modifiers in the Li-Fraumeni syndrome: impact on age at first diagnosis. J Med Genet 46, 766-72 (2009). 134. McInnis, M.G. Anticipation: an old idea in new genes. Am J Hum Genet 59, 973-9 (1996). 135. Howeler, C.J., Busch, H.F., Geraedts, J.P., Niermeijer, M.F. & Staal, A. Anticipation in myotonic : fact or fiction? Brain 112 ( Pt 3), 779-97 (1989). 265

136. Brook, J.D. et al. Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3' end of a transcript encoding a protein kinase family member. Cell 69, 385 (1992). 137. Trkova, M., Hladikova, M., Kasal, P., Goetz, P. & Sedlacek, Z. Is there anticipation in the age at onset of cancer in families with Li-Fraumeni syndrome? J Hum Genet 47, 381- 6 (2002). 138. Brown, B.W., Costello, T.J., Hwang, S.J. & Strong, L.C. Generation or birth cohort effect on cancer risk in Li-Fraumeni syndrome. Hum Genet 118, 489-98 (2005). 139. Tabori, U., Nanda, S., Druker, H., Lees, J. & Malkin, D. Younger age of cancer initiation is associated with shorter telomere length in Li-Fraumeni syndrome. Cancer Res 67, 1415-8 (2007). 140. Trkova, M., Prochazkova, K., Krutilkova, V., Sumerauer, D. & Sedlacek, Z. Telomere length in peripheral blood cells of germline TP53 mutation carriers is shorter than that of normal individuals of corresponding age. Cancer 110, 694-702 (2007). 141. Pinto, E.M. et al. Founder effect for the highly prevalent R337H mutation of tumor suppressor p53 in Brazilian patients with adrenocortical tumors. Arq Bras Endocrinol Metabol 48, 647-50 (2004). 142. Figueiredo, B.C. et al. Penetrance of adrenocortical tumours associated with the germline TP53 R337H mutation. J Med Genet 43, 91-6 (2006). 143. Birch, J.M. et al. Prevalence and diversity of constitutional mutations in the p53 gene among 21 Li-Fraumeni families. Cancer Res 54, 1298-304 (1994). 144. Achatz, M.I. et al. The TP53 mutation, R337H, is associated with Li-Fraumeni and Li- Fraumeni-like syndromes in Brazilian families. Cancer Lett 245, 96-102 (2007). 145. Olivier, M. et al. Li-Fraumeni and related syndromes: correlation between tumor type, family structure, and TP53 genotype. Cancer Res 63, 6643-50 (2003). 146. Wagner, J. et al. High frequency of germline p53 mutations in childhood adrenocortical cancer. J Natl Cancer Inst 86, 1707-10 (1994). 147. Easton, D.F., Ford, D. & Bishop, D.T. Breast and ovarian cancer incidence in BRCA1- mutation carriers. Breast Cancer Linkage Consortium. Am J Hum Genet 56, 265-71 (1995). 148. Elledge, S.J. & Amon, A. The BRCA1 suppressor hypothesis: an explanation for the tissue-specific tumor development in BRCA1 patients. Cancer Cell 1, 129-32 (2002). 149. Jeffrey, P.D., Gorina, S. & Pavletich, N.P. Crystal structure of the tetramerization domain of the p53 tumor suppressor at 1.7 angstroms. Science 267, 1498-502 (1995). 150. DiGiammarino, E.L. et al. A novel mechanism of tumorigenesis involving pH-dependent destabilization of a mutant p53 tetramer. Nat Struct Biol 9, 12-6 (2002). 151. Wong, P. et al. Prevalence of early onset colorectal cancer in 397 patients with classic Li- Fraumeni syndrome. Gastroenterology 130, 73-9 (2006). 152. Bischoff, F.Z. et al. Spontaneous abnormalities in normal fibroblasts from patients with Li-Fraumeni cancer syndrome: aneuploidy and immortalization. Cancer Res 50, 7979-84 (1990). 266

153. Shay, J.W., Tomlinson, G., Piatyszek, M.A. & Gollahon, L.S. Spontaneous in vitro immortalization of breast epithelial cells from a patient with Li-Fraumeni syndrome. Mol Cell Biol 15, 425-32 (1995). 154. Knudson, A.G., Jr., Hethcote, H.W. & Brown, B.W. Mutation and childhood cancer: a probabilistic model for the incidence of retinoblastoma. Proc Natl Acad Sci U S A 72, 5116-20 (1975). 155. Venkatachalam, S. et al. Retention of wild-type p53 in tumors from p53 heterozygous mice: reduction of p53 dosage can promote cancer formation. EMBO J 17, 4657-67 (1998). 156. Yuasa, H., Tokito, S. & Tokunaga, M. Primary carcinoma of the choroid plexus in Li- Fraumeni syndrome: case report. Neurosurgery 32, 131-3; discussion 133-4 (1993). 157. Wang, L. & Cornford, M.E. Coincident choroid plexus carcinoma and adrenocortical carcinoma with elevated p53 expression: a case report of an 18-month-old boy with no family history of cancer. Arch Pathol Lab Med 126, 70-2 (2002). 158. Garber, J.E. et al. Choroid plexus tumors in the breast cancer-sarcoma syndrome. Cancer 66, 2658-60 (1990). 159. Krutilkova, V. et al. Identification of five new families strengthens the link between childhood choroid plexus carcinoma and germline TP53 mutations. Eur J Cancer 41, 1597-603 (2005). 160. Gonzalez, K.D. et al. Beyond Li Fraumeni Syndrome: clinical characteristics of families with p53 germline mutations. J Clin Oncol 27, 1250-6 (2009). 161. Ruijs, M.W. et al. TP53 germline mutation testing in 180 families suspected of Li- Fraumeni syndrome: mutation detection rate and relative frequency of cancers in different familial phenotypes. J Med Genet 47, 421-8. 162. Tinat, J. et al. 2009 version of the Chompret criteria for Li Fraumeni syndrome. J Clin Oncol 27, e108-9; author reply e110 (2009). 163. Strazielle, N. & Ghersi-Egea, J.F. Choroid plexus in the central nervous system: biology and physiopathology. J Neuropathol Exp Neurol 59, 561-74 (2000). 164. Brown, P.D., Davies, S.L., Speake, T. & Millar, I.D. Molecular mechanisms of cerebrospinal fluid production. Neuroscience 129, 957-70 (2004). 165. Louis, D.N. et al. The 2007 WHO classification of tumours of the central nervous system. Acta Neuropathol 114, 97-109 (2007). 166. Jeibmann, A. et al. Prognostic implications of atypical histologic features in choroid plexus papilloma. J Neuropathol Exp Neurol 65, 1069-73 (2006). 167. Wrede, B. et al. Atypical choroid plexus papilloma: clinical experience in the CPT-SIOP- 2000 study. J Neurooncol 95, 383-92 (2009). 168. Jeibmann, A. et al. Malignant progression in choroid plexus papillomas. J Neurosurg 107, 199-202 (2007). 169. Hasselblatt, M. et al. Identification of novel diagnostic markers for choroid plexus tumors: a microarray-based approach. Am J Surg Pathol 30, 66-74 (2006). 267

170. Doring, F. et al. The epithelial inward rectifier channel Kir7.1 displays unusual K+ permeation properties. J Neurosci 18, 8625-36 (1998). 171. Hasselblatt, M. et al. TWIST-1 is overexpressed in neoplastic choroid plexus epithelial cells and promotes proliferation and invasion. Cancer Res 69, 2219-23 (2009). 172. Mertens, F. et al. Recurrent chromosomal imbalances in choroid plexus tumors. Cancer Genet Cytogenet 80, 83-4 (1995). 173. Rickert, C.H., Wiestler, O.D. & Paulus, W. Chromosomal imbalances in choroid plexus tumors. Am J Pathol 160, 1105-13 (2002). 174. Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome. Nat Rev Genet 7, 85-97 (2006). 175. Gault, J. et al. Comparison of polymorphisms in the alpha7 nicotinic receptor gene and its partial duplication in schizophrenic and control subjects. Am J Med Genet B Neuropsychiatr Genet 123B, 39-49 (2003). 176. Traherne, J.A. Human MHC architecture and evolution: implications for disease association studies. Int J Immunogenet 35, 179-92 (2008). 177. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat Genet 36, 949-51 (2004). 178. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525-8 (2004). 179. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704-12. 180. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848-53 (2007). 181. Henrichsen, C.N. et al. Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41, 424-9 (2009). 182. Guryev, V. et al. Distribution and functional impact of DNA copy number variation in the rat. Nat Genet 40, 538-45 (2008). 183. Inoue, K. & Lupski, J.R. Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet 3, 199-242 (2002). 184. Merla, G. et al. Submicroscopic deletion in patients with Williams-Beuren syndrome influences expression levels of the nonhemizygous flanking genes. Am J Hum Genet 79, 332-41 (2006). 185. Kleinjan, D.A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet 76, 8-32 (2005). 186. Turner, D.J. et al. Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nat Genet 40, 90-5 (2008). 187. Maher, B. Personal genomes: The case of the missing heritability. Nature 456, 18-21 (2008). 188. Higgins, M.E., Claremont, M., Major, J.E., Sander, C. & Lash, A.E. CancerGenes: a gene selection resource for cancer genome projects. Nucleic Acids Res 35, D721-6 (2007). 268

189. Shlien, A. et al. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proc Natl Acad Sci U S A 105, 11264-9 (2008). 190. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444- 54 (2006). 191. McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 40, 1166-74 (2008). 192. Forbes, S.A. et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr Protoc Hum Genet Chapter 10, Unit 10 11 (2008). 193. Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet 41, 579-84 (2009). 194. Perry, G.H. et al. The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet 82, 685-95 (2008). 195. Frank, B. et al. Copy number variant in the candidate tumor suppressor gene MTUS1 and familial breast cancer risk. Carcinogenesis 28, 1442-5 (2007). 196. McCarroll, S.A. & Altshuler, D.M. Copy-number variation and association studies of human disease. Nat Genet 39, S37-42 (2007). 197. Ionita-Laza, I., Rogers, A.J., Lange, C., Raby, B.A. & Lee, C. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 93, 22-6 (2009). 198. Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713-20. 199. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-78 (2007). 200. Barnes, C. et al. A robust statistical method for case-control association testing with copy number variation. Nat Genet 40, 1245-52 (2008). 201. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease. Nat Genet (2008). 202. Diskin, S.J. et al. Copy number variation at 1q21.1 associated with neuroblastoma. Nature 459, 987-91 (2009). 203. Vandepoele, K., Van Roy, N., Staes, K., Speleman, F. & van Roy, F. A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution. Mol Biol Evol 22, 2265-74 (2005). 204. Laureys, G., Speleman, F., Opdenakker, G., Benoit, Y. & Leroy, J. Constitutional translocation t(1;17)(p36;q12-21) in a patient with neuroblastoma. Genes Chromosomes Cancer 2, 252-4 (1990). 205. Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232-6 (2008). 206. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237-41 (2008). 269

207. Mefford, H.C. et al. Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N Engl J Med 359, 1685-99 (2008). 208. Greenway, S.C. et al. De novo copy number variants identify new genes and loci in isolated sporadic tetralogy of Fallot. Nat Genet 41, 931-5 (2009). 209. Nagy, R., Sweet, K. & Eng, C. Highly penetrant hereditary cancer syndromes. Oncogene 23, 6445-70 (2004). 210. Schouten, J.P. et al. Relative quantification of 40 nucleic acid sequences by multiplex ligation-dependent probe amplification. Nucleic Acids Res 30, e57 (2002). 211. Futreal, P.A. et al. A census of human cancer genes. Nat Rev Cancer 4, 177-83 (2004). 212. Jackson, E.M. et al. High-density single nucleotide polymorphism array analysis in patients with germline deletions of 22q11.2 and malignant rhabdoid tumor. Hum Genet 122, 117-27 (2007). 213. Hodgson, S.V. et al. Two cases of 5q deletions in patients with familial adenomatous polyposis: possible link with Caroli's disease. J Med Genet 30, 369-75 (1993). 214. Su, L.K. et al. Genomic rearrangements of the APC tumor-suppressor gene in familial adenomatous polyposis. Hum Genet 106, 101-7 (2000). 215. Aretz, S. et al. Large submicroscopic genomic APC deletions are a common cause of typical familial adenomatous polyposis. J Med Genet 42, 185-92 (2005). 216. Charames, G.S. et al. A large novel deletion in the APC promoter region causes gene silencing and leads to classical familial adenomatous polyposis in a Manitoba Mennonite kindred. Hum Genet 124, 535-41 (2008). 217. Delnatte, C. et al. Contiguous gene deletion within chromosome arm 10q is associated with juvenile polyposis of infancy, reflecting cooperation between the BMPR1A and PTEN tumor-suppressor genes. Am J Hum Genet 78, 1066-74 (2006). 218. Petrij-Bosch, A. et al. BRCA1 genomic deletions are major founder mutations in Dutch breast cancer patients. Nat Genet 17, 341-5 (1997). 219. Montagna, M. et al. Genomic rearrangements account for more than one-third of the BRCA1 mutations in northern Italian breast/ovarian cancer families. Hum Mol Genet 12, 1055-61 (2003). 220. Casilli, F. et al. The contribution of germline rearrangements to the spectrum of BRCA2 mutations. J Med Genet 43, e49 (2006). 221. Lesueur, F. et al. The contribution of large genomic deletions at the CDKN2A locus to the burden of familial melanoma. Br J Cancer 99, 364-70 (2008). 222. Cybulski, C. et al. A large germline deletion in the Chek2 kinase gene is associated with an increased risk of prostate cancer. J Med Genet 43, 863-6 (2006). 223. Cybulski, C. et al. A deletion in CHEK2 of 5,395 bp predisposes to breast cancer in Poland. Breast Cancer Res Treat 102, 119-22 (2007). 224. Levran, O. et al. Spectrum of sequence variations in the FANCA gene: an International Fanconi Anemia Registry (IFAR) study. Hum Mutat 25, 142-9 (2005). 270

225. van Hattem, W.A. et al. Large genomic deletions of SMAD4, BMPR1A and PTEN in juvenile polyposis. Gut 57, 623-7 (2008). 226. Kishi, M. et al. A large germline deletion of the MEN1 gene in a family with multiple endocrine neoplasia type 1. Jpn J Cancer Res 89, 1-5 (1998). 227. Nystrom-Lahti, M. et al. Founding mutations and Alu-mediated recombination in hereditary colon cancer. Nat Med 1, 1203-6 (1995). 228. Chan, T.L. et al. A novel germline 1.8-kb deletion of hMLH1 mimicking alternative splicing: a founder mutation in the Chinese population. Oncogene 20, 2976-81 (2001). 229. Stella, A. et al. Germline novel MSH2 deletions and a founder MSH2 deletion associated with anticipation effects in HNPCC. Clin Genet 71, 130-9 (2007). 230. Plaschke, J., Ruschoff, J. & Schackert, H.K. Genomic rearrangements of hMSH6 contribute to the genetic predisposition in suspected hereditary non-polyposis colorectal cancer syndrome. J Med Genet 40, 597-600 (2003). 231. Riva, P. et al. NF1 microdeletion syndrome: refined FISH characterization of sporadic and familial deletions with locus-specific probes. Am J Hum Genet 66, 100-9 (2000). 232. Bausch, B., Borozdin, W. & Neumann, H.P. Clinical and genetic characteristics of patients with neurofibromatosis type 1 and pheochromocytoma. N Engl J Med 354, 2729- 31 (2006). 233. Tsilchorozidou, T. et al. Constitutional rearrangements of chromosome 22 as a cause of neurofibromatosis 2. J Med Genet 41, 529-34 (2004). 234. Horvath, A. et al. Large deletions of the PRKAR1A gene in Carney complex. Clin Cancer Res 14, 388-95 (2008). 235. Shimkets, R. et al. Molecular analysis of chromosome 9q deletions in two Gorlin syndrome patients. Am J Hum Genet 59, 417-22 (1996). 236. Bremner, R. et al. Deletion of RB exons 24 and 25 causes low-penetrance retinoblastoma. Am J Hum Genet 61, 556-70 (1997). 237. Cascon, A. et al. Gross SDHB deletions in patients with paraganglioma detected by multiplex PCR: a possible hot spot? Genes Chromosomes Cancer 45, 213-9 (2006). 238. Baysal, B.E. et al. An Alu-mediated partial SDHC deletion causes familial and sporadic paraganglioma. J Med Genet 41, 703-9 (2004). 239. McWhinney, S.R. et al. Large germline deletions of mitochondrial complex II subunits SDHB and SDHD in hereditary paraganglioma. J Clin Endocrinol Metab 89, 5694-9 (2004). 240. Swensen, J.J. et al. Familial occurrence of schwannomas and malignant rhabdoid tumour associated with a duplication in SMARCB1. J Med Genet 46, 68-72 (2009). 241. Le Meur, N. et al. Complete germline deletion of the STK11 gene in a family with Peutz- Jeghers syndrome. Eur J Hum Genet 12, 415-8 (2004). 242. Bougeard, G. et al. Screening for TP53 rearrangements in families with the Li-Fraumeni syndrome reveals a complete deletion of the TP53 gene. Oncogene 22, 840-6 (2003). 271

243. Bougeard, G. et al. Molecular basis of the Li-Fraumeni syndrome: an update from the French LFS families. J Med Genet 45, 535-8 (2008). 244. Kozlowski, P. et al. Identification of 54 large deletions/duplications in TSC1 and TSC2 using MLPA, and genotype-phenotype correlations. Hum Genet 121, 389-400 (2007). 245. Richards, F.M. et al. Mapping the Von Hippel-Lindau disease tumour suppressor gene: identification of germline deletions by pulsed field gel electrophoresis. Hum Mol Genet 2, 879-82 (1993). 246. Huff, V. et al. Evidence for WT1 as a Wilms tumor (WT) gene: intragenic germinal deletion in bilateral WT. Am J Hum Genet 48, 997-1003 (1991). 247. Hastings, P.J., Lupski, J.R., Rosenberg, S.M. & Ira, G. Mechanisms of change in gene copy number. Nat Rev Genet 10, 551-64 (2009). 248. Neale, M.J. & Keeney, S. Clarifying the mechanics of DNA strand exchange in meiotic recombination. Nature 442, 153-8 (2006). 249. Marques-Bonet, T., Girirajan, S. & Eichler, E.E. The origins and impact of primate segmental duplications. Trends Genet 25, 443-54 (2009). 250. Llorente, B., Smith, C.E. & Symington, L.S. Break-induced replication: what is it and what is it for? Cell Cycle 7, 859-64 (2008). 251. Lieber, M.R. The mechanism of double-strand DNA break repair by the nonhomologous DNA end-joining pathway. Annu Rev Biochem 79, 181-211. 252. Toffolatti, L. et al. Investigating the mechanism of chromosomal deletion: characterization of 39 deletion breakpoints in introns 47 and 48 of the human dystrophin gene. Genomics 80, 523-30 (2002). 253. Bzymek, M. & Lovett, S.T. Instability of repetitive DNA sequences: the role of replication in multiple mechanisms. Proc Natl Acad Sci U S A 98, 8319-25 (2001). 254. Lee, J.A., Carvalho, C.M. & Lupski, J.R. A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 131, 1235-47 (2007). 255. Weir, B.A. et al. Characterizing the cancer genome in lung adenocarcinoma. Nature 450, 893-8 (2007). 256. Mullighan, C.G. et al. Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 446, 758-64 (2007). 257. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061-8 (2008). 258. Malkin, D. Predictive genetic testing for childhood cancer: taking the road less traveled by. J Pediatr Hematol Oncol 26, 546-8 (2004). 259. Perry, G.H. et al. Hotspots for copy number variation in chimpanzees and humans. Proc Natl Acad Sci U S A 103, 8006-11 (2006). 260. Lupski, J.R. Genomic rearrangements and sporadic disease. Nat Genet 39, S43-7 (2007). 261. Vasudevan, S.A., Nuchtern, J.G. & Shohet, J.M. Gene profiling of high risk neuroblastoma. World J Surg 29, 317-24 (2005). 272

262. Camp, N.J. et al. Compelling evidence for a prostate cancer gene at 22q12.3 by the International Consortium for Prostate Cancer Genetics. Hum Mol Genet 16, 1271-8 (2007). 263. Easton, D.F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087-93 (2007). 264. Gudmundsson, J. et al. Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet 39, 631-7 (2007). 265. Hunter, D.J. et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 39, 870-4 (2007). 266. Stacey, S.N. et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet 39, 865-9 (2007). 267. Pinto, D., Marshall, C., Feuk, L. & Scherer, S.W. Copy-number variation in control population cohorts. Hum Mol Genet 16 Spec No. 2, R168-73 (2007). 268. Lin, M. et al. dChipSNP: significance curve and clustering of SNP-array-based loss-of- heterozygosity data. Bioinformatics 20, 1233-40 (2004). 269. Nichols, K.E., Malkin, D., Garber, J.E., Fraumeni, J.F., Jr. & Li, F.P. Germ-line p53 mutations predispose to a wide spectrum of early-onset cancers. Cancer Epidemiol Biomarkers Prev 10, 83-7 (2001). 270. Bond, G.L., Hu, W. & Levine, A. A single nucleotide polymorphism in the MDM2 gene: from a molecular and cellular explanation to clinical effect. Cancer Res 65, 5481-4 (2005). 271. Eyfjord, J.E. et al. TP53 abnormalities and genetic instability in breast cancer. Acta Oncol 34, 663-7 (1995). 272. Georgiades, I.B., Curtis, L.J., Morris, R.M., Bird, C.C. & Wyllie, A.H. Heterogeneity studies identify a subset of sporadic colorectal cancers without evidence for chromosomal or microsatellite instability. Oncogene 18, 7933-40 (1999). 273. Primdahl, H. et al. Allelic imbalances in human bladder cancer: genome-wide detection with high-density single-nucleotide polymorphism arrays. J Natl Cancer Inst 94, 216-23 (2002). 274. Lane, D.P. Cancer. p53, guardian of the genome. Nature 358, 15-6 (1992). 275. Krawczak, M. et al. PopGen: population-based recruitment of patients and controls for the analysis of complex genotype-phenotype relationships. Community Genet 9, 55-61 (2006). 276. A haplotype map of the human genome. Nature 437, 1299-320 (2005). 277. Matsuzaki, H. et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods 1, 109-11 (2004). 278. Nannya, Y. et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 65, 6071-9 (2005). 273

279. Komura, D. et al. Genome-wide detection of human copy number variations using high- density DNA oligonucleotide arrays. Genome Res 16, 1575-84 (2006). 280. Gruber, S.B. New developments in Lynch syndrome (hereditary nonpolyposis colorectal cancer) and mismatch repair gene testing. Gastroenterology 130, 577-87 (2006). 281. van der Klift, H. et al. Molecular characterization of the spectrum of genomic deletions in the mismatch repair genes MSH2, MLH1, MSH6, and PMS2 responsible for hereditary nonpolyposis colorectal cancer (HNPCC). Genes Chromosomes Cancer 44, 123-38 (2005). 282. Huebner, K. & Croce, C.M. FRA3B and other common fragile sites: the weakest links. Nat Rev Cancer 1, 214-21 (2001). 283. Prasad, R. et al. Cloning of the ALL-1 fusion partner, the AF-6 gene, involved in acute myeloid leukemias with the t(6;11) chromosome translocation. Cancer Res 53, 5624-8 (1993). 284. Mochizuki, S. & Okada, Y. ADAMs in cancer cell proliferation and progression. Cancer Sci 98, 621-8 (2007). 285. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature (2009). 286. Shlien, A. & Malkin, D. Copy number variations and cancer. Genome Med 1, 62 (2009). 287. Malinge, S., Izraeli, S. & Crispino, J.D. Insights into the manifestations, outcomes, and mechanisms of leukemogenesis in Down syndrome. Blood 113, 2619-28 (2009). 288. Scherer, S.W. et al. Challenges and standards in integrating surveys of structural variation. Nat Genet 39, S7-15 (2007). 289. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56-64 (2008). 290. Conrad, D.F. et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat Genet 42, 385-91. 291. Lieber, M.R., Ma, Y., Pannicke, U. & Schwarz, K. Mechanism and regulation of human non-homologous DNA end-joining. Nat Rev Mol Cell Biol 4, 712-20 (2003). 292. Knudson, A.G., Jr. Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci U S A 68, 820-3 (1971). 293. Kranz, C. et al. A mutation in the human MPDU1 gene causes congenital disorder of glycosylation type If (CDG-If). J Clin Invest 108, 1613-9 (2001). 294. Darnell, J.C., Fraser, C.E., Mostovetsky, O. & Darnell, R.B. Discrimination of common and unique RNA-binding activities among Fragile X mental retardation protein paralogs. Hum Mol Genet 18, 3164-77 (2009). 295. Kullander, K. et al. Role of EphA4 and EphrinB3 in local neuronal circuits that control walking. Science 299, 1889-92 (2003). 296. Schluth-Bolard, C. et al. 17p13.1 microdeletion involving the TP53 gene in a boy presenting with mental retardation but no tumor. Am J Med Genet A 152A, 1278-82. 274

297. Adam, M.P. et al. Clinical utility of array comparative genomic hybridization: uncovering tumor susceptibility in individuals with developmental delay. J Pediatr 154, 143-6 (2009). 298. Krepischi-Santos, A.C. et al. Constitutional haploinsufficiency of tumor suppressor genes in mentally retarded patients with microdeletions in 17p13.1. Cytogenet Genome Res 125, 1-7 (2009). 299. Schwarzbraun, T. et al. Predictive diagnosis of the cancer prone Li-Fraumeni syndrome by accident: new challenges through whole genome array testing. J Med Genet 46, 341-4 (2009). 300. Sen, S.K. et al. Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet 79, 41-53 (2006). 301. Batzer, M.A. & Deininger, P.L. Alu repeats and human genomic diversity. Nat Rev Genet 3, 370-9 (2002). 302. Deininger, P.L. & Batzer, M.A. Alu repeats and human disease. Mol Genet Metab 67, 183-93 (1999). 303. Smith, T.M. et al. Complete genomic sequence and analysis of 117 kb of human DNA containing the gene BRCA1. Genome Res 6, 1029-49 (1996). 304. Slebos, R.J., Resnick, M.A. & Taylor, J.A. Inactivation of the p53 tumor suppressor gene via a novel Alu rearrangement. Cancer Res 58, 5333-6 (1998). 305. Plummer, S.J. et al. A germline 2.35 kb deletion of p53 genomic DNA creating a specific loss of the oligomerization domain inherited in a Li-Fraumeni syndrome family. Oncogene 9, 3273-80 (1994). 306. Garber, J.E. et al. Follow-up study of twenty-four families with Li-Fraumeni syndrome. Cancer Res 51, 6094-7 (1991). 307. Wolff, J.E., Sajedi, M., Brant, R., Coppes, M.J. & Egeler, R.M. Choroid plexus tumours. Br J Cancer 87, 1086-91 (2002). 308. Berger, C. et al. Choroid plexus carcinomas in childhood: clinical features and prognostic factors. Neurosurgery 42, 470-5 (1998). 309. Fitzpatrick, L.K., Aronson, L.J. & Cohen, K.J. Is there a requirement for adjuvant therapy for choroid plexus carcinoma that has been completely resected? J Neurooncol 57, 123-6 (2002). 310. Wrede, B., Liu, P. & Wolff, J.E. Chemotherapy improves the survival of patients with choroid plexus carcinoma: a meta-analysis of individual cases with choroid plexus tumors. J Neurooncol 85, 345-51 (2007). 311. Wolff, J.E., Sajedi, M., Coppes, M.J., Anderson, R.A. & Egeler, R.M. Radiation therapy and survival in choroid plexus carcinoma. Lancet 353, 2126 (1999). 312. Mirzayans, R., Severin, D. & Murray, D. Relationship between DNA double-strand break rejoining and cell survival after exposure to ionizing radiation in human fibroblast strains with differing ATM/p53 status: implications for evaluation of clinical radiosensitivity. Int J Radiat Oncol Biol Phys 66, 1498-505 (2006). 275

313. Lavra, L. et al. Gal-3 is stimulated by gain-of-function p53 mutations and modulates chemoresistance in anaplastic thyroid carcinomas. J Pathol 218, 66-75 (2009). 314. Cooper, M., Li, S.Q., Bhardwaj, T., Rohan, T. & Kandel, R.A. Evaluation of oligonucleotide arrays for sequencing of the p53 gene in DNA from formalin-fixed, paraffin-embedded breast cancer specimens. Clin Chem 50, 500-8 (2004). 315. Shaulsky, G., Ben-Ze'ev, A. & Rotter, V. Subcellular distribution of the p53 protein during the cell cycle of Balb/c 3T3 cells. Oncogene 5, 1707-11 (1990). 316. Ray, A. et al. A clinicobiological model predicting survival in medulloblastoma. Clin Cancer Res 10, 7613-20 (2004). 317. Whibley, C., Pharoah, P.D. & Hollstein, M. p53 polymorphisms: cancer implications. Nat Rev Cancer 9, 95-107 (2009). 318. Adesina, A.M., Nalbantoglu, J. & Cavenee, W.K. p53 gene mutation and mdm2 gene amplification are uncommon in medulloblastoma. Cancer Res 54, 5649-51 (1994). 319. Saylors, R.L., 3rd et al. Infrequent p53 gene mutations in medulloblastomas. Cancer Res 51, 4721-3 (1991). 320. Momand, J., Jung, D., Wilczynski, S. & Niland, J. The MDM2 gene amplification database. Nucleic Acids Res 26, 3453-9 (1998). 321. Shlien A, T.U., Baskin B, Rotin L, Marshall C, Feuk L, Hudgins L, Nichols K, Scherer S, Ray P, Malkin D. DNA Copy Number Variation and Cancer susceptibility: The Li- Fraumeni Syndrome paradigm. . Chromosome Research 17, S11-12 (2009). 322. Narod, S.A. & Foulkes, W.D. BRCA1 and BRCA2: 1994 and beyond. Nat Rev Cancer 4, 665-76 (2004). 323. Barrow, E. et al. Cumulative lifetime incidence of extracolonic cancers in Lynch syndrome: a report of 121 families with proven mutations. Clin Genet 75, 141-9 (2009). 324. Vasen, H.F. et al. Guidelines for the clinical management of Lynch syndrome (hereditary non-polyposis cancer). J Med Genet 44, 353-62 (2007). 325. Chompret, A. et al. Sensitivity and predictive value of criteria for p53 germline mutation screening. J Med Genet 38, 43-7 (2001). 326. Tabori, U. & Malkin, D. Risk stratification in cancer predisposition syndromes: lessons learned from novel molecular developments in Li-Fraumeni syndrome. Cancer Res 68, 2053-7 (2008). 327. Varley, J.M. et al. Characterization of germline TP53 splicing mutations and their genetic and functional analysis. Oncogene 20, 2647-54 (2001). 328. Pizzo, P.A. & Poplack, D.G. Principles and practice of pediatric oncology, xiv, 1780 p. (Lippincott Williams & Wilkins, Philadelphia, 2006). 329. Akiyama, T., Dass, C.R. & Choong, P.F. Novel therapeutic strategy for osteosarcoma targeting osteoclast differentiation, bone-resorbing activity, and apoptosis pathway. Mol Cancer Ther 7, 3461-9 (2008). 330. Friend, S.H. et al. A human DNA segment with properties of the gene that predisposes to retinoblastoma and osteosarcoma. Nature 323, 643-6 (1986). 276

331. Friend, S.H. et al. Deletions of a DNA sequence in retinoblastomas and mesenchymal tumors: organization of the sequence and its encoded protein. Proc Natl Acad Sci U S A 84, 9059-63 (1987). 332. Hansen, M.F. et al. Osteosarcoma and retinoblastoma: a shared chromosomal mechanism revealing recessive predisposition. Proc Natl Acad Sci U S A 82, 6216-20 (1985). 333. McIntyre, J.F. et al. Germline mutations of the p53 tumor suppressor gene in children with osteosarcoma. J Clin Oncol 12, 925-30 (1994). 334. Murphree, A.L. & Benedict, W.F. Retinoblastoma: clues to human oncogenesis. Science 223, 1028-33 (1984). 335. Toguchida, J. et al. Mutation Spectrum of the p53 Gene in Bone and Soft Tissue Sarcomas. Cancer Res 52, 6194-6199 (1992). 336. Wang, L.L. et al. Association between osteosarcoma and deleterious mutations in the RECQL4 gene in Rothmund-Thomson syndrome. J Natl Cancer Inst 95, 669-74 (2003). 337. Forus, A. et al. Comparative genomic hybridization analysis of human sarcomas: I. Occurrence of genomic imbalances and identification of a novel major amplicon at 1q21- q22 in soft tissue sarcomas. Genes Chromosomes Cancer 14, 8-14 (1995). 338. Ozaki, T. et al. Genetic imbalances revealed by comparative genomic hybridization in osteosarcomas. Int J Cancer 102, 355-65 (2002). 339. Selvarajah, S. et al. Genomic signatures of chromosomal instability and osteosarcoma progression detected by high resolution array CGH and interphase FISH. Cytogenet Genome Res 122, 5-15 (2008). 340. Selvarajah, S. et al. Identification of cryptic microaberrations in osteosarcoma by high- definition oligonucleotide array comparative genomic hybridization. Cancer Genet Cytogenet 179, 52-61 (2007). 341. Squire, J.A. et al. High-resolution mapping of amplifications and deletions in pediatric osteosarcoma by use of CGH analysis of cDNA microarrays. Genes Chromosomes Cancer 38, 215-25 (2003). 342. Stock, C., Kager, L., Fink, F.M., Gadner, H. & Ambros, P.F. Chromosomal regions involved in the pathogenesis of osteosarcomas. Genes Chromosomes Cancer 28, 329-36 (2000). 343. Tarkkanen, M. et al. DNA sequence copy number increase at 8q: a potential new prognostic marker in high-grade osteosarcoma. Int J Cancer 84, 114-21 (1999). 344. Tarkkanen, M. et al. Gains and losses of DNA sequences in osteosarcomas by comparative genomic hybridization. Cancer Res 55, 1334-8 (1995). 345. Zielenska, M. et al. Comparative genomic hybridization analysis identifies gains of 1p35 approximately p36 and in osteosarcoma. Cancer Genet Cytogenet 130, 14-21 (2001). 346. Thomas, D.M. et al. Terminal osteoblast differentiation, mediated by runx2 and p27KIP1, is disrupted in osteosarcoma. J Cell Biol 167, 925-34 (2004). 347. Nathan, S.S. et al. Elevated expression of Runx2 as a key parameter in the etiology of osteosarcoma. Mol Biol Rep (2008). 277

348. Mosse, Y.P. et al. Identification of ALK as a major familial neuroblastoma predisposition gene. Nature (2008). 349. Sulong, S. et al. A comprehensive analysis of the CDKN2A gene in childhood acute lymphoblastic leukemia reveals genomic deletion, copy number neutral loss of heterozygosity, and association with specific cytogenetic subgroups. Blood 113, 100-7 (2009). 350. Kresse, S.H. et al. LSAMP, a novel candidate tumor suppressor gene in human osteosarcomas, identified by array comparative genomic hybridization. Genes Chromosomes Cancer 48, 679-693 (2009). 351. Yen, C.C. et al. Identification of chromosomal aberrations associated with disease progression and a novel 3q13.31 deletion involving LSAMP gene in osteosarcoma. Int J Oncol 35, 775-88 (2009). 352. Durbin, A.D. et al. JNK1 determines the oncogenic or tumor-suppressive activity of the integrin-linked kinase in human rhabdomyosarcoma. J Clin Invest 119, 1558-70 (2009). 353. Eddy, S.R. What is a hidden Markov model? Nat Biotechnol 22, 1315-6 (2004). 354. Schmittgen, T.D. & Livak, K.J. Analyzing real-time PCR data by the comparative C(T) method. Nat Protoc 3, 1101-8 (2008). 355. Diskin, S.J. et al. STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 16, 1149-58 (2006). 356. Bailey, J.A. & Eichler, E.E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet 7, 552-64 (2006). 357. Chanda, B. et al. A novel mechanistic spectrum underlies glaucoma associated chromosome 6p25 copy number variation. Hum Mol Genet (2008). 358. Lupski, J.R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet 1, e49 (2005). 359. Chen, J. et al. The t(1;3) breakpoint-spanning genes LSAMP and NORE1 are involved in clear cell renal cell carcinomas. Cancer Cell 4, 405-13 (2003). 360. Ntougkos, E. et al. The IgLON family in epithelial ovarian cancer: expression profiles and clinicopathologic correlates. Clin Cancer Res 11, 5764-8 (2005). 361. Reed, J.E. et al. Expression of cellular adhesion molecule 'OPCML' is down-regulated in gliomas and other brain tumours. Neuropathol Appl Neurobiol 33, 77-85 (2007). 362. Ploner, A., Ploner, C., Lukasser, M., Niederegger, H. & Huttenhofer, A. Methodological obstacles in knocking down small noncoding RNAs. RNA 15, 1797-804 (2009). 363. Yamaguchi, T. et al. Allelotype analysis in osteosarcomas: frequent allele loss on 3q, 13q, 17p, and 18q. Cancer Res 52, 2419-23 (1992). 364. Shaikh, T.H. et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res 19, 1682-90 (2009). 278

365. Tsuchida, R. et al. Cisplatin treatment increases survival and expansion of a highly tumorigenic side-population fraction by upregulating VEGF/Flt1 autocrine signaling. Oncogene 27, 3923-34 (2008). 366. Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299-320 (2005). 367. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-78 (2007). 368. Tian, C. et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet 4, e4 (2008). 369. Li, J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100-4 (2008). 370. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451, 998-1003 (2008). 371. Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101 (2008). 372. Xing, J. et al. Fine-scaled human genetic structure revealed by SNP microarrays. Genome Res 19, 815-25 (2009). 373. Reich, D., Thangaraj, K., Patterson, N., Price, A.L. & Singh, L. Reconstructing Indian population history. Nature 461, 489-94 (2009). 374. Abdulla, M.A. et al. Mapping human genetic diversity in Asia. Science 326, 1541-5 (2009). 375. Tishkoff, S.A. et al. The genetic structure and history of Africans and African Americans. Science 324, 1035-44 (2009). 376. Seldin, M.F. et al. European population substructure: clustering of northern and southern populations. PLoS Genet 2, e143 (2006). 377. Bauchet, M. et al. Measuring European population stratification with microarray genotype data. Am J Hum Genet 80, 948-56 (2007). 378. Price, A.L. et al. Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 4, e236 (2008). 379. Guthery, S.L., Salisbury, B.A., Pungliya, M.S., Stephens, J.C. & Bamshad, M. The structure of common genetic variation in United States populations. Am J Hum Genet 81, 1221-31 (2007). 380. Silva-Zolezzi, I. et al. Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico. Proc Natl Acad Sci U S A 106, 8611-6 (2009). 381. Kelley, J.L., Madeoy, J., Calhoun, J.C., Swanson, W. & Akey, J.M. Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res 16, 980- 9 (2006). 382. Barreiro, L.B., Laval, G., Quach, H., Patin, E. & Quintana-Murci, L. Natural selection has driven population differentiation in modern humans. Nat Genet 40, 340-5 (2008). 383. Sabeti, P.C. et al. Positive natural selection in the human lineage. Science 312, 1614-20 (2006). 279

384. Lamason, R.L. et al. SLC24A5, a putative cation exchanger, affects pigmentation in and humans. Science 310, 1782-6 (2005). 385. Simonson, T.S. et al. Genetic Evidence for High-Altitude Adaptation in Tibet. Science (2010). 386. Affymetrix. Use of Saliva gDNA for SNP Genotyping. Technical Note (http://media.affymetrix.com/support/technical/technotes/saliva_gDNA_genotyping.pdf) (2008). 387. Schaid, D.J., Batzler, A.J., Jenkins, G.D. & Hildebrandt, M.A. Exact tests of Hardy- Weinberg equilibrium and homogeneity of disequilibrium across strata. Am J Hum Genet 79, 1071-80 (2006). 388. Stouffer, S.A., Suchman, E.A., DeVinney, L.C., Star, S.A. & Williams, R.M.J. The American Soldier: Adjustment during Army Life, (Princeton University Press, Princeton, 1949). 389. Korn, J.M. et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 40, 1253-60 (2008). 390. Nei, M. & Tajima, F. DNA polymorphism detectable by restriction endonucleases. Genetics 97, 145-63 (1981). 391. Cai, J.J. PGEToolbox: A Matlab toolbox for population genetics and evolution. J Hered 99, 438-40 (2008). 392. Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. (Distributed by the author. Department of Genome Sciences, University of Washington, Seattle., 2004). 393. Weir, B.S. & Cockerham, C.C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358-1370 (1984). 394. Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 1655-64 (2009). 395. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559-75 (2007). 396. Schuster, S.C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943-7 (2010). 397. Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high- resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17, 1665-74 (2007). 398. Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E. & Pritchard, J.K. A high- resolution survey of deletion polymorphism in the human genome. Nat Genet 38, 75-81 (2006). 399. Hinds, D.A., Kloek, A.P., Jen, M., Chen, X. & Frazer, K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet 38, 82-5 (2006). 400. Witherspoon, D.J. et al. Human population genetic structure and diversity inferred from polymorphic L1(LINE-1) and Alu insertions. Hum Hered 62, 30-46 (2006). 280

401. Shriver, M.D. et al. Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2, 81-89 (2005). 402. Rosenberg, N.A. et al. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet 1, e70 (2005). 403. Handley, L.J., Manica, A., Goudet, J. & Balloux, F. Going the distance: human population genetics in a clinal world. Trends Genet 23, 432-9 (2007). 404. Ramachandran, S. et al. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A 102, 15942-7 (2005). 405. Klein, R.G. Darwin and the recent African origin of modern humans. Proc Natl Acad Sci U S A 106, 16007-9 (2009). 406. Hoffecker, J.F. Out of Africa: modern human origins special feature: the spread of modern humans in Europe. Proc Natl Acad Sci U S A 106, 16040-5 (2009). 407. Wright, S. Evolution in Mendelian Populations. Genetics 16, 97-159 (1931). 408. Stringer, C.B., Grun, R., Schwarcz, H.P. & Goldberg, P. ESR dates for the hominid burial site of Es Skhul in Israel. Nature 338, 756-8 (1989). 409. Gutenkunst, R.N., Hernandez, R.D., Williamson, S.H. & Bustamante, C.D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5, e1000695 (2009). 410. Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710-22 (2010). 411. Bamshad, M. et al. Genetic evidence on the origins of Indian caste populations. Genome Res 11, 994-1004 (2001). 412. Watkins, W.S. et al. Genetic variation in South Indian castes: evidence from Y- chromosome, mitochondrial, and autosomal polymorphisms. BMC Genet 9, 86 (2008). 413. Government of Nepal. Statistical Year Book of Nepal (ed. Statistics, C.B.o.) (Kathmandu, 2007). 414. Fornarino, S. et al. Mitochondrial and Y-chromosome diversity of the Tharus (Nepal): a reservoir of genetic variation. BMC Evol Biol 9, 154 (2009). 415. Gayden, T. et al. Genetic insights into the origins of Tibeto-Burman populations in the Himalayas. J Hum Genet 54, 216-23 (2009). 416. Gayden, T. et al. The Himalayas as a directional barrier to gene flow. Am J Hum Genet 80, 884-94 (2007). 417. Goebel, T., Waters, M.R. & O'Rourke, D.H. The late Pleistocene dispersal of modern humans in the Americas. Science 319, 1497-502 (2008). 418. O'Rourke, D.H. & Raff, J.A. The Human Genetic History of the Americas: The Final Frontier. Curr Biol 20, R202-R207 (2010). 419. Mulligan, C.J., Hunley, K., Cole, S. & Long, J.C. Population genetics, history, and health patterns in native americans. Annu Rev Genomics Hum Genet 5, 295-315 (2004). 281

420. Wang, S. et al. Genetic variation and population structure in native Americans. PLoS Genet 3, e185 (2007). 421. Mulligan, C.J., Kitchen, A. & Miyamoto, M.M. Updated three-stage model for the peopling of the Americas. PLoS One 3, e3199 (2008). 422. Fagundes, N.J. et al. Mitochondrial population genomics supports a single pre-Clovis origin with a coastal route for the peopling of the Americas. Am J Hum Genet 82, 583-92 (2008). 423. Perego, U.A. et al. Distinctive Paleo-Indian migration routes from Beringia marked by two rare mtDNA haplogroups. Curr Biol 19, 1-8 (2009). 424. Ray, N. et al. A statistical evaluation of models for the initial settlement of the american continent emphasizes the importance of gene flow with Asia. Mol Biol Evol 27, 337-45 (2010). 425. Kolman, C.J., Sambuughin, N. & Bermingham, E. Mitochondrial DNA analysis of Mongolian populations and implications for the origin of New World founders. Genetics 142, 1321-34 (1996). 426. Merriwether, D.A., Rothhammer, F. & Ferrell, R.E. Distribution of the four founding lineage haplotypes in Native Americans suggests a single wave of migration for the New World. Am J Phys Anthropol 98, 411-30 (1995). 427. Watkins, W.S. et al. Genetic variation among world populations: inferences from 100 Alu insertion polymorphisms. Genome Res 13, 1607-18 (2003). 428. Jorde, L.B. et al. Microsatellite diversity and the demographic history of modern humans. Proc Natl Acad Sci U S A 94, 3100-3103 (1997). 429. Jorde, L.B. et al. The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y chromosome data. Am J Hum Genet 66, 979-988 (2000).

282

Copyright Acknowledgements