<<

ANALYSIS OF GENOMIC VARIANTS FOR INVESTIGATING THE GENETIC ETIOLOGY OF DISEASE

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Daniel Edmund Newburger March 2015

© 2015 by Daniel Edmund Newburger. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/kh271wr8164

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jonathan Pritchard

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Arend Sidow

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

The study of genomic variation within human populations is critical for elucidating the genetic factors that contribute to disease. Identifying and characterizing the genetic architecture of disease advances clinical care by facilitating the development of novel diagnostic tools, the identification of new therapeutic targets, and the practice of personalized treatment for genetic syndromes. The massive volume of genetic data generated by modern genotyping technologies, combined with the informatics challenges of filtering and interpreting these noisy measurements, represent significant obstacles to genomic research. These technical issues necessitate the development of computationally efficient methodologies that leverage raw genotype data for the comparative genomic analysis of complex phenotypes across human subpopulations. In this dissertation, I describe my contributions towards the biomedical study of genetic syndromes using high-throughput genotyping technologies. First, I discuss methods for studying the genome evolution of pre-malignant lesions during progression to breast cancer. Second, I describe algorithms for performing highly accurate variant validation in genomic studies using next generation sequencing. Fi- nally, I present methods for identifying novel disease susceptibility loci in complex diseases using identity by descent mapping in large case-control cohorts.

iv Acknowledgements

I would like to thank the truly extraordinary mentors, collaborators, and friends who have supported me through both the good times and the terrifying doldrums of graduate school. I simply cannot thank you enough for your patience, wisdom, and friendship. Foremost, I would like to thank my thesis advisor, Serafim. I remain in awe of your ability to deconvolute the most tangled analytical problems into solvable components. You found elegant paths through so many technical obstacles in my research and have always been a wellspring of novel ideas. I am even more grateful to you for your unwavering encouragement and patience. You gave me the freedom to explore far afield, and I feel privileged to be part of your group. I am deeply grateful to Arend Sidow for his guidance, mentorship, and leadership. Arend, you brought vision and scientific rigor to every meeting, and you always managed to make time in your schedule to share your expertise. You taught me how to examine complex problems down to the finest detail, and your forthright advice and criticisms have been invaluable. You are one of the few people who will say what you really think, and yet you are always optimistic and generous in your feedback. I am also indebted to the other members of my reading and orals committees: Rob West, Jonathan Pritchard, and Gavin Sherlock. Rob, your boundless knowledge of cancer genetics and histomorphology drove our cancer projects forward, and I am grateful for all of the time you spent tutoring me in the field. Jonathan, although we didn’t meet until the end of my graduate career, your advice has been penetrating and insightful. Gavin, thank you so much for chairing my defense committee and for your keen questions and suggestions.

v My thesis would not have been possible without several other mentors. I would like to thank Hanlee Ji and Sivan Bercovici for their incredible generosity. Hanlee, you coached me through the first years of my PhD with wisdom, precision, and humor. Your relentless pursuit of scientific innovation and your mastery of genetics, oncology, and biotechnology continue to inspire me. Sivan, thanks are entirely inadequate to express my gratitude for your patience, your guidance, and the surfeit of brilliant ideas you contributed to our joint projects. Our meetings have been some of the funniest and most productive moments in my graduate work. I would also like to thank Atul Butte, who first introduced me to as an undergrad, and whose encouragement and counsel propelled me through the first few years of graduate school. I am deeply indebted to my academic advisor, . Russ, your clairvoyant advice during our biannual meetings proved pivotal over and over again, and I can’t thank you enough for ensuring that my meandering thesis evolved into a BMI dissertation. Likewise, I am deeply indebted to my colleague Alex Morgan, who has been exceptionally generous as a mentor. Whether proofreading my fellowship applications in first year or talking me through tough decisions in sixth year, you have always provided singularly thoughtful advice and gone far out of your way to render assistance. Without your help, I would still be floundering in my studies. It has been a joy to be a member of the BMI program. Mary Jeanne, thank you so much for steering me through the tortuous process of navigating graduate school. We in BMI are incredibly lucky to have you at the helm of the BMI program, keeping us from running aground on rocky shores. I would like to thank all the other amazing people who have kept BMI afloat: Nancy Lennartsson, Steve Bagley, John DiMario, Betty Cheng, Larry Fagan, Carol Maxwell, and of course Darlene Vian. I would also like to thank my staunch compatriots in BMI, especially fellow classmates Linda Liu, Nick Tatonetti, and Rob Bruggner. The Batzoglou lab has fostered some of the most amazing folks at Stanford, and I feel incredibly fortunate to call them friends and colleagues. I would especially like to thank Sarah Aerni, Marc Schaub, Tom Do, Sam Gross, Jesse Rodriguez, Sofia Kyriazopoulou-Panagiotopoulou, Anshul Kundaje, Lin Huang, Alex Bishara,

vi and Yuling Liu. Marc and Sarah, your friendship and advice meant so much to me as I struggled to orient myself in the lab, and I deeply appreciate your generosity as mentors. Jesse, working with you and learning from you has been a blast. Alex, I still have not watched Clerks. I have been privileged to work with incredible collaborators from outside the lab, as well. I would like to thank Georges Natsoulis, John Bell, Sue Grimes, Patrick Flaherty, Sarah Garcia, Ziming Weng, Noah Spies, Alayne Brunner, Robert Sweeney, and Marina Sirota. Patrick, you inspired me with your commitment to scientific excellence and taught me how to evaluate my projects and research goals. John, I greatly enjoyed kvetching and swapping books during our much-needed coffee breaks. I would like to give special thanks to a few friends without whom I would never have completed my graduate studies. I am incredibly grateful to Dorna Kashef- Haghighi, whose brilliance and hard work made our joint projects in the Batzoglou lab possible, and whose friendship made it fun. Working alongside you was the highlight of graduate school. Tiffany Chen and Tim Lee, I am exceptionally fortunate to be friends with you. You have been my most trusted confidants in matters ranging from research priorities to hunting for good eats in Cupertino. Tiffany, your insight and wisdom regarding matters of both research and career have been invaluable. Tim, your humor, consideration, and scientific advice have kept me sane during times of stress and failure. I hope you get another twelve-win arena run soon. My family has been an inexhaustible source of love and support. Mom, you have always set the highest bar for hard work and dedication to research. I would never have made it through graduate school without your encouragement and, when necessary, admonishments. Dad, you first got me interested in science, and your sage and pragmatic advice has always helped me tackle questions of research and career. Maggie, I can always look to you for both encouragement and commiseration. Finally, completing graduate school would have been inconceivable without the love and support of my wife, Melody. Mel, whether proofreading my papers at midnight, fixing my slides, celebrating victories, or providing consolation, you were always there for me, making life better than better; you make life great.

vii Contents

Abstract iv

Acknowledgements v

1 Introduction 1

2 Background 3 2.1 The genome and disease ...... 3 2.1.1 Genomic variation ...... 4 2.1.2 Technologies for genomic studies ...... 6 2.1.3 Cancer sequencing ...... 10

3 Genome Evolution in Breast Cancer 13 3.1 Abstract ...... 13 3.2 Introduction ...... 14 3.3 Results ...... 15 3.3.1 Whole-genome sequencing of early neoplasias and related car- cinomas from archival material ...... 15 3.3.2 Somatic SNVs fall into a limited and highly structured set of classes ...... 16 3.3.3 Allele frequencies of somatic SNVs support common ancestral relationships ...... 20 3.3.4 Mutated neoplasias are evolutionarily related to carcinomas . 21

viii 3.3.5 Point-mutational mechanisms are evolutionarily stable and re- producible among cases ...... 22 3.3.6 Aneuploidies are the dominant evolutionary feature of progression 25 3.4 Discussion ...... 31 3.5 Methods ...... 33 3.5.1 Identification and processing of neoplasias ...... 33 3.5.2 Library construction and sequencing ...... 33 3.5.3 Read mapping and BAM file processing ...... 34 3.5.4 Multisample SNV Calling ...... 35 3.5.5 Determination of somatic SNV class patterns and of robust sharing classes ...... 36 3.5.6 PCR-based validation of SNVs and accuracy assessment of whole- genome calls ...... 37 3.5.7 Aneuploidy and tumor purity ...... 41 3.5.8 SNV mutation spectra ...... 43 3.5.9 Tree inference ...... 43 3.5.10 Ordering SNVs vs. Chromosome 1q ploidy gain in ancestral branches ...... 43 3.6 Acknowledgements ...... 44

4 Pipeline technologies for validating genomic variants 45 4.1 Abstract ...... 45 4.2 Introduction ...... 46 4.3 Materials and Methods ...... 49 4.3.1 Capture oligonucleotide sequence generation ...... 49 4.3.2 Quality control annotation for capture oligonucleotides . . . . 51 4.3.3 Database construction ...... 52 4.4 Results ...... 52 4.4.1 Coverage of the human genome ...... 52 4.4.2 Capture oligonucleotide human genome mapping ...... 54 4.4.3 Interface for the Human OligoGenome ...... 55

ix 4.5 Discussion ...... 57 4.6 Acknowledgements ...... 58

5 IBD Mapping in Large Disease Cohorts 59 5.1 Abstract ...... 59 5.2 Introduction ...... 60 5.3 Results ...... 64 5.3.1 IBD detection pipeline development ...... 64 5.3.2 Integration of IBD mapping tools ...... 66 5.3.3 IBD mapping of a multiple sclerosis cohort ...... 67 5.4 Discussion ...... 70

6 Future Directions 71

x List of Tables

3.1 Variant call statistics ...... 16

4.1 Whole genome summary statistics for capture oligonucleotide coverage 53 4.2 Coding region summary statistics for capture oligonucleotide coverage 53

xi List of Figures

2.1 DNA sequencing pipeline ...... 9

3.1 Lineage tree and alternate allele frequencies ...... 18 3.2 Mutation spectra and rates of somatic SNVs ...... 23 3.3 Lesser allele fraction plot of Patient 6 ...... 26 3.4 Aneuploidy summary for all patient samples ...... 28 3.5 Genome evolutions of all patients ...... 30

4.1 Schema for target specific capture via selective circularization . . . . 47 4.2 In silico genome coverage by restriction enzyme ...... 54 4.3 Overview of the OligoGenome website ...... 56

5.1 GWAS and IBD mapping association test methods ...... 62 5.2 Identity by descent between distant relatives ...... 63 5.3 IBD detection pipeline ...... 66 5.4 IBD relationships for MS cohort at rs498422 ...... 68 5.5 IBD detection in the HapMap3 CEU cohort ...... 69

xii Chapter 1

Introduction

The human genome is the chemical template that orchestrates the growth and devel- opment of a human organism. Nearly every cell in the human body inherits a copy of this DNA code, which serves as the fundamental instruction set for the molecular, cellular, and tissue-level biochemical activities that define life. In the 21st century, the study of human genetics has finally come into its own. In June 2000, President Clin- ton announced the completion of the first draft of the human genome [65], and Dr. Francis Collins, director of the International Human Genome Project, proclaimed, "What more powerful form of study of mankind could there be than to read our own instruction book?" Without the expert knowledge to decode and annotate the hu- man genome, however, this draft represents nothing more than a three billion dollar puzzle [81]. Translating the human genome into meaningful statements about human development and health therefore represents one of the foremost scientific challenges of the 21st century. The human reference genome has facilitated the genome-wide study of genetic variation among human populations, which represents a critical area of biomedical research. Although the genomes of any two people are more than 99.9% identical, these differences and their complex interplay with environmental factors define an in- dividual’s characteristics [110]. Specifically, genetic differences among patients or be- tween healthy and diseased individuals elucidate the etiology of disease and offer novel insights into clinical care. A better understanding of the biomolecular foundations

1 CHAPTER 1. INTRODUCTION 2

of genetic syndromes promotes clinical advances by identifying new biomarkers and therapeutic targets. At the level of personalized medicine, genomic studies promote the development of diagnostic tests and treatment protocols that integrate informa- tion about an individual’s unique genomic code to achieve better patient outcome [70]. In the past decade, genomic tools for the study of Mendelian and complex dis- ease have therefore led to clinically translatable findings that have improved patient care across a myriad of medical conditions [70]. This dissertation explores the study of human genetic variation at both the in- dividual and population level. Chapter 2 introduces background concepts and ter- minology for the projects discussed in subsequent chapters. Chapter 3 presents the study of breast cancer genome evolution during progression from premalignant lesions to invasive cancer. Chapter 4 introduces technologies and resources for high-depth DNA sequencing and validation of genetic variation within a given subject’s genome. Chapter 5 presents the design and construction of a software pipeline for identifying disease susceptibility loci discovery using population-level association testing meth- ods. Finally, chapter 6 provides concluding remarks and explores some of the exciting research avenues that will be unlocked by emerging sequencing technologies. Chapter 2

Background

2.1 The genome and disease

The term genomics describes the study of genetics through the investigation of the structure and function of DNA. Unlike traditional genetics, genomics examines DNA at the level of whole organisms rather than at the level of single genes or functional domains. By leveraging technologies that identify DNA base pair identity across the span of a given genome and by using bioinformatics methods to analyze the resulting sequence data, the field of genomics allows researchers to directly study the nucleotide code that directs cellular function. By identifying and evaluating differences among human genomes, genomics provides a methodology for understanding the functional impact of DNA sequence on human health and disease. In this context, the completion of a first draft of the human reference genome in 2000 represented a landmark in the study of human disease [65]. The human reference genome is a consensus sequence for the three billion base pairs of the human genome (in other words, an enormous, linear string of As, Gs, Ts, and Cs that represent the nucleotides found in the majority of human chromosomes and their order within these DNA molecules). Over the past fifteen years, use of the reference genome has enabled a host of biomedical applications that directly inform the study of human health, both in terms of helping us better understanding the genetic underpinnings of disease and in terms of clinical applications, such as novel diagnostic tools and

3 CHAPTER 2. BACKGROUND 4

personalized patient care [20, 34, 70]. Applications include the study of cancer, as further discussed below, as well as other disease with genetic heritability. In the sections below, I will attempt to provide the reader with the biological and technological terminology for reading the research chapters of my thesis. I will first define the classes of genomic variation upon which I built methods to perform disease inference. Then, I will provide an overview of several key technologies that facilitate genomics investigation. I will then briefly discuss the study of cancer in the context of genomics. The present boom in genomic technologies and computational methods for ge- nomic analysis makes the present a very exciting time for research. As we learn to better identify and characterize the genetic architecture of disease, we can advance clinical care by facilitating the development of novel diagnostic tools, the identifica- tion of new therapeutic targets, and the practice of personalized treatment for genetic syndromes [34, 70]. In the near future, the genome sequencer may take its place beside ultrasound and MRI machines as an important and flexible medical tool for routine patient care.

2.1.1 Genomic variation

The term genomic variation refers to differences in genome sequence among two or more distinct groups of individuals. The manner in which genomic variation is quan- tified therefore depends on experimental context. For example, the term could refer to genomic differences between two individuals within a given population (e.g. two Eastern Europeans), or it could refer to genomic differences between two more dis- tantly related groups (e.g. Tibetan Wolves and Timber Wolves). Effectively reporting and aggregating human genomic variation in genomic studies therefore requires the establishment of a standard against which to compare individuals. In the field of DNA sequencing and genotyping, the establishment of a refer- ence genome for a given organism provides a common template against which to define genetic differences. The Genome Reference Consortium (GRC) therefore cre- ated the human reference genome to represent a consensus DNA sequence for the CHAPTER 2. BACKGROUND 5

human genome, as mentioned in the introductory remarks [20, 21]. Once identified, the genomic differences between an individual and the reference can be annotated, compressed, and efficiently stored for future applications [18, 19, 26]. It is important to remember that in the context of human genomics, a mutation does not necessarily represent a functional change to DNA. Instead, it represents a departure from the established consensus sequence, which, by virtue of being a linear string, cannot en- compass the full range of healthy human genotypes. Furthermore, a DNA mutation may represent the most common allele a given population. Mutations can be separated into two broad categories based upon genomic size: small-scale mutations, such as single nucleotide variants (SNVs) and short insertion or deletion events (microindels), large-scale mutations, such as translocations, inver- sions, large insertion or deletion events (indels), and copy number variants [27, 72]. Small-scale mutations can cause genetic disorders by disrupting gene regulation or al- tering protein-coding sequence, whereas large-scale mutations are commonly respon- sible for changes to genomic structure that have multiple phenotypic effects [6, 35]. The sections below will focus on single nucleotide variants and copy number variants. Because these mutation types are disease-informative and can be accurately identi- fied, these mutation categories served as the focus of my thesis work for both assay development and disease investigation.

Single Nucleotide Variants (SNVs)

A single nucleotide variant is defined as a single base in a given individual’s genome that differs from the corresponding base in the reference genome. A given human genome has upwards of three million SNVs, making them the most frequently occur- ring type of genomic variant [2, 86]. For an SNV, the term alternate allele1 refers to the identity of the base that differs from the reference. The alternate allele frequency (AAF) is defined as the frequency of that base across a given population. For ex- ample, if a specific individual has a T base at a position where the human reference

1The term minor allele can often be used interchangeably with the term alternate allele. However, the human reference genome is not guaranteed to have the minor allele for a given population or set of populations, so the two terms are not synonymous. Additionally, alternate allele frequency can specifically refer to read counts in DNA sequencing experiments. CHAPTER 2. BACKGROUND 6

genome has an A base, T represents the alternate allele for the SNV at that position. If five out of fifty people have one copy of the alternate allele, it has an alternate allele frequency of 0.05 (the human genome is diploid and therefore each individual has one allele on each chromosomes copy). In population genetics, the term single nucleotide polymorphism (SNP) describes a single nucleotide variant that occurs above a specified frequency (often 1%) in a given population [2, 105]. The terms SNV and SNP are used interchangeably in some contexts. Conversely, rare variants are designated as mutations that occur below a specified frequency in the population.

Copy Number Variants (CNVs)

The term copy number variant (CNV) describes the biological phenomenon in which a contiguous genomic region is either duplicated or lost in an individual’s genome. Although these events occur less frequently than SNVs, they may encompass large areas of up to many megabases in length [35]. CNVs therefore have the potential to impact more total genomic space, per nucleotide, than any other type of genomic event [6]. Because these events have a substantial effect on genomic structure and transcript abundance, they can cause significant phenotypic change. The term aneuploidy refers to copy number variants that encompass the entirety of a chromosome. Common aneuploid conditions include trisomy 21, in which the three copies of chromosome 21 cause the genetic disorder Down syndrome, and conditions such as Klinefelter syndrome and Turner’s syndrome, for which an individual inherits an abnormal number of sex chromosome copies.

2.1.2 Technologies for genomic studies

Accurately identifying the SNVs and CNVs present in a given individual’s genome requires precision instrumentation and subsequent computational analysis. Several classes of technologies exist for performing genome-wide mutation detection. This sec- tion first introduces DNA sequencing technologies, which provide genomic information at base-pair resolution. These methods have several advantages in terms of precision CHAPTER 2. BACKGROUND 7

and flexibility, but they require substantial investment cost and demand computation- ally intensive post-processing steps. This section then introduces microarray-based genotyping technologies, also known as SNP arrays, which perform genotyping and copy number detection with high accuracy and at lower cost. SNP arrays suffer from significant limitations, however. Chip-based technologies can only yield genomic in- formation at preselected locations and can only detect certain classes of mutations. These tradeoffs make chip-based methods highly desirable for certain applications, such as disease association studies, but they cannot support many other study types, such as novel variant discovery or genome assembly. For these reasons, DNA sequenc- ing technologies have begun to eclipse the older chip-based genotyping methods as sequencing prices continue to drop.

DNA sequencing

The field of DNA sequencing has grown exponentially over the past decade and en- compasses a wide range of different methodologies for both sample preparation and nucleotide base calling [12, 55, 73, 91]. The goal of DNA sequencing is to determine the nucleotide identity and base order of genetic material from a given source, and next generation sequencers can yield gigabases or even terabases of sequence per run. In a typical sequencing protocol for a human genome, DNA is first extracted from a blood or tissue sample from an individual. High throughput sequencing technolo- gies cannot process DNA molecules longer than several hundred base pairs in length, so sample preparation protocols then use methods such as sonication or enzymatic shearing to cut DNA molecules into small fragments. Following protocol-specific DNA modification, these fragments are run through a DNA sequencer to determine their sequence of nucleotides [73]. The informatic representation of these sequences are referred to as reads, which generally include a two bit base call and an estimate of the likelihood of error for each base. Obtaining mutation calls from a sequencer’s reads requires several computation- ally challenging steps. Figure 2.1 represents a standard pipeline for variant calling that matches sequenced fragments to their place of origin in the genome, cleans com- putational and biochemical artifacts, and then applies statistical methods to infer the CHAPTER 2. BACKGROUND 8

presence of mutations [28, 71]. The first process, called mapping or read alignment, uses highly efficient string matching algorithms to identify the most likely genomic lo- cation of origin for each sequence along the human reference genome. Because a given human genome is roughly 99.9% identical to the human reference genome, inexact string matching can rapidly and accurately place the vast majority of reads onto this genomic map [58, 61]. Additional computational steps, such as duplicate removal, lo- cal realignment, and base recalibration correct for errors and bias introduced during the preceding sample preparation, DNA sequencing, and alignment steps [71]. The final procedure, variant calling, predicts the location and identity of mutations in the genome of interest [28]. A host of different statistical tools exist for inferring SNVs, CNVs, and other mutations based on discrepancies between the expected reference genome sequence and the aligned reads. To illustrate the problem, if at position Z the reference genome reads A but the sequencing data show 50 As and 50 Cs, an SNV caller will infer that the sequenced individual has a heterozygous SNV at that locus. DNA sequencing has an enormous number of applications, ranging from human population genetics to metagenomics, in which diverse microbial populations are se- quenced. Chapter 4 of my thesis discusses the use of next generation sequencing to validate mutation calls in genomic studies by developing a highly specific capture technology that extracts and amplifies the DNA present at a small number of chosen loci [80]. Chapter 3 uses whole genome sequencing to investigate the evolution of can- cer genomes during disease progression [79]. With the recent achievement of $1,000 human genome sequencing, the application space for sequencing continues to grow. Regular cancer screening by sequencing and clinical evaluation of the human micro- biome in gastroenterology represent just a few of the potential clinical applications that may soon be established as standards of care.

SNP chips

DNA microarrays came to prominence in the early 2000s as high-throughput and cost- effective technologies for interrogating many genomic loci in parallel [48, 105]. With modern, high-density microarrays, researchers can design SNP chips that perform genotyping at more than a million loci. SNP chip technologies come with many CHAPTER 2. BACKGROUND 9

Figure 2.1: This flow chart represents a standard DNA sequencing pipeline for reference-based variant calling in a previously characterized organism. caveats, however. They usually perform only SNP and CNV detection and have limited resolution when describing CNV boundaries. Furthermore, the SNP targets and their base identities must be designated at the time of chip manufacture based upon reference knowledge, making exploratory analysis of novel organisms or de novo mutation detection impossible [48]. Additionally, the information cannot be used for applications such as assembling genomes or for detecting novel splice isoforms. SNP chips excel at tasks that require a large volume of sparse, high-accuracy genotype data. Disease association studies represent one common application, in which statistical techniques attribute observed phenotypic effects to genomic loci by leveraging genetic information from large cohorts of genotyped individuals [15, 47]. Chapter 5 describes one such study methodology and the tools required for successfully uncovering risk loci for a complex disease. Complex diseases, in which multiple genetic and environmental factors contribute to phenotype, are currently CHAPTER 2. BACKGROUND 10

difficult to study using sequencing technologies due to cost [119].

2.1.3 Cancer sequencing

Cancer describes a highly heterogeneous class of diseases characterized by malignant cell growth. Unlike healthy cells, neoplastic cells exhibit uncontrolled growth that remains untempered by the intracellular and extracellular mechanisms that regulate normal cell division behaviors [45, 46]. The effects of unchecked proliferation prove damaging to neighboring cells and tissues, and cancerous cells may ultimately gain the ability to metastasize, whereby they can invade parts of the body distant from the cancer’s origin. Cancer is unquestionably one of the greatest public health concerns of the current century, causing more than 14% of deaths worldwide in 2014 [101]. Cancer genome sequencing merits special discussion due to the extraordinary clin- ical opportunities it promises and its unique challenges [31, 74, 93]. Cancer is funda- mentally a disease of the genome, reprogramming cellular machinery using the very template by which that machinery is controlled [45]. DNA sequencing therefore pro- vides a way to identify the exact genomic changes that drive tumor progression. In the clinic, sequencing offers a powerful method for detecting, diagnosing, and char- acterizing , as well as the ability to personalize treatment via identification of druggable targets or classification of tumors into prognostic subtypes. Finally, cancer sequencing research promises to improve our knowledge of cancer-specific biomarkers and therapeutic targets [93]. The sections below introduces terminology and concepts that help explain the challenges of cancer genome sequencing. These topics are central to the research in Chapter 3.

Somatic mutation calling

A given human genome differs from the reference genome at millions of SNV positions. A cancer genome, however, rarely differs from the healthy genome of its host by more than a few thousand SNV positions. A small number of these mutations, known as driver mutations, can be sufficient to start a healthy cell down the path towards CHAPTER 2. BACKGROUND 11

malignancy [45, 93]. Cancer specific CNVs, often called copy number aberrations (CNAs), also play a large role in many cancer types. CNAs become prevalent in certain cancer genomes as DNA repair and apoptotic mechanisms degrade, resulting in aneuploidy, large copy gains, and loss of heterozygosity, in which a section of one chromosome copy is completely lost [31, 43]. Identifying these cancer specific mutations, known as somatic mutations in the context of cancer genomics, can serve as the basis for characterizing tumor identity and as a tool for studying cancer genetics. In order to distinguish somatic mutations from germline mutations, most studies sequence both tumor and germline (also called normal) samples [74]. The simplest method for somatic SNV detection involves independent SNV calling in the tumor sample and the normal sample, followed by subtraction of normal variants from the tumor variant list. Methods that consider tumor and normal sequence simultaneously achieve significant accuracy gains by leveraging the joint information in probabilistic models. Chapter 3 discusses methods that leverage additional sequence data from multiple, related cancer lesions to produce highly accurate somatic SNV calls. Simi- larly, sophisticated methods for CNA detection jointly assess factors such as relative sequence depth to call germline and somatic CNVs simultaneously. Raphael et al. [93] and Ding et al. [31] provide excellent reviews of somatic variant calling.

Cancer heterogeneity

Cancer heterogeneity further increases the complexity of somatic variant detection. A given contains a diverse collection of co-evolving cells. Although descended from the same progenitor cell, the genomes of cancer cells in clonal subpopulations can diverge rapidly over the course of many divisions [79, 83, 93]. Somatic mutations therefore appear at different allele frequencies depending on their time of origin dur- ing cancer progression and the fitness of the associated clonal subgroups. Whereas germline variation detection methods can assume that a mutation is present in at least one chromosome copy of a given cell, a somatic mutation might only be present in a small fraction of the sampled tumor cells. This condition leads to false negative mutation calls due to a dearth of evidence supporting the mutation’s presence. Meth- ods that model cancer heterogeneity often produce false positive calls as well, since a CHAPTER 2. BACKGROUND 12

less stringent caller can more easily mistake base calling errors for somatic mutations. Resected tumor samples usually consist of a significant fraction of healthy germline cells, which further dilutes mutational signatures. The presence of germline cells in cancer samples is often referred to as normal contamination. Cancer heterogeneity represents one of the primary hurdles for achieving accurate cancer prognosis and effective treatment. Clonal subpopulations may exhibit different behaviors in terms of their potential for growth and resistance to chemotherapeutic treatment. A small but highly aggressive group of cells may cause tumor growth to accelerate suddenly. Similarly, if a small tumor subpopulation survives a chemothera- peutic regimen targeted at the otherwise susceptible cellular majority, this population can cause a recurrence as it expands in the absence of other neoplastic competitors [40]. Using sequencing to identify druggable targets present in all tumor cells and select successful treatment regimes promises to be a vital clinical application in the coming decade. Chapter 3

Genome Evolution in Breast Cancer

3.1 Abstract

Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and in- creased cellular proliferation that eventually culminate in the carcinoma phenotype. Early neoplasias, which are often found concurrently with carcinomas and are histo- logically distinguishable from normal breast tissue, are less advanced in phenotype than carcinomas and are thought to represent precursor stages. To elucidate their role in cancer evolution we performed comparative whole genome sequencing of early neoplasias, matched normal tissue, and carcinomas from six patients, for a total of 31 samples. By using somatic mutations as lineage markers we built trees that relate the tissue samples within each patient. On the basis of these lineage trees we inferred the order, timing, and rates of genomic events. In four out of six cases, an early neoplasia and the carcinoma share a mutated common ancestor with recurring aneuploidies, and in all six cases evolution accelerated in the carcinoma lineage. Transition spectra of somatic mutations are stable and consistent across cases, suggesting that accu- mulation of somatic mutations is a result of increased ancestral cell division rather than specific mutational mechanisms. In contrast to highly advanced tumors that are the focus of much of current cancer genome sequencing, neither the early neoplasia genomes nor the carcinomas are enriched with potentially functional somatic point mutations. Aneuploidies that occur in common ancestors of neoplastic and tumor

13 CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 14

cells are the earliest events that affect a large number of genes, and may predispose breast tissue to eventual development of invasive carcinoma.

3.2 Introduction

The cells of a multicellular organism are related to one another by a bifurcating lineage tree whose root is the zygote. DNA replication, chromosome segregation, and cell division during development from the zygote to the adult introduces point mutations and other DNA changes into the genome, which persist in the descendants of the cells in which they occurred. Germ line point mutations occur at a rate of approximately one per diploid genome per cell division [56], but the rate of somatic changes is less well-understood, and is likely to vary by tissue type. Large-scale genomic changes such as aneuploidies are generally thought to be extremely rare in normal tissue. Cancers, in contrast to normal tissue, accumulate much larger numbers of genomic changes, as illustrated by genome sequencing of late-stage tumors[4, 10, 17, 60, 82, 83, 87, 102, 103]. Solid tumors are highly mutated by several mechanisms such as point mutations, copy-number variations, and chromothripsis [9, 23, 41, 59, 63, 66, 75, 100]; relapses or metastases exhibit further mutational evolution [29, 30, 69, 78, 108, 112, 116, 117]. The state of an individual advanced cancer genome sheds little light on the order of genomic changes, however, except in analyses of subclone evolution [82, 97]. In an advanced tumor, the earliest driver changes that had predisposed ancestral cells to eventual carcinoma development are confounded with later changes. As a consequence, our understanding of tumor evolution is still in its infancy. The historically proven approach to understanding evolution is comparative anal- ysis of extant species, whose power was greatly increased by whole genome sequencing in recent years. Analogous to species comparisons, which are based on evolutionary (bifurcating) lineage trees, comparisons of somatic genomes from a single individual could in principle shed light on somatic evolution, but in normal tissue the number of mutations is low. However, given the large number of genomic changes during tumor CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 15

evolution, it may be possible to dissect the evolutionary history of a cancer by com- paring its genome to clinically recognized precursor lesions. In this context, breast cancers provide a proof-of-principle opportunity, due to their frequent association with early neoplastic lesions that are readily identified by morphology [1, 11, 64, 99] and whose genomes may provide windows into the earliest stages of tumor evolution. Using whole genome sequencing of histologically characterized archival (formalin- fixed, paraffin-embedded) samples, we determine lineage relationships of early neo- plasias with carcinomas, quantify mutational load and mutation spectra during pro- gression from normal tissue to neoplasia to carcinoma, and find the earliest detectable mutations and aneuploidies in cell lineages ancestral to the lesions. A subset of these early events may have provided the initial oncogenic potential and helped trigger the first clonal expansion. Our analyses reveal variation among the six cases in the specific evolution of neoplasia and tumor, as would be expected for an evolutionary process dominated by stochasticity. The mechanistic commonalities among the cases, however, bear significant implications for our conceptualization of tumor origins and progression.

3.3 Results

3.3.1 Whole-genome sequencing of early neoplasias and re- lated carcinomas from archival material

Our workflow began with the screening of histopathological sections of archival estro- gen receptor positive invasive ductal carcinoma (IDC) resection specimens for presence of concurrent early neoplasias, which are microscopic in size (typically 1-3 mm). We selected cases in which early neoplasia with or without atypia ("EN" or "ENA"; a spectrum of usual ductal hyperplasia, columnar cell lesions, and flat epithelial atypia), and in some cases ductal carcinoma in situ (DCIS) were present in addition to the IDC. Areas of high neoplasia or carcinoma content were cored and histologically re- evaluated for lesion purity. Six cases were chosen in which each sample met criteria for purity and had enough DNA for whole genome sequencing. Each case had at least CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 16

P1 P2 P3 P4 P5 P6 Total 2,973,005 2,771,413 2,912,758 2,915,727 2,650,714 2,937,816 Homozygous 1,168,671 1,078,021 1,149,006 1,160,421 1,017,760 1,146,679 Ts/Tv ratio 2.13 2.09 2.09 2.09 2.15 2.10 In dbSNP 2,910,863 2,717,531 2,856,582 2,857,498 2,596,421 2,864,359 Percent 97.91 98.06 98.07 98.00 97.95 97.50 Novel 62,142 53,882 56,176 58,229 54,293 73,457 Homozygous 2,514 1,734 1,715 1,681 1,295 2,372 Candidate somatic 1,465 1,546 2,567 2,775 1,924 3,416 After filtering 1,279 1,479 2,104 2,582 1,728 3,211

Table 3.1: Variant call statistics one early neoplasia sample from the same side in which the carcinoma was found, and five also had a contralateral early neoplasia. Each had at least one control sample (lymph, normal breast tissue or both), and three cases also had a DCIS in addition to the IDC, yielding a total of 31 samples that belong to 7 classes of normal and neoplastic tissue (Figure 3.1a). We optimized DNA isolation from archival samples to obtain sufficient quantities of preparative material, and honed the generation of robust libraries. For each sample, a single library was built and sequenced with paired-end reads (2x101 bp) on the Illumina HiSeq platform. Library complexity was sufficient to support deep whole genome sequencing, with the vast majority of sequence data coming from independent DNA fragments as opposed to PCR duplicates. The samples from the first patient were sequenced to higher coverage (average of 84.6x) to calibrate the tradeoff between cost and sensitivity in variation calling. Coverage of each sample by confidently mapped reads ranged from 46.7x to 105.7x, with a median of 53.4x.

3.3.2 Somatic SNVs fall into a limited and highly structured set of classes

Detection of somatic single nucleotide variants (SNVs), such as those occurring dur- ing cancer evolution, requires a methodology with high specificity because inherited (germline) variants are orders of magnitude more numerous and even a low rate of miscalling inherited variants as somatic results in low accuracy. Our high sequence CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 17

Figure 3.1: (Legend on next page.) CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 18

Figure 3.1: Lineage tree and alternate allele frequencies. a, The samples in this study by type (rows) and patient (columns). b, Model of neoplastic progression on the basis of organismal tissue and cell lineage. For simplicity, only one possible scenario of the progression from normal to neoplasia to carcinoma is shown. Mutations that arise in ancestors are propagated through subsequent divisions to all descendants. Depending on the ancestors in which they arise, they will be found in one or more samples of the patient, with varying prevalence. For example, mutations that arise in the B branches will be found in all cells of the neoplasia and of the carcinoma; by contrast, mutations that arise on the C branch will be present only in a subset of the neoplasia cells and mark the neoplastic subpopulation from which the carcinoma arose. Mutations that arise on the F branch mark a clonal expansion within the neoplasia, after the last common ancestor with the carcinoma. Note that if there are no mutations found that define branches B and C, it is not possible to infer a specific relationship of the carcinoma with the neoplasia. NS, not sampled. In the expanded box are alternate allele frequency comparisons relevant to neoplasias and carcinomas. The two starred comparisons require independent estimates of the proportion of normal cells in each sample, as they compare AAFs across different samples. All other comparisons are either within samples, or the AAF is zero, thus requiring no independent estimate of the proportion of normal cells in the sample. c.- f. Alternate allele frequencies as a function of the class and sample, for each patient with phylogenetically informative SNV-sharing classes. The number of SNVs in each class and the branch in the lineage tree of panel a. are listed below each plot. For Patient 1, the only phylogenetically informative class was where the IDC shared SNVs with ENA. For the other patients, the AAFs of informative classes are grouped together and the mutation pattern for each class is represented by a series of zeros and ones directly above the sample labels (a one indicates that the SNVs were present in the corresponding sample and a zero indicates that they were not). EN, early neoplasia; EN_cl, early neoplasia contralateral; ENA, early neoplasia with atypia. Subscript in lineage-tree branch of patient 6 denotes whether the neoplasia in the lineage tree is this patient’s EN or ENA, and whether the carcinoma is DCIS or IDC. coverage and purity of samples allowed us to pursue highly sensitive and specific so- matic SNV identification. Because we sequenced several samples from each patient, we identified the total set of SNVs in each patient with a multisample strategy using GATK [28, 71]. For each patient, we called variants using reads from all samples simultaneously, and then assigned genotypes to each sample. The vast majority of CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 19

SNVs were present in all samples, as expected from germline variants. Standard qual- ity control metrics confirmed the high quality of our variant calls. The total number of high-confidence germline variants ranged from 2,650,714 (Patient 5) to 2,973,005 (Patient 1). Between 97.91% and 98.06% of these were present in dbSNP. On average, 59,697 SNVs per patient were present in all samples but not in dbSNP, and therefore represent novel SNPs of low population-allele frequency (Table 3.1). Between 1465 (Patient 1) and 3416 (Patient 6) SNVs were candidate somatic variants, as they were not detected in at least one sample of that patient (Table 3.1). If the samples are related by a tree, then only some sharing classes are possible and the total number of observed classes is much lower than the number of possible classes. For example, in Patient 1, from whom we sequenced six samples, there are 26 - 1 = 63 possible classes to which an SNV can belong. In this patient, 1766 SNVs were absent from at least one sample, and excluding those present in lymph we retain 1465 candidate somatic SNVs. Only six of the classes, containing 1279 out of the initial 1465 candidate SNVs (87%), survived filtering. Those SNVs removed during filtering were either germline SNVs where one allele was poorly covered, or somatic SNVs whose class membership we could not confidently establish. PCR-based targeted validation of 388 SNVs in Patients 2 and 6 revealed a call accuracy of 100% and 92%, respectively. Across the six cases, we retained 82%-96% (median = 91%) of SNVs and 19%- 43% (median = 27%) of classes, revealing substantial structure in the data. The final number of confident somatic SNVs ranges from 1279 in Patient 1 to 3211 in Patient 6, for a total of 12,392 in all six patients. 8950 (72%) of these are private to only one sample in only one patient, and the number of such private SNVs increases as a function of the severity of the cancer phenotype: the IDCs harbor the most private mutations (average of 601 per sample, N=7, range 46-1809), the DCISs have an average of 470 SNVs per sample (N=3 range 70-978), early lesions 229 per sample (N=14, range 123-387), and normal have the fewest (N=2, range 39-89). On average, the IDCs accumulated 2.6-fold more private mutations than the early neoplasias, and almost 10-fold more than normal breast tissue. This may be due to a larger number of cell divisions or an increased mutation rate in the ancestral cell lineage of the IDC. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 20

3.3.3 Allele frequencies of somatic SNVs support common an- cestral relationships

Somatic SNVs that are not private to individual samples define phylogenetically in- formative classes. A total of 3442 SNVs define such classes, ranging from 0 SNVs in Patient 4 to 1054 SNVs in Patient 3, with a per-case average of 574 and a per-class (N=7) average of 492. To illustrate the logic of phylogenetic inference using informa- tive classes, we consider a hypothetical lineage tree that relates non-breast somatic, normal breast, neoplastic, and carcinoma cell lineages (Figure 3.1b). Mutations that occurred in ancestral cells are present in specific subsets of samples, with the lineage tree constraining the set of possible classes. As demonstrated in recent studies of subclone evolution in IDC [82, 83, 97], alter- native allele frequency (AAF) is a powerful metric for understanding tumor evolution. The "alternate allele" is the allele that does not match the reference base, and which in the vast majority of cases is the somatic mutation. Its frequency is estimated from its sequence coverage divided by the coverage of the alternate base plus that of the reference base. Depending on the ancestral lineage in which a collection of mutations arose, their AAF distributions in each sample vary. For example, if a variant arose in a common ancestor of a subset of lesional cells in the sample, its AAF is lower than that of an earlier mutation that is present in all lesional cells of the sample (Figure 3.1b). For each SNV class of each patient, we obtained estimates of AAF distributions with highly consistent class patterns (Figure 3.1c-f). For example, in Patient 1 the AAFs of the SNVs that are present in ENA and IDC and absent everywhere else are higher than the AAFs of the ENA-only or the IDC-only classes. The same patterns hold for Patients 2 and 6. The patterns in Patient 5 are complicated by the presence of two IDCs and by low numbers of SNVs in relevant classes. Note that the mean AAFs are always less than 50% due to unavoidable contamination of the lesional tissue with normal cells that derive from lineages that branched off before the lesional ancestors accumulated their somatic mutations. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 21

3.3.4 Mutated neoplasias are evolutionarily related to carci- nomas

Each case represents an independent evolution; therefore, common patterns across the cases may be of general significance. We first asked to what extent the early neoplasias and the carcinomas share mutations that are not present in other samples, pointing to shared ancestral cell lineages. In four cases (Patients 1, 2, 5, and 6; Figure 3.1c-f), the phylogenetically informative SNV classes indicate that a neoplasia shares a common ancestor with the carcinoma. In each of these cases, a neoplasia and the carcinoma share a significant number of SNVs. For example, in Patient 1, 775 SNVs are shared between ENA and IDC and in Patient 2, 681 SNVs are shared among the EN, DCIS, and IDC, with additional SNVs shared between the EN and IDC. There are no well-supported classes (in terms of number of SNVs and their AAFs) that are in conflict with each other, and none in which normal tissue or contralateral EN share SNVs with the carcinomas. The aforementioned PCR-based targeted validation showed 94% and 98% accuracy in assigning SNVs to the correct phylogenetic class. In three of these four cases (Patients 1, 2, and 6) the number of SNVs in common between a neoplasia and carcinoma suggests the existence of a common ancestor that had already accumulated many somatic SNVs. Strikingly, in two cases (Patients 1 and 2) the number of mutations in the ancestor is greater than the number of mutations that subsequently occurred in the ancestral lineage private to the carcinoma. In three cases (Patients 2, 3, and 6) DCIS was concurrent with IDC, and in one case (Patient 5) two independent IDC lesions were present. These four cases provided us the opportunity to ask whether the carcinoma phenotype arose once or multiple times independently. In Patient 3, the DCIS and IDC share a mutated common ancestor, suggesting that the carcinoma phenotype arose in the ancestral lineage, and that the IDC subsequently acquired the invasive phenotype. In Patients 2 and 6, there is no well-supported class of SNVs that unites the two carcinomas to the exclusion of a neoplasia. Instead, in both patients, the DCIS and the IDC each share separate classes of SNVs with a neoplasia, suggesting independent origins of the carcinoma phenotype from neoplastic ancestors. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 22

These results suggest that some early neoplasias harbor a predisposition to spawn- ing a carcinoma that later acquires an invasive phenotype (Patients 1, 2, 6). The chance of acquiring a carcinoma phenotype, given the predisposition provided by the neoplasia, is sufficiently high to allow for concurrent and independent development of carcinomas (DCIS and IDC in Patients 2 and 6).

3.3.5 Point-mutational mechanisms are evolutionarily stable and reproducible among cases

SNVs result from mutations that occurred in ancestral cells, and if a specific molec- ular mechanism were primarily responsible for the mutations, the distribution of the SNVs among the various types of change (the "mutation spectrum") would carry that mechanism’s signature [88]. To investigate the cause of the ancestral accumulation of mutations, we analyzed the mutational spectrum as a function of the samples in which SNVs were found. The mutational spectrum in our cases is remarkably consistent from patient to patient (Figure 3.2a) and is also stable across SNVs in different types of samples and in different patterns (Figure 3.2b). Transitions outnumber transversions about 1.5-fold in a pattern that is typical for replication errors and not indicative of any specific type of DNA damage or failed repair mechanism. C-to-T changes (or G-to-A, which are the same due to base pairing) are most numerous. Converted to substitution rates, this bias is even more pronounced because there are only roughly 2 C’s for every 3 T’s in the human genome. The consistency across patients implies a common mechanism, and the consistency among the three SNV groups (SNVs in early lesions only, in carcinoma only, and shared between early lesions and carcinoma) implies that the common mechanism acts throughout neoplastic and tumor evolution. To further shed light on the mutational mechanism we turned to analysis of din- ucleotide substitution patterns. Because dinucleotide frequencies vary by an order of magnitude in the human genome, with AA/TT being most common and CG least common, we converted mutation counts to rates. Truly random substitutions would have the same rates for each of the sixty possible mutations (ten dinucleotides with six CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 23

Figure 3.2: Mutation spectra and rates of somatic SNVs. (A) Mononucleotide substi- tution frequencies by patient. (B) Mononucleotide substitution frequencies by SNV class. (C ) Dinucleotide substitution rates of SNVs private to early neoplasias. (D) Dinucleotide substitution rates of SNVs private to carcinomas. (E) Dinucleotide sub- stitution rates of SNVs shared among neoplasias and carcinomas. For C?E, SNVs are pooled across patients. The mutated dinucleotide is indicated in the inner circle, and the substitution occurring within it is color coded. Rate is defined as mutations per dinucleotide of that class. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 24

possible changes each, not counting changes in both bases because they are exceed- ingly rare). A dinucleotide-unaware process would recapitulate the mononucleotide rates, with the average transition having an about four-fold higher rate than the av- erage transversion. By contrast, we detect an approximately eight-fold higher rate of C-to-T transitions in the CpG context. This higher mutation rate is due to methyla- tion of the C in a CpG dinucleotide, which upon deamination becomes a TpG. If the repair machinery catches this event it is reversed, but if the replication fork passes first it leads to a C-to-T transition in one of the daughter strands. The relative rate of C-to-T transitions in CpGs versus C-to-T transitions in the other dinucleotide con- texts and versus all other changes, provides an internal calibration as to whether DNA damage processes or defective repair mechanisms have disproportionally affected the genome. In our patients, the rate increase of C-to-T transitions in the CpG context and in the dinucleotide mutation spectrum in general is similar to germline evolution [51, 104], and is consistent across patients as well as among classes of SNVs (private to neoplasias, private to IDCs, and shared among neoplasias and carcinomas; Figure 3.1c-e). This implies that the sources of the somatic SNVs are mutations that ac- cumulated during many rounds of DNA replication (many ancestral cell divisions), and that cancer- or neoplasia-specific point mutational mechanisms, if present at all, did not substantially affect the mutation spectrum. Taken together, these lines of evidence support a model of mutation accumulation that is gradual and largely a function of the number of cell divisions, as opposed to recurring DNA damage events or mutational storms. The somatic SNVs are randomly distributed in each patient with no enrichment of exonic or nonsynonymous changes, regardless of the phylogenetic class to which they belong. We also detect very little clustering of mutations that might be indicative of localized mutagenic events [83]. Across all cases, 159 out of the 12,392 high-confidence somatic SNVs fall into coding regions, with 2/3 (106) being nonsynonymous, which is what is expected by chance. This holds true for any biological subdivision of the data (e.g., neoplasias vs. IDC). The affected genes exhibit no enrichment for pathways by GO analysis [3, 49]. One point mutation, H1047R in PIK3CA, which has been CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 25

previously implicated in cancer [32, 95] and early neoplasias [107], was recurrent in our cases (Patients 1, 3, 4, and 5, in various samples) at varying allele frequencies. Common cancer loci such as TP53 and BRCA1 were not mutated.

3.3.6 Aneuploidies are the dominant evolutionary feature of progression

The paucity of candidate driver mutations and overall random distribution of point mutations in our cases suggests that other genomic events may be contributing to the initial neoplastic phenotype and its progression to carcinoma. We therefore devised a multistep strategy to identify chromosome arm-scale losses and gains in each pa- tient, utilizing those germline variants for which the patients were heterozygous. Each patient was heterozygous for between 1.56 and 1.74 million SNPs, ensuring substan- tial statistical power to detect subchromosomal-sized aneuploidies and copy number variations. We quantified, in each somatic sample separately, the fraction of reads that sup- port the allele with the fewer number of reads (the lesser allele fraction, or LAF). We then ordered the SNVs according to their position in the genome, and identified transition points where the LAF abruptly changes. In one case (Patient 5) the 20 large-scale copy number variations, which are confined to this patient’s two IDC sam- ples, are suggestive of chromothripsis [23, 63, 66, 75, 100]. In the other five patients, we identified a total of 46 large-scale copy number variations, 43 of which involve whole chromosomes or whole chromosome arms. None of the normal breast and contralateral neoplastic samples, some of the ip- silateral neoplasias, and all of the carcinomas exhibit aneuploidy. Four of the seven IDCs exhibit evidence for the presence of a subclone population in which additional chromosomes have undergone aneuploidy events. In Patients 1, 2, and 6, aneuploidy events are shared among early neoplasias and carcinomas. All aneuploidies that are present in the neoplasias are also present in the carcinomas. Plotting the LAFs of all samples from a patient powerfully illustrates CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 26

Figure 3.3: Lesser allele fraction plot of Patient 6. SNVs are arranged by their order in the genome, and LAF is plotted for each sample in windows of 1000 SNVs with 500 SNV overlap. Aneuploidies are visible as precipitous drops in the LAF, which are often shared between samples. Chromosome boundaries are indicated by short vertical lines. All samples are plotted and give highly consistent LAFs for chromosomes that are euploid. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 27

Figure 3.4: (Legend on next page.) CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 28

Figure 3.4: Aneuploidy summary. (A) LAF distributions for each chromosome across all patients and samples. In each sample-by-patient panel, the LAF distributions of all chromosomes are superimposed. In the absence of aneuploidy, the plot lines of all chromosomes are well-aligned, as is evident in the control plots and some EN plots. Control panels often contain plots from two samples (indicated) and so there are sometimes 46 lines superimposed, revealing the robustness of the LAF metric across samples and chromosomes. A chromosome’s plot line is gray when it does not deviate from the typical distribution. The line is colored when the chromosome’s LAF is skewed. Distinct colors are assigned to represent aneuploid regions that recur in different samples and patients. Colors are labeled in the panel in which they first appear. For Patient 6 please see Figure 3.3. (B) FISH of chromosome 1 in ENA of Patient 6. (C ) Distribution of aneuploidies by patient, excluding those in IDC subclones. Each square denotes a unit gain (orange) or loss (blue). In Patients 2, 3, and 6, two phases of aneuploidies occurred, with those of the second phase not surrounded by a border. (Total) The total number of chromosomes lost (-) or gained (+) across all patients; (1st) the number during the first detected phase. Only recurrent events are listed. In Patient 5 (which exhibits hallmarks of chromothripsis), different pieces of chromosomes 1p and 19 underwent simultaneous losses or gains. both the chromosome scale of these events as well as the sharing of the same aneu- ploidies among certain samples. In Patient 6, for example, the aneuploidies involving chromosomes 1q, 6q, 8p, 17 and 22 are shared among both carcinomas and the EN (Figure 3.3). The plot also reveals the aneuploidies of many other chromosomes that are present in a subclone population that makes up about 30% of the IDC sam- ple. Examination of the corresponding plots of all patients reveals the extraordinary prevalence of aneuploidies in these cases. Graphing the distribution of LAFs for each LAF-derived section of the genome separately (usually a whole chromosome or arm) further supports the robustness of LAF as a metric to identify aneuploidies (Figure 3.4a). However, a reduction of LAF can be a result of ploidy gains as well as losses. Therefore, we calculated the actual ploidy changes in a two-step process: first we estimated the contribution of normal cells to the sample using chromosome losses, and then we calculated the additional number of chromosome copies for those chromosomes that exhibited increased ploidy. We validated a subset of these calls using FISH (Figure 3.4b) and found all LAF-based CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 29

calls that we tested to be correct. The distribution of aneuploidies across chromosomes among the 6 patients is highly nonrandom (Figure 3.4c). Gain of chromosome 1q is by far the most common event, with a total of 13 extra copies accumulated in these patients, not considering the IDC subclones. All cases exhibit 1q gain, and it is the only event that is shared by all three early neoplasias in which we could detect aneuploidy. In three cases (Patients 2, 3, and 6), the IDC underwent gains of 1q in addition to previous ones, increasing 1q ploidy to 6, 4, and 4, respectively. This suggests that the selective advantage conferred by 1q gain increases with further gains of 1q during tumor evolution. Like the shared SNVs, the shared aneuploidies support specific lineage relation- ships among the samples of each patient. We therefore built lineage trees using the somatic SNVs as phylogenetic markers and then asked whether the shared aneuploi- dies are consistent with these trees (Figure 3.5). All aneuploidies are unambiguously and parsimoniously assigned to specific branches in the SNV-based lineage trees. The order of aneuploidies during the evolution of each case is also unambiguous, and highly suggestive of a small number of aneuploidies being first drivers of the neoplastic phenotype. In all cases, gain of 1q was among the events that occurred first, including in the three cases in which genomic crises occurred in a common ancestor of neoplasias and carcinomas (Patients 1, 2, and 6). Loss of 16q occurred four times and loss of 17 three times as part of the first set of aneuploidies. Gain of 16p occurred three times. The remaining aneuploidies occurred once or twice in all trees and none were recurrent in the earliest stages of evolution. In order to time the occurrence of aneuploidies relative to SNVs, we identified the branch in the lineage tree of each patient where the first ploidy gains of chromo- some 1q occurred and considered SNVs that occurred on this branch. AAF spectra of SNVs that occurred before the ploidy gains and located on the chromatid that was duplicated should be enriched for higher AAF in the progeny samples. In each of the six patients, statistical tests rejected the null hypothesis that there are no such SNVs (Fisher’s exact test, p-values ranging from 0.5x10-2 to 0.8x10-36). This pattern is reproducible between different samples of the same case, and the SNVs that exhibit CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 30

Figure 3.5: Genome evolutions of all patients (P1?P6 ). Vertical black lines are ances- tral lineages whose lengths are proportional to the number of SNVs that occurred in each (except Patient 4, which is 50% shorter for fit). Cones represent tissue samples; cone width represents approximate amount of tissue; cone height is constrained at the top by the position of the last common ancestral cell of the sample, which is de- termined by the ancestral branch lengths, and on the bottom by the time of surgery, which is the same for all samples. The ratio of cone width to height is an approxima- tion of the rate of cell division in each sample since the last common ancestral cell. Chromosome ploidy changes are indicated with the chromosome number; stand-alone numbers in italics indicate the number of chromosomes affected by subclone evolution (or putative chromothripsis in Patient 5). Thick branches are the earliest branches for which we are able to infer genomic events. Circles at the end of thick branches are ancestors with the colors denoting their inferred neoplasia- like, DCIS-like, or IDC-like phenotypes. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 31

high AAF largely overlap. The same pattern holds for the ploidy gain in Chromo- some 16p, but due to fewer SNVs the statistical signal is less strong. Overall, the AAF distributions of 1q SNVs are consistent with some mutations occurring before the ploidy gain, and some mutations occurring after the ploidy gain. This suggests gradual accumulation of point mutations as a function of the number of cell divisions, as opposed to mutational bursts. Because the aneuploidies and SNVs independently support the lineage tree topolo- gies, the genotypes and phenotypes of the common ancestors can be confidently in- ferred in each case. The aforementioned mutated common ancestors of neoplasias and carcinomas in Patients 1, 2, and 6, bore extensive aneuploidy, as did the mutated com- mon ancestor of the DCIS and IDC in Patient 3. In all four cases, therefore, genomic crises occurred in an ancestral cell, or in consecutive daughter cells of the ancestral cell lineage. The phenotypes of these ancestors likely included nuclear atypia and increased rate of cell division, but no invasive capabilities. Their genomes were pre- disposed to further genomic change, and as a result the subsequent lineages leading to IDC accumulated numerous additional SNVs and aneuploidies.

3.4 Discussion

Evolutionary studies of cancer have so far focused on the inference of clonal evolu- tion within the cancer (e.g., Nik-Zainal et al. 2012a), or analyses of the relationship of metastases with the primary tumor (e.g., Navin et al. 2011) [78, 82]. We here addressed a different perspective, namely that of the early origins of the cancer phe- notype. These three approaches can be thought of as mimicking progression, at least as far as solid tumors are concerned: Studies of metastatic evolution are about the terminal stages of the cancer; studies of within-cancer subclone diversity are about the Darwinian process of faster versus slower growing cell populations and the evolution of the primary tumor mass; and studies of early neoplasias and their relationships to the diagnostic tumors are about early origins of cancer. Our understanding of these early origins will be greatly enhanced by molecular CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 32

evolutionary analyses similar to those that have advanced our understanding of organ- ismal evolution. Cells within concurrent lesions are analogous to extant organisms: they are related to one another by bifurcating lineage trees and have accumulated genomic changes over the course of evolution. In our study of multiple lesions in six cases of ductal breast carcinoma, we found that the genomes of ancestors of some early neoplasias and carcinomas were already aneuploid and harbored a modest number of point mutations. By comparing mutational spectra of somatic SNVs across patients and samples we inferred that somatic SNVs accumulated gradually as a result of a large number of ancestral cell divisions and not during saltatory mutational crises. In two cases, the carcinoma phenotype originated twice independently from an an- cestral neoplastic phenotype, suggesting a substantial predisposition of the ancestor to generate cancerous progeny. All of the neoplasias with aneuploidies shared common cellular ancestors with the carcinomas; in all these cases the neoplasia and carcinoma shared these aneuploidies as well as somatic SNVs. By contrast, none of the neoplasias that were devoid of aneuploidies (all contralateral ENs and five ipsilateral ENs) were closely related to a carcinoma. Among the aneuploidies, gain of chromosome 1q was most dramatically recurrent, which is consistent with its prevalence among late-stage breast cancers (c.f. Fig. 4 in Curtis et al. 2012) [24]. 1q harbors more than a thousand genes, and while the increased dosage alone is not sufficient for a carcinoma phenotype (some of our neoplastic samples carry the increased 1q ploidy), it is likely to be predisposing to further genomic change. Initially, such change may be catalyzed primarily by an increased rate of cell division, as the mutation spectrum of the early neoplasias is indistinguishable from that of the IDCs in every patient examined. Additional aneu- ploidies accumulate, however, and at some point a combination of dosage imbalances and mutational load, and perhaps epigenetic or stromal changes as well, results in an invasive carcinoma phenotype. We anticipate that the evolution of a diverse set of breast and other cancers will soon be studied similarly and with complementary approaches [38, 78, 82, 97, 98]. Current practice in clinical diagnosis of cancer facilitates studies on archival material because of the low cost and superior quality of histopathological examination CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 33

of formalin-fixed, paraffin-embedded samples. We show that high-quality, large-scale genome sequence can be obtained from archival material, and show by validation that the data from such material can be highly robust. Evolutionary inference based on many samples of such material opens a new dimension for analysis of cancer origins and progression. In the future, phylogenetic analysis of carcinomas and concurrent lesions will suggest drugs that attack both carcinoma and early lesions by targeting genomic changes common to all lesions, removing not only the carcinoma but also the reservoir of related cells from which a carcinoma might recur.

3.5 Methods

3.5.1 Identification and processing of neoplasias

All patients except one had opted for mastectomies, and all of the available breast tissue had been formalin-fixed, which allowed for the discovery of multiple sites of neoplastic lesions in each case by examination of large sets of tissue sections. Neo- plastic lesions were classified according to a standard set of criteria that included nuclear morphology, cell shape and tissue organization. Once a lesion was identified and characterized, we estimated the extent of the neoplastic tissue by taking cores and performing further sectioning and histology. We then dissected the material to minimize the proportion of normal breast tissue in the final sample. Our goal was to achieve 50% or more neoplastic or tumor content, but we could not rigorously quantify this number until after sequencing had been performed.

3.5.2 Library construction and sequencing

DNA extraction from each dissected sample was performed using procedures opti- mized for archival material. FFPE cores were cut into 20 ţm slices. Paraffin was dissolved in Xylene and removed (4 repeats of 5 minutes incubation with rotation in 1 ml of Xylene and microcentrifugation for 3 minutes) and followed by washing with ethanol (4 repeats of 5 minutes incubation with rotation in 1 ml ethanol and micro- centrifugation for 3 minutes). Tissue was then lysed with Proteinase K and crosslinks CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 34

reversed by overnight incubation at 56řC. After brief digestion with RNaseA (Qia- gen), DNA was purified with a column-based method (Qiagen QIAamp DNA Mini Kit). For each sample one Illumina library was built with an average insert size of between 300 and 400 bases depending on the quality of the DNA. Half to one ţg of genomic DNA (depending on the availability of the material) was sheared to 400 bp with Covaris S2, end-repaired, ligated to Illumina adapter, size selected, and am- plified with 8 cycles of PCR to generate the final library. Standard Illumina 2x101 paired-end sequencing on the HiSeq2000 platform was performed such that the final sequence coverage of confidently aligned reads was nearly 100x for each sample in the first patient, and 50x for the samples of Patients 2-6. Analysis of the mapped reads confirmed high library quality (very low duplicate read-pair fraction, almost normally distributed fragment size, and highly uniform genome coverage) that was indistinguishable from that of comparable libraries constructed from fresh DNA.

3.5.3 Read mapping and BAM file processing

Raw Illumina reads were uploaded to DNAnexus (https://dnanexus.com/) and aligned to the human reference genome (UCSC build hg19) using the DNAnexus read map- per, a hash-based probabilistic aligner that incorporates paired read information. We used standard quality-control metrics, such as percent confidently mapped reads and insert size distribution, to discard problematic Illumina lanes prior to subsequent analysis. Successfully aligned reads from high-quality lanes were labeled using read group tags and then merged into sample-level BAM files. Lane-level read group tags improve the performance of downstream BAM processing and variant calling with the Genome Analysis Toolkit (GATK) [28, 71]. We followed GATK’s best practices guidelines (v3) to perform sample-level BAM processing using the Picard java utilities (http://picard.sourceforge.net/) and GATK tools [71]. This protocol has three steps that are executed in the following order: duplicate read marking, local realignment, and base quality score recalibration. We used the Picard MarkDuplicates utility to mark duplicate reads based upon the read position and orientation of read pairs. Marked duplicates were ignored in subsequent CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 35

processing and variant calling steps. GATK local realignment was performed with standard parameters and the recommended known indel sets (Mills et al. and 1000 Genomes indels from the GATK v1.2 bundle) [76]. GATK base quality score recali- bration was performed with the standard set of covariates. The realigned, recalibrated BAM files produced by these processing steps were used for multisample SNV calling and for all alignment-related statistics such as allele counts.

3.5.4 Multisample SNV Calling

Multisample SNV calling was performed on processed, sample-level BAM files with the GATK Unified Genotyper [28]. Multisample runs were grouped by patient such that BAM files from different patients were run separately. Notable parameters for the Unified Genotyper include standard call confidence of 50.0 (-stand_call_conf 50.0) and minimum base quality score of 20 (-mbq 20). To reduce SNV false discovery rate, raw variant calls were filtered using GATK variant quality score recalibration tools (VQSR) with the recommended training sets. The following annotations were used for training: FS (strand bias), MQ (mapping quality), DP (depth), HaplotypeScore, MQRankSum, and ReadPosRankSum. Replacing the recommended QD annotation (call quality divided by depth) with DP greatly improves sensitivity for low frequency somatic variants. We used pass-filter SNVs to create a set of high confidence germline calls and a set of high confidence somatic calls for each patient. For a given patient, we defined germline SNVs as calls meeting the following multisample criteria: 1) depth twenty or greater in every sample, where depth is defined as the sum of alternate and reference base counts, and 2) non-reference GATK genotype (GT) in every sample. These high-confidence germline calls were used for aneuploidy analyses (below). Somatic SNVs were defined using a similar set of criteria: 1) depth twenty or greater in every sample, 2) fewer than two reads supporting the alternate allele in at least one sample, and 3) absence in dbSNP 132. We excluded SNVs in dbSNP 132 in order to reduce the number of false negative germline calls in our somatic SNV call set. Three out of four Patient 2 genomic libraries were contaminated with mouse DNA, CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 36

with approximately 15% of DCIS reads aligning to the mouse genome. Approximately 1% of reads from Normal and 0.65% of reads from EN aligned to mouse; these frac- tions were significantly above background levels for unaffected libraries. To remove contamination-related mapping artifacts from our SNV data, we added additional filtering steps to the SNV calling protocol for Patient 2. Prior to variant calling with the Unified Genotyper, we eliminated all reads lacking confidently mapped mates. After variant calling and VQSR, we removed all novel, pass-filter SNVs positioned in areas of the genome with significant homology to the mouse genome. Homology was assessed by mapping tiled 75-mer reference sequences, surrounding each posi- tion of interest, to the mouse genome (mm9). This second step dramatically reduced spurious calls in DCIS while eliminating only 1% germline dbSNP positions used as controls.

3.5.5 Determination of somatic SNV class patterns and of ro- bust sharing classes

Multisample somatic SNV calls were further analyzed to determine patterns of SNV- sharing across samples within the same patient. Although GATK provides sample genotype calls based on genotype likelihood calculations, these calls lack sensitivity when applied to cancer samples with substantial normal contamination or subclonal tumor populations. To further enhance sensitivity of SNV detection beyond GATK multi-sample calls, we applied a simple but sensitive metric to determine each sample’s mutation status. At each somatic SNV position predicted by GATK in at least one sample, we considered any sample with two or more reads supporting the alternate allele to harbor the mutation (i.e. mutation present). Samples with fewer than two reads supporting the alternate allele were labeled as reference (i.e. mutation absent). Our rationale was that given that a specific SNV is detected in some samples, reads supporting this SNV in other samples have a significant prior to be true rather than sequencing errors. We call this criterion evidence of presence of an SNV is a given sample. These patterns of mutation presence and absence define mutation classes for lineage construction and other somatic SNV analyses. We note that a small but CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 37

important number of SNVs were reallocated by this method from candidate somatic SNVs with inconsistent patterns of sharing among samples to germline events, and that very few single-sample ("private") SNVs were reallocated to sharing classes, underscoring the high sequence and alignment quality of our datasets. A case with n samples has 2n possible class patterns. For example, for a case with 5 samples, the patterns are 00000 to 11111. No case has the 00000 class because an SNV has to be present in at least one sample, and the 11111 class is that of germline variants. Classes that are private to one sample are 10000, 01000, 00100, 00010, and 00001. Candidate classes that are possibly phylogenetically informative are defined by SNVs that are present in two or more, but not all, samples. To identify the sub- set of robust phylogenetically informative classes, we applied the following steps: 1) Eliminate classes with the SNV present in the lymph sample (applicable to Patients 1, 4, 5, and 6). These classes consisted of lymph-only SNVs (presumably somatic mutations in the lymph sample) and germline SNVs where one or very few samples were missing the alternate allele presumably due to sampling variance. 2) Retain the classes that, when ranked in decreasing order of the number of SNVs present within them, together contain 95% of all candidate somatic SNVs. This eliminated all spuri- ous classes that were not supported by an overall substantial number of SNVs, most of which were missing from just one sample presumably due to sampling variance. 3) Eliminate classes with a large fraction of SNVs whose mutation-absent samples exhibit one alternate-allele supporting read, suggestive of systematic false negative calls. This also constituted a small number of classes with SNVs whose alternate alleles were missing from just one sample presumably due to sampling variance.

3.5.6 PCR-based validation of SNVs and accuracy assessment of whole-genome calls

Validation Design

We designed primers to target a random subset of SNVs within each sample-specific and phylogenetic class for validation, using target-specific PCR amplification followed CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 38

by sequencing. We focused on Patients 2 and 6 because their lesions have the great- est phylogenetic complexity (Figure 3.5) and therefore constitute the most stringent test of the main results of our study. 192 and 196 primer sets were designed for Patients 2 and 6, respectively, such that each SNV to be validated was within ap- proximately 40 bases of the sequence start site. Primer design was optimized for multiplexing. Primers contained Illumina linker sequences to facilitate sequencing. The initial target-specific multiplex PCR was performed with slow-annealing. A sec- ond PCR using Illumina-compatible primers added barcodes and yielded preparative amounts of material. All barcoded samples from a single patient were combined into a single lane of HiSeq2000 for sequencing. For Patient 2, 192 of 192 targets successfully generated enough reads to support validation, with a mean coverage (number of reads per target per sample) of almost 190,000. For Patient 6, 195 of 196 targets were successful, with a mean coverage of just over 43,000. Amplification and sequencing were performed with all targets (each pool containing all PCR 196 patient-specific primer pairs) on all samples (which were amplified separately) of each patient. This design supported two levels of validation for both patients, which we denote A and B. Two more types of validation, C and D, were possible in Patient 6. F

Validation A

Validation A is the simplest of the four approaches. It asks whether the validation PCR/sequencing supports the initial SNV call at all, i.e., whether the alternate allele is detectable well above background in at least one sample. A for Patient 2 is 192/192 = 100%. A for Patient 6 is 180/195 = 92%. 12 of the false positives of Patient 6 are SNVs that had initially been called as pri- vate to the ENA sample. Excluding the ENA-only calls, the validation rate improves to 172/175 = 98%. In this context, we note that SNVs that are present in the ENA and also in another sample have a much better validation rate than those present in the ENA alone, due to the additional signal provided by the other samples. We conclude that our initial SNV calls had a high degree of specificity. SNVs CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 39

present in more than one sample, which comprise the classes that are most important for our study, have an almost perfect validation rate.

Validation B

Validation B addresses sample-specificity and whether the assignment of an SNV to a specific class, especially to a phylogenetically informative class, was correct. The most stringent metric is to ask what fraction of SNVs are validated to be present in precisely the same set of samples as the initial assignment based on the whole genome sequence, and to count each with a misassignment as incorrect. It uses those SNVs that were validated to be present (validation A). B for Patient 2 is 180/192 = 94%. 11 of the 12 miscalls involve an SNV that was initially called as IDC-only, but is in fact an SNV shared between the EN and the IDC. Recall that this class had low alternate allele frequencies in the EN, so these were simply missed in the genome-wide data due to their very low frequency. B for Patient 6 is 176/180 = 98%. In summary, our class assignments are highly accurate and the small amount of error does not affect the study?s results or conclusions in any way.

Validation C

In Patient 6, we were able to go back to the archival tissue and recover additional (separate) IDC material as well as a sample of normal tissue. In what might be called a ?biological? validation, we can therefore ask what fraction of SNVs present in the original IDC are also present in the two new IDC samples. Class IDC-only: 18 of the 19 IDC-only SNVs tested also appear in both new IDC samples. The one SNV that is not present in the new IDC samples has the lowest alternate allele frequency in the original IDC sample, indicating that it marks a subclone not present outside of this sample. Phylogenetically informative classes that include IDC: 50 out of 50 SNVs were present in the new IDC samples. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 40

Thus, this validation shows that mutations we find in a single IDC isolate are fully supported by their presence in independent IDC isolates, and that our false-positive rate for this class is effectively zero.

Validation D

The addition of a sample of normal tissue from the ipsilateral breast in close proximity to the other lesions allowed us to ask whether any SNVs we targeted would give a false- positive signal in the validation. The seven SNVs we tested that were shared among all ipsilateral samples were also positive in this normal sample, as expected for SNVs that arose early in breast development; none of the remaining SNVs that were private to one sample or comprised the phylogenetically informative classes (N=188) had signal above background in the normal sample, again underscoring superior specificity of our somatic SNV calls.

Validation of the "Evidence-of-Presence" criterion

The validation data also allowed us to examine whether the reassignment of SNVs according to our evidence-of-presence criterion improved accuracy over GATK multi- sample calling. As we describe in the manuscript, we first perform the standard GATK multi-sample SNV calling to identify the set of somatic SNVs in a patient. GATK results include class membership, i.e., in which sample the alternate allele of the SNV is present. But we adjust this class membership using our ?evidence-of- presence? criterion, which asks whether there is evidence for the alternate allele of an SNV in a sample where GATK did not call it. The logic is as follows: Assume that an SNV is called by GATK in sample A of a given patient, but not in sample B. Assume that in sample B, there are two (or more) reads supporting that SNV. (This situation is common with GATK.) Due to its presence in sample A, the SNV has a high prior probability of being a true somatic SNV in sample B, rather than resulting from coincidence of sequencing errors in the two or more reads supporting it. Recall that typically fewer than 1000 somatic SNVs are called per sample; this is several orders of magnitude fewer positions than the entire genome, and therefore it is CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 41

possible to use a more sensitive criterion for detection of SNVs in these positions than for de novo discovery in the entire genome, without increasing the false positive rate substantially. The validation data show that application of the evidence-of-presence criterion indeed improves call accuracy over the GATK class assignments: In Patient 2, 17 SNVs within our validation set had been reassigned according to evidence-of-presence. In 14 out of these 17 cases, the reassignment detected the mutations in samples that were validated, thus improving over GATK calls; in 3 cases the reassignment created a false positive, i.e., detecting an SNV in a sample which was not supported by our validation. Similarly, in Patient 6, of the 14 SNVs within our validation set that were reassigned according to evidence-of-presence, 11 were correctly reassigned, in 3 cases evidence-of-presence called an SNV in one ad- ditional false positive sample. In summary, we concluded that evidence-of-presence significantly improved class assignments over GATK.

Conclusion

The results from the four approaches to validation reveal the robustness of the genome-wide data, particularly of the phylogenetically informative classes, which form a cornerstone of our study. The results from the assessment of the evidence-of- presence criterion versus original GATK calls underscore the power of multisample calling and the technical robustness of our

3.5.7 Aneuploidy and tumor purity

To identify aneuploidies we selected a subset of the germline SNVs identified by GATK. These "sgSNVs" were defined, separately for each patient, as a patient’s multisample germline SNVs that had dbSNP132 entries, were heterozygous, and had minor allele frequencies in the control sample of at least 0.25. We define the "lesser allele" as the one supported by fewer reads than the other allele (which is the "preva- lent allele"). Three metrics were calculated for each SNV: the lesser allele coverage, the prevalent allele coverage, and the lesser allele fraction (LAF). The LAF was used to identify aneuploidies, whose "sign" (loss or gain) was then set by the two coverage CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 42

metrics. In all patients except 5, the vast majority of chromosomal copy-number transitions coincided with the centromere, or the whole chromosome was involved (Supplemen- tal Figs. 12-17). Fine mapping of the transition points was therefore not usually necessary. In the handful of cases where a transition point did not coincide with a centromere, we found the window of the plot (Supplemental Figs. 12-17) at which the event either started or ended (window i). As discussed in figure 3.3, each window spans 1000 SNVs, with an overlap of 500 SNVs between adjacent windows. We then plotted the frequency of the heterozygous variants in the three relevant windows (i- 1,i,i+1, totaling 2000 variants) in that sample. The variant at which the frequency shifted was easily detected by eye, and it was not necessary to deploy segmentation methods. The resolution of this analysis is low (determined by what can be seen by eye on the plots) and we did not attempt to identify events that involved re- gions smaller than about a third of a chromosome arm. We also note that we did not attempt to identify structural rearrangements that do not result in copy number changes, such as inversions. The identified loss of heterozygosity (LOH) chromosomes were then used to esti- mate the fraction of the sample that is due to normal cells (lymphocytes, myocytes, etc.), as follows: All cancer cells contribute zero copies of an allele that was lost due to LOH, and the normal cells contribute one copy of the LOH allele times the con- tamination fraction n. Note that in all our patients, the control samples were free of LOH chromosomes (Figure 3.4a). The LOH allele is almost always the one with fewer reads. Therefore the LAF l should on average be equal to the lost-chromosome frac- tion that is contributed by the normal contamination. Some arithmetic shows that n = l / (1 - l). Once n was estimated from l, the exact ploidy p for those chromosomes that had gains was calculated according to the formula p = (1-2nl) / (l(1-n)). Sequence-based n’s roughly matched estimates of n by histology. The histology- based estimates are necessarily an approximation because they are based on limited sampling, by sectioning, of the tissue core mass from which DNA is obtained. CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 43

3.5.8 SNV mutation spectra

Mutation spectra for patient samples were aggregated in two ways: (1) combined across patients to form three "superclasses" of SNVs based on lesion class (private in early neoplasias, private in carcinomas, and shared between neoplasias and carci- nomas); (2) combined within each patient, ignoring lesion class, to form six groups. Complementary mutations were pooled, reducing the number of possible mononu- cleotide mutations from 12 to 6, and the number of single-base substitution classes in dinucleotides from 16 x 6 = 96 to 10 x 6 = 60. Mononucleotide mutation spectra were simply estimated from the frequency of the mutation type (c.f., Figures 3.2a and 3.2b, where the bars of each color add up to 1). For dinucleotides, we calculated rates by dividing the number of events of each of the 60 changes by the genome-wide count of the dinucleotide that was mutated.

3.5.9 Tree inference

Tree topology was defined by the phylogenetically informative SNV classes (Supple- mental Table 1). The data are unambiguous and we therefore used parsimony to establish which samples shared common ancestors in which configuration. Once the SNV-based trees were built, aneuploidy events could be mapped onto them, and again the data were unambiguous. Even successive gains of ploidy of the same chromosome, most prominently among them 1q (e.g., Figure 3.5f), could be ordered without con- flicts.

3.5.10 Ordering SNVs vs. Chromosome 1q ploidy gain in an- cestral branches

We devised a statistical test to ask whether some SNVs occurred before copy gain in aneuploidy regions. For each patient, we identified the branch in the lineage tree responsible for the first copy number changes in Chromosome 1q, which consistently represents the earliest aneuploidy event in our patients. We then analyzed the AAF spectra of SNVs occurring in that branch. The test below is based on the idea that CHAPTER 3. GENOME EVOLUTION IN BREAST CANCER 44

SNVs that occur on a 1q chromatid prior to gain of a copy of that chromatid should have higher AAF than SNVs occurring on a 1q chromosome after copy gain. We used SNVs on all diploid chromosomes on the same branch as our control set. Sequence coverage is scaled with respect to the aneuploidy and controls for contam- ination of the sample by normal cells (lymphocytes etc.): scaled coverage=coverage x((px(1-n))/2+n), where p is the estimated ploidy and n is the estimated normal contamination. In order to find outliers indicative of events prior to copy gain, we calculated a z-score. SNVs with AAFs with z-score > 3 were labeled as "high" and SNVs falling below threshold were labeled as "low". For each patient, we used Fisher’s exact test to compare the distribution of SNV labels in the control chromosomes vs. 1q. In each of the patients, we reject the null hypothesis that the 1q distribution is equal to or less extreme than the control distribution (Supplemental Table 3).

3.6 Acknowledgements

This chapter is a reproduction, in part, of published work in Genome Research [79]. I would like to thank my co-first authors Dorna Kashef and Ziming Weng for their brilliant contributions and dedication to this project, and I would like to thank corre- sponding authors Robert West, Serafim Batzoglou, and Arend Sidow for the combi- nation of leadership, insight, and expertise that they always brought to our frequent and indispensable discussions. I would also like to thank co-authors Raheleh Salari, Robert Sweeney, Alayne Brunner, Shirley Zhu, Xiangqian Guo, Sushama Varma, and Megan Troxell, whose combined knowledge of statistics, oncology, genetics, and genome sequencing was vital to the success of this work. Chapter 4

Pipeline technologies for validating genomic variants

4.1 Abstract

Recent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted re- sequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource (http://oligogenome.stanford.edu/). This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this re- source provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human

45 CHAPTER 4. VALIDATING GENOMIC VARIANTS 46

genome. The website has query tools for selecting and evaluating capture oligonu- cleotides from specified genomic regions.

4.2 Introduction

Using next-generation DNA sequencing (NGS) technologies, there has been a dra- matic increase in intermediate-scale, targeted resequencing applications. This is a generally useful approach for discovering polymorphisms and mutations of interest among candidate regions and validating novel variants and mutations from complete genomes and exomes [67, 109]. NGS-based targeted resequencing also has immediate application as a clinical diagnostic for identifying pathogenic mutations in medical conditions such as inherited diseases and cancer. Therefore, it has become increas- ingly important to develop accessible, cost-effective and flexible methods that can be used to design customized capture assays targeting any region throughout the entire human genome beyond coding sequences. Currently there is very little available with regard to conducting targeted resequencing of non-coding human genome regions. We present a genome-wide solution towards targeted resequencing of loci from the hu- man genome. Relying on a capture technology we developed, our genome-wide design covers the human genome with in-solution capture probes. As a result, it provides both exome coverage as well as facilitating the analysis of non-coding regions such as promoters and regulatory sequences. These non-coding regions are of increasing interest with regard to disease-related polymorphisms and mutations. As recently described by Natsoulis et al. [77], this in-solution capture approach enables targeted resequencing of large sets of genomic loci targets up to 1Mb and potentially higher. Using highly multiplexed pools of single-stranded 80-mer cap- ture oligonucleotides to circularize target genomic regions en masse (Figure 4.1), this capture assay enables the amplification of target-specific regions with a univer- sal set of PCR primers common to all targets. A capture oligonucleotide contains two single-stranded capture arm sequences that mediate circularization by hybridiz- ing specifically to the complementary flanking sequences of a genomic target. A CHAPTER 4. VALIDATING GENOMIC VARIANTS 47

Figure 4.1: Schema for target-specific capture and amplification by selective genomic circularization. This schema for the Natsoulis et al. [77] capture protocol describes the major steps for conducting capture and amplification of a target region. The light blue squiggles at the top of the figure indicate restriction enzyme recognition sites that are cut by the addition of a single restriction enzyme. ROI stands for region of interest (i.e. target region), green bars indicate capture arms, green circles indicate capture arm hybridization sites and red bars indicate universal primer sequence. The protocol described by this figure is performed separately for each restriction enzyme.

fixed sequence general motif links the capture arm oligonucleotides to form a com- plete capture oligonucleotide with 80 bp length. Each circularization reaction also incorporates a 40-bp universal vector oligonucleotide that complements the capture oligonucleotide’s general motif. This vector provides the universal primer sequences necessary for downstream amplification. Previously, we designed a set of capture oligonucleotides spanning the human exome (http://oligoexome.stanford.edu) and demonstrated that customized capture assays could be easily developed using this resource [77]. CHAPTER 4. VALIDATING GENOMIC VARIANTS 48

In brief, the full protocol includes the following key steps (Figure 4.1): (i) ge- nomic DNA is subject to restriction enzyme digestion by a single enzyme; (ii) the addition of capture oligonucleotides pools that are specific to a given restriction en- zyme and the vector sequence circularizes genomic targets; (iii) 50 exonuclease cleaves unbound 50 sequence (the 50 flap); (iv) circularization is completed by ligation; and (v) a uracil glycosylase reaction linearizes circularized molecules to produce capture regions flanked by universal primer sequences. These molecules can then be uniformly amplified by PCR and prepared for sequencing. As has been described, this assay successfully targets up to 1Mb of human se- quence and can accommodate the highly multiplexed capture of thousands of loci [77]. Additionally, the technology achieves both high-sensitivity and high-specificity human genomic capture across target regions up to 800bp in length. On-target re- gions of >10-fold coverage make up >85% of the original targets. Off-target capture as we recently demonstrated was <5%. Based on a published cost assessment [77], the overall assay is significantly less costly than common capture methods such as multiplex PCR and in-solution capture. The procedure uses <100ng of DNA per individual sample, which makes it ideal for clinical applications with limited sample material, and the capture assay can be completed in several days. Finally, this capture assay can be adapted for multiple sequencing platforms. The most recent application as described by Natsoulis et al. [77] uses next-generation Illumina technology for downstream sequencing, but it may be adapted for use with other next-generation sequencers, as we have previously demonstrated with Roche’s 454 sequencer [25]. This selective capture protocol introduces several molecular constraints that must be considered in identifying capture arm sequences (Figure 4.1). To complete the ligation in Step 4, the termini of a captured DNA sequence must lie flush to the 40-bp vector oligonucleotide. The 30 capture arm of a successful capture oligonu- cleotide must therefore hybridize precisely at the 30 terminus of a restriction fragment containing the genomic region of interest. The 50 exonuclease in Step 3 enables flex- ible placement of the 50 capture arm by removing the 50 flap produced by genomic DNA that extends beyond the capture arm. These molecular mechanisms compli- cate capture arm design and render the procedure intractable by standard primer CHAPTER 4. VALIDATING GENOMIC VARIANTS 49

design software. Designing capture arms that cover any given human genome target represents a major challenge to disseminating this technology to interested users. To facilitate designing customized targeted resequencing assays for any human genome region, we have created the Stanford Human OligoGenome Resource, a database of oligonucleotide capture sequences that span the human genome. Us- ing our previous experience in designing and implementing assays, we improved the design method to avoid issues which decrease capture efficiency [77]. This unique resource has extensive utility given that it provides coverage for capture targets be- yond the 3% of the coding region portion (e.g. exome) of the human genome. The OligoGenome website (http://oligogenome.stanford.edu/) provides an interface for browsing, filtering and downloading capture oligonucleotide sequences based upon user queried genomic regions and annotation-based constraints. The capture oligonu- cleotide designs and the search tools expedite the experimental design of customized capture assays and provides researchers with the ability to query both inside and outside of the coding regions of the human genome. Given its low cost and limited infrastructure requirements [77], this resource greatly improves the accessibility of highly multiplexed genomic target capture and resequencing for researchers.

4.3 Materials and Methods

4.3.1 Capture oligonucleotide sequence generation

We created the Capture Oligonucleotide Annotation and Creation in Human (COACH) Ruby suite to generate capture oligonucleotides for the human genome in silico. The suite has two primary modules: a Capture Oligonucleotide Generator (COG) that finds putative capture arm sites and a Refactoring Engine for INvalid Selection (REINS) that removes sites which fail to pass all specified constrains. As input, the program takes a 2-bit genome file, a set of restriction enzymes and one or more bed-formatted annotation files. The suite processes the restriction enzymes indepen- dently and outputs a set of capture oligonucleotides that maximizes genome coverage for each enzyme. CHAPTER 4. VALIDATING GENOMIC VARIANTS 50

To generate the capture probes for the Stanford Human OligoGenome Resource, we used the UCSC hg19 genome build for chromosomes 1-22, X, and Y [37]. The coordinates for these regions exactly match the coordinates of NCBI genome build 37. We chose the four restriction enzymes MseI, BfaI, Sau3AI and CViQI to define our in silico-cut sites based upon empirical results from Natsoulis et al. [77]. Finally, we used UCSC dbSNP131 annotations to define common variants [96] and a 24- mer mapability track from ENCODE to provide an application-specific repeat mask [57, 92]. For a given 24-mer in the human genome, the mapability track indicates how many other 24-mers in the genome have a sequence that differs by two or fewer bases. Determination of exon coverage relied on the Consensus Coding Sequence (CCDS) Project [89] for hg19. COG uses a greedy algorithm that guarantees selection of capture arms that max- imize genomic coverage given the constraints in REINS. COG significantly improves upon the TargetedOligoDesign program described in Natsoulis et al. [77], which eval- uated a fixed set of oligonucleotide capture arms for each target region. For each chromosome, COG defines a set of genomic target regions such that each region is bounded by adjacent restriction sites. Within each target region, COG finds the pair of plus strand capture arms that would achieve greatest coverage of the region and submits them to REINS for validation. It continues to generate capture arm sites in decreasing order of coverage until REINS either validates a pair of sites or until no further sites are available. The same procedure is repeated for minus strand cap- ture arms. It also tests for a combination of minimally overlapping plus and minus strand capture arms. COG compares the three capture sequence sets returned by this process and outputs the set that achieves the greatest coverage of the queried region. In the case of a tie for coverage, it selects the set that covers the fewest bases redundantly. If no valid set of capture arms is available, COG does not produce any output for that target region. In order to ensure highly sensitive and specific capture, REINS applied the fol- lowing, stringent constraints to the capture oligonucleotide sequences generated for the Stanford Human OligoGenome Resource. These rules correspond to the empirical best practices empirically determined by Natsoulis et al. [77]: CHAPTER 4. VALIDATING GENOMIC VARIANTS 51

(i) Capture arms are 20 bp in length;

(ii) The sequences in a pair of capture arms must have the same polarity with respect to the reference genome;

(iii) 30 capture arms must be flush to a restriction site; and

(iv) The maximum size of a DNA molecule targeted by a capture oligonucleotide is 800bp and the minimum size is 100bp.

Also, REINS applies rules based on genome-specific annotations to improve cap- ture performance in human genomic target sequences. REINS rejects capture arm sequences that would hybridize to regions containing known variants from dbSNP131. Additionally, because certain common variants disrupt restriction sites of interest or introduce new restriction sites, it ensures that capture arms mediate circularization both in the presence and in the absence of these variable cut sites. REINS uses the 24-mer mapability track described above to detect capture arms with non-specific hybridization, which leads to inefficient reactions or non-specific, off-target capture. To prioritize highly specific capture arms, we ran COACH three times, using different mapability constraints based on the 24-mer mapability track to create three tiers of oligos: (i) Tier 1: oligos must fall within uniquely mapable regions; (ii) Tier 2: oligos must fall within regions mapping to fewer than 10 other regions; and (iii) Tier 3: no mapability restriction. We used capture arms from Run 2 to fill in gaps in coverage left after Run 1, and similarly filled remaining gaps with oligos from Run 3. The combination of these genome-specific rules and parameters constitutes a stringent constraint engine that enforces capture oligonucleotide quality.

4.3.2 Quality control annotation for capture oligonucleotides

We generated annotations for each capture oligonucleotide produced by COACH to serve as a proxy for capture efficiency and capture specificity. Among them are param- eters which we previously had demonstrated are important for mediating on-target and efficient capture. The following annotations apply to each capture arm for any CHAPTER 4. VALIDATING GENOMIC VARIANTS 52

given oligonucleotide, and all repeat annotations are specific to the human genome: (i) number of exact sequence matches present in the human genome; (ii) number of matches differing by one base, (iii) number of matches differing by two bases; and (iv) GC content. Parameters 1-3 influence the on-target capture efficiency. As was previously demonstrated, one can reduce off-target capture by avoiding repetitive re- gions of the genome in either one or both of the capture arm sequences. We used bowtie [58] to determine in silico the number of off-target regions per oligonucleotide capture arm sequences. We considered an off-target capture to occur if the capture arms aligned between 100 and 1000bp from each other with zero mismatches and had the correct relative orientation. We defined these positions as paralogs (P) of the intended capture site.

4.3.3 Database construction

The Stanford Human OligoGenome Resource (http:// oligogenome.stanford.edu) runs on a 2 x 2.27GHz Quad Core Intel Xeon E5520 server, with 24GB memory and Ubuntu 9.10 operating system. The web application is implemented in Ruby on Rails 2.3.8, running under Passenger 2.2.15. The underlying database is MySQL 5.0.42 community edition, which is hosted on a separate database server. Query and data download is via any current web browser. Recommended browsers and versions are: Internet Explorer 7.0+; Firefox 3.0+; Safari 5.0+; or Chrome (any version).

4.4 Results

4.4.1 Coverage of the human genome

The Stanford Human OligoGenome Resources achieves 92.1% in silico coverage of the entire human genome using the four restriction enzymes MseI, BfaI, Sau3AI and CViQI. In total, 21.8 million probes capture 2.85 billion nucleotide positions at least once. Of these probes, 20.2 million that cover 88.4% of the genome are predicted to have a unique capture site due to the absence of paralogous regions (Table 4.1). Nearly 720000 probes cover the CCDS-coding regions (99.65% coverage) for the 22 CHAPTER 4. VALIDATING GENOMIC VARIANTS 53

Statistics for whole genome capture BfaI CviQI MseI Sau3AI Total Tier 1 only Total number of oligos 4 049 706 2 999 049 4 825 988 3 246 400 15 121 143 Average capture length (bases) 401 483 269 430 381 Total bases covered (megabases) 1614 1441 1295 1388 2311 Percent of genome covered 52.14 46.54 41.83 44.83 74.64 Percent of oligos with U0>1 4.71 5.13 5.08 5.06 4.99 Percent of oligos with paralogs>0 0.07 0.07 0.06 0.07 0.07 Percent of genome covered with paralogs removed 52.10 46.50 41.80 44.80 74.60 Tiers 1, 2 and 3 combined Total number of oligos 5 787 809 4 362 946 6 757 372 4 938 767 21 846 894 Average capture length (bases) 410 496 280 426 391 Total bases covered (megabases) 2160 1978 1760 1938 2852 Percent of genome covered 69.79 63.89 56.85 62.61 92.14 Percent of oligos with U0>1 23.9 24.60 23.23 28.60 24.92 Percent of oligos with paralogs>0 6.96 6.48 7.25 8.90 7.39 Percent of genome covered with paralogs removed 64.91 59.41 52.32 57.67 88.43 Table 4.1: Summary statistics for all capture oligonucleotides designed to target human genome Build 37/hg19. Tier 1 oligonucleotides are the subset of targeting molecules generated with the strictest repeat masking parameters based upon k-mer mapability. Tiers 1, 2 and 3 represent all oligonucleotides in the database. This table illustrates that the looser mapability masking parameters used in Tiers 2 and 3 allowed for increased coverage but with a higher probability of having off-target binding and amplification.

Statistics for CCDS capture for all tiers BfaI CviQI MseI Sau3AI Total Total number of oligos covering CCDS target area 182 483 178 338 158 445 200 019 719 285 Average capture length (bases) 521 550 419 489 497 Total bases covered (megabases) 25.286 23.270 22.162 24.04 31.70 Percent of CCDS covered 79.49 73.15 69.67 75.58 99.65 Percent of oligos with paralogs>0 2.89 2.85 3.00 3.03 2.94 Percent of CCDS covered with paralogs removed 76.96 70.85 67.36 73.02 97.12 Table 4.2: Summary statistics describing the in silico percent capture of CCDS re- gions by the combined set of oligonucleotide probes. Exonic regions prove possible to capture with high sensitivity and specificity due to their high k-mer complexity.

April 2011 release of CCDS (Table 4.2). Approximately 70 000 of these capture oligonucleotides have only one predicted target site, providing 97.12% coverage of the CCDS-annotated coding regions at high specificity. At least 77.2% of the genome is covered by capture oligonucleotides from two or more different restriction enzymes (91.5% of CCDS regions), which allows for experimental redundancy. As 50% of the human genome is highly repetitive, these total coverage numbers indicate that the capture design successfully bridges short repetitive sequences such as Alu elements by placing capture arms in uniquely mapping region on either side of these regions. Average capture lengths of a given genomic target region are also listed in Table 4.1. CHAPTER 4. VALIDATING GENOMIC VARIANTS 54

Figure 4.2: In silico coverage by the set of capture oligonucleotides from the Hu- man OligoGenome Resource. Coverage is across (a) the whole genome and (b) the regions defined by CCDS in each successive tier of 24-mer repeat masking. Tier 1 oligonucleotides are the subset of targeting molecules generated with the strictest re- peat masking parameters based upon k-mer mapability. Tiers 1, 2 and 3 represent all oligonucleotides in the database. The restriction enzyme count on the x-axis is the number of restriction enzymes for which the OligoGenome database contains an oligonucleotide that can capture a given base. Zero depth indicates the set of positions for which no capture oligonucleotides exist. As expected, fewer repeat mask restric- tions lead to a greater number of positions covered by multiple restriction enyzmes’ oligonucleotides.

4.4.2 Capture oligonucleotide human genome mapping

As described in the ’Materials and Methods’ section, we established three tiers of ma- pability to assess off-target capture. Tier 1 oligonucleotides are the subset of targeting molecules generated with the strictest repeat masking parameters based upon k-mer mapability (Table 4.1). Tiers 2 and 3 have fewer constraints on their presence in the genome and are more susceptible to off-target capture. Combined Tiers 1, 2 and 3 represent all oligonucleotides in the database. Figure 4.2 additionally illustrates the advantage of the multitiered approach to repeat masking the oligonucleotide capture sites. Tier 1 provides highly specific capture oligonucleotides with reduced coverage, while the addition of subsequent tiers with reduced repeat masking achieve higher CHAPTER 4. VALIDATING GENOMIC VARIANTS 55

coverage at the cost of less efficient reactions through off-target capture.

4.4.3 Interface for the Human OligoGenome

The Human OligoGenome Resource website presents an intuitive interface for se- lecting and downloading capture oligonucleotides for customized assays to mediate targeted resequencing. Users can download all probe sets by selecting gzipped flat files organized by the chromosome (Figure 4.3a). Users can also select all capture oligonu- cleotides from specified genomic regions using the Query Capture Seqs tool, which either takes chromosome, start position and end position as input or allows the user to upload a bed file of capture regions as input (Figure 4.3b). Before submitting the query, the user may also choose to filter results by the repeat annotations discussed below and by tier number. Each row of output from this tool presents information for a single capture oligonucleotide. The first set of fields contains information about the oligonucleotide sequence and genomic target, including chromosome (Chromosome), 1-based capture region start position (Capture Start), and 1-based capture region end position (Capture End). The Length column calculates total capture region length, and the Polarity column identifies the strand with which the capture arms hybridize relative to the reference sequence. The 5prime Capture Arm and the 3prime Capture Arm columns contain the 20bp sequences for the 50 capture arm and the 30 capture arm, respectively (Figure 4.3c). The website also generates a table describing the in silico coverage of returned capture oligonucleotides across the queried regions, both per region and in total across all regions. The output also includes the annotations generated by COACH. These include GC content, the number of exact sequence matches present in the human genome (U0), the number of matches differing by one base (U1), the number of matches differing by two bases (U2) and the number of in silico off target capture regions (Paralogs). Additionally, each Oligo Name field provides a hyperlink to a page that displays restriction enzyme identity (Enzyme) and full capture oligonucleotide se- quence (Capture Oligo) for the specified oligo (Figure 4.3d). The user can download the oligonucleotide entries returned by the Query Capture Seqs tool by clicking on CHAPTER 4. VALIDATING GENOMIC VARIANTS 56

Figure 4.3: A brief overview of the OligoGenome website and its query tools. You may (a) download all capture oligonucleotides directly or (b) search for capture oligos that target a specific interval entered on the page or a set of intervals uploaded in bed format. (c) After the submission of queried regions, you may view the returned capture oligonucleotides on the website, download the table in bed format, or export the results to the UCSC Genome Browser to view as a track. (d) Additionally, clicking an oligo name will bring you to a page with additional information, including the full 80-bp capture oligonucleotide. CHAPTER 4. VALIDATING GENOMIC VARIANTS 57

the Export Oligos button at the top of the page, which produces a tab-delimited text file containing all 10 fields described above, as well as the genome build and download date. The user may also choose to export the data to UCSC as a custom track [54]. All data on the Human OligoGenome Resource website are freely accessible. To design capture assays, one selects the regions-of-interest and then downloads the overlapping capture oligonucleotide sequences. We recommend using Tier 1 cap- ture oligonucleotides and then individually selecting lower tier oligonucleotides to fill specific gaps when needed. Also, choosing oligonucleotides with a GC content <75% will improve general capture efficiency. After oligonucleotides are synthesized, they should be pooled in equimolar ratio to each other based on their affiliated restriction enzyme.

4.5 Discussion

To facilitate targeted resequencing of the human genome, we have developed and released the Human OligoGenome Resource. It covers >92% of the human genome with capture oligonucleotides that can be used in robust in-solution capture assays using the selective genomic circularization method [77]. This high level of in silico coverage is partly attributable to our design’s capability to bridge repetitive sequences in the human genome. In particular, the Human OligoGenome Resource provides for the first time a general resource to capture and target resequence non-coding regions such as promoters and regulatory sequences, which are of increasing interest in re- gards to disease-related polymorphisms and mutations. It uses a simple web interface to provide access to capture oligonucleotide sequences for the entire human genome. These sequences facilitate rapid experiment design for using the capture technology as described in Natsoulis et al. [77]. The capture oligonucleotides can be ordered and synthesized from any commercial vendor or core oligonucleotide synthesis facility, combined to form highly multiplexed reagent pools and downstream sequencing can be conducted using any NGS platform. These probes also serve as a useful resource for other selective circularization technologies. The recently published paper by Jo- hansson et al. [52] presents a comparable capture method for which the OligoGenome CHAPTER 4. VALIDATING GENOMIC VARIANTS 58

capture oligonucleotides can be easily adapted. The Human OligoGenome Resource site will facilitate previously untenable studies in genetic and clinical resequencing and expedite variant discovery and validation.

4.6 Acknowledgements

This chapter is a reproduction, in part, of published work in Nucleic Acids Research [80]. I would like to thank my co-authors Georges Natsoulis, Sue Grimes, John Bell, Ron Davis, Hanlee Ji, and Serafim Batzoglou for their contributions to this project. I would like to give special thanks to Georges and Hanlee for conceiving and directing the course of this work. Chapter 5

IBD Mapping in Large Disease Cohorts

5.1 Abstract

Identity by descent (IBD) mapping provides an effective methodology for identify- ing novel disease susceptibility loci in large patient cohorts. Although recent studies suggest that IBD mapping can achieve greater power than traditional association testing methods for the analysis of complex diseases, limitations in the speed and ac- curacy of tools for performing the critical IBD detection step have proven prohibitive for cohorts containing tens of thousands of individuals. Resolving these constraints has broad implications for the discovery of novel susceptibility loci using the myr- iad of large disease cohorts collected by genome wide association studies (GWAS) over the past decade. Building upon the rapid and highly accurate IBD detection tools recently developed in the Batzoglou lab, I have developed a robust and scalable software framework for IBD detection that integrates IBD mapping and IBD visual- ization tools. This software provides an exciting opportunity to discover novel disease susceptibility loci in previously intractable disease datasets, and we hope that this work will allow researchers to successfully revisit previously examined GWA studies.

59 CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 60

5.2 Introduction

The term complex disease encompasses the diverse set of health conditions arising from interactions among multiple genetic and environmental factors [22, 68]. The vast majority of common, modern diseases have complex etiology [22, 119]. These conditions include heart disease, which alone accounts for over 600,000 deaths and $300 billion in healthcare costs in the U.S annually [16], as well as prevalent and dev- astating diseases such as type 2 Diabetes and multiple sclerosis [22, 33, 62, 114, 119]. Understanding the genetic underpinnings of complex disease has profound implica- tions for clinical care. Further elucidating the genetic architecture of these conditions not only facilitates drug development and biomarker discovery, but it also enables su- perior diagnostic screening methods and personalized therapeutic optimization [70]. Studying genomics in the context of this disease category therefore has tremendous importance with regards to reducing the global burden of disease. Despite dramatic advances in high-throughput genotyping technology, a combina- tion of etiological and computational considerations continues to make disease suscep- tibility loci discovery difficult in the 21st century. Environmental and lifestyle factors frequently confound genetic analysis, but few studies have the resources to collect and integrate environmental factors into their models [84, 85]. The genetic landscape of complex disease proves similarly challenging. Complex disease often results from the combined, non-linear effects of tens or even hundreds of small genetic contributions [5, 7, 36]. Such diseases, almost by definition, are characterized by genetic hetero- geneity, with affected individuals displaying diversity in their respective genetic risk factors [42]. These conditions stymie traditional linkage approaches, which excel at identifying high penetrance mutations with large effect size but lack sensitivity for common variants of small effect size [5]. Modern methods that leverage next genera- tion sequencing (NGS) data to perform functional analysis of low frequency variants overcome many of the limitations of linkage studies, but the prohibitive expense of performing sequencing on large cohorts (>10,000 individuals) reduces the statistical power of these studies [119]. For more than a decade, genome wide association studies (GWAS) have provided CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 61

a valuable methodology for exploring the genetics of complex disease. Using cost- effective genotyping technologies, such as microarrays, GWA studies allow researchers to aggregate data from tens of thousands of patients in order to achieve sufficient sta- tistical power to detect common variants of low effect size. This approach has proven remarkably fruitful. The NHGRI GWAS catalog now includes genotype-phenotype associations for over 14,000 genomic loci across more than 4,000 diseases and traits [113], and over 100 disease susceptibility loci have been discovered for Crohn’s disease alone [53, 111]. Notably, GWAS discoveries for Crohn’s disease and other diseases such as rheumatoid arthritis have led directly to clinical advances [111]. However, despite these achievements, known risk loci typically explain less than fifty percent of the estimated heritability for a given complex disease [119]. Even accounting for the potential inflation of heritability estimates [68, 118], our current knowledge of complex disease genetics remains limited. These missing genetic factors may hold the key to developing novel diagnostics, designing new therapies, and obtaining biological insights that transform our understanding of disease etiology. Recent research suggests that a significant portion of this missing heritability may reside in low-frequency variants of moderate effect size [39, 119]. Such variants evade detection by popularly employed disease association testing methods. Although typical GWAS methodologies provide powerful tools for finding common variants with small effect size, such techniques remain unable to identity low-frequency variants associated with disease (Figure 5.1a) [119]. Conversely, next generation sequencing analyses represent a valuable approach for identifying rare variants of moderate or large effect size, often aggregating such mutations at the gene or pathway level to increase statistical power. However, NGS methods likewise perform poorly for the discovery of low-frequency variants of moderate effect size [68]. Modern methodologies therefore lack an effective means of investigating disease-causing mutations that lie in this frequency spectrum. Identity by descent (IBD) mapping provides a promising strategy for detecting low-frequency variants of intermediate effect size [14, 106]. A genomic region is defined as identical by descent among two or more individuals if those individuals inherited at least one copy of that region from a common ancestor, without recombination (Figure CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 62

Figure 5.1: GWAS and IBD mapping association test methods. A) A hypothetical disease susceptibility locus for a given population contains multiple low-frequency, moderate effect size mutations (labels A-F) as well as a T/C SNP. Population ex- pansion over successive generations is shown by the traingular tree structure, where width (horizontally) indicates population size: the top of the tree represents a small ancestral population and the bottom of the tree represents the present population. Black lines indicate the distributions of the T and C SNP alleles across the whole population, and the orange lines labeled with letters indicate subpopulations contain- ing recently arising risk alleles. Red case dots and control blue dots represent case and control individuals sampled from the population. Traditional GWAS significance testing fails to identify this risk locus. B) Pairwise IBD detection followed by IBD mapping and permutation testing as described by Purcell et al. results in strong signal at this risk locus.

5.2). IBD mapping finds enrichment in IBD among cases versus among controls as a means of identifying genomic regions that are strongly associated with disease (Figure 5.1b) [8, 14]. The intuition behind this method is as follows: a given set of alleles with no impact on disease should be represented equally among cases and controls. On the other hand, a significant enrichment of one or more low-frequency alleles among cases suggests that these alleles harbor functional changes responsible for increased disease risk. Unlike traditional GWAS methods, IBD mapping can aggregate the contributions of multiple, low-frequency alleles, and these alleles need not be coupled with the distribution of a single, common polymorphism. CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 63

IBD mapping has evident utility for bridging the methodological gap that has prevented identification of low-frequency disease-associated variants, but several con- straints have hindered its widespread adop- tion. IBD mapping requires demographic scenarios in which multiple, low-frequency causal mutations reside within close genetic proximity [14]. Many complex diseases lack this mutational signature. Additionally, IBD mapping has the greatest statistical power when investigating causal variants arising within the past 10-100 generations [14, 62]. Figure 5.2: Identity by descent be- More importantly, though, computational tween distant relatives. In this figure, considerations have obstructed IBD mapping distant relatives share a common ma- ternal and paternal ancestor. These analyses. IBD mapping for complex disease relatives are considered IBD at the necessitates both sensitive and highly specific genomic region in which they share IBD detection tools in order to achieve suffi- an ancestral haplotype, represented by the orange block from the paternal an- cient downstream statistical power [14]. Ad- cestor. ditionally, the speed of modern IBD detection tools becomes a limiting factor when they are applied to large datasets. The detec- tion of low-frequency risk alleles necessitates large cohorts sizes (upwards of 10,000 case individuals) [119], and IBD detection runtime scales quadratically with number of queried individuals [94]. Both accuracy and speed of IBD detection have therefore prevented the application of this methodology to the study of complex disease. To address these computational shortcomings, the Batzoglou lab has recently de- veloped a set of tools (SpeeDB and Parente) for rapid identity by descent detection in distantly related individuals [50, 94]. These methods are notable for their speed and accuracy, which significantly outperform the previous state-of-the-art. These tools offer a novel opportunity for constructing a software pipeline that applies IBD mapping techniques to large case-control cohort datasets traditionally reserved for CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 64

GWAS association testing. The time for such an endeavor is opportune. A decade of GWAS data has yielded over 140 disease or trait studies containing greater than 10,000 individuals [113]. By integrating SpeeDB and Parente IBD data with mod- ern statistical frameworks for IBD mapping, our pipeline promises to uncover novel disease susceptibility loci that have remained intractable for the past decade.

5.3 Results

To facilitate rapid and accurate IBD mapping large case-control cohorts, we have developed a robust and scalable IBD mapping pipeline that integrates the SpeeDB and Parente2 IBD detection tools. Our pipeline additionally integrates existing IBD mapping methods and tools for downstream analysis. Finally we have applied the pipeline to a novel multiple sclerosis datasets that was previously prohibitively large for IBD mapping experiments.

5.3.1 IBD detection pipeline development

We have built a robust and scalable IBD detection pipeline using SpeeDB and Par- ente2, two state-of-the-art tools developed in the Batzoglou lab. The resulting frame- work not only serves as the foundation for the subsequent IBD mapping steps, but it also addresses a computational bottleneck currently faced by the scientific community [14]. First, by dramatically reducing IBD detection runtime, the pipeline facilitates IBD applications that currently remain computationally infeasible with large cohorts. Such applications include detecting hidden relatedness and investigating demographic history. Second, the increase in accuracy offered by Parente2 allows for previously unattainable sensitivity in downstream IBD mapping and ancestry inference algo- rithms. Although researchers can run SpeeDB and Parente2 without pipeline infrastruc- ture, software that integrates SpeeDB and Parente2 greatly increases the utility of these tools. We have provided the following three features, which prove essential for adoption, efficient usage, and error-free deployment of these tools: 1) automated data CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 65

pre-processing, 2) cluster deployment options for parallel computing, and 3) a user- friendly software interface. Automation of data preprocessing, IBD inference, and data post-processing steps makes Parente2 and SpeeDB accessible to labs without computational personnel and promotes proper tool usage. Using a MapReduce-like strategy for filtering and distributing genomic data across cluster nodes, the pipeline optimizes CPU usage to reduce wallclock runtime and bypasses memory overhead problems encountered when processing non-partitioned data. Finally, both SpeeDB and Parente2 require deep understanding of program parameters in order to run successfully. Our pipeline provides a user-friendly interface to abstract away these parameters and allow easy, parallelized deployment across compute clusters. The pipeline runs SpeeDB, Parente2, and data pre- and post-processing steps via a framework implemented in Python 2.7.3 and bash shell scripts. The software also facilitates deployment on compute clusters, including those with job engines such as the popular Torque and Oracle Grid Engine systems. Pre-processing steps include data format conversion, data cleaning (e.g. integrating genetic position if missing), and genotype phasing for the Parente2 training procedure using Beagle or HAPI-UR [13, 115]. The pipeline then performs filtering via SpeeDB and IBD detection via Parente2’s inference step. Essential parameters include IBD block size, SpeeDB filter stringency, and input and output formats. By creating a highly modular system, the pipeline also serves as an easily modifiable platform into which users can integrate custom tools. For instance, a user can install alternate phasing software or add support for additional file formats. Figure 5.3 illustrates the pipeline’s modular architecture for the IBD detection steps. The pipeline achieves full parallelization of data preprocessing, HAPI-UR or BEAGLE phasing, and Parente2 and SpeeDB execution. Accompanying wrapper scripts provide support for running the pipeline on the TORQUE job engine, which governs job instances on the Stanford Scail cluster. CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 66

Figure 5.3: IBD detection pipeline prototype. All steps can be run in parallel, and the expanded section of the inference workflow illustrates the MapReduce-style framework for the SpeeDB and Parente2.

5.3.2 Integration of IBD mapping tools

The IBD detection pipeline provides a flexible platform for performing downstream analysis using IBD mapping and visualization tools. We have implemented com- plementary python scripts for performing the the widely used pairwise IBD mapping statistic first proposed in Purcell et al. [90] and benchmarked by Browning and Brown- ing [14]. These methods measure enrichment in IBD sharing among case-case pairs versus among control-control and case-control pairs at a given genomic locus using the following score:

P S P T S − p0 p0 T − p0 p0 p L − p L (5.1) NAA N!AA

For locus p ∈ p0, where L = |p0|, Sp represents IBD counts among case-case pairs, Tp represents IBD counts among non case-case pairs (i.e. case-control and control-control), and NAA and N!AA represent total number of case-case pairs and non case-case pairs, respectively. A high score for equation 5.1 is indicative of dis- ease association, as recent founder effects for the disease of interest will result in an increase in IBD among case-case pairs at the causal locus. P-values are generated by CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 67

permutation of disease status labels or IBD likelihood scores. Han et al. have recently developed and implemented a faster approach to pairwise IBD mapping [44]. The authors leverage importance sampling to achieve drastic improvements in permutation testing speed. Their approach provides an estimate for the Purcell et al. pairwise permutation testing statistic that can achieve very small p-values with minimal computational overhead. This speedup proves important in the context of the proposed IBD pipeline, as IBD mapping otherwise becomes a computational bottleneck. Our pipeline ensures output compatible with GraphIBD, the software package created by Han et al. so that users can run this sophisticated IBD mapping method. The modular nature of the IBD pipeline also promotes the integration of custom modules for developing and testing novel IBD mapping statistics. We anticipate that future users can incorporate options for applying nonparametric tests such as Kolomogorov-Smirnov and Mann-Whitney-Wilcoxon to the distribution of case-case pair and control-control pair IBD scores. The pipeline also includes a suite of visualization tools for inspecting clusters of IBD individuals and assessing IBD mapping results. Figure 5.4 illustrates locus- specific IBD relationships among case and control individuals with a particular minor allele.

5.3.3 IBD mapping of a multiple sclerosis cohort

We have demonstrated that our IBD mapping pipeline can efficiently analyze datasets of up to 30,000 individuals by fully processing a large multiple sclerosis cohort in col- laboration with the International Multiple Sclerosis Genetics Consortium. Processing this volume of data with high sensitivity for short IBD regions was previously infea- sible using traditional IBD mapping pipelines. Using HapMap 3 genotype data, we have also demonstrated the efficacy of our pipeline for identifying IBD regions with high sensitivity and specificity. Figure 5.5 presents the results of running our pipeline on HapMap 3 CEU individuals. Parente2’s embedded likelihood ratio test (ELRT) cleanly separates regions of true IBD from parent-child relationships (red dots) from CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 68

Figure 5.4: This figure visualizes IBD relationships at SNP rs498422 among individ- uals in a multiple sclerosis cohort who have the minor allele for this SNP. Colored circles represent individuals, and red indicates case individuals and blue indicates con- trol individuals. Lines represent IBD relationships between two individuals, where red lines indicate high confidence case-case IBD, blue lines indicate high confidence control-control IBD, purple lines indicate high confidence case-control IBD, and gray lines indicate low confidence IBD. CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 69

Figure 5.5: IBD detection in the HapMap3 CEU cohort. Red dots represent per window Parente2 scores for parent-child pairs, whereas blue dots represent scores for pairs of unrelated individuals. The embedded likelihood ratio test distinguishes between related and unrelated pairs. CHAPTER 5. IBD MAPPING IN LARGE DISEASE COHORTS 70

regions that are not IBD among unrelated individuals (blue dots).

5.4 Discussion

IBD mapping facilitates identification of disease susceptibility loci for complex dis- eases characterized by multiple, low-frequency variants of moderate effect size. IBD mapping offers an opportunity to reanalyze large, public disease cohorts that exhibit a large degree of missing heritability in order to recover disease susceptibility loci that evade detection by traditional GWAS methods. However, both the speed and accu- racy of IBD detection methods have previously rendered this task untenable. The IBD mapping pipeline described above leverages rapid and accurate tools for IBD detection to enable IBD mapping studies on these large cohorts. The abundance of publicly available genotyping data for disease cohorts makes this project very timely. We will distribute all pipeline software on GitHub for free academic use. The software includes 1) a Python framework for running IBD detection with Parente2 and SpeeDB, 2) modules for running IBD mapping analyses, including GraphIBD and implementations of the Purcell et al. pairwise mapping method, and 3) IBD mapping visualization tools, written in Python. We hope that our work will enable other researchers to incorporate fast and accurate IBD analysis into their own workflows. Chapter 6

Future Directions

It has been a privilege to work alongside experts in the field to investigate genome evolution during cancer progression, to explore the genetics of complex disease at the population level, and to develop informatics methods for novel DNA capture plat- forms. This work encompasses a diverse set of fields within genomics, all of which are poised to undergo rapid change as DNA sequencing technologies advance. Decreasing prices and improvements in multiple facets of sequencing methodology promise to un- lock new applications and reduce the computational overhead of previously difficult tasks. Several recently developed technologies show promise for producing long sequence reads for use in genomics studies. Synthetic long read technologies use specialized DNA partitioning and labeling techniques to reconstruct DNA fragments approaching 10kb in size from standard short read sequencing output. Examples of this method- ology include products from Moleculo and 10X Technologies, which have come to market very recently. By leveraging existing short read sequencers, such as those produced by Illumina, they can take advantage of the increasing accuracy and high throughput of well-established technologies. Several single-molecule long read se- quencing technologies are also coming to fruition. Pacific Biosciences currently offers reads with lengths in the tens of thousands, while upcoming technologies, such as nanopore sequencing pursued by companies including Oxford Nanopore and Genia,

71 CHAPTER 6. FUTURE DIRECTIONS 72

promise inexpensive reads of even longer length. If these third generation technolo- gies overcome their throughput and accuracy limitations, they will revolutionize the sequencing field. Both synthetic long read and single-molecule long read technologies will greatly increase the sensitivity and specificity of the genomic assays used in cancer genomics research. Read aligners have the capability to map long reads to previously un- reachable genomic regions characterized by repeat or degenerate sequence, as well as improve alignment quality in currently accessible areas. These advances herald more accurate and comprehensive SNV calling as well as dramatic improvements in our ability to detect more complex mutations types, such as insertions, deletions, and structural variants. Previously underutilized mutations and mutation types can then therefore be applied to clinical interpretation and evolutionary research into cancer genomics. Additionally, long reads provide sufficient information for highly accurate phasing and genome assembly methods, which can further reveal functional genomic features that remain undetectable by other sequencing approaches. Recent advances in single cell sequencing techniques will also facilitate cancer research by allowing in- vestigators to obtain exact genetic information about subclonal populations instead of relying on computational inference methods with limited resolution. The increases in throughput and decreases in cost for sequencing technologies have already begun to revolution the study of complex disease. These fields can reexamine genotyped cohorts using full base-pair resolution data to observe not only common polymorphisms in a given population but also to detect the rare mutations that explain missing heritability. Sequencing will facilitate the use of mutations types such as indels and structural variants in disease association studies. Additionally, these data will allow for more precise ancestry inference to correct for population stratification and for use in mixed-model association mapping. For these reasons, sequencing will likely render chip-based genotyping data obsolete over the next several decades. The field of DNA capture methods has proven equally dynamic. Robust and high- throughput multiplex PCR methods provide a cost efficient and effective solution for CHAPTER 6. FUTURE DIRECTIONS 73

performing small batch mutation validation. However, given the decrease in sequenc- ing prices, I anticipate that the use of highly specific capture methods will become less frequent. In-solution hybrid capture methods can now provide reasonably priced variant validation for genomic regions at the megabase scale. The increasing confi- dence in next generation sequencing quality may also soon render orthogonal variant discovery methods unnecessary for many applications, replacing Sanger sequencing output as a sufficient gold standard. Finally, long read technology is enabling a host of new applications. Long reads will facilitate highly accurate phasing and genome assembly methods, especially in burgeoning fields such as metagenomics. RNA sequencing studies will continue to gain momentum, harnessing long reads to resolve isoform identity and abundance and employing single cell sequencing for unprecedented resolution. The near future also promises portable DNA sequencers for use outside the lab, both for forensic and medical purposes. I greatly look forward to the next decade of rapid advances in both basic and translational research. References

[1] Tarek M A Abdel-Fatah, Desmond G Powe, Zsolt Hodi, Andrew H S Lee, Jorge S Reis-Filho, and Ian O Ellis. High frequency of coexistence of columnar cell lesions, lobular neoplasia, and low grade ductal carcinoma in situ with in- vasive tubular carcinoma and invasive lobular carcinoma. Am. J. Surg. Pathol., 31(3):417–26, March 2007.

[2] Goncalo R Abecasis, Adam Auton, Lisa D Brooks, Mark a DePristo, Richard M Durbin, Robert E Handsaker, Hyun Min Kang, Gabor T Marth, and Gil a McVean. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, November 2012.

[3] M Ashburner, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, K Dolinski, S S Dwight, J T Eppig, M A Harris, D P Hill, L Issel-Tarver, A Kasarskis, S Lewis, J C Matese, J E Richardson, M Ringwald, G M Rubin, and G Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25(1):25–9, May 2000.

[4] Shantanu Banerji, Kristian Cibulskis, Claudia Rangel-Escareno, Kristin K Brown, Scott L Carter, Abbie M Frederick, Michael S Lawrence, Andrey Y Sivachenko, Carrie Sougnez, Lihua Zou, Maria L Cortes, Juan C Fernandez- Lopez, Shouyong Peng, Kristin G Ardlie, Daniel Auclair, Veronica Bautista- Piña, Fujiko Duke, Joshua Francis, Joonil Jung, Antonio Maffuz-Aziz, Robert C Onofrio, Melissa Parkin, Nam H Pho, Valeria Quintanar-Jurado, Alex H Ramos, Rosa Rebollar-Vega, Sergio Rodriguez-Cuevas, Sandra L Romero-Cordoba,

74 REFERENCES 75

Steven E Schumacher, Nicolas Stransky, Kristin M Thompson, Laura Uribe- Figueroa, Jose Baselga, Rameen Beroukhim, Kornelia Polyak, Dennis C Sgroi, Andrea L Richardson, Gerardo Jimenez-Sanchez, Eric S Lander, Stacey B Gabriel, Levi A Garraway, Todd R Golub, Jorge Melendez-Zajgla, Alex Toker, Gad Getz, Alfredo Hidalgo-Miranda, and Matthew Meyerson. Sequence anal- ysis of mutations and translocations across breast cancer subtypes. Nature, 486(7403):405–9, June 2012.

[5] JC Barrett and MJ Daly. Complex Disease Genes and Their Discovery. Mol. Genet. Inflamm. Bowel Dis., pages 87–97, 2013.

[6] JS Beckmann, Xavier Estivill, and SE Antonarakis. Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nat. Rev. Genet., 8(August):639–646, 2007.

[7] Ashley H Beecham, Nikolaos a Patsopoulos, Dionysia K Xifara, Mary F Davis, Anu Kemppinen, Chris Cotsapas, Tejas S Shah, Chris Spencer, David Booth, An Goris, Annette Oturai, Janna Saarela, Bertrand Fontaine, Bernhard Hem- mer, Claes Martin, Frauke Zipp, Sandra D’Alfonso, Filippo Martinelli-Boneschi, Bruce Taylor, Hanne F Harbo, Ingrid Kockum, Jan Hillert, Tomas Olsson, Maria Ban, Jorge R Oksenberg, Rogier Hintzen, Lisa F Barcellos, Cristina Agliardi, Lars Alfredsson, Mehdi Alizadeh, Carl Anderson, Robert Andrews, Helle Bach Sø ndergaard, Amie Baker, Gavin Band, Sergio E Baranzini, Na- dia Barizzone, Jeffrey Barrett, Céline Bellenguez, Laura Bergamaschi, Luisa Bernardinelli, Achim Berthele, Viola Biberacher, Thomas M C Binder, Han- nah Blackburn, Izaura L Bomfim, Paola Brambilla, Simon Broadley, Bruno Brochet, Lou Brundin, Dorothea Buck, Helmut Butzkueven, Stacy J Caillier, William Camu, Wassila Carpentier, Paola Cavalla, Elisabeth G Celius, Irène Coman, Giancarlo Comi, Lucia Corrado, Leentje Cosemans, Isabelle Cournu- Rebeix, Bruce a C Cree, Daniele Cusi, Vincent Damotte, Gilles Defer, Silvia R Delgado, Panos Deloukas, Alessia di Sapio, Alexander T Dilthey, Peter Don- nelly, Bénédicte Dubois, Martin Duddy, Sarah Edkins, Irina Elovaara, Feder- ica Esposito, Nikos Evangelou, Barnaby Fiddes, Judith Field, Andre Franke, REFERENCES 76

Colin Freeman, Irene Y Frohlich, Daniela Galimberti, Christian Gieger, Pierre- Antoine Gourraud, Christiane Graetz, Andrew Graham, Verena Grummel, Clara Guaschino, Athena Hadjixenofontos, Hakon Hakonarson, Christopher Halfpenny, Gillian Hall, Per Hall, Anders Hamsten, James Harley, Timothy Har- rower, Clive Hawkins, Garrett Hellenthal, Charles Hillier, Jeremy Hobart, Muni Hoshi, Sarah E Hunt, Maja Jagodic, Ilijas Jelčić, Angela Jochim, Brian Kendall, Allan Kermode, Trevor Kilpatrick, Keijo Koivisto, Ioanna Konidari, Thomas Korn, Helena Kronsbein, Cordelia Langford, Malin Larsson, Mark Lathrop, Christine Lebrun-Frenay, Jeannette Lechner-Scott, Michelle H Lee, Maurizio a Leone, Virpi Leppä, Giuseppe Liberatore, Benedicte a Lie, Christina M Lill, Magdalena Lindén, Jenny Link, Felix Luessi, Jan Lycke, Fabio Macciardi, Satu Männistö, Clara P Manrique, Roland Martin, Vittorio Martinelli, Deborah Ma- son, Gordon Mazibrada, Cristin McCabe, Inger-Lise Mero, Julia Mescheriakova, Loukas Moutsianas, Kjell-Morten Myhr, Guy Nagels, Richard Nicholas, Petra Nilsson, Fredrik Piehl, Matti Pirinen, Siân E Price, Hong Quach, Mauri Reuna- nen, Wim Robberecht, Neil P Robertson, Mariaemma Rodegher, David Rog, Marco Salvetti, Nathalie C Schnetz-Boutaud, Finn Sellebjerg, Rebecca C Sel- ter, Catherine Schaefer, Sandip Shaunak, Ling Shen, Simon Shields, Volker Siffrin, Mark Slee, Per Soelberg Sorensen, Melissa Sorosina, Mireia Sospedra, Anne Spurkland, Amy Strange, Emilie Sundqvist, Vincent Thijs, John Thorpe, Anna Ticca, Pentti Tienari, Cornelia van Duijn, Elizabeth M Visser, Steve Vucic, Helga Westerlind, James S Wiley, Alastair Wilkins, James F Wilson, Ju- liane Winkelmann, John Zajicek, Eva Zindler, Jonathan L Haines, Margaret a Pericak-Vance, Adrian J Ivinson, Graeme Stewart, David Hafler, Stephen L Hauser, Alastair Compston, Gil McVean, Philip De Jager, Stephen J Sawcer, and Jacob L McCauley. Analysis of immune-related loci identifies 48 new suscep- tibility variants for multiple sclerosis. Nat. Genet., 45(11):1353–60, November 2013.

[8] Sivan Bercovici, Christopher Meek, Ydo Wexler, and Dan Geiger. Estimating genome-wide IBD sharing from SNP data via an efficient hidden Markov model REFERENCES 77

of LD with application to gene mapping. Bioinformatics, 26(12):i175–i182, 2010.

[9] Rameen Beroukhim, Craig H Mermel, Dale Porter, Guo Wei, Soumya Ray- chaudhuri, Jerry Donovan, Jordi Barretina, Jesse S Boehm, Jennifer Dobson, Mitsuyoshi Urashima, Kevin T Mc Henry, Reid M Pinchback, Azra H Ligon, Yoon-Jae Cho, Leila Haery, Heidi Greulich, Michael Reich, Wendy Winck- ler, Michael S Lawrence, Barbara A Weir, Kumiko E Tanaka, Derek Y Chi- ang, Adam J Bass, Alice Loo, Carter Hoffman, John Prensner, Ted Liefeld, Qing Gao, Derek Yecies, Sabina Signoretti, Elizabeth Maher, Frederic J Kaye, Hidefumi Sasaki, Joel E Tepper, Jonathan A Fletcher, Josep Tabernero, José Baselga, Ming-Sound Tsao, Francesca Demichelis, Mark A Rubin, Pasi A Janne, Mark J Daly, Carmelo Nucera, Ross L Levine, Benjamin L Ebert, Stacey Gabriel, Anil K Rustgi, Cristina R Antonescu, Marc Ladanyi, Anthony Letai, Levi A Garraway, Massimo Loda, David G Beer, Lawrence D True, Aikou Okamoto, Scott L Pomeroy, Samuel Singer, Todd R Golub, Eric S Lander, Gad Getz, William R Sellers, and Matthew Meyerson. The landscape of so- matic copy-number alteration across human cancers. Nature, 463(7283):899– 905, February 2010.

[10] Graham R Bignell, Chris D Greenman, Helen Davies, Adam P Butler, Sarah Edkins, Jenny M Andrews, Gemma Buck, Lina Chen, David Beare, Calli La- timer, Sara Widaa, Jonathon Hinton, Ciara Fahey, Beiyuan Fu, Sajani Swamy, Gillian L Dalgliesh, Bin T Teh, Panos Deloukas, Fengtang Yang, Peter J Camp- bell, P Andrew Futreal, and Michael R Stratton. Signatures of mutation and selection in the cancer genome. Nature, 463(7283):893–8, February 2010.

[11] Alessandro Bombonati and Dennis C Sgroi. The molecular of breast cancer progression. J. Pathol., 223(2):307–17, January 2011.

[12] Daniel Branton, David W Deamer, Andre Marziali, Hagan Bayley, Steven a Benner, Thomas Butler, Massimiliano Di Ventra, Slaven Garaj, Andrew Hibbs, REFERENCES 78

Xiaohua Huang, Stevan B Jovanovich, Predrag S Krstic, Stuart Lindsay, Xin- sheng Sean Ling, Carlos H Mastrangelo, Amit Meller, John S Oliver, Yuriy V Pershin, J Michael Ramsey, Robert Riehn, Gautam V Soni, Vincent Tabard- Cossa, Meni Wanunu, Matthew Wiggin, and Jeffery a Schloss. The potential and challenges of nanopore sequencing. Nat. Biotechnol., 26(10):1146–53, 2008.

[13] Sharon R Browning and Brian L Browning. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet., 81(5):1084–1097, 2007.

[14] Sharon R Browning and Brian L Browning. Identity by descent between distant relatives: detection and applications. Annu. Rev. Genet., 46:617–33, January 2012.

[15] William S Bush and Jason H Moore. Chapter 11: Genome-wide association studies. PLoS Comput. Biol., 8(12):e1002822, January 2012.

[16] CDC. CDC - DHDSP - Heart Disease FAQs.

[17] Michael A Chapman, Michael S Lawrence, Jonathan J Keats, Kristian Cibul- skis, Carrie Sougnez, Anna C Schinzel, Christina L Harview, Jean-Philippe Brunet, Gregory J Ahmann, Mazhar Adli, Kenneth C Anderson, Kristin G Ardlie, Daniel Auclair, Angela Baker, P Leif Bergsagel, Bradley E Bernstein, Yotam Drier, Rafael Fonseca, Stacey B Gabriel, Craig C Hofmeister, Sundar Jagannath, Andrzej J Jakubowiak, Amrita Krishnan, Joan Levy, Ted Liefeld, Sagar Lonial, Scott Mahan, Bunmi Mfuko, Stefano Monti, Louise M Perkins, Robb Onofrio, Trevor J Pugh, S Vincent Rajkumar, Alex H Ramos, David S Siegel, Andrey Sivachenko, A Keith Stewart, Suzanne Trudel, Ravi Vij, Douglas Voet, Wendy Winckler, Todd Zimmerman, John Carpten, Jeff Trent, William C Hahn, Levi A Garraway, Matthew Meyerson, Eric S Lander, Gad Getz, and Todd R Golub. Initial genome sequencing and analysis of multiple myeloma. Nature, 471(7339):467–72, March 2011.

[18] Wei-Hsin Chen, Yu-Wen Lu, Feipei Lai, Yin-Hsiu Chien, and Wuh-Liang Hwu. REFERENCES 79

Integrating Human Genome Database into Electronic Health Record with Se- quence Alignment and Compression Mechanism. J. Med. Syst., May 2011.

[19] Scott Christley, Yiming Lu, Chen Li, and Xiaohui Xie. Human genomes as email attachments. Bioinformatics, 25(2):274–5, January 2009.

[20] Deanna M Church, Valerie a Schneider, Karyn Meltz Steinberg, Michael C Schatz, Aaron R Quinlan, Chen-Shan Chin, Paul a Kitts, Bronwen Aken, Ga- bor T Marth, Michael M Hoffman, Javier Herrero, M Lisandra Zepeda Men- doza, Richard Durbin, and Paul Flicek. Extending reference assembly models. Genome Biol., 16(1):13, January 2015.

[21] International Human Genome Sequencing Consortium. Finishing the euchro- matic sequence of the human genome. Nature, pages 931–945, 2004.

[22] Johanna Craig. Complex Diseases: Research and Applications, 2008.

[23] Karen Crasta, Neil J Ganem, Regina Dagher, Alexandra B Lantermann, Elena V Ivanova, Yunfeng Pan, Luigi Nezi, Alexei Protopopov, Dipanjan Chowdhury, and David Pellman. DNA breaks and chromosome pulverization from errors in mitosis. Nature, 482(7383):53–8, February 2012.

[24] Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, Stefan Gräf, Gavin Ha, Gholamreza Haffari, Ali Bashashati, Roslin Russell, Steven McKinney, Anita Langerø d, Andrew Green, Elena Provenzano, Gordon Wishart, Sarah Pinder, Peter Watson, Florian Markowetz, Leigh Murphy, Ian Ellis, Arnie Purushotham, Anne-Lise Bø rresen Dale, James D Brenton, Simon Tavaré, Carlos Caldas, and Samuel Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–52, June 2012.

[25] Fredrik Dahl, Johan Stenberg, Simon Fredriksson, Katrina Welch, Michael Zhang, Mats Nilsson, David Bicknell, Walter F Bodmer, Ronald W Davis, and REFERENCES 80

Hanlee Ji. Multigene amplification and massively parallel sequencing for can- cer mutation discovery. Proc. Natl. Acad. Sci. U. S. A., 104(22):9387–92, May 2007.

[26] Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, and . Data structures and compression algorithms for high-throughput sequencing tech- nologies. BMC Bioinformatics, 11(1):514, January 2010.

[27] Adrian V Dalca and Michael Brudno. Genome variation discovery with high- throughput sequencing data. Brief. Bioinform., 11(1):3–14, January 2010.

[28] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, and Mark J Daly. A framework for variation discovery and geno- typing using next-generation DNA sequencing data. Nat. Genet., 43(5):491–8, May 2011.

[29] Li Ding, Matthew J Ellis, Shunqiang Li, David E Larson, Ken Chen, John W Wallis, Christopher C Harris, Michael D McLellan, Robert S Fulton, Lu- cinda L Fulton, Rachel M Abbott, Jeremy Hoog, David J Dooling, Daniel C Koboldt, Heather Schmidt, Joelle Kalicki, Qunyuan Zhang, Lei Chen, Ling Lin, Michael C Wendl, Joshua F McMichael, Vincent J Magrini, Lisa Cook, Sean D McGrath, Tammi L Vickery, Elizabeth Appelbaum, Katherine De- schryver, Sherri Davies, Therese Guintoli, Li Lin, Robert Crowder, Yu Tao, Jacqueline E Snider, Scott M Smith, Adam F Dukes, Gabriel E Sanderson, Craig S Pohl, Kim D Delehaunty, Catrina C Fronick, Kimberley a Pape, Jerry S Reed, Jody S Robinson, Jennifer S Hodges, William Schierding, Nathan D Dees, Dong Shen, Devin P Locke, Madeline E Wiechert, James M Eldred, Josh B Peck, Benjamin J Oberkfell, Justin T Lolofie, Feiyu Du, Amy E Hawkins, Michelle D O’Laughlin, Kelly E Bernard, Mark Cunningham, Glendoria Elliott, REFERENCES 81

Mark D Mason, Dominic M Thompson, Jennifer L Ivanovich, Paul J Goodfel- low, Charles M Perou, George M Weinstock, Rebecca Aft, Mark Watson, Tim- othy J Ley, Richard K Wilson, and Elaine R Mardis. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature, 464(7291):999–1005, April 2010.

[30] Li Ding, Timothy J Ley, David E Larson, Christopher A Miller, Daniel C Koboldt, John S Welch, Julie K Ritchey, Margaret A Young, Tamara Lam- precht, Michael D McLellan, Joshua F McMichael, John W Wallis, Charles Lu, Dong Shen, Christopher C Harris, David J Dooling, Robert S Fulton, Lucinda L Fulton, Ken Chen, Heather Schmidt, Joelle Kalicki-Veizer, Vincent J Magrini, Lisa Cook, Sean D McGrath, Tammi L Vickery, Michael C Wendl, Sharon Heath, Mark A Watson, Daniel C Link, Michael H Tomasson, William D Shan- non, Jacqueline E Payton, Shashikant Kulkarni, Peter Westervelt, Matthew J Walter, Timothy A Graubert, Elaine R Mardis, Richard K Wilson, and John F DiPersio. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–10, January 2012.

[31] Li Ding, Michael C. Wendl, Joshua F. McMichael, and Benjamin J. Raphael. Expanding the computational toolbox for mining cancer genomes. Nat. Rev. Genet., 15(8):556–570, July 2014.

[32] Matthew J Ellis, Li Ding, Dong Shen, Jingqin Luo, Vera J Suman, John W Wal- lis, Brian A Van Tine, Jeremy Hoog, Reece J Goiffon, Theodore C Goldstein, Sam Ng, Li Lin, Robert Crowder, Jacqueline Snider, Karla Ballman, Jason Weber, Ken Chen, Daniel C Koboldt, Cyriac Kandoth, William S Schierding, Joshua F McMichael, Christopher A Miller, Charles Lu, Christopher C Harris, Michael D McLellan, Michael C Wendl, Katherine DeSchryver, D Craig Allred, Laura Esserman, Gary Unzeitig, Julie Margenthaler, G V Babiera, P Kelly Mar- com, J M Guenther, Marilyn Leitch, Kelly Hunt, John Olson, Yu Tao, Christo- pher A Maher, Lucinda L Fulton, Robert S Fulton, Michelle Harrison, Ben Oberkfell, Feiyu Du, Ryan Demeter, Tammi L Vickery, Adnan Elhammali, He- len Piwnica-Worms, Sandra McDonald, Mark Watson, David J Dooling, David REFERENCES 82

Ota, Li-Wei Chang, Ron Bose, Timothy J Ley, David Piwnica-Worms, Joshua M Stuart, Richard K Wilson, and Elaine R Mardis. Whole-genome analysis in- forms breast cancer response to aromatase inhibition. Nature, 486(7403):353–60, June 2012.

[33] Evangelos Evangelou and John P a Ioannidis. Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet., 14(6):379–89, June 2013.

[34] Guy Haskin Fernald, Emidio Capriotti, Roxana Daneshjou, Konrad J Kar- czewski, and Russ B Altman. Bioinformatics challenges for personalized medicine. Bioinformatics, 27(13):1741–8, July 2011.

[35] Lars Feuk, Andrew R Carson, and Stephen W Scherer. Structural variation in the human genome. Nat. Publ. Gr., 7(2):85–97, February 2006.

[36] Andre Franke, Dermot P B McGovern, Jeffrey C Barrett, Kai Wang, Graham L Radford-Smith, Tariq Ahmad, Charlie W Lees, Tobias Balschun, James Lee, Rebecca Roberts, Carl A Anderson, Joshua C Bis, Suzanne Bumpstead, David Ellinghaus, Eleonora M Festen, Michel Georges, Todd Green, Talin Haritunians, Luke Jostins, Anna Latiano, Christopher G Mathew, Grant W Montgomery, Natalie J Prescott, Soumya Raychaudhuri, Jerome I Rotter, Philip Schumm, Yashoda Sharma, Lisa A Simms, Kent D Taylor, David Whiteman, Cisca Wi- jmenga, Robert N Baldassano, Murray Barclay, Theodore M Bayless, Stephan Brand, Carsten Büning, Albert Cohen, Jean-Frederick Colombel, Mario Cot- tone, Laura Stronati, Ted Denson, Martine De Vos, Renata D’Inca, Marla Du- binsky, Cathryn Edwards, Tim Florin, Denis Franchimont, Richard Gearry, Jür- gen Glas, Andre Van Gossum, Stephen L Guthery, Jonas Halfvarson, Hein W Verspaget, Jean-Pierre Hugot, Amir Karban, Debby Laukens, Ian Lawrance, Marc Lemann, Arie Levine, Cecile Libioulle, Edouard Louis, Craig Mowat, William Newman, Julián Panés, Anne Phillips, Deborah D Proctor, Miguel Regueiro, Richard Russell, Paul Rutgeerts, Jeremy Sanderson, Miquel Sans, Frank Seibold, A Hillary Steinhart, Pieter C F Stokkers, Leif Torkvist, Gerd REFERENCES 83

Kullak-Ublick, David Wilson, Thomas Walters, Stephan R Targan, Steven R Brant, John D Rioux, Mauro D’Amato, Rinse K Weersma, Subra Kugath- asan, Anne M Griffiths, John C Mansfield, Severine Vermeire, Richard H Duerr, Mark S Silverberg, Jack Satsangi, Stefan Schreiber, Judy H Cho, Vito Annese, Hakon Hakonarson, Mark J Daly, and Miles Parkes. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet., 42(12):1118–25, December 2010.

[37] Pauline A Fujita, Brooke Rhead, Ann S Zweig, Angie S Hinrichs, Donna Karolchik, Melissa S Cline, Mary Goldman, Galt P Barber, Hiram Clawson, Antonio Coelho, Mark Diekhans, Timothy R Dreszer, Belinda M Giardine, Rachel A Harte, Jennifer Hillman-Jackson, Fan Hsu, Vanessa Kirkup, Robert M Kuhn, Katrina Learned, Chin H Li, Laurence R Meyer, Andy Pohl, Brian J Raney, Kate R Rosenbloom, Kayla E Smith, , and W James Kent. The UCSC Genome Browser database: update 2011. Nucleic Acids Res., 39(Database issue):D876–82, January 2011.

[38] Marco Gerlinger, Andrew J Rowan, Stuart Horswell, James Larkin, David En- desfelder, Eva Gronroos, Pierre Martinez, Nicholas Matthews, Aengus Stewart, Patrick Tarpey, Ignacio Varela, Benjamin Phillimore, Sharmin Begum, Neil Q McDonald, Adam Butler, David Jones, Keiran Raine, Calli Latimer, Claudio R Santos, Mahrokh Nohadani, Aron C Eklund, Bradley Spencer-Dene, Graham Clark, Lisa Pickering, Gordon Stamp, Martin Gore, Zoltan Szallasi, Julian Downward, P Andrew Futreal, and Charles Swanton. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med., 366(10):883–92, March 2012.

[39] Greg Gibson. Rare and common variants: twenty arguments. Nat. Rev. Genet., 13(2):135–145, January 2012.

[40] Mel Greaves and Carlo C Maley. Clonal evolution in cancer. Nature, 481(7381):306–13, January 2012. REFERENCES 84

[41] Christopher Greenman, Philip Stephens, Raffaella Smith, Gillian L Dalgliesh, Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam Butler, Claire Stevens, Sarah Edkins, Sarah O’Meara, Imre Vastrik, Esther E Schmidt, Tim Avis, Syd Barthorpe, Gurpreet Bhamra, Gemma Buck, Bhudipa Choud- hury, Jody Clements, Jennifer Cole, Ed Dicks, Simon Forbes, Kris Gray, Kelly Halliday, Rachel Harrison, Katy Hills, Jon Hinton, Andy Jenkinson, David Jones, Andy Menzies, Tatiana Mironenko, Janet Perry, Keiran Raine, Dave Richardson, Rebecca Shepherd, Alexandra Small, Calli Tofts, Jennifer Varian, Tony Webb, Sofie West, Sara Widaa, Andy Yates, Daniel P Cahill, David N Louis, Peter Goldstraw, Andrew G Nicholson, Francis Brasseur, Leendert Looi- jenga, Barbara L Weber, Yoke-Eng Chiew, Anna DeFazio, Mel F Greaves, Anthony R Green, Peter Campbell, , Douglas F Easton, Geor- gia Chenevix-Trench, Min-Han Tan, Sok Kean Khoo, Bin Tean Teh, Siu Tsan Yuen, Suet Yi Leung, Richard Wooster, P Andrew Futreal, and Michael R Stratton. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132):153–8, March 2007.

[42] Alexander Gusev, Gaurav Bhatia, Noah Zaitlen, Bjarni J Vilhjalmsson, Dorothée Diogo, Eli a Stahl, Peter K Gregersen, Jane Worthington, Lars Klareskog, Soumya Raychaudhuri, Robert M Plenge, Bogdan Pasaniuc, and Alkes L Price. Quantifying missing heritability at known GWAS loci. PLoS Genet., 9(12):e1003993, December 2013.

[43] Arief Gusnanto, Henry M Wood, Yudi Pawitan, Pamela Rabbitts, and Stefano Berri. Correcting for cancer genome size and tumour cell content enables bet- ter estimation of copy number alterations from next-generation sequence data. Bioinformatics, 28(1):40–7, January 2012.

[44] Buhm Han, Eun Yong Kang, Soumya Raychaudhuri, Paul I W de Bakker, and . Fast pairwise IBD association testing in genome-wide association studies. Bioinformatics, 30(2):206–13, January 2014. REFERENCES 85

[45] Douglas Hanahan and RA Weinberg. The hallmarks of cancer. Cell, 100:57–70, 2000.

[46] Douglas Hanahan and Robert a Weinberg. Hallmarks of cancer: the next gen- eration. Cell, 144(5):646–74, March 2011.

[47] Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for com- mon diseases and complex traits. Nat. Rev. Genet., 6(2):95–108, February 2005.

[48] Jörg D Hoheisel. Microarray technology: beyond transcript profiling and geno- type analysis. Nat. Rev. Genet., 7(3):200–10, March 2006.

[49] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res., 37(1):1–13, January 2009.

[50] Lin Huang, Sivan Bercovici, Jesse M Rodriguez, and Serafim Batzoglou. An Effective Filter for IBD Detection in Large Data Sets. PLoS One, 9(3):e92713, January 2014.

[51] Dick G Hwang and Phil Green. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. U. S. A., 101(39):13994–4001, September 2004.

[52] H Johansson, M Isaksson, E Falk Sörqvist, F Roos, J Stenberg, T Sjöblom, J Botling, P Micke, K Edlund, S Fredriksson, H Göransson Kultima, Olle Eric- sson, and Mats Nilsson. Targeted resequencing of candidate genes using selector probes. Nucleic Acids Res., 39(2), November 2010.

[53] Luke Jostins, Stephan Ripke, Rinse K Weersma, Richard H Duerr, Dermot P McGovern, Ken Y Hui, James C Lee, L Philip Schumm, Yashoda Sharma, Carl a Anderson, Jonah Essers, Mitja Mitrovic, Kaida Ning, Isabelle Cleynen, Emilie Theatre, Sarah L Spain, Soumya Raychaudhuri, Philippe Goyette, Zhi Wei, Clara Abraham, Jean-Paul Achkar, Tariq Ahmad, Leila Amininejad, Ash- win N Ananthakrishnan, Vibeke Andersen, Jane M Andrews, Leonard Baidoo, REFERENCES 86

Tobias Balschun, Peter a Bampton, Alain Bitton, Gabrielle Boucher, Stephan Brand, Carsten Büning, Ariella Cohain, Sven Cichon, Mauro D’Amato, Dirk De Jong, Kathy L Devaney, Marla Dubinsky, Cathryn Edwards, David Elling- haus, Lynnette R Ferguson, Denis Franchimont, Karin Fransen, Richard Gearry, Michel Georges, Christian Gieger, Jürgen Glas, Talin Haritunians, Ailsa Hart, Chris Hawkey, Matija Hedl, Xinli Hu, Tom H Karlsen, Limas Kupcinskas, Subra Kugathasan, Anna Latiano, Debby Laukens, Ian C Lawrance, Char- lie W Lees, Edouard Louis, Gillian Mahy, John Mansfield, Angharad R Mor- gan, Craig Mowat, William Newman, Orazio Palmieri, Cyriel Y Ponsioen, Uros Potocnik, Natalie J Prescott, Miguel Regueiro, Jerome I Rotter, Richard K Russell, Jeremy D Sanderson, Miquel Sans, Jack Satsangi, Stefan Schreiber, Lisa a Simms, Jurgita Sventoraityte, Stephan R Targan, Kent D Taylor, Mark Tremelling, Hein W Verspaget, Martine De Vos, Cisca Wijmenga, David C Wilson, Juliane Winkelmann, Ramnik J Xavier, Sebastian Zeissig, Bin Zhang, Clarence K Zhang, Hongyu Zhao, Mark S Silverberg, Vito Annese, Hakon Hakonarson, Steven R Brant, Graham Radford-Smith, Christopher G Mathew, John D Rioux, Eric E Schadt, Mark J Daly, Andre Franke, Miles Parkes, Sev- erine Vermeire, Jeffrey C Barrett, and Judy H Cho. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature, 491(7422):119–24, November 2012.

[54] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. The human genome browser at UCSC. Genome Res., 12(6):996–1006, June 2002.

[55] Daniel C Koboldt. Challenges of sequencing human genomes. Brief. Bioinform., 11(5):484–498, June 2010.

[56] Augustine Kong, Michael L Frigge, Gisli Masson, Soren Besenbacher, Patrick Sulem, Gisli Magnusson, Sigurjon A Gudjonsson, Asgeir Sigurdsson, Aslaug Jonasdottir, Adalbjorg Jonasdottir, Wendy S W Wong, Gunnar Sigurds- son, G Bragi Walters, Stacy Steinberg, Hannes Helgason, Gudmar Thorleif- sson, Daniel F Gudbjartsson, Agnar Helgason, Olafur Th Magnusson, Unnur REFERENCES 87

Thorsteinsdottir, and Kari Stefansson. Rate of de novo mutations and the im- portance of father’s age to disease risk. Nature, 488(7412):471–5, August 2012.

[57] Eunice L Kwak, Yung-Jue Bang, D Ross Camidge, Alice T Shaw, Benjamin Solomon, Robert G Maki, Sai-Hong I Ou, Bruce J Dezube, Pasi A Jänne, Daniel B Costa, Marileila Varella-Garcia, Woo-Ho Kim, Thomas J Lynch, Panos Fidias, Hannah Stubbs, Jeffrey A Engelman, Lecia V Sequist, WeiWei Tan, Leena Gandhi, Mari Mino-Kenudson, Greg C Wei, S Martin Shreeve, Mark J Ratain, Jeffrey Settleman, James G Christensen, Daniel A Haber, Keith Wilner, Ravi Salgia, Geoffrey I Shapiro, Jeffrey W Clark, and A John Iafrate. Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N. Engl. J. Med., 363(18):1693–703, October 2010.

[58] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10(3):R25, January 2009.

[59] Rebecca J Leary, Jimmy C Lin, Jordan Cummins, Simina Boca, Laura D Wood, D Williams Parsons, Siân Jones, Tobias Sjöblom, Ben-Ho Park, Ramon Par- sons, Joseph Willis, Dawn Dawson, James K V Willson, Tatiana Nikolskaya, Yuri Nikolsky, Levy Kopelovich, Nick Papadopoulos, Len A Pennacchio, Tian- Li Wang, Sanford D Markowitz, Giovanni Parmigiani, Kenneth W Kinzler, Bert Vogelstein, and Victor E Velculescu. Integrated analysis of homozygous dele- tions, focal amplifications, and sequence alterations in breast and colorectal cancers. Proc. Natl. Acad. Sci. U. S. A., 105(42):16224–9, October 2008.

[60] Timothy J Ley, Elaine R Mardis, Li Ding, Bob Fulton, Michael D McLellan, Ken Chen, David Dooling, Brian H Dunford-Shore, Sean McGrath, Matthew Hickenbotham, Lisa Cook, Rachel Abbott, David E Larson, Dan C Koboldt, Craig Pohl, Scott Smith, Amy Hawkins, Scott Abbott, Devin Locke, Ladeana W Hillier, Tracie Miner, Lucinda Fulton, Vincent Magrini, Todd Wylie, Jarret Glasscock, Joshua Conyers, Nathan Sander, Xiaoqi Shi, John R Osborne, REFERENCES 88

Patrick Minx, David Gordon, Asif Chinwalla, Yu Zhao, Rhonda E Ries, Jacque- line E Payton, Peter Westervelt, Michael H Tomasson, Mark Watson, Jack Baty, Jennifer Ivanovich, Sharon Heath, William D Shannon, Rakesh Nagara- jan, Matthew J Walter, Daniel C Link, Timothy a Graubert, John F DiPersio, and Richard K Wilson. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature, 456(7218):66–72, November 2008.

[61] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589–95, March 2010.

[62] Rui Lin, Jac Charlesworth, Jim Stankovich, Victoria M Perreau, Matthew a Brown, and Bruce V Taylor. Identity-by-descent mapping to detect rare variants conferring susceptibility to multiple sclerosis. PLoS One, 8(3):e56379, January 2013.

[63] Pengfei Liu, Ayelet Erez, Sandesh C Sreenath Nagamani, Shweta U Dhar, Katarzyna E KoÅĆodziejska, Avinash V Dharmadhikari, M Lance Cooper, Joanna Wiszniewska, Feng Zhang, Marjorie A Withers, Carlos A Bacino, Luis Daniel Campos-Acevedo, Mauricio R Delgado, Debra Freedenberg, Adolfo Garnica, Theresa A Grebe, Dolores Hernández-Almaguer, LaDonna Immken, Seema R Lalani, Scott D McLean, Hope Northrup, Fernando Scaglia, Lane Strathearn, Pamela Trapane, Sung-Hae L Kang, Ankita Patel, Sau Wai Che- ung, P J Hastings, PaweÅĆ Stankiewicz, James R Lupski, and Weimin Bi. Chromosome catastrophes involve replication mechanisms generating complex genomic rearrangements. Cell, 146(6):889–903, September 2011.

[64] Maria A Lopez-Garcia, Felipe C Geyer, Magali Lacroix-Triki, Caterina Marchió, and Jorge S Reis-Filho. Breast cancer precursors revisited: molecular features and progression pathways. Histopathology, 57(2):171–92, August 2010.

[65] C Macilwain. World leaders heap praise on human genome landmark. Nature, 405(6790):983–4, June 2000. REFERENCES 89

[66] Christopher A Maher and Richard K Wilson. Chromothripsis and human dis- ease: piecing together the shattering process. Cell, 148(1-2):29–32, January 2012.

[67] Lira Mamanova, Alison J Coffey, Carol E Scott, Iwanka Kozarewa, Emily H Turner, Akash Kumar, Eleanor Howard, Jay Shendure, and Daniel J Turner. Target-enrichment strategies for next-generation sequencing. Nat. Methods, 7(2):111–8, February 2010.

[68] Teri a Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia a Hindorff, David J Hunter, Mark I McCarthy, Erin M Ramos, Lon R Cardon, Aravinda Chakravarti, Judy H Cho, Alan E Guttmacher, Augustine Kong, Leonid Kruglyak, Elaine Mardis, Charles N Rotimi, Montgomery Slatkin, David Valle, Alice S Whittemore, Michael Boehnke, Andrew G Clark, Evan E Eichler, Greg Gibson, Jonathan L Haines, Trudy F C Mackay, Steven a McCarroll, and Peter M Visscher. Finding the missing heritability of complex diseases. Nature, 461(7265):747–53, October 2009.

[69] Elaine R Mardis. Genome sequencing and cancer. Curr. Opin. Genet. Dev., 22(3):245–50, June 2012.

[70] Mark I McCarthy, Gonçalo R Abecasis, Lon R Cardon, David B Goldstein, Julian Little, John P A Ioannidis, and Joel N Hirschhorn. Genome-wide asso- ciation studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet., 9(5):356–69, May 2008.

[71] a. H. McKenna, M. Hanna, E. Banks, a. Sivachenko, K. Cibulskis, a. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. Depristo. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., July 2010.

[72] Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational meth- ods for discovering structural variation with next-generation sequencing. Nat. Methods, 6(11):S13–S20, 2009. REFERENCES 90

[73] Michael L Metzker. Sequencing technologies - the next generation. Nat. Rev. Genet., 11(1):31–46, 2010.

[74] Matthew Meyerson, Stacey Gabriel, and Gad Getz. Advances in understand- ing cancer genomes through second-generation sequencing. Nat. Rev. Genet., 11(10):685–696, October 2010.

[75] Matthew Meyerson and David Pellman. Cancer genomes evolve by pulverizing single chromosomes. Cell, 144(1):9–10, January 2011.

[76] Ryan E Mills, Christopher T Luttig, Christine E Larkins, Adam Beauchamp, Circe Tsui, W Stephen Pittard, and Scott E Devine. An initial map of in- sertion and deletion (INDEL) variation in the human genome. Genome Res., 16(9):1182–90, 2006.

[77] Georges Natsoulis, John M Bell, Hua Xu, Jason D Buenrostro, Heather Or- donez, Susan Grimes, Daniel Newburger, Michael Jensen, Jacob M Zahn, Nancy Zhang, and Hanlee P Ji. A flexible approach for highly multiplexed candidate gene targeted resequencing. PLoS One, 6(6):e21088, January 2011.

[78] Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, Lak- shmi Muthuswamy, Alex Krasnitz, W. Richard McCombie, James Hicks, and Michael Wigler. Tumour evolution inferred by single-cell sequencing. Nature, pages 1–6, March 2011.

[79] Daniel E Newburger, Dorna Kashef-Haghighi, Ziming Weng, Raheleh Salari, Robert T Sweeney, Alayne L Brunner, Shirley X Zhu, Xiangqian Guo, Sushama Varma, Megan L Troxell, Robert B West, Serafim Batzoglou, and Arend Sidow. Genome evolution during progression to breast cancer. Genome Res., 23(7):1097–1108, July 2013.

[80] Daniel E Newburger, Georges Natsoulis, Sue Grimes, John M Bell, Ronald W Davis, Serafim Batzoglou, and Hanlee P Ji. The Human OligoGenome Resource: REFERENCES 91

a database of oligonucleotide capture probes for resequencing target regions across the human genome. Nucleic Acids Res., 40(Database issue):D1137–43, January 2012.

[81] NHGRI. Human Genome Project Completion: Frequently Asked Questions.

[82] Serena Nik-Zainal, Ludmil B Alexandrov, David C Wedge, Peter Van Loo, Christopher D Greenman, Keiran Raine, David Jones, Jonathan Hinton, John Marshall, Lucy A Stebbings, Andrew Menzies, Sancha Martin, Kenric Leung, Lina Chen, Catherine Leroy, Manasa Ramakrishna, Richard Rance, King Wai Lau, Laura J Mudie, Ignacio Varela, David J McBride, Graham R Bignell, Susanna L Cooke, Adam Shlien, John Gamble, Ian Whitmore, Mark Maddi- son, Patrick S Tarpey, Helen R Davies, Elli Papaemmanuil, Philip J Stephens, Stuart McLaren, Adam P Butler, Jon W Teague, Göran Jönsson, Judy E Gar- ber, Daniel Silver, Penelope Miron, Aquila Fatima, Sandrine Boyault, Anita Langerø d, Andrew Tutt, John W M Martens, Samuel A J R Aparicio, Å ke Borg, Anne Vincent Salomon, Gilles Thomas, Anne-Lise Bø rresen Dale, An- drea L Richardson, Michael S Neuberger, P Andrew Futreal, Peter J Campbell, and Michael R Stratton. Mutational processes molding the genomes of 21 breast cancers. Cell, 149(5):979–93, May 2012.

[83] Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov, Christopher D Greenman, King Wai Lau, Keiran Raine, David Jones, John Marshall, Manasa Ramakrishna, Adam Shlien, Susanna L Cooke, Jonathan Hinton, Andrew Menzies, Lucy A Stebbings, Catherine Leroy, Mingming Jia, Richard Rance, Laura J Mudie, Stephen J Gamble, Philip J Stephens, Stuart McLaren, Patrick S Tarpey, Elli Papaemmanuil, Helen R Davies, Ignacio Varela, David J McBride, Graham R Bignell, Kenric Leung, Adam P Butler, Jon W Teague, Sancha Martin, Goran Jönsson, Odette Mariani, Sandrine Boyault, Penelope Miron, Aquila Fatima, Anita Langerø d, Samuel A J R Aparicio, Andrew Tutt, Anieta M Sieuwerts, Å ke Borg, Gilles Thomas, Anne Vincent Salomon, Andrea L Richardson, Anne-Lise Bø rresen Dale, P Andrew Futreal, REFERENCES 92

Michael R Stratton, and Peter J Campbell. The life history of 21 breast cancers. Cell, 149(5):994–1007, May 2012.

[84] Chirag J Patel, Jayanta Bhattacharya, and Atul J Butte. An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS One, 5(5):e10746, January 2010.

[85] Chirag J Patel, David H Rehkopf, John T Leppert, Walter M Bortz, Mark R Cullen, Glenn M Chertow, and John Pa Ioannidis. Systematic evaluation of environmental and behavioural factors associated with all-cause mortality in the United States National Health and Nutrition Examination Survey. Int. J. Epidemiol., 42(6):1795–810, December 2013.

[86] Kimberly Pelak, Kevin V. Shianna, Dongliang Ge, Jessica M. Maia, Mingfu Zhu, Jason P. Smith, Elizabeth T. Cirulli, Jacques Fellay, Samuel P. Dickson, Curtis E. Gumbs, Erin L. Heinzen, Anna C. Need, Elizabeth K. Ruzzo, Aban- ish Singh, C. Ryan Campbell, Linda K. Hong, Katharina a. Lornsen, Alexan- der M. McKenzie, Nara L. M. Sobreira, Julie E. Hoover-Fong, Joshua D. Milner, Ruth Ottman, Barton F. Haynes, James J. Goedert, and David B. Goldstein. The Characterization of Twenty Sequenced Human Genomes. PLoS Genet., 6(9):e1001111, September 2010.

[87] Erin D Pleasance, R Keira Cheetham, Philip J Stephens, David J McBride, Sean J Humphray, Chris D Greenman, Ignacio Varela, Meng-Lay Lin, Gon- zalo R Ordóñez, Graham R Bignell, Kai Ye, Julie Alipaz, Markus J Bauer, David Beare, Adam Butler, Richard J Carter, Lina Chen, Anthony J Cox, Sarah Edkins, Paula I Kokko-Gonzales, Niall a Gormley, Russell J Grocock, Christian D Haudenschild, Matthew M Hims, Terena James, Mingming Jia, Zoya Kingsbury, Catherine Leroy, John Marshall, Andrew Menzies, Laura J Mudie, Zemin Ning, Tom Royce, Ole B Schulz-Trieglaff, Anastassia Spiridou, Lucy a Stebbings, Lukasz Szajkowski, Jon Teague, David Williamson, Lynda Chin, Mark T Ross, Peter J Campbell, David R Bentley, P Andrew Futreal, REFERENCES 93

and Michael R Stratton. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature, 463(7278):191–6, January 2010.

[88] Erin D Pleasance, Philip J Stephens, Sarah O’Meara, David J McBride, Ali- son Meynert, David Jones, Meng-Lay Lin, David Beare, King Wai Lau, Chris Greenman, Ignacio Varela, Serena Nik-Zainal, Helen R Davies, Gonzalo R Or- doñez, Laura J Mudie, Calli Latimer, Sarah Edkins, Lucy Stebbings, Lina Chen, Mingming Jia, Catherine Leroy, John Marshall, Andrew Menzies, Adam Butler, Jon W Teague, Jonathon Mangion, Yongming a Sun, Stephen F McLaughlin, Heather E Peckham, Eric F Tsung, Gina L Costa, Clarence C Lee, John D Minna, Adi Gazdar, Ewan Birney, Michael D Rhodes, Kevin J McKernan, Michael R Stratton, P Andrew Futreal, and Peter J Campbell. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature, 463(7278):184–90, January 2010.

[89] Kim D Pruitt, Jennifer Harrow, Rachel A Harte, Craig Wallin, Mark Diekhans, Donna R Maglott, Steve Searle, Catherine M Farrell, Jane E Loveland, Bar- bara J Ruef, Elizabeth Hart, Marie-Marthe Suner, Melissa J Landrum, Bron- wen Aken, Sarah Ayling, Robert Baertsch, Julio Fernandez-Banet, Joshua L Cherry, Val Curwen, Michael Dicuccio, Manolis Kellis, Jennifer Lee, Michael F Lin, Michael Schuster, Andrew Shkeda, Clara Amid, Garth Brown, Oksana Dukhanina, Adam Frankish, Jennifer Hart, Bonnie L Maidak, Jonathan Mudge, Michael R Murphy, Terence Murphy, Jeena Rajan, Bhanu Rajput, Lillian D Riddick, Catherine Snow, Charles Steward, David Webb, Janet A Weber, Lau- rens Wilming, Wenyu Wu, Ewan Birney, David Haussler, Tim Hubbard, James Ostell, Richard Durbin, and David Lipman. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res., 19(7):1316–23, July 2009.

[90] ShaunÂă Purcell, BenjaminÂă Neale, KatheÂă Todd-Brown, LoriÂă Thomas, ManuelÂăA ÂăR Âă Ferreira, DavidÂă Bender, JulianÂă Maller, PamelaÂă Sklar, PaulÂăI ÂăW Âă deÂăBakker, MarkÂăJ Âă Daly, and PakÂăC Âă REFERENCES 94

Sham. PLINK: A Tool Set for Whole-Genome Association and Population- Based Linkage Analyses. Am. J. Hum. Genet., 81(3):559–575, 2007.

[91] Dmitry Pushkarev, Norma F Neff, and Stephen R Quake. Single-molecule sequencing of an individual human genome. Nat. Biotechnol., 27(9):847–52, September 2009.

[92] Brian J Raney, Melissa S Cline, Kate R Rosenbloom, Timothy R Dreszer, Ka- trina Learned, Galt P Barber, Laurence R Meyer, Cricket A Sloan, Venkat S Malladi, Krishna M Roskin, Bernard B Suh, Angie S Hinrichs, Hiram Claw- son, Ann S Zweig, Vanessa Kirkup, Pauline A Fujita, Brooke Rhead, Kayla E Smith, Andy Pohl, Robert M Kuhn, Donna Karolchik, David Haussler, and W James Kent. ENCODE whole-genome data in the UCSC genome browser (2011 update). Nucleic Acids Res., 39(Database issue):D871–5, January 2011.

[93] BJ Raphael, JR Dobson, L Oesper, and F Vandin. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Med, 6(1):5, January 2014.

[94] JM Rodriguez, Serafim Batzoglou, and Sivan Bercovici. An accurate method for inferring relatedness in large datasets of unphased genotypes via an embedded likelihood-ratio test. Res. Comput. . . . , pages 212–229, 2013.

[95] Yardena Samuels, Zhenghe Wang, Alberto Bardelli, Natalie Silliman, Janine Ptak, Steve Szabo, Hai Yan, Adi Gazdar, Steven M Powell, Gregory J Riggins, James K V Willson, Sanford Markowitz, Kenneth W Kinzler, Bert Vogelstein, and Victor E Velculescu. High frequency of mutations of the PIK3CA gene in human cancers. Science, 304(5670):554, April 2004.

[96] Lecia V Sequist, Scott Gettinger, Neil N Senzer, Renato G Martins, Pasi A Jänne, Rogerio Lilenbaum, Jhanelle E Gray, A John Iafrate, Ryohei Katayama, Nafeeza Hafeez, Jennifer Sweeney, John R Walker, Christian Fritz, Robert W Ross, David Grayzel, Jeffrey A Engelman, Darrell R Borger, Guillermo Paez, and Ronald Natale. Activity of IPI-504, a novel heat-shock protein 90 inhibitor, REFERENCES 95

in patients with molecularly defined non-small-cell lung cancer. J. Clin. Oncol., 28(33):4953–60, November 2010.

[97] Sohrab P Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha, Yongjun Zhao, Gulisa Turashvili, Jiarui Ding, Kane Tse, Gholamreza Haf- fari, Ali Bashashati, Leah M Prentice, Jaswinder Khattra, Angela Burleigh, Damian Yap, Virginie Bernard, Andrew McPherson, Karey Shumansky, Ana- maria Crisan, Ryan Giuliany, Alireza Heravi-Moussavi, Jamie Rosner, Daniel Lai, Inanc Birol, Richard Varhol, Angela Tam, Noreen Dhalla, Thomas Zeng, Kevin Ma, Simon K Chan, Malachi Griffith, Annie Moradian, S-W Grace Cheng, Gregg B Morin, Peter Watson, Karen Gelmon, Stephen Chia, Suet- Feung Chin, Christina Curtis, Oscar M Rueda, Paul D Pharoah, Sambasivarao Damaraju, John Mackey, Kelly Hoon, Timothy Harkins, Vasisht Tadigotla, Mahvash Sigaroudinia, Philippe Gascard, Thea Tlsty, Joseph F Costello, Irm- traud M Meyer, Connie J Eaves, Wyeth W Wasserman, Steven Jones, David Huntsman, Martin Hirst, Carlos Caldas, Marco A Marra, and Samuel Aparicio. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature, 486(7403):395–9, June 2012.

[98] SP Shah, RD Morin, Jaswinder Khattra, and Leah Prentice. Mutational evolu- tion in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265):809–13, October 2009.

[99] Peter T Simpson, Theo Gale, Jorge S Reis-Filho, Chris Jones, Suzanne Parry, John P Sloane, Andrew Hanby, Sarah E Pinder, Andrew H S Lee, Steve Humphreys, Ian O Ellis, and Sunil R Lakhani. Columnar cell lesions of the breast: the missing link in breast cancer progression? A morphological and molecular analysis. Am. J. Surg. Pathol., 29(6):734–46, June 2005.

[100] Philip J Stephens, Chris D Greenman, Beiyuan Fu, Fengtang Yang, Graham R Bignell, Laura J Mudie, Erin D Pleasance, King Wai Lau, David Beare, Lucy A Stebbings, Stuart McLaren, Meng-Lay Lin, David J McBride, Ignacio Varela, Serena Nik-Zainal, Catherine Leroy, Mingming Jia, Andrew Menzies, Adam P REFERENCES 96

Butler, Jon W Teague, Michael A Quail, John Burton, Harold Swerdlow, Nigel P Carter, Laura A Morsberger, Christine Iacobuzio-Donahue, George A Follows, Anthony R Green, Adrienne M Flanagan, Michael R Stratton, P An- drew Futreal, and Peter J Campbell. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1):27–40, January 2011.

[101] B W Stewart and C P Wild. World Cancer Report 2014. Technical report, 2014.

[102] Michael R Stratton. Exploring the genomes of cancer cells: progress and promise. Science, 331(6024):1553–8, March 2011.

[103] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer genome. Nature, 458(7239):719–24, April 2009.

[104] J Sved and A Bird. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. U. S. A., 87(12):4692–6, June 1990.

[105] a C Syvänen. Accessing genetic variation: genotyping single nucleotide poly- morphisms. Nat. Rev. Genet., 2(12):930–42, December 2001.

[106] Elizabeth a Thompson. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics, 194(2):301–26, June 2013.

[107] Megan L Troxell, Alayne L Brunner, Tanaya Neff, Andrea Warrick, Carol Bead- ling, Kelli Montgomery, Shirley Zhu, Christopher L Corless, and Robert B West. Phosphatidylinositol-3-kinase pathway mutations are common in breast colum- nar cell lesions. Mod. Pathol., 25(7):930–7, July 2012.

[108] Samra Turajlic, Simon J Furney, Maryou B Lambros, Costas Mitsopoulos, Iwanka Kozarewa, Felipe C Geyer, Alan Mackay, Jarle Hakas, Marketa Zvelebil, Christopher J Lord, Alan Ashworth, Meirion Thomas, Gordon Stamp, James Larkin, Jorge S Reis-Filho, and Richard Marais. Whole genome sequencing of REFERENCES 97

matched primary and metastatic acral melanomas. Genome Res., 22(2):196– 207, February 2012.

[109] Emily H Turner, Sarah B Ng, Deborah a Nickerson, and Jay Shendure. Methods for genomic partitioning. Annu. Rev. Genomics Hum. Genet., 10:263–84, 2009.

[110] J C Venter, M D Adams, E W Myers, P W Li, R J Mural, G G Sutton, H O Smith, M Yandell, C a Evans, R a Holt, J D Gocayne, P Amanatides, R M Ballew, D H Huson, J R Wortman, Q Zhang, C D Kodira, X H Zheng, L Chen, M Skupski, G Subramanian, P D Thomas, J Zhang, G L Gabor Miklos, C Nel- son, S Broder, a G Clark, J Nadeau, V a McKusick, N Zinder, a J Levine, R J Roberts, M Simon, C Slayman, M Hunkapiller, R Bolanos, a Delcher, I Dew, D Fasulo, M Flanigan, L Florea, a Halpern, S Hannenhalli, S Kravitz, S Levy, C Mobarry, K Reinert, K Remington, J Abu-Threideh, E Beasley, K Biddick, V Bonazzi, R Brandon, M Cargill, I Chandramouliswaran, R Char- lab, K Chaturvedi, Z Deng, V Di Francesco, P Dunn, K Eilbeck, C Evangelista, a E Gabrielian, W Gan, W Ge, F Gong, Z Gu, P Guan, T J Heiman, M E Higgins, R R Ji, Z Ke, K a Ketchum, Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu, G V Merkulov, N Milshina, H M Moore, a K Naik, V a Narayan, B Nee- lam, D Nusskern, D B Rusch, S Salzberg, W Shao, B Shue, J Sun, Z Wang, a Wang, X Wang, J Wang, M Wei, R Wides, C Xiao, C Yan, a Yao, J Ye, M Zhan, W Zhang, H Zhang, Q Zhao, L Zheng, F Zhong, W Zhong, S Zhu, S Zhao, D Gilbert, S Baumhueter, G Spier, C Carter, a Cravchik, T Woodage, F Ali, H An, a Awe, D Baldwin, H Baden, M Barnstead, I Barrow, K Beeson, D Busam, a Carver, a Center, M L Cheng, L Curry, S Danaher, L Davenport, R Desilets, S Dietz, K Dodson, L Doup, S Ferriera, N Garg, a Gluecksmann, B Hart, J Haynes, C Haynes, C Heiner, S Hladun, D Hostin, J Houck, T How- land, C Ibegwam, J Johnson, F Kalush, L Kline, S Koduru, a Love, F Mann, D May, S McCawley, T McIntosh, I McMullen, M Moy, L Moy, B Murphy, K Nelson, C Pfannkoch, E Pratts, V Puri, H Qureshi, M Reardon, R Ro- driguez, Y H Rogers, D Romblad, B Ruhfel, R Scott, C Sitter, M Smallwood, E Stewart, R Strong, E Suh, R Thomas, N N Tint, S Tse, C Vech, G Wang, REFERENCES 98

J Wetter, S Williams, M Williams, S Windsor, E Winn-Deen, K Wolfe, J Zaveri, K Zaveri, J F Abril, R Guigó, M J Campbell, K V Sjolander, B Karlak, a Ke- jariwal, H Mi, B Lazareva, T Hatton, a Narechania, K Diemer, a Muruganujan, N Guo, S Sato, V Bafna, S Istrail, R Lippert, R Schwartz, B Walenz, S Yooseph, D Allen, a Basu, J Baxendale, L Blick, M Caminha, J Carnes-Stine, P Caulk, Y H Chiang, M Coyne, C Dahlke, a Mays, M Dombroski, M Donnelly, D Ely, S Esparham, C Fosler, H Gire, S Glanowski, K Glasser, a Glodek, M Gorokhov, K Graham, B Gropman, M Harris, J Heil, S Henderson, J Hoover, D Jennings, C Jordan, J Jordan, J Kasha, L Kagan, C Kraft, a Levitsky, M Lewis, X Liu, J Lopez, D Ma, W Majoros, J McDaniel, S Murphy, M Newman, T Nguyen, N Nguyen, M Nodell, S Pan, J Peck, M Peterson, W Rowe, R Sanders, J Scott, M Simpson, T Smith, a Sprague, T Stockwell, R Turner, E Venter, M Wang, M Wen, D Wu, M Wu, a Xia, a Zandieh, and X Zhu. The sequence of the human genome. Science, 291(5507):1304–51, February 2001.

[111] Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years of GWAS discovery. Am. J. Hum. Genet., 90(1):7–24, January 2012.

[112] Matthew J Walter, Dong Shen, Li Ding, Jin Shao, Daniel C Koboldt, Ken Chen, David E Larson, Michael D McLellan, David Dooling, Rachel Ab- bott, Robert Fulton, Vincent Magrini, Heather Schmidt, Joelle Kalicki-Veizer, Michelle O’Laughlin, Xian Fan, Marcus Grillot, Sarah Witowski, Sharon Heath, John L Frater, William Eades, Michael Tomasson, Peter Westervelt, John F DiPersio, Daniel C Link, Elaine R Mardis, Timothy J Ley, Richard K Wil- son, and Timothy A Graubert. Clonal architecture of secondary acute myeloid leukemia. N. Engl. J. Med., 366(12):1090–8, March 2012.

[113] Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindorff, and Helen Parkinson. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res., 42(Database issue):D1001–6, Jan- uary 2014. REFERENCES 99

[114] Helga Westerlind and Kerstin Imrell. Identity-by-descent mapping in a Scandi- navian multiple sclerosis cohort. Eur. J. . . . , (January):1–5, August 2014.

[115] Amy L Williams, Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet., 91(2):238–51, August 2012.

[116] Xiaochong Wu, Paul A Northcott, Adrian Dubuc, Adam J Dupuy, David J H Shih, Hendrik Witt, Sidney Croul, Eric Bouffet, Daniel W Fults, Charles G Eberhart, Livia Garzia, Timothy Van Meter, David Zagzag, Nada Jabado, Jeremy Schwartzentruber, Jacek Majewski, Todd E Scheetz, Stefan M Pfister, Andrey Korshunov, Xiao-Nan Li, Stephen W Scherer, Yoon-Jae Cho, Keiko Akagi, Tobey J MacDonald, Jan Koster, Martin G McCabe, Aaron L Sarver, V Peter Collins, William A Weiss, David A Largaespada, Lara S Collier, and Michael D Taylor. Clonal selection drives genetic divergence of metastatic medulloblastoma. Nature, 482(7386):529–33, February 2012.

[117] Shinichi Yachida, Siân Jones, Ivana Bozic, Tibor Antal, Rebecca Leary, Baojin Fu, Mihoko Kamiyama, Ralph H Hruban, James R Eshleman, Martin A Nowak, Victor E Velculescu, Kenneth W Kinzler, Bert Vogelstein, and Christine A Iacobuzio-Donahue. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature, 467(7319):1114–7, October 2010.

[118] Or Zuk, Eliana Hechter, Shamil R Sunyaev, and Eric S Lander. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. U. S. A., 109(4):1193–8, January 2012.

[119] Or Zuk, Stephen F Schaffner, Kaitlin Samocha, Ron Do, Eliana Hechter, Sekar Kathiresan, Mark J Daly, Benjamin M Neale, Shamil R Sunyaev, and Eric S Lander. Searching for missing heritability: designing rare variant association studies. Proc. Natl. Acad. Sci. U. S. A., 111(4):E455–64, January 2014.