UC San Diego UC San Diego Electronic Theses and Dissertations

Title Computational methods for analyzing and detecting genomic structural variation : applications to cancer

Permalink https://escholarship.org/uc/item/9x56z8qw

Author Bashir, Ali

Publication Date 2009

Supplemental Material https://escholarship.org/uc/item/9x56z8qw#supplemental

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational Methods for Analyzing and Detecting Genomic Structural Variation: Applications to Cancer

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Bioinformatics

by

Ali Bashir

Committee in charge:

Professor Vineet Bafna, Chair Professor Trey Ideker Professor Pavel Pevnzer Professor Benjamin Raphael Professor Bing Ren Professor Nicholas Schork

2009 Copyright Ali Bashir, 2009 All rights reserved. The dissertation of Ali Bashir is approved and it is acceptable in quality and form for publication on mi- crofilm and electronically:

Chair

University of California, San Diego

2009

iii DEDICATION

To my girlfriend, for her constant encouragement; my parents, for their constant supply of food, clean laundry, and car advice; and my niece and nephews for being a constant source of distraction.

iv TABLE OF CONTENTS

Signature Page...... iii

Dedication...... iv

Table of Contents...... v

List of Figures...... ix

List of Tables...... xi

Acknowledgements...... xii

Vita...... xiv

Abstract of the Dissertation...... xv

Chapter 1 Introduction...... 1 1.1 Why “large” events?...... 2 1.2 How do you observe structural variants?...... 5 1.2.1 Primer Approximation Multiplex PCR (PAMP)...... 5 1.2.2 High-throughput Sequencing...... 6 1.3 In which contexts should we examine structural variants?...... 7 1.3.1 Cancer applications...... 8 1.3.2 Other applications...... 8

Chapter 2 Optimization of primer design for the detection of genomic lesions in cancer...... 10 2.1 Introduction...... 10 2.2 Optimizing primer design...... 14 2.2.1 PAMP design...... 15 2.2.2 Extensions...... 16 2.3 Complexity of PAMP design...... 17 2.4 Algorithms for PAMP design...... 18 2.5 Results...... 23 2.5.1 Experimental validation...... 23 2.5.2 Computational modeling...... 27 2.5.3 Convergence and running time...... 30 2.6 Discussion...... 31 2.7 Acknowledgements...... 32

v Chapter 3 Two-Sided PAMP and Alternating Multiplexing...... 33 3.1 Introduction...... 33 3.2 Methods: A multiplexed approach to PAMP design...... 36 3.2.1 Amplification 6= Detection...... 36 3.2.2 Simulated Annealing for Optimization...... 40 3.3 Results...... 43 3.3.1 Simulations...... 44 3.3.2 Left vs. Right Breakpoint Detection...... 45 3.3.3 Experimental Confirmation of CDKN2A...... 48 3.3.4 Running time...... 48 3.4 Discussion...... 49 3.5 Acknowledgements...... 51

Chapter 4 Evaluation of paired-end sequencing strategies- applications to fusion...... 52 4.1 Introduction...... 52 4.2 Results...... 55 4.2.1 Computing probability of a fusion gene...... 55 4.2.2 Fusion Predictions in Breast Cancer...... 58 4.2.3 Detection and Localization of Genome Rearrangements..... 62 4.2.4 Comparison of Sequencing Strategies...... 64 4.2.5 Lengths of Fusion ...... 66 4.2.6 Effects of Errors...... 69 4.3 Discussion...... 71 4.3.1 Defining the Genomic Features of Interest...... 74 4.3.2 Choice of Sequencing Parameters...... 75 4.3.3 Organization of Cancer Genomes...... 76 4.3.4 Extensions and Applications...... 78 4.4 Methods...... 78 4.4.1 Mapping and clustering of end sequences...... 78 4.4.2 Validating fusion predictions by sequencing...... 79 4.4.3 Computing fusion probability...... 79 4.4.4 Algorithms for efficient probability computation...... 81 4.4.5 Expected number of fusion points...... 82 4.4.6 Localization of Rearrangement Fusion Points...... 83 4.5 Acknowledgements...... 85

Chapter 5 On design of deep sequencing experiments...... 86 5.1 Introduction...... 86 5.2 Results...... 89 5.3 Discussion...... 96 5.4 Methods...... 99 5.4.1 Breakpoint Resolution...... 99

vi 5.4.2 Simulation...... 100 5.4.3 Mixing clone-lengths...... 100 5.4.4 Proof of Optimality of Two Clone Design...... 102 5.4.5 Simulation for mix of clones...... 103 5.4.6 Transcript Sequencing...... 103 5.4.7 Haplotype assembly...... 104 5.5 Acknowledgements...... 105

Chapter 6 Reconstructing Genomic Architectures...... 106 6.1 Introduction...... 106 6.2 Methods...... 107 6.2.1 Obtaining an architecture graph...... 107 6.2.2 Retrieving Optimal Eulerian Paths...... 108 6.3 Discussion...... 114

Chapter 7 Evidence for Large Inversion Polymorphisms in the from HapMap data...... 116 7.1 Introduction...... 116 7.2 Results...... 118 7.2.1 Overview of Method...... 119 7.2.2 Power to detect Inversion Polymorphisms...... 122 7.2.3 Scanning the HapMap data for inversion polymorphisms..... 123 7.2.4 Sequence Analysis of Inversion Breakpoints...... 128 7.2.5 Assessing the false positive rate...... 130 7.3 Discussion...... 133 7.4 Methods...... 135 7.4.1 Haplotype Data...... 135 7.4.2 Defining multi-SNP markers...... 136 7.4.3 Computing LD...... 137 7.4.4 The Inversion Statistic...... 138 7.4.5 Identifying potential inversions...... 139 7.4.6 Simulating Inversions...... 141 7.4.7 Sequence Analysis...... 141 7.4.8 Coalescent Simulations...... 142 7.5 Acknowledgements...... 143 7.6 Supplemental material attached electronically...... 143

Chapter 8 Orthologous repeats and phylogenetic inference...... 144 8.1 Introduction...... 144 8.2 Approach...... 145 8.3 Results...... 153 8.3.1 Species with finished sequence...... 153 8.3.2 A larger set of species...... 154

vii 8.3.3 Assessment of Incompatible Repeats...... 156 8.4 Discussion...... 158 8.5 Methods...... 162 8.6 Acknowledgements...... 167 8.7 Supplemental material attached electronically...... 167

Chapter 9 Conclusions...... 168 9.1 Open Problems...... 170 9.1.1 Genomic diagnostics for disease...... 170 9.1.2 Predicting Fusion Events and Architectures...... 171 9.1.3 Population genetics and phylogenetics...... 172 9.2 A modest proposal...... 174

Appendix A Supplemental: Optimization of primer design for the detection of genomic lesions in cancer...... 175 A.1 Complexity of PAMP design...... 175 A.2 Methods and Parameters...... 177 A.2.1 Computational...... 177 A.2.2 Experimental...... 179 A.3 Supplemental Figures...... 179

Appendix B Supplemental: Two-Sided PAMP and Alternating Multiplexing.. 181 B.1 Experimental Methods...... 181 B.2 Proofs...... 181

Appendix C Supplemental: Evaluation of paired-end sequencing strategies- ap- plications to gene fusion...... 183 C.1 Simulations...... 183 C.1.1 Calculation of Pζ and E(|Θζ )| ...... 183 C.1.2 Calculation of Fusion Probabilities for Artificial Fusion Genes. 184 C.1.3 Sensitivity and Selectivity under Random Rearrangements... 184 C.2 Supplemental Figures...... 185

Appendix D Supplemental: On design of deep sequencing experiments..... 190

References...... 194

viii LIST OF FIGURES

Figure 1.1: Simple forms of structural variation...... 2 Figure 1.2: Diagram of Philadelphia Chromsome...... 3 Figure 1.3: Schematic of PAMP design...... 6

Figure 2.1: Schematic of PAMP design...... 12 Figure 2.2: Sketch of Simulated Annealing Methodology...... 22 Figure 2.3: Integer Linear Program for PAMP design...... 24 Figure 2.4: PAMP CDKN2A Experimental Results...... 24 Figure 2.5: PAMP performance with repeat located primers...... 26 Figure 2.6: Comparison of Missing Coverage in CDKN2A...... 28 Figure 2.7: Assessment of sequence coverage from computational primer opti- miziation...... 29

Figure 3.1: Schematic of PAMP detection failure...... 36 Figure 3.2: Schematic of uncovered breakpoint computation...... 42 Figure 3.3: Performance of PAMP-2D ...... 46 Figure 3.4: Left and right detectability...... 47 Figure 3.5: Detecting rearrangements in cell-lines with complex rearrangements. 49

Figure 4.1: Schematic of Breakpoint Calculation...... 56 Figure 4.2: Prediction of a Fusion between the NTNG1 and BCAS1 genes..... 59 Figure 4.3: Fusion gene pair size distributions...... 60 Figure 4.4: Schematic of a Breakpoint Region...... 64 Figure 4.5: Probability of localizing a fusion point to an interval of a given length. 65 Figure 4.6: Distribution of gene sizes for different sets of genes...... 68 Figure 4.7: Simulation: Probability of detecting fusion genes...... 70 Figure 4.8: Sensitivity and Specificity of Fusion Gene Predictions...... 72 Figure 4.9: Probability of observing at least one chimeric cluster vs. the percent of chimeric clones...... 73

Figure 5.1: Applications of paired-end mapping...... 87 Figure 5.2: Detection-resolution trade-off...... 91 Figure 5.3: Combination of clone lengths for breakpoint detection...... 92 Figure 5.4: Distribution of normalized expression from two transcript sequencing experiments...... 95 Figure 5.5: Haplotype lengths as a function of sequence coverage...... 96

Figure 6.1: Example of integrating CGH and ESP data into a flow problem..... 109

Figure 7.1: Unusual Linkage Disequilibrium observed in SNP data when the in- verted haplotype (w.r.t the reference sequence) has very high frequency...... 121

ix Figure 7.2: Power of method to detect inversion polymorphisms in HapMap ’anal- ysis panels’...... 123 Figure 7.3: Genomic overview of a 1.4 Mb region at 16p12 predicted to have an inversion in both the CEU and YRI ‘analysis panels’ ...... 126 Figure 7.4: Overview of a ≈ 1.2 Mb long inversion on 10 predicted in the CHB+JPT ‘analysis panel’ ...... 127 Figure 7.5: A predicted YRI Inversion polymorphisms on chromosome 6 overlaps with the TCBA1 gene...... 129 Figure 7.6: Splice isoforms of the ICAp69 gene that are approximately consistent with a predicted YRI inversion breakpoint on chromosome 7...... 130 Figure 7.7: Length Distribution of predicted inversions in the YRI ‘analysis panel’ ...... 133 Figure 7.8: The p-value distribution for predicted inversions having p-value ≤ 0.02. 140

Figure 8.1: Schematic of building repeat based phylogenies...... 146 Figure 8.2: Sketch of phylogeny reconstruction from the orthologous-repeats table 148 Figure 8.3: Multiple alignment for an incompatible repeat in the orthologous- repeats table of 9 species with finished sequence...... 151 Figure 8.4: Distribution of the difference statistic among columns with high and low degree of incompatibility...... 153 Figure 8.5: Phylogenetic tree of a large set of 28 species...... 157

Figure A.1: Number of simulated-annealing iterations as a function of region size and primer-density...... 180

Figure C.1: Distribution of MCF7 Clone Lengths...... 185 Figure C.2: Length of a Breakpoint Region (BPR) for varying amounts of clonal coverage...... 186 Figure C.3: Clone Length vs. Pζ vs. |Θζ | for varying N...... 186 Figure C.4: The effect of clone length and number of paired reads on Pζ and |Θζ |. 187 Figure C.5: Pζ and |Θζ | for different clone lengths...... 187 Figure C.6: The number of paired-reads (and resulting E(|Θζ |)) needed to obtain a Pζ of 0.99 for clone lengths varying from 1 to 150kb...... 188 Figure C.7: Average fusion probability vs. number of mapped reads...... 188 Figure C.8: Effect of chimeric clones...... 189

Figure D.1: Fraction of detected transcripts from kidney RNA-seq at different sequencing depths...... 190 Figure D.2: Estimating the p.d.f of normalized gene expression values...... 191 Figure D.3: Plot of log log f(ν) vs. log(ν) reveals a linear relationship...... 192 Figure D.4: Plot of the N50 haplotype length as a function of the read length... 193

x LIST OF TABLES

Table 4.1: Ranked List of Fusion Genes Predictions in Breast Cancer Cell Lines and Primary Tumors...... 61 Table 4.2: Breakpoint Detection and Localization for Different Sequencing Strate- gies...... 66

Table 7.1: List of Predicted Inversions for which there is some form of evidence supporting the inverted orientation...... 125

Table 8.1: An orthologous-repeats table containing a sampling of repeats...... 149 Table 8.2: Shared-repeat graph and subgraphs of 9 species with finished sequence. 155 Table 8.3: Incompatible columns in the Orthologous-Repeats Table...... 165

xi ACKNOWLEDGEMENTS

There are a number of people who have made my thesis work possible. My advisor, Vineet Bafna, has been an extremely patient teacher and his excitement for science has kept me motivated. I would like to acknowledge my entire committee - Ben Raphael, Pavel Pevzner, Nik Schork, Bing Ren and Trey Ideker - for their flexibil- ity, willingness to collaborate, and for sharing their advice and time with me. I have been lucky to collaborate with a number of excellent researchers at the Moores Can- cer Center (YT Liu, Qing Lu, Dennis Carson), UCSF Comprehensive Cancer Center (Colin Collins, Stas Volik), ABI (Francisco De La Vega), Illumina (Semyon Kruglyak), Scripps (Kelly Frazer, Olivier Harismendy), and others. I have also worked with a num- ber of talented undergraduates (Ankur Jain, Samad Lotia, Tony Veit). Members of the Bioinformatics Program, and specifically the Bafna and Pevzner labs, have been instru- mental over my graduate career. Many former and current members have served as collaborators (Chun (Jimmie) Ye, Vikas Bansal, Alkes Price, Banu Dost). Others have gone the extra mile in helping me work through difficult problems (Ryan Kelley, Mark Chaisson, Neil Jones) or giving me advice in preparing my thesis (Ari Frank). Some have dropped whatever they were doing whenever I wanted to get a snack or needed a break (Natalie Castellana). And, many of these and others have offered critical insights and discussion, scientific and otherwise, (Sourav Bandyopadhyay, Gary Hon, Stephen Tanner, Noah Zaitlen). Lastly, I would like to acknowledge Orapim Tulyathan for her last-minute prowess with Adobe Illustrator. Chapter2 (with AppendixA), was published in Bioinformatics, Vol 23, pp 2807- 2815, 2007, A. Bashir, Y-T. Liu, B. Raphael, D. Carson, and V. Bafna, “Optimization of primer design for the detection of variable genomic lesions in cancer”. The dissertation author was the primary investigator and author of this paper. Chapter3 (with AppendixB), is accepted at the 13th Annual International Con- ference on Research in Computational Molecular Biology (RECOMB 2009), A. Bashir, Q. Lu, B. Raphael, D. Carson, Y-T. Liu, and V. Bafna, “Optimizing PCR assays for

xii DNA based cancer diagnostics”. The dissertation author was the primary investigator and author of this paper. Chapter4 (with AppendixC), was published in PLoS Computational Biology, Vol 4(4), 2008, A. Bashir, S. Volik, C. Collins, V. Bafna, and B. Raphael, “Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer.”. The dissertation author was the primary investigator and author of this paper. Chapter5 with (AppendixD), is currently in submission, A. Bashir, V. Bansal, and V. Bafna, “On design of deep sequencing experiments”. The dissertation author was a primary co-investigator and co-author of this work. Chapter7, was published in published in Genome Research, Vol 17, pp 219-30, 2007, V. Bansal, A. Bashir, and V. Bafna, “Evidence for large inversion polymorphisms in the human genome from HapMap data”. The dissertation author was the secondary author of this paper. Chapter8, was published in Genome Research, Vol 15, pp 998-1006. A. Bashir, C. Ye, A.L. Price, and V. Bafna, “Orthologous repeats and mammalian phylogenetic inference”. The dissertation author was the primary investigator and author of this paper. All other chapters are the original work of the dissertation author.

xiii VITA

2003 Bachelor of Science in Bioengineering, University of Cal- ifornia, Berkeley

2009 Doctor of Philosophy in Bioinformatics, University of Cal- ifornia, San Diego

PUBLICATIONS

“Orthologous Repeats and Mammalian Phylogenetic Inference”. Ali Bashir, Chun Ye, Alkes Price and Vineet Bafna. Genome Research, 2005

“‘Optimization of primer design for the detection of variable genomic lesions in cancer”. Ali Bashir, Yu-Tsueng Liu, Benjamin Raphael, Dennis Carson, and Vineet Bafna. Bioin- formatics, 2007

“Optimizing PCR assays for DNA based cancer diagnostics”. Ali Bashir, Qing Liu, Benjamin Raphael, Dennis Carson, Yu-Tsueng Liu, and Vineet Bafna. RECOMB, to appear, 2009

“Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer.”. Ali Bashir, Stanislav Volik, Colin Collins, Vineet Bafna, and Benjamin Raphael. PLoS Computational Biology, 2008

“Evidence for large inversion polymorphisms in the human genome from HapMap data” Vikas Bansal, Ali Bashir and Vineet Bafna. Genome Research, 2007

FIELDS OF STUDY

Major Field: Bioinformatics Vineet Bafna

xiv ABSTRACT OF THE DISSERTATION

Computational Methods for Analyzing and Detecting Genomic Structural Variation: Applications to Cancer

by

Ali Bashir Doctor of Philosophy in Bioinformatics

University of California, San Diego, 2009

Professor Vineet Bafna, Chair

Understanding genetic variation has emerged as a key research problem of the post-genomic era. Until recently, the study of large genomic events, or structural vari- ants, was marginal in comparison to smaller events, such as single nucleotide vari- ants/polymorphisms. Technological advancements in sequencing, array design, and primer based assays have made the detection structural variants more cost-effective, reopening the possibility of high-throughput, systematic analysis. Here, we propose algorithms utilizing, detecting, and analyzing these events. Cancer is a largely genomic disease driven by somatic mutation and often char- acterized by large-scale genome rearrangements. We develop optimization schemes for PCR based diagnostics for detecting genomic lesions in cancer patients. The op- timization allows robust detection of highly variable genomic lesions, even in a high background of normal DNA. We propose a subtle change to experimental design that significantly improves the assay without impacting experimental complexity. In a sepa-

xv rate study, we present an efficient approach for de novo detection of gene fusion events given paired-end sequencing data. Even in low genomic coverage, ∼ .6X, with large insert (clone) sizes, > 100kb, our method reliably predicts gene fusions. Paired-reads are further applied in reconstructing cancer genome architectures; we focus on local optimizations at complexly amplified or rearranged breakpoints. Large-scale genomic events also play important roles within normal populations and across species. We develop a novel approach that exploits unusual linkage dis- equilibrium patterns to detect inversion polymorphisms from limited SNP data. For phylogenetic inference, we track the insertion of transposable repeat elements across 28 mammalian species. Our algorithm returns phylogenies highly consistent with other studies and, in some cases, helps resolve points of debate. Lastly, we present a framework for the design of high-throughput sequencing studies directed at transcriptome sequencing, haplotype assembly, and the detection of structural variants. An explicit trade-off is shown between detection and localization of breakpoints for different insert sizes when using paired-reads. We prove that a mix of exactly two insert sizes provides the optimal probability of resolving a breakpoint to a given a resolution. In transcriptome sequencing, we show that it is possible to accurately approximate a samples underlying gene expression distribution with only 100K reads via a novel correction method.

xvi Chapter 1

Introduction

Understanding the impact of genetic variation is one of the key research prob- lems of the post-genomic era. There have been a number of seminal works in this area; most have focused on small genomic events, such as single nucleotide polymorphisms (SNPs) and microsatellites. These studies have fueled interest in cataloging the diversity of SNPs in various genomes, allowing these events to be used for disease association. This has enabled projects on an enormous scale, including the international HapMap project which identified the variation at over a million SNPs in four populations [3]. Until recently, larger genomic events (often termed structural variants) were, comparatively, marginalized. Structural variants include genomic inversions, deletions, insertions, and translocations (Figure 1.1. Such events were considered to be relatively rare, especially outside of disease contexts [42]. Moreover, it was previously not cost effective to systematically characterize large events in a population. Recent technolog- ical developments in sequencing, array design, and primer based assays have made the detection of these larger scale events more feasible. This has reopened the possibility of performing high-throughput, systematic analyses of structural variants.

1 2

Inversion Deletion Translocation (a) (b) (c)

Figure 1.1: Simple forms of structural variation. For each example, the top corresponds to a “normal” sequence relative to the reference genome, and the bottom corresponds to a rearranged genome. All lines in the same plane corresponds to a contiguous stretch of DNA, with arrows indicating postive (right arrow) or negative (left arrow) strand orientation relative to a reference genome. Different colored lines indicate sequence from different , and dotted lines correspond to “breakpoints”. (a) inversion, (b) deletion, (c) translocation.

1.1 Why “large” events?

This work will detail a number of methods for optimal detection and evaluation of structural variants, and other large-scale genomic events, for a range of applications. However, as mentioned earlier, there is already a variety of highly developed compu- tational and experimental techniques already in place for the examination of small ge- nomic events. Additionally, new technologies are emerging and improving, that are enabling even closer examination of small scale events. One may, rightly, ask, “Why bother looking at large events?” Structural variants have a number of unique properties when compared to smaller, more frequent, events. Specifically, large events

. . .can have more dramatic phenotypic affects. Let us consider the effects of a SNP on structure or gene regulation. A single SNP may disrupt an important amino acid, making the structure less stable or less likely to bind its target. Similarly, a precisely placed SNP could weaken or destroy the effect of a critical regulatory motif. 3

Compare this effect to that of a similarly placed deletion. A deletion certainly disrupt and exon or motif, and could potentially eliminate an entire protein domain or all regula- tory signals upstream of a gene. A translocation could create completely novel proteins1 or entirely rewire the regulatory pattern of a gene. Figure 1.2 shows the “philadelphia chromosome”, a translocation between chromosomes 9 and 22 that leads to the BCR- ABL fusion gene which is associated with chronic myelogenous leukemia [37]. Though the frequency of SNPs may be much higher, the individual impact of structural vari- ants are far more profound. This has important consequences not only in understanding diseases but also in the development of diagnostics and therapeutics.

Chr 9

Chr 22 BCR ABL Reciprocal Translocation

Figure 1.2: Diagram of Philadelphia Chromsome (BCR-ABL fusion). A reciprocal transloca- tion between chromosomes 9 and 22 leads to the BCR-ABL fusion gene.

. . .are characteristic of cancer genomes. Cancer is now the leading cause of death in many countires [31]. It is inherently a genomic disease, driven by mutations. High level spectral karyotyping (SKY) show that very large rearrangements (or even entire chromosomal duplications) are characteristic of many cancers [133, 177]. A number of these events have been shown to create causal fusion genes in multiple tumor types [164, 152, 75], of which the BCR-ABL fusion in Figure 1.2 is an example. This dissertation will examine a number of topics related to developing disease based diagnostics for

1For example, gene fusions. 4 cancer and better characterizing and understanding specific types of cancer.

. . . allow for more robust detection and characterization of recurrent events. SNPs are so frequent that the probability of observing the same event at the same posi- tion, purely by random chance, is always a confounding factor. Larger s.v.’s circumvent this issue in two ways. First, they are significantly less frequent (several orders of mag- nitude). Second, they impact a pair of genomic positions, causing the space of possible events to grow quadratically (as opposed to linearly) with the size of the respective genome. This second point has an added benefit: it allows for distinctions to be made between events. We will use the term “breakpoint” to describe any pair of nonadja- cent genomic coordinates, (x, y), from a reference genome that become adjacent in a rearranged genome. In Figure 1.1, breakpoint coordinates correspond to the positions immediately upstream, x, and downstream, y, dotted lines in the rearranged genome. A rearrangement that is observed in multiple individuals at exactly the same breakpoint, (x, y), is more likely to be inherent to the population (i.e the rearrangement occurred in some founder individual at an earlier point in time and has been passed down to their descendants). An event which has shifted breakpoints, (x, y) vs. (x + α, y + β), is more likely the result of an independently recurrent event, as may occur in a cancer’s onco- genic progression. This distinction is helpful when attempting to characterize somatic vs. germline events.

. . . are not as well characterized and understood. The mechanisms by which large-scale rearrangements occur are just beginning to be understood. Models such as nonhomologous end-joining (NHEJ), retrotransposon insertion, and nonallelic homol- ogous recombination (NAHR) are being advanced to explain many events, but a large number remain unclassified [64]. Moreover, there is a dearth of computational tools for drawing complete conclusions from these events. The availability of robust computa- tional analysis tools will be a catalyzing force in the development of new experimental techniques and in understanding how genomes evolve. 5

These concerns underlie the topics studied in this dissertation. Together, they form a compelling argument: There is a clear need to apply an equally rigorous framework to the evaluation of large-scale genomic events.

1.2 How do you observe structural variants?

Tool development is tightly connected to the underlying experimental techniques. As mentioned, there has been a proliferation of new technologies enabling the detection of structural variants. The dissertation will primarily focus on optimization and analysis related to novel primer design, high-throughput sequencing, and array technologies.

1.2.1 Primer Approximation Multiplex PCR (PAMP)

Primer approximation Multiplex PCR is a technique for identifying genomic le- sions with highly variable boundaries [82]. It is based on a simple principle - breakpoints induced by genomic lesions can be detected by PCR primers flanking the event 2. Con- ventional approaches using a single primer pair are sufficient when the precise bound- aries of lesions are known [63]. However, if the lesion space is highly variable, it is impossible to select a single pair of primers that will reliably assay for all possible events. In the clinic, this leads to a negative result, providing false information to medi- cal practitioners. PAMP significantly increases the number of detectable breakpoints. Multiple primers are selected around a lesion point, (x, y), of interest. Forward primers are placed upstream of the lesion boundary x and reverse primers downstream of the boundary y. Any pair of forward and reverse primers can create a viable PCR product, provided the lesion places them within an acceptable range of PCR amplification (Figure 1.3). The results of PAMP are paired with genome tiling arrays to quickly indicate which genomic

2Given the primers are within the acceptable amplifiable distance of the PCR protocol. 6

d

F3 F2 F1 R1 R2 R3 human genomic sequence ρ ρ ρ ρ ρ ρ 1 2 3 4 5 6 probe sequences fused genomic segments

F3 F2 R3 tumor genomic sequence ρ ρ ρ ρ ρ ρ 1 2 3 4 5 6

array PCR amplification

Figure 1.3: Schematic of PAMP design. Forward and reverse primers approximately cover the left and right breakpoints of the fusing genomic regions. The primers are spread out so that each deletion results in a unique primer pair being amplified. The amplified product is detected by hybridization to probes on an array. regions are amplified. Optimization schemes related to this are discussed in Chapters2 and3.

1.2.2 High-throughput Sequencing

End-Sequence Profiling (ESP) and Paired-end Mapping (PEM)

End-sequence profiling (ESP) and paired-end mapping (PEM) are experimen- tal techniques for discovering rearrangement events through end-sequencing of clones (or inserts) of DNA [170, 127, 169, 166]. The sequenced ends are mapped back to a reference genome. Clones spanning breakpoints will map discordantly - joining DNA segments from different chromosomes or from intrachromosomal regions that are too far or too close (given the distribution of insert lengths). These discordant mappings form invalid pairs (x, y) where x and y correspond to the mapping positions of end- sequences from the left and right respectively. We show in Chapters4 and5 that cluster- ing of related invalid pairs allows for precise localization of the underlying breakpoint, ζ, enabling accurate prediction of gene fusion events. 7

Transcriptomic sequencing or RNA-seq

Until recently, approximating expression through sequencing was impractical due to cost considerations. Achieving the necessary sampling depth with conventional sequencing methods simply required more sequence reads than was reasonably feasible. Additionally, cDNA libraries typically suffered from a 3’ bias as the reverse transcrip- tase would initialize from the end of the mRNA. High-throughput sequencing technolo- gies along with new preparation techniques have, as with genome re-sequencing, made expression analysis a much more pragmatic option. In the updated techniques, full length mRNAs are fragmented and cDNA is cre- ated from the resulting sheared products using non-specific PCR primers [87]. These cDNA can then be mapped back to the reference genome in order to identify the ex- pressed genes. Fragmentation helps to ensure a reduced 3’ bias, while highly parallel short-reads allow for a robust sampling of the the underlying gene expression distribu- tion. Chapter5 discusses how to to correlate this sampling to relative gene expression and how to determine the depth of sequencing required to accurately sample genes at a specific relative gene expression.

1.3 In which contexts should we examine structural vari- ants?

We will show that large-scale events can be informative in both disease (cancer) and non-disease settings. Even within those two areas there is an enormous variety of potential applications. This dissertation is structured so as to evaluate structural variants in problems of increasing scope. 8

1.3.1 Cancer applications

Assaying for known events within a single individual. In Chapters2 and3 we de- velop optimization schemes for assaying genomic lesions in cancer patients. Specifi- cally, our optimization allows for robust detection of highly variable genomic lesions. We will show that these events can be captured even in a high background of normal DNA.

Discovering novel events in individuals and multiple samples. In Chapters4 and5 multiple breast cancer cell-lines and primary tumors are mined for genome rearrange- ments. Methods are developed to rigorously predict gene fusions using end-sequence profiling. A statistical framework is put forward to predict the probability of detecting and resolving breakpoints.

Reconstructing complete genomic architectures. Chapter6 addresses the issue of reconstructing cancer genome architectures given a combination of paired-end sequenc- ing and array CGH. Specifically, we focus on performing local optimization of com- plexly amplified and rearranged regions of the genome.

1.3.2 Other applications

Discovering rearrangements inherent to a population. Chapter7 proposes a new method for detecting large inversion polymorphisms using linkage disequilibrium.

Utilizing events as markers for phylogenetic reconstruction. Lastly, Chapter8 sug- gests a novel approach to phylogenetic reconstruction by tracking transposable element insertion events.

For each of these topics, we present methods that yield novel insights, enable better study design, or predict targets for further analysis. These subjects are just a small 9 portion of the many questions related to, and enabled by, the study of large structural changes. However, the techniques developed will, hopefully, form a stepping stone for further research in this very active field. Chapter 2

Optimization of primer design for the detection of genomic lesions in cancer

2.1 Introduction

Many tumors are characterized by large-scale DNA damage. These changes include point mutations and small insertion/deletion events, but also large structural changes such as deletions, translocations, and inversions of entire chromosomal seg- ments. Notable examples include the TMPRSS2 fusion with ETS transcription factors [164], the SMAD4/DPC4 which exhibits a homozygous deletion in pancreatic cancer [51], and the CDKN2A locus which frequently has regions mutated or deleted in many cancers [24, 132, 109]. The CDKN2A region is interesting in that it encodes two , INK4a and ARF, that actively participate in major tumor suppressor net- works [48]. Many such variations in tumor genomes remain undiscovered, and their characterization will be an important part of cancer genome projects. Established experimental techniques for detecting structural changes include array-CGH [120], FISH [115], and End-sequence Profiling [169]. However, array-CGH will detect only copy number changes – not structural rearrangements like inversions or translocations – and generally performs poorly if the sample is a highly heterogeneous

10 11 mix of wild type and tumor cells [120]. This presents a significant challenge when screening cells in early onset cancer patients, where the predominance of cells are actu- ally normal, and only a small fraction contain the genomic lesion of interest. Techniques like FISH are labor intensive making them impractical for high-throughput analysis, and moreover often have poor resolution (> 1Mb). Finally, genome sequencing techniques like End-sequence Profiling are costly for whole-genome analysis, and it is not clear how to restrict them to specific regions of the genome. PCR provides one possible solution to this problem. The exponential growth of the reacting product allows for the amplification of weak signals. Consider two re- gions that are brought together by a genomic rearrangement (deletion, inversion, etc.) in a tumor. Appropriately designed primer pairs within 1 kb of the fusing breakpoints will amplify only in the presence of the mutated DNA, and can amplify even with a small population of cells. Such PCR based screening has been useful in isolating dele- tion mutants in C. elegans [63]. However, in many real tumors, further complications exist as these structural variants often do not have exact boundaries. In the CDKN2A re- gion, deletion boundaries often vary over several hundred kilobases and even megabases [130]. This type of variation is even observed in deletions/translocations which result in fusion proteins; the TMPRSS2 and ETS family fusion in prostate cancer not only lacks specificity in the genes it hits (ERG and ETV1/4), but also as to which exons are joined together[164]. Therefore, in order to appropriately monitor an individual’s can- cer progression, a test is needed that is capable of screening for, and returning accurate boundaries of, highly variable breakpoints. To achieve this goal, Liu and Carson have recently devised a novel multiplex primer technique, Primer Approximation Multiplex PCR (PAMP), that allows for the assaying of many possible lesion boundaries simultaneously [82]. A mock illustration of this experimental model can be seen in Figure 2.1. PAMP utilizes multiple primers whose PCR products cover a region in which breakpoints may occur. Every primer up- stream of one breakpoint is in the same orientation, opposite to the primers downstream of the second breakpoint. A primer-pair can form a PCR product only if a genomic 12

d

F3 F2 F1 R1 R2 R3 human genomic sequence ρ ρ ρ ρ ρ ρ 1 2 3 4 5 6 probe sequences fused genomic segments

F3 F2 R3 tumor genomic sequence ρ ρ ρ ρ ρ ρ 1 2 3 4 5 6

array PCR amplification

Figure 2.1: Schematic of PAMP design. Forward and reverse primers approximately cover the left and right breakpoints of the fusing genomic regions. The primers are spread out so that each deletion results in a unique primer pair being amplified. The amplified product is detected by hybridization to probes on an array, the dark spots on the array correspond to amplification of primers most proximal to the breakpoint. In practice, these primer-pairs out compete all others and provide the only visible signal. lesion places the pair in close proximity. If the primer-pairs are spatially distinct, then any lesion will cause the amplification of exactly one primer-pair. The resulting PCR products are easily assayed on a tiling array, identifying the breakpoints of the lesion. The result is a technique which can identity genomic lesions even in high background of normal DNA, and offers precise mappings of a genomic breakpoint (resolution of less than 1kb). For the PAMP technique to succeed, primers must be selected which adequately cover the entire region, such that every possible pair of deletion boundaries is repre- sented by a corresponding pair of primers that will be amplified by PCR. Additionally, the primers must be chosen from a unique region of the genome, and not allowed to dimerize with each other. Finally, a selected primer must satisfy physico-chemical char- acteristics that allow it to prime the polymerase reaction. This last problem is well- studied. A number of programs, such as Primer3, select for optimal primers given a nucleotide sequence [134]. Primer Dimer (PD) formation is a common issue in mul- tiplexing PCR reactions, and is affected by amplicon length, sequence and priming 13 efficiency [40]. Additionally, a host of algorithms and applications are available for predicting primer-dimer interactions given a set of multiplexing primers [167, 76]. It is not hard to see that PD formation (due to cross-hybridization) is quite prevalent, using standard dimerization criteria [167]. This poses a significant challenge when the design calls for large numbers (500-1000) of primers, with 5002 −10002 possible dimerizations. Some recent work addresses this problem. The general problem of optimizing primer set size under cross-hybridization constraints has previously been shown to be NP-Complete by reduction to the Multiple Choice Matching problem [107]. Thus, a number of heuristics have been developed for specific applications, such as minimizing primer set size when given a set of target objects (such as ORFs) [35, 36]. Recently, a number of papers have attempted to optimize multiplex reactions with respect to SNP genotyping. Specifically, these approaches attempt to partition primer sets into multi- ple multiplexing tubes and examine the trade-offs associated with various experimental design factors [124, 81]. Additionally, recent studies using bioinformatics approaches, have been able to achieve multiplexing of greater than 1000 SNPs, far exceeding previ- ous multiplexing thresholds [173]. The PAMP technique is fundamentally different from these approaches. The de- sign uniquely results in the amplification of a single pair out of a large set of primers (and therefore primer pairs) due to the genomic lesion. Additionally, unlike clique or coloring based apporaches for primer set partitioning [124, 81], we must simultaneously create a non-dimerizing set of primers while optimizing coverage of all possible break- points in a region. This sequence coverage criteria adds additional complexity to the optimization. Additionally, the goal is to maximize one’s ability to detect a structural variant in a specific locus, no matter how variable its boundaries are within a patient population. In this paper, we formulate the appropriate optimization problem (Section 2.2), show that the problem is NP-hard (Section 2.3) even in a restricted form. In Section 2.4 we describe a number of heuristics that either terminate quickly, or guarantee optimality (but not both). The algorithms are applied to a test region around the known CDKN2A 14 deletion, and show excellent results even in a 1:49 mixture of mutated:wild-type cells (Section 2.5). These preliminary results also help set the design parameters for larger problems. We can achieve near-optimal designs for regions as large as 500kb, and de- scribe additional improvements for larger regions 1. Our results indicate that PAMP is a feasible technique for assaying lesions, up to a given size, in a heterogeneous mixture of cancer and wild-type cells.

2.2 Optimizing primer design

To model the problem accurately, we must establish the constraints for an appro- priate set of primers. Define a primer-design as a set of forward primers Fn,...,F2,F1 with genomic locations lFn < . . . < lF1 and a set of reverse primers R1,R2,...,Rn with genomic locations lR1 < lR2 < . . . < lRn (Figure 2.1). Let d equal the maximum distance between any pair of forward primers, or any pair of reverse primers. We say that the primer design covers a genomic location z provided that there exists a pair of primers Fi and Rj such that if z is deleted from the genome then the distance between

Fi and Rj is at most 2d. For the protocol to be successful, the distance 2d should be no greater than what can be readily amplified between a primer-pair (2kb is used as a cut-off). We define the coverage of the primer design to be the fraction of the genome between lFn and lRn that is covered. Our goal is to find a primer design that maximizes coverage subject to some constraints, as described below. First, we do not allow primer-dimerization: Any two primers in a single multi- plex reaction should not cross-hybridize. We present an experiment below in which the presence of a single pair of dimerizing primers is sufficient to negate the amplification. This imposes a fairly stringent requirement for our specific protocol, as dimerization is fairly common, and every forward primer will be pooled with every reverse primer. With 250 forward and reverse primers, this leads to at least 2502 pairs 2. Second, each

1Additionally, though not as high-throughput, FISH could detect events for larger regions. 2If all primers are pooled in a single reaction this would lead to 5002 pairs 15 primer must amplify a unique region. We enforce this by ensuring that the primer itself is unique, and 13bp from the 30 end occur no more than expected by chance. Histori- cally, primers have been selected only in repeat-masked genomes. However, we show that good coverage can be ensured only by allowing unique primers located within trans- posable elements. Third, primers in the same direction must be non-overlapping, and at least distance r apart where r is the length of the desired probe. Finally, the primer- selection must be physico-chemically appropriate, as described by melting temperature, GC content, and other parameters. These are lumped together as they are adequately addressed by primer selection programs such as Primer3 [134]. In our design, we start with some pre-processing to select unique and physico-chemically appropriate candidate primers. Next, all dimerizing pairs are identified. The problem of designing primers that obey the coverage and cross-hybridization constraints is formulated as a combinatorial optimization problem on the set of candidate primers.

2.2.1 PAMP design

Construct a primer-coverage graph G, over a sequence of length L, as follows:

3 each candidate primer defines a vertex, u, with its genomic location denoted by lu .

Add additional vertices lb = 0, le = L to define the start and end of the sequence. The vertices are paired up with primer-dimerization edges E (lb, le do not contain any primer- dimerization edges). Thus, (u, v) ∈ E if and only if primers u and v cross-hybridize. Each pair of nodes is also associated with a corresponding proximity cost. Consider a solution in which two forward primers u, v are adjacent. Recall that if |lu − lv| ≤ d, then any deletion with breakpoints between lu and lv should lead to an amplification.

Otherwise, there are at most |lu −lv|−d positions, where a deletion would not be marked by a PCR amplification. Based on this, each pair (u, v) is associated with a coverage- cost C(u, v) = max{0, |lu − lv| − d}. The primer design is a chain P = p1, p2,... of forward primers followed by reverse primers, ordered so that lpj < lpj+1 for all j

3for exposition purposes, we ignore the length of the primer 16

(Figure 2.2 top). Define the cost of the design as

X X C(P) = wp + wcC(pj, pj+1) (2.1)

(pi,pj )∈E j where wc, wp are appropriately chosen weighting functions. To solve the PAMP design problem, we need to compute a chain of minimum cost. Note that many pairs of primers will cross-hybridize, and removing all such pairs could lead to very sparse primer data- sets. This is modeled by adjusting wp. Keeping it high may lead to a very sparse solution, while keeping it too low leads to many conflicts being allowed. While our algorithm works for a general wp, preliminary results show that a single dimerizing pair can cause the entire multiplexed PCR reaction to fail (Section 2.5.1), so we describe results with wp = ∞.

2.2.2 Extensions

The model proposed here does not capture two natural extensions. The PAMP protocol does not require all forward and all reverse primers to be in a single multiplex reaction. Rather, the forward and reverse primers can be partitioned into sets, with each forward set reacted with each reverse set. This implies that dimerizations between two forward (or, two reverse) primers are allowed when they are in different sets. We model this by not adding (u, v) (which can possibly dimerize) to E if u and v are both forward (or both reverse) and occur in different sets. If we partition the forward and reverse sets into N sets each the protocol will have N 2 distinct multiplex reactions. As smaller N implies smaller cost, it can be used to optimize a cost-coverage tradeoff A second useful parameter is the total number of primers. Define primer-density ρ as the average number of primers every d base-pairs. Clearly ρ must be ≥ 1 for full coverage. We show that a modest increase in ρ can provide a significant increase in achievable coverage. The primer-density is controlled by augmenting the cost function to be

X X C(P) = wp + wcC(pj, pj+1) + wρρ (2.2)

(pi,pj )∈E j 17

Here, we select wρ = ∞ if ρ exceeds the desired density. Otherwise, wρ = 0. Another point to note is that Figure 2.1 describes a scenario in which the left and right break- points are known to lie in distinct genomic segments. However, this is not critical. We can extend the protocol to a case where the left and right break-points lie in overlap- ping regions. Primer design considerations for more complex rearrangements (such as deletions with overlapping boundaries and translocations) are natural extensions, but are omitted for exposition purposes.

2.3 Complexity of PAMP design

We show that the PAMP design problem is NP-hard, even in a restricted form: we consider the case wp = ∞, so that no cross-hybridization edge is allowed in the solution. We consider an additional restriction on the problem by assuming that the right breakpoint is known exactly, and we have a single reverse primer on that side which does not conflict with any of the candidate forward primers. The decision version of the restricted problem is as follows. One-sided PAMP design (OPAMP): Given a genomic region, G, of length L with a single reverse primer, a collection F of candidate forward primers, with only the forward primers dimerizing, and an integer value D ≤ L. Does there exist a non- dimerizing collection F ⊆ F of forward primers such that the total uncovered region is less that D with no adjacent primers 4? Note that a polynomial time solution for the general problem implies that OPAMP is poly-time solvable. Hence, it is sufficient to prove that OPAMP is NP-hard. Supple- mental A.1 includes a detailed proof via reduction from Max2-SAT [46]. The problem has recently been shown to be hard to even approximate [26].

4“Adjacent” primers have a spacing of less than r base pairs, where r > 0 as defined in Section 2. 18

2.4 Algorithms for PAMP design

Prior to optimization of the candidate set, we need to do two preliminary com- putations.

Conflict Edge computation

First, we must compute all possible primer pairs that dimerize (the set of con- flict edges E). Dimerization due to cross-hybridization is not perfectly understood, but previous studies have indicated that cross-hybridization could occur if an ungapped alignment exists with matches exceeding mismatches by at least 7, specifically in the 3’ region [167]. We use this as our conflict criteria. For a genomic region of 500Kbp, there are often tens of thousands of candidate primers, each pair which must be checked for dimerization, To efficiently compute the conflict edge graph, we will employ a simple filtering technique. If the mismatches occur randomly, it can be shown that with high probability, there is a sub-alignment with 3 consecutive matches in dimerizing primers. Therefore, we construct a hash table of all 3-mers. Only primers that hash to the same location are aligned to compute E. Additional layers of complexity regarding conflict edges are possible. Lipson proposed an extensive strategy for computing dimerization and mispriming potential, the probability of a primer pair falsely amplifying a different region of the genome [81]. In future work we intend to compare such approaches experimentally to improve the robustness of our algorithm.

Repeat filtering

Second, we must ensure uniqueness of the primers by filtering for repeats. Typi- cal algorithms avoid placing primers within repeats in order to reduce the possibility of primers annealing non-uniquely [6]. However, based on the limitation in range of PCR (∼ 2kb in our experiments), a significant loss of coverage would result from disallowing primers within repetitive elements. For example, in the CDKN2A region 19 the optimal coverage theoretically possible was ∼ 75% coverage of the 500 kb flanking sequence if one disallows primers within repeats. In order to obtain better coverage, we needed to be aggressive in our selection of primers, and selected some primers within repetitive regions. We used parameters from the Wang et. al, to help derive our filtering criteria [173]. For each repeat in our region of interest, we created a table of every 20-mer indicating its raw occurrence in the genome, as well as the occurrence of the 13bp sub-string from its 3’-end. A primer was selected if, in addition to satisfying standard primer criteria, it did not have its 3’-end occur more than was expected by chance. Additionally, the resulting sequence set is checked more rigorously for uniqueness in the region as described in the Supplemental A.2. Though the majority of repeat sequence is in far higher occurrence than expected by chance, one can sometimes find small regions that are permissible as primers. The set of ‘unique’ but possibly dimerizing sets of primers forms the initial list from which a candidate-set of low cost is to be selected. Given the NP-completeness result, we focus on heuristic versions of the problem. We describe algorithms for optimal PAMP design based on greedy selection and simulated-annealing (guaranteed running time, but not optimality), and also Integer Linear Programming (guaranteed optimality, but not running time).

Greedy heuristic for PAMP design

In this simple heuristic, we attempt to greedily extend a primer set of low cost.

Note that the typical value of wp is very high (or, wp = ∞), which limits the total number of primers selected. Define Pu as the chain whose penultimate primer is u (the primer at le being the last) with cost C(Pu). Eu corresponds to the set of primers which 20 have dimerizing edges with u.

C(Pu) = min {C(Pv + {u})}

v : lv < lu

Pv∗ = argmin {C(Pv + {u})} v : lv < lu

Pu = Pv∗ + {u}

The final solution is given at the Pu with minimum cost. It is not hard to see that this will result in unevenly distributed primers, with better primer density in regions that were looked at first. In practice, the greedy heuristic is outperformed by simulated annealing (especially in large regions) and is used, along with randomized selection, only to to provide initial solutions.

Simulated annealing for PAMP design

The simulated annealing is done over the space of all putative solutions. We start with a candidate-set P of cost C(P), and in each step, we move to a neighboring solution P0 [71]. We will consider two solution spaces, each with its own neighborhood. In the

first, we consider the case wp = ∞. Each candidate-set P induces an independent set on the primer-coverage graph, i.e. no pair of dimerizing primers is allowed. Candidate-

0 0 sets P is in the neighborhood of P (P ∈ NP ), if there exists a primer u such that P0 = P + {u} − {v :(u, v) ∈ E}.

In the second case (wp < ∞), every subset P subject to certain size constraints 0 0 0 is in the solution space. P ∈ NP if |P − P | ≤ 1. In other words, P can be obtained from P by adding or deleting a single primer. While the two approaches are similar, they do have different convergence properties. In each step s, we move from the current

0 candidate-set P to a neighboring set P . Denote the cost of this transition by ∆s = C(P0) − C(P). From our tests, moving from one independent set to another allowed for faster convergence, therefore we use this methodology for our comparative studies. In the simulated annealing procedure, we start with an initial solution (random, or greedy). In each transition, step s is sampled (among all possible steps) with prob- 21 ability proportional to e−∆s/T , where the temperature T is an adjustable parameter. T is decreased according to an annealing schedule. Steps that cause a large decrease in the cost are the most probable, but unfavorable steps also are possible (at higher T ) al- lowing for an escape from local minima. An example illustrating this point is seen in Figure 2.2. The speed and quality of final solution depend upon a number of factors, including the quality of initial solution, choice of neighborhood, and the setting of ap- propriate temperature T . We experimented with a number of strategies for optimizing the speed and quality of the solution. In practice the various annealing schedules showed little performance distance, though best results were obtained for proportional and lin- ear schedules. For consistency, a linear schedule is used throughout the results section. Additionally, a random starting solution was given at each test, in order to compare the solutions to the naive greedy without bias.

ILP and lower bounds

We can model our problem as a binary integer linear program (ILP). Typical ILP solvers guarantee optimality, but not running time, and may not converge for large sizes.

The ILP is depicted in Figure 2.3. For every primer in F, we define a binary variable xi.

The variable xi = 1 iff the primer starting at location li is chosen. Clearly, for each pair of dimerizing primers i, j, and for each pair of primers such that li − lj ≤ r, we have the constraint

xi + xj ≤ 1

We set variable qij = 1 if xi = 1, xj = 1 and for all other j < k < i xk = 0. In other words, primer j is the primer selected immediately prior to primer i. As qij contributes to the cost of the solution, we only need to set lower bounds on it. The constraints

X qij ≥ xi

lj

qij ≤ xj 22

Dimerizing Edge

1 2 3 4 5 6 7 8 9 10 11 1

Iteration 1

1 2 3 4 5 6 7 8 9 10 11

Iteration 2

1 2 3 4 5 6 7 8 9 10 11

Iteration 3,4

1 2 3 4 5 6 7 8 9 10 11

Figure 2.2: Sketch of Simulated Annealing Methodology. Ovals represent possible primers, darkly shaded ovals represent primers in the current chain, dotted lines between two ovals repre- sent dimerization edges, and shaded rectangles represents uncovered sequence (sequence greater than d base pairs upstream from the nearest primer). Our goal is to minimize uncovered primer space. In this case, the third iteration provides a solution with perfect coverage- which could not be reached in a single move from the initial solution. Note that iterations 1 and 2 transition through chains with higher cost than the initial solution. 23

ensure that qij = 1 only if xi = 1, and xj = 1 for some j < i. The ‘uncovered region’ penalty di is constrained by X di ≥ max{0, (li − lj − d)qij}

lj

Note, when li−lj −d < 0, this term is replaced with 0. Clearly, this penalty is minimized by setting qij = 1 for the primer j selected immediately prior to primer i. In that case, di is exactly the number of uncovered bases, which we seek to minimize. We find empirically (see Results) that the ILP as described is intractable even for moderate regions. Therefore, we use the ILP formulation mainly to test the performance of the simulated annealing solutions on smaller regions with a sparse number of primers. For lower bounds, we also considered the Linear Programming formulation, achieved by replacing the 0, 1 constraints with

0 ≤ xi ≤ 1 , 0 ≤ qij ≤ 1

Unfortunately, the integrality gap between the ILP and the relaxed LP is quite large in practice and the bounds are not useful (data not shown). We focus our studies on ILP solutions which can be obtained for smaller regions with a sparse number of primers. Empirical results for the simulated annealing compared to the ILP solutions can be seen in Figure 2.7c. In future work, we will explore the use of various cut inequalities to improve the performance of the ILP.

2.5 Results

2.5.1 Experimental validation

We applied our algorithm to design a set of 600 primers (300 pairs) covering a 500kb region (ρ = 1.2) surrounding the CDKN2A locus. As described in the original PAMP paper, typical multiplexing reactions consist of adjacent subsets of primers from the larger, overall primer set [82]. Therefore, as a proof of concept, we used a 12 primer- pair test subset from these globally optimized primers to assay a known ∼ 15kb deletion 24

P min i di s.t.

xi + xj ≤ 1 for all dimerizing primers i, j

xi + xj ≤ 1 i − j ≤ r P L i xi ≤ ρ ∗ d

qij ≤ xj ∀i, j P j

qij, xi ∈ {0, 1} ∀i, j

Figure 2.3: Integer Linear Program for PAMP design

Left Right Primer Set Mut: Left Right Other (Size) Wild- Probe Probe Probe Probe Probe type Signal? Signal? Signal?

Initial (20) 0: 1 No No No

Initial (20) 1 : 49 Yes Yes No

Initial (20) 1 : 9 Yes Yes No

Initial (20) 1: 0 Yes Yes No

Initial+ 1 : 49 Yes Yes No Repeats (25) Initial+ 1 : 9 Yes Yes No Repeats (25) Initial+ 1 : 0 Yes Yes No Repeats (25) Initial+ 1:9 No No No Primer Dimers (28) Initial+Pri 1:0 No No No mer Dimers (28)

Figure 2.4: PAMP CDKN2A Experimental Results. Signal shows amplification around known breakpoints using a multiplexed set of 20 primers. The signals are obtained in a heterogeneous mix of mutant and wild-type DNA. The signal is retained when including distant primers within repeat regions, but completely disappears when a single dimerizing primer pair is used in the reaction. Note that none of the other probes shows a signal. 25 event at the CDKN2A locus in the Detroit 562 cell line [109]. A continuous subset of 10 forward primers and 10 reverse primers (representing 20kb of sequence coverage) were chosen from the initial set of 500, in the regions closest to the previously published breakpoints. PAMP experiments were conducted with a varying mix of wild-type (non- deleted) and mutant samples. As shown in Figure 2.4, the array shows amplification of only the two probes associated with the characteristic breakpoint positions of the Detroit CDKN2A deletion. The signal is present even when the mutant:wild-type ratio is only 1:49. Moreover, we tested the affect of dimerizing primers with respect to suppression of a true signal. It was observed that a single pair of dimerizing primers was sufficient to destroy the signal completely, demonstrating the impact of adding dimerizing primers (Figure 2.4). At first glance, this seems like a much simpler (and, scale-reduced) version of the optimizations we have been discussing. However, we note that the computation was done over the entire region and the 20 primers were manually picked from the 600 primer set design around the region of deletion, and the computational complexity is unchanged. Also, the experimental complexity is close to the desired experimental complexity. Recall that the experimental protocol calls for N forward, and N reverse sets for a total of N 2 multiplex reactions. Our computational design was based on the most complex case (N = 1). On the experimental side, by choosing 12 forward-reverse primers (N = 300/10 = 25), we would need a total of 625 multiplex reactions, of which exactly one would give the desired positive result. Here, we validated by only performing the single positive experiment. In other experiments, we have scaled this up to N ≈ 10, sufficient for clinical settings, and are moving towards N = 1 (data not shown). We also performed a series of experiments to test whether primers within (or proximal to) known repeats would cause problems. The set of 20 primers was extended by adding primers flanking highly conserved repeats (such as AluSx transposable ele- ments). Figure 2.4 provides a negative control by showing that repeat located primers do not destroy the CDKN2A deletion signal. 26

Primer Mut: Left Right Repeat Probes for Set Wild- Probe Probe Probe (Size) type Signal? Signal? Signal? Primer Pairs Left in Wild-type Right Initial+ 0: 1 No No Yes (4) Probe Probe Flanking Repeats (30)

Initial+ 1 : 49 Yes Yes Yes (4) Flanking Repeats (30)

Initial+ 1 : 9 Yes Yes Yes (4) Flanking Repeats (30)

Initial+ 1: 0 Yes Yes No Flanking Repeats (30)

Figure 2.5: PAMP performance with repeat located primers. Primers were chosen in a repeat located within the deleted CDKN2A region in the Detroit 562 cell-line. In mixed samples, a signal is obtained from the boundaries of the deleted region, as well as the repeat located primers from the wild-type sample. The repeat signal disappears in the absence of the wild-type sample.

An ideal positive control would be an experiment in which repeat located primers are the ones being amplified. Unfortunately, such a design is not possible for the CDKN2A deletion in the Detroit 562 cell line, as the deletion breakpoints are not proxi- mal to repeats that yield good primers. However, the 14kb deleted region itself contains multiple repeat elements which can be equally informative. Therefore, we conducted a series of experiments including primer pairs from repeat elements within the deleted region. In every mixed sample (wild-type + mutant), PCR products located in repeats and PCR products resulting from the deletion were detected (Figure 2.5). The signals of PCR products located in repeats disappeared when only the mutant sample was used, confirming that only the deleted region was being amplified by the primers located in repeats. All PAMP experiments were performed using the protocol described in Liu et al [82]. Our results show the power of the PAMP protocol in amplifying the deletion signal even in a mixed population. The negative control with dimerizing primers reveals the importance of a good design providing high coverage with non-dimerizing primers. The remainder of this manuscript describes the impact of various parameters on the 27 performance of the simulated-annealing heuristic, with comparison to the naive greedy and ILP solutions.

2.5.2 Computational modeling

Recall that if the candidate primer set places two adjacent primers at distance d0, with d0 > d, then the total coverage is reduced by (d0 − d) bp. We use this bound on theoretically obtainable coverage to compare performance. This bound is not tight for large regions, because as the size of the region increases, it becomes harder to find primer sub-sets with non-dimerizing pairs. Even with the weak upper bound, early computational results are promising. On the 500kb CDKN2A region, we obtained non-dimerizing primer sets with greater than > 96% coverage (with primer density ρ = 1.2), improving upon the greedy solution (< 92% coverage). Also, note that restricting primer selection to repeat-masked regions would have resulted in greatly reduced coverage (∼ 60%). Figure 2.6 shows the amount of coverage missed in each of these approaches. The lower coverage (across all opti- mizations) in the CDKN2A region when compared to other 500kb regions is primarily due to its high repeat content (> 60% over 500kb), however this exemplifies the need for primers in repeatmasked genomic regions. A similar study was done for a smaller region corresponding to the TMPRSS2:ERG [164] fusion, achieving > 97% coverage (data not shown). We consider the factors that would impact the quality of the final solution. The size of the region is an important consideration, as discussed earlier. Also, different genomic regions have fairly different compositional characteristics, which will influ- ence performance. Finally, the performance is also influenced by algorithm specific parameters such as the primer-density ρ. To examine these issues, we selected a number of regions at random from the genome, with size varying from 100Kb to 5Mb, with ≥ 5 replicates for each size. Figures 2.7a,b show the performance as a function of size and primer-density ρ, additional parameters for primer selection and optimization 28

Figure 2.6: Comparison of Missing Coverage in CDKN2A. Custom tracks corresponding to missing coverage were added into the UCSC genome broswer at the CDKN2A locus [69]. The first track indicates simulated optimized solutions with ρ = 1.2, when restricted to nonrepeat re- gions of the genome (206,250bp missed coverage). The second track indicates a greedy solution when the search space allows for primers in repeatmasked regions (92,848bp missed coverage). The third track is for the general simulated annealing solution with ρ = 1.2 (17,846bp missed coverage). Less highlighting indicates better coverage. can be found in the Supplemental A.2. As can be seen, for low sizes (≤ 500kb), and higher primer-density, the designed primers are very close to theoretical optimum. The coverage diverges from the theoretical optimum over large regions, primarily due to ex- tensive primer dimerization, which not only restricts the overall number of primers, but also greatly limits choices in primer sparse regions of the genome. The primer-density provides a cost-coverage trade-off. For mid-sized regions, a small increase in primer- density greatly improves coverage (Figure 2.7a). Specifically, a significant improvement is observed between ρ = 1 and ρ = 1.2, suggesting that ρ = 1.2 provides a good cost- coverage tradeoff. This is made more apparent by comparison to the unrestricted greedy approach which outperforms ρ = 1 in certain cases. Additionally, the simulated anneal- ing solution consistently improves upon the greedy heuristic (even when no restriction is placed on the greedy heuristic for ρ) (Figure 2.7a,b). For larger-sized regions (> 1Mb) there is a significant reduction in coverage. To improve coverage further, we exploit the fact that the PAMP protocol employs mul- 29

Coverage vs. Region Size 1.2

1

0.8

0.6 Theoretical Limit ρ = 1 ρ = 1.1 0.4 ρ = 1.2 ρ = 1.5 0.2 ρ = 2 (*) ρ = 1.1 Fraction of Region Covered by Primers 0 100 200 500 1000 2000 5000 (a) Size in kb Coverage vs. Primer Density

1

0.98

0.96

0.94 100kb 100kb Theoretical Limit 0.92 500kb

Fraction of Region Covered by Primers 500kb Theoretical Limit

0.9 1 1.1 1.2 1.5ρ 2 (b) Coverage vs. Region Size 1

0.9

0.8

0.7 Theoretical Limit ILP bound Simulated Annealing ρ = 1.5

Fraction of Region Covered by Primers 0.6 50 100 150 (c) Size in kb

Figure 2.7: (a) Coverage versus region size. Each datapoint is the mean over 5 randomly chosen genomic regions of a fixed size. For large regions, the coverage is improved by allowing dimerization to be possible between 2 forward (or, 2 reverse) primers (ρ = 1.1 paritioned), if the two primers are never together in a single multiplex experiment (i.e. they belong to different multiplexing “sets”). (b) Improvement in coverage with increasing primer-density ρ. There is a distinct improvement as ρ goes from 1 to 1.2, after which the improvement is less pronounced. (c) The best ILP and simulated annealing solutions are virtually indistinguishable at equivalent ρ. Each data represents a single iteration. 30 tiple ’tubes’ of multiplexing in practice. The forward and reverse sets are themselves partitioned into N sets each( N ≤ 10). Each multiplex reaction consists of a forward and a reverse partition. Thus, each forward primer is only multiplexed with the forward primers in its own partition, and can dimerize with all other forward primers. This relax- ation on dimerization constraints allows us to get improved coverage (See Figure 2.7a, p = 1.1, partitioned). An improvement similar to the unpartitioned sets was observed as p was increased from 1.1 to 2 (data not shown). In future work, we will include the optimization of the number of rounds as an explicit part of the primer design. As mentioned, it was observed that extensive primer-dimerization makes it dif- ficult to obtain high coverage, and our results may well be close to the true optima, but the weak bounds on optimal coverage makes it difficult to test this directly. How- ever, in small, sparse regions 5 informative ILP bounds could be obtained. It should be noted that even when the simulated annealing diverged significantly from the theoretical lower bound, it almost perfectly approximated the observed ILP solution (Figure 2.7c). Additionally, the revised lower bound placed by the ILP (since true convergence could not always be observed) greatly reduced the gap between the theoretical and observed coverage.

2.5.3 Convergence and running time

The performance of the simulated-annealing algorithm depends upon a number of factors, including the annealing schedule, choice of neighborhood, and so on. We restrict discussion to the length of the schedule. We experimented with a linearly de- creasing temperature T , using a number of runs. In each run, the annealing was set to be 10× slower than the previous run. We stop when little (less than 1%) or no improvement is recorded over the previous run. The number of iterations is plotted as a function of region size and primer-density in Supplemental A.3. Interestingly, the number of itera- tions peaks around 1Mb, decreasing again. Once again, primer-dimers in large regions

5 1 10 th as dense as the normal sequence, corresponding to 88, 179, 262 primers for the 50, 100, 150 kb regions respectively 31 severely restrict the search space, leading to fast convergence, but low coverage. The number of iterations also increase with increasing ρ due to added flexibility in selection.

2.6 Discussion

We have shown a method to design appropriate sets of primers for PAMP that cover a region without dimerizing, map uniquely in the genome, and possess the requi- site physico-chemical characteristics for the PCR reaction. Using this design in multi- plex PCR, allows us to detect most deletions within a given region. However, the real advantage of this method is that protocols can be designed for any structural variation that brings two disparate genomic regions together. Therefore, deletions, inversions, translocations, and transpositions can all be assayed with PAMP and appropriate primer designs. A critical part of our study is the formulation of the problem as a combinatorial optimization problem with goal of improving coverage, while satisfying a collection of constraints. This is not simply an academic question. A diagnostic test will fail in patients for whom the deletion boundaries lie within uncovered regions. The greater the uncovered region, higher the failure probability. In this respect, bringing the coverage up from < 90% to over 97% is very significant. We also provide an ILP formulation, which guarantees true optimality, but did not converge in our formulation. We are exploring a number of alternative approaches, using different cut inequalities to speed up ILP convergence. Even if we guarantee optimality, the ubiquitous dimerization will keep coverage low for large regions. For these regions, we have alternative formulations with more complex multiplexing scenarios to improve coverage. Future work will address the optimization for those problems. Other technologies have been developed for assaying structural changes in tumor genomes, such as BAC End Sequence Profiling [169] and array CGH [120]. However, the ability of array CGH to detect alterations is impeded by the presence of normal cells or other genomic heterogeneity in the tumor sample. End Sequence Profiling (ESP) can 32 potentially overcome genomic heterogeneity with deep sequencing, but at great expense. Moreover, it is not clear how to restrict ESP to specific regions of the genome. In contrast, the selective amplification of the structurally modified region allows PAMP to detect even weak signals in a heterogeneous population. PAMP could become the technique of choice for probing of specific variants in cancer patients, although the de novo discovery of such variants will still rely on array CGH, ESP, or other techniques.

2.7 Acknowledgements

Chapter2 (with AppendixA), was published in Bioinformatics, Vol 23, pp 2807- 2815, 2007, A. Bashir, Y-T. Liu, B. Raphael, D. Carson, and V. Bafna, “Optimization of primer design for the detection of variable genomic lesions in cancer”. The dissertation author was the primary investigator and author of this paper. Chapter 3

Two-Sided PAMP and Alternating Multiplexing

3.1 Introduction

The PAMP optimization suggested in the previous chapter had several limita- tions. Before addressing these issues, let us briefly restate the PAMP optimization prob- lem. Recall, as shown in Figure 3.1, that the genomic region of interest is tiled by p p forward (denoted by ) and reverse ( ) primers. All of the primers are incorporated into a multiplex tube (or tubes), along with the query DNA. The primers are spaced so that deletion at any boundary will bring a primer pair close enough to be amplified. In (x, y) p p Figure 3.1, the deletion at brings 1 and 3 close together. The amplified product b , b ... is hybridized to a set of probes denoted by locations 1 2 , and detected on an ar- ray. Successful hybridization confirms the deletion event, and also resolves the deletion boundary to within 1kb. In spite of its apparent simplicity, successful deployment of PAMP requires the solving of a challenging combinatorial optimization problem relating to the design of primers. Intuitively, the goal is to maximize the detection of any possible deletion while minimizing the number of multiplex reactions such that no primer-primer interactions

33 34 occur within a reaction. The optimization is critical, as each missed deletion is a po- tentially undetected tumor in an individual. Likewise, we cannot overstate the need for computation, as the design can involve the selection of thousands of primers from tens of thousands of candidate primers, and even a single primer-primer interaction within a multiplex reaction nullifies all signals [13].

Primer design for PAMP: Formally, a primer-design is described by a set of forward and reverse primers

P = (P , P ) = {(p , . . . , p , p ), (p , p , . . . , p })  n 2 1 1 2 ν

The genomic locations of the primers are denoted by

l < . . . < l < l < . . . < l n 1 1 ν

P = (P , P ) breakpoint Consider a design,  . Define a as a pair of genomic coordinates (x, y) that come together because of a mutation. Breakpoint (x, y) is amplifiable by a (p , p ) (x−l )+(l −y) ≤ d d ≈ 2000 primer pair i j if i j , where ( ) is a known parameter constrained by limitations of PCR. For each breakpoint (x, y), denote its coverage-cost, CP (x, y) = 1 if it is not amplifiable by a pair in P. CP (x, y) = 0 otherwise. This allows P P P us to define two-sided cost of tiling a genomic region by primers as x y C (x, y). The critical constraint in selecting primers is dimerization: no pair of primers in a single multiplex reaction must hybridize to each other. One way to solve the dimerization problem is simply to put each pair of primers in their own multiplex tube. With 500 forward, and 500 reverse primers in a typical design, this is not tenable. We consider the following:

The PAMP design problem:

Input: Positive integer m, and a forward and reverse candidate region, with a collection of corresponding forward and reverse primers. 35

Output: A primer design over m multiplex tubes, so that no primer pair dimerizes in a P P P multiplex reaction, and the coverage cost is minimized ( x y C (x, y)).

We will add additional constraints to this formulation in the next section. We previously addressed this problem using a heuristic approach with a simplified coverage function which performed naive multiplexing (see [13]), hereafter termed PAMP-1D . PAMP- 1D optimizes a one-sided variant of cost by penalizing each pair of adjacent primers if they were greater than half of the amplifiable distance. The advantage is computational expediency, as the change in coverage due to primer addition/deletion is easily com- puted. Even though PAMP-1D was able to handle a larger number (tens of thousands) of input primers and converge in a reasonable amount of time (minutes to hours), its cov- erage function underestimated true breakpoint coverage. Especially for non-symmetric regions, the designs are greatly improved by allowing longer primer gaps on the larger side, and vice-versa. The two-sided problem is very difficult to optimize; previous ap- proaches were impractical for most data-sets only handling < 100 input primers [32]. In our application, even simple input sets would entail thousands of potential primers. Finally, initial experimental data showed fundamental flaws in these approaches (see Results). The initial formulation makes the reasonable, but incorrect assumption that an amplified PCR product can be readily detected by a corresponding probe. Sec- ond, it does not carefully account for other interactions, such as ‘mispriming’ (described later), which may also lead to false negatives.

In this manuscript, we seek to redress these shortcomings with PAMP-2D . In

Section 3.2.1, we describe the flaw in the PAMP-1D design and solve it using a novel alternating multiplexing scheme. We follow this by describing a generic framework for optimal alternating multiplexed primer design. The actual optimization is performed using simulated annealing. In Section 3.2.2, we describe a data-structure to speed-up the simulated annealing iteration. Simulations on multiple genomic region demonstrate the efficacy and efficiency of the proposed approach (Section 3.3). To test the overall methodology, we assayed events in 3 cell-lines, CEM, A549, and MOLT4. We show 36 that in each case, we can detect the lesion of interest, even in regions with multiple rearrangements.

3.2 Methods: A multiplexed approach to PAMP design

3.2.1 Amplification 6= Detection

We begin by noting that amplification of a deleted genomic region is not synony- mous with detection of the region. Figure 3.1 shows an example of this. Each primer is associated with a proximal downstream probe on the array, in order to detect the am- plified product. Note the probe and the primer locations cannot match, because residual primers in the solution will hybridize to the array, giving a false signal. This leads to an ‘orphan’ region between the primer and the probe. As an example, the point x in Fig- p b (x, y) ure 3.1 lies in between primer 1 and its probe 1 . When the region is deleted, p p 1 and 3 come together, and are amplified. However, the amplified region does not contain sequence complementary to any probe on the left hand side. Consequently, there is no signal corresponding to the left breakpoint. The solution to this detection problem

deletion

p p p1 p p p 3 2 x 1 2 y 3 tumor genomic sequence b b2 b1 b 1 2 b 3

b3

b3 probe sequences b2 b1 b 1 b 2 b 3

PCR amplification array

Figure 3.1: Schematic of PAMP detection failure. The breakpoint, (x, y) results in an ampli- fied product, but is not detected on the left side.

(p , p ) lies in recognizing that had the product 2 3 been amplified, the left breakpoint b (p , p ) would be detected by hybridization to probe 2 . However, even when 2 3 is 37

(p , p ) close enough to be amplified, it is out-competed by 1 3 and will not amplify. One p p multiplex possible solution is to add 1 and 2 in different tubes. Multiple multiplex reactions is a practical necessity in most cases, as it is challenging to amplify products with a larger number primers in a single tube [40]. Indeed, the experimental design for PAMP, as first proposed by Liu and Carson [82], consists of selecting groups of adjacent primers upstream and downstream of a putative breakpoint, in which each set is one half the desired multiplex size. Nevertheless, increased multiplexing increases the complex- ity and cost of the experiment, and must be controlled. We address these issues through a novel Alternating multiplexing strategy.

P = (P , P ) Alternating multiplexing for primer design: Consider a primer design  . p b l < b Each forward primer i is associated with a downstream probe located at i ( i i ). b We abuse notation slightly by using i to denote both the probe, and its location. Also b ≤ b i p p i (i−1) for all , with equality indicating that i and (i−1) share the same probe. b b adjacent i ≥ k > j In this notation, probes i and j are (on the genome) if there exists such that b = b = . . . b < b = ... = b i (i−1) k (k−1) j b b b b As an example, probes 3 and 1 are adjacent in Figure 3.1, as also 2 and 1 , but not b b 3 and 2 . Analogous definitions apply for reverse primers. The probes are located at b ≤ b ≤ . . . b b < l i 1 2 ν where i i, for all .

Definition 1: Breakpoint (x, y) is left-detectable (respectively, right-detectable) by a P (p , p ) l < b < x primer-design if it is amplifiable by some primer pair i j , and i i , l > b > y (respectively, j j ).

The set of left-detectable and right-detectable breakpoints is denoted by

SX (P) = {(x, y)|(x, y) is left-detectable by P},

SY (P) = {(x, y)|(x, y) is right-detectable by P}

As a breakpoint might not be detected even when it is detectable, we define 38

Definition 2: Breakpoint (x, y) is left-detected (respectively, right-detected) by P if it is left-detectable (respectively, right detectable), and the following are satisfied: (a) no pair of primers p, p0 ∈ P dimerize (non-dimerization); and (b) (x, y) is not amplified by p , p ∈ P l < x < b l > y > b non-detection any i j with i i , or j j ( ). When P is used in a single reaction, we denote the set of left-detected (respectively,

∗ ∗ right-detected) breakpoints as SX (P) ⊆ SX (P) (respectively, SY (P) ⊆ Sy(P)). By partitioning P into multiple multiplexing tubes it may be possible to obtain better cov- erage. Simply, one could run each forward and reverse primer in its own multiplex reaction so that ∪ S∗ (p , p ) = S (P) i,j X i j X p , p The idea behind alternate multiplexing is simple: Primers i j (correspondingly, p , p b i j) can be added to the same multiplex set only if their downstream probes i b b , b and j (correspondingly, i j) are not adjacent. A similar rule applies for reverse primers. Formally, first order the all unique probes by their genomic position (indepen- dently for the left and right sides), and then number them in increasing order. Partition P into P0, P1, where P0 contains all primers whose probe is “even” numbered, and the sets P1 contains all primers whose probe is “odd” numbered. Similarly, define P0, P1.   The multiplexing reactions are given by choosing each of the 4 forward reverse parti- p , p tions. In Figure 3.1, this scheme would place 2 3 in different multiplex sets. Before we show that alternating multiplexing optimizes detection, we define a technical term. A design of forward (likewise, reverse) primers is non-trivial if there exists at least one p , p 0 < l − l < d b 6= b pair of primers i j with i j , and i . The definition is introduced for technical reasons only, as any useful design must be non-trivial.

Theorem 1: Let P be a design with no dimerizing pairs. Then, alternating multiplexing allows us to detect all detectable breakpoints. In other words

∗ a b ∗ a b ∪a,b∈{0,1}S (P × P ) = SX (P), and ∪a,b∈{0,1} S (P × P ) = SY (P) X  Y  P , P Further, if are non-trivial then the multiplexing is the minimal necessary to achieve detectability. 39

Proof: See Appendix. 

Note that Theorem1 works only for non-dimerizing sets. In earlier work, we have shown that the achievable coverage diminishes with larger sets of primers because of dimerizations. To connect dimerization, detection, and multiplexing, we start by defin- ing a primer-dimer-adjacency graph G, whose nodes are defined by the set of primers. A primer-pair is edge-connected if either of the following is true: a) the primers dimerize, or b) they are adjacent and in the same orientation. Theorem2 is based on the prop- erty that a coloring of this graph partitions primers into non-dimerizing, alternatingly multiplexed sets.

Theorem 2: Let each forward (reverse) primer in P be edge connected with at most ∆ ∆ ( ) other forward (reverse) primers Further, there is no dimerization between a ∆ · ∆ forward-reverse pair. Then, there exists a design with no more than  multiplex reactions for which all breakpoints in SX (P) (respectively, SY (P)) are left-detected (respectively, right detected).

Proof: Proof based on Brooks theorem [21]. See Appendix. 

Theorems1 and2 guide an overall iterative strategy that provides for the best coverage, given a bound on the number of multiplex reactions. Assume for now that we OptCoverage(G, ∆ , ∆ ) G ∆ , ∆ have a procedure  that takes and numbers  as input, (P , P ) and returns an optimized design,  . Specifically, it returns the sub-graph induced (P , P ) by  in which a) each forward (respectively, reverse) primer is edge-connected to ≤ ∆ ≤ ∆ ( ) forward (reverse) primers; and, b) no forward-reverse primers are edge- ≤ ∆ · ∆ connected. We can use Theorems1,2 to obtain the same coverage using  multiplexing. The following section will describe the OptCoverage procedure. The overall algorithm is motivated by the fact Theorem2 only provides a weak upper bound ∆ ∆ of  on the available multiplexing. If the actual number of multiplex reactions is smaller than available, we adjust and iterate. 40

m, G, L ,L Procedure PrimerDesign( ) ( L ,L represent lengths of the two regions ) *  * r r  L   L   1. Let ∆ = m L , ∆ = m L (* Initial estimate *)   P , P =OptCoverage(G, , ) 2. ( ) ∆ ∆ . ( Return sub-graph induced by P , P with ( ) edges per * ( ) ∆ ∆ forward (reverse) primer, and optimal coverage. *)

MaxAdjacency P , P 3. Compute ∆c= ( ) P P m m 4. Use Welsh and Powell algorithm [175] to color (and ). Return (and, ) colors. |m − m · m | , 5. If  is large, adjust ∆ ∆; Go to Step 2.

∆ ∆ Figure 3.3c shows that a large gap often exists between  and true multi- ∆ ∆ plexing levels, especially as  gets large. In the following section, we describe the use of simulated annealing to optimize the design of primers.

3.2.2 Simulated Annealing for Optimization

The computational goal is to choose a design P that minimizes coverage-cost CP . The optimal design is chosen from candidate solutions P in which each forward ∆ ∆ primer (reverse primer) is edge-connected with at most ( ) other primers. We use an established simulated annealing approach to perform the optimization [71]. For any candidate solution P, define its neighborhood as the set of candidate solutions that are obtained by adding or removing a primer from P. In other words,

0 0 P0 P P ∈ NP iff |P − P | ≤ 1. Let δ = C − C denote the change in coverage cost in moving from P to P0. Following the s.a. paradigm, we move to P0 if δ < 0. If δ ≥ 0, we

0 δ choose to move to P with probability exp(− T ), in order to escape local minima. While the basic paradigm has been explored in our earlier work, and elsewhere, we extend this here by addressing two key issues: 1) Incorporation of probe distances/interactions into the optimization; and, 2) Rapid calculation of the 2D coverage-cost. 41

Incorporating Probes: We address the first problem by noting the direct relationship between primers and probes. Specifically, we do not need to separately iterate over primers and probes; anytime a primer is added we attempt to add its proximal down- stream probe. This probe is added unless it is already present in the set (in which case no additional probe is added) or it causes a mispriming signal (in which case the next most proximal probe is examined). As the first condition is trivial, we focus our attention on the second. There have previously been rigorous attempts at identifying such mispriming pairs in a computational framework [81]. The mispriming problem in PAMP is some- what unique; it is only problematic when it leads to a “false positive” signal. These signals occur when a primer pair anneals to another region of the genome and the am- plified sequenced hybridizes to a probe on the array. This allows us to create a novel formulation for the probe/mispriming problem. Define a collection of probe-misprime- triads T P (p , p , b) ∈ T p p m on as follows:  m if and only if primers and  misprime to genomic sequence s and probe b anneals to s. Checking this could be costly if many such mispriming triads exist. In practice, enforcing this criteria has no measurable effect on time complexity as most probes and primer-pairs are unique and |Tm| is small. This effect is, in fact, minimized prior to simulated annealing optimization by preferentially selecting probes in the input set which do not create probe-misprime-triads.

Updating Coverage: For exposition, we focus on coverage (detection will naturally follow). Recall that a breakpoint (x, y) is covered if there exists some pair of primers (p , p ) (x − l ) + (l − y) ≤ d (x, y) i j such that i j . All such breakpoints, , can be considered as points in a two-dimensional space. p Consider a step in the simulated annealing when primer i is being added, and we need to compute the change in cost. See Figure 3.2a. We only need consider break- points (x, y), where li < x ≤ li+1. To check if (x, y) was previously covered, we only need to examine the most proximal upstream primer. This suggests a naive algorithm that scales with the length of the opposing region, L, and the amplifiable range of a PCR 42

|D|(i-2) |D|(i-1) |D|i |D|(i+1)

d d

T1 |D|1 d

R U T d

|D|2 U

T2 |D|3

a) b)

Figure 3.2: Schematic of uncovered breakpoint computation. a) Diagram of a small primer set in which there are initially 4 forward and 4 reverse primers. An additional forward primer (lightly shaded) is added at position i, which reduces the uncovered space. The difference in uncovered space between p to p is seen as the difference in total shaded area compared (i−1) i to darkly shaded area. b) shows uncovered space for the added primer at a specific reverse pair location, given by the dotted lines in a). Uncovered breakpoints are the coordinates contained in the rectangle not contained in the triangle. The total number of such breakpoints is given by,

|U| = |R − R ∩ T | = |R| − (|T | − |T1| − |T2|). product, d, yielding a time complexity of O(Ld). To make the computation more efficient, we partition the space into forward D = (l , l ] intervals i i i+1 , and reverse intervals, using adjacent pairs of forward and reverse primers In Figure 3.2a these intervals correspond to regions on the x and y axes p D respectively. In adding primer i , coverage is changed only in the forward interval i . rectangles R = D × D The algorithm proceeds by examining ij i j, one for each reverse D uncovered R U interval j. Denote the set of breakpoints in an arbitrary rectangle, , as (ignoring subscripts i and j), as in Figure 3.2b. Let T denote the total space covered by the corresponding primer pair. Observe that

U = R − (R ∩ T ) = R − (T − T1 − T2) where T1,T2 represents portions of T not in R (Note that T1 and T2 can be equal to φ). 43

d = min (|D | , d) d = min (|D | , d) Let , and   . Then, 1 1 |T | = (d − d )2, |T | = (d − d )2 1 2 2 2 

This leads to a simple equation for calculating the amount of uncovered space |U|, as

1 1 1  |U| = |D | |D | − d2 − (d − d )2 − (d − d )2  2 2 2 

This update reduces the time complexity several orders of magnitude to O(n), where n is the total number of opposing primers. Even so, the update remains expensive to compute, and must be improved fur- p d d ther. In adding forward primer i , the set of values j do not change. If for some j, d + d < d D × D d i j , then i j is entirely covered. To exploit this, we store all j in AX EAP d AX EAP a M -H , and all i in M -H . When considering a forward primer, we AX EAP d d >= d − d scan M -H  using a BFS to get all j for which j i , for a total of O(k) steps. If we add the forward primer, we need to make O(1) updates to MAX-

EAP O(k) + O(lg n) H . Total time is per iteration. In most cases, there is either very little uncovered space or the uncovered space is relegated to a few specific regions (as in Figure 3.4), implying k << n. When optimizing for coverage and detection, we need to maintain two additional heaps with primer-probe distances, with a slightly more complex algorithm. However, the update time remains the same.

3.3 Results

We simulated data-sets from two genomic regions that have been implicated in cancers. A homozygous deletion in the CDKN2A region (9p21) is an important ge- netic marker for multiple cancers. The lesion has been observed in multiple cell-lines and primary tumor samples, including glioblastomas, leukemias, and lung, pancreatic and breast cancers [132]. However, the boundaries of the deletion are known to vary over a large region. Recently, the deletion was assayed and confirmed in 25/54(46%) of adolescent ALL patients[138]. These results are based on array-CGH, which would 44 not detect small deletions, or be useful for early diagnosis when the tumor cells are rare compared to wild-type ones. The observed deletions varied in size from 25kb all the way to the loss of an entire arm (52Mb), with a variety of intermediate sized dele- tions. The overlap among all specimens was 12.5kb. Thus the CDKN2A region is a prototypical case for PAMP. The second example comes from recently mapped dele- tions in the TMPRSS2 region (21q22.3), that fuse the 5’ UTR of the TMPRSS2 gene with ETS transcription factors (ERG, ETV1, or ETV4), resulting in over-expression of these genes, and progression of prostate cancer [164, 174]. The two regions are also good test cases in that the CDKN2A fusing regions are symmetric, while the TMPRSS2 is non-symmetric, and not as amenable to a one-sided cost function.

3.3.1 Simulations

The detectability achieved varies considerably according to the genomic regions in question. Therefore, we bench-marked the overall performance of PAMP-2D across a spectrum of sequences sizes (10, 50, 100, 200kb) for both the forward and reverse primer regions. For each pair of sizes, 10 corresponding pairs of genomic regions were ran- domly selected, making 160 unique input sets. PAMP-1D , and PAMP-2D were were run on each of these sets. Figure 3.3a shows that PAMP-2D is superior to PAMP-1D over all input sample sizes. Much of the improvement is in detection due to the use of alternat- ing multiplexing. However, the performance remains superior to PAMP-1D even when alternating multiplexing is incorporated in PAMP-1D , particularly in non-symmetric regions (Figure 3.3b, raw results available in online supplement). The performance of all methods degrades for large regions (≥ 500kb) due to increased dimerizations. To improve detection for these large regions, increased multi- plexing is important. Figure 3.3c shows the improvement observed in transition to an ∆ · ∆ increasing number of multiplexing sets (represented by ) for a non-symmetric 500 × 50kb region. Saturation occurs prior to reaching complete coverage, partially because in some regions it is simply not feasible to add in primers and probes. In this 45 region, we also performed a specific optimization for 50 multiplex sets using the afore- ∆ = 7, ∆ = 7 mentioned PrimerDesign procedure. A symmetric strategy (  ) provides 94% ∆ = 22, ∆ = only coverage (Figure 3.3c). The non-symmetric initial solution (  2) provided 97% coverage with a “true” multiplexing level m = 38. Iterating with 98% m = 50 (∆ = 11, ∆ = 6) adjusted values, we achieved coverage with  . A signif- icantly more complex multiplexing strategy could help further improve of coverage and will be explored in future research.

3.3.2 Left vs. Right Breakpoint Detection

Figure 3.4 shows the results of our design on the TMPRSS2-ERG region (240kb × 20kb), with the obvious conclusion that less than 5% of the breakpoints remain unde- tected on either side (overall coverage 98%). Interestingly, the figure also differentiates between coverage (amplification), which is symmetric, and detection, which is not. To explain, note that more breakpoints are not detected on the left (ERG) side compared to the right (TMPRSS2). This is largely due to the presence of several large, highly con- served, LINE elements, in ERG introns (corresponding to vertical bands of uncovered breakpoints in Figure 3.4a). While it was possible to design primers in these regions, it was nearly impossible to design unique probes. The primers allowed breakpoints to be amplified and right-detected (by TMPRSS2 probes), but not left-detected. In some repetitive regions, it was difficult to even design unique primers. When the sequence is not amplified, neither the left, nor the right end-point is detected, observable as shaded regions at the same breakpoints in Figure 3.4a and 3.4b. Figure 3.4c,d shows the total length of uncovered sequenced on each axis (1 dimensional) contained within highly- conserved “Repeat Blocks”. We see that as a fraction of the total sequence length, these regions are quite small. In the case of ERG, only a small fraction of total “Repeat Block” space is uncovered. A similar coverage was obtained for CDKN2A region (data not shown). 46

(a)

2 Breakpoint Detection with Increasing Δ (Δ = Δ = Δ ) 0.98 550kb Region (500k + 50k) 66 108 0.96

0.94 35

0.92

0.9 True Multiplexing Level (via Welsh-Powell) 0.88 6 Percent Breakpoints Detected 0.86

4 0.84 4 10 50 100 500 2 (b) (c) Δ

Figure 3.3: Performance of PAMP-2D . (a) The surface plot shows that a significant benefit in detectable coverage is seen when comparing PAMP-1D (blue) to PAMP-2D (red). (b) Apply- ing the alternating strategy to PAMP-1D significantly improves its coverage. However PAMP- 2D consistently obtains better coverage (especially in non-symmetric regions). (c) As allowed multiplexing in the final primer set increases, the resulting coverage increases. Red values rep- resent the ‘true’ number of multiplex reactions at each data point (as predicted via Welsh-Powell algorithm). 47

x 107 TMPRSS2−ERG Left Detected Breakpoints 4.1802 x 107 TMPRSS2−ERG Right Detected Breakpoints 4.1802 4.18 4.18 )

4.1798 ) 4.1798 4.1796 4.1796

TMPRSS2 4.1794

TMPRSS2 4.1794

4.1792 4.1792

4.179 4.179

4.1788 4.1788

Genomic Position ( 4.1786

Genomic Position ( 4.1786

4.1784 4.1784

4.1782 4.1782 3.87 3.875 3.88 3.885 3.89 3.895 3.9 3.87 3.875 3.88 3.885 3.89 3.895 3.9 Genomic Position ( ERG) 7 Genomic Position ( ERG) 7 a) x 10 b) x 10

TMPRSS2 Interval (20 kb) ERG Interval (240 kb) Repeat Blocks (1.5 kb) Repeat Blocks (39.5 kb) Uncovered (0.65 kb) Uncovered (3.4 kb) c) d)

Figure 3.4: Left and right detectability. The spots represent all undetected breakpoints (x, y) from the joining of TMPRSS2 and ERG. a) Left detected breakpoints resulting from fusion of the TMPRSS2 ERG region. b.) Right detected breakpoints. In both cases, the fraction of undetectable breakpoints is less than 5% of all possible breakpoints. c) and d) Venn diagrams (to scale)- showing the overlap of missed regions with highly-conserved repeats. The outermost circle represents the entire length of the corresponding axis. The “Repeat Block” shaded area corresponds to the sum of lengths for all repeat regions that are over 75% conserved, continuous (or with < 100 bp between conserved repeats), and must be over 500 bp long. “Uncovered” corresponds to the vertical (c) and horizontal stripes (d) of uncovered space in figures a) and b) respectively. In both diagrams “Uncovered” is completely contained within “Repeat Blocks”. 48

3.3.3 Experimental Confirmation of CDKN2A

We had previously reported a design for the CDKN2A region, optimized us- ing the single-sided scoring function. The design spanned 600kb, incorporating 600 primers [13]. Our PAMP-1D design successfully verified the deletion breakpoint in the Detroit 562 cell-line, but could not detect the lesion in 3 other cell-lines (Molt-4, CEM, and A549). Here, we report positive results on the 3 cell lines using a novel design that includes alternating multiplexing (See Appendix for modifications to experimental design). Note that two of three (CEM and MOLT4) resulted in experimental failure using PAMP-1D designs. In each case, the tests confirmed breakpoint boundaries pre- viously described in the literature [82, 72, 138]. See Figure 3.5 for an overview of the results. Confirmatory sequencing validated the breakpoint boundaries (Figure 3.5a-c). Full sequencing results of the amplified primer products, along with results for addi- tional cell-lines, can be found in the online supplement. Note that in the absence of alternating multiplexing, the forward primer, at 21, 817, 940, precludes left-detection of the CEM breakpoint (Figure 3.5a). Interest- ingly, The A549 cell-line has a discontinuous 290kb CDKN2A deletion within which there is an internal 325 bp inversion. The array design successfully captured both left and right breakpoints as well as an internal event (Figure 3.5d), indicating that the tech- nique can be successful in detecting breakpoint boundaries even in complex regions with multiple rearrangements.

3.3.4 Running time

The update operation described in Section 3.2.2 is critical to the success of

PAMP-2D . A naive computation of coverage for an 200 × 10kb region requires > .05 CPU-min per iterations 1. In contrast, our optimized computation runs in < 1.5 × 10−7 CPU-min per iteration (including both coverage computation and updates). Both tests were run on a 3.4 Ghz Pentium 4 with 1 GB RAM. Even so, the designs involve a very

1This represents the largest region for which it was possible to complete even a short test run 49

21,818,384 21,879,447 21,817,281 21,817,940 21,983,175 21,878,490 21,983,961 ACACCGAAGG ACTTTTGCTG CACACACACAC CCCCAAAAGAT (unmapped) 17 bp 21,879,245 21,983,832 a) 21,817,319 21,982,809 21,983,088 b) 21,983,814

L 21,822,462 22,113,318 M inversion 21,821,842 22,113,527 AACCCTTACTGTT CTCCTGAGTA CTAGATGAAA CTTTCCAATT R L M R 21,822,123 21,974,311 22,113,500 c) 21,897,451 21,897,127 d)

Figure 3.5: Detecting rearrangements in cell-lines with complex rearrangements. Sequenc- ing results confirming the breakpoint locations in a) CEM b) MOLT4, and c) A549 cell-lines. The presence of multiple forward primers in CEM requires the use of alternating multiplexing. d) Array results for the A549 cell line. Note that the array not only captures the left and right breakpoints, but also an inserted inversion. The remainder of the spots correspond to non-specific background signals (corresponding to repeat locations) present across runs. complex optimization. The simulations required a total of ∼ 7000 CPU Hours (ranging from as little as 2 minutes on average for 10 × 10kb regions to 26 hours for 500 × 500kb regions).

3.4 Discussion

Our results provide solid evidence of the feasibility of this approach for early diagnosis of cancer. The alternating primer scheme ensures that all breakpoints that can possibly be detected are detected. This scheme, in the non-dimerizing case, rep- resents the minimal number of multiplexing reactions possible to achieve this optimal breakpoint coverage. We provide the number of multiplexing reactions as a parame- ter to be chosen by the experimentalist. This allows a trade-off between coverage and experimental cost/complexity. Other important trade-offs can factor into the decision making process. If one simply seeks to determine the presence of a rearrangement, then 50 detection on either side is acceptable. In some cases, it is important to have positional information for both the left and and right breakpoint coordinate. For example, the amplifying primer pair could be used individually in follow-up tests for the individual (thereby, saving cost and making a more reliable assay). Also, the predicted breakpoint can be validated via sequencing or being run on a gel. In both cases, simply amplifying the event is insufficient. A key point of debate is the choice of relatively older technologies (PCR and hybridization), given the rapid development of new parallel sequencing technologies. To explain our choice, note that there are two facets to our strategy: a) PCR allows for the amplification of weak signals from the cancer sequence; and, b) oligonucleotide ar- rays allow for a cost-effective and reliable detection. On the face of it, high-throughput sequencing approaches appear to be a good alternative, as per base, such approaches are cost-effective. However, without amplification one would be primarily sequencing back- ground DNA, not the cancerous signal. An enormous depth of coverage (and therefore cost) would be necessary to ensure detection of a weak cancerous signal. Additionally once a mutation is detected in the individual, resequencing is a costly follow-up, while PAMP returns a custom pair of primers specific to that lesion event. Second, hybridization yields an unambiguous detection of the PCR amplifica- tion. Sequencing could be used in lieu of hybridization to detect PCR-amplified mutants, but this is more challenging than it appears. There is always the possibility of amplifying background DNA (returning to the mispriming problem) or sequencing non-amplified DNA (especially if no true lesion exists). These would not hybridize to the probe, but would confound sequence based analyses and the reconstruction of the breakpoint. Such problems are magnified by artifacts inherent to multiplexing which could lead to several non-specific amplifications in addition to the targeted breakpoint. Moreover, there is a fixed cost (several thousand dollars) for running a single sample, which makes for an expensive early diagnostic, or even regular follow-up exam, to see cancer progression or remission in a single individual, whereas custom arrays are fairly cost-effective. A significant remaining challenge is that our coverage drops off for larger re- 51 gions (≥ 500kb). The primary reason for this is an inherent requirement in our design that each forward primer must be multiplexed with every reverse primer, and therefore cannot dimerize with it. With increased sizes, each forward primer is constrained to not dimerize with many reverse primers, which severely reduces the number of primers, and coverage. One way around this is to use a flexible multiplexing scheme. Subsets of forward primers can be permitted to dimerize with subsets of reverse primers as long as they are never in the same multiplex reaction. While this works in principle, optimiz- ing such designs would require a substantial increase in the total number of primers (as multiple primers spanning the same genomic region would be necessary), the number of multiplexing sets, and the overall experimental complexity. As these approaches move to a more industrial or automated setting, it will become increasingly important to solve these more complex optimization problems.

3.5 Acknowledgements

Chapter3 (with AppendixB), is accepted at the 13th Annual International Con- ference on Research in Computational Molecular Biology (RECOMB 2009), A. Bashir, Q. Lu, B. Raphael, D. Carson, Y-T. Liu, and V. Bafna, “Optimizing PCR assays for DNA based cancer diagnostics”. The dissertation author was the primary investigator and author of this paper. Chapter 4

Evaluation of paired-end sequencing strategies- applications to gene fusion

4.1 Introduction

Cancer is a disease driven by selection for somatic mutations. These mutations range from single nucleotide changes to large-scale chromosomal aberrations such as deletion, duplications, inversions and translocations. While many such mutations have been cataloged in cancer cells via cytogenetics, gene resequencing, and array-based techniques (i.e. comparative genomic hybridization) there is now great interest in using genome sequencing to provide a comprehensive understanding of mutations in cancer genomes. The Cancer Genome Atlas(http://cancergenome.nih.gov/index.asp) is one such sequencing initiative that focuses sequencing efforts in the pilot phase on point mutations in coding regions. This approach largely ignores copy neutral genome re- arrangements including translocations and inversions. Such rearrangements can create novel fusion genes, as observed in leukemias, lymphomas, and sarcomas [97, 88, 75]. The canonical example of a fusion gene is BCR-ABL, which results from a characteris- tic translocation (termed the “Philadelphia chromosome”) in many patients with chronic myelogenous leukemia (CML) [75]. The advent of Gleevec, a drug targeted to the BCR-

52 53

ABL fusion gene, has proven successful in treatment of CML patients [37], invigorating the search for other fusion genes that might provide tumor-specific biomarkers or drug targets. Until recently, it is was generally believed that recurrent translocations and their resulting fusion genes occurred only in hematological disorders and sarcomas, with few suggesting that such recurrent events were prevalent across all tumor types including solid tumors [95, 96]. This view has been challenged by the discovery of a fusion between the TMPRSS2 gene and several members of the ERG protein family in prostate cancer [164] and the EML4-ALK fusion in lung cancer [153]. These studies raise the question of what other recurrent rearrangements remain to be discovered. One strategy for genome-wide high-resolution identification of fusion genes and other large scale rearrangements is paired-end sequencing of clones, or other fragments of genomic DNA, from tumor samples. The resulting end-sequence pairs, or paired reads, are mapped back to the reference human genome sequence. If the mapped locations of the ends of a clone are “invalid” (i.e. have abnormal distance or orienta- tion) then a genomic rearrangement is suggested (See Figure 4.1 and Methods). This strategy was initially described in the End Sequence Profiling approach [170] and later used to assess genetic structural variation [170, 166]. An innovative approach utilizing SAGE-like sequencing of concatenated short paired-end tags successfully identified fu- sion transcripts in cDNA libraries [135]. Present and forthcoming next-generation DNA sequencers hold promise for extremely high-throughput sequencing of paired-end reads. For example, the Illumina Genome Analyzer will soon be able to produce millions of paired reads of approximately 30 bp from fragments of length 500 to 1000 bp [15], while the SOLiD system from Applied Biosystems promises 25 bp reads from each end of size selected DNA fragments of many sizes [59]. Similar strategies coupling the generation of paired-end tags with 454 sequencing have also been described [106, 64]. Whole genome paired-end sequencing approaches allow for a genome-wide sur- vey of all potential fusion genes and other rearrangements in a tumor. This approach holds several advantages over transcript or protein profiling in cancer studies. First, 54 discovery of fusion genes using mRNA expression [164], cDNA sequencing, or mass spectrometry [39] depends on the fusion genes being transcribed under the specific cel- lular conditions present in the sample at the time of the assay. These conditions might be different than those experienced by the cells during tumor development. Second, mea- surement of fusions at the DNA sequence level focuses on gene fusions due to genomic rearrangements and thus is less impeded by splicing artifacts or trans splicing [92]. Fi- nally, genome sequencing can identify more subtle regulatory fusions that result when the promoter of one gene is fused to the coding region of another gene, as in the case with with the c-Myc oncogene fusion with the immunoglobin gene promoter in Burkitt’s lymphoma [30]. In this paper, we address a number of theoretical and practical considerations for assessing cancer genome organization using paired-end sequencing approaches. We are largely concerned with detecting a rearrangement breakpoint, where a pair of non- adjacent coordinates in the reference genome is adjacent (i.e. fused) in the cancer genome. In particular, we extend this idea of a breakpoint to examine the ability to detect fusion genes. Specifically, if a clone with end sequences mapping to distant locations identifies a rearrangement in the cancer genome, does this rearrangement lead to forma- tion of a fusion gene? Obviously, sequencing the clone will answer this question, but this requires additional effort/cost and may be problematic; e.g. most next-generation sequencing technologies do not “archive” the genome in a clone library for later anal- ysis (for the sake of simplicity we will use the term “clone” to refer to any contiguous fragment that is sequenced from both ends). We derive a formula for the probability of fusion between a pair of genomic regions (e.g. genes) given the set of all mapped clones and the empirical distribution of clone lengths. These probabilities are useful for prioritizing follow-up experiments to validate fusion genes. In a test experiment on the MCF7 breast cancer cell-line, 3,201 pairs of genes were found near clones with aber- rantly mapping end-sequences. However, our analysis revealed only 18 pairs of genes with a high probability (¿0.5) of fusion, of which six were tested and five experimentally confirmed (Table 4.2.4). 55

The advent of high throughput sequencing strategies raises important experi- mental design questions in using these technologies to understand cancer genome or- ganization. Obviously, sequencing more clones improves the probability of detecting fusion genes and breakpoints. However, even with the latest sequencing technologies, it would be neither practical nor cost effective to shotgun sequence and assemble the genomes of thousands of tumor samples. Thus, it is important to maximize the proba- bility of detecting fusion genes with the least amount of sequencing. This probability depends on multiple factors including the number and length of end-sequenced clones, the length of genes that are fused, and possible errors in breakpoint localization. Here, we derive (theoretically and empirically) several formulae that elucidate the trade-offs in experimental design of both current and next-generation sequencing technologies. Our probability calculations and simulations demonstrate that even with current paired-end technology we can obtain an extremely high probability of breakpoint detection with a very low number of reads. For example, more than 90% of all breakpoints can be de- tected with paired-end sequencing of less than 100,000 clones (Table 4.1). Additionally, next-generation sequencers can potentially detect rearrangements with a greater than 99% probability and localize the breakpoints of these rearrangements to intervals of less than 300 bp in a single run of the machine (Table 4.1).

4.2 Results

4.2.1 Computing probability of a fusion gene

Given a set of clones from a cancer genome, we want to compute the probability that these clones identify a fusion gene in the cancer genome, i.e. a fusion of two differ- ent genes from the reference genome. We consider the cancer genome as a rearranged version of the reference human genome and assume that there exists a mapping between coordinates of the two genomes. The reference genome is described by a single interval of length G; i.e. we concatenate multiple chromosomes into a single coordinate system. 56

Fused Gene Pair

Tumor Genome

Gene U Gene V Normal Genome xC a yC b

L= (a-xC) + (b-yC) (a) Genomic Position Gene V Clone size = LMax

b

Clone size = LMin

yC

Clone size = 0

xC a Gene U (b) Genomic Position

Figure 4.1: Schematic of Breakpoint Calculation. (a) The endpoints of a clone C from the tumor genome map to locations xC and yC (joined by an arc) on the reference genome that are inconsistent with C being a contiguous piece of the reference genome. This configurations indi- cates the presence of a breakpoint (a, b) that fuse at ζ in the tumor genome. (b) The coordinates (a, b) of the breakpoint are unknown but lie with the trapezoid described by equation (4.1). The observed length of the clone is given by LC = (a − xC ) + (b − yC ). The rectangle U × V describes the breakpoints that lead to a fusion between genes U and V . 57

We define a breakpoint (a, b) as a pair of non-adjacent coordinates a and b in the ref- erence genome that are adjacent in the tumor genome. Correspondingly, we define the fusion point1 as the coordinate ζ in the tumor genome such that the point a maps to ζ and the point b maps to ζ + 1. Consider a clone C containing ζ. If the breakpoints a and b are far apart (e.g. on different chromosomes) then the endpoints of C will map to two locations xC and yC on the reference genome that are inconsistent with C being a contiguous fragment of the reference genome (Figure 4.1a). In this case, we say that

(xC , yC ) is an invalid pair [127]. Observing an invalid pair (xC , yC ) does not identify the breakpoint (a, b) exactly. However, if we know that the length of the clone C lies within the range [Lmin,Lmax], and we assume that: (i) only a single breakpoint is con- tained in a clone; and (ii) a > xC and b > yC (without loss of generality: See Methods); then breakpoint (a, b) that are consistent with (xC , yC ) must satisfy

Lmin ≤ (a − xC ) + (b − yC ) ≤ Lmax. (4.1)

If we plot an invalid pair (xC , yC ) as a point in the two dimensional space G × G then the breakpoints (a, b) satisfying the above equation define a trapezoid2 (Figure 4.1). If multiple clones contain the same fusion point ζ, then the corresponding break- point (a, b) lies within the intersection I of the trapezoids corresponding to the clones. Conversely, we will assume that if the trapezoids defined by several invalid pairs inter- sect, then they share a common breakpoint. We call a set of clones whose trapezoids have non-empty intersection a cluster. Figure 4.2 displays a cluster of six clones from the MCF7 cell line. As the number of clones that are end-sequenced increases, more clones will contain the same fusion point and more clusters will be formed. Thus, the area of I will decrease, and correspondingly the uncertainty in the location of the fusion point decreases. Now, each gene defines an interval U = [s, t] where s is the 50 transcription start site and t is the 30 transcription termination site. Consider two genes with intervals U and

1In the genome rearrangement literature, a fusion point is also called a breakpoint [116]. 2 A triangle when Lmin = 0 58

V . The two genes are fused if there exists a breakpoint (u, v) that lies This breakpoint is detected if (u, v) lies in I. An approximate probability for a fusion gene is the fraction of I that lies within the rectangle U × V . We obtain a better estimate of the probability of fusion by considering the empirical distribution of clone lengths and provide a (See Methods). The exact probability of the gene fusion is given by the probability mass that lies within the intersection of I and the rectangle U × V defined by the pair of genes. An efficient algorithm for computing these probabilities is given in Methods.

4.2.2 Fusion Predictions in Breast Cancer

We made predictions of fusion genes for the MCF7, BT474, SKBR3 breast cancer cell lines as well as two primary tumors using data from end sequence profil- ing of these samples [169, 129]. Approximately, 71Mb of end-sequence was derived from these 5 samples, ∼ 29Mb (corresponding to .47 clonal coverage) coming from the MCF7 cell line. Across all samples, a total of 1,141 invalid pairs were obtained. These formed 919 clusters, 95 of which contained more than one clone. We applied our method of computing fusion gene probability to each of these samples, using the distribution of clone lengths in each library for these calculations. Supp. Fig C.1 shows this empirical empirical distribution of clone lengths in each li- brary for these calculations. Table 4.1 shows the results of our predictions for fully sequenced BACs across multiple breast cancer cell lines and primary tumors, sorted ac- cording to fusion probability. We have successfully validated a number of these highest ranked predictions by sequencing the entire clone and identifying the exact location of the breakpoint and point of gene fusion (See Methods). Sequencing also showed that certain clones contain multiple rearrangement breakpoints with more than two contigu- ous segments of the reference genome present in a single clone (Table 4.1). In these cases, we ensure that the breakpoint associated to each gene in the fusion disrupts the corresponding gene. Such multiple rearranged regions have been observed to still form fusion transcripts as in the case of BCAS4/BCAS3 [135, 11]. Figure 4.2 illustrates 59

Figure 4.2: Prediction of a Fusion between the NTNG1 and BCAS1 genes. The rectangle indicates the possible locations of a breakpoint on chromosomes 1 and 20 that would result in a fusion between NTNG1 and BCAS1. Each trapezoid indicates possible locations for a breakpoint consistent with an invalid pair. Assuming that all clones contain the same breakpoint, this breakpoint must lie in the intersection of the trapezoids (shaded region). Approximately 69% of this shaded region intersects (darkly shaded region) the fusion gene rectangle, giving a probability of fusion of approximately 0.69. The empirical distribution of clone sizes reveals that not all clone sizes are equally likely (e.g. extremely long or short clones are rare). Using this additional information, our improved estimate for the probability of fusion is > 0.99. 60

1

0.8

0.6

0.4 Probability of Fusion 0.2

0 2 4 6 8 10 12 14 10 10 10 10 10 10 10 Product of Gene Lengths (log scale) (a)

40

20

Count in ChimerDB 0 2 4 6 8 10 12 14 10 10 10 10 10 10 10 Product of Gene Lengths (log scale) (b)

Figure 4.3: (a) Probability of Fusion vs. product of gene lengths involved in the fusion event indicates higher fusion probabilities for larger gene pairs. Larger circles indicate gene pairs experimentally validated by further sequencing. Darkly shaded circles indicate fusions con- firmed by clone sequencing; yellow circles indicate predicted fusions with negative sequencing results. Smaller points indicate untested predictions with blue circles indicating cluster sizes of 2 or more clones and red circles indicating singletons. (b) The number of fusion genes in chimerDB [70] plotted as a function of the product of gene lengths in the fusion. 61

Table 4.1: Ranked List of Fusion Genes Predictions in Breast Cancer Cell Lines and Pri- mary Tumors The gene order shown indicates “start” and “end” positions with respect to tran- scription. Elements labeled as “Not Tested” are, as yet, not sequenced. Though additional clones have been sequenced they did not overlap any putative fusion region. It should be noted that those clones were also negative for gene fusion. VAPB/ZNFNA13 has low probability, however there are many pairs of genes with low probability of fusion in this region. The probability that any one of these gene pairs fuse is > .30. All clones in a cluster are non-redundant (the same clones do not reappear multiple times in a cluster). † indicates that a single clone contained more than two chromosomal segments, i.e. the clone is not a simple fusion of two genomic loci.

Start End Fusion Cluster Sequencing Supporting Cell Line / Gene Gene Probability Size Fusion Primary Tumor ASTN2 PTPRG 1 2 Yes† MCF7 BCAS4 BCAS3 1 20 Yes† MCF7 KCND3 PPM1E 0.99 12 Yes MCF7 NTNG1 BCAS1 0.99 6 Yes MCF7 BCAS3 ATXN7 0.83 8 Yes† MCF7 ZFP64 PHACTR3 0.6322 2 No BT474 CT012 HUMAN UBE2G2 0.0880 1 No Breast VAPB ZNFN1A3 0.0842* 3 Yes BT474 BMP7 EYA2 0.0324 4 No† MCF7 KCNH7 TDGF1 0.0215 1 No Breast SULF2 TBX4 0.00656 2 No MCF7 NACAL NCOA3 0.0057 2 No MCF7 MRPL45 TBC1D3C 0.0005 1 No BT474 U1 NP 060028.2 0.0005 1 No Breast RBBP9 ITGB2 0.0005 1 No Breast Y SYNPR < 0.0001 4 No MCF7 PRR11 TMEM49 < 0.0001 9 No MCF7 BMP7 Q96TB < 0.0001 3 No MCF7 62 the computation of fusion probability for one high-scoring prediction(NTNG1/BCAS1). The strong correspondence between fusion probability prediction and subsequent se- quencing validation of the breakpoint in Table 4.1 illustrates the predicative power of our method. Table 4.1 also indicates the power of the technique in predicting clone that do not have fusion genes. Only one clone with fusion probability below 50% contained a fusion gene (VAPB/ZNFN1A3). The data suggests a strong correlation between gene rectangle size (the product of gene lengths) and probability of gene fusion. Larger fusion genes tend to have higher fusion probabilities and greater likelihood of being validated (Figure 4.3). A similar trend is observed in chimerDB, a database of fusion genes in cancer derived from mRNA, EST, literature and database searches [70].

4.2.3 Detection and Localization of Genome Rearrangements

We now consider the problem of how much sequencing is required to detect a genome rearrangement and to localize the breakpoint of a rearrangement. Consider an idealized model in which R clones, each of fixed length L are picked uniformly at random from a tumor genome of length G 3 and end-sequenced. These end sequences are mapped to the reference genome. The fraction f of clones uniquely mapped varies significantly among different sequencing technologies. In an ESP study, with paired- end Sanger sequencing of BACs, 90% of reads were mappable with 58% uniquely mapped [129]. A recent study that used 454 sequencing to identify structural variants in the human genome reported 63% mapping of sequences with recognizable linker se- quences, and 41% of all reads mapped [64]. Note that the 454 reads are of significantly longer (average 109 bp) compared to other next generation sequencing technologies (av- erage 2030 bp) [59, 15, 64] and thus even lower mapping efficiencies are expected for these shorter reads. A fusion point, ζ, on the tumor genome is detected if a uniquely mapped clone contains it (Figure 4.4). Using the Clarke-Carbon formula [27, 78](See Methods), the

3For these calculations the tumor genome size, G, will correspond to the diploid genome size, ∼ 6 × 106bp for human. 63

probability Pζ of detection is given by

−c Pζ ≈ 1 − e (4.2) where c = NL/G is the clone coverage. If only a single clone contains a fusion point, then the fusion point is localized to within L bp. If multiple clones contain a fusion point, then the fusion point is localized more precisely. We define the breakpoint region,

Θζ , as the interval determined by the intersection of all clones that contain ζ. Thus, |Θζ | defines the localization of ζ, or the uncertainty in mapping ζ. Since localizing a fusion point to within L, requires only a single clone containing ζ, we find (see Methods) that

−c − N Pr(|Θζ | = L) ≈ Le (1 − e G ). (4.3)

Furthermore, we find that for s < L,

− Ns − N 2 P r[(|Θζ | = s)] ≈ se G (1 − e G ) (4.4)

These equations allow us to estimate the the expected length of Θζ , conditioned on ζ being covered 4 (See Methods for full derivation and closed form solution, equa- tion 4.24) as

− N ! L−1 ! 1 − e G 2 −c X 2 − Ns − N E(|Θ | | ζ is covered) ≈ L e + s e G (1 − e G ) (4.5) ζ 1 − e−c s=1 See Methods section for full derivation and closed form solution. We evaluated the error in this approximation by simulation (See Supplemental Methods for descriptions of all simulations). Supp. C.2 shows that approximation (4.5) very closely models the average obsered |Θζ |. The relative error between the average observed length of the breakpoint region and approximation (4.5) was 0.02. We also assess the effect of different clone sizes, L, and number of clones, N, on the expected length of the breakpoint region, E(|Θζ |), around a specific fusion point, ζ. In Figure 4.5 we see the obvious correlation that an increase in the number of reads,

4 Otherwise, Θζ is not defined 64

Cl Cr

Genome

Figure 4.4: Schematic of a Breakpoint Region. A fusion point ζ on the tumor genome con- tained in multiple clones. The leftmost and rightmost clones determine the breakpoint region Θζ in which the fusion point can occur.

N, decreases the uncertainty in localization (|Θζ |). Interestingly, note that 40kb clones are most advantageous when localization |Θζ | = 40kb is desired. A similar effect is observed for 150kb and 2kb. Thus, there is a direct correlation between the clone length and the ability to localize a fusion point to a given sized interval, implying that the choice of clone lengths impacts the ability to detect fusions of a specific size.

4.2.4 Comparison of Sequencing Strategies

Formulas (4.2) and (4.5) provide a framework for examining a variety of se- quencing parameters (L, N, c). Table 4.2.4 and Supp. C.3 and C.4, demonstrate the effect of using different clones lengths and varying numbers of paired reads on the abil- ity to detect and localize a fusion point. Table 4.2.4 also indicates the effect of such parameters on the ability to detect and localize clusters of invalid pairs, as defined by Formulas 4.4.6 and 4.26. One can see that a distinct trade-off exists between detection, in which larger clones hold a distinct advantage, and localization, in which case smaller clones are advantageous. Longer clones (e.g. BACs of 150kb) are more pragmatic for sequencing projects using a smaller number of paired reads, the advent of low cost, highly parallel sequencing of small clones could soon yield extremely high probability 65

1

0.8 150kb Clones, 10k Reads 40kb Clones, 10k Reads 0.6 2kb Clones, 10k Reads 150kb Clones, 100k Reads 0.4 40kb Clones, 100k Reads Probability 2kb Clones, 100k Reads 150kb Clones, 1M Reads 0.2 40kb Clones, 1M Reads 2kb Clones, 1M Reads

0 0 0.5 1 1.5 2 | Θ | 5 ζ x 10

Figure 4.5: Probability of localizing a fusion point to an interval of a given length. A fusion point is localized to length s if the corresponding breakpoint point region has length s or less. Note that when s exceeds the clone length L, only a single clone contains the fusion point; thus, the probability of localization is the fixed probability of a clone containing the fusion point at a given clonal coverage. Note that a fixed clone length is assumed here. Using a distribution of clone lengths would create a less abrupt transition. 66

Table 4.2: Breakpoint Detection and Localization for Different Sequencing Strategies The probability Pζ of detecting a fusion point and the expected length E(|Θζ |) of a breakpoint region under various clone sizes (L) and number of end-sequenced clones (N). Pζ∗ , and E(|Θζ∗ |), correspond to the probability for, and expected size of a breakpoint region in the case of, two clones spanning ζ. The small clone sizes (1kb, 2kb) and large number of reads represent what one might achieve in a single run with new technologies (assuming perfect mapping of end sequences). The last value for the 150kb clones represent our current status on the MCF7 cell line.

Clone Size(L) Paired Reads(N) Clone Coverage(c) E(|Θζ |) Pζ E(|Θζ∗ |) Pζ∗ 1kb 40 × 106 13.3X 295 > .99 289 .99 1kb 1 × 106 .33X 972 .15 658 .012 2kb 20 × 106 13.3X 593 > .99 581 .99 2kb 1 × 106 .66X 1889 .28 1296 .044 10kb 5 × 106 16.7X 2393 > .919 2378 > .99 10kb 1 × 106 3.3X 7342 .81 5657 .50 40kb 2 × 106 26.7X 5998 > .99 5997 > .99 40kb .1 × 106 1.33X 35587 .49 25124 .14 150kb .5 × 106 25X 23997 > .99 76807 > .99 150kb .1 × 106 5X 93169 .92 72022 .80 150kb .012 × 106 .6X 142510 .26 97457 0.037

of detection (high Pζ ) and extremely high resolution of fusion points (small |Θζ |).

4.2.5 Lengths of Fusion Genes

Since our simulations revealed that the choice of sequencing parameters affects the ability to localize breakpoint regions to intervals of different lengths (Figure 4.5), we further explored what lengths might be advantageous for identification of fusion genes. There is considerable variation in sizes of human genes (Figure 4.6). When considering all known transcripts [68], the median gene size is approximately 20 kb and the mean is approximately 40 kb. However, examination of chimerDB shows a clear bias towards larger genes, with a median gene size of 40 kb and a mean gene size of 90 kb. It is tempting to speculate on the reasons for this bias. One possibility is ascertainment bias, as larger fusion genes would be easier to identify via cytogenetic techniques which to 67 date have been the technique used to identify most fusion genes. Additionally, random breakage of the genome would favor fusions involving larger genes, as the probability of a breakpoint disrupting a large gene would be greater than for a small gene. We exam- ined the length distribution of random fusion genes by simulation. We selected random breakpoints in the genome, and if a breakpoint formed a fusion gene we recorded the length of the resulting fusion gene (Figure 4.6). It is interesting to note that these ran- dom fusion events resulted in much larger genes than observed in the normal genome or chimerDB (median and mean gene sizes of 155 kb and 284 kb, respectively). Though further investigations are needed, one possible explanation is that known fusion genes have a biased size distribution because they are selected for functional reasons. We also examined the distribution of transcription factor genes and kinase genes, both of which are members of multiple fusion genes (Figure 4.6). Interestingly, the size distribution of kinases is closer to the chimerDB distribution, while the size distribution of transcription factors is closer to the size distribution of all known genes. The variation in gene sizes for different classes of genes (Figure 4.6) suggests that one consider a wide range of gene sizes when assessing our ability to detect fusion genes. Figure 4.7 shows the number of clones necessary to achieve at least a .5 fusion probability for a random gene pair of specified size, across a variety of different clone sizes. Note that the breakpoint could exist at any position within either gene. Smaller clone sizes clearly hold a distinct advantage in the probability of detecting fusion genes when compared across equal clonal coverage while large clones tend to perform better when comparing across equivalent reads (Figure 4.7a). This is not surprising, as a sig- nificantly higher number of paired reads is required to achieve the same coverage with smaller clones. In particular, 75 times more paired reads from 2 kb clones are needed to obtain the same clonal coverage as 150kb clones. There is also a relationship between the size of a fusion gene and the probability of detecting the fusion (Figure 4.7B). Since larger clones create larger trapezoids (Fig- ure 4.1) the use of larger clones increases the probability that the trapezoid defined by the clone intersects the rectangle defined by the two genes, thus producing a higher prob- 68

0.35

0.3 All Genes (39,368) 0.25 Kinases (620)

0.2 Transcription Factors (562) 0.15 ChimerDB (300) 0.1 Random Fusion Genes (2000) 0.05 Percentage of Total Genes in Set 0 1 5 10 20 50 100 200 500 Gene Length in kb (Log Scale)

Figure 4.6: Distribution of gene sizes for different sets of genes. All genes: The “known genes” track in the UCSC Genome Browser [68]. Kinases: Selected from the KinBase database [86]. Transcription factors: Selected from the AmiGO database according to the GO term “transcription factor activity” [8]. ChimerDB: Fusion genes in cancer extracted from the chimerDB database [70]. Random Fusion Genes: A set of 2000 genes involved in 1000 random fusion events. Random Fusion events were formed by inducing random breakpoints, and selecting such events if they formed a fusion gene. Note that the gene sizes are on a log scale, and the number of genes from each set used to derive each distribution is shown in the legend. 69 ability of detection of a breakpoint. However, this effect is counteracted by the fact that larger clones also yield larger breakpoint regions, leading to lower fusion probabilities since only a small fraction of a larger trapezoid typically overlaps the gene rectangle. The optimal clone length for fusion gene identification is directly related to the length of fusion genes. Thus, the length of fusion genes that one wants to detect with high probability is an important parameter in choosing a sequencing strategy. For example, if a fusion gene is 40 kb in length, the average fusion probability is significantly greater when using the same number of 40 kb clones compared to 2 kb or 10 kb clones, because of the greater genomic coverage provided by the larger clones. However, in this scenario 40 kb clones also perform nearly as well as 150 kb clones (Figure 4.7B), because the 40 kb clones have better breakpoint localization (Figure 4.5). If the fusion gene size is increased to 150 kb, then 150 kb clones are superior since the poorer breakpoint localiza- tion has limited effect on prediction of a large fusion gene. One additional consideration is that larger clones (e.g. 150 kb) consistently show lower variance in fusion probabili- ties (Figure C.7) due to their higher probability of detecting a fusion. This makes larger clones more reliable when performing studies across multiple tumor samples, especially when the number of paired reads available for its sample is limited.

4.2.6 Effects of Errors

There are numerous sources of error in paired-end sequencing strategies for re- arrangement identification including experimental artifacts, genome assembly errors or mis-mapping of end sequences. These errors can lead to incorrect predictions of fusion genes, or false positives. A major source of experimental artifacts in current sequencing approaches is chimeric clones that are produced when two non-contiguous regions of DNA are joined together during the cloning procedure. Approximately 12% of clones in modern BAC libraries are chimeric [169], and rates for other vectors are roughly sim- ilar [64]. The type and rate of experimental artifacts for new genome amplification and sequencing strategies is still an open question. 70

Fusion Probability >0.5 Fusion Probability > .5 108 108 1000 20kb fusion 10000 40kb fusion 40000 100kb fusion 150000 150kb fusion 7 10 107

6 6 Paired Rads 10 10 Number of Clones

5 5 10 10 0 2 4 5 9 10 15 500 1000 2000 10000 40000 150000 Fusion Gene Size Clone Size (a) (a)

Figure 4.7: (a) The number of paired reads necessary to detect fusion genes with fusion prob- ability greater than .5 as a function of gene size for different clone lengths. The vertical lines indicate median (20kb) and mean (40kb) sizes for all known genes as well as the median (40kb) and mean (90kb) sizes for chimerDB genes. (b) The number of paired reads necessary to detect fusion genes with fusion probability greater than .5 as a function of clone size for different fusion genes sizes (log scale in both axes). Each point in these plots is the average over 100 different fusion genes and and 100 different simulations of clones from the genome. Thus, each datapoint represents the average value of 104 simulations. In each simulation, a pair of genes was chosen such that area of the resulting gene rectangle (U × V ) was equal to the square of the indicated fusion gene size. A breakpoint was selected for the gene pair uniformly in the rectangle U × V ). 71

In order to assess the rate of false positive predictions of fusion genes in the pres- ence of errors, we simulated 100 random genome rearrangements with 1% of the paired- end sequences arising from chimeric clones. For several clone lengths, we recorded the number of fusion genes correctly identified (true positives) and the number of incor- rect fusion gene predictions (false positives) as the minimum fusion gene probability required for identification was increased (Figure 4.8). For small numbers of paired reads, the largest clones (150 kb) yield the largest number of true positives (Figure 4.8A and 4.8B), while with a large number of paired reads, smaller clones (40 kb) are better (Figure 4.8C). Extremely large numbers of paired reads are required before very small clones (2 kb) become effective (Figure 4.8D). On the other hand, these small clones show almost no false positives at reasonable probability thresholds, and show little (if any) increase in true positives if the probability threshold is reduced (Figure 4.8D). Finally, we examined the effect of chimeric clones on our ability to identify breakpoints from invalid clusters. Obviously, when only a single isolated invalid pair exists we cannot determine whether it arose via a chimeric event or through a true rear- rangement. However, a cluster of invalid pairs is highly unlikely to arise from chimeric clones [127]. Figure 4.9) shows that in most cases, no clusters formed from chimeric clones are observed. Even under high coverage (10X) and a very high percentage of chimeric clones (5% of all paired reads) in 80% of the time no chimeric clusters were observed. This result demonstrates that clusters of two or more invalid pairs are very likely to indicate true rearrangement events. When comparing a fixed number of chimeric clones over clones of varying lengths, the probability of observing a chimeric cluster is much lower for smaller clones (Figure C.8).

4.3 Discussion

We provided a computational framework to evaluate paired-end sequencing strate- gies for detection of genome rearrangements in cancer. Our probability calculations and simulations show that current paired-end technology can obtain an extremely high prob- 72

2.5 15

2

10 1.5 TP TP 1 5

0.5 2kb Clones 2kb Clones 40kb Clones 40kb Clones 150kb Clones 150kb Clones 0 0 0 2 4 6 8 0 10 20 30 40 50 60 (a) FP (b) FP

20

15 15

10 10 TP TP

5 5 2kb Clones 2kb Clones 40kb Clones 40kb Clones 150kb Clones 150kb Clones 0 0 0 5 10 15 20 0 0.5 1 1.5 (c) FP (d) FP

Figure 4.8: Sensitivity and Specificity of Fusion Gene Predictions (a) Number of false positive (FP) and true positive (TP) fusion gene predictions for a simulated genome with 100 transloca- tions and 10,000 paired reads. Each curve represents the average of 50 simulations with clones of a fixed size (2kb, 40kb, 150kb clones). The minimum fusion probability threshold for indicating that a fusion gene was predicted was decreased from > .95 (leftmost point) to > 0 (rightmost point) in increments .05. and the number of true and false predictions was determined. For all figures 19 true fusion genes were present in the rearranged genome. These 19 events were not selected for, rather they were a result of the random rearrangment of the genome. (b) 100,000 paired reads. (c) 1,000,000 paired reads. (d) 10,000,000 paired reads. 73

0 10

10−1

10−2

10−3

10−4

10−5 1X Clonal Coverage 2X Clonal Coverage 5X Clonal Coverage −6 10X Clonal Coverage 10 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Probability of Observing a Chimeric Cluster Percent Chimeric Clones

Figure 4.9: Probability of observing at least one chimeric cluster vs. the percent of chimeric clones. These probabilities were computed using Equation 4.27, with clone size L = 150kb and confirmed by simulation. Other clone sizes yield virtually identical probabilities at the same clonal coverage. ability of breakpoint detection with a very low number of reads. For example, more than 90% of all breakpoints can be detected with paired-end sequencing of less than 100, 000 clones (Table 4.2.4). Additionally, next-generation sequencers can potentially detect rearrangements with greater than 99% probability and localize the breakpoints of these rearrangements to intervals less than 300 bp in a single run of the machine (Ta- ble 4.2.4). If only a fraction (e.g. 50%) of the reads map uniquely, similar detection levels are achievable by simply doubling the amount of sequencing. We derived formulae that provide estimates of the probability of detecting rear- rangement breakpoints and localizing them precisely. For a genome of length G with N mapped paired reads from clones of length L, the detection probability is a function of the of clonal coverage (c = NL/G). Thus, increasing L means that fewer clones are needed to maintain the same probability of detecting a fusion. On the other hand, break- point localization depends independently on “clone” length L, number of mapped reads N, and genome size G. Traditionally, clone length L was dictated by efficiency con- siderations with available cloning vectors (e.g. plasmids≈ 2 kb, fosmids≈ 40 kb, and BACs≈ 150 kb). However, new sequencing technologies permit paired-end sequencing 74 from a larger range of “clone” lengths. The natural question for the practitioner is: what sequencing strategy maximizes information about rearrangements in the cancer genome for minimum cost? Three con- siderations preclude a definitive answer to the question. First, the goal of “maximizing information about rearrangements” in cancer genomes requires further specification. Second, the parameters of a sequencing strategy cannot be set arbitrarily, but are re- stricted by the chosen technology. Third, the complexity of cancer genomes at the se- quence level including the number and type of rearrangements and the sequence charac- teristics of rearrangement breakpoints is currently unknown. We discuss each of these issues below and then conclude by describing further extensions of our methodology.

4.3.1 Defining the Genomic Features of Interest

When studying genome rearrangements by paired-end sequencing approaches, there are two interrelated goals that affect the choice of sequencing strategy. First, one might be interested in detecting as many rearrangement breakpoints as possible with the minimum amount of sequencing. In this case, the goal is to maximize the clonal coverage c with the fewest number of paired reads. It follows immediately from the breakpoint detection probability (Equation 4.2) that for a fixed number of paired reads, larger clones give higher probabilities of detection than smaller clones. On the other hand, one might be interested in precise localization of breakpoint regions. In this case, smaller clones provide better localization when the breakpoint is detected (Figure 4.8, Supplemental Figure C.5). Better localization of breakpoints is desirable if one wants to determine with cer- tainty that a gene is fused or disrupted by a genome rearrangement. Our results showed the correlation between clone length and the probability of localizing breakpoints to an interval of a specific length. Figure 4.5 shows that with a fixed number of paired reads, the optimal choice of clone length depends on the desired interval of localization. Fig- ure 4.7B shows that these results readily translate to the probability of detecting fusion 75 genes of a given size. If paired-end sequences could be obtained for any clone length, then the choice of optimal clone length depends on the length of fusion genes that the researcher wants to have the greatest ability to localize. This in turn might depend on a prior belief about the model of rearrangement in cancer. For example, if one wants to be able to localize fusion genes whose length is approximately the length of an average human gene (40 kb), then the optimal clone length is 40 kb. However, under the hypoth- esis that the breaks in the genome that lead to fusion genes are distributed uniformly on the genome, larger fusion genes would be expected and thus larger clones would be optimal. Better localization is also desirable when one wants to validate a breakpoint via PCR, perhaps to determine if the breakpoint is recurrent across multiple samples. In this case, the breakpoint must be localized to an interval length that can be amplified via PCR, typically less than a few kilobases, and thus smaller clones are appropriate. On the other hand, in many cases rearrangement breakpoints are known to vary across kilobases in different patients [82]. Thus, approaches like Primer Approximation Mul- tiplex PCR (PAMP) [82] that assay for variable genomic lesions in a patient population are useful, and the need for precise localization of breakpoints is reduced. Nevertheless, the success of PAMP relies on establishing reasonable boundaries of a rearrangement, so that appropriate primers tiling the region can be designed [17]. Our methodology provides these boundaries, and the combination of paired-end sequencing and PAMP is a powerful tool for identifying therapeutic targets and designing clinical diagnostics.

4.3.2 Choice of Sequencing Parameters

There are several next-generation sequencing technologies now on the market, and others that soon will be commercially available. Information about the capabilities of many of these machines, particularly in regards to paired-end sequencing, is presently limited. In addition, the field is developing rapidly and any claims stated about read lengths, sequencing error rates, etc. are undergoing continual revision. While our anal- 76 ysis focused on several key parameters including number of paired reads, clone length, and percent of chimeric clones, in reality only some of these parameters are adjustable while others (e.g. error rate) are fixed by the chosen sequencing technology. One issue not considered in our model that is closely tied to the sequencing technology is the mapping of reads to the reference genome. Different sequencing tech- nologies produce reads of varying length and quality that can have a dramatic effect on the ability to map paired reads. On one extreme, conventional paired-end sequencing of cloned genomic fragments employed by current ESP studies [170, 169], yields high quality reads exceeding 500 bp. This enables the majority of reads outside of repeats and segmental duplications to be uniquely and accurately mapped to the reference genome. For example, with paired-end sequences 500 bp long, 11492 out of 19831 clones (58%) in the MCF7 study mapped uniquely [129], while with paired-end sequences 100 bp long 41% of paired reads using, mapped uniquely [64]. Newer sequencing technolo- gies such as Illumina and ABI have even shorter reads (20 to 30 bp) and higher error rates [15, 59], and the ditag approach sequences only 1820 base pairs from each end of the genomic fragment [135]. These shorter reads will be much more difficult to map, particularly when analyzing rearrangements. Moreover, unlike resequencing stud- ies, where one can increase mapping efficiency using additional information that the end sequences are close together on the reference genome, detection of rearrangements specifically requires the accurate mapping of end sequences from distant locations on the genome. It would be informative to model the effect of different read accuracies and lengths on the ability to accurately resolve breakpoints.

4.3.3 Organization of Cancer Genomes

Our simulations made certain simplifying assumptions about the character of cancer genomes. Most notably, we assumed that the size of the cancer genome (equal to the parameter G above) is known. Since many cancer genomes, particularly solid tu- mors, have extensive aneuploidy the actual size of a given genome might differ greatly 77 from normal cells [120]. At the present time, it is difficult to calibrate the genome length parameter in our simulations, and pilot sequencing studies will be needed to assess the extent of amplification in these samples. Paired-end sequencing will naturally sample more from amplified regions. Although we did not explicitly simulate amplifications, it is clear that the probability of detecting amplified translocations is directly propor- tional to their relative amplification in the genome. Namely, as the number of copies, a, of a fusion point ζ increases, the probability of detection Pζ increases, approximately following 1e−ca, assuming that the genome size is constant under the amplification. Since highly amplified regions can have complex organization due to duplication mech- anisms [169, 126], many of the genome rearrangements detected in low coverage studies will likely be in these highly amplified and rearranged regions. Identification of non- amplified rearrangements might require extremely high coverage. An additional consideration is whether cancer rearrangement breakpoints are bi- ased to certain regions of the genome. For example, if rearrangement breakpoints are in highly repetitive regions, it might be difficult to map sequences that are too close to the breakpoints, and thus larger clones are appropriate. On the other hand, if there are mul- tiple rearrangements clustered in a small genomic interval as observed in the multiple breakpoints found in some sequenced BACs and also in other recent sequencing stud- ies [129, 113], larger clones would miss some of these rearrangements. Finally, genomic heterogeneity, particularly in primary tumor samples, reduces the effective coverage and thus the probability of detecting rearrangement breakpoints. Even a genomic lesion that is an important checkpoint in a cancer progression, might be difficult to detect in an admixed sample containing normal cells and cells from earlier developmental stages of the tumor. It is nearly impossible to determine how all of these factors will affect cancer sequencing strategies without further studies. Such pilot studies promise to pro- vide a significant increase in new information about the extent of ploidy changes and heterogeneity. 78

4.3.4 Extensions and Applications

Our formula for the probability of a fusion gene is readily extended to fusions of other genomic features. For example, we can compute the probability of regulatory fusions that result from the fusion of the promoter of one gene to the coding region of another gene. Other genomic assays such as array comparative genomic hybridization (CGH) can be used in combination with paired-end sequencing. Array CGH identi- fies breakpoints involved in deletions and amplifications at average resolutions of less than 10 kb [113, 12]. If this information overlaps paired-end sequencing data (such as the case with an amplified translocation like BCAS4/BCAS3) it might be possible to improve the resolution of the breakpoint interval defined by a paired-end sequencing approach. As next-generation technologies mature and the cost of sequencing declines, paired-end sequencing of cancer genomes will inevitably provide highly reliable and precise detection of fusion genes. Application of these technologies will permit the sys- tematic analysis of all classes of genomic events that lead to cancer and will shed new light on the genetic and genomic basis of cancer.

4.4 Methods

4.4.1 Mapping and clustering of end sequences

We assume that each clone C is end sequenced and the ends are mapped uniquely to the reference human genome sequence. Thus, each clone C corresponds to a pair

(xC , yC ) of locations in the human genome where the end sequences map. In addition, an end sequence may map to either DNA strand, and so each mapped end has a sign (+ or −) indicating the mapped strand. We call such a signed pair an end sequence pair (ES pair). If we know that the length LC of the clone C lies within the range [Lmin,Lmax], then for most clones the distance between elements of the corresponding ES pair will lie in this range and the ends will have opposite, convergent orientations: i.e. an ES pair of the form (+x, −(x + LC )). Following [127] we call such ES pairs valid pairs because 79 these indicate no rearrangement in the tumor genome. We can use the distribution of distance |y| − |x| between the ends of valid pairs to define an empirical distribution for the clone lengths (cf. Supp. C.1).

If a pair (xC , yC ) has ends with non-convergent orientation or whose distance

|y| − |x| is greater than Lmax or smaller than Lmin, we say that (xC , yC ) is an invalid pair. The set of breakpoints (a, b) that are consistent with the invalid pair (xC , yC ) is determined by the inequalities [128]

Lmin ≤ sign(xC )(a − xC ) + sign(yC )(b − yC ) ≤ Lmax. (4.6)

Throughout the paper, we assume (without loss of generality) that sign(xC ) = sign(yC ) =

+ so that a ≥ xC and b ≥ yC .

4.4.2 Validating fusion predictions by sequencing

Clones contained predicted fusion genes were draft sequenced (1X coverage) by subcloning into 3kb plasmids as described in [129]. Assembly of these sequences and alignment to the reference human genome identified either the precise fusion point, or identified a plasmid containing the fusion point thus localizing the breakpoint to 3kb.

4.4.3 Computing fusion probability

Define C(a,b) as the event that a clone C from the tumor with corresponding invalid pair (xC , yC ) overlaps a breakpoint (a, b) of a reference genome. Assume w.l.o.g. that the invalid pair (xC , yC ) is oriented such that a ≥ xC and b ≥ yC . Given the breakpoint (a, b), the length LC of the clone is fixed to be

lC (a, b) = (a − xC ) + (b − yC ) = (a + b) − (xC + yC ) (4.7)

Therefore, the event C(a,b) implies the event that LC = lC (a, b), allowing us to express the probability of occurrence of breakpoint (a, b) in terms of the distribution on the 80

lengths of clones. Denote NC [s] as the number of discrete breakpoints (a, b) such that a ≥ xC , b ≥ yC , and a + b = s. Then

Pr(C(a,b)) = Pr(C(a,b) ∩ (LC = lC (a, b))) (4.8)

= Pr(C(a,b) | LC = lC (a, b)) · Pr(LC = lC (a, b)) (4.9) 1 = Pr(LC = lC (a, b)), (4.10) NC [a + b] where the last equality follows from Equation 4.7 and the assumption that all break- points are equally likely. Consider a pair of genes spanning genomic intervals U, V . If exactly one of the genes is on the “+” strand and the other is on the “-” strand, then the probability of genes fusing, given C is

X Pr(LC = lC (a, b)) Pr(∪(a,b)∈U×V C(a,b)) = . (4.11) NC [a + b] (a,b)∈U×V Otherwise, if the genes are both on the same strand then an in-frame fusion transcript is impossible5 and the probability is 0. In the simple case, assume that the clone lengths were uniformly distributed over the range [Lmin,Lmax], so that  1  if Lmin ≤ lC (a, b) ≤ Lmax,  Lmax−Lmin Pr(LC = lC (a, b)) = 0 otherwise.

In this case, Equation 4.11 refers to the fraction of the trapezoid (Equation 4.1) that intersects with U × V . A more realistic distribution of the clone lengths is empiri- cally estimated (Supp. C.1), and used to compute P r(LC = lC (a, b)). Next, we extend the equations to include the case when a set {C(1),C(2),...} of multiple clones overlap the breakpoint (a, b). Define C to be the event that all clones overlaps the same breakpoint. Then

C = ∪(a,b)C(a,b) where (j) C(a,b) = ∩jC(a,b)

5The analysis of the orientation of genes is analogous in the cases of invalid pairs with other signs. 81 is the event that all clones C(j) overlap the breakpoint (a, b). Thus, the probability of (a, b) being the breakpoint given that all clones overlap it is given by Pr(C ∩ C) Pr(C |C) = (a,b) (4.12) (a,b) Pr(C) Pr(C ) = (a,b) (4.13) Pr(C) Q (j) j Pr(C ) = (a,b) (4.14) P Q (j) (a,b) j Pr(C(a,b))

Here, equation 4.13 follows from the fact that C(a,b) implies C, and Equation 4.14 follows from the independence of clones. This allows us to compute the probability of genes U, V fusing as

P Q (j) (a,b)∈U×V j Pr(C ) Pr(∪ C |C) = (a,b) (4.15) (a,b)∈U×V (a,b) P Q (j) (a,b) j Pr(C(a,b))

4.4.4 Algorithms for efficient probability computation

The naive approach to computing Pr(∪(a,b)∈U×V C(a,b)|C) in Equation 4.15 in- (j) (j) volves the computation of Pr(C(a,b)) over all (a, b), and all clones C , which can be expensive. We make the computation more efficient by exploiting redundancies in com- putation. First, note that we only need to do the computation over a range of (a, b) values defined by the intersection of all the trapezoids. Note from equation 4.10 that

0 0 (j) (j) lC (a, b) = lC (a , b ) ⇒ Pr(C(a,b)) = Pr(C(a0,b0))

Furthermore, since lc(a, b) = (a + b) − (xC − yC ) all points (a, b) with equal values of lC (a, b) lie on a line with slope −1 (antidiagonal). This provides a methodology for rapidly computing the probability of fusion.

Define diagonal s as the diagonal line with a + b = s. Let Ds denote the set of breakpoints that lie on diagonal s, and are overlapped by all clones. Then

Ds = {(a, b): a ≥ xC(j) and b ≥ yC(j) ∀j, (a + b) = s} 82

Then, D = ∪s Ds is the set of breakpoints overlapped by all clones. Also, define the diagonal probability as a product of the probabilities of these clone lengths

 (j)  Y Pr |C | = (s − xC(j) − yC(j) ) Ps = N (j) [s] j C

Then, we have

P Q Pr(C(j) ) Pr(∪ C |C) = (a,b)∈U×V ∩D j (a,b) (a,b)∈U×V (a,b) P Q (j) (a,b)∈D j Pr(C(a,b)) P P Q Pr(C(j) ) = s (a,b)∈U×V ∩Ds j (a,b) P P Q Pr(C(j) ) s (a,b)∈Ds j (a,b) P s |Ds ∩ U × V | · Ps = P s |Ds | · Ps

The algorithm itself is straightforward. Compute |Ds |, |Ds ∩ U × V | , Ps , for all values of s. It is efficient, as there are relatively few diagonals with Ps > 0 and |Ds | > 0.

4.4.5 Expected number of fusion points

We seek to compute the probability of detecting a fusion point and the expected number of fusion points that can be detected for a specific amount of end sequencing. Assume that N clones, each of length L, are end sequenced from a tumor genome of size G. We assume that the left endpoint of each clone is selected uniformly at random from the tumor genome. Then a fusion point ζ is detected if a clone contains it. Thus, the probability Pζ of detection is given by [27, 78]

 N L −NL −c P = 1 − 1 − ≈ 1 − e G = 1 − e , (4.16) ζ G

NL where c = G is the clonal coverage. Suppose there are M fusion points in the tumor genome, and define the random variables X1,...,XM by Xi = 1 if the i-th fusion point is covered and Xi = 0, otherwise. Then

−c E(Xi) = 1 − e . 83

The expected number of fusion points detected is given by

M X −c E(X) = E(Xi) = M(1 − e ). i=1 Using the Poisson approximation with λ = M(1 − e−c) e−λλm P r[m fusion points detected] ≈ m! Given m observed fusion points, the maximum likelihood estimator Mˆ of the total num- ber of fusion points is m Mˆ = . (4.17) 1 − e−c

4.4.6 Localization of Rearrangement Fusion Points

When multiple clones contain a fusion point ζ, the localization of ζ (the length of the interval that the clones indicate ζ must lie within) is improved. Define Θζ as the intersection of all clones that cover ζ (Figure 4.4). We wish to compute the probability distribution on the length of Θζ . Following Lander-Waterman [78], we assume that the NL left endpoints of clones follow a Poisson process with intensity c = G on the interval

G. Θζ is determined by the left endpoint of the right-most clone that contains ζ, and the right endpoint of the left-most clone that contains ζ. Define Aj, (0 ≤ j ≤ L − 1) as the event in which the right-most clone has its left endpoint j nucleotides to the left of

ζ. Correspondingly, let Bj, 1 ≤ j ≤ L be the event that the left-most clone has its right endpoint j nucleotides to the right of ζ. The event Aj occurs when there is a clone with left endpoint at ζ − j and no clones with left endpoints in the interval j nucleotides to the right of ζ − j, and similarly for Bj. Therefore,

− jN − N P r(Aj) = P r(Bj) = e G (1 − e G ) (4.18)

The events Aj (likewise, all Bj) are mutually exclusive. Note that this allows us to redefine Pζ as

L−1 L−1 − N X − jN − NL −c Pζ = P r(∪j=0 Aj) = (1 − e G ) e G = (1 − e G ) = 1 − e (4.19) j=0 84

Note that if s < L, then As−j and Bj are independent for all j. To compute the proba- bility distribution on |Θζ |, we have two cases. For s < L,

s s X P r(|Θζ | = s) = P r(∪j=0(As−j ∩ Bj)) = P r(As−j ∩ Bj) j=0 s X = P r(As−j)P r(Bj) j=0

− sN − N 2 = se G (1 − e G ) (4.20)

The event |Θζ | = L requires all clones covering ζ to have the same left endpoint. There- fore −c − N P r(|Θζ | = L) = Le (1 − e G ) (4.21)

We can compute the expected length of Θζ , conditioned on ζ being covered by a clone; otherwise Θζ is undefined. Since the event |Θζ | ≤ L occurs only when ζ is covered, we have

Pr(|Θ | = s) Pr(|Θ | = s) Pr(|Θ | = s |ζ is covered) = ζ = ζ (4.22) ζ Pr(ζ is covered) 1 − e−c and combining (4.20), (4.21), and (4.22) obtains

− N ! L−1 ! 1 − e G 2 −c X 2 − Ns − N E(|Θ | | ζ is covered) = L e + s e G (1 − e G ) (4.23) ζ 1 − e−c s=0 We note that the sum in the above formula has a closed form solution:

− N ! 1 − e G h 2 −c 1 E(|Θζ | | ζ is covered) = L e − −c − N 1 − e (1 − e G )2  − N − N c  2 − N 2 − N − N  i e G (1 + e G ) − e L (1 − e G ) + (L − 1) e G (1 + e G ) . (4.24) Because of the presence of chimeric reads, it might be useful to consider only clusters of invalid pairs, where 2 or more paired reads span a breakpoint. In this case

−c NL (N−1)L P ≈ 1 − e − e G (4.25) ζ G 85 and   − N 1 − e G E(|Θ | | ζ is covered by multiple clones) = ζ   (N−1)L  −c NL − 1 − e + e G G (4.26) L−1 ! X 2 − Ns − N × s e G (1 − e G ) s=0 It is also informative to obtain the probability that two or more chimeric paired reads form a cluster. Let N is the total number of paired reads as before and q be the probability that a paired read is chimeric. If we assume that the distribution of clone lengths has mean L and is variable over a range of size L0, then

Nq(Nq−1)  LL0  2 P (at least one pair of chimeric clones overlap) ≈ 1 − 1 − G (4.27) Nq(Nq−1)LL0 ≈ 1 − e 2G2 .

4.5 Acknowledgements

Chapter4 (with AppendixC), was published in PLoS Computational Biology, Vol 4(4), 2008, A. Bashir, S. Volik, C. Collins, V. Bafna, and B. Raphael, “Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer.”. The dissertation author was the primary investigator and author of this paper. Chapter 5

On design of deep sequencing experiments

5.1 Introduction

Massively parallel sequencing technologies have made it possible to cost ef- fectively sequence individual genomes, in a small laboratory setting. This commodi- tization of sequencing has already changed the landscape of human genetic variation, specifically for structural variations [64], but also for rare variants [16], transcript sam- pling [87, 85], long-range haplotype resolution [79], and others. Nevertheless, the technological advances also pose significant new challenges. The individual laboratory might not be equipped to provide correct, and cost-effective designs for the new exper- iments. By ‘design’, we refer to questions such as “How much sequencing needs to be done in order to reliably detect all structural variations in the sample to a resolution of 400 bp?” Confounding this further is the proliferation of a large number of sequencing technologies, including Illumina [16], ABI SOLiD [90], Roche 454 [176], Pacific Bio- Sciences [38], and others [19, 119]. These technologies offer the end-user a bewildering array of design-parameters, including cost, read-length, mapping fidelity, clone/insert length, accuracy, and others. It is difficult to make a reasoned choice of technology and

86 87

Query ζ (a) a b Reference PEM

(b) Reference

(c) ACTAC--GTAGGTG ACGAC—-GTACGTG ACGTG—-ACACTC ACTC------ACAGTC ACGCA------AGCGTTA GAGT—-GCATT

-T/G-----G/C------C/G----A/T----A/G-- Reference Haplotype assembly --T------G------G------A------A--- --G------C------C------T------G---

Figure 5.1: Applications of paired-end mapping. (a) Structural variant (Inversion) detection using genomic PEM. The inversion brings disparate points (a, b) together to a fusion point ζ in the query. (b) RNA-seq to detect transcript fusion. Sampled transcripts are mapped back to the reference genome, to detect disrupted, spliced, or fused transcripts. (c) Haplotype assembly via paired-end mapping. Fragments overlapping multiple alleles elucidate the phase, and can be assembled to get haplotypes. 88 design-parameters in conducting an experiment. Likewise, the technology developers are faced with difficult choices on which parameters to improve in future development. In this paper, we address and resolve some of the common design questions relating to structural variation, haplotyping, and transcript profiling. Specifically, we focus on Paired-End Sequence Mapping (PEM). In PEM, the two ends of a sampled clone are sequenced, and mapped to a reference genome. The mapping provides insight into the structure of the sampled genome relative to the reference, and helps detect structural-variation (s.v.), referring to events that rearranged the query genome relative to a reference. See Fig. 5.1(a). If the two ends of the sampled clone map aberrantly to the reference in distance and/or orientation, they form an ”invalid pair” and suggest a s.v. event [170]. Note that any s.v. implies at least one breakpoint, described as a pair of disparate genomic coordinates (a, b) from the reference genome that are adjacent at some coordi- nate of the query genome. Detecting this breakpoint is a necessary prelude to detecting the variation. Resolving the breakpoint refers to reducing the uncertainty in determining a, b. A good resolution is critical to elucidating the phenotypic impact of the variation. In an earlier work, we described the use of tightly resolved breakpoints in detecting gene fusion events in cancer [14]. Intuition suggests that longer the clone insert-length, the easier it is to detect the breakpoint. At the same time, the longer clone-lengths increase the uncertainty in re- solving the breakpoint. Below, we make the trade-off between detection and resolution explicit. We also derive a formula that computes the probability of resolving a break- point to within ‘s’ base-pairs, given a fixed number of shotgun reads from a specific paired-end sequencing technology. For example, we can show that only 1.5× mapped sequence coverage of the human genome using Illumina (Solexa) can help resolve al- most 90% of the breakpoints to within 200bp using a mix of clones. All other parameters being equal, we show that the best resolution of a structural variation come from using exactly two possible clone-lengths: one that is as close as possible to the desired resolu- tion, and one that is as long as technologically possible. 89

Transcript sequencing is a direct approach for measuring transcript abundance, as well as uncovering variations involving transcripts, such as fused transcripts, and trans-splicing events [87]. Fig. 5.1(b) describes the detection of a fused transcript via transcript sequencing. However, the variation in abundance levels of different transcripts makes it harder to design experiments that sample the transcriptome effectively. Here, we derive empirical and analytic results for the amount of sequencing needed to obtain a specific coverage of the transcriptome, based on a small amount of sample sequence. Our estimates depend on a novel extrapolation for the low abundance genes, which are not represented in the sample. Haplotyping, or the separation of the maternal and paternal chromosomes, rep- resents yet another application of PEM. See Fig. 5.1 (c). PEM is used to map the se- quenced fragments to the reference genome. A paired-end sequence that contains two or more variant sites establishes the phase between those sites, and the information can be assembled to reconstruct long haplotypes. A key aspect of sequence-based haplotyping is that the haplotype length depends primarily on the lengths of the sequenced reads and the depth of coverage, and not on Linkage Disequilibrium patterns in the population. We recently established that haplotyping can be done effectively by assembling variants together using long Sanger reads [10, 79]. Our design results in accurate haplotypes with an N50 length of 350kbp. On the face of it, it seems unlikely that next generation sequencing can be useful for haplotyping, given the significantly shorter read lengths. We show empirically that PEM, using a mix of clone-lengths is effective in resolving the haplotypes to an N50 length of several hundred kilobases.

5.2 Results

Detection-Resolution tradeoff: Recall that a breakpoint (a, b) on the reference genome refers to a single ‘fusion’ point ζ on the query genome. If the two ends of a clone span ζ, the breakpoint is detected. Assume that N clones of length L are sampled at ran- dom, and end-sequenced. For a genome of length G, the clonal coverage c = NL/G, 90 describes the expected number of clones spanning ζ. Intuition suggests that longer the clone-length L, the easier it is to detect ζ. Indeed, the probability of detecting an arbi- trary breakpoint is given by the Clarke-Carbon Formula [27].

P (ζ) = 1 − e−c (5.1)

However, the greater clone length also creates a greater uncertainty in the loca- tion of ζ. Define resolution-ambiguity as the size of the region θ (denoted by |θ|) in which ζ is constrained to lie. Order the clones spanning ζ by their right end-point. Let A be the distance of the right end point of the leftmost clone to the right of ζ. Then,

!  L − sN P r(A > s ∧ ζ is covered ) = (1 − s/G)N · 1 − 1 − ≈ e−sc/L − e−c G

We show (see METHODS) that

P P r(A > s ∧ ζ is covered ) L L E(A|ζ is covered) = s = − P r(ζ is covered) c ec − 1 Using symmetry arguments,

2L 2L 2G 2L E(|Θ||ζ is covered) = 2 · E(A|ζ is covered) = − c = − NL (5.2) c e − 1 N e G − 1

Given a fixed number of sequenced reads N, Eq. 5.2 describes the decrease in resolution with increasing L, although the effect decreases for large N. Figure 5.2 illustrates the trade-off. Nevertheless, current sequencing capability allows us to detect and resolve a large fraction of breakpoints. For example, with an Illumina run with 2kbp inserts and 25 × 106 mappable reads one could detect nearly 100% of breakpoints with an average resolution-ambiguity of less than 500bp.

Mixing clone lengths: Many of the next generation sequencing technologies offer a variety of clone (insert) lengths. For example, the ABI SOLiD technology claims a 91

P requiring two clones ζ Θ 1 |E( ζ)| requiring two clones 5000 200bp Inserts 0.8 2kbp Inserts 4000 10kbp Inserts 0.6 3000 )| ζ ζ P Θ

0.4 |E( 2000

200bp Inserts 0.2 2kbp Inserts 1000 10kbp Inserts 0 0 0 2 4 6 8 10 0 1 2 3 4 5 Reads 7 7 (a) x 10 (b) Reads x 10

Figure 5.2: Detection-resolution trade-off. (a) The detection probability Pζ increases with increased sequencing (N), as well as clone-length L. (b) The expected resolution-ambiguity increases with increasing clone-length L. In both figures, the ‘x’ and ‘•’ correspond to the expected and observed values, respectively, for specific values of N and L. variety of clone lengths ranging from 600bp to about 10000bp [60]. Given the trade- off between detection and resolution, we next asked if using a mix of clone lengths could help with detection and resolution. To address this, we first derived bounds on the probability of resolving a breakpoint to a desired level of resolution using a mix of two clone lengths. Suppose we generate N1,N2 reads, respectively from clone libraries of lengths L1,L2. Then, for an arbitrary s (see METHODS)

 −s(N/G) N   1 − e 1 + s G (* if s < L1 *)  Pr | | ≤ s −c1 −s(N2/G) N2  ( Θζ ) = 1 − e e 1 + s G (* if L1 ≤ s < L2 *)   −c  1 − e (* if s > L2, c = (N1L1 + N2L2)/G∗ *) (5.3)

Note that the resolution-ambiguity |Θ| ≤ L1, or |Θ| = L2 can be obtained using single clone libraries, but the likelihood of resolving between L1 and L2 is optimized by using an appropriate mix of the two libraries. Analogous equations can be derived when two overlaping clones or more are required to detect a breakpoint. Figure 5.3 illustrates this principle using Illumina generated sequences for an African male [16], assuming 92

Probability of resolving a breakpoint to precision s or better (N = 21M, N = 12.5, N = 8.5) 1 2 0.5

0.4

0.3 | < s) ζ Θ

Pr(| 0.2

Empirical Mix, N total inserts 0.1 Predicted Mix, N total inserts) Predicted, N inserts of length 2000 Predicted, N inserts of length 200 0 50 100 150 200 250 300 350 400 450 500 s

Figure 5.3: Combination of clone lengths for breakpoint detection. The solid lines indicated the theoretically predicted probabilities as described in METHODS. The dotted line indicates an empirical mix, using 4 lanes each from 200 and 2000 base pair insert runs. N corresponds to the combined number of reads which were uniquely mapped at both ends. one had chosen to split a single run between 2 insert lengths. For the Figure, we use clones of length 200bp and 2000bp. A set of ‘true’ breakpoints was chosen from ‘large’ s.v’s. As discussed in the next paragraph, small events are hard to detect using large clone-lengths. For a fixed amount of sequencing, we confirm the theoretically predicted boost in probability of detecting a breakpoint to within a resolution-ambiguity of 200bp. The probability is doubled from 0.15 to over 0.29 using a mix of clone libraries. Similar results are ob- tained for other sequencing studies, such as an ABI SOLiD sequencing with 600 and 2700 length libraries (data not shown). In a further extension of the analysis, we show that to maximize the likelihood of resolving breakpoints to s bp, we need only two libraries-one with clone-length s, and the other as large as possible (see METHODS). A caveat of this analysis is that we take the clone-length as fixed, and do not account for variation in lengths. While this can be easily corrected, it is not an issue for large events, such as those obtained by non-allelic homologous recombination [64]. 93

Even with this caveat, empirical data very closely approximates the theoretical curve (Figure 5.3, dotted lines). Though the theoretical model performs better, the magnitude of the ‘boost’ at 200bp is maintained. The concordance between theoretical and exper- imental results shows the limited effect of clone-length variation. Other mechanisms, such as non-homologous end-joining (NHEJ), often gives rise to small insertions and deletions [64]. These are valuable as genetic markers. If the event size is smaller than the variance in clone-length, the event will not be detected by a breakpoint spanning clone. Small clone-length libraries, or longer reads (such as those available in Roche- 454) would then be the best design choice for detection.

Transcript sequencing: As transcripts have variable expression, the amount of se- quencing needed to detect a transcript is uncertain. Let xt denote the true-expression of transcript t, defined as the number of copies of t in the sample. While no expres- sion technology directly measures xt, sequence based transcript sampling (such as the

RNA-seq procedure) is the first to provide an unbiased estimate of xt. Let transcript t have length lt. In the RNA-seq procedure, the transcripts are randomly sheared, creating

∝ ltxt copies of each transcript, sampled, and end-sequenced. Let at ∝ xtlt denote the number of sequences sampled from xt [87]. The normalized-expression for t

x at ν = t = lt t P P au xu u u lu describes the probability of sampling and sequencing t. Our key idea is based on sam- pling. While deep sequencing is required to estimate νt for each transcript t, A more modest level of sequencing, allows us to estimate the distribution of ν values among all transcripts. Formally, define a p.d.f f(ν) for a randomly sampled transcript to have normalized-expression ν. Consider a transcript sequencing experiment with N reads. If we could estimate νt, then

N −νN P r[t is sampled |νt = ν] = 1 − (1 − ν) ' 1 − e 94

Instead, we propose to use the estimate of f to make predictions about sampling tran- scripts. Z Z −νN P r[t is sampled ] = P r[t is sampled |νt = ν]f(ν)dν = (1 − e )f(ν)dν (5.4) ν ν We tested the predictive accuracy of Eq. 5.4 using data from Marioni et al. [87]. An empirical p.d.f derived from the total sequence used in each of the two studies (kidney and liver, ∼ 35 × 106 reads each). Figure 5.4(a) shows that the empirical distribution of normalized-expression values is similar between the two studies. We next asked if f could be estimated using a smaller sequence sample. If so, a small amount of sequencing to estimate f can be used to compute the depth of sequencing required to adequately sample all of the transcripts. We generated smaller sequence-subsets (100K, and 1M) by sampling from the complete set, and used those to compute the probability of transcript detection. Figure 5.4(b) plots a detection-curve, described as the probability of detecting a transcript from the liver sample as a function of its normalized abundance. While the predictions made with the smaller samples (blue, red solid lines) track the true detection-curve (black line), there is a significant bias. The bias appears as the low abundance reads are not sampled. Using a novel regression based strategy (METHODS) to correct for the bias, the corrected curves (blue, red dotted lines) track the true estimates closely, even using a sparse set of 100K reads. From this data, we can say predict with 1 million reads, we can sample 90−95% of the transcripts in this sample. We suspect that f might be well-conserved across samples (as in Figure 5.4(a), and may not have to be re-estimated entirely independently for each sample. We confirmed that the p.d.f. accurately predicts the probability of detection by bootstrapping (Supplemental Figure D.1).

Haplotype assembly: Recently, the first individual human sequence ‘HuRef’ was generated using the Sanger sequencing technology [79]. The longer read-lengths make it possible to assembly long haplotypes, even with modest levels of sequencing. We asked if deep sequencing with short reads could be used for haplotyping. The HuRef 95

Histogram of normalized expression values Transcript detection probability with limited reads 0.18 1 Kidney 0.16 Liver 0.9

0.14 0.8

0.12 0.7

0.1 0.6

0.08 0.5

0.06 0.4 Detection Probability

Fraction of total reads Empirical (34M) 0.04 0.3 100K Sampled 1M Sampled 0.02 0.2 100k Corrected 1M Corrected 0 0.1 10−10 10−8 10−6 10−4 10−2 100 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (a) ν (b) Reads x 106

Figure 5.4: Distribution of normalized expression from two transcript sequencing exper- iments. (a) Histogram of ν in two separate samples. For clarity, the bins are log distributed and the y-axis represents the fraction of total reads. This enables the computation of (b), the probability of detecting an arbitrary transcript. The solid lines correspond to predictions made of the empirical (or simulated empirical) distribution. The dotted lines correspond to corrected values from regression (see Methods). Note the high fit that is obtained after correction, with only 100,000 reads. sequence data was used to simulate paired-end reads of different lengths (see METH- ODS), from a diploid chromosome with 154000 heterozygous SNPs (1 every 1.6kbp). For the analysis, we simulated perfect reads with no errors. Any read that maps to two or more variants confirms the phase between them, and this information can be assembled to obtain larger haplotypes. To estimate haplotyping performance using next generation sequencing, we used the N50 measure, defined as the physical length such that 50% of the variants are contained in haplotype segments of the given length or greater. The measure is considered better than median/average haplotype since those can be consid- erably biased by a large number of small haplotype segments. Using Sanger sequencing, the N50 haplotype length for the HuRef diploid genome was ∼ 350kbp at a sequence coverage of 7.5x, illustrating the power of this approach [79]. In the absence of paired-end sequence information, one would require very long reads ∼ 10kbp to achieve N50 lengths comparable to HuRef data (Supplementary Fig- ure D.4). While long reads could be available in the near future [38], current technolo- gies offer much shorter read lengths. With current technology, even the use of paired-end 96 sequencing to link variants results in only modest improvements in the N50 haplotype length. Using a mix of clones with lengths normally distributed around mean lengths 3kbp, 10kbp and 36kbp (σ = 0.1µ), and 50× sequence coverage, the N50 length for 50 bp reads is small, i.e. 45kbp (Figure 5.5)(a). The N50 length can be improved dra- matically by either increasing the read length (600kbp for 200bp reads) or by increasing the clone length variation. Using 50bp reads, and 50× sequence coverage, with a mix of clone-lengths ranging from 3kbp to 27kbp, we can get an N50 haplotype length of 200kbp (similar to the Venter diploid genome) (Figure 5.5).

N50 haplotype length for mix of insert sizes N50 haplotype length with mate pairs 350 kb 700 kb 50 bp 600 kb 100 bp 300 kb 200 bp 500 kb 250 kb

400 kb 200 kb

300 kb 150 kb Haplotype Length

200 kb N50 haplotype length 100 kb

100 kb 50 kb

0 0 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60 70 80 (a) Coverage (b) Coverage

Figure 5.5: (a) Haplotype lengths as a function of sequence coverage for three different read lengths. Each mate pair is a pair of reads of the corresponding length. For short reads (50 base pairs), even at 50x coverage, the N50 haplotype length is small ≈ 45kbp. (b) The N50 Haplotype length grows dramatically for 50bp reads, using a mix of clone-lengths ranging from 3kbp to 27kbp.

5.3 Discussion

We present a number of analytic and empirical results on the design of sequenc- ing experiments for uncovering genetic variation. Our study provides a systematic ex- planation for empirical observations relating to the amount of sequencing, and the choice of technologies. The theoretical analysis is not without caveats, which are discussed be- 97 low. Nevertheless, the concordance with empirical data illustrates the applicability of our methods. Some of the results, while not counter-intuitive, provide additional insight. For example, we show that the best design for detecting s.v. to within ‘s’ bp demands the choice of exactly two clone-lengths, one close to s, and the other as large as possi- ble. We explicate the trade-offs between detection and resolution, and can compute the probability of s.v. detection as well as the expected resolution-ambiguity for any choice of technology and parameters. An important simplification in our analysis is to treat clone-length as constant. However, choosing a distribution on the clone-length does not influence the expected resolution-ambiguity, only its variance. The variance is important for measuring smaller structural variations. Therefore, experiments that aim to detect small structural varia- tions are constrained to using technologies in which the clone-length variation is sig- nificantly smaller than the size of the s.v. itself. The available technologies are con- stantly reducing the variance in clone-lengths through better library preparation strate- gies, which might allow the use of larger clone-lengths in the future. However, our results also point to the usefulness of large clone-length variation in haplotype assem- bly. Therefore experiments aimed at detecting small variations (small clone-lengths), larger translocations (larger clone-lengths), and haplotype sampling (large variation in clone-lengths) all demand different designs. Haplotype assembly is a very specialized, but increasingly popular application of the new sequencing methods. While our results have been obtained for currently avail- able technology parameters, it is clear that up and coming technologies for individual molecule isolation, whole genome amplification, and single molecule sequencing could make chromosomal size haplotyping feasible. Once the new technology parameters are available, it is likely that the optimal designs will change to include a mix of short and long clones, in order to scaffold alleles together. Other issues confound design as well, but can be modeled. Different technolo- gies have different error rates. This is corrected by introducing a mapping-rate parameter f, defined as the fraction of reads that are mapped unambiguously to the reference. Re- 98 placing the number of reads N by fN helps correct somewhat for sequencing errors. Chimerisms in clone-lengths can be controlled by demanding the use of multiple over- lapping clones. We have extended most analyses to requiring two or more clones (see METHODS). Longer clone-lengths consume more sample for equivalent amount of se- quencing. Therefore, if the sample is limited (as in tumors), the best design should also seek to optimize a ‘sample-cost’ versus detection trade-off. We do not address some important applications for next generation sequencing technologies: the detection of rare (and common) sequence variants in re-sequencing studies. Given the relatively high error rates of these technologies, reliable and accu- rate detection of sequence variants (SNPs) is a challenging problem. Our preliminary analysis indicates that the design depends greatly on the technologies used, and com- putational methods for detecting sequencing artifacts. It is therefore difficult to draw general design principles that would be applicable to all technologies. The design of sequencing for ‘dark-region’ identification (i.e. DNA inserts on the sampled genome that are not in the reference) is not addressed. For transcript sequencing, we address the important question of depth of se- quencing, given the large variation in transcript abundance. Our results suggest that estimating the distribution of normalized expression values with modest amounts of sequencing can help address design questions for transcript sequencing, even when the transcript abundance varies over many orders of magnitude. However, a number of other questions remain, such as the detection of splicing events, and the resolution of break- points. While transcript sequencing is a quick way to detect breakpoints, the location of the breakpoint is confounded by trans-splicing. The issues relating to design can be resolved only after methods are discovered to resolve breakpoints and predict splicing events based on transcriptome sequencing. Technological developments all point to the rapid deployment of personalized genomic sequencing. As large populations of individuals are sequenced, and the se- quence is analyzed for a variety of applications, design issues relating to the amount of sequencing, the choice of technology, and the choice of technological parameters be- 99 come paramount. Our paper helps resolve some of these questions. As current technolo- gies mature and new technologies arise it will be critical to further develop a framework to maximize study efficacy.

5.4 Methods

5.4.1 Breakpoint Resolution

The clone coverage is given by c = NL/G where N is the number of clones. A breakpoint (a, b) in the reference genome corresponds to a fusion point ζ in the query genome where the coordinates a, b come together. Let ζ be covered by at least one clone, and let A be the distance of the right end point of the leftmost clone from ζ.

!  L − sN P r(A > s ∧ ζ is covered ) = (1 − s/G)N · 1 − 1 − ≈ e−sc/L − e−c G P P r(A > s ∧ ζ is covered ) E(A|ζ is covered) = s P r(ζ is covered) R L e−sc/L − e−c ds. = 0 1 − e−c L/c(1 − e−c) − Le−c L L = = − 1 − e−c c ec − 1

Using symmetry arguments, 2L 2L E(|Θ||ζ is covered) = 2 · E(A|ζ is covered) = − c ec − 1 Requiring coverage by multiple (≥ 2) clones,

P r(A > s ∧ ζ is covered by ≥ 2 clone) = e−cs/L · (1 − P r( no clones) − P r( exactly one clone))  c(L − s)  = e−cs/L 1 − e−(L−s)c/L − e−(L−s)c/L L L R e−sc/L − e−c(c + 1) + cs/Le−c ds. E(A|ζ is covered by ≥ 2 clones ) = 0 1 − e−c − ce−c L/c(1 − e−c) − (c + 1)Le−c + cL/2e−c = 1 − e−c − ce−c L cL = − c 2(ec − (c + 1)) 2L cL E(|Θ||ζ is covered ≥ 2 clones ) = 2 · E(A|ζ is covered ≥ 2 clones ) = − c ec − c − 1 100

5.4.2 Simulation

A set of “true” breakpoints were chosen by mapping Illumina reads for individ- ual NA18507 (obtained from the NCBI short read trace archive) to build 36.1 of the human genome. ELAND alignment tool, where each end mapped separately to detect s.v’s. A large number of libraries were mapped to obtained a candidate set of “true breakpoints” (> 100X insert coverage). Next, a 200bp and a 2kbp library were se- lected at random. The fraction of “true” breakpoints detected and resolved by these libraries is shown in Figure 5.3. To simplify analysis only events greater than 2kbp were considered. |Θζ | was obtained by taking the mean (Θζ ) given the overlap in paired- reads. These numbers were compared against theoretical predictions, obtained using Eq. 5.1,5.2 respectively [78, 14]. The number of mapped reads, N, is given by the number of paired-reads for which both ends mapped uniquely to the genome.

5.4.3 Mixing clone-lengths

Consider the case where we have two different clone-lengths L1 and L2 where

L2 > L1 w.l.o.g. Denote the coverages of the clone libraries as c1 and c2. Let c = c1 + c2L1/L2

! „ L − s «N1+N2 N1+N2 1 −sc/L −c P r(A > s ∧ s ≤ L1 ∧ ζ is covered ) = (1 − s/G) · 1 − 1 − ≈ e 1 − e G “ ” = e−c1 e−sc2/L2 − e−c2 P P r(A > s ∧ s ≤ L1 ∧ ζ is covered ) E(A|ζ is covered) = s P r(ζ is covered) P P r(A > s ∧ s > L1 ∧ ζ is covered ) + s P r(ζ is covered)

L1 −c −c L2 ` −c −(c +c )´ −(c +c ) (1 − e ) − L1e + e − e 1 2 − e 1 2 (L2 − L1) = c c2 1 − e−(c1+c2) E(|Θ| |ζ is covered) = 2 · E(A|ζ is covered) 101

Next, we compute the probability of resolving a breakpoint to within ‘s’ bp. We have 3 cases: i) s < L1; ii) L1 ≤ s < L2; and, iii) s >= L2. For s < L1, we extend the − sN − N 2 analysis of [14], where we showed that P r(|Θζ | = s) = se G (1 − e G ) . Denoting

N = N1 + N2,

s − N 2 X − xN P r(|Θζ | ≤ s|s < L1) = (1 − e G ) xe G x=0 Z s − N 2 − xN = (1 − e G ) xe G dx 0  2 −G   G  G  = 1 − e−N/G e−sN/G + s − N N N sN = 1 − e−sN/G − e−sN/G G

Note that the results are independent of clone-lengths (or, in fact, whether or not a mix of clones is being used). However, for the case L1 ≤ s < L2, we have to consider

−c1 the event of an L1 clone spanning the breakpoint (1−e ) or the event of two L2 clones spanning ζ with no L1 clones spanning ζ. Therefore,

 sN  P r(|Θ | ≤ s|L ≤ s ≤ L ) = (1 − e−c1 ) + e−c1 1 − e−sN/G − e−sN/G ζ 1 2 G

The case when s > L2 can be modeled by a single library with c = (N1L1 + N2L2)/G.

−c P r(|Θζ | ≤ s|s ≥ L2) = 1 − e

The equations can be modified to require that at least 2 clones overlap a break- point. Case (i) is unchanged, as it requires 2 clones. Likewise for the second term in case (ii). We constrain case (ii) to require 2 or more clones for the first term.

 sN  P r(|Θ | ≤ s|L ≤ s ≤ L ) = (1 − e−c1 − c e−c1 e−c2 ) + e−c1 1 − e−sN/G − e−sN/G ζ 1 2 1 G  sN  = (1 − e−c1 − c e−c) + e−c1 1 − e−sN/G − e−sN/G 1 G 102

For case (iii), we can extend the generic cluster coverage case.

−c −c P r(|Θζ | ≤ s|s ≥ L2) = 1 − e − ce

5.4.4 Proof of Optimality of Two Clone Design

We show that it is sufficient to consider exactly two clone lengths for resolving a breakpoint to within ‘s’ bp. We show first that for a given s and N, and a collection of clone-lengths, P r(|Θζ | ≤ s), is maximized using a mixture of≤ 2 clone lengths. Assume to the contrary that an optimal mix requires ≥ 3 distinct clone-lengths.

0 This implies that for some clone length L , L 6= s, and L 6= LM , where LM is the 0 0 maximum available clone-length. In other words, either a) L < s, or, b) s < L < LM . We consider each case in turn.

L0 < s: From earlier discussion, the contribution of the clones with length L0 to P r(|Θ| ≤

0 s) is proportional to coverage (c1). Replacing clones of length L with clones of length s will increase coverage without changing N, contradicting optimality.

0 s < L < LM : Once again, for clones larger than the desired resolution-ambiguity s, their contribution to P r(|Θ| ≤ s) is completely dependent on coverage. Replac-

ing by a clone of length LM improves the resolution probability, a contradiction.

An immediate corollary is that the optimal design consists of a mix of two clone lengths, s and LM . The mix of the two libraries (the ratio N1/N2 s.t. N1 + N2 = N is fixed) only needs to be optimized for Case (ii).

“ ”2 „ −G „ „ G « G «« −c1 −c −c1 −N2/G −sN2/G P r(|Θζ | ≤ s|L1 ≤ s ≤ L2) = (1 − e − c1e ) + e 1 − e e + s − N2 N2 N2

We compute the optimal mix empirically by iterating over N1 ∈ [0,N]. 103

5.4.5 Simulation for mix of clones

The set of breakpoints, and method for computing mean size of Θζ , followed that of Section 5.4.2. A single 2kbp and 200bp library were analyzed, using 4 lanes from each corresponding flow cell. Clusters of invalid pairs were generated by combining the two reduced libraries.

5.4.6 Transcript Sequencing

Mapped RNA-seq data, generated by Marioni, et al. [87], was obtained from the Gilad Lab website. The genomic mappings were converted to a list of overlapping exons in Refseq. For each transcript, a count of the number of reads sampling it was

1 generated . This enabled the estimation of νt which was calculated as described earlier.

To obtain smaller data sets, random sampling of the reads was performed, and νt was re-calculated. The sample is used to estimate the p.d.f of normalized abundance values, and is shown in Supplemental Figure D.2. It can be observed that each sample of r reads is accurate for highly expressed genes (normalized expression > 1/r). Below 1/r, the chance of sampling a gene is low, and so the p.d.f cannot be estimated accurately. We observed empirically that the p.d.f follows the following distribution:

f(ν) = ceβν−α

log(f(ν)) = βν−α + log c

log log(f(ν)) = log(β) − α log(ν)

Supplemental Figure D.3 shows a plot of log log f(ν) vs. log(ν) as a straight line for all samples. Performing a regression analysis on the line reveals the slope α, and the intercept log(β). To get c, we use

Z 1 ceβx−α = 1 x=0 1In the case of multiple exons overlapping the same read all corresponding transcripts were counted. 104 c, α, β are used to obtain a p.d.f. A critical step is to identify the aforementioned “re- liable point”. This point can be identified independently for each distribution by de- termining the point of inflection of the graph log log(f(ν)) vs log(ν); the set of points immediately downstream of the inflection are used for regression. The corrected p.d.f. utilizes the empirically generated p.d.f after this reliable point, and the theoretical p.d.f before then. It is important to note that the empirical p.d.f. derived using all reads im- plies that there is a drop off in abundance for very low abundance genes (Figure 5.4(a)), which the regression would over-predict. However, it is very likely that this tail is an artifact of incomplete sampling and a regression of the full data would provide a better estimate.

5.4.7 Haplotype assembly

We chose the HuRef sequence data to simulate haplotype assembly. For chro- mosome 1, ∼ 120, 000 SNPs were identified as heterozygous (at least 2 reads for each allele and 20% reads supporting minor allele) during shotgun sequencing at a coverage of 7.5×. At this coverage, 21.6% of true heterozygous SNPs will be miscalled as ho- mozygous due to lack of coverage for both alleles. In order to simulate a complete set of heterozygous SNPs, we randomly labeled 21.6% of the SNPs called as homozygous to be heterozygous. This gives us a set of 154, 000 heterozygous SNPs, approximately one every 1.6kbp. Other forms of variation such as short insertion/deletions were also found to be heterozygous, and can potentially be used, but were ignored in this simulation. For preliminary analysis, we simulated shotgun reads with no errors. The quality of a haplotype assembly is measured using the length of the haplotype segments (set of SNPs that can be linked or phased together). Length is measured using the N50 haplo- type length, defined as the physical length such that 50% of the variants are contained in haplotype segments of the given length or greater. Greater the N50 length, better the haplotype assembly. Note that N50 length is a better measure than median/average haplotype since those can be considerably biased by a large number of small connected 105 components. N50 haplotype length for the Venter diploid genome was about 350 kilo- bases. For single read simulation (no paired-ends), we simulated reads ranging from 1000 base pairs to 10000 base pairs. Note that no current technology can generate such long reads however some upcoming methods (see e.g. Pacific Biosciences) have the promise to do so. Next, to simulate paired-end sequences, we considered mean clone- lengths 3kbp, 10kbp and 36kbp in an empirically determined proportion 60%, 20% and 20% respectively. The sampled clone-lengths were picked from a normal distribution with σ = 0.1µ. For the final simulation, the insert size of the paired-end reads was uni- formly sampled from a mix of 10 different clone-lengths: {3kbp, 6kbp, .., 24kbp, 27kbp}.

5.5 Acknowledgements

Chapter5 with (AppendixD), is currently in submission, A. Bashir, V. Bansal, and V. Bafna, “On design of deep sequencing experiments”. The dissertation author was a primary co-investigator and co-author of this work. Chapter 6

Reconstructing Genomic Architectures

6.1 Introduction

Identifying specific structural variants within a cancer genome underlies a much larger computational problem: Is it possible to reconstruct the entire architecture of a cancer genome? The general importance of understanding the sequence and structure of a cancer genome is akin to that underlying the reconstruction of the reference human genome. However, it would be extremely costly to undertake a full de novo assembly of multiple cancer cells. If we, instead, focus on larger level questions of architecture (specifically which intervals of DNA are connected) then it may be possible to utilize comparatively less expensive techniques that incorporates information from a known reference genome sequence. In this chapter, we focus on local optimizations related to correctly matching nonadjacent intervals of DNA that have been joined as a result of genome rearrangements. This is particularly crucial in regions that have undergone complex amplification and/or multiple rearrangements. These optimizations will, hope- fully, provide insights into cancer genome architectures and the mechanisms by which cancers evolve.

106 107

6.2 Methods

6.2.1 Obtaining an architecture graph

Reconstructing cancer genome architectures with low coverage sequencing data is a several step process. One method is to combine end-sequence profiling (ESP) and array comparative genomic hybridization (aCGH) data [1]. This method requires signif- icantly less sequencing than conventional fragment assembly methods. The necessary sequence coverage for genome reconstruction is significantly reduced, leading to a more cost effective solution. The CGH and ESP data detail the following information about genomic se- quences:

1. Boundaries for rearrangement, duplication, and deletion events (CGH and ESP)

2. Links between segments in the tumor genome. (ESP only)

3. Copy numbers for each interval on the Genome (CGH and ESP, CGH being less expensive when considering an entire genome)

In order to accomplish the reconstruction, we will need to integrate these data sources. First, we must create a compact representation of all genomic segments that are adjacent in the cancer genome. Second, we must correctly traverse this compact representation to obtain the most accurate reconstruction of the tumor genome.

Integrating CGH with ESP - a Min Cost, Max Flow Approach

In order to create a compact representation of linked genomic intervals and re- turn the architecture of the underlying genome, the technique borrows heavily from the fragment assembly literature. We will integrate ESP and aCGH data via a MIN-COST,

MAX FLOW approach [117]. Once, such a graph is constructed, the tumor genome will exist as a unique Eulerian path through this graph [118, 117]. 108

Some modifications, however, must be made to previous formulations. Though transposable repeats and simple repeats are mostly ignored in our reconstruction, an analog to the “Copy Number Problem” remains, in the high number of, nearly identical, duplicated regions within a cancer genome. Obviously, obtaining the coverage neces- sary to clarify the true copy-number would prove highly costly via conventional BAC ESP, therefore this information is provided by aCGH. Figure 6.1 illustrates Aerni et al’s technique for integrating these two data sources [1]. When converting to a flow, the copy numbers of segments from aCGH become the edge values between nodes given by end-sequences. This problem is complicated by low-coverage and possible incompleteness or inaccuracies in the data. Array CGH fluorescent ratios could be biased by a number of factors; discordant ESP reads could result from either chimeric clones or experimental variance in BAC sizes. Therefore, we are confronted with the problem of both erroneous data points as well as “missing links” between segments. In order to account for this, a single source-sink node is created in the network flow, with edges to each node in the graph. Any excess flow can be sent to this node. Note that using a naive max flow would permit all weight to be sent down these edges, giving no added insight into the architecture. Therefore, we must penalize paths which send an excess amount of flow into “source/sink” nodes. Therefore, a cost is given for any flow going from a node in the graph to the source/sink. The problem now becomes a MIN-COST,MAX-FLOW. We are attempting to find the optimal architecture which minimizes the cost (thus optimizing flow between valid nodes) of exiting to a source or sink. When traversing the graph, each path (from source, though the graph, returning to sink) can be considered a chromosome, an episome, or some other contiguous unit of DNA.

6.2.2 Retrieving Optimal Eulerian Paths

Assuming that the flow returns valid links between genomic segments, it should be possible to return a putative genomic architecture. First, each edge of the flow needs 109

Figure 6.1: Example of integrating CGH and ESP data into a flow problem. (Top) In the first step, a segmented genome is generated from with copy numbers given from aCGH. ESP is used to created edges joining each segment. (Middle) In the next step, lower and upper bounds are placed on the CGH estimates. (Bottom) Lastly, this information is reorganized into a directed graph. [125] 110 to converted into multiple edges corresponding, in number, to the weight of the flow across that edge. Thus, an edge with weight 2 would yield 2 new edges across the same two vertices. An Eulerian path now exists for the graph. A traversal of any such Eulerian path provides a genomic architecture that is consistent with the adjacencies and the copy numbers provided by ESP and aCGH, respectively. However, many valid Eulerian paths would exist in a genome that has under- gone sufficient duplication events, such as most cancer genomes; In such cases, there is not sufficient information to reconstruct the true architecture. Given the relative rarity of rearrangements, the most parsimonious such path (i.e. the one which requires the minimal number of genomic events from the reference genome) should be a reasonable approximation to the true architecture. Such a path could be constructed by enumerat- ing all Eulerian paths, and reconstructing their rearrangement distance from the original genome [52]. However, computing all possible Eulerian paths would be prohibitively slow 1. Additionally, Moret et al., has shown that exactly calculating the minimal num- ber of events needed from a reference genome is also computationally hard. Therefore, we propose the design of a more efficient, heuristic strategy for parsimonious path re- construction.

Local Optimization - Min-Edge Flow Problem

Some degree of local optimization on this Eulerian graph could benefit the global reconstruction. If multiple segments, s, could be joined together as units, then each occurrence of those segments together would require only a single duplication event, as opposed to s duplication events, to explain it. This principle motivates our approach. For each vertex, v, in the graph, we have a set of incoming edges (corresponding to the set of input vertices, P ) and a set of outgoing edges (corresponding to the set of output vertices, Q). Our goal is to group together 3 vertices - one vertex from the set of input edges, pi, the vertex itself, v, and one vertex from the set of output edges,

1In the worst case O(V !). 111

qj. We will refer to this grouping as pi, qj. Intuitively, reducing the number of unique groupings at each vertex should result in a net reduction in segments for the graph as a whole, which would imply fewer events 2.

Therefore, we can formulate the simpler MIN-EDGE FLOW local problem: At each vertex v, minimize the size of the set, O, representing unique pairings ({(pi, qj), ...}) of input vertices, P , with output vertices, Q. The net weight of the edges out of pi must equal the flow of pi into v. Similarly net weight of the edges into of qi must equal the

flow of v into qi (for all j).

Upper Bound on Minimum Set Size: |O| = |P | + |Q| − 1. We propose that there is an inherent upper bound on the minimum set size, |O| = |P | + |Q| − 1, that will prove informative when we seek to bound our solution. For clarify, the following nomenclature will be used.

N = The total flow across a node P = The set of input vertices |P | = The total number of unique vertices into a node, the size of P Q = The set of output vertices |Q| = The total number of unique vertices out of a node, size of Q O = The set of matchings |O| = The total number of unique pairings, the size of O

We can put a rough upper bound of |P |+|Q| on the minimum number of unique pairings that can result from routing the flow between |P | input edges, of net flow N, and |Q| output edges, of net flow N. Proof: This upper bound can be obtained in polynomial time, using a simple algo- rithm:

2 Each pi and each qj would need to occur exactly as many times as the respective weights of their edges to v. 112

Algorithm 1 NAIVEPAIRING(P, Q, O) 1: x = SMALLESTVERTEX (P )

2: y = SMALLESTVERTEX (Q)

3: O. ADDPAIR (x, y)

4: x.flow = x.flow− MIN (x.flow, y.flow)

5: y.flow = y.flow− MIN (x.flow, y.flow)

6: if x.flow = 0 then

7: REMOVEVERTEX (x, P )

8: end if

9: if y.flow = 0 then

10: REMOVEVERTEX (y, Q)

11: end if

12: if P = ∅ then

13: return O

14: else

15: NAIVEPAIRING (P, Q, O)

16: end if

In each case one is guaranteed to increase |O| by one and decrease |P |+|Q| by at least one. This implies that |O| would, at most, equal |P |+|Q|. Since equality is ensured in the last iteration (the last else clause), we obtain a tighter bound: |O| ≤ (|P |+|Q|−1).



Optimal matching through pairing edges of equivalent flows.

Theorem 3: Given a graph G with |P | input edges and |Q| output edges in which the

flow is equivalent for some input vertex Pi and some output vertex Qj. An optimal matching exists which contains an edge between pi and qj whose weight is equivalent to their total flow. 113

Proof: Without changing the problem, one can partition the left side of the matching into pi and Pi0 = P − pi and the right side, similarly, into qj and Qj0 = Q − Qj. Let’s assume an optimal matching exists which does not contain such an edge between pi and qj equivalent to the full flow. This implies that there are some number of edges, n1, from pi to some subset of n1 vertices from Qj0 . Additionally, there are some number of edges, n2, from some subset of n2 vertices from Pi0 to qj (the value of the total flow for Fn1 = Fn2 ). We will call all remaining edges in the graph nr, therefore the number of edges in the optimal graph is n1 + n2 + nr.

Now, let us add in the edge from pi with qj, this would necessitate the removal of the n1 edges from Pi0 to qj and the n2 edges from pi to Qj0 . As stated earlier, these n1 edges come from n1 vertices in Pi0 (and, similarly, n2 vertices from Qj0 ). From

Section 6.2.2 we know that we can assign the remaining flow with ≤ n1 + n2 − 1 edges.

We are left with 1 + (n1 + n2 − 1) + nr = n1 + n2 + nr. Therefore, we can say that joining pi to qj results in an output set equivalent in size to an optimal pairing. 

Tighter Upper Bound on Minimum Set Size |O| ≤ (|P | + |Q| − C) . It follows from this that we can reduce the size of the output set by finding paired subsets P 0 and Q0 from P and Q whose summed flows are equivalent. Let us call the number of such pairings C (in the trivial case where P 0 = P and Q0 = Q, C = 1) our upper-bound now becomes |O| ≤ |P | + |Q| − C. If a subset cannot be further split into equivalent pairings, can we show that |O0] = |P 0| + |Q0| − 1?

0 0 Let us first reformulate the graph. If we create an edge Pi − Qj between a vertex 0 0 0 0 0 Pi from P to a vertex Qj in Q for every corresponding element Ok in the resulting output set, then minimizing the size of O0 is equivalent to minimizing the number of edges. Furthermore, we know that all valid solutions must be a single connected com- ponent since our subsets cannot be split any further. If multiple connected components existed it would imply that each component represented an equivalent pairing, breaking the condition. Since, a minimum edge connected component, a tree, has V −1 edges (in this case V = |P | + |Q|) it is clear that |P | + |Q| − 1 is a lower-bound on the minimum 114 number of edges. Maximizing the number of equivalent pairings (components), C, will give us our minimal output set, |Omin| = |P | + |Q| − Cmax.

Reduction to NP-Completeness. We claim that the local MIN-EDGE FLOW problem is NP-complete. Proof: Given the previous section, our question can be reformulated - What is the maximal number of disjoint subset pairings, C, as described above, given P and Q?

The SUBSET-SUM problem is as follows: Given, S = s1, s2, .., sn and some integer B, does their exist a subset S0 such that the elements in S0 sum up to B? The reduction begins with this generic instance of SUBSET-SUM. Let M be the sum of all elements in S. We shall construct a graph G which contains a subset-sum equal to B if and only if there exists a flow with 2 components (or, equivalently, n edges).

Let G be a bipartite graph with n input vertices, P = s1, s2, ..., sn and 2 output vertices Q = B,M − B. We claim that a subset-sum exists if and only if there is a flow with exactly n edges, or alternatively with 2 components. 

Though, this problem is NP-complete, in practical applications n is typically quite small and an MILP formulation (in which all edges with equivalent flows are paired together, as described earlier) has proved sufficient to quickly, and exactly, solve partitions encountered in typical data.

6.3 Discussion

Here we described a first step for resolving paths in complicated cancer genome architectures. It is important to note that even within the local problem addressed above, many uncertainties remain. For example, it is possible (and highly likely around ver- tices with low edge weights) that many possible pairings are equally optimal, given the described metric. How does one select between such solutions? A local heuristic proves insufficient to distinguish such cases, forcing the consideration of all possible 115 valid groupings. Similarly, how does one optimally chain together the resulting pairings induced across nodes? Both issues can be partially dealt with if one has edges of varying length in the underlying architecture graph. In the context of ESP or PEM this corresponds to a mix of clone (or insert) libraries coming from a range of length distributions. This enables more refined scaffolds from which to extend optimal pairings. Moreover, a variety of clone lengths would help resolve ambiguous pairings across a given node, that may not be ambiguous across other (related) nodes. High-throughput sequencing could be used to supplement (and perhaps replace) array CGH copy number predictions. Though array CGH is still more cost-effective, high-throughput sequencing potentially allows for more complete sampling of the underlying genome. Copy number estimation could be done at a more precise level, especially in regions not covered by array probes. This would open up the possibility for localized de novo assembly of non-reference se- quences near breakpoints that may complicate architecture assembly. Rigorous structur- ing of sequencing studies will be of paramount importance in reconstructing complexly rearranged cancer architectures. Chapter 7

Evidence for Large Inversion Polymorphisms in the Human Genome from HapMap data

7.1 Introduction

Large scale structural changes such as deletions, duplications, inversions and translocations of genomic segments are known to be associated with susceptibility to disease [77, 111, 84]. However, until recently, it was generally believed that genetic differences between two individuals were largely mediated by small scale changes such as Single Nucleotide Polymorphisms (SNPs) and meiotic recombination, and structural variants represent a relatively small fraction of the genetic diversity. Indeed, the Inter- national HapMap project [162] was set up to catalog human genetic diversity based on SNPs alone. The first phase of this project has recently been completed producing a haplotype map of about one million common SNPs [3]. In this context, it is somewhat surprising that evidence is now mounting for the presence of large scale variants in the human genome, collectively referred to as ‘structural variants’. High-throughput exper- imental techniques based on comparative genomic hybridization have enabled the dis-

116 117 covery of hundreds of copy number polymorphisms in human individuals [144, 58, 145]. The HapMap genotype data has also been used to discover insertion/deletion polymor- phisms [28, 56, 89]. In sharp contrast, knowledge about the location and genome-wide extent of in- version polymorphisms has not accumulated at the same pace, primarily due to the lack of a high-throughput technique for detecting inversions. Methods for assaying an inver- sion in a population are also limited [165]. Very few examples are known of large in- version polymorphisms whose frequency has been characterized in human populations. A notable one is the recently discovered 900kb long common inversion polymorphism on 17q21.31 [154]. The inverted orientation had a frequency of 21% in Europeans but was rare in individuals of African (6%) and Asian (1%) origin. The inverted haplotype was dated to be about 3 Myr old but shows little evidence for recombination, leading to a distinct haplotype pattern and extended LD across the region in the CEU popu- lation. Interestingly, genotype-phenotype analysis in an Icelandic population showed that women carrying the inverted haplotype had more children than those who didn’t, providing direct evidence that the inverted arrangement is under some form of selection. A recent study [166] mapped fosmid paired-end sequence data from a fosmid DNA library of a North American female (not represented in the reference human genome assembly) to the reference human assembly. Fosmids that showed discrep- ancy by size were indicative of deletions/insertions between the two genomes, while fosmids whose ends mapped on the same strand of the reference genome (discordancy by orientation) pointed to potential inversions. This strategy revealed 56 putative inver- sion breakpoints in addition to 139 insertions and 102 deletions. Although the method is effective in determining inversions, it will require extensive re-sequencing in a popula- tion of individuals to fully determine the extent and frequency of these polymorphisms. An indirect approach that has been adopted for finding inversion polymorphisms is to test human-chimp inversions for polymorphism in humans using FISH and PCR analy- sis [43, 157]. Out of 23 regions that were tested [43], 3 were found to be polymorphic in humans with the largest being a 800kb inversion on chromosome 7. 118

In this paper, we use the recently generated high density SNP genotype data from the International HapMap project [3] to detect large inversion polymorphisms in the human genome. Unlike deletions which cause miscalled genotypes and can lead to Mendelian inconsistencies [89, 28], inversions do not produce any aberrant SNP geno- types. Our method is based on the detection of an unusual Linkage Disequilibrium pattern that is indicative of inversions for which the inverted orientation (w.r.t reference human genome sequence) is present in a majority of chromosomes in a population. The method can also detect large orientation errors in the human sequence assembly. Using simulations, we show that our method has statistical power to detect such inversions. We have applied our method to data from the first phase of the International HapMap project to generate a list of 176 candidate inversions in the three HapMap ‘analysis pan- els’ (CEU, YRI and CHB+JPT). Although it is difficult to estimate how many of these represent true inversions, a crude estimate of the false positive rate using coalescent simulations indicates that about half of the 78 predictions in the YRI ‘analysis panel’ represent true inversions. The false positive rate could be higher (about 80%) for the inversions in the CHB+JPT ‘analysis panel’ , according to a conservative assessment. Even with the high false positive rates, our method is a cost-effective approach to dis- covering inversion polymorphisms. We have looked for supporting evidence for our predicted inversions in the form of discordant fosmid pairs, assembly discrepancies and presence of a pair of inverted repeats near inversion breakpoints. This has resulted in a smaller list of 15 inversions, two of which represent previously known inversions.

7.2 Results

We utilized the genome-wide SNP data from Phase I of the International HapMap project consisting of genotypes of 269 DNA samples for approximately 1 million SNPs. These samples consist of 90 CEPH individuals (30 parent-child trios) from Utah, USA (abbreviated as CEU), 90 Yoruban individuals (30 trios) from Ibadan, Nigeria (YRI), 44 unrelated individuals from Tokyo, Japan (JPT) and 45 Han Chinese individuals from 119

Beijing, China (CHB). We combined the individuals from the JPT and CHB popula- tions to obtain a larger set of 89 individuals (referred to as the CHB+JPT ‘analysis panel’). For the CEU and YRI ‘analysis panels’, our data consisted of 120 chromosomes (from the 60 parent individuals) each. We used the phased haplotype data (downloaded from the HapMap website) which was computationally phased using the program Phase 2 [3, 155]. All our analysis was done separately on the three ‘analysis panels’ (CEU, YRI, CHB+JPT).

7.2.1 Overview of Method

Genetic maps are constructed by genotyping a large number of genetic markers in a pedigree and determining the physical order of the markers through estimates of the recombination fraction between genetic markers. On the other hand, the human genome assembly represents the genomic sequence of a few individual(s) and a second, possibly different, ordering of the markers (SNPs for example) can be determined by mapping the sequence flanking the markers to this reference. Recently a high resolution genetic map was constructed using pedigree data from an Icelandic population [74]. Comparison of the genetic map to the reference sequence revealed several regions where the ordering of the genetic markers was in opposite orientation to that suggested by the reference sequence. Given the incomplete nature of the draft human sequence at that time, the sequence was modified in the regions where the genetic map strongly indicated a different marker order. The possibility that some of these discrepancies are a result of an inversion polymorphism in the particular region cannot be discounted. For example, if the human sequence represents the minor allele in a particular region of the human genome which has two orientations, one would expect the ordering of the markers (inside the inverted segment) in the genetic map to be consistent with that of the major allele and hence be opposite to that of the sequence. In fact, this is true for a 4.5 megabase long inversion on chromosome 8 where the reference human sequence represents the minor allele (frequency 20-30% in human populations) and the genetic 120 map [74] matches the marker order of the major allele. However, the low resolution of genetic maps makes it difficult to detect such discrepancies in general. In genotype data from unrelated individuals, Linkage Disequilibrium (LD) refers to the non-random association of alleles at physically neighbouring markers (SNPs in our case). Consider two bi-allelic loci L1,L2 with alleles A/a and B/b respectively. If the loci are sufficiently distant, there is almost no correlation between the markers, and joint probability of seeing alleles A, B is simply P1,2(A, B) = P1(A)P2(B). However, if the loci are physically close, then

|P1,2(A, B) − P1(A)P2(B)| >> 0

The LD is typically measured by a normalization of |P1,2(A, B) − P1(A)P2(B)| (See [171], for a review of LD measures, including D0 and r2). In human population data, significant LD is observed at close distances and little or no LD is observed at long distances. This correlation of LD with distance is very noisy due to multiple factors, largely due to the fine-scale heterogeneity in the recombination rates in the human genome [91, 29, 100]. Although it may not be possible to determine a physical or- dering of SNPs using LD alone, it is possible to distinguish between SNPs that are physically close from physically distant SNPs using LD. Our method utilizes high den- sity SNP genotype data to find regions of the human genome where the ordering of the SNPs suggested by Linkage Disequilibrium patterns is opposite to that of the physical sequence. Consider a genomic region that is inverted (w.r.t the reference sequence) in a majority of the chromosomes in a population and assume that we have genotyped mark- ers on either side of the two breakpoints. For a graphical illustration, see Figure 7.1.

In such a scenario, we would expect to see unusually high long range LD (LD13 and

LD24) than would be expected between markers that are physically distant. Further, one would also observe low LD (LD12 and LD34) between pairs of markers that are physically close according to the reference sequence. The strength of this effect will be proportional to the frequency of the inverted allele. Our statistic is designed to search 121

for pair of breakpoints showing this kind of signal.

Reference Sequence ¡ ¡ ¡ ¡ © ¨ © ¨ © ¨ © ¨ ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ ¢ ¢ £ ¢ £ ¢ £ ¢ £ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤

¡ ¡ ¡ ¡ © ¨ © ¨ A © ¨ © ¨ A G                                                    & & & & , , , , ( ( ( ( ( * ( * ( * * * * * *        

$ $ $ $ . . . . A A G % $ % $ % $ % $ / . / . / . / . ' & ' & ' & ' & - , - , - , - , ( ) ( ) ( ) ( ) ( ) ( ) ( * * * + * + * + * + * + * +                    % $ % $ % $ % $ / . / . / . / .

    " " " " A G A         # " # " # " # "         ! ! ! !                              # " # " A # " # " G T Genotyped population

8 8 8 8 > > > > B B B InversionB B F B F B F F F F F F 6 6 6 6 4 4 4 4 7 6 7 6 7 6 7 6 5 4 5 4 5 4 5 4 9 8 9 8 9 8 9 8 ? > ? > ? > ? > B C B C B C B C B C B C B F F F G F G F G F G F G F G : : : : < < < < @ @ @ @ @ D @ D @ D D D D D D 7 6 7 6 7 6 7 6 5 4 5 4 5 4 5 4

0 0 0 0 2 2 2 2 G C T 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 ; : ; : ; : ; : = < = < = < = < @ A @ A @ A @ A @ A @ A @ D D D E D E D E D E D E D E

1 0 1 0 1 0 1 0 3 2 3 2 G 3 2 3 2 C A

mapped fragments

A A ...... G A A ...... G Polymorphism Data A T ...... G

G T ...... C G A ...... C 1 2 3 4 D’=0.17

D’=1

Figure 7.1: Unusual Linkage Disequilibrium observed in SNP data when the inverted hap- lotype (w.r.t the reference sequence) has very high frequency. SNPs are ’mapped’ to the reference sequence using the flanking sequence (denoted by shaded boxes). Therefore close SNPs in high LD are mapped to distant regions 1 and 3 (the red boxes). Consequently, the two regions show unusually high LD for that distance.

Most measures of LD are defined for a pair of bi-allelic sites, and have high variance. We are interested in assessing the strength of association between blocks of SNP’s across the inversion breakpoints. Therefore, we use the multi-allelic version of the LD measure D0 [80, 53] by considering a block of SNPs as a multi-allelic marker. Blocks contain a fixed number of SNPs that are located within a certain distance on the

chromosome (see Methods). To obtain an empirical probability distribution (φd) of the LD between two blocks at a fixed distance d, we compute the D’ measure between all pairs of blocks that are distance d apart. For every pair of inversion breakpoints, we use

the four LD values (LD12, LD13, LD24 and LD34) and the LD probability distributions 122 to compute a pair of log likelihood ratios, one for each breakpoint. These log likelihood ratios represent the ratio of the probability of the region being inverted versus being non-inverted in the population. Using a permutation method, we compute a p-value for the log-likelihood ratios which represents the probability of the log likelihood ratios achieving a high value by chance. We use this p-value as our statistic for evidence of inversion (see Methods for details).

7.2.2 Power to detect Inversion Polymorphisms

Our statistic is suited to detect long inversions (long enough for little or no long range LD to be present) for which the inverted orientation (w.r.t. the reference sequence) is the major allele. Many factors influence the power of our statistic, including back- ground recombination rates, the length of the inversion, and the frequency of the inverted haplotype. We used simulations to assess how these factors affect the power of our statis- tic. Currently, only a few instances of inversion polymorphisms are known, and existing work on simulating population data incorporating the effect of inversion polymorphisms is of a theoretical nature based on Drosophila inversion polymorphisms [101]. There- fore, we adopted a simple strategy to simulate inversions of varying frequency using haplotype data from the HapMap project (see Methods). As our simulations were over real data with high variation in recombination rates, we effectively average over the ef- fect of recombination rate variation. Figure 7.2(a) describes the power of the statistic to detect inversions as a function of the frequency of the inverted allele (f), keeping the length fixed at 500 kb for the three HapMap ‘analysis panels’. The power is measured by the fraction of simulated inversions in which the inverted region was detected with a p-value less than a fixed cutoff (0.02). Figure 7.2(b) describes the power for different lengths of the inverted region. The results indicate that the power of the method is low for small inversions (0.45 for inversions of length 100kb) and increases with increasing length, saturating around 500kb. Although the simulations cannot completely capture the effect of an inversion on LD patterns, they suggest that our method has good sta- 123 tistical power to detect long inversions segregating at high frequency in a population. They also indicate that the power is maximum in the YRI ‘analysis panel’ (see Figure 7.2(a)). We show later, through independent assessment of the false-positive rate of our predicted inversions on the HapMap data, that the error rate is lowest for the YRI ’analysis panel’.

(a) (b)

Figure 7.2: (a) Power of our method to detect inversion polymorphisms in the three HapMap ’analysis panels’. Inversions of varying frequency (100% to 50%) of a fixed length (500 kb) were simulated using the HapMap data for the three ‘analysis panels’ separately. The y-axis represents the fraction of simulated inversions for which there was at least one pair of pre- dicted breakpoints with p-value <= 0.02 matching the breakpoints of the simulated inversion. (b) Power to detect inversions of four different lengths in the YRI ’analysis panel’.

7.2.3 Scanning the HapMap data for inversion polymorphisms

We searched the phased haplotype data from the three HapMap ’analysis panel’ individually using our statistic to determine sites of inversion. To reduce the number of false positives, we considered predicted inversions with length in the range 200kb-4Mb. After clustering and filtering the initial list of predicted inversions for each ‘analysis panel’ separately (see Methods), we had a total of 176 putative inversions in the three HapMap ‘analysis panels’ with a p-value of 0.02 or less (see Figure 7.8 for the p-value distribution of these predictions). Of these, 26 were detected in the CEU ‘analysis panel’ 124

, 78 in the YRI ‘analysis panel’ and 72 in the CHB+JPT ‘analysis panel’ . Most of the predicted inversions were unique to one of the ‘analysis panels’ , but three regions were predicted in two ‘analysis panels’ each. The predicted list includes two sites of known inversion polymorphisms: a 800kb inversion polymorphism on 7p22.1 and a 1.1 megabase long inversion on chromosome 16p12.2. The 800 kb inversion at 7p22 was identified previously [43] using interphase FISH with 2/20 CEPH individuals found to be heterozygous for the inversion. Our method gave a signal for this region in the YRI ‘analysis panel’ matching the known breakpoints (p-value of 0.012). For this inversion, the breakpoints were previously identified to a resolution of about 200kb [43]. For one of the breakpoints, our method can narrow down the location to a region of length 45kb. The chromosome 16 inversion was identified through the analysis of discordant fosmid pairs [166]. Interestingly, we detected this inversion in both CEU (p-value 0.008) and the YRI ‘analysis panels’ (p-value 0.018) with identical pair of breakpoints (see Table 7.2.3). Analysis of the sequence around the breakpoints revealed that presence of a pair of long highly homologous inverted repeats (see Figure 7.3). The current list of inversion polymorphisms in the human genome is small, with only about 15 inversions larger than 200kb that are known to be polymorphic in normal humans (from the Genome Variation Database at http://projects.tcag.ca/variation/). We looked for additional evidence that would support some of our predicted inversions. As noted earlier, sequence from different individuals (in the form of fosmid end pair sequences) can be mapped to the reference sequence to identify inverted regions [166]. Another source of evidence comes from comparing the two human sequence assemblies produced by the International Human Genome Sequencing Consortium [61] and Celera Genomics [168]. Regions that are inverted in orientation between the two assemblies represent sites of assembly error in one of the two assemblies or polymorphic inversions, since these assemblies were generated using different sets of individuals. The Celera whole genome shotgun assembly [62] was aligned to the reference sequence assembly (Build 34) to discover such regions (B. Walenz, pers. comm.). If the orientation of the Celera assembly supports a predicted inversion, then it is highly likely that the inverted 125

Table 7.1: List of Predicted Inversions for which there is some form of evidence supporting the inverted orientation. All genomic coordinates are based on Build 34 of the human genome assembly. Human-Chimp inversions are regions that are inverted between the human and chim- panzee genomes [105, 43]. Inverted repeats imply the presence of a pair of low-copy highly homologous sequences, one near each breakpoint.

chr. left bp. (kb) right bp.(kb) sample p-value Direct Evidence Inverted Human-Chimp Repeats Inversions √ 16 21,279. . . 587 22,356. . . 643 CEU 0.008 Inv:21,544. . . 22,654 21,504. . . 22,723 21,279. . . 557 22,300. . . 682 YRI 0.018 Inv:21,559. . . 22,645

Inversion Polymorphism √ 7 5,610. . . 783 6,632. . . 677 YRI 0.012 5,766. . . 6,565 Left Bp: 5,608. . . 776 Right Bp: 6,495. . . 735 10 6,233. . . 240 7,432. . . 517 CHB+JPT 0.006 − 2 Discordant Fosmid √ 46,378. . . 457 48,046. . . 735 YRI 0.014 46,512. . . 47,057 10 45,506. . . 6,453 48,029. . . 821 CHB+JPT 0.018

112,373. . . 558 112,272. . . 388 112,556. . . 677 CEU 0.0001 inverted between 13 112,266. . . 379 112,554. . . 665 CHB+JPT 0.005 Build 34 & Celera both bps span gaps

1,627. . . 3,044 2 1,527. . . 654 4,565. . . 681 YRI 0.005 inverted between 1,527. . . 3,040 Build 34 & Celera both bps span gaps

√ 143,185. . . 143,723 1 143,737. . . 778 146,942. . . 7,113 YRI 0.005 143,862. . . 145,914 142,424. . . 146,586 √ 2 132,383. . . 388 132,629. . . 654 YRI 0.005 131,015. . . 132,518 130,908. . . 132,285

√ 175,531. . . 177,204 5 177,155. . . 499 180,364. . . 571 YRI 0.015 177,301. . . 532

√ 7 148,971. . . 9,270 152,105. . . 161 YRI 0.007 149,120. . . 153,113 9 87,120. . . 291 87,772. . . 886 CHB+JPT 0.0167 87,810. . . 878 √ 11 48,607. . . 841 50,765. . . 51,208 YRI 0.007 49,337. . . 793 49,831. . . 49,871 √ 12 124,704. . . 711 126,031. . . 044 CEU 0.0092 − √ 19 19,826. . . 871 20,331. . . 356 YRI 0.016 − √ 19 49,122. . . 128 49,564. . . 605 YRI 0.016 − 126

Pair of Inverted Repeats

21.2M ¢ ¨ ¢ 21.4M¨ 21.6M 21.8M 22.0M ¤ ¤ ¤ ¤ 22.2M ¦ ¦ ¦ 22.4M 22.6M      ¢ ¨ ¢ ¨ ¤ ¤ ¤ ¤ ¦ ¦ ¦ ¡ ¡ ¡ £ £ © ¥ ¥ ¥ § §    ¢ ¨ ¢ ¨ ¤ ¤ ¤ ¤ ¦ ¦ ¦ ¡ ¡ ¡ £ £ © ¥ ¥ ¥ § §

Left Breakpoint: 21.279..587 CEU inversion Right Breakpoint: 22.356...643

Left Breakpoint: 21.279..557 YRI inversion Right Breakpoint: 22.300...682

Inversion: 21.544...22.654

Inversion: 21.559...22.645

Human−Chimpanzee Inversion: 21.504...22.723 Disease Genes NM_144672 NM_001014444 NM_001888 NM_170664

Figure 7.3: Genomic overview of a 1.4 Mb region at 16p12 predicted to have an inversion in both the CEU and YRI ‘analysis panels’ . The left predicted breakpoint (the dotted line) overlaps with a ≈ 80kb long segment that is highly homologous to a segment (in the inverted orientation) near the other breakpoint. The region contains several disease-related genes (from the OMIM database).

orientation is present in the population. One of our predictions was supported by two fosmid pair sequences discordant by orientation [166]. This ≈ 1.2Mb inversion on chromosome 10 (p15.1-p14) was predicted in the CHB+JPT ‘analysis panel’ with a p-value of 0.005. The left end of the fosmid pair mapped in the reference assembly about 40kb before the predicted left breakpoint while the right end mapped just before the right breakpoint (see Figure 7.4). Since the insert size of fosmids ranges between 32 and 48 kb, the two discordant fos- mids are consistent with the predicted breakpoints. There were no gaps in the genome assembly near the breakpoints and there were fosmids and BACs consistent with the reference assembly (UCSC Human Genome Browser: http://genome.ucsc.edu). This suggests that the inversion represents a previously unknown inversion polymorphism. There were two regions for which we obtained evidence for the inverted orienta- tion from the Celera assembly. One of these regions is a ∼ 200 kb long region on chro- mosome 13 which was predicted to be inverted in both the CEU and CHB+JPT ‘analysis 127

Discordant Fosmid pairs

6.1M 6.3M 6.5M 6.7M 6.9M 7.1M 7.5M

¡ ¢¤£ £ ¥¤¦ ¦ 7.3M§

¡ ¢ ¥ §

6.233...240 M CHB+JPT inversion 7.432..517 M

6.274...280 M CHB+JPT inversion 7.152...172 M

Figure 7.4: Overview of a ≈ 1.2 Mb long inversion on chromosome 10 predicted in the CHB+JPT ‘analysis panel’ . Also shown are two discordant fosmid pairs whose one end maps to before the predicted left breakpoint and the other end maps to a region before the right break- point. Also both the fosmid ends map to the same strand (+) which indicates the blue segment in the middle is inverted. In this region, there is another overlapping inversion predicted in the CHB+JPT ‘analysis panel’ . The region has several genes proximal to the left breakpoint, one of which is known to be over-expressed in tumor cells [137]. panels’ . The region is also present in the inverted orientation in the Celera assembly and both breakpoints span large gaps (100kb) in the sequence assembly. Another large predicted inversion on chromosome 2p25 overlaps with a 1.4Mb region that is inverted between the two recent human genome assemblies (Build 34 and 35). The orientation of the Celera assembly of the human genome is concordant with the Build 35 assembly for the 1.4Mb region. There are gaps on each breakpoint which are not spanned by fosmids indicating that it is difficult to determine the correct orientation. This region was tested for polymorphism in a ‘analysis panel’ of 10 CEPH individuals [43] but was not found to be polymorphic. A 2Mb long predicted inversion on chromosome 10q.11 was predicted in both the YRI and CHB+JPT ‘analysis panels’ . Further, both the breakpoints for this re- 128 gion span gaps in the human sequence assembly suggesting that this could represent an assembly orientation error. Two segments in this region are inverted between the Cel- era sequence assembly and the public assembly. The analysis of the genomic sequence around the breakpoints revealed the presence of several hundred kb long inverted repeats of very high sequence similarity. Many of our predicted inversions overlap with regions that are inverted between the human and chimpanzee genomes [105, 43] (see Table 7.2.3 for a list). One of these is the 800kb inversion on chromosome 7 that was tested for polymorphism in humans since it was found to be inverted between the human and chimpanzee sequences [43].

7.2.4 Sequence Analysis of Inversion Breakpoints

Segmental duplications have been shown to be highly overrepresented near sites of structural variation in the human genome [58, 166]. Mechanisms have been pro- posed as to how a pair of low copy inverted repeats may mediate inversion events in the genome [47, 84, 146]. Pairs of inverted repeats have also been detected near the in- version breakpoints for several known inversion polymorphisms [156, 43]. We checked for the presence of pairs of low-copy homologous repeats near the breakpoints of our predicted inversions. We found that 18 of our predicted inversions had pairs of highly homologous repetitive sequences near the breakpoints. There were 11 distinct regions for which there were inverted repeats near the breakpoints1 (listed in Table 7.2.3). The significance of finding inverted repeats near the inversion breakpoints was estimated us- ing a simple empirical method (see Methods). The p-value was estimated to be 0.006. A complete list of all pairs of low-copy repeats near breakpoints is given in Supplementary File S2. Many examples of apparently benign chromosomal deletions that in many cases delete entire genes have recently been reported in the HapMap ‘analysis panels’ [28, 89]. Less is known about inversions affecting genes by truncating the coding sequence

1Some of these regions correspond to two predicted inversions 129 in normal human individuals. Recurrent inversions disrupting the factor VIII gene on the X chromosome are known to be a common cause of severe hemophilia A [77, 34,9]. We analyzed the sequence around inversion breakpoints to see if they overlap with known genes in the human genome. The resolution of our predicted inversion breakpoints varies from a few kilobases in some cases to several hundred kilobases in others, mak- ing it difficult to say with certainty whether the inversion actually affects some gene. Assuming that purifying selection acts on inversions disrupting genes, one would ex- pect a under-representation of inversion breakpoints disrupting genes. We found that 66 of our predicted inversion breakpoints are completely covered by one or more genes (for 6 inversions, both breakpoints are spanned by gene(s)). (For a list of these genes, see Supplementary S3). This is significantly less than what one would expect by chance (p-value of 0.02). Many of the genes that intersect with breakpoints are previously known to be disrupted in diseases. The T-cell lymphoma breakpoint-associated target 1 (TCBA1) gene spans a genomic region of over 900kb on chromosome 6, and is associated with multiple splice isoforms, as well as alternative start sites. As the name suggests, the gene is structurally disrupted in T-cell lymphoma cell lines [158], and developmental disor- ders [178]. A sketch of the previously mapped breakpoints and our predicted inversion breakpoints with respect to the known isoforms of the gene is shown in Figure 7.5.

124,336...341kbYRI inversion 124,628...634kb

gb:AB070452

gb:BC035062

ATN−1 HT−1

Figure 7.5: A predicted YRI Inversion polymorphisms on chromosome 6 overlaps with the TCBA1 gene. The dashed line describes the location of the predicted breakpoints. The previously mapped breakpoints of the gene in T-cell lymphoma/leukemia cell lines are shown by the blue lines. 130

We also detect a number of disrupted genes with alternative splice forms, with some of the splice isoforms consistent with the inversion breakpoint. An interesting ex- ample is the Islet cell antigen (ICAp69) gene, which is a target self-antigen in type 1 diabetes. The gene is known to have multiple isoforms [45]. As shown in Fig- ure 7.6, a predicted inversion breakpoint on chromosome 7 removes the 3’ end of the gene (gb:BC008640), approximately consistent with the expression of alternative splice forms (gb:BC005922,U38260). These and many other examples hint at the important role of structural variation in mediating gene diversity.

7.944..985 YRI inversion

gb:BC008640

gb:BC005922

gb:U38260

Figure 7.6: Splice isoforms of the ICAp69 gene that are approximately consistent with a predicted YRI inversion breakpoint on chromosome 7. The region of the left insertion break- point is denoted by a dashed line. The exons are not drawn to scale.

7.2.5 Assessing the false positive rate

Several of our predicted inversions represent known inversion polymorphisms and many others are supported by independent forms of evidence such as matching fosmid end sequences showing discordancy by orientation, regions inverted between different human assemblies, etc. Given the incomplete nature of our knowledge of in- version polymorphisms in the human genome, this does suggest that many of our other top predictions could represent inversions. Although LD generally decays with increas- ing distance between the markers, it is now well known that there is significant variation in recombination rates across the human genome [91, 100]. This variation in the recom- bination rates could potentially result in false positives using our statistic. Therefore, it 131 is useful to estimate how many of our predicted inversions are correct. Estimating the false positive rate reliably is difficult, given the state of our knowledge. We used coalescent simulations to estimate the frequency of predicted inversions on haplotype data with ‘no inversions’. To incorporate the heterogeneity in recombina- tion rates in the simulated data, we used a recently developed coalescent simulation pro- gram [139] which can generate population data incorporating variation in recombination rates and a wide range of demographic histories for different populations (see Methods). The program is calibrated to produce haplotype data that has considerable variation in LD such as that seen in real population data. The same thresholds and parameters were used for scanning the simulated datasets using our statistic as for the HapMap data. We analyzed the number of predicted inversions in the simulated data separately for each ‘analysis panel’ . Given the small number of predicted inversions in the HapMap data and many caveats in matching the simulation parameters with the real data, it is difficult to estimate the false positive rate based on a direct comparison. The number of pairs of breakpoints for which the statistic is computed is huge (≈ 40 million in the YRI ‘analysis panel’ ) while the number of predicted inversions is small (78 with a p-value of 0.02 or smaller). One cannot compare the ratio of the number of breakpoints examined to the predicted inversions in the HapMap and the simulated ‘analysis panels’ . Therefore, we use an indirect estimate. For a p-value cut-off π, denote γ(π) to be the ratio of the number of predicted regions with a p-value at most π in the HapMap ‘analysis panel’ to the corresponding number in the simulated data. If a lower p-value implies a greater chance of a prediction being real, one would expect γ(π) to increase with decrease in π. Note that if the number of true predictions (which is unknown) is small or if the p-values for the real predictions are not concentrated in the tail of the distribution, it would be difficult to observe an increase in γ(π). For the YRI ‘analysis panel’ , γ(π) ranges from 1.73−1.75 for π in the range 0.1 − 0.06, but increases to γ(0.02) = 2.85, and γ(0.01) = 4.86. For a p-value of 0.02, this represents a 1.7-fold enrichment in the number of predictions in the HapMap 132 data vs the simulated data. Under the assumption that the increase in the number of predictions in the tail of the p-value distribution is a result of true predictions, the false positive rate at cut-off of 0.02 can be estimated to be ∼ 58%. For the CEU ‘analysis panel’ , we didn’t observe a gradual increase in γ(π) and also the number of predictions smaller than 0.02 is only 26, making it difficult to get a meaningful estimate of the false positive rate via this method. For the CHB+JPT ‘analysis panel’ , this method suggests a higher false positive rate of 80% at a cutoff of 0.02. This could reflect the low power of our method to detect true inversion polymorphisms in the CHB+JPT haplotype ‘analysis panels’ due to less accurate long range haplotype phasing in the CHB+JPT ‘analysis panels’ as compared to the CEU and YRI ‘analysis panels’ . Our analysis suggests that the false positive rate is the smallest in the YRI ‘analysis panel’ and about half of the YRI predicted inversions could be real. This is also supported by the fact that the two previously known inversions (that we detect across the 3 HapMap ‘analysis panels’ ) are detected in the YRI ‘analysis panel’ and about 10 predicted inversions in the YRI ‘analysis panel’ are supported by the presence of inverted repeats. We also looked at the length distribution of the predicted inversions using our statistic in each of the three HapMap ‘analysis panels’ independently. For this we con- sidered inversions with length in the range 200kb-10Mb. For the YRI ‘analysis panel’ , the number of predicted inversions seems to drop after 4Mb and remains essentially constant after that (see Figure 7.7). The number of predicted inversions with length in the range 1-4 Mb is 30 while the number of predicted inversions in the range 4-8 Mb is only 10. In contrast, for the CHB+JPT ‘analysis panel’ , the numbers are 62 (in the range 1-4 Mb) and 51 (in the range 4-8 Mb). These results indicate that there is a 3-fold clustering of predicted regions in the smaller range for the YRI ‘analysis panel’ . If most of the predictions were false, one would not expect to see any clustering. The higher clustering in the YRI ‘analysis panel’ versus the CHB+JPT ‘analysis panel’ is consistent with the results from the coalescent simulations which also predict a smaller false pos- itive rate for the YRI ‘analysis panel’ . While the above estimates of the false positive rate are crude, they nevertheless indicate that many of our predictions, especially those 133 in the YRI ‘analysis panel’ , are likely to be real.

40 35

30

25

20

# of inversions 15 10

5 0

200−500 500−1000 1000−2000 2000−3000 3000−4000 4000−5000 5000−6000 6000−7000 7000−8000 8000−9000 9000−10000 Size of inversion (kb)

Figure 7.7: Length Distribution of predicted inversions in the YRI ‘analysis panel’ . For this plot, we consider inversions with length in the range 200kb to 10Mb.

7.3 Discussion

We have presented a statistical method that has power to detect large inversion polymorphisms using population data. Our method can also detect large regions where the reference assembly has erroneous orientation. Applying our method to the HapMap data, we have identified 176 putative inversions in the three HapMap ‘analysis panels’ . The false positive rate for the predicted inversions in the YRI sample indicates that ≈ 30 of the 78 YRI predictions could represent real inversions. We have looked for inde- pendent evidence in the form of discordancies between the NCBI and Celera assembly, discordant fosmid pairs and presence of inverted repeats near inversion breakpoints for our predicted inversions. We have identified a novel 1.2 Mb long inversion on chro- mosome 10 that is supported by two discordant fosmid pairs and has not been reported before. For two of our predicted inversions, both breakpoints span gaps in the human reference assembly and the inverted orientation is represented in the Celera genome as- 134 sembly, indicating orientation errors in the reference assembly. For about 10 regions, the inversions breakpoints are flanked by a pair of highly homologous inverted repeats. A recently proposed method called ‘haplotype fusion’ can assay single haplotypes for the presence of an inversion even when the breakpoints lie within long inverted re- peats [165]. The set of predicted inversions flanked by inverted repeats represent ideal candidates for validation using this technique. Our method is designed to detect long inversions for which the inverted allele (w.r.t the reference sequence orientation) in a population has high frequency. Therefore, it is unlikely to detect inversion polymorphisms for which the inverted allele is the minor variant. However, the allele frequencies of structural polymorphisms can vary signifi- cantly across populations. For 5 of the 10 deletion polymorphisms that were genotyped in the HapMap ‘analysis panels’ , the minor allele in one ‘analysis panel’ was the major allele in another ‘analysis panel’ [89]. The availability of data from multiple popula- tions increases the chance of detecting the inversion using our method in the population where the inverted allele is the major variant. Furthermore, in many cases the refer- ence sequence assembly is likely to represent the minor variant in the population. For a 18-kb inversion polymorphism at 7q11 [43], the minor allele (frequency of 30%) was represented in the reference assembly while the major allele matches the orientation in the chimpanzee sequence. Although the method seems to be robust to the variation in recombination rates, it is possible that this heterogeneity in recombination rates and other events can produce a signal using our statistic. One such scenario is where the two breakpoints represent gene conversion hotspots while there is no recombination across the entire region. Gene conversion events would reduce short range LD while absence of recombination would maintain long range associations. From a computational perspective, our method represents a novel strategy for using population data for detecting large rearrangements. It is becoming increasingly cost-effective to generate genome-wide SNP genotype data and our method can be ap- plied to any such data. Other strategies have been suggested for computationally mining SNP data for potential inversions. Inversion polymorphisms have been extensively in- 135 vestigated for Drosophila, and it has been observed that the presence of inversion poly- morphisms leads to strong and extended Linkage Disequilibrium across the inverted region since recombination in inversion heterozygotes is suppressed [5, 104, 102]. This reduces the overall recombination rate in the region and also tends to produce two di- vergent haplotype clades [103,5]. The best known example of this effect in the human genome is the 900kb polymorphic inversion on chromosome 17 [154]. However, it re- mains to be seen if this pattern is true of all (or most) human inversion polymorphisms. In fact, our analysis of haplotype patterns of the few known inversion polymorphisms does not indicate that all inversion polymorphisms lead to such distinctive haplotype patterns (unpublished data). Our results also indicate that many large inversion polymorphisms remain to be discovered in the human genome, and it may require extensive re-sequencing in multi- ple populations to find all such inversions. The presence of a large number of inversion polymorphisms could have major implications for evolution of the human genome. In- versions are known to directly suppress recombination in inversion heterozygotes. The lowering of recombination between inversion heterozygotes may also create effects sim- ilar to population sub-structure even without geographic isolation of the individuals. Characterization of inversion variants in human populations will be required to deter- mine to what extent large inversions affect the recombination landscape of the human genome. Inversions could also represent an alternative mechanism for creating diversity in gene regulation, and splice isoforms. Such variation may also influence phenotypes and associations with diseases.

7.4 Methods

7.4.1 Haplotype Data

We utilized genotype data from Phase I of the International HapMap project consisting of 269 individuals genotyped on about 1 million SNPs. These individuals 136 consist of 30 trios from Utah region (CEU), 30 trios from Ibadan, Nigeria (YRI), 44 unrelated individuals from Tokyo, Japan (JPT) and 45 Han Chinese individuals from Beijing area (CHB). Since the JPT and CHB populations are genetically similar, we pooled the data from these two populations to obtain a larger ‘analysis panel’ of 89 indi- viduals. For the CEU and YRI ‘analysis panels’ , we used the 60 unrelated parents from the respective populations. We analyzed each of the three ‘analysis panels’ : CEU, YRI and CHB+JPT separately. We used the phased haplotype data for these ‘analysis pan- els’ (HapMap data release #16 available at www.hapmap.org/downloads/phasing/2005- 03 phaseI/full/). Since the SNPs in this data were ordered based on the NCBI Build 34 (hg16) assembly of the human genome, all our results are with respect to NCBI Build 34 assembly. We used the phased data since it is difficult to detect long range LD without phasing information. The phasing is highly accurate for the CEU and the YRI ‘analysis panels’ due to the presence of trio information. For the JPT and CHB populations, in the absence of trios, the haplotype phasing is less accurate (a switch error every 0.34 Mb [3]). This can destroy long range LD, thereby potentially reducing the power of our method to detect inversions in the CHB+JPT ‘analysis panel’ .

7.4.2 Defining multi-SNP markers

For each ‘analysis panel’ , all SNPs with a minor allele frequency smaller than 0.1 in the ‘analysis panel’ were discarded since they are less informative about LD patterns. After this filtering, we selected a multi-marker SNP block for every remain- ing SNP as follows. For each SNP S, we considered all SNPs in the genomic region L(S) ...L(S) + W where L(S) is the genomic location of SNP S and W is the window size. If this window had less than k SNPs, it was discarded. For any k SNPs, an indi- vidual sequence is described by a haplotype of length k, induced by the allelic values of the k SNPs. Denote the set of haplotypes as A1,A2,..., with frequencies p1, p2,... respectively. For each window, we chose a subset of k SNPs that maximize the entropy P of the haplotypes (− i p1 log pi) defined by any subset of k SNPs. The subset of SNPs 137 with maximum entropy best captures the haplotype diversity of the window and is po- tentially most effective for measuring LD with other multi-allelic SNP markers. These k SNPs defined a multi-SNP marker with a left and right physical boundary defined by the physical location of the first and kth SNP. The average SNP density of the HapMap ‘analysis panels’ is about one SNP (with MAF >= 0.1) per 5-6 kb (across different chromosomes). The parameters k and W were chosen to be 3 and 18kb respectively based on this SNP density. The results are not greatly affected by increasing or decreas- ing W by a few kb. Simulations indicate that the power to detect inversions is smaller for k = 4 as compared to k = 3.

7.4.3 Computing LD

Linkage Disequilibrium between two multi-SNP markers was computed using the multi-allelic version of D0 [80]. Let A and B denote two blocks with haplotypes

A1,A2,... and B1,B2,... respectively. Let pi (qj) denote the frequency of haplotype

Ai (Bj). Define Dij = hij − piqj where hij is the frequency of the haplotype AiBj. The extent of LD between each pair of haplotypes is defined as

0 Dij Dij = Dmax where   min{piqj, (1 − pi)(1 − qj)} if Dij < 0 Dmax =  min{pi(1 − qj), (1 − pi)qj} otherwise The overall measure of LD between A and B is

0 X X 0 DAB = piqj|Dij| i j We computed the LD measure between all pairs of multi-SNP markers on a chromosome (defined above) within a certain maximum distance. Using these LD values for the 22 autosomes, we obtained probability distribution curves of LD at a fixed distance d, denoted as φd. The X chromosome was excluded since it has a reduced recombination rate as compared to the autosomes. 138

7.4.4 The Inversion Statistic

Consider a pair of breakpoints where B1 and B2 denote two blocks on either side of the left inversion breakpoint and B3 and B4 are the blocks of SNP’s spanning the other inversion breakpoint (See Figure 7.1). We compute a pair of log likelihood ratios, one for each inversion breakpoint which represent the log of the ratio of the probability of the region being inverted in the population vs being non-inverted. Let LDij denote the LD between blocks i and j, and dij denote the corresponding distance. The log likelihood ratio for the left breakpoint is defined as   φd13 (LD12) · φd12 (LD13) LLRl = log (7.1) φd12 (LD12) · φd13 (LD13) Similarly, the log likelihood ratio for the right inversion breakpoint is defined as   φd24 (LD34) · φd34 (LD24) LLRr = log (7.2) φd34 (LD34) · φd24 (LD24) If the pair of breakpoints represent inversion breakpoints (with the inverted allele having high frequency), we would expect the long range LD (LD13 and LD24) to be stronger than the short range LD (LD12 and LD34) and both log likelihood ratios to be positive. However, most measures of LD, including D’ show some dependence upon allele frequencies. Therefore, even in the absence of an inversion, the log likelihood ratios could be positive (due to the long range LD being larger in magnitude than the short range LD just by chance). Therefore, we estimate the significance of the two log- likelihood ratios using a permutation test. For a pair of breakpoints denoted by (l1, l2) and (r1, r2), we permute the haplotypes inside the inverted region (from the block l2 to r1). The two log-likelihood ratios are computed for this permutation and the p-value is defined as the fraction of permutations for which at least one of the two log-likelihood ratios is greater than its corresponding original value. We use 10, 000 permutations to compute each p-value. Using simulations, we found the p-value to have much better specificity and almost equal sensitivity at detecting inversions as compared to the log- likelihood ratios. Therefore, we use the p-value for a pair of breakpoints as our statistic for the presence of an inversion. The p-value for the log-likelihood ratios cannot be 139 interpreted as a typical p-value; it estimates the chance that at least one of the two log- likelihood ratios would achieve the corresponding computed value even if there was no LD between the blocks.

7.4.5 Identifying potential inversions

For every chromosome, we considered the region between every pair of adjacent SNPs as a potential breakpoint. If a pair of adjacent SNPs showed high correlation using the r2 measure (a cutoff of r = 0.6 was used), the region in between is highly unlikely to be a breakpoint and was excluded. For every breakpoint, we choose a multi- SNP marker to the left of the breakpoint and another one to the right of the breakpoint (these were chosen to be the physically closest multi-SNP markers to the breakpoint from the set of multi-SNP markers defined previously). Each breakpoint is reported as a pair of genomic coordinates corresponding to the right physical boundary of the multi- SNP marker closest to the left of the breakpoint and the left physical boundary of the multi-SNP marker closest to the right of the breakpoint. For every pair of breakpoints within a certain maximum distance, we computed the two log-likelihood ratios and the corresponding p-value. All pairs of breakpoints with low p-value are considered as potential candidates for inverted regions. A predicted inversion is reported as a 4-tuple

(l1, l2, r1, r2) corresponding to a pair of left (l1, l2), and right (r1, r2) breakpoints. For analysis of the HapMap data, we ignored pairs of breakpoints within 200kb of each other since considerable LD is observed at short distances in the HapMap data and power simulations also indicate that our method has low power to detect inversions of small length. Our results for the estimation of the false positive rate indicated that there was some enrichment for true positives in the predicted inversions with p-value smaller than 0.02. Therefore we choose a p-value cutoff of 0.02 for generating the pre- dicted inversions. We also limit the size of the largest predicted inversion that we con- sider to 4 megabases. The largest known polymorphic inversion in the human genome is about 4.5Mb in length (Genome Variation Database at http://projects.tcag.ca/variation/). 140

Also, the distribution of the length of the predicted inversions suggests that predicted in- versions larger than 4Mb represent false positives rather than true inversions. All pairs of predicted inversion breakpoints with length in the range 200kb-4Mb and with a p-value of 0.02 or smaller were enumerated for each chromosome in the three HapMap ‘anal- ysis panels’ . For each ‘analysis panel’ and chromosome, we clustered the predicted inversions based on the physical location of the breakpoints. For two predicted inver- sions (l1, l2, r1, r2) and (p1, p2, q1, q2), if the segment (l1, l2) and (p1, p2) overlapped and similarly if (r1, r2) and (q1, q2) overlapped, these two predicted inversions were grouped together. After clustering, we had 215 predicted inversions in the three ‘analysis panels’ . For every cluster we report the pair of inversion breakpoints with the smallest p-value. In order to further reduce potential false positives, we removed predicted inversions for which there was strong LD between the block to the left of the left breakpoint (block 1 in Figure 7.1) and the block to the right of the right breakpoint (p-value of the multi- allelic LD smaller than 0.02). The final list of 176 predicted inversions is presented in Supplementary File S1 (for the P-value distribution of these predicted inversions, see Figure 7.8.

Figure 7.8: The p-value distribution for predicted inversions having p-value ≤ 0.02. 141

7.4.6 Simulating Inversions

It is straightforward to simulate an inversion with frequency f = 1, however, to the best of our knowledge, there is no existing program that can simulate human population data accommodating polymorphisms. The effect of decreasing the frequency of the inverted haplotype (on our statistic) is to essentially decrease the strength of long range LD and increase short range LD. Hence, we adopted a simple simulation strategy which could mimic this effect of the inversion frequency on our statistic directly. For a given chromosome, we chose at random two SNPs S and E that define the region with the inversion polymorphism. Let 1, 2 . . . s denote the SNP’s in this chosen region. To simulate an inversion with frequency f = 1, we just flip the values of the alleles at SNPs i and s + 1 − i, for all 1 ≤ i ≤ s/2 for all haplotypes. In order to simulate an inversion of frequency f (0 < f < 1), we randomly select a subset of haplotypes of size f × n, where n is the total number of haplotypes. For every haplotype in this set, we simply flip the values of the alleles at SNPs i and s + 1 − i, for all 1 ≤ i ≤ s/2. Notice that this may have the effect of combining the alleles at two different SNPs. We used the phased haplotype data from the International HapMap project to simulate inversions. In order to simulate an inversion of given length, we choose one breakpoint randomly and the second breakpoint using the length of the inversion. After planting the inversions, we scan the chromosome for regions with low p-value for the log-likelihood ratios. A simulation inversion is considered to be detected if predicted inversion (l1, l2, r1, r2) has the property that the interval (l1, l2) overlaps the left endpoint of the inversion and (r1, r2) overlaps the right endpoint. Power is defined as the fraction of simulated inversions which are detected. Each point in the power plots is based on simulating about 500 inversions.

7.4.7 Sequence Analysis

We used the repeat-masked June 2003 (NCBI Build 34) human genome se- quence from the UCSC (University of California, Santa Cruz) Human Genome Browser 142 website for analyzing the inversion breakpoints. For each predicted inversion, the ge- nomic sequence in the window [l1 − 200000 . . . l2 + 200000] was blasted against the sequence in the window [r1 − 200000 . . . r2 + 200000] to find pairs of homologous se- quences. Only hits with an e-value less than 1e−25 and length at least 100bp were considered. We also removed pairs of homologous sequences that were less than 100kb apart. The statistical significance of the number of inversion breakpoints flanked by a pair of inverted repeats was estimated empirically as follows. We simulated 1000 ran- dom lists of inversions and computed the number of inversions with a pair of inverted repeats. Each random list of inversions was generated by shifting each predicted inver- sion (on the HapMap ‘analysis panels’ ) to a random location on the same chromosome on which it was detected. The p-value was estimated to be 0.006 using this method. Ad- ditionally, we observed that the length of the inverted repeats for many of the predicted inversions was generally much longer than those for the random lists. Analysis of genes near inversion breakpoints was performed using the UCSC KnownGenes II list from the UCSC Genome Browser. A gene was defined to cover an inversion breakpoint, if the transcriptional start position of the gene was before the left boundary of the breakpoint and the transcriptional end location after the right boundary of the breakpoint. In order to assess the statistical significance of the number of inversion breakpoints covered by one or more genes, we used an empirical method similar to the one used above for inverted repeats. We simulated 1000 random lists of inversions and computed the number of genes covering breakpoints for each list.

7.4.8 Coalescent Simulations

We simulated population data using the Cosi program [139] which implements a coalescent model similar to the MS program [57] but allowing for complex demographic histories and variable recombination rates. We used the bestfit model which has been calibrated using genome-wide human population data for different populations. The bestfit model uses the large-scale variation in recombination rates obtained from the 143 deCODE genetic map along with fine-scale variation in recombination rates. We used the default parameters of this model which are listed in Table 1 of the paper describing the method [139]. The program generates data for four populations, each with its own demographic scenario. We used the data for three of the populations: West African, European and East Asian. These three populations were considered as proxies for the YRI, CEU and CHB+JPT ‘analysis panels’ respectively from the International HapMap project. We matched each HapMap ‘analysis panel’ to the corresponding simulated ‘analysis panel’ in the number of chromosomes. We didn’t model SNP ascertainment bias (present in the HapMap ‘analysis panels’ ) for the simulated data since it is unlikely to affect our results as we discard SNPs with low minor allele frequency (less than 0.1). We generated 100 datasets of length 20Mb (it is computationally infeasible to generate chromosomal length regions using the cosi program) for each of the three ‘analysis panels’ . We simulated data with a fixed number of SNPs and then thinned the SNPs so that the average SNP density (for SNPs with minor allele frequency >= 0.1) matched that of the HapMap data.

7.5 Acknowledgements

Chapter7, was published in published in Genome Research, Vol 17, pp 219-30, 2007, V. Bansal, A. Bashir, and V. Bafna, “Evidence for large inversion polymorphisms in the human genome from HapMap data”. The dissertation author was the secondary author of this paper.

7.6 Supplemental material attached electronically

Supplemental Tables for Chapter7 are available as a single zipped attachment online. Chapter 8

Orthologous repeats and phylogenetic inference

8.1 Introduction

Repetitive elements, particularly SINEs (short interspersed elements) and LINEs (long interspersed elements), provide excellent markers for phylogenetic analysis: their mode of evolution is predominantly homoplasy-free, since they do not typically insert in the same locus of two unrelated lineages, and unidirectional, since they are not precisely excised from a locus with the flanking sequences preserved [147]. Indeed, the use of SINEs and LINEs to elucidate phylogeny has a rich history. SINEs and LINEs have been used to show that hippopotamuses are the closest living relative of whales [149, 108], to determine phylogenetic relationships among cichlid fish [159, 160, 161], and to elu- cidate the phylogeny of eight Primate species, providing the strongest evidence yet that chimps are the closest living relative of humans [136]. In each one of these studies, the presence or absence of a repetitive element at a specific locus in a given species was determined experimentally by PCR analysis, using flanking sequences as primers. It has been suggested that such experimental studies would not make a widespread contri- bution to phylogenetic inference in the short term, because the time, money, and effort

144 145 needed to collect data on relatively few characters would be prohibitive [55]. We agree that the biological methods described above are highly resource intensive. However, the set of species with partial sequence data available is rapidly expanding. Therefore, we propose instead to determine the presence or absence of a repetitive element at specific loci in each given species, and infer the resulting phylogeny, purely by computational means. Previous work has already hinted at the potential of this approach: for example, Thomas et al. [163] identified 4 repetitive elements which support a Primate-Rodent clade, and Schwartz et al. [142] identified a repetitive element which supports a horse- Carnivore clade. Our work extends the computational analysis of repetitive elements to elucidate phylogeny to a much larger scale. We apply our method to obtain results on the phylogeny of 28 (mostly eu- therian) mammals, using sequence data from the Comparative Vertebrate Sequencing Project [141]. We note that the phylogeny of eutherian mammals has been subject to considerable debate, as there are many instances where previous studies reach conflict- ing conclusions [4]. More recent studies [73, 131] have resolved many of the differences between mitochondrial and nuclear data. However, some open questions remain. Our results shed light on these questions, and are otherwise consistent with previous results. Given the predominantly homoplasy-free, unidirectional nature of SINE/LINE inser- tions, and the robustness of results obtained with limited sequence, we are optimistic that, with an increased amount of sequence data available in the future, our method will be a valuable alternative to traditional phylogenetic approaches (See also, [33]).

8.2 Approach

Consider a syntenic genomic region in a set of n species. Figure 8.1(a) describes this schematically for n = 7 species. The synteny is determined by flanking orthologous regions such as single copy genes in all 7 species. Further, let n1 (n1 = 3 in Figure 8.1) of these n genes contain a repeat element R such that removing this repeat element re- sults in a largely gap-free local multiple alignment of 6 of the 7 species. The multiply 146 aligned region is depicted by the lightly shaded areas in Figure 8.1 (a). The most par- simonious phylogeny explaining this scenario will have the 3 species in a clade with R inserted in a common ancestor (Figure 8.1(b)). Any other scenario would imply either that R was inserted at exactly the same location multiple times in different species, or that the insertion of R in a species was followed by a deletion event that removed only the region containing R, and nothing else. Both of these are rare events, and therefore less plausible. The absence of a strong alignment (perhaps due to a deletion event) in G implies that neither presence, nor absence of R can be verified. Thus, repeat R does not impose any phylogenetic constraint on G.

Repeat

A

B C Repeat Insertion D E F G

ABC D E F Aligned region Syntenic Anchors (a) (b)

Figure 8.1: (a) A schematic diagram of syntenic regions in 3 species, with a repeat insertion in A, B, and C. The lightly shaded areas correspond to regions that align well, indicating that the repeat is present in A, B, C and absent in D,E,F . Neither presence, nor absence can be verified for G. (b) A likely phylogeny consistent with a parsimonious explanation of the data. Species A, B, and C belong in a clade that can be separated from D,E,F , and the repeat was inserted in a common ancestor of the 3 species. There is no constraint on where G might occur.Also note that there are no constraints on where D,E,F occur, other then they do not fall in a clade defined by the earliest ancestor of A, B, C.

As transposable repeat elements are very common, particularly in mammals, a 147 collection of phylogenetic constraints such as the one in Figure 8.1(b) could be used to automatically construct a complete phylogeny. Through a multiple alignment procedure (to be described in detail in the Methods section), we have a collection of orthologous regions containing a subset of species in which a repeat was inserted in exactly the same location, and a disjoint subset in which the repeat was not inserted. This information is computed as an orthologous-repeats table, O, with rows corresponding to species, and columns to repeats. The entries are given by   1 if species i clearly contains repeat c  O[i, c] = 0 if species i clearly does not contain repeat c .   ? otherwise

In practice, constructing accurate multiple alignments of diverged species is a challeng- ing and highly researched problem [20, 22, 142]. In order to average out possible errors in orthology computation, we use MultiPipMaker [142] to compute multiple master- slave alignments, with each species in turn as the master. This leads to multiple columns for each truly orthologous Repeat, but only one column (or very few columns) for an incorrectly computed orthologous region. These columns are then filtered to retain only the ones with high sequence similarity in Repeats and flanking regions. For each column c, and triple (i, j, k), where O[i, c] = O[j, c] = 1, and O[k, c] = 0, the final phylogeny must be consistent with ((i, j), k), with the common ancestor of i and j separated from species k. Therefore, we have the following question: Given a collection of phyloge- netic constraints of the form ((i, j), k), does there exist a phylogeny that is consistent with all of these constraints? This problem is well studied. Aho et al., [2], and Pe’er et al., [114] show that the tree, if it exists, can be constructed efficiently. Henzinger et al. [54] devise a more efficient algorithm for this problem, and Kannan et al., 1998 [67] consider many extensions. These algorithms only work if the data is error free, so we cannot use them directly. Instead, we use a small modification of Aho et al.’s algo- rithm to handle errors. The algorithm is described below, with Figure 8.2 illustrating an example with n = 5 species, and 3 repeats. 148

R1 R2 R3 C A B C D E A 1 0 0 1 1 D A 1 1 0 0 B 1 1 0 2 B 2 0 0 A E 1 C 0 0 C 1 1 0 B G D 1 D 0 0 1 E E 0 0 1 D E ABC

(a) (b) (c)

A B C A C A 0 0 R2 1 B 1 B C A 0 B 1 C 1 ABC D E

(d) (e) (f)

Figure 8.2: Sketch of phylogeny reconstruction from the orthologous-repeats table. (a) An orthologous-repeats table with 5 species and 3 repeats. (b) The resulting shared-repeat graph. We also illustrate the graph in matrix form. Note that the connected components of the graph cor- respond to clades in the final phylogeny. (c) One of the two clades has 2 species and is therefore resolved. The other has 3 species, and needs to be resolved further. (d) The orthologous-repeats subtable of species A, B, C. Only repeat R2 contains two 1’s and one 0. (e) The resulting shared-repeat subgraph resolves species A, B, C. (f) The final phylogeny.

1. Construct a weighted, undirected shared-repeat graph G = (V, E, w), with each

species corresponding to a vertex in G. For repeat r, let N1(r) be the subset

of species which contain this repeat. For all repeats r, and all (i, j) ∈ N1(r), increment the weight w(i, j). Figure 8.2(b) illustrates the corresponding shared- repeat graph G.

2. Recurse to construct a sub-tree for each unresolved connected component of G.

While recursing on a component containing the subset Nc, we only consider

columns which contain at least two 1’s and one 0 when restricted to rows in Nc. 149

In the example, we only need to recurse on {A, B, C}. When restricted to those rows only R2 contains two 1’s and one 0 (Figure 8.2(c-e)).

3. Construct the tree by connecting the root to the sub-trees from each connected component (Figure 8.2(f)).

Table 8.1: An orthologous-repeats table containing a sampling of repeats. Each column corresponds to a specific repeat. The symbol 1 corresponds to the presence, and 0 to the absence of that repeat. ’?’ indicates missing data, when neither presence, nor absence of the repeat could be confirmed. repeat occurrence human 1 1 1 ? 0 ? ? chimp 1 1 1 0 0 ? ? baboon 1 1 0 0 0 ? ? mouse 1 0 0 1 0 ? ? rat ? 0 ? 1 0 ? ? cat ? 0 0 0 1 1 0 dog 0 0 0 0 ? 1 0 cow 0 0 0 0 1 0 1 pig 0 0 0 ? 1 ? 1

As described, the algorithm does not handle the case in which the shared-repeat graph yields a single connected component. This could happen if some repetitive ele- ments lead to contradictory phylogenetic scenarios. Previous biological studies which used repetitive elements to elucidate phylogeny typically included a small number of contradictory loci. For example, in their analysis of Alu elements to determine Primate phylogeny, Salem et al. [136] identified 7 loci with an Alu element clearly present in human and chimp genomes and clearly absent from gorilla, and 1 locus with an Alu ele- ment clearly present in human and gorilla and clearly absent from chimp; they concluded that the contradictory locus was due to incomplete lineage sorting: the Alu element at that locus was polymorphic at the time of divergence of gorilla from human and chimp, remained polymorphic at the time of divergence of chimp from human, and eventually 150 became fixed in human and gorilla lineages but not in chimp. Incomplete lineage sort- ing and the incompatible loci they create can complicate any phylogenetic analysis, but generally should not pose a problem in phylogenetic analyses using repetitive elements, as long as a sufficiently large number of independent loci are examined [147]. In an automated analysis of thousands of repeats, rare instances of insertion ho- moplasy may also appear. According to Shedlock and Okada [147], SINEs and LINEs are predominantly homoplasy-free, but hot-spots of insertion may occur in exceptional cases. Indeed, Cantrell et al. [23] have identified a locus containing two such hot-spots, leading to SINE insertion homoplasy in multiple rodent species. We have found evi- dence of insertion homoplasy in our own data set: Figure 8.3 illustrates that a strong alignment appears to exist for a SINE repeat in cat and rat, while the absence of this repeat is strongly supported in baboon, cat, dog, cow, pig, and mouse, implying a phy- logeny which is almost certainly incorrect. This repeat that is shared by cat and rat in an orthologous location is not an error, but accurately reflects the actual sequence data. Incomplete lineage sorting does not seem to be a plausible explanation for this example, as polymorphism of the presence or absence of the repeat would need to persist from the time of divergence of Rodents and Laurasiatheria (cat,dog,cow,pig) through the time of divergence of cat and dog, which seems unlikely. We speculate instead that this may be a rare instance of insertion homoplasy. The (rare) presence of repeats which are incompatible with the correct phylogeny leads to two questions. First, how can we determine the correct phylogeny in the pres- ence of conflicting evidence? Second, given a set of orthologous repeats which are incompatible with the correct phylogeny, how can we determine if these are instances of insertion homoplasy, incomplete lineage sorting, or erroneous alignment? In this pa- per, we focus primarily on the first question. Thus, we take the conservative approach of discarding repeats which are incompatible with the correct phylogeny. However, the second question is one of independent interest. For example, insertion homoplasy has important ramifications for repeat sub-family analysis, and evidence of incomplete lin- eage sorting may shed light on speciation hypotheses [136, 110]. In the Results section, 151

human ? GGGAAATCTCATAACTGATGCCAGAAGCACGT------GGGAA------ATCTCATAACTG chimp ? GGGAAATCTCATAACTGATGCCAGAAGCACGT------GGGAA------ATCTCATAACTG baboon 0 GGGAAATCTCATAACTGATGCCAGAAGCACGTTGCTCCAGAGCTAGCCAG ------cat 1 GAGGAATCTCATAACTGACATCAGAAGCATATTGCTCTGAAGTAAACCAG gggcgcttggctggctcagtcagtagagcatgcaacgcttgaccttctggttg dog 0 GAGGAATCTC--AACTGACATCTGAAGCATACTG------cow 0 GGGGAATCTTATAAGTGACATGAGAAGCACATTG------pig 0 GAGGAATCTCATCATTGACACAAGAAGCAGATTG------rat 1 GAGGCATCCCATAGATGACGTGAGTGTCTCCTCAGCCTAGAGCAG--CAg gggagctgaacaaccctagccatcagaaaatgtgactcataaccttatggttg mouse ? ------// human ? ---ATACCAGAAGCATGCTG------CTCCAGA CCAGTGCTCCTGGTAGTGCCTCGAAAGTGGCAGGCCACTGAACAAAGCGG chimp ? ---ATACCAGAAGCATGTTG------CTCCAGA CCAGTGCTCCTGGTGGTGCCTTGAAAGTGGCAGGCCACTGAACAAAGCGG baboon 0 ------TCCTGGTGGTGCCTTGAAAGTGGCAGGCCGCTGAACAAAGCTG cat 1 taaattcgagaaccatattgggtgcagagattacttaaaaataaaatctttaaa CCAGTGCGTGTGGTGCTGCCTCAAAAATGGGAGGCCACTGAATTAAGTGA dog 0------CTCTAAA CCAGTGCTTGT----CTGCCTCAAAAATGGGAGGCCACAGTGT-AGATGG cow 0 ------CTCCAGA CCAGGGCTCCTGGTGCTGCCTCAAAAATGAGAGGCCACTGAACCAAGTAG pig 0 ------CTCCAGA CCAGTGCTCCTGGTGTCACCTCAAAAATGAGAGACAACTGAACAGAGTGG rat 1 ggggcaccacaacctgaggaggtgcagaggtaggctgagaaccactgCTCTGAA CCAGTGCCTCTGA---TGCCTCATAAGTAAGAGACCACTCATTTAGATAG mouse 0 ------TCTAAA CCAGTGCCCCTGATGCT---TCGTAAGTAGGAGACCACGCATTTAAGCGG

Figure 8.3: Multiple alignment for an incompatible repeat in the orthologous-repeats table of 9 species with finished sequence. Repeats annotated by RepeatMasker [151]. are indicated in lower case. we describe putative instances of each of these causes of incompatible repeats. We now describe our approach to discarding repeats which are incompatible with the correct phylogeny. One possibility is to look at target-site duplications, the regions on either side of a repeat element which were duplicated at the time of repeat insertion. Previous studies have used matching target-site duplications to confirm that orthologous repeats correspond to a single insertion event in a common ancestor [163]. However, target-site duplications can be difficult to identify if they are short and/or highly diverged, thus using target-site duplications to automatically discern instances of insertion homoplasy in a large-scale analysis is a considerable (and perhaps insurmount- able) challenge. Therefore, we instead employ the following three approaches: First, in the case of insertion homoplasy, the orthologous repeats differed at the time of insertion and hence show greater divergence. Thus, we can use the statistic

%SIMILARITY IN FLANKING REGION − %SIMILARITY IN REPEAT REGION.

Large positive values of this statistic suggest possible insertion homoplasy (See Fig- 152 ure 8.4, and Results on the performance of this statistic). This statistic could also be high if the flanking regions were functionally important. However, that is rare event, and discarding repeats with high values of the statistic is conservative. For the sec- ond approach, recall that each orthologous repeat describes a sub-tree that should be compatible with the overall phylogeny. Repeats that are incompatible with the the true phylogeny are likely to be incompatible with sub-trees from many other repeats; this incompatibility can be tested without reconstruction of the phylogeny (See Methods: incompatibility removal). We show in our Results that all incompatibilities in our data are explained by a small number of repeats. Finally, the presence of such repeats leads to a single connected component in the shared-repeat graph with the incompatible repeats being among the lowest weight edges. We iteratively remove minimum weight edges until the shared-repeat graph is no longer connected. In practice, we have found that the minimum weight is quite small, and the resulting phylogenies are robust (See Results). Our method includes the following steps:

1. Identify repeats in all of the sequences.

2. Use a genome multiple alignment tool to compute a multiple alignment of all sequences. The specific tool used, MultiPipMaker builds a multiple alignment from n − 1 master slave alignments of a single sequence against all others.

3. Construct an n × m orthologous-repeats table O, where the m columns arise from orthologous repeats using each sequence in turn as a Master.

4. Repeat with each sequence as the Master sequence to construct a complete orthologous-repeats table.

5. Remove Repeats (columns in the table) that are incompatible.

6. Construct a complete phylogeny from the orthologous-repeats table.

7. Compute Bootstrap values of the phylogeny to determine robust branches.

These steps are described in detail in the Methods section. 153

16

14

12

10

8 %Repeats 6

4

2

0

2 6 4 8 2 6 4 8 -0.2 -0.1 0.1 0.3 -0.18 -0.14 -0.06 -0.02 0.0 0.0 0.1 0.1 0.2 0.2 0.3 0.3 %Similarity(flanking regions) -%Similarity(repeats)

High degree Low degree

Figure 8.4: Distribution of the difference statistic among columns with high and low degree of incompatibility. The statistic measures the difference in sequence similarity between flanking and repeat regions. Repeats which show incompatibility to many other repeats may often be due to insertion homoplasy. These repeats show larger values of the difference statistic.

8.3 Results

8.3.1 Species with finished sequence

We first applied our method to 9 species with finished sequence data presently available, using sequence data from the 1.5Mb 7q31 region (see Methods). We con- structed an orthologous-repeats table containing 1101 columns after removal of incom- patible repeats (see online Supplemental Data). The resulting shared-repeat graph is displayed in Table 8.2(a). After omitting edges of weight 1, this shared-repeat graph splits into two connected components: a Primate-Rodent clade (human,chimp,baboon, mouse,rat) and a Laurasiatheria clade (cat,dog,cow,pig). Reapplication of the method to these clades produces the shared-repeat subgraphs displayed in Table 8.2(b). The Primate-Rodent subgraph is indicative of a Primate clade (human,chimp,baboon) and a Rodent clade (mouse,rat); the Laurasiatheria subgraph is indicative of a Carnivore clade 154

(cat,dog) and an Artiodactyl clade (cow,pig). Finally, reapplication of the method to the Primate clade produces the shared-repeat subgraph displayed in Table 8.2(c), which is indicative of a human-chimp clade. Combining all of these results, we obtain a phy- logenetic tree of 9 species (See Figure in online Supplemental data S1). The tree is completely consistent with a larger one of 28 mammalian species (Figure 8.5).

8.3.2 A larger set of species

We subsequently applied our method to a larger set of 28 (mostly eutherian) mammals with partial sequence data available, again using sequence data from the 1.5Mb 7q31 region (see Methods). We constructed an orthologous-repeats table con- taining 4775 columns after removal of incompatible repeats (see online Supplemen- tal Data), and constructed a shared-repeat graph (see online Supplemental Data S2). The resulting phylogenetic tree is displayed in Figure 8.5. Each node is labeled by a bootstrap support value for that clade, obtained from an analysis of 1,000 bootstrap replicates. The consensus bootstrap tree was reconstructed using Consense, part of the Phylip package [41]. Results for parts of the tree where previous studies reached conflicting conclusions are discussed in detail below (see Discussion). Otherwise, our tree is entirely consistent with previous studies. In particular, our phylogeny of the 13 Primate species in our data set agrees exactly with the widely accepted phylogeny of Primates [123], and nearly all Primate phylogeny branches are supported by high bootstrap values. For example, we have identified hundreds of repeats which correctly separate (baboon,macaque,vervet,chimp,human,gorilla,orangutan) and (dusky titi, mar- moset,squirrel monkey) from (galago,lemur,mouse lemur), and less than 10 repeats which support alternate resolutions of this trichotomy. Each one of these incompatible repeats is consistent with insertion homoplasy;1 the incompatible repeats are removed from the orthologous-repeats table during the incompatibility removal step. These num-

1In each case, there exist two clades whose union contains all species with the repeat clearly present and no species with the repeat clearly absent, supporting the hypothesis of two distinct repeat insertion events in the ancestor of each clade. 155

Table 8.2: Shared-repeat graph and subgraphs of 9 species with finished sequence. (a) The shared-repeat graph on all 9 species is indicative of Primate-Rodent and Laurasiatheria clades. (b) Shared-repeat subgraphs for Primate-Rodent and Laurasiatheria clades are indicative of Primate, Rodent, Carnivore and Artiodactyl clades. (c) The shared-repeat subgraph for the Primate clade is indicative of a human-chimp clade. (a) human chimp baboon mouse rat cat dog cow pig human 933 668 3 0 0 0 0 1 chimp 623 3 0 0 0 0 1 baboon 3 0 0 0 0 1 mouse 43 0 0 0 0 rat 0 0 0 0 cat 31 8 15 dog 6 11 cow 18 pig

(b) human chimp baboon mouse rat human 235 122 0 0 chimp 112 0 0 baboon 0 0 mouse 28 rat cat dog cow pig cat 17 0 0 dog 0 0 cow 4 pig

(c) human chimp baboon human 55 0 chimp 0 baboon 156 bers, and the resulting 100% bootstrap support for the correct resolution of this tri- chotomy, illustrate the robustness of our approach in dealing with instances of insertion homoplasy.

8.3.3 Assessment of Incompatible Repeats

As discussed earlier, a few of the repeats are instances of insertion homoplasy, which can complicate phylogenetic analyses. If there is no instance of insertion homo- plasy, then each pair of columns (i.e. repeats) in the orthologous-repeats table should be compatible in that none of the implied phylogenies contradict each other. In the Methods section, we describe the simple 3-gamete condition that can be used to check incompat- ibility. Such incompatibilities are common in molecular sequence data, but should be rare for repeat insertion data. We define an incompatibility graph on the columns of the orthologous-repeats table. Each column is a node in the graph. Two columns are connected by an edge if they are not compatible. The columns that contain an instance of insertion homoplasy lead to phylogenies that are incompatible with many others, and therefore, correspond to high-degree nodes. Note also that if the repeats were inserted independently, their divergence from the flanking regions should be higher than repeats that were inserted in a common ancestor of the sequence. For each of the columns in the table, we computed the difference in % similarity between the flanking regions and the repeat regions. To determine if this can be used as a statistic to detect independently inserted repeats, we looked at the distribution of this number for the 500 highest and the 500 lowest degree nodes in the incompatibility graph. See Figure 8.4. While the true distributions overlap, they have distinct means of 8.6% for high-degree, and 3.2% for the low-degree nodes. A t-test to determine if the means were equal gave a P-value of 1.1e−32. Based on this, we remove all columns for which the difference is 7.5% or higher. This columns removal procedure still retains some instances of insertion homo- plasy, but these show up as high degree nodes in the incompatibility graph. We con- 157

Figure 8.5: Phylogenetic tree of a large set of 28 species. Bootstrap support values are based on 1,000 bootstrap replicates. Tree image created using Treeview [112]. 158 structed incompatibility graphs for the 9 organism data set as well as the complete 28 organism data set. For the 9 species, there were a total of 1101 columns, of which 717 nodes were connected by 821 edges. However, all edges are incident to only 4 nodes, and removing them would would remove all incompatibilities from the graph. The 28 organism data set has similar characteristics. There were a total of 4833 columns with 28859 edges involving 3716 columns. However, removal of 58 highest degree columns eliminates all incompatible edges. In our method, we iteratively remove the highest degree node until no incompatible edge remains. In order to validate our results, we manually examined each of the 58 incom- patible repeats. Of these, 38 are consistent with insertion homoplasy according to the above criteria, i.e. there exist two clades whose union contains all species with the re- peat clearly present and no species with the repeat clearly absent. Of the 38 putative instances of insertion homoplasy, 23 correspond to Alu repeats in primates; we further analyzed the subfamily history of these repeats with respect to known Alu subfamily classification [121]. In nearly every case, for the two clades described above, subfamily membership was concordant within clades but discordant between clades (see online Supplemental data S3), strongly supporting the insertion homoplasy hypothesis of two distinct repeat insertion events in the ancestor of each clade. For additional discussion and alignments, see online Supplemental data S3.

8.4 Discussion

The phylogeny of eutherian mammals has been subject to considerable debate, as there are many instances where previous studies reach conflicting conclusions [4]. In particular, various placements of Rodents, horse, rabbit and hedgehog have been reported, as we discuss below. More recent studies [73, 131]) have resolved many of the differences between mitochondrial and nuclear data, but leave open questions regarding the placement of armadillo and muntjak, which our results address. We first discuss the placement of Rodents, i.e. resolution of the trichotomy be- 159 tween Rodents, Primates and Laurasiatheria (Carnivores, Artiodactyls, etc.). Some stud- ies report a Primate-Rodent clade [98,4] while others report the divergence of Rodent from a Primate-Laurasiatheria clade [7, 93]. In our analysis, we identified 2 repeats sep- arating Primates and Rodents from Laurasiatheria. Our results agree with [163], who identified 3 repetitive elements which support a Primate-Rodent clade. However, our au- tomated approach failed to discover 3 of the 4 repeats mentioned by Thomas et al. These three repeats (all MLT10A0 repeats) failed because they either: a.) did not align to the flanking region on one side of the repeat, b.) showed significantly weaker alignment within the repeat region than the flanking regions, c.) were slightly below our flanking region threshold. We note that our 9 organism run was sensitive enough to select one of the aforementioned MLT10A0 repeats for support of the Primate-Rodent clade. Another interesting example is the placement of horse in the phylogenetic tree. Early studies of horse, Carnivores and Artiodactyls reported a horse-Artiodactyl clade [49], while more recent studies report a horse-Carnivore clade [98,7]. In our analysis, we identified 1 repeat separating horse and Carnivores from Artiodactyls. It is notable that our program discovers the same L1MA9 repeat that Schwartz et al. [142] used to es- tablish the horse-Carnivore clade. The alignment of this repeat (with flanking sequence) can be seen in Supplemental document S5. The placement of rabbit in the phylogenetic tree has been the subject of consider- able debate. The resolution of the trichotomy between rabbit, Primates and Laurasiathe- ria has been variously reported as (Laurasiatheria,(rabbit,Primate)) [98], or (Primate, (rabbit,Laurasiatheria)) [7], or (rabbit,(Primate,Laurasiatheria)) [93]. We identified 4 repeats separating rabbit and Primates from Laurasiatheria, strongly supporting ((rab- bit,Primate),Laurasiatheria). We further note that the Murphy et al. studies confirm the Glires hypothesis of a rabbit-Rodent clade, while the Arnason et al. and Misawa and Janke studies reject the Glires hypothesis. Although we neither confirm or reject the Glires hypothesis, due to our unresolved (rabbit,Rodent,Primate) trichotomy, our rabbit results are inconsistent with the two studies rejecting the Glires hypothesis. Our placement of hedgehog inside the Laurasiatheria clade and armadillo outside 160 the clade containing Primates, Rodents and Laurasiatheria is consistent with Murphy et al. [98], but inconsistent with Arnason et al. [7], which places armadillo inside the Laurasiatheria clade and hedgehog outside the clade containing Primates, Rodents and Laurasiatheria. We identified 2 repeats separating hedgehog and Laurasiatheria from Primates and Rodents. Recent studies [73, 131] agree with our placement of Rodents, horse, rabbit and hedgehog, but still leave some open questions. For example, armadillo has been vari- ously placed outside the clade containing Primates, Rodents and Laurasiatheria [98, 73], inside Laurasiatheria [7], or inside the clade containing Primates and Rodents [131]. Our results agree with Murphy et al. and Kitazoe et al.: we identified 1 repeat separating Primate/Rodent and Laurasiatheria clades from armadillo. Our placement of Marsu- pials (wallaby,monodelphis,opossum) outside the clade containing Primates, Rodents and Laurasiatheria is widely consistent with previous studies [98,7]). We note that due to the inadequate representation of Marsupial repeat families in Repbase [65, 66]), proper placement of Marsupials in our phylogenetic tree would not have been pos- sible without our use of the RepeatScout algorithm [122] to identify additional re- peat families (see Methods). Finally, we comment on the (cow,sheep,muntjak) tri- chotomy: Reyes et al. [131], report a ((cow,muntjak),sheep) resolution, but we report ((cow,sheep),muntjak), supported by 23 repeats separating cow and sheep from munt- jak. Overall, we consider our generation of a phylogenetic tree of 28 mammalian species using orthologous repeats in 1.5Mb of sequence to be an encouraging result; although other methods based on protein coding sequence use far less data, our method can be applied to arbitrary DNA sequence, as produced by large comparative sequencing efforts. It is notable that all of our results are consistent with the Murphy et al. [98] study, despite having been obtained via entirely different means. Our bootstrap values are slightly lower than other studies in which nearly all bootstrap support values exceed 95% [98,7, 93]. However, these conflicting studies, each supported by high bootstrap values, cannot all be correct. Indeed, recent articles point to exaggerated bootstrap 161 support values for the Bayesian methods used in some of these studies [94]. We note that our own procedure creates multiple columns for each orthologous repeat, which may lead to higher bootstrap values. Because our multiple alignment method creates multiple columns for each truly orthologous repeat but only one column (or very few columns) for an incorrectly computed orthologous repeat, we feel that this is justified. We anticipate that with additional data and improved repeat-finding tools, we will obtain higher bootstrap values and resolve the unresolved trichotomies in our tree. In addition, we hope to further reduce the incidence of phylogenetically incompatible repeats, many of which may be due to insertion homoplasy; exploring the possible use of target-site duplications towards this goal is an important direction of our ongoing research. A caveat of our approach is that it results in a large amount of missing data; little work has been done to assess the statistical impact of missing data in phylogenetic inference, and varying opinions have been expressed in the literature on the issue of missing data in phylogenetic studies using orthologous repeats [55, 148]). One direction that we will consider is to adapt methods from quartet puzzling. The term refers to approaches for obtaining reliable ML estimates of trees by combining information from unrooted quartets [140]. In our data-set, each column corresponds to a partial tree on a subset of species, which should be amenable to quartet puzzling. Repetitive elements provide excellent markers for phylogenetic analysis, be- cause their mode of evolution is predominantly unidirectional and homoplasy-free. Our approach allows us to isolate and investigate the evidence from each repeat, and is ro- bust enough to deal with thousands of repeats. We are optimistic that going forward, our method will be a valuable alternative to traditional phylogenetic approaches. 162

8.5 Methods

Data

Sequences were collected from the NIH Intramural Sequencing Center (NISC), Comparative Vertebrate Sequencing project [141]. The set of sequences used were from target reference 7q31, Encode Name Enm001, a region approximately 1.5 Mb in size. The sequences themselves ranged from 1.2Mb (pig) to 2.3 Mb (marmoset). To obtain preliminary data for organisms with unpublished 7q31 sequence, the entire 7q31 data set was scanned. Genbank files, for accession numbers from that data set, were retrieved; from these files the corresponding sequences were extracted. Contigs were joined to one another via overlap information embedded within each genbank file. Note that the concatenated sequences are not complete, and the alignment introduces gaps.

Repeat Identification

For the 9 organism data set, repeat-annotated sequences were obtained from supplemental data of Thomas et al. [163]. For the 28 organism data set, repeat ele- ments were identified by running RepeatMasker [151] using a repeat library derived from the set of mammalian repeat families in Repbase [65, 66] plus additional repeat families identified by RepeatScout [122]. RepeatMasker was run at the default setting for speed/sensitivity.

Multiple Alignments

Multiple Alignments were generated via MultiPipMaker [142]. MultiPipMaker is a tool for aligning multiple, long (Mb size) genomic DNA sequences quickly and with good sensitivity. The program takes as input a single reference sequence and mul- tiple secondary sequences; additionally, one of the following options must be selected: show all matches, chaining, or single coverage. Alignments are first computed by pair- wise Blatz alignments, and subsequent refinements, between the reference organism 163 and each secondary sequence. MultiPipMaker then looks at sub-alignments within the global multiple alignment to see if modifications can be made to improve the overall score of the alignment. Since our sequences were variable in length and since the align- ments generated by MultiPipMaker are most relevant as alignments to the reference sequence, it was necessary to rerun MultiPipMaker with each organism as reference se- quence. This generates multiple columns for a single orthologous repeat, but has the advantage of averaging over data. Repeats erroneously marked as orthologous with a single master sequence are unlikely to show up with other master sequences, and will have a low weight in the shared-repeat graph. Thus, for our n organisms we generated n multiple alignments (the ordering of the secondary sequences was irrelevant). More- over, the chaining option was selected to avoid duplicate matches caused by the ”show all matches” option, i.e. a single region in the reference sequence aligning to two regions in a secondary sequence. This option was selected over single coverage because: 1) the secondary sequences were assumed to be contiguous, 2) the comparisons were made with a single strand of the secondary sequence, and 3) the order of conserved regions was assumed identical in the two sequences [143].

Identifying Orthologous Insertions

For each MultiPip alignment, our algorithm iterated through the reference or- ganism’s RepeatMasker generated repetitive element list, ignoring all non-transposable element based repeats (such as LTRs and simple repetitive repeats). For each considered repeat, the corresponding orthologous region in each secondary organism, as well as a 50 nucleotide upstream and downstream flanking region were retrieved. For a repeat to be considered present in a secondary organism’s sequence it must strongly align in the repeat region and within both flanking regions. See Supplemental document S4 for assessment of flanking region alignments. For a repeat to be considered absent from a secondary organism’s sequence it must strongly align within both flanking regions, while gapping out the repeat region. Such an alignment may not always be possible. A 164 deletion in the region, for example, might make it impossible to determine if the repeat was deleted after insertion, or if it was never inserted. If neither set of requirements are satisfactorily met, the presence of the repeat is considered uncertain for that sec- ondary organism’s sequence. In the case of a partial repeats, if the base organism repeat is a full length repeat and it aligns to a partial repeat in a secondary organism (or vice versa) the repeat is considered uncertain for the secondary organism. However, if the base organism has a partial repeat and the same partial repeat region is seen within a secondary organism it is considered to be present in the secondary organism. Using this methodology, an orthologous-repeats table is generated. Each row of the repeat repre- sents an organism, and each column represents a given repeat. The presence of a repeat is indicated with a ’1’, the absence with a ’0’, and uncertainty with a ’?’.

Incompatibility Removal

Two repeats (columns in the orthologous-repeats table) are incompatible if they lead to conflicting phylogenies. Such incompatibility can be tested directly by using the rule of 3-gamete violation. An incompatibility occurs for two columns (i, j) in the orthologous-repeats table if and only if there exist 3 species A, B, C that contain 0, 1, (1, 0) and (1, 1) in the columns i and j, as shown in the Table 8.5(a) (See for exam- ple, [50]):

Construct an incompatibility graph. Each column is a node in the graph, and a pair of columns (i, j) forms an edge if (i, j) are incompatible. We must compute a minimum vertex cover of this graph [46], i.e. we must remove a minimum number of columns such that no incompatibilities remain. The problem is computationally hard in general, but our results show that a greedy heuristic works fine. Also, the graphs we obtain are almost bipartite (contain no cycles of odd length), for which the problem is tractable. We iteratively remove the column with the highest degree (number of incompati- 165 ble edges), and recompute the degree of each column, and repeat until no incompatibility remains. This revised orthologous-repeats table is then fed into our tree building algo- rithm. We note that there are rare cases in which there is no explicitly incompatible pair but the ambiguities ’?’ still lead to incompatibilities, as illustrated in Table 8.5(b). In this example, resolving the ambiguity of C at repeat i as a 0 leads to an incompatibil- ity between i and j. On the other hand, resolving it as a 1 leads to an incompatibility between i and k. Such rare cases of indirect incompatibility lead to the shared-repeat graph having a single connected component. We deal with these cases in phylogeny reconstruction (see below).

Table 8.3: Incompatible columns in the Orthologous-Repeats Table. (a) Columns i, j are incompatible because they violate the 3-gamete condition. (b) i, j, k are incompatible together as any resolution of the ambiguity for species C in column i leads to an incompatibility. As columns i and k are supported by h and l respectively, column j corresponds to a weak edge and is removed during phylogeny reconstruction, resolving the incompatibility. (a) i j A 0 1 B 1 0 C 1 1 (b) h i j k l A 1 1 1 0 0 B 1 1 0 0 0 C ? ? 1 1 1 D 0 0 ? 1 1

Shared-Repeat Graph Generation and Phylogeny Reconstruction

The following procedure is an implementation of the algorithm presented by Aho et al., with modifications for dealing with incompatibilities [2].

1. A subset of the orthologous-repeats table is created, in which only ”relevant” 166

rows (organisms) are considered (initially all rows, since all organisms are being considered). Within this subset of rows, only those columns in which at least two rows have a 1 and one row has a 0 are considered.

2. Utilizing this subset of the original repeat occurrence table, a graph is created by iterating through the columns. If two rows both have a 1 in given column an edge of weight 1 is created between the two corresponding organisms. If an edge already exists between those two organisms its weight is incremented by 1.

3. Multiple connected components are sought within the graph. If the graph con- tains a single connected component, weak edges must be eliminated. This is accomplished by removing edges, beginning with those of weight 1 and incre- mentally removing edges of greater weight, until multiple connected compo- nents arise.

4. Steps 1-3 are recursively applied to each connected component containing more than two organisms. The ”relevant” rows in each run are the organisms within the connected component.

Consider the above example illustrated in Table 8.5(b). The phylogenetic inference of column i is supported by column h, and column k is supported by column l. Thus, in the shared-repeat graph, edges (A, B), and (C,D) have weight 2, while the edge (A, C) has weight 1. Removing the minimum weight edge is akin to removing column j, which has the least support. Finally, we perform a non-parametric bootstrapping of our data. A 1000 psue- doreplicates were generated by randomly sampling the orthologous repeats table (gen- erated after removal of incompatible repeats) to create new orthologous repeat tables of the same size as the original. From this set of 1000 trees, we were able to obtain a consensus tree with bootstrap values using the Consense program [41]. 167

8.6 Acknowledgements

Chapter8, was published in Genome Research, Vol 15, pp 998-1006. A. Bashir, C. Ye, A.L. Price, and V. Bafna, “Orthologous repeats and mammalian phylogenetic inference”. The dissertation author was the primary investigator and author of this paper.

8.7 Supplemental material attached electronically

Supplemental files for Chapter8 are available as a single zipped attachment on- line. Chapter 9

Conclusions

We began by asking whether large-scale genomic events merited their own inde- pendent study. Each of the presented topics highlight the unique applicability of these events. We have shown that high-coverage, primer-based diagnostics can be designed that robustly detect genomic lesions even in a high background of normal DNA. Addi- tionally, paired-end sequencing has been used to accurately predict gene fusions, even when using very large insert lengths at low genomic coverage. Working backwards, we were then able to show direct connections between experimental design and the abil- ity to detect breakpoint events and expressed transcripts, forming a foundation for how researchers should organize large-scale studies. Lastly, we presented novel techniques for detecting polymorphisms using SNP arrays and in building phylogenies using the simplest of large genomic markers, repeat elements. Along the way, we were able to use these methods to gain a number of of unique biological insights. For example, our end-sequence profiling study in cancer discov- ered a large number of gene fusions at the genomic level. Other studies have similarly shown a large number of fusion genes using transcript sequencing [135]. Interestingly, with the exception of the BCAS4/BCAS3 fusion, there was no overlap between two fu- sion sets on the same cell line (MCF7). Though this could be due to poor sequencing depth in both studies, it may also indicate a high rate of read-through or trans-splicing

168 169 events [85]. BCAS4/BCAS3 was, also, unique in that it was a multiply rearranged and amplified region, suggesting that its mutation may be under selection. This, along with the comparatively lower number of fusion events we observed in primary tumors, may point to a model in which a few significant events push a cell towards oncogenesis, with a large number of lesions occurring after the initiation of oncogenesis. On the other end of the spectrum, the analysis of orthologous repeats helped shed new light on unre- solved areas of the eutherian phylogeny. Notably, the phylogenetic placement of rabbits has been a subject of considerable debate. Our result unambiguously places primates, rodents, and rabbit together, with Laurasiatheria as an outgroup. Its inability to resolve the rabbit, rodent split is indicative of the short time span at that evolutionary juncture and the possibility of “incomplete lineage sorting” during the rapid divergence of several species. Reflecting on these studies as whole, there is a strong interdependence that should be exploited. For example, the PAMP optimization requires highly character- ized boundaries for genomic lesions; our fusion prediction method generates precise boundaries for any genomic event. High-throughput sequencing of multiple cancer cell lines (as well as primary tumors) could predict boundaries at which recurrent break- points occur. These boundaries could constrain a particular PAMP design, ensuring that it captures a high fraction of the desired lesion, while minimizing cost. Similarly, one of the purported goals in sequencing multiple cell lines is the identification of recurrent events. The PAMP assay is an effective tool to quickly assay for a specific event over a large number of individuals. One could design PAMP assays to rapidly screen a large number of predicted fusion (or other lesion events) and determine which are the most promising targets. Since the PAMP assay provides tight boundaries for each lesion, it will provide insight into the underlying mechanism of an event - providing insight into whether the event has a somatic origin. This interconnection extends to polymorphism studies (such as the inversion analysis). High-throughput sequencing can provide a list of candidate events that can confirm, or be used to train, predictions based on SNP arrays; PAMP assays validate (and refined boundaries) for predicted events. 170

Nearly all studies benefit from the design considerations discussed in Chapter5. For studies of gene fusion, this would direct the the optimal mix of clone libraries and depth of sequencing to catch recurrent events over multiple samples. These type of statistics will become of particular importance in determining complex genome archi- tectures. Having a complete picture of the interconnections between genomic segments is essential for any reasonably accurate reconstruction. Moreover, sequencing depth can be used to confirm and correct aCGH copy number predictions. Certainly, for genome sequencing studies, often the starting point for phylogenetic inference, depth of sequenc- ing and event detection are paramount concerns. Taken together, these techniques com- plement each other, allowing for deeper insight into the underlying questions.

9.1 Open Problems

Though we offer methods for many open issues, these solutions should hardly be considered final. Rather they represent an initial pass on questions that will need to further explored. The discussed topics still only scratch the surface of potential applica- tions and contexts in which to examine these events.

9.1.1 Genomic diagnostics for disease

This work offered several schemes for the detection of genomic lesions using primer-based assays. However, even within the PAMP framework, there is significant leeway for the development of more intricate optimization schemes that can be explored. The most direct of these would be to decouple the dependence between forward and re- verse primer sets. In the current designs, significant restraints are placed on the types of multiplex reactions that can occur. Specifically, we enforce that any forward primer must not dimerize with any reverse primer, as each forward primer must at some point be multiplexed with all reverse primers (and vice versa). This constraint, has an experimen- tal underpinning - we seek to minimize cost by reducing the number of primers used and 171 multiplexing reactions run. It is, in fact, this simplification which leads to the elegant two-dimensional symmetry which we exploit in Chapter3 for computational expedi- ency. However, one could achieve better coverage if permitted to use forward (reverse) primers with only a subset of reverse (forward) primers. Of course, this would lead to more overall primers. The same region may have multiple primers collocated, and this would, in turn, create more convoluted multiplexing schemes. However, as this ap- proach moves out of the laboratory and into an automated industrial setting, the necessity of maintaining low experimental complexity is alleviated. This unconstrained approach could permit good coverage of very highly variable breakpoint regions, megabases or tens of megabases in size. Nanotechnology will, perhaps, enable even more significant opportunities to push the limits of experimental complexity. The PAMP protocol could be shifted to a microdroplet-based approach, in which each primer pair interacts within its own in- dependent reaction. In some formulations, this could eliminate multiplexing altogether. However, new questions (such as minimizing the number of droplets, decoding signals from droplets, ensuring that the genomic lesion is present in each reaction, etc.) will inevitably arise. No matter how the technology evolves, these optimization schemes will play a crucial role in making the diagnostics feasible and facilitating analysis of the resulting signals.

9.1.2 Predicting Fusion Events and Architectures

A large part of this dissertation relates to work associated with, or enabled by, high-throughput sequencing. The approach taken was two-fold: 1.) develop methods for data analysis, 2.) develop a framework for the design of sequencing studies. With regards to the first point, we focused on applications for predicting breakpoint positions and gene fusion events, showing that one can predict (with high confidence) gene fusion events even with very low levels of sequencing [14]. The formulation also suggests an elegant model by which to classify and cluster large scale genomic events (such as 172 insertions, deletion, inversions, and translocations) and has already been extended to capture these events in an efficient and robust fashion [150]. Gene fusions, obviously, represent one small piece of the complex set of changes induced by rearrangements. The same sort of rigorous approach can, and should, be ap- plied to other types of events. Specifically, it should be fairly straightforward to repre- sent certain regulatory modules and predict whether or not a regulatory shift is induced by genomic events. Even within gene fusion analysis there are additional levels of com- plexity that can be examined. Determining the probability that the fusion event creates an “in frame” fusion protein, or joining of specific protein domains, will be informative in associating the fusion event with a specific phenotypic effect. With increasing inter- est in transcriptomics, transcript based fusion analysis potentially provides a much more direct confirmation that a genomic rearrangement actually leads to an expressed gene. High-throughput mass spectrometry approaches are also now beginning to examine and predict fusion products [39]. Correlating the fraction of rearrangements which create transcripts to the fraction of fusion transcripts that become expressed fusion proteins will be imperative in determining which rearrangements are optimal therapeutic targets.

9.1.3 Population genetics and phylogenetics

A number of incremental algorithmic and data improvements will make the study of large scale polymorphisms and rearrangement based phylogenetics increasingly relevant. As an example, the inversion detection technique has best success when the inverted allele (with respect to the reference sequence orientation) in a population has high frequency; as such, it misses many low frequency events. This could be somewhat improved by implementing an optimal partitioning of haplotypes across each putatively inverted region. Though this is itself a computationally challenging problem, it has the potential to improve both sensitivity and selectivity Both studies were performed on limited data. For detecting inversions in the population we used SNP data from HapMap. Higher density SNP arrays and larger 173 population sample sizes will make this approach increasingly robust. This type of data is already available in the latest version of HapMap [44]. Ongoing and future initiatives, such as the 1000 Genomes project, could provide sequencing confirmation for many of the predicted events. This will enable a much larger training set of “true” inversion signals, allowing for the possibility of using machine learning approaches along with the current statistics. Similarly, our phylogenies were generated using only ∼ 1.5 Mb of syntenic DNA. New technologies will enable an orders of magnitude increase in sequence data to be generated at a fraction of the cost. As complete re-sequencing of genomes becomes common place, it will dramatically change the nature of such studies. Phylogenetic inference, even with the maturation of high-throughput sequenc- ing, still faces inherent complications associated with repeat dependent studies. Repeat elements often lead to misassembly - a problem that is compounded when using short reads. When utilizing orthologous repeat elements, misassembly has a profound im- pact on the resulting phylogenies. New sequencing protocols, allowing for distant mate pairing of short reads, may help scaffold across repeat regions, though these will still require deep sequencing [25]. In this context, new long read technologies, such as those being developed by Pacific Biosciences [38], may play an important role in helping confidently place such markers. More significantly, by combining these markers with other genomic events pre- dicted by high-throughput sequencing, a truly robust method for phylogenetic inference emerges. We have outlined approaches for determining rearrangement breakpoints (for an individual or cell sample) relative to a genomic reference. In phylogenetics, the same analysis is often performed between two different organisms. Rearrangements have been used for quite some time in establishing phylogenetic distance. Methods are now utiliz- ing a variety of genomic signals (integrating both rearrangements and retrotransposon insertions) in order to determine distant phylogenetic relationships and particularly dif- ficult nodes in phylogenies [99]. Taken together, these techniques have the potential to become the standard for mammalian phylogenetic inference. Both of these methods also have the potential to be turned on their heads and 174 investigate their underlying signals. One can use the method for inversion detection to understand the impact of large inversions on local linkage disequilibrium patterns. It will be particularly informative to see the effect on such patterns when the inversion has a strong connection to phenotype. Similarly, we can use our repeat derived phylogenies to investigate the progression and activity of different repeat families. It has been proposed that transposable elements have a strong regulatory impact [18]; these phylogenies can help clarify if specific repeats led to the emergence of new evolutionary traits.

9.2 A modest proposal

We are clearly in the midst of a surge in data related to genomics and transcrip- tomics. There is a risk of being overwhelmed by the shear volume of data that is being created. As such, there may be the temptation to undertake significant studies without having a complete grasp of the results that should be expected. It will be crucial for informatics development to continually be one step ahead of the technological curve. This, of course, applies to data analysis and method development. Current methods will need to be optimized to handle an increasing amount of data and new methods will need to be developed to handle novel sources of data. However, an even more essential role will need to be played in shaping the nature of research being performed. With the commoditization of new technologies, bioinformatics will take on a seminal role in dictating study design and defining the set of achievable hypotheses. This extends beyond designing studies based on new technologies, but also in determin- ing which new technologies should be developed. There will soon be a proliferation of options in sequencing (short reads, paired-reads, long-reads, very long reads, etc.). Informaticians should move to the forefront: identifying how technologies should be use (and which should succeed) and determining which scientific problems will be an- swered. Appendix A

Supplemental: Optimization of primer design for the detection of genomic lesions in cancer

A.1 Complexity of PAMP design

One-sided PAMP design (OPAMP): Given a genomic region, G, of length L with a single reverse primer, a collection F of candidate forward primers, with only the forward primers dimerizing, and an integer value D ≤ L. Does there exist a non- dimerizing collection F ⊆ F of forward primers such that the total uncovered region is less that D with no adjacent primers 1? Note that a polynomial time solution for the general problem implies that OPAMP is poly-time solvable. Hence, it is sufficient to prove that OPAMP is NP-hard. We show a reduction from Max2-SAT [46]. Consider an instance φ = (X, C) of Max2-SAT, with n variables X = {x1, . . . , xn}, and m clauses defined by C = {C1,...,Cm}, each with at most two literals (x or x¯ for some x ∈ X). We will transform φ into an instance Gφ of OPAMP as follows: 1“Adjacent” primers have a spacing of less than r base pairs, where r > 0 as defined in Section 2.

175 176

We design a primer for each occurrence of a variable. Each primer is defined by a concatenation st, where s is a string that identifies the variable, and t is a string unique to a primer. Suppose literal xj occurs in m1 clauses, and its complement x¯j appears in m2 clauses. We construct m1 primers of the form sjt1, sjt2, . . . , sjtm1 , and m2 primers of the form s¯jt1, s¯jt2,..., s¯jtm1 , where s¯j is the reverse complement of the string sj. Note that as there are many good primers, we make the implicit assumption that it is possible to design such primers. By construction, we have the following properties:

1. All primers from the first set dimerize with all primers from the second set, implying that only one set will be represented in the final set.

2. Each primer is unique from other primers in its set, and therefore, all nonadjacent primers in a set can be selected.

To construct the genomic string, G, we use the following construction: The clauses are considered in an arbitrary order. For each clause Cq = x+y, we concatenate the primers corresponding to the literals x, y. Each such pair of forward primers is located on a genomic string exactly d distance apart, where d is the maximum length of a string that can be amplified by a PCR. The intervening space is constructed to be a run of d As so that no primer is selected from that space. Finally, at the right end, we have a single reverse primer that does not conflict with any primer in the collection. We have the following theorem.

Theorem 4: An instance φ = (X, C) of Max-2SAT can be transformed into an instance

Gφ of OPAMP with the following property: there exists a boolean assignment to X that satisfies exactly k ≤ m clauses if and only if there exists a subset of non-dimerizing primers F covering all but L − kd base-pairs.

Proof: Checking whether F is a set of non-dimerizing primers with coverage L − kd base pairs can be accomplished in polynomial time by checking, for each pair of primers sita, sjtb ∈ F , if they dimerize. In the worst case such a test is quadratic in the size of F (which is less than or equal to m, the total number of clauses). 177

We show that if φ has k satisfying assignments, than Gφ has a subset of primers,

F covering all but L−kd base-pairs. Each satisfying clause, Cq in φ contains at least one q q literal xi that is assigned 1, and each such literal corresponds to a primer si ta. Selecting one such “true” literal at each of the k clauses, yields a set F (where |F | = k) covering kd base pairs. We claim that F is a set of non-dimerizing primers. For any two primers si, sj ∈ F , both xi, xj are mapped to one by the satisfying assignment, and thus xi 6=x ¯j.

By construction, sita, sjtb will not be a dimerizing primer pair in Gφ.

Conversely, suppose that Gφ has a set of non-dimerizing primers F covering all but L − kd basepairs. We can assign 1 to each xi where sita ∈ F . First, we show that the assignment is consistent. If xi = 1 and x¯i = 1, then sita, s¯ita ∈ F . This implies F contains dimerizing primers, a contradiction. Next, as F contains only non-adjacent primers, each primer helps satisfy a distinct clause in φ. Therefore, at least k clauses have been satisfied. 

A.2 Methods and Parameters

A.2.1 Computational

Primer3 parameters Primer3 [134] was run in its stand-alone form (version 1.1.1 beta) with the maximum “Number to Return” set to 50000. The allowed GC content was 35%- 65%. No “N”s were permitted in candidate primer sequences (“Max Ns Accepted” = 0). Primers in the forward direction were selected using the “left only” parameter. Similarly, primers in the reverse direction were selected using the “right only” parameter.

Filtering The first filtering step eliminated those primers which occurred above ran- dom chance in the human genome. Given an N-mer of size 13 and a genome of size 3×109 3Gb, we would expect ∼ 45 occurrences of such an N-mer by random chance ( 413 ). Therefore, all primers which had a 3’ end occurrence greater than 45 was eliminated. 178

Next, primers were checked for complete uniqueness to the region. Using parameters derived from [173] we checked that there was no greater than 75% identity with the primer and any other sequence in the region of interest and that no exact matches ≥ 15 existed. The degree of filtering varied greatly, but was generally correlated to the repeat content of a region (data not shown).

Conflict Edge (Primer-Dimer) Computation To efficiently compute the conflict edge graph, we will employ a simple filtering technique. We construct a hash table of all 3- mers. Only primers that hash to the same location are aligned to compute E. Primers were aligned using an ungapped alignment, where a “match” is given a score of 1, and a “mismatch” a penalty of -1. Primer pairs which received a score ≥ 7 were given a dimerization edge.

Simulated Annealing The initial sets for each simulated annealing run were gener- ated by generating a random independent set of size ρ = 1 prior to the first simulated annealing iteration. Temperature was set to be very high initially (T = 1000), and decremented linearly as a function of the number of iterations until 0 was reached. For simulations (and experimental results) each iteration moved to a neighboring indepen- dent set (as described in the paper). To calculate the energy of a set, we determine the amount of missing coverage in the current solution and a neighboring solution. This determines the probability of a move [71]. The optimal solution observed over all itera- tions was returned as the selected primer set.

Integer Linear Program To solve the ILP formulation we used CPLEX 10.1.1 for Linux (x64). ILP simulations were formulated as described in Section 4, using the cplex lp format. In order to reduce run time, the ILP solver was given the simulated annealing solution as a starting set. It is for this reason that we never observe an ILP solution worse than the corresponding simulating annealing solution. The solutions show in Figure 7c correspond to the optimal solution observed and the best lower bound available for cplex 179 at the time when the run was cut-off. Each run was permitted to run for > 24 hours.

A.2.2 Experimental

The experimental procedures are essentially the same as previously described [82] except the design of primers and the PCR protocols as described here. A series of primers toward INK4A exons 1-2 along the CDKN2A locus were synthesized by Inte- grated DNA Technologies (Coralville, IA). Groups of computionally designed forward and reverse primers (50 nM each in the final reaction) were used to generate amplicons from 0.1 µg of genomic DNA templates in a total of 10 µl of solution mixing with 10 µl of Taq 2Master Mix (New England Biolabs, Ipswich, MA). Each primer contains two elements: the sequence of primer B (GTT TCC CAG TCA CGA TC) at the 5’ end for the subsequent step of PCR labeling as described previously [172] and the genomic sequence of a specific target at the 3’ end. The reaction was assembled at 4◦C in a PCR workstation and transferred to a thermocycler with the block preheated to 94◦C. The cycling conditions were a 3-minute denaturation step at 94◦C followed by 5 cycles at 92◦C for 30 sec, 65◦C for 30 sec and 68◦C for 2 minutes; 5 cylcles at 92◦C for 30 sec, 60◦C for 30 sec and 68◦C for 2 minutes and 25 cycles at 92◦C for 30 sec, 55◦C for 30 sec and 68◦C for 2.5 minutes with a final extension step at 68◦C for 5 minutes. One µl of unpurified product was subsequently used as a template for another 20 cycles of am- plification to label the amplicons via a previously described ”Round C” PCR protocol (94◦C for 30 sec, 40◦C for 30 sec, 50◦C for 30 sec and 72◦C for 1 minute) with primer B and a 4:1 mixture of aminoallyl dUTP (Ambion, Austin, TX) and dTTP for probe labeling [172].

A.3 Supplemental Figures 180

Figure A.1:

Number of simulated-annealing iterations as a function of region size and primer- density. The iterations decrease after peaking at 1Mb, indicating the impact of primer- dimerization on convergence. Similarly, as set size expands, most notably from 1.1 to 2, the solutions converge more rapidly. On a 3.2GHz Intel/Xeon CPU with 4Gb RAM, we compute ∼ 27.5 × 106 iterations per minute. Appendix B

Supplemental: Two-Sided PAMP and Alternating Multiplexing

B.1 Experimental Methods

The microarray procedure followed that of Liu and Carson [82] with the follow- ing modification. Briefly, the 50-ends of each designed primers have the sequence of primer B (GTTTCCCAGTCACGATC) for the subsequent step of PCR labeling with a single primer B as described previously [172, 83].

B.2 Proofs

Theorem1 makes 2 assertions, which we will prove independently.

1. Let P be a design with no dimerizing pairs. Using alternating multiplexing, we detect all detectable breakpoint:

∗ a b ∗ a b ∪a,b∈{0,1}S (P × P ) = SX (P), ∪a,b∈{0,1}S (P × P ) = SY (P) X  Y 

Further, if P , P are non-trivial then alternating multiplex yields the minimum 2. number (4) of multiplex reactions necessary to achieve detectability.

181 182

Proof: 1 (By contradiction). Consider (x, y) ∈ SX (P) that is not left-detected by the strategy. All other cases are symmetric. By definition of detectability, there p , p b exists a proximal (’good’) amplifiable primer pair i j, with left probe i , such that l < b < x p i i . The primer is not left-detected only if there exists a (’bad’) primer k b < l < x < b (x, y) with i k k , which amplifies but does not left-detect. Among x p , p all good and bad left primers for , choose the pair ( i k ) with the most proximal x p , p probes on either side of . By definition, i k are adjacent, and cannot be in the same multiplex reaction, a contradiction. P p , p 0 < 2: In a non-trivial forward design , there exists pair of primers i k with l − l < d b 6= b i k , and i k . As primers and probes do not overlap in sequence, we x l < x < b p can find a point with j j . Consider an arbitrary reverse primer j and y b < y < l (x, y) (p , p ) choose so that j j. Consider a breakpoint . When i k are in the p , p same multiplex tube, i j gets preferentially amplified, but not left-detected. For left- (p , p ) detection, i k must be in separate multiplex-tubes. An analogous argument can be P used for non-trivial design . As the set of forward and reverse primers is partitioned into at least two tubes each, a total of 4 reactions is needed to cover every pair. 

Proof: (Theorem 2): Brooks’ theorem states that a graph can be colored using ∆ colors provided ∆ ≥ 3, and it is not a complete graph. Consider the primer-dimer- adjacency graph G. By definition, no forward (reverse) vertex has degree greater than ∆ ∆ ( ). We have dummy nodes at the beginning and end, which are adjacent to some primers (and edge-connected), but do not dimerize with any primer. Therefore, the graph cannot be either an odd-cycle, or a complete graph, and Brook’s theorem applies. In practice, the graphs are very sparse and we apply the Welsh-Powell algorithm to obtain good colorings in practice.  Appendix C

Supplemental: Evaluation of paired-end sequencing strategies- applications to gene fusion

C.1 Simulations

C.1.1 Calculation of Pζ and E(|Θζ)|

In order to generate the observed distribution of |Θζ | a random position, ζ, was chosen in an interval of size G, which represents the tumor genome. Reads were sim- ulated by selected random numbers in the range [0,G − L), where L corresponds the size of the clone (150kb unless otherwise noted). In each iteration N = cG/L clones are generated. If a random number is observed at positions [ζ − L, ζ) then we compute the observed length of Θζ by ordering the set of clones overlapping ζ and selecting the rightmost clone to define the start of interval and the leftmost point to define the end of the interval. The average of all iterations for which a clone is observed to span the breakpoint is computed for each c.

183 184

C.1.2 Calculation of Fusion Probabilities for Artificial Fusion Genes

In order to simulate fusion events, in each graph, a fusion gene was created by randomly selecting gene lengths from the distribution of genes in the genome (derived from the ”known genes” table at the UCSC Genome browser. and randomly fusing them to create the desired fusion gene length. Random paired-reads were then generated on these rearranged genomes (of length normally distributed around mean L). For each ar- tificial fused gene 100 such simulations were performed at each clone size- 100 artificial fusion genes were created for each read value and clone size (corresponding to 10000 simulations at each datapoint). All invalid reads were grouped into clusters. This set of invalid pairs, along with the the complete list of genomic positions for all known genes as well as the two artificial genes, was fed as input to our algorithm. Note, that if no paired read spanning the fusion gene was observed at a given datapoint, then a 0 was returned as the fusion probability for that iteration.

C.1.3 Sensitivity and Selectivity under Random Rearrangements

As in the previous section we began with a diploid reference genome. 100 random rearrangement events (of average size 1Mb) were performed on the reference genome. Subsequently paired reads were generated on the rearranged genome, with 1% of paired reads being chimeric. This rearranged genome was analyzed explicitly, by identification of each breakpoint (a, b), to determine all ”true” fusion genes. For each simulated paired-read set the complete set of fusion probabilities for all invalid clusters (including singletons) was generated. The predicted fusion genes were parti- tioned into ”True Positives” and ”False Positives” in each set. Counts corresponding to the number of ”true” and ”false” fusion genes above specific fusion probabilities (> 0, > 0.05, > 0.1... > .95) were determined for each simulation. The average of such counts over 50 simulations (for a given clone size and number of paired reads) is plotted in Figure 8. 185

C.2 Supplemental Figures

Distribution of MCF7 BAC Lengths 600

500

400

300 Count

200

100

0 0 0.5 1 1.5 2 2.5 3 5 Size in Basepairs x 10

Figure C.1: Distribution of MCF7 Clone Lengths. The mean for this distribution is 122 kb, and the standard deviation is 24 kb. Fusion Probabilities in Table 1 are computed using this distribution and the putative fusion regions for each gene pair (see Methods). 186

4 Predicted vs Simulated Θ (150kb Clones) x 10 14 Predicted

) 12 Simulated Θ

10

8

6

4

Breakpoint Region Size ( 2

0 0 10 20 30 40 50 Clonal Coverage

Figure C.2: Length of a Breakpoint Region (BPR) for varying amounts of clonal coverage. The blue curve shows the expected length (Equation 4.5), while red curve is the average observed length over 50 simulations.

5k Reads 10k Reads 1 20k Reads 1 50k Reads 0.9 0.9 100k Reads 500k Reads 0.8 0.8 1M Reads 0.7 0.7 2M Reads 10M Reads 0.6 0.6 50M Reads ζ 0.5 ζ 100M Reads P 0.5 P 0.4 0.4

0.3 0.3 0.2 0.2

0.1 0.1 15 2 10 0 1 0 15 5 4 15 10 10 x 10 5 0 0 x 105 5 0 0 4 Clone Size 4 Clone Size x 10 Θ x 10 Θ ζ ζ (a) (b)

Figure C.3: Clone Length vs. Pζ vs. |Θζ | for varying N. A clear tradeoff can be observed.

Larger clone lengths yield higher Pζ (detection probability), compared to smaller clone lengths, which have the advantage of better localization (smaller |Θζ |). Different lines originating from 0 refer to different number of reads. As the number of reads grows, the trade-off converges to high detection, and better localization. (a) shows values in a mesh graph, while (b) shows raw values. 187

106

105 1

0.8 104 0.6 ζ ζ Θ P 0.4 103

0.2

0 102 8 1 10 10 2 7 10 10 6 3 10 10 1 4 5 10 10 10 2 5 4 6 5 4 10 10 10 10 10 4 3 6 10 10 10 2 18 10 6 3 10 10 10 10 Paired Reads Paired Reads Clone Siize Clone Size (a) (b)

Figure C.4: The effect of clone length and number of paired reads on Pζ and |Θζ |. (a)

Pζ increases as the number of paired reads N or clone length L increases, but is constant as a function of N/L. (b) |Θζ | decreases as the number of paired reads increases or the clone length decreases. Note that all axes are log values (with the exception of Pζ in (a).

4 x 10 1 15 1kb 2kb 0.8 10kb 40kb 10 150kb

0.6 |

ζ ζ | Θ P 0.4 1kb 2kb 5 10kb 0.2 40kb 150kb 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Reads(Millions) Reads(Millions) (a) (b)

Figure C.5: Pζ and |Θζ | for different clone lengths (a) The probability of detecting a fusion point, Pζ , for different clone lengths and varying number of mapped paired reads. (b) The expected length of a breakpoint region, |Θζ |, around a fusion point (assuming that the fusion point is contained in a clone). 188

7 Pζ = 0.99 4 10 x 10 x 10 102 8

1.5 6 ζ

1 4 | Θ Paired Reads

0.5 2

0 100 0 0 1.625 5 10 15 4 Clone Size x 10

Figure 6: The number of paired-reads (and resulting E( Θζ )) needed to obtain a Pζ of 0.99 for clone lengths | | varying from 1 to 150kb. The x-axis indicates clone length, L, the y-axis indicates reads, N, and the alternate y-axis shows Θζ . The vertical line indicates the intersection point between the two lines at 16000bp. Figure C.6: The number| | of paired-reads (and resulting∼ E(|Θζ |)) needed to obtain a Pζ of 0.99 for clone lengths varying from 1 to 150kb The x-axis indicates clone length, L, the y-axis indicates reads, N, and the alternate y-axis shows |Θζ |. The vertical line indicates the intersection point between the two lines at ∼ 16kb. 6

200kb Genes

1.2 1 0.8 0.6 0.4 0.2

0 2kb Clones Average Fusion Probability −0.2 40kb Clones 150kb Clones −0.4 0 0.5 1 1.5 2 6 Reads x 10

Figure C.7: Average fusion probability vs. number of mapped reads. The average fusion probability with mean and standard deviations as a function of N, the number of mapped paired reads. The x-axis represents the number of clones sequenced, N. The simulated fusion genes were 200kb. 189

−5 −3 x 10 1kb Clones x 10 10kb Clones 7 7 1k reads 1k reads 10k reads 10k reads 6 6 100k reads 100k reads 1M reads 1M reads 5 5

4 4

3 3

2 2

1 1

0 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Probability of Observing a Chimeric Cluster Probability of Chimeric Reads Probability of Observing a Chimeric Cluster Probability of Chimeric Reads (a) (b) 40kb Clones 150kb Clones 0.12 0.8 1k reads 1k reads 10k reads 0.7 10k reads 0.1 100k reads 100k reads 1M reads 1M reads 0.6

0.08 0.5

0.06 0.4

0.3 0.04

0.2

0.02 0.1

0 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Probability of Observing a Chimeric Cluster Probability of Chimeric Reads Probability of Observing a Chimeric Cluster Probability of Chimeric Reads (c) (d)

Figure 8: Effect of chimeric clones. Probability of observing at least one chimeric cluster vs. the percent of chimeric clones under equal number of paired reads indicates lower chimerism for smaller clones. (a) 1kb clones, (b) 10kb clones, Figure(c) 40kb C.8: clones,Effect and of (d) chimeric 150kb clones. clones. Probability of observing at least one chimeric cluster vs. the percent of chimeric clones under equal number of paired reads indicates lower chimerism for smaller clones. (a) 1kb clones, (b) 10kb clones, (c) 40kb clones, and (d) 150kb clones.

8 Appendix D

Supplemental: On design of deep sequencing experiments

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3 observed Fraction of Detected Transcripts expected 0.2 expected from liver

0.1 0 2 4 6 8 10 12 5 Mapped Reads x 10

Figure D.1: Fraction of detected transcripts from kidney RNA-seq at different sequenc- ing depths. The blue line corresponds to the fraction of 100 transcripts discovered from 50 bootstrapping tests (with error bars indicating variance). The observed mean corresponds nearly perfectly with the expected. Also the expected distribution from a different tissue (liver) tracks, but underpredicts, the observed mean.

190 191

0.025

10k 100k 0.02 1M 5M 10M 20M All 0.015

0.01 Probability

0.005

0 −7 −6 −5 −4 −3 −2 −1 10 10 10 10 10 10 10 Expression

Figure D.2: Estimating the p.d.f of normalized gene expression values. Note that all samples agree except at low levels of detection, where there are insufficient reads. Thus, the 10K sample estimates the p.d.f accurately after a normalized expression value of 10−3. 192

4 10 k 100 k 1 M 10 M 3.5 All

3

2.5

2

Counts 2.5

1

0.5

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -3 Expression x10

Figure D.3: Plot of log log f(ν) vs. log(ν) reveals a linear relationship. To reduced noise log bin values were used. Regression analysis on the points subsequent the inflection helps correct for the sampling bias. 193

N50 haplotype length with mate pairs 700 kb 50 bp 600 kb 100 bp 200 bp 500 kb

400 kb

300 kb

Haplotype Length 200 kb

100 kb

0 10 15 20 25 30 35 40 45 50 Coverage

Figure D.4: Plot of the N50 haplotype length as a function of the read length. The haplotype length is plotted for two different levels of sequence coverage: 20x and 50x. From the plot, it is apparent that increasing the coverage doesn’t greatly affect the haplotype length. The N50 haplotype length grows non-linearly with increasing read length. In the absence of mate-pairs, it would require long reads (10kbp or more) to achieve N50 haplotype lengths comparable to the Venter diploid genome. References

[1] Aerni, S., Lipson, D., Volik, S., Collins, C., Barrett, M., Yakhini, Z., and Raphael, B., 2009: Combined analysis of copy number changes and structural rearrange- ments in cancer genomes. In Preparation. [2] Aho, A., Sagiv, S., Szymanski, T., and Ullman, J., 1981: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. Siam Journal of Computing, 10(3), 405–421. [3] Altshuler, D., Brooks, L., Chakravarti, A., Collins, F., Daly, M., Donnelly, P., and Consortium, I. H., 2005: A haplotype map of the human genome. Nature, 437(7063), 1299–1320. [4] Amrine-Madsen, H., Koepfli, K., Wayne, R., and Springer, M., 2003: A new phy- logenetic marker, apolipoprotein B, provides compelling evidence for eutherian relationships. Molecular Phylogenetics and Evolution, 28, 225–40. [5] Andolfatto, P., Depaulis, F., and Navarro, A., 2001: Inversion polymorphisms and nucleotide variability in Drosophila. Genet Res, 77(1), 1–8. [6] Andreson, R., Reppo, E., Kaplinski, L., and Remm, M., 2006: GENOMEMASKER package for designing unique genomic PCR primers. BMC Bioinformatics, 7, 172. Comparative Study. [7] Arnason, U., Adegoke, J., Bodin, K., Born, E., Esa, Y., Gullberg, A., Nilsson, M., Short, R., Xu, X., and Janke, A., 2002: Mammalian mitogenomic relation- ships and the root of the eutherian tree. Proceedings of the National Academy of Sciences, 99, 8151–8156. [8] Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al., 2000: : tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1), 25–29. doi:10.1038/75556. [9] Bagnall, R. D., Waseem, N., Green, P. M., and Giannelli, F., 2002: Recurrent inversion breaking intron 1 of the factor viii gene is a frequent cause of severe hemophilia a. Blood, 99(1), 168–174.

194 195

[10] Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V., 2008: An mcmc algorithm for haplotype assembly from whole-genome sequence data. Genome Res, 18(8), 1336–1346. doi:10.1101/gr.077065.108.

[11] Barlund, M., Monni, O., Weaver, J., Kauraniemi, P., Sauter, G., Heiskanen, M., Kallioniemi, O., and Kallioniemi, A., 2002: Cloning of BCAS3 (17q23) and BCAS4 (20q13) genes that undergo amplification, overexpression, and fusion in breast cancer. Genes Chromosomes Cancer, 35(4), 311–317. doi:10.1002/gcc. 10121.

[12] Barrett, M., Scheffer, A., Ben-Dor, A., Sampas, N., Lipson, D., Kincaid, R., Tsang, P., Curry, B., Baird, K., Meltzer, P., et al., 2004: Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. Proc Natl Acad Sci U S A, 101(51), 17765–17770. doi:10.1073/pnas.0407979101.

[13] Bashir, A., Liu, Y., Raphael, B., Carson, D., and Bafna, V., 2007: Optimization of primer design for the detection of variable genomic lesions in cancer. Bioin- formatics, 23(21), 2807.

[14] Bashir, A., Volik, S., Collins, C., Bafna, V., and Raphael, B., 2008: Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer. PLoS Computational Biology, 4(4).

[15] Bentley, D., 2006: Whole-genome re-sequencing. Curr. Opin. Genet. Dev, 16, 545–552.

[16] Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., Hall, K. P., Evers, D. J., Barnes, C. L., Bignell, H. R., Boutell, J. M., Bryant, J., Carter, R. J., Keira Cheetham, R., Cox, A. J., Ellis, D. J., Flat- bush, M. R., Gormley, N. A., Humphray, S. J., Irving, L. J., Karbelashvili, M. S., Kirk, S. M., Li, H., Liu, X., Maisinger, K. S., Murray, L. J., Obradovic, B., Ost, T., Parkinson, M. L., Pratt, M. R., Rasolonjatovo, I. M., Reed, M. T., Rigatti, R., Rodighiero, C., Ross, M. T., Sabot, A., Sankar, S. V., Scally, A., Schroth, G. P., Smith, M. E., Smith, V. P., Spiridou, A., Torrance, P. E., Tzonev, S. S., Ver- maas, E. H., Walter, K., Wu, X., Zhang, L., Alam, M. D., Anastasi, C., Aniebo, I. C., Bailey, D. M., Bancarz, I. R., Banerjee, S., Barbour, S. G., Baybayan, P. A., Benoit, V. A., Benson, K. F., Bevis, C., Black, P. J., Boodhun, A., Brennan, J. S., Bridgham, J. A., Brown, R. C., Brown, A. A., Buermann, D. H., Bundu, A. A., Burrows, J. C., Carter, N. P., Castillo, N., Chiara E Catenazzi, M., Chang, S., Neil Cooley, R., Crake, N. R., Dada, O. O., Diakoumakos, K. D., Dominguez- Fernandez, B., Earnshaw, D. J., Egbujor, U. C., Elmore, D. W., Etchin, S. S., Ewan, M. R., Fedurco, M., Fraser, L. J., Fuentes Fajardo, K. V., Scott Furey, W., George, D., Gietzen, K. J., Goddard, C. P., Golda, G. S., Granieri, P. A., Green, D. E., Gustafson, D. L., Hansen, N. F., Harnish, K., Haudenschild, C. D., Heyer, N. I., Hims, M. M., Ho, J. T., Horgan, A. M., Hoschler, K., Hurwitz, S., Ivanov, 196

D. V., Johnson, M. Q., James, T., Huw Jones, T. A., Kang, G. D., Kerelska, T. H., Kersey, A. D., Khrebtukova, I., Kindwall, A. P., Kingsbury, Z., Kokko-Gonzales, P. I., Kumar, A., Laurent, M. A., Lawley, C. T., Lee, S. E., Lee, X., Liao, A. K., Loch, J. A., Lok, M., Luo, S., Mammen, R. M., Martin, J. W., McCauley, P. G., McNitt, P., Mehta, P., Moon, K. W., Mullens, J. W., Newington, T., Ning, Z., Ling Ng, B., Novo, S. M., O’Neill, M. J., Osborne, M. A., Osnowski, A., Os- tadan, O., Paraschos, L. L., Pickering, L., Pike, A. C., Pike, A. C., Chris Pinkard, D., Pliskin, D. P., Podhasky, J., Quijano, V. J., Raczy, C., Rae, V. H., Rawlings, S. R., Chiva Rodriguez, A., Roe, P. M., Rogers, J., Rogert Bacigalupo, M. C., Romanov, N., Romieu, A., Roth, R. K., Rourke, N. J., Ruediger, S. T., Rusman, E., Sanches-Kuiper, R. M., Schenker, M. R., Seoane, J. M., Shaw, R. J., Shiver, M. K., Short, S. W., Sizto, N. L., Sluis, J. P., Smith, M. A., Ernest Sohna Sohna, J., Spence, E. J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo, C. L., Tur- catti, G., Vandevondele, S., Verhovsky, Y., Virk, S. M., Wakelin, S., Walcott, G. C., Wang, J., Worsley, G. J., Yan, J., Yau, L., Zuerlein, M., Rogers, J., Mul- likin, J. C., Hurles, M. E., McCooke, N. J., West, J. S., Oaks, F. L., Lundberg, P. L., Klenerman, D., Durbin, R., and Smith, A. J., 2008: Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53–59. doi:10.1038/nature07517.

[17] Bignell, G. R., Santarius, T., Pole, J. C. M., Butler, A. P., Perry, J., Pleasance, E., Greenman, C., Menzies, A., Taylor, S., Edkins, S., Campbell, P., Quail, M., Plumb, B., Matthews, L., McLay, K., Edwards, P. A. W., Rogers, J., Wooster, R., Futreal, P. A., and Stratton, M. R., 2007: Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res, 17(9), 1296–1303.

[18] Bourque, G., Leong, B., Vega, V., Chen, X., Lee, Y., Srinivasan, K., Chew, J., Ruan, Y., Wei, C., Ng, H., et al., 2008: Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Research, 18(11), 1752.

[19] Branton, D., Deamer, D. W., Marziali, A., Bayley, H., Benner, S. A., Butler, T., Di Ventra, M., Garaj, S., Hibbs, A., Huang, X., Jovanovich, S. B., Krstic, P. S., Lindsay, S., Ling, X. S., Mastrangelo, C. H., Meller, A., Oliver, J. S., Pershin, Y. V., Ramsey, J. M., Riehn, R., Soni, G. V., Tabard-Cossa, V., Wanunu, M., Wiggin, M., and Schloss, J. A., 2008: The potential and challenges of nanopore sequencing. Nat Biotechnol, 26(10), 1146–1153. doi:10.1038/nbt.1495.

[20] Bray, N., and Pachter, L., 2003: MAVIDmultiple alignment server. Nucleic Acids Res, 31(13), 3525–3526.

[21] Brooks, R., 1941: On colouring the nodes of a network. Proc. Cambridge Phil. Soc., 37, 194–197. 197

[22] Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., and Batzoglou, S., 2003: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res, 13(4), 721– 731. Evaluation Studies.

[23] Cantrell, M. A., et al., 2001: An ancient retrovirus-like element contains hot spots for SINE insertion. Genetics, 158, 769–777.

[24] Castellano, M., Pollock, P. M., Walters, M. K., Sparrow, L. E., Down, L. M., Gabrielli, B. G., Parsons, P. G., and Hayward, N. K., 1997: Cdkn2a/p16 is inac- tivated in most melanoma cell lines. Cancer Res, 57(21), 4868–4875.

[25] Chaisson, M., and Pevzner, P., 2008: Short read fragment assembly of bacterial genomes. Genome Research, 18(2), 324.

[26] Chuzhoy, J., 2007: Hardness of PAMP. Personal Communication.

[27] Clarke, L., and Carbon, J., 1976: A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell, 9(1), 91–99.

[28] Conrad, D., Andrews, T., Carter, N., Hurles, M., and Pritchard, J., 2006: A high- resolution survey of deletion polymorphism in the human genome. Nat Genet, 38(1), 75–81. doi:10.1038/ng1697.

[29] Crawford, D., Bhangale, T., Li, N., Hellenthal, G., Rieder, M., Nickerson, D., and Stephens, M., 2004: Evidence for substantial fine-scale variation in recombina- tion rates across the human genome. Nature Genetics, 36, 700–706.

[30] Croce, C., Erikson, J., Haluska, F., Finger, L., and Showe, Y., L.C.and Tsujimoto, 1986: Molecular genetics of human B- and T-cell neoplasia. In Cold Spring Harb Symp Quant Biol., volume 51, 891–8.

[31] Danaei, G., Vander Hoorn, S., Lopez, A., Murray, C., and Ezzati, M., 2005: Causes of cancer in the world: comparative risk assessment of nine behavioural and environmental risk factors. The Lancet, 366(9499), 1784–1793.

[32] Dasgupta, B., Jun, J., and Mandoiu, I., 2008: Primer Selection Methods for De- tection of Genomic Inversions and Deletions via PAMP. In Proceedings of the 6th Asia-Pacific Bioinformatics Conference. Imperial College Press.

[33] Delsuc, H., Brinkmann, H., and Philippe, H., 2005: Phylogenomics and the re- construction of the tree of life. Nature Reviews Genetics, 6, 361–75.

[34] Deutz-Terlouw, P. P., Losekoot, M., Olmer, R., Pieneman, W. C., de Vries-v d Weerd, S., Briet,¨ E., and Bakker, E., 1995: Inversions in the factor viii gene: improvement of carrier detection and prenatal diagnosis in dutch haemophilia a families. J Med Genet, 32(4), 296–300. 198

[35] Doi, K., and Imai, H., 1997: Greedy Algorithms for Finding a Small Set of Primers Satisfying Cover and Length Resolution Conditions in PCR Experiments. Genome Inform Ser Workshop Genome Inform, 8, 43–52.

[36] Doi, K., and Imai, H., 1999: A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments. Genome Inform Ser Workshop Genome Inform, 10, 73–82.

[37] Druker, B., 2002: STI571 (Gleevec) as a paradigm for cancer therapy. Trends Mol Med, 8(4 Suppl), S14–S18.

[38] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., Bibillo, A., Bjornson, K., Chaudhuri, B., Christians, F., Cicero, R., Clark, S., Dalal, R., Dewinter, A., Dixon, J., Foquet, M., Gaertner, A., Hardenbol, P., Heiner, C., Hester, K., Holden, D., Kearns, G., Kong, X., Kuse, R., Lacroix, Y., Lin, S., Lundquist, P., Ma, C., Marks, P., Maxham, M., Murphy, D., Park, I., Pham, T., Phillips, M., Roy, J., Sebra, R., Shen, G., Sorenson, J., Tomaney, A., Travers, K., Trulson, M., Vieceli, J., Wegener, J., Wu, D., Yang, A., Zaccarin, D., Zhao, P., Zhong, F., Korlach, J., and Turner, S., 2009: Real-time dna sequencing from single polymerase molecules. Science, 323(5910), 133–138. doi:10.1126/science.1162986.

[39] Elenitoba-Johnson, K., Crockett, D., Schumacher, J., Jenson, S., Coffin, C., Rockwood, A., and Lim, M., 2006: Proteomic identification of oncogenic chromosomal translocation partners encoding chimeric anaplastic lymphoma ki- nase fusion proteins. Proc Natl Acad Sci U S A, 103(19), 7402–7407. doi: 10.1073/pnas.0506514103.

[40] Fan, J., Chee, M., Gunderson, K., et al., 2006: Highly parallel genomic assays. Nature Reviews Genetics, 7(8), 632.

[41] Felsenstein, J., 2004: Phylip phylogeny inference package version 3.61. http://evolution.genetics.washington.edu/phylip.html.

[42] Feuk, L., Carson, A., and Scherer, S., 2006: Structural variation in the human genome. Nat Rev Genet, 7(2), 85–97.

[43] Feuk, L., Macdonald, J., Tang, T., Carson, A., Li, M., Rao, G., Khaja, R., and Scherer, S., 2005: Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet, 1(4). doi:10.1371/journal.pgen.0010056.

[44] Frazer, K., Ballinger, D., Cox, D., Hinds, D., Stuve, L., Gibbs, R., Belmont, J., Boudreau, A., Hardenbol, P., Leal, S., et al., 2007: A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164), 851. 199

[45] Gaedigk, R., Karges, W., Hui, M. F., Scherer, S. W., and Dosch, H. M., 1996: Ge- nomic organization and transcript analysis of ICAp69, a target antigen in diabetic autoimmunity. Genomics, 38(3), 382–391.

[46] Garey, M. R., and Johnson, D. S., 1979: Computers and Intractability: A Guide to the Theory of NP-completeness. W.H. Freeman and Company.

[47] Giglio, S., Broman, K., Matsumoto, N., Calvari, V., Gimelli, G., Neumann, T., Ohashi, H., Voullaire, L., Larizza, D., Giorda, R., Weber, J., Ledbetter, D., and Zuffardi, O., 2001: Olfactory receptor-gene clusters, genomic-inversion poly- morphisms, and common chromosome rearrangements. Am J Hum Genet, 68(4), 874–883.

[48] Gil, J., and Peters, G., 2006: Regulation of the ink4b-arf-ink4a tumour suppressor locus: all for one or one for all. Nat Rev Mol Cell Biol, 7(9), 667–677. doi: 10.1038/nrm1987.

[49] Graur, D., Gouy, M., and Duret, L., 1997: Evolutionary Affinities of the Order Perissodactyla and the Phylogenetic Status of the Superordinal Taxa Ungulata and Altungulata. Molecular Phylogenetics and Evolution, 7, 195–200.

[50] Gusfield, D., 1997: Algorithms on Strings, Trees and Sequences: Computer Sci- ence and Computational Biology. Cambridge University Press.

[51] Hahn, S. A., Schutte, M., Hoque, A. T., Moskaluk, C. A., da Costa, L. T., Rozen- blum, E., Weinstein, C. L., Fischer, A., Yeo, C. J., Hruban, R. H., and Kern, S. E., 1996: DPC4, a candidate tumor suppressor gene at human chromosome 18q21.1. Science, 271(5247), 350–353.

[52] Hannenhalli, S., and Pevzner, P. A., 1995: Transforming men into mice (poly- nomial algorithm for genomic distance problem). In 36th Annual Symposium on Foundations of Computer Science (FOCS’95), 581–592. IEEE Computer Society Press, Los Alamitos. ISBN 0-8186-7183-1.

[53] Hedrick, P. W., 1987: Gametic disequilibrium measures: proceed with caution. Genetics, 117(2), 331–341.

[54] Henzinger, M., King, V., and Warnow, T., : A fast algorithm for constructing rooted trees from constraints. (Unpublished Manuscript).

[55] Hillis, D., 1999: SINEs of the perfect character. Proceedings of the National Academy of Sciences, 96, 9979–81.

[56] Hinds, D., Kloek, A., Jen, M., Chen, X., and Frazer, K., 2006: Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet, 38(1), 82–85. doi:10.1038/ng1695. 200

[57] Hudson, R. R., 1990: Gene genealogies and the coalescent process. In Oxford surveys in evolutionary biology, editors D. Futuyma, and J. Antonovics, vol- ume 7, 1–44. Oxford University Press.

[58] Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W., and Lee, C., 2004: Detection of large-scale variation in the human genome. Nat Genet, 36(9), 949–951. doi:10.1038/ng1416.

[59] Inc., A. B., 2007: Overview of the solid system. URL: http://marketing.appliedbiosystems.com/images/Product/Solid Knowledge/ SOLiD Chemistry Presentation 1019.pdf.

[60] Inc., A. B., 2009: Overview of the solid system. URL: http://www3.appliedbiosystems.com/AB Home/ applicationstechnolo- gies/SOLiDSystemSequencing/overviewofsolidsystem/index.htm.

[61] International Human Genome Sequencing Consortium, 2001: Initial sequencing and analysis of the human genome. Nature, 409, 860–921.

[62] Istrail, S., Sutton, G. G., Florea, L., Halpern, A. L., Mobarry, C. M., Lippert, R., Walenz, B., Shatkay, H., Dew, I., Miller, J. R., Flanigan, M. J., Edwards, N. J., Bolanos, R., Fasulo, D., Halldorsson, B. V., Hannenhalli, S., Turner, R., Yooseph, S., Lu, F., Nusskern, D. R., Shue, B. C., Zheng, X. H., Zhong, F., Delcher, A. L., Huson, D. H., Kravitz, S. A., Mouchard, L., Reinert, K., Rem- ington, K. A., Clark, A. G., Waterman, M. S., Eichler, E. E., Adams, M. D., Hunkapiller, M. W., Myers, E. W., and Venter, J. C., 2004: Whole-genome shot- gun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA, 101(7), 1916–1921. doi:10.1073/pnas.0307971100.

[63] Jansen, G., Hazendonk, E., Thijssen, K., and Plasterk, R., 1997: Reverse genetics by chemical mutagenesis in caenorhabditis elegans. Nat Genet, 17, 119–21.

[64] JO, K., AE, U., JP, A., B, G., F, G., JF, S., PM, K., D, P., NJ, C., L, D., BE, T., Z, C., A, T., AC, S., J, C., F, Y., NP, C., ME, H., SM, W., TT, H., MB, G., M, E., and M, S., 2007: Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome. Science.

[65] Jurka, J., 1998: Repeats in genomic DNA: mining and meaning. Curr. Opin. Struct. Biol., 8, 333–337.

[66] Jurka, J., 2000: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet, 16(9), 418–420.

[67] Kannan, S., Warnow, T., and Yooseph, S., 1998: Computing the local consensus of trees. SIAM Journal of Computing, 27(6), 1695–1724. 201

[68] Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber, R. J., Haus- sler, D., Kent, W. J., and of California Santa Cruz, U., 2003: The UCSC Genome Browser Database. Nucleic Acids Res, 31(1), 51–54.

[69] Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D., 2002: The human genome browser at UCSC. Genome Res, 12(6), 996–1006.

[70] Kim, N., Kim, P., Nam, S., Shin, S., Lee, S., and Journals, O., 2006: ChimerDB– a knowledgebase for fusion sequences. Nucleic Acids Res, 34(Database issue), D21–D24. doi:10.1093/nar/gkj019.

[71] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., 1983: Optimization by Simu- lated Annealing. Science, 220(4598), 671–680.

[72] Kitagawa, Y., Inoue, K., Sasaki, S., Hayashi, Y., Matsuo, Y., Lieber, M., Mi- zoguchi, H., Yokota, J., and Kohno, T., 2002: Prevalent Involvement of Ille- gitimate V (D) J Recombination in Chromosome 9p21 Deletions in Lymphoid Leukemia. Journal of Biological Chemistry, 277(48), 46289–46297.

[73] Kitazoe, Y., Kishino, H., Okabayashi, T., Watabe, T., Nakajima, N., Okuhara, Y., and Kurihara, Y., 2004: Multidimensional Vector Space Representation for Convergent Evolution and Molecular Phylogeny. MBE. Advanced Access.

[74] Kong, A., Gudbjartsson, D., Sainz, J., Jonsdottir, G., Gudjonsson, S., Richards- son, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., Palsson, S., Frigge, M., Thorgeirsson, T., Gulcher, J., and Stefansson, K., 2002: A high-resolution recombination map of the human genome. Nat Genet, 31(3), 241–247. doi:10.1038/ng917.

[75] Kurzrock, R., and Talpaz, M., 1991: The molecular pathology of chronic myel- ogenous leukaemia. British journal of haematology, 79, 34–7.

[76] Kmpke, T., Kieninger, M., and Mecklenburg, M., 2001: Efficient primer design algorithms. Bioinformatics, 17(3), 214–225.

[77] Lakich, D., Kazazian, H. H., Antonarakis, S. E., and Gitschier, J., 1993: Inver- sions disrupting the factor viii gene are a common cause of severe haemophilia a. Nat Genet, 5(3), 236–241. doi:10.1038/ng1193-236.

[78] Lander, E. S., and Waterman, M. S., 1988: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3), 231–239.

[79] Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., Axelrod, N., Huang, J., Kirkness, E. F., Denisov, G., Lin, Y., MacDonald, J. R., Pang, 202

A. W., Shago, M., Stockwell, T. B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S. A., Busam, D. A., Beeson, K. Y., McIntosh, T. C., Remington, K. A., Abril, J. F., Gill, J., Borman, J., Rogers, Y. H., Frazier, M. E., Scherer, S. W., Strausberg, R. L., and Venter, J. C., 2007: The diploid genome sequence of an individual human. PLoS Biol, 5(10). doi:10.1371/journal.pbio.0050254.

[80] Lewontin, R., 1964: The Interaction of interaction of selection and linkage II. Optimum models. Genetics, 50, 757–782.

[81] Lipson, D., 2002: Optimization Problems in Design of Oligonucleotides for Hy- bridization based Methods. Master’s thesis, Technion - Israel Institute of Tech- nology.

[82] Liu, Y., and Carson, D., 2007: A novel approach for determining cancer genomic breakpoints in the presence of normal DNA. PLoS ONE, 2(4), e380.

[83] Lu, Q., Nunez, E., Lin, C., Christensen, K., Downs, T., Carson, D., Wang- Rodriguez, J., and Liu, Y., 2008: A sensitive array-based assay for identifying multiple TMPRSS2: ERG fusion gene variants. Nucleic Acids Research.

[84] Lupski, J. R., 1998: Genomic disorders: structural features of the genome can lead to dna rearrangements and human disease traits. Trends Genet, 14(10), 417– 422.

[85] Maher, C. A., Kumar-Sinha, C., Cao, X., Kalyana-Sundaram, S., Han, B., Jing, X., Sam, L., Barrette, T., Palanisamy, N., and Chinnaiyan, A. M., 2009: Tran- scriptome sequencing to detect gene fusions in cancer. Nature.

[86] Manning, G., Whyte, D. B., Martinez, R., Hunter, T., and Sudarsanam, S., 2002: The protein kinase complement of the human genome. Science, 298(5600), 1912– 1934. doi:10.1126/science.1075762.

[87] Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y., 2008: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res., 18, 1509–1517.

[88] May, W., Gishizky, M., Lessnick, S., Lunsford, L., Lewis, B., Delattre, O., Zuc- man, J., Thomas, G., and Denny, C., 1993: Ewing Sarcoma 11; 22 Translocation Produces a Chimeric Transcription Factor that Requires the DNA-Binding Do- main Encoded by FLI1 for Transformation. Proc Natl Acad Sci U S A, 90(12), 5752–5756.

[89] McCarroll, S., Hadnott, T., Perry, G., Sabeti, P., Zody, M., Barrett, J., Dallaire, S., Gabriel, S., Lee, C., Daly, M., Altshuler, D., and Consortium, T. I. H., 2006: Common deletion polymorphisms in the human genome. Nat Genet, 38(1), 86– 92. 203

[90] McKernan, K., Blanchard, A., Kotler, L., and Costa, G., 2006: Reagents, meth- ods, and libraries for bead-based squencing. U.S. Patent 084132.

[91] McVean, G., Myers, S., Hunt, S., Deloukas, P., Bentley, D., and Donnelly, P., 2004: The fine-scale structure of recombination rate variation in the human genome. Science, 304, 581–584.

[92] Meyer, C., Burmeister, T., Strehl, S., Schneider, B., Hubert, D., Zach, O., Haas, O., Klingebiel, T., Dingermann, T., and Marschalek, R., 2007: Spliced MLL fusions: a novel mechanism to generate functional chimeric MLL-MLLT1 transcripts in t(11;19)(q23;p13.3) leukemia. Leukemia, 21(3), 588–590. doi: 10.1038/sj.leu.2404542.

[93] Misawa, K., and Janke, A., 2003: Revisiting the Glires concept–phylogenetic analysis of nuclear sequences. Molecular Phylogenetics and Evolution, 28, 320– 327.

[94] Misawa, K., and Nei, M., 2003: Reanalysis of Murphy et al.’s data gives various mammalian phylogenies and suggests overcredibility of Bayesian trees. Journal of Molecular Evolution, 57 Suppl 1, S290–6.

[95] Mitelman, F., Johansson, B., and Mertens, F., 2004: Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nature Genetics, 36, 331–334.

[96] Mitelman, F., Johansson, B., and Mertens, F., 2007: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer, 7, 233–45.

[97] Morris, S., Kirstein, M., Valentine, M., Dittmer, K., Shapiro, D., Saltman, D., and Look, A., 1994: Fusion of a kinase gene, ALK, to a nucleolar protein gene, NPM, in non-Hodgkin’s lymphoma. Science, 263(5151), 1281–1284.

[98] Murphy, W., Eizirisk, E., O’Brien, S., Madsen, O., Scally, M., Douady, C., Teel- ing, E., Ryder, O., Stanhope, M., de Jong, W., and Springer, M., 2001: Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science, 294, 2348–51.

[99] Murphy, W., Pringle, T., Crider, T., Springer, M., and Miller, W., 2007: Using genomic data to unravel the root of the placental mammal phylogeny. Genome Research, 17(4), 413.

[100] Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P., 2005: A fine-scale map of recombination rates and hotspots across the human genome. Science, 310(5746), 321–324. doi:10.1126/science.1117196. 204

[101] Navarro, A., Barbadilla, A., and Ruiz, A., 2000: Effect of inversion polymor- phism on the neutral nucleotide variability of linked chromosomal regions in drosophila. Genetics, 155(2), 685–698. [102] Navarro, A., Barbadilla, A., and Ruiz, A., 2000: Effect of inversion polymor- phism on the neutral nucleotide variability of linked chromosomal regions in Drosophila. Genetics, 155(2), 685–698. [103] Navarro, A., Betran,´ E., Barbadilla, A., and Ruiz, A., 1997: Recombination and gene flux caused by gene conversion and crossing over in inversion heterokary- otypes. Genetics, 146(2), 695–709. [104] Navarro, A., and Gazave, E., 2005: Inversions with classical style and trendy lines. Nat Genet, 37(2), 115–116. doi:10.1038/ng0205-115. [105] Newman, T. L., Tuzun, E., Morrison, V. A., Hayden, K. E., Ventura, M., Mc- Grath, S. D., Rocchi, M., and Eichler, E. E., 2005: A genome-wide survey of structural variation between human and chimpanzee. Genome Res, 15(10), 1344– 1356. doi:10.1101/gr.4338005. [106] Ng, P., Tan, J., Ooi, H., Lee, Y., Chiu, K., Fullwood, M., Srinivasan, K., Perbost, C., Du, L., Sung, W., et al., 2006: Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes. Nucleic Acids Research, 34(12), e84. [107] Nicodme, P., and Steyaert, J. M., 1997: Selecting optimal oligonucleotide primers for multiplex PCR. Proc Int Conf Intell Syst Mol Biol, 5, 210–213. [108] Nikaido, M., Rooney, A., and Okada, N., 1999: Phylogenetic relationships among cetartiodactyls based on insertions of short and long interspersed ele- ments: Hippopotamuses are the closest extant relatives of whales. Proceedings of the National Academy of Sciences, 96, 10261–66. [109] Nobori, T., Miura, K., Wu, D. J., Lois, A., Takabayashi, K., and Carson, D. A., 1994: Deletions of the cyclin-dependent kinase-4 inhibitor gene in multiple hu- man cancers. Nature, 368(6473), 753–756. doi:10.1038/368753a0. [110] Osada, N., and Wu, C., 2005: Inferring the Mode of Speciation From Genomic Data: A Study of the Great Apes. Genetics, 259–264. [111] Osborne, L., Li, M., Pober, B., Chitayat, D., Bodurtha, J., Mandel, A., Costa, T., Grebe, T., Cox, S., Tsui, L., and Scherer, S., 2001: A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat Genet, 29(3), 321–325. doi:10.1038/ng753. [112] Page, R. D. M., 1996: TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences, 12, 357–358. 205

[113] Paris, P., Sridharan, S., Scheffer, A., Tsalenko, A., Bruhn, L., and Collins, C., 2007: High resolution oligonucleotide CGH using DNA from archived prostate tissue. Prostate, 67(13), 1447–1455. doi:10.1002/pros.20632.

[114] Pe’er, I., Pupko, T., Shamir, R., and Sharan, R., 2004: Incomplete Directed Per- fect Phylogeny. Siam Journal of Computing, 33(3), 590–607.

[115] Perry, A., Nobori, T., Ru, N., Anderl, K., Borell, T. J., Mohapatra, G., Feuerstein, B. G., Jenkins, R. B., and Carson, D. A., 1997: Detection of p16 gene deletions in gliomas: a comparison of fluorescence in situ hybridization (FISH) versus quantitative PCR. J Neuropathol Exp Neurol, 56(9), 999–1008.

[116] Pevzner, P., 2000: Computational Molecular Biology: An Algorithmic Approach. MIT press.

[117] Pevzner, P. A., and Tang, H., 2001: Fragment assembly with double-barreled data. Bioinformatics, 17 Suppl 1, S225–S233.

[118] Pevzner, P. A., Tang, H., and Waterman, M. S., 2001: An eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A, 98(17), 9748–9753. doi: 10.1073/pnas.171285098.

[119] Pihlak, A., Bauren,´ G., Hersoug, E., Lonnerberg,¨ P., Metsis, A., and Linnars- son, S., 2008: Rapid genome sequencing with short universal tiling probes. Nat Biotechnol, 26(6), 676–684. doi:10.1038/nbt1405.

[120] Pinkel, D., and Albertson, D., 2005: Array comparative genomic hybridization and its applications in cancer. Nat Genet, 37 Suppl, S11–S17. doi:10.1038/ ng1569.

[121] Price, A., Eskin, E., and Pevzner, P., 2004: Whole genome analysis of alu repeat elements reveals complex evolutionary history. Genome Research, 2245–2252.

[122] Price, A. L., Jones, N. C., and Pevzner, P. A., 2005: De novo identification of repeat families in large genomes. Bioinformatics, 21 Suppl 1, i351–i358. doi: 10.1093/bioinformatics/bti1018.

[123] Purvis, A., 1995: A composite estimate of primate phylogeny. Philos Trans R Soc Lond B Biol Sci, 348(1326), 405–421.

[124] Rachlin, J., Ding, C., Cantor, C., and Kasif, S., 2005: Computational tradeoffs in multiplex PCR assay design for SNP genotyping. BMC Genomics, 6, 102. doi:10.1186/1471-2164-6-102.

[125] Raphael, B., 2006: Figure integrating arrayCGH and ESP. Personal Communi- cation. 206

[126] Raphael, B., and Pevzner, P., 2004: Reconstructing tumor amplisomes. Bioinfor- matics, 20 Suppl 1, I265–I273. doi:10.1093/bioinformatics/bth931.

[127] Raphael, B., Volik, C., S.and Collins, and Pevzner, P., 2003: Reconstructing tu- mor genome architectures. Bioinformatics, 19 Suppl 2, II162–II171.

[128] Raphael, B., Volik, S., and Collins, C., In Press: Analysis of genomic alterations in cancer. In Genome Sequencing Technology and Algorithms, editors H. Tang, S. Kim, and E. Mardis. Arctech House.

[129] Raphael, B., Volik, S., Yu, P., Wu, C., Huang, G., Waldman, J., F. Costello, Pienta, K., Mills, G., Bajsarowicz, K., Kobayashi, Y., Sridharan, S., Paris, P., Tao, Q., Aerni, S., Brown, R., Bashir, A., Gray, J., Cheng, J.-F., Jong, P., Nefedov, M., Padilla-Nash, H., and Collins, C., 2008: A sequence based survey of the complex structural organization of tumor genomes. Genome Biology, 9(3).

[130] Raschke, S., Balz, V., Efferth, T., Schulz, W. A., and Florl, A. R., 2005: Homozy- gous deletions of CDKN2A caused by alternative mechanisms in various human cancer cell lines. Genes Chromosomes Cancer, 42(1), 58–67.

[131] Reyes, A., Gissi, C., Catzeflis, F., Nevo, E., Pesole, G., and Saccone, C., 2004: Congruent Mammalian Trees from Mitochondrial and Nuclear Genes Us- ing Bayesian Methods. MBE, 21.

[132] Rocco, J., and Sidransky, D., 2001: p16 (MTS-1/CDKN2/INK4a) in Cancer Pro- gression. Experimental Cell Research, 264(1), 42–55.

[133] Roschke, A., Stover, K., Tonon, G., Schaffer,¨ A., and Kirsch, I., 2002: Stable Karyotypes in Epithelial Cancer Cell Lines Despite High Rates of Ongoing Struc- tural and Numerical Chromosomal Instability. Neoplasia (New York, NY), 4(1), 19.

[134] Rozen, S., and Skaletsky, H., 2000: Primer3 on the www for general users and for biologist programmers. Methods Mol Biol, 132, 365–386.

[135] Ruan, Y., Ooi, H., Choo, S., Chiu, K., Zhao, X., Srinivasan, K., Yao, F., Choo, C., Liu, J., Ariyaratne, P., et al., 2007: Fusion transcripts and tran- scribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res, 17(6), 828–838. doi: 10.1101/gr.6018607.

[136] Salem, A., Ray, D., Xing, J., Callinan, P., Myers, J., Hedges, D., Garber, R., Witherspoon, D., Jorde, L., and Batzer, M., 2003: Alu elements and hominid phylogenetics. Proceedings of the National Academy of Sciences, 100, 12787– 91. 207

[137] Sampath, J., Long, P. R., Shepard, R. L., Xia, X., Devanarayan, V., Sandusky, G. E., Perry, W. L., Dantzig, A. H., Williamson, M., Rolfe, M., and Moore, R. E., 2003: Human spf45, a splicing factor, has limited expression in normal tissues, is overexpressed in many tumors, and can confer a multidrug-resistant phenotype to cells. Am J Pathol, 163(5), 1781–1790.

[138] Sasaki, S., Kitagawa, Y., Sekido, Y., Minna, J., Kuwano, H., Yokota, J., and Kohno, T., 2003: Molecular processes of chromosome 9p21 deletions in human cancers. Oncogene, 22, 3792–3798.

[139] Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J., and Altshuler, D., 2005: Calibrating a coalescent simulation of human genome sequence variation. Genome Res, 15(11), 1576–1583. doi:10.1101/gr.3709305.

[140] Schmidt, H. A., Strimmer, K., Vingron, M., and von Haeseler, A., 2002: TREE- PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics, 18, 502–504.

[141] Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., and Smit, A., : Nisc comparative sequencing program. http://www.nisc.nih.gov/.

[142] Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Program, N. C. S., Green, E., Hardison, R., and Miller, W., 2003: MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Research, 31(13), 3518–3524.

[143] Schwartz, S., Zhang, Z., Frazer, K., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W., 2003: PipMaker- A Web Server for Aligning Two Genomic DNA Sequences. Genome Research, 10(4), 577–86.

[144] Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Man˚ er,´ S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks, J., Ye, K., Reiner, A., Gilliam, T. C., Trask, B., Patterson, N., Zetterberg, A., and Wigler, M., 2004: Large-scale copy number polymorphism in the human genome. Science, 305(5683), 525–528. doi:10.1126/science.1098918.

[145] Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U., Pertz, L. M., Clark, R. A., Schwartz, S., Segraves, R., Oseroff, V. V., Albertson, D. G., Pinkel, D., and Eichler, E. E., 2005: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet, 77(1), 78– 88. doi:10.1086/431652.

[146] Shaw, C., and Lupski, J., 2004: Implications of human genome architecture for rearrangement-based disorders: the genomic basis of disease. Hum Mol Genet, 13 Spec No 1, 57–64. doi:10.1093/hmg/ddh073. 208

[147] Shedlock, A., and Okada, N., 2000: SINE insertions: powerful tools for molecu- lar systematics. BioEssays, 22, 148–60. [148] Shedlock, A. M., Milinkovitch, M. C., and Okada, N., 2000: SINE evolution, missing data, and the origin of whales. Systematic Biology, 49(4), 808–817. [149] Shimamura, M., Yasue, H., Ohshima, K., Abe, H., Kato, H., Kishiro, T., Goto, M., Munechika, I., and Okada, N., 1997: Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature, 388, 666–70. [150] Sindi, S., Helman, E., Bashir, A., and B.J., R., 2009: A geometric approach for the classification of comparison of structural variants. In Preparation. [151] Smit, A., and Green, P., 2004: RepeatMasker. http://www.repeatmasker.org/. [152] Soda, M., Choi, Y., Enomoto, M., Takada, S., Yamashita, Y., Ishikawa, S., Fuji- wara, S., Watanabe, H., Kurashina, K., Hatanaka, H., et al., 2007: Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer. Na- ture. doi:10.1038/nature05945. [153] Soda, M., Choi, Y. L., Enomoto, M., Takada, S., Yamashita, Y., Ishikawa, S., Fujiwara, S.-I., Watanabe, H., Kurashina, K., Hatanaka, H., Bando, M., Ohno, S., Ishikawa, Y., Aburatani, H., Niki, T., Sohara, Y., Sugiyama, Y., and Mano, H., 2007: Identification of the transforming eml4-alk fusion gene in non-small-cell lung cancer. Nature. doi:10.1038/nature05945. [154] Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V., Desnica, N., Hicks, A., Gylfason, A., Gudbjartsson, D., Jonsdottir, G., Sainz, J., Agnarsson, K., Birgisdottir, B., Ghosh, S., Olafsdottir, A., Cazier, J., Kristjansson, K., Frigge, M., Thorgeirsson, T., Gulcher, J., Kong, A., and Stefansson, K., 2005: A common inversion under selection in Europeans. Nat Genet, 37(2), 129–137. doi:10.1038/ ng1508. [155] Stephens, M., and Scheet, P., 2005: Accounting for decay of linkage disequi- librium in haplotype inference and missing-data imputation. Am J Hum Genet, 76(3), 449–462. doi:10.1086/428594. [156] Sugawara, H., Harada, N., Ida, T., Ishida, T., Ledbetter, D., Yoshiura, K., Ohta, T., Kishino, T., Niikawa, N., and Matsumoto, N., 2003: Complex low-copy repeats associated with a common polymorphic inversion at human chromosome 8p23. Genomics, 82(2), 238–244. [157] Szamalek, J. M., Cooper, D. N., Schempp, W., Minich, P., Kohn, M., Hoegel, J., Goidts, V., Hameister, H., and Kehrer-Sawatzki, H., 2006: Polymorphic micro- inversions contribute to the genomic variability of humans and chimpanzees. Hum Genet, 119(1-2), 103–112. doi:10.1007/s00439-005-0117-6. 209

[158] Tagawa, H., Miura, I., Suzuki, R., Suzuki, H., Hosokawa, Y., and Seto, M., 2002: Molecular cytogenetic analysis of the breakpoint region at 6q21-22 in T-cell lym- phoma/leukemia cell lines. Genes Chromosomes Cancer, 34(2), 175–185.

[159] Takahashi, K., Nishida, M., Yuma, M., and Okada, N., 2001: Retroposition of the AFC family of SINEs before and during the adaptive radiation of cichlid fishes in Lake Malawi and related inferences about phylogeny. Journal of Molecular Evolution, 53, 496–507.

[160] Takahashi, K., Terai, Y., Nishida, M., and Okada, N., 2001: Phylogenetic re- lationships and ancient incomplete lineage sorting among cichlid fishes in Lake Tanganyika as revealed by analysis of the insertion of retroposons. Molecular Biology and Evolution, 18, 2057–66.

[161] Terai, Y., Takahashi, K., Nishida, M., Sato, T., and Okada, N., 2003: Using SINEs to probe ancient explosive speciation: ”hidden” radiation of African cichlids? Molecular Biology and Evolution, 20, 924–30.

[162] The International HapMap Consortium, 2003: The International HapMap Project. Nature, 426(6968), 789–796. doi:10.1038/nature02168.

[163] Thomas, J., et al., 2003: Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424, 788–793.

[164] Tomlins, S., Rhodes, D., Perner, S., Dhanasekaran, S., Mehra, R., Sun, X., Varambally, S., Cao, X., Tchinda, J., Kuefer, R., et al., 2005: Recurrent Fusion of TMPRSS2 and ETS Transcription Factor Genes in Prostate Cancer. Science, 310(5748), 644–648. doi:10.1126/science.1117679.

[165] Turner, D. J., Shendure, J., Porreca, G., Church, G., Green, P., Tyler-Smith, C., and Hurles, M. E., 2006: Assaying chromosomal inversions by single-molecule haplotyping. Nat Methods, 3(6), 439–445. doi:10.1038/nmeth881.

[166] Tuzun, E., Sharp, A., Bailey, J., Kaul, R., Morrison, V., Pertz, L., Haugen, E., Hayden, H., Albertson, D., Pinkel, D., , Olson, M., and Eichler, E., 2005: Fine- scale structural variation of the human genome. Nat Genet, 37(7), 727–732. doi: 10.1038/ng1562.

[167] Vallone, P. M., and Butler, J. M., 2004: Autodimer: a screening tool for primer- dimer and hairpin structures. Biotechniques, 37(2), 226–231.

[168] Venter, G., et al., 2001: The sequence of the human genome. Science, 291, 1304– 1351. 210

[169] Volik, S., Raphael, B., Huang, G., Stratton, M., Bignel, G., Murnane, J., Brebner, J., Bajsarowicz, K., Paris, P., Tao, Q., et al., 2006: Decoding the fine-scale struc- ture of a breast cancer genome and transcriptome. Genome Res, 16(3), 394–404. doi:10.1101/gr.4247306.

[170] Volik, S., Zhao, S., Chin, K., Brebner, J., Herndon, D., Tao, Q., Kowbel, D., Huang, G., Lapuk, A., Kuo, W., et al., 2003: End-sequence profiling: Sequence- based analysis of aberrant genomes. Proc Natl Acad Sci U S A, 100(13), 7696– 7701. doi:10.1073/pnas.1232418100.

[171] Wall, J., and Pritchard, J., 2003: Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet, 4(8), 587–597. doi:10.1038/nrg1123.

[172] Wang, D., Urisman, A., Liu, Y., Springer, M., Ksiazek, T., Erdman, D., Mardis, E., Hickenbotham, M., Magrini, V., Eldred, J., et al., 2003: Viral discovery and sequence recovery using DNA microarrays. PLoS Biol, 1(2), E2.

[173] Wang, H.-Y., Luo, M., Tereshchenko, I. V., Frikker, D. M., Cui, X., Li, J. Y., Hu, G., Chu, Y., Azaro, M. A., Lin, Y., Shen, L., Yang, Q., Kambouris, M. E., Gao, R., Shih, W., and Li, H., 2005: A genotyping system capable of simultane- ously analyzing > 1000 single nucleotide polymorphisms in a haploid genome. Genome Res, 15(2), 276–283. doi:10.1101/gr.2885205.

[174] Wang, J., Cai, Y., Ren, C., and Ittmann, M., 2006: Expression of variant TM- PRSS2/ERG fusion messenger RNAs is associated with aggressive prostate can- cer. Cancer Res., 66, 8347–8351.

[175] Welsh, D., and Powell, M., 1967: An upper bound for the chromatic number of a graph and its application to timetabling problems. The Computer Journal, 10(1), 85–86.

[176] Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y. J., Makhijani, V., Roth, G. T., Gomes, X., Tartaro, K., Niazi, F., Turcotte, C. L., Irzyk, G. P., Lupski, J. R., Chinault, C., Song, X. Z., Liu, Y., Yuan, Y., Nazareth, L., Qin, X., Muzny, D. M., Margulies, M., Weinstock, G. M., Gibbs, R. A., and Rothberg, J. M., 2008: The complete genome of an individual by massively parallel dna sequencing. Nature, 452(7189), 872–876. doi:10.1038/nature06884.

[177] Winnard, P., Glackin, C., and Raman, V., 2006: Stable integration of an empty vector in MCF-7 cells greatly alters the karyotype. Cancer Genetics and Cytoge- netics, 164(2), 174–176.

[178] Yue, Y., Stout, K., Grossmann, B., Zechner, U., Brinckmann, A., White, C., Pilz, D. T., and Haaf, T., 2006: Disruption of TCBA1 associated with a de novo 211 t(1;6)(q32.2;q22.3) presenting in a child with developmental delay and recurrent infections. J Med Genet, 43(2), 143–147.