Genome Wide Variation Analysis Using High Throughput Data Yu Qian

PhD Dissertation

Bioinformatics Research Centre Aarhus University Denmark

Genome Wide Variation Analysis Using High Throughput Data

A Dissertation Presented to the Faculty of Science and Technology of Aarhus University in Partial Fulfillment of the Requirements for the PhD Degree

by Yu Qian January 26, 2015

Abstract

This PhD dissertation presents my work in analyzing genome wide variation data, following a chronological order. Whole-exome sequencing allows efficient targeting of the -coding region of the genome at low cost. We analyze such a dataset of 12 central chimpanzees and characterize their polymorphism patterns. By contrasting the genome wide variation data of , we report extensive adaptive evo- lution targeting the X of chimpanzees as well as much stronger purifying selection than observed in . Genome wide association study (GWAS) is a technique using large amount of genomic data to identify patterns of genomic variations that vary system- atically between individuals with different disease states. Single nucleotide polymorphism (SNP) is one of the most common genomic variants, of which the standard single marker test is often underpowered. Here we present two projects where alternative methods are applied to SNP data with attempt to increase GWAS power. For the first project, we incorporate prior biologi- cal knowledge of protein-protein interaction network into GWAS by network propagation. For the second project, we study multiple SNPs at a region. Identical by descent (IBD) means two identical allele at a locus are inherited from a common ancestor. We suggest and implement an efficient algorithm of finding IBD tracts among multiple individuals, these IBD tracts are then thrown into linear mixed models for association test. The reduced cost of sequencing has made it possible to sequence the whole genome of thousands of individuals, opening up the possibility to explore a more diverse set of variants than SNPs, such as structural variations. One important class of structural variations in human genome is Alu element. We develop a software, PAIR2, to detect Alu polymorphism using sequencing data, which is to our knowledge the first tool to identify Alu polymorphism for multiple individuals simultaneously, pinpoint the precise breakpoints and call genotypes with high accuracy. PAIR2 makes it possible to study the evolution and disease associations of Alu elements at population scale.

i

Resumé

Denne PhD afhandling præsenterer mit arbejde inden for analyse af genomiske variationer efter en kronologisk rækkefølge. Whole-exome sequencing gør det muligt at målrette protein-coding om- rådet af genomer hurtigt og billigt. Vi analyserer data fra 12 centrale chim- panser og karakterisere deres mønster af polymorfier. Sammenligner med genomiske variationer i mennesker, observer vi omfattende adaptiv evolution på X-kromosomet af chimpanser samt meget stærkere rensende selektion i forhold i mennesker. Genome wide association study (GWAS) er en teknik som bruger store mængder af genomiske data til at identificere mønstre i genomiske variationer som varier systematisk mellem individer med forskellige sygdomstilstande. Enkeltnukleotidpolymorfier (SNP) er en af de mest almindelige genomiske variationer, og standard tests for SNP har ofte lave styrke. Her præsenterer vi to projekter med alternativ metoder i anvendelse af SNP data med det formål at øge statistisk styrke. I det første projekt, bruger vi forudgående biologiske viden om protein-protein interaktionsnetværk i GWAS igennem netværksspredning. I det andet projekt, undersøgelser vi SNP i forbindelse med IBD, som betyder at to ens alleler nedarvet fra en fælles forfader. Vi udvikler hurtige algoritmer til at finde IBD alleler mellem mange personer og bruger dem i lineære blandede modeller for associationstest. Den reducerede omkostninger af sekventering gør det muligt at sekven- terere tusindvis af genomer, og derfor er kan vi undersøge en bredere sæt af variationer desuden SNP, såsom strukturel variation. Alu element er en vigtig type af strukturel variationer i vores genom. Vi udvikler en software, PAIR2, til at finde Alu polymorfier på sekvenserings data. PAIR2 er, til vores vi- den, det første værktøj til at samtidigt identificere Alu polymorfier på mange personer, at lokalisere præcise breakpoints og at bestemme genotyper med stor nøjagtighed. PAIR2 gør det muligt at undersøge evolution og sygdom- stilstande korrelationer af Alu elementer på befolkningen skala.

iii

Acknowledgments

I am grateful to be a PhD student in Bioinformatics Research Centre (BiRC), where I had the chance to study science and work with outstanding scientists. I am most thankful to my supervisor, Mikkel Heide Schierup, for his con- tinuous support throughout my PhD, for being a good example of scientist himself, for guiding me through research work, for helping me develop inde- pendence, and for always being supportive and encouraging. I would also like to thank my co-supervisor, Thomas Mailund, for showing me the truth of being a researcher, and for being smart and honest while criticizing my work. Many thanks to all the wonderful people at BiRC, for their friendship, for creating a great working environment, and for keeping my life interest- ing. Thanks to Ellen Noer for being helpful with administrative issues and for bringing a smile to the department each day. Thanks to colleagues whom I have worked with for their inspirations and great advice, these include Carsten Wiuf, Palle Villesen Fredsted, Thomas Bataillon and Søren Besen- bacher. Thanks to colleagues whom I might not have directly worked with for their enlightened scientific discussions, these include Qianyun Guo, Michael Knudsen, Jakob Grove, Rune Friborg, Martin Simonsen, Jesper Nielsen and Iwona Siuda. Special thanks to professor Sharon Browning, Brian Browning and their research group at the University of Washington, for accepting me for my study abroad, for advising me and working with me, and for making my stay at Seattle a positive and unforgettable experience. My sincere thanks also go to Daníel Guðbjartsson, Bjarni Halldórsson and Birte Kehr at deCODE genetics, for having stimulating discussions and pro- ductive collaborations with me, and for making my stay at Iceland a wonderful experience (except the weather). Finally, I would like to thank parents and grandparents for encouraging me to follow my dreams; my USTC friends for sharing my moments at different time zones; and my boyfriend, Casper Svenning Jensen, for being there.

Yu Qian, Aarhus, January 26, 2015.

v vi

List of abbreviations (in alphabetical order)

DFE - distribution of fitness effects GWAS - genome-wide association study HMM - hidden Markov model IBD - identical by descent IBS - identical by state LD - linkage disequilibrium ML - maximum likelihood MAF - minor allele frequency NGS - next-generation sequencing PPI - protein–protein interactions PR - precision recall RR - recall rate SFS - site frequency spectrum SNP - single nucleotide polymorphism Contents

Resumé iii

Acknowledgments v

Contents vii

I Overview 1

1 Introduction 3

2 Adaptive evolution in central chimpanzees 7 2.1 Background ...... 7 2.2 Related work ...... 8 2.3 Research contributions ...... 8

3 Methods in genome-wide association studies 13 3.1 Background ...... 13 3.2 Improving power of GWAS using network information . . . . . 16 3.2.1 Introduction ...... 16 3.2.2 Related work ...... 16 3.2.3 Research contributions ...... 18 3.2.4 Results and discussion ...... 21 3.3 Identity by descent and disease mapping ...... 24 3.3.1 Introduction ...... 24 3.3.2 Related work ...... 27 3.3.3 Research contributions ...... 28 3.3.4 Results ...... 32 3.3.5 Discussion and future work ...... 33 3.4 Summary ...... 35

4 Alu polymorphism discovery from next-generation sequenc- ing data 37 4.1 Background ...... 37

vii viii CONTENTS

4.2 Methods ...... 40 4.2.1 Alu Deletion ...... 40 4.2.1.1 Informative reads ...... 41 4.2.1.2 Determining genotype ...... 42 4.2.2 Alu Insertion ...... 43 4.2.2.1 Informative reads ...... 44 4.2.2.2 Insertion regions for single individual . . . . . 44 4.2.2.3 Insertion regions for multiple individuals . . . 46 4.2.2.4 Identifying precise breakpoints ...... 46 4.2.2.5 Inserting Alu at breakpoints ...... 49 4.2.3 Simulations ...... 49 4.3 Results and discussion ...... 51 4.3.1 Simulated data ...... 51 4.3.2 CEU trio of 1000 Genome Project ...... 53 4.3.3 Timing ...... 54 4.4 Summary and future work ...... 55 4.4.1 Algorithm improvement ...... 55 4.4.1.1 Genotype calling of multiple individuals . . . . 55 4.4.1.2 Identify insertion regions ...... 56 4.4.1.3 Benchmarking ...... 56 4.4.2 Applications ...... 56

II Appendix 59

A Extensive X-linked adaptive evolution in central chimpanzees 61 A.1 Summary ...... 61

B Identifying disease associated by network propagation 79 B.1 Summary ...... 79

C Efficient clustering of identity-by-descent between multiple individuals 97 C.1 Summary ...... 97

Bibliography 107 Part I

Overview

1

Chapter 1

Introduction

Bioinformatics is an interdisciplinary field of science, where methods and soft- ware tools are developed to process and analyse biological data. The general purpose of doing a PhD in bioinformatics is to increase our understanding of biology, particularly human biology. We want this, not only because biology is interesting itself, but also because we want to find cures for human diseases. During my PhD studies, I have studied across a wide spectrum of topics in the field of Bioinformatics. I will present some of the research projects I have been involved in this dissertation. As you might have known, research is not always guaranteed to be successful. For a handful of projects that I failed, I will not include them in this dissertation. The genetic information of an organism is encoded in its genome, and is reproduced and passed on to its offsprings through the process of self- replication. During this process, mutation occurs and becomes the ultimate source of genetic variation, providing the genetic material for natural selection and variety of species alive today. The size of a genome is often very large, for example, a haploid human genome consists of three billion DNA base pairs. Thanks to recent advances in next-generation sequencing (NGS) technologies, we are able to study such a big genome at the nucleotide level. In a common NGS pipeline, a whole genome, or a target region of the genome, is randomly digested into small fragments for sequencing. The sequenced fragments are then either aligned to a reference genome or de novo assembled. Finally, the genomic variations of each sequenced target can be identified through various variants-calling methods. Advances in low-cost NGS have lead to the availability of large collections of genomic data, making the analysis of such data increasingly challenging. Motivated by the potential information hidden behind the huge amount of genomic data, I started my PhD at August 2009, with focus on methods de- velopment and application in analyzing genome wide variation data. My first project was a data-driven project, together with a group of people, we looked

3 4 CHAPTER 1. INTRODUCTION at the whole-exome sequencing data of 12 central chimpanzees. Comparing the chimpanzee data to human data, we learned something about adaptive evolution in central chimpanzees. For this project, I implemented the pipeline of the main analysis. The result was published and is included as Chapter 2 and Appendix-A in this dissertation. Soon after that, I tried a few more projects on different topics and gradually learned that my interest lies on disease mapping, or ultimately, to cure people. I started to explore the field of genome-wide association study (GWAS). The first project that finished with some results was a study that combines net- work information with the GWAS data. Using external resources of biological knowledge such as network, we are able to boost the power of GWAS. This work is included in Chapter 3 and Appendix-B. In late 2012, I visited the Dept. of Biostatistics, University of Washington for 6 months. Working with the Browning group, I gained a lot of knowledge about genetics and statistics. Due to time limitation, we were only able to implement one project, which was on the subject of identify-by-descent (IBD) relation among multiple individuals. The work is included in Chapter 3 and Appendix-C. In late 2013, I took a 6-month leave from my PhD study at Aarhus Uni- versity and worked as a researcher at deCODE genetics in Iceland, a company specialized at analyzing and understanding human genome, particularly the genome of Icelandic populations. There I worked mainly on three projects, (i) implementing and testing coordinate descent algorithm for efficient calculation of Bayesian mixed model statistics. (ii) integrative annotation of noncoding variants using support vector machine. (iii) identification of polymorphic Alu elements based on sequencing data. Project i was completed and project ii was discarded while I was at deCODE. Project iii was continued after I came back to Aarhus University. Chapter 4 of this dissertation gives a detailed description of this project.

Peer-reviewed publications

Christina Hvilsom∗; Yu Qian∗; Thomas Bataillon∗; Yingrui Li∗; Thomas Mailund; Bettina Sallé; Frands Carlsen; Ruiqiang Li; Hancheng Zheng; Tao Jiang; Hui Jiang; Xin Jin; Kasper Munch; Asger Hobolth; Hans R Siegis- mund; Jun Wang; Mikkel H Schierup. Extensive X-linked adaptive evolution in central chimpanzees. Proceedings of the National Academy of Sciences 109, no. 6 (2012): 2054-2059. [Appendix-A] *The authors contributed equally.

Yu Qian, Søren Besenbacher, Thomas Mailund, and Mikkel Heide Schierup. Identifying disease associated genes by network propagation. BMC systems biology 8, no. S-1 (2014): S6. [Appendix-B] 5

Yu Qian, Brian L. Browning, and Sharon R. Browning. Efficient clustering of identity-by-descent between multiple individuals. Bioinformatics (2013): btt734. [Appendix-C]

Manuscripts under review Thomas Bataillon; Jinjie Duan; Christina Hvilsom; Xin Jin; Yingrui Li; Laurits Skov; Kasper Munch; Asger Hobolth; Jun Wang; Thomas Mailund; Hans R Siegismund; Mikkel H Schierup. Inference of purifying and positive selection in three subspecies of chimpanzees (Pan troglodytes) from exome sequencing. Genome Biology and Evolution

Manuscripts under preparation

Yu Qian, and Bjarni V Halldórsson. PAIR2: Alu polymorphism discovery from next-generation sequencing data

Conference presentations

Identifying disease associated genes by network propagation. Oral presen- tation at APBC 2014 (Asia Pacific Bioinformatics Conference) in Shanghai, China, January 2014

Software

EMI: accompanying source code for paper [1], a C++ implementation of fast and accurate searching of multiple IBD clusters. The extended work of improving pairwise IBD detection is also included. Available at https://github.com/aimeida/EMI_dev.

PAIR2: accompanying source code for a manuscript under preparation, a C++ implementation of detecting polymorphic Alu elements in high-throughput sequencing data. Available at https://github.com/aimeida/AluDetection.

Chapter 2

Adaptive evolution in central chimpanzees

This chapter corresponds to a publication in Appendix-A.

2.1 Background

Since the advent of Sanger sequencing in 1977, knowledge of DNA sequences has become indispensable for basic biological research as well as genetic stud- ies. Meanwhile, various sequencing technologies have been developed to re- place the traditional methods with better efficiency and less expensive cost. Whole-exome capture and sequencing efficiently targets the protein coding part of the genome, covering 1% - 2% of the genome in all. It generates ac- curate individual genotyping of synonymous and nonsynonymous diversity at high coverage and allows the studying of whole genome at greater resolution with lower cost. This technology has been widely used in many fields such as population genetics [2]. Mutation happens during evolution and can be broadly divided into three categories based on its effects. First, mutations that are harmful to the fitness of their host and thus reduce survival or fertility. Second, mutations that are neutral and have little or no effect on fitness. Finally, mutations that are advantageous and increase fitness by allowing their host to better adapt to an environment. The relative frequencies of these types of mutation — the distribution of fitness effects (DFE), is central to many questions in evolu- tionary biology. However, we still know very little about the DFE and their dominance in diploid organisms [3]. Another fundamental entity in evolutionary biology is α, which quantifies the proportion of mutations that have been fixed by positive selection. It has been suggested to combine the genome-wide surveys of polymorphism within species and between species to estimate α. Empirical studies, particularly for the Drosophila genus show that mutations fixed by positive selection can

7 8CHAPTER 2. ADAPTIVE EVOLUTION IN CENTRAL CHIMPANZEES make up a sizable fraction (> 50%) of the divergence between species in coding regions, but it remains unclear of whether these results are general for [4]. The X chromosome, present in a single copy in males, has a curious role in population genetics. The faster-X theory states that DNA sequences on X often evolve faster than autosomal sequences. It might be a result of the adaptive fixation of recessive beneficial mutations in X-linked genes, higher mutation rates on the X chromosome versus the autosomes (dosage compensation), or a smaller effective population size (Ne) of X chro- mosome [5]. Surveying genome-wide coding variation within and among species gives power to study the genetics of adaptation. For example, by comparing the amount of variation within a species (polymorphism) to the divergence be- tween species (substitutions), we can use McDonald–Kreitman test to deter- mine the proportion of substitutions that resulted from positive selection and measure the amount of adaptive evolution [6]. In addition, contrasting the difference of α and DFE between autosome and X chromosome offers the possibility of revealing important aspects of the natural selection.

2.2 Related work

In a population, autosomes and the X chromosome experience the same de- mographic history but selection and genetic drift can be different. Theory predicts that recessive beneficial mutations should result in faster X evolution. However, empirical evidence for increased levels of adaptation in X-linked re- gions remains elusive [5], except for an unique instance of a newly formed neo- X chromosome in Drosophila miranda and a recent analysis of polymorphism and divergence between Drosophila melanogaster and Drosophila simulans for about 100 genes in X-linked regions [4]. In human studies, X-linked variation is lower than autosomal variation close to genes suggesting that selection affects X chromosomes more than autosomes, but Europeans have a genome-wide relative decrease in X-linked variation when comparing with Africans, suggesting that genetic drift has specifically affected X chromosome in Europeans [7–9]. Likewise, it has been the subject of much discussion why there is a reduced variation of the X chro- mosome relative to autosomes in the human–chimpanzee ancestral species [10].

2.3 Research contributions

To study genome variation within and among species, and characterize poly- morphism patterns in chimpanzees since their divergence with humans, we used capture and sequenced 12 chimpanzees from Central Africa (Pan troglodytes troglodytes). At an average sequencing depth of 35×, we obtained 2.3. RESEARCH CONTRIBUTIONS 9 a dataset of ∼ 62, 000 coding SNPs. For analysis, we mapped reads against the human reference, which contains less error and thus is better suited for in- terspecific comparison, but our conclusions remain unaltered when excluding ambiguous SNPs (see Appendix-A, Table S4). The synonymous nucleotide diversity in central chimpanzees is high on autosomes, but it reduces to half of that level on the X chromosome, with a diversity X/A ratio (πs(X)/πs(A)) of 0.46 - 0.51. This is different from the observations on human, where the X to autosome diversity ratio range from 0.57 (Europeans) to 0.7 (Africans) near genes and between 0.75 (Europeans) and 0.87 (Africans) far away from genes [8, 9, 11]. Excluding the possible effects of male biased mutation and demographic history, the explanation of reduced variation on the X chromosome can be a combination of more efficient selection against deleterious variants and selection for advantageous recessive mutations. To obtain the unfolded site frequency spectrum (SFS) for polymorphism on chimpanzees, we used the human reference genome to orient SNPs. To further contrast the polymorphism patterns to human population, we com- pared our data to a dataset of 200 human exomes of European origin [12] by calculating the expected SFS for human autosomes and X chromosome for a sample corresponding to the size of our dataset. We observe a larger frac- tion of polymorphisms in humans of non-synonymous (50%) relative to that of central chimpanzees (45%) (Figure 4.1(A) and (B)). We then used the Sumatran orangutan (P ongoabelii) genome [13] to infer which of the human–chimpanzee fixed differences occurred on the chimpanzee branch and combined these with our chimpanzee polymorphism data. This allows measurement of the amount of adaptive evolution on the branch leading to chimpanzees, using the approach of ref [14]. All autosomes have an excess of nonsynonymous polymorphisms segregating compared with nonsynonymous fixed differences, leading to a neutrality index (NI) > 1 and no evidence for adaptive evolution. In contrast, the X chromosome shows evidence for adap- tive evolution (NI = 0.76) and a proportion of differences fixed by adaptive evolution of α = 29% (95% confidence interval (CI) [0.1–0.36]). To further qualify the differences in the intensity of purifying and posi- tive selection in chimpanzees, and to examine whether our inference above could be biased by changes in the recent demographic history, we used the approach in ref [15] to estimate jointly the DFE and α. This model assumes an expansion model that also yielded the best fit to our data. The strength of selection against mutations is measured by its effective selection coefficient (|Nes|), which incorporates the actual fitness effect of the mutation (s) and effective size (Ne). A slightly higher fraction of sites with strongly deleterious mutations (|Nes| > 100) is found on the X chromosome relative to autosomes (Figure 4.2). This finding suggests marginally more efficient purifying selec- tion on the X chromosome, considering also that its effective size is expected to be smaller than for the autosomes. The fraction of nonsynonymous muta- 10CHAPTER 2. ADAPTIVE EVOLUTION IN CENTRAL CHIMPANZEES

Figure 2.1: Comparison of site frequency spectrum in central chim- panzees versus humans. (A) Brown and red: Counts of the number of synonymous and non-synonymous SNPs on the autosomes as a function of the frequency of the derived variant established using humans as outgroup. Blue and light blue: Corresponding expected counts for a sample of 24 exomes of the 200 human exomes. (B) Corresponding counts on the X chromosome. (C) Autosomal site frequency spectrum of synonymous and nonsynonymous SNPs derived from Fig.A and compared with the expectations from a constant size population without selection (green). (D) X chromosome site frequency spectrum for humans and chimpanzees. This Figure is presented in the pub- lication of Appendix-A, Figure 1. tions fixed by positive selection is estimated at α = 38% (95% CI [0.22, 0.51]) on the X chromosome, and α ∼ 0 (95% CI [−0.09, 0.07]) on the autosomes. Our study demonstrates that a substantial amount of adaptive evolution is targeting the X chromosome and that selection against deleterious mutations is more efficient on the X than on the autosomes in central chimpanzees. Given the striking difference in adaptive evolution detected using diver- gence, we used our diversity data to query whether selective sweeps have oc- curred preferentially on the X chromosome. We searched for the occurrence of extreme selective sweeps causing reduction of polymorphism over megabase- wide regions by scanning the genome using windows of 10 kb of exon data, in which we contrasted the number of polymorphic sites with the number of 2.3. RESEARCH CONTRIBUTIONS 11

Figure 2.2: Efficacy of purifying selection against deleterious mu- tations in central chimpanzees. The strength of purifying selection is measured by the product Nes, where Ne is the effective population size and s the selection coefficient against a heterozygous deleterious mutation. Mu- tations are divided into four categories: quasineutral mutations (deleterious mutation with 0 < |Nes| ≤ 1), mildly deleterious mutations (1 < |Nes| ≤ 10), deleterious mutations (10 < |Nes| ≤ 100), and strongly deleterious mutations (|Nes| > 100). The proportions are estimated separately in autosomes and chromosome X. Error bars denote one SE around the estimates of each pro- portion. This Figure is presented in the publication of Appendix-A, Figure 2. synonymous differences on the chimpanzee branch. The most striking exam- ple is found on chromosome 3, a 6-Mb region with low polymorphism. This region contains a cluster of immunity-related genes under positive selection in humans as well as the CCR5 gene involved in HIV resistance. We ap- ply the same analysis on Chromosome X, where we observe reduced diversity throughout the chromosome, but not any clear instances of recent sweeps.

Chapter 3

Methods in genome-wide association studies

This chapter corresponds to publication in Appendix-B and Appendix-C.

3.1 Background

The goal of association studies is to identify patterns of variants at genomic loci that vary systematically between individuals with different disease states. A genome-wide association study (GWAS), also known as whole genome as- sociation study, is an examination of genetic variants and their associations to a a trait. A GWAS proceeds by identifying a number of individuals carrying a dis- ease or a trait and comparing these individuals to those that do not or are not known to carry the disease or trait. Both sets of individuals are then genotyped for a large number of genetic variants which are then tested for as- sociation to the phenotype. In the following, I use the term disease and trait interchangeably to represent the phenotype to be studied. Single nucleotide polymorphisms (SNPs) are by far the most commonly considered genetic vari- ants. A typical GWAS involves genotyping of 100,000 to 3 million common SNPs among thousands of individuals. However, sequencing data has opened up the possibility to explore a more diverse set of genetic variants than SNPs, such as structural variations and copy-number variants [16]. Linkage disequilibrium (LD), defined as the non-random association of alleles at different loci, is derived from the presence of two or more loci on a chromosome with limited recombination between them. LD implies that tightly linked variants strongly correlated, thus it is possible to detect associ- ations even if the causal variant is not directly typed in the study. Figure 3.1 gives such an illustration. The two indicated direct associations cannot be ob- served, but if the typed marker is closely correlated to the causal variant (thus

13 14CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES in strong LD), we might be able to detect the indirect association between the typed marker and disease.

Figure 3.1: Tag SNP and LD. A typed variant, usually a tag SNP, is in LD with the true causal variant.

Studies have shown that the human genome has a haplotype block struc- ture, such that it can be divided into discrete blocks of limited haplotype diversity and variants within each block are in LD with each other [17, 18]. Haplotype blocks is a surprising discovery of great practical importance for the disease mapping, and it suggests that testing one SNP within each block for significant association with a disease might be sufficient to indicate as- sociation with every SNP in that block, thus reducing the number of SNPs that need to be tested [19]. While it is prohibitively expensive to sequence all the SNPs in the human genome, typing limited number of SNPs in each haplotype block is a common solution to reduce the total cost, as it is still able to capture most of the genetic variation in the genome. GWAS has rapidly become a standard method to study the genetic basis of complex traits. with the number of published GWAS reports growing each year, see Figure 3.2. In the past decade, many common and rare variants that are associated to complex traits have been successfully identified. Some examples include heart disease, diabetes, auto-immune diseases, and psychiatric disorders [20, 21]. Yet at the same time, many studies have also pointed out various problems and failures in the experimental design of GWAS: detection of association of alleles with modest effect sizes is underpowered, common risk variants fail to explain the majority of genetic heritability [21], findings of GWAS show 3.1. BACKGROUND 15

Published GWA Reports, 2005 – 6/2012 1350 1400

1200

1000

800 of Publications of 600

400 Number

200 Total Total

0 2005 2006 2007 2008 2009 2010 2011 2012

Calendar Quarter Through 6/30/12 postings

Figure 3.2: Number of published GWA reports. Figure downloaded from http : //www.genome.gov . correlation but not causation to the trait being studied, to name a few. In this chapter, I will focus on some GWAS projects that I have been involved, each addressing a specific topic. 16CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

3.2 Improving power of GWAS using network information

3.2.1 Introduction The genotyped data in GWAS is analyzed predominantly through single SNP tests. The data is first processed by removing markers with low quality or are not in Hardy-Weinberg equilibrium (HWE). Individuals with a high propor- tion of missing genotypes are also removed. After the pre-processing, disease association test is performed for each SNP individually, commonly used ap- proaches include chi-square tests, Fisher’s exact test and Cochran-Armitage trend test. One fundamental problem of such tests is that the genome is so large and SNPs identified as associated could well arise by chance, therefore a multiple testing adjustment is often applied to distinguish causal from spurious signals. If a researcher is lucky, a SNP located in the genome is found to be significant after the multiple testing adjustment. However, this is not often the case, true associations might not be in LD with typed SNPs or no SNPs surpass the multiple testing threshold (e.g., p < 5 × 10−8 ). The odds of no significant finding may increase as the number of SNPs on standard genotyping chips increase due to a higher penalty for multiple comparisons [22]. In order to alleviate such issues, various approaches have been developed to increase the statistical power, such as aggregating multiple markers from the same gene or the same haplotype block region, incorporating prior biological knowledge from other sources into the GWAS analyses, and so on. Methods and software packages that consider multiple markers have be- come available, yet they have been slow to become widely adopted and their efficacy in real data analysis is often questioned [23–25]. Meanwhile, using prior biological knowledge such as gene annotation and pathway has shown to be powerful in identifying new associations in addition to those identified using single SNPs [26]. For example, given gene annotation, the most significant SNPs within and near a gene are collapsed to represent a gene-level association statistics. Then each gene is tested either individually or collectively with other genes. Our current understanding of human genome and the existence of abundant databases containing known gene pathways and protein-protein interactions has encouraged researchers to perform association tests through interaction, pathway or network analysis.

3.2.2 Related work Interaction analysis is first considered when a genetic factor functions pri- marily through a mechanism that involves other genes, the effect might be missed if the gene is only examined on its marginal effect without allowing for its potential interactions with these other genes. The motivation of applying 3.2. IMPROVING POWER OF GWAS USING NETWORK INFORMATION 17 interaction analysis in GWAS is to enhance its power and discover more signif- icant associations, since a variant with small marginal effect is not necessarily clinically insignificant but may turn out to have a strong effect together with another variant [27, 28]. Interactions are however much more difficult to detect and model than marginal effects due to two main challenges. First, given n genes in total, the n total number of tests in a k-gene interaction model is k . This number grows by n-fold when k increases and quickly becomes computationally intractable. Even with speed improvement by preliminary pruning, heuristics or parallel computing, researchers are still only able to discover two-way or three-way interactions for GWAS. Second, since a huge number of possible combinations are tested, a large proportion of significant associations are expected to be false positives. Thus, reducing the number of false positives while retaining power is big challenge [28–32].. A more tractable alternative strategy of gene-level disease mapping is path- way analysis, where multiple variants in interacting or related genes in the same pathway are considered jointly. This alternative is very appealing for two reasons. First, it reduces the complexity of analysis by grouping thou- sands of genes to hundreds of pathways. Second, it is believed that disease susceptibility is likely to depend on the cumulative effect of multiple genes in functional pathways, and thus the resulting associated pathways might have more explanatory power than a simple list of different genes. The procedure of pathway-based approach is analogous to gene set en- richment analysis [33], it ranks all genes by their statistical significance and tests whether a group of related genes have modest yet consistent deviation from what is expected by chance. Some technical difficulties such as repre- senting each gene by multiple SNPs, adjustment of different sizes of pathways and assessing the statistical significance of pathway-level statistics have been addressed by successive generations of methods [34]. Despite the factor that studies have successfully demonstrated the feasi- bility of pathway analysis, where some biological and biochemical mechanism that underpin the disease is elucidated [35, 36], a few limitations remain to be addressed. For example, the general input of these approaches is a set of genes, which are treated as exchangeable without taking into account their functional relationships. As a result, information from the pathway topology and interactions among genes is usually ignored, even though such information can be utilized to increase the power of detecting real associations. Another limitation is due to incomplete and inaccurate annotations of database, and it is yet not clear how robust such methods are if the pathway annotations or pathway sizes change. Last but not least, the additional information of genes in overlapping pathways is not fully utilized in the general settings, thus genes (e.g., housekeeping genes) that appear more frequent in multiple pathways might lead to bias in the analyses. When the topology and appearance frequency starts to play a role, it 18CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES is natural to think of utilizing an existing protein-protein interaction (PPI) network as an alternative strategy. Similar to pathway-based approaches, network analysis integrates our current understanding of human genome in disease mapping. It is often believed that biological units such as genes be- have interactively by groups. Consider the disease development as a crucial biological mechanism, the failure of a small portion of the important factors might lead to dysfunction of the whole biological mechanism. For example, studies have supported the hypothesis that multiple rare mutations, each with a low marginal effect, may be a major player in genetic determination of sus- ceptibility for some complex diseases such as schizophrenia [37–39]. There have been a few previous studies combining network data with GWAS data. A representative strategy is to apply a dense module search algo- rithm to search for significantly enriched subnetworks (modules) with strong association signals [40]. A true candidate module will be consistently signifi- cant across permutations or different GWAS datasets. However, the limitation is that the module searching procedure is sensitive to the background network. That is, our current knowledge of PPI network is not complete, yet the result- ing dense modules from such methods rely heavily on the interactions defined in PPI. Moreover, in order to identify modules that are significantly associ- ated to disease, an intensive topologically matched randomization process is needed. Network guilt by association (GBA) is an alternative strategy of identify- ing disease genes using network information. Based on the observation that similar phenotypes arise from functionally related genes, GBA has been im- plemented in different frameworks such as Iterative Ranking and Gaussian Smoothing [41]. Given a query disease and a human PPI network, in which each protein is mapped to one gene, the known causal genes of diseases that are phenotypically similar to the query disease are given a prior score in the PPI network, then these prior scores are propagated and smoothed over network such that each protein gets an association score. Genes with high associa- tion scores are considered as candidate disease genes. Despite the fact that GBA is a proven approach for identifying novel disease genes, its application in GWASs remain to be explored [42].

3.2.3 Research contributions In this study, we propose a framework to address the issue of exploiting net- work data in a principled way. We implement GBA as an application in GWAS and search for genetic determinants of complex traits underlying the rationale of complex diseases operates through multiple genes. Its application in a study of Crohn’s disease shows the its potential to increase the amount of useful information that otherwise can be missed by standard analysis. Given a PPI network, we consider it as an undirected graph G = (V,E,W ), where nodes V are a set of , edges E are links between proteins and 3.2. IMPROVING POWER OF GWAS USING NETWORK INFORMATION 19 W are the weight of the edges. A link exists if there is an interaction between its end nodes. For a node v ∈ V , denote all of its neighbors as a set of N(v), and the number of its neighbors as degree(v). Let Y : V → R≥0 be a function of prior evidence, i.e., a node v has a high score if we a priori (from GWAS) believe that it is associated to the disease. Let F : V → R≥0 be a function of posterior evidence, i.e., the evidence of association after network propagation. Given a GWAS dataset, we first assign a prior score Y for each gene. There are many ways to calculate a combined p value of each gene from all its SNPs, and minimum p value method is chosen in our study because it gives higher internal consistency of gene ranks for random subset of the data. More specifically, the p value of a gene is defined as the minimum single marker test p value of the SNPs located within the gene or 10Kb immediately upstream −1 or downstream. The prior score is yi = Φ (1 − pi/2) with Φ the cumulative distribution function of normal distribution. Under the null hypothesis of no + association, yi v N (0, 1). Then we propagate the evidence of associations measured as prior scores through the network. Similar to Google’s PageRank algorithm, the impor- tance (score) of each node is dependent on its neighbors. A node that is linked to many nodes receives a high score itself. The posterior score F of a node is computed as

X 0 F (v) = α[ F (u)wv,u] + (1 − α)Y (v) u∈N(v) the parameter α ∈ (0, 1) weights the relative importance of information re- 0 p ceived from neighbors, and wv,u = wv,u/ d(v) × d(u) denotes the weight of edges, with d(v) the degree of node v. The above formula can be expressed in linear form F = αW 0F + (1 − α)Y , which is equivalent to

F = (I − αW 0)−1(1 − α)Y (3.1)

It can be proved that W 0 is similar to a stochastic matrix, which has eigenval- ues in [−1, 1] (according to the PerronFrobenius theorem). Since α ∈ (0, 1), the eigenvalues of (I − αW 0)−1 exists. Though the above linear equation can be solved analytically, it is difficult to compute the inverse of large matrix of |V | × |V | dimension. Instead, we choose an iterative propagation method to solve the system. At iteration t, we compute

F t = αW 0F t−1 + (1 − α)Y (3.2)

We study two extreme scenarios, T = 1 and T = ∞. When T = 1, each gene only receives information from its neighbors, and equation (3.2) is only calculated once. When T = ∞, the evidence of association is smoothed over the network as equation (3.2) eventually reaches equilibrium and F scores con- verge according to equation (3.1). Unlike T = 1, the posterior score calculated when T = ∞ is also dependent on the indirect neighbors of each gene. 20CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

After propagation, we finally select candidate genes based on posterior scores F . The posterior score is the combined information of GWAS (prior score) and neighbor genes (propagation), thus genes of high posterior score might be of interest. However, a gene can have a high posterior score under two conditions: (1) it has a higher degree in the network and receives more information from its neighbors than other lower degree genes, (2) most of its neighbors are candidate associations and thus itself is GBA. In order to iden- tify genes in condition (2) while eliminating spurious signals from condition (1), we apply an alternative ranking of candidate genes based on permuta- tions. At each permutation, propagation starts with randomized prior scores and the resulting posterior score of each gene is recorded and later used to access the significance of its original posterior score. The pseudocode is shown in the following. The predefined parameters are significant threshold (e.g., Sig. Thresh.=0.01) and number of permutations (e.g., K = 10000).

// GWAS prior prior scores based on p value of Cochran-Armitage trend test in GWAS data; propagate prior scores by equation (2); normalize posterior scores for all genes such that the sum is 1; i record posterior score for genei as S0; // permutation for k in 1 to K; do permute case and control labels, calculate new prior scores from p value; propagate prior scores by equation (2); normalize posterior scores for all genes such that the sum is 1; i record posterior score for genei as Sk; done // find candidate GBA genes

for genei in network; do K P I(Si >Si ) k=1 k 0 p value of posterior score for genei is K genei is candidate gene if the p value is smaller than Sig. Thresh. done

To further study if the identified GBA genes collectively contribute to the disease, we build penalized logistic regression models using all the SNPs within 10Kb boundary of candidate genes as covariates. The performance is evaluated in 10-fold cross-validations (CV), that is, a prediction model based on 90% of the cases and controls is tested on the remaining 10% data, and this step is repeated 10 times with different 90% and 10% subset of the cohorts. 3.2. IMPROVING POWER OF GWAS USING NETWORK INFORMATION 21 3.2.4 Results and discussion We performed GBA on an anonymous dataset of the Wellcome Trust Case Control Consortium (WTCCC) study of Crohn’s disease. The original cohort includes 2005 Caucasian UK patients of Crohn’s disease and 3004 controls genotyped on the Affymetrix 500K mapping array [43]. After standard qual- ity control, we are left with 1748 cases and 2938 controls. The PPI network is built based on the STRING database version 9.0 with edges representing interactions of strong confidence [44]. The final network includes 229599 in- teractions involving 15010 proteins. Each SNP is mapped to a gene if it was located within the gene or 10Kb immediately upstream or downstream, and GBA propagation is performed as previously described. Candidate genes are ranked by the p value of their pos- terior score obtained from permutations. When we performed the analysis, we found that a similar network analysis on the same dataset was published [45]. To make a fair comparison, we generate a list of 150 top ranked candidate genes ( T = 1 ) and compared our results. The list of genes are shown in Appendix-B, Table S2. We observed that genes with significant p values from GWAS are also candidate genes in our study, which is not surprising since the prior score from GWAS contributes to the posterior score. Our study also identified 5 genes that are missed by standard methods, but are reported as candidate disease genes for Crohn’s disease in other independent GWASs (Figure 3.3). To evaluate the prediction models based on the candidate genes, we cal- culate the average Area Under the Receiver-Operating-Characteristic Curves (AUC) for 10 trials. The results are compared to the regression models built in a pathway study of the same dataset [36]. Our study consistently achieves a higher AUC both including and excluding significant SNPs of p < 5 × 10−7, as shown in Table 3.2.4 . We also build prediction models with the same number of genes that are ranked on top based on their posterior scores instead of permutation signifi- cance, as expected, we observe a drop of the AUC value. The reason might be that the genes called in our study are GBA genes and might collectively contribute to disease development, while genes of highest posterior scores can have no association to the disease but has a high posterior score simply due to network topology. Combining GWAS data with function databases is very appealing as it provides more explanatory power than a simple list of different genes, While pathway methods have shown success in many applications, their limitations should be acknowledged. For example, genes involved in multiple pathways might introduce bias in different pathways, different definitions of the same pathway in different knowledge bases can affect performance assessment in terms of power and true positive/negative rate. Our study has shown the potential power of GBA framework of boosting GWAS signals, even though 22CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

GWAS Lee et al 3 3 4 17 3 2

5

our study

Figure 3.3: Number of reported genes. Overlap of validated genes among top 100 genes for each method. Validated genes denote genes that are reported as candidate in other independent GWAS studies. Method GWAS means ranking genes by minimum p value. This Figure is presented in the publication of Appendix-B, Figure 1.

it can not fully address the above mentioned issues. Evolution occurs in the context of a complex network of interconnected genes and pathways. However, we are still in the early stages of understand- ing how the network of epistatic and main effects synthesizes with biological networks and pathways [46]. When we first started this project, the motiva- tion was to find a group of genes that collectively contribute to disease through biological network. These genes might have small individual effect sizes, but their combined effect may be much larger in terms of a network paradigm. One potential extension of our work is to detect interactions and epistasis using the GBA candidate genes, it will be computationally tractable because we have a much smaller set of genes to test. However, one limitation of our study is that we exclusively consider variants near genes, ignoring the fact that variants at noncoding regions may also contain signals of associations. A more general framework that not only considers a diverse set of variants but 3.2. IMPROVING POWER OF GWAS USING NETWORK INFORMATION 23 method Genea SNP1b SNP2c AUC1d AUC2e T=1 Sig. Thresh.f 130 1176 1145 0.705(0.033) 0.687(0.032) PSg 130 2331 2310 0.645(0.024) 0.627(0.026) T=∞ Sig. Thresh. 184 1314 1281 0.730(0.017) 0.715(0.013) PS 184 3279 3256 0.645(0.024) 0.626(0.026) aGene is number of genes identified as GBA associations, bSNP1 is number of SNPs mapped to these genes. cSNP2 is bSNP1 excluding SNPs that reach GWAS significance of 5 × 10−7. dAUC1, mean AUC of prediction models built with bSNP1, numbers in parentheses are standard deviation in cross validations. eAUC2, similar to AUC1, prediction models built with cSNP2. f Sig. Thresh. = 0.01 g PS means genes with highest posterior scores are selected. This table is presented in Appendix-B Table 1. also integrates our increasing understanding of human genome is needed. Such a framework is recently proposed [47]. The researchers suggest the C score approach to estimate the relative pathogenicity of human genetic variants. Integrating information from diverse annotations (allelic diversity, pathogenicity, disease severity, experimentally measured regulatory effects, complex trait associations, etc.), a C score is computed for each single-nucleotide variant of human genome in order to prioritize functional, deleterious and pathogenic variants. Continuing at this direction, the future GWAS might provide better interpretation of variants in clinical settings and improve dis- covery power for genetic studies. 24CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

3.3 Identity by descent and disease mapping

3.3.1 Introduction

Two individuals are identical by status (IBS) at a locus if they share an identi- cal allele, and they are identical by descent (IBD) at this locus if the identical allele is inherited from a common ancestor. At each generation, IBD segments are broken up by recombinations during the process of meiosis. In general, IBD segments from a common ancestor n generations back (hence involving 2n meiosis) have expected length 1/(2n) Morgans(M) [48]. For example, two individuals sharing a common ancestry 25 generations back have IBD seg- ments of average length of 2 cM, similar to the resolution of the widely used SNP array data [49]. The problem of accurate IBD detection has become increasingly challeng- ing with the availability of large amount of high-throughput data of human genome. The time complexity and false discovery rate scale quadratically with regards to the cohort size. Extensive previous work has focused on develop- ing methods for pairwise IBD segments detection. PLINK [50] is one of the earliest methods, which models the inheritance status along the genome us- ing a hidden Markov model (HMM) and assumes that all markers are in LD. GERMLINE [51] relies on string matching and considers long stretches of IBS as IBD segments, the hash-mapping strategy ensures efficiency (scaling lin- early with the number of individuals analyzed). fastIBD [52], or its improved version, Refined IBD [53], is based on a HMM that incorporates the modeling of LD, it identifies pairs of individuals sharing the same state by posterior sampling in each sliding window and extends the shared haplotype segments across windows when the probability of IBD is high. Parente2 [49] is a recent method that is based on an embedded log-likelihood ratio and accounts for LD by explicitly modeling haplotype frequencies. The detected IBD segments based on genotyping data serve as the founda- tion for a variety of downstream applications, ranging from population-based linkage mapping to haplotype phasing and genotype imputation. For exam- ple, IBD analysis is a key step for analyzing relatedness within and across a population; it can be applied to study the fine-scale demographic history of modern population, such as Ashkenazi Jewish and Kenyan Maasai indi- viduals [54]; it is used to estimate the narrow-sense heritability of metabolic traits[55]; it is also shown to be useful in removing the confounding effects for hidden relatedness in cohorts of individuals [56]. Although it is common in GWASs to use dense panels of SNPs and make direct identification of causal variants, there are limitations in certain study designs such as family-based studies, in which causal alleles are relatively rare and have high penetrance, thus identification of associated genomic region is a prerequisite for identifying causal variants. The concept of IBD in population genetics is consistent with the underlying assumption of GWAS that genetic 3.3. IDENTITY BY DESCENT AND DISEASE MAPPING 25 variants contribute to disease development in complex diseases [57]. With a technique called IBD mapping, IBD segments are applied directly to identify regions of the genome associated with a phenotype by narrowing the search to regions where most cases are IBD. Simulation studies have shown that IBD mapping has higher statistical power to detect disease susceptibility genes carrying multiple rare causal variants [58]. While the general focus of IBD relation has been on the relation between pairwise haplotypes, fewer studies have explored IBD relations among mul- tiple haplotypes and their applications, even though IBD is an equivalence relation on haplotypes that partitions a set of haplotypes into IBD clusters. For example, if haplotypes A and B are IBD and haplotypes A and C are IBD at a locus, then haplotypes B and C are also IBD by definition, even though their shared segment may be too short to be detected in pairwise manner. Simultaneous analysis to detect multiple-haplotype IBD clusters of all markers and all individuals is often computationally intractable, especially for large samples of thousands of haplotypes. Only a few softwares exist for detecting multiple-IBD clusters, they are MCMC IBD finder [59], the DASH method [60] and IBD-Groupon [61]. MCMC IBD finder is too slow for prac- tical application. The last two methods are both based on the fundamental concept of transitive relation of pairwise segments. By the time of our study, DASH is the only method that is able to handle datasets at population scale. The genetic concept of multiple-IBD clusters underlies their application in improving pairwise IBD detection. For example, the current resolution of IBD detection using SNP array data is 1 ∼ 2 cM, corresponding to a common ancestor within the past 25˘50 generations, thus many short segments can result as false negatives in pairwise IBD detection. The inferred multiple-IBD clusters can be used to identify such false negative pairwise IBD segments. At a locus, consider haplotypes as nodes in an undirected graph and their IBD relation as edges, an edge connecting node A and B exists if haplotype A and haplotype B is identical by descent. This transitive relation of pairwise IBD segments implies that a multiple-IBD cluster of haplotypes that are all identical by descent is a fully connected graph, We say that there is an incon- sistency of the transitive relation if we only observe pairwise IBD segments of (A, B) and (A, C), but not (B,C). This inconsistency can be resolved by grouping haplotypes A, B and C into a multiple-IBD cluster, such that each member of this cluster is IBD to all the other members. Similarly, identifying false negative pairwise IBD is equivalent to identifying missing links in highly connected graphs. Improved accuracy and resolution of pairwise IBD segments can benefit the downstream analysis, such as inference of the geography of recent genetic ancestry in a population [62]. Another application of multiple-IBD clusters lies in GWAS. Same as pairwise IBD segments, multiple-IBD clusters target at long shared segments that correspond to a relatively short time to the common ancestor. Variants, derived from this common ancestor, being recent, are rare. 26CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

Therefore the inferred clusters can be considered as rare genetic markers in GWAS. GWASs are initially designed to focus on the higher end of the frequency- effect size spectrum and have identified many common variants associated with diseases, yet they have explained relatively little of the heritability of complex diseases. For example, researchers found that incorporating genetic information did not improve doctors ability to predict disease risk for breast cancer, Type 2 diabetes and rheumatoid arthritis [63]. There have been a lot of ongoing discussions about possible reasons for missing heritability [64–66]. It is suggested that genetic interactions, if present, could account for substantial missing heritability [67]. Because GWASs focus on the identification of common variants, it is also plausible that rare variants, of a lower minor allele frequency (MAF) (0.1% ∼ 5%), might contribution to the missing heritability, although much less is known at present [68, 69]. Fig- ure 3.4 gives an illustration.

Figure 3.4: Feasibility of identifying genetic variants by MAF and strength of genetic effect (odds ratio). Most emphasis and interest lies in identifying associations with characteristics shown within diagonal dotted lines. Figure adapted from [68]

The classical single-marker association analysis lacks power to capture as- sociations of variants that are not sufficient frequent, unless their effect sizes are very large, as in monogenic conditions. Alternatively, burden tests have been suggested to assess the cumulative effects of multiple variants in a ge- nomic region so as to increase power. Such methods often summarize the rare variants within a region by a single value, which is then tested for association 3.3. IDENTITY BY DESCENT AND DISEASE MAPPING 27 with the trait of interest [70]. To circumvent the limitations of burden tests of implicitly assuming all rare variants influence the phenotype in the same direction, a score-based variance-component test, the sequence kernel associ- ation test (SKAT) is proposed and has proven successful in association tests of rare variants [71].

3.3.2 Related work Despite the computational challenges, there have been several previous at- tempts to detect multiple-IBD clusters. One of earliest proposed methods to detect IBD segments shared by multiple individuals requires the availability of pedigree information or previous information about IBD sharing patterns [72]. MCMC IBD finder [59] is a more general method that detect shared IBD seg- ments directly from unphased genotype data. Using a HMM model with states of possible IBD segments among the given set of individuals, and the Markov Chain Monte Carlo (MCMC) approach to infer the parameters, MCMC IBD finder has shown higher power to detect short IBD segments compared with pairwise methods. However, it is only applicable to small datasets due to in- tensive computation cost and the limit of pre-specified number of IBD clusters. Moreover, it does not take LD into account and thus is not directly applicable to most of the genotype data available today. With the improved accuracy and efficiency of pairwise IBD detection methods such as GERMLINE [51] and Refined IBD [53], methods have been developed to utilize the identified pairwise IBD segments in order to infer the multiple-IBD structure. Example methods are DASH [60] and IBD- Groupon [61], both of which are built on top of pairwise IBD segments. As LD has been taken into consideration during the pairwise detection stage, the multiple-IBD identification can simply skip the modeling of LD along the genome. The DASH method [60] cuts the haplotypes into windows, within each window, a graph is built based on pairwise IBD relation of haplotypes. it then runs an iterative minimum cut algorithm to identify densely connected haplotypes. The resulting subgraphs satisfying the density threshold is consid- ered as IBD clusters within this sliding window. Such procedure is performed at each sliding window along the genome, while adjusting the parameters of minimum cut algorithm by IBD clusters derived from the previous adjacent window. Similar to DASH, IBD-Groupon [61] also uses external tools such as fastIBD [52] to identify pairwise IBD segments in the prerequisite step. However, it ad- justs the threshold parameters such that fastIBD outputs more IBD segments, even though some of them can be false-positive. It then constructs a HMM to identify the most likely multiple-IBD clusters (which they called group-wide IBD tracts) so as to eliminate false-positive IBD segments. IBD segments are considered as false-positives and are removed if they do not belong to a 28CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES multiple-IBD cluster, or they are too short and thus have a low likelihood. The depth-first search algorithm and shortest path algorithm are applied to determine the possible number of hidden states, which corresponds to the number of multiple-IBD clusters. Based on a real pedigree in HapMap, the author of IBD-Groupon simulated a dataset of 90 related individuals, consisting 6159 SNPs on chromosome 22, and compared its performance to different methods. The MCMC IBD finder is too slow even for a subset of the data with nine individuals. IBD-Groupon has a similar execution time compared to DASH but slightly higher accuracy. However, the relative performance of DASH and IBD-Groupon has not been evaluated using population data, larger samples sizes or multiple parameter settings. The potential application of inferred multiple-IBD clusters in GWAS is to be explored. The identified IBD segments often correspond to a relatively recent common ancestor and thus are rare, therefore association mapping re- lying on IBD sharing falls into the category of rare variants association. One of the earliest technique embracing IBD sharing in GWAS is IBD mapping in case-control studies [58]. It tests whether the case group has more genomic IBD sharing around a putative causal variant than the control group. Sim- ulation has shown that it might have higher power than association analysis based on SNP data when multiple rare causal variants are clustered within a gene. The authors of the DASH method have also suggested an approach using multiple-IBD clusters, where each inferred multiple-IBD cluster is encoded as a genetic marker and the standard single marker test is performed on these markers. This method is shown to be powerful in an isolated population with abundant IBD sharing and can be applied to case-control studies as well as quantitative traits. However, such a test might still encounter the problem that independent testing of rare variants are often underpowered, and associations of small sized multiple-IBD clusters (thus rare) can be ignored. Their power simulations also show that the optimal power for IBD clusters below frequency 2% is about 0.5, thus methods that can detect weak signals from small multiple-IBD clusters are needed.

3.3.3 Research contributions In this work, we present an Efficient Multiple-IBD (EMI) algorithm to search for multiple-IBD clusters, that is an order of magnitude faster than exist- ing methods. The coalescence simulations show that EMI has comparable performance with respect to the quality of clusters it uncovers. We further investigate the potential application of multiple haplotype IBD clusters in as- sociation studies in the framework of linear mixed models with random effects. In general, EMI builds upon pairwise IBD segments to infer IBD clus- ters. It scans the genome along sliding windows and constructs connected 3.3. IDENTITY BY DESCENT AND DISEASE MAPPING 29 graphs at each window, with nodes representing haplotypes and edges/links representing pairwise IBD relation spanning this region. A connected graph, representing a multiple-IBD cluster, is expanded from seed haplotypes by re- peatedly adding qualified neighbors as long as the density of the graph meets a certain requirement. Although DASH, IBD-Groupon and EMI are all graph-based clustering methods, there is important difference in how clusters are identified. DASH builds clusters from the largest connected component and iteratively divides it into smaller clusters by a minimum cut algorithm, On the contrary, EMI takes an agglomerative approach and builds each cluster from initial seed haplotypes and recursively adds qualifying haplotypes to expand the cluster. EMI also adapts the data structure of priority queues so as to ensure computational efficiency. This step differs from IBD-Groupon, which uses HMM to find the most likely maximal cliques within each chunk of genome to resolve inconsistencies. The procedure IBD-Groupon applies only involves pruning incorrect links, even though adding links can be a better alternative that not only resolves inconsistencies, but also retrieves missing links (false negatives in pairwise IBD detection). The missing links of the graph should not be ignored for multiple-IBD clusters at fine scale, as the state-of-the-art pairwise IBD detec- tion methods only achieve high power (e.g. 0.8) for IBD segments longer than 2 cM in SNP data. Here we give a brief description of the algorithm underlying EMI, the more detailed description can be found in Appendix-C. Given N haploid copies of genome, the construction of multiple-IBD clusters is fundamentally dependent on the presence of pairwise IBD segments. Consider a genomic region that is divided into K consecutive windows of length H, and the minimum density of a cluster denmin. At kth window wink, we are given a set of pairwise IBD segments, based on which a weighted undirected graph Gk = (V,Ek,W k) is k built. The set of nodes V represent N haploid individuals. An edge ea,b ∈ E exists if haplotype a and haplotype b have IBD sharing spanning the current window wink, and its weight is wa,b, representing the confidence score for IBD sharing. The pairwise IBD relation is transitive, thus multiple-IBD clusters are essentially dense subgraphs in Gk that are nearly complete. Given a weighted network, EMI outputs a set of disjoint dense subgraphs following these steps.

1. Seed Selection Two un-clustered nodes with highest weighted degree are chosen as seed nodes, where the weighted degree of a node is defined as the sum of weight of its connecting edges.

2. Cluster Expansion Expand the current cluster C by recursively adding a node with highest 30CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

support to C as long as the density of C is above denmin, then output cluster C. The support of a node is defined as the sum of the weight of all edges connecting this node to any node in cluster C.

3. Repeating The above procedure of seed selection and cluster expansion is repeated for the remaining un-clustered nodes until all nodes are clustered or there is no edge left among the un-clustered nodes.

This implementation uses two priority queues. The first priority queue is used to pick the seed node with the highest weighted degree. Once a node has been used in a cluster, it is removed from the queue and the weighted degrees of all its neighbors are decreased accordingly. A second priority queue is used for expanding clusters. Each node adjacent to one of the nodes in the current cluster C is included in the queue and is prioritized based on its support to C. The priority queue is implemented using the data structure of Fibonacci heap to support fast operations of extracting nodes with highest support, insert new node and increase node weights, that correspond to delete- min, insert and decrease-key operations with amortized time complexity of O(log V ), O(1) and O(1) [73]. Because there are at most O(V ) delete-min operations and at most O(E) insert and decrease-key operations, the overall time cost is O(V log V + E). This clustering process is performed at each window independently while tracking information from adjacent windows to maintain efficiency. To be more specific, at window wink, we adopt clusters from the previous window wink−1 and adjust the topology according to the change of edges. For example, new nodes can be added to an existing cluster, an old cluster whose density is below the threshold is dissolved, two highly connected clusters can be merged. Figure 3.5 shows a running example of the algorithm. Neither the heuristics employed by EMI nor the greedy clustering algo- rithm employed by DASH can guarantee the accuracy of resulting clusters. In order to investigate the accuracy in a systematic way, we simulate population data with thousands of samples based on a coalescence model and compare the corresponding performance metrics. The coalescent model with recombination describes genealogies of underlying chromosomes from unrelated individuals and thus allow the investigation under realistic scenarios. We first simulate sequence data under the coalescent model with parameters such as effective population size, mutation rate and recombination rate adjusted to fit the de- mographic history of an isolated population. We then generate simulated SNP array data by thinning the sequence data and selecting variants with MAF within a certain range. The multiple-IBD clustering is performed after the pairwise IBD detection on simulated SNP array data. The power and accuracy of the resulting multiple-IBD clusters is measured by how well they retrieve the true underlying genealogy. More specifically, 3.3. IDENTITY BY DESCENT AND DISEASE MAPPING 31

Figure 3.5: A running example of EMI in two adjacent sliding win- dows. The density cutoff is 0.8, numbers show the edge weights. (A) H3 has the highest weighted degree of 3.79 and is selected as first seed node. H1 is connected to H3 with highest edge weight and thus are selected as the second seed node. The cluster starts with (H3, H1), H2 is added as it has the largest support for the existing cluster, which is 1.85 (0.9 + 0.95). The expansion continues until the density drops below 0.8, output clusters to (A0). (B) In the next adjacent window, one edge is removed from the current cluster (H3, H1, H2, H5) and the density drops below the cutoff. H5 is removed from the cluster and the un-clustered nodes form another cluster, shown in (B0). This figure is presented in Appendix-C Figure 1.

through simulation we keep track of the coalescent tree, which traces the ancestry of sample chromosomes back in time until the most common ancestor. The topology of coalescent tree only changes when recombination happens. Within one sliding window, a true multiple-IBD cluster is any sub-tree of the coalescent tree, that contains all descendants of the root of this sub-tree and spans this window without topology changes. We use two common metrics in hierarchy clustering, the Jaccard measure (Jaccard) and the precision-recall (PR) measure, to compare the inferred clusters and true IBD clusters and access the accuracy of inferred clusters. We use recall rate (RR), defined as the number of haplotypes assigned to the right clusters, to access the power of clustering. 32CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES

Showing that EMI is able to infer the underlying multiple-IBD clusters with high accuracy, we bring the inferred clusters to the application of GWAS. Each multiple-IBD cluster is encoded as a genetic marker corresponding to its spanning region, i.e., every haplotype in the multiple-IBD cluster is the variant allele at that locus. Since most of the IBD clusters are small, the encoded markers are thus rare and may have small effects on the trait; the standard single variant test will be underpowered. An alternative strategy is to model these rare variants as random effects in linear mixed-effects models. As a locus, consider the linear model

0 0 yi = α0 + α Xi + β Gi +  yi denotes the phenotype for the ith subject, Xi is a vector of covariates associated with fixed effects such as sex, age and eigenvectors for correcting population stratification. Gi = (Gi1,Gi2, ..., Gip) represents the random effects with Gij ∈ {0, 1, 2} the number of haplotypes in cluster j for ith subject. β is a vector of covariates associated with random effects of Gi.  is an error term with mean 0 and variance of σ2. Testing the association of combined effects of p clusters with the trait is equivalent to testing the null hypothesis H0 : V ar(β) = 0. This can be tested with a variance-component score test, which is known to be the locally most powerful test [74].

3.3.4 Results We first evaluate the performance of different clustering methods under differ- ent parameter settings. We did not run IBD-Groupon v1.1 on our simulation data, because IBD-Groupon is not designed for large scale population data and it ran out of memory (> 12G) when we analyzed a simulated dataset of 4000 haplotypes with 3000 SNPs. It is not known if IBD-Groupon is able to handle such large datasets (private contact with the author). Therefore we only compare EMI and DASH for the simulation data and evaluate the qual- ity of clusters in terms of RR and PR (or Jaccard measure). RR measures the proportion of haplotypes that are assigned to a correct cluster, and it is analogous to the definition of power in a hypothesis test. PR or the Jaccard measure represents the accuracy of clusters, which is an analog to 1 minus the type 1 error rate. For association tests, a higher power is often preferred while keeping the type 1 error rate under control. Similarly, in our application, we prefer a higher RR while keeping the PR or Jaccard measure above a certain level. Using grid search, we decide an optimal parameter setting of density threshold denmin = 0.5 and window length H = 200 Kb(see Appendix-C, Table 1), where the highest RR is achieved. We generate a genomic region of 100 Mb for 4000 haplotypes as described above, and run both EMI and DASH on a single processor of a 2.4 GHz computer. 100 windows, 200 kb in length, are randomly selected to evaluate the accuracy of multiple-IBD clusters. The 3.3. IDENTITY BY DESCENT AND DISEASE MAPPING 33 running time and accuracy of multiple-IBD clusters are shown in Table 2 of Appendix-C. Compared with DASH, EMI has slightly lower accuracy accord- ing to the Jaccard measure and PR, but slightly higher power in terms of RR, meaning more haplotypes are assigned to the correct cluster. For the com- puting time, EMI is an order of magnitude faster than DASH, and the time scales better than DASH as the density of clusters increases. We then applied EMI in a Northern Finland Birth Cohort (NFBC) dataset, including 5402 individuals from genetically isolated north Finland regions. Within each window, all the clusters spanning this entire window are consid- ered as random effects in the linear model for association test of low-density lipoprotein (LDL) cholesterol. The variance-component test of random effects in mixed models is conducted using the R package SKAT [71], and a p-value is calculated against the null hypothesis of zero variance of random effects at each window. To correct for multiple testing, we performed 1000 permuta- tions of the trait values while keeping the covariates unchanged. We obtain the experiment-wide distribution of minimal p-values over all loci from the permutation replicates, and calculate the empirical p value as the proportion of minimal p-values from permutations that are smaller than the p-value on the original data. Regions with empirical p-value smaller than 0.05 (corre- sponding to a SKAT p-value of 1.13 × 10−5 in this analysis) are candidate associations. Two regions near the APO cluster and the PCSK9 gene show significant associations to LDL cholesterol (shown at Appendix-C Table 3). In a previously published GWAS on the same dataset, only the association of APO clusters has been reported, but not the PCSK9 gene [75], even though various other independent studies (including a large meta-analysis) have sug- gested that PCSK9 gene has a strong effect on LDL cholesterol [76].

3.3.5 Discussion and future work While most existing methods for IBD detection only consider pairwise IBD segments, we try to infer IBD segments shared by multiple haplotypes and evaluate the results using coalescent trees. Using heuristics and efficient data structures, EMI runs very fast in application of genome-wide data. Accurate inference of IBD segments plays an important role in various ge- nomic studies, ranging from mapping disease genes to exploring recent popula- tion histories. The majority of recent work in the field of IBD has been focused on improving the accuracy and the speed in order to handle increasingly large cohorts of individuals, while targeting shorter genomic segments that originate from a more ancient common ancestor. One obvious extension of our work is to infer short pairwise IBD segments from the multiple-IBD clusters, i.e., an accurate multiple-IBD cluster is fully connected and each edge represents a true pairwise IBD segment. Even though the multiple-IBD clusters we build have erroneous nodes and edges, we can still control the quality of predictions using probability of edges inferred from the clusters. Some of implementation 34CHAPTER 3. METHODS IN GENOME-WIDE ASSOCIATION STUDIES work has been done and is available at https://github.com/aimeida/EMI_dev. However, I did not continue working on this project for two reasons. First, I went abroad for some other projects and thus had limited time working on it. Second, multiple studies have been published the year after our study, where softwares for efficient and powerful inference of short pairwise IBD seg- ments have become available, and we can not immediately see that our idea outperforms others. These recently published studies make inference of IBD segment in differ- ent ways. Parente2 [49] is based on an embedded log-likelihood ratio and employs a sliding-window approach for detecting IBD across the genomes of two individuals. It aggregates informative overlapping windows of non- consecutive, randomly selected markers, and accounts for linkage disequilib- rium by explicitly modeling haplotype frequencies. SpeeDB [77] takes input of a target individual and a database of individuals that potentially share IBD segments with the target. It applies an efficient filter to exclude chro- mosomal segments from the database that are highly unlikely to be IBD with the corresponding segments from the target individual. With the exponential increase in GWAS studies, SpeeDB provides a scalable infrastructure and en- sures efficiency by allowing the continuous addition of samples in a growing database of samples for shared ancestry. PIGS [78] is hybrid approach that is very similar to our potential future work. It leverages the multiple-IBD struc- tures to simultaneously compute the probability of IBD clusters conditional on all pairwise IBD segments. The simulation study shows increased power of identifying short pairwise IBD segments. As for the application of IBD in GWASs, our study has shown one pos- sibility in the framework of linear mixed model. However, researchers have not agreed on an optimal way of applying IBD in GWASs and further studies on this topic are needed. For example, a recent GWAS of a Scandinavian multiple sclerosis (MS) cohort concludes that IBD mapping is not sufficiently powered to identify MS risk loci even in ethnically relatively homogeneous populations, possibly due to the fact that rare variants are not adequately present [79]. Yet another study shows improved power in GWAS with a large number of family-based controls by incorporating local IBD in the Quasi Like- lihood Score statistic [80]. 3.4. SUMMARY 35

3.4 Summary

Related to GWAS, I have worked on developing methods that can be more powerful than single marker test under certain conditions, such as network analysis and IBD mapping. Besides that, I have also been involved in a few data-driven projects, such as finding different methylation patterns in cancer samples (work with Carsten Wiuf), analysis of RNA-seq data in exosome re- gions (work with Torben H. Jensen group), annotation of noncoding variants using motif position weight matrix (work with Daníel Guðbjartsson at de- CODE), coordinate descent algorithm in calculating mixed model association statistics (work with Daníel Guðbjartsson at deCODE) and etc. However, some projects did not come out with promising results and some involve sen- sitive data that I am not allowed to publish (work with deCODE), therefore they are not covered in this thesis. In summary, GWASs have rapidly become a standard method for disease gene discovery during the past decade. A lot of efforts have been made in this field, with the hope that GWAS would provide an effective and unbiased approach to reveal the risk alleles associated with phenotypic variation. The power of GWAS rests on several fundamental assumptions: (i) additive ge- netic variation is abundant, (ii) causal polymorphisms have sizeable effects and (iii) these polymorphisms have moderate frequencies or can be effectively tagged by polymorphisms that do. Recently each of these of assumptions has been questioned and it leads us to the Post-GWAS era [81], where there is a wide interest in combining population genetic models and molecular bio- logical knowledge, one recent example is a study that estimates the relative pathogenicity of genetic variants using diverse biological annotations. Another growing interest in Post-GWAS era is to move towards sequencing samples, making it possible to look into a wider spectrum of genetic variants, such as structural variants. I will describe one type of such variants in the next chapter.

Chapter 4

Alu polymorphism discovery from next-generation sequencing data

This project started during my stay in deCODE genetics, Iceland, March 2014. I continued working on it until this month. I implemented the algorithms, which are freely available at my github account, https://github.com/aimeida/AluDetection. The manuscript is under preparation. My co-authors of this project, Birte Kehr and Bjarni V. Halldórsson helped with providing code for simulation data and running the algorithms on Icelandic population data.

4.1 Background

The advent of next-generation sequencing technologies has made it possible to study the structural variations of human genome, as a basis for studying human evolution and disease. One important class of structural variations are Alu elements, which are short interspersed nuclear elements (SINEs) that occur frequently in the human genome as well genomes of other , Alu elements have inserted in genomes throughout the evolution, and have propagated to more than one million locations over the last 65 mil- lion years (mya). They are the largest family of mobile elements in the human genome, comprising more than 10% of the genome. The majority of Alu am- plification occurred early in primate evolution, and the estimated current rate of Alu retroposition is approximately one new Alu insertion per generation, which is at least 100-fold slower than the peak of amplification that occurred 30–50 mya ago [82, 83]. An Alu element is an approximately 300 bp long sequence derived from the 7SL RNA gene, separated by a short A-rich region. Often the 30 end has an A-rich region that plays a critical role in its amplification mechanism [84].

37 CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 38 NEXT-GENERATION SEQUENCING DATA Although Alu elements do not encode genes, they may have functional im- portance. They may assist the creation of structural variants [85], and it has been generally recognized that Alu elements affect genome evolution, cellular processes, DNA methylation and transcriptional regulation [86–89]. The im- portance of Alu elements is further highlighted by the potential association with genetic instability, one of the principal causative factors in many disor- ders [88, 90, 91]. For more information about Alus, readers are referred to recent reviews [92, 93]. The Alu elements in the genome can be divided into two broad categories, i.e., fixed Alus and polymorphic Alus. The fixed Alus are generally evolution- arily older and thus present in the entire population, their locations are largely known from a reference genome. The polymorphic Alus have been inserted re- cently and are only present in a subset of the population, the locations of these novel inserted Alus are not known a priori and are of greater interest for us. Various methods have been described for the detection of structural variations, but only few have focused on the detection of Alu polymorphisms. The purpose of this chapter is to introduce a new method for detecting Alu polymorphisms. The detection of polymorphic Alu elements can be divided into two parts, Alu Deletion and Alu Insertion. If one Alu element exists in the reference genome, but is missing in the individual(s) being sequenced (donor genome), we call it an Alu Deletion problem, meaning that one Alu element is deleted with respect to the reference, even though this Alu element has most likely been inserted during evolution. Similarly, Alu Insertion is defined as one Alu element is inserted with respect to the reference, meaning it does not exist in the reference genome, but in the donor genome. See Figure 4.1 for example of Alu Deletion and Figure 4.2 for example of Alu Insertion. While a large number of methods have been developed to determine SNPs and copy number variants from sequencing data, comparatively fewer methods have been developed for determining Alu polymorphisms, with notable excep- tions of Alu-Detect [94], VariationHunter [87, 95] and RetroSeq [96], some of which have recently been successfully applied to real data. The existing methods focus mainly on detection of Alu Insertion, and in principle follow a three-step analysis, 1. For each genome, identify informative fragments (reads or read pairs) that indicates occurrence of Alu insertions. 2. Construct clusters of fragments along the genome, such that each cluster might include a potential insertion. 3. For each genome, calculate the possibility of Alu insertion at each cluster if there are informative reads. In this chapter we present a new tool to detect Alu polymorphism on a population-scale. It is called PAIR2 and differs from other tools mainly in the 4.1. BACKGROUND 39 last two steps. We define a breakpoint as the position at the genome where one Alu is inserted with respect to reference. At step (2), PAIR2 is able to construct clusters jointly from multiple individuals and pinpoint the precise breakpoints, instead of only analyzing each individual independently with ap- proximate breakpoints. Pooling information across the samples is potentially more powerful to detect polymorphic Alus with lower frequency, and high res- olution of breakpoints is useful in variants classification and annotation. At step (3), PAIR2 applies a probabilistic framework for calling genotypes and is able to differentiate between homozygous and heterozygous calls, while other tools either miss support for homozygous calls or make calls simply based on a strict filtering of the counts of supporting fragments. PAIR2 inherits the idea of utilizing insertion length of paired-end reads at polymorphic loci from our previous work, PAIR [97], however, the main part of the work flow, such as genotype calling and breakpoints identification is different from PAIR. As PAIR does not support joint analysis of multiple individuals for Alu insertions, we will skip the comparison of PAIR and PAIR2 in this study. The other contributions of PAIR2 include, (i) it is almost parameter-free, with most of the parameters are automatically inferred from the input data. (ii) it is a stand-alone package implemented using the SeqAn C++ library [98] and can be installed without requirement of other external tools. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 40 NEXT-GENERATION SEQUENCING DATA 4.2 Methods

The input of PAIR2 is a reference genome (e.g. NCBI human genome) and a binary alignment (BAM) file of paired-end sequencing reads of a donor individual (or a set of individuals). The genome sequence of the reference is known and highly similar, but not identical, to the donor genome. We start by giving some definitions that will be used in what follows; A read pair r has a left read rL and right read rR, and they are mates to each other, denoted as mate(rL) = rR and mate(rR) = rL. We use rN to denote either a left or a right read if the relative position in the pair is not important. If rN is uniquely mapped to the reference genome, we use begin(rN ) and end(rN ) to represent its start and end position in the mapped reference genome. Paired-end sequencing implies that a proper mapped read pair has its left and right read mapped to opposite strands and to a proximal location. We define the insertion length of a read pair r, measured with respect to the refer- ence genome, as Y (r) = end(rR) − begin(rL). We approximate the empirical distribution of Y from all proper mapped read pairs, as well as its mean E[Y ] and standard deviation σ(Y ). In practice, we stratify Y by different sequenc- ing libraries, because different libraries may have different expected insertion length. In BAM file, the information of different libraries is recorded in the read group (RG) tag.

4.2.1 Alu Deletion Given sequencing data of a single individual in a BAM format and a set of Alu regions in the reference genome, we want to identify the existence of Alu Deletion at each Alu region. A set of Alu regions in a reference genome can be determined using various sources, such as RepeatMasker (http://www.repeatmasker.org). At each Alu region in the reference genome, we call a haplotype H1 if the Alu is deleted with respect to the reference genome, and H0 otherwise. Since each diploid individual has 2 haplotypes, the goal is to compute the probability of three different genotypes: homozygote nonAlu (G0), heterozygote Alu (G1) and homozygote Alu(G2). We consider one region at a time. Figure 4.1 shows an example of Alu region, bounded by a left breakpoint AL and a right breakpoint AR. The bounds of the Alu region are also defined as breakpoints, because the align- ment of a read, which comes from a H1 haplotype, to the reference genome will be stopped at the breakpoints. We restrict our attention to flanking regions [F L, AL] and [AR, F R] on both sides of the Alu, where we choose |FR − AR| = |AL − FL| = 1.5E[Y ], where E[Y ] is the average insertion length of read pairs, as defined previously. We then construct a set R, where each element r ∈ R is a read pair with at least one read overlapping the re- 4.2. METHODS 41 gion [F L, F R], that is, either begin(rN ) ∈ [F L, F R] or end(rN ) ∈ [F L, F R] holds. We define informative reads as reads that contain information of the underlying polymorphic pattern at this region, it is thus trivial to conclude that any informative read is also a member of set R. Alu Deletion is done in two steps, we first search for all the informative reads at set R, and then call genotypes based on these informative reads.

FL ALALU AR FR A) reference

r1 , KC r2 , U r3 , U r4 , KC r5 , U r6 , *

FL ALALU AR FR B) reference

r1 , KS r2 , * r3 , KS r4 , KS r5 , U

Figure 4.1: Example of an Alu Deletion Arrows show read directions, blue part of the reads can be mapped to the reference outside of the Alu, and red part can be mapped to the Alu. (A) shows example reads from a haplotype of H1,(B) shows example reads from a haplotype of H0. A heterozygote diploid can have reads shown in both A and B

4.2.1.1 Informative reads In the context of Alu Deletion, we consider only proper mapped read pairs while searching for informative reads. There are primarily two signals indi- cating reads coming from H1. The first is that a read is split, containing one part from each side of the Alu. The second is that each read in a read pair is mapped to different sides of the Alu and thus has an increased insertion length with respect to reference genome, that is, the insertion length follows distribution of Y + lAlu instead of Y , with lAlu the length of the Alu. To distinguish such signals from the ones that support H0, we classify a read pair r into different types, type(r) ∈ {KC, KS, U, ∗}.

• KC. (Known-Clipped). If a read comes from H1 and happens to cover both AL and AR while mapped to the reference, we will see that this read is soft-clipped at one breakpoint in the BAM file, though the clipped part can be remapped to the other side of Alu, starting from the other breakpoint. r1 and r4 in Figure 4.1(A) are such examples. We classify r as KC if rN can be aligned to the reference on both sides of the Alu with the alignment length longer than a threshold. KC provides evidence for H1. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 42 NEXT-GENERATION SEQUENCING DATA • KS. (Known-Spanning). r is of KC type if rN is mapped to the ref- erence on both sides of one breakpoint. Figure 4.1(B) gives 3 examples, r1, r3 and r4. Similar to KC read pairs, we also require rN mapped to the reference on both sides of the Alu breakpoint with the alignment length longer than a threshold. KS show evidence of H0.

• U. (Unknown). A read pair r does not fall into category of KC or KS, but covers both of the two breakpoints, meaning both begin(rL) < AL and end(rR) > AR holds. r5 in Figure 4.1(A) and r5 in Figure 4.1(B) are such examples, the insertion length indicates its origin of H0 or H1.

• ∗. A read pair r that does not belong to the above mentioned types provides no useful information about the existence of Alu polymorphism, and thus is not used in our algorithm. Examples are r6 in Figure 4.1(A) and r2 in Figure 4.1(B).

The above mentioned read-to-reference alignment is implemented using Smith Waterman algorithm [99], allowing mismatches and small gaps, and we require each successful alignment a minimal length of 20 bp, with the align- ment score satisfying a certain threshold. This strict requirement is similar to other programs [94, 96]. In summary, KS and KC reads are breakpoint-overlapping reads and thus informative, whereas U reads are informative by their insertion length. All these types of reads will be used to determine the genotype calling.

4.2.1.2 Determining genotype Using breakpoint overlapping reads Reads overlapping breakpoints give strong evidence for Alu polymorphisms, for example, KC read pairs are most likely from H1 haplotypes and KS read pairs are most likely from H0 haplo- types. However, due to misalignment or sequencing error, this classification may be incorrect, we set a parameter PE, in our experiments chosen as 0.001, representing the probability that this assumption is incorrect. We can then compute the probability of observing a read pair of KC or KS type given the haplotype as:

P (r = KS|H0) = P (r = KC|H1) = 1 − PE

P (r = KC|H0) = P (r = KS|H1) = PE

Using length distribution Read pairs overlapping an Alu coming from a H0 haplotype have an insertion length Y (r), while reads coming from a H1 haplotype will be mapped lAlu further apart and have an insertion length 4.2. METHODS 43

distribution of Y (r) + lAlu. If we let `(r) be the insertion length of a read r then we can derive the probability of observing a read pair U as:

P (r = U)|H0 ∼ Y (r)

P (r = U)|H1 ∼ Y (r) + lAlu Due to strict requirements in the classification (e.g., at least 20 bp mapped to reference on both ends), false negative KC or KS read pairs will be ignored by many other tools, even though they do provide weak evidence polymor- phism. However, these read pairs will be used in PAIR2 if they are of type U, because their insertion lengths are informative.

The likelihood At a given Alu location, assume each read pair in the set R is independent, we consider only read pairs satisfying type(r) ∈ {KC, KS, U}. The likelihood of the observations given true genotype Gg, g ∈ {0, 1, 2} thus follows: Y P (R|Gg) = P (r|Gg) r∈R Y = {P (r|H0)P (H0|Gg) + P (r|H1)P (H1|Gg)} r∈R Y = {P (r|H0)P (H0|Gg) + P (r|H1)(1 − P (H0|Gg))} (4.1) r∈R where P (r|H) is given in above, and we have

P (H0|G0) = P (H1|G2) = 1 Assuming uniform sequencing coverage, and a read length of krk, e.g., 100 bp, we get the estimate

P (H0|G1) = (lAlu + 2 ∗ krk)/(lAlu + 4 ∗ krk) Plugging these into equation (1) , the corresponding maximum-likelihood estimator gives the genotype call.

4.2.2 Alu Insertion Detecting Alu insertions is a more difficult problem than Alu deletions, as potential insertion positions (breakpoints) are not known a priori. When con- sidering multiple individuals simultaneously, the precise breakpoints shared by all the individuals having that Alu insertion should be known in order to make accurate genotype calls. For Alu Insertion problem, we first select reads that indicate occurrence of Alu insertions, and then infer breakpoints at regions with potential insertions. Finally, we insert a consensus Alu element in silico at each breakpoint and apply the Alu Deletion algorithm for genotype calling. The details of the implementation are given in the following. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 44 NEXT-GENERATION SEQUENCING DATA

A) FL AL(R) FR reference

r1 r2 r3 r4

B) FL AL(R) FR reference

? r1 ? r2 ? r3 ? r4

Figure 4.2: Example of an Alu Insertion An Alu is inserted to the reference genome at breakpoints AL(R), arrows show read directions, blue part of the reads can be mapped to the reference and red part are clipped or mapped to somewhere else in the reference. (A) shows example reads from a haplotype of nonAluI,(B) shows example reads from a haplotype of AluI. A heterozygote diploid can have reads shown in both A and B

4.2.2.1 Informative reads A read pair r is called a discordant pair if it has discordant insertion length. This happens when either one of the two reads are mapped at the same chro- mosome with abnormal insertion length ( e.g., |Y (r)−E[Y ]| > 3σ(Y ) ) or two reads are mapped to different chromosomes. A read rN is called a Alu mate if r is a discordant pair and mate(rN ) is mapped to an Alu element. There are mainly two signals indicating the presence of an Alu insertion. The first is a discordant read pair with one read mapped to a known Alu region, we call it IAlu. The known Alu regions can be obtained from public database or tools, such as RepeatMasker. Examples of IAlu pairs are r1, r3 in Figure 4.2(B). IAlu read pairs are useful to infer approximate regions of insertions, as been done by many tools [94, 96, 100]. The second signal is a split read where only the part at one side of the breakpoint is mapped to the reference genome, denoted as IClip. Examples are r2 and r4 in Figure 4.2(B). Similar to KC read pairs in Alu Deletion problem, IClip reads are also often labeled as soft-clipped reads in the BAM input. Though IClip reads are useful to pinpoint the precise breakpoints of Alu insertions, relatively fewer tools utilize such information [94].

4.2.2.2 Insertion regions for single individual We start by reading the BAM file and store all IAlu read pairs. For any Alu mate in a IAlu read pair, we label it as la read if the mapping and orientation to the reference implies that it is to the left of an inserted Alu, and ra read if it 4.2. METHODS 45 is to the right of the Alu. Both la and ra reads give partial information about the location of Alu insertions. We say that an Alu position, α, where a novel Alu is inserted, covers a la read, rN , if α ∈ [end(rN ), begin(rN )+E[Y ]+3σ(Y )] holds, and it covers a ra read if α ∈ [end(rN ) − E[Y ] − 3σ(Y ), begin(rN )]. For each la or ra read, we want to either cover it with an Alu position or declare it as an error read. We define a constant k to be the relative cost between the two. For each individual, we find a set of la and ra reads from all IAlu read pairs, and we want to find a minimal set of Alu positions while maximizing the number of la and ra reads covered by these positions. This problem is formulated as:

Input A set L of la reads and a set R of ra reads Output A set A of Alu positions, a set E of errors Objective min(|E| + k|A|) Constraints Each la ∈ L and ra ∈ R is either in E or covered by α ∈ A.

For example, we set k = 3, meaning that if three la or ra reads are found that can be covered by a single Alu insertion event, we prefer to insert an Alu rather than to assign error labels to these reads. The most general version of this problem reduces to a set covering problem, which is hard to even approxi- mate, as mentioned in [97, 100]. However, as the reads are linearly arranged on the chromosome, the problem reduces to set covering on interval graphs which can be solved in polynomial time using e.g. dynamic programming [101]. However, sequencing data can be erroneous in practice, especially in repet- itive regions, thus we are only interested in calling Alu Insertion with high confidence. We introduce an optimal solution to search for regions likely to contain an Alu insertion. Define support(p) as the number of la and ra reads covered by position p, we make a single pass through the genome and find all positions satisfying support(p) > n, e.g., n = 2. This step can be done in O(m log m + g) time, where m is the total number of la and ra reads and g is the size of genome. First we sort all la and ra reads by position, which takes time O(m log m), then we move along the genome and search for positions passing support(p) > n while updating two queues of la and ra reads respectively on the fly, this step has linear time complexity of O(g). Consider an Alu position α and its surrounding region [α − l, α + l] (e.g., l = 1000 bp ). Assuming that no other Alu exists in this region and the mapped reads are evenly distributed, error-free, it is easy to see that the sum of la and ra reads monotonically increase from α − l to α and decrease from α to α + l, and the maximum number of la and ra reads is in the vicinity of α. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 46 NEXT-GENERATION SEQUENCING DATA Based on this observation, we then create a set of non-overlapping regions, denoted as {(bi, ei), i = 1, 2,... }, such that each region might contain one Alu insertion and the sum of la and ra reads covered by each region is maximized in its vicinity. We also add constraints of the maximum length of each region to assure that at maximum one Alu insertion occurs in (bi, ei). It is important to point out that at this step the position of breakpoint is only approximate, and its resolution is at the scale of the region length, e.g., several hundred base pairs. We will use IClip reads to refine the precise breakpoints in (bi, ei) in later steps.

4.2.2.3 Insertion regions for multiple individuals To study multiple individuals simultaneously, we need to first determine the breakpoints of Alu insertions that are shared by a subset of individuals, and then call genotypes for each of them. We start by searching small regions where one (and at maximum one) potential novel Alu is inserted at this region. We do this by pooling the sets of non-overlapping regions from each individual, sort them by position, and then merge all overlapped and nearby regions. In the end, we generate a new set of non-overlapping regions, {(bi, ei), i = 1, 2,... }, such that each region is likely to contain one Alu insertion, indicated by at least one individual. In practice, we define an upper bound for the length of each region (bi, ei), such that at maximum one Alu insertion can occur in each region.

4.2.2.4 Identifying precise breakpoints We want to identify the precise breakpoints shared by all polymorphism car- riers, such that we do not falsely label one Alu insertion as two different inser- tions at nearby different locations, which can affect the downstream analysis.

Difficulties caused by Target Site Duplication The difficulties in deter- mining the precise breakpoints come from the fact that an Alu is often inserted to the genome through a process called Target-Primed Reverse , that involves two distinct single-strand breaks, thus the final DNA sequence contains two identical copies of Target Site Duplication (TSD), a sequence of 4 - 25bp, just before and just after the newly inserted Alu element [92]. However, it is not always the case that we observe two copies of TSD at the inserted Alu element. The reference genome is also likely to contain 0 or 1 copy of TSD. We define the left breakpoint as AL if it is indicated by a IClip read whose left part is mapped to reference and right part is soft-clipped, and define the right breakpoint as AR if indicated by a IClip read whose right part is mapped to the reference. Due to TSD, AL is not always equal to AR, as illustrated in Figure 4.3. In Figure 4.3(A), one TSD exists in the reference 4.2. METHODS 47 genome, and the another TSD is attached to the end of Alu to be inserted. r1 indicates breakpoint AL and r2 indicates AR, as two TSDs are identical, we have AL > AR, even though r1 and r2 indicate the same Alu insertion event. In Figure 4.3(B), an Alu is inserted into the reference with two TSDs attached at the beginning and the end of the Alu sequence, thus we have AL equals to AR. We define a breakpoint pair as (AL, AR), and use it interchangeably with the term of breakpoint in the following text. It is not always the case that we observe both AL and AR in the Alu insertion region, we might observe only one of them based on sequencing data.

A) AR AL reference

? r1 ? r2

donor

r1 r2

B) AL(R) reference

? r1 ? r2

donor

r1 r2

Figure 4.3: Example of TSD in Alu Insertion region The top shows the reference and the bottom shows the donor. Red nucleotides represent Alu element, grey shaded blocks represent TSD. AL and AR are breakpoints indicated by split reads mapped to the left and right side of the insertion, the distance between them |AR − AL| always ranges from 0 to 50 bp, In (A), one TSD is present in the reference, and another copy of TSD is attached to the end of the inserted Alu. Two TSDs are identical, so r1 and r2 indicates different breakpoints of AL and AR respectively. In (B), two TSDs are attached to the beginning and end of the inserted Alu, so r1 and r2 indicates the same breakpoints. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 48 NEXT-GENERATION SEQUENCING DATA

Breakpoints at one region Given any region (bi, ei) obtained from the previous step, we declare that this region contains one Alu insertion if an unique breakpoint pair can be determined, otherwise, this region is discarded for further analysis. Ideally, all polymorphism carriers having IClip reads in this region will point to one single breakpoint. However, this is often not the case mainly due to two reasons. (i) Some split reads are merely artifacts of sequence and/or mapping errors, and indicates false breakpoints. (ii) The existence of TSD means AL and AR can differ up to 50 bp (Length of a TSD is almost never longer than 50 bp [102] ). The first factor requires distinguishing true breakpoints from the false ones, and it can be partially solved by using only good IClip reads. For example, one IClip read that has good sequencing quality score and can be aligned to the reference with good alignment score indicates a most likely true breakpoint. The second factor is however largely ignored by most studies, and the fact that TSDs vary in both length and composition further complicates this picture. Therefore we introduce a voting system (to be explained in next section) to determine the true breakpoint in one region, based on multiple breakpoints suggested by different IClip reads. The workflow at region (bi, ei) looks like this. (1) IClip reads from all individuals at this region are extracted and remapped to the reference using a split-mapping algorithm, allowing part of the read mapped to the reference, the other part mapped to an Alu consen- sus sequence. The position where the mapping jumps from reference to Alu consensus sequence indicates the breakpoint. (2) Each individual puts all its IClip reads to the voting system, resulting in 0, 1 or 2 valid breakpoints, where 0 means that this individual is not likely to be a polymorphism carrier. (3) The valid breakpoints from all the individuals are collected as input to a new voting system, resulting in 0, 1 or 2 valid breakpoints, where 0 means no unique breakpoint can be determined at this region and thus the Alu insertion does not exist. This voting system gives accurate inference of (AL, AR) at region (bi, ei). We restrict each individual to have at maximum 2 votes, therefore the voting is fair, i.e., no individuals with extreme high coverage can dominate the inference of breakpoints.

The voting system Among the multiple breakpoints suggested by differ- ent IClip reads, we want to find out the true, and therefore (hopefully) the most common breakpoint pair (AL, AR) through a voting system. We use parameter f ( set as f = 0.7 ) to guarantee that the final vote is common, and parameter x (set as x = 50), to match the factor that AL and AR can differ due to TSD but abs(AL − AR) < 50 always holds. The voting is formulated as the following problem. Given a list of integers N = {n1, n2, . . . , nm}. Ns is a subset of N, we say Ns is a valid subset iff 4.2. METHODS 49

abs(ns − nt) <= x holds ∀ns, nt ∈ Ns. N has at least one largest valid subset Nms, whose size kNmsk is the maximum size among any possible valid subset of N. We say list N has a valid vote iff kNmsk/kNk >= f, and a valid vote is the most (or the second most) common element in Nms.

4.2.2.5 Inserting Alu at breakpoints

If the breakpoints can be identified in region (bi, ei), an Alu element will be inserted at the breakpoint and then Alu Deletion algorithm is used to call genotypes. The inserted Alu can be either a consensus Alu element from public database, or a consensus sequence built with IClip and IAlu reads at each loci. To build a consensus Alu sequence using reads overlapping the breakpoints, we closely follow the SeqCons method [103] and largely re-uses the SeqCons code that is provided with the SeqAn C++ library [98]. Specifically, we start by computing all-against-all overlap alignments of the read sequences of one Alu insertion at a time using a standard dynamic programming algorithm with free end-gaps. We filter the alignments by applying a length threshold and re- quire a minimal percent identity. Next, we build an alignment graph [104] from the pairwise alignments, apply a triplet extension for promoting consistency of the alignments [105–107], and finally align the sequences progressively [108] along a UPGMA [109] guide tree. For more details of these steps, we refer to the original publication of SeqCons [103]. Finally, we realign the reads with the Anson-Myers algorithm [110] to obtain an accurate majority vote consensus sequence. However, building the consensus sequence can be time consuming when sample size is large. Alternatively, we insert a consensus Alu element of AluY family into the reference genome before applying the Alu Deletion algorithm. This gives equally good results. We want to remind readers that after one Alu sequence is inserted to the reference, all the informative reads are translated to their corresponding informative reads in the Alu Deletion algorithm. For example, one IClip read in Alu Insertion is equivalent to one KS read in Alu Deletion. Therefore the method of genotype calling also follows the one described in Alu Deletion algorithm.

4.2.3 Simulations In order to assess the accuracy of our approach, we simulated two sets SimDel and SimIns of sequencing data based on human chromosome 21 (build 37) from 100 diploid genomes. In SimDel, we modeled recurring Alu deletions by selecting 100 known Alu elements on the reference chromosome 21, assigning a frequency to each of them, and deleting the Alu elements at these frequencies from 200 copies of chromosome 21, which resulted in 200 haplotypes. In CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 50 NEXT-GENERATION SEQUENCING DATA SimIns, we modeled recurring Alu insertions by first deleting 100 known Alu elements from the reference chromosome 21, assigning a frequency to each of them, and inserting the Alu elements at these frequencies back into 200 copies of the modified reference, which again resulted in 200 haplotypes. With SimIns, we used the modified reference that has all 100 Alu elements deleted in all further steps of the analysis. In both sets, the frequencies of the Alu element deletions or insertions are uniformly distributed between 0 and 1, and the Alu locations were chosen randomly on the chromosome, with the constraint that no other Alu is found within 600 bp of the inserted Alu. From the Mason reads simulation package, version 2.0 [111], we used the MasonVariator to add SNPs and small indels to each haplotype and the sim- ulator to generate 5 Million paired-end Illumina reads for each haplotype (see Fig. 4.4 for details). We merged haplotypes in pairs to obtain 100 diploid data sets for SimDel and 100 diploid data sets for SimIns.

Figure 4.4: The commands used to simulate sequencing reads for 200 haplotypes of our SimIns data set The input file chr21_del.fa contains human chromosome 21 with 100 Alu elements deleted, and the input file hap.vcf contains insertion records for a subset of these regions (see text for details). was chosen between 0 and 199.

mason_variator --seed --in-reference chr21_del.fa --out-vcf hap.vcf \ --snp-rate 0.0001 --small-indel-rate 0.000001 --max-small-indel-size 6 \ --sv-indel-rate 0 --sv-inversion-rate 0 --sv-translocation-rate 0 \ --sv-duplication-rate 0

mason_simulator --seed --read-name-prefix "hap:|sim:" \ --num-fragments 2904970 --fragment-mean-size 500 --illumina-read-length 101 \ --illumina-prob-insert 0.0001 --illumina-prob-deletion 0.0001 \ --input-reference chr21_del.fa --input-vcf hap.vcf \ --out reads.1.fastq --out-right reads.2.fastq --out-alignment truth.bam 4.3. RESULTS AND DISCUSSION 51

4.3 Results and discussion

In this section, we present results of running PAIR2 on both simulated data and a human trio from the 1000 Genomes Project, along with comparison of results from other studies. Among the state-of-the-art tools we want to compare with, we did not choose VariationHunter [95], because it requires DIVET format as the input of mapped reads, while we only have BAM files generated from BWA [112] at hand. Generating DIVET files requires another alignment tool, which can lead to an unfair comparison due to different alignments. We did not run Alu- Detect [94] either, because it is not able to differentiate between homozygous and heterozygous calls. We chose RetroSeq, a tool specialized for transposable element discovery for the analysis.

4.3.1 Simulated data The details of simulations are described in the Method section. For both SimDel and SimIns data, 110 polymorphic Alus are simulated, and each is inserted to a population of 200 haplotypes randomly depending on their fre- quency. These 200 haplotypes are then randomly paired to construct 100 diploid individuals. Both SimDel and SimIns are simulated at both (∼10×) and (∼25×) average sequencing coverage, For any simulated polymorphic Alu, each individual can have 0, 1 and 2 copies of them, corresponding to genotype of G = 0, G = 1 and G = 2, with G = 1 heterozygote and G = 2 the homozygote Alu. These polymorphic Alus are set as the known ground truth. The resulting data has an average of 36 heterozygotes, 34 homozygotes per individual for SimDel, and 37 heterozy- gotes and 33 homozygotes for SimIns, see Table 4.1 for the actual Alu counts simulated. We then run SimDel independently for each individual, and run SimIns jointly for multiple individuals.

G = 1 G = 2 data sum min max sum min max SimDel 3653 25 44 3424 25 43 SimIns 3753 26 47 3366 26 44 Table 4.1: Simulated Alu counts of 100 individuals The sum column is total counts of simulated Alu, the min and max column are the minimum and maximum number of Alu elements seen in one simulated individual

We compare PAIR2 and RetroSeq using use SimIns data, but not SimDel data, because the general interest is to identity novel Alu insertions, and the functionality of the Alu Deletion included in PAIR2 is not available in the tools we considered. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 52 NEXT-GENERATION SEQUENCING DATA The predicted genotype calls are compared to the truth. To measure the accuracy of distinguishing heterozygous and homozygous calls, we stratify prediction counts to different groups of Ctp. Given t ∈ {0, 1, 2} the true genotype and p ∈ {0, 1, 2} the predicted genotype, Ctp represents the number of predicted calls of p, when the true underlying genotype is t. For example, C01 and C02 are both false positive counts. If we define the number of true positive calls with tolerance of genotype calling error, that is, TPN = C11 + C22 + C12 + C21, we can derive the calculation of Sensitivity as

Sensitivity = TPN/(TPN + C10 + C20) and false discovery rate (FDR) as

FDR = (C01 + C02)/(TPN + C01 + C02) The summary of predicted Alu calls is shown in Table 4.2, For the Alu Deletion run on SimDel data, the sensitivity increases from 85.8% at 10× cov- erage to 98.1% at 25× coverage, which is not surprising since more reads often provide more information of Alu polymorphism. For the Alu Insertion run on SimIns data, high coverage data of 25× also results in better performance than 10× coverage data. PAIR2 performs consistently better than RetroSeq, as measured by sensitivity and FDR, with a much higher accuracy of genotype calling.

TP FN FP GE coverage data tool Sensitivity FDR C11 C22 C10 C20 C01 C02 C12 C21 SimDel PAIR2 2668 3350 985 19 0 0 0 55 85.8% 0 ∼ 10× SimIns PAIR2 3152 3017 601 341 0 0 0 8 86.8% 0 SimIns RetroSeq 530 2505 1347 490 999 1119 1876 371 74.2% 28.6% SimDel PAIR2 3521 3342 132 0 12 0 0 82 98.1% 0.2% ∼ 25× SimIns PAIR2 3269 3041 484 322 0 0 0 3 88.7% 0 SimIns RetroSeq 2302 261 1191 520 1913 79 260 2585 76.0% 26.9%

Table 4.2: Summary of predicted Alu counts Ctp represents the number of polymorphic Alu predicted as of genotype p while the true underlying geno- type is t. The counts of Ctp are further grouped into 4 types, named as TP (True Positive), FN (False Negative), FP (False Positive) and GE (Genotype- calling Error). The definition of Sensitivity and False Discovery Rate (FDR) is given in main text.

The overall statistics based on simulated data shows that PAIR2 is very conservative to make Alu calls, and the FDR is close to 0. It is because PAIR2 applies strict quality control in selecting the informative reads, and it only makes calls if the precise breakpoints can be determined (for Alu Insertion). Simply adjusting the controls (internal parameters in PAIR2) to allow more calls leads to an increased sensitivity (data not shown), however, we decide not to do so, because the real data is not as clean as simulated data, and we do not want to have too many false positive calls. 4.3. RESULTS AND DISCUSSION 53

4.3.2 CEU trio of 1000 Genome Project Several studies evaluate their method on the public data from the 1000 Genome Project [94, 96], for a fair comparison, we also run PAIR2 on a CEU trio data(father NA12891, mother NA12892 and the female offspring NA12878) with default parameters. This high depth (> 75×) Illumina HiSeq data is produced at the Broad Institute, and can be downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120117_ceu_trio_b37_decoy/. This trio was previously part of a 1000 Genomes pilot project follow-up study [113], however, the sequencing data available at that time was of lower sequencing coverage ( 9 − 16×) for each genome. 186 loci were randomly selected for PCR validation, and the precise breakpoints were also given. Ret- roSeq uses these validated PCR calls as standards and calculate the average sensitivity of 98% for each sample, doing the same calculation on PAIR2 re- sults gives a similar average sensitivity, see Table 4.3. However, the true number of novel Alu insertions is unknown, calculation based on a small set of validated loci will lead to an inflated sensitivity.

PCR (total) PAIR2 RetroSeq Sample Total PCR distance(bp) Total PCR distance(bp) NA12878 165 1441 162 2.1 1038 162 16.2 NA12891 142 1432 138 2.1 1046 139 17.6 NA12892 152 1405 150 2.2 1078 148 16.5 Table 4.3: Predicted and validated Alu Insertion calls for the CEU trio. The PCR(total) column are the total number of PCR validated Alu insertion calls by [113] for each sample, the PCR columns are the number of validated calls that are also predicted by PAIR2 or RetroSeq. The distance(bp) column shows the average distance in between the predicted break- point and the true breakpoints reported by PCR.

In fact, PAIR2 reports more Alu insertions than RetroSeq, that is, Ret- roSeq reports on average 1054 novel Alu insertions per sample, while PAIR2 finds on average 1426 per sample, about 35% more than RetroSeq. This re- sult is consistent with the simulated data, where PAIR2 is shown to be more powerful than RetroSeq. We estimate the FDR by counting the non-Mendelian calls, that is, calls in the child that do not follow the expected inheritance patterns according to the parents, we also consider the calls private to the child as false positives, since the rate of Alu insertions is estimated to be about 1 per generation [82]. We find in total 13 non-Mendelian calls, equivalent to FDR of 0.7%. We then compared our genotype calls with the PCR validated calls, see Table 4.4, PAIR2 has an average genotype calling accuracy of 99.1%, compared to average of 72.6% from RetroSeq. We further study if PAIR2 can pinpoint CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 54 NEXT-GENERATION SEQUENCING DATA the exact breakpoints, and the mean distance of our predicted breakpoints is about 2 bp away from the true breakpoints, see Table 4.3.

PAIR2 RetroSeq

Sample C11 C22 C21 C11 C22 C21 NA12878 124 38 0 124 1 37 NA12891 95 41 2 95 0 44 NA12892 107 41 2 106 0 42 Table 4.4: Genotype calls of PCR validated Alu Insertion calls for the CEU trio. Ctp represents the number of polymorphic Alu predicted as of genotype p while the true underlying genotype is t, the true genotype is determined by PCR validation [113].

4.3.3 Timing We ran our computations on desktop machine of a single 2.67 GHz Intel i5 processor, on average each sample from the CEU trio family took about 3 hours for Alu Insertion. 4.4. SUMMARY AND FUTURE WORK 55

4.4 Summary and future work

In this project, we discuss the difficulties in finding novel Alus and their break- points. We present and implement an algorithm, PAIR2, that uses read pairs and split reads to jointly detect novel Alu elements of multiple individuals, identify the precise breakpoints of insertions, and make genotype calls based on maximum-likelihood estimator. We benchmark the performance on both simulated data and real data, and compare it to the-state-of-the-art tools. In our simulated dataset of 100bp paired end Illumina sequencing reads, we show that PAIR2 is able to call Alus inserted with respect to the reference, as well the ones deleted with high accuracy and power. Analysis on a human trio from the 1000 Genomes Project shows that PAIR2 is able to produce highly accurate genotype calls.

4.4.1 Algorithm improvement 4.4.1.1 Genotype calling of multiple individuals Though the genotype calling we used for benchmarking is based on maximum likelihood estimator and gives fairly good results, it is still worthwhile to try to incorporate the prior genotype frequency in a Bayesian framework. Borrowing notations from the Methods section, we calculate the posterior probability of each genotype as

P (D|Gg)P (Gg) P (Gg|D) = P (D|Gg)P (Gg)/P (D) = P k=00,01,11 P (D|Gk)P (Gk) where P (Gk) denotes the prior probability. Further more, we can use information from multiple individuals to filter out polymorphisms that are likely to be false positives. This is implemented using a likelihood ratio test. At one Alu region, where there exists at least one individual with Alu deleted with respect to the reference genome, we define Ri the set of reads belonging to individual i, i ∈ [1, 2, . . . , m], and p the frequency haplotype H1 in the population. Assume Hardy-Weinberg equilibrium, the joint likelihood of the observed data is

m Y 2 2 ((1 − p) P (Ri|G00) + 2p(1 − p)P (Ri|G01) + p P (Ri|G11)) i=1 thus the test statistic for the likelihood ratio test is

m m X 2 2 X 2 log((1−p) P (Ri|G00)+2p(1−p)P (Ri|G01)+p P (Ri|G11))−2 log P (Ri|G00) i=1 i=1 which follows a chi-squared distribution with one degree of freedom under the null. CHAPTER 4. ALU POLYMORPHISM DISCOVERY FROM 56 NEXT-GENERATION SEQUENCING DATA 4.4.1.2 Identify insertion regions At the time we started this project, the sequencing cost was so high that it was uncommon to sequence multiple individuals at high coverage. Therefore everything was designed for low to median coverage data (e.g., 3× ∼ 30×). However, at the time I write this dissertation, the sequencing cost has reduced a lot and high-coverage sequencing for multiple individuals is affordable for many large-scale projects. In the current version of PAIR2, we first identify some regions where Alu insertion might happen using IAlu reads from all individuals, and then we filter out false positive regions using IClip reads, i.e., a region is considered as false positive if the inserted Alu can not be located to 1 or 2 breakpoints. However, the first step of identifying approximate regions using IAlu reads can be time consuming, especially when there are many individuals, sequenced at high-coverage, to analyze. In paired-end sequencing, the library size (insertion length) is often about [300, 800] bp, and one Alu element of similar length, ranging from 300 to 600 bp. Therefore we often observe that an IAlu read pair also has an IClip read, with the IClip read indicating the precise breakpoint. We denote such read pairs as IAluClip. When there are enough reads (multiple individuals or high coverage) covering the region of Alu insertion, we can always find IAluClip read pairs. In order to improve the efficiency in identifying insertion regions, we can use IAluClip (instead of IAlu) read pairs at the first step. It will not only avoid the difficult set-covering problem introduced by IAlu reads, but also filter out potential false-positive regions and reduce our workload at the second step.

4.4.1.3 Benchmarking Reference mapping and De novo assembly [114] are two commonly used se- quence assembly in whole genome sequencing, and PAIR2 is developed in the context of reference mapping, However, finding Alu polymorphisms will be a completely different problem in case of De novo assembly data. As more genomes are assembled from scratch [115], we can simply align the genome with consensus Alu sequences in order to detect Alu polymorphisms. Com- paring the findings to the ones from PAIR2 will be an alternative benchmark to evaluate the performance.

4.4.2 Applications The ubiquity, sheer number, diversity, and continued activity of Alu elements indicate that they are key shapers of genomic evolution. A method for quickly and inexpensively—yet comprehensively—identifying Alu polymorphisms is useful to understand their evolution impacts and contributions, it opens the 4.4. SUMMARY AND FUTURE WORK 57 door to many applications such as estimating and comparing transposition rates under varying circumstances, and the comparative study of mobile ele- ment evolution across species [82, 116]. GWASs have been able to successfully identify variants associated to dis- ease (e.g. [117–119]), and SNPs are by far the most commonly considered genetic differences. However, sequencing data opens up the possibility to ex- plore a more diverse set of variants than SNPs, such as Alu polymorphisms. The computations presented in this chapter indicate that the Alu polymor- phism rate is comparable to the polymorphism rate of copy number variations, therefore it is likely that Alu polymorphisms is an useful resource for GWAS.

Part II

Appendix

59

Appendix A

Extensive X-linked adaptive evolution in central chimpanzees

A.1 Summary

This paper was produced in collaboration with many researchers from BiRC (Aarhus University), Copenhagen Zoo, University of Copenhagen and Beijing Genomics Institute. I have contributed equally for this project. Genome-wide surveys of polymorphism within species and between species gives the power to study the genetics of adaptation, in particular the propor- tion of amino acid substitutions fixed by positive selection. In addition, the difference between autosomes and the X chromosome holds information on the dominance of beneficial (adaptive) and deleterious mutations. In this study, we analysed the complete exomes of 12 chimpanzees. We confirm previous reports that central chimpanzees harbor two to three times more synonymous polymorphism than human populations and that the population has under- gone expansion. We suggest that both deleterious and beneficial mutations are at least partly recessive and thus removed more efficiently on the X chro- mosome. Moreover, we demonstrate that a substantial amount of adaptive evolution is targeting the X chromosome and that selection against deleteri- ous mutations is more efficient on the X than on the autosomes in central chimpanzees.

61 Extensive X-linked adaptive evolution in central chimpanzees

Christina Hvilsoma,b,1, Yu Qianc,1, Thomas Bataillonc,1, Yingrui Lid,1, Thomas Mailundc, Bettina Sallée, Frands Carlsena, Ruiqiang Lid, Hancheng Zhengd, Tao Jiangd, Hui Jiangd, Xin Jind, Kasper Munchc, Asger Hobolthc, Hans R. Siegismundb, Jun Wangb,d,2, and Mikkel Heide Schierupc,2 aScience and Conservation, Copenhagen Zoo, 2000 Frederiksberg, Denmark; bBioinformatics, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark; cBioinformatics Research Center, Aarhus University, 8000C Aarhus, Denmark; dBeijing Genomics Institute, Shenzhen 518083, Guangdong, China; and eCentre International de Recherches Médicales de Franceville, B769 Franceville, Gabon

Edited by David Reich, Harvard Medical School, Boston, MA, and accepted by the Editorial Board December 22, 2011 (received for review May 4, 2011) Surveying genome-wide coding variation within and among species drift can be different. First, hemizygosity of the X chromosome in gives unprecedented power to study the genetics of adaptation, in males makes natural selection on recessive adaptive mutations particular the proportion of amino acid substitutions fixed by posi- more efficient, and these mutations can therefore drive higher rates tive selection. Additionally, contrasting the autosomes and the X of adaptive evolution on X-linked relative to autosomal regions (1, chromosome holds information on the dominance of beneficial 8). Second, depending on the reproductive variance in the two (adaptive) and deleterious mutations. Here we capture and sequence sexes, we expect X chromosome regions to have between 50 and the complete exomes of 12 chimpanzees and present the largest set 100% of the effective size of autosomes, undergo more genetic of protein-coding polymorphism to date. We report extensive adap- drift, and thus be more prone to accumulate more slightly delete- fi tive evolution speci cally targeting the X chromosome of chimpan- rious mutations. Empirical evidence for increased levels of adap- zees with as much as 30% of all amino acid replacements being tation in X-linked regions remains elusive (7), except for a unique adaptive. Adaptive evolution is barely detectable on the autosomes instance of a newly formed neo-X chromosome in Drosophila mi- except for a few striking cases of recent selective sweeps associated randa (9) and a recent analysis of polymorphism and divergence with immunity gene clusters. We also find much stronger purifying between Drosophila melanogaster and Drosophila simulans for selection than observed in humans, and in contrast to humans, we about 100 genes in X-linked regions (2). In human studies, X- find that purifying selection is stronger on the X chromosome than on the autosomes in chimpanzees. We therefore conclude that most linked variation is lower than autosomal variation close to genes adaptive mutations are recessive. We also document dramatically suggesting that selection affects X chromosomes more than auto- reduced synonymous diversity in the chimpanzee X chromosome somes, but Europeans have a genome-wide relative decrease in X- relative to autosomes and stronger purifying selection than for the linked variation when comparing with Africans, suggesting that fi human X chromosome. If similar processes were operating in the genetic drift has speci cally affected X chromosome in Europeans human–chimpanzee ancestor as in central chimpanzees today, our (3, 10–12). Likewise, it has been the subject of much discussion why results therefore provide an explanation for the much-discussed re- there is a reduced variation of the X chromosome relative to duction in the human–chimpanzee divergence at the X chromosome. autosomes in the human–chimpanzee ancestral species (13–16). Exome capturing efficiently targets the protein coding part of Pan troglodytes troglodytes | SNP | site frequency spectrum | the genome at high coverage and thus allows accurate individual distribution of fitness effects | faster X genotyping of synonymous and nonsynonymous diversity. Exome sequencing studies in humans have revealed an abundance of low- uantifying the relative importance of purifying, neutral, and frequency nonsynonymous variants and very limited evidence for Qpositive selection in shaping divergence between species adaptive evolution (17). remains a challenge in evolutionary biology. Beneficial mutations Here we use exon capture and sequencing to extensively char- are central for understanding evolution by natural selection. acterize patterns of polymorphism segregating in chimpanzees However, the rarity of beneficial mutations has frustrated attempts from Central Africa. (Pan troglodytes troglodytes). Central chim- to characterize their most basic genetic properties in higher panzees are genetically two to three times more variable than organisms (1). One way forward is to combine genome-wide sur- human populations, allowing more coding variation to be included veys of polymorphism within species and between species di- in analysis, and previous genetic analysis suggests that chimpanzee vergence to estimate the fraction, α, of mutations that have been demographic history is less complex than that of humans (18). We fixed by positive selection. Empirical studies, particularly in the use these data of ∼62,000 coding SNPs to estimate the DFE and Drosophila genus show that mutations fixed by positive selection amount of adaptive evolution in chimpanzees since their di- can make up a sizable fraction (>50%) of the divergence between vergence with humans. species in gene coding regions (2) but whether these results are general for mammals, for example, remains unclear. Furthermore we still know very little about the distribution of fitness effects Author contributions: C.H., T.B., H.R.S., J.W., and M.H.S. designed research; C.H., Y.L., R.L., (DFE) of these mutations and even less about their dominance in H.Z., T.J., H.J., X.J., J.W., and M.H.S. performed research; C.H., B.S., and F.C. contributed new reagents/analytic tools; Y.Q., T.B., T.M., K.M., A.H., and M.H.S. analyzed data; and diploid organisms. Theory predicts and recent empirical studies C.H., T.B., and M.H.S. wrote the paper. emphasize that the demographic history as well as variation in The authors declare no conflict of interest. mutation and recombination rates can blur footprints of molecular This article is a PNAS Direct Submission. D.R. is a guest editor invited by the Editorial adaptation by Darwinian selection (3, 4). This in turn can com- Board. α plicate the inference of DFE and (5, 6). Freely available online through the PNAS open access option. In that context, contrasting the DFE and rates of adaptation in 1C.H., Y.Q., T.B., and Y.L. contributed equally to this work. autosomes versus sex chromosomes is an elegant strategy to infer 2To whom correspondence may be addressed. E-mail: [email protected] or wangj@ the genetic properties of the mutations underlying adaptation (7). genomics.org.cn. In a panmictic population, autosomes and the X chromosome ex- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. perience the same demographic history but selection and genetic 1073/pnas.1106877109/-/DCSupplemental.

2054–2059 | PNAS | February 7, 2012 | vol. 109 | no. 6 www.pnas.org/cgi/doi/10.1073/pnas.1106877109 Results fewer rare variants, suggesting different demographics, but the Patterns of Polymorphism in Central Chimpanzees. We captured and nonsynonymous SFS has a stronger shift toward rare variants in sequenced the 12 central chimpanzee exomes to an average depth humans (Fig. 1C). This is even more striking for the X chromo- of 35× (Tables S1–S7). For analysis, we included only from somes (Fig. 1D). Assuming selective neutrality of synonymous fi nonduplicated areas with a unique mapping coverage against the polymorphisms, an explanation for these differences is more ef - human genome of at least 20× for each individual, leaving 49% of cient selection against nonsynonymous mutations in chimpanzees, all exons (Table S8) and a total of 15.7 million exonic base pairs particularly on the X chromosome. To understand further the where genotypes for all SNPs could be called in all individuals (19). processes underlying differences in relative frequencies of rare nonsynonymous polymorphisms between human and chimpanzee, Fixed differences with the human reference genome were called in fi the same regions (Table 1). We also mapped reads against the we classi ed the functional consequences of nonsynonymous chimpanzee reference genome and found that 96% of SNPs (97% mutations into benign, possibly damaging, and probably damaging of X-linked SNPs) were also called this way. The discrepancies and contrasted the proportion of singletons among these catego- (Table S4) can be attributed to the recent duplication history and ries (Fig. S4). In both chimpanzee and human autosomes the proportion of singletons increases in categories predicted to be the fragmented assembly of the chimpanzee reference genome − more damaging (Fisher’s exact test, P < 10 8 for both species). (20, 21). The results reported below are based on mapping using For the X chromosome, there is no enrichment for chimpanzees the human reference, which contains less error and thus is better (Fisher’s exact test, P = 0.73) in contrast to humans (Fisher’s suited for interspecific comparison, but our conclusions remain exact test, P = 0.002) (Fig. S4). unaltered when excluding ambiguous SNPs (Table S4). The synonymous nucleotide diversity in central chimpanzees – Positive Selection Targets the Chimpanzee X Chromosome. We then is high (Table 1 and Tables S6 S8) but markedly reduced on the used the Sumatran orangutan (Pongo abelii) genome (22) to infer X chromosome where we observe less than half the level of di- which of the human–chimpanzee fixed differences occurred on the versity found in autosomes (Table 1 and Fig. S1). In humans, X to chimpanzee branch (Table S8) and combined these with our autosome diversity ratios range from 0.57 (Europeans) to 0.7 chimpanzee polymorphism data. This allows measurement of the (Africans) near genes and between 0.75 (Europeans) and 0.87 amount of adaptive evolution on the branch leading to chimpan- – (Africans) far away from genes (3, 10 12). We used the human zees, using the approach of ref. 6. All autosomes have an excess of reference genome to orient SNPs (Methods) and obtain the un- nonsynonymous polymorphisms segregating compared with non- folded site frequency spectrum (SFS) at both synonymous and synonymous fixed differences, leading to a neutrality index (NI) >1 nonsynonymous sites (Fig. 1). The synonymous SFS is shifted to- and no evidence for adaptive evolution (Table S8). In contrast, the ward lower frequencies relative to what is expected under a con- X chromosome shows evidence for adaptive evolution (NI = 0.76) stant population model but closely fits the neutral expectation and a proportion of amino acid differences fixed by adaptive from a population recently experiencing a four- to fivefold ex- evolution of α = 29% [95% confidence interval (CI) 0.1–0.36]. pansion (Methods and Fig. S2). This is also reflected in Watterson’s Last, we examined the 37 most abundant gene ontology categories estimator of the scaled mutation rate being higher than nucleotide in our data (58% and 38% of the exome data on the autosome and diversity (Table 1 and Fig. S1). The nonsynonymous SFS is slightly the X, respectively). We found a significant heterogeneity in α shifted toward rare variants relative to the synonymous SFS on the among categories on autosomes (Λ = 184, P < 1e-6) and on the X autosomes (Fig. 1C) but not on the X chromosome (Fig. 1D). chromosome (Λ =54,P < 0.003), suggesting that the differences above are genuine (6). Comparison with Human Coding Diversity. We then compared our To further qualify the differences in the intensity of purifying data to a dataset of 200 human exomes of European origin (17) by and positive selection in chimpanzees, and to examine whether calculating the expected SFS for human autosomes and X chro- our inference above could be biased by changes in the recent mosome for a sample corresponding to the size of our dataset demographic history, we used the SFS and divergence data to (Methods). A larger fraction of polymorphisms in humans is non- estimate jointly the DFE and α, using the approach of ref. 5. This synonymous (50%) relative to that of central chimpanzees (45%) model assumes an expansion model that also yielded the best fit (Fig. 1 A and B and Fig. S3). The human synonymous SFS has to our data (Fig S2). Note that when estimating DFE from SFS

Table 1. Diversity and divergence for 12 central chimpanzees on autosomes and the X chromosome Autosome X chromosome X/A

Number of synonymous SNPs 32,942 808 Synonymous divergence with humans 32,548 1,223 Number of synonymous sites called 3,287,414 172,476 Number of nonsynonymous SNPs 26,462 617 Nonsynonymous divergence with humans 20,632 1,054 Number of nonsynonymous sites called 11,380,785 600,624 Watterson´s θ (nonsynonymous) 0.00062 (4e-06) 0.00030 (0.00001) 0.480 (0.019) Watterson´s θ (synonymous) 0.00268 (0.00001) 0.00136 (0.00005) 0.508 (0.017)

πN 0.00046 (4e-06) 0.00019 (0.00001) 0.421 (0.024) EVOLUTION

πS 0.00204 (0.00002) 0.00093 (0.00004) 0.453 (0.022) πN/πS 0.22417 (0.00250) 0.20847 (0.0152) 0.930 (0.067)

dN 0.00072 (0.00001) 0.00085 (0.00004) 1.177 (0.054)

dS 0.00415 (0.00004) 0.00293 (0.00013) 0.706 (0.032) dN/dS 0.17317 (0.00242) 0.28887 (0.0181) 1.668 (0.107)

SEM for diversity and divergence estimates were computed assuming a binomial distribution (independence of sites) and are shown

in parentheses. SEM for ratios were approximated using the delta method (32). πN, nonsynonymous diversity; πS, synonymous diversity; dN, nonsynonymous divergence on the chimpanzee branch since split with humans; dS, synonymous divergence on the chimpanzee branch since split with humans.

Hvilsom et al. PNAS | February 7, 2012 | vol. 109 | no. 6 | 2055 AC

B D

Fig. 1. Comparison of site frequency spectra in central chimpanzees versus humans. (A) Brown and red: Counts of the number of synonymous and non- synonymous SNPs on the autosomes as a function of the frequency of the derived variant established using humans as outgroup. Blue and light blue: Corresponding expected counts for a sample of 24 exomes of the 200 human exomes (17) (Methods). (B) Corresponding counts on the X chromosome. (C) Autosomal site frequency spectrum of synonymous and nonsynonymous SNPs derived from Fig. 1A and compared with the expectations from a constant size population without selection (green). (D) X chromosome site frequency spectrum for humans and chimpanzees. data, the strength of selection against mutations is measured by its region contains a cluster of immunity-related genes under positive effective selection coefficient (|Nes|), which incorporates the ac- selection in humans (24) (Fig. 3B) as well as the CCR5 gene in- tual fitness effect of the mutation (s) and effective size (Ne). A volved in HIV resistance (25). The second and third most prom- slightly higher fraction of sites with strongly deleterious mutations inent sweeps are found on chromosome 11 and chromosome 16 | |> ( Nes 100) is found on the X chromosome relative to autosomes (Fig. S5). These sweeps are also associated with immunity gene (Fig. 2). This finding suggests marginally more efficient purifying clusters reported to be under positive selection in human diversity selection on the X chromosome, considering also that its effective genome scans (24). Chromosome X does not exhibit any clear size is expected to be smaller than for the autosomes. Using this instances of recent sweeps but its diversity is generally reduced fi approach, the fraction of nonsynonymous mutations xed by pos- throughout the chromosome (Fig. 3C). Given that our dataset α itive selection on the X chromosome is estimated at = 38% (95% covers 65% of human X-linked exons, and assuming an α of 1/3, we α ∼ − CI 0.22, 0.51), whereas it is estimated at 0 (95% CI 0.09, 0.07) can estimate that around 1054/(3*0.65) = 540 fixations have oc- for the autosomes. Interestingly, in human populations, the frac- curred by positive selection on the X chromosome on the chim- tion of nonsynonymous mutations that are strongly deleterious panzee branch during the last 4 million years. That amounts to (estimated |N s|>100) is estimated to be in the range of 30–50% e roughly one adaptive fixation every 500 generations. Of these, we (5, 23), whereas we find estimates above 70% (Fig. 2). expect only adaptive fixations during the last ∼0.5 million years to Extreme Selective Sweeps Associated with Immunity Genes. Given have affected present day levels of diversity. This amounts to only the striking difference in adaptive evolution detected using di- 540/8 = 67 out of 1,054/0.65 = 1,600 expected nonsynonymous vergence, we used our diversity data to query whether selective substitutions and this may explain why we do not observe reduced sweeps have occurred preferentially on the X chromosome. We synonymous diversity in 10kb windows with more nonsynonymous searched for the occurrence of extreme selective sweeps causing substitutions (Fig. S6), a pattern previously reported in Drosophila reduction of polymorphism over megabase-wide regions by scan- as evidence for recurrent selective sweeps (26). It is possible that ning the genome using windows of 10 kb of exon data, in which we many of the X-linked adaptive substitutions in the chimpanzee contrasted the number of polymorphic sites with the number of happened on standing variation as recently reported for humans synonymous differences on the chimpanzee branch. The most (27). However, in that case the fixation rate should not depend on striking example is found on chromosome 3 where a 6-Mb region the dominance of new mutations and we would not expect the spans 12 consecutive windows of low polymorphism (Fig. 3A). This striking difference between X and autosomes that we report here

2056 | www.pnas.org/cgi/doi/10.1073/pnas.1106877109 Hvilsom et al. Autosomes Chromosome X A 0.4 0.6 0.8 Frequency 0.2 0.0 Quasi neutral Mildy deleterious Deleterious Very deleterious Relative proportion of deleterious mutations

Fig. 2. Efficacy of purifying selection against deleterious mutations in central chimpanzees. The strength of purifying selection is measured by the product Nes, where Ne is the effective population size and s the selection coefficient against a heterozygous deleterious mutation. Mutations are di- vided into four categories: quasineutral mutations (deleterious mutation with 0 <|Nes| ≤ 1), mildly deleterious mutations (1 <|Nes| ≤ 10), deleterious mutations (10 <|Nes| ≤ 100), and strongly deleterious mutations (|Nes|>100). The proportions are estimated separately in autosomes and chromosome X. Error bars denote one SE around the estimates of each proportion. B (28). Fixation of new recessive mutations therefore remains the likely explanation for our observations.

Selection Strongly Reduces Coding X Polymorphisms in Chimpanzees. If the reproductive variance in males and females is equal, we expect a relative ratio of synonymous diversity on the X chromo- some and autosomes [πs(X)/πs(A)] of 0.75. The chimpanzee mat- ing system makes it likely that the reproductive variance is higher among males than females, increasing this ratio (in humans it is estimated at 0.81) (7). However, we observe a ratio of only 0.46– 0.51 in chimpanzees (Table 1). A lower mutation rate on the X chromosome caused by male biased mutation (29) can explain only part of this reduction, e.g., even with a male-to-female mutation rate of 4, the diversity ratio is reduced only from 0.75 to 0.6 (8). C Finally, we note that the ratio we observed is close to the one reported in a recent study (30), surveying a sample of six western chimpanzees for nucleotide variation on the X chromosome and chromosome 21 (π(A) = 0.081%, π(X) = 0.034%, π(X)/π(A) = 0.42). Demographic effects have the potential to temporarily alter the ratio of synonymous diversity between X and autosomes (31). We investigated whether any realistic demographic scenario has the potential to produce the observed ratio of diversity and found that only a recent dramatic reduction in population size, or cor- responding bottleneck ending recently, has the potential to pro- duce a ratio of synonymous diversity between X and autosomes near the observed. Such scenarios, however, are incompatible with the enrichment of rare synonymous diversity that we ob- serve. Here demographics alone (Fig. S2) are unlikely to explain this pattern, and rather we argue that the reduced variation on the X chromosome is driven by a combination of more efficient se- lection against detrimental variants and selection for advantageous recessive mutations. Recent studies in humans have also reported EVOLUTION a reduced X/A ratio near genes and interpreted this as Hill– Robertson effects (10, 12, 27). Purifying selection is, if anything, Fig. 3. Scanning for selective sweeps reveals a 6-Mb wide sweep on chro- marginally stronger on the X chromosome than on autosomes in mosome 3 and generally reduced polymorphism on chromosome X. (A) Syn- chimpanzees (Fig. 2), and a sizable fraction (10–50%) of X-linked onymous diversity (measured as Watterson’s θ) and divergence (on the nonsynonymous changes in the chimpanzee lineage has been fixed chimpanzee branch) in windows of 10 kb of accumulated exon base pairs where SNPs were called. A 6-Mb region with suppressed diversity is marked by by positive selection after divergence from humans. Thus, the vertical lines. (B) Zooming in on this region reveals a cluster of genes involved previously noted (7, 32) large X-linked dN/dS ratio for the chim- in immunity and associated with positive selection in humans plus a gene panzee lineage, confirmed in this larger study, appears largely (CCR5) involved in HIV resistance in humans. (C) Diversity and divergence in 10- driven by positive selection rather than relaxed purifying selection. kb windows on the X chromosome.

Hvilsom et al. PNAS | February 7, 2012 | vol. 109 | no. 6 | 2057 This represents a clear example of faster X evolution (1, 7) driven Capture and Sequencing. Each of the 12 qualified genomic DNA samples were by adaptive evolution and is our main finding. randomly fragmented by Covaris and DNA fragments with a peak at 150–200 bp were selected for ligation, with adapters ligated to both ends. The adapter-li- Discussion gated templates were purified using Agencourt AMPure SPRI beads and fi fragments with insert size about 250 bp were excised. Extracted DNA was Using a much larger dataset, we con rm previous reports that amplified by ligation-mediated PCR (LM-PCR), purified, and hybridized to the central chimpanzees harbor two to three times more synonymous SureSelect Biotinylated RNA Library (BAITS) for enrichment. Hybridized frag- polymorphism than human populations and that the population ments were bound to the streptavidin-coated beads, whereas nonhybridized has undergone expansion (18, 33, 34). We report here that puri- fragments were washed out after 24 h. Captured LM-PCR products were sub- fying selection, as measured by the DFE (Fig. 2), is comparatively jected to Agilent 2100 Bioanalyzer to estimate the magnitude of enrichment. stronger than in humans (5, 23). Consistent with this finding, Each captured library was then loaded on the Hiseq2000 platform, and high- PolyPhen predicts that a smaller fraction of nonsynonymous throughput sequencing was performed for each captured library in- dependently to ensure that each sample met the desired average fold cover- mutations that segregate at low frequencies will have harmful age. Raw image files were processed using Illumina base-calling software 1.7 effects (Fig. S4). for base calling with default parameters and the sequences of each individual Our study suggests that both deleterious and beneficial muta- were generated as 90-bp paired end reads. tions are at least partly recessive. Assuming that new mutations have the same underlying effects on fitness on the X chromosome Mapping and Quality Filtering. SOAPaligner (Soap 2.20) was used to align the and the autosomes, a larger number of slightly deleterious muta- clean reads to the human reference genome (National Center for Biotechnology tions are expected to segregate on the X due to increased genetic Information build 36.3) as well as the chimpanzee reference genome (PanTro2) drift. If we use synonymous diversity as a proxy for the amount of allowing a maximum of two mismatches per 90-bp fragment. Full SOAP options were: -a -b -D -o -2 -r 1 -t -n 4 –v2. genetic drift on the X chromosome and autosomes we would ex- | |< On the basis of SOAP alignment results, the software SOAPsnp was used to pect slightly deleterious mutations ( Nes 10) to be far more assemble the consensus sequence and call genotypes in target regions. The abundant on the X chromosome. This is clearly not the case (Fig. following options were set: -i -d -o -r 0.00005 -e 0.0001 -M -t -u -L -s -2 –T(consult 2) and suggests that a sizable fraction of these mutations are http://soap.genomics.org.cn/ for details). partially recessive and thus removed more efficiently on the X For SNP calling we chose to include only exons with an average coverage chromosome (35). of >20 for all 12 individuals. For chromosome X we required females to have > Theory predicts that recessive beneficial mutations should result mean coverage of 20. Using this strict criterion we maintain 49% of all exons in faster X evolution. Increased fixation of slightly deleterious where individual genotypes can be called in all individuals. In these regions we included SNPs if they had a quality score of >20. We called genotypes if mutations on the X chromosome can also contribute to non- coverage of the alternative allele (i.e., not in the reference chimpanzee ge- synonymous divergence and drive a faster X evolution even with nome or in the human genome) was >4. partial dominance (7). However, our data clearly rule out this latter We then excluded 1,886 SNPs from duplicated regions, yielding our final possibility and leave the partial recessive beneficial mutations as Dataset S1. a parsimonious explanation. The only alternative could be an ex- treme bias in the gene repertoire of the X relative to autosomes SNP Orientation. SNPs in chimpanzee were oriented using the reference human with the X chromosome harboring genes with biological functions genome sequence. Adding the Sumatran orangutan genome sequence as an extra outgroup gave conflicting results for 5,712 SNPs, equivalent to 9.4% of all that are most prone to adaptive evolution. However, when com- ’ fi SNPs, but the SFS based on SNP s concordant was very similar to the one paring rates of evolution within biological function (as de ned by reported. The orangutan genome sequence was also used to place fixed dif- the 37 most abundant Gene Ontology (GO) categories) we find ferences between human and chimpanzee on the chimpanzee and human that X-linked genes have higher rates of evolution than their au- branch, respectively. tosomal counterparts (paired Wilcoxon rank test, P value < 0.007, Fig. S7). Interestingly, the GO categories that exhibit the highest Comparison with 200 Human Exomes. We obtained the human SFS from Li et al. rates of evolution include “regulation of transcription, DNA-de- (17) who sequenced 200 human exomes with a mean coverage depth of 14.1. In pendent,”“negative regulation of transcription from RNA poly- their cleaned data, there are 11,273 synonymous and 12,586 nonsynonymous merase II ,” and “multicellular organismal development.” autosomal SNPs. We used the reported frequencies to obtain a SFS for 24 human chromosomal synonymous and nonsynonymous autosomal SNPs and This squares nicely with previous reports arguing that many of the for 21 human chromosomal synonymous and nonsynonymous X-linked SNPs. extant differences between modern humans and chimpanzees in- We obtained the SFS for the smaller sample size by calculating, for each SNP, volve changes in gene regulations and neoteny (36). the binomial probability distribution for calling a SNP at a certain frequency. If Recently, there has been much debate on the causes of the the reported autosomal frequency is p then the probability for not calling reduced divergence of the X chromosome between human and aSNPisp24 +(1− p)24 and the probability for the SNP being at frequency x is x 24-x chimpanzee relative to autosomes (13–16). The reduced diver- b(24,x) p (1 − p) ,wherex =1,...,23. gence can be accounted for by an effective size of the X chromo- fi fi some of about 50% of that of the autosomes in the common Fitting of SFS. We tted ve alternative demographic models to the syn- onymous SFS data using DaDi (37). These include a constant population size ancestor of human and chimpanzees (13). Our study demonstrates model, as well as bottleneck, expansion, and growth models. We used that a substantial amount of adaptive evolution is targeting the X Akaike’s information criterion (AIC) to perform model selection. chromosome and that selection against deleterious mutations is more efficient on the X than on the autosomes in central chim- Estimation of Positive and Purifying Selection. To infer the strength of purifying panzees. In the light of these results, a similar process of X-linked selection from patterns of polymorphism and divergence, we binned the data adaptation and stronger efficiency of purifying selection in the in a series of adjacent genomic windows. Each window comprised 10 kb of exon common ancestor of human and chimpanzee appears to be an material. Starting on a new window, we added contiguous exons one at a time, attractive hypothesis that may account for reduced divergence on counted the number of nucleotides called in each, and added it to the window count. When the window count exceeded 10,000 we switched to a new win- the X chromosome. dow. Because we did not split exons into two windows, all windows contained slightly more than 10,000 sites, and because the exon density varied along the Methods genome, the genomic region each window spanned varied as well. Sample Acquisition. Blood samples were collected from 12 wild-born unrelated Within each window, we recorded the number of synonymous and non- chimpanzees from Gabon, Equatorial Guinea, and zoos in Europe (Table S1). All synonymous positions. The orangutan sequence was used to obtain the necessary permits from the Convention on International Trade in Endangered number of sites contributing to divergence from human, and specifically on the Species were obtained. Blood-derived DNA was used to minimize somatic and chimpanzee branch. For each polymorphic position in our sample of central cell-line–derived false positives. chimpanzees, we recorded the number of chromosomes that carried each

2058 | www.pnas.org/cgi/doi/10.1073/pnas.1106877109 Hvilsom et al. alternative allele (derived versus ancestral; see SNP orientation for further synonymous mutations, and the parameters of a simple expansion model (5). details) and used that information to build the unfolded SFS for each window. We estimated the fraction of mutations expected to fall within four classes of

Counts were then summed over the window to obtain genome-wide or intensity of selection, effectively neutral (|Nes| range 0–1): weakly deleterious chromosome-wide SFSs. (|Nes| range 1–10), strongly deleterious (|Nes| range 10–100), and very strongly We relied on two complementary approaches to infer the strength of pu- deleterious (|Nes|>100). Only effectively neutral and weakly deleterious are rifying selection and quantify the role of positive selection in driving divergence expected to contribute to nonsynonymous divergence as other classes of in coding regions of the chimpanzee genome. First, using the approach deleterious mutations have vanishingly small probabilities to go to fixation implemented by Welch (6), we inferred jointly the fraction 1 − f of mutation and thus contribute to nonsynonymous divergence. Confidence intervals and under strong purifying selection and the proportion α of nonsynonymous SEs around estimates of both α and the proportion of mutations within each nucleotide divergence driven by positive selection from the counts of poly- class were obtained by bootstrap. Bootstrap datasets for the X-linked data morphism and divergence. To make a more robust estimation, we first used were obtained by resampling X-linked 10 kb windows with replacement, a series of models of increasing complexity ranging from a simplistic pure fi neutrality (f =1,α = 0) model to models incorporating a variable intensity of whereas bootstrap datasets were obtained by strati ed resampling windows positive or purifying selection (f and α were allowed to vary according to each across autosomes. chromosome or gene category). Estimation of parameters for each gene or category of gene was made using a likelihood framework as implemented in Scanning for Selective Sweeps. To scan for signals of selective sweeps, we used the MKtest software (6). To compare the various models, we used AIC as the same 10 kb window approach and calculated the fraction of observed suggested by Welch (6). When models differed by less than 5 AIC units, we used synonymous changes because the human/chimpanzee ancestor compared a model averaging procedure to obtain robust estimates of α and f parameters with the total number of possible synonymous changes and the Watterson’s θ as the weighted average on the basis of differences in AIC of the individual restricted to synonymous sites. estimates obtained under each model. Likelihood profiles were used to obtain For each chromosome, we plotted these two summary statistics for each fi α an approximate 95% con dence interval for . Heterogeneity among k GO window and queried regions where the synonymous polymorphism is un- α classes for was tested using a likelihood ratio test (LRT) comparing a model usually small, whereas the synonymous divergence is not similarly reduced. The α specifying a different and f value for each class versus a reduced model still rationale for this strategy was to search for regions exhibiting a reduced level fitting k parameters for f but a single α value. The models are nested and, of variation that was not trivially driven by a reduced level of mutation rate. under the hypothesis of no heterogeneity in α, the LRT statistic (Λ) should be χ2 distributed (k − 1 df). ACKNOWLEDGMENTS. We thank Brian Charlesworth, Aida Andrés, Kay Second, for a more fine-grained picture of the intensity of purifying selec- Prüfer, Freddy B. Christiansen, Nick Patterson, and two anonymous reviewers tion, we used data on divergence together with the SFS of synonymous and for comments on the manuscript; Qing Zhang for technical assistance; and nonsynonymous sites to infer jointly the underlying distribution of scaled se- Claire Neesham for language editing. This work was supported by the Copen- fi α lection coef cient (Nes) against new deleterious mutations, the proportion hagen Zoo and grants from the Danish Natural Science Research Council (to of the nonsynonymous divergence driven by positive selection on non- M.H.S. and H.R.S.).

1. Orr HA (2010) The population genetics of beneficial mutations. Philos Trans R Soc 19. Li R, et al. (2009) SOAP2: An improved ultrafast tool for short read alignment. Bio- Lond B Biol Sci 365:1195–1201. informatics 25:1966–1967. 2. Andolfatto P, Wong KM, Bachtrog D (2011) Effective population size and the efficacy 20. Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the of selection on the X chromosomes of two closely related Drosophila species. Genome chimpanzee genome and comparison with the human genome. Nature 437:69–87. Biol Evol 3:114–128. 21. Cheng Z, et al. (2005) A genome-wide comparison of recent chimpanzee and human 3. Hammer MF, Mendez FL, Cox MP, Woerner AE, Wall JD (2008) Sex-biased evolu- segmental duplications. Nature 437:88–93. tionary forces shape genomic patterns of human diversity. PLoS Genet 4:e1000202. 22. Locke DP, et al. (2011) Comparative and demographic analysis of orang-utan ge- 4. Keinan A, Reich D (2010) Can a sex-biased human demography account for the re- nomes. Nature 469:529–533. duced effective population size of chromosome X in non-Africans? Mol Biol Evol 27: 23. Boyko AR, et al. (2008) Assessing the evolutionary impact of amino acid mutations in 2312–2321. the human genome. PLoS Genet 4:e1000083. 5. Eyre-Walker A, Keightley PD (2009) Estimating the rate of adaptive molecular evo- 24. Barreiro LB, Quintana-Murci L (2010) From evolutionary genetics to human immu- lution in the presence of slightly deleterious mutations and population size change. nology: How selection shapes host defence genes. Nat Rev Genet 11:17–30. Mol Biol Evol 26:2097–2108. 25. Samson M, et al. (1996) Resistance to HIV-1 infection in caucasian individuals bearing 6. Welch JJ (2006) Estimating the genomewide rate of adaptive protein evolution in mutant alleles of the CCR-5 chemokine receptor gene. Nature 382:722–725. Drosophila. Genetics 173:821–837. 26. Macpherson JM, Sella G, Davis JC, Petrov DA (2007) Genomewide spatial correspon- 7. Mank JE, Vicoso B, Berlin S, Charlesworth B (2010) Effective population size and the dence between nonsynonymous divergence and neutral polymorphism reveals ex- Faster-X effect: Empirical results and their interpretation. Evolution 64:663–674. tensive adaptation in Drosophila. Genetics 177:2083–2099. 8. Vicoso B, Charlesworth B (2006) Evolution on the X chromosome: Unusual patterns 27. Hernandez RD, et al.; 1000 Genomes Project (2011) Classic selective sweeps were rare and processes. Nat Rev Genet 7:645–653. in recent human evolution. Science 331:920–924. 9. Bachtrog D, Jensen JD, Zhang Z (2009) Accelerated adaptive evolution on a newly 28. Orr HA, Betancourt AJ (2001) Haldane’s sieve and adaptation from the standing formed X chromosome. PLoS Biol 7:e82. genetic variation. Genetics 157:875–884. 10. Hammer MF, et al. (2010) The ratio of human X chromosome to autosome diversity is 29. Makova KD, Li WH (2002) Strong male-driven evolution of DNA sequences in humans positively correlated with genetic distance from genes. Nat Genet 42:830–831. and apes. Nature 416:624–626. 11. Keinan A, Mullikin JC, Patterson N, Reich D (2009) Accelerated genetic drift on 30. Perry GH, Marioni JC, Melsted P, Gilad Y (2010) Genomic-scale capture and se- chromosome X during the human dispersal out of Africa. Nat Genet 41:66–70. quencing of endogenous DNA from feces. Mol Ecol 19:5332–5344. 12. Gottipati S, Arbiza L, Siepel A, Clark AG, Keinan A (2011) Analyses of X-linked and 31. Pool JE, Nielsen R (2007) Population size changes reshape genomic patterns of di- autosomal genetic variation in population-scale whole genome sequencing. Nat versity. Evolution 61:3001–3006. Genet 43:741–743. 32. Lu J, Wu CI (2005) Weak selection revealed by the whole-genome comparison of the X 13. Hobolth A, Christensen OF, Mailund T, Schierup MH (2007) Genomic relationships and chromosome and autosomes of human and chimpanzee. Proc Natl Acad Sci USA 102: speciation times of human, chimpanzee, and gorilla inferred from a coalescent hid- 4063–4067. den Markov model. PLoS Genet 3:e7. 33. Hey J (2010) The divergence of chimpanzee species and subspecies as revealed in 14. Patterson N, Richter DJ, Gnerre S, Lander ES, Reich D (2006) Genetic evidence for multipopulation isolation-with-migration analyses. Mol Biol Evol 27:921–933. complex speciation of humans and chimpanzees. Nature 441:1103–1108. 34. Won YJ, Hey J (2005) Divergence population genetics of chimpanzees. Mol Biol Evol 15. Presgraves DC, Yi SV (2009) Doubts about complex speciation between humans and 22:297–307. chimpanzees. Trends Ecol Evol 24:533–540. 35. Charlesworth D, Willis JH (2009) The genetics of inbreeding depression. Nat Rev EVOLUTION 16. Wakeley J (2008) Complex speciation of humans and chimpanzees. Nature 452:E3–E4, Genet 10:783–796. discussion E4. 36. Somel M, et al. (2009) Transcriptional neoteny in the human . Proc Natl Acad Sci 17. Li Y, et al. (2010) Resequencing of 200 human exomes identifies an excess of low- USA 106:5743–5748. frequency non-synonymous coding variants. Nat Genet 42:969–972. 37. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD (2009) Inferring the 18. Wegmann D, Excoffier L (2010) Bayesian inference of the demographic history of joint demographic history of multiple populations from multidimensional SNP fre- chimpanzees. Mol Biol Evol 27:1425–1435. quency data. PLoS Genet 5:e1000695.

Hvilsom et al. PNAS | February 7, 2012 | vol. 109 | no. 6 | 2059 Supporting Information

Hvilsom et al. 10.1073/pnas.1106877109

’ θ Fig. S1. Synonymous and nonsynonymous diversity and divergence per chromosome. Watterson s for synonymousP sites was calculatedP as the number of 23 1 17 1 synonymous SNPs divided by the number of synonymous sites for each chromosome, and then divided by i¼1 i for autosomes and i¼1 i for the X chro- mosome. Synonymous π was calculated as the average synonymous heterozygosity for each chromosome. Synonymous divergence was calculated as the number of fixed synonymous substitutions on the chimp branch divided by the number of synonymous sites. For diversity and divergence the same set of exons was used. Watterson’s θ is shown in general higher than π, consistent with a growing population scenario. Watterson θ and π both increase with chromosome number, and both decrease dramatically on the X chromosome. Synonymous divergence shows a similar pattern, but for nonsynonymous divergence, the X chromosome is not reduced compared with the other chromosomes.

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 1of10 Fig. S2. Expansion model fit to the synonymous site frequency spectrum. The autosomal SFS data for synonymous mutations were fitted by different de- mographic model using DaDi. The best fitting model is shown, which is an expansion model of an increase in population size by a factor of ∼4 about 150 generations ago.

Fig. S3. More nonsynonymous rare SNPs in humans. The ratio of synonymous to nonsynonymous SNPs (y axis) for the three lowest frequency classes in the SFS (variant observed in chromosomes 1, 2, and 3) as well as for all higher frequencies pooled. There is an enrichment of rare nonsynonymous mutations for the human autosomes and in particular for the human X chromosome. For chimpanzee, a smaller enrichment is visible for the autosomes but cannot be seen for the X chromosome.

Fig. S4. The proportion of singletons within different functional classes of SNPs for human and chimpanzees divided into autosomes and X chromosome. The classification of SNPs is into synonymous and nonsynonymous SNPs. Within the nonsynonymous SNP, Polyphen 2.0 was used to classify each SNP as being benign, possibly damaging, or probably damaging. SEs are based on the binomial variance.

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 2of10 Fig. S5. Further examples of putative sweeps associated with immunity. Second to the chromosome 3 example (Fig. 3), chromosome 11 and chromosome 16 showed the strongest signals of reduced polymorphism marked in this figure. On chromosome 11, the marked region include the genes OTUB1, chr11:63510461–63522463, CLCF1 chr11:66888215–66897782, and PTPRCAP chr11:66959557–66961729. The region on chromosome 16 includes TRADD chr16:65745604–65751306, HSF4 chr16:65756217–65761347, EDC4 chr16:66464500–66475907, PSMB10 chr16:66525908–66528254, NFATC3 chr16:66676876– 66818338, and NFAT5 chr16:68156498–68296054.

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 3of10 Fig. S6. No apparent effect of recurrent selective sweep on synonymous diversity in X-linked regions. Data on chromosome X are divided in 72 windows (blue open circle) comprising each 10 kb of exonic material. Nonsynonymous divergence is measured through the number of nonsynonymous substitutions affecting specifically the chimpanzee branch in each window. A corrected measure of synonymous nucleotide diversity is obtained after accounting for the level of synonymous divergence among windows (a regression model with synonymous divergence explains about 11% of the total variation in synonymous nucle- otide diversity). The orange dotted line denotes the regression of corrected (residual) synonymous diversity against nonsynonymous divergence and is not significantly different from zero (P value >0.4).

Fig. S7. Rates of evolution (dN/dS) are higher on the X chromosome than on autosomes. Rates of evolution (dN/dS) in autosomes versus the X chromsome are contrasted for the most abundant Gene Ontology categories of biological processes (n = 37). Taken together these 37 GO categories comprise 58 and 60%, respectively, of all autosomal exons and X-linked exons in Dataset S1. For each GO category, data from all pooled exons were used to estimate dN/dS using the Nei–Gojobori method. Each circle denotes the dN/dS estimates for a gene ontology category. The size of each circle is proportional to the abundance of each category (on autosomes). The diagonal (orange line) denotes the expectation under null hypothesis that within each GO category the dN/dS in autosomal versus X linked genes are identical. Of 37 GO categories considered, 27 exhibit higher dN/dS in X-linked genes (paired Wilcoxon rank test, P value <0.007). Accordingly, the predicted dN/dS, based on the relative abundance of GOs on the X for these 37 categories, is 0.155 (CI 95%: 0.147–0.163), whereas the observed dN/dS on the X is 0.211 (95% CI: 0.179–0.242).

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 4of10 Table S1. Origin of samples ID Sex Origin

Aboume M Gabon Amelie F Gabon, Haut-Ogooué Ayrton M Gabon, Moanda Bakoumba M Gabon Benefice F Gabon (CNRS, Makokou) Chiquita F Gabon Cindy F Wild-caught Africa* Lalala F Gabon Makokou F Gabon, Ogooué-Ivindo Masuku F Gabon, Haut-Ogooué Noemie F Equatorial Guinea Susi F Wild-caught Africa*

The 12 Pan troglodytes troglodytes samples are from wild-born unrelated individuals. F, female; M, male. *Tested by mtDNA and microsatellites to be of Central African origin.

Table S2. Sequencing statistics for first six samples Exon capture ABOUME AMELIE AYRTON BAKOUMBA BENEFICE CHIQUITA

Initial bases on target 37,806,033 37,806,033 37,806,033 37,806,033 37,806,033 37,806,033 Initial bases near target* 57,200,015 57,200,015 57,200,015 57,200,015 57,200,015 57,200,015 Initial bases on or near target 95,006,048 95,006,048 95,006,048 95,006,048 95,006,048 95,006,048 Total uniquely mapped reads 20,566,908 20,897,094 20,863,152 21,477,662 23,371,004 22,756,168 Total effective yield, Mb 1851.02 1880.74 1877.68 1932.99 2103.39 2048.06 Average read length, bp 90.00 90.00 90.00 90.00 90.00 90.00 Effective sequence on target, Mb 1329.39 1277.51 1332.36 1329.94 1434.26 1352.83 Effective sequence near target, Mb 219.01 290.06 241.94 280.91 336.41 316.13 Effective sequence on or near target, Mb 1548.40 1567.57 1574.30 1610.85 1770.68 1668.96 Fraction of effective bases on target, % 71.8 67.9 71.0 68.8 68.2 66.1 Fraction of effective bases on or near target, % 83.7 83.3 83.8 83.3 84.2 81.5 Average sequencing depth on target 35.16 33.79 35.24 35.18 37.94 35.78 Average sequencing depth near target 3.83 5.07 4.23 4.91 5.88 5.53 Mismatch rate in target region, % 0.62 0.62 0.60 0.64 0.60 0.60 Mismatch rate in all effective sequence, % 0.74 0.75 0.73 0.76 0.72 0.73 Base covered on target 36,691,325 36,665,142 36,628,348 36,759,294 36,696,323 36,712,506 Coverage of target region, % 97.1 97.0 96.9 97.2 97.1 97.1 Base covered near target 29582614 35146064 32302856 33562545 40692616 39546192 Coverage of flanking region, % 51.7 61.4 56.5 58.7 71.1 69.1 Fraction of target covered with at least 20×, % 59.0 58.5 60.4 58.4 65.0 63.2 Fraction of target covered with at least 10×, % 78.6 78.7 79.2 78.4 81.8 81.2 Fraction of target covered with at least 4×, % 91.0 91.2 91.0 91.2 91.9 91.8 Fraction of flanking region covered with at least 20×, % 4.7 6.6 5.3 6.4 7.7 7.1 Fraction of flanking region covered with at least 10×, % 11.6 16.0 13.2 15.1 18.9 17.8 Fraction of flanking region covered with at least 4×, % 24.9 32.8 28.0 30.9 38.9 37.2

*Flanking region within 200 bp of target regions.

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 5of10 Table S3. Sequencing statistics for last six samples Exon capture LALALA MAKOKOU MASU.K.U NOEMIE SUSI_11043 CINDY_11525

Initial bases on target 37,806,033 37,806,033 37,806,033 37,806,033 37,806,033 37,806,033 Initial bases near target* 57,200,015 57,200,015 57,200,015 57,200,015 57,200,015 57,200,015 Initial bases on or near target 95,006,048 95,006,048 95,006,048 95,006,048 95,006,048 95,006,048 Total uniquely mapped reads 17,275,141 22,450,448 20,576,915 20,036,672 21,436,483 26,448,088 Total effective yield, Mb 1554.76 2020.54 1851.92 1803.30 1929.28 2380.33 Average read length, bp 90.00 90.00 90.00 90.00 90.00 90.00 Effective sequence on target, Mb 1070.54 1389.72 1233.10 1237.95 1312.52 1655.74 Effective sequence near target, Mb 227.31 297.94 298.78 275.42 296.20 357.00 Effective sequence on or near target, Mb 1297.84 1687.66 1531.88 1513.37 1608.72 2012.73 Fraction of effective bases on target, % 68.9 68.8 66.6 68.6 68.0 69.6 Fraction of effective bases on or near target, % 83.5 83.5 82.7 83.9 83.4 84.6 Average sequencing depth on target 28.32 36.76 32.62 32.74 34.72 43.80 Average sequencing depth near target 3.97 5.21 5.22 4.82 5.18 6.24 Mismatch rate in target region, % 0.60 0.59 0.62 0.59 0.60 0.70 Mismatch rate in all effective sequence, % 0.72 0.72 0.75 0.72 0.73 0.81 Base covered on target 36,497,258 36,682,710 36,701,512 36,604,851 36,687,496 36,460,039 Coverage of target region, % 96.5 97.0 97.1 96.8 97.0 96.4 Base covered near target 34,592,251 37,139,854 36,594,614 38,247,253 38,809,387 32,979,857 Coverage of flanking region, % 60.5 64.9 64.0 66.9 67.8 57.7 Fraction of target covered with at least 20×, % 53.5 63.5 57.6 60.0 61.9 67.7 Fraction of target covered with at least 10×, % 75.7 81.1 78.4 79.2 80.6 81.8 Fraction of target covered with at least 4×, % 89.7 91.7 91.2 91.0 91.6 91.0 Fraction of flanking region covered with at least 20×, % 4.2 6.8 6.7 5.7 6.4 9.6 Fraction of flanking region covered with at least 10×, % 12.4 16.6 16.6 15.4 16.5 19.8 Fraction of flanking region covered with at least 4×, % 29.1 34.4 34.6 34.1 35.4 34.7

*Flanking region within 200 bp of target regions.

Table S4. Differences between using the hg18 and pantro2 for mapping of reads and identification of SNPs and the potential consequences of using either for the calculation

of πN/πS Autosome X chromosome

Exon not present in PanTro2 assembly* 1374 87 Synonymous (S) SNPs not polymorphic against panTro2† 1029 23 † Nonsynonymous (NS) SNPs not polymorphic against panTro2 1439 21 ‡ NS/S SNPs including not polymorphic against panTro2 SNPs 0.803 0.764 NS/S SNPs excluding not polymorphic against panTro2 SNPs‡ 0.784 0.759

*Number of SNPs where the corresponding exon is absent from the panTro2 assembly. † Number of SNPs called against hg18, which show no polymorphism when mapped against pantro2. ‡ Consequences on the nonsynonymous to synonymous ratio of SNPs when keeping or leaving out these SNPs.

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 6of10 Table S5. Exons and base pairs on the chip and included in the analysis Chr bp on chip bp included in analysis Exons on chip Exons in analysis

Chr1 3,505,063 1,766,139 17,924 9,183 Chr2 2,504,138 1,102,350 11,567 5,820 Chr3 1,957,127 949,629 9,678 4,982 Chr4 1,372,542 662,912 6,300 3,181 Chr5 1,624,295 772,957 7,366 3,734 Chr6 1,750,838 803,917 8,486 3,904 Chr7 1,697,055 686,310 7,447 3,649 Chr8 1,218,967 475,993 5,218 2,451 Chr9 1,473,160 664,479 6,901 3,295 Chr10 1,431,287 693,089 7,229 3,706 Chr11 2,063,538 911,974 8,972 4,392 Chr12 1,776,009 836,062 9,132 4,721 Chr13 631,682 316,146 3,094 1,471 Chr14 1,090,664 490,301 5,032 2,458 Chr15 1,268,872 537,462 5,782 2,864 Chr16 1,510,643 578,315 7,153 3,287 Chr17 2,029,048 857,305 9,838 4,790 Chr18 544,231 267,124 2,472 1,320 Chr19 2,290,530 791,001 9,113 3,344 Chr20 870,891 364,430 4,380 2,094 Chr21 376,560 157,734 1,783 798 Chr22 793,609 263,734 3,566 1,520 ChrX 1,326,500 775,235 6,292 3,615 Autosome sum 35,107,249 15,724,598 164,852 8,0621

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 7of10 vlo tal. et Hvilsom www.pnas.org/cgi/content/short/1106877109

Table S6. Number of nonsynonymous SNPs per individual Nonsynonymous ABOUME AYRTON BAKOUMBA AMELIE BENEFICE CHIQUITA LALALA MAKOKOU MASU.K.U NOEMIE CINDY_11525 SUSI_11043 length

Chr1 486 668 732 656 677 655 617 691 697 668 603 623 1,355,702.92 Chr2 381 358 353 401 376 367 334 395 403 359 345 379 839,811.61 Chr3 296 270 308 256 288 266 251 300 260 257 288 284 728,525.9 Chr4 200 193 179 124 191 194 169 206 168 202 162 217 501,899.31 Chr5 233 216 213 215 205 241 209 240 247 250 178 224 591,837.88 Chr6 240 223 244 194 242 242 234 224 245 213 186 207 620,628.89 Chr7 239 252 238 224 244 267 226 255 252 234 274 224 520,704.03 Chr8 145 125 142 142 147 151 121 131 142 120 132 130 368,341.04 Chr9 253 246 254 256 277 262 254 249 226 258 255 248 500,575.51 Chr10 230 218 244 230 248 244 232 229 229 206 235 210 520,981.22 Chr11 471 374 393 494 443 447 448 490 472 443 455 442 698,178.96 Chr12 252 265 220 201 198 257 245 250 262 253 263 250 637,426.84 Chr13 89 87 88 83 96 81 82 107 116 87 93 77 237,950.36 Chr14 139 138 154 134 143 127 143 138 132 142 137 128 379,736.6 Chr15 196 227 172 199 223 192 190 186 214 180 172 204 412,794.94 Chr16 208 206 209 216 204 225 181 215 241 215 197 199 437,246.59 Chr17 335 315 369 326 328 337 300 310 359 317 324 309 647,024.96 Chr18 75 60 67 96 77 87 79 86 86 93 90 99 207,649.42 Chr19 391 436 425 407 461 395 407 414 416 402 381 428 574,978.99 Chr20 115 119 121 111 129 112 102 95 111 108 108 123 281,241.27 Chr21 76 62 61 72 63 65 65 60 41 60 78 51 121,469.96 Chr22 101 98 88 112 110 114 87 117 114 94 98 97 196,078.15 ChrX 0 0 0 83 86 86 76 72 93 73 85 95 394,956.52 Autosomal 5,151 5,156 5,274 5,149 5,370 5,328 4,976 5,388 5,433 5,161 5,054 5,153 11,380,785.35 average 8of10 vlo tal. et Hvilsom www.pnas.org/cgi/content/short/1106877109

Table S7. Number of synonymous SNPs per individual Synonymous ABOUME AYRTON BAKOUMBA AMELIE BENEFICE CHIQUITA LALALA MAKOKOU MASU.K.U NOEMIE CINDY_11525 SUSI_11043 length

Chr1 660 759 784 774 771 806 701 784 780 746 665 755 395,421.08 Chr2 508 438 439 501 506 497 489 502 512 455 502 446 238,286.39 Chr3 399 359 389 373 406 386 374 382 359 391 383 380 209,358.1 Chr4 269 292 278 209 284 287 265 288 249 269 269 278 139,815.69 Chr5 333 320 338 340 287 334 306 343 346 342 263 332 170,423.12 Chr6 332 316 349 328 353 373 354 361 362 306 298 337 177,131.11 Chr7 314 317 306 290 319 328 300 321 345 296 309 293 150,446.97 Chr8 178 218 211 200 206 197 200 206 184 202 192 214 104,437.96 Chr9 311 321 329 322 309 324 311 303 299 333 287 297 144,877.49 Chr10 269 295 321 313 296 306 284 327 284 297 268 290 148,537.78 Chr11 452 382 438 458 412 396 437 453 439 414 419 402 204,590.04 Chr12 393 390 327 322 300 360 336 365 393 366 365 378 184,564.16 Chr13 138 126 162 151 135 151 149 158 157 122 153 132 67,398.64 Chr14 215 189 201 224 209 232 206 200 202 209 166 199 108,993.4 Chr15 242 243 253 278 272 245 227 260 283 243 206 248 119,387.06 Chr16 273 263 278 281 251 279 260 294 283 274 232 258 130,710.41 Chr17 439 417 437 418 457 456 417 382 448 387 437 435 191,001.04 Chr18 132 120 127 115 142 128 145 129 125 138 150 125 59,074.58 Chr19 467 506 456 445 458 444 433 460 481 457 388 463 166,396.01 Chr20 176 166 204 162 164 193 153 180 190 157 176 165 82,775.73 Chr21 90 91 86 65 81 84 93 72 48 83 83 70 35,043.04 Chr22 128 134 117 144 130 146 122 135 152 145 140 165 58,744.85 ChrX 0 0 0 89 118 114 78 105 104 106 121 107 113,291.48 Autosomal 6,718 6,662 6,830 6,713 6,748 6,952 6,562 6,905 6,921 6,632 6,351 6,662 3,287,414.65 average 9of10 Table S8. Number of synonymous (πS) and nonsynonymous (πN) SNPs in central chimps and the number of synonymous (dS) and nonsynonymous (dN) changes on the chimpanzee branch, estimated from human–chimp divergence using orangutan as an outgroup for assigning events to branches

Chromosome dN dS dN/dS πN πS πN/πS NI Nonsyn sites Syn sites

1 899 1,455 0.618 3,108 3,620 0.859 1.390 1,355,702.92 395,421.08 2 587 962 0.610 1,865 2,402 0.776 1.272 839,811.61 238,286.39 3 521 881 0.591 1,476 1,881 0.785 1.327 728,525.9 209,358.1 4 402 633 0.635 1,090 1,442 0.756 1.190 501,899.31 139,815.69 5 389 724 0.537 1,230 1,613 0.763 1.419 591,837.88 170,423.12 6 435 675 0.644 1,207 1,685 0.716 1.112 620,628.89 177,131.11 7 357 626 0.570 1,145 1,465 0.782 1.370 520,704.03 150,446.97 8 227 470 0.483 795 1,061 0.749 1.551 368,341.04 104,437.96 9 340 601 0.566 1,251 1,500 0.834 1.474 500,575.51 144,877.49 10 371 588 0.631 1,190 1,480 0.804 1.274 520,981.22 148,537.78 11 643 852 0.755 2,227 2,067 1.077 1.428 698,178.96 204,590.04 12 401 779 0.515 1,356 1,757 0.772 1.499 637,426.84 184,564.16 13 122 284 0.430 519 740 0.701 1.633 237,950.36 67,398.64 14 240 464 0.517 793 1,084 0.732 1.414 379,736.6 108,993.4 15 250 471 0.531 866 1,149 0.754 1.420 412,794.94 119,387.06 16 294 562 0.523 967 1,279 0.756 1.445 437,246.59 130,710.41 17 448 748 0.599 1548 1,952 0.793 1.324 647,024.96 191,001.04 18 152 283 0.537 462 651 0.710 1.321 207,649.42 59,074.58 19 659 811 0.813 2,000 2,103 0.951 1.170 574,978.99 166,396.01 20 182 358 0.508 584 931 0.627 1.234 281,241.27 82,775.73 21 109 160 0.681 330 411 0.803 1.179 121,469.96 35,043.04 22 150 254 0.591 453 669 0.677 1.147 196,078.15 58,744.85 X 508 505 1.006 617 808 0.764 0.759 600,624.31 172,475.69

The total number of syn and nonsyn sites are also shown.

Other Supporting Information Files

Dataset S1 (RTF)

Hvilsom et al. www.pnas.org/cgi/content/short/1106877109 10 of 10

Appendix B

Identifying disease associated genes by network propagation

B.1 Summary

This paper was in collaboration with researchers from BiRC, with me the main author. GWAS has become a common tool to discover the genetic basis of complex diseases. Despite the success in various studies, the standard single-marker analysis might suffer from low statistical power. One possible alternative to this is to treat the biological knowledge as prior information when searching large numbers of SNPs for association with phenotype. This study attempts to integrate the PPI network so as to boost the power of GWAS with an example application on a GWAS of Crohn’s disease.

79 Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 http://www.biomedcentral.com/1752-0509/8/S1/S6

PROCEEDINGS Open Access Identifying disease associated genes by network propagation Yu Qian*, Søren Besenbacher, Thomas Mailund, Mikkel Heide Schierup From The Twelfth Asia Pacific Bioinformatics Conference (APBC 2014) Shanghai, China. 17-19 January 2014

Abstract Background: Genome-wide association studies have identified many individual genes associated with complex traits. However, pathway and network information have not been fully exploited in searches for genetic determinants, and including this information may increase our understanding of the underlying biology of common diseases. Results: In this study, we propose a framework to address this problem in a principled way, with the underlying hypothesis that complex disease operates through multiple connected genes. Associations inferred from GWAS are translated into prior scores for vertices in a protein-protein interaction network, and these scores are propagated through the network. Permutation is used to select genes that are guilty-by-association and thus consistently obtain high scores after network propagation. We apply the approach to data of Crohn’s disease and call candidate genes that have been reported by other independent GWAS, but not in the analysed data set. A prediction model based on these candidate genes show good predictive power as measured by Area Under the Receiver Operating Curve (AUC) in 10 fold cross-validations. Conclusions: Our network propagation method applied to a genome-wide association study increases association findings over other approaches.

Background and multi-gene effects should be taken into considera- In recent years, genome-wide association studies tion while mapping from genotypes to phenotypes. (GWAS) have become a common tool to discover the Consider a crucial biological mechanism, where failure genetic basis of complex diseases and have led to many of a small portion of the important genes can lead to scientific discoveries [1]. Many single nucleotide poly- dys-function of the whole biological mechanism. This is morphisms (SNPs) have been identified in a variety of very likely to happen in case of complex disease, such as diseases. The single marker analysis tests genetic associa- Crohn’s Disease, and therefore multi-locus analysis show tion of individual SNPs and identifies only the most signif- increase of power when analyzing such data. icant SNPs below a stringent significance level, for Pathway-based or gene set enrichment analysis has − example, p<5×10 8, which is necessary to control the become a potentially powerful approach in search of dis- false positive rate on a genome-wide level. However, ease associated genes (for a recent review of pathway the identified SNPs only represent a small fraction of the analysis, see [3]). One of the most popular methods is genetic variants to contribute to complex diseases, due to GSEA [4]. Using a modified Kolmogorov-Smirnov test, it small individual effect sizes. Markers that are truly but compares the p value distribution of genes in a pathway weakly associated with disease often fail to be detected [2]. with the rest of the genes. GSEA has successfully identi- It is well understood that the stability of biological fied the IL-12/IL-23 pathway that is significantly asso- systems is governed by many biomolecular interactions ciated to Crohn’s disease [5]. However, there are some disadvantages to the common pathway based analysis. * Correspondence: [email protected] First, most of the studies choose the most significant Bioinformatics Research Center, Aarhus University, 8000C Aarhus, Denmark © 2014 Qian et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 2 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

SNP from each gene as a representative, and therefore score representing stronger evidence for association. If a systematic but small changes in a gene-set will be missed gene has many neighbours that are associated with the dis- if individual genes do not have any SNPs with strong ease, it is very likely that itself is also associated. A gene marginal association. Second, how robust the methods with high posterior score can be called as candidate genes, are with regard to factors such as pathway annotations even though it has a low prior score and fails to be called and pathway size is not clear. Third, many methods treat in standard GWAS due to stringent p value cut-off, chip all genes equally despite that some genes (e.g., house- coverage or sampling bias. keeping genes) appear in many pathways [6]. The addi- A recent study applied similar ideas to prioritize candi- tional information of genes in overlapping pathways date disease genes and demonstrated a boost in the power should not be ignored, as shown in a study where weight- to detect associated genes in GWAS [14]. Using a naive ing genes based on their appearance in the gene sets can Bayes framework on datasets of Crohn’s disease and Type improve gene set ranking and boost sensitivity of the ana- 1 diabetes, the posterior score of each gene was obtained lysis [7]. by adding its own log odds from GWAS (as prior) and a Similar to pathway-based analysis, where biologically soft GBA score from neighbors. The study showed that relevant connections from public databases are utilised in some genes with high posterior scores are actually vali- GWAS, network-based analysis has also become a popular dated as true associations in later studies, although they tool for the study of complex disease. In context of mole- do not have highest prior scores (e.g., lowest p values in cular interaction networks, it has been found that about GWAS studies) and would possibly be ignored in two- one third of known disorders with multiple genes show stage studies. However, there are some open questions in physical clustering of genes with the same phenotype and this study. First, the posterior score of a gene depends not these clusters are likely to represent disorder-specific func- only on its neighbors association of disease, but also on tional modules [8]. A concept of disease modules was how many neighbors it has, i.e, network topology. It is emerged as more studies show that proteins that are appealing to mark a gene with high posterior score as involved in the same disease show a high propensity to associated, neglecting the fact that a high posterior score interact with each other [9,10]. If a few disease compo- is merely due to a high degree (receiving information from nents are identified, other disease-related components are more neighbors). Second, there is no statistic control (e.g., likely to be found in their network-based vicinity. For false discovery rate) for the findings. If there is no signal in example, various module search methods have been devel- the network, i.e, a completely random prior for all the oped in search of disease associated modules [11,12]. genes, this method still outputsthegeneswithhighest However, a module consists of an arbitrary number of posterior scores. genes, it often requires intensive simulations for multiple- testing corrections. Methods Network Guilt by Association (GBA) is an approach In this section, we first describe the data we used and for identifying disease genes based on the observation the network propagation framework, then we build a that similar phenotypes arise from functionally related prediction model based on the GBA genes, and evaluate genes. Algorithms related to Google’s PageRank, such as the performance in cross-validation (CV). Iterative Ranking and Gaussian smoothing, are applied in prioritizing candidate disease genes using network Data Set information [13]. A typical workflow looks like this: Prior information from GWAS. We analyzed the raw given a query disease, known causal genes of diseases anonymous genotype data of the Wellcome Trust Case that are phenotypically similar to the query disease are Control Consortium (WTCCC) study. The original cohort given a prior score in human PPI network, then the includes 2005 Caucasian UK patients of Crohn’s disease prior scores are propagated and smoothed over network and 3004 controls genotyped on the Affymetrix 500K map- such that each protein gets an association score. Genes ping array. The details are described in [15]. Genotypes with high association scores are considered as candidate with posterior probability (or CHIAMO score) lower than associations. 0.98 are considered as missing data. Markers are removed In this study, we analyzed a GWAS in a GBA frame- if the percentage of missing data was larger than 5% or if work. The network is overlaid with GWAS information, they are not in Hardy-Weinberg equilibrium (p >0.0001 that is, each gene is assigned a prior score based on the for control group). We further remove some individuals gene level p value from GWAS, which represents our with missing allele larger than 3% or of non-European prior knowledge of its association to the disease. After ancestry or with duplicated samples, as suggested by [15], propagation, the prior score has been smoothed over the and are left with 1748 cases and 2938 controls. Finally we whole network and each gene gets a new association map a SNP to a gene if it was located within the gene or score, denoted as the posterior score, with higher posterior 10kb immediately upstream or downstream. Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 3 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

 − Interaction network. The PPI network is built based F = (I − αW ) 1(1 − α)Y (1) on the STRING database version 9.0 [16]. Only interac- tions with a score larger than 700 are included, and it It can be proved that W’ is similar to a stochastic results in 229599 interactions involving 15010 proteins. matrix, which has eigenvalues in [−1, 1] (according to We use proteins and genes interchangeably in the follow- the PerronFrobenius theorem). Since a∈ (0, 1), the −1 ing, because SNPs were first mapped to genes then eigenvalues of (I − aW’) exists. Though the linear mapped to corresponding proteins. The GWAS dataset equation can be solved analytically, it is difficult to com- only covers part of the genes in the network, for example, pute the inverse of a large matrix with |V|×|V| dimen- 11363 genes for Crohn’s disease. We discard all the ver- sion, and we choose a iterative propagation method to tices that are not covered by GWAS, keeping only edges solve the system. At iteration t, we compute between covered vertices. Isolated nodes are also removed,  − Ft = αW Ft 1 + (1 − α)Y (2) in the end, we are left with a large connected network N . Tuning parameters. There are two tuning parameters Propagation of evidence in the model, aand T. T denotes the number of itera- Consider a PPI network as an undirected graph tions of propagation. We study two extreme scenarios, G =(V, E, w), where nodes V are a set of proteins, T = 1 and T = ∞. When T = 1, each node only receives edges E are links between proteins if interaction exists, information from its direct neighbors. Posterior score F w is the weight of an edge. For a node v ∈ V,denote is calculated from equation (2). When T = ∞,equation its total number of neighbors by degree(v)andits (2) reaches equilibrium after many iterations and the → direct neighbors in G by N (v). Let Y : V R≥0 repre- information is smoothed over the network, therefore sent a function of prior evidence, i.e., assign high score each node also gains information from its indirect to a node v if we a priori (from GWAS) believe that it neighbors through iterations. F scores will converge as → is associated to the disease. F : V R≥0 denotes a showninequation(1).Inpractice,equilibriumisoften function of posterior evidence, i.e., F(v)representsthe achieved within 20 iterations. posterior evidence of association after propagating the a, also known as the damping parameter in the litera- information of its neighbor nodes. The main three ture, denotes how much information a node receives from steps of network propagating are, (1) obtaining prior neighbors. Higher aindicates less weight on its own prior information from GWAS, (2) calculating and normalis- information. Previous applications of similar algorithms of ing posterior scores by network propagation (with ranking SNPs or genes recommend a∈ [0.5, 0, 95] [13,18]. choice of tuning parameters), (3) selecting genes with We explored a∈ (0.2, 0.4, 0.6, 0.8, 0.9) in the experiments. highest posterior score as candidates. A good choice of ashould give better internal consistency Gene-level prior scores.Thep value of a gene is of gene rank by posterior score, which can be measured by defined as the minimum single marker test p value of Kendall Tau rank correlation [19]. its SNPs, as widely used in pathway analysis. The prior We randomly choose half of the case and control −1 score of gene i defined as yi = F (1 − pi/2) and F is samples and rank genes based on posterior scores. Low the Cumulative Distribution Function of normal distri- consistency is obtained for a ∈ (0.2, 0.4) and it agrees bution. Therefore, under the null hypothesis of no asso- with a previous study that a should be more than 0.5 + ciation, yi ∼ (0, 1).Accordingto[17],minimum [18]. Highest consistency is obtained when a∈ (0.8, 0.9), p values performsN best in most scenarios, we also tried thus we choose the mean a= 0.85 in the main analysis, Fisher’s combined probability test, however, it gives unless otherwise specified. lower internal consistency of gene ranks for random subset of the data. Identifying associated genes Calculating posterior scores by propagation.The As shown in Equation (2), a gene can have a high posterior posterior score F is computed as score under two conditions: (1) it has a high degree in the  network and receives more information from its neighbors  F(v)=α[ F(u)w v,u]+(1− α)Y(v) than the other low degree genes, (2) most of its neighbors u∈N(v) are associated with the disease and itself is GBA. Therefore ranking a gene based on posterior score has some issues where the parameter a∈ (0, 1) weights the relative when the first condition is dominant, we may include too importance of information received from neighbors, and many false positives in the candidate gene set and have a  × wv,u = wv,u/ d(v) d(u) denotes the weight of edges, potential power loss for genes with lower degrees in the with d(v) the degree of node v. The above formula can network. be expressed in linear form F = aW’F +(1− a)Y, which Here we suggest a framework, that uses permutations is equivalent to to identify GBA genes and eliminates potential false Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 4 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

positives. The pseudocode is shown in the following. Results and discussion Input parameters include: significant threshold and Problems with association by rank number of permutations, K. Network propagation for prioritizing associated genes example input: Sig. Thresh.=0.01, K=10000; has been applied in several studies when there is func- // GWAS prior tional similarity between a given gene and the known prior scores based on p values of Cochran-Armitage disease gene. The selected few known disease associated trend test in GWAS data; genes give prior information for the network, after pro- propagate prior scores by equation (2); pagation, each gene gets a posterior score, which repre- normalize posterior scores for all genes so that the sents its association to the disease. sum is 1; Implementation of similar ideas in a GWAS showed i record posterior score for genei as S0; boosting in identification of disease-associated genes // permutation [14]. However, in such an application, genes of high for k in 1 to K;do degree often have high posterior scores due to propaga- permute case and control labels, calculate prior scores tion. If we simply take the genes with highest posterior from p values; score as candidates, we may end up including too many propagate prior scores by equation (2); false positives in the candidate gene list. In Additional normalize posterior scores for all genes so that the file 1: Table S1, where N0 refers to network with the sum is 1; GWAS prior and Nk≠0 to networks with randomized i record posterior score for genei as Sk; prior, one can see that the top ranked genes in N0 are done also often ranked top in Nk≠0. Although the detailed // find candidate GBA genes implementation of our study and the one by [14] is dif- for genei in network; do ferent, it reveals the necessity of utilizing such network the p value of the posterior score of genei is methods in a more cautious way. Moreover, most of the K i > i k=1 I(Sk S0) top ranked genes in N0 have high network degree, with K average degree of 195.4, while the PPI used in our stu- genei is candidate association if the p value of its pos- dies has an average degree of 24.5, it again confirms our terior score is smaller than Sig. Thresh. concern that genes of high degree tend to be ranked on done top in the network.

Prediction model and ROC Candidate genes The GBA candidate genes are important nodes in the Identified candidate genes network, because many of their neighboring genes are The study by [14] listed the top-ranked 150 genes, to associated with disease. To measure how the GBA make the results comparable, we also made a list of 150 genes collectively contribute to the disease, we used top-ranked genes with T = 1, they are ranked by the them to build prediction models and evaluate the per- p value of their posterior score obtained from permuta- formance in 10-fold cross-validation (CV). The predic- tion. As shown in Additional file 1: Table S2, one can tion models based on 90% of the cases and controls see that many genes with significant p value from were tested on the remaining 10% data, and it was GWAS are also called in our study. This is not surpris- repeated 10 times with different 90% and 10% of the ing since the prior scores from GWAS contribute to the cohorts. posterior scores. Moreover, our method also identified A logistic regression model with all the SNPs as cov- some genes that are not called by standard methods in ariates is fitted by the R package glmnet [20]. Though this data set, but reported in other independent GWAS. theGBAframeworkonlychosethemostsignificant The number of genes that are validated by other studies SNP to represent the gene, we used all the SNPs and reported in GWAS catalog [22] is shown in Figure 1. located within 10Kb boundary of the candidate genes. There are 7 genes identified as association candidates in glmnet applies cyclical coordinate descent to solve elas- our study, which failed to be called by GWAS p value tic-net penalized regression models, which are mix- ranking, they are PTPN22, IRF1, PTGER4, IL12B, tures of two penalties: l1 (the lasso) and l2 (ridge IL18R1, FASLG and JAK2. Except IL12B and JAK2, the regression), and it generates models with relatively few other 5 genes also failed to be called by [14]. These can- predictors. To evaluate the performance of the predic- didate genes all have a higher rank of posterior scores in tive model, we calculated the average Area Under the the network of GWAS prior, compared to network with receiver operating characteristic Curves (AUC) [21] for random prior, as shown in Additional file 1: Table S2. all 10 trials. For example, gene JAK2 (MIM 147796) has a gene level Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 5 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

type of chips). Nevertheless, our study showed potential power of GBA framework to boost GWAS signals. Genomic prediction model We further investigated whether the candidate genes collectively contribute to the disease and evaluated the extent to which predictions were driven by these candi- date genes. A previous study conducted pathway analy- sis on the same data set [23] and built a logistic regression model of 277 genes with a variable selection algorithm. The average AUC was 0.6 in 10-fold CV with all the SNPs within 10kb boundary of the selected genes, and it dropped to 0.56 after excluding all SNPs − with p value <5×10 7. Using GBA candidate genes, we had higher average AUC in 10-fold CV, the average is 0.705 (T = 1) and 0.730 (T=∞) in models including all the SNPs that are mapped to candidate genes, and 0.687 (T = 1) and 0.715 (T=∞)afterremovingSNPswith − p value <5×10 7.ThenumbersareshowninTable1, one can see that the increased AUC is not due to the Figure 1 Number of reported genes. Overlap of validated genes number of genes we used to build the model, on the among top 150 genes for each method, all of them map SNPs to contrary, we used fewer genes for the prediction model, genes within 10kb boundary. Validated genes denote genes that with 130 (T = 1) and 184 (T=∞) genes respectively. are reported as candidate in later GWAS studies. Method GWAS means ranking genes by minimum p value. We also built prediction models with the same num- ber of genes that are ranked on top (based on posterior scores), which would be called candidate genes in the p value of 0.0523, but its rank in the network increases study of [14]. As shown in Table 1, there is a AUC drop from average of 104 in Nk≠0 to 54 in N0.IL18R1(MIM with models of top ranked genes. The reason might be 604494) has a GWAS p value of 0.003, and its rank of that the candidate genes in our study are GBA genes posterior score is 3342 in Nk≠0 and 1109 in N0. This gene and collectively contribute to disease, while top ranked wouldbemissedbybothsinglemarkertestaswellas genes can be special in the network topology but with ranking genes by posterior score. no association to the disease. There are 4 genes, C13orf31, MST1, RTEL1 and HLA- DQA2, that were listed on top 150 by [14] and failed to Boosting signal from IL12 pathway genes be called in our study. The reason we failed to call them Many studies of pathway analysis have uncovered signif- is because these genes are not present in the STRING icant associations between Crohn’s disease and the database we used. Therefore a more complete PPI data- IL12/23 signaling pathway [5,24]. 19 of 20 genes in IL12 base integrated from diverse sources is needed, as was pathway are included in our network analysis, most of done by [14]. them have a posterior score (rank) increase in N0, Based on the number of significant genes that are shown in Additional file 1: Table S3. As propagation reported by other independent GWAS or meta-analyses, redistributes the information of the network, standard there seems to be no big advantages of our results com- pathway enrichment analysis based on posterior scores pared with the one by [14]. However, these studies are might have an advantage over the one based on direct designed to discover different signals in the network. GWAS results. Our method focus more on genes with abundant guilty neighbors, thus it has higher power to discover local sig- Conclusions nals even with low degree genes but less power to call Combining GWAS data with function databases is very high degree genes. However, there are also two other appealing as it provides more explanatory power for the factors in the approaches that may contribute to the dif- list of candidate genes. While pathway methods have ference in performance. First, the detailed implementa- shown success in many applications, they also have lim- tion is different, [14] rank genes by posterior log odds, itations. For example, genes involved in multiple path- and choose a best prior log odds such that the number of ways might introduce bias in different pathways, different validated genes is maximized. Second, the reported genes definitions of the same pathway in different knowledge in GWAS catalog might have a bias to genes that have bases can affect performance assessment in terms of significant p value in certain type of study (eg., certain power and true positive/negative rate. [3]. Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 6 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

Table 1 Prediction models based on candidate genes. T=1 T=∞ method Genea SNP1b SNP2c AUC1d AUC2e Gene SNP1 SNP2 AUC1 AUC2 Sig. Thresh.f 130 1176 1145 0.705(0.033) 0.687(0.032) 184 1314 1281 0.730(0.017) 0.715(0.013) PSg 130 2331 2310 0.645(0.024) 0.627(0.026) 184 3279 3256 0.645(0.024) 0.626(0.026) aGene is number of genes identified as GBA associations, bSNP1 is number of SNPs mapped to these genes. cSNP2 is bSNP1 excluding snps that reach GWAS significance of 5 × 10−7. dAUC1, mean AUC of prediction models built with bSNP1, numbers in parentheses are standard deviation in cross validations. eAUC2, similar to AUC1, prediction models built with cSNP2. fSig. Thresh. = 0.01 gPS means genes with highest posterior scores are selected.

The methods of combining GWAS and PPI networks other methods. Genes in IL12 pathway. Table S3 lists 19 genes in IL12 mainly fall into two categories, (1) dense module search pathway, most of them have an increased posterior score in the network algorithms in search of significantly enriched subnetworks of GWAS prior. [11,12,25], (2) propagation algorithms related to Google’s PageRank [13,14] in search of genes that have top ranks in the network. However, while methods in group (1) require Competing interests intensive randomization of network topology for accessing The authors declare that they have no competing interests. module significance, and often encounter multiple testing Authors’ contributions problem in searching of modules of various sizes in high The project was conceived by YQ, SB, TM and MS. SB prepared the original dimension space, methods in group (2) fail to distinguish data, YQ designed and conducted the analysis. YQ drafted the manuscript, signals from GWAS and signals from network, and there- SB, TM and MS shared in writing the manuscript. All authors have read and approved the final manuscript. fore tend to have a high false positive rate, especially in case of biased PPI database. Acknowledgements The authors thank anonymous reviewers for helpful comments. The study Our study extends the idea of network propagation with was supported by the Lundbeck Foundation (LuCAMP), the Danish Natural GWAS information, such that information from various Sciences Research Council and the EU Marie Curie action (NextGene) resources can be utilized. The performance of this method can be improved in various ways. Integration of diverse Declarations Publication of this article was funded by EU Marie Curie action (NextGene). data sources, as suggested in [14], will improve the ability This article has been published as part of BMC Systems Biology Volume 8 to prioritize disease genes. Mapping multiple SNPs to a Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Systems Biology. The full contents of single gene is the simplest way of obtaining genelevel sta- the supplement are available online at http://www.biomedcentral.com/ tistics, yet some collapsing-based and kernelbased meth- bmcsystbiol/supplements/8/S1. ods are worth trying for gene-level statistics [26]. There are also potential extensions of this study. For example, Published: 24 January 2014 most of the 20 major genes in IL12 pathway, identified as References associated in Crohn’s Disease, have increased posterior 1. Visscher PM, Brown MA, McCarthy MI, Yang J: Five Years of GWAS score in the network of GWAS prior, it implies that net- Discovery. The American Journal of Human Genetics 2012, 90:7-24 [http:// work propagation method redistributes the information in linkinghub.elsevier.com/retrieve/pii/S0002929711005337]. 2. Jia P, Wang L, Meltzer HY, Zhao Z: Pathway-based analysis of GWAS the network where the true associations get enriched datasets: effective but caution required. The International Journal of information. Therefore pathway analysis based on poster- Neuropsychopharmacology 2010, 14(04):567-572 [http://www.journals. ior scores may have more power than the standard path- cambridge.org/abstract\_S1461145710001446]. 3. Khatri P, Sirota M, Butte AJ: Ten Years of Pathway Analysis: Current way analysis. Many methods that detect interaction and Approaches and Outstanding Challenges. PLoS Computational Biology epistasis, such as Support Vector Machine [27] and Logic 2012, 8(2):e1002375 [http://dx.plos.org/10.1371/journal.pcbi.1002375]. Regression [28] are not applicable in genome-wide scale 4. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette Ma, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set due to high dimensions, a reduced search space such as enrichment analysis: a knowledge-based approach for interpreting interactions among GBA genes might yield some results. genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15545-15550 [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1239896&tool= Additional material pmcentrez&rendertype=abstract]. 5. Wang K, Zhang H, Kugathasan S, Annese V, Bradfield JP, Russell RK, Sleiman PMa, Imielinski M, Glessner J, Hou C, Wilson DC, Walters T, Kim C, Additional file 1: Genes with highest posterior scores. Table S1 lists Frackelton EC, Lionetti P, Barabino A, Van Limbergen J, Guthery S, genes with highest posterior scores in the network, with parameter of Denson L, Piccoli D, Li M, Dubinsky M, Silverberg M, Griffiths A, Grant SFa, T = 1. Candidate gene set of top 150 genes. Table S2 lists the top 150 Satsangi J, Baldassano R, Hakonarson H: Diverse genome-wide association candidate GBA genes in our study, which are used for comparison with studies associate the IL12/IL23 pathway with Crohn Disease. American Qian et al. BMC Systems Biology 2014, 8(Suppl 1):S6 Page 7 of 7 http://www.biomedcentral.com/1752-0509/8/S1/S6

journal of human genetics 2009, 84(3):399-405 [http://www.pubmedcentral. 23. Eleftherohorinou H, Wright V, Hoggart C, Hartikainen AL, Jarvelin MR, nih.gov/articlerender.fcgi?artid=668006&tool=pmcentrez&rendertype= Balding D, Coin L, Levin M: Pathway analysis of GWAS provides new abstract]. insights into genetic susceptibility to 3 inflammatory diseases. PloS one 6. Ma J, Sartor Ma, Jagadish HV: Appearance frequency modulated gene set 2009, 4(11):e8068 [http://www.pubmedcentral.nih.gov/articlerender.fcgi? enrichment testing. BMC bioinformatics 2011, 12:81 [http://www.ncbi.nlm. artid=2778995&tool=pmcentrez&rendertype=abstract]. nih.gov/pubmed/21418606]. 24. Benson JM, Sachs CW, Treacy G, Zhou H, Pendley CE, Brodmerkel CM, 7. Tarca AL, Draghici S, Bhatti G, Romero R: Downweighting overlapping Shankar G, Mascelli Ma: Therapeutic targeting of the IL-12/23 pathways: genes improves gene set analysis. BMC bioinformatics 2012, 13:136 generation and characterization of ustekinumab. Nature biotechnology [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3443069&tool= 2011, 29(7):615-24 [http://www.ncbi.nlm.nih.gov/pubmed/21747388]. pmcentrez&rendertype=abstract]. 25. Jia P, Zheng S, Long J, Zheng W, Zhao Z: dmGWAS: dense module 8. Feldman I, Rzhetsky A, Vitkup D: Network properties of genes harboring searching for genome-wide association studies in protein-protein inherited disease mutations. Proceedings of the National Academy of interaction networks. In Bioinformatics. Volume 27. Oxford, England; Sciences of the United States of America 2008, 105(11):4323-4328 2011:95-102 [http://www.pubmedcentral.nih.gov/articlerender.fcgi? [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2393821&tool= artid=3008643&tool=pmcentrez&rendertype=abstract]. pmcentrez&rendertype=abstract]. 26. Li L, Zheng W, Lee JS, Zhang X, Ferguson J, Yan X, Zhao H: Collapsing- 9. Barabási AL, Gulbahce N, Loscalzo J: Network medicine: a network-based based and kernel-based single-gene analyses applied to Genetic approach to human disease. Nature reviews Genetics 2011, 12:56-68 [http:// Analysis Workshop 17 mini-exome data. BMC proceedings 2011, 5 Suppl www.ncbi.nlm.nih.gov/pubmed/21164525]. 9(Suppl 9):S117 [http://www.pubmedcentral.nih.gov/articlerender.fcgi? 10. Cai JJ, Borenstein E, Petrov Da: Broker genes in human disease. Genome artid=3287841&tool=pmcentrez&rendertype=abstract]. biology and evolution 2010, 2:815-25 [http://www.pubmedcentral.nih.gov/ 27. Ban HJ, Heo JY, Oh KS, Park KJ: Identification of type 2 diabetes- articlerender.fcgi?artid=2988523&tool=pmcentrez&rendertype=abstract]. associated combination of SNPs using support vector machine. BMC 11. Jia P, Wang L, Fanous AH, Pato CN, Edwards TL, Zhao Z: Network-Assisted genetics 2010, 11:26 [http://www.pubmedcentral.nih.gov/articlerender.fcgi? Investigation of Combined Causal Signals from Genome-Wide artid=2875201&tool=pmcentrez&rendertype=abstract]. Association Studies in Schizophrenia. PLoS computational biology 2012, 28. Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L: Sequence analysis using 8(7):e1002587 [http://www.ncbi.nlm.nih.gov/pubmed/22792057]. logic regression. Genetic epidemiology 2001, 21 Suppl 1(Suppl 1):S626-31 12. Akula N, Baranova A, Seto D, Solka J, Nalls Ma, Singleton A, Ferrucci L, [http://www.ncbi.nlm.nih.gov/pubmed/11793751]. Tanaka T, Bandinelli S, Cho YS, Kim YJ, Lee JY, Han BG, McMahon FJ: A network-based approach to prioritize results from genome-wide doi:10.1186/1752-0509-8-S1-S6 association studies. PloS one 2011, 6(9):e24220 [http://www. Cite this article as: Qian et al.: Identifying disease associated genes by pubmedcentral.nih.gov/articlerender.fcgi?artid=3168369&tool= network propagation. BMC Systems Biology 2014 8(Suppl 1):S6. pmcentrez&rendertype=abstract]. 13. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. plos computational biology 2010, 6:e1000641 [http://www.pubmedcentral.nih. gov/articlerender.fcgi?artid=2797085&tool=pmcentrez&rendertype=abstract]. 14. Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM: Prioritizing candidate disease genes by networkbased boosting of genome-wide association data. Genome research 2011, 21(7):1109-21 [http://www.pubmedcentral.nih. gov/articlerender.fcgi?artid=3129253&tool=pmcentrez&rendertype=abstract]. 15. WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447(7145):661-78 [http://www.ncbi.nlm.nih.gov/pubmed/17554300]. 16. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic acids research 2011, 39(Database):D561-8 [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013807&tool= pmcentrez&rendertype=abstract]. 17. Chapman J, Whittaker J: Analysis of multiple SNPs in a candidate gene or region. Genetic Epidemiology 2009, 32(6):560-566. 18. Davis Na, Crowe JE, Pajewski NM, McKinney Ba: Surfing a genetic association interaction network to identify modulators of antibody response to smallpox vaccine. Genes and immunity 2010, 11(8):630-636 [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3001955&tool= pmcentrez&rendertype=abstract]. 19. McKinney Ba, Pajewski NM: Six Degrees of Epistasis: Statistical Network Models for GWAS. Frontiers in genetics 2011, 2(January):109 [http://www. pubmedcentral.nih.gov/articlerender.fcgi?artid=3261632&tool= pmcentrez&rendertype=abstract]. Submit your next manuscript to BioMed Central 20. Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear and take full advantage of: Models via Coordinate Descent. Journal of Statistical Software 2010, 33. 21. Graham NE: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical signi cance and • Convenient online submission interpretation. Quarterly Journal of the Royal Meteorological Society 2002, • Thorough peer review 2145-2166. • No space constraints or color figure charges 22. Hindorff La, Sethupathy P, Junkins Ha, Ramos EM, Mehta JP, Collins FS, Manolio Ta: Potential etiologic and functional implications of genome- • Immediate publication on acceptance wide association loci for human diseases and traits. Proceedings of the • Inclusion in PubMed, CAS, Scopus and Google Scholar National Academy of Sciences of the United States of America 2009, • Research which is freely available for redistribution 106(23):9362-7 [http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2687147&tool=pmcentrez&rendertype=abstract]. Submit your manuscript at www.biomedcentral.com/submit name arank N0 brank random cdegree dvalidate TP53 0 0.0 615 N CDC42 1 3.4 258 N FAM48A 2 1.1 521 N INS-IGF2 3 3.0 461 N AKT1 4 4.9 497 N PLEK 5 4.2 140 N RAC1 6 6.3 290 N CALM1 7 6.9 265 N EGFR 8 9.4 344 N ESR1 9 10.9 287 N EGF 10 10.4 374 N DNAH8 11 14.9 148 N CTNNB1 12 7.4 369 N IL6 13 16.2 397 N NFKB1 14 15.0 400 N STAT3 15 46.5 282 Y JUN 16 18.8 394 N SP1 17 12.4 352 N DLG4 18 17.7 70 N IL2 19 24.3 309 N FOS 20 31.5 342 N MYC 21 17.7 370 N FYN 22 19.5 233 N EP300 23 27.0 260 N GRB2 24 41.8 242 N APP 25 32.8 334 N PRKAR2A 26 23.2 199 N CD4 27 47.9 253 N NGF 28 39.5 201 N CRCT1 29 683.5 9 N IL1B 30 36.7 269 N IFNG 31 41.3 297 N CDH1 32 43.6 223 N VEGFA 33 44.5 287 N NOD2 34 1706.6 47 Y BCL2 35 38.7 306 N PIK3R1 36 55.5 263 N IL23R 37 1185.3 61 Y

1 FN1 38 24.3 243 N PPARG 39 29.1 253 N RAC2 40 34.0 144 N ATG16L1 41 2690.1 20 Y PTEN 42 55.5 198 N MAPK1 43 35.2 246 N CDKN2A 44 37.2 176 N NKX2-3 45 4774.5 12 Y ST6GALNAC1 46 48.4 24 N ABCC11 47 135.3 51 N FURIN 48 24.0 75 N TBP 49 44.2 195 N DAG1 50 111.4 64 N GNAL 51 44.0 298 N GRIP1 52 92.7 46 N GCNT1 53 41.5 25 N JAK2 54 104.1 231 Y PARK2 55 50.7 131 N CCND1 56 46.7 315 N IGF1 57 56.0 250 N GSK3B 58 47.6 152 N B3GNT6 59 51.2 20 N GNAT3 60 124.6 37 N C1GALT1 61 46.5 21 N ACTN2 62 66.6 138 N VIM 63 91.9 190 N MTF1 64 731.9 19 N BMP4 65 54.7 159 N RELA 66 87.4 244 N MAPK8 67 55.4 269 N PTS 68 45.6 46 N NOTCH1 69 52.6 183 N RB1 70 83.8 170 N RHOQ 71 58.9 100 N RNF31 72 127.6 142 N PIK3CA 73 84.6 241 N STAT1 74 100.6 218 N PRKACG 75 68.4 96 N CREBBP 76 153.1 172 N SNAP25 77 68.3 64 N

2 IL8 78 78.5 267 N ACTN4 79 137.7 97 N IL4 80 94.3 218 N TNFSF15 81 4351.7 16 Y CBL 82 124.6 180 N HNF4A 83 85.9 178 N PCNA 84 70.2 224 N PPP2R1A 85 123.6 124 N MGAT1 86 134.8 14 N CD44 87 74.9 217 N SLC7A13 88 3495.6 2 N PIK3R2 89 146.7 175 N ACTG1 90 200.1 92 N PRKCA 91 70.8 202 N SCAPER 92 5318.3 2 N MAPK14 93 92.5 193 N POMC 94 135.7 256 N ARF6 95 101.0 95 N STX1A 96 88.3 63 N PPP1CA 97 115.5 67 N ARF1 98 105.0 58 N ABL1 99 93.3 151 N

Table S1, arank N0 is gene rank by posterior scores in networks overlaid by GWAS data (T=1 for propagaion). brank random is the average gene rank by posterior scores in networks overlaid by random prior information. cdegree is the degree of gene in the network, dvalidate denotes whether the gene is reported (Y-Yes, N-No) as association in other Crohn’s Disease studies according to GWAS catalog, www.pnas.org/cgi/doi/10.1073/pnas.0903103106

3 name arank P bpval crank N0 drank Ni evalidate f PSP NOD2 0 1.76e-15 34 1432.8 Y 0.0000 ATG16L1 1 1.08e-13 41 2311.6 Y 0.0000 IL23R 2 2.91e-13 37 1258.1 Y 0.0000 C1orf141 3 1.15e-09 1853 5869.8 Y 0.0017 CYLD 4 4.49e-09 620 4763.9 Y 0.0000 NKX2-3 5 1.61e-08 45 4156.3 Y 0.0000 PTPN2 6 4.01e-07 234 3668.3 Y 0.0000 SLC22A5 7 4.74e-07 224 3987.2 Y 0.0000 BSN 8 6.44e-07 351 1259.6 Y 0.0031 APEH 9 6.44e-07 2816 6895.4 N 0.0010 CDKAL1 10 4.86e-06 768 2806.9 Y 0.0000 STAT3 12 1.12e-05 15 44.3 Y 0.0000 DAG1 13 1.27e-05 50 128.8 N 0.0041 AMT 15 1.32e-05 481 1419.5 N 0.0060 GCKR 19 1.69e-05 1023 3427.9 Y 0.0005 IL12RB2 20 1.89e-05 2483 5702.7 Y 0.0006 USP4 26 2.89e-05 2094 6179.7 N 0.0002 CREM 30 3.88e-05 422 2606.5 Y 0.0000 BSG 31 4.11e-05 1580 4417.7 N 0.0008 DGKD 33 4.21e-05 1098 3136.3 Y 0.0005 ERC2 39 6.30e-05 827 3233.7 N 0.0085 SLC22A4 40 6.77e-05 268 3734.8 Y 0.0000 ACSL6 42 7.24e-05 861 2682.4 Y 0.0098 IP6K1 46 7.52e-05 819 5268.6 N 0.0005 TNFSF15 48 7.85e-05 81 4020.5 Y 0.0000 TNFRSF6B 51 8.17e-05 883 5865.4 N 0.0014 SP110 54 8.63e-05 2386 5565.9 N 0.0091 POU2F1 56 1.10e-04 2432 4856.6 N 0.0056 MAPT 61 1.55e-04 126 464.5 N 0.0007 CRHR1 67 1.89e-04 1174 6646.3 N 0.0014 WNT3 69 1.99e-04 1917 5421.9 N 0.0015 VWF 70 2.00e-04 658 1626.3 N 0.0058 N4BP1 83 2.39e-04 766 3780.5 N 0.0037 MOGS 87 2.67e-04 1882 6474.9 N 0.0014 GRB2 89 2.71e-04 24 36.9 N 0.0076 MYH3 90 2.71e-04 2635 5539.0 N 0.0033 DLG5 91 2.80e-04 114 5814.4 N 0.0000 USP1 92 2.84e-04 2462 5958.0 N 0.0005

4 P4HA2 99 3.26e-04 1028 6499.5 Y 0.0001 ENSP00000332488 101 3.49e-04 2526 7770.8 N 0.0026 CD48 119 4.92e-04 1978 4466.6 N 0.0028 PCDH15 128 5.97e-04 270 727.4 N 0.0067 IL18RAP 144 7.42e-04 382 3619.9 Y 0.0000 PTGFR 163 8.46e-04 1593 2982.6 N 0.0094 NEDD4L 173 9.08e-04 433 1213.3 N 0.0015 GRAP2 175 9.24e-04 2559 4973.3 N 0.0042 STX2 176 9.30e-04 548 1428.8 N 0.0059 ADCY3 182 9.62e-04 403 895.3 N 0.0056 IRF1 188 1.00e-03 781 2086.7 Y 0.0031 CAMK2G 199 1.10e-03 763 2074.0 N 0.0037 IL12B 214 1.28e-03 334 1887.7 Y 0.0000 CREBBP 227 1.39e-03 76 152.6 N 0.0016 STX16 236 1.44e-03 1214 3354.3 N 0.0033 UBE2Q1 256 1.57e-03 1515 7958.1 N 0.0015 UTS2 257 1.57e-03 1818 4044.9 N 0.0022 HERC2 259 1.59e-03 144 839.9 N 0.0004 INPP5E 261 1.64e-03 1297 3595.8 N 0.0058 PTGER4 263 1.64e-03 749 2115.6 Y 0.0021 CAPN10 288 1.91e-03 859 5083.2 N 0.0036 DEFA5 293 1.94e-03 1312 6235.9 N 0.0001 CDC37 298 1.97e-03 2517 5162.5 N 0.0070 FZD6 303 2.01e-03 423 1067.2 N 0.0084 CCNY 305 2.01e-03 239 3945.7 N 0.0000 DCTN1 330 2.19e-03 119 320.2 N 0.0033 SLC8A1 331 2.23e-03 806 2927.4 N 0.0053 C1orf43 337 2.25e-03 840 7057.2 N 0.0005 POMT2 371 2.45e-03 614 2444.7 N 0.0075 KAT2A 373 2.46e-03 733 1800.2 N 0.0044 RRAS 408 2.78e-03 1181 2943.6 N 0.0060 IL18R1 417 2.81e-03 1109 3342.2 Y 0.0028 PIM1 468 3.34e-03 1123 3365.2 N 0.0011 AHCYL1 475 3.42e-03 1801 4357.3 N 0.0093 FOS 477 3.43e-03 20 31.5 N 0.0036 ASAH1 483 3.47e-03 1151 2913.4 N 0.0103 KIAA0564 495 3.67e-03 805 2226.1 N 0.0098 PRMT5 504 3.82e-03 413 1171.4 N 0.0062 ST6GAL2 527 4.05e-03 295 4923.5 N 0.0006 CHDH 605 5.01e-03 2631 5462.3 N 0.0039

5 IL12RB1 627 5.23e-03 1432 3865.7 N 0.0002 MAP3K7 648 5.42e-03 242 685.7 N 0.0015 ZFAND6 673 5.75e-03 2113 6879.9 N 0.0011 CD3G 744 6.65e-03 2197 4901.5 N 0.0080 CD3D 745 6.65e-03 1045 3601.1 N 0.0006 LAMA5 780 7.13e-03 1412 3080.9 N 0.0042 SEC13 817 7.56e-03 399 921.8 N 0.0028 RORC 885 8.41e-03 740 1789.7 N 0.0082 TRAF1 975 9.45e-03 380 1357.1 N 0.0001 S100B 988 9.59e-03 323 1093.3 N 0.0042 ELAVL4 1054 1.09e-02 191 804.9 N 0.0091 TWSG1 1057 1.09e-02 757 2290.9 N 0.0036 IL15 1184 1.28e-02 804 1863.1 N 0.0048 ABCC11 1232 1.35e-02 47 178.0 N 0.0002 WBP4 1339 1.52e-02 215 856.7 N 0.0069 CLCN1 1356 5.59e-03 975 3401.0 N 0.0054 JAK1 1505 1.79e-02 145 332.3 N 0.0020 PIP5K1A 1589 1.91e-02 2701 5339.7 N 0.0051 CTDSP2 1605 1.94e-02 1064 3294.8 N 0.0023 SEC23B 1608 1.95e-02 2353 5739.1 N 0.0028 PHYH 1650 2.01e-02 349 1540.8 N 0.0058 PPIP5K1 1658 2.02e-02 472 3820.2 N 0.0000 CD4 1731 2.15e-02 27 52.6 N 0.0022 TNFRSF8 1766 2.21e-02 1190 2540.9 N 0.0078 OSM 1862 2.38e-02 1487 3085.3 N 0.0090 CX3CR1 1919 2.48e-02 1459 3318.6 N 0.0068 IL1R1 1951 2.55e-02 458 1599.0 N 0.0001 SCAPER 2034 2.70e-02 92 4137.7 N 0.0006 C20orf54 2043 2.72e-02 1781 6844.2 N 0.0076 UBFD1 2116 2.91e-02 330 1958.7 N 0.0073 GPT 2163 3.02e-02 2251 4594.1 N 0.0066 IPPK 2165 3.03e-02 905 4736.9 N 0.0006 LIMCH1 2242 3.17e-02 170 3303.0 N 0.0101 SOCS4 2330 3.36e-02 1021 2636.7 N 0.0035 NCKAP1 2361 3.43e-02 865 3057.7 N 0.0069 CD3E 2621 4.04e-02 1654 3943.6 N 0.0051 FKBP4 2704 4.33e-02 2629 6167.0 N 0.0020 TMC8 2869 4.74e-02 1812 7839.2 N 0.0043 TMC6 2870 4.74e-02 1811 7700.6 N 0.0045 FAM92B 2880 4.75e-02 102 5654.4 N 0.0001

6 SSU72 2885 4.78e-02 416 3835.6 N 0.0000 SEC24D 2933 4.88e-02 2360 5425.9 N 0.0085 IP6K3 3052 5.21e-02 1042 6119.4 N 0.0001 JAK2 3063 5.23e-02 54 100.2 Y 0.0025 IL5 3137 5.43e-02 440 1009.2 N 0.0101 UBAC1 3232 5.64e-02 1081 5968.7 N 0.0027 SEC23A 3285 5.80e-02 1500 4870.5 N 0.0031 RHOC 3352 5.97e-02 1463 4109.1 N 0.0089 PTPN22 3390 6.09e-02 651 2025.7 Y 0.0015 ESYT3 3425 6.20e-02 2424 6008.4 N 0.0091 GABARAPL2 3580 6.62e-02 546 2401.3 N 0.0064 HIST1H4A 3616 6.71e-02 1860 6167.5 N 0.0008 TMEM39A 3650 6.81e-02 530 6071.9 N 0.0016 IL13 3734 7.07e-02 932 2662.1 N 0.0022 PPAP2A 3741 7.09e-02 193 543.5 N 0.0044 C10orf107 3760 7.17e-02 634 2249.5 N 0.0066 FECH 3825 7.37e-02 1933 4514.8 N 0.0071 ACTG1 4576 1.01e-01 90 200.7 N 0.0084 IL23A 4657 1.04e-01 489 1467.1 N 0.0012 CXCR4 4821 1.11e-01 252 598.3 N 0.0088 SLC7A13 5476 1.41e-01 88 3944.4 N 0.0005 RIMS2 5769 1.56e-01 345 1720.0 N 0.0007 KPNA7 5815 7.68e-02 2007 7231.5 N 0.0048 CRCT1 6411 1.94e-01 29 1282.8 N 0.0003 PMM2 6910 2.27e-01 496 1918.6 N 0.0060 PPIP5K2 8248 3.29e-01 1611 5902.5 N 0.0005 C1orf64 8316 3.35e-01 637 7746.2 N 0.0033 FASLG 8483 3.54e-01 227 711.3 Y 0.0007 TRAPPC1 9276 4.48e-01 1972 6522.3 N 0.0060 GKN2 10028 5.55e-01 589 4442.7 N 0.0033 MTERF 10454 1.61e-01 2818 8682.0 N 0.0075 IGHJ6 10958 7.78e-01 324 6174.5 N 0.0060

7 Table S2, arank P is rank of genes based on gene level p values from GWAS, bpval is p values from GWAS, crank N0 is rank of posterior scores in network overlaid by GWAS prior, drank Ni is average gene rank by posterior scores in net- works overlaid by random prior information. evalidate de- notes whether the gene is reported (Y-Yes, N-No) as associa- tion in other Crohn’s Disease studies according to GWAS catalog, www.pnas.org/cgi/doi/10.1073/pnas.0903103106. f is the p value of posterior score, based on 10000 permutations, calculated as the proportion of occurrence of larger posterior scores in the permuta- tion.

8 IL12 pathway rank GWAS arank N0 brank random JUN 2916 16 18.2 MAP2K6 2025 970 718.2 IL12A 9016 3527 4461.9 IL12B 214 334 1887.7 CD3E 2621 1654 3943.6 CD3G 744 2197 4901.5 MAPK8 9032 67 56.8 ETV5 6467 1994 1884.1 IFNG 9837 31 44.9 STAT4 5617 1007 1323.9 IL12RB1 627 1432 3865.7 MAPK14 3010 93 93.0 CD247 2277 1016 1756.3 IL12RB2 20 2483 5702.7 IL18 5164 293 612.7 TYK2 9622 499 781.9 JAK2 3063 54 100.2 IL18R1 417 1109 3342.2 CD3D 745 1045 3601.1 Table S3 Ranks of IL12 pathway genes in GWAS and net- work propagation (T = 1), GWAS rank of gene is based on the most significant SNPs. arank N0 is gene rank by poste- rior scores in network of GWAS prior. brank random is the average gene rank by posterior scores in networks of random prior. 15 out of 19 genes have a rank increase in the network of GWAS prior.

9

Appendix C

Efficient clustering of identity-by-descent between multiple individuals

C.1 Summary

This paper was produced when I visited Department of Biostatistics at Uni- versity of Washington, Seattle. In collaboration with Brian Browning and Sharon Browning, I wrote this paper with me as main author. Identity-by-descent (IBD) denotes the relation of haplotypes in a finite population, i.e., two haplotypes are IBD at a locus if they are identical and inherited from a common ancestor. In this study, we consider the IBD rela- tion among multiple individuals, instead of pairwise IBD relations in common practice. Being able to detect multiple IBD relations in a fast and accurate manner, we further explore its application in GWASs. In the end, we show that encoding the multiple IBD relation as random effects in linear mixed- effects models has potential power boost in GWAS applications.

97 Vol. 30 no. 7 2014, pages 915–922 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt734

Genome analysis Advance Access publication December 19, 2013 Efficient clustering of identity-by-descent between multiple individuals Yu Qian1,*, Brian L. Browning2,3 and Sharon R. Browning2

1Bioinformatics Research Center, Aarhus Universitet, 8000C Aarhus, Denmark, 2Department of Biostatistics and Downloaded from 3Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA Associate Editor: Alfonso Valencia

ABSTRACT IBD mapping in association studies (Browning and Browning, http://bioinformatics.oxfordjournals.org/ Motivation: Most existing identity-by-descent (IBD) detection meth- 2012). ods only consider haplotype pairs; less attention has been paid to The current resolution of IBD detection in single-nucleotide considering multiple haplotypes simultaneously, even though IBD is polymorphism (SNP) array data is 1–2 cM, corresponding to a an equivalence relation on haplotypes that partitions a set of haplo- common ancestor within the past 25–50 generations; therefore, types into IBD clusters. Multiple-haplotype IBD clusters may have ad- many short segments will be missed in pairwise IBD detection vantages over pairwise IBD in some applications, such as IBD (Browning and Browning, 2012). The missing information can be mapping. Existing methods for detecting multiple-haplotype IBD clus- partly retrieved by multiple-haplotype analysis that identifies ters are often computationally expensive and unable to handle large clusters of haplotypes that are all identical by descent, which samples with thousands of haplotypes. we refer to as multiple-IBD clusters later in the text. Because Results: We present a clustering method, efficient multiple-IBD, which IBD is an equivalence relation on haplotypes, if haplotypes uses pairwise IBD segments to infer multiple-haplotype IBD clusters. It A and B are IBD and haplotypes A and C are IBD, then haplo- types B and C are also IBD by definition, even though their expands clusters from seed haplotypes by adding qualified neighbors at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 and extends clusters across sliding windows in the genome. Our shared segment may be too short to be detected in pairwise method is an order of magnitude faster than existing methods and IBD detection. Within a certain region, if we only observe pair- has comparable performance with respect to the quality of clusters wise IBD segments of (A, B) and (A, C), but not (B, C), we say it uncovers. We further investigate the potential application of multiple- that there is inconsistency in the pairwise IBD segments. haplotype IBD clusters in association studies by testing for association Such inconsistency can be resolved by grouping A, B and C between multiple-haplotype IBD clusters and low-density lipoprotein into a multiple-IBD cluster, where each member in the cluster cholesterol in the Northern Finland Birth Cohort. Using our multiple- is IBD to all the other members. haplotype IBD cluster approach, we found an association with a gen- There have been several previous attempts to detect multiple- omic interval covering the PCSK9 gene in these data that is missed by IBD. MCMC IBD finder (Moltke et al., 2011) is a Markov standard single-marker association tests. Previously published studies Chain Monte Carlo approach that considers multiple individuals confirm association of PCSK9 with low-density lipoprotein. simultaneously; however, it is not computationally tractable for Availability and implementation: Source code is available under the genome-wide analysis with large sample sizes. The DASH GNU Public License http://cs.au.dk/~qianyuxx/EMI/. method (Gusev et al., 2011) builds on pairwise IBD segments Contact: [email protected] and applies an iterative minimum cut algorithm to identify den- Supplementary information: Supplementary data are available at sely connected haplotypes as IBD clusters. The method scans the Bioinformatics online. genome through sliding windows and output subgraphs of desired density. The performance of the DASH method in Received on August 24, 2013; revised on November 2, 2013; terms of speed and accuracy has not been investigated previ- accepted on December 13, 2013 ously. IBD-Groupon (He, 2013) is a recently developed method that also detects group-wise IBD tracts based on pair- wise IBD segments. Unlike DASH, which takes the length of 1 INTRODUCTION windows and density threshold of subgraphs as input param- In a finite population, haplotypes are identity-by-descent (IBD) eters, IBD-Groupon is almost parameter-free and it uses a if they are identical and inherited from a common ancestor. hidden Markov model (HMM) to determine the cliques and Tracts of IBD are broken up by recombination during meiosis. the length of group-wise IBD tracts automatically. The perform- To be detectable, a pairwise IBD segment must be sufficiently ance of IBD-Groupon was evaluated on simulation data of long and contain a sufficient number of genotyped markers chromosome 22 with 6159 SNPs, based on a real pedigree in (Browning and Browning, 2013a; Gusev et al., 2009). Detected HapMap, and it shows high power in detecting short IBD IBD segments have several important applications, for example, tracts (He, 2013). In comparison with DASH, IBD-Groupon detecting signals of natural selection (Albrechtsen et al., 2010), has a similar execution time with higher accuracy for a sample inference of population structure (Ralph and Coop, 2013) and size of 90 related individuals; however, the relative performance of DASH and IBD-Groupon was not evaluated using popula- *To whom correspondence should be addressed. tion data, larger samples sizes or multiple parameter settings.

ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 915 Y.Qian et al.

In this study, we evaluate DASH using population samples with outbred populations. Standard methods used in GWAS evaluate thousands of individuals in a range of different parameters set- each variant individually with univariate statistics, such as the tings. We do not evaluate IBD-Groupon v1.1 because it required Cochran–Armitage test for trend, and are underpowered for rare too much memory (412 Gb) when analyzing the simulated variants unless sample sizes or effect sizes are large (Li and Leal, datasets in this study. 2009). Over the past few years, many group-wise association tests In this work, we present the efficient multiple-IBD (EMI) al- have emerged as to overcome this limitation, including burden gorithm to search for multiple-IBD clusters along sliding win- tests (Madsen and Browning, 2009) and variance-component dows in the genome. In contrast to DASH, which builds clusters tests (Wu et al., 2011). Burden tests are typically based on col- from a largest connected component and iteratively divides it lapsing or summarizing the rare variants within a region by a Downloaded from into smaller clusters by a minimum cut algorithm, EMI takes single value, which is then tested for association with the trait. an agglomerative approach. It builds each cluster from initial However, a limitation for such methods is that they implicitly seed haplotypes and recursively adds qualifying haplotypes to assume that all rare variants influence the phenotype in the same expand the cluster. High computational efficiency is ensured by direction, which might not be true in real applications. the use of priority queues. We evaluate our results based on Alternatively, we can cast the problem in the framework of http://bioinformatics.oxfordjournals.org/ coalescence simulations. Coalescence simulations allow the inves- linear mixed models with random effects and apply a global tigation of realistic scenarios while enabling determination of the score test for the null hypothesis that all the variance components true multiple-IBD status. We compare the performance of EMI are 0. This can be conveniently tested with a variance-component and DASH with thousands of samples. score test in the corresponding mixed model, which is known to Although DASH, IBD-Groupon and EMI all use graph-based be a locally powerful test (Lin, 1997). Because most of our IBD clustering methods, there is an important difference in how clus- clusters to be tested are rare, we chose to use mixed models and ters are identified. IBD-Groupon uses an HMM to find the most test for random effects of all the variants in a region. We chose to likely maximal cliques within each chunk of genome so as to use the sequence kernel association test (SKAT) (Wu et al.,2011) resolve inconsistencies. Such a procedure only involves pruning for our analysis, as it provides a supervised flexible regression edges (an edge is a pairwise IBD segment), even though some- model to test for association between genetic variants (common times adding a few edges can not only resolve inconsistencies but and rare) in a region. SKAT allows for adjustment for covariates

also recover missing pairwise IBD segments in the input. Both and has been shown to be powerful for most underlying hypoth- at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 EMI and DASH build clusters that are highly connected and eses concerning the relationship of variants and complex traits therefore can prune incorrect edges as well as add missing (Ladouceur et al., 2012). edges. Missing edges should not be ignored, especially for short segments, as the state-of-the-art pairwise IBD detection methods only achieve high power (e.g. 0.8) for IBD tracts longer than 2 MATERIALS AND METHODS 2 cM in SNP data (Browning and Browning, 2013a). In this section, we first describe the algorithm framework of EMI, and Detected IBD segments have many applications, one of which then we describe the simulation framework for comparing and evaluating is IBD mapping in association studies (Browning and the multiple-IBD clusters found by EMI and DASH. Finally, we describe Thompson, 2012; Francks et al., 2010; Gusev et al., 2011; Lin the real data to which we will apply IBD mapping. et al., 2013; Purcell et al., 2007). One can code the multiple-IBD clusters as genetic markers and use them for association testing in 2.1 Algorithm outline genome-wide association studies (GWAS). EMI clusters pairwise IBD segments into multiple-IBD clusters with an GWAS have identified many common variants associated with algorithm that is similar to one used in the systems biology domain for diseases, yet they have explained relatively little of the heritability fast clustering of biological networks such as protein–protein interaction of complex diseases. Rare genetic variants, often defined as vari- networks (Jiang and Singh, 2010). The predicted clusters can be used to ants with minor allele frequency (MAF) 51% 5%, can play predict missing links in the network and to search for protein complexes important roles in complex diseases and traits (Schork et al., and functional modules. Extending the algorithm to a genome-wide scale, EMI builds a graph 2009). It has been suggested that rare variants may contribute with nodes representing individual haplotypes and edges representing to disease and thus partly explain the missing heritability (Eichler IBD in the local region. The algorithm moves to the next region through et al., 2010). IBD mapping falls into the category of focusing sliding windows. The term multiple-IBD cluster means a highly connected mainly on signals from rare variants because IBD detection subgraph where each haplotype in the graph is estimated to be IBD to all methods detect long shared segments that correspond to a rela- the other haplotypes in the same subgraph. tively short time to the common ancestor. Variants arising Consider a genomic region that is divided into K consecutive windows shortly before the most recent common ancestor of a group of of length H. For each window wink, its left and right window boundary is IBD haplotypes will be correlated with the IBD group member- denoted as wink(L) and wink(R), and we have wink(R) ¼ wink(L) þ H. Given ship; these variants, being recent, are rare. Studies have shown N haploid copies of genome, the construction of multiple-IBD clusters is that IBD mapping may have higher power than association ana- fundamentally dependent on the presence of pairwise IBD segments, which can be efficiently detected by existing tools such as Beagle lysis of SNP data when multiple rare causal variants are clustered Refined IBD (Browning and Browning, 2013a) and GERMLINE within a gene (Browning and Thompson, 2012). (Gusev et al., 2009). A multiple-IBD cluster-based association test is essentially a Assume a set S of pairwise IBD segments in this dataset, where each rare-variant association test. The frequency of each cluster is element s 2 S has the form s ¼ (i,j,l,r),i,j2 1, ...,Nandrepresentsa determined by the number of haplotypes in that cluster, which shared IBD segment between haplotypes i and j in the genomic interval is often small because large IBD groups are rare, especially for [l,r]. Note that the order of i and j does not matter, and s can also be

916 Efficient clustering of IBD Downloaded from http://bioinformatics.oxfordjournals.org/

Fig. 1. A running example of EMI in two adjacent sliding windows, with density cutoff denmin ¼ 0.8. (A) An edge denotes a pairwise IBD segment that spans the entire window. Numbers denote the weights of the edges, which are determined from the total length of IBD sharing (as described in Section 2). H3 is selected as the first node because it has the highest weighted degree of 3.79. H1 has the largest edge weight among all the outgoing edges of H3 and is selected as the second seed node. The cluster expansion starts with {H3, H1}, then H2 is added, as it has the largest support for the existing cluster, which is 1.85 (0.9 þ 0.95). The cluster continues to expand and we get {H3, H1, H2, H5, H4}. However, density is below the cutoff after adding H4, so we remove H4 and output the clusters (of minimum size 3) to (A’). (B) When we move the next window, an edge is removed from the current cluster {H3, H1, H2, H5}, The new density is below the cutoff, so we remove the edge and check whether H5 should be removed. After removing H5, we start with the remaining nodes. {H3, H1, H2} stays and is considered as a single node. We then start the expansion again. We end up with two clusters {H3, H1, H2, H4} and {H5, H6, H7}, shown in (B’) at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014

written as s ¼ (j, i, l, r). Within window wink, the corresponding pairwise We define the following terms that are specific to our implementation: IBD collection Sk is the subset of S, containing member s of S that cover P dw(a) ¼ wa, b, the weighted degree for node a. the entire window, that is satisfying l wink(L) and r wink(R). ea, b2E A running example is shown in Figure 1, and the details of the algo- C, a cluster of nodes (haplotypes). Each cluster of nodes determines a rithm are described below. User-defined parameters include the minimum subgraph of Gl P density of a cluster denmin and the sliding window size H. density (C) ¼ Iðea, b 2 CÞ=ðjjC ðjjC 1Þ=2Þ,thedensityofcluster C. I is an indicator function and takes the value of 1 if edge e exists 2.1.1 Single-locus clustering Readers are encouraged to read the a,b in cluster C, and 0 otherwise. paper by Jiang and Singh (2010) wherein they explain the similar tech- nical details. Here we only briefly outline the algorithm framework and its C [a, a 2 V, a temporary new cluster constructed by adding node a implementation. to cluster C. P Definitions: Within window wink along the genome, we define a weighted support (a, C) ¼ b2C, e 2E wa, b, the support of node a to cluster C. undirected graph Gk ¼ (V,Ek,Wk), V ¼ 1 ...N, with vertices V a, b representing N haploid individuals, and edges Ek corresponding to IBD Expanding clusters: Given a weighted network, the algorithm outputs k segments in Sk. An edge ea,b 2 E ,a,b2 V exists if there is an IBD a set of disjoint dense subgraphs, which are the multiple-IBD clusters segment s ¼ (a,b,l,r) in Sk; otherwise, there is no edge between node a in our application. and b. An edge ea,b has a weight wa,b 2 [Wmin,1]withWmin40, which represents a confidence score for IBD segment sharing between a and b. (i) Seed selection The weight can be determined by various methods such as the length of The vertex a with highest weighted degree in the current network IBD segment or a likelihood ratio for an IBD versus a non-IBD model in is selected as the first seed node. Then we divide the neighbor- the output of Beagle Refined IBD (Browning and Browning, 2013a). ing nodes of a into five bins based on the edge weight, for

In our analyses, we first applied 90% winsorization on all the IBD example, if Wmin ¼ 0.8, then the corresponding five bins are [0.8, segment length (i.e. segment lengths below the 5th percentile were set 0.84], [0.84, 0.88], [0.88, 0.92], [0.92, 0.96], and [0.96, 1]. We search to the 5th percentile, and segment lengths above the 95th percentile from highest weight bin [0.96, 1] to lowest. If the current bin is not were set to the 95th percentile), and we then linearly transformed the empty, node argmaxb dw(b) in this bin is chosen as the second seed resulting length into a weight between [Wmin,1],whereWmin ¼ 0.8. node. However, the winsorization might not be necessary, as results without (ii) Cluster expansion winsorization are similar. If previously detected pairwise IBD collections S are error-free, a fully The current cluster C starts with the two seed nodes and the edge connected subgraph of Gk would indicate a multiple-IBD cluster. In the between them. Then we search for a node b with a maximum value presence of error in pairwise IBD segments, we would see false or missing of support (b, C) in the remaining unclustered nodes. If density edges in the graph; thus, multiple-IBD clusters can be identified by dense (C[ b) is above a threshold denmin,nodeb is added to cluster C; subgraphs that are nearly complete. otherwise, output cluster C.

917 Y.Qian et al.

(iii) Repeating uniformly distributed between 2 and 50%. The number of variants The above procedure of seed selection and cluster expanding is corresponds to a SNP density of 1 million SNPs genome wide. repeated for the remaining unclustered nodes until all nodes are The multiple-IBD clustering requires pairwise IBD as input, which we clustered or there is no edge left among the un-clustered nodes. generated using two different approaches. The first one uses the thinned SNP array data and applies to the standard GWAS settings, where SNP (iv) Implementation data are often available. We use Beagle version 4 Refined IBD (r1058) for The implementation uses priority queues. The first priority queue is haplotype phasing and pairwise IBD detection. All the parameters were used to pick the seed haplotype with the highest weighted degree. left at their default values, i.e. ibdcm ¼ 1.0 (the minimum length in Once a haplotype has been used in a cluster, it is removed from the centiMorgans of reported IBD) and ibdlod ¼ 3 (the minimum LOD queue and the weighted degrees of all its neighbors are decreased score for reported IBD). Downloaded from accordingly. The second priority queue is used for expanding clus- The second approach uses sequencing data and evaluates the optimal ters. Each haplotype b adjacent to one of the haplotypes in cluster C performance of multiple-IBD clustering when we have nearly perfect being built is included in the queue and is prioritized based on sup- power for pairwise IBD detection, even for IBD tracts as short as port (b, C). It is implemented in the Fibonacci heap and supports 0.2 cM. Two haplotypes are identical-by-state (IBS) if they are identical insertion and key decrease, with a theoretical complexity of O (jVj within a region, and we treat all the pairwise IBS segments longer than

log(jVj) þjEj) (Fredman and Tarjan, 1987) 0.2 cM as pairwise IBD segments after removing variants with MAF http://bioinformatics.oxfordjournals.org/ 50.25%. This frequency filtering removes variants that arise from muta- 2.1.2 Multilocus clustering In application to genome-wide data, tion events since the most recent common ancestor. Using IBS to EMI slides the local window along the genome and extends the single- approximate IBD tracts is, however, not applicable in real data due to locus clustering to multilocus clustering. sequencing error and haplotype phasing error. The first window win is analyzed with the single-locus algorithm, 0 2.2.2 True IBD clusters The coalescent tree traces the ancestry of and we get an initial set of multiple-IBD clusters . When we move to 0 sample chromosomes back in time until there is a single common ances- a new window win from the previous window win , we use a modified kþ1 k tor. When recombination is involved, the ancestral relationship among version of single-locus clustering that includes three steps. chromosomes is complicated. At any single position along the genome, (i) Dissolving of old clusters there is still a tree, but the trees at nearby positions may differ. Multiple-IBD essentially means multiple haplotypes share a common For each pairwise IBD segment s ¼ (i, j, l, r) that belongs to IBD ancestor. In our simulated data, we define the true multiple-IBD cluster collection Sk but not Skþ1, if its edge ei, j connects two nodes in a as a cluster of haplotypes that share the same common ancestor across a at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 cluster C, remove this edge and check the density of the updated ¼ cluster C.DissolveclusterC if the density is below a threshold. region of length Htrue.WechoosetouseHtrue 200 kb (approximately equivalent to 0.2 cM) in the evaluation framework, which is also the (ii) Expanding and merging existing clusters. window size we used for IBD mapping. More specifically, within a

For each unclustered node a, add it to an existing cluster C 2 k region T of size Htrue, a multiple-IBD cluster CTi is defined as any sub- if support (a, C) using IBD segments in Skþ1 is above a thresh- tree of the true coalescent trees, which contains all descendants of the root old. If an IBD segment that belongs to Skþ1 but not Sk, connects of the sub-tree and spans this region without topology changes. Shorter two existing clusters Cm and Cn (Cm,Cn 2 k), merge these two Htrue results in decreased performance for EMI and DASH, due to the clusters if the density of combined cluster is above the threshold. limitations of pairwise IBD resolution, while longer Htrue leads to too few (iii) Single-locus clustering true multiple-IBD clusters, as recombination breaks down the IBD signals. For the remaining nodes that are not included in any cluster, create new clusters in window winkþ1 with the single-locus 2.2.3 Performance metrics The power and accuracy of resulting clus- clustering. ters are measured by how well different methods recover the true under-

lying genealogy. The number of true IBD clusters CTi in region T depends 2.1.3 Software implementation The above algorithm is implemented on the local recombination rate as well as on the length of region T. As in Cþþ and is freely available at http://cs.au.dk/*qianyuxx/EMI/ any sub-tree (a true IBD cluster) of the coalescent tree also forms a true IBD cluster, there is a hierarchy structure in the true multiple-IBD clus- 2.2 Evaluation of clusters ters, and therefore evaluating how well the clustering algorithm works is challenging. 2.2.1 Data simulation The coalescent model with recombination de- scribes genealogies of underlying chromosomes from unrelated individ- 2.2.4 Overlap measure Two overlap measurements, the Jaccard uals (McVean and Cardin, 2005). We used the program MaCS (version measure (Jaccard) and the Precision-Recall (PR) measure, are widely 0.4f) (Chen et al., 2009) to simulate sequence data under the coalescent used in systems biology (Kelley and Ideker, 2005; Song and Singh, model with recombination. Ten datasets were simulated; each consists of 2009) when there is a hierarchy structure in functional modules. 4000 haplotypes spanning a 10 Mb region. The simulated recent effective Jaccard: given two sets, the Jaccard similarity coefficient is defined as population size is 3000 individuals, whereas the ancient (before 5000 gen- the size of the intersection over the size of the union. For each output erations ago) population size is 24 000 individuals. The mutation rate is cluster C, its Jaccard value with a true IBD cluster CTi is defined as 8 1.38 10 , and the recombination rates follow the HapMap recombin- jC \ CTi j jC [ C j. The Jaccard measure for cluster C is the maximum Jaccard ation map for chromosome 20 (Frazer et al., 2007). The parameters for Ti T MaCS were ‘4000 10000000 T -t 0.0001656 -r 0.00012 -h 1000 -R value over all true IBD clusters in region . C C chr20map -G 0.0 -eN 0.4167 8.0’. The simulation is designed to mimic PR: Similarly, for each cluster and true IBD cluster Ti,thePR jC \ CT jjC \ CT j an isolated population such as that for the Northern Finland Birth score is defined as i i . The PR measure for cluster C is the jCj jCTi j Cohort (NFBC) data, on which we conducted IBD mapping. maximum of PR scores over all true IBD clusters CTi in region T. We then generated simulated SNP array data by thinning the sequence data. All variants with MAF 52% were removed; 3000 markers were 2.2.5 Coverage measure Recall Rate (RR) is used to quantify how selected among the remaining variants in each 10 Mb region, with MAF many individuals are assigned to a correct cluster, with the following

918 Efficient clustering of IBD

definitions: (i) True Positive (TP) node: a node is called TP node if it the proportion of haplotypes that are assigned to a correct clus- belongs to a cluster C, and there exists a true IBD cluster CTi in region T, ter, and it is similar to the definition of power in a hypothesis jC \ CTi j such that 0:8. (ii) True node: any node in a true IBD cluster CTi test. PR or the Jaccard measure measures the accuracy of clus- jC [ CTi j is called a true node. The RR is defined as the number of TP nodes ters, which is an analog to 1 minus the type 1 error rate. Often a divided by number of true nodes. Note that the number of true nodes higher power is preferred while keeping the type 1 error rate is the union of nodes in all CTi, and this can be less than the number of under control in association tests. Similarly, in our application, haplotypes. we prefer a higher RR while keeping the PR or Jaccard measure above a certain level.

2.3 IBD mapping Downloaded from 3.1.1Gridsearchofparameters DASH has two versions ac- 2.3.1 Data set We ran EMI for multiple-IBD clustering in the NFBC cording to the documentation on the Web site (http://www.cs. 1966 GWAS data, downloaded from dbGaP (accession number phs000276.v1.p1), and subsequently tested for association with low- columbia.edu/*gusev/dash/). One version is DASH_cc, which is density lipoprotein (LDL) cholesterol, which has shown a high level of parameter-free and does not apply dense-subgraph searching. heritability in previous studies (Browning and Browning, 2013b). The DASH_cc iteratively searches for all clusters without enforcing http://bioinformatics.oxfordjournals.org/ 5402 individuals included in NFBC are drawn from genetically isolated aminimumdensity,e.g.denmin ¼ 0, and thus it generates a few Finnish regions and thus have relatively more IBD than other popula- large clusters that do not accurately approximate the true mul- tions. A standard GWAS analysis has been published, and multiple tiple-IBD clusters. The other version is DASH_adv, for which genome-wide significant common variants were found (Sabatti et al., the goal of finding dense subgraphs is similar and comparable 2009). with EMI. In our experiments, DASH_adv also runs faster than Pairwise IBD detection was done as reported previously (Browning DASH_cc, so we used DASH_adv in the analysis. For simplicity, and Browning, 2013b). On chromosome 1, IBD detection took 83 h we drop the subscript adv and refer to it as DASH in the with a minimum IBD length parameter of 1.0 cM. Close relatives and outlying individuals based on the value of the first 20 eigenvectors were following. removed. For LDL, we also excluded individuals who were pregnant, had Both DASH and EMI take two input parameters, denmin diabetes or were not fasting at the time of measurement. After data fil- (minimum density) and H (window length), but no previous tering, we were left with 4406 individuals. studies have shown which parameters should be recommended.

Using simulation data, we performed a grid search in the 2D at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 2.3.2 Linear mixed model We ran EMI in this dataset with the best parameter space with denmin taking values of [0.4, 0.5, 0.6, 0.7, parameters tuned in our simulation, including the cluster density cutoff 0.8] and H taking values of [100, 200 and 400 kb]. We generated denmin and the sliding window size H. Each multiple-IBD cluster is con- sidered as a variant, and the cluster frequency is therefore defined as the 10 datasets as described in Section 2.2.1. From each dataset, 10 cluster size (number of haplotypes in this cluster) divided by the total windows,200 kb in length, were chosen at random. Within each number of haplotypes, which is 8812 for 4406 individuals. window, we evaluated the quality of clusters by comparing them Because most of the multiple-IBD clusters are rare (with fre- with the true multiple clusters. The average results from all quency50.01) and may have small effects on the trait, the standard 100 windows are shown in Table 1. One can see that with dif- single variant test will be underpowered. ferent combinations of denmin and H, both DASH and EMI In contrast to the concept behind the single SNP test, we fit the effects have similar performance. However, a lower density cutoff, of all clusters as a random effect in a linear mixed model. Consider the such as 0.5, gives better results for both EMI and DASH. We linear model decided to use denmin ¼ 0.5 and H ¼ 200 kb for further analysis, 0 0 yi ¼ 0 þ a Xi þ b Gi þ i where EMI reaches the highest RR and a high PR that is not different from the highest average PR. yi denotes the phenotype for the ith subject, a is a vector of fixed effects and b is a vector of random effects with mean 0. Xi ¼ðXi1, Xi2, ..., Xi12Þ denotes the 15 covariates of sex, oral contraceptive use and the first 10 3.1.2 Comparison with DASH Two types of pairwise IBD seg- eigenvectors. Gi ¼ðGi1, Gi2, ..., GipÞ,withGij ¼ 0, 1, 2 represents the ments are generated as input for EMI and DASH, as described number of haplotypes in cluster j for the ith subject for the p clusters in Section 2.2.1, which we refer to as SNP data and IBS data, 2 within the region. i is an error term with mean 0 and variance of . respectively. Testing association of p clusters with the trait corresponds to testing the SNP data: Given thinned SNP data, we used Beagle Refined null hypothesis H0: VarðbÞ¼0, which can be tested with a variance-com- IBD to generate pairwise IBD segments, which are used as input ponent score test. Such a test of non-zero variance in a linear mixed for multiple-IBD clustering afterward. The time and perform- model is implemented using the software SKAT (version 0.82) (Wu ance in 10 simulated datasets are shown in Table 2. Compared et al., 2011) and was used in our analysis. with DASH, EMI has slightly lower accuracy in terms of the Jaccard measure and PR, and slightly higher power in terms of 3 RESULTS how many haplotypes are assigned to the correct cluster. IBS data: It has been shown that most existing pairwise IBD 3.1 Performance of clustering in simulated data detection methods have lower power for short IBD segments In this section, we evaluate the performance of clustering with (e.g. shorter than 2 cM) given SNP data. Therefore, the perform- different parameters, and we compare EMI and DASH, with ance of multiple-IBD clustering might be limited by the input of respect to the running time and accuracy of multiple-IBD clus- pairwise IBD segments. To explore the optimal performance a ters. All the computation times in this study are from runs on a clustering method can achieve, we also generated pairwise IBS single processor of a 2.4 GHz computer. The clusters are evalu- segments from the sequencing data and used them as input for ated in terms of RR and PR (or Jaccard measure). RR measures the clustering. The results are shown in Table 2. As expected, the

919 Y.Qian et al.

Table 1. The average performance for grid search in the parameter space

Method denmin 0.50.60.70.8

H (kb) 100 200 400 100 200 400 100 200 400 100 200 400

EMI RR 0.830 0.834 0.832 0.823 0.826 0.825 0.809 0.811 0.811 0.797 0.801 0.799 J 0.744 0.753 0.767 0.731 0.742 0.756 0.735 0.744 0.760 0.726 0.734 0.750

DASH RR 0.729 0.736 0.743 0.734 0.742 0.748 0.730 0.736 0.740 0.738 0.746 0.745 Downloaded from J 0.776 0.776 0.779 0.752 0.759 0.766 0.742 0.747 0.752 0.710 0.717 0.727

Note: Each cell shows the mean value of the performance metrics in 100 random windows, with Htrue ¼ 200 kb as the benchmark for calculation. J denotes Jaccard measure and RR denotes Recall Rate. http://bioinformatics.oxfordjournals.org/

Table 2. Average performance of the clustering in simulation

Data Method J PR RR Time (seconds)

SNP DASH 0.775 (0.003) 0.767 (0.003) 0.736 (0.007) 6.016 EMI 0.753 (0.003) 0.744 (0.003) 0.835 (0.008) 0.712 IBS DASH 0.884 (0.001) 0.882 (0.001) 0.910 (0.002) 252.267 EMI 0.864 (0.001) 0.863 (0.001) 0.938 (0.002) 13.542

Note: In each of the 10 datasets, 10 random windows of size Htrue¼ 200 kb are chosen to calculate the performance metrics. The columns of J (Jaccard measure), PR

(PR measure) and RR (Recall Rate) show the mean (standard error) values in overall 100 windows. The Time column shows mean over 10 datasets. Parameters for at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014

DASH and EMI are H ¼ 200 kb and denmin ¼ 0.5. The corresponding short terms are J (Jaccard measure), PR (Precision-Recall measure) and RR (Recall Rate) as defined in the main text.

performance increased significantly compared with the results with thinned SNP data. However, neither EMI nor DASH achieved 100% power or accuracy because both methods use heuristics, and the input IBS segments differ somewhat from true IBD segments. The IBS data have more pairwise IBD segments than the thinned SNP data, and the number of edges (jEj)increases within each window; hence the running time also increases signifi- cantly. Within each window, the average value of jEj is 45 325 for SNP data and 158 951 for IBS data. We can see that the comput- ing time for EMI scales better than DASH as jEj increases.

3.2 IBD mapping in NFBC data Fig. 2. The distribution of sizes of multiple-IBD clusters in NFBC data 3.2.1 Multiple-IBD clustering We use cluster density cutoff denmin ¼ 0.5 and sliding window size H ¼ 0.2 cM as input param- eters for EMI, which is close to the optimal parameters in our simulation. Runs with slightly different parameters do not association test with LDL. We ran SKAT on sliding windows change the results significantly (data not shown). Running on of size 0.2 cM, which is also the window size we used for EMI. a single processor, EMI takes 50 min to analyze the 22 auto- Across the 22 autosomes, there were 22 630 sliding windows somes in the NFBC data with 5402 individuals. EMI outputs wherein multiple-IBD clusters were found. all the clusters with a minimum size 3. The cluster frequency For each window, all the clusters spanning the entire window distribution is shown in Figure 2. As we expected, most of the are considered and contribute to the random effects in the mixed clusters have small sizes. Approximately 98% of them have a model, and a P-value is calculated against the null hypothesis of 0 frequency50.01. variance of random effects. The P-values along the sliding win- dows are shown in Supplementary Figure S1 and the Quantile– 3.2.2 Association mapping with linear mixed models After Quantile (QQ) plot is shown in Supplementary Figure S2. removing close relatives and individuals with missing covariates, To correct for multiple testing, we performed 1000 permuta- we were left with 4406 individuals in the NFBC data for the tions of the trait values, keeping the covariates unchanged, and

920 Efficient clustering of IBD

Table 3. Loci reaching the significance threshold in the analysis of LDL

Chromosome Window start Window end SKAT P Empirical P Gene

1 55230403 55258652 7.44 106 0.013 1 55258652 55293452 9.37 106 0.015 PCSK9 19 50016356 50069307 1.51 106 0.002 19 50069307 50140305 4.52 107 0.001 APO cluster Downloaded from Note: Window start and window end are the boundaries of sliding windows. SKAT P is the P-value reported by SKAT, showing the significance of random effects in the mixed model. Empirical P is obtained based on minimal P values from 1000 permutations, as detailed in the text, to adjust for multiple testing across the autosomes.

ran SKAT for each permutation. We obtained the experiment- mapping and population genetic inference. Most existing meth- http://bioinformatics.oxfordjournals.org/ wide distribution of minimal P-values over all loci from the per- ods for IBD detection only consider pairwise IBD, yet the mutation replicates. The empirical P is defined as the proportion implementation and applications of multiple-IBD have not of minimal P values from permutations that are smaller than the been fully explored. P-value on the original data in this region. The QQ plot shows In this study, we developed EMI to detect IBD segments that that the variance-component score test is anti-conservative. By are shared by multiple individuals. Unlike the existing method using empirical P-values, we not only adjust for multiple testing DASH, which searches for a highly connected graph by dividing but also correct for the anti-conservative nature of the original the big graph iteratively with a minimum cut algorithm, EMI is P-values. implemented in an agglomerative manner and uses a heuristic The empirical experiment-wide P-value 0.05 corresponds to approach to greedily build clusters. Using efficient data struc- aSKATP-value of 1.13 105 in this analysis. As shown in tures, EMI runs faster than DASH in application to genome- Table 3, we found two loci, the APO cluster and the PCSK9 wide data, with comparable performance. The theoretical time gene, that have an empirical P-value 50.05. complexity is hard to obtain because both DASH and EMI use at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 Sabatti et al. (2009) conducted standard GWAS analysis on methods to reduce computational effort when moving across this dataset and reported six loci associated with LDL traits. The sliding windows along the genome. However, our simulations APO cluster (MIM107730) was found with both standard show that the difference in running time between EMI and GWAS analysis (P-value 4.96e-8) and with our multiple-IBD DASH becomes larger in a region with increased pairwise IBD approach (P-value 4.52e-7). The signals around the APO cluster sharing, such as data from isolated populations. span several windows. The accuracy of inferred multiple-IBD clusters has not previ- The second strongest signal found in our study is located near ously been evaluated. We used coalescent simulations to evaluate the PCSK9 (MIM 607786) gene, which is known to be involved the accuracy of resulting multiple-IBD clusters and found the op- in regulation of LDL cholesterol. Previous studies suggested that timal parameters for subsequent analysis. Although multiple-IBD mutations in PCSK9 have a strong effect on LDL cholesterol clusters are supposed to be highly connected, a density cutoff as (Cohen et al., 2006). This association was first reported in a low as 0.5 has a fairly good performance. The best density cutoff GWAS-based analysis for a variant of MAF 1% in a sample may depend on the pairwise IBD detection program, which has to of 8816 individuals (Kathiresan et al., 2008). In the NFBC data- make a trade-off between power and type 1 error. For example, we set, there are no SNPs in strong LD with the reported SNP, and used Beagle Refined IBD for pairwise IBD detection, which has a therefore the standard methods failed to report it. The associ- low type 1 error rate but also low power for short IBD segments. ation with PCSK9 was also reported later by one of the largest Therefore, with Beagle Refined IBD, we use a low-density cutoff lipid meta-analyses to date with a sample size of 4100 000 indi- so that we favor adding missing edges over cutting existing edges viduals of European descent (Teslovich et al., 2010) and a recent from the pairwise IBD input. If we use GERMLINE, which has GWAS of African Americans and Hispanic Americans (Coram weaker control of type 1 error to detect pairwise IBD segments, et al., 2013), which indicates that the signal we found is real. a higher density cutoff may work better. It is worth mentioning that the third strongest signal we found The speedup EMI achieves might seem to be insignificant in is chromosome 19 [11145302, 11225495 bp] (hg18), 40 kb the context of GWAS, where a lot more time is spent on the downstream of the LDLR gene (MIM 606945). Although this 5 upstream analysis. We have shown in our study that the simple signal (P-value 8.22e ) failed to show significance after the mul- heuristic performs fairly well, but the results are not optimal and tiple-testing correction, it was one of the loci reported by Sabatti can be further improved. For example, one can combine the et al. (2009) and has been validated by many other GWAS with information of clusters of all windows into a global probabilistic larger sample sizes (Coram et al., 2013; Teslovich et al., 2010). framework, similar to the idea behind IBD-Groupon. When it requires many iterations to train an HMM, computational speed matters. Therefore, our method is suitable for use in more 4 DISCUSSION complicated models. IBD haplotype sharing is useful for many applications, such as The concept of IBD mapping is not new, yet the approach used imputation, improved accuracy in haplotype phasing, IBD here is different from the approaches used previously. Previous

921 Y.Qian et al.

IBD analyses have either used pairwise IBD and looked for dif- Francks,C. et al. (2010) Population-based linkage analysis of schizophrenia and ferences in IBD frequency between pairs of cases and pairs of bipolar case-control cohorts identifies a potential susceptibility locus on 19q13. Mol. Psychiatry, 15, 319–325. controls (Browning and Thompson, 2012; Purcell et al.,2007) Frazer,K.A. et al. (2007) A second generation human haplotype map of over 3.1 or have tested each multiple-IBD cluster individually (Gusev million SNPs. Nature, 449, 851–861. et al., 2011). Here we use a variance components approach to Fredman,M.L. and Tarjan,R.E. (1987) Fibonacci heaps and their uses in improved jointly test the multiple-IBD clusters in a region. The multiple- network optimization algorithms. J. ACM, 34, 596–615. IBD cluster approach is expected to be more powerful when Gusev,A. et al. (2009) Whole population, genome-wide mapping of hidden related- ness. Genome Res., 19, 318–326. multiple low-frequency causal variants contribute to a trait. Gusev,A. et al. (2011) DASH: a method for identical-by-descent haplotype mapping

uncovers association with recent variation. Am. J. Hum. Genet., 88, 706–717. Downloaded from Funding: Research grants (HG004960, HG005701, GM099568 He,D. (2013) IBD-Groupon: an efficient method for detecting group-wise identity- and GM075091) from the National Institutes of Health, USA, by-descent regions simultaneously in multiple individuals based on pairwise IBD and the Danish Council for Independent Research (DFF) relationships. Bioinformatics, 29, i162–i170. Natural Sciences grant (09-065923). The NFBC1966 Study is Jiang,P. and Singh,M. (2010) SPICi: a fast clustering algorithm for large biological networks. Bioinformatics, 26, 1105–1111. conducted and supported by the National Heart, and Kathiresan,S. et al. (2008) Six new loci associated with blood low-density lipopro- Blood Institute (NHLBI) in collaboration with the Broad tein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. http://bioinformatics.oxfordjournals.org/ Institute, UCLA, University of Oulu, and the National Nat. Genet., 40, 189–197. Institute for Health and Welfare in Finland. This article does Kelley,R. and Ideker,T. (2005) Systematic interpretation of genetic interactions not necessarily reflect the opinions or views of the NFBC1966 using protein networks. Nat. Biotechnol., 23, 561–566. Ladouceur,M. et al. (2012) The empirical power of rare variant association Study Investigators, Broad Institute, UCLA, University of Oulu, methods: results from sanger sequencing in 1998 individuals. PLoS Genet., 8, National Institute for Health and Welfare in Finland and the e1002496. NHLBI. Li,B. and Leal,S.M. (2009) Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS Genet., 5, e1000481. Conflict of Interest: none declared. Lin,R. et al. (2013) Identity-by-descent mapping to detect rare variants conferring susceptibility to multiple sclerosis. PLoS One, 8, e56379. Lin,X. (1997) Variance component testing in generalised linear models with random REFERENCES effects. Biometrika, 84, 309–326. Madsen,B.E. and Browning,S.R. (2009) A groupwise association test for rare

Albrechtsen,A. et al. (2010) Natural selection and the distribution of identity-by- mutations using a weighted sum statistic. PLoS Genet., 5, e1000384. at Aarhus Universitets Biblioteker / University Libraries on September 1, 2014 descent in the human genome. Genetics, 186, 295–308. McVean,G.A.T. and Cardin,N.J. (2005) Approximating the coalescent with Browning,S.R. and Browning,B.L. (2012) Identity by descent between distant recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci., 360, 1387–1393. relatives: detection and applications. Ann. Rev. Genet., 46, 617–33. Moltke,I. et al. (2011) A method for detecting IBD regions simultaneously in Browning,S.R. and Thompson,E.A. (2012) Detecting rare variant associations multiple individuals–with applications to disease genetics. Genome Res., 21, by identity-by-descent mapping in case-control studies. Genetics, 190, 1168–1180. 1521–1531. Purcell,S. et al. (2007) PLINK: a tool set for whole-genome association and popu- Browning,B.L. and Browning,S.R. (2013a) Improving the accuracy and effi- lation-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. ciency of identity by descent detection in population data. Genetics, 194, Ralph,P. and Coop,G. (2013) The geography of recent genetic ancestry across 459–471. Europe. PLoS Biol., 11, e1001555. Browning,S.R. and Browning,B.L. (2013b) Identity-by-descent-based heritability Sabatti,C. et al. (2009) Genomewide association analysis of metabolic traits in a analysis in the Northern Finland Birth Cohort. Hum. Genet., 132, 129–138. birth cohort from a founder population. Nat. Genet., 41, 35–46. Chen,G.K. et al. (2009) Fast and flexible simulation of DNA sequence data. Schork,N.J. et al. (2009) Common vs. rare allele hypotheses for complex diseases. Genome Res., 19, 136–142. Curr. Opin. Genet. Dev., 19, 212–219. Cohen,J.C. et al. (2006) Sequence variations in PCSK9, low LDL, and protection Song,J. and Singh,M. (2009) How and when should interactome-derived clusters against coronary heart disease. N. Engl. J. Med., 354, 1264–1272. be used to predict functional modules and protein function? Bioinformatics, 25, Coram,M.A. et al. (2013) Genome-wide characterization of shared and distinct 3143–3150. genetic components that influence blood lipid levels in ethnically diverse Teslovich,T.M. et al. (2010) Biological, clinical and population relevance of 95 loci human populations. Am. J. Hum. Genet., 92, 904–916. for blood lipids. Nature, 466, 707–713. Eichler,E.E. et al. (2010) Missing heritability and strategies for finding the under- Wu,M.C. et al. (2011) Rare-variant association testing for sequencing data with the lying causes of complex disease. Nat. Rev. Genet., 11, 446–450. sequence kernel association test. Am. J. Hum. Genet., 89,82–93.

922

Bibliography

[1] Yu Qian, Brian L Browning, and Sharon R Browning. Efficient cluster- ing of identity-by-descent between multiple individuals. Bioinformatics, page btt734, 2013. 5

[2] Bahareh Rabbani, Mustafa Tekin, and Nejat Mahdieh. The promise of whole-exome sequencing in medical genetics. Journal of human genetics, 59(1):5–15, 2013. 7

[3] Adam Eyre-Walker and Peter D Keightley. The distribution of fitness effects of new mutations. Nature Reviews Genetics, 8(8):610–618, 2007. 7

[4] Peter Andolfatto, Karen M Wong, and Doris Bachtrog. Effective popula- tion size and the efficacy of selection on the x chromosomes of two closely related drosophila species. Genome Biology and Evolution, 3:114–128, 2011. 8

[5] Judith E Mank, Beatriz Vicoso, Sofia Berlin, and Brian Charlesworth. Effective population size and the faster-x effect: empirical results and their interpretation. Evolution, 64(3):663–674, 2010. 8

[6] Jane Charlesworth and Adam Eyre-Walker. The mcdonald–kreitman test and slightly deleterious mutations. Molecular biology and evolution, 25(6):1007–1015, 2008. 8

[7] Michael F Hammer, Fernando L Mendez, Murray P Cox, August E Wo- erner, and Jeffrey D Wall. Sex-biased evolutionary forces shape genomic patterns of human diversity. PLoS genetics, 4(9):e1000202, 2008. 8

[8] Srikanth Gottipati, Leonardo Arbiza, Adam Siepel, Andrew G Clark, and Alon Keinan. Analyses of x-linked and autosomal genetic vari- ation in population-scale whole genome sequencing. Nature genetics, 43(8):741–743, 2011. 9

[9] Michael F Hammer, August E Woerner, Fernando L Mendez, Joseph C Watkins, Murray P Cox, and Jeffrey D Wall. The ratio of human x

107 108 BIBLIOGRAPHY

chromosome to autosome diversity is positively correlated with genetic distance from genes. Nature genetics, 42(10):830–831, 2010. 8, 9

[10] John Wakeley. Complex speciation of humans and chimpanzees. Nature, 452(7184):E3–E4, 2008. 8

[11] Alon Keinan, James C Mullikin, Nick Patterson, and David Reich. Ac- celerated genetic drift on chromosome x during the human dispersal out of africa. Nature genetics, 41(1):66–70, 2008. 9

[12] Yingrui Li, Nicolas Vinckenbosch, Geng Tian, Emilia Huerta-Sanchez, Tao Jiang, Hui Jiang, Anders Albrechtsen, Gitte Andersen, Hongzhi Cao, Thorfinn Korneliussen, et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nature genetics, 42(11):969–972, 2010. 9

[13] Devin P Locke, LaDeana W Hillier, Wesley C Warren, Kim C Worley, Lynne V Nazareth, Donna M Muzny, Shiaw-Pyng Yang, Zhengyuan Wang, Asif T Chinwalla, Pat Minx, et al. Comparative and demographic analysis of orang-utan genomes. Nature, 469(7331):529–533, 2011. 9

[14] John J Welch. Estimating the genomewide rate of adaptive protein evolution in drosophila. Genetics, 173(2):821–837, 2006. 9

[15] Adam Eyre-Walker and Peter D Keightley. Estimating the rate of adap- tive molecular evolution in the presence of slightly deleterious mutations and population size change. Molecular biology and evolution, 26(9):2097– 2108, 2009. 9

[16] Steven A. McCarroll. Extending genome-wide association studies to copy-number variation. Human Molecular Genetics, 17(R2):R135–R142, 2008. 13

[17] Stacey B. Gabriel, Stephen F. Schaffner, Huy Nguyen, Jamie M. Moore, Jessica Roy, Brendan Blumenstiel, John Higgins, Matthew DeFelice, Amy Lochner, Maura Faggart, Shau Neen Liu-Cordero, Charles Ro- timi, Adebowale Adeyemo, Richard Cooper, Ryk Ward, Eric S. Lander, Mark J. Daly, and David Altshuler. The structure of haplotype blocks in the human genome. Science, 296(5576):2225–2229, 2002. 14

[18] Jeffrey D Wall and Jonathan K Pritchard. Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet, 4(8):587–597, August 2003. 14

[19] Montgomery Slatkin. Linkage disequilibrium - understanding the evo- lutionary past and mapping the medical future. Nat Rev Genet, 9(R2):477–485, 2008. 14 BIBLIOGRAPHY 109

[20] David J Balding. A tutorial on statistical methods for population asso- ciation studies. Nature reviews. Genetics, 7(10):781–91, October 2006. 14

[21] Peter M. Visscher, Matthew A. Brown, Mark I. McCarthy, and Jian Yang. Five Years of GWAS Discovery. The American Journal of Human Genetics, 90(1):7–24, January 2012. 14

[22] Yasuhito Nannya, Kenjiro Taura, Mineo Kurokawa, Shigeru Chiba, and Seishi Ogawa. Evaluation of genome-wide power of genetic association studies based on empirical data from the hapmap project. Human molec- ular genetics, 16(20):2494–2505, 2007. 16

[23] Sulgi Kim, Nathan J. Morris, Sungho Won, and Robert C. Elston. Single-marker and two-marker association tests for unphased case- control genotype data, with a power comparison. Genetic Epidemiology, 34(1):67–77, 2010. 16

[24] Xuefeng Wang, Nathan J. Morris, Daniel J. Schaid, and Robert C. El- ston. Power of single- vs. multi-marker tests of association. Genetic Epidemiology, 36(5):480–487, 2012.

[25] David H Ballard, Judy Cho, and Hongyu Zhao. Comparisons of multi- marker association methods to detect association between a candidate region and disease. Genetic epidemiology, 34(3):201–212, 2010. 16

[26] Gang Peng, Li Luo, Hoicheong Siu, Yun Zhu, Pengfei Hu, Shengjun Hong, Jinying Zhao, Xiaodong Zhou, John D Reveille, Li Jin, et al. Gene and pathway-based second-wave analysis of genome-wide associa- tion studies. European Journal of Human Genetics, 18(1):111–117, 2009. 16

[27] Heather J Cordell. Detecting gene-gene interactions that underlie human diseases. Nature reviews. Genetics, 10(6):392–404, June 2009. 17

[28] Yue Wang, Guimei Liu, Mengling Feng, and Limsoon Wong. An empiri- cal comparison of several recent epistatic interaction detection methods. Bioinformatics (Oxford, England), 27(21):2936–43, November 2011. 17

[29] Zhengkui Wang, Yue Wang, Kian-Lee Tan, Limsoon Wong, and Di- vyakant Agrawal. eceo: an efficient cloud epistasis computing model in genome-wide association study. Bioinformatics, 27(8):1045–1051, 2011.

[30] Rui Jiang, Wanwan Tang, Xuebing Wu, and Wenhui Fu. A random forest approach to the detection of epistatic interactions in case-control studies. BMC bioinformatics, 10(Suppl 1):S65, 2009. 110 BIBLIOGRAPHY

[31] Heather J Cordell. Detecting gene–gene interactions that underlie hu- man diseases. Nature Reviews Genetics, 10(6):392–404, 2009.

[32] Shyh-Huei Chen, Jielin Sun, Latchezar Dimitrov, Aubrey R Turner, Tamara S Adams, Deborah A Meyers, Bao-Li Chang, S Lilly Zheng, Henrik Grönberg, Jianfeng Xu, et al. A support vector machine approach for detecting gene-gene interaction. Genetic epidemiology, 32(2):152– 167, 2008. 17

[33] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukher- jee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide ex- pression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005. 17

[34] Purvesh Khatri, Marina Sirota, and Atul J Butte. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS compu- tational biology, 8(2):e1002375, 2012. 17

[35] Kai Wang, Haitao Zhang, Subra Kugathasan, Vito Annese, Jonathan P Bradfield, Richard K Russell, Patrick M a Sleiman, Marcin Imielinski, Joseph Glessner, Cuiping Hou, David C Wilson, Thomas Walters, Ce- cilia Kim, Edward C Frackelton, Paolo Lionetti, Arrigo Barabino, Johan Van Limbergen, Stephen Guthery, Lee Denson, David Piccoli, Mingyao Li, Marla Dubinsky, Mark Silverberg, Anne Griffiths, Struan F a Grant, Jack Satsangi, Robert Baldassano, and Hakon Hakonarson. Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn Disease. American journal of human genetics, 84(3):399–405, March 2009. 17

[36] Hariklia Eleftherohorinou, Victoria Wright, Clive Hoggart, Anna-Liisa Hartikainen, Marjo-Riitta Jarvelin, David Balding, Lachlan Coin, and Michael Levin. Pathway analysis of GWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PloS one, 4(11):e8068, January 2009. 17, 21

[37] Ivan P. Gorlov, Olga Y. Gorlova, Shamil R. Sunyaev, Margaret R. Spitz, and Christopher I. Amos. Shifting paradigm of association studies: Value of rare single-nucleotide polymorphisms. The American Journal of Human Genetics, 82(1):100 – 112, 2008. 18

[38] Walter Bodmer and Carolina Bonilla. Common and rare variants in mul- tifactorial susceptibility to common diseases. Nature genetics, 40(6):695– 701, 2008. BIBLIOGRAPHY 111

[39] Tom Walsh, Jon M. McClellan, Shane E. McCarthy, Anjené M. Adding- ton, Sarah B. Pierce, Greg M. Cooper, Alex S. Nord, Mary Kusenda, Dheeraj Malhotra, Abhishek Bhandari, Sunday M. Stray, Caitlin F. Rippey, Patricia Roccanova, Vlad Makarov, B. Lakshmi, Robert L. Findling, Linmarie Sikich, Thomas Stromberg, Barry Merriman, Nitin Gogtay, Philip Butler, Kristen Eckstrand, Laila Noory, Peter Gochman, Robert Long, Zugen Chen, Sean Davis, Carl Baker, Evan E. Eichler, Paul S. Meltzer, Stanley F. Nelson, Andrew B. Singleton, Ming K. Lee, Judith L. Rapoport, Mary-Claire King, and Jonathan Sebat. Rare struc- tural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science, 320(5875):539–543, 2008. 18

[40] Peilin Jia, Lily Wang, Ayman H Fanous, Carlos N Pato, Todd L Ed- wards, Zhongming Zhao, International Schizophrenia Consortium, et al. Network-assisted investigation of combined causal signals from genome- wide association studies in schizophrenia. PLoS computational biology, 8(7):e1002587, 2012. 18

[41] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. Associating genes and protein complexes with disease via net- work propagation. PLoS computational biology, 6(1):e1000641, 2010. 18

[42] Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56–68, 2011. 18

[43] Paul R Burton, David G Clayton, Lon R Cardon, Nick Craddock, Panos Deloukas, Audrey Duncanson, Dominic P Kwiatkowski, Mark I Mc- Carthy, Willem H Ouwehand, Nilesh J Samani, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, 2007. 21

[44] Damian Szklarczyk, Andrea Franceschini, Michael Kuhn, Milan Si- monovic, Alexander Roth, Pablo Minguez, Tobias Doerks, Manuel Stark, Jean Muller, Peer Bork, et al. The string database in 2011: func- tional interaction networks of proteins, globally integrated and scored. Nucleic acids research, 39(suppl 1):D561–D568, 2011. 21

[45] Insuk Lee, U Martin Blom, Peggy I Wang, Jung Eun Shim, and Ed- ward M Marcotte. Prioritizing candidate disease genes by network- based boosting of genome-wide association data. Genome research, 21(7):1109–1121, 2011. 21

[46] BA McKinney and Nicholas M Pajewski. Six degrees of epistasis: sta- tistical network models for gwas. Frontiers in genetics, 2, 2011. 22 112 BIBLIOGRAPHY

[47] Martin Kircher, Daniela M Witten, Preti Jain, Brian J O’Roak, Gre- gory M Cooper, and Jay Shendure. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics, 46(3):310–315, 2014. 23

[48] Sharon R Browning. Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Ge- netics, 178(4):2123–2132, 2008. 24

[49] Jesse M Rodriguez, Sivan Bercovici, Lin Huang, Roy Frostig, and Ser- afim Batzoglou. Parente2: A fast and accurate method for detecting identity by descent. Genome research, pages gr–173641, 2014. 24, 34

[50] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker, Mark J Daly, et al. Plink: a tool set for whole- genome association and population-based linkage analyses. The Amer- ican Journal of Human Genetics, 81(3):559–575, 2007. 24

[51] Alexander Gusev, Jennifer K Lowe, Markus Stoffel, Mark J Daly, David Altshuler, Jan L Breslow, Jeffrey M Friedman, and Itsik Pe’er. Whole population, genome-wide mapping of hidden relatedness. Genome re- search, 19(2):318–326, 2009. 24, 27

[52] Brian L Browning and Sharon R Browning. A fast, powerful method for detecting identity by descent. American journal of human genetics, 88(2):173–82, February 2011. 24, 27

[53] Brian L Browning and Sharon R Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics, 194(2):459–471, 2013. 24, 27

[54] Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, and Itsik Pe’er. Length distributions of identity by descent reveal fine-scale demographic history. The American Journal of Human Genetics, 91(5):809–822, 2012. 24

[55] Sharon R Browning and Brian L Browning. Identity-by-descent-based heritability analysis in the northern finland birth cohort. Human genet- ics, 132(2):129–138, 2013. 24

[56] Trevor J Pemberton, Chaolong Wang, Jun Z Li, and Noah A Rosen- berg. Inference of unexpected genetic relatedness among individuals in hapmap phase iii. The American Journal of Human Genetics, 87(4):457– 464, 2010. 24 BIBLIOGRAPHY 113

[57] Sharon R Browning and Brian L Browning. High-resolution detection of identity by descent in unrelated individuals. The American Journal of Human Genetics, 86(4):526–539, 2010. 25

[58] Sharon R Browning and Elizabeth A Thompson. Detecting rare vari- ant associations by identity-by-descent mapping in case-control studies. Genetics, 190(4):1521–1531, 2012. 25, 28

[59] Ida Moltke, Anders Albrechtsen, Thomas vO Hansen, Finn C Nielsen, and Rasmus Nielsen. A method for detecting ibd regions simultaneously in multiple individuals—with applications to disease genetics. Genome research, 21(7):1168–1180, 2011. 25, 27

[60] Alexander Gusev, Eimear E Kenny, Jennifer K Lowe, Jaqueline Salit, Richa Saxena, Sekar Kathiresan, David M Altshuler, Jeffrey M Fried- man, Jan L Breslow, and Itsik Pe’er. Dash: a method for identical-by- descent haplotype mapping uncovers association with recent variation. The American Journal of Human Genetics, 88(6):706–717, 2011. 25, 27

[61] Dan He. Ibd-groupon: an efficient method for detecting group-wise identity-by-descent regions simultaneously in multiple individuals based on pairwise ibd relationships. Bioinformatics, 29(13):i162–i170, 2013. 25, 27

[62] Peter Ralph and Graham Coop. The geography of recent genetic ances- try across europe. PLoS biology, 11(5):e1001555, 2013. 25

[63] Hugues Aschard, Jinbo Chen, Marilyn C Cornelis, Lori B Chibnik, Eliz- abeth W Karlson, and Peter Kraft. Inclusion of gene-gene and gene- environment interactions unlikely to dramatically improve risk predic- tion for complex diseases. The American Journal of Human Genetics, 90(6):962–972, 2012. 26

[64] Evan E Eichler, Jonathan Flint, Greg Gibson, Augustine Kong, Suzanne M Leal, Jason H Moore, and Joseph H Nadeau. Missing heri- tability and strategies for finding the underlying causes of complex dis- ease. Nature Reviews Genetics, 11(6):446–450, 2010. 26

[65] Linda Koch. Epigenetics: An epigenetic twist on the missing heritability of complex traits. Nature Reviews Genetics, 2014.

[66] William Schierding, Wayne S Cutfield, and Justin M O’Sullivan. The missing story behind genome wide association studies: single nucleotide polymorphisms in gene deserts have a story to tell. Frontiers in genetics, 5, 2014. 26 114 BIBLIOGRAPHY

[67] Or Zuk, Eliana Hechter, Shamil R Sunyaev, and Eric S Lander. The mys- tery of missing heritability: Genetic interactions create phantom heri- tability. Proceedings of the National Academy of Sciences, 109(4):1193– 1198, 2012. 26 [68] Teri A Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia A Hindorff, David J Hunter, Mark I McCarthy, Erin M Ramos, Lon R Cardon, Aravinda Chakravarti, et al. Finding the missing heri- tability of complex diseases. Nature, 461(7265):747–753, 2009. 26 [69] Or Zuk, Stephen F Schaffner, Kaitlin Samocha, Ron Do, Eliana Hechter, Sekar Kathiresan, Mark J Daly, Benjamin M Neale, Shamil R Sunyaev, and Eric S Lander. Searching for missing heritability: Designing rare variant association studies. Proceedings of the National Academy of Sciences, 111(4):E455–E464, 2014. 26 [70] Bingshan Li and Suzanne M Leal. Discovery of rare variants via se- quencing: implications for the design of complex trait association stud- ies. PLoS genetics, 5(5):e1000481, 2009. 27 [71] Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93, 2011. 27, 33 [72] Alun Thomas, Nicola J Camp, James M Farnham, Kristina Allen-Brady, and Lisa A Cannon-Albright. Shared genomic segment analysis. map- ping disease predisposition genes in extended pedigrees using snp geno- type assays. Annals of human genetics, 72(2):279–287, 2008. 27 [73] Michael L Fredman and Robert Endre Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM (JACM), 34(3):596–615, 1987. 30 [74] Xihong Lin. Variance component testing in generalised linear models with random effects. Biometrika, 84(2):309–326, 1997. 32 [75] Chiara Sabatti, Susan K Service, Anna-Liisa Hartikainen, Anneli Pouta, Samuli Ripatti, Jae Brodsky, Chris G Jones, Noah A Zaitlen, Teppo Varilo, Marika Kaakinen, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics, 41(1):35–46, 2008. 33 [76] Tanya M Teslovich, Kiran Musunuru, Albert V Smith, Andrew C Ed- mondson, Ioannis M Stylianou, Masahiro Koseki, James P Pirruccello, Samuli Ripatti, Daniel I Chasman, Cristen J Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707–713, 2010. 33 BIBLIOGRAPHY 115

[77] Lin Huang, Sivan Bercovici, Jesse M Rodriguez, and Serafim - zoglou. An effective filter for ibd detection in large data sets. PloS one, 9(3):e92713, 2014. 34

[78] Danny S Park, Yael Baran, Farhad Hormozdiari, and Noah Zaitlen. Pigs: Improved estimates of identity-by-descent probabilities by probabilistic ibd graph sampling. Bioinformatics Research and Applications, page 400. 34

[79] Helga Westerlind, Kerstin Imrell, Ryan Ramanujam, Kjell-Morten Myhr, Elisabeth Gulowsen Celius, Hanne F Harbo, Annette Bang Otu- rai, Anders Hamsten, Lars Alfredsson, Tomas Olsson, et al. Identity-by- descent mapping in a scandinavian multiple sclerosis cohort. European Journal of Human Genetics, 2014. 34

[80] Joshua N Sampson, Bill Wheeler, Peng Li, Jianxin Shi, et al. Leveraging local identity-by-descent increases the power of case/control gwas with related individuals. The Annals of Applied Statistics, 8(2):974–998, 2014. 34

[81] P Marjoram, A Zubair, and SV Nuzhdin. Post-gwas: where next? more samples, more snps or more biology&quest. Heredity, 112(1):79– 88, 2013. 35

[82] M.A. Batzer and P.L. Deininger. Alu repeats and human genomic di- versity. Nature Reviews Genetics, 3(5):370–379, 2002. 37, 53, 57

[83] Vladimir Kapitonov and Jerzy Jurkal. The age of alu subfamilies. Jour- nal of molecular evolution, 42(1):59–65, 1996. 37

[84] Elisabetta Ullu and Christian Tschudi. Alu sequences are processed 7sl rna genes. Nature, 312(5990):171–172, November 1984. 37

[85] Jianxin Wang, Lei Song, M Katherine Gonder, Sami Azrak, David A. Ray, Mark A. Batzer, Sarah A. Tishkoff, and Ping Liang. Whole genome computational comparative genomics: A fruitful approach for ascertain- ing alu insertion polymorphisms. Gene, 365:11–20, 2006. 38

[86] Rotem Sorek, Gil Ast, and Dan Graur. Alu-containing exons are alter- natively spliced. Genome research, 12(7):1060–1067, 2002. 38

[87] Fereydoun Hormozdiari, Miriam K Konkel, Javier Prado-Martinez, Giorgia Chiatante, Irene Hernando Herraez, Jerilyn A Walker, Ben- jamin Nelson, Can Alkan, Peter H Sudmant, John Huddleston, et al. Rates and patterns of great ape retrotransposition. Proceedings of the National Academy of Sciences, 110(33):13457–13462, 2013. 38 116 BIBLIOGRAPHY

[88] Prescott L Deininger and Mark A Batzer. Alu repeats and human dis- ease. Molecular genetics and metabolism, 67(3):183–193, 1999. 38

[89] A.-H. Salem et al. Alu elements and homonid phylogenetics. Proceedings of the National Academy of Sciences, 100:12787–91, 2003. 38

[90] Wensheng Zhang, Andrea Edwards, Wei Fan, Prescott Deininger, and Kun Zhang. Alu distribution and mutation types of cancer genes. BMC genomics, 12(1):157, January 2011. 38

[91] Alexandre de Andrade, Min Wang, Maria F Bonaldo, Hehuang Xie, and Marcelo B Soares. Genetic and epigenetic variations contributed by Alu retrotransposition. BMC genomics, 12(1):617, January 2011. 38

[92] Prescott Deininger. Alu elements: know the sines. Genome Biol, 12(12):236, 2011. 38, 46

[93] Darren J Burgess. Population genetics: Mobile elements across human populations. Nature reviews. Genetics, 14(6):370, June 2013. 38

[94] Matei David, Harun Mustafa, and Michael Brudno. Detecting Alu in- sertions from high-throughput sequencing data. Nucleic acids research, 41(17):e169, 2013. 38, 42, 44, 51, 53

[95] Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E Eichler, and S Cenk Sahinalp. Next-generation VariationHunter: combinatorial algorithms for transpo- son insertion discovery. Bioinformatics (Oxford, England), 26(12):i350– 7, June 2010. 38, 51

[96] Thomas M Keane, Kim Wong, and David J Adams. RetroSeq: trans- posable element discovery from next-generation sequencing data. Bioin- formatics (Oxford, England), 29(3):389–90, February 2013. 38, 42, 44, 53

[97] Jón Ingi Sveinbjörnsson and Bjarni V Halldórsson. PAIR: polymorphic Alu insertion recognition. BMC bioinformatics, 13 Suppl 6(Suppl 6):S7, January 2012. 39, 45

[98] Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinfor- matics, 9:11, 2008. 39, 49

[99] M. S. Waterman T. F. Smith. Identification of common molecular sub- sequences. Journal of molecular biology, 147(1):195–97, March 1981. 42 BIBLIOGRAPHY 117

[100] F. Hormozdiari, I. Hajirasouliha, P. Dao, F. Hach, D. Yorukoglu, C. Alkan, E.E. Eichler, and S.C. Sahinhalp. Next-generation Variation- Hunter: Combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26:i350–7, 2010. 44, 45

[101] Yannakakis M Papadimitriou C. Optimization, approximation, and complexity classes. Journal of Computer and System Sciences, 43(3):425–440, 1991. 45

[102] Eric M Ostertag and Haig H Kazazian Jr. Biology of mammalian l1 retrotransposons. Annu Rev Genet, 35:501–538, 2001. 48

[103] Tobias Rausch, Sergey Koren, Gennady Denisov, David Weese, Anne- Katrin Emde, Andreas Döring, and Knut Reinert. A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics, 25(9):1118–1124, 2009. 49

[104] John Kececioglu. The maximum weight trace problem in multiple sequence alignment. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial pattern matching, volume 684 of Lecture Notes in Computer Science, pages 106–119, Berlin/Heidelberg, 1993. Springer-Verlag. 49

[105] O. Gotoh. Consistency of optimal sequence alignments. Bull Math Biol, 52(4):509–525, 1990. 49

[106] Cedrik Magis, Jean-François Taly, Giovanni Bussotti, Jia-Ming Chang, Paolo Di Tommaso, Ionas Erb, José Espinosa-Carrasco, and Cedric Notredame. T-Coffee: Tree-based consistency objective function for alignment evaluation. Methods Mol Biol, 1079:117–129, 2014.

[107] Tobias Rausch, Anne-Katrin Emde, David Weese, Andreas Döring, Cedric Notredame, and Knut Reinert. Segment-based multiple sequence alignment. Bioinformatics, 24(16):i187–i192, 2008. 49

[108] D. F. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol, 25(4):351–360, 1987. 49

[109] RR Sokal and CD Michener. A statistical method for evaluating sys- tematic relationships. Univ Kansas Sci Bull, 38(22):l409–l438, 1958. 49

[110] E. L. Anson and E. W. Myers. ReAligner: a program for refining DNA sequence multi-alignments. J Comput Biol, 4(3):369–383, 1997. 49

[111] Manuel Holtgrewe. Mason–a read simulator for second generation se- quencing data. Technical report, FU Berlin, 2010. 50 118 BIBLIOGRAPHY

[112] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25:1754–60, 2009. 51

[113] Chip Stewart, Deniz Kural, Michael P Strömberg, Jerilyn a Walker, Miriam K Konkel, Adrian M Stütz, Alexander E Urban, Fabian Gru- bert, Hugo Y K Lam, Wan-Ping Lee, Michele Busby, Amit R Indap, Erik Garrison, Chad Huff, Jinchuan Xing, Michael P Snyder, Lynn B Jorde, Mark a Batzer, Jan O Korbel, and Gabor T Marth. A com- prehensive map of mobile element insertion polymorphisms in humans. PLoS genetics, 7(8):e1002236, August 2011. 53, 54

[114] Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A Shlyakhter, Matthew K Belmonte, Eric S Lander, Chad Nusbaum, and David B Jaffe. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome research, 18(5):810–820, 2008. 56

[115] Monya Baker. De novo genome assembly: what every biologist should know. Nature methods, 9(4):333, 2012. 56

[116] David J Witherspoon, Yuhua Zhang, Jinchuan Xing, W Scott Watkins, Hongseok Ha, Mark A Batzer, and Lynn B Jorde. Mobile element scan- ning (me-scan) identifies thousands of novel alu insertions in diverse human populations. Genome research, 23(7):1170–1181, 2013. 57

[117] Unnur Styrkarsdottir, Bjarni V. Halldórsson, Solveig Gretarsdottir, Daniel F. Gudbjartsson, G. Bragi Walters, et al. Multiple Genetic Loci for Bone Mineral Density and Fractures. New England Journal of Medicine, 358(22):2355–2365, 2008. 57

[118] Daniel F. Gudbjartsson, G. Bragi Walters, Gudmar Thorleifsson, Hreinn Stefansson, Bjarni V. Halldorsson, et al. Many sequence variants affect- ing diversity of adult human height. Nat Genet, 40(5):609–615, May 2008.

[119] F. Rivadeneira, U. Styrkarsdottir, K. Estrada, B. Halldorsson, et al. Twenty loci associated with bone mineral density identified by large- scale meta-analysis of genome-wide association datasets. Nature Genet- ics, 41:1199–206, Nov 2009. 57