Population-haplotype models for mapping and tagging structural variation using whole genome sequencing

Eleni Loizidou

Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy

Section of Genomics of Common Disease Department of Medicine Imperial College London, 2018

1

Declaration of originality

I hereby declare that the thesis submitted for a Doctor of Philosophy degree is based on my own work. Proper referencing is given to the organisations/cohorts I collaborated with during the project.

2

Copyright Declaration

The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work

3

Abstract

The scientific interest in copy number variation (CNV) is rapidly increasing, mainly due to the evidence of phenotypic effects and its contribution to disease susceptibility. Single nucleotide polymorphisms (SNPs) which are abundant in the have been widely investigated in genome-wide association studies (GWAS). Despite the notable genomic effects both CNVs and SNPs have, the correlation between them has been relatively understudied. In the past decade, next generation sequencing (NGS) has been the leading high-throughput technology for investigating CNVs and offers mapping at a high-quality resolution. We created a map of NGS-defined CNVs tagged by SNPs using the 1000 Genomes Project phase 3 (1000G) sequencing data to examine patterns between the two types of variation in -coding . To investigate potential relationships between CNV-tagging SNPs and various phenotypes, we used SNPs reported for disease/phenotype associations from the GWAS catalog. Moreover, we applied our method to DIAGRAM consortium and Northern Finland Birth Cohort (NFBC) data. Our analysis replicated existing CNV-tagging SNPs but also revealed novel relationships between them in almost all the datasets we analysed. We have developed a statistical framework under a population perspective for a fast and accurate CNV detection. Using 202 drug-target genes defined in collaboration with GlaxoSmithKline (GSK), we applied our framework to the 1000G data. We calculated summary statistics based on the detected CNV calls including the allele frequency (AF) for each of the 26 populations of the 1000G. In addition, we visualised our results using UCSC genome browser visualisation tracks for all 202 regions and successfully benchmarked our CNV calls by comparing them to a gold standard set of the 1000G CNVs. Overall in this thesis, we present detailed maps of CNVs and CNV-tagging SNPs to enhance existing knowledge of their impact on human genome.

4

To my parents, my everything

5

Acknowledgments

I would like to express my deepest gratitude to my supervisor, Dr. Inga Prokopenko, for her valuable advice, guidance and patience throughout my PhD journey. Her perfectionism and level of sophistication and experience were always a source of inspiration to me. Above all, I need to say thank you for teaching me how to be a true scientist and how to grow academically and scientifically.

I am also deeply grateful to my co-supervisor Dr. Evangelos Bellos whose support and academic brilliance paved the way and made this long journey look shorter. His encouragement and calm reaction to every situation have been my source of power and strength. A thank you is not enough to express my appreciation for the faith you showed in me. I would also like to express my gratitude to my co-supervisors Prof. Michael Johnson and Dr. Lachlan J. M. Coin for their valuable advice and for accepting me in their research groups. Last but not least, I would like to truly thank Dr. Leonardo Bottolo. The person who initially believed in my abilities and is indirectly responsible for me being where I am today.

My research project would have never been initiated and completed without the support of Medical Research Council (MRC) and GlaxoSmithKline (GSK). I would therefore like to express my strongest appreciation for them both.

I owe special thanks to the Cyprus Institute of Neurology and Genetics and specifically to Prof. Kyproula Christodoulou and Dr. George Spyrou for generously hosting me at the Bioinformatics department in Cyprus for the last year of my PhD. Thank you for giving me the chance to attend the department’s conferences as a speaker and for treating me as part of your team.

This long adventure seemed shorter with the support of my colleagues who I am lucky to say that I am now calling friends. Special thank you to Dr. Sadia Saeed, Dr. Marika Kaakinen, Dr. Amna Khamis, Charalambos Kkoufou, Mila Anasanti, Dr. Hutokshi Crouch, Abdullah Abdulshakur and Jani Heikkinen for the unforgettable moments, outings and unstoppable laughter. I would also like to thank Patricia Murphy for her assistance at several administrative issues since the first day of my PhD and for her support throughout the years. I am mostly

6 thankful for being a member of an international department which provided me with the opportunity to meet people from different cultures and mentality.

Finally, since this thesis is the culmination of my PhD journey, I would like to say the biggest and warmest thank you to my family. My favourite people who have always been my driving force, my inspiration, my inner power. My sisters Antigoni and Marina, who are a gift from God to me and their love and support are always unconditional. Thank you for constantly being by my side to encourage me, even during periods of worry and frustration. My husband George, the man I am now sharing my life with and the person who proves every single day that true love exists. His patience during the three years we were living in a different place so I could fulfil my dreams has given me the strength to move on. Even though a few words and a thank you will never be enough, I will try to express my deepest gratitude to the two people I owe everything in life. My parents, who provided me with the greatest values. Just by being their selves, they taught me that the biggest achievement is to first be a good person and always treat others in the way you would like to be treated. They never stopped believing in me even when I doubted myself and supported my decisions no matter what. Through their actions, they proved that even with the greatest achievements, the important thing is to remain modest and keep working with passion and self-respect for the best outcome. Their passion for work and their love for humanity were the reasons I gained my scientific curiosity. Thank you for loving me unconditionally and giving me the “supplies” to be the person I am today and to have a successful future.

7

Table of Contents

Abstract ...... 4 Acknowledgments ...... 6 Table of Contents ...... 8 List of figures ...... 11 List of tables ...... 12 List of Abbreviations ...... 13 Chapter 1 ...... 15 Introduction ...... 15 1.1. Human genome ...... 15 1.1.1 Human genome variation – Single Nucleotide Polymorphisms (SNPs) and Structural Variation (SV) ...... 15 1.1.2. CNV description ...... 16 ...... 18 1.2. Sequencing the human genome ...... 19 1.2.1. Uncovering CNVs ...... 19 1.2.2. 1000 Genomes Sequencing Project (1000G) ...... 20 1.2.3. Data generated by 1000 Genomes Project ...... 21 1.2.4. Next generation DNA sequencing methods ...... 21 1.2.5. Whole-genome and whole-exome sequencing ...... 21 1.2.6. Investigating the role of CNVs through sequencing druggable genome targets ...... 22 1.3. CNV calling in SNP datasets ...... 24 1.3.1 Role of CNVs in variability of human phenotypes from genome wide association studies (GWAS) ...... 24 1.3.2. cnvHap: an integrative population and haplotype – based multiplatform model of SNPs and CNVs ...... 26 1.3.3. Advantages of cnvHap ...... 27 1.4. CNV detection in next generation sequencing data ...... 27 1.4.1 Sequencing data integration ...... 27 1.4.2 Definition of haplotypes ...... 29 1.4.3 The growing importance of CNVs: new insights for detection and clinical interpretation . 30 1.4.4. Mapping CNVs by population – scale genome sequencing ...... 33 1.5. CNVs and human disease ...... 34 1.5.1. Role of CNVs in human disease ...... 34

8

1.5.2. CNV maps ...... 35 1.6. Aims and structure of thesis ...... 36 Chapter 2 ...... 37 Extraction of CNV-tagging SNPs using the 1000G sequencing data ...... 37 2.1. Materials and methods ...... 37 2.1.1 1000 Genomes project data ...... 37 2.1.2. CNV and SNP extraction using the 1000G data ...... 38 2.1.3. Evaluation of CNV-SNP LD in sequencing data ...... 39 2.2. CNV and CNV-tagging SNPs’ maps ...... 42 Chapter 3 ...... 50 CNV-tagging SNPs’ link to human disease/traits ...... 50 3.1. Disease associations ...... 50 3.1.1. GWAS catalog ...... 50 3.1.2. DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) ...... 70 3.1.3. Northern Finland Birth Cohort (NFBC) ...... 74 3.2. Summary of findings ...... 78 Chapter 4 ...... 79 Description of the statistical framework for CNV detection ...... 79 4.1. Read depth ...... 79 4.1.1. Pre-processing ...... 80 4.1.2. Statistical modelling ...... 82 4.1.3. RD-based methods ...... 83 4.2. Hidden Markov Model ...... 84 4.2.1. Description ...... 84 4.2.2. popCNV under a population framework ...... 88 Chapter 5 ...... 89 Construction of CNV map in drug-target genes using popCNV ...... 89 5.1. Samples and datasets ...... 90 5.1.1. 202 GSK defined genes ...... 90 5.1.2. Definition of region ...... 91 5.2. popCNV implementation ...... 92 5.2.1. CNV detection ...... 93 5.2.2. CNV visualization ...... 101 5.3. Benchmarking of popCNV ...... 111 5.4. Summary ...... 111 Chapter 6 ...... 112

9

Conclusion ...... 112 6.1. Discussion ...... 113 6.2. Future directions ...... 117 6.2.1. Functional analysis of JAZF1 and TSPAN8 CNVs ...... 117 6.2.2. Testing CNV associations with disease ...... 118 6.2.3. Contribution of CNVs to pharmacogenetics ...... 118 6.2.4. Role of geographical clustering in CNVs ...... 119 6.2.5. Potential use of popCNV by external Consortia ...... 119 6.2.6. popCNV extension ...... 120 References ...... 121 Appendix ...... 130

10

List of Figures

Figure 1.1: Visualisation of CNVs in the human genome ...... 18 Figure 1.2: Visualization of read depth alignment signatures ...... 28 Figure 1.3: Visualization of read pair alignment signatures ...... 28 Figure 1.4: Visualization of split read alignment signatures ...... 29 Figure 2.1: Plot of AF(CNV) Vs LD max in protein-coding genes...... 42 Figure 3.1: Visualisation of current GWAS catalog ...... 53 Figure 3.2: Geographic location of the NFBC1966. Dark grey represents the targeted area of the cohort ...... 75 Figure 3.3: Visualization of FADS region using UCSC browser, showing the full spectrum of CNVs in this specific region. Red represents the deletions and blue the duplications. Each line shows the CNVs for every of the 26 populations of the 1000G...... 77 Figure 4.1: Illustration of read depth created by CNV ...... 81 Figure 4.2: Visualization of our statistical model ...... 87 Figure 5.1: Gene region definition and visualization ...... 91 Figure 5.2: popCNV implementation and processing ...... 92 Figure 5.3: Read depth signature on 1...... 93 Figure 5.4: Read depth signature for HG00351 Finnish sample in IL6 gene region...... 97 Figure 5.5: UCSC genome browser plot illustrating the CNV presence at 16p12.3 ..... 103 Figure 5.6: UCSC genome browser plot optimising CNV calls in the JAZF1 gene for 26 populations. Red colour represents the deletions and blue colour the duplications. Each line shows CNVs for the population specified above the line...... 104 Figure 5.7: UCSC genome browser plots showing the CNV spectrum across each European population in JAZF1 gene...... 105

11

List of Tables

Table 2.1: Transformation of CNV and SNP genotypes ...... 39 Table 2.2: Populations of the 1000 Genomes Project phase 3 release data ...... 41 Table 3.1: Identified CNV-tagging SNPs in various phenotypes ...... 50 Table 3.2: Established CNV-tagging SNPs and disease/trait associations ...... 56 Table 3.3: CNV-tagging SNPs (CNVs < 1kb) and disease/trait associations ...... 60 Table 3.4: CNV-tagging SNPs (CNVs 1kb – 10kb) and disease/trait associations ...... 65 Table 3.5: CNV-tagging SNPs (CNVs > 10kb) and disease/trait associations ...... 67 Table 3.6: CNVs associated with Type-2 diabetes ...... 71 Table 3.7: Allele frequencies for T2D-associated CNVs in 1000G continental groups ...... 72 Table 3.8: Allele frequencies for T2D-associated CNVs in the European populations ...... 72 Table 5.1: popCNV sample results from CNV analysis of chromosome 7 encompassing IL6 gene (chr7:22771285-22813003)...... 95 Table 5.2: CNVs identified within the IL6 gene region (chr7:22771285-22813003) ...... 97 Table 5.3: Average CNV allele frequencies per population in IL6 gene region ...... 98 Table 5.4: CNVs identified within the JAZF1 gene region ...... 99 Table 5.5: Average CNV allele frequencies per population in JAZF1 gene region ...... 100

12

List of Abbreviations

CNV Copy number variation SNP Single Nucleotide Polymorphism GWAS Genome-wide association study(ies) DIAGRAM DIAbetes Genetics Replication And Meta-analysis NGS Next Generation Sequencing HTS High-throughput sequencing SV Structural Variation GSK GlaxoSmithKline AF Allele Frequency UCSC University of California Santa Cruz MAF Minor Allele Frequency BP Base pairs MB Mega base pairs NCBI National Center for Biotechnology Information dbVar Database of genomic variants FISH Fluorescence in situ hybridization CGH Comparative genomic hybridization WGS Whole-genome sequencing WES Whole-exome sequencing EBI European Bioinformatics Institute ADME Absorption Distribution Metabolism Excretion NS Non-synonymous HMM Hidden Markov Model RD Read Depth

13

RP Read Pair SR Split Read PEM Paired-end mapping LOESS Local regression CN Copy number FDR False Discovery Rate GTEx Genotype-Tissue Expression VEP Variant Effect Predictor HGSV Human Genome Structural Variation SVA SINE-VNTR-Alus ALU Arthrobacter luteus VCF Variant call format T2D Type 2 diabetes NFBC Northern Finland Birth Cohort FA Fatty acids EWT Event-wise testing BWA Burrows-Wheeler Aligner

14

Chapter 1

Introduction

1.1. Human genome

The human genome contains the genetic material which the offspring inherits from the parents and therefore the genome defines the phenotypic characteristics inherited. It consists of 23 pairs of of which the 22 are autosomal whereas the 23rd pair has the sex chromosomes. The variability in the human genome sequence was not deciphered until the large international effort of multiple scientists across the globe completed the 13 years-long Human Genome Project in 2003. This project has highlighted the detection of 30,000 – 35,000 genes (https://www.genome.gov/12513430/2004-release-ihgsc-describes-finished-human- sequence/) in the human genome and has also enabled identifying the sequence of the ~3 billion DNA base pairs [1]. That was an early estimation since at the moment, evidence suggests that the human genes are ~ 19,000 [2]. Changes in the sequence or the structure of the human genome are the main reasons for human genomic variation. These include single- nucleotide and multi-nucleotide variants, short deletions as well as large CNVs [3].

1.1.1 Human genome variation – Single Nucleotide Polymorphisms (SNPs) and Structural Variation (SV)

Single Nucleotide Polymorphism (SNP) (Appendix-Figure 1) is a variation at a single position of the DNA sequence of an individual and this position is called nucleotide (https://www.nature.com/scitable/definition/single-nucleotide-polymorphism-snp-295). A chain of four nucleotide bases constructs the DNA sequence. These are: A (adenine), T (thymine), C (cytosine) and G (guanine). When a nucleotide at a certain position in the DNA sequence is not the same in more than 1% of a population, then this is considered as variation and is defined as SNP, although with the advent of the large-scale studies in the past decade rare (with minor allele frequency, MAF < 1%) SNPs and other variants are being annotated and studied [4]. Almost ~7 million out of ~11 million SNPs occurring in an individual genome

15 have MAF > 5 % meaning that the SNP is common (International HapMap Project); the 2012 release of 1000 Genomes Project (1000G) data has seen annotation of ~38 million SNPs [5].

Structural Variation (SV) on the other hand, affects larger regions of the genome and can be either unbalanced (Copy Number Variations, CNVs) or balanced. Even though SNPs are considered as the most common type of genetic variation on the human DNA, SV was proved to have a larger effect on the genome compared to SNPs [6]. Examples of balanced SVs are inversions and balanced translocations. Balanced SVs can often be found to overlap with segmental duplications or repetitive regions of DNA. Both types of SVs emerge through a break in the DNA phosphodiester backbone [7] (https://www.ncbi.nlm.nih.gov/dbvar).

1.1.2. CNV description

CNVs can be defined as DNA segments which are either lost (deletion) or gained (duplication and insertion) and therefore represent an imbalance between two genomes [7]. They are the most frequent class of SVs and their range varies between 50 basepairs (bp) to several megabases (Mb). It is assumed that CNVs appear to be responsible for disease susceptibility and genetic diversity among samples and populations to a larger extend than SNPs (http://www.annualreviews.org/doi/pdf/10.1146/annurev-med-100708-204735#article-denial). They are split in two groups; common and rare CNVs. Each group is defined by the population frequency of the CNVs. Common CNVs appear regularly in a population (MAF > 5%) whereas rare CNVs are not that frequent within or between populations (MAF < 1%).

The biological balance of the diploid state is disturbed by CNVs at any locus of the human genome. Approximately 90% of CNVs have two allelic states. According to the National Center for Biotechnology Information (NCBI) human reference sequence, biallelic CNVs are defined as deletions when the alternate allele has less copies of the variable sequence than the reference, and duplications if there are more copies in the alternate allele when compared to the reference. The rest of the amount of loci remaining, i.e. 10%, has copy number states which do not fit to a two allelic system [8]. SV including CNV account for a substantial proportion of human genetic variation. SVs included in the Database of Genomic Variants (http://projects.tcag.ca/variation/) have been reported in 29.7% of the human genome [9] and CNVs have been estimated to account for approximately 18% of detectable effects of genetic variation on [10].

16

Types of CNVs: There are several types of CNVs that can be found in the human genome. Each of these are explained below and Figure 1.1 shows a visualisation of them according to the human genome structure.

Deletion: A deletion exists when a piece of DNA sequence is missing i.e genetic material is lost. According to how many nucleotides the deletion encompasses, they are divided into “macrodeletions” or “microdeletions” and in both cases, their effects on the phenotype can be severe. For example, when a microdeletion occurs, genes or parts of the genes are missing affecting the gene dosage and as a result the phenotype. Duplication: When a chromosome or part of a chromosome is amplified as a result of DNA replication errors, the result is a duplication. It can be present in autosomes as well as in sex chromosomes. The results from a duplication can be severe leading to diseases such as Down syndrome. Inversion: “A segment of DNA that is reversed in orientation with respect to the rest of the chromosome” [7]. A disease linked to inversions is haemophilia A [11]. Insertion: An insertion appears when the sequence of one or more nucleotides is added in the middle of two adjacent nucleotides in the sequence. A C-insertion polymorphism which is considered as a rare mutation in the NOD2 gene on is linked to Crohn’s disease [12]. Translocation: When a part of a chromosome is added to another (non-homologous) chromosome, we have a translocation. It is also called a balanced translocation if there is no loss or duplication of the genetic material. Usually, problems are not caused for the person who carries such genetic material but offspring of that person might be of increased risk for an unbalanced translocation. In that case, the offspring inherits parent’s chromosome of balanced translocation but with more or less genetic material. Therefore, this will affect the carrier due to the change in the genetic code (http://www.brusselsgenetics.be/genetic- variation) (http://projects.tcag.ca/variation/) [13]. Examples of translocations are different types of cancer such as leukaemia and tumours [14].

17

Figure 1.1: Visualisation of CNVs in the human genome (a) Representation of a deletion and a duplication (b) Representation of additional types of CNVs (c) Representation of a reciprocal translocation

(a)

(b) (c)

18

1.2. Sequencing the human genome

1.2.1. Uncovering CNVs

CNVs were traditionally identified using microscopy-based technologies such as karyotyping and fluorescence in situ hybridization (FISH) [15]. These techniques were followed by comparative genomic hybridization (CGH) with the aim of investigating multiple targets at the same time. To achieve a better resolution, CGH was later merged with microarray technologies and as a result, array CGH (aCGH) was created. Thus, starting from 2003, CNVs were detected genome-wide using aCGH as well as SNP-based array approaches [16]. However, the above techniques were shown to have weaknesses such as hybridization noise, low genome coverage, low resolution and most importantly they struggled to identify novel and rare variants [17, 18]. Also, due to the high cost of aCGH combined with the difficulties in meta-analysing SNP-derived CNV data, CNV association studies were typically conducted in substantially smaller samples sizes than GWAS and so have lacked the statistical power to robustly detect CNVs with small to moderate effect sizes.

Finally, the ground-breaking technology of Next Generation Sequencing (NGS) brought a brand new era in the field of genomics. Various platforms such as Illumina and Life Technologies, have since enabled fast CNV detection at a high resolution through whole- genome sequencing (WGS). NGS is able to genotype and characterize CNVs by creating millions of short reads in a single experiment [19]. The benefits from 19nalyzing CNVs using NGS compared to aCGH vary. First, NGS can cover a much larger portion of the genome. It also has the ability to generate copy numbers, achieve a high breakpoint resolution and thus detect novel CNVs. Several databases have been now generated to help researchers analyse WGS data. This has led to the development of different CNV identification tools based on NGS features as well as the generation of various databases which aimed to provide scientists with whole-genome sequenced samples.

19

1.2.2. 1000 Genomes Sequencing Project (1000G)

A ground-breaking international collaboration named as “1000 Genomes Project” was established in 2008. The 1000G Project’s aim was the production of a detailed public catalogue which provided deep understanding of the human genome sequence variation including CNVs and SNPs, along with their haplotype contexts. Using high-throughput sequencing, the team sequenced the genomes of more than 2,500 individuals from multiple human populations using NGS [20].

Originally, the goal of the consortium was the collection, sequencing and analysis of at least 1000 samples in 3 years. Furthermore, the project started with a pilot phase in 2008 as a pre- step of the main project which would follow later. The intension of this pilot phase was the investigation of low-coverage WGS of unrelated individuals, high-coverage targeted sequencing and high-coverage WGS of trios using different NGS technologies [21]. The result of this analysis was the discovery of eight million novel sequence SVs. Given this promising start, the project proved that it had a great potential to shed light on human genetic variation.

The main project was implemented in late 2009 with the aim of sequencing ~ 2,500 in 3 phases. Phase 1 included 1,092 individuals from 14 different populations and was completed in mid-2011. It was a representation of the first series of exome and low-coverage WGS. During Phase 2 and Phase 3, there was an addition of ~ 1500 individuals from 12 other populations raising the total to 2,504 samples. Both latter phases were based on Illumina sequencing platforms which produced higher-quality variant calls. The latest to date 1000G data release (Phase 3) commenced in 2014 and it largely elucidated genomic variation.

The contribution of the 1000G Project to the scientific community has been invaluable. It has not only aided the interpretation of genetic-association studies, but it has also paved the way to the implementation of novel sequencing methods and tools for analysing WGS data. It is now the largest and the most widely used repository for publicly available human genomic data. Even though CNV calling remains challenging for various reasons, the 1000G’s computational analysis techniques as well as the file formats that it has generated were a valuable resource for researchers.

20

1.2.3. Data generated by 1000 Genomes Project

Data released by 1000G are in various formats. Raw sequence data which is the project’s main output is released in Fastq format. Alignments are in SAM/BAM format whereas variant calls in VCF format. Variants calls are available in EBI|NCBI release directory, alignments in alignment index EBI|NCBI and raw sequence data in the sequence.index EBI|NCBI. Examples presenting the structure and formatting of the data can be found in the 1000G Tutorial (1000 Genomes).

1.2.4. Next generation DNA sequencing methods

At its inception, the 1000G Project acknowledged that it would be crucial to characterize the breadth of human genomic variation in order to fully explore the association between genotype and phenotype. Therefore, they conducted a pilot study where different strategies for genome-wide sequencing with high-throughput platforms were compared [5]. They used:

1) Low – coverage whole – genome sequencing

2) High – coverage sequencing

3) – targeted sequencing

One of the main reasons the 1000G project conducted the above sequencing analyses was to discover variants. More specifically, simple models indicated that when provided with a total amount of sequencing, the amount of variants identified is maximised through the sequencing of multiple samples at low-coverage [22, 23].

1.2.5. Whole-genome and whole-exome sequencing

Whole-genome sequencing

Originally, NGS was explicitly associated with WGS. Through a random fragmentation of the DNA before sequencing starts, WGS provides a bias-free and complete overview of genomic variation. Using this technique, WGS offers a high-resolution and captures variants of all sizes that could have been previously missed. In addition, it extracts a large amount of data that are of use by various databases. As a result, high-throughput WGS has become the leading technology for studies investigating human variation, with the 1000G paving the way. Nevertheless, WGS has its pitfalls. Despite the fact that there has been a massive decrease in

21 the costs of DNA sequencing during the last few years [24], the size of the human genome is not considered cost-effective for population-based analyses.

Targeted sequencing

Soon after the triumphant introduction of NGS, researchers and especially those with a restricted budget realised that WGS would most likely be financially out of reach. To that end, targeted sequencing was developed through the ability of NGS to sequence fragments of DNA that had been pre-selected. Thus, sequencing of specific target regions (capture sequencing) enabled the identification of genomic variants in the regions of interest and provided a cost- effective solution which helped avoid the pitfalls of WGS. A popular application of targeted sequencing is the whole-exome sequencing (WES) which focuses on the analysis of protein- coding regions. Even though capture sequencing was originally used for exploring SNPs, it is now enabling the analysis and investigation of CNVs as well [25].

1.2.6. Investigating the role of CNVs through sequencing druggable genome targets

The International Human Genome Sequencing Consortium [26] aimed to include 15,000 known genes as well as 17,000 gene predictions in a gene index in order to estimate a corrected gene number of 30,000-35,000 [27]. These new genes used for prediction were of extreme interest to pharmaceutical companies for their evaluation and deep understanding. Every additional genomic piece of information is important in the sense that each new gene or function is a potential drug target.

Although the genes with unidentified function are potentially interesting, selecting them as targets for new compounds may be expensive: research programs would have to be initiated for extensive assessment of their function and regulation.

Through their pharmacogenetics research and development efforts, all major pharmaceutical companies seek to deliver safe and efficient molecules that would help the largest possible population. Pharmacogenetics researchers have focused on the impact SNPs have on drug- response but it is now of great interest to investigate the role of CNVs. CNVs occur across whole human genome, including genes that are responsible for both, the uptake and metabolism of drugs into cells [28]. This was recently demonstrated in a study that analysed

22

152 genes in tissue and blood samples from cancer patients [28]. Cancer tissues were found to harbour an abundance of CNVs when compared with normal samples. Variations which were identified in a small amount of the studied genes generally affect the way cancer drugs are metabolised in and out of cells.

CNVs can impact ADME (absorption, distribution, metabolism and excretion) and drug-target genes as well as genes in which mutations have a causative role in disease. Examples of such causative genes in Early-Onset Alzheimer’s disease are the amyloid precursor protein (APP) located on chromosome 21, the presenilin 1 (PSEN1) which is located on chromosome 14 and the presenilin 2 (PSEN2) located on chromosome 1 [29]. This leads to the assumption that CNVs affect drug related genes by altering drug metabolism and drug response. Despite that, the role of CNVs in pharmacogenetics remains relatively understudied [30].

Although it is assumed that the contribution of rare and low frequency genetic variants (MAF between 0% and 5%) (Appendix, Box 1) to complex disease risk is large, the abundance of rare variants in the human populations is still under investigation [31]. Due to this fact, 202 drug- target genes were sequenced in 14,002 people to explore the amount of rare genetic variants that could be identified [32-34]. These genes were chosen for the analysis because they represent ~1% of the coding genome and ~7% of genes which already are known drug targets or they have the potential to become drug targets [35].

A very important aspect when discovering rare variants, is that they may be confounded with sequencing errors and so a lot of experiments need to be performed to achieve high data quality. The analysis of sequence data of these 202 drug-target genes showed an abundance of rare SNPs in comparison to common variants. Due to the purifying selection, there is an abundance of non-synonymous (NS) variants at low frequencies and thus there is a larger number of rare variants in the coding sequence.

Tests can be also performed for an association of rare variants with disease especially when there is also a strong association between the power of the test and the cumulative minor allele frequency (cMAF) of SNPs that can be deleterious in each gene [36].

It is expected for rare variants to be geographically clustered or private to certain populations since they are resulted from recent mutations. Variant sharing depends on geographic distance and decreases with allele frequency. Thus, variant sharing is much lower when

23 comparing populations from different continents. Therefore, catalogs of rare variants should be created locally around the world [37, 38].

An abundance of rare variants was found within different human populations indeed whereas common variants can only account for a small amount of the genetic diversity [39]. Further investigation is needed since there is the potential of linking the 202 drug-target genes to drug response.

Nowadays the deep understanding of the genetic risk factors, that contribute to drug- response variability, is extremely important for the realisation of personalised medicine targets. Certain drugs such as carboplatin, cisplatin, etoposide and daunorubicin evaluated in a recent study [40] are very frequently used for treating several cancer types, including ovarian, colorectal, testicular and lung. In most association studies, SNPs have been used to clarify the specific effect of genetic polymorphisms on drug response while less attention has been paid to CNVs. Since disease susceptibility cannot be fully explained by SNPs, CNVs should also be considered as potential genetic factors linked to drug response. Genome – wide surveys of CNVs that have been recently conducted, have stated that finding these SVs in the human genome is a common phenomenon [41]. With the help of improved assays and novel analysis techniques, CNV presence in genes which encode drug metabolizing enzymes, are suggested to have severe effects on drug response [40].

1.3. CNV calling in SNP datasets

1.3.1 Role of CNVs in variability of human phenotypes from genome wide association studies (GWAS)

GWAS represent a type of genetic study design that examines hundreds of thousands of SNPs genotyped across the genome on an array. Gene microarray technologies, consisting of aCGH and SNP arrays, were considered a great advancement in computational genomics. Microarray technologies were the first to allow a detailed CNV classification in the genomic sequence [42, 43]. GWAS studies compare the genetic variation in people, for a specific phenotype, binary or quantitative. They are important for the identification of associated loci especially when the history behind a disease is unknown and the aetiology is complex, such

24 as that of psychiatric disorders [44]. They are also used for the detection of genetic factors in common or complex diseases. As identified by GWAS, many loci have been found to contribute to common complex diseases and while sample sizes tend to increase, more variants are continuously detected through technological development and hence there are lower genotypic costs. The heritability of a disease is only explained in small part by common SNPs (MAF > 5%). Thus, it is necessary to take into account different types of variation that contribute to heritability of diseases which is still not explained. One promising candidate is CNV [8]. Latest GWAS are detecting CNVs using SNPs, after finding the copy number status from the generated data through high-throughput genotyping platforms.

CNV and clinical interpretation

CNV genomic characterisation: The exploration of the genomic context is crucial for the understanding of the possible impact of a single CNV. Sequence-based annotation and visualisation of large datasets can now be performed by several tools [45-47]. First, there are genome browsers such as the ENCODE project and the Genotype-Tissue Expression (GTEx) project [48] (https://www.gtexportal.org/home/). These are very powerful for CNV annotation and provide explanations for the putative impact of CNV for various tissues. Variant Effect Predictor (VEP) is another tool which is useful for the annotation of SNPs and CNVs from any species. The data it uses come from Ensembl [46]. Most importantly, VEP offers allele/genotype frequencies for SNPs and a list of tagged variants [49].

Investigating the putative clinical impact: Evidence of the association between a CNV and a phenotype can be found through family studies, but these are not feasible on large scale and remain labour- and time-consuming. Moreover, DECIPHER interface which is an online repository of CNV and phenotype data is used by clinical labs for the sharing of CNV detected. DECIPHER’s aim is to make the clinical interpretation of CNV feasible [50].

Text mining approaches: Text-mining can be a valuable tool for the extraction of scientific literature and identification of links between a gene and a concept term.

Geneset enrichment analyses: Geneset enrichment analyses are used in studies which perform gene expression analyses. They look at whether there is an overlap between biological annotations.

25

Network-based analyses: The network-guided analyses allow for genes which are “unaffected” in the dataset to be associated with other “affected genes” i.e it accounts for interaction between genes. Network of such interactions can now be constructed through gene expression data and text-mining approaches. These principles have been employed by various methods to identify SNPs which are linked to clinical trait. Furthermore, these techniques are regularly used in SNP-GWAS and most importantly in drug-discovery studies.

1.3.2. cnvHap: an integrative population and haplotype – based multiplatform model of SNPs and CNVs

Multiple statistical approaches have been developed for identifying CNVs in microarray datasets. One of the most powerful such methods is cnvHap, which formed the basis of our NGS-based CNV detection framework. cnvHap algorithm cnvHap is a population-haplotype algorithm, for accurate CNV detection and genotyping using SNP microarray intensity data [51]. cnvHap also incorporates an underlying haplotype model, which has been used to accurately phase CNV by exploiting large sample sizes and multiple assays [52]. An HMM was used for CNV modelling at the single chromosome level. A set of cluster positions for each probe is constructed by cnvHap in order to be modeled as “linear combinations of fixed nonlinear functions of the underlying CNV-SNP genotype” [51]. This is demonstrated on Appendix-Figure 2. cnvHap also performs re-clustering of the transformed allele intensity measurements i.e log R ratio (LRR) and B allele Frequency (BAF) at each position to account for batch effects and other biases. By using these cluster positions, the algorithm estimates the probability of every CNV-SNP genotype at each position.

Multiple datasets in a single HMM: By combining multiple datasets in a single HMM, cnvHap algorithm is able to estimate copy number states from a collection of probes and transform them into a collection of unmeasured loci at which a uniform distribution of the CNV-SNP genotypes replace the cluster probabilities estimated before. This resulted in an improvement in CNV genotyping accuracy and was accomplished by using fastPHASE and polyHap haplotype models.

26

1.3.3. Advantages of cnvHap cnvHap was an important addition to the growing body of research on CNV-phenotype associations In addition to its improved CNV genotyping accuracy, cnvHap is useful for meta- analyses and specifically for combining effect size estimates through the posterior probability distributions obtained from the model. Many of the benefits of cnvHap arise from the use of its specialized HMM, which models the distribution of fluorescence intensities at each probe across the entire population, while simultaneously applying spatial smoothing on the haploid copy number state within each sample. On the other hand, the complexity of cnvHap’s model imposes computational limitations on its genome-wide applicability, especially for estimating high-copy number states [51].

1.4. CNV detection in next generation sequencing data

1.4.1 Sequencing data integration

The main concepts related to sequence data integration as well as ways to accomplish it are explained below.

Hidden Markov Model (HMM)

HMMs are widely used for CNV detection. They are learnable finite stochastic automates. Generally, a HMM has two stochastic processes. The first one is a Markov chain process and is defined by states and transition probabilities. These states cannot be directly observed and thus they are “hidden”. From the second process, emissions observable at each moment are produced which depend on a state-dependent probability distribution [53]. The name “hidden” in the HMM refers to the states of Markov chain and not to the parameters of the model.

Read Depth (RD): Read depth is the only alignment feature which is directly associated with the absolute copy number. It is the measure that counts the number of reads, -i.e the generated NGS data presented as short sequence fragments with a length ranging from 35bp to 300bp-, which are mapped in each genomic region [15]. Therefore, it is proportional to the

27 absolute copy number. For every sample that is sequenced, there exists a corresponding statistical number that represents the expected value of the RD. This is called depth of coverage (DOC).

Figure 1.2: Visualization of read depth alignment signatures

Read pairs (RPs): Pairs of reads that map to an improbable distance apart or even to different contigs (for SVs). Figure 1.3: Visualization of read pair alignment signatures

28

Split Reads (SRs): Individual reads that span a deletion or insertion breakpoint [15].

Figure 1.4: Visualization of split read alignment signatures

1.4.2 Definition of haplotypes

Widely used phasing and imputation algorithms, including Phase, Beagle, IMPUTE, MACH and Fastphase [39, 54-57], have been designed to resolve the haplotypic phase of SNP markers in diploid genomes. While these algorithms can be used to phase/impute short indels by recoding insertion/deletion states as bi-allelic SNP markers, they are not suited to the more complex problem of haplotyping larger and more complex structural variants which may themselves encompass many SNPs. An extension of fastphase has been previously developed [55], called polyHap [52] in which haplotypes are sampled from a finite number of `ancestral haplotypes’, to accurately resolve the phase of polyploid genomes. This haplotype HMM had then been fused with a haploid copy-number HMM to create the model underlying cnvHap [51]. cnvHap allows any pre- specified level of ploidy, as well as any maximum copy number per haploid chromosome. This algorithm allowed for addressing both ‘non-internal’ SV phasing, which consists of identifying the extended haplotype harbouring the structural variant as well as ‘internal’ phasing, which consists of identifying the internal SNP haplotype belonging to each copy of a duplicated region. We will build on these algorithms to develop improved methods for resolving the haplotype of structural variants in WGS data, using multiple sources of information, including

i. Sharing of short haplotype segments across unrelated individuals in the population ii. Long range haplotype sharing between closely related individuals.

29

1.4.3 The growing importance of CNVs: new insights for detection and clinical interpretation

Detecting CNVs is crucial for the better understanding of genomic context and its contribution to disease. Even though examples of links between CNVs and disease have already been published, the impact of CNVs on disease progression and drug response has not been fully explored yet [58]. Below, we are going to discuss about strengths, limitations, challenges and progress of CNV detection analyses (through DNA microarrays and NGS), as well as the link of known CNVs with phenotypes.

The main platforms for genome-wide CNV discovery comprise Comparative Genomic Hybridization (CGH), SNP DNA microarrays and NGS.

SNP genotyping arrays: Initially, SNP genotyping arrays were not constructed for CNV analysis. Despite that, a copy number ratio can be obtained through the combination of fluorescent intensities of the two alleles and normalisation of this quantity with respect to the reference [58]. Then, it is possible for the CNV to be identified by detecting significant deviations from the baseline copy number (CN) measurement. There exist various microarray methods for covering CN variant regions. Selecting the appropriate method depends on certain factors:

1) The platform which is going to be analysed (Illumina or Affymetrix)

2) What the output should be, either discrete or continuous prediction of CN

3) The DNA analysis type: germline or somatic CNV analysis [58].

Combination of several methods offers improvements on sensitivity and specificity.

Sequencing-based methods: Massively parallel sequencing is now feasible through NGS technologies. To take advantage of the wealth of data generated by NGS, many SV detection approaches have been developed, including paired-end mapping (PEM), read-depth analysis, split-read approaches and sequence assembly comparisons [59-61].

Paired-end mapping approaches: Before the arrival of NGS, SVs were identified from fosmid paired-end sequencing. PEM has both advantages and disadvantages. On the one hand, it is able to determine breakpoint precisely and it has a good performance when there are

30 repetitive elements. On the other hand, PEM cannot resolve SVs when both ends of a pair happen to map to repetitive regions.

Read-depth method: The aim of the read-depth analysis is to explore the local changes in read coverage in comparison to the expected depth distribution. Information obtained from paired-reads can help improve mapping quality and detect big, complex rearrangements. Nevertheless, non-unique read mapping remains a challenging aspect of the read-depth analysis.

Split-read approach: The underlying methods of SRs initially start by using a pair of reads where a single read from each pair is aligned to the reference genome at an exclusive position. The other read either does not map or partly maps to the genome. Through this procedure of unmapped or not wholly mapped reads, CNV breakpoints are generated at a unique [15].

Sequence assembly comparison: For the cases where high sequencing depth occurs, de novo assembly should be attempted as it follows for intuitive detection of deletions and duplications through simple sequence comparison to the reference genome. De novo assembly has both its advantages and disadvantages. It can achieve superior performance to the PEM approach, particularly for deletions or insertions smaller than the paired-end insert size [58]. Despite that, it is challenging to perform in repeat-rich regions and it requires high- read depth coverage.

Pitfalls in CNV analyses

Despite the development of novel technologies and statistical methodologies, CNV identification is still considered a very challenging task [62-64]. Batch effects can easily affect DNA microarrays as well as NGS. CNV prediction can be influenced by factors such as experiment date and the plate id. As a solution to batch effects, experimental planning and quality control measures should be undertaken to facilitate downstream analysis. Certain approaches can mitigate batch effects, such as checking for consistency between batches through the use of positive and negative controls, conducting univariate or multivariate analyses and performing sensitivity analyses for the consistency of the results.

31

DNA microarray limitations: Some of the problems that may arise are due to inherent DNA microarray limitations. These include probe spatial auto-correlation, non-specific hybridisation and differences between colour dyes for CGH arrays. Many of these problems can be overcome using normalisation procedures such as LOESS (LOcal regrESSion) smoothing. These procedures though, can lead to false positives and require the use of multiple adjacent probes for the identification of a CNV. This negatively affects the detection of small CNVs. Another limitation of microarrays, is their poor coverage of repeat-rich regions and regions near segmental duplications. These regions tend to be hotspots for CNV and may be entirely missed by SNP microarrays. Thus, modern genotyping arrays use a combination of SNPs and non-polymorphic probes in order to cover these CNV regions[65]. DNA microarrays are also unable to provide absolute CN estimates because of hybridization saturation. This can be overcome with de novo arrays which are used for diagnostic purpose. These use a combination of CGH and SNP probes for better CN classification.

NGS limitations: NGS has many advantages when compared to DNA microarrays. First, NGS can detect small variants such as indels as well as more complex structural variants, such as inversions. Second, exact breakpoint locations can be estimated and more importantly NGS does not have to deal with hybridization saturation and thus multi-copy amplifications can be easily estimated too. Nevertheless, NGS is still prone to biases that need to be taken into account. Biases can be introduced by systematic sequencing errors and by mapping ambiguity of short sequencing reads.

CNV genome – wide association tests

A large number of statistical approaches could be applied to test for association between a given trait and a CNV locus. Linear regression is usually the most appropriate for quantitative traits, whereas for binary traits, logistic regression and Fisher’s exact test are preferably used. These can be directly applied to single probes but not to CN regions that cover multiple points. For the identification of CNV clusters and to facilitate interpretation, it is crucial to “align” CNVs from different individuals and assign a consensus CNV. The “merge-by-overlap” approach can do this through merging CNVs from different subjects into the same CNV region “if their reciprocal overlap satisfies a minimal cut-off” [41, 58].

32

1.4.4. Mapping CNVs by population – scale genome sequencing

A map of unbalanced CNVs has been constructed [66], based on whole genome DNA sequencing data. The construction was based on a combination of evidence from different SV discovery methods together with experimental validation. Thus far, the focus has been on the characterization of deletions which are more often linked to disease, especially those affecting genes. The constructed map is considered a new tool for association studies based on sequencing data.

SV discovery methods: The aforementioned sequencing-based CNV map, utilized multiple data sources and computational approach to combine evidence from RD and RP data Different alignment algorithms were used for mapping DNA reads and false discovery rates (FDRs) were calculated using a combination of PCR and array-based analysis results. Sensitivities of deletion SV discovery methods were examined and the highest ones have been shown to be achieved by RD and RP with an FDR > 10% [66]. The SV discovery set constructed by [66] was obtained through the combined calls of different discovery methods covering the same SV.

Impact of SV discovery methods: The SV discovery set was linked to genomic annotation to examine its impact. Coding sequences were influenced by a substantial amount of SVs which overlapped genes and disrupted . Deletions composed most of the SVs that intersected with genes.

Computational issues on SV detection: The SV analyses are computationally challenging. Within association studies using population-based samples of thousands of individuals this issue becomes critical. The Genome STRip method was developed to improve computational feasibility of SV detection within the setting of a large scale genetic study [66, 67].

Advantages of the pre-computed reference SV map: The constructed map will make the discovery, genotyping and imputation of SVs feasible in many cohorts, where SV discovery is not possible. It will also be used for sequencing various genomes to perform disease association studies.

33

1.5. CNVs and human disease

1.5.1. Role of CNVs in human disease

Human DNA SNPs have been extensively explored and they are considered as the most common type of variation within the human genome. The association of phenotypes with genetic variants is crucial in the field of statistical genetics. Most genome-wide association studies (GWAS) focused on the identification of SNPs and their contribution to disease risk. It is also well-established that copy number variation contributes to genetic variation, so research has also turned to the role of CNVs in disease associations, especially in an effort to explain partially the “missing heritability”. This definition has arisen from the unexplained heritability of the majority of complex diseases despite the GWAS efforts to explore the genetic background of human disease [68].

As identified by GWAS, many CNV-affected loci contribute to common complex diseases such as Alzheimer’s disease and obesity [69, 70]. A major challenge in the field of CNVs involves a subset known as “variants of unknown significance”. This type of variants might be involved in disease susceptibility but research still requires an addition of population-level data to study them [71].

The aetiology of complex phenotypes also lies in haplotype associations. Haplotypes can influence phenotypes in a direct way, by affecting promoter activity and protein structure [72, 73]. They also indirectly play a crucial role in GWAS when testing the relationship between CNVs with SNPs with the aim of finding CNV-tagging SNPs after defining the copy number status from the generated data.

Despite the replicable associations between CNVs and complex disease [74-77], the contribution of CNVs to many common, complex diseases is less well understood, and there are vastly fewer robustly replicated CNV associations with disease than SNP associations from GWAS. To date, CNV has failed to account for a substantial proportion of “missing heritability” in complex disease genetics [51]. The view on SV and disease has changed a lot, due to the sub-microscopic CNVs that have been discovered in the human genome. As stated above, CNVs, to a larger extend than SNPs have been proved to cause genetic diversity among

34 individuals and a rapid increase in the number of traits [7]. The interaction between CNVs and genetic or environmental factors can affect their detectable phenotypic effect [78].

Examples of common complex diseases that have been found to be associated with CNVs are obesity, schizophrenia, mental retardation, Autism Spectrum Disorder (ASD) and idiopathic generalized epilepsy [79-81]. These diseases are linked to CNVs identified in chromosome 16. A ~21kb deletion at 16p12.3 appeared to be associated with obesity in European (n = 1,627) and Chinese populations (n = 2,286) [82]. The genes affected by the deletion are GPRC5b (G protein-coupled receptor, family C, group 5, member B), DNAH3 (Dynein Axonemal Heavy Chain 3), GPR139 (G protein-coupled receptor 139) and SYT17 (Synaptotagmin). This study suggests that CNV plays a crucial role in obesity mainly in Europeans, thus, further work can be done to identify whether this CNV is causative for obesity in other populations as well. In addition to the obesity-linked deletion, there are also both deletions and duplications at 16p13.11 which are associated with various phenotypes. Duplications located in this specific region are linked to diseases affecting neurodevelopment, behaviour and physical health. Examples of these are ASD, Intellectual Disability (ID) and schizophrenia. Moreover, other serious phenotypes include hypermobility and craniosynostosis. Multiple diseases have also been linked to deletions at this locus such as focal and generalized epilepsy, ID, ASD, congenital abnormalities and microcephaly [80, 81].

1.5.2. CNV maps

CNV maps contain information from healthy individuals of different ethnicities and can be used for various purposes. They serve as a base for exploring the genetics behind phenotypic variation. They can also determine whether a CNV is pathogenic or benign or even identify novel CNVs, providing evolutionary clues to upcoming research [71]. Additionally, such maps are extremely beneficial for applications in medicine. This requires attention, as there are criteria to determine whether such maps can be clinically relevant to the medical community. First, CNVs found in disease cases should be looked up in the map to investigate if they fall into a copy-number variable region (CNVR). Second, they should be linked to disease genes already established as causatives by medicine.

35

1.6. Aims and structure of thesis

The primary aim of this project was the construction of a map of CNV-tagging SNPs to explore new disease associations and add knowledge to existing disease pathways. Furthermore, we sought to develop a novel set of algorithms under a statistical framework for the detection and investigation of CNVs in the human genome using NGS.

Chapter 2 describes the way in which CNVs and SNPs were extracted from the 1000G database as well as how they were analysed based on the linkage disequilibrium (LD) patterns between them. Based on the LD findings, Chapter 3 highlights how CNV-tagging SNPs were linked to human disease and traits through data obtained either from external sources or publicly available data.

Chapter 4 focuses on the statistical strategies that were applied to our population-based model to achieve CNV calling. The data used for developing the model and expanding this to a pipeline were pulled from the latest release of the 1000G project. Chapter 5 describes the validation of our method and explains the advantages of CNV identification approach. Furthermore, it presents a map consisting of novel and existing CNVs together with their characteristics, including allele frequencies (AFs) for each studied population.

Chapter 6 summarizes each of the thesis goals and provides an overview of the results obtained. In addition, it proposes ideas on how research community can rely on the method of CNV detection and the CNV-tagging SNPs map for future work.

36

Chapter 2

Extraction of CNV-tagging SNPs using the 1000G sequencing data

In Chapter 1, we described the role of human genome variability and particularly the importance of CNVs in human disease. We turned our attention to the history behind CNV identification and went through its latest available methods and techniques. Furthermore, we presented existing techniques for NGS-based CNV detection and discussed how this brought new ways of addressing the issue of “missing heritability”. In the next two chapters, we will describe how we used sequencing data to extract and explore the genetic variability related to CNV identification present in various populations worldwide. We will emphasize the relationship between genetic markers and highlight their crucial phenotypic impact. In addition, we will explore potential functional effects of CNVs and suggest ways in which they can become useful in uncovering the genetic effect on disease susceptibility. The aim of this part of our work was to uncover the relationship between SNPs and CNVs from the 1000 Genomes project whole-genome sequencing data and therefore to identify population- specific CNV-tagging SNPs.

2.1. Materials and methods

2.1.1 1000 Genomes project data

The data that we used to evaluate the overlap between SNPs and CNVs genome-widely, were the most recent (Phase 3) publicly available data from the 1000 Genomes project, released in 2014. These data contain information for 2,504 unrelated individuals with ancestry from 26 populations around the globe [20] for the five major ethnic groups, including European, African, Ad mixed American, East Asian, and South Asian. The samples we analysed were originally sequenced using a combination of low-coverage whole-genome sequencing, deep exome sequencing and dense microarray genotyping using the Illumina platform [83]. During Phase 3, the Project used data obtained from the Illumina sequencing platform whereas in

37

Phase 1 they had a combination of sequencing platforms. The datasets were aligned to the GRCh37 human reference genome (http://www.1000genomes.org/analysis). The read lengths used were ≥ 70 bp aiming at achieving higher quality calls than Phase 1, where read lengths ranged between 36 bp to >100 bp. Due to the rapid development of variant calling algorithms over the years, Phase 3 incorporated an additional set of variant callers. These included haplotype aware variant callers as well as de novo assembly variant callers. Another reason for using Phase 3 data was the consideration of both low-coverage and exome sequence together rather than individually and the change in genotype calling algorithms. By using Shapeit2 [84] and MVNcall [85] they have been able to integrate multi-allelic variants and complex events that could not be used in previous phases.

As part of the 1000G Project, a Human Genome Structural Variation Consortium (HGSV) was established. The main objective of the HGSV was to make discoveries and genotype structural variants including insertions, deletions, duplications and inversions and to create a map of SVs based on the studied populations. Our analysis focused on CNVs detected within the scope of work of the 1000G-HGSV and on SNPs, that is, the main type of variant detected within the 1000G Project as an output. We defined a gene region (Appendix, Box 1) by its transcriptional start and stop sites and expanded it by using a 20kb window around each gene. Through this expansion, we covered all the regulatory regions (including UTRs and promoters) and therefore all the possible variants in most proximal “cis” regulatory elements.

2.1.2. CNV and SNP extraction using the 1000G data

We downloaded all SNPs and CNVs available from the 1000G which overlapped all the protein-coding gene regions extracted from Ensembl GRCh37.82 assembly. That is, our CNV and SNP identification process, focused on the extraction of the protein-coding gene regions which were common with the regions of the 1000G variants. The total number of protein- coding gene regions we had in our downloaded list was 20,314. One gene region of chromosome Y did not have information for all the 2,504 samples we were analysing, and thus, we excluded it from our gene list.

We initially started the analysis by downloading all SVs. These were available in a single file since their number to date is much smaller compared to that of SNPs. The variants included,

38 covered any type of SVs including inversions, insertions of SINE-VNTR-Alus (SVA) element, insertions of LINE1 element, insertions of Arthrobacter luteus (ALU) elements, duplications, and deletions. Our analysis focused on CNVs, namely deletions and duplications. As opposed to CNVs, SNP files generated by the 1000G directories are split by chromosome. Some of these files also included information on indels, which was excluded from our downstream analyses. The resulted outputs from both CNV and SNP extractions were in reference to each protein- coding gene region.

Furthermore, we split all multi-allelic SNPs and CNVs into bi-allelic to avoid losing genotype information from specific samples. That is, if a sample had both a deletion and a duplication, we split it as if there were two separate samples. All homozygous genotypes indicating copy- neutral state and therefore absence of variability were removed. The next step of the analysis, was a pre-processing step for the preparation of the CNV and SNP files to be used for a LD calculation. Below, we are explaining how the transformation of each genotype was done, based on the form in which they appear in each variant call format (VCF) file:

Table 2.1: Transformation of CNV and SNP genotypes

Genotypes Transformed genotypes SNPs CNVs (VCF) (LD format) (Example) 0 | 0 0 G/G Neutral copy number 0 | 1 1 A/G Deletion 1 | 0 1 G/A Deletion 1 | 1 2 A/A Duplication

2.1.3. Evaluation of CNV-SNP LD in sequencing data

LD calculation We used the r2 metric to calculate the LD between CNVs and SNPs in 13,003 defined protein- coding regions. This calculation is specifically based on the Pearson correlation coefficient (r), often referred to as Pearson R test. It is used as a statistical test between two different variables to examine the strength of their relationship. The formula of the Pearson correlation coefficient is given below:

39

(2.1)

Its values range between -1 and 1, therefore we used the r2 which always lies between 0 and 1 for a better interpretation in terms of genetic variation.

A threshold of ≥ 0.8 was considered as a high LD, 0.6 < r2 ≤ 0.8 as moderate, 0.4 < r2 ≤ 0.6 as modest, 0.4 < r2 ≤ 0.2 as mild, 0.2 < r2 ≤ 0.01 as low and r2 < 0.01 as no LD (Appendix Box 1).

Allele frequency (AF) in relation to the LD

In addition to the LD, we calculated the allele frequency (AF) of CNVs overall as well as by population to explore the AF differences between geographical areas. The AF practically represents how often a genetic variant appears in a population (https://www.nature.com/scitable/definition/allele-frequency-298). It is calculated based on the genotype frequencies (homozygous and heterozygous) and the number of individuals of a specific population:

푯풆풕풆풓풐풛풚품풐풖풔 풂풍풍풆풍풆풔 + 푯풐풎풐풛풚품풐풖풔 풂풍풍풆풍풆풔 풙 ퟐ Allele frequency = (2.2) 푵풖풎풃풆풓 풐풇 풊풏풅풊풗풊풅풖풂풍풔 풙 ퟐ

The VCF files we originally downloaded included information about the AFs for five continental groups as defined by the 1000G Project (Table 2.2). These populations were European, Ad mixed American, African, East Asian and South Asian. Each continental group included was additionally split into individual populations whose AFs were not available in the original VCF files. We therefore performed an additional analysis to calculate each of those AFs to obtain information about potential genetic drifts which might have occurred over time between those populations.

We evaluated the LD between SNPs and CNVs by the AF bins of 13,003 predefined protein- coding regions, to investigate the frequency of highly correlated variants in each population.

40

We initially calculated the analysed regions’ LD for all the populations pooled together, followed by a one-to-one analysis for each population. That is, we did a separate analysis to observe the LD patterns in each of the 26 populations. We then investigated these differences for the major ethnic groups as defined above (Table 2.2). Thus, we were able to establish the frequency of common CNVs (MAF > 5%), low-frequency (1-5%) and rare CNVs (<1%) with respect to high or low LD.

Table 2.2: Populations of the 1000 Genomes Project phase 3 release data

Europeans Ad Mixed Americans Africans South Asians East Asians Utah residents Mexican ancestry Yoruba in Gujarati Indian Han Chinese in from North from Los Angeles, Ibadan, from Houston, Bejing, China and West USA (MXL) Nigera (YRI) Texas (GIH) (CHB) Europe (CEU)

Toscani in Puerto Ricans from Luhya in Punjabi from Japanese in Italia (TSI) Puerto Rico (PUR) Webuye, Lahore, Tokyo, Japan Kenya Pakistan (PJL) (JPT) (LWK)

Finnish in Colombians from Gambian in Bengali from Southern Han Finland (FIN) Mendellin, Colombia Western Bangladesh Chinese (CHS) (CLM) Gambia (BEB) (GWD)

British in Peruvians from Lima, Mende in Sri Lankan Chinese Dai in England and Peru (PEL) Sierra Tamil from the Xishuangbanna, Scotland Leone UK (STU) China (CDX) (GBR) (MSL)

Iberian Esan in Indian Telugu Kinh in Ho Chi population in Nigera from the UK Minh City, Spain (IBS) (ESN) (ITU) Vietnam (KHV)

Americans of African ancestry in SW USA (ASW)

African Carribbeans in Barbados (ACB)

41

2.2. CNV and CNV-tagging SNPs’ maps

CNV map The total number of CNVs identified from our genome-wide analysis and across all protein- coding gene regions was 37,307. After the LD analysis, we observed that out of the 20,313 gene regions analysed, only 13,003 harboured CNVs. The length of the identified CNVs was calculated based on the start and end positions of the CNV in each region’s VCF file. It ranged between 207 bp and 2,258,000 bp. Through the AF calculation, we found that 33,393 (89.5%) were rare, 2,305 (5.5%) were of low-frequency and only 316 (0.85%) were common.

CNV-tagging SNPs map

Figure 2.1: Plot of AF(CNV) Vs LD max in protein-coding genes

Among those 37,307 CNVs identified in the 1000G data, 16,420 (44%) had a strong LD with nearby SNPs showing that CNVs were well-tagged by SNPs (Figure 2.1a). We investigated each continental group separately to observe the genetic properties of the 26 populations studied. Both common and rare CNVs were observed in all populations with similar LD patterns (Figure 2.1b). The plots appear to follow the same trend, showing almost a constant variability throughout the populations -even if this variability does not assume homoscedasticity-. African populations were in general the ones having the highest variant frequency confirming previous findings [5]. While past research agrees on the excess of variants in African populations, the directions for Asian populations’ variant frequency are still ambiguous. A study conducted in 2011 from Xu H. et al., focused on identifying potential common and rare CNVs in three Asian populations (Chinese, Malays and Indians in Singapore) [86]. This effort resulted to most of CNVs having a low-frequency (AF < 10%) while 40% of them appeared to be rare (AF < 1%). Findings from Victor E. et al in 2016, show that rare variants in African populations (AF < 0.05) were three times more frequent compared to European and Asian populations using asthma and chronic obstructive pulmonary disease (COPD) examples [87]. Though, it would require an independent investigation to evaluate whether the variant frequency is disease-specific, given the report from 2016. Our results are in agreement with both findings since we identified both common and rare variants in Asian populations.

42

(a) Plot showing the maximum of the calculated LD for CNV-tagging SNPs for all the 26 populations of the 1000G together, against the AF of all the CNVs detected in the 1000G sequencing data. The maximum LD for several CNV-tagging SNPs in various regions was very close to 0. Measurements are denoted in red circles. (b) Plots for the maximum of the LD for each of the 26 populations of the 1000G separately against the AF of the CNVs present in each specific population. Only high LD results are presented on each plot (LD ≥ 0.8). (i) African populations (ii) American populations (iii) East Asian populations (iv) South Asian populations (v) European populations

(a)

43

(b)

(i)

44

45

(ii)

46

(iii)

47

(iv)

48

(v)

49

Chapter 3

CNV-tagging SNPs’ link to human disease/traits

This chapter will focus on the identification of CNV-tagging SNPs for various human phenotypes/traits. The analyses were conducted using GWAS catalog, DIAGRAM and NFBC data sources (Table 3.1). Furthermore, we present the results from each of the three analyses, discuss and compare common CNV-tagging SNPs detected in each of them.

Table 3.1: Identified CNV-tagging SNPs in various phenotypes Data No of Phenotypes/traits evaluated CNV- CNVs No of source tested tagging related SNP SNPs loci variants GWAS 34,672 http://www.ebi.ac.uk/gwas/docs/file- 53 41 63 catalog downloads DIAGRAM* 27,493 Type 2 diabetes 5 4 6 NFBCⱡ 2,084 FAs¥ & polyunsaturated FA None None

* DIAGRAM: Diabetes Genetics Replication and Meta-analysis ⱡ NFBC: Northern Finland Birth Cohort ¥ FA: Fatty acids

3.1. Disease associations

3.1.1. GWAS catalog

The National Human Genome Research Institute (NHGRI) joined with the European Bioinformatics Institute (EMBL-EBI) to create a collaborative project called the “GWAS catalog” (https://www.ebi.ac.uk/gwas/). The Catalog was initially founded by NHGRI in 2008. The collaboration between the two Institutes started in 2010 and is still ongoing. The GWAS catalog is a massive collection of all the published GWAS including about 100,000 SNPs and

50 associations between SNPs and traits/diseases based on a p-value < 1.0 x 10-5 (Figure 3.1) [88].

We downloaded version v1.0 disease associations from GWAS catalog in December 2016, to check for established SNPs associated with the diseases and other human phenotypes (https://www.ebi.ac.uk/gwas/downloads) and evaluate any potential intersections with our CNV-tagging SNP map. The SNPs included in the catalog are the most significant SNPs pulled from the studies from each independent locus.

Based on the p-value filtering of the GWAS catalog (p-value <1.0 x 10-5), we detected 54 CNV regions with tagging SNPs on 16 chromosomes (Tables 3.1, 3.2, 3.3, 3.4). Almost all CNVs tagged by SNPs are common deletions. Based on our functional analysis, the identified deletions were mainly intronic and intergenic, such as a deletion of 588bp and a deletion of 1,163bp in , within loci associated with major depressive disorder and systemic lupus erythematosus respectively (Tables 3.2, 3.3). This analysis suggests that CNVs play an important role in human trait variability. Intronic CNVs can alter the structure of gene transcripts and thus cause changes to the coding sequence of messenger RNAs (mRNAs). In addition, intronic deletions have the potential of being pathogenic, if they interfere with splicing, such as the deletions of ~ 80kb, 9.7kb and 29kb in CFHR1-3-5 genes associated with Nephropathy [89]; and they can even be deleterious if they are located in small introns, such as the 723bp deletion on chromosome 18, affecting susceptibility to breast cancer. Furthermore, if we consider “imputing” CNVs based on the tagging SNPs, we can uncover potential functional culprits behind the disease-SNP associations in the future. By definition, imputation methods aim to discover potential haplotype sharing between the study samples and the haplotypes present in the reference set [90]. Furthermore, such methods are used to impute the missing alleles in the study samples through this haplotype sharing. This explains the powerful association between the methods of inferring haplotype phase and approaches of tagging-SNPs. Traditionally, the SNP tagging approach included the imputation of a SNP with another set of SNPs. Specifically, the SNP that had to be imputed was phased with a small collection of neighbouring SNPs extracted from the studied data set with the purpose of identifying the haplotypes being in high LD with the alleles at the SNP of interest. Furthermore, the genotypes included in the study as well as the ones from the reference panel are phased together with the aforementioned SNPs. During this phasing procedure, the

51 genotypes that are missing from the study are imputed. This method is fast and efficient and it is now used for imputing CNVs based on the tagging SNPs. Using the above approach, we can use the novel CNV-tagging SNPs in GWAS to discover associations with various common complex diseases. NGS data contribute to an accurate imputation and databases such as the 1000G allow scientists to explore multiple variants including CNVs. For a more precise detection of true causal variants, it is essential to analyse populations with a larger sample size with low LD.

We tested 34,672 association signals reported in the GWAS catalog in total. If the list included the same SNPs associated with multiple phenotypes, we considered them in this count only once. We then selected SNPs with p-value ≤ 1x10-8 and ended up with 11,991 association signals achieving this stringent significance threshold. Out of all evaluated association signals, we identified 53 SNPs tagging 41 independent CNVs. That is, CNVs were in some cases in high LD with more than one SNP. Pairwise LD (r2) between the 53 SNPs was also calculated (Appendix Table 1) among all the 26 populations, showing that some of the SNPs were not independent. In this case, 26 SNP associations were defined as CNV-tagging SNPs. These

52 associations were located on 10 chromosomes. To provide full coverage of the GWAS catalog, we kept all 54 association regions based on catalog’s p-value threshold.

Figure 3.1: Visualisation of current GWAS catalog

Established CNV-tagging SNPs

We identified CNV-tagging SNPs in various gene regions (Table 3.2) that have been previously reported. Specifically, the studies detected the SNP(s) associated with a certain disease or trait and at the same time they evaluated potential CNVs tagged by those SNPs. Thus, the same CNV-tagging SNPs the studies identified were replicated by our analysis. These have therefore proved to be a perfect validation of our approach. Below, we describe some of the most remarkable associations that can be further explored to answer questions behind the causal variants of a known disease/trait or questions about population-specific effects.

53

Several studies investigated potential relationships between IgA nephropathy and human genome variation [91-93]. The reported associations showed that there are deletions present in CFHR1 and CFHR3 genes which are in high LD with rs6677604 (Table 3.2). Specifically, a common deletion of ~ 84kb was considered as the likely causal allele for the disease in Han Chinese population, but this still needs to be tested by formal direct genotyping of CFHR3,1Δ in large case-control cohorts [94]. In our results, we confirm these findings through the identification of two deletions of ~ 80kb and ~ 10kb in the same genomic area tagging the same SNP (LD = 0.84, LD = 0.83). One study specifically looked at a mutation present in CFHR5 in Cypriot patients with glomerulonephritis, identifying an internal duplication which has proved to be a rare allele in this population only [95, 96].

A strong association confirming previous results was one of a 30-kb deletion correlated with the SNP rs12628403 (LD = 0.90, Table 3.2) on chromosome 22q13.1 [97-100]. This common deletion involves the APOBEC3A gene and was associated with an elevated risk of breast cancer. Based on our findings, parts of this deletion are three shorter CNVs of ~20kb, 1.6kb and 1.7kb respectively. The 20kb CNV was biallelic, with one of them being a deletion and the other one a duplication.

SNP rs2380205 located on near ANKRD16 and FBXO18 genes, is tagging a deletion of 3.2kb (LD = 0.85). This is a population-specific association, as two separate studies in the past found different results; one study revealed a significant association between the SNP and breast cancer risk in women with European descent whereas the other study found that they were not significantly associated in women with African ancestry [101, 102].

Other previously reported and replicated by us CNV-tagging SNPs include a 45kb deletion tagged by rs3101336 in NEGR1 locus linked to menarche (age at onset), BMI, childhood BMI, obesity and early onset extreme obesity [103-110]. Another deletion of ~31.5kb is tagged by rs4085613 located close to LCE3A, LCE3B, LCE3D and LCE3E. The deletion is linked to psoriasis and psoriasis vulgaris [75, 111-114]. In the same genomic locus, another SNP, rs1581803, is tagging the same deletion and the SNP is found to be linked to inflammatory skin disease [115]. A short deletion of just 716bp is tagged by rs180242. This CNV-tagging SNP is located in GNG11 locus and is related to heart rate [116]. In addition to the aforementioned, the SNP rs1031391 is tagging a CNV (either deletion or duplication) of ~27.5kb and is located in Tas2R31 and Tas2R43 loci. The SNP is linked to bitter taste perception whereas bipolar

54 disorder is found to be associated with another SNP, rs7247513, tagging a 4kb deletion [117- 120]. A drug-target gene, SULT2A1, associated with blood metabolite levels includes the SNP rs296396 which is tagging a deletion of just 1.5kb [121, 122].

55

Table 3.2: Established CNV-tagging SNPs and disease/trait associations Phenotype/Disease/ Loci Genomic region SNP (rsID) Context P-value CNV start CNV CNV EUR LD Trait pos (bp) length type AF (bp) Menarche (age at NEGR1 1:71841623- rs3101336 Upstream 1.00E-13 72766343 45,472 Deletion 0.64 1.00 onset) 72768417 gene variant BMI Childhood BMI Obesity Obesity (early onset extreme) BMI NEGR1 1:71841623- rs2568958 Upstream 2.00E-14 72766343 45,472 Deletion 0.64 1.00 Obesity 72768417 gene variant Weight Psoriasis LCE3A 1:152518130- rs4085613 Downstream 7.00E-30 152555495 31,437 Deletion 0.64 0.92 LCE3D 152559248 gene variant LCE3E 1:152531857- 152572980 Psoriasis vulgaris LCE3A 1:152553138- rs4845454 Downstream 4.00E-12 152555495 31,437 Deletion 0.64 1.00 LCE3B 152593562 gene variant 1:152575310- 152615579 Inflammatory skin LCE3A 1:152553138- rs1581803 downstream 2.00E-12 152555495 31,437 Deletion 0.64 1.00 disease LCE3B 152593562 1:152575310- 152615579 Nephropathy CFHR1 1:196601008- rs6677604 Intron 3.00E-10 196728877 79,988 Deletion 0.19 0.84 CFHR3 196736634 variant 196734744 9,732 Deletion 0.19 0.83 CFHR5 196735895 29,045 CNV* 0.19 0.84

CFHR1 IgA Nephropathy CFHR3

56

Heart rate GNG11 7:93200885- rs180242 upstream 7.00E-12 93541824 716 Deletion 0.68 0.85 93560577 Breast cancer ANKRD16 10:5787186-5904095 rs2380205 Upstream 5.00E-07 5889589 3,218 Deletion 0.44 0.85 FBXO18 10:5883580-5951869 Bitter taste Tas2R31 12:10957559- rs1031391 intron 2.00E-19 11222191 27,480 CNV* 0.49 0.95 perception Tas2R43 11344212 12:11070005- 11344172 Bipolar disorder ZNF490 19:12616534- rs7247513 3_prime_UT 2.00E-06 12694867 4,057 Deletion 0.69 0.98 12711789 R_variant

Blood metabolite SULT2A1 19:48317701- rs296396 downstream 1.00E-92 48372722 1,539 Deletion 0.15 0.86 levels 48384769

Breast cancer APOBEC3A 22:39328746- rs12628403 intron 4.00E-06 39357694 30,880 Deletion 0.07 0.89 39379188 *CNV = deletion or duplication

57

Novel CNV-tagging SNPs

Out of the 41 CNVs tagged in GWAS catalog regions, 31 are novel (Tables 3.2, 3.3, 3.4). The novelty lies in the relationship between the CNVs with the SNPs as the associations between the SNPs and the diseases were already established in the GWAS catalog. The length of the 31 tagged CNVs ranges between 275bp to ~ 6kb with most of them being short. On chromosomes 2, 6, 7, 10 and 22, more than one SNP tag either one or multiple CNVs (Tables 3.2, 3.3). Specifically, on chromosome 2, rs6725887 and rs7582720 located in WDR12 locus and which are associated with coronary artery disease, early onset myocardial infarction and ischemic stroke, tag a 5kb deletion. MHC, HLA-DQA1 and HLA-DQB1 loci on encompassing rs9273373, are linked to type 1 diabetes, asthma and hay fever. The SNP is tagging a very short deletion of 320bp. The same deletion is also tagged by rs1063355 which is located in HLA-DQA1, HLA-DQB1-AS1 and HLA-DQB1 loci and is associated with ulcerative colitis. In addition, rs12531540 and rs849142 are tagging another short deletion of 378bp and are located in JAZF1 locus on chromosome 7. These are associated with systemic lupus erythematosus. In the same genomic locus, the SNP rs849135 is associated with type 2 diabetes and is tagging the same deletion. Results for CNV-tagging SNPs identified in JAZF1 are also discussed and compared in the Section 3.1.2. using DIAGRAM consortium data on type 2 diabetes. On chromosome 10, five SNPs are tagging the same 442bp deletion. The SNPs are located in ARMS2, HTRA1, LOC105378525, BTBD16 and PLEKHA1 loci and are associated with different types of age-related macular degeneration (exudative, extreme sampling, advanced, CNV-related and CNV Vs geographic atrophy). The last chromosome where a deletion was tagged by multiple SNPs was . Specifically, three different SNPs tag a 979bp deletion; the first SNP is rs1053593, located in HMGXB4 locus and is associated with waist-to-hip ratio adjusted for BMI. TOM1 locus encompasses the other two SNPs, rs138740 and rs138777 which are associated with multiple myeloma and monoclonal gammopathy as well as total cholesterol respectively.

An additional observation was that the frequency of the deletions present in the European populations ranges between 10% and 76%. Among the genes in which CNV-tagging SNPs were identified there are also drug-target genes. Specifically, these genes are HTR3D-HTR3C- HTR3E, PLEKHA1, MGST1 and SULT2A1 and they are associated with major depressive disorder, age-related macular degeneration, visceral fat and blood metabolite levels

58 respectively. With the exception of SULT2A1, the rest of the drug-target genes’ associations with CNVs are novel. The fact that the existence of CNV-tagging SNPs was found only in the above drug-targets does not indicate that there are no other drug-targets with potential CNVs tagged. GWAS are continuously conducted and thus as novel associations are reported, more tagging SNPs will be uncovered in drug-target genes.

59

Table 3.3: CNV-tagging SNPs (CNVs < 1kb) and disease/trait associations Phenotype/Disease/Trait Loci Genomic region SNP (rsID) Context P-value CNV start pos CNV CNV EUR LD (bp) length type AF (bp) Carotid plaque burden MXD1 2:70100692- rs10205487 intron 2.00E-06 70125033 528 Deletion 0.44 0.99 (smoking interaction) 70152707 2:70104820- 70190077 Major depressive HTR3D 3:183832826- rs1969253 intron 5.00E-06 183876797 588 Deletion 0.51 0.96 disorder HTR3C 184422546 HTR3E 3:183853176- EIF2B5 183911398 ECE2 3:183872477- DVL3 183921879 AP2M1 ABCF3 VWA5B2 MIR1224 ALG3 CAMK2N2

Schizophrenia SLCO6A1 5:101687486- rs6878284 intron 9.00E-09 101693791 928 Deletion 0.64 0.97 NR 101854720 Crohn's disease HLA-B 6:32526546- rs114985235 upstream 9.00E-23 32571334 275 Deletion 0.67 0.83 HLA-C 32577625 _ALU LOC105375015

Systemic lupus LOC107986589 rs9271100 intron 1.00E-12 0.82 erythematosus HLA

Asthma HLA 6:32575956- rs9272346 intron 2.00E-08 32624896 967 Deletion 0.59 0.88 DQA1 32634839

60

DRB1 6:32607244- LOC107986589 32656160

Type 1 Diabetes MHC rs9273373 upstream 4.00E-14 32614932 320 Deletion 0.54 0.82 Asthma and hay fever HLA-DQA1 - 32623533 0.97 HLA-DQB1 32624896 0.97

HLA-DQB1 Ulcerative colitis HLA-DRA HLA- rs1063355 3_prime_U 2.00E-06 32614932 0.82 DQA1 TR_variant 32623533 0.97 HLA-DQB1-AS1 32624896 0.97

Systemic lupus JAZF1 7:27850192- rs12531540 intron 3.00E-08 28214282 378 Deletion 0.51 0.91 erythematosus 28240362 rs849142 intron 9.00E-11 0.87

Type 2 diabetes rs849135 intron 2.00E-09 0.80

Heart rate LOC105375402 7:93531011- rs180242 upstream 7.00E-12 93541824 716 Deletion 0.68 0.85 93577922 Age-related macular ARMS2 10:124194169- rs10490924 missense NA 124216824 442 Deletion 0.20 0.95 degeneration HTRA1 124236868 rs3793917 upstream 4.00E-60 LOC105378525 10:124194169- Exudative age-related 124236868 macular degeneration 10:124201041- 124294424 Age-related macular degeneration (CNV)

61

Age-related macular ARMS2 degeneration (CNV vs. BTBD16 GA) PLEKHA1 HTRA1 0.95 Age-related macular degeneration (extreme LOC105378525 sampling) rs3750848 intron 3.00E-29 ARMS2 HTRA1 LOC105378525

Advanced age-related HTRA1 intron 7E-735 0.98 macular degeneration LOC105378525 rs3750846

Age-related macular 0.98 degeneration (wet) rs11200638 upstream 8.00E-12

Perioperative myocardial SMAD9 13:37373361- rs609418 downstream 6.00E-06 37414451 328 Deletion 0.74 0.99 infarction in coronary RFXAP 37423241 artery bypass surgery 13:37398968- 37514902 Pulmonary function CFDP1 16:75307596- rs2865531 intron 2.00E-11 75429831 842 Deletion 0.57 0.95 Pulmonary function 75487383 (interaction)

Lipoprotein-associated TMEM49 17:57764553- rs11650106 intron 3.00E-09 57870250 930 Deletion 0.57 0.96 phospholipase A2 activity VMP1 57939616 and mass

Breast cancer CHST9 18:24475595- rs1436904 intron 3.00E-08 24571615 723 Deletion 0.41 0.96 24785281

62

Waist-to-hip ratio HMGXB4 22:35633445- rs1053593 missense 5.00E-06 35645163 979 Deletion 0.68 0.97 adjusted for body mass 35711800 (East index Asian women) 8.00E-06 (women)

TOM1 rs138740 intron 0.88 6.00E-08 Multiple myeloma

Multiple myeloma and TOM1 rs138740 intron 0.88 3.00E-07 monoclonal gammopathy

Cholesterol, total TOM1 rs138777 intron 0.85 5.00E-08

Binge eating behaviour in PRR5 22:45078113- rs6006893 intron 9.00E-07 45252401 689 Deletion 0.12 0.87 bipolar disorder ARHGAP8 45278586 22:45078355- 45278665

63

Table 3.2 above shows the CNV-tagging SNPs along with their associations with various diseases, focusing on CNVs with a length of < 1kb. The fact that most of the novel CNVs found to be correlated with SNPs are quite small, is indicative of the power of NGS to detect such variants. In the Section 3.1.1 we described multiple SNPs tagging more than one CNVs, of any length. This is particularly important since, until recently, studies have been reporting one-to- one associations of CNV-tagging SNPs. The detection of more than one SNPs involved with CNVs can potentially lead to novel discoveries which will be disease- and drug-oriented. This can be achieved thought the imputation of CNVs using SNPs as described in the Section 3.1.1.

It would be beneficial for small variants such as the ones we identified in Table 3.2, to be incorporated into genetic association studies in order to help elucidate disease susceptibility. For example, our results revealed that rs1969253 is tagging a 588bp deletion on chromosome 3. The SNP which is linked to various loci (HTR3D, HTR3C, HTR3E, EIF2B5, ECE2, DVL3, AP2M1, ABCF3, VWA5B2, MIR1224, ALG3 and CAMK2N2) was previously found to be associated with major depressive disorder (MDD). It is already known that there is an overlap between depression and anxiety disorder [123, 124]. Using our results on MDD, further studies could potentially look for CNV-tagging SNPs linked to anxiety disorder and whether these are located within the aforementioned loci associated with MDD. Such a finding could create a network between diseases, genes and variants. Moreover, using our method for finding CNV- tagging SNPs, future research can rely on publicly available variation databases for discovering more potential causative CNVs. Some loci we presented in our results such as JAZF1, CHST9 and TOM1, have been previously identified with CNVs, independently of CNV-tagging SNPs presence [125-127].

In addition to the above and as observed in our results, even these small variants are common in European populations. Further investigation would be recommended based on geographical clustering of the identified CNVs in each European population.

64

Table 3.4: CNV-tagging SNPs (CNVs 1kb – 10kb) and disease/trait associations Phenotype/Disease/Trait Loci Genomic SNP (rsID) Context P-value CNV start CNV CNV EUR LD region pos (bp) length type AF (bp) Schizophrenia NR 1:243631535- rs13376709 Intron 6.00E-08 243782747 1,012 Deletion 0.67 0.81 AKT3 244034381 Coronary artery disease WDR12 2:203719505- rs6725887 Intron 2.00E-08 203899045 5,239 Deletion 0.13 0.89 Myocardial infarction 203899521 (early onset)

Coronary artery disease rs7582720 intron 4.00E-09 or ischemic stroke White matter NBEAL1 2:203859602- rs72934505 intron 6.00E-08 203899045 5,239 Deletion 0.13 0.99 hyperintensity burden 204111101 Amyotrophic lateral CPNE4 3:131232399- rs1320900 intron 7.00E-06 131708252 5,132 Deletion 0.33 0.88 sclerosis 132024254 Aging traits KCNAB1 3:155735490- rs3772255 intron 8.00E-06 156092162 1,526 Deletion 0.18 0.87 156276545 Systemic lupus RCC2P8 4:109711877- rs4956211 intergenic 1.00E-06 109721079 1,163 Deletion 0.38 0.87 erythematosus COL25A1 110243813

Hepatitis B HLA-DPA1 6:33012346- rs3077 3_prime 2.00E-61 33026721 1,884 Deletion 0.19 0.98 Hepatitis B (viral 33068552 UTR_variant clearance) 6:33023703- Chronic hepatitis B 33074978 infection

Liver enzyme levels MLIP 6:53774780- rs9296736 Intron 3.00E-09 53928702 6,118 Deletion 0.67 0.99 (gamma-glutamyl 54151078 transferase)

65

Bipolar disorder NR 10:72412559- rs17600642 Intron 5.00E-06 72448735 2,608 Deletion 0.25 0.84 ADAMTS14 72542197 Visceral fat MGST1 12:16321419- rs10772915 intron 5.00E-06 16420132 1,156 Deletion 0.40 0.95 LOC101928362 16450619 SLC15A5

Bipolar disorder or major DLEU1 13:50636307- rs1262778 intron 8.00E-06 51069347 5,735 Deletion 0.76 0.99 depressive disorder 51317372 (combined)

Metabolite levels MYO9A 15:72094632- rs12050794 intron 2.00E-06 72385965 2,193 Deletion 0.74 0.94 (HVA/MHPG ratio) SENP8 72430918 NR2E3 GRAMD2

Schizophrenia CNOT1 16:58533855- rs12325245 intergenic 2.00E-08 58673838 3,508 Deletion 0.16 0.91 SLC38A7 58683790

Milk allergy FAM117A 17:47767694- rs9898058 intron 1.00E-06 47826809 1,069 Deletion 0.15 0.99 47886542 Lipoprotein-associated TMEM49 17:57764553- rs11650106 intron 3.00E-09 57870031 1,456 Deletion 0.57 0.96 phospholipase A2 activity VMP1 57939616 and mass

Periodontitis (Mean PAL) PSMA8 18:23693816- rs11877878 intron 5.00E-06 23747780 3,540 Deletion 0.098 0.82 23793319 Bipolar disorder ZNF490 19:12668775- rs7247513 3_prime 2.00E-06 12694867 4,057 Deletion 0.69 0.98 12770912 UTR_variant

Blood metabolite levels TPRX2P - 19:48353723- rs296396 downstream 1.00E-92 48372722 1,539 Deletion 0.15 0.86 LOC102725176 48409654

66

Table 3.5: CNV-tagging SNPs (CNVs > 10kb) and disease/trait associations Phenotype/Disease/Trait Loci Genomic region SNP (rsID) Context P-value CNV start CNV CNV EUR LD pos (bp) length type AF (bp) Bitter taste perception PRH1-PRR4 12:10957559- rs1031391 intron 2.00E-19 11222191 27,480 CNV* 0.49 0.95 PRH1-TAS2R14 11344212 PRH1 12:11070005- 11344172 * CNV = deletion or duplication

67

The size of each SNP-tagged CNV varied. As stated previously, the majority of CNVs were < 1kb. However, there were also CNVs with a length between 1kb – 10kb (Table 3.3) with most of them being < 6kb. Additionally, we found a distinct CNV of ~ 28kb that constituted the largest CNV of our GWAS catalog analysis (Table 3.4). Furthermore, future GWAS could focus on CNV associations with disease and replicate the CNV-tagging SNP associations. Most importantly, the studies can impute the CNVs based on the tagging SNPs to discover novel disease pathways. A genetic association study can also be based on the combination of CNVs together with other factors such as environmental, lifestyle, social status and mental health. That way, it can determine whether a variant is causal itself or it becomes causal and even deleterious when mixed with other characteristics -genetic or not-. Additionally, based on our results, loci associated with a specific disease were also linked to other diseases. For example, NR is involved with both schizophrenia and bipolar disorder. The two CNVs encompassing parts of the gene might not have the same effect on each disease, but further investigation would be crucial.

“Correlation is not causation” [128]. This is undoubtedly true since an association itself is not a sufficient proof of function. The investigation of molecular mechanisms could unveil the true functional effects of a CNV.

Drug-target gene analysis

Based on the above findings, we performed an additional analysis of drug-target genes to observe potential differences in LD patterns within and between populations. We used DrugBank’s list (https://www.drugbank.ca/) which was currently available (accessed June 2016) to obtain the drug-targets and repeated the analysis of CNV and SNP extraction as well as the CNV-SNP LD calculations. To obtain the final set of drug-target genes which we included in the LD analysis, we conducted a pre-processing of the downloaded genes. We initially mapped the UniProt protein IDs to Ensembl Gene IDs and then to the corresponding gene coordinates. This list contained 2,279 drug-target genes in total. Many of the the UniProt IDs corresponded to immune genes, which in turn often corresponded to multiple alternate reference sequences. To deal with this, we only kept the mapping to the main reference genome (GRCh37) and discarded the rest. Moreover, there were 353 UniProt IDs that did not have gene mappings and were marked as empty in the converted file. After a manual assessment, we concluded that the reason why they did not have gene mappings was likely

68 because they corresponded to genes on other organisms as for example HIV and E. coli. These drugs were likely designed to target the genes of the pathogens directly and thus, they would be of no interest in our analysis. Finally, 4 of the UniProt IDs had mappings to multiple different locations on the human genome. These, together with the UniProt IDs without a gene mapping were excluded from our list.

After the above exclusion criteria, the final number of gene regions we ended up with was 1,413. We also excluded one gene region of chromosome Y which had information on fewer samples than the rest of the genes. Of the 1,413 gene regions, 920 (65.1%) contained CNVs. The number of regions with identified CNVs varied among populations. The total number of CNVs detected in drug-target genes was 2,494. Among these CNVs, 1,180 (47%) were in high LD (Appendix, Box 1) with SNPs. Examples of such strong correlations were a ~ 4kb rare deletion on chromosome 10 (AF = 0.002) correlated to rs572009688 (LD = 1) and a rare duplication of ~36kb on chromosome 2 linked to a SNP with an undefined rsID with a genomic position chr2:98273487 (LD = 1). There were similar patterns within as well as between the populations of the 1000G (Appendix - Figure 3). Even though there seemed to be enough common CNVs present, the majority of the identified CNVs were of low-frequency or rare (Appendix, Box 1). This implied that multiple SNPs were of low frequency or rare as well (Appendix – Figure 4). This property of rare SNPs tagging rare CNVs might contribute to our capacity to identify associations with low frequency and rare SNPs in GWAS, as it was recently highlighted for type 2 diabetes [129].

69

3.1.2. DIAbetes Genetics Replication And Meta-analysis (DIAGRAM)

The DIAGRAM consortium (http://diagram-consortium.org/about.html) comprises a large number of research groups interested in implementing large-scale studies focused on type 2 diabetes and the genetics behind the disease. The main focus of this consortium lies in European populations. DIAGRAM’s initial phase defined as “DIAGRAM v1”, included various T2D GWAS from the UK (WTCCC), DGI and FUSION groups. This was followed by “DIAGRAM v2” after a meta-analysis which added GWAS from five more studies – DGDG, KORA, Rotterdam, DeCODE and EUROSPAN. The latest version of DIAGRAM is “DIAGRAM v3” which is currently the largest GWA dataset for European samples, with 12,171 cases and 56,862 controls. Based on the latter dataset, the group performed analyses to select SNPs for the integration of DIAGRAM v3 with Metabochip data. Stage 1 analysis included 26,676 T2D cases in total and 132,532 controls collected from 18 GWAS. Furthermore, Stage 2 follow-up analysis included 16 studies (D2D2007, DANISH, DIAGEN, DILGOM, DRsEXTRA, EMIL-Ulm, FUSION2, NHR, IMPROVE, InterACT-CMC, Leipzig, METSIM, HUNT/TROMSO, SCARFSHEEP, STR, Warren2/58BC) combined with Metabochip data but there was no overlap between the individuals selected for Stage 2 and Stage 1 analyses [129]. The scope of DIAGRAM also includes the support of projects which aim to discover T2D variants in other populations of different ancestry.

A very recent study performed a European T2D genome-wide association analysis based on the data from “DIAGRAM v3” [129]. Using the credible sets of SNPs associated with T2D from this study, we conducted an analysis similar to the one for the GWAS catalog in Section 3.1.1. A credible set was defined as the set of SNP variants with a high probability of being causal for each disease-associated signal [129, 130]. Specifically, we looked for CNV-tagging SNPs which could be potentially correlated with the T2D SNPs. The analysis revealed four deletions tagged by different SNPs on chromosomes 7, 9, 12 and 14 encompassing JAZF1, GLIS3, PTPRD, TSPAN8 and NRXN3 respectively (Table 3.5).

70

Table 3.6: CNVs associated with Type-2 diabetes

CNV type Chr. Start (bp) Length (bp) rsID LD (r2) Locus (CNV) Deletion 7 28,214,282 378 rs1635852 0.95 JAZF1 Deletion 9 3,934,397 1018 rs10758593 1.00 GLIS3 Deletions 9 8,640,832 844 rs186838848 1.00 PTPRD 9,389,096 1645 1.00 Deletion 12 71,602,363 2117 rs6581998 0.996 TSPAN8/LGR5 Deletion 14 80,070,794 1318 rs10146997 1.00 NRXN3

As regards the AFs of each of the greater populations, we observed that the genes where deletions are very common, are JAZF1 and TSPAN8. These CNVs have a ~50% AF for each European country of the 1000G (Table 3.7). In addition, deletions in TSPAN8 are ~21% frequent in the South Asian populations and 34% in the American populations. In JAZF1, the AFs reach a ~45% in South Asians and ~37% in Americans. The largest amount of CNVs in JAZF1 was found in Africans where the AF reached a ~71% (Table 3.6).

We decided to focus on Europeans populations where deletions for TSPAN8 and JAZF1 consistently reached an AF of 50%. In contrast, deletions were absent in European populations for GLIS3, PTPRD and NRXN3 according to the 1000G. Based on this information, we decided to focus our investigation only on JAZF1 and TSPAN8 deletions. To date, studies have been also mainly focusing on European populations when studying human variation [131-133]. Our AF results though indicate that future research should also focus on the remaining populations since the variants identified can bring brand new insights to population genetics.

71

Table 3.7: Allele frequencies for T2D-associated CNVs in 1000G continental groups

CNV genomic region Locus Allele frequencies (%) EUR EAS SAS AFR AMR 7:28,214,282-28,214,660 JAZF1 51.00 1.50 21.30 7.00 34.00 9:3,934,397-3,935,415 GLIS3 0 0 0 0.08 0 9:8,294,246-10,632,723 PTPRD 0.003 0.02 0 0.29 0.03 0 0 0 0 0.0014 12:71,602,363-71,604,480 TSPAN8/LGR5 49.70 7.94 44.58 70.88 36.89 14:80,070,794-80,072,112 NRXN3 0 0.60 0.20 0 0

Table 3.8: Allele frequencies for T2D-associated CNVs in the European populations

CNV genomic region Locus Allele frequencies for European populations (%) GBR FIN TSI IBS CEU 7:28,214,282-28,214,660 JAZF1 52.00 46.00 53.00 51.00 49.00 9:3,934,397-3,935,415 GLIS3 0 0 0 0 0 9:8,294,246-10,632,723 PTPRD 0 0 0.0047 0 0.010 0 0 0 0 0 12:71,602,363-71,604,480 TSPAN8/LGR5 45.60 51.00 48.13 52.34 51.00 14:80,070,794-80,072,112 NRXN3 0 0 0 0 0

GBR = British in England and Scotland; FIN = Finnish in Finland; TSI = Toscani in Italy; IBS = Iberians in Spain; CEU = Utah residents with Northern and Western ancestry

72

CNV-tagging SNP in TSPAN8

The meta-analysis of GWA data performed in 2008 which led to the publication of “DIAGRAM v2”, identified several susceptibility loci for T2D [134]. The study found that among those loci, there was also the SNP rs7961581 (p = 1.1 x 10-9) located within TSPAN8 gene. In our credible set of SNPs, the strongest statistical signal associated with TSPAN8 in Stage 1 analysis was rs6581998 (p = 1.23 x 10-5). The two SNPs are highly correlated (LD = 0.97). The Wellcome Trust Case Control Consortium, conducted a GWAS of CNV in 2010, studying eight common diseases [135]. Their analysis revealed that a CNV encompassing TSPAN8 either partially or wholly, was associated with T2D (CNVR5583.1, p = 3.9 x 10-5). This CNV is in low LD (r2 = 0.17) with the SNP identified by the DIAGRAM v2 meta-analysis. Therefore, the assumption was that the CNV detected might have been weakly correlated with the SNP which was considered as the true causal variant for T2D. Our LD analysis of CNVs with SNPs showed that rs6581998 was highly correlated with a deletion at chr12:28,214,282-28,216,399 of ~ 2.1kb (LD = 0.996). This observation may underline a population genetic drift, and it is possible that it will lead us to a potential functional culprit of T2D.

CNV-tagging SNP in JAZF1

JAZF1 (juxtaposed with another zinc finger gene 1) is another gene associated with T2D. Other diseases or traits linked to the gene include: prostate cancer, Crohn’s disease, systemic lupus erythematosus, systematic sclerosis, rheumatoid arthritis, diisocyanate-induce asthma and height [125, 136, 137].

The same GWAS meta-analysis published in 2008, reported that the SNP rs864745 located in intron 1 of the JAZF1 gene, showed the strongest statistical evidence for association with T2D [134]. In our analysis, the lead SNP rs1635852 also located in JAZF1 and which is strongly correlated with the aforementioned SNP (LD = 0.95), has also been found as highly associated with T2D (p = 3 x 10-14).

A study published in the International Journal of Oncology performed a whole exome sequencing analysis that showed evidence of copy number in JAZF1 [125]. The analysis revealed that the gene appeared to be highly amplified in a metastatic colon cancer which spreads to the lung. This study discussed about the potential functional effect that JAZF1 may have as an oncogene which plays a crucial role in the metastasis of colon to lung cancer.

73

Even though there are clues of CNV presence in JAZF1, there seem to be no studies with reported associations between JAZF1 CNV and T2D. Our analysis showed evidence of a short deletion of just 378bp which is in high LD with rs1635852 (LD = 0.95). The same deletion was reported in the Section 3.1.1, Table 3.2, with CNV-tagging SNPs identified using GWAS catalog associations. The SNP from GWAS catalog correlated with the deletion is different from the aforementioned CNV-tagging SNP from DIAGRAM data and they are in a moderate LD (LD = 0.74). The length of the deletion potentially explains why the variant has not been identified before. A laboratory analysis to confirm the deletion in European samples would be vital for population genetics.

3.1.3. Northern Finland Birth Cohort (NFBC)

The Northern Finland Birth Cohorts are represented by two independent studies; NFBC1966 and NFBC1986. These studies have collected data at 20-year intervals from Northern Finland and represent two prospective longitudinal birth cohorts of women expecting their offspring in a respective birth year (http://www.oulu.fi/nfbc/). The cohorts were initially designed in the 1960-s. The collection of NFBC1966 data focussed particularly on the provinces of Oulu and Lapland and mothers with an expected date of birth in 1966 (Figure 3.2) [138]. This resulted in a set of 12,055 mothers whose date of delivery was expected to fall in 1966. Even though the total number of children included in the initial set was 12,231, the number was reduced to 12,058 live-born children coming from 12,068 deliveries -13 women had twins- [139]. The cohort covered almost all births in the selected studied area (96%) with birth weight of ≥600 grams. The data were enriched after questionnaires were posted and received back at the ages of 1, 14 and 31 years together with medical records from the hospital as well as national register data. The NFBC1986 data came from an unselected population and comprised of 9,362 mothers and 9,479 children. The expected delivery date was between 1st of July 1985 and 30th of June 1986 but a small proportion of births given at the end of June 1985 or begin of July 1986 were also considered. The total number of births was 9,362, reaching a 99% of all the births that occurred between the studied period of the cohort. Out of the 9,479 children born, 9,432 were live-born. Similar to NFBC1966, postal questionnaires were sent at the ages of 7, 8 and 15-16 years to add on the original data. Hospital records and registered data were also provided to enrich the collection.

74

Figure 3.2: Geographic location of the NFBC1966. Dark grey represents the targeted area of the cohort

Using 8,004 Finnish and 1,942 Dutch individuals from both NFBC1966 and NFBC1986, members of our research team undertook a study focusing on the variability in fatty acids (FAs) derived from the metabolomics platform data. FA concentrations were quantified by a high-throughput serum nuclear magnetic resonance (NMR) metabolomics platform [140]. They particularly investigated omega-3,-6,-7/-9 FAs and other polyunsaturated FAs and discovered both common and rare variants through a multi-phenotype genome-wide meta- analysis. This methodology allowed analysing the genetic architecture of more than one fatty acid simultaneously by exploiting the correlation between the traits to increase power for locus discovery. This study identified eight genetic loci associated with FAs, replicated previously established loci for blood lipids and identified two novel loci at LPXN and RPS6KA4 loci with Finnish-specific effects [manuscript in preparation]. To further characterize the associated loci, they calculated posterior probabilities for each variant within 500kb of the detected association signals to highlight the most likely variants with a causal role for the identified associations [130].

75

FAs are present in human blood, cells and tissues as well as in the normal diet where they serve as nutrients [141]. They act as sources of energy and in complex systems and can modulate gene transcription [142]. We pursued three separate approaches aiming to identify CNV-tagging SNPs in (i) the credible set of ten common variant loci (Appendix-Table 2) (ii) the credible set of rare variant loci located at the top transcript in FADS region (this set was split into two lists of Finnish and Dutch populations) (Appendix-Table 3) (Figure 3.3) (iii) all the SNPs from all the transcripts reaching genome-wide significance (Appendix - Table 4).

In each of the above three cases, we used the positions of the SNPs to find potential overlaps with the SNP positions of our CNV-tagging SNPs list. The analysis was based on the criterion of a high LD (Appendix – Box 1). That is, our analysis looked for NFBC SNPs that did not only overlap positions from our CNV-tagging SNPs, but also demonstrated high LD with their respective CNVs. At this stage, we did not perform LD pruning on the NFBC SNP dataset in order to retain as much information as possible. In downstream analyses, however, when common tagging SNPs are identified, we evaluate pairwise LD between SNPs.

Our results from all the three analyses revealed that considering an LD threshold ≥ 0.8, neither of the credible sets of common and rare variants nor the full list of SNPs of genome-wide significance tagged any CNVs. Furthermore, we attempted to lower the LD to ≥ 0.5 which resulted to no additional information.

All-in-all, after assessing each FAs dataset, we concluded that there are no CNV-tagging SNPs in the identified FAs loci. Examples of such FA SNPs we examined for each dataset, were the SNP at the position 15:58683366 in LIPC gene (common variants list), the SNPs at chr11:6146209 and chr11:61295920 for Dutch and Finnish respectively (rare variants list), and the SNPs at chr11:61356813 and chr11:61356815 (genome-wide significance list) (Appendix Tables 2,3,4).

Our list of CNV-tagging SNPs was generated from the 1000G variants which includes the genomes of 2,504 individuals. If such databases add more human genomes in the future, there is always a possibility of finding CNV-tagging SNPs in the aforementioned loci.

76

Figure 3.3: Visualization of FADS region using UCSC browser, showing the full spectrum of CNVs in this specific region. Red represents the deletions and blue the duplications. Each line shows the CNVs for every of the 26 populations of the 1000G.

77

3.2. Summary of findings

This chapter described all our analyses conducted with the goal to identify common SNPs from either published GWAS or other disease datasets and our CNV-tagging SNPs list. Specifically, we used SNPs from 1) GWAS catalog reported traits, 2) DIAGRAM consortium with t2d data and 3) NFBC data on fatty acids and polyunsaturated fatty acids. The genome- wide CNV-tagging SNPs we identified were based on the sequencing data available from the 1000G, on 2,504 individuals from 5 greater populations.

The first analysis revealed common deletions at 63 loci. We observed that 53 SNPs were tagging 41 independent CNVs meaning that in some cases more than one SNPs tagged either one or more CNVs. We split our analyses in established and novel CNV-tagging SNPs, as some of our identified tagged CNVs were already reported previously. Most of the novel CNVs were < 1kb, potentially revealing the reason they were not found in the past. A remarkable relationship, was the deletion of 378bp which was in high LD with three different SNPs. The SNPs were previously found to be associated with systemic lupus erythematosus and t2d and are located in JAZF1 locus. This was particularly interesting, as the same deletion was in LD with a different SNP from DIAGRAM T2D data, in JAZF1 locus again. The SNPs from GWAS catalog and the SNP from DIAGRAM data are in a moderate LD. T2D data revealed four additional SNPs correlated with five other CNVs located in five different loci. We focused our CNV research on JAZF1 and TSPAN8 genes as the CNV AFs reached a 50% on all European populations. The last analysis on CNV-tagging SNPs using NFBC data on fatty acids was split in three parts to find common SNPs between our list and NFBC’s common variants, rare variants and variants reaching genome-wide significance from all the transcripts of FADS region. The analysis resulted in no variants with high LD from the two lists. One possible reason for this, is the number of genomes we analysed (2,504) for CNV variability, and specifically, the number of genomes from Finnish population (n=8,004). Additional genomes could lead to a positive result using the same data in the future.

78

Chapter 4

Description of the statistical framework for CNV detection

This chapter will describe the way in which we extracted information from alignment features and how we turned these features into a numerical format to prepare files for CNV calling. We will first explain the pre-processing steps we undertook to prepare NGS data for CNV detection. We will conclude with a detailed description of the statistical framework we developed which formed our CNV pipeline (popCNV) as a set of algorithms. Specifically, the framework is based on a HMM framework originally developed for cnvHap which aimed to model CNV at a haplotypic level, to improve CNV detection and genotyping accuracy [51].

The development and implementation of popCNV was done using the latest phase 1000G Project data as described in Chapters 1 and 2. Sequence alignment files in the BAM format [67] constituted the input of popCNV.

4.1. Read depth

The read depth (RD)-based approach is suitable for identifying CNV by counting the number of reads which are mapped to each region of the human genome [15]. RD-based methods serve better in estimating the absolute copy numbers since the RD is directly proportional to the latter. For the same reason, RD is always compared to the depth of coverage (DOC) to obtain an estimate of the underlying copy number as follows:

푅퐷 푝푙표푖푑푦 ∗ = 푎푏푠표푙푢푡푒 푐표푝푦 푛푢푚푏푒푟 (4.1) 퐷푂퐶

79

[143] where

푟푒푎푑 푙푒푛푔푡ℎ ∗ 푛푢푚푏푒푟 표푓 푟푒푎푑푠 퐷푂퐶 = (4.2) ℎ푎푝푙표푖푑 푔푒푛표푚푒 푙푒푛푔푡ℎ

[144]

Changes in the overall RD in a genomic region indicates alterations in genomic copy numbers. Therefore, a duplicated or deleted region is expected to have a higher or lower RD respectively [145-148].

The tools that are available for RD calculation are split into three categories:

1) Single samples 2) Case/control samples 3) Population of samples In case there is a single sample, the possibility of “pooling” information across individuals is eliminated. Thus, based on the information available, the theoretical distribution of RD observations can be used to detect CNV. To that end, the most appropriate mathematical model must be chosen for RD. This method looks for regions with abnormal depth deviating from the distribution. The resulting copy numbers estimated in this case are absolute copy numbers [15].

4.1.1. Pre-processing

The pre-processing and calculation of the RD was one of the initial steps of our pipeline. More precisely, the procedure which RD methods usually follow for CNV identification comprises four steps: • mapping • normalisation

80

• copy number calculation • segmentation The pre-processing starts by filtering out unmapped reads, alignments achieving a low quality score and duplicate reads contained in BAM alignment files, as these would negatively affect the results. The following step includes the sorting of the filtered BAM files by coordinate and then their conversion to the pileup format. This is achieved using the SAMtools software package (http://samtools.sourceforge.net/) which offers various utilities for alignment manipulation. The pileup format contains summary information of the alignment for each genomic position and shows how many reads are mapped to the specific locations. Based on these counts, the raw RD measurement is ready and can then be pulled out from the pileup file. The second step which is the normalisation of the RD, aims to correct possible biases produced primarily by variability in the guanine-cytosine (GC) content by genomic regions that are of low complexity. Following the normalisation of RD comes the estimation of the copy number which determines whether there is a deletion or a duplication in the human genome (Figure 4.1).

Figure 4.1: Illustration of read depth created by CNV

[13]

The last step is the merging of the regions whose copy number is similar, to identify copy number regions that are discordant [149]. Under a mathematical perspective, the way the RD observations are formed after the first two steps (mapping and normalisation) from NGS analyses is comparable to the log ratios from arrayCGH data. Thus, algorithms used for CNV identification from such kind of data can also go through modifications to be used on NGS RD

81 data. The approach presented here is an extension of the cnvHap model, which was originally developed for array-based CNV calling.

4.1.2. Statistical modelling

To date, the majority of NGS-based CNV detection methods were developed on the assumption that the data were normally distributed [150]. This is because RD can be compared to microarray fluorescent intensity due to their shared similarities. As with microarray intensity ratios, the variance of the underlying RD data is assumed to increase with increasing copy number.

Based on the above hypotheses, continuous data are expressed and modelled using a Normal distribution, assuming a symmetric variance:

1 푦−휇 1 − ( )2 푒푥푝 2 휎 (4.3) 휎 √2휋

where 푦 ∈ (−∞, +∞) 휇 ∈ (−∞, +∞) 휎 ∈ (0, +∞)

The mean and variance are thus represented by: Mean E(Y) = 휇 Variance Var(Y) = 휎2

The exponential representation of the above equation is:

휇2 (푦휇 − 2 ) 푒푥푝 (4.4) 2 2 1 푦 2 휎 − ( 2 + 푙푛(2휋휎 )) { 2 휎 }

82

4.1.3. RD-based methods

RD is commonly used by researchers aiming to analyse NGS data. This is potentially due to the fact that the interpretation of the outputs is intuitive and therefore a lot of approaches for CNV detection are dedicated to this specific method. Below, we are describing some of the most notable RD-based methods which are used to date.

A method called CNVnator [147] uses the mean-shift technique initially constructed for image processing, to partition the whole genome into segments to define copy number. This technique partitions the genomic sequence into bins which do not overlap and are of equal size. It then directs each bin to move towards the bins with the most identical RD signal. This way, the RD signal splits into neighbourhoods of local modes. Furthermore, the breakpoint segmentation is defined by neighbouring data points (bins) that are in opposite directions. A notable fact about CNVnator, is that by using the mean-shift technique, it reaches high breakpoint accuracy but this does not occur when it comes to whole genome scaling. That is, the technique does the segmentation locally and therefore the high accuracy for breakpoints exists for neighbouring bins and is not dependent of the read length or the DOC. As a result, for whole genome analysis, a different CNV identification should be performed.

Event-wise testing (EWT) [151] is an older RD method but one of the first to perform segmentation of CNV with a high resolution. Based on significance testing, it looks for certain classes of small events genome-widely and with a high speed. From the clusters created out of small events, it constructs groups of larger events. The criterion for grouping small events to “enter” the larger groups is their statistical significance. Both deletions and duplications identified in an individual by the method, are later tested over multiple genomes to detect polymorphism between individuals. The main idea of EWT is the calculation of RD counts with a specific defined window. To observe an increase or decrease in the RD estimations, the method converts the read count into a z-score by subtracting the mean of all windows and dividing by the standard deviation. It then calculates the upper-tail and lower-tail probability using the obtained z-scores. To check the assumptions of normality and correct for potential biases when data deviate from the Normal distribution, EWT chose to increase the size of the RD windows.

83

Both CNVnator and EWT were tested and proved to be more computationally efficient for CNV detection than other methods [152]. Large CNVs were readily discovered by the algorithms with high breakpoint accuracy. However, they were both unable to identify CNVs smaller than 1kb. The inability to identify small events poses a risk of losing valuable information about causative variants.

4.2. Hidden Markov Model

Using the HMM of cnvHap as a foundation (described in Chapter 1), we extended the framework of the model to incorporate NGS techniques with the aim of creating an all-in-one probabilistic model. That is, we considered the population-haplotype model of the algorithm as an inspiration to estimate the RD and then to define copy number. Taking the same path which cnvHap paved, our scope was also to perform a population model which would use large sample sizes to identify variation in each sample independently.

By definition, the underlying state of the HMM is not observed. What is observed, is the output generated by these “hidden” states. Our motivation to use a HMM for our CNV detection was the fact that the observed sequence RD data are produced by separate hidden states which correspond to the copy numbers. The HMM comprises emission and transition probabilities. The statistical framework under which we built our analysis, first modelled each data point based on the hidden states (copy numbers) through the emission distribution of the RD. For the copy number change between two adjacent copy number states, the HMM estimates the transition probabilities.

4.2.1. Description

The HMM in popCNV is made of one hidden state per haploid copy number at each probe 푚.

The hidden state of the HMM is denoted as 푠푚 whereas the haploid copy number as 푙, such that 푙푚 ∈ {0,1,2, … , 퐶푁푚푎푥} [51]. At each measured position, every copy number state has an emission distribution corresponding to the measured RD. The transition probabilities comprise a global transition matrix and a local transition rate which we will describe later.

84

4.2.1.1. Emission probabilities

Assume that 퐼푚 is a vector of the observations of our algorithm’s HMM and this vector captures the RD data at each probe 푚 as mentioned above. Based on this information we can

푗 푗 symbolize the emission data for a sample 푗 as 퐼푚 = {푅퐷푚}. The emission probability which is derived from an unordered list of states 푠푚 = {푙1,푙2} entirely depends on the total copy number:

푗 푗 푗 푃(퐼푚|푐(푠푚 = {푙1,푙2}, 휃푚) = {훮(푅퐷푚|휅푚푐, 휂푚푐)} (4.5)

[51] where 훮 = Normal distribution

푐(푠푚) = total copy number of the haploid copy number state such that 푐(푠푚) = 푙1 + 푙2

Finally, 휃푚 denotes the parameters of the emission distributions at each measured position 푚.

4.2.1.2. Transition probabilities

The between copy number states transition probabilities have been defined by a continuous time Markov chain theory. The transition between unordered sequence of states can then be represented by:

푝(푠푚 = {푙1 , … , 푙N} |푠(푚−1) = {푘1 , … , 푘N}) (4.6) = ∑ ( ∏ 푝(푠푚푛 = 푙z(n)|푠(푚−1)푛 = 푘n ))

푧∈푍(푠푚) 푛=1,…푁 [51]

The calculation for the transition probability of the haploid HMM from hidden state 푘 to 푙 can be performed in the following way:

푝(푠 = 푙|푠 = 푘) = {푒푄푟푚(푑(푚)−푑(푚−1))} 푚 (푚−1)푛 푘,푙 (4.7) [51]

85

푄 defines the global transition rate matrix among different copy number states and 푟푚 represents the local rate of copy number transitions. 푑(푚) is used to express the coordinate in base pairs of the position 푚. The relationship above is easier to comprehend if we interpret

푄푘,푙 as the transition rate from state 푘 to state 푙. Then the exponential of this matrix 푄 is the transition probability between two states.

The requirement behind the above result is that the row sums of the transition rate matrix equal to zero such that 푄푘,푘 = − ∑푙≠푘 푄푘,푙. Furthermore, we consider 휋 as a stationary probability distribution of the states, so that 푄휋 = 0. This hypothesis is valid if and only if the probability of the transition between two states in an absolute distance 푑 is not equal to 0 and that the probability of passing by each state a finite number of times across an infinite distance is 1. We also use the notation of 푃푘(푑푚) for the probability of a sample being in state

푘 at a genomic distance 푑푚 and 푃(푑푚) for the vector of this probability. The differential equation called Kolmogorov which is expressed by

푃′(푑) = 푄푃푟 (4.8)

and where 푟 denotes an arbitrary scaling constant, captures the evolution of 푃 over the genomic distance 푑. The outcome of the above differential equation is the exponential matrix of 푄 multiplied by the initial probability distribution and is shown below:

푃(푑) = 푃(0)푒푄푟푑 (4.9)

Based on the result of this differential equation, we get the transition probabilities of the equation (4.9).

The prior information of the transition model was initially based on a user-defined equilibrium probability distribution 휋. Using this information, we hypothesize that the rate matrix 푄 can be reversed. In other words, the transition between the copy number states is identical in both directions. In the case of a transition from deletions and duplications to copy neutral states, the rate of the transition will be greater compared to the reverse case, independently of the size of the deletion or the duplication. This happens because the equilibrium frequency of the transition rate for the first two copy numbers is smaller whereas for the copy neutral

86 is larger. Using the equilibrium probability distribution 휋 mentioned above and taking the example of [153], we created a symmetric 푄푖푗 matrix as follows:

휋푗 −1 ∗ √ ⁄휋 푖푓 푖 = 푗 푖 푄푖푗 = 휋 (4.10) √ 푗⁄ 휋푖 표푡ℎ푒푟푤푖푠푒 { max 푐표푝푦 푛푢푚푏푒푟

The default equilibrium probability distribution was defined as (0.01, 0.98, 0.01) for the corresponding copy numbers (0,1,2) respectively. The local scalar rate which we previously denoted as 푟푚 was set to 1 by default. The schematic diagram shown below illustrates the haploid and diploid models generated by our algorithm [154].

Figure 4.2: Visualization of our statistical model

(a) Copy number states of the haploid HMM; each vertex represents a single copy number state (0 = red, 1 = grey, 2 = green) and each of the edges shows the transition between the states. (b) Pairs of haploid copy number states leading to the diploid model (0 = dark red, 1 = red, 2 = grey, 3 = green, 4 = dark green)

87

4.2.2. popCNV under a population framework

The main advantage of popCNV is that it is capable of analyzing CNV data at a population level through the processing of several samples at the same time. Our pipeline is taking advantage of the large sample sizes and accounts for sample-independent variation by using the population distribution of each data point at each probe. The initial scope of this procedure is the update of the emission parameters during the HMM training. More precisely, based on the emission distributions created originally, observed population data are organized into clusters at each genomic position. These are used to describe the copy number assigned to each sample. Unlike to the traditional HMM, the cluster positions constructed are then re- constructed and thus the parameters of the emission distributions are updated [154].

88

Chapter 5

Construction of CNV map in drug-target genes using popCNV

In the past few years, WGS has been the leading technology for the complete characterisation of genetic variation. The main aim of analysing data genome-wide is the identification of the genetic basis of various diseases and continuous phenotypes as well as the differences that can be captured in diverse populations [155]. To date, many risk variants have been reported for hundreds of phenotypes and dozens of diseases, but they explain the phenotypic variability only partially because common variants usually exert only modest effects on a phenotype. WGS is now the proposed solution for generating large collections of rare SNPs and SVs contributing to disease susceptibility [156]. WGS is the technology “hidden” behind re-sequencing of both healthy controls and individuals presenting with abnormal phenotypes. Despite the advantages of high-coverage WGS data, the high cost of sequencing remains a challenge. As a result, various algorithms have been developed to consider low-coverage data for CNV identification, such as those used for our analyses.

In Chapter 4 we described the statistical framework we used to develop the popCNV to identify CNV in NGS data. We provided a detailed description of the RD method and of the population model we built using a HMM. To examine the variation present in the human genome we performed a WGS analysis using sequencing data from the 1000G project. While the previous chapter focused on the construction of the pipeline and the model incorporated, this chapter will describe the implementation and the benchmarking of the popCNV pipeline.

89

5.1. Samples and datasets

Publicly available data from phase 3 of the 1000G project were used for developing and benchmarking popCNV. We specifically analysed 2,504 samples with ancestry from 26 different populations as indicated in Chapter 2. In addition, we used 202 drug-target genes defined in collaboration with GlaxoSmithKline (GSK) (Appendix – Table 5) to identify potential CNVs affecting drug-response variability.

5.1.1. 202 GSK defined genes

A team of GSK researchers conducted a study to identify rare variants in 202 drug-target genes [39]. At the the time the study was conducted, the genes constituted ~1% of the coding genome and ~7% of the known or likely-to-become drug-targets. They sequenced the genes in a sample population of 14,002 subjects. 864kb were targeted, 674kb of which corresponded to coding and UTR exon regions. The populations comprising the sequenced sample were of European, African American and South Asian ancestry. The study found that most of the variants were rare (over 95%) with an MAF ≤ 0.5% and the majority was observed in just one or two individuals. A notable part of the team’s research was the geographical clustering of variants. Common variants were mostly shared between European populations whereas rare variants showed less sharing within these populations. Sharing generally decreased with lower AFs. The only population whose level of variant sharing was lower with the rest of Europeans was Finnish. Therefore, the study pointed out the urge of creation of rare variant catalogs, which will be population-specific. Furthermore, it highlighted the importance of future research for the identification of common variants in the 202 drug- target genes so potential associations between risk variants and disease can be found. Rare variants in coding regions also seemed to have functional effects, thus appearing to play a critical role for genetic diversity.

The scope of our analysis using the aforementioned drug-target genes was mainly to confirm previous results, identify common variants to contribute to novel genetic insights and to optimise the popCNV pipeline.

90

5.1.2. Definition of gene region

Using the 202 drug-target gene list of GSK, we defined a “gene region” as the expansion of the original gene coordinates by 20kb on either side of the gene (transcriptional start and stop sites). This expansion incorporated all regulatory regions close to the genes (including UTRs and promoters). We split the gene region in 500bp non-overlapping windows. More precisely, the pipeline observed a change in copy numbers every 500bp.

Figure 5.1: Gene region definition and visualization

- -20k20kb +20kb

b Gene

+20kb - Gene *(The online source of the picture in Figure 5.1 is +20kb http://exchange.smarttech.com/search.html?q=%22DNA%20structure%2220k )

b Gene

+20kb

- Gene

20k +20kb Gene b

+20kb Gene - 20k +20kb Gene b

+20kb Gene - 20k 91

b

5.2. popCNV implementation popCNV was implemented in a three-step process followed by a visualization and benchmarking of the results (Figure 5.2). The procedure began with the use of RD as the alignment feature of the pipeline. Calculation of RD shows how many reads can be aligned to a genomic position and is directly proportional to the copy number. In popCNV, the estimation of the RD corresponded to the average normalised read count within a 500bp window. An example of RD count on chromosome 1 within a 1000G sample is shown below in Figure 5.3. The detailed description of the RD pre-processing was presented in Chapter 4.

Figure 5.2: popCNV implementation and processing

202 drug-target and ADME gene regions

Tool/Model Process

Alignment feature (read cnvHap extended - large scale sequencing data depth) incorporated

Normalisation of read depth Definition of alleles

Hidden Markov Model (HMM) CNV segmentation & detection

UCSC browser Visualisation

1000G sequence data Benchmarking

Summary statistics

92

The HMM built for the pipeline was used to assign segments of DNA to copy numbers within the RD framework. The resulting CNV segmentation was used to create a detailed map of deletions and duplications which were visualized with the help of UCSC browser. Finally, the benchmarking of popCNV was conducted based on the 1000G data we initially used for our CNV calling analysis. The procedures for the identification of CNVs as well as the visualization and benchmarking of CNV calls are described below.

Figure 5.3: Read depth signature on chromosome 1. Representation of our segmentation algorithm where each point is an RD measurement. Colouring represents our detected CNVs; grey is the reference copy number, dark and light blue represent duplications and red is for deletions.

5.2.1. CNV detection

The initial step of popCNV was the calculation of normalised summary statistics for the RD. This calculation constituted the input of the HMM. The RD was sampled over each chromosome at a default frequency of 500bp (which the user is allowed to define accordingly). popCNV can be run using a combination of R and UNIX (shell) scripts including some Java command-line tools. Once the RD pre-processing was performed for each sequenced sample, popCNV applied the HMM for CNV segmentation and copy number assignment. Under the framework in which the HMM was constructed, a posterior probability distribution was used to define the accuracy of the model for each identified CNV. In addition,

93 the pipeline also produced segmentation plots as output (presented on Figure 5.3). These plots were obtained for each individual sample in every population of the 1000G.

Initially, our analysis focused on the identification of common and rare CNVs in the 202 drug- target gene regions globally, as well as in each of the 26 populations of the 1000G project. All the gene regions we analysed throughout the project were aligned to the GRCh37-hg19 assembly. In the course of our analysis, an international 1000G-funded collaborative study was published, which similarly to our effort had analysed the 1000G WGS sequencing data from the phase 3 at genome-wide scale [20]. The fact that the 1000G itself published an SV analysis using a collection of methods, provides the perfect validation set for popCNV. Apart from the phase 3 data, they also used data from orthogonal techniques such as long-read single-molecule sequencing. The study constructed a map of 68,818 SVs split in eight different classes in the 2,504 individual genomes of the 1000G. For the SV mapping on WGS, they used the mapping algorithms BWA (Burrows-Wheeler Aligner) and mrsFAST (micro-read fast alignment and search tool) whereas for identification and genotyping, they used a combination of nine algorithms. Their aim was to resolve SV classes which had not been previously resolved on a haplotype level. In addition, using the short-read DNA, they identified various gene-SV intersections. The study highlighted how populations differ regarding SVs and investigated their functional effect. 60% and 71% of their identified SVs were novel compared to DGV and 1000G Project’s SVs respectively [21, 157]. As expected based on previous findings, this paper has also observed that African individuals showed an excess in heterozygous deletions compared to other populations (Appendix – Figure 5). The same observation existed for SNPs as well. As opposed to the African population, the results for East Asian population revealed that heterozygous deletions are a minority whereas homozygous deletions are more common.

5.2.1.1. 202 drug-target gene results

We carried out the analysis for CNV detection and benchmarking simultaneously for 202 drug- target genes. This was achieved by extracting the 1000G CNV genomic regions which overlapped the regions of our drug-target genes and comparing our findings to the published CNVs. Using the outcomes of this analysis, we ended up with 405 genomic regions. This was due to the fact that the drug-target genomic positions overlapped more than one CNV positions of the 1000G. Chromosome X regions were omitted from our analysis. The initial set

94 did not include any chromosome Y regions. Thus, we ended up with 394 genomic regions in the final set which was almost double the number of the drug-target genes. popCNV original output popCNV identified both heterozygous and/or homozygous deletions and duplications in all the 26 studied populations. The output generated by the pipeline is a text file which includes the samples from all the populations, the chromosome, along with the start and end position of each CNV identified, the number of SNPs encompassed in the specific region, the length and type (deletion/duplication) of the CNVs and the confidence score for each call (Table 5.1). Based on the CNV segmentation from the HMM model, the aforementioned CNV “type”represents the diploid HMM states. There are four different copy numbers; 0 for a homozygous deletion, 1 for a heterozygous deletion, 2 for a copy neutral, 3 for a heterozygous duplication and 4 for a homozygous duplication.

Table 5.1: popCNV sample results from CNV analysis of chromosome 7 encompassing IL6 gene (chr7:22771285-22813003).

“Type” represents the diploid HMM state, including 0 - homozygous deletion, 1 - heterozygous deletion, 2 - copy neutral, 3 - heterozygous duplication and 4 - homozygous duplication.

Sample Chr Start pos Last pos No of Length Type Average SNPs (bp) certainty HG00096 7 22789637 22790637 3 1,000 3 1 HG00114 7 22778137 22781137 7 3,000 3 1 HG00122 7 22776637 22779137 6 2,500 3 1

Summary statistics

The total number of non-overlapping CNVs we detected across the evaluated regions in the 1000G reference set was 87,529. Of those, 43,284 were heterozygous deletions, 43,076 were heterozygous duplications and 1,169 were homozygous duplications.

95

The length of the identified CNVs ranged between 500bp-69,500bp for heterozygous deletions, 500bp-362,000bp for heterozygous duplications and 500bp-13,500bp for homozygous duplications. We detected 21,994 CNVs with a length < 1kb within the analysed regions. Specifically, these were divided into 10,989 deletions and 11,005 duplications (10,336 heterozygous duplications and 669 homozygous duplications). ~ 99% of CNVs identified were heterozygous, and only 1.34% of them were homozygous. We constructed detailed maps of multiple classes of CNVs including estimates of their AF across all 1000G populations as indicated below.

CNV clustering

After the CNV analysis was completed, we calculated summary statistics at the individual and population level. Initially, in order to avoid overlapping CNV calls, we merged CNVs within a sample if the distance between the calls was less than 2kb. Thus, we created CNV clusters based on consecutive CNV positions. Furthermore, according to the number of calls each sample had, we performed an additional within- and between-sample merging (distance threshold < 2kb) to reduce redundancy of numerous overlapping calls for each sample or between different samples. popCNV final output

Based on these results, we created an all-in-one dataset consisting the samples, the populations and the CNV type. This dataset was used as the main data source to extract the relevant summary information for our calculations. These calculations included the number of people in each population, the type of CNVs found in the studied region -either heterozygous or homozygous- and the number of people with detected CNVs, both in each population as well as in the region of interest. Finally, for the purpose of obtaining original results without depending on calculations from existing data sources, we calculated the AF in each population by using the above information. The calculation of AF was based on the equation (2.2). Even though we could extract the AFs from the VCF files of the 1000G, we preferred to re-estimate the frequencies and thus validate our calculations. A notable difference between our calculation and 1000G’s AFs, is that using popCNV, we are able to look at AFs for each population of the 5 greater populations (Americans, Africans, European, South Asians, East Asians).

96

Indicative example of CNV detection for a selected gene Below, we present summary statistics results for the IL6 (interleukin 6) gene (Table 5.2) obtained from popCNV, including the AFs per population (Table 5.3). IL6 is a protein-coding drug-target gene. It has a functional effect on diseases associated with inflammation of the human organism, such as diabetes mellitus type 2, rheumatoid arthritis and inflammatory bowel disease 1. We chose to focus on IL6 due to the fact that GSK reported an association between this gene and multiple sclerosis (OR =12, p = 0.007) [39]. We also checked for potential common drug-targets between the 202 genes and the drug-target genes found in CNV-tagging SNPs list from Chapter 3. There were no overlaps between the two lists.

Table 5.2: CNVs identified within the IL6 gene region (chr7:22771285-22813003) CNV TYPE CNV COUNT SUBJECT COUNT Homozygous Heterozygous Deletions 80 0 80 Duplications 75 0 75

The above results indicate that in the region of the IL6 gene, both deletions and duplications were present. Our findings show that 80 heterozygous deletions were present in 80 individuals and 75 heterozygous duplications in 75 individuals. This implies that each individual had a single CNV.

Figure 5.4: Read depth signature for HG00351 Finnish sample in IL6 gene region. Dark blue represents heterozygous duplications and red represents the heterozygous deletions.

97

Table 5.3: Average CNV allele frequencies per population in IL6 gene region

Origin Population No. Individuals Allele Frequency Deletion Duplication Britain (GBR) 107 0.009 0.028 Finland (FIN) 105 0.005 0.024 European Iberia (IBS) 130 0.025 0.009 Utah residents (CEU) 183 0.003 0.014 Toscani (TSI) 112 0.009 0.0045 Ad Puerto Rico (PUR) 150 0.013 0.020 Colombia (CLM) 148 0.007 0.010 Mixed Peru (PEL) 124 0.019 0.008 American Mexico (MXL) 112 0.005 0.014 India (ITU) 118 0.017 0.00 Punjabi (PJL) 162 0.012 0.006 South Asian Gujarati India (GIH) 113 0.035 0.044 Bengali (BEB) 128 0.007 0.003 Sri Lanka (STU) 128 0.012 0.008 China Beijing (CHB) 108 0.014 0.014 Japan (JPT) 105 0.019 0.029 East Asian South China (CHS) 171 0.003 0.0029 Xishuangbanna (China) 109 0.009 0.0046 (CDX)Vietnam (KHV) 123 0.004 0.012 Africa Caribbean (ACB) 180 0.02 0.008 NigerIa-Yoruba (YRI) 186 0.005 0.005 Kenya (LWK) 116 0.013 0.013 African Western Gambia (GWD) 158 0.014 0.008 Sierra Leone (MSL) 173 0.020 0.016 Nigera-Esan (ESN) 144 0.003 0.014 Africa/America (ASW) 107 0.004 0.009

98

The results from the AF calculation confirmed previous findings about the rare variants present in IL6. The presence of rare IL6 variants in the studied populations, is another proof that a local catalog of rare variants should be created. Interestingly, there is a scarcity of literature on CNVs affecting IL6 so further investigation for variation within the gene would be essential.

CNV identification in JAZF1 The analysis we conducted for T2D DIAGRAM data (Chapter 3) revealed a deletion of 378bp being tagged by the SNP rs1635852 (LD = 0.95). Further to this analysis, we wished to investigate potential CNV presence in the gene and examine the proportion of common and rare variation. The results from the CNV detection we performed for JAZF1 showed an abundance of both deletions and duplications in the gene. Specifically, we identified 2,418 heterozygous duplications, 22 homozygous duplications and 2,708 heterozygous deletions. Similar to IL6, no homozygous deletions were found. The heterozygous deletions were spread in 533 individuals from all the 26 studied populations. Both the lowest and the highest levels of heterozygosity in deletions were exhibited in the East Asian populations and more precisely from Chinese Dai in Xishuangbanna in China (CDX) and Japanese in Tokyo (JPT) respectively. The heterozygous duplications were present in 702 individuals whereas the homozygous duplications in 19 individuals. The homozygous duplications were allocated in one American population, three European, one East Asian and two African. The population which had the highest number of duplication homozygosity was Toscani in Italy (TSI). The average CNV length in JAZF1 was ~1.6kb and it ranged between 500bp and 82kb.

Table 5.4: CNVs identified within the JAZF1 gene region

CNV TYPE CNV COUNT SUBJECT COUNT Homozygous Heterozygous Deletions 2,708 0 533 Duplications 2,440 19 702

The AFs calculated for JAZF1 (Table 5.5), indicate that the CNVs present in the gene are mostly common. The AFs confirm that the highest number of duplications were found in Toscani population (TSI) (39%) whereas the highest number of deletions were identified in African Carribeans (ACB) and Luhya population (LWK) (15% for both). These results are in agreement with the 1000G’s CNV AFs which showed that the majority of CNVs in JAZF1 for each of the

99 greater populations were common. Specifically, the AF for each of the European populations was ~50%, meaning that CNVs were present in half the individuals of each population.

Table 5.5: Average CNV allele frequencies per population in JAZF1 gene region

Origin Population No. Individuals Allele Frequency Deletion Duplication Britain (GBR) 107 0.084 0.14 Finland (FIN) 105 0.11 0.10 European Iberia (IBS) 130 0.11 0.11 Utah residents (CEU) 183 0.057 0.13 Toscani (TSI) 112 0.049 0.39 Ad Puerto Rico (PUR) 150 0.037 0.080 Colombia (CLM) 148 0.051 0.061 Mixed Peru (PEL) 124 0.12 0.14 American Mexico (MXL) 112 0.080 0.10 India (ITU) 118 0.038 0.021 Punjabi (PJL) 162 0.025 0.032 South Asian Gujarati India (GIH) 113 0.11 0.15 Bengali (BEB) 128 0.034 0.031 Sri Lanka (STU) 128 0.039 0.043 China Beijing (CHB) 108 0.11 0.11 Japan (JPT) 105 0.21 0.26 East Asian South China (CHS) 171 0.044 0.035 Xishuangbanna (China) 109 0.028 0.078 (CDX)Vietnam (KHV) 123 0.13 0.15 Africa Caribbean (ACB) 180 0.15 0.16 NigerIa-Yoruba (YRI) 186 0.065 0.12 Kenya (LWK) 116 0.15 0.19 African Western Gambia (GWD) 158 0.040 0.078 Sierra Leone (MSL) 173 0.055 0.066 Nigera-Esan (ESN) 144 0.046 0.035 Africa/America (ASW) 107 0.13 0.07

100

CNV identification on chromosome 16

In Chapter 1, Section 1.5, we referred to a ~ 21kb deletion at 16p12.3 which was found to be linked to obesity in European and Chinese populations. Specifically, 46 of the European samples were found with a homozygous deletion (AF = 2.02%) whereas 577 of them were identified with a heterozygous deletion of one copy number (AF = 25.24%) and 1,663 with two copy numbers (AF = 72.74%). The specific location of the deletion was chr16:19,853,151- 19,874,863. The genes affected by the CNV are GPRC5b, DNAH3, GPR139 and SYT17. We sought to investigate the CNV presence in the region where the deletion was previously identified. We therefore analysed the genomic region chr16:19,833,151-19,894,863 which includes the genomic position of the CNV expanded by 20kb on either side.

The heterozygous CNV that our CNV analysis detected and is closer to the position of the CNV reported previously, was a ~24.5kb heterozygous duplication at the position chr16:19,863,031-19,887,531. This duplication belonged to the sample HG01510 from the Iberian population in Spain (IBS). The AF of duplications in this specific population is 3.09% whereas the total AF of heterozygous duplications in European populations is ~20%. Figure 5.5 illustrates the CNV spectrum of the analysed region and focuses on IBS population which includes the duplication detected.

Our results confirmed that there is a CNV presence at 16p12.3 locus, thus, further investigation about the above CNVs and their association with obesity would be essential, especially if we consider their size.

5.2.2. CNV visualization

A crucial step of our analysis was the examination of the overall spectrum of CNVs followed by an investigation of CNV patterns in each of the 26 populations. We first converted the output from the CNV analysis which was in a text file format into a bed file format. The bed file is a tab-delimited text file which is used to process genome annotation track data in the UCSC (University of California Santa Cruz) genome browser. UCSC was originally launched on the 22nd of June, 2000 as a draft version and was completed on the 7th of July, 2000. It now consists of a set of tools for the visualization, analysis and distribution of data.

101

The overall results of JAZF1 are illustrated below (Figure 5.6). The UCSC plot confirms the CNV identification results as described above, and at the same time, it shows the CNV spectrum across all populations. The plot provides useful information regarding each population’s samples. We focused on European populations which showed an abundance of common CNVs mainly (Figure 5.7). popCNV has the ability to detect and visualize CNVs as well as providing summary statistics information in a gene region rapidly and efficiently. In addition, popCNV has been optimised to take full advantage of high-performance computer clusters to achieve further efficiency gains.

102

Figure 5.5: UCSC genome browser plot illustrating the CNV presence at 16p12.3 locus

103

Figure 5.6: UCSC genome browser plot optimising CNV calls in the JAZF1 gene for 26 populations. Red colour represents the deletions and blue colour the duplications. Each line shows CNVs for the population specified above the line.

104

Figure 5.7: UCSC genome browser plots showing the CNV spectrum across each European population in JAZF1 gene.

The extended population line reveals the samples with a CNV presence. Each sample provides information such as the chromosome position, the genomic size and the DNA sequence. (a) British population (GBR) (b) Finnish population (FIN) (c)(i & ii) Toscani in Italy (d) Iberian population in Spain (e) Utah residents (CEPH) with Northern and Western European ancestry (CEU).

(a)

105

(b)

106

(c) (i)

107

(c)(ii)

108

(d)

109

(e)

110

5.3. Benchmarking of popCNV

As mentioned in Section 5.2.1.1, our pipeline performed CNV detection and benchmarking of the CNV calls at the same time. This was achieved by using the 202 drug-target gene regions to download CNV positions of the 1000G that overlap the regions of the drug-targets. The set of detected SVs in Sudmant et.al [20] is considered as gold standard since the study used a combination of nine different algorithms to achieve a SV identification. Therefore, we decided to make a comparison between our results and the gold standard set of deletions and duplications in the aforementioned paper.

The total number of our detected CNVs was 87,529 which reaches a 75% agreement with the gold standard set. Specifically, the latter included ~50,000 deletions and duplications (42,279 and 6,025 respectively). The difference in the proportion of CNV calls between our project and the gold standard set is likely due to the different thresholds used for the reduction of multiple overlapping calls per sample.

5.4. Summary

In this chapter we described the statistical procedures that we adopted to identify CNV in low-depth whole-genome sequencing data. Using popCNV, a pipeline analysing RD and incorporating this alignment feature into a HMM, we can detect and visualize CNVs efficiently under a population framework. In addition, popCNV offers methods which obtain summary statistics globally as well as by population such as the AF. To our knowledge, none of the existing catalogs provide AFs for specific populations. The AFs that are available for extraction to date, are stored in the VCF files of the 1000G and they are presented for each of the greater populations as an overall estimation. Thus, we propose this new pipeline, popCNV, which examines the CNV spectrum across populations, but also calculates summary statistics which add information on our existing knowledge of CNV.

111

Chapter 6

Conclusion

The past decade saw great methodological and technological advances for the discovery, characterisation and functional significance of CNVs. This has led to a new era of both CNV detection and GWAS analyses. Despite these advances, identification of CNVs remains challenging. This is the reason why databases like the 1000G project are continuously updated to allow for genome-wide investigation of human variation.

According to dbSNP, the proportion of the human genome that CNVs cover is greater than the one covered by the most common SNPs [158]. Moreover, CNV plays a crucial role in the genetics of both common and complex disease such as cancer, type 2 diabetes, neurodevelopmental disorders and many more. Studies analysing CNVs, SNPs or even a combination of the two in such diseases, have been able to detect de novo mutations which offer new insights behind disease aetiology [151]. Even though there has been a growing interest in the relationship between CNVs and SNPs and their contribution to novel disease associations, there are only a few databases providing this information. Research studies that focused on CNVs tagged by SNPs and identified major associations with various diseases, brought new ideas to the scientific community as well as to pharmaceutical companies for the development of new drugs. Thus, our project seized the opportunity to enhance existing knowledge by investigating CNV-tagging SNPs which contribute to novel disease associations.

For a proper examination of CNV impact on individuals’ genomes, NGS platforms have been applied to several studies with the purpose of a better and more accurate CNV identification. High-throughput sequencing is commonly used and is now considered as the most effective method for the elucidation of CNV. We have developed a statistical framework which achieves accurate CNV detection and calculates summary statistics at a population level.

112

In this thesis, we introduced our methods for the identification of CNV-tagging SNPs using sequencing data and investigated their potential association with various traits. In addition, we presented the pipeline we developed for CNV detection and explained the statistical methods behind it. This chapter comprises a summary of our discoveries in combination with themes encompassing our project. Finally, we propose ways in which our methods can be used for further research to contribute to the development of the genomics field.

6.1. Discussion

The overall aim of our project was to contribute to the explanation of “missing heritability”. The objectives were two-fold; the first objective was to extensively investigate the role of CNVs in human disease and in the variability of human traits from GWAS. The second objective was the fast and accurate genome-wide CNV detection at high resolution. In Chapter 1, we discussed the history behind human genome variation and how it is associated with common and complex diseases. In addition, we explained how CNV affects drug- response and the impact it has in the field of pharmacogenetics. Moreover, we described the shift from microarrays to NGS for the analysis of genomic variation. The ground-breaking effort of the 1000G project to collect multiple genomes for sequencing, contributed to a whole new era in the field of genomics. We then listed the methods and models for CNV detection using NGS and discussed about available tools which have been designed for this scope.

The rapid evolution of CNV discovery in several GWAS proved that CNVs together with SNP- based GWAS, have a vast potential in explaining the genetics behind various diseases. In Chapter 2, we used 1000G sequencing data to extract CNVs and SNPs from the sequenced genomes of 2,504 individuals with ancestry from 26 populations of 5 different continental groups. By calculating the correlation between the two variant types, we achieved -to our knowledge- the most comprehensive set of CNV-tagging SNPs to date.

The CNVs identified and described in Chapter 3, were mainly of small size and this might be the explanation as to why no studies have been able to detect and report them before. Our method has also validated previous CNV-tagging SNPs where CNVs were either small or large. Thus, our project has become a replication study of such reported associations. One of the main aims of this analysis was to identify a combination of SNPs associated with a CNV rather

113 than single-SNP associations. Examples of diseases linked to more than one SNPs associated with a deletion were coronary artery disease and myocardial infarction as well as systemic lupus erythematosus.

As discussed above, CNV detection is crucial for the explanation of missing heritability. To achieve CNV identification, HTS generates alignment features. In Chapter 4, we examined the underlying methods behind read depth, the feature we used for our analysis and we explained the statistical framework we developed to unveil CNVs. Our CNV identification which was based on HTS, focused on estimating the absolute copy number to reach the most accurate CNV segmentation. We also modelled CNVs at a population level to account for variation in each independent sample which was unrelated to CNV. In addition, our framework incorporated summary statistics calculations to make interpretation of CNVs more intuitive and user-friendly. Together with basic summarized statistics, our pipeline, popCNV, provides AFs for each of the studied populations. In Chapter 5, we discussed the main characteristics and important benefits of WGS and how these are linked to popCNV tool. Furthermore, we extensively explained how CNV detection was achieved through our pipeline and presented our results in detail. popCNV is able to identify small variants down to 500bp and can be used for any defined genomic region. Moreover, it can visualize both the characteristics of the detected CNVs as well as the AFs per population. popCNV has been exhaustively benchmarked at the same time as it was performing CNV calling, which enhanced the speed and efficiency of the pipeline.

Overall, our project focused on two different major CNV analyses which coincide; the first one emphasized on creating a map of CNV-tagging SNPs which would contribute to the unexplained genetics behind human disease. The second one stressed out the importance of an efficient and speedy CNV detection with a high resolution as well as the estimation and optimization of summary statistics of the identified CNVs. Both aims of this project were crucial. The overall aim was to show the impact CNVs have on human genome and diseases as well as to discuss the potential role of the identified CNVs and CNV-tagging SNPs in GWAS. It was widely discussed that the phenotypic impact of CNVs within genes and their contribution to disease risk -especially to common complex disorders- is not yet clear [69]. History of CNV discovery for disease susceptibility confirms our findings which show an excess of rare CNVs compared to common ones. The interpretation of the distribution of allele

114 frequency of CNVs was that there was either a purifying selection affecting related CNV loci or that it was a sign of population growth. Considering such a growth in populations, the patterns of observed allele frequencies between the SNPs and CNVs to date have been similar. Researchers have been able to provide a clearer explanation of purifying selected CNVs compared to SNPs and therefore CNVs might have a greater functional impact than SNPs. In our analysis, the rare CNVs reached a 89.5% AF whereas the low-frequency and common CNVs reached a 5.5% and 0.85% AF. This is based on 2,504 genomes coming from 5 ethnicities and 26 different populations. By increasing the number of analyzed genomes, there is the potential of having alterations in the proportion of AF worldwide, not just within European populations. These alterations can be due to a greater number of CNVs compared to our 37,307 CNVs or due to the different DNA patterns of each individual genome of a certain population. Our results revealed that individuals coming from the same population tend to have similar CNV patterns as for example the 378bp deletion in JAZF1 found in European populations. Despite that, research could turn the attention as to why this deletion was not that frequent in the rest of the world and if there could be a possibility of detecting it in the future. This also existed for psoriasis, where a CNV in LCE3B and LCE3C loci was initially found in Europeans and later on it was also discovered in Chinese populations as well [159].

There is still a gap between the statistical association and the biological background of identified CNVs or CNV-tagging SNPs, since a deletion or duplication of a part or even of an entire gene may not lead to an obvious phenotypic change. This results to a difficult prediction about which variants affect specific genes. It is therefore important to not only detect CNVs, but also to find their functional impact through the understanding of hidden unknown mechanisms which may increase the risk of disease. An example of such mechanisms are molecular mechanisms, as some CNVs act on disease-causal genes which have not necessarily been proved to be associated with the CNV location. A CNV with such behavior, is the 716bp common deletion we replicated and discussed in Section 3.1.1. This CNV is associated with lower expression of GNG11 in blood [116]. This specific deletion was one of the 10 established CNVs which we found to be correlated with neighboring SNPs. We also identified 31 novel CNV-tagging SNPs where the majority of them were < 1kb. CNVs with such a small length were previously difficult to detect, thus creating an additional “gap” in missing heritability and in the prediction of which variants are gene- or disease- specific. Our pipeline, popCNV, was able

115 to identify CNVs down to 500bp which is promising for follow-up studies. At the time of our project, we did not look among the CNV map created from popCNV for any potential tagging SNPs. The analysis for CNV-tagging SNPs was entirely based on GWAS catalog SNPs and variants downloaded from the 1000G. Taking though small CNVs down to 500bp from popCNV results, we could also conduct a research based on already established CNV-tagging SNPs, this time, looking for common CNVs.

After conducting a functional analysis, scientists are then able to conclude to the best choice of a statistical model for CNV GWAS as they have full information about the CNVs of interest. Based on CNV GWAS’ criteria, a criterion among the most important ones is to demonstrate whether CNVs are common or rare, as it can be challenging to discover a statistically significant association between a phenotype and a rare allele. In such case, the model is examining the association between the total CNV load together with the disease status [8]. In our project, we have detected both rare and common CNVs, including the common deletions which were tagged by various SNPs. In cases where common CNVs are tested, the association can be demonstrated using a single CNV and the disease status as the allele frequency is higher and there are more cases to compare to controls. Another important aspect before identifying associations between CNVs and disease in GWAS, is to check the effect sizes of the CNVs of interest. This is particularly challenging, since previous knowledge on the genetic epidemiology of CNVs and especially the ones tagged by neighboring SNPs have small effect sizes compared to disease-associated SNPs [109]. If the effect size of such CNVs was large, the SNPs tagging them would be strong signals for the GWAS associations and therefore it would be easier to detect the region of interest. On the contrary, considering that in most cases the effect sizes of CNVs are small, there is the obstacle of not identifying the region easily. As a result, GWAS need large sample sizes to maximize the power of the analysis.

Based on the above information, defining the role of CNVs in common disease susceptibility through GWAS data is vital. Our project presented two ways of doing so; either by identifying CNVs highly correlated with surrounding SNPs or by detecting CNV calls from sequencing data to be used for CNV GWAS in the future. Both ways are important since they can each contribute to the missing heritability in their own way.

116

6.2. Future directions

Our CNV and CNV-tagging SNPs catalogs can be used in various ways to enhance future research. Aspects such as the functional impact of CNV itself or its link to disease association, aetiology and progression have always been critical in genetic studies. We therefore propose several ways in which our catalogs could be of use by the scientific community, the physicians and the pharmaceutical companies. In addition, we discuss how our work can be extended to allow future research to deeply investigate the issue of missing heritability.

6.2.1. Functional analysis of JAZF1 and TSPAN8 CNVs

Our findings from the analysis of DIAGRAM data revealed CNV presence in JAZF1 and TSPAN8 genes. The size of the CNVs suggest that they partially overlap the genes, thus, leading to various genetic alterations potentially affecting gene expression. A functional analysis as well as independent genotyping of these novel CNVs using orthogonal methods would be vital. This can validate their existence and inform researchers, physicians as well as pharmaceutical companies for the creation of new drugs. Specifically, future research can test for functional enrichment of “genes sets” connected to JAZF1 and TSPAN8, i.e the genes which are affected by the novel CNVs. This would be useful for the identification of biological processes related to t2d and discoveries of new candidate t2d pathways. “Gene sets” are defined as the groups of genes which either have common functions or are involved in the same pathway [160]. This way, the functional enrichment leads to a mapping and turns the effect of either JAZF1 and TSPAN8 to larger groups of disease pathways. The latter can potentially create a network of genes sets which will make interpretation of the enrichment results easier.

There are several tools which can be used to assess enrichment for either common or rare CNVs. The ANNOVAR tool is able to show whether the CNVs contributing to the effects in the regions of interest cause protein-coding changes. The GARFIELD tool checks -among other features-, for potential functional and regulatory enrichment of the variants as well as chromatin states and genic annotations. One of the most useful tools is also the UCSC browser which coordinates all the data for the Encyclopedia of DNA elements (ENCODE) Consortium. ENCODE’s goal was to provide scientists with a list of functional elements present in the human genome such as elements affecting the protein and RNA levels. UCSC helps the visualisation of the ENCODE DNA elements. For the visualisation of the gene sets enriched in

117

CNVs which in our case could be common deletions, Cytoscape22 tool would also be beneficial. The tool is used to transform the gene sets into a network, representing the functional enrichment analysis graphically [160].

Above all, CNV analysis of JAZF1 and TSPAN8 could help unveil the genetic causality of type 2 diabetes. Previous research has shown links between type 2 diabetes and Alzheimer’s disease [161]. Recently, a diabetes drug was tested in mice with Alzheimer’s and it caused a reversal in memory loss [162]. The drug is now being tested in humans and is promising for the treatment of Alzheimer’s or dementia symptoms. The relationship between CNV, Type 2 diabetes and Alzheimer’s could be further elucidated by using gene-networks. Using network- based approaches such as gene-to-gene or gene-to-variant, we could gain brand new insights into the aetiology of these two complex diseases. It is also remarkable that the majority of GWAS studies for JAZF1 and TSPAN8 have focused only on SNPs and have not shown clear pathophysiological implications of associated SNPs or have been able to demonstrate their functional impact. Since there are clues for the presence of CNV in these genes, research should also focus on CNV identification and investigate the genotype-phenotype relationships.

6.2.2. Testing CNV associations with disease

Further to our discussion above in Section 6.2., the next step after identifying CNVs is to test the calls of interest with the disease status. By defining the carrier status, that is, if a person has a deletion or a duplication, there are various methods that are used to test the associations. In our case where the CNVs are biallelic, the most appropriate methods would be a chi-square test or a logistic regression. For CNV GWAS, the ideal outcome would be the rejection of null hypothesis of no association between the disease and the biallelic variant. It is always important to account for heterogeneity in the carrier status estimate and thus, Bayesian models would be more appropriate through the use of posterior probabilities [163, 164].

6.2.3. Contribution of CNVs to pharmacogenetics

Our CNV analysis conducted using popCNV resulted in the creation of a map of CNVs in a set of drug-target genes. This map will be provided as a resource for genotyping CNV in future pharmacogenetics (PGx) studies and it will also connect existing drugs to new

118 disease indications (repurposing). The presence of CNVs in drug target genes can elucidate potential gene dosage effects on clinical phenotypes. This may, in turn, inform drug dosage adjustment according to CNV genotype, as well as influence the design of novel substances to counteract the effects of CNV. This is also linked to CNV-tagging SNPs, since the identification of combined SNP/CNV haplotypes can now greatly facilitate the detection of CNVs in future studies using appropriate tagging SNPs. As a result, drug companies can rely on the CNV map to conduct new Clinical Trials and assess how patients with deletions and/or duplications will respond to treatments based on the drug dosage they are provided.

6.2.4. Role of geographical clustering in CNVs

We observed that occasionally, there are large differences in the AF of CNV in various populations, especially if they belong to different continental groups. This is valid for both common and rare variants but our results showed that it is mostly true for the latter. In other words, rare variant sharing decreases substantially depending on the geographic distance between populations. The creation of a detailed CNV catalog for each geographical area is therefore crucial. This will make CNV and AF identification more robust for each independent population.

6.2.5. Potential use of popCNV by external Consortia

A large panel called Haplotype Reference Consortium (HRC) consisting of sequencing data from several cohorts is a research effort for the collection of human haplotypes. The panel’s first release included 64,976 haplotypes of European ancestry at ~39 million SNPs. Even though HRC is considered a large panel, it only uses haplotypes from European populations, which restricts the possibility of investigating variation in the rest of the world. It also currently reports only SNP-based results, therefore there is a lack of CNV information. Despite the fact that our project could not use the panel’s information based on what was publically available, we consider that our pipeline could be of great use by the panel. Since HRC is now incorporating the 1000G project as one of its cohorts, apart from providing the pipeline as a tool for CNV detection, we can also offer our summary statistics including the AFs for each of the 26 populations. This information is currently unavailable not just by the 1000G but from any other international panel. Together with our statistics and further to Section 6.2.3, there

119 is a potential of achieving detailed catalogs of rare variants in each continental group or even in each individual population.

6.2.6. popCNV extension popCNV has the potential of being expanded to incorporate all the alignment features available from HTS and become an integrative framework for CNV detection. These features are the read pairs (RPs) and split reads (SRs). Even though RD -which is currently the feature popCV is using- is the only feature providing absolute copy number estimation, RPs and SRs will help researchers wishing to use our pipeline to accurately identify breakpoints at a high resolution. Therefore, they will be able to compare copy number differences between the alignment features and conclude on the most reliable segmentation.

120

References

1. Nan M. Laird, C.L., in The Fundamentals of Modern Statistical Genetics. 2011, Springer-Verlag New York. 2. Ezkurdia, I., et al., Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum Mol Genet, 2014. 23(22): p. 5866-78. 3. Haraksingh, R.R. and M.P. Snyder, Impacts of variation in the human genome on gene regulation. J Mol Biol, 2013. 425(21): p. 3970-7. 4. Frazer, K.A., et al., Human genetic variation and its contribution to complex traits. Nat Rev Genet, 2009. 10(4): p. 241-51. 5. Genomes Project, C., et al., A map of human genome variation from population-scale sequencing. Nature, 2010. 467(7319): p. 1061-73. 6. Conrad, D.F., et al., Origins and functional impact of copy number variation in the human genome. Nature, 2010. 464(7289): p. 704-12. 7. Stankiewicz, P. and J.R. Lupski, Structural variation in the human genome and its role in disease. Annu Rev Med, 2010. 61: p. 437-55. 8. Zöllner, S. and T.M. Teslovich, Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases. Statistical Science, 2009. 24(4): p. 530- 546. 9. Zhang, F., et al., Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet, 2009. 10: p. 451-81. 10. Stranger, B.E., et al., Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science, 2007. 315(5813): p. 848-53. 11. Feuk, L., Inversion variants in the human genome: role in disease and genome architecture. Genome Med, 2010. 2(2): p. 11. 12. Young, C., et al., A Crohn's disease-associated insertion polymorphism (3020insC) in the NOD2 gene is not associated with psoriasis vulgaris, palmo-plantar pustular psoriasis or guttate psoriasis. Exp Dermatol, 2003. 12(4): p. 506-9. 13. Baker, M., Structural variation: the genome's hidden architecture. Nat Methods, 2012. 9(2): p. 133-7. 14. Rabbitts, T.H., Chromosomal translocations in human cancer. Nature, 1994. 372(6502): p. 143-9. 15. Zhao, M., et al., Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 2013. 14 Suppl 11: p. S1. 16. Carter, N.P., Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet, 2007. 39(7 Suppl): p. S16-21. 17. Snijders, A.M., et al., Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet, 2001. 29(3): p. 263-4. 18. Shendure, J. and H. Ji, Next-generation DNA sequencing. Nat Biotechnol, 2008. 26(10): p. 1135-45. 19. Metzker, M.L., Sequencing technologies - the next generation. Nat Rev Genet, 2010. 11(1): p. 31-46. 20. Sudmant, P.H., et al., An integrated map of structural variation in 2,504 human genomes. Nature, 2015. 526(7571): p. 75-81.

121

21. Genomes Project, C., et al., An integrated map of genetic variation from 1,092 human genomes. Nature, 2012. 491(7422): p. 56-65. 22. Le, S.Q. and R. Durbin, SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res, 2011. 21(6): p. 952-60. 23. Wendl, M.C. and R.K. Wilson, The theory of discovering rare variants via DNA sequencing. BMC Genomics, 2009. 10: p. 485. 24. Wetterstrand, K., DNA sequencing Costs: Data from the NHGRI Genome Sequencing Platfrom (GSP). 2014. 25. Bellos, E., et al., cnvCapSeq: detecting copy number variation in long-range targeted resequencing data. Nucleic Acids Res, 2014. 42(20): p. e158. 26. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001. 409(6822): p. 860-921. 27. Grenet, O., Significance of the human genome sequence to drug discovery. Pharmacogenomics J, 2001. 1(1): p. 11-2. 28. Willyard, C., Copy number variations' effect on drug response still overlooked. Nat Med, 2015. 21(3): p. 206. 29. Rocchi, A., et al., Causative and susceptibility genes for Alzheimer's disease: a review. Brain Res Bull, 2003. 61(1): p. 1-24. 30. He, Y., J.M. Hoskins, and H.L. McLeod, Copy number variants in pharmacogenetic genes. Trends Mol Med, 2011. 17(5): p. 244-51. 31. Bodmer, W. and I. Tomlinson, Rare genetic variants and the risk of cancer. Curr Opin Genet Dev, 2010. 20(3): p. 262-7. 32. Pritchard, J.K., Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet, 2001. 69(1): p. 124-37. 33. Kryukov, G.V., L.A. Pennacchio, and S.R. Sunyaev, Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet, 2007. 80(4): p. 727-39. 34. Marth, G.T., et al., The functional spectrum of low-frequency coding variation. Genome Biol, 2011. 12(9): p. R84. 35. Russ, A.P. and S. Lampel, The druggable genome: an update. Drug Discov Today, 2005. 10(23-24): p. 1607-10. 36. Asimit, J. and E. Zeggini, Rare variant association analysis methods for complex traits. Annu Rev Genet, 2010. 44: p. 293-308. 37. Bustamante, C.D., E.G. Burchard, and F.M. De la Vega, Genomics for the world. Nature, 2011. 475(7355): p. 163-5. 38. Gravel, S., et al., Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A, 2011. 108(29): p. 11983-8. 39. Nelson, M.R., et al., An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 2012. 337(6090): p. 100-4. 40. Gamazon, E.R., et al., Copy number polymorphisms and anticancer pharmacogenomics. Genome Biol, 2011. 12(5): p. R46. 41. Redon, R., et al., Global variation in copy number in the human genome. Nature, 2006. 444(7118): p. 444-54. 42. Sebat, J., et al., Large-scale copy number polymorphism in the human genome. Science, 2004. 305(5683): p. 525-8. 43. Iafrate, A.J., et al., Detection of large-scale variation in the human genome. Nat Genet, 2004. 36(9): p. 949-51.

122

44. Sullivan, P.F., M.J. Daly, and M. O'Donovan, Genetic architectures of psychiatric disorders: the emerging picture and its implications. Nat Rev Genet, 2012. 13(8): p. 537-51. 45. Fiume, M., et al., Savant Genome Browser 2: visualization and analysis for population-scale genomics. Nucleic Acids Res, 2012. 40(Web Server issue): p. W615- 21. 46. Flicek, P., et al., Ensembl 2012. Nucleic Acids Res, 2012. 40(Database issue): p. D84- 90. 47. Kuhn, R.M., D. Haussler, and W.J. Kent, The UCSC genome browser and associated tools. Brief Bioinform, 2013. 14(2): p. 144-61. 48. Consortium, E.P., An integrated encyclopedia of DNA elements in the human genome. Nature, 2012. 489(7414): p. 57-74. 49. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 2009. 4(7): p. 1073- 81. 50. Firth, H.V., et al., DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am J Hum Genet, 2009. 84(4): p. 524-33. 51. Coin, L.J., et al., cnvHap: an integrative population and haplotype-based multiplatform model of SNPs and CNVs. Nat Methods, 2010. 7(7): p. 541-6. 52. Su, S.Y., et al., Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics, 2008. 9: p. 513. 53. Kouemou, G.L., in Hidden Markov Models, Theory and Applications, P. Dymarski, Editor. 2011, INTECH. 54. Li, Y., et al., MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 2010. 34(8): p. 816-34. 55. Marchini, J., et al., A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 2007. 39(7): p. 906-13. 56. Scheet, P. and M. Stephens, A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 2006. 78(4): p. 629-44. 57. Stephens, M., N.J. Smith, and P. Donnelly, A new statistical method for haplotype reconstruction from population data. Am J Hum Genet, 2001. 68(4): p. 978-89. 58. Valsesia, A., et al., The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation. Front Genet, 2013. 4: p. 92. 59. Zeitouni, B., et al., SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics, 2010. 26(15): p. 1895-6. 60. Dalca, A.V., et al., VARiD: a variation detection framework for color-space and letter- space platforms. Bioinformatics, 2010. 26(12): p. i343-9. 61. Ruffalo, M., T. LaFramboise, and M. Koyuturk, Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics, 2011. 27(20): p. 2790-6. 62. Wineinger, N.E., et al., Statistical issues in the analysis of DNA Copy Number Variations. Int J Comput Biol Drug Des, 2008. 1(4): p. 368-95. 63. Curtis, C., et al., The pitfalls of platform comparison: DNA copy number array technologies assessed. BMC Genomics, 2009. 10: p. 588. 64. Winchester, L., C. Yau, and J. Ragoussis, Comparing CNV detection methods for SNP arrays. Brief Funct Genomic Proteomic, 2009. 8(5): p. 353-66.

123

65. McCarroll, S.A., et al., Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet, 2008. 40(10): p. 1166-74. 66. Mills, R.E., et al., Mapping copy number variation by population-scale genome sequencing. Nature, 2011. 470(7332): p. 59-65. 67. Handsaker, R.E., et al., Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet, 2011. 43(3): p. 269-76. 68. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50. 69. Falchi, M., et al., Low copy number of the salivary amylase gene predisposes to obesity. Nat Genet, 2014. 46(5): p. 492-7. 70. Hooli, B.V., et al., Rare autosomal copy number variations in early-onset familial Alzheimer's disease. Mol Psychiatry, 2014. 19(6): p. 676-81. 71. Zarrei, M., et al., A copy number variation map of the human genome. Nat Rev Genet, 2015. 16(3): p. 172-83. 72. Drysdale, C.M., et al., Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc Natl Acad Sci U S A, 2000. 97(19): p. 10483-8. 73. Joosten, P.H., et al., Promoter haplotype combinations of the platelet-derived growth factor alpha-receptor gene predispose to human neural tube defects. Nat Genet, 2001. 27(2): p. 215-7. 74. Aitman, T.J., et al., Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature, 2006. 439(7078): p. 851-5. 75. de Cid, R., et al., Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis. Nat Genet, 2009. 41(2): p. 211-5. 76. Hollox, E.J., et al., Psoriasis is associated with increased beta-defensin genomic copy number. Nat Genet, 2008. 40(1): p. 23-5. 77. International Schizophrenia, C., Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature, 2008. 455(7210): p. 237-41. 78. Clancy, S., Copy number variation. Nature Education, 2008. 1 79. Ingason, A., et al., Copy number variations of chromosome 16p13.1 region associated with schizophrenia. Mol Psychiatry, 2011. 16(1): p. 17-25. 80. Loureiro Susana, A.J., Cafe Catia, Conceicao Ines, Mouga Susana, Beleza Ana, Oliveira Barbara, de Sa Joaquim, Carreira Isabel, Saraiva Jorge, Vicente Astrid, Oliveira Guiomar, Copy number variations in chromosome 16p13.11-The neurodevelopmental clinical spectrum. International Journal of Pediatrics, 2017. 81. Heinzen, E.L., et al., Rare deletions at 16p13.11 predispose to a diverse spectrum of sporadic epilepsy syndromes. Am J Hum Genet, 2010. 86(5): p. 707-18. 82. Yang, T.L., et al., Ethnic differentiation of copy number variation on chromosome 16p12.3 for association with obesity phenotypes in European and Chinese populations. Int J Obes (Lond), 2013. 37(2): p. 188-90. 83. Genomes Project, C., et al., A global reference for human genetic variation. Nature, 2015. 526(7571): p. 68-74. 84. Delaneau, O., J. Marchini, and J.F. Zagury, A linear complexity phasing method for thousands of genomes. Nat Methods, 2011. 9(2): p. 179-81. 85. Menelaou, A. and J. Marchini, Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics, 2013. 29(1): p. 84-91.

124

86. Xu, H., et al., SgD-CNV, a database for common and rare copy number variants in three Asian populations. Hum Mutat, 2011. 32(12): p. 1341-9. 87. Ortega, V.E. and E.R. Bleecker, 45 - Genetics in Asthma and COPD A2 - Broaddus, V. Courtney, in Murray and Nadel's Textbook of Respiratory Medicine (Sixth Edition), R.J. Mason, et al., Editors. 2016, W.B. Saunders: Philadelphia. p. 786-806.e8. 88. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A, 2009. 106(23): p. 9362-7. 89. Wang, L.L., et al., Intron-size constraint as a mutational mechanism in Rothmund- Thomson syndrome. Am J Hum Genet, 2002. 71(1): p. 165-7. 90. Marchini, J. and B. Howie, Genotype imputation for genome-wide association studies. Nat Rev Genet, 2010. 11(7): p. 499-511. 91. Feehally, J., et al., HLA has strongest association with IgA nephropathy in genome- wide analysis. J Am Soc Nephrol, 2010. 21(10): p. 1791-7. 92. Yu, X.Q., et al., A genome-wide association study in Han Chinese identifies multiple susceptibility loci for IgA nephropathy. Nat Genet, 2011. 44(2): p. 178-82. 93. Kiryluk, K., et al., Discovery of new risk loci for IgA nephropathy implicates genes involved in immunity against intestinal pathogens. Nat Genet, 2014. 46(11): p. 1187- 96. 94. Xie, J., et al., Fine Mapping Implicates a Deletion of CFHR1 and CFHR3 in Protection from IgA Nephropathy in Han Chinese. J Am Soc Nephrol, 2016. 27(10): p. 3187-3194. 95. Gale, D.P., et al., Identification of a mutation in complement factor H-related protein 5 in patients of Cypriot origin with glomerulonephritis. Lancet, 2010. 376(9743): p. 794-801. 96. Malik, T.H., et al., A hybrid CFHR3-1 gene causes familial C3 glomerulopathy. J Am Soc Nephrol, 2012. 23(7): p. 1155-60. 97. Benetkiewicz, M., et al., Chromosome 22 array-CGH profiling of breast cancer delimited minimal common regions of genomic imbalances and revealed frequent intra-tumoral genetic heterogeneity. Int J Oncol, 2006. 29(4): p. 935-45. 98. Castells, A., et al., A region of deletion on chromosome 22q13 is common to human breast and colorectal cancers. Cancer Res, 2000. 60(11): p. 2836-9. 99. Long, J., et al., A common deletion in the APOBEC3 genes and breast cancer risk. J Natl Cancer Inst, 2013. 105(8): p. 573-9. 100. Nik-Zainal, S., et al., Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer. Nat Genet, 2014. 46(5): p. 487-91. 101. Kim, H.C., et al., A genome-wide association study identifies a breast cancer risk variant in ERBB4 at 2q34: results from the Seoul Breast Cancer Study. Breast Cancer Res, 2012. 14(2): p. R56. 102. Zheng, Y., et al., Fine mapping of breast cancer genome-wide association studies loci in women of African ancestry identifies novel susceptibility markers. Carcinogenesis, 2013. 34(7): p. 1520-8. 103. Perry, J.R., et al., Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature, 2014. 514(7520): p. 92-97. 104. Locke, A.E., et al., Genetic studies of body mass index yield new insights for obesity biology. Nature, 2015. 518(7538): p. 197-206.

125

105. Felix, J.F., et al., Genome-wide association analysis identifies three new susceptibility loci for childhood body mass index. Hum Mol Genet, 2016. 25(2): p. 389-403. 106. Zhang, D., et al., Interactions between obesity-related copy number variants and dietary behaviors in childhood obesity. Nutrients, 2015. 7(4): p. 3054-66. 107. Berndt, S.I., et al., Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet, 2013. 45(5): p. 501-12. 108. Elks, C.E., et al., Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nat Genet, 2010. 42(12): p. 1077-85. 109. Willer, C.J., et al., Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet, 2009. 41(1): p. 25-34. 110. Wheeler, E., et al., Genome-wide SNP and CNV analysis identifies common and low- frequency variants associated with severe early-onset obesity. Nat Genet, 2013. 45(5): p. 513-7. 111. Bowes, J., et al., Variants in linkage disequilibrium with the late cornified envelope gene cluster deletion are associated with susceptibility to psoriatic arthritis. Ann Rheum Dis, 2010. 69(12): p. 2199-203. 112. Sun, L., et al., Association between LCE gene polymorphisms and psoriasis vulgaris among Mongolians from Inner Mongolia. Arch Dermatol Res, 2018. 310(4): p. 321- 327. 113. Pajic, P., et al., The psoriasis-associated deletion of late cornified envelope genes LCE3B and LCE3C has been maintained under balancing selection since Human Denisovan divergence. BMC Evol Biol, 2016. 16(1): p. 265. 114. Chandra, A., et al., Increased Risk of Psoriasis due to combined effect of HLA-Cw6 and LCE3 risk alleles in Indian population. Sci Rep, 2016. 6: p. 24059. 115. Baurecht, H., et al., Genome-wide comparative analysis of atopic dermatitis and psoriasis gives insight into opposing genetic mechanisms. Am J Hum Genet, 2015. 96(1): p. 104-20. 116. den Hoed, M., et al., Identification of heart rate-associated loci and their effects on cardiac conduction and rhythm disorders. Nat Genet, 2013. 45(6): p. 621-31. 117. Ledda, M., et al., GWAS of human bitter taste perception identifies new loci and reveals additional complexity of bitter taste genetics. Hum Mol Genet, 2014. 23(1): p. 259-67. 118. Jiang, Y. and H. Zhang, Propensity score-based nonparametric test revealing genetic variants underlying bipolar disorder. Genet Epidemiol, 2011. 35(2): p. 125-32. 119. Roudnitzky, N., et al., Genomic, genetic and functional dissection of bitter taste responses to artificial sweeteners. Hum Mol Genet, 2011. 20(17): p. 3437-49. 120. Alonso, A., et al., GStream: improving SNP and CNV coverage on genome-wide association studies. PLoS One, 2013. 8(7): p. e68822. 121. Shin, S.Y., et al., An atlas of genetic influences on human blood metabolites. Nat Genet, 2014. 46(6): p. 543-550. 122. Schulze, J., et al., SULT2A1 Gene Copy Number Variation is Associated with Urinary Excretion Rate of Steroid Sulfates. Front Endocrinol (Lausanne), 2013. 4: p. 88. 123. Goodwin, G.M., The overlap between anxiety, depression, and obsessive-compulsive disorder. Dialogues Clin Neurosci, 2015. 17(3): p. 249-60.

126

124. Zbozinek, T.D., et al., Diagnostic overlap of generalized anxiety disorder and major depressive disorder in a primary care sample. Depress Anxiety, 2012. 29(12): p. 1065- 71. 125. Fang, L.T., et al., Comprehensive genomic analyses of a metastatic colon cancer to the lung by whole exome sequencing and gene expression analysis. Int J Oncol, 2014. 44(1): p. 211-21. 126. Zhao, X., et al., Examination of copy number variations of CHST9 in multiple types of hematologic malignancies. Cancer Genet Cytogenet, 2010. 203(2): p. 176-9. 127. Laffin, J.J., et al., Novel candidate genes and regions for childhood apraxia of speech identified by array comparative genomic hybridization. Genet Med, 2012. 14(11): p. 928-36. 128. Gibson, G., Rare and common variants: twenty arguments. Nat Rev Genet, 2012. 13(2): p. 135-45. 129. Jason, F., et al., Sequence data and association statistics from 12,940 type 2 diabetes cases and controls. Sci Data, 2017. 4: p. 170179. 130. Wellcome Trust Case Control, C., et al., Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet, 2012. 44(12): p. 1294-301. 131. Chen, W., et al., Copy number variation across European populations. PLoS One, 2011. 6(8): p. e23087. 132. Torrico, B., et al., Lack of replication of previous autism spectrum disorder GWAS hits in European populations. Autism Res, 2017. 10(2): p. 202-211. 133. Zhang, H., et al., A Powerful Procedure for Pathway-Based Meta-analysis Using Summary Statistics Identifies 43 Pathways Associated with Type II Diabetes in European Populations. PLoS Genet, 2016. 12(6): p. e1006122. 134. Zeggini, E., et al., Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet, 2008. 40(5): p. 638-45. 135. Wellcome Trust Case Control, C., et al., Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 2010. 464(7289): p. 713-20. 136. Zhou, D.Z., et al., Variations in/nearby genes coding for JAZF1, TSPAN8/LGR5 and HHEX-IDE and risk of type 2 diabetes in Han Chinese. J Hum Genet, 2010. 55(12): p. 810-5. 137. Voight, B.F., et al., Twelve type 2 diabetes susceptibility loci identified through large- scale association analysis. Nat Genet, 2010. 42(7): p. 579-89. 138. Kaakinen, M., GENETIC AND LIFE COURSE DETERMINANTS OF CARDIOVASCULAR RISK FACTORS. 2012. 139. Rantakallio, P., Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatr Scand, 1969. 193: p. Suppl 193:1+. 140. Soininen, P., et al., High-throughput serum NMR metabonomics for cost-effective holistic studies on systemic metabolism. Analyst, 2009. 134(9): p. 1781-5. 141. Calder, P.C., Functional Roles of Fatty Acids and Their Effects on Human Health. JPEN J Parenter Enteral Nutr, 2015. 39(1 Suppl): p. 18S-32S. 142. Norris R. Glick, M.H.F., The Role of Essential Fatty Acids in Human Health. SAGE, 2013. 18(4): p. 268-289. 143. Bellos, E., Statistical methods for elucidating copy number variation in high- throughput sequencing studies. 2014.

127

144. Sims, D., et al., Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet, 2014. 15(2): p. 121-32. 145. Wang, X., et al., CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations. Gigascience, 2017. 6(12): p. 1- 12. 146. Xie, C. and M.T. Tammi, CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 2009. 10: p. 80. 147. Abyzov, A., et al., CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res, 2011. 21(6): p. 974-84. 148. Szatkiewicz, J.P., et al., Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation. Nucleic Acids Res, 2013. 41(3): p. 1519-32. 149. Magi, A., et al., Read count approach for DNA copy number variants detection. Bioinformatics, 2012. 28(4): p. 470-8. 150. Vardhanabhuti, S., et al., Parametric modeling of whole-genome sequencing data for CNV identification. Biostatistics, 2014. 15(3): p. 427-41. 151. Yoon, S., et al., Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res, 2009. 19(9): p. 1586-92. 152. Duan, J., et al., Comparative studies of copy number variation detection methods for next-generation sequencing technologies. PLoS One, 2013. 8(3): p. e59128. 153. Holmes, I. and G.M. Rubin, An expectation maximization algorithm for training hidden substitution models. J Mol Biol, 2002. 317(5): p. 753-64. 154. Bellos, E., M.R. Johnson, and L.J. Coin, cnvHiTSeq: integrative models for high- resolution copy number variation detection and genotyping using population sequencing data. Genome Biol, 2012. 13(12): p. R120. 155. Genome of the Netherlands, C., Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet, 2014. 46(8): p. 818-25. 156. Pauline C. Ng, E.F.K., in Genetic Variation. 2010, Springer. p. 215-226. 157. MacDonald, J.R., et al., The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res, 2014. 42(Database issue): p. D986-92. 158. Walker, L.C., et al., Evaluation of copy-number variants as modifiers of breast and ovarian cancer risk for BRCA1 pathogenic variant carriers. Eur J Hum Genet, 2017. 25(4): p. 432-438. 159. Li, M., et al., Deletion of the late cornified envelope genes LCE3C and LCE3B is associated with psoriasis in a Chinese population. J Invest Dermatol, 2011. 131(8): p. 1639-43. 160. Pinto, D., et al., Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 2010. 466(7304): p. 368-72. 161. Peila, R., et al., Type 2 diabetes, APOE gene, and the risk for dementia and related pathologies: The Honolulu-Asia Aging Study. Diabetes, 2002. 51(4): p. 1256-62. 162. Tai, J., et al., Neuroprotective effects of a triple GLP-1/GIP/glucagon receptor agonist in the APP/PS1 transgenic mouse model of Alzheimer's disease. Brain Res, 2018. 1678: p. 64-74.

128

163. Barnes, C., et al., A robust statistical method for case-control association testing with copy number variation. Nat Genet, 2008. 40(10): p. 1245-52. 164. Zollner, S., et al., Bayesian EM algorithm for scoring polymorphic deletions from SNP data and application to a common CNV on 8q24. Genet Epidemiol, 2009. 33(4): p. 357-68.

129

Appendix

Box 1 • Minor allele frequency (MAF): ✓ Common variants: MAF > 5% ✓ Low-frequency variants: MAF = 1-5% ✓ Rare variants MAF <1%

• Gene region: We defined a gene region by its transcriptional start and stop sites and expanded it by using a 20kb window around each gene

• Linkage Disequilibrium (LD): ✓ High LD: r2 ≥ 0.8 ✓ Moderate LD: 0.6 < r2 ≤ 0.8 ✓ Modest LD: 0.4 < r2 ≤ 0.6 ✓ Mild LD: 0.4 < r2 ≤ 0.2 ✓ Low LD: 0.2 < r2 ≤ 0.01 ✓ No LD: r2 < 0.01

• Read Depth (RD): Read depth is the only alignment feature which is directly associated with CNV. It is the measure that counts the amount of reads, i.e. the generated NGS data presented as short sequence fragments with a length ranging from 35bp to 300bp- which are mapped in each genomic region

• Credible set: A credible set was defined as the set of SNP variants with a high probability of being causal for each disease-associated signal

130

Appendix Table 1: Pairwise LDs between the 53 identified CNV-tagging SNPs

Chr1 rsID rs3101336 rs2568958 rs4085613 rs4845454 rs1581803 rs6677604 rs13376709 rs3101336 1 0.993 0.001 0.002 0.002 0.009 0.008 rs2568958 0.993 1 0.001 0.001 0.001 0.009 0.008 rs4085613 0.001 0.001 1 0.919 0.918 0.002 0.005 rs4845454 0.002 0.001 0.919 1 0.998 0.003 0.009 rs1581803 0.002 0.001 0.918 0.998 1 0.003 0.009 rs6677604 0.009 0.009 0.002 0.003 0.003 1 0.014 rs13376709 0.008 0.008 0.005 0.009 0.009 0.014 1

Chr2 rsID rs10205487 rs6725887 rs7582720 rs72934505 rs10205487 1 0.003 0.003 0.002 rs6725887 0.003 1 0.996 0.873 rs7582720 0.003 0.996 1 0.869 rs72934505 0.002 0.873 0.869 1

Chr3 rsID rs1320900 rs3772255 rs1969253 rs1320900 1 0.015 0 rs3772255 0.015 1 0.001 rs1969253 0 0.001 1

131

Chr6 rsID rs114985235 rs9271100 rs9272346 rs9273373 rs1063355 rs3077 rs9296736 rs114985235 1 0.004 0.002 0.002 0.002 0.004 0 rs9271100 0.004 1 0.31 0.316 0.314 0.004 0.003 rs9272346 0.002 0.31 1 0.74 0.735 0.001 0 rs9273373 0.002 0.316 0.74 1 0.961 0.001 0 rs1063355 0.002 0.314 0.735 0.961 1 0.001 0 rs3077 0.004 0.004 0.001 0.001 0.001 1 0.009 rs9296736 0 0.003 0 0 0 0.009 1

Chr7 rsID rs12531540 rs849142 rs849135 rs180242 rs12531540 1 0.805 0.721 0 rs849142 0.805 1 0.9 0 rs849135 0.721 0.9 1 0 rs180242 0 0 0 1

Chr10 rsID rs2380205 rs17600642 rs10490924 rs3750848 rs3750846 rs3793917 rs11200638 rs2380205 1 0 0.004 0.004 0.004 0.004 0.004 rs17600642 0 1 0.002 0.002 0.002 0.002 0.002 rs10490924 0.004 0.002 1 0.999 0.974 0.933 0.926 rs3750848 0.004 0.002 0.999 1 0.975 0.934 0.927 rs3750846 0.004 0.002 0.974 0.975 1 0.959 0.952 rs3793917 0.004 0.002 0.933 0.934 0.959 1 0.964 rs11200638 0.004 0.002 0.926 0.927 0.952 0.964 1

132

Chr12 rsID rs1031391 rs10772915 rs1031391 1 0 rs10772915 0 1

Chr13 rsID rs609418 rs1262778 rs609418 1 0 rs1262778 0 1

Chr16 rsID rs12325245 rs2865531 rs12325245 1 0.001 rs2865531 0.001 1

Chr17 rsID rs9898058 rs11650106 rs9898058 1 0.006 rs11650106 0.006 1

Chr18 rsID rs11877878 rs1436904 rs11877878 1 0.002 rs1436904 0.002 1

133

Chr19 rsID rs7247513 rs296396 rs7247513 1 0.005 rs296396 0.005 1

Chr22 rsID rs1053593 rs138740 rs138777 rs12628403 rs6006893 rs1053593 1 0.897 0.857 0.006 0.02 rs138740 0.897 1 0.941 0.004 0.014 rs138777 0.857 0.941 1 0.004 0.013 rs12628403 0.006 0.004 0.004 1 0.011 rs6006893 0.02 0.014 0.013 0.011 1

Appendix Table 2: Credible set of common variants for FAs

Locus name Chr:position (b37) SNP

PCSK9 1:55505647 rs11591147

GCKR 2:27730940 rs1260326

LPXN 11:58221445 rs188015703

FADS1 11:61588305 rs174564

GPR137 11:63849812 rs1006207

ZNF259 11:64131280 rs3782101

LIPC 15:58683366 rs1532085

PDXDC1 16:15172118 rs11644601

PBX4 19:19667254 rs143988316

APOE 19:45411941 rs429358

134

Appendix Table 3: Credible set of rare variant loci at the top transcript in FADS region

(a) Finnish credible set (b) Dutch credible set

Finnish Dutch 11:61295920 11:61351587 11:61385870 11:61351882 11:61408729 11:61359230 11:61439288 11:61361550 11:61460673 11:61395182 11:61495943 11:61399876 11:61509711 11:61399916 11:61545842 11:61428225 11:61580504 11:61462091 11:61597461 11:61463157 11:61641717 11:61470111 11:61685755 11:61472621 11:61696999 11:61472973 11:61710547 11:61484563 11:61710872 11:61489874 11:61728147 11:61491097 11:61990026 11:61494650 11:62044156 11:61520098 11:62172978 11:61524017 11:62178954 11:61567140 - 11:61580504 - 11:61590544 - 11:61602609 - 11:61607047 - 11:61611078 - 11:61625056 - 11:61649030 - 11:61652504 - 11:61657741 - 11:61662448 - 11:61664968 - 11:61699640 - 11:61722738 - 11:61725010 - 11:61727438 - 11:61730637 - 11:61731508 - 11:61750516 - 11:61769066 - 11:61769837 - 11:61782134 - 11:61798234

135

- 11:61798362 - 11:61803863 - 11:61805045 - 11:61808233 - 11:61851551 - 11:61898768 - 11:61918659 - 11:61922209 - 11:61925119 - 11:61947514 - 11:61952495 - 11:62003198 - 11:62013387 - 11:62021881 - 11:62030661 - 11:62051974 - 11:62053701 - 11:62053858 - 11:62055927 - 11:62058068 - 11:62059909 - 11:62060170 - 11:62064505 - 11:62065277 - 11:62067278 - 11:62076154 - 11:62079014 - 11:62082555 - 11:62082888 - 11:62099762 - 11:62103851 - 11:62107941 - 11:62110110 - 11:62113291 - 11:62123668 - 11:62145722 - 11:62146094 - 11:62149280 - 11:62151257 - 11:62154844 - 11:62160435 - 11:62169169 - 11:62186805 - 11:62190293 - 11:62193116 - 11:62258385 - 11:62286666

136

Appendix Table 4: FAs SNPs reaching genome-wide significance

NTR SNPs NFBC1986 SNPs NFBC1966 SNPs chr pos chr pos chr pos 11 61356813 11 61356112 11 61356813 11 61356816 11 61356178 11 61356815 11 61356818 11 61356741 11 61356818 11 61356982 11 61356816 11 61450325 11 61357642 11 61357571 11 61450325 11 61358273 11 61358661 11 61475463 11 61358925 11 61358726 11 61482748 11 61359818 11 61359230 11 61484708 11 61361550 11 61359818 11 61490879 11 61362439 11 61360441 11 61496490 11 61362729 11 61361530 11 61498825 11 61362774 11 61362013 11 61502275 11 61362796 11 61362505 11 61502998 11 61363594 11 61362774 11 61503297 11 61364121 11 61365427 11 61507120 11 61364520 11 61366120 11 61528734 11 61365427 11 61366306 11 61367924 11 61539170 11 61366681 11 61367972 11 61544244 11 61366773 11 61368724 11 61545821 11 61368097 11 61369674 11 61553356 11 61370215 11 61371308 11 61553565 11 61370647 11 61371635 11 61567669 11 61370853 11 61371889 11 61580863 11 61370865 11 61372570 11 61588857 11 61370958 11 61373194 11 61591973 11 61373780 11 61373445 11 61605352 11 61374059 11 61373644 11 61607827 11 61448205 11 61448696 11 61621481 11 61449214 11 61449844 11 61668007 11 61450323 11 61450296 11 61674027 11 61450428 11 61450323 11 61674159 11 61451934 11 61450324 11 61675088 11 61452342 11 61450612 11 61678137 11 61452827 11 61452673 11 61729801 11 61453015 11 61453591 11 61732811 11 61453235 11 61454854 11 61897580 11 61453537 11 61455052 11 61897630 11 61453822 11 61455382 11 61672381 11 61453985 11 61455995 11 61656877 11 61453993 11 61456225 11 61510829 11 61456160 11 61456258 11 61609448 11 61456225 11 61456878 11 61360701

137

11 61456420 11 61457426 11 61671074 11 61458147 11 61457616 11 61904551 11 61459665 11 61458046 11 61727747 11 61460311 11 61458972 11 61628615 11 61460673 11 61459015 11 61374059 11 61460859 11 61459102 11 61460956 11 61460968 11 61459208 11 61677674 11 61461048 11 61460507 11 61670438 11 61462091 11 61461048 11 61358857 11 61463157 11 61461710 11 61669571 11 61463282 11 61462338 11 61897404 11 61463351 11 61463126 11 61599991 11 61463409 11 61465154 11 61609708 11 61463652 11 61465306 11 61687454 11 61464124 11 61465755 11 61359254 11 61464436 11 61465806 11 61720004 11 61464674 11 61466699 11 61495943 11 61465032 11 61467009 11 61370647 11 61465154 NA NA 11 61466521 11 61467282 11 61467182 11 61467287 11 61610594 11 61467301 11 61467983 11 61723633 11 61467408 11 61468340 11 61541769 11 61467721 11 61468728 11 61686613 11 61467947 11 61469180 11 61673187 11 61469229 11 61469614 11 61612560 11 61470111 11 61470238 11 61630133 11 61470630 11 61470405 11 61620079 11 61470670 11 61470408 11 61891611 11 61471290 11 61470902 11 61524172 11 61471637 11 61471092 11 61467301 11 61472609 11 61471142 11 61492620 11 61472621 11 61471324 11 61681137 11 61472841 11 61473954 11 61717563 11 61472973 11 61474156 11 61456420 11 61473187 11 61474419 11 61463282 11 61474899 11 61474593 11 61548004 11 61475289 11 61474870 11 61483671 11 61476680 11 61474899 11 61680386 11 61478771 11 61475770 11 61465654 11 61480038 11 61476680 11 61607117 11 61480075 11 61477394 11 61654471 11 61480435 11 61478614 11 61905659 11 61482216 11 61479863 11 61449214 11 61482748 11 61480419 11 61575405 11 61483671 11 61482585 11 61679346 11 61484015 11 61482965 11 61453537 11 61484563

138

11 61484708 11 61483725 11 61605608 11 61485760 11 61484660 11 61670646 11 61487180 11 61484947 11 61459665 11 61487510 11 61485337 11 61908692 11 61488769 11 61485784 11 61611710 11 61489874 11 61486429 11 61472609 11 61490619 11 61486777 11 61574543 11 61491250 11 61487510 11 61543408 11 61491395 11 61488661 11 61358273 11 61491467 11 61489615 11 61626161 11 61492020 11 61489622 11 61719070 11 61492087 11 61490806 11 61356982 11 61492620 11 61491299 11 61575542 11 61492671 11 61491797 11 61575763 11 61493241 11 61491804 11 61582297 11 61494650 11 61492020 11 61549399 11 61495790 11 61492029 11 61687693 11 61495917 11 61492739 11 61670745 11 61495943 11 61600167 11 61496942 11 61493008 11 61499206 11 61493682 11 61659285 11 61500020 11 61493880 11 61591059 11 61500056 11 61494664 11 61902595 11 61500665 11 61495319 11 61717400 11 61500978 11 61495917 11 61613392 11 61501051 11 61496929 11 61601152 11 61501097 11 61496942 11 61579058 11 61501577 11 61496958 11 61615005 11 61502028 11 61498975 11 61593396 11 61504315 11 61499418 11 61621185 11 61507065 11 61500862 11 61730637 11 61507473 11 61502606 11 61686335 11 61508127 11 61502864 11 61587082 11 61508409 11 61502911 11 61733576 11 61509197 11 61503878 11 61494520 11 61509711 11 61504553 11 61357641 11 61510829 11 61505008 11 61368977 11 61511271 11 61505698 11 61546414 11 61511794 11 61506322 11 61666117 11 61513071 11 61506511 NA NA 11 61521502 11 61507823 11 61605005 11 61522383 11 61509146 11 61508127 11 61522959 11 61509347 11 61539878 11 61525217 11 61509378 11 61452342 11 61525281 11 61509808 11 61720750 11 61525567 11 61510813 11 61615538 11 61526688 11 61512311 11 61601260 11 61527632

139

11 61527710 11 61513359 11 61574562 11 61528479 11 61514280 11 61460673 11 61528910 11 61520446 11 61527127 11 61529656 11 61520574 11 61513070 11 61530452 11 61520682 11 61509832 11 61530500 11 61521025 11 61528951 11 61531031 11 61521882 11 61493400 11 61531297 11 61521905 11 61901272 11 61531423 11 61522728 11 61370215 11 61531810 11 61523167 11 61448205 11 61532358 11 61523808 11 61472841 11 61532453 11 61525074 11 61602053 11 61533013 11 61525138 11 61645904 11 61534291 11 61525567 11 61617588 11 61535589 11 61525677 11 61914533 11 61536181 11 61525922 11 61725074 11 61538489 11 61526422 11 61544365 11 61539170 11 61527370 11 61542485 11 61539878 11 61656578 11 61540781 11 61527527 11 61540896 11 61528879 11 61364121 11 61541769 11 61529129 11 61548193 11 61542485 11 61529254 11 61686209 11 61543311 11 61529656 11 61494640 11 61545842 11 61533013 11 61531423 11 61546023 11 61534126 11 61892297 11 61547951 11 61535589 11 61720266 11 61548004 11 61535692 11 61619513 11 61548081 11 61537014 11 61610210 11 61548796 11 61537043 11 61671923 11 61549399 11 61537117 11 61453822 11 61549792 11 61537456 11 61525281 11 61551446 11 61538525 11 61479572 11 61551601 11 61538951 11 61685755 11 61551695 11 61539475 11 61461808 11 61552108 11 61540740 11 61911579 11 61552354 11 61540781 11 61731508 11 61553356 11 61541223 11 61479920 11 61554260 11 61541383 11 61894988 11 61554588 11 61541429 11 61627355 11 61567670 11 61542063 11 61651086 11 61567721 11 61543224 11 61900103 11 61567955 11 61543358 11 61588031 11 61568827 11 61543411 11 61369585 11 61569392 11 61545058 11 61720927 11 61570073 11 61545490 11 61615878 11 61570191 11 61545814 11 61493325 11 61571866

140

11 61572273 11 61546153 11 61920440 11 61572522 11 61546991 11 61514183 11 61573773 11 61548081 11 61463652 11 61573791 11 61548216 11 61471637 11 61573986 11 61548446 11 61627286 11 61574543 11 61549658 11 61721532 11 61574892 11 61550855 11 61678427 11 61574999 11 61550882 11 61631064 11 61575210 11 61551695 11 61717952 11 61575311 11 61552206 11 61573986 11 61579058 11 61552717 11 61917991 11 61579381 11 61553198 11 61619786 11 61580504 11 61554588 11 61546023 11 61580863 11 61568475 11 61492671 11 61581412 11 61569587 11 61909147 11 61582297 11 61570093 11 61478549 11 61582527 11 61571023 11 61896825 11 61596640 11 61571069 11 61616347 11 61597461 11 61596640 11 61597814 11 61571369 11 61598391 11 61571649 11 61370958 11 61598579 11 61572423 11 61501097 11 61599329 11 61573159 11 61542513 11 61599991 11 61574596 NA NA 11 61600167 11 61575795 11 61643775 11 61600699 11 61575910 11 61498333 11 61601202 11 61577190 11 61505035 11 61601260 11 61578896 11 61450428 11 61601682 11 61579259 11 61901716 11 61602053 11 61579615 11 61904557 11 61602463 11 61580878 11 61680130 11 61602572 11 61581412 11 61548632 11 61602632 11 61582005 11 61543419 11 61602836 11 61583773 11 61674298 11 61603228 11 61583774 11 61726953 11 61603561 11 61583957 11 61606638 11 61603725 11 61584513 11 61731608 11 61603944 11 61596280 11 61611500 11 61604324 11 61596653 11 61453985 11 61604450 11 61597033 11 61651218 11 61604584 11 61597527 11 61920225 11 61605487 11 61597554 11 61500020 11 61605578 11 61598330 11 61552108 11 61605608 11 61599407 11 61666737 11 61606275 11 61599878 11 61370865 11 61606500 11 61600632 11 61494270 11 61606638 11 61601682 11 61632895 11 61606933

141

11 61607047 11 61602810 11 61465032 11 61607107 11 61602836 11 61470630 11 61607457 11 61603609 11 61538489 11 61607964 11 61604317 11 61674248 11 61609323 11 61605889 11 61726251 11 61609448 11 61606737 11 61618422 11 61609708 11 61607118 11 61894872 11 61609833 11 61607292 11 61526688 11 61611033 11 61607678 11 61589843 11 61611078 11 61608649 11 61528479 11 61611414 11 61609179 11 61487180 11 61611428 11 61611255 11 61621036 11 61611500 11 61611636 11 61554260 11 61611694 11 61612576 11 61903687 11 61611710 11 61613019 11 61460311 11 61611913 11 61613153 11 61366746 11 61611933 11 61613743 11 61567158 11 61612036 11 61615170 11 61472621 11 61612468 11 61893471 11 61612560 11 61616748 11 61613137 11 61616861 11 61589682 11 61613225 11 61617410 11 61361066 11 61613392 11 61619319 11 61366398 11 61613393 11 61619666 11 61366399 11 61613508 11 61619783 11 61370853 11 61613841 11 61620501 11 61455739 11 61613889 11 61620621 11 61626856 11 61613895 11 61620622 11 61727838 11 61613897 11 61620648 11 61731977 11 61614023 11 61621727 11 61685790 11 61614403 11 61622349 11 61574999 11 61614572 11 61622605 11 61594164 11 61615542 11 61623690 11 61371283 11 61615903 11 61624065 11 61633736 11 61616069 11 61624933 11 61731810 11 61616322 11 61625062 11 61727438 11 61617015 11 61625084 11 61722627 11 61617339 11 61626803 11 61910612 11 61617410 11 61628127 11 61613393 11 61617588 11 61629143 11 61721088 11 61617753 11 61629360 11 61671169 11 61618016 11 61629938 11 61567955 11 61618088 11 61630029 11 61528910 11 61619181 11 61630950 11 61600699 11 61619294 11 61584705 11 61534291 11 61619786 11 61584936 11 61903375 11 61620079 11 61584944 11 61460968 11 61620621

142

11 61620622 11 61586108 11 61619334 11 61621185 11 61586518 11 61733521 11 61621481 11 61587575 11 61719762 11 61621611 11 61588103 11 61612922 11 61622191 11 61588320 11 61589587 11 61622437 11 61590177 11 61477655 11 61622483 11 61590179 11 61527710 11 61622501 11 61590249 11 61551607 11 61622968 11 61591993 11 61726428 11 61623690 11 61592239 11 61500056 11 61624920 11 61592522 11 61623679 11 61625056 11 61593490 11 61610414 11 61625112 11 61595363 11 61909899 11 61625115 11 61595487 11 61607447 11 61625997 11 61632669 11 61620488 11 61626161 11 61634611 11 61672107 11 61626599 11 61634650 11 61500978 11 61627203 11 61634727 NA NA 11 61627960 11 61609434 11 61628503 11 61641185 11 61628602 11 61642287 11 61643135 11 61628615 11 61642310 11 61450396 11 61628996 11 61642680 11 61529430 11 61629143 11 61642730 11 61628503 11 61629576 11 61644773 11 61367871 11 61630133 11 61644928 11 61907265 11 61631510 11 61645751 11 61361580 11 61631521 11 61645805 11 61634429 11 61631611 11 61646559 11 61631671 11 61585140 11 61646569 11 61575210 11 61586116 11 61647678 11 61494359 11 61586518 11 61647729 11 61911760 11 61586571 11 61648472 11 61471290 11 61587082 11 61648487 11 61725010 11 61587337 11 61653642 11 61677808 11 61588103 11 61653659 11 61371781 11 61588299 11 61655823 11 61634098 11 61589587 11 61656436 NA NA 11 61589843 11 61656981 11 61588299 11 61591059 11 61657457 NA NA 11 61591993 11 61658139 11 61532845 11 61592111 11 61658157 11 61600375 11 61592200 11 61659048 NA NA 11 61593744 11 61659289 11 61357422 11 61594164 11 61664889 NA NA 11 61631803 11 61666181 11 61655020 11 61632072 11 61666192 11 61470670 11 61632310

143

11 61633736 11 61666844 11 61721090 11 61634820 11 61666929 11 61648514 11 61641655 11 61667605 11 61580504 11 61641686 11 61669162 11 61573773 11 61641717 11 61669667 11 61481597 11 61643135 11 61672193 11 61605715 11 61643775 11 61673517 11 61604450 11 61644269 11 61674280 11 61567701 11 61644773 11 61674424 11 61910664 11 61646999 11 61674762 11 61656856 11 61647410 11 61675152 11 61544879 11 61647678 11 61677790 11 61569098 11 61648113 11 61680156 11 61568518 11 61648472 11 61680850 11 61623353 11 61648514 11 61681252 11 61456573 11 61648649 11 61683690 11 61372243 11 61648832 11 61684314 11 61482170 11 61649030 11 61687195 11 61505594 11 61649457 11 61356045 11 61650102 11 61717412 11 61650520 11 61719015 11 61732941 11 61650965 11 61719723 11 61366073 11 61651086 11 61720363 11 61916577 11 61651171 11 61722045 11 61613551 11 61651676 11 61722283 11 61450000 11 61652514 11 61722393 11 61577671 11 61652714 11 61722738 11 61632072 11 61652981 11 61722765 11 61500171 11 61653164 11 61723003 11 61628601 11 61656344 11 61725170 11 61357179 11 61656981 11 61725444 11 61609721 11 61657001 11 61726034 11 61725247 11 61657195 11 61726743 11 61649601 11 61657453 11 61727844 11 61603011 11 61658099 11 61729587 11 61570702 11 61666117 11 61729931 11 61633202 11 61666192 11 61730832 11 61458147 11 61666255 11 61731803 11 61599379 11 61666306 11 61732129 11 61509711 11 61666543 11 61733539 11 61729468 11 61666737 11 61734482 11 61620335 11 61666780 11 61891818 11 61907745 11 61667879 11 61892097 11 61600492 11 61667904 11 61893191 11 61522383 11 61668117 11 61893989 11 61480607 11 61668272 11 61895084 11 61583233 11 61668383 11 61895664 11 61686144 11 61668897

144

11 61669299 11 61897803 11 61909032 11 61669411 11 61898528 11 61668564 11 61669571 11 61902122 11 61907425 11 61669669 11 61902292 11 61625997 11 61669850 11 61903288 NA NA 11 61669946 11 61903428 11 61616380 11 61670136 11 61903442 11 61898730 11 61670288 11 61904811 11 61573791 11 61670438 11 61905584 NA NA 11 61670645 11 61905622 11 61567140 11 61670646 11 61907592 11 61486891 11 61670745 11 61908983 11 61918659 11 61671074 11 61910053 11 61621950 11 61671923 11 61911543 11 61534101 11 61672147 11 61911806 11 61679110 11 61672381 11 61913368 11 61587337 11 61674082 11 61914245 11 61533214 11 61674248 11 61914293 NA NA 11 61674277 11 61357480 11 61674297 11 61914643 11 61674298 11 61918850 11 61527403 11 61675378 11 61918947 11 61531913 11 61675714 11 61356045 11 61491870 11 61675751 11 61356422 11 61569392 11 61676505 11 61356813 11 61530743 11 61676654 11 61356818 11 61720911 11 61676897 11 61356876 11 61618887 11 61677435 11 61356886 11 61670788 11 61677693 11 61356982 11 61906341 11 61678137 11 61357179 11 61473686 11 61678327 11 61357422 NA NA 11 61678427 11 61357480 NA NA 11 61678932 11 61357641 11 61463352 11 61679346 11 61358273 11 61511528 11 61679961 11 61358925 11 61631803 11 61680386 11 61359218 NA NA 11 61681347 11 61359254 11 61684417 11 61682418 11 61359511 11 61453015 11 61683324 11 61360701 11 61728105 11 61683676 11 61361066 11 61683676 11 61684314 11 61361580 11 61478200 11 61684417 11 61361699 11 61894782 11 61685096 11 61362323 11 61631610 11 61685191 11 61362439 NA NA 11 61685866 11 61362681 11 61613949 11 61686074 11 61362981 11 61359511 11 61686124 11 61363327 11 61592620 11 61686144

145

11 61686176 11 61364121 11 61911752 11 61686209 11 61364392 11 61547669 11 61687130 11 61366376 11 61572637 11 61687454 11 61366398 11 61903753 11 61687693 11 61366399 11 61472566 11 61717400 11 61366746 11 61364392 11 61717563 11 61367871 11 61672727 11 61718173 11 61368097 NA NA 11 61718228 11 61368807 11 61525217 11 61718669 11 61368977 11 61489874 11 61719026 11 61369340 11 61728481 11 61719490 11 61369529 11 61586052 11 61719902 11 61369585 11 61465175 11 61720266 11 61370149 11 61551601 11 61720537 11 61370215 11 61451430 11 61720927 11 61370545 11 61356876 11 61721718 11 61370647 11 61917406 11 61722237 11 61370853 11 61531585 11 61722627 11 61459765 11 61722738 11 61370865 11 61723633 11 61370889 11 61549792 11 61725010 11 61370934 11 61540654 11 61725074 11 61370958 11 61567670 11 61725444 11 61371854 11 61603944 11 61725456 11 61372141 11 61504315 11 61726251 11 61372480 11 61373272 11 61726717 11 61372558 11 61508409 11 61726953 11 61373272 11 61370889 11 61727438 11 61374393 11 61369711 11 61727747 11 61448996 11 61719026 11 61727838 11 61449214 11 61631510 11 61728105 11 61450428 11 61546888 11 61728147 11 61451430 11 61511271 11 61728461 11 61451553 11 61914369 11 61728481 11 61451951 11 61606500 11 61729932 11 61452342 11 61919391 11 61730637 11 61452741 11 61669946 11 61730854 11 61453015 11 61537383 11 61731508 11 61453235 11 61622905 11 61731810 11 61453421 11 61641889 11 61731977 11 61453537 11 61666214 11 61732129 11 61453822 11 61728147 11 61732658 11 61453985 11 61720508 11 61732811 11 61454216 11 61525727 11 61732941 11 61455846 11 61623983 11 61733349 11 61456045 11 61917552 11 61734028 11 61456160 11 61469229 11 61891646

146

11 61892095 11 61456420 11 61530500 11 61892203 11 61456568 11 61676505 11 61893216 11 61457889 11 61632298 11 61893663 11 61458031 11 61584631 11 61893690 11 61458147 11 61652203 11 61893694 11 61459600 11 61641717 11 61893721 11 61459665 11 61682418 11 61893938 11 61460127 11 61579427 11 61893964 11 61460311 11 61620274 11 61894180 11 61460673 11 61368097 11 61894244 11 61460968 11 61456045 11 61894482 11 61461151 11 61464674 11 61894491 11 61463282 11 61473187 11 61894801 11 61463352 11 61547951 11 61894874 11 61463652 11 61602572 11 61894891 11 61463748 11 61674279 11 61894988 11 61464436 11 61674821 11 61895084 11 61464674 11 61686176 11 61896029 11 61613575 11 61896148 11 61465032 11 61896825 11 61465175 11 61461151 11 61897073 11 61465654 11 61728461 11 61897298 11 61466521 11 61555793 11 61897404 11 61467144 11 61905913 11 61897409 11 61467408 11 61362323 11 61897753 11 61467699 11 61719902 11 61898174 11 61467721 11 61627960 11 61898325 11 61467947 11 61374393 11 61898497 11 61468191 11 61686074 11 61898768 11 61469229 11 61545842 11 61898982 11 61470630 11 61359218 11 61899041 11 61470670 11 61472464 11 61899284 11 61472609 11 61452741 11 61899378 11 61472621 11 61498526 11 61900103 11 61472841 11 61570073 11 61901484 11 61473187 11 61733349 11 61901893 11 61475195 11 61687130 11 61901950 11 61475289 11 61568827 11 61902080 11 61475529 11 61456160 11 61902122 11 61476534 11 61644707 11 61902449 11 61478771 11 61892110 11 61903100 11 61479920 11 61672629 11 61903375 11 61480038 11 61650926 11 61903559 11 61480075 11 61631690 11 61903687 11 61480607 11 61718173 11 61903753 11 61480850 11 61490619 11 61903905 11 61481597 11 61454216 11 61904551

147

11 61904557 11 61482748 11 61685866 11 61905129 11 61483671 11 61654092 11 61905659 11 61484708 11 61726717 11 61905913 11 61485156 11 61586116 11 61906654 11 61485474 11 61721718 11 61906842 11 61487180 11 61667356 11 61907186 11 61489740 11 61597461 11 61907265 11 61489874 11 61630384 11 61907425 11 61490372 11 61373851 11 61907821 11 61490619 11 61453421 11 61908022 11 61491395 11 61362439 11 61908363 11 61491467 11 61631521 11 61908524 11 61491870 11 61609751 11 61908692 11 61492620 11 61920459 11 61909147 11 61492671 11 61628996 11 61909717 11 61493241 11 61358925 11 61909869 11 61493325 11 61629508 11 61910612 11 61493400 11 61607457 11 61910664

11 61910796 11 61493619 11 61910804 11 61494031 11 61911535 11 61494520 11 61911806 11 61495790 11 61912210 11 61495930 11 61913091 11 61495943 11 61913956 11 61496490 11 61914282 11 61496629 11 61914533 11 61497952 11 61916631 11 61498941 11 61916730 11 61499206 11 61917406 11 61500020 11 61917552 11 61500056 11 61917991 11 61500171 11 61918123 11 61500519 11 61918212 11 61500665 11 61918659 11 61500978 11 61919391 11 61501097 11 61919569 11 61501577 11 61920225 11 61502028 11 61503877 11 61504315 11 61504606 11 61505035 11 61505594 11 61506867 11 61507473 11 61508055

148

11 61508127 11 61508409 11 61509020 11 61509197 11 61509711 11 61509832 11 61510829 11 61511271 11 61511421 11 61511794 11 61511799 11 61513070 11 61514183 11 61521157 11 61522383 11 61524111 11 61525281 11 61525817 11 61526370 11 61526688 11 61527710 11 61528479 11 61528894 11 61528910 11 61528921 11 61529931 11 61530500 11 61531423 11 61531539 11 61531913 11 61532422 11 61532845 11 61533366 11 61534101 11 61534291 11 61535976 11 61539170 11 61539878 11 61540476 11 61540896 11 61541166 11 61541396 11 61541545 11 61542485 11 61542513 11 61543408

149

11 61544244 11 61544365 11 61544879 11 61545821 11 61545842 11 61546688 11 61546764 11 61547951 11 61548004 11 61548193 11 61549399 11 61549792 11 61551446 11 61551589 11 61551607 11 61552108 11 61553233 11 61553356 11 61553477 11 61554260 11 61555792 11 61555793 11 61567158 11 61567670 11 61567701 11 61567721 11 61567955 11 61568827 11 61569392 11 61570073 11 61570702 11 61571106 11 61573773 11 61573791 11 61574104 11 61574543 11 61574999 11 61575144 11 61575210 11 61575311 11 61575405 11 61577671 11 61579058 11 61579427 11 61580504 11 61580863

150

11 61582297 11 61582405 11 61582949 11 61583233 11 61597461 11 61597476 11 61597913 11 61598579 11 61599991 11 61600167 11 61600375 11 61600699 11 61601260 11 61602053 11 61602080 11 61602572 11 61603944 11 61604450 11 61605005 11 61605082 11 61605608 11 61605715 11 61606500 11 61606638 11 61607117 11 61607457 11 61607482 11 61607827 11 61609448 11 61609615 11 61609708 11 61609721 11 61609833 11 61610568 11 61611035 11 61611078 11 61611140 11 61611694 11 61611710 11 61612304 11 61612560 11 613225 11 613392 11 613393 11 613551 11 613575

151

11 613949 11 615538 11 615878 11 61616347 11 61618014 11 61618659 11 61619334 11 61619513 11 61619786 11 61620079 11 61620274 11 61620488 11 61621829 11 61622670 11 61624327 11 61624599 11 61625056 11 61625307 11 61625971 11 61625997 11 61627321 11 61627611 11 61628503 11 61628996 11 61629518 11 61630823 11 61631521 11 61631690 11 61586116 11 61586571 11 61587337 11 61589587 11 61589682 11 61589843 11 61591059 11 61591973 11 61592620 11 61593744 11 61594164 11 61631803 11 61631961 11 61632072 11 61632267 11 61632895 11 61633251 11 61633736

152

11 61633867 11 61634042 11 61641407 11 61641717 11 61641889 11 61642369 11 61643135 11 61643222 11 61643775 11 61644760 11 61644763 11 61644924 11 61645849 11 61646004 11 61646147 11 61647730 11 61648514 11 61649812 11 61650926 11 61651086 11 61651218 11 61651868 11 61654092 11 61654471 11 61655020 11 61658058 11 61658343 11 61666737 11 61667356 11 61667614 11 61668641 11 61670646 11 61670718 11 61670745 11 61671074 11 61671169 11 61671349 11 61671856 11 61671923 11 61672107 11 61672381 11 61672629 11 61674172 11 61674248 11 61674297 11 61674298

153

11 61674983 11 61676182 11 61676505 11 61678137 11 61678427 11 61679110 11 61679346 11 61680130 11 61680386 11 61681137 11 61682168 11 61682418 11 61683676 11 61684417 11 61685096 11 61685578 11 61685755 11 61685790 11 61685866 11 61686074 11 61686144 11 61687130 11 61687454 11 61687693 11 61717563 11 61718173 11 61718669 11 61719026 11 61719902 11 61720266 11 61720537 11 61720911 11 61720927 11 61721090 11 61721718 11 61722627 11 61723364 11 61723633 11 61724193 11 61724458 11 61724505 11 61724853 11 61725010 11 61725074 11 61726251 11 61726717

154

11 61726953 11 61727747 11 61728105 11 61728147 11 61728461 11 61729801 11 61729819 11 61730325 11 61730637 11 61731418 11 61731508 11 61731549 11 61731810 11 61732658 11 61732941 11 61733302 11 61733349 11 61733733 11 61734028 11 61892095 11 61892110 11 61892297 11 61893471 11 61894180 11 61894491 11 61894872 11 61894988 11 61896029 11 61896148 11 61896297 11 61896825 11 61897404 11 61897409 11 61897580 11 61897744 11 61897806 11 61898743 11 61899817 11 61900103 11 61901716 11 61901893 11 61902884 11 61903687 11 61903753 11 61904557 11 61905129

155

11 61905659 11 61905913 11 61906341 11 61907265 11 61907361 11 61907425 11 61907669 11 61907746 11 61907965 11 61908363 11 61908692 11 61909032 11 61909147 11 61909899 11 61910612 11 61910664 11 61912753 11 61914247 11 61914249 11 61914369 11 61914533 11 61915921 11 61916577 11 61917406 11 61917552 11 61917731 11 61917991 11 61918659 11 61919391 11 61919842 11 61920225

156

Appendix Table 5: 202 drug-target genes

Gene ABCB1 CHRNA4 ADAM10 CHRNA5 ADIPOQ CHRNA6 ADORA1 CHRNA7 ADORA2A CHRNB2 ADRB3 CLEC16A ALOX5AP CNR2 APCS CNTN5 APH1A CTSK APH1B CXCL1 APP CXCL2 BDKRB2 CXCL3 BICD1 CXCL5 BRD2 CYSLTR1 BRD3 CYSLTR2 BRD4 DPP3 C5AR1 DPP4 CACNA1B DRD2 CAMKK2 DRD3 CASR DYRK3 CCKAR EDNRA CCKBR EDNRB CCL11 EGR1 CCL7 ELA2 CCL8 EVI5 CCR1 FAAH CCR3 FGF10 CCR5 FH CCR9 GABRA2 CD28 GABRA3 CD3D GHSR CD3E GJD2 CD3G GLP1R CD4 GPBAR1 CDH2 GPR119 CHRM3 GRIN1 CHRM4 GRIN2B SIRT1 GRM5 CHRNA3 GSK3B HCRTR1 NCSTN HCRTR2 NFKBIL1 HHIP GPR109A HRH1 NLRP1

157

HRH3 NLRP3 HTR1A NMNAT2 HTR1B NOS2A HTR2C NR1D1 HTR4 NRXN1 HTR6 NTRK2 IKBKB OPRK1 IL13 OPRM1 IL18 OSM IL1R1 OXTR IL23A P2RX7 IL28B P4HA1 IL4 P4HA2 IL5 P4HB IL6 PDE4A IL7R PDE5A IL8 PGK1 IL8RB PIK3CA ITGA4 PLA2G7 ITGAV PPARD ITGB1 PRKAG1 JAK3 PSEN1 KCNC2 PSEN2 KCNMA1 PSENEN KCNN4 PTGDR KIAA1967 PTGER1 L1CAM PTGES LDHA PTGIR LEP PTGS1 LRRK2 PTGS2 MAG PTHR1 MAPK11 PYGB MAPK14 RIPK2 MCHR1 RORA MCHR2 RORC METAP2 RTN4 MIF EDG1 MLNR SCD MME SCN9A MMP12 SDHB MMP9 SDHD MS4A1 SIRT2 SIRT3 SIRT4 SIRT5

158

SIRT6 SIRT7 SLC10A1 SLC10A2 SLC5A1 SLC6A4 SLC6A9 SP110 STIM1 STK39 SYK TACR1 TACR2 TACR3 TBXA2R TGFB1 TGFBR1 TLR4 TLR7 TLR9 TNFRSF1A TNFSF11 TNNI3K TRPC3 TRPC6 TRPM8 TRPV1 UTS2R ZAP70

159

Appendix Figure 1: Single Nucleotide Polymorphism (SNP) illustration

Appendix Figure 2: cnvHap operation – cluster positions for each probe (column). Each cross indicates trained cluster means. Red represents deletions, blue represents duplications and grey represents the copy neutral state. The first and last probe show Illumina SNP probes whereas the second probe shows an Agilent aCGH probe for which only the LRR was defined.

For the first and last probes, both LRR and BAF were defined [51].

160

Appendix Figure 3: qq-plots for the maximum of the LD for each of the 26 populations of the 1000G separately against the AF of the CNVs present in each specific population. (i) African populations (ii) American populations (iii) East Asian populations (iv) South Asian populations (v) European populations

(i)

161

162

(ii)

163

(iii)

164

(iv)

165

(v)

166

Appendix Figure 4: Maximum LD Vs AF of CNVs in a complete set of drug-target genes

Appendix Figure 5: Homozygous and heterozygous deletions in human populations

167