Detecção de Copy Number Variation (CNV) e sua caracterização na população brasileira

Ana Cláudia Martins Ciconelle

Dissertação apresentada ao Instituto de Matemática e Estatística da Universidade de São Paulo para Obtenção do Título de Mestre em Bioinformática

Programa: Mestrado em Bionformática Orientadora: Prof. Dra. Júlia Maria Pavan Soler

Durante o desenvolvimento deste trabalho a autora recebeu auxílio financeiro do CNPq e CAPES

São Paulo, Janeiro de 2018 Detection of Copy Number Variation (CNV) and its characterization in Brazilian population

Esta versão da dissertação contém as correções e alterações sugeridas pela Comissão Julgadora durante a defesa da versão original do trabalho, realizada em 06/02/2018. Uma cópia da versão original está disponível no Instituto de Matemática e Estatística da Universidade de São Paulo.

Comissão Julgadora:

• Prof. Dra. Júlia Maria Pavan Soler - IME-USP • Prof. Dr. Alexandre da Costa Pereira - HCFMUSP • Prof. Dr. Benilton de Sá Carvalho - UNICAMP Agradecimentos

Agradeço aos meus pais, Claudio e Marcia, meu irmão Lucas, minhas avós, Dorga e Isabel, e todos os outros familiares que sempre me apoiaram e me fazem acreditar no significado de família. Agradeço especialmente á minha professora, orientadora e amiga, Júlia M. P. Soler, que desde da minha iniciação científica sempre esteve disponível para me ensinar, orientar e aconselhar com muita paciência, carinho e apoio. Agradeço também meus amigos de graduação em Ciências Moleculares, em especial ao Chico, Otto e o Leo, e aos amigos do IME que sempre me ajudaram em todos os sentidos possíveis. Agradeço aos professores do Ciências Moleculares e do IME por me mostrarem o caminho da ciência. Este trabalho não seria possível sem o apoio do INCOR/FMUSP por conceder os dados do Projeto Corações de Baependi e das agências CNPq e CAPES pelo apoio financeiro.

i ii Resumo

CICONELLE, A. C. M. Detecção de Copy Number Variation (CNV) e sua caracteri- zação na população brasileira. Programa de Bioinformática - Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, 2018.

Estudos de associação genética (do inglês, Genome-wide association studies - GWAS) são uma ferramenta fundamental para associar marcadores genéticos, e regiões genômicas com doenças e fenótipos complexos, permitindo compreender em mais detalhes essa rede de regulação bem como mapear genes e, com isso, desenvolver técnicas de diagnóstico e tratamento. Atualmente, a principal variante genética utilizada nos estudos de associação é o SNP (do inglês, Single Nucleotide Polymorphism), uma variação que afeta apenas uma base do DNA, sendo o tipo de variação mais comum tanto entre os indivíduos como dentro do genoma.. Apesar das diferentes técnicas disponíveis para os estudos de associação, muitas doenças e traços complexos ainda possuem parte de sua herdabilidade inexplicada. Para contribuir com estes estudos, foram criados banco de dados genéticos de referência, como o HapMap e o 1000 Genomes, que possuem representantes das variantes genéticas comuns das populações mundiais (européias, asiáticas e africanas). Nos últimos anos, duas das solucões adotadas para tentar explicar a herdabilidade de doenças e fenótipos complexos correspondem a utilizar diferentes tipos de variantes genéticas e incluir variantes raras e específicas para uma determinada população. O CNV (do inglês, Copy Number Variation) é uma variante estrutural que está ganhando espaço nos estudos de associação nos últimos anos. Essa variante é caracterizada pela deleção ou duplicação de uma região do DNA que pode ser de apenas alguns pares de bases até cromossomos inteiros, como no caso da síndrome de Down. Em parceria com o Instituto do Coração (InCor-FMUSP), este trabalho utiliza os dados do projeto Corações de Baependi para estabelecer uma metodologia para caracterizar os CNVs na população brasileira a partir de dados de SNPs e associá-los com a altura. O projeto inclui dados genéticos e fenótipos de 1,120 indivíduos relacionados (estruturados em famílias). Para a detecção dos CNVs, os recursos do software PennCNV são utilizados e metodolo- gias de processamento, normalização, identificação e análises envolvidas são revisadas. A caracterização dos CNVs obtidos inclui informações de localização, tamanho e frequência na

iii iv população e padrões de herança genética em trios. A associação dos CNVs com a altura é realizada a partir de modelos lineares mistos e utilizando informações sobre a estrutura de família. Os resultados obtidos indicaram que a população brasileira contém regiões (únicas) com variação no número de cópias que não estão identificadas na literatura. Características gerais dos CNVs, como tamanho e frequência no indivíduo, foram semelhantes ao que é apontado na literatura. Também foi observado que a transmissão de CNV pode não seguir as leis mendelianas, uma vez que a frequência de trios com um dos pais com deleção/duplicação e filho normal era superior à frequência dos trios com filho portador da mesma variação. Este trabalho também identificou uma região no cromossomo 9 que pode estar associ- ada com a altura, sendo que portadores de uma duplicação nesta região podem ter uma diminuição esperada de aproximadamente 3cm na altura.

Palavras-chave: CNV, herdabilidade, fenótipos complexos, SNPs, dados de família. Abstract

CICONELLE, A. C. M. Detection of Copy Number Variation (CNV) and its char- acterization in Brazilian population. 2016. Master in - Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, 2018.

Genome-wide association studies (GWAS) are a tool of high importance to associate genetic markers, genes and genomic regions with complex phenotypes and diseases, allowing to understand in details this regulation of expression as well as the genes, and then develop new techniques of diagnoses and treatment of diseases. Nowadays, the main genetic marker used in GWAS is the SNP (single nucleotide polymorphism), a variation that affects only one base of the DNA, being the most common type of variation between individuals and inside the genome. Even though there are multiple techniques available for GWAS, several complex traits still have unexplained heritability. To contribute to these studies, reference genetic maps are being created, such as the HapMap and 1000 Genomes, which have common genetic variants from world wide population (including European, Asian and African populations). In the last years, two solutions adopted to solve the missing heritability are to use different types of genetic variants and include the rare and population specific markers. Copy number variation (CNV) is a structural variant which use is increasing in GWAS in the last years. This variant is characterized for the or duplication of a region a DNA and its length can be from few bases pair to the whole chromosome, as in Down syndrome. In collaboration of the Heart Institute (InCor-FMUSP), this work uses the dataset from Baependi Heart Study to establish a methodology to characterized the CNVs in the Brazilian population using SNP array data and associate them with height. This project uses the genetic and phenotype data of 1,120 related samples (family structure). For CNV calling, resources from the software PennCNV are used and methodologies of preprocessing, normalization, identification and other analysis are reviewed. The character- ization of CNVs include information about location, size, frequency in our population and the patterns of inheritance in trios. The association of CNVs and height is made using linear mixed models and with information of family structure. The obtained results indicate that the Brazilian population has regions with variation in the number of copies that are not in the literature. General characteristics, such as length

v vi and frequency in samples, are similar to the information found in the literature. In addi- tion, it was observed that the transmission of CNVs could not follow the Mendelian laws, since the frequency of trios which one parent has a deletion/duplication and the offspring is normal is higher than the frequency of trios with one parent and the offspring has a deletion/duplication. This work also identified a region on chromosome 9 that could be associated to height, being that carries of a duplication in this region can have the expected height dropped by approximately 3cm.

Keywords: CNV, heritability, complex phenotypes, SNPs, family data. Contents

List of Abbreviations ix

List of Figures xi

List of Tables xiii

1 Introduction1

2 Copy Number Variation (CNV)7 2.1 Biological background...... 7 2.2 Mechanisms for CNVs Generation and CNV Transmission...... 11 2.3 CNV Calling...... 14 2.4 Association Studies and CNVs...... 16 2.4.1 CNVs and Height...... 20

3 Materials and Methods 23 3.1 Dataset...... 23 3.1.1 SNP array platform...... 24 3.2 Methodology Overview...... 26 3.3 Preprocessing of SNP data...... 30 3.3.1 Quantile normalization...... 31 3.3.2 Median polish...... 32 3.3.3 SNP Genotype Calling...... 33 3.4 Log R Ratio (LRR) and B Allele Frequency (BAF)...... 35 3.4.1 Log R Ratio (LRR)...... 36 3.4.2 B Allele Frequency (BAF)...... 36 3.5 Hidden Markov Models (HMM)...... 38 3.6 Selection of CNV Regions...... 41 3.6.1 Quality Control...... 41 3.6.2 Minimal Regions...... 42 3.6.3 Filtering CNV Regions...... 44 3.7 Association Study and Polygenic Mixed Model...... 44

vii viii CONTENTS

4 Application in Baependi Heart Study 49 4.1 Log R Ratio (LRR) and B Allele Frequency (BAF)...... 50 4.2 CNV calling...... 52 4.3 Quality Control...... 52 4.4 Minimal Regions...... 56 4.5 CNV Filter...... 56 4.6 Baependi Samples...... 58 4.7 CNVs in Brazilian Population...... 59 4.7.1 How many CNVs does an individual have?...... 60 4.7.2 How long are the CNVs?...... 64 4.7.3 Where are the CNVs?...... 65 4.8 CNV Inheritance...... 68 4.8.1 CNV occurrences in trios...... 68 4.8.2 CNV Trait Heritability...... 70 4.9 CNVs and Height...... 72

5 Final Considerations 81

A Quantile Normalization 83

B Median Polish 87

C Minimal Regions 91

D CNV Filter 95

E Pedigree and Kinship Matrix 99

F Frequency of CNVs by Chromosome 101

G Proportion of CNV occurrences in trios 105

H IDs and CEL Files Correspondence 113

Bibliography 117 List of Abbreviations

BAF B Allele Frequency CGH Comparative genomic hybridization CN Copy Number CNP Copy Number Polymorphism CNV Copy Number Variation CNVRs CNV-containing regions FoSTeS Fork Stalling and Template Switching GWAS Genome-Wide Association Study HMM Hidden Markov Model indel Small insertions/deletions LCR Region-specific Low-Copy-Repeat LD LRR Log R Ratio LRT Likelihood Ratio Test MAF Minor Allele Frequency MCF Minor Copy Frequency NAHR Nonallelic Homologous Recombination NGS Next Generation Sequencing NHEF Nonhomologous End Joining SNP Single Nucleotide Polymorphism SD Segmental Duplications SV Structural Variants VNTR Variable-number

ix x LIST OF ABBREVIATIONS List of Figures

1.1 Illustration of a single nucleotide polymorphisms (SNP)...... 2

2.1 Types of structural variants...... 8 2.2 Illustration of a CNV...... 9 2.3 Example of aneuploidy...... 10 2.4 Proportion of CNVs in each chromosome...... 11 2.5 Illustration of the four major mechanisms underlying human genomic rear- rangements and CNV formation...... 13 2.6 Publications of SNPs and CNVs...... 17 2.7 CNV-containing region (CNVRs)...... 19

3.1 Illustration of DNA Microarray...... 25 3.2 Illustration of signal extraction for a given molecular marker...... 26 3.3 Flowchart of the pipeline...... 27 3.4 Illustration of SNP clustering...... 33 3.5 Values of LRR and BAF for each case of CNV...... 37 3.6 Representation of the procedure to find minimal regions across samples.... 43 3.7 Example of the kinship matrix (φ) given the family represented by the pedigree. 46

4.1 Intensity of probes A and B of one SNP from 1,120 samples...... 50 4.2 LRR and BAF of one SNP from 1120 samples...... 51 4.3 X and Y probe intensities for all 1120 samples...... 53 4.4 Histogram of the standard deviation of Log R Ratio...... 54 4.5 Histogram of the BAF mean...... 55 4.6 Histogram of the BAF drifiting...... 55 4.7 Histogram of waviness factor...... 56 4.8 CNV region with four categories...... 57 4.9 CNV region with three categories...... 57 4.10 Distribution of the age and height of all the 910 samples and for males and females...... 58 4.11 Distribution of individual ancestry...... 59 4.12 Total of CNVs in each procedure...... 60

xi xii LIST OF FIGURES

4.13 Absolute frequency of samples based on the individual number of detected CNVs...... 61 4.14 Distribution of individual number of detected CNVs for all samples with less than 100 CNVs...... 61 4.15 Distribution of CNVs regarding the number of copies...... 63 4.16 Absolute frequency of samples based on the 8,794 CNVs...... 63 4.17 Number of CNVs according to the age...... 64 4.18 Histograms of CNV length...... 64 4.19 Histogram of filtered CNVs lenght...... 65 4.20 Proportion of CNVs in each chromosome based in total of base pairs..... 66 4.21 Frequency of CNVs per region after finding the minimal regions...... 67 4.22 Cases of CNV transmission...... 69 4.23 Manhattan plot of the intraclass correlation coefficient for each CNV..... 71 4.24 Distribution of the intraclass correlation coefficient...... 71 4.25 Manhattan plot of the p-values from the first model...... 73 4.26 Manhattan plot of the heritability from the first model...... 73 4.27 Manhattan plot of the p-values from the second model...... 74 4.28 Manhattan plot of the heritability from the second model...... 74 4.29 Manhattan plot of the p-values from the third model...... 75 4.30 Manhattan plot of the heritability from the third model...... 76 4.31 Distribution of the height based on the number of copies...... 78

A.1 Raw data in a regular plot (a) and in a quantile-quantile plot (b)...... 84 A.2 New quantile-quantile plot...... 84

E.1 Genogram corresponding to the family data...... 100

F.1 Frequency of CNVs per region after finding the minimal regions...... 101 F.2 Frequency of CNVs per region after finding the minimal regions...... 102 F.3 Frequency of CNVs per region after finding the minimal regions...... 103 List of Tables

2.1 Number of associations between variants and phenotypes... 16 2.2 Examples of GWAS...... 18

3.1 States defined for the HMM...... 38 3.2 Hypothetical example of the merged and cleaned outputs from PennCNV.. 43 3.3 Copy number of each sample for all minimal regions...... 44

4.1 Example of the file containing the CNVs from sample 1...... 52 4.2 Quality control measurements from PennCNV...... 54 4.3 Cumulative frequency of samples based on the number of CNVs...... 62 4.4 Absolute frequency of CNV based on relative frequency of samples...... 66 4.5 Distribution of CNV for 910 samples...... 67 4.6 Mean relative frequency (%) for CNV occurrences in trios with one normal parent and another with single deletion...... 69 4.7 Mean of the relative frequencies per chromosome (%) for CNV occurrences in trios with one normal parent and another with single duplication...... 69 4.8 Mean of the relative frequency (%) for CNV occurrences in trios with two normal parents...... 70 4.9 Models used for heritability estimation...... 72 4.10 The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a di- chotomous covariate...... 77 4.11 The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a con- tinuous variable...... 78 4.12 The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a cate- gorical variable...... 79 4.13 Number of individuals with normal genotype, duplication and double dupli- cation for the two regions of chromosome 9...... 80

G.1 Relative frequency of occurrences of CNVs in trios...... 108 G.2 Relative frequency of occurrences of CNVs in trios...... 109 G.3 Relative frequency of occurrences of CNVs in trios...... 110 G.4 Relative frequency of occurrences of CNVs in trios...... 111

xiii xiv LIST OF TABLES

G.5 Relative frequency of occurrences of CNVs in trios...... 112

H.1 Correspondence between IDs of samples and CEL Files...... 114 H.2 Correspondence between IDs of samples and CEL Files...... 115 Chapter 1

Introduction

The research in genetics is growing every day and new technologies to understand the genes, genetic variations, and traits heritability in living organisms are being developed and generating a massive quantity of genetic information. Therefore, Genome Wide Association Studies (GWAS) aims to associate genetic markers, candidate genes or genome regions with complex traits and diseases, which are likely derived from multiple genes and the environ- ment, such as height and diabetes (Lewis and Knight, 2012; Nature Education, 2014). In addition, discovering the associations between diseases and genetic factors is an important step to understand the pathogenesis of the diseases and to facilitate the process of diagnosis and treatment (Lewis and Knight, 2012; The International HapMap Consortium, 2003). There are several methods for performing genome wide association studies described in the literature. They vary according to the type of genetic marker used. Genetic mark- ers are DNA sequences with known physical and molecular locations on chromosomes, such as single nucleotide polymorphisms (SNPs), microssatellites and copy number probes (National Cancer Institute; Ziegler and König, 2006). They can also be called as genetic variant when they also indicate a common variation in the genome, usually present at least 1-5% of the population (National Cancer Institute). The most used genetic variant for GWAS is the SNP, but other variants, as , small insertions/deletions (indels), variable-number tandem repeats (VNTRs), and copy- number variations (CNVs), are also available (Lewis and Knight, 2012). Single nucleotide polymorphisms (SNPs) are variations at a single position in DNA that are present in at

1 2 INTRODUCTION 1.0 least 1% of the population (Figure 1.1). On average, they can occur once for every 300 nucleotides, meaning that a human genome can contain roughly 10 million SNPs. Some of them are documented to have a direct influence in some phenotypes, while others still do not have a known effect (Genetics Home Reference, 2017). In Chapter2, we detail the role of SNPs and CNVs in GWAS.

Figure 1.1: Illustration of a single nucleotide polymorphisms (SNP). A SNP is a change in the genetic code of an individual where a single nucleotide is replaced by another nucleotide in the DNA sequence. In this hypothetical example, at position 5, one individual have a C nucleotide, while the other two have a T nucleotide.

Several studies are being performed to catalogue the human genetic variants to facili- tate GWAS. A pioneer project is the HapMap Project which aims to find the patterns of DNA sequence variation in the human genome based on SNPs, make this information freely available to the public domain and help investigators to discover the genetic factors that contribute to susceptibility to disease, to protection against illness and to drug response (The International HapMap Consortium, 2003). This project was developed using 270 DNA samples from African, Asian and European populations, including unrelated individuals and trios. Those populations from different an- cestral geographic locations were chosen to ensure that the database would contain the most of the common variation and rare variants from each population. In addition, it allows to obtain information about patterns of linkage disequilibrium (LD) since common SNPs tend to be older than rare SNPs and it is a reflection of historical recombination and demographic events (The International HapMap Consortium, 2003). As the name suggests, the HapMap uses the information of 6.8 millions SNPs to create a 1.0 3 map of , which are DNA sequences within an organism that are inherited together from a single parent since they are in linkage disequilibrium (The International HapMap Consortium, 2003; Ziegler and König, 2006). Therefore, HapMap maps groups of SNPs that are usually in block, allowing researchers to sequence only some SNPs and then impute the SNPs which are from the same block. In 2016, the HapMap Project was merged to the . With a similar goal to HapMap, 1000 Genomes identifies genetic variants with frequencies of at least 1% in the studied populations (1000 Genomes Project Consortium et al., 2016), including not only SNPs, but also structural variants and small insertions/deletions. This project accounts with a total of 2,504 samples from 26 populations, which is a modest number per population and, consequently, detects only the most common variants in the worldwide population. Even though there is a major success in gene discovery, the percentage of variance ex- plained by GWAS loci for many traits is relatively low. Thus, a substantial part of the traits variation is still unexplained. This phenomenon is called missing heritability. One ex- ample of trait with a high missing heritability is the height, as described in Chapter2. In Manolio et al. (2010), two of the solutions cited to revealing the missing heritability is to use different types of genetic variants, and to include common and rare variants. Based on these scenarios, in this work, our focus is on CNV detection since this kind of variant is not as well characterized as SNPs, but it is expected to be associated with several traits and diseases. Copy number variation (CNV) occurs when the number of copies of a particular region (one or more loci) of the DNA differs from two in autosomes or one/two in allosomes and has an important role in the genetic variability in humans. The effects of CNVs to human diseases are not yet well known (National Institutes of Health, 2017), although several diseases have been associated to this kind of polimorphism, such as uric acid (Scharpf et al., 2014), pancreatitis (Maréchal et al., 2006) and nervous system disorders (Lee and Lupski, 2006). Also, the procedure to obtain the CNV information is not simple and involves different statistical and computational procedures. In Chapter2, we give a detailed description of CNVs and describe the biologic mechanisms of formation of this variant, as well as the possible technologies of detection of them. In addition, we present some association studies that include this kind of structural variant of the human genome. 4 INTRODUCTION 1.0

Genome-wide association studies are usually based on reference maps such as 1000 Genomes, which do not take into account the population-specific and rare variants since do not have a representative number of samples per worldwide population. In addition, Sanna et al. (2011) shows that adding rare variants in association studies doubled the ex- plained heritability, and Tennessen et al. (2012) describes that around 82% of rare SNPs (less than 1% of the population) are population specific. Therefore, identifying different types of variants and including data from specfic populations can explain the missing her- itability of traits and disease. This motivates the creation of genomic reference maps for specific populations, for example, the Genome of Netherlands (Boomsma et al., 2014), a project similar to the 1000 Genomes Project, which aims to characterize genetic variants from dutch population, including rare variants. Motivated by the unknown influence of CNVs on anthropometric measurements and the lack of studies based on Brazilian population, this work was developed in collaboration with the Laboratório de Genética e Cardiologia Molecular/InCor-FMUSP. Using the database from the Bapendi Heart Study (de Oliveira et al., 2008; Egan et al., 2016), described in Chapter3, we analyzed the genotype (SNP data) and phenotype data from 80 families to characterize the CNVs in Brazilian population and to understand their association with phenotypes, such as height. In summary, the main purpose of this project is to present methodologies to quantify and call CNVs from SNP platforms and to analyze such data considering family based designs. For this, the project will focus on the following specific aims:

• Set a methodology that includes bioinformatics and statistical tools, allowing to process and quantify CNVs from SNPs expression data;

• Apply such methodologies by using data from Baependi Heart Study and characterize the patterns of the CNVs detected in this population;

• Study the CNVs inheritance patterns;

• Associate the CNVs to height.

The methodology involved work is composed by two procedures: the CNV calling and 1.0 5 the CNV analysis, as described in Chapter3. The CNV calling consists in quantifying and genotyping CNVs from the SNP array data obtained by blood samples of individuals (human population). In other words, from each sample, we identify the CNV regions and classify them based on the number of copies. The CNV analysis focus on characterizing the CNVs from Brazilian samples and identifying CNVs that might be associated with height. In addition, given that the Baependi Heart Study data contains the family structure between samples, we explore the inheritance of CNVs, which is scarcely described in the CNV studies. In Chapter4, we illustrate some steps of the the CNV calling, including some outputs obtained during the procedure. Also, we describe the main characteristics of the detected CNVs, such as length, location in the genome and number of CNVs per sample, besides some inheritance patterns occurring in trios data. The results of the association between CNVs and height are presented with the annotation of the 20 most significant CNVs. Chapter5 presents the final considerations and conclusions, including suggestions for further analysis. 6 INTRODUCTION 1.0 Chapter 2

Copy Number Variation (CNV)

This chapter explains the biological background of Copy Number Variations (CNVs), including its definition, characteristics and mechanisms of formation and detection by new technologies. The importance and role of CNVs in genome-wide association studies (GWAS) is also described to introduce the methods explained in Chapter3.

2.1 Biological background

Segments of DNA can show different kinds of structural variants (SVs). They can be mod- ifications in orientation (inversions), chromosomal location (translocations) or copy number (deletions, insertions and duplications). SVs can be either balanced, with no loss or gain of genetic material, or unbalanced, where a part of the genome is lost or duplicated. The formers comprise inversions or translocations of a stretch of DNA within or between chro- mosomes while the later is termed copy variation number (CNV). A representation of these variations is in Figure 2.1(Escaramís et al., 2015). This work is focused on the study of copy number variation (CNV), a subtype of SVs. CNV is an alteration in the number of copies of a segment of the DNA (Figure 2.2), unsettling the normal biological balance of the diploid state in humans at any given . The segment can include from a single nucleotide polymorphism (SNP) to several genes. A CNV can be classified in three groups: Duplication (when there is three or more copies of the segment), deletion (when the number of copies is below 2) or complex (when there

7 8 COPY NUMBER VARIATION (CNV) 2.1

Figure 2.1: Types of structural variants. Unbalanced SVs are represented in the top two rows including deletion, insertion of novel sequence and duplication (interspersed duplication and tandem duplication). Balanced SVs are represented in the third row and include inversions and transloca- tions. Examples of complex SVs are presented in the bottom row. Source: Escaramís et al.(2015). is a combination of deletions and duplications). Insertion is a mutation that increases the number of DNA bases. It can be considered as a CNV when the added sequence is the same as the neighbor sequences, being equivalent to a duplication. In the literature, it is possible to find CNVs named as copy number polymorphisms (CNPs), in which the only difference lies on the frequency of the variation in the population. CNV is a rare mutation present in less than 1% of the population (Campbell et al., 2011), while CNPs are more common mutations, present in more than 1% of the population. In this project, we don’t distinguish between CNVs and CNPs. Studies usually define the size of CNV as 1kb or larger (Feuk et al., 2006). However, some works using Watson and Venter genomes describe CNVs ranging from 300 to 350bp (Levy et al., 2007; Wheeler et al., 2008). Therefore, there is no consensus about the size of CNVs. As long as analyses of higher resolution are performed, many more CNVs of smaller size ranges are likely to be discovered (Zhang et al., 2009). The CNVs with less than 1kb can be described as small insertions/deletions (indels), 2.1 BIOLOGICAL BACKGROUND 9 and when the CNV takes over a whole chromosome (Figure 2.3), driving the human so- matic cell to contain more or less than the normal 46 chromosomes, it is called aneuploidy and it is considered extreme case of unbalanced SV (Escaramís et al., 2015), as in the tri- somy 21 in patients with Down syndrome and the monosomy X with Turner syndrome (Stankiewicz and Lupski, 2010). Due to the difficulty of identifying CNVs and its real length and location, some works deal with CNV-containing regions (CNVRs), instead of CNVs, considering that the region contains at least one CNV, but its location is not precise (McCarroll and Altshuler, 2007).

Figure 2.2: Illustration of a CNV. A human contains two copies of the same chromosome, one from the mother and another from the father, thus the total copy of each segment is two (middle). When a deletion of segment II occurs (left), instead of two copies of the segment II, the individual has only one copy; When a duplication of II occurs, the total copies of the individual is three. The segment can be from one single SNP to several genes.

Association studies involving CNVs are an important step for the comprehension of complex diseases, traits and evolution. As described in Zhang et al. (2009), the Database of Genomic Variants included 38,406 SVs ranging from 100bp to 3Mb, which covers 29.74% of the reference genome. In addition, the SNP database comprises 14,708,752 SNPs, but it covers less than 1% of the reference genome. Therefore, SVs can account for a big part of the genetic diversity in humans. For a better understanding of the variability of the human genome in healthy individuals 10 COPY NUMBER VARIATION (CNV) 2.2

Figure 2.3: Example of aneuploidy. In this type of variant, the chromosome has three copies instead of two. For chromosome 21, this case is Down Syndrome.

and the role of CNVs, Zarrei et al. (2015) developed a CNV map covering the data from various populations, in which less than 10% were from South American population. They created two maps, one including all CNVs and CNV-containing regions (CNVRs) from sev- eral published studies (defined as inclusive map) and another containing only CNVs and CNVRs with at least two subjects in two independent studies (defined as stringent map). From the inclusive map, they estimated that 9.5% of the human genome contains gains or losses, while in the stringent map, this value dropped to 4.8%. The maps created by Zarrei et al. (2015) also shows that CNVs are not evenly distributed in the chromosomes (Figure 2.4), in which, in general, chromosomes 19, 22 and Y have the biggest proportions of CNVs. This proportion is the total of base pairs of CNVs of a chromosome divided by the total of base pairs of the chromosome. When compared the proportions of losses and gains, they identified that the chromosomes are more susceptible to losses than gains with proportion intervals of 4.3% to 19.2% and 1.1% to 16.4%, respectively. 2.2 MECHANISMS FOR CNVS GENERATION AND CNV TRANSMISSION 11

Figure 2.4: Proportion of CNVs in each chromosome. The horizontal dashed lines indicate the genome average for the inclusive map (upper line) and the stringent map (lower line). Source: Zarrei et al.(2015).

2.2 Mechanisms for CNVs Generation and CNV Trans-

mission

CNVs can be de novo or transmitted from parents, but the de novo CNV rate per trans- mission (µ) is around 2 × 10−2 (Itsara et al., 2010). Therefore, the most part of CNVs of an individual is due to the presence of the CNV in the parents haplotypes. As described in Itsara et al. (2010), this rate can change when two groups are being compared. Regard- ing family data, for example, cases of multiplex autism obtained a µ = 2.2 × 10−2, while unaffected siblings showed a µ = 5.4 × 10−3, implying that the presence of de novo CNVs increases the risk of autism. The generation of de novo SV can occur both meiotically and mitotically. Thus, monozy- gotic twins can carry differences in SV, and individuals can be carriers of structural variants, between tissues and even within tissues (Escaramís et al., 2015). For this reason, a CNV can be found either in one type of cell or all the somatic cells of the individual. Stankiewicz and Lupski(2002, 2010) describe four possible sporadic mechanisms for forma- tion of CNVs: NAHR, NHEJ, FoSTeS, and L1-mediated retrotransposition.

• Nonallelic Homologous Recombination (NAHR):

Region-specific Low-Copy-Repeats (LCRs), also called segmental duplications (SDs), 12 COPY NUMBER VARIATION (CNV) 2.2

are DNA blocks of ∼10–400 kb with ≥97% identity and exist in multiple locations as a result of duplication events. Due to their size and similarity, SDs often result in forms of chromosomal rearrangement and can cause genome instability. Although they are rare in most , LCRs comprise a large portion of the human genome owing to a significant expansion during evolution (Stankiewicz and Lupski, 2002).

When LCRs are located at a distance less than ∼10Mb from each other, they can lead to misalignment of chromosomes or chromatids and mediate nonallelic homologous recombination (NAHR) that can result in unequal crossing-over, with recombination hotspots, gene conversion, and apparent minimal efficient processing segments. NAHR between directly oriented LCRs results in deletions or reciprocal duplications of the genomic segment between them. This molecular mechanism has been shown to be responsible for the vast majority of the common sized recurrent rearrangements — reciprocal deletions and duplications, or inversions (Stankiewicz and Lupski, 2010), an example is illustrated in Figure 2.5.

NAHR can occur both in meiosis and in mitosis and it will happen only in the presence of substrates (LCRs or SDs). In meiosis, NAHR can lead to unequal crossing over and genomic rearrangements that will be present in all the cells. On the other hand, in mitosis, it leads to mosaic populations of somatic cells carrying copy number or SVs (Zhang et al., 2009).

• Nonhomologous End Joining (NHEJ):

In Nonhomologous End Joining (NHEJ), double strand breaks are detected. Then both broken DNA ends are bridged, modified, and finally linked. The product of the repair often contains additional nucleotides at the DNA end junction, leaving a “molecular scar" (Stankiewicz and Lupski, 2010) as shown in Figure 2.5. This process does not require a substrate and usually leads to deletions or small insertions.

• Replication-Error Mechanisms (FoSTeS):

The fork stalling and template switching (FoSTeS) is a mechanism based on DNA repli- cation error. During the DNA replication, the fork of one DNA region stalls and the 2.2 MECHANISMS FOR CNVS GENERATION AND CNV TRANSMISSION 13

strand is released from its original template, resuming the DNA synthesis in another replication fork in physical proximity (Stankiewicz and Lupski, 2010; Zhang et al., 2009). As shown in Figure 2.5, this process can repeat multiple times (FoSTes x2, FoSTes x3, ...).

FoSTeS does not require a substrate. However, a small sequence of base pairs must be the same in both forks (microhomology) so the DNA synthesis can be resumed. This is the only mechanism that creates complex CNVs.

• L1-mediated retrotransposition:

This mechanism will generate only insertions, meaning that the result is not to be necessarily a CNV. Long interspersed element-1 (L1) is the only element still active in the human genome. It comprises approximately 17% and contains two opened read- ing frames (ORF), regions that indicate where the translation starts and ends. The insertion happens when RNA polymerase II transcribes the region between ORFs.

Figure 2.5: Illustration of the four major mechanisms underlying human genomic re- arrangements and CNV formation: Non-Allelic Homologous Recombination (NAHR); Non- Homologous End-Joining (NHEJ); Fork Stalling and Template Switching (FoSTeS); retrotransposi- tion. Figure from Zhang et al.(2009).

Given that CNVs can be transmitted, understanding this information is valuable for as- sociation studies, which detect transmitted CNVs and explore how they underlie Mendelian diseases in families (McCarroll and Altshuler, 2007). For example, it was reported a triplica- tion of an approximately 605 kb segment containing the PRSS1 and PRSS2 genes that causes 14 COPY NUMBER VARIATION (CNV) 2.3

hereditary pancreatitis, which has 80% penetrance (80% of the CNV carriers develop the disease) (Maréchal et al., 2006). In addition, a research group identified a CNV responsible for the Pelizaeus-Merzbacher disease, in which 65% of the cases had inherited the condition (Lee and Lupski, 2006). The findings of CNV transmission indicate that it is consistent with normal (Locke et al., 2006) and this information can be considered by some CNV calling algorithms such as the one described in Wang et al. (2008a) and Chu et al. (2013). However, in Locke et al. (2006), the considered CNVs were located within duplicated regions of the human genome and the 269 samples (individuals) studied had European, Yoruba and Asian ancestry. On the other hand, as described in Palta et al. (2015), considering all the regions of the genome with a CNV, the transmission rate is 45.5% with statistically significant deviation from the expected Mendelian transmission rate of 50%, specially, when the regions are smaller than 10kb. Thus, in this project, we aim to evaluate the CNV transmission rate in Brazilian popu- lation, taking into account the whole genome. For this analysis, we use the family data from the Baependi Heart Study, but we only considered trios data.

2.3 CNV Calling

Copy Number Variations can be detected by different technologies, and the methods documented in the literature can be divided in three groups depending on the type of data: Comparative Genomic Hybridization (CGH), SNP-array and Next Generation Sequencing (NGS). The first method for identification of CNVs was the array-CGH (Comparative genomic hybridization). It uses two genomes, a sample and a control, which are hybridized against the same oligonucleotides. The fluorescent signal intensity ratio between them can be compared across each chromosome to identify copy number changes (Theisen, 2017). In spite of being a tool offered by many companies and being capable of using target arrays for clinical tests, it cannot detect an absolute number of copies (Escaramís et al., 2015). 2.4 CNV CALLING 15

For a deep study of structural variants, the NGS methods are preferred since its four strategies (Read Depth (RD), Paired Read (PR), Split Reads(SR)/Clip Reads(CR) and de novo Sequence Assembly (AS)) have different advantages that can be combined to detect all kinds of CNVs. Another method is based on SNP Array data. Single nucleotide polymorphisms (SNPs) are the most common type of in humans and represent a difference in a single nucleotide in the DNA (Laframboise, 2009), as described in Chapter1. Adopting the SNP information, it is possible to infer the presence of CNVs and, due to large amounts of SNP data for genome-wide association studies, there are several algorithms that have been developed, such as PennCNV (Wang et al., 2007b) and CRLMM/VanillaIce (Carvalho et al., 2007; Scharpf et al., 2011). However, the major disadvantage is the poor overlap results among the different softwares (Eckel-Passow et al., 2011). As described in Wang et al. (2008b), the algorithms for CNV calling based on SNP arrays can be summarized in a 3-step process which analyzes each individual separately.

1. Preprocessing: Quantify the intensities of each allele (A and B) of a given SNP. The most common values used are the Log R Ratio and B Allele frequencies that are inferred based on the raw data (Escaramís et al., 2015).

2. CNV calling: Values obtained in the preprocessing are used to estimate a "copy-number measurement", which is usually continuously distributed across populations. Based on these values, it is possible to call the "copy-number genotypes", that can vary as a simple "loss" or "gain" qualification or as discrete values such as (0, 1, 2) referring to the number of copies of a given allele (McCarroll and Altshuler, 2007).

3. Smoothing across the chromosome: Techniques of smoothing and quality control are applied to reduce the noise and to obtain better CNV callings.

Each step can vary from software to software. The complete description of resources and methods used during this project can be found in Chapter3. 16 COPY NUMBER VARIATION (CNV) 2.4

2.4 Association Studies and CNVs

Genome-wide association studies (GWAS) are a gene mapping approach that involves the identification of candidate genes or genome regions that contribute to a specific disease by testing for association between disease status and genetic variants. They are the main tool for identifying genes that contribute to complex traits and diseases, such as diabetes and heart diseases, which receive the term "complex" because they are explained by genetic and environmental factors besides their interaction (Lewis and Knight, 2012). Although there are several types of genetic markers that can be used for this kind of stud- ies, SNPs are the most commonly used. The online and public database ClinVar archives all published results of associations between human genetic variants and phenotypes, Table 2.1 summarizes the total of results based on the genetic variant (Landrum et al., 2016), show- ing that 76% of GWAS are made based on SNPs, while CNVs (deletions ans duplications) account for 19% of the studies. The massive use of SNPs in comparison with CNVs can also be seen in Figure 2.6, which represents the results of our search of papers with the keywords SNP and CNV, separately.

Table 2.1: Number of associations between human genome variants and phenotypes. It includes conditions and diseases, such as obesity and cancer.

Genetic Variant Associations with Phenotypes Relative Frequency (%) SNP 266,709 76 Deletion 40,963 12 Duplication 24,107 7 Indel 2,392 0.6 Insertion 15,454 4.4

CNVs can represent benign polymorphic variations or modify expected phenotypes by mechanisms such as altered gene dosage and gene disruption (Zhang et al., 2009). In ad- dition, CNVs have an important role in genomic variation. Several studies have already reported the association between CNVs and some conditions, as presented in Table 2.2 and, for a more complete database, DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) includes a list of genetic variants (sequence variants or copy number variants) that has a role in the pathology of some specified syndromes (Firth et al., 2009; Wellcome Trust Sanger Institute, 2009). Therefore, this polymorphism 2.4 ASSOCIATION STUDIES AND CNVS 17

Figure 2.6: Publications of SNPs and CNVs. Number of publications with keywords SNP and CNV in Google Academics. is as relevant as SNPs for association studies, and, as shown before, the literature has given bigger emphasis in SNP studies compared to CNVs studies. The statistical analysis for GWAS varies according to the subject as well as the type of data and dependent variable (phenotype), some examples are described in Table 2.2. The statistical model used in this work is described in Section 4.9. Despite the analysis applied to infer the association between trait and genetic marker, significant genetic associations can have three different interpretations (Lewis and Knight, 2012):

1. Direct association, in which the genetic marker is the true causal variant conferring disease susceptibility;

2. Indirect association, in which a genetic marker in linkage disequilibrium (LD) with the true causal variant is genotyped; 18 COPY NUMBER VARIATION (CNV) 2.4

Table 2.2: Examples of GWAS. For each study, we have the phenotype, the type of data and software used in the CNV calling, the CNV associated to phenotype and the statistical model applied for the association analysis.

Genetic Phenotype Data for CNV Variation Statistical Analysis (Reference) calling (CNV) Adjusted serum log uric acid concentrations (continuous) in a mixed effects regression model Uric acid SNP array data with fixed effects for copy concentrations (Softwares: Duplication/ number (modeled as continuous (Scharpf et al., APT/PennCNV/- Deletion in 4p16.1 quantitative variable with scale 2014) VanillaICE) 0-4), age, log-transformed BMI, gender, and study center. Chemistry plate was added to the model as a random effect. Williams-Beuren Fisher’s exact test between Syndrome (WBS) Microsatellites Deletion in 7q11.23 clinical features of WBS and (Dutra et al., CNV traits. 2011) Miller-Dieker lissencephaly Data from syndrome fluorescence in situ Contingency table analysis (Cardoso et al., hybridization Deletion in 17p13.3 between affection and CNV 2003; (FISH) (Similar to presence . Ledbetter et al., Array-CGH) 1992) A likelihood ratio test to model Duplication/ the distribution of per-sample Severe Obesity Triplication in CNV measurements as a SNP array (Wheeler et al., LEPR, POMC, Gaussian mixture and compares (Software: APT) 2013) MC4R, BDNF and the goodness of fit with or SH2B1 without association to affected status.

3. A false-positive result, in which there is either chance or systematic confounding, such as population stratification. One way to minimize this problem is to adjust for ancestry coefficients, commonly calculated as principal components (Price et al., 2006).

As described in Yang et al. (2010), two problems found in association studies using SNPs are the small effect of each marker over the phenotype and the incomplete linkage disequi- librium between causal variants and the SNPs. These factors require studying the effect of block of SNPs over the phenotype instead of the effect of individual SNPs. The overview by McCarroll and Altshuler(2007) discuss the evidence for the effects of 2.4 ASSOCIATION STUDIES AND CNVS 19

CNVs in phenotypes and describes the challenges of the association studies involving CNVs. Despite the listed problems that have been solved since then, for instance, the development of new technologies for CNV detection, some of them are still open such as the transmission rate of CNVs. Unlike the SNPs, whose population frequencies and precise locations are very well char- acterized, CNVs are still uncertain. For example, the Affymetrix 6.0 platform used in this project has approximately 1.8 million markers, in which "202,000 probes targeting 5,677 CNV regions" (Affymetrix, 2008), i.e., CNV probe locations in fact correspond to the loca- tions of CNV-containing regions (CNVRs) (McCarroll and Altshuler, 2007). As showed in Figure 2.7, the defined position can represent more than one kind of CNVR. When using SNP array for CNV calling, this problem is minimized since it accounts SNP information and linkage disequilibrium approaches. As one can see in Section 3.5, PennCNV includes the SNP information in the transition probabilities of the hidden Markov model, giving more precision to the CNV calling.

Figure 2.7: CNV-containing region (CNVRs). Possible CNV locations related to the coor- dinates of a reported CNV-containing region (CNVR). For bacterial artificial chromosome (BAC) probe (red region), it is possible to find a duplication in only part of the region or in its vicinity (blue regions). The same occurs for deletions (the last four lines). Source: McCarroll and Altshuler (2007).

As mentioned before, CNV calling techniques have low overlap among them (Eckel-Passow et al., 20 COPY NUMBER VARIATION (CNV) 2.4

2011), and a few percent of the thousands CNVs detected can be genotyped in the available samples, which means that, for example, one could identify thousands of CNVs from a set of samples, but only few of them can actually be genotyped, since it is difficult to detect CNVs at the same DNA position in a pertinent number of samples (McCarroll and Altshuler, 2007). This can be observed in Section 4.5, in which we filter the CNVs to keep only the ones that are present in at least 2% of the population. These problems were related to the quality of the CNV calling, and even though it is clear that all data used in an association study must be as accurate as possible, obtaining the CNV categories is the biggest challenge during the study. Since this step is completed, different statistical models can be applied for association between the CNV and the desired phenotype depending on the subject and data used. Some software and R packages were developed for facilitating the implementation of the statistical analysis in GWAS, such as SOLAR (Blangero et al., 2015) and kinship2 R package (Therneau et al., 2015). They can deal with several kinds of data, including family information, and will be better described on Chapter3.

2.4.1 CNVs and Height

Published studies estimate height heritability as approximately 80%, which means to say that 80% of height variation in populations are due to genetic effects. GWAS succeeded to identify around 50 singular variants that may be associated with height and could explain up to 5% of the total heritability when considered independently. However, this value increases to 45% if polymorphisms groups are used to describe the phenotypic variation (Yang et al., 2010). Thus, choosing height in this work was based on the interest in characterizing the pattern of missing heritability for this phenotype in the Brazilian population through CNV analysis. Association studies between copy number variation (CNV) and human height have al- ready been performed as described in Dauber et al. (2011); Kim et al. (2013); Li et al. (2010). In all these studies, two of which were performed in Asian population, only un- related individuals were considered (Kim et al., 2013; Li et al., 2010). Their findings suggest an association of short height with combined deletions (Dauber et al., 2011), specifically the 2.4 ASSOCIATION STUDIES AND CNVS 21 ones on the region 12q2433, a neighbour of the gene GPR133, which in turn contains SNPs previously associated to height (Kim et al., 2013). Other CNV regions associated with hu- man height were found (Li et al., 2010), but none of them was validated after multiple tests correction. 22 COPY NUMBER VARIATION (CNV) 2.4 Chapter 3

Materials and Methods

This chapter describes the dataset and the SNP array platform used in the Baependi Heart Study. The following sections present methodologies, which are part of the pipeline we proposed. They are divided in two parts: CNV calling and CNV analysis. The first part is described in Section 3.3 and 3.4, where we explain the procedures included in the pre-processing of genetic data and obtain two new values (Log R Ratio and B Allele Frequency) from the SNP array data. Section 3.5 introduces the Hidden Markov Model(HMM) used for CNV calling. The second part of the methodology begins with the pre-processing of the HMM output and ends with the description of the models used to infer the heritability of height and the CNV transmission rate (Sections 3.5, 3.6.2, 3.6.3 and 2.4). Illustrations of some procedures are in Chapter4.

3.1 Dataset

Due to multiple waves of immigration, Brazil has a highly admixed population, which can be driven by genetic and environmental influences on several traits. The Baependi Heart Study is being conducted by the Heart Institute since 2005 to develop a longitudinal family- based cohort study for understanding the variation of cardiovascular risk factors within the Brazilian population and disentangle its genetic and environmental components. The study contains two steps of data collecting in accordance with a planned sample design. The first

23 24 MATERIALS AND METHODS 3.1

wave was performed between December 2005 and January 2006, and the second wave was followed-up in 2010 (details are described in de Oliveira et al. (2008); Egan et al. (2016)). The data considered in this work is from the first wave of the described study and it provides information about 105 families (1,666 individuals, 723 male and 943 females) living in the village of Baependi, in the state of Minas Gerais, Brazil. Data from 631 nuclear families were available, with an offspring ranging from 1 to 14. The number of generations per family varied from 2 to 4 (54% of the families had 3 generations, and 45% had 2 generations). Only individuals aged 18 years or older were considered eligible for participating in the study. The mean age was 44 years, with a range of 18 to 100 years (de Oliveira et al., 2008). For each participant a questionnaire was used to obtain information regarding family relationships, demographic characteristics, medical history and environmental risk factors. Anthropometric measures, physical examination and electrocardiogram of the participants were performed by trained medical students. Also, fasting blood glucose, total cholesterol, lipoprotein fractions and triglycerides were obtained by standard techniques in blood sam- ples. Serum samples were stored at –80 o C and genomic DNA was extracted by standard procedures. From DNA samples, genotyping with SNP array was made with Affymetrix Platform 6.0 and 1120 CEL files were obtained.

3.1.1 SNP array platform

DNA microarray is a technology used to perform experiments on multiple genes at the same time. This tool allows to determine whether the DNA of an individual contains a genetic variant. For this purpose, a DNA microarray contains multiple and unique spots with several identical strands of DNA as illustrated in Figure 3.1(Genetic Science Learning Center, 2013; National Human Genome Research Institute (NHGRI)). SNP array is a type of DNA microarray which is used to detect single nucleotide polymor- phisms within a population. This platform 6.0 includes 906,600 SNPs markers and 946,000 CN probes, in which 202,000 targets 5,677 CNV regions from the Toronto Database of Ge- nomic Variant (Affymetrix, 2008). The human genome of two individuals are 99.9% identical at the nucleotide level, and the presence of SNPs in the genome is the largest source of genetic diversity among hu- 3.2 DATASET 25

Figure 3.1: Illustration of DNA Microarray. The microarray chip contains several spots, each spot includes the multiple and identical DNA strands. During the experiment, the isolated genetic material of a sample will bind to those DNA strands correspondent to genes that are turned on. Given the information of each spot, the final output lists the active genes. Source: Genetic Science Learning Center(2013). mans. However, some regions of the genome can have no or few SNPs (Laframboise, 2009; Shen et al., 2008). In addition, as described in Section 2.4, a single probe can represent multiple CNV regions. For these reasons, CN probes evenly spaced along the genome were added to the platform to include these regions that would be not covered by SNPs. The procedure to obtain the intensity of the alleles for a given marker using the Affymetrix assay is illustrated by Figure 3.2. For a given SNP, different oligonucleotides of 25 nucleotides (25-mer probes) for both alleles containing the SNP in different positions are used to bind to the DNA strand. When the probe is complementary to all 25 bases, a brighter signal is de- tected. Otherwise if there is a mismatch at the SNP site, the signal is lighter. (Laframboise, 2009). The values of these intensities are stored in CEL files that will be used for the CNV analysis as well as SNP analysis. 26 MATERIALS AND METHODS 3.2

Figure 3.2: Illustration of signal extraction for a given molecular marker. In this hypo- thetical example, there is a sequence containing a SNP which can be A or C. For this SNP, the SNP array chip has different DNA strands (probes) including the SNP with the reference and the altered nucleotide which are defined as alleles A and B. When the denatured DNA of the sample is inserted in the chip, a perfect match occurs when all the bases of the sample DNA and the probe bind and, then, a higher intensity signal is detected (orange). Otherwise, a lower intensity signal (yellow) is detected. Source: Laframboise(2009).

3.2 Methodology Overview

The used methodology can be summarized by Figure 3.3, which describes the prepro- cessing of SNP data, the CNV calling and the CNV analysis. For the preprocessing of SNP data and the CNV calling, the software Affymetrix Power Tools, PennCNV and packages from the R environment were used. Briefly, the CEL files from Affymetrix 6.0 platform are used to obtain the signal intensities of alleles A and B for each single nucleotide polymorphism (SNP). Based on these values, the genotypes (AA, AB e BB) are predicted by unsupervised clustering algorithms. In addition, using the intensities values, new values were obtained by polar transformation: log R ratio (LRR) and B Allele frequency 3.2 METHODOLOGY OVERVIEW 27

(BAF). These new information is used in a hidden Markov model for CNV estimation.

Figure 3.3: Flowchart of the pipeline. The number indicates which function was used. Box I indicates the CNV calling and box II indicates the CNV analysis.

The following functions were used in the preprocessing described in Figure 3.3(Wang et al., 2007a):

1. Function apt-probeset-summarize from Affymetrix Power Tools. Given the CEL files, signal intensity values for probes are normalized using quan- 28 MATERIALS AND METHODS 3.2

tile normalization (Section 3.3.1). Then, the function applies the median polish (Sec- tion 3.3.2) to get the final cleaned intensity values for alleles A and B for each SNP.

2. Function apt-probeset-genotype from Affymetrix Power Tools. Given the CEL files, this function generates the individual genotype calls using the Birdseed algorithm. For each SNP in each sample, the genotype will be coded as 0, 1 and 2 for AA, AB, BB and -1 for missing values, respectively, with its corresponding confidence scores. Also, a final report will infer the sample sex.

3. Function generate_affy_geno_cluster.pl from PennCNV. This function generates canonical genotype clustering files based on the output files from functions 1 and 2. These files contain cluster positions of each SNP for each canonical genotype (AA, AB and BB).

4. Function normalize_affy_geno_cluster.pl from PennCNV. The calculation of LRR and BAF values for each SNP and each sample are made using the genotype clustering file and intensities values of the alleles A and B. More details are in Section 3.4.

The preprocessing phase generates auxiliary files, such as the genotype clustering file and genotype confidence scores, and the file containing the LRR and BAF for all markers and samples. This information is used for the CNV calling (Figure 3.3). The following functions were used in this process (Wang et al., 2007a):

5. Function kcolumn.pl from PennCNV. The output from the previous function is a table with markers in the rows and LRR and BAF values for each sample in columns. Since PennCNV detects the CNVs separately for each sample, this function splits the table into tables of three columns (Marker, LRR and BAF) for each sample.

6. Function: detect_cnv.pl from PennCNV. The CNV calling is performed for each sample (Section 3.5). Some additional infor- mation from HapMap reference are used. For example, the SNP coordinates and the probability of B allele. 3.2 METHODOLOGY OVERVIEW 29

PennCNV is used since it estimates locus-level copy number, performs segmentation, eval- uates CNV-specific quality-control metrics within a single software package, has relatively small bias and variability, and detects regions while maintaining an estimated false-positive rate (Eckel-Passow et al., 2011). The identified CNV regions are specific for each sample (individual). As showed in Fig- ure 3.3, we excluded the samples that do not pass in the quality control (Section 4.3). Then, a new set of minimal regions, defined by the overlap regions across all samples, was built. Then, a filter is made, removing all minimal regions with a low frequency of CNVs. The final regions are then ready for the CNV analysis of this work.

7. Function filter_cnv.pl from PennCNV. As PennCNV does not filter the samples, this function returns the values of mean, median and standard deviation of LRR and BAF for each sample. This information allows one to evaluate the quality of the CNV calling based on the criteria described in Section 3.6.1.

The output from filter_cnv.pl is loaded in R to select the samples that passed the quality control.

8. Function CNTools package from Bioconductor (Zhang, 2017). Each sample contains its own CNV regions. However, for posterior analysis, we need the same variables for all samples. The solution adopted was to identify the minimal regions (Section 3.6.2), which are the overlapping of all identified regions. CNTools package was created with a similar aim, so we made some adaptations for our necessity.

9. Basic functions from R. The obtained minimal regions take into account all CNVs identified in all samples. Then if a CNV is present in only one sample, the minimal region correspondent to this CNV will have one sample with a mutation. For this reason, we filtered the regions so that at least 2% of the samples would have a mutation using basic functions from R. The procedure is described in Section 3.6.3

10. Basic functions from R and polygenic from Solar (Blangero et al., 2015). 30 MATERIALS AND METHODS 3.3

The script to analyze the identified CNVs includes simple functions from R. To estimate the CNV transmission rate, we use the polygenic function from Solar, which is described in Section 3.7.

11. Function polygenic from Solar (Blangero et al., 2015) and kinship2 package from R (Therneau and Sinnwell(2015)). To estimate the heritability of phenotypes using CNVs as covariates, we use the poly- genic function from Solar and the function lmkin from kinship2 package as described in Section 3.7.

3.3 Preprocessing of SNP data

The CEL file output is a specific file from Affymetrix with the information for each sample. It stores the intensity values of each probe array and its standard deviation, a flag to indicate an outlier, a user defined flags, and the number of pixels values collected from an Affymetrix GeneArray scanner (Affymetrix, 2009a,b). To generate the intensity values and SNP genotype calls from the CEL files obtained from our samples, the Affymetrix Power Tools (APT) was used. The procedure, described in McCall et al. (2010), involves:

• Quantile normalization;

• Median polish;

• SNP genotype calling (Birdseed).

Quantile normalization and median polish are used to normalize and remove outliers from the intensity values of alleles A and B. For a given SNP, when these intensities val/[ues from several samples are plotted, it is easy to observe the formation of three clusters, since we have three genotypic classes (AA, AB, BB). An example can be seen in Figure 3.4. The SNP genotype calling is the procedure to predict the genotypes based on the intensities values of allele A and B. 3.3 PREPROCESSING OF SNP DATA 31

3.3.1 Quantile normalization

Quantile normalization is a procedure to normalize two or more vectors and make them have the same or a similar distribution. This method is highly used in biostatistics for analysis of data generated from experiments on DNA, RNA, and protein microarrays. In this analysis, the quantile normalization is performed to remove unknown variation possibly due to target preparation and hybridization (McCall et al., 2010). The step aims to make the distribution of SNP intensities for each individual more comparable. The algorithm for this procedure is described in Bolstad et al. (2003) and it was de- veloped based on the fact that two data vectors with the same distribution will show a straight diagonal line across the origin given by the unit vector ( √1 , √1 ) in a quantile- 2 2 quantile plot. This can be generalized for n data vectors with the same distribution, which n-dimensions quantile-quantile plot will be a straight line across the origin given by the unit

√1 √1 vector ( n ,..., n ). Thus, the quantile normalization is the procedure of projecting the points of a n-dimensional quantile-quantile plot onto the diagonal, following the 5-steps algorithm:

1. Given n p−dimensional vectors to be normalized, create the X matrix of dimension p × n where each vector is a column;

2. Sort each column of X to give Xsort;

3. Take the means across rows of Xsort;

0 4. Substitute each element in the row by the row mean to get Xsort;

0 5. Get Xnormalized by rearranging each column of Xsort to have the same ordering as original X.

Even though, there is a quantile function for R (Package ‘preprocessCore) (Bolstad, 2001), a simplified example of how quantile normalization works using a R script can be found in AppendixA. In our case, the quantile normalization is performed across chips, thus, n indicates the number of samples and p, the number of SNPs/CN probes. 32 MATERIALS AND METHODS 3.3

3.3.2 Median polish

Based on the robust multiarray analysis (RMA) (McCall et al., 2010), after the quantile normalization, the median polish is performed to remove possible outliers in the array. This technique described in Tukey(1970) extracts the effects of row and column factors in a two- way table using medians and it is similar to ANOVA, but using median instead of mean.

A given Xp×n = xij, for i = 1, . . . , p and j = 1, . . . , n, can be decomposed as in Equa- tion 3.1, in which α is a constant, ri is the effect associated to the i − th row, cj is the effect associated to the j − th column and ij is the residual associated to the element xij of the matrix.

xij = α + ri + cj + ij. (3.1)

The median polish follows the algorithm:

1. Given a matrix Xp×n, set the values α = 0 as the constant, r = {r1 . . . rp} and c =

{c1 . . . cn} as a vector of zeros with length p and n to be the row and column effects, respectively, and δ = 0 and X(t) = X as auxiliary variables;

2. For each line i of X, compute the median (mi.), subtract mi. from each element of Xi.

and sum mi. to ri;

3. δ is defined as the median of c. Subtract δ from each element of c. Then, sum up δ and α;

4. For each column j of X, compute the median (m.j), subtract m.j for each element of

X.j and sum m.j to cj. This new X represents the residual matrix;

5. Now, δ is defined as the median of r. Subtract δ from each element r. Then, sum up δ and α;

6. If the sum of the absolute values of X is equal to 0 or X is very similar to the X(t), the values of the constant, row and column effects and residual are inferred. Otherwise, set X(t) = X and repeat the steps 2-5 until one of the criteria is reached. 3.3 PREPROCESSING OF SNP DATA 33

The residual matrix, X(t), is subtracted from the original X. This means that the outlier will continue in the dataset, but the residual associated to it will be removed. In our application, the row and columns factors indicates the effects of the probe (genetic variant) and the chip (sample) and it used to protect against outlier probes (Affymetrix, 2017; Irizarry et al., 2003; Rabbee and Speed, 2006).

3.3.3 SNP Genotype Calling

Birdseed is a genotyping and clustering algorithm for Affymetrix SNP arrays platforms. For a given SNP, three genotypes among samples of a population are expected: AA, AB and BB. Based on intensities of alleles A and B for several samples, this phenomenon can be observed with the formation of one to three clusters associated to those genotypes. As we can see in Figure 3.4, the first SNP has the three expected genotypes, while the second one has no samples with genotype BB.

Figure 3.4: Illustration of SNP clustering. Given the values of intensities A and B of a SNP, the plot will show the formation of clusters. Plot on the left forms three clusters, while plot on the right forms two clusters. Source: Korn et al.(2008) (Supplemental Material).

Birdseed uses a customized Expectation-Maximization (EM) algorithm to fit two-dimensional Gaussians to normalized and summarized (median polish) SNP data (A-signal vs. B-signal) and identify the clusters. This procedure gives a genotype and confidence scores for every 34 MATERIALS AND METHODS 3.3 individual at each SNP (Broad Institute, 2008). The algorithm described by Korn et al. (2008)(Supplemental Material) has the following steps for each SNP:

1. The initial conditions for each cluster, representing genotypes AA, AB and BB, are based on a prior models file that contains SNP-specific estimates of cluster locations and variances learned from samples of a known genotype;

2. The prior model is scaled by a value s to be in the same intensity space as the samples. This value s is defined as:

n √ P 2 2 Ai +Bi d i s = = n √ (3.2) P 2 2 m (Wc+0.1) µca+µcb c=AA,AB,BB P (Wc+0.1) c=AA,AB,BB

where, d is the mean distance of a sample from the origin, m is the weighted average of the prior model means from the origin, A and B are the intensities for allele A and

B; i = 1, . . . , n, n is the number of samples, µc is the mean of cluster c, Wc is the expected weight of cluster c obtained from a reference population, being the HapMap reference date the most commonly used.

3. The data is fitted considering four independent and different two-dimensional Gaussian Mixture Models (GMM) which are based on the number of possible clusters. Model 1 considers that there is only one cluster, model 2 considers 2 clusters and models 3 and 4 consider 3 clusters, but with different initial conditions. The initial conditions of each model can be found on Korn et al. (2008).

In general, this step performs a modified Expectation-Maximization (EM) algorithm, which includes two steps: Expectation and Maximization. The first one calculates the probability of each sample to belong to each cluster, and the second one updates the parameters of each cluster based on the results of the Expectation step. This step repeats until either convergence or the maximum of iterations (defined by the user) is reached. 3.4 LOG R RATIO (LRR) AND B ALLELE FREQUENCY (BAF) 35

4. Model selection is then performed following different criterias, such as BIC information

criterion, closeness between final means (µ1, µ2, µ3) and expected means (µAA, µAB,

µBB).

5. Once the model is selected, genotypes and confidence scores are calculated.

The CRLMM algorithm (Carvalho et al., 2007, 2010) is also an alternative procedure for genotype calling considering SNP data and it is available in R software.

3.4 Log R Ratio (LRR) and B Allele Frequency (BAF)

Log R Ratio (LRR) and B Allele frequency (BAF) are the parameters used by PennCNV and other methods for calling CNVs (Eckel-Passow et al., 2011; Wang et al., 2007b). They are preferred instead of the A and B normalized intensity values because of simplicity for interpreting, since they are based on polar coordinates and the intensities values are likely to create clusters. This data transformation also takes into account a reference value obtained from different external populations, making them easier to interpret. Therefore, a polar coordinate transformation of two-channels normalized intensity data (two alleles, A and B) is performed for each SNP, obtaining a intensity value, called R, and an allelic intensity ratio, called θ (Peiffer, 2006; Wang et al., 2007b). This transformation is a function F :(R+)2 → (R+)2 defined by F (A, B) = (R, θ), in which R and θ are given by Equations 3.3 and 3.4. R can be seen as a measure of distance and θ is the angular coordinate. The illustration of these transformations can be seen in Section 4.1.

R = A + B. (3.3)

arctan(B/A) θ = . (3.4) π/2 36 MATERIALS AND METHODS 3.4

3.4.1 Log R Ratio (LRR)

2 1 The log R ratio (LRR) is obtained from function F : R → R defined as F (Robs,Rref ) = LRR, in which LRR is given by Equation 3.5:

Robserved intensity LRR = log2 (3.5) Rreference intensity

The LRR equation shows that it is critical choosing a proper reference panel as it can affect all subsequent analyses (Gold Helix, 2014). The reference panel used is HapMap The International HapMap Consortium(2003). Based on Equation 3.5, the LRR is a con- tinuous value and can be interpreted as a indicator of deletion or duplication. If the LRR is close to 0, it indicates that the number of copies is two as expected. As the value of LRR decreases, the chance of a deletion increases and, similarly, as the value of LRR increases, the chance of a duplication increases.

3.4.2 B Allele Frequency (BAF)

The B allele frequency (BAF) is calculated based in the relative allelic signal intensity ratio explained by the Equation 3.4, in which, A and B are the normalized and summarized intensity of alleles A and B.

+ 4 + 1 The function F :(R ) → (R ) is defined by F (θ, θAA, θAB, θBB) = BAF , in which

Equation 3.6 shows the values of BAF dependent on θ, where the values of θAA, θAB and

θBB are the values of θ for three canonical genotype clusters generated from a large set of reference samples (Wang et al., 2007b).

 0 if θ < θ  AA    (θ−θAA) 0.5 (θ −θ ) if θAA ≤ θ < θAB BAF = AB AA (3.6)  (θ−θAB ) 0.5 + 0.5 if θAB ≤ θ < θBB  (θBB −θAB )   1 if θ ≥ θAA

The BAF is an additional information used to call CNVs. It can be interpreted as an indicator of the presence of a B allele in the genotype. BAF = 0 indicates genotypes as 3.5 LOG R RATIO (LRR) AND B ALLELE FREQUENCY (BAF) 37

AA, BAF = 0.5 as AB and BAF = 1 as BB. The presence of a CNV can alter the value of BAF, and, for those, it has a bigger per-probe signal-to-noise ratio (SNR) than the LRR (Alkan et al., 2011). The advantage of using BAF is that it allows the detection of events of neutral CNV, as the uniparental dissomy (UPD), in which the prole receives two copies from one parent and no copy from the other parent (Alkan et al., 2011). Figure 3.5 shows how the combination of LRR and BAF can indicate the number of copies, including the case of uniparental disomy (UPD), where LRR is two as expected. Nonetheless, BAF indicates the lack of heterogeneity. For example, a region with three copies has a mean LRR higher than 0 and the BAF can assumes four different values: 0(AAA), 0.33(AAB), 0.66(ABB) and 1(BBB).

Figure 3.5: Values of LRR and BAF for each case of CNV. The combination of LRR and BAF can represent which type of CNV is present, including more complex cases when the number of copies is equal to two even though there is a variation. Source: Alkan et al.(2011).

Even though the LRR and the BAF can give an insight about the number of copies for a given SNP, for the CNV calling, more factors are considered, such as distance between genetic variants. 38 MATERIALS AND METHODS 3.5

3.5 Hidden Markov Models (HMM)

A Markov process considers that the probability of observing a particular state at a par- ticular time point depends only on the state at the previous time point. A Hidden Markov model (HMM) is a sequence model that assumes the sequence of random variables indexed by the time point follows a Markov process with unobserved states, dealing with ‘label- ing’ problems, such as gene identification (label nucleotides as exons, introns, or intergenic sequence) and CNV detection (label markers as deletion, normal, or duplication) (Eddy, 2004). In Wang et al. (2007b), a first order HMM was adopted for CNV calling because two adjacent genetic markers are more likely to have the same number of copies when they are in close linkage. Consequently, given a sequence of genetic markers (SNPs and CN probes), the HMM allows to identify the state (number of copies) of each marker taking into account the state of the previous marker. The notation used here for each SNP/CN probe is {r, b, z} to denote the LRR, the BAF and the copy number state, respectively. The structure of a HMM must define the following parameters (Eddy, 2004):

• A symbol and the number of symbols, K:

The symbol in a HMM is the parameter that describes each observation. In this case, each marker (SNPs and CN probes) is described by r and b. Since r corresponds to the LRR, it is a continuous value. The second, b, relates to the BAF and it is also a continuous value from 0 to 1. Therefore, the number of symbols is infinity for r and b.

• Number of states, Z = 6: In this case, there are six states as described in Table 3.1.

Table 3.1: States defined for the HMM. Each state has its correspondent number of copies and CNV genotypes associated to it.

Copy state (z) Total Copy Description (for autosomes) CNV Genotype 1 0 Deletion of 2 copies - 2 1 Deletion of 1 copy A, B 3 2 Normal State AA, AB, BB 4 2 Normal State (LOH) AA, BB 5 3 Single Copy duplication AAA, AAB, ABB, BBBB 6 4 Double Copy duplication AAAA, AAAB, AABB, ABBB, BBBB 3.5 HIDDEN MARKOV MODELS (HMM) 39

• Emission probability: ez(x) for each state z. This is the probability of x = r and x = b Z given the state z. It sums to one over K symbols x, ez(x) = 1; x

Since we have two different parameters, there are two different emission probabilities, one for the LRR (r) (Eq. 3.7) and another for BAF (b), given by Eq. 3.8(Wang et al., 2007b).

– LRR:

P (r|z) = πr + (1 − πr)φ(r; µr,z, σr,z) (3.7)

where P is the conditional probability of r given the state z, φ is the density

function of a normal distribution with mean µr,z and standard deviation σr,z

and πr is a random variable generated from a uniform distribution for correcting possible random fluctuations.

– BAF:

The emission probability of BAF is more complex, since the genotype of a CNV can have different patterns of B allele frequency (Table 3.1 and Figure 3.5).

For the BAF emission probability (P (b|z)), the terms A, B and C add values based on three cases: For 0

and C), its distribution is modeled by a mixture of point mass at 0(M0) and 1

(M1), respectvely, and a truncated normal (Wang et al., 2007b). 40 MATERIALS AND METHODS 3.5

K(z)−1  X P (b|z) = πb + (1 − πb) BN[g − 1; K(z) − 1, pB]φ(b; µb,g, σb,g) g=2 | {z } A

+ BN[0; K(z) − 1, pB][1b=0M0 + 10

where,

K(z) − 1 BN[g − 1; K(z) − 1, p ] = pg−1(1 − p )K(z)−g (3.9) B g − 1 b b

indicates the probabily for a genotype with g copies of allele B and pB is the population probability of B allele and K(z) is the number of possible genotypes for copy number at state z. The normal distribution σ is similar to one present in the emission probability of the LRR.

• Transition probabilities tz(i): This is the probability of going from state z to a state i 6 X (including itself). It sums to one over the Z states i, tz(i) = 1 . i=1 In this case, it means the probability of having a copy number changing between two

adjacent markers. For this, PennCNV uses Equation 3.10, being zi the state of the

current marker and zi−1 the state of the previous marker:

 6 X −di  D 1 − pj,k−1(1 − e ) if l = j P (zi = l|zi−1 = j) = k=2 (3.10)  −di pj,j−1(1 − e D ) if l 6= j

where, D is a constant that is set as 100Mb for state 4 and 100kb for other states and

di is the distance in base pairs (bp) between two markers. The values of p are treated as unknown parameters and estimated in the Baum-Welsh algorithm (Wang et al., 2007b). 3.6 SELECTION OF CNV REGIONS 41

PennCNV estimates the initial model parameters for the defined HMM empirically from several large CNV regions (Wang et al., 2007b). Once the hidden Markov model is defined, the Viterbi algorithm (Eddy, 2004) implemented by PennCNV is used to infer the most likely path (state sequences for all SNPs along each chromosome).

3.6 Selection of CNV Regions

This 3-step process starts from the output from PennCNV and ends with a list of genome regions with the respective number of copies for each sample. These regions are a CNV (CN = {0, 1, 3, 4}) for some part of the samples and normal (CN = 2) for the remainder population.

3.6.1 Quality Control

As described by Turner et al. (2011), for identifying true associations in GWAS, the overall quality of the data is highly important. The first procedure to exclude bad samples is crossing the information of sex, which has three sources. From the celfiles, we obtain the intensities of alleles A and B from all SNPs of the chromosomes X and Y and summarize these values into two values (X¯ and Y¯ ) using Equation 3.11. During the SNP genotype procedure (Section 3.3.3), Birdseed creates a summary file, including the inferred sex for each sample. The third source is the declared sex by the sample. Inconsistencies among these sources can indicate the bad quality of the sample.

N N PX PY Aij +Bij Aij +Bij j=1 j=1 X¯ = NX and Y¯ = NY (3.11) i 2 i 2

which i = 1,..., 1120 indicates the sample, NX = 9508 and NY = 134 indicates the number of SNPs from chromosome X and Y, respectively. X¯ and Y¯ is the mean intensity for chromosomes X and Y. Despite the sex information, all the measurements for quality control are provided by PennCNV. The function filter_cnv.pl returns the following quality control values: LRR 42 MATERIALS AND METHODS 3.6 standard deviation, BAF mean, BAF drift and waviness factor(WF), which are considered to filter bad samples. As described on PennCNV tutorial, Affymetrix data contains more noises than Illumina data, so thresholds can be set in more liberal values. The Log R Ratio (LRR) is expected to have mean 0 and, for Affymetrix platforms, the standard deviation is expected to be up to 0.35 (Wang et al., 2007a). Meanwhile, for B Allele Frequency, the mean value must be between 0.4 and 0.6. The BAF drift summarizes the deviation of BAF from the expected values of two copies (0, 0.5 and 1) (Marenne et al., 2012), PennCNV suggest a threshold of 0.01, although some works with Illumina platforms set this value to 0.002. The waviness factor is a value that identifies samples with LRR that are not consistent across the genome, and it must be between -0.4 and 0.4 (Marenne et al., 2012). As described in Diskin et al. (2008), this value is calculated based on the median of LRR values and follows the Equation 3.12.

1 WF = (1 − 2[ rGC <0]) × median(|Yn − median(Yn)|) (3.12)

where, Yn is a vector of median LRR values for n 1Mb non-overlapping windows of the human genome (n = 1,..., ∼ 3000) and rGC is the correlation between median LRR and local GC content for all windows. For a negative rGC , the WF will be negative.

3.6.2 Minimal Regions

The output from PennCNV has one file for each sample (individual) containing different CNV regions. To define common regions across samplea, we will detect the overlap regions from all samples. Suppose we are analyzing 6 samples (from A to F) in the region from positions 1bp to 55bp of chromosome 5. Table 3.2 represents this hypothetical example as the merged outputs from PennCNV and Figure 3.6 illustrates it. In this case, sample A has two CNVs detected and D has no CNVs. Minimal regions will be defined using all start and end positions of all CNVs detected. 3.6 SELECTION OF CNV REGIONS 43

For example, as shown in Figure 3.6, "CNV 1" has no mutations and "CNV 2" is defined between positions 5 and 10, because in samples C and E, the CNV starts at position 5 and, in sample A, the CNV starts at position 10.

Table 3.2: Hypothetical example of the merged and cleaned outputs from PennCNV. Each row represents a CNV and describes in which sample it was found, where it starts and ends, its length and its copy state (described in Table 3.1).

ID Chromosome Start End Length Copy State A 5 10 25 15 5 A 5 40 55 15 2 B 5 15 35 20 5 C 5 5 45 40 5 E 5 5 20 15 6 F 5 45 55 10 1

Figure 3.6: Representation of the procedure to find minimal regions across samples described in Table 3.2. The line indicates the chromosome 5 from position 1bp to 55bp. The shadowed regions indicate the CNV regions with its respective number of copies. The overlap of all regions generates the minimal regions. In this case, from the 6 CNVs, we obtained 9 minimal regions.

Once the minimal regions are defined, it is possible to check the copy number of these regions for each sample, as shown in Table 3.3. For this procedure, the package CNTools is 44 MATERIALS AND METHODS 3.7

used (Zhang, 2017) and the R script of this example can be found in AppendixC.

Table 3.3: Copy number of each sample for all minimal regions.

Chromosome Start End A B C D E F 5 1 5 2 2 2 2 2 2 5 6 10 2 2 3 2 4 2 5 11 15 3 2 3 2 4 2 5 16 20 3 3 3 2 4 2 5 21 25 3 3 3 2 2 2 5 26 35 2 3 3 2 2 2 5 36 40 2 2 3 2 2 2 5 41 45 1 2 3 2 2 2 5 46 55 1 2 2 2 2 0

3.6.3 Filtering CNV Regions

After the procedure of identifying the minimal regions, it is expected to identify regions with rare mutations, i.e., observed in very few samples and regions where no sample has mutations. For this reason, a procedure of excluding these regions was developed. Similar to minor allele frequency (MAF) in SNP data, which refers to the frequency at which the least common allele occurs in a given population (The International HapMap Consortium, 2005), it was defined a "minor CNV frequency (MCF)" of 2%. Thus, given that a CNV re- gion can have up to 5 categories based on the number of copies (0, 1, 2, 3 and 4), the CNV will only be accounted if it has at least two categories with more than 2% of the samples. For example, in Table 3.3, the CNV from position 1 to 5 would be disregarded because it has only one category (all samples has 2 copies, then category 2 has 100% of the samples). The R script used for this step can be found in AppendixD.

3.7 Association Study and Polygenic Mixed Model

The basis of the statistical methodology for association study and heritability estimation in family design is the linear mixed model (Equation 3.13). As in the classical model, y is the response vector, β indicates the fixed effects and u and  are the random effects. 3.7 ASSOCIATION STUDY AND POLYGENIC MIXED MODEL 45

X and Z are known matrices linking β and u to y, respectively (Laird and Ware, 1982; McCulloch and Searle, 2001):

yc = Xc β + Z c u + c , (3.13) nc×1 nc×p p×1 nc×qq×1 nc×1 where, p is the number of fixed factors; q is the number of random factors; nc is the number of samples in cluster c. Assuming multivariate normal distributions for random effects, E[u] = 0, cov(u) = D, E[] = 0 and cov() = R, with R and D positive definite matrices with known structure, the Equation 3.14 holds:

u ∼ Nq(0,DD),

c ∼ Nnc(0,RR), (3.14)

T ync ∼ N(Xββ,ZDZZDZ + R).

As described in Section 4.6, we work with family-based data, which describes the relat- edness between individuals. Based on this, Equation 3.15 can be formulate as the polygenic model, described by Amos(1994); Blangero et al. (2013), given by:

yf = X fβ + gf + f , (3.15)

where yf is the response vector from familiy f with f corresponding to cluster c in

Equation 3.13, indicating here the family; gf and f are the polygenic and error random effect vectors, respectively.

Compared to Equation 3.13, the term Z fu is replaced by gf , a vector of individual

2 random genetic effects with E[gf ] = 0 and Cov[gf ] = 2ΦΦfσg, in which 2ΦΦ is a known genetic

0 2 2 relationship matrix between individuals, such that 2ΦΦ = Z fZ f , and D = Iσg with σg is the additive polygenic variance component (Blangero et al., 2013). Thus, for the polygenic

2 2 model, the covariance matrix is Cov(yf ) = Ωf = 2φfσg + I fσe. Since individuals of the same family share different proportions of the genome, matrix φ, called kinship matrix, is an important element of the polygenic model. Each element of 46 MATERIALS AND METHODS 3.7

1 r the kinship matrix is defined by 2 for r indicating the degree of kinship (for example, r = 1 for the person with self, r = 2 for parents-children and siblings, r = 3 for grandparents- grandchildren). Figure 3.7 illustrates an example of how this matrix is built.

Figure 3.7: Example of the kinship matrix (φ) given the family represented by the pedigree. Father-offspring (as individuals 2 and 4), mother-offspring (as individuals 1 and 4) and siblings (as individuals 4 and 5) relationships are 0.25, once they share 50% of the genome and the model works with 2φ. A R script to obtain the pedigree plot and kinship matrix is found on AppendixE.

As described in Demidenko(2004), mixed models, specifically the variance components models, are important tools since it allows one to introduce different (co)variations, for in- stance, the within (or intra) -cluster variation and the between (or inter) -cluster variation. Thus, an important value to be used is the intra-class correlation coefficient, which rep- resents the proportion of the total variance explained by the random effect (u and g, in Equations 3.13 and 3.15, respectively). For all quantitative phenotypes, the variance of a given trait in the population can be explained by biological and environmental factors. The polygenic model allows us to understand the proportion of the phenotypic variance explained by genetic factors using the 3.7 ASSOCIATION STUDY AND POLYGENIC MIXED MODEL 47 intra-class correlation coefficient, which, in this case, is called heritability (Visscher et al., 2008) (Equation 3.16):

2 2 2 σg σg h = 2 = 2 2 , (3.16) σy σg + σe

2 where σy is the total variance of the phenotype. The hypothesis of interest in this case can be described as Eq.3.17:.

2 2 H0 : σg = 0 vs H1 : σg > 0, (3.17) which can be tested using the Likelihood Ratio Test (LRT) (Self and Liang, 1987). This test is commonly used in quantitative genetics to find traits accounted by a significant

1 2 additive polygenic effect. The LRT is asymptotically distributed as a mixture of 2 χ0 and 1 2 2 χ1 (Blangero et al., 2013; Self and Liang, 1987). In Section 4.9, a polygenic mixed model is used to understand the effect of CNVs in height, being the CNV a covariate. In addition, in Section 4.8.2, we use the model to estimate the CNV transmission rate, in which the CNV is used as dicotomic variable. 48 MATERIALS AND METHODS 3.7 Chapter 4

Application in Baependi Heart Study

Chapter2 shows that there are studies considering CNV and its characteristics. However, as studies involving the human Brazilian population are scarce, one of our aims is to detect CNV and characterize its occurrence pattern based on data from based in Baependi Heart Study. For CNV calling, the raw data is preprocessed using the quantile normalization and median polish as presented in the previous chapter. The intensities values obtained from SNP data are used to calculate the LRR and the BAF and, in Section 4.1 we illustrate this procedure. The CNV calling is performed using PennCNV. A description of the output containing the found CNVs and the quality control measurements are in Sections 4.2 and 4.3. For CNV analysis, in Section 4.6, firstly we present a brief description of the family structure and the height of Baependi data. The following sections aim to characterize the CNVs found in samples from the Baependi population, which include answers to specific questions such as "how long are the CNVs?" and "how many CNVs does an individual have?". Section 4.8 analyses the inheritance of CNVs. Section 4.9 shows results of the association study for CNV and height. Some of the scripts used for implementation of the analysis are found in a resumed form in the AppendicesA,B,C,D,E andG . Complete scripts are found in https://github.com/Cicconella/Mestrado.

49 50 APPLICATION IN BAEPENDI HEART STUDY 4.1

4.1 Log R Ratio (LRR) and B Allele Frequency (BAF)

As described in Section 3.4, the raw data from CEL files from Baependi Heart Study contains, among other information, the intensities of the two alleles of the SNPs. The extrac- tion, normalization and summarization of this information was performed using Affymetrix Power Tools and the output for the SNP located in the position 3302871bp of chromosome 1 is illustrated in Figure 4.1a, in which each point represents a sample (individual). As ex- pected, it is also possible to observe the formation of clusters representing the genotypes AA, AB and BB described in Section 3.3.3. Figure 4.1b illustrates the rotation of the raw data after the polar transformation de- scribed in Section 3.4. This transformation facilitates the interpretation of the intensities of alleles A and B. The sum of them (R) will indicate if there is a possible deletion, when the total is lower than the average, or a posible duplication, when it is higher than the average. Meanwhile, θ will represent the proportion of B allele in the genotype.

(a) Raw data. (b) Polar coordinate transformation.

Figure 4.1: Intensity of probes A and B of one SNP from 1120 samples. a displays the raw intensity and b contains the same information after polar transformation.

Even though the R and θ are informative, these values are normalized based on a reference 4.2 LOG R RATIO (LRR) AND B ALLELE FREQUENCY (BAF) 51 panel containing the same values estimated from a reference population. PennCNV uses HapMap as reference panel for the calculation of LRR and BAF as default, but for a large quantity of samples the reference map can be the set of samples itself. After the estimation of these values for each SNP, the HMM can be applied for the CNV calling. Figure 4.2 shows the LRR and BAF obtained for the same SNP from previous figures for 1,120 samples. Based on Equation 3.5, LRR around 0 implies that the sample has a value similar to the reference panel and its number of copies is two. Hence, LRR far from 0 indicates a loss or gain in the DNA sequence. As shown in Figure 4.2, LRR varies around -1.5 to 1.5, usually, less than -0.5, indicating a deletion and, more than 0.5, a duplication. BAF assumes values close to 0, 0.5 and 1 and it can be seen as a proportion of B allele in the genotype. "Big" variations from these values can indicate duplications or deletions. For example, if the BAF is around 0.33, the expected genotype would be AAB. Based on this information, it is possible to confirm the Hardy-Weinberg equilibrium, since we have the frequency of samples for each genotype (AA, AB and BB).

Figure 4.2: LRR and BAF of SNP _A4265735 from the 1120 samples. 52 APPLICATION IN BAEPENDI HEART STUDY 4.3

4.2 CNV calling

The procedure of CNV calling is performed by the Hidden Markov Model presented in Section 3.5. By the end of the procedure, several outputs are obtained containing different information, such as the LRR and BAF for all samples for all markers and the summary of quality control for all samples. The main output is a collection of similar files describing the CNVs for each sample, an example for illustration of the structure is shown in Table 4.1.

Table 4.1: Example of the file containing the CNVs from sample 1. PennCNV generates a file with this structure for each sample. Each line describes a CNV. Columns "Chr", "Start" and "End" indicates the region of the CNV. "Number" is the number of markers from the Affymetrix 6.0 platform inside the region of the CNV. "Lenght" is the size of CNV in base pairs. "State" corresponds to HMM states (Table 3.1) and "CN" is the number of copies associated to the state. "First and Last Markers" identify the markers where the CNV starts and ends.

Sample Chr Start End Number Length State CN First Marker Last Marker 1 15 22231485 22264715 31 33231 2 1 CN_691574 CN_691602 1 19 59989695 60040503 29 50809 2 1 CN_170378 SNP_A-4271224 1 17 18296117 18373803 21 77687 2 1 CN_749706 CN_751779 1 17 67057139 67076931 22 19793 2 1 CN_744214 CN_744222 1 9 44181813 44569219 23 387407 2 1 CN_1322576 CN_1322482 1 4 64380064 64390853 30 10790 1 0 CN_1052052 CN_1052079

4.3 Quality Control

In total, from 1,120 samples, 375,312 CNVs were identified. From these numbers, we could expect an average of 335 CNVs per sample, which it is higher than what PennCNV usually estimates (up to 100 CNVs per sample). In addition, as PennCNV summarizes the values used for the CNV calling per sample, it recommends to evaluate the quality of samples. Therefore, this section explains the filtering process to obtain the samples with higher quality. As described in Section 3.6.1, our procedure of quality control is based on crossing the information of sex and evaluating the measurements obtained from PennCNV. All samples were evaluated for every measurement and the filtering is performed at the end of the evaluation. Considering sex estimates from PennCNV and the intensities of SNPs from chromosomes X and Y, we calculated the mean intensity of the SNPs for chromosomes X and Y for each 4.3 QUALITY CONTROL 53 sample and compared them to the Birdseed estimates. Figure 4.3 shows the mean intensity of SNPs of chromosomes X and Y for each sample. The colors indicate which sex was associated to each sample by Birdseed. 18 individuals were mislabelled, being 1 was labeled as male when the intensity of SNPs indicates as female, and 17 the other way around.

Figure 4.3: X and Y probe intensities for all 1120 samples. The x-axis and y-axis indicate the sum of the average over all probes for the normalized Cartesian intensity for alleles A and B using some probes available on X chromosome and Y chromosome, respectively. Red dots indicate the female samples and blue dots, the male samples.

As described in Turner et al. (2011), another way to confirm the quality in GWAS is to cross the information between the declared sex of the sample with the predicted sex. In this comparison, Birdseed predicted 5 females as male and 17 males as female. All the 18 samples mislabelled in the first comparison were included in the 22 samples mislabelled in the second comparison. As mentioned before, PennCNV gives some quality control measurements, an example of the structure of this output is shown in Table 4.2. The first element to be examined is the LRR, where the high variation can generally be 54 APPLICATION IN BAEPENDI HEART STUDY 4.3

Table 4.2: Quality control measurements from PennCNV.

File LRR Mean LRR Median LRR SD BAF Mean BAF Median BAF SD BAF Drift WF NumCNV 1 −0.019 0 0.230 0.500 0.500 0.061 0.001 0.008 51 2 −0.021 0 0.235 0.500 0.500 0.056 0.001 −0.027 992 7 −0.018 0 0.234 0.500 0.500 0.052 0.001 −0.026 948 14 −0.012 0 0.190 0.500 0.500 0.052 0.0004 0.015 165 16 −0.014 0 0.191 0.500 0.500 0.058 0.001 −0.013 77 18 −0.011 0 0.211 0.500 0.500 0.061 0.001 0.011 52

treated as bad genotyping quality. The default threshold is set to standard deviation higher than 0.2. However, as described in the pennCNV manual (Wang et al., 2007a), this value is too stringent, so that it was set to higher than 0.35, based on the histogram of the LRR standard deviation (Figure 4.4). In this case, 101 of the 1,120 samples didn’t pass the quality control.

Figure 4.4: Histogram of the standard deviation of Log R Ratio.

Considering the BAF value, it is expected to have a mean close to 0.5, since it is classified as 0, 0.5 and 1, so BAF values higher than 0.6 or less than 0.4 is treated as bad clustering quality. Fortunately, as showed in Figure 4.5, no samples failed this test. Another value asso- ciated to BAF is its drift, which takes into account the abnormal BAF patterns. Considering the standard value (higher than 0.002) for bad genotyping, almost 50% of the samples failed 4.3 QUALITY CONTROL 55

(Figure 4.6). As these thresholds were largely based on Illumina arrays and the samples are from the Affymetrix 6.0 platform, the new thresholds was set to 0.01, in which 56 of 1,120 failed the test.

Figure 4.5: Histogram of the BAF mean. Figure 4.6: Histogram of the BAF drifiting.

The last measure of quality control is the waviness factor (WF). The WF measures the waviness of the signal curves, as artificial gains and losses in the genome can be created by peaks and troughs of the wave. This value is expected to be between -0.04 and 0.04, and under this criterion, about 10% of the samples failed the quality test (Figure 4.7). After removing the samples that fail in at least one of the quality controls, one more filter is made, excluding families with no related individuals. As explained in Section 3.7, the model we use for association analysis takes into account the relatedness between family members. In this case, families with kinship matrix composed only by null degree of kinship will not contribute in the model. For example, when only a couple (husband and wife) pass in the quality control, no relatedness will be found inside this family. At the end of this filtering, the final number of samples is 910. 56 APPLICATION IN BAEPENDI HEART STUDY 4.5

Figure 4.7: Histogram of waviness factor.

4.4 Minimal Regions

Section 3.6.2 describes the procedure to obtain common CNVs for all samples. Applying it to the 135,414 CNVs obtained from the 910 samples, we got 64,107 CNVs. A detailed description of the CNV regions is presented as following.

4.5 CNV Filter

As described in Section 3.6.3, when we obtained the minimal regions, some of the regions may contain few or no samples with mutation. After this step, we got 8,794 CNVs regions from the 64,107 minimal regions. In this case, minimal regions with only one category of copy number with more than 2%(18) samples are excluded, i. e. when 98% of the samples are in group 2 (normal number of copies). Figures 4.8 and 4.9 show an illustration of the difference between the distribution of copy number for a region that pass the filter from a region that is excluded, respectively. 4.6 CNV FILTER 57

Figure 4.8: CNV region with four categories (Chr1: from 17,101,294 to 17,108,271). This region will not be excluded since from the four categories (1, 2, 3 and 4 , three of them has more than 18 samples.

Figure 4.9: CNV region with three categories (Chr1: from 3,353,296 to 3,369,917). Since only the category 2 has at least 18 samples, it is excluded. 58 APPLICATION IN BAEPENDI HEART STUDY 4.6

4.6 Baependi Samples

As described in 4.3, from the 1,120 samples, after the quality control filtering, 910 samples passed the criteria. From these samples, 383 are males and 527 are females. In total, there are 80 families with mean of 12 individuals and maximum of 72. The samples range from 18 to 95 years old with higher concentration between 32 and 56 years old with no relevant differences between males and females (Figure 4.10a). The height of the samples ranges from 139cm to 192cm, with higher concentration between 157cm and 170cm (Figure 4.10b). In addition, females are, on average, 10cm shorter than males.

(a) (b)

Figure 4.10: Distribution of the age (a) and height (b). All the 910 samples were considered together and divided by males and females.

The samples of this study are part of Baependi population and, as Brazilian population is highly admixed, it is important to observe the ancestry of the samples. In De Andrade et al. (2015), the principal component analysis (PCA) was performed considering Baependi Heart Study and taking into account the family structure. Figure 4.11 shows the stratification of the Baependi individuals in two latent variables (PC1 and PC2). Such PCs representing the 4.7 CNVS IN BRAZILIAN POPULATION 59 ancestry coefficients are recommended to be used for correction of possible admixture affect in association analysis and it is considered in Section 4.9. Samples with higher values in PC1 declared as afro-descendants.

Figure 4.11: Distribution of individual ancestry. Source: De Andrade et al.(2015)

4.7 CNVs in Brazilian Population

As described in Section 4.3, from the 1,120 samples 910 were considered for analysis due to the quality control filtering. From the original data, we were able to identify 375,312 CNVs and, after the cleaning procedure, this value dropped to 135,414 CNVs. For the following descriptive analysis, only the 910 samples will be considered. As described in Section 3.6.2, from the raw cleaned data, we obtained the minimal regions, in which we considered the overlap of CNVs. In this section, we also analyze the 64,107 minimal regions after the filtering procedure (Section 4.5). In summary, Figure 4.12 highlights the impact of the filtering procedures on the total 60 APPLICATION IN BAEPENDI HEART STUDY 4.7

number of CNVs detected.

Figure 4.12: Total of CNVs in each procedure. The highlighted boxes indicate the number of CNVs used during the CNV analysis.

4.7.1 How many CNVs does an individual have?

The number of CNVs we obtained from each sample varies from 17 to 2,921 CNVs, which shows a big variability, as shown in Figure 4.13. However, from the Table 4.3, we also can observe that 83% of the samples contain less than 100 CNVs, which is expected limit for PennCNV. For this group of samples, the mean number of CNVs per sample is x¯ = 56.49 and the standard deviation equal to s = 15 (Figure 4.14). For both the complete group of samples and the subgroup, the median of 60 and 57 CNVs, respectively, are compatible with similar studies (Itsara et al., 2010; Scharpf et al., 2014). 4.7 CNVS IN BRAZILIAN POPULATION 61

Figure 4.13: Absolute frequency of samples based on the individual number of detected CNVs.

Figure 4.14: Distribution of individual number of detected CNVs for all samples with less than 100 CNVs. 62 APPLICATION IN BAEPENDI HEART STUDY 4.7

Table 4.3: Cumulative frequency of samples based on the number of CNVs. E.g., 29% of the samples has less than 50 CNVs and 85% of samples has less than 125 CNVs.

Number of CNVs Cumulative Frequency of Samples 25 0.01 50 0.29 75 0.71 100 0.83 125 0.84 150 0.86 175 0.87 200 0.88 250 0.89 300 0.90 350 0.91 400 0.92 550 0.93 600 0.94 675 0.95 850 0.96 975 0.97 1, 225 0.98 1, 400 0.99 1, 825 1.00

In general, the identification of deletions is easier than duplications. As expected, Fig- ure 4.15a shows that 70% of all CNVs are deletions (0 and 1 category). However, when considered only the CNVs from the subgroup of 84% of samples, this value drops to 65% (Figure 4.15b). 4.7 CNVS IN BRAZILIAN POPULATION 63

(a) All samples (b) Samples with less than 100 CNVs

Figure 4.15: Distribution of CNVs regarding the number of copies. 0 and 1 indicate deletion, while 1 and 2 indicate duplication. Figure a contains CNVs from all the samples and b contains only CNVs from samples with less than 100 CNVs.

During the procedure of finding the minimal regions, the number of CNVs per sample is supposed to increase, since their original CNV regions can be broken into minimal regions. From the 8,794 CNVs, one sample will have only part of them, as shown by Figure 4.16. In this case, 88% of the samples has less than 1,000 CNVs.

Figure 4.16: Absolute frequency of samples based on the 8,794 CNVs.

As CNVs are a type of structural variant, it is possible that during our life, de novo CNVs may occur. Considering the number of CNVs per sample according to the age of the individuals, using a linear regression, the linear tendency is not significant as illustrated in Figure 4.17, but it is expected a "new" CNV for each 17 years. 64 APPLICATION IN BAEPENDI HEART STUDY 4.7

Figure 4.17: Number of CNVs according to the age.

4.7.2 How long are the CNVs?

The length of a CNV varies between 3bp to 27,435,314bp (27.5Mb) and follows a log- normal distribution as obtained by Scharpf et al. (2014). Figure 4.18 shows histograms of the size of CNVs, indicating that deletions are, in general, shorter than duplications.

(a) (b) (c)

Figure 4.18: Histograms of CNV length. Figure a considers only deletions, figure b considers only duplications and c contains all CNVs. The data is presented in exponential scale.

The length of the CNVs obtained after filtering the minimal regions also has a high variation from 1kb to 18,363,770kb (18Mb) and, as expected, the average length is shorter than the original CNVs, as shown in Figure 4.19. 4.7 CNVS IN BRAZILIAN POPULATION 65

Figure 4.19: Histogram of filtered CNVs length. Distribution of CNV length considering the filtered CNVs after obtaining the minimal regions.

4.7.3 Where are the CNVs?

As described in Chapter2, chromosomes 19, 22 and Y present the biggest proportions of CNVs. From our dataset, chromosomes 19 and 8 have more regions of CNVs based on the number of base pairs as shown in Figure 4.20. However, when only CNVs detected in at least 5% of the samples are considered, chromosomes 19 and 9 have the biggest proportions. For this analysis we use the minimal regions before the filtering process (64,107 CNVs). Considering the minimal regions, Table 4.4 shows the number of CNVs based on their frequency on our samples. For example, 23 CNVs are present in 60% of the samples. The CNVs that are highly present in the population could be characterized as Brazilian specific variants. For understanding the distribution of CNVs in the genome, Figures 4.21 shows the ab- solute frequency of the detected CNVs along the positions on the chromosome 1 and 6 (All chromosomes are represented on Figures F.1, F.2 and F.3 in AppendixF). 66 APPLICATION IN BAEPENDI HEART STUDY 4.7

Table 4.4: Absolute frequency of CNV based on relative frequency of samples. For ex- ample, 10,043 CNVs are present in at least 2% of the samples.

Samples (%) Number of CNVs 2 10, 043 5 4, 055 10 1, 983 20 1, 007 30 561 40 320 50 147 60 23 70 23 80 21 90 17

Figure 4.20: Proportion of CNVs in each chromosome based in total of base pairs. Blue bars indicate that the CNVs considered are present in at least one of the 910 samples and green bars for at least 20 samples. 4.7 CNVS IN BRAZILIAN POPULATION 67

(a) Chromosome 1 (b) Chromosome 6

Figure 4.21: Frequency of CNVs per region after finding the minimal regions. The x-axis represents the positions of the chromosome by base-pairs. The y-axis indicates the number of CNVs detected in the respective position for 910 samples.

The region with highest presence of CNVs across samples is in chromosome 1 between positions 72,541,505bp and 72,583,736bp (Figure 4.21a), which, on average, 818 samples from the total of 910 has a deletion or a duplication. Studies indicate that a deletion on this region from 1p31.1 (gene NEGR1) can be associate to severe obesity (Wheeler et al., 2013) and neuropsychiatric and behavioral problems (Genovese et al., 2015). However, based on DECIPHER, deletions and duplications in these regions must be at least 4Mb to be pathogenic, while our region has 42,231bp. In our population, the average CN is described in Table 4.5.

Table 4.5: Distribution of CNV for 910 samples.

CN 0 1 2 3 4 Samples 95 322 57 238 198

Another region with a high frequency of CNVs among samples is in chromosome 6 be- tween positions 79,044,206bp and 79,084,489bp, which is in region 6Q14.1bp and the gene MYO6, responsible for the myosin VI protein (Figure F.1f). Microdeletions in this gene were reported to cause from intellectual disability to severe autistic disorder (Becker et al., 2012; Quintela et al., 2015), depending of the region size. From the 910 samples, on average, 487 68 APPLICATION IN BAEPENDI HEART STUDY 4.8 has some kind of CNV, being 60% deletions (CN={0,1}), the size can vary from sample to sample, since we break the region when finding the minimal regions.

4.8 CNV Inheritance

As described in Chapter1, one desirable characteristic of a genetic marker is to be heritable in a simple Mendelian fashion. However, the literature shows different conclusions about the inheritance of CNVs. For understanding this problem, we use two methods, the first approach, we study the CNV occurrences in trios evaluating the segregation pattern from parents to child and, for the second one, we applied a linear mixed model using the CNV as a response variable.

4.8.1 CNV occurrences in trios

At total, 106 trios were analyzed. They include trios with the same parents with different offsprings. AppendixG contains a script describing how the trios and the occurrences of CNVs in them are identified and the final results for each chromosome. The occurrences of CNVs in trios will be described as a three-digits number, in which the first digit indicates the number of copies of one parent, the second for the another parent and the third for the offspring. As expected, "222" (parents and offspring are normal) is the most common combination of CNV occurrences, in which, on average, 77.45% of the trios are all normal for all 8,794 CNVs. For understanding the inheritance of CNVs, we consider the case of one parent being normal and another having a deletion. As shown in Figure 4.22, following the Mendelian laws, in this case, we expected that the proportions of children with two copies would be similar to the children with one deletion. However, as shown in Table 4.6, on average, 7.52% of the trios has parents with combination "21" (one normal parent and another with single deletion) and the mean relative frequency of the trios with offspring with deletion is 1.29%, while with normal offspring is 6.06%. It means that, in general, the affected parent transmits the normal allele instead of the allele with deletion. 4.8 CNV INHERITANCE 69

Figure 4.22: Cases of CNV transmission. The title indicates the case of CNV occurances, i. e., "212" shows the trio in which one parent and the offspring are normal and the other parent has one deletion. The number inside the diamonds indicates the number of the copies that the individual carries.

Table 4.6: Mean relative frequency (%) for CNV occurrences in trios with one normal parent and another with single deletion. The mean was obtained as the mean of the mean relative frequency from all 22 chromosomes.

CNV occurrence 210 211 212 213 214 Mean Freq. (%) 0.08 1.29 6.06 0.08 0.01

A similar situation can be found for trios in which one of the parents is normal and the another has one duplication ("23"). As shown in Table 4.7, more than the double of the cases the affected parent transmitted the normal allele instead of the allele with a duplication.

Table 4.7: Mean of the relative frequency per chromosome (%) for CNV occurrences in trios with one normal parent and another with single deletion. Independent of the chromosome, the mean was obtained as the average of the relative frequency mean from all 22 chromosomes.

CNV occurrence 320 321 322 323 324 Mean Freq. (%) 0.01 0.07 2.93 1.22 0.08

When both parents are normal ("22" combination), we consider that a CNV in the off- spring is a de novo mutation. As shown in Table 4.8, as expected the frequency of de novo single deletion/duplication in child is almost five times higher than double deletion/dupli- cation and deletions show to be more common than duplications. 70 APPLICATION IN BAEPENDI HEART STUDY 4.8

Table 4.8: Mean of the relative frequency (%) for CNV occurrences in trios with one normal parent and another with single deletion. Independent of the chromosome, the mean was obtained as the average of the relative frequency mean from all 22 chromosomes.

CNV occurrence 220 221 222 223 224 Mean Freq. (%) 0.76 3.49 77.45 1.27 0.26

As expected, cases, which don’t follow the Mendelian laws and would demand de novo mutations, were not found in our dataset, such as "410", "104", "114", among others.

4.8.2 CNV Trait Heritability

In this section we aim to understand how much of the presence of CNV can be explained by genetic factors causing familial aggregation. For this purpose, the CNV was treated as a trait modeled as a response variable. In a general sense, such formulation follow the same concept used when molecular response, as intensities, is treated as a trait for estimation of its heritability (Monks et al., 2004). In addition, in our formulation the CNV was considered as a dichotomous variable, which means if the sample carries two copies as expected, it will be codified to 0, otherwise, it will be codified as 1. Model 4.1 considers the variables age and sex as covariates.

CNV ∼ µ + Age + Sex + g +  (4.1)

The intraclass correlation coefficient, usually called as heritability in the genetic litera- ture, varies from 0 to 1. Estimated values very close to 0 and 1 indicates an error on the estimation process, as found in Figure 4.23. From the 8,794 CNVs considered in this analy- sis, 1,422 (16%) have estimates 0 and 1,141 (11%) have estimates 1. Thus, excluding those values, Figure 4.24 shows that the most part of CNVs has a intraclass correlation coefficient under 40%. 4.9 CNV INHERITANCE 71

Figure 4.23: Distribution of intraclass correlation coefficient. Values obtained from 6,231 CNVs.

Figure 4.24: Distribution of intraclass correlation coefficient. Values obtained from 6,231 CNVs. 72 APPLICATION IN BAEPENDI HEART STUDY 4.9

4.9 CNVs and Height

As described in Section 2.4.1, height is a complex phenotype and its heritability is es- timated to be around 80%. Due to the missing heritability, we explored this phenotype in association with CNVs. The linear mixed model defined in Equation 3.15 was applied considering height as response variable y. First, different models were considered without using CNVs, as shown in Table 4.9. Without any covariate, the heritability of height was 54.93%. An increase on the heritability estimates can be observed when covariates are added, including age, sex and coefficients of ancestry (PC1 and PC2) obtained by principal components analysis (De Andrade et al, 2016). Models 4 and 5 present a heritability around 83%, as expected by the literature. This increase in h2 estimates indicate that the added covariate absorbs the variability associated to the environment (error term).

Table 4.9: Models used for heritability estimation.

2 2 2 Model σg σ Heritability (h ) 1 height ∼ µ + g +  55.47 45.51 54.93% 2 height ∼ µ + Age + g +  66.47 7.39 69.19% 3 height ∼ µ + Sex + g +  30.61 20.09 60.37% 4 height ∼ µ + Age + Sex + g +  7.39 34.72 82.44% 5 height ∼ µ + Age + Sex + PC1 + PC2 + g +  33.88 6.75 83.38%

Based on Table 4.9, Model 5 shows the highest heritability estimate. In addition, De Andrade et al. (2015) describes the importance of the use of ancestry information in association studies with family data. Therefore, we adopted this model to add the CNV as covariate as shown in Model 4.2. To explore de CNV effect on the height, we adjusted the model in three dif- ferent ways, in which the CNV was considered as dichotomous, having linear effect and as categorical covariate in five levels.

height = µ + age + sex + PC1 + PC2 + CNV + g +  (4.2)

To implement these models, we used the kinship2 package (Therneau and Sinnwell, 2015) from R. For the first model, the CNV was considered as a dichotomous variable, samples with 4.9 CNVS AND HEIGHT 73 two copies were considered as 0 and individuals with 0, 1, 3, or 4 copies were considered as 1. Figures 4.25 and 4.26 shows the Manhattan plots obtained by this model, the first one presents −log10(P − V alue) and the second one the height heritability when adding the CNV.

Figure 4.25: Manhattan plot from the first model. y-axis indicates the −log10(p − value) of each CNV in association with height. The position used in the x-axis is the center position (in bp) of the CNV.

Figure 4.26: Manhattan plot from the first model. y-axis indicates the heritability of height when adding the CNV. The position used in the x-axis is the center position (in bp) of the CNV. 74 APPLICATION IN BAEPENDI HEART STUDY 4.9

For the second model, the CNV was considered as a discrete variable. In this case, it is assumed that the CNV would have a linear effect on the height, i.e., as the number of copies increases/decreases, it is expected a correspondent linear change on the height. Similar to the previous model, Figures 4.27 and 4.28 shows the Manhattan plots obtained by the fitted model with CNV as covariate.

Figure 4.27: Manhattan plot from the second model. y-axis indicates the −log10(p − value) of each CNV in association with height. The position used in the x-axis is the center position (in bp) of the CNV.

Figure 4.28: Manhattan plot from the second model. y-axis indicates the heritability of height when adding the CNV. The position used in the x-axis is the center position (in bp) of the CNV. 4.9 CNVS AND HEIGHT 75

For the third model, the CNV was considered as a categorical variable (assuming levels 0, 1, 2, 3, 4 and 5), being the number of copies 2 the reference value in our formulation. Figure 4.29 shows the Manhattan plot with the p-values associated to each CNV. In this case, we selected the lower p-value among the categories. In addition, Figure 4.30 has the heritability of the height when CNV is added. In this model, if a category has less than 5 samples, those samples were excluded to avoid bias.

Figure 4.29: Manhattan plot from the third model. y-axis indicates the −log10(p − value) of each CNV in association with height. The position used in the x-axis is the center position (in bp) of the CNV. 76 APPLICATION IN BAEPENDI HEART STUDY 4.9

Figure 4.30: Manhattan plot from the third model. y-axis indicates the heritability of the height when adding the CNV. The position used in the x-axis is the center position (in bp) of the CNV.

In general, adding CNV increases the heritability, meaning that it explains the variability associated to the environment (error term). However, using the model of CNV as a discrete variable (with linear effect), the CNVs from the final region of chromosome 8 were capable to decrease the heritability to 82.34%, which means that it could explain a portion of the variability associated to polygenic component. The top-twenty CNVs, the most significant p-values obtained are described in Tables 4.10, 4.11 and 4.12. 4.9 CNVS AND HEIGHT 77

Table 4.10: The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a dichotomous covariate.

Chromosome Start End Coefficient p-Value 9 78, 960, 219 78, 961, 487 −3.632 0.0001 9 78, 961, 488 78, 967, 224 −3.343 0.0001 4 70, 266, 714 70, 270, 740 −2.447 0.001 1 72, 584, 493 72, 585, 028 1.845 0.002 1 107, 031, 633 107, 032, 253 −3.084 0.002 15 32, 560, 573 32, 564, 059 −1.507 0.002 15 32, 534, 636 32, 545, 464 −1.490 0.003 15 32, 545, 465 32, 550, 579 −1.499 0.003 15 32, 527, 165 32, 533, 011 −1.495 0.003 9 44, 153, 644 44, 158, 278 2.096 0.003 10 54, 603, 595 54, 606, 646 −2.266 0.003 13 67, 047, 580 67, 048, 966 −2.304 0.003 15 32, 533, 012 32, 534, 635 −1.479 0.003 13 67, 048, 967 67, 049, 132 −2.253 0.003 15 32, 518, 395 32, 518, 577 −1.489 0.003 15 32, 564, 060 32, 566, 300 −1.471 0.003 15 32, 518, 578 32, 527, 164 −1.482 0.004 10 52, 681, 854 52, 682, 015 −2.007 0.004 9 44, 132, 217 44, 141, 253 2.072 0.004 15 32, 550, 580 32, 560, 572 −1.442 0.004 78 APPLICATION IN BAEPENDI HEART STUDY 4.9

Table 4.11: The top-20 CNVs with lower p-values for the Model 4.2 with CNV with linear effect.

Chromosome Start End Coefficient p-Value 9 78, 960, 219 78, 961, 487 −3.556 7.3e-05 9 78, 961, 488 78, 967, 224 −3.038 2e-04 7 61, 259, 336 61, 261, 504 2.932 0.00026 7 61, 292, 763 61, 311, 022 3.009 5e-04 7 61, 261, 505 61, 292, 762 3.001 0.00055 7 61, 311, 023 61, 365, 842 3.389 0.00058 7 61, 479, 242 61, 483, 741 2.884 0.00086 7 61, 067, 917 61, 070, 238 2.635 0.0011 7 61, 476, 931 61, 479, 241 2.685 0.0012 1 107, 031, 633 107, 032, 253 3.071 0.0015 7 61, 070, 239 61, 079, 976 2.616 0.0015 16 14, 961, 383 14, 961, 859 −2.299 0.0018 4 34, 495, 607 34, 497, 735 −0.877 0.002 3 164, 076, 458 164, 079, 827 0.635 0.0021 7 109, 228, 154 109, 228, 159 1.161 0.0021 19 48, 394, 874 48, 404, 090 1.502 0.0021 3 164, 079, 828 164, 081, 881 0.631 0.0022 3 164, 004, 318 164, 007, 552 0.607 0.0023 3 164, 004, 187 164, 004, 317 0.601 0.0025 3 164, 064, 163 164, 069, 972 0.623 0.0025

(a) (b)

Figure 4.31: Distribution of the height. These distributions are based on the number of copies for the regions from 78,960,219bp to 78,961,487bp (a) and from 78,961,488bp to 78,967,224bp (b) of chromosome 9.

Based on these results, the region from 78,960,219bp to 78,967,224bp in chromosome 9 4.9 CNVS AND HEIGHT 79

Table 4.12: The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a categorical covariate.

Chromosome Start End Factor Coefficient p-Value 9 78, 961, 488 78, 967, 224 CNV3 −3.412 9.9e-05 9 78, 960, 219 78, 961, 487 CNV3 −3.524 0.00019 11 50, 691, 613 50, 703, 802 CNV0 −3.486 0.00095 11 55, 178, 916 55, 185, 983 CNV1 −1.805 0.0011 11 55, 185, 984 55, 186, 549 CNV1 −1.806 0.0011 19 48, 445, 351 48, 448, 077 CNV0 −5.108 0.0012 19 48, 394, 874 48, 404, 090 CNV0 −4.699 0.0013 8 7, 174, 057 7, 181, 231 CNV1 −0.647 0.0014 10 54, 603, 595 54, 606, 646 CNV1 −2.955 0.0014 19 48, 404, 091 48, 405, 102 CNV0 −4.647 0.0014 19 48, 405, 103 48, 405, 574 CNV0 −4.640 0.0015 19 48, 405, 575 48, 407, 869 CNV0 −4.625 0.0015 10 54, 606, 647 54, 609, 159 CNV1 −2.872 0.0016 19 48, 407, 870 48, 413, 392 CNV0 −4.663 0.0016 15 20, 084, 762 20, 089, 383 CNV1 −1.496 0.0017 9 44, 153, 644 44, 158, 278 CNV3 1.170 0.0018 19 48, 413, 393 48, 432, 805 CNV0 −4.601 0.0018 19 48, 432, 806 48, 445, 350 CNV0 −4.607 0.0018 8 7, 174, 043 7, 174, 056 CNV1 −0.562 0.0019 15 20, 073, 123 20, 084, 266 CNV1 −1.482 0.0019 80 APPLICATION IN BAEPENDI HEART STUDY 4.9

is present between the CNVs with lower p-value for the three models. This region includes introns and exons from the gene PCSK5, which encodes a member of the subtilisin-like proprotein convertase family (UCSC Genome Browser on Human). Also, the results indicate that the presence of a duplication decreases the height by approximately 3cm, as can be seen in Figures 4.31a and 4.31b. Table 4.13 shows the number of individuals with normal genotype and CNVs, in which only 2 individuals have double duplication. These CNVs increased the heritability associated to the height, which means they explained a portion of the variability associated to the error component (environment).

Table 4.13: Number of individuals with normal genotype, duplication and double duplication for the two regions of chromosome 9.

Chromosome Start End Normal Duplication Double Duplication 9 78, 960, 219 78, 961, 487 873 36 1 9 78, 961, 488 78, 967, 224 866 42 2

The Ensemble Project describes a larger duplication region (Chromosome 9:78879922bp- 79031998bp), which contains the duplication found in samples from Baependi Heart Study. The documented region is present in 0.1% of Asian population and in 0.48% of Chinese population, but no association studies was performed based on this data. In addition, the Ensemble Project describes a SNP associated with height (rs4416887) located in the the same gene (PCSK5) where our CNV is found. Chapter 5

Final Considerations

The use of CNVs for genome-wide association studies is an important step for com- prehension of complex diseases and traits. In this work, we explored the statistical and computational tools for CNV calling from SNP array data and its analysis. The application of these techniques in the dataset from Baependi Heart Study allows us to characterize CNVs of the Brazilian population. The CNV database created during this work can be used for association studies with different phenotypes, but we just analyzed the association with height. From this descriptive analysis of the obtained CNVs from Brazilian data, we could observe that the distribution of the length of CNVs and of the number of CNVs per sample are similar to other works described in the literature, but specific CNV regions were also identified . However, the results could be dependent of the use of PennCNV which could bias the results. Works considering the CNVs in Brazilian population are still scarce and the ones we found are specific for target genes or genomic regions. During this work, we identified minimal regions in which a relevant part of the population have a deletion or duplication. These regions could be considered as genetic markers specific for the Brazilian population. For example, 4,055 CNVs were present in at least 5% of the samples. Given that the Baependi Heart Study includes samples with family structure, we were able to explore the transmission of CNVs (deletions/duplications) in trios. As expected, we found that the frequency of trios in which parents transmits the CNV are lower than the frequency of trios in which parents transmits the normal number of copies. Further, de novo

81 82 FINAL CONSIDERATIONS

CNVs were also identified. These scenarios indicate that CNV regions could not follow the Mendelian laws. The association of CNVs with height was performed using linear mixed models and we included sex, age and ancestry coefficients as covariates. Three models were considered for parameterization of the CNV effect, in which the only difference lies in the type of CNV data (categorical in five levels, dichotomous and discrete). All models indicated that a duplication in a defined region of chromosome 9 could decrease the expected height by 3cm. This work aimed to understand and set a methodology to identify CNVs from SNP intensity data, characterize the detected CNVs from Baependi Heart Study and associate them with height. Further work can be performed based on the CNV database obtained. For a better understand of CNVs in Brazilian population, annotation of the common identified CNVs can be made. As the Baependi Heart Study includes different information about the samples, new studies could be performed to associate those CNVs with other phenotypes using the scripts used in this work. Different models can also be used, including new covariates, such as groups of CNVs instead of analyzing them separately as we did, as well as including SNP genotype. Appendix A

Quantile Normalization

The following R code exemplifies the quantile normalization:

• Generate two random vectors of lenght 10:

1 a = round( runif (10 , min = 0 , max=10) ) 2 b = round( runif (10 , min = 5 , max=10) )

For this example: a is (0, 4, 3, 8, 5, 1, 2, 6, 5, 10) b is (10, 8, 9, 9, 8, 9, 9, 9, 7, 9)

• Show plot of the vectors and the quantile-quantile plot:

1 plot (a, b, pch="x", xlim=c (0,10), ylim=c (0,10), xlab = "Vector A", ylab = "Vector B", main = "Raw vector values") 2 lines (0:10,0:10) 3 4 qqplot ( c ,d, pch="x", xlim=c (0,10), ylim=c (0,10), xlab = "Sorted vector A", ylab = "Sorted vector B", main = "Quantile−q u a n t i l e p l o t " ) 5 lines (0:10,0:10)

• Calculate the mean of each row, substitute the values of each row by the respective mean and do the quantile-quantile plot of the new vectors:

1 e = cbind ( sort ( a ) , sort (b) )

83 84 APPENDIX A

(a) Raw values. (b) Quantile-quantile plot.

Figure A.1: Raw data in a regular plot (a) and in a quantile-quantile plot (b).

2 f = apply ( e , 1 , mean) 3 4 qqplot ( f [ order ( a ) ] , f [ order (b)], pch="x", xlim=c (0,10), ylim=c ( 0 , 1 0 ) , xlab = "Normalized vector A", ylab = "Normalized vector B", main = " Quantile−quantile plot") 5 lines (0:10,0:10)

Figure A.2: New quantile-quantile plot.

• The vectors with the new values is sorted based on the original position, for this QUANTILE NORMALIZATION 85 example: a is (3.5, 7.0, 7.0, 5.0, 4.5, 6.5, 8.5, 7.5, 6.0, 10.0). b is (8.5, 4.5, 6.5, 5.0, 6.0, 7.0, 7.0, 7.5, 10.0, 3.5). 86 APPENDIX A Appendix B

Median Polish

The following R code exemplifies the median polish procedure:

• Create a matrix to be decomposed and set the values as defined in Step 1 in Sec- tion 3.3.2:

1 x = matrix ( c (2,1,0,1,1,5,1,4,2,5,1,4,3,2,2,2,4,3,3,3), ncol=4, byrow= T) 2 alpha = 0 3 r = rep (0 , nrow( x ) ) 4 c = rep (0 , ncol ( x ) ) 5 d e l t a = 0 6 x_prev = 0

For this example:

  2.00 1.00 0.00 1.00     1.00 5.00 1.00 4.00     X = 2.00 5.00 1.00 4.00       3.00 2.00 2.00 2.00   4.00 3.00 3.00 3.00 α is 0 r is (0,0,0,0,0) c is (0,0,0,0)

87 88 APPENDIX B

δ is 0

• Step 2 and 3 described in Section 3.3.2:

1 medians = apply ( x , 1 , median) 2 x_prev =x 3 x = x−medians 4 5 r = r+medians 6 d e l t a = median( c ) 7 c = c−d e l t a 8 alpha = alpha+delta

The new matrix X and values:   1.00 0.00 −1.00 0.00     −1.50 2.50 −1.50 1.50     X = −1.00 2.00 −2.00 1.00        1.00 0.00 0.00 0.00   1.00 0.00 0.00 0.00 α is 0 r is (-1,2.5,3,1,3) c is (1,0,-1,0) δ is 0

• Step 4 and 5 described in Section 3.3.2:

1 medians = apply ( x , 2 , median) 2 x_prev = x 3 x = t ( t ( x )−medians ) 4 5 c = c+medians 6 d e l t a = median( r ) 7 r = r−d e l t a 8 alpha = alpha+delta MEDIAN POLISH 89

The new matrix X and values:   0.00 0.00 0.00 0.00     −2.50 2.50 −0.50 1.50     X = −2.00 2.00 −1.00 1.00        0.00 0.00 1.00 0.00   0.00 0.00 1.00 0.00 α is 5.5 r is (-3.5,0,0.5,-1.5,0.5) c is (2,0,-2,0) δ is 2.5

• Step 6 described in Section 3.3.2:

1 sum( x ) 2 sum( abs ( x_prev−x ) )

As the output is 3 and 10, respectively, we re-run the scripts from the second, third

and fourth boxes. After this procedure, sum(abs(xprev − x)) = 0. Therefore, the final values are:

α is 3 r is (-2,0,0,-1,0) c is (1,0,-1,0)

  0.00 0.00 0.00 0.00     −3.00 2.00 −1.00 1.00     X = −2.00 2.00 −1.00 1.00        0.00 0.00 1.00 0.00   0.00 0.00 1.00 0.00 This means, for a value in the original X, we can obtain its decomposition. For example:

x22 = 5 = α + r2 + c2 + 22 = 3 + 0 + 0 + 2 90 APPENDIX B

• For comparison, the R function for median polish is used and, as expected, the values are the same:

1 > medpolish(x) 2 1 : 15 3 Final : 15 4 5 Median Polish Results (Dataset: "x") 6 7 Overall : 3 8 9 Row Effects: 10 [ 1 ] −2 0 0 −1 0 11 12 Column E f f e c t s : 13 [ 1 ] 1 0 −1 0 14 15 Residuals : 16 [ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] 17 [ 1 , ] 0 0 0 0 18 [ 2 , ] −3 2 −1 1 19 [ 3 , ] −2 2 −1 1 20 [ 4 , ] 0 0 1 0 21 [ 5 , ] 0 0 1 0 Appendix C

Minimal Regions

The following R code exemplifies the procedure to find the minimal regions with the respective copy number for each sample:

• Load Bioconductor, CNTools and define a function that changes the copy number state to the copy number value:

1 source ( " http : // bioconductor.org/ biocLite .R") 2 library ("CNTools") 3 library (stargazer) 4 5 arruma = function ( x ) { 6 x [ which( x==1) ] = 0 7 x [ which( x==2) ] = 1 8 x [ which( x==−1)] = 2 9 x [ which( x==5) ] = 3 10 x [ which( x==6) ] = 4 11 return ( x ) 12 }

• Define samples e an fictional output of PennCNV:

1 samples = LETTERS[1:6] 2 3 a = c ("A",5,10,25,25 −10 ,5) 4 a = rbind ( a , c ("A",5,40,55,15, 2)) 5 b = c ("B" ,5,15,35,20,5)

91 92 APPENDIX C

6 c = c ("C" ,5,5,45,40,5) 7 e = c ("E",5,5,20,15,6) 8 f = c ("F",5,45,55,10,1) 9 10 11 aux = as . data . frame( rbind ( a , b , c , e , f ) ) 12 aux 13 colnames ( aux ) = c ("ID", "chrom", "loc.start", "loc.end", "num.mark", "seg .mean") 14 aux$ID = as . character ( aux$ID) 15 aux$chrom = as . numeric( as . character ( aux$chrom ) ) 16 aux$ l o c . start = as . numeric( as . character ( aux$ l o c . start )) 17 aux$ l o c . end = as . numeric( as . character ( aux$ l o c . end)) 18 aux$num. mark = as . numeric( as . character ( aux$num. mark ) ) 19 aux$seg .mean = as . numeric( as . character ( aux$seg .mean)) 20 21 > aux 22 ID chrom l o c . start l o c . end num.mark seg.mean 23 a A 5 10 25 15 5 24 A 5 40 55 15 2 25 b B 5 15 35 20 5 26 c C 5 5 45 40 5 27 e E 5 5 20 15 6 28 f F 5 45 55 10 1

• Define a data.frame with the same values and adding auxiliary regions for each sample from position 1 to the last possible position:

1 n=length ( samples ) 2 aux <− data . frame( c ( aux$ID,samples) , c ( aux$chrom , rep ( unique ( aux$chrom ) , n) ) , 3 c ( aux$ l o c . start , rep (1 , n) ) , c ( aux$ l o c . end , rep (max( aux$ l o c . end) , n) ) , 4 c ( aux$num. mark , rep (1 , n) ) , c ( aux$seg .mean, rep( −1 , n ))) 5 colnames ( aux ) = c ("ID", "chrom", "loc.start", "loc.end", "num.mark", MINIMAL REGIONS 93

"seg .mean") 6 aux 7 8 > aux 9 ID chrom loc. start l o c . end num.mark seg.mean 10 1 A 5 10 25 15 5 11 2 A 5 40 55 15 2 12 3 B 5 15 35 20 5 13 4 C 5 5 45 40 5 14 5 E 5 5 20 15 6 15 6 F 5 45 55 10 1 16 7 A 5 1 55 1 −1 17 8 B 5 1 55 1 −1 18 9 C 5 1 55 1 −1 19 10 D 5 1 55 1 −1 20 11 E 5 1 55 1 −1 21 12 F 5 1 55 1 −1

• Find the minimal regions e assign copy numbers for each sample:

1 aux$ID = as . character ( aux$ID) 2 s t r ( aux ) 3 head ( aux ) 4 dim( aux ) 5 6 seg = CNSeg(aux) 7 seg 8 rsByregion = getRS(seg, by = "region", imput = TRUE, XY = FALSE, what = "max" ) 9 cnv = rs(rsByregion) 10 cnv 11 12 dim( cnv ) 13 cnv = cbind (cnv[ ,1:3] , apply ( cnv [ , −c (1:3)],2,arruma)) 14 cnv 15 16 > cnv 94 APPENDIX C

17 chrom start end AB CD EF 18 1 5 1 5 2 2 2 2 2 2 19 2 5 6 10 2 2 3 2 4 2 20 3 5 11 15 3 2 3 2 4 2 21 4 5 16 20 3 3 3 2 4 2 22 5 5 21 25 3 3 3 2 2 2 23 6 5 26 35 2 3 3 2 2 2 24 7 5 36 40 2 2 3 2 2 2 25 8 5 41 45 1 2 3 2 2 2 26 9 5 46 55 1 2 2 2 2 0 Appendix D

CNV Filter

The following R code exemplifies the procedure to remove CNV with no or low mutations:

• Load the functions that count the number of samples of each group (CN = 0, ..., 4) and the function that counts the number of valid groups:

1 cont_0 = function ( x ) { 2 length (which( x[−c (1,2,3)]==0)) 3 } 4 5 cont_1 = function ( x ) { 6 length (which( x[−c (1,2,3)]==1)) 7 } 8 9 cont_2 = function ( x ) { 10 length (which( x[−c (1,2,3)]==2)) 11 } 12 13 cont_3 = function ( x ) { 14 length (which( x[−c (1,2,3)]==3)) 15 } 16 cont_4 = function ( x ) { 17 length (which( x[−c (1,2,3)]==4)) 18 } 19 20

95 96 APPENDIX D

21 grupos = function ( a ) { 22 t = sum( a ) 23 g = 5 24 25 i f ( a [1] < t∗mcf ) 26 g = g−1 27 i f ( a [2] < t∗mcf ) 28 g = g−1 29 i f ( a [3] < t∗mcf ) 30 g = g−1 31 i f ( a [4] < t∗mcf ) 32 g = g−1 33 i f ( a [5] < t∗mcf ) 34 g = g−1 35 36 return ( g ) 37 }

• Load the dataset (we use the example from AppendixC) and define the mcf:

1 > cnv 2 chrom start end AB CD EF 31 5 1 5222222 42 5 610223242 53 5 11 15323242 64 5 16 20333242 75 5 21 25333222 86 5 26 35233222 97 5 36 40223222 10 8 5 41 45 1 2 3 2 2 2 11 9 5 46 55 1 2 2 2 2 0 12 13 mcf = 0.02

• Calculate the number of groups of each CNV and remove the ones with only one group:

1 mut_0 = apply (cnv,1,cont_0) CNV FILTER 97

2 mut_1 = apply (cnv,1,cont_1) 3 mut_2 = apply (cnv,1,cont_2) 4 mut_3 = apply (cnv,1,cont_3) 5 mut_4 = apply (cnv,1,cont_4) 6 7 mutations = cbind (mut_0 ,mut_1 ,mut_2 ,mut_3 ,mut_4) 8 9 dim( cnv ) 10 dim(mutations) 11 12 groupCNV = apply (mutations, 1, grupos) 13 14 summary( as . factor (groupCNV) ) 15 16 exclude = which(groupCNV==1) 17 cnv = cnv[− exclude , ] 18 dim( cnv ) 19 20 > cnv 21 chrom start end AB CD EF 22 2 5 6 10 2 2 3 2 4 2 23 3 5 11 15 3 2 3 2 4 2 24 4 5 16 20 3 3 3 2 4 2 25 5 5 21 25 3 3 3 2 2 2 26 6 5 26 35 2 3 3 2 2 2 27 7 5 36 40 2 2 3 2 2 2 28 8 5 41 45 1 2 3 2 2 2 29 9 5 46 55 1 2 2 2 2 0 98 APPENDIX D Appendix E

Pedigree and Kinship Matrix

The following R code exemplifies how to obtain the pedigree of a given family and its respective kinship matrix:

• Load library kinship2 in R Therneau and Sinnwell(2015) and create the family data (this is an hypothetical example):

1 require ( kinship2 ) 2 3 exemplo = cbind ( c ( 1 : 9 ) , c (0,0,0,2,2,2,0,3,7), c (0,0,0,1,1,1,0,4,6), c (1,0,0,1,0,1,0,1,0), rep ( 1 , 9 ) ) 4 exemplo = as . data . frame( exemplo ) 5 colnames ( exemplo ) = c ( "ID" , "PAT" , "MAT" , "SEX" , "FID" ) 6 exemplo 7 8 ID PAT MAT SEX FID 9 1 1 0 0 1 1 10 2 2 0 0 0 1 11 3 3 0 0 0 1 12 4 4 2 1 1 1 13 5 5 2 1 0 1 14 6 6 2 1 1 1 15 7 7 0 0 0 1 16 8 8 3 4 1 1 17 9 9 7 6 0 1

99 100 APPENDIX E

• Create and show the family pedigree in a genogram (Figure E.1):

1 pedig = pedigree(id=exemplo$ID, dadid=exemplo$PAT, momid=exemplo$MAT, 2 sex=exemplo$SEX, famid=exemplo$FID , missid =0) 3 4 par ( bg=NA) 5 plot (pedig[ ’1’], main = "Family Pedigree")

Figure E.1: Genogram corresponding to the family data.

• Create kinship matrix:

1 kmat=kinship(pedig) 2 kmat 3 4 > kmat 5 9 x 9 sparse Matrix of class "dsCMatrix" 6 1 2 3 4 5 6 7 8 9 71 0.500 . . 0.250 0.250 0.250 . 0.1250 0.1250 82 . 0.500 . 0.250 0.250 0.250 . 0.1250 0.1250 93. . 0.50. . . . 0.2500. 104 0.250 0.250 . 0.500 0.250 0.250 . 0.2500 0.1250 115 0.250 0.250 . 0.250 0.500 0.250 . 0.1250 0.1250 126 0.250 0.250 . 0.250 0.250 0.500 . 0.1250 0.2500 13 7 ...... 0.50 . 0.2500 14 8 0.125 0.125 0.25 0.250 0.125 0.125 . 0.5000 0.0625 15 9 0.125 0.125 . 0.125 0.125 0.250 0.25 0.0625 0.5000 Appendix F

Frequency of CNVs by Chromosome

(a) Chromosome 1 (b) Chromosome 2 (c) Chromosome 3

(d) Chromosome 4 (e) Chromosome 5 (f) Chromosome 6

Figure F.1: Frequency of CNVs per region after finding the minimal regions. The x-axis represents the positions of the chromosome by base-pairs. The y-axis indicates the number of CNVs detected in the respective position for 910 samples.

101 102 APPENDIX F

(a) Chromosome 7 (b) Chromosome 8 (c) Chromosome 9

(d) Chromosome 10 (e) Chromosome 11 (f) Chromosome 12

(g) Chromosome 13 (h) Chromosome 14 (i) Chromosome 15

Figure F.2: Frequency of CNVs per region after finding the minimal regions. The x-axis represents the positions of the chromosome by base-pairs. The y-axis indicates the number of CNVs detected in the respective position for 910 samples. FREQUENCY OF CNVS BY CHROMOSOME 103

(a) Chromosome 16 (b) Chromosome 17 (c) Chromosome 18

(d) Chromosome 19 (e) Chromosome 20 (f) Chromosome 21

(g) Chromosome 22

Figure F.3: Frequency of CNVs per region after finding the minimal regions. The x-axis represents the positions of the chromosome by base-pairs. The y-axis indicates the number of CNVs detected in the respective position for 910 samples. 104 APPENDIX F Appendix G

Proportion of CNV occurrences in trios

The script to identify the trios among all the samples and the CNV occurrences in them can be resumed as the following:

• Create an example of family with 4 individuals and the CNV information.

1 samples = rbind ( c (123,124,125) , 2 c ( 1 2 4 , 0 , 0 ) , 3 c ( 1 2 5 , 0 , 0 ) , 4 c (122,124,125)) 5 colnames ( samples ) = c ( "ID" , "FA" , "MO") 6 samples = as . data . frame( samples ) 7 8 cnv = cbind ( c ( 1 , 1 , 2 , 2 ) , 9 c ( 2 , 2 , 2 , 2 ) , 10 c (3,2,NA,2)) 11 colnames ( cnv ) = c ( "CN1" , "CN2" , "CN3" ) 12 cnv = as . data . frame( cnv )

• For each CNV, define all the possible combinations of CNVs. For each individual, check the its CNV and the CNVs from the parents. If every one in the trio has a number of copies (i. e., value is not NA for the three members), add one for the respective combination.

1 for ( c in 1 : ncol ( cnv ) { 2 cn = as . numeric( cnv [ , c ])

105 106 APPENDIX G

3 cn = cbind (samples ,cn) 4 5 comb = expand . grid ( c ( 0 : 4 ) , c ( 0 : 4 ) , c ( 0 : 4 ) ) 6 comb = data . frame( cbind (comb , rep (0 ,125) ) ) 7 comb [ , 4 ] = as . numeric( as . character (comb[ ,4])) 8 9 colnames (comb) = c ("P1", "P2", "OF", "CN") 10 11 head (comb) 12 head ( cn ) 13 i =4 14 for ( i in 1 :nrow( cn ) ) { 15 cn [ i , ] 16 17 i f ( i s . na( as . character (cn[i ,"cn"]))){ 18 print ("no cnv information") 19 next 20 } 21 22 i f (cn[i ,"FA"] !=0 & cn [ i , "MO" ] !=0) { 23 i f ( i s . na( as . character ( cn [ which(cn[,1]==cn[i ,"FA"]) ,"cn"])) 24 | i s . na( as . character ( cn [ which(cn[,1]==cn[i ,"MO"]) ,"cn"]))) { 25 next 26 } else { 27 a = t ( as . matrix ( c ( as . numeric( as . character ( cn [ which( cn [ ,1]== cn[i ,"FA"]) ,"cn"])), 28 as . numeric( as . character ( cn [ which( cn [ ,1]== cn[i ,"MO"]) ,"cn"])) , 29 as . numeric( as . character (cn[i ,"cn"]))))) 30 colnames ( a ) = colnames (comb) [ −4] 31 comb [ which( apply (comb , 1 , function (x) identical(x[1:3], a [1,]))) ,4] = comb[ which( apply (comb , 1 , function ( x ) identical(x[1:3], a[1,]))),4]+1 32 } 33 } else { PROPORTION OF CNV OCCURRENCES IN TRIOS 107

34 print ("no parents") 35 next 36 } 37 }

• As we are not interested in distinguish mother from father, we summarize the combi- nations that are equivalents, for example, 122 and 212 (one parent has a deletion and the other parent and the offspring are normal).

1 2 3 for ( i in 1 : 7 5 ) { 4 a = t ( as . matrix ( as . numeric( c ( rev (comb[i , 1:2]) ,comb[i ,3])))) 5 colnames ( a ) = c ("P1", "P2","OF") 6 bla = which( apply (comb , 1 , function (x) identical(x[1:3], a[1,]) )) 7 print ( bla ) 8 print ( i ) 9 i f ( bla !=i ) { 10 comb[i,4] = comb[i,4]+comb[bla,4] 11 comb = comb[−bla , ] 12 } 13 } 14 15 i f ( c == 1) { 16 t r i o s = comb 17 } else { 18 t r i o s = cbind (trios ,comb[ ,4]) 19 } 20 21 }

This procedure was performed for all CNVs for all 22 chromosomes. The mean relative frequency for all 75 combinations of CNV occurrences for each chromosome is described in Tables G.1, G.2, G.3, G.4 and G.5. 108 APPENDIX G 03 0 0 01 0 0 01 0 0 002 0 0 . . . . 01 0 03 0 0 0 001 0 003 0 0 0 . . . . 02 0 0 0101 0 0 05 0 0 0 0 04 0 0 0 0 001002 0 0 0 0 0 ...... 64 0 7606 008 0 0 0 0 0 0 7630 0 0 95 0 0 0 2495 004 0 56 049 0 034 0 0 87 086 0 0 072 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 43 0 50 0 0 0 0 0 80 0 0 0 0 0 81 0 0 0 0 0 39 0 0 0 0 0 7138 0 0 0 0 0 0 0 0 0 0 ...... 03 0 0 01 0 0 002 0 0 003 0 0 . . . . The columns indicate the number of copies for parents (first two digits) and 07 0 0514 006 0 0 0 0 0 0 1 1 0805 0 0 06 0 0 0 0505 011 005 002 0 021 0 1 016 0 0 0 0 1 0 0 0 0 0 0 1 0 0705 009 014 0 0 0 0 0 0 0 0 10 0 0 0 10 0 0 0 07 0 0 0 09 0 0 0 05 0 0 0 ...... 10 0 03 0 0406 0 03 0 0 03 0 01 0 04 0 0202 0 0 0103 0 06 0 02 0 0 01 0 06 0 02 0 03 0 003 0 ...... 03 0 01 0 002 0 002 0 . . . . 01 0 0 02 0 0 0 001 0 005 0 0 004 0 . . . . . 52 0 0 16 0 0821 036 0 0 0 0 0 0 13 0 16 0 0 23 0 0 0 1719 0 0 0309 0 022 040 0 012 0 022 0 014 0 0 017 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 04 0 0 0 13 0 0 0 0 ...... 08 0 03 0 0103 0 08 0 0 04 0 08 0 1103 0 0 01 0 0103 0 04 0 0 0201 0 0 11 0 0 0 0 ...... 31 0 09 0 0631 0 17 0 0 10 0 11 0 0 29 0 0512 0 0 0102 0 11 014 0 01 0 0 09 0 05 001 0 0 0 08 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 000 100 200 300 400 110 210 310 410 220 320 420 330 430 440 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Relative frequency of occurrences of CNVs in trios. Mean Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 Chr. 8 Chr. 6 Chr. 7 Chr. 9 Chr. 10 Chr. 11 Chr. 12 Chr. 13 Chr. 14 Chr. 15 Chr. 16 Chr. 17 Chr. 18 Chr. 19 Chr. 20 Chr. 21 Chr. 22 Std. dev. Table G.1: offspring (third digit)column and "210" rows means indicate that thefor one chromosome relative parent 1, is frequency on (%) normal average, (2 of 0.07% copies), occurrences of the of the other this trios one formation has has this in a formation. 30 deletion CNVs (1 copy) of and each the chromosome. offspring For has example, a double deletion (0 copy) and, PROPORTION OF CNV OCCURRENCES IN TRIOS 109 02 001 004 . . . 002 0 0001 0 0004 0 . . . 01 0 01 0 0 03 0 0114 0 0 0 0 . . . . . 01 0 0 0 53 0 0 0 01 0 0 0 03 0 01 001 0 0 0 11 0 ...... 0105 0 0 0 0 0 0 0 0 07 0 1102 0 12 0 0 0 0 0 10 0 02 013 0 0 0 0 0319 002 0 0 0 0 0 0 0 0219 040 0 0 021 0 0 0 0 0 0 0 0 ...... 49 0 2633 0 03 0 0 02 0 8073 0 32 0 0 0 0 0 0 8532 0 03 0 0 9427 024 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0520 0 01 0 82 015 0 68 0 089 0 16 070 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 02 3 05 3 08 1 03 3 37 2 01 3 02 4 ...... 07 0 2502 0 04 0 0 4 4 17 0 06 010 3 0 4 0310 0 0 3 3 03 002 4 078 0 4 12 0 3 ...... The columns indicate the number of copies for parents (first two digits) and 29 0 1394 0 09 0 0 30 0 0565 0 21 0 0 0 4 87 0 0 47 0 0 6828 0 0 2289 0 0 0 0 3 1 83 0 0 1 6677 073 0 61 0 0 0 74 3 0 0 80 5 0 0 87 0 06 076 0 0 1 0 3 ...... 24 1 5709 2 18 0 1 34 1 1006 1 18 0 1 20 0 21 1 0824 0 1 19 0 56 1 0109 0 08 0 02 0 0 58 6 29 1 06 0 49 1 04 0 ...... 03 0 01 0 001 0 002 0 . . . . 32 0 02 0 08 0 0 07 0 01 002 0 0 0 01 0 0 03 0 0 02 0 0 ...... 09 0 0 0 04 0 0 0 06 0 0 0 0 05 0 0 0 12 0 27 0 0 0 05 0 0 0 03 0 0 0 1019 0 021 0 0 0 0 1 29 009 006 0 0 0 0 0 0 0 45 0 14 0 1204 0 0 0 0 14 0 28 0 48 0 03 0 0 06 0 ...... 20 0 04 0 0402 0 0 06 0 06 0 01 0 06 0 02 0 0 0 0 06 0 02 0 01 0 23 0 0205 0 0 07 0 ...... 09 0 01 0 01 0 02 0 01 0 0 01 0 07 0 03 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 003 0 002 0 002 0 0 002 0 004 0 ...... 001 101 201 301 401 111 211 311 411 221 321 421 331 431 441 0 0 0 0 0 0 0 0 0 0 0 0 0 Relative frequency of occurrences of CNVs in trios. Mean Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 6 Chr. 5 Chr. 7 Chr. 9 Chr. 8 Chr. 20 Chr. 21 Chr. 22 Chr. 10 Chr. 13 Chr. 17 Chr. 19 Chr. 11 Chr. 12 Chr. 14 Chr. 15 Chr. 16 Chr. 18 Std. dev. Table G.2: offspring (third digit)column and "210" rows means indicate that thefor one chromosome relative parent 1, is frequency on (%) normal average, (2 of 0.07% copies), occurrences of the of the other this trios one formation has has this in a formation. 30 deletion CNVs (1 copy) of and each the chromosome. offspring For has example, a double deletion (0 copy) and, 110 APPENDIX G 01 02 05 01 07 02 01 01 03 23 003 002 003 002 ...... 06 0 02 0 03 0 01 0 06 0 0105 0 0 01 0 08 0 12 0 004 0 ...... 24 0 13 0 1810 0 0 0 0 26 0 0903 0 015 0 0 02 0 02 0 0 06 0 0 32 0 0 15 0 05 0 0 41 0 002 0 ...... 63 0 69 0 1923 0 0 08 0 2867 0 0 43 0 0 53 0 42 0 27 0 0 19 0 0 0 4555 0 0 13 0 96 0 80 1 61 0 23 0 24 0 0 0 21 0 0 0 36 0 0 0 70 0 0 0 ...... 44 1 10 0 93 0 2537 0 0 70 1 1753 0 0 7937 0 0 63 0 50 0 46 0 9286 0 0 42 0 78 0 46 0 11 0 05 0 39 0 37 5 39 0 0 0 0 19 0 ...... 52 3 45 2 61 1 65 2 5837 2 0 0244 2 4 81 0 60 2 71 1 0292 1 0 92 0 12 5 05 8 90 5 60 10 75 0 08 1 23 0 60 3 34 5 40 2 ...... 10 71 07 77 23 8 04 78 06 81 10 79 03 81 02 73 04 49 12 61 03 85 10 73 003 79 002 74 ...... 31 0 30 0 13 0 81 62 0 1305 0 0 13 0 1612 046 0 0 79 19 0 11 0 79 02 0 79 04 0 83 51 0 96 0 51 0 76 58 0 03 0 11 0 81 ...... The columns indicate the number of copies for parents (first two digits) and 32 0 06 0 48 0 92 0 1576 0 0 76 0 5810 0 71 0 0 76 0 32 0 0 86 41 0 27 0 20 0 37 0 62 2 29 0 37 0 24 0 14 0 0 82 63 0 1 91 0 0 83 00 0 ...... 31 7 38 6 47 6 22 0 3827 6 5 39 6 4038 6 31 6 5 30 5 17 4 40 6 43 7 31 6 45 5 24 7 34 6 47 4 45 6 67 7 16 4 14 5 ...... 21 0 05 0 02 0 08 0 0309 0 0 01 0 0301 0 24 0 0 02 0 18 0 09 0 10 0 01 0 16 0 ...... 19 0 08 0 17 0 10 0 3721 0 0 2414 0 0 14 0 04 0 10 0 0 08 0 04 0 004 0 0 003 0 ...... 98 0 17 0 38 0 49 0 8955 0 0 53 0 0 9540 0 0 7985 0 0 72 0 0 0 13 0 8234 0 0 0 10 0 57 0 79 0 01 0 0 36 0 0 0 01 0 0 0 66 0 0 1 43 0 0 0 5 41 0 0 0 ...... 07 0 08 1 12 1 07 0 1006 0 1 09 1 1412 1 26 1 05 1 0 12 2 1103 1 1 05 1 08 0 21 1 16 0 09 0 ...... 03 0 02 0 02 0 03 0 0203 0 0 12 0 0106 0 03 0 01 0 0 01 0 0 08 0 01 0 03 0 04 0 04 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 002 0 ...... 002 102 202 302 402 112 212 312 412 222 322 422 332 432 442 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Relative frequency of occurrences of CNVs in trios. Mean Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 Chr. 6 Chr. 7 Chr. 8 Chr. 9 Chr. 10 Chr. 11 Chr. 12 Chr. 13 Chr. 14 Chr. 15 Chr. 16 Chr. 17 Chr. 18 Chr. 19 Chr. 20 Chr. 21 Chr. 22 Std. dev. Table G.3: offspring (third digit)column and "210" rows means indicate that thefor one chromosome relative parent 1, is frequency on (%) normal average, (2 of 0.07% copies), occurrences of the of the other this trios one formation has has this in a formation. 30 deletion CNVs (1 copy) of and each the chromosome. offspring For has example, a double deletion (0 copy) and, PROPORTION OF CNV OCCURRENCES IN TRIOS 111 10 01 03 01 03 01 07 ...... 19 0 01 0 01 0 0509 0 0 01 0 05 0 02 0 06 0 07 0 01 0 30 0 31 0 ...... 24 0 17 0 65 0 2137 0 0 29 0 01 0 0 06 0 11 0 05 0 09 0 14 0 0 0204 028 0 0 0 0 0 68 0 08 0 0 53 0 09 0 0 09 0 0 ...... 05 0 02 0 01 0 1842 0 0 03 0 05 0 02 0 07 0 15 0 05 0 11 0 26 0 94 1 36 0 52 0 17 0 0 0 05 0 0 0 004 0 0 ...... 08 0 40 0 28 0 2239 0 0 98 0 23 0 29 0 88 0 96 0 62 0 18 0 0 39 0 3019 0 10 0 0 0 40 1 92 0 32 0 49 0 0 07 0 9183 0 0 0 ...... 68 1 54 0 66 1 2716 1 1 02 0 29 0 22 0 30 0 25 0 43 0 59 1 90 0 1634 0 50 0 3 36 5 93 1 73 4 0656 0 0 0 0 0 0 34 0 3777 0 1 ...... 09 1 0206 1 1 03 3 29 0 . . . . . 22 0 03 0 0 05 0 0 0713 0 0 05 0 1 12 0 0 03 0 1 15 0 2 2356 0 0 1 ...... 32 0 09 0 09 0 0812 0 0 07 0 01 0 0 0 22 0 03 0 18 0 10 0 0 2 10 0 0 0 0105 003 046 0 0 0 0 0 0 08 0 0 1 01 0 0 1 003 0 0 0 ...... The columns indicate the number of copies for parents (first two digits) and 004 0 001 0 0002 0 . . . 07 0 02 0 003 0 . . . 1201 0 0 0 0 01 0 02 0 02 0 0 0 . . . . . 02 0 01 0 01 0 01 0 0 0 0 04 0 04 0 0 0 0 04 0 0 0 0 001 0 0 0 0 ...... 01 0 002 0 . 0003 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 003 103 203 303 403 113 213 313 413 223 323 423 333 433 443 Relative frequency of occurrences of CNVs in trios. Mean Chr. 2 Chr. 1 Chr. 3 Chr. 4 Chr. 5 Chr. 6 Chr. 7 Chr. 8 Chr. 9 Chr. 10 Chr. 11 Chr. 12 Chr. 13 Chr. 14 Chr. 15 Chr. 16 Chr. 17 Chr. 18 Chr. 19 Chr. 20 Chr. 21 Chr. 22 Std. dev. Table G.4: offspring (third digit)column and "210" rows means indicate that thefor one chromosome relative parent 1, is frequency on (%) normal average, (2 of 0.07% copies), occurrences of the of the other this trios one formation has has this in a formation. 30 deletion CNVs (1 copy) of and each the chromosome. offspring For has example, a double deletion (0 copy) and, 112 APPENDIX G 31 06 01 12 01 25 04 02 02 03 10 46 003 ...... 11 0 02 0 11 0 01 0 01 0 32 0 10 0 29 0 04 0 09 0 003 0 ...... 03 0 05 0 01 0 0 10 0 05 0 0 01 0 02 0 01 0 16 0 0 36 0 09 0 0 20 0 22 0 0 ...... 20 0 26 0 03 0 77 0 06 0 07 0 0 32 0 0 0 11 0 07 0 19 0 0 08 0 26 0 0 0 24 0 06 0 39 0 01 0 0 0 02 0 0 0 67 0 0 0 04 0 ...... 05 0 08 0 11 0 14 0 06 0 04 0 18 0 14 0 08 0 06 0 44 0 21 0 07 0 0 29 0 02 0 09 0 ...... 08 0 12 0 0 12 0 14 0 20 0 43 0 1613 023 0 0 0 0 0 0 0 0 0 0 23 0 08 0 0 0 0 15 0 17 0 28 0 56 0 17 0 0 16 0 46 0 3 36 0 36 0 26 0 05 0 0 30 0 ...... 11 1 05 0 06 0 02 0 24 0 01 0 11 0 ...... 06 0 01 0 01 0 03 0 0 004 0 . . . . . The columns indicate the number of copies for parents (first two digits) and 06 0 01 0 02 0 02 0 0 0 02 0 0 01 0 0 0 01 0 0 0 03 0 0 0 03 0 07 0 1 ...... 01 0 0 01 0 0 01 0 0 0 0 0 01 0 0 0 0 0 002 0 0 004 0 0 ...... 01 0 0 01 0 0 0 0 02 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 002 0 0 001 0 0 0 0 0 002 0 0 ...... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000000000000 0 0 001 0 0 0001 0 0 0002 0 0 . 004 104 204 304 404 114 214 314 414 224 324 424 334 434 444 . . 0 0 0 Relative frequency of occurrences of CNVs in trios. Mean Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 Chr. 6 Chr. 7 Chr. 8 Chr. 9 Chr. 10 Chr. 11 Chr. 12 Chr. 13 Chr. 14 Chr. 15 Chr. 16 Chr. 17 Chr. 18 Chr. 19 Chr. 20 Chr. 22 Chr. 21 Std. dev. Table G.5: offspring (third digit)column and "210" rows means indicate that thefor one chromosome relative parent 1, is frequency on (%) normal average, (2 of 0.07% copies), occurrences of the of the other this trios one formation has has this in a formation. 30 deletion CNVs (1 copy) of and each the chromosome. offspring For has example, a double deletion (0 copy) and, 113 114 APPENDIX H

Appendix H

IDs and CEL Files Correspondence

Table H.1: Correspondence between IDs of samples and CEL Files.

ID cel ID CEL ID CEL ID CEL ID CEL 1, 101 2, 674 7, 909 1, 312 15, 601 1, 204 17, 707 1, 314 20, 709 3, 134 1, 201 2, 675 7, 910 314 15, 602 2, 682 17, 708 3, 037 20, 710 3, 441 2, 101 2, 017 7, 911 1, 235 15, 603 2, 790 17, 709 60 20, 711 884 2, 201 3, 010 7, 913 3, 096 15, 604 1, 142 17, 801 389 20, 717 312 2, 202 686 9, 101 424 15, 701 430 17, 805 3, 117 20, 722 2, 694 2, 302 1, 094 9, 201 385 15, 703 2, 791 17, 903 3, 091 20, 736 1, 725 2, 303 304 9, 301 593 15, 802 788 17, 906 175 20, 740 3, 002 3, 801 356 9, 302 365 15, 808 2, 625 17, 909 824 20, 901 2, 763 4, 101 198 9, 603 266 15, 901 2, 949 17, 911 3, 042 20, 902 2, 330 4, 301 3, 085 9, 703 352 15, 902 3, 132 17, 912 402 20, 904 3, 382 4, 302 3, 083 9, 705 1, 093 15, 903 3, 137 17, 913 1, 214 21, 101 2, 941 4, 502 2, 677 9, 706 3, 251 15, 906 146 17, 914 378 21, 302 2, 950 4, 506 3, 111 9, 708 1, 257 15, 907 2, 945 18, 101 1, 808 22, 502 874 4, 601 3, 149 9, 903 1, 363 15, 908 1, 455 18, 201 2, 822 22, 601 1, 213 4, 803 2, 752 10, 101 1, 915 15, 910 868 18, 301 698 22, 602 946 4, 806 2, 297 10, 303 1, 158 15, 911 2, 113 18, 302 3, 095 22, 702 2, 864 4, 903 1, 872 10, 304 1, 100 15, 912 863 18, 401 1, 020 22, 704 2, 857 4, 904 2, 028 10, 401 3, 503 15, 913 1, 148 18, 601 494 24, 202 2, 806 4, 905 3, 155 10, 901 311 15, 914 1, 240 18, 602 3, 163 24, 901 2, 760 4, 908 3, 105 10, 903 1, 234 15, 915 1, 309 18, 701 2, 067 24, 902 2, 761 4, 910 2, 753 10, 904 858 15, 916 2, 698 18, 702 1, 146 25, 101 887 4, 911 2, 469 10, 906 959 15, 917 2, 695 18, 705 549 25, 301 174 4, 912 246 10, 908 647 15, 921 482 18, 710 106 25, 303 111 4, 915 1, 911 10, 911 397 15, 922 3, 120 18, 902 182 25, 403 2, 060 4, 916 2, 453 10, 914 2, 633 15, 923 2, 637 18, 903 621 25, 404 277 4, 919 1, 444 11, 201 3, 065 15, 924 1, 897 18, 904 75 25, 901 2, 712 5, 601 1, 867 11, 702 2, 939 15, 925 2, 948 18, 905 2, 718 25, 903 1, 180 5, 603 931 11, 804 1, 598 15, 926 2, 624 18, 906 2, 746 25, 905 2, 962 5, 604 325 11, 806 729 16, 101 336 18, 907 380 25, 908 1, 337 5, 706 2, 005 11, 901 3, 150 16, 301 828 18, 908 1, 126 25, 909 358 5, 707 2, 748 11, 902 2, 891 16, 502 81 18, 910 1, 036 27, 101 485 IDS AND CEL FILES CORRESPONDENCE 115

Table H.2: Correspondence between IDs of samples and CEL Files.

ID cel ID CEL ID CEL ID CEL ID CEL 5, 804 307 11, 903 912 16, 601 2, 969 18, 911 269 27, 502 608 5, 902 284 11, 911 1, 408 16, 703 77 18, 912 1, 108 27, 601 832 5, 903 501 12, 101 105 16, 901 1, 137 19, 101 519 27, 602 1, 548 5, 904 1, 228 12, 201 2, 874 16, 902 166 19, 201 110 27, 603 3, 194 5, 905 2, 747 14, 101 286 16, 903 609 19, 601 662 27, 605 2, 690 5, 906 2, 815 14, 201 1, 395 16, 905 1, 339 19, 703 712 27, 606 1, 922 5, 907 1, 187 14, 501 1, 153 16, 906 2, 910 19, 905 3, 165 27, 607 590 5, 909 117 14, 601 185 16, 909 2, 961 19, 906 597 27, 609 945 5, 910 2, 865 14, 602 847 16, 911 2, 831 19, 907 91 27, 610 1, 874 7, 101 3, 046 14, 603 732 16, 912 2, 829 19, 910 2 27, 702 2, 812 7, 501 247 14, 701 1, 403 17, 101 330 19, 911 309 27, 704 3, 462 7, 603 242 14, 702 274 17, 201 503 19, 912 1, 034 27, 706 2, 928 7, 801 1, 356 14, 703 72 17, 301 272 19, 913 797 27, 708 643 7, 807 3, 109 14, 801 1, 196 17, 302 2, 982 20, 101 360 27, 801 2, 876 7, 901 1, 957 14, 901 1, 264 17, 303 69 20, 201 1, 256 27, 902 2, 855 7, 902 2, 669 15, 301 39 17, 602 362 20, 302 1, 385 27, 905 458 7, 906 3, 522 15, 303 99 17, 703 292 20, 602 1, 495 27, 907 2, 786 7, 907 1, 064 15, 304 553 17, 705 1, 588 20, 603 1, 439 27, 909 2, 879 7, 908 3, 606 15, 502 129 17, 706 969 20, 704 171 27, 910 587 116 APPENDIX H Bibliography

1000 Genomes Project Consortium et al.(2016) 1000 Genomes Project Consortium, A Auton, LD Brooks, RM Durbin, EP Garrison, HM Kang, JO Korbel, JL Marchini, S McCarthy, GA McVean and GR Abecasis. HHS Public Access. Nature, 526(7571): 68–74. doi: 10.1038/nature15393.A. Cited on pages: 3

Affymetrix(2017) Affymetrix. Affymetrix Power Tools: MANUAL: apt-probeset- summarize (1.20.0), 2017. URL https://www.affymetrix.com/support/developer/ powertools/changelog/apt-probeset-summarize.html. Cited on pages: 33

Affymetrix(2008) Affymetrix. Genome-Wide Human SNP Array 6.0, 2008. URL http://www.affymetrix.com/catalog/131533/AFFY/Genome-Wide+Human+SNP+ Array+6.0#1_1. Cited on pages: 19, 24

Affymetrix(2009a) Affymetrix. Affymetrix CEL Data File Format, 2009a. URL http://media.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/ cel.html#V4. Cited on pages: 30

Affymetrix(2009b) Affymetrix. Affymetrix DAT Data File Format, 2009b. URL http://media.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/ dat.html. Cited on pages: 30

Alkan et al.(2011) Can Alkan, Bradley P Coe and Evan E Eichler. Genome discovery and genotyping. Nat Rev Genet, 12(5):363–376. doi: 10.1038/nrg2958. URL http://dx.doi.org/10.1038/nrg2958. Cited on pages: 37

Amos(1994) C I Amos. Robust variance-components approach for assessing genetic linkage in pedigrees. American journal of human genetics, 54(3):535–543. ISSN 0002-9297. Cited on pages: 45

Becker et al.(2012) Kerstin Becker, Nataliya Di Donato, Muriel Holder-Espinasse, Joris Andrieux, Jean Marie Cuisset, Louis Vallee, Ghislaine Plessis, Nolwenn Jean, Bruno De- lobel, Ann Charlotte Thuresson, Goran Anneren, Kirstine Ravn, Zeynep Tumer, Sigrid Tinschert, Evelin Schrock, Aia Elise Jonch and Karl Hackmann. De novo microdeletions of chromosome 6q14.1-q14.3 and 6q12.1-q14.1 in two patients with intellectual disabil- ity - further delineation of the 6q14 microdeletion syndrome and review of the liter- ature. European Journal of Medical Genetics, 55(8-9):490–497. ISSN 17697212. doi: 10.1016/j.ejmg.2012.03.003. Cited on pages: 67

Blangero et al.(2015) J Blangero, K Lange, L Almasy, T Dyer, H Göring, J Williams and C Peterson. SOLAR, 2015. URL http://www.biostat.wustl.edu/genetics/geneticssoft/ manuals/solar210/01.chapter.html. Cited on pages: 20, 29, 30

117 118 BIBLIOGRAPHY

Blangero et al.(2013) John Blangero, Vincent P Diego, Thomas D Dyer, Marcio Almeida, Juan Peralta, Jeff T Williams and Laura Almasy. A Kernel of Truth: Statistical Advances in Polygenic Variance Component Models for Complex Human Pedigrees, volume 81. ISBN 9780124076778. doi: 10.1016/B978-0-12-407677-8.00001-4. Cited on pages: 45, 47 Bolstad et al.(2003) B M Bolstad, R A Irizarry, M Astrand and T P Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193. doi: 10.1093/bioinformatics/19.2.185. URL http://dx.doi.org/10.1093/bioinformatics/19.2.185. Cited on pages: 31

Bolstad(2001) Benjamin Milo Bolstad. Package ‘preprocessCore’ for R, 2001. Cited on pages: 31 Boomsma et al.(2014) Dorret I. Boomsma, Cisca Wijmenga, Eline P. Slagboom, Morris A. Swertz, Lennart C. Karssen, Abdel Abdellaoui, Kai Ye, Victor Guryev, Martijn Vermaat, Freerk Van Dijk, Laurent C. Francioli, Jouke Jan Hottenga, Jeroen F.J. Laros, Qibin Li, Yingrui Li, Hongzhi Cao, Ruoyan Chen, Yuanping Du, Ning Li, Sujie Cao, Jessica Van Setten, Androniki Menelaou, Sara L. Pulit, Jayne Y. Hehir-Kwa, Marian Beekman, Clara C. Elbers, Heorhiy Byelas, Anton J.M. De Craen, Patrick Deelen, Martijn Dijkstra, Johan T. Den Dunnen, Peter De Knijff, Jeanine Houwing-Duistermaat, Vyacheslav Koval, Karol Estrada, Albert Hofman, Alexandros Kanterakis, David Van Enckevort, Hailiang Mai, Mathijs Kattenberg, Elisabeth M. Van Leeuwen, Pieter B.T. Neerincx, Ben Oostra, Fernanodo Rivadeneira, Eka H.D. Suchiman, Andre G. Uitterlinden, Gonneke Willemsen, Bruce H. Wolffenbuttel, Jun Wang, Paul I.W. De Bakker, Gert Jan Van Ommen and Cornelia M. Van Duijn. The Genome of the Netherlands: Design, and project goals. European Journal of Human Genetics, 22(2):221–227. ISSN 10184813. doi: 10.1038/ejhg. 2013.118. URL http://dx.doi.org/10.1038/ejhg.2013.118. Cited on pages: 4 Broad Institute(2008) Broad Institute. Birdsuite: Birdseed, 2008. URL https://www. broadinstitute.org/mpg/birdsuite/birdseed.html. Cited on pages: 34 Campbell et al.(2011) Catarina D. Campbell, Nick Sampas, Anya Tsalenko, Peter H. Sud- mant, Jeffrey M. Kidd, Maika Malig, Tiffany H. Vu, Laura Vives, Peter Tsang, Laurakay Bruhn and Evan E. Eichler. Population-genetic properties of differentiated human copy- number polymorphisms. American Journal of Human Genetics. ISSN 00029297. doi: 10.1016/j.ajhg.2011.02.004. Cited on pages: 8 Cardoso et al.(2003) Carlos Cardoso, Richard J Leventer, Heather L Ward, Kazuhito Toyo-Oka, June Chung, Alyssa Gross, Christa L Martin, Judith Allanson, Daniela T Pilz, Ann H Olney, Osvaldo M Mutchinick, Shinji Hirotsune, Anthony Wynshaw-Boris, William B Dobyns and David H Ledbetter. Refinement of a 400-kb Critical Region Allows Genotypic Differentiation between Isolated Lissencephaly, Miller-Dieker Syndrome, and Other Phenotypes Secondary to Deletions of 17p13.3. Am. J. Hum. Genet, 72:918–930. Cited on pages: 18 Carvalho et al.(2007) Benilton Carvalho, Henrik Bengtsson, Terence P Speed and Rafael A Irizarry. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics (2007),, 8(2):485–499. doi: 10.1093/biostatistics/kxl042. Cited on pages: 15, 35 Carvalho et al.(2010) Benilton S. Carvalho, Thomas A. Louis and Rafael A. Irizarry. Quantifying uncertainty in genotype calls. Bioinformatics. ISSN 14602059. doi: 10.1093/ bioinformatics/btp624. Cited on pages: 35 BIBLIOGRAPHY 119

Chu et al.(2013) Jen-hwa Chu, Angela Rogers, Iuliana Ionita-Laza, Katayoon Darvishi, Ryan E Mills, Charles Lee and Benjamin A Raby. Copy number variation genotyping using family information. {BMC} Bioinformatics, 14(1):157. doi: 10.1186/1471-2105-14-157. URL http://dx.doi.org/10.1186/1471-2105-14-157. Cited on pages: 14

Dauber et al.(2011) Andrew Dauber, Yongguo Yu, Michael C. Turchin, Charleston W. Chiang, Yan A. Meng, Ellen W. Demerath, Sanjay R. Patel, Stephen S. Rich, Jerome I. Rotter, Pamela J. Schreiner, James G. Wilson, Yiping Shen, Bai-Lin Wu and Joel N. Hirschhorn. Genome-wide Association of Copy-Number Variation Reveals an Association between Short Stature and the Presence of Low-Frequency Genomic Deletions. The Amer- ican Journal of Human Genetics, 89(6):751–759. doi: 10.1016/j.ajhg.2011.10.014. URL http://dx.doi.org/10.1016/j.ajhg.2011.10.014. Cited on pages: 20

De Andrade et al.(2015) Mariza De Andrade, Debashree Ray, Alexandre C. Pereira and Júlia P. Soler. Global Individual Ancestry Using Principal Components for Family Data. Human Heredity, 80(1):1–11. ISSN 14230062. doi: 10.1159/000381908. Cited on pages: 58, 59, 72 de Oliveira et al.(2008) Camila M de Oliveira, Alexandre C Pereira, Mariza de Andrade, Júlia M Soler and José E Krieger. Heritability of cardiovascular risk factors in a Brazil- ian population: Baependi Heart Study. {BMC} Medical Genetics, 9(1):32. ISSN 1471- 2350. doi: 10.1186/1471-2350-9-32. URL http://dx.doi.org/10.1186/1471-2350-9-32http: //bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-9-32. Cited on pages: 4, 24

Demidenko(2004) Eugene Demidenko. Mixed Models: Theory and Applications. Wiley- Interscience. ISBN 0471601616. Cited on pages: 46

Diskin et al.(2008) Sharon J Diskin, Mingyao Li, Cuiping Hou, Shuzhang Yang, Joseph Glessner, Hakon Hakonarson, Maja Bucan, John M Maris and Kai Wang. Adjustment of genomic waves in signal intensities from whole-genome SNP geno- typing platforms. Nucleic acids research, 36(19):e126. ISSN 1362-4962. doi: 10. 1093/nar/gkn556. URL http://www.ncbi.nlm.nih.gov/pubmed/18784189http://www. pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2577347. Cited on pages: 42

Dutra et al.(2011) Roberta Lelis Dutra, Patrícia de Campos Pieri, Ana Carolina Dias Teixeira, Rachel Sayuri Honjo, Debora Romeo Bertola and Chong Ae Kim. Detection of deletions at 7q11.23 in Williams-Beuren syndrome by polymorphic markers. Clinics. ISSN 1807-5932. doi: 10.1590/S1807-59322011000600007. Cited on pages: 18

Eckel-Passow et al.(2011) Jeanette E Eckel-Passow, Elizabeth J Atkinson, Sooraj Ma- harjan, Sharon L R Kardia and Mariza de Andrade. Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 {SNP} array platform. {BMC} Bioin- formatics, 12(1):220. doi: 10.1186/1471-2105-12-220. URL http://dx.doi.org/10.1186/ 1471-2105-12-220. Cited on pages: 15, 19, 29, 35

Eddy(2004) Sean R Eddy. What is a hidden Markov model? Nat Biotechnol, 22(10):1315– 1316. doi: 10.1038/nbt1004-1315. URL http://dx.doi.org/10.1038/nbt1004-1315. Cited on pages: 38, 41

Egan et al.(2016) Kieren J Egan, Malcolm von Schantz, André B Negrão, Hadassa C Santos, Andréa R V R Horimoto, Nubia E Duarte, Guilherme C Gonçalves, Júlia M P Soler, Mariza de Andrade, Geraldo Lorenzi-Filho, Homero Vallada, Tâmara P Taporoski, 120 BIBLIOGRAPHY

Mario Pedrazzoli, Ana P Azambuja, Camila M de Oliveira, Rafael O Alvim, José E Krieger and Alexandre C Pereira. Cohort profile: the Baependi Heart Study—a family-based, highly admixed cohort study in a rural Brazilian town. BMJ Open, 6(10):e011598. ISSN 2044-6055. doi: 10.1136/bmjopen-2016-011598. URL http://bmjopen.bmj.com/lookup/ doi/10.1136/bmjopen-2016-011598. Cited on pages: 4, 24

Escaramís et al.(2015) Geòrgia Escaramís, Elisa Docampo and Raquel Rabionet. A decade of structural variants: Description, history and methods to detect structural variation. Briefings in Functional . ISSN 20412657. doi: 10.1093/bfgp/elv014. Cited on pages: 7,8,9, 11, 14, 15

Feuk et al.(2006) Lars Feuk, Andrew R Carson and Stephen W Scherer. Structural variation in the human genome. Nature Reviews Genetics, 7(2):85–97. doi: 10.1038/nrg1767. URL https://doi.org/10.1038pages:8

Firth et al.(2009) Helen V. Firth, Shola M. Richards, A. Paul Bevan, Stephen Clayton, Manuel Corpas, Diana Rajan, Steven Van Vooren, Yves Moreau, Roger M. Pettett and Nigel P. Carter. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. American Journal of Human Genetics, 84(4):524– 533. ISSN 00029297. doi: 10.1016/j.ajhg.2009.03.010. URL http://dx.doi.org/10.1016/j. ajhg.2009.03.010. Cited on pages: 16

Genetic Science Learning Center(2013) Genetic Science Learning Center. DNA Mi- croarray, 2013. URL http://learn.genetics.utah.edu/content/labs/microarray/. Cited on pages: 24, 25

Genetics Home Reference(2017) Genetics Home Reference. What are single nucleotide polymorphisms (SNPs)?, 2017. URL https://ghr.nlm.nih.gov/primer/genomicresearch/ snp. Cited on pages: 2

Genovese et al.(2015) A Genovese, D M Cox and M G Butler. Partial Deletion of Chro- mosome 1p31.1 Including only the Neuronal Growth Regulator 1 Gene in Two Siblings. Journal of Pediatric Genetics, 4(1):23–28. ISSN 2146-460X. doi: 10.1055/s-0035-1554977. Cited on pages: 67

Gold Helix(2014) Gold Helix. CNV Univariate Analysis Tutorial, 2014. URL http://doc. goldenhelix.com/SVS/tutorials/cnv_univariate_analysis/index.html. Cited on pages: 36

Irizarry et al.(2003) Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer- Barclay, Kristen J. Antonellis, Uwe Scherf and Terence P. Speed. Exploration, normaliza- tion, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4 (December):249–264. ISSN 14654644. doi: 10.1007/978-1-4614-1347-9{\_}15. Cited on pages: 33

Itsara et al.(2010) Andy Itsara, Hao Wu, Joshua D. Smith, Deborah A. Nickerson, Isabelle Romieu, Stephanie J. London and Evan E. Eichler. De novo rates and selection of large copy number variation. Genome Research, 20(11):1469–1481. ISSN 10889051. doi: 10. 1101/gr.107680.110. URL https://doi.org/10.1101/gr.107680.110. Cited on pages: 11, 60

Kim et al.(2013) Yun Kyoung Kim, Sanghoon Moon, Mi Yeong Hwang, Dong-Joon Kim, Ji Hee Oh, Young Jin Kim, Bok-Ghee Han, Jong-Young Lee and Bong-Jo Kim. Gene- based copy number variation study reveals a microdeletion at 12q24 that influences height BIBLIOGRAPHY 121

in the Korean population. Genomics, 101(2):134–138. doi: 10.1016/j.ygeno.2012.11.002. URL http://dx.doi.org/10.1016/j.ygeno.2012.11.002. Cited on pages: 20, 21

Korn et al.(2008) Joshua M Korn, Finny G Kuruvilla, Steven a McCarroll, Alec Wysoker, James Nemesh, Simon Cawley, Earl Hubbell, Jim Veitch, Patrick J Collins, Katayoon Darvishi, Charles Lee, Marcia M Nizzari, Stacey B Gabriel, Shaun Purcell, Mark J Daly and David Altshuler. Integrated genotype calling and association analysis of SNPs, com- mon copy number polymorphisms and rare CNVs. Nature genetics, 40(10):1253–60. ISSN 1546-1718. doi: 10.1038/ng.237. URL http://www.ncbi.nlm.nih.gov/pubmed/18776909. Cited on pages: 33, 34

Laframboise(2009) Thomas Laframboise. Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances. 37(13):4181–4193. doi: 10.1093/ nar/gkp552. Cited on pages: 15, 25, 26

Laird and Ware(1982) Nan M. Laird and James H. Ware. Random-Effects Models for Longitudinal Data Nan. Biometrics, 38(4):963–974. Cited on pages: 45

Landrum et al.(2016) Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey Hoover, Wonhee Jang, Kenneth Katz, Michael Ovetsky, George Riley, Amanjeev Sethi, Ray Tully, Ricardo Villamarin-salomon, Wendy Rubinstein and Donna R Maglott. Clin- Var: public archive of interpretations of clinically relevant variants. 44(November 2015): 862–868. doi: 10.1093/nar/gkv1222. Cited on pages: 16

Ledbetter et al.(1992) Susan A Ledbetter, Akira Kuwano, William B Dobyns and David H Ledbetter. Microdeletions of Chromosome I7pI3 Lissencephaly Cause of Isolated. Amer- ican Journal of Human Genetics, 50:182–189. Cited on pages: 18

Lee and Lupski(2006) Jennifer A Lee and James R Lupski. Genomic Rearrangements and Gene Copy-Number Alterations as a Cause of Nervous System Disorders. , 52 (1):103–121. doi: 10.1016/j.neuron.2006.09.027. URL http://dx.doi.org/10.1016/j.neuron. 2006.09.027. Cited on pages: 3, 14

Levy et al.(2007) Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg and J . The Diploid Genome Sequence of an Individual Human. {PLoS} Biology, 5 (10):e254. doi: 10.1371/journal.pbio.0050254. URL https://doi.org/10.137/journal.pbio. 0050254. Cited on pages: 8

Lewis and Knight(2012) Cathryn M. Lewis and Jo Knight. Introduction to genetic association studies. Cold Spring Harbor Protocols, 2012(3):297–306. ISSN 19403402. doi: 10.1101/pdb.top068163. URL http://dx.doi.org/10.1101/pdb.top068163http://www. ncbi.nlm.nih.gov/pubmed/22383645. Cited on pages: 1, 16, 17

Li et al.(2010) Xi Li, Lijun Tan, Xiaogang Liu, Shufeng Lei, Tielin Yang, Xiangding Chen, Fang Zhang, Yue Fang, Yan Guo, Liang Zhang, Han Yan, Feng Pan, Zhixin Zhang, Yumei Peng, Qi Zhou, Lina He, Xuezhen Zhu, Jing Cheng, Lishu Zhang, Yaozhong Liu, Qing Tian 122 BIBLIOGRAPHY

and Hongwen Deng. A genome wide association study between copy number variation ({CNV}) and human height in Chinese population. Journal of Genetics and Genomics, 37(12):779–785. doi: 10.1016/s1673-8527(09)60095-3. URL http://dx.doi.org/10.1016/ S1673-8527(09)60095-3. Cited on pages: 20, 21

Locke et al.(2006) Devin P Locke, Andrew J Sharp, Steven A McCarroll, Sean D McGrath, Tera L Newman, Ze Cheng, Stuart Schwartz, Donna G Albertson, Daniel Pinkel, David M Altshuler and Evan E Eichler. Linkage Disequilibrium and Heritability of Copy-Number Polymorphisms within Duplicated Regions of the Human Genome. The American Journal of Human Genetics, 79(2):275–290. doi: 10.1086/505653. URL http://dx.doi.org/10.1086/ 505653. Cited on pages: 14

Manolio et al.(2010) Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Gold- stein, Lucia A. Hindorff, David J. Hunter, Mark I. McCarthy, Erin M. Ramos, Lon R. Cardon, Aravinda Chakravarti, Judy H. Cho, Alan E. Guttmacher, Augustine Kong, Leonid Kruglyak, Elaine Mardis, Charles N. Rotimi, Montgomery Slatkin, David Valle, Alice S. Whittemore, Michael Boehnke, Andrew G. Clark, Evan E. Eichler, Greg Gibson, Jonathan L. Haines, Trudy F. C. Mackay, Steven A. McCarroll and Peter M. Visscher. Finding the missing heritability of complex diseases. Nature, 461(7265):747–753. doi: 10.1038/nature08494.Finding. Cited on pages: 3

Maréchal et al.(2006) Cédric Le Maréchal, Emmanuelle Masson, Jian-Min Chen, Frédéric Morel, Philippe Ruszniewski, Philippe Levy and Claude Férec. Hereditary pancreatitis caused by triplication of the trypsinogen locus. Nature Genetics, 38(12):1372–1374. doi: 10.1038/ng1904. URL http://dx.doi.org/10.1038/ng1904. Cited on pages: 3, 14

Marenne et al.(2012) Gaëlle Marenne, Francisco X Real, Nathaniel Rothman, Benjamin Rodríguez-Santiago, Luis Pérez-Jurado, Manolis Kogevinas, Montse García-Closas, De- bra T Silverman, Stephen J Chanock, Emmanuelle Génin and Núria Malats. Genome-wide CNV analysis replicates the association between GSTM1 deletion and bladder cancer: a support for using continuous measurement from SNP-array data. BMC Genomics, 13 (1):326. ISSN 1471-2164. doi: 10.1186/1471-2164-13-326. URL http://bmcgenomics. biomedcentral.com/articles/10.1186/1471-2164-13-326. Cited on pages: 42

McCall et al.(2010) M N McCall, B M Bolstad and R A Irizarry. Frozen robust multiarray analysis (fRMA). páginas 242–253. doi: 10.1093/biostatistics/kxp059. Cited on pages: 30, 31, 32

McCarroll and Altshuler(2007) Steven A McCarroll and David M Altshuler. Copy- number variation and association studies of human disease. Nature Genetics, 39(7s): S37–S42. ISSN 1061-4036. doi: 10.1038/ng2080. URL http://dx.doi.org/10.1038/ng2080. Cited on pages: 9, 13, 15, 18, 19, 20

McCulloch and Searle(2001) Charles E. McCulloch and Shayle R. Searle. Generalized, Linear, and Mixed Models (Wiley Series in Probability and Statistics). Wiley-Interscience. ISBN 047119364X. Cited on pages: 45

Monks et al.(2004) S A Monks, A Leonardson, H Zhu, P Cundiff, P Pietrusiak, S Edwards, J W Phillips, A Sachs and E E Schadt. Genetic Inheritance of Gene Expression in Human Cell Lines. Am. J. Hum. Genet, 75:1094–1105. URL https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC1182144/pdf/AJHGv75p1094.pdf. Cited on pages: 70 BIBLIOGRAPHY 123

National Cancer Institute() National Cancer Institute. Definition of genetic variant - NCI Dictionary of Genetics Terms - National Cancer Institute. URL https://www.cancer. gov/publications/dictionaries/genetics-dictionary?cdrid=776887. Cited on pages: 1

National Human Genome Research Institute (NHGRI)() National Human Genome Research Institute (NHGRI). DNA Microarray Technology. URL https://www.genome. gov/10000533/dna-microarray-technology/. Cited on pages: 24

National Institutes of Health(2017) National Institutes of Health. Talking Glossary of Genetic Terms, 2017. URL https://www.genome.gov/glossary/index.cfm?id=40. Cited on pages: 3

Nature Education(2014) Nature Education. Complex trait - Definition, 2014. URL https://www.nature.com/scitable/definition/complex-trait-82. Cited on pages: 1

Palta et al.(2015) Priit Palta, Lauris Kaplinski, Liina Nagirnaja, Andres Veidenberg, Märt Möls, Mari Nelis, Tõnu Esko, Andres Metspalu, Maris Laan and Maido Remm. Phasing and Inheritance of Copy Number Variants in Nuclear Families. PLoS ONE, 10 (4):e0122713. ISSN 19326203. doi: 10.1371/journal.pone.0122713. URL https://doi.org/ 10.13712Fjournal.pone.0122713. Cited on pages: 14

Peiffer(2006) D A Peiffer. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Research, 16(9):1136–1148. doi: 10. 1101/gr.5402306. URL http://dx.doi.org/10.1101/gr.5402306. Cited on pages: 35

Price et al.(2006) Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Wein- blatt, Nancy A Shadick and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909. doi: 10.1038/ng1847. URL http://dx.doi.org/10.1038/ng1847. Cited on pages: 18

Quintela et al.(2015) Ines Quintela, Montse Fernandez-Prieto, Lorena Gomez-Guerrero, Mariela Resches, Jesus Eiris, Francisco Barros and Angel Carracedo. A 6q14.1-q15 mi- crodeletion in a male patient with severe autistic disorder, lack of oral language, and dysmorphic features with concomitant presence of a maternally inherited Xp22.31 copy number gain. Clinical case reports, 3(6):415–423. ISSN 2050-0904 (Electronic). doi: 10.1002/ccr3.255. Cited on pages: 67

Rabbee and Speed(2006) Nusrat Rabbee and Terence P. Speed. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics, 22(1):7–12. ISSN 13674803. doi: 10.1093/bioinformatics/bti741. Cited on pages: 33

Sanna et al.(2011) Serena Sanna, Bingshan Li, Antonella Mulas, Carlo Sidore, Hyun M Kang, Anne U Jackson, Maria Grazia Piras, Gianluca Usala, Giuseppe Maninchedda, Alessandro Sassu, Fabrizio Serra, Antonietta Palmas, Markku Laakso, Kristian Hveem, Timo A Lakka, Rainer Rauramaa, Michael Boehnke, Francesco Cucca, Manuela Uda, David Schlessinger and Ramaiah Nagaraja. Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heri- tability. PLoS Genetics, 7(7). doi: 10.1371/journal.pgen.1002198. Cited on pages: 4

Scharpf et al.(2011) R B Scharpf, R A Irizarry, M E Ritchie, B Carvalho and I Ruczinski. Using the R Package crlmm for Genotyping and Copy Number Estimation. Journal of statistical software. Cited on pages: 15 124 BIBLIOGRAPHY

Scharpf et al.(2014) Robert B Scharpf, Lynn Mireles, Qiong Yang, Anna Köttgen, Ingo Ruczinski, Katalin Susztak, Eitan Halper-Stromberg, Adrienne Tin, Stephen Cristiano, Aravinda Chakravarti, Eric Boerwinkle, Caroline S Fox, Josef Coresh and Wen Hong Linda Kao. Copy number polymorphisms near SLC2A9 are associated with serum uric acid concentrations. BMC Genetics, 15(1):1–13. ISSN 1471-2156. doi: 10.1186/ 1471-2156-15-81. URL http://dx.doi.org/10.1186/1471-2156-15-81. Cited on pages: 3, 18, 60, 64

Self and Liang(1987) Steven G Self and Kung-Yee Liang. Asymptotic Properties of Max- imum Likelihood Estimators and Likelihood Ratio Tests under Nonstandard Conditions. Journal of the American Statistical Association, 82(398):605–610. doi: 10.1080/01621459. 1987.10478472. URL http://dx.doi.org/10.1080/01621459.1987.10478472. Cited on pages: 47

Shen et al.(2008) Fan Shen, Jing Huang, Karen R Fitch, Vivi B Truong, Andrew Kirby, Wenwei Chen, Jane Zhang, Guoying Liu, Steven A McCarroll, Keith W Jones and Michael H Shapero. Improved detection of global copy number variation using high density, non-polymorphic oligonucleotide probes. {BMC} Genet, 9(1):27. doi: 10.1186/1471-2156-9-27. URL http://dx.doi.org/10.1186/1471-2156-9-27. Cited on pages: 25

Stankiewicz and Lupski(2002) Pawel Stankiewicz and James R Lupski. Genome ar- chitecture, rearrangements and genomic disorders. Trends in Genetics, 18(2):74–82. doi: 10.1016/s0168-9525(02)02592-1. URL http://dx.doi.org/10.1016/S0168-9525(02)02592-1. Cited on pages: 11, 12

Stankiewicz and Lupski(2010) Paweł Stankiewicz and James R Lupski. Structural Vari- ation in the Human Genome and its Role in Disease. Annual Review of Medicine, 61 (1):437–455. doi: 10.1146/annurev-med-100708-204735. URL http://dx.doi.org/10.1146/ annurev-med-100708-204735. Cited on pages: 9, 11, 12, 13

Tennessen et al.(2012) Jacob A. Tennessen, Abigail W. Bigham, Timothy D. O’Connor, Wenqing Fu, Eimear E. Kenny, Simon Gravel, Sean McGee, Ron Do, Xiaoming Liu, Goo Jun, Hyun Min Kang, Daniel Jordan, Suzanne M. Leal, Stacey Gabriel, Mark J. Rieder, Goncalo Abecasis, David Altshuler, Deborah A. Nickerson, Eric Boerwinkle, Shamil Sun- yaev, Carlos D. Bustamante, Michael J. Bamshad and Joshua M. Akey. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science, 337(6090):64–69. doi: 10.1126/science.1219240.Evolution. Cited on pages: 4

The International HapMap Consortium(2003) The International HapMap Consor- tium. The International HapMap Project. Nature, 426:789–796. Cited on pages: 1,2,3, 36

The International HapMap Consortium(2005) The International HapMap Consor- tium. A haplotype map of the human genome. Nature, 437(7063):1299–1320. doi: 10.1038/nature04226. URL http://dx.doi.org/10.1038/nature04226. Cited on pages: 44

Theisen(2017) Aaron Theisen. Microarraybased Comparative Genomic Hybridization (aCGH). 1(2008):1–6. Cited on pages: 14

Therneau and Sinnwell(2015) Terry M Therneau and Jason Sinnwell. kinship2: Pedigree Functions, 2015. URL https://cran.r-project.org/package=kinship2. Cited on pages: 30, 72, 99 BIBLIOGRAPHY 125

Therneau et al.(2015) Terry M Therneau, Schaid Daniel, Jason Sinnwell and Elizabeth Atkinson. Pedigree Functions - Package kinship2, 2015. URL https://cran.r-project.org/ web/packages/kinship2/kinship2.pdf. Cited on pages: 20

Tukey(1970) J.W. Tukey. Exploratory Data Analysis. Addison-Wesley. ISBN 9780608082257. Cited on pages: 32

Turner et al.(2011) Stephen Turner, Loren L Armstrong, Yuki Bradford, Christopher S Carlson, C Dana, Andrew T Crenshaw, Mariza De Andrade, Kimberly F Doheny, L Jonathan, Geoffrey Hayes, Gail Jarvik, Lan Jiang, Iftikhar J Kullo, Rongling Li, Teri a Manolio, Martha Matsumoto, Catherine a Mccarty, N Andrew, Daniel B Mirel, Justin E Paschall, Elizabeth W Pugh, V Luke, Russell a Wilke, Rebecca L Zuvich and Marylyn D Ritchie. Quality control procedures for genome wide association stud- ies. Current Proceedings in Human Genetics, 68(1):1–24. ISSN 1934-8266. doi: 10.1002/0471142905.hg0119s68.Quality. Cited on pages: 41, 53

UCSC Genome Browser on Human() UCSC Genome Browser on Human. Human hg19 - chr9:78,960,219-78,967,224. URL https://genome.ucsc.edu/. Cited on pages: 80

Visscher et al.(2008) Peter M Visscher, William G Hill and Naomi R Wray. Heritability in the genomics era {\textemdash} concepts and misconceptions. Nat Rev Genet, 9(4): 255–266. doi: 10.1038/nrg2322. URL http://dx.doi.org/10.1038/nrg2322. Cited on pages: 47

Wang et al.(2007a) K Wang, M Li, D Hadley, R Liu, J Glessner, S F A Grant, H Hakonar- son and M Bucan. PennCNV, 2007a. URL http://penncnv.openbioinformatics.org/en/ latest/. Cited on pages: 27, 28, 42, 54

Wang et al.(2007b) Kai Wang, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, Struan F.A. Grant, Hakon Hakonarson and Maja Bucan. {PennCNV}: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome {SNP} genotyping data. Genome Research, 17(11):1665–1674. doi: 10. 1101/gr.6861907. URL http://dx.doi.org/10.1101/gr.6861907. Cited on pages: 15, 35, 36, 38, 39, 40, 41

Wang et al.(2008a) Kai Wang, Zhen Chen, Mahlet G. Tadesse, Joseph Glessner, Struan F. A. Grant, Hakon Hakonarson, Maja Bucan and Mingyao Li. Modeling genetic inheri- tance of copy number variations. Nucl. Acids Res., 36(21). doi: 10.1093/nar/gkn641. Cited on pages: 14

Wang et al.(2008b) Wenyi Wang, Benilton Carvalho, Nathaniel D Miller, Jonathan Pevs- ner, Aravinda Chakravarti and Rafael A Irizarry. Estimating Genome-Wide Copy Number Using Allele-Specific Mixture Models. Journal of Computational Biology, 15(7):857–866. doi: 10.1089/cmb.2007.0148. URL http://dx.doi.org/10.1089/cmb.2007.0148. Cited on pages: 15

Wellcome Trust Sanger Institute(2009) Wellcome Trust Sanger Institute. DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources), 2009. URL https://decipher.sanger.ac.uk/. Cited on pages: 16

Wheeler et al.(2008) David A Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhijani, G Thomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi, Cynthia L Turcotte, Gerard P Irzyk, 126 BIBLIOGRAPHY

James R Lupski, Craig Chinault, Xing-zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin, Donna M Muzny, Marcel Margulies, George M Weinstock, Richard A Gibbs and Jonathan M Rothberg. The complete genome of an individual by massively par- allel {DNA} sequencing. Nature, 452(7189):872–876. doi: 10.1038/nature06884. URL https://doi.org/10.1038/nature06884. Cited on pages: 8

Wheeler et al.(2013) Eleanor Wheeler, Ni Huang, Elena G Bochukova, Julia M Keogh, Sarah Lindsay, Sumedha Garg, Elana Henning, Hannah Blackburn, Ruth J F Loos, Nick J Wareham, Stephen O’Rahilly, Matthew E Hurles, Inês Barroso and I Sadaf Farooqi. Genome-wide SNP and CNV analysis identifies common and low-frequency variants asso- ciated with severe early-onset obesity. Nature Genetics, 45(5):513–517. ISSN 1061-4036. doi: 10.1038/ng.2607. URL http://www.nature.com/doifinder/10.1038/ng.2607. Cited on pages: 18, 67

Yang et al.(2010) Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, An- jali K Henders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin, Grant W Montgomery, Michael E Goddard and Peter M Visscher. Common {SNPs} explain a large proportion of the heritability for human height. Nature Genetics, 42(7): 565–569. doi: 10.1038/ng.608. URL http://dx.doi.org/10.1038/ng.608. Cited on pages: 18, 20

Zarrei et al.(2015) Mehdi Zarrei, Jeffrey R. MacDonald, Daniele Merico and Stephen W. Scherer. A copy number variation map of the human genome. Nature Reviews Genetics, 16(3):172–183. ISSN 1471-0056. doi: 10.1038/nrg3871. URL http://www.nature.com/ doifinder/10.1038/nrg3871. Cited on pages: 10, 11

Zhang et al.(2009) Feng Zhang, Wenli Gu, ME Matthew E Hurles and JR James R Lupski. Copy Number Variation in Human Health, Disease, and Evolution. An- nual Review of Genomics and Human Genetics, 10(1):451–481. ISSN 1527-8204. doi: 10.1146/annurev.genom.9.081307.164217. URL https://doi.org/10.1146/annurev.genom. 9.081307.164217www.annualreviews.org. Cited on pages: 8,9, 12, 13, 16

Zhang(2017) Jianhua Zhang. CNTools: Convert segment data into a region by sample matrix to allow for other high level computational analyses, 2017. Cited on pages: 29, 44

Ziegler and König(2006) Andreas Ziegler and Inke R. König. A Statistical Approach to Genetic Epidemiology. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany. ISBN 3527312528. doi: 10.1002/9783527633654. URL http://doi.wiley.com/10.1002/ 9783527633654. Cited on pages: 1,3