From array genotypes to whole-genome sequences: harnessing genetic variation for dairy cattle selection

by Adrien M. Butty

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Doctor of Philosophy in Animal Biosciences

Guelph, Ontario, Canada © Adrien Butty, September 2019 ABSTRACT

FROM ARRAY GENOTYPES TO WHOLE-GENOME SEQUENCES: HARNESSING

GENETIC VARIATION FOR DAIRY CATTLE SELECTION

Adrien M. Butty Advisor: University of Guelph, 2019 Prof. Dr. Christine F. Baes

Genetic markers are currently used as a tool in animal breeding to measure and make use of genetic variation. After the unsuccessful implementation of marker-assisted selection including microsatellites in genetic evaluation models, implementation of genomic selection principles in breeding programs allowed drastic acceleration of the genetic gains made in the dairy industry.

Although great genetic improvements have been made possible with the introgression of array- based single nucleotide polymorphisms (SNP) genotypes into genetic evaluation models, more variants are needed to explain a larger proportion of the genetic variance observed in economically important traits. This would allow for more precise estimation of breeding values (EBV) and greater genetic progress. The increase in the number and type of variants through using whole- genome sequencing (WGS), for instance copy number variants (CNV), in genetic evaluation models could contribute to higher accuracies of EBV. Sequencing of a large number of animals is still prohibitively expensive, but the large number of genotyped samples already available allows for 1) the accurate imputation of genotypes to WGS variants, 2) the identification of CNV relying on the signal intensity values produced at the time of array genotyping, and 3) the use of the CNV identified with high confidence in silico to gain knowledge about the genetic architecture of traits of economic importance, for example hoof health traits. In this thesis, haplotype-based methods were developed that improve reference population animal selection for sequencing, to allow for more accurate imputation of common or rare variants.

Secondly, CNV were identified with high confidence using both array genotypes and WGS information. Finally, CNV regions were identified that were associated with hoof health traits recorded for the Canadian Holstein genetic evaluation.

Starting with SNP that were phased to haplotypes and looking at the structural variants that are

CNV, this thesis bridges current and possible future genetic markers to exploit the maximum genomic information present in the dairy population. Altogether, the advances made in this thesis will permit an increase in the rate of genetic improvement for dairy cattle once breeding value estimation models have been developed that efficiently include and combine CNV and SNP information.

iv

DEDICATION

A Marie-Berthe et Henri

v

ACKNOWLEDGEMENTS

My first thank you goes to my supervisor of long date, Dr. Christine Baes, not only for her constant support and patience, but also for the great moments spent listening to her stories, which are always so interesting. Thank you, Chris, for bringing me back on track and for challenging me in situations where I definitely needed to get a kick to go on. I am also very grateful to my co-advisor, Dr. Filippo Miglior. Thank you, Filippo, for your support and mentorship through this PhD adventure. I very much appreciate your steady support, your guidance in this unknown process that is the development of a PhD thesis, but also the support in times where I needed an ear for my concerns. I would like to thank all members of my advisory committee members for their help, and constructive feedbacks: Dr. Paul Stothard, Dr. Mehdi Sargolzaei, Dr. Birgit Gredler-Grandl, and Dr. Angela Cánovas. Special thanks to Paul and his team at the University of Alberta for the opportunity that I had to visit and learn about next-generation sequence data, copy number variants and the winter in Alberta. I acknowledge the financial support of the Efficient Dairy Genome Project and all partners. In addition, I am grateful to the Swiss Association for Animal Science and Qualitas AG (Zug, Switzerland) for their support. A special thanks goes here to the genetic evaluation team of Qualitas AG who accepted me as part of their group in the times I was working on my thesis from their offices. Technical difficulties along my PhD thesis were always resolved in a very bright and rapid manner by the IT manager of the Department of Animal Sciences. I thank from the bottom of my heart Bill Szkotnicki for his immeasurable help to get or keep my (sometimes somewhat large) analyses running. The people I met in Guelph all have their share in my well-being during my stay and I am very grateful for all the encounters. I especially want to thank Marta McCarthy and the Gryphon Singers for welcoming me in the choir. A huge thank you also to Nathalie Lauriault and Gord Kirby for giving me a home away from home and so many great memories. Merci!

vi

It is a great experience to be part of the CGIL community, from everyday lunch chats to more festive party nights, I am very grateful for all the great memories and the support in my academic time. Thank you all of you! Last but definitely not least, I thank from the bottom of my heart, Marco, my family and my friends in Switzerland and in Canada for their support throughout all the more and the less good phases I went through during my thesis. The support I received every day is what got me to the end of this adventure. To all of you: thank you!

vii

TABLE OF CONTENTS

Abstract ...... ii Dedication ...... iv Acknowledgements ...... v Table of Contents ...... vii List of Tables ...... ix List of Figures ...... x List of Abbreviations ...... xiii Chapter 1: General introduction...... 1 1.1 Cattle breeding and genetic markers ...... 1 1.2 Thesis objectives ...... 5 1.3 References ...... 5 Chapter 2: Optimizing selection of the reference population for genotype imputation from array to sequence variants ...... 8 2.1 Abstract ...... 9 2.2 Introduction ...... 10 2.3 Material and methods ...... 13 2.4 Results ...... 20 2.5 Discussion ...... 24 2.6 Conclusions ...... 28 2.7 References ...... 28 2.8 Acknowledgements ...... 32 2.9 Tables ...... 33 2.10 Figures...... 37 Chapter 3: High confidence copy number variants identified in Holstein dairy cattle from whole genome sequence and genotype array data ...... 48 3.1 Abstract ...... 49 3.2 Introduction ...... 49 3.3 Materials and methods ...... 51 3.4 Results ...... 56 3.5 Discussion ...... 60

viii

3.6 Conclusions ...... 64 3.7 References ...... 65 3.8 Acknowledgments...... 73 3.9 Tables ...... 74 3.10 Figures...... 77 Chapter 4: Genome-wide association study between copy number variants and hoof health traits in Holstein dairy cattle ...... 84 4.1 Abstract ...... 85 4.2 Introduction ...... 86 4.3 Materials and methods ...... 88 4.4 Results ...... 91 4.5 Discussion ...... 93 4.6 Conclusions ...... 97 4.7 References ...... 97 4.8 Acknowledgements ...... 104 4.9 Tables ...... 105 4.10 Figures...... 108 Chapter 5: General discussion and conclusions ...... 111 5.1 General discussion ...... 111 5.2 Conclusions ...... 113 5.3 References ...... 114 Appendices ...... 115

ix

LIST OF TABLES

Table 2-1. Parameters used for the simulation of the populations LongRangeLD and CurrentPop...... 33 Table 2-2. Rate of genotyping change as introduced in the simulated whole-genome sequence genotypes. As an example, 1.1% of the simulated AB genotypes were changed to AA genotypes. Values were retrieved from the study by Baes et al. (2014)...... 34 Table 2-3. Proportion of animals overlapping between selection methods in reference populations of different sizes. The methods were the key ancestors (AMAT), the selection of Highly Segregating Haplotypes (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI)...... 35 Table 2-4. Accuracies for reference populations of 50, 200, and 1,200 individuals and increasing MAF of the variants considered, all variants, the rare variants (MAF < 0.05), or the common variants (MAF ≥ 0.05). MAF bins “x-y” stands for “x < MAF  y”...... 36 Table 3-1. Studies, data type, number of sample and number of breeds composing the CNV dataset from the Database of Genomic Variants archive...... 74 Table 3-2. Summary of the whole-genome sequence dataset...... 75 Table 3-3. Summary of the genotyping information dataset...... 76 Table 4-1. Number of markers before and after quality control and number of samples for each genotype array used in the study...... 105 Table 4-2. Heritability estimates (and standard deviations) as published by the Canadian Dairy Network used for de-regression of the hoof health EBV and descriptive statistics of the de- regressed EBV...... 106 Table 4-3. CNVR significantly (P < 0.0005) associated with hoof health traits, their type (duplication or deletion), and content...... 107

x

LIST OF FIGURES

Figure 2-1. Structure and number of animals of the simulated populations...... 37 Figure 2-2. The Genetic Diversity Index is the sum of the unique haplotypes found in a group of animals. On this figure, five animals carry in total 18 unique haplotypes (5 variants of haplotypes A, 4 variants of haplotype B, 6 variants of haplotype C, and 3 variants of haplotype D). Colors highlight the unique haplotype alleles of each haplotype block...... 38 Figure 2-3. Representation of the distribution of the intervals of minor alleles frequencies used for assessing imputation accuracy...... 39 Figure 2-4. Number of Jersey (JE) animals selected by the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI) methods from candidate pools with different proportion of JE animals...... 40 Figure 2-5. Decay of linkage disequilibrium (LD) over genomic distance in the real and the simulated sequences...... 41 Figure 2-6. Distribution of the different groups of animals on the first and second principal components. The variance explained by the components is given in brackets. Grey crosses represent all the candidates, the red triangles are the animals selected by the key ancestors (AMAT) method, the purple plusses are the animals selected by the Inverse Weighted Selection (IWS) method, the green plusses are the animals selected by the Highly Segregating Haplotypes selection (HSH) method, and the blue crosses are the animals selected with the Genetic Diversity Index (GDI) method...... 42 Figure 2-7. Density curves of the minor allele frequencies observed in the simulated population over the generations...... 43 Figure 2-8. Selected proportion of unique haplotypes from the total haplotype library found in the reference group created with different selection methods. The methods compared are the key ancestors (AMAT), the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI)...... 44

Figure 2-9. Accuracy of imputation for all SNP, only common SNP (minor allele frequency  0.05), or only rare SNP (minor allele frequency < 0.05), using reference population sizes from 50 to 1,200 individuals. The methods compared are the key ancestors (AMAT), the Highly

xi

Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI). All standard errors are below 0.013...... 45 Figure 2-10. Genotype concordance rates for all SNP using reference population sizes from 50 to 1,200 individuals. The methods compared are the key ancestors (AMAT), the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI). All standard errors are below 0.009...... 46 Figure 2-11. Accuracy of imputation increases with higher minor allele frequency (MAF) of the variants. Here the accuracies reached with 50 animals in a reference group of key ancestors (AMAT) are presented. MAF bins “x-y” stands for “x < MAF  y”. All standard errors are below 0.008...... 47 Figure 3-1. Steps to define high confidence animal (ANIMAL_CNVR) and population (POPULATION_CNVR) copy number variant regions (CNVR)...... 77 Figure 3-2. Copy number variants identified with a lower density marker panel but not with a higher density marker panel are considered false positive results. Vertical bars below the horizontal bar represent markers showing copy number loss. Vertical bars above the horizontal bar represent markers showing copy number gain. Vertical bars across the horizontal bar represents markers showing no copy number variation. In this example, three markers showing a copy number variation were needed to identify a variant...... 78 Figure 3-3. Average number of copy number variants (CNV) identified per animal based on the HD (777K), 150K, and 50K marker densities. Letters represents the zones of the figures. The false positive discovery rates were the ratio (c+f)/(all CNV) and (f+g)/(all CNV) for 150K and 50K, respectively...... 79 Figure 3-4. Number of high confidence copy number variant regions (CNVR) identified at the animal level when different minimum percentages were considered for the reciprocal overlap between the copy number variants identified on genotype array or whole-genome sequence information...... 80 Figure 3-5. Distribution of the high confidence copy number variant regions (CNVR) across the bovine genome found in different sets of CNVR. Only carrying an identified variant are pictured...... 81

xii

Figure 3-6. Read alignments of four samples on Bos taurus autosome 20 from 59,901,001 base pairs (bp) to 59,910,000 bp belongs to the unique high confidence copy number variant (CNV) region no. 48. The two top samples have no predicted CNV from whole-genome sequences in the region under the red bar, whereas the two bottom samples have a copy number loss predicted. 82 Figure 3-7. Read alignment of three samples on Bos taurus autosome 20 from 2,759,000 base pairs (bp) to 2,770,000 bp. Although a copy number gain is identified on whole-genome sequences is predicted for the top sample for the region under the red bar, no difference in alignment variation can be observed...... 83 Figure 4-1. Distribution of the copy number variant regions identified on 5,845 samples over the bovine autosomes (black stripes)...... 108 Figure 4-2. Copy number variant regions associated with hoof health traits: sole hemorrhage (SH), sole ulcer (SU), heel horn erosion (HHE), digital dermatitis (DD), white line lesion (WL), interdigital hyperplasia (IH), and interdigital dermatitis (ID)...... 109 Figure 4-3. Read depth around the copy number variant region no. 7 (CNVR7; 16:54,474,882- 54,500,285) for three sequenced samples. The red bar shows the breakpoints defined for CNVR7 with genotype array information. Deletions can be observed on the two top samples. Red-colored reads in the CNV carrier sequences represent reads split at the time of alignment. No copy number variant is observed on the bottom sample...... 110

xiii

LIST OF ABBREVIATIONS

Abbreviation Description 150K Marker panel of the GeneSeek® Genome Profiler Bovine 150K 50K Marker panel of the Illumina BovineSNP50 Beadchip ACGH Array comparative genomic hybridization AMAT Key ancestor selection method ANIMAL_CNV Copy number variant regions identified at the sample level BAF B allele frequency BLUP Best linear unbiased predictor bp BTA Bos taurus autosome CCGP Canadian Cattle Genome Project CHE Switzerland CNG Copy number gain CNL Copy number loss CNV Copy number variant CNVR Copy number variant region DD Digital dermatitis dEBV De-regressed estimated breeding values DGVa Database of the Genomic Variant archive EBV Estimated breeding value EDGP Efficient Dairy Genome Project FISH Fluorescence in situ hybridization GC GenCall GDI Genetic diversity index method GEBV Genomic estimated breeding value GEN Genotype array information GEN_CNV Copy number variants identified on genotype array data GGPHD or GHD Marker panel of the GeneSeek® Genome Profiler Bovine HD GMD Marker panel of the GeneSeek® Genome Profiler Bovine 50K GO GWAS Genome-wide association study

xiv

HD Marker panel of the Illumina BovineHB Beadchip HHE Heel horn erosion HOL Holstein HSH Highly segregating haplotype method ID Interdigital dermatitis IH Interdigital hyperplasia IWS Inverse weighted selection method JE Jersey Kb Kilobase KEGG Kyoto Encyclopedia of and Genomes LD Linkage disequilibrium LRR Log R ratio MAF Minor allele frequency MAS Marker-assisted selection Mb Megabases MIX Copy number variant regions with loss and gain genotypes MS Microsatellite NCBI National Center for Biotechnology Information PC Principal component PCA Principal component analysis PCR Polymerase chain reaction POPULATION_CNVR Copy number variant regions identified at the experiment level qPCR Quantitative polymerase chain reaction QTL Quantitative trait locus RD Read depth SD Standard deviation SH Sole hemorrhage SNP Single nucleotide polymorphism SU Sole ulcer TU Toe ulcer WGS Whole-genome sequence WGS_CNV Copy number variants identified on whole-genome sequencing data WL White line lesion

CHAPTER 1: General introduction

1.1 Cattle breeding and genetic markers

Animal breeding and in particular dairy cattle breeding is already a long-lasting and successful story. Animal selection for breeding relies on genetic variation among animals (Groeneveld et al., 2010), but first breeding attempts were solely based on phenotypes. The subsequent development of statistical methods around the infinitesimal model as the Best Linear Unbiased Predictor (BLUP; Henderson, 1984), in combination with large-scale use of artificial insemination, has allowed for rapid changes in dairy cattle production. The genome was then considered as a type of “black box” (Misztal, 2006) as genetic markers were still debuting at the time. The advent of DNA genetic markers in the 1980s and especially DNA microsatellites in the early 1990s shifted the focus of research towards quantitative trait loci (QTL) and their link with genetic markers (Weller, 2009). Genetic markers were to be the key to open the “black box” that is the genome (Habier et al., 2013). Markers were defined as heritable genome loci that could have multiple states (or alleles) and reflected differences in DNA among individuals (Sunnucks, 2000). The following paragraphs present genetic markers, the genetic variation they represent, and their application in animal breeding. 1.1.1 Microsatellites and marker-assisted selection Microsatellites (MS) were the first widely used markers (Vignal et al., 2009). MS are repeated sequences of two, three or four bases usually found in one to six repetitions that have all the characteristics of reliable markers (Jarne and Lagoda, 1996). Reliable markers have to 1) be testable with polymerase chain reaction (PCR), 2) be comparable from sample to sample, 3) be DNA based (in opposition to biological markers that were blood groups for example), 4) allow allele frequency calculation, and 5) preferably only represent a single DNA locus (Sunnucks, 2000). First, MS were mainly investigated in the , but it rapidly became clear that they are abundant in the genome of many species. Their distribution across the genome was shown to be not random (Li et al., 2002) and assessment of MS was not always possible among datasets prepared in different laboratories, or even among different batches (Vignal et al., 2009). Although 1

there were undisputable drawbacks, MS were the first genetic markers that allowed identification of genome-wide associations between genes and phenotypes, of genes roles, and of mechanisms of mutations (Li et al., 2002). Moreover, MS made possible the development of precise genetic maps for multiple livestock species (Kappes et al., 1997) and the beginning of animal traceability, pedigree control and population comparison (Vignal et al., 2009). This began the genomic era of population genetics. Quantitative geneticists proposed to include MS information in genetic evaluations to increase the generational genetic gain. The idea of marker-assisted selection (MAS) was already suggested by Tanksley in 1993. The idea of MAS was to include the effects of specific markers with known associations with each trait (Van Berloo and Stam, 1998). This introgression of genetic marker information into genetic evaluation permitted the selection of traits that were not possible to evaluate traditionally: traits with very low heritability, traits difficult to measure on a large number of animals or traits that were only measurable in late life or even after death of the individual (Dekkers, 2004). A shift in the research from traditional genetic evaluation models to MAS was rapid but soon flaws of the method were to be discovered. Although MS had a high information content through their multi-allelic property, their effects on any phenotype had to be estimated for each family individually (Meuwissen et al., 2001). Moreover, the link between phenotypes and genes of interest were difficult to identify and marker associations discovered in one population could not be validated in another population (Misztal, 2006). This was due to the lack in linkage disequilibrium (LD) between the microsatellites and the QTL impacting the phenotype. Assumptions that were probably too simplistic, in inheritance pattern for instance, were also made to bypass the high demands in computing resources of more complex models (Dekkers, 2004). Eventually, MAS was implemented in some routine genetic evaluations in dairy cattle in France (Boichard et al., 2002), and in Germany (Bennewitz et al., 2003) for instance. The enthusiasm among geneticists, however, let place to more cautious optimism as described by Misztal in 2006. 1.1.2 Single nucleotide polymorphisms and genomic selection Single nucleotide polymorphisms (SNP) were first genotyped at the end of the 1970’s (Wallace et al., 1979), yet their era only came in the 1990s when PCR technologies were developed, allowing high-throughput genotyping of samples for many SNP (Syvänen, 2001). Compared to MS, SNP have the disadvantage to be less informative due to their bi-allelic characteristic. However, they 2

are much more abundant, they are often in LD with QTL affecting economically important traits, and they can be genotyped in a highly standardized way, made them the genetic markers of choice. Rapidly, a great number of cattle were genotyped, which enabled the production of an equivalent or even greater amount of information than with MS for a fraction of the cost (Syvänen, 2001; Vignal et al., 2009). The high density of the SNP maps made possible the application of the idea presented in 2001 by Meuwissen et al. that is now called genomic selection. Instead of selecting a few markers that have a recognized effect on a particular trait, the effects of all available genetic markers can be estimated and used for the prediction of the genetic value of an individual. This model is in line with the infinitesimal model supported by the successful genetic evaluation model that was the BLUP model (Georges et al., 2019). The idea was later supported with a simulation study by Schaeffer (2006) who suggested that conventional genetic evaluation should be modified to include genomic information. The author presented that the addition of genomic information in a selection scheme would double the genetic change in comparison with the traditional selection scheme and reduce the costs of proving bulls drastically, which would lead to overall reduced costs of the breeding program. The increase in genetic gain per year would be due to a reduced generation interval and the possibility to select animals from a larger pool of candidates thus increasing the selection intensity and therefore the selection differential. This follows the principle of Lush’s breeder equation 푅 = ℎ2푆, (Lush, 1943) where 푅 is the response to selection, ℎ2 is the heritability of the trait of interest, and 푆 is the selection differential, which can be expressed in terms of the phenotypic standard deviation (푝): 푆 = 𝑖푝, where 𝑖 is the selection intensity as showed in Falconer and Mackay (1996). In addition, the increase in information through genotyping results in a higher accuracy of the estimated breeding values (EBV). Following, with the development of inexpensive and high-throughput SNP genotyping array for cattle (Matukumalli et al., 2009), genomic selection in dairy cattle was rapidly implemented worldwide. Currently, years after the first implementation of genomic selection in many countries, the predicted increase in genetic gain can be observed, making genomic selection the tool of choice for breeding programs of major livestock species (Georges et al., 2019). Although genomic selection has been extremely successful and the rate of genetic gain in cattle has increased considerably, researchers continued to attempt to increase the accuracy of genomic 3

EBV. While SNP are able to explain a good proportion of the genetic variance observed in a population (e.g. 24% of proportion of black in Holstein cattle explained only with three loci, Hayes et al., 2010), part of it could not be caught by SNP solely; this has colloquially been termed “missing heritability” (Maher, 2008). One way to tackle this issue is to increase the number of variant effects estimated for each individual. Not only will this increase the number of data points per individual, but it also reduces the need for LD between markers and QTL having an effect on the trait of interest. This solution was tested in a simulation by VanRaden et al. (2011). The authors mimicked a Holstein population with animals genotyped at different densities, they imputed the lower density genotypes to the higher density marker panels and observed the changes in EBV reliability in comparison with reliabilities reached with low-density genotypes, but only small increases in EBV reliability (<1.6%) could be reached. Reliability even decreased with the addition of whole-genome sequence (WGS) variants for some traits. Overall, the addition of WGS variants to genomic selection models did not reach much more than 5% of gain in accuracy (Georges et al., 2019). In their review, Georges et al. (2019) also explains that the WGS variants used have to be imputed for use in genomic selection schemes and that this imputation process is not accurate enough to allow for the later correct SNP effect estimation in genomic selection. Chapter 2 of this thesis presents a possible solution to tackle this lack of imputation accuracy, particularly for rare variants, through selecting a reference population for genotype imputation from array to WGS variants. 1.1.3 Copy number variants: the future genetic variants in cattle breeding? In the last decade of cattle genomics, the main research and development focused on SNP along with whole-genome sequencing which led to considerable achievements, however, the totality of genetic variation has yet to be harnessed. Structural variants and more particularly copy number variants (CNV), seem to be promising to catch more of population-wide genetic variation for genetic selection. The term “structural variants” defines segments of DNA longer than 50bp that show insertion, inversion, translocation, deletion or duplication (Alkan et al., 2011). The latter two types contribute to CNV, the most commonly found class of structural variants in the animal genome. CNV have been shown to carry between 20 and 25% additional information on phenotypic variation in comparison to SNP (Xu et al., 2014; Hay et al., 2018). The identification and validation of CNV on a large number of samples, whether relying on genotype array or on 4

WGS information, is still a challenge that is presented and discussed in Chapter 3. Although no conventional methods have yet been developed for their definition, associations of CNV were demonstrated with multiple traits of economic importance for the dairy cattle industry in previous studies and in Chapter 4. Inclusion of CNV information along with SNP in genomic selection scheme could help improve further the accuracy of genomic EBV and thus increase the rates of genetic gain made in the industry.

1.2 Thesis objectives

The overall goal of this thesis was to better understand and capture genetic variation relying on single nucleotide polymorphisms or copy number variants to improve the accuracy of selection of animals that will partake to the generation of future dairy cattle. The specific objectives of this thesis were:

a) To describe two novel haplotype-based methods to select animals for sequencing to create a reference population for accurate genotype imputation from array to sequence variants in a simulation study, in particularly for rare variants. b) To develop a copy number variant identification pipeline using whole-genome sequence and array genotypes for determination of high confidence copy number variant regions, and to describe and functionally annotate those regions. c) To determine a set of high confidence copy number variant regions from genotype array information for a large group of North American Holsteins and to identify possible associations between the newly-determined variants and traits of economic importance for the Canadian dairy industry.

1.3 References

Alkan, C., B.P. Coe, and E.E. Eichler. 2011. Genome structural variation discovery and genotyping.. Nat. Rev. Genet. 12:363–76. doi:10.1038/nrg2958. Bennewitz, J., N. Reinsch, J. Szyda, F. Reinhardt, C. Kuhn, M. Schwerin, G. Erhardt, C. Weimann, and E. Kalm. 2003. Marker assisted selection in German Holstein dairy cattle breeding: outline of the program and marker assisted breeding value estimation. B. Abstr. 54th Annu. Mtg. Eur. Assoc. Anim. Prod. 5.

5

Boichard, D., S. Fritz, M.N. Rossignol, M.Y. Boscher, A. Malafosse, and J.J. Colleau. 2002. Implementation of marker-assisted selection in French dairy cattle. Pages 18–23 in Proceedings of the 7th world congress on genetics applied to livestock production, Montpellier. Van Berloo, R., and P. Stam. 1998. Marker-assisted selection in autogamous RIL populations: A simulation study. Theor. Appl. Genet. 96:147–154. doi:10.1007/s001220050721. Dekkers, J.C.M. 2004. Commercial application of marker- and gene-assisted selection in livestock : Strategies and lessons. J. Anim. Sci. 82:E313-328. doi:82/13_suppl/E313 [pii]. Falconer, D.S., and T.F.. Mackay. 1996. Introduction to Quantitative Genetics. 4th edition. Georges, M., C. Charlier, and B. Hayes. 2019. Harnessing genomic information for livestock improvement. Nat. Rev. Genet. 20:135–156. doi:10.1038/s41576-018-0082-2. Groeneveld, L.F., J.A. Lenstra, H. Eding, M.A. Toro, B. Scherf, D. Pilling, R. Negrini, E.K. Finlay, H. Jianlin, E. Groeneveld, and S. Weigend. 2010. Genetic diversity in farm animals - A review. Anim. Genet. 41:6–31. doi:10.1111/j.1365-2052.2010.02038.x. Habier, D., R.L. Fernando, and D.J. Garrick. 2013. Genomic BLUP decoded: A look into the black box of genomic prediction. Genetics 194:597–607. doi:10.1534/genetics.113.152207. Hay, E.H.A., Y.T. Utsunomiya, L. Xu, Y. Zhou, H.H.R. Neves, R. Carvalheiro, D.M. Bickhart, L. Ma, J.F. Garcia, and G.E. Liu. 2018. Genomic predictions combining SNP markers and copy number variations in Nellore cattle. BMC Genomics 19:441. doi:10.1186/s12864-018-4787- 6. Hayes, B.J., J. Pryce, A.J. Chamberlain, P.J. Bowman, and M.E. Goddard. 2010. Genetic architecture of complex traits and accuracy of genomic Prediction: Coat colour, Milk-fat percentage, and type in holstein cattle as contrasting model traits. PLoS Genet. 6. doi:10.1371/journal.pgen.1001139. Henderson, C.R. 1984. Applications of Linear Models in Animal Breeding. University of Guelph. Jarne, P., and P.J.L. Lagoda. 1996. Microsatellites, from molecules to populations and back. Trends Ecol. Evol. 11:424–429. doi:10.1016/0169-5347(96)10049-5. Kappes, S.M., J.W. Keele, R.T. Stone, R.A. McGraw, T.S. Sonstegard, T.P.L. Smith, N.L. Lopez- Corrales, and C.W. Beattie. 1997. A second-generation linkage map of the bovine genome. Genome Res. 7:235–249. doi:10.1101/gr.7.3.235. Li, Y.-C., A.B. Korol, T. Fahima, A. Beiles, and E. Nevo. 2002. Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review.. Mol. Ecol. 11:2453– 65. doi:10.1046/j.1365-294X.2002.01643.x. Lush, J.L. 1943. Animal breeding plans. Anim. Breed. plans. Maher, B. 2008. Personal genomes: The case of the missing heritability. Nature 456:18–21. doi:10.1038/456018a. Matukumalli, L.K., C.T. Lawley, R.D. Schnabel, J.F. Taylor, M.F. Allan, M.P. Heaton, J. O’Connell, S.S. Moore, T.P.L. Smith, T.S. Sonstegard, and C.P. Van Tassell. 2009.

6

Development and Characterization of a High Density SNP Genotyping Assay for Cattle. PLoS One 4:e5350. doi:10.1371/journal.pone.0005350. Meuwissen, T.H., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps.. Genetics 157:1819–29. Misztal, I. 2006. Challenges of application of marker assisted selection – a review. Anim. Sci. Pap. reports 24:5–10. Schaeffer, L.R. 2006. Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123:218–223. doi:10.1111/j.1439-0388.2006.00595.x. Sunnucks, P. 2000. Efficient genetic markers for population biology. Trends Ecol. Evol. 15:199– 203. doi:10.1016/S0169-5347(00)01825-5. Syvänen, A.C. 2001. Accessing genetic variation: Genotyping single nucleotide polymorphisms. Nat. Rev. Genet. 2:930–942. doi:10.1038/35103535. Tanksley, S.D. 1993. Mapping polygenes. Annu. Rev. Genet. 27:205–233. Vanraden, P.M., J.R. O’Connell, G.R. Wiggans, and K.A. Weigel. 2011. Genomic evaluations with many more genotypes. Genet. Sel. Evol. 43:10. doi:10.1186/1297-9686-43-10. Vignal, A., D. Milan, M. SanCristobal, and A. Eggen. 2009. A review on SNP and other types of molecular markers and their use in animal genetics. Genet. Sel. Evol. 34:275. doi:10.1186/1297-9686-34-3-275. Wallace, R.B., J. Shaffer, R.F. Murphy, J. Bonner, T. Hirose, and K. Itakura. 1979. Hybridization of synthetic oligodeoxyribonucleotides to φX 174 DNA: The effect of single base pair mismatch. Nucleic Acids Res. 6:3543–3558. doi:10.1093/nar/6.11.3543. Weller, J.I. 2009. Quantitative Trait Loci Analysis in Animals. CABI. Xu, L., J.B. Cole, D.M. Bickhart, Y. Hou, J. Song, P.M. VanRaden, T.S. Sonstegard, C.P. Van Tassell, and G.E. Liu. 2014. Genome wide CNV analysis reveals additional variants associated with milk production traits in Holsteins. BMC Genomics 15:683. doi:10.1186/1471-2164-15-683.

7

CHAPTER 2: Optimizing selection of the reference population for genotype imputation from array to sequence variants

Adrien M. Butty1, Mehdi Sargolzaei1,2, Filippo Miglior1,3, Paul Stothard4, Flavio S. Schenkel1, Birgit Gredler-Grandl5,6, and Christine F. Baes1,7* 1Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, ON, Canada 2Select Sires Inc., Plain City, OH, USA 3Canadian Dairy Network, Guelph, ON, Canada 4Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada 5Qualitas AG, Zug, Switzerland 6Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Wageningen, The Netherlands 7Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, Switzerland

Keywords: dairy cattle, sequencing, imputation, haplotypes, accuracy, selection

Published reference: Butty, A. M., Sargolzaei, M., Miglior, F., Stothard, P., Schenkel, F. S., Gredler-Grandl, B., & Baes, C. F. (2019). Optimizing Selection of the Reference Population for Genotype Imputation from Array to Sequence Variants. Frontiers in Genetics, 10, 510.

8

2.1 Abstract

Imputation of high-density genotypes to whole-genome sequences is a cost-effective method to increase the density of available markers within a population. Imputed genotypes have been successfully used for genomic selection and discovery of variants associated with traits of interest for the population. To allow for the use of imputed genotypes for genomic analyses, accuracy of imputation must be high. Accuracy of imputation is influenced by multiple factors, such as size and composition of the reference group, and the allele frequency of variants included. Understanding the use of imputed whole-genome sequences prior to the generation of the reference population is important, as accurate imputation might be more focused, for instance, on common or on rare variants. The aim of this study was to present and evaluate new methods to select animals for sequencing relying on a previously genotyped population. The Genetic Diversity Index method optimizes the number of unique haplotypes in the future reference population, while the Highly Segregating Haplotype selection method targets haplotype alleles found throughout the majority of the population of interest. First the whole-genome sequences of a dairy cattle population were simulated. The simulated sequences mimicked the linkage disequilibrium level and the variants’ frequency distribution observed in currently available Holstein sequences. Then, reference populations of different sizes, in which animals were selected using both novel methods proposed here as well as two other methods presented in previous studies, were created. Finally, accuracies of imputation obtained with different reference populations were compared against each other. The novel methods were found to have overall accuracies of imputation of more than 0.85. Accuracies of imputation of rare variants reached values above 0.50. In conclusion, if imputed sequences are to be used for discovery of novel associations between variants and traits of interest in the population, animals carrying novel information should be selected and, consequently, the Genetic Diversity Index method proposed here may be used. If sequences are to be used to impute the overall genotyped population, a reference population consisting of common haplotypes carriers selected using the proposed Highly Segregating Haplotype method is recommended.

9

2.2 Introduction

Globally, over 2.6 million cattle have been genotyped to date and the number of genotyped animals is expected to further grow in the coming years1. Dairy cattle genotyping is typically performed using genotype arrays of low or medium densities. Variants on genotype arrays are not selected randomly, rather they are evenly distributed over the whole genome and selected for their high level of segregation across multiple breeds (Boichard et al., 2012). Such a selection of variants has the advantage of enabling the application of the same array for multiple breeds, thus simplifying comparison between breeds. A disadvantage, however, is that they show an ascertainment bias, and variants with a low minor allele frequency (MAF) are underrepresented in genotype array data. The term “rare variants” henceforth refers to variants with a MAF lower than 0.05. Depending on the number of animals included and the alleles they carry, each genomic dataset contains its share of rare variants. The lack of knowledge about rare variants hinders the discovery of quantitative trait loci that, for example, appeared recently in a population through mutation (Fritz et al., 2013). Observed low MAF of variants can also be due to natural or artificial selection against an allele that has a negative impact on animal fitness or performance, thus indicating that a rare variant could be linked to a trait of interest or even a lethal malformation. An example of a rare variant associated with a disease can be found in a study by Drögemüller et al. (2009) in which a variant with a MAF of 0.03 is associated to arachnomelia (a calf malformation also called spider legs) in Brown Swiss cattle. Errors during genotyping or sequencing can also lead to wrongly-identified variants with low MAF (Zhang et al., 2016). Whole-genome sequencing can help provide better insight about rare variants (Daetwyler et al., 2014) but the costs of Next-Generation Sequencing technologies are still too high for mass sequencing of animals (Fraser et al., 2018). Imputation allows inference of whole-genome sequence (WGS) information for animals genotyped with various arrays based on complete WGS information of a reference population. The in silico creation of WGS from the readily available high number of genotypes enables a drastic increase in genotypic information for a large number

1 https://queries.uscdcb.com/Genotype/cur_ctry.html, last accessed 2018-09-23 10

of animals. High levels of imputation accuracy, however, are needed to allow use of the predicted genotypes for genomic evaluation or GWAS as demonstrated by Marchini and Howie (2010). The imputation from 50K to HD has been widely studied, and accurate HD genotypes are routinely imputed in dairy cattle genetic evaluation centers (e.g. Hozé et al., 2013; Ma et al., 2013; Pausch et al., 2013). Imputation to WGS variants, however, still needs to be improved. Accuracy of imputation is influenced by: a) the size of the reference population; b) the imputation method; c) the relatedness between the reference and the target population; d) the genotyping densities used, the difference in the number of variants and the linkage disequilibrium between SNP of both low- and high-density panels; e) the MAF of the variants considered; and f) the genetic diversity of the reference population. A thorough review of the factors influencing accuracy of imputation in livestock species was written by Calus et al. in 2014. The selection of animals to include in reference populations influences many of these parameters and is thus of high importance. Druet et al. (2014) stated that as the MAF of variants becomes lower, the method used to select animals to be included in the reference population becomes more important. The international dataset created under the scope of the 1,000 Bull Genomes Project (Daetwyler et al., 2014) is a possible reference set for imputation of cattle array genotypes to WGS. Up to Run 5 of this project, most animals sequenced were selected for their high genetic contribution to the population of their breed (Goddard and Hayes, 2009). These key ancestors carry most of the common variants for the populations they were selected from but lack information on rare variants. Pausch et al. (2017) showed that overall average imputation accuracy of array genotypes to the variant list from the 1,000 Bull Genomes Project was greater than 90%, but that the imputation accuracy of rare variants did not reach 70%. Low imputation accuracy of rare variants hinders the discovery of causal variants, not only for highly polygenic traits, but also for recent mutations that lead to malformations or loss of fitness (Li et al., 2011). Zhang et al. (2017) showed that the lack of accuracy in imputation of variants with low MAF also limits the success of genomic selection, particularly for health traits. Improved accuracy of WGS imputation will not only increase the probability of discovering causal variants for newly recognized diseases or malformations but will also enable more precise categorization and selection of variants for routine genomic selection programs for traditional and novel traits.

11

Various methods have been proposed to select animals for sequencing, the first of which relied solely on pedigree information and targeted influential ancestors of the population of interest. Boichard et al. (1997) developed a method to identify animals that have the greatest genetic contribution to a population based on its pedigree information. This method was implemented and widely distributed using the software PEDIG (Boichard, 2002). The key ancestors method, developed thereafter, relied on the numerator relationship matrix of the genotyped population of interest and also aimed to maximize the proportion of genes of the population captured by the selected animals (Goddard and Hayes, 2009). As the number of genotyped animals increased, selection methods have been adapted to consider genomic information. Methods were proposed which emphasize selection of animals carrying common haplotypes. Druet et al. (2014) presented a method maximizing the number of haplotypes selected. The key contributors method presented by Neuditschko et al. (2017) defines animals as informative based on the genomic relationship matrix of the population and aims to select individuals within possible subpopulations. Another selection method developed by Gonen et al. (2017) involved the algorithm AlphaSeqOpt that not only selects individuals that, together, represent the maximum haplotype diversity of a population, but also suggests different sequencing coverages in situations where the sequencing costs are predetermined. An optimized version of AlphaSeqOpt was proposed by Ros-Freixedes et al. (2017), similarly considering situations where the sequencing costs were predetermined, but additionally targeting haplotypes instead of individuals. This method was shown to improve the phasing accuracy of the reference population it formed, even if it still maximizes the proportion of the total haplotypes included. In contrast to the previously described methods, which target representative animals of a population, the Inverse Weighted Selection Method (Bickhart et al., 2016) was developed to prioritize individuals for their higher genetic diversity at the haplotype level, classifying animals based on the rarity of their haplotypes. The Inverse Selection Methods was shown to allow sequencing of the maximum number of haplotypes with the fewest number of animals. In this study, two new selection methods are presented: the optimized Genetic Diversity Index (GDI), which targets animals carrying more rare haplotype alleles than the average individuals and the Selection of Highly Segregating Haplotype (HSH), which aims at selecting animals whose haplotypes are highly segregating, but not selected yet. The GDI method aims to improve the accuracy of imputation of rare variants through selection of animals that, together,

12

carry the most different haplotypes, whereas the HSH should help to improve overall accuracy through selection of animals that carry the highest segregating haplotypes not previously sequenced. The objectives of this study were: 1) to describe two innovative methods to select animals for sequencing from a population, and 2) to compare these methods to two previously described selection methods: the key ancestors method and the Inverse Weighted Selection method. 2.3 Material and methods

Firstly, the WGS and high-density array genotypes of a dairy cattle population were simulated. Secondly, reference populations were created by selecting animals based on four different methods. Thirdly, a set of simulated target animals were imputed using the different reference populations. Finally, the imputation accuracies of the different methods were compared to each other considering sets of variants, defined depending on their MAF (Figure 2-1). 2.3.1 Simulation 2.3.1.1 Population structure Large scale whole-genome sequence data was simulated with the QMsim program (Sargolzaei and Schenkel, 2009) using three subsequent populations. First, a historical population was simulated to create linkage disequilibrium (LD) between the variants. Then, a second population, termed LongRangeLD, was simulated to increase long-range LD between variants. Finally, a third population (CurrentPop) was simulated for downstream analysis. CurrentPop simulated the latest years of dairy cattle breeding, in which few selected sires were used heavily in the breeding population. The historical population considered an equal number of individuals from both sexes, discrete generations, random mating at the gametic level, no selection, and no migration. A total of 800 males and 800 females were simulated for 4,000 generations to achieve mutation-drift equilibrium. Ten further historical generations were generated expanding the population to 10,100 animals. In the last generation of the historical population, there were 100 males and 10,000 females. The founders of LongRangeLD were all animals of the last generation in the historical population, after which each generation was composed of 8,000 animals. Through using different replacement rates, the 20 generations of this population overlapped. The total LongRangeLD population was 13

composed of 168,100 animals. Founders of CurrentPop were 100 males and 4,000 females from the last generation of LongRangeLD and also 4,000 more females from the second-last generation of LongRangeLD. The 10 generations of this population had 6,000 animals and overlapped too. Finally, the complete population for downstream analysis had 66,100 animals, of which 30,168 (127) were males. Migration was not simulated in any scenario. Further parameters used for both LongRangeLD and CurrentPop are presented in Table 2-1. The complete simulation process was replicated 10 times and the results reported are averages of the replicates. 2.3.1.2 Genome Gene-dropping simulation was completed using QMSim (Sargolzaei and Schenkel, 2009). The same genome was simulated for all populations. Cattle autosomal chromosomes were simulated with a length that followed the results presented by Bohmanova et al. (2010) and summed up to a total of 2,496 cM. Bi-allelic markers and quantitative trait loci (QTL) were randomly distributed over all chromosomes with equal MAF in the first historical generation. The QTL effects were sampled from a gamma distribution with a shape parameter of 0.4, following the results obtained by Hayes and Goddard in 2001. The number of crossovers per was sampled from a Poisson distribution with mean equal to the chromosome length in centimorgans. The probability of a second crossover within 25cM of a first recombination event was, therefore, lower depending on the proximity of crossovers. The mutation rate of the markers and the QTL was assumed to be 10-4. For each replicate, 8,622,767 markers and 4,000 QTL were generated. 2.3.1.3 Introduction of genotyping error and selection of variant subsets Selection of markers in the simulated data was performed to ensure that the MAF distribution followed that observed in the real data, described below. From all simulated variants, a first subset representing WGS was selected that contained all QTL. Then two subsets of the WGS were selected, which simulated high-density (HD) and medium density (50K) array genotype variant panels. In contrast to the WGS set, no QTL were allowed in the HD and 50K variant panels. Minor allele frequencies considered at this stage were computed considering a random sample of 30,000 animals from CurrentPop. Those animals represented 45% of the total population. Real data comprised 425 Holstein (HOL) and 25 Red-Holstein animals from Run 5 of the 1,000 Bull Genomes Project (Daetwyler et al., 2014), 2,946 HOL animals (males and females) from the

14

Canadian Dairy Network database (as of August 2017), and 36,157 HOL bulls with a North American identification tag born after 2010 for the WGS, HD, and 50K panels, respectively. The real WGS set was filtered for a minor allele count of 1 and was composed of 31,787,016 bi-allelic variants. Variants with a MAF lower than 0.1% were filtered out from the HD dataset. The real HD genotypes contained information for 587,817 bi-allelic variants. The same filter for variants with a MAF lower than 0.1% was applied to the 50K panel leading to 44,347 bi-allelic variants. The number of selected variants per chromosome was proportional to the number of variants found in the real data. Variants were distributed by MAF in 50 bins. The sampling of the variants occurred randomly within the bin-by-chromosome groups with the function sample() in R, version 3.4.3 (R Core Team, 2017). The final simulated data was composed of 3,235,171 (155,117), 571,661 (6) and 44,288 (0) variants for WGS, HD, and 50K, respectively. Genotyping error was introduced in the WGS based on error rates observed by Baes et al. (2014) using the HaplotypeCaller function of the Genome Analysis Toolkit with a multi-sample approach (McKenna et al., 2010; Table 1-2). Missing data was also added at this stage. Inclusion of genotyping errors and missing data in the genotypes was done using snp1101 (Sargolzaei, 2014). 2.3.1.4 Creation of the reference populations and the validation set Groups of 50, 100, 200, 400, 800, and 1,200 animals were created from one pool of candidates using four selection methods. This pool of candidates was composed by all males of the CurrentPop and contained 30,027 (108) bulls. As the 50K chip represents the preferred SNP chip for bull genotyping, animals were selected on their 50K haplotypes. The groups of selected bulls were later used as the reference populations for imputation from HD to WGS genotype density. Although imputation was done from HD to WGS genotype densities, selection of animals, when performed based on genotypes, was run on the 50K array panel to mimic again real situations, where the majority of the individuals would have only 50K genotype information. Haplotypes were defined as non-overlapping segments of 20 contiguous SNP of the 50K SNP panel throughout the study and had an average length of 1,082,875 bp (264,426 bp). The same candidate pool was available for each method, so the same animal could be selected by multiple methods. The selection methods were: 1) the key ancestors method, which used the additive genetic relationship matrix; 2) a combination of the newly developed Genetic Diversity Index and the

15

simulated annealing algorithm (Kirkpatrick et al., 1983; Černý, 1985); 3) the Inverse Weighted Selection method (Bickhart et al., 2016); and 4) a second novel method aiming to select highly segregating haplotypes in the genotyped population that are not carried by any animal of the population of interest already sequenced. These methods are described in more details next. The 5,000 youngest animals (males and females) from CurrentPop that were not selected during the creation of the reference groups composed the target population of the imputation. 2.3.2 Selection methods Selection of key ancestors was the method of choice to select the first animals sequenced in populations, as a representative genotyped group of animals from the population of interest was not needed (Daetwyler et al., 2014). This key ancestor method (AMAT) was chosen for comparison because of its frequent use and because it had indirectly a similar aim than the novel Selection of Highly Segregating Haplotype (HSH) method proposed here, i.e. selection of carriers of commonly found variants. Shortly after the first draft of the optimized Genetic Diversity Index (GDI) proposed here was designed, the paper of Bickhart et al. (2016) was published that presented the Inverse Selection Method (IWS). As GDI, this method aimed at selecting animals that are genetically more diverse in the pool of candidates. IWS seemed thus to be fairly comparable to GDI and was chosen to be included in this study. Other methods of animal selection for sequencing considered other objectives such as sequencing some animals at different coverages or combination of genotyping and sequencing, given a limited budget. In contrast, this study only considers situations where a given number of animals to sequence is given. Focusing on methods with similar aims than the novel methods proposed here seemed a way to allow for an in-depth analysis of them, as for example, differentiating accuracies of imputation of variants with different MAF. 2.3.2.1 Selection of key ancestors The AMAT method aimed to identify animals explaining most of the genetic variation of a −1 population following the equation 푝푛 = 푨푛 ∗ 푐푛 where 푝푛 was a vector of the proportion of gene −1 pool captured by the n selected animals, 푨푛 was the inverse of the numerator relationship matrix of the n selected animals, and 푐푛 was a vector of the average relationships of the n selected animals with the entire population (Goddard and Hayes, 2009).

16

2.3.2.2 Inverse selection method The IWS method developed by Bickhart et al. (2016) prioritized sequencing of rare haplotypes 푁퐻퐴푃 2 following the equation 퐼푛푑푒푥 = ∑푖=1 푓푖 − 2푓푖 + 1, where 푁퐻퐴푃 was the number of haplotypes and 푓푖 was the frequency of haplotype 𝑖 in the population. This inverted parabolic function gave a high index value to individuals carrying haplotype alleles with low frequencies, as higher frequencies led to higher penalization (through the term −2푓푖) The computation of this index was iterative: 1) select the animal with the highest index; 2) recalculate the index of the remaining candidates without considering the haplotypes present in the genotypes of selected animals; and 3) pick out the next animal with the best new index. This method was used as it is implemented in the software program snp1101 (Sargolzaei, 2014). 2.3.2.3 Optimized genetic diversity index Relying on a probabilistic optimization algorithm –simulated annealing (Kirkpatrick et al., 1983; Černý, 1985) – the proposed GDI method optimized the count of unique haplotypes of a group of animals composed of all previously sequenced animals and a defined number of sequencing candidates. The simulated annealing algorithm was developed to find the global optimum of a dataset with multiple local optima. The GDI of the whole group of animals was optimized with the simulated annealing algorithm permuting one candidate at a time and recalculating the index. The GDI was computed by summing the count of unique haplotype alleles present within a group 푁퐻퐴푃 of animals following the equation 퐼푛푑푒푥 = ∑푖=1 푢푛𝑖푞푢푒 (퐻퐴푃푖), where 푁퐻퐴푃 was the number of haplotype blocks and 퐻퐴푃푖 were the haplotype variants in block 𝑖. Figure 2-2 gives an example of the index calculation based on five animals and four haplotypes. This method was also used as implemented in the program snp1101 (Sargolzaei, 2014). 2.3.2.4 Selection of highly segregating haplotypes To identify animals with the highest contribution to the population, the novel HSH method based on haplotype diversity was developed. The method had the following steps: 1) a haplotype library was created for all selection candidates using non-imputed genotypes. Haplotypes that appeared in less than 10 animals were discarded to reduce errors in the computation of their frequencies due to phasing error or haplotypes from other breeds; 2) contribution of each animal to the haplotype library based on the haplotypes’ frequency was calculated following the equation, 퐼푛푑푒푥 =

17

푁퐻퐴푃 ∑푖=1 푓푖, where 푁퐻퐴푃 was the number of haplotypes and 푓푖 was the frequency of haplotype 𝑖 in the population. The animal with the highest Index value was then selected and; 3) frequencies of all haplotypes present in the selected animal were multiplied by a factor of 0.75 to penalize these already captured haplotypes. The factor for penalization is decided based on haplotypes frequency distribution in Holstein. Then the second most influential animal was selected based on highest contribution from the penalized haplotypes frequencies of all haplotypes it carries were multiplied again by the same factor of 0.75. After selecting an influential animal, total haplotype coverage (i.e. prevalence) was calculated for the new group of selected candidates. The process was repeated until the desired number of animals was selected, increasing the number of unique haplotypes selected with each animal, but avoiding selection of possible outliers (which carry many low- frequency haplotypes from another breed), for example from crossbred individuals as long as any non-outlier animals were still in the selection pool. Because the most frequent haplotypes were penalized first, the next animal chosen tended to carry haplotypes that are less frequent in the library or population. This method was also used as it is implemented in the software program snp1101 (Sargolzaei, 2014). The HSH method could accommodate any situation where some animals were previously sequenced, as the choice of the next influential animal is a function of already selected animals. Therefore, although the selected candidates may be different depending on which initial list of sequenced animals is used, the overall contribution to the population haplotypes should change only minimally. 2.3.3 Measures of diversity in the reference population The level of genetic diversity was compared between reference populations. Next to the number of segregating variants as presented by Pluzhnikov and Donnelly (1996), the proportion of the total number of unique haplotypes alleles found in the candidate groups that were also found in the individuals selected for sequencing were used to compare the level of genetic diversity of the reference population of each scenario. The proportion of the rare haplotypes found in differently selected individuals was computed using the R package GHap (Utsunomiya et al., 2016). First, all haplotypes found within the candidates were identified. Second, the frequencies of the haplotypes within the candidates were computed. Finally, the proportions of haplotypes found in different groups were calculated. Following the construction of haplotypes when the animals were selected, haplotypes were built here again with 20-SNP windows and without overlap. 18

2.3.4 Principal component analysis Principal components analysis (PCA) is a statistical method that, when applied to genotypic data, allows detection of its structure (Ely et al., 2010). PCA was run on 50K genotypes of the candidate pool to determine the structure of the simulated population. This analysis was conducted using the implementation presented by Abraham and Inouye in 2014 and available in snp1101 (Sargolzaei, 2014) with the following parameters: a maximum of 50 iterations were allowed, 40 principal components were computed and only variants with a MAF equal or higher than 0.01 were considered. 2.3.5 Imputation Following results presented by Whalen et al. (2018), the combination of the phasing software Eagle version 2.3.5 (Loh et al., 2016) and the imputation software Minimac3 (Das et al., 2016) – two programs developed for analysis of human data for which little to no family information is available – was used without pedigree information on the differently created reference populations to impute one set of target animals. Both software programs were used in their default mode. A linear genetic map of 1cM per Mb was used to approximate the average recombination rate at phasing. From this step onwards, all genotypes were reduced to the 10 first simulated chromosomes to reduce computation time and memory load. Imputed genotype calls only were used, not the genotype probabilities. 2.3.6 Measures of imputation accuracy Imputation accuracy was computed on multiple sets of variants for each scenario. Variants were distributed over multiple bins, depending on the MAF observed in the true genotypes of the target population of each simulation replicate. Two non-overlapping subsets containing common (MAF > 0.05) or rare (MAF =< 0.05) variants were created, as well as a set of adjacent SNP bins. Variants were distributed following their MAF in the bins with boundaries at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45 and 0.50. The bins were created to allow for the higher bound MAF to be included but not the lower bound. The composition of all bins is represented in Figure 2-3. Imputation was evaluated at a per SNP basis by the squared correlation between the true and imputed genotypes and average. This accuracy measure, called allelic R2 by Browning and

19

Browning (2009), is advantageous, as it is independent of the MAF of the variants imputed. Correlations between true and imputed genotypes were checked to ensure that negative correlations were not present so that no variants were filtered out at this stage. Accuracies of variants that were not segregating anymore after imputation were set to zero. Genotype concordance rates between all variants of the true and imputed genotypes were also computed on all variants. This measure represents the proportion of genotypes that are correctly imputed and allowed for evaluation of the imputation on a per animal basis. 2.3.7 Performance of the haplotype-based selection methods with crossbred animals in the candidate pool Selection of animals for sequencing is often run in one population at a time. Depending on the quality of the data recording, a proportion of the animals declared to be purely from one population may be crossbred or from another population. It is important that the method of selection avoids selecting individuals that are not part of the population of interest. BovineSNP50 genotypes of 16,420 Holstein and 2,920 Jersey (JE) males born after 2011 were retrieved from the Canadian Dairy Network database to create pools of 5,840 selection candidates with different degrees of admixture as presented on the horizontal axis of Figure 2-4. From the complete dataset, animals were randomly selected to enter each pool. The IWS, GDI and HSH methods were then used to select 100 animals out of each pool and the number of JE animals that were picked were counted. 2.3.8 Statistical tests of average differences between scenarios After testing for the normality of the replicates within methods-by-reference size scenarios with Shapiro-Wilk tests, Kruskal-Wallis Rank Sum tests and Wilcoxon Rank Sum statistical tests were performed for each MAF category to determine significant differences in accuracies among all methods or pairwise, respectively. The Bonferroni correction was used to adjust for multiple comparisons for an experimental-wise significance level of 0.05. 2.4 Results

2.4.1 LD structure, MAF distribution and structure of the simulated population A rapid decrease in LD over increasing genomic distance was observed in both real and simulated genomic data (Figure 2-5). The high level of LD at distances shorter than 100kb in the real Holstein population already described by Sargolzaei et al. (2008) is mimicked in the simulation. Rare 20

variants comprised 52.43% (2.2%) of the WGS variants over the replicates. Principal component analysis showed a compactly distributed population on the two first components, which explained 6.11% of the total genomic variance (Figure 2-6). Spearman’s rank correlation between the first principal component and the generation of the animals was 0.87 (data not shown). Density curves of the MAF over the generations of the simulated population showed that an increasing number of variants became rare (Figure 2-7). 2.4.2 Haplotype coverage in the reference population The number of segregating variants and the proportion of unique haplotype alleles found in each reference population had a correlation of 0.68 (P < 0.0001). Increasing the number of animals in the reference population led to an increased proportion of unique haplotypes covered (Figure 2-8). Overall, haplotypes coverage ranged from 8.6% of the total haplotypes from the scenario with 50 animals selected on the basis of HSH, to 35.5% in the scenario including 1,200 animals selected through GDI. The reference groups created following the AMAT and HSH methods captured a lower proportion of the total haplotypes than reference populations created following the IWS and GDI methods. The proportion of haplotypes with a frequency equal or below 5% that were selected in each reference group followed the proportion of total haplotype selected. 2.4.3 Overlap in selection The same pool of candidates was made available for selection for each method and reference size so that the same animals could be selected by multiple methods. The proportions of animals present in two groups for each reference size are shown in Table 2-3. Overlaps were higher between AMAT and HSH and between IWS and GDI. Small reference population sizes led to a higher proportion of animals found in multiple reference populations, with a maximum of 26% of animals found in common between the reference groups of AMAT and HSH that contained 50 animals in total. GDI did not have any overlap with AMAT or HSH for groups containing 50 and 100 animals. The overlap in the selected references of 100 individuals can be observed in Figure 2-6 where plusses, representing the animals selected with IWS, and crosses, representing the animals selected with GDI, are superposed.

21

2.4.4 Selection with possible crossbred animals With a pool composed of animals from two populations in a 50:50 ratio, no genotype-based method could avoid selecting at least half of them from the JE population (Figure 2-4). Differences were observed between methods in the more realistic scenarios with a proportion of 5% or less non- target animals. HSH did not select any JE animals until they comprised 5% of the candidate pool. In contrast, GDI already selected 58 JE animals when JE comprised 1% of the candidate pool. The 58 JE selected in this scenario were 45% of all JE animals present in the pool. In the scenarios with a candidate pool composed of 5% or less JE animals, IWS consistently selected only 5% of the JE animals. 2.4.5 Accuracy of imputation Accuracies of imputation were observed on all variants and on two non-overlapping subsets: the rare variants with a MAF below 0.05 and the common variants with a MAF equal or above 0.05. Results are presented about these sets in the following order: first, all variants, then the rare variants and finally the common variants as the later showed re-ranking in comparison with the two other groups. Considering all variants, accuracy of imputation reached values between 0.55 and 0.85, depending on the method used to create the reference groups and their sizes. Increasing the number of animals in the reference population led to corresponding increases in accuracies. Table 2-4 shows the accuracies of imputation reached in scenarios with 50, 200 and 1,200 reference animals selected by the four methods and across all adjacent MAF bins. In general, AMAT and HSH reached lower accuracies than IWS and GDI. The differences in accuracies, however, were smaller when the reference population size increased (Figure 2-9). In the scenario in which only 50 animals composed the reference population, IWS and GDI had the highest accuracies and were not significantly different (P > 0.05). AMAT had a significantly lower accuracy and the accuracy of HSH was even lower than that of AMAT (P < 0.0001) (Table 2-4). By increasing the size of the reference population to 100, 200, or 400 animals, differences in accuracy between AMAT and HSH were small, so that only two groups of methods, AMAT/HSH and IWS/GDI, could be differentiated. With reference groups of 800 and 1,200 individuals, only GDI and AMAT were significantly different (P < 0.0001), where GDI had the highest accuracy (0.944). Genotype concordance rates reached values above 0.96 in all cases (Figure 2-10). Significant differences 22

between methods were only observed with reference populations comprising 50, 100, or 200 animals. Concordance rates were higher when animals were selected with HSH or AMAT than with IWS or GDI for reference sizes of 50 or 100 animals (P < 0.0001). Only the concordance rate of GDI for a reference population comprised of 200 animals was significantly lower than any other (P < 0.0001). When the accuracies of imputation were estimated on rare variants only, accuracies reached values between 0.33 and 0.76, but the rank of the methods from best to worst was consistent with results based on all variants (IWS/GDI > AMAT/HSH), and significant differences were also observed at any reference populations size (P < 0.0001). With reference size of 1,200 individuals, differences were only found between AMAT vs. HSH and AMAT vs. GDI, where AMAT had lower accuracy in both contrasts. In contrast, when only common variants were considered, the ranking was reversed: AMAT and HSH produced significantly higher accuracies than IWS and GDI (P < 0.0001). Accuracies took values as high as 0.99 and were never below 0.84. With a reference population of 50 animals, HSH reached a greater accuracy than AMAT and both were better than IWS and GDI. With reference sizes of 100 and 200, significant differences were again observed between the groups of methods AMAT/HSH and IWS/GDI (P < 0.0001). With 400 animals as reference, the accuracy reached by GDI was significantly lower than the other methods. Scenarios where 800 and 1,200 animals composed the reference population did not show difference in accuracy value before the fourth decimal. Although no change in the values was observed for these scenarios (Table 2-4), variances between replicates were very small (standard deviation < 0.004), therefore testing the methods against each other still led to significant results after correction for multiple testing. Distribution of the variants into 14 adjacent bins allowed a more precise evaluation of the effect of the reference composition on the imputation accuracy. With no exception, increased MAF led to increased accuracy values. For example, accuracies increased from 0.21 to 0.94 when using a reference group of 50 individuals selected with AMAT (Figure 2-11). Only pairs of contiguous MAF bins were analyzed and no significant differences within reference size-by-method scenario were found in imputation accuracy of variants with a MAF greater than 0.3 (P > 0.05).

23

2.5 Discussion

In the first step of this study, the whole-genome sequences of a dairy cattle population were simulated. They were compared to currently available real Holstein sequence data to ensure they mimicked observed levels of LD and MAF distribution. In the second step, reference populations of different size were created with animals selected by both proposed novel methods as well as two other methods presented in previous studies. The selection methods were assessed with respect to their propensity to select animals that might not be of the population of interest, the genetic diversity of the groups of animals picked, and the distribution of those over generations. Finally, accuracies of imputation were compared for imputation runs with the different reference populations. The differentiation of imputation accuracy of variants with specific MAF allowed comparison between the strengths and weaknesses of each method of selection. Different software programs were developed to simulate genomic information such as AlphaSim (Faux et al., 2016), ms2gs (Pérez-Enciso and Legarra, 2016), and QMsim (Sargolzaei and Schenkel, 2009). With its highly flexible genome and population configuration system, QMsim allowed for simulation of a great number of whole-genome sequences that had a LD structure properly following the parameters of the real data available. With the aim of simulating a Holstein population, only sequences of Holstein animals from the 1,000 Bull Genomes Project Run 5 were retrieved. These animals were mostly sequenced because they had a great genetic contribution to their population (Daetwyler et al., 2014). Although they are considered representative, these animals became influential as they were used heavily for breeding in their population and probably had a high genetic merit. They may, in fact, carry alleles at frequencies different from those in the overall population. This difference between the influential animals and the complete population limits the possible true closeness of the simulation with the whole real Holstein population. The LD level of the simulated sequences followed the real observed LD decay (Figure 2-5). Similarly, the distribution of the variants used in this work in MAF bins followed the distribution observed in real datasets. Once the sequence was simulated, multiple reference populations were created by selecting animals using methods of selection that can be divided into two groups: AMAT and HSH, which mainly target animals that are carriers of commonly found haplotypes, whereas IWS and GDI are

24

methods aiming to maximize the selection of animals carrying more haplotype alleles. Moreover, although AMAT keeps on searching for commonly found haplotypes, the penalization of those implemented in HSH leads to a shift from the search of commonly found haplotypes to rare ones. Through this shift, not only selection of common, but also of rare haplotypes is optimized. This shift, however, is highly dependent on the size of the candidate pool and the number of animals to be selected, as the increasing ratio of selected animals over the candidate pool facilitates the capture of more different haplotypes. A disadvantage of the haplotype-based selection method is that candidates must all be genotyped. In this sense, selection of animals for genotyping or sequencing in populations in which only a small proportion of individuals are genotyped should be done with AMAT, as long as a correct and complete pedigree is available. Candidate pools for animal selection are often composed of individuals belonging to more than one population due to errors at the time of data recording, and thus crossbred animals could be erroneously selected. Testing methods for their tendency to pick crossbred animals revealed that methods targeting rare variants selected more animals that were not from the population of interest, which was expected. HSH was the only method in which no animal of the JE population was selected before their proportion in the candidate pool reached 5%, which can be considered a usual proportion of crossbred animals wrongly declared as purebreds (Figure 2-4). If GDI or IWS are used on real datasets, population structure analysis and analysis of the relationships between the candidates is essential to ensure that crossbred animals are removed prior to selection. Following the control of the non-target animals selected with each method, a principal component analysis was used. This allowed for comparison of the distribution and overlap of the selected animals over the complete candidate pool. Methods targeting rare haplotypes picked the same animals more often (Table 2-3). The concentration of points representing the animals selected by IWS and GDI or the superposed dark and light brown points on Figure 2-6 follows the same idea. The animals selected for their higher genetic diversity were mostly of generation 1 and 2 out of the 10 simulated generations. Selection applied without allowing for migration in the simulated population led to a reduction of the MAF of the variants under selection pressure (Figure 2-7). Accordingly, the number of combinations of SNP alleles at the haplotype level was reduced, and less unique haplotypes alleles could be found in animals in generation three or more. Fewer unique haplotype alleles also led to higher haplotype frequencies of the remaining ones. Finally, carrying 25

less unique haplotype and haplotypes alleles of higher frequencies, individuals of generation three or more were less likely to be selected by GDI and IWS. It is of interest to assess the genetic diversity within and between the created reference populations. The proportion of selected haplotypes alleles increased with the number of animals selected, which was expected, as more animals can collectively carry additional different haplotypes. Similarly, when looking at the overlap of picked haplotypes alleles between methods, the methods presented more overlap if they targeted the common (AMAT, HSH) or the rare variants (IWS, GDI). Notably, when reference populations were smaller, animals selected with AMAT carried a greater number of different haplotypes than HSH. This can be explained by the following arguments. The HSH method makes sure that all commonly found haplotypes are selected before animals carrying rare variants get targeted, while AMAT relies solely on pedigree and thus has no possibility to consider the Mendelian sampling happening over generations. This limitation of AMAT, when compared to haplotype-based methods, was observed in our study when 1,200 animals comprised the reference group and only rare variants were considered. In this case, AMAT had a significantly lower accuracy than both HSH and GDI (Table 2-4). Moreover, it is likely that a real pedigree would contain errors that would not allow for a better haplotype coverage using AMAT than HSH as missing and incorrect information would impeach correct computation of the kinship among animals and thus the probable proportion of haplotypes they share. The pedigree-based method AMAT also showed a limitation once the number of animals increased, as redundancy of the added haplotype in the selected group was not directly avoided and effective Mendelian sampling could not be evaluated, which was in contrast to the results obtained for HSH. GDI consistently obtained greater haplotype coverage in the selected group of animals (Figure 2-8). This shows that the targeted optimization at the group level of the number of rare haplotypes was also achieved. Therefore, GDI and IWS seem to be the methods of choice when the objective is to select animals for their propensity to carry novel, rare or deleterious variants. The influence of the selection of genetically more diverse animals on the accuracy of selection, however, must be carefully assessed. Overall, accuracies of imputation from HD to WGS were similar to previous results observed in real dairy cattle datasets (e.g. Pausch et al., 2017), although rare variants were kept throughout the whole analysis in the current study. Differences between scenarios were significant (P < 0.0001), 26

however, the accuracies were mostly similar between methods. This was probably due to very low variance between the replicates, as the simulation algorithm is highly stable. All methods of selection avoided redundancy of the haplotypes selected, thus only minor differences between methods were observed after enough animals were selected. The greatest differences in accuracy of imputation between the methods were found when the reference population was small. Moreover, when observing genotype concordance rates, no differences were found when the reference populations comprised more than 200 animals. In contrary to the allelic r2 and as demonstrated in the review by Calus et al., (2014), the genotype concordance is dependent on the MAF of the variants considered and increases artificially with lower MAF. Differences in the distribution of the MAF of the rare variants between the reference population led to the observed re-ranking. Considering that animals selected with IWS and GDI were mainly from generation 1 and 2 of the simulated population (Figure 2-6), and that the MAF distribution of the rare variants shifted towards zero generation after generation (Figure 2-7), the MAF distribution within the rare variants category might be different between reference populations selected for high coverage of rare or common haplotypes. More different haplotype alleles were present in the reference populations selected with GDI and IWS (Figure 2-8), whereas animals selected with AMAT and HSH carried, as intended, more common variants. Animals selected with AMAT and HSH, however, still carried some rare variants but those had more often a MAF below 0.01. Figure 2-11 shows a distinctly bigger change in accuracy between monomorphic or rare variants with a MAF lower than 0.01 and rare variants with MAF between 0.01 and 0.05. It is this difference in the distribution of the MAF of the rare variants that explain the re-ranking of the methods between the genotype concordance and the allelic r2 values. Targeting rare haplotypes at selection (GDI and IWS) led to the creation of a reference population with more rare variants, but most of the added rare variants had a MAF between 0.01 and 0.05, whereas targeting common haplotypes led to the creation of a reference population carrying mainly common variants, but also some rare variants that mainly had a MAF below 0.01. Those variants with a MAF below 0.01 artificially increased the genotype concordance so that a re-ranking was observed. Considering the re-ranking observed between method group, i.e., HSH/AMAT and IWS/GDI when looking at either rare or common variants, the method to select animals should be chosen using one of two principles: if the future imputed genotypes will be used as full genotypes and the

27

imputation needs to be specially accurate for variants that will explain most of the genetic variation of a trait, animals should be selected using AMAT or HSH. In contrast, if future analysis will focus on the discovery of novel functional rare variants animals should be selected using IWS or GDI. Genotype concordance is the measure of imputation accuracy of choice when common variants that explain most of the genetic variance of most traits of interest for the dairy industry, are of interest for future analyses. Our results showed that genotype concordances with small reference populations were higher when the individuals were selected with AMAT or HSH. The first line of Table 2-4, where only the segregating variants with a MAF below 1% were considered, is a good example of the differences in accuracy of imputation for rare variants, variants that could have a novel deleterious effect. In this example, when the reference population only contained 50 animals, the difference in accuracy of imputation reached 0.18 points between the best (IWS) and the worst (HSH) methods. The accuracy of imputation increased with the MAF of the variants, but this increase stopped once segregation reached a level of 30% (Figure 2-11). 2.6 Conclusions

Selection of animals for sequencing is an important task, as it greatly impacts the information gained about a population of interest, especially in populations with limited effective population size. Different selection methods are available that either rely solely on pedigree or that utilize information on previously genotyped individuals. In the first case, selecting key ancestors is highly recommended. Otherwise, the best method depends on the use of the future set of sequences. If the newly selected animals will be the first sequenced animals in their population and should allow for the overall imputation of the rest of the population, it is better to select animals carrying common haplotypes using the new HSH method instead of any of the other methods described in this study. If the resulting sequences of the selection of animals in a population will be used for discovery of new variants or should allow annotation of possible deleterious ones, animals carrying novel information should be selected and, consequently, the GDI method proposed here may be used. 2.7 References

Abraham, G., and M. Inouye. 2014. Fast principal component analysis of large-scale genome-wide data. PLoS One 9:e93766. doi:10.1371/journal.pone.0093766. Baes, C.F., M.A. Dolezal, J.E. Koltes, B. Bapst, E. Fritz-Waters, S. Jansen, C. Flury, H. Signer- 28

Hasler, C. Stricker, R. Fernando, R. Fries, J. Moll, D.J. Garrick, J.M. Reecy, and B. Gredler. 2014. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.. BMC Genomics 15:948. doi:10.1186/1471-2164-15-948. Bickhart, D.M., J.L. Hutchison, D.J. Null, P.M. VanRaden, and J.B. Cole. 2016. Reducing animal sequencing redundancy by preferentially selecting animals with low-frequency haplotypes. J. Dairy Sci. 99:5526–5534. doi:10.3168/jds.2015-10347. Bohmanova, J., M. Sargolzaei, and F.S. Schenkel. 2010. Characteristics of linkage disequilibrium in North American Holsteins. BMC Genomics 11:421. doi:10.1186/1471-2164-11-421. Boichard, D. 2002. Pedig : a Fortran Package for Pedigree Analysis Suited for Large Populations. 7th World Congr. Genet. Appl. to Livest. Prod. 28–29. Boichard, D., H. Chung, R. Dassonneville, X. David, A. Eggen, S. Fritz, K.J. Gietzen, B.J. Hayes, C.T. Lawley, T.S. Sonstegard, C.P. van Tassell, P.M. VanRaden, K.A. Viaud-Martinez, and G.R. Wiggans. 2012. Design of a bovine low-density snp array optimized for imputation. PLoS One 7:e34130. doi:10.1371/journal.pone.0034130. Boichard, D., L. Maignel, and É. Verrier. 1997. The value of using probabilities of gene origin to measure genetic variability in a population. Genet. Sel. Evol. 29:5–23. doi:10.1186/1297- 9686-29-1-5. Browning, B.L., and S.R. Browning. 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals.. Am. J. Hum. Genet. 84:210–23. doi:10.1016/j.ajhg.2009.01.005. Calus, M.P.L., A.C. Bouwman, J.M. Hickey, R.F. Veerkamp, and H.A. Mulder. 2014. Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: A review of livestock applications. Animal 8:1743–1753. doi:10.1017/S1751731114001803. Černý, V. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Optim. Theory Appl. 45:41–51. doi:10.1007/BF00940812. Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.-N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. VanTassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. 2014. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46:858–865. doi:10.1038/ng.3034. Das, S., L. Forer, S. Schönherr, C. Sidore, A.E. Locke, A. Kwong, S.I. Vrieze, E.Y. Chew, S. Levy, M. McGue, D. Schlessinger, D. Stambolian, P.-R. Loh, W.G. Iacono, A. Swaroop, L.J. Scott, F. Cucca, F. Kronenberg, M. Boehnke, G.R. Abecasis, and C. Fuchsberger. 2016. Next- generation genotype imputation service and methods. Nat. Genet. 48:1284–1287. doi:10.1038/ng.3656. Drögemüller, C., M. Rossi, A. Gentile, S. Testoni, H. Jörg, G. Stranzinger, M. Drögemüller, M.L. Glowatzki-Mullis, and T. Leeb. 2009. Arachnomelia in Brown Swiss cattle maps to chromosome 5. Mamm. Genome 20:53–59. doi:10.1007/s00335-008-9157-2. 29

Druet, T., I.M. Macleod, and B.J. Hayes. 2014. Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions.. Heredity (Edinb). 112:39–47. doi:10.1038/hdy.2013.13. Ely, J.J., M.A. Bishop, M.L. Lammey, M.M. Sleeper, J.M. Steiner, and D.R. Lee. 2010. Use of biomarkers of collagen types I and III fibrosis metabolism to detect cardiovascular and renal disease in chimpanzees (Pan troglodytes). Comp. Med. 60:154–158. doi:10.1371/journal.pgen.0020190. Faux, A.-M., G. Gorjanc, R.C. Gaynor, M. Battagin, S.M. Edwards, D.L. Wilson, S.J. Hearne, S. Gonen, and J.M. Hickey. 2016. AlphaSim: Software for Breeding Program Simulation. Plant Genome 9:0. doi:10.3835/plantgenome2016.02.0013. Fraser, R.S., J.S. Lumsden, and B.N. Lillie. 2018. Identification of polymorphisms in the bovine collagenous lectins and their association with infectious diseases in cattle. Immunogenetics 70:533–546. doi:10.1007/s00251-018-1061-7. Fritz, S., A. Capitan, A. Djari, S.C. Rodriguez, A. Barbat, A. Baur, C. Grohs, B. Weiss, M. Boussaha, D. Esquerré, C. Klopp, D. Rocha, and D. Boichard. 2013. Detection of Haplotypes Associated with Prenatal Death in Dairy Cattle and Identification of Deleterious Mutations in GART, SHBG and SLC37A2. PLoS One 8:e65550. doi:10.1371/journal.pone.0065550. Goddard, M.E., and B. Hayes. 2009. Genomic Selection Based on Dense Genotypes Inferred From Sparse Genotypes. Proc. Assoc. Advmt. Anim. Breed. Genet 18:26–29. Gonen, S., R. Ros-Freixedes, M. Battagin, G. Gorjanc, and J.M. Hickey. 2017. An exact method for optimal allocation of sequencing resources in genotyped livestock populations. Genet. Sel. Evol. 49:47. doi:10.1186/1297-9686-43-7. Hayes, B., and M.E. Goddard. 2001. The distribution of the effects of genes affecting quantitative traits in livestock. Genet. Sel. Evol. 33:209. doi:10.1186/1297-9686-33-3-209. Hozé, C., M.-N. Fouilloux, E. Venot, F. Guillaume, R. Dassonneville, S. Fritz, V. Ducrocq, F. Phocas, D. Boichard, and P. Croiseau. 2013. High-density marker imputation accuracy in sixteen French cattle breeds. Genet. Sel. Evol. 45:33. doi:10.1186/1297-9686-45-33. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi. 1983. Optimization by Simulated Annealing. Science (80-. ). 220:671–680. doi:10.1126/science.220.4598.671. Li, L., Y. Li, S.R. Browning, B.L. Browning, A.J. Slater, X. Kong, J.L. Aponte, V.E. Mooser, S.L. Chissoe, J.C. Whittaker, M.R. Nelson, and M.G. Ehm. 2011. Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PLoS One 6. doi:10.1371/journal.pone.0024945. Loh, P.-R., P. Danecek, P.F. Palamara, C. Fuchsberger, Y. A Reshef, H. K Finucane, S. Schoenherr, L. Forer, S. McCarthy, G.R. Abecasis, R. Durbin, and A. L Price. 2016. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48:1443–1448. doi:10.1038/ng.3679. Ma, P., R. Brøndum, Q. Zhang, M. Lund, and G. Su. 2013. Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle. J. Dairy Sci. 96:4666–4677. doi:10.3168/jds.2012-6316. 30

Marchini, J., and B. Howie. 2010. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11:499–511. doi:10.1038/nrg2796. McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M.A. DePristo. 2010. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297–1303. doi:10.1101/gr.107524.110. Neuditschko, M., H.W. Raadsma, M.S. Khatkar, E. Jonas, E.J. Steinig, C. Flury, H. Signer-Hasler, M. Frischknecht, R. von Niederhäusern, T. Leeb, and S. Rieder. 2017. Identification of key contributors in complex population structures. PLoS One 12:e0177638. doi:10.1371/journal.pone.0177638. Pausch, H., B. Aigner, R. Emmerling, C. Edel, K.-U. Götz, and R. Fries. 2013. Imputation of high- density genotypes in the Fleckvieh cattle population. Genet. Sel. Evol. 45:3. doi:10.1186/1297-9686-45-3. Pausch, H., I.M. MacLeod, R. Fries, R. Emmerling, P.J. Bowman, H.D. Daetwyler, and M.E. Goddard. 2017. Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle. Genet. Sel. Evol. 49:24. doi:10.1186/s12711-017- 0301-x. Pérez-Enciso, M., and A. Legarra. 2016. A combined coalescence gene-dropping tool for evaluating genomic selection in complex scenarios (ms2gs). J. Anim. Breed. Genet. 133:85– 91. doi:10.1111/jbg.12200. Pluzhnikov, A., and P. Donnelly. 1996. Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144:1247–1262. R Core Team. 2017. R: A Language and Environment for Statistical Computing. Ros-freixedes, R., S. Gonen, G. Gorjanc, and J.M. Hickey. 2017. A method for allocating low- coverage sequencing resources by targeting haplotypes rather than individuals. Genet. Sel. Evol. 49:78. doi:10.1101/188896. Sargolzaei, M. 2014. SNP1101 User’s Guide. Version 1.0.. HiggsGene Solut. Inc. Sargolzaei, M., and F.S. Schenkel. 2009. QMSim: A large-scale genome simulator for livestock. Bioinformatics 25:680–681. doi:10.1093/bioinformatics/btp045. Sargolzaei, M., F.S. Schenkel, G.B. Jansen, and L.R. Schaeffer. 2008. Extent of Linkage Disequilibrium in Holstein Cattle in North America. J. Dairy Sci. 91:2106–2117. doi:10.3168/jds.2007-0553. Utsunomiya, Y.T., M. Milanesi, A.T.H. Utsunomiya, P. Ajmone-Marsan, and J.F. Garcia. 2016. GHap: An R package for genome-wide haplotyping. Bioinformatics 32:2861–2862. doi:10.1093/bioinformatics/btw356. Whalen, A., G. Gorjanc, R. Ros-Freixedes, and J.M. Hickey. 2018. Assessment of the performance of hidden Markov models for imputation in animal breeding. Genet. Sel. Evol. 50:44. doi:10.1186/s12711-018-0416-8. Zhang, Q., M.P.L.L. Calus, B. Guldbrandtsen, M.S. Lund, and G. Sahana. 2017. Contribution of 31

rare and low-frequency whole-genome sequence variants to complex traits variation in dairy cattle. Genet. Sel. Evol. 49:60. doi:10.1186/s12711-017-0336-z. Zhang, Q., B. Guldbrandtsen, M.P.L. Calus, M.S. Lund, and G. Sahana. 2016. Comparison of gene-based rare variant association mapping methods for quantitative traits in a bovine population with complex familial relationships. Genet. Sel. Evol. 48:60. doi:10.1186/s12711- 016-0238-5. 2.8 Acknowledgements

The 1,000 Bulls Genome Project is acknowledged for providing the whole-genome sequence genotypes of the Holstein animals. The Canadian Dairy Network provided the array genotypes. Computations were done at the server facilities provided by the Centre for Genetic Improvement of Livestock, Department of Animal Biosciences at the University of Guelph, ON Canada. We gratefully acknowledge funding by the Efficient Dairy Genome Project, funded by Genome Canada (Ottawa, Canada), Genome Alberta (Calgary, Canada), Ontario Genomics (Toronto, Canada), Alberta Ministry of Agriculture (Edmonton, Canada), Ontario Ministry of Research and Innovation (Toronto, Canada), Ontario Ministry of Agriculture, Food and Rural Affairs (Guelph, Canada), Canadian Dairy Network (Guelph, Canada), GrowSafe Systems (Airdrie, Canada), Alberta Milk (Edmonton, Canada), Victoria Agriculture (Melbourne, Australia), Scotland's Rural College (Edinburgh, UK), USDA Agricultural Research Service (Beltsville, United States), Qualitas AG (Zug, Switzerland), Aarhus University (Aarhus, Denmark). AB is especially grateful to the Qualitas AG team and the Swiss Association for Animal Science for their support.

32

2.9 Tables

Table 2-1. Parameters used for the simulation of the populations LongRangeLD and CurrentPop. Parameter LongRangeLD CurrentPop Number of generations 20 10 Litter size 1.0 1.0 Sire replacement rate 0.5 0.5 Dam replacement rate 0.3 0.3 Mating design Positive assortative Positive assortative Selection design onOn phenotypesphenotypes onOn EBVEBV Culling design Age Low EBV EBV estimation method None BLUP using the true additive genetic variance Number of traits 1.0 1.0genetic variance Heritability 0.3 0.3 Phenotypic variance of trait 1.0 1.0

33

Table 2-2. Rate of genotyping change as introduced in the simulated whole-genome sequence genotypes. As an example, 1.1% of the simulated AB genotypes were changed to AA genotypes. Values were retrieved from the study by Baes et al. (2014). Simulated genotypes including genotyping error and missing values AA AB BB -/-

AA 0.639 0.004 0.001 0.356

AB 0.011 0.970 0.000 0.019 True True

genotypes BB 0.002 0.004 0.976 0.018

34

Table 2-3. Proportion of animals overlapping between selection methods in reference populations of different sizes. The methods were the key ancestors (AMAT), the selection of Highly Segregating Haplotypes (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI). Size Method

50 AMAT HSH IWS HSH 0.26 IWS 0.04 0.00 GDI 0.00 0.00 0.08 100 HSH 0.23 IWS 0.03 0.01 GDI 0.00 0.00 0.1 200 HSH 0.20 IWS 0.05 0.03 GDI 0.01 0.02 0.14 400 HSH 0.13 IWS 0.04 0.06 GDI 0.02 0.03 0.09 800 HSH 0.09 IWS 0.03 0.12 GDI 0.03 0.06 0.12 1,200 HSH 0.08 IWS 0.03 0.16 GDI 0.04 0.09 0.12

35

Table 2-4. Accuracies for reference populations of 50, 200, and 1,200 individuals and increasing MAF of the variants considered, all variants, the rare variants (MAF < 0.05), or the common variants (MAF ≥ 0.05). MAF bins “x-y” stands for “x < MAF  y”.

50 200 1,200 MAF bin AMAT IWS HSH GDI AMAT IWS HSH GDI AMAT IWS HSH GDI 0.00-0.01 0.212 a 0.282 b 0.146 c 0.281 b 0.471 a 0.527 b 0.468 a 0.540 b 0.625 a 0.641 a,b 0.643 a,b 0.647 b 0.01-0.02 0.624 0.640 0.557 0.629 0.903 0.912 0.908 0.906 0.960 0.962 0.961 0.961 0.02-0.03 0.712 a 0.704 b 0.678 a 0.691 b 0.931 0.936 0.935 0.929 0.970 0.972 0.971 0.970 0.03-0.04 0.761 a 0.732 b 0.746 a 0.725 b 0.944 a,b 0.948 a 0.948 a,b 0.940 b 0.975 0.976 0.975 0.975 0.04-0.05 0.793 a 0.754 b 0.790 a 0.745 b 0.952 0.954 0.956 0.946 0.979 0.979 0.978 0.978 0.05-0.10 0.837 a 0.786 b 0.848 a 0.776 b 0.964 0.966 0.966 0.957 0.983 0.984 0.983 0.983 0.10-0.15 0.887 0.834 0.898 0.825 0.974 0.975 0.976 0.969 0.987 0.988 0.987 0.987 0.15-0.20 0.911 a 0.865 b 0.920 a 0.855 b 0.979 0.980 0.980 0.974 0.990 0.990 0.989 0.989 0.20-0.25 0.925 a 0.878 b 0.932 a 0.870 b 0.982 0.982 0.983 0.978 0.991 0.991 0.990 0.990 0.25-0.30 0.933 a 0.892 b 0.940 a 0.881 b 0.984 a,b 0.984 a 0.984 a,b 0.980 b 0.991 0.992 0.991 0.991 0.30-0.35 0.939 a 0.904 b 0.944 a 0.896 b 0.985 a,b 0.985 a 0.985 a,b 0.982 b 0.992 0.993 0.992 0.992 0.35-0.40 0.942 a 0.908 b 0.948 a 0.900 b 0.986 0.986 0.986 0.982 0.992 0.993 0.992 0.992 0.40-0.45 0.944 a 0.911 b 0.949 a 0.904 b 0.986 0.986 0.986 0.983 0.993 0.993 0.992 0.992 0.45-0.50 0.945 a,b 0.914 b,c 0.949 a 0.908 b,c 0.986 a 0.986 b 0.986 a 0.983 b 0.993 a 0.993 b 0.992 b 0.993 b All 0.580 a 0.589 b 0.549 c 0.583 b 0.765 a 0.789 b 0.765 a 0.790 b 0.840 a 0.847 a,b 0.847 a,b 0.849 b Common 0.894 a 0.848 b 0.903 c 0.839 b 0.976 a 0.976 b 0.977 a 0.970 b 0.988 a 0.989 a,b 0.988 b 0.988 b Rare 0.387 a 0.429 b 0.331 c 0.425 b 0.635 a 0.673 b 0.635 a 0.679 b 0.749 a 0.759 a,b 0.761 b 0.763 b a,b,c Different letters represent significant differences in accuracies among the methods within a bin-by-size set of values (pairwise Wilcoxon Rank Sum Test significant with p-value after experimental-wise Bonferroni correction).

36

2.10 Figures

Figure 2-1. Structure and number of animals of the simulated populations.

37

Figure 2-2. The Genetic Diversity Index is the sum of the unique haplotypes found in a group of animals. On this figure, five animals carry in total 18 unique haplotypes (5 variants of haplotypes A, 4 variants of haplotype B, 6 variants of haplotype C, and 3 variants of haplotype D). Colors highlight the unique haplotype alleles of each haplotype block.

38

Figure 2-3. Representation of the distribution of the intervals of minor alleles frequencies used for assessing imputation accuracy.

39

Figure 2-4. Number of Jersey (JE) animals selected by the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI) methods from candidate pools with different proportion of JE animals.

40

Figure 2-5. Decay of linkage disequilibrium (LD) over genomic distance in the real and the simulated sequences.

41

Figure 2-6. Distribution of the different groups of animals on the first and second principal components. The variance explained by the components is given in brackets. Grey crosses represent all the candidates, the red triangles are the animals selected by the key ancestors (AMAT) method, the purple plusses are the animals selected by the Inverse Weighted Selection (IWS) method, the green plusses are the animals selected by the Highly Segregating Haplotypes selection (HSH) method, and the blue crosses are the animals selected with the Genetic Diversity Index (GDI) method.

42

Figure 2-7. Density curves of the minor allele frequencies observed in the simulated population over the generations.

43

Figure 2-8. Selected proportion of unique haplotypes from the total haplotype library found in the reference group created with different selection methods. The methods compared are the key ancestors (AMAT), the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI).

44

Figure 2-9. Accuracy of imputation for all SNP, only common SNP (minor allele frequency  0.05), or only rare SNP (minor allele frequency < 0.05), using reference population sizes from 50 to 1,200 individuals. The methods compared are the key ancestors (AMAT), the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI). All standard errors are below 0.013.

45

Figure 2-10. Genotype concordance rates for all SNP using reference population sizes from 50 to 1,200 individuals. The methods compared are the key ancestors (AMAT), the Highly Segregating Haplotype selection (HSH), the Inverse Weighted Selection (IWS), and the Genetic Diversity Index (GDI). All standard errors are below 0.009.

46

Figure 2-11. Accuracy of imputation increases with higher minor allele frequency (MAF) of the variants. Here the accuracies reached with 50 animals in a reference group of key ancestors (AMAT) are presented. MAF bins “x-y” stands for “x < MAF  y”. All standard errors are below 0.008.

47

CHAPTER 3: High confidence copy number variants identified in Holstein dairy cattle from whole genome sequence and genotype array data

Adrien M. Butty1, Tatiane C. S. Chud1, Filippo Miglior1, Flavio S. Schenkel1, Arun Kommadath2,3, Kirill Krivushin2 Jason R. Grant2, Irene M. Häfliger4, Cord Drögemüller4, Angela Cánovas1, Paul Stothard2, and Christine F. Baes1,4*

1Centre for Genetic Improvement of Livestock, University of Guelph, Guelph, ON, Canada 2Dept of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada 3Lacombe Research and Development Centre, Agriculture and Agri-Food Canada, Lacombe, AB, Canada 4Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, BE, Switzerland

Keywords: dairy cattle, structural variants, sequencing, genotyping

48

3.1 Abstract

Multiple methods to detect copy number variants (CNV) relying on different types of data have been developed and CNV have been shown to have an impact on phenotypes of numerous traits of economic importance in cattle, such as reproduction and immunity. Further improvements in CNV detection are still needed in regard to the trade-off between high-true and low-false positive variant identification rates. Instead of improving single CNV detection methods, variants can be identified in silico with high confidence when multiple methods and datasets are combined. Here, CNV were identified from whole-genome sequences (WGS) and genotype array (GEN) data. After CNV detection, two sets of high confidence CNV regions (CNVR) were created that contained variants found in both WGS and GEN data following an animal-based and a population- based pipeline. Furthermore, the change in false positive CNV identification rates using different GEN marker densities was evaluated. The population-based approach characterized CNVR, which were more often shared among animals, and were more often linked to putative functions than CNV identified with the animal-based approach. Moreover, false positive identification rates up to 22% were estimated on GEN information. Further research using larger datasets should use a population-wide approach to identify high confidence CNVR. 3.2 Introduction

Dairy cattle genetics has made great advances since the effects of single nucleotide polymorphisms (SNP) have been recognized on a wide range of mono or polygenic traits economically important for the dairy industry (Pausch et al., 2016; van den Berg et al., 2016; Sallam et al., 2017; Xiang et al., 2017; Ma et al., 2018). Genomic variation, however, is not only caused by SNP. Recent studies have shown that structural variants (SV) also have an important impact on phenotypes of a multitude of traits, such as milk production, reproduction, health, and feed efficiency (Hou et al., 2012a; Jiang et al., 2013; de Almeida Santana et al., 2016; Liu et al., 2019). Types of SV include translocations, inversions and copy number variation (Alkan et al., 2011). Copy number variants (CNV) form the most common class of SV in the human, plant and animal genome and can be identified as two types of event: copy number loss (CNL) or copy number gain (CNG). As the amount of DNA changes between samples with or without multiple copies of a segment, CNV are a type of the unbalanced structural variations (Alkan et al., 2011). Although the number of bovine

49

CNV described in the literature is lower than the number of SNP, the fact that they have multiple alleles makes them highly informative (Chen et al., 2018). The CNV can affect both monogenic traits, such as the coat color of cattle (Durkin et al., 2012), and polygenic traits such as feed efficiency, production traits, and reproduction traits of cattle (da Silva et al., 2016; Hay et al., 2018). For instance, a study by Liu et al. (2019) showed associations between CNV and production traits specifically in Holstein dairy cattle. Identifying CNV is challenging and no consensus on the best method of identification has been reached because multiple factors, starting with the source of information on which the CNV are identified, influence the results. Moreover, most CNV identification methods were developed around the trade-off between high rates of discovery and low false positive rates. CNV have been identified in cattle from SNP genotyping arrays (GEN) (Seroussi et al., 2010; Jiang et al., 2012; Salomon-Torres et al., 2014, 2016; Huang et al., 2017; Antunes de Lemos et al., 2018; Rafter et al., 2018), hybridization arrays (ACGH) (Fadista et al., 2010; Liu et al., 2010, 2019; Zhan et al., 2011; Zhou et al., 2018), and whole-genome sequences (WGS) (Keel et al., 2016, 2017; Mei et al., 2017; Mielczarek et al., 2017, 2018; Kommadath et al., 2019). The degree of resolution of the different platforms leads to the identification of different CNV with varying lengths distributed unequally in the genome. Although the first studies defined CNV as genomic segments of 1 kilobase (Kb) (Feuk et al., 2006) or more, the latest developments in their identification on whole- genome sequences have reduced the minimum size of interest to 50 base pairs (bp) (Kidd et al., 2015). Generally, variants longer than 5 megabases (Mb) are considered erroneous and thus removed (Hay et al., 2018). Variants detected with different methods, on different animals, and relying on different sources of information may overlap between 0 and 90 percent (Salomón- Torres et al., 2015; da Silva et al., 2016; Keel et al., 2016; Prinsen et al., 2016; Sasaki et al., 2016; Silva et al., 2016). Moreover, Kommadath et al. (2019) suggested the method used for CNV identification has the most impact on the overlap between studies. The advantages and disadvantages of the detection methods on GEN and WGS data were reviewed by Winchester et al. (2009) and Pirooznia et al. (2015), respectively. One main limitation for all methods and sources of information is the quality of the reference genome on which the analysis is based. With a larger N50 contig size (26.3 vs 0.097 Mb) and a drastic reduction in the number of gaps (393 vs 72,051), the latest bovine reference genome assembly (ARS-UCD1.2; Rosen et al., 2018) is a clear 50

improvement compared to its predecessor UMD3.1.1 (Zimin et al., 2009) and CNV are now being identified with more confidence. The marker density of the GEN data on which CNV identification relies also influences the set of CNV identified (Wang et al., 2007). An estimate of the effect of the marker density on the false positive CNV identification rate, however, is lacking when relying on ARS-UCD1.2. Methods based on WGS to identify CNV follow four different approaches: split read, read pair, assembly and read depth (Alkan et al., 2011). In contrast, the identification of CNV relying on GEN data is only based on the signal intensity and, in some cases, on the B allele frequency values generated at genotyping (Winchester et al., 2009). The signal intensity is a measure of the fluorescence intensity at the time of genotyping and reflects the quantity of DNA material present for a given probe on an array at the time of genotyping. The read depth approach in WGS detection approaches is its best equivalent. In both of these cases, the identification of CNV is indirectly based on the amount of DNA found for a region in a sample. CNV identification from GEN data has been developed to analyze the information of one individual at a time. To fairly compare CNV identified from these two sources of information, it is thus of importance to choose a single sample WGS identification method that relies on the read depth of the sequences. CNV regions identified based on two types of data and with two methods can be considered of high confidence (Zhan et al., 2011). The objectives of this study were: 1) to identify and describe CNV from GEN and WGS data based on ARS-UCD1.2; 2) to define sets of high confidence CNV regions following the two approaches and to find their possible impact(s) on traits of interest for the dairy industry through in silico functional analysis; 3) to compare the high confidence CNV regions with previously published variants; and 4) to estimate the effect of the marker density on false positive discovery rates. 3.3 Materials and methods

3.3.1 Animal material, genotyping and whole-genome sequencing Both WGS and GEN data of 96 Holstein animals (15 cows and 81 bulls) were available. A total of 41 animals were sequenced within the Canadian Cattle Genome Project (Stothard et al., 2015) (CCGP), 32 within the Efficient Dairy Genome Project (Kommadath et al., 2019) (EDGP), and 23 under the scope of multiple projects of the Vetsuisse Faculty of the University of Bern (CHE) and

51

were primarily selected in order to identify causative variation for a wide range of diseases (Wiedemar et al., 2015; Menzi et al., 2016; Agerholm et al., 2017). The Genetic Diversity Index method applied to select animals of the EDGP dataset for the sequencing was described in a simulation study (Butty et al., 2019). DNA was extracted from frozen semen samples for the EDGP and the CCGP bulls. The WGS was performed using Illumina HiSeqTM 2000 (CCGP samples) or Illumina HiSeqTM X (EDGP samples). All analyses were carried out following the manufacturer’s protocols as described by Stothard et al. (2015). Paired-end reads were filtered, and the remaining reads were aligned to the latest bovine genome assembly (ARS-UCD1.2) using the Burrows- Wheeler Aligner (version 0.7.17; Li and Durbin, 2009) following the protocol of the 1,000 Bull Genomes Project (http://www.1000bullgenomes.com/, last accessed 2019-03-14). Reads bases were cut out with Trimmomatic (version 0.38.1; Bolger et al., 2014) that were 1) identified as Illumina adapters, 2) at the start to the end of a read and had a quality lower than 20, and 3) not reaching a minimum average quality of 15 over a sliding window of 3 bp. After trimming, reads were dropped that did not reach an average quality of 20 and that were shorter than 35bp. After alignment, duplicate reads were marked with the function MarkDuplicates of the Picard toolkit (version 2.18.15, https://broadinstitute.github.io/picard/, last accessed 2019-03-23) and base quality recalibration was performed with the functions BaseRecalibrator and PrintReads of the Genome Analysis Toolkit (version 3.8; McKenna et al., 2010). The set of known variants provided by the consortium of the 1,000 Bull Genomes Project was used for base quality recalibration. The same 96 animals that were sequenced as described above were also genotyped. Among them, 57 animals were genotyped with the Illumina BovineHD Beadchip (HD; 777,962 markers), 22 with the GeneSeek® Genome Profiler Bovine 150K (version 2 and 3; 150K; 138,892 and 139,376 markers), 12 with the GeneSeek® Genome Profiler Bovine HD (GGPHD; 76,883 markers) and 5 with the Illumina BovineSNP50 Beadchip (50K; 54,001 markers). Thirteen additional Holstein bulls genotyped with the Illumina BovineHD Beadchip were included to analyze the influence of the genotype density on the CNV identification rates. The average gap sizes between the markers were 3.98Kb, 21.83Kb, 39.66Kb, and 66.75Kb for HD, 150K, GGPHD and 50K respectively. Markers with a GenCall (GC) score below 0.15 were removed on a per-sample basis. The SNP markers positions were updated to ARS-UCD1.2 using the map files available on the NAGRP data

52

repository (https://www.animalgenome.org/repository/cattle/UMC_bovine_coordinates/, last accessed 2019-03-14). 3.3.2 CNV identification and sets of high confidence CNV regions The CNV were identified on all available WGS and GEN data on a per-animal basis. The CNV detection software PennCNV (Wang et al., 2007) (version 1.0.3) was used to identify CNV from the GEN dataset (GEN_CNV). PennCNV relies on a hidden Markov model based not only on the Log R ratio (LRR) but also on the allelic ratio distribution (B allele frequency) of each sample. Prior to CNV identification and in order to reduce false positive results, the LRR values were corrected for genomic waves based on the Guanine-Cytosine content of the genomic regions 500Kb upstream and downstream of each investigated marker (Diskin et al., 2008). Only autosomes and markers with known position were considered for the analysis. After CNV detection, low-quality samples were removed from the analysis using the following cutoffs: LRR standard deviation above 0.3, B allele frequency drift above 0.01, and wave factor above 0.05. CNV covering less than 10 SNP in samples genotyped with the Illumina BovineHD Beadchip and less than three SNP in all other samples were also removed. The CNV detection software program CNVnator (Abyzov et al., 2011) (version 0.3.3) was used to identify CNV from the WGS data (WGS_CNV). This software partitions the read depth (RD) of the aligned genome for each individual over segments of a given length, corrects those depending on their Guanine-Cytosine content and then performs CNV identification. CNV detection was carried out by dividing the genome into segments of 200bp. With this segment length, the ratio between the RD and its variance over all samples was 4.58, which fits the recommended ratio between 4 and 5 by Abyzov et al. (2011). After CNV detection, following recommended practice from previous studies (Hay et al., 2018), variants shorter than 1Kb or longer than 5Mb were removed. Only CNV from regions with a mean RD different from the RD average of the sample (P < 0.05, t-test) and with more than 50% of the reads mapped with a quality greater than zero were kept. Next, sets of high confidence CNV regions (CNVR) were created. The CNVR are formed by collating overlapping or contiguous CNV from GEN and WGS data. Merging CNV in regions was first described by Redon et al. (2006) and allows for population-wide CNV analysis. Two sets of high confidence CNVR were considered for analysis: ANIMAL_CNVR and 53

POPULATION_CNVR (Figure 3-1). To obtain the ANIMAL_CNVR set, CNV identified within the same animal from GEN and WGS data that had a reciprocal overlap of at least 50% of their length were considered high confidence CNV and merged to CNVR over all animals. All CNV identified with one data source that had any overlap were merged to create two sets of CNVR: the GEN_CNVR and the WGS_CNVR. GEN_CNVR and WGS_CNVR that reciprocally overlapped over at least 50% of their lengths and were present in more than 5% of the samples comprised then the POPULATION_CNVR set. CNVR that were found in some samples as CNG and in other samples as CNL were called MIX. 3.3.3 Effect of the genotype array density on the identified CNV The LRR and BAF from all 70 samples genotyped with the HD panel were masked down to the SNP overlap with the markers of the 150K panel and of the 50K panel. The three datasets were edited as previously described: markers with a GC score below 0.15 were removed, LRR values were corrected for the Guanine-Cytosine content of the 500Kb regions up- and downstream each investigated marker, and markers positioned on sexual chromosomes or without known positions were removed. CNV were then identified using PennCNV (Wang et al., 2007) and filtered as described earlier for each dataset. Finally, CNV remained, which were identified on all three densities for 30 samples composed the set for analysis. CNV identified on multiple SNP densities that had an overlap of at least one base pair were considered equal. GEN_CNV identification relied on the signal intensity value produced at genotyping; a higher signal intensity was equivalent to higher DNA concentration for a position, thus indicating a possible CNG or, in case the signal intensity was less strong than expected, a possible CNL. Filters were set at the time of CNV identification so that only regions where a minimum number of contiguous markers showing the same CNG or CNL pattern were considered CNV. Analysis of lower density marker panels could lead to the identification of CNV in regions where information of more markers in the same region would show no CNV pattern. CNV identified in those regions on lower density marker panel could thus be considered false positives. A situation where a CNL was identified with a lower density marker panel but not with a higher density marker panel is represented in Figure 3-2. The upper part of the figure depicts a genomic region with a higher density marker panel, whereas the lower part depicts a genomic region with a lower density panel. In this example, a minimum of three markers had to show the same CNG or CNL pattern for the 54

identification of a CNV. Vertical bars going through the genomic region represent markers with no CNG or CNL pattern, vertical bars only above or only below the genomic region represent CNG or CNL, respectively. The lack of marker information in the lower density panel across the genomic region led to the identification of a CNL that was not found in the higher density panel; such a CNL was considered a false positive hit in this study. False positive rates were computed per sample as the proportion of CNV found only with the 150K or only with the 50K over all CNV found with any density. Confidence intervals of 95% (95%-CI) for the false positive rates were computed with 10,000 times bootstrapping for the number of CNV identified and both false positive rates. 3.3.4 Previously known variants Two sets of previously known variants were considered for comparison with the high confidence CNVR identified in this study: the CNV deposited on the Genomic Variant archive database of EMBL-EBI (DGVa; https://www.ebi.ac.uk/dgva, last accessed 2019-03-24) and the CNVR identified on the datasets A, B, and C described by Kommadath et al (2019). The CNV identified and discussed in eight studies were available from the DGVa. Out of those, four studies identified CNV using GEN data (Liu et al., 2010; Hou et al., 2011, 2012a; Karimi et al., 2017) and four studies identified CNV from WGS data (Bickhart et al., 2012; Boussaha et al., 2015; Keel et al., 2016; Mesbah-Uddin et al., 2018). The number of samples of the studies varied from 6 to 539 and different breeds were included as well as Bos indicus animals (Table 3-1). Chromosome, start position, end position and type of all published CNV were retrieved and merged to form the DGVa CNVR dataset used for comparison with our results. Kommadath et al. (2019) identified CNVR in Bos taurus animals using the multi-sample approach implemented in cn.MOPS (Klambauer et al., 2012) and relying on WGS data. They describe CNVR from four datasets with different numbers of samples of different breeds. As dataset D was the same as the 32 EDGP samples, only CNV identified on datasets A, B, and C of this study were considered. CNVR shorter than 5Mb, and that were present in at least 5% of the samples of each dataset were retrieved and merged to form the CNVR set used for comparison with our results. Both sets of known CNVR were generated based on the UMD3.1 (Zimin et al., 2009) bovine reference genome but our results were based on ARS-UCD1.2. To allow for comparison between all sets of CNVR, coordinates of the previously described variants were translated to their ARS- 55

UCD1.2 equivalent using the UCSC Genome Browser LiftOver tool (Haeussler et al., 2019). Minimum ratio of bases that had to remap was set to 0.4, all other liftOver parameters were kept at their default values. In total and after translation to ARS-UCD1.2 positions, 9,169 CNVR composed the DGVa set and 4,525 CNVR were retrieved from the database of Kommadath et al. (2019). In all comparisons between CNVR sets, regions with a reciprocal overlap of at least 50% of their length were considered equal. 3.3.5 In silico functional analysis Bos taurus coding sequences located at the same genomic regions as the high confidence CNVR were retrieved from the Ensembl Genes database (Cunningham et al., 2019) (ARS-UCD1.2, annotation release 96) with the Ensembl Biomart tool (Kersey et al., 2011) The OmicsBox (version 1.0.0; new updated software from Blast2GO (Götz et al., 2008)) was used to annotate the regions. The GO analysis was performed taking into account the three GO categories (biological process, molecular function and cellular component) using OmicsBox (Götz et al., 2008). Coding sequences were annotated with blastx and the OmicsBox mapping and GO annotation routines (Conesa et al., 2005). Nucleotide query sequences were compared against all the sequences found in the database of the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov, last accessed 2019-04-25). Matches between sequences of this study and the database were reported that reached a significance level of at least 0.001 (e-value) and had a similarity of at least 90%. The GO significance levels were computed following Fisher’s exact test for multiple testing in OmicsBox. As described by Cánovas et al., (2013) and Li et al., (2016) the OmicsBox suite was also used to examine associated biological pathways involving the enzymes coded by the genes present in the high confidence CNVR based on the Kyoto Encyclopedia of Genes and Genomes (KEGG; Ogata et al., 1999). GeneCards information was also retrieved for the identified and named genes (Safran et al., 2010). 3.4 Results

3.4.1 Alignment and update to the bovine reference genome ARS-UCD1.2 The sequences of 96 Holstein animals were aligned to the bovine reference genome ARS-UCD1.2 (Rosen et al., 2018). On average, 559,155,098 reads (standard deviation (SD) = 258,847,099) were obtained per animal, of which 96.6% (SD = 5.3%) were correctly mapped. Differences were

56

observed depending on the sequencing project under which sequences were generated (Table 3-2). The average read depth was of 20x and ranged from 8.57x to 42.53x. CCGP samples had an average read depth of 12x (SD = 2x), EDGP samples had an average read depth of 35x (SD = 4x), and CHE samples had an average read depth of 15x (SD = 5x) [Appendices A and B]. Positions of 85.4%, 86.2%, 90.4%, and 72.3% of the GEN markers on ARS-UCD1.2 were retrieved for the Illumina BovineHD Beadchip, the GeneSeek® Genome Profiler Bovine 150K, the GeneSeek® Genome Profiler Bovine HD, and the Illumina BovineSNP50 Beadchip, respectively. Average percent of markers remaining after quality controls ranged between 68.5% and 88.5% (Table 3-3). The number of samples that passed the GEN quality filters limited the number of samples to 67 for both GEN and WGS datasets [Appendix C]. 3.4.2 Copy number variant identification On average 7 GEN_CNV (min: 1, max: 70) and 1,855 WGS_CNV (min: 1,335, max: 4,413) were identified in 67 samples after quality control on a per sample and on a per CNV basis. The total number of CNV detected were 471 in GEN and in 124,302 in WGS. More CNL were found in both GEN and WGS data with 299 (63.4%) and 96,023 (77.2%) variants, respectively. CNV discovery in GEN was based on an average of 433,619 (SD = 280,740) markers. The data available for this study did not permit investigation of the effect of different read depths on the identified WGS_CNV. Average CNV lengths were of 132 Kb (min: 5 Kb, max: 1341 Kb) for the GEN set and of 22 Kb (min: 1.2 Kb, max: 4783 Kb) for the WGS set. The WGS_CNV were significantly longer than the GEN_CNV (P < 0.0001, Wilcoxon rank sum test with continuity correction). Once collated to CNV regions, the number of variants was reduced to 246 GEN_CNVR and 8,974 WGS_CNVR that had an average length of 111 Kb (min: 5 Kb, max: 1,787 Kb) and 54 Kb (min: 1.2 Kb, max: 13,296 Kb), respectively. The length distributions of the CNVR sets were not parametric and not different (P> 0.05, Wilcoxon rank sum test with continuity correction). 3.4.3 Copy number variant and genotyping marker densities Masking marker information of the animals with HD genotypes to 150K and 50K SNP panels enabled measurement of the effect of marker panel on the false positive CNV identification rate. Among the 70 animals with HD_GEN information, 30 passed all per sample-based filters and had CNV in each density. CNV lengths were similar to the CNV lengths reached with the complete

57

GEN data, as previously described. On average, 8.451 CNV (95%-CI: 8.426, 8.476) were identified with the HD panel, 4.993 (95%-CI: 4.983, 5.002) with the 150K panel, and 1.869 (95%- CI: 1.865,1.872) with the 50K panel per animal. Average overlaps of CNV between set per sample are shown in Figure 3-3. The CNV falsely identified with the 150K marker panel composed on average 21.7% of the CNV identified with all densities (95%-CI: 21.6, 21.7) and 12.2% (95%-CI: 12.1, 12.2) of the CNV identified with 50K marker panel were false positive. 3.4.4 High confidence copy number variant regions Altering the minimum percentage of reciprocal overlap between GEN_CNV and WGS_CNV to select high confidence CNV between 20 and 80 percent changed the number of variants considered of high confidence linearly. Low required overlap percentage (10%) or high overlap percentage (90%), however, had a larger impact on the number of high confidence variants, increasing or reducing it drastically (Figure 3-4). In total, 52 ANIMAL_CNVR (30 CNL, 21 CNG, and 1 MIX) and 36 POPULATION_CNVR (15 CNL, 7 CNG, and 14 MIX) were identified on 22 and 20 chromosomes (Figure 3-5). Visual identification of true positive WGS_CNV found within high confidence CNVR was possible (e.g. Figure 3-6) as well as the identification of false positive WGS_CNV outside of any high confidence CNVR boundaries (e.g. Figure 3-7). The total genome covered was 0.22% and 0.24% for ANIMAL_CNVR and the POPULATION_CNVR, respectively. The number of animals carrying a CNVR ranged from one to 25 for the ANIMAL_CNVR and from four to 67 for the POPULATION_CNVR. On one hand, 27 ANIMAL_CNVR (52%) were detected only in one sample. On the other hand, five POPULATION_CNVR (14%) were found in all samples [Appendix D]. A total of 31 CNVR were found in common between both ANIMAL_CNVR and POPULATION_CNVR sets. The two sets of ANIMAL_CNVR and POPULATION_CNVR were merged for functional analysis to a set of unique high confidence CNVR in which overlapping CNVR were only considered once. Accordingly, 57 regions were considered unique high confidence CNVR. Half of the 57 unique high confidence CNVR were not found in the previous studies of which CNVR were retrieved. These 28 candidate CNVR were all found in the ANIMAL_CNVR set but only 12 of them were part of the POPULATION_CNVR set. No unique high confidence CNVR was found in any gap of the reference assembly. 58

3.4.5 Putative functions of the identified CNVR The 57 unique high confidence CNVR returned 188 Ensembl sequences. Peptide sequences of 160 of those could be retrieved and analyzed with the OmicsBox annotation routine. After analyses, 104 sequences were found that had BLAST hits and could be mapped and annotated for 35 high confidence CNVR. These 35 regions represented 58% of the ANIMAL_CNVR and 61% of the POPULATION_CNVR. Significant GO terms were identified in the three GO main categories biological processes, molecular functions, and cellular component. At the most informative level of the biological processes, 69% of the GO terms related to transcription-related functions and 31% to sensory perception of smell. Of the molecular function terms, 35% related to olfactory receptor activity, 36% to G -coupled receptor activity and 29% to binding. Regarding the cellular component terms, 37% of the GO terms were related to membrane components, 36% to cytoplasm elements, and 27% to intracellular elements. Enzyme codes could be retrieved for 10 sequences and connected with 8 KEGG biological pathways (in decreasing number of linked sequences): glycerolipid metabolism, fatty acid elongation, drug metabolism, glycerophospholipid metabolism, nitrogen metabolism, starch and sucrose metabolism, Th1 and Th2 cell differentiation, and T cell receptor signaling pathway. Genes present on the region and their GeneCards information as well as significantly associated GO terms and KEGG pathway are reported for each unique high confidence CNVR in the Appendix E. Olfactory receptor genes and immunity-related genes are known to be more often duplicated than other genes and are therefore candidates for CNV studies (Bickhart and Liu, 2014). Genes of the olfactory receptor family (e.g. OR5H8, OR7A10, OLF4, OR2AJ) were contained in six high confidence CNVR on BTA1 (42506338-42587000), BTA7 (8633381-9847400; 10230437- 1049220; 41582849-41756370), BTA10 (27020800-27056000), and BTA15 (79748837- 79817000). Immunity-related genes were found in five high confidence CNVR: Peptidylprolyl Isomerase A (PPIA) was found in a CNVR on BTA1 (93367289-93763299), T Cell Receptor Delta Locus (TRD) was found in a CNVR on BTA10 (22340671-22570397), Complement Factor H (CFH) and referentially Expressed Antigen In Melanoma (PRAME) were found in a CNVR on BTA16 (5701395-5889597 and 54476201-54495676, respectively), and five genes of the DEFB family (DEFB103A, DEFB, DEFB1, DEFB402, DEFB4A) were found in a CNVR on BTA27 (6696801-7186762). Moreover, a CNVR on BTA7 (69606401-69655411) was found to be 59

associated with two KEGG pathways related to immunity function: Th1 and Th2 cell differentiation and T cell receptor signaling. Finally, three regions were found that contained genes related to embryonic development (RET, COL27A1, POPDC3); BTA8 (103487143- 103513600), BTA9 (44726004- 44875567), BTA28 (13354073- 13504624), and eight genes of the homeobox A family contained in one region on BTA4 (68808125- 68911600). Genes of the homeobox A family are known for their important role in embryonic development in all mammalian species (Mark et al., 1997) [Appendix E]. 3.5 Discussion

In a first step of this study, the whole-genome sequences of 96 Holstein animals were aligned and the genotype array variant positions of 109 Holstein animals (the same 96 plus 13 bulls) were updated to the bovine reference genome ARS-UCD1.2. In a second step, copy number variants were identified on the WGS and the GEN data of all animals on a per sample basis. High confidence CNVR were then created in silico following two approaches and putative function of the resulting CNVR were retrieved. Finally, identification of CNV on the same sample with reduced marker panels was analyzed to estimate false positive CNV discovery rates when those were identified on lower density genotype array information. The average percentage of reads mapped to ARS-UCD1.2 was high (>94%) and the observed read depths were comparable with values reached previously when the same samples were mapped to UMD3.1 (data not shown No difference between CNV identified from samples with lower or higher read depth could be observed, a verification that the approach used to identify CNV from WGS data used in CNVnator correctly handled the difference in read depth. Applying a read depth based approach to WGS samples with differing coverage has been found to perform better than approaches based on split read or paired end (Keel et al., 2017). Reduction in the number of markers included for GEN_CNV identification on a per animal basis was mostly due to the loss of markers when updating their position to ARS-UCD1.2. A few more markers (3.8% to 0.4%, depending on the marker panel), however, were removed because of their low GC scores. The remaining markers were still distributed evenly on the genome, which allowed for further analyses of the data.

60

Although relying on the same samples, CNV identification on GEN and WGS data resulted in different sets of variants: more than 250 times more CNV were identified from the WGS data when compared with the GEN_CNV and the total length of the latter was six times higher than that of the WGS_CNV. Such differences in CNV number and length between CNV identified on WGS and GEN information were already described in a study by Zhan et al. (2011). Both measures were different due to the large difference in resolution between GEN and WGS data. GEN discovery only relies on a limited number of genome positions whereas the WGS discovery relies on all 2,716,000 Kb of the sequences (Rosen et al., 2018). The difference in the number and length of CNV was also due to the fact that breakpoints can only be placed at the position of a marker with GEN data, whereas any position of the genome can be defined as a CNV breakpoint using WGS. Furthermore, difference in the set of WGS_CNV and GEN_CNV comes from the relative versus absolute character of both identification methods. Whereas the CNV identification on GEN first rely only on one nucleotide and its signal intensity at a time, relative change in RD between segments of the genome are considered to identify WGS_CNV. More CNL than CNG were observed in both GEN and WGS CNV sets. Generally, CNL are more often found than CNG irrespective of the type of data used for CNV identification (Prinsen et al., 2016; Sasaki et al., 2016; Keel et al., 2017; Mielczarek et al., 2017). Although detection of both CNL and CNG is of importance, CNL are predicted to have more influence as they affect gene dosage and can lead to exposure of a normally unexpressed (deleterious) recessive allele (Bickhart and Liu, 2014). CNL are therefore expected to be found less numerous than CNG due to natural selection against deleterious variants in any species, but still found in higher number with the current CNV identification methods (Wang et al., 2007; Korn et al., 2008; Hou et al., 2012b; Xu et al., 2013), it can be concluded that CNL are easier to detect than CNG. Differences in marker densities were not directly accounted for in this study even though four marker panels were used in our analyses (Table 3-3). The number of CNV identified decreased with the number of markers on which the identification relied. In contrast, the CNV length increased when fewer markers were included, as the distance between markers increased. This is in line with the difference in CNV length observed between the GEN_CNV and the WGS_CNV described earlier in this paper. In this analysis, CNV that were identified with a lower density but not with the HD marker panel were considered false positive. False positive discovery rates of 61

12% for the 50K and of 22% for the 150K dataset were estimated. Both estimated false positive discovery rates were higher than those reported previously from studies based on WGS data, which ranged between 2% and 8% (Stothard et al., 2011; Bickhart et al., 2012). The lower resolution and lower precision of the CNV identification on GEN data could explain the higher rates observed in our study. In our study, the false positive identification rates were estimated assuming that any CNV identified with the HD marker panel was a true CNV. This strong assumption probably led to underestimation of the false positive identification rates but as these were already consequent, it is certain that CNV identification using a single method in silico on a single set of samples is not robust. The compromise between high true positive discovery rates and low false discovery rates of CNV against high numbers of non-identified CNV and high number of falsely detected CNV is addressed with various measures or quality control protocols in each newly described in silico CNV identification method. However, only experimental verification of the identified CNV with qPCR or FISH analyses, for instance, can be considered true validation (Bickhart et al., 2012). These methods have the disadvantage of being lower throughput, resulting in greater time and cost than in silico methods. In this study, sets of CNV were defined to be partly validated (or else said to be of high confidence) when the same region was detected as a CNV in GEN and in WGS data. This methodological consensus approach follows not only the conclusion of Baes et al. (2014), who found the highest quality single nucleotide variants when those were identified by multiple software but also the study by Zhan et al (2011) on CNV identification. Two sets of high confidence CNVR were created with the same samples. The main difference between those two sets lies in the level at which the consensus between the GEN and the WGS data is implemented. The high confidence of the ANIMAL_CNVR is gained through overlapping of CNV of one animal that were identified from GEN and WGS. In contrary, the high confidence of the POPULATION_CNVR is gained through overlapping of CNVR from all animals that were identified on the GEN or the WGS. In our study, the sample size is limited (n = 67 after quality control) so that although the term “population" is used in POPULATION_CNVR, the data do not represent the whole Holstein cattle population. However, an experiment following the same method with a greater number of samples could be considered representative of a population and the term is therefore kept throughout this paper. An important parameter used for the identification 62

of high confidence CNVR is the minimum percentage of reciprocal overlap set to consider two variants equals. Depending on the study, requirements vary from any overlap (minimum 1 bp) to overlap values between 50% and 90% (Hou et al., 2011; Sasaki et al., 2016; Letaief et al., 2017). In order to define this threshold in our study, the number of high confidence CNVR remaining with minimum reciprocal overlap ranging between 10% and 90% was calculated. Apart from a steady reduction of the number of high confidence CNVR with increasing minimum percentages from 20% to 80%, no plateau was observed so that any value in this interval could be selected (Figure 3-4). The greater change in number of high confidence CNVR identified with 10% and 90% of minimum reciprocal overlap are probably due to the difference in the range of the length between WGS_CNVR and GEN_CNVR as for example short WGS_CNVR cannot cover the required proportion of the relatively long GEN_CNVR even if they have breakpoints within the GEN_CNVR. Considering the relatively small sample size and the aim to compare two high confidence CNVR sets created with the datasets of the present study, a minimum percentage of reciprocal overlap of 50% was selected for both approaches. With the variants representing 0.22% and 0.24% of the total genome and the similarity of their length distributions, both sets can first be considered similar. Genome coverage of the CNV reported in previous studies on bovine CNV ranged between 0.1% and 6.0% and led to the conclusion that higher number of samples and higher data resolution (WGS versus GEN) led to higher number of CNV discovered and thus to higher experiment-wise genome coverages (Stothard et al., 2011; Zhan et al., 2011; Salomón-Torres et al., 2015; Letaief et al., 2017). In this study, high confidence CNVR had not only to be identified with both WGS and GEN information but also only relied on 67 samples. The stringent CNVR discovery parameters as well as the relatively small number of samples explain the low coverage values. Differences between the POPULATION_CNVR and the ANIMAL_CNVR, however, were observed at multiple levels. Of the CNVR identified in this study but not previously described, only half were included in the POPULATION_CNVR set, whereas all were included in the ANIMAL_CNVR set. The lower concordance of the CNVR found in the latter can be an indication that this approach is less reliable than the POPULATION_CNVR although low overlap between studies is an expected result (Salomón-Torres et al., 2015; da Silva et al., 2016; Keel et al., 2016; Prinsen et al., 2016; Sasaki et al., 2016; Silva et al., 2016). Over 50% of the ANIMAL_CNVR were only found in one sample 63

but some POPULATION_CNVR identified were shared by all samples so that this approach can be considered more inclusive. The weight given to a single sample/data combination with this approach is lower than that in the ANIMAL_CNVR approach, as CNV can be found relying on one data type in one animal and on the other data type in another animal, but still be part of the high confidence POPULATION_CNVR set. Considering that all samples came from the Holstein population, the regions found in all samples could be population-specific. For instance, a region on BTA7 (pos. 41582849-41756370) could be found in a review on selective sweeps linked with animals of the Holstein breeds only (Gutiérrez-Gil et al., 2015). Annotation and pathway analysis of the unique high confidence CNVR showed results in accordance with previous studies on cattle CNV. Bickhart and Liu (2014) previously described known association in cattle between CNV and traits of importance to the industry related to the immunity of the animals. The olfactory receptor genes family was also described in the same review as a group of candidate genes for higher duplication rates. It is thus no surprise that more than 20% of the unique high confidence CNVR contained genes associated with GO terms or KEGG pathways linked with those two functions. A previous study on Holstein dairy cattle showed SNP regions associated with immunity and reproduction that also contained olfactory receptor genes (Jaton et al., 2018) In addition to olfactory senses and immunity, CNV have been described that are related directly or indirectly to reproduction traits (Liu et al., 2019). Four regions were annotated through GeneCards information with reproductive traits. Remarkably, one of these regions (BTA4;68808125-68911600) contains eight genes of the HoxA gene family. The role of this gene family on embryonic development has been described in a study on Chinese indigenous sheep (Ma et al., 2017). 3.6 Conclusions

Although each approach has strengths and flaws, CNV can be identified with high confidence when multiple methods or data sources form the base of the identification. The update of the bovine reference genome to the ARS-UCD1.2 assembly still leads to comparable CNV identification results with the previous assembly (UMD3.1). Population-wide identification was shown to allow identification of variants that were more often already known, were more inclusive as less weight was given to the individual sample CNV, allowed identification of regions common to all samples,

64

and could more often be linked to a putative function than sample-based identified CNV. When relying only on genotyping signal intensity information, CNV identification is possible and the variants can be used in downstream analysis, but high false positive identification rates should not be ignored. Further research on larger datasets is now needed. These studies should target high confidence CNVR with a population-wide approach and account for false positive variants. 3.7 References

Abyzov, A., A.E. Urban, M. Snyder, and M. Gerstein. 2011. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21:974–984. doi:10.1101/gr.114876.110. Alkan, C., B.P. Coe, and E.E. Eichler. 2011. Genome structural variation discovery and genotyping.. Nat. Rev. Genet. 12:363–76. doi:10.1038/nrg2958. de Almeida Santana, M.H., G.A.O. Junior, A.S.M. Cesar, M.C. Freua, R. da Costa Gomes, S. da Luz e Silva, P.R. Leme, H. Fukumasu, M.E. Carvalho, R.V. Ventura, L.L. Coutinho, H.N. Kadarmideen, and J.B.S. Ferraz. 2016. Copy number variations and genome-wide associations reveal putative genes and metabolic pathways involved with the feed conversion ratio in beef cattle. J. Appl. Genet. 57:495–504. doi:10.1007/s13353-016-0344-7. Agerholm, J.S., F.J. McEvoy, S. Heegaard, C. Charlier, V. Jagannathan, and C. Drögemüller. 2017. A de novo missense mutation of FGFR2 causes facial dysplasia syndrome in Holstein cattle. BMC Genet. 18:74. doi:10.1186/s12863-017-0541-3. Antunes de Lemos, M.V., M.P. Berton, G.M. Ferreira de Camargo, E. Peripolli, R.M. de Oliveira Silva, B. Ferreira Olivieri, A.S. Cesar, A.S.C. Pereira, L.G. de Albuquerque, H.N. de Oliveira, H. Tonhati, and F. Baldi. 2018. Copy number variation regions in Nellore cattle: Evidences of environment adaptation. Livest. Sci. 207:51–58. doi:10.1016/j.livsci.2017.11.008. Baes, C.F., M.A. Dolezal, J.E. Koltes, B. Bapst, E. Fritz-Waters, S. Jansen, C. Flury, H. Signer- Hasler, C. Stricker, R. Fernando, R. Fries, J. Moll, D.J. Garrick, J.M. Reecy, and B. Gredler. 2014. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.. BMC Genomics 15:948. doi:10.1186/1471-2164-15-948. van den Berg, I., D. Boichard, and M.S. Lund. 2016. Comparing power and precision of within- breed and multibreed genome-wide association studies of production traits using whole- genome sequence data for 5 French and Danish dairy cattle breeds. J. Dairy Sci. 99:8932– 8945. doi:10.3168/jds.2016-11073. Bickhart, D.M., Y. Hou, S.G. Schroeder, C. Alkan, M.F. Cardone, L.K. Matukumalli, J. Song, R.D. Schnabel, M. Ventura, J.F. Taylor, J.F. Garcia, C.P. Van Tassell, T.S. Sonstegard, E.E. Eichler, and G.E. Liu. 2012. Copy number variation of individual cattle genomes using next- generation sequencing. Genome Res. 22:778–790. doi:10.1101/gr.133967.111. Bickhart, D.M., and G.E. Liu. 2014. The challenges and importance of structural variation detection in livestock. Front. Genet. 5:37. doi:10.3389/fgene.2014.00037.

65

Bolger, A.M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. doi:10.1093/bioinformatics/btu170. Boussaha, M., D. Esquerré, J. Barbieri, A. Djari, A. Pinton, R. Letaief, G. Salin, F. Escudié, A. Roulet, S. Fritz, F. Samson, C. Grohs, M. Bernard, C. Klopp, D. Boichard, and D. Rocha. 2015. Genome-wide study of structural variants in bovine Holstein, Montbéliarde and Normande dairy breeds. PLoS One 10:e0135931. doi:10.1371/journal.pone.0135931. Butty, A.M., M. Sargolzaei, F. Miglior, P. Stothard, F.S. Schenkel, B. Gredler-Grandl, and C.F. Baes. 2019. Optimizing Selection of the Reference Population for Genotype Imputation from Array to Sequence Variants. Front. Genet. 10:510. doi:10.3389/FGENE.2019.00510. Cánovas, A., Rincon, G., Islas-Trejo, A., Jimenez-Flores, R., Laubscher, A., Medrano, J.F. 2013. RNA sequencing to study and single nucleotide polymorphism variation associated with citrate content in cow milk. Journal of Dairy Science, 96, 2637-2648, doi: 10.3168/jds.2012-6213. Chen, X., Y. Zhao, S. Zhang, J. Zhang, Z. Gao, Y. Yang, X. Zhao, and Y. Wang. 2018. Forensic genetic informativeness of an SNP panel consisting of 19 multi-allelic SNPs. Forensic Sci. Int. Genet. 34:49–56. doi:10.1016/j.fsigen.2018.01.006. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 21: 3674–6. pmid:16081474 Cunningham, F., P. Achuthan, W. Akanni, J. Allen, M.R. Amode, I.M. Armean, R. Bennett, J. Bhai, K. Billis, S. Boddu, C. Cummins, C. Davidson, K.J. Dodiya, A. Gall, C.G. Girón, L. Gil, T. Grego, L. Haggerty, E. Haskell, T. Hourlier, O.G. Izuogu, S.H. Janacek, T. Juettemann, M. Kay, M.R. Laird, I. Lavidas, Z. Liu, J.E. Loveland, J.C. Marugán, T. Maurel, A.C. McMahon, B. Moore, J. Morales, J.M. Mudge, M. Nuhn, D. Ogeh, A. Parker, A. Parton, M. Patricio, A.I. Abdul Salam, B.M. Schmitt, H. Schuilenburg, D. Sheppard, H. Sparrow, E. Stapleton, M. Szuba, K. Taylor, G. Threadgold, A. Thormann, A. Vullo, B. Walts, A. Winterbottom, A. Zadissa, M. Chakiachvili, A. Frankish, S.E. Hunt, M. Kostadima, N. Langridge, F.J. Martin, M. Muffato, E. Perry, M. Ruffier, D.M. Staines, S.J. Trevanion, B.L. Aken, A.D. Yates, D.R. Zerbino, and P. Flicek. 2019. Ensembl 2019. Nucleic Acids Res. 47:D745–D751. doi:10.1093/nar/gky1113. Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.-N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. VanTassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. 2014. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46:858–865. doi:10.1038/ng.3034. Diskin, S.J., M. Li, C. Hou, S. Yang, J. Glessner, H. Hakonarson, M. Bucan, J.M. Maris, and K. Wang. 2008. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36:e126–e126. doi:10.1093/nar/gkn556. Durkin, K., W. Coppieters, C. Drögemüller, N. Ahariz, N. Cambisano, T. Druet, C. Fasquelle, A. 66

Haile, P. Horin, L. Huang, Y. Kamatani, L. Karim, M. Lathrop, S. Moser, K. Oldenbroek, S. Rieder, A. Sartelet, J. Sölkner, H. Stålhammar, D. Zelenika, Z. Zhang, T. Leeb, M. Georges, and C. Charlier. 2012. Serial translocation by means of circular intermediates underlies colour sidedness in cattle. Nature 482:81–84. doi:10.1038/nature10757. Fadista, J., B. Thomsen, L.-E. Holm, and C. Bendixen. 2010. Copy number variation in the bovine genome.. BMC Genomics 11:284. doi:10.1186/1471-2164-11-284. Feuk, L., A.R. Carson, and S.W. Scherer. 2006. Structural variation in the human genome. Nat. Rev. Genet. 7:85–97. doi:10.1038/nrg1767. Götz, S., J.M. García-Gómez, J. Terol, T.D. Williams, S.H. Nagaraj, M.J. Nueda, M. Robles, M. Talón, J. Dopazo, and A. Conesa. 2008. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res. 36:3420–3435. doi:10.1093/nar/gkn176. Gutiérrez-Gil, B., J.J. Arranz, and P. Wiener. 2015. An interpretive review of selective sweep studies in Bos taurus cattle populations: Identification of unique and shared selection signals across breeds. Front. Genet. 6:167. doi:10.3389/fgene.2015.00167. Haeussler, M., A.S. Zweig, C. Tyner, M.L. Speir, K.R. Rosenbloom, B.J. Raney, C.M. Lee, B.T. Lee, A.S. Hinrichs, J.N. Gonzalez, D. Gibson, M. Diekhans, H. Clawson, J. Casper, G.P. Barber, D. Haussler, R.M. Kuhn, and W.J. Kent. 2019. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47:D853–D858. doi:10.1093/nar/gky1095. Hay, E.H.A., Y.T. Utsunomiya, L. Xu, Y. Zhou, H.H.R. Neves, R. Carvalheiro, D.M. Bickhart, L. Ma, J.F. Garcia, and G.E. Liu. 2018. Genomic predictions combining SNP markers and copy number variations in Nellore cattle. BMC Genomics 19:441. doi:10.1186/s12864-018-4787- 6. Hou, Y., D.M. Bickhart, H. Chung, J.L. Hutchison, H.D. Norman, E.E. Connor, and G.E. Liu. 2012a. Analysis of copy number variations in Holstein cows identify potential mechanisms contributing to differences in residual feed intake. Funct. Integr. Genomics 12:717–723. doi:10.1007/s10142-012-0295-y. Hou, Y., G. Liu, D. Bickhart, M. Cardone, K. Wang, E. Kim, L. Matukumalli, M. Ventura, J. Song, P. VanRadan, T. Sonstegard, and C. Van Tassell. 2011. Genomic characteristics of cattle copy number variations. BMC Genomics 12:127. doi:10.1186/1471-2164-12-127. Hou, Y., G.E. Liu, D.M. Bickhart, L.K. Matukumalli, C. Li, J. Song, L.C. Gasbarre, C.P. Van Tassell, and T.S. Sonstegard. 2012b. Genomic regions showing copy number variations associate with resistance or susceptibility to gastrointestinal nematodes in Angus cattle. Funct. Integr. Genomics 12:81–92. doi:10.1007/s10142-011-0252-1. Huang, J., R. Li, X. Zhang, Y. Huang, R. Dang, X. Lan, H. Chen, and C. Lei. 2017. Copy number variation regions detection in Qinchuan cattle. Livest. Sci. 204:88–91. doi:10.1016/j.livsci.2017.08.016. Jaton, C., Schenkel, F., Sargolzaei, M., Cánovas, A., Malchiodi, F., C. A. Price, C. Baes,F. Miglior 2018. Genome-Wide Association Study and in silico functional analysis of the number of embryos produced by Holstein donors. Journal of Dairy Science (8): 7248-7257, doi: 10.3168/jds.2017-13848. 67

Jiang, L., J. Jiang, J. Wang, X. Ding, J. Liu, and Q. Zhang. 2012. Genome-Wide Identification of Copy Number Variations in Chinese Holstein. PLoS One 7:e48732. doi:10.1371/journal.pone.0048732. Jiang, L., J. Jiang, J. Yang, X. Liu, J. Wang, H. Wang, X. Ding, J. Liu, and Q. Zhang. 2013. Genome-wide detection of copy number variations using high-density SNP genotyping platforms in Holsteins. BMC Genomics 14:131. doi:10.1186/1471-2164-14-131. Karimi, K., A. Esmailizadeh, D.D. Wu, and C. Gondro. 2017. Mapping of genome-wide copy number variations in the Iranian indigenous cattle using a dense SNP data set. Anim. Prod. Sci. 58:1192–1200. doi:10.1071/AN16384. Keel, B.N., J.W. Keele, and W.M. Snelling. 2017. Genome-wide copy number variation in the bovine genome detected using low coverage sequence of popular beef breeds. Anim. Genet. 48:141–150. doi:10.1111/age.12519. Keel, B.N., A.K. Lindholm-Perry, and W.M. Snelling. 2016. Evolutionary and functional features of copy number variation in the cattle genome. Front. Genet. 7:207. doi:10.3389/fgene.2016.00207. Kersey, P., A. Kerhornou, S. Haider, A. Kahari, J. Zamora, R.J. Kinsella, J. Almeida-King, D. Staines, P. Derwent, G. Spudich, P. Flicek, and G. Proctor. 2011. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. doi:10.1093/database/bar030. Kidd, J.M., R. Sebra, X. Zheng-Bradley, J. Zhang, A. Bashir, K. Walter, K. Chen, E.E. Schadt, D.M. Muzny, C. Zhang, R.E. Handsaker, F. Yu, R.A. Gibbs, T. Zichner, A. Malhotra, T. Rausch, A.M. Stütz, F. Paolo Casale, X. Jasmine Mu, M. Romanovitch, J.A. Walker, Y. Zhang, J. Sebat, C.E. Mason, Y. Kong, N.F. Parrish, X. Shi, M.J.P. Chaisson, G. Marth, E. Cerveira, A. Abyzov, F. Kahveci, M.K. Konkel, M. Pendleton, M. Gujral, J. Chen, O. Stegle, E. Garrison, Z. Chong, M. Malig, F. Hormozdiari, S.E. Devine, S. Meiers, J.O. Korbel, S. Emery, S. McCarthy, P.H. Sudmant, B. Raeder, E.-W. Lameijer, A. Schlattl, J. Huddleston, M.B. Gerstein, A. Quitadamo, L. Clarke, A.A. Shabalin, R.E. Mills, K. Ye, E.J. Gardner, A. Untergasser, A. Noor, P. Chines, P. Flicek, G. Jun, W. Zhou, G. Dayama, T. Bae, S.A. McCarroll, C. Alkan, A. Menelaou, B.J. Nelson, X. Fan, C. Lee, S. Kashin, A. Auton, H.Y.K. Lam, D. Antaki, M. Wang, M.A. Batzer, E. Dal, E.E. Eichler, M. Hsi-Yang Fritz, and L. Ding. 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81. doi:10.1038/nature15394. Klambauer, G., K. Schwarzbauer, A. Mayr, D.A. Clevert, A. Mitterecker, U. Bodenhofer, and S. Hochreiter. 2012. Cn.MOPS: Mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 40:e69. doi:10.1093/nar/gks003. Kommadath A, Grant JR, Krivushin K, Butty AM, Baes CF, Carthy TR, Berry DP, and Stothard P. 2019. A large interactive visual database of copy number variants discovered in taurine cattle. Gigascience 8. doi:http://dx.doi.org/10.5524/100600. Korn, J.M., F.G. Kuruvilla, S.A. McCarroll, A. Wysoker, J. Nemesh, S. Cawley, E. Hubbell, J. Veitch, P.J. Collins, K. Darvishi, C. Lee, M.M. Nizzari, S.B. Gabriel, S. Purcell, M.J. Daly, and D. Altshuler. 2008. Integrated genotype calling and association analysis of SNPs, 68

common copy number polymorphisms and rare CNVs. Nat. Genet. 40:1253–1260. doi:10.1038/ng.237. Letaief, R., E. Rebours, C. Grohs, C. Meersseman, S. Fritz, L. Trouilh, D. Esquerré, J. Barbieri, C. Klopp, R. Philippe, V. Blanquet, D. Boichard, D. Rocha, and M. Boussaha. 2017. Identification of copy number variation in French dairy and beef breeds using next-generation sequencing. Genet. Sel. Evol. 49:77. doi:10.1186/s12711-017-0352-z. Li, H., and R. Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi:10.1093/bioinformatics/btp324. Li, M., Riddle, S., Zhang, H., D'Alessandro, A., Serkova, N. J., Hansen KC, Moldvan R, McKeon A, Frid M, Kumar S, Li H, Liu H, Islas-Trejo A, Cánovas, A, Medrano JF, Thomas MG, Ilska D, Plecita-Hlavata L, Jetek P, Pullamsetti S, Fini MA, El Kasmi KC, Zhang Q, Stenmark K.R. 2016. Metabolic Reprogramming Regulates the Proliferative and Inflammatory Phenotype of Adventitial Fibroblasts in Pulmonary Hypertension through the Transcriptional co-Repressor C-terminal Binding Protein. Circulation, 134 (15): 1105-1121. Liu, G.E., Y. Hou, B. Zhu, M.F. Cardone, L. Jiang, A. Cellamare, A. Mitra, L.J. Alexander, L.L. Coutinho, M.E. Dell’Aquila, L.C. Gasbarre, G. Lacalandra, R.W. Li, L.K. Matukumalli, D. Nonneman, L.C.D.A. Regitano, T.P.L. Smith, J. Song, T.S. Sonstegard, C.P. Van Tassell, M. Ventura, E.E. Eichler, T.G. McDaneld, and J.W. Keele. 2010. Analysis of copy number variations among diverse cattle breeds. Genome Res. 20:693–703. doi:10.1101/gr.105403.110. Liu, M., L. Fang, S. Liu, M.G. Pan, E. Seroussi, J.B. Cole, L. Ma, H. Chen, and G.E. Liu. 2019. Array CGH-based detection of CNV regions and their potential association with reproduction and other economic traits in Holsteins. BMC Genomics 20:181. doi:10.1186/s12864-019- 5552-1. Ma, L., J.B. Cole, Y. Da, and P.M. VanRaden. 2018. Symposium review: Genetics, genome-wide association study, and genetic improvement of dairy fertility traits. J. Dairy Sci. 102:3735– 3743. doi:10.3168/jds.2018-15269. Ma, Q., X. Liu, J. Pan, L. Ma, Y. Ma, X. He, Q. Zhao, Y. Pu, Y. Li, and L. Jiang. 2017. Genome- wide detection of copy number variation in Chinese indigenous sheep using an ovine high- density 600 K SNP array. Sci. Rep. 7:912. doi:10.1038/s41598-017-00847-9. Mark, M., F.M. Rijli, and P. Chambon. 1997. Homeobox genes in embryogenesis and pathogenesis. Pediatr. Res. 42:421–429. doi:10.1203/00006450-199710000-00001. McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M.A. DePristo. 2010. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297–1303. doi:10.1101/gr.107524.110. Mei, C., H. Wang, Q. Liao, L. Wang, G. Cheng, H. Wang, C. Zhao, S. Zhao, J. Song, X. Guang, G.E. Liu, A. Li, X. Wu, C. Wang, X. Fang, X. Zhao, S.B. Smith, W. Yang, W. Tian, L. Gui, Y. Zhang, R.A. Hill, Z. Jiang, Y. Xin, C. Jia, X. Sun, S. Wang, H. Yang, J. Wang, W. Zhu, and L. Zan. 2017. Genetic Architecture and Selection of Chinese Cattle Revealed by Whole Genome Resequencing. Mol. Biol. Evol. 35:688–699. doi:10.1093/molbev/msx322. 69

Menzi, F., N. Besuchet-Schmutz, M. Fragnière, S. Hofstetter, V. Jagannathan, T. Mock, A. Raemy, E. Studer, K. Mehinagic, N. Regenscheit, M. Meylan, F. Schmitz-Hsu, and C. Drögemüller. 2016. A transposable element insertion in APOB causes cholesterol deficiency in Holstein cattle. Anim. Genet. 47:253–257. doi:10.1111/age.12410. Mesbah-Uddin, M., B. Guldbrandtsen, T. Iso-Touru, J. Vilkki, D.J. De Koning, D. Boichard, M.S. Lund, and G. Sahana. 2018. Genome-wide mapping of large deletions and their population- genetic properties in dairy cattle. DNA Res. 25:49–59. doi:10.1093/dnares/dsx037. Mielczarek, M., M. Frąszczak, R. Giannico, G. Minozzi, J.L. Williams, K. Wojdak-Maksymiec, and J. Szyda. 2017. Analysis of copy number variations in Holstein-Friesian cow genomes based on whole-genome sequence data. J. Dairy Sci. 100:5515–5525. doi:10.3168/jds.2016- 11987. Mielczarek, M., M. Fraszczak, E. Nicolazzi, J.L. Williams, and J. Szyda. 2018. Landscape of copy number variations in Bos taurus: Individual - and inter-breed variability. BMC Genomics 19:410. doi:10.1186/s12864-018-4815-6. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27:29–34. doi:10.1093/nar/27.1.29. Pausch, H., R. Emmerling, H. Schwarzenbacher, and R. Fries. 2016. A multi-trait meta-analysis with imputed sequence variants reveals twelve QTL for mammary gland morphology in Fleckvieh cattle. Genet. Sel. Evol. 48:14. doi:10.1186/s12711-016-0190-4. Pirooznia, M., F. Goes, and P.P. Zandi. 2015. Whole-genome CNV analysis: Advances in computational approaches. Front. Genet. 6:138. doi:10.3389/fgene.2015.00138. Prinsen, R.T.M.M., M.G. Strillacci, F. Schiavini, E. Santus, A. Rossoni, V. Maurer, A. Bieber, B. Gredler, M. Dolezal, and A. Bagnato. 2016. A genome-wide scan of copy number variants using high-density SNPs in Brown Swiss dairy cattle. Livest. Sci. 191:153–160. doi:10.1016/j.livsci.2016.08.006. Rafter, P., D.C. Purfield, D.P. Berry, A.C. Parnell, I.C. Gormley, J.F. Kearney, M.P. Coffey, and T.R. Carthy. 2018. Characterization of copy number variants in a large multibreed population of beef and dairy cattle using high-density single nucleotide polymorphism genotype data. J. Anim. Sci. 96:4112–4124. doi:10.1093/jas/sky302. Redon, R., S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. González, M. Gratacòs, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. Scherer, and M.E. Hurles. 2006. Global variation in copy number in the human genome. Nature 444:444–454. doi:10.1038/nature05329. Rosen, B., D. Bickhart, R. Schnabel, S. Koren, C. Elsik, A. Zimin, C. Dreischer, S. Schultheiss, R. Hall, S. Schroeder, C. Van Tassell, T. Smith, and J. Medrano. 2018. Modernizing the Bovine Reference Genome Assembly. Proc. World Congr. Genet. Appl. to Livest. Prod. 802.

70

Safran, M., I. Dalah, J. Alexander, N. Rosen, T. Iny Stein, M. Shmoish, N. Nativ, I. Bahir, T. Doniger, H. Krug, A. Sirota-Madi, T. Olender, Y. Golan, G. Stelzer, A. Harel, and D. Lancet. 2010. GeneCards Version 3: the human gene integrator.. Database (Oxford).. doi:10.1093/database/baq020. Sallam, A.M., Y. Zare, F. Alpay, G.E. Shook, M.T. Collins, S. Alsheikh, M. Sharaby, and B.W. Kirkpatrick. 2017. An across-breed genome wide association analysis of susceptibility to paratuberculosis in dairy cattle. J. Dairy Res. 84:61–67. doi:10.1017/S0022029916000807. Salomón-Torres, R., V.M. González-Vizcarra, G.E. Medina-Basulto, M.F. Montaño-Gómez, P. Mahadevan, V.H. Yaurima-Basaldúa, C. Villa-Angulo, and R. Villa-Angulo. 2015. Genome- wide identification of copy number variations in Holstein cattle from Baja California, Mexico, using high-density SNP genotyping arrays.. Genet. Mol. Res. 14:11848–11859. doi:10.4238/2015.October.2.18. Salomon-Torres, R., L.K. Matukumalli, C.P. Van Tassell, C. Villa-Angulo, V.M. Gonzalez- Vizcarra, and R. Villa-Angulo. 2014. High density LD-based structural variations analysis in cattle genome. PLoS One 9:e103046. doi:10.1371/journal.pone.0103046. Salomon-Torres, R., R. Villa-Angulo, and C. Villa-Angulo. 2016. Analysis of copy number variations in Mexican Holstein cattle using axiom genome-wide Bos 1 array. Genomics Data 7:97–100. doi:10.1016/j.gdata.2015.12.007. Sasaki, S., T. Watanabe, S. Nishimura, and Y. Sugimoto. 2016. Genome-wide identification of copy number variation using high-density single-nucleotide polymorphism array in Japanese Black cattle.. BMC Genet. 17:26. doi:10.1186/s12863-016-0335-z. Seroussi, E., G. Glick, A. Shirak, E. Yakobson, J.I. Weller, E. Ezra, and Y. Zeron. 2010. Analysis of copy loss and gain variations in Holstein cattle autosomes using BeadChip SNPs. BMC Genomics 11:673. doi:10.1186/1471-2164-11-673. da Silva, J.M., P.F. Giachetto, L.O. da Silva, L.C. Cintra, S.R. Paiva, M.E.B. Yamagishi, and A.R. Caetano. 2016. Genome-wide copy number variation (CNV) detection in Nelore cattle reveals highly frequent variants in genome regions harboring QTLs affecting production traits. BMC Genomics 17:454. doi:10.1186/s12864-016-2752-9. Silva, V.H. d., L.C. d. A. Regitano, L. Geistlinger, F. Pértille, P.F. Giachetto, R.A. Brassaloti, N.S. Morosini, R. Zimmer, and L.L. Coutinho. 2016. Genome-Wide Detection of CNVs and Their Association with Meat Tenderness in Nelore Cattle. PLoS One 11:e0157711. Stothard, P., J.-W. Choi, U. Basu, J.M. Sumner-Thomson, Y. Meng, X. Liao, and S.S. Moore. 2011. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery. BMC Genomics 12:559. doi:10.1186/1471-2164-12-559. Stothard, P., X. Liao, A.S. Arantes, M. De Pauw, C. Coros, G.S. Plastow, M. Sargolzaei, J.J. Crowley, J.A. Basarab, F. Schenkel, S. Moore, and S.P. Miller. 2015. A large and diverse collection of bovine genome sequences from the Canadian Cattle Genome Project. Gigascience 4:49. doi:10.1186/s13742-015-0090-5. Wang, K., M. Li, D. Hadley, R. Liu, J. Glessner, S.F.A. Grant, H. Hakonarson, and M. Bucan. 2007. PennCNV: An integrated hidden Markov model designed for high-resolution copy 71

number variation detection in whole-genome SNP genotyping data. Genome Res. 17:1665– 1674. doi:10.1101/gr.6861907. Wiedemar, N., A.-K. Riedi, V. Jagannathan, C. Drögemüller, and M. Meylan. 2015. Genetic Abnormalities in a Calf with Congenital Increased Muscular Tonus. J. Vet. Intern. Med. 29:1418–1421. doi:10.1111/jvim.13599. Winchester, L., C. Yau, and J. Ragoussis. 2009. Comparing CNV detection methods for SNP arrays. Briefings Funct. Genomics Proteomics 8:353–366. doi:10.1093/bfgp/elp017. Xiang, R., I.M. MacLeod, S. Bolormaa, and M.E. Goddard. 2017. Genome-wide comparative analyses of correlated and uncorrelated phenotypes identify major pleiotropic variants in dairy cattle. Sci. Rep. 7:9248. doi:10.1038/s41598-017-09788-9. Xu, L., Y. Hou, D. Bickhart, J. Song, and G. Liu. 2013. Comparative Analysis of CNV Calling Algorithms: Literature Survey and a Case Study Using Bovine High-Density SNP Data. Microarrays 2:171–185. doi:10.3390/microarrays2030171. Zhan, B., J. Fadista, B. Thomsen, J. Hedegaard, F. Panitz, and C. Bendixen. 2011. Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping. BMC Genomics 12:557. doi:10.1186/1471-2164-12-557. Zhou, B., X. Zhang, A.E. Urban, R.R. Haraksingh, S.S. Ho, and R. Pattni. 2018. Whole-genome sequencing analysis of CNV using low-coverage and paired-end strategies is efficient and outperforms array-based CNV analysis. J. Med. Genet. 55:735–743. doi:10.1136/jmedgenet- 2018-105272. Zimin, A. V, A.L. Delcher, L. Florea, D.R. Kelley, M.C. Schatz, D. Puiu, F. Hanrahan, G. Pertea, C.P. Van Tassell, T.S. Sonstegard, G. Marçais, M. Roberts, P. Subramanian, J.A. Yorke, and S.L. Salzberg. 2009. A whole-genome assembly of the domestic cow, Bos taurus.. Genome Biol. 10:R42. doi:10.1186/gb-2009-10-4-r42.

72

3.8 Acknowledgments

The authors thank Derek Bickhart and Kiranmayee Bakshy (USDA, Madison, USA) for providing the files needed to translate the CNV positions between UMD3.1 and ARS-UCD1.2. The Semex Alliance (Guelph, Canada), Genex (Shawano, USA), Alta Genetics (Balzac, Canada), Select Sires (Plain City, USA), and Neogen Corporation (Lansing, USA) are acknowledged for providing the array genotype signal intensities. Computations were done at the server facilities provided by the Centre for Genetic Improvement of Livestock, Department of Animal Biosciences at the University of Guelph, ON Canada. We gratefully acknowledge funding by the Efficient Dairy Genome Project, funded by Genome Canada (Ottawa, Canada), Genome Alberta (Calgary, Canada), Ontario Genomics (Toronto, Canada), Alberta Ministry of Agriculture (Edmonton, Canada), Ontario Ministry of Research and Innovation (Toronto, Canada), Ontario Ministry of Agriculture, Food and Rural Affairs (Guelph, Canada), Canadian Dairy Network (Guelph, Canada), GrowSafe Systems (Airdrie, Canada), Alberta Milk (Edmonton, Canada), Victoria Agriculture (Melbourne, Australia), Scotland's Rural College (Edinburgh, UK), USDA Agricultural Research Service (Beltsville, United States), Qualitas AG (Zug, Switzerland), Aarhus University (Aarhus, Denmark). AB is especially grateful to the Qualitas AG team and the Swiss Association for Animal Science for their support.

73

3.9 Tables

Table 3-1. Studies, data type, number of sample and number of breeds composing the CNV dataset from the Database of Genomic Variants archive. Study by Data type No. samples No. breeds Liu et al, 2010 GEN 90 17* Hou et al, 2011 GEN 539 21* Hou et al, 2012 GEN 472 1 Bickhart et al, 2012 WGS 6 4* Boussaha et al, 2015 WGS 62 3 Keel et al, 2016 WGS 175 20 Karimi et al, 2017 GEN 50 8* Mesbah-Uddin et al, 2017 WGS 175 3 *Bos indicus animals were included along with Bos taurus in these studies

74

Table 3-2. Summary of the whole-genome sequence dataset. Average no. of reads Average percentage of Average read No. of Sequencing project (SD*) reads mapped (SD*) depth (SD*) samples Canadian Cattle Genome Project 404,535,408 (70,157,276) 97.0 (1.1) 12x (2x) 42 Efficient Dairy Genome Project 903,140,899 (89,796,943) 97.6 (0.5) 35x (4x) 31 Vetsuisse Faculty 377,871,061 (118,817,596) 94.3 (10.5) 15x (5x) 23 *SD = standard deviation

75

Table 3-3. Summary of the genotyping information dataset. No. of No. of remapped Average no. of markers No. of No. of samples Genotype array markers markers (% total) after QC* (% total) samples after QC* Illumina BovineHD Beadchip 777,962 664,143 (85.4) 660,951 (85.0) 57 47 GeneSeek® Genome Profiler 138,892 119,665 (86.2) 116,135 (83.6) 22 14 Bovine 150K GeneSeek® Genome Profiler 76,883 69,537 (90.4) 68,059 (88.5) 12 3 Bovine HD Illumina BovineSNP50 54,001 39,030 (72.3) 36,991 (68.5) 5 3 Beadchip *QC = quality controls

76

3.10 Figures

ANIMAL_CNVR POPULATION_CNVR

GEN_CNV WGS_CNV GEN_CNV WGS_CNV

l

a

m

i n A overlapping CNV = high confidence CNV

GEN_CNVR WGS_CNVR

n high confidence CNVR

o

i

t a

l overlapping CNVR

u

p o

P Keep CNVR present in more than 5% of samples high confidence CNVR

Figure 3-1. Steps to define high confidence animal (ANIMAL_CNVR) and population (POPULATION_CNVR) copy number variant regions (CNVR).

77

higher density

lower density

copy number loss

Figure 3-2. Copy number variants identified with a lower density marker panel but not with a higher density marker panel are considered false positive results. Vertical bars below the horizontal bar represent markers showing copy number loss. Vertical bars above the horizontal bar represent markers showing copy number gain. Vertical bars across the horizontal bar represents markers showing no copy number variation. In this example, three markers showing a copy number variation were needed to identify a variant.

78

Figure 3-3. Average number of copy number variants (CNV) identified per animal based on the HD (777K), 150K, and 50K marker densities. Letters represents the zones of the figures. The false positive discovery rates were the ratio (c+f)/(all CNV) and (f+g)/(all CNV) for 150K and 50K, respectively.

79

100

R

V

N

C

e 75

c

n

e

d

i

f

n o

c 50

h

g

i

h

f

o

r

e 25

b

m

u N 0 10 20 30 40 50 60 70 80 90

Percentage of reciprocal overlap Figure 3-4. Number of high confidence copy number variant regions (CNVR) identified at the animal level when different minimum percentages were considered for the reciprocal overlap between the copy number variants identified on genotype array or whole-genome sequence information.

80

Figure 3-5. Distribution of the high confidence copy number variant regions (CNVR) across the bovine genome found in different sets of CNVR. Only chromosomes carrying an identified variant are pictured.

81

Figure 3-6. Read alignments of four samples on Bos taurus autosome 20 from 59,901,001 base pairs (bp) to 59,910,000 bp belongs to the unique high confidence copy number variant (CNV) region no. 48. The two top samples have no predicted CNV from whole-genome sequences in the region under the red bar, whereas the two bottom samples have a copy number loss predicted.

82

Figure 3-7. Read alignment of three samples on Bos taurus autosome 20 from 2,759,000 base pairs (bp) to 2,770,000 bp. Although a copy number gain is identified on whole-genome sequences is predicted for the top sample for the region under the red bar, no difference in alignment variation can be observed.

83

CHAPTER 4: Genome-wide association study between copy number variants and hoof health traits in Holstein dairy cattle

Adrien M. Butty1, Tatiane C. S. Chud1, Filippo Miglior1, Flavio S. Schenkel1, Angela Cánovas1, Irene M. Häfliger2, Cord Drögemüller2, Paul Stothard3, Francesca Malchiodi1,4, and Christine F. Baes1,2*

1Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, Ontario, Canada 2Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, Switzerland 3Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, Alberta, Canada 4The Semex Alliance, Guelph, Ontario, Canada Keywords: genotype, dairy cattle, structural variants, functional analysis, gene

84

4.1 Abstract

Genome-wide associations studies based on single-nucleotide polymorphisms (SNP) have been completed for multiple traits in dairy cattle, however, copy number variants (CNV) add genomic information that has yet to be harnessed. The objectives of this study were to describe CNV identified in a large number of genotyped Holstein animals and to find, annotate and describe significantly associated CNV regions (CNVR) with de-regressed estimated breeding values for hoof health traits. The routine recording of hoof health traits paired with a large number of available genotypes in Canada makes the analysis of the genomic architecture of these traits possible. Lameness is the third main reason worldwide for involuntary culls in the dairy cattle industry, and thus a better understanding of hoof health traits could eventually lead to improvements in overall animal welfare and production. A total of 23,256 CNV comprising 1,645 CNVR were identified in 5,845 Holstein animals. Fourteen CNVR were found to be significantly (P < 0.0005) associated with the de-regressed EBV for the hoof health traits, which included digital dermatitis, interdigital dermatitis, heel horn erosion, sole ulcer, white line lesion, sole hemorrhage, and interdigital hyperplasia. No association was found with toe ulcer. Twenty genes overlap with the significantly associated CNVR, including PRAME, SCART1, NRXN2, KIF26A, GPHN, BTN1A1, and OR7A1. In this study, an impact on infectious hoof lesions could be attributed to the Preferentially Expressed Antigen In Melanoma (PRAME) gene. Almost all genes detected in association with non-infectious hoof lesions could be linked to known metabolic disorders, which are linked to non-infectious lesions. Significantly associated regions were detected although stringent quality controls were applied. Therefore, this explains the few high confidence associations found in this study. Introgression of the knowledge gained in this study into routine genetic evaluation through considering information of associated CNV to the traits of interest into the used models will improve accuracy of estimated breeding values and could further increase the genetic gain for these traits in the Canadian Holstein population, thus reducing the involuntary animal losses due to lameness.

85

4.2 Introduction

Since the implementation of genomic selection in dairy cattle, millions of animals have been genotyped and evaluated. Bi-allelic SNP have been the main focus in dairy cattle genomic research; the inclusion of higher density SNP information available through whole-genome sequencing into genome analyses is expected to catalyze genetic improvement of livestock species (Taylor et al., 2016) and provide better comprehension of the genetic architecture of many economically important traits (Goddard et al., 2016). The knowledge gained on the genetic architecture of traits has allowed testing of individuals at specific loci in order to avoid the propagation of genetic diseases in livestock species. The steadily expanding list of deleterious cattle haplotypes is an example of the growing knowledge based on genotyping efforts (VanRaden et al., 2011; Sahana et al., 2013; Cooper et al., 2014; Kadri et al., 2014; Pausch et al., 2015; Menzi et al., 2016; Guarini et al., 2019). One of the most common methods to discover genotype- phenotype association is genome-wide association analysis (GWAS). Many GWAS studies based on SNP have been performed on dairy cattle, however, fewer association analyses based on other types of variants, such as copy number variants (CNV), have been conducted. CNV are a class of structural variants defined as deletions or insertions in the genome that have a length ranging between 50 base pairs (bp) and 5 megabases (Mb) (Kidd et al., 2015). CNV identification is still challenging, which may explain why only few CNV association studies have been completed to date (Bickhart and Liu, 2014). Multiple methods exist to detect CNV, and all have their advantages and disadvantages (Winchester et al., 2009; Pirooznia et al., 2015). Each CNV identification method handles the trade-off between low false discovery rates and high positive discovery rates in different ways. A previous study showed that identifying CNV using multiple methods or multiple types of information and focusing on overlapping CNV as true variants avoids most false positive results (Zhan et al., 2011). A main factor affecting identification of CNV, whether from whole-genome sequence (WGS) or SNP genotyping information, is the quality of the reference assembly used to map the genomic information (Winchester et al., 2009; Pirooznia et al., 2015). Although its quality does not equal that of the latest human reference assembly, the recently released bovine reference genome ARS-UCD1.2 (Rosen et al., 2018) allows CNV to be identified more precisely in cattle. Thus, further associations between genomic regions and phenotypes can be discovered. 86

CNV identification on array genotype information can only detect CNV starting and ending at marker positions, whereas CNV identification on WGS can be more precise at detecting CNV breakpoints (Alkan et al., 2011). As more and more animals are genotyped, CNV identification on a genotype array is currently the method of choice when the number of available samples is of importance for the downstream CNV analysis planned. This is the case for association analyses where each sample is tested for a large number of data points and a large sample size is needed to accurately estimate the effect of each variant (Spencer et al., 2009). The lack of linkage disequilibrium between any SNP and 20 to 25% of the detected CNV led to the conclusion that CNV carry information that cannot be detected with SNP only. In other words, SNP can be used to tag three quarters of the CNV, but one quarter remain untagged (Xu et al., 2014; Hay et al., 2018). Moreover, CNV cover a greater percentage of the genome than SNP: 1% to 8% (Fadista et al., 2010; Liu et al., 2010; Hou et al., 2011; Stothard et al., 2011; Boussaha et al., 2015; Letaief et al., 2017; Upadhyay et al., 2017) versus 0.03% for a high-density genotype of approximately 777,000 SNP. CNV have also been shown to capture more than 17% of the gene expression variation found in the human genome (Stranger et al., 2007). Another study on the human genome estimated that 40% of the CNV overlap at least partly with protein-coding genes (Conrad et al., 2010), which also indicates highly probable effects of CNV on phenotypes. Furthermore, one study presented higher accuracies of genomic breeding values by combining SNP and CNV information for growth and meat production traits in Bos Indicus, thus showing a possible use of the information gained from association studies relying on CNV (Hay et al., 2018). Associations between traits of economic importance and CNV in dairy cattle have already been demonstrated in multiple studies involving production (Xu et al., 2014; Ben Sassi et al., 2016; Prinsen et al., 2017; Zhou et al., 2018; Liu et al., 2019), reproduction (Glick et al., 2011; Liu et al., 2019), health (Ben Sassi et al., 2016; Durán Aguilar et al., 2017; Liu et al., 2019), and conformation traits (Ben Sassi et al., 2016; Prinsen et al., 2017; Liu et al., 2019). However, there is no study presenting associations between CNV and hoof health-related traits. The major reasons for the early withdraw of dairy cattle from production worldwide are, in order of importance, mastitis, reproductive failure, and hoof disorders (Heringstad et al., 2018). Between 10 and 15% of the involuntary culls observed in previous studies were due to lameness (Green et al., 2002; Cha et al., 2010) and Canadian data showed that almost 40% of all cows presented to a 87

hoof trimmer had at least one foot disorder (Malchiodi et al., 2017). Following the claw health atlas developed by the International Committee for Animal Recording (Egger-Danner et al., 2014), genetic evaluation of hoof health was implemented in 2018 in Canada for three infectious traits: digital dermatitis (DD), interdigital dermatitis (ID), and heel horn erosion (HHE), and for five non- infectious traits: sole ulcer (SU), toe ulcer (TU), white line lesion (WL), sole hemorrhage (SH), and interdigital hyperplasia (IH) following the model developed earlier for DD (Beavers and Van Doormal, 2018; Malchiodi et al., 2018b). Although these traits have low heritability estimates, the large number of animals with both phenotypes and genotypes provide an initial basis for the analysis of associations between hoof health traits and in silico identified CNV. This study aims to describe CNV identified on a large number of genotyped Holstein animals, find associations between the identified CNV and hoof health traits using pseudo-phenotypes, and further functionally annotate and describe the associated CNV regions. 4.3 Materials and methods

4.3.1 Animals and genotypes Illumina report files containing Log R ratio (LRR) and B allele frequency (BAF) data and the related SNP map files of 10,682 Holstein animals were retrieved for CNV identification with PennCNV (Wang et al., 2007). A total of 70 animals (12 cows and 58 bulls) were genotyped with the Illumina BovineHD Beadchip (HD; 777,962 markers), 587 (497 cows and 90 bulls) with the GeneSeek® Genome Profiler Bovine 150K (version 2 and 3; 150K; 138,892 and 139,376 markers), 807 (653 cows and 154 bulls) with the GeneSeek® Genome Profiler Bovine HD (GHD; 76,883 markers), 9,035 (4,007 cows and 5,028 bulls) with the Illumina BovineSNP50 Beadchip (50K; 54,001 markers), and 183 (174 cows and 9 bulls) with the GeneSeek® Genome Profiler Bovine 50K (GMD; 49,463 markers). SNP marker positions were updated from the bovine reference genome assembly UMD3.1 (Zimin et al., 2009) to ARS-UCD1.2 (Rosen et al., 2018) using the information made available on the NAGRP data repository (https://www.animalgenome.org/repository/cattle/UMC_bovine_coordinates/, last accessed 2019-03-14) and markers with a GenCall score below 0.15 were removed on a per-sample basis before CNV identification. The average number of markers for CNV identification after data editing was 680,557 (standard deviation (SD) = 43,730), 136,968 (SD = 4,692), 76,009 (SD =

88

1,282), 46,683 (SD = 1,322), and 46,909 (SD = 307) for the HD, 150K, GHD, 50K, and GMD panels, respectively (Table 4-1). 4.3.2 CNV identification The software program PennCNV (Wang et al., 2007) (version 1.0.3) was used to identify CNV; information on the genotyping signal intensity and the BAF are integrated on a per sample basis into a hidden Markov model to determine the number of copies and genotypes of each CNV. Genotyping information was reduced to markers with known positions on autosomes. LRR values were corrected based on the guanine-cytosine content of the genomic regions 500Kb upstream and downstream of each marker following the method described by Diskin et al. (2008). The detect_cnv.pl script from the PennCNV package was used with the option --test to identify CNV sample by sample including BAF values averaged over all samples genotyped with the same marker panel. After CNV detection, the quality control script of the ParseCNV package (release 20) was used to filter out possible false positive CNV (Glessner et al., 2013). Samples that were filtered out had 1) a genotype call rate below 97%, 2) a high-intensity noise (LRR SD above 0.3), 3) extreme intensity waviness (guanine-cytosine waviness factor above 0.05 after LRR correction), 4) a BAF drift above 0.01, 5) more than 9 CNV identified, or 6) shared more than 50% of their genotypes with another animal. The minimum number of SNP covered by a CNV was set to 10 for samples with HD marker information and three for all others. Of the 56,561 CNV detected with PennCNV, 23,256 variants remained for association analysis that were identified on 5,845 samples after quality control. 4.3.3 Phenotypes Genomic estimated breeding values (GEBV) and heritability estimates for eight hoof health traits including DD, ID, HHE, SU, TU, WL, SH, and IH were retrieved from the April 2019 routine genetic evaluation performed by the Canadian Dairy Network (Guelph, Canada; Table 4-2). Hoof lesions were recorded by 54 trimmers on 206,417 cows from 1,312 herds. In total 345,436 records composed the dataset for breeding value estimation. The model used to estimate the breeding values was the same for all traits and can be presented as 푌 = 퐻퐷 + 푃 + 푇 + 푆 + 푎 + 푝푒 + 푒, where 푌 was 0 or 1 in the absence or presence of a given lesion, 퐻퐷, 푃, 푇, and 푆 were the respective fixed effects of herd by date of trimming, parity, trimmer, and stage of lactation at trimming. The

89

random effects were the animal additive effect 푎, the permanent environmental effect 푝푒, and the residual effect 푒. The reliability of the GEBV was used to quantify the amount of information available for different individuals and values were de-regressed following the method presented by VanRaden et al. (2009). The de-regressed EBV (dEBV) were used as pseudo-phenotype for the association analyses. dEBV were computed for 1,889 bulls for which CNV could be detected and were thus used for association analyses. The average and range values of the dEBV are presented in Table 4-2. 4.3.4 Association analyses The software program ParseCNV (release 20) (Glessner et al., 2013) was used to identify associations between the CNV identified and dEBV of 1,889 Holstein animals. ParseCNV converts the CNV calls into probe-based genotypes. In other words, it separates the markers depending on their CNV genotype, correcting at the same time for family structure based on the parents of each sample given in the Plink .fam file. Animals with no deletion, one deletion or two deletions received the codes 22, 12 or 11, respectively. Animals with no duplication, one duplication or two duplications received the codes 22, 11 or 12, respectively. This probe-based coding allowed to run an association analysis as it would be done for SNP with the method implemented in Plink (version 1.07) (Purcell et al., 2007) and using the --assoc argument. The association analysis was run for deletions and duplications separately. Correction for population structure was also carried out at this stage with the three first cluster dimensions using the --cluster argument in combination with --mds-plot. The model used for association testing was 푦 = 푋푏 + 푒, where 푌 was a vector of dEBV, 푋 was the design matrix of the fixed effect of one CNV genotype at a time, 푏 was the CNV effect, and 푒 was the vector of random residual effects. The output of the association tests was used to merge neighboring (<1Mb) CNV reaching a similar significance level (± 1 log P-value) to CNV regions (CNVR). This method to create CNVR was showed to be flexible and thus appropriate to define the breakpoints of the significantly associated regions (Glessner et al., 2013). P-values for each CNVR were computed with a Wald test based on the regression coefficients and the standard errors of each single CNVR. In order to account for multiple testing, CNVR with a P-value lower than 0.0005 were considered significantly associated as suggested by the ParseCNV developers. Only significantly associated regions that had overlap

90

with CNVR identified with WGS information of the 67 Holstein bulls of a previous study were further analyzed and described to reinforce control against false positive results (Chapter 3). 4.3.5 Description of associated regions Peptide sequences of the associated regions were retrieved from the Ensembl Gene database (Cunningham et al., 2019) (ARS-UCD1.2, annotation release 96) with the Ensembl Biomart tool (Kersey et al., 2011). The OmicsBox (version 1.1.0; new updated software from Blast2GO; Götz et al., 2008)) was used to annotate the significantly associated regions. The GO analyses were performed by taking the three GO categories (biological process, molecular function and cellular component) into account and using OmicsBox (Götz et al., 2008). Coding sequences were annotated with blastx and the OmicsBox mapping and GO annotation routines (Conesa et al., 2005). Query sequences were compared against all the sequences found in the database of the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov, last accessed 2019-05-31). A significance level of at least 0.001 (e-value) and similarity of at least 70% were needed to consider a reported match for further analysis. The GO significance levels were computed following Fisher’s exact test for multiple testing in OmicsBox. As described by Cánovas et al. (2013) and Li et al. (2016), the OmicsBox suite was used to examine associations between the sequences and biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG; Ogata et al., 1999). Information about the genes annotated in the significantly associated sequences was retrieved from their GeneCards (Safran et al., 2010). 4.4 Results

4.4.1 CNV identification On average, 4 CNV (min: 1, max: 9) were identified per sample on the 5,845 samples remaining after quality control on a per sample basis. Of the 23,256 CNV included in the association analysis, 13,724 were deletions and 9,532 were duplications. The length of the CNV was not parametrically distributed (P < 0.05; Shapiro-Wilk normality test) and ranged from 76bp to 4.17Mb with an average length of 168.52Kb. No statistical difference in the distribution of the length of the CNV was observed between chromosomes (P> 0.05, Wilcoxon rank sum test with continuity correction). The CNV were found on all autosomes with a maximum of 2,775 CNV on the Bos taurus autosome (BTA) 12 and a minimum of 106 CNV on BTA24. Merging CNV with at least 1bp overlap to non-

91

overlapping CNVR reduced the number of variants to 1,645. Accounting for redundancy of variants over the genome, 9.43% of the total bovine genome was found to be covered by a CNVR. (Figure 4-1). 4.4.2 CNVR associated with hoof health traits Association analyses between hoof health traits and the discovered CNV of 1,889 bulls led to the identification of significant associations (P-value < 0.0005) between all traits and 23 CNVR. After control for the presence of the CNVR in the list of variants identified with the WGS information of 67 Holstein bulls, fourteen CNVR remained which were associated with at least one trait. (Table 4-3,Figure 4-2) Associations were found for all traits, except TU. Of the fourteen regions, nine were deletions and five were duplications. The significantly associated regions were distributed on thirteen chromosomes (Figure 4-2) and had an average length of 104 Kb (min. 9.8 Kb, max. 343.3 Kb). No associated CNVR was found in any gap of the reference assembly ARS-UCD1.2. All CNVR, except CNVR3 and CNVR14, were associated with one trait only. CNVR3 was associated with ID and IH, whereas CNVR14 was associated with DD and SU. All regions returned 54 Ensembl peptide sequences. Performing analyses using the OmicsBox mapping and annotation routines, 43 sequences were found to have BLAST hits and genes could be annotated for 11 associated CNVR (Table 4-3). CNVR9 contained the most genes (6), whereas only one gene was found in the sequence of CNVR1, CNVR2, CNVR4, CNVR7, CNVR11, CNVR13, and CNVR14. Associated GO terms in the three main GO categories (biological processes, molecular functions, and cellular component) were identified. At the most informative level of the biological processes, 25% of the GO terms were related to cellular processes, 14% related to metabolic processes, 11% related to biological regulation, and the remaining 50% were distributed over multiple categories that never reached more than 4% of the terms. Regarding the molecular function terms, 51% were related to binding, 27% were related to catalytic activity, 16% were related to transporter activity, and 5% were related to receptor activity. Of the cellular component terms, 46% related to cell parts, 45% related to membrane parts, and 9% related to protein-containing complex. Enzyme codes were retrieved for 13 sequences and associated with 5 KEGG pathways. Among them, the folate biosynthesis pathway was associated with CNVR4 whereas CNVR10 was associated with purine metabolism, alanine, aspartame and glutamate metabolism, cyanoamino metabolism, and thiamine metabolism pathways. 92

4.5 Discussion

In this study, after data editing, 23,256 CNV were identified relying on the genotype array data of 5,845 Holstein individuals aligned to the bovine reference genome ARS-UCD1.2. Association analysis between the identified CNV and de-regressed EBV of all hoof health traits were performed for 1,889 Canadian bulls. CNVR significantly associated with hoof health traits were analyzed for their gene content and putative functions in relation to the traits of interest. The large number of samples included in this study for CNV identification, the use of updated SNP position to ARS-UCD1.2, and the discovery of associations between CNV and hoof health traits make this study novel at multiple levels. Moreover, conservative approaches were applied: a) use of strict quality parameters for CNV identification; b) de-regression of highly reliable EBV only; and c) removal of associated CNVR that were not overlapping with CNV identified on a set of partly similar samples but relying on WGS information. Therefore, the presented associated CNVR and genes are highly reliable candidates for their impact on hoof health traits. 4.5.1 Identified CNV Although a relatively high number of CNV were identified in this study, the average number of four CNV per sample can be compared to results presented in other studies relying on genotype array data. The density of the genotype array used is known to impact the number and length of the variants identified. Results presented in our previous study show that two or five CNV are identified on average from 50K or 150K marker panels, respectively. Considering that most of the samples in this study were genotyped with marker densities between 50K and 150K, the observed average of CNV per sample can be considered valid (Chapter 3). In a similar manner, the average length of the identified CNV (168.52Kb) is equivalent to the average distance between markers of the 50K marker panel after quality control (174.53Kb) (three SNP were needed to consider a CNV valid). Eighty-five percent of the samples on which CNV identification relied were genotyped with the 50K marker panel. This shows that the higher number of samples genotyped with this panel truly influenced the final CNV set (Table 4-1). Similar to previous studies, more deletions were detected than duplications (Boussaha et al., 2015; Keel et al., 2017; Letaief et al., 2017). Current CNV detection methods relying on genotype array information have been shown to detect deletions more often than duplications (Prinsen et al., 2016; Sasaki et al., 2016; Mielczarek et al.,

93

2017). In addition, the CNV were not distributed equally over the bovine autosomes (Figure 4-1). This is due to CNV formation mechanisms (non-allelic homologous recombination, fork stalling and template switching, non-homologous end-joining, and mobile element insertion) that take place more often in some genomic regions than others, in a similar way that recombination events occur more often in hotspots of the genome (Fadista et al., 2010; Bickhart and Liu, 2014). Finally, the genome coverage of the CNVR observed in this study (9.43%) is relatively high in comparison with previous studies that mainly report coverage values below 8% in the cattle genome (Fadista et al., 2010; Liu et al., 2010; Hou et al., 2011; Stothard et al., 2011; Boussaha et al., 2015; Letaief et al., 2017; Upadhyay et al., 2017). This can be explained by the fact that variants were identified on a high number of samples in comparison with previous studies, and that those samples were mostly genotyped with medium density marker panels. Due to the lower number of possible breakpoints than with higher density genotype array information, the CNV identified were longer and covered a greater part of the genome. 4.5.2 Associations between CNVR and hoof health traits Although the analysis relied on a large number of samples, few CNVR were found associated with hoof health traits. Only two regions, CNVR3 and CNVR14, were significantly associated with two traits, whereas all other associated CNVR were linked to one single trait. No region was found to be associated that had already been described in a GWAS on the same traits in the same Canadian Holstein cattle population using SNP (Malchiodi et al., 2018a). The lack of concordance between studies and the low discovery rate can be explained by the relatively stringent conditions CNVR had to meet to be considered significantly associated in this analysis. First, the CNV had to cover at least three SNP. Second, they had to pass the PennCNV defaults and the ParseCNV adjusted filter values. Third, only CNVR that had overlap with CNVR identified on WGS information of partly different samples than the samples that were directly included in this study were kept. Further analyses with less stringent conditions at the time of CNV identification and/or at the time of CNVR association would likely result in a higher number of associated regions. However, the risk for false positives would also be higher. The increase of a false positive rate with the increase of CNV discovered was showed in our previous study (Chapter 3). As the CNV and the resulting CNVR were based on genotype array information, the breakpoints of the associated regions had to be at a marker position. Use of the same WGS CNV information to filter the associated CNVR 94

showed that the true breakpoint of the associated CNVR are probably a few bases next to the breakpoints given by the array information. For example, when observing the read depth of the two top samples on Figure 4-3, a deletion for CNVR7 can be seen that starts before and ends after the region as described with genotype array (red bar). Moreover, the red-colored reads seen in the sequences of the CNV carriers mark reads that were split at the time of alignment, a further hint on the presence of a CNV in this region. The bottom sample on the same figure has no deletion in CNVR7. Definition of the region breakpoint could therefore be more precise with additional sequencing of a selection of animal carriers of deletions or duplications at each of the significantly associated CNVR. 4.5.3 Gene content and putative function of the associated CNVR Of the seven hoof health traits for which associations were found with CNVR, three were infectious hoof lesions (DD, ID, HHE) and four were non-infectious (SU, WL, SH, IH). It was therefore expected that different genomic regions would be associated with these two groups of traits. No CNVR was significantly (P > 0.0005) associated with TU. Of the infectious traits, of which five had associated CNVR, HHE had the most associations. DD was associated with two CNVR and ID with one region only. Of the genes found in the regions associated with HHE, the PRAME gene is a promising candidate. It is known to impact the immune response of cattle and could thus impact resistance against pathogens leading to infection of the hoof horn. Moreover, it is known to be duplicated in a CNV-like manner as its amplification in autosomes of cattle has already been described (Chang et al., 2011). Two genes are present in CNVR12, VPS10 domain- containing receptor SorCS3 (SORCS3), and Scavenger Receptor Family Member Expressed On T Cells 1 (SCART1). Although no real link between SORCS3 and HHE could be made, an impact of the number of SCART1 copies on HHE (an infectious trait) can exist. A study discussing the role of gamma delta immunity related cells in cattle, rodents, and humans present the hypothesis that SCART1-coded are only found in this specific type of delta gamma T cell and serve the recognition of important pathogens (Baldwin et al., 2014). The gene Neurexin 2 (NRXN2) was mapped to CNVR13, which is also associated with HHE. The absence of neurexin affects leukocyte adhesion deficiency type 3 (Safran et al., 2010). In a higher number, this gene could lead to an increased ability of the leukocytes to act in the case of the presence of a pathogen in the organism. Of the genes present in CNVR10, Kinesin Family Member 26A (KIF26A) is the gene 95

which can be related to the other infectious trait DD. This kinesin protein is a part of the microtubules used for the formation of vacuoles in the cells and impacts their stability (Jancsik et al., 1996). The less solidified vacuoles could be more prone to fail their purpose of isolating pathogens in the cell (Mostowy and Shenoy, 2015). No gene could be mapped within the CNVR associated to ID. Relationships between infectious traits and genes are expected in the functions of the immune system of the animals, but non-infectious traits are often related to metabolic or mechanic processes (Heringstad et al., 2018). Metabolic diseases often lead to poor hoof quality and thus higher incidence of lesions. Following metabolic diseases, nutrients are not supplied to the dermal- epidermal junction between the live and the horn tissues of the hoof which slowly degenerate and lead to a lack of support inside the hoof. This can be followed by the apparition of ulcers, hemorrhages, and white line diseases (Lischer and Ossent, 2002). Of the non-infectious traits, SU is the trait associated with regions with the most genes mapped. None of the mapped genes, however, could be linked to this trait. CVNR were significantly associated with SH on BTA10 and on BTA20. The Gephyrin (GPHN) gene on BTA10 was associated with the folate biosynthesis KEGG pathway. Changes in the folate metabolism lead to an increase of metabolites in the blood that may affect hoof quality (Lischer and Ossent, 2002). Even though the genomic region of CNVR9 on BTA20 and its gene content (LPCAT1, SLC6A3, CLPTM1L, TERT, SCL6A18) was already presented in a GWAS between CNV and somatic cell score in a previous study (Durán Aguilar et al., 2017), no direct link could be found with SH. A gene could be mapped to both CNVR associated with WL. The gene BTN1A1 mapped to CNVR11 is associated with the synthesis of milk fat protein and is thus indirectly linked with the metabolism of the animal (Szyda and Komisarek, 2007). More interestingly, both CNVR1 and CNVR9 were also associated with ketosis traits (data not shown), a common metabolic disease of dairy cattle (Duffield, 2000). The only gene mapped to CNVR1 associated with WL was the olfactory receptor OR7A17. High selection pressure on the evolution of olfactory receptor genes knowingly led to the expansion of this gene family and therefore, of CNV (Bickhart and Liu, 2014). The gene OR7A17, however, only covers a part of CNVR1. Therefore, another gene that could have a clear impact on WL could be found in this region.

96

4.6 Conclusions

Identification and association analyses of CNV with hoof health traits found fourteen CNVR significantly associated with infectious and non-infectious hoof lesions. Very strict quality control thresholds were applied not only on identification of CNV, but also during the association analyses. This resulted in only few significantly associated regions but allows those identified to be high confidence associations. Genes were mapped to the associated CNVR which had previously described functions that could also be related to the recorded hoof health traits in Canada. This study is a good foundation for the analysis of association between hoof health traits and in silico identified CNV. Nevertheless, additional data will be needed to strengthen the analysis. Introgression of the knowledge gained in this study into the genetic evaluation through considering information of associated CNV to the traits of interest into the used models could lead to greater genetic improvement rates in the Holstein dairy cattle population, thus reducing the involuntary animal losses due to lameness on farms. 4.7 References

Alkan, C., B.P. Coe, and E.E. Eichler. 2011. Genome structural variation discovery and genotyping.. Nat. Rev. Genet. 12:363–76. doi:10.1038/nrg2958. Baldwin, C.L., H. Hsu, C. Chen, M. Palmer, J. McGill, W.R. Waters, and J.C. Telfer. 2014. The role of bovine γδ T cells and their WC1 co-receptor in response to bacterial pathogens and promoting vaccine efficacy: A model for cattle and humans. Vet. Immunol. Immunopathol. 159:144–155. doi:10.1016/j.vetimm.2014.02.011. Beavers, L., and B. Van Doormal. 2018. Genetic Selection for Improved Hoof Health Is Now Possible. Accessed May 31, 2019. https://www.cdn.ca/document.php?id=513. Bickhart, D.M., and G.E. Liu. 2014. The challenges and importance of structural variation detection in livestock. Front. Genet. 5:37. doi:10.3389/fgene.2014.00037. Boussaha, M., D. Esquerré, J. Barbieri, A. Djari, A. Pinton, R. Letaief, G. Salin, F. Escudié, A. Roulet, S. Fritz, F. Samson, C. Grohs, M. Bernard, C. Klopp, D. Boichard, and D. Rocha. 2015. Genome-wide study of structural variants in bovine Holstein, Montbéliarde and Normande dairy breeds. PLoS One 10:e0135931. doi:10.1371/journal.pone.0135931. Cánovas, A., Rincon, G., Islas-Trejo, A., Jimenez-Flores, R., Laubscher, A., Medrano, J.F. 2013. RNA sequencing to study gene expression and single nucleotide polymorphism variation associated with citrate content in cow milk. Journal of Dairy Science, 96, 2637-2648, doi: 10.3168/jds.2012-6213. Cha, E., J.A. Hertl, D. Bar, and Y.T. Gröhn. 2010. The cost of different types of lameness in dairy cows calculated by dynamic programming. Prev. Vet. Med. 97:1–8. 97

doi:10.1016/j.prevetmed.2010.07.011. Chang, T.C., Y. Yang, H. Yasue, A.K. Bharti, E.F. Retzel, and W.S. Liu. 2011. The expansion of the PRAME gene family in Eutheria. PLoS One 6:e16867. doi:10.1371/journal.pone.0016867. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 21: 3674–6. pmid:16081474 Conrad, D.F., D. Pinto, R. Redon, L. Feuk, O. Gokcumen, Y. Zhang, J. Aerts, T.D. Andrews, C. Barnes, P. Campbell, T. Fitzgerald, M. Hu, C.H. Ihm, K. Kristiansson, D.G. Macarthur, J.R. Macdonald, I. Onyiah, A.W.C. Pang, S. Robson, K. Stirrups, A. Valsesia, K. Walter, J. Wei, Wellcome Trust Case Control Consortium, C. Tyler-Smith, N.P. Carter, C. Lee, S.W. Scherer, and M.E. Hurles. 2010. Origins and functional impact of copy number variation in the human genome.. Nature 464:704–12. doi:10.1038/nature08516. Cooper, T.A., G.R. Wiggans, D.J. Null, J.L. Hutchison, and J.B. Cole. 2014. Genomic evaluation, breed identification, and discovery of a haplotype affecting fertility for Ayrshire dairy cattle. J. Dairy Sci. 97:3878–3882. doi:10.3168/jds.2013-7427. Cunningham, F., P. Achuthan, W. Akanni, J. Allen, M.R. Amode, I.M. Armean, R. Bennett, J. Bhai, K. Billis, S. Boddu, C. Cummins, C. Davidson, K.J. Dodiya, A. Gall, C.G. Girón, L. Gil, T. Grego, L. Haggerty, E. Haskell, T. Hourlier, O.G. Izuogu, S.H. Janacek, T. Juettemann, M. Kay, M.R. Laird, I. Lavidas, Z. Liu, J.E. Loveland, J.C. Marugán, T. Maurel, A.C. McMahon, B. Moore, J. Morales, J.M. Mudge, M. Nuhn, D. Ogeh, A. Parker, A. Parton, M. Patricio, A.I. Abdul Salam, B.M. Schmitt, H. Schuilenburg, D. Sheppard, H. Sparrow, E. Stapleton, M. Szuba, K. Taylor, G. Threadgold, A. Thormann, A. Vullo, B. Walts, A. Winterbottom, A. Zadissa, M. Chakiachvili, A. Frankish, S.E. Hunt, M. Kostadima, N. Langridge, F.J. Martin, M. Muffato, E. Perry, M. Ruffier, D.M. Staines, S.J. Trevanion, B.L. Aken, A.D. Yates, D.R. Zerbino, and P. Flicek. 2019. Ensembl 2019. Nucleic Acids Res. 47:D745–D751. doi:10.1093/nar/gky1113. Diskin, S.J., M. Li, C. Hou, S. Yang, J. Glessner, H. Hakonarson, M. Bucan, J.M. Maris, and K. Wang. 2008. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36:e126–e126. doi:10.1093/nar/gkn556. Duffield, T. 2000. Subclinical ketosis in lactating dairy cattle.. Vet. Clin. North Am. Food Anim. Pract. 16:231–253. doi:10.1016/S0749-0720(15)30103-1. Durán Aguilar, M., S.I. Román Ponce, F.J. Ruiz López, E. González Padilla, C.G. Vásquez Peláez, A. Bagnato, and M.G. Strillacci. 2017. Genome-wide association study for milk somatic cell score in holstein cattle using copy number variation as markers. J. Anim. Breed. Genet. 134:49–59. doi:10.1111/jbg.12238. Egger-Danner, C., P. Nielsen, A. Fiedler, K. Müller, T. Fjeldaas, D. Döpfer, V. Daniel, C. Bergsten, G. Cramer, A.M. Christen, K.F. Stock, G. Thomas, M. Holzhauer, A. Steiner, J. Clarke, N. Capion, N. Charfeddine, E. Pryce, E. Oakes, J. Burgstaller, B. Heringstad, C. Ødegård, and J. Kofler. 2014. ICAR Claw Health Atlas. ICAR Tech. Ser. 18:45. Fadista, J., B. Thomsen, L.-E. Holm, and C. Bendixen. 2010. Copy number variation in the bovine 98

genome.. BMC Genomics 11:284. doi:10.1186/1471-2164-11-284. Glessner, J.T., J. Li, and H. Hakonarson. 2013. ParseCNV integrative copy number variation association software with quality tracking. Nucleic Acids Res. 41:e64–e64. doi:10.1093/nar/gks1346. Glick, G., A. Shirak, E. Seroussi, Y. Zeron, E. Ezra, J.I. Weller, and M. Ron. 2011. Fine Mapping of a QTL for Fertility on BTA7 and Its Association With a CNV in the Israeli Holsteins. G3&#58; Genes|Genomes|Genetics 1:65–74. doi:10.1534/g3.111.000299. Goddard, M.E., K.E. Kemper, I.M. MacLeod, A.J. Chamberlain, and B.J. Hayes. 2016. Genetics of complex traits: Prediction of phenotype, identification of causal polymorphisms and genetic architecture. Proc. R. Soc. B Biol. Sci. 283:20160569. doi:10.1098/rspb.2016.0569. Götz, S., J.M. García-Gómez, J. Terol, T.D. Williams, S.H. Nagaraj, M.J. Nueda, M. Robles, M. Talón, J. Dopazo, and A. Conesa. 2008. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res. 36:3420–3435. doi:10.1093/nar/gkn176. Green, L.E., V.J. Hedges, Y.H. Schukken, R.W. Blowey, and A.J. Packington. 2002. The Impact of Clinical Lameness on the Milk Yield of Dairy Cows. J. Dairy Sci. 85:2250–2256. doi:10.3168/JDS.S0022-0302(02)74304-X. Guarini, A.R., M. Sargolzaei, L.F. Brito, V. Kroezen, D.A.L. Lourenco, C.F. Baes, F. Miglior, J.B. Cole, and F.S. Schenkel. 2019. Estimating the effect of the deleterious recessive haplotypes AH1 and AH2 on reproduction performance of Ayrshire cattle. J. Dairy Sci. 102:5315–5322. doi:10.3168/jds.2018-15366. Hay, E.H.A., Y.T. Utsunomiya, L. Xu, Y. Zhou, H.H.R. Neves, R. Carvalheiro, D.M. Bickhart, L. Ma, J.F. Garcia, and G.E. Liu. 2018. Genomic predictions combining SNP markers and copy number variations in Nellore cattle. BMC Genomics 19:441. doi:10.1186/s12864-018-4787- 6. Heringstad, B., C. Egger-Danner, N. Charfeddine, J.E. Pryce, K.F. Stock, J. Kofler, A.M. Sogstad, M. Holzhauer, A. Fiedler, K. Müller, P. Nielsen, G. Thomas, N. Gengler, G. de Jong, C. Ødegård, F. Malchiodi, F. Miglior, M. Alsaaod, and J.B. Cole. 2018. Invited review: Genetics and claw health: Opportunities to enhance claw health by genetic selection. J. Dairy Sci. 101:4801–4821. doi:10.3168/jds.2017-13531. Hou, Y., G. Liu, D. Bickhart, M. Cardone, K. Wang, E. Kim, L. Matukumalli, M. Ventura, J. Song, P. VanRadan, T. Sonstegard, and C. Van Tassell. 2011. Genomic characteristics of cattle copy number variations. BMC Genomics 12:127. doi:10.1186/1471-2164-12-127. Jancsik, V., D. Filliol, and A. Rendon. 1996. Tau proteins bind to kinesin and modulate its activation by microtubules.. Neurobiology (Bp). 4:417–29. Kadri, N.K., G. Sahana, C. Charlier, T. Iso-Touru, B. Guldbrandtsen, L. Karim, U.S. Nielsen, F. Panitz, G.P. Aamand, N. Schulman, M. Georges, J. Vilkki, M.S. Lund, and T. Druet. 2014. A 660-Kb Deletion with Antagonistic Effects on Fertility and Milk Production Segregates at High Frequency in Nordic Red Cattle: Additional Evidence for the Common Occurrence of Balancing Selection in Livestock. PLoS Genet. 10:e1004049. doi:10.1371/journal.pgen.1004049. 99

Keel, B.N., J.W. Keele, and W.M. Snelling. 2017. Genome-wide copy number variation in the bovine genome detected using low coverage sequence of popular beef breeds. Anim. Genet. 48:141–150. doi:10.1111/age.12519. Kersey, P., A. Kerhornou, S. Haider, A. Kahari, J. Zamora, R.J. Kinsella, J. Almeida-King, D. Staines, P. Derwent, G. Spudich, P. Flicek, and G. Proctor. 2011. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. doi:10.1093/database/bar030. Kidd, J.M., R. Sebra, X. Zheng-Bradley, J. Zhang, A. Bashir, K. Walter, K. Chen, E.E. Schadt, D.M. Muzny, C. Zhang, R.E. Handsaker, F. Yu, R.A. Gibbs, T. Zichner, A. Malhotra, T. Rausch, A.M. Stütz, F. Paolo Casale, X. Jasmine Mu, M. Romanovitch, J.A. Walker, Y. Zhang, J. Sebat, C.E. Mason, Y. Kong, N.F. Parrish, X. Shi, M.J.P. Chaisson, G. Marth, E. Cerveira, A. Abyzov, F. Kahveci, M.K. Konkel, M. Pendleton, M. Gujral, J. Chen, O. Stegle, E. Garrison, Z. Chong, M. Malig, F. Hormozdiari, S.E. Devine, S. Meiers, J.O. Korbel, S. Emery, S. McCarthy, P.H. Sudmant, B. Raeder, E.-W. Lameijer, A. Schlattl, J. Huddleston, M.B. Gerstein, A. Quitadamo, L. Clarke, A.A. Shabalin, R.E. Mills, K. Ye, E.J. Gardner, A. Untergasser, A. Noor, P. Chines, P. Flicek, G. Jun, W. Zhou, G. Dayama, T. Bae, S.A. McCarroll, C. Alkan, A. Menelaou, B.J. Nelson, X. Fan, C. Lee, S. Kashin, A. Auton, H.Y.K. Lam, D. Antaki, M. Wang, M.A. Batzer, E. Dal, E.E. Eichler, M. Hsi-Yang Fritz, and L. Ding. 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81. doi:10.1038/nature15394. Letaief, R., E. Rebours, C. Grohs, C. Meersseman, S. Fritz, L. Trouilh, D. Esquerré, J. Barbieri, C. Klopp, R. Philippe, V. Blanquet, D. Boichard, D. Rocha, and M. Boussaha. 2017. Identification of copy number variation in French dairy and beef breeds using next-generation sequencing. Genet. Sel. Evol. 49:77. doi:10.1186/s12711-017-0352-z. Li, M., Riddle, S., Zhang, H., D'Alessandro, A., Serkova, N. J., Hansen KC, Moldvan R, McKeon A, Frid M, Kumar S, Li H, Liu H, Islas-Trejo A, Cánovas, A, Medrano JF, Thomas MG, Ilska D, Plecita-Hlavata L, Jetek P, Pullamsetti S, Fini MA, El Kasmi KC, Zhang Q, Stenmark K.R. 2016. Metabolic Reprogramming Regulates the Proliferative and Inflammatory Phenotype of Adventitial Fibroblasts in Pulmonary Hypertension through the Transcriptional co-Repressor C-terminal Binding Protein. Circulation, 134 (15): 1105-1121. Lischer, C., and P. Ossent. 2002. Pathogenesis of sole lesions attributed to laminitis in cattle. Pages 82–89 in Proceedings of the 12th international symposium on lameness in ruminants, Orlando. Liu, G.E., Y. Hou, B. Zhu, M.F. Cardone, L. Jiang, A. Cellamare, A. Mitra, L.J. Alexander, L.L. Coutinho, M.E. Dell’Aquila, L.C. Gasbarre, G. Lacalandra, R.W. Li, L.K. Matukumalli, D. Nonneman, L.C.D.A. Regitano, T.P.L. Smith, J. Song, T.S. Sonstegard, C.P. Van Tassell, M. Ventura, E.E. Eichler, T.G. McDaneld, and J.W. Keele. 2010. Analysis of copy number variations among diverse cattle breeds. Genome Res. 20:693–703. doi:10.1101/gr.105403.110. Liu, M., L. Fang, S. Liu, M.G. Pan, E. Seroussi, J.B. Cole, L. Ma, H. Chen, and G.E. Liu. 2019. Array CGH-based detection of CNV regions and their potential association with reproduction and other economic traits in Holsteins. BMC Genomics 20:181. doi:10.1186/s12864-019- 5552-1. 100

Malchiodi, F., L. Brito, F. Schenkel, A. M Christen, D. Kelton, and F. Miglior. 2018a. Genome- wide association study and functional analysis of infectious and horn type hoof lesions in Canadian Holstein cattle. Page in Proceedings of the World Congress on Genetics Applied to Livestock Production. Malchiodi, F., J. Jamrozik, A.-M. Christen, G. Kistemaker, P. Sullivan, B. Van Doormal, D. Kelton, F. Schenkel, and F. Miglior. 2018b. Implementation of genomic evaluation for digital dermatitis in Canada. Page in Interbull Bulletin. Interbull Centre. Malchiodi, F., A. Koeck, S. Mason, A.M. Christen, D.F. Kelton, F.S. Schenkel, and F. Miglior. 2017. Genetic parameters for hoof health traits estimated with linear and threshold models using alternative cohorts. J. Dairy Sci. 100:2828–2836. doi:10.3168/jds.2016-11558. Menzi, F., N. Besuchet-Schmutz, M. Fragnière, S. Hofstetter, V. Jagannathan, T. Mock, A. Raemy, E. Studer, K. Mehinagic, N. Regenscheit, M. Meylan, F. Schmitz-Hsu, and C. Drögemüller. 2016. A transposable element insertion in APOB causes cholesterol deficiency in Holstein cattle. Anim. Genet. 47:253–257. doi:10.1111/age.12410. Mielczarek, M., M. Frąszczak, R. Giannico, G. Minozzi, J.L. Williams, K. Wojdak-Maksymiec, and J. Szyda. 2017. Analysis of copy number variations in Holstein-Friesian cow genomes based on whole-genome sequence data. J. Dairy Sci. 100:5515–5525. doi:10.3168/jds.2016- 11987. Mostowy, S., and A.R. Shenoy. 2015. The cytoskeleton in cell-autonomous immunity: Structural determinants of host defence. Nat. Rev. Immunol. 15:559–573. doi:10.1038/nri3877. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27:29–34. doi:10.1093/nar/27.1.29. Pausch, H., H. Schwarzenbacher, J. Burgstaller, K. Flisikowski, C. Wurmser, S. Jansen, S. Jung, A. Schnieke, T. Wittek, and R. Fries. 2015. Homozygous haplotype deficiency reveals deleterious mutations compromising reproductive and rearing success in cattle. BMC Genomics 16:312. doi:10.1186/s12864-015-1483-7. Pirooznia, M., F. Goes, and P.P. Zandi. 2015. Whole-genome CNV analysis: Advances in computational approaches. Front. Genet. 6:138. doi:10.3389/fgene.2015.00138. Prinsen, R.T.M.M., A. Rossoni, B. Gredler, A. Bieber, A. Bagnato, and M.G. Strillacci. 2017. A genome wide association study between CNVs and quantitative traits in Brown Swiss cattle. Livest. Sci. 202:7–12. doi:10.1016/j.livsci.2017.05.011. Prinsen, R.T.M.M., M.G. Strillacci, F. Schiavini, E. Santus, A. Rossoni, V. Maurer, A. Bieber, B. Gredler, M. Dolezal, and A. Bagnato. 2016. A genome-wide scan of copy number variants using high-density SNPs in Brown Swiss dairy cattle. Livest. Sci. 191:153–160. doi:10.1016/j.livsci.2016.08.006. Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M.A.R. Ferreira, D. Bender, J. Maller, P. Sklar, P.I.W. de Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses.. Am. J. Hum. Genet. 81:559–75. doi:10.1086/519795. Rosen, B., D. Bickhart, R. Schnabel, S. Koren, C. Elsik, A. Zimin, C. Dreischer, S. Schultheiss, 101

R. Hall, S. Schroeder, C. Van Tassell, T. Smith, and J. Medrano. 2018. Modernizing the Bovine Reference Genome Assembly. Proc. World Congr. Genet. Appl. to Livest. Prod. 802. Safran, M., I. Dalah, J. Alexander, N. Rosen, T. Iny Stein, M. Shmoish, N. Nativ, I. Bahir, T. Doniger, H. Krug, A. Sirota-Madi, T. Olender, Y. Golan, G. Stelzer, A. Harel, and D. Lancet. 2010. GeneCards Version 3: the human gene integrator.. Database (Oxford).. doi:10.1093/database/baq020. Sahana, G., U.S. Nielsen, G.P. Aamand, M.S. Lund, and B. Guldbrandtsen. 2013. Novel harmful recessive haplotypes identified for fertility traits in Nordic Holstein cattle. PLoS One 8:e82909. doi:10.1371/journal.pone.0082909. Sasaki, S., T. Watanabe, S. Nishimura, and Y. Sugimoto. 2016. Genome-wide identification of copy number variation using high-density single-nucleotide polymorphism array in Japanese Black cattle.. BMC Genet. 17:26. doi:10.1186/s12863-016-0335-z. Ben Sassi, N., Ó. González-Recio, R. de Paz-del Río, S.T. Rodríguez-Ramilo, and A.I. Fernández. 2016. Associated effects of copy number variants on economically important traits in Spanish Holstein dairy cattle. J. Dairy Sci. 99:6371–6380. doi:10.3168/jds.2015-10487. Spencer, C.C.A., Z. Su, P. Donnelly, and J. Marchini. 2009. Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 5:e1000477. doi:10.1371/journal.pgen.1000477. Stothard, P., J.-W. Choi, U. Basu, J.M. Sumner-Thomson, Y. Meng, X. Liao, and S.S. Moore. 2011. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery. BMC Genomics 12:559. doi:10.1186/1471-2164-12-559. Stranger, B.E., M.S. Forrest, M. Dunning, C.E. Ingle, C. Beazlsy, N. Thorne, R. Redon, C.P. Bird, A. De Grassi, C. Lee, C. Tyler-Smith, N. Carter, S.W. Scherer, S. Tavaré, P. Deloukas, M.E. Hurles, and E.T. Dermitzakis. 2007. Relative impact of nucleotide and copy number variation on gene phenotypes. Science (80-. ). 315:848–853. doi:10.1126/science.1136678. Szyda, J., and J. Komisarek. 2007. Statistical Modeling of Candidate Gene Effects on Milk Production Traits in Dairy Cattle. J. Dairy Sci. 90:2971–2979. doi:10.3168/jds.2006-724. Taylor, J.F., K.H. Taylor, and J.E. Decker. 2016. Holsteins are the genomic selection poster cows. Proc. Natl. Acad. Sci. 113:7690–7692. doi:10.1073/pnas.1608144113. Upadhyay, M., V.H. da Silva, H.-J. Megens, M.H.P.W. Visker, P. Ajmone-Marsan, V.A. Bâlteanu, S. Dunner, J.F. Garcia, C. Ginja, J. Kantanen, M.A.M. Groenen, and R.P.M.A. Crooijmans. 2017. Distribution and Functionality of Copy Number Variation across European Cattle Populations. Front. Genet. 8:1–12. doi:10.3389/fgene.2017.00108. VanRaden, P.M., C.P. Van Tassell, G.R. Wiggans, T.S. Sonstegard, R.D. Schnabel, J.F. Taylor, and F.S. Schenkel. 2009. Invited review: reliability of genomic predictions for North American Holstein bulls.. J. Dairy Sci. 92:16–24. doi:10.3168/jds.2008-1514. VanRaden, P.M., K.M. Olson, D.J. Null, and J.L. Hutchison. 2011. Harmful recessive effects on fertility detected by absence of homozygous haplotypes.. J. Dairy Sci. 94:6153–61. doi:10.3168/jds.2011-4624.

102

Wang, K., M. Li, D. Hadley, R. Liu, J. Glessner, S.F.A. Grant, H. Hakonarson, and M. Bucan. 2007. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17:1665– 1674. doi:10.1101/gr.6861907. Winchester, L., C. Yau, and J. Ragoussis. 2009. Comparing CNV detection methods for SNP arrays. Briefings Funct. Genomics Proteomics 8:353–366. doi:10.1093/bfgp/elp017. Xu, L., J.B. Cole, D.M. Bickhart, Y. Hou, J. Song, P.M. VanRaden, T.S. Sonstegard, C.P. Van Tassell, and G.E. Liu. 2014. Genome wide CNV analysis reveals additional variants associated with milk production traits in Holsteins. BMC Genomics 15:683. doi:10.1186/1471-2164-15-683. Zhan, B., J. Fadista, B. Thomsen, J. Hedegaard, F. Panitz, and C. Bendixen. 2011. Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping. BMC Genomics 12:557. doi:10.1186/1471-2164-12-557. Zhou, Y., E.E. Connor, G.R. Wiggans, Y. Lu, R.J. Tempelman, S.G. Schroeder, H. Chen, and G.E. Liu. 2018. Genome-wide copy number variant analysis reveals variants associated with 10 diverse production traits in Holstein cattle. BMC Genomics 19:314. doi:10.1186/s12864-018- 4699-5. Zimin, A. V, A.L. Delcher, L. Florea, D.R. Kelley, M.C. Schatz, D. Puiu, F. Hanrahan, G. Pertea, C.P. Van Tassell, T.S. Sonstegard, G. Marçais, M. Roberts, P. Subramanian, J.A. Yorke, and S.L. Salzberg. 2009. A whole-genome assembly of the domestic cow, Bos taurus.. Genome Biol. 10:R42. doi:10.1186/gb-2009-10-4-r42.

103

4.8 Acknowledgements

The Semex Alliance (Guelph, Canada), Genex (Shawano, USA), Alta Genetics (Balzac, Canada), Select Sires (Plain City, USA), Holstein Canada (Brantford, CA) and Neogen Corporation (Lansing, USA) are acknowledged for providing the array genotype signal intensities. Phenotypic information was provided by the Canadian Dairy Network (Guelph, Canada). Computations were done at the server facilities provided by the Centre for Genetic Improvement of Livestock, Department of Animal Biosciences at the University of Guelph (Guelph, Canada). We gratefully acknowledge funding by the Efficient Dairy Genome Project, funded by Genome Canada (Ottawa, Canada), Genome Alberta (Calgary, Canada), Ontario Genomics (Toronto, Canada), Alberta Ministry of Agriculture (Edmonton, Canada), Ontario Ministry of Research and Innovation (Toronto, Canada), Ontario Ministry of Agriculture, Food and Rural Affairs (Guelph, Canada), Canadian Dairy Network (Guelph, Canada), GrowSafe Systems (Airdrie, Canada), Alberta Milk (Edmonton, Canada), Victoria Agriculture (Melbourne, Australia), Scotland's Rural College (Edinburgh, UK), USDA Agricultural Research Service (Beltsville, United States), Qualitas AG (Zug, Switzerland), Aarhus University (Aarhus, Denmark). AB is especially grateful to the Qualitas AG team and the Swiss Association for Animal Science for their support.

104

4.9 Tables

Table 4-1. Number of markers before and after quality control and number of samples for each genotype array used in the study. Genotype array HD 150K GHD 50K GMD Total number of markers 777,962 138,892 76,883 54,001 49,463 Average number of 680,557 136,968 76,009 46,683 46,909 markers after QC* (43,730)** (4,692) (1,282) (1,322) (307) Total number of samples 71 588 808 9,035 183 Number of samples for 5 19 35 1,827 3 association analyses *QC = quality control **Numbers in brackets are the standard deviation

105

Table 4-2. Heritability estimates (and standard deviations) as published by the Canadian Dairy Network used for de-regression of the hoof health EBV and descriptive statistics of the de-regressed EBV. dEBV Trait (abbreviation) Heritability Mean Min. Max. Digital Dermatitis (DD) 0.08 (0.004) 0.27 -0.02 0.86 Interdigital Dermatitis (ID) 0.05 (0.003) 0.16 -0.04 0.62 Heel Horn Erosion (HHE) 0.08 (0.005) 0.26 -0.67 1.14 Sole Ulcer (SU) 0.05 (0.003) 0.41 -1.00 1.00 Toe Ulcer (TU) 0.04 (0.003) -0.01 -0.62 1.00 White Line Lesion (WL) 0.04 (0.003) -0.06 -0.62 0.75 Sole Hemorrhage (SH) 0.03 (0.003) 0.61 -0.33 1.25 Interdigital hyperplasia (IH) 0.07 (0.004) 0.03 -0.73 0.45

106

Table 4-3. CNVR significantly (P < 0.0005) associated with hoof health traits, their type (duplication or deletion), and gene content. CNVR BTA* Start End Type Trait(s) Genes CNVR1 7 10,422,889 10,432,630 DEL WL OR7A17 CNVR2 8 23,776,015 23,878,364 DEL IH MLLT3 CNVR3 9 44,794,304 44,864,222 DUP ID,IH CNVR4 10 78,557,712 78,830,390 DUP SH GPHN CNVR5 12 86,121,984 86,338,161 DEL SU ATP11A, TUBGCP3, SPACA7 CNVR6 15 79,760,818 79,808,157 DEL HHE CNVR7 16 54,477,653 54,495,676 DEL HHE PRAME CNVR8 18 31,109,599 31,125,563 DUP HHE CNVR9 20 70,834,509 71,177,834 DEL SH LPCAT1, CLPTM1L, NDUS6, TERT, SLC6A18, MRPL36 CNVR10 21 68,617,018 68,743,664 DEL DD ASPG, KIF26A CNVR11 23 25,984,486 26,166,446 DEL WL BTN1A1 CNVR12 26 25,491,013 25,509,679 DUP HHE SORCS3, SCART1 CNVR13 29 42,865,742 42,882,539 DUP HHE NRXN2 CNVR14 29 49,648,648 49,670,956 DEL DD,SU SYT8 *BTA stands for “Bos taurus autosome”

107

4.10 Figures

Figure 4-1. Distribution of the copy number variant regions identified on 5,845 samples over the bovine autosomes (black stripes). 108

Figure 4-2. Copy number variant regions associated with hoof health traits: sole hemorrhage (SH), sole ulcer (SU), heel horn erosion (HHE), digital dermatitis (DD), white line lesion (WL), interdigital hyperplasia (IH), and interdigital dermatitis (ID).

109

Figure 4-3. Read depth around the copy number variant region no. 7 (CNVR7; 16:54,474,882-54,500,285) for three sequenced samples. The red bar shows the breakpoints defined for CNVR7 with genotype array information. Deletions can be observed on the two top samples. Red-colored reads in the CNV carrier sequences represent reads split at the time of alignment. No copy number variant is observed on the bottom sample.

110

CHAPTER 5: General discussion and conclusions

5.1 General discussion

Single nucleotide polymorphisms (SNP) from genotype arrays explain much of the genetic variance observed in a population but not its totality (Maher, 2008). Haplotypes and whole-genome sequence (WGS) variants, therefore, were proposed to increase the information carried by each marker and their density to explain a higher percentage of the genetic variance (Daetwyler et al., 2014; Karimi et al., 2018). Haplotypes are also a genetic marker and have an advantage over the markers that can be read by routinely used genotyping arrays, bi-allelic SNP, as they can be multi- allelic. Imputation relies on the linkage disequilibrium (LD) between markers and is often relatively low between SNP with high and low minor allele frequencies (Hayes et al., 2007). The use of haplotypes derived from SNP information partly allows for counterbalancing this lack of linkage between observed variants and rare causative variants. All causative mutations and other genomic regions varying between individuals are expected to be included in the WGS data, even if the WGS data is reduced to a set of bi-allelic single nucleotide variants. Although SNP array data are the input information needed for the selection methods developed in Chapter 2, the selection of animals relies on haplotype alleles. The results of this chapter showed that the animals selected to compose a future sequenced reference population for imputation from array genotype to sequence variants impacts the overall imputation accuracy and particularly the accuracy of rare variant imputation. Therefore, it is important to define the use of the imputed WGS variants at the time of selecting animals composing the imputation reference population. If the sequenced animals are the first of a population to be sequenced and serve for general imputation of genotyped individuals, they should be representative of their population and thus carry mostly common variant alleles. If, however, the imputed variants should serve for the discovery of new and possibly deleterious variants, the animals selected for the reference population should carry novel information: rare variant alleles.

111

To be able to explain a larger proportion of the genetic variance, the addition of other types of markers in analysis is of great importance and copy number variants (CNV) are good candidates. This is discussed in Chapter 3 and Chapter 4. In the same way that an increase in prediction accuracy of two percent is considered a great increase in cattle genomic selection (Pérez-Enciso et al., 2015; VanRaden et al., 2017), the slightest increase in the information content of a genomic dataset, including CNV or even multi-allelic SNP, can be considered a great improvement. Multiple CNV detection methods on array genotype and WGS information have been developed and have their strengths and weaknesses in detection power, detection accuracy, and variant breakpoint definition. Development of quality measurements and validation methods that can be applied in silico, however, are needed to validate the large CNV datasets that will soon be generated. The CNV detection and confirmation pipeline presented in Chapter 3 is based on the idea that CNV can be detected from the genome of the same individual using different methods and data sources with high confidence. Moreover, Chapter 3 presents one of the first CNV detection on cattle genomes that relies on the bovine reference assembly ARS-UCD1.2 (Rosen et al., 2018). On one hand, this chapter showed that merging CNV with any overlap to population- wide CNV regions led to the identification of variants that were more often overlapping with previously published variants. Moreover, these variants could more often be attributed a putative function. On the other hand, the effect of the array genotype density used on the false positive CNV identification rate was demonstrated and this cannot be ignored in downstream analysis of the variants. Making use of the learning outcome in Chapter 3, but using a much larger sample size (10,126 vs 96 animals), Chapter 4 illustrates high confidence CNV detection through stringent quality control thresholds. Moreover, only variants detected on the large number of array genotypes that overlapped with variants detected on the WGS information from Chapter 3 were considered valid. Even though a stringent significance level threshold was applied on the following analysis, significant CNV regions were detected to be associated with hoof health traits. Putative functions related to the infectious or metabolic character of hoof heath traits could be attributed to those CNV regions. More traits must be analyzed in a similar manner to 1) show again the importance of CNV, 2) get an understanding of the genetic architecture for the traits, 3) understand which

112

measures allow to truly assess the genetic background of the traits, and, 4) find high confidence candidate CNV for their introgression into genetic evaluation models.

5.2 Conclusions

Genetic variance is key for animal selection under a breeding program. Opening gradually the “black box” that is the genome with different types of genetic markers, this thesis assessed different genetic markers and their use to discover, describe and make use of genomic information for selection. Advances and discoveries were made based on the North-American Holstein cattle population on three levels: 1) the effect of the individuals from which WGS information is used to impute array genotypes to WGS variants, 2) ways to improve the precision of the in silico identification of CNV, and 3) the genetic architecture of hoof health traits through discovery of CNV regions in association with them. Starting with SNP that are phased to haplotypes and looking at the structural variants that are CNV, this thesis bridges current and possible future genetic markers to exploit the maximum genomic information present in the Holstein cattle population and shows that more genetic variation can be captured with an increased number and diversity of genetic marker types. Altogether, the advances made in this thesis will eventually permit an increase in genetic improvement rates of the dairy cattle breeding industry. Genetic improvement, in comparison with management improvements, has the advantage to be cumulative from generation to generation. However, development of statistical breeding values estimation models that allow parallel inclusion of SNP and CNV information is needed to make possible full leverage of the information gathered in this thesis and in other studies of similar genre.

113

5.3 References

Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.-N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. VanTassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. 2014. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46:858–865. doi:10.1038/ng.3034. Hayes, B.J., A.J. Chamberlain, H. McPartlan, I. MacLeod, L. Sethuraman, and M.E. Goddard. 2007. Accuracy of marker-assisted selection with single markers and marker haplotypes in cattle. Genet. Res. 89:215–220. doi:10.1017/S0016672307008865. Karimi, Z., M. Sargolzaei, J.A.B. Robinson, and F.S. Schenkel. 2018. Assessing haplotype-based models for genomic evaluation in Holstein cattle. Can. J. Anim. Sci. 98:750–759. doi:10.1139/cjas-2018-0009. Maher, B. 2008. Personal genomes: The case of the missing heritability. Nature 456:18–21. doi:10.1038/456018a. Pérez-Enciso, M., J.C. Rincón, and A. Legarra. 2015. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet. Sel. Evol. 47:43. doi:10.1186/s12711-015-0117-5. Rosen, B., D. Bickhart, R. Schnabel, S. Koren, C. Elsik, A. Zimin, C. Dreischer, S. Schultheiss, R. Hall, S. Schroeder, C. Van Tassell, T. Smith, and J. Medrano. 2018. Modernizing the Bovine Reference Genome Assembly. Proc. World Congr. Genet. Appl. to Livest. Prod. 802. VanRaden, P.M., M.E. Tooker, J.R. O’Connell, J.B. Cole, D.M. Bickhart, J.R. O’Connell, J.B. Cole, and D.M. Bickhart. 2017. Selecting sequence variants to improve genomic predictions for dairy cattle. Genet. Sel. Evol. 49:32. doi:10.1186/s12711-017-0307-4.

114

APPENDICES Appendix A: Description of the WGS samples

Interbull ID Illumina_Nr Project Project no Sample no HOLCANM000000290516 HiSeq2000 CCGP SRP017441 SRR1425144 HOLCANM000000327279 HiSeq2000 CCGP SRP017441 SRR1346392 HOLCANM000000334489 HiSeq2000 CCGP SRP017441 SRR1346390 HOLCANM000000335966 HiSeq2000 CCGP SRP017441 SRR1425131 HOLCANM000000376455 HiSeq2000 CCGP SRP017441 SRR1348583 HOLCANM000005279989 HiSeq2000 CCGP SRP017441 SRR1425136 HOLCANM000005402550 HiSeq2000 CCGP SRP017441 SRR1425128 HOLCANM000005757117 HiSeq2000 CCGP SRP017441 SRR1348592 HOLCANM000006026421 HiSeq2000 CCGP SRP017441 SRR1346386 HOLCANM000006790237 HiSeq2000 CCGP SRP017441 SRR1425132 HOLCANM000006820564 HiSeq2000 CCGP SRP017441 SRR1365114 HOLCANM000006830327 HiSeq2000 CCGP SRP017441 SRR1425134 HOLCANM000006947936 HiSeq2000 CCGP SRP017441 SRR1425127 HOLCANM000007220817 HiSeq2000 CCGP SRP017441 SRR1425162 HOLCANM000007746123 HiSeq2000 CCGP SRP017441 SRR1425129 HOLCANM000007816429 HiSeq2000 CCGP SRP017441 SRR1365119 HOLCANM000009324236 HiSeq2000 CCGP SRP017441 SRR1365101 HOLCANM000009428124 HiSeq2000 CCGP SRP017441 SRR1425166 HOLCANM000100745543 HiSeq2000 CCGP SRP017441 SRR1346389 HOLCANM000101098586 HiSeq2000 CCGP SRP017441 SRR1425160 HOLDEUM000000253642 HiSeq2000 CCGP SRP017441 SRR1425138 HOLGBRM000000598172 HiSeq2000 CCGP SRP017441 SRR1348572 HOLITAM006001001962 HiSeq2000 CCGP SRP017441 SRR1365104 HOLITAM017990915143 HiSeq2000 CCGP SRP017441 SRR1365123 HOLUSAM000002234121 HiSeq2000 CCGP SRP017441 SRR1365115 HOLUSAM000002266677 HiSeq2000 CCGP SRP017441 SRR1425161 HOLUSAM000002284915 HiSeq2000 CCGP SRP017441 SRR1425165 HOLUSAM000002297473 HiSeq2000 CCGP SRP017441 SRR1425140 HOLUSAM000017129288 HiSeq2000 CCGP SRP017441 SRR1425125 HOLUSAM000017349617 HiSeq2000 CCGP SRP017441 SRR1365120 HOLUSAM000050750432 HiSeq2000 CCGP SRP017441 SRR1425130 HOLUSAM000062768990 HiSeq2000 CCGP SRP017441 SRR1425159 HOLUSAM000065258473 HiSeq2000 CCGP SRP017441 SRR1425163 115

HOLUSAM000069080973 HiSeq2000 CCGP SRP017441 SRR1365117 HOLUSAM000123066734 HiSeq2000 CCGP SRP017441 SRR1365116 HOLUSAM000130588960 HiSeq2000 CCGP SRP017441 SRR1425133 HOLUSAM000132967734 HiSeq2000 CCGP SRP017441 SRR1425139 HOLUSAM000132973942 HiSeq2000 CCGP SRP017441 SRR1365147 HOLUSAM000133573930 HiSeq2000 CCGP SRP017441 SRR1365118 HOLUSAM000137974489 HiSeq2000 CCGP SRP017441 SRR1425126 HOLUSAM000139471179 HiSeq2000 CCGP SRP017441 SRR1425156 HOL840M003130854065 HiSeqX EDGP SRP153409 SRR7521425 HOLCANM000012192478 HiSeqX EDGP SRP153409 SRR7521415 HOLCANM000109127990 HiSeqX EDGP SRP153409 SRR7521416 HOL840M003131131371 HiSeqX EDGP SRP153409 SRR7521432 HOLUSAM000072495715 HiSeqX EDGP SRP153409 SRR7521421 HOL840M003128043640 HiSeqX EDGP SRP153409 SRR7521428 HOL840M003128043644 HiSeqX EDGP SRP153409 SRR7521427 HOLUSAM000072306266 HiSeqX EDGP SRP153409 SRR7521420 HOL840M003131131453 HiSeqX EDGP SRP153409 SRR7521431 HOL840M003129128755 HiSeqX EDGP SRP153409 SRR7521423 HOL840M003013523361 HiSeqX EDGP SRP153409 SRR7994015 HOLUSAM000074072229 HiSeqX EDGP SRP153409 SRR7994016 HOLUSAM000072826907 HiSeqX EDGP SRP153409 SRR7994021 HOLCANM000109243044 HiSeqX EDGP SRP153409 SRR7994017 HOL840M003125079126 HiSeqX EDGP SRP153409 SRR7994018 HOL840M003125479421 HiSeqX EDGP SRP153409 SRR7994012 HOL840M003012503598 HiSeqX EDGP SRP153409 SRR7994014 HOLUSAM000073986060 HiSeqX EDGP SRP153409 SRR7994013 HOLUSAM000074284017 HiSeqX EDGP SRP153409 SRR7994019 HOLUSAM000071451889 HiSeqX EDGP SRP153409 SRR7994029 HOLUSAM000071852979 HiSeqX EDGP SRP153409 SRR7521419 HOL840M003128792954 HiSeqX EDGP SRP153409 SRR7994031 HOL840M003129038218 HiSeqX EDGP SRP153409 SRR7994028 HOL840M003123600774 HiSeqX EDGP SRP153409 SRR7994025 HOL840M003014334782 HiSeqX EDGP SRP153409 SRR7994030 HOL840M003130641747 HiSeqX EDGP SRP153409 SRR7994027 HOL840M003125170885 HiSeqX EDGP SRP153409 SRR7994024 HOL840M003123600913 HiSeqX EDGP SRP153409 SRR7521440 HOL840M003129016258 HiSeqX EDGP SRP153409 SRR7521429 HOL840M003124584834 HiSeqX EDGP SRP153409 SRR7521439 HOLUSAM000072512148 HiSeqX EDGP SRP153409 SRR7521422 116

HOL840M003124720445 HiSeqX EDGP SRP153409 SRR7521438 HOLCHEF0000000RM112 HiSeq2000 CHE PRJEB18113 SAMEA19865668 HOLCHEF120077642018 HiSeq2000 CHE PRJEB18113 SAMEA19867168 HOLCHEF120077566710 HiSeq2000 CHE PRJEB18113 SAMEA19864168 HOLCHEF120112591455 HiSeq2000 CHE PRJEB7707 SAMEA3113485 HOLCHEF120110812262 HiSeq2000 CHE PRJEB18113 SAMEA19871668 HOLCHEF120125774197 HiSeq3000 CHE PRJEB18113 SAMEA5159886 HOLCHEF120118922192 HiSeq3000 CHE PRJEB18113 SAMEA4644729 HOLCHEM120111347343 HiSeq3000 CHE PRJEB11963 SAMEA3682653 HOLCHEM0000000RM369 HiSeq2000 CHE PRJEB18113 SAMEA19874668 HOLDNKM000000248199 HiSeq2500 CHE PRJEB18113 SAMEA32983918 HOLDEUF001404042123 HiSeq3000 CHE PRJEB18113 SAMEA32996668 HOLITAM017990516801 HiSeq3000 CHE PRJEB18113 SAMEA33004168 HOLDNKM000000256738 HiSeq3000 CHE PRJEB18113 SAMEA19322668 HOLUNKF0000000RM370 HiSeq2500 CHE PRJEB18113 SAMEA32983168 HOLUNKF0000000RM113 HiSeq3000 CHE PRJEB18113 SAMEA33000418 HOLUNKF0000000RM463 HiSeq2500 CHE PRJEB18113 SAMEA32995168 HOLUNKF0000000RM372 HiSeq2500 CHE PRJEB18113 SAMEA32984668 HOLUNKF0000000RM773 HiSeq3000 CHE PRJEB18113 SAMEA19321918 HOLUNKF0000000RM767 HiSeq3000 CHE PRJEB18113 SAMEA19321168 HOLUNKF0000000RM664 HiSeq3000 CHE PRJEB18113 SAMEA19311418 HOLUNKM0000000RM666 HiSeq3000 CHE PRJEB18113 SAMEA19312168 HOLUSAM000071088599 HiSeq3000 CHE PRJEB18113 SAMEA32995918 HOLUNKM0000000RM673 HiSeq3000 CHE PRJEB12095 SAMEA3706830

117

Appendix B: Description of the aligned WGS

ITBID Number of reads Mapped reads Paired reads Read depth HOLCANM000000290516 459336181 455624161 447646351 14.32 HOLCANM000000327279 282521064 279942891 269016888 8.89 HOLCANM000000334489 339553498 337111677 332947164 10.24 HOLCANM000000335966 462851692 458594932 453364910 13.44 HOLCANM000000376455 663424177 658124642 648831633 18.58 HOLCANM000005279989 454854098 450790644 441533081 13.57 HOLCANM000005402550 433262640 428972837 422686491 12.52 HOLCANM000005757117 354243160 353586267 348033795 10.29 HOLCANM000006026421 354448733 352500785 340192345 10.55 HOLCANM000006790237 444832578 440606393 425380760 13.19 HOLCANM000006820564 371533901 362631631 356587815 10.24 HOLCANM000006830327 462315136 458336454 449538244 13.61 HOLCANM000006947936 462649112 458533707 454515418 13.42 HOLCANM000007220817 381645159 378256070 371823773 10.87 HOLCANM000007746123 464033082 460194444 452172901 13.71 HOLCANM000007816429 425931289 421584311 407592028 12.61 HOLCANM000009324236 429648471 425656962 421248714 12.53 HOLCANM000009428124 356988098 356435402 349656118 10.35 HOLCANM000100745543 333035559 330330991 324642631 10.19 HOLCANM000101098586 329001050 324121827 311924969 9.44 HOLDEUM000000253642 338389530 333863805 321100178 10.53 HOLGBRM000000598172 316117689 314816978 309187612 9.26 HOLITAM006001001962 357479131 356603812 340412111 10.18 HOLITAM017990915143 426531933 423211476 416409797 12.87 HOLUSAM000002234121 410258976 404446760 396178982 12.53 HOLUSAM000002266677 376212856 373241422 367199233 11.28 HOLUSAM000002284915 297271359 295598825 292363277 8.81 HOLUSAM000002297473 334762091 329293870 323267411 9.85 HOLUSAM000017129288 457010697 452764576 443375736 13.27 HOLUSAM000017349617 420181079 415889098 406033788 12.68 HOLUSAM000050750432 457125386 452857485 443904487 13.2 HOLUSAM000062768990 361892244 359123949 354409536 10.99 HOLUSAM000065258473 373969170 371129937 366611584 11.41 HOLUSAM000069080973 438494939 435151663 430226983 13.46

118

HOLUSAM000123066734 432935197 429650027 425327268 13.31 HOLUSAM000130588960 466450155 461953220 450208884 13.86 HOLUSAM000132967734 319310653 314024769 304955819 9.45 HOLUSAM000132973942 493843261 489416795 480251502 13.98 HOLUSAM000133573930 442772744 439195966 434726891 13.65 HOLUSAM000137974489 449327387 436151164 426345017 13.62 HOLUSAM000139471179 319662699 317757947 314686996 9.51 HOL840M003130854065 740850036 739853050 726520568 29.45 HOLCANM000012192478 864725202 863303258 840280533 34.15 HOLCANM000109127990 860741725 859576854 837998987 33.83 HOL840M003131131371 938162183 925776620 911540451 36.52 HOLUSAM000072495715 939533168 927748224 915834020 37.06 HOL840M003128043640 934301843 921059609 909449141 36.39 HOL840M003128043644 854883746 845699052 834916613 33.95 HOLUSAM000072306266 943652189 932521781 922046315 36.78 HOL840M003131131453 894131855 881168602 870326683 34.01 HOL840M003129128755 841956455 830657348 820842706 32.53 HOL840M003013523361 1001809379 990311351 962678832 37.83 HOLUSAM000074072229 999568238 988874005 974113309 36.52 HOLUSAM000072826907 964050138 956358423 941768824 37.47 HOLCANM000109243044 985114620 973879330 958841904 35.16 HOL840M003125079126 988051153 968152479 952245808 38.1 HOL840M003125479421 948877367 938084369 921033980 35.51 HOL840M003012503598 1008365869 995551032 979086915 36.13 HOLUSAM000073986060 1011326945 996453990 982684199 38.71 HOLUSAM000074284017 754047930 748643408 738872726 29.52 HOLUSAM000071451889 902353085 893099751 880450166 35.18 HOLUSAM000071852979 842850812 835678386 823611693 32.32 HOL840M003128792954 667934674 660026949 653401010 23.57 HOL840M003129038218 901387242 893213418 883180453 31.47 HOL840M003123600774 847565121 846870218 827968856 34.38 HOL840M003014334782 871508799 870853730 854777080 34.95 HOL840M003130641747 1000733720 999819562 977458231 40.87 HOL840M003125170885 1033357414 1032380182 1013098994 42.53 HOL840M003123600913 744033841 743197080 729058230 29.86 HOL840M003129016258 907601352 906452323 889553811 36.52 HOL840M003124584834 907599252 906711491 890352390 36.87 HOLUSAM000072512148 886929320 885562687 872836810 35.73 HOL840M003124720445 1017729064 1016788554 992734481 40.41 119

HOLCHEF0000000RM112 506143596 414802018 404420874 11.78 HOLCHEF120077642018 285179084 284227615 280405648 8.57 HOLCHEF120077566710 359018718 356521977 350481340 10.82 HOLCHEF120112591455 537227111 534407111 527367656 16.3 HOLCHEF120110812262 287293683 286664668 284347842 10.4 HOLCHEF120125774197 470776797 469971238 463702338 20.9 HOLCHEF120118922192 363596748 363211558 357436308 17.01 HOLCHEM120111347343 330653720 330136954 323956224 15 HOLCHEM0000000RM369 533922583 532450316 523586260 15.26 HOLDNKM000000248199 303429038 302139853 291716284 11.19 HOLDEUF001404042123 560073218 527506319 515534746 23.73 HOLITAM017990516801 314128944 313608027 307101012 14.55 HOLDNKM000000256738 265818921 265038296 260961852 12.37 HOLUNKF0000000RM370 335060336 333856879 183137256 12.12 HOLUNKF0000000RM113 446940130 446132354 438992412 20.43 HOLUNKF0000000RM463 565113459 434384260 423295506 19.32 HOLUNKF0000000RM372 413718372 411609772 405402722 15.03 HOLUNKF0000000RM773 228975269 228259697 224872608 10.71 HOLUNKF0000000RM767 284077178 281353558 275751308 12.86 HOLUNKF0000000RM664 242736035 242482666 238945266 11.1 HOLUNKM0000000RM666 203318909 203087966 199436584 9.39 HOLUSAM000071088599 556061826 554281662 538548312 25.28 HOLUNKM0000000RM673 297770730 297374821 290680398 13.66

120

Appendix C: Genotype array density, number of markers, and number of identified CNV

ITBID GEN density Markers GEN_CNV WGS_CNV HOL840M003014334782 GGPHD 69388 9 1576 HOL840M003124720445 150K 119570 12 1724 HOL840M003125079126 150K 119528 7 1797 HOL840M003125170885 150K 119600 0 1785 HOL840M003125479421 150K 119601 7 1738 HOL840M003128043640 GGPHD 69418 2 1780 HOL840M003128043644 GGPHD 69405 5 1716 HOL840M003128792954 150K 119590 6 1640 HOL840M003129016258 150K 119335 12 1765 HOL840M003129128755 150K 119457 12 1739 HOL840M003130641747 150K 119557 8 1831 HOL840M003131131453 150K 119500 7 1732 HOLCANM000000290516 HD 662853 8 1791 HOLCANM000000335966 HD 662836 11 1707 HOLCANM000000376455 HD 662732 7 1838 HOLCANM000005279989 HD 662857 3 1690 HOLCANM000005402550 HD 662849 3 1665 HOLCANM000005429693 HD 662694 4 1687 HOLCANM000005757117 HD 662622 5 1575 HOLCANM000006026421 HD 662734 3 1613 HOLCANM000006790237 HD 662837 4 1636 HOLCANM000006830327 HD 662843 1 1667 HOLCANM000006947936 HD 662844 3 1621 HOLCANM000007220817 HD 662819 4 1607 HOLCANM000007746123 HD 662855 3 1660 HOLCANM000007816429 HD 662843 4 1670 HOLCANM000009428124 HD 662845 2 1630 HOLCANM000100745543 HD 662734 2 1657 HOLCANM000101098586 HD 662843 6 1525 HOLCHEF120077566710 HD 664071 7 1799 HOLCHEF120077642018 HD 664068 12 1588 HOLCHEF120110812262 HD 664067 20 1632

121

HOLCHEF120112591455 HD 664070 7 1864 HOLCHEF120118922192 HD 664062 3 2993 HOLCHEM0000000RM369 HD 663888 26 1865 HOLCHEM120111347343 HD 663754 7 2798 HOLDEUF001404042123 HD 664000 8 2973 HOLDEUM000000253642 HD 662854 5 1628 HOLDNKM000000248199 HD 664040 4 1624 HOLDNKM000000256738 150K 119212 5 2758 HOLGBRM000000598172 HD 662716 5 1887 HOLITAM006001001962 HD 662847 4 1605 HOLITAM017990915143 HD 662848 4 1813 HOLUNKF0000000RM370 HD 663927 70 1335 HOLUNKF0000000RM372 HD 663987 13 1657 HOLUNKF0000000RM664 HD 661419 24 2404 HOLUNKF0000000RM773 150K 119235 3 2577 HOLUNKM0000000RM666 HD 663949 3 2356 HOLUSAM000002234121 HD 662840 6 1693 HOLUSAM000002266677 HD 662805 9 4413 HOLUSAM000002284915 HD 662839 4 1589 HOLUSAM000002297473 HD 662855 6 1576 HOLUSAM000017129288 HD 662817 6 1630 HOLUSAM000017349617 HD 662853 5 1729 HOLUSAM000050750432 HD 662858 5 1714 HOLUSAM000062768990 50K 37035 0 2047 HOLUSAM000065258473 50K 37025 0 1937 HOLUSAM000069080973 HD 662854 2 1730 HOLUSAM000073986060 150K 119606 5 1705 HOLUSAM000074072229 150K 119551 7 1772 HOLUSAM000074284017 150K 119587 14 1645 HOLUSAM000123066734 HD 662853 3 1805 HOLUSAM000130588960 HD 662853 4 1715 HOLUSAM000132967734 HD 662841 1 1618 HOLUSAM000133573930 HD 662844 2 1812 HOLUSAM000137974489 HD 662860 2 1616 HOLUSAM000139471179 50K 36885 0 2338

122

Appendix D: Description of the high confidence CNV regions

Type CNVR BTA Start End Length (bp) ANIMAL POPULATION CNVR1 1 42506338 42587000 80663 DEL DEL CNVR2 1 88344064 88526602 182539 DEL MIX CNVR3 1 92845638 92950807 105170 DUP DUP CNVR4 1 93367289 93763299 396011 DEL MIX CNVR5 2 41308201 41330958 22758 DEL / CNVR6 2 77343601 77447413 103813 DUP / CNVR7 2 82604201 82671570 67370 DEL DEL CNVR8 2 123735242 123851299 116058 DEL DEL CNVR9 3 692307 728806 36500 DUP DUP CNVR10 4 68808125 68911600 103476 / MIX CNVR11 4 91444801 91497709 52909 DUP / CNVR12 4 105960151 106053600 93450 DEL / CNVR13 5 106751379 106771800 20422 DUP DUP CNVR14 6 37695352 37736960 41609 DEL DEL CNVR15 6 51884459 52200066 315608 DEL DEL CNVR16 6 116469695 116555166 85472 DUP / CNVR17 7 8633381 9847400 1214020 MIX MIX CNVR18 7 10230437 10492200 261764 DEL MIX CNVR19 7 28613892 28666800 52909 DUP / CNVR20 7 41582849 41756370 173522 DEL MIX CNVR21 8 13075336 13125539 50204 DEL DEL CNVR22 8 34528550 34568189 39640 DEL DEL CNVR23 8 103487143 103513600 26458 DUP / CNVR24 9 30540580 30579950 39371 DEL / CNVR25 9 44726004 44875567 149564 DUP DUP CNVR26 9 91877253 91992541 115289 DEL CNVR27 10 6342565 6358546 15982 DEL CNVR28 10 22340671 22570397 229727 DEL CNVR29 10 27020800 27056000 35201 DEL DEL CNVR30 10 77294140 77308252 14113 DEL CNVR31 12 71055845 71136410 80566 DEL CNVR32 13 42792001 42823557 31557 / MIX CNVR33 14 76782001 76950664 168664 DEL DEL

123

CNVR34 15 40067276 40105374 38099 DEL DEL CNVR35 15 79748837 79817000 68164 DEL DEL CNVR36 16 5701395 5889597 188203 DEL DEL CNVR37 16 9349883 9585237 235355 DEL DEL CNVR38 16 54476201 54495676 19476 DEL DEL CNVR39 17 69606401 69655411 49011 DEL / CNVR40 17 71875018 71931712 56695 DUP / CNVR41 18 12357563 12486000 128438 DUP DUP CNVR42 18 13314001 13398000 84000 DUP MIX CNVR43 18 27822572 28303561 480990 DUP MIX CNVR44 19 52157782 52264948 107167 DUP / CNVR45 20 17223949 17248086 24138 DUP / CNVR46 20 39311943 39349989 38047 DUP DUP CNVR47 20 55486598 55622986 136389 DEL DEL CNVR48 20 59885213 59962804 77592 DUP MIX CNVR49 21 29395417 29456331 60915 DUP / CNVR50 21 32064601 32151192 86592 / DUP CNVR51 22 25322727 25397487 74761 DEL / CNVR52 22 41750031 41769379 19349 DEL / CNVR53 23 24491525 24548270 56746 DUP / CNVR54 23 26717308 26841090 123783 DUP MIX CNVR55 27 6696801 7186762 489962 / MIX CNVR56 27 38222988 38409429 186442 DUP MIX CNVR57 28 13354073 13504624 150552 / MIX

124

Appendix E: Description of the high confidence CNV regions (cont’d)

No Samples No Samples Already CNVR ANIMAL POPULATION known Genes CNVR1 2 34 X OR5H8, OR5H2 CNVR2 1 67 X PLCXD1, PPP2R5B, GTPBP6 CNVR3 1 15 RF00001 CNVR4 2 4 X NLGN1, PPIA CNVR5 1 KCNJ3 CNVR6 1 CNVR7 1 5 CNVR8 1 10 X CNVR9 2 6 DCAF6 CNVR10 7 X RF02043, RF02042, RF02041, RF02040, RF02142, RF02141, RF02140, RF02139, RF02138, RF02137, HOXA11, HOXA10, HOXA13, HOXA9, HOXA7, HOXA6, HOXA5, HOXA3,bta-mir-196b CNVR11 1 GRM8 CNVR12 1 TRBV24-1 CNVR13 1 4 CNVR14 1 9 X CNVR15 5 8 X CNVR16 1 POLN CNVR17 14 67 X RF00026, OLF4-like (10 times), OR7A10, OR7A17-like (9 times) CNVR18 4 52 X OLF4-like (3 times), OR7A10, OR7A17 CNVR19 1 CNVR20 3 67 OR2AJ1, OR2L2 CNVR21 5 8 X CNVR22 2 5 X CNVR23 1 X COL27A1 CNVR24 1 X CNVR25 9 21 POPDC3 CNVR26 1 TFB1M, CLDN20 CNVR27 1 X RF00026

125

CNVR28 1 TRD CNVR29 5 8 X OR11H6, OR11G2-like (2 times) CNVR30 2 X CNVR31 1 X CNVR32 4 ABHD12, PYGB CNVR33 7 10 CNVR34 5 11 X CNVR35 25 67 X OR5M11, OR5M3 CNVR36 1 67 X CFH CNVR37 3 30 CNVR38 4 11 X PRAME CNVR39 1 CNVR40 3 SLC5A1 CNVR41 6 12 MTHFSD, FOXC2, FOXL1 CNVR42 4 11 SLC7A5, CA5A, BANP CNVR43 9 15 X CNVR44 3 bta-mir-10164 CNVR45 1 CNVR46 2 6 X RAI14 CNVR47 5 11 X CNVR48 3 43 CNVR49 1 X CNVR50 5 RCN2, PSTPIP1 CNVR51 1 PABPN1L CNVR52 1 X CNVR53 1 PKHD1 CNVR54 1 67 X CNVR55 67 X DEFB103A, EBD, DEFB, DEFB1, DEFB402, DEFB4A CNVR56 1 12 SH2D4A, CSGALNACT1, F00001 CNVR57 4 RET

126

Appendix F: Academic experiences

Regular courses ANSC*6390 – QTL and Genetic Markers – Winter 2016. Instructor: Dr. Flávio Schenkel ANSC*6210 – Selection in Animal Breeding – Winter 2016. Instructor: Dr. Angela Cánovas STAT*6950 – Statistical Methods for Life Sciences – Fall 2016. Instructor: Dr. Gary Umphrey. ANSC*6370 – Quant. Genetics and Animal Models – Fall 2016. Instructor: Dr. Flávio Schenkel ANSC*6600 – Seminar – Fall 2016. Instructor: Dr. Georgia Mason All courses were taken at the University of Guelph Short courses Dissertation Boot Camp – July 2017 – Instructor: Dr. Jodie Salter – University of Guelph Design of Breeding Programs with Genomic Selection – June 2018 – Instructors: Dr. Jack Dekkers and Dr. Julius Van der Werf – University of Guelph Multivariate Statistics in Animal Breeding/Science Using R – July 2018 – Instructor: Dr. Nicolò Macciotta – University of Guelph Writing and Presenting Scientific Papers – August 2019 – Instructors: Dr. Phil Garnsworthy and Dr. Mike Grossman – BE-Ghent International conferences Annual meeting of the European Association for Animal Sciences (EAAP) – August 2016, IR- Belfast and August 2019, BE-Ghent Annual meeting of the American Dairy Science Association (ADSA) – June 2017, US-Pittsburgh and June 2018, US-Knoxville Plant and Animal Genome Conference – January 2017, US-San Diego Gordon Research Seminar and Conference, Quantitative Genetics & Genomics – February 2017, US-Galveston World Congress on Genetics Applied to Livestock Production – February 2018, NZ-Auckland International Society for Animal Genetics Conference – July 2019, ES-Lleida National meetings 127

Dairy Cattle Breeding and Genetics Committee Meeting – Winter and Fall 2016, 2017, 2018, and 2019, CA-Guelph Open Industry Sessions – Spring 2016, 2018, and 2019, CA-Guelph Swiss Animal Breeding – Technology Platform Meeting – June 2016, CH-Zug Annual Meeting of the Swiss Association for Animal Science – April 2017, 2018, and 2019, diverse places in CH Symposiums of the Efficient Dairy Genome Project – October 2017 and December 2018, CA- Guelph Externships Qualitas AG, CH-Zug – Supervisors: Dr. Birgit Gredler-Grandl, Dr. Mirjam Frischknecht, Dr. Jürg Moll – Summer 2016, Summer 2017, Spring 2018 University of Alberta – Supervisors: Dr. Paul Stothard, Dr. Kirill Krivushin – Winter 2017 Other activities Tenor with the Gryphon’s Singers – Conductor: Dr. Marta McCarthy – Winter 2016, Fall 2016, Winter 2018, Fall 2018 Tenor in the choir of the opening concert for the Elora Music Festival – Conductor: Judith Yan – Summer 2018 Reviewer for the journals: Revista Brasileira de Zootecnica and Journal of Dairy Science Weekly seminar hosted by the Centre for Genetic Improvement of Livestock Participation to the CGIL graduate students and postdoc journal club Coordination of the maintenance of the CGIL Nespresso Coffee Machine Interview about the EDGP for the French-Ontarian television channel ICI TELEVISION – Winter 2019 Participation to the OAC Why and How podcast – Summer 2019 Teaching experience MBG*2400 – Plants & Animal Genetics – Fall 2016 – Instructor: Dr. Andrew Robinson MBG*4030 – Companion Animal Genetics – Fall 2018 – Instructor: Dr. Andrew Robinson

128

Awards Young Academic Promotion Grant – Swiss Association for Animal Sciences – 2016 to 2019 Frank Wallace Cockshutt Scholarship – University of Guelph – 2016 and 2017 Ontario Graduate Scholarship – University of Guelph – 2017 to 2018 Travel Award to the WCGALP 2018 – Genome Alberta – 2018 Taffy Davison Travel Grant – University of Guelph – 2018 Dr. G.W. Friars Award – University of Guelph – 2018 Hamilton Milk Producers’ Association Scholarship – University of Guelph – 2018 Travel bursary to the Annual Conference – International Society of Animal Genetics – 2019

129