Western University Scholarship@Western

Electronic Thesis and Dissertation Repository

4-18-2019 2:00 PM

Characterization of genomic copy number variation in Mus musculus associated with the germline of inbred and wild mouse populations, normal development, and cancer

Maja Milojevic The University of Western Ontario

Supervisor Hill, Kathleen A. The University of Western Ontario

Graduate Program in Biology A thesis submitted in partial fulfillment of the equirr ements for the degree in Doctor of Philosophy © Maja Milojevic 2019

Follow this and additional works at: https://ir.lib.uwo.ca/etd

Part of the Genetics and Genomics Commons

Recommended Citation Milojevic, Maja, "Characterization of genomic copy number variation in Mus musculus associated with the germline of inbred and wild mouse populations, normal development, and cancer" (2019). Electronic Thesis and Dissertation Repository. 6146. https://ir.lib.uwo.ca/etd/6146

This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected].

Abstract

Mus musculus is a human commensal species and an important model of human development and disease with a need for approaches to determine the contribution of copy number variants (CNVs) to genetic variation in laboratory and wild mice, and arising with normal mouse development and disease. Here, the Mouse Diversity Genotyping array (MDGA)-approach to CNV detection is developed to characterize CNV differences between laboratory and wild mice, between multiple normal tissues of the same mouse, and between primary mammary gland tumours and metastatic lung tissue.

A CNV detection pipeline was used in conjunction with evaluated probe sets, targeting 925,378 loci at an inter-probe-set median distance of 319 bp, to identify CNVs in a publicly- available dataset that includes representatives of 114 classical laboratory (CL) strain mice, 52 wild-derived (WD) mice, and 19 wild-caught (WC) mice. On average, WC and WD mice (~50 CNVs/mouse) have twice as many CNVs as CL mice. DdPCR confirmed 96% of MDGA- predicted copy number states. CL CNVs impact pathways related to immunity and nucleosome-associated functions, whereas olfaction and pheromone detection are impacted in WC mice. WD mice share impacted genic pathways with both cohorts.

In a five-member C57BL/6J inbred mouse family, losses of developmentally-important HOXA were detected and confirmed in multiple normal tissues. Further confirmation of postzygotic Hoxa13 losses in unrelated C57BL/6J, CBA/CaJ, and DBA/2J mice points to a widespread phenomenon occurring in mice, involving mutation hotspots and/or programmed losses.

In comparison to normal tissues (25 CNVs/mouse), cancer samples from an MMTV- PyMT mouse breast cancer model with lung metastasis have 1.6- to 3.2-fold more CNVs. CNV size is reduced and CNV recurrence is increased among primary tumours in the absence of the hyaluronan-mediated motility receptor, suggestive of altered mechanisms of CNV formation and selection for specific phenotypes in the tumour microenvironment, respectively.

CNVs were found to arise during normal development, producing different CNV profiles than with tumorigenesis and metastasis. CNV profiles also differ between laboratory

ii

and wild mice. This thesis presents improvements to an array-based CNV detection and analysis pipeline which was used to determine the contribution of CNVs to genetic variation in M. musculus.

Keywords

Mus musculus, Mouse Diversity Genotyping Array, copy number variants, single nucleotide polymorphism, somatic mosaicism, de novo genetic variation, cancer, classical laboratory strains, wild-derived strains, wild-caught mice.

iii

Co-Authorship Statement

Chapters 2 and 3 contain material from a manuscript published in BMC Genomics on July 4, 2015, entitled: “Genomic copy number variation in Mus musculus”. This publication was co- authored by M. Elizabeth O. Locke, Susan T. Eitutis, Nisha Patel, Andrea E. Wishart, Mark Daley and Kathleen A. Hill. I generated the filtered probe lists, compiled a list of genes expected to remain consistent in copy number, performed analysis of genic impact of CNVs, and confirmed putative CNVs. These components are included in the thesis while work primarily done by others was excluded. The CNV calls from this publication are presented in Chapter 3 in the broad survey of mouse genotypes and were generated by M.E.O. Locke. M.E.O. Locke also determined CNV concordance with previous studies. M.E.O. Locke, K.A. Hill, and I drafted the manuscript. S. Eitutis provided the initial probe list and filtering criteria. N. Patel performed probe to reference alignment. A.E. Wishart provided helpful discussions pertaining to the design of the study. M. Daley provided interpretation of statistical analysis and contributed critical revisions. K.A. Hill conceived of the study and participated in its design and coordination. All authors participated in useful discussion, as well as read, edited, and approved the final manuscript.

Dr. Melissa Holmes provided the naked mole-rat samples described in Chapter 2. Chloe D. Rose performed the DNA extraction from tails of these samples. For Chapter 4, I performed the tissue harvesting for all members of the mouse family together with Alanna K. Edge, Chloe D. Rose, Zachary Hawley, and Hasan Baassiri. DNA extractions were performed by A.K. Edge (pancreas and bladder), C.D. Rose (tail), Z. Hawley (kidney and lung) and me (hippocampus). Cancer sample data files used in Chapter 5 were provided by Dr. Eva Turley and generated by Conny Toelg and David Carter at the London Regional Genomics Centre. M.E.O. Locke generated SNP and CNV genotype calls.

iv

Acknowledgments

There are many people involved in the making of this thesis who I would like to thank for providing guidance, research assistance, and support. First, I would like to express my gratitude to my supervisor, Dr. Kathleen Hill, for being supportive of my research and career goals and for providing me with the opportunity to work autonomously, gain new skills, and present my research to the scientific community. I would also like to thank my advisors Dr. Mark Daley and Dr. Robert Cumming for providing helpful advice and feedback throughout my studies.

This research would not have been possible without the help of Beth Locke who provided critical computational assistance and research advice, as well as great friendship. Thank-you to Freda Qi, Alanna K. Edge, Nisha Patel, and other past and present members of the Hill laboratory for their research collaboration and for all the fun times and laughter we shared.

I would like to thank Carol Curtis, Diane Gauley, Arzie Chant, Hillary Bain, and Sherri Fenton from the Biology Graduate Office for their availability, kindness, and willingness to provide assistance with administrative tasks over the years.

Thank-you to my parents for their unwavering support and love, and for encouraging me to pursue higher education. Thank-you to my sisters for bringing me so much joy and inspiration, and for patiently waiting for me to finish my studies.

Finally, I would like to thank Nicolas Bensoussan for being there for me every day, for providing moral support throughout the thesis writing process, and for inspiring me with his curiosity, determination, and ingenuity.

This research was supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant awarded to Dr. Kathleen Hill as well as funds awarded through the Western Strategic Support for NSERC Success initiative at Western University. This research was also supported by external and internal funding awarded to me, including the Queen Elizabeth II Graduate Scholarship in Science and Technology, and the Dr. Irene Uchida Fellowship in Life Sciences awarded by Western’s Biology Department. Financial support for conference attendance was provided by the Department of Biology Graduate Travel Award,

v

and Environmental and Genomics Society Student and New Investigator Travel Awards.

vi

Table of Contents

Abstract ...... ii

Co-Authorship Statement ...... iv

Acknowledgments ...... v

Table of Contents ...... vii

List of Tables ...... xiii

List of Figures ...... xv

List of Appendices ...... xvii

List of Abbreviations ...... xix

Chapter 1 ...... 1

1 Introduction to CNVs and Thesis Aims ...... 1

1.1 Copy Number Variants (CNVs) ...... 1

1.1.1 Defining CNVs ...... 1

1.1.2 CNVs and natural and artificial selection ...... 2

1.1.3 CNVs and human disease ...... 6

1.1.4 Mouse models of human disease ...... 9

1.1.5 Mechanisms of CNV formation ...... 10

1.1.6 CNV mutation rates ...... 13

1.2 CNV detection ...... 16

1.3 Thesis goal and specific aims ...... 18

1.4 References ...... 21

Chapter 2 ...... 41

2 The Mouse Diversity Genotyping Array: Overview ...... 41

2.1 Introduction ...... 41

2.1.1 Study motivation ...... 41

vii

2.1.2 The Mouse Diversity Genotyping Array design ...... 41

2.1.3 SNP probes and invariant genomic probes ...... 42

2.1.4 Fluorescence-based SNP and CNV genotyping ...... 42

2.1.5 Generating SNP genotype and CNV calls ...... 43

2.1.6 Discrepancies in annotations for the probe files of the Mouse Diversity Genotyping Array ...... 45

2.1.7 Assessment of probe design for the Genome-Wide Human SNP Array 6.0 ...... 46

2.1.8 Cross-species hybridization and considerations for SNP genotyping ...... 47

2.1.9 General goal, specific objectives, and predicted outcomes ...... 48

2.2 Materials and methods ...... 49

2.2.1 Probe filtering and SNP genotyping for the Mouse Diversity Genotyping Array ...... 49

2.2.2 Probe filtering and SNP genotyping for the Genome-Wide Human SNP Array 6.0 ...... 50

2.2.3 Cross-species hybridization with SNP genotyping ...... 51

2.3 Results ...... 53

2.3.1 Mouse Diversity Genotyping Array and Genome-Wide Human SNP Array 6.0 probe filtering ...... 53

2.3.2 Cross-species hybridization ...... 59

2.4 Discussion ...... 60

2.4.1 Impact of probe filtering for the Mouse Diversity Genotyping Array and the Genome-Wide Human SNP Array 6.0 ...... 60

2.4.2 Cross-species hybridization and considerations for SNP genotyping ...... 61

2.5 Computation-based assessment of MDGA data quality ...... 62

2.5.1 Assessing MDGA data quality through visualization of fluorescence intensity data ...... 63

2.5.2 Assessing quality of SNP genotype output ...... 63

2.5.3 Assessing MDGA data quality by examining the nature of CNV calls .... 64

viii

2.5.4 Probe annotation: Providing genomic context ...... 66

2.5.5 Assessing MDGA data quality with pairwise genetic distance comparisons ...... 67

2.5.6 Considerations for Mus musculus sample set size and sex of mice ...... 68

2.5.7 Considerations for tissue type ...... 69

2.6 Conclusion ...... 70

2.7 References ...... 71

Chapter 3 ...... 79

3 CNV Diversity in Inbred and Wild Mice Detected by the Mouse Diversity Genotyping Array ...... 79

3.1 Background ...... 79

3.1.1 Research goal, central hypothesis, and specific objectives ...... 84

3.2 Materials and methods ...... 85

3.2.1 Samples ...... 85

3.2.2 CNV identification ...... 85

3.2.3 Figure construction and statistical analysis ...... 86

3.2.4 CNV recurrence ...... 87

3.2.5 Concordance with previous studies ...... 87

3.2.6 Genes unlikely to harbour copy number losses ...... 87

3.2.7 Confirmation of select CNVRs by droplet digital PCR (ddPCR) ...... 88

3.2.8 Gene analysis ...... 89

3.2.9 Genetic distance matrices and phenogram generation ...... 90

3.3 Results ...... 90

3.3.1 Broad survey ...... 90

3.3.2 Mouse cohort comparison study ...... 98

3.4 Discussion ...... 112

3.4.1 Broad survey ...... 112 ix

3.4.2 Mouse cohort comparison study ...... 117

3.5 Conclusion ...... 123

3.6 References ...... 125

Chapter 3B ...... 149

3B Visualizing the Distribution of CNVs across a Genome with CNV Landscape Plots nPlots ...... 141

3B.1 Background ...... 141

3B.2 Materials and methods ...... 141

3B.3 Results ...... 141

3B.3.1 Visualization of CNV spatial landscape of 17 ...... 141

3B.3.2 CNV spatial landscape analyses for autosomes and Chromosome X ..... 144

3B.4 Discussion ...... 147

3B.5 References ...... 148

Chapter 4 ...... 149

4 Somatic Mosaicism and de novo CNVs in a C57BL/6J Mouse Family ...... 149

4.1 Background ...... 149

4.1.1 Research goal, central hypothesis, and specific objectives ...... 152

4.2 Materials and methods ...... 154

4.2.1 Samples ...... 154

4.2.2 DNA extraction ...... 154

4.2.3 Genotyping and CNV detection ...... 154

4.2.4 DdPCR confirmation ...... 155

4.3 Results ...... 156

4.3.1 CNVs detected and ddPCR confirmation ...... 156

4.4 Discussion ...... 161

4.5 Conclusion ...... 163

x

4.6 References ...... 165

Chapter 5 ...... 169

5 Characterization of the CNV Landscape in a Mouse Model of Breast Cancer with Lung Metastasis in the Presence and Absence of Rhamm ...... 169

5.1 Background ...... 169

5.1.1 Research goal, central hypothesis, and specific objectives ...... 173

5.2 Materials and methods ...... 174

5.2.1 Samples ...... 174

5.2.2 MDGA hybridization ...... 174

5.2.3 SNP genotyping and CNV identification ...... 174

5.2.4 Select genic CNV confirmation by droplet digital PCR (ddPCR) ...... 175

5.2.5 SNP and CNV phenogram construction ...... 176

5.2.6 Tissue-specific CNVs ...... 176

5.2.7 Recurrent gene gains and losses and IPA networks ...... 177

5.3 Results ...... 177

5.3.1 CNVs detected ...... 177

5.3.2 Droplet digital PCR confirmation of select genic CNVs ...... 183

5.3.3 CNV genic analysis ...... 185

5.4 Discussion ...... 188

5.4.1 CNVs detected ...... 188

5.4.2 Droplet digital PCR confirmation of select genic CNV regions ...... 191

5.4.3 CNV genic analysis ...... 192

5.5 Conclusion ...... 195

5.6 References ...... 196

Chapter 6 ...... 204

6 Summary and Discussion ...... 204

xi

6.1 Study limitations ...... 206

6.2 Future extensions ...... 208

6.2.1 Evolution and adaptation studies ...... 209

6.2.2 Study extensions for genome mosaicism in healthy tissues and cancer . 210

6.3 References ...... 213

Appendices ...... 214

Curriculum Vitae ...... 221

xii

List of Tables

Table 2-1. Mus species and Heterocephalus glaber samples...... 52

Table 2-2. Changes in SNP genotyping call rate when changing algorithm version and SNP probe lists...... 55

Table 2-3. SNP genotype call rates for Heterocephalus glaber samples genotyped together with and without a set of 27 samples representing four Mus subgenera: Mus, Pyromys, Coelomys, and Nannomys...... 59

Table 3-1. Autosomal CNV losses and gains in laboratory strains and wild-caught mice. .... 92

Table 3-2. Most common CNVs detected by the Mouse Diversity Genotyping Array in a set of 334 Mus musculus samples...... 94

Table 3-3. Top DAVID terms for genic CNVs detected in classical laboratory and wild-caught mice...... 96

Table 3-4. Summary statistics for autosome and chromosome X CNVs detected in classical laboratory, wild-derived, and wild-caught mice...... 99

Table 3-5. Number of singleton CNVs present on the autosomes and X chromosome for classical laboratory, wild-derived, and wild-caught mouse cohorts, as determined using <40% reciprocal overlap and 0% overlap criteria...... 105

Table 3-6. Average CNV- and SNP-based genetic distances within and between classical laboratory, wild-derived, and wild-caught mouse cohorts...... 108

Table 3-7. Genic CNV gains and losses in classical laboratory, wild-derived, and wild-caught mice...... 110

Table 3-8. Top Gene Ontology terms for CNV gains and losses in classical laboratory, wild- derived, and wild-caught mouse cohorts...... 111

xiii

Table 5-1. Summary of averages for CNV numbers, length, state, and genic classification, in cancer and normal sample groups...... 180

Table 5-2. Top three Ingenuity Pathway Analysis “diseases and functions” terms for gene networks describing recurrent genic CNVs within each cancer sample group...... 187

xiv

List of Figures

Figure 2-1. Impact of application of probe list filtering criteria for SNP, Exon 1 and Exon 2 Mouse Diversity Genotyping Array probes...... 54

Figure 2-2. Impact of application of probe list filtering criteria for Genome-Wide Human SNP Array 6.0 SNP probes...... 56

Figure 2-3. Call rates for 351 Jackson Laboratory samples, generated with different probe lists...... 57

Figure 2-4. Call rates for HapMap3 samples, generated with Birdseed (v1), Birdseed (v2), and filtered probe lists...... 58

Figure 3-1. Number of CNVs detected for classical laboratory, wild-derived, and wild-caught mice...... 100

Figure 3-2. Number of CNV gains and losses for classical laboratory, wild-derived, and wild- caught mice...... 101

Figure 3-3. Length of CNV gains and losses for the autosomes and X of classical laboratory, wild-derived, and wild-caught mice...... 103

Figure 3-4. CNVRs in classical laboratory, wild-derived, and wild-caught mice...... 105

Figure 3-5. Phenograms depicting relationships between mice, as determined using pairwise genetic distances calculated for all autosomal CNVs in 210 mice and autosomal SNPs in 215 mice...... 106

Figure 3B-1. Gephi-based visualization of HD-CNV output showing Chromosome 17 CNV merges and singletons for classical laboratory, wild-derived, and wild-caught mouse samples...... 106

Figure 3B-2. Distribution of CNV gains and losses across Chromosome 17 for classical laboratory, wild-derived, and wild-caught mouse samples...... 106

xv

Figure 3B-3. Distribution of CNV gains and losses across Chromosome X for classical laboratory, wild-derived, and wild-caught mouse samples...... 106

Figure 3B-4. Distribution of CNV gains and losses across Chromosome 9 for classical laboratory, wild-derived, and wild-caught mouse samples...... 106

Figure 3B-5. Examples of three CNV distribution patterns (A-C) across a mouse chromosome...... 106

Figure 4-1. DdPCR-based copy number states for Hoxa genes in multiple tissues from a C57BL/6J mouse family and three unrelated mice...... 158

Figure 4-2. DdPCR-based copy number states for eleven genes in multiple tissues from mice of a C57BL/6J family...... 160

Figure 5-1. Phenograms representing CNV- and SNP-based pairwise genetic distance between mouse samples from four groups: Rhamm-/- primary tumour, Rhamm-/- lung with metastasis, wild-type primary tumour, wild-type lung with metastasis...... 179

Figure 5-2. Number of CNV gains and losses detected in primary tumour and lung with metastasis samples from three wild-type and three Rhamm-/- mice...... 181

Figure 5-3. CNV length distribution for primary tumour and lung with metastasis samples from three wild-type and three Rhamm-/- mice...... 182

Figure 5-4. Tissue-specific CNVs in the primary tumour and lung with metastasis tissue of wild-type and Rhamm-/- mice...... 183

Figure 5-5. Copy number state of Ilk, Taf10, and Rhamm genes as detected by ddPCR in wild-type primary tumour, wild-type lung with metastasis, Rhamm-/- primary tumour, and Rhamm-/- lung with metastasis mouse tissues...... 184

xvi

List of Appendices

Appendix 2A: Genome-Wide Human SNP Array 6.0 filtered probe list...... 214

Appendix 2B: Genome-Wide Human SNP Array 6.0 filtered probe list including flanking regions...... 214

Appendix 2C: 874 HapMap3 CEL files...... 214

Appendix 2D: Sample Affymetrix Power Tools commands for using Birdseed version 1 and version 2 algorithms for SNP genotyping with the apt-probe-genotype program...... 214

Appendix 2E: 351 Mouse Diversity Genotyping Array CEL files from The Center for Genome Dynamics at The Jackson Laboratory and sample ID...... 215

Appendix 2F: List of Mouse Diversity Genotyping Array SNP probes from filtered probe list that produced a No Call genotype in all 351 mouse samples...... 215

Appendix 2G: List of genes unlikely to vary in copy number...... 215

Appendix 3A: Mouse sample information for 351 Mouse Diversity Genotyping Array CEL files...... 215

Appendix 3B: Mouse sample cohort, subspecies and origin information for 215 Mouse Diversity Genotyping Array CEL files...... 215

Appendix 3C: Autosomal CNVs detected for 351 mouse samples...... 215

Appendix 3D: Chromosome X CNVs detected for 351 mouse samples...... 215

Appendix 3E: Summary of autosomal and chromosome X CNVs detected for 351 mouse samples...... 215

Appendix 3F: CNVs detected for 210 mouse samples...... 215

Appendix 3G: Summary of autosomal and chromosome X CNVs detected for 210 mouse samples...... 215

xvii

Appendix 3H: Summary of the predicted and experimental ddPCR copy number (CN) states for nine genic copy number variant regions (CNVRs) in three classical strains...... 216

Appendix 3I: Western University ethics approval for animal use in research...... 217

Appendix 3J: CNV and SNP genetic distance matrices...... 218

Appendix 3K: Ingenuity Pathway Analysis core analysis of genes overlapping CNV regions for different copy number states and two mouse cohorts...... 218

Appendix 3L: DdPCR confirmation of CN state in 14 mice of three different inbred strains for nine select genic CNVRs detected using the Mouse Diversity Genotyping Array...... 219

Appendix 3M: List of CNVs found overlapping genes that are unlikely to vary in copy number...... 220

Appendix 4A: Log R ratio, B allele frequency, and waviness factor values for 26 mouse tissue samples and two PennCNV runs...... 220

Appendix 4B: CNV calls for first PennCNV dataset...... 220

Appendix 4C: CNV calls for second PennCNV dataset...... 220

Appendix 5A: CNV calls for six mammary gland primary tumour samples and six lung with metastasis samples from three MMTV-PyMT Rhamm-/- mice and three MMTV-PyMT Rhamm+/+ mice...... 220

Appendix 5B: CNV and SNP genetic distance matrices...... 220

Appendix 5C: Copy number state, position and genic content of CNV regions that are recurrent in all three samples with a shared Rhamm genotype and tumour type...... 220

xviii

List of Abbreviations aCGH Array comparative genomic hybridization APE Analyses of Phylogenetics and Evolution APT Affymetrix® Power Tools BAF B allele frequency bp (s) BRLMM-P Bayesian Robust Linear Model with Mahalanobis distance classifier – Perfect-match probes BXD C57B6/J x DBA/2J recombinant inbred strains CC-UNC G2:F1 Collaborative-Cross-Univeristy of North Carolina G2:F1 generation CEL Cell intensity file [file extension] CDF Chip definition file CGH Comparative genomic hybridization CL Classical laboratory CN Copy number CNA Copy number alteration CNV Copy number variant CNVR Copy number variant region CRLMM Corrected Robust Linear Model with Maximum Likelihood Classification DAVID Database for Annotation, Visualization and Integrated Discovery ddPCR Droplet digital polymerase chain reaction DNA Deoxyribonucleic acid DSB Double-strand breaks FISH Fluorescent in-situ hybridization GC Guanine, GO Gene ontology GRC Genome Reference Consortium HD-CNV Hotspot detector for copy number variants HMM Hidden Markov Model HR Homologous recombination

xix

IGP Invariant genomic probe IPA Ingenuity Pathway Analysis kb Kilobase(s) LCR Low copy repeat LINE Long interspersed nuclear element LRR Log R ratio LRR SD Log R ratio standard deviation Mb Megabase(s) MDGA Mouse Diversity Genotyping Array MMBIR Microhomology-mediated break-induced replication MMTV-PyMT Mouse mammary tumour virus-polyoma middle T-antigen NAHR Non-allelic homologous recombination NCBI National Center for Biotechnology Information NHEJ Non-homologous end joining nt Nucleotide(s) PCR Polymerase chain reaction PFB Population frequency of B allele RHAMM Receptor for hyaluronan-mediated motility RM Rhamm-/- lung with metastasis RNA Ribonucleic acid RP Rhamm-/- primary tumour SINE Short interspersed nuclear element SNP Single nucleotide polymorphism SV Structural variant UCSC University of California, Santa Cruz WC Wild caught WD Wild derived WF Waviness factor WM Wild-type (Rhamm+/+) lung with metastasis WP Wild-type (Rhamm+/+) primary tumour

xx 1

Chapter 1 1 Introduction to CNVs and Thesis Aims 1.1 Copy Number Variants (CNVs)

Copy number variants (CNVs) are a source of genomic variation that contributes to normal and pathogenic phenotypes1–4 through the structural alteration of a genome. As a result, CNVs can play an important role in fitness for an individual organism and at a population level5,6. Improvements to the resolution and sensitivity of CNV detection technology have led to the discovery that CNVs are a common genomic phenomenon, with CNV differences being found even between individual cells form the same tissue7. Here, high-resolution SNP microarray technology is employed to explore and gain a better understanding of the CNV landscape across the genome of Mus musculus, a human commensal species8 and important animal model of human diseases, from the perspective of adaptation and evolution, somatic mosaicism, and disease.

1.1.1 Defining CNVs

Copy number variants (CNVs) are defined as large segments of DNA ranging from ~50 bp to several megabases in size, that are present in different numbers of copies between genomes9. While the lower size limit of CNVs is not clearly defined since it can overlap with other structural variant types like indels (small insertions or deletions), the limit has decreased from the commonly used 1 kb minimum as the resolution of CNV detection technology and computational methods improved9–11. CNV differences between can exist at all levels, ranging from the genomes of two cells of an individual to the genomes of two different organisms. In a diploid genome, the expected copy number of most DNA regions is two. A DNA segment that occurs as a copy state of one in a diploid genome, as opposed to the default two copies, is called a loss or deletion while a DNA segment that is present in more than two copies is referred to as a CNV gain, duplication, or amplification. Since CNVs can span thousands to millions of base pairs in length, many CNVs overlap the full spectrum of functional elements in a genome and can impact phenotypes if these elements are dosage sensitive. As such, CNVs play important roles in

2 both normal and pathogenic diversity in many organisms including plants, animals and unicellular organisms.

1.1.2 CNVs and natural and artificial selection

If CNVs alter the expression of genomic elements that determine phenotypes, then positive or negative selection of traits can occur if fitness is affected. Trait selection can occur quickly through artificial means when humans intentionally breed and select plants, animals and other organisms for specific qualities, sometimes following intentional mutation induction. Examples of artificial selection by humans are numerous and include generation of different dog breeds12, creating ornamental plant cultivars13, and improvement of lactic acid bacteria strains used for producing fermented dairy products14. Sometimes there are unintended consequences of artificial selection. For example, some mutations in livestock and pets produce desirable traits when present in a heterozygous state but are deleterious when present in a homozygous state15. Under natural selection, traits that confer a fitness advantage to individuals within a population will be positively selected while deleterious traits will negatively affect fitness and be under negative selection. CNVs are known to impact phenotypes and cause diseases through numerous mechanisms including gene dosage changes, gene interruption, gene fusion, position effects, unmasking recessive alleles or functional polymorphisms, and potential transvection effects16.

More than 145 CNV genes show evidence of positive selection in humans17. During recent human evolution, different human populations gained variation in the copy number of the dosage-sensitive salivary amylse gene, AMY1. Gains in AMY1 range from two to twenty copies within human populations, have not been observed in Denisovan or Neanderthal genomes, and are thought to be associated with the introduction of starch-rich foods into the human diet18–20. Individual differences in AMY1 copy number can lead to different oral perceptions of starch viscosity, and this in turn may influence individual food choices and consumption levels of starch-containing foods21. Because AMY1 is expressed in multiple tissues and salivary a- has a limited role in starch digestion, it was proposed that additional adaptive advantages may be responsible for retaining AMY1 copy number gains in humans18. The spread of prehistoric agriculture not only impacted humans,

3 it also correlates with the copy number expansion of the pancreatic amylase gene, AMY2B, in dogs22. Dog breeds which were more likely to have had high starch diets historically, have more AMY2B copies on average than do dog breeds who had low starch diets23.

Positive selection can occur for copy number losses as well as gains. Three genomic elements that were deleted or pseudogenized in humans are an enhancer located near the growth arrest and DNA damage inducible gamma (GADD45G) gene, the caspase 12 (CASP12) gene, and an androgen receptor (AR) enhancer. In modern humans, CASP12 has nearly reached complete fixation for a null allele that reduces susceptibility to severe sepsis24. Deletion of the GADD45G enhancer correlates increased growth of specific brain regions while the deletion of the AR enhancer is associated with the loss of sensory vibrissae and penile spines25. The loss of penile spines in the human lineage may have been in response to a changing reproductive strategy that included pair bonding and monogamy26.

In the modern , 91% of genes that are greater than 10 kb, occur in a fixed diploid state27. Almost half of the duplicated genes are not variable in copy number between individuals, and of the genes that are variable, 80% do not exceed a copy number state of five27. This implies that the majority of the human genome is fairly stable with regard to copy number and a limited number of gene families have extreme variation. This is consistent with studies of bacteria that have shown gene amplifications to be unstable and associated with increased fitness costs28; excess gene copies are quickly being reduced to lower copy numbers. After comparing the human data to the gorilla, chimpanzee and orangutan genomes, Sudmant et al (2010) were able to identify 53 gene families with increased copy number in humans, including eight gene families that appeared to be fixed in humans27. The human specific duplications included genes related to brain development and function.

Similarly, many human CNVs are enriched for genes with chemosensation and immune response functions29. Chemosensation genes include olfactory receptor (OR) genes, which belong to the largest gene family in humans and are known to undergo gene duplication and pseudogenization events30. The rate of OR gene pseudogenization has

4 occurred almost 4-fold faster in humans than in other primates, possibly resulting from a lowered dependence on chemosensory perception in humans, relative to apes31. Additional support for environmental pressures shaping the OR gene family in humans is provided by observations that the number of intact and functional OR genes differs greatly between modern human populations32. Although humans have more OR pseudogenes than many mammals, research suggests that contrary to popular belief, humans do not have a poor sense of smell in comparison to other mammals, like dogs, which are known for their sense of smell33.

CNVs are frequently found to overlap genes with functions important to the human immune system. Human antibody heavy chains are produced by immunoglobulin heavy chain (IHC) gene families that arose through duplication and diversification, and are currently hotspots for CNV formation34,35. The effects of CNVs on innate immunity have been difficult to characterize and their impact on disease susceptibility and progress has been controversial36. An example of this is the defensin genes, which are commonly studied for their antimicrobial functions37. In humans, copy number polymorphisms of defensin genes are limited to a subset of defensin genes and do not appear to be a general feature of innate immunity genes38. CNVs of defensin genes have been associated with HIV progression to AIDS, and susceptibility to numerous disorders including cervical cancer, autoimmune disorders like ankylosing spondylitis and lupus, sarcoidosis in females, ulcerative colitis susceptibility, and more39–44. Although there are many findings of associations between copy number of immunological genes and disease susceptibility, confirmation of the biological roles of candidate CNVs in disease susceptibility is required36.

Also relevant to human health, is the role of CNVs in antibiotic resistance. Bacterial acquisition of antibiotic resistance is a two-step process involving the initial amplification event of dosage-sensitive genes that confers antibiotic resistance and the subsequent mutational events that reduce the fitness costs of the amplification45. These bacterial gene amplifications are generated and lost at high rates, making them difficult to study. There are at least 22 known instances of multiple bacterial species acquiring antibiotic resistance through gene amplifications occurring either on plasmid DNA or chromosome DNA45. One

5 such example is the acquisition of tobramycin resistance in multidrug-resistant Acinetobacter baumannii, isolated from a patient46. Treatment of A. baumannii isolates with increasing tobramycin concentrations results in either moderate tobramycin resistance (≤8 μg/ml) with no fitness costs via low amplification of the aminoglycoside resistance gene aphA1, or high tobramycin resistance (16 μg/ml) with impaired fitness resulting from greater amplification of aphA1. This experiment mimics the selective pressures acting on bacteria outside of the laboratory that are created by humans mainly through excessive and improper use of antibiotics, as well as other actions related to infection control and prevention47. It is important to understand the mechanisms by which resistance to antibiotics is conferred since antibiotic resistance has become a serious health concern for both humans, and for conventional livestock farming where antibiotics are routinely used in a prophylactic manner48.

In agriculturally relevant animals, CNV studies are often conducted between different breeds to identify contributors to different desirable traits. For chickens, CNVs have been studied in different breeds to identify breed- or line-specific CNVs and to determine if they have an effect on egg production, body size, growth rate, abdominal fat content, feather growth and pea-comb phenotype49–52. CNV studies on pigs have been used to determine if there is an association between CNVs and coat colour, backfat thickness, meat quality, fatty acid composition and growth traits53–56. In cattle, genomes were examined for CNV impacts on growth, milk and fat traits, meat quality, and health traits57–61. CNV research that is primarily aimed at discovering associations with phenotypic traits has also been conducted on agriculturally important animals like sheep, goats, ducks, turkeys, and horses62–66 and plants, including in maize, rice, barley, wheat, grapes, tomatoes, sorghum, foxtail millet and soybeans67–75. For both plants and animals important to agriculture, there is a need to understand structural variation that occurs in the genome. This knowledge would assist with maintenance and monitoring of genetic diversity in crops or livestock, selecting desirable phenotypic traits, improving the understanding of and monitoring for health issues, and improving knowledge regarding the domestication history of different species.

6

1.1.3 CNVs and human disease

CNVs have been extensively studied from a human disease perspective. Over 90% of pathogenic CNVs contain ohnologs, which are genes that have been retained following ancestral whole-genome duplication events76. This is far lower than observations for nonpathogenic CNVs (~30% contain an ohnolog), which suggests that the dosage sensitive nature of many ohnologs contained in CNVs appears to be an important part of the pathogenicity. Following whole-genome duplication, dosage balancing was required, and now, this careful regulation of is disrupted if the ohnologs are duplicated or deleted77. Down syndrome (trisomy 21) is an example of a genetic disorder caused by gene dosage alteration, in this case an increase in dosage. The majority of genes associated with this genetic disorder can be classified as ohnologs76. These findings regarding the relationships between CNVs, dosage-sensitive ohnologs, and aberrant phenotypes with increased risk to multiple diseases, are valuable since they may help narrow the list of candidates involved in certain diseases. However, it is important to recognize that not all dosage-sensitive genes are ohnologs.

Pathogenic CNVs are involved in both simple, single-gene diseases, and complex diseases. With simple diseases, a mutation affecting only one gene or other important genomic element (e.g. expression regulators) is enough to result in a disease phenotype. For example, CNV duplications of peripheral myelin protein 22 (PMP22), lamin B1 (LMNB1), and nuclear receptor binding SET domain protein 1 (NSD1) cause Charcot- Marie-Tooth 1A disease, autosomal dominant leukodystrophy, and a growth retardation syndrome, respectively78–80. When deleted, PMP22 causes hereditary neuropathy with liability to pressure palsies while a NSD1 deletion causes Sotos syndrome78,81,82. When a gene is impacted in such a way that it leads to decreased expression and a pathogenic phenotype, whether via a CNV deletion or other means, this is called haploinsufficiency. CNVs implicated in complex diseases, including neurological disorders like autism, vary in nature in regard to size, copy number state, whether they are recurrent or not, or inherited or de novo83–86. Complex diseases and traits with CNV associations include Parkinson disease, Alzheimer disease, schizophrenia, epilepsy, and HIV-1 infection susceptibility87– 92.

7

1.1.3.1 Disease thresholds, heteroplasmy and pathogenic mosaicism

CNV deletions and duplications of dosage-sensitive genes may not result in an abnormal phenotype if a certain threshold of cellular or tissue malfunction is not reached. Disease “thresholds” are commonly used when discussing mitochondrial heteroplasmy – the presence of more than one mitochondrial genome in the mitochondrial population within a cell or individual. The threshold level for when a phenotypic effect is present is dependent on the particular mutation and tissue93. Phenotypic threshold levels for mtDNA mutations are generally above 60% and deletions generally have lower thresholds than other mutations types94.

Mitochondrial mutations are common in normal cells and likely inherited frequently. In 1995, Chen et al. found that approximately 50% of tested oocytes contained a frequently occurring 4,977 bp deletion95. This particular deletion is found in a subpopulation of people with chronic progressive external ophthalmoplegia, and Kearns Sayre syndrome95–97. A likely explanation for why some people with mitochondrial deletions develop diseases while others do not is that people who developed a disease inherited a higher mutation load or acquired spontaneous de novo mutations earlier than people who did not develop a disease. If a mitochondrion acquired a de novo mutation in a zygote, that will have a minimal effect on the developing individual compared to a zygote where 90% of mitochondria are inherited mutants and the phenotype threshold has been surpassed. Although generally thought to occur randomly, some evidence suggests that mitochondrial segregation is not always random and heteroplasmy levels can be maintained in daughter cells for numerous mitoses98.

Genotype differences can also occur between the nuclear DNA of somatic cells, in a phenomenon called somatic mosaicism. Cancer is a well-known example of somatic mosaicism where tumour cell populations are known to have genetic instability and high heterogeneity in comparison to adjacent healthy tissues99,100. Cancer-predisposing mutations can be inherited or occur de novo in an individual. One group of individuals who are predisposed to developing cancer and have high levels of de novo CNVs are people with Li-Fraumeni syndrome (LFS)101. They are more likely to have high CNV levels than

8 healthy individuals or individuals who develop the same TP53 mutation spontaneously, rather than inherit it. The reason for this is that germline TP53 tumor suppressor mutations in individuals with LFS are thought to increase genomic instability, leading to the formation of more CNVs and carcinogenic mutations101. The CNV levels were found to be even higher once LFS-affected individuals developed cancer when compared to mutant carriers not yet affected by cancer. Similar to the idea of a phenotypic threshold, there appears to be a dose-response relationship between CNV frequency and cancer phenotype severity for LFS101.

CNVs are commonly implicated in cancer, particularly recurrent copy number gains of oncogenes and recurrent losses of tumour suppressor genes102. There are known associations between cancer and the copy number duplication and overexpression of over 70 genes103. With respect to copy number losses, tumour suppressor gene losses can lead to decreased gene expression which may subsequently play a role in cancer initiation or progression104. Tumour suppressor genes, including phosphatase and tensin homolog (PTEN), microcephalin 1 (MCPH1), F-box protein 25 (FBXO25), SMAD family member 4 (SMAD4), tripartite motif-containing 35 (TRIM35), RB transcriptional corepressor 1 (RB1), and methylthioadenosine phosphorylase (MTAP), have shown concordance between copy number loss and decreased gene expression in multiple tumour samples104. MTAP, which is involved in purine biosynthesis, showed this relationship in 14 cancer types104. In non-small cell lung cancer, decreased MTAP expression is associated with poor overall survival and higher risk of tumour reoccurrence105. The mutation and gene expression profile for a given type of cancer can help direct cancer treatments, like using trastuzumab for cancers with estrogen-related receptor beta type 2 (ERRB2) amplifications or AKT serine/threonine kinase 1 and 2 (AKT1/2) inhibitors if there is no response to trastuzumab or lapatinib106. However, the genetic heterogeneous makeup of tumours can make cancer difficult to treat because often when a dominant cell subpopulation is targeted and eliminated, other tumour cell subpopulations become dominant107,108.

In spite of tumour heterogeneity, many cancer types were found to have unique underlying mutation profiles or “signatures” that can be used to diagnose tumour types109. The collective mutations present across a genome, and their characteristics (e.g. type, size,

9 location, genic content), constitute the mutation “profile” of a sample. Cancer mutation signatures commonly involve multiple genes and mutation types109. Soh et al (2017)109 applied machine learning to tumour DNA sequence data, representative of 28 different cancer types, and found that the cancer type of a tumour sample can be correctly identified in almost 84% of cases when using SNP and CNV data for 100 genes. The same study showed that the accuracy of cancer type identification decreases when fewer genes are used and if only SNPs or only CNVs are used as a predictor. CNVs are a better predictor of cancer type than SNPs since use of CNV data alone can result in an overall prediction accuracy of ~75% while SNP data alone have a much lower prediction accuracy of ~49%109.

1.1.3.2 CNV impact on drug metabolism

CNVs can play important roles in disease treatment since they can influence drug metabolism. Cytochrome P450 2D6 (CYP2D6) is an extensively studied human gene that encodes an capable of metabolizing numerous drugs including selective serotonin reuptake inhibitors, atypical antipsychotic medications, tricyclic antidepressants, beta blockers, opioid pain medications, antimalarial medication, poly (ADP-ribose) polymerase inhibitors (anticancer agents), and more110. CYP2D6 is a dosage-sensitive gene which causes decreased drug metabolism if copies are deleted, and increased metabolism if the gene is duplicated. A study of over 30,000 Americans found that a large minority (12.6%) of people have either fewer or more than two copies of CYP2D6111. This minority group may be given incorrect medication doses and be more likely to suffer side effects when taking drugs metabolized by CYP2D6, if the dosage is not adjusted for CYP2D6 enzyme levels. One such example is codeine, which can be lethal to those with CYP2D6 duplications and the associated ultra-rapid metabolizer phenotype112,113.

1.1.4 Mouse models of human disease

While many studies have been performed to study the effects of CNVs in humans, mice are also of great importance and help to fill in gaps in human research with respect to disease etiology, progression, and treatment. Mice are one of the most commonly used animals in human disease research and their value is evident from the fact that in 2002,

10

Mus musculus was the first mammal to have its genome fully sequenced114. By 2011, genome sequences were available for 17 inbred laboratory mouse strains115. The popularity of the mouse as an animal model comes from the many advantages of working with mice. For example, mouse breeding can be controlled to minimize or maximize genetic heterogeneity between individuals, family studies can be conducted, studies can be conducted at any age, there is no shortage of available tissues, the scale of the experiment can be much larger than with humans, diet and environment can be controlled, and the genomes can be manipulated to generate specific phenotypes and diseases. Some disease mouse models that involve DNA gains and losses are used to study mitochondrial deletion disorders116, the contribution of dosage imbalance to complex diseases like Alzheimer’s disease117, haploinsufficiency118 and overexpression disorders119, and cancer120.

Due to the importance of mice in research, it is necessary to gain an understanding of the mouse genetics and to develop tools for studying the mouse genome. Determining what normal mutation levels are for different mouse strains and tissues will help with identification of abnormal genomic alterations, discovery of mouse-, strain-, and tissue- specific mutation hotspots, and development of appropriate study designs. Although mouse studies are invaluable for human research, animal models do not always accurately mimic human diseases and study results for the two species are not always concordant, for reasons concerning both biological (i.e. species-specific differences) and non-biological aspects (e.g. study design quality, availability/suitability of technology, sample availability, cost, etc.) 121,122. Mouse genetic studies can be useful beyond human-relevant medical research, and have also contributed in the areas of evolution and ecology123,124, and uncovering the history of human colonization125.

1.1.5 Mechanisms of CNV formation

There are several mutational mechanisms that give rise to CNVs, which can be grouped into homology-based and non-homology-based mechanisms126. The breakpoint junctions surrounding a CNV, as well as the CNV size, can indicate what mechanism was responsible for creating a CNV. Long homologous regions in the breakpoint junctions, like low copy repeats, tend to be associated with non-allelic homologous recombination (NAHR). Allelic homologous recombination is normally used to repair broken chromosomes with two ended

11 double-strand breaks (DSBs) but mismatches between homologues can occur in repetitive regions, leading to recombination between non-allelic homologous regions. When occurring between chromatids and chromosomes, NAHR can produce reciprocal duplications and deletions, but when occurring within a chromatid, NAHR will produce only deletions127.

In the human male germline, the rate of NAHR-mediated CNV deletions was found to be two-fold higher than the duplication rate128. However, the proportion of duplications and deletions appears to be similar in healthy humans, suggesting that other mechanisms might be biased towards duplications or there is stronger selection against cells carrying deletions128,129. NAHR is the most common mechanism for generating recurrent CNVs130. Recurrent CNVs are the same length and share fixed (common) breakpoints. Short homologous regions enriched for mobile elements like long or short interspersed nuclear elements (LINEs and SINEs), are associated with CNV insertion and deletion events that can occur via NAHR, nonhomologous end joining (NHEJ) or replication-based mechanisms131–133. NAHR breakpoints are also associated with hypomethylation and open chromatin134. Exposed DNA is more susceptible to damage like DSBs than condensed DNA is, and DNA is particularly vulnerable when it is in a single-stranded state during transcription134.

In the absence of large homologous regions, NHEJ may be used to repair DNA. NHEJ is an important mechanism for repairing DSBs involving blunt ends or short microhomologies (1-4 bp), with the repair outcome typically resulting in small insertions and deletions ranging from one base pair to a couple hundred base pairs in length135,136. When longer terminal microhomologies (>5 bp) are involved in DSB resolution, the mechanism is called microhomology-mediated end joining (MMEJ) and it may be used in place of NHEJ when some components required for NHEJ are not available136.

In addition to producing small structural alterations, NHEJ is capable of altering large regions of the genome, and producing CNV deletions involving thousands of base pairs or more137–140. NHEJ is also associated with large CNV duplications, which sometimes occur via a two-step process involving homologous recombination (HR) in

12 addition to NHEJ139,141,142. Unlike NAHR, the NHEJ process requires little to no homology and is known to create non-recurrent CNVs, which have unique sizes and breakpoints130.

DSBs can occur throughout all cell cycle stages, as can NHEJ-mediated repair. However, NHEJ is affected by the cell phases with regard to its frequency and repair outcomes (e.g. deletions of varying sizes are common in G1)143. HR is an alternative mechanism to repairing DSBs and can occur during the S/G2 phase if homologues are present near the DSBs144. In comparison to HR, NHEJ repairs DSBs more quickly and efficiently in human cells, but also with less accuracy145. Being a DSB repair mechanism, NHEJ-mediated CNVs are likely to be found in regions of the genome that are susceptible to breakage. For example, LINE-1 causes DSBs via endonuclease activity during retrotransposition events, so DSBs are expected to occur in regions containing such mobile elements146. Spontaneous DSBs can also occur during DNA replication when replication forks collapse and by extension could occur if interfere with replication progress147. However, mechanisms other than NHEJ are involved in DNA repair following replication fork collapse since the breaks involve a single double-stranded end with no other end to join it to.

Like NAHR and NHEJ, replication-based mechanisms are important contributors to CNV formation and formation of de novo CNVs can be observed when canonical NHEJ is blocked in mouse embryonic stem cells148. In the absence of canonical NHEJ, CNVs arising during replication were 1.9-fold larger when occurring at mutation hotspots than non-hotspots, and more CNV deletions than duplications were observed at both hotspots (100% deletions) and non-hotspots (79.5% deletions). Excluding possible alternative end- joining mechanisms, one replicative mechanism which could have contributed to the formation of these CNVs is microhomology-mediated break-induced replication (MMBIR), which is involved in repairing one-ended DSBs following replication fork collapse149.

Following replication fork collapse, MMBIR finds microhomologies between the broken DNA strand and a DNA template to repair the break. Breakpoint junctions following MMBIR repair can often be found near LCRs150,151. Additionally, it was

13 predicted that MMBIR could create LCRs which can later be utilized for NAHR149. MMBIR can create large, complex structural alterations where the outcome is dependent on the which DNA template is used for repair and how many template switches occur. When a sister chromosome or homologue behind the fork breakage is used, a duplication will result, whereas the use of a template positioned ahead of the fork breakage will result in a deletion. Other outcomes include inversions, translocations, triplication and a rolling circle.

Junctions at the endpoints of MMBIR show microhomology (2-5 bp), sometimes leading to confusion as to whether MMBIR or NHEJ was responsible for the repair event, although the presence of complexity is more indicative of MMBIR than NHEJ126. Like NHEJ, MMBIR creates nonrecurrent CNVs, but in contrast to end-joining mechanisms, MMBIR is a favoured mechanism for the production of CNVs arising during replication, particularly for CNV amplifications.

NAHR, NHEJ and MMBIR are important DNA damage repair mechanisms that can produce CNVs, although other repair mechanisms exist as well. The outcome of DNA repair is dependent on numerous factors including, but not limited to, the type of damage (e.g. one or two-ended DSBs), the cell cycle stage, the surrounding genomic context (e.g. presence or absence of homologues and repeats), and the availability of key components for specific repair pathways.

1.1.6 CNV mutation rates

For CNVs greater than 500 bp in length, array-based HapMap data representing 450 individuals provided a genome-wide CNV mutation rate of 3 × 10−2 per genome per generation152. The authors predict that this mutation rate is an underestimate since purifying selection against deleterious CNVs was not taken into consideration. A second study showed that the general CNV mutation rate based on 4,187 genomic regions from HapMap Phase II human populations is estimated to be at an order of 10−5 CNVs per locus per generation, but was as high as 10−3 CNVs per generation for 2.5% of 4,187 loci, with 47% of these mutation hotspots overlapping genes153. A similar average mutation rate at an order of 10−5 was estimated based on 856 CNV loci from HapMap Phase III human

14 populations154. This average CNV mutation rate is consistent with NAHR-mediated CNV mutation rates in human sperm155.

In the human male germline, NAHR deletion rates at a mutation hotspot were found to differ greatly between individuals, ranging from 9.82 × 10−6 to 6.96 × 10−5 per individual, with an average deletion rate of 3.52 × 10−5 (± 3.04 × 10−6 SEM)155. An earlier study examining the same locus observed very similar NAHR deletion rates in male sperm (4.20 × 10−5 ± 2.99 × 10−6 SEM)128. For three other mutation hotspots, the average NAHR deletion rates were different, ranging from 2.16 × 10−5 to 1.87 × 10−6. The average NAHR duplication rates were lower than the duplication rates for all four hotspots and ranged from 1.73 × 10−5 to 1.73 × 10−7. This finding is consistent with observations from other studies where NAHR-mediated deletions were found to be more frequent than duplications128,153. In a third study focusing on a region of Chromosome 17, NAHR CNV rates associated with male meiosis were also found to be on an order of 10−5 to 10−7 156.

Contrary to the mutation rates determined above which were determined based on a few disease-relevant loci, the frequency of NAHR-mediated CNVs detected across whole genomes would suggest that the mutation rates are higher than expected. For a mutation rate at an order of 10-5, 0.00324 de novo deletions and duplications per generation, or 0.324 NAHR-mediated CNVs per individual when separated from the reference genome by 100 generations would be expected yet 24 NAHR-mediated CNVs per individual were observed157.

Although additional genome-wide studies would be helpful in resolving the NAHR mutation rates for CNV formation, NAHR rates have been more extensively studied than the difficult to determine NHEJ or MMBIR mutation rates. Identifying NAHR-associated breakpoints is simpler because it involves searching for large homologies, while mechanisms that generate nonrecurrent CNVs have little to no homology at breakpoints and it can be difficult to assign an exact mechanism to a CNV. Little is known about NHEJ rates, although the CNV generation rate via NHEJ is predicted to be less than 10-7 per locus per generation152. However, based on deletion breakpoints examined from the 1000 Genomes project data, nonhomologous mechanisms generate more CNVs (61% of

15 detected CNV deletions) than NAHR (13%), with 42% of these deletions predicted to have occurred via NHEJ and 58% through template-switching mechanism like MMBIR134.

Like NHEJ, there is little information regarding MMBIR-mediated CNV rates. While MMBIR is associated with replication fork collapse, the rate of replication fork collapse in humans is unknown. If recombination frequency during replication is used as a measure of replication fork collapse, then replication fork collapse is expected to occur multiple times per cell division cycle158. Although not all replication fork collapses are resolved with a CNV outcome, the sheer number of replication events required to create an adult human from a zygote and to renew cells in tissues provides many opportunities for CNV formation. To reach the predicted 3.72 × 1013 cells present in an adult human when starting from a zygote, the cell population will have needed to double approximately 45 times through numerous independent replication events159. These frequent but nonrecurrent mutation events would result in high levels of somatic mosaicism in healthy individuals that could be difficult to detect if there is insufficient clonal expansion to meet a required detection threshold. Mutations that arise early in embryogenesis would have high clonal expansions in an adult, assuming there are no fitness costs to the affected cells.

Aneuploidy mosaicism in blastocysts is known to occur frequently, increasing from ~50% to ~90% of tested blastocysts, as a woman’s age and meiotic error risk increases160. In preimplantation embryos, aneuploidy mosaicism is associated with clinical implications like reduced implantation success and increased risk of miscarriage with embryos that have aneuploidy in 20%-80% of their cells (in a 5-cell biopsy)160. CNVs are known to arise during gametogenesis and embryogenesis and with estimates of the rate of de novo CNV occurrence in offspring being placed at 1.2 × 10−2 CNVs per genome per transmission, for CNVs over 100 kb in length161. In rare cases, individuals with genomic disorders present a CNV mutator phenotype and have an unusually high number of de novo CNVs (5-10 CNVs per individual)162. This mutational process appears to be transient, only occurring perizygotically, possibly as a consequence of replicative repair. It is hypothesized that mutant maternal or transcripts are responsible for driving a CNV mutator phenotype in early embryogenesis and the mutation process ceases once zygotic genome transcription is activated and wild-type proteins and transcripts are generated162. The

16 mutations that arise early in embryogenesis are likely to have large clonal populations in the developed individual and therefore have high phenotypic impact potential.

1.2 CNV detection

There are several approaches that can be used for the identification of CNVs. Fluorescent in-situ hybridization (FISH) is a commonly used CNV detection technique, and is often used for diagnostic purposes in the clinic163–165. With FISH-based approaches, locus- specific fluorescent probes are hybridized to metaphase or interphase spreads to visualize CNVs at a single cell level166. This method provides an absolute CNV count, typically at a resolution ranging from tens of thousands to hundreds of thousands of base pairs167, although the resolution has been improved with newer methods like Fiber FISH which is performed on extended chromatin fibers168. The drawbacks to FISH approaches are that they are low throughput in comparison to other CNV-detection technologies like microarrays, and CNV discovery is limited to the loci that have sequences complementary to the designed probes, meaning a priori knowledge of the target CNVs is required166.

Higher throughput technologies for CNV detection include array-based comparative genomic hybridization (aCGH) and non-CGH microarrays like SNP-based oligonucleotide arrays169. In an aCGH approach, sample and reference DNA are cohybridized to DNA probes on an array. A sample of interest, for example, may be DNA from a disease-affected individual while the reference DNA would come from a healthy individual170. The goal would be to identify chromosomal abnormalities in the sample of interest compared to the healthy reference and see if there is an association to the disease phenotype. The sample and reference DNA are labelled with different fluorescent dyes so that for each locus, the ratio between the two fluorescence signals can be used to infer a copy number state171. Since aCGH-based CNV detection is dependent on the use of a reference genome, absolute counts of DNA copies are not detected like with FISH. An advantage over FISH, is that aCGH uses thousands of probes in parallel which allows for genome-wide CNV detection. Array CGH has been used in the identification of CNVs associated with diseases including but not limited to pancreatic cancer172, metastatic breast cancer173, neurodevelopmental disorders or multiple congenital anomalies174, and autism spectrum disorder175. Array CGH has also been used in a human evolutionary study, and

17 found copy number expansions in humans when compared to four other hominoid species (bonobo, chimpanzee, gorilla, and orangutan) that could account for some of the species- specific phenotypic characteristics such as brain structure and function176.

Non-CGH microarrays are an alternative approach to aCGH for CNV detection and do not require cohybridization with a reference DNA sample. With SNP-based oligonucleotide microarrays, for example, the DNA of interest is hybridized to single- stranded DNA probes on an array and the overall fluorescence intensity signal of each locus is compared to a reference diploid signal to detect CNV gains and losses177. Thus, SNP microarray CNV detection is reference based and does not provide absolute copy number counts. High resolution SNP microarrays have been used for a variety of different CNV studies including but not limited to studies of human anthropomorphic traits178, cross- species comparisons between humans and rhesus macaques179, fetal alcohol spectrum disorder180, autism spectrum disorder181, and thoracic aortic aneurysms and dissections182. Overall, both aCGH and non-CGH microarrays that are used for CNV detection come in a variety of designs and vary greatly in several aspects including but not limited to their resolution and sensitivity, in the type of mutations that they can detect, and the amount of DNA required for hybridization183–185.

A more advanced approach to CNV detection than microarrays, particularly with respect to resolution and sensitivity, is the use of sequencing technologies, which provide single base-pair resolution. Sequencing technologies, which are rapidly improving and developing, are used in high-throughput approaches for genome-wide CNV discovery, even for single cells, and they allow for the identification of translocations, rare variants, and CNV breakpoint junctions186,187. However, assembling raw sequencing data into a useful output is a complex process for numerous reasons that include but are not limited to the large quantities of output data, various limitations between the different sequencing technology platforms and algorithms, dependence on reference genomes for read alignment, and a lack of standardization and simplified workflows188,189. In regard to clinical application, many tests that look for structural alterations of the genome are not conducted using sequencing since often times more affordable methods like FISH, or PCR- based techniques are sufficient, especially when testing for known variants and

18 diseases190,191. Microarrays are currently being used in clinics for genetic testing of individuals, particularly children, with developmental delays and intellectual disabilities, autism spectrum disorders and dysmorphic features with unknown causes192. Microarrays are also used in the detection of clinically relevant CNVs in fetuses for prenatal diagnosis of cytogenic abnormalities193. Sequencing is of great value for studying complex or monogenic diseases where rare structural or sequence alterations that are associated with the disease have yet to be discovered194.

Due to the high demand for products and services for human research, CNV detection technology and methodology for the human genome have seen rapid advances. However, mouse models are still necessary for many aspects of human-relevant research but the costs of some CNV detection methods like whole genome sequencing are prohibitively expensive to use on a large scale for mice. Therefore, more affordable technologies like high-density microarrays are of value to animal model research, particularly when the study requires looking at the whole genome and at many samples. One such microarray is the Mouse Diversity Genotyping Array (MDGA; Thermo Fisher Scientific Inc, Waltham, MA), a high-resolution, mouse SNP genotyping array that was reported in 2009195.

1.3 Thesis goal and specific aims

The overall goal of this thesis is to explore and characterize the CNV landscape of Mus musculus (house mouse), as detected by the MDGA in order to advance our knowledge of an important model organism from the perspectives of adaptation and evolution, normal development, and disease. M. musculus was selected as it is the most common mouse species used in research, the reference genome has been sequenced and annotated, and it is compatible with the MDGA since the array design is based off of the M. musculus reference genome195. The MDGA was selected as the detection technology of choice due to its affordability and because it was the highest resolution mouse microarray available at the time. The chapters in this thesis describe experiments that were conducted with the goals of improving the reliability of the MDGA CNV detection pipeline and determining its utility for CNV detection in mice. This thesis will also explore how CNVs contribute to the genomic landscape in different mouse groups affected by natural or artificial selection,

19 in multiple tissues within an inbred mouse family, and in a mouse model of cancer. The specific aims and the rationale for these aims are described below.

Aim 1: To assist in the development of a reliable, validated, and user-friendly CNV detection pipeline for use with the MDGA by 1) identifying probes that are predicted to perform poorly and excluding them from computations associated with CNV calling, and 2) updating probe annotations where necessary, and 3) developing a CNV output visualization method to assist with identifying patterns in the number, state, and spatial distribution of CNVs within and across the genomic landscapes of multiple samples.

This aim was necessary due to inconsistencies found in the original probe annotation files and a lack of CNV detection software that can generate calls using both the single nucleotide polymorphism-based probes and the invariant genomic probes. Results relating to MDGA improvements and recommendations for use are presented mostly in Chapter 2, and the CNV detection pipeline was used in the experiments described in Chapters 3-5, with some stated modifications between different experiments. A method for visualizing the CNV distribution across a genome is described in Chapter 3B.

Aim 2: To use the developed CNV detection pipeline to explore the CNV landscape of multiple mouse subspecies and characterize differences between laboratory-bred and wild- caught mice.

This aim is used to test the pipeline developed in Aim 1 on a large M. musculus dataset and the results of this broad survey have been published. Chapter 3 also includes a mouse cohort comparison study that is an extension of the broad survey and has a stronger focus on the genetic variation between classical laboratory-bred, wild-caught and wild-derived mouse cohorts.

Aim 3: To study the contribution of CNVs to somatic mosaicism in an inbred mouse family and assess the MDGA tool for this objective.

The goal of this aim is to provide a mutational baseline for what a CNV profile looks like across multiple healthy tissues of one of the most commonly used laboratory mouse strains,

20

C57BL/6J, as this would allow for identification of abnormal mutation profiles in future mutation studies. The results of this work are presented in Chapter 4.

Aim 4: To assess the utility of the MDGA in the context of tumorigenesis and metastasis.

The MDGA was originally used to genotype normal mouse samples. The purpose of this aim is to determine if the MDGA can be applied to disease samples like cancerous tissue, where there is a high degree of genetic heterogeneity that also includes small size clones of de novo mutations. The work from this aim is presented in Chapter 5.

21

1.4 References

1. Nozawa, M. & Nei, M. Genomic drift and copy number variation of chemosensory receptor genes in humans and mice. Cytogenet. Genome Res. 123, 263–269 (2008).

2. Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853 (2007).

3. Hollox, E. J., Armour, J. A. L. & Barber, J. C. K. Extensive normal copy number variation of a beta-defensin antimicrobial-gene cluster. Am. J. Hum. Genet. 73, 591–600 (2003).

4. Mihaylova, M., Staneva, R., Toncheva, D., Pancheva, M. & Hadjidekova, S. Benign, pathogenic and copy number variations of unknown clinical significance in patients with congenital malformations and developmental delay. Balkan J. Med. Genet. 20, 5–12 (2017).

5. Farslow, J. C. et al. Rapid Increase in frequency of gene copy-number variants during experimental evolution in Caenorhabditis elegans. BMC Genomics 16, 1044 (2015).

6. Katju, V. & Bergthorsson, U. Copy-number changes in evolution: rates, fitness effects and adaptive significance. Front. Genet. 4, 273 (2013).

7. McConnell, M. J. et al. copy number variation in human neurons. Science 342, 632–637 (2013).

8. Boursot, P., Auffray, J.-C., Britton-Davidian, J. & Bonhomme, F. The evolution of house mice. Annu. Rev. Ecol. Syst. 24, 119–152 (1993).

9. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

10. Xi, R. et al. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc. Natl. Acad. Sci. U. S. A. 108,

22

E1128–E1136 (2011).

11. Magi, A., Benelli, M., Yoon, S., Roviello, F. & Torricelli, F. Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res. 39, e65 (2011).

12. Parker, H. G. et al. Genetic structure of the purebred domestic dog. Science 304, 1160–1164 (2004).

13. Schum, A. & Preil, W. Induced mutations in ornamental plants. in Somaclonal variation and induced mutations in crop improvement. Current plant science and biotechnology in agriculture (eds. Jain, S. M., Brar, D. S. & Ahloowalia, B. S.) 333–366 (Springer, Dordrecht, 1998). doi:10.1007/978-94-015-9125-6_17

14. Derkx, P. M. F. et al. The art of strain improvement of industrial lactic acid bacteria without the use of recombinant DNA technology. Microb. Cell Fact. 13 Suppl 1, S5 (2014).

15. Hedrick, P. W. Heterozygote advantage: The effect of artificial selection in livestock and pets. J. Hered. 106, 141–154 (2015).

16. Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet. 10, 451–481 (2009).

17. Iskow, R. C., Gokcumen, O. & Lee, C. Exploring the role of copy number variants in human adaptation. Trends Genet. 28, 245–257 (2012).

18. Fernández, C. I. & Wiley, A. S. Rethinking the starch digestion hypothesis for AMY1 copy number variation in humans. Am. J. Phys. Anthropol. 163, 645–657 (2017).

19. Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 39, 1256–1260 (2007).

23

20. Perry, G. H., Kistler, L., Kelaita, M. A. & Sams, A. J. Insights into hominin phenotypic and dietary evolution from ancient DNA sequence data. J. Hum. Evol. 79, 55–63 (2015).

21. Mandel, A. L., Peyrot des Gachons, C., Plank, K. L., Alarcon, S. & Breslin, P. A. S. Individual differences in AMY1 gene copy number, salivary α-amylase levels, and the perception of oral starch. PLoS One 5, e13352 (2010).

22. Arendt, M., Cairns, K. M., Ballard, J. W. O., Savolainen, P. & Axelsson, E. Diet adaptation in dog reflects spread of prehistoric agriculture. Heredity (Edinb). 117, 301–306 (2016).

23. Reiter, T., Jagoda, E. & Capellini, T. D. Dietary variation and evolution of gene copy number among dog breeds. PLoS One 11, e0148899 (2016).

24. Wang, X., Grus, W. E. & Zhang, J. Gene losses during human origins. PLoS Biol. 4, e52 (2006).

25. McLean, C. Y. et al. Human-specific loss of regulatory DNA and the evolution of human-specific traits. Nature 471, 216–219 (2011).

26. van Driel, M. F. Re: Human-Specific Loss of Regulatory DNA and the Evolution of Human-Specific Traits. Eur. Urol. 60, 1123–1124 (2011).

27. Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

28. Adler, M., Anjum, M., Berg, O. G., Andersson, D. I. & Sandegren, L. High fitness costs and instability of gene duplications reduce rates of evolution of new genes by duplication-divergence mechanisms. Mol. Biol. Evol. 31, 1526–1535 (2014).

29. Nguyen, D.-Q., Webber, C. & Ponting, C. P. Bias of selection on human copy- number variants. PLoS Genet. 2, e20 (2006).

30. Jiang, Y. & Matsunami, H. Mammalian odorant receptors: functional evolution

24

and variation. Curr. Opin. Neurobiol. 34, 54–60 (2015).

31. Gilad, Y., Man, O., Pääbo, S. S. & Lancet, D. Human specific loss of olfactory receptor genes. Proc. Natl. Acad. Sci. 100, 3324–3327 (2003).

32. Menashe, I., Man, O., Lancet, D. & Gilad, Y. Different noses for different people. Nat. Genet. 34, 143–144 (2003).

33. McGann, J. P. Poor human olfaction is a 19th-century myth. Science 356, eaam7263 (2017).

34. Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. Am. J. Hum. Genet. 92, 530–546 (2013).

35. Keyeux, G., Lefranc, G. & Lefranc, M. P. A multigene deletion in the human IGH constant region locus involves highly homologous hot spots of recombination. Genomics 5, 431–441 (1989).

36. Olsson, L. M. & Holmdahl, R. Copy number variation in autoimmunity-- importance hidden in complexity? Eur. J. Immunol. 42, 1969–1976 (2012).

37. Ganz, T. Defensins: antimicrobial peptides of innate immunity. Nat. Rev. Immunol. 3, 710–720 (2003).

38. Linzmeier, R. M. & Ganz, T. Copy number polymorphisms are not a common feature of innate immune genes. Genomics 88, 122–126 (2006).

39. Mehlotra, R. K. et al. Copy number variation within human β-defensin gene cluster influences progression to AIDS in the multicenter AIDS cohort study. J. AIDS Clin. Res. 3, 3–7 (2012).

40. Abe, S. et al. Copy number variation of the antimicrobial-gene, defensin beta 4, is associated with susceptibility to cervical cancer. J. Hum. Genet. 58, 250–253 (2013).

25

41. Wang, L. et al. Association study of copy number variants in FCGR3A and FCGR3B gene with risk of ankylosing spondylitis in a Chinese population. Rheumatol. Int. 36, 437–442 (2016).

42. Yuan, J. et al. FCGR3B copy number loss rather than gain is a risk factor for systemic lupus erythematous and lupus nephritis: A meta-analysis. Int. J. Rheum. Dis. 18, 392–397 (2015).

43. Wu, J. et al. FCGR3A and FCGR3B copy number variations are risk factors for sarcoidosis. Hum. Genet. 135, 715–725 (2016).

44. Asano, K. et al. Impact of allele copy number of polymorphisms in FCGR3A and FCGR3B genes on susceptibility to ulcerative colitis. Inflamm. Bowel Dis. 19, 2061–2068 (2013).

45. Sandegren, L. & Andersson, D. I. Bacterial gene amplification: implications for the evolution of antibiotic resistance. Nat. Rev. Microbiol. 7, 578–588 (2009).

46. McGann, P. et al. Amplification of aminoglycoside resistance gene aphA1 in Acinetobacter baumannii results in tobramycin therapy failure. MBio 5, e00915 (2014).

47. Tanwar, J., Das, S., Fatima, Z. & Hameed, S. Multidrug resistance: an emerging crisis. Interdiscip. Perspect. Infect. Dis. 2014, 541340 (2014).

48. Mie, A. et al. Human health implications of organic food and organic agriculture: a comprehensive review. Env. Heal. 16, 111 (2017).

49. Sohrabi, S. S., Mohammadabadi, M., Wu, D.-D. & Esmailizadeh, A. Detection of breed-specific copy number variations in domestic chicken genome. Genome 61, 7–14 (2018).

50. Zhang, H. et al. Detection of genome-wide copy number variations in two chicken lines divergently selected for abdominal fat content. BMC Genomics 15, 517 (2014).

26

51. Elferink, M. G., Vallée, A. A. A., Jungerius, A. P., Crooijmans, R. P. M. A. & Groenen, M. A. M. Partial duplication of the PRLR and SPEF2 genes at the late feathering locus in chicken. BMC Genomics 9, 391 (2008).

52. Wright, D. et al. Copy number variation in intron 1 of SOX5 causes the Pea-comb phenotype in chickens. PLoS Genet. 5, e1000512 (2009).

53. Rubin, C.-J. et al. Strong signatures of selection in the domestic pig genome. Proc. Natl. Acad. Sci. 109, 19529–19536 (2012).

54. Schiavo, G. et al. Copy number variants in Italian Large White pigs detected using high-density single nucleotide polymorphisms and their association with back fat thickness. Anim. Genet. 45, 745–749 (2014).

55. Wang, L. et al. Copy number variation-based genome wide association study reveals additional variants contributing to meat quality in Swine. Sci. Rep. 5, 12535 (2015).

56. Revilla, M. et al. A global analysis of CNVs in swine using whole genome sequence data and association analysis with fatty acid composition and growth traits. PLoS One 12, e0177014 (2017).

57. Yang, M. et al. Association study and expression analysis of CYP4A11 gene copy number variation in Chinese cattle. Sci. Rep. 7, 46599 (2017).

58. Gao, Y. et al. CNV discovery for milk composition traits in dairy cattle using whole genome resequencing. BMC Genomics 18, 265 (2017).

59. Stothard, P. et al. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery. BMC Genomics 12, 559 (2011).

60. Silva, V. H. da et al. Genome-wide detection of CNVs and their association with meat tenderness in Nelore cattle. PLoS One 11, e0157711 (2016).

61. Salomón-Torres, R. et al. Genome-wide SNP signal intensity scanning revealed

27

genes differentiating cows with ovarian pathologies from healthy cows. Sensors 17, 1920 (2017).

62. Jenkins, G. M. et al. Copy number variants in the sheep genome detected using multiple approaches. BMC Genomics 17, 441 (2016).

63. Fontanesi, L. et al. An initial comparative map of copy number variations in the goat (Capra hircus) genome. BMC Genomics 11, 639 (2010).

64. Skinner, B. M. et al. Comparative genomics in chicken and Pekin duck using FISH mapping and microarray analysis. BMC Genomics 10, 357 (2009).

65. Griffin, D. K. et al. Whole genome comparative studies between chicken and turkey and their implications for avian genome evolution. BMC Genomics 9, 168 (2008).

66. Ghosh, S. et al. Copy number variation in the horse genome. PLoS Genet. 10, e1004712 (2014).

67. Jiao, Y. et al. Genome-wide genetic changes during modern breeding of maize. Nat. Genet. 44, 812–815 (2012).

68. Bai, Z. et al. The impact and origin of copy number variations in the Oryza species. BMC Genomics 17, 261 (2016).

69. Muñoz-Amatriaín, M. et al. Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol. 14, R58 (2013).

70. Zhang, X., Gao, M., Wang, S., Chen, F. & Cui, D. Allelic variation at the vernalization and photoperiod sensitivity loci in Chinese winter wheat cultivars (Triticum aestivum L.). Front. Plant Sci. 6, 470 (2015).

71. Yin, L. et al. Phytohormone and genome variations in Vitis amurensis resistant to downy mildew. Genome 60, 791–796 (2017).

28

72. Causse, M. et al. Whole genome resequencing in tomato reveals variation associated with introgression and breeding events. BMC Genomics 14, 791–805 (2013).

73. Zheng, L.-Y. et al. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor). Genome Biol. 12, R114 (2011).

74. Bai, H. et al. Identifying the genome-wide sequence variations and developing new molecular markers for genetics research by re-sequencing a Landrace cultivar of foxtail millet. PLoS One 8, e73514 (2013).

75. Maldonado dos Santos, J. V. et al. Evaluation of genetic variation among Brazilian soybean cultivars through genome resequencing. BMC Genomics 17, 110 (2016).

76. McLysaght, A. et al. Ohnologs are overrepresented in pathogenic copy number mutations. Proc. Natl. Acad. Sci. U. S. A. 111, 361–366 (2014).

77. Makino, T. & McLysaght, A. Ohnologs in the human genome are dosage balanced and frequently associated with disease. Proc. Natl. Acad. Sci. 107, 9270–9274 (2010).

78. Li, J., Parker, B., Martyn, C., Natarajan, C. & Guo, J. The PMP22 gene and its related diseases. Mol. Neurobiol. 47, 673–698 (2013).

79. Padiath, Q. S. et al. Lamin B1 duplications cause autosomal dominant leukodystrophy. Nat. Genet. 38, 1114–1123 (2006).

80. Sachwitz, J. et al. NSD1 duplication in Silver–Russell syndrome (SRS): molecular karyotyping in patients with SRS features. Clin. Genet. 91, 73–78 (2017).

81. Kurotaki, N. et al. Haploinsufficiency of NSD1 causes Sotos syndrome. Nat. Genet. 30, 365–366 (2002).

82. Abdalla, E., Bartsch, O., Galetzka, D. & Zechner, U. Novel clinical findings in the first Egyptian case of Sotos syndrome caused by complete deletion of the NSD1

29

gene. Am. J. Med. Genet. A 173, 1090–1093 (2017).

83. Glessner, J. T. et al. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature 459, 569–573 (2009).

84. Matsunami, N. et al. Identification of rare DNA sequence variants in high-risk autism families and their prevalence in a large case/control population. Mol. Autism 5, 5 (2014).

85. Leppa, V. M. et al. Rare inherited and de novo CNVs reveal complex contributions to ASD risk in multiplex families. Am. J. Hum. Genet. 99, 540–554 (2016).

86. Reis, V. N. de S. et al. Integrative variation analysis reveals that a complex genotype may specify phenotype in siblings with syndromic autism spectrum disorder. PLoS One 12, e0170386 (2017).

87. Chartier-Harlin, M. C. et al. Alpha-Synuclein locus duplication as a cause of familial Parkinson’s disease. Lancet 364, 1167–1169 (2004).

88. Rovelet-Lecrux, A. et al. APP locus duplication causes autosomal dominant early- onset Alzheimer disease with cerebral amyloid angiopathy. Nat. Genet. 38, 24–26 (2006).

89. International Schizophrenia Consortium, T. I. S. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237–241 (2008).

90. Xu, B. et al. Strong association of de novo copy number mutations with sporadic schizophrenia. Nat. Genet. 40, 880–885 (2008).

91. Helbig, I. et al. 15q13.3 microdeletions increase risk of idiopathic generalized epilepsy. Nat. Genet. 41, 160–162 (2009).

92. Gonzalez, E. et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307, 1434–1440 (2005).

30

93. Rossignol, R., Malgat, M., Mazat, J. P. & Letellier, T. Threshold effect and tissue specificity. Implication for mitochondrial cytopathies. J. Biol. Chem. 274, 33426– 33432 (1999).

94. Rossignol, R. et al. Mitochondrial threshold effects. Biochem. J. 370, 751–762 (2003).

95. Chen, X. et al. Rearranged mitochondrial genomes are present in human oocytes. Am. J. Hum. Genet. 57, 239–247 (1995).

96. Schon, E. A. et al. A direct repeat is a hotspot for large-scale deletion of human mitochondrial DNA. Science 244, 346–349 (1989).

97. Mita, S. et al. Recombination via flanking direct repeats is a major cause of large- scale deletions of human mitochondrial DNA. Nucleic Acids Res. 18, 561–567 (1990).

98. Raap, A. K. et al. Non-random mtDNA segregation patterns indicate a metastable heteroplasmic segregation unit in m.3243A>G cybrid cells. PLoS One 7, e52080 (2012).

99. Negrini, S., Gorgoulis, V. G. & Halazonetis, T. D. Genomic instability — an evolving hallmark of cancer. Nat. Rev. Mol. Cell Biol. 11, 220–228 (2010).

100. Chen, W. et al. Identification of chromosomal copy number variations and novel candidate loci in hereditary nonpolyposis colorectal cancer with mismatch repair proficiency. Genomics 102, 27–34 (2013).

101. Shlien, A. et al. Excessive genomic DNA copy number variation in the Li- Fraumeni cancer predisposition syndrome. Proc. Natl. Acad. Sci. U. S. A. 105, 11264–11269 (2008).

102. Zhang, L., Yuan, Y., Lu, K. H. & Zhang, L. Identification of recurrent focal copy number variations and their putative targeted driver genes in ovarian cancer. BMC Bioinformatics 17, 222 (2016).

31

103. Santarius, T., Shipley, J., Brewer, D., Stratton, M. R. & Cooper, C. S. A census of amplified and overexpressed human cancer genes. Nat. Rev. Cancer 10, 59–64 (2010).

104. Zhao, M. & Zhao, Z. Concordance of copy number loss and down-regulation of tumor suppressor genes: a pan-cancer study. BMC Genomics 17, 532 (2016).

105. Su, C. Y. et al. MTAP is an independent prognosis marker and the concordant loss of MTAP and p16 expression predicts short survival in non-small cell lung cancer patients. Eur. J. Surg. Oncol. 40, 1143–1150 (2014).

106. Jernström, S. et al. Drug-screening and genomic analyses of HER2-positive breast cancer cell lines reveal predictors for treatment response. Breast Cancer (Dove Med. Press). 9, 185–198 (2017).

107. Kreso, A. et al. Variable clonal repopulation dynamics influence chemotherapy response in colorectal cancer. Science 339, 543–548 (2013).

108. Gillies, R. J., Verduzco, D. & Gatenby, R. A. Evolutionary dynamics of carcinogenesis and why targeted therapy does not work. Nat. Rev. Cancer 12, 487– 493 (2012).

109. Soh, K. P., Szczurek, E., Sakoparnig, T. & Beerenwinkel, N. Predicting cancer type from tumour DNA signatures. Genome Med. 9, 104 (2017).

110. U.S. Food & Drug Administration. Table of pharmacogenomic biomarkers in drug labeling. (2018). Available at: https://www.fda.gov/Drugs/ScienceResearch/ucm572698.htm.

111. Beoris, M., Amos Wilson, J., Garces, J. A. & Lukowiak, A. A. CYP2D6 copy number distribution in the US population. Pharmacogenet. Genomics 26, 96–99 (2016).

112. Kelly, L. E. et al. More codeine fatalities after tonsillectomy in North American children. Pediatrics 129, e1343–e1347 (2012).

32

113. Gasche, Y. et al. Codeine intoxication associated with ultrarapid CYP2D6 metabolism. N. Engl. J. Med. 351, 2827–2831 (2004).

114. Mouse Genome Sequencing Consortium et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

115. Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011).

116. Inoue, K. et al. Generation of mice with mitochondrial dysfunction by introducing mouse mtDNA carrying a deletion into zygotes. Nat. Genet. 26, 176–181 (2000).

117. Sekine, M. & Makino, T. Inference of causative genes for Alzheimer’s disease due to dosage imbalance. Mol. Biol. Evol. 34, 2396–2407 (2017).

118. Koenig, S. N. et al. Notch1 haploinsufficiency causes ascending aortic aneurysms in mice. JCI Insight 2, e91353 (2017).

119. Anderson, T. J. et al. Distinct phenotypes associated with increasing dosage of the PLP gene: implications for CMT1A due to PMP22 gene duplication. Ann. N. Y. Acad. Sci. 883, 234–246 (1999).

120. Jahid, S. et al. Inhibition of colorectal cancer genomic copy number alterations and chromosomal fragile site tumor suppressor FHIT and WWOX deletions by DNA mismatch repair. Oncotarget 8, 71574–71586 (2017).

121. Ben-David, U. et al. Patient-derived xenografts undergo mouse-specific tumor evolution. Nat. Genet. 49, 1567–1575 (2017).

122. Perel, P. et al. Comparison of treatment effects between animal experiments and clinical trials: systematic review. BMJ 334, 197 (2007).

123. Bryk, J. & Tautz, D. Copy number variants and selective sweeps in natural populations of the house mouse (Mus musculus domesticus). Front. Genet. 5, 153 (2014).

33

124. Pezer, Ž., Harr, B., Teschke, M., Babiker, H. & Tautz, D. Divergence patterns of genic copy number variation in natural populations of the house mouse (Mus musculus domesticus) reveal three conserved genes with major population-specific expansions. Genome Res. 25, 1114–1124 (2015).

125. Jones, E. et al. Fellow travellers: a concordance of colonization patterns between mice and men in the North Atlantic region. BMC Evol. Biol. 12, 35 (2012).

126. Hastings, P. J., Lupski, J. R., Rosenberg, S. M. & Ira, G. Mechanisms of change in gene copy number. Nat. Rev. Genet. 10, 551–564 (2009).

127. Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002).

128. Turner, D. J. et al. Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nat. Genet. 40, 90–95 (2008).

129. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

130. Gu, W., Zhang, F. & Lupski, J. R. Mechanisms for human genomic rearrangements. Pathogenetics 1, 4 (2008).

131. Bailey, J. A., Liu, G. & Eichler, E. E. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet 73, 823–834 (2003).

132. Han, K. et al. L1 recombination-associated deletions generate human genomic variation. Proc. Natl. Acad. Sci. 105, 19366–19371 (2008).

133. Gu, S. et al. Alu-mediated diverse and complex pathogenic copy-number variants within human chromosome 17 at p13.3. Hum. Mol. Genet. 24, 4061–4077 (2015).

134. Abyzov, A. et al. Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat. Commun. 6, 7256 (2015).

34

135. Liang, F., Han, M., Romanienko, P. J. & Jasin, M. Homology-directed repair is a major double-strand break repair pathway in mammalian cells. Genetics 95, 5172– 5177 (1998).

136. Lieber, M. R. The mechanism of human nonhomologous DNA end joining. J. Biol. Chem. 283, 1–5 (2008).

137. Lee, S. J. et al. Non-homologous end joining repair mechanism-mediated deletion of CHD7 gene in a patient with typical CHARGE syndrome. Ann. Lab. Med. 35, 141–145 (2015).

138. Shaw, C. J. & Lupski, J. R. Non-recurrent 17p11.2 deletions are generated by homologous and non-homologous mechanisms. Hum. Genet. 116, 1–7 (2005).

139. Inoue, K. et al. Genomic rearrangements resulting in PLP1 deletion occur by nonhomologous end joining and cause different dysmyelinating phenotypes in males and females. Am. J. Hum. Genet. 71, 838–853 (2002).

140. Conrad, D. F. et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 (2010).

141. Lee, H. J., Kweon, J., Kim, E., Kim, S. & Kim, J. S. Targeted chromosomal duplications and inversions in the human genome using zinc finger nucleases. Genome Res. 22, 539–548 (2012).

142. Woodward, K. J. et al. Heterogeneous duplications in patients with Pelizaeus- Merzbacher disease suggest a mechanism of coupled homologous and nonhomologous recombination. Am. J. Hum. Genet. 77, 966–987 (2005).

143. Moore, J. K. & Haber, J. E. Cell cycle and genetic requirements of two pathways of nonhomologous end-joining repair of double-strand breaks in . Mol. Cell. Biol. 16, 2164–2173 (1996).

144. Lieber, M. R. The mechanism of double-strand DNA break repair by the nonhomologous DNA end-joining pathway. Annu. Rev. Biochem. 79, 181–211

35

(2010).

145. Mao, Z., Bozzella, M., Seluanov, A. & Gorbunova, V. Comparison of nonhomologous end joining and homologous recombination in human cells. DNA Repair (Amst). 7, 1765–1771 (2008).

146. Gasior, S. L., Wakeman, T. P., Xu, B. & Deininger, P. L. The human LINE-1 retrotransposon creates DNA double-strand breaks. J. Mol. Biol. 357, 1383–1393 (2006).

147. Michel, B., Ehrlich, S. D. & Uzest, M. DNA double-strand breaks caused by replication arrest. EMBO J. 16, 430–438 (1997).

148. Arlt, M. F., Rajendran, S., Birkeland, S. R., Wilson, T. E. & Glover, T. W. De novo CNV formation in mouse embryonic stem cells occurs in the absence of Xrcc4-dependent nonhomologous end joining. PLoS Genet. 8, e1002981 (2012).

149. Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet. 5, e1000327 (2009).

150. Stankiewicz, P. et al. Genome architecture catalyzes nonrecurrent chromosomal rearrangements. Am. J. Hum. Genet. 72, 1101–1116 (2003).

151. Lee, J. A. et al. Role of genomic architecture in PLP1 duplication causing Pelizaeus-Merzbacher disease. Hum. Mol. Genet. 15, 2250–2265 (2006).

152. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).

153. Fu, W., Zhang, F., Wang, Y., Gu, X. & Jin, L. Identification of copy number variation hotspots in human populations. Am. J. Hum. Genet. 87, 494–504 (2010).

154. Hu, X.-S. et al. High mutation rates explain low population genetic divergence at copy-number-variable loci in Homo sapiens. Sci. Rep. 7, 43178 (2017).

36

155. MacArthur, J. A. L. et al. The rate of nonallelic homologous recombination in males is highly variable, correlated between monozygotic twins and independent of age. PLoS Genet. 10, e1004195 (2014).

156. Liu, P. et al. Frequency of nonallelic homologous recombination is correlated with length of homology: Evidence that ectopic synapsis precedes ectopic crossing- over. Am. J. Hum. Genet. 89, 580–588 (2011).

157. Parks, M. M., Lawrence, C. E. & Raphael, B. J. Detecting non-allelic homologous recombination from high-throughput sequencing data. Genome Biol. 16, 72 (2015).

158. Cortez, D. Preventing replication fork collapse to maintain genome integrity. DNA Repair (Amst). 32, 149–157 (2015).

159. Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013).

160. Munné, S. & Wells, D. Detection of mosaicism at blastocyst stage with the use of high-resolution next-generation sequencing. Fertil. Steril. 107, 1085–1091 (2017).

161. Itsara, A. et al. De novo rates and selection of large copy number variation. Genome Res. 20, 1469–1481 (2010).

162. Liu, P. et al. An organismal CNV mutator phenotype restricted to early human development. Cell 168, 830–842.e7 (2017).

163. Jacobs, Browne, Gregson, Joyce & White. Estimates of the frequency of chromosome abnormalities detectable in unselected newborns using moderate levels of banding. J. Hum. Genet. 29, 103–108 (1992).

164. Halling, K. C. & Kipp, B. R. Fluorescence in situ hybridization in diagnostic cytology. Hum. Pathol. 38, 1137–1144 (2007).

165. Savic, S. & Bubendorf, L. Common fluorescence in situ hybridization applications in cytology. Arch. Pathol. Lab. Med. 140, 1323–1330 (2016).

37

166. Cantsilieris, S., Baird, P. N. & White, S. J. Molecular methods for genotyping complex copy number polymorphisms. Genomics 101, 86–93 (2013).

167. Kallioniemi A, Visakorpi T, Karhu R, Pinkel D & Kallioniemi O. P. Gene copy number analysis by fluorescence in situ hybridization and comparative genomic hybridization. Methods 9, 113–121 (1996).

168. Kraan, J. et al. Multicolor Fiber FISH. Methods Mol. Biol. 204, 143–153 (2002).

169. Ylstra, B., van den IJssel, P., Carvalho, B., Brakenhoff, R. H. & Meijer, G. A. BAC to the future! Or oligonucleotides: A perspective for micro array comparative genomic hybridization (array CGH). Nucleic Acids Res. 34, 445–450 (2006).

170. Coughlin, C. R., Scharer, G. H. & Shaikh, T. H. Clinical impact of copy number variation analysis using high-resolution microarray technologies: advantages, limitations and concerns. Genome Med. 4, 80 (2012).

171. Lockwood, W. W., Chari, R., Chi, B. & Lam, W. L. Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur. J. Hum. Genet. 14, 139–148 (2006).

172. Rausch, V. et al. Array comparative genomic hybridization of 18 pancreatic ductal adenocarcinomas and their autologous metastases. BMC Res. Notes 10, 560 (2017).

173. Magbanua, M. J. M. et al. Expanded genomic profiling of circulating tumor cells in metastatic breast cancer patients to assess biomarker status and biology over time (CALGB 40502 and CALGB 40503, Alliance). Clin. Cancer Res. 24, 1486– 1499 (2018).

174. Maini, I. et al. Prematurity, ventricular septal defect and dysmorphisms are independent predictors of pathogenic copy number variants: a retrospective study on array-CGH results and phenotypical features of 293 children with neurodevelopmental disorders and/or multiple c. Ital. J. Pediatr. 44, 34 (2018).

38

175. Lovrečić, L. et al. Diagnostic efficacy and new variants in isolated and complex autism spectrum disorder using molecular karyotyping. J. Appl. Genet. 59, 179– 185 (2018).

176. Fortna, A. et al. Lineage-specific gene duplication and loss in human and great ape evolution. PLoS Biol. 2, E207 (2004).

177. Winchester, L., Yau, C. & Ragoussis, J. Comparing CNV detection methods for SNP arrays. Brief. Funct. Genomic. Proteomic. 8, 353–366 (2009).

178. Macé, A. et al. CNV-association meta-analysis in 191,161 European adults reveals new loci associated with anthropometric traits. Nat. Commun. 8, 744 (2017).

179. Ng, J., Fass, J. N., Durbin-Johnson, B., Smith, D. G. & Kanthaswamy, S. Identifying rhesus macaque gene orthologs using heterospecific human CNV probes. Genom. Data 6, 202–207 (2015).

180. Zarrei, M. et al. Copy number variation in fetal alcohol spectrum disorder. Biochem. Cell Biol. 96, 161–166 (2018).

181. Mkrtchyan, H. et al. The human genome puzzle – the role of copy number variation in somatic mosaicism. Curr. Genomics 11, 426–431 (2010).

182. Prakash, S. et al. Recurrent rare genomic copy number variants and bicuspid aortic valve are enriched in early onset thoracic aortic aneurysms and dissections. PLoS One 11, e0153543 (2016).

183. Hester, S. D. et al. Comparison of comparative genomic hybridization technologies across microarray platforms. J. Biomol. Tech. 20, 135–151 (2009).

184. Haraksingh, R. R., Abyzov, A., Gerstein, M., Urban, A. E. & Snyder, M. Genome- wide mapping of copy number variation in humans: comparative analysis of high resolution array platforms. PLoS One 6, e27859 (2011).

185. Haraksingh, R. R., Abyzov, A. & Urban, A. E. Comprehensive performance

39

comparison of high-resolution array platforms for genome-wide Copy Number Variation (CNV) analysis in humans. BMC Genomics 18, 321 (2017).

186. Mardis, E. R. DNA sequencing technologies: 2006–2016. Nat. Protoc. 12, 213– 218 (2017).

187. van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next- generation sequencing technology. Trends Genet. 30, 418–426 (2014).

188. Zhao, M., Wang, Q., Wang, Q., Jia, P. & Zhao, Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 14, S1 (2013).

189. Endrullat, C., Glökler, J., Franke, P. & Frohme, M. Standardization and quality management in next-generation sequencing. Appl. Transl. genomics 10, 2–9 (2016).

190. Cui, C., Shu, W. & Li, P. Fluorescence in situ hybridization: Cell-based genetic diagnostic and research applications. Front. Cell Dev. Biol. 4, 89 (2016).

191. Butchbach, M. E. R. Applicability of digital PCR to the investigation of pediatric- onset genetic disorders. Biomol. Detect. Quantif. 10, 9–14 (2016).

192. Battaglia, A. et al. Confirmation of chromosomal microarray as a first-tier clinical diagnostic test for individuals with developmental delay, intellectual disability, autism spectrum disorders and dysmorphic features. Eur. J. Paediatr. Neurol. 17, 589–599 (2013).

193. Van den Veyver, I. B. et al. Clinical use of array comparative genomic hybridization (aCGH) for prenatal diagnosis in 300 cases. Prenat. Diagn. 29, 29– 39 (2009).

194. Kingsmore, S. F., Dinwiddie, D. L., Miller, N. A., Soden, S. E. & Saunders, C. J. Adopting orphans: comprehensive genetic testing of Mendelian diseases of childhood by next-generation sequencing. Expert Rev. Mol. Diagn. 11, 855–868

40

(2011).

195. Yang, H., Ding, Y., Hutchins, L. & Szatkiewicz, J. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

41

Chapter 2 2 The Mouse Diversity Genotyping Array: Overview

Parts of this chapter were published in Locke et al, BMC Genomics (2015)1. Probe filtering based on BLAST analysis was performed by Nisha Patel. Chloe Rose isolated DNA from naked mole-rat samples.

2.1 Introduction

2.1.1 Study motivation

To assist with the fulfillment of Aim 1 in this thesis - the development of a reliable, validated and user-friendly CNV detection pipeline for use with the Mouse Diversity Genotyping Array (MDGA) - probe lists were filtered according to defined criteria to include only probes predicted to perform well in CNV detection. The impact of probe list filtering was assessed by comparing SNP genotyping results pre- and post-filtering. The MDGA probe design was evaluated in comparison to the Genome-Wide Human SNP Array 6.0, which was selected as a standard for high quality. Additionally, the utility of the MDGA for rodent samples within and outside of the genus Mus was tested to identify cross- species hybridization limitations. This chapter provides background on the design of the MDGA, and discusses sample selection, the CNV calling process, and data quality assessment.

2.1.2 The Mouse Diversity Genotyping Array design

The Mouse Diversity Genotyping Array (Thermo Fisher Scientific Inc, Waltham, MA) was designed by Yang et al (2009) with the intent to assess genetic diversity in Mus2. It is a high-density genotyping array with probe sequences designed based on the sequenced C57BL/6J mouse strain reference genome (Mus musculus domesticus)2. The MDGA can be used to determine single nucleotide polymorphism (SNP) genotypes and identify CNVs. Single nucleotide variants are considered polymorphic if they are present in at least 1% of the individuals of a population3. The MDGA has 2,093,288 different probes that interrogate two SNP alleles (alleles “A” and “B”) at 623,124 uniformly-spaced loci (~1 SNP per 4.3 kb)2,4. These probes were designed to target SNPs selected to capture the genetic diversity

42 present in classical laboratory and wild-derived mouse strains. A detailed description of how the loci were selected is available in the supplementary information of the Yang et al paper2. A second group of probes, called invariant genomic probes (IGPs), target 597,758 non-polymorphic regions in exons and are used in CNV calling. IGPs were designed to target ~93% of known exons in the mouse genome (Ensembl version 49)2. Exon 1 and Exon 2 probe types make up the largest group of IGPs on the MDGA at a total of 1,195,516 probes and are divided into the two exon groups based on Affymetrix-specified criteria for meeting probe design standards. CNV calling in this thesis is performed using both SNP probes and Exon 1 and 2 probes.

2.1.3 SNP probes and invariant genomic probes

For each SNP locus, there are four different probe sequences that make up a probe set – a sense and antisense sequence for each of the two alleles. The probes that make up a SNP probe are found in a total of eight separate locations or “features” on the array – there are two features for each of the four SNP probe sequences. SNP genotyping will result in one of four genotype calls being generated per SNP locus: homozygous for allele A (AA), homozygous for allele B (BB), heterozygous (AB), or an inability to determine a genotype (No Call). The invariant genomic probes (IGPs), Exon 1 and Exon 2 probes, were designed so that there are six probe sequences per exon. The IGPs target approximately 93% of the annotated exons in the mouse genome2. There are three sense strand probes spaced out along each exon in conserved (invariant) regions, and for each sense probe there is a complementary antisense probe. These six probes make up an IGP set and unlike the SNP probes, there is only one feature on the array per probe sequence. On the MDGA, each 5 µm x 5 µm feature has approximately 1.66-6 picomoles of copies of a particular probe sequence5.

2.1.4 Fluorescence-based SNP and CNV genotyping

Prior to hybridization to the MDGA, sample DNA undergoes several preparatory steps, including biotin labelling. Biotin-labelled sample DNA that has hybridized to MDGA probes can be stained with streptavidin-bound phycoerythrin, a fluorescent protein. The fluorescence intensity from sample DNA that is bound to IGPs and SNP probes can be

43 used to determine how many relative copies of that particular DNA sequence are present in the sample genome.

If a feature on the array does not emit light, then no hybridization occurred. This could occur due the absence of DNA template or the presence of single nucleotide mutations within the DNA template. Depending on the type of variant calling performed, a lack of a fluorescence signal would translate into either an inability to detect a SNP genotype or a copy number loss. To call a copy number loss, a predetermined number of probes targeting consecutive loci would have to emit a similar fluorescence signal. If the fluorescence intensity for probes targeting a given genomic region is higher or lower than what is expected for a copy number of two, then this would suggest the presence of a copy number gain or loss, respectively.

2.1.5 Generating SNP genotype and CNV calls

2.1.5.1 SNP genotyping with Affymetrix Power Tools

Prior to CNV calling, SNP genotyping is performed. SNP genotype calls for the MDGA can be generated using the BRLMM-P algorithm with Affymetrix Power Tools’ apt- probeset-genotype program (APT; Thermo Fisher Scientific)6,7. Data from multiple microarrays are genotyped together and any inter-sample variability in the signal intensity needs to be corrected. This can be done with an APT normalization step. Quantile normalization is typically used for SNP genotyping, with the exception of cancer samples where median normalization is recommended8. A median polish summarization step follows normalization and is applied to summarize the information from all the probes for one allele of a SNP by using the median value. Median polish takes into account both inter- sample variation and variation in signal intensity arising from sequence-dependent hybridization specificity9.

After genotyping, other APT programs are used to generate log R Ratio (LRR) and B Allele Frequency (BAF) values for the SNP probes. The LRR represents the total fluorescence intensity signals from each SNP probe set. The BAF values represent the fluorescence signal ratio between the B and A allele probes at each SNP locus. The clustering of genotype calls into AA, AB, and BB genotypes, and the generation of LRR

44 and BAF values, is assisted by a model clustering file that specifies where each genotype cluster for a given SNP is likely to be located. This file can be created by the user or downloaded from the product page for the MDGA on the Thermo Fisher Scientific website10. Specific information about APT installation and required files for MDGA-based SNP genotyping are available online, along with a sample genotyping script11.

SNP genotyping is not only needed for generating CNV calls, it can be used to identify false positive calls for putative CNV losses. A requirement for CNV identification in this thesis is that the CNV calls include SNP probes and not only IGPs. SNP probes within a CNV with a predicted state of zero should only have “No Call” genotypes since a state of zero would mean that no target DNA is present in that region of the genome. Heterozygous SNP genotypes (“AB” calls) would not be expected for SNP probes underlying a CNV with a state of one since only one copy of the allele would be present.

2.1.5.2 CNV calling with PennCNV

Following SNP genotyping, PennCNV software11 can be used to generate CNV calls. PennCNV is open source software capable of using both SNP probes and IGPs for CNV calling and detect copy number states ranging from zero to four. The SNP genotyping step is a necessary step prior to CNV calling because PennCNV requires LRR and BAF information to identify CNVs. Before the LRR and BAF files are used for CNV calling, the SNP genotyping call rates for each sample are assessed and any unusual outliers are removed. The call rate for a sample is the percentage of loci that have a genotype call12. After outliers are removed, the genotyping step is performed again and new LRR and BAF files are created. SNP genotyping does not have to be repeated if there are no failing samples to remove. Another file necessary for CNV calling is a PFB-formatted file that contains the genomic location for all SNP probes and IGPs. For a given copy number state to be assigned to a genomic region, consecutively located target DNA sequences must emit the same fluorescence intensity after hybridizing to the probes. The MDGA probe locations are based on the sequenced Mus musculus genome and need to be updated as the reference genome is updated. Probes excluded from the PFB file will not be used by PennCNV when it generates CNV calls.

45

PennCNV uses a hidden Markov model (HMM) algorithm to make CNV calls. The HMM algorithm determines how likely a probe signal is to represent a particular copy number state while taking the inter-probe distance and the fluorescence signal of the previous probe into consideration11. CNV calling in PennCNV for a dataset of interest is assisted by a trained HMM model file that provides information about probabilities and possible copy number outcomes, based on a training dataset.

Several consecutive probes of the same assigned copy number state are required to call a putative CNV. The specific number of consecutive probes required to make a CNV call is selected by the user although a minimum of three or more probes is generally used13. From a biological perspective, most regions of the genome are expected to be in a copy state of two and it is uncommon to have two opposite copy number states next to each other, like a loss immediately followed by a gain or vice versa.

The CNVs called by PennCNV require experimental confirmation to determine if the CNVs are biological events, and to locate the exact position of the CNV junctions. PennCNV provides start and end positions for each CNV call but the array does not have single nucleotide-level resolution, so these positions are rough estimations of the junction locations. The true CNV junctions are assumed to lie somewhere between the probes at the ends of the predicted CNV and the probes just outside of the CNV ends. Further information about the use of PennCNV with microarrays designed by Affymetrix is available on the PennCNV website14.

2.1.6 Discrepancies in annotations for the probe files of the Mouse Diversity Genotyping Array

Following the release of the MDGA, discrepancies were found in the chromosomal positions listed in the annotation files for the SNP probes15. Later, the probe annotation files were filtered by S.T. Eitutis (2013)16 based on several inclusion criteria including, but not limited to, probe length consistency, perfect match genome alignment scores, the absence of restriction enzyme recognition sites in the target DNA, presence of only one known SNP locus within a probe set, and sufficient genomic distance between target SNPs. In this chapter, the probe annotation files for both SNP probes and IGPs are filtered to

46 remove unreliable probes from use in CNV identification. The CNV-specific filtering criteria for the MDGA are described in the methods section.

Although probe filtering can be used to improve CNV calling reliability, the CNVs identified by microarrays are putative and require additional experimental confirmation. Several computational based methods, discussed later in Section 2.5, can be used to assess the quality of the MDGA data prior to experimental confirmation.

2.1.7 Assessment of probe design for the Genome-Wide Human SNP Array 6.0

Following assessment of the MDGA probe design, probe annotation files from the Genome-Wide Human SNP Array 6.0 (SNP Array 6.0; Affymetrix®, Thermo Fisher Scientific Inc, Waltham, MA) were assessed for issues in SNP probe design, for comparison purposes (see Results section 2.3.1). The SNP Array 6.0, designed by McCarroll et al (2008)17, is a human SNP-based genotyping array comparable to the MDGA, which was available for use one year after the SNP Array 6.018. The SNP Array 6.0 contains 1,863,892 different probe sequences that provide SNP genotype calls at 931,946 loci (full CDF file)19. Each SNP genotype is determined by a set of two probe sequences representing two alleles.

Like the MDGA, the SNP Array 6.0 can provide both SNP and CNV genotype calls17. The impact of probe filtering on the two arrays can be assessed and compared by looking at the SNP genotyping call rates pre- and post-filtering. An increase in SNP genotype call rates is expected if the probe filtering is effective at removing probes that do not return a genotype call. In this chapter, SNP Array 6.0 probe filtering is limited to inclusion criteria that are relevant to SNP genotype calling. The probe inclusion criteria include: probe length of 25 nt, probe sequence uniqueness, and absence of NspI or StyI recognition sites in the target DNA. If the target DNA is unable to bind the probes due to being digested at the probe target site, then a genotype call cannot be made, and removal of these poorly-performing probes should result in an increased call rate.

SNP genotyping can also be affected by the genotyping algorithm used. The Birdseed algorithm is commonly used for the SNP Array 6.0 and there are two versions

47 available (v1 and v2). The two Birdseed versions differ in their use of SNP-specific models. For Birdseed v1, a Gaussian mixture model is fitted into two-dimensional A- and B-signal space and the SNP-specific models are used as starting points for Expectation- Maximization algorithm iterations for genotype clustering6. Birdseed v2 is more robust than Birdseed v1 in the sense that the clustering is more reliant on SNP-specific priors.

Here, both Birdseed versions are tested on a HapMap320 dataset of 874 samples, to observe the effect on SNP call rates, in comparison to probe filtering effects. The Birdseed algorithm is used as opposed to other existing algorithms because Birdseed is used by Affymetrix® for array testing and validation21. Two versions of Birdseed are included in this study rather than one because they are known to produce slightly different genotyping outcomes for the same dataset22.

2.1.8 Cross-species hybridization and considerations for SNP genotyping

SNP microarrays are generally made for a limited number of organisms that are of scientific interest, including humans, model organisms like mice, agriculturally-important animals like cows, and companion species like dogs1,21,23,24. These microarrays can sometimes be used for organisms for which they were not designed. The hybridization of Antarctic fur seal (Arctocephalus gazelle) DNA to a dog (Canis familiaris) array is one such example and the study obtained SNP genotype calls for 19.2% of loci targeted by the array25. This low percentage of usable SNP loci is expected since, according to this study, there are 44 million years of divergence between the seal and dog.

Cross-species array hybridization has also been used for gene expression studies. Some examples include Weddell seal (Leptonychotes weddellii) samples being applied to a human gene expression array26, Sordaria macrospora fungus samples being applied to an array based on the closely related Neurospora crassa fungus27, and bell pepper and eggplant samples being applied to a tomato gene expression array28.

For cross-species microarray assays, it is expected that if there is low sequence divergence between the species being tested and the species upon which an array was designed, as is expected with closely related species, then there will be high levels of probe-

48 target hybridization. Sequence divergence as small as 1% will have a detectable decrease in hybridization levels29. Fish gene expression array experiments have shown that gene expression results are most consistent for species that diverged from each other less than 10 million years ago30. The same study showed that some results, although more limited, can be generated by expression arrays for species that diverged from the array reference species more than 200 million years ago. A similar observation was found for the MDGA, where increased genetic distance of Mus musculus samples from the C57BL/6J reference strain was associated with an increase in heterozygous SNP calls and No Calls31.

CNVs are identified by collective calls of an altered copy number state using consecutive uninterrupted markers. Therefore, unlike SNP genotyping, the ability to identify CNV events is heavily dependent on knowing the genomic position of probe targets, relative to each other. In order to use the MDGA for CNV calling in a different species, the MDGA probe sequences would have to be reannotated based on the sequenced reference genome for the species of interest, given the likelihood of genomic rearrangements over evolutionary time. To ensure that only consecutively located probes are used to identify CNVs, probes with sequences that occur more than once in the genome should be excluded from CNV calling and the probes need to be annotated so that they are assigned a genomic position.

2.1.9 General goal, specific objectives, and predicted outcomes

Overall, the filtered probe lists and recommendations for MDGA use and data quality assessment described in this chapter are expected to assist in the improvement of the MDGA as a tool for SNP genotyping and CNV detection in mice.

The specific objectives of this chapter are: 1. To identify MDGA probes that are predicted to perform poorly in CNV detection based on defined criteria for probe design, and to generate a list of probes recommended for use CNV detection. 2. To assess the impact of filtering probe lists to contain valid probes only, with filtering success being evaluated in terms of increased SNP genotype call rates.

49

3. To compare the impact of MDGA probe list filtering on SNP genotype calling to the impact of Genome-Wide Human SNP Array 6.0 probe list filtering. • The Genome-Wide Human SNP Array 6.0 is predicted to have a better probe design and is used here as a standard for comparison. 4. To evaluate the applicability and hence the utility of the MDGA for rodent species other than Mus musculus. 5. To make recommendations for study design considerations to ensure that reliable and useful MDGA data is generated. 6. To discuss approaches for assessing quality of MDGA data.

The filtering of probe annotation files for the MDGA is expected to result in improved SNP genotype calling and help create a more reliable list of probes for use in CNV calling. SNP Array 6.0 probe filtering is also predicted to result in improvements to SNP genotype calling. However, the greater use and earlier development of human versus mouse microarray technologies may have resulted in a better designed microarray for humans, requiring fewer probes to be excluded from genotyping. It is expected that the MDGA can be used for cross-species hybridization studies of SNP-based genetic diversity and that the array will be most informative for samples closely related to Mus musculus.

2.2 Materials and methods

2.2.1 Probe filtering and SNP genotyping for the Mouse Diversity Genotyping Array

MDGA annotation files for IGPs and SNP probes were filtered using criteria relevant for CNV calling, and these filtered probe lists were used to identify CNVs described in multiple projects in subsequent chapters of this thesis. IGP annotation files were downloaded from the Center for Genome Dynamics website32. IGPs that were classified as Exon 1 and Exon 2 were locally run through BLAST to ensure that the probe sequences were found only once in the mouse haploid genome (UCSC:mm9) and to verify the annotated position (Fig. 2-1). Inclusion criteria for probe sequences required a length of 25 nt, one probe ID per probe sequence and the presence of complementary sense and antisense sequences.

50

In-house scripts removed probe sets likely to contribute to background noise and false positives, including those containing palindromic NspI or StyI recognition sites within a given probe sequence and its 12 nt flanking region (as the genomic target sequence is digested by these restriction prior to hybridization to the array) as well as probe sets overlapping other probe sets based on genomic position, as these would compete for genomic DNA template (Fig. 2-1). This filtered list was used to identify CNVs in the broad survey discussed in Chapter 3. Following the completion of CNV calling and analyses, the SNP and IGP annotation files were further filtered to create a more stringent probe list (Fig. 2-1), which was used in the mouse cohort comparison study of Chapter 3. The additional filtering included removing overlapping probes because they would compete for the same DNA template, which could affect the fluorescence intensity of that genomic region. The check for overlapping probes included all probe types on the MDGA. It is important to note that overlap was determined based on the reference genome locations (UCSC:mm9/NCBI Build 37) and that which probes are overlapping each other may change with updates to the reference genome. The probe identifiers for the stringently filtered probe list are provided in Locke et al. (2015)33.

For 351 mouse .CEL files downloaded from the Center for Genome Dynamics at The Jackson Laboratory website34, SNP genotype calls were generated using the APT BRLMM-P algorithm and default parameters. The three probe lists used for SNP genotyping were the unfiltered MDGA probe list, the filtered list provided by S. T. Eitutis (2013) and the stringent list recommended for CNV calling in Locke et al16,33.

2.2.2 Probe filtering and SNP genotyping for the Genome-Wide Human SNP Array 6.0

SNP Array 6.0 probe sequence, flanking sequence, probe annotation, and library files were downloaded from the Thermo Fisher Scientific website35. SNP probe sequences with “SNP_A” identifiers were filtered, using in-house scripts, based on probe length, sequence uniqueness among other SNP probes, and restriction enzyme recognition sites (Fig. 2-2; Appendix 2A). If one probe sequence in the set of two failed to meet inclusion criteria, the whole probe set was removed from analysis. SNP probes were not filtered based on probe overlap criteria which would remove probes competing for the same target DNA. Unlike

51

CNV calling, SNP genotyping is not dependent on the fluorescence intensity levels relative to an expected diploid intensity, which can be affected by probe competition for shared DNA targets. After filtering was applied to the 25 nt probe sequences, restriction enzyme recognition site filtering was applied to the annotation files containing flanking sequence information for each probe (Appendix 2B). Flanking sequences contain the 25 nt probe sequence and eight additional nucleotides.

SNP genotype calls were generated using Affymetrix® Power Tools with the Birdseed (version 1 and version 2) algorithm and default parameters6, and 874 HapMap3 CEL files (Appendix 2C) that were downloaded from NCBI’s HapMap ftp site36. This dataset represents individuals from various ethnic backgrounds including African ancestry, Gujarati Indian, Han Chinese, Japanese, Luhya, Maasai, Mexican ancestry, Toscani, Utah/Mormon and Yoruba. Sample APT scripts are provided in the supplementary materials (Appendix 2D).

Graphical images were generated using R37. Sample numbers in Figure 2-3 correspond with the “Sample_Number” column in Appendix 2C and sample numbers in Figure 2-4 correspond with the “Sample_Number” column in Appendix 2E.

2.2.3 Cross-species hybridization with SNP genotyping

The Mus subgenera sample set includes a total of 27 Mus, Pyromys, Coelomys, and Nannomys samples (Table 2-1). MDGA CEL files for Mus samples were downloaded from the Center for Genome Dynamics website38. H. glaber tail tissue samples were provided by Dr. Melissa Holmes (University of Toronto Mississauga). DNA was extracted according to the Wizard® Genomic DNA Purificaiton Kit protocol (Promega, Madison, Wisconsin, USA), with two modifications: 1) tissues were digested by proteinase K for 24 to 48 hours, and 2) RNAse disgestion was used for all tissues. DNA quantity and purity were assessed using a NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA). MDGA hybridization was performed at the London Regional Genomics Centre (Robarts Research Institute, Western University, London, ON). SNP genotyping was performed for combined Mus and H. glaber samples as well as separately

52 for each of these two groups. Genotype calls were generated using the stringent filtered MDGA probe list (Section 2.2.1), APT BRLMM-P algorithm, and default parameters.

Table 2-1. Mus species and Heterocephalus glaber samples.

CEL ID Mus Subgenus Species Sex SNP_mDIV_D7-473_012209.CEL Coelomys Mus pahari Male SNP_mDIV_D6-472_012209.CEL Mus Mus caroli Male SNP_mDIV_D7-644_101509-redo.CEL Mus Mus caroli Male SNP_mDIV_D7-644_91809.CEL Mus Mus caroli Male SNP_mDIV_D3-639_101509-redo.CEL Mus Mus castaneus Female SNP_mDIV_D3-639_91809.CEL Mus Mus castaneus Female SNP_mDIV_D8-646_101509-redo.CEL Mus Mus cervicolor Male SNP_mDIV_D8-646_91809.CEL Mus Mus cervicolor Male SNP_mDIV_A2-645_102109.CEL Mus Mus cookii Male SNP_mDIV_D9-647_101509-redo.CEL Mus Mus dunni Male SNP_mDIV_D9-647_91809.CEL Mus Mus dunni Male SNP_mDIV_D4-640_101509-redo.CEL Mus Mus famulus Male SNP_mDIV_D4-640_91809.CEL Mus Mus famulus Male SNP_mDIV_D8-474_012209.CEL Mus Mus famulus Male SNP_mDIV_D5-642_101509-redo.CEL Mus Mus fragilicauda Male SNP_mDIV_D5-642_91809.CEL Mus Mus fragilicauda Male SNP_mDIV_D6-643_101509-redo.CEL Mus Mus fragilicauda Male SNP_mDIV_D6-643_91809.CEL Mus Mus fragilicauda Male SNP_mDIV_A7-654_102109.CEL Nannomys Mus mattheyi Male SNP_mDIV_D11-653_101509-redo.CEL Nannomys Mus minutoides Male SNP_mDIV_D11-653_91809.CEL Nannomys Mus minutoides Male SNP_mDIV_D10-652_101509-redo.CEL Nannomys Mus orangiae Male SNP_mDIV_D10-652_91809.CEL Nannomys Mus orangiae Male SNP_mDIV_A3-648_102109.CEL Pyromys Mus platythrix Male SNP_mDIV_A4-649_102109.CEL Pyromys Mus platythrix Male SNP_mDIV_A5-650_102109.CEL Pyromys Mus saxicola Male SNP_mDIV_A6-651_102109.CEL Pyromys Mus saxicola Male DNA3337.CEL N/A Heterocephalus glaber Female DNA3338.CEL N/A Heterocephalus glaber Female DNA3339.CEL N/A Heterocephalus glaber Male DNA3340.CEL N/A Heterocephalus glaber Male

53

2.3 Results

2.3.1 Mouse Diversity Genotyping Array and Genome-Wide Human SNP Array 6.0 probe filtering

Following filtering of MDGA probe sequences, 94% of SNP probe sequences and 71% of exon probe sequences remained (Fig. 2-1). In total, approximately 4.79 million unique probes targeting 915,195 loci are available for use in CNV calling. In comparison to the original, unfiltered SNP probe list, SNP genotype calling improved for 100% of tested samples when using both the filtered list by S.T. Eitutis (2013)16, and the filtered list in this study (Table 2-2). The average percent by which the call rates increased is similar for the two filtered probe lists.

The SNP Array 6.0 probe files were filtered for the purpose of determining if SNP genotyping call rates would improve or not following filtering. SNP probe filtering left 95% of SNP probes available for genotyping (Fig. 2-2). For this array, the SNP call rate is most improved when using a different calling algorithm with the original probe list and not by filtering the probe list (Table 2-2). When using the unfiltered human array probe list, 174 probes fail to call a genotype in any of the 874 samples (Appendix 2F). Filtering the probe list does not result in the removal of any of the 174 probes and they continue to provide No Call genotypes. The call rate improvement following probe filtering proved to be minimal for the human array, with an average call rate increase of 0.009% (Birdseed v1) or 0.0093% (Birdseed v2), and a maximum individual sample increase of approximately 0.06% (Table 2-2).

54

Figure 2-1. Impact of application of probe list filtering criteria for SNP, Exon 1 and Exon 2 Mouse Diversity Genotyping Array probes. Non-bolded criteria were used to construct the probe list used in the broad survey in Chapter 3. Bolded criteria were used to construct the stringent probe list used in the mouse cohort comparison study in Chapter 3. SNP probe filtering was performed on a probe list previously filtered by S. T. Eitutis (2013)16. The Exon probe list had not been filtered previously.

55

Table 2-2. Changes in SNP genotyping call rate when changing algorithm version and SNP probe lists.

Algorithm and Percentage of samplese with Average call rate Average call rate Array probe lista improved call rates (%) increase (%; range) decrease (%; range) BRLMM-P, S. T. 100 0.538 (0.16-1.08) N/A Mouse Diversity Eitutisb Genotyping Array BRLMM-P, Locke 100 0.536 (0.16-1.08) N/A et al b Birdseed (v1), 72.65 0.009 (0-0.061) 0.0044 (2x10-5-0.018) filteredc Birdseed (v1), 31.24 0.011 (8x10-5-0.051) 0.013 (5x10-5-0.061) flankingc Genome-Wide Human Birdseed (v1), 95.88 0.083 (6.7x10-4-0.17) 0.018 (3.3x10-4-0.065) SNP Array 6.0 original listd Birdseed (v2), 74.37 0.0093 (3x10-5-0.057) 0.0044 (3x10-5-0.018) filteredd Birdseed (v2), 29.98 0.011 (5x10-5-0.045) 0.013 (5x10-5-0.064) flankingd

a ‘Filtered’ refers to the probe list filtering applied to only the 25 nt probe sequence. ‘Flanking’ excludes “Filtered” probes whose flanking regions contain NspI or StyI recognition sites. b Compared to original MDGA list with BRLMM-P c Compared to original SNP Array 6.0 list with Birdseed (v1) d Compared to original SNP Array 6.0 list with Birdseed (v2) e Genotyping was performed for 351 MDGA CEL files and 874 HapMap3 CEL files for the MDGA and SNP Array 6.0, respectively.

56

Remaining loci

25 nt probe length 931,946

931,932

903,296

Figure 2-2. Impact of application of probe list filtering criteria for Genome-Wide Human SNP Array 6.0 SNP probes. Probe sequences representing both SNP alleles of a locus were excluded if at least one probe sequence did not meet the inclusion criteria.

When Birdseed (v1) is used instead of Birdseed (v2) as the genotyping algorithm, the call rate improves for 96% of samples (Table 2-2). However, the change in call rate is small. The average call rate across samples increases by 0.083%, and the highest increase in an individual sample’s call rate is 0.17%. For samples where the call rate decreases, the average decrease in call rate is 0.018%. When looking solely at call rates, the differences in call rates between different algorithms is very small. At an individual sample level, some call rates are not greatly affected by changes in algorithm version or probe list used, while other samples are more affected (Fig. 2-3).

57

Figure 2-3. Call rates for 351 Jackson Laboratory samples, generated with different probe lists. Black lines connect the call rates for individual samples and represent the call rate difference.

The difference in effect of probe filtering on SNP genotype call rates for the mouse and human arrays can be best observed in Figures 2-3 and 2-4. The average call rate increase following filtering for the MDGA is approximately 0.54%, which is nine times higher than the maximum call rate increase for an individual human sample following filtering (~0.06%). The highest call rate increase for an individual mouse sample post- filtering is 1.08%.

58

Figure 2-4. Call rates for HapMap3 samples, generated with Birdseed (v1), Birdseed (v2), and filtered probe lists. Black lines connect the call rates for individual samples and represent the call rate difference.

59

2.3.2 Cross-species hybridization

The average SNP genotyping call rate is highest for members of the subgenus Mus while H. glaber has the lowest average call rate at approximately half the call rate of the Mus subgenus group (Table 2-3). The H. glaber call rate almost doubles when it is genotyped together with Mus samples.

Table 2-3. SNP genotype call rates for Heterocephalus glaber samples genotyped together with and without a set of 27 samples representing four Mus subgenera: Mus, Pyromys, Coelomys, and Nannomys.

Divergence time from Genotyped with or Samples Sample size Average call rate Mus musculus domesticus without other samples Heterocephalus glaber Without 44% 73.1 million years ago39 4 (Naked mole-rat) With Mus set 86% Mus (90%) Pyromys (85%) Without Mus (17) Coelomys (83%) Pyromys (4) Nannomys (86%) Mus set ≤ 7.6 million years ago40 Coelomys (1) Mus (91%) Nannomys (5) Pyromys (88%) With four H. glaber samples Coelomys (86%) Nannomys (88%)

60

2.4 Discussion

2.4.1 Impact of probe filtering for the Mouse Diversity Genotyping Array and the Genome-Wide Human SNP Array 6.0

MDGA probe filtering improved overall SNP genotype call rates, as predicted, and it did so for all tested mouse samples. Therefore, filtered probe lists are recommend for use in variant detection with the MDGA, particularly for SNP genotyping. Testing the effect of probe filtering on CNV calling is more challenging as it would require algorithms to be tested on samples where the CNVs have been previously identified so that the copy number state and genomic location are known. Although not tested here, probe filtering is expected to improve CNV detection so filtered probe lists will be used for CNV detection in the studies described later in this thesis. CNV calls are directly affected by SNP probe performance. If a probe does not bind its target efficiently, then the fluorescence intensity of that region will be lower than the fluorescence intensity of the neighbouring probes and may break up a larger CNV gain, making it falsely appear to be two separate CNV events. False negatives can occur when minimum marker requirements are used for CNV calling and poorly performing probes in a CNV region prevent the cut-offs from being met. For example, a CNV spanning a required minimum of three probe loci may not be identified as a CNV because one of the probes is providing false signal information due to poor target DNA binding.

SNP Array 6.0 data indicate that probe filtering based on NspI and StyI recognition sites does not greatly improve the average call rate, particularly when compared to the effect of using algorithm versions on call rate. Furthermore, a group of 174 probes that produced No Call genotypes in all tested samples, were not removed by probe filtering. This implies that there are factors other than restriction enzyme digestion involved in preventing hybridization. No combination of probe list and algorithm resulted in an increased call rate for all samples. The highest call rates overall were observed when using the unfiltered probe list and the Birdseed (v1) algorithm. This suggests that the SNP Array 6.0 is well-designed with respect to the location of restriction enzyme recognition sites and does not appear to benefit from the use of a filtered probe list. The difference in the average

61 call rate was very minor (<0.1%) between algorithm versions and this finding is consistent with previous algorithm comparisons when genotyping using the Genome-Wide Human SNP Array 6.022. However small the differences may appear to be, the algorithm choice does still impact the genotyping results and selection of the wrong algorithm can result in variants going undetected or being incorrectly associated with a biological event or phenotype. The importance of algorithm selection is supported by Bucasas et al41, who found that the CRLMM algorithm yielded higher call rates than the Birdseed (v2) algorithm but also showed decreased tolerance to low quality samples. Users selecting algorithms for variant detection should be aware that some outcomes may be algorithm-specific. The reasons for why Birdseed (v1) yields higher call rates than Birdseed (v2) would need to be investigated in a future study.

2.4.2 Cross-species hybridization and considerations for SNP genotyping

Like other microarrays, the MDGA can also be used for cross-species hybridization but the call rates are heavily influenced by how the genotyping is performed. For instance, naked mole-rat samples applied to the MDGA yield very different genotyping call rates depending on whether or not the samples are genotyped along with Mus samples or independently. By genotyping the naked mole-rat samples along with Mus samples, the call rates are artificially increased so that they are more similar to the mouse samples. Genotyping Mus samples with naked mole-rat samples results in a slight increase, up to 3%, in the average Mus call rate. These results suggest that when conducting cross-species hybridization analysis, it is important to perform the genotyping for the different species separately.

Aside from technical issues, low call rates can occur if a sample’s genome is distantly related to the reference genome that the probe design was based on. In this case, sequence divergence between species means that there are hybridization incompatibilities between the target DNA and the array probes. The MDGA probes can hybridize with SNP probes to a degree that is associated with the time of species divergence from Mus musculus domesticus. Mus subgenus samples have the highest call rates among the Mus genus samples tested, likely due to having a closer genetic relationship to M. m. domesticus42 and

62 therefore more sequence complementarity between the MDGA probe sequences and the target DNA. The subgenra Pyromys, Coelomys, and Nannomys have call rates in the 80s and are more distantly related to the M. musculus species42. Heterocephalus glaber (naked mole rat) samples, which have the lowest call rates, are also the samples that are the most distantly related to M. m. domesticus39.

Unlike SNP genotyping, it is more difficult to apply cross-species hybridization for CNV detection. This is because accurate CNV detection requires high genomic coverage by probes that are contiguously located in the genome. If an organism is distantly related to the organism whose genome was used to design the array, then it is likely that many probes will not have a target to bind to due to sequence divergence or if they do bind a target, the locations may not be contiguous due to low synteny. When choosing to perform cross-species hybridization, it is advisable to check the array probe sequences against a reference genome (if available) to determine if the sequence is present in the genome and where it is located. This will provide the user with an idea of how many probes should return a genotype call and which part of the genome is being targeted. The naked mole rat, for example, has 41,225 syntenic regions with the mouse genome, covering ~83% of the naked mole rat genome43. In comparison, there are 24,999 syntenic regions between the naked mole rat and human genomes, covering ~92% of the naked mole rat genome43. This means that both mouse and human microarrays could be used for CNV detection in the naked mole rat, provided that there is sufficient sequence similarity between the microarray probes and the sample DNA.

2.5 Computation-based assessment of MDGA data quality

CNV results generated by microarrays require experimental confirmation to ensure that the calling methods are reporting biological events. Even if the microarray probes are designed well, false positive and negative calls can occur. Some CNV calling errors can result from hybridization problems while other errors result from the CNV calling methods or algorithms employed44,45. A study comparing multiple algorithms on SNP microarray data found less than ~50% concordance in detected CNVs, between any two algorithms tested46. The study’s authors advised, based on findings in multiple studies, that the use of multiple algorithms for a dataset can reduce false negative calls but will also increase false positive

63 calls. Typically, PCR-based methods are used to experimentally confirm putative CNVs. In studies where a large number of CNVs are detected, it is not practical or cost-effective to confirm every putative CNV. Therefore, a targeted approach to confirmation can be used. With a targeted approach, computational assessment of the data quality can be used first to evaluate if the CNV call quality is sufficient to move forward with wet lab confirmation. The following chapter sections discuss computational methods of data quality assessment.

2.5.1 Assessing MDGA data quality through visualization of fluorescence intensity data

One of the simplest ways of checking the quality of microarray hybridization for newly generated data or unfiltered, publicly available data, is to generate coloured images showing the hybridization signal for each array feature. Some arrays can have regions of abnormally low or high hybridization signal, which can result from an unequal application of the DNA solution, precipitate formation, and poor sample preparation47. Once array hybridization has been assessed visually for the presence of artifacts, and problematic arrays have been removed from the dataset, SNP genotyping and CNV calling can be performed. Depending on the software used for interpreting the fluorescence intensity data and the size of the artifacts, artifact reduction may be possible. For small artifacts, the probes in those regions can be identified and excluded from analysis. Normalization may help in cases were the overall array fluorescence intensity is lower or higher than expected unless the signal is too low for detection or the array is oversaturated.

2.5.2 Assessing quality of SNP genotype output

SNP genotype call rates are an indicator of the array quality and sample suitability. For a call to be generated, the probes have to bind sufficient DNA template so that a hybridization fluorescence signal can be detectable. Errors affecting DNA preparation and inappropriate hybridization conditions can reduce probe-target binding48, leading to low SNP genotype call rates.

For samples closely related to the reference genome, hybridization success may be reduced if off-target mutations (when nucleotide mismatches occur at sites other than the

64

SNP locus) are present. The expected MDGA SNP call rate for a Mus musculus domesticus laboratory strain is 97% or higher7. Lower call rates are expected for other subspecies of mouse like M. m. musculus, M. m. castaneous and M. m. molossinus, and different species of mouse like M. spretus and M. spicilegus are expected to have call rates in the low nineties1. To ensure the highest call rates possible for MDGA samples, probes that are predicted to perform poorly with regards to hybridization, or cannot be annotated, should be excluded from genotyping.

The SNP genotyping output is important for CNV calling when using PennCNV software because PennCNV requires information about the frequency of the B allele (BAF) and the Log R ratio (LRR)11,49. The LRR represents the normalized, total fluorescent intensity signals from each SNP probe set and is related to the amount of sample DNA bound to the probes. BAF values represent the normalized fluorescent signal ratio between the B and A allele probes at each SNP locus. The amount of deviation of BAF values from the expected BAF values for AA (0.0), AB (0.5) and BB (1.0) clusters, is called the BAF drift. Commonly, a filtering step is applied to the LRR standard deviation (LRR_SD) and BAF drift values to determine if a sample is suitable to use for CNV calling. Low LRR_SD and BAF drift values are desirable and PennCNV uses cutoffs of 0.30 for LRR_SD and 0.01 for BAF drift50. Exclusion criteria based on the waviness factor (WF <0.05) are used as well50,51. The waviness factor describes the dispersion in the signal intensity; low amounts of dispersion are desirable and are an indicator of DNA quality. Lower or higher LRR_SD, BAF drift, and WF cutoffs can be used depending on the type of array used and user preference for cutoffs. Affymetrix arrays, for example, tend to produce more noise than Illumina arrays so less stringent cutoffs should be used for Affymetrix array data52.

2.5.3 Assessing MDGA data quality by examining the nature of CNV calls

There are several indicators that are helpful in predicting the reliability of CNV calls. In the case of CNV losses, state-zero losses should not overlap SNP probes with genotype calls other than No Call genotypes and state-one losses should not overlap SNP probes with heterozygous genotype calls. The reason for this is that if there is only one copy of a DNA segment that contains SNP loci, only one allele should be present which would result

65 in the generation of a homozygous genotype call. If there are no copies of the DNA segment, SNP probes would not bind target DNA, so it would not be possible to generate a genotype call.

CNV losses are less likely to occur and be maintained in genomic regions containing functional elements, in particular for genes vital for cell survival53,54. Although it is possible for such losses to be present in subpopulations of cells in cases of mosaicism, these losses are unlikely to have been inherited or to have arisen early in development. There are over 700 genes and noncoding elements, overlapping MDGA probes, that are known to cause deleterious phenotypes when copies are lost, or are highly conserved and unlikely to vary in copy number (Appendix 2G). Most of the genes listed in Appendix 2G come from the International Mouse Phenotyping Consortium (IMPC)55. The IMPC’s purpose is to generate and characterize 20,000 knockout mouse strains, and it a useful source for determining the phenotypic impact of gene deletions in mice. The remaining genes in the Appendix 2G come from independent studies and were compiled as resource in Locke et al33. These genes are expected to overlap very few putative CNVs detected in normal healthy tissues, with the exception of some mosaicism events, and can therefore be used to assess the MDGA data quality.

When generating CNV calls for a dataset with mice of different backgrounds, it is expected that the number of CNVs will differ between mice and there will be a mix of copy number losses and gains called13,33,56. Excessive CNV calls in one sample may indicate poor array quality or an unusually high mutation load in that individual. Similarly, detecting only gains or only losses in a dataset may be an indicator of either technical issues or true mutational events (e.g. from exposure to a or disease models). Based on Locke et al.’s33 study which uses the MDGA, Mus musculus subspecies have an average of 29 CNVs per mouse, although the number of CNVs can differ greatly depending on the mouse genetic background and health. To determine what is a normal CNV profile for a single healthy mouse of a specific background, comparisons should be made to CNVs from samples from multiple individuals of this background. For these comparisons, the CNV detection approach used should be the same.

66

The size of the CNVs detected is dependent on the probe density and spacing across the genome. CNVs that are most likely to be biological events are expected to be those that are detected by many, closely-spaced probes since the resolution would be higher. A consequence of low resolution is that a large detected CNV may actually be multiple CNVs that have been called as one event due to low probe density or uneven probe spacing that includes large gaps13. CNV calls can be excluded based on the number of probes used to make that call as well as the probe density. The probe density is calculated by dividing the CNV length by the number of probes underlying a call, so it does not take into account probe spacing within the CNV region. Therefore, large CNVs should be assessed for probe spacing to ensure that the call does not include probe “deserts”, which can affect the accuracy of the call.

Confidence in the putative CNVs calls is greater if the CNVs have been detected before in multiple, independent studies. Recurrent CNVs can be found in mutation hotspots or they may be mouse strain- or species-specific as a result of an inherited mutational event in a common ancestor. One such example, is the presence of duplications of the insulin- degrading enzyme (Ide) gene in some C57BL/6J mice from The Jackson Laboratories colony, which is thought to have arisen sometime after 1994 and spread throughout the colony via breeding57. The Ide duplication may have an impact on disease models but does not appear to be under negative selection in the general C57BL/6J population57. The support for recurrent, strain- or species-specific CNVs being true events is increased when the same CNVs have been observed the same mouse strain or species in previous studies.

2.5.4 Probe annotation: Providing genomic context

Updating the genomic positions of SNP probe and IGP sequences to the latest mouse genome build is required when using updated annotation information from databases like Ensembl and UCSC genome browser58–60. Probe annotation updates are also important for CNV calling accuracy, which is heavily dependent on knowing which probes target consecutively located DNA regions. Updated genomic positions for the MDGA can be downloaded from the Thermo Fisher Scientific website9. It is not uncommon, however, for there to be compatibility issues between the genome version used to specify probe locations and the genome version for genomic information downloaded from various databases.

67

2.5.5 Assessing MDGA data quality with pairwise genetic distance comparisons

The genetic distance between samples can be used as an indicator of microarray data quality. Closely related individuals are expected to have a smaller genetic distance between them than unrelated or distantly related individuals. Likewise, samples taken from the same individual would have a smaller genetic distance value than samples coming from different individuals. If pairwise genetic distances calculated for a sample set does not reproduce the known relative relationships between those samples, then there may be a problem with the array data or there was not sufficient genetic variation at the loci that were used for the calculations.

Genetic distance can be calculated using SNP or CNV genotypes. The pairwise genetic distance values generated for a sample set can then be used to create distance matrices. The following calculation is used to produce a pairwise SNP genetic distance value for two samples:

,-$*. /01 2"#-$34" 5%66"7"#&") 8"$9""# $9- )*:4.") !"#"$%& (%)$*#&" = ,-$*. #;:8"7 -6 /01 .-&% &-:4*7"5

There are multiple definitions for what constitutes a SNP difference. One definition includes counting any genotype (AA, AB, BB, No Call) difference between the two samples, at a given locus, as a difference. Alternatively, No Calls can be excluded from this calculation since a genotype could not be determined. Using this formula, the pairwise genetic distance value will range from 0 to 1. If two samples have no SNP genotype differences, the value will be 0. The more similar two samples are genetically, the closer the value will be to 0.

A matrix containing the pairwise SNP genetic distance values for a sample set can be used to construct a phenogram. A phenogram depicts the degree of similarity between individuals, based on selected characteristics, without including measures of evolutionary time or defining common ancestors.

68

CNVs can also be used to calculate genetic distance. For CNV calls, the probes underlying a CNV are assigned the copy number state of that CNV as a genotype. Then a formula similar to the one for SNP genetic distance can be used to calculate CNV genetic distance between two samples:

,-$*. <0= 2"#-$34" 5%66"7"#&") 8"$9""# $9- )*:4.") !"#"$%& (%)$*#&" = ,-$*. #;:8"7 -6 .-&% &-:4*7"5

SNP- and CNV-based genetic distances will not necessarily show the same relationship between the same samples and SNP-based distance is more likely to reiterate known genealogy13. Cutler et al.13 attribute these differences to the nature of CNV inheritance from parents as well as the smaller numbers of CNVs, compared to SNPs, which leads to greater intergenerational fluctuations in CNV content. However, CNV- based genetic distance can be informative about relatedness and ancestry when population- specific copy number variable regions61 or fixed deletions and duplications in different species’ lineages62 are used.

2.5.6 Considerations for Mus musculus sample set size and sex of mice

When designing an experiment, it is important to select the mouse species, number of samples and tissue type, with considerations of MDGA detection limits in mind. The MDGA is best suited for detecting genetic variation in Mus musculus subspecies. When using a BRLMM-P algorithm for SNP genotyping of MDGA data, it is recommended that at least 60 samples are genotyped together for the data to cluster properly63. However, other sources using a different microarray found that high performance BRLMM-P clustering can be achieved with 44 samples or fewer64. In genome-wide association studies (GWAS) using SNP microarrays, thousands of disease and control samples may be required if rare variants are involved65. Increasing sample size generally has a greater effect than increasing microarray genome coverage with regards to identifying disease-associated SNP variants66.

In addition to sample size, the sex of the mice for the samples in the dataset should be given consideration. Ideally, the user will know the sex of the samples and input this information into PennCNV. Where the sex of the sample is unknown, PennCNV will use

69

B allele frequency (BAF) values from the X chromosome SNPs to generate CNV calls for the X chromosome – information regarding the sex of the samples is not required when using PennCNV for the autosomes, which are run separately from the X chromosome67. From a SNP genotyping perspective, greatly skewed sex ratios in a dataset might negatively affect clustering and the BAF values for the X chromosome. Datasets using only females can have three genotype clusters (AA, AB, and BB genotypes) while male samples are expected to lack a heterozygous genotype cluster.

2.5.7 Considerations for tissue type

The MDGA, like other microarrays, is more likely to detect CNVs that are in the majority of cells of the tissue sample. When using a microarray, a consensus result will be generated for all the genomes present within a sample. Rare losses or gains may not influence the florescence signal sufficiently to be detected. Genome mosaicism has been detected at levels as low as <5% using a SNP-based microarray, although detection limits are generally higher68. Another study found that mosaicism is readily detectable at 20%, although the threshold can be lowered to 10% following additional statistical calulations69. It would be expected that a detection threshold will differ based on the specific microarray and computational detection method employed.

The implications of microarray sensitivity limitations are that the MDGA may not be suitable for identifying rare mosaicism events in diseased or normal tissue. Tumour samples for example, tend to have high amounts of genetic heterogeneity due to genomic instability70. While low level tumour mosaicism may be not be detected by microarrays, SNP microarrays have been of use in cancer research71–73 since many mutations are detectable. In normal tissues, genetic variation can arise spontaneously during development or later in life and is a commonly occurring phenomenon74. The clonal size of a genetic variant in an individual will vary based on factors including but not limited to the when the variant arose and the phenotypic impact of the variant75–78. To identify cell lineage-specific variants, variant detection should be conducted on isolated cell types or subpopulations within a tissue if it is comprised of heterogeneous cell populations78–81.

70

2.6 Conclusion

When compared to the Genome-Wide Human SNP Array 6.0, the MDGA has more probes that are predicted to perform poorly. However, these probes can be identified and excluded from analyses to improve the overall performance of the MDGA, which was assessed using SNP genotype call rate in this study. The generation of good quality array data is also dependent on having study designs that incorporate appropriate sample sets and variant detection methods. It is important to design studies that are suitable for MDGA’s sensitivity and resolution limits. For example, the MDGA cannot be used to detect very rare variants or to find the exact locations of CNV junctions.

Well-designed studies do not guarantee the production of good quality data since sample preparation or hybridization errors can occur. Several computational methods are available for assessing the quality of MDGA data. These include visualizing the array to look for hybridization abnormalities and assessing the SNP genotyping and CNV calling results using metrics like SNP call rates, the presence of heterozygous SNP calls underlying CNV losses, the presence of CNVs in genes unlikely to vary in copy number, and BAF drift, LRR SD, and WF values. Ultimately, even if computational assessment indicates good quality data, experimental confirmation is necessary to ensure that true genomic alterations were identified. Preferably, non-hybridization-based methods would be used for confirmation to avoid similar errors that occur with hybridization-based methods like microarrays.

71

2.7 References

1. Yang, H. et al. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

2. Nature Education. SNP. Scitable (2014). Available at: https://www.nature.com/scitable/definition/single-nucleotide-polymorphism-snp- 295.

3. The Jackson Laboratory. Jax Notes - The Mouse Diversity Genotyping Array: An advanced, high-density mouse genotyping microarray. Available at: https://www.jax.org/news-and-insights/2008/january/the-mouse-diversity- genotyping-array-an-advanced-high-density-mouse-genotyp.

4. Wishart, A. E. Somatic copy number mosaicism contributes to genomic diversity in Mus musculus. (The University of Western Ontario, 2014).

5. Thermo Fisher Scientific. Affymetrix Power Tools. Available at: https://www.thermofisher.com/us/en/home/life-science/microarray- analysis/microarray-analysis-partners-programs/affymetrix-developers- network/affymetrix-power-tools.html.

6. Affymetrix Power Tools MANUAL: apt-probset-genotype (1.20.0). Available at: http://www.affymetrix.com/support/developer/powertools/changelog/apt-probeset- genotype.html.

7. Affymetrix. Genotyping console 4.0 user manual. (2009).

8. Bolstad, B., Ghosh, S. & Turpaz, Y. SNP array-based analysis for detection of chromosomal aberrations and copy number variations. in Methods in Microarray Normalization (ed. Stafford, P.) 9, 245–264 (Boca Raton: CRC Press, 2008).

9. Thermo Fisher Scientific. Mouse Diversity Genotyping Array. Available at: https://www.thermofisher.com/order/catalog/product/901615.

72

10. Affymetrix. Vignettes: Mouse Diversity Genotyping Array clustering analysis. (2009). Available at: https://media.affymetrix.com/support/developer/powertools/changelog/VIGNETT E-Mouse-WGSA-genotyping.html.

11. Wang, K., Li, M., Hadley, D. & Liu, R. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole- genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).

12. Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).

13. Cutler, G. & Kassner, P. D. Copy number variation in the mouse genome: implications for the mouse as a model organism for human disease. Cytogenet. Genome Res. 123, 297–306 (2008).

14. PennCNV. PennCNV-Affy User Guide. Available at: http://penncnv.openbioinformatics.org/en/latest/user-guide/affy/.

15. Fadista, J. & Bendixen, C. Genomic position mapping discrepancies of commercial SNP chips. PLoS One 7, e31025 (2012).

16. Eitutis, S. T. Array-based genomic diversity measures portray Mus musculus phylogenetic and genealogical relationships, and detect genetic variation among C57Bl/6J mice and between tissues of the same mouse. (The University of Western Ontario, 2013).

17. McCarroll, S. A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).

18. Yang, H., Ding, Y., Hutchins, L. & Szatkiewicz, J. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

19. Thermo Fisher Scientific. Genome-Wide Human SNP Array 6.0. Available at: https://www.thermofisher.com/order/catalog/product/901182.

73

20. Gibbs, R. A. et al. The international HapMap project. Nature 426, 789–796 (2003).

21. Affymetrix. Data Sheet: Genome-Wide Human SNP Array 6.0. 1–4 (2012). Available at: https://tools.thermofisher.com/content/sfs/brochures/genomewide_snp6_datasheet. pdf.

22. Hong, H., Xu, L. & Tong, W. Assessing consistency between versions of genotype-calling algorithm Birdseed for the Genome-Wide Human SNP Array 6.0 using HapMap samples. in Advances in Computational Biology (ed. Arabnia, H. R.) 355–360 (Springer New York, 2010). doi:10.1007/978-1-4419-5913-3_40

23. Matukumalli, L. K. et al. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One 4, e5350 (2009).

24. Illumina. CanineHD BeadChip. (2010). Available at: https://www.illumina.com/documents/products/datasheets/datasheet_caninehd.pdf. (Accessed: 1st January 2017)

25. Hoffman, J. I., Thorne, M. A. S., McEwing, R., Forcada, J. & Ogden, R. Cross- amplification and validation of SNPs conserved over 44 million years between seals and dogs. PLoS One 8, e68365 (2013).

26. Ptitsyn, A., Schlater, A. & Kanatous, S. Transformation of metabolism with age and lifestyle in Antarctic seals: a case study of systems biology approach to cross- species microarray experiment. BMC Syst. Biol. 4, 133 (2010).

27. Nowrousian, M., Ringelberg, C., Dunlap, J. C., Loros, J. J. & Kück, U. Cross- species microarray hybridization to identify developmentally regulated genes in the filamentous fungus Sordaria macrospora. Mol. Genet. Genomics 273, 137–149 (2005).

28. Moore, S., Payton, P., Wright, M., Tanksley, S. & Giovannoni, J. Utilization of tomato microarrays for comparative gene expression analysis in the Solanaceae. J.

74

Exp. Bot. 56, 2885–2895 (2005).

29. Gilad, Y., Rifkin, S. A., Bertone, P., Gerstein, M. & White, K. P. Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res. 15, 674–680 (2005).

30. Renn, S. C. P., Aubin-Horth, N. & Hofmann, H. a. Biologically meaningful expression profiling across species using heterologous hybridization to a cDNA microarray. BMC Genomics 5, 42 (2004).

31. Didion, J. P. et al. Discovery of novel variants in genotyping arrays improves genotype retention and reduces ascertainment bias. BMC Genomics 13, 34 (2012).

32. The Center for Genome Dynamics at The Jackson Laboratory. Mouse Diversity Genotyping Array - Annotation files. Available at: http://cgd.jax.org/datasets/diversityarray/annotation.shtml.

33. Locke, M. E. O. et al. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

34. The Center for Genome Dynamics at The Jackson Laboratory. Mouse Diversity Genotyping Array - CEL files. Available at: http://cgd.jax.org/datasets/diversityarray/CELfiles.shtml.

35. Thermo Fisher Scientific. Thermo Fisher Scientific - Documents: Support Files. (2017). Available at: www.thermofisher.com/order/catalog/product/901182.

36. HapMap - NCBI FTP Site - NIH. Available at: ftp://ftp.ncbi.nlm.nih.gov/hapmap/raw_data/hapmap3_affy6.0/.

37. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. (2016).

38. Center for Genome Dynamics - Mouse Diversity Array CEL files. Available at: http://cgd.jax.org/datasets/diversityarray/CELfiles.shtml.

75

39. Lewis, K. N. et al. Unraveling the message: insights into comparative genomics of the naked mole-rat. Mamm. Genome 27, 259–278 (2016).

40. Chevret, P., Veyrunes, F. & Britton-Davidian, J. Molecular phylogeny of the genus Mus (Rodentia: Murinae) based on mitochondrial and nuclear data. Biol. J. Linn. Soc. 84, 417–427 (2005).

41. Bucasas, K. L. et al. Assessing the utility of whole-genome amplified serum DNA for array-based high throughput genotyping. BMC Genet. 10, 85 (2009).

42. Lundrigan, B. L., Jansa, S. A. & Tucker, P. K. Phylogenetic relationships in the genus Mus, based on paternally, maternally, and biparentally inherited characters. Syst. Biol. 51, 410–431 (2002).

43. Kim, E. B. et al. Genome sequencing reveals insights into physiology and longevity of the naked mole rat. Nature 479, 223–227 (2011).

44. Carter, N. P. Methods and strategies for analyzing copy number variation using DNA microarrays. Nat. Genet. 39, S16–S21 (2007).

45. Marioni, J. C. et al. Breaking the waves: Improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 8, (2007).

46. Xu, L., Hou, Y., Bickhart, D., Song, J. & Liu, G. Comparative analysis of CNV calling algorithms: Literature survey and a case study using bovine high-density SNP data. Microarrays 2, 171–185 (2013).

47. Jaksik, R., Iwanaszko, M., Rzeszowska-Wolny, J. & Kimmel, M. Microarray experiments and factors which affect their reliability. Biol. Direct. 10, 46 (2015).

48. Koltai, H. & Weingarten-Baror, C. Specificity of DNA microarray hybridization: Characterization, effectors and approaches for data correction. Nucleic Acids Res. 36, 2395–2405 (2008).

76

49. PennAffy [http://www.openbioinformatics.org/penncnv/penncnv_download.html].

50. Lin, C. F., Naj, A. C. & Wang, L. S. Analyzing copy number variation using SNP array data: Protocols for calling CNV and association tests. Curr. Protoc. Hum. Genet. 79, Unit-1.27 (2013).

51. Diskin, S. J. et al. Adjustment of genomic waves in signal intensities from whole- genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008).

52. PennCNV. PennCNV-Affy: Additional Topics. (2017). Available at: http://penncnv.openbioinformatics.org/en/latest/user-guide/affy/.

53. Hart, T. et al. High-resolution CRISPR screens reveal fitness genes and genotype- specific cancer liabilities. Cell 163, 1515–1526 (2015).

54. Mihaly, S. R., Ninomiya-Tsuji, J. & Morioka, S. TAK1 control of cell death. Cell Death Differ. 21, 1667–1676 (2014).

55. Dickinson, M. E. et al. High-throughput discovery of novel developmental phenotypes. Nature 537, 508–514 (2016).

56. Pezer, Ž., Harr, B., Teschke, M., Babiker, H. & Tautz, D. Divergence patterns of genic copy number variation in natural populations of the house mouse (Mus musculus domesticus) reveal three conserved genes with major population-specific expansions. Genome Res. 25, 1114–1124 (2015).

57. Watkins-Chow, D. & Pavan, W. Genomic copy number and expression variation within the C57BL/6J inbred mouse strain. Genome Res. 18, 60–66 (2008).

58. Aken, B. L. et al. Ensembl 2017. Nucleic Acids Res. 45, D635–D642 (2017).

59. Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).

60. Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).

77

61. Lou, H. et al. A map of copy number variations in Chinese populations. PLoS One 6, e27341 (2011).

62. Sudmant, P. H. et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 23, 1373–1382 (2013).

63. Affymetrix. Frequently asked questions: Affymetrix Mouse Diversity Genotyping Array. 1–5 (2009). Available at: https://www.affymetrix.com/support/help/faqs/pdf/mouse_diversity/mouse_diversi ty_array.pdf. (Accessed: 1st January 2018)

64. Thermo Fisher Scientific. BRLMM-P : a genotype calling method for the SNP 5.0 array. 1–16 (2007).

65. Hong, E. P. & Park, J. W. Sample size and statistical power calculation in genetic association studies. Genomics Inform. 10, 117 (2012).

66. Lindquist, K. J., Jorgenson, E., Hoffmann, T. J. & Witte, J. S. The impact of improved microarray coverage and larger sample sizes on future genome-wide association studies. Genet. Epidemiol. 37, 383–392 (2013).

67. PennCNV. FAQ - PennCNV. (2017). Available at: http://penncnv.openbioinformatics.org/en/latest/misc/faq/.

68. Conlin, L. K. et al. Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucleotide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275 (2010).

69. Cross, J., Peters, G., Wu, Z., Brohede, J. & Hannan, G. N. Resolution of trisomic mosaicism in prenatal diagnosis: Estimated performance of a 50K SNP microarray. Prenat. Diagn. 27, 1197–1204 (2007).

70. Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).

78

71. Mao, X., Young, B. D. & Lu, Y.-J. The application of single nucleotide polymorphism microarrays in cancer research. Curr. Genomics 8, 219–228 (2007).

72. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. 107, 16910–16915 (2010).

73. Lindgren, D., Höglund, M. & Vallon-Christersson, J. Genotyping techniques to address diversity in tumors. Adv. Cancer Res. 112, 151–182 (2011).

74. Vattathil, S. & Scheet, P. Extensive hidden genomic mosaicism revealed in normal tissue. Am. J. Hum. Genet. 98, 571–578 (2016).

75. Otto, S. P. & Orive, M. E. Evolutionary consequences of mutation and selection within an individual. Genetics 141, 1173–1187 (1995).

76. Frank, S. A. Somatic evolutionary genomics: Mutations during development cause highly variable genetic mosaicism with risk of cancer and neurodegeneration. Proc. Natl. Acad. Sci. 107, 1725–1730 (2010).

77. Samuels, M. E. & Friedman, J. M. Genetic mosaics and the germ line lineage. Genes 6, 216–237 (2015).

78. Bolton, H. et al. Mouse model of chromosome mosaicism reveals lineage-specific depletion of aneuploid cells and normal developmental potential. Nat. Commun. 7, 11165 (2016).

79. Colom, B. & Jones, P. H. Clonal analysis of stem cells in differentiation and disease. Curr. Opin. Cell Biol. 43, 14–21 (2016).

80. Der, E. et al. Single cell RNA sequencing to dissect the molecular heterogeneity in lupus nephritis. JCI Insight 2, e93009 (2017).

81. Suzuki, Y. et al. Multiregion ultra-deep sequencing reveals early intermixing and variable levels of intratumoral heterogeneity in colorectal cancer. Mol. Oncol. 11, 124–139 (2017).

79

Chapter 3 3 CNV Diversity in Inbred and Wild Mice Detected by the Mouse Diversity Genotyping Array

A version of this chapter’s broad survey of mouse genomes for CNVs has been published in Locke et al., BMC Genomics (2015)1.

3.1 Background

The house mouse, Mus musculus, has a long history linked with humans as a commensal species2,3 and as a valuable animal model of human biology and disease4–6, yet there are many aspects of its basic biology left to uncover. One such aspect, which is explored in this chapter, is the contribution of copy number variants (CNVs) to mouse genetic diversity and to phenotypic traits important for species adaptation, evolution, and use as a model organism. M. musculus is known to be a phenotypically and genetically diverse species being comprised of three major subspecies in the wild3 and including numerous laboratory strains which were genetically manipulated and bred to express specific traits7. M. musculus subspecies are sufficiently diverse genetically that hybrid sterility and reduced fertility have been known to occur8–11. One of the contributors to the genetic diversity in M. musculus is CNVs12,13. CNVs can alter phenotypes by modifying expression levels of protein-coding genes and regulatory or noncoding elements14,15. Deleterious phenotypes will be selected against in a mouse population while phenotypes that confer an advantage to fitness are more likely to remain or expand within a population. An exception to this is the intentional introduction and maintenance of deleterious variants in mice by humans, to create a variety of genetically and phenotypically different mouse models of human biology and disease4,16,17. The genetic diversity of a mouse model can be increased by using wild-derived strains, which capture the genetic diversity of the three major mouse subspecies in natural populations, or by breeding wild-derived strains with classical laboratory strains (usually domesticus) of interest18.

An important part of understanding why there are genetic differences among members of the M. musculus species is understanding its history. The musculus,

80 domesticus, and castaneus subspecies of M. musculus diverged sometime around ~350- 500 thousand years ago19,20. Ancestral M. musculus populations are thought to have lived in the north of the Indian subcontinent and dispersed to other parts of the world from there3. Today, this species can be found world-wide with different subspecies being predominant in certain regions. After M. musculus subspecies divergence, the subspecies are thought to have independently evolved commensalism following the rise of human agrarian societies7. As a human commensal species, M. musculus movement often followed human movement and this allowed for the establishment of populations where none would be expected due to large natural barriers such as oceans. During the Viking Age for example, mouse stowaways travelled with European colonizers, and this resulted in the establishment of M. m. domesticus populations in Iceland and Greenland (later replaced by Danish M. m. musculus)2.

M. m. musculus is commonly found in a region ranging from central Europe to northern China and in Greenland, while M. m. castaneus is common in southeast Asia3,21. M. m. domesticus has the greatest range and can be found in western Europe, the Mediterranean basin, Africa, North and South America, and Australia21. Another mouse subspecies was thought to be present in Japan, M. m. molossinus, until DNA studies showed that Japanese mice have mostly M. m. musculus DNA along with some M. m. domesticus and M. m. castaneus DNA22,23. Mice in New Zealand are hybrids of all three mouse subspecies representing mouse populations from different regions of Europe and Asia24. Subspecies hybrids can be found in hybrid zones where the subspecies ranges meet and overlap. Isolated populations of house mouse subspecies can be found where there are natural barriers between the subspecies. One such example is the Himalayan mountains which separate M. m. musculus and M. m. castaneus populations21.

In addition to natural barriers isolating mouse subspecies, there are also biological barriers that can interfere with hybridization of subspecies. M. m. musculus and M. m. domesticus, for example, were found to have divergent urinary odors occurring in allopatric European populations and the divergence was more pronounced in regions where the two subspecies come in contact with each other25. It is suspected that the urinary odors play a role in mating and that the presence of different proteins in the mouse urine is responsible

81 for the divergent odors. This may suggest that genetic differences between the mouse subspecies are contributing to differences in odor production and detection, and preferences for certain odors, leading to interference with hybridization.

Although mating can occur between subspecies, genetic incompatibilities can affect the offspring fitness by causing hybrid sterility8. Hybrid mating experiments between M. m. musculus and M. m. domesticus reveal that hybrid males are sterile if they inherit a M. m. musculus X chromosome while hybrid females are not affected26. Sex chromosome incompatibilities are not the only genetic contributors to infertility. The autosomal PR- domain containing 9 (Prdm9) gene was found to play a role in hybrid sterility with fertility levels being affected by which alleles are present and the number of copies of Prdm927. Prdm9 is highly polymorphic in M. musculus populations28. During meiosis, Prdm9 initiates meiotic recombination by promoting double strand break (DSB) formation near its DNA binding sites29. Sterile mouse hybrids have defects in meiotic prophase following the DSB formation and different Prdm9 alleles influence each other’s behaviour regarding DSBs, but the defects causing sterility can be reversed by introducing altered Prdm9 alleles into the mouse hybrids30. In all, it has been found that multiple genomic loci can independently impact hybrid fertility due to genetic differences between subspecies at those loci.

Within the same subspecies, genetic variation was found between different wild populations of house mouse, with CNVs being a large contributor to this variation12. When multiple mice have a CNV overlapping the same genomic region, that region is referred to as a CNV region (CNVR). CNVRs that overlap large segmental duplications have been found to vary more in copy number in mice than those that do not overlap segmental duplications, and they are enriched for nonessential, environmentally-responsive genes including those with functions related to olfaction, production of urinary proteins, as well as production of proteins with unknown functions12. CNVRs that do not overlap segmental duplications are thought to be under stronger selective constraints, since these CNVRs contain CNVs that are smaller, less frequent, and linked to Mendelian disease genes.

82

With inbred mice, many classical laboratory strains share a predominantly M. m. domesticus background but their genome has been modified through generations of inbreeding and selection for specific traits. Many laboratory mice have been selectively bred for traits related to human health with numerous strains being used as models of cancer, complex and single gene diseases, and for determining gene function (e.g. gene knockouts)31–33. Genetic differences among laboratory strains also contribute to differences in non-pathogenic traits like taste thresholds and macronutrient selection preference (related to taste)34,35.

CNVs are known to contribute to some genetic differences in laboratory mice. For example, an Ide (insulin-degrading enzyme) gene duplication can be found segregating in C57BL/6J mice36. Likewise, a-defensin gene diversification occurring through tandem duplication has been observed specifically in C57BL/6 mice and results in an increased variety of antimicrobial peptides24. CNV studies of laboratory mice have found CNVs that are enriched for genes with functions related to the immunity, olfaction, and pheromone detection38,39.

Due to inbreeding over numerous generations and having a small founding population, classical laboratory mice do not capture the extent of genetic diversity occurring in wild populations of mice7. Thus, wild-derived mouse lines were established to introduce greater genetic variation and novel phenotypes40. The genetic variation among lines, particularly from a single nucleotide polymorphism perspective, has been shown to be greater in wild-derived strains than classical laboratory strains17,41. Similarly, the contribution of CNVs to genetic diversity is expected to be greater in wild-derived mice than classical laboratory mice.

In this study, the CNV landscape is characterized in hundreds of M. musculus individuals from a variety of genetic backgrounds. CNV differences within and between multiple mouse cohorts are identified using the Mouse Diversity Genotyping Array42, with an aim of gaining insight into potential phenotypic impact and relevance to adaptation and evolution in natural and laboratory environments. This chapter is divided into two studies, 1) a broad survey of mouse genetic diversity and 2) a comparison of genetic diversity

83 between mouse cohorts. The broad survey tests a CNV detection pipeline on a large mouse dataset and assesses its variant detection capabilities by experimentally confirming putative CNVs. This survey will characterize the CNVs in a diverse group of M. musculus samples that includes multiple subspecies, a variety of inbred laboratory mice, and wild mice. Reported here, is the CNV analysis of 351 mice, representing 290 strains that have not been studied for CNVs previously.

The mouse cohort comparison study uses a refined dataset that focuses on the comparison of three mouse cohorts: classical laboratory (CL) strains, wild-derived (WD) inbred strains, and wild-caught (WC) mice. These cohorts were selected to study how differences in genealogy, breeding schema, origins of genetic diversity, and housing shape the CNV landscape of a mouse genome. In comparison to the broad survey, the mouse cohort comparison study uses 10 addition WD samples and 14 additional WC samples. CL strain duplicates were excluded for a total of 114 CL strains selected for study, each with different breeding and phenotype selection histories but all are predominantly M. m. domesticus. In this study, there is a greater emphasis on the differences between classical CL, WD, and WC mouse cohorts with respect to CNV characteristics (e.g. number, length, state, hotspots, genic content), and the potential implications for adaptation and evolution.

Genetic variants that are recurrent in CL cohort are predicted to be associated with the shared M. m. domesticus background, husbandry practices, and adaptation to living in a laboratory environment. Since the WC cohort includes multiple M. musculus subspecies from different geographical regions, WC diversity is predicted to be associated with the inclusion of different subspecies and the different geographic environments. Recurrent CNVs that are either subspecies-specific or geographically-specific may be observed for the WC cohort. The WD cohort also includes multiple subspecies, but unlike with WC mice, WD mice are inbred laboratory strains. The CNVs in this cohort are likely shared with mice in the CL and WC cohorts, resulting from inheritance given common ancestry or from selection pressures given similar environments. The WC and WD cohorts are predicted to have the greatest genetic diversity due to greater diversity in environments and the inclusion of different subspecies. The WC and CL cohorts are expected to have the

84 fewest CNVs in common given their histories in different environments and the different genealogies for the two cohorts.

3.1.1 Research goal, central hypothesis, and specific objectives

Research goal: The overall goal of this study is to broadly characterize the CNV landscape of Mus musculus using the MDGA, and to expand on this initial study by identifying CNV differences within and between mouse cohorts that can be further studied to gain insight into the CNV origins, phenotypic impact and relevance to adaptation and evolution in natural and laboratory environments.

Central hypothesis: First, given that CNVs are heritable, different mice of different strains and origins will have different CNV profiles reflecting similarities and differences in genealogy. Second, given evidence of the phenotypic impact of CNVs, the profile of CNVs will differ between mouse cohorts that differ in environment and phenotypic selection.

The specific objectives of the board survey are: 1. To identify CNVs, using a filtered probe list containing both SNP probes and IGPs, in Mus musculus samples from different lines, subspecies and geographic locations. 2. To use genes that are unlikely to vary in copy number as a measure of false CNV discovery. 3. To characterize CNV differences and impacted gene pathways between CL and WC mouse cohorts. 4. To evaluate the reliability of the candidate CNV detection method via confirmation of select genic CNVs by ddPCR.

The specific objectives of the mouse cohort comparison study are: 1. To identify CNVs in CL, WD, and WC Mus musculus samples using the stringently filtered probe list recommended in Locke et al1. 2. To use genes that are unlikely to vary in copy number as a measure of false CNV discovery. 3. To characterize CNV differences and impacted gene pathways in and between CL, WD, and WC cohorts.

85

3.2 Materials and methods

3.2.1 Samples

Publicly available Mouse Diversity Genotyping Array CEL files were downloaded from the Center for Genome Dynamics at The Jackson Laboratory43. The CEL files for the broad mouse survey contain raw array intensity data for mouse tail samples from 120 CL strains, 58 WD strains, 10 consomic strains, one congenic strain, 44 BXD recombinant inbred strains, 40 CC-UNC G2:F1 strains, 55 F1 hybrids of inbred strains and 23 WC mice, for a total of 351 samples (Appendix 3A). Consomic strains are inbred mice that contain one entire chromosome from a different mouse strain44. Congenic strains are generated to contain a particular marker from another mouse strain44. BXD recombinant inbred strains contain approximately equal amounts of genetic material from a C57BL/6 and DBA/2 strain background44. CC-UNC G2:F1 mice are the first generation of collaborative cross mice generated at the University of North Carolina by breeding eight extant and genetically diverse laboratory strains together to create recombinant inbred lines45,46. For the mouse cohort comparison study, CEL files for 110 CL strains without strain duplicates, 37 WC mice, and 68 WD mice were selected for analysis (Appendix 3B).

3.2.2 CNV identification

The probe list provided in Additional file 2 of Locke et al.1 was used for SNP genotyping and CNV calling in the broad survey while the more stringently filtered list, provided in Additional file 7 of Locke et al. was used for the mouse cohort comparison study. Genotype calls were generated using the BRLMM-P algorithm implemented in Affymetrix® Power Tools47 using default parameters as specified by Genotyping Console48, which includes quantile normalization. To pass SNP genotype call rate requirements, CL mouse samples in the mouse cohort study, but not the broad survey, were required to have an overall call rate greater than 97%. Low call rates are expected for WD and WC samples1,49, so call rate was not used as an exclusion criterion for these samples.

A canonical genotype clustering file was generated and used to calculate Log R Ratio (LRR) and B allele frequency (BAF) values using the PennAffy package50. PennCNV was used to generate PFB (population frequency of the B allele) reference files

86 from the dataset for each study51. A GC model file, containing the percent GC content of the 1 Mb region surrounding each marker (or the genome-wide average of 42% GC content, if this could not be calculated) was generated using KentUtils52 and an in-house script based on the reference genome used for the broad survey (UCSC:mm9) and the mouse cohort study (UCSC:mm10).

CNVs were detected with PennCNV using default parameters and GC model correction53. CNVs on the X chromosome were detected in a separate run of PennCNV using the –chrX option. Calls were filtered to be 500 bp to 1 Mb in length, have at least three markers, have a marker density of 0.00013 markers/bp, have a log-R ratio standard deviation below 0.35 and have a B-allele frequency drift below 0.01. CNV data for the broad survey are provided in Appendices 3C-E. CNV calls for the mouse cohort study are provided in Appendices 3F and 3G.

For the mouse cohort study, CL samples failing to meet the inclusion criteria, based on their autosomal CNV calls, were excluded from all subsequent analyses (five samples failed: MA/MyJ, NONcNZO10/LtJ, C57BL/10ScSnJ, C57BL/10ScNJ, GR/J). The failing samples were included in the phenogram section for comparison to the SNP data since the SNP genotyping threshold was met. The MDGA cannot always capture the diversity from wild-type samples and is therefore more likely to fail to meet inclusion criteria for WD and WC samples than CL strains. As such, WD and WC failing samples were not excluded from analysis. Mouse samples were also not excluded from subsequent analysis if their Chromosome X CNV calls failed quality controls because X chromosome calling requires special treatment51 so the CNV calls for the X chromosome may less reliable than autosome calls. In addition, sex prediction by PennCNV is more suitable for Illumina arrays when using default parameters than for Affymetrix arrays54.

3.2.3 Figure construction and statistical analysis

Boxplots were generated in R (v3.2.4) using the Boxplot function. Figure 3-2 was generated using the ggplot2 (v3.1.0) package and the geom_point and geom_density functions. Wilcoxon and Kruskal-Wallis tests were performed with R using the stat_compare_means function from the ggpubr (v0.2) package.

87

3.2.4 CNV recurrence and CNV landscape plot visualization

Recurrent CNVs were identified using the default overlap percentages (40% for a merge, and 99% for a family) and reciprocal overlap in HD-CNV55. Recurrent CNVs must be found in at least two mice within a dataset, regardless of cohort, and they are not required to share the same copy number state. Genomic regions containing recurrent CNVs are referred to as CNV regions (CNVRs) in this chapter. Appendix 3C lists broad survey CNVs as unique (also known as singleton CNVs) or recurrent. For the mouse cohort study, the number of CNVRs detected by HD-CNV, was represented with a Venn diagram generated in R with the venneuler package (v1.1.0). Each CNVR count includes multiple CNVs and CNVs can be included in more than one CNVR.

3.2.5 Concordance with previous studies

Data for concordance analysis for the broad survey were downloaded from the Database of Genomic Variants56 or from supplementary tables depending on availability1. Overlap analysis at 20% reciprocal overlap was performed using the intersect function of Bedtools (version 2.17.0)57. The copy number state of the call was not considered; the presence of a call in a previous study was considered evidence that variability occurs in this region. Chromosome X was excluded from analysis since many studies did not have CNV data for this chromosome.

To identify CNVs in the mouse cohort study that were observed in the broad survey, the genome coordinates from the Locke et al.1 CNVs were converted to the newer genome build (GRCm38/mm10) using the UCSC Genome Browser LiftOver tool. Nine positions could not be updated due to sequence additions or deletions within the region. CNVs that overlapped by at least 1 bp with the Locke et al. CNVs were labeled as “previously observed”. Overlapping CNVs were identified using in-house Python scripts.

3.2.6 Genes unlikely to harbour copy number losses

In-house scripts were used to determine if CNVs overlapped “control” genes unlikely to vary in copy number (Appendix 2G). The genes used in overlap analysis in the broad survey from the list in Appendix 2G can be identified by “Locke et al” and “Gatesy et al”

88 in the source column. All genes in Appendix 2G were used for overlap analysis in the mouse cohort study. A CNV to be considered to be overlapping a control gene if there was an overlap of at least 1bp between the CNV and a control gene.

3.2.7 Confirmation of select CNVRs by droplet digital PCR (ddPCR)

For the broad survey, nine genic CNVRs found in C57BL/6J mice were selected for CNV confirmation by ddPCR in five C57BL/6J, five CBA/CaJ and four DBA/2J inbred mice (Appendix 3H). C57BL/6J and CBA/CaJ mouse samples came from Dr. Kathleen Hill’s laboratory (Appendix 3I) and DBA/2J mice were provided by Dr. Shiva Singh. For C57BL/6J mice, DNA was extracted from tail samples, with the exception of C57BL/6J mouse 2, where ear clip tissue was used. DNA was extracted from the cerebellum for DBA/2J mice and tail samples for CBA/CaJ mice. For each CNVR, one TaqMan® Copy Number Assay (Thermo Fisher Scientific, Waltham, Massachusetts, USA) was selected for a gene overlapping that CNVR. Overall, nine gene assays were conducted for the 14 mice with inclusion of two technical replicates per DNA sample. A TaqMan® Copy Number Reference Assay (Thermo Fisher Scientific, Waltham, Massachusetts, USA) for the transferrin receptor gene (Tfrc) was used as a reference with an expected copy number of two. Negative controls lacking DNA template were included for each gene assay, including the reference gene.

Prior to ddPCR, DNA samples were extracted using the Wizard® Genomic DNA Purification Kit (Promega, Madison, Wisconsin, USA), assessed for quantity using a NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA) and diluted to approximately 8 ng/μl. The DNA was then fragmented by centrifuging 140 μl of DNA sample at 16,000xg for 3 min in a QIAshredder column (Qiagen, Venlo, Limburg, Netherlands) to prohibit inaccuracies in copy number detection due to tandem duplications not efficiently sorted in the ddPCR assay58.

Each 20 μl PCR reaction contained 8 μl of DNA template (~4 ng/μl), 10 μl of the ddPCR™ Supermix for Probes (Bio-Rad, Hercules, California, USA), 1 μl of the FAM™ dye-labelled TaqMan® assay for the gene target of interest, 1 μl of the VIC® dye-labelled

89

TaqMan® reference assay. Droplets were generated by a QX200™ droplet generator (Bio- Rad, Hercules, California, USA). A C1000 Touch™ thermal cycler (Bio-Rad, Hercules, California, USA) was used to perform PCR using the following program: 1 cycle at 95°C for 10 min, 45 cycles of denaturation at 95°C for 30 s, annealing and extension at 60°C for 1 min and enzyme deactivation at 98°C for 10 min. Droplets were read using a QX200™ droplet reader and analyzed with QuantaSoft™ software (Version 1.7.4.0917; Bio-Rad, Hercules, California, USA).

3.2.8 Gene analysis

Gene annotations consistent with the broad survey reference genome (UCSC:mm9) and the mouse cohort study reference genome (UCSC:mm10) were downloaded from Ensembl BioMart59,60. Genes found in CNVs were identified using in-house scripts. The Database for Annotation, Visualization and Discovery (DAVID) Functional Annotation tool was used to identify gene ontology (GO) term enrichment for genes overlapping CNVs61,62. DAVID versions 67 and 68 were used for the broad survey and the mouse cohort study, respectively. DAVID automatically excludes redundant genes from its analysis. The three GO categories “GOTERM_BP_FAT”, “GOTERM_CC_FAT”, and “GOTERM_MF_FAT” were used to identify the most relevant GO terms for each broad survey gene list. Occasionally, pseudogenes can be “resurrected” and produce translated products63. For this reason, pseudogenes classified as having a protein-coding biotype by Ensembl were included in the gene analysis. For the mouse cohort study, the default “GOTERM_BP_DIRECT” category was used to identify the most relevant biological process GO terms for each gene list. The gene lists were comprised of genes that were completely encompassed by CNVs in the mouse cohorts.

Ingenuity® Pathway Analysis’ Core Analysis64 was used to determine disease and biological function networks for genes overlapping CNVs from the broad survey. Direct and indirect relationships with a maximum of 35 focus molecules per network were included. Human, mouse and rat genes were included. The confidence level was set to include experimentally observed relationships between focus molecules as well as predicted relationships that have a high confidence. Molecule relationships with endogenous chemicals were excluded.

90

3.2.9 Genetic distance matrices and phenogram generation

SNP genetic distance was calculated based on pairwise genotype differences at SNP loci. All four genotype categories were used in the calculations: AA, AB, BB, and No Call. To calculate pairwise CNV genetic distance, probes underlying a CNV call were assigned the state of that CNV, while probes not underlying a CNV call were assigned a default state value of two. Distance matrices were generated using R (v3.2.4). From the distance matrices (Appendix 3J), phenogram files were constructed in R using the BIONJ function in the APE package (v3.4) and saved in Newick format. The Newick files were uploaded to Figtree (v1.4.2) to generate coloured phenogram images.

3.3 Results

3.3.1 Broad survey

3.3.1.1 CNVs detected

For 334 samples passing quality control criteria, a total of 9,634 CNVs were identified on the autosomes, with an average of 29 CNVs per sample (Table 3-1). On the X chromosome, 1,218 CNVs were found (Appendix 3D), with an average of four CNVs per sample. Calls across all samples affect 6.87% (169.9 Mb) of the autosomal genome or 8.15% (215.2 Mb) when including calls on the X chromosome.

Strains classified as CL strains have a mean of 0.065% (1.6 Mb) of the autosomes affected by CNVs, 0.065% (1.7 Mb) when the X chromosome was included. The mean autosome and genome percentage affected for the WD strains (0.15% or 3.6 Mb and 0.14% or 3.8 Mb, respectively) and WC mice (0.14% or 3.5 Mb and 0.14% or 3.8 Mb, respectively) were significantly different than the CL strains (P < 0.01, Mann–Whitney test).

The CNVs on the autosomes have an average length of 54,037 bp, with a median length of 26,340 bp. The majority (81%) of CNV calls are between 1 kb and 100 kb. Gains are significantly larger than losses (P < 2.2 × 10−16, Mann–Whitney test), where gains have a median length of 36,708 bp compared to losses at 20,091 bp. Copy-state-zero losses are significantly smaller than copy-state-one losses (P < 2.2 × 10−16, Mann–Whitney test),

91 where copy-state-zero losses have a median length of 13,766 bp compared to copy-state- one losses at 26,980 bp. Losses outnumber gains by a ratio of 1.42:1 on the autosomes (Table 3-1) and only the CL cohort has more gains than losses. Unlike the CL and WC cohorts which have roughly twice as many state-one CNVs as state-zero CNVs, the WD cohort has approximately twice as many state-zero CNVs as state-one CNVs.

92

Table 3-1. Autosomal CNV losses and gains in laboratory strains and wild-caught mice.

Number of Number of Copy Number Stateb Sample Group Loss/Gainc Samples CNV calls 0 1 3+

All Mice 334 9,634 (29)a 1,995 (6) 3,661 (11) 3,978 (12) 1.42

Classical Inbred 114 2,824 (25) 424 (4) 867 (7) 1,533 (13) 0.84

Wild Derived 52 2,611 (50) 1,214 (23) 594 (11) 803 (15) 2.25

Wild Caught 19 969 (51) 231 (12) 491 (26) 247 (13) 2.92

C57BL/6J 8 90 (11) 0 (0) 38 (5) 52 (6) 0.73

C57BL/6NJ 6 46 (7) 5 (1) 23 (4) 18 (3) 1.56

a Values in parentheses are the average number of CNVs per sample. b Copy number is determined in reference to the diploid standard with 0 indicative of loss of paternal and maternal copies; 1 and 3+ indicate single copy loss and the occurrence of duplication events respectively. c Loss/Gain is the total number of deletions (0 and 1 copy-state call counts) divided by the number of gains (3+ copy-state call counts)

93

3.3.1.2 Genic content and analysis

A majority (65.7%) of CNVs within the dataset entirely encompass at least one gene, are entirely encompassed by a gene, or partially overlap with at least one gene. The three main Ensembl classification types, excluding regulatory elements, for regions that overlap CNVs are protein-coding genes (76%), pseudogenes (11%) and multiple classes of RNAs (10%). The percentage of CNVs containing protein-coding genes in the CL mice (76.7%) is higher than in the WC mice (54.2%).

Overall, protein coding genes were found in a higher percentage of gains (88.8% of gain calls overlapped a protein coding gene region) than losses (55.6%). Pseudogenes were also found to overlap a higher percentage of gains (18.0%) than losses (13.9%), as were RNAs (18.2% vs 7.1%) and antisense gene regions (5.1% vs 2.6%).

The most common CNV (when considering events with the same start and end position in each sample) is in 66 mice on Chromosome 17 and contains the Tmem181c-ps pseudogene (Table 3-2). Almost all (93%) CL mice with this CNV have a gain, while all WC mice have a single-copy loss. The second most common CNV (Table 3-2) contains two pseudogenes, Ear-ps7 and Ear-ps10, as well as two protein-coding genes, Ang5 and Ang6. This CNV was observed only as a copy number state of either zero or four and both states existed in CL and WC cohorts. This CNV occurred most frequently in the BXD cohort.

94

Table 3-2. Most common CNVs detected by the Mouse Diversity Genotyping Array in a set of 334 Mus musculus samples. Number of CNV Genomic location of CNV mice with Gene name (Gene symbol; Gene type)a state the CNV chr17:6635443-6646618 66 Mixed Transmembrane protein 181 C, pseudogene (Tmem181c-ps; ps) Angiogenin, ribonuclease A family, member 5 (Ang5; pc), Eosinophil- associated, ribonuclease A family, pseudogene 7 (Ear-ps7; ps), chr14:44540155-44579921 43 Mixed Eosinophil-associated, ribonuclease A family, pseudogene 10 (Ear- ps10; ps), Angiogenin, ribonuclease A family, member 6 (Ang6; pc) DEAD (Asp-Glu-Ala-Asp) box polypeptide 39B (Ddx39b; pc), chr17:35383895-35392718 41 Gain histocompatibility 2, Q region locus 4 (H2-Q4; pc) ST6 (alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1,3)-N- chr11:116603748-116629092 40 Mixed acetylgalactosaminide alpha-2,6-sialyltransferase 1 (St6galnac1; pc), Predicted gene 11735 (Gm11735; ps) chr4:122366514-122382286 38 Loss RIKEN cDNA 9530002B09 gene (9530002B09Rik; pc) chr7:111681502-111683670 35 Mixed Tripartite motif-containing 30E, pseudogene 1 (Trim30e-ps1; ps) Selection and upkeep of intraepithelial T cells 4 (Skint4; pc), Predicted chr4:111790559-111972640 35 Gain gene 12820 (Gm12820; ps), predicted gene 12815 (Gm12815; ps), selection and upkeep of intraepithelial T cells 3 (Skint3; pc) BTB (POZ) domain containing 9 (Btbd9; pc), predicted gene 9874 chr17:30593663-31058945 34 Gain (Gm9874; pc), glyoxalase 1 (Glo1; pc), dynein, axonemal, heavy chain 8 (Dnahc8; pc) Ubiquitin protein ligase E3B (Ube3b; pc), methylmalonic aciduria chr5:114856193-114895051 34 Gain (cobalamin deficiency) cblB type homolog (human) (Mmab; pc), mevalonate kinase (Mvk; pc) chr14:20443929-20587951 34 Gain Predicted gene 17030 (Gm17030; ps), nidogen 2 (Nid2; pc) a Gene names are as in Mouse Genome Informatics Symbol. Gene types are one of: Protein coding (pc), RNA type as listed, or pseudogene (ps)

95

CNV differences were observed in samples from the same mouse strain. Six of eight C57BL/6J mice have an extra copy of the insulin-degrading enzyme (Ide) gene and half of the C57BL/6J mice have an extra copy of the fibroblast growth factor binding protein 3 (Fgfbp3) gene. None of the C57BL/6NJ mice have the Ide or Fgbp3 gain. All eight C57BL/6J mice in this study also have CNV gains overlapping most of Skint4, NLR family, pyrin domain containing 1B (Nlrp1b), and solute carrier family 25, member 37 (Slc25a37), although none of these genes were encompassed completely by a CNV like Ide and Fgfbp3. Single-copy losses overlapping predicted gene 9765 (Gm9765) and Btbd9 are also common (found in > 50% of samples). The Skint4 two-copy gain is also in all six C57BL/6NJ mice.

When only considering genes completely encompassed by CNVs and CNVs completely encompassed by genes (complete overlap), the top gene enrichment terms differed between WC and CL mice (Table 3-3). Across CL mice, only the gene ontology (GO) terms for gains are significant, while in WC mice, GO terms for both losses and gains are significant. The most significant GO term across classical laboratory mice is “antigen −10 processing and presentation of peptide antigen” (Padj = 3.26 × 10 ). Most of the top GO terms for CL mice are related to immunity or structural organization of the genome. Across WC mice, GO terms related to olfaction are significant for losses while GO terms related to pheromone response are significant for gains.

96

Table 3-3. Top DAVID Gene Ontology terms for genic CNVs detected in classical laboratory and wild-caught mice. P Mouse CNV Involved genes Fold adj Gene Ontology term (Categorya) (Benjamini- cohort state (% total) enrichment Hochberg) Classical 3+ Antigen processing and presentation of peptide antigen (BP) 16 (1.95) 13.08 3.26E-10 Classical 3+ Antigen processing and presentation (BP) 22 (2.68) 7.23 1.24E-09 Classical 3+ MHC protein complex (CC) 17 (2.07) 9.19 5.60E-09 Classical 3+ Nucleosome (CC) 16 (1.95) 7.25 5.12E-07 Classical 3+ Nucleosome assembly (BP) 17 (2.07) 6.66 1.81E-06 Classical 3+ Protein-DNA complex assembly (BP) 17 (2.07) 6.4 2.02E-06 Classical 3+ Nucleosome organization (BP) 17 (2.07) 6.4 2.02E-06 Classical 3+ Chromatin assembly (BP) 17 (2.07) 6.48 2.06E-06 Wild caught 1 Sensory perception of chemical stimulus (BP) 71 (26.59) 5.16 1.46E-30 Wild caught 1 Sensory perception (BP) 72 (26.97) 4.44 1.74E-27 Wild caught 1 Neurological system process (BP) 78 (29.21) 4.02 2.25E-27 Wild caught 1 Cognition (BP) 72 (26.97) 4.21 3.88E-26 Wild caught 1 Olfactory receptor activity (MF) 62 (23.22) 4.81 1.10E-24 Wild caught 1 Sensory perception of smell (BP) 61 (22.85) 4.73 8.03E-24 Wild caught 1 G-protein coupled receptor protein signaling pathway (BP) 76 (28.46) 3.5 4.46E-23 Wild caught 1 Cell surface receptor linked signal transduction (BP) 79 (29.59) 2.74 1.50E-17 Wild caught 3+ Pheromone binding (MF) 13 (5.39) 15.33 9.35E-09 Wild caught 3+ Odorant binding (MF) 13 (5.39) 14.17 1.20E-08 Wild caught 3+ Response to pheromone (BP) 13 (5.39) 13.92 1.03E-07 Wild caught 3+ Pheromone receptor activity (MF) 13 (5.39) 11.13 1.36E-07

a BP, Biological Process; CC, Cellular Component; MF, Molecular Function

97

Ingenuity Pathway Analysis (IPA) gene groupings into top diseases and functions networks show differences between WC and CL mice for CNVs completely within or completely containing a gene, although the distinction is not as clear as with DAVID (Appendix 3K). A total of 45 networks with an IPA score not less than 10 were identified. More networks are affected by gains (28) than by losses (17) and, in particular, by gains across the CL strains (22). “Lipid metabolism” is among the top biological functions for an IPA network associated with gains across WC mice and is not found for CNVs in CL mice. Conversely, CL mice have a network associated with “ metabolism” in gains, as well as “amino acid metabolism” in one-copy losses.

Development terms were found in 23 of the 45 networks associated with CNV regions and included cellular development, tissue development and the development of a variety of systems (e.g. neurological, hematological, gastrointestinal). For all genes present in the mouse (Ensembl:67), 34 out of 50 of their associated networks when analyzed as a whole with IPA include development terms.

Across mouse strains, networks involved in “endocrine system development” are associated with gains in WC mice and with state-zero losses in CL mice. Networks involved in “cardiac system development” are only associated with gains in CL mice and not associated with CNVs in WC mice. Networks involved in “inflammatory response” are associated with CNVs (both in losses and gains) in the CL mice, but not in the WC mice. Networks involved in “cell mediated immune response” were found to be associated with gains in both CL and WC mice.

3.3.1.3 Genes unlikely to harbour copy number losses

Mouse CNVs that were detected by the MDGA did not overlap 26 gene regions that are conserved in copy number across mammalian species (Appendix 2G)65. For autosomal genes that are unlikely to contain losses, a partial loss of one copy of Col7a1 was detected in three mice. Two male mice are partially missing the Cask gene on the X chromosome (approximately 33% and 6.5% missing).

98

3.3.1.4 Droplet digital PCR confirmation of select genic CNVRs

For a total of 252 ddPCR confirmation assays, 242 (96%) were in agreement with MDGA predictions (Appendices 3H and 3L). There was no discordance between ddPCR technical replicates. Predicted intra-strain differences in Fgfbp3 copy number were also confirmed by ddPCR assays, for the C57BL/6J samples. Inter-strain differences in copy number state for CNVRs affecting haloacid dehalogenase-like hydrolase domain containing 3 (Hdhd3), selection and upkeep of intraepithelial T cells 3 (Skint3), and glyoxalase 1 (Glo1) genes were also confirmed by ddPCR. Three of nine ddPCR gene assay results (B4galt3, Ide and Fgfbp3) matched the predicted state for all three mouse strains. Skint3 and Trim30e-ps1 copy number states were zero for all CBA/CaJ and DBA/2J mice when a state of two was predicted. However, the MDGA predicted a copy number difference of two for Skint3 and Trim30e-ps1 when comparing CBA/CaJ and DBA/2J to C57BL/6J, so ddPCR results were considered to confirm MDGA predictions. The gains predicted for Hdhd3 in CBA/CaJ and DBA/2J mice were detected by ddPCR and called as a state of six in both strains. Skint3 ddPCR copy number states were found to be increased by one for all three mouse strains when compared to the predicted states. Contrary to array-based predictions, ddPCR targeting intelectin-1 (Itln1) determined that copy number states did not differ from two, in the five C57BL/6J mice tested.

3.3.2 Mouse cohort comparison study

3.3.2.1 CNV number, state and length in three mouse cohorts

In total, 4,718 autosomal and 683 Chromosome X CNVs were detected, and passed filtering criteria, for 210 mouse samples (Appendices 3F and 3G). For autosome CNVs, WD mice had the highest median number of CNVs, which was two-fold higher than the CL strain median while there was no difference for the median number of X chromosome CNVs between WD and CL cohorts (Table 3-4). WC mice had a three-fold lower median number of X chromosome CNVs than the WD and CL cohorts and an intermediate autosome CNV median. WC mice had the greatest range in the total number of CNVs detected per mouse (7-98 CNVs) when compared to WD mice (12-75) and CL mice (4-80; Fig. 3-1).

99

Table 3-4. Summary statistics for autosome and Chromosome X CNVs detected in classical laboratory, wild-derived, and wild- caught mice. Total Samples with Total Average Median Min, max CNV Genomic Mouse number of at least one number of number of number of number of gains location cohorta samples CNV CNVs CNVs CNVs CNVs (%)

CL 105 105 1706 16 15 4, 67 54

Autosomes WD 68 68 2098 31 30 9, 72 34

WC 37 37 914 25 21 7, 90 3

CL 105 95 361 3 3 0, 15 92

Chr. X WD 68 65 240 4 3 0, 11 75

WC 37 25 82 2 1 0, 8 54

a CL, Classical laboratory strains; WD, wild-derived strains; WC, wild-caught mice

100

120 *** ****

100 **** ●

80 ● ● ● ●

● 60 ● ●

Number of CNVs ● ● 40 ● ●

20

0

Classical Wild Derived Wild Caught

Mouse Cohort

Figure 3-1. Number of CNVs detected for classical laboratory, wild-derived, and wild- caught mice. The box plots show the median number of CNVs with the quartiles, minimum and maximum non-outlier values, and outliers. Asterisks indicate significant p- values of <0.001 (***) or <0.0001 (****) for Wilcoxon tests performed following a Kruskal-Wallis test (p = 7 x 10-16).

CL mice have a higher ratio of gains-to-losses than do either WD or WC mice, by 1.6- and 1.8-fold respectively (Fig. 3-2). All cohorts have a higher proportion of CNV gains on the X chromosome than on the autosomes and the greatest difference is seen in the WD mice, which have a 2.2-fold increase in CNV gains on the X chromosome when compared to the autosomes (Table 3-4). WC and CL mice both have 1.7-fold more gains on the X chromosome than on the autosomes.

101

Figure 3-2. Number of CNV gains and losses for classical laboratory, wild-derived, and wild-caught mice. Individual data points represent a single sample within the classical laboratory (pink circle), wild-derived (light green), and wild-caught (dark green) mouse cohorts. The center dotted line indicates where a data point will lie if a sample has an equal number of CNV gains and losses. The dotted lines above and below the center line indicate where data points will lie if the sample has 75% losses with 25% gains and 75% gains with 25% losses, respectively. The top density plot shows the distribution of CNV gains for each cohort while the right-side density plot shows the distribution of losses.

102

For all cohorts, the median length is longer for autosomal CNV gains than the median length for CNV losses (Fig. 3-3). The WD cohort has the greatest fold difference in length between gains and losses where gains are 2.2-fold longer in median length. CL and WC cohorts had a similar length difference between gains and losses, at 1.6- and 1.5- fold respectively. The same pattern for median length was observed between gains and losses for Chromosome X CNVs, except the fold difference was much higher at 4.9-, 7.7- , and 8.5-fold for CL, WD and WC mice, respectively. Significant differences in the mean CNV length were observed for all autosomal intra- and inter-cohort comparisons while for the X chromosome CNVs, there were significant differences observed only at the intra- cohort level.

103

1.5

**** **** **** **** **** **** **** **** ns ns *** ** ns **** ns ns ns ns

1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Length (Mb) ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● 0

CL WD WC CL WD WC CL WD WC CL WD WC Mouse Cohort

Figure 3-3. Length of CNV gains and losses for the autosomes and X chromosomes of classical laboratory, wild-derived, and wild-caught mice. Classical laboratory, wild- derived, and wild caught mouse cohorts are represented by CL, WD, and WC, respectively. CNVs are groups by autosomal gains (dark blue), autosomal losses (light blue), Chromosome X gains (dark yellow), and Chromosome X losses (light yellow). The box plots show the median number of CNVs with the quartiles, minimum and maximum non- outlier values, and outliers. P-values of > 0.5 (ns), £ 0.01 (**), £ 0.001 (***), and £ 0.0001 (****) are indicated for Wilcoxon tests performed following a Kruskal-Wallis test for the autosome (p < 2.2 x 10-16) and X chromosome data (p < 2.2 x 10-16).

104

3.3.2.2 Regions with recurrent CNVs and singleton CNVs

When considering all cohorts combined, 67% of CNVs were recurrent (Fig. 3-4). For a CNV from one mouse sample to be considered recurrent, it must reciprocally overlap a CNV from at least one other mouse sample by at least 40%, and the copy number states of the overlapping CNVs are permitted to differ. A genomic region containing recurrent CNVs is defined here as a CNV region or CNVR. WD mice had 3.2 times more cohort- specific CNVRs than WC mice and 1.2 times more CNVRs than CL mice. A similar number of CNVRs were shared between the WD and WC cohorts, and the WD and CL cohorts. In contrast, 3.1 times fewer CNVRs were shared between CL and WC mice. The CNVRs are not uniformly distributed across the mouse genome.

A total of 1,783 singleton CNVs (<40% reciprocal overlap) were identified (Table 3-5). WD mice have double the number of singletons CL mice have (1.97-fold more) and more than double the number of singletons WC mice have (2.2-fold more). When a more stringent singleton cutoff of 0% reciprocal overlap is applied, the majority of CL (91.1%), WD (90.4%), and WC (89.0%) CNVs that were singletons at the 40% overlap, remain singletons. At 0% and 40% reciprocal overlap respectively, CL strains had the fewest singletons as a cohort (20% and 22%), the WC cohort had the greatest (38% and 41%) and the WD cohort had slightly fewer singleton CNVs than the WC cohort (35% and 39%).

105

Wild caught 53

Classical

laboratory

Wild derived

Figure 3-4. CNVRs in classical laboratory, wild-derived, and wild-caught mice. Classical laboratory, wild-derived, and wild-caught mouse cohorts are represented in pink, light green, and dark green colours, respectively. A CNV is considered to be recurrent if it has at least 40% reciprocal overlap with another CNV of any state in a different sample. Genomic regions containing recurrent CNVs are CNV regions (CNVRs). CNVs include autosome and Chromosome X CNVs.

Table 3-5. Number of singleton CNVs present on the autosomes and X chromosome for classical laboratory, wild-derived, and wild-caught mouse cohorts, as determined using <40% reciprocal overlap and 0% overlap criteria.

Mouse Autosomal Chr. X Autosomal Chr. X cohort Singletons Singletons Singletons Singletons

(<40% overlap) (<40% overlap) (0% overlap) (0% overlap) Classical 438 24 399 22

Wild derived 868 44 788 36

Wild caught 385 24 351 23

106

3.3.2.3 Genetic distance

The CNV-based genetic distance between mouse pairs groups some mice with common ancestry together but it does not recapitulate known genealogy as well as SNP-based genetic distance does (Fig. 3-5). The SNP-based genetic distance shows a clear division in genetic distance between the CL samples and the WD and WC samples, which do not have clear cohort divisions. The MOR/RkJ WD mouse is genetically most similar to the C57BL strains according to SNP-based genetic distance calculations, but this is not unexpected because MOR/RkJ contains some C57BL/6J ancestry66. CNV-based genetic distance does not show any clear cohort grouping.

CNV SNP

Figure 3-5. Phenograms depicting relationships between mice, as determined using pairwise genetic distances calculated for all autosomal CNVs in 210 mice and autosomal SNPs in 215 mice. The mouse samples are coloured by cohort where pink indicates classical laboratory strains, light green represents wild-derived mouse strains and dark green is used for wild-caught mice. Samples C57BL/6NCr, C57BL/6NCrl, and C57BL/6NTac are listed as C57BL/6NCl, C57BL/6Crl, and C57BL/6Tc, respectively, in Appendix 3B.

107

When looking at CNV-based genetic distance values, the average intra-cohort genetic distance is lowest within the CL cohort, although it is very similar to the WD and WC cohorts (Table 3-6). The average inter-cohort genetic distance is similar when comparing the CL and WD cohorts and the CL and WC cohorts. The average inter-cohort genetic distance is slightly greater between the WD and WC cohort samples. SNP-based genetic distance shows that the average genetic distance within the CL cohort is two times lower than the average intra-cohort genetic distance for the WD and WC cohorts. The greatest average inter-cohort genetic distance is between CL and WC mice and lowest between CL and WD mice. The minimum and maximum genetic distance values indicate that CL mice are genetically more similar to each other than to mice in other cohorts. In contrast, some mice in the WD and WC cohorts are more genetically similar to mice from other cohorts than to mice within their cohorts. In comparison to CNV-based genetic distance values, the SNP distance values are greater by 100-fold indicating that there are more SNP differences between samples than CNV differences.

108

Table 3-6. Average CNV- and SNP-based genetic distances within and between classical laboratory, wild-derived, and wild- caught mouse cohorts.

Average inter-cohort genetic distance (min, max) Variant type used Average intra-cohort for genetic distance Mouse cohort genetic distance calculations (min, max) Classical Wild derived Wild caught

2.57´10-3 2.85´10-3 2.90´10-3 Classical - (1.06´10-4, 1.45´10-2) (3.20´10-4, 1.71´10-2) (3.78´10-4, 1.75´10-2) 2.97´10-3 3.07´10-3 Copy number variant Wild derived - - (4.91´10-4, 1.20´10-2) (5.64´10-4, 2.03´10-2) 2.92´10-3 Wild caught - - - (2.11´10-4, 1.86´10-2) 1.89´10-1 3.84´10-1 4.67´10-1 Classical - (1.84´10-3, 2.36´10-1) (2.54´10-2, 5.58´10-1) (2.50´10-1, 5.72´10-1) Single nucleotide 3.86´10-1 4.01´10-1 Wild derived - - polymorphism (2.17´10-2, 5.91´10-1) (1.33´10-1, 6.00´10-1) 3.91´10-1 Wild caught - - - (3.03´10-2, 5.68´10-1)

109

3.3.2.4 Genes unlikely to vary in copy number

Of the 5401 detected CNVs, 99% did not overlap, to any degree, with genes that were unlikely to vary in copy number. The full list of genes checked for CNV overlap used can be found in Appendix 2G. Of the 47 CNVs that did overlap these genes only four had a copy number state of zero (Appendix 3M). However, none of the four CNVs completely overlapped these genes with the percentage of gene overlap ranging from 0.5-20.5%.

3.3.2.5 Concordance between mouse cohort comparison study and broad survey

Of the autosomal CNVs detected in this study, 522 (11%) CNVs, were not observed in the broad survey. However, the majority of these novel CNVs (68%) were found in samples not used in the study. When only comparing the same samples from both studies, 167 autosomal CNVs from 94 mice, were novel in the mouse cohort comparison study. For the X chromosome CNVs, 15% of CNVs were not found in the broad survey. However, unlike for autosomes, the majority (68%) of these novel CNVs were detected in samples used in both studies. In total, 71 novel CNVs were detected in 39 mouse samples in the mouse cohort comparison study.

3.3.2.6 Genic content

Within the WD and WC mouse cohorts, there were fewer CNVs overlapping at least one complete gene (Table 3-7). The CL cohort has an equal number of genic and non-genic CNVs. The majority of genic CNVs in the CL and WD mouse samples were gains while approximately the same proportion of WC genic CNVs were gains and losses. The top ranked gene ontology (GO) terms for genes overlapping CNVs differ between CL and wild mouse samples (Table 3-8). CL mice had CNV losses and gains in genes with terms related to epigenetics, like nucleosome assembly and methylation. CL CNV losses were found in genes with antigen-related functions. Nucleosome assembly is also found as a top GO term in WD CNV gains. Unlike with the CL cohort, the WD cohort has gene enrichment for sensory perception of smell, also found in the WC cohort. The WC cohort genes are also enriched for pheromone response in both CNV losses and gains.

110

Table 3-7. Genic CNV gains and losses in classical laboratory, wild-derived, and wild-caught mice.

Genic CNVsa Non-genic CNVs Mouse Genic CNVs cohort Gain Loss Gain Loss (% of total)

Classical 773 (73%) b 286 (27%) 482 (48%) 526 (52%) 51

Wild derived 529 (57%) 399 (43%) 363 (26%) 1047 (74%) 40

Wild caught 224 (49%) 236 (51%) 108 (20%) 428 (80%) 46

a Genic CNVs overlap at least one whole gene b Percentage of gains in genic CNVs

111

Table 3-8. Top Gene Ontology terms for CNV gains and losses in classical laboratory, wild-derived, and wild-caught mouse cohorts.

P Mouse CNV Number of adj GO term (biological process) P-value (Benjamini- cohort states genes Hochberg) Nucleosome assembly 25 1.40E-16 2.00E-13 Losses Antigen processing and presentation 16 4.30E-12 3.90E-09 Classical DNA methylation on cytosine 13 1.10E-11 6.60E-09 laboratory Nucleosome assembly 25 9.40E-18 1.60E-14 Gains DNA methylation on cytosine 13 2.70E-12 2.20E-09 Positive regulation of gene expression, epigenetic 13 4.20E-12 2.30E-09 Sensory perception of smell 105 1.40E-69 6.50E-67 G-protein coupled receptor signaling pathway 109 2.00E-56 4.70E-54 Losses Detection of chemical stimulus involved in sensory 16 1.40E-08 2.10E-06 Wild perception derived Sensory perception of smell 71 3.80E-17 4.60E-14 Gains G-protein coupled receptor signaling pathway 83 7.80E-14 4.80E-11 Nucleosome assembly 17 3.10E-10 1.30E-07 Sensory perception of smell 67 1.50E-33 5.20E-31 Losses G-protein coupled receptor signaling pathway 74 4.70E-29 8.30E-27 Wild Response to pheromone 18 2.70E-16 2.60E-14 caught Response to pheromone 10 2.60E-04 2.90E-01 Gains Cell adhesion 23 3.80E-04 2.20E-01 Glutathione metabolic process 7 4.40E-04 1.80E-01

112

3.4 Discussion

3.4.1 Broad survey

3.4.1.1 CNVs detected

The percentage of the reference mouse genome affected by CNVs detected in this study (6.87-8.15%) falls within the range found in other studies which found between 1.2%67 and 10.7%68 of the reference genome affected by SVs and CNVs, respectively. In comparison to WD (0.15%) and WC (0.14%) mice in this study, the percent of the genome affected was higher in other studies for WD mouse samples (3.4%)67 and WC samples (10.7%)68. CL mice had a lower percentage (0.065%) of the genome affected by CNVs than WD or WC mice, which would be expected with inbreeding practices leading to reduced inter- strain and intra-strain genetic diversity. These values are all affected by the sample size, capture technology and diversity of samples, which differ between studies. The amount of the mouse genome affected by CNVs is greater than that reported for dog (1.08%)69, cattle (1.61–4.60%)70,71 and swine (4.23%)72 but is similar to that reported for humans (3.7%, 7.6%, 12%)73–75.

The higher ratio of CNV losses to gains observed in this study is consistent with a previous study76. The larger median CNV length of gains, compared to losses, is consistent with the idea that large genomic gains are less likely to be deleterious than losses and thus more likely to be present in the genome, particularly for genic CNVs77. Similarly, a significant size difference between copy-state-zero and copy-state-one losses was observed, where the median length of copy-state-one CNVs is approximately double that of copy-state-zero CNVs. A complete loss of a genomic region is more likely to be deleterious if it is large because it is more likely to overlap regions that code for important biological functions.

3.4.1.2 Genic content and analysis

While reduced genetic diversity is expected in CL mice, CNVs can arise even in well- established strains like C57BL/6J mice. Watkins-Chow and Pavan78 found dosage- sensitive Ide and Fgfbp3 copy number gains segregating in the Jackson Laboratory

113

C57BL/6J colony with the occurrence of the gains having increased in the colony since the mutation was predicted to have arisen, sometime after 1994. The Ide and Fgfbp3 gains in C57BL/6J samples were confirmed by ddPCR in our study as well. The intra-strain CNV differences that were detected by the array continue to support intra-strain CNV differences as important contributors to divergence from isogeneity53. This divergence from isogeneity, however, can only occur in certain regions of the genome and is dependent on the mutation type. For example, few losses were expected to occur in regions important for biological function as they are likely to be deleterious79. Thus, to some extent, dosage sensitive genes can be used to assess the quality of CNV calls.

Recurrent detection of specific CNVs by different research groups may indicate a mutation hotspot in a region of the genome or inheritance through relatedness (e.g. same supplier), but it is also important to consider the reference genome used when generating CNV calls. The C57BL/6J strain is commonly used as a reference. CNV gains that overlap with Skint4, Nlrp1b, Slc25a37, Ide and Fgbp3 in the study were called as CNV losses in non-C56BL/6J laboratory strains in previous studies38,39,68,80–84. Similarly, the CNV losses in Btbd9 were called as CNV gains in previous studies38,76,81–84. Gm9765, which appears as a loss in this study’s C57BL/6J mice, appeared as an gain in inbred mice in six other studies38,76,81–84, while one study found a mix of losses and gains in this region68. This may indicate that the CNVs overlapping with these six regions (excluding Gm9765) are widespread in some C57BL/6J mouse stocks and using this mouse strain as a reference (expected copy number state of two) may result in incorrect CNV states reported in other strains.

Gene ontology analysis, using DAVID, showed differences in gene enrichment between mouse cohorts. Laboratory mouse strains are frequently bred to display specific immunity or disease phenotypes85 and this may in part explain the GO term enrichment across the CL mouse strains for immunity-related terms. Olfaction- and pheromone-related genes, which can assist mice with social interactions and gaining information about their environment86, are not highly enriched in analysis of GO terms for CL mouse strains, consistent with their laboratory breeding history and less diverse ancestry. Similar to copy

114 number variation, SNP variation in pheromone receptors is lower in CL mice when compared to WD mice87.

Using IPA, an overrepresentation of lipid metabolism genes has been shown in CNV regions in WC mice68. Different sets of metabolism genes were overrepresented in CL mouse CNVs: carbohydrate metabolism genes in gains and amino acid metabolism genes in losses. This difference may indicate copy number variation as an adaptive change to diet between WC mice and CL strains. In humans and dogs, the copy number of the amylase (AMY1, AMY2B) gene was found to vary and in dogs is also found to be amplified over wolves, conferring adaptation to a starch-rich diet88,89. Across all of the samples, there is only one gain in the mouse ortholog to these genes (Amy1, Amy2), found in the YBR/EiJ CL strain, so there is no evidence for an adaptive change to diet involving CNVs in the mouse amylase genes within this sample mouse population.

Since many development-related genes are present in the mouse genome, it is not unexpected for development-related IPA networks to appear for the detected CNVs. The types of developmental genes that are overrepresented differ between mouse cohorts although it is not clear if the differences result from factors including but not limited to environmental influences and sampling biases (e.g. sample size, inter-cohort differences in subspecies composition, and genetic relatedness).

Some CNV calls may differ by strain due to strain-specific SNPs preventing the hybridization of probes and the target DNA. As a result, a bias in gene enrichment may be present depending on how closely related a mouse is to the probe design reference. Biases in gene enrichment can also occur with large gene families or if there are many genes associated with specific Gene Ontology terms. These biases can be overcome in programs like DAVID by inputting all of the genes in the mouse genome as a “background” for statistical analysis.

3.4.1.3 Genes unlikely to harbour copy-number losses

The autosomal genes A disintegrin and metallopeptidase domain 17 (Adam17)90, cyclin- dependent kinase 8 (Cdk8)91, collagen type VII alpha 1 (Col7a1)92, delta like canonical

115

Notch ligand 1 (Dll1)93, DNA methyltransferase 3B (Dnmt3b)94, dual-specificity tyrosine- (Y)-phosphorylation regulated kinase 1a (Dyrk1a)95, embryonic ectoderm development (Eed)96, elastin (Eln)97, enhancer of zeste 2 polycomb repressive complex 2 subunit (Ezh2)98, insulin-like growth factor 1 (Igf1)99, laminin alpha 5 (Lama5)100, complex subunit 1 (Med1)101, mediator complex subunit 21 (Med21)102, mediator complex subunit 24 (Med24)103, mediator complex subunit 30 (Med30)104, peroxisomal biogenesis factor 7 (Pex7)105, Pbx/knotted 1 homeobox (Pknox1)106, 3-phosphoinositide dependent protein kinase 1 (Pdpk1)107, solute carrier family 2 (facilitated glucose transporter) member 1 (Slc2a1)108, SUZ12 polycomb repressive complex 2 subunit (Suz12)109,110, VPS35 retromer complex component (Vps35)111, and transferrin receptor (Tfrc)112 are known to cause deleterious phenotypes when gene expression levels are reduced and may be lethal when inherited at a zero-copy state or one-copy state, depending on the gene. Therefore, losses in these gene regions, particularly state-zero losses, are not expected to be inherited or arise early in development and be present in the adult mouse. Although three mice appear to have partially lost one copy of Col7a1, unlike a zero-copy loss, a single-copy loss of Col7a1 is not lethal92. Mice in this latter case are expected have a normal phenotype if gene expression levels are high enough92. As expected, no losses were detected in any of the other autosomal genes listed above.

A number of genes on the X chromosome cause deleterious phenotypes when deleted or inactivated113, including apoptosis-inducing factor mitochondrion-associated 1 (Aifm1)114, aminolevulinic acid synthase 2 erythroid (Alas2)115, APC membrane recruitment 1 (Amer1, synonyms Wtx and Fam123b)116, BCL6 interacting corepressor (Bcor)117, calcium/calmodulin-dependent serine protein kinase (MAGUK family; Cask)118, cullin 4B (Cul4b)119, phenylalkylamine Ca2+ antagonist (emopamil) binding protein (Ebp)120, filamin alpha (Flna)121, glucose-6-phosphate dehydrogenase X-linked (G6pdx)122, glycerol kinase (Gyk)123, inhibitor of kappaB kinase gamma (Ikbkg)124, methyl CpG binding protein 2 (Mecp2)125, mediator complex subunit 12 (Med12)126, X-linked myotubular myopathy gene 1 (Mtm1)127, NAD(P) dependent steroid dehydrogenase-like (Nsdhl)128, OFD1 centriole and centriolar satellite protein (Ofd1)129, phosphatidylinositol glycan anchor biosynthesis class A (Piga)130, and porcupine O-

116 acyltransferase (Porcn)131,132. The partial loss of the Cask gene in two mice may be a true biological event because although a knockout of Cask is lethal, mice are still viable even if Cask expression has been reduced by ~70%118. Losses up to 4761 bp in length have been found in Cask67 and a large CNV loss covering the entire Cask gene was identified in an aCGH study133. As long as some degree of the functioning Cask gene is maintained in the mouse it is possible for Cask to acquire mutations or be lost in a cell population.

There are several possible explanations for observing gene losses that contribute to deleterious phenotypes. The losses could be false positive calls or could be due to off-target mutations in the samples that prevent the sample DNA from binding to the array probes. In other cases, a potentially deleterious genic loss could have arisen in a mouse, but it is not harmful to the individual because it occurred after a specific developmental time point, in a tissue where it is not vital for proper function, or it is present in a clonal size insufficient to produce a negative phenotype. The following genes were also reported to overlap losses, over 500 bp in size, in previous studies67,68,83,134: Adam17, Cdk8, Dnmt3b, Dyrk1a, Eed, Ezh2, Lama5, Med21, Pdpk1, and Pex7, as well as on the X chromosome38,67,83: Aifm1, Bcor, Cask, Cul4b, Ebp, Flna, G6pdx, Gyk, Ikbkg, Mecp2, Nsdhl, and Porcn. The reports for the genes listed above do not provide an integer copy number state, so it is likely that the reported losses in these gene regions are one-copy-state losses since single-copy losses or minimal expression of each of these genes can be tolerated in mice.

3.4.1.4 Droplet digital PCR confirmation of select genic CNVRs

Biological validation, such as qPCR, would normally be performed using the same DNA samples. Since we did not have access to those exact samples, strain-matched mice were used instead since mice of the same strain are related and there is a possibility that the CNVs selected for confirmation may have been inherited in the lineage. Confirmation of select genic CNVs in classical inbred strains is a first step toward biological validation and future work could be expanded include more CNVs and to evaluate their phenotypic impact.

The high rate of CNV confirmation (96%) in mouse samples that differed from those used on the array, can be partially attributed to the occurrence of strain-specific

117

CNVs. For example, two of the CNVRs selected for confirmation are known to contain genes Ide and Fgfbp3 that vary in copy number within the C57BL/6J mouse strain78. Alternatively, not all strain-specific CNVs are found in every individual of that strain, so a copy number state of two (“default” state) is also considered to be an acceptable alternate state for confirmation purposes, thereby increasing the probability that a CNV state can be “confirmed”. In the case of Skint3 in DBA/2J mice, a copy number state of two was predicted by the array, but ddPCR results showed a state of zero. Complete losses of Skint3 have been previously observed in DBA/2J mice135. Notably, any differences from the predicted copy number states are not necessarily indicative of MDGA performance given that different mice were used for the microarray and ddPCR-based determinations.

3.4.2 Mouse cohort comparison study

3.4.2.1 CNVs detected

Among the three cohorts, WD mice had the greatest number of CNVs, which is consistent with findings in the Locke et al1 study, which includes ~85% of the samples used in the mouse cohort comparison study. WD mice are laboratory strains that have greater inter- strain genetic heterogeneity than CL strains, so the increased number of CNVs may have artificially resulted from breeding practices intended to maintain higher genetic diversity than CL mice136. At the same time, sibling mating has occurred with WD mice, leading to reduced heterozygosity, and the WD mice may also be gaining CNVs common to CL strains, as a result of positive selection for variants with higher fitness in mouse facility environments136. Further genomic studies would be required to determine if the increased number of CNVs in WD mice is due to animal husbandry, the genetic diversity of the founding wild animals and animals used in maintaining the stock, or some other reason.

In general, inbreeding is associated with reduced genetic diversity and CL strains have been maintained through inbreeding for many generations. Inter-line genetic relatedness can be attributed to founder effects and a small number of related founders. Intra-strain homogeneity can be valuable in that it helps reduce confounding factors arising from genetic variation in research studies. Many CL strains are known for having specific traits and were intentionally bred to study disorders caused by different mutations4. Some

118 mouse strains are used to model diseases that are known to have genomic instability, such as cancer and therefore could have a number of structural alterations that is higher than expected with other disease models or with normal development137–139. This may have been the case with CL samples which are outliers in the cohort with respect to number of CNVs detected. The MRL/MpJ sample for example, which had the highest number of CNVs in the mouse cohort study (80 CNVs; 100 CNVs in the broad survey), is prone developing lupus and differences in anergic B cell gene expression have been identified between the MRL/MpJ strain background and the C57BL/6J strain, which is not prone to developing lupus140. The differentially expressed genes were involved in functional networks relating to the regulation of cell growth, signalling and apoptosis140. It has yet to be determined if the high number of CNVs in the MRL/MpJ sample is characteristic of this strain as a result of genetic background or if it is disease-associated, or both.

Given that CNV gene gains are generally thought to result in milder phenotypes than losses and strong purifying selection against genic CNV losses, but not gains, has been observed in a eukaryote species77, it was expected that all mouse cohorts would have more CNV gains than losses. However, only CL mice were found to have more gains than losses. More losses than gains were also found in all but one of 59 sequenced samples assessed in a study of wild-caught domesticus and musculus mice (including wild-caught mice bred for one or two generations)13. Losses were generally found to be more frequent in studies of other organisms as well, including silkworms141, dogs142, goats143, and cattle144, and are thought be found more frequently due to mutation mechanism biases and biases in the detection technology used141,143. The mouse cohort comparison study also showed the median length of CNV losses to be smaller than gains for all cohorts and this length difference was larger for Chromosome X CNVs. The reason for this difference in length has yet to be determined, although it is possible that large deletions are more likely to be deleterious than large gains and therefore are negatively selected against. A study of CNVs in wild-caught mice found that, although more numerous, losses generally impacted a smaller proportion of the genome than gains, which indicates that many losses are small in size13. In humans, the CNV length was found to be associated with pathogenicity, since pathogenic CNVs for both gains and losses were longer on average than benign CNVs, thereby impacting more of the genome145. Pathogenic CNV losses were also found to

119 outnumber pathogenic gains by ~3-fold while benign CNVs were made up of slightly more gains than losses145, further providing support that CNV losses are more likely to be deleterious than gains.

3.4.2.2 CNV recurrence

The CNV diversity was predicted to be the greatest in the WC cohort because of the heterogeneous makeup of the samples in this cohort which included mice from different geographical regions and multiple subspecies, and these captured mice were not inbred for numerous generations like CL mice. This diversity was reflected in the percentage of singleton (non-recurrent) CNVs in this cohort, which is the highest among the three cohorts. Likewise, the WC cohort had the highest maximum genetic distance between two mice, although the average genetic distance was similar to the WD cohort. In contrast, the CL cohort mostly had recurrent CNVs, some of which appear to be specific to a mouse- strain or lineage and may have resulted from random mutations that arose in the lineage and were passed on or they may have been selected for in the generation of a mouse strain due to phenotypic impact. An example of this is the Ide gain in C57BL/6J mice78. The high level of CNV recurrence in the CL cohort is evident from the pairwise genetic distance values which show lower intra-cohort diversity for CL mice than WC or WD mice. WD mice had a similar number of singletons as WC mice, indicative of genetic diversity within the cohort. However, the WD cohort also had 167 and 166 CNVRs shared with the WC and CL cohorts, respectively. These shared CNVs may be the result of shared ancestry or they could be variants that arose independently and were positively selected for in the wild or laboratory environments.

The WC and CL cohorts shared more than 3-fold fewer CNVRs than the WD cohort shared with either of these cohorts. This suggests that the WD mice are more closely related to mice in the other two cohorts, either as a result of shared genealogy or shared environmental influence on the genome, while the WC and CL mice had less in common. Overall, there were more CNVRs shared by all three cohorts than by any two alone. Some of these recurrent CNVs may be specific to mouse subspecies, since multiple subspecies or mixtures of subspecies can be found within the cohorts. If shared among all subspecies, recurrent CNVs may have arisen in ancient ancestors and could be informative about the

120 evolutionary history of mice. Alternatively, recurrent CNVs may have arisen independently in individuals from each cohort in mutation hotspots in the genome.

CNVRs that overlap large segmental duplications are more likely to contain CNVs that are variable in copy number between individuals and overlap environmentally responsive genes while CNVRs outside of segmental duplications are more likely to contain pathogenic CNVs12,146. Several CNVRs observed in this study have been identified in other studies, including the regions that contain the Itln1 and Hjurp (Holliday junction recognition protein) genes, which have roles in metabolism and chromosome segregation during cell division, respectively12,147. The genomic context of CNVRs and regions with unusual CNV distribution patterns should be looked at in depth in future studies to determine potential mechanisms of CNV formation and phenotypic impact. CNV landscape plots, described in Chapter 3B, can be used to aid in the visual identification of major CNVRs and unusual CNV distribution patterns.

3.4.2.3 Genic content of CNVs and analysis

The predicted functional impact of genic CNVs differed between mouse cohorts. Similar to the DAVID results in the broad survey, the CL mice had CNVs in genes with functions relating to immunity like “antigen processing and presentation”, and to epigenetic regulation like “nucleosome assembly”. The highest ranking, gene ontology terms for CNVs in the WD and WC cohorts, included terms related to olfaction and pheromone response, and were also observed for the WC samples in the broad survey. This is consistent with a previous study which found that copy number variable regions in wild mouse populations generally contained genes relevant to environmental and behavioural interactions, such as vomeronasal and olfactory receptor genes12. Pheromone response which is important for mouse mate selection, and influencing sexual behaviour and reproductive function148, would be expected to be more diverse in natural mouse populations where a choice of potential mates is available than in laboratory mice which have mates selected for them by humans to create and maintain inbred lines. Likewise, the ability to detect a wide range of scents via olfaction is important in the wild for obtaining food149,150 and avoiding predators151,152, neither of which is a concern for laboratory mice since they live in control environments. Many laboratory mouse strains have been created

121 to express disease-related traits4, so it is expected that immunity-related terms would appear in this cohort.

The G-protein coupled receptor signaling pathway found in the WD and WC mouse cohorts is likely to have an important role in adaptation and evolution. The reason for this is that these proteins belong to a protein superfamily and their common function is in acting as receptors for signals for different biological pathways relating to olfaction, pheromone detection, taste, regulation of metabolism, reproduction, development, and more153. Due to the numerous functional roles of genes associated with the G-protein coupled receptor signaling pathway, genes impacted by CNVs will need to be identified individually to determine what specific biological pathways and functions are associated with each cohort but broadly fall under the G-protein coupled receptor signaling pathway category.

3.4.2.4 Genes unlikely to harbour copy-number losses

Very few CNV losses occurred in genes that were predicted to be unlikely to vary in copy number, particularly if the CNV state was zero. Where state-zero-losses of control genes did occur, the losses covered 20% or less of the gene. The presence of such losses may not have phenotypic consequences if the losses arise during a period where their expression is not vital for survival (e.g. after development). The CNV that results in a 20% loss of a gene was found in the CL mouse, TSJ/Le, in the Tcf7l2 gene. Had this loss disrupted the gene function, the mouse would have been severely hypoglycemic and would have died perinatally154. Furthermore, the CNV breakpoints are approximations of the real CNV breakpoints so the loss may overlap less than 20% of the gene. Additional experiments would be required to confirm that the array is detecting biological events but the low presence of CNVs in control genes is consistent with where CNV losses would be less likely to occur.

3.4.2.5 Confirmation of CNV data

Since the data for this study was produced using publicly available MDGA data from an online source, there is no way to directly confirm the array findings using the DNA from those specific samples. Instead, the same mouse strains could be tested to identify strain- specific CNVs, as was done in the broad survey.

122

In the broad survey, Hdhd3 and Glo1 CNV gains were identified in CBA/CaJ and DBA/2J samples and confirmed using ddPCR. In the current study, the same gains were detected again in these mouse samples. DdPCR was also performed to confirm CNV gains in Ide, Fgfbp3, and Skint3, all of which also appeared as gains in the C57BL/6J sample in this study. Nlrp1b, B4galt3, and Trim30-ps1 appeared either as a state of two or as gains in the eight C57BL/6J mice in the broad survey and as a state of two in the mouse in this study. A loss of a copy of Itln1 was found in C57BL/6J samples in both studies.

Further support for the accuracy of array data is the genetic distance information generated from the SNP genotyping data. In the phenogram representing the genetic distance relationships between mice, the CL mouse samples cluster together while the WD and WC samples are clustered together. Related mouse strains were also found to cluster together, which indicates genetic similarity. The WC mouse samples group together by mouse subspecies, consistent with a previous study on some of these mouse samples21. One WD mouse sample, MOR/RkJ was found to cluster with the C57BL strains, which was not expected. However, MOR/RkJ and MOR/Rk mice consistently cluster with C57BL-type strains in other SNP-based studies66,155. These findings support the inclusion of C57BL- type strains in the MOR/RkJ ancestry.

Two other outliers on the SNP-based phenogram are the CL strains, CE/J and SM/J, which are found on their own branches. The background of these two strains likely accounts for the increased genetic distance from the other classical laboratory strains. The CE/J mouse originated from a wild mutant mouse that was trapped in Illinois in 1920 and then inbred for many generations to become a CL strain156. The SM/J mouse was created in 1939 using seven different inbred mouse stocks and selectively bred for a small size phenotype157. Didion et al (2013) also observed segregation of CE/J and SM/J from the other CL strains and attributed this to the highly mixed background of these mice, which includes large contributions from a WC mouse in the CE/J strain21.

The CNV phenogram shows some related mice clustering together but there is not a clear distinction between the CL and wild cohorts like there is when using SNP-based genetic distance. This lack of clustering based on known genealogy is likely due to

123 insufficient numbers of strain-specific CNVs, which is not an issue with SNP genotype data. SNP information is regularly used to determine genetic relationships in a variety of organisms.

3.5 Conclusion

The microarray is a valuable tool for large-scale analysis and when analyzed with rigour can provide insight into SNP and copy number variation. This study provides researchers with a CNV detection pipeline that has been tested on a large, publicly available dataset of Mus musculus samples and was confirmed using computational and PCR-based approaches.

In the broad survey, differences were found in the genes affected by putative CNVs between WC and CL mice, most notably in genes related to lipid, carbohydrate and amino- acid metabolism, as well as immunity, pheromone response and olfaction. This supports the hypothesis that CNVs play a role in increasing genetic diversity and have phenotypic impacts that when shaped by selective pressures confer adaptation.

The mouse cohort study expanded on the findings in the broad survey and identified cohort-specific differences in CNV number, state and genes impacted. CNV length is not a defining characteristic among the cohorts, suggesting that the CNVs arose in samples from the three cohorts via common mechanisms of CNV formation. Further analysis is required to determine if there is a biological significance behind the differences in CNV number and state between cohorts. The gene pathways impacted by CNVs in CL, WD, and WC cohorts have associations consistent with the histories of these mice with respect to their environments and breeding histories.

In future studies, recurrent CNVs could be studied in detail from a cohort and subspecies perspective, to identify evolutionary-relevant mutations by finding cohort- or subspecies-specific CNVRs and determining if there are dosage-sensitive elements contained within those regions that can impact phenotypes. Dosage-sensitivity of a gene can be determined by confirming the predicted copy number state through ddPCR and then measuring levels of the transcribed product with a gene expression microarray or with RNA

124 sequencing. The mechanism of CNV formation in regions of interest can also be determined by sequencing the breakpoint junctions and searching for repetitive or homologous regions, or nucleotide deletions and additions that are known to be associated with specific mechanisms of CNV formation. Uncharacterized genomic elements in recurrent CNVs could be characterized through various functional genomics approaches to determine their function and if they are dosage sensitive. For example, a region of interest could be sequenced and checked for the presence of transcription-associated sequences like open reading frames, followed by measuring levels of predicted transcript and translated product. Protein or RNA function can be predicted through ortholog comparisons and verified with gene knockout experiments. Any future MDGA-based experiments should be carried out on samples where tissues are available for confirmation of microarray results using methods not based on hybridization.

Overall, the two studies in this chapter presented an established CNV detection pipeline, introduced a method to visualize the distribution of CNVs across a genome, and characterized CNVs in a large variety of M. musculus individuals, including three mouse cohorts, from the perspective of CNV number, state, length and genomic context.

125

3.6 References

1. Locke, M. E. O. et al. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

2. Jones, E. et al. Fellow travellers: a concordance of colonization patterns between mice and men in the North Atlantic region. BMC Evol. Biol. 12, 35 (2012).

3. Boursot, P., Auffray, J.-C., Britton-Davidian, J. & Bonhomme, F. The evolution of house mice. Annu. Rev. Ecol. Syst. 24, 119–152 (1993).

4. Rosenthal, N. & Brown, S. The mouse ascending: perspectives for human-disease models. Nat. Cell Biol. 9, 993–999 (2007).

5. Bolton, H. et al. Mouse model of chromosome mosaicism reveals lineage-specific depletion of aneuploid cells and normal developmental potential. Nat. Commun. 7, 11165 (2016).

6. Maslov, A. Y. et al. DNA damage in normally and prematurely aged mice. Aging Cell 12, 467–477 (2013).

7. Phifer-Rixey, M. & Nachman, M. W. Insights into mammalian biology from the wild house mouse Mus musculus. Elife 2015, 1–13 (2015).

8. Turner, L. M. & Harr, B. Genome-wide mapping in a house mouse hybrid zone reveals hybrid sterility loci and Dobzhansky-Muller interactions. Elife 3, e02504 (2014).

9. Turner, L. M., Schwahn, D. J. & Harr, B. Reduced male fertility is common but highly variable in form and severity in a natural house mouse hybrid zone. Evolution 66, 443–458 (2012).

10. Suzuki, T. A. & Nachman, M. W. Speciation and reduced hybrid female fertility in house mice. Evolution 69, 2468–2481 (2015).

126

11. Davis, R. C. et al. A genome-wide set of congenic mouse strains derived from CAST/Ei on a C57BL/6 background. Genomics 90, 306–313 (2007).

12. Pezer, Ž., Harr, B., Teschke, M., Babiker, H. & Tautz, D. Divergence patterns of genic copy number variation in natural populations of the house mouse (Mus musculus domesticus) reveal three conserved genes with major population-specific expansions. Genome Res. 25, 1114–1124 (2015).

13. Harr, B. et al. Genomic resources for wild populations of the house mouse, Mus musculus and its close relative Mus spretus. Sci. Data 3, 160075 (2016).

14. Gamazon, E. R. & Stranger, B. E. The impact of human copy number variation on gene expression. Brief. Funct. Genomics 14, 352–357 (2015).

15. de Smith, A. J., Walters, R. G., Froguel, P. & Blakemore, A. I. Human genes involved in copy number variation: mechanisms of origin, functional effects and implications for disease. Cytogenet. Genome Res. 123, 17–26 (2008).

16. Churchill, G. a, Gatti, D. M., Munger, S. C. & Svenson, K. L. The Diversity Outbred mouse population. Mamm. Genome 23, 713–718 (2012).

17. Ideraabdullah, F. Y. et al. Genetic and haplotype diversity among wild-derived mouse inbred strains. Genome Res. 14, 1880–1887 (2004).

18. Poltorak, A., Apalko, S. & Sherbak, S. Wild-derived mice: from genetic diversity to variation in immune responses. Mamm. Genome 29, 577–584 (2018).

19. Geraldes, A. et al. Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genes. Mol. Ecol. 17, 5349–5363 (2008).

20. Geraldes, A., Basset, P., Smith, K. L. & Nachman, M. W. Higher differentiation among subspecies of the house mouse (Mus musculus) in genomic regions with low recombination. Mol. Ecol. 20, 4722–4736 (2011).

127

21. Didion, J. P. & De Villena, F. P. M. Deconstructing Mus gemischus: Advances in understanding ancestry, structure, and variation in the genome of the laboratory mouse. Mamm. Genome 24, 1–20 (2013).

22. Yonekawa, H. et al. Hybrid origin of Japanese mice ‘Mus musculus molossinus’: evidence from restriction analysis of mitochondrial DNA. Mol. Biol. Evol. 5, 63– 78 (1988).

23. Nunome, M. et al. Detection of recombinant haplotypes in wild mice (Mus musculus) provides new insights into the origin of Japanese mice. Mol. Ecol. 19, 2474–2489 (2010).

24. Searle, J. B. et al. The diverse origins of New Zealand house mice. Proc. Biol. Sci. 276, 209–217 (2009).

25. Smadja, C. & Ganem, G. Divergence of odorant signals within and between the two European subspecies of the house mouse. Behav. Ecol. 19, 223–230 (2008).

26. Hashemian, N., Rajabi-Maham, H. & Edrisi, M. Genetic vs environment influences on house mouse hybrid zone in Iran. J. Genet. Eng. Biotechnol. 15, 483–488 (2017).

27. Flachs, P. et al. Interallelic and intergenic incompatibilities of the Prdm9 (Hst1) gene in mouse hybrid sterility. PLoS Genet. 8, e1003044 (2012).

28. Kono, H. et al. Prdm9 polymorphism unveils mouse evolutionary tracks. DNA Res. 21, 315–326 (2014).

29. Baudat, F. et al. PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 (2010).

30. Davies, B. et al. Re-engineering the zinc fingers of PRDM9 reverses hybrid sterility in mice. Nature 530, 171–176 (2016).

31. Murphy, E. D. Characteristic Tumors. in Biology of the Laboratory Mouse (ed.

128

Green, E. L.) (Dover Publications, Inc., 1966).

32. Russell, E. S. & Meier, H. Constitutional Diseases. in Biology of the Laboratory Mouse (ed. Green, E. L.) (Dover Publications, Inc., 1966).

33. Hall, B., Limaye, A. & Kulkarni, A. B. Overview: generation of gene knockout mice. Curr. Protoc. Cell Biol. Chapter 19, Unit 19.12 19.12.1-17 (2009).

34. Ishiwatari, Y. & Bachmanov, A. A. Nacl taste thresholds in 13 inbred mouse strains. Chem. Senses 37, 497–508 (2012).

35. Tordoff, M. G., Downing, A. & Voznesenskaya, A. Macronutrient selection by seven inbred mouse strains and three taste-related knockout strains. Physiol. Behav. 135, 49–54 (2014).

36. Watkins-Chow, D. E. & Pavan, W. J. Genomic copy number and expression variation within the C57BL/6J inbred mouse strain. Genome Res. 18, 60–66 (2008).

37. Shanahan, M. T., Tanabe, H. & Ouellette, A. J. Strain-specific polymorphisms in paneth cell α-defensins of C57BL/6 mice and evidence of vestigial myeloid α- defensin pseudogenes. Infect. Immun. 79, 459–573 (2011).

38. Cutler, G., Marshall, L. A., Chin, N., Baribault, H. & Kassner, P. D. Significant gene content variation characterizes the genomes of inbred mouse strains. Genome Res. 17, 1743–1754 (2007).

39. Graubert, T. A. et al. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet. 3, e3 (2007).

40. Yoshiki, A. & Moriwaki, K. Mouse phenome research: implications of genetic background. ILAR J. 47, 94–102 (2006).

41. Yang, H. et al. Subspecific origin and haplotype diversity in the laboratory mouse. Nat. Genet. 43, 648–655 (2011).

129

42. Yang, H. et al. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

43. Center for Genome Dynamics - Mouse Diversity Array CEL files. Available at: http://cgd.jax.org/datasets/diversityarray/CELfiles.shtml.

44. International Committee on Standardized Genetic Nomenclature for Mice & Rat Genome and Nomenclature Committee. Guidelines for nomenclature of mouse and rat strains. Mouse Genome Informatics Database (2016). Available at: http://www.informatics.jax.org/nomen/strains.shtml.

45. Welsh, C. E. et al. Status and access to the Collaborative Cross population. Mamm. Genome 23, 706–712 (2012).

46. Threadgill, D. W. & Churchill, G. A. Ten years of the Collaborative Cross. Genetics 190, 291–294 (2012).

47. Affymetrix Power Tools MANUAL: apt-probset-genotype (1.20.0). Available at: http://www.affymetrix.com/support/developer/powertools/changelog/apt-probeset- genotype.html.

48. Werness, S. & Anderson, D. Genotyping console 4.0 User Manual. Comput. Programs Biomed. 18, 99–108

49. Yang, H., Ding, Y., Hutchins, L. & Szatkiewicz, J. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

50. PennAffy [http://www.openbioinformatics.org/penncnv/penncnv_download.html].

51. Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high- resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).

52. Kent Utils [https://github.com/NullModel/kentUtils].

53. Diskin, S. J. et al. Adjustment of genomic waves in signal intensities from whole-

130

genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008).

54. PennCNV. FAQ - PennCNV. (2017). Available at: http://penncnv.openbioinformatics.org/en/latest/misc/faq/.

55. Butler, J. L., Osborne Locke, M. E., Hill, K. & Daley, M. HD-CNV: hotspot detector for copy number variants. Bioinformatics 29, 262–263 (2013).

56. Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).

57. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

58. Yukl, S. A., Kaiser, P., Kim, P., Li, P. & Wong, J. K. Advantages of using the QIAshredder instead of restriction digestion to prepare DNA for droplet digital PCR. Biotechniques 56, 194–196 (2014).

59. Kinsella, R. J. et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford). 2011, bar030 (2011).

60. Ensembl! Archive [http://may2012.archive.ensembl.org/index.html].

61. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).

62. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).

63. Brosch, M. et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and ‘resurrected’ pseudogenes in the mouse genome. Genome Res. 21, 756–767 (2011).

64. IPA®, QIAGEN Redwood City [www.qiagen.com/ingenuity].

131

65. Gatesy, J. et al. A phylogenetic blueprint for a modern whale. Mol. Phylogenet. Evol. 66, 479–506 (2013).

66. Petkov, P. M. et al. An efficient SNP system for mouse genome scanning and elucidating strain relationships. Genome Res. 14, 1806–1811 (2004).

67. Yalcin, B. et al. Sequence-based characterization of structural variation in the mouse genome. Nature 477, 326–329 (2011).

68. Henrichsen, C. N. et al. Segmental copy number variation shapes tissue transcriptomes. Nat. Genet. 41, 424–429 (2009).

69. Berglund, J. et al. Novel origins of copy number variation in the dog genome. Genome Biol. 13, R73 (2012).

70. Jiang, L. et al. Genome-wide detection of copy number variations using high- density SNP genotyping platforms in Holsteins. BMC Genomics 14, 131 (2013).

71. Hou, Y. et al. Genomic characteristics of cattle copy number variations. BMC Genomics 12, 127 (2011).

72. Wang, J. et al. A genome-wide detection of copy number variations using SNP genotyping arrays in swine. BMC Genomics 13, 273 (2012).

73. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).

74. Iskow, R. C., Gokcumen, O. & Lee, C. Exploring the role of copy number variants in human adaptation. Trends Genet. 28, 245–257 (2012).

75. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

76. Yalcin, B. et al. The fine-scale architecture of structural variants in 17 mouse genomes. Genome Biol. 13, R18 (2012).

132

77. Hartmann, F. E. & Croll, D. Distinct trajectories of massive recent gene gains and losses in populations of a microbial eukaryotic pathogen. Mol. Biol. Evol. 34, 2808–2822 (2017).

78. Watkins-Chow, D. & Pavan, W. Genomic copy number and expression variation within the C57BL/6J inbred mouse strain. Genome Res. 18, 60–66 (2008).

79. Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38, 75–81 (2006).

80. She, X., Cheng, Z., Zöllner, S., Church, D. M. & Eichler, E. E. Mouse segmental duplication and copy number variation. Nat. Genet. 40, 909–914 (2008).

81. Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat. Genet. 41, 430–437 (2009).

82. Agam, A. et al. Elusive copy number variation in the mouse genome. PLoS One 5, e12839 (2010).

83. Wong, K. et al. Sequencing and characterization of the FVB/NJ mouse genome. Genome Biol. 13, R72 (2012).

84. Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011).

85. Sellers, R. S., Clifford, C. B., Treuting, P. M. & Brayton, C. Immunological variation between inbred laboratory mouse strains: points to consider in phenotyping genetically immunomodified mice. Vet. Pathol. 49, 32–43 (2012).

86. Restrepo, D., Arellano, J., Oliva, A. M., Schaefer, M. L. & Lin, W. Emerging views on the distinct but related roles of the main and accessory olfactory systems in responsiveness to chemosensory signals in mice. Horm. Behav. 46, 247–256 (2004).

133

87. Wynn, E. H., Sánchez-andrade, G., Carss, K. J. & Logan, D. W. Genomic variation in the vomeronasal receptor gene repertoires of inbred mice. BMC Genomics 13, 415 (2012).

88. Perry, G. H. et al. Copy number variation and evolution in humans and chimpanzees. Genome Res. 18, 1698–710 (2008).

89. Axelsson, E. et al. The genomic signature of dog domestication reveals adaptation to a starch-rich diet. Nature 495, 360–364 (2013).

90. Hassemer, E. L. et al. The waved with open eyelids (woe) locus is a hypomorphic mouse mutation in Adam17. Genetics 185, 245–255 (2010).

91. Westerling, T., Kuuluvainen, E. & Mäkelä, T. P. Cdk8 is essential for preimplantation mouse development. Mol. Cell. Biol. 27, 6177–6182 (2007).

92. Fritsch, A. & Loeckermann, S. A hypomorphic mouse model of dystrophic epidermolysis bullosa reveals mechanisms of disease and response to fibroblast therapy. J. Clin. Invest. 118, 1669–1679 (2008).

93. Rubio-Aliaga, I. et al. Dll1 haploinsufficiency in adult mice leads to a complex phenotype affecting metabolic and immunological processes. PLoS One 4, e6054 (2009).

94. Ueda, Y. et al. Roles for Dnmt3b in mammalian development: a mouse model for the ICF syndrome. Development 133, 1183–1192 (2006).

95. Rachdi, L. et al. Dyrk1a haploinsufficiency induces diabetes in mice through decreased pancreatic beta cell mass. Diabetologia 57, 960–969 (2014).

96. Faust, C., Lawson, K. A., Schork, N. J., Thiel, B. & Magnuson, T. The Polycomb- group gene eed is required for normal morphogenetic movements during gastrulation in the mouse embryo. Development 125, 4495–4506 (1998).

97. Faury, G. et al. Developmental adaptation of the mouse cardiovascular system to

134

elastin haploinsufficiency. J. Clin. Invest. 112, 1419–1428 (2003).

98. O’Carroll, D. et al. The Polycomb-group gene Ezh2 is required for early mouse development. Mol. Cell. Biol. 21, 4330–4336 (2001).

99. Mohan, S. & Baylink, D. J. Impaired skeletal growth in mice with haploinsufficiency of IGF-I: genetic evidence that differences in IGF-I expression could contribute to peak bone mineral density differences. J. Endocrinol. 185, 415–420 (2005).

100. Shannon, M. B., Patton, B. L., Harvey, S. J. & Miner, J. H. A hypomorphic mutation in the mouse laminin alpha5 gene causes polycystic kidney disease. J. Am. Soc. Nephrol. 17, 1913–1922 (2006).

101. Ito, M., Yuan, C. X., Okano, H. J., Darnell, R. B. & Roeder, R. G. Involvement of the TRAP220 component of the TRAP/SMCC coactivator complex in embryonic development and thyroid hormone action. Mol. Cell 5, 683–693 (2000).

102. Tudor, M., Murray, P. J., Onufryk, C., Jaenisch, R. & Young, R. A. Ubiquitous expression and embryonic requirement for RNA polymerase II coactivator subunit Srb7 in mice. Genes Dev. 13, 2365–2368 (1999).

103. Ito, M., Okano, H. J., Darnell, R. B. & Roeder, R. G. The TRAP100 component of the TRAP/Mediator complex is essential in broad transcriptional events and development. EMBO J. 21, 3464–3475 (2002).

104. Krebs, P. et al. Lethal mitochondrial cardiomyopathy in a hypomorphic Med30 mouse mutant is ameliorated by ketogenic diet. Proc. Natl. Acad. Sci. U. S. A. 108, 19678–19682 (2011).

105. Braverman, N. et al. A Pex7 hypomorphic mouse model for plasmalogen deficiency affecting the lens and skeleton. Mol. Genet. Metab. 99, 408–416 (2010).

106. Ferretti, E. et al. Hypomorphic mutation of the TALE gene Prep1 (pKnox1) causes a major reduction of Pbx and Meis proteins and a pleiotropic embryonic

135

phenotype. Mol. Cell. Biol. 26, 5650–5662 (2006).

107. Rexhepaj, R. et al. Reduced intestinal and renal amino acid transport in PDK1 hypomorphic mice. FASEB J. 20, 2214–2222 (2006).

108. Wang, D. et al. A mouse model for Glut-1 haploinsufficiency. Hum. Mol. Genet. 15, 1169–1179 (2006).

109. Pasini, D., Bracken, A. P., Jensen, M. R., Lazzerini Denchi, E. & Helin, K. Suz12 is essential for mouse development and for EZH2 histone methyltransferase activity. EMBO J. 23, 4061–4071 (2004).

110. Miró, X. et al. Haploinsufficiency of the murine polycomb gene Suz12 results in diverse malformations of the brain and neural tube. Dis. Model. Mech. 2, 412–418 (2009).

111. Liu, W. et al. Vps35 haploinsufficiency results in degenerative-like deficit in mouse retinal ganglion neurons and impairment of optic nerve injury-induced gliosis. Mol. Brain 7, 10 (2014).

112. Levy, J. E., Jin, O., Fujiwara, Y., Kuo, F. & Andrews, N. C. Transferrin receptor is necessary for development of erythrocytes and the nervous system. Nat. Genet. 21, 396–399 (1999).

113. Franco, B. & Ballabio, A. X-inactivation and human disease: X-linked dominant male-lethal disorders. Curr. Opin. Genet. Dev. 16, 254–259 (2006).

114. Brown, D. et al. Loss of Aif function causes cell death in the mouse embryo, but the temporal progression of patterning is normal. Proc. Natl. Acad. Sci. U. S. A. 103, 9918–9923 (2006).

115. Nakajima, O. et al. Heme deficiency in erythroid lineage causes differentiation arrest and cytoplasmic iron overload. EMBO J. 18, 6282–6289 (1999).

116. Moisan, A. et al. The WTX tumor suppressor regulates mesenchymal progenitor

136

cell fate specification. Dev. Cell 20, 583–596 (2011).

117. Ng, D. et al. Oculofaciocardiodental and Lenz microphthalmia syndromes result from distinct classes of mutations in BCOR. Nat. Genet. 36, 411–416 (2004).

118. Atasoy, D. et al. Deletion of CASK in mice is lethal and impairs synaptic function. Proc. Natl. Acad. Sci. U. S. A. 104, 2525–2530 (2007).

119. Jiang, B. et al. Lack of Cul4b, an E3 ubiquitin ligase component, leads to embryonic lethality and abnormal placental development. PLoS One 7, e37070 (2012).

120. Seo, K. W., Kelley, R. I., Okano, S. & Watanabe, T. Mouse Td ho abnormality results from double point mutations of the emopamil binding protein gene (Ebp). Mamm. Genome 12, 602–605 (2014).

121. Feng, Y. et al. Filamin A (FLNA) is required for cell-cell contact in vascular development and cardiac morphogenesis. Proc. Natl. Acad. Sci. U. S. A. 103, 19836–19841 (2006).

122. Longo, L. et al. Maternally transmitted severe glucose 6-phosphate dehydrogenase deficiency is an embryonic lethal. EMBO J. 21, 4229–4239 (2002).

123. Huq, A. H. M. M., Lovell, R. S., Ou, C.-N., Beaudet, A. L. & Craigen, W. J. X- linked glycerol kinase deficiency in the mouse leads to growth retardation, altered fat metabolism, autonomous glucocorticoid secretion and neonatal death. Hum. Mol. Genet. 6, 1803–1809 (1997).

124. Smahi, A. et al. Genomic rearrangement in NEMO impairs NF- k B activation and is a cause of incontinentia pigmenti. Nature 405, 466–472 (2000).

125. Chen, R. Z., Akbarian, S., Tudor, M. & Jaenisch, R. Deficiency of methyl-CpG binding protein-2 in CNS neurons results in a Rett-like phenotype in mice. Nat. Genet. 27, 327–331 (2001).

137

126. Rocha, P. P., Scholze, M., Bleiss, W. & Schrewe, H. Med12 is essential for early mouse development and for canonical Wnt and Wnt/PCP signaling. Development 137, 2723–2731 (2010).

127. Buj-Bello, A. et al. The lipid phosphatase myotubularin is essential for skeletal muscle maintenance but not for myogenesis in mice. Proc. Natl. Acad. Sci. U. S. A. 99, 15060–15065 (2002).

128. Liu, X. Y. et al. The gene mutated in bare patches and striated mice encodes a novel 3beta-hydroxysteroid dehydrogenase. Nat. Genet. 22, 182–187 (1999).

129. Ferrante, M. I. et al. Oral-facial-digital type I protein is required for primary cilia formation and left-right axis specification. Nat. Genet. 38, 112–117 (2006).

130. Hara-Chikuma, M. et al. Epidermal-specific defect of GPI anchor in Pig-a null mice results in Harlequin ichthyosis-like features. J. Invest. Dermatol. 123, 464– 469 (2004).

131. Liu, W. et al. Deletion of Porcn in mice leads to multiple developmental defects and models human focal dermal hypoplasia (Goltz syndrome). PLoS One 7, e32331 (2012).

132. Biechele, S., Adissu, H. A., Cox, B. J. & Rossant, J. Zygotic Porcn paternal allele deletion in mice to model human focal dermal hypoplasia. PLoS One 8, e79139 (2013).

133. Cutler, G. & Kassner, P. D. Copy number variation in the mouse genome: implications for the mouse as a model organism for human disease. Cytogenet. Genome Res. 123, 297–306 (2008).

134. Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 20, 623–635 (2010).

135. Boyden, L. M. et al. Skint1, the prototype of a newly identified immunoglobulin superfamily gene cluster, positively selects epidermal γδ T cells. Nat. Genet. 40,

138

656–662 (2008).

136. Harper, J. M. Wild-derived mouse stocks: An underappreciated tool for aging research. Age 30, 135–145 (2008).

137. Celeste, A. et al. Genomic instability in mice lacking histone H2AX. Science 296, 922–927 (2002).

138. Perera, S. A. et al. Telomere dysfunction promotes genome instability and metastatic potential in a K-ras p53 mouse model of lung cancer. Carcinogenesis 29, 747–753 (2008).

139. Mostoslavsky, R. et al. Genomic instability and aging-like phenotype in the absence of mammalian SIRT6. Cell 124, 315–329 (2006).

140. Clark, A. G., Mackin, K. M. & Foster, M. H. Tracking differential gene expression in MRL/MpJ versus C57BL/6 anergic B cells: molecular markers of autoimmunity. Biomark. Insights 3, 335–350 (2008).

141. Zhao, Q., Han, M.-J., Sun, W. & Zhang, Z. Copy number variations among silkworms. BMC Genomics 15, 251 (2014).

142. Nicholas, T. J. et al. The genomic architecture of segmental duplications and associated copy number variants in dogs. Genome Res. 19, 491–499 (2009).

143. Fontanesi, L. et al. An initial comparative map of copy number variations in the goat (Capra hircus) genome. BMC Genomics 11, 639 (2010).

144. Bae, J. S. et al. Identification of copy number variations and common deletion polymorphisms in cattle. BMC Genomics 11, 232 (2010).

145. Rice, A. M. & McLysaght, A. Dosage sensitivity is a major determinant of human copy number variant pathogenicity. Nat. Commun. 8, 14366 (2017).

146. Nguyen, D. Q. et al. Reduced purifying selection prevails over positive selection in human copy number variant evolution. Genome Res. 18, 1711–1723 (2008).

139

147. Cutler, G. & Kassner, P. D. Copy number variation in the mouse genome: implications for the mouse as a model organism for human disease. Cytogenet. Genome Res. 123, 297–306 (2008).

148. Asaba, A., Hattori, T., Mogi, K. & Kikusui, T. Sexual attractiveness of male chemicals and vocalizations in mice. Front. Neurosci. 8, 231 (2014).

149. Price, C. J. & Banks, P. B. Food quality and conspicuousness shape improvements in olfactory discrimination by mice. Proc. R. Soc. B Biol. Sci. 284, 20162629 (2017).

150. Vander Wall, S. B. et al. Interspecific variation in the olfactory abilities of granivorous rodents. J. Mammal. 84, 487–496 (2003).

151. Voznessenskaya, V. V. Influence of cat odor on reproductive behavior and physiology in the house mouse (Mus musculus). in Neurobiology of Chemical Communication (ed. Mucignat-Caretta, C.) 389–405 (Boca Raton (FL): CRC Press/Taylor & Francis, 2014).

152. Li, Q. & Liberles, S. D. Aversion and attraction through olfaction. Curr. Biol. 25, R120–R129 (2015).

153. Vassilatis, D. K. et al. The G protein-coupled receptor repertoires of human and mouse. Proc. Natl. Acad. Sci. U. S. A. 100, 4903–4908 (2003).

154. Savic, D. et al. Alterations in TCF7L2 expression define its role as a key regulator of glucose metabolism. Genome Res. 21, 1417–1425 (2011).

155. Zhang, J. et al. A high-resolution multistrain haplotype analysis of laboratory mouse genome reveals three distinctive genetic variation patterns. Genome Res. 15, 241–249 (2005).

156. The Jackson Laboratory. Mouse Strain Datasheet - 000657. Available at: www.jax.org/strain/000657.

140

157. The Jackson Laboratory. Mouse Strain Datasheet - 000687. Available at: www.jax.org/strain/000687.

141

3B Visualizing the Distribution of CNVs across a Genome with CNV Landscape Plots 3B.1 Background

An informative way to view CNVs across the genomic landscape is to plot them at chromosomal base pair positions, similar to the plot presented in Figure 1 of Cutler et al. (2007)1. This CNV landscape plot has advantages over Gephi-based visualization of HD- CNV output in that it shows the genomic locations and copy number states of CNVs in each sample with easy recognition of singletons and CNVRs that are specific to a mouse strain or cohort, or that are shared by multiple mouse groups. Here, an example CNV landscape plot of CNVs detected on Chromosome 17 in the mouse cohort comparison dataset is presented alongside the corresponding HD-CNV Gephi image for comparison.

3B.2 Materials and methods

The mouse samples used here are described in section 3.2.1. CNV identification was performed as described in section 3.2.2. Recurrent and singleton CNVs were determined using HD-CNV, described in section 3.2.3. HD-CNV output was converted into a chromosome image using the Fruchterman-Reingold layout in Gephi2. To visualize CNV recurrence across individual chromosomes, a timeline-style plot was generated in R using the geom_curve() function in ggplot2 (v2.1.0).

3B.3 Results

3B.3.1 Visualization of CNV spatial landscape of Chromosome 17

In Figures 3B-1 and 3B-2, several regions with recurrent CNVs, also known as copy number variant regions (CNVRs), can be seen. Region A contains CNVs from all mouse cohorts, as does region C1. Regions C1 and C2 in Figure 3B-1 flank the ATP-binding cassette transporter G1 (Abcg1) and trefoil factor 3 (Ttf3) genes, and correspond with region C in Figure 3B-2, which cannot be clearly identified as two separate merge regions

142 at that scale. Region C appears to be a major CNV hotspot in Mus musculus since it is present in many mice from all three cohorts. Region D has CNVs that are recurrent mostly in the CL cohort, although one WD sample is included as well.

C2 D

B

C1

A

Figure 3B-1. Gephi-based visualization of HD-CNV output showing Chromosome 17 CNV merges and singletons for classical laboratory, wild-derived, and wild-caught mouse samples. Each node (circle) represents a CNV and nodes that are connected by edges (lines) indicated CNVs that overlap reciprocally by at least 40%. The more CNVs that are involved in a merge region, the warmer the node colours are while cool colours indicate fewer CNVs in a merge. Merge regions A-D correspond with regions A-D in Figure 3B-2.

143

A B C D Classical

Mouse WildDerived

Wild Caught

Genome Position (Mb)

144

Figure 3B-2. Distribution of CNV gains and losses across Chromosome 17 for classical laboratory, wild-derived, and wild-caught mouse samples. Regions A-D correspond with merge regions A-D in Figure 3B-1. Different colours represent CNV gains (yellow), state-one-losses (light blue), and state-zero-losses (dark blue).

3B.3.2 CNV spatial landscape analyses for autosomes and Chromosome X

A visual inspection of the autosomal landscape plots (data not shown) reveals 21 CNVRs that are shared by mice in all three cohorts. There are also six regions where CNVs commonly occur in both WD and WC mice. No large multi-sample CNVRs are shared by only CL and WD or WC mice, based solely on visual inspection. Nine genomic regions were found to have cohort-specific CNVRs, seven of which were unique to the CL cohort and were predominantly gains among the CL samples. The X chromosome (Fig. 3B-3) has over 10 CNVRs between and within the mouse cohorts, with almost all of these regions having CNV gains. CNV losses on the X chromosome are noticeably more common in the WD and WC cohorts, in comparison to the CL cohort. CNVRs on the X chromosome are difficult to quantify visually due to the large number of CNVs detected on this chromosome and would be better suited for HD-CNV analysis. Strain-specific CNVRs were observed as well, including a region on Chromosome 9 (Fig. 3B-4) where recurrent gains can be found in the 129-type strains.

The CNV landscape plots also revealed an unusual distribution of CNV occurrence on five autosomes of the PWK x Domesticus F1 sample, where many CNVs were found somewhat uniformly distributed either in regions of an autosome or across the entire autosome. Such a pattern can be observed in other samples within the dataset, like on (Fig. 3B-5A) of the RDS10105 WC sample, but only PWK x Domesticus F1 has this pattern on multiple chromosomes. Other CNV distribution patterns can be observed as well, such clustered gains made up of two or more closely-spaced, large CNV gains (Fig. 3B-5B) and paired, small CNV gains and deletions where a CNV gain is located almost immediately next to a CNV loss or vice versa (Fig. 3B-5C).

145

Classical

Mouse WildDerived

Wild Caught

Genome Position (Mb) Figure 3B-3. Distribution of CNV gains and losses across Chromosome X for classical laboratory, wild-derived, and wild-caught mouse samples. Different colours represent CNV gains (yellow), state-one-losses (light blue), and state-zero-losses (dark blue).

146

Classical

129-type strains

Mouse WildDerived

Wild Caught

Genome Position (Mb) Figure 3B-4. Distribution of CNV gains and losses across Chromosome 9 for classical laboratory, wild-derived, and wild-caught mouse samples. Different colours represent CNV gains (yellow), state-one-losses (light blue), and state-zero-losses (dark blue).

147

Figure 3B-5. Examples of three CNV distribution patterns (A-C) across a mouse chromosome. Different colours represent CNV gains (yellow), state-one-losses (light blue), and state-zero-losses (dark blue). Dark red asterisks indicate a gain closely followed by a loss. Light green asterisks indicate a loss closely followed by a gain.

3B.4 Discussion

When CNVs from the mouse cohort comparison study are plotted on CNV landscape plots, several CNVRs that involved a large number of CNVs are revealed. In total, there are at least 21 of these CNVRs shared among M. musculus samples in all three cohorts, as well as over 15 regions that are specific to two cohorts, one cohort, or a mouse strain. HD-CNV analyses detected far more CNVRs (see section 3.3.2.2), but it is not possible to visually identify all these CNVRs from the CNV landscape plot. In general, the CNV landscape plot is more useful in identifying major mutation hotspots and CNVRs in a large dataset, such as Regions A and D on Chromosome 17. The plot is also highly informative about general CNV distribution patterns across the mouse genome. HD-CNV Gephi images are

148 most helpful in visualizing the number of singletons and recurrent CNVs but they are not informative about the nature of the CNVs nor do they allow for visual identification of CNV occurrence patterns along the chromosome landscape. Following CNVR identification in the mouse cohort comparison dataset, the next step would be to examine the genomic context of the CNVRs in order to determine the mechanisms of CNV formation and impact on phenotype or biological relevance.

Given a sufficient sample size for pattern observation, the CNV landscape plot can be used for data visualization with any genome and for any mutation type, provided that the genomic positions of the CNVs are known. Furthermore, the CNV landscape plot can be used in conjunction with HD-CNV output to visualize merged regions and singletons across each chromosome. In all, the CNV landscape plot is a useful tool in assisting with visual identification of mutation hotspots in a genome and CNVRs relevant to adaptation and evolution. This type of CNV plot may also prove to be valuable in identifying mutation signatures in disease studies involving unstable genomes, such as cancer, and in mutagen exposure studies, such as exposure to ionizing radiation which is known to cause increased CNV formation3.

3B.5 References

1. Cutler, G., Marshall, L. A., Chin, N., Baribault, H. & Kassner, P. D. Significant gene content variation characterizes the genomes of inbred mouse strains. Genome Res. 17, 1743–1754 (2007).

2. Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. ICWSM (2009).

3. Arlt, M. F., Rajendran, S., Birkeland, S. R., Wilson, T. E. & Glover, T. W. Copy number variants are produced in response to low-dose ionizing radiation in cultured cells. Environ. Mol. Mutagen. 55, 103–13 (2014).

149

Chapter 4 4 Somatic Mosaicism and de novo CNVs in a C57BL/6J Mouse Family 4.1 Background

The presence of two or more genetically distinct cell populations within an individual, is a common phenomenon known as mosaicism, which can impact phenotypes, in some cases causing deleterious effects. Somatic mosaicism, which occurs postzygotically in somatic cells1 via multiple mechanisms, including imperfect DNA replication, and exogenous and endogenous mutagen exposure2,3, has been shown to be widespread in normal human tissues4–6. Mosaicism can also occur in the germline, in which case it is referred to as germline mosaicism7. Unlike somatic mutations, mutations in the germline can impact offspring. The proportion of cells with a given de novo (i.e. postzygotic) mutation is largely dependent on when the mutation arose. CNVs in mice were observed to arise as early as in the first division following zygote formation, and although CNVs were not found in all preimplantation embryos, there was a general pattern of increasing CNV numbers as preimplantation embryos continued to grow in cell number and develop8. Similarly, chromosomal mosaicism was found in approximately half of human embryos, with a pattern of increasing frequency of mosaic embryos with embryo progression through the developmental stages9. A mutational event that occurs when a two-cell embryo is formed can result in genotypic differences between half of an individual’s cells, if the mutation does not reduce cell fitness, and could have potential phenotypic implications for the affected individual if genes or regulatory regions are impacted10,11. Early-occurring mutations may lead to mosaicism in both the soma and germline if arising prior to the separation of the soma and germline7, resulting in gonosomal mosaicism. In such a case, there is potential for the phenotype to be impacted in both the affected individual and the individual’s offspring7. Due to its association with different diseases, mosaicism has been commonly studied from a human disease perspective. There are comparatively fewer studies on mosaicism occurring with normal development in healthy individuals.

150

Mosaicism can lead to several potential consequences depending on several factors including but not limited to the mutation type, location (i.e. genomic content), and proportion of cells affected12,13. With respect to mutation type, aneuploidy mutations are more likely to cause severe phenotypic consequences than smaller genome alterations or point mutations since they affect a larger region of the genome that includes numerous genes and regulatory elements. One of the most commonly known examples of aneuploidy in humans is Trisomy 21 and it can be found in a mosaic form. The severity of the affected person’s symptoms is dependent on the proportion of cells containing the extra Chromosome 21 and if a full third chromosome or partial copy, similar to a CNV gain, is present14,15. Other chromosomal aneuploidies can be lethal when inherited but can produce less severe consequences if a small proportion of the person’s cells are affected, as is the case with trisomy 18 mosaicism16. Smaller genomic alterations like CNVs are also contributors to genome mosaicism between17,18 and within6 normal human tissues and do not always result in negative phenotypes. De novo CNVs are estimated to affect approximately 30% of normal human skin fibroblast cells and between 13-40% of human frontal cortex neurons4,6, making somatic mosaicism a common phenomenon in normal human tissues.

The observation that CNV mosaicism can occur in normal tissues without phenotypic consequences can be explained by several reasons. First, some genes have tissue-specific or time-specific expression and are not needed in the tissue or during the time point when the CNV arises. For example, of 19,628 putative protein coding genes in humans for which there is transcriptome data available in major organs and tissue types, 7,367 genes are expressed in all tissues and 7,835 have tissue-specific or related tissue group-specific expression patterns19. Even if CNVs overlap genic or non-coding elements that are expressed, the elements might not be dosage sensitive (i.e. expression is dependent on copy number), so there may not be any impact on phenotype for gains or partial losses. Individual cells that acquire mutations that disrupt vital cell functions will die or stop dividing, resulting in the loss of these mutant cells from proliferating cell populations20,21. Furthermore, the presence of cancer-causing mutations does not always result in the development of cancer. An example of this was observed in the human eyelid where more than 25% of cells in the normal human eyelid epidermis were found to contain known

151 cancer-causing mutations as a result of UV exposure, yet still appeared to be maintaining normal functions22. Although the reason for why the cells are not presenting malignant phenotypes was not determined in the study, it has been proposed that the cellular microenvironment is an important factor the development of cancer and genetic mutations alone, within the context of the eyelid epidermis, are not sufficient to cause cancer23. The microenvironment is an important area of study in cancer research and multiple studies have shown the impact of the microenvironment on both allowing and preventing cancer initiation and progression24,25.

In contrast to mutations that do not impact phenotypes because of their timing or location in a specific cell type or tissues, some mutations are linked to different diseases because they generate different phenotypes depending on the cell or tissue type and the developmental timing of the origin of the mutation. For example, a study sampling brain and skin tissue from individuals with Sturge-Weber syndrome found that a specific nonsynonymous point mutation in the guanine nucleotide-binding protein G(q) subunit alpha (GNAQ) gene is associated with this neurocutaneus disorder and likely occurs early in fetal development26. This same mutation occurring later and in a different cell type, melanocytes, is linked to uveal melanoma risk27.

When attempting to find a genetic link to a phenotype, it is important to account for possible somatic mosaicism. If the mutations are tissue-specific in tissues that are hard to access, such as the brain, then the mutations might not be detected when a sentinel tissue such as blood, saliva or skin is used for disease testing. In one case, a CNV gain present in approximately 20% of neurons, was enough to cause brain dysfunction (i.e. hemimegalencephaly)28. This CNV gain was discovered in a post-mortem brain and likely would not have been discovered in the living individual if sentinel tissues were tested instead of the brain tissue. It is possible that many disease-causing mutations are missed even when the appropriate tissue is selected, because of low-level mosaicism that is below the detection threshold of the technology used.

De novo CNV discovery is heavily dependent on the technology used, making the true number of de novo CNVs present in an individual’s tissues and cells difficult to

152 ascertain. Estimates of the minimum detectable level of mosaicism using SNP microarray technology range from <5-20% depending on the array resolution and probe types, algorithms used, amount of tissue or number of cells examined, the heterogeneity level of the cell population, and type of mutation being assayed29–32. Alternatively, sequencing strategies offer high depth of coverage and single-nucleotide resolution allowing for the detection of rare, low frequency events in cell population or even individual CNV events in a single cell6,33,34. While there are multiple CNV mosaicism detection approaches available, the most appropriate choice will be dependent on cost limitations and what research question needs to be answered. There are also several additional challenges when studying mosaicism in humans, including but not limited to, the low availability of multiple human tissues, limitations to conducting family studies, and the difficulty in controlling for the subject’s environment and genetic background.

For this study, a cost-effect MDGA approach is used to identify somatic mosaicism in multiple healthy tissues from the parents and sons of a C57BL/6J mouse family. The MDGA has not been used previously in a study of somatic mosaicism occurring with normal development in a mouse family. The C57BL/6J strain was selected as it is a highly inbred strain that is commonly used by mouse researchers and it is most compatible with the MDGA, which was built based on the C57BL/6J reference genome35. With a family study, genetic variants can be compared between parents and siblings to help establish when the variants may have arisen. The use of multiple tissues for each individual can help identify tissue-specific mutations, and examining tissues derived from different developmental germ layers can help establish when the variants arose. Overall, this study will establish what a normal CNV profile looks like across the genomes of multiple healthy tissues in a C57BL/6J mouse family.

4.1.1 Research goal, central hypothesis, and specific objectives

Research goal: The purpose of this study is to characterize the CNV landscape in multiple, normal tissues of family members of a commonly used laboratory mouse strain to determine the contribution of CNVs to somatic mosaicism occurring with normal development.

153

Central hypothesis: CNVs that are inherited in family members or arise spontaneously within an individual mouse during postzygotic development can be detected using MDGA analysis.

The specific objectives of this chapter are:

1. To detect and characterize putative CNVs in both parents and three sons of a single C57BL/6J mouse family across four tissues in the parents and six tissues in the sons.

Predictions: Based on CNV data from eight C57BL/6J tail samples, presented in the broad survey in Chapter 3 and Locke et al.36, there should be an average of 11 CNVs per mouse for the mouse family tail samples. This average increases to 14 if Chromosome X CNVs are included. It is also predicted that there will be more gains than losses since 57% of autosomal CNVs detected in the eight samples were gains. All the Chromosome X CNVs found in the eight C57BL/6J samples were gains and are therefore predicted to be gains in the C57BL/6J mouse family in this analysis. Considering previous findings in C57BL/6J samples, copy number gains of the Ide gene are likely to be found since these mouse samples also came from The Jackson Laboratory.

2. To determine what the level of de novo CNV occurrence is within an individual mouse and to determine if there are tissue-specific CNVs.

Predictions: Most CNVs detected are predicted to be recurrent across different tissues within an individual mouse. Given that somatic mosaicism is a common phenomenon, it is predicted that the MDGA will detect some de novo CNVs given that a sufficient clonal size is present, but it is unknown if tissue-specific CNVs will be detected. Recurrent CNVs are expected to be shared among members of the mouse family due to inheritance.

3. To confirm the presence of select candidate CNVs in the biological samples using ddPCR.

154

4.2 Materials and methods

4.2.1 Samples

All mouse housing, care and animal use procedures were approved by Western University’s Animal Care Committee (Appendix 3I). One family of C57BL/6J mice, consisting of a sire (identifier number 1), dam (identifier number 2), and three adult sons

(numbers 3-5), was used in this study. The mice were euthanized by CO2 asphyxiation according to approved protocols and exsanguinated via cardiac puncture. Tissues were harvested, frozen in liquid nitrogen and stored at -80°C. The tissues chosen for this study were selected based on collectively representing different compositions of cells derived from different embryological germ layers. The selected tissues were hippocampus (ectoderm), lung (endoderm), bladder (mesoderm), and tail (mesoderm and ectoderm) for all five mice, and in addition, kidney (mesoderm), and pancreas (endoderm) from the three sons. A tail sample from a C57BL/6J mouse not related to the family, a tail sample from a CBA/CaJ mouse, and a DBA/2J brain sample (provided by Dr. Shiva Singh), were also included in this study for comparison to the mouse family and were used for ddPCR confirmation of select candidate CNVs.

4.2.2 DNA extraction

The DNA was extracted according to the Wizard® Genomic DNA Purification Kit protocol (Promega, Madison, Wisconsin, USA), with two modifications: 1) tissues were digested by proteinase K for 24-48 hours, and 2) RNAse digestion was used for all tissues. DNA quantity and purity were assessed using a NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA). DNA preparation and hybridization to the Mouse Diversity Genotyping Array were performed according to the standard protocol37 at the London Regional Genomics Centre (Robarts Research Institute, London, Ontario, Canada). MDGA hybridization data were outputted in a CEL file format.

4.2.3 Genotyping and CNV detection

Genotyping was performed using Affymetrix Power Tools’ (APT)38 BRLMM-P algorithm, and default parameters as specified by Genotyping Console39, which includes quantile

155 normalization. Following a summarization step using APT, the Log R Ratio (LRR) and B allele frequency (BAF) values were generated using the PennAffy package40. CNV calls were made using PennCNV41 with GC model correction. The GC model file was generated using in-house scripts and KentUtils42, and was based on the mouse reference genome (UCSC:mm10). CNV calls on the X chromosome were generated separately using the – chrX option in PennCNV. The first round of genotyping and CNV calling used a reference set of 313 mice36 to create a canonical genotype clustering file for use with the PennAffy package, as well as the Population Frequency of B Allele (PFB) file used with PennCNV. However, only eight of twenty-six samples had a log-R ratio standard deviation (LRR SD) that met the cutoff by having values below 0.35, for the autosomes (Appendix 4A). The samples passing the LRR SD cutoff were also below the B-allele frequency drift (BAF drift <0.01) and waviness factor (WF <0.05) cutoffs. Confirmation of select CNVs based on this dataset was low (see section 4.3.1), so CNV calling was performed again with modifications so that data did not produce false positive calls, based on the ddPCR results.

The quality of the calls was improved when using a different reference set (214 classical laboratory mice with more females), and only SNP probes. With those modifications, all samples passed the LRR SD and WF cutoffs but only nine samples were below the BAF drift cutoff (Appendix 4A). All SNP genotype call rates were above 99% for every sample and CNVs that failed to be confirmed by ddPCR did not appear in this dataset. The CNV dataset created using the 313 sample reference file will be referred to as dataset 1 and the CNV dataset created using the 214 sample reference file will be referred to as dataset 2.

4.2.4 DdPCR confirmation

DdPCR was performed using nine TaqMan® Copy Number Assays (Thermo Fisher Scientific, Waltham, Massachusetts, USA), selected based on results from dataset 1: Hoxa1 (Mm00563305_cn), Hoxa2 (Mm00563310_cn), Hoxa3/5 (Mm00736986_cn), Hoxa13 (Mm00563296_cn), Glo1 (Mm00735212_cn), Ide (Mm00496897_cn), Fgfbp3 (Mm00630217_cn), Skint3 (Mm00735949_cn), and Itln1 (Mm00534147_cn). Six additional gene assays were selected to confirm CNVs from dataset 2: Slamf9 (Mm00736858_cn), Lama2 (Mm00307917_cn), Ebf1 (Mm00734275_cn), Loxl2

156

(Mm00611995_cn), Mum1l1 (Mm00631824_cn), and Map3k7 (Mm00735116_cn). The transferrin receptor gene (Tfrc) was used as the diploid copy number reference and no- template controls were used in all assays. Gene assays were selected to test a mixture of CNV gains, losses, singletons and recurrent CNVs. Control samples with no expected gains or losses, according to the MDGA results, were used for each gene assay in addition to samples for which a CNV was called.

DNA quantity and quality were assessed prior to ddPCR, using a NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA), and diluted to approximately 8 ng/μl. Then, the DNA was fragmented by centrifuging 140 μl of DNA sample at 16,000xg for 3 min in a QIAshredder column (Qiagen, Venlo, Limburg, Netherlands).

The PCR mixture for a single reaction contained 5 μl of DNA template, 5 μl of PCR-grade water, 12.5 μl of the ddPCR™ Supermix for Probes (Bio-Rad, Hercules, California, USA), 1.25 μl of the FAM™ dye-labelled TaqMan® assay for the gene target of interest, 1.25 μl of the VIC® dye-labelled TaqMan® reference assay. Additional water was used instead of template DNA or gene assay solution for controls. 20 μl of the PCR mixture was used for droplet generation and PCR. Droplets were generated by a QX200™ droplet generator (Bio-Rad, Hercules, California, USA). PCR was carried out in a C1000 Touch™ thermal cycler (Bio-Rad, Hercules, California, USA) with the following program: 1 cycle at 95°C for 10 min, 45 cycles of denaturation at 95°C for 30 s, annealing and extension at 60°C for 1 min and enzyme deactivation at 98°C for 10 min. Droplets were read using a QX200™ droplet reader and analyzed with QuantaSoft™ software (Version 1.7.4.0917; Bio-Rad, Hercules, California, USA).

4.3 Results

4.3.1 CNVs detected and ddPCR confirmation

From dataset 1, 112 autosomal and 25 Chromosome X CNVs were detected (Appendix 4B). From this dataset, nine genic regions were selected for ddPCR confirmation. The genic regions included targets within the Hoxa cluster, which is predicted to be present as

157 a copy number loss, and targets confirmed as CNVs in Chapter 3 that were also predicted to be CNVs in the mouse family samples. Losses were predicted in a region on Chromosome 6 that spans several Hoxa cluster genes in the pancreas and bladder of mice 4 and 5, with the CNV loss starting at Hoxa2 or Hoxa3, depending on the sample, and ending at Hoxa13. Hoxa1 ddPCR assays consistently showed a state of two in tested samples (Fig. 1). Copy number states for Hoxa2 and Hoxa3/5 were between 1.5 and 2, while copy number states for Hoxa13 were generally between 1 and 1.5. There was a trend of decreasing copy number state with increasing gene proximity to Hoxa13, observed for mice 1 and 5 (Fig. 1). For three mice (C57BL/6J, DBA/2J, and CBA/CaJ) not related to the C57BL/6J mouse family, the copy number state for Hoxa13 was lower than an expected default state of two and ranged between 1.5 and 1.77.

158

3

t n n o t t o t* o 2 n n n t o t t o n t t t t* t o* t t t t n n t t o o 1 Copy Number State 0 Tail Tail Tail Tail Lung Lung Lung Lung Lung Lung Brain Kidney Bladder Bladder Bladder Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Hippocampus Hippocampus Hippocampus Hippocampus Hippocampus H13H13H13 H1 H2 H3 H13 H13 H13 H1 H2 H3 H13 H1 H2 H3 H13 B C D 1 2 3 4 5 Mouse ID, Gene Assay, and Tissue

Figure 4-1. DdPCR-based copy number states for Hoxa genes in multiple tissues from a C57BL/6J mouse family and three unrelated mice. Hoxa13 (H13) results are shown in dark green while ddPCR states for Hoxa1 (H1), Hoxa2 (H2), and Hoxa3/5 (H3) are shown in light green. Individual mice are represented by a number if they are members of the C67BL/6J family, where 1, 2 and 3 to 5 represent the sire, dam and three sons, respectively. Letters are used to indicate individuals not related to the mouse family (B for C57BL/6J, C for CBA/CaJ, and D for DBA/2J). The letters above the bars indicate samples for which no MDGA data were available (n), samples for which a copy number state of two is expected based on array findings (t), and samples for which a copy number state of one is expected based on array findings (o). The bars show the average state of two technical replicates, with the exception of bars marked with an asterisk which represent one assay. Error bars represent standard deviation and were used when the average was calculated from two, three (bold lowercase letter), or four (bold, underlined lowercase letter) separate ddPCR assays for the same gene and sample.

159

CNV losses that overlapped Glo1 and Itln1 were also detected by the array in several mouse samples, yet all tested samples showed a default copy number state of two (Fig. 4-2). Likewise, a Skint3 gain that was identified in the lung of mouse 3, could not be confirmed with ddPCR. Copy number gains for the Ide and Fgfbp3 genes were confirmed in mouse 2. An Ide gain was also identified in the kidney of mouse 5, although it was not identified as a CNV by the MDGA. Overall, the confirmation rate for CNV gains and losses for non-Hoxa regions was 23.5%. Hoxa gene assay results are excluded from the confirmation rate because the copy number state for most samples is neither a one nor a two.

For dataset 2, 70 autosomal and 38 Chromosome X CNVs were identified (Appendix 4C). CNVs from dataset 1, that failed to be confirmed by ddPCR, were not present in dataset 2. Six CNV gene regions were selected for confirmation. These regions represented a mixture of copy number states and were found to differ between tissues or mouse samples. Four of the selected genes (Ebf1, Lama2, Map3k7, and Slamf9) were predicted as copy number losses but none could be confirmed by ddPCR (Fig. 4-2). Mum1l1 (chr. X) was predicted as a gain in the lung of mouse 3 and kidney of mouse 5. Like the CNV losses, this gain could not be confirmed as a CNV using ddPCR.

A CNV gain was detected on Chromosome 14 in all tissues of every mouse family member. This CNV overlapped Synb and Entpd4 but as no assays were available for these genes, a gene neighbouring this CNV region, Loxl2, was selected for confirmation. Loxl2 could not be confirmed as a gain. In all, no CNV gains or losses from dataset 2 could be confirmed.

160

5 a 4 a b a a n b 3 b* b b b b b a a a b b* a b a* a a a b a a a a a a a 2 b* a b a b b a b a b a b b b a a a a b a 1 b

Copy Number State 0 Tail Tail Tail Tail Tail Tail Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung Kidney Kidney Kidney Kidney Kidney Kidney Kidney Kidney Kidney Kidney Kidney Bladder Bladder Bladder Pancreas Pancreas Pancreas Pancreas Pancreas Pancreas Hippocampus Hippocampus Hippocampus Hippocampus Hippocampus Hippocampus Hippocampus 2 2 5 1 2 3 4 5 3 4 5 3 1 2 3 1 2 4 1 3 4 5 2 3 2 3 1 2 3 4 5

Fgfbp3 F n IdeIde Glo1Glo1 Itln1Itln1 Skint3S Ebf1 Lama2Lama2 Loxl2 Map3k7M Slamf9 Mum1l1Mum1l1 Gene Assay and Mouse Sample

Figure 4-2. DdPCR-based copy number states for eleven genes in multiple tissues from mice of a C57BL/6J family. Gene assays were selected based on either the first set of MDGA results (dark blue) or the second set of results (light blue). Individual mice are represented by a number if they are members of the C67BL/6J family, where 1, 2 and 3-5 represent the sire, dam and sons, respectively. Bolded letters represent genes Fgfbp3 (F), Skint3 (S), and Map3k7 (M). The letters above the bars indicate confirmed copy number states that matched predicted MDGA states (a), ddPCR copy number states that did not match the predicted MDGA copy number state (b), samples for which no MDGA data were available (n). All bars show the average state of two technical replicates for an assay, with the exception of bars marked with an asterisk which do not have replicates. Error bars represent standard deviation and were used when the average was calculated from two separate ddPCR assays each with replicates, for the same gene and sample. A bolded lowercase letter indicates that two separate ddPCR assay values were averaged but one assay had two technical replicates and the other did not have replicates.

161

4.4 Discussion

Among the tail samples, the highest average number of CNVs was for dataset 1 when counting both autosomal and Chromosome X CNVs. This average number of CNVs was six, which is much lower than the predicted average of 14 CNVs per C57BL/6J tail sample36. However, the lowest number of CNVs detected for one of the eight C57BL/6J samples was also six, indicating that an average of six CNVs for the mouse family could be within the normal range. More mouse samples would be needed to firmly establish the normal range of the number of CNVs present in healthy C57BL/6J mice.

Predictions for the proportion of gains to losses were met for dataset 1. It was predicted that the majority (57%) of CNVs detected would be gains for the autosomes and that all Chromosome X CNVs would be gains. For dataset 1, 61% of autosomal CNVs were gains and all Chromosome X CNVs were gains. For dataset 2, 50% of the autosomal CNVs were gains and all Chromosome X CNVs were gains.

Unfortunately, several arrays failed to pass quality control criteria and the CNV confirmation rate was low for both sets of data. The CNVs that were confirmed, like gains of Ide and Fgfbp3, are already known to be widespread in C57BL/6J mice43, so it was a possibility that they would be present in the mouse family. Gains of Ide were confirmed to be a copy number of four in the dam (mouse 2) and the sire (mouse 1) likely had a copy number state of two for Ide since one son (mouse 5) was found to have an Ide copy number of three. Further ddPCR experiments would be required to confirm the copy number status of Ide in the other two sons and the sire.

Although there were issues with the quality of the data, an unexpected biological event was discovered. A Hoxa13 loss was detected in dataset one and ddPCR assays for this gene consistently returned copy number states that were in-between one and two, for all samples tested except bladder of mouse 4, which had a copy number state of one. A similar pattern was observed in tissues from three unrelated mice of different backgrounds (C57BL/6J, CBA/CaJ, and DBA/2J). A copy number state between one and two is consistent with the presence of a mixed cell population where some cells have a Hoxa13

162 loss and some cells have two copies of Hoxa13. Hoxa13 is unlikely to be inherited as a full loss because it can cause embryonic lethality44. Surviving Hoxa13-/- mice have been recovered when the genetic contribution of the C57BL/6J background in the mice is at least 87.5%, however, these mice are infertile, and exhibit hypodactyly, syndactyly, and stiffness when walking45. Mice with a high contribution of C57BL/6J to the genetic background and one functional Hoxa13 copy do not exhibit visible autopod skeletal defects and are capable of producing offspring. In 129/SV strains of mice or in mice with lower background contributions of C57BL/6, varying degrees of hypodactyly and syndactyly of the forelimbs and hindlimbs have been observed with a Hoxa13+/- genotype, with greater disruption of Hoxa13 function being associated with increased phenotype severity44. All of the mice in the C67BL/6J family appeared phenotypically normal, and it is likely that the mice inherited two normal copies of Hoxa13 which, based on array findings, may have been lost in a subpopulation of cells as an example of postzygotic somatic mosaicism.

An emergent hypothesis that is consistent with findings is that Hoxa13 losses may have occurred early in development, after limb development, as a non-random, “programmed” deletion. Programmed genomic deletions, which typically occur in the form of chromatin diminution or elimination, are known to occur in over 100 multicellular animal species from nine taxonomic groups, as well as single-cell ciliates and can serve different purposes46. In sea lampreys and the nematode Acaris suum, ~20% and ~13% of the somatic genome, respectively, is deleted during early development to create distinct genomes and transcriptomes between the germline and soma47,48. Many of the deleted lamprey genes have roles in germline development and pluripotency47. Similarly, 85% of the deleted genes in A. suum are expressed during gametogenesis (65%) or early embryogenesis (20%)48. Sciarid flies undergo three tissue-specific types of chromosome elimination which serve to determine the sex of the embryo and to develop gametes49. Like many eliminated genes, Hoxa13 has functions important for organism development, and has been shown to be required for autopod development44 and placental function50 in mice. It is unknown what biological purpose it would serve for Hoxa13 or neighbouring Hoxa genes to be deleted in somatic cells during early development. An alternative explanation for the Hoxa13 deletions is that there is no programmed modification of Hoxa13, but rather, this region is a mutational hotspot that is susceptible to replication- or transcription-induced

163 deletions. The presence of low copy repeats like segmental duplications and self-chains, for example, can facilitate repeat-induced replication errors leading to CNV formation51. On the other hand, repetitive DNA sequences are also commonly associated with programmed chromosome diminution, as the majority of eliminated DNA sequences are repetitive46.

To determine the mutational mechanism for CNV losses in the Hoxa cluster, the breakpoint junctions of the CNVs need to be identified and sequenced. Given that all Hoxa1 ddPCR assays indicated a copy number state of two, it can be assumed that one breakpoint junction occurs between Hoxa1 and Hoxa13. For some samples, Hoxa2 and Hoxa3 had lower copy number states than Hoxa1 but higher than Hoxa13 which may suggest the presence of multiple deletion initiation or end points within the Hoxa cluster region.

Initial Hoxa13 ddPCR results for mice not related to the C57BL/6J family suggest that the occurrence of Hoxa13 losses might be a widespread phenomenon in M. musculus. Further testing of multiple tissues from individuals of different strains and subspecies will need to be conducted to determine how prevalent Hoxa13 losses are in mice. Detailed analysis of different tissues can also assist in delineating the cell lineages affected and the developmental timeline. Screening mouse embryos at different developmental stages for Hoxa13 losses would help establish if Hoxa13 losses occur early in development and are present as a minority cell population in all germ layer lineages or if the deletions are more likely to independently arise numerous times, in multiple cell populations. Early Hoxa13 losses might not result in a deleterious phenotype if not enough cells are affected for gene expression levels to meet a phenotypic threshold. No study to-date has observed intra-tissue losses of this gene in multiple tissues in all members of a mouse family. Further work can be extended to determine the mutational mechanism and purpose of Hoxa13 deletions and to determine what proportion of the mouse genome regularly undergoes such alterations.

4.5 Conclusion

Ultimately, this study was unable to examine somatic mosaicism in a mouse family due to technical issues that may have occurred during the DNA preparation or hybridization

164 process. To gather the data again will require DNA from the mouse family members to be hybridized to new set of MDGA arrays. Alternatively, different technology like sequencing could be used to identify somatic mosaicism in mice. Since this study was completed, there has been significant progress in the methods used to study how mutations arise in cell lineages of mice. By using CRISPR technology to insert known DNA sequences called “barcodes” into mouse embryos, deletion and insertion mutations can be tracked through cell lineages and developmental germ layers of a whole mouse52. In the future, the mouse family study could be expanded upon by using methods with higher resolution and sensitivity, and including more families, a greater variety of tissues and subpopulations of cell, and more mouse strains or subspecies.

165

4.6 References

1. Freed, D., Stevens, E. L. & Pevsner, J. Somatic mosaicism in the human genome. Genes 5, 1064–1094 (2014).

2. Mazouzi, A., Velimezi, G. & Loizou, J. I. DNA replication stress: causes, resolution and disease. Exp. Cell Res. 329, 85–93 (2014).

3. Ignatov, A. V, Bondarenko, K. A. & Makarova, A. V. Non-bulky lesions in human DNA: the ways of formation, repair, and replication. Acta Naturae 9, 12–26 (2017).

4. Abyzov, A. et al. Somatic copy number mosaicism in human skin revealed by induced pluripotent stem cells. Nature 492, 438–442 (2012).

5. Vattathil, S. & Scheet, P. Extensive hidden genomic mosaicism revealed in normal tissue. Am. J. Hum. Genet. 98, 571–578 (2016).

6. McConnell, M. J. et al. Mosaic copy number variation in human neurons. Science 342, 632–637 (2013).

7. Samuels, M. E. & Friedman, J. M. Genetic mosaics and the germ line lineage. Genes (Basel). 6, 216–237 (2015).

8. Guo, F. et al. Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res. 27, 967–988 (2017).

9. Bielanska, M., Tan, S. L. & Ao, A. Chromosomal mosaicism throughout human preimplantation development in vitro: incidence, type, and relevance to embryo outcome. Hum. Reprod. 17, 413–419 (2002).

10. Campbell, I. M., Shaw, C. A., Stankiewicz, P. & Lupski, J. R. Somatic mosaicism: implications for disease and transmission genetics. Trends Genet. 31, 382–392 (2015).

11. De, S. Somatic mosaicism in healthy human tissues. Trends Genet. 27, 217–223 (2011).

12. Kosztolányi, G. It is time to take timing seriously in clinical genetics. Eur. J. Hum. Genet. 23, 1435–1437 (2015).

13. Forsberg, L. A., Gisselsson, D. & Dumanski, J. P. Mosaicism in health and disease — clones picking up speed. Nat. Rev. Genet. 18, 128–142 (2016).

14. Papavassiliou, P. et al. The phenotype of persons having mosaicism for trisomy 21/down syndrome reflects the percentage of trisomic cells present in different tissues. Am. J. Med. Genet. Part A 149, 573–583 (2009).

166

15. Pelleri, M. C. et al. Systematic reanalysis of partial trisomy 21 cases with or without Down syndrome suggests a small region on 21q22.13 as critical to the phenotype. Hum. Mol. Genet. 25, 2525–2538 (2016).

16. Tucker, M. E., Garringer, H. J. & Weaver, D. D. Phenotypic spectrum of mosaic trisomy 18: two new patients, a literature review, and counseling issues. Am. J. Med. Genet. A 143A, 505–517 (2007).

17. O’Huallachain, M., Karczewski, K. J., Weissman, S. M., Urban, A. E. & Snyder, M. P. Extensive genetic variation in somatic human tissues. Proc. Natl. Acad. Sci. 109, 18018–18023 (2012).

18. Piotrowski, A. et al. Somatic mosaicism for copy number variation in differentiated human tissues. Hum. Mutat. 29, 1118–1124 (2008).

19. Thul, P. J. et al. A subcellular map of the human proteome. Science 356, eaal3321 (2017).

20. Gatti, M. & Baker, B. S. Genes controlling essential cell-cycle functions in Drosophila melanogaster. Genes Dev. 3, 438–453 (1989).

21. Bolton, H. et al. Mouse model of chromosome mosaicism reveals lineage-specific depletion of aneuploid cells and normal developmental potential. Obstet. Gynecol. Surv. 71, 665–666 (2016).

22. Martincorena, I. et al. Tumor evolution. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).

23. Bissell, M. J. & Labarge, M. A. Context, tissue plasticity, and cancer: are tumor stem cells also regulated by the microenvironment? Cancer Cell 7, 17–23 (2005).

24. Quail, D. F. & Joyce, J. A. Microenvironmental regulation of tumor progression and metastasis. Nat. Med. 19, 1423–1437 (2013).

25. Bissell, M. J. & Hines, W. C. Why don’t we get more cancer? A proposed role of the microenvironment in restraining cancer progression. Nat. Med. 17, 320–329 (2011).

26. Shirley, M. D. et al. Sturge-Weber syndrome and port-wine stains caused by somatic mutation in GNAQ. N. Engl. J. Med. 368, 1971–1979 (2013).

27. Van Raamsdonk, C. D. et al. Frequent somatic mutations of GNAQ in uveal melanoma and blue naevi. Nature 457, 599–602 (2009).

28. Cai, X. et al. Single-Cell, Genome-wide Sequencing Identifies Clonal Somatic Copy-Number Variation in the Human Brain. Cell Rep. 8, 1280–1289 (2014).

167

29. Conlin, L. K. et al. Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucleotide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275 (2010).

30. Cross, J., Peters, G., Wu, Z., Brohede, J. & Hannan, G. N. Resolution of trisomic mosaicism in prenatal diagnosis: Estimated performance of a 50K SNP microarray. Prenat. Diagn. 27, 1197–1204 (2007).

31. Yamamoto, G. et al. Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single- nucleotide-polymorphism genotyping microarrays. Am. J. Hum. Genet. 81, 114– 126 (2007).

32. Mason-Suares, H. et al. Density matters: comparison of array platforms for detection of copy-number variation and copy-neutral abnormalities. Genet. Med. 15, 706–712 (2013).

33. Wang, J., Fan, H. C., Behr, B. & Quake, S. R. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150, 402– 412 (2012).

34. Hou, Y. et al. Genome analyses of single human oocytes. Cell 155, 1492–1506 (2013).

35. Yang, H., Ding, Y., Hutchins, L. & Szatkiewicz, J. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

36. Locke, M. E. O. et al. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

37. Thermo Fisher Scientific. Genome-Wide Human SNP Nsp/Sty 6.0 User Guide. (2008).

38. Affymetrix Power Tools MANUAL: apt-probset-genotype (1.20.0). Available at: http://www.affymetrix.com/support/developer/powertools/changelog/apt-probeset- genotype.html.

39. Affymetrix. Genotyping console 4.0 user manual. (2009).

40. PennAffy [http://www.openbioinformatics.org/penncnv/penncnv_download.html].

41. Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high- resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).

42. Kent Utils [https://github.com/NullModel/kentUtils].

43. Watkins-Chow, D. E. & Pavan, W. J. Genomic copy number and expression

168

variation within the C57BL/6J inbred mouse strain. Genome Res. 18, 60–66 (2008).

44. Fromental-Ramain, C. et al. Hoxa-13 and Hoxd-13 play a crucial role in the patterning of the limb autopod. Development 122, 2997–3011 (1996).

45. Perez, W. D. et al. Survival of Hoxa13 homozygous mutants reveals a novel role in digit patterning and appendicular skeletal development. Dev. Dyn. 239, 446–457 (2010).

46. Wang, J. & Davis, R. E. Programmed DNA elimination in multicellular organisms. Curr. Opin. Genet. Dev. 27, 26–34 (2014).

47. Bryant, S. A., Herdy, J. R., Amemiya, C. T. & Smith, J. J. Characterization of somatically-eliminated genes during development of the sea lamprey (Petromyzon marinus). Mol. Biol. Evol. 33, 2337–2344 (2016).

48. Wang, J. et al. Silencing of germline-expressed genes by DNA elimination in somatic cells. Dev. Cell 23, 1072–1080 (2012).

49. Goday, C. & Rosario Esteban, M. Chromosome elimination in sciarid flies. BioEssays 23, 242–250 (2001).

50. Shaut, C. A. E., Keene, D. R., Sorensen, L. K., Li, D. Y. & Stadler, H. S. HOXA13 Is essential for placental vascular patterning and labyrinth endothelial specification. PLoS Genet. 4, e1000073 (2008).

51. Chen, L., Zhou, W., Zhang, L. & Zhang, F. Genome architecture and its roles in human copy number variation. Genomics Inform. 12, 136 (2014).

52. Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).

169

Chapter 5 5 Characterization of the CNV Landscape in a Mouse Model of Breast Cancer with Lung Metastasis in the Presence and Absence of Rhamm 5.1 Background

Although the MDGA was designed to detect genetic variation in normal mouse tissues1, it may prove to be a useful tool in detecting genetic variation associated with cancer. Currently, its application in detecting structural variation in tissues with genomic instability, like cancer tissues, has been limited2. Conversely, numerous human, SNP microarray-based studies have been conducted to study genomic structural alterations, like copy number alterations and aneuploidy, in cancer3–6. Like with human studies, the use of SNP-based microarrays in mouse models of cancer may help expand our understanding of the role of genetic alterations in cancer. Copy number alterations cause the disruption of normal cell function by increasing, decreasing, or preventing the expression of dosage- sensitive genes, and thereby contributing to various cancer characteristics like initiation timing and location, tumour cell metabolism, tumour growth, ability to metastasize, and response to treatment7–12.

Numerous mouse models of human cancer have been developed to help fill in gaps in human studies and the design of mouse models of cancer has been constantly improving to help increase the predictive value when compared to human-based studies13. Mouse models of human breast cancer are among some of the most common animal cancer models and can be generated in several ways, the earliest being generated through a transgenic approach where foreign, oncogenic DNA is integrated into the genome14,15. One example of an oncogenic transgene is a mouse mammary tumour virus (MMTV) promoter paired with a polyoma middle T-antigen (PyMT) gene, which is used to model luminal breast cancer with lung metastasis16. MMTV is active in estrogen-sensitive tissues such as the mammary epithelium where it drives the expression of the oncogenic PyMT which simulates growth signaling pathways, resulting in unregulated cell division and the development of mammary gland primary tumours and secondary tumours in the lungs16–

170

18. Other cancer models can be generated through the use of various combinations of promoters and oncogenes. The whey acidic protein (WAP) promoter for example, drives gene expression in the mouse mammary gland epithelium like MMTV, but unlike with MMTV which is activated during puberty16, WAP is responsive during pregnancy under the influence of multiple lactogenic hormones and continues after weaning19. WAP can be used to create a model of breast cancer when paired with the simian virus 40 (SV40) T/t antigen, a protein which inactivates the tumour suppressors p53 and Rb20. Ectopic expression of the SV40 T/t antigen has been used to create over 20 different transgenic cancer models affecting the bladder, liver, retina, skin, intestines, stomach, and mammary glands20. Mouse cancer models can also be created in several other ways that do not require the introduction of a transgene, including but not limited to mutagen or carcinogen exposure, knockouts of tumour suppressor genes, and the use of human tumour xenografts15,21.

Cancer cells routinely exhibit several common properties termed “hallmarks”. Three of these hallmarks are sustained proliferation, genomic instability, and escape from apoptosis22. It was proposed by Macheret and Halazonetis (2015)22 that DNA replication stress should be included as another hallmark of cancer. DNA replication stress is connected to the previous three hallmarks since sustained cellular proliferation induces DNA replication stress which in turn can lead to genomic instability, such as the formation of CNVs, and some of these de novo mutations may alter normal cell phenotypes allowing them to escape from apoptosis22. DNA replication stress in yeast has been observed to induce high rates of both large (e.g. aneuploidy and CNVs) and small genomic alterations (point mutations and small insertions/deletions), with tandemly repeated genes like ribosomal RNA genes being particularly susceptible to deletions23. The observed genomic alteration preferentially occurred in regions with slow-moving replication forks, with some alterations conferring a selective growth advantage to cells23. Cancer-associated copy number alterations that confer an advantage for growth are important for the development of some cancer types while passenger mutations can help elucidate the mechanisms behind genomic instability22.

171

The roles of genomic alterations in certain cancers are thought to be important to tumorigenesis and cancer progression since multiple recurrent and syntenic copy number alterations have been found between a mouse model of T-cell lymphoma and multiple human cancers, suggesting the existence of genetic modifications and biological pathways that are common between cancers across and within species24. This finding is further supported by Rennhack et al. (2018) who found that copy number alterations are very common in 27 major models of breast cancer and although there is a high degree of heterogeneity, there are many conserved gene copy number alterations within and between mouse and human tumors that are associated with tumour progression and secondary tumour characteristics like histological appearance, metastatic potential, and oncogenic singalling25. It was also noted that not all mouse models had equal amounts of genomic instability, which differed based on the cancer driver, promoter, and mouse strain used25.

Copy number alteration differences are known to exist between primary and secondary tumours26–28. For some cancers, there are greater mutation constraints in the primary tumours, possibly due to microenvironment effects or requirements to maintain specific functional pathways29. Depending on how early or late in primary tumour development metastasis-capable cells develop and metastasize, and on how quickly new subpopulations arise in the primary and metastatic tumours, more or fewer genetic similarities may be detected between the two tumour types30. If only one subpopulation of a primary tumour is capable of metastasizing, the initial metastatic tumour cell population will have less genetic diversity than the primary tumour31. The level of heterogeneity in a metastatic tumour may increase as the tumour continues to grow and acquire mutations, leading to increased genetic distance between the primary and metastatic tumours31. In a study of human colorectal cancer with liver metastasis, the discordance in average copy number between the primary tumour and metastases was 22%, leading the authors to proposed that due to substantial genomic differences, different treatment strategies may be required for primary tumors and metastasis tissue26.

In this study, the MDGA is used to assess the CNV landscape of mammary gland primary tumours and lung with metastasis in an MMTV-PyMT mouse model of luminal breast cancer, in the presence and absence of the receptor for hyaluronan-mediated motility

172

(Rhamm) gene. RHAMM is a receptor for hyaluronan (HA) and has intracellular and extracellular functions relating to normal cellular function and cancer. Inside the cell, one important function of RHAMM is to localize to the centromeres during mitosis and maintain mitotic spindle integrity32. RHAMM normally resides intracellularly but it can be exported to the cell surface to bind HA in response to wounding and cytokine signalling33. Extracellularly, RHAMM can promote cell motility and invasion through the use of MEK1 and ERK1/2 kinases via association with CD44, the main HA receptor, and subsequent activation of MEK1 and ERK1/2 kinases34,35. Elevated RHAMM expression is associated with poor clinical outcome in the majority of breast cancers36 and high levels of HA binding to RHAMM are associated with increased breast cancer invasiveness and lung metastases, but lower proliferation37.

A lack of Rhamm expression in cells causes multi-pole mitotic spindles, aberrant chromosome segregation, and inappropriate cytokinesis during mitosis38. Crossing Rhamm knockout mice to the MMTV-PyMT mouse model of luminal breast cancer leads to a significant increase in the number of metastasis nodules in the lung compared to wild-type MMTV-PyMT mice (unpublished). The increased metastases observed in Rhamm-/- mice suggests that elevated genomic instability in the absence of Rhamm expression is a driver for tumour progression. The association between loss of Rhamm expression and genomic instability makes this cancer model suitable for studying genome structural alterations.

The MDGA has been used previously to study copy number alterations in a mouse model of breast cancer. Standfuss et al.2 found that there was an increase in copy number alterations in primary tumours in transgenic mice that developed breast cancer in comparison to mammary gland tissue from two non-transgenic mice, of the same strain, that did not develop cancer. Transgenic mammary gland tissue that was collected prior to tumour development showed fewer copy number alterations than tumour samples but more alterations than normal tissue.

There are several differences between the Standfuss study and this study. Standfuss et al. used a WAP-SVT/t-driven cancer model with outbred mouse strains and included WAP-SVT/t breast cancer derived cell lines, while this study used an MMTV-PyMT

173 cancer model in inbred mouse strains with Rhamm-/- and wild-type genotypes. The results of the Standfuss study and this study are not expected to be very similar since cancer models using different oncogenic drivers have been observed to have different CNV profiles, and even when the same cancer driver is used in a mouse model of breast cancer, the use of different background strains can produce copy number differences25. Similarities between the two studies include the use of a small sample set of six mice and neither study used adjacent normal tissue for comparison to cancer tissue. However, the Strandfuss study assessed copy number alterations at different stages of tumorigenesis up to the development of primary tumours, while our study assesses genetic differences occurring with different Rhamm genotypes and between primary tumours and metastasis in the same animal. The study in this chapter advances beyond Standfuss et al. by being the first to use both SNP probes and IGPs for CNV detection with the MDGA, which increases the MDGA resolution and genic representation. The increased resolution, as well as the use of filtered probe lists is expected to allow for the more reliable detection of CNVs, particularly of genic CNVs, as well as help in the detection of smaller CNVs that can be missed at a low resolution.

5.1.1 Research goal, central hypothesis, and specific objectives

Research goal: The purpose of this study is to 1) assess the utility of the MDGA in the context of tumorigenesis and metastasis, specifically in an MMTV-PyMT mouse model of breast cancer using wild-type and Rhamm-/- mice, and 2) to analyze CNV differences with different Rhamm-/- genotypes and cancer tissue types (primary tumour or tissue with metastases).

Central hypothesis: Given that genomic instability is a hallmark of cancer, primary tumours and lung tissue with metastases will have more CNVs than normal tissues. If metastatic capabilities are specific to a subset of primary tumour cells with a given genotype, then the primary tumor will have more CNV diversity than metastatic tumours and there will be more inter-animal variability in CNV profiles of primary tumours versus metastatic tumours. Given the phenotype of increased metastases in Rhamm-/- mice, primary tumour and metastatic tumour samples will have Rhamm genotype-specific CNV profiles which may include increased genomic instability.

174

The specific objectives are: 1. To identify CNV differences between primary tumours and lung with metastasis in wild-type and Rhamm-/- MMTV-PyMT mouse samples. 2. To confirm the presence of select candidate CNVs in the biological samples using ddPCR.

5.2 Materials and methods

5.2.1 Samples

Mouse MDGA CEL files and DNA samples were provided our collaborator, Dr. Eva Turley. All six mice used in this study are female, carry an MMTV-PyMT transgene, and are of a mixed FVB/N and C57BL/6 background (E. Turley, personal communication). Three of the mice have a Rhamm-/- genotype and are represented by numeric identifiers: 10.4, 45.2, and 63.1. The three wild-type mice are represented by the numeric identifiers 36.1, 36.2, and 76.3. All mice developed mammary gland primary tumours that metastasized to the lungs. For each mouse, DNA was isolated from a primary mammary gland tumour and from normal lung tissue that contained metastatic tumours. In total, there are four sample groups in this study: wild-type primary tumour (WP), wild-type lung with metastasis (WM), Rhamm-/- primary tumour (RP), and Rhamm-/- lung with metastasis (RM). The metastasis samples also contain surrounding normal tissue since it was not feasible to extract only the individual metastasis nodules.

5.2.2 MDGA hybridization

Mouse DNA was prepared by C. Tolg. MDGA hybridization was performed at the London Regional Genomics Centre (Robarts Research Institute, London, ON) according to instructions in the Affymetrix® Genome-Wide Human SNP Nsp/Sty 6.0 manual39. The resulting CEL files were then used for SNP genotyping and CNV identification.

5.2.3 SNP genotyping and CNV identification

M.E.O. Locke generated SNP genotypes and CNV calls by using the BRLMM-P algorithm implemented in Affymetrix® Power Tools40 using default clustering parameters as specified by Genotyping Console, and included median normalization and artifact

175 reduction, as recommended for cancer samples by the Affymetrix® Genotyping Console 4.0 User Manual41. Analysis was limited to a stringent probe list42. A canonical genotype clustering file42 was used to calculate Log R Ratio (LRR) and B allele frequency (BAF) values using the PennAffy package43. A GC model file, containing the percent GC content of the 1 Mb region surrounding each marker (or the genome-wide average of 42% if this could not be calculated) was generated using KentUtils44 and an in-house script based on the reference genome (UCSC:mm9). CNVs were detected with PennCNV using default parameters, a Population Frequency of B Allele (PFB) file based on a collection of reference strains42 and GC model correction45. Calls were not filtered based on length or number of markers, as all passed the minimum criteria of at least a 500 bp length and more than 10 markers underlying the call. Samples were not excluded based on the standard deviation of the LRR values or B allele frequency drift. The CNV calls are available in Appendix 5A.

5.2.4 Select genic CNV confirmation by droplet digital PCR (ddPCR)

A recurrent CNV region located on Chromosome 7, was selected for confirmation by ddPCR in all 12 samples. Two TaqMan® Copy Number Assays (Thermo Fisher Scientific, Waltham, Massachusetts, USA) for the Ilk/Rrp8 (Mm00216640_cn) and Taf10 (Mm00232271_cn) genes, which overlap the CNV region, were used for confirmation. Additionally, a Rhamm TaqMan® Copy Number Assay (Mm00344889_cn) was used to confirm the copy number state of Rhamm in each mouse sample. The transferrin receptor gene (Tfrc) was used as the diploid copy number reference for the Ilk, Taf10 and Rhamm assays. No-template controls were used in all assays and there were two technical replicates for each assay.

DNA quantity and quality were assessed prior to ddPCR, using a NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA), and diluted to approximately 8 ng/μl. Subsequently, the DNA was fragmented by centrifuging 140 μl of DNA sample at 16,000xg for 3 min in a QIAshredder column (Qiagen, Venlo, Limburg, Netherlands).

176

Each PCR reaction, with the exception of negative controls, contained 5 μl of DNA template, 5 μl of PCR-grade water, 12.5 μl of the ddPCR™ Supermix for Probes (Bio-Rad, Hercules, California, USA), 1.25 μl of the FAM™ dye-labelled TaqMan® assay for the gene target of interest, 1.25 μl of the VIC® dye-labelled TaqMan® reference assay. 20 μl of the PCR mixture was used for droplet generation and PCR. Droplets were generated by a QX200™ droplet generator (Bio-Rad, Hercules, California, USA). PCR was carried out in a C1000 Touch™ thermal cycler (Bio-Rad, Hercules, California, USA) with the following program: 1 cycle at 95 °C for 10 min, 45 cycles of denaturation at 95 °C for 30 s, annealing and extension at 60 °C for 1 min and enzyme deactivation at 98 °C for 10 min. Droplets were read using a QX200™ droplet reader and analyzed with QuantaSoft™ software (Version 1.7.4.0917; Bio-Rad, Hercules, California, USA).

5.2.5 SNP and CNV phenogram construction

SNP distance was calculated by totaling the number of loci where pairs of samples did not share the same genotype call and dividing by the total number of SNP loci. Loci where both mice had a No Call genotype were not counted as a difference. For pairwise CNV distance calculations, SNP and IGP markers were assigned the copy number state (0, 1, 2, or 3+) that they called as their “genotype”. The total number of CNV genotype differences between pairs of samples was divided by the total number of SNP and IGP loci to obtain CNV genetic distance values. SNP and CNV distance matrices are available in Appendix 5B. The genetic distance matrices were used to construct a phenogram file using the BIONJ function of the APE package (version 3.3) for R (version 3.2.2), which implements the algorithm described by Gascuel et al46. The phenogram file was saved in Newick format and uploaded to Figtree (v1.4.2) to generate coloured phenogram images.

5.2.6 Tissue-specific CNVs

Tissue-specific CNVs within a mouse were identified using inhouse scripts. If a CNV in one mouse tissue overlapped to any degree with another CNV in a different tissue of the same mouse, regardless of copy number state, it was considered to be recurrent within the mouse. Conversely, CNVs that did not overlap any CNVs in a different tissue within the same mouse were considered to be tissue-specific CNVs (i.e. somatic mutations). It is

177 important to note that the inherited copy number state is unknown. Therefore, in cases where the copy number state of the same locus differs between two tissues of the same mouse, either one or both of the tissues sampled may contain a de novo CNV.

5.2.7 Recurrent gene gains and losses and IPA networks

Genic annotation used to identify the genic content of CNVs was obtained from Ensembl’s BioMart (Ensembl genes 67, NCBIM37). Protein-coding genes, non-coding genes, and pseudogenes that completely overlapped CNVs of the same state, in all three samples of a group (shared Rhamm genotype and tumour type), were considered to be recurrent within that group (Appendix 5C). Lists of recurrent gene gains and losses were made for each of the four groups. The four sample groups were wild-type primary tumour (WP) samples, wild-type metastasis (WM) samples, Rhamm-/- primary tumour (RP) samples, and Rhamm-/- metastasis (RM) samples.

Each gene list was inputted into the Ingenuity® Pathway Analysis (IPA) Core Analysis (Qiagen, Venlo, Limburg, Netherlands) program. IPA identifies relevant relationships, functions and pathways for inputted datasets such as gene lists. IPA parameters were set to include: direct and indirect relationships with a maximum of 35 focus molecules per network, mouse genes only, an “experimentally observed” confidence level, and the Ingenuity Knowledge Base Genes only reference set. “Endogenous Chemicals” was not selected since the focus of this study is on genetics aspects rather than metabolomics.

5.3 Results

5.3.1 CNVs detected

Across 12 cancer samples, 665 CNVs were identified (Appendix 5A). CNV genetic distances do not show samples clustering together by mouse or genotype (Fig. 5-1). There is no readily apparent relationship between the samples. In contrast, SNP distances are greater between different mice than between two tissues of the same mouse and the samples also cluster by genotype (Fig. 5-1).

Primary tumours in Rhamm-/- and wild-type mice have greater inter-animal variation in the number of CNVs detected than do metastatic tumour samples (Table 5-1).

178

Rhamm-/- metastatic tumour samples have 1.4-fold more CNVs on average, with more inter-animal variation in the number of CNVs, than wild-type metastatic tumour samples. Cancer samples range from having 1.6- to 3.2-fold more CNVs, on average, than normal tail samples which have an average of 25 CNVs per sample (Table 5-1). In contrast to normal samples, which have slightly more gains than losses on average, all cancer samples have more losses than gains (Table 5-1; Fig. 5-2).

Rhamm-/- primary tumour samples have the smallest CNVs, with the average length being less than half the size of CNVs found in any other cancer sample group, and closest in size to C57BL/6J normal tail tissue samples (Table 5-1). By plotting the length distribution of the CNVs for each sample, it is evident that only the Rhamm-/- primary tumour samples all consistently have a high proportion of smaller CNVs and this is not seen with the metastasis samples and only seen in one wild-type primary tumour sample (Fig. 5-3). Wild-type primary tumour samples were found to have the longest average CNV length but also had the greatest inter-animal variation (Table 5-1). The average CNV size of metastasis samples is similar between genotypes (Table 5-1; Fig. 5-3). All cancer groups have longer CNVs, on average, than do normal tissues (Table 5-1).

179

CNV SNP

Figure 5-1. Phenograms representing CNV- and SNP-based pairwise genetic distance between mouse samples from four groups: Rhamm-/- primary tumour, Rhamm-/- lung with metastasis, wild-type primary tumour, wild-type lung with metastasis. Rhamm-/- primary tumour (RP) and lung with metastasis (RM) samples are coloured in blue while green indicates wild-type primary tumour (WP) and lung with metastasis (WM) samples. The numeric values in the sample labels represent the individual mouse identifier number. Scale bars represent genetic distance.

180

Table 5-1. Summary of averages for CNV numbers, length, state, and genic classification, in cancer and normal sample groups.

Number Average number Average CNV Gains Genic Genic gains Sample group of of CNVs per length (bp) (%) (%) (%) samples mouse per group

Rhamm-/- primary 3 79 ± 46 91,239 ± 16,525 3.16 ± 4.03 62.08 ± 7.57 3.25 ± 3.61 tumour

Wild-type primary 3 46 ± 42 272,356 ± 134,633 14.86 ± 23.90 71.46 ± 4.31 15.82 ± 26.01 tumour

Rhamm-/- lung with 3 56 ± 21 235,703 ±77,512 14.15 ± 6.93 75.79 ± 7.98 8.11 ± 3.47 metastasis

Wild-type lung with 3 41 ± 9 229,280 ± 46,936 11.94 ± 3.93 74.92 ± 6.56 8.22 ± 4.18 metastasis

114 classical laboratory mouse normal tail 114 25 ± 14 52,490 ± 18,840 51.28 ± 17.54 23.87 ± 11.23 59.11 ± 28.94 tissue42

C57BL/6J normal tail 8 11 ± 4 76,469 ± 19,760 57.54 ± 13.76 43.66 ± 11.65 58.12 ± 15.36 tissue42

FVB/NJ normal tail 1 28 46,279 53.57 17.86 40 tissue42

Averages are presented along with standard deviation.

181

120

100

80

60

Number of CNVs 40

20

0 36.1 36.2 76.3 36.1 36.2 76.3 63.1 45.2 10.4 63.1 45.2 10.4 WildWild Type-type Primary Wild -Typetype Rhamm-/-Rhamm Primary-/- Rhamm-/-Rhamm-/- Metastasis Metastasis primary tumour metastasis primary tumour metastasis

Mouse ID and Sample Type

Figure 5-2. Number of CNV gains and losses detected in primary tumour and lung with metastasis samples from three wild-type and three Rhamm-/- mice. Gains are shown in dark blue while losses are shown in light blue. Each mouse is represented by a numeric identifier.

182

Figure 5-3. CNV length distribution for primary tumour and lung with metastasis samples from three wild-type and three Rhamm-/- mice. Primary tumour data are coloured in blue and lung with metastasis data are coloured in green. Individual mice are represented by the numeric identifiers at the top of the figure.

Cancer samples have a similar percentage of genic CNVs on average at around 71- 76%, except for Rhamm-/- primary tumour samples, which have 62% genic CNVs (Table 5-1). In contrast, 18-44% of CNVs in normal tissues are genic, depending on the sample group (Table 5-1). A high proportion of genic CNVs in cancer samples are losses, ranging from 84-97% genic losses, depending on the sample group. For normal tissues, more genic CNV gains than losses were observed in 64% of the 114 mouse samples.

Here, a CNV is defined as tissue-specific if it is present in only one of two tissues of an individual mouse. The majority of CNVs (68-97%) in the wild-type and Rhamm-/- mice are tissue-specific within an individual mouse (Fig. 5-4). The proportion of tissue- specific CNVs detected in a particular tissue type, did not appear to be dependent on genotype.

183

100 90 108 129 49 80 41 57 70 119

CNVs (%) 60 50 40 specific - 30 20 Tissue 10 0 36.1 36.2 76.3 63.1 45.2 10.4 Wild Type Rhamm-/-Rhamm-/-

Mouse ID and Rhamm Genotype

Figure 5-4. Tissue-specific CNVs in the primary tumour and lung with metastasis tissue of wild-type and Rhamm-/- mice. Dark green bars represent primary tumour data and light green bars represent lung with metastasis data. Each mouse is represented by a numeric identifier. Values over each bar indicate the number of CNVs represented by the bar.

5.3.2 Droplet digital PCR confirmation of select genic CNVs

The ddPCR results confirmed that the Rhamm gene was knocked out in Rhamm-/- samples and was present as a copy number state of two in wild-type samples (Fig. 5-5). Based on MDGA results, Ilk and Taf10 were predicted to have a copy number state of two in all samples except in Rhamm-/- primary tumours. These genes were not confirmed to be a copy number state of one in Rhamm-/- primary tumour samples. Instead, there appears to be a mixture of genotypes present in the cell population with a subset of cells carrying the deletion. Furthermore, several other samples also appear have lost copies of Ilk and Taf10 in a subset of the cell population sampled within the tissue sample.

184

2.5

2.0

1.5

1.0

Copy Number State 0.5

0.00 36.1 36.2 76.3 36.1 36.2 76.3 10.4 45.2 63.1 10.4 45.2 63.1 36.1 36.2 76.3 36.1 36.2 76.3 10.4 45.2 63.1 10.4 45.2 63.1 36.1 36.2 76.3 36.1 36.2 76.3 10.4 45.2 63.1 10.4 45.2 63.1 P M P M P M P M P M P M Wild Type RhammRhamm-/--/- Wild Type Rhamm-/-Rhamm-/- Wild Type RhammRhamm-/--/- IlkIlk Taf10Taf10 RhammRhamm

Gene Assayed, Sample Genotype, Tumour Type and Mouse Identifier Number

Figure 5-5. Copy number state of Ilk, Taf10, and Rhamm genes as detected by ddPCR in wild-type primary tumour, wild-type lung with metastasis, Rhamm-/- primary tumour, and Rhamm-/- lung with metastasis mouse tissues. Bars are coloured according to sample group (genotype and tumour type). Each mouse is represented by a numeric identifier on the x-axis and primary tumour and metastatic tumour tissues are represented by “P” and “M”, respectively.

185

5.3.3 CNV genic analysis

Genic CNVs, especially genic losses, are found in the majority of cancer samples which is in contrast to normal tissue samples where the majority of CNVs detected are not genic (Table 5-1). For each of the four cancer sample groups, recurrent CNV-affected pseudogenes, non-coding genes and protein-coding genes can be found that are common to all samples within the groups (Appendix 5C). Rhamm-/- primary tumour samples have more genes in common that are impacted by CNVs than wild-type primary tumour samples do. More specifically, Rhamm-/- primary tumour samples had eight recurrent CNV regions overlapping 48 protein-coding or non-coding genes, while only six protein-coding or non- coding genes, all in one CNV region, were affected by CNVs in the three wild-type primary tumour samples.

The top IPA network for the 48 shared Rhamm-/- genes is “connective tissue development and function, tissue morphology, cellular growth and proliferation” (Table 5- 2). Vascular endothelial growth factor B (Vegfb), a regulator of Rhamm is found in this network but Vegfb-containing CNVs were also found in some wild-type samples. Only three genes, integrin-linked kinase (Ilk), ribosomal RNA processing 8 methyltransferase homolog (yeast; Rrp8), and TATA-box binding protein associated factor 10 (Taf10), were unique to Rhamm-/- primary tumours as copy number losses and not found to vary in copy number in any other sample. These genes were completely encompassed by a single CNV, which also partially overlapped the dynein heavy chain domain 1 (Dnhd1) and tripeptidyl peptidase I (Tpp1) genes. Wild-type primary tumour samples did not have any group- specific genes.

Although an Ilk loss was not observed in Rhamm-/- metastasis samples, the top canonical pathway for this cancer group in IPA was “ILK Signaling” (p = 1.43 x 10-5) and included insulin receptor substrate 3 (Irs3), protein phosphatase 1 regulatory inhibitor subunit 14B (PPP1R14B), ribosomal protein S6 kinase polypeptide 4 (RPS6KA4), and VEGFB. Rhamm-/- metastasis samples shared CNVs affecting 31 genes in six genomic regions. Wild-type metastasis samples shared 20 genes, also in six genomic regions. In Rhamm-/- metastasis samples, only snoMe28S-Cm3227 (snoRNA) was unique to that

186 group, while in wild-type metastasis samples, AC161763.1 (miRNA), ABCE maturation factor (LTO1), and cyclin D1 (Ccnd1) were unique and not found in Rhamm-/- metastasis samples. The top Rhamm-/- metastasis network is immune-related cell signaling (Table 5- 2), and the top wild-type metastasis network is cell function related to cancer and injury response. The remaining top networks for both metastasis groups only had one associated focus molecule.

187

Table 5-2. Top three Ingenuity Pathway Analysis “diseases and functions” terms for gene networks describing recurrent genic CNVs within each cancer sample group. Number IPA Group of Focus Top Diseases and Functions Scorea Moleculesb Connective Tissue Development and Function, 39 17 Tissue Morphology, Cellular Growth and Proliferation Rhamm-/- Digestive System Development and Function, primary 12 7 Hepatic System Development and Function, tumour Inflammatory Response Carbohydrate Metabolism, Small Molecule 2 1 Biochemistry, Post-Translational Modification

Amino Acid Metabolism, Post-Translational 3 1 Modification, Small Molecule Biochemistry

Wild-type Cancer, Gastrointestinal Disease, Hepatic System 3 1 primary Disease tumour Cell-To-Cell Signaling and Interaction, 2 1 Hematological System Development and Function, Immune Cell Trafficking Cell-To-Cell Signaling and Interaction, 14 6 Hematological System Development and Function, Immune Cell Trafficking Rhamm-/- Tissue Morphology, Organ Development, 3 1 metastasis Reproductive System Development and Function

Cancer, Cell-To-Cell Signaling and Interaction, 3 1 Immunological Disease

Cellular Function and Maintenance, Cancer, 16 7 Organismal Injury and Abnormalities

Wild-type Tissue Morphology, Cellular Development, Tissue 3 1 metastasis Development Amino Acid Metabolism, Post-Translational 3 1 Modification, Small Molecule Biochemistry

a The IPA score is equal to -log10 of the p-value b Number of genes that were inputted into IPA by the user and are present in the top diseases and functions network

188

5.4 Discussion

The MDGA was able to provide information regarding the number, size, state and gene content of CNVs in cancer samples. Comparisons of CNVs in cancer tissues and normal tissues revealed differences in cancer CNV profiles from normal CNV profiles with respect to CNV number, length, state, and genic content. Contrary to predictions, there is no clear association between the number of CNVs detected in either the presence or absence of Rhamm expression. However, the absence of Rhamm expression is associated with alterations to the primary tumour genome that were not observed in metastasis tissue or with a wild-type genotype, particularly in respect to the high proportion of smaller CNV losses. With respect to CNV length only, Rhamm-/- primary tumour CNVs are more similar to normal tissues than to other cancer samples. Rhamm-/- primary tumours also have multiple genic CNVs shared between samples within the group that are associated with IPA disease and function terms relevant to cancer while the wild-type primary tumours have more diversity in genic CNVs impacted, suggesting that Rhamm-/- absence in the microenvironment leads to selection for specific phenotypes resulting from genotypic changes. This genotype-specific difference was not observed with metastasis samples, possibly because of the presence of normal tissue and a difference in the lung verses mammary gland microenvironment. As predicted, confirmation of select CNV targets revealed that there are SNP microarray sensitivity limitations for genetically heterogeneous samples, such that small mutant clones cannot be detected. Ideally, tumour subpopulations should be isolated and characterized for CNVs using CNV-detection methods appropriate for small sample sizes. Regardless of its limitations, the MDGA provided important leads to follow up on relating to the influence of Rhamm-/- on the CNV landscape, possibly through tissue microenvironment changes that lead to selection for specific phenotypes associated with the detected genotypes.

5.4.1 CNVs detected

The genetic distance between cancer samples can differ greatly based on which mutation types are used in the calculations (e.g. SNPs or CNVs), the algorithm or model used, and if the sample is genetically heterogeneous or if more homogeneous cell subpopulations are

189 used47. When using SNPs, the genetic distance between cancer samples reflected the known genetic relationships between the samples. On the other hand, CNV genetic distance showed that the genetic variation acquired by the cancer samples exceeds the shared, inherited genetic variation in a way that the known relationships are not observed when looking at the CNV phenogram. The absence of sample clustering by animal in the CNV phenogram would suggest that there is high intra-animal copy number variation in cancerous tissues, regardless of Rhamm genotype. However, the SNP genetic distance values are larger than CNV genetic distance values, which means that there is a larger contribution from SNPs to genetic differences between the samples than there is from CNVs.

Differences between SNP and CNV genetic distance values and how closely they represent known genetic relationships are a result of SNP and CNV frequency in the genome and how that is applied to genetic distance calculations. SNP genetic distance is calculated for hundreds of thousands of single-nucleotide loci while CNV genetic distance is dependent on CNV number, length and state. Assuming that the majority of the genome is a copy number state of two in all individuals, a sample would need to have a great number of small CNVs with mixed states to achieve a high level of uniqueness and thereby distinction from other samples – similar to SNPs. Within a mouse, the CNV landscape between two tissues would have to be very similar for the tissues to cluster together since genetic distance can be greatly impacted by small differences in CNV number, length or state. If there is a high incidence of de novo or tissue-specific CNVs in tissue samples, as can occur with cancer or mutagen exposure, the genetic distance will be greater than for tissues that only contained inherited CNVs.

The number of CNVs detected in the cancer samples exceeded what is typical of normal mouse tissues, in some cases by more than four-fold. This observation of high CNV levels in cancer is not found in all cancer studies as mutation profiles can vary based on cancer type10,48. In a study on human colorectal cancer49, the median number of CNVs was found to be similar between individual tumours (480 CNVs, with a range of 7-969 CNVs) and matched normal tissue (383 CNVs, with a range of 3-762 CNVs). A large range in CNV number was also found in the cancerous tissues of the six mice, unlike the CNV range

190 for normal mouse samples, which was far smaller. In the human colorectal cancer study49, there were more than 4-times the number of CNVs detected overall in cancer samples than normal samples - similar to the findings in the mouse study presented here. A larger sample size of primary and secondary tumour samples with matched normal tissue will be needed in future experiments to determine if the high CNV loads detected in the MMTV-PyMT mice are representative of this particular cancer model or if these six mice are outliers and other mice would have lower CNV loads.

Tissue-specific copy number alterations were detected in all tissues of this study and are a common occurrence with cancer tumorigenesis and metastasis29,50. The number of mutations that arise in a tissue can depend on many factors including, but not limited to, which genes are affected (e.g. mutation repair genes), tissue type, and microenvironment. Ben-David et al51 found that aneuploidy and large copy number alterations become dominant as non-invasive lesions progress to invasive carcinomas in SV40 Tag-induced cancer, but that an increase in chromosomal aberrations is not associated with PYMT- related metastasis development. The authors also found recurrent copy number alterations that are unique to specific breast cancer drivers51. Recurrent CNVs should be studied at a cell subpopulation level within the primary tumours and compared to isolated metastasis tumours to identify chromosomal changes that promote tumorigenesis and metastasis. However, higher resolution technology is required to study cell subpopulations since the MDGA only reports the most abundant copy number state.

One of the most striking differences when comparing cancer samples to normal tissues is the high number of CNV losses observed with cancer. Generally, genic losses are more likely to be deleterious than gains, particularly when there is a loss of genes necessary for cell function. Therefore, CNV losses are not expected to be prevalent in healthy tissues. However, sometimes losses are advantageous. In hematopoietic cancer stem cells from a Pten-/- mouse model of leukemia, loss of ribosomal DNA regions was counterintuitively associated with increased cell proliferation, rRNA production and protein synthesis52. Alternatively, copy number losses could be neutral, they could contain tumour suppressor genes that would contribute to tumour progression if gene copies are lost, or they could

191 confer minor fitness disadvantages to cells but are linked to a strong cancer driver and are therefore not selected against.

Although there were no clear genotype-specific differences regarding the number of CNVs detected by the array, genotype-specific differences were observed when looking at the length of CNVs in primary tumour samples. In the absence of Rhamm, primary tumour CNV losses were found to be smaller than in wild-type primary tumours, but still larger than CNVs in healthy mice. This difference was not seen in metastatic tumour samples, possibly because of the presence of some normal tissue and a different microenvironment in the lung than in the mammary gland. Alternatively, it is possible that the successfully metastasizing cells had a genotype that did not contain many of the CNV losses present in the primary tumour samples. Small losses that are genotype specific may share a common mechanism. The mechanisms known to generate smaller deletions include: double-strand breaks associated with elevated and this being more likely with open DNA53. Therefore, the wild type and Rhamm-/- groups could be examined for differences in chromatin structure which may result from differing levels of transcription or influences on histone activity. The level of DNA accessibility and susceptibility to damage can be determined using a DNase I footprinting assay. Other assays could be applied to test for genotype-specific, mutation repair deficiencies.

Aneuploidy was not detected by the MDGA, even though it is known to occur in the absence of RHAMM38. Aneuploidy would not be detectable by the array unless a large proportion of the cells had the same chromosomal gain or loss.

5.4.2 Droplet digital PCR confirmation of select genic CNV regions

Confirmation of selected recurrent CNVs revealed a limitation of microarrays when used with cancer samples or any sample that is a composite of subpopulations with different CNV genotypes. The MDGA is only capable of calling whole number CNV states from zero to four while ddPCR can detect a mixture of copy number states, such as a state of 1.5. In a genome, a copy number state can only exist as a whole number. A state of 1.5 could appear if there is mosaicism in a population of cells where some cells have a CNV loss while others do not, so an averaged copy number state is detected. In this event, ddPCR

192 would make a copy number call in between whole-number states, while the MDGA would round the state up or down to a whole number. CNV detection is often more problematic in cancerous tissues than in normal tissues because of the heterogeneous genetic makeup of tumours54,55.

The MDGA has the capability to detect a knockout of Rhamm since there are probes present in that region of the genome. However, no state zero CNVs were detected in the entire dataset. This may have occurred for a few reasons such as uneven or excessive staining, or it could be due to highly variable fluorescence intensities that can occur with cancer samples or can arise from unequal sample application during hybridization56–58. Visual images of the arrays showed hybridization abnormalities like the presence of curved line artifacts across the arrays which would likely cause high variation in fluorescence intensities. Therefore, as with any technology based on massively parallel hybridization, conclusions drawn from the array data require additional experimental confirmation. In this study, ddPCR analysis did successfully confirm the Rhamm genotype of the samples.

5.4.3 CNV genic analysis

Due to the heterogeneous nature of cancer tissues and microarray limitations, additional confirmation of the results is required, particularly when attempting to assign cancer- related roles to genes. For examples, the CNV region containing Ilk, Rrp8, and Taf10, which are genes that have been implicated in cancer, was identified by the MDGA as a copy number loss in Rhamm-/- primary tumour samples. However, ddPCR results suggest that the samples contain cell populations with mixed copy number states of two or lower. This intra-sample heterogeneity cannot be captured by the MDGA, which is limited to calling whole-number CNV states.

Ilk, Rrp8, and Taf10, are found contiguously and partially overlapping each other on Chromosome 7 and were detected as a copy number state of one in all three Rhamm-/- primary tumour samples using MDGA analysis. This CNV confirmed as a copy number state of less than two, but more than one, by ddPCR in all Rhamm-/- primary tumour samples. Although other samples also had a copy number state less than two for this region, there was no other sample group wherein all the samples showed this loss. Ilk, Rrp8, and

193

Taf10 may impact cancer proliferation, apoptosis resistance, and growth, respectively, and might contribute to some Rhamm-/--specific phenotypes like increased lung metastases.

During mitosis, ILK forms a complex with an integrin receptor, α-Parvin and Dynactin-2/1 to orient mitotic spindles59.When Ilk is knocked out, mitotic spindles become more randomly oriented leading to changes in the axis of mitosis and subsequently allowing cells to grow outside of their normal cell layer. With less cell-cell contact limiting cell division, cells are able to proliferate more rapidly and spread outside of their normal layers. ILK has overlapping regulatory pathways with RHAMM as they both act as regulators of ERK1/2 Map kinases38,60, which are commonly involved in cancer61. ILK and RHAMM are also involved in fibronectin (FN1) regulation. RHAMM is an up-regulator of Fn1 when associated with E2F162, and ILK is required for the induction of FN1 fibrillogenesis63. Fibronectin is important for cell adhesion and invasion64, including for metastatic breast cancers65. What impact the combination of ILK reduction and RHAMM absence has on fibronectin and cell behaviour remains to be determined.

Ribosomal RNA-processing protein 8 (RRP8) is thought to be a methyltransferase in the energy-dependent nucleolar silencing complex (eNoSC). One established function of eNoSC involves sensing the energy status in a cell, and when energy levels are low, it will suppress rRNA transcription and ribosome biogenesis through histone acetylation and methylation to prevent energy deprivation-dependent apoptosis66. Rrp8 heterozygous deletions have been found in several different cancers and reduced expression is thought to play a role in tumor formation and poor survival in breast cancer67.

TAF10 is also involved in transcriptional regulation and required for mouse embryogenesis and cell cycle progression68, and erythropoiesis69. TAF10 is a component of multiple complexes including TFIID, a general transcription factor which binds TATA boxes, and TATA box-binding protein-free TAF-containing complexes like TRTC, PCAF, and STAGA, which also have roles in regulating transcription70–72. In addition, TAF10 can directly associate with estrogen receptors (ER) and when it is knocked down, there is a significant reduction in estradiol-induced repression of the folate receptor-α (Folr1) P4 core promoter73. With Taf10 deleted in the Rhamm-/- primary tumour cells, Folr1 repression

194 through ER would be reduced. Since Folr1 expression helps triple negative (ER-, PR-, HER2-) breast cancer cells increase folate uptake and grow in vitro, it was suggested that Folr1 could be targeted in cancer treatments74.

The CNV loss encompassing Ilk, Rrp8 and Taf10 could not have been inherited as a complete loss (copy number = 0) since at least one copy of Taf10 is required for mouse embryogenesis68. Whether or not a loss of this region plays an important biological role in the MMTV-PyMT model studied here remains to be seen. An L1-family LINE located between exons in Dnhd1, as well as several SINEs and LTRs may provide mechanisms for CNV formation in this region.

Wild-type primary tumours had many mouse-specific CNVs, as the three primary tumour samples only shared one CNV, and this shared CNV was not unique to this cancer group. The high level of unique CNVs in wild-type primary tumours and the contrasting high levels of CNV recurrence in Rhamm-/- primary tumours may indicate microenvironment differences that result in selection for phenotypes, produced by specific CNV genotypes, in the absence of RHAMM. The idea that Rhamm absence or presence can influence the tumour microenvironment is plausible since there is emerging evidence that suggests the tumour microenvironment can be modified by different gene alleles75. Furthermore, the tumour microenvironment is hypothesized to be involved in the development of highly malignant cells through a combination of selection and education (characteristic alteration) of tumour cells76. A larger sample size and additional experiments would be required to ascertain if the microenvironment influences the genotype of the cells and what advantages are conferred to the primary tumour as a result of the selection for specific genotypes.

In Rhamm-/- metastasis samples, no CNVs containing protein-coding genes were unique to the group. SnoMe28S-Cm3227 was unique to Rhamm-/- metastasis samples but its association with cancer has not been studied. In wild-type metastasis samples, AC161763.1, LTO1, and Ccnd1 were unique to the group and observed as losses. AC161763.1, also known as brain cytoplasmic RNA 1 or Bc1, is a lincRNA77 that promotes TGF-β-induced smooth muscle cell differentiation when knocked down and acts as a

195 suppressor of differentiation when overexpressed78. Overexpression of LTO1 but not under-expression, as would be expected with a loss, has been commonly observed in cancer79. Ccnd1 is a cell cycle regulator whose reduced expression has been linked to resistance to breast cancer80 and reduced tumour incidence in teratoma-susceptible mice81. The role that genic CNVs play in relation to cancer in this study’s metastasis samples cannot be determined solely from SNP microarray data, especially because the metastasis samples contain a mixture of normal and cancerous tissue.

5.5 Conclusion

In this study, cancer samples differed in CNV landscape in comparison to normal, healthy tissues. Primary tumour samples had a greater inter-animal variability in the number of CNVs detected than did metastasis samples. The absence of Rhamm is associated with the presence of small CNVs in primary tumours but not in lung tissue with metastasis or in the wild-type tissue samples. This is suggestive of different CNV mutation mechanisms in the primary tumour tissue in the presence and absence of Rhamm. Based on CNV recurrence and genic analysis, an absence of Rhamm is now hypothesized to produce a microenvironment where there is selection for a specific cell phenotype.

This study was designed as a pilot study to determine if the MDGA could be used for CNV detection with cancer samples and if it could be used, then the goal was to identify and characterize differences between wild-type and Rhamm-/- cancer tissue. Although microarray technology is limited in its ability to detect copy number variants in tissue samples composed of heterogeneous cell populations, the MDGA was able to identify differences in comparison to normal tissues and between cancer groups in regard to CNV number, size, state, recurrence and genic content. To explore the CNV landscape in more depth, larger sample sizes, the use of normal tissue adjacent to assayed tumour samples, higher resolution technology, and focusing on cell subpopulations within tumours is recommended. This study shows the utility of the MDGA in searching for general mutation patterns, between cancer groups, that could be further investigated to identify mutational mechanisms and genes associated with specific cancer phenotypes.

5.6 References

1. Yang, H. et al. A customized and versatile high-density genotyping array for the mouse. Nat. Methods 6, 663–666 (2009).

2. Standfuss, C., Pospisil, H. & Klein, A. SNP microarray analyses reveal copy number alterations and progressive genome reorganization during tumor development in SVT/t driven mice breast cancer. BMC Cancer 12, 380 (2012).

3. Ho, C. C., Mun, K. S. & Naidu, R. SNP array technology: An array of hope in breast cancer research. Malays. J. Pathol. 35, 33–43 (2013).

4. Van Loo, P. et al. Analyzing cancer samples with SNP arrays. Methods Mol. Biol. 802, 57–72 (2012).

5. Cheng, J. et al. Pan-cancer analysis of homozygous deletions in primary tumours uncovers rare tumour suppressors. Nat. Commun. 8, 1221 (2017).

6. Li, M., Wen, Y. & Fu, W. A single-array-based method for detecting copy number variants using affymetrix high density SNP arrays and its application to breast cancer. Cancer Inform. 13, 95–103 (2014).

7. Ohshima, K. et al. Integrated analysis of gene expression and copy number identified potential cancer driver genes with amplification-dependent overexpression in 1,454 solid tumors. Sci. Rep. 7, 641 (2017).

8. Takano, T. et al. Epidermal growth factor receptor gene mutations and increased copy numbers predict gefitinib sensitivity in patients with recurrent non-small-cell lung cancer. J. Clin. Oncol. 23, 6829–6837 (2005).

9. Brown, D. et al. Phylogenetic analysis of metastatic progression in breast cancer using somatic mutations and copy number aberrations. Nat. Commun. 8, 14944 (2017).

10. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-

196

genome sequences. Nature 534, 47–54 (2016).

11. Zhang, Y. et al. Copy number alterations that predict metastatic capability of human breast cancer. Cancer Res. 69, 3795–3801 (2009).

12. Sharma, A. K., Eils, R. & König, R. Copy number alterations in enzyme-coding and cancer-causing genes reprogram tumor metabolism. Cancer Res. 76, 4058– 4067 (2016).

13. Landgraf, M., McGovern, J. A., Friedl, P. & Hutmacher, D. W. Rational design of mouse models for cancer research. Trends Biotechnol. 36, 242–251 (2018).

14. Stewart, T. A., Pattengale, P. K. & Leder, P. Spontaneous mammary adenocarcinomas in transgenic mice that carry and express MTV/myc fusion genes. Cell 38, 627–637 (1984).

15. Cheon, D.-J. & Orsulic, S. Mouse models of cancer. Annu. Rev. Pathol. 6, 95–119 (2011).

16. Guy, C. T., Cardiff, R. D. & Muller, W. J. Induction of mammary tumors by expression of polyomavirus middle T oncogene: a transgenic mouse model for metastatic disease. Mol. Cell. Biol. 12, 954–961 (1992).

17. Rodriguez-Viciana, P., Collins, C. & Fried, M. Polyoma and SV40 proteins differentially regulate PP2A to activate distinct cellular signaling pathways involved in growth control. Proc. Natl. Acad. Sci. U. S. A. 103, 19290–19295 (2006).

18. Ichaso, N. & Dilworth, S. M. Cell transformation by the middle T-antigen of polyoma virus. Oncogene 20, 7908–7916 (2001).

19. Nukumi, N. et al. Regulatory function of whey acidic protein in the proliferation of mouse mammary epithelial cells in vivo and in vitro. Dev. Biol. 274, 31–44 (2004).

20. Hudson, A. L. & Colvin, E. K. Transgenic mouse models of SV40-induced cancer.

197

ILAR J. 57, 44–54 (2016).

21. Kohnken, R., Porcu, P. & Mishra, A. Overview of the use of murine models in leukemia and lymphoma research. Front. Oncol. 7, 22 (2017).

22. Macheret, M. & Halazonetis, T. D. DNA replication stress as a hallmark of cancer. Annu. Rev. Pathol. 10, 425–448 (2015).

23. Zheng, D.-Q., Zhang, K., Wu, X.-C., Mieczkowski, P. A. & Petes, T. D. Global analysis of genomic instability caused by DNA replication stress in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. U. S. A. 113, E8114–E8121 (2016).

24. Maser, R. S. et al. Chromosomally unstable mouse tumours have genomic alterations similar to diverse human cancers. Nature 447, 966–971 (2007).

25. Rennhack, J., To, B., Wermuth, H. & Andrechek, E. R. Mouse models of breast cancer share amplification and deletion events with human breast cancer. J. Mammary Gland Biol. Neoplasia 22, 71–84 (2017).

26. Kawamata, F. et al. Copy number profiles of paired primary and metastatic colorectal cancers. Oncotarget 9, 3394–3405 (2018).

27. Li, F., Sun, L. & Zhang, S. Acquirement of DNA copy number variations in non- small cell lung cancer metastasis to the brain. Oncol. Rep. 34, 1701–1707 (2015).

28. Gao, Y. et al. Single-cell sequencing deciphers a convergent evolution of copy number alterations from primary to circulating tumor cells. Genome Res. 27, 1312–1322 (2017).

29. Malek, J. A. et al. Copy number variation analysis of matched ovarian primary tumors and peritoneal metastasis. PLoS One 6, e28561 (2011).

30. Turajlic, S. et al. Metastasis as an evolutionary process. Science 352, 169–175 (2016).

198

31. Caswell, D. R. & Swanton, C. The role of tumour heterogeneity and clonal cooperativity in metastasis, immune evasion and clinical outcome. BMC Med. 15, 133 (2017).

32. Maxwell, C. A. et al. RHAMM Is a Centrosomal Protein That Interacts with Dynein and Maintains Spindle Pole Stability. Mol. Biol. Cell 14, 2262–2276 (2003).

33. Hardwick, C. et al. Molecular cloning of a novel hyaluronan receptor that mediates tumor cell motility. J. Cell Biol. 117, 1343–1350 (1992).

34. Tolg, C. et al. Rhamm-/- fibroblasts are defective in CD44-mediated ERK1,2 motogenic signaling, leading to defective skin wound repair. J. Cell Biol. 175, 1017–1028 (2006).

35. Hamilton, S. R. et al. The hyaluronan receptors CD44 and Rhamm (CD168) form complexes with ERK1,2 that sustain high basal motility in breast cancer cells. J. Biol. Chem. 282, 16667–16680 (2007).

36. Wang, C. et al. The overexpression of RHAMM, a hyaluronan-binding protein that regulates ras signaling, correlates with overexpression of mitogen-activated protein kinase and is a significant parameter in breast cancer progression. Clin. Cancer Res. 4, 567–576 (1998).

37. Veiseh, M. et al. Cellular heterogeneity profiling by hyaluronan probes reveals an invasive but slow-growing breast tumor subset. Proc. Natl. Acad. Sci. 111, E1731– E1739 (2014).

38. Tolg, C. et al. RHAMM promotes interphase microtubule instability and mitotic spindle integrity through MEK1/ERK1/2 activity. J. Biol. Chem. 285, 26461– 26474 (2010).

39. Affymetrix. Genome-Wide Human SNP Nsp/Sty Assay 5.0 User Manual. (2007).

40. Affymetrix Power Tools MANUAL: apt-probset-genotype (1.20.0). Available at:

199

http://www.affymetrix.com/support/developer/powertools/changelog/apt-probeset- genotype.html.

41. Affymetrix. Genotyping console 4.0 user manual. (2009).

42. Locke, M. E. O. et al. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

43. PennAffy [http://www.openbioinformatics.org/penncnv/penncnv_download.html].

44. Kent Utils [https://github.com/NullModel/kentUtils].

45. Diskin, S. J. et al. Adjustment of genomic waves in signal intensities from whole- genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008).

46. Gascuel, O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14, 685–695 (1997).

47. Schwartz, R. & Schäffer, A. A. The evolution of tumour phylogenetics: principles and practice. Nat. Rev. Genet. 18, 213–229 (2017).

48. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

49. Chen, W. et al. Identification of chromosomal copy number variations and novel candidate loci in hereditary nonpolyposis colorectal cancer with mismatch repair proficiency. Genomics 102, 27–34 (2013).

50. Shlien, A. & Malkin, D. Copy number variations and cancer. Genome Medicine 1, (2009).

51. Ben-David, U. et al. The landscape of chromosomal aberrations in breast cancer mouse models reveals driver-specific routes to tumorigenesis. Nat. Commun. 7, 12160 (2016).

52. Xu, B. et al. Ribosomal DNA copy number loss and sequence variation in cancer.

200

PLOS Genet. 13, e1006771 (2017).

53. Falk, M., Lukášová, E. & Kozubek, S. Chromatin structure influences the sensitivity of DNA to γ-radiation. Biochim. Biophys. Acta - Mol. Cell Res. 1783, 2398–2414 (2008).

54. Chen, G. K., Chang, X., Curtis, C. & Wang, K. Precise inference of copy number alterations in tumor samples from SNP arrays. Bioinformatics 29, 2964–2970 (2013).

55. Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinformatics 18, 286 (2017).

56. Jaksik, R., Iwanaszko, M., Rzeszowska-Wolny, J. & Kimmel, M. Microarray experiments and factors which affect their reliability. Biol. Direct. 10, 46 (2015).

57. Singh, R. Signal oscillation is another reason for variability in microarray-based gene expression quantification. PLoS One 8, e54753 (2013).

58. Bilban, M., Buehler, L. K., Head, S., Desoye, G. & Quaranta, V. Defining signal thresholds in DNA microarrays: exemplary application for invasive cancer. BMC Genomics 3, 19 (2002).

59. Morris, E. J., Assi, K., Salh, B. & Dedhar, S. Integrin-linked kinase links dynactin- 1/dynactin-2 with cortical Integrin receptors to orient the mitotic spindle relative to the substratum. Sci. Rep. 5, 8389 (2015).

60. Huang, Y., Li, J., Zhang, Y. & Wu, C. The roles of integrin-linked kinase in the regulation of myogenic differentiation. J. Cell Biol. 150, 861–871 (2000).

61. Roskoski, R. ERK1/2 MAP kinases: Structure, function, and regulation. Pharmacol. Res. 66, 105–143 (2012).

62. Meier, C. et al. Association of RHAMM with E2F1 promotes tumour cell

201

extravasation by transcriptional up-regulation of fibronectin. J. Pathol. 234, 351– 364 (2014).

63. Elad, N. et al. The role of integrin-linked kinase in the molecular architecture of focal adhesions. J. Cell Sci. 126, 4099–4107 (2013).

64. Ruoslahti, E. Fibronectin in cell adhesion and invasion. Cancer Metastasis Rev. 3, 43–51 (1984).

65. Fernandez-Garcia, B. et al. Expression and prognostic significance of fibronectin and matrix metalloproteases in breast cancer metastasis. Histopathology 64, 512– 522 (2014).

66. Grummt, I. & Ladurner, A. G. A metabolic throttle regulates the epigenetic state of rDNA. Cell 133, 577–580 (2008).

67. Yang, L., Song, T., Chen, L., Soliman, H. & Chen, J. Nucleolar repression facilitates initiation and maintenance of senescence. Cell Cycle 14, 3613–3623 (2015).

68. Mohan II, W. S., Scheer, E., Wendling, O. & Metzger, D. TAF10 (TAF(II)30) is necessary for TFIID stability and early embryogenesis in mice. Mol. Cell. Biol. 23, 4307–4318 (2003).

69. Papadopoulos, P. et al. TAF10 interacts with the GATA1 transcription factor and controls mouse erythropoiesis. Mol. Cell. Biol. 35, 2103–2118 (2015).

70. Wieczorek, E., Brand, M., Jacq, X. & Tora, L. Function of TAF(II)-containing complex without TBP in transcription by RNA polymerase II. Nature 394, 172– 175 (1998).

71. Ogryzko, V. V. et al. Histone-like TAFS within the PCAF histone acetylase complex. Cell 94, 35–44 (1998).

72. Martinez, E. et al. Human STAGA complex is a chromatin-acetylating

202

transcription coactivator that interacts with pre-mRNA splicing and DNA damage- binding factors in vivo. Mol. Cell. Biol. 21, 6782–6795 (2001).

73. Hao, H. et al. Estrogen-induced and TAFII30-mediated gene repression by direct recruitment of the estrogen receptor and co-repressors to the core promoter and its reversal by tamoxifen. Oncogene 26, 7872–7884 (2007).

74. Necela, B. M. et al. Folate receptor-α (FOLR1) expression and function in triple negative tumors. PLoS One 10, e0122209 (2015).

75. Flister, M. J. & Bergom, C. Genetic modifiers of the breast tumor microenvironment. Trends Cancer 4, 429–444 (2018).

76. Takahashi, K. et al. Pancreatic tumor microenvironment confers highly malignant properties on pancreatic cancer cells. Oncogene 37, 2757–2772 (2018).

77. The Jackson Laboratory. Sequence detail (ENSMUSG00000115783). Mouse Genome Informatics Database (2018). Available at: http://www.informatics.jax.org/sequence/ENSMUSG00000115783.

78. Wang, Y.-C., Chuang, Y.-H., Shao, Q., Chen, J.-F. & Chen, S.-Y. Brain cytoplasmic RNA 1 suppresses smooth muscle differentiation and vascular development in mice. J. Biol. Chem. 293, 5668–5678 (2018).

79. Zhai, C. et al. The function of ORAOV1/LTO1, a gene that is overexpressed frequently in cancer: essential roles in the function and biogenesis of the ribosome. Oncogene 33, 484–494 (2014).

80. Yu, Q., Geng, Y. & Sicinski, P. Specific protection against breast cancers by cyclin D1 ablation. Nature 411, 1017–1021 (2001).

81. Lanza, D. G., Dawson, E. P., Rao, P. & Heaney, J. D. Misexpression of cyclin D1 in embryonic germ cells promotes testicular teratoma initiation. Cell Cycle 15, 919–930 (2016).

203

Chapter 6 6 Summary and Discussion

The overall goal of this thesis was to identify CNVs in inbred and wild mice, using the Mouse Diversity Genotyping Array (MDGA), and to explore the CNV landscape of Mus musculus from the perspectives of evolution and adaptation, normal development, and cancer. To accomplish this goal, a CNV detection pipeline for the MDGA needed to be developed and tested on a sample dataset. Once a reliable pipeline was established, the MDGA could be used to characterize the CNV landscape in different mouse subspecies, multiple normal tissues within an individual mouse, and in primary tumour samples compared to metastasis tissue. The outcome of the first thesis aim of developing a CNV detection pipeline, was the generation of lists of filtered probe lists recommended for use in CNV detection, updated probe annotation files, and the collection of a list of genes, predicted have conserved copy number states, for use in the assessment of CNV call reliability. It was also shown that the MDGA probes were not designed as carefully as the Genome-Wide Human SNP Array 6.0 probes. However, filtering of the MDGA probes increases SNP genotyping performance and is predicted to improve CNV genotyping. Cross-species MDGA hybridization data showed that MDGA use is not limited to M. musculus but has the potential to be a SNP and CNV detection tool for Mus genus samples and may be of more limited use for genotyping distantly related rodents like H. glaber. Cross-species hybridization studies using the MDGA would require identifying which probes are usable in the species of interest by determining which probe sequences are present in the reference genome, and reannotation of the probe genomic positions is necessary for accurate CNV genotyping in a species other than M. musculus.

The developed MDGA CNV detection pipeline was first used in a broad, mouse survey study where CNVs were shown to commonly occur in the genomes of M. musculus individuals, with 9,634 CNVs being detected in 334 mouse samples. The reliability of the pipeline for CNV detection was established by putative CNV confirmation by ddPCR. Indirect confirmation of CNVs on strain-matched samples suggests that some CNVs may be strain-specific. CNVs were also found to impact different genes and biological pathways in classical laboratory strain and wild-caught mouse cohorts that appear to be linked to

204

differences in their respective environments and mate selection. As predicted, these data imply that some adaptive traits can be conferred through variation in gene copy number, although experiment confirmation of these findings is required. The discovery of CNV differences between the classical laboratory strains and wild-caught mice led to the development of the mouse cohort study to further explore mouse cohort differences in CNV landscapes.

The mouse cohort study compared the CNV landscapes in classical laboratory, wild-derived, and wild-caught mouse cohorts. The differences in gene pathways between the classical laboratory strains and wild-caught mice were similar to findings in the broad survey study. The wild-derived strain cohort, which was not assessed for CNV genic content in the broad survey study, was found to have gene pathway overlap between both the classical laboratory and wild caught cohorts. These data may show how CNV-affected gene pathways are altered through the process of creating inbred laboratory strains. A useful tool that came out of this study to assist with future identification of cohort-specific CNVs and mutation hotspots in the mouse genome was the CNV landscape plot. Given a sufficient sample size for pattern observation, the CNV landscape plot can be used for data visualization with any genome and for any mutation type, provided that the genomic position is known. Furthermore, the CNV landscape plot can be used in conjunction with HD-CNV output to visualize merged regions and singletons across each chromosome.

After looking at CNV differences between mice, where all the data originated from tail samples, the next aim was to characterize the CNV landscape between multiple tissues of individual mice. However, low data quality provided too many false positives to accurately characterize the CNVs in the sample tissues from members of a C57BL/6J mouse family. A ddPCR confirmation of some of the putative CNVs detected may have uncovered a mutational hotpot or programmed deletion in the Hoxa gene cluster, which always included Hoxa13. This deletion was not only present in all tested C57BL/B6 mice and tissue samples, but also appeared to be occurring as somatic mosaicism at the intra- tissue level. Evidence of Hoxa deletion mosaicism was also found in two other M. domesticus laboratory strains and in a C57BL/6J mouse unrelated to the mouse family. The significance of a Hoxa13 gene loss is that it is unlikely to be inherited without causing

205

embryonic lethality or some level of abnormal development, depending on if the deletion is a copy state of zero or one, respectively, and depending on the mouse genetic background. The emergent hypothesis from this data is that this deletion is a case of nonstochastic somatic mosaicism. Future testing of is required to determine if the data show a developmentally-programmed deletion in a developmentally-relevant gene, within the context of specific tissues.

The final thesis aim was to study CNVs in the context of tumorigenesis and metastasis. Using an MMTV-PyMT mouse model of breast cancer with either a knockout of Rhamm or a wild-type genotype, the CNV profile was characterized for mammary gland primary tumours and lung tissue with metastasis for six mice. CNV profiles in cancer samples were found to differ from normal tissues and have greater numbers of CNVs, particularly genic losses. Among the four cancer tissue groups, the CNV profiles of Rhamm-/- primary tumour samples were unique in having smaller CNV deletions than the other groups. This would suggest that different mutational mechanisms are forming CNVs in the presence and absence of Rhamm. Furthermore, the higher frequency of recurrent CNVs, or greater genetic homogeneity, among Rhamm-/- primary tumour samples compared to samples with a wild-type genotype suggests that Rhamm absence may be altering the microenvironment so that certain cancer cell genotypes are selected for. In the absence of positive selection for a specific phenotype and genotype, primary tumours would be expected to have greater genetic heterogeneity. Overall, this study shows the utility of the MDGA in cancer research and provides further leads to explore relating to the impact of Rhamm absence on the CNV landscape of the cancer genome, particularly in the mechanisms of CNV formation and role of recurrent, genic CNVs.

6.1 Study limitations

There are several limitations for the studies described in this thesis that could be improved upon in future studies. When the MDGA probe lists were filtered, the impact of the filtering was assessed indirectly via SNP genotype call rate changes without additional biological confirmation. Following filtering and overall increases in SNP genotype, some poorly performing SNP probes that returned only “No Calls” remained. This could result from probe properties that were not filtered, like the presence of mononucleotide repeats or GC

206

content, or for other reasons like low sequence complementarity between the probes and sample DNA. Since the probe filtering was performed to improve CNV calling, SNP probes that are suitable for SNP genotyping but not for CNV calling, would have been removed. Thus, the use of the recommend filtered probes provided in Locke et al1 is more suited for CNV studies than SNP studies.

When testing the ability of the MDGA to provide genotype calls for the genus Mus and for H. glaber samples, it was found that samples that are distantly related to mice can artificially be made to appear more closely related if genotyped together with mouse samples. Therefore, it is important that genotyping is performed using species-appropriate reference files and genotyping groups. Likewise, the sample size of the species of interest should be sufficiently large so that distinct genotype clusters can be generated. Lastly, because the genotyping output for cross-species hybridization experiments was observed to differ greatly depending on the approach used, biological confirmation is essential to ensure that the correct approach is being applied to detect biological events and minimize false calls.

For the broad mouse survey study and the mouse cohort study, the greatest limitation is that the tissue that was used to generate the CEL files was not available for CNV confirmation. Although mouse-specific CNVs cannot be confirmed without the sample tissue, it may be possible to confirm strain-specific CNVs in a different member of the same strain. For identification of strain-specific and subspecies-specific CNVs to be possible, the sample size for each group should be increased so that recurrent CNVs can be identified. Using the same mouse subspecies for classical laboratory, wild-derived, and wild caught mouse cohort comparisons would help identify cohort differences that result from life in a laboratory environment or in a natural environment without any confounding effects from the inclusion of multiple subspecies.

When using the MDGA to study somatic mosaicism, it is important to keep in mind that microarrays have limited sensitivity and resolution. This means that the CNVs that are most likely to be detected in a tissue sample are those that have a clonal size that meets the required detection threshold. For large clonal sizes to be present in normal tissue, the

207

CNV event either arose early in development or in a tissue with a high cell turnover rate where there is more opportunity for CNV formation to occur via replication errors2. The MDGA’s resolution limit does not allow for exact CNV breakpoint identification, so sequencing of the predicted breakpoint regions is required to locate the exact CNV start and end. Sequencing of these regions may also uncover the potential mechanisms by which CNVs arose. In the mouse family study, sequencing is required to determine if the Hoxa cluster deletions involve the same mechanism and CNV breakpoints in all members of the mouse family.

The inclusion of adjacent normal tissue is needed to establish which CNVs in the cancerous tissues are associated with tumorigenesis and metastasis and which CNVs constitute the normal background genotype. The high amount of genetic heterogeneity present in cancerous tissues3 poses a challenge in CNV detection since CNVs will only be detected if they are present in a sufficient clonal size. Therefore, experimental confirmation of putative CNVs is necessary to identify biological events and determine false positive rates. Due to genomic instability in cancer, the CNVs found in cancerous tissues may be byproducts of an unstable genome without being contributors to cancer phenotypes. Therefore, confirmed CNVs need to be characterized through further experimentation to determine their role in cancer formation and development, metastasis, and treatment resistance. Another caveat of the cancer study described here, is that the CNV landscape that was observed is specific to the cancer model used here and could differ with the use of different cancer models, mouse strains, and mouse habitat conditions4–6. Likewise, the results observed in mouse models may not be reproducible in human studies due to biological differences between the species7.

6.2 Future extensions

The experiments conducted in this thesis all used filtered probe lists for SNP and CNV genotyping. To further improve on the probe lists used, a future study should be conducted to provide biological confirmation of the impact of probe filtering on genotyping. The impact of probe list filtering on CNV calling reliability can be assessed by using known CNVs (i.e. samples with known genotypes) to compare if CNV calling is more accurate before or after probe filtering. Similarly, the impact of probe list filtering on SNP genotype

208

calling can be assessed with PCR-based genotyping of samples to determine if the filtered probe list provides more reliable SNP genotype calls than the unfiltered list. Since algorithm selection and the composition of the dataset can influence SNP and CNV genotyping, experimental confirmation can be used to determine the best approach for detecting SNPs and CNVs with accuracy. Further experiments can be performed to assess the variant detection abilities of the MDGA in cross-species hybridization experiments. These experiments would need large sample sizes with species-appropriate model files, probe filtering based on a reference genome to identify and annotate useable probes in the species of interest, and biological confirmation of the results. Overall, these experiments would greatly improve the reliability of the MDGA in SNP and CNV detection in mice and possibly provide a SNP and CNV detection tool for studies where there is not a microarray available for the species of interest.

6.2.1 Evolution and adaptation studies

There are also multiple future extensions possible for the broad mouse survey study and the mouse cohort study. Future studies can be designed to investigate the role of CNVs in different mouse strains, and in mouse evolution and adaptation to different environments. At a mouse strain level, study extensions could include characterizing mouse strain- specific CNV differences to assist with phenotype characterization and the determination of appropriate background strains to use in mouse model studies. From the perspective of evolution, evolutionary differences would be observable by studying different mouse subspecies and populations to find and characterize CNVs that contribute to phenotypic differences among these groups. A study that would be particularly relevant to the evolution of M. musculus subspecies is the role of CNVs in reproductive isolation, speciation, and hybrid sterility, since genetic incompatibilities leading to hybrid sterility are known to occur between M. musculus subspecies8. These CNV studies would first require characterization of the mating outcome of subspecies crosses and the fitness of offspring. Then CNV studies can be performed to identify differences between samples where mating was unsuccessful, or offspring fitness was reduced, in comparison to samples where cross-subspecies mating was successful and offspring fitness was not affected. For all future experiments, CNVs detected with microarray approaches would require

209

experimental confirmation, and the impact of a CNV on phenotype can only be established following additional functional genomics experiments that focus on the impact of the CNV on transcription, translation, gene expression regulation, protein or RNA functions and interactions, cell function, and organism development and fitness.

In the mouse cohort study, CNV differences between mouse cohorts were indirectly studied but can be expanded upon using a more informative direct experimental approach. Further extensions would include identifying recurrent CNVs and characterizing them for phenotypic impact and to sequence the breakpoint junctions to identify the mechanisms of CNV formation. The mouse cohort study can be improved by using the same mouse subspecies for each cohort to reduce subspecies-related confounding factors. Ideally, CNV changes related to environmental adaptation would be studied by characterizing CNV profiles in wild caught founder populations and then observing how the CNV landscape is altered following the generation of inbred strains from the founding population, under laboratory housing conditions. The generated inbred strains would not contain genetic modifications like gene knockouts or selection for disease-related phenotypes since targeted genetic modifications may be confounding factors when studying CNV landscape alterations over generations. The goal would be to ensure that a different environment is the only or predominant factor influencing the CNV landscape in the mice. Conducting additional broad CNV survey studies across various mouse groups would also be of great value. Broad survey studies with single nucleotide resolution of CNV breakpoint junctions would be informative about common mechanisms of CNV formation in the mouse genome and where mutation hotspots are located. Furthermore, these studies could help identify which genomic regions highly conserved regardless of environment, and which are copy number variable allowing for different phenotypic traits.

6.2.2 Study extensions for genome mosaicism in healthy tissues and cancer

In this thesis, the contribution of CNVs to mosaicism was studied from the perspective of healthy tissues (mouse family study) and cancerous tissues. The mouse family study will need to be repeated due to problems with the data quality, and there are several modifications that can be included in the repeat experiments. First, the tissue samples that

210

are selected for study should include samples that represent only one germ layer rather than a mix of germ layers. This will allow for the determination of when a CNV arose in somatic tissue. Blood should also be included in the study to assess its usefulness as a sentinel tissue. More C57BL/6J mouse families, as well as other strains, can be included in future studies to generate CNV profiles are representative of most laboratory strains and to ensure that the data are not biased due to small sample sizes. In addition, somatic mosaicism experiments can be performed using wild mouse populations to determine if CNV mosaicism levels are influenced by the level of genetic diversity in a population and by an individual’s environment.

Although reliable CNV profiles could not be generated for the mouse family study, the suggestive occurrence of a nonstochastic, postzygotic HOXA gene cluster deletion opened an unexpected area for confirmation and hypothesis testing related to a developmentally-programmed mutational event. Future studies can be performed to determine if the occurrence of the deletion is a wild-spread phenomenon in Mus, and if the deletions are programmed or occur due to being in an area susceptible to structural mutations. Sequencing of the Hoxa gene region will provide information on the exact genomic position and surrounding genomic context of the deletion in different tissues, which can be used to determine the mutation’s origin and mechanism of formation. If the Hoxa cluster is a mutation hotspot, then it is likely that germ cells are affected as well as the soma. Therefore, germ cells could be included in future studies to determine the frequency of Hoxa deletion occurrence and what are the impacts on reproduction success and offspring health.

When studying mosaicism in cancer, the genetic heterogeneity of the cancerous tissue created challenges in microarray-based CNV detection. To better characterize the CNV landscape in tumours with respect to improved resolution and sensitivity, a sequencing-based approach would be recommended in combination with isolation of cell subpopulations. Comparisons between CNV profiles of primary tumour cell populations to metastasis cells and normal adjacent cells could provide insight into mechanisms of tumorigenesis and successful metastasis. The use of large sample sizes would be required to be able to identify patterns and statistically significant differences between sample

211

groups. To attribute functional roles to CNVs of interest, functional genomics and phenotypic impact experiments would be required. Future aims for the MMTV-PyMT Rhamm cancer study include conducting additional experiments to determine phenotypic impacts of genic CNVs, determining if Rhamm genotype influences that mechanism of CNV formation, and establishing if there is positive selection for cells with a specific genotype due to microenvironment changes in the absence of Rhamm.

The MDGA may also prove useful in studying how the CNV landscape is altered in a mouse or specific tissues of a mouse following mutagen exposure. Normal CNV profiles from somatic mosaicism studies can serve as a baseline for comparison to mutagen-exposed samples, while some cancer profiles may represent the extreme end of genomic instability. There is also potential for using the MDGA as a biomonitoring tool for laboratory and natural mouse populations with the goal of monitoring levels of genetic diversity and checking for environmental mutagen exposure. In conclusion, the work presented in this thesis provides researchers with a CNV detection pipeline compatible with MDGA data, with potential use in studying the CNV landscapes of Mus musculus and closely related species from the perspective of evolution and adaptation, and genome mosaicism in healthy and disease-afflicted individuals.

212

6.3 References

1. Locke, M. E. O. et al. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

2. Chen, L. et al. CNV instability associated with DNA replication dynamics: evidence for replicative mechanisms in CNV mutagenesis. Hum. Mol. Genet. 24, 1574–1583 (2015).

3. Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).

4. Zhang, N., Wang, M., Zhang, P. & Huang, T. Classification of cancers based on copy number variation landscapes. Biochim. Biophys. Acta - Gen. Subj. 1860, 2750–2755 (2016).

5. Hunter, K. W. Mouse models of cancer: does the strain matter? Nat. Rev. Cancer 12, 144–149 (2012).

6. Justice, M. J. & Dhillon, P. Using the mouse to model human disease: increasing validity and reproducibility. Dis. Model. Mech. 9, 101–103 (2016).

7. Burkhardt, A. M. & Zlotnik, A. Translating translational research: mouse models of human disease. Cell. Mol. Immunol. 10, 373–374 (2013).

8. Turner, L. M. & Harr, B. Genome-wide mapping in a house mouse hybrid zone reveals hybrid sterility loci and Dobzhansky-Muller interactions. Elife 3, e02504 (2014).

213

Appendices

Appendix 2A: Genome-Wide Human SNP Array 6.0 filtered probe list. (Online)

Appendix 2B: Genome-Wide Human SNP Array 6.0 filtered probe list including flanking regions. (Online)

Appendix 2C: 874 HapMap3 CEL files. (Online)

Appendix 2D: Sample Affymetrix Power Tools commands for using Birdseed version 1 and version 2 algorithms for SNP genotyping with the apt-probe-genotype program.

#Birdseed v1 example apt-probeset-genotype \ -o Output_Directory/ \ -c GenomeWideSNP_6.Full.cdf \ -s Probe_list.txt \ --special-snps GenomeWideSNP_6.Full.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed.models \ -a birdseed \ --cel-files CEL_Files.txt \ --verbose 2

#Birdseed v2 example apt-probeset-genotype \ -o Output_Directory/ \ -c GenomeWideSNP_6.Full.cdf \ -s Probe_list.txt \ --chrX-probes GenomeWideSNP_6.chrXprobes \ --chrY-probes GenomeWideSNP_6.chrYprobes \ --special-snps GenomeWideSNP_6.Full.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \ -a birdseed-v2 \ --cel-files CEL_Files.txt \ --verbose 2

214

Appendix 2E: 351 Mouse Diversity Genotyping Array CEL files from The Center for Genome Dynamics at The Jackson Laboratory and sample ID. (Online)

Appendix 2F: List of Mouse Diversity Genotyping Array SNP probes from filtered probe list that produced a No Call genotype in all 351 mouse samples. (Online)

Appendix 2G: List of genes unlikely to vary in copy number. (Online)

Appendix 3A: Mouse sample information for 351 Mouse Diversity Genotyping Array CEL files. (Online)

Appendix 3B: Mouse sample cohort, subspecies and origin information for 215 Mouse Diversity Genotyping Array CEL files. (Online)

Appendix 3C: Autosomal CNVs detected for 351 mouse samples. (Online)

Appendix 3D: Chromosome X CNVs detected for 351 mouse samples. (Online)

Appendix 3E: Summary of autosomal and Chromosome X CNVs detected for 351 mouse samples. (Online)

Appendix 3F: CNVs detected for 210 mouse samples. (Online)

Appendix 3G: Summary of autosomal and Chromosome X CNVs detected for 210 mouse samples. (Online)

215

Appendix 3H: Summary of the predicted and experimental ddPCR copy number (CN) states for nine genic copy number variant regions (CNVRs) in three classical laboratory mouse strains.

Predicted CN state summary Number ddPCR CN state summary TaqMan® Copy CNVR (number of mice) CNVR probes in Number CNVR locationa size gene CNVR C57BL/6J CBA/CaJ DBA/2J C57BL/6J CBA/CaJ DBA/2J Assay ID (bp) target (SNP, IGP) (n = 8) (n = 1) (n = 1) (n = 5) (n = 5) (n = 4)

chr1:173206026- B4galt3b Mm00735212_cn 2,093 2, 18 2 (5), 3 (3)c 2 2 2 2 2 173208119 chr19:36990335- Fgfbp3 Mm00630217_cn 55,371 12, 38 2 (5), 3 (3) 2 2 2 and 3 2 2 37045706 chr17:30593663- Glo1 Mm00735212_cn 465,282 61, 273 1(6), 2 (2) 3 3 2 4 4 31058945 chr4:62157378- Hdhd3 Mm00553493_cn 27,703 3, 33 2 (8) 4 3 2 6 6 62185081 chr19:37317974- 2 (2), 3 Ide Mm00496897_cn 133,159 31, 54 2 2 2 2 2 37451133 (5,1) chr1:173391344- Itln1 Mm00534147_cn 116,733 12, 43 1 (8) 2 2 2 2 2 173508077 chr11:71014664- 2 (1), 3 (2), Nlrp1bb Mm00635379_cn 84,353 3, 29 2 2 2 2 2 71099017 4 (5) chr4:111745396- Skint3 Mm00735949_cn 319,067 0, 61 4 (2,6) 2 2 2 0 0 112064463 Trim30e- chr7:111681502- Mm00657977_cn 2,168 0, 7 2 (5), 4 (3) 2 2 2 0 0 ps1 111683670 a The CNVR start and end positions are determined by the earliest start position and latest end position in a group of overlapping CNVs for CNVs detected in C57BL/6J, CBA/CaJ and DBA/2J mice b These genes partially overlap the CNVR c The number of mice where the observed CNV only partially overlapped the gene of interest is underlined

216

Appendix 3I: Western University ethics approval for animal use in research.

2009-033::6: AUP Number: 2009-033 AUP Title: Mutational Mechanisms

Yearly Renewal Date: 10/01/2015

The YEARLY RENEWAL to Animal Use Protocol (AUP) 2009-033 has been approved and will be approved for one year following the above review date. 1. This AUP number must be indicated when ordering animals for this project. 2. Animals for other projects may not be ordered under this AUP number. 3. Purchases of animals other than through this system must be cleared through the ACVS office. Health certificates will be required.

REQUIREMENTS/COMMENTS Please ensure that individual(s) performing procedures on live animals, as described in this protocol, are familiar with the contents of this document.

The holder of this Animal Use Protocol is responsible to ensure that all associated safety components (biosafety, radiation safety, general laboratory safety) comply with institutional safety standards and have received all necessary approvals. Please consult directly with your institutional safety officers.

Submitted by: Kinchlea, Will D on behalf of the Animal Use Subcommittee

217

Appendix 3J: CNV and SNP genetic distance matrices. (Online)

Appendix 3K: Ingenuity Pathway Analysis core analysis of genes overlapping CNV regions for different copy number states and two mouse cohorts. (Online)

218

Appendix 3L: DdPCR confirmation of CN state in 14 mice of three different inbred strains for nine select genic CNVRs detected using the Mouse Diversity Genotyping Array. ddPCR CN state C57B/6J mouse CBA/CaJ mouse DBA/2J mouse Gene assay 1 2 3 4 5 1 2 3 4 5 1 2 3 4 1.99 1.80 2.01 1.90 1.94 2.04 2.09 2.02 1.97 1.89 2.01 1.94 1.94 1.97 B4galt3 1.87 1.87 1.87 1.99 2.00 2.02 2.17 2.00 1.85 1.96 1.87 1.90 2.02 2.03 1.97 2.92 2.06 2.07 2.12 2.02 2.11 2.00 1.99 2.11 2.11 2.09 2.00 2.05 Fgfbp3a 2.04 2.89 2.03 2.04 2.04 2.05 2.03 2.15 2.00 2.05 1.99 2.02 1.97 2.00 1.89 2.04 2.05 1.94 1.94 3.94 3.88 3.84 3.97 3.99 3.86 4.12 3.87 3.90 Glo1a 2.01 1.97 1.97 2.01 1.99 3.93 3.94 3.99 4.00 3.99 3.96 4.10 3.94 3.74 1.93 1.78 1.98 1.98 1.96 5.85 5.87 5.93 6.02 5.86 5.66 5.90 5.90 5.70 Hdhd3a,c 1.96 1.71 1.95 1.98 1.96 5.80 5.80 6.01 5.78 5.81 5.69 5.71 5.89 5.90 1.89 1.86 1.92 1.88 1.87 1.91 1.90 1.94 1.89 1.94 1.97 1.94 1.86 1.99 Ide 1.94 1.89 1.93 1.92 1.89 1.94 1.91 2.01 1.94 1.90 1.99 1.94 1.96 1.97 1.96 1.89 1.95 1.95 2.06 1.94 1.92 1.98 1.93 1.92 1.95 1.89 2.01 1.94 Itln1a,b 1.90 1.90 2.03 2.02 1.97 1.9 1.93 1.96 1.88 2.06 2.01 1.94 1.99 1.92 2.09 2.01 2.01 1.99 1.86 1.98 1.96 1.95 1.92 1.99 2.09 2.11 2.04 2.06 Nlrp1b 2.09 1.97 1.89 1.99 1.97 1.99 2.01 2.04 1.99 2.02 1.96 2.04 1.98 1.99 2.09 2.19 1.91 2.08 2.09 0 0 0 0 0 0 0 0 0 Skint3 2.02 2.04 2.02 1.98 1.99 0 0 0 0 0 0 0 0 0 2.07 1.88 1.98 2.10 1.94 0 0 0 0 0 0 0 0 0 Trim30e-ps1 2.00 1.91 1.94 2.01 1.99 0 0 0 0 0 0 0 0 0 a The gene is entirely within the CNV observed in each mouse b All DBA/2J and CBA/CaJ samples had multiple droplet populations between the lowest amplitude negative droplet population and the highest amplitude Itln1 postive droplet population. This may be caused by strain-specific genetic variation affecting amplification efficiency c For C57BL/6J mouse 2, the CN is likely two but ddPCR values were lower than expected for a CN of two, possibly because the duplicate regions are very close in proximity and further fragmentation of the sample is required to accurately detect the CN state

219

Appendix 3M: List of CNVs found overlapping genes that are unlikely to vary in copy number. (Online)

Appendix 4A: Log R ratio, B allele frequency, and waviness factor values for 26 mouse tissue samples and two PennCNV runs. (Online)

Appendix 4B: CNV calls for first PennCNV dataset. (Online)

Appendix 4C: CNV calls for second PennCNV dataset. (Online)

Appendix 5A: CNV calls for six mammary gland primary tumour samples and six lung with metastasis samples from three MMTV-PyMT Rhamm-/- mice and three MMTV-PyMT Rhamm+/+ mice. (Online)

Appendix 5B: CNV and SNP genetic distance matrices. (Online)

Appendix 5C: Copy number state, position and genic content of CNV regions that are recurrent in all three samples with a shared Rhamm genotype and tumour type. (Online)

220

Curriculum Vitae

Name: Maja Milojevic

Post-secondary 2012-2019 Ph.D. Biology, 2008-2012 B.Sc.

Education and The University of Western Ontario

Degrees: London, Ontario, Canada

Honours and 2017 Dr. Irene Uchida Fellowships in Life Sciences

Awards: 2016 Queen Elizabeth II Graduate Scholarship in Science and

Technology

2016 EMGS Student and New Investigator Travel Award

2015 EMGS Student and New Investigator Travel Award

2015 Department of Biology Graduate Travel Award

2015 Department of Biology Graduate Student Teaching Award

2012 Dean's Honor List

2008 Western Scholarship of Excellence

2008 Queen Elizabeth II Aiming for the Top Tuition Scholarship

Related Work 2012-2018

Experience: Teaching Assistant

The University of Western Ontario

London, Ontario, Canada

Publications:

Locke, M.E.O., Milojevic, M., Eitutis, S.T., Patel, N., Wishart, A.E., Daley, M., Hill, K.A. Genomic copy number variation in Mus musculus. BMC Genomics 16, 497 (2015).

221