GENETIC VARIATION IN THE DOMESTICATED DOG AS A MODEL OF

DISEASE

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Jennie Lynn Rowell, M.S.

Graduate Program in Nursing

The Ohio State University

2012

Dissertation Committee:

Donna O. McCarthy, Adviser

Carlos E. Alvarez, Cognate Adviser

Kim McBride

Jodi McDaniel

i ii

Copyright by

Jennie Lynn Rowell

2012

iii ABSTRACT

One of the greatest challenges facing clinical scientists is a developed understanding of the genetic basis for complex human diseases such as cancer. Despite many technological advances in genetics, progress has been slow. This is owed, in part, to intricate -gene interactions as well as poorly understood environmental influences on and genetic traits. The high level of heterogeneity of the makes identification of these interactions and environmental influences difficult. Recently, the domesticated dog (Canis Lupis Familaris), has demonstrated its powerful applicability to human disease as a new model of genetic variation. Dogs are an excellent model for the study of complex disease in for a variety of reasons, including the extensive level of health care they receive, and their phenotypic diversity. Approximately 400 inherited diseases similar to those of humans are characterized in dogs, including complex disorders such as cancers, heart disease, and neurological disorders. The purpose of this dissertation project is to elucidate the role of genetic variation in the domesticated dog as a model for understanding human disease and is comprised of four manuscripts. I establish 1) the utility of the canine model for the study of human disease, 2) the theoretical and empirical establishment of a novel method for genomewide genetic analysis, 3) identification of germline loci associated with the risk for development of

Osteosarcoma. Finally, 4) I describe a novel mechanism involving structural variation iv which results in an epigenetic readout for a highly penetrant Mendelian trait. I discuss how each of these studies stands to broadly accelerate biomedical investigation.

v DEDICATION

This document is dedicated to the people who supported me in following my dreams.

vi ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to the individuals who, through their support and encouragement, made this dissertation project possible both academically and personally. First, I would like to thank my both my nursing and cognate advisers. Dr.

Donna McCarthy helped me to develop my scientific career goals. She demanded my best efforts, and in return devoted countless time and effort to provide support, including additional assertiveness when needed. Dr. Carlos Alvarez allowed me to learn and work in his genetics lab when I had no prior experience. He was incredibly patient and kind in his teaching, providing me with a strong basis on which to build my understanding of genetics. He allowed me to participate in countless activities that strengthened my skills as a future principle investigator. To both of them, I am completely indebted for the investment they made in me. I would like to thank my committee members, Drs. Kim

McBride and Jodi McDaniel. Dr. McBride provided much support the statistical understanding of this project, including assisting in manuscript preparation and principle dialogue that helped to shape this project. Although joining my dissertation committee near the end of the project, Dr. McDaniel provided insightful comments and questions to spur critical thinking and generous encouragement.

vii I would also like to acknowledge all of the faculty and staff at The Ohio State

University College of Nursing, and The Research Institute at Nationwide Children’s

Hospital Center for Human & Molecular Genetics for their assistance on many technical aspects of these experiments, and willing providing assistance.

I would like to recognize Dr. C. Guillermo Couto, Dr. Sara Zaldivar-Lopez, and

Dr. Lilian Marin for sample collection, clinical expertise, and support of this work.

Working with them demonstrates the heart of collaborative research.

I would also like to Leszek Rybaczyk, Elise Fiala, and Fortune Shea. These members of my lab assisted me throughout this project with technical aspects of this work, as well as challenged me to further develop my computational and analytical skills.

I would like to personally acknowledge many friends who have become family,

Dr. Anjel (Pook) Stough-Hunter and Nikki Seagraves, who through their constant support and encouragement, made this process easier. I would also like to recognize Gloria

Zender, Tracey Edwards, Cassie Span, Shannon Schlagbaum, Valerie Brown, Deb

Osborn, Jackie Crouthers, Marni Dunlap, Heather Osborn, and Joel Brummel who provided much needed support which enabled to complete this aspect of my education.

I would also like express my immense debt of gratitude to Beth McKee and

Abbey Carter-Logan. Without their kindness, generosity, example, and ability to see what my future could be, I would not have been able to achieve any of my dreams academically or personally.

I would like to thank my family, whose critical role is often played behind the scenes of such an accomplishment. I thank my grandparents (Alex and Emily Marzek) for

viii all of their love and support. I thank my parents (Dean and Betsy Rowell) for helping to shape my aspirations, teaching me to love learning at an early age and allowing me to pursue my dreams through their own personal sacrifice. I thank my mother (“mumso”), for her unconditional love, constant encouragement, and especially for reminding me

“this too shall pass”. I thank Amy and Brian Newby, for their unconditional love, reassurance, and sense of humor. I also thank my brother, who as my roommate was possibly most effected during the last months of completing this process. I appreciate his love, encouragement, and support. Lastly, this work would not have been possible without the grace of Jesus Christ. I have been blessed with an amazing support system professionally and personally.

ix VITA

2001...... Diploma, Rowell Homeschool

2005...... B.S. Nursing, Cedarville University

2009...... M.S. Nursing Research, College of Nursing,

The Ohio State University

PUBLICATIONS

Rowell JL, McCarthy DO, Alvarez CE. Dog models of naturally occurring cancer. Trends Mol Med. 2011 Mar 23. PMID: 21439907.

Rowell, J.L. & Blakely, W.P. (2008). Discussion questions in Pathophysiology: Concepts of altered health states, 8th Ed. Edited by Carol Porth. Philadelphia: Lippincott, Williams & Wilkins.

FIELD OF STUDY

Major Field: Nursing

x TABLE OF CONTENTS

ABSTRACT...... IV

DEDICATION ...... VI

ACKNOWLEGDMENTS...... VII

VITA ...... VII

TABLE OF CONTENTS ...... XI

LIST OF TABLES...... XIII

LIST OF FIGURES ...... XV

CHAPTER 1:INTRODUCTION...... 1

CHAPTER 2: DOG MODELS OF NATURALLY OCCURRING CANCER ...... 5

CHAPTER 3: GENOMEWIDE INTERSECTION-UNION ANALYSIS (GIA) INCREASES THE

POWER TO DETECT GENETIC ASSOCIATIONS ...... 29

CHAPTER 4: USE OF GENOMEWIDE INTERSECTION-UNION ANALYSIS IDENTIFIES RISK

LOCI FOR CANINE OSTEOSARCOMA ...... 62

CHAPTER 5: GENETIC AND MOLECULAR MECHANISMS OF A MENDELIAN TRAIT WITH

EPIGENETIC READOUT: CANINE BRINDLE COAT PATTERN ...... 137

xi Materials and Methods...... 171

CHAPTER 6: CONCLUSION ...... 177

REFERENCES...... 180

xii LIST OF TABLES

CHAPTER 3: GENOMEWIDE INTERSECTION-UNION ANALYSIS (GIA) INCREASES THE

POWER TO DETECT GENETIC ASSOCIATIONS

Table 3.1. Comparison of published GWA results and GIA reanalysis for four traits in

LUPA data...... 57

Table 3.2. QTL of size ...... 61

CHAPTER 4: USE OF GENOMEWIDE INTERSECTION-UNION ANALYSIS IDENTIFIES RISK

LOCI FOR CANINE OSTEOSARCOMA

Table 4.1. Phenotype information of Greyhounds used in this study ...... 106

Table 4.2. Listing of all of the reference genome from the UCSC genome

browser within the specified interval in Fig.4.4...... 107

Table 4.3. Population stratification using PLINK ...... 113

Table 4.4. Haplotypes of 15 SNPs associated with the development of osteosarcoma in

racing Greyhounds and suggestion of cancer role ...... 114

Table 4.5. Greyhound risk loci cancer implications...... 119

CHAPTER 5: GENETIC AND MOLECULAR MECHANISMS OF A MENDELIAN TRAIT WITH

EPIGENETIC READOUT: CANINE BRINDLE COAT PATTERN xiii Table 5.1. Basic colors and associated alleles in dogs...... 169

Table 5.2. Breeds with known e/e genotype ...... 170

xiv LIST OF FIGURES

CHAPTER 2: DOG MODELS OF NATURALLY OCCURRING CANCER

Figure 2.1: Dog Cancer Genetics ...... 23

Figure 2.2: An example of the clinical relevance of dogs for cancer treatments...... 26

Figure 2.3: Prevalence of B and T- lymphoma in Dog breeds...... 27

Figure 2.4: Translational potential of tumor bearing dogs...... 28

CHAPTER 3: GENOMEWIDE INTERSECTION-UNION ANALYSIS (GIA) INCREASES THE

POWER TO DETECT GENETIC ASSOCIATIONS

Figure 3.1: Overview of our application of GIA...... 51

Figure 3.2: Frequency analysis for groups of resampled datasets using GWA ...... 52

Figure 3.3: Scree plot of SNPs significantly associated with furnishings...... 53

Figure 3.4: The robustness of GIA as observed with simulation...... 54

Figure 3.5: A flow diagram representing the GIA process ...... 55

Figure 3.6: Dog breeds demonstrate extreme variation in ear morphology...... 56

CHAPTER 4: USE OF GENOMEWIDE INTERSECTION-UNION ANALYSIS IDENTIFIES RISK

LOCI FOR CANINE OSTEOSARCOMA

xv Figure 4.1: Principle component analysis (PCA) of racing (RRG and OFR) and AKC show Greyhounds...... 95

Figure 4.2: Application of the GIA to a published SNP dataset ...... 96

Figure 4.3: Identification of the previously mapped for brindle coat pattern in our

Greyhound cohort...... 97

Figure 4.4: Gene annotation of candidate association loci...... 98

Figure 4.5: GWAS in 36 dogs...... 99

Figure 4.6: Comparison of Illumina genotyping calls to sequencing data...... 100

Figure 4.7: Histogram of alleles in OFR and OSA greyhounds for SNP chr34:35,156,555...... 101

Figure 4.8: Histogram of alleles in OFR and OSA greyhounds for SNP chr10:5,894,741...... 102

Figure 4.9: Kaplan-Meier analysis of ZBBX ...... 103

Figure 4.10. Graph of homozygosity within region chr10:5,647,387-6,238,255 ...... 104

Figure 4.11: Gene Atlas expression values BioGPS...... 105

xvi CHAPTER 5: GENETIC AND MOLECULAR MECHANISMS OF A MENDELIAN TRAIT WITH

EPIGENETIC READOUT: CANINE BRINDLE COAT PATTERN

Figure 5.1: Coat Color in Greyhounds...... 152

Figure 5.2: New CNV calling algorithm identifies CNV overlapping K-locus...... 153

Figure 5.3: 1M probe oligonucleotide array confirms CNV in Brindle dogs...... 154

Figure 5.4: Southern blot of Left breakpoint with restriction enzyme 1...... 155

Figure 5.5: Southern blot with Restriction Enzyme 2...... 156

Figure 5.6: Repetitive elements within the breakpoint region ...... 157

Figure 5.7: PCR Scanning of the region to narrow breakpoint...... 158

Figure 5.8: DLA results...... 159

Figure 5.9: UCSC genome browser with annotation ...... 160

Figure 5.10: A Non-Long terminal repeat retrotransposon...... 161

Figure 5.11: A B2 SINE assists transcription during pituitary development in mouse as

a boundary element ...... 161

Figure 5.12: Expression of CBD103 ...... 162

Figure 5.13: A model of a regulatable epigenetic switch created by CTCF and Tsix. 163

Figure 5.14: Predicted CTCF Locus ...... 164 xvii Figure 5.15: Bisulfite Methylation Results ...... 165

Figure 5.16: Spliced cDNA...... 166

Figure 5.17: A schematic overview of region associated with Brindle ...... 167

xviii CHAPTER 1: INTRODUCTION

Recently, there has been growing interest in the application of genetics to human health.

The completion of the human genome DNA sequence in 2003 led to rapid increase in gene and mutation discovery; currently, ~3,500 Mendelian disorders are catalogued in which a some of the molecular basis has been appreciated

[http://omim.org/statistics/entry; 1, 2]. However, several questions have yet to be answered. For instance, why do individuals with the same Mendelian disorder not have the same identifiable mutations and why are individuals with identical mutations (even within the same family) not equally affected with the resultant phenotype [3]? One hypothesis is that genetic variation within the primary gene’s regulatory sequence could explain such discrepancies. A recent study on Joubert syndrome (a human ciliopathy) found that mutation of either one of two adjacent genes, TMEM138 and TMEM216, causes a phenotypically indistinguishable disease [4]. Despite a lack of in the genes, a conserved regulatory element in the noncoding intergenic region controls their coordinated expression [which is essential for their interdependent cellular role, 4].

Another example is seen in Hirschsprung disease [a congenital intestinal aganglionosis developmental defect associated with the lack of intramural ganglion cells in the gastrointestinal tract, 5].

1 Disease symptomatology is contingent on both a mutation in the RET gene, and a common population variant in an intronic enhancer. Strikingly, the population variant is inherited from a parent who does not contribute the actual RET mutation [5]. As this study shows, a disorder that is monogenic is not necessarily monocausal [3]. This is a tenacious challenge when studying complex outbred populations like humans, and suggests that alternative models are needed to facilitate dissection of human disease.

One such model is the domesticated dog (Canis lupus familiaris). Within the last

7 years, canine genetics has shown its powerful applicability to human disease, with >80 canine disease mutations now known to have a human disease analog [6, 7]. The genetic advantages of dogs are based largely on the occurrence of two population bottlenecks, one when dogs diverged from the gray wolf and another at breed formation.

Archaeological evidence suggests that an early period of dog domestication from the gray wolf happened well before any other animal or plant domestication (proto-domestication, largely unintentionally) as early as ~35,000 YA [8]. Domestication took place globally at that time, evidenced by the remains of a 33,000 year old dog found in southern Siberia whose morphological features are consistent with a transitional form between wild wolves and a domesticated dogs [9]. The major period of dog domestication related to today’s breeds occurred ~15,000-16,000 YA when the gray wolf and dog diverged in the middle east, with potential secondary sources from Europe and east Asia [10, 11].

Subsequently, a second pronounced population bottleneck occurred ~200 years ago when most dog breeds were created by the selection of morphological and behavioral traits [7]. This was vastly accelerated during the controlled breeding practices of the

2 Victorian era (circa 1830–1900), when crosses between breeds from divergent genetic lineages become highly desirable [12, 13]. Today’s breeds are essentially isolated genetic populations whose genetic similarities and differences can be exploited to identify disease mutations [7].

The purpose of this dissertation project is to elucidate the role of genetic variation in the domesticated dog as a model for understanding human disease. This project is composed of four manuscripts. (1) “Dog Models of Naturally Occurring Cancer” provides background into dogs as a genetic model for studying human diseases with a focus on three types of cancer (soft tissue Sarcomas, Osteosarcoma, Lymphomas).

(2) “Genomewide Intersection-Union Analysis (GIA) Increases the Power to Detect

Genetic Associations” provides the theoretical underpinnings and validation of a novel method for genomewide genetic analysis, the GIA. (3) “Use of Genomewide Intersection-

Union Analysis Identifies Risk Loci for Canine Osteosarcoma” demonstrates an application of the GIA methodology and reports the first genomewide genetic association study of Osteosarcoma risk in Greyhounds. Empirical validation of two loci is reported.

Both of the above manuscripts are based on genetic variation as determined using single nucleotide polymorphism (SNP) genotyping data. Finally, (4) “Genetic and molecular mechanisms of a Mendelian trait with epigenetic readout: canine brindle coat pattern” addresses another type of genetic variation- copy number (CN). Here, we elucidate the complex mechanism that governs brindle coat color in dogs. We discuss how these findings can be broadly generalized to unappreciated mechanisms in human disease.

Taken together, this dissertation project establishes 1) the utility of the canine model for

3 the study of human disease, 2) the theoretical and empirical establishment of a novel method for genomewide genetic analysis, 3) the identification of germline risk loci associated with development of Osteosarcoma, and 4) the description of a highly penetrant Mendelian trait that is due to a novel mechanism involving structural variation which results in an epigenetic readout. I discuss how each of these studies stands to broadly accelerate biomedical investigation.

4 CHAPTER 2: DOG MODELS OF NATURALLY OCCURRING CANCER*

*Rowell JL, McCarthy DO, Alvarez CE. (2011). Dog models of naturally occurring cancer. Trends Mol Med. Jul;17(7):380-8. Epub 2011 Mar 24. Review. PubMed PMID: 21439907. Reprinted by permission from Elsevier. 5 Abstract

Studies using dogs provide an ideal solution to the gap in animal models for natural disease and translational medicine. This is evidenced by approximately 400 inherited disorders being characterized in domesticated dogs, most of which are relevant to humans. There are several hundred isolated populations of dogs (breeds) and each has a vastly reduced genetic variation compared with humans; this simplifies disease mapping and pharmacogenomics. Dogs age five- to eight-fold faster than do humans, share environments with their owners, are usually kept until old age and receive a high level of health care. Farseeing investigators recognized this potential and, over the past decade, have developed the necessary tools and infrastructure to utilize this powerful model of human disease, including the sequencing of the dog genome in 2005. Here, we review the nascent convergence of genetic and translational canine models of spontaneous disease, focusing on cancer.

6 New Models of Complex Disease Needed

The greatest challenge facing clinical scientists is an incomplete understanding of the genetic basis for complex human diseases [14]. Despite a myriad of technological advances in genetics, progress has been slow. This is owed, in part, to intricate gene-gene interactions and poorly understood environmental effects [15]. The identification of these interactions and environmental influences is difficult to dissect in humans due to the high level of genetic heterogeneity [16]. Most genome-wide association studies (GWAS) have only identified a small fraction of the genetic basis of complex diseases [17]. And yet disease heritability is critical to understanding disease risk, the effects of environment and lifestyle on disease development, and response to treatment.

Much of the research on human disease genetics relies on animal models. The most frequently used model, the mouse, has several advantages. Mice have short gestation times and are small, making their generation relatively rapid and inexpensive compared to other mammals. Moreover, technologies exist to manipulate the expression of genes in the entire organism or in select cells or tissues [18]. However, mouse models of cancer have limitations. The most notable is that tumors arise spontaneously in humans, but must be induced in most mouse models. Whereas human disease is polygenic, genetic manipulations in mouse models often involve one or a few genes and/or environmental conditions that affect expression of specific genes in an inbred mouse line with undetermined human relevance [16]. Mouse models of cancer in humans are thus missing vast gene networks and interactions that are responsible for, or contribute to, disease in humans. Here we discuss the advantages of tumor-bearing dogs as an alternative model

7 for understanding the genetic basis of human disease [19], highlighting three cancer types as examples.

Advantages of Dog Models

Domesticated dogs (Canis lupus familiaris) are excellent models of human complex diseases for several reasons, including their easy accessibility and prominent status in diverse cultures. For instance, >73 million dogs live in ~40% of households in the USA

[20] and 54% of them are considered to be a “family member” [21]. Over 40B (USD) is spent annually on dog health care [21], and is second only to humans in the level of health care received [22]. That, combined with the shared environment of owners and dogs, can be exploited for epidemiological studies of diseases common to dogs and humans.

Next to humans, domesticated dogs have the most phenotypic diversity and known naturally-occurring diseases of all land mammals [23]. For example, the average weight of Chihuahuas and English Mastiffs differs by 65-fold. Dogs share over ~650 Mb of ancestral sequence in common with humans that is absent in mice, and canine DNA and sequence is more similar to human than mouse is [24; Fig. 2.1A]. Analysis of the

13,816 protein coding genes with 1:1:1 orthology in human, mouse and dog showed that the numbers of lineage-specific non-synonymous substitutions (i.e., amino acid changing;

KA) are 0.017, 0.038, and 0.021, respectively [24]. Thus, many aspects of human biology are presumably more relevant in dogs than in mice [25]. Approximately 400 inherited diseases similar to those of humans are characterized in dogs, including complex disorders such as cancers, heart disease, and neurological disorders [26, 27]. More than

8 40 naturally occurring canine diseases have mutations in a homologous human gene associated with a similar disease [28]. Additionally, depending on breed size, dogs have a five to eight-fold accelerated aging process compared to humans

[http://www.avma.org/animal_health/care_older_pet_faq.asp]. Moreover, dogs are kept as companion animals well into their old age [29, 30]. The most recently available data

(2006) shows that ~45% of companion dogs were >6 years old [21], the human equivalent of ~60-95. Thus, dog models hold great promise for accelerating the understanding of genetic and environmental contributions to human disease, particularly those that are chronic or associated with aging.

The greatest advantage of dog models is the result of their evolutionary history which involved at least two severe population bottlenecks [27]. The first occurred when dogs were domesticated from wolves ~15,000-40,000 years ago [31]. The second was most pronounced ~200 years ago when most dog breeds were created by selection of morphological and behavioral traits. Today there are ~400 isolated populations or breeds.

Breed creation inadvertently selected many “founder” mutations that are associated with specific traits and diseases; this translates into reduced disease and genetic heterogeneity, consistent with the fact that most breeds are predisposed to a distinct set of diseases.

Because linkage disequilibrium is up to 100-fold greater in dogs than humans, single breeds are powerful subjects for broad genetic mapping [27]. On the other hand related breeds that share a trait are powerful subjects for fine mapping. This advantage is illustrated by the recent analysis of polyneuropathy with juvenile onset in dogs, which is similar to human Charcot-Marie-Tooth (CMT) syndrome [32]. The comparison of 7

9 affected and 17 related unaffected control Greyhounds identified a 19.5 Mb region that was homozygous in the affected dogs, and contained a 10 bp deletion in N- downstream regulated 1 (NDRG1), orthologous to a known human CMT gene. Pedigree information and the extended homozygosity suggest the mutation arose in a popular sire in 1968. Now the disease can be eradicated from the breed through selective breeding, and the dog model can be used to better understand and treat human CMT [32].

Additionally, dogs might provide clues about the “missing heritability” of human complex genetics. Recently, a group of 300 investigators [33] performed a meta-analysis of GWAS (an approach using SNP markers across the entire genomes of many people to find genetic variations associated with a particular disease) of 180,000 individuals characterized for height (known to be 80% heritable). They identified 180 loci that together explain 10% of height heritability. Similarly, Boyko et al. studied 57 quantitative morphological traits in 915 dogs that included samples from 80 breeds; traits included body size and external dimensions, and cranial/dental/long bone size and shape [34]. In contrast to human studies, they found that one to three quantitative trait loci explain the majority of phenotypic variation for most of the dog traits examined. The question now is whether canine complex diseases will turn out to have a similarly simplified genetic architecture.

Cancer Development in Dogs

Dogs are exceptional models of cancer because they naturally develop the same cancers as humans [35]. Indeed, dog tumors are histologically similar to human, and respond similarly to conventional therapies [19]. Although disease course is reported to

10 be more aggressive in dogs than humans for some cancer types [19], it is not clear whether dog cancer is generally more aggressive than human. This issue is complicated because dog cancers are not treated as aggressively as human, resulting in shorter survival times and faster evaluations of outcomes. Moreover, disease bearing dogs tend to present for treatment at later stages than humans. Regardless, the significantly shorter duration time of canine clinical trials is a major advantage [19; Fig. 2.2]. The disease-free time interval in dogs treated for cancer is 18 months, compared to >7 years needed to assess treatment outcomes in humans [19]. Additionally, many histological types of cancer are associated with similar genetic alterations in humans and dogs. For instance, statistical analysis of genomic alterations in human and dog colorectal tumors showed that samples clustered according to stage, origin and instability status across species [36].

Strikingly, cluster analysis of genome regions affected by DNA copy number alterations showed branching together of human and dog tumors according to colorectal cancer subtypes (vs. species) [36]. This suggests the same genetic pathways are affected in colorectal tumorigenesis in both species. By contrast, species-specific alterations tended to localize to evolutionarily unstable genome regions. These observations thus hint that the alterations common to both species are more likely to cause cancer than those found in only one (i.e., the latter could be irrelevant species-specific mutation hotspots). In summary, dogs are useful in multiple approaches to cancer investigation [37]: breed- specific risk can be used to discover disease pathways; human cancer pathways can be tested for roles, and targeted for treatment, in canine disease; and canine somatic

11 mutations and genome alterations can be used to narrow down human mutations (Fig.

2.1B-D). Below we provide three examples of canine-human comparative oncology.

Soft-tissue sarcomas

Soft tissue sarcomas (STS) comprise 1% of all newly diagnosed cancer types in humans [38] and represent a heterogeneous group of mesenchymal neoplasms which demonstrate a high degree of variation in clinical presentation and cellular morphology

[39]. These genetically complex cancers include angiosarcomas (hemangiosarcomas in dogs), fibrosarcomas, and histiocytomas. Recent advances in immunohistochemistry, cytogenetics, and molecular genetic analysis have allowed a clinically relevant division of STS to improve diagnosis and treatment [40]. Based on clinical and biological variation among these neoplasms, STS can be broadly dichotomized into two groups.

One is characterized by specific, balanced chromosomal translocations, whereas the other typically shows more extensive chromosomal rearrangements leading to recurrent, but non-specific, chromosomal gains and losses [40]. Owing to their complex nature, the specific cells from which most of this group of cancers develop remain largely unknown.

Although some strains of mice have developed spontaneous STS, rodent models generally require induction of STS [41]. By contrast, dogs are an excellent model of STS because they have similar tumor genetic complexity to humans [42]. For instance, two poorly differentiated fibrosarcomas taken from Labrador Retrievers had large chromosomal rearrangements, amplifications and deletions similar to those observed in human fibrosarcoma [43]. Notably, these fibrosarcomas had loss of heterozygosity affecting cyclin-dependent kinase family 2A and 2B (CDKN2A/CDKN2B). Given that

12 deletions of CDKN2A and CDKN2B have been reported in other cancer types, including

STS in humans, this offers a novel target for discovering common pathways and genes affected in both dogs and humans that affects the development or progression of STS

[42].

Another advantage of using canines for studying STS is breed predispositions to specific types of STS, including increased incidences in Flat-Coated Retrievers and

Rhodesian Ridgebacks [26]. For example, hemangiosarcomas are relatively common in dogs, accounting for ~5-7% of all observed tumors [44]. The dogs at greatest risk for hemangiosarcoma are Golden Retrievers (GR), German Shepherds, and Boxers [45]. One group recently compared gene expression profiles in hemangiosarcoma tumors from multiple dog breeds [35]. They found that the GR was unique in over-expression of vascular endothelial 1 (VEGF1) compared to other breeds, whereas VEGF2 was more highly expressed in the other breeds compared to the GR. When VEGF2 expression was blocked in hemangiosarcoma-derived tumor cell lines, the rate of cell growth slowed – except in cell lines derived from GR tumors. This finding implies that the unique genetic background of the GR influenced this breed’s susceptibility to the development of hemangiosarcoma, suggesting that canine tumors can be used to understand how genetic background can influence susceptibility of an individual to non- inherited cancers. Clinical trials involving inhibitor (TKI) treatment of

STS found that the most effective TKIs (such as ), also targeted all VEGF isoforms [46]. Performing clinical trials on pedigree dogs, such as the GR, could provide novel information regarding genetic background effects on tumor progression. Thus,

13 given the increased incidence of STS in dogs, the diversity of naturally occurring

‘complex’ and ‘simple’ sarcoma similarity in humans and dogs, and the availability of different genetic backgrounds across breeds for clinical therapy testing, the canine model is more relevant than other animal models for direct human STS applications.

Osteosarcoma

In humans, the most commonly diagnosed primary malignant tumor of the bone is osteosarcoma (OSA). It is the third most frequent cause of cancer in adolescents and represents over 56% of all bone tumors. The prognosis for patients with metastatic OSA is poor, with only 20% surviving event-free for 5 years post-diagnosis [47], and > 30% of patients do not respond to chemotherapy [48]. Roughly 10,000 dogs are diagnosed with

OSA yearly in the USA [49], compared to 2,650 new cases of human primary bone cancer [including OSA, Ewing sarcoma, malignant fibrous histiocytoma, and chondrosarcoma; http://www.cancer.gov/cancertopics/types/bone/]. Because there is no consistent method for reporting cancer in dogs, we estimate OSA incidence is at least

13.9/100,000 [21, 50], as opposed to the actual incidence of 1.02/100,000 in humans

(across all ages) [51]. In both humans and dogs, OSA has a bimodal age distribution and the main cause of death is pulmonary metastasis. It accounts for 85% of malignancies originating in the bone [52] in large and giant dog breeds [53], which have an OSA risk

61 times higher than all breeds [45]. The canine disease is much more aggressive than the human, with surgical treatment alone producing a 5% survival rate [49]. The same treatments for OSA are used in both humans and dogs [54]. Dogs develop OSA at similar sites as humans and both have similar histology and response to treatment [49, 55].

14 Indeed, dogs have been a valuable model of OSA since they first participated in clinical trials pioneering limb salvage techniques that are now used in humans [56].

In addition to the similarity of tumor biological behavior of human and dog OSA, recent studies identified parallel genetic features [57]. Both human and canine OSA have a 75% aneuploid DNA index, and both share similar genetic alterations [55]. Moreover, many candidate genes implicated in pediatric OSA have also been implicated in the canine disease: phosphatase and tensin homolog (PTEN), retinoblastoma 1 (RB1), ezrin

(EZR), met proto-oncogene ( ; MET), v-erb-b2 erythroblastic leukemia viral oncogene homolog 2, neuro/glioblastoma derived oncogene homolog (ERBB2), and tumor protein 53 (TP53) [58]. The commonly affected TP53 tumor suppressor pathway has similar alterations in human and canine OSA [59].

Because human TP53 is more similar to dog than mouse [60], and because mutations occur naturally in dogs, the canine OSA model is presumably more relevant to humans.

Additionally, recent work in dogs focused on differential OSA tumor expression of genes associated with short and long term survival [61]. In experiments using cDNA microarrays, investigators found deregulated expression of the following signaling pathways that were previously reported in human OSA: Wnt, chemokine/cytokine, apoptosis signaling, , and Ras [61]. Co-expression of hepatocyte growth factor

(HGF) and the proto-oncogenic receptor c-Met are implicated in growth, invasion, and metastasis in human OSA. Although more frequently over-expressed in human OSA, another study found co-expression of HGF and c-Met in all 59 OSA canine tumors in the study, with over-expression of both present in 24% of cases [62]. Other investigators

15 identified two genes, interleukin 8 (IL8) and solute carrier family 1 (glial high affinity glutamate transporter), member 3 (SLC1A3) that were uniformly expressed in all canine

OSA tumors, but not in all human pediatric OSA tumors. However, pediatric patients who did over-express IL8 and SLC1A3 had poorer outcomes then those who did not [63].

Yet another gene expression study of canine OSA tumors identified 10 significantly differentiated pathways between responders to treatment and non-responders [64]. These pathways [including cyclic adenosine monophosphate (cAMP) signaling; Chemokines and Adhesion; Sonic Hedgehog and Parathyroid Hormone Signaling Pathways in Bone and Cartilage Development] are also disrupted in human cancers. These various findings suggest that alterations of similar pathways occur in human and canine OSA, but that species-specific genetic changes might account for overall disparity in incidence and aggressiveness. Related to that, Phillips et al. used a whole genome linkage approach to map OSA segregating in a four generation pedigree of Scottish Deerhounds [56]. They found evidence of linkage (Zmax=5.766) consistent with a dominant OSA mutation in a

4.5 Mb region of 34q16.2–q17.1 (syntenic to human 3q26). Because OSA is relatively rare and most cases are sporadic in humans, inherited forms and different risks across dog breeds offer a great opportunity to identify pathogenetic pathways.

OSA tumors in dogs and humans also share DNA structural changes. Analyzing 38

OSA tumors from 29 Rottweilers and 9 Golden Retrievers, a recent study demonstrated that, as with its human counterpart, dog OSA has a tendency toward highly complex and chaotic karyotypes [65]. These are comprised of structural and numerical aberrations, including gene dosage imbalances of known oncogenes and tumor suppressors. The most

16 frequently observed genome alteration was an amplification affecting both the MYC and

KIT (c-KIT) oncogenes. This is consistent with observations of genome alterations in human OSA that are predictive of clinical outcome. Notably, KIT was recently proposed as a novel therapeutic target for pediatric OSA [66]. This not only supports the genetic relevance of the canine model, but also the clinical utility of including dogs in OSA clinical trials. Thus, the canine OSA model recapitulates the human cancer and, because

OSA occurs 20 times more often in dogs than humans [55], it provides an unparalleled opportunity for identifying key cellular pathways in this cancer [25].

Lymphomas

A group of cancers affecting the lymph tissue are collectively known as lymphomas.

Lymphomas represent ~5% of all human cancers in the United States and account yearly for 4.6 Billion USD worth of treatment

[http://www.cancer.gov/aboutnci/servingpeople/snapshots/lymphoma]. One specific class, non-Hodgkin’s lymphoma (NHL) occurs in B or T-cells, with >65,000 new cases reported in 2009 [for types of NHL, see http://www.cancer.gov/cancertopics/types/non- hodgkins-lymphoma]. Notably, the incidence of NHL is increasing but the etiology remains obscure [67]. Thus, an alternative model of lymphoma is needed to elucidate the causes and identify clinically meaningful cancer biology. Dogs and humans have similar tumor biology, tumor biological behavior, and genetic aberrations. The incidence of lymphoma in humans and dogs is similar [68]: 15.5-29.9 [69] and 15-30 [70] per

100,000, respectively. The most common type of NHL is the same in both humans and dogs – diffuse large B-cell – and the same chemotherapy agents are used to treat it [68].

17 An additional advantage of the dog model is the increased prevalence of lymphoma within specific dog breeds. Lymphoma is the most common life-threatening cancer in all dogs, accounting for 24% of all canine cancers [71]. Approximately, 1 in 4 Boxers and 1 in 8 Golden Retrievers develop lymphoma [45]. Additionally, there is a breed-specific distribution of B-cell and T-cell lymphomas [72; Fig. 2.3]: whereas excess incidence of

T-cell lymphomas were noted in 10 breeds, the most striking occurred (in order of observed frequency) in Irish Wolfhounds, Siberian Huskies, and Shih Tzus. By contrast, the breeds with excessive occurrence of B-cell lymphomas were Cocker Spaniels and

Basset Hounds. A second study conducted in Norway grouped together all types of lymphomas and also identified excessive occurrence of lymphomas in specific breeds, lending credence to a breed specific risk for lymphoma development [73]. They found the relative risk for lymphoma was highest in the Boxer and Flat-Coated Retriever. More recently, a study examined records from the Veterinary Medical Database and selected cases with a diagnosis of lymphoma type not specified, giant follicular lymphoma, and lymphosarcoma and used controls with any diagnosis other than lymphoma [74]. This study also identified a breed specific risk for lymphoma with the highest breeds including

Bullmastiff [odds ratio (OR) 4.83 vs control], Boxer [OR 4.05 vs control] and Bernese

Mountain dog [OR 3.64 vs control]. Notably, although the former and latter studies examined different subsets of lymphoma, they included many of the same breeds and had similar findings. For instance, the Irish Wolfhound had the highest rate of T-cell lymphoma in the Modiano et al. study [72], and also had an OR of 3.23 for lymphoma compared to other dogs in the Villamil et al. study [74]. The underlying cytogenetic basis

18 of lymphoma appears to be shared in human and dog. The examination of three canine hematological cancers, including Burkitt lymphoma and small lymphocytic lymphoma

[75], showed that these canine cancers shared cytogenetic abnormalities with those characteristic of their human counterparts. This suggests that humans and dogs share common pathways or an ancestrally retained pathogenetic basis for lymphoma [75].

Consequently, by using the dog genome in comparison with the human genome, relevant genetic aberrations can be identified.

Finally, the relevance of dogs as a lymphoma model is supported by use in clinical trials. Given that dogs develop spontaneous B-cell NHL and share many characteristics in common with human B-cell NHL [such as diagnostic criteria and response to a chemotherapy based regimen that includes cyclophosphamide, doxorubicin, vincristine, and prednisone (commonly referred to as the CHOP chemotherapy)], dogs were recently enrolled in a clinical trial of a selective and irreversible Bruton tyrosine kinase (Btk) inhibitor, PCI-32765, which blocks B-cell activation [76]. Activation of the B-cell antigen receptor signaling pathway contributes to the initiation and maintenance of B-cell

[76]. This clinical trial research began when the same group described the synthesis of a series of Btk inhibitors that bind covalently to a cysteine residue leading to potent and irreversible inhibition of Btk enzymatic activity. In that study, after additional analysis of this agent in both cell lines and mouse models, they initiated a canine clinical trial.

Although the clinical trial is ongoing to date, 8 dogs have been treated with 3 demonstrating stable disease and 3 with partial responses including one dog with a 77% decrease in tumor size [this drug is now undergoing human clinical development in

19 patients with B-cell malignancies]. Finally, a recent pilot study used anti-human leukocyte antigen (HLA)-DR monoclonal antibody (mAb) as a treatment for dogs with lymphoma [77]. Preliminary results demonstrated that humanized IgG4 anti-HLA-DR, currently under evaluation preclinically for human trials, also bound malignant canine lymphocytes. These findings provide justification for the using dogs with lymphoma in safety and efficacy evaluations of therapy for both veterinary and human purposes [77,

Fig. 2.4].

Potential utility of dogs in translational medicine

The naturally occurring relevance of the canine model to cancer in humans can be exploited to generate new treatments relatively quickly (Fig. 2.4). Whereas there are strict

Food and Drug Administration (FDA) regulations concerning treatments to be used and commercialized, as well as for clinical trials in humans, there are fewer regulations for

Phase I/II/III clinical trials before drugs use in pets [78; http://prsinfo.clinicaltrials.gov].

Rather, it is left to the discretion of the owner, who could approve the use of investigational therapeutics before conventional treatments. There are several trends in drug development that suggest increased use of dogs as translational models. Two of these are the rising proportion of biological vs. chemical compounds, and the growing focus on targeting genetic/biochemical pathways (or disease subtypes) vs. broad diseases or types of cancer. Here we propose that dogs are ideal patients in which to develop novel therapeutics. Several facts indicate using dogs in translational medicine can hugely accelerate drug development: reduced regulatory guidelines, vastly diminished and soon- to-be fully defined genetic variation within breeds (but similar levels of variation occur

20 across all breeds as humans), reduced disease heterogeneity (i.e., breed-specific risks of diseases are often associated with a single founder mutation), and accelerated aging/disease progression compared to humans. These genetic benefits translate into faster progress at every stage – e.g., identifying disease mutations in discovery, identifying biomarkers and endpoints in clinical trials, and using pharmacogenetics from preclinical research to post-approval studies. Indeed, dogs have been instrumental in rapid development of biological and biological-like therapeutics, including gene therapies

(e.g., for specific inherited forms of muscular and retinal dystrophies [79]) and antisense morpholino oligonucleotides (e.g., to alter mRNA splicing and avert nonsense-mediated decay of dystrophin [80]). However, we believe dog patients are greatly underutilized in development of therapeutic interventions. Drug development is difficult and risky, with the average drug costing ~800M USD to develop. One of the most challenging go/no go decision points is determination that a therapeutic agent is effective in humans. This is established by a small clinical study of select subjects that will likely respond to therapy.

Dog breeds with known disease mutations are ideal lead-ins to such studies. Depending on the disease, such proof-of-concept studies could be done robustly in even fewer than

10 subjects, and at a pace proportional to the accelerated disease progression. Such studies would not only establish efficacy, pharmco-kinetics/dynamics, and toxicity, but also dosing, biomarkers/endpoints, and adverse effects. This could dramatically reduce the failure rate of human proof-of-concept studies, and thus time and costs.

21 Concluding remarks and future perspectives

Dogs are uniquely suited for use as an animal model of complex human disease due to their phenotypic diversity and naturally occurring disease similarity to human conditions. The evolutionary history of dogs, their position as a family member in many households, and the high level of health care they receive offer tremendous opportunities.

That, combined with recently developed genetic resources, makes dogs outstanding models for the study of known genetic pathways, discovery of genetic and environmental contributions to disease, and translational studies in cancer risk, prevention, and treatments [19, 27]. The full utilization of canine models of cancer will require expertise in basic science, translation, and direct clinical relevance. This will necessitate large collaborations across almost all aspects of veterinary and human medicine: including molecular biology and genetics, epidemiology, pharmacology, bioinformatics, statistics, and engineering. Developing these pipelines now will speed potential therapeutic outcomes. Although this review focused on the relevance of the dog as a model for research in cancer genetics, biomedical research has long included canine models of numerous other diseases and their treatments [27]. For example, dogs are also increasingly used in behavioral research, including learning [81], social cognition [82], and the effects of diet and behavior enrichment on executive functioning [83]. Increased appreciation of the unique and comparative value of the dog as a model of diverse human diseases should accelerate research leading to new treatments, and improved health care for both we humans and for our best friend.

22 Figure 2.1: Dog Cancer Genetics (A) Protein sequence conservation in dogs. (A top) Phylogenetic tree of the mammalian c-Met receptor. The branching pattern corresponds well with the organismal relationships. For example, the Boreoeutheria clade comprises two sister taxa which include primates, rodents, rabbits and a taxa including carnivorans and most hoofed animals. Although mouse and human c-Met branch together, the branch length of mouse c-Met shows that the protein sequence is more divergent than human and dog (scale bar shows amino acid changes per site). (A bottom) Dog are more similar to those of humans than are mouse proteins. Phylogenetic treeing analysis of a composite of 10 cancer proteins branches human and dog proteins apart from mouse with a bootstrap value of 100. The following proteins were included: MYC, ERBB2, KIT, ret proto-oncogene (RET), v-raf murine sarcoma viral oncogene homolog B1 (BRAF), PTEN, RB1, CDKN2A, breast cancer 1, early onset (BRCA1), TP53. [Neighbor-Joining trees shown (500-replicate bootstrap values); Maximum Parsimony topology is the same]. (B) Examples of breed-specific germline variation with potential cancer relevance. (B top) Common missense variant in Rottweiler c-Met receptor. WebLogo analysis shows a close-up of the consensus amino acid sequence of c-Met from 23 mammals. Letter height corresponds to the frequency of a given amino acid at each position, with the highest letters signifying complete conservation. 70% of Rottweiler’s have a missense variant at Gly 966, which is located in the extracellular region and could thus affect ligand binding or receptor signaling [84]. (B middle) More than 60% of Rottweilers have a 273-kb copy number variant (CNV) in an intron of CSMD1, but it has not been observed in diverse other breeds (UCSC Browser; human gene transcribed right to left) [85] . (B bottom) Close-up of one of several non-coding conserved elements within the CSMD1 CNV (Vista Browser, conservation with human >60% shown by red coloring). The most conserved region within this area contains three candidate binding sites for the tumor suppressor E2A (another conserved element contains TP53 binding sites [85]). The conservation (which is absent in chicken) is reduced in mouse in comparison to more distantly related mammals, horse and dog. (C) Somatic genome alterations in canine cancer. Kisseberth et al. isolated the OSW T cell lymphoma cell line and identified several genomic alterations [86]. A single two copy loss was found, and it affects the CDKN2A tumor suppressor gene. Subsequent analysis of OSW by high resolution tiling ologinucleotide-array CGH revealed many additional alterations, including focal two-copy deletions affecting as few as a single gene [85]. (C top) Whole genome display of CGH analysis of OSW [85]. The midline shows a 1:1 DNA ratio to the reference genome of a boxer. Deletion CNVs are segments below the midline, and gains are above the midline (log 2 scale). “Un” denotes unmapped contigs and is highly enriched for repetitive sequences; the Y chromosome is absent from the canFam2 genome assembly. (C middle/bottom) Close-up of CGH analysis of 11 and 22. Both chromosomes have 2-copy microdeletions. One confirms complete deletion of the tumor suppressor, p16/CDKN2A. The other spans a single active gene, SLITRTK1, which was previously implicated in malignant hematopoiesis [87]. This illustrates how dogs can be used as translational models of known human cancer genetics, as well as for discovery of novel genes in the same genetic pathways. Continued

23 Figure 2.1. continued (D) Second generation genotyping technology allows the integration of single nucleotide polymorphism and CNV maps. CNVs from two Greyhounds are shown. This 170k oligonucleotide array enables simultaneous SNP genotyping and DNA copy number determination (Illumina CanineHD). For each pair, the top window shows DNA copy number as Log 2 R ratios, with the midline generally corresponding to copy number of 2. The bottom windows show allele frequencies. A copy number gain is detected as an upward shift on Log R ratio, and as a shift from B allele ratios of 1:1 (left and right segments) to 1:2 and 2:1 allele ratios (center segment). A copy number loss is detected as a downward shift in Log R ratio, and as a shift from allele ratios of 1:1 (left, right segments) to an allele ratio of 1:0 (or loss of heterozygosity; center). Continued

24 Figure 2.1. continued

25 Figure 2.2: An example of the clinical relevance of dogs for cancer treatments. Canines are increasingly being used in clinical cancer drug trials to determine the efficacy of treatment given how closely many of the cancer they develop recapitulate the human cancer. (A) A picture of a Boston terrier, a breed predisposed to the development of Mast cell tumors. (B) London et al. conducted a clinical trial of an oral inhibitor, Palladia on dogs with recurrent mast cell tumors. Shown here is a Kaplan-Meier survival analysis demonstrating time to tumor progression in placebo- treated and Palladia-treated dogs with Mast Cell Tumors [88]. (C) A breakdown of the clinical trial of Palladia, including the demonstrated advantages of dogs as models of pharmacologic cancer intervention. Reproduced with Permission, from [88].

26 Figure 2.3: Prevalence of B and T-cell lymphoma in Dog breeds. A varying excess of T and B-cell lymphoma, in a breed specific manner, has been noted. Presented here is the observed percentage of T vs B-cell lymphoma by breed: Irish wolfhounds (100:0 Siberian huskies (88.9:11.1), Shih Tzus (81:19), Airedale terriers (80:20), Cavalier King Charles spaniels (80:20), and Yorkshire terriers (80:20). By contrast, the breeds with an excessive occurrence of B-cell compared to T-cell lymphomas were cocker spaniels (93.2:6.8) and basset hounds (94.4:5.6) [72]. Photo sources [89-100].

27 Figure 2.4: Translational potential of tumor bearing dogs. On the bottom is the typical course of human drug research and development. There is no established paradigm for the drug research and development in dogs and other companion animals [19]. Although our schematic mirrors the same process in pets, most drugs used on patient animals are taken from human drug development or are approved human drugs used off-label. Indeed, few regulations exist for Phase I/II/III clinical trials before drugs are used in pets.

28 CHAPTER 3: GENOMEWIDE INTERSECTION-UNION ANALYSIS (GIA)

INCREASES THE POWER TO DETECT GENETIC ASSOCIATIONS*

*Leszek A. Rybaczyk§, Jennie L. Rowell§, Bogdan A. Pathak, Kun Huang, Pramod K. Pathak, Carlos E. Alvarez. Genomewide Intersection-Union Analysis (GIA) Increases the Power to Detect Genetic Associations. Submitted for publication. §Denotes Co-first authors. 29 Abstract

The key question in genomics and genetics is to what extent biological signal can be isolated from noise. For example, one million marker measurements can be considered in each individual of a genomewide genetic association study. Establishing disease association using traditional methods thus requires large populations and tolerating a substantial amount of background noise created by latent variables. The Intersection-

Union Test (IUT) is a well-established method for statistical analysis that is broadly used for industrial and biomedical applications, including by the Food and Drug

Administration to test the safety of drugs in clinical trials. Here, we report that an adaptation of the IUT we call Genomewide Intersection-union Analysis (GIA) inherently increases resolvability of true positive associations. GIA overcomes limitations of traditional genomewide genetic association studies by 1) accounting for latent variables without information loss or distortion, 2) not requiring multiplicity testing correction and

3) significantly increasing power. This method is computationally fast and broadly applicable to simple and complex genetics, as well as to genomics. We propose this method will vastly accelerate genetic discovery.

All of the data analyzed are publicly available (with access information provided within)

or included as supplemental information.

30 Introduction

A major challenge for genomewide genetic association (GWA) analysis is identifying true signal from noise, regardless of whether such signal is modest or pronounced [101].

This issue has led to the development of new multifaceted theories and statistical analyses to aid in the search for causal variation [102]. Because massive amounts of data are analyzed in those studies, intense focus has been applied to the removal of latent variables (trait-related variables that are not directly measured); the approaches to do this involve various statistical alterations of the data in attempts to isolate the true positives

[103, 104]. This task is vastly complicated when using traditional methods that apply thousands to millions of association tests, which necessitate either multiple testing corrections (to account for the family-wise error rate) [105] or posterior probability [106] measures that rely on information that may vary even within the same population. While these methods have achieved some success in the search for variation contributing to common, complex diseases, the majority of causative variation remains unknown [107].

It is thus important to consider novel statistical approaches that can accurately detect the true signal among the background noise, while simultaneously accounting for latent variables. Here, we propose that a modified Intersection Union Test (IUT), that we call

Genomewide Intersection-Union Analysis (GIA), increases the ability to identify true signal despite great background noise.

The IUT is a well-established statistical analysis method used for diverse industrial and biomedical applications, including by the Food and Drug Administration

(FDA) to test the safety of drugs in clinical trials [108, 109]. Allison and others have

31 proposed the use of IUT to address intersections between sets of findings in microarray gene expression datasets [110], and he, Kim and colleagues presented an excellent discussion of IUT/UIT uses, including for the analysis of such data across species [111].

Yet, surprisingly, its application has remained conspicuously missing in the area of high- density GWA. [The Union-Intersection Test (UIT) has been applied to GWA for reasons distinct from ours [112]; despite the similar names, the IUT and UIT have very different statistical properties [111].] The absence of IUT use for GWA is likely due to its conservative nature (propensity for false negatives) and possibly to a perceived lack of power. There may also be a common misunderstanding that IUT is more suitable for uses akin to quality control of precision in manufacturing, but not necessarily for genetics - which is subject not only to the laws of physics, but to those of evolution. Here we show that the IUT and GIA are appropriate and powerful for high throughput genomewide analyses.

Our application of GIA is straightforward, involving three primary steps. (1)

Stratification of the entire sampled population into multiple groups that each contains the variables of interest. Stratification increases the precision and power of our analysis by intensifying the contrast between case and control populations; ideally, stratification occurs according to known population differences or sub-hypotheses. However, it is also possible to stratify into subpopulations based on random groupings. The goal of this step is to maximize differences unrelated to the trait of interest across groupings in order to eliminate noise and isolate the true signal in future steps. [Notably, depending on the hypotheses under consideration, it may be appropriate to stratify across groups (e.g.,

32 including different subpopulations in each group to negate population structure) or within groups (e.g., including the same subpopulations in each group to isolate population structure).] 2) Association testing within stratified groups. Association testing identifies variables that are significantly different between case and control populations. Only markers that are statistically significant will be retained for continued analysis. This results in independent identification of divergent markers. Since each group is assumed to be independent, the accrual of the same markers in multiple groups provides greater evidence of their relationship. 3) Aggregation of significant results across all subgroups.

False positives (type I error) are assumed to be random independent events. According to probability theory, if events A and B are independent, then the probability of intersection is the product of probabilities, or p(AB) = p(A) p(B). Therefore, the more groups used, the smaller the probability of type I error. This theorem also allows us to calculate cumulative p-values. When the remaining results present at this stage are significant in all subgroups, their p-values are multiplied and ranked according to their combined p-values.

These three steps combined result in the diminution of the signal related to latent variables and resolution of signal related to the trait of interest. The simple logic of GIA is an alternative application of Berger’s IUT theorem and is applicable to multiple types of genomic data [113]. Our interest is the commonality that exists between the subpopulations in the final step of GIA. The commonality will be due to the trait of interest, and, while any signal that is present due to latent variables may pass the second step of GIA, it will be eliminated in the final step. We propose that this results in 1) increased power, 2) ability to negate latent variables, and 3) avoidance of multiplicity

33 corrections. GIA is thus a paradigm shift from the theoretical underpinnings of not only

GWA but of diverse genomic analyses. As noted above, the optimal design of GIAS involves stratification by known confounds. For instance, many trait-associated mutations are shared across dozens or hundreds of the 400 or so existing dog breeds. The idealized

GIAS design in such cases would be stratified to include different breeds in each case- control IUT/GIA group. This leverages the known confounds (and latent variables) to increase the disparity between the central measures both within and across groups. An important trait of the IUT is that it is highly flexible and, even with random groupings, still exploits countervailing noise to isolate true signal. To demonstrate the utility of GIA, we conducted two studies. The first, reported here, establishes the theoretical basis along with analysis and simulation studies of published data. In a second, partnering study

(Rowell et al., submitted for publication), we validate the method by mapping a known locus and successfully conduct the first association study of the complex trait of osteosarcoma risk in a cohort of 36 Greyhounds (followed by validation in a second group of twice as many dogs). The present study lays the theoretical foundation of GIA and demonstrates its power, robustness, and validity.

Results

In Methods, we provide the theoretical framework of GIA along with the theorems and proofs that establish its mathematical validity. In order to test the performance of GIA, we conducted simulation studies using a public canine dataset from the LUPA

Consortium [114] that consists of single nucleotide polymorphism genotypes (~174,000

34 SNP platform) for 456 dogs from 30 breeds [115]. We selected this dataset from Vaysse,

Ratnakumar and colleagues because phenotype-genotype associations were obtained using traditional GWA and the published findings include multiple Mendelian and complex traits (or nominal variables and Quantitative Trait Loci, QTLs). Additionally, we sought to focus our initial testing of GIA in dogs, which represent an outbred population with great phenotypic diversity [7]. The total levels of genetic variation in humans and dogs are similar; however, the difference between human populations corresponds to 5-10% of that, while the difference between dog breeds is approximately

30% [116]. The LUPA dataset is thus ideal here for two aspects of its inter-breed design: increased latent variables across breeds and breakdown of the extensive linkage disequilibrium that exists within breeds.

To evaluate the performance of GIA, we first conducted simulation analysis using the canine trait of furnishings (which includes exaggerated moustache and eyebrows) from the LUPA dataset, which was shown to segregate the causal variant. Cadieu et al. originally identified the furnishings variant as a 167 bp insertion in RSPO2, which begins at chr13:11,634,766 (CanFam2 assembly) and occurs within a 718 kb haplotype from chr13:11,593,074-11,718,754 [117]. Likewise, Vaysse et al. identified that locus for furnishings, with the most significant SNP at chr13:11,678,731 and a genomewide significance pattern observed between 10.42-11.68 Mb. We simulated 1000 replicates of the LUPA data by naïve bootstrapping (simple random sampling with replacement; Fig.

Fig. 3.1A), ensuring independence of replicate data. We performed association tests on the trait of furnishings and conducted a frequency analysis to determine the percentage of

35 times that we could correctly identify the true associated SNP using GWA and GIA on this simulated data, and then validated this approach in a second independent simulation set.

Using the traditional GWA method we found that although GWA was able to detect the previously associated SNP, it lacked specificity (Fig. 3.2). Distinguishing actual signal from noise was primarily contingent on the replicate group used for analysis, suggesting substantially limited power based on the latent variables of the samples. In contrast, using GIA we were able to identify only the true locus in all resampled datasets. We performed GIA on two (2 groups of 456 dogs x 500 GIA) and four resampled datasets (4 groups of 456 dogs x 250 GIA) and calculated the FI, where a value of 1 indicates a significant SNP present in all GIAs (Fig. 3.1A). In both the two- and four-group GIA, we identified 8 loci associated with furnishings that spanned the known causal locus, from chr13:10,210,459-11,678,731 (two groups FI=0.93-1; four groups FI=0.97-1). Most strikingly, we identified 3 SNPs with FI=1 in both two and four groups. One of these SNPs (chr13:11,678,731) was identified by Vaysse et al. as the SNP most significantly associated with furnishings. Our two additional SNPs

(chr13:11,660,194 and chr13:11,659,792) are nearest chr13:11,678,731 (Fig. 3.3). These results indicate increased specificity of GIA compared to GWA.

We next assessed the robustness of GIA by designing our analysis to be minimally powered to detect true associations. That is, instead of using the ideal GIA design which would involve maximizing the contrast through stratification according to a known confound (such as breed), we used randomized grouping. We analyzed the 456

36 dogs from each resampled simulation by proportionally allocating the dogs with and without furnishings (cases and controls) into four groupings by simple random sampling without replacement. Each subgroup contained approximately 114 dogs (456 dogs / 4 groups x 1000 GIA; Fig. 3.1B). GIA was then performed within (instead of across) each of the simulated sets in these four groups in order to test the sensitivity of our procedure.

Our results again identified the same 8 significant SNPs (chr13:10,210,459- 11,678,731, with FI=1; Fig. 3.4).

After the simulation set verified the increased power and robustness of GIA, we next sought to validate GIA sensitivity directly within the LUPA data (Fig. 3.1C). We focused on four traits mapped in that study: boldness, ear erectness, size, and sociability

[115] :H VHW WKH IDOVH SRVLWLYH UDWH Į  DW  DQG WKXV FDOFXODWHG RXU FRPELQHG Į across all intersections to be 3.13 x 10-7 Įn (n=number of data groups)). This, combined with the total number of SNP markers used, requires a total of five datasets to obtain a

IDOVH SRVLWLYH UDWH  WRWDO PDUNHUV XVHG î Įn; 171,361 × 0.000000313 = 0.53 false positives). For each trait, we randomly distributed all of the 456 dogs with phenotype information into five groups (Fig. 3.5), with each of the breeds represented within all five groupings (again, this random distribution is not the optimal application of GIA, but provides increased evidence for the validity of the method); we performed the random groupings separately for phenotype positive and negative dogs to ensure we had proportional allocation of cases and controls within each grouping. Using PLINK [118], we performed case-control association analysis by chi-square testing in each group. We compared the significant SNPs across all five groups, retaining only the SNPs that were

37 significant at p<0.05 in each of the groups (Table 3.1). For each of those SNPs, we multiplied the p-values across all groups to generate a final confidence measure that can be used for ranking hits.

For the Mendelian, dichotomous variable of boldness [as defined in 119], Vaysse et al.’s top locus was observed 271 kb downstream of a previously size-associated region

(within an intron of HMGA2, ~chr10:11,195,975 [10, 119]), with additional highly significant SNPs at chr10:11,440,860-10,804,969. When we applied GIA to that same dataset, 8 of the top 10 SNPs associated with boldness spanned this same locus

(10,703,666-11,758,427; p= 9.39 x 10-44 to 6.33 x 10-77) and included both of the SNPs identified by them (11,440,860 and 10,804,969), as well as the HMGA2 locus previously associated with boldness [119].

Next, we applied GIA to three complex traits: ear erectness, size, and sociability

[115, 120]. Dog breeds demonstrate a wide variation in ear morphology, from prick ear to drop ear (Figs. 3.6A, B). In phenotyping this dataset, dog breeds were assigned a number from 1 (prick ear) to 5 (drop ear). Vaysse et al.’s most strongly associated SNP occurred at chr10:11,072,007. This region, chr10 ~10.4 and 11.4 Mb, was previously shown to be associated with ear type [34]. Using PLINK’s Quantitative Trait Association, an asymptotic p-value was generated (using a combination of a likelihood ratio test and

Wald test), and then GIA was performed. Strikingly, we had 14 SNPs that were most significantly associated with the ear phenotype; these SNPs were all located on chr10:10,462,369-11,792,711 (p= 2.90 x10-50 to 4.22 x 10-96; Figs. 3.6C-E). Not only was our most significantly associated SNP the same as that identified by Vaysse et al., but we

38 also identified SNPs comprising a nearby haplotype previously associated with ear phenotype [34].

Then, we considered the complex trait breed size, using weight as a proxy [115].

Vaysse et al. identified Chr15:44.23-44.44 Mb as being strongly associated with size, with the top hit at chr15:44,242,609. They also observed an association within a known dog size locus at chr10:11,169,956 that is 500 kb away from HGMA2, which is implicated in human size [121]. Within the top 9 GIA regions of association, not only did we identify the same locus (chr15:44,216,576- 44,437,773; p= 3.22 x10-44 to 3.50 x 10-

58), but we also identified the same chr10:11,169,956 SNP (Tables 3.1 & 3.2).

Finally, we considered the trait of sociability. Vaysse at al. identified a region on chrX that, while not statistically significant in their GWAS, did show strong evidence for a potential association with sociability. In order to accurately measure their genomewide values for the X chromosome (and achieve statistical significance), they removed all males from the analysis and identified 10 SNPs on chrX:106.03-106.61 Mb. Using GIA, we identified 7 SNPs on ChrX:106.30-106.61 Mb (3.50 x 10-28) without removing any males from the analysis. [We did not have sex information for the LUPA data and thus could not compare our results with and without males. We identified additional significant hits outside the X chromosome that will have to be validated in future studies.] Gender can be considered an extreme form of population structure affecting all

SNPs on the sex chromosomes. The X chromosome is often removed for statistical analysis because the additional copy in females can inflate disease difference related to gender. Strikingly, GIA was impervious to this latent variable. Thus, we were able to

39 show the specificity of GIA on these four additional phenotypic traits applied directly to the LUPA dataset.

Discussion

The theoretical basis of GIA is a combination of information theory, the law of parsimony and the IUT. Information theory stipulates that information is lost each time a variable is manipulated [122]. Any data transformation results in information loss or distortion and that can adversely influence understanding of the complexity of individual occurrences [123]. This would suggest that currently accepted practices of manipulating genomewide data in order to control for latent variables may not be desirable; and also implies that the largest amount of information will be retained using a method that maximizes the probability of correctly identifying true positives. Application of this leads to the conclusion that the proposed method must be sufficiently simple or parsimonious to ensure minimal information loss. Finally, GIA leverages the IUT's ability to account for latent variables and confounds by looking across groups and only retaining statistically significant findings. This decreases the differences that result from random variation and accumulates the effects due to association of the SNP with the trait of interest. These theories combine within GIA to provide a simplistic method that minimizes information loss, leverages latent variables to increase the disparity between the central means, and yet increases resolvability.

GIA conceptually differs from past genetic analysis because it is focused on detecting the similarities between cases rather than the differences between cases and

40 controls. Because the approach to detection is entirely different, GIA is able to accommodate latent variables rather than correcting for them by traditional means such as principal component analysis and regressions. These manipulations of variables result in loss of information.

Statistically, greater GIA precision is achieved if the aggregate cohort can be divided based on individual sub-hypotheses; however, the flexibility of this method also allows for accurate results with random group assignment. The advantage of GIA over traditional genomewide analyses is increased ability to identify true positives despite high levels of noise, and to do so with significantly reduced numbers of subjects. By using this method, we were able to correctly identify SNPs associated with traits in Vaysse et al.’s large GWAS using fewer samples (Rowell et al., submitted for publication), despite the presence of exaggerated latent variables represented by dog breeds. Further research is needed to optimize this methodology for diverse genomic data (including whole genome sequences) and applications in all species.

In the studies presented here, we provide evidence of increased ability to detect true signals with a much smaller sample size relative to GWA analysis. One indication of this is the ability to not only detect SNPs within haplotypes previously reported by Vaysse et al., but also our ability to detect SNPs in flanking haplotypes that have reduced levels of association. For instance, in ear erectness, we were able to identify Vaysse et al.’s specific SNP of association; but our results also detected SNPs of moderate association in the nearby haplotypes. Notably, the inability to detect moderate levels of association was recently noted to be a significant gap in current GWA methodologies [101]. In a

41 partnering study, we also report strong evidence that GIA has the ability to identify association with dramatically reduced sample numbers (Rowell et al., submitted for publication). However, investigators conducting GIAS need to be aware of its potential limitations. (1) GIA is overly conservative. This can be PLWLJDWHG E\ LQFUHDVLQJ Į IRU statistical calculations within analysis groups; unfortunately this will also increase the false positive rate. (2) Biological relevance of subjects (i.e., shared phenotype and associated genetic variation) within analytical groups is critical to avoid increasing the false positive rate. In this study, we showed how this can be addressed by bootstrapping analysis. (3) By maximizing variability across groups and minimizing it within groups, the optimal results are attained. Beyond these three criteria, the most ideal design could also address ascertainment bias. This would be done by using independent cohorts of cases-controls, with all subpopulations represented in each analysis group. Such issues will require future evaluations that may largely depend on experimental validation (e.g., for assessing the false negative rate). It will be interesting to see how much complex disease heritability can be explained by ideally-designed GIAS and how much additional information may be extracted from existing data. We propose that a wide spectrum of

GIA applications will vastly accelerate high throughput genomic analyses.

Methods

Information Theory Applied to IUT Theory

Treatment of latent variables is generally formal (as a direct and technical occurrence that requires mathematical adjustment) or empirical [as a function of the observed scores,

42 resulting in methods to weight the results; 104]. In order to statistically correct for this type of variable, it is considered “locally” independent from the genetic trait being examined. Yet, inherent within this definition is the violation of the biological reality that complex traits do not occur in isolation from other genetic and environmental modifiers.

However, Information Theory has been applied to account for both missing information and large networks of biological information [122, 124]. If we also consider Hill’s criteria for causation [125] and Prentice’s criteria for surrogate markers [126], the most direct and parsimonious method for control of latent variables will also retain all possible information and therefore be an application of information theory [127].

In Berger's original conception, he used bioequivalence testing in the pharmaceutical industry as a method to illustrate the IUT. The IUT was applied to bioequivalency data in which two primary data points were of most interest, 1) area under the curve (AUC), and 2) time until maximum concentration (Tmax). The two drugs were considered bioequivalent if the respective population means of AUC and Tmax were sufficiently close. In order to reject the H0, all components of the H0 must be rejected.

This, therefore, decreases the consumer's risk. Suppose that H0 is a union of k sets. Then the IUT can be stated as follows:

Berger (1982) proved the following two theorems concerning the IUTs.

Theorem 1: If Ri is a level-Į test of H0i for i = 1, . . ., k, then the IUT with the rejection

43 region is a level-ĮWHVWIRUH0 versus Ha.

Proof:

This result provides a Type I error rate of at most Į without the need for multiplicity correction [115]. According to Berger's first theorem (Theorem 1), the IUT test is quite conservative and thus no multiple testing corrections are needed (because the overall size of the test is at most Į [113, 128]).

Theorem 2: For some i = 1, . . ., k, suppose Ri is a size-Į rejection region for testing H0i versus Hai. For every j, L NML, suppose Rj is a level-Į rejection region for testing H0i versus Haj. Suppose there exists a sequence of parameter points

in such that

and, for every M NML,

Then the IUT with rejection region is a size-Į test of H0 versus Ha.

44 Proof:

Since the IUT is a level Įtest,

The logic of GIA is an alternative way to apply Berger's theorem. As an example, let us consider a SNP data set that sampled N individuals, a portion of which have a genetic phenotype of interest, X. The population under study is subdivided, based on known differences, into three subpopulations (A, B, C) that contain both cases and controls. The

H0 hypothesis, that there is no genetic difference between cases and controls, will be tested using a chi-square test in each of the subpopulations. Only those SNPs that reject

H0 at level-Į in each subpopulation are retained. Now, in order to compare each of the results across the three subpopulations, Berger's theorem is applied. In this case, the null hypothesis states that there is a difference within each of the subpopulations. Thus to retain the hypothesis that there is no difference between cases and controls, we must reject H0 within each subpopulation comparison.

Therefore, we suggest that by modifying the theoretical framework of the IUT,

GIA allows for the control of latent variables without sacrificing family-wise error rate and thus requiring multiplicity correction. Consider as an example the case of one common latent variable which most genetic studies attempt to account for, population

45 structure. This is the presence of variation in levels genetic similarity within the population as a consequence of factors such as geographical subdivision and finite population size [129]. To correct for this latent variable, traditional approaches attempt to remove the environmental influences from population structure, or incorporate variances based on environmental or geographic stratifications. Use of traditional statistical procedures such as Principal Component Analysis [PCA; 130] or Multidimensional scaling [MDS; 131] to remove latent variables may still fail to account for additional hidden confounds and artifacts. These methods are forced to ignore the synergism that exists between the organism, its immediate surroundings, and the environment as a whole. In contrast, information theory would stipulate that to elucidate the genetic trait of interest, we must exploit all the available information to control for spurious associations.

Previous work by others has already established equations to calculate at what point a threshold should be set to determine whether the results from the IUT should be considered a true positive or should be eliminated as a false positive [132-135].

Simulation Data

LUPA Consortium data published by Vaysse et al. was used for the simulation studies. It was downloaded from http://dogs.genouest.org/SWEEP.dir/Supplemental.html [115].

The phenotypes are also available from that source [Information open-access from, 115 phenotypes listed in Table S3]. Genotypes were determined with the Illumina CanineHD bead array containing 173,622 probes for SNP markers with an average of greater than 70 markers per Mb. Of these SNPs, 0.9% was recently discovered through targeted

46 resequencing, 65.1% were present in a comparison between the boxer reference genome and the low coverage poodle genome, 21.7% were present from low coverage sequence reads in diverse dog breeds compared to boxer genome, 25.4% were present within the boxer reference, and 1.2% were present within alignments of wolf and/or coyote sequences with the reference boxer genome [136], and was validated in 450 samples from

26 dog breeds [115]. It should be noted that SNP selection for this platform was based on comparison with the reference boxer genome (CanFam2). LUPA data was downloaded and processed using PLINK. A script was written in PHP to bootstrap the LUPA data.

Briefly, we resampled all 456 by simple random sampling with replacement for our simulation datasets. Visual inspection of the data and preliminary analysis revealed that we had appropriate distributions for our resampled data. In total we had a 1000 bootstrapped replicates. We carried out the simulation independently in three separate locations and all analyses were verified among all three independent simulations. We then performed chi-squared tests for association using PLINK. All PLINK association data sets were aggregated and then parsed using MATLAB for frequency analysis and

PHP and/or R for GIA analysis.

Simulation Methods

A set of 1000 resamples of 456 dogs were generated by simple random sampling with replacement (naïve bootstrap) in the R project environment. The labels so generated were placed into a table that was saved into a file to ensure repeatability. The following were implemented using PHP as a cross-platform scripting language: Using this table,

47 resamples (cloned populations) were generated from the original LUPA data in text format for use in PLINK (software) by appending a repetition number to each of the original dog identifiers. The resultant files were separated into child directories indexed by resample identifier from 0001 to 1000. A derived binary file of each resample was generated by calling PLINK, and phenotype mapping files were generated for each resample from the PLINK-generated .fam file and data supplied by [115]. PLINK was then re-run to generate 1000 derivative association files corresponding to each of the resamples.

The association files were imported in groups corresponding to the number of tables required for a single GIA at a specified p-value cutoff. This cutoff was selected to ensure that the final number of significant SNPs in a single GIA were below a certain

-32 threshold (50, Pcutoff < 1 X 10 ). The 1000/(no. of groups) GIAs were then computed and the significant SNPs recorded individually in data and log files. Significant SNPs had their maximum, minimum, and mean p values recorded in the data files. The data and log files were then summarized to generate FI for the significant SNPs. In addition, maximum, minimum, and mean number of significant SNPs that were selected in each

GIA were reported.

The following steps were necessitated for GIAs that were conducted on the intra- resample subgroups: The same resamples were used as before. For each resample, the

456 dogs were imported into memory and segregated by phenotype. A Mersenne-Twister random number generator [137] was seeded with a random number recorded in a file

(again, for repeatability). The generated random numbers were used to randomly allocate

48 the cases and controls to one of four groups of approximately 114 as proportional allocation was used to ensure that case and control percentages remained close to that of the resample itself. PLINK was then run on the resultant four sub-resamples to generate four association tables. These tables were then combined into a single GIA. This process was repeated 999 more times, resulting in 1000 data files containing sets of significant

SNPs. The 1000 data files were then summarized in the same manner as described before.

Association calculations took approximately 32 hours for the first set, and 48 hours for the second set. GIA calculations took approximately 45 seconds per intersection on a workstation with a Pentium 630 3.0 GHz processor.

Data access

The LUPA Consortium data used for the simulation studies was published by Vaysse et al. and is publicly available at http://dogs.genouest.org/SWEEP.dir/Supplemental.html

[115]. The relevant phenotype data is available from the same publication [115].

Acknowledgements

The authors would like to acknowledge Drs. Kim McBride and Dorothy Pathak for critical reading of the manuscript. We would also like to acknowledge Kelly Rybaczyk for her advice. We are grateful to the LUPA Consortium investigators for their invaluable dataset. This study was supported in part by a research Grant (R210602710) from the

National Institutes of Health and a research Grant (CA100865) from the Department of

Defense Congressionally Directed Medical Research Programs to CEA. JLR was

49 supported by a fellowship from National Institutes of Health (NINR 5F31NR011559).

Author contributions

JLR and LAR contributed equally to this work. LAR and PKP led the statistics aspects of this study, CEA and JLR the genetics, and BAP and KH the computation. LAR conceived of the initial idea and algorithm to apply IUT to genomewide association.

LAR, JLR and CEA developed and implemented GIA; JLR ran most of the code and executed the algorithm. PKP designed the simulation studies and BAP and KH wrote the code for those; BAP and PKP performed them; and all authors participated in optimization. JLR wrote most of the manuscript with significant contributions from CEA,

LAR, BAP and PKP.

50 A

B

C

Figure 3.1. Overview of our application of GIA. Blue circles represent simulated data, Red circles represent original data. (A) The original LUPA dataset was simulated with replacement to develop 1,000 replicates. By simulating this dataset, we were able to retain the characteristics inherent to the canine populations included within the dataset. A chi-square analysis was performed on each simulated dataset, and GIA was completed on either groupings of 2 or 4 to yield the results for comparison with published GWA results [115]. (B) To test the robustness of our method, we then performed a simulation analysis in which each of the simulated data sets divided into 4 separate groups and GIA completed across the groups. (C) To test the application of GIA to the original dataset, we randomly divided the data into 5 groupings for each of the variables tested. We performed chi-square analysis on each sub-group, retaining only the significant results. Subsequently, we compared the results and retained only those significant results presents in each of the 5 groups.

51 A B

C D

Figure 3.2. Frequency analysis for groups of resampled datasets using GWA. Using the traditional GWA method, we found that regardless how many datasets we added we were unable to increase the resolution. In fact, as the group size increases, we significantly increase the amount of noise retained in the analysis using a traditional approach for (A) 5 groups (B) 25 groups (C) 50 groups (D) 100 groups.

52 Value I F Value I F

Figure 3.3. Scree plot of SNPs significantly associated with furnishings using GIA (A) the significant SNPs in common across groupings of 2 GIAs and (B) of 4 GIAs. A 718 kb haplotype from chr13:11,593,074-11,718,754 bp was previously associated with furnishings [117]. We identified that locus, as well as Vaysse et al.’s most significantly associated SNP [chr13:11678731; 115]. Here, we present the combined observations across simulations using the LUPA dataset. The y-axis represents FI value, where 1 is a significant SNP observed across all GIA datasets. The x-axis shows the corresponding SNPs in decreasing order of p-values. The dashed orange line represents the end of the previously observed associations. Note the precipitous drop in observations corresponding to regions that have not been previously implicated for furnishings.

53 Figure 3.4. The robustness of GIA as observed with simulation that randomly divided each group of 456 dogs into 4 separate groups and completed a GIA on the data. Our results identified the same 8 significant SNPs previously associated with the trait of furnishings [115, 117]. (A) Scree plot of FI values for furnishings. Note that the first 8 SNPs have an FI value of 1 and are followed by rapidly decreasing values. (B) For the same SNPs observed in “A”, we show the highest boundary of the p-value (that is, those SNPs that were least significant; blue line), the lowest boundary of the p-value (that is, those SNPs that were most significant; red line), and mean p- value (green line) observed across all GIAs. 54 Figure 3.5. A flow diagram representing the GIA process as applied to the LUPA dataset with the trait of furnishings. The final result is the product of the significant results within and then across each group, giving GIA increased power to detect true association.

55 Figure 3.6. Dog breeds demonstrate extreme variation in ear morphology ranging from (A) Pricked ears as seen in this Belgian Tervuren breed to (B) Dropped ears, as seen in this Bernese Mountain Dog. (C, D) Applying GIA, we identified locus chr10:10,462,369-11,792,711 (p= 2.90 x10-50 to 4.22 x 10-96). The y-axis represents –log10 p-value, while the x-axis demonstrates the genomic location in Mb on chr10: 9.5-12.5 Mb. Note the blue bars above “C”, which correspond to haplotypes identified using PLINK. (C) This histogram plot demonstrates the signal of all SNPs (including those not statistically significant) within the region. (D) A plot of only the SNPs significantly associated with the trait of furnishings. (E) A previous large scale dog GWAS also identified this locus on chr10 as being significantly associated with ear morphology [34]. Shown are the association values across breeds that correlated with the single-marker p-value. Modified from[115].Photos courtesy of http://www.alicambernese.com/bernesemountaindogpuppies.html and http://whatafy.com/the- belgian-tervuren-proves-that-belgian-dogs-are-the-best-for-protection.html.

56 Table 3.1. Comparison of published GWA results and GIA reanalysis for four traits † The GWAS did not use a traditional p-value, but rather a genomewide significance p-value was calculated by comparing the true genotype-phenotype correlation of each SNP to the maximum permutated value of all SNPs across the platform [115]. * Denotes where the most significantly associated call by GWA was also the most significant by GIA.

GIA GWAS

(As reported in this study) [As performed by LUPA consortium,115]

Phenotype CFA BP p-value CFA BP p-value†

Boldness 10 11,440,860* 3.69 x 10-77 10 11,440,860 p-genome <0.001

10 10,804,969 2.33 x 10-64 10 10,804,969 p-genome=0.006

10 11,072,007 4.33 x 10-55 10 ~11,195,975* p-genome <0.001

10 11,056,641 1.83 x 10-49

7 78,399,885 2.59 x 10-48

2 38,145,114 1.02 x 10-47

10 10,703,666 1.77 x 10-47

10 11,384,057 2.91 x 10-46

10 11,758,427 4.24 x 10-46

10 11,100,691 1.65 x 10-44

Continued

57 Table 3.1. continued

Ears 10 11,072,007* 4.22 x 10-96 10 11,072,007* p-genome <0.001

10.27-11.79

10 11,056,641 7.45 x 10-89 10 Mb p-genome <0.05

10 10,907,439 5.98 x 10-72

10 11,100,691 1.57 x 10-70

10 10,871,535 1.06 x 10-69

10 11,081,762 9.55 x 10-66

10 11,086,490 1.69 x 10-65

10 11,121,003 2.68 x 10-63

10 10,491,306 3.66 x 10-62

10 11,384,057 1.39 x 10-56

10 11,792,711 2.57 x 10-56

10 10,804,969 9.40 x 10-56

10 10,707,193 1.17 x 10-55

10 10,462,369 2.90 x 10-50

Continued

58 Table 3.1. continued

15 44,226,659 3.51 x 10-58 15 44,242,609 p-genome =

Size 0.004*

15 44,231,500 3.50 x 10-58 44.23-44.44

15 Mb p-genome <0.05

15 44,242,609 3.03 x 10-57

15 44,258,017 8.81 x 10-52

15 44,267,011 2.58 x 10-51

10 11,169,956 1.29 x 10-48 10 11,169,956 p-genome = 0.036

15 44,427,593 2.64 x 10-46

15 44,437,773 2.64x 10-46

15 44,216,576 3.22 x 10-44

Continued

59 Table 3.1. continued

Sociability 37 17,422,040 1.83 x 10-36

23 19,918,839 7.94 x 10-34

11 72,070,527 1.65 x 10-30

22 4,080,466 5.31 x 10-30

15 45,840,380 1.37 x 10-29

9 7,282,996 2.57 x 10-29

28 31,503,958 2.68 x 10-29

30 15,929,787 4.01 x 10-29

1 97,463,560 4.17 x 10-29

16 40,621,962 5.04 x 10-29

106.03–106.61 p-genome <0.05

X 106,306,300 1.41 x 10-28 X Mb only male canines

X 106,381,235 1.41 x 10-28

X 106,423,498 1.41 x 10-28

X 106,508,937 1.41 x 10-28

X 106,545,078 1.41 x 10-28

X 106,556,584 1.41 x 10-28

X 106,614,877 1.41 x 10-28

60 Table 3.2. For the QTL of size, we used the phenotype proxy of breed weight previously published by Vaysse et al. [115]. We used all breeds that contained weight information (kg) and randomly distributed the dogs across five groups, ensuring that each breed was represented by at least 2 dogs in each grouping. Using PLINK’s Quantitative Trait Association, an asymptotic p- value was generated (using a combination of a likelihood ratio test and Wald test). We retained all p-values that were <0.05 in each individual group. Then, we compared across all groups and retained only the SNPs that were significant in each of the 5 groupings. In order to rank the SNPs according to most associated, we multiplied the p-values across each of the groups (since GIA generates level-Į RXU XSSHU ERXQG ZH FDQ XVH WKH DV\PSWRWLF S-value across all 5 groups to determine an approximate ordering).

Chr BP Group 1 Group 2 Group 3 Group 4 Group 5 Combined p-

. value

15 44,226,659 1.05 x 10-10 3.85 x 10-10 5.11 x 10-13 2.44 x 10-14 6.95 x10-13 3.50 x 10-58

15 44,231,500 1.05 x 10-10 3.85 x 10-10 5.11 x 10-13 2.44 x 10-14 6.95 x10-13 3.50 x10-58

15 44,242,609 1.01 x 10-11 1.21 x 10-09 3.54 x 10-12 1.74 x 10-13 4.03 x 10-13 3.03 x 10-57

15 44,258,017 1.20 x 10-10 1.72 x 10-08 9.37 x 10-12 2.04 x 10-12 2.24 x 10-11 8.81 x 10-52

15 44,267,011 8.56 x 10-11 2.00 x 10-08 9.31 x 10-12 4.9 x 10-12 3.30 x 10-11 2.58 x 10-51

10 11,169,956 3.67 x 10-09 1.92 x 10-10 3.26 x 10-10 5.12 x 10-11 1.10 x 10-10 1.29 x 10-48

15 44,427,593 5.13 x 10-10 4.85 x 10-09 3.81 x 10-08 3.66 x 10-12 7.63 x 10-10 2.64 x 10-46

15 44,437,773 5.13 x 10-10 4.85 x 10-09 3.81 x 10-08 3.66 x 10-12 7.63 x 10-10 2.64 x 10-46

15 44,216,576 8.18 x 10-08 4.57 x 10-08 4.69 x 10-09 1.85 x 10-09 9.98 x 10-13 3.22 x 10-44

61 CHAPTER 4: USE OF GENOMEWIDE INTERSECTION-UNION ANALYSIS

IDENTIFIES RISK LOCI FOR CANINE OSTEOSARCOMA*

*Rowell, J. Rybaczyk, L.A., Zaldivar-Lopez, S. Marin, L.M., Fiala, E.M., Couto, C.G., Alvarez, C.E. (2012). Identifiying Risk Loci for Canine Osteosarcoma. Submitted for publication.

62 Abstract

Osteosarcoma is the most common bone cancer in humans. Identifying germ line susceptibility loci has proven difficulty, due in part to its rare occurrence in humans, as well as the complex disease nature of cancers. Dogs have a 13 fold increased rate of developing osteosarcoma and vastly reduced genetic variation. Here, we use

Genomewide Intersection-Union Analysis (GIA) to conduct the first mapping of osteosarcoma risk loci in Greyhounds. GIA, a modified Intersection-Union Test (IUT), sensitively, specifically and robustly detects genetic signal in noise. It does so, in part, by circumventing two major limitations of high density genetic analysis: i) corrections for latent variables (resulting in information distortion and loss) and ii) correction for multiple tests (resulting in decreased power). We report multiple loci associated with osteosarcoma development and the testing and successful validation of two of those in a second, larger group of Greyhounds. Two specific associated SNPs are particularly replete with biological relevance suggestive of a role in OSA. The chr34:35,156,555 SNP lies within the linkage interval previously reported for osteosarcoma risk in the closely related dog breed, the Scottish Deerhound and confers an Odd’s Ratio (OR) of 3.7 for the development of OSA in our study population. A second candidate SNP, chr10:5,894,741, is suggestive of the association of racing performance and osteosarcoma risk, and confers an OR of 5.7. We propose that these findings will accelerate translational research of osteosarcoma, and that GIA will be widely used in diverse applications for genomewide genomic analysis.

63 Author Summary

Osteosarcoma is the most common bone cancer in humans. Because it is a disease with complex genetics and occurs rarely in humans, the genes associated with osteosarcoma risk have not been identified. We thus chose to study osteosarcoma genetics in dogs, which have much higher rates of this cancer and have vastly reduced levels of genetic variation. Here we report the development of a novel analytical approach for identifying genetic variation associated with disease. We first validate that methodology by i) re- analyzing published data and ii) correctly mapping a previously mapped trait. We then identify the first candidate genes for canine osteosarcoma, and subsequently validate those in a second group of twice as many dogs. Supporting our findings, one validated gene is in a region associated with osteosarcoma risk in closely related Sighthounds –

Greyhound and Scottish Deerhound. We also identified a risk region (with two candidate genes) that appears to be under selection for Greyhound racing performance. This suggests that racing selection may explain the high rate of osteosarcoma in this breed.

Our methodology will accelerate genomewide genetic studies, and our cancer findings will improve the understanding of cancer risk and yield animal models for development of new therapies.

64 Introduction

Identifying germ line susceptibility loci for complex genetic diseases such as cancer has proven difficult, due in part to the multifactorial nature of disease. Cancer risk includes behaviorally mediated lifestyle and environmental factors, as well as genetically heritable risk variants that segregate in both Mendelian and non-Mendelian patterns [138]. In humans, genomewide association studies (GWAS) have catalogued 6,068 SNPs that were associated with a disease or trait as of March 27, 2012

[http://www.genome.gov/gwastudies; 139, 140]. Even so, the actual effect sizes remain low, with Odd Ratios (OR) typically ranging from 1.1-1.4 [141], likely because traditional approaches suffer from low power (requiring large sample sizes) and thus risk allele effects are marginalized over genetic and environmental backgrounds [112]. Within the last 7 years, canine genetics has demonstrated a powerful applicability to human disease, with >80 canine disease mutations known to have a human disease analog [6, 7].

Owing to at least two population bottlenecks that occurred at the domestication of the dog from the gray wolf and the formation of breeds, today’s breeds are essentially isolated genetic populations with increased risk for the development of specific diseases [7].

Despite their genetic advantages, canine GWAS still suffers from many of the same issues that are present in human studies. Here, we report the first germ-line susceptibility loci associated with osteosarcoma in Greyhounds using 12 osteosarcoma (OSA) positive retired racers (cases), 12 OSA free racers (racing-controls), and 12 OSA free show

65 Greyhounds (show-controls), and establish the applicability of a powerful method for

GWA, the genomewide Intersection-Union (Test) analysis (GIA).

Background

Greyhound OSA

US Greyhounds are comprised of two pedigreed populations, each with their own registry: the National Greyhound Association (NGA) for racers and the American Kennel

Club (AKC) for show Greyhounds. These two sub-populations diverged 110 years ago and have continued to be bred for different traits since then [142]. As a result, these two sub-breeds look very similar but have strikingly different disease predispositions [143].

The most notorious difference between racing and show Greyhounds is susceptibility to

OSA development. Osteosarcoma in domestic dogs closely recapitulates the human disease in tumor biological behavior, genetic features, age of onset distribution, and treatment modalities [7]. In addition to the risk conferred upon the Greyhound simply due to breed size, evidence suggests that racers have added risk for the development of OSA over show Greyhounds and other breeds [53]. While the heritability of OSA has not been established in Greyhounds, a closely related Sighthound breed, the Scottish Deerhound, has an OSA heritability of 69% [144]. The overall incidence is 6% in racing Greyhounds, yet it accounts for 45% of all cancer types and is the cause of death in 25% of this population [145]. But strikingly, <0.002% of AKC show Greyhounds developed OSA in a recent survey of that sub-breed [143]. Additionally, racing related environmental effects do not account for this difference. Racing greyhounds that actually race and those who do

66 not, develop OSA at the same rates and factors such track direction (racing Greyhounds who track race tend to repeatedly lean in toward their left forelimb) have no correlation with OSA development in racers [146]. Thus, genetic factors are likely to confer high risk on these dogs for OSA development and have not yet been elucidated within

Greyhounds.

Intersection-Union Test (IUT)

The IUT as a statistical test is well established in clinical trials [108, 109], and has been applied to genomewide gene expression analysis and to very low resolution mapping of quantitative trait loci [141, 147]; yet, surprisingly, it has not been used for high density genomewide genetic association analysis. A similar method, the Union-Intersection Test was recently applied to GWAS with an emphasis on epistatic interactions [112]; however, among other distinctions, the UIT differs from IUT in that it requires multiple testing corrections. Alongside the present empirical demonstration of the GIA approach for GWAS, we laid the theoretical groundwork and conducted additional simulations in a partnering study (LR et. al., manuscript submitted for publication). (1) The IUT does not require multiple-testing correction across groups [113]. It is well known that as the total number of comparisons increase, so too does the family-wise error rate despite each individual tests set at level-Į. Thus, for as few as 45 comparisons, the Type I error rate is actually 90%, rather than the 5% set for each individual marker [148]. However, the null hypotheses (Hș) of the IUT is a composite of several individual null hypotheses, so that in order to pass the test, all individual Hș must pass. This confers a high level of power when using this test [see Materials and Method, IUT methods; LR et al., manuscript

67 submitted; 149]. Thus, when applying the IUT, the maximum Type I error is level-Į. (2)

The IUT negates data artifacts. Because failure of a single test will result in failure of the

Ho, the IUT identifies associations independent of confounding variables. When appropriately applied, the IUT removes artifactual variation in each group unless the variation is systemic to all data across groups. Thus, only factors associated with the hypothesis will be retained [112]. Such artifactual variation removed can include both population structure (true and cryptic) and technical or experimental artifacts (LR et al., manuscript submitted). We used the IUT as a basis for the development of the

Genomewide Intersection-Union Analysis (GIA).

Application of the GIA

We genotyped 36 Greyhound samples (12 cases, 12 racing-controls, and 12 show- controls; see Table 4.1 for sample information) using 3 CanineHD Beadchip SNP arrays containing ~174,000 probes (12 samples/array with all sub-breeds represented on each array). Based on the above principles of the IUT, we developed the GIA. GIA has three primary steps: 1) Stratification. The optimal design of the GIA is to maximize differences across groups, and minimize differences within groups. This allows for separation of the non-relevant variation, and increases the signal to noise ratio. We conducted our analysis using only Greyhounds (minimizing variation within groups), with three groups composed of cases, racer-controls, and show-controls. 2) Screening of SNPs to develop a composite null hypothesis. Importantly, the GIA tests a composite null hypothesis

(HO=HO1+HO2+HO3,…, etc.), where the combined null is that the groups are the same (the alternative hypothesis is that the groups are different). In our case, we are interested in 68 whether a select set of SNPs in group 1, are the same as SNPs in group 2, and SNPs in group 3. In order to determine which SNPs to include in our compound Ho, we first screen the possible ~174,000 SNPs using a chi-square test on cases and controls, setting

Į Notably, the significance level of <0.05, is only used to determine whether the individual null hypothesis will be rejected, to provide composite null hypotheses for each group [as suggested in 111]. Our hypothesis is concentrated on determining commonality between groups, not differences between cases and controls (as is the traditional use).

Thus, we are not conducting a traditional hypothesis test with the chi-square. At this stage we may retain many false positives, but these will be lost when we test our composite null hypothesis across groups (since, by definition a p-value cutoff is determining the probability an association is due to random chance, these random events will not persist across multiple groups tested and will therefore not be retained). The null hypothesis is determined at the group level, not the SNP level, and therefore no correction for multiple testing is required under Berger’s first theorem. Berger’s first theorem states that the probability of an accepted compound Ho across groups (that the groups are the same for a certain SNP) being due solely to chance is equal to the parameters under which each individual Ho were determined. For example, each individual SNP test was conducted at a set maximum threshold (in our case, level-Į   IRU WKH SUREDELOLW\ WKDW WKH association was due to chance as part of the screening process. Therefore, the combined group’s rejection region, or the maximum threshold value associated with type I error,

PXVW EH ” D  SUREDELOLW\ WKDW WKH UHVXOW LV GXH FRPSOHWHO\ WR UDQGRP FKDQFH

(HO1 Į  ,H2 Į  ,H2 Į  H2Į ).

69 This also takes advantageous of simple probabilities. The probability that any one

SNP found significant in an individual association test of one group, will also be significant in another independent group, is given by the product of the probabilities, or p(AB) = p(A) p(B). Therefore, the expected number of common false positives across 3

JURXSVIRU613 §QîĮANZKHUHQLQWKHQXPEHURIWHVWVĮLVWKHVLJQLILFDQFHOHYHODQG

K is the number of datasets), 1.25 x 10-4. As the number of datasets analyzed increases, the probability of committing the same Type I error in multiple datasets decreases exponentially. Therefore, we maintain very low false positive rates without needing to perform a MTC. 3) Combined results. In the final stage of the GIA, the composite Ho is that all of the individual hypotheses are the same. This way, if any one group rejects the null (that the SNP is the same across groups), then all of the groups reject and this SNP would not be included as a candidate SNP locus.

Results

Examination of the data by Principle Components Analysis (PCA) revealed a clear separation between cases, control-racers, and control-shows (Fig. 4.1). Based on prior experience with IUT methods applied to functional genomics [133, 150], we applied a modified IUT to our genetic association analysis.

External Validation

To validate our approach, we used the LUPA Consortium’s public SNP dataset of 456 dogs from 30 breeds that were genotyped on the same platform used here [115]. Vaysse,

Ratnakumar et. al. reported GWAS findings for several traits. We applied GIA to five traits: furnishings (coat type with moustache and eyebrows), boldness, size, ear 70 morphology, and sociability. We proportionally allocated the dogs with and without the trait of interest (cases and controls) into five groupings by simple random sampling without replacement. For each grouping, we used a chi-square analysis to perform screening and retained only the SNPs at the threshold of p< 0.05. We then retained only the SNPs that were present across all five groups below the threshold, multiplied the p- values and rank-ordered the results. In this case, the multiplied p-value is used only as method for ranking the results, as an approximation rather than exact value. Using our method, we identified Vaysse et. al.’s most significant loci for all five traits (see

Rybaczyk et al., manuscript submitted). Subsequently, we randomly selected only 24 dogs (12 cases and 12 controls) from the larger dataset of 456 dogs, and divided them into three groups. We then performed a traditional GWAS analysis and GIA on these six traits: furnishings, boldness, size, ear morphology, sociability, and tail curliness. We found that the signal was strong enough in three of the groups (furnishings, ear morphology, and tail curliness) that even using traditional GWA in a sample of 24 dogs we were able to detect the top association loci reported by Vaysse et al. The GWA result for a fourth trait, size, was contingent on the type of multiplicity correction used.

However, even using the weakest implementation of GIA (that is, without breed stratification), we were able to correctly identify the previously associated loci within our top regions for five of the six variables. (The lone 456-dog GWA result that was not replicated in our vastly reduced sample number of 24 dogs was sociability, Vaysse et al.’s trait identified using gender as a control; information not available with the public data).

As an example, for the trait of boldness, we identified the same two highly significant

71 SNPs as Vaysse et al. (chr10:10,804,969 bp CanFam2 assembly, p= 1.09 x 10-7 and chr10:11,440,860, p= 2.65 x 10-8; Fig. 4.2A), with multiple SNPs overlapping their reported locus. For furnishings, our top SNP (chr13:11,67,8731, p= 2.54 x 10-13; Fig.

4.2B) was the precise SNP identified by Vaysse et al. Our approach is thus powerful, as we were able to identify the same genetic associations with only 24 of their 456 dogs.

Internal Validation

To validate the GIA method with our Greyhound dataset, we selected a previously mapped trait in dogs: brindle coat pattern (chr16:56.01-62.01 Mb, CanFam2; 23,24; see

Fig. 4.3A). Using a traditional GWAS with multiplicity correction, we were not able to identify any significant associations (p•0.9). We implemented the GIA, and the top SNP

(p=1.15x10-09; see Fig. 4.3B) fell within the known interval. Notably, this interval occurs very near the end of Chr16; this region contains an assembly gap and diverse structural variation in all breeds [85]. Thus, the majority of the region appears to be uninformative for SNP associations; yet, we were able to identify the nearest informative SNP (Fig.

4.3A; Table 4.2).

Experimental Data Set

For our OSA inquiry, we considered each array a subgroup of analysis (thus controlling for batch effect). Briefly, each array was treated as a separate analysis and a chi-square test was performed on each to determine inclusion in the composite null hypothesis, considering each group individually against the other two separately (cases vs. racer- controls; cases vs. show-controls, etc.). Phenotypically, Greyhounds develop OSA at an average onset of 8 years of age [151]. We considered Greyhounds OSA negative if they

72 were • 9 years of age and cancer free. For 4 samples that were 5 years of age or less,

PCA analysis revealed separation of the young racer-controls with the true racer-controls and these were included in the analysis (see Materials and Methods, OSA cases). An initial GWA analysis with Bonferroni multiple testing correction revealed no significant results (p>0.05; Fig. 4.5). We separately applied GIA analysis using two SNP calling algorithms on the raw data; most significant SNPs were similarly called by both algorithms and the remainder was discarded (Fig. 4.6; Materials and Method, Platform &

SNP calling). We identified 15 significant SNPs across the three datasets for the cases and racer-control comparison (by including the show-controls, we were able to control for associations that due only to racing status). We confirmed the reliability of these 15

SNPs by sequencing them in four of the original genotyped dogs. We conducted population stratification analysis and determined these results were not due to the cases being more related to each other than to the controls (see Materials and Method Text,

Spurious Results from Genotype Associations, Table 4.3). We imputed genomewide haplotypes of both cases and racing-controls (PLINK haplotyping; Table 4.4).

Experimental Validation of Two SNPs

We selected two SNPs (chr34:35,156,555, chr10:5,894,741) for validation in a new group of racer-controls and cases that had not been previously genotyped. The first SNP we validated was chr34:35,156,555, located in the Deerhound OSA linkage interval (26). In the discovery set, we identified the risk allele as T. We sequenced a total of 58 dogs (29 racer-controls and 29 cases) for this SNP and found a significant difference between the two groups, with the T allele confirmed as the risk allele (p=0.01; Fig. 4.7). We also

73 sequenced 61 racing Greyhounds (31 racer-controls and 30 cases) for their chr10:5,894,741 (LRIG3) SNP genotype, and found a significant difference (p=0.001;

Fig. 4.8) between the two groups in the presence of the risk allele. These genotyping results in a new population of cases and racer-controls Greyhounds validated the GIA findings, and provided further evidence of Greyhound OSA-associated variation at these loci.

Biological Implications of Candidate Regions

Of our 15 candidate regions, almost all contain genes that are implicated in cancer (Table

4.4, 4.5). The biological implications of candidate regions are presented in detail in Table

4.5. Here, we choose to focus on 2 SNPs that are particularly relevant to Greyhounds as well as cancer. SNP chr34:35,156,555 lies within a region previously associated with the development of OSA in a very closely related Sighthound, the Scottish Deerhound [34].

The Deerhound is a giant breed of dog, with an OSA heritability of 69% and overall incidence estimated to be >150/100,000 [144, 152]. Applying a linkage approach to a four-generation 135 Deerhound pedigree (with 60 dogs genotyped for 610 microsatellite markers), Phillips et al. identified a 4.5 Mb OSA interval on chr34:34.9-39.4 Mb [Zmax=

5.766; 152]. Ongoing studies will determine whether Greyhounds and Deerhounds share an ancestral risk haplotype at this locus. The nearest gene to our predicted haplotype interval is ZBBX, ~7 kb away. While the function of this gene is not clear, multiple B-box proteins are associated with the development of different cancers [153].

IntOGen mining [154] of ZBBX across diverse cancer datasets shows all associations with p<1E-3 are either reduced gene expression (endocrine, male germ cell, skin) or DNA 74 copy number loss (bladder, mouth/oral squamous cell carcinoma). The single study to find significant somatic genome alterations affecting ZBBX in human osteosarcoma reported frequent deletions [2 gains, 12 deletions in 38 tumors; p=4.617E-3; 155]. We tested the largest cancer resource of associated gene expression and survival data – breast cancer (2,324 patients total) – by Kaplan Meier analysis [Fig. 4.9; 156]. Without stratifying by molecular subtype, decreased breast cancer survival is strongly associated with reduced expression levels of ZBBX. Analysis of gene expression between cancer and non-cancer matched tissue [NextBio, 157] showed a pattern of significantly decreased gene expression in a set of cancers (lung, nasopharyngeal and fallopian tube carcinoma, and male germ cell tumors), but increased expression in cancers suggestive of sex hormone biology (ovarian and prostate carcinoma and gonadotrope pituitary adenoma)

(data not shown). The combined findings suggest ZBBX could be a tumor suppressor gene sensitive to sex hormones. Our findings indicate that in our study population, presence of the risk allele conferred an Odd’s Ratio of 3.7 for the development of OSA.

The most intriguing candidate SNP is chr10:5,894,741 (p=9.15 x10-6), located within the 1 Mb region that Akey et al. reported to be most highly evolutionarily selected in Greyhounds compared with other breeds [di=18.85; 158]. Within this 1 Mb region there is only one gene annotated, leucine-rich repeats and immunoglobulin-like domains protein 3 (LRIG3), 142 kb downstream of our SNP. This SNP exists in a 577 kb run of otherwise shared homozygosity among cases, racer-controls, and show-controls (in which only a single SNP in three dogs is heterozygous, chr10:5,979,022). None of the other 14

SNPs associated with OSA exhibited this pattern of extended homozygosity surrounding

75 a single polymorphic SNP (Fig. 4.10). SNP chr10:5,894,741 is significant in the comparison not only between cases and racer-controls, but also between show-controls and cases. However, this SNP is not significant in the comparison between show-controls and racer-controls. The haplotype detected by Akey et al. is thus an ancestral Greyhound haplotype, which our higher density SNP array splits into two haplotypes. That suggests there may have been two strong selection events at this locus, a relatively ancient one and a recent one. Sequencing and further analysis is required to identify variants in the risk allele and to dissect the evolutionary history. That and functional studies will establish whether the risk haplotype is under selection for racing performance and coincidentally increases osteosarcoma risk.

This SNP is of additional interest because the biological role of LRIG family members, mainly LRIG1, has been shown to function in ubiquitination-associated regulation of cell surface expression of EGFR/ErbB family receptors and is subjects of great interest in cancer [159, 160]. However, their roles in tumorigenesis is complex and cannot be generalized [161, 162]. Few studies of LRIG3 have been conducted, but there are reports that it has a role in Fgf and Wnt signaling [163] and that its subcellular localization pattern is statistically significantly associated with survival in astrocytic tumors [164]. LRIG3 gene expression suggests it has a prominent role in bone development, which is interesting for a candidate osteosarcoma gene. In the BioGPS panel of dozens of mouse tissues, a handful of tissues show high levels and osteoblasts are the second highest of all [Gene Atlas MOE430, GCRMA, 165; Fig. 4.11]. IntOGen analysis shows that, with the single exception of male germ cell tumors, all p<1E-3

76 significant associations found in different cancers are DNA copy number gains

(fibrosarcoma, diffuse large b-cell lymphoma, follicular lymphoma) or increased mRNA expression in tumors vs. normal tissue (urothelial carcinoma). Although it is not statistically significant, one study showed an increased frequency of LRIG3 DNA copy number gain in osteosarcoma [9 gain, 3 loss in 38 tumors; p=0.225; 155]. Similar findings were reported for chondrosarcoma [7 gain, 0 loss in 33 tumors; 0.186; 166].

These various clues hint that LRIG3 may be an oncogene in osteosarcoma. In addition, we found evidence that the flanking monocarboxylic acid transporter 2 gene (SLC16A7), an excellent osteosarcoma candidate, has regulatory sequence within this candidate interval (Table 4.4). Our findings indicate that in our study population, presence of the risk allele conferred an Odd’s Ratio of 5.7 for the development of OSA.

Discussion

Canine models of human diseases have long been appreciated. Here we combined two approaches to accelerate discovery of human disease genetics. The first is to use the dog model which has 100-fold reduced genetic variation (vs. human) and exhibits natural human-relevant disease in many outbred populations [7]. For instance, natural mutations in hypocretin (orexin) receptor 2 (Hcrtr2) were first found to cause narcolepsy in

Doberman Pinschers and Labrador Retrievers [167]; and those findings led to the discovery that the human disease results from hypocretin peptide deficiency [168]. The

77 second approach is to develop new statistical analysis methodologies with significantly increased power to isolate signal from noise (vs. traditional approaches). In this study, we conducted empirical demonstrations that GIA can be used to map simple and complex traits with great power. We establish the theoretical basis for this in a partnering work

(LR et al., submitted). There we include simulation studies of published datasets and we present the first components of a larger framework for understanding the results generated by these methods. If the study design accounts for the following three aspects,

GIA should be accurate without further statistical analysis. (1) GIA is overly conservative, having decreased probability of false positives and increased probability of false negatives. That can be addressed by increasing Į (i.e., raising the p-value for chi- square analysis) within analysis groups and using the rank order of cumulative p-values for SNPs. (2) Without strong biological relatedness of samples (i.e., a common phenotype and genetic mechanism in cases), the probability of false positives increases. This can be addressed by conducting bootstrap analysis (Rybackzyk et al. submitted for publication).

(3) The best results are achieved with minimum intra-analysis group variability and maximum inter-group variability. Consideration of this aspect will be taken up in future studies, but may ultimately require deep experimental data for definitive evaluation. The evidence presented here indicates that GIA of diverse datasets across all species will be fruitful.

Here, we used GIA to report the first OSA risk loci in Greyhounds. Two of our reported loci are of particular interest given their presence in a 1) closely related

Sighthound breed and 2) a region of highest selection within Greyhounds. We

78

empirically validated these two SNPs, and now further analysis remains to discover the specific variant within these regions. Yet, both variants produce high OR’s for the development of OSA in dogs carrying the risk allele (OR: 3.7, chr34:35,156,555; OR:

5.7, chr10:5,894,741). While our initial analysis did not indicate any direct epistatic events, additional analysis will need to be conducted to determine the possible combined role of the loci. Some clinical evidence suggests that OSA as developed by racing greyhounds maybe different than other breeds [146]. This is of particular interest, as we often see human cancer with multiple different “sub-phenotypes” of the dominant cancer.

Using several Sighthound breeds with OSA, as well as breeds without, will lead to further insight into the functional role of these candidate variants.

GIA is conceptually simple and intuitive, and is implemented through thoroughly validated statistical methods. Optimal use of GIA requires clear understanding of each step and appropriate experimental design and analysis. The GIA shifts the focus from the differences in each case-control SNP association, to the commonality between groups.

The null hypothesis is determined at the group level, not the SNP level, and therefore no correction for multiple testing is required under Berger’s first theorem. We believe there is no “magic” in GIA, but rather a simple way of looking at the probability that a significant SNP in one group will be present in the other groups as well. As the number of datasets analyzed increases, the probability of committing the same type I error in multiple datasets decreases exponentially. Therefore, we maintain very low false positive rates without needing to perform a MTC. Here, we identified 15 loci with this method.

The probability that all of 15 our loci are false positive is 4.57 x 10-67 ( 79 4.57 x 10-67). This combined with our simulation analysis, external and internal dataset validation, and empirical validation of two SNPs, supports our application of the GIA.

However, additional simulations are needed to test multiple sampling scenarios and the complete parameters of GIA, but this is outside the scope of this paper. We propose that

GIA will greatly accelerate (and reduce costs of) genetics by radically improving sensitivity and specificity, and our cancer findings will dramatically improve the understanding of cancer risk and yield animal models for development of new therapies.

80 Materials and Methods

Ethics statement: All blood samples were taken from dogs by trained veterinarians or veterinary technicians according to relevant national and international guidelines with prior informed consent of the owners.

IUT Methods: As part of an effort to identify common genetic variants in dog breeds and to test those for disease association, we became interested in how human guided selection had conferred different disease risks between two sub-breeds that recently diverged from the same breed – racing and show Greyhounds. In the original design of this study, we sought to isolate differences in population structure between racing and show

Greyhounds. We genotyped 36 Greyhounds: 12 OSA free racing Greyhounds, 12 AKC shows, and 12 OSA positive Greyhounds. After a preliminarily Principle Components

Analysis (PCA) revealed a clear separation between OSA positive racers, OSA negative racers, and AKC Greyhounds (Fig. 4.1), we decided to focus on this trait. Due to a small sample size, we were underpowered to conduct traditional GWAS. Based on previous experience with the IUT [133, 150], we developed a modified IUT, the GIA (Rybaczyk et al. manuscript submitted), and applied it to our data.

The following two theorems explain a) how an IUT is at least level- test and thus requires no multiple testing corrections and b) how specific applications of the IUT cannot result in size ǩ tests.

81 The hypothesis constructed to be tested under an IUT can be stated as follows, in which

Hș is a union of k sets:

Berger (1982) proved the following theorem concerning the IUTs.

Theorem 1:

If is a level-Į test of for . Then the IUT with the rejection region

isi a level-Į test for versus .

Proof:

This result provides a Type I error rate of at most Į without the need for multiplicity correction. According to Berger's first theorem (Theorem 1), the IUT test is quite conservative and thus no multiple testing corrections are needed across groups (because the overall size of the test is at most Į [113, 128, 149].

82 2) The IUT can negate data artifacts. Microarray data is extremely sensitive to minor artifactual variations (i.e. hybridization technique, technical printing differences for commercial arrays, etc.). There are multiple statistical algorithms to compensate for such effects. However, innate within the IUT is the removal of artifactual variation from a single measure, which would fail in at least one hypothesis test. Additionally, rather than compensating for population structure, the IUT can compensate its effects. Population structure has been the bane of many GWAS studies [112]; as a result, traditional approaches suggest controlling for the population structure and multiple models have been created to that end [169, 170]. The GIA identifies associations independent of other confounding variables. Inherent within the GIA, failure of a single test will result in failure of the null hypothesis (the SNPs across groups are not the same). Thus, the IUT is one on the most powerful tests for association because only factors truly associated with the hypothesis will be retained.

Based on sampling 173,000 SNPs at an Į=0.05, we are conservatively using p”10-4 as our cutoff value for genomewide significance [171]. In this paper, we have employed

Bayesian statistical inference in which Bayes' theorem is used to calculate how belief in a proposition changes due to evidence. The GIA can also be thought of as a conditional probability of the event A multiplied by the probability of event B given A. Thus, if we assumed a prior probability in which the phenotypes associated with any genotypes in the dataset were completely biologically independent (equivalent to randomly assigning genotypes and phenotypes), then we would predict ~21 false positives in our data (p3

83 (0.000125) x 171,993 SNPs). In our case, since we have a clinically validated phenotype, and reliable genotyping call information that we independently validated, the Bayesian false positive rate is dependent on two factors: 1) heritability and 2) prevalence. While no quantitative figure exists for Greyhound heritability, OSA heritability in the closely related Scottish Deerhounds is 69%. Assuming a similar heritability, this reduces the number of false positives to 14.8 (i.e., under the assumption that phenotype classification were randomly assigned irrespective of true phenotype). Additionally, OSA prevalence is dependent on the definition of population used. Greyhounds have a prevalence of 25% of all death due to cancer are from OSA; while total population prevalence is only ~6%.

OSA Cases: The average age of onset for OSA is 8 years old in cases. We classified cases’s as OSA negative if they were at least 10 years of age with a recent healthy veterinary physical. Our original design was not structured to analyze OSA and, subsequently, the first group of racer-controls did not meet these criteria (average age 5.5 years). To establish whether these currently young OSA-negative racers could be classified as racer-controls despite their age, we conducted a principle components analysis (PCA) using AKC show dogs, old racer-controls and cases that were OSA positive. We found that the young racer-controls grouped together with both the show- controls and old racer-controls that met our cutoff criteria for being OSA-negative, and grouped separately and apart from the OSA-positive cases (see Fig. 4.1). This suggested that the young racer-controls could be considered as OSA negative. The 6% overall

84 incidence of OSA development in cases [172] supported this finding and therefore these dogs were included in the GIA analysis.

Animals: We collected samples from Osteosarcoma Positive Racing Greyhounds (n=12) and Osteosarcoma Negative Racing Greyhounds (n=12) in collaboration with The Ohio

State University Veterinary Medical Hospital and Greyhound Health and Wellness

Program. Samples from AKC show dogs (n=12) were collected in collaboration with the

AKC at a show in California. Samples were shipped overnight on ice for processing in the laboratory at The Research Institute at Nationwide Children’s Hospital. All samples were verified for registration and pedigree information (for racing Greyhounds, NGA registration; for show Greyhounds, AKC registration) and were determined to be unrelated to at least the level of grandparents.

Clinical Characterization of Osteosarcoma: For cases, suspected osteosarcoma on x- ray was diagnosed by a Veterinarian Oncologist after surgical excision of the tumor and subsequent pathological report by a Veterinarian Pathologist. All blood samples were obtained prior to initiation of chemotherapy. For controls, the Greyhound Health and

Wellness Program at The Ohio State University invited owners of adopted retired racing

Greyhounds who were osteosarcoma free and >10 years of age to participate in this research project. Interested owners were screened for inclusion, and subsequently informed consent for blood collection was obtained from the owner, and sample collected by a trained veterinary technician in 1-2 7 mL BD lavender top tubes.

85 DNA Isolation: DNA was isolated from whole blood using the Puregene Genomic DNA purification Kit (Gentra), with an additional ethanol precipitation step for optimal DNA quality. Samples were selected based on purity and high molecular weight DNA (as determined by Nanodrop readings and agarose gel electrophoresis, respectively) and confirmation of complete registration and pedigree information (for racing Greyhounds,

NGA registration; for Show Greyhounds, AKC registration).

Platform & SNP calling: We genotyped 36 samples using 3 CanineHD Beadchip SNP arrays (12 samples/array; Illumina). The CanineHD Beadchip contains 173,622 probes for SNP markers with an average of greater than 70 markers per Mb. Of these SNPs,

0.9% were recently discovered through targeted resequencing of previous gaps, 65.1% were present in a comparison between the boxer reference genome and the low coverage poodle genome, 21.7% were present from low coverage sequence reads in diverse dog breeds compared to boxer genome, 25.4% were present within the boxer reference, and

1.2% were present within alignments of wolf and/or coyote sequences with the reference boxer genome [136], and the platform was validated in 450 samples from 26 dog breeds

[115]. It should be noted that a presumptive weakness in the design of this array is that the selection of SNPs was based on comparison with the reference boxer genome

(CanFam2).

86 In the original design of this study, we sought to map genetic variation differences between racing and show Greyhounds. Using the PLINK software package [118], we excluded samples with •25% of total genotyping calls missing (none), and any SNP that was missing in • 25% of the samples (conducted on each array separately rather than the samples collectively). Out of a possible 173,663 markers, this left 171,993 SNPs, with a genotyping call rate per individual of >99.9%. In order to ensure the highest accuracy of the genotyping calls, after our initial genotyping analysis with PLINK (using Illumina’s genotying calls), we custom designed 19 PCR assays to flank the genotyped SNPs for validation and selected 2 OSA dogs and 2 racer-controls dogs to confirm a subset of 5

SNP calls made by the Illumina algorithm. We found a number of miscalls for some

SNPs within this verification set. We then expanded the sequencing of our SNP discovery set to include all 24 racing Greyhounds genotyped on the CanineHD array for 2 SNPs, and a smaller subset of these dogs represented across an additional 13 of our significant

SNPs. We compared 92 sequence results to Illumina’s genotyping and found that only

53% were accurately called on the array. The most common type of error we observed was a call of homozygous that sequencing revealed to be heterozygous. This is consistent with previous literature that finds a tendency of SNPs to be called homozygous over heterozygous [173]. Most problematic, in 12% of calls, Illumina’s algorithm called homozygous for the wrong allele (when sequenced, it was homozygous for the alternative allele). We manually examined the GenTrain scores (Illumina’s algorithm for SNP intensities that determines genotyping calls) for these miscalled SNPs and noted a pattern in which the miscalled genotypes clustered outside of Illumina’s highest confidence

87 ranking (Fig. 4.8). To determine whether this high proportion of miscalls was due to the calling algorithm, we repeated the GIA using Partek SNP calling and compared these results both to Illumina’s calls and to the sequencing data. Using Partek, 72% of calls matched with the sequencing results, and only 1 of the calls inaccurately classified homozygosity for one allele when it was the alternative allele (this overlapped with a one of Illumina’s miscalls for the same sample, suggesting an error with the sample, or possible copy number differences). While our error rate appears high, we believe it to be reliable for several reasons. 1) We used very stringent quality controls measures for the genomic DNA selected for this analysis. 2) The Illumina CanineHD SNP array was processed by an Illumina certified lab. This included an additional quality control measurement of the DNA. 3) We validated two of our calls in cohort second of greyhounds. 4) Our overall call rate per dog averaged >99.9%. This is consistent with the study that established the use of this array technology [115]. Notably, they did not report sequencing data to verify their reported calls. We also find a lack of examples in the literature in which studies validate a portion of the highly significant calls to determine error rates. Thus, we believe our study stands out for reporting this information, but believe it to be consistent with others.

In this case, we eliminated all SNPs that had at least one no call in our dataset of interest, but did not implement any HWE criteria. SNPs that violate HWE are traditionally removed at the beginning of genetic studies, especially in humans, as they can result in false associations and are commonly evidence of genotyping errors. However, another

88 reason that SNPs fail HWE is because of an association with disease [173]. While some groups have used this to find variants associated with disease or as a methodological premise [174-176], this fact is rarely exploited by researchers. Yet, one of the basic assumptions of HWE is random mating and no specific selection pressure [177]. Thus, in pure-bred dogs we expect that many of our loci will violate HWE because of current breeding practices and human guided selection that occurs. One study that specifically addressed this issue, found that, across breeds, HWE was a significant concern, but within breeds they found little effect of HWE [178]. However, this group only looked at

109 SNPs representing 13 genes. The application of HWE and the extent to which it must be corrected for is still quite controversial. While we violated HWE in our comparison of cases and racer-controls, this comparison suggests population differentiation between the two groups (cases vs show-controls) is expected and related to our phenotype of interest.

Our sampling method was specifically designed to aggregate Greyhounds from the same population of racers (which are randomly assigned to different case-control groups for

GIA); thus, any separation that occurs should be due to the phenotype of interest without population structure interference. We thus propose that the only loci with population structure that were identified using our GIA methods here are those associated with OSA.

PCR and Genotyping Sequencing: PCR forward and reverse primers were custom designed to flank the SNP of interest, with an average TM of 60oC, and amplicon size of

~400 bp. PCR was conducted as per protocol (JumpTaq Hot Start Polymerase, Sigma).

Products were purified from the PCR product as per protocol (PCR Purification,

89 QIAGEN), and then sequenced (MWG, Operon). Results were unambiguously interpreted using SeqMan (DNASTAR, LaserGene) or they were resequenced.

Validation of GIA Method with Public Canine GWAS Data: To validate our application of the GIA, we used Vaysse, Ratnakumar and colleagues’ recently released

SNP dataset of 456 dogs from 30 breeds that were genotyped on the same platform we used for this study [115]. We tested six variables (furnishings, boldness, size, ear morphology, sociability, and tail curliness), and selected two (furnishings and boldness) to report here (additional analysis of these and other traits using this data is presented in a partnering publication focused more on the methodology of GIA for genomewide analysis (Rybaczyk et al., manuscript submitted). Vaysse et al. identified the previously reported locus for furnishings, with the most significant SNP at 11,678,731 (CanFam2) and a genomewide significance pattern observed between 10.42-11.68 Mb. To apply the

GIA approach to their data, we randomly distributed all 456 dogs into 3 groups with each of the breeds represented within all three groupings. Using PLINK, we performed chi- square associations within each group. We compared the significant calls across all three groups, retaining only the calls that were significant at p<0.05 in each of the groups. Our top SNP (p= 3.52 x 10-119) was the same that was most significant in Vaysse et. al.’s study. Of our top 10 most significantly associated SNPs, 9 were present at this locus.

Subsequently, we applied our method to a smaller sample size of 24 randomly selected dogs from their data. Again, our most significantly associated SNP was chr13:11,678,731 bp (p= 2.54 x 10-13), with our top 4 SNPs present within the previously associated

90 haplotype. We also looked at the phenotype of boldness in Vaysse et al.’s dataset

[boldness as defined in 119]. Their top locus occurred 271 kb 3’ of a region previously shown to be associated with dog size [within an intron of HMGA2, ~11,195,975 Mb; 34,

119], with additional highly significant results at chr10:11,440,860-10,804,969 bp. When we applied the GIA to this dataset, our top 5 hits spanned this same locus (10,703,666-

11,440,860 bp; p= 9.39 x10-46 to 6.33 x 10-76), and included their SNP located at

11,440,860 bp.

To validate our implementation of the GIA method within our own dataset, we selected a previously mapped trait in dogs - brindle coat pattern. Previous studies identified the K locus for dominant black coat color and refined the overlapping brindle region to <2 Mb, chr16:54.65-56.53 Mb [CanFam1; 179, 180]. Using the coat color listed for each dog on their official pedigree from their registering institution (either NGA or AKC), we categorized all dogs with brindle identified in their coat color as affected, and all dogs with no brindle identified in their coat color as non-affected. We implemented a GIA in which each array was treated as a separate analysis and association testing (chi-square) was performed. Selecting a cutoff of Į=0.05, we completed intersection analysis to determine across all three datasets what remained statistically significant. Most notably, our most significant SNP (BICF2S23764760, chr16:60483301 bp, 1.15x10-09) falls 675 kb downstream from this region of association previously reported with brindle in 3 breeds. It is directly within a region which segregated brindle coat pattern in a Lab x

Greyhound cross [chr16:52.72-57.32 Mb, 180]. The SNP we have significantly

91 associated with brindle is the closest SNP to the region that is polymorphic in this breed.

Because black is dominant over brindle, some black or parti colored dogs may be brindle carriers without exhibiting it.

Population Stratification Analysis: Spurious results from case-control genotype associations can be attributed to a number of things including, DNA quality, differential algorithms, population stratification, and combining different commercial array release data [181-184]. Given that, in our use of GIA, we randomly assigned cases and controls to multiple groups, the method should be impervious to population structure and most types of cryptic population structure (except ascertainment, or sampling, bias) (Rybaczyk et al., submitted). The gravest potential problem is population structure, such as the possibility that cases are on average more closely related than controls [that is, the cases are on average more related then the controls, 173]. This is of serious concern in founder populations that have grown rapidly from a small size, or when extensive inbreeding occurs [185]. However, ascertainment bias is largely (if not completely) non-genetic, which reduces the likelihood of its influence on our strongest results [173]. To investigate the role that population stratification may be playing in our data, we performed population stratification using PLINK, comparing the cases (cases) and controls (racer- controls). PLINK uses a complete linkage agglomerative clustering method, based on pairwise identity-by-state (IBS) distance, but with modifications to the clustering process that includes restrictions based on a significance test for whether two individuals belong to the same population, a phenotype criterion, and cluster size restrictions [118]. We first

92 performed a genomewide IBS clustering, and as expected, both the cases and racer- controls cluster together with an approximate proportion of variation between the two groups 1.83 x 10-3. We then conducted a permutation test for between group IBS differences, based on the variable of OSA. This method uses 10,000 permutations to ask whether or not, on average, an individual is less similar to another phenotypically- discordant individual than would be expected by chance [i.e. if we randomized phenotype labels,118]. This is performed for 12 separate tests, and there was no significant difference between the groups in all comparisons (case/control more/less similar; case/case more/less similar than control/control, etc., with all of the resultant p-values >

0.1). This suggests that the case being more related to each other than the controls is not a concern in this GIA study.

Acknowledgments:

The authors would like to acknowledge Dr. Helen Hamilton, a member of the Board for the Greyhound Club of North America and the health Chair for the Greyhound Club of

America from the AKC for collecting samples from AKC Greyhounds. We thank Tamra

Mathie, RVT and Nicole Stingle, RVT, the Clinical Trials office at The Ohio State

University College of Veterinary Medicine for coordinating and collecting racing

Greyhound blood samples, Kelly Rybaczyk for her advice, and Drs. Kim McBride and

Donna McCarthy for critical reading of the manuscript.

93 This study was supported in part by a research Grant (R210602710) from the

National Institutes of Health and a research Grant (CA100865) from the Department of

Defense Congressionally Directed Medical Research Programs to CEA, and Oncology

Nursing Society small grant award, Midwest Nursing Research Society Susan Elek

Dissertation Research Grant, OSU Alumni Grants for Graduate Research Dissertation award, International Society of Nurses in Genetics Small Grant Award, to JLR. JLR was supported by a fellowship from the National Institutes of Health (NINR 5F31NR011559).

SZL was supported by a fellowship from Caja Madrid Foundation (Madrid, Spain).

All data reported here will be available upon publication and can be found at

NCBI Geo Datasets, GSE XXX-XXXX.

94 Figure 4.1. Principle components analysis (PCA) of racing (cases and racer-controls) and AKC show Greyhounds. (A) A PCA uncorrected for racing status of covariation among our three groups (cases, Green; racer-controls, blue; AKC Show, Red). Note that the AKC show dogs cluster together tightly on the left side of the figure, while both the OSA positive and negative racing Greyhounds cluster on the right. PC2 explains the difference between these groupings. (B) A PCA corrected for racing status. By taking PC2 from the uncorrected data and removing the SNPs that corresponded to that eigen vector, we are able to correct for the effect of racing status. Note that now the separation occurs on OSA status. Cases that are OSA positive cluster together on the left side of the figure, while AKC show dogs that are OSA negative and racer-controls cluster together on the right.

95 Figure 4.2. Application of the GIA to a published SNP dataset containing 456 dogs from 30 breeds. Vaysse et al used a GWAS approach to identify several phenotype/genotype associations on the same platform we used [CanineHD, Illumina, >170,000 SNPs; 115]. The gene annotation listed represents a UCSC genome browser view of Human proteins mapped to the genomic locations. (A) A close-up of the Chr10:10.5-12.0 Mb region associated with the phenotype of boldness. The red pins above the solid black line represent SNPs identified by Vaysse et al., while the blue pins represent SNPs identified by our application of GIA in this study. For two SNPs, we both called the same SNPs as significant (chr10:10804969 and chr10:11440860). (B) A close-up of the Chr13:10.0-12.0 Mb region associated with Furnishings. The hashed red line indicates the region associated by Vaysse et al., with the red pins the SNP boundaries of the region associated. The blue pins represent SNPs identified by our application of GIA in this study. Vaysee et. al’s region of highest association chr13:11,678,731), was also ours.

96 Figure 4.3. Identification of the previously mapped locus for brindle coat pattern in our Greyhound cohort. (A) A racing Greyhound with a brindle coat pattern is shown. (B) The regions significantly associated with brindle coat pattern in this study are demonstrated as the inverse -Log10 ratio. Note that SNP BICF2S23764760 (chr16:60,483,301, 1.15x10-09), which lies in the previously mapped brindle locus, is significantly higher than the others in association with brindle coat pattern.

97 Figure 4.4. Gene annotation of candidate association loci. The light blue pin marks the SNP associated with OSA in this study. Genes are taken from UCSS Genome Browser and represent a subset of genes annotated in organisms present in this server using that RefSeq track. For a comprehensive list of genes, see Table 4.2. (A) Chr16 region associated with brindle is shown. The solid red bar indicates a region previously associated with brindle; the gene symbol for the Dominant Black gene is CBD103 [179, 180]. The navy bar represents the region of homozygosity within Greyhounds. (B) The Chr10 osteosarcoma region is shown. The solid navy bar indicates the locus previously identified as having the strongest evidence of population differentiation in Greyhounds [158]. In the current study, an additional SNP in this region splits this ancestral haplotype into two. (C) The Chr34 osteosarcoma region is shown. This locus was previously shown to be associated with the development of OSA in Scottish Deerhounds; that linkage interval is indicated by the solid red bar spanning 4 Mb [152]. The navy bar represents our candidate haplotype of 72 kb.

98 Figure 4.5. GWAS in 36 dogs. The x-axis represents the chromosome and the y-axis the p-value. 7KHVROLGEODFNOLQHFRUUHVSRQGVWRDVLJQLILFDQFHRIĮ (A) Using all 36 Greyhounds with the AKCG and OFRs coded as negative for OSA, for RRG coded as OSA positive, we conducted a GWAS using a Chi-Square. This is uncorrected for multiple testing. With >171,000 SNPs used LQWKLVDVVRFLDWLRQDWRXUVHWĮ-level of 0.05, we will have ~8,574 false positives. (B) To correct for multiple testing, we used a Bonferroni correction and found no statistically significant results. While the Bonferroni correction is thought to be overly conservative, we had similar results using the Bonferroni-Holm method, Sidak’s correction, and Dunnett's correction.

99 Figure 4.6. Comparison of Illumina genotyping calls to sequencing data. (A) The calls made by Illumina are represented in three categories. Those to the Left (Light blue) are homozygous for Allele B, those in the middle (Light purple) are heterozygous, and those to the right (light red) are homozygous Allele A. Note that these samples outside of the blue circle (GenTrain call of 95% confidence), are also closest to the heterozygous line. (B) Sequencing data demonstrate a correct homozygous call for the 95% (top), and two incorrectly called heterozygous calls (middle, bottom).

100 Figure 4.7. Histogram of alleles in OFR and OSA greyhounds for SNP chr34:35,156,555. We sequenced 61 racing greyhounds (29 OSA negative, OFR; 29 OSA positive, OSA) for their genotype at chr34:35,156,555. We converted the genotype calls to numbers present for each allele (columns T and C; 0= none, 1=allele present, 2=2 alleles present) and performed a student’s t-test and found a significant difference between both the risk allele (T) and the wild type allele (C). The y-axis represents percent of total genotyped population, and the x-axis represents the different genotypes. Red bars represent OSA negative racing greyhounds and gray bars represent OSA positive racing greyhounds.

101 Figure 4.8. Histogram of alleles in OFR and OSA greyhounds for SNP chr10:5,894,741.We sequenced 61 racing greyhounds (31 OSA negative, OFR; 30 OSA positive, OSA) for their genotype at chr10:5,894,741. We converted the genotype calls to numbers present for each allele (columns T and C; 0= none, 1=allele present, 2=2 alleles present) and performed a student’s t-test and found a significant difference between both the risk allele (C) and the wild type allele (A). The y-axis represents percent of total genotyped population, and the x-axis represents the different genotypes. Red bars represent OSA negative racing greyhounds and gray bars represent OSA positive racing greyhounds.

102 Figure 4.9. Kaplan-Meier analysis of ZBBX. We used the genomic database and analysis server Kaplan-Meier Plotter [http://kmplot.com/analysis/; 156], which contains gene expression data using Affymetrix HGU133A and HGU133+2 microarrays on 2,472 breast cancer patients. We explored the relationship between ZBBX expression and Breast cancer survival by plotting low and high expressers of ZBBX against probability of survival (Y-axis) and time in years of survival (X-axis). Without stratifying by molecular subtype, decreased breast cancer survival is strongly associated with reduced expression levels of ZBBX (Bonferroni corrected, p=0.005).

103 Figure 4.10. Graph of homozygosity within region chr10:5,647,387-6,238,255. This demonstrates a unique pattern of reduced heterzygosity within a shared haplotype across all three sub-breeds of Greyhounds (RRG, OFR, AKCG). Here, the light blue squares correspond to the shared region of homozygosity, the yellow bar is our identified risk SNP for OSA, and the orange indicates heterozygosis calls. Each row represents a unique SNP, each column a unique dog.

104 Figure 4.11. BioGPS [165] Gene Atlas (MOE430; 1430555) expression values from Affymetrix chips, normalized using GCRMA, relate to fluorescence intensity. A close-up of LRIG3, showing a 30-fold increase expression in day 5, 7, and 14 mouse Osteoblasts. In Xenopus, LRIG3 regulates neural crest formation by influencing FGF and WNT signaling pathways; additionally, using a splicing morpholino led to severe anterior defects and altered body axis in embryos [163].

105 Table 4.1: Phenotype information of Greyhounds used in this study

Type of ID/Name of Year of Birth Sex Coat Color Listed on Diseases (per Greyhound Dog Registration owner) OFR OFR_1 2007 Male Brindle None OFR OFR_2 2007 Female Fawn None OFR OFR_3 2005 Female Fawn None OFR OFR_4 2005 Female Brindle None OSA OSA _1 2000 Female Brindle & White OSA OSA OSA_2 1999 Female Fawn OSA OSA OSA_4 1998 Male White & Fawn OSA OSA OSA_11 2000 Male Brindle OSA AKC AKC_4 2003 Male Blue Brindle None AKC AKC_6 2007 Female Blue None AKC AKC_8 2004 Male Blue Fawn heart murmur, mitral insufficiency AKC AKC_10 2008 Male Blue Brindle/ Parti None OFR Jams 1996 Female Dark Brindle None AKC AKC1 2005 Female Black Parti None OFR Tiny 2001 Female White and Brindle None OFR Clifford 2002 Male Brindle None OFR Walter 1999 Male Black None AKC AKC3 2006 Female Black & White Parti None AKC AKC 4 2003 Male Blue Brindle None AKC AKC5 2005 Female Red None OSA Cofax 2001 female Black OSA OSA Joe 2003 Male Dark Brindle OSA OSA Manny 2004 Male Brindle OSA OSA OSA14 1997 Male White and Brindle OSA OFR Oshkosh 1998 Male Light Brindle None OFR Deb 1998 Male White & Fawn None OFR Babe 1999 Male White & Brindle None OFR Mac 2000 Female Brindle None AKC AKC2 2001 Female Parti Colored None AKC AKC9 2001 Male Black Brindle None AKC AKC11 2008 Male Red and White None AKC AKC12 1999 Female Red Brindle None OSA Jiminy 2005 Male Black Brindle & White OSA OSA Topper 2003 Male Fawn OSA OSA Raze 1999 Male Black OSA OSA Kiowa 2001 Male White & Light Brindle OSA

106 Table 4.2. Listing of all of the reference genome genes from the UCSC genome browser within the specified interval in Fig. 4.4.

Chr10 Chr16 Chr34

Accession Gene Name Accession Gene Name Accession Gene Name

NM_001095046 LOC495385 NM_001254477 LOC548053 NM_001124255 LOC100135879 NM_001205591 LRIG3 NM_021928 SPCS3 NM_001199201 ZBBX NM_001136051 LRIG3 NM_001024267 MGC109340 NM_001102054 SERPINI2 NM_153377 LRIG3 NM_029541 Arxes1 NM_178824 WDR49 NM_177152 Lrig3 NM_029823 Arxes2 NM_001083712 MGC151671 NM_001199555 LRIG3 NM_001098293 LOC778464 NM_019745 Pdcd10 NM_001192860 LRIG2 NM_142977 Spase22-23 NM_212933 pdcd10b NM_001107710 Lrig2 NM_029569 Asb5 NM_001200435 pdc10 NM_001025067 Lrig2 NM_001017746 zgc:112531 NM_200555 pdcd10a NM_014813 LRIG2 NM_001002852 Spata4 NM_001122752 SERPINI1 NM_001110370 lrig3 NM_181265 WDR17 NM_001245196 LOC100190743 NM_001110347 lrig3 NM_001201544 si:ch211-72g23.1 NM_001095681 LOC734183 NM_011391 Slc16a7 NM_001035474 C15H11orf58 NM_001140417 neus NM_001076336 SLC16A7 NM_001013898 RGD1311703 NR_033843 LOC646168 NM_001099419 zgc:165507 NM_001193864 C11orf58 NM_001077122 MGC133804 NM_001199604 SLC16A7 NM_001093296 smap NM_001205553 GOLIM4 NM_004731 SLC16A7 NM_201591 GPM6A NM_001098192 golim4a NM_001112921 slc16a7 NM_001173936 dmb NM_001167748 Egfem1 NM_017302 Slc16a7 NM_213200 gpm6aa NR_021485 EGFEM1P NM_001037408 slc16a7 NM_001135460 DKFZP469H0415 NM_001183904 RPS6A NM_001146627 AGXT2L1 NM_214687 gpm6ab NM_211925 AGOS_AGR197C NR_027474 AGXT2L1 NM_001037526 Defb41 NM_001178529 RPS6B NM_001146590 AGXT2L1 NM_001037804 DEFB130 NR_032606 MIR551B NM_031279 AGXT2L1 NM_001039125 Defb47 NR_034820 MIR551 NR_027475 AGXT2L1 NM_001195257 LOC100133267 NM_001193396 HMGN4 Continued

107 Table 4.2 continued Chr16 Chr34

Accession Gene Name Accession Gene Name

NM_001037751 Defb48 NM_005241 MECOM NR_003668 DEFB109P1B NM_001095670 mecom-a NM_001129759 LOC100170034 NM_001095633 evi1 NM_001037532 Defb42 NM_001161389 mecom-b NR_024044 DEFB109P1 NM_001177995 Prdm16 NM_001002035 DEFB108B NR_001576 TERC NR_036558 DEFB108P NM_001038673 ARPM1 NM_001244455 PSMB4 NM_030557 Mynn NM_001105032 ZNF705A NM_001242944 Gm14306 NM_001193630 LOC100132396 NM_001177832 Gm10324 NM_001164457 ZNF705G NM_144546 Zfp119a NR_046212 C130030K03Rik NM_001162911 Zfp934 NM_001002589 zgc:92716 NM_145490 Zfp959 NM_153655 psma6a NM_145563 Zfp932 NM_165567 CG30382 NM_145591 Zfp958 NM_001251322 PAA1 NM_172491 D130040H23Rik NM_165565 Prosalpha1 NM_001085546 Gm14393 NM_131795 psma6b NM_001134998 LOC100188984 NM_001200869 psa6 NM_146231 Zfp825 NM_001108931 RGD1560350 NM_001177525 EU599041 NM_074170 pas-1 NM_001162922 Zfp931 NM_001111668 paa NM_001244038 ZNF726 NM_001055697 Os03g0180400 NM_001101948 ZNF596 NM_001136590 pco108605 NM_001256173 ZNF85 NM_126597 PAA2 NM_001099327 100043387 NM_001046994 LOC732998 NM_001008727 ZNF121 NM_211639 AGOS_AGL089W NM_138330 ZNF675 NM_001021282 SPBC646.16 NM_001139487 LOC100125368 NM_001180876 SCL1 NM_146249 Zfp119b NR_046351 FAM90A10P NM_001001130 Zfp85-rs1 NM_001256857 LOC100287327 NM_001077624 ZNF846 Continued

108 Table 4.2 continued

Chr16 Chr34

Accession Gene Name Accession Gene Name

NM_001242329 USP17L5 NM_001100416 OTTMUSG00000016609 NM_001256855 LOC100287238 NM_001099308 Gm14391 NM_001256852 LOC100287144 NM_001034900 Zfp345 NM_001256853 LOC100287205 NM_001145863 Gm14139 NM_001256854 LOC100287178 NM_001078504 zf(c2h2)-29 NM_001256863 LOC100287513 NM_001105557 Zfp938 NM_201402 USP17L2 NM_001136496 Zfp935 NM_001256859 LOC100287364 NM_001105028 LOC100125388 NM_001256862 LOC100287478 NM_001099349 Gm14308 NM_001256869 USP17L7 NM_001142572 ZNF669 NM_001256860 LOC100287404 NM_173480 ZNF57 NM_001256871 USP17L3 NM_001024731 LOC100044193 NM_001256873 USP17L1P NM_001033719 ZNF404 NM_001256874 USP17L4 NR_040256 Gm14405 NM_001256894 LOC100288520 NM_001194298 ZNF501 NM_001256872 USP17L8 NM_001078390 zf(c2h2)-13 NM_001256861 LOC100287441 NM_001172779 LRRC34 NM_201409 Dub1a NM_001080460 LRRIQ4 NM_001256973 Dub3 NM_024727 LRRC31 NM_007887 Dub1 NM_001166606 C25H16orf88 NM_001001559 Dub2a NR_029451 LINC00477 NR_046415 LOC649352 NM_001201058 usmg5 NR_046416 USP17 NM_001200033 LOC794625 NM_001256867 LOC728419 NM_182610 SAMD7 NM_001242331 LOC728400 NR_024409 LOC100128164 NM_001242327 LOC728369 NM_027016 Sec62 NM_001242326 LOC728373 NM_001093356 MGC82698 NM_001242328 LOC728379 NM_001025147 Gpr160 NM_001242330 LOC728393 NM_024947 PHC3 NM_001242332 LOC728405 NM_001031319 PRKCI NR_027279 USP17L6P NM_001030091 prkcz Continued

109 Table 4.2 continued

Chr16 Chr34

Accession Gene Name Accession Gene Name

NR_003275 LOC392196 NM_001204587 LOC100533284 NM_014629 ARHGEF10 NM_001043458 Apkc NM_001038140 MYOM2 NM_062610 pkc-3 NM_001206128 UBASH3B NM_001248008 SKIL NM_033225 CSMD1 NM_001130669 skilb NM_052896 CSMD2 NM_001161641 CLDN11 NM_001153786 magi104697 NM_001002624 cldn11a NM_001252218 Rpl31 NM_174623 TMSB10 NR_002595 RPL31P11 NM_020949 SLC7A14 NM_001200721 rl31 NM_134506 CG12531 NM_001248470 LOC100499742 NM_141171 slif NM_001250275 LOC100306335 NM_140962 CG13248 NM_001064043 Os06g0319700 NM_001246309 LOC100164077 NM_001159008 LOC100286120 NM_001192552 SNRPB2 NM_001111810 TIDP3261 NM_001200570 ru2b NM_001158649 LOC100285759 NM_001043919 SNF NM_001182294 RPL31B NM_180605 F10K1.32; NM_127532 F6F22.23; NM_001056381 Os03g0298800 NM_125054 MIK19.16; NM_001162674 LOC100162519 NM_118756 T25K17.40; NM_001175915 LOC100383254 NM_001017855 zgc:110146 NM_001156411 LOC100283511 NM_001092983 mdh2-a NM_001254349 LOC100805822 NM_001138756 LOC100193663 NM_001250958 LOC100499914 NM_209285 AGOS_ADL164C NM_001004855 eif5a NM_001066892 Os07g0630800 NM_145687 MAP4K4 NM_209701 AGOS_ADR252W NM_001091926 idh3b NM_001138830 pco143139c NM_142743 CG6439 NM_001051226 Os01g0829800 NM_001171915 LOC100329146 NM_001254820 LOC100856934 NM_001137265 LOC100191841 NM_142439 CG7998 NM_001156475 LOC100283574 NM_112364 mMDH2 NM_001049260 Os01g0276100 Continued

110 Table 4.2 continued

Chr16 Chr34

Accession Gene Name Accession Gene Name

NM_001022809 SPCC306.08c NM_120407 IDH-V NM_001009348 MCPH1 NM_059588 C30F12.7 NM_001245208 LOC100189979 NM_001147405 LOC100272953 NM_001118887 ANGPT2 NM_119730 IDH-III NM_009641 Angpt4 NM_001059628 Os04g0479200 NM_001131165 AGPAT5 NM_001053849 Os02g0595500 NM_001076745 zgc:154071 NM_179632 IDH2 NM_207411 XKR5 NM_119692 IDH1 NM_214442 PBD-2 NM_001148411 LOC100274026 NM_001078133 DEFB NM_001099645 RPL22L1 NM_001032857 RHBD-1 NR_004442 Gm15421 NM_001009151 DEFB1 NM_001245860 LOC100219288 NM_001025353 Gm6040 NM_001245671 LOC100190148 NM_001037547 Defb51 NM_001141369 rl22l NM_001037524 Defb52 NR_033152 CR42491 NM_001037523 Defb33 NM_062531 rpl-22 NM_001168245 LOC783012 NM_001097788 LOC100037062 NM_022327 Ralb NM_001249505 LOC100500073 NM_214444 LOC404703 NM_001249933 LOC100305544 NM_018661 DEFB103B NM_100164 F22D16.17; NM_183026 Defb14 NM_177586 Eif5a2 NM_001081551 DEFB103A NM_001113938 LOC100135249 NM_001110238 SPAG11 NM_001044073 LOC693075 NM_058201 SPAG11B NM_138034 eIF-5A NM_153115 Spag11a NM_001162016 LOC100161411 NM_001037852 Spag11c NM_001181705 ANB1 NM_080389 DEFB104A NM_001251538 LOC100306157 NM_001040702 DEFB104B NM_001178849 HYP2 NM_139222 Defb15 NM_001247570 eIF-5A1 NM_183035 Defb34 NM_001247578 eIF-5A3 NM_001040704 DEFB106B NM_001160661 if5a1 Continued

111 Table 4.2 continued

Chr16 Chr34

Accession Gene Name Accession Gene Name

NM_001129756 DEFB106A NM_063406 iff-2 NM_001037514 Defb12 NM_066751 iff-1 NM_139224 Defb35 NM_001138162 LOC100192991 NM_001129753 DEFB105A NM_101261 ELF5A-1 NM_001040703 DEFB105B NM_001249052 LOC100305991 NM_001037668 DEFB107A NM_105608 ELF5A-3 NM_001040705 DEFB107B NM_001247786 eIF-5A4 NR_046354 FAM90A2P NM_001249466 LOC100305510 NM_001250529 LOC100499632 NM_001123497 LOC548620 NM_001103222 SLC2A2 NM_001124289 LOC100135945 NM_198370 zcchc9 NM_001019910 SPAC683.02c NM_001163007 Tnik NM_001037676 tnikb NM_001024937 MINK1 NM_199703 tnika NM_001031126 NRK NM_206252 msn NM_001029803 mig-15 NR_032617 MIR569 NM_001130081 PLD1 NM_001160095 pld1a NM_001075827 PLD2 NM_001164436 TMEM212 NM_001132462 FNDC3B NM_001034427 RPA3 NM_001159832 fndc3ba Continued

112 Table 4.3. Population stratification using PLINK. Any significant results could suggest population relatedness of case/case vs control/control.

Mean SD

Between-group IBS 0.798521, 0.027253 In-group (OSA) IBS 0.797395 0.027205 In-group (OFR) IBS 0.79507 0.025762 Approximate proportion of variance between OSA and OFR groups = 0.00182649

IBS group-difference p-value T1: Case/control less similar p = 0.898081 T2: Case/control more similar p = 0.101929

T3: Case/case less similar than control/control p = 0.621324 T4: Case/case more similar than control/control p = 0.378686

T5: Case/case less similar p = 0.596754 T6: Case/case more similar p = 0.403256

T7: Control/control less similar p = 0.317967 T8: Control/control more similar p = 0.682043

T9: Case/case less similar than case/control p = 0.420226 T10: Case/case more similar than case/control p = 0.579784

T11: Control/control less similar than case/control p = 0.125129 T12: Control/control more similar than case/control p = 0.874881

113 Table 4.4. Haplotypes of 15 SNPs Associated with the Development of Osteosarcoma in Racing Greyhounds and suggestion of cancer role.

SNP Haplotype Region/ Haplotype Size (kb) Genes Suggestion of tumor suppressor Size or oncogene biology by evidence in this table (all notes refer to Table 4.5)

BICF2G630722 Chr1:41484457- 245 sterile alpha motif domain frequent amplification in human 811 41729672 containing 5, SAMD5 osteosarcoma consistent with oncogenic role (possibly cell autonomously)

BICF2P846957 Chr2:74140437- 184 protein tyrosine phosphatase, implicated tumor suppressor in 74325097 receptor type, U, PTPRU multiple cancers; limited osteosarcoma genetics suggest oncogenic role in osteosarcoma (possibly cell autonomously); evidence of estrogen responsiveness

serine/arginine-rich splicing ""factor 4, SRSF4

mitochondrial trans-2-enoyl- oncogenic, possibly cell CoA reductase, MECR autonomously; this is consistent with observed increased mRNA "" expression in Greyhound osa tumors relative to other breeds (see "Relative gene expression…)

BICF2S235185 Chr8:66558701- 89 peptidase inhibitor, implicated to have anti-tumor 47 66647870 clade A (alpha-1 activity in multiple cancers (as antiproteinase, antitrypsin), anti-angiogenic); limited member 4, SERPINA4 osteosarcoma genetics suggest oncogenic role in osteosarcoma; evidence of sex hormone responsiveness; evidence of tumor suppression in human breast cancer (see Notes) Continued

114 Table 4.4 continued

SNP Haplotype Region/ Haplotype Genes Suggestion of tumor suppressor Size Size (kb) or oncogene biology by evidence in this table (all notes refer to Table 4.5)

serpin peptidase inhibitor, implicated to be tumor suppressor clade A (alpha-1 in other cancers; no apparent ""antiproteinase, antitrypsin), evidence for osteosarcoma member 5; SERPINA5 serpin peptidase inhibitor, frequent amplification in human clade A (alpha-1 osteosarcoma consistent with ""antiproteinase, antitrypsin), oncogenic role member 3, SERPINA3

serpin peptidase inhibitor, weak suggestion of tumor clade A (alpha-1 suppressive role from human antiproteinase, antitrypsin), osteosarcoma and other cancer "" member 13 (pseudogene), genetics SERPINA13

G570f26S182 Chr10:5555944- 768 leucine-rich repeats and suggestion of tumor suppressive 6324274 immunoglobulin-like domains role from human osteosarcoma protein 3, LRIG3 and other cancer genetics, and from published studies of a glioma cell line [solute carrier family 16, SLC16A7 is a potential tumor member 7 (monocarboxylic suppressor in human acid transporter 2), SLC16A7 osteosarcoma: it is a validated (aka monocarboxylic acid target of mir-29b, which has transporter 2, MCT2) is reduced expression in 73,813 bp outside, but osteosarcoma; this is consistent appears to have associated with decreased mRNA expression regulatory elements within in Greyhounds relative to other candidate OSA haplotype] breeds (see "Relative gene "" expression ..." column here)

Continued

115 Table 4.4 continued

SNP Haplotype Region/ Haplotype Genes Suggestion of tumor suppressor Size Size (kb) or oncogene biology by evidence in this table (all notes refer to Table 4.5)

BICF2P810812/ Chr12:56839735- 330.563 mannosidase, endo-alpha, no cancer implication except for BICF2S239147 57170298 MANEA reduced survival associated with 60 increased expression in breast cancer (see Notes), which suggests an oncogenic role; highly enriched in osteoblasts vs all other cells and tissues except microglia

BICF2P906491 Chr13:7469784- 44.003 Sin3A-associated protein, frequent amplification in human 7513787 18kDa, SAP18 osteosarcoma and other cancers suggests oncogenic role (but gene expression studies show both up and downregulation in different cancers); most highly expressed in osteoblasts compared to other cells and tissues

TIGRP2P21679 Chr16:45476333- 139.282 no 7_rs8858906 45615615

BICF2S233442 Chr16:45804993- 251.191 tripartite motif family-like 1, frequent deletion in human 46056184 verified that TRIML1 osteosarcoma and other cancers the same189kb gap consistent with tumor suppression exists in canfam3 build

tripartite motif family-like 2, frequent amplification in human ""TRIML2 osteosarcoma consistent with oncogenic role zinc finger protein 42 statistically significant increased homolog, ZFP42 expression in Greyhound and trend of amplification in human osteosarcoma tumors are "" consistent with oncogenic role

Continued

116 Table 4.4 continued

SNP Haplotype Region/ Haplotype Genes Suggestion of tumor suppressor Size Size (kb) or oncogene biology by evidence in this table (all notes refer to Table 4.5)

TIGRP2P29764 Chr22:58921599- 489.101 family with sequence statistically significant frequent 6_rs9178782 59410700 similarity 155, member A, gene amplification in human FAM155A osteosarcoma and other cancers suggests oncogenic role; may be affected by estrogen signaling

TIGRP2P30073 Chr23:17955842- 40.505 RNA binding motif, single statistically significant frequent 6_rs8767639 17996347 stranded interacting protein 3, gene deletion in human RBMS3 osteosarcoma suggests tumor suppressor role, as has been reported in two other cancers; multiple suggestions of osteoblast/bone-specific biological roles

BICF2P627819 Chr29:15643604- 30.566 Na+/K+ transporting ATPase 15674170 interacting 3, NKAIN3

BICF2G630632 Chr29:35992994- 572.753 cyclic nucleotide binding statistically significant frequent 308 36565747 verified that domain containing 1, CNBD1 gene deletion in human 20Mb gap is present in osteosarcoma and other cancers canfam3 build suggests tumor suppressor role

BICF2G630453 Chr34:35098110- 72.679 [zinc finger, B-box domain statistically significant frequent 521 35170789 containing, ZBBX 7390 bp gene deletion of ZBBX in human outside candidate OSA osteosarcoma, deletion in other interval] human cancers, and downregulation of gene expression in three cancer types are all consistent with a tumor suppressor role

Continued

117 Table 4.4 continued

SNP Haplotype Region/ Haplotype Genes Suggestion of tumor suppressor Size Size (kb) or oncogene biology by evidence in this table (all notes refer to Table 4.5)

BICF2P108110 Chr37:23255842- 542.845 Ikaros family zinc finger 2 [see mention of non-cancer cell 0 23798687 (Helios), IKZF2 autonomous immune theory in Notes]

sperm associated antigen 16, statistically significant frequent SPAG16 gene deletion in human osteosarcoma and other cancers suggests tumor suppressor role; in other cancers, there are reports of "" increased mRNA expression (which would be consistent with an oncogenic role)

cluster of large intergenic non coding RNAs (lincRNAs) and transcripts of uncertain coding potential (TUCPs) (human nomenclature): TCONS_00004996 TCONS_00003093 TCONS_00003095 TCONS_00003490 TCONS_00004569 TCONS_00004009 TCONS_00004997 TCONS_00004570 ""

118 Table 4.5. Greyhound Risk Loci Cancer Implications.

Haplotype Region/ Genes Gene/protein function Gene expression Published Nextbio Intogen cancer Relative gene Potentially relevant Notes Size (NCBI) focused on implications osteosarcoma- genetics analysis expression in canine Quantitative Trait osteoblasts and of cancer role oriented functional (cut-off p,1E-4 osteosarcoma Loci (QTL) and osteosarcoma (Pubmed) genomic analysis unless stated) tumors (4 OMIM (NCBI dbEST, Greyhounds vs 7 Phenotype/Gene cGAP, BioGPS) mixed dogs and 7 Unknown (Note: pedigree dogs these are very often (taken from multi-Mb intervals) PMID:20860831; significant in bold)

Continued 119 Table 4.5 continued

Chr1:41484457- sterile alpha SAMD5 function and widely expressed none elevated frequent gene no probe on array Rat QTL Lmblg1: Geo GDS2755 41729672 motif domain biological role is at low levels; expression in amplification in Limb length QTL 1, Analysis of HCT116 containing 5, unknown; SAM moderately highly human human RGD:1354636; Rat cells overexpressing SAMD5 domains are known expressed in osteosarcoma osteosarcoma (15 QTL Hcas2: miR-34, a microRNA. to be involved in osteoblasts relative to other gain, 1 deletion in 38 Hepatocarcinoma miR-34 is commonly protein-protein relative to most cell lines; gene tumors, p=1.395E-3; susceptibility QTL deleted in human interactions and are tissues (BioGPS, deletion in primary Ozaki et al.; 2, RGD:631688; cancers. Results common in proteins mouse) osteosarcoma PMID:12402305); Rat QTL Sald1: provide insight into the that polymerize; SAM (frequency frequent gene Serum aldosterone function of miR-34.; domains are very unspecified); gene amplification (but not level QTL 1, mir-34 down regulated common in deletion in U2OS deletion) in cancers RGD:631508; Rat in osteosarcoma: "A eukaryotic genes osteosarcoma cell from several tissues, QTL Bmd1: Bone study of line mRNA up regulation mineral density QTL osteosarcoma cell (but not down) in two 1, RGD:1554320; lines and primary cancers Rat QTL Sald1: tumor samples Serum aldosterone revealed an level QTL 1, interaction between RGD:631508; Rat miR-34 and ; QTL Scort1: Serum tumor samples corticosterone level showed a decreased QTL 1, expression of miR-34 RGD:1354580; Rat and inhibited p53- QTL Hrtrt2: Heart mediated cell cycle rate QTL 2, arrest and apoptosis RGD:1300167; [10]." 10. C. He, J. OMIM: 606255: Xiong, X. Xu et al.,

120 Stature quantitative “Functional elucidation trait locus 1; OMIM: of MiR-34 in 608935: Lung osteosarcoma cells cancer 1 and primary tumor samples,” Biochemical and Biophysical Research Communications, vol. 388, no. 1, pp. 35–40, 2009.

Continued Table 4.5 continued

Chr2:74140437- protein summary from expressed implicated to comparison of frequent gene non-significant Rat QTL Emca1: Member of the 74325097 tyrosine RefSeq (adapted sequence tags be a tumor gene expression in amplification in increased Estrogen-induced receptor-type protein phosphatase from UCSC genome from human suppressor in U2OS human expression in mammary cancer tyrosine phosphatase , receptor server): member of trabecular bone multiple osteosarcoma osteosarcoma (15 Greyhounds (fold QTL 1, (PTP) family type, U, the protein tyrosine and cancers: lung cells gain, 1 deletion in 38 change 1.68), RGD:1358187; Rat PTPRU phosphatase (PTP) chondrosarcoma (PMID:15871 overexpressing tumors, p=1.395E-3; p=0.13 QTL Bss4: Bone family, known to be cells (dbEST) 143, inducible estrogen Ozaki et al.; structure and signaling molecules 15356345), UHFHSWRUV(5ĮRU PMID:12402305); strength QTL 4, that regulate a variety colon (5ȕDQGWKHQ common gene RGD:1549838; of cellular processes (PMID:15059 treated with or deletion in cancers Mouse bone including cell growth, 896), skin without the from 5 tissues; by mineral density 7, differentiation, mitotic (PMID:11710 corresponding sequencing, tumor MGI:2389131 cycle, and oncogenic 941) ligand, estradiol: mutations detected transformation; observed PTPRU in 5 cancers possesses an upregulation of (frequencies from extracellular region, a IROGLQ(5Į 2/14 to 1/101) single (p=4.5E-5), 1.25- transmembrane IROGLQ(Uȕ region, and two (p=0.0297) in tandem intracellular response to catalytic domains, estradiol exposure and thus represents a receptor-type PTP; extracellular region

121 contains a meprin-A5 antigen-PTP (MAM) domain, Ig-like and fibronectin type III- like repeats; thought to play roles in cell- cell recognition and adhesion; studies of the similar gene in mice suggested role in early neural development Continued Table 4.5 continued

serine/argini RNA Splicing; widely or none frequent gene no probe on array ne-rich probable role in ubiquitously amplification in splicing alternative splice site expressed human factor 4, selection during pre- (dbEST; osteosarcoma (15 SRSF4 mRNA splicing PMID:17640361) gain, 1 deletion in 38 (UniProtKB) tumors, p=1.395E-3; Ozaki et al. pubmed 12402305); common gene deletion in " cancers from 5 " tissues and mRNA upregulation in cancers from 5 other (non-overlapping) tissues

mitochondrial fatty acid metabolism widely expressed none mRNA very highly frequent increased trans-2-enoyl- and biosynthesis; at generally low upregulated in in amplification in expression in CoA shown to reduce levels, expressed efferent duct in human Greyhounds (fold reductase, trans-2-enoyl-CoA to at low levels in response to both osteosarcoma (20 change 1.81, MECR acyl-CoA with chain osteoblasts, very estadiol and gain, 2 deletion in 38 p=0.023 lengths from C6 to highly enriched in in tumors, p=1.238E-6; C16 in an NADPH- one tissue: brown castrated mice Ozaki et al. PubMed dependent manner, adipose 12402305); evidence with preference for (BioGPS); of both gene 122 medium chain-length expressed amplification (n=3) " substrates sequence tag and loss (n=1), and " (PMID:12654921) from Ewing's mRNA up (n=1) and sarcoma (dbEST) down (n=1) regulation in different cancers (incl. osteosarcoma) Continued Table 4.5 continued

Chr8:66558701- serpin negative regulation of human mRNA SERPINA4 is mRNA very highly non-significant trend non-significantly our Kaplan Meier 66647870 peptidase endopeptidase expressed in an anti- upregulated in of gene amplification increased analysis of 2324 inhibitor, activity; kallikrein several tissues, angiogenic liver from sex in human expression in breast cancer patients clade A inhibitor (SERPINA4 very highly therapeutic hormone-treated osteosarcoma (10 Greyhounds (fold shows higher (alpha-1 aka ); enriched in adult candidate for gonadectomized gain, 2 deletion in 38 change 1.01), expression is antiproteinas serine proteinase liver (BioGPS, multiple mice tumors, p=0.125; p=0.68 associated with e, inhibitor (serpin) and dbEST) cancers Ozaki et al.; improved survival antitrypsin), a heparin-binding (PMID:12384 PMID:12402305); (HR=0.56 (0.48-0.64) member 4, protein; localized in 424, frequent gene loss in logrankP=0, multiple SERPINA4 vascular smooth 17714861, cancer of 4 other testing corrected p muscle cells and 17729417, tissues value: 0) [KM Plotter] endothelial cells of 18089723, blood vessels, 18338836, suggesting that it 19709125, may be involved in 20509975) the regulation of vascular function; plays a role in neointima hyperplasia 123 Continued Table 4.5 continued

serpin SERPINA5 (aka expressed in implicated to non-significant gene non-significantly peptidase ) is several tissues, be a tumor amplification in increased inhibitor, protein inhibitor of highly enriched in suppressor in human expression in clade A activated protein C; testis cancer of osteosarcoma (7 Greyhounds (fold (alpha-1 rotein C is a vitamin multiple gain, 4 deletion in 38 change 1.01), antiproteinas K-dependent serine tissues: tumors, p=0.527; p=0.61 e, protease zymogen breast Ozaki et al. PubMed antitrypsin), present in human (PMID:12176 12402305); frequent member 5; plasma; activated 977), kidney gene gain in cancer SERPINA5 protein C is a potent (PMID:17450 of pharynx; mRNA anticoagulant; is 526), ovarian down regulation in sometimes referred (PMID:21102 cancer of testis " to as plasminogen 419) activator inhibitor-3 (PAI3) because it also inhibits plasminogen activators; SERPINA5 has structural similarities

124 to PAI1 Continued Table 4.5 continued

serpin SERPINA3 (aka mRNA expressed frequent gene non-significant OMIM Disease- peptidase alpha-1- at moderate to amplification in increased causing: Alpha-1- inhibitor, antichymotrypsin is a high levels in a human expression in antichymotrypsin clade A plasma protease small number of osteosarcoma (13 Greyhounds (fold deficiency, (alpha-1 inhibitor) is belongs tissues; highest- gain, 4 deletion in 38 change 1.02), Cerebrovascular antiproteinas to the class of serine expression is in tumors, p=0.012; p=0.57 disease, occlusive, e, protease inhibitors; it liver, retinal ciliary Ozaki et al. PubMed OMIM: 107280 antitrypsin), is reported to be body and 12402305); tissues member 3, synthesized in the pancreatic islet from 4 different SERPINA3 liver; normal serum cells (BioGPS); cancers each has level is about one- only expressed one of the following: tenth that of alpha-1- sequence tag gene gain or loss antitrypsin, with expression in and mRNA up or " which it shares bone is from down regulation; nucleic acid and chondrosarcoma sequencing has protein sequence (dbEST) revealed mutations homology; both are in cancer from 3 major acute phase tissues (ranging in

125 reactants and their frequency from 2/2 to concentrations in 4/447 tumors) plasma increase in response to trauma, surgery, and infection

serpin pseudogene non-statistically no probe on array peptidase significant trend of inhibitor, gene deletion in clade A human (alpha-1 osteosarcoma (3 antiproteinas gain, 9 deletion in 38 e, tumors, p=0.077; antitrypsin), Ozaki et al.; member 13 PMID:12402305); " (pseudogene gene loss in cancer ), from two other SERPINA13 tissues and sequencing based evidence of mutation in a third cancer type Continued Table 4.5 continued

Chr10:5555944- leucine-rich modulator FGF and highly expressed repression of non-significant gene non-significant Human Stature there is a significant 6324274 repeats and WNT signaling in a small number LRIG3 in amplification in decreased quantitative trait body of work on the immunoglob (PMID:18287203); of tissues, glioma cell human expression in locus 3, OMIM: role of LRIG1 in the ulin-like paralog LRIG1 highest in large line GL15 osteosarcoma (9 Greyhounds (fold 606257; Mouse tumor suppressive domains induces ligand- intestine, significantly gain, 3 deletion in 38 change -1.46), single gene- downregulation of protein 3, dependent osteoblasts and increased in tumors, p=0.225; p=0.49 spanning QTL: EGFR signaling LRIG3 ubiquitiantion and epidermis (in that vitro invasion Ozaki et al.; Lmr5, leishmaniasis (PMID:21576352) degradation of EGFR order) (BioGPS); and adhesion PMID:12402305); resistance 5, (PMID:15282549, highly expressed activity and frequent gene gain in MGI:2656511; Rat 19216216) in vascular tissue markedly cancer of 5 different QTL Bss5: Bone (dbEST) promoted cell tissues at p<0.5 (2 at structure and growth, and p<1E-4), with only strength QTL 5, induced exception to this RGD:1549840; Rat increment of trend being testis QTL Lnnr1: Liver the proportion cancer (frequent neoplastic nodule of G0/G1 deletion, p=0.019); remodeling QTL 1, cells and significant mRNA RGD:631534: Rat inhibited upregulation in two QTL Mcs6: apoptosis cancers (p<0.05; one Mammary (PMID:19200 p<1E-4) carcinoma 647); susceptibility QTL perinuclear 6, RGD:70190 staining of LRIG3 associated with lower

126 proliferation index in gliomas and was, in addition to tumor grade, an independent prognostic factor, and, within the groups of Continued grade III and Table 4.5 continued

[solute tissues with few or no expressed at low reduced osteosarcoma non-significant gene decreased Mouse single gene- SLC16A7 , the next carrier family mitochondria, such levels in several expression in samples from 28 amplification in expression in spanning QTL closest gene to 16, member as erythrocytes and to many tissues; colorectal patients at human Greyhounds (fold Pbwg5 Description: LRIG3 haplotype is of 7 tumor cells, depend no reported carcinoma diagnosis were osteosarcoma (9 change -1.87), postnatal body potential interest and (monocarbox largely on glycolysis expression in (PMID:18188 analyzed, gain, 3 deletion in 38 p=0.03 weight growth 5, its relevance is ylic acid to generate ATP. The osteoblasts or 595); differential gene tumors, p=0.225; MGI:3035962 supported by a transporter major end products bone cancer in SLC16A7 is expression Ozaki et al.; placental mammal- 2), SLC16A7 of glycolysis, BioGPS, dbEST a potential between the PMID:12402305); specific evolutionarily (aka pyruvate and lactate, (but is expressed tumor following were frequent gene gain in conserved element monocarboxy must be eliminated in osteosarcoma suppressor in significant: male cancer of 2 different (canFam2 lic acid from these cells to according to human vs female, 1.8-fold tissues; frequent phastConsElements4 transporter 2, enable continued studies in osteosarcom downregulated mRNA upregulation way lod= 176, range = MCT2 ) is glycolytic flux and Nextbio) a: it is a (0.0039) in one cancer and chr10:6279670- 73,813 bp prevent toxic effects. validated downregulation in 6280025) that is in outside, but H+/monocarboxylate target of mir- another LRIG3 haplotype and appears to transporters (MCTs) 29b appears to be have mediate the transport (PMID:22350 evolutionarily " associated of lactate and 417), which conserved in a regulatory pyruvate. Human has reduced position much closer elements MCT2 has a high expression in to SLC16A7 than within affinity for the osteosarcom LRIG3

127 candidate transport of pyruvate a OSA (PMID:9786900) (PMID:19342 haplotype] 382); there are >30 PubMed references linking miR- 29b and diverse cancers Continued Table 4.5 continued

Chr12:56839735- mannosidase Cellular protein most highly none osteosarcoma no gene deletion or non-significant Rat QTL Thshl2: our Kaplan Meier 57170298 , endo-alpha, metabolic process, expressed in samples from 28 amplification decreased Thyroid stimulating analysis of 2324 MANEA post-translational microglia, patients at observed in 38 expression in hormone level QTL breast cancer patients protein-modification, osteoblasts and diagnosis were human Greyhounds (fold 2, RGD:1331796 shows higher protein N-linked embryonic analyzed, osteosarcoma change -1.09), expression is glycosylation: N- fibroblasts; also differential gene tumors (Ozaki et al. p=0.78 associated with glycosylation of expressed at low expression PMID:12402305) reduced survival proteins is initiated in to moderate between the (HR=1.4 (1.3-1.7) the endoplasmic levels in several following were logrankP=8E-8, reticulum (ER) by the other cells significant: multiple testing transfer of the (BioBPS, dbEST) osteosarcoma corrected p value: preassembled subtypes, 0.0324) [KM Plotter] oligosaccharide telangiectatic vs glucose-3-mannose- osteoblastic 3.05- 9-N- fold upregulated acetylglucosamine-2 (1.8E-11), from dolichyl fibroblastic vs pyrophosphate to osteoblastic 2.37- acceptor sites on the fold upregulated target protein by an (0.0398), oligosaccharyltransfe chondroblastic vs rase complex; this telangiectatic 4.04- core oligosaccharide fold down is sequentially regulated (0.0338)

128 processed by several ER glycosidases and by an endomannosidase (E.C. 3.2.1.130), such as MANEA, in the Golgi; MANEA catalyzes the release of mono-, di-, and triglucosylmannose oligosaccharides by cleaving the alpha- 1,2-mannosidic bond Continued that links them to high-mannose Table 4.5 continued

Chr13:7469784- Sin3A- component of the widely expressed none mouse genetically frequent increased 7513787 associated histone deacetylase at moderately engineered mouse amplification in expression in protein, complex, which high levels, most model of human Greyhounds (fold 18kDa, includes SIN3, abundantly osteosarcoma: osteosarcoma (17 change 1.22), SAP18 SAP30, HDAC1, expressed in primary gain, 1 deletion in 38 p=0.46 HDAC2, RbAp46, osteoblasts osteosarcoma vs tumors, p=1.131E-4; RbAp48, and other (BioGPS, dbEST) in vitro Ozaki et al. polypeptides; directly differentiated PMID:12402305); interacts with SIN3 primary osteoblast, gene deletion in 3 and enhances SIN3- 2.32-fold down and amplification in 1 mediated regulation (2.0E-6) cancer; mRNA up transcriptional regulation in 4 and repression when down regulation in 5 tethered to the cancers promoter 129

Chr16:45476333- Rat QTL Arunc1: 45615615 Aerobic running capacity QTL 1, RGD:1298529; Rat QTL Arunc2: Aerobic running capacity QTL 2, RGD:1298527 Chr16:45804993- tripartite probable E3 ubiquitin- very highly none frequent gene non-significant Pre-implantation 46056184 motif family- protein ligase; RING expressed in deletion in human increased embryo-specific RING verified that the like 1, finger E3 ligase testis and osteosarcoma (1 expression in finger protein (Tian same189kb gap TRIML1 family embryonic stem gain, 13 deletion in Greyhounds in 4 2009) exists in canfam3 cells and less so 38 tumors, p=1.446E- probes, (fold change build in few tissues: 3; Ozaki et al., range 1.00-1.02), adrenal gland PMID:12402305); p=0.19-0.98 and placenta; no frequent deletion in reported two other cancers; expression in sequencing-detected osteoblasts and mutations in three bone cancer in cancers (ranging BioGPS or from 1/1 to 1/48 dbEST tumors)

Continued Table 4.5 continued

tripartite probable E3 ubiquitin- expressed at low none frequent non-significant Intracellular, ligase motif family- protein ligase; RING levels in small amplification in increased activity, zinc ion like 2, finger E3 ligase number of human expression in binding TRIML2 family tissues; no osteosarcoma (15 Greyhounds in 4 reported gain, 1 deletion in 38 probes, (fold change expression in tumors, p=1.395E-3; range 1.01), p=0.16 osteoblasts and Ozaki et al. bone cancer in PMID:12402305); BioGPS or frequent deletion in dbEST cancers from 8 tissues at p<1E-4 " and 6 more gene deletion and 3 amplification at p<0.05; frequent mRNA downregulation in 2 cancers and upregulation in another

zinc finger probably involved in predominantly or decreased non-significant gene increased protein 42 self-renewal property specifically expression in amplification in expression in homolog, of ES cells (by expressed in renal cell human Greyhounds (fold ZFP42 similarity); may be embryonic tissue, carcinoma osteosarcoma (7 change 1.07), 130 involved in embryonic stem (PMID:16344 gain, 3 deletion in 38 p=0.03 [no transcriptional cells and 273) tumors, p=0.527; annotated probe; regulation (by placenta Ozaki et al.; mapped human similarity) (BioGPS, dbEST) PMID:12402305); " protein to canine gene amplification in chr16:46,053,060- 2 cancers and 46,054,573 which deletion in 2 cancers overlaps probe CfaAffx.11730.1.S1 at chr16:46053559- 46054474]

Continued Table 4.5 continued

Chr22:58921599- family with probable highly enriched in comparison of frequent no probe on array In differential gene 59410700 sequence transmembrane diverse brain gene expression in amplification in expression studies, similarity protein of unknown regions and U2OS human epithelium from 155, member function pituitary; also osteosarcoma osteosarcoma (17 invasive breast tumor A, FAM155A expressed in cells gain, 4 deletion in 38 grade 2 vs grade 1 retina, adrenal overexpressing tumors, p=1.131E-4; has 12.8-fold gland and mast inducible estrogen Ozaki et al. downregulation (rank cells (BioGPS), UHFHSWRUV(5ĮRU PMID:12402305); top 1%, p=0.0236) and lung and (5ȕDQGWKHQ frequent and epithelium from prostate (dbEST) treated with or amplification in invasive breast tumor without the cancers from 4 grade 1 vs normal corresponding tissues and deletions (rank top 2%, ligand: U2OS cells from 1; frequent p=0.0204) [Nextbio, expressing mRNA upregulation NCBI Geo in 2 cancers GSE14548]; our (5ȕWUHDWHGZLWK Kaplan Meier analysis estradiol vs of 2324 breast cancer untreated show patients shows higher 1.65-fold expression is upregulation associated with (p=0.0155), and improved survival U2OS cells (HR=0.68 (0.59-0.78) expressing logrankP=2.5E-8, estrogen receptor multiple testing 131 (5ĮWUHDWHGZLWK corrected p value: estradiol vs 0.00997) [KM Plotter] untreated show 1.61-fold upregulation (p=0.0112) Continued Table 4.5 continued

Chr23:17955842- RNA binding probable RNA- very highly downregulati mouse model of frequent gene non-significant Rat QTL Bmd3: Genomewide 17996347 motif, single binding protein that expressed in on of RBMS3 osteosarcoma: deletion in human increased Bone mineral association candidate stranded belongs to the c-myc dorsal root expression is primary osteosarcoma (4 expression in density QTL 3, for bisphosphonate- interacting gene single-strand ganglia and associated osteosarcoma vs gain, 12 deletion in Greyhounds in 2 RGD:1554321 related osteonecrosis protein 3, binding protein abundantly with poor in vitro 38 tumors, p=4.617E- probes, (fold change of the jaw, p<7E-8; RBMS3 family; localizes expressed in prognosis in differentiated 3; Ozaki et al., range 1.10-1.20), odds ratio 5.8 (3.1- mostly to the several tissues, esophageal primary osteoblast, PMID:12402305); p=0.43-0.74; 11.1) cytoplasm suggesting including squamous 3.08-fold frequent deletion in decreased (PMID:22267851) that it may be osteoblasts cell upregulation (1.5E- cancers from 11 expression in involved in a (BioGPS) carcinoma 10); comparison of tissues Greyhounds in 1 cytoplasmic function (PMID:21844 gene expression in probes (fold change such as controlling 183); gene U2OS -1.15), p=0.69 RNA metabolism, deletion osteosarcoma rather than reported in cells transcription neuroblastom overexpressing a inducible estrogen (PMID:18664 UHFHSWRUV(5ĮRU 255) (5ȕDQGWKHQ treated with or without the corresponding ligand: U2OS cells expressing estrogen receptor (5ȕWUHDWHGZLWK estradiol vs untreated show

132 1.36-fold downregulation (p=0.0134), and U2OS cells expressing estrogen receptor (5ĮWUHDWHGZLWK estradiol vs untreated show 1.75-fold downregulation (p=0.0002) Continued Table 4.5 continued

Chr29:15643604- Na+/K+ Na+/K+ transporting expressed at low none mouse genetically no gene deletion or non-significant OMIM: 606789: Na+/K+ transporting 15674170 transporting ATPase interacting levels in several engineered mouse amplification increased Fetal hemoglobin ATPase interacting 3 ATPase transmembrane to many tissues; model of observed in 38 expression in quantitative trait interacting 3, protein; function no reported osteosarcoma: human Greyhound (fold locus 4 NKAIN3 unknown expression in primary osteosarcoma change 1.01), osteoblasts or osteosarcoma vs tumors (Ozaki et al. p=0.33. bone cancer in in vitro PMID:12402305); BioGPS, dbEST differentiated frequent gene (but is expressed primary amplification in in osteosarcoma osteoblasts, 1.2- cancer from 4 according to fold upregulation tissues and deletion studies in (p=0.043) in 1 cancer; frequent Nextbio) mRNA downregulation in 2 cancers and upregulation in 1

Chr29:35992994- cyclic probable cyclic predominantly none frequent gene non-significant OMIM: 606789: 36565747 verified nucleotide nucleotide binding expressed in deletion in human increased Fetal hemoglobin that 20Mb gap is binding protein; function testis, and less osteosarcoma (0 expression in quantitative trait present in canfam3 domain unknown so in spleen, gain, 22 deletion in Greyhound (fold locus 4; Rat QTL build containing 1, stomach, skin, 38 tumors, p=4.084E- change 1.02), Thshl2: Thyroid CNBD1 retina (BioGPS, 10; Ozaki et al., p=0.61 stimulating dbEST) PMID:12402305); hormone level QTL 133 frequent deletion in 2, RGD:1331796 cancers from 6 tissues

Continued Table 4.5 continued

Chr34:35098110- [zinc finger, unknown predominantly none frequent gene non-significant OMIM: 109200: 8 PhastCons 35170789 B-box expressed in deletion in human increased Alopecia, (candidate gene domain testis and less so osteosarcoma (2 expression in androgenetic, 1; regulatory) elements containing, in pituitary, lung, gain, 12 deletion in Greyhounds in 2 Rat QTL Bss11: >50 score, highest ZBBX 7390 connective tissue 38 tumors, p=4.617E- probes, (fold change Bone structure and 116; nearest gene is bp outside and uterus 3; Ozaki et al., range 1.01-1.02), strength QTL 11, ZBBX ; our Kaplan candidate PMID:12402305); p=0.48-0.68 RGD:1578648; Rat Meier analysis of 2324 OSA interval] frequent deletion in QTL Bmd2: Bone breast cancer patients cancers from 2 mineral density QTL shows higher tissues; mRNA 2, RGD:1554319 expression is downregulation in 3 associated with cancers improved survival (HR=0.67 (0.59-0.77) logrankP=1.3E-8, multiple testing corrected p value: 0.00504) [KM Plotter]

Chr37:23255842- Ikaros family IKZF2 (aka, highly enriched in Associated no significant pattern no probe on array strong evidence from 23798687 zinc finger 2 HELIOS , ZNFN1A2 ): T cells, mast with CD4 T of gene amplification multiple studies (Helios), members of this cells, thymus, cells (n=6) or deletion (Nextbio) shows that IKZF2 protein family (Ikaros, lymph nodes, differentiating (n=4) observed in 38 IKZF2 mRNA Aiolos and Helios) epidermis, to T helper 2 human expression levels are

134 are hematopoietic- thyroid, bladder, and follicular osteosarcoma associated with specific transcription mouth, pharynx, helper T cells tumors (Ozaki et al. FOXP3 expression: factors involved in cornea, stem in vivo (55); PMID:12402305); e.g., CD4+ T cells regulation of cells, myeloid Hodgkin and frequent from small intestinal lymphocyte progenitor cells non-Hodgkin amplification in lamina propria of development; IKZF2 and other cells lymphoma cancers from 4 wildtype mice, relative forms homo- or and tissues, (56); tissues and frequent expression of Foxp3- hetero-dimers with including low leukemogene mRNA GFP+ vs Foxp3-GFP- other Ikaros family levels in bone sis (57) downregulation in shows 18.8-fold members, and is (but no significant one cancer upregulation (rank top thought to function expression 1%, p=0.0374) (NCBI predominantly in detected in Geo GSE20366); this early hematopoietic osteoblasts or is of interest because development osteoclasts) there is a theory that (BioGPS, dbEST) immunosuppression by FOXP3-expressing Regulatory T cells (Tregs) may interfere Continued Table 4.5 continued

sperm cilia and flagella are widely expressed frequent relative gene frequent gene non-significant SPAG16 encodes 2 associated comprised of a (with testis- elevation of expression in deletion in human increased major proteins that antigen 16, microtubular specific mRNA human osteosarcoma (2 expression in associate with the SPAG16 backbone, the alternatively- expression in osteosarcoma gain, 16 deletion in Greyhounds in 4 axoneme of sperm tail axoneme, organized spliced isoform, multiple subtypes - 38 tumors, p=2.365E- probes, (fold change and the nucleus of by the basal body PMID:21150711), cancers fibroblastic vs 5; Ozaki et al., range 1.01-1.56), postmeiotic germ cells and surrounded by enriched in brain (PMID: osteoblastic 2.94- PMID:12402305); p=0.15-0.46 plasma membrane; and pineal, 21150711) fold upregulation frequent mRNA SPAG16 encodes 2 thyroid, prostate, (p=0.0328) upregulation in major proteins that pituitary, lung, (osteoblastic cancers of the brain associate with the adipose; robulstly expression at and bladder, and axoneme of sperm expressed in probe 219109_at = mutation detected in tail and the nucleus human 1813.8; NCBI Geo 1/1 ovarian tumor of postmeiotic germ osteosarcoma GSE14827) tested by sequencing cells, respectively (see "Nextbio..." (PMID:17699735) column; expressed in BioGPS

135 osteoblasts and " bone, but the expression microarray signal is barely above the median for all tissues; only bone-related expressed sequence tags are from chondrosarcoma (dbEST) Continued Table 4.5 continued

cluster of no probe on array ERBB4 is 121.7 kb large away (potentially intergenic relevant to lincRNA non coding cluster in candidate RNAs osteosarcoma (lincRNAs) haplotype. and transcripts of uncertain coding potential (TUCPs) (human nomenclatur e): TCONS_000 04996 " TCONS_000 03093 TCONS_000 03095 TCONS_000 03490 TCONS_000 04569 TCONS_000

136 04009 TCONS_000 04997 TCONS_000 04570 Continued CHAPTER 5: GENETIC AND MOLECULAR MECHANISMS OF A MENDELIAN

TRAIT WITH EPIGENETIC READOUT: CANINE BRINDLE COAT PATTERN

*Rowell, J., Fiala, E.M., Zaldivar-Lopez, S. Marin, L.M., Fiala, E.M., Couto, C.G., Alvarez, C.E. (2012). Identifiying Risk Loci for Canine Osteosarcoma. Submitted for publication.

137 Introduction

Recently, studies on the domesticated dog have answered several questions concerning selection under domestication. Dogs have many similarities to humans, including genetics (types and total levels of genetic variation) and certain disease predispositions

[7]. Dogs also have some intriguing differences, such as a reduced genetic heterogeneity within breeds (like those seen in isolated human populations of Iceland or some Mormon and Amish groups), but increased heterogeneity across breeds (three times higher than across human ethnic groups). This underlies many of the breed-to-breed differences in disease risks; yet other traits were likely purposely introduced by breeders (crossing carriers and selecting for transmission) – e.g., short legs (chrondroplasia) [34, 186].

Understanding such differences is key to understanding the role of selection under domestication [as postulated by Darwin; 187]. This knowledge will give further insight into evolutionary genetics and will provide the possibility of understanding complex genetic architecture that may not yet have been elucidated in human populations.

Canine genetics are vastly underappreciated as potentially the best mammalian model to determine the relative contributions of different types of genetic variations and molecular mechanisms [188]. For example, in dogs it has been possible to estimate the relative contributions of DNA copy number variation (CNV) or retroposon insertions to all natural inherited disease. In contrast, human disease-associated variation discovery is often not unbiased (e.g., using high-throughput sequencing or array analysis) and inbred rodents do not represent natural disease in outbred populations. Dog breeds are thus powerful genetic models because of their i) remarkable spectrum of traits and ii) simplified genetic structure, which vastly facilitates and accelerates variation discovery.

138 Dissection of trait genetics, in turn, leads to new molecular understanding and translational models. In some cases, mutation discovery reveals extraordinary or novel molecular mechanisms that may otherwise not be decipherable or imaginable. For instance, Parker et al. showed that all twenty common breeds with short legs can be traced back to a single recent retroposition event of the retro-FGF4 gene [186]. We have identified the dog coat color brindle phenotype as having the potential to reveal novel genetic mechanisms that are likely to have broad biomedical implications (see below).

Background

Mammalian coat color has a long history of revealing previously unknown genetic mechanisms, such as X-inactivation [189]. Intriguingly, eumelaninic together with pheomelaninic coloration appears to be unique to dogs [179]. In most mammals, melanocytes synthesize red/yellow (pheomelaninic) or black/brown (eumelaninic) color that is governed by two genes: Agouti, which encodes a paracrine-signaling ligand secreted by cells adjacent to hair follicle melanocytes, and Melanocrtin 1 receptor

(MC1R), which encodes a seven trans-membrane receptor that, when active, causes melanocytes to produce eumelanin. Agouti coat color effects are mediated by competitive inhibition of alpha-melanocyte-stimulating hormone binding at MC1R. Thus, MC1R loss of function or Agouti gain of function results in production of pheomelanin (red/yellow), while MC1R gain of function or Agouti loss of function results in production of eumelanin (black/brown) [Table 5.1; 190]. However, it was recently revealed that the pigment type-switching mechanism in the dog has additional complexity that was previously unknown in any species, including the presence of a black coat color that is dominant to yellow [Dominant Black (KB); 179, 180]

139 Candille et al. showed that dominant black is the result of a mutation in the beta

Defensin 3 gene CBD103 [179]. Specifically, a three deletion results in loss of a single glycine UHVLGXH ³ǻ*´ 7KDWYDULDWLRQKDVWZRHIIHFWVL LQFUHDVHGH[WUDFHOOXODU levels (mechanism unknown) and ii) increased affinity of CBD103 for MC1R. The dominant black effect is mediated by CBD103 inhibition of Agouti binding at MC1R.

[Intriguingly, this suggests the same gene may be responsible for red coat color in cattle, which maps to the same locus; 191.] Yet another unique aspect of this gene’s role in dogs is the coat coloring mechanism known as brindle. Brindle is a pattern of alternating black/brown and red/yellow coat color that forms an irregular pattern, typically a “v” over the dorsum and “s” over the flanks and ventrum [see Fig. 5.1; 180]. The pattern, which resembles that observed in pigmentary mosaicism (aka Blaschko's lines) in humans, appears to similarly mark the migration routes of two melanocyte precursor cell populations that emanate from the neural crest and end in the skin. However, rather than resulting from mosaicism in dogs, it is a Mendelian trait. A linkage study first implicated chr16 as the locus for brindle coat color, and as part of this analysis Kern et al. were able to define a dominance order of KBlack > KBrindle > KYellow (hereafter KB, Kbr, Ky) [180].

They postulated that brindle likely results from an unstable allele that switches from yellow to black through some epigenetic process. Chen et al. subsequently identified common structural variation at the CBD103 locus – directly affecting its copy number and having a breakpoint proximal to the gene – and proposed that a structural variant could be responsible for such a Mendelian trait with an epigenetic readout [85].

To verify ǻ* as the causal mutation, Candille et al. generated a transgenic mouse model. While the ǻ* mouse was black, unexpectedly 21/23 wildtype KY/KY also had

140 black coat color. This suggests the possibility that the ǻ*mutation is not necessarily the dominant black mechanism, but could instead be a marker of a CBD103 allele that is expressed at increased levels over wild type. In the same study, brindle coat color was mapped to a 1.85 Mb region overlapping CBD103. Here, we present evidence that brindle coat color is caused by a 76 kb copy number duplication on chr16. We show that the left breakpoint of this duplication occurs within a long interspersed nucleotide element

(LINE). We also demonstrate the presence of differential DNA methylation patterns at this locus in KB, Kbr, Ky skin tissue, and show this is associated with differential regulation of CBD103 mRNA expression in light and dark skin of brindle dogs. Finally, we present additional evidence suggesting that the brindle mechanism is essentially the same as for X-inactivation.

Results

CNV Analysis

Previously, our lab reported the presence of a 611kb CNV-rich region overlapping KB locus at chr16:61,902,802-62,514,014 within 8 dogs analyzed using a Nimblegen 385k aCGH [85]. Notably, this is a subtelomeric region (canine chr16 sequence ends at chr16:62,570,175) and thus is highly enriched for low copy repeats (or segmental duplications) and structural polymorphisms. As a result, the genome assembly of this region includes multiple gaps. Subsequently, an improved CNV calling algorithm, segMNT, was developed which called 30% more CNVs than the original (DNACopy) using the same dataset (Fig. 5.2). Interestingly, the CNV present at the subtelomeric locus on chr16 was now clearly established to be multiple CNV regions. This algorithm

141 indicated that two of the eight dogs with CNV at this locus had a relatively small CNV that overlapped the KB locus (CBD103), while the others appeared to have a shared segmental duplication of a large downstream sequence [this region matched the segmental duplications as published previously by Akey; 192]. While these two dogs were from different breeds (Akita and Bulldog), they happened to be the only two brindle dogs in the study. Southern blotting with a CBD103 probe confirmed those two dogs and a third black coat dog were carriers of the CBD103-duplication allele [two heterozygotes and one homozygote; 85]. We confirmed the presence of a 76kb CNV in Brindle dogs using a larger cohort of 15 dogs on an ultra-high density 1 M oligonucleotide spot array

CGH (Agilent; Fig. 5.3, see Methods).

Characterization of Brindle Genotype

We were most interested in the centromeric breakpoint as it is close to CBD103. To map the breakpoint, we designed Southern blot analysis using two probes within an 8 kb region from the last unaffected comparative Genomic Hybridization (CGH) probe to the first affected probe for the CNV using the Nimblegen data. For this analysis, we used genomic DNA isolated from blood from American Kennel Club (AKC) registered dogs from several breeds and Greyhounds registered with the National Greyhound Association for racing (NGA). We used the coat color listed on the registration, and included dogs whose coat color was listed as tan or fawn (Ky), any coat color with Brindle (Kbr), and black (KB). The resultant bands were not the predicted 6.6kb size (based on the CanFam2 genome assembly; Fig. 5.4). We found that our two black dogs (which genotypically can be KB/KB, KB/Kbr, or KB/Ky) each had a single and differently sized band of approximately 6.6 kb and 7 kb. We also identified an 8kb band in yellow coat dogs

142 (which are necessarily Ky/Ky or have a MC1R mutation that makes them unable to express eumelanin). Strikingly, for brindle we identified two patterns of bands- the first pattern was a single 7.5 kb band and a second pattern with two bands, one matching the

7.5 kb and second band matching the 8 kb band found in yellow coat dogs. These results suggested four different alleles corresponding to two different alleles in black, one for brindle, and one for yellow (Fig. 5.4). Thus, with this assay we were able to identify dogs homozygous and heterozygous for the brindle allele. We repeated this analysis with a second restriction enzyme and found similar results for the size of the bands (the black band was the smallest and brindle and yellow dogs had a band that was ~1.5 kb larger than expected; Fig. 5.5). This suggests that the CanFam2 assembly of this region is incorrect (since the expected size band was based on a phenotypically-apparent Ky Boxer, and our results showed one of the KB alleles to be the size indicated in CanFam2; see Fig.

5.5 A,B).

We then used overlapping PCR assays to scan the full 6.6 kb region within the restriction enzyme cleavage sites mentioned above. Based on the initial results, we were able to conduct this analysis on homozygous black, yellow and brindle dogs. That assured that a positive PCR product for one allele did not mask the absence of a product for a second allele. This region is characterized by many highly repetitive elements (Fig.

5.6), so PCR amplicons were designed based on uniqueness of primers; these assays were positioned to generate overlapping products (Fig. 5.7A). We identified a discrepancy in band size (versus predicted from CanFam2 assembly) in a region corresponding to a

Long Interspersed Element (LINE) in yellow and brindle. This 2.8kb LINE fragment is immediately preceded by a Short Interspersed Element (SINE) fragment. We used a PCR

143 assay that, due to the high number of repetitive elements in the region, could not be designed to be <4 kb in length and encompassed both of the SINE and LINE.

Surprisingly, we were unable to yield a band in black coat color, and identified a very strong band at 1.5 kb in yellow and brindle (2.5 kb smaller than predicted; Fig. 5.7B).

Interestingly, sequencing of this band in both directions resulted in portions of the SINE, but no part of the LINE (despite the presence of unique primers downstream of the LINE and upstream of the SINE). This suggests it is a CBD103 locus-specific amplicon correctly priming on the SINE side, but with the primer on the LINE side priming an off- target site that is absent or more distant in black dogs. We next performed adaptor- mediated PCR walking with the Digestion, Ligation, Amplification methodology [DLA;

193]. We used 4 different restriction enzymes for adaptor ligation to “walk” from known unique sequence into unknown sequence. With DLA, we were only able to generate the same 1.5kb band observed previously that when sequenced matched the SINE, but not the LINE (despite our restriction enzyme adaptor sites being present within the LINE according to the CanFam2 assembly; Fig. 5.8). These findings indicate differences between brindle/yellow and black alleles and suggest the possibility of sequence that is difficult to PCR-amplify or that is different from the CanFam2 assembly (e.g., spanning a small duplication, inversion or insertion). Although further studies are necessary to define the breakpoint and allele-specific variations at sequence resolution, the aCGH and

Southern blotting results clearly define the small interval of interest (Fig. 5.9). Those results are also consistent with Kerns et al.’s mapping of brindle to this region, demonstrating that the brindle duplication is local and not a translocation to another chromosome.

144 With the left-sided breakpoint fine mapped within the LINE, we then sought to isolate the right-sided breakpoint. We first tested whether the CNV was a tandem duplication [segmental duplications are the source of the majority of full-length copy number polymorphic genes, with most of the variant genes organized as tandem duplication; i.e., the most common occurrence; 194]. Having had previous success with this simple approach, we attempted to PCR a head (forward PCR primer)-to-tail (reverse

PCR primer) orientation to establish the breakpoint [85]. Using long-range Taq

Polymerase (KOD XL, Novagen), we could not isolate a fragment for the right sided breakpoint. Because this design automatically tests for a tail-to-tail or head-to-head orientation as well, this suggests the combination of length and sequence content are not amenable to PCR amplification, or that there are sequence differences with the CanFam2 assembly that are critical to this assay.

General LINE Analysis

The Repeat Masker track on the UCSC genome server (CanFam2) showed the breakpoint site is somewhere within a ~2.7 kb LINE (non-LTR retrotransposon) fragment that lies

~2 kb from the 3’ terminus of CBD103 (Fig. 5.10A). We used EGPRED [a suite of gene prediction algorithms; 195] to scan the entire ~9 kb region between CBD103 and the preceding gene beta 4 (DEFB4A) for gene coding potential. Several programs converge exactly on the same sequence as being part of a candidate gene. The predicted genic sequence is precisely the LINE element. Although there is a long uninterrupted open reading frame preserving a portion of the reverse transcriptase “ORF2” protein (692 amino acid ORF), it has the appearance of being an internal exon (predicted to splice to a terminal 126 bp/45 aa exon within the LINE, followed by a potential poly-adenylation

145 site ~350 bp downstream). This could be consistent with a recent transposition event that has since undergone structural variation or with it being the 3’ terminus of a gene that begins more distantly than we have searched. However, it is also possible that the LINE has the full ORF of an active gene driven by a nearby promoter.

We screened the sequence between DEFB4A and the LINE for promoter potential using Promoter 2.0 [Polymerase II promoter prediction; 196] and found a single site

(score 1.047, “highly likely prediction”) in the middle of the 185 bp B2 SINE fragment that lies 30 bp upstream of the LINE. Such a possibility is consistent with previously described RNA polymerase II promoter activity within B2 SINEs [which are derived from tRNA and generate short untranslated SINE transcripts by RNA polymerase III;

197, 198]. B2 SINE elements have recently been appreciated to regulate transcription in a developmental and tissue-specific way [198, 199]. For example, the mouse growth hormone locus has different expression profiles during development in which the locus is initially silenced and then becomes transcriptionally active by day 17.5. It was discovered that a B2 SINE serves as a boundary element, with bidirectional transcription of the B2

SINE upstream of the growth hormone locus facilitating a change in structure from a repressed heterochromatic stage to a permissive euchromatic stage [Fig. 5.11; 198,

199]. This is reminiscent of the effect seen in brindle coat color.

Characterizing Expression

Next, we tested for differences in CBD103 expression between brindle and non-brindle dogs. We obtained 5 mm skin punches from black, yellow (breeds without a known e/e genotype for M1CR; see Table 5.2), and brindle dogs (this included punches of the dark and light stripe) and isolated RNA and made first strand cDNA from these samples.

146 Using RPL13A as a reference gene [a highly stably expressed gene in canine whole skin;

200], we performed quantitative PCR (qPCR) analysis of 3 black, 3 brindle dark, and 3 brindle light skin samples (Fig. 5.12A & B). We found a significant increase in expression of CBD103 in dark brindle tissue compared to both light brindle and black tissue (p=0.008; Fig. 5.12C).

CTCF Locus and Characterizing Methylation

It has been previously hypothesized that epigenetically regulation is involved in the color type-switching and maintenance in brindle [180]. One relevant sequence element that could be involved in such an effect is a CTCF binding site. CTCF sites are found throughout the genome and are known to have several roles in epigenetic regulation, including serving as insulators from the spread of epigenetic modifications and differential regulation of active and inactive chromosomes in X-inactivation (see discussion; Fig. 5.13). Using the CTCFBS prediction tool at the InsulatorDB database

[http://insulatordb.uthsc.edu; 201], we identified a predicted CTCF site at

Chr16:61,896,184-61,896,203 with a very strong confidence score of 18.83 (>3 is considered evidence of the CTCF binding site; Fig. 5.14). The sequence

TTGCCACCAGGTGATGGTAA, is located 1.02kb from the SINE (chr16:61,897,257-

61,897,441), and 1.27kb from the LINE (chr16:61,897,474-61,900,156). The predicted

CTCF locus matches a consensus CTCF sequence from humans [202].

We then became interested in the effects this putative CTCF binding site could have on DNA methylation within the region. We isolated genomic DNA from the skin of black, yellow, dark brindle and light brindle stripes of dogs. We conducted bisulfite conversion and sequencing (see Methods). We designed PCR primers for amplicons

147 upstream and downstream of both the CTCF site and CBD103

[http://bisearch.enzim.hu/?m=genompsearch; 203, 204]. We cloned the bisulfite converted PCR product and sequenced the appropriately sized clones. Interestingly, we found differential patterns of methylation (Fig. 5.15). In light brindle and yellow, we found methylation that existed after the CTCF site. But in dark brindle and black skin, we found no DNA methylation of the CTCF locus.

Alternatively Spliced cDNA

Finally, we considered the possibility that the CTCF site may regulate the expression of an antisense transcript emanating from the SINE/LINE elements and downregulating

CBD103 similarly to the mechanism of X-inactivation (Fig. 5.16). We designed primers to amplify the exons of CBD103 and downstream targets. Using cDNA, we were able to amplify this target only in the dark brindle stripe (absent in yellow, black, or light brindle stripe; Fig. 5.15). Sequencing of this band identified a spliced cDNA partially overlapping CBD103. This data suggests that an alternatively spliced transcript of

CBD103 exists only for dark brindle.

Discussion

Brindle coat pattern has been hypothesized to be the product of a binary (yellow/black) type-switching allele that i) manifests no later than early neural crest differentiation of melanocyte lineage cells and ii) is generated and maintained epigenetically [180], but the exact mechanism has remained elusive. This is largely due to an abundance of highly repetitive elements in the locus, which has resulted in putative genome assembly errors and hindered determination of the genetic variation that differs between K alleles. Here,

148 we present work that gives insight into the complex mechanism of brindle coat pattern in dogs (Fig. 5.16). CNVs have been reported to affect coat colors before – in Massese sheep [205], dermal hyperpigmentation in chickens [206], a 492kb translocation in cattle that causes color sidedness [207], and a ~450kb duplication causing dominant white in pigs [208]. Interestingly, this last group identified the mechanism that generated the duplication in dominant white pigs as arising through unequal homologous recombination between two LINE elements flanking the tyrosine receptor gene, KIT.

Coat variation in animals has revealed diverse genetic, biochemical and physiological mechanisms. For instance, the EM genotype causing a Melanistic (black) mask on certain breeds, is caused by a valine substitution for methionine at MC1R amino acid 264 (M264V) [190]. Because of the order of dominance, a single EM allele will cause a melanistic mask on the face in an otherwise pheomelanin coat pattern on the body. Although the mechanism by which this occurs is unknown, it is suggested to be an interaction between MC1R (E) with Agouti (A) and melanocyte stimulating hormone; that is, the presence of the EM allele allows Agouti to bind proportionally to cause fawn pigment on the body and the melanocyte stimulating hormone to bind on the face instead

[190]. This suggests an underlying pattern that affects MCR1 in certain regions of the body versus a specific mutation in the receptor [190]. Additional evidence for such a pattern was recently identified in a study of CBD103 in canines that found the two highest tissues of expression were skin and surprisingly, the tongue [209]. Melanocytes differentiate from the neural crest [NC; a transient population of cells that delaminates from the neural tube and migrates considerably throughout the embryo during vertebrate development; 210]. Interestingly, melanocytes of the head and neck can be generated

149 directly from the neural crest, or indirectly through several different progenitor cell types, including nerve-derived Schwann cell precursors [211, 212]. In addition, cranial neural crest cells occupy the tongue buds before myogenic progenitors migrate into the tongue primordium, suggesting that cranial neural crest cells are involved in tongue muscle development. Further, the same signaling cascade is important for both melanocyte and tongue development [213, 214]. This suggests fundamental variances in the mechanisms involved in color differences.

The nature of the genetic mechanism underlying brindle is reminiscent of X- inactivation in mammals [reviewed in 215]. There, two non-coding genes, Tsix and Xist, act in cis and their expressed RNAs are localized to the nucleus. X-inactivation is mediated by the coating of one X chromosome with Xist RNA. After the initiation of X- inactivation, Tsix – which is proximal, in an antisense position to Xist – expression is downregulated, creating a permissive state for Xist upregulation by activators. Tsix RNA associates with the methyltransferase DNA methyltransferase 3A (DNMT3A), either directly or through accessory factors, and thereby stably silences the Xist promoter via

DNA methylation and other changes in chromatin structure [216, 217]. The key mechanism that regulates this program is the binding of CTCF exclusively to the Tsix locus of the inactivated X chromosome. That is associated with increased expression of that copy of Tsix through the Xist locus, somehow silencing it. On the active X, the same

CTCF site is methylated and Tsix is not expressed. The evidence we have accumulated suggests the same mechanism as is seen in random X-inactivation, except that both copies of a brindle homozygote are either activated or inactivated. Our working hypothesis is that CTCF regulates a promoter in the B2 SINE. If CTCF is bound, the B2

150 SINE activates expression of the antisense LINE transcript; this induces epigenetic silencing of CBD103 and results in stripes of yellow coat. Alternatively, if CTCF is not bound and the site is methylated, there is no B2 SINE-mediated expression of the antisense LINE element and CBD103 is expressed and manifested as stripes of black coat.

The mechanism described here is of particular interest because it would be difficult to identify even if it was commonly associated with human complex diseases. The brindle locus could be mapped because it is a highly penetrant Mendelian trait. Brindle’s outward epigenetic manifestation and our discovery of a brindle structural mutation provided us with the necessary clues to dissect the mechanism. These findings can be applied to identifying other instances of such regulation in health and disease through integrative genomics. It also seems likely that these phenomena can be exploited as genetic tools for diverse applications of gene therapy or transgenesis, and for screening of epigenetic therapeutics.

151 Figure 5.1: Coat Color in Greyhounds. (A) Light Brindle (Kbr/Ky, Kbr/Kbr) (B) Dark Brindle (Kbr/Ky, Kbr/Kbr) (C) Yellow(Ky/Ky) (D) Black(KB/Ky, KB/Kbr, KB/KB). Photos courtesy of www.greyhound-data.com; Tio_Loco-big; Salacres_Toffees-big; Sansone_Della_Caveja-big; Undertheradar-big.

152 Figure 5.2. New CNV calling algorithm identifies CNV overlapping K-locus in only brindle dogs. The two dogs with a CNV where both registered AKC dogs with brindle listed in their coat pattern, from two different breeds (A) Akita (B) Bulldog. (C) New partitioning of the CNV and segmental duplication in all 8 dogs. Note that the 6 dogs share a portion of the segmental duplication.

153 Figure 5.3. 1M probe oligonucleotide array confirms CNV in Brindle dogs. The solid line at 0.00 represents two copies in diploid genome. Above the line represents CNV amplification, and below represents CNV deletion. The red color corresponds to amplification. Note the presence of a CNV overlapping k-locus in this brindle dog, as well as a segmental duplication.

154 Figure 5.4. Southern blot of Left breakpoint with restriction enzyme 1. (A) A shot of the UCSC genome browser, with the segmented lines representing the restriction enzyme cleavage sites. The black box represents the probe placement (“blat sequence”), and the corresponding repetitive elements in the genome masked by Repeat Masker. (B) The Southern blot results.

155 Figure 5.5. Southern blot with Restriction Enzyme 2. (A) The predicted 4.4kb band with probe placement in the UCSC Genome Browser. (B) Tasha, the Boxer whose DNA sequence was used for CanFam2. (C) Southern Blot results for restriction enzyme 2.

156 Figure 5.6. Repetitive elements within the breakpoint region on chr16. This screenshot was taken from the UCSC Genome Browser. Note that the left panel identifies SINES, LINES, LTRS, and simple repeats.

157 Figure 5.7. PCR Scanning of the region to narrow breakpoint. (A) Blue pars represents PCRs completed in all coat types that were the same. The red bar represents a PCR product that was not the expected size. Segmented vertical lines represent the cleavage sites of both restrictions enzymes used for this analysis. (B) PCR product predicted to be 4 kb, with bands ~1.5 kb, and no band in black.

158 Figure 5.8. DLA results. (A) A shot of the UCSC genome browser, where the lime green bar indicates the sequencing results, which places the fragment over the SINE. (B) Sequencing from the DLA with a primer upstream of the SINE and downstream of the LINE. (C) Agarose gel image of ~1.5kb band.

159 Figure 5.9. UCSC genome browser with annotation of the called CNV using the Agilent 1M array in Brindle dogs. Note the breakpoint falls directly into the LINE.

160 Figure 5.10. A Non-Long terminal repeat retrotransposon. UTR, Untranslated region; CC, coiled coil; RRM, RNA recognition motif; CTD, carboxyl-terminal domain; EN, endonuclease; RT, reverse transcriptase; C, cytosinse rich domain. Modified from [218].

161 Figure 5.11: A B2 SINE assists transcription during pituitary development in mouse as a boundary element. Bidirectional transcription of a B2 SINE upstream of the growth hormone locus facilitates a change in chromatin structure. Modified from [198].

162 Figure 5.12. Expression of CBD103. Brindle skin punches taken from the same dog (A) Light brindle stripe (B) Dark brindle stripe. (C) cDNA was normalized to RPL13A and absolute values of mRNA copy number were determine with qPCR.

163 Figure 5.13: A model of a regulatable epigenetic switch created by CTCF and Tsix. Xi represents the inactive X chromosome. Xa is the active chromosome. Modified from [215].

164 Figure 5.14. Predicted CTCF Locus. (A) -61,896,184 from the UCSC genome browser is the predicted CTCF locus. (B) Consensus Sequence modified from [202].

165 Figure 5.15: Bisulfite Methylation Results. (A) Each column corresponds to a CpG (3 total) overlapping the CTCF locus, with each row representing a clone. Here, we demonstrate that in Kyellow, each CpG is methylated; in KbrLight we have methylation of 2 of the 3 CPGs (a single point mutation making it unusable for determining methylation status of the center CpG). For KbrDark, we find methylation of the first CpG, but no methylation thereafter. (B) A direct comparison of the CTCF locus in light and dark brindle stripes from the same dog.

166 Figure 5.16: Spliced cDNA. (A) PCR results, with presence of band only in dark stripe of brindle. (B) Using UCSC genome browser “Blast” program, alignment with CBD103.

167 Figure 5.17. A schematic overview of region associated with Brindle. A representation of the information generated in this dissertation for evidence suggesting a mechanism.

168 Table 5.1: Basic colors and associated alleles in dogs.

Basic Colors Alleles Effect A (Agouti) ay Fawn/sable (cream to yellow to red with darker tips) (some solid black hairs intermingled amongst reddish hair in some breeds) aw Wolf sable- wild type color (many banded hairs-black-reddish-black) at Black-and-tan or brown-and-tan a Recessive black

B (brown)= tyrosinase B Black eumelanin related protein 1 b (bs, bd, bc) Brown eumelanin

E (extension) = EM Melanistic Mask melanocortin receptor 1 E Eumelanin (black, brown, blue) can be produced e Only phaeomelanin (red, yellow, cream) produced

K (dominant black)= KB Black, brown, or blue (eumelanin CBD103 pigmentation only) Kbr Brindle (on body region that would be phaeomelanin pigmented otherwise) Ky Expression of agouti alleles that express phaeomelanin possible

169 Table 5.2: Breeds with known e/e genotype (Dogs that are e/e are red or yellow due to pheomelanin production, and this is the recessive genotype).

Beagle

Brittany Spaniel

Cardigan Welsh Corgi

Chinese Shar-Pei

Clumber Spaniel

Cocker Spaniel

Dachshund

English Setter

English Pointer

Flatcoated Retriever

French Bulldogs

Golden Retriever

Irish Setter

Japanese Chin

Labrador Retriever

Poodle

Pomeranian

Portuguese Water Dog

Vizsla

170 Materials and Methods

Animals

The samples used as part of this study were collected under a larger study of breed variation and disease association in pure breed dogs. Samples were collected in collaboration with The Ohio State University Veterinary Medical Hospital and

Greyhound Health and Wellness Program. Interested owners were screened for inclusion, and subsequently informed consent for blood collection was obtained from the owner, and sample collected by a trained veterinary technician in 1-2 7 mL BD lavender top tubes. All samples were classified according to breed and known phenotypes. Dogs were selected for use in this study based on coat color pattern listed on their registration, regardless of known disease status. Dog breeds were excluded from this study if their breed was known to carry e/e genotypes (that is, dominant melanocortin mutation that would make the dog coat color appear to be yellow/red when it may possess the underlying Kbr genotype). Coat colors were subsequently placed into the following phenotype categories: KB, Kbr, or Ky. Consequently, all dogs used in further analysis were genotyped and sequenced.

DNA/RNA Isolation

Blood: Genomic DNA (gDNA) was isolated from whole blood using the Puregene

(Gentra) Genomic DNA purification Kit, with an additional ethanol precipitation step.

Samples were selected based on high molecular weight DNA (as determined by

171 Nanodrop readings and agarose gel electrophoresis), confirmation of complete registration, and pedigree information.

Tissue: gDNA was isolated from 5mm skin punches obtained by a licensed Doctor of

Veterinary Medicine (DVM) from animals that were being euthanized for health reasons.

We used the Puregene (Gentra) Genomic DNA purification Kit, with an additional

Proteinase K step, and then subsequent ethanol precipitation. Samples were selected based on high molecular weight DNA (as determined by Nanodrop readings and agarose gel electrophoresis), confirmation of complete registration, and pedigree information.

RNA was isolated from 5mm skin punches; whenever possible the same subject used to generate gDNA (but a different skin punch) was used to generate RNA. We used the

RNeasy Fibrous Tissue Mini Kit (Qiagen) with a tissue pulverizer to isolate RNA. When necessary, some samples were placed in RNAlater-ICE frozen tissue transition solution.

After isolating RNA, SuperScript III First-Strand Synthesis (Invitrogen) was used to generate cDNA.

Platform: CNV & SNP calling

Our initial CNV observation was made on the Nimblegen 385K array [85]. In that study, we identified a 611kb CNV on chr16 which overlapped the known K locus [85].

Subsequent to this finding, we custom designed our own ultra-high resolution Canine

CGH array (Agilent) using Genotypic bioinformatics core to develop probes that covered the CanFam2 genome. To avoid cross-hybridization, each probe was aligned to the

CanFam2 genome using BLAST; any probe that did not map uniquely was removed

172 except the targeted segmental duplications. This array includes 1 Million oligonucleotide probes with an average spacing of <1.7 kb genome-wide. Additionally, our array includes known segmental duplications based on Akey’s map of segmental duplications, with an average spacing per probe of <1.5 kb [192]. Essential for an oligo array, this platform uses 50–75mers with the design of the probes to have TM within +/- 3 oC of each other.

Also, this platform uses 3 ug of DNA without requiring complexity reduction or amplification (which generate incomplete sampling and signal noise, respectively). The advantage of using the Agilent array compared to other commercially available arrays is that Agilent has an excellent low background with increased signal to noise ratio. This ratio allows for adequate CNV detection at 3 probes or less on a Agilent 1M probe array, compared to >10 probes needed for Nimblegen’s 2.1 M probe [219]. All samples were two-color comparatively hybridized against the gDNA from a single reference dog, a

Labrador Retriever. All samples were processed from whole blood to isolate gDNA as above. Hybridization to arrays was completed by The BioGenomics Core at The

Research Institute at Nationwide Children’s Hospital (TRINCH; Columbus, OH). The protocol for DNA digestion, labeling, purification and hybridization to the arrays followed the manufacturers' instructions.

CNV Call Validation

We validated a subset of the CNV calls made from the 1M Agilent array by Southern blotting. We selected 3 CNV overlapping known genes (HMGCS2, ZFHX3, SOX9) and 1 reference gene [VEGF, as used previously in 85]. The exact methods used are as

173 described in [220]. Briefly, probes were designed to target unique sequence in HindIII fragments that did not overlap predicted breakpoints, generate fragments of different sizes to allow multiplex hybridizations, and have similar melting temperatures, to result in comparable signal in multiplex format. Probes were generated by PCR, confirmed by agarose gel electrophoresis, purified using PCR Purification kit (QIAGEN), and random primed labeled with 32P. Signal was quantified by PhosphorImaging (Storm, GE

Healthcare Life Sciences).

CNV Data Analysis

Microarray image files were quantified using the Agilent Feature Extraction software and then imported into Partek. We set the criteria for calling CNVs intentionally low to minimize false negatives and accept false positives. First, a filtering procedure was used to flag low-intensity features. Probes with a combined Cy3 and Cy5 intensity value of more than 3 SD below the mean of the high-intensity mode were flagged and excluded from further analysis. Next, segmentation of the remaining data was performed using a circular binary segmentation method with post-processing to ensure that regions had at least three genomic coordinate consecutive probes with the same sign of deviation in the log 2 ratio as well as the median log 2 ratio which exceeded 0.3 in absolute value on the log scale. A subset of CNVs that had been previously reported were manually confirmed within the dataset.

174 Methylation Analysis

We used the EZ DNA Methylation-Gold kit (Zymo Research) to bisulfite covert gDNA according to their protocol. We then PCR amplified the converted gDNA with primers

GHVLJQHG IRU ELVXOILWH FRQYHUWHG J'1$ WR UHVXOW LQ DPSOLFRQV aES ZLWK • &3* islands within amplicon using the online database Bisearch

[http://bisearch.enzim.hu/?m=genompsearch; 203, 204]. We then used CloneJET PCR cloning kit (Fermentas, Thermo Scientific) with Z-competent '+Į(&ROL cells (Zymo

Research). After colonies were identified based on the PCR screening, the PCR product was purified (QIAGEN), and then sequenced at the DNA sequencing core (Eurofins,

MWG, Operon). Sequence was then analyzed and figures created using Bisulfite

Sequencing DNA Methylation Analysis (BISMA) tool available from Bisulfite sequencing Data Presentation and Compilation (BDPC) [http://biochem.jacobs- university.de/BDPC; 221, 222, 223].

Polymerase Chain Reaction (PCR) Methods

PCR methods were carried out per the JumpStart REDTaq DNA Polyermase protocol

(Sigma-Aldrich), with optimizations of each different primer pair reaction. PCR primers were custom designed using Oligo Calc Nearest Neighbor based melting temperature

(TM) calculations [224], and then uniqueness confirmed using the UCSC Genome

Browser of CanFam2 [24, 225]. Oligos were generated by Integrated DNA Technologies

(IDT; http://www.idtdna.com/site). All sequencing quality control was performed by the

DNA sequencing core (Eurofins, MWG, Operon). For the Digestion, Ligation, 175 Amplification (DLA) PCR walking method, we followed the protocol by [193]. We generated four unique random adaptors not present within the genome, and then paired these with rare-cutter restriction enzymes. Final adaptor gDNA was the combined with unique primers from the locus of interest in two reactions, the first with an external primer pair amplification followed by an internal PCR of the same locus.

176 CHAPTER 6: CONCLUSION

The amount of identified genetic variation that explains common complex diseases or traits to date is relatively small. For instance, while human height is ~80% heritable [226-

228], the majority of variation remains unaccounted for. One study found 20 variants explaining 3% of human height while another found 180 loci that explained 10% of human height [229]. More recently, Yang et al. used a statistical model to identify

249,831 SNPs associated with human height that explain 45% of the variation. Yet this is consistent with the proportion of heritability explained by genomewide significant SNPs in a number of diseases: Schizophrenia, 1%; Type 2 Diabetes, 5-10%; Breast cancer, 8%

[230]. This picture also appears to be consistent in other species, including Mus musculus

(), Drosophila melanogaster (fruit fly), and Zea mays (maize) [231,

232]. However, one species demonstrates a sharp contrast – the domesticated dog. For 55 morphological traits measured in 915 dogs from 80 breeds, 3 or fewer loci explained 67% of the variation within those traits [34].

This dissertation project has provided elucidation of the role of genetic variation in the domesticated dog as a model for human disease. Based upon on the presented within this document, the next steps will include the following:

177 (1) Additional follow-up of the statistical boundaries and optimization of the GIA methodology will be necessary. While this was outside the scope of this dissertation, additional measurements like the false positive and false negative rates under different conditions (such as varying the minor allele frequency, percentage of population affected, etc.), are necessary to fully understand the constraints on the application of the GIA.

(2) Follow-up on the loci associated with osteosarcoma. The two priority loci would be

Chr34 and Chr10. We were able to validate these SNPs in a separate population, but further analysis is needed to determine what the specific variations and mechanisms that result in Greyhounds being more susceptible to the development of OSA. This can be done by obtaining blood samples from closely related breeds that have a significant prevalence of OSA and are likely to share the same risk loci, such as Irish Wolfhounds.

As a first step, it is possible to determine which risk haplotypes are associated with OSA in different Sighthounds. A second approach that is likely to be undertaken in the near future is to identify all genetic variation that is uniquely present in risk haplotypes, such as those where LRIG3 and ZBBX are present. This has been difficult to do in the past, but can now be done with relative ease by sequence capture and high-throughput sequencing for unique or low copy repeat variation, or by array CGH for structural variation.

(3) Finally, significant work remains to fully elucidate the mechanism behind brindle coat color. The current implications would suggest first establishing the specific alleles for brindle, yellow, and black coat colors for the 4kb fragment overlapping the LINE. We also need to analyze the capability of the SINE as a promoter element to drive the expression of the LINE, the true composition of the LINE, and histone modifications that

178 differ between coat colors. Ultimately, it is likely to require transgenic mouse models and mutagenesis studies to conclusively and comprehensively dissect the brindle mechanism.

Taken together, this dissertation has established 1) the utility of the dog model for the study of human disease, 2) the theoretical and empirical basis for using GIA as a method for genomewide genetic analysis, 3) Germline loci associated with the risk for development of OSA, and 4) mechanistic clues into a highly penetrant Mendelian trait that may provide downstream clues to understanding human disease.

179 References

1. Amberger, J., C. Bocchini, and A. Hamosh, A new face and new challenges for Online Mendelian Inheritance in Man (OMIM(R)). Hum Mutat, 2011. 32(5): p. 564-7. 2. Online Mendelian Inheritance in Man, O.M.-N.I.o.G.M., Johns Hopkins University (Baltimore, MD), {date}. World Wide Web URL: http://omim.org/. 3. Chakravarti, A. and A. Kapoor, Genetics. Mendelian puzzles. Science, 2012. 335(6071): p. 930-1. 4. Lee, J.H., et al., Evolutionarily assembled cis-regulatory module at a human ciliopathy locus. Science, 2012. 335(6071): p. 966-9. 5. Emison, E.S., et al., Differential contributions of rare and common, coding and noncoding Ret mutations to multifactorial Hirschsprung disease liability. Am J Hum Genet, 2010. 87(1): p. 60-74. 6. Boyko, A.R., The domestic dog: man's best friend in the genomic era. Genome Biol, 2011. 12(2): p. 216. 7. Rowell, J.L., D.O. McCarthy, and C.E. Alvarez, Dog models of naturally occurring cancer. Trends Mol Med, 2011. 17(7): p. 380-8. 8. vonHoldt, B.M., et al., A genome-wide perspective on the evolutionary history of enigmatic wolf-like canids. Genome Res, 2011. 21(8): p. 1294-305. 9. Ovodov, N.D., et al., A 33,000-year-old incipient dog from the Altai Mountains of Siberia: evidence of the earliest domestication disrupted by the Last Glacial Maximum. PLoS One, 2011. 6(7): p. e22821. 10. Vonholdt, B.M., et al., Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature, 2010. 464(7290): p. 898-902. 11. Savolainen, P., et al., Genetic evidence for an East Asian origin of domestic dogs. Science, 2002. 298(5598): p. 1610-3. 12. Wilcox, B., Atlas of dog breeds of the world / Bonnie Wilcox and Chris Walkowicz. 5th ed ed, ed. C. Walkowicz. 1995, Neptune City, NJ : Lanham, MD :: TFH Publications ; Distributed in the U.S. to the Bookstore and library trade by National Book Network. 912 p. :. 13. The complete dog book. 20th ed ed, ed. C. American Kennel. 2006, New York :: Ballantine Books. xxi, 858 p .:. 14. Lander, E.S. and N.J. Schork, Genetic dissection of complex traits. Science, 1994. 265(5181): p. 2037-48.

180 15. Strauch, K., et al., How to model a complex trait. 1. General considerations and suggestions. Hum Hered, 2003. 55(4): p. 202-10. 16. Karlsson, E.K. and K. Lindblad-Toh, Leader of the pack: gene mapping in dogs and other model organisms. Nat Rev Genet, 2008. 9(9): p. 713-25. 17. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-53. 18. Gondo, Y., et al., Next-generation gene targeting in the mouse for functional genomics. BMB Rep, 2009. 42(6): p. 315-23. 19. Paoloni, M. and C. Khanna, Translation of new cancer treatments from pet dogs to humans. Nat Rev Cancer, 2008. 8(2): p. 147-56. 20. (2005) American Pet Products Manufacturers Association (APPMA) Report

21. Twigger, S.N., et al., The Rat Genome Database, update 2007--easing the path from disease to data and back again. Nucleic Acids Res, 2007. 35(Database issue): p. D658-62. 22. Patterson, D.F., Companion animal medicine in the age of medical genetics. J Vet Intern Med, 2000. 14(1): p. 1-9. 23. Starkey, M.P., et al., Dogs really are man's best friend--canine genomics has applications in veterinary and human medicine! Brief Funct Genomic Proteomic, 2005. 4(2): p. 112-28. 24. Lindblad-Toh, K., et al., Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 2005. 438(7069): p. 803-19. 25. Khanna, C., et al., The dog as a cancer model. Nat Biotechnol, 2006. 24(9): p. 1065-6. 26. Sargan, D.R., IDID: inherited diseases in dogs: web-based information for canine inherited disease genetics. Mamm Genome, 2004. 15(6): p. 503-6. 27. Parker, H.G., A.L. Shearin, and E.A. Ostrander, Man's best friend becomes biology's best in show: genome analyses in the domestic dog. Annu Rev Genet, 2010. 44: p. 309-36. 28. Ostrander, E.A., F. Galibert, and D.F. Patterson, Canine genetics comes of age. Trends Genet, 2000. 16(3): p. 117-24. 29. Cummings, B.J., et al., The canine as an animal model of human aging and dementia. Neurobiol Aging, 1996. 17(2): p. 259-68. 30. Bonnett, B.N. and A. Egenvall, Age patterns of disease and death in insured Swedish dogs, cats and horses. J Comp Pathol, 2010. 142 Suppl 1: p. S33-8. 31. Germonpré, M., et al., Fossil dogs and wolves from Palaeolithic sites in Belgium, the Ukraine and Russia: osteometry, ancient DNA and stable isotopes. Journal of Archaeological Science, 2009. 36(2): p. 473-490. 181 32. Drogemuller, C., et al., A deletion in the N-myc downstream regulated gene 1 (NDRG1) gene in Greyhounds with polyneuropathy. PLoS One, 2010. 5(6): p. e11258. 33. Lango Allen, H., et al., Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 2010. 467(7317): p. 832-8. 34. Boyko, A.R., et al., A simple genetic architecture underlies morphological variation in dogs. PLoS Biol, 2010. 8(8): p. e1000451. 35. Tamburini, B.A., et al., Gene expression profiles of sporadic canine hemangiosarcoma are uniquely associated with breed. PLoS One, 2009. 4(5): p. e5549. 36. Tang, J., et al., Copy number abnormalities in sporadic canine colorectal cancers. Genome Res. 37. Breen, M., Update on genomics in veterinary oncology. Top Companion Anim Med, 2009. 24(3): p. 113-21. 38. Krikelis, D. and I. Judson, Role of chemotherapy in the management of soft tissue sarcomas. Expert Rev Anticancer Ther. 10(2): p. 249-60. 39. Guillou, L. and A. Aurias, Soft tissue sarcomas with complex genomic profiles. Virchows Arch. 456(2): p. 201-17. 40. Mertens, F., I. Panagopoulos, and N. Mandahl, Genomic characteristics of soft tissue sarcomas. Virchows Arch, 2010. 456(2): p. 129-39. 41. Cohen, S.M., et al., Hemangiosarcoma in rodents: mode-of-action evaluation and human relevance. Toxicol Sci, 2009. 111(1): p. 4-18. 42. Aguirre-Hernandez, J., et al., Disruption of in canine fibrosarcomas highlights an unusual variability of CDKN2B in dogs. BMC Vet Res, 2009. 5: p. 27. 43. Sargan, D.R., et al., Chromosome rearrangements in canine fibrosarcomas. J Hered, 2005. 96(7): p. 766-73. 44. Modiano, J., Canine Hemangiosarcoma - The Road from Despair to Hope. National Canine Cancer Foundation, 2008. 45. Shearin, A.L. and E.A. Ostrander, Leading the way: canine models of genomics and disease. Dis Model Mech, 2010. 3(1-2): p. 27-34. 46. Chao, J., W.A. Chow, and G. Somlo, Novel targeted therapies in the treatment of soft-tissue sarcomas. Expert Rev Anticancer Ther, 2010. 10(8): p. 1303-11. 47. Mialou, V., et al., Metastatic osteosarcoma at diagnosis: prognostic factors and long-term outcome--the French pediatric experience. Cancer, 2005. 104(5): p. 1100-9. 48. Mankin, H.J., et al., Survival data for 648 patients with osteosarcoma treated at one institution. Clin Orthop Relat Res, 2004(429): p. 286-91. 182 49. Withrow & MacEwen's small animal clinical oncology / [edited by] Stephen J. Withrow, David M. Vail. 4th ed ed, ed. S.J. Withrow and D.M. Vail. 2007, St. Louis, Mo. :: Saunders Elsevier. xvii, 846 p. :. 50. Withrow, S.J., Small animal clinical oncology / Stephen J. Withrow, E. Gregory MacEwan. 3rd ed ed, ed. E.G. MacEwan. 2001, Philadelphia :: W. B. Saunders. xvii, 718 p. :. 51. Mirabello, L., R.J. Troisi, and S.A. Savage, Osteosarcoma incidence and survival rates from 1973 to 2004: data from the Surveillance, Epidemiology, and End Results Program. Cancer, 2009. 115(7): p. 1531-43. 52. Harari, J., The prevalence of and risk factors for canine appendicular osteosarcoma.(study recaps and comments). Veterinary Medicine, 2008. 103(2): p. 79(1). 53. Ru, G., B. Terracini, and L.T. Glickman, Host related risk factors for canine osteosarcoma. Vet J, 1998. 156(1): p. 31-9. 54. Messerschmitt, P.J., et al., Osteosarcoma. J Am Acad Orthop Surg, 2009. 17(8): p. 515-27. 55. Withrow, S.J. and R.M. Wilkins, Cross talk from pets to people: translational osteosarcoma treatments. ILAR J, 2010. 51(3): p. 208-13. 56. LaRue, S.M., et al., Limb-sparing treatment for osteosarcoma in dogs. J Am Vet Med Assoc, 1989. 195(12): p. 1734-44. 57. Mueller, F., B. Fuchs, and B. Kaser-Hotz, Comparative biology of human and canine osteosarcoma. Anticancer Res, 2007. 27(1A): p. 155-64. 58. Kirpensteijn, J., et al., TP53 gene mutations in canine osteosarcoma. Vet Surg, 2008. 37(5): p. 454-60. 59. Levine, R.A. and M.A. Fleischli, Inactivation of p53 and retinoblastoma family pathways in canine osteosarcoma cell lines. Vet Pathol, 2000. 37(1): p. 54-61. 60. Nasir, L., et al., Nucleotide sequence of a highly conserved region of the canine p53 tumour suppressor gene. DNA Seq, 1997. 8(1-2): p. 83-6. 61. Selvarajah, G.T., et al., Gene expression profiling of canine osteosarcoma reveals genes associated with short and long survival times. Mol Cancer, 2009. 8: p. 72. 62. Fieten, H., et al., Expression of hepatocyte growth factor and the proto-oncogenic receptor c-Met in canine osteosarcoma. Vet Pathol, 2009. 46(5): p. 869-77. 63. Paoloni, M., et al., Canine tumor cross-species genomics uncovers targets linked to osteosarcoma progression. BMC Genomics, 2009. 10: p. 625. 64. O'Donoghue, L.E., et al., Expression profiling in canine osteosarcoma: identification of biomarkers and pathways associated with outcome. BMC Cancer, 2010. 10: p. 506.

183 65. Thomas, R., et al., Influence of genetic background on tumor karyotypes: evidence for breed-associated cytogenetic aberrations in canine appendicular osteosarcoma. Chromosome Res, 2009. 17(3): p. 365-77. 66. Entz-Werle, N., et al., KIT gene in pediatric osteosarcomas: could it be a new therapeutic target? Int J Cancer, 2007. 120(11): p. 2510-6. 67. Institute, N.C., Cancer Trends Progress Report-2009/2010 Update. http://progressreport.cancer.gov/doc_detail.asp?pid=1&did=2009&chid=93&coid =920&mid=#measuring, 2010. 68. Hansen, K. and C. Khanna, Spontaneous and genetically engineered animal models; use in preclinical cancer drug development. Eur J Cancer, 2004. 40(6): p. 858-80. 69. Vail, D.M. and E.G. MacEwen, Spontaneously occurring tumors of companion animals as models for human cancer. Cancer Invest, 2000. 18(8): p. 781-92. 70. Hahn, K.A., et al., Naturally occurring tumors in dogs as comparative models for cancer therapy research. In Vivo, 1994. 8(1): p. 133-43. 71. Breen, M., The Prognostic Significance of Chromosome Aneuploidy in Canine Lymphoma. http://www.akcchf.org/pdfs/2009FundingRequest.pdf, 2009. 72. Modiano, J.F., et al., Distinct B-cell and T-cell lymphoproliferative disease prevalence among dog breeds indicates heritable risk. Cancer Res, 2005. 65(13): p. 5654-61. 73. Gamlem, H., K. Nordstoga, and E. Glattre, Canine neoplasia--introductory paper. APMIS Suppl, 2008(125): p. 5-18. 74. Villamil, J.A., et al., Hormonal and sex impact on the epidemiology of canine lymphoma. J Cancer Epidemiol, 2009. 2009: p. 591753. 75. Breen, M. and J.F. Modiano, Evolutionarily conserved cytogenetic changes in hematological malignancies of dogs and humans--man and his best friend share more than companionship. Chromosome Res, 2008. 16(1): p. 145-54. 76. Honigberg, L.A., et al., The Bruton tyrosine kinase inhibitor PCI-32765 blocks B- cell activation and is efficacious in models of autoimmune disease and B-cell malignancy. Proc Natl Acad Sci U S A, 2010. 107(29): p. 13075-80. 77. Stein, R., et al., Evaluation of anti-human leukocyte antigen-DR monoclonal antibody therapy in spontaneous canine lymphoma. Leuk Lymphoma, 2010. 78. Khanna, C., et al., Guiding the optimal translation of new cancer treatments from canine to human cancer patients. Clin Cancer Res, 2009. 15(18): p. 5671-7. 79. Wang, Z., et al., Gene therapy in large animal models of muscular dystrophy. ILAR J, 2009. 50(2): p. 187-98. 80. Yokota, T., et al., Efficacy of systemic morpholino exon-skipping in Duchenne dystrophy dogs. Ann Neurol, 2009. 65(6): p. 667-76. 184 81. Elgier, A.M., et al., Communication between domestic dogs (Canis familiaris) and humans: dogs are good learners. Behav Processes, 2009. 81(3): p. 402-8. 82. Morell, V., Animal behavior. Going to the dogs. Science, 2009. 325(5944): p. 1062-5. 83. Cotman, C.W. and E. Head, The canine (dog) model of human aging and disease: dietary, environmental and immunotherapy approaches. J Alzheimers Dis, 2008. 15(4): p. 685-707. 84. Liao, A.T., M. McMahon, and C.A. London, Identification of a novel germline MET mutation in dogs. Anim Genet, 2006. 37(3): p. 248-52. 85. Chen, W.K., et al., Mapping DNA structural variation in dogs. Genome Res, 2009. 19(3): p. 500-9. 86. Kisseberth, W.C., et al., A novel canine lymphoma cell line: A translational and comparative model for lymphoma research. Leuk Res, 2007. 87. Milde, T., et al., A novel family of slitrk genes is expressed on hematopoietic stem cells and leukemias. Leukemia, 2007. 21(4): p. 824-7. 88. London, C.A., et al., Multi-center, placebo-controlled, double-blind, randomized study of oral phosphate (SU11654), a receptor tyrosine kinase inhibitor, for the treatment of dogs with recurrent (either local or distant) mast cell tumor following surgical excision. Clin Cancer Res, 2009. 15(11): p. 3856-65. 89. http://www.dublinirishfestival.org/animals/irishwolfhound.php;. 90. http://blogneffy.blogspot.com/2010/06/wanted-shih-tzu-breeders-in-davao- city.html. 91. http://sentinelkennels.com/images/airedale.jpg;. 92. http://www.justdogbreeds.com/images/breeds/cavalier-king-charles-spaniel.jpg. 93. http://www.petsflick.com/images/yorkshire-terrier.jpg. 94. http://tidyyourdog.com/wp-content/uploads/2009/04/siberian-husky.jpg. 95. http://www.petside.com/breeds/chinese-shar-pei.php. 96. http://www.fordogtrainers.com/ProductImages/dog-breeds-muzzles/Australian- Shepherd-muzzle-Australian-Shepherd.jpg. 97. http://www.breederretriever.com/photopost/pindex/516/. 98. http://retrieverman.files.wordpress.com/2009/01/white-golden-retriever- wikipedia.jpg. 99. http://www.dogtastic.org/dogtastic/images/BreedPics/cocker%20spaniel.jpg. 100. http://a1.cdnsters.com/static/images/dogster/breeds/basset_hound.jpg. 101. Martin, L.J., et al., Improving the signal-to-noise ratio in genome-wide association studies. Genet Epidemiol, 2009. 33 Suppl 1: p. S29-32. 102. Lango Allen, H., et al., Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 2010. 467(7317): p. 832-8. 185 103. McArdle, J.J., Latent variable modeling of differences and changes with longitudinal data. Annu Rev Psychol, 2009. 60: p. 577-605. 104. Borsboom, D., G.J. Mellenbergh, and J. van Heerden, The theoretical status of latent variables. Psychol Rev, 2003. 110(2): p. 203-19. 105. Rice, T.K., N.J. Schork, and D.C. Rao, Methods for handling multiple testing. Adv Genet, 2008. 60: p. 293-308. 106. Jiang, X., et al., A bayesian method for evaluating and discovering disease loci associations. PLoS One, 2011. 6(8): p. e22075. 107. Juran, B.D. and K.N. Lazaridis, Genomics in the post-GWAS era. Semin Liver Dis, 2011. 31(2): p. 215-22. 108. Nishikawa, M., T. Tango, and M. Ohtaki, Statistical tests based on new composite hypotheses in clinical trials reflecting the relative clinical importance of multiple endpoints quantitatively. Biom J, 2009. 51(5): p. 749-62. 109. Xiong, C., et al., Power and sample size for clinical trials when efficacy is required in multiple endpoints: application to an Alzheimer's treatment trial. Clin Trials, 2005. 2(5): p. 387-93. 110. Allison, D.B., et al., Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet, 2006. 7(1): p. 55-65. 111. Kim K, Z.S., Loraine A & Allison DB., Picking the most likely candidates for further development: Novel intersection-union tests for addressing multi- component hypotheses in comparative genomics, in Proceedings of the American Statistical Association Joint Statistical Meeting, ENAR Section [CD-ROM]. 2004. 112. Shriner, D. and L.K. Vaughan, A unified framework for multi-locus association analysis of both common and rare variants. BMC Genomics, 2011. 12: p. 89. 113. Berger, R.L., Multiparameter Hypothesis Testing and Acceptance Sampling. Technometrics, 1982. 24(4): p. 295-300. 114. Lequarre, A.S., et al., LUPA: a European initiative taking advantage of the canine genome architecture for unravelling complex disorders in both human and dogs. Vet J, 2011. 189(2): p. 155-9. 115. Vaysse, A., et al., Identification of Genomic Regions Associated with Phenotypic Variation between Dog Breeds using Selection Mapping. PLoS Genet, 2011. 7(10): p. e1002316. 116. Parker, H.G., et al., Genetic structure of the purebred domestic dog. Science, 2004. 304(5674): p. 1160-4. 117. Cadieu, E., et al., Coat variation in the domestic dog is governed by variants in three genes. Science, 2009. 326(5949): p. 150-3. 118. Purcell, S., et al., PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 2007. 81(3): p. 559-75. 186 119. Jones, P., et al., Single-nucleotide-polymorphism-based association mapping of dog stereotypes. Genetics, 2008. 179(2): p. 1033-44. 120. Svartberg, K. and B. Forkman, Personality traits in the domestic dog (Canis familiaris). Applied Animal Behaviour Science, 2002. 79(2): p. 133-155. 121. Gudbjartsson, D.F., et al., Many sequence variants affecting diversity of adult human height. Nat Genet, 2008. 40(5): p. 609-15. 122. Adami, C., The use of information theory in evolutionary biology. Ann N Y Acad Sci, 2012. 123. C. E. Shannon, A.m.t.o.c., '' Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948. 124. Li, H., et al., Complex-disease networks of trait-associated single-nucleotide polymorphisms (SNPs) unveiled by information theory. J Am Med Inform Assoc, 2012. 19(2): p. 295-305. 125. Hill, A.B., The Environment and Disease: Association or Causation? Proc R Soc Med, 1965. 58: p. 295-300. 126. Prentice, R.L., Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med, 1989. 8(4): p. 431-40. 127. Morabia, A., On the origin of Hill's causal criteria. Epidemiology, 1991. 2(5): p. 367-9. 128. Berger, R.L., Likelihood Ratio Tests and Intersection-Union Tests, in Advances in statistical decision theory and applications, S. Panchapakesan, N. Balakrishnan, and S.S. Gupta, Editors. 1997, Birkhäuser: Boston. 129. Novembre, J. and S. Ramachandran, Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet, 2011. 12: p. 245-74. 130. Gu, H., et al., Principal component directed partial least squares analysis for combining nuclear magnetic resonance and mass spectrometry data in metabolomics: application to the detection of breast cancer. Anal Chim Acta, 2011. 686(1-2): p. 57-63. 131. Zhu, C. and J. Yu, Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics, 2009. 182(3): p. 875-88. 132. Rybaczyk, L.A., et al., An overlooked connection: serotonergic mediation of estrogen-related physiology and pathology. BMC Womens Health, 2005. 5: p. 12. 133. Rybaczyk, L.A., et al., An indicator of cancer: downregulation of monoamine oxidase-A in multiple organs and species. BMC Genomics, 2008. 9: p. 134.

187 134. Rybaczyk, L., et al., New bioinformatics approach to analyze gene expressions and signaling pathways reveals unique purine gene dysregulation profiles that distinguish between CD and UC. Inflamm Bowel Dis, 2009. 15(7): p. 971-84. 135. Rybaczyk, L.A., Comparative gene expression analysis to identify common factors in multiple cancers. 2008, Ohio State University: Columbus, Ohio. p. x, 150 p. 136. Germonpré, M., et al., Journal of Archaeological Science, 2009. 36(2): p. 473- 490., Fossil dogs and wolves from Palaeolithic sites in Belgium, the Ukraine and Russia: osteometry, ancient DNA and stable isotopes. Journal of Archaeological Science, 2009. 36(2): p. 473-490. 137. Matsumoto, M. and T. Nishimura, Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION, 1998. 8(1): p. 3-31. 138. Offit, K., Personalized medicine: new genomics, old lessons. Hum Genet, 2011. 130(1): p. 3-14. 139. Hindorff, L.A., et al., Potential etiologic and functional implications of genome- wide association loci for human diseases and traits. Proc Natl Acad Sci U S A, 2009. 106(23): p. 9362-7. 140. Rakyan, V.K., et al., Epigenome-wide association studies for common human diseases. Nat Rev Genet, 2011. 12(8): p. 529-41. 141. Chung, C.C. and S.J. Chanock, Current status of genome-wide association studies in cancer. Hum Genet, 2011. 130(1): p. 59-78. 142. Branigan, C.A., The reign of the greyhound : a popular history of the oldest family of dogs. 2004, Hoboken, N.J.: Howell Book House. 143. Zaldívar-López, S., Marín, L.M., Hamilton, H., Couto, C.G. , DISEASE PREVALENCE AND CAUSES OF DEATH IN AMERICAN KENNEL CLUB REGISTERED GREYHOUNDS, in ACVIM Meeting. 2009. 144. Phillips, J.C., et al., Heritability and segregation analysis of osteosarcoma in the Scottish deerhound. Genomics, 2007. 90(3): p. 354-63. 145. Lord, L.K., et al., Results of a web-based health survey of retired racing Greyhounds. J Vet Intern Med, 2007. 21(6): p. 1243-50. 146. Couto, G.C., Greyhound Racing Risk Factors, C.E. Alvarez, Editor. 2012: Columbus, Ohio. 147. Zhang, Q., et al., Mapping quantitative trait loci for milk production and health of dairy cattle in a large outbred pedigree. Genetics, 1998. 149(4): p. 1959-73. 148. Doi, J.A., Introduction to Intersection-Union Tests. 2010.

188 149. Berger, R.L. and J.C. Hsu, Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Statistical Science, 1996. 11(4): p. 283-302. 150. Rybaczyk, L.A., Comparative Gene Expression Analysis To Identify Common Factors In Multiple Cancers. The Ohio State University, 2008. Dissertation. 151. Marín LM, e.a., Fresh frozen plasma or epsilon aminocaproic acid for the prevention of post-amputation bleeding in retired racing Greyhounds with appendicular bone tumors: A retrospective study of 46 cases (2003-2008). In press, Journal of Veterinary Emergency and Critical Care., 2012. 152. Phillips, J.C., L. Lembcke, and T. Chamberlin, A novel locus for canine osteosarcoma (OSA1) maps to CFA34, the canine orthologue of human 3q26. Genomics, 2010. 96(4): p. 220-7. 153. Torok, M. and L.D. Etkin, Two B or not two B? Overview of the rapidly expanding B-box family of proteins. Differentiation, 2001. 67(3): p. 63-71. 154. Gundem, G., et al., IntOGen: integration and data mining of multidimensional oncogenomic data. Nat Methods, 2010. 7(2): p. 92-3. 155. Ozaki, T., et al., Genetic imbalances revealed by comparative genomic hybridization in osteosarcomas. Int J Cancer, 2002. 102(4): p. 355-65. 156. Gyorffy, B., et al., An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Res Treat, 2010. 123(3): p. 725-31. 157. Kupershmidt, I., et al., Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One, 2010. 5(9). 158. Akey, J.M., et al., Tracking footprints of artificial selection in the dog genome. Proc Natl Acad Sci U S A, 2010. 107(3): p. 1160-5. 159. Miller, J.K., et al., Suppression of the negative regulator LRIG1 contributes to ErbB2 overexpression in breast cancer. Cancer Res, 2008. 68(20): p. 8286-94. 160. Gur, G., et al., LRIG1 restricts growth factor signaling by enhancing receptor ubiquitylation and degradation. EMBO J, 2004. 23(16): p. 3270-81. 161. Hedman, H. and R. Henriksson, LRIG inhibitors of growth factor signalling - double-edged swords in human cancer? Eur J Cancer, 2007. 43(4): p. 676-82. 162. Thomasson, M., et al., LRIG1 and the liar paradox in prostate cancer: a study of the expression and clinical significance of LRIG1 in prostate cancer. Int J Cancer, 2011. 128(12): p. 2843-52. 163. Zhao, H., et al., Lrig3 regulates neural crest formation in Xenopus by modulating Fgf and Wnt signaling pathways. Development, 2008. 135(7): p. 1283-93. 164. Guo, D., et al., Perinuclear leucine-rich repeats and immunoglobulin-like domain proteins (LRIG1-3) as prognostic indicators in astrocytic tumors. Acta Neuropathol, 2006. 111(3): p. 238-46. 189 165. Wu, C., et al., BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol, 2009. 10(11): p. R130. 166. Larramendy, M.L., et al., Clinical significance of genetic imbalances revealed by comparative genomic hybridization in chondrosarcomas. Hum Pathol, 1999. 30(10): p. 1247-53. 167. Lin, L., et al., The sleep disorder canine narcolepsy is caused by a mutation in the hypocretin (orexin) receptor 2 gene. Cell, 1999. 98(3): p. 365-76. 168. Peyron, C., et al., A mutation in a case of early onset narcolepsy and a generalized absence of hypocretin peptides in human narcoleptic brains. Nat Med, 2000. 6(9): p. 991-7. 169. Zhuang, J.J., et al., Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group. Genet Epidemiol, 2010. 34(4): p. 319-26. 170. Meirmans, P.G. and P.W. Hedrick, Assessing population structure: F(ST) and related measures. Mol Ecol Resour, 2011. 11(1): p. 5-18. 171. Ioannidis, J.P., G. Thomas, and M.J. Daly, Validating, augmenting and refining genome-wide association signals. Nat Rev Genet, 2009. 10(5): p. 318-29. 172. Rosenberger, J.A., N.V. Pablo, and P.C. Crawford, Prevalence of and intrinsic risk factors for appendicular osteosarcoma in dogs: 179 cases (1996-2005). J Am Vet Med Assoc, 2007. 231(7): p. 1076-80. 173. Balding, D.J., A tutorial on statistical methods for population association studies. Nat Rev Genet, 2006. 7(10): p. 781-91. 174. Jiang, R., et al., Fine-scale mapping using Hardy-Weinberg disequilibrium. Ann Hum Genet, 2001. 65(Pt 2): p. 207-19. 175. Song, K. and R.C. Elston, A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Stat Med, 2006. 25(1): p. 105-26. 176. Gao, G., et al., A Generalized Sequential Bonferroni Procedure Using Smoothed Weights for Genome-Wide Association Studies Incorporating Information on Hardy-Weinberg Disequilibrium among Cases. Hum Hered, 2011. 73(1): p. 1-13. 177. Sha, Q. and S. Zhang, A test of Hardy-Weinberg equilibrium in structured populations. Genet Epidemiol, 2011. 35(7): p. 671-8. 178. Short, A.D., et al., Hardy weinberg expectations in canine breeds: implications for genetic studies. J Hered, 2007. 98(5): p. 445-51. 179. Candille, S.I., et al., A -defensin mutation causes black coat color in domestic dogs. Science, 2007. 318(5855): p. 1418-23. 180. Kerns, J.A., et al., Linkage and segregation analysis of black and brindle coat color in domestic dogs. Genetics, 2007. 176(3): p. 1679-89. 190 181. Clayton, D. and H.T. Leung, An R package for analysis of whole-genome association studies. Hum Hered, 2007. 64(1): p. 45-51. 182. Plagnol, V., et al., A method to address differential bias in genotyping in large- scale association studies. PLoS Genet, 2007. 3(5): p. e74. 183. Lasky-Su, J., et al., On the replication of genetic associations: timing can be everything! Am J Hum Genet, 2008. 82(4): p. 849-58. 184. Mitry, D., et al., SNP mistyping in genotyping arrays--an important cause of spurious association in case-control studies. Genet Epidemiol, 2011. 35(5): p. 423-6. 185. Voight, B.F. and J.K. Pritchard, Confounding from cryptic relatedness in case- control association studies. PLoS Genet, 2005. 1(3): p. e32. 186. Parker, H.G., et al., An expressed retrogene is associated with breed-defining chondrodysplasia in domestic dogs. Science, 2009. 325(5943): p. 995-8. 187. Darwin, C. On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. 1859. 188. Alvarez, C.E. and J.M. Akey, Copy number variation in the domestic dog. Mamm Genome, 2012. 23(1-2): p. 144-63. 189. Lyon, M.F., Gene action in the X-chromosome of the mouse (Mus musculus L.). Nature, 1961. 190: p. 372-3. 190. Schmutz, S.M., et al., MC1R studies in dogs with melanistic mask or brindle patterns. J Hered, 2003. 94(1): p. 69-73. 191. Dreger, D.L. and S.M. Schmutz, The variant red coat colour phenotype of Holstein cattle maps to BTA27. Anim Genet, 2010. 41(1): p. 109-12. 192. Nicholas, T.J., et al., The genomic architecture of segmental duplications and associated copy number variants in dogs. Genome Res, 2009. 19(3): p. 491-9. 193. Liu, S., C.R. Dietrich, and P.S. Schnable, DLA-based strategies for cloning insertion mutants: cloning the gl4 locus of maize using Mu transposon tagged alleles. Genetics, 2009. 183(4): p. 1215-25. 194. Bailey, J.A., J.M. Kidd, and E.E. Eichler, Human copy number polymorphic genes. Cytogenet Genome Res, 2008. 123(1-4): p. 234-43. 195. Issac, B. and G.P. Raghava, EGPred: prediction of eukaryotic genes using ab initio methods after combining with sequence similarity approaches. Genome Res, 2004. 14(9): p. 1756-66. 196. Knudsen, S., Promoter2.0: for the recognition of PolII promoter sequences. Bioinformatics, 1999. 15(5): p. 356-61. 197. Ferrigno, O., et al., Transposable B2 SINE elements can provide mobile RNA polymerase II promoters. Nat Genet, 2001. 28(1): p. 77-81.

191 198. Ponicsan, S.L., J.F. Kugel, and J.A. Goodrich, Genomic gems: SINE RNAs regulate mRNA production. Curr Opin Genet Dev, 2010. 20(2): p. 149-55. 199. Lunyak, V.V., et al., Developmentally regulated activation of a SINE B2 repeat as a domain boundary in organogenesis. Science, 2007. 317(5835): p. 248-51. 200. Wood, S.H., et al., Reference genes for canine skin when using quantitative real- time PCR. Vet Immunol Immunopathol, 2008. 126(3-4): p. 392-5. 201. Bao, L., M. Zhou, and Y. Cui, CTCFBSDB: a CTCF-binding site database for characterization of vertebrate genomic insulators. Nucleic Acids Res, 2008. 36(Database issue): p. D83-7. 202. Kim, T.H., et al., Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell, 2007. 128(6): p. 1231-45. 203. Tusnady, G.E., et al., BiSearch: primer-design and search tool for PCR on bisulfite-treated genomes. Nucleic Acids Res, 2005. 33(1): p. e9. 204. Aranyi, T., et al., The BiSearch web server. BMC Bioinformatics, 2006. 7: p. 431. 205. Fontanesi, L., et al., Coat colours in the Massese sheep breed are associated with mutations in the agouti signalling protein ( ASIP) and ( MC1R) genes. Animal, 2011. 5(1): p. 8-17. 206. Dorshorst, B., et al., A complex genomic rearrangement involving the endothelin 3 locus causes dermal hyperpigmentation in the chicken. PLoS Genet, 2011. 7(12): p. e1002412. 207. Durkin, K., et al., Serial translocation by means of circular intermediates underlies colour sidedness in cattle. Nature, 2012. 482(7383): p. 81-4. 208. Giuffra, E., et al., A large duplication associated with dominant white color in pigs originated by homologous recombination between LINE elements flanking KIT. Mamm Genome, 2002. 13(10): p. 569-77. 209. Leonard, B.C., et al., Activity, Expression and Genetic Variation of Canine beta- Defensin 103: A Multifunctional Antimicrobial Peptide in the Skin of Domestic Dogs. J Innate Immun, 2012. 210. Thomas, A.J. and C.A. Erickson, The making of a melanocyte: the specification of melanoblasts from the neural crest. Pigment Cell Melanoma Res, 2008. 21(6): p. 598-610. 211. Sommer, L., Generation of melanocytes from neural crest cells. Pigment Cell Melanoma Res, 2011. 24(3): p. 411-21. 212. Adameyko, I., et al., and Mitf cross-regulatory interactions consolidate progenitor and melanocyte lineages in the cranial neural crest. Development, 2012. 139(2): p. 397-410. 213. Okubo, T., L.H. Pevny, and B.L. Hogan, Sox2 is required for development of taste bud sensory cells. Genes Dev, 2006. 20(19): p. 2654-9. 192 214. Han, D., et al., A TGFbeta-Smad4-Fgf6 signaling cascade controls myogenic differentiation and myoblast fusion during tongue development. Development, 2012. 215. Chao, W., et al., CTCF, a candidate trans-acting factor for X-inactivation choice. Science, 2002. 295(5553): p. 345-7. 216. Sado, T., Y. Hoki, and H. Sasaki, Tsix silences Xist through modification of chromatin structure. Dev Cell, 2005. 9(1): p. 159-65. 217. Sun, B.K., A.M. Deaton, and J.T. Lee, A transient heterochromatic state in Xist preempts X inactivation choice without RNA stabilization. Mol Cell, 2006. 21(5): p. 617-28. 218. Beck, C.R., et al., LINE-1 retrotransposition activity in human genomes. Cell, 2010. 141(7): p. 1159-70. 219. Alkan, C., B.P. Coe, and E.E. Eichler, Genome structural variation discovery and genotyping. Nat Rev Genet, 2011. 12(5): p. 363-76. 220. Chung, E.K., et al., Human complement components C4A and C4B genetic diversities: complex genotypes and phenotypes. Curr Protoc Immunol, 2005. Chapter 13: p. Unit 13 8. 221. Rohde, C., et al., BISMA--fast and accurate bisulfite sequencing data analysis of individual clones from unique and repetitive sequences. BMC Bioinformatics, 2010. 11: p. 230. 222. Rohde, C., et al., Bisulfite sequencing Data Presentation and Compilation (BDPC) web server--a useful tool for DNA methylation analysis. Nucleic Acids Res, 2008. 36(5): p. e34. 223. Rohde, C., et al., New clustering module in BDPC bisulfite sequencing data presentation and compilation web application for DNA methylation analyses. Biotechniques, 2009. 47(3): p. 781-3. 224. Kibbe, W.A., OligoCalc: an online oligonucleotide properties calculator. Nucleic Acids Res, 2007. 35(Web Server issue): p. W43-6. 225. Karolchik, D., et al., The UCSC Genome Browser Database. Nucleic Acids Res, 2003. 31(1): p. 51-4. 226. Yang, J., et al., Common SNPs explain a large proportion of the heritability for human height. Nat Genet, 2010. 42(7): p. 565-9. 227. Macgregor, S., et al., Bias, precision and heritability of self-reported and clinically measured height in Australian twins. Hum Genet, 2006. 120(4): p. 571- 80. 228. Silventoinen, K., et al., Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res, 2003. 6(5): p. 399-408.

193 229. Weedon, M.N., et al., Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet, 2008. 40(5): p. 575-83. 230. Visscher, P.M., et al., Five years of GWAS discovery. Am J Hum Genet, 2012. 90(1): p. 7-24. 231. Flint, J. and T.F. Mackay, Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res, 2009. 19(5): p. 723-33. 232. Buckler, E.S., et al., The genetic architecture of maize flowering time. Science, 2009. 325(5941): p. 714-8.

194