CCaanniinnee DDiisseeaassee GGeennee IIddeennttiiffiiccaattiioonn

by

Jeremy R. Shearman

The University of New South Wales

2011

School of Biotechnology and Biomolecular Sciences University of New South Wales THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: Shearman First name: Jeremy Other name/s: Ross Abbreviation for degree as given in the University calendar: PhD School: BABS Faculty: Science Title: Canine disease gene identification

Abstract 350 words maximum: (PLEASE TYPE) The (Canis familiaris) was domesticated from around 15,000 years ago in multiple locations in the northern hemisphere. Most modern were developed from domestic dogs in the past 200 years in Europe. This breed development resulted in genetically isolated, highly inbred populations segregating diseases. Two autosomal recessive diseases are Trapped Neutrophil Syndrome (TNS) in Border collies and cerebellar abiotrophy in Australian Kelpies. TNS is an inherited neutropenia resulting in a compromised immune system. TNS was mapped to VPS13B using a candidate gene approach and linkage analysis. Sequencing of the gene in affected and control dogs identified a 4 bp deletion in exon 19 causing frame shift and premature truncation. Alternate transcripts of VPS13B are expressed in the of humans but not mice. Sequencing of cDNA from healthy dogs revealed that dogs also express alternate transcripts in the brain. Cerebellar abiotrophy in the Australian Kelpie results in an . Affymetrix SNP array v2 was used to perform whole genome mapping in twelve affecteds and twenty control Kelpies. Association analysis failed to identify the disease region. Homozygosity analysis identified a five megabase region where all affecteds were homozygous for a common haplotype. This region was enriched for two affecteds and one control using Nimblegen sequence capture arrays and sequenced on a 454 using titanium chemistry and multiplex identifiers. A total of 2019 differences were identified homozygous in the affecteds compared to controls, 682 of those were in genic regions, 25 were in exons and 8 changed an amino acid.

Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

…………………………………………… ……………………………………..……… ……….……………...…….… Signature Witness Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed …………………………………………………...

Date …………………………………………………... Acknowledgements

I would like to thank Alan Wilton first and foremost for taking me on as a student despite not having a scholarship. I was able to survive thanks to Angela Higgins from the UNSW sequencing centre employing me as a casual research assistant two days a week for most of my PhD. The sequencing centre then became a part of the Ramaciotti Centre for Gene Function Analysis, which fortunately kept me on as a casual employee – Thanks to Ian Dawes, Helen Speirs and Jason Koval. Thanks to all the past and current member of the Wilton lab: Louise, Carol, Scott, Natsuki, Pete, Julia, Yosh, Claire, Auda, Zi, Paulina, Pranoy, Mathew and Annie (roughly in order of appearance). Also Thanks to my co-supervisor: Peter Little and his lab group: Rohan, Mark, Oscar and Michael who occupied the other end of the lab. The next generation of occupants of the other end of the lab (my new co-supervisor) Bill Ballard and his group: Rich, Innes, Jonci, Pann Pann, Lou, Carolina and Kylie (there is something about that side of the lab that attracts head-of-school type PI’s). On the note of thanking all of those people (most of which have left UNSW already) it is good to be the one leaving for a change.

Thanks to all my friends: Alex, Paul, Kellie, Kevin, Frances, Kye, Jackie, Grace, Alex, Richard, Michael, Claire, Auda, Eser, Paulina, Zi, Jonci, Pranoy, Natsuki, Lee- Anne, Clare, Emily, Michael, Allan, Graham, Aisha, Adam and Ken. Thanks to my family, especially my brothers who find it quite amusing that after 9 years of university, I will still be learning less than their truck driver wage.

Special thanks to Auda for printing, binding and submitting my thesis for me while I lived it up it Thailand. Most special thanks to Kittiya - the reason that I am in Thailand. Abstract

The dog (Canis familiaris) was domesticated from wolves around 15,000 years ago in multiple locations in the northern hemisphere. Most modern dogs were developed from domestic dogs in the past 200 years in Europe. This breed development resulted in genetically isolated, highly inbred populations segregating diseases. Two autosomal recessive diseases are Trapped Neutrophil Syndrome (TNS) in Border collies and cerebellar abiotrophy in Australian Kelpies. TNS is an inherited neutropenia resulting in a compromised immune system. TNS was mapped to VPS13B using a candidate gene approach and linkage analysis. Sequencing of the gene in affected and control dogs identified a 4 bp deletion in exon 19 causing frame shift and premature truncation. Alternate transcripts of VPS13B are expressed in the brain of humans but not mice. Sequencing of cDNA from healthy dogs revealed that dogs also express alternate transcripts in the brain. Cerebellar abiotrophy in the Australian Kelpie results in an ataxia. Affymetrix SNP array v2 was used to perform whole genome mapping in twelve affecteds and twenty control Kelpies. Association analysis failed to identify the disease region. Homozygosity analysis identified a five megabase region where all affecteds were homozygous for a common haplotype. This region was enriched for two affecteds and one control using Nimblegen sequence capture arrays and sequenced on a 454 using titanium chemistry and multiplex identifiers. A total of 2019 differences were identified homozygous in the affecteds compared to controls, 682 of those were in genic regions, 25 were in exons and 8 changed an amino acid. List of Publications Shearman JR and Wilton AN. (2011) Mapping Cerebellar Abiotrophy in Australian Kelpies. Animal . doi:10.1111/j.1365-2052.2011.02199.x

Shearman JR and Wilton AN. (2011) The effects of on the incidence of disease in a pedigree population. Animal Genetics. In prep

Shearman JR and Wilton AN. (2011) A Canine Model of Cohen Syndrome: Trapped Neutrophil Syndrome. BMC Genomics. doi:10.1186/1471-2164-12- 258

Shearman JR and Wilton AN. (2011) Origins of the domestic dog and the rich potential for gene mapping. Genetics Research International. 2011: 1-6.

Vonholdt BM, Pollinger JP, Lohmueller KE, Han E, Parker HG, Quignon P, Degenhardt JD, Boyko AR, Earl DA, Auton A, Reynolds A, Bryc K, Brisbin A, Knowles JC, Mosher DS, Spady TC, Elkahloun A, Geffen E, Pilot M, Jedrzejewski W, Greco C, Randi E, Bannasch D, Wilton A, Shearman J, Musiani M, Cargill M, Jones PG, Qian Z, Huang W, Ding ZL, Zhang YP, Bustamante CD, Ostrander EA, Novembre J and Wayne RK. (2010) Genome- wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 464: 898-902.

Shearman JR, Lau VM and Wilton AN. (2008) Elimination of SETX, SYNE1 and ATCAY as the cause of Cerebellar Abiotrophy in Australian Kelpies. Animal Genetics. 39: 573.

Shearman JR and Wilton AN. (2007) Elimination of neutrophil elastase and the genes for adaptor protein complex 3 subunits as the cause of trapped neutrophil syndrome in Border collies. Animal Genetics. 38: 188-189.

Shearman JR, Zhang QY and Wilton AN. (2006) Exclusion of CXCR4 as the cause of Trapped Neutrophil Syndrome in Border Collies using five microsatellites on canine chromosome 19. Animal Genetics. 37: 89. TABLE OF CONTENTS

LIST OF FIGURES ...... IV

LIST OF TABLES ...... V

OVERVIEW OF THESIS CHAPTERS ...... A

1 INTRODUCTION...... 1

1.1 ORIGINS OF THE DOMESTIC DOG AND THE RICH POTENTIAL FOR GENE MAPPING...... 1 1.1.1 Summary...... 1 1.1.2 Domestication from the ...... 2 1.1.3 Dog population structure...... 3 1.1.4 Gene mapping in dogs ...... 4 1.1.5 Tools available ...... 5 1.1.6 Disease genetics ...... 6 1.1.7 Morphology and behavioural genetics ...... 8 1.1.8 Conclusion...... 9 1.1.9 Acknowledgements...... 9 1.1.10 References...... 10

1.2 GENE MAPPING APPROACHES FOR ANIMALS ...... 15 1.2.1 Summary...... 15 1.2.2 Gene mapping...... 15 1.2.3 Functional candidate gene approach ...... 16 1.2.4 Whole genome analysis...... 17 1.2.5 Sequence capture...... 20 1.2.6 Next generation sequencing...... 21 1.2.7 Third generation sequencing ...... 25 1.2.8 Conclusion...... 26 1.2.9 References...... 27

2 A CANINE MODEL OF COHEN SYNDROME: TRAPPED NEUTROPHIL SYNDROME ...... 30

2.1 ABSTRACT ...... 30

2.2 INTRODUCTION ...... 31

2.3 MATERIALS AND METHODS...... 33

2.4 RESULTS ...... 39

2.5 DISCUSSION ...... 55

2.6 ACKNOWLEDGMENTS ...... 57

2.7 REFERENCES...... 57

2.8 SUPPLEMENTARY MATERIAL ...... 61

3 THE EFFECTS OF INBREEDING ON THE NEW SOUTH WALES POPULATION ...... 62

3.1 SUMMARY ...... 62 3.2 INTRODUCTION ...... 63

3.3 MATERIALS AND METHODS...... 65 3.3.1 Pedigree analysis...... 65 3.3.2 TNS and NCL testing ...... 66

3.4 PEDIGREE ANALYSIS...... 66

3.5 DISEASE ANALYSIS ...... 71

3.6 CONCLUSION ...... 74

3.7 REFERENCES...... 75

4 MAPPING CEREBELLAR ABIOTROPHY IN AUSTRALIAN KELPIES ...... 77

4.1 SUMMARY ...... 77

4.2 INTRODUCTION ...... 78

4.3 MATERIALS AND METHODS...... 80 4.3.1 Samples...... 80 4.3.2 Histopathology...... 81 4.3.3 SNP analysis...... 82 4.3.4 Linkage analysis ...... 84

4.4 RESULTS AND DISCUSSION ...... 86 4.4.1 Histopathological findings...... 86 4.4.2 Genome wide association study...... 87 4.4.3 Homozygosity analysis...... 91 4.4.4 Linkage analysis ...... 93

4.5 CONCLUSION ...... 99

4.6 REFERENCES...... 100

5 SEQUENCING THE CEREBELLAR ABIOTROPHY REGION IN THE AUSTRALIAN KELPIE...... 104

5.1 SUMMARY ...... 104

5.2 INTRODUCTION ...... 105

5.3 MATERIALS AND METHODS...... 115 5.3.1 Australian Kelpie samples ...... 115 5.3.2 Nimblegen sequence capture ...... 115 5.3.3 454 sequencing of captured samples ...... 115 5.3.4 Sequencing of candidate differences...... 116

5.4 RESULTS AND DISCUSSION...... 118 5.4.1 Nimblegen sequence capture ...... 118 5.4.2 454 sequence data analysis ...... 118 5.4.3 Sequencing of candidate differences...... 122

5.5 CONCLUSION ...... 125

5.6 ACKNOWLEDGMENTS ...... 126

5.7 REFERENCES...... 126

6 CONCLUSION...... 135 APPENDICES...... 136

I. EXCLUSION OF CXCR4 AS THE CAUSE OF TRAPPED NEUTROPHIL SYNDROME IN BORDER

COLLIES USING FIVE MICROSATELLITES ON CANINE CHROMOSOME 19...... 136

SOURCE/DESCRIPTION...... 136

SEQUENCING AND AUTOZYGOSITY ANALYSIS ...... 137

COMMENTS...... 137

ACKNOWLEDGEMENTS ...... 140

REFERENCES ...... 140

SUPPLEMENTARY MATERIAL ...... 141

II. ELIMINATION OF NEUTROPHIL ELASTASE AND ADAPTOR PROTEIN COMPLEX 3 SUBUNIT

GENES AS THE CAUSE OF TRAPPED NEUTROPHIL SYNDROME IN BORDER COLLIES ...... 143

SOURCE/DESCRIPTION...... 143

LINKAGE ANALYSIS AND AUTOZYGOSITY MAPPING ...... 144

CONCLUSION ...... 144

REFERENCES ...... 147

SUPPLEMENTARY MATERIAL ...... 148

III. ELIMINATION OF SETX, SYNE1 AND ATCAY AS THE CAUSE OF CEREBELLAR ABIOTROPHY

IN AUSTRALIAN KELPIES...... 150

SOURCE/DESCRIPTION...... 150

HOMOZYGOSITY ANALYSIS...... 151

REFERENCES ...... 151

SUPPORTING INFORMATION ...... 152

IV. GENOME-WIDE SNP AND HAPLOTYPE ANALYSES REVEAL A RICH HISTORY UNDERLYING

DOG DOMESTICATION ...... 154 LIST OF FIGURES

Figure 1-1 Comparison of second generation sequencing protocols showing basic workflow and sequencing procedure...... 24 Figure 2-1: Multipoint LOD scores for linkage to TNS gene by microsatellites C13.0390, C13.0423, C13.0449, C13.0478 in VPS13B region on CFA3...... 40 Figure 2-2: Genotypes at 4 microsatellite loci near VPS13B in 8 TNS affected border collies identified by ID number...... 44 Figure 2-3: Sequence comparison of an affected, carrier and control Border collie showing 4 bp deletion in exon 19 of VPS13B at base 2894 to 2897 (transcript variant 1) ...... 46 Figure 2-4 Partial DNA and AA sequence of normal (Norm) and mutant (Mut) VPS13B sequence (above) and translation (below)...... 46 Figure 2-5: PCR product from and cortex of 3 healthy dogs for primers spanning exon 25-28 of cDNA...... 50 Figure 2-6: Forward (above) and reverse (below) sequence data from cerebellum (top) and cortex (bottom) cDNA of a healthy dog...... 51 Figure 2-7: Sequence comparison of VPS13B exons 28 and 28b in human to dog, , cow, rat and mouse...... 53 Figure 2-8: Sequence comparison between mouse and rat of VPS13B sequence homologous to exon 28 in human...... 54 Figure 3-1 Pedigree of offspring with inbreeding coefficient of 50.02% ...... 70 Figure 4-1 Samples received from Thomas and Robertson (1989) processed on SNP arrays and used for linkage analysis ...... 81 Figure 4-2 Pedigree of all samples processed on Affymetrix SNP arrays ...... 83 Figure 4-3: Giemsa stained sections of cerebellum folia from four Australian Kelpies affected with cerebellar abiotrophy...... 87 Figure 4-4 Whole genome association study for the CA region in Australian Kelpies using 11 affecteds and 19 controls...... 89 Figure 4-5 Whole genome association study for the CA region in Australian Kelpies using 3 affecteds, 19 controls and 8 suspected cases set to unknown status ...... 90 Figure 4-6 SNP data of candidate region for 11 affected Kelpies, 19 control Kelpies and 3 unaffected siblings to affected 6149 for SNPs between 27.6 and 35.7 Mb on chromosome 3... 92 Figure 4-7 Inheritance patterns in affected families for microsatellites in the long microsatellite group...... 95 Figure 4-8 Inheritance patterns in affected families for microsatellites in the short microsatellite group...... 98 Figure 4-9 Multipoint linkage analysis of long and short microsatellite groups. Significance threshold is indicated by the red line ...... 99

iv LIST OF TABLES

Table 2-1: Primer names and sequences for VPS13B sequencing primers ...... 35 Table 2-2: Microsatellite primers used for linkage analysis of TNS candidate genes ...... 38 Table 2-3: LOD score for linkage analysis using microsatellites in the region of 9 candidate genes for TNS...... 40 Table 2-4: Microsatellite allele sizes in 72 Border collies for the 4 microsatellites in the VPS13B region ...... 41 Table 2-5: Non disease haplotypes for the 4 microsatellites in the VPS13B region in 72 Border collies...... 42 Table 2-6: Single nucleotide variations identified from sequencing genomic DNA of VPS13B showing their genomic location, predicted effect on the protein produced by transcript variant 1 and dbSNP ID where available ...... 45 Table 2-7: Deletion detection test results for individuals in the TNS pedigree, a sample set of individuals unrelated to TNS affected lineages and a set of samples collected randomly from Norway...... 47 Table 2-8: TNS testing results and allele proportions of sample sets per country after correcting for ascertainment bias ...... 48 Table 2-9: Comparison of clinical signs with percentages between Cohen syndrome patients and Seven TNS affected Border collies...... 49 Table 3-1 Population size (N), offspring number, number and sex distribution of parents and proportion of parents to population size per generation...... 68 Table 3-2 Wright’s coefficient of inbreeding for the New South Wales database of registered pure bred Border collies...... 69 Table 3-3 Average and median F per generation for all Border collies registered in New South Wales up to the end of 2009 ...... 70 Table 3-4 Number of popular sires and dams per offspring count range in the Border collie database for New South Wales ...... 71 Table 3-5 Expected number of total, coding, AA changing, protein function destroying and passed to next generation per generation...... 72 Table 3-6 Average and median inbreeding coefficients for clear, carrier and affected disease status for TNS and NCL of tested dogs compared to the dogs in the New South Wales database..... 74 Table 4-1 Long microsatellite group: microsatellite name, primer sequence, repeat unit type and count in reference sequence ...... 85 Table 4-2 Short microsatellite group: microsatellite name, primer sequence, repeat unit type and count in reference sequence ...... 85 Table 4-3 Total allele counts and counts between affecteds and controls for each microsatellite in the long microsatellite group ...... 94 Table 4-4 Total allele counts and counts between affecteds and controls for each microsatellite in the the short microsatellite group...... 97

v Table 5-1: Genes in the Kelpie cerebellar abiotrophy candidate region showing location and known structure/function information...... 106 Table 5-2: sequencing primers for typing candidate differences identified between cerebellar abiotrophy affected Kelpies and a control Kelpie...... 117 Table 5-3 mapping statistics for sequence output of captured DNA for CA affecteds and the Kelpie control...... 118 Table 5-4 Sequence differences identified between affected and control dogs from 454 sequencing ...... 120 Table 5-5 Genomic position, sequencing depth and coding effect of differences in exons identified homozygous in affected dogs that were not homozygous in the control dog ...... 121 Table 5-6 Genotypes of differences identified between CA affected and a control Kelpie typed in 96 Kelpies ...... 124

vi OVERVIEW OF THESIS CHAPTERS

The thesis is in the format of a series of publications. Individual chapters were published in Animal Genetics, Genome Research International and BMC Genomics, but have all been reformatted to suit the format of Animal Genetics as this is the top ranking journal in the field of animal genetics. The theme of this thesis is disease gene mapping in dogs. The introduction chapter consists of two papers, the first is a general introduction to canine genetics (Shearman and Wilton 2011. Genetics Research International) and the second is an introduction to gene mapping methods in model organisms. These are intended to give the reader the necessary background information for the theme of the thesis. Two different mapping approaches were utilised in this thesis; the candidate gene approach and a genome wide association study. The candidate gene approach was applied to an autosomal recessive disease, Trapped Neutrophil Syndrome (TNS), in Border collies. The disease results in a poor immune system and failure to thrive. The second chapter presents the findings of using the candidate gene approach to identify the causative of TNS (Shearman and Wilton 2011. BMC Genomics doi:10.1186/1471-2164-12-258). The outcome of this body of work was to develop a commercially available diagnostic test allowing the laboratory to screen for carriers of the TNS mutation. Breeders are then able to use this information to ensure that matings do not occur between two carriers, thus removing the chance for affected pups to be born. The lab began offering this test in 2007 and has developed a comprehensive database of Border collies and their disease status. Chapter three makes use of this database to compare inbreeding coefficients of carriers and affecteds to the inbreeding coefficients of the entire New South Wales database of Border collies from 1982-2009 (Shearman and Wilton 2011 - to be submitted). In addition, two brief notes were published in Animal Genetics describing the exclusion of some of the candidate genes for TNS, these can be found in appendix I and II and give information regarding to the samples used (Shearman et al. 2006, Shearman and Wilton 2007). With the causative mutation for TNS identified and a test successfully in use, research into the cause of a second disease, Cerebellar Abiotrophy (CA) in the Australian Kelpie, was started. The clinical signs for CA are an intention tremor and loss of coordination. A candidate gene approach was initially adopted to find the causative mutation, appendix III contains a published brief note in Animal Genetics on the excluded genes from the candidate gene approach (Shearman et al. 2008). A

a genome wide association study approach was undertaken in light of the vast number of good candidate genes. Chapter four reports on the genome wide association study using Affymetrix SNP array v2.0 to map the causative mutation for this disease to a 5 Mb region on chromosome 3 (Shearman and Wilton 2011. Animal Genetics doi:10.1111/j.1365-2052.2011.02199.x). The identified region containing the causative mutation for CA did not contain any genes known to cause CA when disrupted so a target capture and 454 sequencing approach was taken to find the causative mutation. Chapter five reports on the findings from this work but is not published as the causative mutation remains to be identified. However, this chapter is written in publication format and will be updated and published when the work is completed by the next student to take on the project. In addition to the kelpie samples, six samples were processed on the SNP arrays. The Dingo is a native wild dog to Australia and is currently under threat of extinction by hybridisation with non-native dogs. The data from these samples was sent to collaborators; Bridget vonHoldt and Robert Wayne to be included in a study using domestic dogs and wild wolves to trace the origins of dog domestication. The dingo was found to group with the ‘ancient dog’ breeds and turned out to be one of the oldest dog breeds in the world. Contribution to this work consisted of the Dingo SNP data and minor editing to the final draft of the manuscript prior to submission. The published work can be found in appendix IV, this work has not been reformatted to suit the style of Animal Genetics as it would be inappropriate given the limited input by the author of this thesis. The use of two different gene mapping approaches applied to finding the causative mutation of two different diseases allows for a comparison of the different methods and covers the field of disease gene mapping approaches in a comprehensive manner.

b 1 INTRODUCTION

1.1 Origins of the domestic dog and the rich potential for gene mapping

[2011, Genetics Research International. 2011: 1-6.]

Jeremy R. Shearman*,±,‡ and Alan N. Wilton*,±

* School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia ± Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia ‡ National Center for Genetic Engineering and Biotechnology, 113 Phahonyothin Road Klong 1, Klong Luang, Pathumthani 12120, Thailand

Address for correspondence Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

1.1.1 Summary The unique breeding structure of the domestic dog makes canine genetics a useful tool to further the understanding of inherited diseases and gene function. Answers to the questions of when and where the dog was domesticated from the wolf are uncertain, but how the modern diversity of dog breeds was developed is documented. Breed development has resulted in many genetically isolated populations segregating for different alleles for disease, morphological and behavioural traits. Many genetic tools are available for dog research allowing investigation into the genetic basis of these phenotypes. Research into causes of diseases in dogs is relevant to humans and other species, comparative genomics is being used to transfer genetic information to them, including some studies on morphological and behavioural phenotypes. Because of the unique breed structure and well maintained pedigrees, dogs represent a model organism containing a wealth of genetic information. 1 1.1.2 Domestication from the wolf Domestic dogs can be viewed as one of mankind’s largest and longest running breeding experiments. The process has resulted in over 400 breeds with considerable morphologic and behavioural diversity compared to the gray wolf ancestor. The origin and time frame of domestication from the gray wolf are hotly debated. Early work using phylogenetic substitution rates in mitochondrial D-loop sequence suggest that dogs may have originated as early as 100,000 years ago (Vila et al. 1997; Wayne and Ostrander 1999), however, this figure is based on an unlikely assumption of a single founding mtDNA haplotype. A similar study by Savolainen et al. (Savolainen et al. 2002) using samples from a more widely distributed area and allowing for multiple mtDNA haplotypes in the founding population suggested a domestication time of 15,000 years ago. Pang et al. (2009) using entire mitochondrial genomes from 169 dogs and mitochondria control region sequence data from 1543 dogs suggest a domestication time of 5,400 to 16,300 years ago. However, dog-like fossils have been dated as early as 31,000 years ago (Germonpre et al. 2009). The discrepancy between genetic and archaeological data could be caused by several things. One is incomplete separation of wolf and dog populations with recent admixture, as has been observed in a US wolf population, which would reduce the apparent time since domestication (Anderson et al. 2009). Identifying the location of dog domestication has proven difficult, partly because it is confounded by the choice of samples from wild relatives for comparison to dog. Mitochondrial DNA (mtDNA) data has been used to support East Asia as the origin of all modern dog breeds (Savolainen et al. 2002; Pang et al. 2009). Verginelli et al. (Verginelli et al. 2005) have used mtDNA to suggest an Eastern European origin of domestication. The number of dogs included from a region’s native dog population can influence the conclusions as shown by Boyko et al. (2009), who examined mtDNA sequence from native African village dogs (representing domestic dogs prior to breed development). They suggested that genome wide autosomal markers were required to answer the question of where dogs were first domesticated. vonHoldt et al. (2010) typed a set of 48,000 SNPs on Affymetrix mapping array version 2 in 912 dogs from 85 modern and ancient breeds and 225 gray wolves and concluded that dogs were likely domesticated from multiple locations. Some ancient breeds seem to have a primary ancestry in East Asia but the majority of breeds have ancestry in the Middle East (vonHoldt et al. 2010). There were no clines of in dog populations, so 2 unlike in human populations genetic diversity cannot be used to trace ancestral origins of dog. The data leads one to speculate that there were multiple origins of domestication, but this does not fit well with the global distribution of all mtDNA clades (Pang et al. 2009).

1.1.3 Dog population structure The domestication of the wolf established populations of native dogs in several places around the world, and these native populations existed for some time allowing some genetic diversity to rebuild after the original domestication bottleneck. The breed structure and relationship between dog breeds can be teased apart using SNP data such as that from vonHoldt et al. (2010). The ancient dog breeds, such as the Australian Dingo, and Chinese Shar Pei were isolated from these early dogs thousands of years ago and remained more or less a distinct breeding population (vonHoldt et al. 2010; Parker et al. 2007). Most modern dog breeds have been developed in the last two hundred years (Parker et al. 2004; Lindblad-Toh et al. 2005) by selecting dogs with certain phenotypes, primarily in Europe (vonHoldt et al. 2010). Line breeding (inbreeding) with strong artificial selection for generations has resulted in different characteristics becoming fixed in each breed. Today, dogs have one of the most diverse phenotypic ranges of any species (Spady et al. 2008). Dog breeds can be classified into nine groups based on form or function: toy dogs, spaniels, scent , working dogs, mastiff-like breeds, small terriers, retrievers, herding breeds, and sight hounds (vonHoldt et al. 2010). Dogs within a classification tend to be more similar genetically and grouped together in neighbour-joining trees performed on SNP data (vonHoldt et al. 2010). Breeds of dog generated by crossing dogs from two different groups is also reflected in the neighbour-joining tree as having ancestry to both groups (vonHoldt et al. 2010). Understanding the origins of a breed is important for genetic studies as the history can give an indication of potential allele sharing between breeds. For a dog to be classified as pure bred it has to be the offspring of pure bred parents. Pedigree records are well documented and dogs of mixed ancestry are excluded from any breed. This unique population structure results in a significant degree of inbreeding and strong population substructure. One factor that influences these processes is the popular sire effect. A popular sire is a male that is highly sought after for breeding purposes, usually from winning dog shows or herding competitions. A popular sire can produce hundreds of offspring, contributing significantly to the of the next generation (Calboli et al. 2008). This results in inbreeding effective 3 population sizes that are around 50 individuals for each breed (Calboli et al. 2008). Such pure bred populations have strong genetic drift which can result in genetic diseases or in the breed. A dog carrying a recessive disease allele can pass it on to hundreds of offspring rapidly spreading it through the population. Most pure bred dogs will carry several such disease alleles and many of them will be unique to the breed due to new mutations or drift increasing the frequency of a mutation present in a founder dog. Breeding of dogs is easily manipulated and planned. Single animals with a rare disorder can be bred into disease colonies for research. Crosses between breeds can be used to place genes for genetic traits in different genetic backgrounds to allow study of the influence of modifying genes on phenotype. This can be important when studying diseases with low penetrance or variable expression.

1.1.4 Gene mapping in dogs Dogs have large haplotype blocks (regions of linked alleles in strong linkage disequilibrium, see Wall and Pritchard 2003 for review of linkage disequilibrium and haplotype blocks) within a breed and smaller haplotype blocks between breeds. A haplotype block is a long stretch of DNA with a particular combination of allelic variants that often occur together and in canines these haplotype blocks can be up to ten times the length of haplotype blocks found in humans (Lindblad-Toh et al. 2005; Gray et al. 2009). Large haplotype blocks allow mapping in dogs to be performed with fewer polymorphic markers and fewer individuals as compared to human studies and makes pure bred dogs an ideal model for the study of genetic traits and diseases. However, large haplotype blocks also mean that any trait region identified within a single breed can be in the range of several Mb incorporating tens of genes to over a hundred (Parker et al. 2007; Akey et al. 2010). Such a large trait region is a significant problem for identifying causative mutations. In some cases this problem can be overcome by using related breeds that share the same trait to help narrow the interval containing the mutation (Karlsson et al. 2007). For cases where the trait of interest is restricted to a single breed, a search may indicate a large possible trait region requiring a candidate gene approach to be applied within that region. For identifying novel genes involved in genetic pathways, the study of canine traits can be useful. Conditions that are rare in outbred populations, such as human, can become common within one or several inbred breeds and so there are many traits that can be readily studied in dogs. Purebred dogs typically have less heterogeneity (multiple 4 alleles causing indistinguishable phenotypes) than an outbred populations (IDID, http://www.vet.cam.ac.uk/idid/) which means that analysis of a canine phenotype will result in a stronger genetic signal. Mapping canine homologues of complex traits is therefore likely to identify single, high-effect loci as a result of the breeding structure (Cadieu et al. 2009). Dog genetics may not hold all the answers to the causes of complex trait phenotypes in outbred populations, but it can shed light on at least some genes and pathways involved. Dogs can make a good model organism as they generally share the same environment as humans, supplementing the use of mouse, zebra fish and yeast as models. A typical mapping experiment in dogs would make use of an association study using SNP arrays on a trait or set of traits that exists in multiple breeds, such as coat variation (Cadieu et al. 2009). In the example of coat variation, three phenotypes were each mapped within a single breed: obvious moustache and eyebrows, hair length, and curled hair. Mapping was then expanded to include dogs from 80 breeds allowing the authors to exclude false positives caused by sample stratification and to narrow the candidate region by taking advantage of the smaller haplotype block sharing between breeds. This made the identification of genes simpler.

1.1.5 Tools available A 1.5x coverage sequence of a (Kirkness et al. 2003) and a 7.5x coverage sequence of a boxer (Lindblad-Toh et al. 2005) has provided an annotated dog genome and allowed for comparative genomics and the establishment of the dog as a mammalian model organism. Other important milestones were the development of a canine expression array (Affymetrix) and several canine SNP arrays with 100,000s of loci (Karlsson et al. 2007; www.affymetrix.com; www.illumina.com). SNP arrays replace the need to laboriously type large numbers of microsatellites for whole genome analysis. A comprehensive linkage map for all dog chromosomes is now also available that can be used in conjunction with whole genome mapping (Wong et al. 2010). The availability of high throughput genotyping technologies allows for large-scale mapping experiments to be rapidly performed with markers spaced densely enough that fine mapping to localize the gene after initial mapping studies will be easier or may even be unnecessary. It is important to note that the SNPs on these arrays are based one a subset of breeds and thus may not be applicable to all breed types, however, they have been successfully used in dogs as distantly related as the dingo (vonHoldt et al 2010).

5 With the development of next generation sequencing, which allows gigabases of DNA sequence be generated from a single sample (see Metzker 2009 for review of next generation sequencing), several technologies have become available to address the issue of targeting particular part(s) of the genome to be sequenced (Albert et al. 2007; Gnirke et al. 2009; Tewhey et al. 2009). Sequence capture involves using many DNA probes giving sequence representation of a target sequence to hybridise with DNA from the sample and temporarily capture specific target regions of the genome that are then recovered. Sequence capture followed by next generation sequencing is useful when a trait or disease gene is mapped to a region of several mega bases in size (Karlsson et al. 2007).

1.1.6 Disease genetics The gene complement of most eutherian mammals is very similar and dogs have similar genetic diseases to those observed in large outbred populations, such as humans, from simple monogenic traits to complex disorders. For a comprehensive listing of dog diseases with human analogues see Table S1 in Boyko A.R. (2011). The medical attention we provide our much loved canine companion has led to an extensive list of known disorders, second only to human and mouse (see Sargan 2004; Inherited Diseases In Dogs database, http://server.vet.cam.ac.uk/index.html). Information on the genetic basis of common complex disorders such as cancers, heart diseases and diabetes in the dog can be informative for disease gene identification in other species. While few examples of shared genes for complex disorders currently exist, it is likely that such genes will be found based on the clinical sign similarities discussed below. A benefit of using the dog model for disease studies is the well-documented pedigrees providing information on relatedness, inbreeding coefficients, common ancestors and thus high- risk family lines. This information can aid in the selection of samples for genetic studies on diseases and traits by allowing the researcher to identify potential carriers of a disorder. Cancers of many different types exist in different breeds offering the potential for insight into disease mechanisms and treatment options for cancers in humans and other species and this is a major research focus of several groups e.g. LUPA (www.eurolupa.com). For example, there is a familial medullary thyroid cancer common in the Alaskan Malamute (Lee et al. 2006), a Non-Hodgkin’s lymphoma common in the Boxer, Setter and Cocker Spaniel (Pastor et al. 2009) and mammary tumours in the (Rivera et al. 2009) to name a few. The types 6 of cancers observed in dogs are, in many cases, similar to forms found in humans. Gene expression profiling of 32 cases of canine osteosarcoma has identified expression patterns associated with short versus long term survival similar to those found in humans (Selvarajah et al. 2009). Genomic regions with copy number abnormalities that were identified in cases of canine colorectal cancer contain many genes known to be disrupted in human colorectal cancer. Furthermore, clustering of human and dog copy number abnormalities grouped samples into tumour subtypes rather than species (Tang et al. 2010). The genetic similarities in cancer subtypes between human and dog suggest that genetic pathways leading to cancer may be similar across species. Cases of canine hemangiosarcoma have also been suggested as good models to study the effect of cancers in varying genetic backgrounds, because the genetic stratification between dog breeds is somewhat similar to the genetic stratification observed among different human ethnicities (Tamburini et al. 2009; Thomas et al. 2009). Dogs also suffer from inherited high blood pressure and various cardiovascular disorders such as arrhythmias, cardiomyopathy and dilated cardiomyopathy. Dilated cardiomyopathy in dogs presents with clinical signs similar to human symptoms such as shortness of breath, decreased appetite, weakness and collapse. Interestingly, individual breeds differ in which of these clinical signs is the most common (Martin et al. 2009). Dogs suffer from many immune-mediated disorders, similar to humans (see Gershwin 2010 for review of immune disorders), which may be due to disease alleles at several loci segregating within dog populations, a shared environment with humans or a combination of both of these factors. These examples represent naturally occurring diseases of biomedical significance, segregating in pure bred dog populations. Mapping of these disease genes in the dog could aid in elucidating the disease mechanism in humans and other species. The above examples are areas where canine genetics could significantly aid the understanding of complex disease. Two examples where canine genetics has shed light on previously unknown disease mechanisms in humans include; discovery in dogs of a narcolepsy gene, HCRTR2 (Lin et al. 1999; Hungs et al. 2001); and a novel photoreceptor gene, PRCD, involved in cases of retinitis pigmentosa (Zangerl et al. 2006). Other cases where mapped dog diseases have been speculated as corresponding to unmapped homologous diseases in human include: a duplication of four genes predisposing to dermal sinus, which in humans is often associated with spina bifida (Salmon et al. 2007); a set of five loci that are associated with systemic lupus erythematosus (Wilbe et al. 2010) and; Hyperuricosurea in the Dalmation cause by mutations in the SLC2A9 gene (Bannasch et

7 al. 2008). In most cases the identification of the cause of a canine disease identifies a gene where mutations in homologs cause a similarly characterised disease in other species. Identifying canine homologues to human disease genes allow for affected dogs to be used as a mammalian model to further study the disease mechanism and potential treatment options.

1.1.7 Morphology and behavioural genetics There is also potential from canine genetics to identify the genetic basis for morphological and behavioural traits. Any dog chosen from a pure bred population will be morphologically defined by the breed defining traits and thus measurements of characters are not required for all individuals when comparing across breeds (Sutter et al. 2008). Consistent phenotypes mean that SNP data from multiple studies can be pooled and used to map genes for these breed defining traits. Examples where large datasets incorporating large numbers of breeds have been used to map such traits are beginning to appear such as the coat variation study by Cadieu et al. (2009). Loss of function mutations in myostatin (MSTN) is a good example of a trait transferrable between species using comparative genomics. It causes increased muscle mass in several species, including dogs and (see Rodgers and Garikipati 2008 for review). Heterozygosity for a MSTN mutation has been found to increase racing ability in both Whippets and racing horses (Mosher et al. 2007; Hill et al. 2010). Whippets are a racing dog breed that have been selected for a combination of slim build, deep chest and powerful legs allowing them to reach speeds over 50 km per hour. Analysis of the genetic factors behind the Whippet phenotype and running speed may complement studies into the genetics of running speed in racehorses. Understanding the genetics of traits such as skeletal structure, muscle density and muscle mass would benefit breeding studies for these species and others. One of the most remarkable characteristics of domestic dogs is their ability to pick up and understand human cues and emotions. Dogs show a strong attachment relationship with their caregiver and are more amenable to training than wolves raised in the same environment (Topal et al. 2005; Gacsi et al. 2005; Gacsi et al. 2009). This suggests that the characteristics that allowed dogs to be domesticated have a genetic component. vonHoldt et al. (2010) have found a strong selection signal in domestic dogs on a gene, WBSCR17, which in humans is involved in William-Beuren syndrome, a disease that includes mental retardation, ease with strangers and a desire to be in

8 groups. Such characteristics would make dogs easier to handle and could have been strongly selected early during the domestication process. Individual dog breeds are enriched or fixed for innate behavioural characteristics including pointing, herding and aggressive behaviour. While these breeds still require training for pointing and herding they are far more responsive to the training than other breeds, which suggests that these traits have a degree of genetic predisposition. Mapping for these traits has identified genomic regions that appear strongly associated with these behaviours (Jones et al. 2008). Different dog breeds show variation in the amount of confrontational or aggressive behaviour they exhibit towards humans and other dogs. Takeuchi et al. (Takeuchi et al. 2009) have mapped a trait they call ‘aggression towards strangers’ to a variant in SLC1A2, which may be responsible for overly aggressive behaviour. These few examples show how canine genetics can be used to identify genes potentially affecting behaviours, which may assist in identifying similar genes affecting behaviour in other species.

1.1.8 Conclusion Dogs represent such a rich potential resource to further the understanding of diseases and genetic traits because of their history of domestication and breed development. Domestication of the dog has resulted in many isolated populations, much like a breeding experiment with gene mapping as the aim. Recent advances in understanding this genetic history are important for mapping genes for various phenotypes and traits. Breeds fixed or highly enriched for certain phenotypes already exist. Identifying the genetics responsible for breed defined phenotypes is potentially as simple as collating the existing SNP array data and performing the analyses. A confounding issue that could pose a problem for mapping the phenotypes listed above is sample stratification and the large haplotype blocks that exists within breeds. However, canine genetics has significant potential to contribute to the understanding of genetic disorders and functional genomics in other species and will compete with other species as a genetic model organism.

1.1.9 Acknowledgements The authors would like to acknowledge members of the Ballard lab of the University of New South Wales for proof reading and feedback.

9 1.1.10 References Akey J.M., Ruhe A.L., Akey D.T., Wong A.K., Connelly C.F., Madeoy J., Nicholas T.J. and Neff M.W. (2010). Tracking footprints of artificial selection in the dog genome. Proceedings of the National Academy of Sciences of the United States of America. 107: 1160-1165. Albert T.J., Molla M.N., Muzny D.M. et al. (2007) Direct selection of human genomic loci by microarray hybridization. Nature Methods. 4: 903-905. Anderson, T., vonHoldt B.M., Candille S.I. et al. (2009) Molecular and evolutionary history of melanism in North American gray wolves. Science. 323: 1339–1343. Bannasch D., Safra N., Young A., Karmi N., Schaible R.S. and Ling G.V. (2008) Mutations in the SLC2A9 gene cause hyperuricosuria and hyperuricemia in the dog. PLoS Genetics. 4: e1000246. Boyko A.R., Boyko R.H., Boyko C.M. et al. (2009) Complex population structure in African village dogs and its implications for inferring dog domestication history. Proceedings of the National Academy of Sciences of the United States of America. 106: 13903-13908. Boyko A.R. (2011) The domestic dog: man's best friend in the genomic era. Genome Biology. 12: 216-225. Cadieu E., Neff M.W., Quignon P. et al. (2009) Coat variation in the domestic dog is governed by variants in three genes. Science. 326: 150-153. Calboli F.C., Sampson J., Fretwell N. and Balding D.J. (2008) Population structure and inbreeding from pedigree analysis of purebred dogs. Genetics. 179: 593-601. Gacsi M., Gyori B., Miklosi A., Viranyi Z., Kubinyi E., Topal J. and Csanyi V. (2005) Species-specific differences and similarities in the behavior of hand-raised dog and wolf pups in social situations with humans. Developmental Psychobiology. 47: 111-122. Gacsi M., Gyori B., Viranyi Z., Kubinyi E., Range F., Belenyi B. and Miklosi A. (2009) Explaining dog wolf differences in utilizing human pointing gestures: selection for synergistic shifts in the development of some social skills. PLoS One. 4: e6584. Germonpre M., Sablin M.V., Stevens R.E. et al. (2009) Fossil dogs and wolves from Palaeolithic sites in Belgium, the Ukraine and Russia: osteometry, ancientDNA and stable isotopes. Journal of Archaeological Science. 36: 473–490. Gershwin L.J. (2010) Autoimmune diseases in small animals. Veterinary Clinics of North America. Small Animal Practice. 40: 439-457. 10 Gnirke A., Melnikov A., Maguire J. et al. (2009) Solution hybrid selection with ultra- long oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology. 27: 182-189. Gray M.M., Granka J.M., Bustamante C.D., Sutter N.B., Boyko A.R., Zhu L., Ostrander E.A. and Wayne R.K. (2009) Linkage disequilibrium and demographic history of wild and domestic canids. Genetics. 181: 1493-1505. Hill E.W., Gu J., Eivers S.S., Fonseca R.G., McGivney B.A., Govindarajan P., Orr N., Katz L.M. and MacHugh D. (2010) A sequence polymorphism in MSTN predicts sprinting ability and racing stamina in thoroughbred horses. PLoS One. 5: e8645. Hungs M., Lin L., Okun M. and Mignot E. (2001) Polymorphisms in the vicinity of the hypocretin/orexin are not associated with human narcolepsy. . 57: 1893-1895. Jones P., Chase K., Martin A., Davern P., Ostrander E.A. and Lark K.G. (2008) Single- nucleotide-polymorphism-based association mapping of dog stereotypes. Genetics. 179: 1033-1044. Karlsson E.K., Baranowska I., Wade C.M. et al. (2007) Efficient mapping of mendelian traits in dogs through genome-wide association. Nature Genetics. 39: 1321- 1328. Kirkness E.F., Bafna V., Halpern A.L. et al. (2003) The dog genome: survey sequencing and comparative analysis. Science. 301: 1898-1903. Lee J.J., Larsson C., Lui W.O., Hoog A. and Von Euler H. (2006) A dog pedigree with familial medullary thyroid cancer. International Journal of Oncology. 29: 1173-1182. Lin L., Faraco J., Li R., Kadotani H., Rogers W., Lin X., Qiu X., de Jong P.J., Nishino S. and Mignot E. (1999) The sleep disorder canine narcolepsy is caused by a mutation in the hypocretin (orexin) receptor 2 gene. Cell. 98: 365-376. Lindblad-Toh K., Wade C.M., Mikkelsen T.S. et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 438: 803-819. Martin M.W., Stafford Johnson M.J. and Celona B. (2009) Canine dilated cardiomyopathy: a retrospective study of signalment, presentation and clinical findings in 369 cases. The Journal of Small Animal Practice. 50: 23-29. Metzker M.L. (2009) Sequencing technologies - the next generation. Nature Reviews. Genetics. 11: 31-46.

11 Mosher D.S., Quignon P., Bustamante C.D., Sutter N.B., Mellersh C.S., Parker H.G. and Ostrander E.A. (2007) A mutation in the myostatin gene increases muscle mass and enhances racing performance in heterozygote dogs. PLoS Genetics. 3: e79. Pang J.F., Kluetsch C., Zou X.J. et al. (2009) mtDNA data indicate a single origin for dogs south of Yangtze River, less than 16,300 years ago, from numerous wolves. Molecular Biology and Evolution. 26: 2849-2864. Parker H.G., Kim L.V., Sutter N.B., Carlson S., Lorentzen T.D., Malek T.B., Johnson G.S., DeFrance H.B., Ostrander E.A. and Kruglyak L. (2004) Genetic structure of the purebred domestic dog. Science. 304: 1160-1164. Parker H.G., Kukekova A.V., Akey D.T., Goldstein O., Kirkness E.F., Baysac K.C., Mosher D.S., Aguirre G.D., Acland G.M. and Ostrander E.A. (2007) Breed relationships facilitate fine-mapping studies: a 7.8-kb deletion cosegregates with Collie eye anomaly across multiple dog breeds. Genome Research. 17: 1562-1571. Pastor M., Chalvet-Monfray K., Marchal T., Keck G., Magnol J.P., Fournel-Fleury C. and Ponce F. (2009) Genetic and environmental risk indicators in canine non- Hodgkin's lymphomas: breed associations and geographic distribution of 608 cases diagnosed throughout France over 1 year. Journal of Veterinary Internal Medicine. 23: 301-310. Rivera P., Melin M., Biagi T., Fall T., Haggstrom J., Lindblad-Toh K. and von Euler H. (2009) Mammary tumor development in dogs is associated with BRCA1 and BRCA2. Cancer Research. 69: 8770-8774. Rodgers B.D. and Garikipati D.K. (2008) Clinical, agricultural, and evolutionary biology of myostatin: a comparative review. Endocrine Reviews. 29: 513-534. Salmon Hillbertz N.H., Isaksson M., Karlsson E.K. et al. (2007) Duplication of FGF3, FGF4, FGF19 and ORAOV1 causes hair ridge and predisposition to dermoid sinus in Ridgeback dogs. Nature Genetics. 39: 1318-1320. Sargan D.R. (2004). IDID: inherited diseases in dogs: web-based information for canine inherited disease genetics. Mammalian Genome. 15: 503-506. Savolainen P., Zhang Y.P., Luo J., Lundeberg J. and Leitner T. (2002) Genetic evidence for an East Asian origin of domestic dogs. Science. 298: 1610-1613. Selvarajah G.T., Kirpensteijn J., van Wolferen M.E., Rao N.A., Fieten H. and Mol J.A. (2009) Gene expression profiling of canine osteosarcoma reveals genes associated with short and long survival times. Molecular Cancer. 8: 72.

12 Spady T.C. and Ostrander E.A. (2008) Canine behavioral genetics: pointing out the phenotypes and herding up the genes. American Journal of Human Genetics. 82: 10-18. Sutter N.B., Mosher D.S., Gray M.M. and Ostrander E.A. (2008) Morphometrics within dog breeds are highly reproducible and dispute Rensch's rule. Mammalian Genome. 19: 713-723. Takeuchi Y., Kaneko F., Hashizume C., Masuda K., Ogata N., Maki T., Inoue- Murayama M., Hart B.L. and Mori Y. (2009) Association analysis between canine behavioural traits and genetic polymorphisms in the Shiba Inu breed. Animal Genetics. 40: 616-622. Tamburini B.A., Trapp S., Phang T.L., Schappa J.T., Hunter L.E. and Modiano J.F. (2009) Gene expression profiles of sporadic canine hemangiosarcoma are uniquely associated with breed. PloS One. 4: e5549. Tang J., Le S., Sun L., et al. (2010) Copy number abnormalities in sporadic canine colorectal cancers. Genome Research. 20: 341-350. Tewhey R., Warner J.B., Nakano M. et al. (2009) Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nature Biotechnology. 27: 1025-1031. Thomas R., Wang H.J., Tsai P.C., Langford C.F., Fosmire S.P., Jubala C.M., Getzy D.M., Cutter G.R., Modiano J.F. and Breen M. (2009) Influence of genetic background on tumor karyotypes: evidence for breed-associated cytogenetic aberrations in canine appendicular osteosarcoma. Chromosome Research. 17: 365-377. Topal J., Gacsi M., Miklosi A., Viranyi Z., Kubinyi E. and Csanyi V. (2005) Attachment to humans: a comparative study on hand-reared wolves and differently socialized dog puppies. Animal Behaviour. 70: 1367-1375. Verginelli F., Capelli C., Coia V., Musiani M., Falchetti M., Ottini L., Palmirotta R., Tagliacozzo A., De Grossi Mazzorin I. and Mariani-Costantini R. (2005) Mitochondrial DNA from prehistoric canids highlights relationships between dogs and South-East European wolves. Molecular Biology and Evolution. 22: 2541-2551. Vila C., Savolainen P., Maldonado J.E., Amorim I.R., Rice J.E., Honeycutt R.L., Crandall K.A., Lundeberg J. and Wayne R.K. (1997) Multiple and ancient origins of the domestic dog. Science. 276: 1687-1689.

13 vonHoldt B.M., Pollinger J.P., Lohmueller K.E. et al. (2010) Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 464: 898-902. Wall J.D. and Pritchard J.K. (2003) Haplotype blocks and linkage disequilibrium in the human genome. Nature Reviews. Genetics. 4: 587-597. Wayne R.K. and Ostrander E.A. (1999) Origin, genetic diversity, and genome structure of the domestic dog. BioEssays. 21: 247-257. Wilbe M., Jokinen P., Truve K. et al. (2010) Genome-wide association mapping identifies multiple loci for a canine SLE-related disease complex. Nature Genetics. 42: 250-254. Wong A.K., Ruhe A.L., Dumont B.L. et al. (2010) A comprehensive linkage map of the dog genome. Genetics. 184: 595-605. Zangerl B., Goldstein O., Philp A.R. et al. (2006) Identical mutation in a novel retinal gene causes progressive rod-cone degeneration in dogs and retinitis pigmentosa in humans. Genomics. 88: 551-563.

14 1.2 Gene mapping approaches for animals

Jeremy R. Shearman and Alan N. Wilton

School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia and Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia

Address for correspondence Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

1.2.1 Summary This review examines mapping approaches and technologies available in both model and non-model organisms. Mapping approaches covered include the functional candidate gene approach and how this can be combined with linkage analysis around a candidate gene. Whole genome mapping approaches using microsatellites and single nucleotide polymorphisms are reviewed and examples of microsatellite screening sets and SNP arrays are given. The technologies of next generation sequencing and sequence capture methods are described and compared. As high throughput sequencing technology advances and becomes cheaper, animal genetics will benefit greatly.

1.2.2 Gene mapping Gene mapping is a method used to link the genotype of an organism to the phenotype. Disease states are a common phenotype studied in animals and the mutation(s) resulting in those diseases can be identified by one of several methods discussed below. Information on the gene’s function can be extrapolated through identification of which biological processes are affected when the gene is disrupted. Domestic Animals such as the dog, or cow can be manipulated through breeding studies with selection for morphological or behavioural traits. The dog is an exceptionally good model for studying these characters as their entire population history

15 has been one of for a staggering range of phenotypes (Spady and Ostrander 2008). Mapping the genetic variants responsible for morphological and behavioural phenotypes allows identification of the gene responsible and provides information on gene function in that organism and also the role of homologous genes in other organisms. There are two main approaches to identification of a gene responsible for a phenotype: the Functional Candidate Gene (FCG) approach and a whole genome scan (Collins, 1995). In addition, next generation sequencing technologies are capable of sequencing whole genomes from affected and controls to identify a single base change that is responsible for disease (Roach et al 2010).

1.2.3 Functional candidate gene approach The FCG approach involves identifying candidate genes based on information about the function of the gene. The molecular pathways or processes which may be involved in or affected by a disease process can reveal candidate genes. Once a candidate gene has been identified, sequencing of the exons of affected and control individuals can be used to search for differences associated with disease. The FCG approach is best suited to monogenic Mendelian diseases because this process works best for mutations of significant effect. FCG is heavily reliant on existing information which is a severe bottleneck in this approach. The FCG approach is a very useful method in model organisms, with or without a reference genome. Comparative genomics can be applied to use gene annotation in any organism to identify potential candidates (McGary et al. 2010). If the target species has a reference sequence available, then the process simply involves identifying the homologue to the candidate from a model organism and sequencing it. In species without a reference sequence, comparative mapping can be employed to first locate a putative homologous gene which can then be sequenced in affecteds and controls to test sequence variants for association with disease. Sequencing of gene exons are usually carried out first as variants affecting function are more easily identified in coding regions. Sequencing only the exons from a candidate gene can, however, result in false negative for a candidate if the mutation responsible for the disorder is non-coding, e.g. a promoter or splicing variant. This problem can be overcome by checking for the localisation of the disease causing variant by performing localised mapping and linkage analysis around the candidate gene prior to sequencing, which will reveal whether the disease gene lies in the candidate region. For model organisms with a reference genome this approach simply involves using the 16 sequence to identify simple sequence repeats (microsatellites) in the region, which are large blocks of tandem repeats likely to be polymorphic, and genotyping them in affecteds and controls. Inherited diseases with a recent origin should show a strong association between microsatellite alleles and disease alleles through linkage disequilibrium if the loci are located close to each other. In many cases organisms that don’t yet have a reference genome will have a selection of mapped microsatellites that can be used. If not, microsatellites around the candidate gene need to be identified and comparison to sequence in closely related species can be helpful to do this. The FCG approach is useful as it can rapidly return a positive result at a low cost if the right candidate is chosen. This approach works best for diseases where phenotypes are well characterised in other organisms that have a mutation in a gene of known function, or where the phenotype represents a disruption to a well characterised pathway. If the phenotype being studied is not well defined or could be caused by mutations in a large number of genes then the large numbers of possible candidates makes a whole genome analysis a more productive approach. A whole genome analysis will usually localise the mutation to a region of the genome in the range of 100 kb to several Mb (John et al. 2004) and the FCG approach can then be applied to a smaller set of genes to identify the causative mutation.

1.2.4 Whole genome analysis A whole genome analysis uses a set of microsatellites or SNPs distributed across the entire genome to locate the genetic basis for a phenotype. Depending on the size of the genome being studied and the recombination rate, a Minimal Microsatellite Set (MMS) might consist of 200-800 microsatellites. To detect linkage, a microsatellite allele has to lie in linkage disequilibrium (physically linked) with the disease causing variant. Thus, older genetic variants which have had more time for linkage disequilibrium to break down by recombination may require more microsatellites to detect linkage. Microsatellite sets that can be used for whole genome analysis have been developed for many animals such as the dog (Sargan et al. 2007), horse (Mittmann et al. 2009; Dierks et al. 2007), pig (Rohrer et al. 1997; Gallardo et al. 2008), trout (Guyomard et al. 2006), macaque (Higashino et al. 2009), alpaca (Reed and Chaves 2008), chicken (Crooijmans et al. 1997) and crocodile (Miles et al. 2009). If family data is available, Microsatellite data can be analysed using linkage analysis which returns a Logarithm of ODds (LOD) score for each locus with the trait of interest. If one allele at

17 a locus is always inherited with the trait being studied then a high LOD score will result, representing physical linkage between that locus and the trait locus. The main advantage of microsatellites is the large number of alleles that each locus can have. Allele identification is usually based on sizing a PCR product often performed using fluorescently-labelled primers and capillary or gel electrophoresis (Mansfield et al. 1996). Linkage analysis mapping with microsatellites typically uses multiple generations and large families segregating for a trait of interest. Large, multigenerational families can be created in domestic animals through breeding which gives animal studies an advantage over studies in human populations. The disadvantages of using microsatellites are the high mutation rates that can result in non- Mendelian inheritance and the requirement to perform multiple PCRs per sample if examining several loci. Microsatellites have reported mutation rates between 10-6 and 10-2 (Eckert and Hile 2009). The mutation rates are dependent on the number of repeat units and the length of the repeat. The number of repeat units positively correlates with the high mutation rate (Eckert and Hile 2009). This means that new mutations can occur in large multigenerational pedigrees and may result in lower scores in a linkage analysis. In order to get data for several hundred microsatellites for a whole genome scan, 30 to 60 multiplex PCRs are required per sample, which can be laborious and time consuming. For these reasons Single Nucleotide Polymorphisms (SNPs) have replaced microsatellites as the marker of choice for genome wide analysis. Oligonucleotides for detection of a large number of SNPs can be tiled onto a microarray and all SNPs genotyped in a single hybridisation. Multiple samples can be processed quickly for over one million SNPs making SNP genotyping more efficient and equally as informative as microsatellite scans, if not more so. The high throughput processing of SNPs allows for hundreds to tens of thousands of samples to be genotyped in a single study. Because SNPs have only two alleles, their information content is much lower than microsatellites so a study requires ten times the number of SNPs as microsatellites for the same information content (John et al. 2004). The large capacity of arrays makes testing such a large number of loci possible. Linkage analysis can be performed using SNPs but microsatellites have more commonly been used. Because of the large scale capacity of SNP arrays it allows unparalleled coverage of the genome with marker loci, which has provided the power for a different type of genetic analysis to linkage studies, the association study. Genetic variants of small effect size can be more readily detected with an association study because of the increased density of loci tested compared to linkage analysis. In addition,

18 association studies do not require complex pedigrees, that are often difficult to collect, and can therefore be more readily undertaken using resources available in the population. For these reasong the association study is becoming more common than linkage analysis for gene mapping. Association studies require two sample groups – cases and controls. The statistic analysis assesses the association between an allele and disease and looks for an increase in the allele proportion in cases compared to controls. The first SNP arrays were developed for human studies with mice following shortly after (Sapolsky et al. 1999; Lindblad-Toh et al. 2000). SNP arrays are now available for the dog (Karlsson et al. 2007 with Affymetrix; Illumina), sheep (International Sheep Genomics Consortium with Illumina), cow (Illumina; Affymetrix), horse (Illumina), rat (Illumina; Affymetrix) and pig (Illumina). On top of these mass manufactured arrays both Illumina and Affymetrix offer custom arrays, Agilent is about to release its SNP custom array based on their unique array manufacturing process, and severl lower throughput systems are available for custom SNP typing, eg Sequenom mass spectrometry, Illumina Golden Gate assay etc. Several considerations need to be made when choosing which mapping strategy to follow. Microsatellites are fairly cheap to type individually, but for whole genome analyses the numbers of microsatellites and the workload required make the experiment quite costly and time consuming. Each SNP array is relatively expensive, but the processing of all loci in parallel is rapid and the high SNP density allows for large numbers of loci to be genotyped without an increase in experimental workload. Considering experimental costs, processing time and marker density, SNP arrays have advantages over microsatellite genotyping. Typically, microsatellites will be used in linkage analysis and SNPs for association studies, but these analysis methods depend on the samples chosen and marker density required rather than the genotyping method. In addition to linkage analysis and association studies, homozygosity analysis is a method that can be performed using either SNPs or microsatellites. Homozygosity analysis is based on identifcation of a haplotype, ie a block of alleles, that have been inherited identical-by- descent in a large proportion of affected individuals. It can be a powerful analysis method for recessive traits as commonly found in animals where there are high levels of inbreeding. A whole genome analysis can identify a large genomic region that contains the disease gene. The region can be a number of mega bases of DNA containing up to hundreds of genes which can make the disease gene difficult to identify (Karlsson et al.

19 2007). In this situation either the FCG approach can be employed to rank the genes for future analysis and the genes sequenced sequentially according to this ranking or the mapped region can be sequenced in its entireity in an affected to identify the disease causing variant. However, cloning and sequencing the region is a time consuming undertaking and expensive, so a process of enrichment via sequence capture arrays and sequenced on a next generation sequencer has become a popular technique.

1.2.5 Sequence capture With the development of next generation sequencing (see section 1.2.6) several sequence capture technologies have become available to address the issue of targeting only part(s) of a genome for sequencing. The basic principle for the sequence capture methods involves using oligonucleotide probes specific to target regions of the genome to enrich complementary sequences from a specific individual. The regions are captured by hybridising to overlapping 100 bp oligos that cover the target sequence and non- target sequence is washed away. The enriched target sequence is then recovered and sequenced. The organism being studied must have a reference genome available as the sequence of the target region(s) must be known to develop the probes. There are three different approaches to target sequence enrichment currently available, array based, solution based and emulsion PCR based. Array based approaches currently offer the largest capture capacity with arrays from Agilent and Nimblegen being able to capture up to 30 Mb. Array based approaches involve fragmenting the target genome, ligating a universal primer, hybridising fragments to the array, washing away unbound DNA, recovering the target DNA fragments and then amplifying them by PCR. Solution based approaches such as Agilent’s SureSelect use biotinylated RNA capture probes to hybridise with the fragmented genomic DNA. Streptavidin coated magnetic beads are used to immobilise the captured target sequence while the non-target DNA is washed away, the RNA probes are then degraded and the enriched fragments amplified by PCR. A third method is offered by Raindance. It uses emulsion PCR of water microdroplets in oil containing PCR components, fragmented genomic DNA and sets of primers to amplify the targeted region in a series of PCRs. Hundreds of thousands of microdroplets, each with a single pair of primers, can be combined into a single PCR mix without any primer competition as each primer pair is contained in its own water droplet. The choice of which technology to use should be based on the size of the target region and the subsequent processing steps. Array and solution based approaches have a 20 tendency to produce fragment chimeric sequences during the ligation steps which produces what appear to be inversion or translocation sequences (Quail et al. 2008). Sequences can be identified as chimeric if there are many fragments captured at these regions. However, in cases of low capture the chimeric sequences would be limited in number and there would be no evidence to suggest the sequences were artefacts. The emulsion PCR based approach will not produce these chimeric sequences. However, the PCR will not amplify across large inversion or translocation breakpoints and they would show up as regions of no coverage in the sequence data requiring further investigation. As next generation sequencing becomes cheaper the need for sequence enrichment technologies may become redundant unless capture costs can be scaled down as well.

1.2.6 Next generation sequencing Next generation sequencing (next gen sequencing) refers to the high throughput sequencing machines that have become commercially available in the past few years. These technologies represent a significant leap in sequence data acquisition. The process involves fragmenting a genome (or target DNA) and isolating hundreds of thousands to millions of DNA fragments and individually sequencing them in parallel. In this way entire genomes can be sequenced and assembled without the need for cloning and Sanger sequencing. The human genome was first sequenced using cloning and Sanger sequencing and took 5 years to obtain the first draft which was completed 2 years later and cost about $3 billion (International Human Genome Sequencing Consortium 2004). The genome of James Watson was sequenced in 2007 using next gen sequencing and took approximately 2 months and less than $1 million (Wheeler et al. 2007). The low cost per base of next gen sequencing is key to the success of the technology and places next gen sequencing runs within the budget of most research groups. This low cost has seen to the recent explosion of completed and draft genome sequences for many organisms. There are currently three next generation sequencing platforms on the market, the Roche 454 pyrosequencer, the Illumina Genome Analyser II (GAII) and the Applied Biosystems SOLiD (see Figure 1-1 for workflow comparison). Each platform is being updated continuously with improvements in both software and sequencing chemistry so reports of output limitations become outdated very quickly. Each platform is currently able to perform single read sequencing and paired-end sequencing, paired-end sequencing involves sequencing both ends of a small fragment. Mate paired libraries can be constructed to obtain sequences a known distance apart which can be useful in 21 sequencing over interspersd repeat sequences. For mate paired libraries, large fragments in the range of 3 – 15 kb are circularised and the ligated ends sequenced. These methods are very useful for sequence assembly and means that a single sequencing platform can be used to completely sequence and assemble a whole mammalian genome. The first platform to be released was the 454 pyrosequencer currently able produce up to 800 Mb of sequence data from reads up to 500 bp using titanium sequencing chemistry. The 454 supports mate pair sequencing of fragment sizes 3 kb, 8 kb and 20 kb. The DNA preparation involves fragmenting the genome then ligating an A adapter and a biotin labelled B adapter. The biotin label allows the DNA fragment to attach to a streptavidin coated bead and only fragments with an A and a B adapter will amplify. These fragments are then clonally amplified on the beads using emulsion PCR at a DNA: bead ratio optimised so that one bead contains clonal sequences from a single DNA fragment. These beads are deposited onto a PicoTitrePlate (PTP) which contain wells large enough to fit a single DNA bead plus packing beads and enzyme beads. Nucleotides are sequentially flowed across the PTP surface. Incorporation of one base releases one pyrophosphate which fuels a luciferase reaction releasing one photon of light. The resulting light flashes are recorded and used to identify the sequence. Problems arise from large homopolymeric tracts as the light intensity cannot be resolved resulting in incorrect calls of the number of bases. The next platform to be released was the Illumina Genome Analyser (GA). The Illumina GA can currently produce about 20 Gb of sequence data from reads up to 150 bp in length and supports paired-end sequencing of 200-500 bp and mate paired sequencing of 2-5 kb fragments. The Illumina GA has a DNA preparation process similar to the 454 consisting of DNA fragmentation with ligation of forwards and reverse adapters. Fragments are flowed across a glass slide with forward and reverse primers attached. Fragments anneal to the primers and are amplified utilising the nearby primer pair which results in isolated islands of clonal sequences attached to the slide. Sequencing involves flowing reversible fluorescent terminators across the slide and recording images of fragment clusters, the fluorophore is then removed and the next nucleotide flowed. Because each base is a terminator each fragment in a sequence cluster remains synchronised to the cluster. The shorter sequence lengths generated result in a higher computing load to assemble the reads and repetitive sequences that

22 cannot be spanned without the use of paired-end sequencing. Illumina have released the HiSeq 2000, a machine capable of 200 Gb of sequence output from a single run. The third platform released was the Applied Biosystems SOLiD. The SOLiD is currently able to produce 60 Gb of sequence data from 50 bp reads and supports paired- end reads and mate paired reads from 600 bp up to 10 kb. Library preparation is based on emulsion PCR of DNA fragments attached to beads similar to the library prep for 454, except the beads are much smaller allowing for greater sequence output. Sequencing is based on ligation of fluorescently labelled 8-mer probes where the dye colour is assigned based on the first and second nucleotide at the 3` end of the probe. The first cycle begins with a primer complementary to the adaptor sequence followed by sequential flows of the 8-mer probes. Detection of the dye colour is used to identify the 2 base combination and the three 5` bases are cleaved for the next cycle identifying every 5th and 6th base. After a series of ligation steps the extension product is removed and a primer complementary to the adapter, one base shorter at the 5` end, is used for the next round of ligation steps. This process is repeated reducing the primer length incrementally down to four bases shorter. The result is that each base of the DNA fragment is interrogated twice meaning a higher sequence quality. The disadvantage of the SOLiD is the low read length causing many of the reads to remain un-mappable unless performed as paired-end sequencing. Next generation sequencing is a technique that can supplement or replace mapping with microsatellites or SNPs. Using sequence capture methods allow for an entire mapped region to be sequenced in healthy and affected animals. One could even replace the need to first map the location of a mutation by sequencing the entire exome (all exon sequences of a genome). As throughput increases mapping may be replaced by simply sequencing an entire genome.

23 Figure 1-1 Comparison of second generation sequencing protocols showing basic workflow and sequencing procedure

24 Next generation sequencers do have inherent biases. The biggest biases are the need for PCR based amplification and the need for multiple copies of a DNA fragment to contribute to the sequence signal. Amplification of the genome to be sequenced can introduce mutations, skew allelic ratios and bias against very high or very low GC/AT content sequences (Acinas et al. 2005). The adapter ligation step can produce chimeric sequences resulting in assembly errors for de novo applications and false identification of inversions or translocations for resequencing applications. When sequence capture and next generation sequencing are combined these chimeric sequences can become quite common (Shearman, unpublished data). Avoiding these problems would require sequencing of single DNA molecules without modification or PCR. Helicos BioSciences was the first to release such technology capable of sequencing single DNA fragments without the need for PCR amplification.

1.2.7 Third generation sequencing Third generation sequencing is high throughout sequencing of single DNA molecules. The first of these platforms to be released was the Helicos’ Heliscope capable of delivering 21-28 Gb of sequence from 25-55 bp reads. The preparation involves fragmenting genomic DNA and adding a poly A tail terminating with a fluorescently labelled A. These fragments are then flowed across a glass slide coated in poly T primers to which individual DNA fragments anneal. The positions of labelled fragments are recorded and the fluorescent label removed. Fluorescently labelled bases are then sequentially washed over the slide and an image taken with a microscopic camera, the fluorescent label is then removed to allow for the next base to be added. The chemistry does not use terminators but rather relies on steric hindrance caused by the fluorophore to ensure only one base is added at a time. In this way single DNA molecules are sequenced without the biases introduced by ligation or PCR, however the enzyme used to add the poly A tail is not well characterised may introduce unknown biases (Yehudai-Resheff and Schuster 2000). The Heliscope has problems of base skipping resulting in apparent deletions and currently has no paired end read method. One of The most promising approaches to high throughput sequencing is a technology being developed by Oxford Nanopore (Clarke et al. 2009). The principle involves using an exonuclease to cleave off bases from a single stranded DNA fragment which are directed through a protein nanopore. The identity of each base passing through can be determined by passing a current across the nanopore and recording the decrease in current from each base. The nanopore also interacts with each nucleotide 25 passing through it causing different bases to dwell within the nanopore for different lengths of time allowing for a dual check system per nucleotide. One of the most exciting benefits of the technology is the ability to detect methylated cytosine directly based on current and dwell time. The proof of concept phase is complete and the results look very promising (Clarke et al. 2009). The true test for the technology will be how well it scales up to produce a sequencing platform. The technology has many benefits over current sequencing platforms. Data collection is a series electrical currents so generated data will not require as much storage space making long term data storage much cheaper and enable storage of raw data. Raw data from next gen sequencers are a series of images that are too large to store long term, they are analysed to obtain the sequence information and then usually deleted. The ability to simultaneously perform two measurements per base will result in very high accuracies and read lengths should be much larger than any current method (Clarke et al. 2009). The reduced costs and high throughput of third generation sequencers have many implications for gene mapping. Sequencing an entire genome may be possible at a low cost. The potential for whole genome sequence data to replace SNP array data was hotly debated at the 30th Annual Lorne Genome Conference (15th – 18th Feb 2009). A strong argument for SNP arrays relied on the potential to pool up to thousands of samples onto a single array. This allows a case – control experiment of thousands of individuals to be performed with only 2 SNP arrays. However, most experiments do not pool samples for processing on arrays so this application may not be enough to keep the technology alive. The strongest argument in favour of whole genome sequence data was the ability to extract both mapping and potentially causative sequence variants from the same data set. This approach would be best suited to smaller sample sizes and does not require a reference genome or the availability of SNP arrays for the organism being studied.

1.2.8 Conclusion Many different mapping approaches are available for genetic studies in animals. Currently the most popular method is SNP array based genotyping, but microsatellites work very well for animals where this technology has not yet been developed. The FCG approach is unavoidable in most circumstances and will continue to play a role in gene mapping until the cliché $1000 genome is finally realised. When an entire genome can be sequenced at costs lower than or equal to the current cost of mapping there will be very little need for SNP arrays or sequence capture technologies. Mapping the gene(s) 26 responsible for a phenotype could be performed using whole genome sequences rather than sets of SNPs and causative mutations or changes can be identified from the same data. However, the computing power required for such analyses exceeds what is currently available and the limiting factor will be the computing power, statistical approach and capacity for bioinformatic analysis. One way of getting around this problem is not to use all of the data, but to simply extract SNP information from the sequence data and perform mapping analyses using this. In any case, smaller research groups will no longer have to wait for molecular tools to be developed for their organism of interest.

1.2.9 References Acinas S.G., Sarma-Rupavtarm R., Klepac-Ceraj V. and Polz M.F. (2005) PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Applied and Environmental Microbiology. 71: 8966-8969. Clarke J., Wu H.C., Jayasinghe L., Patel A., Reid S. and Bayley H. (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotechnology. 4: 265-270. Crooijmans R.P., Dijkhof R.J., van der Poel J.J. and Groenen M.A. (1997) New microsatellite markers in chicken optimized for automated fluorescent genotyping. Animal Genetics. 28: 427-437. Collins F.S. (1995) Positional cloning moves from perditional to traditional. Nature Genetics. 9: 347-350 Dierks C., Lohring K., Lampe V., Wittwer C., Drogemuller C. and Distl O. (2007) Genome-wide search for markers associated with osteochondrosis in Hanoverian warmblood horses. Mammalian Genome. 18: 739-747. Eckert K.A. and Hile S.E. (2009) Every microsatellite is different: Intrinsic DNA features dictate mutagenesis of common microsatellites present in the human genome. Molecular Carcinogenesis. 48: 379-388. Gallardo D., Pena R.N., Amills M. et al. (2008) Mapping of quantitative trait loci for cholesterol, LDL, HDL, and triglyceride serum concentrations in pigs. Physiological Genomics. 35: 199-209. Guyomard R., Mauger S., Tabet-Canale K., Martineau S., Genet C., Krieg F. and Quillet E. (2006) A type I and type II microsatellite linkage map of rainbow

27 trout (Oncorhynchus mykiss) with presumptive coverage of all chromosome arms. BMC Genomics. 7: 302. Higashino A., Osada N., Suto Y., Hirata M., Kameoka Y., Takahashi I. and Terao K. (2009) Development of an integrative database with 499 novel microsatellite markers for Macaca fascicularis. BMC Genomics. 10: 24. International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic sequence of the human genome. Nature. 431: 931-945. John S., Shephard N., Liu G. et al. (2004) Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. American Journal of Human Genetics. 75: 54-64 Karlsson E.K., Baranowska I., Wade C.M. et al. (2007) Efficient mapping of mendelian traits in dogs through genome-wide association. Nature Genetics. 39: 1321- 1328. Lindblad-Toh K., Winchester E., Daly M.J. et al. (2000) Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nature Genetics. 24: 381-386 Mansfield E.S., Vainer M., Enad S., Barker D.L., Harris D., Rappaport E. and Fortina P. (1996) Sensitivity, reproducibility, and accuracy in short tandem repeat genotyping using capillary array electrophoresis. Genome Research. 6: 893- 903. McGary K.L., Park T.J., Woods J.O., Cha H.J., Wallingford J.B. and Marcotte EM. (2010) Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proceedings of the National Academy of Sciences of the United States of America. 107: 6544-6549 Miles L.G., Isberg S.R., Glenn T.C., Lance S.L., Dalzell P., Thomson P.C. and Moran C. (2009) A genetic linkage map for the saltwater crocodile (Crocodylus porosus). BMC Genomics. 10: 339. Mittmann E.H., Lampe V., Momke S., Zeitz A. and Distl O. (2010) Characterization of a Minimal Microsatellite Set for Whole Genome Scans Informative in Warmblood and Coldblood Horse Breeds. Journal of Heredity. 101: 246-250 Quail M.A., Kozarewa I., Smith F., Scally A., Stephens P.J., Durbin R., Swerdlow H. and Turner D.J. (2008) Nature Methods. 5: 1004-1010. Reed K.M. and Chaves L.D. (2008) Simple sequence repeats for genetic studies of alpaca. Animal Biotechnology. 19: 243-309.

28 Roach J.C., Glusman G., Smit A.F. et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 328: 636-639 Rohrer G.A., Vogeli P., Stranzinger G., Alexander L.J. and Beattie C.W. (1997) Mapping 28 erythrocyte antigen, plasma protein and enzyme polymorphisms using an efficient genomic scan of the porcine genome. Animal Genetics. 28: 323-330. Sapolsky R.J., Hsie L., Berno A., Ghandour G., Mittmann M. and Fan J.B. (1999) High- throughput polymorphism screening and genotyping with high-density oligonucleotide arrays. Genetic Analysis: Biomolecular Engineering. 14: 187- 192 Sargan D.R., Aguirre-Hernandez J., Galibert F. and Ostrander E.A. (2007) An extended microsatellite set for linkage mapping in the domestic dog. Journal of Heredity. 98: 221-231. Spady T.C. and Ostrander E.A. (2008) Canine behavioral genetics: pointing out the phenotypes and herding up the genes. American Journal of Human Genetics. 82: 10-18. Wheeler D.A., Srinivasan M., Egholm M. et al. (2007) The complete genome of an individual by massively parallel DNA sequencing. Nature. 452: 872-876. Yehudai-Resheff S. and Schuster G. (2000) Characterization of the E.coli poly(A) polymerase: nucleotide specificity, RNA-binding affinities and RNA structure dependence. Nucleic Acids Research. 28: 1139-1144.

29 2 A CANINE MODEL OF COHEN SYNDROME: TRAPPED NEUTROPHIL SYNDROME

[2011, BMC Genomics. doi:10.1186/1471-2164-12-258]

Jeremy R. Shearman1,2, and Alan N. Wilton1,3

1School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia 2National Center for Genetic Engineering and Biotechnology, 113 Phahonyothin Rd., Klong 1, Klong Luang, Pathumthani 12120, Thailand 3Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia Address for correspondence: Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

Running Title: TNS is Cohen syndrome in dogs Key words: Trapped Neutrophil Syndrome, Cohen Syndrome, neutropenia, vesicle transport, expression, vesicle protein sorting 13 B, VPS13B, dog

2.1 Abstract

Trapped Neutrophil Syndrome (TNS) is a common autosomal recessive neutropenia in Border collie dogs. A candidate gene approach and linkage analysis was used to show that the causative gene is VPS13B. VPS13B was chosen as a candidate due to similarities in clinical signs between TNS and Cohen Syndrome such as neutropenia and a typical facial dysmorphism. Linkage analysis using microsatellites close to VPS13B showed positive linkage of the region to TNS. Each of the 63 exons of VPS13B was sequenced in affected and control dogs. The causative mutation in Border collies is a four bp deletion in exon 19 of the largest transcript and results in premature truncation of the protein. Cohen syndrome patients present with mental retardation in 99% of cases, 30 but learning disabilities featured in less than half of TNS affected dogs. It has been implied that loss of the alternate transcript of VPS13B in the human brain utilising an alternate exon, 28, may cause mental retardation. Mice cannot be used to test this hypothesis as they do not express the alternate exon. Dogs were found to express alternate transcripts in the brain utilising an alternate exon homologous to human exon 28. This would allow dogs to be used as a model organism to explore the function of the alternately spliced transcript in the brain. TNS in Border collies is the first animal model for Cohen syndrome and can be used to study the disease aetiology.

2.2 Introduction

Dogs consist of over 400 genetically isolated breeds with considerable morphological and behavioural diversity. Dog breeds have undergone two major bottlenecks, the first when they were domesticated from the wolf ~15,000 years ago (vonHoldt et al. 2010, Pang et al. 2009) and the second in the last few hundred years during development of the modern breeds from a low number of individuals selected for certain physical or behavioural traits (vonHoldt et al. 2010, Parker et al. 2004). This unique population structure results in limited genetic variation within each breed and many inherited diseases. Inherited diseases in dogs are predominantly recessive often resulting from strong inbreeding, usually showing allelic homogeneity within a breed or group of related breeds and allelic heterogeneity between less related breeds (Ostrander and Wayne 2005). An autosomal recessive neutropenia, known as Trapped Neutrophil Syndrome (TNS), has been identified in the Border collie breed. The disease was originally described in the Australian and New Zealand population of Border collies and is characterised by a deficiency of segmented neutrophils in the blood and hyperplasia of myeloid cells in the bone marrow (Allan et al. 1996). Severely affected pups show abnormal craniofacial development with a narrow, ferret-like face. Affected pups are generally smaller than their litter mates and suffer from chronic infections and failure to thrive resulting from a compromised immune system. Some show early infections from six weeks of age. For others the first sign of TNS is a bad reaction to immunisation at 12 weeks, while in a few cases clinical signs are very mild and not recognised until two or more years of age. All known TNS affected dogs at the start of this research could be traced back to a single common ancestor (Shearman and Wilton 2007 - see appendices pages 135-148).

31 This suggested that the disease was the result of a single mutational event that had been spread through the population by the champion sire effect. Show champion male dogs are often used heavily for breeding to pass on show winning traits. Any detrimental mutations that a show winning male may be carrying can rapidly spread through a population. When coupled with inbreeding, recessive diseases can manifest within a few generations after the champion is used. All affected dogs are then homozygous and identical-by-descent for the genomic region surrounding the mutation. Allelic heterogeneity was considered unlikely for TNS given the recent common ancestor. A promising approach to identify disease genes is the candidate gene approach coupled with linkage analysis using microsatellites from the candidate gene region. This approach involves selecting candidate genes based on gene function or similarities of clinical signs when the homologous gene is disrupted in a model organism. Instead of extensive sequencing of candidate genes, applying linkage analysis using markers in the candidate gene region can confirm whether the region is the correct one even if the mutation is non-coding. Candidate genes in this study were chosen based on their role in pathways known to be associated with neutrophil maturation or known to cause neutropenia in human. Ten genes were investigated as candidates for TNS, four of these are known to cause a disease featuring neutropenia and six are linked to neutrophil function. Six of the genes from this list have previously been excluded. These include chemokine (C-X- C motif) receptor 4 (CXCR4) responsible for WHIM syndrome (Warts, Hypogammaglobulinaemia, Immunodeficiency and Myelokathexis; MIM:193670) (Shearman et al. 2006 - see appendices pages 135-141), neutrophil elastase (ELANE) which is responsible for severe congenital neutropenia (MIM:202700) and cyclic neutropenia (MIM:162800), adaptor-related protein complex 3, beta 1 subunit (AP3B1) which is responsible for Hermansky-Pudlak Syndrome type 2 (MIM:608233) and adaptor-related protein complex subunits AP3D1, AP3S1 and AP3M1 (Shearman and Wilton 2007 - see appendices pages 141-148). Other candidates were the ligand of CXCR4, chemokine (C-X-C motif) ligand 12 (CXCL12), which was investigated because it may have the potential to cause similar clinical signs to WHIM syndrome when disrupted. Another is an anti apoptotic factor, BCL2-like 1 (BCL2L1) which has been shown to be downregulated in cases of neutropenia (Aprikyan et al. 2000). Interleukin 8 receptor B (CXCR2) was investigated as it is able to attract and activate neutrophils from the blood to sites of inflammation (Holmes et al. 1991). Cohen syndrome (Cohen et al. 1973, Kolehmainen et al. 2003.

32 MIM:216550) was considered as a model for TNS due to presence of the clinical signs of neutropenia with hyperplasia of the bone marrow enriched for the myeloid cell lines (Kivitie-Kallio et al. 1997) and a typical craniofacial dysmorphism. Cohen syndrome is caused by mutations in Vesicle Protein Sorting 13B (VPS13B). Linkage analysis followed by sequencing was used to investigate whether a mutation in one of these genes is responsible for TNS in Border collies.

2.3 Materials and Methods

DNA from Border collie blood samples was extracted following a standard salting out method (Miller et al. 1988). See appendix publications for sample descriptions and pedigree (pages 135-148). Primers were designed for microsatellite amplification and VPS13B sequencing (Table 2-1) based on the canine genomic sequence (build 2.1) using primer3 (http://frodo.wi.mit.edu/cgi- bin/primer3/primer3_www.cgi). Primers were developed for four new microsatellites from the VPS13B region and one microsatellite for BCL2L1 using the reference canine sequence (build 2.1) (Table 2-2). Microsatellites with a minimum of 20 repeat units of two to four bp in the reference sequence were chosen and primers were designed to amplify them. Microsatellites were named based on chromosome number and location in units of 10,000 bps. C13.0390 (CFA13 at 390 x 104 bp) is approximately 197 kb before the transcription start site of VPS13B, C13.0423 is in intron 16, C13.0449 is in intron 24 and C13.0478 is in intron 43. Additional microsatellites were used from dbSTS (NCBI) (Table 2-2). The universal priming method was used (Oetting et al. 1995; Neilan et al. 1997) to label PCR products which were sized on an ABI3730 DNA analyser (Applied Biosystems) and analysed using ABI Genemapper software (Applied Biosystems). Linkage analysis was performed using Superlink Online (Silberstein et al. 2006). Primers were designed to intron sequences to capture the entire exon including intron-exon boundaries (Table 2-1). Sequencing was performed using ABI BigDye chemistry V3.1 (Applied Biosystems) and run on an ABI3730 DNA analyser at the Ramaciotti Centre for Gene Function Analysis. All exon sequences were aligned to the reference dog genome using Seqscape (Applied Biosystems) to compare carriers and affecteds to control Border collie sequences.

33 Primers were designed flanking the deletion (Table 2-1) to develop a carrier test based on sizing the PCR product on an ABI 3730 DNA Analyzer to identify presence of the deletion.

34 Table 2-1: Primer names and sequences for VPS13B sequencing primers sequencing primers VPS13B Forwards primer Reverse primer Exon 1 AGTTGCAGCTGGAGTAGACACAG AACAAACCATCACTAGTAACGAAGC Exon 2 TGTGTGAACCTTTGTTATGTACCAC AATAGGCATGTGAGAAACAGTAAGTG Exon 3 TTTTAAACGCTGTTTGAGAATGC GTAAGTTAAAATCCTGCTGAAGAGC Exon 4 CCAGTTTTTGTAAGCCTGTCTACC TACTACACTTGAAACAGGCATGTGG Exon 5 ATAAATGAGAGATTCGTTGCATAGG ACATCTGGACAAACTTAAAAGGAAG Exon 6 AGTGAACTGTTTTGGCTTATATTGG AAAAATCAAAGTAGTGTCAAACAACC Exon 7 CCAACAAACTTGAAAGTAGTAATGC AACAAACTTCAAATCAGATCCTTCC Exon 8-9 TGGTATTTCAGTTCTCTCTTTGAGC TCCCTGTATTCTATCCATAAAGAGG Exon 10 TTTTCTTTAACAGTTTAATGATTTTTCC TCTTCAAACTTCTCTAGATGTAGAACC Exon 11 TAGAATTACGGATGTCCTTTGG AGAACCAACTGATTCTGATGAGG Exon 12 GAGAATTGGTTTGAAGGCATTAAG AACTAGCAAGAGACTCAAAATTAGGC Exon 13 AAAGTATAATTTGACCTTGAAATTTGG CACAGTTACATTGCTTTAACACACC Exon 14 TTCTAGATTGATTTGATGACATTGC TTACATATCATTTAAATCTCATCTTATGG Exon 15 TAGTAGGAAGCAATGATTTTTCAGG ATACCCACCAACACTTAACACTACC Exon 16 TCTTTCTGAAGATTGTCTAGATTGAAAG TTACCTCGCATATCCTGAATTTTAC Exon 17 GGTGATTCTAACAAAAACTTAATAAGAGG ATGTCACTACCCTCTTTCCTAGTCG Exon 18 ATTAAGTTTAAATGCCTTGGTGAAG CAGTAACGCTCCTTGTAAAATGC Exon 19 CTTACTTTGTTAAGAATCAAAAGTGC AGTTTCATAGAGCACAACTGTAGGG Exon 20 TATGAATACTGCCTTGAAAAGTTGG TTAAACTAGTCTCTCCAGTGTTTCG Exon 21 CCAAGGGAGCTACATATTGATTG TCATGTCTCTTTAGCTGGCATTC Exon 22 CTGACAAGAGTAACCAACACATTCC ATAAAGTGCACCAAGAATTTTCACC Exon 23 GTTACTCATAAAGGCATGAATATGG ACATCAATATGATTAAAAGGTGTGG Exon 24 GCTGCATTAAAAATGGTTTTACCTC CTTATCTAACCACGCTGATCTTGAC

35 Exon 25 TTAATTGTGATTCGAAGTTGAAAGG AGAAACATTAATAAACAGCCACTGC Exon 26-27 TCGTTAGCATTTATGATATTGAAAAG GACTGTGTTCAAAGAAAGTATAGCTC Exon 28-29 TCAGTGTAGTGTTGTTTACATAATAACC TCATTGACAAAGTAGATGATACAGAGG Exon 30 TCAAGGGATGTAGACATAATAAAAGG ACACCAAAGTGCAAACTACAACC Exon 31 TTTAATTCATGTGTTTTAATGTTGG CTTCTTTACAGGATGAAATTGTTGC Exon 32 TCCAATGTGGTTACTTCCTAATACC AAAAGTGGTACCCAAACACTTAGG Exon 33 CAAATATAACTGTCATACATGTTTGTGG TAATAAGCAATTTGACATTCTGTGC Exon 34 ATTAAACTATGTTGAGTGGTTGTGC TAATAGAGGCTTTTCTGAGGGTAGG Exon 35 ACCATTTTAACCTCTTTATTACATCC TTGATCGATCTTGATACAATGG Exon 36 TGAGTACATACAAGAACATGCAAGG ACTATTTATTTTTGGCAAGGAATGC Exon 37 GCTTACATTCAATGACAGAGCTTC TGGATACAAACACTTTAGTTTACCG Exon 38 CAACTGAAAAAGACTCTGACAAAGG TCAATCTGATAATACTCTTTCTCACC Exon 39-40 TGCATTAATTCTAGTGTGATTGTGG TTTCCTAGTATATAATCAGCACTTTTCC Exon 41 AATATGCATTTCAACAAATACAGAAC ATCTGTAGATCACACGGATGCAG Exon 42 CATAAGTTGTGTGATATGTGGATGG TCACAGCACTTTGAAACATCAATAC Exon 43 GTAATGCCTTCAAGAGAGTGTTAGG AGAGTGCCTAGTAAAGATCACATGC Exon 44 AAACATCTTGCAATTAATCTGATGC GAATAAATGAATCCCCAACCTTAAC Exon 45 TGACATATTTGCTATTCTTAGCTTGG GCAAGGTAGTAGTAAGCGTACATAGG Exon 46 CAGGCACACAACCAGTATTTCC AAACTACTGGGACTGATACAACAGC Exon 47-48 TCTATCAACCTCTCCTCACACAAAC AACGGTACAAGGTTGCTTTTCTAC Exon 49 GTCCCTAAATCAACAAAAATTAGCC ACACCTCTTCACACTAGAATCATCC Exon 50 ATTGGATGATTCTAGTGTGAAGAGG GAATATCAAAGAACATTCCATACCC Exon 51 TGTACCTAAGGTTGCATATTATTCTTTAC TTGTCTAGAGCATGTAATACTAAAAAGG Exon 52 GAACACAAGTTAAAATTGGTACTATCC GATTTCTGCTCATTAACTACCTTGG Exon 53 AGGATACGTGGTTATGGTTTCTTAG CACTGCAGCTACTCTTTTGTCTCTC Exon 54-55 CAAATCTAAATAAACAGGTACCAAGC TATAAAAATGGATTCACAGGTGTCC Exon 56 ATAGAACTAGCTGTTGAGCCAGAAC GATGCTAATATGTAATTCGAAGATGC 36 Exon 57 AGGGTGAGACTGATGTTCTTCC AGGTCTTGAGGGCGATAAGG Exon 58 GTCAGAGTAAAGCTATTTGCTAGGC GGAATAACAGTTGAGCCAATCC Exon 59 ATTGTCTTAAGGCCCTTGCTTC GCTCACTGTGGAGCACGAAC Exon 60 TGATGAAGATCATAAGTGTCTTAGTAACC TAAGAATGCAAGTTAGTCAGTGAGG Exon 61 TGTAAACAGATGACATCACTACAGC AAGGTCACCTCTGTTAGAATGTGG Exon 62 TGGTAAAGTGAGTGCAAATACACG ATGACAGGCTCCAATACACACC Exon 63 TACTCAGATGTATCAAGGCAAAAGC TGTGTTAGAGCAGCGAGTTAACAG Splice primers VPS13B cDNA Exon 25-28 GATCCAGTGGACGTATTAGTCTGTG GTAAACTGCGTGAAGTCAATGTATG

37 Table 2-2: Microsatellite primers used for linkage analysis of TNS candidate genes Gene Microsatellite name Forward sequence Reverse sequence Repeat type VPS13B C13.0390 TGGAGCTCTGAAGCTAGTAAGGA TCCAAAGCTGGAAACAGAGAA AAAGG x 21 C13.0423 GACACTGGCTTTTCTGTTGTTC GGCTAATGAAACATTTGCTTTGT TTTC x 39 C13.0449 ATCATTTTGGTGGCATCCAT TGAAATATTGCTTTGTTCAAAACTT TTTC x 28 - CT x 13 C13.0478 TGGGTGTTTTCAGGTTGAAG TGCTTCTTGTCTGTCATGGAA TA x 30 BCL2L1 FH3616* GTGGGGAGTCTGTTTGTCC CAACGGAGTCAGGAGTATAACC AAAG x 19 FH2495* ATTTCATATGTGAGGCTGAGATTG CAGTGGGAGAAAGATGCCAT AAAG x 15 REN127K17* TATTTCTGCTATTTGTGTC ATTCCATGTAGATTGTCA TG x 11 CXCL12 FH2759* AGTACTTGAGGCTTGGAGTCAG CAAGCTGAGAGCCATGTAGG GT x 23 C28.176* TCAGGCTCTCCAGAGACACA TTGCCAATAGCAGACTGTGG CA x 13 REN205C12* TGTTCTCATCCCATGCTTCC TAGCTGCCTGATGGGGTAAC GT x 12 CXCR2 REN41P15* GTTCCGGTTATTGCGTGTG CTGGGTCGGAGGAAGATG AC x 10 * Microsatellites from dbSTS (NCBI)

38 The clinical sign analysis was a yes/no questionnaire (Section 2.8) where a short description was required for yes answers that was designed for owners of affected dogs based on clinical signs observed in human. To detect presence of the alternately splice transcripts in dog primers were designed in exons 25 and 28 (Table 2-1) to produce a product incorporating alternately spliced products of exon 27 (XM_539102) or 27b (XM_850840). The expected PCR product sizes from transcripts XM_539102 and XM_850840 using these primers were 684 bp and 609 bp, respectively, based on the reference genome sequence. Cortex and cerebellum cDNA was prepared from three healthy dogs. The cortex tissue was removed first and cerebellum tissue approximately ten minutes later from dogs immediately after euthanasia and the samples were placed into liquid nitrogen. RNA was extracted using the RNeasy Lipid Tissue Mini Kit from Quiagen and cDNA was prepared using poly-T primers and PCR. PCR was performed on the cDNA, the PCR product was sized on a 1% agarose gel and the PCR mix was sequenced using the above sequencing method.

2.4 Results

Ten candidate genes were examined (Table 2-3) for linkage to TNS and eight were eliminated from being linked to TNS based on maximum logarithm of odds (LOD) scores less than zero and one showed no significant linkage from multipoint linkage analysis in an eight generation pedigree with seven to 12 affecteds depending on the analysis (Shearman et al. 2006; Shearman and Wilton 2007; Table 2-3). The microsatellites from the VPS13B region were highly polymorphic with between eight and 14 alleles in a sample of 72 Border collies (Table 2-4). Twenty three haplotypes were observed involving alleles at these loci and there was a large amount of linkage disequilibrium between alleles (Table 2-5). Multipoint linkage analysis for the TNS pedigree with multiple inbreeding loops (Shearman and Wilton 2007) gave significant evidence of linkage with a maximum LOD score of 8.86 at marker C13.0390 (Figure 2-1).

39 Table 2-3: LOD score for linkage analysis using microsatellites in the region of 9 candidate genes for TNS. ĂŶĚŝĚĂƚĞ EĂŵĞ >K ůŽĐĂƚŝŽŶ

yZϰΏ ĐŚĞŵŽŬŝŶĞƌĞĐĞƉƚŽƌϰ ͲϭϬ͘ϭϴ Śƌϭϵ͕ϰϭ͘ϵDď

>Ϯ>ϭ >Ͳy> Ͳϲ͘Ϭϱ ŚƌϮϰ͕Ϯϰ͘ϭDď

y>ϭϮ ƐƚƌŽŵĂůĐĞůůĚĞƌŝǀĞĚĨĂĐƚŽƌϭ Ͳϱ͘ϵϬ ŚƌϮϴ͕ϱϴ͘ϵDď

yZϮ /ŶƚĞƌůĞƵŬŝŶϴƌĞĐĞƉƚŽƌ Ͳϯ͘ϯϰ Śƌϯϳ͕Ϯϳ͘ϵDď

>Eΐ ŶĞƵƚƌŽƉŚŝůĞůĂƐƚĂƐĞ Ͳϭϲ͘ϴϴ ŚƌϮϬ͕ϲϬ͘ϵDď

Wϯϭΐ WϯĚĞůƚĂƐƵďƵŶŝƚ Ͳϭϲ͘ϴϴ ĨĂϮϬ͕ϲϬDď

Wϯϭΐ WϯďĞƚĂƐƵďƵŶŝƚ ͲϬ͘ϴϱ Śƌϯ͕ϯϭ͘ϰDď

WϯDϭΐ WϯŵƵƐƵďƵŶŝƚ ͲϮϰ͘ϲϱ Śƌϰ͕Ϯϳ͘ϳDď

Wϯ^ϭΐ WϯƐŝŐŵĂƐƵďƵŶŝƚ ͲϭϬ͘ϴϬ Śƌϭϭϴ͘ϲDď

Ώ^ŚĞĂƌŵĂŶĞƚ͘Ăů͘ϮϬϬϲ͘ ΐ^ŚĞĂƌŵĂŶĂŶĚtŝůƚŽŶϮϬϬϳ͘

DƵůƚŝƉŽŝŶƚ>K^ĐŽƌĞƐ ϵ

ϴ͘ϴ

ϴ͘ϲ >K ϴ͘ϰ

ϴ͘Ϯ

ϴ ϯϯ͘ϱϰϰ͘ϱϱϱ͘ϱϲ ŚƌŽŵŽƐŽŵĂůůŽĐĂƚŝŽŶ;DďͿ

Figure 2-1: Multipoint LOD scores for linkage to TNS gene by microsatellites C13.0390, C13.0423, C13.0449, C13.0478 in VPS13B region on CFA3.

40 Table 2-4: Microsatellite allele sizes in 72 Border collies for the 4 microsatellites in the VPS13B region Microsatellite Allele sizes  C13.0390 259 269 274 277 279 282 283 288 292 293 303 307 312 313 C13.0423 336 338 342 345 352 356 369 374 377 382 389 393 398 410 415 418 444 457 C13.0449 368 372 376 378 380 384 386 388 390 392 396 C13.0478 183 185 187 189 193 195 199 205

41 Table 2-5: Non disease haplotypes for the 4 microsatellites in the VPS13B region in 72 Border collies Non disease haplotypes C13.0390 259 269 274 274 279 282 283 283 283 283 288 C13.0423 410 374 418 415 352 345 356 352 356 356 444 C13.0449 380 372 378 378 388 392 384 388 392 388 384 C13.0478 185 193 185 185 183 195 183 183 183 183 189 frequency 5 3 5 10 5 2 1 4 5 2 7

C13.0390 288 288 288 288 293 293 303 303 307 313 313 C13.0423 457 342 357 349 398 382 374 377 393 369 369 C13.0449 380 376 380 380 372 368 396 396 376 396 396 C13.0478 187 187 187 187 183 199 193 193 205 195 193 frequency 7 2 1 2 11 3 4 1 1 2 2

42 Strong linkage disequilibrium between TNS and the microsatellites was observed at VPS13B, with the majority of affected dogs homozygous for a single haplotype (Figure 2-2). Variations in the disease haplotype occurred in three dogs, each at one locus only. Each of the 63 exons with at least 100 bp of flanking intron was sequenced in two controls, two carriers and two affecteds. Thirteen coding sites were found to be polymorphic in these dogs or different to the reference genome (Table 2-6). A homozygous four bp deletion was found, relative to Border collie controls and the dog reference genome, in the coding sequence of the affected allele at base 2894 to 2897 of cDNA VPS13B in exon 19 (c.2894_2897delGTTT transcript variant one, accession# HM036106). Obligate carriers (known to have produced affected offspring) were heterozygous for the four bp deletion (Figure 2-3). The deletion results in frame shift of putative transcript variants one, three and four and premature truncation after the 980th amino acid of the 4019, 3994 and 1427 amino acid VPS13B protein isoforms, respectively (Figure 2-4).

43 Sample 537 Sample 541 Sample 576 Sample 577 Sample 621 Sample 893 Sample 1137 Sample 1151

Microsatellite ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn C13. 0390 Ň277ŇŇ277Ň Ň277ŇŇ290Ň Ň277ŇŇ277Ň Ň277ŇŇ290Ň Ň277ŇŇ277Ň Ň277ŇŇ277Ň Ň277ŇŇ277Ň Ň277ŇŇ277Ň C13. 0423 Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň Ň384ŇŇ384Ň C13. 0449 Ň386ŇŇ386Ň Ň386ŇŇ386Ň Ň386ŇŇ386Ň Ň386ŇŇ386Ň Ň386ŇŇ386Ň Ň386ŇŇ390Ň Ň386ŇŇ386Ň Ň386ŇŇ386Ň C13. 0478 Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ň195ŇŇ195Ň Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ

Figure 2-2: Genotypes at 4 microsatellite loci near VPS13B in 8 TNS affected border collies identified by ID number. Alleles are represented by sizes of PCR products in base pairs.

44 Table 2-6: Single nucleotide variations identified from sequencing genomic DNA of VPS13B showing their genomic location, predicted effect on the protein produced by transcript variant 1 and dbSNP ID where available Exon† Reference Variant genomic location aa Effect Accession # dbSNP ID 6 T A 4159271 syn HM149237 rs22238678 8 T C 4176521 syn HM149238 rs22238688 11 G A 4178874 G -> S HM149239 rs22238710 16 A T 4232838 Q -> H HM149240 n/a 19 (18) G A 4319973 syn HM149241 n/a 37 (35) G A 4676737 C -> Y HM149242 n/a 43 (41) A G 4744210 syn HM149243 rs22277807 45 (43) G A 4772965 R -> Q HM149244 n/a 52 (50) C T 4789089 syn HM149245 rs22310936 57 (55) G A 4812943 A -> T HM149246 n/a 57 (55) C T 4813323 syn HM149246 rs22253538 59 (57) C A 4819760 syn HM149247 n/a 63 (61) A G 4833061 syn HM149248 n/a † Genomic exons numbered 1-63. Numbers in brackets represent exon numbering for transcript variant 1

45 Figure 2-3: Sequence comparison of an affected, carrier and control Border collie showing 4 bp deletion in exon 19 of VPS13B at base 2894 to 2897 (transcript variant 1)

DNA Norm 2881 tataactggcttgtttatcagcctcagaaacgaaccagtagacat aa Norm Y N W L V Y Q P Q K R T S R H DNA Mut 2881 tataactggctt----atcagcctcagaaacgaaccagtagacat aa Mut Y N W L I S L R N E P V D I

DNA Norm 2926 atgcagcagcagcctgtcatagctgttcctcttgttacaccaatt aa Norm M Q Q Q P V I A V P L V T P I DNA Mut 2926 atgcagcagcagcctgtcatagctgttcctcttgttacaccaatt aa Mut C S S S L S * Figure 2-4 Partial DNA and AA sequence of normal (Norm) and mutant (Mut) VPS13B sequence (above) and translation (below). Underlined aa represent expected aa after the frameshift due to deletion, stop codon represented by *.

Exon 19 was sequenced in 69 Border collies (6 affected, 10 obligate carriers and 47 of unknown status) confirming that all affecteds were homozygous for the GTTT deletion and all obligate carriers were heterozygotes (Table 2-7). Of the 47 Border collies related to the TNS pedigree with unknown status, 29 were heterozygotes and 18 were homozygous for the normal sequence.

46 Table 2-7: Deletion detection test results for individuals in the TNS pedigree, a sample set of individuals unrelated to TNS affected lineages and a set of samples collected randomly from Norway. status homozygote deletion heterozygote homozygote normal affected 6 - - Obligate carrier - 10 - related† - 29 18 unrelated‡ - 40 220 Norway* 1 9 70 † ancestors and siblings of TNS affected dogs ‡ Australian Border collie sample set collected prior to TNS work * sample set collected randomly to represent Border collie population of Norway

The TNS carrier test was applied to a sample set of 260 Australian Border collies collected before the TNS research began. These samples can be considered randomly sampled with respect to TNS. A set of 80 samples that were collected in Norway to represent the local Border collie population were also tested (Table 2-8). Both sample sets give unbiased allele proportions of 0.07 and 0.08, respectively, for the TNS mutation. A total of 5000 Border collies have been tested from Europe, USA, UK, Australia, New Zealand and Japan, 1100 were carriers and 30 were affected. This sample set is biased towards dogs with familial links to known carriers. To correct for this ascertainment bias any samples where either of the parents had been tested as carriers were removed. Allele frequencies ranged from 0.04 to 0.08 for all populations with an average of 0.064 from a sample set of 2100 dogs after ascertainment bias correction (Table 2-8).

47 Table 2-8: TNS testing results and allele proportions of sample sets per country after correcting for ascertainment bias Country Not TNS TNS carrier TNS affected Total TNS allele proportion Carrier Rate Australia1 220 40 0 260 0.077 0.154 Norway2 71 9 1 81 0.068 0.111 Japan3 69 14 0 83 0.084 0.169 Finland3 120 18 0 138 0.065 0.130 Czeck Rep3 55 8 0 63 0.063 0.127 Netherlands3 136 12 0 148 0.041 0.081 Germany3 223 33 0 256 0.064 0.129 UK3 665 85 0 750 0.057 0.113 US3 263 52 0 315 0.083 0.165 Total 1822 271 1 2094 0.065 0.129 1 Samples collected before TNS research 2 Random sample of unrelated dogs 3 Samples tested for TNS excluding any with either parent tested as TNS carriers

48 A comparison between Cohen syndrome patients and 7 TNS affected Border collies identified an overlap in disease characteristics (Table 2-9). Features common to both Cohen syndrome and TNS include typical facies, slender extremities, infantile hypotonia, developmental delay and neutropenia. However, obesity and ophthalmic abnormalities, which are other common Cohen syndrome clinical signs, were not found in Border collies. Mental retardation is a clinical sign in Cohen syndrome which is difficult to assess in dogs, but inability to learn simple commands was reported by breeders for three of the seven affected dogs while some dogs were reported to be quick learners.

Table 2-9: Comparison of clinical signs with percentages between Cohen syndrome patients and Seven TNS affected Border collies. Clinical signs Cohen Syndrome (%)† TNS (%) typical facies 81* 43 slender extremities 91 86 mental retardation 99 43 microcephaly 53 71 infantile hypotonia 86 71 developmental delay 81 86 delayed puberty 41 86 short stature 47 60‡ truncal obesity 65 0 neutropenia 75§ 100 myopia 44 n/a cataracts 64 n/a n/a – not available * average for clinical signs: short philtrum, high narrow palate, prominent nasal bridge and prominent central upper incisors † Data summarised from Taban et al. 2007 where the number of individuals for each category varies depending on what has been reported in the literature ‡ out of five dogs as two died before reaching maturity Α,ĞŶŶŝĞƐĞƚĂů͘ϮϬϬϲ͕<ŝǀŝƚŝĞͲ<ĂůůŝŽĂŶĚEŽƌŝŽϮϬϬϭ͕ŚĂŶĚůĞƌĞƚĂů͘ϮϬϬϲ͕<ŽůĞŚŵĂŝŶĞŶĞƚĂů͘ ϮϬϬϰ͕ƵŐŝĂŶŝĞƚĂů͘ϮϬϬϴ͕<ŝǀŝƚŝĞͲ<ĂůůŝŽĞƚĂů͘ϭϵϵϳ͕^ĞŝĨĞƌƚĞƚĂů͘ϮϬϬϲ͘

49 To determine whether the mutation is causative rather than linked to a non- coding VPS13B mutation or mutation in another gene in the region sequence data was compared between species. High conservation of VPS13B is observed at both the amino acid and DNA level across mammalian species. Homology between human, dog and mouse emphasize this, human – dog 90.4% amino acid, 90.7% mRNA; human – mouse 87.3% amino acid, 86.5% mRNA; dog – mouse 85.6% amino acid, 84.8% mRNA. A multiple species comparison using PhastCons from the UCSC genome browser shows the most highly conserved regions to be exons and transcription factor binding sites suggesting conservation in the regulation of VPS13B. There is a large VPS13B transcript variant in human that is not detected in mouse. The two largest putative transcripts of VPS13B in the dog are variant one (XM_539102) and variant three (XM_850840) which utilise an alternative exon 27, referred to as exon 27b. To investigate the presence of these transcript variants in dog brain tissue cortex and cerebellum cDNA was amplified from three healthy dogs by PCR using primers spanning the alternate exon region (Table 2-1). Products were observed at 684 bp corresponding to transcript variant one and 609 bp corresponding to transcript variant three, present in both tissue types (Figure 2-5). The PCR fragment for transcript variant one showed higher intensity on the gel suggesting higher expression levels. A product of approximately 750 bp was also present on the gel at a lower concentration in cerebellum and was barely detectable in cortex.

Figure 2-5: PCR product from cerebellum and cortex of 3 healthy dogs for primers spanning exon 25-28 of cDNA. Product 609 bp and 684 bp is present corresponding to mRNA incorporating exons 27b or 27, respectively. Lanes 1 and 8 contain a 100 bp ladder, lanes 2-4 are cerebellum samples and 5-7 are cortex samples from 3 healthy dogs.

50 The PCR product mixture was sequenced and showed sequence homozygosity up to the expected alternate exon site followed by mixed sequence. The mixed sequence could be resolved to the expected component sequences of the two transcripts confirming that the bands on the gel were the expected PCR products from the transcript variants (Figure 2-6). Sequence from the 750 bp product was not detectable in the sequence results from cortex or cerebellum, clear PCR termination sites were identified at 589 bp and 664 bp from the primer binding site only.

Figure 2-6: Forward (above) and reverse (below) sequence data from cerebellum (top) and cortex (bottom) cDNA of a healthy dog Figure shows alternate exon usage in transcript variant 1 (XM_539102) and variant 3 (XM_850840). The first 16 bp from the alternate exon site corresponding to each of the 2 transcript variants in the mixed sequence is shown separated for each sequence.

51 Comparative genomics was used on reference genome sequence from NCBI to identify species with sequences similar to canine exons 27 and 27b. Human, cow, horse, mouse and rat all have an exon which is 67 bp in length with highly conserved sequence to exon 27b in dogs (Figure 2-7). Human, horse and cow but not mouse or rat have an exon 142 bp in length with highly conserved sequence to exon 27 in dogs. For this exon, the mouse reference genome (C57BL/6J) had 80% identity with a seven bp deletion compared to dog. The rat (BN/SsNHsd) showed 80% identity for the first 59 bp of the sequence (Figure 2-7), but a 271 bp deletion, relative to the mouse, disrupts the sequence (Figure 2-8).

52 exon 28 >Human 1 CCTTGGGGAAGAGTGTTGGTCTTTGGGGCAATGTGGAGGTGTCTTCCTTTCCTGTACTGACAAGCTGAACAGACGCACCT >Dog 1 CTTTGGGGAAGAGTGTTGGTCTCTGGGGCAATGTGGAGGTGTCTTCCTTTCCTGTACTGACAAGCTGAACAGACGCACCT >Horse 1 CCTTGGGGAAGAGTGTTGGTCTCTGGGGCAATGTGGAGGTGTCTTCCTTTCCTGTACTGACAAGCTGAACAGACGCACCT >Cow 1 CCTTGGGGAAGAGTGTTGGTCTCTGGGGCAATGTGGAGGTGTCTTCCTTTCCTGTACTGACAAGCTGAACAGACGCACCT >Rat 1 TCTTGGGGAAGAGTGTTGGTCTTTGGGGGAGTGTGGAGGTGTCT------>Mouse 1 CCTTGGAGAAGAGTGTT------GGGGCACCGTGGAGGTGCCTTCCTTTCCTGTACTGAGTAGCTGGACACTTGCACCT

>Human 81 TGTTGGTTCGACCCATCAGCAAGCAGGACCCTTTCAGTAATTGCTCTGGCTTCTTTCCTTCT >Dog 81 TGTTGGTTCGACCCATCAGCAAGCAGGACCCTTTCAGTAATTGCTCTGGCTTCTTTCCTTCT >Horse 81 TGTTGGTTCGACCCATCAGCAAGCAGGACCCTTTCAGTAATTGCTCTGGCTTCTTTCCTTCT >Cow 81 TGTTGGTTCGACCCATCAGCAACCAGGACCCTTTCAGTAATTGCTCTGGCTTCTTTCCTTCT >Rat 81 ------>Mouse 81 TGTTGGTTCGACCCCTTAGCAAGCAGGACTCTTGAAGTAACTGTCCTGCCTTCCTTTCTTCT exon 28b >Human 1 GCCAGGGGAAGGTTGGCAGTCAGGACATTTTGAAGGAGTATTTCTACAATGCAAAGAAAAATCTGTG >Dog 1 GCCAGGGGAAGGTTGGCAGTCAGGACATTTTGAAGGAGTATTTCTACAGTGCAAAGAAAAACCTGTG >Horse 1 GCCAGGGGAAGGTTGGCAGTCAGGACATTTTGAAGGAGTATTTCTACAGTGCAAGGAAAAACCTGTG >Cow 1 GCCAGGGGAAGGTTGGCAGTCAGGACATTTTGAAGGAGTATTTCTACAGTGCAAAGAAAAACCTGTG >Rat 1 GCCTGGGGAAGGTTGGCAATCAGGACAATTTGAAGGAGTGTTTCTGCAGTGCAAAGAAAAACCTGTG >Mouse 1 GCCTGGGGAAGGTTGGCAGTCAGGACATTTTGAAGGAGTGTTTCTGCAGTGCAAAGAAAAACCTGTG

Figure 2-7: Sequence comparison of VPS13B exons 28 and 28b in human to dog, horse, cow, rat and mouse.

53 Mouse CCTTGGAGAAGAGTGTTGG------GGCACCGTGGAGGTGCCTTCCTTTCCTGTACTGA Rat TCTTGGGGAAGAGTGTTGGTCTTTGGGGGAGTGTGGAGGTGTCT------

Mouse GTAGCTGGACACTTGCACCTTGTTGGTTCGACCCCTTAGCAAGCAGGACTCTTGAAGTAA Rat ------

Mouse CTGTCCTGCCTTCCTTTCTTCTGTAAGAAATCACTTTAAAGCTGTACCCCAGAATTCCCC Rat ------

Mouse CTAGCAAGTACCTAAAATAGGACCTCAGCTGCCATACAGTCATCACTTGATTTTCATTGT Rat ------

Mouse TGCCAATGAAATACTCTTTGAAATAGAGTGAACTATAATCTGTTTTTGTATCATAAATTT Rat ------

Mouse GTTCTGTCATGTGCTGATTGCCA-TCTGTATGTA-TAT-AGCTAGT-CTGAGATTGTGAG Rat ------GGTTGTCAGTCTGTATTTACTATCA-CT-GTTCTGAGTTTGTGAG

Mouse TAAGAGTTAATATTGTCATCAATATAAGAAAATCAAAGTAAGTTATTAAGTAAAATGGCT Rat TAGGAATTAATATTGCCATCAGTAGAAGAAAGTTAAAGTAAGTTATTAAGTAAAATGGCT

Mouse TTCTACTTGTGTTGAAC Rat TTTTACTTGTGTTGAAT

Figure 2-8: Sequence comparison between mouse and rat of VPS13B sequence homologous to exon 28 in human.

54 2.5 Discussion

The high frequency of the TNS deletion in the Border collie population suggests that the mutation is much older than the recent common ancestor of our affected cases (Shearman and Wilton 2007). The mutation is present in Border collies in all countries tested and in both working dogs and show dogs which have been genetically isolated for around 50 years. The widespread nature of the disease indicates that the mutation was either already present in the founder dogs used to establish the breed or originated very early in the breed. If the mutation predates development of the breed it may exist in other collie breeds, as does Collie Eye Anomaly (Parker et al. 2007). The sample sets collected from Norway and Australia prior to TNS research represent an unbiased sample set giving the most reliable disease allele proportion. The allele proportions from other countries after correction for ascertainment bias support the values obtained from these two sample sets. TNS in Border collies is caused by a mutation in the same gene that causes Cohen syndrome in humans, VPS13B. Over 80 mutations in human VPS13B have been identified, all resulting in the characteristic Cohen phenotype. The highly conserved nature of VPS13B suggests that the mutation identified in the canine form of the gene would cause similar clinical signs to that observed in human. The strong clinical sign similarity between Cohen syndrome and TNS is evidence that VPS13B is the TNS gene. No reports have been found of mutations in this gene in other organisms which makes TNS in the dog the first model for Cohen syndrome. Cohen syndrome is rare (~190 patients worldwide) with a homogenous phenotype reported in Finnish patients, who carry mainly the c.3348_3349delCT mutation, and large variation in clinical signs in non-Finnish patients who carry many different mutations (Kolehmainen et al. 2003; Seifert et al. 2006). Diagnosis of Cohen syndrome is based on the presence of at least six of the following main clinical signs: developmental delay, microcephaly, typical Cohen syndrome facial gestalt, truncal obesity with slender extremities, overly sociable behaviour, joint hypermobility, high myopia and/or retinal dystrophy, neutropenia (Kolehmainen et al. 2004). Border collies show developmental delay, microcephaly, typical facial gestalt, slender extremities, and neutropenia, which was enough overlap in clinical signs for TNS to be suspected as homologous to Cohen syndrome. The Cohen syndrome gene, VPS13B, is a potential transmembrane protein involved in vesicle-mediated transport and sorting within the cell (Kolehmainen et al.

55 2003). The gene is similar to yeast Vps13 and is most highly conserved at the ends of the protein which are similar to vacuolar protein sorting domains (Kolehmainen et al. 2003). Vps13 has undergone duplication events early in vertebrate evolution resulting in four paralogue copies VPS13A, B, C and D, which have diverged in function. The transcripts of each of the four genes typically vary by a few amino acid in the middle of the sequence and exhibit tissue specific expression. Each has one ubiquitous transcript and one transcript expressed in brain (Velayos-Baeza et al. 2004). Human VPS13B has four transcript variants, two large transcripts utilising 62 exons each and two small transcripts utilising eight and 18 exons, respectively. The canine form of VPS13B is located at approximately 4.1 Mb on canine chromosome 13 (CFA13) and contains 63 exons with five putative transcript variants (NCBI). Seifert et al. (2009) found that the full length transcript variants showed the highest expression with smaller transcripts barely detectable in both humans and mice. This suggests that the smaller transcripts are either from imperfect splicing or play a regulatory role in VPS13B expression. It is also likely that the smaller of the five putative transcripts listed for canine VPS13B have similar expression and functions. The two largest transcripts of VPS13B in humans utilise alternate splicing to include either exon 28 (NM_017890) or 28b (NM_152564) which are homologous to dog exons 27 and 27b, respectively. In humans, transcript NM_152564 is ubiquitous and transcript NM_017890 is expressed at roughly 20% of the concentration compared to kidney, placenta, small intestine, and lung while in the brain and retina both transcripts have roughly equal expression (Seifert et al. 2009). Seifert et al. (2009) have shown that mice lack a transcript homologous to NM_017890 in human with only the ubiquitous transcript detectable. We have shown that dogs express transcripts in the brain incorporating exons 27 or 27b homologous to human transcripts incorporating exons 28 or 28b, respectively. The difference in exon numbering is due to a 5` untranslated exon annotated in human but not in dog. We also detected a weak product at around 750 bp which may correspond to the incorporation of both exons 27 and 27b. The product was too weak to be detected in mixed sequence in either cerebellum or cortex. An mRNA incorporating both exons would change the reading frame and considering the low levels of expression is likely to be a splicing artefact. Comparative genomics suggests that cow and horse are likely to express both transcripts also. The mouse and rat were found to have independent deletions disrupting use of the exon homologous to canine exon 27. In the case of the mouse, utilising this

56 exon would change the reading frame and destroy protein function. The deletion in the rat removes the 3` half of the exon destroying the splice signal. This shows that the alternate transcript is not required in either mouse or rat. Mental retardation observed in Cohen syndrome suggests that VPS13B plays a role in brain development or maintaining brain function in humans. Seifert et al. (2009) highlighted that no mutations have been discovered in humans that affect either exon 28 or 28b solely and implied that disruption to the transcript utilising exon 28 may be responsible for the mental retardation observed in Cohen syndrome. Dogs express both transcript variants in the brain and mental retardation is observed in some TNS affected dogs. This identifies dogs as a good model to test the hypothesis that the transcript utilising exon 28 may be responsible for mental retardation. TNS in Border collies is the result of a mutation to VPS13B and dogs, like humans, express alternately spliced transcripts of VPS13B in the brain. Comparative genomics shows that other mammalian species are likely to express alternately spliced transcripts in the brain, similar to human and dog. Dogs affected with TNS are the first animal model for Cohen syndrome and can be used to study the development of the disease and the effect of varying expression of VPS13B.

2.6 Acknowledgments

We would like to acknowledge the Pastoral Breeds Health Foundation (UK) and the Border collie clubs and individual Border collie breeders and owners of New South Wales, Victoria and Queensland for sample submission and financial support. We thank Frode Lingaas for the Norwegian Border collie samples. We thank Rosanne Taylor for feedback on the manuscript. Alan Wilton’s lab at UNSW runs a DNA testing service for the TNS mutation.

2.7 References

Allan FJ, Thompson KG, Jones BR, Burbidge HM and McKinley RL. 1996. Neutropenia with a probable hereditary basis in Border collies. N Z Vet J 44: 67- 72. Aprikyan AAG, Liles WC, Park JR, Jonas M, Chi EY and Dale DC. 2000. Myelokathexis, a congenital disorder of severe neutropenia characterized by

57 accelerated apoptosis and defective expression of bcl-x in neutrophil precursors. Blood 95: 320-327. Berliner N, Horwitz M and Loughram TP 2004. Congenital and acquired neutropenia. Hematology 2004: 63-79. Bugiani M, Gyftodimou Y, Tsimpouka P, Lamantea E, Katzaki E, d'Adamo P, Nakou S, Georgoudi N, Grigoriadou M, Tsina E, et al. 2008. Cohen syndrome resulting from a novel large intragenic COH1 deletion segregating in an isolated Greek island population. Am J Med Genet A 146A: 2221-2226. Chandler KE, Kidd A, Al-Gazali L, Kolehmainen J, Lehesjoki AE, Black GC and Clayton-Smith J. 2003. Diagnostic criteria, clinical characteristics, and natural history of Cohen syndrome. J Med Genet 40: 233-241. Cohen MM Jr, Hall BD, Smith DW, Graham CB and Lampert KJ. 1973. A new syndrome with hypotonia, obesity, mental deficiency, and facial, oral, ocular, and limb anomalies. J Pediatr 83: 280-284. Hennies HC, Rauch A, Seifert W, Schumi C, Moser E, Al-Taji E, Tariverdian G, Chrzanowska KH, Krajewska-Walasek M, Rajab A, et al. 2004. Allelic heterogeneity in the COH1 gene explains clinical variability in Cohen syndrome. Am J Hum Genet 75: 138-145. Holmes WE, Lee J, Kuang WJ, Rice GC and Wood W. 1991. Structure and functional expression of a human interleukin-8 receptor. Science 253: 1278-1280. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM and Haussler D. 2002. The human genome browser at UCSC. Genome Res 12: 996-1006 Kivitie-Kallio S, Rajantie J, Juvonen E and Norio R. 1997. Granulocytopenia in cohen syndrome. Br J Haematol 98: 308-311. Kivitie-Kallio S and Norio R. 2001. Cohen Syndrome: essential features, natural history, and heterogeneity. Am J Med Genet 102: 125-135. Kolehmainen J, Black GCM, Saarinen A, Chandler K, Clayton-Smith J, Traskelin A, Perveen R, Kivitie-Kallio S, Norio R, Warburg M, et al. 2003. Cohen syndrome is caused by mutations in a novel gene, COH1, encoding a transmembrane protein with a presumed role in vesicle-mediated sorting and intracellular protein transport. Am J Hum Genet 72: 1359-1369. Kolehmainen J, Wilkinson R, Lehesjoki AE, Chandler K, Kivitie-Kallio S, Clayton- Smith J, Traskelin AL, Waris L, Saarinen A, Khan J, et al. 2004. Delineation of Cohen syndrome following a large-scale genotype-phenotype screen. Am J Hum Genet 75: 122-127.

58 Miller SA, Dykes DD and Polesky HF. 1988. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res 16: 1215. Neilan BA, Wilton AN and Jacobs D. 1997. A universal procedure for primer labelling of amplicons. Nucleic Acids Res 25: 2938-2939. Oetting WS, Lee HK, Flanders DJ, Wiesner GL, Sellers TA and King RA. 1995. Linkage analysis with multiplexed short tandem repeat polymorphisms using infrared fluorescence and M13 tailed primers. Genomics 30: 450-458. Ostrander EA and Wayne RK. 2005. The canine genome. Genome Res 15: 1706-1716 Pang JF, Kluetsch C, Zou XJ, Zhang AB, Luo LY, Angleby H, Ardalan A, Ekstrom C, Skollermo A, Lundeberg J, et al.. 2009. mtDNA data indicate a single origin for dogs south of Yangtze River, less than 16,300 years ago, from numerous wolves. Mol Biol Evol 26: 2849–2864. Parker HG, Kim LV, Sutter NB, Carlson S, Lorentzen TD, Malek TB, Johnson GS, DeFrance HB, Ostrander EA and Kruglyak L. 2004. Genetic structure of the purebred domestic dog. Science 304: 1160-1164. Parker HG, Kukekova AV, Akey DT, Goldstein O, Kirkness EF, Baysac KC, Mosher DS, Aguirre GD, Acland GM and Ostrander EA. 2007. Breed relationships facilitate fine-mapping studies: a 7.8-kb deletion cosegregates with Collie eye anomaly across multiple dog breeds. Genome Res 17: 1562-1571. Velayos-Baeza A, Vettori A, Copley RR, Dobson-Stone C and Monaco AP. 2004. Analysis of the human VPS13 gene family. Genomics 84: 536-549. vonHoldt BM, Pollinger JP, Lohmueller KE, Han E, Parker HG, Quignon P, Degenhardt JD, Boyko AR, Earl DA, Auton A, et al. 2010. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. DOI: 10.1038/nature08837. Schuelke M. 2000. An economic method for the fluorescent labeling of PCR fragments. Nat Biotechnol 18: 233-234. Seifert W, Holder-Espinasse M, Spranger S, Hoeltzenbein M, Rossier E, Dollfus H, Lacombe D, Verloes A, Chrzanowska KH, Maegawa GH, et al.. 2006. Mutational spectrum of COH1 and clinical heterogeneity in Cohen syndrome. J Med Genet 43: e22. Seifert W, Holder-Espinasse M, Kuhnisch J, Kahrizi K, Tzschach A, Garshasbi M, Najmabadi H, Walter Kuss A, Kress W, Laureys G, et al. 2009. Expanded mutational spectrum in Cohen syndrome, tissue expression, and transcript variants of COH1. Hum Mutat 30: E404-420.

59 Shearman JR, Zhang QY and Wilton AN. 2006. Exclusion of CXCR4 as the cause of Trapped Neutrophil Syndrome in Border collies using five microsatellites on canine chromosome 19. Anim Genet 37: 89. Shearman JR and Wilton AN. 2007. Elimination of neutrophil elastase and adaptor protein complex 3 subunit genes as the cause of trapped neutrophil syndrome in Border collies. Anim Genet 38: 188-189. Silberstein M, Tzemach A, Dovgolevsky N, Fishelson M, Schuster A, Geiger D. 2006. Online system for faster multipoint linkage analysis via parallel execution on thousands of personal computers. Am J Hum Genet. 78: 922-935. Taban M, Memoracion-Peralta DS, Wang H, Al-Gazali LI and Traboulsi EI. 2007. Cohen syndrome: report of nine cases and review of the literature, with emphasis on ophthalmic features. J AAPOS 11: 431-437.

60 2.8 Supplementary material

TNS Questionnaire Q: to what age did the dog(s) live to in terms of number of years and developmental stage? Q: did his/her legs become thinner more than normal as you move from the shoulder down towards the feet, in other words are they tapered towards the bottom? Q: did he/she show any sign of mental retardation or slowness? did you find it more difficult to train them than other dogs? Did he/she pick things up slower than other dogs? Q: did he/she show any sort of developmental delay? grew slower than normal, took longer before his/her first steps as a pup than normal etc. Q: did he/she show any signs of obesity during his/her teenage years? Q: did he/she have any eye problems such as colobomas, cataracts, pigmentary deposits, or short sightedness? Q: did his/her spine show more curvature, sideways or up/down, than seen in an unaffected dog? Q: was he/she very affectionate or happy, more than an unaffected dog (if its even possible to tell)? Q: did he/she have less muscle than an unaffected dog? Q: did his/her face have the ferret like appearance comonly seen in TNS affecteds? Q: did he/she have knock knees (when standing knees are kept closer than normal ) or bow legs (knees further appart than normal)? Q: did he/she have a smaller head compared to her body size than normal for his/her age?

61 3 THE EFFECTS OF INBREEDING ON THE NEW SOUTH WALES BORDER COLLIE POPULATION

Jeremy R. Shearman and Alan N. Wilton

School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia and Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia

Address for correspondence Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

3.1 Summary

Pure breed dogs are bred for specific traits and breeders often use line breeding, breeding related individuals, to enrich for such traits. The degree of inbreeding can be measured using Wright’s coefficient of inbreeding which is the probability that an individual is homozygous for an allele identical-by-descent at any given locus. The database of pure bred Border collies in New South Wales, Australia, was obtained from the Australian National Kennel Council covering dogs from 1982 until the end of 2009 and was used to study the effect of inbreeding in the population. Testing results for two common diseases in Border collies, Trapped Neutrophil Syndrome and Neuronal Ceroid Lipofuscinosis have been collected from Border collies around the world from a total of 5780 dogs. The coefficient of inbreeding was calculated for the tested dogs and compared to the New South Wales Border collies. The median inbreeding coefficient was 8.98% for the New South Wales Border collie population and 6.42% and 15.91% for TNS and NCL affected dogs, respectively. This supports the hypothesis that Trapped Neutrophil Syndrome is caused by an older mutation than Neuronal Ceroid Lipofuscinosis. Heavy use of popular sires is responsible for recessive diseases becoming widespread and the inbreeding then causes these diseases to be expressed.

62 3.2 Introduction

The dog was first domesticated from the wolf around 15,000 years ago (Pang et al. 2009) and was most likely spread around the world by human trade. Multiple origins of domestication from many different wolf subspecies has occurred which may explain some of the variation in modern domestic dogs (vonHoldt et al. 2010). The thousands of years between domestication and the appearance of breeds have allowed for sufficient genetic diversity to build which could be selected upon for the development of modern breeds. Modern dog breeds were established through selective breeding from the native population of domesticated dogs within the last 200 years (Parker et al. 2004; Lindblad- Toh et al. 2005; vonHoldt et al. 2010). A total of 197 breeds of dog are officially acknowledged by the Australian National Kennel Council (ANKC; http://www.ankc.org.au/Print---ANKC-Group-Listing-1.aspx) and over 400 by kennel councils globally. Pure breed dogs are bred for specific traits and breeders often use line breeding to enrich for such traits. Line breeding involves the breeding of related individuals expressing a particular trait or set of traits of interest to enrich for those traits. Line breeding is different to the type of inbreeding that occurs in natural populations and is also different to intentional inbreeding such as in the production of inbred mouse lines. Inbreeding in natural populations is a function of mate availability, population structure and chance. Intentional inbreeding is the choice of matings of pairs as closely related as possible (Morse 1978). Line breeding on the other hand is usually the choice of mating pairs related at the degree of cousins or more distant. The degree of inbreeding can be measured using Wright’s coefficient of inbreeding (F) which is the probability that an individual is homozygous for an allele identical-by-descent at any given locus (Wright 1922). Often pure bred dogs are entered into show competitions where the animal is judged on its adherence to the breed defining traits. Dogs that do well at shows are preferred for breeding and champion males can sire hundreds to thousands of offspring (Calboli et al. 2008), these dogs are referred to as popular sires and can have a significant effect on the population. Simulations of the popular sire effect have identified a 4.4 fold risk of disease dissemination compared to random mating conditions (Leroy and Baumung 2010). Leroy and Baumung (2010) found that line breeding and closed breeding decreased the risk of disease spread, but did not perform any simulations in which the popular sire effect and a degree of line breeding were co- occurring. Dogs have 506 recorded genetic diseases according to Online Mendelian 63 Inheritance in Animals (OMIA; http://www.ncbi.nlm.nih.gov/omia/). Dominant diseases are rarely a problem in dog populations as affected dogs are easily excluded from breeding removing the disease allele from the population. Recessive diseases pose considerable problems for dog populations as they tend to have small inbreeding effective population sizes and large influence from the popular sire effect (Calboli et al. 2008). If a popular sire happens to carry a recessive mutation it can rapidly be spread through the global population through association with the set of desirable physical traits. By the time the recessive disease begins to manifest, it can already have allele frequencies as high as 10% (Shearman et al. 2011). A very common breed of dog in Australia is the Border collie. Two populations of Border collies exist which have been fairly isolated genetically for roughly the last 50 years. They are the International Sheep Dog Society (ISDS) working Border collie population in the UK and the show Border collie population. Border collies first appeared in Australia in 1901 and originally came from working stock sharing ancestry with ISDS dogs. The ANKC adopted a national breed standard for show Border collies in 1963, prior to this time Border collies could be registered from working stock or with unknown pedigrees. Genetic flow between show dogs and working dogs is very small to nonexistent since registration in 1963. Pet dogs live very sheltered lives and factors such as predation and starvation are negligible to life expectancy. The top 3 causes of mortality for Border collies is cancer 24%, old age 18% and stroke 9% (http://www.thekennelclub.org.uk/item/570). A Border collie may only be registered in Australia as pure bred if both parents are registered as pure bred. Any offspring from a mating where at least one parent was not registered (pure bred) cannot be registered and thus can be considered lost to the pure bred population. Because of this strict isolation the genetic variation in this breed is limited to the variation present in the founding individuals and any subsequent mutation. Identifying the founding breeds and number of founders used to develop the Border collie is very difficult as the pedigree records do not date back far enough. Historical information on the Border collie indicates that it was created from several different breeds of collie suggesting a reasonable degree of genetic variation in the founders and some allele sharing with other collie breeds. It is possible that some diseases alleles were present in the founders and spread through the Border collie population from the very beginning. Subsequent mutation and line breeding would have also introduced potentially disease causing mutations into the population.

64 Border collies suffer from thirteen genetic diseases according to the Inherited Diseases In Dogs database (IDID; Sargan 2004). Two common diseases in Border collies are Trapped Neutrophil Syndrome (TNS) caused by a 4 bp deletion in VPS13B (Shearman et al. 2006; Shearman and Wilton 2007; Shearman et al. 2011 accepted) and Neuronal Cell Lipofuscinosis (NCL) caused by single base substitution in CLN5 (Melville et al. 2005). The New South Wales (NSW) database of pure bred Border collies was obtained from the ANKC and used for a pedigree analysis in conjunction with genetic testing results for TNS and NCL collected from over 5000 Border collies.

3.3 Materials and methods

3.3.1 Pedigree analysis The database of pure bred Border collies in NSW was obtained from the ANKC covering dogs from 1982 until the end of 2009. The database contains information on: name of the dog, date of birth, parents and registration number. The database was maintained using the program BreedMate (Wild Systems, Australia). The Wright’s inbreeding coefficient calculation function of this program was used to calculate 8 generation inbreeding coefficients for each dog. Excel (Microsoft) was used for data sorting and analysis. The generation time of Border collies was calculated by finding the average and median of the difference in age between each parent and offspring excluding matings where the difference was greater than 15 years. Each dog with a recorded date of birth was assigned a Generation Number (GN) based on intervals of generation time rounded off to the nearest year starting with 1982 as GN1. Population size at GN X was calculated by summing the number of offspring at GN X, GN X-1, 1/2 GN X-2 and 1/3 GN X-3. The number of expected mutations per generation was estimated using the human mutation rate of 1.3 10-8 per nucleotide per generation (Lynch 2010) multiplied by the canine genome size of 2.5 Gb multiplied by the number of offspring in each generation. Parent numbers were taken from the pedigree information and this number accounts for any drift that occurs based on population size. A coding size of 1% of the genome was used to estimate the number of coding mutations. Assuming random mutation, a non synonymous substitution rate, taking into account transitions being twice as likely as transversions (Zhang and Gerstein 2003), of 74% was used to estimate the number of mutations expected to change an amino acid. The number of potentially 65 disease causing mutations was estimated by multiplying the number of mutations expected to change an amino acid by 34 ± 6%, which is the proportion of amino acid changes found to disrupt the function of a protein defined by Guo et al. (2004) as the ‘x- factor’.

3.3.2 TNS and NCL testing A total of 5780 Border collies samples have been collected since 1990 for research into the cause of NCL, TNS and subsequent testing. The NCL mutation test utilises a set of three primers, a common reverse primer and a forwards primer specific to each allele with different fluorescent labels. The 3` end of the forwards primers fall on the causative base substitution such that the forward mutant primer will have a 3` mismatch to the normal DNA strand and vice versa. An additional change is introduced to the second base in from the 3` end to the mutant primer resulting a double mismatch of the normal primer to mutant allele PCR product and vice versa producing higher specificity. To improve resolution, the mutant allele primer has an additional 4 bp making the normal and mutant alleles 123 bp and 127 bp, respectively. The TNS test uses a set of primers spanning the mutation, the forwards primer is fluorescently labelled allowing for detection of the mutant and normal alleles based on a 4 bp difference in product size as described by Shearman et al. (2011). Sample testing for both TNS and NCL was performed using PCR and sizing the product with capillary electrophoresis on an ABI3730 (Applied Biosystems) the resulting data was analysed using GeneMapper software (Applied Biosystems).

3.4 Pedigree analysis

Having a record of all individuals in a large inbreeding population for many generations is a unique situation and can be used to assess how the population behaves over time. The database of the New South Wales Border collies is previously undescribed and since it is not publically available will be described here. The database contains 62,607 dogs born between 1982 and the end of 2009 (Table 3-1). Border collies generally reach breeding age at around 6-12 months, but are usually not bred until at least 12 months old, and they live for between 10 and 17 years. The average and median generation time for Border collies were both 4 years giving a total of 7 generations counting back from 2009. The dam-offspring generation time average was 3.7 years and the median was 3 years, lower than the sire-offspring average generation time of 4.4 years and median of 4 years. The oldest sire was 19 years of age which most 66 likely represents the use of stored semen, the oldest dam was 15 years of age. The generation time is roughly four fold the breeding age and greatest for males because of the artificial selection imposed by breeders heavily using show winning dogs. The effects of artificial selection and heavy use of champion animals have produced characteristic effects on the New South Wales Border collie population demographics. The population size and number of mating dogs used has steadily increased per generation until 1998 at which the time the population begins to decrease in time. This may be either due to a decrease in dogs being registered or a decrease in demand by people for Border collie pups. The heavy use of popular sires has produced significant disparity in the ratio of mating males to females observed at an average of 1:1.5, respectively. The proportion of parents to the population size and to the number of offspring recorded in the previous generation are also observed to decrease per generation. The decreasing proportion of parents represents the artificial selection pressure where only the ‘best’ dogs are chosen to breed. The result of such a heavy artificial selection pressure has produced population demographics very different to those observed in natural populations.

67

Table 3-1 Population size (N), offspring number, number and sex distribution of parents and proportion of parents to population size per generation # parents/ # last GN Generation Years N growth rate Offspring Parents Sires Dams # Parents/N offspring GN1 1982-1985 645 6455 1198 406 792 0.19 GN2 1986-1989 15993 2.48 9538 1438 514 924 0.09 0.22 GN3 1990-1993 22603 1.41 9837 1339 453 886 0.06 0.14 GN4 1994-1997 27880 1.23 11144 1396 491 905 0.05 0.14 GN5 1998-2001 28243 1.01 9033 1057 365 692 0.04 0.09 GN6 2002-2005 26146 0.93 8295 926 304 622 0.04 0.10 GN7 2006-2009 24794 0.95 8305 369 129 240 0.01 0.04

68 Wright’s Coefficient Of Inbreeding (F) was calculated to 8 generations for all dogs in the pedigree where sufficient information was available. The average and median F for the dogs in the database was 9.85% and 8.77%, respectively. The range of F was from 0.00% to 50.02% with over half the dogs having a F of 5 - 20% (Table 3-2). Such high F values are the result of many generations of inbreeding, a pedigree of the dog with the highest F is shown in Figure 3-1. The median F was calculated for each generation and shows a steady increase per generation up to generation four and then begins to decline (Table 3-3). The decreasing F may be due to breeders becoming more aware of the negative impacts of inbreeding.

Table 3-2 Wright’s coefficient of inbreeding for the New South Wales database of registered pure bred Border collies F Count % Entries > 40 % 28 0.04 30 < 40 % 582 0.93 20 < 30 % 3705 5.92 10 < 20 % 22096 35.29 5 < 10 % 22035 35.20 0.01 < 5 % 12311 19.66 0 % 1850† 2.95 † includes individuals with insufficient pedigree information to calculate F

69 Table 3-3 Average and median F per generation for all Border collies registered in New South Wales up to the end of 2009 Generation Average F Median F GN1 7.45 5.76 GN2 9.24 8.22 GN3 10.61 9.63 GN4 11.24 10.06 GN5 10.85 9.61 GN6 9.63 8.68 GN7 8.78 7.65

Figure 3-1 Pedigree of offspring with inbreeding coefficient of 50.02% Squares represent males, circles represent females, small diamonds represent matings and the shaded individual has a F of 50.02%

70 The number of registered offspring can be used to identify popular sires and dams that have had significant genetic contribution to the population. The highest producing sire has 336 recorded offspring and the average and median total lifetime offspring per sire is 20 and 10, respectively. The highest producing dam has 78 recorded offspring and the average and median total lifetime offspring per dam is 11 and 8, respectively. Popular sires produce more offspring than popular dams as females can only produce two litters per year while males can produce many (Table 3-4). The registry of New South Wales Border collies did not include the number of show titles obtained by each dog making it difficult to quantify the correlation between title and offspring number.

Table 3-4 Number of popular sires and dams per offspring count range in the Border collie database for New South Wales Recorded offspring Number of Sires Number of Dams >300 1 0 200<300 5 0 100<200 50 0 50<100 181 13

3.5 Disease analysis

The number of expected new mutations per generation can be estimated using the number of offspring recorded per generation (Table 3-5). As the number of offspring increases so does the number of expected mutations. When considering random mutations in the genome roughly 1% of these mutations are expected to be coding. Considering a transition to transversion ratio of 2:1 (Zhang and Gerstein 2003) for a random coding mutation 74% are expected to change an amino acid. To this probability the chance that an amino acid change will disrupt a proteins function (0.34 ± 0.06) was applied (Guo et al. 2004) resulting in the number of mutations that will potentially cause a disease. The proportion of dogs registered in each generation that are parents was applied to the number of expected mutations that could cause a disease to represent how many of those mutations are likely to have been passed on to the next generation. From these estimates a total of 35 ± 6 potentially disease causing mutations are likely to currently be in the New South Wales Border collie population. 71 The popular sire effect and low proportion of the population used for breeding results in many of the mutations that arise per generation being lost to the population. The most highly deleterious mutations will result in unviable zygotes and several will be dominant resulting in dogs that are not used for breeding, so these mutations will be lost to the population also. The number of expected disease causing mutations in the current New South Wales Border collie population is far higher than the 13 recorded genetic diseases in IDID representing the fact that most mutations are lost to the population within a few generations from when they arise. The most likely scenario for one of these mutations to spread through the population is if it arises in or is passed on to a popular sire as is expected to have happened for TNS and NCL. The highest producing sire appears in the ancestry of many carriers and affecteds, but his carrier status cannot be inferred with the available data.

Table 3-5 Expected number of total, coding, AA changing, protein function destroying and passed to next generation mutations per generation Expected Expected Expected AA Expected destroy Gen Cumulative mutations coding change protein function GN1 209788 2098 1594 542 ± 96 GN2 309985 3100 2356 801 ± 141 121 ± 21 GN3 319702 3197 2430 826 ± 146 129 ± 23 GN4 362180 3622 2753 936 ± 165 136 ± 24 GN5 293573 2936 2231 759 ± 134 102 ± 18 GN6 269588 2696 2049 697 ± 123 88 ± 16 GN7 269913 2699 2051 697 ± 123 35 ± 6

The F was calculated per disease state for both TNS and NCL to gain an understanding of the effect of inbreeding on disease prevalence and mutation age. Testing results for TNS and NCL have been collected from Border collies around the world from a total of 5780 dogs. Of these, 5275 samples have been tested for TNS and 4785 samples have been tested for NCL. A total of 1576 of the 5780 samples are represented in the New South Wales Border collie database and the rest are from other Australian states or other countries. The pedigree information for samples collected from other countries was added to a separate database and used to calculate the F for tested individuals. 72 The average and median F value was calculated for each disease status of dogs tested for TNS and NCL (Table 3-6). Dogs affected with NCL had a F 1.5 times as high as the median of the dogs in the New South Wales database (p-value = 0.006) and dogs affected with TNS had a F slightly smaller than the median of the dogs in the New South Wales database (p-value = 0.121). Dogs that were found to be carriers or clear of either disease had F values lower than the median of the dogs in the New South Wales database. The difference in unaffected dogs and the dogs in the New South Wales database value is because each group represents a different but partially overlapping set of dogs. Both TNS and NCL are recessive diseases caused by a single mutation that has spread through the population. This means affecteds have inherited the mutation identical by descent and as such the disease state should be accompanied by a high inbreeding coefficient depending on the age of the mutation. A recent mutation should not have had time to spread through an entire population and a very high inbreeding coefficient would be expected. An older mutation will have had more opportunity to disperse through the population and thus an individual could inherit both copies of the mutation identical by descent from less inbred parents resulting in a lower inbreeding coefficient. It is possible that either or both of the mutations for TNS and NCL have inadvertently been selected for while selecting for a favourable trait in linkage disequiblibrium. This could account for why these two diseases are so widespread in the population. The TNS mutation is estimated to be very old as it appears in both the show dog and working dog populations which have been genetically isolated for at least 50 years. The NCL mutation is estimated to be much younger as it has not been observed in the working dog population and has a lower incidence than TNS. These F values support the estimates of the ages of the TNS and NCL mutations. The fact that dogs affected with TNS are not more inbred than the median of the dogs in the New South Wales database shows how widely distributed the mutation has become in the global populations of pure bred Border collies.

73 Table 3-6 Average and median inbreeding coefficients for clear, carrier and affected disease status for TNS and NCL of tested dogs compared to the dogs in the New South Wales database Inbreeding coefficients  Clear Carrier Affected Population  Avg Median Avg Median Avg Median Avg Median  TNS 5.77% 4.05% 7.11% 5.97% 8.18% 6.42% 9.85% 8.77% NCL 5.96% 4.41% 7.64% 7.12% 14.43% 15.91% 9.85% 8.77%

† The population value of F represents the median F for the 62,607 Border collies in the New South Wales database. The clear, carrier and affected values of F are calculated only from tested dogs, some but not all of which are included in the New South Wales database.

3.6 Conclusion

The consistent genetic isolation of pure bred dog populations from other breeds ensures that the only new genetic material entering the population is from mutation. A closed population is the most effective way to maintain the phenotype of the population to the breed standard. Any genetic variation introduced from other breeds would shift the phenotype of the offspring away from the breed standard and closer to the introduced breed. However, maintaining the breed to standard decreases genetic variation as dogs not adhering to the standard are removed from the gene pool. The unavoidable inbreeding from this practice allows for recessive mutations to quickly spread and manifest in the population. By the time a recessive disease is identified in the population it is already fairly wide spread. In order to then rid the population of the recessive disease, a genetic test must be established allowing for selection of offspring that do not carry the mutation. This may have the effect of further reducing genetic variation in the population potentially causing an unknown mutation to become common. For this reason the best control strategy to remove a disease from the population is by breeding it out very slowly. The popular sire effect coupled with line breeding allows for recessive mutations to rapidly increase in frequency and for genetic disease to become a problem in pedigree dog populations. In order to rid the population of a recessive disease, a genetic test must be established allowing for selection of offspring that do not carry the mutation. Suddenly removing all carriers may have the effect of further reducing genetic variation 74 in the population and potentially allowing another mutation to become common. The inbreeding levels in affecteds may be useful in determining a strategy to remove a recessive disease from a pedigree population without creating a population bottleneck that could introduce more genetic problems. The data provided here supports the findings of Leroy and Baumung (2010) that the popular sire effect increases the spread of disease alleles. The data presented here represents a real population in which the popular sire effect and line breeding are co-occurring, since the inbreeding coefficient of affected dogs compared to clear dogs was significantly higher it suggests that the popular sire effect works to spread mutations and line breeding results in those mutations manifesting.

3.7 References

Calboli F.C., Sampson J., Fretwell N. and Balding D.J. (2008). Population structure and inbreeding from pedigree analysis of purebred dogs. Genetics. 179: 593-601. Guo H.H., Choe J. and Loeb L.A. (2004). Protein tolerance to random amino acid change. Proceedings of the National Academy of Sciences. 101: 9205-9210 Leroy G. and Baumung R. (2010) Mating practices and the dissemination of genetic disorders in domestic animals, based on the example of dog breeding. Animal Genetics. 42: 89. Lindblad-Toh K., Wade C.M., Mikkelsen T.S. et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 438: 803-819. Lynch M. (2010). Rate, molecular spectrum, and consequences of human mutation. Proceedings of the National Academy of Sciences. 107: 961-968 Melville S.A., Wilson C.L., Chiang C.S., Studdert V.P., Lingaas F. and Wilton A.N. (2005). A mutation in canine CLN5 causes neuronal ceroid lipofuscinosis in Border collie dogs. Genomics. 86: 287-294. Morse H.C. (1978). Origins of Inbred Mice. Academic Press, New York, 1978. Pang J.F., Kluetsch C., Zou X.J. et al. (2009). mtDNA data indicate a single origin for dogs south of Yangtze River, less than 16,300 years ago, from numerous wolves. Molecular Biology and Evolution. 26: 2849-2864. Parker H.G., Kim L.V., Sutter N.B., Carlson S., Lorentzen T.D., Malek T.B., Johnson G.S., DeFrance H.B., Ostrander E.A. and Kruglyak L. (2004). Genetic structure of the purebred domestic dog. Science. 304: 1160-1164.

75 Sargan D.R. (2004). IDID: inherited diseases in dogs: web-based information for canine inherited disease genetics. Mammalian Genome. 15: 503-506. Shearman J.R. and Wilton A.N. (2007). Elimination of neutrophil elastase and the genes for adaptor protein complex 3 subunits as the cause of trapped neutrophil syndrome in Border collies. Animal Genetics. 38: 188-189. Shearman J.R., Zhang Q.Y. and Wilton A.N. (2006). Exclusion of CXCR4 as the cause of Trapped Neutrophil Syndrome in Border Collies using five microsatellites on canine chromosome 19. Animal Genetics. 37: 89. Shearman JR and Wilton AN. (2011) A Canine Model of Cohen Syndrome: Trapped Neutrophil Syndrome. BMC Genomics. Accepted vonHoldt B.M., Pollinger J.P., Lohmueller K.E. et al. (2010) Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 464: 898-902. Wright S. (1922). Coefficients of inbreeding and relationship. The American Naturalist. 56: 330-338. Zhang Z and Gerstein M. (2003). Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Research. 31: 5338-5348.

76 4 MAPPING CEREBELLAR ABIOTROPHY IN AUSTRALIAN KELPIES

[2011, Animal Geneicts. doi:10.1111/j.1365-2052.2011.02199.x]

Jeremy R. Shearman *,±, Roger W. Cook ‡, Christina McCowan ‡, Jessica L. Fletcher §, Rosanne M. Taylor § and Alan N. Wilton *,†

* School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia ± National Center for Genetic Engineering and Biotechnology, 113 Phaholyothin Rd., Klong 1, Klong Luang, Pathumthani 12120, Thailand ‡ Faculty of Veterinary Science, University of Melbourne, Werribee 3030, Australia § The Faculty of Veterinary Science, The University of Sydney, Camperdown, NSW 2006, Australia † Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia

Address for correspondence Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

4.1 Summary

An autosomal recessive form of cerebellar abiotrophy occurs in Australian Kelpie dogs. Clinical signs range from mild ataxia with intention tremor to severe ataxia with seizures. A whole genome mapping analysis was performed using Affymetrix Canine SNP array v2 on 11 affected and 19 control dogs but there was no significant association with disease. A homozygosity analysis identified a three megabase region likely to contain the disease mutation. The region spans 29.8 to 33 Mb on chromosome 3 for which all affected dogs were homozygous for a common haplotype. Microsatellite

77 markers were developed in the candidate region for linkage analysis that resulted in a LOD score suggestive of linkage. The candidate region contains 29 genes, none of which are known to cause ataxia.

4.2 Introduction

Cerebellar abiotrophy (CA) is the loss of one or more cell types in the cerebellum. Ataxia is the main symptom of CA, but ataxia can also result from several other causes such as viral infections, physical trauma to the head or inherited diseases (for review see Manto and Marmolino 2009). Inherited ataxia can be caused by disruption to any one of a long list of seemingly unrelated genes. Online Mendelian Inheritance in Man contains 296 disease entries with a known molecular basis which refer to ataxia. These include both dominant and recessive diseases. Cerebellar abiotrophies have been described in many breeds of dog. They can be placed into four groups based on the type of cell loss observed. The first group shows loss of Purkinje cells and a decrease in the depth of the granular and molecular layers as occurs in the , Labrador, Miniature Schnauzer and American Staffordshire (Yasuba et al. 1988; Kent et al. 2000; Perille et al. 1991; Bildfell et al. 1995; Berry and Blas-Machado 2003; Hanzlicek et al. 2003; Speciale and de Lahunta 2003; Olby et al. 2004). The second group shows loss of Purkinje cells and decrease in depth of the granular layer but have no reported loss in the molecular layer. This group includes the , Rhodesian Ridgeback, , Bull dog and Miniature Poodle (Cork et al. 1981; Chieffo et al. 1994; van der Merwe et al. 2001; Gandini et al. 2005; Cummings and de Lahunta 1988). Chen and Hillman (1989) have shown that the number of Purkinje cells present in the developing cerebellum determines the maximum number of granule cells that develop. The dogs in this group may have a decreased number of granule cells as a direct result of less Purkinje cells during brain development. The other possibility is that these dogs have a form of CA resulting in apoptosis of Purkinje and granule cells after the brain has developed. The third group shows loss of the granular and molecular cell layers but normal numbers of Purkinje cells. In some cases the Purkinje cells appear crowded due to the loss of surrounding cell layers. Dog breeds affected by this form of the disease are the , Bavarian mountain dog, , Italian and Border collie

78 (Tatalick et al. 1993; Flegel et al. 2007; Jokinen et al. 2007; Cantile et al. 2002; Sandy et al. 2002). Other cases of CA in the Brittany were found to have Purkinje cell loss and degeneration of the medulla oblongata and spinal cord (Higgins et al. 1998). This could be a distinct disease in the Brittany or they could be a single disorder with variable caused by variable cell loss. Additional forms of CA that do not fit directly into the above groups form the fourth group. The American Staffordshire and Pit Bull Terrier show loss of Purkinje cells and a fluorescent lipopigment deposit (Siso et al. 2004). Coton de Tulear dogs show a reduced molecular layer while Purkinje cells and the granule cell layer appear normal (Coates 2002). Presence of immune cells in the cerebellum suggests an autoimmune role in Coton de Tulears (Tipold et al. 2000). Kerry blue terriers show loss of Purkinje cells and progressive degeneration of additional brain regions (de Lahunta and Averill 1976; Deforest et al. 1978; Montgomery and Storts 1983; Montgomery and Storts 1984). Bernese mountain dogs show Purkinje cell loss, liver damage and abdominal varicosities (Carmichael et al. 1996). Each pure bred dog population represents a gene pool isolated by controlled breeding practices. This makes it likely that CAs in each breed are due to independent mutations in a number of genes. Identifying the causative mutation in each breed would require separate gene mapping analyses for each breed. An autosomal recessive CA in the Australian Kelpie fits into group three (Thomas and Robertson 1989; Shearman et al. 2008). Australian Kelpies with CA present with ataxia and intention tremors. The disease has been indentified in distantly related dogs indicating that the disease allele is prevalent in the population. The disease shows variable age of onset with symptoms that range from mild to severe. Mild symptoms consist of a noticeable intention tremor, a barely noticeable dysmetria and a high step while walking. Severe symptoms are a pronounced intention tremor, a complete lack of coordination and occasional fitting. Three closely related affected dogs from a breeding study by Thomas and Robertson (1989), two litter mates and a half sibling, showed the full range of symptom severity. One pup was mildly affected, another moderately affected and the third severely affected, exhibiting occasional seizures. Affected dogs from that study showed no signs of symptom progression. Histopathological reports on affected dogs suggested a loss of Purkinje cells and decrease in depth of the granular cell layer in the cerebellum (Thomas and Robertson 1989).

79 Identification of the disease gene would allow breeders to prevent further cases by not mating two carriers and to reduce the incidence of carriers over a number of generations. There are two main approaches to disease gene identification, the functional candidate gene approach and whole genome analysis. The functional candidate gene approach involves identifying candidate genes to be tested based on gene function or a similar disease in a model organism (Zhu and Zhao 2007). There are hundreds of candidate genes for CA which would require a costly and time consuming process of elimination. A whole genome approach interrogates genetic markers, such as Single Nucleotide Polymorphisms (SNPs) or microsatellites, evenly spaced through the genome (John et al. 2004). A whole genome approach can rapidly identify the disease region using a small number of affecteds and controls (Karlsson et al. 2007). This can reduce the number of good candidate genes from hundreds to tens by identifying a small genomic region containing the mutation. A whole genome analysis was undertaken to identify the region containing the causative mutation for CA occurring in the Australian Kelpie.

4.3 Materials and Methods

4.3.1 Samples Blood samples of CA affected Australian Kelpie dogs and relatives were collected from breeders. DNA was extracted using a standard salting out method (Miller et al. 1988). First and second degree relatives to affecteds were collected where possible. A total of 96 samples comprising of affected and unaffected control families were used for haplotyping. The samples included three affecteds from Thomas and Robertson (1989; Figure 4-1).

80 Figure 4-1 Samples received from Thomas and Robertson (1989) processed on SNP arrays and used for linkage analysis Shaded dogs are affected, numbered dogs represent received samples.

4.3.2 Histopathology [This section was written by collaborators Roger Cook and Jessica Fletcher and performed by collaborators Roger Cook, Jenny Charles, Jessica Fletcher and Rosanne M. Taylor]

Brain was extracted from CA affected Australian Kelpies post mortem and formalin fixed. Cerebellum was sectioned and stained with Giemsa allowing visualisation of the cell types present.

81 4.3.3 SNP analysis The canine SNP array v2 (Karlsson et al. 2007; Affymetrix) was used for whole genome analysis on 33 Australian Kelpies (Figure 4-2). Eleven of the Kelpies were affected with CA, 19 were controls and three were unaffected siblings to an affected. Control dogs were chosen to cover a range of relatedness to affected dogs from close (with common ancestors to affecteds) to distantly related (ten generations or more removed). The arrays were processed by the Clive and Vera Ramaciotti Centre for Gene Function Analysis using the Mapping 500k protocol supplied by Affymetrix. SNP calls were performed using ‘SNP 5 command console’ as per the Broad Institute instructions (http://www.broadinstitute.org/science/projects/mammals- models/dog/canine-array/canine-array-faq#7). Data cleanup was performed using Excel (Microsoft) by converting SNP calls with low confidence (>0.0001) to a no-call. SNPs with greater than five no-calls across the samples were removed. SNPs with a heterozygosity rate greater than 0.7 and SNPs with no variation among the samples were removed using PLINK (Purcell et al. 2007; http://pngu.mgh.harvard.edu/purcell/plink/) leaving 28,479 SNPs. The Hardy-Weinberg equilibrium setting in PLINK was set to the lowest cut off (10-12) as SNPs in the region surrounding the mutation should by definition not be in Hardy-Weinberg equilibrium. Association and homozygosity analysis were performed on the SNP data using the program PLINK. The association analysis was performed with 10000 permutations.

82 Figure 4-2 Pedigree of all samples processed on Affymetrix SNP arrays Affected dogs are shaded, numbered dogs represent received samples. Relationship is shown where known. Full pedigree information is missing for samples 6065, 6066, 6067 and 6124 (top right). Many of the control samples listed at the top right have a known pedigree, but the information was not included as the pedigree would require an additional 10 generations to relay the information.

83 4.3.4 Linkage analysis Microsatellite markers were developed within the candidate region to provide additional highly informative markers and enable typing of new samples. All samples were tested with these microsatellites including each sample processed on the micro arrays. Five long microsatellites (>30 tetranucleotide repeats) were chosen to maximise the potential for variation (Table 4-1). One minisatellite and six short microsatellites (20-30 dinucleotide repeats) (Table 4-2). Primers were designed using Primer3 (Rozen and Skaletsky 2000) and PCR products were labelled for fragment analysis using the universal priming method (Neilan et al. 1997). Fragment analysis was performed using capillary electrophoresis on an ABI3730 and analysed using GeneMapper software (Applied Biosystems). Haplotypes in affected families were identified using cosegregation of linked alleles. Linkage analysis was performed using SuperLink Online (Fishelson and Geiger 2002; Silberstein et al. 2006). The large pedigree was split into four nuclear families with seven affected offspring and 13 unaffected to reduce the complexity so it could be analysed. It also accommodated the changes in the haplotype caused by the high mutation rate in the long microsatellite group

84 Table 4-1 Long microsatellite group: microsatellite name, primer sequence, repeat unit type and count in reference sequence Microsatellite name† Forward primer Reverse primer repeat C3.2871 CAGGCTGCCTAGACCTGACC TTTTCTCCCTAGCCCGTACC AAAG x 38 C3.2997 CTTTCAGACCACCATGGAATG TGTGTTGCTTACTGTCTACTCAGG AAAG x 36 C3.3193 ATATGAGCTTGTCTCCCTCCAC AGCACCTAAAGCCTGAACACTG AAAG x 22 AAAAG x 18 C3.3274 CTCCCTGGTGCCCTTAGTACC AAGGGTCTGCGTCTCCTTTG TTTC x 20 CCTTT x 26 C3.3301 TATGGAGAAACACTGGGTCAAG GGACTGAAAAAGAGAGATAACTGC AAAGG x 14 AAAAG x 17 † Microsatellites were named based on chromosome number and genomic location in 10s of kb (e.g. C3.2871 is on chromosome 3 at 28.71 Mb).

Table 4-2 Short microsatellite group: microsatellite name, primer sequence, repeat unit type and count in reference sequence Microsatellite name Forward primer Reverse primer repeat C3.2805 TGAAGACAACCAGCTCACCA GAACATGATGCCTGAAACCTT AC x 21 C3.2856 TTGAGGCAAGTGAACACACC TTTGAGTGCCAAAACAGCAG TG x 25 C3.2869 TTCTTGCCCTTGCTATGCTT GCACACCAGAGGGAATACAGT TG x 21 C3.2984 TGCTATGTCCTGGGTGACG CCCACGTGGAGGTAGTCCT AG x 28 C3.3228 GGTCTCTGGACCAAGGGTTA GGCCAGTAACCAGAGTGGAG TG x 22 C3.3265 GTCTGCCCAGGCGTATTTAG TGAGGCTCAGAACAATAAAGCA AC x 25 C3.3312M CTCCTCCAGCTAACCCAGAG ACCACCTTGATTTTCCCACA Minisat x 19

85 4.4 Results and Discussion

4.4.1 Histopathological findings [This section was written by the thesis author. Experimental work was performed entirely by collaborators Roger Cook, Jenny Charles, Jessica Fletcher and Rosanne Taylor]

Histopathology of four CA affected Kelpies showed that there was a general decrease in the depth of the granular and molecular layers (Figure 4-3). Loss of Purkinje cells was only observed in the cerebellum of a five year old severely affected case. Purkinje cell loss in this case was patchy and strongly associated with depth reduction of the granular and molecular layers and may be a downstream result of cell death in these regions in the cerebellum. Thomas and Robertson (1989) reported a loss of Purkinje cells in Kelpie CA cases with mention of granular layer reduction, but no mention of molecular layer reduction. The cerebellum section published by Thomas and Robertson does show a depth reduction of the molecular layer but it was not commented on the paper. The degree of molecular and granular layer loss appears to correlate with symptom severity in these four samples (Figure 4-3). The sample with the most pronounced cell loss was the five year old Kelpie (Figure 4-3D) which was also the most severely affected. Of the eight-month old Kelpies, the most severely affected had the most pronounced cerebellum changes (Figure 4-3C), the moderately affected had less pronounced changes (Figure 4-3B) and the mildly affected dog had no noticeable changes (Figure 4-3A). No loss of Purkinje cells was observed in any of the eight- month old affecteds. Furthermore, the Purkinje cells appeared numerous and clumped in regions of the most pronounced granular and molecular layer reduction in these pups. If these dogs were tested at older age, it is possible that Purkinje cell loss would develop as observed in the five year old affected. The histopathological findings in this study appear to be consistent with the findings by Thomas and Robertson (1989), if Purkinje cell loss is a downstream effect. The inheritance pattern of CA in Kelpies is consistent with a single autosomal recessive mutation. If samples contained a mixture of two different diseases then two association signals corresponding to two different regions of homozygosity should be apparent in the mapping study. 86 Figure 4-3: Giemsa stained sections of cerebellum folia from four Australian Kelpies affected with cerebellar abiotrophy. A: eight month old mildly affected Kelpie. B: eight month old moderately affected Kelpie. C: eight month old severely affected Kelpie. D: five year old severely affected Kelpie.

[The section slides in this figure were prepared by Jessica Fletcher and Roger Cook. The section photographs for this figure were taken by Jessica Fletcher]

4.4.2 Genome wide association study The genome wide association study was carried out using 28,479 SNPs for 11 affected and 19 control dogs (Figure 4-4) using PLINK with 10000 permutations. If CA is the result of a single mutation in a common ancestor, then all affected dogs should share a large block of homozygosity, which should result in a cluster of significant SNPs in the disease region. No SNPs showed a significant association, but several SNPs spread through the genome gave a –log10 P-value greater than zero, the highest peak was for SNP rs22287890 at 26.66 Mb on chromosome 14 with a value of 0.62 (Significance threshold corrected for multiple testing: 5.76). The clustering of significant SNPs expected for a simple genetic cause of CA was not observed (Figure

4-4). Each SNP with a –log10 P-value of ~0.1 or greater was investigated further. The 87 surrounding SNPs were checked for regions of shared homozygosity and genes surrounding the SNP were checked for association with CA or ataxia in human and model organisms. None of the SNPs was in a region homozygous in affected dogs and there were no genes with known links to ataxia or CA within 2 Mb on either side. In case CA is caused by two different genetic defects in Kelpies, a second analysis was carried out where only the three CA cases from the original work by Robertson (1980) were set as affected. The number of samples used was the same, but other CA cases were set as unknown. Even though this could not produce a significant result it could give some clues to possible gene location. Similar to the previous results, several signals with a –log10 P-value of ~0.1 were observed (Figure 4-5). There were no large regions of homozygosity and no genes with known CA links near these SNPs. The 50 most significant SNPs in this analysis were different from the 50 most significant in the previous analysis with the exception of SNP rs8789909 at 59.45 Mb on chromosome 20 which scored ~0.1 in both analyses. Since only one of the top 50 SNPs identified were the same in both analyses, it suggests that the signals observed are not related to presence of a causative gene for CA and are likely due to insignificant differences between the samples. The lack of signal in the genome wide association study is most likely due to lack of variation at the region. A power study shows that sufficient power is achieved with as little as 20 affected and 20 control dogs for an autosomal recessive mutation (Wade et al. 2006). This power study has been confirmed by mapping a trait using only 9 affected and 12 control dogs (Karlsson et al. 2007). The lack of significant peaks in the association analysis show that there is little to no stratification between affected and control samples.

88 ůůĐĂƐĞƐ't^ Ϭ͘ϳ

Ϭ͘ϲ

Ϭ͘ϱ

Ϭ͘ϰ

Ϭ͘ϯ

Ϭ͘Ϯ ͲůŽŐϭϬWͲǀĂůƵĞ Ϭ͘ϭ

Ϭ ^EWƐĂĐƌŽƐƐǁŚŽůĞŐĞŶŽŵĞ

Figure 4-4 Whole genome association study for the CA region in Australian Kelpies using 11 affecteds and 19 controls

89 ϯĐŽŶĨŝƌŵĞĚĐĂƐĞƐ't^ ϭ͘ϰ

ϭ͘Ϯ

ϭ

Ϭ͘ϴ

Ϭ͘ϲ

Ϭ͘ϰ ͲůŽŐϭϬWͲǀĂůƵĞ

Ϭ͘Ϯ

Ϭ ^EWƐĂĐƌŽƐƐǁŚŽůĞŐĞŶŽŵĞ

Figure 4-5 Whole genome association study for the CA region in Australian Kelpies using 3 affecteds, 19 controls and 8 suspected cases set to unknown status

90 4.4.3 Homozygosity analysis A homozygosity analysis was carried out on the SNP data using PLINK to identify any loci where affecteds were homozygous but controls were not. A region on chromosome 3 from 28 Mb to 33.8 Mb where all affecteds were homozygous for a common haplotype was identified (Figure 4-6). The region is defined by the SNPs ‘chr3.27996340’ and ‘chr3.33836297’. One affected shows heterozygosity between ‘chr3.29768263’ and ‘chr3.33836297’. If diagnosis of CA is correct in this case the disease gene region must be between 29.8 Mb and 33.8 Mb. No other regions of homozygosity greater than 100 kb were shared by all affecteds. Control dogs showed a mixture of homozygosity and heterozygosity for the region (Figure 4-6). Six control dogs were homozygous for the SNPs that define the candidate region. Three of these were homozygous for the same SNP haplotype as the affecteds in the candidate region. Thirteen control dogs were heterozygous at twenty SNPs on average out of the fifty two that define the candidate region. In most cases the same haplotype as identified in affecteds is observed in controls. This lack of variation in the region would explain why the genome wide association study did not identify any significant associations in the region. The three unaffected siblings of a CA affected dog were also processed on the arrays and found to be homozygous for the same haplotype as the affecteds. Together with the lack of variation in the controls this suggests that the mutation responsible for CA has occurred in a common haplotype and only some copies carry the CA mutation. This means that the haplotype cannot be used to identify likely carriers of the CA mutation.

91 ĨĨĞĐƚĞĚƐ ŽŶƚƌŽůƐ ƵŶĂĨĨĞĐƚĞĚƐŝďůŝŶŐ ^EWEĂŵĞ ϲϬϬϭ ϲϬϬϯ ϲϬϮϱ ϲϬϰϰ ϲϬϱϬ ϲϬϱϰ ϲϬϲϱ ϲϬϲϲ ϲϬϲϳ ϲϭϮϰ ϲϭϰϵ ϲϬϴϴ ϲϬϭϴ ϲϬϰϯ ϲϬϰϳ ϲϬϱϵ ϲϬϲϮ ϲϬϲϰ ϲϬϳϮ ϲϬϳϯ ϲϬϳϰ ϲϭϬϮ ϲϬϳϳ ϲϬϳϴ ϲϬϳϵ ϲϬϴϮ ϲϬϵϭ ϲϬϵϵ ϲϬϱϮ ϲϬϱϯ ϲϭϰϱ ϲϭϰϳ ϲϭϰϴ ĐŚƌϯ͘ϮϳϱϴϳϬϱϴ ϬϬϭϬϬϬϮϮϭϬϭ ϬϬϭϬϬϭϬϬϭϭϬϬϭϭϬϬϬϬϬ Ϭϭϭ ĐŚƌϯ͘Ϯϳϱϴϳϯϰϯ ϮϮϭϮϮϮϬϬϭϮϭ ϮϮϭϮϮϭϮϮϭϭϮϮϭϭϮϮϮϮϮ Ϯϭϭ ĐŚƌϯ͘Ϯϳϴϲϭϳϴϱ ϬϬϬϬϬϬϬϬϬϬϭ ϬϬϬϬϬϭϬϬϬϮϬϬϭϭϭϭϬϮϮ Ϭϭϭ ĐŚƌϯ͘ϮϳϵϳϮϲϲϰ ϭϭϭϮϮϮϬϬϭϮϭ ϭϮϭϭϮϭϮϭϬϬϮϬϭϭϭϭϬϬϬ Ϯϭϭ ĐŚƌϯ͘ϮϳϵϵϲϯϰϬ ϬϬϭϭϬϬϮϮϭϬϭ ϭϬϭϭϬϭϬϭϮϮϬϮϭϭϭϭϭϬϭ Ϭϭϭ ĐŚƌϯ͘ϮϴϮϵϭϬϴϮ ϮϮϭϮϮϮϮϮϮϮϮ ϮϮϭϮϮϮϮϮϮϬϮϮϮϭϮϮϭϬϭ ϮϮϮ ĐŚƌϯ͘Ϯϴϯϱϰϳϳϱ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϮϭϮϭϮϮϭϬϮϮϭϭϮϭϭϮϮ ϮϮϮ ĐŚƌϯ͘Ϯϴϵϰϱϱϭϰ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϬϮϮϭϮϬϭϮϮϭ ϮϮϮ ĐŚƌϯ͘ϮϵϬϬϮϮϬϴ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϬϮϮϭϮϮϮϮϮϮ ϮϮϮ ĐŚƌϯ͘Ϯϵϰϱϵϱϯϱ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϬϬϮϭϭϬϬ ϬϬϬ ĐŚƌϯ͘ϮϵϰϲϱϱϬϮ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϭϬϮϬϭϮϮ ϬϬϬ ĐŚƌϯ͘ϮϵϳϬϬϭϳϲ ϮϮϭϮϮϮϮϮϮϮϮ ϭϮϬϭϮϭϮϭϭϬϮϬϭϭϬϭϮϬϬ ϮϮϮ ĐŚƌϯ͘ϮϵϳϮϯϯϱϭ ϬϬϭϬϬϬϬϬϬϬϬ ϭϬϮϭϬϭϬϭϭϬϬϮϬϭϬϭϭϮϭ ϬϬϬ ĐŚƌϯ͘ϮϵϳϲϯϵϬϴ ϮϮϭϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϭϭϭϮϬϭϮϮϮϮϮϮ ϮϮϮ ĐŚƌϯ͘ϮϵϳϲϴϮϲϯ ϬϬϭϬϬϬϬϬϬϬϬ ϭϬϮϭϬϭϬϭϭϭϬϮϭϭϬϭϭϮϭ ϬϬϬ ĐŚƌϯ͘ϮϵϳϴϬϳϱϲ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϮϭ ϬϬϬ ĐŚƌϯ͘ϮϵϴϲϵϭϴϬ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϭϮϮ ϮϮϮ ĐŚƌϯ͘ϮϵϵϬϯϴϰϭ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϭϬϬϬϬϬϬϬϬϬϭϭϬϬϬϮϭ ϬϬϬ ĐŚƌϯ͘ϮϵϵϰϵϳϮϴ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϭϮϭϮϬϮϮϮϭϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϬϬϮϯϵϴ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϭϭϬϬϬϭϭϬϬϮϬϭϬϬϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϬϬϬϮϰϳϰ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϭϭϬϬϬϭϭϬϬϮϬϭϬϬϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϬϬϱϵϭϳϱ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϭϭϬϬϬϭϭϬϬϮϭϭϮϬϬϮϮ ϬϬϬ ĐŚƌϯ͘ϯϬϭϰϯϭϮϵ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϭϬϮϬϭϬϭ ϬϬϬ ĐŚƌϯ͘ϯϬϮϰϭϴϵϲ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϭϮϮϭϮϮϮϮϮϮϭϭϬϭϮϮϭ ϮϮϮ ĐŚƌϯ͘ϯϬϯϮϲϰϲϵ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϭϭϬϮϬϭϭϬϭϭϮϭ ϮϮϮ ĐŚƌϯ͘ϯϬϯϲϰϯϲϮ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϭϮϭϮϬϭϮϬϮϮϮϭ ϮϮϮ ĐŚƌϯ͘ϯϬϰϭϲϵϯϭ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϬϭϬϭϬϬϭϭϬϬϭϬϮϭϭϬϭ ϬϬϬ ĐŚƌϯ͘ϯϬϱϰϱϵϴϭ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϮϭϮϭϮϭϭϭϮϬϭϮϬϭϭϮϭ ϮϮϮ ĐŚƌϯ͘ϯϬϱϵϳϵϲϯ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϭϭϭϮϬϭϭϬϭϭϮϭ ϮϮϮ ĐŚƌϯ͘ϯϬϲϬϲϮϱϳ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϭϬϬϬϬϬϬϭϬϮϭϬϬϭ ϬϬϬ ĐŚƌϯ͘ϯϬϲϰϴϳϮϱ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϭϬϮϬϬϬϭ ϬϬϬ ĐŚƌϯ͘ϯϬϲϴϰϲϴϬ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϬϭϬϭϬϬϭϬϬϬϬϬϬϭϭϬϬ ϬϬϬ ĐŚƌϯ͘ϯϬϲϴϰϳϴϲ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϭϬϬϬϬϭϬϭϬϮϬϭϬϬϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϬϲϵϱϲϳϴ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϭϭϭϮϬϮϭϮϭϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϳϲϵϳϴϬ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϳϳϰϬϰϭ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϮϮϮϮϮϮϭϮϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϳϴϮϭϭϬ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϭϭϬϭϬϭϭϭϬϮϭϭϮϬϭϮϮ ϬϬϬ ĐŚƌϯ͘ϯϬϴϵϵϵϵϰ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϮϭϮϭϮϭϭϭϮϬϮϮϮϮϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϵϰϮϯϯϴ ϬϬϬϬϬϬϬϬϬϬϬ ϭϬϭϭϬϭϬϬϭϭϬϬϭϭϮϬϭϮϮ ϬϬϬ ĐŚƌϯ͘ϯϬϵϰϯϰϬϮ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϮϭϭϮϮϮϭϮϮϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϵϰϵϰϱϵ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϮϭϭϮϮϮϭϮϭϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϬϵϴϲϲϱϱ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϭϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϭϬϭϬϭϳϰ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϭϮϮϮϮϭϮϮϬϭ ϮϮϮ ĐŚƌϯ͘ϯϭϬϱϵϮϲϴ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϬϬϭϬϬϮϭ ϬϬϬ ĐŚƌϯ͘ϯϭϭϬϴϴϮϰ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϭϮϮϭϮϬϮϭϬϬ ϮϮϮ ĐŚƌϯ͘ϯϭϭϲϯϮϭϳ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϭϭϮϭϮϮϭϮϮϮϭϭϭϭϭϮϭ ϮϮϮ ĐŚƌϯ͘ϯϭϳϲϲϱϵϮ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϭϮϮϮϮϮϮϮϮϮ ϮϮϮ ĐŚƌϯ͘ϯϭϴϮϯϮϮϰ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϮϬϰϵϱϬϮ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϮϮϮϮϮϭϮϭϭϮϮϭ ϮϮϮ ĐŚƌϯ͘ϯϮϮϴϴϮϳϯ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϭϮϮϮϮϮϬϭϮϭϭϮϮϭ ϮϮϮ ĐŚƌϯ͘ϯϮϮϵϮϱϮϵ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϭϬϭϬϬϬϬϭϬϭϮϭ ϬϬϬ ĐŚƌϯ͘ϯϮϯϭϬϰϮϲ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϮϮϮϮϮϭϮϭϮϮϮϮϭϮϭϬϭ ϮϮϮ ĐŚƌϯ͘ϯϮϲϭϬϭϭϱ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϭϬϬϭϬϭϬϭϬϮϭϭϬϬϬϬϬ ϬϬϬ ĐŚƌϯ͘ϯϮϵϭϯϭϵϬ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϭϬϬϬϬϭϬϬϬϬϭϭϭϬϭϮϭ ϬϬϬ ĐŚƌϯ͘ϯϯϬϬϴϭϮϵ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϬϬϬϬϬϬϭϭϬϬϬϭϬϬ ϬϬϬ ĐŚƌϯ͘ϯϯϬϲϭϲϳϳ ϮϮϮϮϮϮϮϮϮϮϮ ϮϮϭϮϮϭϮϮϮϭϮϭϭϭϭϭϮϬϭ ϮϮϮ ĐŚƌϯ͘ϯϯϮϬϮϵϱϯ ϮϮϮϮϮϮϮϮϮϮϮ ϭϮϮϭϮϭϮϮϭϮϮϮϮϮϮϮϭϮϮ ϮϮϮ ĐŚƌϯ͘ϯϯϴϯϲϮϵϳ ϭϭϮϭϭϮϬϬϭϮϭ ϭϮϭϮϮϮϮϭϭϮϮϬϭϭϬϭϮϬϬ Ϯϭϭ ĐŚƌϯ͘ϯϯϴϱϲϯϲϰ ϭϭϬϭϭϬϮϮϭϬϭ ϭϬϭϬϬϬϬϭϭϬϬϮϭϭϮϭϭϮϮ Ϭϭϭ ĐŚƌϯ͘ϯϰϭϯϭϴϴϰ ϭϭϮϭϭϮϬϬϭϮϭ ϬϮϭϭϮϭϮϭϬϮϮϭϭϭϮϭϬϮϮ Ϯϭϭ ĐŚƌϯ͘ϯϰϱϲϬϵϲϵ ϭϭϮϮϭϮϮϮϮϮϮ ϮϮϮϮϮϭϮϮϮϭϮϮϭϮϮϭϭϮϮ ϬϮϭ ĐŚƌϯ͘ϯϰϲϲϴϵϲϵ ϭϭϬϬϭϬϬϬϬϬϬ ϬϬϬϬϬϭϬϬϬϭϬϬϭϬϬϭϭϬϬ ϮϬϭ ĐŚƌϯ͘ϯϰϲϳϮϲϰϴ ϭϭϬϬϭϬϬϬϬϬϬ ϬϬϬϬϬϭϬϬϬϭϬϬϭϬϬϭϭϬϬ ϮϬϭ ĐŚƌϯ͘ϯϱϭϳϮϵϲϬ ϬϬϬϬϬϬϬϬϬϬϬ ϬϬϬϬϬϭϬϬϬϭϬϭϮϭϭϭϭϮϭ ϬϬϬ ĐŚƌϯ͘ϯϱϯϵϯϰϯϭ ϭϭϮϭϮϮϬϬϭϮϭ ϭϮϭϮϮϭϮϭϭϮϮϭϮϭϬϮϮϬϬ Ϯϭϭ ĐŚƌϯ͘ϯϱϰϵϬϯϴϯ ϭϭϮϭϮϮϬϬϭϮϭ ϭϮϭϮϮϭϮϭϭϮϮϭϮϭϬϮϮϬϬ Ϯϭϭ ĐŚƌϯ͘ϯϱϱϲϭϳϮϳ ϭϭϮϭϮϮϬϬϭϮϭ ϭϮϭϮϮϭϮϭϭϮϮϭϭϮϮϮϮϮϮ Ϯϭϭ ĐŚƌϯ͘ϯϱϲϲϯϵϵϬ ϭϭϮϭϮϮϬϬϭϮϭ ϭϮϭϮϮϭϮϭϭϮϮϭϮϮϮϮϮϮϮ Ϯϭϭ ĐŚƌϯ͘ϯϱϳϰϮϰϵϲ ϭϭϮϭϮϮϬϬϭϮϭ ϭϮϭϮϮϭϮϭϭϭϮϭϮϮϮϭϮϮϮ Ϯϭϭ

Figure 4-6 SNP data of candidate region for 11 affected Kelpies, 19 control Kelpies and 3 unaffected siblings to affected 6149 for SNPs between 27.6 and 35.7 Mb on chromosome 3. Alleles are colour coded, green represents homozygosity for allele 1, blue represents homozygosity for allele 2 and red represents heterozygosity. SNP names (left) are given as chromosome number and genomic location. Samples (top) are divided into three groups, affecteds on the left, controls in the middle and unaffected siblings on the right.

92 4.4.4 Linkage analysis To differentiate between CA affected and unaffected copies of the SNP haplotype in the identified homozygous region microsatellites were typed for all dogs processed on the SNP arrays plus their family members. Each of the long microsatellites revealed a large amount of variation in the Kelpies. Differences in alleles were observed in most dogs including between affecteds (Table 4-3). This suggests accumulation of mutations on a recent ancestral haplotype which carried the CA mutation. Segregation of these microsatellite haplotypes in families with CA shows inheritance patterns consistent with this region containing the CA mutation (Figure 4-7). Since affecteds have a different haplotype to unaffected siblings there is no evidence to suggest incomplete penetrance.

93 Table 4-3 Total allele counts and counts between affecteds and controls for each microsatellite in the long microsatellite group

ϯ͘Ϯϴϳϭ    C3.2997    ϯ͘ϯϭϵϯ    ϯ͘ϯϮϳϰ   ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽůĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ  ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ ƐŝnjĞ ŽƵŶƚŽƵŶƚ ŽƵŶƚ ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ ϯϰϬ ϱ Ϯ Ϭ  Ϯϱϭ Ϯ ϭ Ϭ  ϯϱϲ Ϯ ϭ Ϭ  ϯϯϲ ϭ ϭ Ϭ ϯϰϰ Ϯ ϭ Ϭ  Ϯϱϰ ϰ Ϯ Ϭ  ϯϴϯ ϴ Ϭ ϭ  ϯϰϬ ϱ ϭ Ϭ ϯϱϲ ϱ ϭ Ϭ  Ϯϱϴ ϱ Ϯ Ϭ  ϯϴϳ ϯϵ Ϯ ϱ  ϯϲϯ ϭ Ϭ Ϭ ϯϲϬ ϭ Ϭ ϭ  ϮϲϮ ϱ Ϯ Ϭ  ϯϵϬ ϲϮ ϭϰ ϵ  ϯϳϲ Ϯ Ϯ Ϭ ϯϲϯ ϭϲ Ϯ Ϯ  Ϯϲϱ ϱ Ϯ Ϭ  ϯϵϰ ϭϱ ϱ Ϯ  ϯϴϳ ϭ ϭ Ϭ ϯϲϳ ϯϯ ϰ ϰ  ϮϳϬ Ϯ ϭ Ϭ  ϯϵϵ ϱ ϭ ϭ  ϰϭϳ ϯ ϭ Ϭ ϯϳϭ ϰϴ ϳ ϭϬ  Ϯϳϰ ϭ ϭ Ϭ  ϰϬϰ ϭ ϭ Ϭ  ϰϮϯ Ϯ Ϯ Ϭ ϯϳϱ Ϯϯ ϴ Ϭ  Ϯϳϵ ϯ ϭ Ϭ  ϰϭϮ ϱ Ϯ Ϭ  ϰϯϬ ϳ ϭ Ϭ ϯϳϴ ϱ Ϭ ϭ  ϮϴϮ Ϯ ϭ Ϭ  ϰϭϲ ϭ Ϭ Ϭ  ϰϯϱ ϭ ϭ Ϭ ϯϵϰ ϭ Ϭ Ϭ  Ϯϵϳ ϰ Ϭ Ϯ  ϰϮϮ Ϯ ϭ Ϭ  ϰϰϭ ϭ ϭ Ϭ ϰϬϬ Ϯ ϭ Ϭ  ϯϬϭ ϳϮ ϱ ϭϬ  ϰϮϲ ϭ ϭ Ϭ  ϰϴϲ ϰ ϯ Ϭ ϰϬϰ ϰ ϯ Ϭ  ϯϬϱ ϯϮ ϵ ϯ  ϰϯϯ ϭ Ϭ Ϭ  ϰϵϬ ϭϲ ϲ ϯ ϰϬϴ ϭ ϭ Ϭ  ϯϬϵ ϳ ϭ ϯ      ϰϵϰ ϭϬ Ϯ ϭ     ϱϬϬ ϯϰ ϯ ϲ ϯ͘ϯϯϬϭ   ϱϬϰ Ϯϲ Ϯ ϱ ϰϲϰ ϰ ϭ Ϭ   ϱϬϴ Ϯϳ Ϯ ϯ ϰϲϵ ϭϬ ϰ Ϭ   ϱϮϬ ϰ Ϭ Ϭ ϰϳϰ ϱϭ ϰ ϴ   ϱϯϳ ϭ ϭ Ϭ ϰϳϴ ϰϬ ϳ ϱ        ϰϴϯ ϯϭ ϲ ϱ        ϰϴϴ Ϯ ϭ Ϭ    ϱϬϵ ϭ Ϭ Ϭ          ϱϮϰ ϭ ϭ Ϭ         94 Figure 4-7 Inheritance patterns in affected families for microsatellites in the long microsatellite group. Affecteds are bold and underlined. Alleles have been colour coded per microsatellite for ease of viewing.

95 Kelpie samples were also typed for the short microsatellites which showed genetic variation between affecteds and controls (Table 4-4). No differences between the common haplotype with CA defect and without were observed in CA families so prarents of CA cases were homozygous for the short microsatellite haplotype (Figure 4-8). This lack of variation in short microsatellites within the region is consistent with the SNP data and confirms that the haplotype associated with the disease cannot be used to identify the disease allele. The two outside markers, microsatellite C3.2805, and the minisatellite C3.3312M were heterozygous in several affecteds which could be due to recombination in past generations or mutations (Figure 4-8). The affected case with the short SNP haplotype was heterozygous at microsatellites C3.2856 and C3.2869 in the heterozygous region which is consistent with the SNP data. Linkage analysis was performed on each set of microsatellite data. The result for the long microsatellite group is indicative of linkage, with a maximum logarithm of odds (LOD) score of 2.37, but the short microsatellites show no evidence of linkage (Figure 4-9). Splitting the pedigree into four nuclear families and treating them as unrelated is conservative as it has the effect of reducing the LOD score that would be obtained from one large pedigree with the same haplotype segregating with the disease in each branch. The amount of pedigree information included in analysis for the short microsatellites does not change the LOD score as the markers are not informative. The positive LOD score for the long microsatellite group is good supporting evidence that the region on chromosome 3 from 28-33 Mb does contain the mutation causing CA. Each of the microsatellites in the long microsatellite group was fully informative in these pedigrees. Therefore with these pedigrees the maximum LOD score obtainable was 2.37. Lack of power to reach a statistically significant result is due to the small number of affected pedigrees available, increasing the number of affected pedigrees would result in statistically significant LOD score of greather than three.

96 Table 4-4 Total allele counts and counts between affecteds and controls for each microsatellite in the the short microsatellite group ϯ͘ϮϴϬϱ    ϯ͘Ϯϴϱϲ    ϯ͘Ϯϴϲϵ    ϯ͘Ϯϵϴϰ   ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽůĨĨĞĐƚĞĚ ůůĞůĞ dŽƚĂů ŽŶƚƌŽů ĨĨĞĐƚĞĚ ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ  ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ ƐŝnjĞ ŽƵŶƚŽƵŶƚ ŽƵŶƚ ƐŝnjĞ ŽƵŶƚ ŽƵŶƚ ŽƵŶƚ ϮϬϴ ϲϵ ϱ ϭϮ  Ϯϰϴ ϭϰ ϰ ϭ  ϯϰϵ Ϯ Ϭ Ϭ  ϮϵϮ ϰϲ ϭϱ Ϭ ϮϭϬ ϵ ϭ ϭ  ϮϱϮ Ϯ Ϭ Ϭ  ϯϱϱ ϭ ϭ Ϭ  ϯϬϯ ϯ Ϭ Ϯ Ϯϭϰ Ϯ Ϯ Ϭ  Ϯϱϲ ϭϮ ϰ Ϭ  ϯϱϳ ϭϰϮ ϮϮ Ϯϳ  ϯϬϵ ϭϮϴ ϭϲ ϮϮ Ϯϭϲ ϭϳ ϰ Ϭ  Ϯϱϴ Ϯ ϭ ϭ  ϯϱϵ ϭϴ ϰ Ϭ  ϯϭϯ ϭ ϭ Ϭ Ϯϭϴ ϭϮ ϰ Ϭ  ϮϲϬ ϵ Ϯ Ϭ  ϯϲϭ ϯ Ϯ Ϭ    ϮϮϬ ϳϬ ϭϳ ϭϯ  ϮϲϮ ϭϮϵ ϭϵ ϭϲ  ϯϲϯ ϴ ϯ Ϭ    ϮϮϯ ϭ ϭ Ϭ  Ϯϲϰ ϭ ϭ Ϭ  ϯϲϱ ϭϬ Ϯ ϭ      Ϯϲϲ ϳ ϭ Ϭ               ϯ͘ϯϮϮϴ ϯ͘ϯϮϲϱ    ϯ͘ϯϯϭϮD        Ϯϭϳ ϲ ϯ Ϭ  ϮϱϮ ϭϳϭ Ϯϵ ϭϴ  ϲϬϴ ϳϭ ϱ ϭϬ    ϮϯϬ ϴ Ϯ Ϭ  ϮϲϮ ϯ ϭ Ϭ  ϳϴϲ ϲϰ ϭϭ ϭϮ    Ϯϯϭ ϭϱϭ ϮϮ ϭϴ    ϴϮϮ ϰ ϭ Ϭ     Ϯϯϰ ϭ Ϭ Ϭ    ϵϯϮ ϭϳ ϲ Ϭ     Ϯϯϴ ϴ ϯ Ϭ   ϵϲϴ ϭϴ ϳ Ϭ     ϮϰϬ ϭ ϭ Ϭ    ϮϰϮ ϭ ϭ Ϭ   

97 Figure 4-8 Inheritance patterns in affected families for microsatellites in the short microsatellite group. Affecteds are bold and underlined. Alleles have been colour coded per microsatellite for ease of viewing

98 Figure 4-9 Multipoint linkage analysis of long and short microsatellite groups. Significance threshold is indicated by the red line

4.5 Conclusion

The 3 Mb region on CFA 3 identified as the location for the CA causing mutation contains 29 genes. None of these genes are known to be associated with CA or ataxia in human or mouse according to OMIM or MGI databases. The wide range in severity of clinical signs for CA suggests that CA could be the result of a regulatory mutation rather than an altered protein sequence. Identifying the causative mutation may require expression analysis of the genes in the region and/or sequencing of the entire candidate region using next generation sequencing. It is important to note that genome wide association studies can fail to predict the location of a disease gene occurring on a common haplotype. Linkage analysis supported the results of the homozygosity mapping that identified the disease gene location because of the use of long-repeat microsatellite loci with high mutation rates. These high mutation rates resulted in the common haplotype mutating to several new haplotypes over the past few generations allowing for haplotypes identical at the SNP level within families to be distinguished. 99 4.6 References

Berry M.L. and Blas-Machado U. (2003) Cerebellar abiotrophy in a miniature schnauzer. The Canadian Veterinary Journal. 44: 657-659. Bildfell R.J., Mitchell S.K. and de Lahunta A. (1995) Cerebellar cortical degeneration in a . The Canadian Veterinary Journal. 36: 570-572. Cantile C., Salvadori C., Modenato M., Arispici M. and Fatzer R. (2002) Cerebellar granuloprival degeneration in an Italian hound. Journal of veterinary medicine. A, Physiology, pathology, clinical medicine. 49: 523-525. Carmichael K.P., Miller M., Rawlings C.A., Fischer A., Oliver J.E. and Miller B.E. (1996) Clinical, hematologic, and biochemical features of a syndrome in Bernese mountain dogs characterized by hepatocerebellar degeneration. Journal of the American Veterinary Medical Association. 208: 1277-1279. Chen S. and Hillman D.E. (1989) Regulation of granule cell number by a predetermined number of Purkinje cells in development. Brain Research Developmental Brain Research. 45: 137-147. Chieffo C., Stalis I.H., Van Winkle T.J., Haskins M.E. and Patterson D.F. (1994) Cerebellar Purkinje's cell degeneration and coat color dilution in a family of Rhodesian Ridgeback dogs. Journal of Veterinary Internal Medicine. 8: 112- 116. Coates J.R., O'Brien D.P., Kline K.L., Storts R.W., Johnson G.C., Shelton G.D., Patterson E.E. and Abbott L.C. (2002) Neonatal cerebellar ataxia in Coton de Tulear dogs. Journal of Veterinary Internal Medicine. 16: 680-689. Cork L.C., Troncoso J.C. and Price D.L. (1981) Canine inherited ataxia. Annals of Neurology. 9: 492-498. Cummings J.F. and de Lahunta A. (1988) A study of cerebellar and cerebral cortical degeneration in miniature poodle pups with emphasis on the ultrastructure of Purkinje cell changes. Acta Neuropathologica. 75: 261-271. Deforest M.E., Eger C.E. and Basrur P.K. (1978) Hereditary cerebellar neuronal abiotrophy in a Kerry Blue Terrier dog. The Canadian Veterinary Journal. 19: 198-202. de Lahunta A. and Averill D.R. Jr. (1976) Hereditary cerebellar cortical and extrapyramidal nuclear abiotrophy in Kerry Blue Terriers. Journal of the American Veterinary Medical Association. 168: 1119-1124.

100 Fishelson M. and Geiger D. (2002) Exact genetic linkage computations for general pedigrees. Bioinformatics. 18 Suppl. 1: S189-S198. Flegel T., Matiasek K., Henke D. and Grevel V. (2007) Cerebellar cortical degeneration with selective granule cell loss in Bavarian mountain dogs. The Journal of Small Animal Practice. 48: 462-465. Gandini G., Botteron C., Brini E., Fatzer R., Diana A. and Jaggy A. (2005) Cerebellar cortical degeneration in three English bulldogs: clinical and neuropathological findings. The Journal of Small Animal Practice. 46: 291-294. Hanzlícek D., Kathmann I., Bley T., Srenk P., Botteron C., Gaillard C. and Jaggy A. (2003) Cerebellar cortical abiotrophy in American Staffordshire terriers: clinical and pathological description of 3 cases. Schweizer Archiv fur Tierheilkunde. 145: 369-375. Higgins R.J., LeCouteur R.A., Kornegay J.N., Coates J.R. (1998) Late-onset progressive spinocerebellar degeneration in Brittany Spaniel dogs. Acta Neuropathologica. 96: 97-101. John S., Shephard N., Liu G., et al. (2004) Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. American Journal of Human Genetics. 75: 54-64 Jokinen T.S., Rusbridge C., Steffen F., Viitmaa R., Syrja P., de Lahunta A., Snellman M. and Cizinauskas S. (2007) Cerebellar cortical abiotrophy in Lagotto Romagnolo dogs. The Journal of Small Animal Practice. 48: 470-473. Karlsson E.K., Baranowska I., Wade C.M. et al. (2007) Efficient mapping of mendelian traits in dogs through genome-wide association. Nature Genetics. 39: 1321- 1328. Kent M., Glass E. and de Lahunta A. (2000) Cerebellar cortical abiotrophy in a beagle. The Journal of Small Animal Practice. 41: 321-323. Manto M. and Marmolino D. (2009) Cerebellar . Current Opinion in Neurology. 22: 419-429. Miller S.A., Dykes D.D. and Polesky H.F. (1988) A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Research. 16: 1215 Montgomery D.L. and Storts R.W. (1983) Hereditary striatonigral and cerebello-olivary degeneration of the Kerry blue terrier. I. Gross and light microscopic central nervous system lesions. Veterinary Pathology. 20: 143-159. Montgomery D.L. and Storts R.W. (1984) Hereditary striatonigral and cerebello-olivary degeneration of the Kerry Blue Terrier. II. Ultrastructural lesions in the caudate

101 nucleus and cerebellar cortex. Journal of neuropathology and experimental neurology. 43: 263-275. Olby N., Blot S., Thibaud J.L., Phillips J., O'Brien D.P., Burr J., Berg J., Brown T. and Breen M. (2004) Cerebellar cortical degeneration in adult American Staffordshire Terriers. Journal of Veterinary Internal Medicine. 18: 201-208. Perille A.L., Baer K., Joseph R.J., Carrillo J.M. and Averill D.R. (1991) Postnatal cerebellar cortical degeneration in Labrador Retriever puppies. The Canadian Veterinary Journal. 32: 619-621. Purcell S., Neale B., Todd-Brown K. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 81: 559-575. Rozen S. and Skaletsky H. (2000) Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology. 132: 365-386. Sandy J.R., Slocombe R.E., Mitten R.W. and Jedwab D. (2002) Cerebellar abiotrophy in a family of Border Collie dogs. Veterinary Pathology. 39: 736-738. Shearman J.R., Lau V.M. and Wilton A.N. (2008) Elimination of SETX, SYNE1 and ATCAY as the cause of Cerebellar Abiotrophy in Australian Kelpies. Animal Genetics. 39: 573. Silberstein M., Tzemach A., Dovgolevsky N., Fishelson M., Schuster A. and Geiger D. (2006) Online system for faster multipoint linkage analysis via parallel execution on thousands of personal computers. American Journal of Human Genetics. 78: 922-935. Siso S., Hanzlícek D., Fluehmann G., Kathmann I., Tomek A., Papa V. and Vandevelde M. (2004) Neurodegenerative diseases in domestic animals: a comparative review. Veterinary Journal. 171: 20-38. Speciale J. and de Lahunta A. (2003) in a mature Staffordshire terrier. Journal of the American Animal Hospital Association. 39: 459-462. Tatalick L.M., Marks S.L. and Baszler T.V. (1993) Cerebellar abiotrophy characterized by granular cell loss in a Brittany. Veterinary Pathology. 30: 385-388. Thomas J.B. and Robertson D. (1989) Hereditary cerebellar abiotrophy in Australian kelpie dogs. Australian Veterinary Journal. 66: 301-302. Tipold A., Fatzer R., Jaggy A., Moore P. and Vandevelde M. (2000) Presumed immune- mediated cerebellar granuloprival degeneration in the Coton de Tulear breed. Journal of Neuroimmunology. 110: 130-133.

102 Neilan B.A., Wilton A.N. and Jacobs D. (1997) A universal procedure for primer labelling of amplicons. Nucleic Acids Research. 25: 2938-2939. van der Merwe L.L. and Lane E. (2001) Diagnosis of cerebellar cortical degeneration in a Scottish terrier using magnetic resonance imaging. The Journal of Small Animal Practice. 42: 409-412. Yasuba M., Okimoto K., Iida M. and Itakura C. (1988) Cerebellar cortical degeneration in beagle dogs. Veterinary Pathology. 25: 315-317. Zhu M. and Zhao S. (2007) Candidate gene identification approach: progress and challenges. International Journal of Biological Sciences. 3: 420-427.

103 5 SEQUENCING THE CEREBELLAR ABIOTROPHY REGION IN THE AUSTRALIAN KELPIE

Jeremy R. Shearman and Alan N. Wilton

School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia and Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia

Address for correspondence Alan Wilton, School of Biotechnology, University of NSW, Sydney NSW 2052, Australia E-mail: [email protected] Fax: +61 2 9385 1483

5.1 Summary

An autosomal recessive Cerebellar Abiotrophy (CA), characterised by a high step when walking, intention tremor and ataxia in the Australian Kelpie has been mapped to canine chromosome three. Nimblegen sequence capture arrays were used to enrich for the candidate region from 27 to 33 Mb for a total tiling region of 5.2 Mb. Two affecteds and one control Kelpie were processed resulting in enrichments of 251x and 430x for the affecteds and 316x for the control. Multiplex identifiers were added and the DNA sequenced using titanium chemistry on a 454 sequencer. The total sequence output was 313.8 Mb. One affected gave 31.1 Mb of sequence, the other affected gave 112 Mb and the control gave 170.6 Mb. A comparison of gaps in the sequencing coverage to gaps tiled in sequence capture probes, shows that 98% of gaps in affected sequence and 97% of gaps in the control sequence are located within 200 bp of a gap in capture probe coverage. A total of 2019 sequence differences between affecteds and the control were identified that were homozygous in the affecteds compared to the control, 682 of those were in genic regions, 25 were in exons and eight changed an amino acid. Seven of the non-synonymous differences were investigated in a sample set of 96 Kelpies and did not segregate with the disease. The remaining differences have been prioritised for further investigation and include the unchecked

104 coding difference, differences in 5` and 3` untranslated regions and in highly conserved intergenic regions between dog and human.

5.2 Introduction

An autosomal recessive Cerebellar Abiotrophy (CA) in the Australian Kelpie appears widespread in both working and show lineages of the breed. The disease is characterised by an ataxia, including intention tremor and a high stepping gait. Clinical signs range from mild and barely noticeable to severe including occasional fitting. Large clinical sign variation has been observed between full and half siblings (Thomas and Robertson 1989). Histopathology changes in cerebellum were described by Thomas and Robertson as a loss of Purkinje cells and reduction in granular and molecular layer depth in the cerebellum of affected individuals. Cerebellum from four recent samples showed a depth reduction of granular and molecular layers with apparent secondary loss of Purkinje cells in a severe case in an five-year-old dog (Shearman et al. 2011). Histopathology of this dog sample shows localised regions of reduced cell density in the cerebellum rather than uniform reduction (Shearman et al. 2011). A whole genome scan using Affymetrix canine SNP chip v2 identified a region of homozygosity common to all affecteds. This homozygous region localised the causative mutation to chromosome 3 between 28 and 33.1 Mb with a possible recombinant narrowing the region to between 29.8 and 33.1 Mb (Shearman et al. 2011). The region did not show significant association with markers because some controls shared the same haplotype. It is suspected that there are two versions of the haplotype in the population, one with the CA mutation and one without. Linkage analysis using microsatellites with a high mutation rate and variability within the common haplotype was suggestive of the region being linked to the disease (Shearman et al. 2011). The dog genome sequence (NCBI) lists 44 genes in the larger CA region and 29 genes in the narrowed CA region. Eight of the 44 genes have a known phenotype when disrupted by mutation in humans (OMIM). The canine genes in the region are predicted from homology to human and mouse sequence with only AP3B1 confirmed experimentally in the dog. Information available in the literature on the genes is summarised in Table 5-1.

105 Table 5-1: Genes in the Kelpie cerebellar abiotrophy candidate region showing location and known structure/function information Gene Gene position Canine gene Gene name symbol Structure/function information (Mb on (Human) CFA3) LOC610097 28.052 – Vacuolar ATP ATP6AP1 A subunit of a V-ATPase proton pump expressed in all cells (Beyenbach and 28.067 synthase subunit S1 Wieczorek 2006). Zebrafish deficient in atp6ap1 develop oculocutaneous albinism precursor (V- and experienced elevated levels of cell apoptosis in their retinas and ATPase S1 subunit) (Nuckels et al. 2009). LOC479159 28.092 – Ribosomal protein RPS23 Mouse knock outs of a ribosomal gene are homozygous lethal. Heterozygou knock 28.093 S23 outs can result in haploinsufficiency disorders, such as the Minute phenotype in Drosophila (Marygold et al. 2007). LOC610103 28.107 – APG10 autophagy ATG10 Essential for the formation of autophagosomes (Ravikumar et al. 2009). Blocking 28.307 10-like autophagosome formation increases the susceptibility of cells to pro-apoptotic insults (Ravikumar et al. 2006). LOC479160 28.499 – Single-stranded SSBP2 Commonly deleted in cases of acute myeloid leukemia in humans. Inducing 28.783 DNA-binding expression causes loss of C-MYC expression and cell cycle arrest (Liang et al. protein 2 2005). LOC488925 28.811 – Cytosolic acetyl- ACOT12 Acyl-CoA thioesterase 12 converts acetyl-CoA to acetate and is most highly 28.854 CoA hydrolase expressed in proximal intestinal epithelium, liver and kidney (Westin et al. 2008).

106 LOC479161 28.856 – Zinc finger, CCHC ZCCHC9 No information on ZCCH9. ZCCHC12 is a transcriptional co-activator of activator 28.864 domain containing 9 protein 1 (AP-1) and cAMP response element binding protein (CREB) (Li et al. 2009). ZCCHC11 is a 3ƍ terminal uridylyl transferase that uridylates cytokine- targeting miRNAs (Jones et al. 2009; Hagan et al. 2009). LOC479163 28.894 – Creatine kinase, CKMT2 Catalyses the transfer of phosphate from ATP to creatine generating ADP and 28.909 sarcomeric creatine phosphate (Klein et al. 1991). mitochondrial precursor LOC610203 28.925 – Ras protein-specific RASGRF2 Component of the signalling machinery involved in T cell mediated immune 29.170 guanine nucleotide- responses (Ruiz et al. 2007). Homozygous knock-out mice have a normal releasing factor 2 phenotype (Fernandez-Medarde et al. 2002), but a higher tendency to develop lymphoblastic lymphoma–like tumors in aged mice (Ruiz et al. 2009). LOC479164 29.219 – mutS homolog 3 MSH3 DNA mismatch repair gene, mutations associated with cancer and microsatellite 29.404 instability (Risinger et al. 1996; Akiyama et al. 1997; Lipkin et al. 2000). LOC479165 29.404 – Dihydrofolate DHFR Converts dihydrofolate into tetrahydrofolate, role in the synthesis of purines, 29.430 reductase thymidylic acid, and some amino acids. Deficiency results in megaloblastic anaemia (Tauro et al. 1976), while over expression results in methotrexate resistance (Mishra et al. 2007).

107 LOC488927 29.483 – Ankyrin repeat ANKRD34B Upregulated in bone marrow cells that become dendritic cells (Al-Shaibi and 29.485 domain 34B Ghosh 2004). Ankyrins in general function to target ion channels and pumps to their appropriate membrane (Bennet and Chen 2001). LOC479166 29.499 – AASA9217 AASA9217 Hypothetical protein only. 29.527 LOC479167 29.547 – Zinc finger, FYVE ZFYVE16 Implicated in membrane trafficking and functions as a scaffold protein to facilitate 29.582 domain containing transforming growth factor-ȕ signaling (Seet et al. 2004; Shi et al. 2007). 16 (KIAA0305) LOC488928 29.632 – Likely ortholog of SPZ1 Strongly expressed in testis and suspected to play a role in spermatogenesis (Sha et 29.634 mouse al. 2003). spermatogenic Zip 1 LOC479168 29.671 – Developmentally SERINC5 A carrier protein associated with phosphatidylserine and sphingolipid production 29.773 regulated protein (Inuzuka et al. 2005). Sphingolipid synthesis is involved in dendrite genesis in TPO1 Purkinje neurons and is required for Purkinje cell survival (Furuya et al. 1995). LOC488929 29.816 – similar to WW stWBP11 Processed pseudogene of WBP11, which binds Ataxin-1 (Okazawa et al. 2002). 29.819 domain binding Upregulation of WBP11 inhibits basal transcription in cerebellar neurons protein 11 increasing their vulnerability to low potassium conditions (Enokido et al. 2002). Mutations in WBP11 have been identified in humans with X-linked mental retardation (Cossee et al. 2006).

108 LOC488930 29.827 – Thrombospondin 4 THBS4 Associated with coronary heart disease (McCarthy et al. 2004; Cui et al. 2006). 29.887 precursor LOC488931 29.887 – Metaxin 3 MTX3 No information found on metaxin 3 (MTX3). Mtx1 mutations cause embryonic 29.916 (hypothetical lethality in mice that are also mutant in glucocerebrosidase (Bornstein et al. 1995). protein) LOC479170 30.364 – Myospryn protein CMYA5 Muscle-specific protein kinase A anchoring protein implicated in cardiac disease 30.119 (Sarparanta et al. 2008; Reynolds et al. 2008). LOC607764 30.156 – PAP associated PAPD4 Poly(A) elongation protein required for long-term memory (Kwak et al. 2008) and 30.218 domain containing 4 the final stage of oogenesis (Benoit et al. 2008). LOC488933 30.283 – Homer protein HOMER1 Scaffold protein binds receptors that regulate glutamine levels in brain (Szumlinski 30.406 homolog 1 et al. 2006). knock out mice have mild somatic growth retardation, poor motor coordination, enhanced sensory reactivity and learning deficits (Jaubert et al. 2007). LOC610323 30.434 – Junction-mediating JMY Coactivator of p53 (Shikama et al. 1999), controls actin dynamics in motile cells 30.529 and regulatory (Roadcap and Bear 2009). protein LOC479171 30.616 – Betaine BHMT Catalyzes the remethylation of homocysteine using betain as a methyl donor (Li et 30.635 homocysteine al. 2008). methyl transferase

109 LOC488934 30.659 – Betaine- BHMT2 Like BHMT, but uses S-methylmethionine as a methyl donor rather than betain 30.672 homocysteine (Szegedi et al. 2008). methyltransferase 2 LOC488935 30.673 – Dimethylglycine DMGDH Mitochondrial matrix enzyme that converts dimethylglycine to sarcosine. 30.739 dehydrogenase, Mutations causes dimethylglycine dehydrogenase deficiency resulting in a fish like mitochondrial body odour and chronic muscle fatigue (Binzak et al. 2001). precursor LOC610364 30.753 – Arylsulfatase B ARSB Catalyzes the hydrolysis of the sulfate ester group of dermatan sulphate, mutations 30.916 precursor cause mucopolysaccharidosis type VI (O’Brien et al. 1974; Matalon et al.1974). LOC610372 30.986 – Copine-1 CPNE1 Involved in membrane trafficking (Creutz et al. 1998), forms lattice structures 30.988 which may provide a scaffold for signalling complexes (Creutz and Edwardson 2009). LOC610427 31.146 – Lipoma HMGIC LHFPL2 Member of the gene family suspected to play a role in translocation-associated 31.163 fusion partner-like 2 lipomas (Petit et al. 1999). LOC479172 31.187 – Secretory carrier SCAMP1 Membrane protein of transport vesicles, knock out mice show no phenotypic 31.305 membrane protein 1 abnormalities (Fernandez-Chacon et al. 1999). AP3B1 31.370 – adaptor-related AP3B1 Involved in protein sorting to lysosomes. Mutations cause: Hermansky-Pudlak 31.636 protein complex Syndrome 2 in humans characterised by oculocutaneous albinism and platelet AP3 beta 1 subunit defects (Dell'Angelica et al. 1999); and canine cyclic neutropenia (Benson et al. 2004).

110 LOC479173 31.880 – Tubulin binding TBCA A cofactor involved in ȕ-tubulin folding (Tian et al. 1996). Knock down in 31.913 cofactor A mammalian cells produces a decrease in soluble tubulin and cell cycle arrest leading to apoptosis (Nolasco et al. 2005). LOC488938 31.952 – Orthopedia OTP Role in hypothalamic neural cell differentiation. Knock out mice die soon after 31.959 birth and display impaired neuroendocrine impairment (Acampora et al. 1999). LOC610498 32.129 – WD repeat domain WDR41 No specific information, but the WD repeat domain binds RNA (Lau et al. 2009). 32.174 41 LOC488939 32.179 – 3,5 cyclic nucleotide PDE8B Degrades cyclic AMP. Mutations cause adrenal hyperplasia (Horvath et al. 2008) 32.331 phosphodiesterase and autosomal-dominant striatal degeneration which resembles idiopathic 8B Parkinson disease, but without tremor (Appenzeller et al. 2010). LOC479175 32.466 – Hypothetical ZBED3 Binds to axin from the Wnt signalling pathway (Chen et al. 2009). 32.472 LOC479176 32.535 – Angiogenic factor AGGF1 Growth of new blood vessel factor (Tian et al. 2004). 32.570 VG5Q LOC610584 32.614 – Corticotropin CRHBP Binds to and inactivates corticotropin-releasing hormone (Potter et al. 1991). 32.649 releasing hormone binding protein LOC610592 32.661 – S100 calcium S100Z Calcium binding protein, with highest levels of expression in spleen and 32.688 binding protein, zeta leukocytes (Gribenko et al. 2001).

111 LOC488940 32.688 – Proteinase activated F2RL1 Role in mediating inflammation (Fiorucci et al. 2001). Mouse mutants show a 32.697 receptor 2 precursor reduced inflammatory response (Ferrell et al. 2003). LOC488941 32.719 – Eukaryotic EEF1A1 Subunit of elongation factor-1 complex, responsible for delivering aminoacyl 32.734 translation tRNAs to the ribosome. Felty’s syndrome, characterised by rheumatoid arthritis, elongation factor 1 splenomegaly and neutropenia contains EEF1A1 autoantibodies (Ditzel et al. alpha 1 2000). Downregulated in some cervical cancers while overexpression induces apoptosis (Rho et al. 2006). LOC488942 32.760 – Coagulation factor II F2R Mediates anti-apoptotic signalling and plays a role in inflammation, fibrogenesis, 32.777 receptor, precursor and extracellular matrix remodeling (Guo et al. 2004; Ruller et al. 2007). Mouse knock-outs show reduced embryo survival (Conolly et al. 1996). LOC479177 32.782 – Ras GTPase- IQGAP2 May play a role in generation of actin structures, is expressed in mouse liver, but 33.063 activating-like was not detected in brain (Brill et al. 1996). protein IQGAP2 LOC607963 32.855 – Proteinase activated F2RL2 Cofactor for F2RL3, which plays a role in platelet activation (Sambrano et al. 32.862 receptor 3 precursor 2001). LOC488943 33.120 – Synaptic vesicle SV2C Glycoprotein present on synaptic vesicles expressed in the pallidum, substantia 33.274 glycoprotein 2C nigra, midbrain, brainstem and olfactory bulb (Janz and Sudhof 1999). Genes indentified in the reduced candidate region are in bold All gene names (except AP3B1) are listed as “similar to” in the dog reference genome, but has been omitted in the gene name here.

112 None of the genes are an obvious candidate for the Kelpie CA gene, except perhaps stWBP11, which is a processed pseudogene. Likely genes in the CA candidate region to be involved in ataxia based on their function are four genes that are not within the narrowed candidate CA region. However, these genes are being considered since the narrowed region is based on a single case which has not been confirmed by histopathology. 1: ATPase, H+ transporting, lysosomal accessory protein 1 (ATP6AP1): Multiple copies of several subunit genes exist in higher eukaryotes allowing for tissue specific expression (Murata et al. 2002). Knock out mutants in yeast result in unviable cells, but mutations in higher organisms with multiple copies of a subunit gene can result in disease (Murata et al. 2002). It is unlikely that this is the causative gene as this gene lies outside of the narrowed region and no oculocutaneous defects are observed in CA affected kelpies. 2: Autophagy related 10 homolog (ATG10): A mutation to ATG10 may make cells more susceptible to aggregate build up and cell death. 3: Ankyrin repeat domain 34B (ANKRD34B): This gene is possible to be causative assuming a function similar to Ank-1. 4: Serine incorporator 5 (SERINC5): is considered because humans deficient in 3- phosphoglycerate dehydrogenase, the first step of serine biosynthesis, have congenital microcephaly, profound psychomotor retardation, hypertonia, epilepsy, growth retardation, and hypogonadism (Jaeken et al. 1996). Many of these clinical signs are observed in CA affected dogs and mutations to a related enzyme might be a cause. There are five genes within the narrowed candidate region that are preferred candidates. 1. Similar to WW domain binding protein 11 (stWBP11) consists of a single exon and has a poly A tail, while the parent gene is on chromosome 27 and contains 13 exons. If stWBP11 were to somehow become activated or upregulated it may interfere with WBP11 expression and cause CA. 2. Homer homolog 1 (HOMER1) interacts with glutamate receptor, metabotropic 1 (GRM1), kock-out mice for GRM1 are ataxic and have intention tremors but no apparent anatomical abnormalities in the cerebellum (Aiba et al. 1994). In order for a mutation in HOMER1 to be causing CA it would need to be a gain of function mutation that removes GRM1 product and recessive gain of function mutations are rare.

113 3-4. DNA methylation has been suggested to play a role in epigenetic control of neuronal function (Kriaucionis et al. 2009). There are two methyl transferases within the region BHMT and BHMT-2 Disruption of DNA methylation within the cerebellum may modify expression of multiple genes in a variable manner in different affecteds which could account for the variation in clinical signs observed in Kelpie CA cases. 5. LOC479175 was annotated as ‘hypothetical’. It has homology to a novel mouse gene, Zbed3, which plays a role in wnt signalling. ZBED3 binds AXIN1 blocking the phosphorylation of ȕ-catenin (CTNNB1) thus increasing the CTNNB1 half life. Conditional knock-out mice for CTNNB1 at mid-gestation show developmental abnormalities in the midbrain hindbrain region (Schuller and Rowich 2007). Knock- out mice for wingless-related MMTV integration site 1 (Wnt1) show a similar phenotype with the caudal two thirds of the mid brain and cerebellum failing to develop resulting in some embryos dying and those that survive developing severe ataxia (Thomas and Capechi 1990; McMahon and Bradley 1990). This shows that the Wnt signalling pathway is involved in cerebellum formation. It is possible that ZBED3 is a cerebellum specific wnt signalling pathway gene and is a candidate for the Kelpie CA gene. Disruptions to micro RNA (miRNA) processing pathways can also cause ataxia. Purkinje cell specific ablation of dicer in mice results in Purkinje cell loss and ataxia (Schaefer at al. 2007). Dicer ablation in dopaminoceptive neurons also results in ataxia as well as front and hind limb clasping, reduced brain size due to smaller neurons, but not apoptosis (Cuellar et al. 2008). CA in Kelpies could be due to loss of one or more miRNAs from the CA candidate region or mutations to a miRNA binding site in one of the genes in the region. It may be difficult to identify a 3` untranslated region mutation in one of the genes in the region as for many of them it is not annotated and there is no empirical evidence to define the full mRNA. There are no miRNAs reported in the CA region of dog chromosome 3 or the human homologous region in chromosome 5, 75-82 Mb. mmu-miR-1940, is present in the mouse homologous region (chromosome 13, 91-97 Mb) within the intron of Zbed3. A similar sequence to the miRNA can be found by BLAST search in the dog within the same intron of ZBED3.

114 5.3 Materials and methods

5.3.1 Australian Kelpie samples Australian Kelpie samples used were as previously described in Shearman et al. (2011) [chapter 4] consisting of 15 CA affecteds, 17 controls, 16 obligate CA carriers and 51 Kelpies of unknown status (Figure 4-2).

5.3.2 Nimblegen sequence capture The target region was specified as canine chromosome 3, genomic position 27996340 – 33202324 bp (Dog build 2.1). Of this 5.2 Mb, 400 kb was highly repetitive elements and not tiled. Two affected samples (6025 and 6065) and one unaffected control sample (6072) were processed following the recommended protocols (Nimblegen arrays user’s guide v 3). Briefly, the process involved shearing genomic DNA to 500 bp fragments, making ends blunt and ligating universal primers to each end. Fragments were size selected and purified using a magnetic bead cleanup and hybridised to the capture array. Unbound DNA was then washed off and the bound fragments eluted then amplified using PCR with the universal primers. The amplified product was cleaned using a spin column and used in three ligation mediated PCRs per sample. Quality control (QC) was performed on aliquots of mix post blunt ending, post linker ligation and post bead cleanup. PCR was performed on sample prior to hybridisation to confirm that it could be amplified. This product was used as a control to determine enrichment achieved by the capture. Primers were designed for three unique loci within tiled regions and one locus on chromosome 13 as a control. Quantitative PCR was performed on each sample in duplicate using sequence-captured amplified product, non-captured amplified product and genomic DNA as a control. Fold enrichment was calculated as 2^ΔCt where ΔCt is the PCR cycle difference in product detection between sequence-captured vs. non-captured sample and between target region and non target region within each sample. Melt curves were used to determine whether the observed signal came from target product or from primer dimer formation.

5.3.3 454 sequencing of captured samples Sequence capture samples were processed on a 454 sequencer using titanium chemistry with multiplex identifiers (MIDs) enabling samples to be pooled and run

115 together. Reads were mapped against the entire dog genome using GS Reference Mapper software (Roche). Regions of apparent heterozygosity (30 – 80% of each allele based on number of reads) in each affected were visually inspected using Eagleview (Huang and Marth 2008).

5.3.4 Sequencing of candidate differences Differences identified between the affected samples and control sample that were considered as possible causes of CA were sequenced. Primers were designed spanning each candidate difference (Table 5-2) using Primer3 (Rozen and Skaletsky 2000). The region was amplified using PCR, sequenced using BigDye terminators V3.1 and processed using capillary electrophoresis on an ABI3730 (Applied Biosystems). Sequence data was analysed using Seqscape (Applied Biosystems).

116 Table 5-2: sequencing primers for typing candidate differences identified between cerebellar abiotrophy affected Kelpies and a control Kelpie Gene name control affected aa change location forward primer reverse primer DMGDH G A R -> Q 30691084 AATAAAATAAATTGCCCTATCAATGC GTATTATGACACCGGCTGACTCTAC SCAMP1 C T A -> T 31206920CATACTTAAGGCTTGGTGCTTAGTG GTTACACTGAATCAAGAGCAAAATG EEF1A1 A G M -> T 32733055 GTCTTGGATGAACTGAAAGATGAAC GGGACTCTCATCCCTTGAACC THBS4† C T S -> W 29886758 CTGAAGCATCAGAGGTGATACCC ATAATAGAAGCATGCGCCATACC THBS4† C T A -> V 29886785 " " CMYA5 T C M -> T 30114307 ACATCTGAACTAGAACCGAGGATG AGTTTCTACGTTGGACAAAGAATGC CMYA5 G A V -> I 30110351GAGCCATCAGAAGGTAGTTCAATAG GATATTTCTCTAGGAGGAGCCATTC PDE8B C T R -> C 32292988TCTCAACTCCTTTCAACTTCTCAAG TAAGCAAAGGGCTCCTAGGTTAG † both differences are captured with a single set of primers

117 5.4 Results and discussion

5.4.1 Nimblegen sequence capture Each of the three samples processed with sequence capture arrays gave sufficient PCR enrichment to allow sequencing. Ligation mediated PCR of the 1 μl aliquot yielded 200 μl of product at an average concentration of 50 ng μl-1 per sample. Sequence-capture gave fold enrichment of 251x for affected 6025, 430x for affected 6065 and 316x for control 6074. The total amount of product recovered was 81.7 μg for affected 6025, 75 μg for affected 6065 and 70 μg for control 6074.

5.4.2 454 sequence data analysis The total sequence output from all three dogs was 1,176,169 individual reads totalling 313.8 Mb. The distribution of this sequence was uneven with CA affected 6025 giving 113515 reads and 31.1 Mb sequence, affected 6065 giving 419919 reads and 112 Mb sequence and control 6074 giving 642745 reads and 170.6 Mb sequence. Variation in the efficiency of the separate emulsion PCRs may account for some of the significant variation in total sequence output per sample. The MIDs allowed for individual samples to be selected for analysis from the pool of sequence data. Two separate mapping analyses were performed, one with the affecteds combined and the other using only the control. Sequence data was aligned to the entire canine genome as the reference, and an output file of sequences from the candidate region was obtained (Table 5-3). This alignment was used to identify differences between the affecteds and the control.

Table 5-3 mapping statistics for sequence output of captured DNA for CA affecteds and the Kelpie control Mapped Mapped to Mapped to Total to canine target Partially multiple Sequence genome region Unmapped mapped locations Affecteds 143 122 36 1.3 11.3 17.6 Control 170.6 146 41 1.6 14.4 19.7 All values expressed as Mb

118 There was a significant difference in sequence coverage between the affecteds and the control. Genic regions were generally well covered, but some regions within introns had no coverage in either the control or affecteds. These regions will require Sanger sequencing to fill in the gaps. The affecteds a total of 457 kb of the reference sequence had no sequence coverage, 389 kb (85%) of this coincided with probe coverage gaps. The control had 298 kb of the reference sequence with no sequence coverage, 250 kb (84%) of this coincided with probe coverage gaps. If the effect range of probe coverage gaps was considered to extend for 200 bp each side of the gap then the affecteds and the control had similar coverage values. In affecteds, 445 kb (98%) of the reference sequence that had no sequence coverage that was within 200 bp of a probe coverage gap and the control had 288 kb (97%) within 200 bp of a probe coverage gap. Sequence data in the control should reflect the SNP array data for that sample. However, deviations from the expected 50:50 allele ratios were observed for SNPs identified as heterozygous with the ratio of the non-reference allele ranging from 30% to 80%. Since the target capture probes were designed based on the reference allele, ratio deviations should be in favour of the reference allele. In some cases the non- reference allele outnumbered the reference allele, for example SNP chr3.30782110 had 80% of the sequence data from the non-reference allele. This suggests that the allele ratio difference is not due to different capture affinities between sequence variants. The difference is probably is most likely a combination of chance events from the hybridisation step, the ligation mediated amplification PCR, the 454 library preparation and the 454 sequencing. The sample 6025 was homozygous for the narrowed candidate region only. The sequence data was found to be consistent with the SNP calls from the SNP arrays (Shearman et al. 2011). Identification of differences between the affecteds and control was performed using the high confidence differences output file generated by the reference mapper program. A total of 4172 differences were found in the affecteds and 5207 difference in the control compared to the reference sequence ranging from 100% to 10% of reads at that locus (Table 5-4). The differences in the affecteds were compared to the control to identify differences potentially causative of CA. Homozygosity in the affecteds was considered as a minimum of 80% of reads showing the difference and a range 30% to 80% was considered as representing heterozygosity in the control. These cut offs were chosen based on the findings from the comparison of the SNP array data to the sequence data. Homozygous differences in the affecteds where the control was also homozygous

119 were considered to be breed differences between Kelpies and the Boxer reference and were removed from the list of differences to investigate leaving 2107 differences. Known SNPs were identified using the canine SNP database leaving 1577 differences. Of these, 691 were in gene regions, 27 within an exon and eight were non-synonymous (Table 5-5). Four differences within 50 bp of an exon were investigated as potential splice site disruptors.

Table 5-4 Sequence differences identified between affected and control dogs from 454 sequencing Difference description number differences to reference genome in affecteds 4172 differences in affecteds supported by >79% of reads 3776 differences in affecteds supported by 100% of reads 3034 differences to reference genome in control 5207 differences in controls supported by >79% of reads 2555 differences in controls supported by 100% of reads 2145

differences in affectes >79% where controls are less than 81% 2107 number of these that are known SNPs 550 leaving 1577 number of differences in genic regions 691 number of differences in exons 27 number of differences that change an amino acid 8

InDel differences homozygous in affecteds 67 InDel differences not homozygous in control 30 Indel differences in genic regions 10 Indel differences upstream of genes 6

differences within the first and last 50 bp of an intron 4

120 Table 5-5 Genomic position, sequencing depth and coding effect of differences in exons identified homozygous in affected dogs that were not homozygous in the control dog Genomic Sequence Position Reference Variant depth† Gene Effect‡ 28056759 G A 15 ATP6AP1 nc 28092993 G C 10 RPS23 nc 28903436 A G 13 CKMT2 nc 29567856 G A 8 ZFYVE16 nc 29632686 G A 11 SPZ1 nc 29818867 T C 4 WBP11 nc 29837088 G A 21 THBS4 nc 29886758 G C 36 THBS4 S -> W 29886785 G A 37 THBS4 A -> V 30110351 C T 16 CMYA5 V -> I 30114307 A G 11 CMYA5 M -> T 30173203 T C 7 PAPD4 nc 30616120 C T 4 BHMT nc 30616445 T C 6 BHMT nc 30616525 C G 8 BHMT nc 30616716 C T 8 BHMT nc 30691084 G A 5 DMGDH R -> Q 30691196 A G 12 DMGDH nc 30704150 C T 6 DMGDH nc 30738644 T C 8 DMGDH nc 30738713 C T 6 DMGDH nc 30917504 A G 11 ARSB nc 31206920 C T 4 SCAMP1 A -> T 32292988 G A 9 PDE8B R -> C 32733055 A G 15 EEF1A1 M -> T † number of reads at the locus ‡ nc: synonymous or within predicted 3' or 5' sequences

121 No differences were identified in any of the preferred candidate genes described in the introduction. However, coverage of some genes was less than 100% and will require Sanger sequencing to fill in the coverage gaps. One such gap occurred in HOMER1. There are no miRNAs reported in the CA region of dog chromosome 3 or the human homologous region in chromosome 5, 75-82 Mb. mmu-miR-1940, is present in the mouse homologous region (chromosome 13, 91-97 Mb) within the intron of Zbed3. A similar sequence to the miRNA can be found by BLAST search in the dog within the same intron of ZBED3. No differences were identified within the miRNA in the intron of ZBED3 between affecteds and the control. Sequence data in affecteds was checked using Eagleview to identify potential InDel sites, 67 were identified as being homozygous in affecteds. These sites were then checked in the control to exclude InDels that were Kelpie specific and 30 sites remained. Of the 30 sites left, 10 were within genic regions and six were within 30 kb upstream of a gene.

5.4.3 Sequencing of candidate differences A total of 14 differences were investigated as a possible cause of CA in Kelpies. Eight of these were differences that change an amino acid and six were InDels upstream of a gene. Five coding differences from the genes THBS4, CMYA5 and PDE8B were sequenced in 96 individuals and almost all carriers and many Kelpies of unknown status were homozygous for the same allele as the affecteds (Table 5-6). This is strong evidence that the genes are not involved in CA in Kelpies. Three other coding differences, in DMGDH, SCAMP1 and EEF1A1, were sequenced in five affecteds, two controls and one obligate carrier and in each case at least one control was also homozygous for the difference. This is suggestive evidence that these differences do not cause CA in Kelpies. Of six InDels chosen as potentially disrupting expression of a downstream gene, five were identified by checking the sequence data in EagleView, the other was a 5 bp InDel identified by the mapping program. Sanger sequencing was performed on five affecteds, two controls and one obligate carrier for each InDel. The affected samples included both of the affecteds, 6025 and 6065, that were processed on the capture arrays and 454 sequenced. The InDel, upstream of ARSB, identified by the mapping program was consistent with the 454 sequence data but was also homozygous in both controls making it unlikely to be causative. The sequencing results for four of the other five 122 InDels were not consistent with an InDel being present in 6025 or 6065, these InDels were identified by one to three 454-reads. The only other InDel that was consistent with the 454 sequence data was the InDel upstream of PDE8B identified by 15 454-reads. The cause of so many false positive InDels is likely due to the two rounds of ligation and PCR in the experimental steps from sequence capture to 454 sequencing. The ligation steps attach universal primers to the end of each DNA fragment. Each ligation step causes the formation of a small number of chimeric sequences to occur, where two DNA fragments ligate and appear to be a translocation, inversion or InDel depending on the uniqueness or location of the sequence (Quail et al. 2008). If universal primers then ligate to each end of this chimeric sequence it can be amplified by PCR. The ratio of primers to DNA fragments is optimised to minimise the formation of chimeric fragments. However, with two ligation steps chimeric sequences can become common enough to be problematic as was the case here where chimeric fragments typically consisted of one unique sequence and one repetitive element from elsewhere in the genome. With enough sequencing depth these fragments can be easily identified as chimeric and excluded. The sequencing depth obtained for each sample was an average of 9.7, but ranged from 0 to 285. Several sequences were obtained from a single fragment resulting in multiple reads with the same start and stop point. This is likely because the PCR step after the sequence capture amplified single captured fragments such that it could be sequenced multiple times. In places of low sequence coverage this can cause a chimeric sequence to artificially appear more reliable as the sequence depth appears to be high but in fact only represents a single DNA fragment. This can be avoided by using a smaller target region allowing for more fragments to be captured at each region. Performing an additional capture and sequencing on the same samples may improve sequencing depth but would not change the coverage distribution because regions of low coverage represent low capture from the capture arrays.

123 Table 5-6 Genotypes of differences identified between CA affected and a control Kelpie typed in 96 Kelpies Gene aa Affected name control affected change genotypes Carrier genotypes Unknown genotypes Control genotypes THBS4 C G S -> W 11 GG 13 GG; 1 GC 29 GG; 9 GC; 2 CC; 4 CT; 1 TT 4 GG; 9 GC; 2 CC; 1 GT; 1 TT THBS4 C T A -> V 11 TT 13 TT; 1 TC 26 TT; 9 TC; 8 CC 3 TT; 11 TC; 3 CC CMYA5 T C M -> T 10 CC 12 CC; 1 CT 33 CC; 7 CT; 6 TT 7 CC; 7 CT; 2 CC CMYA5 G A V -> I 11 AA 12 AA; 2 AG 34 AA; 6 AG; 6 GG 7 AA; 7 AG; 2 GG PDE8B C T R -> C 12 TT 13 TT; 2 TC; 1 CC 28 TT; 14 TC; 5 CC 4 TT; 10 TC; 3 CC

124 5.5 Conclusion

The sequence capture gave high enrichment and a high enough yield to proceed with 454 sequencing. The total sequence output quality was good with over 80% of the reads mapping to the canine genome and 30% of that mapping to the target region. The affected dogs were found to be homozygous supporting the previously reported SNP and microsatellite data. Coverage correlated with sequence capture probes, coverage of genic regions was generally good, but some gap filling with Sanger sequencing will be required. Problems occurred from chimeric fragments in regions of low coverage. This could have potentially been avoided by targeting a smaller region on the capture arrays which would allow more fragments per region to be captured or by using the raindance oil based sequence capture method. Eight coding differences were found that changed an amino acid but none were consistent with causing CA. Several synonymous coding differences were identified that were considered unlikely to be causative and the obvious candidates were eliminated. Several differences were identified in 5` or 3` untranslated regions of an mRNA. These differences may have a regulatory role and are the next obvious group of differences to investigate as causing CA along with Sanger sequencing to fill in sequencing gaps in exons. This method may cause difficulty in identifying insertions such as transposable elements or insertions in the candidate region so other methods should be included to help identify the mutation. Sequencing of unaffected Kelpies homozygous for the same haplotype as affecteds would also help in reducing the number of differences to investigate. The possibility that the mutation does not lay in the candidate region can not be definitely excluded. However, the inheritance pattern is consistent with autosomal recessive inheritance. Modes of inheritance that can mimic an autosomal recessive pattern, namely mtDNA and autosomal dominant trinucleotide expansion are unlikely since an affected female has been observed to give birth to four unaffected offspring. It is possible that the disease is caused by an autosomal dominant mutation with low penetrance elsewhere in the genome. However, the chance of having an inheritance pattern consistent with an autosomal recessive disease and the microsatellite inheritance pattern being consistent with an autosomal recessive mutation in the candidate region makes it unlikely the gene is located elsewhere.

125 5.6 Acknowledgments

The authors would like to thank Aaron Statham from the Garvan Institute for equipment use and help with the Nimblegen capture arrays.

5.7 References

Acampora D., Postiglione M.P., Avantaggiato V., Di Bonito M., Vaccarino F.M., Michaud J. and Simeone A. (1999) Progressive impairment of developing neuroendocrine cell lineages in the hypothalamus of mice lacking the Orthopedia gene. Genes and Development. 13: 2787-2800. Aiba A., Kano M., Chen C., Stanton M.E., Fox G.D., Herrup K., Zwingman T.A. and Tonegawa S. (1994) Deficient cerebellar long-term depression and impaired motor learning in mGluR1 mutant mice. Cell. 79: 377-388. Akiyama Y., Tsubouchi N. and Yuasa Y. (1997) Frequent somatic mutations of hMSH3 with reference to microsatellite instability in hereditary nonpolyposis colorectal cancers. Biochemical and Biophysical Research Communications. 236: 248- 252. Al-Shaibi N. and Ghosh S.K. (2004) A novel phosphoprotein is induced during bone marrow commitment to dendritic cells. Biochemical and Biophysical Research Communications. 321: 26-30. Appenzeller S., Schirmacher A., Halfter H. et al. (2010) Autosomal-dominant striatal degeneration is caused by a mutation in the phosphodiesterase 8B gene. American Journal of Human Genetics. 86: 83-87. Bennett V. and Chen L. (2001) Ankyrins and cellular targeting of diverse membrane proteins to physiological sites. Current Opinion in Cell Biology. 13: 61-67. Benoit P., Papin C., Kwak J.E., Wickens M. and Simonelig M. (2008) PAP- and GLD- 2-type poly(A) polymerases are required sequentially in cytoplasmic polyadenylation and oogenesis in Drosophila. Development. 135: 1969-1979. Benson K.F., Person R.E., Li F.Q., Williams K. and Horwitz M. (2004) Paradoxical homozygous expression from heterozygotes and heterozygous expression from homozygotes as a consequence of transcriptional infidelity through a polyadenine tract in the AP3B1 gene responsible for canine cyclic neutropenia. Nucleic Acids Research. 32: 6327-6333.

126 Beyenbach K.W. and Wieczorek H. (2006) The V-type H+ ATPase: molecular structure and function, physiological roles and regulation. The Journal of Experimental Biology. 209: 577-589. Binzak B.A., Wevers R.A., Moolenaar S.H., Lee Y.M., Hwu W.L., Poggi-Bach J., Engelke U.F., Hoard H.M., Vockley J.G. and Vockley J. (2001) Cloning of dimethylglycine dehydrogenase and a new human inborn error of metabolism, dimethylglycine dehydrogenase deficiency. American Journal of Human Genetics. 68: 839-847. Bornstein P., McKinney C.E., LaMarca M.E., Winfield S., Shingu T., Devarayalu S., Vos H.L. and Ginns E.I. (1995) Metaxin, a gene contiguous to both thrombospondin 3 and glucocerebrosidase, is required for embryonic development in the mouse: implications for Gaucher disease. Proceedings of the National Academy of Sciences of the United States of America. 92: 4547- 4551. Brill S., Li S., Lyman C.W., Church D.M., Wasmuth J.J., Weissbach L., Bernards A. and Snijders A.J. (1996) The Ras GTPase-activating-protein-related human protein IQGAP2 harbors a potential actin binding domain and interacts with calmodulin and Rho family GTPases. Molecular and Cellular Biology. 16: 4869-4878. Chen T., Li M., Ding Y., Zhang L.S., Xi Y., Pan W.J., Tao D.L., Wang J.Y. and Li L. (2009) Identification of zinc-finger BED domain-containing 3 (Zbed3) as a novel Axin-interacting protein that activates Wnt/beta-catenin signaling. Journal of Biological Chemistry. 284: 6683-6689. Connolly A.J., Ishihara H., Kahn M.L., Farese R.V. and Coughlin S.R. (1996) Role of the thrombin receptor in development and evidence for a second receptor. Nature. 381: 516-519. Cossee M., Demeer B., Blanchet P. et al. (2006) Exonic microdeletions in the X-linked PQBP1 gene in mentally retarded patients: a pathogenic mutation and in-frame deletions of uncertain effect. European Journal of Human Genetics. 14: 418- 425. Creutz C.E. and Edwardson J.M. (2009) Organization and synergistic binding of copine I and annexin A1 on supported lipid bilayers observed by atomic force microscopy. Biochimica et Biophysica Acta. 1788: 1950-1961. Creutz C.E., Tomsig J.L., Snyder S.L., Gautier M.C., Skouri F., Beisson J. and Cohen J. (1998) The copines, a novel class of C2 domain-containing, calcium-dependent,

127 phospholipid-binding proteins conserved from Paramecium to humans. The Journal of Biological Chemistry. 273: 1393-1402. Cuellar T.L., Davis T.H., Nelson P.T., Loeb G.B., Harfe B.D., Ullian E. and McManus M.T. (2008) Dicer loss in striatal neurons produces behavioral and neuroanatomical phenotypes in the absence of neurodegeneration. Proceedings of the National Academy of Sciences of the United States of America. 105: 5614-5619. Cui J., Randell E., Renouf J., Sun G., Green R., Han F.Y. and Xie Y.G. (2006) Thrombospondin-4 1186G>C (A387P) is a sex-dependent risk factor for myocardial infarction: a large replication study with increased sample size from the same population. American Heart Journal. 543: e1-5. Dell'Angelica E.C., Shotelersuk V., Aguilar R.C., Gahl W.A. and Bonifacino J.S. (1999) Altered trafficking of lysosomal proteins in Hermansky-Pudlak syndrome due to mutations in the beta 3A subunit of the AP-3 adaptor. Molecular Cell. 3: 11-21. Ditzel H.J., Masaki Y., Nielsen H., Farnaes L. and Burton D.R. (2000) Cloning and expression of a novel human antibody-antigen pair associated with Felty's syndrome. Proceedings of the National Academy of Sciences of the United States of America. 97: 9234-9239. Enokido Y., Maruoka H., Hatanaka H., Kanazawa I. and Okazawa H. (2002) PQBP-1 increases vulnerability to low potassium stress and represses transcription in primary cerebellar neurons. Biochemical and Biophysical Research Communications. 294: 268-271. Fernandez-Chacon R., Alvarez de Toledo G., Hammer R.E. and Sudhof T.C. (1999) Analysis of SCAMP1 function in secretory vesicle exocytosis by means of gene targeting in mice. The Journal of Biological Chemistry. 274: 32551- 32554. Fernandez-Medarde A., Esteban L.M., Nunez A., Porteros A., Tessarollo L. and Santos E. (2002) Targeted disruption of Ras-Grf2 shows its dispensability for mouse growth and development. Molecular and Cellular Biology. 22:2498-2504 Ferrell W.R., Lockhart J.C., Kelso E.B. et al (2003) Essential role for proteinase- activated receptor-2 in arthritis. The Journal of Clinical Investigation. 111: 35- 41. Fiorucci S., Mencarelli A., Palazzetti B., Distrutti E., Vergnolle N., Hollenberg M.D., Wallace J.L., Morelli A. and Cirino G. (2001) Proteinase-activated receptor 2

128 is an anti-inflammatory signal for colonic lamina propria lymphocytes in a mouse model of colitis. Proceedings of the National Academy of Sciences of the United States of America. 98: 13936-13941. Furuya S, Ono K, Hirabayashi Y. (1995) Sphingolipid biosynthesis is necessary for dendrite growth and survival of cerebellar Purkinje cells in culture. Journal of Neurochemistry. 65: 1551-1561. Gribenko A.V., Hopper J.E. and Makhatadze G.I. (2001) Molecular characterization and tissue distribution of a novel member of the S100 family of EF-hand proteins. Biochemistry. 40: 15538-15548. Guo H., Liu D., Gelbard H., Cheng T., Insalaco R., Fernandez J.A., Griffin J.H. and Zlokovic B.V. (2004) Activated protein C prevents neuronal apoptosis via protease activated receptors 1 and 3. Neuron. 41: 563-572. Hagan J.P., Piskounova E. and Gregory R.I. (2009) Lin28 recruits the TUTase Zcchc11 to inhibit let-7 maturation in mouse embryonic stem cells. Nature Structural and Molecular Biology. 16: 1021-1025. Horvath A., Giatzakis C., Tsang K. et al. (2008) A cAMP-specific phosphodiesterase (PDE8B) that is mutated in adrenal hyperplasia is expressed widely in human and mouse tissues: a novel PDE8B isoform in human adrenal cortex. European Journal of Human Genetics. 16: 1245-1253. Huang W, Marth G. (2008) EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome Research. 18: 1538-1543. Inuzuka M, Hayakawa M, Ingi T. (2005) Serinc, an activity-regulated protein family, incorporates serine into membrane lipid synthesis. The Journal of Biological Chemistry. 280: 35776-35783. Jaeken J, Detheux M, Van Maldergem L, Foulon M, Carchon H, Van Schaftingen E. (1996) 3-Phosphoglycerate dehydrogenase deficiency: an inborn error of serine biosynthesis. Archives of Disease in Childhood. 74: 542-545 Janz R. and Sudhof T.C. (1999) SV2C is a synaptic vesicle protein with an unusually restricted localization: anatomy of a synaptic vesicle protein family. Neuroscience. 94: 1279-1290. Jaubert P.J., Golub M.S., Lo Y.Y., Germann S.L., Dehoff M.H., Worley P.F., Kang S.H., Schwarz M.K., Seeburg P.H. and Berman R.F. (2007) Complex, multimodal behavioral profile of the Homer1 knockout mouse. Genes, Brain, and Behavior. 6: 141-154.

129 Jones M.R., Quinton L.J., Blahna M.T., Neilson J.R., Fu S., Ivanov A.R., Wolf D.A. and Mizgerd J.P. (2009) Zcchc11-dependent uridylation of microRNA directs cytokine expression. Nature Cell Biology. 11: 1157-1163. Klein S.C., Haas R.C., Perryman M.B., Billadello J.J. and Strauss A.W. (1991) Regulatory element analysis and structural characterization of the human sarcomeric mitochondrial creatine kinase gene. The Journal of Biological Chemistry. 266: 18058-18065. Kriaucionis S. and Heintz N. (2009) The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324: 929-930. Kwak J.E., Drier E., Barbee S.A., Ramaswami M., Yin J.C. and Wickens M. (2008) GLD2 poly(A) polymerase is required for long-term memory. Proceedings of the National Academy of Sciences of the United States of America. 105: 14644-14649. Lau C.K., Bachorik J.L. and Dreyfuss G. (2009) Gemin5-snRNA interaction reveals an RNA binding function for WD repeat domains. Nature Structural and Molecular Biology. 16: 486-491. Li F., Feng Q., Lee C. et al. (2008) Human betaine-homocysteine methyltransferase (BHMT) and BHMT2: common gene sequence variation and functional characterization. Molecular Genetics and Metabolism. 94: 326-335. Li H., Liu Q. and Hu X. (2009) Human ZCCHC12 activates AP-1 and CREB signaling as a transcriptional co-activator. Acta Biochimica et Biophysica Sinica. 41: 535-544. Liang H., Samanta S. and Nagarajan L. (2005) SSBP2, a candidate tumor suppressor gene, induces growth arrest and differentiation of myeloid leukemia cells. Oncogene. 24: 2625-2634. Lipkin S.M., Wang V., Jacoby R., Banerjee-Basu S., Baxevanis A.D., Lynch H.T., Elliott R.M. and Collins F.S. (2000) MLH3: a DNA mismatch repair gene associated with mammalian microsatellite instability. Nature Genetics. 24: 27- 35. Marygold S.J., Roote J., Reuter G. et al. (2007) The ribosomal protein genes and Minute loci of Drosophila melanogaster. Genome Biology. 8: R216. Matalon R., Arbogast B. and Dorfman A. (1974) Deficiency of chondroitin sulfate N- acetylgalactosamine 4-sulfate sulfatase in Maroteaux-Lamy syndrome. Biochemical and Biophysical Research Communications. 61: 1450-1457.

130 McCarthy J.J., Parker A., Salem R. et al. (2004) Large scale association analysis for identification of genes underlying premature coronary heart disease: cumulative perspective from analysis of 111 candidate genes. Journal of Medical Genetics. 41: 334-341. McMahon A.P. and Bradley A. (1990) The Wnt-1 (int-1) proto-oncogene is required for development of a large region of the mouse brain. Cell. 62: 1073-1085. Mishra P.J., Humeniuk R., Mishra P.J., Longo-Sorbello G.S., Banerjee D. and Bertino J.R. (2007) A miR-24 microRNA binding-site polymorphism in dihydrofolate reductase gene leads to methotrexate resistance. Proceedings of the National Academy of Sciences of the United States of America. 104: 13513-13518. Murata Y., Sun-Wada G.H., Yoshimizu T., Yamamoto A., Wada Y. and Futai M. (2002) Differential localization of the vacuolar H+ pump with G subunit isoforms (G1 and G2) in mouse neurons. The Journal of Biological Chemistry. 277: 36296-36303. Nolasco S., Bellido J., Gonçalves J., Zabala J.C. and Soares H. (2005) Tubulin cofactor A gene silencing in mammalian cells induces changes in microtubule cytoskeleton, cell cycle arrest and cell death. FEBS Letters. 579: 3515-3524. Nuckels R.J., Ng A., Darland T. and Gross J.M. (2009) The vacuolar-ATPase complex regulates retinoblast proliferation and survival, photoreceptor morphogenesis, and pigmentation in the zebrafish eye. Investigative Ophthalmology and Visual Science. 50: 893-905. O'Brien J.F., Cantz M. and Spranger J. (1974) Maroteaux-Lamy disease (mucopolysaccharidosis VI), subtype A: deficiency of a N- acetylgalactosamine-4-sulfatase. Biochemical and Biophysical Research Communications. 60: 1170-1177. Okazawa H., Rich T., Chang A. et al. (2002) Interaction between mutant ataxin-1 and PQBP-1 affects transcription and cell death. Neuron. 34: 701-713. Peters L.L., Birkenmeier C.S., Bronson R.T., White R.A., Lux S.E., Otto E., Bennett V., Higgins A. and Barker J.E. (1991) Purkinje cell degeneration associated with erythroid ankyrin deficiency in nb/nb mice. The Journal of Cell Biology. 114: 1233-1241. Petit M.M., Schoenmakers E.F., Huysmans C., Geurts J.M., Mandahl N. and Van de Ven W.J. (1999) LHFP, a novel translocation partner gene of HMGIC in a lipoma, is a member of a new family of LHFP-like genes. Genomics. 57: 438- 441.

131 Potter E, Behan DP, Fischer WH, Linton EA, Lowry PJ, Vale WW. (1991) Cloning and characterization of the cDNAs for human and rat corticotropin releasing factor- binding proteins. Nature. 349: 423-426. Quail M.A., Kozarewa I., Smith F., Scally A., Stephens P.J., Durbin R., Swerdlow H. and Turner D.J. (2008) Nature Methods. 5: 1004-1010. Ravikumar B., Berger Z., Vacher C., O'Kane C.J. and Rubinsztein D.C. (2006) Rapamycin pre-treatment protects against apoptosis. Human Molecular Genetics. 15: 1209-1216. Ravikumar B., Futter M., Jahreiss L. et al. (2009) Mammalian macroautophagy at a glance. 122: 1707-1711. Reynolds J.G., McCalmon S.A., Donaghey J.A. and Naya F.J. (2008) Deregulated protein kinase A signaling and myospryn expression in muscular dystrophy. The Journal of Biological Chemistry. 283: 8070-8074. Rho S.B., Park Y.G., Park K., Lee S.H. and Lee J.H. (2006) A novel cervical cancer suppressor 3 (CCS-3) interacts with the BTB domain of PLZF and inhibits the cell growth by inducing apoptosis. FEBS Letters. 580: 4073-4080. Risinger J.I., Umar A., Boyd J., Berchuck A., Kunkel T.A. and Barrett J.C. (1996) Mutation of MSH3 in endometrial cancer and evidence for its functional role in heteroduplex repair. Nature Genetics. 14: 102-105. Roadcap D.W. and Bear J.E. (2009) Double JMY: making actin fast. Nature Cell Biology. 11: 375-376. Rozen S. and Skaletsky H. (2000) Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology. 132: 365-386. Ruiz S., Santos E. and Bustelo X.R. (2007) RasGRF2, a guanosine nucleotide exchange factor for Ras GTPases, participates in T-cell signaling responses. Molecular and Cellular Biology. 27: 8127-8142. Ruiz S., Santos E. and Bustelo X.R. (2009) The use of knockout mice reveals a synergistic role of the Vav1 and Rasgrf2 gene deficiencies in lymphomagenesis and metastasis. PLoS One. 4: e8229 Rullier A., Gillibert-Duplantier J., Costet P. et al. (2008) Protease-activated receptor 1 knockout reduces experimentally induced liver fibrosis. American journal of physiology. Gastrointestinal and liver physiology. 294: G226-235. Sambrano GR, Weiss EJ, Zheng YW, Huang W, Coughlin SR. (2001) Role of thrombin signalling in platelets in haemostasis and thrombosis. Nature. 413: 74-78.

132 Sarparanta J. (2008) Biology of myospryn: what's known? Journal of Muscle Research and Cell Motility. 29: 177-180. Schaefer A., O'Carroll D., Tan C.L., Hillman D., Sugimori M., Llinas R. and Greengard P. (2007) Cerebellar neurodegeneration in the absence of microRNAs. The Journal of Experimental Medicine. 204: 1553-1558. Schuller U and Rowitch D.H. (2007) Beta-catenin function is required for cerebellar morphogenesis. Brain Research. 1140: 161-169. Seet L.F., Liu N., Hanson B.J. and Hong W. (2004) Endofin recruits TOM1 to endosomes. The Journal of Biological Chemistry. 279: 4670-4679. Sha J.H., Zhou Z.M., Li J.M., Lin M., Zhu H., Zhu H., Zhou Y.D., Wang L.L., Wang Y.Q. and Zhou K.Y. (2003) Expression of a novel bHLH-Zip gene in human testis. Asian Journal of Andrology. 5: 83-88. Shearman J.R., Cook R., Charles J., Fletcher J., Taylore R.M. and Wilton A.N. (2011) Mapping cerebellar abiotrophy in Australian Kelpies. Animal Genetics. accepted Shi W., Chang C., Nie S., Xie S., Wan M. and Cao X. (2007) Endofin acts as a Smad anchor for receptor activation in BMP signaling. Journal of Cell Science. 120: 1216-1224. Shikama N., Lee C.W., France S., Delavaine L., Lyon J., Krstic-Demonacos M. and La Thangue N.B. (1999) A novel cofactor for p300 that regulates the p53 response. Molecular Cell. 4: 365-376. Szegedi S.S., Castro C.C., Koutmos M. and Garrow T.A. (2008) Betaine-homocysteine S-methyltransferase-2 is an S-methylmethionine-homocysteine methyltransferase. Journal of Biological Chemistry. 283: 8939-8945. Szumlinski K.K., Kalivas P.W. and Worley P.F. (2006) Homer proteins: implications for neuropsychiatric disorders. Current Opinion in Neurobiology. 16: 251-257. Tauro G.P., Danks D.M., Rowe P.B., Van der Weyden M.B., Schwarz M.A., Collins V.L. and Neal B.W. (1976) Dihydrofolate reductase deficiency causing megaloblastic anemia in two families. The New England Journal of Medicine. 294: 466-470. Thomas J.B. and Robertson D. (1989) Hereditary cerebellar abiotrophy in Australian kelpie dogs. Australian Veterinary Journal. 66: 301-302. Thomas K.R. and Capecchi M.R. (1990) Targeted disruption of the murine int-1 proto- oncogene resulting in severe abnormalities in midbrain and cerebellar development. Nature. 346: 847-850.

133 Tian G., Huang Y., Rommelaere H., Vandekerckhove J., Ampe C. and Cowan N.J. (1996) Pathway leading to correctly folded beta-tubulin. Cell. 86: 287-296. Tian X.L., Kadaba R., You S.A. et al. (2004) Identification of an angiogenic factor that when mutated causes susceptibility to Klippel-Trenaunay syndrome. Nature. 427: 640-645. Westin M.A., Hunt M.C. and Alexson S.E. (2008) Short- and medium-chain carnitine acyltransferases and acyl-CoA thioesterases in mouse provide complementary systems for transport of beta-oxidation products out of peroxisomes. Cellular and Molecular Life Sciences. 65: 982-990.

134 6 CONCLUSION

This thesis has explored two different approaches to mapping disease genes, the candidate gene approach and a genome wide association study. The candidate gene approach was successfully used to map the causative mutation for Trapped Neutrophil Syndrome to a four base pair deletion in vesicle sorting protein 13 B. The causative mutation was found after searching only 9 genes and is a good example of how effective a candidate gene approach can be when there are good candidate genes to examine. A genome wide association study was successfully used to identify a 5 Mb region containing the causative mutation for Cerebellar Abiotrophy in the Australian Kelpie. This region contained no obvious candidate genes so a target sequence capture and 454 sequencing approach was taken to sequence the entire region. The causative mutation, however, was not identified despite investigating differences between the two affected Kelpies and one control Kelpie. There remains a significant amount of work to do in discovering the causative mutation for this disease but the next obvious candidates are differences in the 3` or 5` untranslated region of genes. Both approaches have advantages and disadvantages. A candidate gene approach can identify the causative mutation cheaply and effectively if there are good candidates, without good candidates a significant amount of time and money can be wasted in excluding genes. A genome wide association study can rapidly identify a candidate region in the absence of good candidate genes, but is an expensive exercise. Both of these approaches have been used successfully to identify a region containing the causative mutation. If a genome wide association study was applied to find the TNS mutation it would have successfully identified a large region containing VPS13B. A literature search of genes in that region would then have allowed the identification of VPS13B as a good candidate and the mutation allowing the mutation to be identified. However, if a candidate gene approach was applied to find the CA mutation it would have been a costly and time consuming exercise as the obvious candidate are genes with known links to ataxia in human or mouse and none of these genes are in the candidate region. In conclusion, genome wide association studies using SNP arrays are the optimum mapping approach. Although this is a more expensive approach, the advantages of being able to locate the mutation containing region and requiring only a few weeks to perform far outweigh and cost savings that might be made on getting lucky with the candidate gene approach.

135 APPENDICES

I. Exclusion of CXCR4 as the cause of Trapped Neutrophil Syndrome in Border Collies using five microsatellites on canine chromosome 19

[2006. Animal Genetics. 37: 89.]

J. R. Shearman*, Q. Y. Zhang* and A. N. Wilton*,†

*School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia. †Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia Correspondence: A. Wilton ([email protected])

Source/description

The autosomal recessive disease Trapped Neutrophil Syndrome (TNS) has become common in Australian and New Zealand Border Collies. All cases of the disease are likely to be derived from a single mutation event as all have the same common ancestor on both sides of the pedigree within the last six generations (Fig. S1). Affected dogs should then be homozygous for a common haplotype around the disease locus. In TNS, mature neutrophils remain trapped in the bone marrow, resulting in recurrent infections and failure of the affected pups to thrive.1 TNS shares several symptoms with WHIM disease (warts, hypogammaglobulinaemia, immunodeficiency and myelokathexis) in humans, which is caused by mutations in the last 20 amino acids of the cytoplasmic tail encoded by exon 2 of chemokine (C-X-C motif) receptor 4 (CXCR4).2 This suggests that canine CXCR4 is a candidate for the TNS gene. Canine CXCR4 is 1646 bp in length and consists of two exons: exon 1 has 32 bp of coding sequence, and exon 2 has 1081 bp of coding sequence (Gene ID:483900; http://www.ncbi.nlm.nih.gov).

136 Appendix Sequencing and autozygosity analysis

Polymerase chain reaction (PCR) amplification was performed using standard PCR protocols.3 Sequencing of the first 1026 bp of exon 2 of CXCR4 in TNS affected, carrier and control dogs showed no difference when compared with the reference dog genome (AAEX00000000; http://www.ncbi.nih.nlm.gov). Five short tandem repeats (STRs) of at least 26 bp within 280 kb of CXCR4 on chromosome 19 in the dog genome sequence were chosen as potentially useful, previously uncharacterized microsatellites. Primers (Table S1) were designed to flanking regions of each STR using Primer3 (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi). All five microsatellites were polymorphic in 40 Border Collies and six dogs of mixed breed (one Dingo, two Australian dogs, one Lhasa Apso and two cross-bred dogs), and a range of 3–6 alleles were identified. An autozygosity analysis was performed using these newly identified microsatellites on 40 Border Collies, including six affected Border Collies from four families. Microsatellite products were labelled using the universal priming method.4,5 Haplotypes were determined from segregation in pedigrees to identify any common haplotypes associated with TNS. Seven different haplotypes were identified among the affected individuals, and only one homozygous animal was identified (Fig. 1).

Comments

Exon 2 of CXCR4 was considered the most likely location of the putative TNS mutation because mutations in exon 2 lead to the WHIM syndrome in humans, while mutations in exon 1 are lethal. No mutations were identified in exon 2 of CXCR4 in affected dogs. To eliminate the possibility of a mutation elsewhere in CXCR4, an autozygosity analysis was performed using the five newly identified flanking microsatellites. Because of the recent emergence of the disease from a common ancestor about six generations ago, affected dogs would be expected to be homozygous in the region around the mutation. The appearance of seven different haplotypes among affected dogs strongly suggests that the TNS mutation is not in the CXCR4 region and that CXCR4 is not the cause of TNS in Border Collies. CXCL12, the ligand for this

137 Appendix receptor, and genes from other pathways are now being investigated as candidates for TNS.

138 Sample 577 Sample 537 Sample 576 Sample 541 Sample 621 Sample 893 Locus ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn ň ʼnň ʼn C19. 41902 Ň241ŇŇ245Ň Ň243ŇŇ243Ň Ň239ŇŇ241Ň Ň241ŇŇ241Ň Ň241ŇŇ245Ň Ň245ŇŇ245Ň C19. 42172 Ň234ŇŇ237Ň Ň234ŇŇ234Ň Ň234ŇŇ237Ň Ň237ŇŇ237Ň Ň234ŇŇ237Ň Ň232ŇŇ234Ň C19. 42154 Ň206ŇŇ209Ň Ň206ŇŇ206Ň Ň206ŇŇ206Ň Ň206ŇŇ206Ň Ň206ŇŇ209Ň Ň206ŇŇ209Ň C19. 42152 Ň256ŇŇ260Ň Ň272ŇŇ272Ň Ň256ŇŇ260Ň Ň260ŇŇ272Ň Ň260ŇŇ260Ň Ň256ŇŇ260Ň C19. 42151 Ň270ŇŇ271Ň Ň271ŇŇ271Ň Ň270ŇŇ271Ň Ň270ŇŇ270Ň Ň270ŇŇ271Ň Ň270ŇŇ271Ň Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ Ŋ ŋŊ ŋ

Figure 1 Haplotypes for 5 microsatellites in the CXCR4 region in 6 dogs affected with TNS

139 Acknowledgements

We thank Angela Higgins for running sequences and genotyping. The work was funded in part by a grant from the Canine Research Foundation and the Faculty of Science, UNSW.

References

1. Allan F. J. et al. (1996) New Zeal Vet J 44, 67–72. 2. Hernandez P. A. et al. (2003) Nat Genet 34, 70–4. 3. Melville S. A. et al. (2005) Genomics 86, 287–94. 4. Schuelke M. (2000) Nat Biotech 18, 233–4. 5. Neilan B. A. et al. (1997) Nucl Acids Res 25, 2938–9.

140 Supplementary Material

Table S1. Polymerase chain reaction primers and products from chromosome 19 used to sequence exon 2 of the CXCR4 gene and genotype-associated microsatellites in forty Border Collies plus one Dingo, two Australian cattle dogs, one Lhasa Apso and two cross-bred dogs.

Locus1 Location (bp) Forward Primer 2 Reverse Primer 3 Repeat 4 No. Alleles Size bp C19.41902 41901868-41902092 M13 - AAGATCTGCAGGGTTTCAGG gtttGAGATCTTGGTGGGGTCAT AC(20) 6 237-247 CXCR4 e2 41893825-41894905 TGGAGAATGCCAGCTGAACT AGACTCCGACTCAGTTGAAACA NA NA 1082 C19.42172 42171882-42172095 M13 - TTCAAGCCCAGAAATCATCA gtttGTCCAAAAACTGGGTCATC A(18) 3 234-237 C19.42154 42154403-42154587 M13 - CCACTTGTTTGTGCCCTCTT gtttAGCAGGTCTCCTGACTCCAG (11) 3 206-212 C19.42152 42152451-42152673 gtttCAAGTCCCACATCAAGCTCTTAG M13 - CACCCATATCTCACCATCCA AAAT(11) A(16) 6 251-272 C19.42151 42150756-41251005 M13 - GCCAGGGAGCATCAAAAATA gtttCTCTGGGGTTGGAATGAAGA CT(12) 3 264-271

1 name based on chromosome number and location in Kb on build v2.1 of boxer dog genome 2 M13 CACGACGTTGTAAAACGAC tag used to label microsatellite products (ref 4) 3 gttt GTTT added to encourage addition of A by Taq polymerase (Brownstein et al. (1996). Biotechniques. 20: 1004-1010) 4 repeat length and sequence in boxer dog genome sequence Acc. No. AAEX 00000000

141 Appendix

Figure S1. Pedigrees of affected Border Collies 537, 541, 577, 621 and 893, showing their common ancestor. (Pedigree is unknown for TNS-affected Border Collie 576.) Six other obligate carriers also share this common ancestor.

142 II. Elimination of neutrophil elastase and adaptor protein complex 3 subunit genes as the cause of trapped neutrophil syndrome in Border collies

[2007. Animal Genetics. 38: 188-189.]

J. R. Shearman and A. N. Wilton School of Biotechnology and Biomolecular Sciences, Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia Accepted for publication 25 November 2006

Source/description

A static form of neutropenia, known as trapped neutrophil syndrome (TNS), is becoming common in Australian and New Zealand Border collie dogs due to frequent breeding of champion show dogs that are carriers for the disease allele, and subsequent inbreeding.1,2 This syndrome has an autosomal recessive mode of inheritance. Clinical presentation of TNS is static neutropenia and myelodysplasia, which are also symptoms of human severe congenital neutropenia (SCN), along with acute myeloid leukaemia.3 Mis-trafficking of neutrophil elastase (NE) is responsible for many of the neutropenic syndromes. Excessive NE routed to the plasma membrane due to loss of AP3 causes the static neutropenias observed in SCN and Hermansky–Pudlak syndrome type 2 (HPS2) in humans, as well as the cyclic neutropenia observed in dogs. Cyclic neutropenia in humans is the result of excessive routing of NE to lysosomes.4 Human SCN is caused by mutations in the neutrophil elastase gene (ELA2), which can also cause cyclic neutropenia.4 Canine cyclic neutropenia, called Gray collie syndrome,5 is caused by mutations in the beta 1 subunit of the adaptor-related protein complex 3 gene (AP3B1).6 Another static neutropenia in humans, HPS2, is caused by mutations in AP3B1.7 Because TNS has symptoms similar to SCN and HPS2, ELA2 and the AP3B1 genes are good candidates for TNS.

143 Appendix Dogs affected with TNS in our study could be traced back to a single common ancestor up to seven generations on both sides of a highly inbred pedigree (Fig. 1). Thus, each affected dog should be homozygous and identical-by-descent for the disease gene and surrounding chromosomal region. To investigate the causal mutation for TNS, linkage analysis and autozygosity mapping were performed using 12 microsatellite markers within and near neutrophil elastase (ELA2) and the subunits of AP3 (AP3D1, AP3B1, AP3S1 and AP3M1). AP3D1 and ELA2 are located on CFA20 at 60 and 61.3 Mb respectively. AP3B1 is on CFA3 at 31.4 Mb, AP3S1 is on CFA11 at 8.5 Mb and AP3M1 is on CFA4 at 28.1 Mb (dog genome Build 2.1; http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9615).

Linkage analysis and autozygosity mapping

Microsatellites longer than 30 bp located within and near each of the candidate genes were identified using the dog reference sequence (dog genome Build 2.1). Primers (Table S1) were designed using PRIMER3 (http://frodo.wi.mit.edu/cgi- bin/primer3/primer3_www.cgi) and were checked against the dog genome using BLASTN (http:// www.ncbi.nlm.nih.gov) to ensure they were unique. PCR products were labelled using universal priming.8 Each of the microsatellites was polymorphic, with 2–14 alleles in the 72 Border collies within our pedigree (Table S1). Multipoint linkage analysis was performed using SUPERLINK Online (http:// bioinfo.cs.technion.ac.il/superlink-online/). The complete pedigree was analysed with up to five marker loci simultaneously. Multipoint linkage analysis excluded linkage of TNS with each of the microsatellites (LOD < –2.0), with the exception of AP3B1, which showed no significant evidence of linkage at LOD –0.8 (Table 1). Autozygosity analysis also excluded each of the candidate regions because of the presence of several different haplotypes in the affected dogs (Table S2).

Conclusion

These results demonstrate that TNS in Border collies is not caused by mutations in the genes that are known to be involved in human HPS2 or SCN. Therefore, we

144 Appendix conclude that TNS is distinct from canine cyclic neutropenia and is not the canine equivalent to HPS2 and SCN.

145 Figure 1 Pedigree of trapped neutrophil syndrome-affected Border collies (shaded) showing common ancestor. Genotypes were available for only numbered individuals (pedigree produced with OPediT at http://www.circusoft.com/content/product/5).

146 Table 1 Multipoint LOD scores averaged across microsatellites in candidate genes AP3S1, AP3M1, AP3D1, ELA2 and AP3B1 with trapped neutrophil syndrome in Border collies. Candidate gene LOD score AP3S1 -9.2 AP3M1 -21.3 AP3D1 + ELA2 -16.1 AP3B1 -0.8

References

1. Allan F. J. et al. (1996) N Z Vet J 44, 67–72. 2. Shearman J. R. et al. (2006) Anim Genet 37, 89. 3. Berliner N. et al. (2004) Hematology Am Soc Hematol Educ Program 2004, 63–79. 4. Horwitz M. et al. (2004) Trends Mol Med 10, 163–70. 5. Lothrop C. D. et al. (1987) Endocrinology 120, 1027–32. 6. Benson K. F. et al. (2003) Nat Genet 35, 90–6. 7. Jung J. et al. (2006) Blood 108, 362–9. 8. Neilan B. A. et al. (1997) Nucleic Acids Res 25, 2938–9.

147 Supplementary Material

Table S1 Microsatellite loci used to genotype 72 Border collies for autozygosity mapping and linkage analysis in the TNS pedigree. /RFXV JHQH ORFDWLRQ SULPHUIRUZDUG SULPHUUHYHUVH UHSHDW QRDOOHOHV VL]H ES 

& $3% ,QWURQ 0&&7*&77$$&7&$**&&7&77 *$*$7&7&*$*7&&&$&*7& $*  

& $3% ,QWURQ 077&&7&7&7&&&&7$7$$7&&7* $*&*$7&7*$$*$$*&$&&7 777&7777& 

& $3% ,QWURQ 0*$****$$*&$7&$7777*$ &$$$*$***$$$7$7*77&77*& 7777$  

& $30 CNE 0$$$7**$$7&7*$7&$777$777&& $7$77&77&$$$7777&&77**$* $*  

& $30 CNE 0*77$&7&777*****&7&&7 7*$****$$*&$$7$&&7$$$$ 7$$

& $30 CNE 0$**77*$7$*&7$777&$&&$&$ 77$$77**$**&&$*7*7&$ $*$$$*  

& $36 CNE 077*$$$$77&&7&&$&$$$&$* *&777*77$$77$**7**&$7*$ $$$$7  

& $36 ,QWURQ 07777*&7*&$**$&$&77777 7*$$*&7*7&$**&&77&$7 7*  

& $3' (/$  CNE 0&&$&77$$$$&$7$&$*&&$*7* $7*$&$&$*&7*&$77**$$ $&  

& $3' (/$  ,QWURQ 077&&77&&7*7$7$7*$*777**$ 7$7*&&&$$&$7*&$&$*$& $$$*

& $3' (/$  CNE 0***7$$&&***$*77*$*77 $*$$$$$*&&&$**&7&7*$ 7&

& (/$ $3'  CNE 0$&$*&$**&&7*7*$&&77 &$&$&*&&7&7&7*7*777& $*  

 QDPHEDVHGRQFKURPRVRPHQXPEHUDQGORFDWLRQ HJ&LVRQFIDDW0E RQEXLOGYRIER[HUGRJJHQRPH 0 &$&*$&*77*7$$$$&*$&WDJXVHGWRODEHOPLFURVDWHOOLWHSURGXFWV 1HLODQHWDO   UHSHDWPRWLIDQGVHTXHQFHOHQJWKLQGRJJHQRPHVHTXHQFHEXLOG  ([SHFWHG3&5SURGXFWVL]HEDVHGRQVHTXHQFHLQGRJJHQRPHEXLOG

148 Appendix

Table S2 Alleles for 12 microsatellite loci from five candidate regions in eight TNS affected dogs (537, 541, 577, 576, 621, 893, 1137 and 1151) with multipoint LOD scores for linkage of each of the microsatellites to TNS.

Loci* LOD score AP3S1 537 541 577 576 621 893 1137 1151 C11.0853 246 246 246 261 246 246 261 261 246 246 251 261 246 246 246 246 -7.5961 C11.08588 . . 381 387 381 387 387 387 381 387 387 389 381 381 381 383 -10.7984

AP3M1 537 541 577 576 621 893 1137 1151 C4.2814 268 268 268 268 268 268 267 268 267 268 259 261 268 268 268 268 -20.2987 C4.2821 . . 249 249 253 253 253 256 253 253 249 249 . . . . -19.0186 C4.2822 446 446 442 446 477 481 474 474 461 477 450 450 446 474 457 474 -24.6507

AP3D1 + ELA2 537 541 577 576 621 893 1137 1151 C20.6003 383 383 383 387 383 387 383 383 383 383 383 383 383 387 387 387 -16.7478 C20.6004 457 457 450 438 457 438 446 457 457 457 450 457 438 457 438 438 -16.8821 C20.6008 318 318 318 320 318 320 320 318 318 320 318 320 . . 320 320 -14.7389 C20.6135 313 313 299 319 313 319 299 307 313 319 299 319 319 319 319 319 -15.8389

AP3B1 537 541 577 576 621 893 1137 1151 C3.3138 193 193 193 193 191 193 191 193 193 193 191 193 193 193 . . -0.8468 C3.3144 463 463 463 463 498 463 482 463 463 463 492 463 . . . . -0.8441 C3.3161 . . 313 313 354 313 . . 313 313 333 313 . . . . -0.8436

* Loci names based on chromosome number and location in 10s of kb (eg. C4.2814 is on cfa4 at 28.14 Mb)

149 III. Elimination of SETX, SYNE1 and ATCAY as the cause of Cerebellar Abiotrophy in Australian Kelpies

[2008. Animal Genetics. 39: 573.]

Jeremy R. Shearman, Vivian M. Lau and Alan N. Wilton

School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia and Clive and Vera Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, NSW 2052, Australia Correspondence: A. Wilton ([email protected]) Accepted for publication 15 April 2008

Source/description

An autosomal recessive form of cerebellar abiotrophy (CA) has been identified in Australian Kelpie dogs.1 The main symptom is ataxia, which is seen as head tremors, poor coordination and a high step while walking. Most affected animals present with mild non-progressive ataxia and some exhibit severe ataxia, which can include fitting. Because all affected animals can be traced back to a small number of related common ancestors within eight generations, the disease is likely to be the result of a single mutation that has become common by using a popular sire in a small gene pool and inbreeding. All affected dogs should then be homozygous, identical-by-descent, close to the mutation. Three candidate genes (SETX,2 ATCAY3 and SYNE14) that are known to cause CA with mild non-progressive ataxia in humans were tested in the Kelpies. SETX lies on CFA9 at 55.3 Mb, ATCAY is on CFA20 at 58.7 Mb and SYNE1 is on CFA1 at 45.9 Mb (dog genome Build 2.1).

150 Appendix Homozygosity analysis

To look for homozygosity in affected dogs, microsatellites within 220 kb of these genes were identified and primers were designed to amplify them using a previously described method.5 The distance of 220 kb is sufficiently close for mapping, considering linkage disequilibrium in dogs has been shown to be up to 100x more extensive than in humans.6 Primers were fluorescently labelled using the universal priming method7 and the products were sized on an ABI3730 sequencer (Applied Biosystems). Each of the microsatellites was polymorphic, with between three and 12 alleles (Table S2) in 50 Australian Kelpies including seven affected dogs. Haplotypes were determined from inheritance patterns of two or three microsatellites close to each candidate locus (SETX, ATCAY and SYNE1). Homozygosity for a common haplotype in affected dogs would result if the gene for CA is close to the candidate locus. At least six different haplotypes were identified in affected dogs for each of the candidate loci (Table S1), which excludes each of them from being the gene for CA in the Australian Kelpie. Comments: We are now extending this study to a wholegenome scan using dog SNP arrays as used by Karlsson et al.8

References

1. Thomas J. B. & Robertson D. (1989) Aust Vet J 66, 301–2. 2. Moreira M. C. et al. (2004) Nat Genet 36, 225–7. 3. Bomar J. M. et al. (2003) Nat Genet 35, 264–9. 4. Gros-Louis F. et al. (2007) Nat Genet 39, 80–5. 5. Shearman J. R. & Wilton A. N. (2007) Anim Genet 38, 188–9. 6. Sutter N. B. et al. (2004) Genome Res 14, 2388–96. 7. Neilan B. A. et al. (1997) Nucleic Acids Res 25, 2938–9. 8. Karlsson E. K. et al. (2007) Nat Genet 39, 1321–8.

151 Supporting information

Table S1 Allele sizes (in bp) at eight microsatellite loci from three candidate regions for seven CA-affected Australian Kelpies. ATCAY SETX SYNE1 C20.5872 C20.5847 C9.5510 C9.5542 C9.5547 C1.4591 C1.4548 C1.4542 affected 1 419 - 424 228 - 228 nd† 409 - 426 465 - 465 330 - 330 286 - 286 408 - 408 affected 2 393 -424 228 - 228 371 - 371 426 - 430 nd 330 - 330 286 - 286 408 - 408 affected 3 393 - 424 228 - 228 nd 409 - 430 463 - 463 330 - 330 286 - 286 408 - 408 affected 4 419 -419 228 - 228 368 - 268 440 - 446 459 - 459 330 - 330 256 - 298 384 - 396 affected 5 368 - 435 228 - 230 368 - 419 369 - 446 459 - 461 330 - 330 298 - 298 400 - 404 affected 6 419 -419 228 - 228 368 - 268 415 - 440 459 - 465 325 - 330 286 - 290 408 - 408 affected 7 358 - 424 228 - 232 375 - 393 435 - 455 461 - 461 317 - 330 283 - 283 441 - 441 † nd: not determined

152 Table S2 Details of the eight novel informative microsatellite loci used to genotype 50 Australian Kelpies for homozygosity mapping. Microsatellites Primer F Primer R Repeat type† no. alleles‡ C20.5872 TAACTCTGTTATGGCCCATGC TGTTCTATATCTACGCCTTCCAC ATTTT x 26 9 C20.5847 GCACTACCCTAGATTCTGTGACG CGGACTTGCCTGAAAACC TC x 23 3 C9.5510 AGGAAGCTTGCGGATGATG CTGCGAGTTCCAGTCTGATG AAAG x 24 8 C9.5542 TGTGCAGGACAAAACTCCTC TCAAAGGACCCTGAAAGAGC TnCn mix 194 bp 10 C9.5547 ACACTGCCACAGGCTACAAG AGCGCCTAACCACTATGTCC TG x 19 5 C1.4591 GAAGGTGTGCCAACACCAC TGGTGGGGATAATCACAAAAG CTTT x 19 AG x 13 7 C1.4548 TGACACTTTACCCAAGGATTTTC TTGAACCAAGATTTGCAGGAC GnAn mix 170 bp 12 C1.4542 CTACCTACCATTTGGGGTTCC AACCTTCCATGCCTATTACCTG TTTC x 50 8 † repeat motif in dog reference genome ‡ number of alleles identified from genotyping 50 Australian Kelpies

153 IV. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication

[2010. Nature. 464: 898-902.]

Bridgett M. vonHoldt1, John P. Pollinger1, Kirk E. Lohmueller2, Eunjung Han3, Heidi G. Parker4, Pascale Quignon4, Jeremiah D. Degenhardt2, Adam R. Boyko2, Dent A. Earl5, Adam Auton2, Andy Reynolds2, Kasia Bryc2, Abra Brisbin2, James C. Knowles1, Dana S. Mosher4, Tyrone C. Spady4, Abdel Elkahloun4, Eli Geffen6, Malgorzata Pilot7, Wlodzimierz Jedrzejewski8, Claudia Greco9, Ettore Randi9, Danika Bannasch10, Alan Wilton11, Jeremy Shearman11, Marco Musiani12, Michelle Cargill13, Paul G. Jones14, Zuwei Qian15, Wei Huang15, Zhao-Li Ding16, Ya-ping Zhang17, Carlos D. Bustamante2, Elaine A. Ostrander4, John Novembre1,18 & Robert K. Wayne1

1. Department of Ecology and Evolutionary Biology, 621 Charles E. Young Drive South, University of California, Los Angeles, California 90095, USA. 2. Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853-2601, USA. 3. Department of Biostatistics, University of California, Los Angeles, California 14853, USA. 4. Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 5. Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA. 6. Department of Zoology, Tel Aviv University, Tel Aviv 69978, Israel. 7. Museum and Institute of Zoology, Polish Academy of Sciences, Wilcza 64, 00-679 Warszawa, Poland. 8. Mammal Research Institute, Polish Academy of Sciences, 17-230 Bialowieza, Poland. 9. Instituto Superiore per la Protezione e la Ricerca Ambientale (ISPRA), 40064 Ozzano Emilia (B), Italy. 10. Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, California 95616, USA.

154 Appendix 11. School of Biotechnology and Biomolecular Sciences and Clive and Vera Ramaciotti Center for Gene Function Analysis, University of New South Wales, Sydney NSW 2052, Australia. 12. Faculty of Environmental Design, University of Calgary, 2500 University Drive NW, Calgary, Alberta T2N 1N4, Canada. 13. Affymetrix Corporation, 3420 Central Expressway, Santa Clara, California 95051, USA. 14. The WALTHAM Centre for Pet Nutrition, Waltham on the Worlds, Leicestershire LE14 4RT, UK. 15. Affymetrix Asia Pacific, Scientific Affairs and Collaborations, 1233 Lujiazui Ring Road, AZIA Center, Suite 1508, Shanghai 200120, China. 16. Laboratory for Conservation and Utilization of Bio-resources, Yunnan University, Kunming 650091, China. 17. State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China. 18. Interdepartmental Program in Bioinformatics, 621 Charles E. Young Drive South, University of California, Los Angeles, California 90095, USA.

Advances in genome technology have facilitated a new understanding of the historical and genetic processes crucial to rapid phenotypic evolution under domestication1,2. To understand the process of dog diversification better, we conducted an extensive genome-wide survey of more than 48,000 single nucleotide polymorphisms in dogs and their wild progenitor, the grey wolf. Here we show that dog breeds share a higher proportion of multi-locus haplotypes unique to grey wolves from the Middle East, indicating that they are a dominant source of genetic diversity for dogs rather than wolves from east Asia, as suggested by mitochondrial DNA sequence data3. Furthermore, we find a surprising correspondence between genetic and phenotypic/functional breed groupings but there are exceptions that suggest phenotypic diversification depended in part on the repeated crossing of individuals with novel phenotypes. Our results show that Middle Eastern wolves were a critical source of genome diversity, although interbreeding with local wolf populations clearly occurred elsewhere in the early history of specific lineages. More recently, the evolution of modern dog breeds 155 Appendix seems to have been an iterative process that drew on a limited genetic toolkit to create remarkable phenotypic diversity. The dog is a striking example of variation under domestication, yet the evolutionary processes underlying the genesis of this diversity are poorly understood. To understand the geographic and evolutionary context for phenotypic diversification better, we analysed more than 48,000 single nucleotide polymorphisms (SNPs) typed in a panel of 912 dogs from 85 breeds as well as an extensive sample of 225 grey wolves (the ancestor of the domestic dog4,5) from 11 globally distributed populations (Supplementary Table 1). We constructed neighbour- joining trees using individuals and populations as units of analysis based on both individual SNP and haplotype similarity (Fig. 1 and Supplementary Note A). All trees identify dogs as a distinct cluster. Moreover, using as few as 20 diagnostic SNPs, all dog and wolf samples can be correctly assigned to species of origin with high confidence (see Supplementary Table 2, Supplementary Figs 1–4 and Supplementary Note B). Applying the Bayesian clustering method implemented in STRUCTURE6, we found strong evidence for admixture with wolves only in a minority of breeds (Supplementary Fig. 5). Neighbour-joining trees reveal that most of these breeds (basenji, Afghan hound, Samoyed, saluki, , New Guinea singing dog, dingo, chow chow, Chinese Shar Pei, Akita, Alaskan malamute, Siberian husky and American Eskimo dog) are highly divergent from other dog breeds (Fig. 1 and Supplementary Figs 6–11). These highly divergent breeds have been identified previously and termed ‘ancient’ breeds (as opposed to ‘modern’)4 because, consistent with their high levels of divergence, historical information suggests that most have ancient origins (>500 years ago)7–9. The limitation of evidence for admixture to only a few breeds is striking given that between dogs and wolves is known to occur10 and dogs and wolves coexist widely. Given that modern breeds are the products of controlled breeding practices of the Victorian era (circa 1830–1900)4,7–9, the lack of detectable admixture with wolves is consistent with the strict breeding regimes recently implemented by humans. To identify the primary source of genetic diversity for domestic dogs, we used three approaches. First, we assessed whether a single wolf population clustered with dogs in neighbour-joining trees based on allele sharing of SNPs and sharing of 5- and 10-SNP haplotypes for individuals and breed/population groupings (Fig. 1, Supplementary Figs 6–11 and Supplementary Note A). Only in individual SNP and 5- 156 Appendix SNP haplotype trees were specific populations of Middle and Near Eastern grey wolves found to be most similar to domestic dogs (Fig. 1b and Supplementary Fig. 10). In all other trees, wolves form a single genetic group and are not informative with regard to the wolf population that is most similar to dogs. We further tested the approach of a previous mitochondrial DNA (mtDNA) sequence study that suggested dogs have an origin in east Asia because diversity was highest in east Asian dog breeds3,11. We find that genetic diversity of dogs does not vary with geography in a consistent pattern. Specifically, breeds of east Asian origin do not have the highest level of nuclear variability, even when the SNP discovery scheme differed or haplotype measures of diversity were used to minimize ascertainment bias12 (Fig. 2a, b and Supplementary Figs 12a and 13). Furthermore, we confirmed an absence of geographic patterns in nuclear variation through a reanalysis of previously published microsatellite data4,7 (Fig. 2c and Supplementary Fig. 12b). For example, the two ancient breeds with highest SNP haplotype diversity, saluki and Chinese Shar Pei, originated in widely different areas (the Middle East and China, respectively8,9; Fig. 2b). However, ancient and island breeds are exceptions in consistently having lower diversity (basenji, Canaan dog, dingo, New Guinea singing dog; Fig. 2a–c). Thus, in contrast to previous mtDNA sequence results, current levels of autosomal diversity do not support an east Asian origin (or any other location). Indeed, if demographic history has varied substantially in dog breeds across geographic regions after domestication, current levels of genetic diversity may not directly reflect the oldest, ancestral population as it does in other species such as humans12–14. In addition, we note that recently, the use of genetic diversity to infer centres of domestication has been questioned by studies of semi-feral village dogs from and Puerto Rico that found levels of mtDNA diversity as high or higher than those in east Asia11,15. High diversity in African dog populations reflects the added contribution of ancient indigenous dogs to the gene pool, which elsewhere is often dominated by modern breeds15.

157 Appendix

Figure 1 Neighbour-joining trees of domestic dogs and grey wolves. Branch colour indicates the phenotypic/functional designation used by dog breeders8,9. A dot indicates > 95% bootstrap support from 1,000 replicates. a, Haplotypesharing cladogram for 10- SNP windows (n = 6 for each breed and wolf population). b, Allele-sharing cladogram of individuals based on individual SNP loci. c, Haplotypesharing phylogram based on

158 Appendix 10-SNP windows of breeds and wolf populations. d, Allele-sharing phylogram of individual SNPs for breeds and wolf populations. For c and d, we note breeds where genetic assignments conflict with phenotypic/functional designations as follows: 1, Brussels griffon; 2, Pekingese; 3, pug; 4, Shih-tzu; 5, miniature pinscher; 6, Doberman pinscher; 7, Kuvasz; 8, Ibizian hound; 9, chihuahua; 10, Pomeranian; 11, papillon; 12, Glen of Imaal; 13, ; 14, Briard; 15, Jack Russell; 16, dachshund; 17, great schnauzer; and 18, standard schnauzer. Gt, great; mtn, mountain; PBGV, petit basset griffon vendeen; pin., pinscher; ptr, ; ret., retriever; shep., shepherd; sp., spaniel; Staf., Staffordshire; std, standard; terr., terrier. Canine images not drawn to scale. Wolf image adapted from ref. 31; dog images from the American Kennel Club (http://www.akc.org).

Consequently, as a third approach to determine the primary centre of dog domestication, we considered haplotype sharing of modern and ancient dog breeds with specific wolf populations (see Supplementary Note A). Haplotype diversity patterns have been shown to be less sensitive to ascertainment biases12, and the sharing of SNPs that are otherwise private to specific wolf populations provides a unique signal to support ancestry or admixture. We analysed haplotype sharing between 64 well- sampled (n•9) dog breeds and wolf populations from Europe, the Middle East and China for 500-kilobase (kb) haplotype windows containing 5 and 15 SNPs drawn at random (Fig. 2d and Supplementary Table 3, Supplementary Note A). The Middle East and China have been implicated as centres for dog origination based on the archaeological record or mtDNA diversity3,11,16–18. We also assessed haplotype sharing between dog breeds and North American wolves as a negative control because dogs did not originate there19. Across all breeds, and for both window sizes, levels of sharing between dogs and North American wolves are substantially lower than the analogous comparison with Old World wolves, as expected (Fig. 2d). For 5-SNP haplotype windows, we found that haplotype sharing was uniformly higher between modern dog breeds and Middle Eastern wolves than between other wolf populations (Fig. 2d, left). For 15-SNP windows (Fig. 2d, right), the majority of breeds show the most sharing with Middle Eastern wolves, including some dog breeds of diverse geographic origins (for example, basenji, chihuahua, basset hound and borzoi). Notably, significant sharing with European wolves is found in miniature pinschers, Staffordshire bull terriers, 159 Appendix greyhounds and whippets. The increased haplotype sharing between some European breeds and European wolves in the 15-SNP analysis may not be revealed in the 5-SNP windows because the European and Middle Eastern wolf haplotypes are less readily distinguished when based on fewer SNPs. Finally, only two east Asian breeds (Akita and chow chow) had higher sharing with Chinese wolves, although the results were not significant. In an analysis with fewer chromosomes per breed (n•6), four east Asian breeds - the Akita, Chinese Shar Pei, chow chow and dingo - showed most sharing with Chinese wolves (the latter two breeds showing significantly more sharing than expected), corroborating STRUCTURE clustering results (Supplementary Figs 5 and 14).

160 Appendix

Figure 2 Genome-wide analysis of SNP variation in domestic dogs and grey wolves. a, Average observed heterozygosity (Ho). b, Average number of haplotypes per breed or group for phased SNP loci (15-SNP windows). c, Average observed heterozygosity of microsatellite data4,7. d, Fraction of unique haplotypes shared between 64 dog breeds and wolf populations for 5-(left) and 15-(right) SNP windows. Diamonds 161 Appendix indicate significant sharing (P < 0.05) using permutation test 2 (Supplementary Note A). Six (a–c) or nine (d) individuals represent each breed and wolf population. Error bars indicate s.e.m. E, east; SW, southwest.

Notably, in both 5-SNP and 15-SNP window analyses, the basenji, a breed of Middle Eastern origin, had a greater proportion of shared haplotypes with Middle Eastern wolves than any other domestic dog (Fig. 2d and Supplementary Table 3). This result suggests that had a larger effective population size early in domestication or that they have more recently backcrossed with wolves. Overall, these data implicate the Middle East as a primary source of genetic variation in the dog, with potential secondary sources of variation from Europe and east Asia. In contrast to the mtDNA results, east Asian wolves are a predominant source of haplotype diversity for only a few east Asian dog breeds that have a long history in that region. Neighbour-joining trees based on SNP data provide an explicit framework for investigating hypotheses of breed history and the genesis of phenotypic diversity. Consistent with previous microsatellite results4,7, topological analyses often define three well-supported groups of highly divergent, ancient breeds: an Asian group (dingo, New Guinea singing dog, chow chow, Akita and Chinese Shar Pei), a Middle Eastern group (Afghan hound and saluki) and a northern group (Alaskan malamute and Siberian husky) as being distinct from modern domestic dogs (Fig. 1a, b and Supplementary Figs 6–11). In addition, we find that the basenji often appears as the most divergent breed in allele- and haplotype-sharing trees (Fig. 1a, b and Supplementary Figs 6–11). This finding and high haplotype sharing, as well as a long recorded history8,9, suggest that this breed is one of the most ancient extant dog breeds. The radiation of modern dog breeds has been difficult to resolve because most have originated recently and lack deep, detailed histories8,9. Consequently, the evolutionary process underlying the genesis of phenotypic/functional groupings is obscure. Specifically, many breeds have been documented as originating through crosses of genealogically or geographically distant stocks9 and thus, parallel evolution and genetic heterogeneity within phenotypic/functional breed groupings is expected. Nonetheless, we discern distinct genetic clusters within modern dogs that largely correspond to those based on phenotype or function, including spaniels, scent hounds, mastiff like breeds, small terriers, retrievers, herding dogs and sight hounds (see Fig. 1). 162 Appendix Most genetic groups have short internodes and often low bootstrap support, reflecting the rapid formation of modern breeds in the Victorian era8,9. Notably, toy and working dogs have a more varied relationship to genetic groupings, which is consistent with their known histories involving crosses between breeds from divergent genetic lineages (Supplementary Table 4). The heterogeneous composition of toy breeds may specifically indicate their frequent origin as a cross between a larger dog from a distinct breed grouping and a toy or dwarfed breed (Supplementary Table 4). Finally, within each breed, there is a remarkable concordance with known origin as all dogs are correctly assigned to the breed or population from which they were sampled, with one exception (bull terrier and miniature bull terrier; Fig. 1a, b). The contribution of these groupings to genetic variation was assessed by an analysis of molecular variance (AMOVA; Supplementary Table 5) which showed that 65% of the variation is due to variation within dog breeds, and 31% is due to variation within breed groups, similar to that reported for microsatellite data4,7. However, our analysis also showed that 3.8% of the variation is between phenotypic/functional breed groups (P<0.001). Consequently, although most variation is within breeds, phenotypic/ functional breed groups represent a relatively small but significant component of variation. The process of domestication involves strong selection of specific phenotypes; therefore, a signal of this selection should be evident in the genome20. Given the genome-wide coverage of our panel of SNPs, we searched for genomic regions that might contain adaptive substitutions due to positive selection during the initial phase of dog domestication (rather than breed formation, see Supplementary Note C). For each SNP, we calculated the fixation index (FST) and cross-population extended haplotype homozygosity (XP-EHH) values between non-admixed wolves and modern dogs and considered SNPs with extreme values as candidates for recent positive selection20. These statistics measure population differentiation and relative levels of genetic diversity, both of which are robust indicators of positive selection for recently domesticated species21. We found that SNPs within the top 5% of FST values and SNPs within the highest 1% of XP-EHH values are each significantly enriched for SNPs in genic regions (P=0.04 for FST, P=0.02 for XP-EHH, one-sided exact conditional test, controlling for the ascertainment panel; Supplementary Fig. 15). This result is consistent with a history of adaptive divergence in genic regions. To identify specific regions that are candidates for recent adaptive evolution, we normalized FST and XP-EHH values 163 Appendix within ascertainment categories, and targeted regions that have several SNPs with extreme values for both statistics (Supplementary Table 6, see Supplementary Note C for results). Notably, two of our top three signals are near genes that have been implicated in memory formation and/or behavioural sensitization in mouse or human studies (ryanodine receptor 3 (OMIM accession 180903; ref. 22), adenylate cyclase 8 (OMIM accession 103070; Supplementary Note C)). Furthermore, we observed a single SNP with a high FST value located near the WBSCR17 gene responsible for Williams– Beuren syndrome in humans (OMIM accession 194050; Supplementary Fig. 16), which is characterized by social traits such as exceptional gregariousness. These outlier SNPs provide specific candidate regions for fine-scale mapping of genes that are important in the early domestication of dogs. Our results show domestic dogs have genetic structure on three fundamental levels resulting from distinct evolutionary processes. First, within dog breeds, nearly all dogs are assigned to a breed of origin. This result is supported by previous microsatellite research4,7 and reflects the limited number of founders, inbreeding and small effective population size characteristic of many breeds23,24. Second, breed groupings are evident at a finer scale than previously described, and mirror breed classification based on form and function. We propose that this result reflects the tendency of dog breeders to develop new breeds by crossing individuals within specific functional and phenotypic groups to enhance abilities such as retrieving and herding, or further develop specific morphological traits4. However, heterogeneity within toy breeds and other breed groupings suggests the importance of discrete phenotypic mutations in the evolution of phenotypic diversity in the dog. Recent genetic studies have established that variation in coat colour25 and texture26, body size2, relative leg length27 and body proportions (A.R.B. et al., manuscript submitted) in different dog breeds are due to variation in shared genes of large phenotypic effect. For example, at least 19 distinct dog breeds with foreshortened limbs all uniquely share the same retrotransposed version of Fgf4 that is strongly implicated as the genetic basis for this phenotype27. Once such discrete mutations are fixed in a breed they can readily be crossed into unrelated lineages and thus enhance the process of phenotypic diversification. This process has perhaps produced more phenotypic diversity in dogs than other domesticated species because they are selected for many functions of value to humans (for example, defence, herding, retrieving, hunting, speed and companionship) 164 Appendix as well as for novelty, which culminated in the ‘fancy breeds’ of the Victorian era8,9. Last, we identify divergent lineages of dogs distinct from those breeds that radiated during the nineteenth century and that probably derive from ancient geographically indigenous breeds. This finding mirrors recent genetic discoveries in sheep28 and cattle29 and suggests that some canine lineages may have persisted from antiquity or have more recently admixed with wolves. The latter seems unlikely given that some of these breeds have known ancient histories, exist in areas where wolves are absent, and are phenotypically highly derived8,9. For example, the chow chow originated more than 2,000 years ago8,9. Similarly, the dingo and New Guinea singing dog were probably established over 4,000 years ago and exist in areas without wolves. However, given their close proximity to extensive wolf populations, divergent northern breeds such as the Alaskan malamute, Siberian husky and American Eskimo dog may be better candidates for recent admixture. Our haplotype sharing analysis evaluates the contribution of specific wolf populations to the genome of dogs, and reveals significant Middle Eastern and, for certain breeds, European ancestry. This result is consistent with the archaeological record that identified the earliest dog remains in the Middle East (12,000 years ago)16, Belgium (31,000 years ago)17, and the Bryansk region in western Russia (15,000 years ago)17, as well as the finding of high mtDNA diversity in ancient Italian dogs18. However, some ancient east Asian breeds show affinity with Chinese wolves, which suggests that they were derived from Chinese wolves or admixed with them after domestication10. The domestic dog seems comparable to other domestic species in containing several sources of variation from wild relatives. This dynamic process enriched the dog genome through interbreeding with wolves early in the domestication process. Similarly, mutations that have occurred since domestication, such as the mutation responsible for black coat colour, have been transferred to grey wolves30. Our genome-wide SNP analysis provides a new evolutionary framework for understanding the rapid phenotypic diversification unique to the domestic dog.

METHODS SUMMARY SNP genotyping. Genomic DNA was isolated from blood samples of domestic dogs (Canis familiaris, n = 912) and from tissue and blood samples of grey wolves (C. lupus, n = 225) and coyotes (C. latrans, n = 60; see Supplementary Methods and 165 Appendix Supplementary Table 1). The samples were genotyped and quality control filters were applied (see A.R.B. et al., manuscript submitted) to obtain high-quality genotypes from 48,036 autosomal SNP loci. Cluster analysis. To visualize genetic relationships suggested by our SNP data 6 we used principal component analysis (PCA) (ndog_breed =2) and STRUCTURE

(ndog_breed =1). For tree reconstruction, we analysed two data sets. First, for individual- based allele-sharing distance analyses, we used 574 individuals (ndogs = 490; nOld_World_wolves = 84). This data set consisted of 75 dog breeds where six individuals were genotyped from each breed and an additional five dog breeds where five or fewer individuals were genotyped. The second data set was created for the population-level and haplotype-sharing distance-based analyses and used a subset of 530 individuals to provide comparable sample sizes from 79 dog breeds (nper_breed = 6) or wolf populations from China (n = 6), Middle East (n = 7), central Asia (n = 6) and Europe (n = 31). Coyotes from the western United States (n = 6) were used for rooting. SNP haplotype analysis. From phased genotypes, we divided the genome into 500-kb windows to identify haplotypes and estimated haplotype diversity. The level of haplotype sharing was assessed between a dog breed (nindividuals • 9 per breed, nbreeds = 64) and each wolf population (China, Europe, Middle East and North America). Detecting selection. Population differentiation and extended haplotype homozygosity test statistics were calculated between modern dog breeds and grey wolves. We identified outlier SNP loci based on normalized scores and ranking in the 95th and 99th percentile. Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature.

Received 1 September 2009; accepted 19 January 2010. Published online 17 March 2010.

1. Liti, G. et al. Population genomics of domestic and wild yeasts. Nature 458, 337–341 (2009). 2. Sutter, N. et al. A single IGF1 allele is a major determinant of small size in dogs. Science 316, 112–115 (2007).

166 Appendix 3. Savolainen, P., Zhang, Y. P., Luo, J., Lundeberg, J. & Leitner, T. Genetic evidence for an East Asian origin of domestic dogs. Science 298, 1610–1613 (2002). 4. Parker, H. et al. Genetic structure of the purebred domestic dog. Science 304, 1160– 1164 (2004). 5. Vila, C., Maldonado, J. & Wayne, R. Phylogenetic relationships, evolution, and genetic diversity of the domestic dog. J. Hered. 90, 71–77 (1999). 6. Pritchard, J., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). 7. Parker, H. et al. Breed relationships facilitate fine-mapping studies: a 7.8-kb deletion cosegregates with Collie eye anomaly across multiple dog breeds. Genome Res. 17, 1562–1571 (2007). 8. Wilcox, B. & Walkowicz, C. The Atlas of Dog Breeds of the World 5th edn (TFH Publications, 1995). 9. American Kennel Club. The Complete Dog Book 20th edn (Ballantine Books, 2006). 10. Vila, C., Seddon, J. & Ellegren, H. Genes of domestic mammals augmented by backcrossing with wild ancestors. Trends Genet. 21, 214–218 (2005). 11. Pang, J. et al. mtDNA data indicate a single origin for dogs south of Yangtze river, less then 16,300 years ago, from numerous wolves. Mol. Biol. Evol. 26, 2849–2864 (2009). 12. Conrad, D. et al. A worldwide survey of haplotypes variation and linkage disequilibrium in the human genome. Nature Genet. 38, 1251–1260 (2006). 13. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451, 998–1003 (2008). 14. Li, J. et al. Worldwide human relationships inferred from genome-wide patterns of varition. Science 319, 1100–1104 (2008). 15. Boyko, A. et al. Complex population structure in African village dogs and its implication for inferring dog domestication history. Proc. Natl Acad. Sci. USA 106, 13903–13908 (2009). 16. Dayan, T. Early domesticated dogs of the Near East. J. Archaeol. Sci. 21, 633–640 (1999). 17. Germonpre, M. et al. Fossil dogs and wolves from Palaeolithic sites in Belgium, the Ukraine and Russia: osteometry, ancientDNA and stable isotopes. J. Archaeol. Sci. 36, 473–490 (2009). 167 Appendix 18. Verginelli, F. et al. Mitochondrial DNA from prehistoric canids highlights relationships between dogs and South-East European wolves. Mol. Biol. Evol. 22, 2541–2551 (2005). 19. Leonard, J. et al. Ancient DNA evidence for Old World origin of New World dogs. Science 298, 1613–1616 (2002). 20. Sabeti, P. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007). 21. Innan, H. & Kim, Y. Pattern of polymorphism after strong artificial selection in a domestication event. Proc. Natl Acad. Sci. USA 101, 10667–10672 (2004). 22. Balschun, D. et al. Deletion of the ryanodine receptor type 3 (RyR3) impairs forms of synaptic plasticity and spatial learning. EMBO J. 18, 5264–5273 (1999). 23. Cruz, F., Vila, C. & Webster, M. The legacy of domestication: accumulation of deleterious mutations in the dog genome. Mol. Biol. Evol. 25, 2331–2336 (2008). 24. Gray, M. et al. Linkage disequilibrium and demographic history of wild and domestic canids. Genetics 181, 1493–1505 (2009). 25. Candille, S. et al. A b-defensin mutation causes black coat color in domestic dogs. Science 318, 1418–1423 (2007). 26. Cadieu, E. et al. Coat variation in the domestic dog is governed by variants in three genes. Science 326, 150–153 (2009). 27. Parker, H. et al. An expressed Fgf4 retrogene is associated with breed-defining chondrodysplasia in domestic dogs. Science 325, 995–998 (2009). 28. Pedrosa, S. et al. Evidence of three maternal lineages in near eastern sheep supporting multiple domestication events. Proc. R. Soc. Lond. B 272, 2211–2217 (2005). 29. Elsik, C. et al. The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324, 528–532 (2009). 30. Anderson, T. et al. Molecular and evolutionary history of melanism in North American gray wolves. Science 323, 1339–1343 (2009). 31. Macdonald, D. W. & Barrett, P. Mammals of Britain and Europe (Collins, 1993).

Acknowledgements Grants from NSF and NIH (R.K.W.; C.D.B. and J.N.), the Polish Ministry of Science and Higher Education (M.P. and W.J.), European Nature Heritage Fund EURONATUR (W.J.), National Basic Research Program of China (Y.-p.Z.), and Chinese Academy of Sciences (Y.-p.Z.) supported this research. J.N. was supported by 168 Appendix the Searle Scholars Program. B.M.vH. was supported by a NIH Training Grant in Genomic Analysis and Interpretation. K.E.L. was supported by a NSF Graduate Research Fellowship. E.A.O., D.S.M., T.C.S., A.E. and H.G.P. are supported by the intramural program of the National Human Genome Research Institute. M.P. was supported by the Foundation for Polish Science. Wolf samples from central and eastern Europe and Turkey were collected as a result of a continuing project on genetic differentiation in Eurasian wolves. We thank the project participants (B. Jedrzejewska, V. E. Sidorovich, M. Shkvyrya, I. Dikiy, E. Tsingarskaya and S. Nowak) for their permission to use 72 samples for this study. We acknowledge R. Hefner and the Zoological collection at Tel Aviv University for Israeli wolf samples. We thank the American Kennel Club (AKC) for the dog images reproduced in Fig. 1. We also gratefully acknowledge the dog owners who generously provided samples, the AKC Canine Health Foundation, and Affymetrix Corporation. We thank B. Van Valkenburgh, K.-P. Koepfli, D. Stahler and D. Smith for reviewing the manuscript.

Author Contributions Samples were contributed by E.G., M.P., W.J., C.G., E.R., D.B., A.W., J.S., M.M., E.A.O. and R.K.W. The experiment was designed and carried out with the help of B.M.vH., J.P.P., H.G.P., P.Q., D.S.M., T.C.S., A.E., A.W., J.S., M.C., P.G.J., Z.Q., W.H., Z.-L.D., Y.-p.Z., C.D.B., E.A.O. and R.K.W. The genotyping program was written by A.R.B., A.A., A.R., K.B., A.B. and C.D.B. and further programming was completed by K.E.L., J.D.D., D.A.E., E.H. and J.N. The analyses were conducted by B.M.vH., J.P.P., K.E.L., E.H., H.G.P., J.D.D., A.R.B., D.A.E., A.A., A.R., J.C.K. and J.N. The manuscript was written by B.M.vH., K.E.L., C.D.B., E.A.O., J.N. and R.K.W.

Author Information Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to R.K.W. ([email protected]).

METHODS The CanMap sample collection. Our study is part of the CanMap project (A.R.B. et al., manuscript submitted), which isolated genomic DNA from blood 169 Appendix samples collected from domestic dogs (C. familiaris, n = 912) and from tissue and blood samples from grey wolves (C. lupus, n = 225), coyotes (C. latrans, n = 60), putative dog–wolf hybrids (n = 17), red wolves (C. rufus, n = 12), Mexican wolves (C. l. baileyi, n = 10), Ethiopian wolves (C. simensis, n = 4), black-backed jackals (C. mesomelas, n = 6), golden jackals (C. aureus, n = 2), and a side-striped jackal (C. adustus, n = 1; Supplementary Table 1). Domestic dog samples were obtained through American Kennel Club (AKC) sanctioned dog shows, speciality events, breed clubs, and veterinary clinics. Three-to-twelve dogs from each breed from each of 81 AKC- recognized breeds and four semidomestic lineages (Africanis, Canaan dog, dingo, and New Guinea singing dog) were used in the analysis. Specifically, the semi-domestic dingo and New Guinea singing dog are ancient breeds that were probably established more than 4,000 years ago and have existed in isolation from wolves32. The Affymetrix Canine Mapping Array version 2 contains SNPs that were ascertained by aligning sequence reads to the boxer genome assembly (CamFam2). A large number of SNPs were discovered as heterozygous sites in the boxer genome (here denoted as boxer x boxer SNPs), and further SNPs were found by aligning sequence reads from other breeds or wild canids to the boxer genome. These extra SNPs can be categorized by the sequence aligned to the boxer as follows: (1) the standard poodle (CanFam1); (2) one of nine dog breeds; (3) one of four wolf populations (Alaskan, Chinese, Indian or Spanish wolves); and (4) coyote (see Supplementary Table 7). Breed groupings. Several analyses are based on specific dog breed groupings for comparison purposes (Supplementary Table 1). We define ancient breeds as those that are divergent genetically (Fig. 1), corroborated by previous genetic studies33 and, in most cases, are known to have originated in ancient cultures more than 500 years ago8,9. Furthermore, we used previously defined geographical breed groupings3 (Africa, America, east Asia, Europe, Siberia, southeast Asia and southwest Asia breed groups) and functional and phenotypic breed groupings in common usage by dog breeders8,9,33 (ancient, spitz, toy, spaniels, scent hounds, working dogs, mastiff-like breeds, small terriers, retrievers, herding and sight hounds). Identification of recently related individuals in the sample. We used PLINK34 to obtain pairwise estimates of identity by state (IBS). From the Yellowstone National Park wolves in the data set (n = 19), known pedigree relationships were used to calibrate IBS scores35. A minimum score of IBS > 0.8 indicated a relatedness status 170 Appendix of half-siblings, and values below this level were used to identify a set of unrelated wild canids. Single-SNP measures of genetic diversity. Single-marker descriptive statistics (observed/expected heterozygosity and polymorphism) were estimated using PLINK34 for the complete SNP data set (Fig. 2). Observed heterozygosity was also estimated using only SNPs ascertained from the grey wolf and boxer sequence comparisons (Supplementary Fig. 12). We used microsatellite genotype data from a previous study4 for an independent comparison of observed heterozygosity from loci with different mutational properties and ascertainment schemes (Fig. 2 and Supplementary Fig. 13). STRUCTURE analysis. We used the Bayesian inference program STRUCTURE36 to assess genetic partitions and admixture for the 43,953-SNP data set (linkage disequilibrium (LD) pruned, r2 < 0.5). We used 5,000 burn-in iterations and 15,000 Markov chain Monte Carlo (MCMC) iterations in STRUCTURE, with three repetitions of these parameter settings. The alpha and likelihood statistics were verified to reach convergence before 5,000 burnin and 15,000 MCMC iterations were completed during each repetition for each number of assumed populations analysed. We analysed domestic dogs and Old World wolves to resolve the ancestry of domestic dogs; hence, we included only one dog per breed for the analysis (n = 85; Supplementary Fig. 5). We also included only Old World wolf populations because they may be closely related to the direct ancestors of domestic dogs and we used unrelated individuals from IBS estimates. We sampled China (n = 9), central Asia (n = 3), the Middle East (n = 7) and Europe (n = 43). We excluded wolves from highly inbred populations (Italy, Spain and Sweden)24 to avoid their early partitioning in the analysis. No dog–wolf hybrids were found in the full sample of modern breeds (n = 801) as determined with the program smartpca in the Eigensoft package37. From the dog–wolf PCA (see Supplementary Note B) we identified 20 SNPs with the highest loadings on PC1 as input for an additional STRUCTURE analysis to determine the posterior probability of assignment for dogs and wolves to their corresponding species (Supplementary Table 2). Results were plotted using the circular visualization program CIRCOS (http://mkweb.bcgsc.ca/circos/; Supplementary Fig. 5). After the initial partitioning of modern domestic dogs from wild canids for K = 2 (Supplementary Fig. 5), ancient breeds are separated from other canids when a third population is assumed (K = 3). Our results show uniquely that Canaan dog, dingo, New 171 Appendix Guinea singing dog, and Alaskan Eskimo dog are members of this cluster of ancient breeds, and confirm previous results showing basenji, Afghan hound, samoyed, Saluki, Canaan dog, chow chow, Chinese Shar Pei, Akita, Alaska malamute, and Siberian husky belong in that group3,4,8,9. Analysis of molecular variance. The IBS matrix was put into ARLEQUIN v3 to analyse molecular variance with 10,098 permutations for significance testing38. We defined three different analysis groups (see Fig. 1 and Supplementary Table 5): (1) breed groups in Fig. 1; (2) geographical dog breed groups (Supplementary Table 1); and (3) wolves and dogs as separate groups.

32. Savolainen, P., Leitner, T., Wilton, A., Matisoo-Smith, E. & Lundeberg, J. A detailed picture of the origin of the Australian dingo, obtained from the study of mitochondrial DNA. Proc. Natl Acad. Sci. USA 101, 12387–12390 (2004). 33. Sablin, M. & Khlopachev, G. The earliest Ice Age dogs: evidence from Eliseevichi I. Curr. Anthropol. 43, 795–799 (2002). 34. Purcell, S. et al. PLINK: a tool set for whole-genome association and populationbased linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007). 35. vonHoldt, B. et al. The genealogy and genetic viability of reintroduced Yellowstone grey wolves. Mol. Ecol. 17, 252–274 (2008). 36. Pritchard, J., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). 37. Price, A. et al. Principle components analysis corrects for stratification in genomewide association studies. Nature Genet. 38, 904–909 (2006). 38. Schneider, S., Roessli, D. & Excoffier, L. Arlequin: a software for population genetics data analysis v.2.000 (Genetics and Biometry Lab, Department of Anthropology, University of Geneva, 2000).

172