Identification of causal rare variants in an extended pedigree

with Obsessive-Compulsive Disorder

By

Rageen Rajendram

A thesis submitted in conformity with the requirements for the degree of Master of Science

In the Graduate department of the Institute of Medical Science

University of Toronto

© Copyright by Rageen Rajendram (2014)

Identification of causal rare variants in an extended pedigree with

Obsessive-compulsive disorder

Master of Science, 2014

Institute of Medical Science, University of Toronto

Abstract

Obsessive-Compulsive Disorder (OCD) is a common, heritable and etiologically heterogeneous neuropsychiatric disorder. Despite twin and family studies supporting genetic determinants in OCD, the discovery of causal variants remains elusive. To identify causal variants, an extended pedigree consisting of multiple affected individuals with early- onset OCD was studied. Genome-wide linkage analysis was performed using both parametric and non-parametric approaches. Whole exome sequencing was conducted on two of the most distantly related affected individuals who shared a haplotype. Variants were filtered based on their presence in both individuals, rare frequency (MAF≤0.05), and location within linkage regions. previously implicated in OCD were also screened.

Several putatively damaging low/rare frequency variants were identified in genes involved in immune function (TRAFD1) and glutamate pathways (DLG1, CPQ, SLC1A1). Our findings identified variants within genes/pathways known to be associated with OCD and demonstrated the ability of our methodology to identify rare variants in common complex diseases.

ii Acknowledgements

This thesis is dedicated to the families and children who live with Obsessive-Compulsive

Disorder every day. Thank you for donating your time and energy so we can better understand your condition.

First and foremost, I would like to thank Dr. Paul Arnold for taking me on as a graduate student and for providing me with this amazing opportunity to learn and grow. I am also indebted to Christie Burton, S-M Shaheen, Julie Coste, Laura Park, Vanessa Sinopoli, Bingbin

Li, and Lauren Erdman from the Arnold Lab for their support over the past two years.

Next, I would like to acknowledge my family for their tremendous support and for teaching me to follow my heart without ever giving up. Thank you to my mother and father for giving me much needed guidance and my brother Praveen for giving me a path to follow. I will always be grateful for all you have done and continue to do for me.

To my closest friend, thank you for cheering me on and making these two years some of the best of my life. I am glad to have made a discovery that will last a lifetime.

I am thankful for the support of Christian Marshall, Susan Walker, Matt, Anath Lionel, and

Stephen Scherer from the Scherer Lab. I am also grateful for the support of Darci Butcher,

Dasha Grafodatskya, Yi-an Chen, Sanaa Choufani and Rosanna Weksberg from the

Weksberg Lab. Thank you to the IMS Students’ Association, and the staff at IMS for all their support and guidance. A special thanks to 216 who have been with me since day one.

Finally, thank you to my committee members Dr. Andrew Paterson and Dr. John Vincent for your guidance through all aspects of this project.

iii Contributions

Dr. Paul Arnold was involved initial study design, and funding acquisition. Dr. Gregory

Hanna assessed the family’s phenotype and acquired DNA for genetic studies at the

University of Michigan. S-M Shaheen and Julie Coste assisted with sample acquisition, and

DNA extraction. I prepared all samples for genotyping and whole exome sequencing at The

Center for Applied Genomics (TCAG) at SickKids. I also developed most of the analysis pipeline, analyzed all data and interpreted results. Dr. Andrew Paterson and Nicole Roslin assisted with genome-wide linkage analysis and Christian Marshall, Lynette Lau, and Susan

Walker assisted with whole exome sequencing analysis. I conducted primer design, and

PCR for sequencing validation of WES variants. Christie Burton helped with thesis editing, defense preparation, and mentorship.

The McLaughlin Centre at the University of Toronto funded this study. I was supported by an Institute of Medical Science Entrance Scholarship, CIHR Masters award and a SickKids

Research Training Competition Award.

iv Table of Contents

ABSTRACT………………………………………………………………………...………………..……………………..ii

ACKNOWLEDGMENTS………………………………………………………………….………….…………..……iii

CONTRIBUTIONS……………………………………………………………………………….……...…..….………iv

TABLE OF CONTENTS………………………………………………………..……………...…..…………………...v

LIST OF ABBREVIATIONS……………………………………………………….…..…………………………..viii

LIST OF FIGURES………………………..…………………………………………………………..…………………ix

LIST OF TABLES……………………………………………………………………………………...………….………x

1 Chapter 1: OCD, where we are today ...... 1

1.1 Obsessive-Compulsive Disorder, a common complex disease ...... 1

1.1.1 Prevalence and Symptomology ...... 1

1.1.2 Assessment ...... 2

1.1.3 Comorbidities...... 3

1.1.4 Distinguishing between early-onset and late-onset OCD ...... 3

1.2 Genetics of OCD ...... 4

1.2.1 Heritability ...... 4

1.2.2 Linkage studies in OCD ...... 7

1.3 Candidate Genes, GWAS, and the Glutamate Hypothesis ...... 14

1.3.1 SLC1A1 (EAAC1/EAAT3) ...... 14

1.3.2 Genome-wide association studies ...... 16

1.3.3 Copy-number variations in OCD ...... 20

1.3.4 Environmental factors ...... 21

v 1.4 Genetic Heterogeneity in OCD: a case for rare variants ...... 22

1.5 Next Generation Sequencing ...... 26

1.5.1 Enrichment and Sequencing of DNA ...... 28

1.5.2 SNP Calling Pipeline ...... 29

1.5.3 Annotation...... 29

1.6 Approaches for Identifying Rare variants in complex diseases ...... 30

1.7 Hypothesis ...... 33

2 Methodology...... 35

2.1 Subject Ascertainment and Diagnosis ...... 35

2.2 Genotyping and Linkage Analysis ...... 41

2.2.1 SNP Quality Control ...... 43

2.2.2 Linkage Analysis ...... 43

2.3 CNV Analysis ...... 45

2.4 Whole Exome Sequencing ...... 45

2.4.1 Variant filtration ...... 45

2.5 Validation by Sanger Sequencing ...... 48

2.6 Preliminary Screening of SLC1A1 variant (rs34342853) in an additional cohort of

OCD cases and controls ...... 48

3 Results ...... 50

3.1 Linkage Analysis...... 50

3.1.1 Parametric Linkage Analysis ...... 50

3.1.2 Non-parametric linkage analysis...... 51

3.2 CNV Analysis ...... 59

vi 3.3 Whole Exome Sequencing ...... 63

3.3.1 Parametric linkage ...... 63

3.3.2 Non-parametric linkage ...... 64

3.3.3 Candidate analysis ...... 67

3.3.4 Missense Variants ...... 67

3.4 SLC1A1 Variant ...... 70

4 Discussion ...... 72

4.1 Linkage findings ...... 72

4.2 CNV ...... 74

4.3 WES ...... 74

4.4 Revisiting the role of glutamate in OCD ...... 75

4.5 Limitations ...... 78

4.5.1 Discrete vs. Quantitative Linkage Analysis ...... 78

4.5.2 Whole Exome Sequencing vs. Whole Genome Sequencing ...... 79

4.5.3 Filtering variants ...... 80

5 Conclusions ...... 82

6 Future Directions ...... 84

7 References ...... 88

vii List of Abbreviations

ADHD – Attention Deficit Hyperactivity Disorder CADD – Combined Annotation Dependent Depletion CNV – Copy Number Variant DSM – Diagnostic and Statistical Manual of Mental Disorders DZ – Dizygotic twins EO – Early-onset OCD FISC - Family Informant Schedule and Criteria GAD – General Anxiety Disorder LO – Late-onset OCD LOD – Logarithm-of-ODDs, linkage score LY-BOCS – Lifetime Yale-Brown Obsessive Compulsive Scale MAF – Minor Allele Frequency MDD – Major Depressive Disorder MRS – Magnetic Resonance Spectroscopy MZ – Monozygotic twins OCD – Obsessive-Compulsive Disorder PANDAS - pediatric autoimmune neuropsychiatric disorders associated with GABHS infections PGC – Psychiatric Genomics Consortium QC – Quality Control SAD – Social Anxiety Disorder SRI - serotonin reuptake inhibitors TCAG – The Centre for Applied Genomics TD – Tourette’s disorder TTM - trichotillomania WES – Whole Exome Sequencing

viii List of Figures

Figure 1: Idiogram highlighting regions of linkage with LOD >1 from the 5 primary genome- wide linkage studies conducted to date

Figure 2: A graph depicting the feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio).

Figure 3: General outline of a next-generation sequencing pipeline.

Figure 4: Three of the most common next-generation sequencing strategies for identifying causative variants in Mendelian disorders.

Figure 5: Pedigree of family studied.

Figure 6: General study methodology

Figure 7: Parametric Linkage results for 6, 12, and 22 (the only chromosomes with positive LOD scores).

Figure 8: Haplotype of shared region on 22.

Figure 9: Non-parametric LOD score plots for chromosomes with LOD score >1.

Figure 10: Haplotype of the region with the highest linkage score from the non-parametric analysis on chromosome 8.

Figure 11: Sanger sequencing validation of 4 of the top candidate variants in the family.

Figure 12: TAQMAN Assay output for 78 cases run on a plate.

ix

List of Tables

Table 1: Clinical information for individuals in Family 147. Row labels are described in Table 2. Table 2: Legend for Table 1. Table 3: Candidate gene list from studies used in Taylor et al. 2013 meta-analysis of OCD genetic association studies. Table 4: The 14 missense variants were compared against other currently available algorithms for predicting variant deleteriousness. Table 5: Primers used to validate select candidate variants with their respective annealing temperatures. Table 6: Parametric linkage analysis using a rare dominant model identified 3 regions with suggestive linkage to OCD in the family. Table 7: Non-parametric linkage scores for Family 147. Table 8: Non-parametric linkage scores for Family 147 with the grandfather coded as affected. Table 9: CNVs identified in the family overlapping genes. Table 10: Whole Exome Sequencing Quality metrics as recommended by Abecasis et al. Table 11: Rare shared variants within Parametric linkage regions. Table 12: Rare shared variants within non-parametric linkage regions. Variants identified in the parametric analysis are shaded in grey. Table 13: Rare variants shared by the two individuals with exome sequence data within genes previously associated with OCD in human association studies. Table 14: Comparison of prediction algorithms results for the 14 missense variants identified.

x 1 Chapter 1: OCD, where we are today

1.1 Obsessive-Compulsive Disorder, a common complex

disease

1.1.1 Prevalence and Symptomology

Until the late 70s/early 80s OCD was primarily thought of as a rare disorder. It was not until the 1980s that epidemiological studies demonstrated OCD was relatively common1.

With a lifetime prevalence ranging from 1% to 3%, we now know that Obsessive-

Compulsive Disorder (OCD) detrimentally affects the lives of many individuals every day2.

The World Health Organization has placed OCD among the ten most disabling conditions worldwide. Despite many lines of evidence supporting a genetic etiology to the disorder, identifying the precise biological mechanisms involved has been challenging. Part of the challenge is attributed to the phenotypic and genotypic heterogeneity of the disorder. OCD is considered a complex disorder with patients exhibiting a range of unwanted repeated thoughts (obsessions) and repetitive behaviours (compulsions). The specific content of obsessions and compulsions varies among individuals, although certain symptom dimensions are common in OCD, including those of cleaning (contamination obsessions and cleaning compulsions); symmetry (symmetry obsessions and repeating, ordering, and counting compulsions); forbidden or taboo thoughts (e.g., aggressive, sexual, and religious obsessions and related compulsions); and harm (e.g., fears of harm to oneself or others and related checking compulsions) (DSM-5). Themes of obsessions and compulsions occur

1 across different cultures and are relatively consistent over time in adults with the disorder3.

1.1.2 Assessment

Most individuals with OCD have both obsessions and compulsions and although the compulsions are not performed for pleasure, however, some patients experience relief from anxiety/distress when performing them4. Symptoms are time consuming (take more than 1hr/day) and may significantly interfere with the person’s daily functioning and quality of life. Both the severity of the disorder and the level of insight of the disorder in patients vary4. The Yale-Brown Obsessive-Compulsive Scale (Y-BOCS) is the most commonly used assessment to rate the severity of obsessive-compulsive symptoms5.

Various versions of the Y-BOCS have been used including the Child Y-BOCS (CY-BOCS) which is a version adapted for children and the Lifetime Y-BOCS (LY-BOCS) where OC symptom severity is assessed during the period of time when the symptoms were at their worst. The scale consists of 10 items, 5 of which measure obsessions and 5 measuring compulsions, each item rated from 0 (no symptoms) to 4 (extreme symptoms). This yields a total maximum score of 40. Scores of 0 to 7 represent subclinical OCD symptoms, those from 8 to 15 represent mild symptoms, scores of 16 to 23 relate to moderate symptoms, scores from 24 to 31 suggest severe symptoms, and scores of 32 to 40 imply extreme symptoms. Individuals with scores over 16 are typically considered as having clinical OCD.

The benefit of this scale is that it is not biased towards the type of obsessions or compulsions present. However, each item on the scale is assumed to have an equal influence on the total score (and therefore overall severity) where in actuality some items

2 may be more debilitating to the individual than others. Furthermore, the scale does not examine the relationship between the obsessions and compulsions.

1.1.3 Comorbidities

Comorbidities are common in OCD and add to the heterogeneity of the disorder.

Depression is the most frequent disorder in individuals with OCD, as reported in several studies6. Denys et al. (2004) found that major depressive disorder (MDD) was 10 times more prevalent in OCD patients than in the general population7. While between 60–80% of patients with OCD experience a depressive episode in their lifetime, most studies agree that at least one-third of patients with OCD have concurrent MDD at the time of evaluation6.

Suicidal thoughts occur at some point in as many as half of individuals with OCD and the presence of co-morbid MDD increases the risk4. Lifetime comorbidity between OCD and other anxiety disorders has been estimated as 22% for specific phobia, 18% for social anxiety disorder (SAD), 12% for panic disorder, and 30% for general anxiety disorder

(GAD)6. Up to 30% of individuals with OCD also have a lifetime tic disorder, such as

Tourette’s disorder (TD), and this is more common in males with early onset OCD compared to other incidences of OCD3.

1.1.4 Distinguishing between early-onset and late-onset OCD

While variable across studies, Early-onset OCD (EO) is often defined as development of

OCD symptoms around the age of 12 years8. Individuals with late-onset OCD (LO) develop symptoms much later, on average around 23 years of age8. Clinical observations suggest

3 that EO and LO may differ in a variety of ways, including sex distribution and comorbidity.

Adults are more likely to display insight into the irrationality or excessive nature of their symptoms than children. For this reason insight is not required for a formal diagnosis of

OCD (DSM-5). Taylor et al. conducted a systematic review and evaluation of the EO and LO groups of OCD. He found that compared to LO, EO was more likely to occur in males, had greater OCD global severity and higher prevalence of most types of OC symptoms, was more likely to be comorbid with tics and other OC-spectrum disorders, and had a higher prevalence of OCD in 1st degree relatives8. The latter analysis included studies where the 1st degree relatives were examined by the investigators blind to proband status. This data suggests genetics may play a bigger role in the etiology of EO cases and is discussed further below.

1.2 Genetics of OCD

Results from twin and family studies, neuroimaging studies, treatment studies, and molecular genetic studies have all suggested that biochemical, biological, and genetic factors are important for the development of OCD.

1.2.1 Heritability

Heritability studies tell us how much of a disorder is influenced by genetic factors. Early on, clinicians who studied patients with OCD often identified close relatives with the disorder as well. Twin, and family studies have now shown that OCD is in fact highly heritable with different types of OCD patients having different levels of heritability.

4 1.2.1.1 Twin studies

Twin studies of OCD have collectively shown that OCD is heritable with rates ranging between 45% to 65% in children and 27% to 47% in adults9.The majority of twin studies that have been conducted to date in OCD have had small sample sizes, and those with adequate sample sizes attempted to estimate heritability of obsessive-compulsive (OC) symptoms, not OCD. A recent meta-analysis examined 37 twin pairs from 14 studies that used quantitative OC symptom measures and demonstrated that the majority of the phenotypic variance of OC symptoms is influenced by additive genetic effects and non- shared environment but not by non-additive genetic effects and shared environment10. In the largest twin study to use a DSM diagnosis (DSM-IV) of OCD, Bolton et al. 2007 found a

MZ concordance of 0.57 and a DZ concordance of 0.2211. This study consisted of a community sample of 854 six-year-old twins (253 MZ & 601 DZ) who were interviewed to assign a sub-threshold DSM-IV diagnosis (6.1%, 95%CI 4.2-8.5 [n=136])11. The estimate of familial aggregation due to combined additive genetic and shared environmental effects, which could not be distinguished due to sample size, was 47%11. This study supported the earlier studies that were conducted on OC symptoms and the hypothesis that genetic factors play a significant role in the etiology of OC behaviours and OCD. GCTA measures heritability by estimating the variance explained by SNPs on a chromosome or from across the whole genome for a complex trait using restricted maximum likelihood methods12. A recent GCTA heritability analysis of GWAS data in OCD resulted in an h2 of 0.37 (se=0.07).

The heritability rate was higher when childhood onset cases (0.43, se=0.10) were analyzed separately from adult onset cases (0.26, se=0.24)13. These results further support the heritability rates identified through twin-studies.

5

1.2.1.2 Family studies

Family studies have generally supported the high heritability rates observed in twin studies of OCD14. Many of the early family studies relied on family history data, which can underestimate the true rates of illness in families. These studies also had no control samples and so it was difficult to interpret these findings without a reliable estimate of the population prevalence of OCD. More recent family studies of OCD directly interviewed all available first-degree relatives with structured psychiatric interviews. These studies have further supported OCD as being a familial disorder. Once again what is striking is the difference between studies of children/adolescents with OCD and adults with OCD. The rate of OCD among first-degree relatives of adults with OCD is approximately two times that among first-degree relatives of those without the disorder; however, among first- degree relatives of individuals with onset of OCD in childhood or adolescence, the rate is increased 10-fold over adult OCD rates15.

1.2.1.3 Segregation Studies

Complex segregation analysis is a statistical approach for determining if transmission of a disorder in families is consistent with a Mendelian model of inheritance. Five complex segregation studies have been conducted in OCD and all have found models of genetic transmission to fit the observed patterns of transmission in families. Results from these five studies suggest that a gene or gene(s) of major effect are involved in the etiology of

OCD and is (are) transmitted in a Mendelian fashion. However, the specific genetic model

6 that best fit the transmission within families differed across studies. This could be due to the underlying genetic heterogeneity of OCD or the ineffectiveness of the methodology. A study of segregation in families of students attending medical school found a recessive model to provide a satisfactory fit as a general single- model16. In this case, it is far more likely that the major source of family resemblance for the trait derives from family culture and shared environment than from shared genes. Thus, it is important to assess results from complex segregation studies within the context of prior heritability data from twin and family studies. In addition, it would be optimal to test multiple models of inheritance when performing linkage analysis of extended pedigrees with OCD.

1.2.2 Linkage studies in OCD

Results from heritability studies and family studies in OCD encouraged linkage studies in families in order to identify loci linked to OCD. Since a core component of the study conducted here involves linkage analysis of extended pedigrees, previous linkage studies in

OCD will be explored in depth.

1.2.2.1 Background to linkage analysis

Linkage analysis is used to detect co-segregation of chromosomal regions with a disease in family data. Traditionally, when the maximum linkage score (LOD, logarithm (base 10) of odds) is greater than 3 (a backwards odds ratio of 1000:1), the null hypothesis of independent assortment is rejected and linkage between the trait and the marker locus is assumed. A LOD score of greater than 2 is often considered suggestive of linkage.

Conversely, for those recombination factors where the LOD score is less than -2, the null

7 hypothesis of independent assortment is not rejected and linkage is assumed to be excluded. Lander and Kruglak have recommended a change of LOD score threshold criteria for suggestive and significant linkage to 1.9 and 3.3 respectively17. There are two major approaches to linkage analysis. The parametric approach allows one to specify the mode of segregation i.e. rare or common, and dominant or recessive. In a non-parametric model the mode of segregation is not specified but instead the amount of allele sharing among affected individuals is measured. An important question that this study aims to address is how these two models should be integrated in next generation sequencing studies of common complex diseases such as OCD.

1.2.2.2 Primary genome-wide linkage studies in OCD

Five primary genome-wide linkage studies of OCD have been conducted to date. All studies used multipoint linkage analysis. Significant and suggestive linkage scores from these studies are summarized in Figure 1. The first of these studied 7 pedigrees that had pediatric probands with OCD and at least two affected relatives, and found a maximum parametric homogeneity LOD score with a dominant model of 2.25 on 9p2418. This region was then independently validated to have suggestive linkage (HLOD=2.26, alpha=0.59, dominant, narrow phenotype) in 50 small nuclear pedigrees with OCD19. The replication study also excluded probands with Tourette’s disorder and 93% of the probands had an early age of onset. Harboring the glutamate transporter gene SLC1A1, among many other genes, the 9p24 region remains as one of the most interesting and studied regions in OCD.

More recently, Hanna et al performed genome-wide linkage analysis on another set of 26 families ascertained through probands with early-onset OCD20. No significant evidence for

8 linkage on any chromosome was obtained although suggestive linkage was obtained on chromosome 10p15 region (non-parametric LOD=2.43)20. The 9p24 region was not suggestive of linkage in this data set and also when the data set was combined with the families from the earlier study20.

In a larger study conducted by Shugart et al., 193 families were analyzed under the narrow phenotype and 219 families under the broad phenotype. Under both the narrow and broad phenotype analysis suggestive linkage was identified on chromosome 3q (non-parametric

LOD=2.67). Additional regions of interest were identified on chromosomes 1q, 6q, 7p and

15q. While the mean age of onset for the sample used in this study was 9.5 years, the range was from 5-54 years. Also, approximately two-thirds of the affected individuals were women and so it is unclear how including late-onset samples and the sex bias in the sample may have affected these results21. Samuels et al. and Wang et al. reanalyzed the same dataset except stratified samples based on hoarding and gender respectively. They found different regions of suggestive linkage in each case. Of note, in all three of these studies only non-parametric linkage was used and non-parametric linkage has been shown to have highest power to detect loci where the underlying model is additive compared to other genetic models22.

The most recent independent genome-wide linkage study in OCD assessed 33 Caucasian families with childhood-onset OCD from the United States23. Five regions of interest were identified on chromosomes 1p36, 2p14, 5q13, 6p25, and 10p23 with the strongest score at the 1p36.33-p36.32 (HLOD = 2.96, alpha=0.408, dominant, broad)23. Note, this study

9 included some families from the earlier studies by Hanna et al18, 20. The discrepancies in the results between these studies may be explained by genetic (locus) heterogeneity, as suggested by simulation studies demonstrating the difficulty of replicating a true linkage finding for an oligogenic phenotype24. The ascertainment strategies across studies may have also been a factor in the differing results. Simulation studies have indicated that extended pedigrees and affected relative pairs differ in their power to detect linkage, depending on the frequency of the susceptibility allele25.

One approach to reduce genetic and environmental heterogeneity in complex traits is to conduct studies using multiple affected families in genetically isolated founder populations.

The strongest linkage finding in a primary analysis of OCD reported to date was on chromosome 15q14 in just three large Costa Rican families (homogeneity LOD score = 3.13, recessive, broad)26. This region overlaps the chromosome 15 region previously identified in the Shugart et al. study and is the only region other than 9p24 that has been independently replicated27. Interestingly, chromosome 15 was also reported as having high heritability in the recent GCTA analysis of OCD and TD13. A region of suggestive linkage on chromosome 16q in the Costa Rican study was also previously identified in the Hanna et al. study18.

1.2.2.3 Summary

Taken together, linkage studies in OCD have identified numerous regions of interest but only a few have actually reached suggestive genome-wide significance and then have been independently replicated. Several linkage studies have been conducted in OCD but their

10 methodologies have differed due to phenotypic heterogeneity and advances in technologies, making cross study comparison of results difficult. Due to the presence of individuals in families with OCD with subclinical phenotypes, both a narrow and broad definition of OCD have been used to categorize affected individuals resulting in different regions being identified. Another important consideration is the type of markers that were used for genotyping. Early linkage studies were conducted using microsatellite markers across the genome. This was followed by genome-wide SNP microarrays. The results from other linkage studies done using both microsatellite markers and SNP markers have shown the results to be comparable, while the high SNP density allows loci to be defined more precisely28, 29. Finally, the regions identified by linkage studies can be several megabases in length encompassing many genes and so it is often difficult to identify which of these genes harbor variants responsible for the disorder. Traditional Sanger sequencing of these large regions would be an overwhelming task, but next generation sequencing has now enabled us to do this in a more feasible, cost-effective manner.

11

12 Figure 1: Idiogram highlighting regions of linkage with LOD >1 from the 5 primary genome-wide linkage studies conducted to date. Gray shaded regions indicate density of genes within regions on the chromosome. Several regions of interest have been identified across the genome with minimal replication. LEGEND: Parametric LOD score (PLOD),

Non-parametric LOD score (NPLOD), dominant model (dominant), recessive model (rec), analysis included individuals with subclinical OCD (broad), analysis only consisted of individuals with a strict clinical diagnosis of OCD

(narrow).

13 1.3 Candidate Genes, GWAS, and the Glutamate Hypothesis

Promising linkage, neuroimaging, and pharmacological studies have implicated a number of positional and functional candidate genes in OCD. There has been a focus on genes important in the serotonin and dopaminergic neurotransmitter systems due to the efficacy of serotonin reuptake inhibitors (SRIs) and dopamine antagonists in the treatment of OCD.

However, genetic studies of serotoninergic and dopaminergic genes have thus far been inconsistent. On the other hand increasing evidence has supported the role of glutamate, the most abundant excitatory neurotransmitter in the brain, in OCD30. Morphological and functional imaging studies have identified abnormalities in the cortical-thalamic-striatal circuitry in patients with OCD. Several studies have used magnetic resonance spectroscopy

(MRS) to investigate levels of glutamate and related molecules in this circuitry, and have produced some evidence of glutamate dysregulation in patients with OCD.

1.3.1 SLC1A1 (EAAC1/EAAT3)

In humans the post-synaptic glutamate transporter, SLC1A1, has arguably been one of the strongest candidate genes in OCD. The gene is located within the 9p24 region having suggestive linkage in 7 large families with OCD18 and was independently replicated in 50 pedigrees with OCD19. As a follow up to the linkage studies, five independent family based- association studies and one case-control study in primarily Caucasian samples found that

SLC1A1 may contain susceptibility alleles for OCD30. Family-based association studies are advantageous in that they control for population stratification, however, large enough sample sizes of well-characterized families are needed31. In addition, most of these

14 analyses were based on genotyping a small number of markers in families and it has been shown that undetected genotyping error can produce significant type I error32. Publication bias may also play a significant role influencing a trend in association with a particular gene33. Meta-analyses of genetic association studies have been demonstrated to be more robust in the face of some of these limitations33. In order to clarify the nature of association between SLC1A1 and OCD a meta-analysis was conducted by our group analyzing all previously associated SLC1A1 SNPs in 815 trios, 306 cases and 634 controls34. The results of this study revealed weak association between OCD and one of the nine tested polymorphisms (P=0.046, non-significant corrected), and modest association with a different SNP in male-affecteds only. It has been suggested that the lack of clear association with common variants could be because the real association is with one or more rare variants.

Rare variants identified to date include a rare 11bp deletion located just 3’ of SLC1A135 and a single rare missense SNP (c.490A>G) identified through mutation screening of over 300

OCD patients36. In addition, a recent study of dicarboxylic aminoaciduria, a rare autosomal recessive renal disorder, identified rare mutations in SLC1A1, which were shown to impede glutamate transport in the kidney mediated by the EAAC1 expressed in the kidney.

One of the probands in this study had longstanding behaviors strongly suggestive of obsessive–compulsive traits, although he declined formal evaluation for OCD. This individual harbored a homozygous coding mutation in exon 12 of SLC1A1 (c.1333C>T), in close proximity to common variants previously identified in earlier studies of OCD37, 38. Our group additionally screened 184 males with OCD for rare variation in SLC1A1 exons;

15 however no new coding variation was found39. Transfection of the previously reported missense variant (c.490A>G) into human embryonic kidney cells revealed a statistically significant decrease in glutamate transport. Copy number variants (CNVs) encompassing exonic regions of SLC1A1 have been discovered to be enriched in schizophrenia cases versus controls (Cochran-Mantel-Haenszel P=0.0098, OR=6.19 (1.36-28.24,95%CI)).

However, it was not reported whether these subjects with CNVs exhibited obsessive- compulsive symptoms either prior to or after treatment with antipsychotics40. Next- generation sequencing may help identify additional rare variants of SLC1A1 in patients with OCD and was one of the objectives of this study.

1.3.2 Genome-wide association studies

To date, there have been two published GWAS of OCD in humans. There has also been a

GWAS conducted in a canine model of compulsive behaviour, which was subsequently reanalyzed using a more informative algorithm.

1.3.2.1 Human studies

The first GWAS in OCD included 1465 cases, 5557 ancestry-matched controls and 400 complete trios. This study did not find any SNPs with genome-wide significant association with OCD in the combined trio-case-control sample41. Of note, in the case-control analysis, the lowest two p-values were of two SNPs in DLGAP1 (18p11.31, P= 2.49 x 10-6 and P= 3.44 x 10-6), a member of the same gene family as DLGAP3. DLGAP3 is part of a family of that form scaffolding complexes to regulate the trafficking and targeting of neurotransmitters to the post-synaptic membrane during excitatory synaptic transmission.

16 DLGAP3 is also the only member of this family that is highly expressed in the striatum42.

Furthermore, targeted homozygous deletion of the mouse homolog of DLGAP3 (Sapap3) leads to a behavioural phenotype similar to OCD: compulsive over-grooming behaviour, increased anxiety, and response to selective serotonin reuptake inhibitors42. It is possible that other DLGAP family proteins in humans play a similar role as Sapap3 in mice.

The second independent GWAS included 1065 families (containing 1406 patients with

OCD), combined with population-based samples resulting in a total sample of 5061 individuals43. This study used an integrative analysis pipeline involving association testing at SNP and gene levels (using a hybrid approach which allowed for combined analysis of family- and population data). No genome-wide significant results were identified, however, the SNP with the smallest p-value was identified on chromosome 9 near PTPRD

(rs4401971, P=4.13x10-7). PTPRD promotes the differentiation of glutamatergic synapses and interacts with SLITRK344, 45. Furthermore, as in the previous GWAS, a SNP near DLGAP1 showed some evidence for association (rs3866988, P=2.67 x 10-4). The Psychiatric

Genomics Consortium (PGC), to which our group and OCD was recently added, consists of over 500 collaborating investigators from over 80 institutions in 25 countries. Findings emerging from large consortia such as the Psychiatric Genomics Consortium are demonstrating that larger sample sizes are needed to yield genome-wide significant findings in psychiatric disorders. Efforts, currently underway, to increase the sample size and statistical power for GWAS of OCD will be crucial in order to determine whether common variation within DLGAP1 or other genes involved in glutamate signaling are implicated in the disorder. A meta-analysis of the two GWA studies has yet to be conducted

17 but the increase in sample size may allow for genome-wide significant results. In addition, microarrays such as the Illumina HumanCore Exome and the customized version of this array (“psych chip”), which will be used by the PGC, will allow for GWAS of uncommon and rare SNPs.

Rare damaging variants in DLGAP3 could have similar effects to the mouse knockout if present in conserved coding regions of the gene. Zuchner et al. (2009) resequenced the entire coding sequencing and flanking intronic sequence of DLGAP3 in 165 patients with

OCD, and/or trichotillomania (TTM) and 178 unmatched controls, identifying seven novel nonsynonymous heterozygous variants46. A pooled analysis of these rare variants revealed that a significantly greater proportion of TTM/OCD patients had at least one rare variant

(4.2% of cases vs. 1.1% of controls). These variants have yet to be screened in a larger OCD cohort. A recent study also resequenced the exonic regions of the DLGAP3 in 215 schizophrenic patients, a condition which has high comorbidity with OCD and for which glutamate mechanisms are also strongly implicated47, 48. In this study, no significant difference in the proportion of carriers of missense mutations between patient and control groups (6/215 of patients vs. 4/215 of controls, P=0.53) was found.

1.3.2.2 Canine Studies

In a genome-wide association study of canines with compulsive behaviour the only genome-wide significant SNPs were found to be in CDH249. Several neuropsychiatric disorders, such as epilepsy/mental retardation, autism, bipolar disease and schizophrenia, have been associated with cadherins, mostly by genome-wide association studies50. For

18 example CDH5, CDH8, CDH9, CDH10, CDH13, CDH15, PCDH10, PCDH19 and PCDHB4 have been associated with autism, a disorder with a repetitive behaviour phenotype similar to

OCD50. Recently, Moya et al. resequenced the coding regions of CDH2 in 160 OCD cases and

160 controls and identified 4 nonsynonymous variants (the first exon was not sequenced due to amplification issues)51. These four variants were chosen for follow-up genotyping in an ‘extended’ sample of OCD (N=260), TD probands and relatives (N=454), and healthy controls (N=447). None of these were associated with OCD statistically as with the previous studies mainly due to the limitation of sample size. However, one of the novel SNPs was found in 3/714 of the OCD/TD patients plus TD relatives but not in 447 controls. Of note, the inclusion of relatives in the case population may have contributed to inflated type I error.

In a recent reanalysis of the initial canine GWAS and subsequent targeted sequencing of the top candidate regions, 119 variants were identified in evolutionarily conserved sites specific to dogs with OCD52. These variants were then genotyped in an independent set of dogs from breeds with high rates of OCD (‘OCD-risk breeds’; 69 dogs) and breeds with normal rates of OCD and other psychiatric disorders (‘control breeds’; 19 dogs). Individual

OCD phenotype information was not available for this independent set. Nevertheless, case- only variants identified in the sequence data was significantly more common in OCD-risk breeds (median frequency (FOCD)=0.17), than in control breeds (Fcontrol=0.05) (Wilcoxon test P=0.045). Four genes, all involved in the synapse, had the most case-only variation after resequencing: neuronal cadherin (CDH2), catenin alpha2 (CTNNA2), ataxin-1 (ATXN1), and plasma glutamate carboxypeptidase (PGCP) (FOCD=0.08 versus Fcontrol=0.026, Wilcoxon

19 test P=2.95x10-4). Case-only variants in CDH2 (FOCD=0.23 versus Fcontrol=0.027, P=0.001) and PGCP (FOCD=0.014 versus Fcontrol=0.0, P=0.047) were also individually statistically significant.

1.3.3 Copy-number variations in OCD

Copy number variations (CNVs) are submicroscopic structural variations (≥1kb) that are present at a variable copy number in comparison to the diploid human genome53. Recent investigations have implicated de novo and/or rare CNVs as potentially pathogenic factors in attention-deficit/hyperactivity disorder (ADHD), autism, and schizophrenia. Apart from a small number of case reports, which have identified rare CNVs in patients with other disorders comorbid with OCD, very few studies have screened for CNVs in a population of

OCD cases. Delorme et al. screened for CNVs in the 15q11-13 and 22q11.2 regions due to the high frequency of OC symptoms in Prader-willi syndrome and 22q11.2 deletion syndrome patients54. No deletions or duplications involving either of these regions were identified in the 236 OCD patients studied54. Another study found that carriers of a one- copy deletion of the serotonin-receptor 2A (HTR2A) promoter were associated with very early onset OCD (2.5 years earlier than typical onset), and increased CY-BOCS scores (8.7 points higher compared to “normal” CNV and duplications); which is related to increased severity of OCD symptoms (p=0.031; p=0.004, respectively)55. However, only 136 cases were assessed in this study and replication studies are necessary to confirm the effect of this CNV55. At this time, no systematic study of the contribution of CNVs to OCD has been published.

20 1.3.4 Environmental factors

It is important to recognize that genetics is most likely not the sole contributor towards

OCD. In fact, in their heritability study, Bolton et al reported that 47% of the phenotypic variance could be explained by the familial aggregation due to combined additive genetic and shared environmental factors11. The rest of the phenotypic variance in OCD has been attributed to the non-shared environment. It is also known that physical and sexual abuse in childhood and other stressful or traumatic events have been associated with an increased risk for OCD56. Thus, the environment is predicted to play a critical role in the manifestation of the disorder in many individuals although further investigation is required to identify which environmental factors are important in OCD.

Interestingly, a small subset of children develop obsessive-compulsive symptoms after streptococcal infections. First characterized by Swedo et al., 50 children were identified with acute and dramatic OCD and/or tic disorder onset triggered by group A B-hemolytic streptococcal (GABHS) infections57. The clinical course in these children was characterized by a relapsing-remitting symptom pattern with significant psychiatric comorbidity.

Although the causal link between GABHS and OCD remains somewhat controversial, patients fitting these criteria were given the name PANDAS, for pediatric autoimmune neuropsychiatric disorders associated with GABHS infections, and have since characterized a distinct patient group58, 59. Bhattacharyya et al. investigated the binding of autoantibodies directed against the basal ganglia or thalamus in the serum as well as CSF of OCD patients using western blot and found significantly more binding when compared to controls60.

They also investigated the levels of amino acids (glutamate, GABA, taurine, and glycine) in

21 CSF in the same group of patients and found significantly higher levels of glutamate and glycine levels in OCD patients. Multivariate analysis of variance showed that CSF glycine levels were higher in those OCD patients who had autoantibodies compared with those without, suggesting a link between autoimmune mechanisms and excitatory neurotransmission in the pathogenesis of OCD60. All patients in the study were psychotropic drug naïve.

1.4 Genetic Heterogeneity in OCD: a case for rare variants

Over 100 candidate gene association studies have been published to date in OCD, however none have achieved genome-wide significance (P≤5x10-8)14. Candidate gene studies conducted by our group and others have similarly produced suggestive findings but have not resulted in identification of any clearly disease-causing variants. Furthermore, neither of the 2 published GWAS have produced clear, genome-wide significant findings. The lack of, and inconsistency, of common variant findings suggests either that the sample sizes are not sufficiently large to identify these variants or there are other factors at play. In fact common variant studies in other common complex disorders have revealed that associated variants discovered after large sample size studies have small odds ratios and contribute to just a small fraction of the heritability of the disorder.

The clinical presentation of OCD is remarkably diverse, and can vary within and across patients over time. This variability in the phenotypic expression has led to the hypothesis that OCD is an etiologically heterogeneous disorder and that this heterogeneity can complicate the search for risk genes. While the majority of genetic studies in OCD have

22 focused on common variants few studies have investigated the effect of rare variants.

Common genetic variants are generally defined as present at ≥5% frequency in the general population. Rare frequency variants are typically described as being present at less than

1% in the population, while low frequency variants are between 1%-5% frequency. For the purposes of this study I will be referring to any variant at less than 5% frequency as rare.

Some rare variants are expected to be highly penetrant and more influential on individual phenotypes than common variants because their frequencies have been kept low by natural selection. The presence of both common and rare variants underlies one aspect of genetic heterogeneity in human disease.

Converging evidence for a wide range of common diseases indicates that genetic heterogeneity is important at multiple levels of causation61. Individually rare mutations collectively play a substantial role in causing complex illnesses. For example, inherited predisposition to breast cancer is associated with germline mutations in at least 10 genes62.

The same gene may harbor many different rare severe mutations in unrelated affected individuals. Thousands of different loss-of-function mutations have been detected in BRCA1 and BRCA2; all of these mutations are individually rare, and each independently confers high risk for breast and ovarian cancer. The same mutation may lead to different phenotypes in different individuals. For example, in the Scottish kindred harboring a chromosomal translocation disrupting DISC1, 29 translocation carriers presented variously with schizophrenia, bipolar disorder, major depressive disorder, or no mental illness63.

Finally, mutations in different genes in the same or related pathways may lead to the same

23 disorder. Dozens of genes harbor inherited mutations leading to nonsyndromic hearing loss, and each gene harbors multiple pathogenic mutations64.

Gene-gene interactions add another layer of heterogeneity, yet most of the GWAS that have been conducted to date use a single-locus analysis strategy. If a genetic factor functions primarily through a complex mechanism that involves multiple other genes and environmental factors, the effect might be missed if genes are studied individually. Several methods have been developed that allow for assessment of statistical interactions between loci when analyzing data from GWAS. As rare variants are beginning to be included in these

GWAS studies it will be important to determine if any common variants interact with and influence their impact on the disorder.

The “rare variant hypothesis” proposes that a considerable part of the inherited susceptibility to relatively common human chronic diseases may be due to the combination of the effects of a series of low frequency acting variants of different genes (Figure 2). Each of these variants is expected to confer a moderate but detectable increase in risk. There was initially some debate over the influence of rare variants in common diseases and whether or not diseases that are influenced by rare variations are likely to exhibit familial clustering. Bodmer and Bonilla argued that such diseases are not likely to be familial, suggesting that family-based studies of the type pursued here are not likely to be useful for discovering causative genetic variations65. This argument was based on calculations with the assumption of the existence of only a few predisposing variations with low penetrance.

On the contrary Schork et al., argued that if someone possesses a disease that is influenced

24 by multiple rare inherited variations that work additively, then that individual’s parents possessed the right collection of variants to increase the risk an offspring that could ultimately manifest the disease66. This would mean the parents would be enriched for variants predisposing for the disease and they would have a higher risk of producing other offspring with the disease. This is often called the multifactorial threshold model of disease.

We now know that a proportion of heritable common diseases, such as schizophrenia and autism, previously thought to be due to complex multifactorial inheritance, are thought to represent a heterogeneous collection of rare monogenic disorders67.

Figure 2: Adapted from Manolio et al. 200968, a graph depicting the feasibility of

identifying genetic variants by risk allele frequency and strength of genetic effect

(odds ratio). The focus of this study was to identify a rare allele causing OCD in the

family where the disease appeared Mendelian (red circle).

25 1.5 Next Generation Sequencing

One can determine the presence of rare variants in an individual’s genome by resequencing genomic DNA. While it has been possible to resequence genes or small regions of the genome using PCR and capillary sequencing, this method is not ideal for resequencing whole genomes in many samples. The introduction and widespread use of massively parallel sequencing has now made it possible for individual laboratories to sequence the whole . The cost and capacity required to do this are still significant.

Considering that the function of much of the genome is still largely unknown, initial studies settled on interrogating and focusing on the best understood 1% of the genome, the protein-coding exons. Despite the limitation that exome sequencing does not currently assess the impact of non-coding alleles, there are several reasons why it is a good tool for discovering rare alleles underlying Mendelian phenotypes69. Positional cloning studies that have focused on protein-coding sequences have proved to be highly successful at identifying causative mutations in monogenic diseases and most alleles that are known to underlie Mendelian disorders have been found to disrupt protein-coding sequences69. The general approach for exome sequencing is outlined below and consists of enrichment of

DNA to be sequenced, generating sequence reads, alignment of reads to a reference genome, identification of variants, and annotation and functional characterization of variants (Figure 3).

26

Figure 3: General outline of a next-generation sequencing pipeline. Each step of analysis is described in the respective sections below. (Adapted from Foo et al.)70

27 1.5.1 Enrichment and Sequencing of DNA

Two capturing methods have been extended to target the human exome, solid-phase or liquid-phase hybridization. Solid-phase hybridization methods use probes complimentary to the sequences of interest affixed to a solid support, such as microarrays or filters71.

Liquid-phase hybridization is similar to the solid phase but instead of the probes being attached to a solid matrix they are biotinylated and are bound to magnetic streptavidin beads following hybridization. The Agilent SureSelect All Exon kit uses a liquid-phase hybridization method. Bound DNA is separated from undesired DNA by washing and then sequenced. In a comparison of three enrichment techniques the Agilent SureSelect exhibited superior on-target efficiency and correlation of read depths across samples72.

Once exonic regions are captured, the segments are amplified and sequenced on next- generation sequencing technologies such as Illumina HiSeq, ABI SOLiD, or Roche 454 GS

FLX. These platforms generate hundreds of millions of short sequence reads with an average length of 50bp to 125 bp. In order to ensure sufficient allele sampling, as well as to prevent sequencing errors from appearing to be actual variants, studies recommend a minimum sequence depth threshold, ranging from 8- to 30- fold depth of coverage. The type of sequencing platform chosen also informs the SNP calling pipeline that should be used in order to call variants in the sample.

28 1.5.2 SNP Calling Pipeline

The SNP calling pipeline comprises of 6 general steps: base calling, quality control (QC), alignment/mapping, post processing, quality score recalibration, and finally genotype calling. The method or tool that is used for each of these steps can affect the final result73.

Briefly, base calling involves evaluating images taken during the sequencing process and generating short reads. These reads then undergo basic QC, which remove low quality reads. Remaining reads are then aligned to the reference human genome. Some minor corrections are then needed before variants can be called and is performed during the post processing stage. Finally, quality scores issued by the sequencing platforms are recalibrated in order to more closely match the true error rate. SNPs are then called and can be annotated then filtered using an appropriate pipeline.

1.5.3 Annotation

Between 20000 and 50000 variants are identified per sequenced exome. Annotation is crucial for identifying disease-related alleles among the background of non-pathogenic polymorphisms and sequencing errors. The starting point is usually the type of variant, whether it is in an exonic, intronic, or intergenic region of the genome (there is some intronic and intergenic coverage in most WES platforms). Minor allele frequencies from control databases such as the 1000 genomes database or the Exome variant server tell us how common the variant is in the general population. Again these databases have limitations in that they have focused on exonic regions. As more control individuals are whole genome sequenced we will have a better idea of the frequency and relative

29 importance of variants in intergenic regions. Evolutionary conservation of the variant or surrounding region is also of importance as variants in highly conserved regions of the genome are likely to have a more deleterious effect. GERP scores or phyloP attempt to score variants based on their conservation against mammals or other species74. Software such as ANNOVAR can label variants with many of these annotations at the same time allowing for easy filtration of variants75. The challenge lies in determining which, if any, of these annotations are important for determining a variant’s predicted phenotypic effect.

Several algorithms now exist that try to combine many annotations in order to predict a variant’s relative deleteriousness. One of the most promising of these is the Combined

Annotation Dependent Deletion score (CADD)76. CADD objectively integrates many diverse annotations into a single measure (C score) for each variant. C scores have been shown to correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations. They have also been shown to highly rank known pathogenic variants.

1.6 Approaches for Identifying Rare variants in complex

diseases

In the first reported study that used exome sequencing as an approach to identify causal variants in a rare Mendelian disorder, four individuals with Freeman-Sheldon syndrome

(FSS) were sequenced77. FSS is an autosomal dominant disorder with individuals having mutations in MYH3. By filtering for a common gene with novel variants in each sample the

30 investigators were able to identify the causative variants in MYH3. While in this proof-of- principle FSS study the gene of interest was already known, the same group later employed exome sequencing in a Mendelian disorder where the causative gene was unknown78. Here, additional filtration steps were used such as screening against public SNP databases for common SNPs and prioritizing variants that have the potential to be pathogenic such as non-synonymous mutations, splice acceptor and donor sites, and short coding insertions and deletions. The authors found compound heterozygous mutations in DHODH among individuals with Miller syndrome, an autosomal recessive disorder.

While these initial studies were paramount in proving that next generation sequencing could be used to identify causative variants in Mendelian diseases they highlight certain limitations one may face when expanding this type of effort to more common diseases such as OCD with lower heritability rates. In both of these studies only a single gene was expected to harbor variants associated with the disorder, and so the potential additive effect of several genes was not considered. The disorders studied were rare and the variant responsible was expected to be novel in the family. Furthermore, the disorders studied were either present or absent in the individual with minimal phenotypic heterogeneity.

Thus affected individuals could easily be genetically compared to unaffected siblings or controls. In the case of more heterogeneous disorders such as OCD, many individuals have subclinical traits making it sometimes difficult to label an individual as “affected” or not.

Variants may also not be novel but at a low frequency, and present in some controls, while still conferring considerable risk to the individual for developing the disorder.

31 Nevertheless, there are several strategies that can be employed to identify disease genes in

Mendelian disorders. These strategies can be divided into 3 general approaches: Overlap- based strategies, de novo-based strategies and linkage-based strategies (Figure 4)79. The overlap-based strategy entails sequencing multiple unrelated patients with the same phenotype in order to identify the most likely causative gene. This strategy is mainly used in the absence of genetic heterogeneity and where the disease is most likely monogenic80.

Looking for a gene with shared variants across individuals can quickly identify the disease gene. On the other hand the de novo-based strategy is useful in parent-child trios where the manifestation of the disorder does not appear to be inherited. For common disorders that are genetically highly heterogeneous, such as OCD, this may be a better option however a large number of trios may need to be sequenced in order to identify common genes or pathways involved. The linkage-based strategy used in this study is appropriate for the collection of a larg`e family with a prominent phenotype. Rare variants would be filtered to focus on regions identified to be linked to the disorder in the family. Other approaches such as homozygosity or double-hit (looking for a gene with a single homozygous or compound heterozygous variants in a single sequenced individual) based strategies are useful when dealing with consanguineous families or a rare recessive disorder80.

32

Figure 4: Three of the most common next-generation sequencing strategies for identifying causative variants in Mendelian disorders. A linkage based strategy was

used in this study. Adapted from Gilissen et al. 201280

1.7 Hypothesis

While GWAS and candidate gene association studies are useful for the identification of common variants with relatively small effect sizes, linkage studies of multiplex families are particularly useful for the identification of rare variants with larger effect sizes that are increasingly believed to underlie a substantial portion of the risk for complex disorders.

Individual variants identified via linkage approaches may be family-specific, each accounting for a small proportion of the overall variance; however, the genes implicated by these variants will be of interest in multiple samples and populations, likely accounting for a much larger proportion of the overall OCD risk, in addition to providing insights about the biology of OCD23. While certain candidate genes have been resequenced in OCD, only one study has conducted genome-wide resequencing in an individual with an OCD phenotype.

Low coverage genomic resequencing (average coverage 3x) was performed on a male proband with Tourette’s syndrome harboring a balanced cytogenic chromosome

33 translocation t(6;22)(q16.2;p13) inherited from the mother with OCD81. None of the variants identified after filtering were in genes associated with psychiatric diseases and so were not followed up. The authors chose to instead focus on the structural rearrangements around the translocation region.

The approach taken in this thesis was to combine linkage analysis and next generation sequencing in a family where the OCD phenotype was likely genetic. Extended pedigrees have been a significant resource in the identification of genetic variants associated with complex Mendelian disorders. As outlined above, linkage studies in OCD have identified several regions of interest. We identified an extended pedigree with many affected individuals where OCD segregated consistent with a Mendelian inheritance. We also focused on a family where the disease phenotype was as homogenous as possible and all the individuals had early-onset OCD stemming from previous studies showing a higher heritability rate in these cases. We first used a microarray designed for CNV analysis in an extended pedigree family with OCD in order to identify regions potentially linked to the disorder. A two-tiered, parametric and non-parametric, approach was used for linkage analysis in order to identify regions of variable penetrance. We then used the same microarray to identify CNVs within these regions potentially associated with the disorder.

This was followed by whole exome sequencing in the two most distantly related affected individuals in order to identify smaller genetic variants that may be causative to the disorder. We hypothesized that a single gene may be harboring a rare variant linked to the disorder in the family.

34 2 Methodology

2.1 Subject Ascertainment and Diagnosis

Family 147 was recruited into the study at the University of Michigan Anxiety Disorders

Clinic (Figure 5). The family consisted of 6 affected individuals (4 females and 2 males) with early onset OCD, 4 unaffected individuals, and 2 individuals with an unknown affection status. The grandmother (1462) had died by suicide and was not available for direct assessment or genotyping. Some information on her was obtained through her spouse (1461) using the Family Informant Schedule and Criteria (FISC) and she was not identified as having OCD symptoms (Manuezza et al., 1985). The remaining individuals in the family (n=11) were directly interviewed to determine whether they met DSM-IV criteria for OCD. None of the affected individuals had either 1) chronic neurological disorder (other than tic disorder); 2) mental retardation; 3) DSM-IV diagnosis of autistic disorder, schizophrenia, or bipolar disorder; 4) preexisting genetic condition or were 4) adopted. Subjects signed institutional review board-approved consent forms prior to enrollment in the study and blood or saliva was obtained for genotyping.

Key clinical data on the family is summarized in Table 1. The range in age at assessment of the 11 individuals in the family was 10 to 82 years (31.2 ± 23.5, mean±SD) (Age_Mos). The

6 affected individuals had a Lifetime Yale-Brown Obsessive Compulsive Scale Score ranging from 19-34, representing moderate to severe OCD (LYBOCS10). The age of onset of the affected individuals ranged from 3 to 8 years of age (5.7±2.2, mean±SD) (OCSonset). The paternal grandfather (1461) was coded as having an unknown affection status due to a

35 subclinical OCD phenotype. His LY-BOCS score was 9, which corresponds to mild OCD. Two of the affected individuals (1324, 1460) had a history of TD and an additional individual had a history of tics (1325) (TouSyn4, AllTic4). None of the unaffected individuals (1227,

1326, 126, and 1327) had any history of OCD, tics or TD at the time of assessment (AllTic4).

No individual had a history of obsessive-compulsive symptoms or tics related to streptococcal infection57.

Figure 5: Pedigree of family studied. (L) Individuals included for linkage analysis. (E)

Individuals resequenced by whole exome sequencing.

36 Table 1: Clinical information for individuals in Family 147. Row labels are described in Table 2.

ID 1222 1226 1227 1324 1325 1326 1327 126 1460 1461 379 1462 Proband no no no no no no no no yes no no no paternal Relationship paternal paternal uncle by paternal paternal paternal grandmothe to Proband aunt cousin marriage brother father mother brother cousin proband grandfather cousin r Father ID 1461 1227 1325 1461 1325 1227 1325 1227

Mother ID 1462 1222 1326 1833 1326 1222 1326 1222

Gender female female male male male female male female female male female female Birth Year 1967 1996 1965 1999 1962 1965 1996 1999 1994 1931 2002

Age (Mos.) 497 133 504 127 610 575 181 173 200 992 131 Race Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Obsess 3 3 1 3 3 1 1 1 3 2 3

Ob_Onset 96 106 48 36 72 72 48

Compul 3 3 1 3 3 1 1 1 3 3 3

Co_Onset 96 106 60 36 72 108 48

Obs&Com 1 1 1 1 1 0 1

OCSonset 96 106 48 36 72 72 48

OCD_4 3 3 1 3 3 1 1 1 3 2 3

OCD_FISC 0

OCDonset 336 106 102 420 120 108

LYBOCS10 29 29 30 20 19 9 34

Most 480 300 180 30 480 30 540

TouSyn4 0 0 0 1 0 0 0 0 1 0 0

AllTic4 0 0 0 1 1 0 0 0 1 0 0

TicOnset 30 600 48

ADHD_4 0 0 0 0 1

ADDonset 60

AllAlc4 0 0 0 0 1 0 0 0 0 0 0

AlcOnset 252 240

AlcFISC 1

AllMDD4 1 0 0 0 0 0 0 0 0 1 0

MDDonset 240 72

Episode# 5

37 MDD_FISC 0

AllPhob4 1 0 0 0 1 1 0 1 0 1 1

PhoOnset 408 72 216 108 216 96

PhobFISC 1

GAD_OAD 0 0 0 1 0 0 0 1 0 1 0

GADonset 48 108 600 seasonal allergies, seasonal Allergies 0 0 0 0 0 dander 0 0 0 allergies

Autoimm 0 0 0 0 0 0 1 0 0 0 1 exercise- hypothyroi hypercholes induced spinal Disease dism terolemia GERD asthma stenosis buspirone; Synthroid; guanfacine; Klonapin; CurrMeds citalopram none simvastatin none none none Claritin citalopram citalopram Xanax citalopram

Table 2: Legend for Table 1.

Row Label Description ID Individual ID in Pedigree Proband Proband in Pedigree Relationship to Proband How the individual is related to the proband Father ID Father ID in Pedigree Mother ID Mother ID in Pedigree Code for male and female subjects 1=male Gender 2=female Birth Year Birth year of subject Age (Mos.) Age of subject in months at time of assessment Race Racial Background Code for lifetime history of obsessions defined by DSM-IV based upon direct interview 1 = no history of obsessions 2 = thoughts that meet some but not all criteria for obsessions (i.e., possible obsessions) Obsess 3 = definite history of obsessions Ob_Onset Age at onset of definite obsessions in months Compul Code for lifetime history compulsions defined by DSM-IV based upon direct interview

38 1 = no history of compulsions 2 = thoughts that meet some but not all criteria for compulsions (i.e., possible compulsions) 3 = definite history of compulsions Co_Onset Age at onset of definite compulsions in months Lifetime history of both definite obsessions and definite compulsions 0 = absent (no) Obs&Com 1= present (yes) OCSonset onset of definite or subthreshold Obsessive-Compulsive symptoms in months lifetime history of OCD defined by DSM-IV-TR criteria and based upon direct interview and all other sources of information 0 = indeterminate; unclassifiable; unknown; possible obsessions or compulsions; thoughts/rituals suggestive of obsessive-compulsive symptoms (OCS) 1 = no history of OCD 2 = subthreshold OCD OCD_4 3 = definite OCD lifetime history of OCD based only upon interview with the Family Informant Schedule and Criteria (FISC) 0 = possible obsessions or compulsions; recurrent thoughts or rituals that seem unwanted or involuntary to the informant, but for which the informant has no direct knowledge of the individual’s attitude 1 = no history of OCD 2 = probable OCD (involuntary OCS without evidence of either functional impairment, seeking/being referred for help, or taking medication) 3 = definite OCD (involuntary OCS that have resulted in either functional impairment, seeking/being referred OCD_FISC for help, or taking medication OCDonset onset of definite or subthreshold OCD in months LYBOCS10 Lifetime Yale-Brown Obsessive Scale Score Most most time in minutes spent on definite or possible OCS lifetime history of Tourette's disorder (Gilles de la Tourette’s syndrome) defined by DSM-IV 0 = absent (no) TouSyn4 1 = present (yes) (definite Tourette’s disorder) lifetime history of any DSM-IV tic disorder 0 = absent (no) AllTic4 1 = present (yes) TicOnset age at onset in months of DSM-IV tic disorder lifetime history of Attention Deficit Hyperactivity Disorder (ADHD) defined by DSM-IV 0=absent (no) ADHD_4 1=present (yes) ADDonset age at onset in months of ADHD or undifferentiated ADD lifetime history of Alcohol Abuse defined by DSM-IV AllAlc4 0=absent (no)

39 1=present (yes) AlcOnset age at onset in months of DSM-IV Alcohol Dependence or Abuse lifetime history of Alcoholism based upon FISC interview 0=absent (no) AlcFISC 1=present (yes) (meets definite criteria) lifetime history of MDD defined by DSM-IV 0=absent (no) AllMDD4 1=present (yes) MDDonset age at onset in months of MDD defined by DSM-IV Episode# lifetime number of MDD episodes defined by DSM-IV lifetime history of Major Depression based upon interview with FISC 0 = absent (no) MDD_FISC 1 = present (yes) (meets definite criteria) lifetime history of simple phobia, social phobia, or agoraphobia defined by DSM-IV 0 = absent (no) AllPhob4 1 = present (yes) PhoOnset age at onset in months of first phobic disorder defined by DSM-IV lifetime history of Phobic Disorder based upon interview with FISC 0 = absent (no) PhobFISC 1 = present (yes) (meets definite criteria) lifetime history of either overanxious disorder or generalized anxiety disorder defined by DSM-IV 0 = absent (no) GAD_OAD 1 = present (yes) GADonset age at onset in months of definite or possible Generalized Anxiety Disorder Allergies Any allergies lifetime history of autoimmunological disorders consisting of asthma, Graves disease, Crohn's disease, symptoms of systemic lupus erythematosis, atopic dermatitis, myasthenia gravis, psoriasis, Steven-Johnson syndrome, mixed connective tissue disease, Guillain-Barre syndrome, multiple sclerosis, Sydenham’s chorea, Autoimm rheumatic fever Disease Other diseases or disorders CurrMeds Current Medications

40 2.2 Genotyping and Linkage Analysis

Genomic DNA from all available family members was extracted from whole blood or saliva.

Extracted DNA was genotyped using the Affymetrix Cytoscan HD microarray at The Centre for Applied Genomics (TCAG), SickKids. The data obtained by the Cytoscan HD array platform were analyzed using the Chromosome Analysis Suite software package

(Affymetrix) using annotations of genome version GRCh37 (hg19). This family was not previously genotyped in any linkage or sequencing studies in OCD.

41

Figure 6: General study outline and analysis steps.

42 2.2.1 SNP Quality Control

Forward strand genotype calls were extracted from the microarray and single nucleotide polymorphisms (SNPs) with possible ambiguity for strand were removed (A/T, C/G). A sex check and population stratification analysis were performed in PLINK v.1.07 to confirm that the genotyped sex and ethnicity matched with the reported sex and ethnicity82.

Individuals were merged with HapMap3 phase 3 release 3 CEU population, as the family was Caucasian, for the rest of the QC process83. SNPs with a minor allele frequency over

45%, that were genotyped in all individuals, and did not fail Hardy-Weinberg equilibrium

(p<0.001) were extracted. Pedstats was also used to check for Mendelian inheritance errors and SNPs that had errors were removed. This was followed by pairwise pruning of

SNPs for those that were not in Linkage disequilibrium, based on pairwise r2≤0.1 (1500 marker window, 150 marker shift). A final recombination error check was performed in

MERLIN v.1.1.2 to look for possible genotyping errors that may cause false recombination events during linkage analysis84. These SNPs were also removed before performing linkage analysis in the families. From the final set of SNPs used for linkage analysis (n=8,960), the average between-marker interval was 0.33cM and the average MAF was 0.48.

2.2.2 Linkage Analysis

Pedigree relationships were confirmed before analysis using PEDCHECK v.185 and PLINK v.1.07, and linkage analysis was performed using MERLIN v1.1.2. A 2-tiered approach was used for linkage analysis consisting of both a parametric and non-parametric analysis

(Figure 6). A model-based approach was first used because simulation studies have shown

43 that formulating a genetic model that approximates true inheritance may have more power than non-parametric analysis in part because parametric models can utilize information about unaffected individuals. A rare dominant model was used with a disease allele frequency of 0.01 and 0.01, 0.99, 0.99 penetrance for genotypes AA, Aa, or aa respectively

(“a” denotes the susceptibility allele). Due to the father-son transmission of the disease in the family linkage analysis of the X-chromosome was not performed (Figure 5).

A non-parametric linkage analysis was also performed in order to identify regions of interest shared among affected individuals. Using Merlin, we computed the Kong and Cox logarithm of odds (LOD) statistics. The ‘all’ option was used instead of the ‘pairs’ option for this analysis. The ‘all’ score provides more information when relative pairs included in a data set are not independent and performs well for most simulated genetic models86. The exponential model was reported as it has been identified as being better for large increases in allele sharing in a small number of families87. Finally in both the parametric and non- parametric approaches, a secondary analysis was conducted coding the grandfather as affected.

Haplotyping was performed in MERLIN for the regions identified with high LOD scores and visualized using Haplopainter v.1.04388. Haplotypes that were inherited identical by descent and co-segregated with the OCD phenotype were assessed in the family. Identified regions were compared to previous linkage studies and genes within regions (UCSC) were extracted and compared to previously associated genes in OCD.

44 2.3 CNV Analysis

The standard calling algorithm was used in the manufacture provided Affymetrix

Chromosome Analysis Suite Software (ChAS) v.2.1 in order to call CNVs from the

Affymetrix CytoscanHD microarray. Copy number variants with a minimum of 10kb and 20 markers were calculated. CNVs were then filtered for presence within linkage regions.

2.4 Whole Exome Sequencing

Two of the most distantly related affected individuals (1324 and 1226) were chosen for whole exome sequencing (WES) which was performed at The Centre for Applied Genomics

(TCAG). Briefly, genomic DNA was fragmented and enriched for exonic regions using the

Agilent SureSelect All Exon 50 Mb kit (Agilent Technologies, Santa Clara, CA). The targeted regions were then sequenced on the ABI SOLiD 5500XL system. In the SNP calling pipeline,

Bfast+bwa-0.6.4e was used for alignment, srma0.1.15, for local alignment, and then GATK

1.1.28 for qscore recalibration and then variant calling. ANNOVAR 2013Aug23 was used to annotate the variants. A QC metrics was generated for each individual sequenced.

2.4.1 Variant filtration

Variants with a quality by depth (QD) of less than 5 and a strand bias (SB) of greater than -

0.01 were removed. A more stringent cutoff of QD < 10 and a SB >= -0.01 was used for removal of INDELs. Variants that were not present in both individuals sequenced were also removed. Variants were subsequently filtered for rare frequency (<5%) using the 1000 genomes database89 (phase 1 release v3, April 2012) and for those variants not present in

45 the 1000 genomes, the NHLBI exome sequencing project (6500si, February 2013) was also used as a filter. Following this, for the primary analysis, remaining variants were filtered for those within linkage regions. As a secondary analysis, after filtering for quality, minor allele frequency, and presence in both sequenced individuals, genes that had been previously associated with OCD were assessed for variants (Table 3).

Table 3: Candidate gene list from studies used in Taylor et al. 2013 meta-analysis of

OCD genetic association studies.

SLC6A4 CRIPT GABRA2 HTR1B MOG NTRK3 SLITRK1 UCP2 DAT1 GABRB3 HTR2A MOR OLIG1 SNAP25 ACE DRD1 GABRG2 HTR2B NCAM1 OLIG2 TACR1 BDKRB2 DRD2 GALR1 HTR2C NET PIK3CG TACR2 BDNF DRD3 GLRB HTR3A NEUROD4 PLC1 TNFA CCR5 DRD4 GRIK2 HTR6 NFKBIL1 PMCH TPH CHMR5 EMT GRIK3 IL10 NGFR SAPAP3 TPH2 CHRNA1 ERE6 GRIN2B INPP1 NLGN1 SLC1A1 UBE3A CNTN3 ESR GRIN3A MAOA NPY SLC1A7 COMT GABBR1 HOXB8 SOD2 NTRK2 SLC6A3

Finally, variants from both of the above analyses were ranked by Combined Annotation-

Dependent Depletion (CADD) phred-like score (normalized C scores)76, which predicted the relative pathogenicity of both coding and noncoding variants. For comparison, the potential effect of the 14 missense variants from both of the above analyses on protein function was evaluated using several additional prediction algorithms for including SIFT,

PolyPhen-2, LRT, MutationTaster, MutationAssessor, FATHMM, RSVM, LR (Table 4).

46 Table 4: The 14 missense variants were compared against other currently available algorithms for predicting variant deleteriousness. Training data and the primary information used for each score is described along with the prediction outputs.

Score Training data Information Used Prediction SIFT (“Sorting SWISS-PROT/TrEMBL D: Deleterious (sift<=0.05); T: Tolerant from based on PSI-BLAST tolerated (sift>0.05) Intolerant”)90 PolyPhen-2 UniProtKB/UniRef100; eight sequence-based D: Probably damaging (>=0.957), HDIV91 PDB/DSSP; UCSC and three structure- P: possibly damaging alignments of 45 vertebrate based (0.453<=pp2_hdiv<=0.956); B: genomes predictive features benign (pp2_hdiv<=0.452) HDIV – better for distinguishing mutations with drastic effects PolyPhen-2 Same as above, better for eight sequence-based D: Probably damaging (>=0.909), HVAR91 evaluating rare alleles at loci and three structure- P: possibly damaging potentially involved in based predictive (0.447<=pp2_hdiv<=0.909); B: complex phenotypes features benign (pp2_hdiv<=0.446)

LRT (Likelihood coding sequences of 32 sequence homology D: Deleterious; N: Neutral; U: ratio test)92 vertebrate species Unknown MutationTaster93 UniProt; homologous genes conservation, splice A" ("disease_causing_automatic"); in humans and 10 other site, mRNA features, "D" ("disease_causing"); "N" species; dbSNP; HapMap protein features; ("polymorphism"); "P" ("polymorphism_automatic" MutationAssessor94 homologous sequences from sequence homology of H: high; M: medium; L: low; N: Uniprot identified by protein families and neutral. H/M means functional and BLAST sub-families within L/N means non-functional and between species FATHMM95 homologous sequences from sequence homology D: Deleterious; T: Tolerated UniRef90, SUPERFAMILY and Pfam GERP++74 genomes of 34 mammals multiple alignments higher scores are predicted to be and more deleterious phylogenetic tree

PhyloP96 genomes of 33 placental multiple alignments higher scores are predicted to be mammals and more deleterious phylogenetic tree

47 2.5 Validation by Sanger Sequencing

Variants in the TRAFD1, SLC1A1, CPQ and DLG1 genes were validated in the family using

Sanger Sequencing. Primers were designed using the online ExonPrimer software to encompass the exon containing the variant (Table 5). Thermocycling conditions were as follows: 95°C x 5min, followed by 35 cycles (95°C x 60s, 60°C (ATem) x 60s, 70°C x 90s).

The ATem was changed for each primer set. PCR products were purified using the Qiagen

PCR product purification kit using the manufactures protocol and sequenced at TCAG.

Table 5: Primers used to validate select candidate variants with their respective annealing temperatures.

Gene Exon Forward Primer Reverse Primer ATem (°C) TRAFD1 7 AATCGCTTCCTTAGAGTAGTGAG tCTAAGAAGGAACAAGGCTAAG 60 CPQ 6 GACAACCACACCAGTGAGAG CAGCTCTACACCACAGCATC 65.5 DLG1 21 TGATTCTTCCTGGATGAGTAAC GCAGACCATGATGTGTGG 62.5 SLC1A1 9 GATTAAGTGTGGGGAGGTGG GGAGTCTGTTTTGTAAAGTGAGGTG 62.5

2.6 Preliminary Screening of SLC1A1 variant (rs34342853) in an

additional cohort of OCD cases and controls

An additional set of 188 DSM-IV diagnosed OCD cases and 131 controls were screened for the SLC1A1 variant. Cases and controls were recruited from 4 sites (University of Michigan,

Wayne State University, St. Joseph’s Health Centre, The Hospital for Sick Children) and all patients signed institutional research ethics board approved consent forms to participate in our OCD genetic studies. All subjects were inventoried for age, sex, ethnicity, weight, height, socioeconomic status, intelligence, and obstetrical history. Semi-structured

48 interviews were conducted on all subjects including the Schedule for Obsessive-

Compulsive and Other Behavioural Syndromes (SOCOBS) which assessed Obsessive- compulsive behaviours and tics. Participants with a history of intrusive thoughts, repetitive behaviors, or mental rituals meeting DSM-IV criteria for either obsessions or compulsions were interviewed with the Children’s Yale-Brown Obsessive Compulsive Scale-PL (CY-

BOCS-PL) to assess OC symptom severity during the past week and during the period of time when the symptoms were at their worst. After completion of all interviews for an individual, all available materials were collated and a senior clinician independently reviewed written records of the interviews and all other sources of information. Only subjects for whom the interviewer and senior clinician both agreed on a definite Axis I diagnosis using DSM-IV criteria was coded as having a specific diagnosis. All interviewers had a master's degree and clinical training in child psychopathology and be trained by senior investigators to use the interview schedules.

Genotyping was performed using TaqMan SNP Genotyping Assay for rs34342853 under standard conditions: a total volume of 20ul and 20ng of genomic DNA were amplified in the presence of 1xPCR Master mix (Qiagen, Valencia, CA, USA) and 1x TaqMan Assay (Applied

Biosystems, Foster City, CA, USA). Thermocycling conditions were as follows: 95°C x 10min, followed by 50 cycles (95°C x 10s, 60°C x 30s). Genotypes were called using a fluorescence reader at TCAG. Positive genotype calls were validated using Sanger sequencing using the same primers and protocol as in the family. The frequency in cases was compared against recruited controls.

49 3 Results

3.1 Linkage Analysis

In order to identify regions of the genome associated with OCD in the family, both a parametric and non-parametric linkage analysis were conducted.

3.1.1 Parametric Linkage Analysis

Parametric linkage analysis using a rare dominant model identified 3 regions with a LOD score of 1.78 on chromosomes 6, 12 and 22 (Figure 7, Table 6). Haplotyping determined the boundary SNPs of the 3 regions shared by all affected members of the family, either the grandmother or the grandfather, and not shared by any of the unaffected members.

Regions with LOD scores lower than 1.78 were not shared by all affected individuals (not shown). The region on was interrupted by a region with low LOD score, however after haplotyping this was attributed to genotyping error, and the regions were combined into one larger region for analysis (An apparent double recombination event within a region <2Mb long was seen in the ungenotyped grandmother). Coding the grandfather as affected resulted in only a single region with a positive LOD score of 2.06 on chromosome 22 (Figure 8). This region was transmitted from the grandfather (1461) to all other affecteds in the pedigree. Overall, parametric linkage analysis narrowed the scope of analysis to ~27Mb, or approximately 0.9% of the human genome (~3000Mb). Since the exome is ~1% of the genome this amounts to approximately 270kb within linked regions on average.

50 3.1.2 Non-parametric linkage analysis

Along with the regions identified through the parametric linkage analysis, 8 additional regions were identified in the non-parametric analysis (Figure 9, Table 7). While all affected individuals shared at least one haplotype, in these additional regions at least one unaffected individual shared a haplotype as well. Overall, non-parametric linkage analysis narrowed the scope of analysis to 115Mb, or ~3.8% of the human genome. Again, since the exome is ~1% of the genome this amounts to approximately 1.15Mb within linked regions on average.

Coding the grandfather as affected in the linkage analysis identified 5 regions shared by the grandfather and other affecteds (Table 8). The strongest linkage score was on chromosome

8 with an exponential LOD score of 2.13. Haplotyping indicated that this region was shared by all the affected individuals but also by the unaffected individuals 126 and 1227 (Figure

10).

51

Figure 7: Parametric Linkage results for chromosomes 6, 12, and 22 (the only chromosomes with positive LOD scores).

52

Figure 8: Haplotype of shared region on chromosome 22. The orange segment from

rs2267617 to rs2016013 was shared by all the affected individuals but not the unaffected individuals in the family. The genotypes in the grandmother (1462) was

inferred.

53

54 Figure 9: Non-parametric LOD score plots for chromosomes with LOD score >1. The blue lines represent the Kong and

Cox exponential function and the black lines the Kong and Cox linear function. The “all” option was used instead of the

“pairs” option for this analysis.

55 Table 6: Parametric linkage analysis using a rare dominant model identified 3 regions with suggestive linkage to OCD in the family. The region on chromosome 22 was the only region with a positive LOD score when the grandfather (1461) was coded as affected.

Chr Boundary SNPs Positions (build 37) LOD (1461aff Size LOD score) 6 rs6927210-rs6912191 138,076,153- 144,425,062 (6q23.3- 1.78 6.3Mb q24.2) 12 rs7968453-rs1043810 93,800,678- 114,254,995 (12q22- 1.78 20Mb q24.13) 22* rs2267617-rs5766013 44,538,688-45,183,545 (22q13.31) 1.78 (2.06) 645Kb

Table 7: Non-parametric linkage scores for Family 147. 11 regions with a LOD score

>1 was identified in 9 chromosomes including the regions previously identified through parametric linkage analysis. These regions all contained an allele shared by all affected individuals and some unaffected individuals.

Chr Boundary SNPs Positions (build 37) Highest LOD Size in region 3 Rs1503079 – rs4678363 104,759,032 – 137,368,142 (3q13.11- 1.88 32.6Mb q22.3)

3 Rs12718037-term 196,426,606-198,022,429 (3q29) 1.43 1.6Mb 6 Rs9389951 – rs787765 100,457,411 – 112,766,961 (6q16.2-q21) 1.27 12.3Mb

6 Rs10485400 – rs6912191 130,444,124 – 144,425,062 (6q23.1- 1.88 14Mb q24.2) 8 Rs6471246 – rs2513980 91,692,022 – 98,762,432 (8q21.3-q22.1) 1.88 7.1Mb 11 Rs4145054-term 131,001,402 – 135,006,515 (11q25) 1.11 4Mb 12 Rs7968453 – rs1043810 93,800,678 – 114,254,995 (12q22- 1.88 20.5Mb q24.13) 16 Rs6564059 – term 84,587,249 – 90,354,752 (16q24.1-q24.3) 1.66 5.8Mb 18 0 - Rs9959482 1 – 5,216,054 (18p11.32-p11.31) 1.27 5.2Mb 20 Rs2273684 – rs366860 33,529,766 – 45, 141, 400 (20q11.22- 1.27 11.6Mb q13.12) 22 Rs4823185 – rs5766013 44,408,241 – 45,183,545 (22q13.31) 1.11 645Kb

56

Table 8: Non-parametric linkage scores for Family 147 with the grandfather coded as affected. In addition to the chromosome 22 region identified in the parametric linkage analysis, 4 additional regions were suggestively linked.

Chr Boundary SNPs Positions (build 37) Highest LOD Size in the region 3 Rs12718037-term 196,426,606-198,022,429 (3q29) 1.43 1.6Mb 8 Rs6471246 – rs2513980 91,692,022 – 98,762,432 (8q21.3- 2.13 7.1Mb q22.1) 16 Rs6564059 – term 84,587,249 – 90,354,752 (16q24.1- 2.01 5.8Mb q24.3) 18 1 - Rs9959482 1 – 5,216,054 (18p11.32-p11.31) 1.57 5.2Mb 22 Rs4823185 – rs5766013 44,408,241 – 45,183,545 (22q13.31) 1.44 645Kb

57

Figure 10: Haplotype of the region with the highest linkage score from the non- parametric analysis on chromosome 8. The purple bar indicates the allele shared by

all the affected individuals, the grandfather (1461), and unaffected individuals 126

and 1327.

58 3.2 CNV Analysis

321 CNVs were identified among the individuals in the family genotyped. None of these

CNVs were within the linkage regions identified. 49 unique CNVs were present in at least one affected individual and overlapping genes (Table 9). However, all of these CNVs were also present in at least one unaffected individual and unlikely to be associated with the disorder. Due to their unlikely association with the disorder in this family, no attempt was made to validate these CNVs.

59

Table 9: CNVs identified in the family overlapping genes. Microarray Nomenclature: arr=arrangement, [build], cytoband, (genomic positions) x#of copies

Marker Gene Chr Cytoband CN State Type Size (kbp) Count Count Genes Microarray Nomenclature 1 q21.3 0 Loss 11.995 28 1 LCE1D arr[hg19] 1q21.3(152,761,910-152,773,905)x0 3 p11.1 1 Loss 29.586 50 1 EPHA3 arr[hg19] 3p11.1(89,386,933-89,416,519)x1 3 p14.2 1 Loss 23.785 26 1 FHIT arr[hg19] 3p14.2(60,800,698-60,824,483)x1 3 q25.1 1 Loss 35.951 28 2 MIR548H2, AADAC arr[hg19] 3q25.1(151,513,063-151,549,014)x1 4 q35.2 1 Loss 196.972 112 1 HSP90AA4P arr[hg19] 4q35.2(190,346,146-190,543,118)x1 5 q35.3 1 Loss 38.961 33 1 BTNL3 arr[hg19] 5q35.3(180,378,753-180,417,714)x1 6 p21.33 1 Loss 57.313 78 1 HLA-B arr[hg19] 6p21.33(31,286,291-31,343,604)x1 6 p22.1 1 Loss 32.683 20 1 HCG4B arr[hg19] 6p22.1(29,864,577-29,897,260)x1 6 p22.1 1 Loss 56.24 22 2 HLA-H, HCG4B arr[hg19] 6p22.1(29,841,020-29,897,260)x1 6 p25.3 1 Loss 40.573 88 1 DUSP22 arr[hg19] 6p25.3(254,253-294,826)x1 7 q22.1 3 Gain 176.59 145 1 EMID2 arr[hg19] 7q22.1(100,965,600-101,142,190)x3 7 q31.1 1 Loss 35.106 32 1 IMMP2L arr[hg19] 7q31.1(111,007,249-111,042,355)x1 7 q34 0 Loss 10.376 22 2 PRSS3P2, PRSS2 arr[hg19] 7q34(142,475,398-142,485,774)x0 7 q36.2 1 Loss 11.1 20 1 DPP6 arr[hg19] 7q36.2(154,393,059-154,404,159)x1 8 p11.22 3 Gain 106.405 56 1 ADAM3A arr[hg19] 8p11.22(39,280,547-39,386,952)x3 8 p11.22 1 Loss 139.855 76 2 ADAM5P, ADAM3A arr[hg19] 8p11.22(39,247,097-39,386,952)x1 8 q21.2 1 Loss 10.711 26 1 RALYL arr[hg19] 8q21.2(85,257,957-85,268,668)x1 MRC1, MIR511-2, 10 p12.33 1 Loss 117.205 24 3 MIR511-1 arr[hg19] 10p12.33(18,106,429-18,223,634)x1 10 q21.3 1 Loss 114.54 88 1 CTNNA3 arr[hg19] 10q21.3(69,141,543-69,256,083)x1 10 q26.13 1 Loss 14.839 22 1 DMBT1 arr[hg19] 10q26.13(124,341,802-124,356,641)x1 11 p15.4 1 Loss 25.321 48 2 OR52N5, OR52N1 arr[hg19] 11p15.4(5,783,909-5,809,230)x1 11 q11 1 Loss 78.821 96 3 OR4P4, OR4S2, OR4C6 arr[hg19] 11q11(55,374,175-55,452,996)x1

60 11 q14.3 1 Loss 36.947 52 1 TYR arr[hg19] 11q14.3(88,928,834-88,965,781)x1 11 q24.3 0 Loss 17.288 40 1 FLI1 arr[hg19] 11q24.3(128,683,017-128,700,305)x0 12 p13.2 4 Gain 26.55 30 2 PRH1-PRR4, TAS2R43 arr[hg19] 12p13.2(11,220,826-11,247,376)x4 PRH1-PRR4, TAS2R46, 12 p13.2 1 Loss 59.47 86 3 TAS2R43 arr[hg19] 12p13.2(11,192,450-11,251,920)x1 12 q21.31 1 Loss 39.877 57 1 TMTC2 arr[hg19] 12q21.31(83,168,047-83,207,924)x1 13 q31.3 1 Loss 21.457 20 1 GPC5 arr[hg19] 13q31.3(92,668,185-92,689,642)x1 14 q24.3 1 Loss 28.271 28 2 HEATR4, ACOT1 arr[hg19] 14q24.3(73,995,760-74,024,031)x1 ELK2AP, KIAA0125, 14 q32.33 3 Gain 631.901 192 3 ADAM6 arr[hg19] 14q32.33(106,074,223-106,706,124)x3 14 q32.33 3 Gain 279.011 143 2 KIAA0125, ADAM6 arr[hg19] 14q32.33(106,246,288-106,525,299)x3 15 q24.3 1 Loss 10.73 24 1 SCAPER arr[hg19] 15q24.3(76,891,727-76,902,457)x1 LOC283914, LOC146481, 16 p11.2 3 Gain 306.222 176 3 LOC100130700 arr[hg19] 16p11.2p11.1(34,449,594-34,755,816)x3 TP53TG3C, TP53TG3B, TP53TG3, LOC390705, 16 p11.2 1 Loss 688.116 22 5 RNU6-76 arr[hg19] 16p11.2(32,937,322-33,625,438)x1 16 p13.3 3 Gain 37.446 51 3 SLC9A3R2, NTHL1, TSC2 arr[hg19] 16p13.3(2,087,012-2,124,458)x3 16 q12.2 1 Loss 26.056 32 1 CES1P1 arr[hg19] 16q12.2(55,796,375-55,822,431)x1 TBC1D3F, TBC1D3, 17 q12 1 Loss 120.749 30 3 LOC440434 arr[hg19] 17q12(36,283,806-36,404,555)x1 KANSL1, KANSL1-AS1, LRRC37A, ARL17A, ARL17B 17 q21.31 3 Gain 571.816 72 8 NSFP1, LRRC37A2, NSF arr[hg19] 17q21.31(44,212,823-44,784,639)x3 19 q13.42 1 Loss 28.404 48 2 ZNF525, ZNF765 arr[hg19] 19q13.42(53,888,552-53,916,956)x1 20 p12.1 1 Loss 105.117 92 1 MACROD2 arr[hg19] 20p12.1(14,952,341-15,057,458)x1 20 q13.2 1 Loss 13.84 36 1 BCAS1 arr[hg19] 20q13.2(52,644,753-52,658,593)x1 22 q13.33 1 Loss 11.835 60 1 SHANK3 arr[hg19] 22q13.33(51,103,691-51,115,526)x1 X p21.3 0 Loss 14.056 56 1 IL1RAPL1 arr[hg19] Xp21.3(29,160,655-29,174,711)x0 arr[hg19] Xp22.33 or Yp11.31(2,616,386-2,631,951 or 2,566,386- X p22.33 3 Gain 15.565 48 1 CD99 2,581,951)x3 X q22.1 4 Gain 164.946 21 2 NXF2B, NXF2 arr[hg19] Xq22.1(101,450,607-101,615,553)x4 CT47A2, CT47A10, CT47A12, CT47A3, X q24 3 Gain 48.626 38 12 CT47A9, CT47A4, arr[hg19] Xq24(120,071,134-120,119,760)x3

61 CT47A6, CT47A5, CT47A1, CT47A8, CT47A11, CT47A7 CT47A3, CT47A10, CT47A12, CT47A5, CT47A1 CT47A4, CT47A8, CT47A2, CT47A9, CT47A11, X q24 3 Gain 112.388 44 12 CT47A6, CT47A7 arr[hg19] Xq24(120,009,899-120,122,287)x3 CT45A4, CT45A5, X q26.3 3 Gain 49.405 128 4 CT45A6, SAGE1 arr[hg19] Xq26.3(134,934,813-134,984,218)x3 SPANXC, SPANXA2- OT1, SPANXA2, X q27.2 3 Gain 459.934 252 4 SPANXA1 arr[hg19] Xq27.2(140,323,108-140,783,042)x3

62 3.3 Whole Exome Sequencing

Both individuals sequenced had average target coverage of greater than x85 with 96% of targeted bases covered with at least x10 read depth (Table 10). The transition to transversion ratios for synonymous and nonsynonymous variants were within the expected ranges97. Based on this data we concluded that the quality of exome sequencing data was sufficient for downstream analysis.

Table 10: Whole Exome Sequencing Quality metrics as recommended by Do et al.

2012

Subject ID# 1324 1226 # on target bases 4,947,658,152 4,573,180,551 % selected bases covered 91.9 92.4 Mean target coverage x97 x89 % Target bases >30x, >20x, >10x 86%,92%,96% 84%,91%,96% Total Number of HQ Variants (QD>=5, SB<=- 41,498 41,438 0.01) Synonymous variants (10,000-12,500) 9,038 9,172 Fraction Unique 186/9,038 184/9,172 Transition-transversion ratio (~5) 4.93 4.81 Non-synonymous variants (9,500-12,000) 8,709 8,697 Fraction Unique 377 397 Transition-transversion ratio (~2) 1.98 1.96 Stop or splice altering variants (100-200) 241 247 Fraction Unique 22 30

3.3.1 Parametric linkage

Variants were first filtered and prioritized based on the parametric linkage data from this family. 11 rare shared variants were identified (Table 11). None of these were in the chromosome 22 region which had the highest LOD score when the grandfather was coded as affected. 3 variants were identified to be within the highest 10% of all CADD scores

(CADD Phred >10). Variants with a score below this were either intronic and not predicted

63 to altering splice sites, or synonymous SNVs. The top variant was a missense mutation in

TRAFD1 (12q24.13), involved in the innate immune system98.

Table 11: Rare shared variants within Parametric linkage regions.

1000g esp6500 CADD Chr Position Ref Alt dbSNP 138 Gene Func MAF MAF PHRED

12 112583447 A C rs79680080 TRAFD1 nonsynonymous SNV 0.02 0.016838 18.32

12 113587667 C T rs61748300 CCDC42B nonsynonymous SNV NA 0.003066 17.32

12 97112310 A G rs368653119 CDK17,NEDD1 intergenic NA 0.000154 11.61

12 102494849 T C rs17438178 NUP37 synonymous SNV 0.02 0.026304 9.261 12 113515479 C G N/A DTX1 synonymous SNV NA NA 8.934

6 138417639 A G N/A PERP intronic NA NA 3.123

12 111993571 G A rs11610849 ATXN2 intronic 0.02 NA 2.368

12 109845779 G A rs181015261 MYO1H intronic NA NA 0.958

12 113605769 C T rs117291153 DDX54 intronic 0.01 0.017789 0.353

12 104025446 G A rs147053330 STAB2 synonymous SNV 0.0041 0.005536 0.082

12 97303693 A C rs77544329 NEDD1 intronic NA 0.001922 0.077 LEGEND: Chr – Chromosome number, Position – Variant position on chromosome,

Ref – Reference allele, Alt – Alternative allele, dbSNP 138 – dbSNP build 138 rs#, Func

– Type of variant, 1000g MAF – Minor allele frequency from 1000 genomes database, esp6500 MAF – Minor allele frequency from exome variant server, CADD PHRED –

Scaled Combined annotation dependent depletion score (0-99)

3.3.2 Non-parametric linkage

58 rare variants were identified within the non-parametric linkage regions including 11 variants previously identified in the parametric analysis (Table 12). 15 unique variants had a CADD Phred score >10, two of which had a score >20 predictive of being in the top 1% of pathogenic variants in the human genome. One of these variants was a missense variant in

DLG1 (3q29), a protein involved in trafficking of ionotropic receptors within neurons. The other was a missense variant in CPQ (8q22.1), which encodes plasma glutamate

64 carboxypeptidase (PGCP). CPQ has been previously associated with canine compulsive behaviour52.

Table 12: Rare shared variants within non-parametric linkage regions. Variants identified in the parametric analysis are shaded in grey. 13 variants marked with an

* were within regions that segregated from the grandfather.

1000g Esp6500 CADD Chr Position Ref Alt dbSNP 138 Gene Func MAF MAF PHRED nonsynonymous 16* 87782322 G C rs3751727 KLHDC4 SNV 0.03 0.0237 25.5 nonsynonymous 3* 196771513 G A rs34492126 DLG1 SNV 0.03 0.035522 20.6 nonsynonymous 20 44588904 G A rs117132825 ZNF335 SNV 0.0037 0.00692 19.89

3 129183403 G T rs58532641 IFT122 intronic 0.04 NA 18.93 nonsynonymous 12 112583447 A C rs79680080 TRAFD1 SNV 0.02 0.016838 18.32 nonsynonymous 12 113587667 C T rs61748300 CCDC42B SNV NA 0.003066 17.32

3 130158989 T A rs16827852 COL6A5 intronic 0.03 0.023872 16.49 nonsynonymous 20 39992343 G T rs146612336 EMILIN3 SNV NA 0.000692 15.21 nonsynonymous 8* 98041637 G A rs148684432 CPQ SNV 0.0037 0.007227 13.91

3 134197429 G A rs61748112 ANAPC13 UTR3 0.02 0.022374 13.43

3 126741544 C T rs146748680 PLXNA1 intronic 0.0046 0.013148 13.15 20 44429967 G T N/A DNTTIP1 intronic NA NA 12.72 nonsynonymous 16* 87339450 C T rs141912735 C16orf95 SNV 0.01 NA 12.71 12 97112310 A G rs368653119 CDK17,NEDD1 intergenic NA 0.000154 11.61 nonsynonymous 3 130190716 G A rs73868680 COL6A5 SNV NA NA 11.57 18* 3075498 ACAGAC A N/A MYOM1 intronic NA NA 11.53

3 113212890 G A N/A SPICE1 intronic NA NA 10.74

3 129211069 A G rs60681706 IFT122 intronic 0.04 0.044057 10.17

20 43254298 C T rs61737144 ADA synonymous SNV 0.03 0.042284 9.958

3 118931541 A G rs72655937 B4GALT4 intronic 0.01 0.015762 9.878 nonsynonymous 3 129155463 A G rs2307289 MBD4 SNV 0.03 0.043134 9.524 12 102494849 T C rs17438178 NUP37 synonymous SNV 0.02 0.026304 9.261

3 132166302 T G rs116489157 DNAJC13 synonymous SNV 0.0046 0.012927 9.23 12 113515479 C G N/A DTX1 synonymous SNV NA NA 8.934

3 133546728 G A N/A RAB6B UTR3 NA NA 8.462

6 135748451 T C rs41288015 AHI1 intronic 0.01 0.015981 8.254

8* 95897794 T G rs184241751 CCNE2 intronic 0.0023 0.006698 8.047

3 122869147 G C rs114278681 PDIA5 synonymous SNV 0.04 0.022913 7.558

65 3 121631993 T G rs34443342 SLC15A2 intronic 0.04 0.013302 7.463 nonsynonymous 3 129195600 G A rs150550701 IFT122 SNV 0.0027 0.001153 5.899

3* 196771554 T C rs35370245 DLG1 synonymous SNV 0.03 0.035445 4.846

6 136597262 T C rs1967444 BCLAF1 synonymous SNV NA NA 4.733

16 88780641 G T rs75686921 CTU2 intronic NA NA 4.596

3 111298294 C T rs114857763 CD96 intronic 0.0027 NA 4.528 nonsynonymous 3 129140589 G A rs58932597 EFCAB12 SNV 0.03 0.034743 4.168

6 136597004 A G rs7762367 BCLAF1 synonymous SNV NA NA 3.71

16* 90142280 T C rs369555207 PRDM7 synonymous SNV NA 0.000165 3.383 6 138417639 A G N/A PERP intronic NA NA 3.123

3 121547862 G A rs111257270 IQCB1 intronic 0.04 NA 2.636 nonsynonymous 3 105421032 C G rs41302192 CBLB SNV 0.0046 0.005151 2.568 12 111993571 G A rs11610849 ATXN2 intronic 0.02 NA 2.368

16* 85696615 C T rs146498039 GSE1 synonymous SNV 0.0009 0.000693 1.623 20 35632040 A G N/A RBL1 UTR3 NA NA 1.555

6 111896792 A G rs12206072 TRAF3IP2-AS1 ncRNA_intronic NA NA 1.337

16* 89980738 G T rs368881419 TCF25,MC1R intergenic NA NA 1.271

6 101248149 C T rs12216426 ASCC3 intronic 0.04 0.070517 1.241

8* 95841114 G T rs77861705 INTS8 intronic 0.03 NA 1.222 11 134054717 C T N/A NCAPD3 intronic NA NA 1.147

8* 95186314 G A rs72670355 CDH17 intronic 0.0027 0.004536 1.07 12 109845779 G A rs181015261 MYO1H intronic NA NA 0.958

20 43625569 A C N/A STK4 intronic NA NA 0.764

20 36845446 G A N/A KIAA1755 intronic NA NA 0.554

16* 88505455 G A rs191554921 ZNF469 synonymous SNV 0.0005 NA 0.537 12 113605769 C T rs117291153 DDX54 intronic 0.01 0.017789 0.353

3 113729790 G A N/A KIAA1407 synonymous SNV NA NA 0.106

12 104025446 G A rs147053330 STAB2 synonymous SNV 0.0041 0.005536 0.082

12 97303693 A C rs77544329 NEDD1 intronic NA 0.001922 0.077

20 42088658 A C rs117222808 SRSF6 intronic 0.01 0.024143 0.001 LEGEND: Chr – Chromosome number, Position – Variant position on chromosome,

Ref – Reference allele, Alt – Alternative allele, dbSNP 138 – dbSNP build 138 rs#, Func

– Type of variant, 1000g MAF – Minor allele frequency from 1000 genomes database, esp6500 MAF – Minor allele frequency from exome variant server, CADD PHRED –

Scaled Combined annotation dependent depletion score (0-99)

66 3.3.3 Candidate gene analysis

In screening all rare shared variants in genes previously associated with OCD (Table 3) only two variants were identified (Table 13). The region containing the rare intronic

GABBR1 variant was not shown to segregate with the disorder through haplotyping. The other variant was a missense variant in exon 9 of SLC1A1. SLC1A1 is a neuronal glutamate transporter and has been associated with OCD (Chapter 1.3.1). The parametric and non- parametric LOD scores in this region were -0.22 and 0.059 respectively.

Table 13: Rare variants shared by the two individuals with exome sequence data within genes previously associated with OCD in human association studies.

1000g esp6500si CADD Chr Start Ref Alt dbSNP 138 Gene Func MAF MAF PHRED

6 29595218 C G N/A GABBR1 intronic NA NA 21.6 nonsynonymous 9 4576045 T C rs34342853 SLC1A1 SNV 0.0005 0.001461 16.53 LEGEND: Chr – Chromosome number, Position – Variant position on chromosome,

Ref – Reference allele, Alt – Alternative allele, dbSNP 138 – dbSNP build 138 rs#, Func

– Type of variant, 1000g MAF – Minor allele frequency from 1000 genomes database, esp6500 MAF – Minor allele frequency from exome variant server, CADD PHRED –

Scaled Combined annotation dependent depletion score (0-99)

3.3.4 Missense Variants and Sanger sequencing validation

The 14 missense variants from the above 3 analyses were assessed by several available algorithms for their predicted effect on protein function (Table 14). 8 variants were predicted as being damaging by at least one of the algorithms. 3 of the genes harboring these variants were involved in glutamate function (CPQ, SLC1A1 and DLG1) and 1 was

67 involved in the autoimmune system (TRAFD1). While we expected all 60 synonymous and non-synonymous variants to validate due to the stringent QC process employed during the filtration process, these 4 missense variants were chosen for validation by Sanger sequencing due to their high CADD score and biological plausibility in association with OCD

(Figure 11). All 4 variants validated as expected from their respective haplotypes. Of note, the SLC1A1 variant (rs34342853) was not found in any of the unaffected individuals, but was also not present in a single affected individual 379 (Figure 11).

Figure 11: Sanger sequencing validation of 4 of the top candidate variants in the family. The grandmother’s (1462) genotypes were inferred. Filled in circles and squares represent individuals affected with OCD. Unfilled individuals didn’t have

OCD symptoms. Individuals 1461 and 1462 shaded in grey had a subclinical and unknown OCD phenotype respectively.

68 Table 14: Comparison of prediction algorithms results for the 14 missense variants identified. Algorithm descriptions are provided in Table 4. Only 4 of these missense variants, indicated with an *, were within linked regions and segregated from the grandfather.

Chr. Position Gene

mRNA Exon cDNA Protein SIFT Pp2 HDIV Pp2 HVAR LRT MT MA FATHMM GERP++ PhyloP CADDPhred 16* 87782322 KLHDC4 NM_001184854 3 c.C292G p.L98V T D D D D M T 4.97 2.45 25.5 3* 196771513 DLG1 NM_001204388 21 c.C2348T p.P783L D D D D D M T 5.48 2.587 20.6 20 44588904 ZNF335 NM_022095 14 c.C1963T p.P655S T P P N N N T 4.16 1.389 19.89 12 112583447 TRAFD1 NM_001143906 7 c.A908C p.E303A D D D N D M T 6.08 2.333 18.32 12 113587667 CCDC42B NM_001144872 1 c.C5T p.A2V T D B . N L T 0.322 0.06 17.32 9 4576045 SLC1A1 NM_004170 9 c.T920C p.I307T D B B D D L T 5.39 2.163 16.53 20 39992343 EMILIN3 NM_052846 3 c.C449A p.P150H . D D N D M T 2.97 0.688 15.21 8* 98041637 CPQ NM_016134 6 c.G968A p.R323H T B B N D L T 3.84 0.723 13.91 - 16* 87339450 C16orf95 NM_001195125 4 c.G392A p.C131Y T B B N N . . -3.1 0.623 12.71 3 130190716 COL6A5 NM_001278298 40 c.G7765A p.A2589T T P B . N . D 1.24 0.214 11.57 3 129155463 MBD4 NM_001276270 3 c.T1024C p.S342P T B B N N N D 2.69 1.402 9.524 3 129195600 IFT122 NM_052990 8 c.G770A p.S257N T B B D D N T 4.77 2.656 5.899 - 3 129140589 EFCAB12 NM_207307 2 c.C107T p.P36L T B B N N N T -7.94 2.921 4.168 3 105421032 CBLB NM_170662 12 c.G1865C p.S622T D B B N N L D 3.83 0.705 2.568

69 3.4 SLC1A1 Variant (rs3432853)

The SLC1A1 variant did not segregate perfectly with the disorder, as the youngest affected individual in the family did not harbor the variant (379). However due to the trend in association of the region in multiple candidate gene association studies including a meta- analysis34 we screened an additional OCD case and control cohort for the variant using a

TAQMAN assay. We identified 1 additional case who was heterozygous with the variant

(n=188, MAF=0.0027) and 0 controls (n=131) (Fisher exact test, P=1.00).

Figure 12: TAQMAN Assay output for 78 cases run on a plate. Green dot indicates the single case that was identified with the C/T variant. Blue dots indicate individuals homozygous for T allele. Yellow dots indicate no template controls.

70 The additional case identified with the variant was a 17-year-old male with a primary diagnosis of OCD and a secondary diagnosis of Tourette’s disorder. He had a LY-BOCS score of 30 corresponding to severe OCD.

71 4 Discussion

Whole exome sequencing has allowed for numerous discoveries in the genetics of

Mendelian disorders. Here, we performed whole exome sequencing in an extended pedigree family containing multiple individuals affected with OCD. We highlight the challenge of identifying which of the large amount of rare variation present in individual human genomes are relevant to disease pathogenesis, and propose a unique unbiased prioritization strategy involving linkage analysis and CADD annotation in order to identify most likely candidates. While we were not able to implicate a single gene to be causative of the disorder in the family we identified several high priority candidate variants in genes previously associated with OCD and the glutamate pathway. This suggests that our approach is a useful first step towards the identification of rare genetic variants, which may be associated with OCD.

4.1 Linkage findings

The Affymetrix CytoscanHD array was used for linkage and CNV analysis. The SNPs present on the CytoscanHD array were sufficient to conduct both linkage analysis and CNV analysis using the same platform in a family with many affected individuals. Before the ability to accurately identify large structural variation using next generation sequencing becomes possible, this approach may serve as an effective way to interrogate genome-wide structural variation in a family based design. Linkage analysis in this family allowed us to narrow the scope of our analysis significantly to less than 4% of the genome under the assumption of a single causative allele. Parametric linkage analysis using a rare dominant

72 model allowed us to identify regions that were only present within affected individuals.

Three regions on chromosomes 6q, 12q, and 22q were identified to segregate with the disorder. None of these regions overlapped previous linkage findings in OCD. The strong linkage peak identified in the 6q region by Shugart et al. 2006 was greater than 20Mb downstream of the region on chromosome 6q identified in this family27. Independent linkage studies have identified chromosome 12q23-24 as a genomic region that is likely to contain at least one susceptibility locus for bipolar disorder99. The region on chromosome

12 identified in this family overlaps this region and it is possible the two disorders share a common genetic factor in this region. The region on chromosome 22 was 5Mb from the peak in the Ross et al. study at the terminal end of the chromosome (both peaks were <1Mb in size). When the grandfather was coded as affected the ~1Mb region on chromosome 22 was the only region with a positive LOD score (2.06). This region did not appear to harbor any genes that have been previously associated with psychiatric disorders100.

11 regions were identified in the non-parametric linkage analysis, including the 3 regions identified through the parametric analysis above. One of these regions was on chromosome

3, which was identified as a chromosome with more heritability than expected based on chromosomal length (h2=0.04, se=0.02) in the GCTA analysis of OCD GWAS data (P=0.02)13.

Furthermore, the terminal 3q region identified with an exponential LOD=1.43 was within the 3q region identified in the Shugart et al study27. Of note, the region contains DLG1, among many other genes, which has been associated with autism, schizophrenia and bipolar disorder. One of the genes in the chromosome 18 region identified in the non-

73 parametric linkage analysis is DLGAP1. As discussed earlier DLGAP1 and the DLGAP family of proteins have been of interest in OCD (Chapter 1.3.2.1).

4.2 CNV

Despite the interesting genes present within the identified linkage regions, no specific

CNVs were identified encompassing them. A small (~11kb) deletion was present at

22q13.33(51,103,691-51,115,526) about 5Mb from the linkage peak on chromosome

22q13 in this family. This CNV is interesting in that it encompasses SHANK3 a gene that is expressed in the synapse and implicated in often comorbid autism spectrum disorders101.

However, the CNV was present in only 1 of the affected individuals (1460) and 3 unaffected individuals in the family (1326, 1327, and 1227). Therefore this CNV was predicted to be either a false positive or common and unlikely to play a role in the etiology of OCD in this family. Since no CNVs were found to be of interest in this family we hypothesized that smaller genetic variation such as short insertions or deletions, or SNVs may be responsible.

4.3 WES

We tested for the presence of smaller genetic variation such as rare SNVs or INDELs linked to the disorder in the family by using next-generation sequencing. An unbiased filtering approach was used in order to determine which of the large number of variants present in the whole exome sequencing run was likely to play a key role in the disorder in the family.

We first filtered high quality variants from whole exome sequencing for those that were common between the two affected individuals, within linkage regions, and of low/rare

74 frequency. We also investigated candidate genes previously studied in OCD. Through this process we identified a total of 60 variants of interest. These variants were prioritized based on predicted deleterious scores using CADD. Several of the top hit rare variants were in genes previously associated with OCD and part of the glutamate pathway.

4.4 Revisiting the role of glutamate in OCD

One of the most predicted deleterious variants identified was in DLG1. DLG1 encodes a multi-domain scaffolding protein that is required for normal development and is widely expressed in the brain and multiple organs, especially in the postsynaptic density of the cerebral cortex102, 103. DLG1 is enriched in excitatory synapses, where it is strongly implicated in the trafficking and localization of both NMDA and AMPA-type glutamate receptors, regulating synaptic plasticity102. A role for DLG1 has been raised in numerous brain diseases including schizophrenia, Alzheimer’s disease, Parkinson’s disease and epilepsy. Furthermore DLG1 has been shown to interact with DLGAP1 and DLGAP3104, both of which have been linked to OCD through GWAS studies and mouse knockouts respectively. This suggests DLG1 may be a strong candidate for OCD as well. The variant in

DLG1 may be one that is present in additional families or individuals with OCD due to the region on chromosome 3q having suggestive linkage in other OCD studies27.

CPQ, is a poorly characterized glutamate carboxypeptidase that is secreted in blood plasma.

It may help hydrolyze N-acetylaspartylglutamate (NAAG), the third most abundant neurotransmitter in the brain, to glutamate and N-acetylaspartate (NAA). The gene was also reported to be expressed in various regions of the brain at different time points during

75 development105. Thus, a mutation affecting the function of the protein coded by CPQ could result in changes in levels of glutamate in the brain and possibly in glutamatergic transmission. Variants in the canine CPQ gene (PGCP) were recently associated with a model of compulsive behaviour52. A 1.2kb deletion overlapping exon 2 of the PGCP gene causing a frameshift and loss of 70 amino acids from the protein was found in 3 Jack

Russell terrier cases and one Welsh terrier (out of 74 dogs from OCD-risk breeds), but not in 20 controls. The identification of a rare variant in the canine model suggests rare variants in this gene may play a key role in the corresponding human condition. A common

SNP (rs1835740) proximal to this gene was reported to be associated with migraine in a large GWAS (P = 5.38 × 10−9, odds ratio = 1.23, 95% CI 1.150–1.324) 106. Migraine is known to be often comorbid with Obsessive-compulsive traits107.

A rare missense variant in SLC1A1 was identified in this family. SLC1A1 has been associated with OCD in several candidate gene association studies with the majority of the SNPs within the 3’ region of the gene. Interestingly, the missense variant identified in this family was within the region where some of these common SNPs have been identified. While, this variant was not present in one of the affected individuals it is possible that it still may play an important role in the etiology of the disorder in this family. The individual that didn’t harbor the variant (379) was the youngest in the family to be assessed and could have been a phenocopy, or the variant could be more associated with a tic/TD version of OCD which was present in three of the individuals harboring the variant. In genotyping this variant in an additional cohort of OCD cases we identified an affected individual with the variant. This individual also was diagnosed with Tourette’s disorder supporting the latter hypothesis.

76 Within the small number of additional controls genotyped none were found with the variant, and in the 1000 genomes database only a single individual was found to harbor the variant. The identification of one individual with the variant in only 188 cases genotyped encourages further genotyping of this variant in cases and controls assessed for OCD.

Furthermore, other rare variants in this region of SLC1A1 may be important as well. While

SLC1A1 has been previously screened for rare variants, these studies prescreened for variants before sequencing using either single strand conformation polymorphism or heteroduplex screening. These methods may have missed rare variants present in some of the individuals screened due to technical limitations of the experiment. Direct targeted resequencing of SLC1A1 may identify additional individuals with rare potentially deleterious variants. In addition to rare functional variants in the gene, recent evidence suggests genomic regulation through the expression of isoforms may be responsible for the disruption of normal SLC1A1 function. Using bioinformatics an internal promoter was discovered upstream of exon 5, and confirmed to drive the expression of a transcript consisting of the gene’s exon’s 5-12 (P2)108. Along with P2, two additional isoforms were discovered, one missing exon 2 (ex2skip) and the other exon 11 (ex11skip). In cell lines the isoforms were shown to reduce glutamate transport. Furthermore, the ex2skip and ex11skip partially co-localize and interact with the primary transcript. All three isoforms are evolutionarily conserved between humans and mice, and were expressed in high abundance relative to the primary transcript within the human striatum. This evidence strongly suggests that these isoforms of SLC1A1 may serve to regulate glutamate transport in the striatum and adds another layer of complexity to the study of this gene. Research

77 into whether the previously associated SNPs are influencing regulation by these isoforms should be conducted.

Another top-hit variant from the parametric analysis was in TRAFD1. TRAFD1 is a negative feedback regulator, which controls excessive immune response. The immune system has been implicated in OCD from studies of the PANDAS subset of cases of early-onset OCD patients. As mentioned before, there has also been a link between the autoimmune system and glutamate in OCD. While no genes have specifically been linked to this subset of patients TRAFD1 may be an interesting candidate in this regard.

4.5 Limitations

4.5.1 Discrete vs. Quantitative Linkage Analysis

The phenotypic heterogeneity of OCD presented some unique challenges. In this study, as in previous linkage studies of OCD, a separate analysis was done including a broader OCD phenotype where the individuals with subclinical OCD were coded as affected. Larger families may have individuals with an unclear OCD diagnosis making interpretation of linkage results difficult. An alternative option is to use a quantitative trait approach, which assigns a score for Obsessive-Compulsive symptom level. A quantitative trait approach requires that Obsessive-Compulsive symptoms in individuals be measured on a continuous scale. While not specifically designed for this purpose, the YBOCS total scores could have been used as a quantitative trait. Alternatively, our group has developed the Toronto

Obsessive Compulsive Scale (Park et al., in preparation) designed to measure OC traits in

78 the population and we are using the scale to study the genetics of a large community sample. Future linkage studies of OCD should consider the use of quantitative traits.

In this study a slightly modified version of the Y-BOCS was used, the Lifetime Yale-Brown

Obsessive Compulsive Scale. This scale allowed for the assessment of the severity of OC symptoms when they were at their worst. In some patients, especially with younger patients, the current episode is the worst episode (i.e., present score= lifetime score).

Otherwise, the measure was retrospective and may have been less accurate if the worst period of time was several decades before the time of the interview. Older subjects, such as the grandfather in this study, may have minimized their symptom severity and resulted in misclassification.

4.5.2 Whole Exome Sequencing vs. Whole Genome Sequencing

Whole exome sequencing has a number of limitations. First and foremost, WES does not cover noncoding regions of the genome and >15% of rare Mendelian diseases have causative variants in noncoding, conserved regulatory regions79. This number may be higher in more common complex diseases. These regions may be important in controlling the expression of genes within pathways crucial for OCD. In addition to this limitation, GC- rich sequences can lead to uneven coverage and incomplete capture because of the inherent defect of WES technologies. This can lead to up to 8Mb of the exome without sufficient coverage for variant detection72. Despite these limitations a large number of rare, protein-altering variants, such as missense or nonsense single-base substitutions or

INDELs are predicted to have functional consequences and/or to be deleterious109. Thus,

79 focusing on these variants was a good starting point in order to discover rare variants that may be associated with OCD. Continued development of the technology and the decreasing cost of whole genome sequencing will allow for more comprehensive rare variant studies of OCD.

Copy number variations, inversions, or translocations overlapping with single nucleotide variants identified in patients with the same phenotype can be easily missed by exome sequencing due to the limitations of the technology. This was illustrated in the study by

Norton et al where exome sequencing did not identify the genetic cause in a multiplex family with dilated cardiomyopathy110. Genome-wide copy number analysis however did identify the causal gene, which had point mutations in other patients with the same disorder. For this reason a similar approach was taken in this study as we were able to simultaneously investigate both CNVs and single nucleotide variants (and small insertions and deletions). Despite this, inversions, translocations and INDELs of sizes in between the range covered by WES and CNV microarrays will be missed by the current technologies.

Translocations will also likely be removed when WES reads are aligned to the reference genome. Thus, further development, such as the use of whole genome sequencing is necessary if we hope to have complete genomic interrogation.

4.5.3 Filtering variants

Apart from the technological limitations of whole exome sequencing, the specific pipeline that was used for filtration of variants could have been improved. By selecting two of the most distantly related affected family members we were able to reduce the amount of

80 variation unlikely to be associated with the disorder (variants needed to be present in both affected members). Sequencing additional unaffected individuals may have helped further reduce the number of variation. In addition, we only performed WES in 2 of the affected individuals. It is possible that there are individual specific variants that contribute to the risk for certain subcategories of OCD or even co-morbidities. While this was not the primary goal of this study, by sequencing only two individuals from this family would not be able to identify variants that are individual specific or those variants that contribute to intrafamilial heterogeneity. For example, the SLC1A1 variant was not present in one of the affected individuals. This variant was not initially identified due to the low linkage score in the region.

Current public databases with control data either may not be large enough or not comprehensive enough to effectively exclude common and irrelevant variants across the genome. Specifically, the 1000 genomes database has whole genome coverage of approximately 1000 individuals but this is further divided among specific ethnicities. Also most of the high coverage data is only in exonic regions. On the other hand the exome variant server has whole exome sequencing of 6500 individuals but minimal intergenic/intronic coverage. While these databases were sufficient to filter most of the exonic variants in this family, some of the intronic/intergenic variants were not covered in the family, which presented as a challenge during their filtration. As we move towards WGS we will have much better intronic/intergenic coverage and the availability of databases with sufficient coverage of these regions will be important.

81 5 Conclusions

Our hypothesis was that a rare variant in a single gene was responsible for the OCD in this family. While it is quite possible that this variant was missed due to methodological or technical limitations explored above, several low/rare frequency variants of importance were identified to segregate with the disorder. Due to the quality criteria chosen during the filtration steps of the variants, all 60 variants identified would be expected to have a high validation rate by Sanger sequencing. But the identification of several variants in the glutamate pathway suggests that a single gene may not be responsible for the disorder in the family but may actually be a result of multiple variants in genes that are part of a single or multiple related pathways. Since the minor allele frequency of many these variants were between 0.5-5% in population samples such as the 1000 genomes and Exome Variant

Server, the potential relevance and impact of these variants on phenotype could be called into question. However, the importance of low-frequency variants has been demonstrated by nonsense mutations of PCSK9, which have led to new therapeutics for decreasing LDL cholesterol111, 112. In the familial context, one can envision several mildly deleterious variants accumulating in a family until the additive effect reaches a certain threshold in order to create sufficient risk for manifesting the disease. Future studies should investigate the polygenic effect of rare variants.

Here, we showed that using a combined approach of linkage analysis, whole genome CNV and WES in an extended pedigree we can identify rare variants potentially causal for the disorder. We believe that using an unbiased approach such as the one used in this study

82 was fruitful in identifying variants with a high likelihood to be associated with OCD in an individual or family setting. Using both a parametric and non-parametric linkage design along with screening candidate genes also had distinct advantages over using either alone.

While preliminary, the results of this study support the role of rare variants in OCD and more specifically the role of glutamate. We believe this is an important first step towards the identification of rare causal variants in OCD. Our approach in additional families will expand our knowledge on what we know about the genetics of OCD and might one day provide alternate targets for drug development. More immediately, however, it may help with genetic counseling strategies for these families. For example, Costain et al. evaluated the need for and efficacy of a contemporary genetic counseling protocol for individuals with schizophrenia and clinically confirmed pathogenic CNVs likely to be associated with schizophrenia113. Post genetic counseling, patients reported a significant improvement in understanding the empiric recurrence risk (P = 0.0007), a decrease in associated concern

(P = 0.0103). There were also significant gains in subjective (P = 0.0007) and objective (P =

0.0103) knowledge, and reductions in internalized stigma (P = 0.0111) and self-blame (P =

0.0401). Thus, clinically relevant variants identified in OCD through a study design such as this one could have tremendous psychosocial benefits for patients.

83 6 Future Directions

The first analyses of personal genome sequences were performed on Craig Venter’s and

James Watson’s DNA. In these seemingly healthy individuals, several potentially disease- causing variants were identified. Thus many of the variants identified even if predicted as disease-causing may in fact be false-positives109. Screening additional cases and controls for a particular variant is one way to distinguish true hits from false-positives. However the likelihood that the same rare variant will be identified in additional cases is low due to the frequency of the variant. Burden based tests allow for aggregation of multiple potentially damaging variants within genes114. These tests not only compare the frequency of a variant between cases and controls but also assign relative weights to variants within genes and test for association by comparing the combined impact of all the variants within a particular gene. In addition to ranking variants within individuals, CADD scores may be useful in assigning weights to variants in burden based tests. Regardless of the approach used, rare variant association studies require large sample sizes, which inevitably will require collaboration of investigators across multiple sites much in the same way common- variant genome-wide association studies did. The design of rare variant association studies is an active area of research.

One of our primary goals is to screen the top hit genes identified through exome sequencing in this family within a larger set of OCD cases and controls. 8 of the top hit missense variants are present on the Illumina Human Core Exome microarray, a variation of which is currently being used by the Psychiatric Genomics Consortium in order to

84 genotype a large number of OCD and other psychiatric cases. It will be important to see whether additional individuals with psychiatric conditions harbor these variants independently, a combination of these variants together, or even other rare variants within these top hit genes. As mentioned before, we also have a concurrent study being conducted on Obsessive-compulsive traits in a community sample. In this study we have acquired

DNA and phenotypic information on OC traits using the Toronto Obsessive Compulsive

Scale in 17,263 children. We hope to screen this community sample for the top hit genes as well, which will serve as an additional replication cohort.

While several of the top hit variants were implicated in the glutamate pathway, without functional studies it is difficult to make the leap to causation of the disorder. Magnetic resonance spectroscopy imaging is a noninvasive imaging method that provides spectroscopic information that allows for inference about metabolic information in a particular region of the brain. MRS studies in OCD have identified differences in glutamate levels within the brains of children with OCD115, 116. One of our primary goals is to investigate the neuroimaging data in this family in order to determine whether there are differences in glutamate levels in the brain. Individuals in the family had undergone both

MRI and MRS, and so we hope to compare glutamate levels in the brain between the affected and unaffected members within the same family. We also hope to compare glutamate levels between this family and other families who may have rare variants within the same pathway, and families that are unaffected.

85 Transfection assays in cells where these genes are expressed can help elucidate the precise biological mechanism involved in pathogenesis. This was done previously with the SLC1A1 rare variant using targeted mutagenesis of plasmids containing the gene followed by transfection in human embryonic kidney cell lines39. In this study significant differences in glutamate transport were identified following transfection suggesting the same effect might be seen in neurons where the protein is damaged. We will be performing the same assay for the SLC1A1 variant identified in this study in order to determine whether glutamate transport is affected by the heterozygous mutation. Knockout mice can also be helpful in determining the effect on behaviour when a particular gene is deleted. The new

CRISPR technology allows for the replacement of single variants, which may also be helpful in elucidating the functional effect of these rare SNPs in both cell lines and animal models117.

Finally, we hope to subject additional families to whole genome sequencing in a similar manner in order to identify additional rare variants that may be associated with OCD. Due to the rapidly decreasing cost of whole genome sequencing we hope to move towards this technology as a considerable amount of the genome is missed by just sequencing the exome. Sequencing additional families may further support the role of glutamate in OCD, other related pathways, or even unrelated pathways. Common variants may be regulating the effect of rare variants in individuals and the genes and pathways identified to be important through these rare variant studies will serve to better inform genome-wide association studies. Hypothesis driven genome-wide association studies for example have

86 helped identify new SNPs in genes within pathways associated with lung disease or meconium ileus in people with cystic fibrosis118.

87 7 References 1. Flament MF, Whitaker A, Rapoport JL, Davies M, Berg CZ, Kalikow K et al. Obsessive compulsive disorder in adolescence: an epidemiological study. J Am Acad Child Adolesc Psychiatry 1988; 27: 764–771.

2. Ruscio AM, Stein DJ, Chiu WT, Kessler RC. The epidemiology of obsessive- compulsive disorder in the National Comorbidity Survey Replication. Mol Psychiatry 2010; 15: 53–63.

3. Leckman JF, Denys D, Simpson HB, Mataix-Cols D, Hollander E, Saxena S et al. Obsessive-compulsive disorder: a review of the diagnostic criteria and possible subtypes and dimensional specifiers for DSM-V. Depress Anxiety 2010; 27: 507– 527.

4. Association AP. The Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition.: DSM 5 - American Psychiatric Association - Google Books. 2013;

5. Goodman WK, Price LH, Rasmussen SA, Mazure C, Fleischmann RL, Hill CL et al. The Yale-Brown Obsessive Compulsive Scale. I. Development, use, and reliability. Arch Gen Psychiatry 1989; 46: 1006–1011.

6. Pallanti S, Grassi G, Sarrecchia ED, Cantisani A, Pellegrini M. Obsessive-compulsive disorder comorbidity: clinical assessment and therapeutic implications. Front Psychiatry 2011; 2: 70.

7. Denys D, Tenney N, van Megen HJGM, de Geus F, Westenberg HGM. Axis I and II comorbidity in a large sample of patients with obsessive-compulsive disorder. J Affect Disord 2004; 80: 155–162.

8. Taylor S. Early versus late onset obsessive-compulsive disorder: evidence for distinct subtypes. Clin Psychol Rev 2011; 31: 1083–1100.

9. van Grootheest DS, Cath DC, Beekman AT, Boomsma DI. Twin studies on obsessive- compulsive disorder: a review. Twin Res Hum Genet 2005; 8: 450–458.

10. Taylor S. Etiology of obsessions and compulsions: a meta-analysis and narrative review of twin studies. Clin Psychol Rev 2011; 31: 1361–1372.

11. Bolton D, Rijsdijk F, O'Connor TG, Perrin S, Eley TC. Obsessive-compulsive disorder, tics and anxiety in 6-year-old twins. Psychol Med 2007; 37: 39–48.

12. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 2011; 88: 76–82.

13. Davis LK, Yu D, Keenan CL, Gamazon ER, Konkashbaev AI, Derks EM et al. Partitioning the heritability of Tourette syndrome and obsessive compulsive

88 disorder reveals differences in genetic architecture. PLoS Genet 2013; 9: e1003864.

14. Pauls DL, Abramovitch A, Rauch SL, Geller DA. Obsessive-compulsive disorder: an integrative genetic and neurobiological perspective. Nat Rev Neurosci 2014; 15: 410–424.

15. Pauls DL. The genetics of obsessive-compulsive disorder: a review. Dialogues in Clinical Neuroscience 2010; 12: 149.

16. McGuffin P, Huckle P. Simulation of Mendelism revisited: the recessive gene for attending medical school. Am J Hum Genet 1990; 46: 994–999.

17. Lander E, Kruglyak L. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 1995; 11: 241–247.

18. Hanna GL, Veenstra-VanderWeele J, Cox NJ, Boehnke M, Himle JA, Curtis GC et al. Genome-wide linkage analysis of families with obsessive-compulsive disorder ascertained through pediatric probands. Am J Med Genet 2002; 114: 541–552.

19. Willour VL, Yao Shugart Y, Samuels J, Grados M, Cullen B, Bienvenu OJ et al. Replication study supports evidence for linkage to 9p24 in obsessive-compulsive disorder. Am J Hum Genet 2004; 75: 508–513.

20. Hanna GL, Veenstra-VanderWeele J, Cox NJ, Van Etten M, Fischer DJ, Himle JA et al. Evidence for a susceptibility locus on chromosome 10p15 in early-onset obsessive- compulsive disorder. Biol Psychiatry 2007; 62: 856–862.

21. Samuels JF, Riddle MA, Greenberg BD, Fyer AJ, McCracken JT, Rauch SL et al. The OCD collaborative genetics study: methods and sample description. Am J Med Genet B Neuropsychiatr Genet 2006; 141B: 201–207.

22. Whittemore AS. Genome scanning for linkage: an overview. Am J Hum Genet 1996; 59: 704–716.

23. Mathews CA, Badner JA, Andresen JM, Sheppard B, Himle JA, Grant JE et al. Genome- wide Linkage Analysis of Obsessive-Compulsive Disorder Implicates Chromosome 1p36. Biol Psychiatry 2012;

24. Suarez BK, Hampe CL. Linkage and association. Am J Hum Genet 1994;

25. Badner JA, Gershon ES, Goldin LR. Optimal ascertainment strategies to detect linkage to common disease alleles. Am J Hum Genet 1998; 63: 880–888.

26. Ross J, Badner J, Garrido H, Sheppard B, Chavira DA, Grados M et al. Genomewide linkage analysis in Costa Rican families implicates chromosome 15q14 as a candidate region for OCD. Hum Genet 2011; 130: 795–805.

27. Shugart YY, Samuels J, Willour VL, Grados MA, Greenberg BD, Knowles JA et al.

89 Genomewide linkage scan for obsessive-compulsive disorder: evidence for susceptibility loci on chromosomes 3q, 7p, 1q, 15q, and 6q. Mol Psychiatry 2006; 11: 763–770.

28. John S, Shephard N, Liu G, Zeggini E, Cao M, Chen W et al. Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. Am J Hum Genet 2004; 75: 54–64.

29. Ulgen A, Li W. Comparing single-nucleotide polymorphism marker-based and microsatellite marker-based linkage analyses. BMC Genet 2005; 6 Suppl 1: S13.

30. Wu K, Hanna GL, Rosenberg DR, Arnold PD. The role of glutamate signaling in the pathogenesis and treatment of obsessive-compulsive disorder. Pharmacol Biochem Behav 2012; 100: 726–735.

31. Evangelou E, Trikalinos TA, Salanti G, Ioannidis JPA. Family-based versus unrelated case-control designs for genetic associations. PLoS Genet 2006; 2: e123.

32. Mitchell AA, Cutler DJ, Chakravarti A. Undetected genotyping errors cause apparent overtransmission of common alleles in the transmission/disequilibrium test. Am J Hum Genet 2003; 72: 598–610.

33. Ioannidis J. Genetic associations: false or true? Trends in molecular medicine 2003;

34. Stewart SE, Mayerfeld C, Arnold PD, Crane JR, O'Dushlaine C, Fagerness JA et al. Meta-analysis of association between obsessive-compulsive disorder and the 3' region of neuronal glutamate transporter gene SLC1A1. Am J Med Genet B Neuropsychiatr Genet 2013;

35. Dickel DE, Veenstra-VanderWeele J, Cox NJ, Wu X, Fischer DJ, Van Etten-Lee M et al. Association testing of the positional and functional candidate gene SLC1A1/EAAC1 in early-onset obsessive-compulsive disorder. Arch Gen Psychiatry 2006; 63: 778– 785.

36. Wang Y, Adamczyk A, Shugart YY, Samuels JF, Grados MA, Greenberg BD et al. A screen of SLC1A1 for OCD-related alleles. Am J Med Genet B Neuropsychiatr Genet 2010; 153B: 675–679.

37. Arnold PD, Sicard T, Burroughs E, Richter MA, Kennedy JL. Glutamate transporter gene SLC1A1 associated with obsessive-compulsive disorder. Arch Gen Psychiatry 2006; 63: 769–776.

38. Wendland JR, Moya PR, Timpano KR, Anavitarte AP, Kruse MR, Wheaton MG et al. A haplotype containing quantitative trait loci for SLC1A1 and its association with obsessive-compulsive disorder. Arch Gen Psychiatry 2009; 66: 408–416.

90 39. Veenstra-VanderWeele J, Xu T, Ruggiero AM, Anderson LR, Jones ST, Himle JA et al. Functional studies and rare variant screening of SLC1A1/EAAC1 in males with obsessive–compulsive disorder. Psychiatric Genetics 2012; 22: 256–260.

40. Rees E, Walters JTR, Chambert KD, O'Dushlaine C, Szatkiewicz J, Richards AL et al. CNV analysis in a large schizophrenia sample implicates deletions at 16p12.1 and SLC1A1 and duplications at 1p36.33 and CGNL1. Hum Mol Genet 2013;

41. Stewart SE, Yu D, Scharf JM, Neale BM, Fagerness JA, Mathews CA et al. Genome- wide association study of obsessive-compulsive disorder. Mol Psychiatry 2012;

42. Welch JM, Lu J, Rodriguiz RM, Trotta NC, Peca J, Ding J-D et al. Cortico-striatal synaptic defects and OCD-like behaviours in Sapap3-mutant mice. Nature 2007; 448: 894–900.

43. Mattheisen M, Samuels JF, Wang Y, Greenberg BD, Fyer AJ, McCracken JT et al. Genome-wide association study in obsessive-compulsive disorder: results from the OCGAS. Mol Psychiatry 2014;

44. Dunah AW, Hueske E, Wyszynski M, Hoogenraad CC, Jaworski J, Pak DT et al. LAR receptor protein tyrosine phosphatases in the development and maintenance of excitatory synapses. Nat Neurosci 2005; 8: 458–467.

45. Takahashi H, Katayama K-I, Sohya K, Miyamoto H, Prasad T, Matsumoto Y et al. Selective control of inhibitory synapse development by Slitrk3-PTPδ trans-synaptic interaction. Nat Neurosci 2012; 15: 389–398.

46. Züchner S, Wendland JR, Ashley-Koch AE, Collins AL, Tran-Viet KN, Quinn K et al. Multiple rare SAPAP3 missense variants in trichotillomania and OCD. Mol Psychiatry 2009; 14: 6–9.

47. Schirmbeck F, Zink M. Comorbid obsessive-compulsive symptoms in schizophrenia: contributions of pharmacological and genetic factors. Front Pharmacol 2013; 4: 99.

48. Li J-M, Lu C-L, Cheng M-C, Luu S-U, Hsu S-H, Chen C-H. Exonic resequencing of the DLGAP3 gene as a candidate gene for schizophrenia. Psychiatry Res 2013; 208: 84– 87.

49. Dodman NH, Karlsson EK, Moon-Fanelli A, Galdzicka M, Perloski M, Shuster L et al. A canine chromosome 7 locus confers compulsive disorder susceptibility. Mol Psychiatry 2010; 15: 8–10.

50. Redies C, Hertel N, Hübner CA. Cadherins and neuropsychiatric disorders. Brain research 2012;

51. Moya PR, Dodman NH, Timpano KR, Rubenstein LM, Rana Z, Fried RL et al. Rare missense neuronal cadherin gene (CDH2) variants in specific obsessive-compulsive

91 disorder and Tourette disorder phenotypes. Eur J Hum Genet 2013; 21: 850–854.

52. Tang R, Noh H, Wang D, Sigurdsson S, Swofford R, Perloski M et al. Candidate genes and functional noncoding variants identified in a canine model of obsessive- compulsive disorder. Genome Biol 2014; 15: R25.

53. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD et al. Global variation in copy number in the human genome. Nature 2006; 444: 444–454.

54. Delorme R, Moreno-De-Luca D, Gennetier A, Maier W, Chaste P, Mössner R et al. Search for copy number variants in chromosomes 15q11-q13 and 22q11.2 in obsessive compulsive disorder. BMC Med Genet 2010; 11: 100.

55. Walitza S, Bové DS, Romanos M, Renner T, Held L, Simons M et al. Pilot study on HTR2A promoter polymorphism, -1438G/A (rs6311) and a nearby copy number variation showed association with onset and severity in early onset obsessive- compulsive disorder. J Neural Transm 2012; 119: 507–515.

56. Grisham JR, Fullana MA, Mataix-Cols D, Moffitt TE, Caspi A, Poulton R. Risk factors prospectively associated with adult obsessive-compulsive symptom dimensions and obsessive-compulsive disorder. Psychol Med 2011; 41: 2495–2506.

57. Swedo SE, Leonard HL, Garvey M, Mittleman B, Allen AJ, Perlmutter S et al. Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections: clinical description of the first 50 cases. Am J Psychiatry 1998; 155: 264–271.

58. Singer HS, Gilbert DL, Wolf DS, Mink JW, Kurlan R. Moving from PANDAS to CANS. J Pediatr 2012; 160: 725–731.

59. Swedo SE, Leonard HL, Rapoport JL. The pediatric autoimmune neuropsychiatric disorders associated with streptococcal infection (PANDAS) subgroup: separating fact from fiction. Pediatrics 2004; 113: 907–911.

60. Bhattacharyya S, Khanna S, Chakrabarty K, Mahadevan A, Christopher R, Shankar SK. Anti-brain autoantibodies and altered excitatory neurotransmitters in obsessive-compulsive disorder. Neuropsychopharmacology 2009; 34: 2489–2496.

61. McClellan J, King M-C. Genetic heterogeneity in human disease. Cell 2010; 141: 210–217.

62. Walsh T, King M-C. Ten genes for inherited breast cancer. Cancer Cell. 2007; 11: 103–105.

63. Chubb JE, Bradshaw NJ, Soares DC, Porteous DJ, Millar JK. The DISC locus in psychiatric illness. Mol Psychiatry 2008; 13: 36–64.

64. Dror AA, Avraham KB. Hearing loss: mechanisms revealed by genetics and cell

92 biology. Annu Rev Genet 2009; 43: 411–437.

65. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 2008; 40: 695–701.

66. Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev 2009; 19: 212–219.

67. Cook EH, Scherer SW. Copy-number variations associated with neuropsychiatric conditions. Nature 2008; 455: 919–923.

68. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ et al. Finding the missing heritability of complex diseases. Nature 2009; 461: 747–753.

69. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011; 12: 745–755.

70. Foo J-N, Liu J-J, Tan E-K. Whole-genome and whole-exome sequencing in neurological diseases. Nat Rev Neurol 2012; 8: 508–517.

71. Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet 2010; 19: R145–R151.

72. Hedges DJ, Guettouche T, Yang S, Bademci G, Diaz A, Andersen A et al. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform. PLoS ONE 2011; 6: e18595.

73. Altmann A, Weber P, Bader D, Preuss M, Binder EB, Müller-Myhsok B. A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 2012; 131: 1541–1554.

74. Davydov EV, Goode DL, Sirota M, Cooper GM. PLOS Computational Biology: Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP. PLoS computational … 2010;

75. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010; 38: e164.

76. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014; 46: 310–315.

77. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009; 461: 272–276.

78. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM et al. Exome

93 sequencing identifies the cause of a mendelian disorder. Nat Genet 2010; 42: 30– 35.

79. Zhang X. Exome sequencing greatly expedites the progressive research of Mendelian diseases. Front Med 2014;

80. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet 2012; 20: 490–497.

81. Hooper SD, Johansson ACV, Tellgren-Roth C, Stattin E-L, Dahl N, Cavelier L et al. Genome-wide sequencing for the identification of rearrangements associated with Tourette syndrome and obsessive-compulsive disorder. BMC Med Genet 2012; 13: 123.

82. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics 2007; 81: 559–575.

83. International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA et al. Integrating common and rare genetic variation in diverse human populations. Nature 2010; 467: 52–58.

84. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 2002; 30: 97–101.

85. O'Connell JR, Weeks DE. PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 1998; 63: 259–266.

86. Sengul H, Weeks DE, Feingold E. A survey of affected-sibship statistics for nonparametric linkage analysis. Am J Hum Genet 2001; 69: 179–190.

87. Kong A, Cox NJ. Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet 1997; 61: 1179–1188.

88. Thiele H, Nürnberg P. HaploPainter: a tool for drawing pedigrees with complex haplotypes. Bioinformatics 2005; 21: 1730–1732.

89. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM et al. A map of human genome variation from population-scale sequencing. Nature 2010; 467: 1061–1073.

90. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009; 4: 1073– 1081.

91. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P et al. A method and server for predicting damaging missense mutations. Nat Methods

94 2010; 7: 248–249.

92. Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res 2009; 19: 1553–1561.

93. Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 2010; 7: 575–576.

94. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011; 39: e118.

95. Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat 2013; 34: 57–65.

96. Siepel A, Pollard KS, Haussler D. New Methods for Detecting Lineage-Specific Selection - Springer. Research in Computational Molecular … 2006;

97. Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet 2012; 21: R1–R9.

98. Sanada T, Takaesu G, Mashima R, Yoshida R, Kobayashi T, Yoshimura A. FLN29 deficiency reveals its negative regulatory role in the Toll-like receptor (TLR) and retinoic acid-inducible gene I (RIG-I)-like helicase signaling pathway. J Biol Chem 2008; 283: 33858–33864.

99. Curtis D, Kalsi G, Brynjolfsson J, McInnis M, O'Neill J, Smyth C et al. Genome scan of pedigrees multiply affected with bipolar disorder provides further support for the presence of a susceptibility locus on chromosome 12q23-q24, and suggests the presence of additional loci on 1p and 1q. Psychiatric Genetics 2003; 13: 77–84.

100. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004; 32: D493–D496.

101. Marshall CR, Noor A, Vincent JB, Lionel AC, Feuk L, Skaug J et al. Structural variation of chromosomes in autism spectrum disorder. Am J Hum Genet 2008; 82: 477–488.

102. Fourie C, Li D, Montgomery JM. The anchoring protein SAP97 influences the trafficking and localisation of multiple membrane channels. Biochim Biophys Acta 2014; 1838: 589–594.

103. Valtschanoff JG, Burette A, Davare MA, Leonard AS, Hell JW, Weinberg RJ. SAP97 concentrates at the postsynaptic density in cerebral cortex. Eur J Neurosci 2000; 12: 3605–3614.

104. Takeuchi M. SAPAPs. A FAMILY OF PSD-95/SAP90-ASSOCIATED PROTEINS LOCALIZED AT POSTSYNAPTIC DENSITY. Journal of Biological Chemistry 1997;

95 272: 11943–11951.

105. Kang HJ, Kawasawa YI, Cheng F, Zhu Y, Xu X, Li M et al. Spatio-temporal transcriptome of the human brain. Nature 2011; 478: 483–489.

106. Anttila V, Stefansson H, Kallela M, Todt U, Terwindt GM, Calafato MS et al. Genome- wide association study of migraine implicates a common susceptibility variant on 8q22.1. Nat Genet 2010; 42: 869–873.

107. Curone M, Tullo V, Lovati C, Proietti-Cecchini A, D'Amico D. Prevalence and profile of obsessive-compulsive trait in patients with chronic migraine and medication overuse. Neurol Sci 2014; 35 Suppl 1: 185–187.

108. Porton B, Greenberg BD, Askland K, Serra LM, Gesmonde J, Rudnick G et al. Isoforms of the neuronal glutamate transporter gene, SLC1A1/EAAC1, negatively modulate glutamate uptake: relevance to obsessive-compulsive disorder. Transl Psychiatry 2013; 3: e259.

109. Xue Y, Chen Y, Ayub Q, Huang N, Ball EV. Deleterious- and Disease-Allele Prevalence in Healthy Individuals: Insights from Current Predictions, Mutation Databases, and Population-Scale Resequencing. The American Journal of … 2012;

110. Norton N, Li D, Rieder MJ, Siegfried JD, Rampersaud E, Zuchner S et al. Genome- wide Studies of Copy Number Variation and Exome Sequencing Identify Rare Variants in BAG3 as a Cause of Dilated Cardiomyopathy. The American Journal of Human Genetics 2011; 88: 273–282.

111. Cohen J, Pertsemlidis A, Kotowski IK, Graham R, Garcia CK, Hobbs HH. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat Genet 2005; 37: 161–165.

112. Cohen JC, Boerwinkle E, Mosley TH, Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med 2006; 354: 1264– 1272.

113. Costain G, Esplen MJ, Toner B, Scherer SW, Meschino WS, Hodgkinson KA et al. Evaluating genetic counseling for individuals with schizophrenia in the molecular age. Schizophr Bull 2014; 40: 78–87.

114. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 2012; 13: 762–775.

115. Rosenberg DR, MacMaster FP, Keshavan MS, Fitzgerald KD, Stewart CM, Moore GJ. Decrease in caudate glutamatergic concentrations in pediatric obsessive- compulsive disorder patients taking paroxetine. J Am Acad Child Adolesc Psychiatry 2000; 39: 1096–1103.

96 116. Rosenberg DR, Mirza Y, Russell A, Tang J, Smith JM, Banerjee SP et al. Reduced anterior cingulate glutamatergic concentrations in childhood OCD and major depression versus healthy controls. J Am Acad Child Adolesc Psychiatry 2004; 43: 1146–1153.

117. Sander JD, Joung JK. CRISPR-Cas systems for editing, regulating and targeting genomes. Nat Biotechnol 2014; 32: 347–355.

118. Sun L, Rommens JM, Corvol H, Li W, Li X, Chiang TA et al. Multiple apical plasma membrane constituents are associated with susceptibility to meconium ileus in individuals with cystic fibrosis. Nat Genet 2012; 44: 562–569.

97