C/S-FEATURES MEDIATING CAG/CTG REPEAT INSTABILITY, THE SATELLOG DATABASE, AND CANDIDATE REPEAT PRIORITIZATION IN SCHIZOPHRENIA

by

PERSEUS IOANNIS MISSIRLIS

B.Sc.H., Queen's University, 2002

A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE STUDIES

GENETICS GRADUATE PROGRAM

We accept this thesis as conforming to the required standard

THE UNIVERSITY OF BRITISH COLUMBIA

August 2004

© Perseus loannis Missirlis, 2004 UBC W THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF GRADUATE STUDIES 3 Library Authorization

In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.

Perseus loannis Missirlis 18/08/2004 Name of Author (please print) Date (dd/mm/yyyy)

Title of Thesis: C/S-FEATURES MEDIATING CAG/CTG REPEAT INSTABILITY, THE SATELLOG DATABASE, AND CANDIDATE REPEAT PRIORITIZATION IN SCHIZOPHRENIA

Degree: Master of Science Year: 2004

Department of Genetics Graduate Program The University of British Columbia Vancouver, BC Canada

grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 18-Aug-04 ABSTRACT

Polyglutamine repeat expansions in the coding regions of unrelated have been implicated in the neurodegenerative phenotype of nine separate diseases. However, little is known about the role of flanking c/'s-sequences in mediating this repeat instability. Brock et al. identified an association between flanking GC content and CAG/CTG repeat instability at many of these disease loci by using a relative measure of repeat instability called 'expandability'. Using this measure, we have extended the analysis of Brock and colleagues and utilized the expandability metric to associate other features theorized to contribute to CAG/CTG repeat instability such as repeat length and purity, proximity to CCCTC-binding factor (CTCF) binding sites, and the nucleosome formation potential of the surrounding DNA. Our results confirmed earlier relationships regarding flanking GC content and CAG/CTG repeat instability and also suggest a novel one involving flanking CTCF binding sites. Conversely, no relationships between expandability and repeat length, purity, and nucleosome formation were detected.

Anticipation refers to the progressive worsening of a disease phenotype and earlier age of onset in successive generations. Anticipation has been

reported in a number of diseases in which repeat expansion may have a role in

etiology. We developed Satellog, a database that catalogs all pure 1-16 repeat

unit repeats in the along with supplementary data of use for the

ii prioritization of repeats in disease association studies. For each pure repeat we calculate the percentile rank of its length relative to other repeats of the same class in the genome, its polymorphism within UniGene clusters, its location either within or adjacent to EnsEMBL-defined genes, and its expression profile in normal tissues according to the GeneNote database. By examining the global repeat polymorphism profile, we found that highly polymorphic coding repeats were mostly restricted to trinucleotide repeats, whereas a wider range of repeat unit lengths were tolerated in untranslated sequence. We also found that 3'-UTR sequence tolerates more repeat polymorphisms than 5'-UTR or exonic sequence. Lastly, we use Satellog to prioritize repeats for disease-association studies in schizophrenia. Satellog is available as a freely downloadable MySQL and web-based database.

iii TABLE OF CONTENTS

ABSTRACT ii

TABLE OF CONTENTS iv

LIST OF TABLES viii

LIST OF FIGURES x

LIST OF ABBREVIATIONS xii

ACKNOWLEDGEMENTS xv

DEDICATION xvi

PREFACE xvii

CHAPTER 1 INTRODUCTION 2

1.1 c/'s-Features of unstable CAG/CTG repeats 2 1.1.1 Unstable repeats and disease 2 1.1.2 The argument for cis mediators of instability 4 1.1.2.1 Flanking %GC and CpG islands 4 1.1.2.2 Repeat length and purity 5 1.1.2.3 The role of the CTCF insulator 7 1.1.2.4 The role of nucleosomes 8 1.1.3 Objectives 9 1.1.4 Specific aims and rationale 10 1.2 The unstable repeat perspective of schizophrenia 10 1.2.1 Biology of schizophrenia 10 1.2.2 Genetics of schizophrenia 12 1.2.3 Anticipation in neuropsychiatric diseases 14 1.2.4 CAG/CTG repeats in schizophrenia 15 1.2.5 Published satellite repeat analyses and databases 18 1.2.6 Objectives. 20

1.2.7 Specific aims and rationale 21

CHAPTER 2 MATERIALS AND METHODS 23

2.1 c/s-Features of unstable CAG/CTG repeats 23 2.1.1 Collection of candidate CAG/CTG repeats for cis sequence analysis 23 2.1.2 Software Dependencies 1 23 2.1.3 Implementing the gems_cis database 25

iv 2.1.4 Overview of the flanker.pl script 27 2.1.4.1 Collection of flanking %GC, CpG islands, length and purity and other repetitive elements 29 2.1.4.2 Detection of flanking CTCF insulator protein binding sites 29 2.1.5 Detection of nucleosome formation potential with NucleoMeter 30 2.1.6 Statistics and plots with R 32 2.2 The satellog database 32 2.2.1 Software Dependencies II 33 2.2.2 Implementing the satellog database 33 2.2.3 Preliminary set-up 37 2.2.3.1 Detecting pure repeats with Tandem Repeats Finder (TRF) 37 2.2.3.2 Identifying unique repeat classes 38 2.2.3.3 Preparing expression data from the GeneNote database 39 2.2.3.4 Detecting repeat polymorphisms within UniGene clusters 40 2.2.4 Overview of the repeatalyzer.pl script 41 2.2.5 Generating a measure of repeat length significance 42 2.2.6 Detection and input of disease-associated repeats 42 2.3 Prioritizing candidate repeats for disease-association studies in schizophrenia 43 2.3.1 Input of neuropsychiatric linkage regions into Satellog 43 2.3.2 Prioritizing candidate repeats with Satellog 43

CHAPTER 3 RESULTS 47

3.1 c/s-Features of unstable CAG/CTG repeats 47 3.1.1 Correlation of flanking CAG/CTG repeat features to Brock etal. expandability data 47 3.1.1.1 Correlation of CpG islands with expandability 47 3.1.1.2 Correlation of flanking %GC with expandability 48 3.1.1.3 Correlation of repeat length and purity with expandability 52 3.1.1.4 Correlation of CTCF binding sites with expandability 52 3.1.1.5 Correlation of nucleosome formation potential with expandability 58 3.2 Genomic repeat analysis with the Satellog database 61 3.2.1 Summary statistics 61 3.2.2 Characteristics of disease-associated repeats 62 3.2.3 Characteristics of repeats polymorphic within UniGene clusters 66 3.2.4 Disease-associated repeats detected in UniGene clusters 68 3.3 Candidate repeats for typing in schizophrenia and bipolar disorder 75 3.3.1 Top 20 polymorphic schizophrenia candidate repeats 76 3.3.2 Top 20 globally prioritized schizophrenia candidate repeats 77 3.3.3 Top 20 polymorphic bipolar disorder candidate repeats 78 3.3.4 Top 20 globally prioritized bipolar candidate repeats 79 3.3.5 Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes 80 3.3.6 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 81 3.3.7 Top 20 polymorphic bipolar disorder candidate repeats from disease- associated classes 82 3.3.8 Top 20 globally prioritized bipolar candidate repeats from disease- associated classes 83

CHAPTER 4 DISCUSSION 86

4.1 c/'s-Features of unstable CAG/CTG repeats 86 4.1.1 Identifying c/s-mediators of instability 86 4.1.1.1 Association between flanking %GC and instability 87 4.1.1.2 Association between flanking repeat length, purity and instability 88 4.1.1.3 Association between flanking CTCF binding sites and instability. 88 4.1.1.4 Association between flanking nucleosome formation and instability 90 4.1.2 Prioritizing candidate CAG/CTG repeats 91 4.2 Genomic repeat analysis with the Satellog database 93 4.3 Repeat prioritization in schizophrenia with Satellog 94 4.3.1 Top 20 polymorphic schizophrenia candidate repeats 95 4.3.2 Top 20 globally prioritized schizophrenia candidate repeats 96 4.3.3 Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes 96 4.3.4 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 97 4.4 Conclusions 97 4.5 Problems encountered and limitations 99 4.5.1 Brock et al.'s expandability metric 99 4.5.2 Limitations of the GeneNote dataset 100 4.5.3 Mapping repeats to UniGene clusters 100 4.5.4 Prioritizing with p-values 101 4.5.5 Multiple repeats detected for known diseases 101 4.6 Future studies 102 4.6.1 Identifying c/'s-mediators of instability 102 4.6.2 Improvements to Satellog 103 4.6.3 Disease association studies in schizophrenia 104 4.6.3.1 Specimens for analysis 104 4.7 Significance 106

BIBLIOGRAPHY 108

APPENDIX A 120

APPENDIX B 124

APPENDIX C 126

APPENDIX D 153 APPENDIX E 157

APPENDIX F 159

APPENDIX G 160

APPENDIX H 170

APPENDIX 1 175

APPENDIX J 177

APPENDIX K 180

APPENDIX L 182

APPENDIX M 184

APPENDIX N 188

APPENDIX O 190

APPENDIX P - 192

APPENDIX Q 195

APPENDIX R 205

vii LIST OF TABLES

Table 1: Genetic anticipation in schizophrenia; summary of linkage studies from 1996-1999 (adapted from Vincent et al., 2000) 15

Table 2: All unstable and candidate CAG/CTG repeat-containing genes located within a CpG island. 'Start' and 'End' columns refer to start and end co• ordinates of the CpG island relative to the 50 Mb slice of genomic sequence flanking the CAG/CTG repeat (i.e. the CAG/CTG repeat starts at 50,000). 49

Table 3: All CAG/CTG repeat-containing genes with 100 bp of flanking sequence having %GC at least equal to that of HD. The '100_bp' column summarizes the G+C fraction of 100 bp flanking the CAG/CTG repeat 53

Table 4: All CTCF binding sites with a HMMer score greater than 1 that are within 1,000 bp of a CAG/CTG repeat 59

Table 5: All CTCF binding sites with an HMMer score between 0 and 1 within 1,000 bp of CAG/CTG repeat. These may represent true CTCF sites because of the binding degeneracy CTCF 60

Table 6: The ten most unstable coding repeats organized by descending standard deviation. Repeats highlighted in bold are known disease- associated repeats 73

Table 7: The ten most unstable untranslated repeats organized by the descending standard deviation. No disease-associated repeats are present in this sample 74

Table 8: Top 20 polymorphic schizophrenia candidate repeats 76

Table 9: Top 20 globally prioritized schizophrenia candidate repeats 77

Table 10: Top 20 polymorphic bipolar disorder candidate repeats 78

Table 11: Top 20 globally prioritized bipolar candidate repeats 79

Table 12: Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes 80

Table 13: Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 81

Table 14: Top 20 polymorphic bipolar disorder candidate repeats from disease- associated classes 82

Vlll Table 15: Top 20 globally prioritized bipolar candidate repeats from disease- associated classes 83

Table 16: Summary of expandable CAG loci and candidate CAG/CTG repeat containing genes with at least one feature associated with repeat expandability. %GC (100 bp) refers to the %GC of 100 bp flanking the CAG/CTG repeat. CTCF scores and absolute distance (i.e. either upstream or downstream) relative to the repeat are listed. Multiple CTCF hits are separated by commas 92

Table 17: Summary of disease-associated repeats from Cleary and Pearson, 2003 as detected in Satellog. Each disease is associated with one or more repeat co-ordinates 190

Table 18: Summary of schizophrenia and bipolar disorder linkage regions from (Sklar, 2002). This table summarizes the linkage studies in the paper and includes the cytogenetic band, genetic marker (with co-ordinates) of each study cited in the review. The ref column refers to the PubMed ID of each linkage study. This represents a portion of the linkage table in Satellog.. 192

ix LIST OF FIGURES

Figure 1: Flowchart outlining how the candidate GeMS list was populated 24

Figure 2: Schema for gems_cis database 26

Figure 3: Flowchart for flanker.pl, a perl script designed to analyze the cis- elements flanking disease associated and candidate CAG/CTG repeats... 28

Figure 4: Alignment of experimentally identified CTCF binding sites used to build the HMM. Sequences are from DM in human, the chicken B-globin FN, mouse H19 DMD4 and DMD7, chicken myc FV, and human MYCA. Bold nucleotides are essential contact guanosines; grey bars highlight inter-site conservation (adapted from Filippova et al., 2001) 31

Figure 5: Satellog database schema 35

Figure 6: a-c) Correlation between ranked median expandability and ranked %GC in 100 bp, 500 bp, and 1000 bp flanking unstable CAG/CTG repeats (Brock et al., 1999). d) Spearman's rank correlation (rho) of median expandability and %GC of 50 bp, 100 bp, 500 bp, 1,000 bp, 1,500 bp, 2,000 bp, 2,500 bp, 3,000 bp, 3,500 bp, 4,000 bp, 4,500 bp, and 5,000 bp of sequence flanking the CAG/CTG repeat. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 50

Figure 7: Histogram of %GC of 100 bp flanking the repeat in candidate CAG/CTG repeat sequences. Red bar indicates sole (SCA7) with %GC content achieving statistical significance based on z-score within this distribution 51

Figure 8: No correlation was observed between ranked expandability and ranked repeat length. Repeat length is defined as the absolute length of the repeat, irrespective of purity, defined by Tandem Repeats Finder co• ordinates. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 54

Figure 9: Correlation between ranked expandability and ranked CAG/CTG repeat purity. No correlation was observed between ranked expandability and ranked CAG/CTG repeat purity. CAG/CTG repeat purity defined as longest contiguous stretch of the repeat unit specified in Tandem Repeats Finder. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 55

Figure 10: a) Plot of ranked median expandability against ranked score of computationally detected CTCF binding sites (from HMMer) in 5,000 bp flanking the CAG/CTG repeats known to be unstable (Brock era/., 1999). b)

x Plot of ranked median expandability against ranked distance of CTCF binding site in 5,000 bp flanking the CAG/CTG repeat of genes known to be unstable (Brock etal., 1999) c) Correlation between ranked distance from the CAG/CTG repeat and ranked score of hits. Each point represents a CTCF binding site. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 57

Figure 11: Genomic distribution of repeat lengths of all repeat classes- associated with disease. A repeat class represents all repeat variations of a given repeat unit (i.e. CAG, AGC, GCA, GTC, TGC, CTG) 65

Figure 12: Polymorphic repeats make up a tiny portion of all pure repeats detected in Satellog. Approximately half of all the 111,950 transcribed repeats were mapped to UniGene clusters, but only 5,546 or 0.07 % of all repeats were detected as polymorphic within UniGene clusters 67

Figure 13: Median standard deviations (line through box) of polymorphic repeats detected in exonic, 5'-UTR, and 3'-UTR sequence. Median exonic and 5'-UTR standard deviations of did not significantly differ from reach other, but did significantly differ from 3'-UTR repeats implying that the 3'- UTR tolerates larger more expanded repeats (One-way ANOVA, P< 0.05). 69

Figure 14: Repeat period distribution of polymorphic non-coding repeats at increasing standard deviation (sd) cut-offs 70

Figure 15: Repeat period distribution of polymorphic coding repeats at increasing standard deviation (sd) cut-offs 71

xi LIST OF ABBREVIATIONS

5-HT2A serotonin 2A

ABI Applid Biosystems

ANOVA Analysis of Variance

API Application Programming Interface

AR Androgen Receptor

BCCA BC Cancer Agency

BLAST Basic Local Alignment Search Tool

BLAT BLAST-like Alignment Tool

CCD cleidocranial dysplasia

CpG cytosine and guanine separated by a phosphate

CTCF CCCTC-binding Factor

DM Myotonic Dystrophy

DNA Deoxyribonucleic Acid

DRD3 dopamine D3 receptor gene

DRPLA Dentatorubral-Pallidoluysian Atrophy

EMBOSS The European Molecular Biology Open Software Suite

EPM1 progressive myoclonic epilepsy type 1

FISH fluorescent in situ hybridization

FRAXA Fragile X Syndrome (A subtype)

FRAXE Fragile X Syndrome (E subtype)

FRDA Friedreich's Ataxia GABA gamma-aminobutyric acid

GeMS Genomic Mutational Signatures

GeneNote Gene Normal Tissue Expression

GEO Gene Expression Omnibus

HAT histone acetyl-transferases

HD Huntington's Disease

HFGS hand-foot-genital syndrome

HMM Hidden Markov Model hmmfs Hidden Markov Model fragment search

HUGO Human Genome Organization indel insertion and deletion

kb kilobases

Mb Megabases

MRD Microsatellite Repeats Database

MZ Monozygotic

NA Not Applicable

ND Not Determined

NS Not Significant

OPMD oculopharyngeal muscular dystrophy

Perl Practical extraction and report language

R The R project for Statistical Computing

RAPID Repeat Analysis, Pooled Isolation and Detection

Xlll RED Repeat Expansion Detection

SBMA X-linked Spinal and Bulbar Muscular Atrophy

SCA Spinocerebellar Ataxia

SQL Structured Query Language

STR Short Tandem Repeats

TBP TATA-box binding protein

TNR(s) Trinucleotide Repeat(s)

TRF Tandem Repeats Finder

UBiC UBC Bioinformatics Centre

UCSC University of California Santa Cruz

UTR Untranslated Region

VNTR Variable Number Tandem Repeats ACKNOWLEDGEMENTS

Dr. Rob Holt for supporting and directing my research for the past year. Thanks for tolerating, and being receptive to, my wide-ranging interests in genomics and its application to medicine.

Stefanie Butland and Francis Ouellette for a chance to contribute and collaborate with the GeMS project. Also, for the opportunity to apply my training to a tangible clinical problem in a true bioinformatics environment.

The entire BCCA Genome Sciences Centre team. My work at the Genome

Sciences Centre is equally the result of the incredibly resourceful and helpful work environment. This study simply would not have been possible without your help.

Dr. Marco Marra, Dr. Steven Jones and Dr. Phil Hieter for their support and

input as mentors (Drs. Marra and Jones) and senior supervisor (Dr. Hieter).

Canadian Institutes for Health Research and the Michael Smith Foundation

for Health Research for funding this project and the bioinformatics program in

general. Also for the tremendous opportunity in health-oriented bioinformatics

research. DEDICATION

This work is dedicated to my parents, Elly and Stellios Missirlis, for their support and encouragement throughout my undergraduate and graduate studies.

This work is also dedicated to Madeleine de Trenqualye for her support, humour, appreciation, and generosity during the twilight months of my graduate career.

xvi PREFACE

The CIHR training program in Bioinformatics, in the Genetics Graduate

Program at UBC, is structured in a rotation-based format in order to expose students to different scientific problems and laboratory cultures. I had the opportunity to rotate through three different projects before extending my final project into my Master's thesis with Dr. Rob Holt. It should be noted that the work presented here is the result of one 4 month rotation with Francis Ouellette at the University of British Columbia Bioinformatics Centre (UBiC) and the final 8 months of my graduate career with Dr. Rob Holt, not my entire two years in graduate school. These two portions of my work have been selected because they represent my primary interests towards the end of my Master's degree and provide the best framework for a thesis. That is not to say that my other rotations are insignificant, but rather, they did not fit into an organic whole that could be written as a Master's thesis.

This study is broadly divided into two sections, one dealing with my work

investigating c/s-mediators of CAG/CTG repeat instability which was the result of

my 4 month rotation with UBiC. The second section builds upon the ideas and

software tools from this initial rotation to create a comprehensive database for

the prioritization of candidate repeats in association studies. Each chapter of this

thesis is therefore further divided into sub-chapters dealing with these two major

themes.

xvii CHAPTER 1

INTRODUCTION

1 CHAPTER 1 INTRODUCTION

1.1 c/s-Features of unstable CAG/CTG repeats

1.1.1 Unstable repeats and disease

Repeat instability as an etiological mechanism for neurodegenerative diseases is a relatively new observation. The expansion of trinucleotide repeats

(TNRs) was identified in 1991 as the disease causing mutation for X-linked spinal and bulbar muscular atrophy (SBMA) (La Spada et al., 1991), fragile X syndrome

(FRAXA) (Kremer et al., 1991; Verkerk et al., 1991), and myotonic dystrophy type

1 (DM) (Brook et al., 1992). These diseases exhibit 'genetic anticipation', a progressive worsening of the phenotype and earlier age of onset due to the transmission of an expanding TNR. The process of inter-generational TNR expansion or contraction has been termed 'dynamic mutation' because of the

repeat's unstable length. Since research contained within this report concerns

itself only with repeat expansions, the term 'repeat instability' will henceforth refer

solely to repeat expansions. Today, 35 human diseases, some of which also

exhibit anticipation, have been associated with unstable repeats (Cleary and

Pearson, 2003). Diseases for which unstable microsatellites are the causative

disease mechanism can be divided into those caused by coding or non-coding

repeat expansions.

2 The majority of disease-associated coding repeats identified to date are

CAG/CTG repeats encoding an expanded polyglutamine tract in affected individuals. CAG/CTG expansion disorders include spinal and bulbar muscular atrophy (SBMA) (La Spada et al., 1991), dentatorubral-pallidoluysian atrophy

(DRPLA) (Koide et al., 1994), Huntington disease (HD) (Huntington's Disease

Collaborative Research Group, 1993) and a range of spinocerebellar ataxias

(SCAs) including SCA1 (Banfi et al., 1994), SCA2 (Imbert et al., 1996), SCA3

(Ikeda et al., 1996), SCA6 (Zhuchenko et al., 1997), and SCA7 (David et al.,

1997). In these diseases, an expanded polyglutamine tract results in a toxic gain of function causing either neuronal degeneration (Ross et al., 1998), or in mouse models of spinocerebellar ataxia (SCA), neuronal dysfunction due to Purkinje cell abnormalities (Cummings and Zoghbi, 2000). The precise pathogenic disease mechanism is unknown but requires expression of the expanded polyglutamine tract. Neuronal inclusion bodies are observable on autopsy (Cummings and

Zoghbi, 2000).

Untranslated repeats are diverse and include non-trinucleotide repeats.

For example, progressive myoclonic epilepsy type 1 (EPM1) pathology results from an expansion of the dodecamer CCCCGCCCCGCG (Lalioti et al., 1997)

and an ATTCT repeat expansion is the pathogenic agent in SCA10 (Matsuura et

al., 2000). In contrast to the coding repeat disorders, non-coding repeats such

as myotonic dystrophy can expand dramatically into the range of thousands of

3 repeats (Brook et al., 1992). Non-coding repeat expansions are not associated with neuronal inclusion bodies on autopsy (Cummings and Zoghbi, 2000).

1.1.2 The argument for cis mediators of instability

Murine models provide some of the most compelling evidence of the role of c/s-elements in CAG/CTG repeat instability. Mice with genetically identical inbred backgrounds had a transgene (constructed of a large CTG repeat with little flanking human sequence) integrated randomly into their genomes. The mice demonstrated a wide range of instabilities at the repeat locus, reflecting the influence of mouse c/s-sequences at the site of transgene integration (Zhang et al., 2002). In another experiment, mice had a human transgene integrated into their genomes consisting of CTG repeats plus 45 kb of flanking sequence from an affected DM patient. These mice had uniform instability regardless of the genomic insertion site (Gourdon et al., 1997) which suggested that the identical flanking c/s-sequences of the human transgene dictated the level of instability.

1.1.2.1 Flanking %GC and CpG islands

An association exists between relative expandability of repeats and

flanking GC content (Brock et al., 1999). To quantify relative levels of

expandability, the expandability metric uses the following formula: length change

/ (progenitor allele length - 35 repeats). This measure quantifies the "tendency

4 of an above threshold repeat block to undergo further expansion". The authors' believed that repeat length changes needed to be relative to the progenitor length of repeats. Progenitor allele length was "standardized" by subtracting 35 repeats, the hypothetical threshold of coding CAG/CTG repeat instability in many coding CAG/CTG disorders (Cleary and Pearson, 2003). This expandability metric represents a summary of pedigree analyses and literature detailing expansion events at various CAG/CTG loci. Furthermore, it represents a standardized way of comparing loci irrespective of the effect of progenitor allele length. It established that variable expandability existed at CAG/CTG repeats at different loci, which the authors attributed to the particular milieu of c/'s-features flanking the repeats. The presence of GC rich sequence, including CpG islands, was observed for the more expandable loci, while low GC content and no CpG islands were observed for the less expandable loci. CpG islands were significantly associated with more expandable loci (P< 0.01, Fisher's exact test).

Dramatic positive ranked correlations were observed between expandability and

GC content (in the 100 bp flanking the repeat) for male transmissions determined indirectly by pedigree analysis (rho=0.817, P < 0.01) and directly by single sperm analysis (rho= 0.9, P < 0.05). The correlation in female transmissions was similar but weaker (rho= 0.717, P < 0.025). These relationships persisted at 500

bp flanking the repeat, but the correlation co-efficients were lower. No conserved

flanking motifs were detected, only variations in %GC.

1.1.2.2 Repeat length and purity

5 TNR repeat length and purity are important determinants of stability.

Repeat length dictates the severity of symptoms and approximate age of onset of

TNR diseases. The majority of coding CAG/CTG TNRs greater than -35 repeats become genetically unstable (Cleary and Pearson, 2003). Most TNRs show a

polymorphic distribution in the human genome, for example, the (CAG)n repeats of SCA7 range from 10-13 repeats, with the majority consisting of 10 repeats

(Gouw et al., 1998). Some large normal pre-mutation lengths approach the 20-

35 range and can undergo de novo expansion to full mutation lengths. Large normal repeats may be indicative of a propensity to undergo de novo expansion.

Large pure repeat tracts are more likely to expand following transmission than impure repeat tracts. For example, normal CAG/CTG repeats in SCA1, which can be as large as 39 repeats, are stable when interrupted with a single

CAT. In comparison, SCA1 repeats consisting of 40 pure repeats are genetically

unstable (Chong et al., 1995). FRAXA (Eichler era/., 1994), SCA2 (Choudhry et

al., 2001) and Friedreich's Ataxia (FRDA) (Cossee era/., 1997; Montermini era/.,

1997) also have stabilizing repeat interruptions. Part of the stabilizing effect of

interrupting mutations may come from disruptions in repeat secondary structure.

In SCA1 and FRAXA, repeat purity and formation of slipped stand structures

correlates with repeat instability and disease (Pearson era/., 1998). Many TNRs

form stable secondary structures that may interfere with cellular replication

machinery. Stable secondary structures have been theorized to facilitate

6 polymerase slippage and contribute to repeat expansion (Cleary and Pearson,

2003). The ability of a TNR tract to form secondary structure is both length and purity dependent and most structures cannot be formed by sub-threshold repeats. However, other interruptions within disease-associated CAG/CTG repeats are selected for following expansion in certain genes such as SCA3. In

SCA3, a 5' AGG interruption codes for an essential lysine residue which provides evidence that the protein context of some interruptions may be favoured in expanded alleles (Kawaguchi et al., 1994).

1.1.2.3 The role of the CTCF insulator protein

Myotonic dystrophy (DM) is a TNR expansion disease characterized by

CTG expansions in the 3'-UTR of the DMPK gene. The polyadenylation site of

DMPK is less than 300 bp from the transcriptional start site of SIX5, a gene that when dis-regulated is theorized to cause the cataract phenotype in DM patients.

Interestingly, DMPK and SIX5 have different expression profiles despite their proximity. This phenomenon can be explained by c/s-acting insulator elements that act as a barrier to the local effects of other flanking c/'s-sequences such as enhancers. The CTCF (for 'CCCTC-binding factor') protein is an eleven zinc- finger DNA binding protein that acts as an insulator when bound to CTCF sites

on DNA (Ohlsson et al., 2001). Gel mobility shift assays were used with portions

of the 3'-UTR of DMPK to identify if CTCF bound to those regions (Filippova et

al., 2001). It was observed that CTCF bound to the DNA fragments immediately

7 flanking the CTG repeat and acted as an insulator for DMPK. Interestingly, the approximate -176 bp that separates the two CTCF binding sites also binds a nucleosome as determined by standard micrococcal nuclease assays. The

DMPK insulator sites are also methylation-sensitive; when methylated, insulator function is impaired. Furthermore, the importance of CTCF binding sites flanking

CTG20 for insulator activity was tested by constructing various vectors containing one, both or neither CTCF binding sites. Maximum insulator activity was observed with both CTCF binding sites plus the CTG20 repeat. The upstream

CTCF site was more important than the downstream site. Most importantly, the composition of sequence between the CTCF sites was found to be important for maximum insulator functionality. When the CTG20 between the two CTCF binding sites was replaced with A phage DNA of equal length, a small but significant decrease in insulator function was observed. Flanking CTCF binding sites were also observed at the DM, DRPLA, and SCA7 CAG/CTG repeats by gel retardation assays. For HD and SCA2, only upstream CTCF binding sites were detected. This suggests that CAG/CTG repeats flanked by CTCF sites may have important insulator activity in the human genome and rationalizes their selection over evolutionary history. Nonetheless, the presence of CTCF binding sites flanking the CAG/CTG repeat within these genes suggests that some link may exist between CTCF and instability.

1.1.2.4 The role of nucleosomes

8 Transcriptionally inactive DNA is stored by an octamer of histone protein pairs. These pairs consist of H2A, H2B, H3, and H4 and are termed the

'nucleosome' (Kornberg and Lorch, 1999). The nucleosome is the basic unit of

DNA packaging and modification of its acetylation state is an important process of gene regulation. Nucleosomal acetylation (and the subsequent displacement of nucleosomes from DNA) by histone acetyl-transferases (HATs) has been proven to be an important regulatory mechanism in yeast (Sterner et al., 1999) and in humans (Imhof et al., 1997). Interestingly, the repeat threshold for instability (30-40 repeats or 90-120 bp) is almost the same as the amount of DNA in a nucleosome (-146 bp). Some repetitive sequences readily form

nucleosomes (i.e. (CTG)n and (GGA)n) while others do not (i.e. (CGG)n and

poly(A)n (Cleary and Pearson, 2003). Extensive research contends that the most

efficient sequence for nucleosome positioning is (CTG)n whereas CAT

interruptions disrupt formation. Although these observations are suggestive of some link between TNRs and nucleosome formation, their role in mediating

repeat instability is currently unknown.

1.1.3 Objectives

Trinucleotide repeat instability is associated with flanking %GC content

and CpG islands and may be related to repeat length and purity, CTCF binding

sites and nucleosome formation potential. In this analysis, the expandability

metric developed from male transmissions in Brock et al. is correlated with

9 flanking %GC, CpG islands, repeat length and purity, CTCF binding sites, and nucleosome formation potential to confirm published relationships and identify new ones between unstable coding CAG/CTG repeats and flanking sequence.

1.1.4 Specific aims and rationale

Aim 1: Expand the initial analysis by Brock et al. by correlating flanking %GC,

CpG islands, repeat length and purity, other repetitive elements, CTCF binding sites, and nucleosome formation potential with expandability data.

Brock et al. found a correlation between unstable CAG/CTG repeats and flanking

%GC and CpG islands with expandability using an incomplete version of the human genome. We extend this analysis to the complete human genome

(UCSC v. 34) and analyze other cis features for their ability to identify unstable

CAG/CTG repeats in the human genome.

1.2 The unstable repeat perspective of schizophrenia

1.2.1 Biology of schizophrenia

Schizophrenia is a neuropsychiatric disorder characterized by positive symptoms such as delusions, hallucinations, and thought disorder; and by negative symptoms including catatonic behaviour, affective flattening, alogia, and

10 avolition (American Psychiatric Association. Diagnostic and statistical manual of mental disorders (4* ed.) (DSM-IV)). Prevalence of the disease ranges from

0.75% to 1.5% in the general population independent of the culture studied. Age of onset is generally in early adulthood, from the late teens to the late twenties

(Bray and Owen, 2001). The lack of overt pathology in the schizophrenic brain along with the anti-psychotic properties of certain drugs gave rise to the idea that disturbances in neurotransmitter systems were responsible for the disease. In the 1950's and 1960's the "dopamine hypothesis" was borne from the clinical efficacy of anti-psychotic drugs. Abundant circumstantial evidence suggests that excessive dopamine plays a role in schizophrenia such as:

1) anti-psychotic drugs block central nervous system postsynaptic D2 receptors 2) drugs that increase dopaminergic activity either aggravate schizophrenic symptoms or give rise to symptoms in patients de novo 3) schizophrenic brains have higher dopamine receptor density post• mortem 4) schizophrenic brains have higher dopamine receptor density detected by positron emissions tomography (PET) 5) increased concentrations of homovanillic acid (a metabolite of dopamine) in the cerebrospinal fluid, plasma and urine of patients successfully treated for schizophrenia

(summarized from Potter and Hollister, 2001)

These conclusions are complicated by the fact the newer anti-psychotic drugs have been shown to interact with other receptor systems such as serotonin

(Wooley and Shaw, 1954) glutamate (Kim et al., 1980), and GABA (Roberts,

1972). Therefore, direct evidence of a dopaminergic basis for schizophrenia has yet to be established. Recently, a neurodevelopmental theory of schizophrenia

has been proposed due to supporting epidemiological, neuroanatomical, and

11 histological evidence. Epidemiologic evidence has revealed a pattern of obstetric complications, childhood neuropsychological deficits, and non-specific neuropathological anomalies in schizophrenics, but not at a statistically reproducible level (Weinberger, 1995). Anectodal observations from neuroanatomical, volumetric and histological studies offer other perspectives on the disease. Neuroimaging has revealed increased lateral ventricle size at onset and in adolescents with high risk of developing the disorder (Degreef et al.,

1992). Furthermore, volumetric studies have suggested that schizophrenics have a small reduction in brain size relative to controls (McCarley et al., 1999).

Lastly, histological analysis has revealed reduced axonal and dendritic markers within the frontal lobes and tempero-limbic structures in schizophrenic brains

(Harrison, 1999). These observations form the basis for the newer

"disconnectivity theory" of schizophrenia. According to this theory schizophrenia is a result of subtle defects in the overall synaptic network affecting the organization of neurons in the brain (Bullmore era/., 1997), but more studies are needed to determine the likelihood of this hypothesis. In conclusion, the biological research to date on schizophrenia has yet to yield any clear and undisputed pathology for the disease.

1.2.2 Genetics of schizophrenia

There is a strong, genetic, non-Mendellian component to schizophrenia that is apparent from family, twin and adoption studies (McGuffin et al., 1995).

Risk of developing schizophrenia increases exponentially based on familial

12 relatedness to affected individuals. Identical or monozygotic (MZ) twins have a

50% lifetime risk of developing the disease. The fact that the remaining 50% do not develop the disease highlights the importance of environmental or epigenetic factors in schizophrenia pre-disposition. Adoption studies have firmly established that individuals adopted into families with schizophrenic members retain their biologic, and not their adopted family's, risk of developing schizophrenia.

Genetic epidemiology makes apparent that what is inherited is a predisposition to develop the disease, rather than the certainty to do so (Bray and Owen, 2001).

Biometric models of schizophrenia have estimated that 80% of the risk of developing schizophrenia is genetic, while the remaining 20% is conferred by non-genetic factors (Cardno et al., 1999) perhaps including obstetric complications, maternal viral infections, and social stressors (Tsuang, 2000).

Identification of susceptibility genes for schizophrenia is a long sought after goal of linkage and association studies. Identification of alleles that segregate with schizophrenia could either support the current neurochemical models of the disease or introduce new ones. It is unlikely a single locus is

responsible for genetic predisposition given the recurrence risk of the disease in

relatives of affected individuals. It has been estimated that, assuming extreme

epistasis is not occurring, between 2 and 3 genes contribute to the schizophrenia

phenotype (Risch, 2000).

13 The control of schizophrenic symptoms by anti-psychotic drugs led to the

"dopamine hypothesis of schizophrenia". Most association studies focus on genes with some relationship to dopaminergic or other neurotransmitter systems, but to date, significant polymorphisms have been detected in dopamine D3

receptor gene (DRD3) (Crocq et al., 1992) and serotonin 2A (5-HT2A) receptor

(Inayama et al., 1996). Unfortunately, the Ser9Gly mutation in DRD3 has a low

odds ratio and the T102C mutation in 5-HT2A does not change the amino acid sequence of the serotonin receptor. Association studies in schizophrenia are unlikely to find a single locus of significant effect; instead what is more probable is that numerous loci of small effect contribute a fraction of relative risk that additively increases one's risk of developing schizophrenia.

1.2.3 Anticipation in neuropsychiatric diseases

The phenomenon of anticipation in schizophrenia was first documented in the 19th century by Morel who noted it amongst many other forms of disease that worsened in succeeding generations (Morel, 1857) The "law of anticipation" was formalized by Sir Frederick Mott by examining 420 parent-offspring pairs in the

asylums of London (Mott, 1910). The leap to rationalizing anticipation in genetic

terms was made unscientifically, most probably because it fit neatly into the

social Darwinist paradigm of the era. This view prevailed until the mid-20th

century, until the scientific rationale of anticipation in myotonic dystrophy was

14 forcefully criticized as an artifact of ascertainment bias by Penrose (Penrose,

1948). Ascertainment bias has been reviewed elsewhere, but briefly it includes:

1) Preferential ascertainment of parents with later onset (due to decreased reproductive fitness of earlier onset individuals) 2) Preferential ascertainment of offspring with early onset, as later onset individuals may not be affected at time of study. 3) Decreased memory of earlier episodes in older generation

4) Increased awareness of illness by family members and physician

(excerpted from Vincent era/., 2000)

To overcome these biases, suggested "ideal" study designs include population- based random samples of families irrespective of family history followed prospectively until they are through the age of risk. Unfortunately, no published epidemiological studies exist with these criteria. Eliminating ascertainment and other biases in the evaluation of anticipation is likely to be impossible leading some investigators to speculate that anticipation is a matter of opinion in certain diseases (Ashizawa et al., 1992). Controversy surrounding conclusions of genetic anticipation dates back to prior to conclusive findings in FRAXA and HD, suggesting that these issues are in perpetual consideration and difficult to control

(Ashizawa et al., 1992). Despite these issues, numerous studies of genetic anticipation in schizophrenia employing a variety of designs have found

surprisingly consistent positive results (Table 1).

1.2.4 CAG/CTG repeats in schizophrenia

Interest in the idea of anticipation in psychoses was revived with the

identification in 1991 of trinucleotide repeat expansion as the genetic mechanism

15 Table 1: Genetic anticipation in schizophrenia; summary of linkage studies from

1996-1999 (adapted from Vincent era/., 2000).1

Reference Ascertainment No. families Anticipation Genomic Imprinting I (Bassett and Linkage study 8 Families + NS Honer, 1994)

(Asherson et Linkage study 28-40 11.1 years NS al., 1994) Relative Pairs

(Thibaut et Linkage study 26 Families + NS al., 1995)

(Yaw et al., Linkage study 15 Families + NS 1996)

(Gorwood et Population 24 Families + NS al., 1996) based

(Ohara et al., Clinic-based 24 Familjes + NS 1997)

(Bassett and Retrospective 137 Pairs + ND Husted, 1997) registry

(Johnson et Linkage study 33 Families + ND al., 1997)

(Imamura er Clinic-based 37 Pairs + NS al., 1998)

(Valero era/., Linkage study 25 Families + NS 1998)

(Heiden era/., Clinic-based 15 Families + ND 1999)

Pairs = Parent-child pairs; NS = Not significant; ND = Not determined.

16 responsible for anticipation in FRAXA (Kremer et al., 1991) and SBMA (La Spada et al., 1991). Trinucleotide repeat screens in psychoses have mostly focused on

CAG/CTG type mutations based on their prevalence in neurodegenerative disorders and repeat expansion detection (RED) evidence. RED is a technique that can detect large expanded repeats in DNA without prior knowledge of their genomic co- ordinates (Schalling et al., 1993). RED uses long genomic repeats as a template to which repeat oligonucleotides anneal. Two or more oligonucleotides are ligated when they anneal adjacently on an expanded repeat.

After a number of denaturation, annealing and ligation cycles large single- stranded multimers are produced which are then detected by polyacrylamide gel electrophoresis, blotting and hybridization with complementary 32P-ATP labelled molecules as a probe. RED studies in schizophrenia have shown that the median length of CAG/CTG repeats in affected probands were longer relative to

unaffected controls (Vincent et al., 2000). With evidence of genomic CAG/CTG expansions, subsequent molecular studies tried to identify the location of these

loci. These studies identified three expanded loci: one at ERDA1, which was

cloned using a genomic library from a patient identified by RED as having a large

CAG/CTG repeat (Nakamoto et al., 1997), another within SEF2-1B (this repeat

locus is also referred to as CTG18.1) which was detected by fluorescent in situ

hybridization (FISH) using a large CAG/CTG repeat probe (Haaf et al., 1996),

and SCA8 which was detected by RED and then cloned from total genomic DNA

by Repeat Analysis, Pooled Isolation and Detection (RAPID) (Koob era/., 1998).

Although expanded in schizophrenia, these loci had individually either failed to

17 segregate in family studies, or associate with disease in affected individuals vs. unaffected controls, at least not at levels of statistical significance (Vincent et al.,

2000). A recent study specifically examined whether CAG/CTG repeat expansions as detected by RED in 100 unrelated schizophrenia patients were due to expansions in these loci. The study showed that 28% of studied probands had an expanded CAG/CTG repeat relative to controls. This study took the unprecedented step of typing the CAG/CTG repeats known to be polymorphic in the general population. They concluded that most repeat expansions could be rationalized by expansions in non-pathogenic polymorphic CAG/CTG repeats at the ERDA1, CTG 18.1, and SCA8 loci but could not exclude a small CAG expansion independent of these regions having a limited phenotypic effect

(Tsutsumi et al., 2004). Importantly, no RED or association studies have been conducted on non-CAG/CTG repeats in the genomes of schizophrenics.

1.2.5 Published satellite repeat analyses and databases

Historically if one suspected a polymorphic microsatellite repeat were

associated with a disease, few bioinformatics resources were available to identify

relevant repeats in the human genome. Repeat prioritization is the process by

which candidate repeats in the genome are hierarchically ranked based on

investigator-defined variables. One approach now available is to browse the

Tandem Repeats Finder (TRF) (Benson, 1999) track on the University of

California Santa Cruz (UCSC) genome browser (http://qenome.ucsc.edu/) (Kent

18 et al., 2002) within a genomic region of interest. TRF at UCSC was executed with liberal insertion and deletion (indel) and substitution penalties that allow the detection of larger, frequently impure repeats. Since pure repeat tracts are more likely to expand than impure repeat tracts following transmission (Chung et al.,

1993; Kunst and Warren, 1994; Chong et al., 1995) a large fraction of repeats presented at UCSC are probably not relevant for disease association studies.

Furthermore, certain known disease-associated repeats, such as the GAA repeat in Friedreich's Ataxia (Campuzano et al., 1996), are not detected at all

(chr9:67,109,320-67,109,339) at UCSC because it is too short to be detected by their TRF parameters. Other groups have created databases of all 2-16 repeat

unit satellite repeats within human gene regions (Subramanian et al., 2002;

Collins et al., 2003) and of all 1-6 repeat unit microsatellites across prokaryotic

and eukaryotic taxa (Subramanian et al., 2002). Collins detected microsatellites

with a novel algorithm and deposited this data in a relational database called

GRID Short Tandem Repeats (STR) database (Collins et al., 2003). This

database included in silico polymorphism detection of coding trinucleotide

repeats by using the BLAST algorithm to detect each repeat's length

polymorphisms within GenBank, but only for a subset of exonic repeats (Collins

et al., 2003). These resources enrich the microsatellite repeat bioinformatics

landscape but do not integrate these data with other published resources in a

way relevant for repeat prioritization in disease-association studies. Also, these

resources do not provide flexible interfaces for combining data in user-defined

ways to allow dynamic generation of candidate repeat lists. For example, both

19 the Microsatellite Repeats Database (MRD) (Subramanian et al., 2002) and the

STR databases (Collins et al., 2003) provide static co-ordinates of candidate repeats for disease-association studies defined by the author's criteria, but lack the functionality to easily re-prioritize repeats based on user preferences.

1.2.6 Objectives

We sought to create a resource to allow prioritization of candidate repeats that utilized bioinformatics features relevant to unstable repeats. To identify which repeats are the most likely substrates for expansion, we prioritized candidate repeats using a comprehensive database named Satellog. Satellog integrates the co-ordinates of all micro- and mini-satellite repeats in the human genome with gene proximity, gene expression, and repeat polymorphism data within UniGene clusters.

Given the lack of overt neurological pathology in schizophrenics and the

evidence of anticipation, we propose that a non-CAG/CTG repeat expansion may

confer genetic risk that increases the probability of developing schizophrenia.

Schizophrenia linkage data was incorporated into Satellog in order to prioritize

candidate repeats in schizophrenia. Prioritized repeats will be analyzed with

GeneScan software on an ABI 3700 sequencer by other individuals/groups at the

BCCA Genome Sciences Centre.

20 1.2.7 Specific aims and rationale

Aim 1: Develop a comprehensive bioinformatics resource for the prioritization of candidate repeats for disease-association studies in the human genome.

No resource exists that allows the generation of candidate gene lists based on bioinformatics integrating satellite repeat co-ordinates, gene proximity, gene expression and repeat polymorphism within UniGene clusters. We have created a database to accomplish this that we have named Satellog. This resource will enrich the bioinformatics possible on micro- and minisatellite repeats and provide a powerful tool for researchers investigating repeats believed to be associated with a disease of interest.

Aim 2: Generate candidate repeat lists using Satellog based on schizophrenia linkage regions.

We intend to generate a prioritized candidate repeat list of all repeats in schizophrenia and bipolar linkage regions that have interesting features within

Satellog. A portion of these repeats will be typed at the BCCA Genome Sciences

Centre in disease-association studies with DNA from individuals with

schizophrenia, bipolar disorder and unaffected control individuals (n=35 each).

21 CHAPTER 2

MATERIALS AND METHODS CHAPTER 2 MATERIALS AND METHODS

2.1 c/s-Features of unstable CAG/CTG repeats

2.1.1 Collection of candidate CAG/CTG repeats for cis sequence analysis

Previous work at UBiC generated a list of candidate CAG/CTG repeats

(also called Genomic Mutational Signatures or GeMS) by scanning the human genome for genes with relatively large, coding CAG/CTG repeats (Butland S., personal communication). A perl script (Appendix A) was written that relied on the Tandem Repeat finder (TRF) (Benson, 1999) results generated from the human genome assembly version 33, available from the UCSC genome browser

(http://qenome.ucsc.edu) (Kent et al., 2002). Using the co-ordinates from TRF for CAG/CTG repeats, each repeat was extracted and tested with the EnsEMBL

API to see if it existed within an EnsEMBL (www.ensembl.org) annotated gene

(Hubbard et al., 2002). If yes and the gene was associated with at least one transcript that coded for five or more glutamines, then the gene was collected as a candidate CAG/CTG repeat or GeMS candidate repeat (Figure 1). A total of 66 repeats fulfilled the above criteria for candidate CAG/CTG repeats (Appendix B).

The next problem involved prioritizing genes based on the composition of their flanking cis sequences.

2.1.2 Software Dependencies I

23 (A 3 What are ensembl API O (D the X Use TRF co-ords GeMS? a> a 5" CO

Prioritize! 1) Use Tandem Repeat Finder (TRF) - Data dump table from UCSC I 2) Repeat qualifies as a candidate if: a) Located in an ensembl sene b) 1 or more transcript codes for 5Q+ 3) Total of 66 genes Genotyping

Figure 1: Flowchart outlining how the candidate GeMS list was populated. All bioinformatics analysis was conducted with a single perl script entitled

24 flanker.pl (Appendix C). This script is essentially a wrapper for a number of programs that need to be installed locally which are: BioPerl version 0.7.2

(Stajich era/., 2002), the EnsEMBL API (Hubbard era/., 2002), EMBOSS v.2.7.1

(Rice et al., 2000), HMMer v.1.8.4 (Eddy, 1998), and MySQL v.3.23.52. An older version of HMMer was selected because v. 1.8.4 is optimized for nucleic acid analyses while newer versions focus on protein sequence. All of these programs are installed on stent.cmmt.ubc.ca and flanker.pl was run on this machine at the

UBiC.

2.1.3 Implementing the gemscis database

All data was deposited into a MySQL database entitled 'gems_cis' (Figure

2) (Appendix D). Henceforth, references to slices describe the length (in bp) of the TRF defined CAG/CTG repeat plus the up- and down-stream flanking

sequence. The database is composed of 10 tables entitled buildj'nfo, gems,

gc_plots, ctcf, cpg, gc, repeats, exons, flanking, gems_feat. All tables, with the

exception of build_info, can be related to each other based on the name variable

which is simply the Human Genome Organization (HUGO) name of the gene the

CAG/CTG repeat is within. The build_info table stores one-time operational

information about the database such as: the version of the EnsEMBL human

genome assembly used, the name of this database (gems_cis), the date and

25 ctcf build Info

ens db gc_plots db name —» name date run start start end start 50_bp obsex gems strand end 100_bp distance 150_bp ens ID 200_bp 250_bp 300_bp

gems_feat 350_bp 400_bp name 450_bp chr 500_bp strand 1000_bp start flanking repeats 1500_bp end name 2000_bp unit chr rep_name 2500_bp seq strand rep_class 3000_bp length extras start start 3500_bp purity name for_seq end 4000_bp expandability start rev_seq end 4500_bp for_rep_seq strand 5000_bp rev_rep_seq distance

Figure 2: Schema for gems_cis database. time that flanker.pi was run in order to populate the database. The gems tables stores the EnsEMBL gene ID's. The gc_plots table stores the position within the slice of observed over expected %GC within a sliding window of 200 bp and %GC value calculated by EMBOSS. The ctcf table contains the HMMer score, start, end, strand, and absolute distance from the CAG/CTG repeat as calculated for each hit within the slice. The cpg table collects the start, end and score data for each CpG island, as defined by EnsEMBL, within the slice. The gc table collects the %GC of 50 bp upwards of sequence flanking the CAG/CTG repeat. The repeats table collects the name, repeat class, start and end co• ordinates relative to the slice, score, strand, and distance from the CAG/CTG repeat of each repetitive element. The exons table collects the start and end co• ordinates of each exon. The flanking table stores the , and chromosomal start and end co-ordinates of the slice, including the sequence of the slice the forward and reverse orientations both without and with repeat masking. The gems_feat table collects the chromosome, strand, chromosomal start and end co-ordinates, repeat unit, sequence, length, purity and expandability (Brock era/., 1999) of each TRF defined CAG/CTG repeat.

2.1.4 Overview of the flanker.pl script

The co-ordinates of candidate CAG/CTG repeats are input in flat-file

format to flanker.pi (Appendix B) (Figure 3) which were sequentially looped

27 through until the end of the file was reached. The first part of the program collected the sequence of the repeat plus a user defined amount of flanking sequence, in our analysis, 50 kb. This value was selected because mouse experiments in which a transgene consisting of -45 kb of human sequence flanking the DM repeat was integrated into the mouse genome yielded uniform repeat instability (Gourdon et al., 1997). This experiment established that at least 45 kb of cis sequence is required to observe the effects of human c/s- sequences. To be conservative, the flanking sequence considered in this analysis was extended to 50 kb on each side of each CAG-repeat.

2.1.4.1 Collection of flanking %GC, CpG islands, length and purity and other repetitive elements

The next phase of the loop involves collecting data or calculating values either with Perl or with external programs. For each feature extracted or calculated, the script automatically inserted data into a MySQL database entitled

'gems_cis'. Using the EnsEMBL API, the script extracted EnsEMBL-defined

CpG islands and all the repeats in the sequence surrounding the CAG/CTG

repeat. CAG length and purity, and flanking %GC were calculated by the script.

All data was input into gems_cis.

2.1.4.2 Detection of flanking CTCF insulator protein binding sites

29 The CTCF binding sites were detected by using HMMer (Eddy, 1998).

HMMer can build Hidden Markov Models (HMM) in a number of different ways.

CTCF binding sites are expected to occur more than once and not overlap.

HMMer has an option hmmfs (HMM fragment search) that reports multiple non- overlapping Smith/Waterman matches, hmmfs corrects for the length of the model when calculating statistics by employing a cyclic model that permits multiple matches, hmmfs was used to create an HMM (ctcf-md.hmm) that

defined the CTCF binding sites. The HMM for the CTCF binding sites was

generated from the published multiple sequence alignment (Figure 4). This

alignment highlighted essential contact guanosine nucleotides that mutational

analyses revealed as the most important determinants of CTCF binding

(Filippova et al., 2001).

2.1.5 Detection of nucleosome formation potential with NucleoMeter

Nucleosome formation potential was calculated with the NucleoMeter

program (http://wwwmqs.bionet.nsc.ru/mqs/proqrams/recon/). NucleoMeter

applies discriminant analysis techniques to detect the nucleosome formation of

eukaryotic DNA (Levitsky et al., 2001). NucleoMeter partitions the input

sequence into regions with a more homogeneous dinucleotide frequencies in

order to maximize detection of nucleosome binding sites. The algorithm then

searches for the partition that maximizes the Mahalanobis distance R2 used to

30 C CO — CD

Q o~ _CD E <; O O E-i H o o o CD o o < <; H CD CD H CJ H H CD CD o CD •a o O CD O O E CJ_.CDZCL_C)_ca_.O_0L CD EH O EH EH O O -1- o O O CD CD^O^CS ^Ol

C T- CO _ ., <; CJ CD CD EH CD X 1 ^ CD CO LOZOZOZOrOIIOSOil _Q H CJ EH CO u <; o 13 CD ^ CJ u <; EH U CD 15 Oo SCD CJ <; CJ <; CD .-H p p CD -U -U CD S S 0 CO •H -H I P P c CO CO (fl (0 II I I CO CD I I 4J cn (Tt u u CD >1 >H L p p ,Q m w J. J= e 2 CO discriminate between potential nucleosomal sites. The nucleosome formation potential (cp(X)) is constructed so that input sequences similar to the 141 nucleosome binding training sequences discriminate between potential nucleosomal sites. Sequences similar to the training set return scores close to

+1, while non-nucleosomal binding sequences have scores of -1.

2.1.6 Statistics and plots with R

R is an open-source computer language and environment for statistical analysis (R Development Core Team, 2004). The final component of the program generates an R script for visualizing all the sequence features. A file entitled genename.R is generated within each gene's directory by flanker.pi.

This file automatically generates a plot with the R software package to visualize the sequence features flanking each CAG/CTG repeat.

Data was collected from the gems_cis database for statistical analysis

(see Appendix F for sample queries). Statistical analysis was conducted with the

R package. Ranked Spearman's correlations and Fisher's exact test were used to determine the significance of the associations (see Appendix G for R scripts).

2.2 The satellog database

32 Working with CAG/CTG repeats revealed that there is no easy way for researchers to extract genomic repeat sequence data and information from the current genome browsers. Expertise gained from dealing with collecting genomic

CAG/CTG repeat co-ordinates was incorporated with other bioinformatics features to generate the Satellog database to address this deficiency. The build procedure for the Satellog database is outlined below and in the appendices.

2.2.1 Software Dependencies II

A perl script "repeatalyzer.pl" functions as a wrapper for a number of different programs to achieve the endpoints of Satellog. repeatalyzer.pl is run with perl v5.6.1 and used BioPerl v1.2 (Stajich et al., 2002), the EnsEMBL Perl

API (May 24th, 1999 release), MySQL v10.8 Distribution 3.23.21-beta (for pc- linux-gnu), BLAT v. 28 (Kent, 2002) and v. 34 of the human genome sequence

(Lander et al., 2001). repeatalyzer.pl was run against the homo_sapiens_core_19_34b EnsEMBL database and v. 34 of the human genome sequence. The script was processed in parallel on our in-house 40 processor Opteron cluster.

2.2.2 Implementing the satellog database

Prior to proceeding, a MySQL database called Satellog must be implemented to generate all the required tables (Appendix H). The database is

33 composed of 17 tables: repeats, linkage, unigene, gc, class_stats, ugstats, ugcount, rep_stats, rep_class, transcripts, ens_db, disease, go, pdb, mim, affy, and GeneNote (Figure 5). All tables are organized around the repeats table in a star schema. This table stores output from Tandem Repeats Finder (Benson,

1999) including chromosome start and end co-ordinates, repeat unit length

(referred to as period), the sequence of the repeat unit, the distinct repeat class of which the repeat is a member of, the sequence of the repeat and pure repeat length. The p-value is calculated independently and represents the fraction of repeats of the same class having the same or greater length. The linkage table contains information about genomic linkage regions implicated in diseases of interest. For each disease linkage study, the linkage table stores the cytogenetic band of the genetic marker used, marker genomic co-ordinates, the original

reference's PubMed ID, the linkage score if provided, the type of linkage, any

reported p-values and notes of interesting or confounding principles. The pstart and qend values are co-ordinates encompassing 50 Mb flanking the genetic

marker co-ordinates (recombination boundaries of the marker). The gc table

contains the %GC of the 100 bp, 500 bp, and 1,000 bp of sequence flanking the

repeat. The unigene table contains the genomic co-ordinates of each UniGene

cluster successfully mapped to the human genome including its score from the

BLAST-like Alignment Tool (BLAT) (Kent, 2002) and the percent identity of the

alignment. The rep_class table stores a unique repeat class identifier that is

created by concatenating all repeat class members in the class field. The

class_stats table stores a pvalue for each repeat class length that represents the

34 fraction of repeats of the same class having the same or greater length. The ugcount table links each unique repeat by its repeat ID (rep_id) to the UniGene cluster sequences it has been detected in by BLAT (Kent, 2002) and stores the repeat's length in each hit cluster. The ugstats table collects summary statistics of all UniGene repeat length hits for each rep_id including the count (total number of hits), minimum value, maximum value, mean, and the standard deviation of all detected repeat lengths. Supplementary information about adjacent transcripts is collected in the transcripts table if a repeat is within 60 kb of an EnsEMBL defined gene. For each such repeat this includes the EnsEMBL transcript identifier, distance from or location within the EnsEMBL transcript, coding peptide sequence (if the repeat is exonic), and the EnsEMBL gene identifier of the hit. The ens_db table stores supplementary information of all the

EnsEMBL genes that contain repeats. This table stores each EnsEMBL gene's unique identifier, Human Genome Organization (HUGO) name, text description

(if known), chromosomal co-ordinates and strand location. The go, pdb, mim, affy tables respectively store any (Ashburner et al., 2000),

Protein Data Bank (Berman era/., 2002), Mendelian Inheritance in Man (Wheeler

et al., 2004), and AffyMetrix probe sets associated with each gene. The

genenote table contains AffyMetrix expression values from the Gene Normal

Tissue Expression (GeneNote) database (Shmueli era/., 2003). Specifically this

table includes each probe's identifier, expression value and expression call

(either Absent (A), Marginal (M), or Present (P)) calculated from Microarray

Analysis Suite (MAS) 5.0 with default parameters, AffyMetrix array and number.

36 2.2.3 Preliminary set-up

Prior to running repeatalyzer.pl a number of preliminary programs need to be run plus "staging" databases are created to collect temporary data required for subsequent analyses.

2.2.3.1 Detecting pure repeats with Tandem Repeats Finder (TRF)

We were interested in exclusively pure repeat tracts which are more likely to expand following transmission (Chung era/., 1993; Kunst and Warren, 1994;

Chong et al., 1995). Command-line TRF has seven parameters that can be manually assigned at run-time which include matching weight, mismatch and indel penalties, match probability, indel probability, minimum alignment score to

report, and maximum period size to report (Benson, 1999). We found that

matching weight, mismatch and indel penalties, minimum alignment score and

maximum period size directly affected the length and purity of hits detected by

TRF whereas changing the match and indel probability features was not useful.

The match and indel probability features refer respectively to the percent identity

and fraction of indels tolerated in each serial tandem unit detected as a hit.

These features allow users to specify alternative expected matching and indel

statistical distributions.

37 Next we evaluated the ability of the matching weight and maximum period size parameters to detect short repeats. Period size refers to the length of the tandemly repeated DNA unit, for instance CAG/CTG repeats have a period of 3.

Since TRF hits must be at least 10 bp, the smallest hit for each repeat class reported in Satellog is 10 divided by the repeat unit length. For example, for

CAG/CTG repeats, the smallest hit detectable that satisfies the minimum hit length is a 3 1/3 repeat unit hit (i.e. CAG CAG CAG C). Due to this constraint, only repeats 10 bp and up are stored in Satellog.

Lastly we investigated the utility of adjusting the mismatch and indel penalties. We found that setting the penalty for these parameters to 4090 produced no impure repeats as hits. TRF was run on whole chromosome

FASTA files from v. 34 of the human genome downloaded from the UCSC genome browser. Hit purity was confirmed by visually inspecting the top high period hits (these hits have the highest probability of introducing indels due to the scoring scheme used by TRF (Benson, 1999) (Appendix I).

2.2.3.2 Identifying unique repeat classes

A repeat can be represented in a number of ways in double-stranded

DNA. TRF detects repeats by the first tandemly repeated unit, therefore,

CAGCAGCAG, AGCAGCAGC, and GCAGCAGCA are detected as repeats of

CAG, AGC, and GCA respectively. Furthermore, the reference human genome

38 sequence is only presented as the positive strand. Repeats of GTC, TCG, and

CGT on the positive strand represent 5'->3 CAG, AGC and GCA repeats respectively on the negative strand. Therefore, to identify all CAG/CTG repeats in the human genome it's necessary to detect all CAG, AGC, GCA, GTC, TCG, and CGT repeats on the positive strand. We developed an algorithm to generate all possible sequence varieties of a repeat unit on the positive and negative strands. Our repeat classification algorithm operates by taking an input repeat unit, i.e. CAG, removing the first letter (C in this case) and appending it to the end of the remainder (AG) to create the second repeat unit (AGC). This is then reverse complemented to generate the equivalent sequence on the negative strand (TCG). This procedure is repeated repeat unit length - 1 times to generate a unique identifier henceforth referred to as the repeat class. Each

repeat in Satellog is associated with a single unique repeat class (Appendix J).

2.2.3.3 Preparing expression data from the GeneNote database

The GeneNote (Gene Normal Tissue Expression) database provides

baseline normal expression data of human genes for use in disease studies

(Shmueli et al., 2003). GeneNote data was downloaded from the Gene

Expression Omnibus (GEO) (Appendix K). A total of twelve human tissue

profiles are presented in GeneNote including bone marrow, brain, heart, kidney,

liver, lung, pancreas, prostate, skeletal muscle, spinal cord, spleen, and thymus.

These products were generated with the AffyMetrix HG-U95 A-E probe-set,

39 covering 62,839 probe-sets. EnsEMBL genes have been mapped to AffyMetrix

HG-U95 probes by the EnsEMBL project (Hubbard et al., 2002). Once a repeat is detected either inside or within 60 kb of an EnsEMBL gene, that gene's normal expression profile is evaluated by cross-referencing its AffyMetrix tags to the

GeneNote database within Satellog (Appendix K).

2.2.3.4 Detecting repeat polymorphisms within UniGene clusters

UniGene contains the largest public repository of transcribed human sequence and represents an attempt to organize this wealth of expression data into discrete transcriptional loci (Wheeler et al., 2004). All human UniGene sequences were processed for use with repeatalyzer.pl (Appendix L). For each repeat detected in UTR or exonic sequence, the repeat plus 10 bp of flanking sequence was extracted from EnsEMBL and queried using the BLAT algorithm

(Kent, 2002) against a BLAT-formatted database created from sequences representing the longest, highest quality stretch of DNA from each individual

UniGene cluster (this sequence is provided by UniGene as the file Hs.seq.uniq).

Polymorphism is evaluated only if BLAT analysis against all UniGene clusters

resulted in 1) hits that achieved BLAT scores at least 85% of the theoretical

maximum for a perfect hit 2) 90% of the query sequence matched identically within the cluster 3) the repeat mapped within 10 kb of the genomic co-ordinates

of the UniGene cluster (Appendix M discusses the mapping of UniGene clusters to the human genome). If a hit to a UniGene cluster satisfied these criteria, the

40 length of the repeat in the cluster is stored in Satellog. This feature allows investigators to query all repeats with polymorphisms in UniGene clusters from genomic regions of interest.

2.2.4 Overview of the repeatalyzer.pl script

Once the above software and data dependencies are configured, the perl script repeatalyzer.pl automatically populates Satellog. The script processes the flat files output by TRF. These files contain the repeat co-ordinates plus the repeat period (the size of the repeated unit), the sequence of the individual repeat unit, the entire repetitive sequence and the repeat length. Repeat co• ordinates are passed to the EnsEMBL API to confirm the authenticity of the co• ordinates generated by TRF. If the repeat is not detected within a gene with the

EnsEMBL API, then progressively larger slices incrementing by 15 kb are taken

in search of flanking genes. As soon as a gene is located in flanking sequence then no further flanking sequence is collected. However, if no genes are

detected within 60 kb of the repeat co-ordinates then repeatalyzer.pl stops

searching for genes. If a repeat is detected inside or within 60 kb adjacent to an

EnsEMBL-defined gene then that gene's primary information (co-ordinates,

HUGO name, EnsEMBL ID and description) are collected along with metadata

stored in EnsEMBL such as Protein Data Bank (PDB) (Berman et al.), Online

Mendelian Inheritance in Man (Wheeler et al., 2004), Gene Ontology (GO)

(Ashburner et al., 2000), and mappings to AffyMetrix probe sets. If the repeat is

41 located in the 5'-UTR, 3'-UTR, or exon of a gene then its polymorphism profile within UniGene clusters is evaluated.

2.2.5 Generating a measure of repeat length significance

After running the script to populate Satellog, each repeat's length is compared to its class' genomic repeat length profile. The majority of repeats associated with disease undergo expansions from already large reference genome lengths relative to other repeats of the same class (Cleary and Pearson,

2003). The percentile rank of each repeat length (referred to as p-value in

Satellog) is calculated from the distribution of repeat lengths within each repeat's class (Appendix N). It reflects the proportion of repeats with the same or greater length from the repeat's genomic length distribution.

2.2.6 Detection and input of disease-associated repeats

Disease-associated repeats and their common properties were recently reviewed (Cleary and Pearson, 2003). Repeats that were not analyzed either had a repeat period greater than 16 (thus not detected by our TRF parameters) or were polymorphic but not associated with any disease. For these disease-

associated repeats, there is no record of their precise genomic co-ordinates. To

address this, we used Satellog to probe for the probable repeat that

corresponded to each disease by selecting all repeats of the expected class

42 within each disease gene. Except for the repeat responsible for blepharophimosis (Crisponi et al., 2001), all repeats were detected. A total of 51 repeats were mapped for 31 diseases (Appendix O).

2.3 Prioritizing candidate repeats for disease-association studies in schizophrenia

2.3.1 Input of neuropsychiatric linkage regions into Satellog

A recent article exhaustively reviewed all schizophrenia and bipolar linkage regions identified to date (Sklar, 2002). We manually collected each linkage region and input it in a standard format into Satellog2. For each linkage

region, we input the genetic marker and its co-ordinates, the cytogenetic band, and PubMed ID of the source paper, the score, p-value, and type of linkage if

provided (i.e. logs odd score of 3.4) (Appendix P). If there were any points of

interest mentioned by Sklar these were also included as supplementary notes.

2.3.2 Prioritizing candidate repeats with Satellog

2 Bipolar disorder has overlapping symptoms (DSM-IV) and linkage regions Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413. with schizophrenia. We decided to prioritize bipolar disoder repeats as well because the linkage regions were readily available in the same review as the schizophrenia linkage regions Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413. plus we had DNA from bipolar disorder probands (see 4.6.3.1). However, the rationale for this study is based strictly on the biology of schizophrenia.

43 Repeats were prioritized for disease association studies in schizophrenia and bipolar disorder (Appendix Q). All repeats within 50 Mb of genetic markers associated with each disease were selected from Satellog. Linkage depth was calculated for each repeat by counting the number of linkage co-ordinates that overlapped with the repeat's genomic co-ordinates. We were interested in globally prioritizing repeats in both transcribed and untranscribed regions.

Of the remaining transcribed repeats, those that had any evidence of repeat polymorphism (defined as any length polymorphism within UniGene clusters) were deposited into tables for each disease (schz_cand and bp_cand).

This restricted the analysis to repeats within either UTR or exonic sequence.

Each repeat's co-ordinates, repeat unit, period, class, length, pvalue, linkage depth, and UniGene length polymorphism statistics, location within or adjacent to

EnsEMBL genes, peptide sequence (if exonic), HUGO name, text description of associated genes, tissue expression and call within GeneNote was collected.

Lastly, only repeats with some evidence of brain expression in the GeneNote

database were retained.

All repeats were also globally prioritized without considering evidence of

repeat polymorphism within UniGene clusters and were deposited into tables for

each disease (schz_cand_global and bp_cand_global). This prioritization

paradigm therefore also considers intronic repeats. Each repeat's co-ordinates,

repeat unit, period, class, length, pvalue, linkage depth, location within or

44 adjacent to EnsEMBL genes, peptide sequence (if exonic), HUGO name, text description of associated genes, tissue expression and call within GeneNote was collected. Lastly, only repeats with a p-value less than 0.05 or repeat length greater than 10 and some evidence of brain expression in the GeneNote database were retained. We also checked that any repeats greater than 10 repeat units going through this filter did not have a p-value of 1. Singleton repeats have no other repeats in their repeat class and thus will have a p-value of 1 regardless of their repeat length. The repeat associated with progressive myoclonic epilepsy (EPM1) was from such a distribution and we were concerned about missing other such repeats. One such repeat (rep_id: 2829206, chr:4:

48577074-48577404, GGAGAAGAGGGAGAA, repeat length = 22) was detected in the bipolar linkage regions.

Lastly, the prioritization approach summarized above was repeated, this time selecting repeats of the same repeat class as disease-associated repeats.

45 CHAPTER 3

RESULTS

?

46 CHAPTER 3 RESULTS

3.1 c/s-Features of unstable CAG/CTG repeats

3.1.1 Correlation of flanking CAG/CTG repeat features to Brock et al. expandability data

Both male and female expandability metrics based on pedigree transmission of unstable repeats were developed by Brock et al. to correlate cis- sequence features to expandability (Brock et al., 1999). All of the subsequent analyses in this report used the male pedigree transmission because the male correlation of expandability to flanking %GC was strongest and featured broader gradation of expandability values (in contrast to two 0 expandability values for

SCA1 and SBMA in the female dataset) (Brock et al., 1999). For each cis feature analyzed, any of the candidate CAG/CTG repeats (those within genes that code for 5 or more glutamines) satisfying the profile of expandable repeats within gems_cis are also displayed. All SQL code required to extract the data from the gems_cis database is available in Appendix R.

3.1.1.1 Correlation of CpG islands with expandability

Brock er al. noted that the most expandable CAG/CTG repeats were within CpG islands. Since our analysis was limited to CAG/CTG repeats within

coding regions we sought to establish whether the most expandable loci were

47 within CpG islands. The three most expandable loci were within CpG islands while the four least expandable were not (P < 0.01, Fisher's exact test) (Table 2).

3.1.1.2 Correlation of flanking %GC with expandability

GC content at 100 bp and 500 bp flanking disease-associated CAG/CTG repeats loci is positively associated with expandability (Brock et al., 1999). We extended this analysis to 50 bp, 100 bp, 500 bp, and 1,000 bp flanking all the candidate CAG/CTG repeats and known unstable repeats. The correlations observed in Brock et al. existed in our set.

Spearman rank correlations of %GC versus expandability were calculated for 50 bp, 100 bp, 500 bp, and 1,000 bp of sequence flanking the CAG/CTG repeat. Significant positive correlations were detected for 50 bp, 100 bp, 500 bp, and 1,000 bp of flanking sequence respectively {rho = 0.82, 0.89, 0.93, 0.89, and

Pvalue = 0.04, 0.02, 0.01, 0.02) (Figure 6). At all values of flanking sequence the associations were strong, but the association was strongest at 100 bp (Figure

7). Given this, we investigated the distribution of flanking %GC at 100 flanking bp for all candidate CAG/CTG repeats (Figure 7). The %GC values assumed a normal distribution but the repeat within one gene, SCA7, had high enough %GC to achieve statistical significance (Z-score > 1.96, P < 0.05) (Figure 7). However,

if the %GC threshold is set to that of the third most expandable

48 Name Start End Expandability Islands SCA7 49034 50415 1.3 TRUE SCA2 48861 50806 0.97 TRUE HD 48883 50320 0.29 TRUE SI7E_HUMAN 48837 50259 NULL TRUE IRS1 48020 50849 NULL TRUE RUNX2 49004 51229 NULL TRUE POU3F2 46199 50723 NULL TRUE NM_175863 . 48790 51627 NULL TRUE PHLDA1 49109 51401 NULL TRUE ASCL1 47940 50545 NULL TRUE C14orf4 48179 51516 NULL TRUE POLG 49573 51779 NULL TRUE 094795 48954 50320 NULL TRUE S0C6JHUMAN 48962 50278 NULL TRUE MN1 47193 52991 NULL TRUE

Table 2: All unstable and candidate CAG/CTG repeat-containing genes located within a CpG island. 'Start' and 'End' columns refer to start and end co-ordinates of the CpG island relative to the 50 Mb slice of genomic sequence flanking the CAG/CTG repeat (i.e. the CAG/CTG repeat starts at 50,000).

49 100 bp 500 bp

Figure 6: a-c) Correlation between ranked median expandability and ranked %GC in 100 bp, 500 bp, and 1000 bp flanking unstable CAG/CTG repeats (Brock et al., 1999). d) Spearman's rank correlation (rho) of median expandability and %GC of 50 bp, 100 bp, 500 bp, 1,000 bp, 1,500 bp, 2,000 bp, 2,500 bp, 3,000 bp, 3,500 bp, 4,000 bp, 4,500 bp, and 5,000 bp of sequence flanking the CAG/CTG repeat. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance.

50 Histogram of %GC 100 bp flanking CAG repeat of Candidate GeMS Genes

to

o I CO

CD .O E

to

o -1

0.0 0.2 0.4 0.6 0.8 1.0

%GC of 100 bp flanking repeat

Figure 7: Histogram of %GC of 100 bp flanking the repeat in candidate CAG/CTG repeat sequences. Red bar indicates sole gene (SCA7) with %GC content achieving statistical significance based on z-score within this distribution.

51 locus (HD) then a total of six genes have significant flanking %GC. Three other genes without expandability data, POUF32, C14ord4, and CACNA1A had flanking %GC at least equal to HD (Table 3).

3.1.1.3 Correlation of repeat length and purity with expandability

Having confirmed the correlations observed in Brock et al., we sought to explore new relationships between expandability and CAG/CTG repeat length and purity. Repeat length was defined as the length of nucleotides bounded by the chromosomal co-ordinates detected by Tandem Repeats Finder (TRF).

Spearman ranked correlations revealed no relationship between expandability and CAG/CTG repeat length {rho = -0.32, P = 0.50) (Figure 8). Repeat purity was defined internally by flanker.pl by counting the number of contiguous occurrences of the repeat unit as specified by TRF. Spearman ranked correlations revealed no relationship between expandability and CAG/CTG

repeat purity (rho = -0.11, P= 0.84) (Figure 9).

3.1.1.4 Correlation of CTCF binding sites with expandability

CTCF binding sites are known to flank unstable repeats. For this section

of the analysis we are including DMPK, a gene with an expandable CAG/CTG

repeat in its 3'-UTR, because it is the only disease-associated locus for which it

is known with experimental certainty that CTCF binds to sequences flanking the

52 Name Expandability 100_bp SCA7 1.3 0.83 SCA2 0.97 0.77 HD 0.29 0.76 POU3F2 NULL 0.76 C14orf4 NULL 0.765 CACNA1A NULL 0.8

Table 3: All CAG/CTG repeat-containing genes with 100 bp of flanking sequence having %GC at least equal to that of HD. The '100_bp' column summarizes the G+C fraction of 100 bp flanking the CAG/CTG repeat.

53 Expandability vs. Repeat Length CAG repeats known to be unstable Repeat length derived from TRF co-ordinates

o

CD o

c CO

.a CO "D a CO Q_ X tu c CO —\ CO •a CD

CM H rho = - 0.32 i> = 0.50 o

Repeat Length (rank)

Figure 8: No correlation was observed between ranked expandability and ranked repeat length. Repeat length is defined as the absolute length of the repeat, irrespective of purity, defined by Tandem Repeats Finder co-ordinates. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance.

54 Expandability vs. Repeat Purity CAG repeats known to be unstable Purity defined as longest contiguous repeat unit

o

o

o

rho = -0.11

o P = 0M

3 6

CAG-repeat purity (rank)

Figure 9: Correlation between ranked expandability and ranked CAG/CTG repeat purity. No correlation was observed between ranked expandability and ranked CAG/CTG repeat purity. CAG/CTG repeat purity defined as longest contiguous stretch of the repeat unit specified in Tandem Repeats Finder. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance.

55 CAG/CTG repeat (Filippova et al., 2001). We felt that including the CTCF binding sites flanking the CAG/CTG repeat of DMPK in this analysis provides insight into the significance of CTCF binding sites flanking CAG/CTG repeats in all genomic contexts. The expandable genes were analyzed for CTCF binding sites within 5,000 bp flanking their CAG/CTG repeats. This amount of flanking sequence was chosen because a high number of hits are within this region across most of the expandable genes. CTCF binding site proximity to CAG/CTG repeat was detected with HMMer. Spearman ranked correlations revealed no

relationship between expandability and CTCF scores within 5,000 bp flanking the

CAG/CTG repeat {rho = 0.08, P = 0.84) (Figure 10). Furthermore, no

relationship was obvious between expandability and distance from the repeat

(rho = -0.09, P = 0.81) (Figure 10). However, a significant negative association was detected between CTCF score and distance from the CAG/CTG repeat (rho

= - 0.75, P = 0.03) (Figure 10). These relationships were similar for 10,000 bp

flanking the repeat. Next we sought to compare the distribution of CTCF binding

sites in candidate CAG/CTG repeats versus expandable genes. Interestingly,

when CTCF scores of 1.00 or higher of 1,000 flanking bp were evaluated the

three most expandable loci had one or more CTCF binding sites (P < 0.01,

Fisher's exact test) (Table 4).

The highest scoring hits were those for the DMPK CTCF binding sites

because they were used as part of the training set for the HMM. The quality of

56 Median Expandability vs. Median Expandability vs. CTCF Score Distance from Repeat (5,000 bp) (5,000 bp) c c o CO o o CO o CO

m LO CO CO •o TJ c c co co CL CO CL CO •a CN °rho = 0.08 c CM rho = - 0.09

T3 P = 0.84 P = 0.81 N CD T I 6 8 2 6 8

CTCF Score (rank) Distance from CAG-repeat (rank)

Distance of CTCF hit vs. CTCF Score (5,000 bp)

2 4 6 8

CTCF Score (rank)

Figure 10: a) Plot of ranked median expandability against ranked score of computationally detected CTCF binding sites (from HMMer) in 5,000 bp flanking the CAG/CTG repeats known to be unstable (Brock et al., 1999). b) Plot of ranked median expandability against ranked distance of CTCF binding site in 5,000 bp flanking the CAG/CTG repeat of genes known to be unstable (Brock et al., 1999) c) Correlation between ranked distance from the CAG/CTG repeat and ranked score of hits. Each point represents a CTCF binding site. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance.

57 the remaining hits cannot be assessed because they have not been confirmed experimentally (Table 4). From a qualitative perspective, CTCF binding site hits can be ranked using those of DMPK as the ideal (which has a profile of high score, small distance from the CAG/CTG repeat, and presence of a flanking

CTCF binding site). The CTCF binding sites detected adjacent to the CAG/CTG repeats within the KCNN3, RUNX2 and POUF3 genes are also high scoring hits.

The CTCF binding sites flanking the CAG/CTG repeat within TNS has a lower scoring hit but is closer to the CAG/CTG repeat. The CAG/CTG repeats within

SCA7, NM_175863, and SOC6_HUMAN have CTCF sites flanking both sides of

CAG/CTG repeat.

The CTCF binding sites are degenerate because of the domain architecture of the protein (Ohlsson et al., 2001). This degeneracy may be reflected as weaker scoring hits (score of 0 to 0.99), such as those flanking the

CAG/CTG repeat within IRS1 (Table 5).

3.1.1.5 Correlation of nucleosome formation potential with expandability

Nucleosome formation potential was evaluated with the web-based version of NucleoMeter (http://wwwmqs.bionet.nsc.ru/mqs/proqrams/recon/)

(Levitsky er al., 2001). Nucleometer calculates the nucleosome formation potential of input sequence within a 160 bp window. Values of +1 and higher

58 Gene Name Score Distance Expandability Start End DMPK 15.5 101 4.81 49839 49899 DMPK 16.04 34 4.81 50096 50156 SCA7 6.18 15 1.3 49924 49985 SCA7 1.31 838 1.3 49100 49162 SCA2 2.61 481 0.97 50552 50615 KCNN3 3.73 170 NULL 49769 49830 TNS 1.22 22 NULL 50052 50113 RUNX2 6.17 641 NULL 49301 49359 POU3F2 3.18 918 NULL 49019 49082 NM_175863 2.91 538 NULL 49401 49462 NM_175863 5.84 415 NULL 49524 49585 NM_175863 3.22 225 NULL 50279 50339 NM_175863 2.01 402 NULL 50456 50516 NM_175863 1.33 429 NULL 49509 49571 CIZ1_HUMAN 1.19 883 NULL 50977 51039 CIZ1_HUMAN 2.98 460 NULL 50554 50615 094795 2.26 219 NULL 50308 50369 S0C6_HUMAN 2.96 0 NULL 50025 50086 S0C6_HUMAN 2.37 566 NULL 49374 49434

Table 4: All CTCF binding sites with a HMMer score greater than 1 that are within 1,000 bp of a CAG/CTG repeat.

59 Name Score Distance Expandability Start End DRPLA 0.14 701 0.19 50761 50822 RORC 0.16 0 NULL 49991 50053 IRS1 0.08 46 NULL 50091 50152 IRS1 0.99 425 NULL 49515 49575 RUNX2 0.06 545 NULL 49394 49455 CIZ1_HUMAN 0.53 618 NULL 50712 50774 CIZ1JHUMAN 0.19 654 NULL 50748 50807 CACNA1A 0.04 602 NULL 50643 50706 PRKCBP1 0.06 395 NULL 50436 50499

Table 5: All CTCF binding sites with an HMMer score between 0 and 1 within 1,000 bp of CAG/CTG repeat. These may represent true CTCF sites because of the binding degeneracy CTCF.

60 indicate high nucleosome formation potential whereas values -1 and lower have poor nucleosome formation potential. Portions of flanking sequence that had poor di-nucleotide content were inappropriate for consideration of nucleosome binding sites and were given a null score by NucleoMeter. NucleoMeter returns a running score over each 160 bp window of input sequence. We summarized flanking nucleosome formation potential by averaging the running score for the flanking sequences and then averaging the upstream and downstream scores.

We felt that this gives a rough indication of the nucleosome formation potential of the flanking sequences as a whole. NucleoMeter lacks an API so nucleosome formation potential had to be assessed manually with nucleo.pl, a script designed to extract 200 bp both upstream and downstream flanking the CAG/CTG repeat defined by TRF, but not including the repeat itself (Appendix E). SCA3 flanking sequence were composed solely of null scores and was given a score of NA for statistical calculations in R. Spearman ranked correlations revealed no relationship between expandability and nucleosome formation potential (rho = -

0.37, P=0.50).

See Appendix G for example R scripts used for the statistical analysis discussed in this section.

3.2 Genomic repeat analysis with the Satellog database

3.2.1 Summary statistics

61 A total of 8,357,425 pure repeats were detected by TRF in the human genome and were stored in Satellog. Of these, 5,398,328 or 64.6% were detected within an EnsEMBL-defined gene or within 60 kb flanking either side of an EnsEMBL gene. These repeats mapped to 7,260,625 genetic locations in or near EnsEMBL genes, reflecting the fact that some repeats were located within more than one gene. Of the genes in EnsEMBL, 92% (21,654 / 23,531) had at least one pure repeat within 60 kb of their gene boundaries. All repeats in

Satellog clustered into 70,318 unique repeat classes.

3.2.2 Characteristics of disease-associated repeats

Disease-associated repeats and their common properties were recently reviewed (Cleary and Pearson, 2003). We queried Satellog with these sequences to observe any characteristic features of these repeats relative to all other repeats. We asked how many of these repeats could be identified as

potentially unstable using only the bioinformatics resources within Satellog. A total of 31 of the 35 disease-associated repeats were manually collected from the

review and input into Satellog. Repeats that were not analyzed either had a

repeat period greater than 16 (thus not detected by our TRF parameters) or were

polymorphic but not associated with any disease. For these disease-associated

repeats, there is no record of their precise genomic co-ordinates. To address

this, we used Satellog to probe for the probable repeat that corresponded to each

disease by selecting all repeats of the expected class within each disease gene.

62 All repeats were detected, except for the repeat responsible for blepharophimosis

(Crisponi et al., 2001). In 12 cases, more than one candidate was detected as the disease-associated repeat for a disease. These cases usually involve flanking repeats of the same class that are detected as two distinct repeats because of an interrupting unit, an established characteristic of some disease- associated repeats such as those responsible for SCA1 (Chung et al., 1993) and

Fragile X (Kunst and Warren, 1994). In these cases, we simply retained both repeats and associated them with the disease.

A total of 51 repeats were mapped for 31 diseases. These disease- associated repeats were located in all gene locations, from exons, introns, 5'-

UTR, 3'-UTR and up to 45 kb away from a gene. Interestingly, these repeats were from only 6 repeat classes. Trinucleotide repeats are the most common

repeat class implicated in disease (Cleary and Pearson, 2003), especially for disorders caused by coding repeat expansion. Of the disease-associated

repeats we analyzed, 28 of the 31 were trinucleotide repeats with 16 being from the CAG/CTG repeat class, 11 from the GCG repeat class, and one each from

the CCCCGCCCCGCG, CCTG, GAA, and ATTCT repeat classes respectively.

These disease-associated repeat classes had dramatically different genomic

distribution (Figure 11). For example, the CCCCGCCCCGCG dodecamer

implicated in progressive myoclonic epilepsy type 1 (EPM1) (Lalioti era/., 1997)

is the only pure repeat of its class detected in the human genome and therefore

has a singleton as its distribution. The remaining repeat classes have broader

63 distributions, particularly the GAA repeat class. GAA repeats have been reported to have a unique distribution relative to other trinucleotide repeats due to their evolutionary origin within Alu repeats (Clark et al., 2004). Satellog recapitulated a distinct, expanded profile for GAA repeats relative to all other trinucleotide repeats (Figure 11).

We define significant repeat length in the reference genome as any repeat with length within the top 5% of its class (corresponds to a p-value < 0.05 in

Satellog). Using this cut-off, we determined whether the reference genome

repeat length is significant for any of the disease-associated repeats within their

respective.disease classes. Interestingly, 80% (24/30) of the disease-associated

repeats in Figure 11 were significantly long in the reference genome given their

repeat class' length distribution (P-value < 0.05). In fact, 20 of 30 of all disease-

associated repeats had a p-value of 0.01 or less indicating that these repeats

were the extreme outliers within their class.

64

Of the coding repeats, 12 of 17 had significant repeat lengths, including all the CAG-type repeats. Exceptions were the cleidocranial dysplasia (CCD), hand-foot-genital syndrome (HFGS), synpolydactyly, oculopharyngeal muscular dystrophy (OPMD), and holoprosencephaly coding GCG repeats. The

CCCCGCCCCGCG dodecamer implicated in progressive myoclonic epilepsy type 1 (EPM1) is not included in this comparison because there were no other pure repeats of its class in the genome.

3.2.3 Characteristics of repeats polymorphic within UniGene clusters

We used a bioinformatics approach to see if we could detect repeat polymorphisms within UniGene sequences. Of the 8,357,425 pure repeats detected by Satellog, 1.3% or 111,950 repeats were detected as transcribed by the EnsEMBL API (either in the UTR or exon sequence of the gene). Of these repeats, approximately half (57.4% or 64,116 repeats) were detected within

UniGene cluster sequences. Finally, of these repeats, only 5,546 repeats were detected as polymorphic (defined as any repeat that had at least one sequence within a cluster with a different repeat length) (Figure 12). A measure of repeat polymorphism was provided by calculating the standard deviation (sd) of all

repeat lengths detected within a UniGene cluster. A total of 3,165, 2,044, and 55

polymorphic repeats were detected in 3'-UTR, exonic, and 5'-UTR sequence

respectively (Note, repeats may exist in more than one gene which is why the

location break-down of the repeats is greater than the total number of distinct

polymorphic repeats of 5,546). The degree of polymorphism is greater in the 3'-

66 Unstable Repeats as a Fraction of Total Repeats Detected in UniGene Clusters

• Untranscribed repeats

• Transcribed repeats not detected in UniGene clusters

• Transcribed repeats detected as stable in UniGene Clusters

• Transcribed repeats detected as unstable in UniGene Clusters

Figure 12: Polymorphic repeats make up a tiny portion of all pure repeats detected in Satellog. Approximately half of all the 111,950 transcribed repeats were mapped to UniGene clusters, but only 5,546 or 0.07 % of all repeats were detected as polymorphic within UniGene clusters.

67 UTR sequence than in exonic or 5'-UTR sequence (One-way ANOVA, P < 0.05)

(Figure 13). Next we evaluated the tolerance of repeat polymorphisms by various repeat periods in exonic and UTR sequence. To observe if highly polymorphic repeats were restricted to certain repeat periods (defined as repeat unit length), the repeat period distribution was observed at progressively increasing sd values (Figure 14 & 15). Untranslated repeats were well distributed across all repeat periods except for 16mers at an sd cut-off of 1

(which roughly corresponded to repeat polymorphims of 1 repeat unit). At increasing sd cut-offs, untranslated polymorphic repeats were detected as penta-

, tri-, and mainly di-nucleotide repeats (Figure 14). In contrast, while coding repeat polymorphisms were widely distributed at an sd of 1, they were mainly restricted to trinucleotide repeats at higher sd cut-offs (Figure 15). Although the untranslated repeats had higher sd values, their most polymorphic sd values were restricted to mono- and di-nucleotide repeats.

3.2.4 Disease-associated repeats detected in UniGene clusters

We were interested if known disease-associated repeats were polymorphic within UniGene clusters. We extracted the top ten most polymorphic coding and non-coding repeats, based on their sd value, and determined if any of the disease-associated repeats were also the most polymorphic. The repeats associated with SBMA (AR is the gene mutated in

68 Boxplot Comparison of Polymorphic Repeats from Exonic, 5'-UTR and 3'-UTR Sequence

<0 H

ro > CD Q

CO c CO -I—I

o H i 1 r Exon 5'-UTR 3'-UTR

Figure 13: Median standard deviations (line through box) of polymorphic repeats detected in exonic, 5'-UTR, and 3'-UTR sequence. Median exonic and 5'-UTR standard deviations of did not significantly differ from reach other, but did significantly differ from 3'-UTR repeats implying that the 3'-UTR tolerates larger more expanded repeats (One-way ANOVA, P< 0.05).

69

individuals affected with SBMA), DRPLA, and SCA17 (TBP is the gene mutated in individuals affected with SCA17) were detected as the first-, third- and fourth- most polymorphic coding repeats (Table 6). The coding AIB-I repeat that confers increased risk of prostate cancer is also detected as polymorphic but not in the top ten. Of the non-coding repeats, the repeat responsible for FRAXE is detected as polymorphic, but not as one of the top ten most polymorphic untranslated repeats (Table 7).

Of the 31 disease-associated repeats discussed previously, only 5 repeats were detected as polymorphic within UniGene clusters. We sought to understand why this occurred. Of the 31 disease-associated repeats, 4 failed to map within the genomic co-ordinates of any mapped UniGene cluster. The

remaining 27 repeats mapped within a UniGene cluster's genomic co-ordinates.

However, 16 of these failed to be detected within UniGene sequences even though they mapped within a UniGene cluster. This could be because of the 3'

bias of the UniGene sequences, the incomplete nature of the clusters (Wheeler

et a/., 2004), sequence errors in the representative UniGene cluster sequence

we searched against for hits (Hs.seq.uniq - see METHODS for details), or the

limitations of our mapping algorithm. Our approach enforces that the repeat

must exist with at least 10 bp of flanking sequence, which leaves out repeats at

the edge of UniGene clusters. The remaining 11 disease-associated repeats

were detected within UniGene clusters, but only 5 of these repeats were

polymorphic. On average, the repeats detected as polymorphic had more hits

72 CO r-

LO LO CO

co CO

oo CD LO CM

CO CM CM CO CM co co

. _ -*—' "O co O CD O CL © 2 -Q TJ £ (D "co "Jo 13 o

co 05 C co cn CD co c CO CD CO CD CD CO

CD o

CO CD CO o r- to CO CD CO T3

c o CO CO > o (0 Uo CD CO

CO T3 c oo 05 o CD O CD c (0 0) CM co 05 CM O | B E |c\i w KM O) X c CO CD re CM CM CM X3 E c CD O CO CD c CO N CM 05 CM ~o E CD .c >^ _Q CD CM co co CO "D CD N C CO CM o> < -*—ato>< E t- < 0. CO re < N CO CD c Q_ >- < CL CO iC_D < c "D CD o I re u o 3 3 CO co I ID ID s g I c0) c to 0)

32 ID co CM CO 05 CM CM CM CO CM *— a CO CD C CO CD CO O- < O < O CD c CD —I < a E co 3 a c w ® CO CD CD Q. -C CD T3 .. CD IS- CO a> o

£ cS within UniGene clusters than those detected as stable (there were an average of

17.4 observations per repeat for the polymorphic repeats to 4.54 for stable repeats). This suggests that there is a greater chance of observing repeat polymorphism with deeper sampling. All of the polymorphic repeats were limited to one UniGene cluster and none of the lengths surpassed the disease pre• mutation threshold of 29, 25, 36, 42, and 39 pure repeats for the repeats responsible for increased prostate cancer risk (AIB-I), DRPLA, SBMA, SCA17, and FRAXE respectively (Cleary and Pearson, 2003).

3.3 Candidate repeats for typing in schizophrenia and bipolar disorder

Candidate repeats for the polymorphic repeats were scored and selected based on descending score values. The score was calculated by multiplying

linkage depth by the standard deviation of repeat polymorphisms within UniGene clusters and then dividing by the repeat's p-value. Candidate repeats were also

selected by a global prioritization approach that did not take into account repeat

polymorphism. This approach selected only those repeats with a p-value less

than 0.05 and repeat length greater than 10. Lastly, we prioritized repeats with

the same approach, but restricted the repeats to those from repeat classes

associated with disease. One repeat in bipolar linkage regions was extracted

separately (see 2.3.2) (rep_id: 2829206, chr:4: 48577074-48577404,

GGAGAAGAGGGAGAA, repeat length = 22).

75 r-

co CM CO o co co CD CD Si cn in LO O) col CD

CM co CD > t < cS N o

LL c a> o c v o> re 3 3 CO CO co co 0)0 I CO

CM CM CM CD co co co in in CO LO o d

W -*—< CO *-• CD CD CM in CM CM LO CO CL 0> c a o o c CO d d CO o u _C0 CO CO 05 CD 'c CM a. "E CD CD .c Q. a O O O o N < !c CD CD CD O N o < < IE w < < o < W g u CL ^_ o in o co LO •o o E c co CO o >- CO a> oo E o oo >» CL CM CM o o a. CM CM o CL CM in CM O a CO o s l- CO o CM n co in to CD CD CD co CD CD CD CD CD co CD CO t- r-

co LU N

LD LO

CO CD CM ID CM LO CM CM LO

w to CO as 0) CD CL a CD ka_> © .c 1 a. CL o o N N o o to w T5 CD N N +- o CL v. a aj _o CO D) o o CM CL o o CM Q. O C) CM CO LO ID .£3 CD CO CD oo CD co CD CD co CO CO CO 00 co CD CO o CO

CL LU CD C_ |C0 I O I I O m LU co 0- a. i- LU CO

O a a a a a a

a> co

CO CO CO CO

LO LO co 4—1 co CO CD CL a CD CO _ CD i_ CD CD CD LO LO CD I CD to LO LO co CO £ «» CO •g ,E -a TO T3 H5 c c oCO CO > CD a. o T3 I— i— CD o TJ CO LO 05 CO CO D) CM 00 CD 05 CD|£ o T3 C If) CO a> "5 JoS _co Q- o \a o o Q. o !5 CL o o Z Q. E i_ j>> O o E CL > o o CM a CL o O CM h- a o \- CO Q) co

CO CO CM CM oo CM co CM CM CO x: CM CM CM CM CO I- u ON t-

CD CM co Is- CO co CD LO co LO LO LO o 00 CM CD co CO co o CO 00 LO co s CM CM co CM 00 00 O LO a> CM o 1— •3- CD co |CM 00 oo CM CD 00 co CD LO LO co co 00 o o o s s CO o CO co CO I - I - CO CO CM oo LO

LU •sf 0. X CO _l _l O < < _l _l O a LU z CD 3|3 CL Z z13 < oCO Z CO a a> a. c o o o o a> o c c c o o o o o o o o c ~ i_ o o o O o O o o o o T3 •g 3 O o O o o T—o o o re O o O o o o '•5 o o o o o o o o c > O o o o o c a d d d d CO CO o o

s JO D) o LO Is- co I - co CM LO m co co CM o o c CM CO a Q. !5 !Q < CD O CD TJ TJ o CD < O < f I- , < < 0 0 t < a o < O <

N A T N < A G < < < < CL CD o 00 Is- LO 00 05 o s CM O) LO I - "co CO CM 00 o co LO a. 00 O) LO co cn CM .Q o s 00 I - OJ co CM CD s 00 os > o o I - I - oo 1— o CO CM LO 00 o CM co oo 15 CO co O) CO CO CM 00 I oo CO S LO JQ o CO O) CO I - o CM o CO I-- Is- 00 s CM | U) CL s I co I - CD CO o I - CD CO o 00 CO LO 00 CM CM o Ta—> i-- CD I CO 00 oo Q. CD CO CO oo 00 I oo CT> a> s O •3- CO O) CO CM I - LO I- 0 n- SI CO CM co 00 00 CM I co CM 00 CO CM T CM CM co I- CM CM ~ CM CM co" CO o oo oo cn o CO o cn CO

< rx o CL LX CO c_ u_ LU a. 3

W CD co CO CD (A CO CO < JS CO LU LX o at LX o LU T3 LU LX 0) LU +J a> CO co "o 'o o o (0 CO (0 CO c CO o CO I cb CD £ CO 3 CO CO co CO o CO CD CL) CO (0 cn CM cn Ln -a CM CM E (0 d d d d E o I io_ CO H— CO If) CM CM in cn (ft CO aj cn CO CD C CO CD CL •-= U Q. CD CD i CO i_ CD CO o LO CD -*—' 3 CO -a- CO cn CO CO •g > o T3 T5 a T5 c c CO CO o cO u 00 cn CD CO cn .5 'c c E CD co CD .c l_ CoL .c O a N a o o N o o JE CO o o !c cn •3- u Q. CO CM oo !E o LO co E cn Q. >, CO TJ CM co O o c E CL 0) > o LO o CM co CO CO CL CM CO Q. m O o co CM I- co a CM co o CM H CO CM CM 00 00 co LO CD CD CD CM CM co co CO CO CO I- CM Is- CM 00 CM LO to to CO CD CM LO CO q CM oo CO q LO 00 CM CM Is- LO CM 00 o CM LO O CO LO LO CM LO CM s CM CM CM Is- CO I - CO

LO CO DL 1 u_ DQ < CD CM X LU 1 < 3 o CD u_ _l _J o DC CL Z < O z X O N I co o CD <0 0) CO X V) CD O a <0 CO a o J2 CO a CO a CO a CO o o CO •ao J3 _l _l a CO a TD Q. a 0) a> _l _l a CO CD a. 3 3 a CO o CO *4—2* Z z a CO a o CO a CO a o "o CO o a CD o CO CO a (0 CO a CO CO a o I LU CD i CO (A CO CO CD c o 0) o O c o i c o o CO o o o £; O o o o d) c 1 o o o T3 Q) (0 o o (0 o LO LO C LO LO o E £ CO CO o o

CO (0 ~S LO CO CM CM O) LO LO O) CM CO CD CD CD- CD a i— s CD CD CD co CO CM |LO| I - a> O CM LO oo o o Q. O o o o T3 d d d d C c CO CO o u D> LO CM CO LO CM CM CM CO CO CO CM I 'c C "E CD _c a CL o o < < o o CD < < < N a < < N 1- a CD CD o o CD IE o o < u CO I CD CD T3 (0 CD Is- N 05 00 CO a) 1— co CO o CM o o s oo N CD LO CD I - s CO 1— 00 I - LO CL T— 00 OS co o LO CO s LO CM I - CM CD o CO O) CO CD Is- CM I LO "CO CO a -Q _o CO s « co CM O) CM oo I - O) CO LO A CM 05 CM CD s o CJ) CO I - O O) LO s CM CO T— 00 CM co LO I - O) T— CO a> O) •3- O CL s CM LO 00 I - CM LO CM O O CO CM , LO CO CM —I CO Is- CM to a o CO t- o (O LO co CM CM Si CO CD CD CO CD co to CO CM co CM to I CO (0 CO y- CN 00

oo LO CM CD oo a> o cn oo CO LO CO CM OJ D3 LO col cn CO CM CO oo LO CM M LO CD o co d I CO CO o oo 00 o to CM CO o co OJ co CM cn cri LO

CM o LO o CL o t a. < LX o o CM co < CO o CL CD CO CD O oo O CL O LX CO CL CO CO CO ,z, _i U- LU LU 1= 3 3 < DC CL. I it) < O < 3 CD w to W CD CO CO a CO a a o — o a O LX < T3 o a o LU IX < o o o o LU LX < 0 CD a o o o LU LX < -I—• a LU LX < (0 CO a o o "o o a a a o o o a a a w to a (A to (0 CO I c CD CD c CO CO co o c c c «= o o o o CO CO o o o o 3 X X X x X X X CD a) co X LO CO CD CD CD CD CD CD CO CD CD W CO D)0 T5 TJ E E CO LO LO CM cn co LO 2 CO CM o LO co CO co co d d d "~ CO

CO CO CD CD o_ CO CL CD CD * co co oo co LO LO LO CO LO co 1_ CD -*—* E TJ CO CO •g TJ CO CM CO CM CD co T3 CO o o CM O LO H5 c 3 io| I CM o CO LO h- to 1 LO o CO LO co o c CO CO o o o o o O CM CO cn o cn o > o o o o o o CM CO i a o o CD d d d d o 73 CD O T3 CO CM CD LO co CO LO I o T3 c co co to 0) "5 jCoO O CL JO < < a CD < O < o X3 < CD o < O o o O < CD O CD O CD o o CD CD CD Cl g O CD CD !5 CL 1_ CO LO u CD LO co CO CO o os CO cn GO CO co CO O co O LO co I - LO oo CM CM LO o 00 CO CO E T— co co LO O I co CO o CD LO CM LO CO oo cn cn cn o o O 00 o O CM LO co 1— cn CM CM CL CO CD O OO co E CO CO O evil co >. o o CM LO a CL CO OS CD CM CO cn co CO co CD I LO O LO cn o LO CO 1— LO cn CD CO CM LO o CM CM CO co CO CO O co CM h- CM LO "3- o LO o CM CD cn a o CO 05 oo cn co oo O) O CM o CO oo LO CO CM co CO CO o CM co t- CO o 0) XJ CM CM CO CO CM CM oo CM CM CM CM CM CM co CM CM CM CO I- CM CM co CD CM LO CM LO LO oo CD CM CD O CD CD S CO S I - CO I - O LO CM LO CO 00 1^ S a> CO oo O I - CD LO CM LO CD CM CO co LO CO CM

CM CO i— o Q_ o CM < O < O O LU CL o co _l 111 CO o < o CL o

a a a n in CD CO 0)0 co as Q) CD LO CO co T3 E E o o CM LO LO | i £ xi H— CO

W S Hi S CO 00 CO I - I - co CO S I - CD CD co S CD S CL CM I - LO LO I - CD o O T— CM a O O o o o o CD o O o o CD d 0) ~ d d d d +-» C-O' CO TJ •g CO CM CM CM CO CM '•5 Tc3 co c CO CO o I— o _CC a a CD «J o < < < < O o < o O CD t •= o Q. O o o o h- O CD Q. !a CD < !5 T3 CD CD LO TJ S oo N S I - I - CD S cn I - co co| x» 00 CM CD s c LO 05 OJ CL 00 O N o .2 a "CO >» o "co CO CO 05 CD co O o co 3 LO CM as ai O) CL co SI o O CM h- a o in

oq CM CM 00 CO CM CM CM CM CM co CM CM CO a u CO H CM CM 00 CM CD LO 05 CO LO CO CO CD 05 O O) CO 00 oo CD

_l _1 _l Q _J 3 X z Z

o o o a a a oo a a oo a a a a a a a a

c o c

co x o g I— • c CL) L"3O

LO CM

oo 00 CO CO LO LO CM Q co o Q o o Q o o d d d

CM cn cn CM CM

< CD CD CD < < < O <

o CO o CM CM o 00 CO 00 1— o cn O) T— CM LO CO O 1

CO CM CO 1— CD T— T— O o CO 00 oo i— o cn CT> T— CM LO 00 o 1—

CO CM CHAPTER 4

DISCUSSION CHAPTER 4 DISCUSSION

4.1 c/'s-Features of unstable CAG/CTG repeats

4.1.1 Identifying c/s-mediators of instability

Research into c/s-sequences mediating the instability of CAG/CTG repeats has suffered due to a lack of theoretical knowledge about repeat expansion in general. However, a useful measure of genomic CAG/CTG repeat instability has been developed, termed 'expandability' (Brock era/., 1999). This expandability metric allows the correlation of flanking sequence features to repeat instability at disparate CAG/CTG loci. We have repeated this analysis and extended it to identify novel flanking sequence features associated with instability with one modification to the approach of Brock era/., 1999: only coding

CAG/CTG repeats were analyzed. We limited our analysis in this respect because the selective constraints on expansion differ between coding and non- coding regions (Cleary and Pearson, 2003). Our results agree with those of

Brock et al, 1999 to the extent that we found flanking %GC and presence within

CpG islands to be correlated with repeat expandability. Furthermore, we have extended the original analysis to evaluate associations between expandability,

repeat length and purity, CTCF binding sites, and nucleosome formation

potential. We have also summarized a number of candidate CAG/CTG repeats that have c/s-features similar to genes associated with instability.

86 4.1.1.1 Association between flanking %GC and instability

We observed a positive association between flanking GC content and expandability but the exact percentages and trends were different from those published (Brock et al., 1999). Of immediate concern were the differences in flanking GC content between our study and Brock et al., 1999. In Brock et al.,

1999 and in our study the SCA7 locus had the highest flanking GC content with

83.5% and 83.0% respectively. We sought to understand why these differences

existed. Flanking sequences were manually extracted and counted and the

reason for this difference became apparent. We relied on TRF sequence co•

ordinates to locate the repeat, but TRF identifies repeats by their first tandemly

repeated unit. Therefore, if a CAG/CTG repeat was preceded by a non-repeat G,

the first tandemly repeating unit, GCA would be counted. For example, in this

stretch of sequence AAT|GCAGCAGCAG!GGAG the CAG repeat is in bold but

the repeat unit detected by TRF is highlighted in grey. In contrast, Brock era/.,

1999 manually extracted the sequence based on the absolute co-ordinates of the

each CAG repeat. When we repeated the analysis in this way for the SCA7

locus, we obtained the same value as that published (83.5%). The GC%

difference between our study and Brock et al., 1999 at the SCA3 locus was a

result of using the complete human genome sequence versus the incomplete

version available at the time of their study. Another difference between our study

and Brock era/., 1999 was the correlation co-efficient trend. The correlation co-

87 efficient in Brock et al. decreased when 100 bp to 500 bp flanking the repeat was considered. Conversely, in our study, the strongest correlation co-efficient was observed at 500 bp (0.93), although the difference was miniscule (0.4).

Interestingly, our weakest correlation was at 50 bp flanking the repeat. This difference is because we did not consider the ERDA1 locus, which is in a GC poor portion of the genome and skews correlations higher. CpG islands, on the other hand, were detected as expected from Brock et al., with the more expandable loci harbouring them.

4.1.1.2 Association between flanking repeat length, purity and instability

No correlation between expandability and repeat length or purity was observed in our analysis. Brock et al, 1999 failed to notice any association between repeat length and expandability as well; they did not explore relationships with repeat purity. In one respect this is encouraging as it indicates that the expandability metric is independent of the repeat length. Repeat length and purity remain useful tools in haplotype analysis. For example in HD, on average longer, purer repeats are diagnostic of earlier and more severe disease onset (Cleary and Pearson, 2003).

4.1.1.3 Association between flanking CTCF binding sites and instability

88 CTCF binding sites have been documented to be associated with expandable loci (Filippova et al., 2001). We have used an HMM to detect CTCF binding sites in 1,000 bp flanking the CAG/CTG repeat. Important points from experimentation with CTCF binding sites at the DM locus are a) flanking proximal

CTCF binding sites in conjunction with CAG/CTG repeats form a insulator module, b) the CAG/CTG repeats are an important component of this biological system, c) the upstream binding site is more important than the downstream site for insulator activity (Filippova et al., 2001). Of initial interest was whether our

HMM detected experimentally determined CTCF binding sites flanking CAG/CTG repeats. As revealed by gel-shift assays, the CAG/CTG loci within the DM,

DRPLA and SCA7 had upstream and downstream CTCF binding sites, while HD and SCA2 only had upstream sites. We detected CTCF binding sites flanking

DMPK (as expected as DMPK CTCF binding sites were used in training the

HMM), two upstream of SCA7, one downstream of DRPLA and SCA2, and none

at HD. This is not necessarily a failure of the HMM because the precise

sequence that CTCF bound to was not determined in these experiments and it is

not known if these sequences are of the same family as those used to train our

HMM. The CTCF protein has 11 zinc-fingers which may mediate interactions

with a range of sequence profiles (Ohlsson et al., 2001). Our HMM can only

detect binding sites that it has been trained on, therefore hits must share

sequence features with the CTCF sites previously determined. The proximity (15

bp) and high score (6.18) of the SCA7 locus makes that hit appear more 'real'

than others but such statements are strictly qualitative without further

89 experimental data. Interestingly, La Spada et al. presented evidence at a recent microsatellites conference that the particular sequence motif of the CTCF binding site flanking the CAG/CTG at the SCA7 locus mediates repeat instability (La

Spada et al., 2004). In his work, direct mutation of known CTCF contact nucleotides in a binding site adjacent to the SCA7 locus, the most unstable

CAG/CTG repeat, resulted in a significant decrease in repeat instability.

4.1.1.4 Association between flanking nucleosome formation and instability

Spearman ranked correlations revealed no relationship between expandability and nucleosome formation potential as defined by NucleoMeter

(Levitsky et al., 2001) (rho = -0.37, P = 0.50). NucleoMeter assigns high nucleosome formation potential scores to DNA regions most likely to form nucleosomes. In their training set, DNA with tissue-specific expression profiles had the highest scores reflecting the fact in most tissues these genes would be heterochromatinized (Levitsky et al., 2001). We cross-referenced the unstable genes summarized by Brock et al. with the GeneNote database to see if they exhibited tissue specific expression. Over half of the genes including SCA7,

SCA2, SCA1, and SCA3 had evidence of expression in all tissues. HD and

SBMA were expressed in seven tissues, which is over half of the tissues profiled

by GeneNote. DRPLA had the most limited expressed profile as it was

expressed in only three tissues. Therefore, none of the genes correlated with

nucleosome formation potential, and especially the most unstable ones (SCA7

90 and SCA2) had tissue-specific expression. It is not surprising that the nucleosome formation potential at these sites does not correlate with expandability as it appears that the majority of the genes evaluated are ubiquitously expressed would likely not share c/s-features with genes having tissue-specific expression. It should be noted that the field of computational nucleosome formation potential prediction is new and Levitsky's group is the only one solely focused on this problem. It is thus challenging to compare the accuracy of NucleoMeter's predictions as there are no competing tools investigating the problem.

4.1.2 Prioritizing candidate CAG/CTG repeats

We sought to apply the conclusions from both our re-analysis of the Brock

et al. data and our own analyses to the set of candidate CAG/CTG repeats. Our assumption is that candidate CAG/CTG repeats sharing features with unstable

loci are more likely themselves to be unstable and should be set as higher

priority regions to genotype. We identified statistically significant associations

between CpG islands, GC content, and CTCF binding sites. Candidate

CAG/CTG repeats having at least one of these features were compared to

identify co-occurrences (Table 16). Since no expandability data exists for the

candidate CAG/CTG loci, statistical associations cannot be directly calculated for

these loci and their flanking sequence features. Instead we sought to identify

candidate loci that fit the 'expandability profile'. Eight genes fit the 'expandability

91 CN CD LL CD ON -*—< i— co w CD CJ -ti CD -ti LL c c0 r; o gO

E W CD O (0 CO 0) u. o •S o E co o ^ | O .2 5 Q £ O ^ 5 < . CO O -a CD C_D B_ C SZ CO 0) — = (A CO CO CD C I- g |Q O J3 CD 'co CL c CL CD o -Q 0) U) CO o CD 11 O co SZ 00 co" o CO o P co 00 <° T 00 •* CD ~ O w £ CL H— 00" CD O 5 5 CD o CD _C0 I— CD CD CO o>58 O CO LU CO 1— CD CD co co 0 10 T < CO 0 o o CD 3 1— -*—< X CO CO -g c T3 % c o CO o 'CL "O T3 -Q c O O CO O E IO 1 co O . D_ CO CD IO w -*—' o o CO CL CD^ ZZ i_ O . CD _CD (0 •o c "SI CO a "g-g x CO c CD CL 00 O UJ c CO CO X5 CO 0) CD c CD g CO O I? E ±=J£ 2 O 51 CO >> cb £ CO 0) T- CO co CO CD O £ co Si O CL CO CO O_ CD_ h- co co co profile': CACNA1A, C14orf4, POU3F2, ASCL1, RUNX2, SOC6_HUMAN, IRS1,

NM_175863. CACNA1A, C14orf4, POU3F2, ASCL1, RUNX2 were interesting based on their high flanking GC content, presence within CpG islands and CTCF binding sites. SOC6_HUMAN, IRS1, NM_175863 were interesting primarily because of the presence of pairs of flanking CTCF binding sites and in the case of SOC6_HUMAN, the close proximity of the CTCF site to the CAG/CTG repeat.

4.2 Genomic repeat analysis with the Satellog database

Satellog presents human microsatellite repeat data in a manner relevant to disease association studies. The selection of each bioinformatics feature or supplementary data source in Satellog is rationalized by its biological relevance to polymorphic satellite repeats. Satellog recapitulates many known biological facts about micro- and minisatellite repeats and reveals new patterns of disease- associated repeats.

There is no documentation in the literature of repeat polymorphism differences of repeats residing in various genetic regions. Although one might expect greater polymorphism in UTR sequence relative to exons due to reduced evolutionary constraints, both 5'-UTR and exonic repeats had similar rates of

polymorphism, whereas 3'-UTR repeats had significantly greater polymorphism

compared to these two groups (Figure 13). This may be due to the documented

3'-UTR sequence over-representation in UniGene (Wheeler et al., 2004).

93 However, depending on whether the repeat is within exonic or UTR sequence, there appears to be constraints regarding what repeat unit sizes can tolerate large polymorphisms. Of the more polymorphic UTR repeats (those with sd values greater than 3), there was a single trinucleotide repeat amongst mainly dinucleotide and mononucleotide repeats (Figure 14, Table 7). On the other hand, the majority of exonic polymorphism, although less pronounced, is almost entirely in factors of three (Figure 15, Table 6). Our results support the observation that coding microsatellite polymorphisms are usually in-frame in order to avoid a deleterious phenotype resulting from frame-shift (Metzgar and

Wills, 2000) or to provide a rapid evolutionary response to a changing environment (Kashi era/., 1997). It is interesting that polymorphism data present in the UniGene dataset recapitulates this biological principle.

4.3 Repeat prioritization in schizophrenia with Satellog

We selected repeats that were within the recombination limits (50 Mb or the end of the chromosome) of genetic markers with evidence of linkage to schizophrenia and bipolar disorder in multiple studies. We felt that these broad

regions that had some evidence of association with schizophrenia were of more

interest than random genomic sequence. Our prioritization strategy looked at

polymorphic repeats in linkage regions, any repeats in linkage regions, and then

lastly disease-associated repeat classes that were polymorphic or had significant

lengths. Since we felt that an unstable repeat may confer genetic risk for

94 developing schizophrenia, repeats that had shown some evidence of repeat polymorphism in UniGene clusters were prioritized in a parallel strategy that did not take into account polymorphism profiles and instead emphasized the repeat's p-value and length. The candidate repeat lists present the first objective prioritization of candidate repeats in schizophrenia and bipolar disorder linkage regions. Previous disease association studies looking at candidate repeats in schizophrenia investigated long CAG/CTG repeats close to schizophrenia linkage regions due to their prevalence in polyglutamine encoding expansion disorders. We hope our approach will help identify the repeats implicated in schizophrenia if the disorder is in fact mediated by an unstable repeat tract. Here we summarize the interesting repeats from each prioritization paradigm and highlighting those with a role putative role in neurobiology. The function of these repeat-containing genes is summarized from their GeneCards entries (Rebhan et al., 1997).

4.3.1 Top 20 polymorphic schizophrenia candidate repeats

Interesting polymorphic repeats were detected within genes with evidence

of neuronal function such as Neurochondrin (chr 1: 35459893-35459943), MOG

(Myelin-oligodendrocyte glycoprotein precursor) (chr 6: 29732916-29732936),

VANGL2 (van gogh, (Drosophila)-like 2) (chr 1: 157614137- 157614167) (Table

8). Neurochondrin cDNA's have high expression in the brain and kidney tissue.

Little is known about the function of the gene, but homozygous null mutants in

95 mice are lethal (Mochizuki et al., 2003). MOG is a minor component of the

myelin sheath and is linked to GO terms for synaptic transmission and central

nervous system development. VANGL2 is involved in morphogenesis and

patterning of the neural plate.

4.3.2 Top 20 globally prioritized schizophrenia candidate repeats

The TATAT repeat within the intron of GRIK2 (glutamate receptor,

ionotropic, kainate 2) (chr 6: 102371516-102371577) (Table 9) was the sole

repeat within a gene with evidence of neuronal function. Glutamate mediates excitatory neurotransmission in the brain and anti-psychotics have shown effects on this system (Kim et al., 1980). This receptor has marked expression in the brain and to a lesser extent in the spinal cord.

4.3.3 Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes

A coding CTG repeat was detected in PCSK9 (proprotein convertase subtilisin/kexin type 9) (chr 1: 54875471-54875493) (Table 12). PCSK9 has been implicated in the differentiation of cortical neurons and its expression has been noted in embryonic brain telencephalon neurons (Seidah et al., 2003).

96 4.3.4 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes

No repeats from disease-associated classes were detected within genes with obvious neurobiological function by global prioritization (Table 13).

However, the disease-associated repeats for SCA1 were detected (chr6:

16435844-6435887 and chr 6: 16435895-16435934).

4.4 Conclusions

Brock et al. identified an association between flanking GC content and

CAG/CTG repeat instability at disease loci by using a relative measure of repeat

instability called 'expandability'. Using this expandability measure, we have

extended the analysis of Brock and colleagues to include sequence regions

omitted due to the incomplete state of the human genome sequence at the time

of their study. Furthermore, we have utilized the expandability metric to

associate with instability other features theorized to contribute to it such as

CAG/CTG repeat length and purity, proximity to CCCTC-binding factor (CTCF)

binding sites, and the nucleosome formation potential of the surrounding DNA.

Our results recapitulated Brock's observations regarding flanking GC, CpG

islands and CAG/CTG repeat instability. Specifically, we observed a positive

association between flanking GC content and expandability but the exact

percentages and trends were different from those published (Brock et al., 1999).

97 We also observed that in general unstable repeats were located within CpG islands. Our work also suggested a novel relationship between flanking CTCF binding sites and unstable repeats. Conversely, no relationships between expandability and repeat length, purity, and nucleosome formation were detected. Our results provide further insight regarding what c/s-sequences may contribute to CAG/CTG repeat instability.

We developed Satellog, a database that catalogs all pure 1-16 repeat unit repeats in the human genome along with supplementary data we believe to be of use for the prioritization of satellite repeats in disease association studies. For each pure repeat we also calculate a p-value for its length relative to other repeats of the same class in the genome, its frequency of polymorphism within

UniGene clusters, its location either within or adjacent to EnsEMBL-defined genes, and its expression profile in normal tissues according to the GeneNote database. Satellog is the first database capable of dynamic candidate repeat prioritization in the human genome based on these features.

By examining the global repeat polymorphism profile, we found that highly

polymorphic coding repeats were mostly restricted to trinucleotide repeats,

whereas a wider range of repeat unit lengths were tolerated in untranslated

sequence. We also found that 3'-UTR sequence has more repeat

polymorphisms than 5'-UTR or exonic sequence. To showcase Satellog's

potential utility, we use Satellog to prioritize repeats for disease-association

98 studies in schizophrenia. We hope that Satellog proves useful for candidate repeat prioritization in schizophrenia or any other disease in which unstable repeats are thought to have a role in disease etiology.

4.5 Problems encountered and limitations

4.5.1 Brock etal.'s expandability metric

The expandability metric employed by Brock etal., is the authors' collation

of pedigree analyses and published results. To quantify relative levels of

instability, the metric uses the following formula: length change / (progenitor

allele length - 35 repeats). This measure quantifies the "tendency of an above

threshold repeat block to undergo further expansion". The authors' believed that

repeat length changes needed to be relative to the progenitor length of repeats.

Progenitor allele length was "standardized" by subtracting 35 repeats, the

hypothetical threshold of coding CAG/CTG repeat instability in many coding

CAG/CTG disorders (Cleary and Pearson, 2003). This assumption has a

number of limitations, mainly because the myotonic dystrophy repeat was

included in their study. Since myotonic dystrophy is a non-coding CAG/CTG

disorder, it has different molecular genetic and etiological disease properties.

Most importantly, the published minimum threshold for disease in myotonic

dystrophy is 50 repeats, greater than the 35 used as a control by Brock et al.,

1999. Furthermore, this metric was never tested nor was the raw data used to

99 generate the metric provided. This weakens our ability to rely on Brock et al.'s expandability as a useful measure of genomic CAG/CTG repeat instability.

4.5.2 Limitations of the GeneNote dataset

The GeneNote AffyMetrix microarray experiments (Shmueli et al., 2003) were based on whole tissue RNA samples which is an important consideration if one is interested in gene over-expression in a particular anatomical region. For instance, high gene expression local to the frontal lobe would be diluted in the larger tissue section used for the GeneNote experiments. Repeat- prioritization approaches enforcing particular tissue expression should bear this in mind.

4.5.3 Mapping repeats to UniGene clusters

A major problem with mapping repeats to UniGene clusters is that

repetitive sequence usually hits many clusters, the majority of which are false

positives. To ensure that each hit was real, we pre-mapped each unique

UniGene cluster to the human genome and stored the chromosomal co-ordinates

in a table named unigene (Appendix M). At run-time, every time a repeat was

detected in a UniGene cluster, the hit's co-ordinates were compared to the

mapped co-ordinates of the cluster. If the repeat co-ordinates were within 10 kb

of the UniGene genomic co-ordinates, then the repeat length hits was retained

and merged into a single sd value. It is also important to consider that larger

100 repeat polymorphisms could cause a UniGene cluster to "split" into two distinct clusters. This could downplay a repeat's polymorphism because such repeats would not be evaluated as a single group, therefore decreasing the repeat's sd value. Pre-mapping the clusters controlled for this as well, because if a cluster was split by a large repeat polymorphism, then both clusters should be mapped to the same genomic co-ordinates. In practical terms this was not an issue, since only one of our most polymorphic repeats (sd > 2) mapped to two clusters.

4.5.4 Prioritizing with p-values

The genomic length distribution of a repeat class determines each length's p-value in Satellog (2.2.3.1), but it should be emphasized that this is meaningless for repeats that have few repeats in their distribution. For example, the

CCCCGCCCCGCG dodecamer implicated in progressive myoclonic epilepsy type 1 (EPM1) (Lalioti et al., 1997) is the only pure repeat of its class detected in

the human genome and therefore has a singleton as its distribution. The p-value

for this disease-associated repeat is 1. P-values should be used carefully when

considering larger period repeats as their distributions contain fewer, usually

single, observations.

4.5.5 Multiple repeats detected for known diseases

101 Using the repeat type and gene information in Cleary and Pearson, 2003, we attempted to uniquely identify all disease-associated repeats. In some cases, more than one repeat was a candidate as the disease-associated repeat for a disease. These cases usually involved adjacent repeats of the same class that were detected as two distinct repeats because of an interrupting unit, a known feature of some disease-associated repeats (i.e. SCA1 (Chung et al., 1993;

Kunst and Warren, 1994; Chong et al., 1995)). In these cases, we simply retained both repeats and associated them with the disease.

4.6 Future studies

4.6.1 identifying c/s-mediators of instability

It is encouraging that recent research on CTCF binding sites flanking

CAG/CTG repeats implicated the sites in instability (La Spada et al., 2004).

Future experiments should aim to explain why the published CTCF binding sites flanking DRPLA, SCA7, HD and SCA2 (Filippova et al., 2001) failed to be

detected by our HMM. Implicit in the published analysis was that the multiple

sequence alignment of known sites revealed a consensus sequence of

conserved guanine nucleotides essential for CTCF recognition (Figure 4).

Perhaps the sites not detected by the HMM contained a consensus sequence

that interacts with a different combination of zinc fingers of CTCF than the

binding sites published in the multiple alignment. Gel-shift assays established

102 that CTCF binding sites exist flanking DM, DRPLA and SCA7, while HD and

SCA2 only had upstream sites (Filippova et al., 2001). These sites should be characterized by DNase I footprinting to observe the consensus site that CTCF interacts with and to enrich our HMM to optimize detection of CTCF binding sites either flanking CAG/CTG repeats and or at any other genomic locus. New CTCF binding sites should be incorporated into our HMM with the eventual goal of creating a CTCF binding site predictor.

4.6.2 Improvements to Satellog

Satellog can be improved in many ways in order to increase its utility to the micro- and mini- satellite research community. More sophisticated algorithms can be deployed to detect repeat polymorphisms in UniGene clusters. Once a repeat has been mapped to a UniGene cluster with BLAT, our approach relied on an exact match of at least 10 bp of flanking sequence to register a hit to a specific cluster sequence. Future approaches should further incorporate the

BLAT algorithm for this procedure as it's optimized to detect short, nearly exact matches rapidly (Kent, 2002) while tolerating indels. Rigorous parameter testing will be needed to determine how to optimize detection of repeats with BLAT.

This should greatly enrich the polymorphism profile of repeats within Satellog.

Furthermore, new disease-associated repeats should be added to the database

as they're published.

103 4.6.3 Disease association studies in schizophrenia

Repeats from the candidate repeats lists (3.3.1-8) will be analyzed by

GeneScan software on an ABI 3700 sequencer to determine whether they are

expanded in schizophrenics versus controls. GeneScan is a fragment analysis

package that automatically identifies, quantitates, and sizes each DNA fragment

that passes through an ABI instrument.

4.6.3.1 Specimens for analysis

High quality, high molecular weight DNA from individuals with

schizophrenia, bipolar disorder and unaffected control individuals (n=35 each)

has been obtained from the Stanley Medical Research Institute (Bethesda,

Maryland) free of charge, specifically for the work outlined in this proposal. The

Stanley Array Collection is a brain collection developed specifically for molecular

studies using high throughput methodologies. DNA has been extracted from

post-mortem brain specimens collected, with informed consent from next-of-kin,

by participating medical examiners between January 1995 and June 2002. The

specimens were all collected, processed, and stored in a standardized way.

Exclusion criteria for all specimens included:

• Significant structural brain pathology on post-mortem examination by a qualified neuropathologist, or by pre-mortem imaging, • History of significant focal neurological signs pre-mortem, • History of central nervous system disease that could be expected to alter gene expression in a persistent way, • Documented IQ < 70.

104 Additional exclusion criteria for unaffected controls included:

Age less than 30 (thus, still in the period of maximum risk), Substance abuse within one year of death or evidence of significant alcohol-related

changes in the liver.

Diagnoses were made by two senior psychiatrists, using DSM-IV criteria, based on medical records, and when necessary, telephone interviews with family members. Diagnoses of unaffected controls were based on structured interviews by a senior psychiatrist with family member(s) to rule out Axis I diagnoses. There are several distinct advantages in using the Stanley array samples for this study.

First, there is a control group that is matched to the degree possible for age, ethnicity, gender and cause of death. A matched control group will allow polymorphic DNA repeat length changes that are present in both patients and controls to be recognized as unlikely candidate susceptibility loci. While this study could be done using DNA from either brain or peripheral lymphocytes or any other tissue, data obtained from peripheral lymphocytes would only be

revealing in the case of inherited or constitutive changes. While brain tissue is essentially post-mitotic, an advantage of using DNA from brain is the opportunity to detect sporadic repeat expansions that might occur during brain development.

Finally, brain tissue and high quality RNA samples are also available through the

Stanley Medical Research Institute and these tissue and RNA samples are

derived from the same individuals, described above, that were the source for

DNA. This additional biological material will be an important resource for relating

any observed unstable repeats to neuronal pathology in these diseases in

subsequent studies.

105 4.7 Significance

Satellog enriches the current bioinformatics landscape in which repeats are viewed. For example, the GAA repeat in Friedreich's Ataxia (Campuzano et al., 1996) is not detected at all (chr9:67,109,320-67,109,339) in the UCSC genome browser (Kent, 2002) by the TRF (Benson, 1999) and Variable Number

Tandem Repeats (VNTR) tracks. The VNTR feature in UCSC detects all perfect

2 to 10 repeat units with 10 or more copies. Repeats detected by this method may over-represent insignificant low period repeats and under-represent potentially interesting high period repeats. In Satellog, not only is the Friedreich's

Ataxia GAA repeat detected, but its p-value also suggests that this size of GAA repeat is a relatively rare observation in the human genome (P = 0.045).

Satellog integrates disparate data sources to give researchers an idea of how interesting certain repeats are based on their genetic location, tissue expression profile and polymorphism. It should be noted that Satellog does not intend to be a de novo detection method for disease-associated repeats. Instead, it provides comprehensive, integrated bioinformatics platform to prioritize repeats in a convenient and efficient manner. Satellog also presents the first comprehensive

identification and integration of disease-associated repeats with other genomic

resources for use as bioinformatics reagents in other studies. Satellog should

prove useful to investigators interested in prioritizing repeats for typing in

diseases showing anticipation or in which repeat polymorphism is thought to play

106 a role in etiology and as a general bioinformatics resource for microsatellite repeat studies.

Secondly, we have produced the first objective lists of candidate repeats for association studies in schizophrenia and bipolar disorder. Previous studies have arbitrarily selected repeats based on their prevalence in other etiologically distinct neurological diseases (for example CAG/CTG repeats in inclusion disorders). Our study is the first ever to objectively prioritize repeats in the human genome by integrating disparate bioinformatics resources. We also provide the infrastructure to dynamically redefine candidate sets based on new biological knowledge or research interests. We hope that Satellog will facilitate the identification of disease-associated repeats in schizophrenia and other

ailments. Satellog is available as a freely downloadable MySQL and web-based

database.

107 BIBLIOGRAPHY

American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC.

Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin and G. Sherlock (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25(1): 25-9.

Asherson, P., C. Walsh, J. Williams, M. Sargeant, C. Taylor, A. Clements, M. Gill, M. Owen and P. McGuffin (1994). "Imprinting and anticipation. Are they relevant to genetic studies of schizophrenia?" Br J Psychiatry 164(5): 619- 24.

Ashizawa, T., C. J. Dunne, J. R. Dubel, M. B. Perryman, H. F. Epstein, E. Boerwinkle and J. F. Hejtmancik (1992). "Anticipation in myotonic dystrophy. I. Statistical verification based on clinical and haplotype findings." Neurology 42(10): 1871-7.

Banfi, S., A. Servadio, M. Y. Chung, T. J. Kwiatkowski, Jr., A. E. McCall, L. A. Duvick, Y. Shen, E. J. Roth, H. T. Orr and H. Y. Zoghbi (1994). "Identification and characterization of the gene causing type 1 spinocerebellar ataxia." Nat Genet 7(4): 513-20.

Bassett, A. S. and W. G. Honer (1994). "Evidence for anticipation in schizophrenia." Am J Hum Genet 54(5): 864-70.

Bassett, A. S. and J. Husted (1997). "Anticipation or ascertainment bias in schizophrenia? Penrose's familial mental illness sample." Am J Hum Genet 60(3): 630-7.

Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80.

108 Berman, H. M., T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E. Bourne, K. Burkhardt, Z. Feng, G. L. Gilliland, L. lype, S. Jain, P. Fagan, J. Marvin, D. Padilla, V. Ravichandran, B. Schneider, N. Thanki, H. Weissig, J. D. Westbrook and C. Zardecki (2002). "The Protein Data Bank." Acta Crvstalloqr D Biol Crvstalloqr 58(Pt 6 No 1): 899-907.

Bray, N. J. and M. J. Owen (2001). "Searching for schizophrenia genes." Trends Mol Med 7(4): 169-74.

Brock, G. J., N. H. Anderson and D. G. Monckton (1999). "Cis-acting modifiers of expanded CAG/CTG triplet repeat expandability: associations with flanking GC content and proximity to CpG islands." Hum Mol Genet 8(6): 1061-7.

Brook, J. D., M. E. McCurrach, H. G. Harley, A. J. Buckler, D. Church, H. Aburatani, K. Hunter, V. P. Stanton, J. P. Thirion, T. Hudson and et al. (1992). "Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3' end of a transcript encoding a protein kinase family member." Cell 68(4): 799-808.

Bullmore, E. T., S. Frangou and R. M. Murray (1997). "The dysplastic net hypothesis: an integration of developmental and dysconnectivity theories of schizophrenia." Schizophr Res 28(2-3): 143-56.

Campuzano, V., L. Montermini, M. D. Molto, L. Pianese, M. Cossee, F. Cavalcanti, E. Monros, F. Rodius, F. Duclos, A. Monticelli and et al. (1996). "Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion." Science 271(5254): 1423-7.

Cardno, A. G., E. J. Marshall, B. Coid, A. M. Macdonald, T. R. Ribchester, N. J. Davies, P. Venturi, L. A. Jones, S. W. Lewis, P. C. Sham, Gottesman, II, A. E. Farmer, P. McGuffin, A. M. Reveley and R. M. Murray (1999). "Heritability estimates for psychotic disorders: the Maudsley twin psychosis series." Arch Gen Psychiatry 56(2): 162-8.

Chong, S. S., A. E. McCall, J. Cota, S. H. Subramony, H. T. Orr, M. R. Hughes and H. Y. Zoghbi (1995). "Gametic and somatic tissue-specific heterogeneity of the expanded SCA1 CAG repeat in spinocerebellar ataxia type 1." Nat Genet 10(3): 344-50.

109 Choudhry, S., M. Mukerji, A. K. Srivastava, S. Jain and S. K. Brahmachari (2001). "CAG repeat instability at SCA2 locus: anchoring CAA interruptions and linked single nucleotide polymorphisms." Hum Mol Genet 10(21): 2437-46.

Chung, M. Y., L. P. Ranum, L. A. Duvick, A. Servadio, H. Y. Zoghbi and H. T. Orr (1993). "Evidence for a mechanism predisposing to intergenerational CAG repeat instability in spinocerebellar ataxia type I." Nat Genet 5(3): 254-8.

Clark, R. M., G. L. Dalgliesh, D. Endres, M. Gomez, J. Taylor and S. I. Bidichandani (2004). "Expansion of GAA triplet repeats in the human genome: unique origin of the FRDA mutation at the center of an Alu." Genomics 83(3): 373-83.

Cleary, J. D. and C. E. Pearson (2003). "The contribution of cis-elements to disease-associated repeat instability: clinical and experimental evidence." Cvtoaenet Genome Res 100(1-4): 25-55.

Collins, J. R., R. M. Stephens, B. Gold, B. Long, M. Dean and S. K. Burt (2003). "An exhaustive DNA micro-satellite map of the human genome using high performance computing." Genomics 82(1): 10-9.

Cossee, M., M. Schmitt, V. Campuzano, L. Reutenauer, C. Moutou, J. L. Mandel and M. Koenig (1997). "Evolution of the Friedreich's ataxia trinucleotide repeat expansion: founder effect and premutations." Proc Natl Acad Sci U SA 94(14): 7452-7.

Crisponi, L., M. Deiana, A. Loi, F. Chiappe, M. Uda, P. Amati, L. Bisceglia, L. Zelante, R. Nagaraja, S. Porcu, M. S. Ristaldi, R. Marzella, M. Rocchi, M. Nicolino, A. Lienhardt-Roussie, A. Nivelon, A. Verloes, D. Schlessinger, P. Gasparini, D. Bonneau, A. Cao and G. Pilia (2001). "The putative forkhead transcription factor FOXL2 is mutated in blepharophimosis/ptosis/epicanthus inversus syndrome." Nat Genet 27(2): 159-66.

Crocq, M. A., R. Mant, P. Asherson, J. Williams, Y. Hode, A. Mayerova, D. Collier, L. Lannfelt, P. Sokoloff, J. C. Schwartz and et al. (1992). "Association between schizophrenia and homozygosity at the dopamine D3 receptor gene." J Med Genet 29(12): 858-60.

110 Cummings, C. J. and H. Y. Zoghbi (2000). "Trinucleotide repeats: mechanisms and pathophysiology." Annu Rev Genomics Hum Genet 1: 281-328.

David, G., N. Abbas, G. Stevanin, A. Durr, G. Yvert, G. Cancel, C. Weber, G. Imbert, F. Saudou, E. Antoniou, H. Drabkin, R. Gemmill, P. Giunti, A. Benomar, N. Wood, M. Ruberg, Y. Agid, J. L. Mandel and A. Brice (1997). "Cloning of the SCA7 gene reveals a highly unstable CAG repeat expansion." Nat Genet 17(1): 65-70.

Degreef, G., M. Ashtari, B. Bogerts, R. M. Bilder, D. N. Jody, J. M. Alvir and J. A. Lieberman (1992). "Volumes of ventricular system subdivisions measured from magnetic resonance images in first-episode schizophrenic patients." Arch Gen Psychiatry 49(7): 531-7.

Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755- 63.

Eichler, E. E., J. J. Holden, B. W. Popovich, A. L. Reiss, K. Snow, S. N. Thibodeau, C. S. Richards, P. A. Ward and D. L. Nelson (1994). "Length of uninterrupted CGG repeats determines instability in the FMR1 gene." Nat Genet 8(1): 88-94.

Filippova, G. N., C. P. Thienes, B. H. Penn, D. H. Cho, Y. J. Hu, J. M. Moore, T. R. Klesert, V. V. Lobanenkov and S. J. Tapscott (2001). "CTCF-binding sites flank CTG/CAG repeats and form a methylation-sensitive insulator at the DM1 locus." Nat Genet 28(4): 335-43.

Gorwood, P., M. Leboyer, B. Falissard, M. Jay, F. Rouillon and J. Feingold (1996). "Anticipation in schizophrenia: new light on a controversial problem." Am J Psychiatry 153(9): 1173-7.

Gourdon, G., F. Radvanyi, A. S. Lia, C. Duros, M. Blanche, M. Abitbol, C. Junien and H. Hofmann-Radvanyi (1997). "Moderate intergenerational and somatic instability of a 55-CTG repeat in transgenic mice." Nat Genet 15(2): 190-2.

Gouw, L. G., M. A. Castaneda, C. K. McKenna, K. B. Digre, S. M. Pulst, S. Perlman, M. S. Lee, C. Gomez, K. Fischbeck, D. Gagnon, E. Storey, T. Bird, F. R. Jeri and L. J. Ptacek (1998). "Analysis of the dynamic mutation in the SCA7 gene shows marked parental effects on CAG repeat transmission." Hum Mol Genet 7(3): 525-32.

Ill Haaf, T., G. Sirugo, K. K. Kidd and D. C. Ward (1996). "Chromosomal localization of long trinucleotide repeats in the human genome by fluorescence in situ hybridization." Nat Genet 12(2): 183-5.

Harrison, P. J. (1999). "The neuropathology of schizophrenia. A critical review of the data and their interpretation." Brain 122 ( Pt 4): 593-624.

Heiden, A., U. Willinger, J. Scharfetter, K. Meszaros, S. Kasper and H. N. Aschauer (1999). "Anticipation in schizophrenia." Schizophr Res 35(1): 25-32.

Hubbard, T., D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik and M. Clamp (2002). "The Ensembl genome database project." Nucleic Acids Res 30(1): 38-41.

Huntington's Disease Collaborative Research Group, T. (1993). "A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease . The Huntington's Disease Collaborative Research Group." Cell 72(6): 971-83.

Ikeda, H., M. Yamaguchi, S. Sugai, Y. Aze, S. Narumiya and A. Kakizuka (1996). "Expanded polyglutamine in the Machado-Joseph disease protein induces cell death in vitro and in vivo." Nat Genet 13(2): 196-202.

Imamura, A., S. Honda, Y. Nakane and Y. Okazaki (1998). "Anticipation in Japanese families with schizophrenia." J Hum Genet 43(4): 217-23.

Imbert, G., F. Saudou, G. Yvert, D. Devys, Y. Trottier, J. M. Gamier, C. Weber, J. L. Mandel, G. Cancel, N. Abbas, A. Durr, O. Didierjean, G. Stevanin, Y. Agid and A. Brice (1996). "Cloning of the gene for spinocerebellar ataxia 2 reveals a locus with high sensitivity to expanded CAG/glutamine repeats." Nat Genet 14(3): 285-91.

Imhof, A., X. J. Yang, V. V. Ogryzko, Y. Nakatani, A. P. Wolffe and H. Ge (1997). "Acetylation of general transcription factors by histone acetyltransferases." Curr Biol 7(9): 689-92.

112 Inayama, Y., H. Yoneda, T. Sakai, T. Ishida, Y. Nonomura, Y. Kono, R. Takahata, J. Koh, J. Sakai, A. Takai, Y. Inada and H. Asaba (1996). "Positive association between a DNA sequence variant in the serotonin 2A receptor gene and schizophrenia." Am J Med Genet 67(1): 103-5.

Johnson, J. E., J. Cleary, H. Ahsan, J. Harkavy Friedman, D. Malaspina, C. R. Cloninger, S. V. Faraone, M. T. Tsuang and C. A. Kaufmann (1997). "Anticipation in schizophrenia: biology or bias?" Am J Med Genet 74(3): 275-80.

Kashi, Y., D. King and M. Soller (1997). "Simple sequence repeats as a source of quantitative genetic variation." Trends Genet 13(2): 74-8.

Kawaguchi, Y., T. Okamoto, M. Taniwaki, M. Aizawa, M. Inoue, S. Katayama, H. Kawakami, S. Nakamura, M. Nishimura, I. Akiguchi and et al. (1994). "CAG expansions in a novel gene for Machado-Joseph disease at chromosome 14a32.1." Nat Genet 8(3): 221-8.

Kent, W. J. (2002). "BLAT~the BLAST-like alignment tool." Genome Res 12(4): 656-64.

Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler and D. Haussler (2002). "The human genome browser at UCSC." Genome Res 12(6): 996-1006.

Kim, J. S., H. H. Kornhuber, W. Schmid-Burgk and B. Holzmuller (1980). "Low cerebrospinal fluid glutamate in schizophrenic patients and a new hypothesis on schizophrenia." Neurosci Lett 20(3): 379-82.

Koide, R., T. Ikeuchi, O. Onodera, H. Tanaka, S. Igarashi, K. Endo, H. Takahashi, R. Kondo, A. Ishikawa, T. Hayashi and et al. (1994). "Unstable expansion of CAG repeat in hereditary dentatorubral-pallidoluysian atrophy (DRPLA)." Nat Genet 6(1): 9-13.

Koob, M. D., K. A. Benzow, T. D. Bird, J. W. Day, M. L. Moseley and L. P. Ranum (1998). "Rapid cloning of expanded trinucleotide repeat sequences from genomic DNA." Nat Genet 18(1): 72-5.

Kornberg, R. D. and Y. Lorch (1999). "Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome." Cell 98(3): 285-94.

113 Kremer, E. J., M. Pritchard, M. Lynch, S. Yu, K. Holman, E. Baker, S. T. Warren, D. Schlessinger, G. R. Sutherland and R. I. Richards (1991). "Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n." Science 252(5013): 1711-4.

Kunst, C. B. and S. T. Warren (1994). "Cryptic and polar variation of the fragile X repeat could result in predisposing normal alleles." Cell 77(6): 853-61.

La Spada, A. R., R. I. Richards and B. Wieringa (2004). "Dynamic mutations on the move in Banff." Nat Genet 36(7): 667-70.

La Spada, A. R., E. M. Wilson, D. B. Lubahn, A. E. Harding and K. H. Fischbeck (1991). "Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy." Nature 352(6330): 77-9.

Lalioti, M. D., H. S. Scott, C. Buresi, C. Rossier, A. Bottani, M. A. Morris, A. Malafosse and S. E. Antonarakis (1997). "Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy." Nature 386(6627): 847-51.

Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange- Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier, C. Robert, P.

114 Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka, J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, J. Szustakowki, P. de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi and Y. J. Chen (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921.

Levitsky, V. G., O. A. Podkolodnaya, N. A. Kolchanov and N. L. Podkolodny (2001). "Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis." Bioinformatics 17(11): 998-1010.

Matsuura, T., T. Yamagata, D. L. Burgess, A. Rasmussen, R. P. Grewal, K. Watase, M. Khajavi, A. E. McCall, C. F. Davis, L. Zu, M. Achari, S. M. Pulst, E. Alonso, J. L. Noebels, D. L. Nelson, H. Y. Zoghbi and T. Ashizawa (2000). "Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10." Nat Genet 26(2): 191-4.

McCarley, R. W., C. G. Wible, M. Frumin, Y. Hirayasu, J. J. Levitt, I. A. Fischer and M. E. Shenton (1999). "MRI anatomy of schizophrenia." Biol Psychiatry 45(9): 1099-119.

McGuffin, P., M. J. Owen and A. E. Farmer (1995). "Genetic basis of schizophrenia." Lancet 346(8976): 678-82.

Metzgar, D. and C. Wills (2000). "Evidence for the adaptive evolution of mutation rates." Cell 101(6): 581-4.

115 Mochizuki, R., M. Dateki, K. Yanai, Y. Ishizuka, N. Amizuka, H. Kawashima, Y. Koga, H. Ozawa and A. Fukamizu (2003). "Targeted disruption of the neurochondrin/norbin gene results in embryonic lethality." Biochem Biophvs Res Commun 310(4): 1219-26.

Montermini, L, E. Andermann, M. Labuda, A. Richter, M. Pandolfo, F. Cavalcanti, L. Pianese, L. lodice, G. Farina, A. Monticelli, M. Turano, A. Filla, G. De Michele and S. Cocozza (1997). "The Friedreich ataxia GAA triplet repeat: premutation and normal alleles." Hum Mol Genet 6(8): 1261- 6.

Morel, B. (1857). "Traite des degenerescences." J.B. Bailliere. 1.

Mott, F. (1910). "Hereditary aspects of nervous and mental diseases." Br Med J 2: 1013-1020.

Nakamoto, M., H. Takebayashi, Y. Kawaguchi, S. Narumiya, M. Taniwaki, Y. Nakamura, Y. Ishikawa, I. Akiguchi, J. Kimura and A. Kakizuka (1997). "A CAG/CTG expansion in the normal population." Nat Genet 17(4): 385-6.

Ohara, K., H. D. Xu, N. Mori, Y. Suzuki, D. S. Xu and Z. C. Wang (1997). "Anticipation and imprinting in schizophrenia." Biol Psychiatry 42(9): 760- 6.

Ohlsson, R., R. Renkawitz and V. Lobanenkov (2001). "CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease." Trends Genet 17(9): 520-7.

Pearson, C. E., E. E. Eichler, D. Lorenzetti, S. F. Kramer, H. Y. Zoghbi, D. L. Nelson and R. R. Sinden (1998). "Interruptions in the triplet repeats of SCA1 and FRAXA reduce the propensity and complexity of slipped strand DNA (S-DNA) formation." Biochemistry 37(8): 2701-8.

Penrose, L. (1948). "The problem of anticipation in pedigrees of dystrophia myotonica." Ann Eugenics 14: 125-132.

Potter and Hollister (2001). Antipsychotic Agents & Lithium. Basic & Clinical Pharmacology, McGraw-Hill Publishers.

116 R Development Core Team (2004). R: A language and environment for statistical computing.

Rebhan, M., V. Chalifa-Caspi, J. Prilusky and D. Lancet (1997). "GeneCards: integrating information about genes, and diseases." Trends Genet 13(4): 163.

Rice, P., I. Longden and A. Bleasby (2000). "EMBOSS: the European Molecular Biology Open Software Suite." Trends Genet 16(6): 276-7.

Risch, N. J. (2000). "Searching for genetic determinants in the new millennium." Nature 405(6788): 847-56.

Roberts, E. (1972). "Prospects for research on schizophrenia. An hypotheses suggesting that there is a defect in the GABA system in schizophrenia." Neurosci Res Program Bull 10(4): 468-82.

Ross, C. A., R. L. Margolis, M. W. Becher, J. D. Wood, S. Engelender, J. K. Cooper and A. H. Sharp (1998). "Pathogenesis of neurodegenerative diseases associated with expanded glutamine repeats: new answers, new questions." Prog Brain Res 117: 397-419.

Schalling, M., T. J. Hudson, K. H. Buetow and D. E. Housman (1993). "Direct detection of novel expanded trinucleotide repeats in the human genome." Nat Genet 4(2): 135-9.

Seidah, N. G., S. Benjannet, L. Wickham, J. Marcinkiewicz, S. B. Jasmin, S. Stifani, A. Basak, A. Prat and M. Chretien (2003). "The secretory proprotein convertase neural apoptosis-regulated convertase 1 (NARC-1): liver regeneration and neuronal differentiation." Proc Natl Acad Sci U S A 100(3): 928-33.

Shmueli, O., S. Horn-Saban, V. Chalifa-Caspi, M. Shmoish, R. Ophir, H. Benjamin-Rodrig, M. Safran, E. Domany and D. Lancet (2003). "GeneNote: whole genome expression profiles in normal human tissues." CRBiol 326(10-11): 1067-72.

Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413.

117 Stajich, J. E., D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdigian, G. Fuellen, J. G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla, C. J. Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D. Stein, E. Stupka, M. D. Wilkinson and E. Birney (2002). "The Bioperl toolkit: Perl modules for the life sciences." Genome Res 12(10): 1611-8.

Sterner, D. E., P. A. Grant, S. M. Roberts, L. J. Duggan, R. Belotserkovskaya, L. A. Pacella, F. Winston, J. L. Workman and S. L. Berger (1999). "Functional organization of the yeast SAGA complex: distinct components involved in structural integrity, nucleosome acetylation, and TATA-binding protein interaction." Mol Cell Biol 19(1): 86-98.

Subramanian, S., V. M. Madgula, R. George, R. K. Mishra, M. W. Pandit, C. S. Kumar and L. Singh (2002). "MRD: a microsatellite repeats database for prokaryotic and eukaryotic genomes." Genome Biol 3(12): PREPRINT0011.

Thibaut, F., M. Martinez, M. Petit, M. Jay and D. Campion (1995). "Further evidence for anticipation in schizophrenia." Psychiatry Res 59(1-2): 25-33.

Tsuang, M. (2000). "Schizophrenia: genes and environment." Biol Psychiatry 47(3): 210-20.

Tsutsumi, T., S. E. Holmes, M. G. Mclnnis, A. Sawa, C. Callahan, J. R. DePaulo, C. A. Ross, L. E. DeLisi and R. L. Margolis (2004). "Novel CAG/CTG repeat expansion mutations do not contribute to the genetic risk for most cases of bipolar disorder or schizophrenia." Am J Med Genet 124B(1): 15- 9.

Valero, J., L. Martorell, J. Marine, E. Vilella and A. Labad (1998). "Anticipation and imprinting in Spanish families with schizophrenia." Acta Psvchiatr Scand 97(5): 343-50.

Verkerk, A. J., M. Pieretti, J. S. Sutcliffe, Y. H. Fu, D. P. Kuhl, A. Pizzuti, O. Reiner, S. Richards, M. F. Victoria, F. P. Zhang and et al. (1991). "Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome." CeJI 65(5): 905-14.

118 Vincent, J. B., A. D. Paterson, E. Strong, A. Petronis and J. L. Kennedy (2000). "The unstable trinucleotide repeat story of major psychosis." Am J Med Genet 97(1): 77-97.

Weinberger, D. R. (1995). "From neuropathology to neurodevelopment." Lancet 346(8974): 552-7.

Wheeler, D. L, D. M. Church, R. Edgar, S. Federhen, W. Helmberg, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. O. Suzek, T. A. Tatusova and L. Wagner (2004). "Database resources of the National Center for Biotechnology Information: update." Nucleic Acids Res 32 Database issue: D35-40.

Wooley, D. W. and E. Shaw (1954). "A biochemical and pharmacological suggestion about certain mental disorders." Proc. Natl. Acad. Sci. U. S. A. 40: 228-231.

Yaw, J., M. Myles-Worsley, M. Hoff, J. Holik, R. Freedman, W. Byerley and H. Coon (1996). "Anticipation in multiplex schizophrenia pedigrees." Psvchiatr Genet 6(1): 7-11.

Zhang, Y., D. G. Monckton, M. J. Siciliano, T. H. Connor and M. L. Meistrich (2002). "Age and insertion site dependence of repeat number instability of a human DM1 transgene in individual mouse sperm." Hum Mol Genet 11(7): 791-8.

Zhuchenko, O., J. Bailey, P. Bonnen, T. Ashizawa, D. W. Stockton, C. Amos, W. B. Dobyns, S. H. Subramony, H. Y. Zoghbi and C. C. Lee (1997). "Autosomal dominant cerebellar ataxia (SCA6) associated with small polyglutamine expansions in the alpha 1A-voltage-dependent calcium channel." Nat Genet 15(1): 62-9.

119 APPENDIX A polyQ.pl

#!/usr/local/bin/perl -w # polyq_ens_july28_2003.pl # usage: polyq_ens.pl trf_cag_build33.out

# Stef Butland, updated by Perseus Missirlis July 25, 2003 use strict; use Bio::EnsEMBL::DBSQL::DBAdaptor; #################### # Global Variables # ####################

# EnsEMBL API my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $db_name = 'homo_sapiens_core_15_33';

############################### # Connect to EnsEMBL with API # ############################### my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name); my $slice_adaptor = $db->get_SliceAdaptor;

####################################################################### # FILEHANDLES # # six outfiles for coordinates, one for each category # # will need sequence outfiles for the three polyQ gene categories # # the polyQ gene categories represent the diff priorities with # # a polyQ known gene with multiple aa repeats as the top priority for # # disease candidate # ####################################################################### open GENE, ">> gene.out" or die "cannot open/write to gene.out: $!\n"; open GENEQ, ">> geneq.out" or die "cannot open/write to geneq.out: $!\n"; open GENEQSEQ, ">> geneqseq.out" or die "cannot open/write to geneqseq.out: $!\n";

############################################################################### # Message to print at start of sequence output...hmmm....print to which file? # ############################################################################### print GENEQSEQ "EnsEMBL info\: host\: $host\, db\: $db_name Note that the sequences in this file are all from the plus strand regardless of the strand on which a given gene is encoded. These sequences are intended for PCR primer prediction. Each sequence is 500bp on either side of the repeat in question.\n";

############################################ # Message to print at start of data output # ############################################ print GENE "EnsEMBL info\: host\: $host\, db\: $db_name\nThe genes listed in this file contain >=8 CAG/CTG repeats (not necessarily 8 pure rpts) but did not have >=5 Q's in protein sequence. If the coding region of one of these genes is not properly represented (example: true start ATG is farther upstream), the true protein may contain >=5 Q's. This was the case for HD protein in EnsEMBL 110 but the full length protein is represented in RefSeq. Therefore, this file can be combed for more polyQ candidates by checking coding sequence from different db's.\n";

print GENEQ "EnsEMBL info\: host\: $host\, db\: $db_name\nThis file contains a list of genes that contain >=8 CAGs (not necessarily 8 pure rpts) in nucleotide sequence (as predicted by Tandem Repeat Finder) and >=5 Qs in protein sequence. Each gene is listed as

120 'known' or 'novel' according to EnsEMBL definitions. Known genes are EnsEMBL predicted genes with SwissProt or RefSeq records. For known genes, this file will contain accession numbers for external databases. Occasionally, a known gene will lack a description line. Novel genes will lack descriptions and accessions to external databases.\n\nTranslation numbers for each candidate gene indicate, where the same candidate gene has been listed, that there are multiple transcripts in EnsEMBL.\n";

############################################################################## # read infile specified on the cmd line # # read multiple coords of CAG/CTG repeat locations from STDIN one line at a time # ############################################################################## while (<>) { chomp;

# put elements into an array my @rpt_coords = split /\t/; print "array values: $rpt_coords[0] $rpt_coords[1] $rpt_coords[2] $rpt_coords[3] $rpt_coords[5] $rpt_coords[16]\n\n";

# pull out coords of interest my $chrom = $rpt_coords[1];

my $chromStart = $rpt_coords[2]; my $chromEnd = $rpt_coords[3]; my $rptsize = $rpt_coords[5]; my $rptUnit = $rpt_coords[16];

# following line strips text "chr" from "chrl, chrX etc" in UCSC simpleRepeat table for use in EnsEMBL coord system $chrom =~ s/chr(.+)/$1/;

############# # Get Slice # #############

my $slice = $slice_adaptor->fetch_by_chr_start_end($chrom,$chromStart, $chromEnd); my $rptRegion = $slice->seq;

############################ # Get all genes from slice # ############################

my $genes = $slice->get_all_Genes; foreach my $gene(@$genes) {

################################# # Get all transcipts from slice # #################################

my $transcripts = $gene->get_all_Transcripts; my $i=0; foreach my $transcript(®$transcripts) { $i++ ;

my $peptide = $transcript->translate;

if ($peptide->seq =- /(Q{5,})/i) {

my $qlength = length $1; print GENEQ " \n" ; print GENEQ join "| ", @rpt_coords,"\n"; print GENEQ "Translation " . $i . " of " . $gene->stable_id . " has " . $qlength . " Qs in its first Q run\n"; print GENEQ "http://www.ensembl.org/perl/geneview?gene=" . $gene->stable_id . "\n"; print GENEQ "Description: " . $gene->description . "\n"; print GENEQ "List of amino acid runs of five or more:\n"; ############################################################################# # screen for multiple amino acid runs here # # code from http://lists.evolt.org/archive/Week-of-Mon-20010326/153470.html #

121 # works in independent code test with peptide string below # # test with hd gene coords # #######»##################################################################### my OmyChar,- my $x; my $compare; my $myCompare; my $repeated = 1; $myChar[0] = substr($peptide->seq, 0, 1); for ($x = 1; $x < length($peptide->seq); $x++) { $myChar[$x] = substr($peptide->seq, $x, 1) ; $ compare = $x - 1 ,- $myCompare = $myChar[$compare];

if ($myCompare eq $myChar[$x]) { $repeated++;

} else { print GENEQ "$repeated $myCompare\n" if $repeated >= 5; $repeated = 1; } } print GENEQ "$repeated $myCompare\n" if $repeated >= 5;

########################### # Calculate repeat purity # ########################### my $max = 0; while ($rptRegion =- /(($rptUnit)+)/g) { my $len = $1; my $l_hit = (length($len))/3; if ($l_hit > $max) { $max = $l_hit; } } print GENEQ "Slice has a maximum of $max pure $rptUnit repeats\n\n";

################################################################################### # cat DNA seq of slice to file in fasta format using chromosome coords in defline # ################################################################################### print GENEQSEQ 11 \>chr$chrom\: " . ($chromStart-500) . "\-" . ($chromEnd+500) . " " . $rptUnit . " (CAG-type) rpt in " . $gene->stable_id . " " . $gene-description . "\n my $primer_slice = $slice_adaptor->fetch_by_chr_start_end($chrom,$chromStart- 500,$chromEnd+500); print GENEQSEQ $primer_slice->seq . "\n";

########################################################################## # Collect supplementary information if slice contains known EnsEMBL gene # ########################################################################## if ($gene->is_known) { print GENEQ "Known gene\n"; print GENEQ "DBLinks and synonyms:\n"; foreach my $link (@{$gene->get_all_DBLinks}) { print GENEQ $link->display_id . "1 1 . $link->database . "\n"; my Ssyns = @{$link->get_all_synonyms}; print GENEQ "@syns\n"; }

} else { print GENEQ "Not a known gene\n"; } } else { print GENE " print GENE join "| ", @rpt_coords,"\n"; print GENE "No polyQ in this translation\n"; } } } }

##################### # close filehandles # ##################### close GENE or warn "errors while closing gene.out: $!\n"; close GENEQ or warn "errors while closing geneq.out: $!\n"; close GENEQSEQ or warn "errors while closing geneqseq.out: $!\n" exit; APPENDIX B

Flat file of candidate CAG/CTG repeats (GeMS) (used as input to flanker.pl)

Gene EnsEMBL Gene ID Chr Start End Repeat SCA3_MJD ENSG00000066427 14 90527395 90527437 CTG DMPK ENSG00000104936 19 50949512 50949573 CAG SCA7 ENSG00000163635 3 63753267 63753299 GCA SCA2 ENSG00000089232 12 111819606 111819676 GCT HD ENSG00000125387 4 3113331 3113395 CAG DRPLA ENSG00000111676 12 6925140 6925199 CAG SCA1 ENSG00000124788 6 16390403 16390494 TGC SBMA ENSG00000169083 X 64998383 64998486 GCA SI7E_HUMAN ENSG00000117069 1 76750189 76750226 GCA TNRC4 ENSG00000159409 1 148453802 148453843 TGC RORC ENSG00000143365 1 148552932 148552983 GCT KIAA0476 ENSG00000130568 1 150682577 150682628 CTG KCNN3 ENSG00000143603 1 151617621 151617660 GCT TNS ENSG00000079308 2 218676908 218676937 GCT IRS1 ENSG00000169047 2 227628083 227628127 GCT TNRC15 ENSG00000066216 2 233676219 233676265 CAG SATB1 ENSG00000182568 3 18265417 18265463 CTG BAIAP1 ENSG00000151276 3 65280467 65280529 CTG Q8IVF3 ENSG00000176542 3 114657333 114657376 TGC HYP_95.5 ENSG00000138756 4 80184862 80184945 CAG TNRC3 - ENSG00000179637 4 141277244 141277331 TGC TFEB ENSG00000112561 6 41660233 41660265 GCT RUNX2 ENSG00000124813 6 45391833 45391899 CAG POU3F2 ENSG00000184486 6 99283278 99283355 GCA NM_175863 ENSG00000049618 6 157054532 157054585 CAG TBP ENSG00000112592 6 170546468 170546579 GCA RD_POU ENSG00000106536 7 39086067 39086098 CAG F0XP2 ENSG00000128573 7 113810616 113810739 GCA PAXIP1L ENSG00000157212 7 154075553 154075591 TGC SMARCA2 ENSG00000080503 9 2029754 2029837 GCA NM_152786 ENSG00000157653 9 109641295 109641327 CAG CIZ1_HUMAN ENSG00000148337 9 124406704 124406797 CTG MAML2 ENSG00000184384 11 96009730 96009957 TGC PRDMO ENSG00000170325 11 129813900 129813925 CTG FLJ31638 ENSG00000151065 12 1941584 1941613 TGC ZNF384 ENSG00000126746 12 6656325 6656374 GCT EDR1 ENSG00000111752 12 8985591 8985638 GCA MLL2 ENSG00000167548 12 49143997 49144033 TGC PHC1 ENSG00000179899 12 55523949 55523996 GCT PHLDA1 ENSG00000139289 12 76141663 76141715 TGC ASCL1 ENSG00000139352 12 103285098 103285155 GCA NC0R2 ENSG00000139720 12 124611001 124611053 GCT EP400 ENSG00000183495 12 132427572 132427656 GCA C14orf4 ENSG00000119669 14 75483802 75483867 TGC BRIGHT ENSG00000179361 15 72412214 72412254 CAG POLG ENSG00000140521 15 87464052 87464093 GCT MEF2A ENSG00000068305 15 97845438 97845472 CAG CREBBP ENSG00000005339 16 3778685 3778739 TGC TNRC6 ENSG00000090905 16 24715802 24715902 CAG 094795 ENSG00000168286 16 67612229 67612317 GCA NFAT5 ENSG00000102908 16 69462945 69462986 CAG MINK-1 ENSG00000141503 17 4738742 4738792 GCA RAI1 ENSG00000108557 17 17640149 17640190 CAG S0C6_HUMAN ENSG00000174111 17 36419000 36419025 AGC ZNF161 ENSG00000136451 17 56398683 56398723 TGC CACNA1A ENSG00000141837 19 13163881 13163921 CTG BRD4 ENSG00000141867 19 15194877 15194932 GCT CHERP ENSG00000085872 19 16485772 16485809 GCT NUMBL ENSG00000105245 19 45849908 45849973 GCT NC0A6 ENSG00000088297 20 34014378 34014452 TGC PRKCBP1 ENSG00000101040 20 46505971 46506011 GCT NC0A3 ENSG00000124151 20 46918236 46918323 GCA PCQAP ENSG00000099917 22 19245310 19245403 CAG

124 MN1 ENSG00000169184 22 26520156 26520207 TGC KIAA1093 ENSG00000100354 22 38940215 38940240 GCA MKL1 ENSG00000100361 22 39225928 39225985 TGC TNRC11 ENSG00000184634 X 68594302 68594402 AGC KIAA1817 ENSG00000147234 X 104879384 104879465 CAG CXorf6 ENSG00000013619 X 147410589 147410627 GCA APPENDIX C flanker.pl

#!/usr/local/bin/perl -w # flanker.pl # Mightily committed to code by Perseus Missirlis (perseusObioinformatics.ubc.ca / [email protected]) # Bioinformatics Graduate Student # UBC Bioinformatics Centre (UBiC) at the Centre for Molecular Medicine and Therapeutics (CMMT) # http://bioinformatics.ubc.ca/ # Last update: Aug 27, 2003 use strict; use Bio::EnsEMBL::DBSQL::DBAdaptor; use DBI; use Data::Dumper;

#################### # Global Variables # ####################

# EnsEMBL API

my $host 1ensembldb.ensembl.org1; my $user 1 anonymous1 ; my $db_name 1homo_sapiens_core_16_33'; my $prog_version '0.4';

# DBI

my ($dsn) = "DBI: mysql :gems_cis : stent. cmmt. ubc . ca" ; my ($user_name) = "gems_rw"; my ($password) = "g7e6m5"; my ($dbh, $sth); my (Oary);

####################### # Connect to Database # #######################

$dbh = DBI->connect ($dsn, $user_name, $password, j RaiseError => 1 });

############################## # Insert to build_info table # ##############################

print "\n\n###################################\n",• print "# Inserting into build_info table #\n"; print "###################################";

$sth = $dbh->prepare ("INSERT INTO build_info VALUES(1$host1,1$db_name1,NOW())"); $sth->execute ();

###################################### # Check for proper command-line file # ######################################

my($USAGE) =•"

j Automatic Sequence Analysis Tool for Disease-Associated Repeat Instability Studies - v$prog_version |

126 UBC Bioinformatics

By: Perseus Missirlis (perseus\@canada.com)

---+\n USAGE: $0 coordfile

OPTIONS: None so far!\n\n";

unless(@ARGV) { print $USAGE; exit; }

############################################ # Figure out how many sequences to compare # ############################################

print "\n

| Automatic Sequence Analysis Tool for Disease-Associated Repeat Instability Studies - v$prog_version | | UBC Bioinformatics

| By: Perseus Missirlis (perseus\@canada.com)

+\n\n";

################################ # Open files for sequence data # ################################

my $outputfilel = "seq.fa";

unless ( open(SEQ, ">$outputfilel") ) { die "Cannot open file \"$outputfilel\" to write to!\n\n"; }

my $outputfile2 = "seq.fa.masked";

unless ( open(REP_SEQ, ">$outputfile2") ) { die "Cannot open file \"$outputfile2\" to write to!\n\n"; }

my $outputfile3 = "gc.txt";

unless ( open(GC, ">$outputfile3") ) { die "Cannot open file \"$outputfile3\" to write to!\n\n"; }

print GC "Gene\tGC\%_50bp\tGC\%_100bp\tGC\%_150bp\tGC\%_200bp\tGC\%_250bp\tGC\%_300bp\tGC\%_350bp\ tGC\%_4 00bp\tGC\%_450bp\tGC\%_500bp\tGC\%_1000bp\n";

my $outputfile4 = "ctcf_scores_dist.txt";

unless ( open(CTCF_SCORE, ">$outputfile4") ) { die "Cannot open file \"$outputfile4\" to write to!\n\n",- }

print CTCF_SCORE "Gene\tScore\tDistance\n";

127 my $outputfile5 = "ctcf_scores_dist2.txt"; unless ( open(CTCF_SC0RE2, ">$outputfile5") ) { die "Cannot open file \"$outputfile5\" to write to!\n\n"; } print CTCF_SCORE "Gene\tScore\tDistance\n";

################################# # Specify input file from STDIN # ################################# my $aln_filename = "$ARGV[0]"; unless ( -e $aln_filename) { die "File \"$aln_filename\" doesn\1t seem to exist!!\n"; } unless ( open(ALN_FILE, $aln_filename) ) { die "Cannot open file \"$aln_filename\"\n\n"; }

#################################### # Collect User Input Prior to Loop # ####################################

# Sequence flanking the repeat print "How much sequence flanking the repeat do you which to collect? (0-100,000) -> my $flanker = ;

$flanker =~ s/,//g,- $flanker =~ s/\s//g;

while (($flanker < 0) || ($flanker > 100000)) { print "\nIncorrect amount of flanking sequence, try again (0- 100,000) : »; $flanker = ; $flanker =~ s/,//g; $flanker =~ s/\s//g; }

############################### # Connect to EnsEMBL with API # ############################### my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name);

my $slice_adaptor = $db->get_SliceAdaptor;

############################################################ # Get info from flat file to interact with the EnsEMBL API # ############################################################

my $n = 1; my $input_filename; my $outputfile;

while () { chomp;

######################################### # EnsEMBL API sequence extraction phase # #########################################

# in this regex # $1 collects the common gene name # $2 collects the EnsEMBL gene ID # $3 collects the last digit from the EnsEMBL gene ID, it's an effect the internal brackets # $4 collects the chromosome number # $5 collects the repeat start position in chromosome $4 # $6 collects the repeat end position in chromosome $4'

if

print "1 is $1, 2 is $2, 3 is $3, 4 is $4, 5 is $5, 6 is $6, 7 is $7\n";

my $gene_name = $1; $gene_name =- s/\s//g;

my $gene_id = $2; $gene_id =- s/\s//g; $gene_id =~ tr/a-z/A-Z/;

my $chrom = $4; my $rep_start = $5; my $rep_end = $6 ,- my $repeat_unit = $7;

######################## # Insert to gems table # ########################

$sth = $dbh->prepare ("INSERT INTO gems VALUES('$gene_name1,1$gene_id')"); $sth->execute (),-

################################### # Collect flanking GC percentages # ###################################

# 50

my $slice_left_50 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-50,$rep_start-l); my $slice_right_50 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+50); my $left_50 = $slice_left_50->seq; my $right_50 = $slice_right_50->seq; ^ my $gc_50_seq = $left_50 . $right_50;

my $g=0; my $c=0; my $gc_length_50 = length($gc_50_seq) ,• while($gc_50_seq =~ /g/ig){$g++} while($gc_50_seq =- /c/ig){$c++} my $gc_50 = ($g+$c)/$gc_length_50;

# 100

my $slice_left_100 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-100,$rep_start-l); my $slice_right_100 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+100); my $left_100 = $slice_left_100->seq; my $right_100 = $slice_right_100->seq; my $gc_100_seq = $left_100 . $right_100;

$g=0; $c=0; my $gc_length_100 = length($gc_100_seq); while($gc_100_seq =~ /g/ig){$g++} while($gc_100_seq =~ /c/ig){$c++} my $gc_100 = ($g+$c)/$gc_length_100;

# 150

my $slice_left_150 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-150,$rep_start-l); my $slice_right_150 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+150); my $left_150 = $slice_left_150->seq; my $right_150 = $slice_right_150->seq; my $gc_150_seq = $left_150 . $right_150;

129 $g=0; $c=0; my $gc_length_150 = length($gc_150_seq); while($go_150_seq =- /g/ig){$g++} while($gc_150_seq =~ /c/ig){$c++} my $gc_150 = ($g+$c)/$gc_length_150;

# 200

my $slice_left_200 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-200,$rep_start-l); my $alice_right_200 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+200); my $left_200 = $slice_left_200->seq; my $right_200 = $slice_right_200->seq; my $gc_200_seq = $left_200 . $right_200;

$g=0; $c=0; my $gc_length_200 = length($gc_200_seq); while($gc_200_seq =~ /g/ig){$g++} while($gc_200_seq =- /c/ig){$c++) my $gc_200 = ($g+$c)/$gc_length_200;

# 250

my $slice_left_250 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-2 50,$rep_start-l); my $slice_right_250 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+250); my $left_250 = $slice_left_250->seq; my $right_250 = $slice_right_250->seq;' my $gc_250_seq = $left_250 . $right_250;

$g=0; $c=0; my $gc_length_250 = length($gc_250_seq); while($gc_250_seq =~ l/g/ig){$g++} while($gc_250_seq =~ /c/ig){$c++} my $gc_250 = ($g+$c)/$gc_length_250;

# 300

my $slice_left_300 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-300,$rep_start-l); my $slice_right_300 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+300); my $left_300 = $slice_left_300->seq; my $right_300 = $slice_right_300->seq; my $gc_300_seq = $left_300 . $right_300;

$g=0; $c=0; my $gc_length_3 00 = length($gc_300_seq); while($gc_300_seq =~ /g/ig){$g++} while($gc_300_seq =~ /c/ig){$c++} my $gc_300 = ($g+$c)/$gc_length_3 0 0;

# 350

my $slioe_left_350 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-350,$rep_start-l); my $slice_right_350 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+350); my $left_350 = $slice_left_350->seq; my $right_350 = $slice_right_350->seq; my $gc_350_seq = $left_350 . $right_350;

$g=0; $c=0; my $gc_length_350 = length($gc_350_seq); while($gc_350_seq =- /g/ig){$g++} while($gc_350_seq =~ /c/ig){$c++} my $gc_350 = ($g+$c)/$gc_length_350;

# 400

my $slice_left_400 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-400,$rep_start-l);

130 my $slice_right_400 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+400); my $left_400 = $slice_left_400->seq; my $right_400 = $slice_right_400->seq; my $gc_400_seq = $left_400 . $right_400;

$g=0; $c=0; my $gc_length_4 00 = length($gc_400_seq); while($gc_400_seq = ~ /g/ig){$g++} while($gc_400_seq =~ /c/ig){$c++} my $gc_400 = ($g+$c)/$gc_length_400;

# 450

my $slice_left_450 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-450,$rep_start-l); my $slice_right_450 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+450); my $left_450 = $slice_left_450->seq; my $right_450 = $slice_right_450->seq; my $gc_450_seq = $left_450 . $right_450;

$g=0; $c=0; my $gc_length_4 50 = length($gc_450_seq); while($gc_450_seq =~ /g/ig){$g++} while($gc_450_seq =- /c/ig){$o++} my $gc_450 = ($g+$c)/$gc_length_4 50;

# 500

my $slice_left_500 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-500,$rep_start-l); my $slice_right_500 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+500); my $left_500 = $slice_left_500->seq; my $right_500 = $slice_right_500->seq,- my $gc_500_seq = $left_500 . $right_500;

$g=0; $c=0; my $gc_length_500 = length($gc_500_seq); while($gc_500_seq =~ /g/ig){$g++} while($gc_500_seq =~ /c/ig){$c++} my $gc_500 = ($g+$c)/$gc_length_500;

# 1000

my $slice_left_1000 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start-1000,$rep_start-l); my $slice_right_1000 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+1000); my $left_1000 = $slice_left_1000->seq; my $right_1000 = $slice_right_1000->seq; my $gc_1000_seq = $left_1000 . $right_1000;

$g=0; $c=0; my $gc_length_1000 = length($gc_1000_seq); while($gc_1000_seq =- /g/ig){$g++} while($gc_1000_seq =- /o/ig){$c++} my $gc_1000 = ($g+$c)/$gc_length_1000;

print GC "$gene_name\t$gc_50\t$gc_100\t$gc_150\t$gc_200\t$gc_250\t$gc_300\t$gc_350\t$gc_400\t$gc_4 50\t$gc_500\t$gc_1000\n";

###################### # Insert to gc table # ######################

print "\n\n###########################\n"; print "# Inserting into gc table #\n"; print "###########################\n\n";

$sth = $dbh->prepare ("INSERT INTO gc VALUES('$gene_name','$gc_50','$gc_1001,1$gc_1501,1$gc_200','$gc_250','$gc_300',1$gc_3501, 1 $gc_4001 , 1 $gc_450 ' , ' $gc_500 • , ' $gc_10001 ) 11) ;

131 $sth->execute () ;

#################################### # Create new directories for files # #################################### mkdir "$gene_narae" or warn "Cannot make $gene_name directory: $!"; chdir "./$gene_name" or die "cannot chdir to ./$gene_name: $!"; print "current directory is: "; system("pwd"); print "\n\n"; chomp(my $pwd = "pwd"); $pwd = ~ s>/disk2/home2!//STENT!;

################################ # Open files for sequence data # ################################ my $outputfilel = "$gene_name.fa"; unless ( open(SEQ2, ">$outputfilel") ) { die "Cannot open file \"$outputfilel\" to write to!\n\n"; } my $outputfile2 = "$gene_name.fa.masked"; unless ( open(REP_SEQ2, 11 >$outputf ile2") ) { die "Cannot open file \"$outputfile2\" to write to!\n\n"; } my $outputfile3 = "$gene_name.repeat"; unless ( open(REPEAT, ">$outputfile3") ) { die "Cannot open file \"$outputfile3\" to write to!\n\n"; } my $outputfile4 = "exons.gff"; unless ( open(EXONS, ">$outputfile4") ) { die "Cannot open file \"$outputfile4\" to write to!\n\n"; } my $outputfile5 = "cpg.gff"; unless ( open (CPG, 11 >$outputf ile5") ) { die "Cannot open file \"$outputfile5\" to write to!\n\n"; } my $outputfile6 = "hmm_test.fa"; unless ( open(HMM, ">$outputfile6") ) { die "Cannot open file \"$outputfile6\" to write to!\n\n"; } my $outputfile7 = "rep_coords.txt"; unless ( open(REP_C00RD, ">$outputfile7") ) { die "Cannot open file \"$outputfile7\" to write to!\n\n"; } my $outputfile8 = "rep_purity.txt"; unless ( open(REP_PURE, " >$outputf ile811) ) { die "Cannot open file \"$outputfile8\" to write to!\n\n"; } my $outputfile9 = "alu.txt";

unless ( open(ALU, ">$outputfile9") ) { die "Cannot open file \"$outputfile9\" to write to!\n\n"; }

132 my $outputfilelO= "rep_elements.txt";

unless ( open(REP_EL, ">$outputfilelO") ) { die "Cannot open file \"$outputfilelO\" to write to!\n\n"; }

# repeat

my $slice = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_start,$rep_end);

my $chr_name = $slice->chr_name; my $chr_start = $slice->chr_start; my $chr_end = $slice->chr_end; my $strand = $slice->strand; my $repeat_lengther = $slice->length; my $strand2;

if ($strand == 1) { $strand2 = "+"; } elsif ($strand == -1 ) { $strand2 = "-"; } my $rep_total_seq = $slice->seq;

print "\n$gene_name Repeat Slice Genomic info:\n"; print "********************************\n\n"•

"\n" ; print "Chromosome : " . $chr_name . $chr_start 11 \n"; print "Start . $chr_end "\n" print "End . $strand2 11 \n" ; print "Strand . $repeat_lengther 11 \n" ; print "Length of repeat . $host 11 \n" ; print "EnsEMBL server . $db_name "\n"; print "Database \n\n" ,- print "Repeat

my $repeat_length = $repeat_lengther;

print $rep_total_seq . "\n"; ########################## # R plot for repeat unit # ########################## my $i = $flanker + 1; my $end = $i + $repeat_length;

while ($i < $end) { print REP_COORD "$i\tl\n"; $i++; } Close(REP_COORD);

######################## # Print repeat to disk # ########################

print REPEAT ">$gene_name | $gene_id | $db_name | repeat co• ordinates: Chr$chrom$strand2\:$rep_start\-$rep_end\ | length = $repeat_lengther bp print REPEAT "$rep_total_seq\n\n"; close(REPEAT); ########################### # Calculate repeat purity # ###########################

my $max = 0;

while ($rep_total_seq =~ /(($repeat_unit)+)/g) { my $len = $1; my $l_hit = (length($len))/3; if ($l_hit > $max) { $max = $l_hit; } } print REP_PURE "$max $repeat_unit"; close(REP_PURE);

############################### # Insert to repeat_feat table # ###############################

print "\n\n##################################\n" ,- print "# Inserting into gems_feat table #\n" ; print "##################################\n\n";

$sth = $dbh->prepare ("INSERT INTO gems_feat VALUES(1$gene_name1,1$chr_name','$strand2',1$chr_start1,'$chr_end',1$repeat_unit1,1$rep_t otal_seq','$repeat_lengther1,'$max',NULL)"); $sth->execute () ;

##################################### # Get repeat plus flanking sequence # #####################################

$slice $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start- $flanker,$rep_end+$flanker);

my $invert_strand = $slice->invert; $chr_name = $slice->chr_name; $chr_start = $slice->chr_start; $chr_end = $slice->chr_end; $strand = $slice->strand; $strand2 0; my $lengther $slice->length;

my $genes = $slice->get_all_Genes;

foreach my $gene(@$genes) {

##################### # Collect all exons # #####################

my $temp_id = $gene->stable_id;

print "$temp_id before loop \n";

if ($temp_id eq $gene_id) {

print "$temp_id in loop\n";

print "\nStrand of $temp_id is " . $gene->strand . "\n\n";

if ($gene->strand == 1) { $strand2 = "+"; }

elsif ($gene->strand == -1 ) { $strand2 = »-"; }

134 print "\n#########################\n"; print "# ensembl API #\n" ; print "# Collecting all exons. #\n"; print "#########################\n\n";

my $exons = $gene->get_all_Exons; foreach my $exon(®$exons) { print EXONS $exon->gffstring . "\n" } close(EXONS); print "Done.\n\n";

my $total_seq = $slice->seq; my $repeat_seq = $slice->get_repeatmasked_seq; my $rep_seq = $repeat_seq->seq; my $invert_seq = $invert_strand->seq; my $invert_seq_collect = $invert_strand->get_repeatmasked_seq; my $invert_seq_rep = $invert_seq_collect'->seq;

print "\n$gene_name Slice Genomic info:\n";

print "EnsEMBL Gene ID $gene_id . "\n"; print "Chromosome $chr_name . "\n"; print "Start $chr start . "\n"; print "End $chr_end . "\n"; print "Strand $strand2 . "\n"; print "Length of repeat plus flanking sequence $lengther . "\n" ; print "EnsEMBL server $host . "\n"; print "Database $db_name "\n\n";

# Print flanking repetitive sequence to disk

print SEQ2 ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\- $rep_end\ | flanking sequence up/down-stream of repeat: $flanker | length: $lengther\n"; print SEQ2 "$total_seq\n\n";

print REP_SEQ2 ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\- $rep_end\ | flanking sequence up/down-stream of repeat: $flanker | length: $lengther\n"; print REP_SEQ2 "$rep_seq\n\n";

print HMM ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\- $rep_end\ | flanking sequence up/down-stream of repeat: $flanker |\n"; print HMM "$total_seq\n\n";

print HMM ">$gene_name | $gene_id | $db_name | Chr$chr_name\- \:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\-\:$rep_start\-$rep_end\ | flanking sequence up/down-stream of repeat: $flanker |\n"; print HMM "$invert_seq\n\n";

close(SEQ2); close(REP_SEQ2); close(HMM);

############################ # Insert to flanking table # ############################

print "#################################\n";

135 print "# Inserting into flanking table #\n" ; print "#################################\n\n";

$sth = $dbh->prepare ("INSERT INTO flanking VALUES(1$gene_name','$chr_name',1$strand2',1$chr_start1,'$chr_end1,'$total_seq',1$invert_ seq1,1$repeat_seq','$invert_seq_rep')"); $sth->execute ();

################## # CpG plot phase # ##################

##################################### # Get all CpG islands - EnsEMBL ver # #####################################

print "###############################\n" ; print "# Scanning for CpG Islands #\n"; print "# EnsEMBL plus EMBOSS running #\n"; print "###############################\n\n";

my $simple_feature_adaptor = $db->get_SimpleFeatureAdaptor; my $cpg_islands = $simple_feature_adaptor- >fetch_all_by_Slice_and_score($slice, 50, 'CpG');

foreach my $cpg (®$cpg_islands) { # print CPG "Label: ", $cpg->display_label, "\n"; # print CPG "Seqname: ", $cpg->seqname, "\n"; print CPG $cpg->start . "\t"; print CPG $cpg->end . "\t"; print CPG $cpg->score, "\n";

my $cpg_start = $cpg->start; my $cpg_end = $cpg->end; my $cpg_score = $cpg->score;

####################### # Insert to cpg table # #######################

print "############################\n"; print "# Inserting into cpg table #\n"; print "############################\n\n";

$sth = $dbh->prepare ("INSERT INTO cpg VALUES('$gene_name',1$cpg_start','$cpg_end','$cpg_score')"); $sth->execute ();

}

close(CPG); print "CpG data collected!\n\n";

########## # EMBOSS # ##########

system("cpgplot $gene_name.fa -graph cps -window 500 -shift 100 - minlen 200 -minoe 0.6 -minpc 50 -outfile $gene_name.plotfile > $gene_name\_gc.txt"); system("ps2pdf cpgplot.ps cpgplot.pdf"); system("mv cpgplot.pdf $gene_name.pdf");

############# # Run HMMer # #############

print "\n###################################\n"; print "# HMMer vl.84 #\n"; print "# Scanning for CTCF binding sites #\n"; print "###################################\n\n";

136 system("hmmfs -c ../ctcf-md.hmm $gene_name\.fa > $gene_name\_ctcf.out");

print "Done!\n\n";

##################### # Collect all Alu's # #####################

my $repeats = $slice->get_all_RepeatFeatures; foreach my $repeat(®$repeats) { if ($repeat->repeat_consensus->name eq 'AluY') { print ALU $repeat->gffstring . "\t" . $repeat->repeat_consensus->name . "\t" . $repeat->repeat_consensus->repeat_class . "\n" } } close(ALU);

####################### # Collect all Repeats # #######################

$repeats = $slice->get_all_RepeatFeatures; foreach my $repeat(@$repeats) { print REP_EL $repeat->gffstring . "\t" . $repeat- >repeat_consensus->name . "\t" . $repeat->repeat_consensus->repeat_class . "\n"; }

close(REP_EL);

$input_filename = "rep_elements.txt";

unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write to!\n\n"; } while () { chomp; if (/"\S+\.\S+\- \S+\s+\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\d+)\s(\S+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)/) {

my $r_start = $1; my $r_end = $2; my $r_score = $3; my $r_strand = $4; my $r_name = $5; my $r_class = $6; my $r_distance;

if ($r_start > $flanker) { $r distance = $r_start - ($flanker + $repeat_lengther);

elsif ($r_start < $flanker) { $r_distance = $flanker - $r_end; } print "\n"; print "################################\n"; print "# Inserting into repeats table #\n"; print "################################\n\n";

$sth = $dbh->prepare ("INSERT INTO repeats VALUES(1$gene_name',1$r_name1,1$r_class',1$r_start1,1$r_end1,1$r_score1,1$r_strand1,'$r_d istance')"); $sth->execute 0; } }

####################

137 # Create R_scripts # ####################

######### # Exons # #########

$input_filename = "exons.gff"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write to!\n\n"; } $outputfile= "exons_R.txt"; unless ( open(EXON_R, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } while () { chomp;

################################# # generate R plotfile for exons # #################################

if (/*\S+\s+\S+\s+\S+\s+(\S+)\s+(\S+)\s+\S+/) {

my $i = $1; my $end = $2;

print "##############################\n" print "# Inserting into exons table #\n"

print "##############################\n\i

$sth = $dbh->prepare ("INSERT INTO exons VALUES(1$gene_name', '$11,1 $2') "); $sth->execute ();

while ($i <= $end) { print EXON_R "$i\tO.5\n"; $i++ ;

close(IN_FILE) close(EXON R);

####### # CpG # #######

$input_filename = "cpg.gff";

unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write toi\n\n"; } $outputfile = "cpg_R.txt";

unless ( open(CPG_R, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; }

while () { chomp;

################################# # generate R plotfile for CpG # ################################# print "Creating R plots...\n";

if (/*(\S+)\s+(\S+)\s+(\S+)/) {

my $i = $1; my $end = $2;

while ($i <= $end) { print CPG_R "$i\t$3\n"; $i++; }

close(IN_FILE); close(CPG_R);

######## # CTCF # ########

$input_filename = "$gene_name\_ctcf.out";

unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write to!\n\n"; }

$outputfile= "hmm_R.txt";

unless ( open(HMM_R, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n";

while () { chomp;

###################################### # generate R plotfile for ctcf sites # ######################################

my $c_distance;

if (/A(\d+\.\d+)\s+(\d+)\s+(\d+)/) {

print "1 is $1 and 2 is $2 and 3 is $3\n";

if ($2 < $3) {

if ($2 > $flanker) { $c_distance = $2 - ($flanker + $repeat_lengther)

elsif ($2 < $flanker) { $c distance = $flanker - $3;

print "\n###############################\n"; print "# Inserting into ctcf table 1 #\n"; print "###############################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf VALUES(1$gene_name', '$1' , ' $2 ' , ' $3 ' , ' +1 ,1 $c_distance1)"); $sth->execute 0;

my $i = $2 ; my $end = $3;

while ($i <= $end) {

139 print HMM_R "$i\t$l\n"; $i++ ;

} elsif ($2 > $3) {

if ($3 > $flanker) { $c_distance = $3 - ($flanker + $repeat_lengther) }

elsif ($3 < $flanker) { $c_distance = $flanker - $2; }

print "\n###############################\n"; print "# Inserting into ctcf table 2 #\n",- print "###############################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf VALUES(1$gene_name', '$11,1 $3 ' , ' $2 ' , ' - ' , '$c_distance')") ; $sth->execute () ,-

my $i = $3; my $end = $2;

while ($i < = $end) { print HMM_R 11 $i\t$l\n" ,- $i + +;

}

close(IN_FILE); close(HMM_R);

print "Done!\n\n";

############ # GC Stats # ############

$input_filename = "$gene_name\_gc.txt";

unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write to!\n\n";

$outputfile= "gc_R.txt";

unless ( open(GC_R, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"

my Sobsex; my @gc; $i=l; my $j=l; my $n = $lengther + 2; $end = ($lengther*2) + 1 ;

while () { chomp;

######################################## # Parse cpgplot info for gc statistics # ########################################

if (/"(\d+\.\d+)/) {

if ($i <= $lengther)

140 $obsex[$i] = $1; $i++ ;

elsif (($n <= $end && $i >$lengther)) { $gc[$j] = $1; $n++ ; $j++; }

Close(IN_FILE);

##################### # Print to gc_R.txt # #####################

$i = 1; while ($i <= $lengther) { if ($gc[$i] > 0) { print GC_R "$i\t$obsex[$i]\t" . ($gc[$i]/100) . "\n";

my $obs_in = $obsex[$i]; my $gc_in = ($gc[$i]/100) ;

print "\n"; print "###########################\n"; print "# Inserting into gc_plots #\n" ; print "###########################\n\n";

$sth = $dbh->prepare ("INSERT INTO gc_plots VALUES(1$gene_name1,'$i','$obs_in',1$gc_in')"); $sth->execute ();

} $i + + ; } close(GC R);

######## # AluY # ########

$input_filename = "alu.txt"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open file \"$input_filename\" to write to!\n\n"; }

$outputfile= "alu_R.txt"; unless ( open(ALU_R, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } while () { chomp;

if (/*\S+\.\S+\-\S+\s+\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\d+)\s(\S+)/) {

my $i = $1; my $end = $2;

while ($i <= $end) { print ALU_R "$i\t0.5\n"; $i + +; }

141 } } close(ALU_R);

#################### # Print R plotfile # ####################

$outputfile= "$gene_name\.R"; unless ( open(R_PLOT, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } print R_PLOT "\# FILE\: $gene_name.R\n"; print R_PLOT "# AUTH: Perseus Missirlis \n"; print R_PLOT "\n" ; print R_PLOT "# $gene_name\n"; print R_PLOT "\n" ; print R_PLOT "$gene_name\.rep = read.delim(\"$pwd/rep_coords.txt\", header\=FALSE)\;\n"; print R_PLOT "$gene_name\.exons = read.delim(\"$pwd/exons_R.txt\", header=FALSE);\n"; print R_PLOT 11 $gene_name\ . cpg = read, delim (\"$pwd/cpg_R. txt\", header=FALSE) ; \n" ; print R_PLOT "$gene_name\.ctcf = read.delim(\"$pwd/hmm_R.txt\", header=FALSE);\n"; print R_PLOT "$gene_name\ .alu = read.delim(\"$pwd/alu_R.txt\" , header=FALSE) ;\n" ,- print R_PLOT "$gene_name\.gc = read.delim(\"$pwd/gc_R.txt\", header=FALSE);\n"; print R_PLOT "$gene_name\.flank\.0 = read.delim(\"$pwd/ctcf_dint.txt\", header=FALSE) ;\n" ; print R_PLOT "$gene_name\.flank\.100 = read.delim(\"$pwd/ctcf_dint_100.txt\", header=FALSE) ; \n" ; print R_PLOT "$gene_name\.flank\.500 = read.delim(\"$pwd/ctcf_dint_50 0.txt\", header=FALSE) ; \n" ; print R_PL0T "\n" ; print R_PL0T "x <- dim($gene_name\.rep)[l];\n"; print R_PLOT "mylim = $flanker*2 + x;\n"; print R_PLOT 11 \n"; print R_PLOT "plot{$gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(0, mylim), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; print R_PLOT " lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2 , col=\"black\", cex=0.5, type=\"h\")\n"; print R_PL0T "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\")\n"; print R_PLOT "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\") ,-\n" ; print R_PL0T "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n"; print R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", lty=l);\n"; print R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V3, col=\"blue\", cex=0.5, type=\"l\", lty=2);\n"; print R_PL0T "\n";

print R_PLOT "plot($gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(" . ($flanker-1000) . ", " . ($flanker+1000) . "+x), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; print R_PLOT "lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2, col=\"black\", cex=0.5, type=\"h\")\n"; print R_PLOT "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\");\n"; print R_PL0T "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\");\n"; print R_PLOT "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n";

142 print R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", lty=l);\n"; print R_PLOT " lines ($gene_name\ .gc\$Vl, $gene_name\ .gc\$V3 , col=\"blue\11, cex=0.5, type=\"l\", lty=2);\n"; print R_PLOT "abline(0.5,0);\n"; print R_PLOT "\n"; print R_PLOT "plot($gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(" . ($flanker-10000) . ", " . ($flanker+10000) . "+x), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; print R_PL0T "lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2, col=\"black\", cex=0.5, type=\"h\")\n"; print R_PLOT "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\");\n"; print R_PL0T "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\");\n"; print R_PLOT "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n"; print R_PL0T "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", lty=l) ;\n" ,- print R_PLOT 11 lines ($gene_name\. gc\$Vl, $gene_name\ .gc\$V3, col=\"blue\" , cex=0.5, type=\"l\", lty=2);\n"; print R_PL0T "abline(0.5,0);\n"; close(R_PL0T);

########################################## # Open HMMer file for CTCF binding sites # ########################################## $aln_filename = "$gene_name\_ctcf.out"; unless ( -e $aln_filename) { die "File \"$aln_filename\" doesn\'t seem to exist!!\n"; } unless ( open(CTCF_SITES, $aln_filename) ) { die "Cannot open file \"$aln_f ilename\"\n\n" ,• } ################################################ # Open file for CTCF binding sites di-nt stats # ################################################

my $outputfile = "ctcf_dint.txt";

unless ( open(CTCF, ">$outputfile") ) { die "Cannot open file \"$outputfile\11 to write to!\n\n"; } print CTCF "Score\t" . "aa\t" . "at\t" . "ag\t" . "ac\t" . "ta\t" . "tt\t" . "tg\t" . "tc\t" . "ga\t" . "gt\t" . "gg\t" . "gc\t" . "ca\t" . "ct\t" . "cg\t" . "cc\t" . "errors\n";

143 $outputfile = "ctcf_dint_100.txt"; unless ( open(CTCF_100, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n" } print CTCF_100 "Score\t" "aa\t" "at\t" "ag\t" "ac\t" "ta\t" "tt\t" "tg\t" "tc\t" "ga\t" "gt\t" "gg\t" "gc\t" "ca\t" "ct\t" "cg\t" "cc\t" "errors\n";

$outputfile = "ctcf_dint_500.txt";

unless ( open(CTCF_500 , ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } print CTCF_500 "Score\t" . "aa\t" "at\t" "ag\t" "ac\t" "ta\t" "tt\t" "tg\t" "tc\t" "ga\t" "gt\t" "gg\t" "gc\t" "ca\t" "ct\t" "cg\t" "cc\t" "errors\n";

########################################################### # Initialize variables to count di-nucleotide frequencies # ###########################################################

my $aa; my $at; my $ag; my $ac my $ta; my $tt; my $tg; my $tc my $ga; my $gt; my $gg; my $gc my $ca; my $ct; my $cg; my $cc my $e;

while () { chomp; if (/*(\d+\.\d+)\s+(\d+)\s+(\d+)/) { print "1 is $1, 2 is $2, 3 is $3\n"; my $start; my $size; if ($2 < $3) {

if ($2 < $flanker) { my $start = $flanker-$2; print CTCF_SCORE "$gene_name\t$l\t" . $start . "\n"

144 } elsif ($2 > $flanker) { my $start = $2-($flanker+$repeat_length); print CTCF_SCORE "$gene_name\t$l\t11 . $start . } if ($2 < $flanker) { my $start = $2-$flanker; print CTCF_SC0RE2 "$gene_name\t$l\t" . $start } elsif ($2 > $flanker) { my $start = $2-($flanker+$repeat_length); print CTCF_SC0RE2 "$gene_name\t$l\t" . $start

>

$start = $2-1; $size = $3-$2-l;

##### # 0 # ##### my $for = substr($total_seq, $start, $size);

############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0 ; $at=0; $ag=0 ; $ac=0; $ta=0; $tt=0; $tg=0; $tc=0; $ga=0 ;

$gt=o; $gg=0; $gc=0; $ca=0; $Ct=0; $cg=0; $cc=0; $e=0;

while($for =~ /aa/ig) $aa++} while($for = ~ /at/ig) $at++} while($for = ~ /ag/ig) $ag++} while($for =~ /ac/ig) $ac++} while($for =~ /ta/ig) $ta++} while($for = ~ /tt/ig) $tt++} . while($for = ~ /tg/ig) $tg++} while($for =~ /tc/ig) $tc++} while($for /ga/ig) $ga++} while($for = ~ /gt/ig) $gt++} while($for /gg/ig) $gg++} while($for = ~ /gc/ig) $gc++} while($for /ca/ig) $ca++} while($for /ct/ig) $ct++} while($for /cg/ig) $cg++} while($for = - /cc/ig) $cc++} while($for /[Aatgc /ig){$e++

print CTCF 11 $gene_name\ . $1\ . 0\t11 "$aa\t" . "$at\t" . "$ag\t" . "$ac\t" . "StaXt" . "$tt\t" . "$tg\t" . "$tc\t" . "$ga\t" . "$gt\t" "$gg\t" "$gc\t" "$ca\t" "$ct\t" "$cg\t" "$cc\t" "$e\n";

print "\n####################################\n"; print "# Inserting into ctcf_dint_0 table #\n"; print "####################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_0 VALUES('$gene_name','$1",1$aa1,'$at',"$ag','$ac',1$ta','$tt','$tg','$tc','$ga','$gt','$gg ','$gc1,1$ca','$ct','$cg',1$cc','$e')"); $sth->execute ();

####### # 100 # #######

my $for_left = substr($total_seq, $start-100, 100); my $for_right = substr($total_seq, $start+$size+100, 100); my $for_100 = $for_left . $for . $for_right;

############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0 $at = 0 $ag=0 $ac=0 $ta=0 $tt=0 $tg=0 $tc=0 $ga=0 $gt=0 $gg=o $gc = 0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0;

while($for_ 100 =~ /aa/ig) $aa++} while($for_ ~_100=~ /at/ig) $at++} while($for_ JL00 =~ /ag/ig) $ag++} while($for_ ^100 =~ /ac/ig) $ac++} while ($f or_ 100 =- /ta/ig) $ta++} while($for_ ~100 =~ /tt/ig) $tt++j while($for_ "100 =~ /tg/ig) $tg++} while ($for_ "lOO =~ /tc/ig) $tc++} while ($for_ ~_100=~ /ga/ig) $ga++} while ($for_ ~_100= ~ /gt/ig) $gt++} while ($for_"10 0 =~ /gg/ig) $gg++} while ($for_[10 0 =~ /gc/ig) $gc++} while ($for__10 0 =~ /ca/ig) $ca++} while ($for_~100 =~ /ct/ig) $ct++} while ($for_"lO O =~ /cg/ig) $cg++} while ($for_[10 0 =~ /cc/ig) $cc++} while ($for_~100 =~ /[*atgc /ig){$e++

print CTCF_100 "$gene_name\.$1\.100\t" . "$aa\t" . "$at\t" . "$ag\t" .

146 "$ac\t" "$ta\t" "$tt\t" "$tg\t" "$tc\t" "$ga\t" "$gt\t" "$gg\t" "$gc\t" "$ca\t" "$ct\t" "$cg\t" "$cc\t" "$e\n";

print "######################################\n"; print "# Inserting into ctcf_dint_100 table #\n" ; print "######################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_100 VALUES(1$gene_name', '$1','$aa', 1$at', 1$ag', '$ac1, 1$ta1, ' $tt' , '$tg', '$tc','$ga1, '$gt1, 1$gg 1, 1$gc1, 1$ca1, 1$ct' , '$cg1, 1$cc ' , '$e') " ) ; $sth*->execute ();

####### # 500 # #######

$for_left = substr($total_seq, $start-500, 500); $for_right = substr($total_seq, $start+$size+500, 500); my $for_500 = $for_left . $for_right;

############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0 $at = 0 $ag=0 $ac=0 $ta=0 $tt=0 $tg=0 $tc=0 $ga=0 $gt=0 $gg=0 $gc=0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0;

while($for_ 500 = ~ /aa/ig) $aa++ while ($for_ ^500 /at/ig) $at + + while ($for_ "500 /ag/ig) $ag++ while ($for_ ~500 = ~ /ac/ig) $ac++ while ($for_[50 0 /ta/ig) $ta++ while ($for_ 500 /tt/ig) $tt++ while($for ^500 /tg/ig) $tg++ while ($for_ "500 = ~ /tc/ig) ' $tc+ + while ($f or_ [500 = ~ /ga/ig) [ $ga++ while ($for_ _500 = ~ /gt/ig) ;$gt++ while ($for_ ^500 = ~ /gg/ig) [$gq++ while ($for_ [500 /gc/ig) [$gc++ while ($for_ [500 = ~ /ca/ig) [$ca++ while ($for_[50 0 /ct/ig) [$ct++ while ($for_ "50 0 /cg/ig) [$cg++

147 while($for_500 =~ /cc/ig)($cc++) while($for_500 =- /[Aatgc]/ig){$e++}

print CTCF_500 "$gene_name\ . $1\ . 500\t11 "$aa\t" . "$at\t" . "$ag\t" . "$ac\t" . "$ta\t" . "$tt\t" . "$tg\t" . "$tc\t" . "$ga\t" . "$gt\t" . "$gg\t" . "$gc\t" . "$ca\t" . "$ct\t" . "$cg\t" . "$cc\t" . "$e\n" ,-

print "######################################\n"; print "# Inserting into ctcf_dint_500 table #\n"; print "######################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_500 VALUES('$gene_name',1$1','$aa', 1$at', '$ag1, 1$ac', •$ta1, ' $tt' , '$tg', '$tc',1$ga1, '$gt', '$gg ',1$gc1, 1$ca', '$ct','$cg1,1$cc1 , 1$e')"); $sth->execute ();

} elsif ($2 > $3) {

if ($3 < $flanker) { my $start = $flanker-$3; print CTCF_SCORE "$gene_name\t$l\t" . $start . "\n"; } elsif ($3 > $flanker) { my $start = $3-($flanker+$repeat_length); print CTCF_SCORE "$gene_name\t$l\t" . $start . "\n"; }

if ($3 < $flanker) { my $start = $3-$flanker; print CTCF_SCORE2 "$gene_name\t$l\t" . $start . "\n"; } elsif ($3 > $flanker) { my $start = $3-($flanker+$repeat_length); print CTCF_SCORE2 "$gene_name\t$l\t" . $start . "\n"; }

##### # 0 # #####

$start = $3-1; $size = $2-$3+l; print "reverse complement! start $start and size $size \n"; my $hit = substr($total_seq, $start, $size); $hit = ~ tr/ATGCatgcn/TACGTACGN/; my $rev_hit = reverse($hit);

############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0 ,- $at=0; $ag=0; $ac=0; $ta=0; $tt=0; $tg=0;

148 $tc=0 $ga=0 $gt=0 $gg=o $gc=0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0;

while($rev hit =- /aa/ig) $aa++} while($rev_ hit = - /at/ig) $at++} while ($rev_ "hit = - /ag/ig) $ag++} while ($rev_ "hit =~ /ac/ig) $ac++} while($rev_ "hit =~ /ta/ig) $ta++} while ($rev_ "hit = - /tt/ig) $tt++} while ($rev_ "hit = - /tg/ig) $tg++} while ($rev_ "hit = ~ /tc/ig) $tc++} while ($rev_"hi t =~ /ga/ig) $ga++} while ($rev_ "hit =~ /gt/ig) $gt++} while ($rev_ hit = - /gg/ig) $gg++} while ($rev_ "hit = ~ /gc/ig) $gc++} while ($rev_ "hit = ~ /ca/ig) $ca++} while ($rev_ "hit =~ /ct/ig) $ct++} while ($rev_ "hit =- /cg/ig) $cg++} while ($rev_ _hit =~ /cc/ig) $cc++} while ($rev_ _hit = ~ /[*atgc /ig){$e++}

print CTCF " $gene__name\.$1\.0\t " . "$aa\t" "$at\t" "$ag\t" "$ac\t" "$ta\t" "$tt\t" "$tg\t" "$tc\t" "$ga\t" "$gt\t" "$gg\t" "$gc\t" "$ca\t" "$ct\t" "$cg\t" "$cc\t" "$e\n" ;

print »\n####################################\n" ,- print "# Inserting into ctcf_dint_0 table #\n"; print "####################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_0 VALUES(1$gene_name','$1','$aa1,1$at1,'$ag1,1$ac','$ta','$tt1,1$tg',1$tc',1$ga1,'$gt','$gg 1 , '$gc', '$ca', '$ct',1$cg','$cc1 , 1$e') " ) ; $sth->execute ();

####### # 100 # #######

my $flankl = substr($total_seq, $start-100, 100); $flankl =- tr/ATGCatgcn/TACGTACGN/; my $rev_left = reverse($flankl);

my $flank2 = substr($total_seq, $start+$size+100, 100); $flank2 =- tr/ATGCatgcn/TACGTACGN/; my $rev_right = reverse($flank2);

my $rev_100 = $rev_left . $rev_hit . $rev_right;

149 ############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0; $at=0; $ag=0; $ac=0 ,- $ta=0; $tt=0; $tg=0; $tc=0; $ga=0; $gt=0; $gg=0; $gc=0; $ca=0; $ct=0; $cg=0; $CC=0; $e=0;

while ($rev_ 100 /aa/ig)| $aa++} while ($rev_ 100 /at/ig)| $at++} while ($rev_ 100 /ag/ig)\ $ag++} while ($rev_ 100 /ac/ig){ $ac++} while ($rev_ 100 /ta/ig)| $ta++} while ($rev_ 100 /tt/ig)| $tt++} while ($rev_ 100 /tg/ig)\ $tg++} while ($rev_ 100 /tc/ig)\ $tc++} while ($rev_ 100 /ga/ig)\ $ga++} while ($rev_ 100 /gt/ig)' $gt++} while ($rev_ 100 /gg/ig) •$gg++ } while ($rev_ 100 /gc/ig) •$gc++ } while ($rev_ 100 /ca/ig)• $ca++} while ($rev_ 100 /ct/ig)• $ct++} while ($rev_ 100 /cg/ig)• $cg++} while ($rev_ 100 /cc/ig)• $cc++} while ($rev_ 100 / [*atgc: /ig){$e+

print CTCF_100 "$gene_name\.$1\.100\t" "$aa\t" "$at\t" "$ag\t" "$ac\t» "$ta\t" "$tt\t" "$tg\t" "$tc\t" "$ga\t" "$gt\t" "$gg\t" "$gc\t" "$ca\t" "$ct\t" "$cg\t" "$cc\t" "$e\n";

print "######################################\n"; print "# Inserting into ctcf_dint_100 table #\n"; print "######################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_100 VALUES(1$gene_name1,1$11,'$aa1,1$at',1$ag•,1$ac',1$ta','$tt1,1$tg1,1$tc',1$ga1,'$gt',•$gg ' , ' $gc' ,1 $ca1,1$ct1,1$cg1,1$cc1 , ' $e') ") ; $sth->execute (),-

#######

150 # 500 # #######

$flankl = substr($total_seq, $start-500, 500); $flankl =~ tr/ATGCatgcn/TACGTACGN/; $rev_left = reverse($flankl);

$flank2 = substr($total_seq, $start+$size+500, 50 $flank2 =- tr/ATGCatgcn/TACGTACGN/; $rev_right = reverse($flank2); my $rev_500= $rev_left . $rev_hit . $rev_right;

############################################# # Count di-nucleotides in flanking sequence # #############################################

$aa=0; $at=0; $ag=0; $ac=0; $ta=0; $tt=0; $tg=0; $tc=0; $ga=0; $gt=0; $gg=0; $gc=0; $ca=0; $ct=0; $cg=0; $cc=0; $e=0; while $rev_ 500 =~ /aa/ig) $aa++} while $rev "500 =~ /at/ig) $at++} while $rev_ ^500 =~ /ag/ig) $ag++} while $rev_ 500 =~ /ac/ig) $ac++} while $rev "500 =~ /ta/ig) $ta++} while $rev ~500 =~ /tt/ig) $tt++} while $rev_ ^500 =~ /tg/ig) $tg++} while $rev_ ^500 =~ /tc/ig) $tc++} while $rev_ [500 =~ /ga/ig) $ga++} while $rev_ "500 =~ /gt/ig) $gt++} while $rev_ [500 =~ /gg/ig) $gg++} while ($rev_ "500 =~ /gc/ig) $gc++} while ($rev_ [500 =~ /ca/ig) $ca++} while $rev_ ]500 =- /ct/ig) $ct++} while $rev_ "500 =~ /cg/ig) $cg++} while ($rev_ ^500 =~ /cc/ig) $cc++} while ;$rev_ ]soo =- /["atgc /ig){$e++} print CTCF_500 "$gene_name\.$1\.500\t" . "$aa\t" . "$at\t" . "$ag\t" . "$ac\t" . "$ta\t" . "$tt\t" . "$tg\t" . "$tc\t" . "$ga\t" . "$gt\t" . "$gg\t" . "$gc\t" . "$ca\t" . "$ct\t" . "$cg\t" . "$cc\t" . "$e\n"; print "######################################\n"; print "# Inserting into ctcf_dint_500 table #\n"; print "######################################\n\n";

$sth = $dbh->prepare ("INSERT INTO ctcf_dint_500 VALUES (' $gene_name ' , ' $1' ,1 $aa ' , ' $at' , ' $ag ' , ' $ac ' ,1 $ta' ,.' $tt > , ' $tg' , ' $tc ' , ' $ga' ,1 $gt1 ,1 $gg 1,'$gc1,'$ca','$ct',1$cg1,1$cc', '$e ' ) ") ; $sth->execute ();

} } } close(CTCF); close(CTCF_100); close(CTCF_500); close(CTCF_SITES);

############################## # Print master sequence file # ##############################

# need to go down one directory to print sequence to seq.fa master file

chdir "../" or die "cannot chdir to . ./: $!";

print SEQ ">$gene_name | $gene_id | $db_name | Chr$chr_name$strand2\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom$strand2\:$rep_start\-$rep_end\ | flanking sequence up/down-stream of repeat: $flanker |\n"; print SEQ "$total_seq\n\n"; print REP_SEQ ">$gene_name | $gene_id | $db_name | repeat co• ordinates: Chr$chrom$strand2\:$rep_start\-$rep_end\ | flanking sequence up/down-stream of repeat: $flanker | \n"; print REP_SEQ "$rep_seq\n\n"; } } } print "the value of \$n is $n \n" ; $n++; } ############################################################ # Close filehandlers for repeat tract + flanking sequences # ############################################################

Close(CTCF_SCORE); close(CTCF_SC0RE2); close(GC); close(SEQ); close(REP_SEQ);

exit;

152 APPENDIX D

Interacting with and creating the 'gems_cis' database

Re-populating the Database: MySQL Syntax for deleting the tables

WARNING: This will delete all data collected so far by the database!

If you want to re-run flanker.pl and re-populate the database you must delete all the tables currently in the database with these commands:

DROP TABLE repeats; DROP TABLE build_info ; DROP TABLE cpg; DROP TABLE ctcf; DROP TABLE ctcf dint_ 0; DROP TABLE ctcf_dint_ 100; DROP TABLE ctcf_dint_ 500; DROP TABLE flanking; DROP TABLE gc; DROP TABLE gc_plots; DROP TABLE gems ; DROP TABLE gems_feat; DROP TABLE exons;

Re-populating the Database: MySQL Syntax for creating the tables

You must now re-create all the tables you just deleted so that flanker.pl can insert data into the expected fields. Execute the following commands to do this:

CREATE TABLE build_info ( ens_db VARCHAR(50) NOT NULL, db_name VARCHAR(50) NOT NULL, date run DATETIME

CREATE TABLE gems ( name VARCHAR(15) NOT NULL PRIMARY KEY, ens_ID CHAR(15) NOT NULL ) ;

CREATE TABLE gc ( name VARCHAR(IS) NOT NULL PRIMARY KEY, 50_bp DECIMAL(7,6) NOT NULL, 100_bp DECIMAL(7,6) NOT NULL, 150_bp DECIMAL(7,6) NOT NULL, 200_bp DECIMAL(7,6) NOT NULL, 250_bp DECIMAL(7, 6) NOT NULL, 3 00_bp DECIMAL(7, 6) NOT NULL, 350_bp DECIMAL(7,6) NOT NULL, 400_bp DECIMAL(7,6) NOT NULL, 450_bp DECIMAL(7,6) NOT NULL, 500_bp DECIMAL(7,6) NOT NULL, 1000_bp DECIMAL(7,6) NOT NULL ) ;

CREATE TABLE gems_feat ( name VARCHAR(15) NOT NULL PRIMARY KEY,

153 chr CHAR(2) NOT NULL, strand CHAR(l) NOT NULL, start INT UNSIGNED, end INT UNSIGNED, unit CHAR(3), seq VARCHAR(2 55) NOT NULL, length VARCHAR(4) NOT NULL, purity VARCHAR(3) NOT NULL, expandability DECIMAL(3,2) NULL ) ;

CREATE TABLE flanking ( name VARCHAR(15) NOT NULL PRIMARY KEY, chr CHAR(2) NOT NULL, strand CHAR(l) NOT NULL, start INT UNSIGNED, end INT UNSIGNED, for_seq TEXT, rev_seq TEXT, for_rep_seq TEXT, rev_rep_seq TEXT ) ;

CREATE TABLE cpg ( name VARCHAR(15) NOT NULL, start INT UNSIGNED, end INT UNSIGNED, score VARCHAR(8) ) ;

CREATE TABLE exons ( name VARCHAR(15) NOT NULL, start INT, end INT ) ;

CREATE TABLE repeats ( name VARCHAR(15) NOT NULL, rep_name VARCHAR(15), rep_class VARCHAR(15), start INT UNSIGNED, end INT UNSIGNED, score INT UNSIGNED, strand CHAR(l), distance MEDIUMINT UNSIGNED ) ;

CREATE TABLE gc_plotS ( name VARCHAR(15) NOT NULL, start INT UNSIGNED, obsex DECIMAL(7,6) NOT NULL, gc DECIMAL(7,6) NOT NULL ) ;

CREATE TABLE ctcf ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, start INT UNSIGNED, end INT UNSIGNED, strand CHAR(l) NOT NULL, distance INT UNSIGNED ) ;

CREATE TABLE ctcf_dint_0 ( name VARCHAR (15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4) , ag VARCHAR(4) , ac VARCHAR (4) , ta VARCHAR(4) , tt VARCHAR(4) , tg VARCHAR(4), tc VARCHAR(4), ga VARCHAR(4), gt VARCHAR (4) , gg VARCHAR(4), gc VARCHAR(4), ca VARCHAR(4), Ct VARCHAR(4), eg VARCHAR(4), cc VARCHAR (4) , errors VARCHAR(4) ) ;

CREATE TABLE ctcf_dint_100 ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4), ag VARCHAR (4) , ac VARCHAR(4) , ta VARCHAR(4), tt VARCHAR(4), tg VARCHAR (4) , tc VARCHAR (4) , ga VARCHAR(4), gt VARCHAR(4), gg VARCHAR(4) , gc VARCHAR(4), ca VARCHAR(4), ct VARCHAR (4) , eg VARCHAR(4) , CC VARCHAR(4), errors VARCHAR(4) ) ;

CREATE TABLE ctcf_dint_500 ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4) , ag VARCHAR(4), ac VARCHAR (4) , ta VARCHAR(4), tt VARCHAR(4), tg VARCHAR(4) , tC VARCHAR(4), ga VARCHAR(4) , gt VARCHAR(4) , gg VARCHAR(4) , gc VARCHAR(4) , ca VARCHAR(4) , ct VARCHAR(4), eg VARCHAR(4), CC VARCHAR(4), errors VARCHAR(4) ) ; Inserting the expandability data Since the expandability date cannot be generated automatically, AFTER flanker.pl has been run, the expandability data must be set manually with these commands:

UPDATE gems_feat SET expandability = '4.81' WHERE name = •DMPK';

UPDATE gems_feat SET expandability = '1.30' WHERE name = 'SCA71;

UPDATE gems_feat SET expandability = '0.97' WHERE name = 1SCA2';

UPDATE gems_feat SET expandability = '0.29' WHERE name = 'HD';

UPDATE gems_feat SET expandability = '0.19' WHERE name = 1DRPLA1;

UPDATE gems_feat SET expandability = '0.14' WHERE name = 1SCA11;

UPDATE gems_feat SET expandability = '0.08' WHERE name = 'SBMA1;

UPDATE gems_feat SET expandability = '0.07' WHERE name = 'SCA3_MJD1;

156 APPENDIX E nucleo.pl

#!/usr/local/bin/perl -w use strict; use Bio::EnsEMBL::DBSQL::DBAdaptor; use DBI; use Data::Dumper; #################### # Global Variables # ####################

# ensembl API my $host = 'ensembldb.ensembl.org1; my $user = 'anonymous1 ; my $db_name = 1 homo_sapiens_core_16_33' my $prog_version = '0.4';

############################### # Connect to ensembl with API # ###############################

my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name)

my $slice_adaptor = $db->get_SliceAdaptor;

while (<>) { chomp;

######################################### # ensembl API sequence extraction phase # #########################################

# in this regex # $1 collects the common gene name # $2 collects the ensembl gene ID # $3 collects the last digit from the ensembl gene ID, it's an effect of the internal brackets # $4 collects the chromosome number # $5 collects the repeat start position in chromosome $4 # $6 collects the repeat end position in chromosome $4

if (/*(\w+)\s+(ENSG(\d){ll})\s+(\S+)\s+(\d+)\s+(\d+)\s+(\w+)/) {

print "1 is $1, 2 is $2, 3 is $3, 4 is $4, 5 is $5, 6 is $6, 7 is $7\n";

my $genename = $1;

my $chrom = $4; my $rep_start = $5; my $rep_end = $6; my $repeat_unit;

my $slice_left_100 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start- 100,$rep_start-l); my $slice_right_100 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+100); my $left_100 = $slice_left_100->seq; my $right_100 = $slice_right_100->seq; my $nucleo_100_seq = $left_100 . $right_100;

print "$genename with 100 bp flanking sequence\n\n$$nucleo_100_seq\n\n";

157 my $slice_left_500 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start- 500,$rep_start-l); my $slice_right_500 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+500); my $left_500 = $slice_left_500->seq; my $right_500 = $slice_right_500->seq; my $nucleo_500_seq = $left_500 . $right_500;

print "$genename with 500 bp flanking sequence\n\n$$nucleo_500_seq\n\n";

my $slice_left_1000 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start- 1000,$rep_start-l); my $slice_right_1000 = $slice_adaptor- >fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+1000); my $left_1000 = $slice_left_1000->seq; my $right_1000 = $slice_right_1000->seq; my $nucleo_1000_seq = $left_1000 . $right_1000;

print "Sgenename with 1000 bp flanking sequence\n\n$$nucleo_1000_seq\n\n"; }

158 APPENDIX F

Sample queries

SELECT g.name, g.expandability, gc.50_bp, gc.lOO_bp, gc.l50_bp, gc.200_bp, gc.250_bp, gc.300_bp, gc.350_bp, gc.400_bp, gc.450_bp, gc.500_bp, gc.lOOO_bp FROM gems_feat g, gc WHERE g.expandability > 0 AND g.name = gc.name;

SELECT g.name, g.expandability, r.name, r.distance FROM gems_feat g, repeats r WHERE g.name = r.name AND r.rep_name LIKE 'AluY%' AND expandability > 0 AND r.distance < 20000 AND g.name NOT LIKE 'DMPK1 ORDER BY r.distance;

SELECT g.name, g.expandability, r.rep_name, r.distance FROM gems_feat g, repeats r WHERE g.name = r.name AND r.distance < 10000 AND r.distance > 0 AND r.rep_name NOT LIKE "dust" AND r.rep_name NOT LIKE "(CAG)n" AND g.expandability > 0.9 ORDER BY r.rep_name;

SELECT g.name, g.expandability, r.rep_name, r.distance FROM gems_feat g, repeats r WHERE g.name = r.name AND r.distance < 10000 AND r.distance > 0 AND r.rep_name LIKE "Alu%"; SELECT g.name, g.expandability, r.rep_name, r.distance FROM gems_feat g, repeats r WHERE g.name = r.name AND expandability > 0 AND r.distance < 1000 AND r.distance > 10 ORDER BY r.rep_name DESC;

SELECT g.name, g.expandability, c.score, c.distance FROM gems_feat g, ctcf c WHERE g.name = c.name AND c.distance < 1000 AND g.expandability > 0 ORDER BY c.score DESC;

SELECT c.name, c.score, c.distance, g.expandability, c.start, c.end FROM ctcf c, gems_feat g WHERE c.distance < 1000 AND c.score > 2 AND g.name = c.name ORDER BY c.name;

SELECT c.name, c.score, c.distance, g.expandability, c.start, c.end FROM ctcf c, gems_feat g WHERE g.name = c.name AND c.name = "DRPLA" ORDER BY c.name;

159 APPENDIX G

Example R scripts and selected results

####################### . # Flanking GC Content # #######################

# Bring data into R

gc.brock = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/gc_stats.txt", sep = "") gc.all = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/gc_all.txt", sep = "")

# Calculate ranks for flanking %GC

exp.rank = rank(gc.brock[,2]) gc.100.rank = rank(gc-brock[,4]) gc.500.rank = rank(gc.brock[,12]) gc.1000.rank = rank(gc.brock[,13])

gc.ranks = cbind(exp.rank,gc.100.rank,gc.500.rank,gc.1000.rank)

# Calculate Spearman's Ranked Correlation

cor.test(gc.brock[, 2] , gc.brock[,3], method="spearman")

flanking.gc = c(0, 50 ,100 ,500 ,1000 ,1500 ,2000 ,2500 ,3000 ,3500 ,4000 ,4500 ,5000) rho = c(NA, 0.8214286 ,0.8928571 ,0.9285714 ,0.8928571 ,0.4285714 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.2142857) p.value = c(NA ,0.03945 ,0.01898 ,0.01181 ,0.01898 ,0.349 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.6615)

rho = cbind(flanking.gc,rho,p.value)

# Brock et al. cor co-efficients

genes = c(" DM" , "SCA7", "SCA2", "HD", "DRPLA", "SCA1", "SBMA", "SCA3", "ERDA1" ) exp = c(4 • 81, 1.3, 0.97, 0.29, 0.19, 0.14, 0.08, 0.07, -0.01) gc.100 = c(69. 5, 83 .5, 77, 74.5, 63.5, 66, 65, 36.5, 38.5) gc.500 = c(66, 71.5, 79, 71, 66, 67.2, 59, 38.5, 43)

brock = cbind(genes,exp,gc.100,gc.500)

cor.test(brock[,2],brock[,3], methods"spearman")

# Plot of rho vs. increasing flanking sequence

plot(x=rho[,1],y=rho[,2],xlab="Flanking sequence (bp)", ylab="Spearman's rank correlation (rho)", main="Effect of increasing flanking sequence on Rho value", sub="Figure Id: Spearman rank correlation (rho) value of median expandability to %GC over 50 bp, 100 bp, 500 bp, and 1000 bp stretches of flanking sequence", xlim=c (0,1000) , ylim=c(0, 1) ) ,- lines (x=rho [, 1] ,y=rho [, 2] )

################## # Figure 5 Plots # ##################

pdf("Figure_5_GC.pdf") par(mfrow=c(2,2))

160 \

plot(x=gc.ranks[,2],y=gc.ranks[,1],xlab="%GC of 100 bases flanking repeat (rank)", ylab="Median Expandability (rank)", main="100 bp") abline(0,l) plot(x=gc.ranks[,3],y=gc.ranks[,1],xlab="%GC of 500 bases flanking repeat (rank)", ylab="Median Expandability (rank)", main="500 bp"); abline(0,1) plot(x=gc.ranks[,4],y=gc.ranks[,1],xlab="%GC of 1000 bases flanking repeat (rank)", ylab="Median Expandability (rank)", main="1000 bp"); abline(0,1) plot(x=rho[,1],y=rho[,2],xlab="Flanking sequence (bp)", ylab="Spearman's rank correlation (rho)", main="Effect of increasing flanking\nsequence on Rho value", xlim=c(0,5000), ylim=c(0,1)); lines(x=rho[,1],y=rho[,2] ) dev. off ()

############ # Figure 6 # ############

pdf("Figure_6_hi st.pdf") par(mfrow=c(1,1)) colour = c(0,0,0,0,0,0,0,0,0,2) hist(gc.all[,4],xlim=c(0,1),main="Histogram of %GC\nl00 bp flanking CAG/CTG repeat of Candidate CAG/CTG repeats\n", xlab="%GC of 100 bp flanking repeat", ylab="number of genes", col=colour); dev.of f ()

mean(gc.all[,4])+2 *(sd(gc.all[,4]))

sum(gc.all[,4] > 0.762376)

# z score for second highest expandable locus. (0 .767326- (mean (gc .all [,4] ) ) ) / (sd(gc.all [, 4] ) )

mean(gc.all[,12])+2*(sd(gc.all[,12]))

######################################### # Expandability and Length Calculations # #########################################

len.pur = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/len_pur.txt", sep = "") len.rank = rank(len.pur[,3]) pur.rank = rank(len.pur[,4])

len.pur.ranks = cbind(exp.rank,len.rank,pur.rank)

cor.test(len.pur.ranks[,1],len.pur.ranks[,2], method="spearman")

plot(x=len.pur.ranks[,1];y=len.pur.ranks[,2],xlab="CAG-repeat length (rank)", ylab="Median Expandability (rank)", main="Expandability vs. Repeat Length", sub="Figure 2: Ranked Expandability vs. Ranked CAG/CTG repeat length"); abline(0,1)

############ # Figure 7 # ############

pdf("Figure_7_length.pdf") par(mfrow=c(1,1) ) plot(x=len.pur.ranks[,1],y=len.pur.ranks[,2],xlab="Repeat Length (rank)", ylab="Median Expandability (rank)", main="Expandability vs. Repeat Length\nCAG/CTG repeats known to be unstable\nRepeat length derived from TRF co-ordinates"); dev. off ()

# Output

Spearman's rank correlation rho

161 data: len.pur.ranks[, 1] and len.pur.ranks[, 2] S = 74, p-value = 0.4948 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.3214286

# plot of exp. vs purity

cor.test(len.pur.ranks[,1],len.pur.ranks[,3], method="spearman")

############ # Figure 8 # ############

pdf("Figure_8_purity.pdf") par(mf row=c(1,1)) plot(x=len.pur.ranks[,1],y=len.pur.ranks[,3],xlab="CAG-repeat purity (rank)", ylab="Median Expandability (rank)", main="Expandability vs. Repeat Purity\nCAG/CTG repeats known to be unstable\nPurity defined as longest contiguous repeat unit"); dev.off()

# Output

Spearman's rank correlation rho

data: len.pur.ranks[, 1] and len.pur.ranks[, 3] S = 62, p-value = 0.843 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.1071429

############### # Alu repeats # ############### alu.total = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu.txt", sep = "")

cor.test(alu.total[,2],alu.total[,4], method="spearman")

genes = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCA1","SBMA") alu.count = c(14,6,21,12,6,6,1)

alu = cbind(exp.rank, rank(alu.count))

plot(x=alu[,2],y=alu[,1],xlab="Number of AluY Repeats (rank)", ylab="Median Expandability (rank)", main="Expandability vs. Number of Flanking AluY Repeats", sub="Figure 3: Ranked expandability vs. ranked total number of AluY repeats in 50,000 bp of flanking sequence"); abline(0,1)

cor.test(alu[,1],alu 1,2], method="spearman") cor.test(gc.brock[,2],alu.count, method="spearman")

# Output

Spearman's rank correlation rho

data: gc.brock[, 2] and alu.count S = 46, p-value = 0.7207 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.1853123

# 10, 000

alu.total.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_10000.txt", sep = "")

162 cor.test(alu.total.10000[,2],alu.total.10000[,4], method="spearman")

# Output

Spearman's rank correlation rho data: alu.total.10000[, 2] and alu.total.10000[, 4] S = 156, p-value = 0.04489 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.5718863

# 20,000 alu.total.20000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_20000.txt1 sep = "") cor.test(alu.total.20000[,2],alu.total.20000[,4], method="spearman")

# Output

Spearman's rank correlation rho data: alu.total.20000 [, 2] and alu.total.20000 [, 4] S = 5785, p-value = 0.8534 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.03312307

# 30,000

alu.total.30000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_30000.txt sep = "") cor.test(alu.total.30000[,2],alu.total.30000[,4], method="spearman")

# Output

Spearman's rank correlation rho

data: alu.total.30000[, 2] and alu.total.30000 [, 4] S = 17424, p-value = 0.3314 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.1478174

# 40,000

alu.total.40000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_40000.txt sep = "") cor.test(alu.total.40000[,2],alu.total.40000[,4], method="spearman")

# Output

Spearman's rank correlation rho

data: alu.total.40000[, 2] and alu.total.40000[, 4] S = 28314, p-value = 0.5679 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.07925101

# 50,000

alu.total.50000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_50000.txt sep = "") cor.test(alu.total,50000[,2],alu.total.500 00[,4], method="spearman")

# Output Spearman's rank correlation rho data: alu.total.50000 [, 2] and alu.total.50000 [, 4] S = 53582, p-value = 0.3424 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.1185093

########################### # Numbers of AluY Repeats # ###########################

names = c("genes","expandability","5,000","10,000","20,000","30,000","40,000","40,000 alu.number = matrix(data = NA, nrow=7, ncol = 8, byrow = FALSE)

alu.number[,1] = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCA1","SBMA") alu.number[,2] = c(0.07,1.30,0.97,0.29,0.19,0.14,0.08) alu. number [, 3] = c(l,0,1,1,0,0,0) alu. number [, 4] =c(3,l,5,3,l,0,0) alu.number [, 5] = c (5, 3,11, 9, 2, 3, 0) alu.number [, 6] = c (9, 3 ,16 ,10, 3 , 4 , 0) alu.number [, 7] = c (9, 4 ,19,11, 5, 6, 0) alu.number [, 8] = c (14 , 6, 21,12 , 6, 6,1)

cor.test(alu.number[,2],alu.number[,8], method="spearman")

names = c ( "SCA3_MJD" , "SCA7 " , "SCA2 " , "HD" , "DRPLA" , "SCA1" , "SBMA" ) exp = c(0.07,1.30,0.97,0.29,0.19,0.14,0.08) a.5000 = c(l,0,1,1,0,0,0) a.10000 = c(3,1,5,3,1,0,0) a.20000 = c(5,3,11,9,2,3,0) a.30000 = c(9,3,16,10,3,4,0) a.40000 = c(9,4,19,11,5,6,0) a.50000 = c(14,6,21,12,6,6,l)

alu.number = cbind(names,exp,a.5000,a.10000,a.20000,a.30000,a.40000,a.50000)

########################### # Generate plot variables # ###########################

alu.total.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_10000.txt sep = "")" alu.total.20000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_20000.txt sep = "") alu.total.30000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_30000.txt sep = "") alu.total.40000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_40000.txt sep = "") alu.total.50000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_50000.txt sep = "") plot(x=al[,2],y=al[,1],xlab="(rank)", ylab="Median Expandability (rank)", main="1000 bp") ;

el = rankfalu.total.10000 [,2]) rl = rank(alu.total.10000 [,4]) al = cbind(el,rl) plot(x=al[,2],y=al[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)" ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeat\n(10,000 bp)") abline(0,1)

e2 = rank(alu.total.20000 [,2] ) r2 = rank(alu.total.20000 [,4] ) a2 = cbind(e2,r2) plot (x=a2 [, 2] ,y=a2 [, 1] )

e3 = rank(alu.total.30000 [,2]) r3 = rank(alu.total.30000 [,4]) a3 = cbind(e3,r3) plot(x=a3[,2],y=a3[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeat\n(30,000 bp)") abline(0,1) e4 = rank(alu.total.40000[,2] ) r4 = rank(alu.total.40000[,4] ) a4 = cbind(e4,r4) plot (x=a4 [, 2] ,y=a4 [, 1] ) e5 = rank(alu.total.50000 [,2] ) r5 = rank(alu.total.50000[,4] ) a5 = cbind(e5,r5) plot(x=a5[,2],y=a5[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeat\n(50,000 bp)") abline(0,1)

# rho values

alu.bp=c(10000,20000,30000,40000,50000) alu.rho = c(0.5718863,0.03312307,-0.1478174,-0.07925101,-0.1185093)

############ # Figure 9 # ############

pdf("Figure_9_AluY.pdf") par(mfrow=c(2,2)) plot(x=al[,2],y=al[,1],xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeats\n(10, 000 bp)", xlim=c(0,14), ylim=c(0) abline(0,1) plot(x=a3[,2],y=a3[,1],xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeats\n(30, 000 bp)") plot(x=a5[,2] , y=a5[, 1] , xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance of AluY Repeats\n(50, 000 bp)") plot(x=alu.bp,y=alu.rho,xlab="Flanking sequence (bp)",ylab="Spearman1s rank correlation (rho,) " ,main="Ef feet of increasing\nflanking sequence on rho value") lines(x=alu.bp,y=alu.rho) dev.off()

######## # CTCF # ########

# 50,000

cor.test(ctcf.50000.rank[,2] , ctcf.50000.rank[, 3] , method^"spearman")

# 1,000

ctcf.1000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_l_000.txt", sep = "") ctcf.1000.rank = matrix(data = NA, nrow = 5, ncol = 4, byrow = FALSE) ctcf .1000.rank[,l] = ctcf . 1000 [, 1] ctcf.1000.rank[,2] = rank(ctcf.1000[,2]) ctcf .1000.rankl,3] = rank (ctcf . 1000 [, 3] )

ctcf .1000.rank[,4] = rank (ctcf . 1000 [, 4] )

# Cor. between exp and score

cor. test (ctcf .1000 .rank 1,2] , ctcf . 1000 . rank [, 3] , methods11 spearman")

Spearman's rank correlation rho

data: ctcf.1000.rank[, 2] and ctcf.1000.rank[, 3]

165 S = 4, p-value = 0.1333 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0 . 7905694

# Cor. between exp and distance cor.test(ctcf.1000.rank 1,2],ctcf.1000.rank[,4], method="spearman")

Spearman's rank correlation rho data: ctcf.1000.rank[, 2] and ctcf.1000.rank[, 4] S = 27, p-value = 0.5167 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.3689324

# Cor. between score and distance

cor.test(ctcf.1000.rank[,3],ctcf.1000.rank[,4], method="spearman")

Spearman's rank correlation rho

data: ctcf.1000.rank[, 3] and ctcf.1000.rank[, 4] S = 34, p-value = 0.2333 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.7

# 5,000

ctcf.5000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_5_000.txt", sep tr II ) ctcf.5000.rank = matrix(data = NA, nrow = 9, ncol = 4, byrow = FALSE) ctcf .5000.rank[,l] = ctcf . 5000 [, 1] ctcf .5000.rank[,2] = rank (ctcf . 5000 [, 2 ] ) ctcf .5000.rank[,3] = rank (ctcf. 5000 [, 3] ) ctcf .5000.rank[,4] = rank (ctcf . 5000 [, 4 ] )

# Cor. between exp and score

cor.test(ctcf.5000.rank 1,2],ctcf.5000.rank[,3], method="spearman")

Spearman's rank correlation rho

data: ctcf.5000.rank[, 2] and ctcf.5000.rank[, 3] S = 110, p-value = 0.8438 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0 . 08257228

# Cor. between exp and distance

cor.test(ctcf.5000.rank 1,2],ctcf.5000.rank[,4], method="spearman")

Spearman's rank correlation rho

data: ctcf.5000.rank[, 2] and ctcf.5000.rank[, 4] S = 131, p-value = 0.8096 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.09174698

# Cor. between score and distance

cor.test(ctcf.5000.rank[,3],ctcf.5000.rank[,4], method="spearman")

166 Spearman's rank correlation rho data: ctcf.5000.rank[, 3] and ctcf.5000.rank[, 4] S = 210, p-value = 0.02742 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.75

# 10,000 ctcf.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_10_000.txt", sep = "") ctcf.10000.rank = matrix(data = NA, nrow = 11, ncol = 4, byrow = FALSE) ctcf . 10000 .rank [, 1] = ctcf . 10000 [, 1] ctcf .10000.rank[,2] = rank (ctcf . 10000 [, 2] ) ctcf .10000.rank[,3] = rank (ctcf . 10000 [, 3] ) ctcf.10000.rank[,4] = rank(ctcf.10000[,4])

# Cor. between exp and score cor.test(ctcf.10 000.rank[,2],ctcf.10000.rank[,3], method="spearman")

Spearman's rank correlation rho data: ctcf.10000.rank[, 2] and ctcf.10000.rank[, 3] S = 204, p-value = 0.8388 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.0697736

# Cor. between exp and distance

cor.test(ctcf.10000.rank[, 2] , ctcf.10000.rank[,4], method="spearman")

Spearman's rank correlation rho

data: ctcf.10000.rank[, 2] and ctcf.10000.rank[, 4] S = 266, p-value = 0.5391 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.2093208

# Cor. between score and distance

cor.test(ctcf.10000.rank[,3],ctcf.10000.rank[,4], method="spearman")

Spearman's rank correlation rho

data: ctcf.10000.rank[, 3] and ctcf.10000.rank[, 4] S = 376, p-value = 0.01873 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.7090909

################ # With No DMPK # ################

# 1,000

ctcf.1000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_l_000_noDM.txt", sep = "") ctcf.1000.rank = matrix(data = NA, nrow = 3, ncol = 4, byrow = FALSE) ctcf .1000. rank[, 1] = ctcf . 1000 [, 1] ctcf .1000.rank[,2] = rank (ctcf . 1000 [, 2] ) ctcf.1000.rank[,3] = rank(ctcf.1000[,3])

167 ctcf.1000.rank [ ,4]= rank(ctcf.1000 [,4]) cor. test (ctcf . 1000 . rank [, 2] , ctcf . 1000 . rank [, 3] , method="spearman11)

# 5,000 ctcf.5000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_5_000_noDM.txt", sep = »") ctcf.5000.rank = matrix(data = NA, nrow = 4, ncol = 4, byrow = FALSE) ctcf .5000. rank [,1] = ctcf . 5000 [, 1] ctcf.5000.rank[,2] = rank(ctcf.5000 [,2]) ctcf .5000.rank[,3] = rank (ctcf . 5000 [, 3] ) ctcf .5000.rank[,4] = rank (ctcf . 5000 [, 4] ) cor.test(ctcf.5000.rank[,2],ctcf.5000.rank[,3], method="spearman")

# 10,000

ctcf.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_10_000_noDM.txt' sep = "") ctcf.10000.rank = matrix(data = NA, nrow = 5, ncol = 4, byrow = FALSE) ctcf .10000.rank[,l] = ctcf . 10000 [, 1] ctcf .10000. rank[, 2] = rank (ctcf . 10000 [, 2 ] ) ctcf.10000.rank[,3] = rank(ctcf.10000[,3]) ctcf .10000.rank[,4] = rank (ctcf . 10000 [, 4] )

cor.test(ctcf.10000.rank[,2],ctcf.10000.rank[,3], method="spearman")

############# # Figure 10 # #############

pdf (11 Figure_l 0_CTCF. pdf " ) par(mfrow=c(2,2)) plot (x=ctcf.5000.rank[,3],y=ctcf.5000.rank[,2],xlab="CTCF Score (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nCTCF Score\n(5,000 bp)") plot(x=ctcf.5000.rank[,4],y=ctcf.5000.rank[,2],xlab="Distance from CAG-repeat (rank)", ylab="Median Expandability (rank)", main="Median Expandability vs.\nDistance from Repeat\n(5,000 bp)") plot(x=ctcf.5000.rank[,3],y=ctcf.5000.rank[,4],xlab="CTCF Score (rank)", ylab="Distance from Repeat (rank)", main="Distance of CTCF hit vs.\nCTCF Score\n(5,000 bp)") abline(10,-1) dev.of f ()

################################## # Nucleosome Formation Potential # ##################################

names = c("genes","expandability","100") nucleo = matrix(data = NA, nrow=7, ncol = 3, byrow = FALSE) nucleo[,1] = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCAl","SBMA") nucleo[,2] = c (0 . 07,1. 30 , 0 . 97, 0 . 29, 0 .19, 0 .14 , 0 . 08) nucleo[,3] = C(0.823,-0.758,NA,0.035,0.624,0.265,-0.039)

cor.test(nucleo[,2],nucleo[,3],method="spearman")

Spearman's rank correlation rho

data: nucleo [, 2] and nucleo [, 3] S = 48, p-value = 0.4972 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.3714286

############# # Figure 10 # #############

pdf("Figure_ll_nucleo.pdf") par(mfrow=c(1,1))

168 plot(x=rank(nucleo[,3]),y=rank(nucleo[,2]),xlab="Nucleosome Formation Potential (rank)",ylab="Median Expandability (rank)", main="Median Expandability vs.\nNucleosome Formation Potential\nof 100 bp flanking the repeat") dev.off()

169 APPENDIX H

Satellog MySQL Database commands

######################################## # MySQL Commands for Sattelog Database # ########################################

The following document has all of the commands needed to recreate the Satellog database from scratch. Broadly speaking, the tables can be divided into two classes: those that must be populated manually prior to running repeatalyzer.pl and those populated automatically when repeatalyzer.pl is run. The distinction between the two classes is made in the comments. Some comments are provided after each create table command to briefly expain what data the table holds. See the Sattelog manuscript for further details.

####################### # DROP TABLE COMMANDS # #######################

DROP TABLE affy; DROP TABLE build_info; DROP TABLE ugcount; DROP TABLE ugstats; DROP TABLE ens_db; DROP TABLE gc; DROP TABLE go; DROP TABLE mim; DROP TABLE pdb; DROP TABLE repeats; DROP TABLE transcripts

# Following tables are not automatically generated # Are you sure you need to drop them?

DROP TABLE linkage; DROP TABLE rep_stats; DROP TABLE rep_class; DROP TABLE GeneNote; DROP TABLE disease; DROP TABLE unigene;

######################### # CREATE TABLE COMMANDS # #########################

CREATE TABLE build_info ( ens_db VARCHAR(50) NOT NULL PRIMARY KEY, db_name VARCHAR(50) NOT NULL, date_run DATETIME ) ;

# collects the name and version of the EnsEMBL databases used # collects the date the database was populated

CREATE TABLE repeats ( rep_id INT auto_increment NOT NULL PRIMARY KEY, chr VARCHAR(4), start INT UNSIGNED, end INT UNSIGNED, period TINYINT UNSIGNED, unit VARCHAR(16) , class_id INT, seq VARCHAR(255), length INT UNSIGNED, pvalue DECIMAL(8,6) NULL

170 )

CREATE INDEX r_total ON repeats (rep_id,chr,start,end,period,unit,seq,length) CREATE INDEX r_common ON repeats (rep_id,period,unit, length) ,- CREATE INDEX r_chr ON repeats (chr,start,end,period,unit,length); CREATE INDEX r_total_period ON repeats (rep_id,period); CREATE INDEX r_total_unit ON repeats (rep_id,unit); CREATE INDEX r_total_length ON repeats (rep_id,length); CREATE INDEX r_co_ords ON repeats (rep_id, chr, start, end) ,- CREATE INDEX rapid ON repeats (period, length) ,- CREATE INDEX r_length_class ON repeats (class, length);

# this is the primary organizing table of the database # collects raw information from the Tandem Repeats Finder (TRF) raw output files

########### # ugcount # ###########

CREATE TABLE ugcount ( count_id INT auto_increment NOT NULL PRIMARY KEY, rep_id INT NOT NULL, cluster VARCHAR(20), sequence VARCHAR(20), length INT

CREATE INDEX ug_hits ON ugcount (rep_id,cluster,sequence,length) CREATE INDEX length ON count (rep_id,length);

CREATE TABLE ugstats

count_id INT auto_increment NOT NULL PRIMARY KEY, rep_id INT NOT NULL, count INT NOT NULL, min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL ) ;

CREATE INDEX ug_stats ON ugstats (rep_id,count,min,max,mean,sd); CREATE INDEX sd ON ugstats (rep_id,sd);

CREATE TABLE transcripts ( ts_id INT auto_increment NOT NULL PRIMARY KEY, rep_id INT NOT NULL, ens_ts VARCHAR(15), gene_location VARCHAR(20), pep VARCHAR(150) NULL, ens_id INT NOT NULL ) ;

CREATE INDEX t_common ON transcripts (rep_id,gene_location,pep,ens_id); CREATE INDEX t_pep ON transcripts (rep_id,gene_location); CREATE INDEX t_gene ON transcripts (rep_id,ens_id);

CREATE TABLE gc ( rep_id INT NOT NULL PRIMARY KEY, 100_bp DECIMAL(7,6) NOT NULL, 500_bp DECIMAL(7,6) NOT NULL, 1000_bp DECIMAL(7,6) NOT NULL ) ;

CREATE INDEX gc_100 ON gc (rep_id,100_bp); CREATE INDEX gc_500 ON gc (rep_id,500_bp); CREATE INDEX gc_1000 ON gc (rep_id,100 0_bp)

171 CREATE TABLE ens_db ( ens_id INT auto_increment NOT NULL PRIMARY KEY, ens_name VARCHAR(15) NOT NULL, name VARCHAR(15) NULL, description TEXT NULL, chr CHAR(2), start INT UNSIGNED, end INT UNSIGNED, strand CHAR(l) ) ;

CREATE INDEX ens_common ON ens_db (ens_id,ens_name,name); CREATE INDEX ens_lookup ON ens_db (ens_name,ens_id);

CREATE TABLE go ( go_id INT auto_increment NOT NULL PRIMARY KEY, ens_id INT NOT NULL, go_value VARCHAR(50) NOT NULL ) ;

CREATE INDEX go_go_value ON go (ens_id,go_value);

CREATE TABLE pdb ( pdb_id INT auto_increment NOT NULL PRIMARY KEY, ens_id INT NOT NULL, domain VARCHAR(20) NOT NULL ) ;

CREATE INDEX pdb_domain ON pdb (ens_id,domain);

CREATE TABLE mim ( mim_id INT auto_increment NOT NULL PRIMARY KEY, ens_id INT NOT NULL, mim_value VARCHAR(20) NOT NULL ) ;

CREATE INDEX mim_mim_value ON mim (ens_id,mim_value) ,-

CREATE TABLE affy ( affy_id INT auto_increment NOT NULL PRIMARY KEY, ens_id INT NOT NULL, g_id INT NOT NULL ) ;

CREATE INDEX affy_id_ref ON affy (ens_id,g_id);

# Following tables are not automatically generated

CREATE TABLE linkage ( link_id INT auto_increment NOT NULL PRIMARY KEY, disease VARCHAR(10) NOT NULL, band VARCHAR(25), marker VARCHAR(50), chr CHAR(2), pstart INT, start INT, end INT, qend INT, ref INT, score DECIMAL(3,2), type VARCHAR(10), p_value DECIMAL(11,8), notes TEXT ) ;

172 CREATE TABLE rep_stats ( class_id INT NOT NULL, chr VARCHAR(4), length INT UNSIGNED ) ;

CREATE INDEX stats_search ON rep_stats (class_id,length); CREATE INDEX stats_search_chr ON rep_stats (chr,class_id,length);

CREATE TABLE GeneNote ( g_id INT auto_increment NOT NULL PRIMARY KEY, id_ref VARCHAR(15) NOT NULL, value DECIMAL(10,1) NOT NULL, call CHAR(l) NOT NULL, tissue VARCHAR(15) NOT NULL, array CHAR(l) NOT NULL, number CHAR(4) NOT NULL ) ;

CREATE INDEX g_lookup ON GeneNote (id_ref,g_id); CREATE INDEX g_all ON GeneNote (g_id, id_ref, value, call, tissue, array, number) ,- CREATE INDEX g_common ON GeneNote (g_id,call,tissue);

CREATE TABLE rep_class ( class_id INT auto_increment NOT NULL PRIMARY KEY, class TEXT ) ; CREATE INDEX rc_search ON rep_class (class (20) ASC,rep_class_id);

CREATE TABLE disease ( disease_id INT auto_increment NOT NULL PRIMARY KEY, rep_id INT, short_name VARCHAR(15), full_name VARCHAR(IOO) , name VARCHAR(15), ens_name VARCHAR(15) NOT NULL, norm_min INT, norm_max INT, dis_min INT, dis_max INT, locus CHAR, anticipation VARCHAR(2) ) ;

CREATE TABLE unigene ( cluster_id INT auto_increment NOT NULL PRIMARY KEY, Cluster VARCHAR(20), chr VARCHAR(4), start INT UNSIGNED, end INT UNSIGNED, blatscore INT, identity DECIMAL(4,1) ) ;

CREATE INDEX unigene_look_up ON unigene (cluster_id,cluster,chr,start,end);

CREATE TABLE class_stats ( class INT NOT NULL, length INT, pvalue DECIMAL(9,8) ) ;

CREATE INDEX class_stats_look_up ON class_stats (class,length,pvalue); CREATE TABLE repeats_in_linkage ( rep_id INT, disease VARCHAR(10) NOT NULL, link_id INT ) ;

CREATE INDEX rep_link ON repeats_in_linkage (rep_id,link_id); CREATE INDEX rep_link_disease ON repeats_in_linkage (rep_id,disease,link_id);

LOAD DATA INFILE '/home/perseusm/My_Documents/Publications/Satellog/Results/repeats_in_linkage.txt1 INTO TABLE repeats_in_linkage IGNORE 1 LINES;

CREATE TABLE go_terms ( go_value VARCHAR(50) NOT NULL PRIMARY KEY, go_term VARCHAR(200), go_class VARCHAR(1) ) ;

LOAD DATA INFILE 1/home/perseusm/GO/go_terms.txt1 INTO TABLE go_terms;

174 APPENDIX I

Running TRF on v.34 whole chromosome fasta files from UCSC

@@@@@@@@@@@@®@@@@@@@@ @ Human Genome v.34 @ @@@@@@®@@@@@®®@@@@®@®

The human genome goldenpath for all chromosomes (excluding random chromosome DNA data) was saved in /home/perseusm/goldenpath for subsequent analysis. These files were downloaded from UCSC.

How to download fasta files by FTP: ftp -i hgdownload.cse.ucsc.edu

# -i turns off interactive mode, therefore no prompting during mget login u:anonymous p:your®email.com get files cd goldenPath/hgl6/chromosomes/ mget *

@@@@®®® @ TRF ® @@®®®®@

We are interested in developing our own repeat co-ordinates distinct from the pre- computed co-ords provided by UCSC for two reasons:

1) We want to detect repeats much smaller than the smallest at UCSC 2) We are only interested in pure repeats

The following parameters were recommended to detect the purest repeats possible with TRF without running out of memory:

Shell Script for TRF

/home/perseusm/trf/trf321.linux.exe /home/perseusm/goldenpath/chr7.fa 3 4090 4090 80 10 30 16 -d; for file2 in /home/perseusm/chr7*.html; do rm -i $file2 -f; done; for file3 in /home/perseusm/chr7*.tmp; do rm -i $file3 -f; done

This is an interesting shell script here that gets rid of all the html files spawned by TRF. This, much to my annoyment, was not an option that could be disabled.

You need to do each chromosome sequentially because temporary files are created that are needed in the creation of the final .dat file.

The next thing I wanted to do was to test and ensure that in fact only pure repeats were being detected by the script. The following is a quick and dirty script that extracts the largest hits from the TRF .dat files. Due to the way the scoring algorithm works, larger repeats have a higher chance of tolerating indels and substitutions. I wanted to make sure the TRF parameters I selected reported only pure hits.

# Execute purity test script

/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/parse_dis.pi > purity_test.txt

# Code for purity test script

################ # parse_dis.pl # ################

175 #!/usr/bin/perl use strict; my $chrom; while (<>) { chomp; if ($_ = - /chr(\S+)/) { $chrom = $1;

} elsif ($_ =- /*(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+ \d+\.\d+\s+(\S+)\s+(\S+)/) {

# pull out coords of interest my $chromStart= $1; my $chromEnd = $2; my $rptPeriod = $3; my $rptSize = $4; my $rptConsensus = $5; my $rptUnit = $6; my $rpt = $7; my $rptLength = length $rpt;

if (($rptPeriod == 16) && ($rptSize > 10)) {

print

"$chrom\t$chromStart\t$chromEnd\t$rptUnit\t$rpt\t$rpt\t$rptPeriod\t$rptSize\n";

} } } ################# # End of Script # #################

The contents of /home/perseusm/goldenpath/3.4090.4090.80.10.30.16/purity_test.txt indicate that all the largest hits are pure, this means that all smaller sized hits are pure as well. You should go through this file manually and ensure each hit is pure, i.e. only composed of tandem repeat units.

176 APPENDIX J

Generating the repeat classifier

@@@@@®@@@@@@@@@@@®@@@@@@@@@@@@@@@@®@@®®@@@@ @ Determining the distinct repeat classes @ @@®®@®®®@®®@®®®®®®®@@@@@@@@@@®®@@@®®®®®®@@®

Mathematically, if one is looking at all lmers to 16mers then there are huge number of potential combinations. Not all mathematically predicted repeats exist in the genome (at least not pure repeats). For the pure repeats that do exist in the human genome we need a way of classifying them into their repeat families. For example:

CAG, AGC, GCA, GTC, TCG, CGT are all the same repeat. They are detected distinctly because of the way TRF works (see text). To do this we developed a repeat class that detects all repeats in a family. We also had to design it in a way so that it could be searched specifically for each repeat. To do this, all detected repeats had to be collected from the directory containing the TRF output files as follows: for file in *.dat; do cut -d " " -f 14 $file; done > rep_units_40 90

This cuts the TRF output for each chromosome separated by space at the 14th column (the column containing the repeat unit) and outputs it to rep_units_4 0 90.

One erroneous repeat unit was detected with a AAA instead of an A. The cause of this error is unknown but this class of repeat is represented by the A. So this value was changed in chrl4.fa.3.4090.4090.80.10.30.16.dat, and reflected in rep_units_4090.2

Original erroneous hit in chrl4.fa.3.4090.4090.80.10.30.16.dat: chr 14 102425491 102425500 3 3.3 3 100 0 30 100 0 0 0 0.00 AAA AAAAAAAAAA

Next, we sought to determine all the distinct repeats, this was accomplished by making a temporary "staging" table:

CREATE TABLE rep_class ( class CHAR(16) PRIMARY KEY ) ; and inserting all the repeats into this table:

LOAD DATA INFILE 1/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/rep_units_4090.2' IGNORE INTO TABLE rep_class; from this table, unique repeats were selected with the following shell script: echo " SELECT DISTINCT class FROM rep_class " | mysql --quick -h athena -u schz_rw -prepeat schz_db > repeats_4 090.txt

Next we needed to generate all the distinct repeat classes possible from the distinct repeats in our dataset. home/perseusm/progs/repeat_classer/repeat_classer2.pi does this.

# Execute

/home/perseusm/progs/repeat_classer/repeat_classer2.pi repeats_4090.txt

############################ # Begin repeat_classer2.pl # ############################

#!/usr/bin/perl # repeat_classer2.pl

177 # usage: rep_classer2.pl rep_units.txt # run this script before repeatalyzer.pl # detects what macroclass each repeat belongs to # rep_units.txt is a list of all distinct types of repeats in the db # Perseus Missirlis - Mar 18, 2004 use strict; my $rep = "rep_class.txt"; unless ( open(REP_CLASS, ">$rep") ) { die "Cannot open file \"$rep\" to write to!\n\n"; } my $i = 1; while (<>) { chomp; if ($_ =- /(\S+)/) {

print "line $i matched: $l\n\n";

my ©units; my $rep_class; my ©repeat = split(11,$1); my $lengther = scalar ©repeat; my $n;

print "this is the repeat: ©repeat\n";

print "this is the length of the repeat: $lengther\n\n";

my $x;

while ($n < $lengther) { my $for = shift ©repeat; $repeat[($lengther - 1)] = $for; my $for_run = "©repeat"; $for_run =~ s/\s//g; my $rev_run = reverse $for_run; $rev_run =- tr/ATGCatgc/TACGtacg/; print "this is the repeat after popping it: @repeat\nthis is the variable for regex: $for_run\nthis is the variable in reverse: $rev_run\n\n";

$units[$x] = $for_run; $x++ ,- $units[$x] = $rev_run; $x++ ; $n++; } print "this is the units array: @units\n";

©units = sort(©units); print "sorted units:\n"; foreach my $unit(©units) { print $unit . "\n"; $rep_class .= $unit . "o"; } $rep_class = "o" . $rep_class; print "this is the rep_class: $rep_class\n\n"; print REP_CLASS $rep_class . "\n"; $i++; } } close(REP_CLASS); exit ;

############## # End script #

178 ##############

This script generates all the distinct classes in a file repeats_classes_4090.txt located at: /home/perseusm/progs/repeat_classer/repeats_classes_4090 . txt

This is fed through the rep_class table like above to get all the distinct class which is saved in a file called repeats_classes_4090.txt. located at: /home/perseusm/progs/repeat_classer/repeats_classes_4090.txt

Now create the final, usable rep_class table:

CREATE TABLE rep_class ( rep_class_id INT auto_increment NOT NULL PRIMARY KEY, class TEXT ) ;

CREATE INDEX rc_search ON rep_class (class (50) ASC,rep_class_id); which can be fed into the db using LOAD DATA or the following script:

# execute:

/home/perseusm/progs/repeat_classer/reg_ex_test.pi repeats_classes_4 090.txt

################## # reg_ex_test.pl # ################## #!/usr/bin/perl # reg_ex_test.pi # run this script before repeatalyzer.pl # detects what macroclass each repeat belongs to # rep_units.txt is a list of all distinct types of repeats in the db # Perseus Missirlis - Mar 18, 2 0 04 use strict; use DBI;

# DBI

my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # #######################

$dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); my $i = 1; while (<>) { chomp; if ($_ =- /(\S+)/) { # print "line: 1 - Inserting: $l\n\n"; $sth = $dbh->prepare ("INSERT INTO rep_class VALUES(NULL,1$1')"); $sth->execute (); $ i + + ; } } exit ;

# End script #

179 APPENDIX K

Downloading and populating the GeneNote tables in Satellog

@@@@@®@®@®@®@@@@@ ® GeneNote Data @ @@@@@@@@@@@@@@@@@

Get the GeneNote dataset from GEO: http://www.ncbi•nlm.nih.qov/qeo/ Dataset ID: GSE803

# Execute

/home/perseusm/genenote/genenote_parser.pi GSE803.txt

###################### # genenote_parser.pl # ######################

#!/usr/bin/perl # genenote_parser.pl # usage: genenote_parser.pl GSE803.txt # Perseus Missirlis - Jan 15, 2004 use strict; use DBI;

#################### # Global Variables # ####################

# DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # #######################

$dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 });

############################################################################## # read infile specified on the cmd line # # read multiple coords of CAG/CTG repeat locations from STDIN one line at a time # ############################################################################## my $tissue; my $array; my $number; my $last_insert_id; while (<>) { chomp; if ($_ =~ /\!Sample_title\s+\=\s+Normal\s+((\S+\s+\S+)|\S+)\s+\S+\s+(\S+)\s+\S+\s+(\S+)/) {

$tissue = $1 ; $array = $3 ; $number = $4;

} elsif ($_ =~ /(\S+)\s+(\d+\.\d+)\s+(\S)/) {

my $id_ref = $1;

180 my $value = $2 ,- my $call = $3;

# print "id_ref $id_ref value $value call $call tissue $tissue array $array number $number\n";

$sth = $dbh->prepare ("INSERT INTO GeneNote VALUES('NULL','$id_ref','$value','$call',1$tissue','$array',1$number1)"); $sth->execute (),- $last_insert_id = $sth-> {mysql_insertid}; print "The id of the last record inserted into the db is $last_insert_id\n"; } print "outside of loop this is the last record $last_insert_id\n\n"; }

############## # End script # ##############

This will populate the GeneNote database and have it ready for future queries.

181 APPENDIX L

Downloading and processing UniGene data

We were curious if there was any indication of repeat polymorphism in the UniGene clusters posted at NCBI. repeatalyzer.pl automatically evaluates each repeat for polymorphisms within UniGene clusters. To do this however, we need the clusters and all sequences:

How to download fasta files by FTP:

# open FTP connection to NCBI $ ftp -I ftp.ncbi.nih.gov

# login u:anonymous p:yburoemail.com

# change to UniGene directory cd /repository/UniGene

# download all human UniGene files mget Hs*

Convert FASTA formatting of Hs.seq.uniq file

The Hs.seq.uniq file contains all sequences representing the longest, highest quality stretch of DNA for each particular UniGene cluster. We will be using the BLAT algorithm to see if each repeat plus 10 bp of upstream and downstream genomic sequence can be detected within these sequences. The FASTA files provided by NCBI have a long, somewhat cumbersome naming convention that is too big for the BLAT output.

For example the FASTA header for Hs.2 is:

>gnl|UG|Hs#S1728506 Homo sapiens N-acetyltransferase 2 (arylamine N-acetyltransferase) (NAT2), mRNA /cds=(108,980) /gb=NM_000015 /gi=4557782 /ug=Hs.2 /len=1276

From this, we only really need the cluster identifier (Hs.2) and the UniGene identifier for this sequence within Hs.2 (Hs#S1728506) .

Run the following command-line perl script to format this file for subsequent BLAT analysis:

$ perl -i.bak -p -e 1s/^.*(Hs\#\S+).*\/ug\=(\S+).*$/>\2\|\l/g' Hs.seq.uniq

The FASTA header for all sequences in Hs.seq.uniq is now: >Hs.2|Hs#S1728506

Now rename this file to Hs . seq.uniq2 : $ mv Hs.seq.uniq Hs.seq.uniq2

And rename the back-up file created by command-line file to the original: $ mv Hs.seq.uniq.bak Hs.seq.uniq

Make Hs.seq.uniq2 into a BLATable database

BLAT requires multiple FASTA files converted to a .2bit file format in order to process them.

$ ~/blat/faToTwoBit Hs.seq.uniq2 Hs.seq.uniq2.2bit

Remember where this file is, it is required by repeatalyzer to work.

Split the UniGene clusters into cluster delineated multiple FASTA files

The Hs.seq.all file from UniGene is essentially one huge flat file. Within this file, UniGene clusters are delimited by # followed by a collection of sequences that make up the UniGene cluster. For repeatalyzer to work, the UniGene clusters need to be parsed

182 to separate files representing each cluster with all of its associated sequences Hs.seq.all file was parsed by the following script:

# make a new directory (105680 files will be created!) # make a note of the absolute location of these files # they will be needed by repeatalyzer

$ mkdir ugc_fasta

# run the script

$ ./parse_unigene_all.pi Hs.seq.all

# Code for parsing Hs.seq.all

########################### # parse_unigene_unique.pl # ########################### #!/usr/bin/perl -w # parse_unigene_unique.pi use strict; my $outputfile = "frig";; my $i; my $count; while (<>) {

if .*\/ug\=(\S+).*$)/) {

if ($outputfile eq "frig") { $outputfile = $2; unless ( open(SEQ, ">$outputfile\.ugc") ) { die "Cannot open file \"$outputfile\" to write to!\n\n

print SEQ "$l\n"; $i++ ;

} elsif (($outputfile ne "frig") && ($i ==1)) { close(SEQ);

$outputfile = $2;

unless ( open(SEQ, ">$outputfile\.ugc") ) { die "Cannot open file \"$outputfile\" to write to!\n\n } print SEQ "$l\n"; $i++; } elsif (($outputfile ne "frig") && ($i <= $count)) {

print SEQ "$l\n"; $i++; }

} elsif (/(\S+)/) {

print SEQ "$l\n"; } } exit; APPENDIX M

Mapping the unique UniGene clusters to the human genome

Split the unique UniGene clusters into individual FASTA files

Each unique UniGene cluster was mapped to the human genome. Doing so exploited all of the sequence information within the each cluster and allowed us to control false positives. Whenever a repeat was detected in a UniGene cluster that did not map within 10 kb of the repeat's chromosomal co-ordinates, it was not evaluated further. The Hs.seq.uniq file contains a single sequence representing the longest, highest quality- stretch of DNA for each UniGene cluster. These files were parsed into individual FASTA files so that they could be BLATed against the human genome.

# create a new directory

$ mkdir ugc_unique

# parse Hs.seq.uniq # use original Hs.seq.uniq, not Hs.seq.uniq2 from Appendix E # parse_unigene_unique.pl will format output files correctly

$ parse_unigene_unique.pl Hs.seq.uniq2

########################### # parse_unigene_unique.pi # ###########################

#!/usr/bin/perl -w

# parse_unigene_unique.pl use strict; my $outputfile = "frig"; while (<>) {

if (/*.*(Hs\#\S+).*\/ug\=(\S+).*$/) {

if ($outputfile eq "frig") {

$outputfile = "$2";

unless ( open(SEQ, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } print SEQ ">$2\|$l\n"; } elsif ($outputfile ne "frig") { close(SEQ); $outputfile = "$2";

unless ( open(SEQ, ">$outputfile") ) { die "Cannot open file \"$outputfile\" to write to!\n\n"; } print SEQ 11 >$2\| $l\n" ; }

} elsif (/\S+/) { print SEQ "$&\n";

close(SEQ); exit;

184 Set-up a BLAT server of all human chromosomes

Create a BLAT server using all the human chromosomes from UCSC. Soft mask (-mask) the sequences so that repeats are not allowed to initiate an alignment but can be used to extend an alignment.

/home/perseusm/blat/gfServer -mask -canStop start OofO 8050 /home/perseusm/goldenpath/*.nib

Run mapugc.pl on each unique UniGene sequence

# use the following shell script to run perl script on each unique UniGene sequence for file in /home/unigene/ugc_unique/Hs.*; do ./mapugc.pl $file; done ;

Warning: You need to manually set the gfClient command in the following script to match your host and port settings from the gfServer command run to set-up the BLAT server.

The $cutoff variable is set to 85%, that is, only BLAT scores that are 85% of the maximum (calculated for a perfect hit) are input into the database.

############# # mapugc.pi # #############

#!/usr/bin/perl # mapugc.pl UniGene_Cluster.fa # use a shell for file in Hs.* loop to parse each individual UniGene fa file # blats each unique unigene sequence again human genome # if score is at least 85% of the theoretical max (i.e. all bases match) and 90% of input bases match somewhere, it qualifies as a hit # use this script to fish out chromosome co-ordinates for unigene clusters # make sure /home/unigene/ugc_unique/ has no *.out files in it before running this program # find -name "*.out" -printO | xargs -0 Is # find -name "*.out" -printO | xargs -0 rm -f # make sure BLAT gfClient host and port match your gfServer settings use DBI; use strict;

# DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # #######################

$dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); $sth = $dbh->prepare ("INSERT INTO unigene VALUES(NULL,?,?,?,?,?,?)"); my $blat_file = $ARGV[0]; print "file to blat is $blat_file\n\n"; my $total; unless ( open(FILE, "$blat_file") ) { die "Cannot open file \"$blat_file\" to write to!\n\n"; } while () { chomp ,- if (/"(\w+)$/) {

185 my $seq = $1; $seq =- s/\s+//g; my $length = length($seq); $total += $length; } } my $cutoff = 0.85*$total; print "$total\n"; print "cutoff is $cutoff\n\n"; close(FILE);

print "BLAT query\n\n";

print Vhome/perseusm/blat/gfClient oOOOl 8050 / $blat_file $blat_file.out~;

my $blat_hits = "$blat_file.out";

print "open this file: $blat_hits\n\n"; close(HITS);

unless ( open(HITS, "$blat_hits") ) { die "Cannot open file \"$blat_hits\" to write to!\n\n"; } while () {

if (/"(\d+)\s+(\d+)\s+\d+\s+\d+\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+\S+\s+(\S+)\s+(\d+)\s+(\d+ )\s+(\d+)\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)/) {

print 11 1 $1 2 $2 3 $3 4 $4 5 $5 6 $6 7 $7 8 $8 9 $9 10 $10 11 $11 12 $12 13 $13 \n\n";

my $match = $1; my $mismatch = $2; my $qGapCount = $3; my $qGapBases = $4; my $tGapCount = $5; my $tGapBases = $6; my $query = $7; my $qSize $8; my $qStart = $9; my $qEnd = $10 my $chr = $11; my $start = $12; my $end = $13; my $ug_cluster = $7; my $ug_sequence = $ug_cluster; $ug_cluster =- s/^(Hs\.\S+)\|.*/\l/; $ug_sequence =~ s/*.*\|(Hs\#.*)/\1/;

my $blatScore = $match - $mismatch - $qGapCount - $tGapCount; my $identity = ($match)/($match + $mismatch + $qGapCount);

186 $chr =~ s/chr//g;

$identity = $identity * 100; $identity = sprintf "%.If",$identity;

print "this is the cluster: $ug_cluster\n"; print "this is the sequence: $ug_sequence\n" ,- print "maps to $chr\:$start\-$end in the human genome\n"; print "identity: $identity score: $blatScore\n\n";

if (($blatScore > $cutoff) && ($identity > 90)) { print "it's a keeper\n"; print "this is the cluster: $ug_cluster\n"; print "this is the sequence: $ug_sequence\n"; print "maps to $chr\:$start\-$end in the human genome\n"; print "identity: $identity score: $blatScore\n\n";

# insert hit into database

$sth->execute ($ug_cluster,$chr,$start,$end,$blatScore,$identity);

}

# system("rm $blat_file.out -f");

exit;

187 APPENDIX N

Generating the percentile rank for each repeat (p-values)

We were interested in knowing how significant a repeat length was relative to all other repeats of the same class in the human genome. In the case of CAG class repeats, longer coding repeats in the human genome are more unstable relative to shorter repeats. For each repeat class, we collected counts of repeats at each length and stored in into a table called class_stats with the following script.

####################### # generate_pvalues.pl # #######################

#!/usr/bin/perl # generate_pvalues.pl # Perseus Missirlis # This script loops through the distinct repeat classes in Satellog and collects counts of all the lengths for all repeat classes # Results are stored in the class_stats table in Satellog use strict; use DBI;

# DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); my $sthl = $dbh->prepare ("SELECT COUNT(*) AS count FROM repeats WHERE class = ?"); my $sth2 = $dbh->prepare ("SELECT length, COUNT(length) AS count FROM repeats WHERE class = ? GROUP BY length;"); my $sth3 = $dbh->prepare ("INSERT INTO class_stats VALUES(?,?,?)");

##################### # Extract for query # ##################### my $ i = 1; print "i is $i\n\n" ,-

# 90700 while ($i < 90700) {

print "i is $i\n\n";

$sthl->execute($i);

my $count;

while ( my $href = $sthl->fetchrow_hashref ) {

$count = $href->{count}; print "$i\t$count\n";

}

$sth2->execute($i);

188 my $fraction_larger = $count; my $pvalue;

while ( my $href = $sth2->fetchrow_hashref ) {

my $length = $href->{length}; my $length_count = $href->{count}; $pvalue = $fraction_larger / $count; print "$i\t$length\t$length_count\t$fraction_larger\t$pvalue\n"; $sth3->execute($i,$length,$pvalue); $fraction_larger = ($fraction_larger - $length_count); }

}

print "done\n\n";

Lastly, each rep_id in the repeats table had its class and length queried against the class_stats table to extract a p-value for each repeat.

189 APPENDIX O

Disease-associated repeats

Table 17: Summary of disease-associated repeats from Cleary and Pearson, 2003 as detected in Satellog. Each disease is associated with one or more repeat co-ordinates.

Disease chr start end unit length prostate cancer risk 20 46953964 46953975 CAG 4 prostate cancer risk 20 46965237 46965257 GCA 7 prostate cancer risk 20 46965259 46965287 CAG 9 dentatorubral- 12 6916153 6916199 CAG 15 pallidoluysian atrophy/ Haw River Syndrome Huntington's Disease 4 3108016 3108074 CAG 19 Huntington's Disease-like 16 87419384 87419431 CTG 16 2 spinal and bulbar X 65631950 65632018 GCA 23 muscular atrophy spinal and bulbar X 65632034 65632052 GCA 6 muscular atrophy spinocerebellar ataxia 1 6 16435844 16435887 TGC 14 spinocerebellar ataxia 1 6 16435895 16435934 TGC 13 spinocerebellar ataxia 2 12 110448707 110448734 GCT 9 spinocerebellar ataxia 2 12 110448736 110448776 TGC 13 spinocerebellar ataxia 3 / 14 90527396 90527419 CTG 8 Machado-Joseph Disease spinocerebellar ataxia 6 19 13179673 13179712 CTG 13 spinocerebellar ataxia 7 3 63855699 63855730 GCA 10 spinocerebellar ataxia 17 6 170727556 170727614 CAG 19 infantile spasm syndrome X 24393203 24393234 GCC 10 cleidocranial dysplasia 6 45437342 45437356 GGC 5 cleidocranial dysplasia 6 45437358 45437374 GCG 5 hand-foot-genital 7 26981724 26981734 GCC 3 syndrome hand-foot-genital 7 26981742 26981753 GCC 4 syndrome hand-foot-genital 7 26981820 26981830 GCC 3 syndrome hand-foot-genital 7 26981847 26981858 GCC 4 syndrome synpolydactyly 2 177160255 177160270 GGC 5 synpolydactyly 2 177160330 177160344 GGC 5 synpolydactyly 2 177160355 177160371 GCG 5 oculopharyngeal 14 21780809 21780829 GGC 7 muscular dystrophy

190 holoprosencephaly 13 98332395 98332411 GCG 5 holoprosencephaly 13 98332445 98332454 GGC 3 holoprosencephaly 13 98335704 98335714 GCG 3 holoprosencephaly 13 98335716 98335729 GCG 4 holoprosencephaly 13 98335731 98335744 GCG 4 Myotonic Dystrophy 19 50965303 50965364 CAG 20 unknown 14 75483803 75483834 TGC 10 unknown 14 75483836 75483855 TGC 6 possible bipolar disorder 18 51402372 51402447 AGC 25 spinocerebellar ataxia 12 5 146286801 146286832 GCT 10 Fragile X (A subtype) X 145661208 145661218 GGC 3 Fragile X (A subtype) X 145661287 145661316 GGC 10 Fragile X (E subtype) x 146287684 146287698 GCC 5 Fragile X (E subtype) x 146287711 146287757 GCC 15 Fragile X (E subtype) x 146287806 146287815 CCG 3 Fragile X (E subtype) X 146287817 146287826 GCC 3 Fragile X (E subtype) X 146288149 146288162 CCG 4 Fragile X (F subtype) X 147419081 147419105 CGC 8 Jacobsen Syndrome 11 118614652 118614685 CGG 11 Myotonic Dystrophy 2 3 130212329 130212359 CAGG 7 progressive myoclonic 21 44052526 44052562 GCGCGGGGCGGG 3 epilepsy type 1 spinocerebellar ataxia 10 22 44467801 44467870 ATTCT 14 Friedreich's ataxia 9 67109320 67109339 AAG 6 spinocerebellar ataxia 8 13 68511517 68511562 CTG 15

191 APPENDIX P

Schizophrenia and bipolar disorder linkage regions (adapted from (Sklar, 2002)).

Table 18: Summary of schizophrenia and bipolar disorder linkage regions from (Sklar, 2002). This table summarizes the linkage studies in the paper and includes the cytogenetic band, genetic marker (with co-ordinates) of each study cited in the review. The ref column refers to the PubMed ID of each linkage study. This represents a portion of the linkage table in Satellog.

Disease band marker chr start end ref BP 13q32.3 D13S1271-D13S779 13 97566284 99202143 10374733 BP 13q22.1 D13S800 13 71672693 71672988 9184308 BP 13q32.1 D13S793 13 95549764 95750042 9184308 BP 13q32.1 D13S154 13 93960285 93960543 11149935 BP 13q32.3 D13S225-D13S796 13 99244463 105587128 10631152 BP 13q32.3 D13S225-D13S796 13 99244463 105587128 11673797 BP 13q32.3 D13S779 13 99201956 99202143 11673797 BP 21q22.3 D21S171 21 44848869 44848988 7647797 BP 21q22.3 D21S1260 21 41716438 41716647 9915960 BP 21q22.1 D21S1254 21 33995763 33996026 9184307 BP 21q21.2- D21S265 21 25841358 25841605 9184307 21q21.3 BP 21.q22.12- D21S1252 21 36747281 36747527 9184307 21q22.13 BP 21q22.13 D21S1440 21 38062023 38062184 9184307 BP 22q11.22 D22S303 22 21599366 21599581 9129709 BP 22q12.3 D22S278 22 34678466 34678703 11149935 BP 22q12.3 D22S278 22 34678466 34678703 11149935 BP 22q11.23- D22S419 22 24267850 24268118 9184305 22q12.1 BP 22q12.1- D22S689-D22S685 22 27181014 27181237 10318931 22q12.3 BP 18p11.22 D18S21 18 8552482 8552642 8016089 BP NULL D18S32 18 0 0 9006397 BP 18p11.22 D18S53 18 11482737 11482915 9529343 BP 18q21.31 D18S41 18 52500001 54600000 9529343 BP 18q23 D18S554 18 73170118 73170331 8630501 BP 18q23 D18S70 18 75963363 75963476 8630501 BP 18q21.33 D18S51 18 59097813 59098118 8731454 BP 18q22.2 D18S61 18 65585088 65585244 8731454 BP 18q22.3 D18S541 18 68323159 68323445 9399888 BP 18q22-23 NULL 18 59800001 76115139 10089014 BP 18q12.3 D18S1145 18 35400001 41700000 11673797 BP 12q24.11 ATP2A2 12 109182151 109251615 8199789 BP 12q24.31 D12S1639 12 124731268 124731488 9800214

192 BP 12q24.21 D12S2070 12 114494662 114494765 10318931 BP 12q24.21 D12S2070 12 114494662 114494765 10631152 BP 4p16.1 D4S394 4 7024570 7024766 8630499 BP NULL D4S4394 4 0 0 9774780 BP 4p15.1 D4S2408-D4S2632 4 31055117 35601445 10318931 SCZ 8p21.3 D8S258 8 20377523 20377672 7573181 SCZ 8p21.2 D8S1771 8 25463145 25463370 9731535 SCZ 8p21.2 D8S1752 8 22690067 22690218 9731535 SCZ 8p21.2 D8S1771 8 25463145 25463370 11126395 SCZ 8p21.3 D8S258 8 20377523 20377672 8942448 SCZ 8p22 D8S261 8 12700001 18700000 8950417 SCZ 8p21.3 D8S439 8 22271321 22271581 9754621 SCZ 8p12 D8S1791 8 38171429 38171667 9674972 SCZ 8p21.3 D8S136 8 22455333 22455402 10784452 SCZ 8p21.1 D8S1771 8 25463145 25463370 11179014 SCZ 1q21.3- D1S1653-D1S1677 1 155149566 155149673 10784452 1q23.3 SCZ 1q21-22 NULL 1 40800001 153700000 9754621 SCZ 1q21-22 NULL 1 40800001 153700000 11179014 SCZ 1q23.3 D1S2675 1 159397370 159397531 11126394 SCZ 13q32.2 D13S128 13 96558115 96558272 11126394 SCZ 13q32.2 D13S128 13 96558115 96558272 9050933 SCZ 13q33.1 D13S174 13 100652077 100652253 9731535 SCZ 13q31.1 D13S170 13 78907238 78907358 9754621 SCZ 13q32.1- D13S793-D13S779 13 99201956 99202143 10784452 13q32.3 SCZ 6p22.3 D6S260 6 15512626 15512804 7647789 SCZ 6p24.3 D6S296 6 8798698 8798995 12140777 SCZ 6p22.3 D6S274-D6S285 6 16854125 18679238 7581458 SCZ 6p22.3 D6S274 6 16854125 16854302 7581457 SCZ 6p22.3 D6S285 6 18679024 18679238 7581457 SCZ 6p21.32 HLA-DQB1 6 32673703 32680835 11920855 SCZ 6p22.3 D6S260 6 15512626 15512804 11920855 SCZ 6p24.3 D6S470 6 10133771 10133890 8950417 SCZ 6q21 D6S416 6 112496958 112497214 9226366 SCZ 6q16.1 D6S424 6 95514543 95514789 9226366 SCZ 6q16.1 D6S424 6 95514543 95514789 10402499 SCZ 6q16.1- D6S424-D6S301 6 95514543 95514789 10402499 6q16.3 SCZ 6q15 D6S1570 6 91189580 91189714 11096332 SCZ 6p22.3 D6S242 6 23689569 23689852 10924404 SCZ 10p12.31 D10S1423 10 19441909 19442134 9674973 SCZ 10p12.31 D10S1714 10 18844109 18844295 9674975 SCZ 10p12.1 D10S2443 10 26765475 26765569 0 SCZ 10p12.2 D10S245 10 23607341 23607508 11673797 SCZ 22q12.3 IL2RB 22 35764920 35789001 8178837 SCZ 22q11-13 D22S84 22 11800001 24300000 7909992 SCZ 22q13.33 D22S55 22 47700001 49396972 7909992

193 scz 22q11.21 D22S446 22 20343712 20343913 9754621 scz 22q12.3 D22S283 22 35022762 35022895 9754621 scz 15q14 300kb of a-nicotinic 15 29838757 30477178 8776738 receptor scz 15q14 D15S1012 15 36723599 36723772 11001582

194 APPENDIX Q

Code for repeat prioritization in schizophrenia and bipolar disoder linkage regions

############################################################################### # Shell script to extract all repeats within 50 Mb of linkage genetic markers # ###############################################################################

#!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128 echo 11 SELECT r.rep_id, 1.disease, l.link_id FROM repeats r, linkage 1 WHERE r.chr = 1. chr AND r.start. >= l.pstart AND r.end <= l.qend; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > repeats_in_linkage.txt

################################################# # Table to store all repeats in linkage regions # #################################################

CREATE TABLE repeats_in_linkage ( rep_id INT, disease VARCHAR(10) NOT NULL, link_id INT ) ;

CREATE INDEX rep_link ON repeats_in_linkage (rep_id,link_id); CREATE INDEX rep_link_disease ON repeats_in_linkage (rep_id, disease, link_id) ;

############# # LOAD DATA # #############

LOAD DATA INFILE 1/home/perseusm/My_Documents/Publications/Satellog/Results/repeats_in_linkage.txt1 INTO TABLE repeats_in_linkage IGNORE 1 LINES;

################################################ # GET ALL DISTINCT rep_id's of linked rep_id's # ################################################

#!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128 echo " SELECT DISTINCT rep_id FROM repeats_in_linkage " | mysql --quick -h athena -u schz_rw -prepeat schz_db > distinct_repeats_in_linkage.txt

########################### # Calculate Linkage Depth # ###########################

CREATE TABLE linkage_depth ( rep_id INT NOT NULL, disease VARCHAR(10) NOT NULL, linkage_depth INT

195 )

#################### # linkage_depth.pl # ####################

#!/usr/bin/perl # queries.pl # usage: queries.pl somefile.txt # Perseus Missirlis - Jan 29, 2004 # Last updated: use strict; use DBI;

# DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); my $sthl = $dbh->prepare ("SELECT rep_id, COUNT(link_id) AS linkage_depth FROM repeats_in_linkage WHERE rep_id = ? GROUP BY rep_id;"); my $sth2= $dbh->prepare("INSERT INTO linkage_depth VALUES(?,?)");

##################### # Extract for query # ##################### while (<>) {

if (/*(\d+)/) {

my $rep_id = $1;

$sthl->execute ($rep_id) ,-

while ( my $href = $sthl->fetchrow_hashref ) {

my $linkage_depth = $href->{linkage_depth};

$sth2->execute($rep_id,$linkage_depth);

} } } # Run

./linkage_depth.pl distinct_repeats_in_linkage.txt

####################################################################################### # SELECT ALL TRANSCRIBED, POLYMORPHIC CANDIDATE REPEATS FOR SCHIZOPHRENIA AND BIPOLAR # #######################################################################################

#!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128

echo "

196 SELECT DISTINCT r.rep_id, r.chr, r.start, r.end, r.unit, r.period, r.class, r.length, r.pvalue, Id.linkage_depth, u.count, u.min, u.max, u.mean, u.sd FROM repeats r, linkage_depth Id, ugstats u WHERE u.sd > 0 AND u.rep_id = ld.rep_id AND u.rep_id = r.rep_id AND Id.disease = \"SCZ\"; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > schz_cand.txt echo " SELECT DISTINCT r.rep_id, r.chr, r.start, r.end, r.unit, r.period, r.class, r.length, r.pvalue, Id.linkage_depth, u.count, u.min, u.max, u.mean, u.sd FROM repeats r, linkage_depth Id, ugstats u WHERE u.sd > 0 AND u.rep_id = ld.rep_id AND u.rep_id = r.rep_id AND Id.disease = \"BP\"; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > bp_cand.txt

##################### # DROP TABLE SYNTAX # #####################

# SCHZ

DROP TABLE schz_cand;

# BP

DROP TABLE bp_cand;

####################### # CREATE TABLE SYNTAX # ####################### # SCHZ

CREATE TABLE schz_cand ( rep_id INT NOT NULL, chr VARCHAR(4), start INT UNSIGNED, end INT UNSIGNED, unit VARCHAR(16), period TINYINT UNSIGNED, class_id INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, count INT NOT NULL, min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, description TEXT NULL, tissue VARCHAR(15) NOT NULL, call CHAR(l) NOT NULL ) ;

# BP

CREATE TABLE bp_cand ( rep_id INT NOT NULL, chr VARCHAR(4), start INT UNSIGNED, end INT UNSIGNED, unit VARCHAR(16), period TINYINT UNSIGNED, class_id INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, count INT NOT NULL,

197 min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, description TEXT NULL, tissue VARCHAR(15) NOT NULL, call CHAR(l) NOT NULL ) ;

################### # POPULATE TABLES # ###################

./expressed_in_brain_candidates.pi schz_cand.txt ./expressed_in_brain_candidates.pi bp_cand.txt

# Note: Remember to change input table in $sth3 and $sth4 below

#################################### # expressed_in_brain_candidates.pl # ####################################

#!/usr/bin/perl # expressed_in_brain.pl # usage: # expressed_in_brain.pl somefile.txt # Perseus Missirlis - Jan 29, 2004 # Last updated: use strict; use DBI;

# DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); my $sthl = $dbh->prepare ("SELECT DISTINCT r.rep_id AS a, t.gene_location AS b, t.pep AS c, e.name AS d, e.description AS e FROM repeats r, transcripts t, ens_db e WHERE r.rep_id = ? AND r.rep_id = t.rep_id AND t.ens_id = e.ens_id;"); my $sth2 = $dbh->prepare ("SELECT DISTINCT r.rep_id AS a, t.gene_location AS b, t.pep AS c, e.name AS d, e.description AS e, g.tissue AS f, g.call AS g FROM repeats r, transcripts t, ens_db e, affy a, GeneNote g WHERE r.rep_id = ? AND r.rep_id = t.rep_id AND t.ens_id = e.ens_id AND e.ens_id = a.ens_id AND a.g_id = g.g_id AND g.tissue = \"Brain\" AND g.call = \"P\";"); my $sth3 = $dbh->prepare("INSERT INTO bp_cand VALUES(?????????????????????)"); my $sth4 = $dbh->prepare("UPDATE bp_cand SET tissue = ?, call = ? WHERE rep_id = ?");

##################### # Extract for query # ##################### my $null = "NULL"; while (<>) {

198 if (/"(\d+)\s+(\S+) \s+(\d+)\s+(\d+)\s+(\S+)\s+(\d+)\s+(\d+)\s+ (\d+)\s+(\d+\.\d+)\s+(\d+)\s+ ( \d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) {

my $rep id = $1; my $chr = $2; my $start = $3; my $end = $4; my $unit = $5; my $period = $6; my $class = $7; my $length = $8; my $pvalue = $9; my $link_depth = $10; my $count = $11 my $min = $12 my $max = $13 my $mean = $14 my $sd = $15

$sthl->execute($1) ;

while ( my $href = $sthl->fetchrow_hashref ) {

my $gene_location = $href->{b}; my $pep = $href->{c}; my $name = $href->{d}; my $description = $href->{e};

$sth3- >execute($rep_id,$chr,$start,$end,$unit,$period,$class,$length,$pvalue,$link_depth,$count ,$min,$max,$mean,$sd,$gene_location,$pep,$name,$description,$null,$null);

} $sth2->execute($1);

while { my $href = $sth2->fetchrow_hashref ) {

my $tissue = $href->{f}; my $call = $href->{g};

$sth4->execute($tissue,$call,$rep_id);

############################################################## # SELECT ALL CANDIDATE REPEATS FOR SCHIZOPHRENIA AND BIPOLAR # ##############################################################

#!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128 echo " SELECT DISTINCT r.rep_id, r.chr, r.start, r.end, r.unit, r.period, r.class, r.length, r.pvalue, Id.linkage_depth FROM repeats r, linkage_depth Id WHERE r.rep_id = ld.rep_id AND Id.disease = \"SCZ\";

" | mysql --quick -h athena -u schz_rw -prepeat schz_db > schz_cand_global.txt echo " SELECT DISTINCT r.rep_id, r.chr, r.start, r.end, r.unit, r.period, r.class, r.length, r.pvalue, Id.linkage_depth FROM repeats r, linkage_depth Id WHERE r.rep_id = ld.rep_id AND Id.disease = \"BP\"; " | mysql --quick -h athena -u schz_rw.-prepeat schz_db > bp_cand_global.txt

199 ##################### # DROP TABLE SYNTAX # ##################### # SCHZ

DROP TABLE schz_cand_global;

# BP

DROP TABLE bp_cand_global;

####################### # CREATE TABLE SYNTAX # #######################

# SCHZ

CREATE TABLE schz_cand_global ( schz_cand_global_id INT auto_increment PRIMARY KEY, rep_id INT NOT NULL, chr VARCHAR(4), start INT UNSIGNED, end INT UNSIGNED, unit VARCHAR(16), period TINYINT UNSIGNED, class_id INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, description TEXT NULL, tissue VARCHAR(15) NOT NULL, call CHAR(l) NOT NULL CREATE INDEX lookup ON schz_cand_global (rep_id, schz_cand_global_id);

# BP

CREATE TABLE bp_cand_global ( bp_cand_global_id INT auto_increment PRIMARY KEY, rep_id INT NOT NULL, chr VARCHAR(4), Start INT UNSIGNED, end INT UNSIGNED, unit VARCHAR(16) , period TINYINT UNSIGNED, class_id INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, description TEXT NULL, tissue VARCHAR(15) NOT NULL, call CHAR(l) NOT NULL ) ;

CREATE INDEX lookup ON bp_cand_global (rep_id, bp_cand_global_id);

################### # POPULATE TABLES # ###################

./expressed_in_brain_global.pi schz_global_candidates.txt & ./expressed_in_brain_global2.pi bp_global_candidates.txt &

# Note: Remember to change input table in $sth3 and $sth4 below

################################ # expressed_in_brain_global.pl # ################################

#!/usr/bin/perl # expressed_in_brain_global.pl # usage: expressed_in_brain_global.pl somefile.txt # Perseus Missirlis - Jan 29, 2004 # Last updated: use strict; use DBI; ? # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $sth); my (@ary);

####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 }); my $sthl = $dbh->prepare ("SELECT DISTINCT r.rep_id AS a, t.gene_location AS b, t.pep AS c, e.name AS d, e.description AS e FROM repeats r, transcripts t, ens_db e WHERE r.rep_id = ? AND r.rep_id = t.rep_id AND t.ens_id = e.ens_id;"); my $sth2 = $dbh->prepare ("SELECT DISTINCT r.rep_id AS a, t.gene_location AS b, t.pep AS c, e.name AS d, e.description AS e, g.tissue AS f, g.call AS g FROM repeats r, transcripts t, ens_db e, affy a, GeneNote g WHERE r.rep_id = ? AND r.rep_id = t.rep_id AND t.ens_id = e.ens_id AND e.ens_id = a.ens_id AND a.g_id = g.g_id AND g.tissue = \"Brain\" AND g.call = \"P\";");

201 my $sth3 = $dbh->prepare("INSERT INTO schz_cand_global VALUES('NULL',?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"); my $sth4 = $dbh->prepare("UPDATE schz_cand_global SET tissue = ?, call = ? WHERE rep_id = ?") ;

##################### # Extract for query # ##################### my $null = "NULL"; while (<>) {

(/A(\d+)\s+(\S+) \s+(\d+)\s+(\d+)\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)/) {

my $rep id = $1 my $chr = $2 my $start = $3 my $end $4 my $unit = $5 my $period = $6 my $class = $7 my $length = $8 my $pvalue = $9 my $link_depth = $10;

$sthl->execute ($1) ;

while ( my $href = $sthl->fetchrow_hashref )

my $gene_location = $href->{b} my $pep = $href->{c} my $name = $href->{d} my $description = $href->{e}

$sth3- >execute($rep_id,$chr,$start,$end,$unit,$period,$class,$length,$pvalue,$link_depth,$gene location,$pep,$name,$description,$null,$null);

}

$sth2->execute($1);

while ( my $href = $sth2->fetchrow_hashref )

my $tissue = $href->{f}; my $call = $href->{g};

$sth4->execute($tissue,$call,$rep_id);

}

################################################################# # COLLECT ALL REPEATS IN OVERLAPPING SCZ AND BP LINKAGE REGIONS # #################################################################

CREATE TABLE schz_bp_cand SELECT s.rep_id, s.chr, s.start, s.end, s.unit, s.period, s.class_id, s.length, s.pvalue, s.linkage_depth, s.count, s.min, s.max, s.mean, s.sd, s.gene_location, s.pep, s.name, s.description, s.tissue, s.call FROM schz_cand s, bp_cand b WHERE s.rep_id = b.rep_id

######################################### # SHELL SCRIPT TO SELECT TOP 2 0 REPEATS # ######################################### #!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128

############################################################# # SELECT TOP 20 POLYMORPHIC SCHIZOPHRENIA CANDIDATE REPEATS # ############################################################# echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM schz_cand WHERE tissue = \"Brain\" AND call = \"P\" AND gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\" ORDER BY score DESC LIMIT 50 11 | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_schz_cand.txt

###################################################################### # SELECT TOP 20 GLOBALLY PRIORITIZED SCHIZOPHRENIA CANDIDATE REPEATS # ###################################################################### echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM schz_cand_global WHERE tissue = \"Brain\" AND call = \"P\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_schz_cand.txt

####################################################### # SELECT TOP 20 POLYMORPHIC BIPOLAR CANDIDATE REPEATS # ####################################################### echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM bp_cand WHERE tissue = \"Brain\" AND call = \"P\" AND gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_bp_cand.txt

################################################################ # SELECT TOP 20 GLOBALLY PRIORITIZED BIPOLAR CANDIDATE REPEATS # ################################################################ echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM bp_cand_global WHERE tissue = \"Brain\" AND call = \"P\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_bp_cand.txt

203 ######################################################################### # SHELL SCRIPT TO SELECT TOP 20 REPEATS FROM DISEASE-ASSOCIATED CLASSES # #########################################################################

#!/bin/sh # # Automated Queries to schz_db # Perseus Missirlis # 040128

############################################################# # SELECT TOP 20 POLYMORPHIC SCHIZOPHRENIA CANDIDATE REPEATS # #############################################################

echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM schz_cand WHERE tissue = \"Brain\" AND call = \"P\" AND (gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\") AND (class_id = 182 OR class_id = 381 OR class_id = 51 OR class_id = 36 OR class_id = 285 OR class_id = 42495) ORDER BY score DESC LIMIT 50 " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_schz_cand_dis.txt

###################################################################### # SELECT TOP 20 GLOBALLY PRIORITIZED SCHIZOPHRENIA CANDIDATE REPEATS # ######################################################################

echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM schz_cand_global WHERE tissue = \"Brain\" AND call = \"P\" AND (class_id = 182 OR class_id = 381 OR class_id = 51 OR class_id = 36 OR class_id = 285 OR class_id = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_schz_cand_dis.txt

####################################################### # SELECT TOP 20 POLYMORPHIC BIPOLAR CANDIDATE REPEATS # #######################################################

echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM bp_cand WHERE tissue = \"Brain\" AND call = \"P\" AND (gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\") AND (class_id = 182 OR class_id = 381 OR class_id = 51 OR class_id = 36 OR class_id = 285 OR class_id = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_bp_cand_dis.txt

################################################################ # SELECT TOP 20 GLOBALLY PRIORITIZED BIPOLAR CANDIDATE REPEATS # ################################################################

echo " SELECT DISTINCT rep_id, chr, start, end, unit, length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM bp_cand_global WHERE tissue = \"Brain\" AND call = \"P\" AND (class_id = 182 OR class_id = 381 OR class_id = 51 OR class_id = 36 OR class_id = 285 OR class_id = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_bp_cand_dis.txt

204 APPENDIX R SQL code to generate results

Extract names of genes whose CAG/CTG repeats were within CpG islands:

SELECT c.name, c.start, c.end, g.expandability, IF(((c.start + (c.end. - c.start)) > 50000) AND ((c.end - (c.end - c.start)) < 50000),"TRUE","FALSE") AS islands FROM cpg c, gems_feat g WHERE c.name = g.name ORDER BY islands DESC;

Extract genes, their expandability metric and flanking %GC:

SELECT g.name, g.expandability, gc.50_bp, gc.l00_bp, gc.500_bp, gc.10 0 0_bp FROM gems_feat g, gc WHERE g.expandability > 0 AND g.name = gc.name; Extract genes with 100 bp flanking %GC at least equal to that of HD (0.76):

SELECT g.name, g.expandability, gc.l00_bp FROM gems_feat g, gc WHERE g.name = gc.name AND gc.l00_bp >= 0.76; Extract genes with flanking CTCF binding site scores of 1.00 or higher:

SELECT c.name, c.score, c.distance, g.expandability, c.start, c.end FROM ctcf c, gems_feat g WHERE c.distance < 1000 AND c.score > 1 AND g.name = c.name ORDER BY g.expandability DESC;

Extract genes with weak flanking CTCF scores (0 < score < 1):

SELECT c.name, c.score, c.distance, g.expandability, c.start, c.end FROM ctcf c, gems_feat g WHERE c.distance < 1000 AND c.score > 0 AND c.score < 1 AND g.name = c.name ORDER BY g.expandability DESC;

205