Ubc 2004-0561.Pdf

C/S-FEATURES MEDIATING CAG/CTG REPEAT INSTABILITY, THE SATELLOG DATABASE, AND CANDIDATE REPEAT PRIORITIZATION IN SCHIZOPHRENIA by PERSEUS IOANNIS MISSIRLIS B.Sc.H., Queen's University, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES GENETICS GRADUATE PROGRAM We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 2004 © Perseus loannis Missirlis, 2004 UBC W THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF GRADUATE STUDIES 3 Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Perseus loannis Missirlis 18/08/2004 Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: C/S-FEATURES MEDIATING CAG/CTG REPEAT INSTABILITY, THE SATELLOG DATABASE, AND CANDIDATE REPEAT PRIORITIZATION IN SCHIZOPHRENIA Degree: Master of Science Year: 2004 Department of Genetics Graduate Program The University of British Columbia Vancouver, BC Canada grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 18-Aug-04 ABSTRACT Polyglutamine repeat expansions in the coding regions of unrelated genes have been implicated in the neurodegenerative phenotype of nine separate diseases. However, little is known about the role of flanking c/'s-sequences in mediating this repeat instability. Brock et al. identified an association between flanking GC content and CAG/CTG repeat instability at many of these disease loci by using a relative measure of repeat instability called 'expandability'. Using this measure, we have extended the analysis of Brock and colleagues and utilized the expandability metric to associate other features theorized to contribute to CAG/CTG repeat instability such as repeat length and purity, proximity to CCCTC-binding factor (CTCF) binding sites, and the nucleosome formation potential of the surrounding DNA. Our results confirmed earlier relationships regarding flanking GC content and CAG/CTG repeat instability and also suggest a novel one involving flanking CTCF binding sites. Conversely, no relationships between expandability and repeat length, purity, and nucleosome formation were detected. Anticipation refers to the progressive worsening of a disease phenotype and earlier age of onset in successive generations. Anticipation has been reported in a number of diseases in which repeat expansion may have a role in etiology. We developed Satellog, a database that catalogs all pure 1-16 repeat unit repeats in the human genome along with supplementary data of use for the ii prioritization of repeats in disease association studies. For each pure repeat we calculate the percentile rank of its length relative to other repeats of the same class in the genome, its polymorphism within UniGene clusters, its location either within or adjacent to EnsEMBL-defined genes, and its expression profile in normal tissues according to the GeneNote database. By examining the global repeat polymorphism profile, we found that highly polymorphic coding repeats were mostly restricted to trinucleotide repeats, whereas a wider range of repeat unit lengths were tolerated in untranslated sequence. We also found that 3'-UTR sequence tolerates more repeat polymorphisms than 5'-UTR or exonic sequence. Lastly, we use Satellog to prioritize repeats for disease-association studies in schizophrenia. Satellog is available as a freely downloadable MySQL and web-based database. iii TABLE OF CONTENTS ABSTRACT ii TABLE OF CONTENTS iv LIST OF TABLES viii LIST OF FIGURES x LIST OF ABBREVIATIONS xii ACKNOWLEDGEMENTS xv DEDICATION xvi PREFACE xvii CHAPTER 1 INTRODUCTION 2 1.1 c/'s-Features of unstable CAG/CTG repeats 2 1.1.1 Unstable repeats and disease 2 1.1.2 The argument for cis mediators of instability 4 1.1.2.1 Flanking %GC and CpG islands 4 1.1.2.2 Repeat length and purity 5 1.1.2.3 The role of the CTCF insulator protein 7 1.1.2.4 The role of nucleosomes 8 1.1.3 Objectives 9 1.1.4 Specific aims and rationale 10 1.2 The unstable repeat perspective of schizophrenia 10 1.2.1 Biology of schizophrenia 10 1.2.2 Genetics of schizophrenia 12 1.2.3 Anticipation in neuropsychiatric diseases 14 1.2.4 CAG/CTG repeats in schizophrenia 15 1.2.5 Published satellite repeat analyses and databases 18 1.2.6 Objectives. 20 1.2.7 Specific aims and rationale 21 CHAPTER 2 MATERIALS AND METHODS 23 2.1 c/s-Features of unstable CAG/CTG repeats 23 2.1.1 Collection of candidate CAG/CTG repeats for cis sequence analysis 23 2.1.2 Software Dependencies 1 23 2.1.3 Implementing the gems_cis database 25 iv 2.1.4 Overview of the flanker.pl script 27 2.1.4.1 Collection of flanking %GC, CpG islands, length and purity and other repetitive elements 29 2.1.4.2 Detection of flanking CTCF insulator protein binding sites 29 2.1.5 Detection of nucleosome formation potential with NucleoMeter 30 2.1.6 Statistics and plots with R 32 2.2 The satellog database 32 2.2.1 Software Dependencies II 33 2.2.2 Implementing the satellog database 33 2.2.3 Preliminary set-up 37 2.2.3.1 Detecting pure repeats with Tandem Repeats Finder (TRF) 37 2.2.3.2 Identifying unique repeat classes 38 2.2.3.3 Preparing expression data from the GeneNote database 39 2.2.3.4 Detecting repeat polymorphisms within UniGene clusters 40 2.2.4 Overview of the repeatalyzer.pl script 41 2.2.5 Generating a measure of repeat length significance 42 2.2.6 Detection and input of disease-associated repeats 42 2.3 Prioritizing candidate repeats for disease-association studies in schizophrenia 43 2.3.1 Input of neuropsychiatric linkage regions into Satellog 43 2.3.2 Prioritizing candidate repeats with Satellog 43 CHAPTER 3 RESULTS 47 3.1 c/s-Features of unstable CAG/CTG repeats 47 3.1.1 Correlation of flanking CAG/CTG repeat features to Brock etal. expandability data 47 3.1.1.1 Correlation of CpG islands with expandability 47 3.1.1.2 Correlation of flanking %GC with expandability 48 3.1.1.3 Correlation of repeat length and purity with expandability 52 3.1.1.4 Correlation of CTCF binding sites with expandability 52 3.1.1.5 Correlation of nucleosome formation potential with expandability 58 3.2 Genomic repeat analysis with the Satellog database 61 3.2.1 Summary statistics 61 3.2.2 Characteristics of disease-associated repeats 62 3.2.3 Characteristics of repeats polymorphic within UniGene clusters 66 3.2.4 Disease-associated repeats detected in UniGene clusters 68 3.3 Candidate repeats for typing in schizophrenia and bipolar disorder 75 3.3.1 Top 20 polymorphic schizophrenia candidate repeats 76 3.3.2 Top 20 globally prioritized schizophrenia candidate repeats 77 3.3.3 Top 20 polymorphic bipolar disorder candidate repeats 78 3.3.4 Top 20 globally prioritized bipolar candidate repeats 79 3.3.5 Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes 80 3.3.6 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 81 3.3.7 Top 20 polymorphic bipolar disorder candidate repeats from disease- associated classes 82 3.3.8 Top 20 globally prioritized bipolar candidate repeats from disease- associated classes 83 CHAPTER 4 DISCUSSION 86 4.1 c/'s-Features of unstable CAG/CTG repeats 86 4.1.1 Identifying c/s-mediators of instability 86 4.1.1.1 Association between flanking %GC and instability 87 4.1.1.2 Association between flanking repeat length, purity and instability 88 4.1.1.3 Association between flanking CTCF binding sites and instability. 88 4.1.1.4 Association between flanking nucleosome formation and instability 90 4.1.2 Prioritizing candidate CAG/CTG repeats 91 4.2 Genomic repeat analysis with the Satellog database 93 4.3 Repeat prioritization in schizophrenia with Satellog 94 4.3.1 Top 20 polymorphic schizophrenia candidate repeats 95 4.3.2 Top 20 globally prioritized schizophrenia candidate repeats 96 4.3.3 Top 20 polymorphic schizophrenia candidate repeats from disease- associated classes 96 4.3.4 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 97 4.4 Conclusions 97 4.5 Problems encountered and limitations 99 4.5.1 Brock et al.'s expandability metric 99 4.5.2 Limitations of the GeneNote dataset 100 4.5.3 Mapping repeats to UniGene clusters 100 4.5.4 Prioritizing with p-values 101 4.5.5 Multiple repeats detected for known diseases 101 4.6 Future studies 102 4.6.1 Identifying c/'s-mediators of instability 102 4.6.2 Improvements to Satellog 103 4.6.3 Disease association studies in schizophrenia 104 4.6.3.1 Specimens for analysis 104 4.7 Significance 106 BIBLIOGRAPHY 108 APPENDIX A 120 APPENDIX B 124 APPENDIX C 126 APPENDIX D 153 APPENDIX E 157 APPENDIX F 159 APPENDIX G 160 APPENDIX H 170 APPENDIX 1 175 APPENDIX J 177 APPENDIX K 180 APPENDIX L 182 APPENDIX M 184 APPENDIX N 188 APPENDIX O 190 APPENDIX P - 192 APPENDIX Q 195 APPENDIX R 205 vii LIST OF TABLES Table 1: Genetic anticipation in schizophrenia; summary of linkage studies from 1996-1999 (adapted from Vincent et al., 2000) 15 Table 2: All unstable and candidate CAG/CTG repeat-containing genes located within a CpG island.

Ubc 2004-0561.Pdf

Genetic Analysis of Indel Markers in Three Loci Associated with Parkinson’S Disease

"The Genecards Suite: from Gene Data Mining to Disease Genome Sequence Analyses". In: Current Protocols in Bioinformat

Supplementary Table 3 Gene Microarray Analysis: PRL+E2 Vs

Ep 2327798 A1

UC San Francisco Previously Published Works

Supplementary Data

1 Novel Expression Signatures Identified by Transcriptional Analysis

120409 Thesis

The Roles of Alternative Cap-Binding Proteins of Arabidopsis Thaliana

Copy Number Variation in Han Chinese Individuals with Autism

Chicken Sperm Transcriptome Profiling by Microarray Analysis

Membranes of Human Neutrophils Secretory Vesicle Membranes And