To Download the PDF File
Total Page:16
File Type:pdf, Size:1020Kb
Computational Identification of Thyroid Response Elements in Genomic DNA By Remi Gagne A thesis submitted to The Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Computer Science Ottawa-Carleton Institute for Computer Science School of Computer Science Carleton University Ottawa, Ontario April 2010 © Copyright 2010, Remi Gagne Library and Archives Bibliothgque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'6dition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre r6f6rence ISBN: 978-0-494-68634-8 Our file Notre reference ISBN: 978-0-494-68634-8 NOTICE: AVIS: The author has granted a non- L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission. In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. 1*1 Canada Abstract Due to the volume and complexity of data arising from high throughput biological assays, computational analysis becomes increasingly important to assist biologists in forming and testing hypotheses. In the current study, bioinformatics is applied to the fields of microbiology and toxicogenomics in analyzing chlP-chip data to study the thyroid hormone receptor conducted by Health Canada. This data analysis requires normalization and signal detection. A survey of contemporary methods was performed in order to find the most appropriate model for each step, given our experimental platform. Proof of concept experiments using high quality benchmark data revealed that normalization for chlP-chip data didn't improve the accuracy of subsequent peak finding algorithms. Splitter was used to detect peaks, which revealed 230 regions for which the thyroid hormone receptor is believed to be bound to DNA. Once signal detection was complete, the identified DNA segments were examined to model the degenerate sequence motif. Motif finding algorithms (MFAs) from a number of underlying statistical models were also applied to find occurrences of novel motifs not previously known to be linked to the thyroid hormone receptor. In total, 105 thyroid hormone receptor binding sites (thyroid response elements) were identified with an expected false discovery rate of 20%. MFAs found motifs which are very similar to known binding sites for proteins which could interact with the thyroid hormone receptor, such as SP-1, PAX and KROX binding sites. A wet laborary validation of theses sites is now needed in order to reveal the functionality of these sites, i.e. whether the identified motifs truly exhibit a gene regulation function. ii Acknowlegments I would sincerely like to thank everyone that has been involved in this project. Particularly, people that are directly involved in the project who have been providing support from the beginning; Dr. Hongyang Dong who generated the data, Andrew Williams the statistician at Heatlh Canada and all my coworkers (Byron Kuo, John Gingerich and many others) that were there for me in the best and worst times. I would like to also thank my family for their understanding in these sometimes pretty stressful moments, a especially my lovely wife Paula and my little "pitchounette" Catherine. I am also grateful to the members of my committee, Prof. Dehne and Prof. Famili who generously agreed to spend time examining this document. I am very grateful to Health Canada and my work supervisor Dr. Paul White, who supported me financially during this project. And last but not least, my two supervisors, Dr. Carole Yauk and Dr. James Green who agreed to take me under their wing to help me produce this document. iii Table of Contents ABSTRACT II ACKNOWLEGMENTS Ill TABLE OF CONTENTS IV LIST OF TABLES VII LIST OF FIGURES IX LIST OF ALGORITHMS XII LIST OF EQUATIONS XIII CHAPTER 1. INTRODUCTION 1 1.1. MOTIVATION 4 1.2. OBJECTIVES 4 1.3. THESIS OUTLINE 4 CHAPTER 2. BIOLOGICAL AND TECHNOLOGICAL REVIEW 6 2.1. BIOLOGICAL BACKGROUND 6 2.1.1. Thyroid hormones 6 2.1.2. Thyroid hormone receptor 7 2.1.3. THR partners for gene regulation 9 2.1.4. Thyroid response elements 10 2.1.5. Summary 15 2.2. BIOLOGICAL LABORATORY TECHNOLOGY OVERVIEW AND EXPERIMENTAL DESIGN 16 2.2.1. Chromatin immunoprecipitation 16 2.2.2. Microarrays 17 2.2.3. Biological experimental design 19 iv 2.3. SUMMARY 21 CHAPTER 3. CHIP-CHIP DATA ANALYSIS 22 3.1. MAPPING PROBES TO GENOME 24 3.2. NORMALIZATION OF CHIP-CHIP DATA 24 3.2.1. Normalization methods developed for gene expression microarrays 25 3.2.2. ChlP-chip normalization methods 30 3.2.3. Summary 38 3.3. PEAK FINDING ALGORITHM 39 3.3.1. Splitter [20] 40 3.3.2. Summary 44 3.4. EVALUATION OF PERFORMANCE WITH BENCHMARK DATA 44 3.4.1. Evaluation of the precision of binding site cut-off 48 3.5. OPTIMISATION OF PEAK-FINDING ALGORITHM PARAMETERS TO ACTUAL THR STUDY DATA 50 3.6. RESULTS OF PEAK FINDING TO ACTUAL THR STUDY DATA 51 3.7. EXPERIMENTAL VALIDATION 54 3.8. SUMMARY 55 CHAPTER 4. MOTIF IDENTIFICATION 56 4.1. SEARCHING FOR THE KNOWN CONSENSUS TRE MOTIF 56 4.1.1. Models for the identification of the TRE hexamer 57 4.1.2. Determination of the correct DNA scanning model for TREs 59 4.1.3. Relative abundance of TRE hexamers in DNA sequences 63 4.1.4. Analysis of TRE ChlP-chip sequences for the THR and AP-1 binding site ..68 4.1.5. Summary 70 4.2. IDENTIFICATION OF NOVEL MOTIFS 71 4.2.1. Application of motif finding algorithms to the TRE dataset 73 4.2.2. Results of MFAs on the TRE dataset 76 4.2.3. Summary of the utilization of MFAs 91 v 4.3. SUMMARY 94 CHAPTER 5. CONCLUSION 97 5.1. SUMMARY OF RESEARCH 97 5.2. MAJOR CONCLUSIONS 98 5.2.1. Normalization 98 5.2.2. Peak finding 98 5.2.3. TRE consensus motif searching 99 5.2.4. Novel TRE motif searching 99 5.3. FUTURE WORK 99 BIBLIOGRAPHY 101 APPENDIX A 109 APPENDIX B 110 APPENDIX C 121 APPENDIX D 124 APPENDIX E 126 vi List of Tables Table 2-1: TRE arrangements for hetero and homo dimer TRE configurations 12 Table 2-2: List of TREs in mouse genome compiled from the literature, the gene that is regulated by the TRE, its accession number, its DNA strand(GS), the location with respect to the transcription start site of the gene, the strand of the TRE (TS), the sequence which contains the TRE with binding site in bold, the type of gene regulation that the TRE performs (up or down regulates gene transcription), the literature reference for the TRE and the TRE configuration are shown 13 Table 3-1: Comparison of the platform used for Spike-in and our chlP-chip data 45 Table 3-2: Results from the Whitehead data set with Splitter at 2.5 SD 46 Table 3-3: Location of peaks with respect to mRNA mapped in MM9 by the UCSC genome browser 52 Table 4-1: Scores of halfSites for murine TREs 61 Table 4-2: Scores of halfSites for rat, human, chicken TREs 61 Table 4-3: Sample of True Positive and False Positive rate vs. Min Score and Max Score 63 Table 4-4: Output of Bipad (left half site motif logo, the distribution of the length of the spacer, and the right half site motif logo) 77 Table 4-5: Results of the MEME analysis using all the sequences in the TRE chlP-chip dataset ranked by ^-values 80 Table 4-6: Results of the MEME analysis using TOP5-PND4&TOP25-PND15 sequences in the TRE chlP-chip dataset ranked by .E-values 82 Table 4-7: Motif targets using TOMTOM for a few example cases 84 Table 4-8: Motif found by MEME (Table 4-5 - 4 and Table 4-6 - 3) in the first column and the SP-1 motif in the Transfac database in column 2 85 Table 4-9: Top 3 motifs found by Bioprospector and percentage of motif hit in CpG islands 87 Table 4-10: Motif targets using TOMTOM for a few example cases Table 4-11: Top 4 motifs found by Weeder and percentage of motifs in CpG islands.... 89 Table 4-12: Motif targets using TOMTOM for a few example cases 90 viii List of Figures Figure 1-1: Schema of gene promoter region 2 Figure 1-2: Flowchart of experimental process 3 Figure 2-1: Illustration of T3 (left) and T4 (right) hormones 6 Figure 2-2: Sequence (ID) and Structure (3D) of Nuclear Receptors 8 Figure 2-3: Logo of TRE hexamer 11 Figure 2-4: Mechanism of T3 regulated gene (activation and repression) with FOS and JUN nuclear receptor interaction 14 Figure 2-5: AP-1 binding site sequence logo 15 Figure 2-6: Each step of chromatin immunoprecipitation 16 Figure 2-7: Example of a microarray with a zoom-in on some wells 19 Figure 2-8: Distribution of the resolution for the "A" (black line) and "B" (blue line) microarray (please note that blue and black line overlaps greatly) 20 Figure 2-9: Experimental Design of Biological Experiment 20 Figure 3-1: Example of raw chlP-chip data and confirmed TRE expressed by probes circled in red.