To Download the PDF File

Total Page:16

File Type:pdf, Size:1020Kb

To Download the PDF File Computational Identification of Thyroid Response Elements in Genomic DNA By Remi Gagne A thesis submitted to The Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Computer Science Ottawa-Carleton Institute for Computer Science School of Computer Science Carleton University Ottawa, Ontario April 2010 © Copyright 2010, Remi Gagne Library and Archives Bibliothgque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'6dition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre r6f6rence ISBN: 978-0-494-68634-8 Our file Notre reference ISBN: 978-0-494-68634-8 NOTICE: AVIS: The author has granted a non- L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission. In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. 1*1 Canada Abstract Due to the volume and complexity of data arising from high throughput biological assays, computational analysis becomes increasingly important to assist biologists in forming and testing hypotheses. In the current study, bioinformatics is applied to the fields of microbiology and toxicogenomics in analyzing chlP-chip data to study the thyroid hormone receptor conducted by Health Canada. This data analysis requires normalization and signal detection. A survey of contemporary methods was performed in order to find the most appropriate model for each step, given our experimental platform. Proof of concept experiments using high quality benchmark data revealed that normalization for chlP-chip data didn't improve the accuracy of subsequent peak finding algorithms. Splitter was used to detect peaks, which revealed 230 regions for which the thyroid hormone receptor is believed to be bound to DNA. Once signal detection was complete, the identified DNA segments were examined to model the degenerate sequence motif. Motif finding algorithms (MFAs) from a number of underlying statistical models were also applied to find occurrences of novel motifs not previously known to be linked to the thyroid hormone receptor. In total, 105 thyroid hormone receptor binding sites (thyroid response elements) were identified with an expected false discovery rate of 20%. MFAs found motifs which are very similar to known binding sites for proteins which could interact with the thyroid hormone receptor, such as SP-1, PAX and KROX binding sites. A wet laborary validation of theses sites is now needed in order to reveal the functionality of these sites, i.e. whether the identified motifs truly exhibit a gene regulation function. ii Acknowlegments I would sincerely like to thank everyone that has been involved in this project. Particularly, people that are directly involved in the project who have been providing support from the beginning; Dr. Hongyang Dong who generated the data, Andrew Williams the statistician at Heatlh Canada and all my coworkers (Byron Kuo, John Gingerich and many others) that were there for me in the best and worst times. I would like to also thank my family for their understanding in these sometimes pretty stressful moments, a especially my lovely wife Paula and my little "pitchounette" Catherine. I am also grateful to the members of my committee, Prof. Dehne and Prof. Famili who generously agreed to spend time examining this document. I am very grateful to Health Canada and my work supervisor Dr. Paul White, who supported me financially during this project. And last but not least, my two supervisors, Dr. Carole Yauk and Dr. James Green who agreed to take me under their wing to help me produce this document. iii Table of Contents ABSTRACT II ACKNOWLEGMENTS Ill TABLE OF CONTENTS IV LIST OF TABLES VII LIST OF FIGURES IX LIST OF ALGORITHMS XII LIST OF EQUATIONS XIII CHAPTER 1. INTRODUCTION 1 1.1. MOTIVATION 4 1.2. OBJECTIVES 4 1.3. THESIS OUTLINE 4 CHAPTER 2. BIOLOGICAL AND TECHNOLOGICAL REVIEW 6 2.1. BIOLOGICAL BACKGROUND 6 2.1.1. Thyroid hormones 6 2.1.2. Thyroid hormone receptor 7 2.1.3. THR partners for gene regulation 9 2.1.4. Thyroid response elements 10 2.1.5. Summary 15 2.2. BIOLOGICAL LABORATORY TECHNOLOGY OVERVIEW AND EXPERIMENTAL DESIGN 16 2.2.1. Chromatin immunoprecipitation 16 2.2.2. Microarrays 17 2.2.3. Biological experimental design 19 iv 2.3. SUMMARY 21 CHAPTER 3. CHIP-CHIP DATA ANALYSIS 22 3.1. MAPPING PROBES TO GENOME 24 3.2. NORMALIZATION OF CHIP-CHIP DATA 24 3.2.1. Normalization methods developed for gene expression microarrays 25 3.2.2. ChlP-chip normalization methods 30 3.2.3. Summary 38 3.3. PEAK FINDING ALGORITHM 39 3.3.1. Splitter [20] 40 3.3.2. Summary 44 3.4. EVALUATION OF PERFORMANCE WITH BENCHMARK DATA 44 3.4.1. Evaluation of the precision of binding site cut-off 48 3.5. OPTIMISATION OF PEAK-FINDING ALGORITHM PARAMETERS TO ACTUAL THR STUDY DATA 50 3.6. RESULTS OF PEAK FINDING TO ACTUAL THR STUDY DATA 51 3.7. EXPERIMENTAL VALIDATION 54 3.8. SUMMARY 55 CHAPTER 4. MOTIF IDENTIFICATION 56 4.1. SEARCHING FOR THE KNOWN CONSENSUS TRE MOTIF 56 4.1.1. Models for the identification of the TRE hexamer 57 4.1.2. Determination of the correct DNA scanning model for TREs 59 4.1.3. Relative abundance of TRE hexamers in DNA sequences 63 4.1.4. Analysis of TRE ChlP-chip sequences for the THR and AP-1 binding site ..68 4.1.5. Summary 70 4.2. IDENTIFICATION OF NOVEL MOTIFS 71 4.2.1. Application of motif finding algorithms to the TRE dataset 73 4.2.2. Results of MFAs on the TRE dataset 76 4.2.3. Summary of the utilization of MFAs 91 v 4.3. SUMMARY 94 CHAPTER 5. CONCLUSION 97 5.1. SUMMARY OF RESEARCH 97 5.2. MAJOR CONCLUSIONS 98 5.2.1. Normalization 98 5.2.2. Peak finding 98 5.2.3. TRE consensus motif searching 99 5.2.4. Novel TRE motif searching 99 5.3. FUTURE WORK 99 BIBLIOGRAPHY 101 APPENDIX A 109 APPENDIX B 110 APPENDIX C 121 APPENDIX D 124 APPENDIX E 126 vi List of Tables Table 2-1: TRE arrangements for hetero and homo dimer TRE configurations 12 Table 2-2: List of TREs in mouse genome compiled from the literature, the gene that is regulated by the TRE, its accession number, its DNA strand(GS), the location with respect to the transcription start site of the gene, the strand of the TRE (TS), the sequence which contains the TRE with binding site in bold, the type of gene regulation that the TRE performs (up or down regulates gene transcription), the literature reference for the TRE and the TRE configuration are shown 13 Table 3-1: Comparison of the platform used for Spike-in and our chlP-chip data 45 Table 3-2: Results from the Whitehead data set with Splitter at 2.5 SD 46 Table 3-3: Location of peaks with respect to mRNA mapped in MM9 by the UCSC genome browser 52 Table 4-1: Scores of halfSites for murine TREs 61 Table 4-2: Scores of halfSites for rat, human, chicken TREs 61 Table 4-3: Sample of True Positive and False Positive rate vs. Min Score and Max Score 63 Table 4-4: Output of Bipad (left half site motif logo, the distribution of the length of the spacer, and the right half site motif logo) 77 Table 4-5: Results of the MEME analysis using all the sequences in the TRE chlP-chip dataset ranked by ^-values 80 Table 4-6: Results of the MEME analysis using TOP5-PND4&TOP25-PND15 sequences in the TRE chlP-chip dataset ranked by .E-values 82 Table 4-7: Motif targets using TOMTOM for a few example cases 84 Table 4-8: Motif found by MEME (Table 4-5 - 4 and Table 4-6 - 3) in the first column and the SP-1 motif in the Transfac database in column 2 85 Table 4-9: Top 3 motifs found by Bioprospector and percentage of motif hit in CpG islands 87 Table 4-10: Motif targets using TOMTOM for a few example cases Table 4-11: Top 4 motifs found by Weeder and percentage of motifs in CpG islands.... 89 Table 4-12: Motif targets using TOMTOM for a few example cases 90 viii List of Figures Figure 1-1: Schema of gene promoter region 2 Figure 1-2: Flowchart of experimental process 3 Figure 2-1: Illustration of T3 (left) and T4 (right) hormones 6 Figure 2-2: Sequence (ID) and Structure (3D) of Nuclear Receptors 8 Figure 2-3: Logo of TRE hexamer 11 Figure 2-4: Mechanism of T3 regulated gene (activation and repression) with FOS and JUN nuclear receptor interaction 14 Figure 2-5: AP-1 binding site sequence logo 15 Figure 2-6: Each step of chromatin immunoprecipitation 16 Figure 2-7: Example of a microarray with a zoom-in on some wells 19 Figure 2-8: Distribution of the resolution for the "A" (black line) and "B" (blue line) microarray (please note that blue and black line overlaps greatly) 20 Figure 2-9: Experimental Design of Biological Experiment 20 Figure 3-1: Example of raw chlP-chip data and confirmed TRE expressed by probes circled in red.
Recommended publications
  • Two-Page PDF Version
    CALL FOR PAPERS IEEE Computer Society Bioinformatics Conference Stanford University, Stanford, CA August 11–14, 2003 You are invited to submit a paper to the 2003 IEEE Computer Society Bioinformatics Conference (CSB2003). The conference’s goal is to facilitate collaboration between com- puter scientists and biologists by presenting cutting edge computational biology research findings. While such research has an interdisciplinary character, CSB2003 emphasizes the computational aspects of bioinformatics research. Computer science papers must show bio- logical relevance, and biology papers must stress the computational aspects of the results. CSB2003 will accept 27 papers for podium presentation, and these will be published in the IEEE conference proceedings. Topics of interest include (but are not limited to): • Machine Learning • String & Graph Algorithms • Data Mining • Genome to Life • Robotics • Stochastic Modeling • Data Visualization • Genomics and Proteomics • Regulatory Networks • Gene Expression Pathways • Comparative Genomics • Evolution and Phylogenetics • Pattern Recognition • Molecular Structures & Interactions Papers are limited to 12 pages, single spaced, in 12 point type, including title, abstract (250 words or less), figures, tables, text, and bibliography. The first page should give keywords, authors’ postal and electronic mailing addresses. Submit papers electronically to bioinfor- [email protected] in either postscript or PDF format. A select subset of accepted papers will be invited to also publish in the Journal of Bioinformatics and Computational Biology (http://www.worldscinet.com/jbcb/jbcb.shtml). The Best Paper will be selected by the Program Committee and announced at the awards ceremony. Submissions must be received no later than April 1, 2003. Authors will be notified of their submission's status by May 19, 2003, and final corrected versions must be received by June 14, 2003.
    [Show full text]
  • Robust Normalization of Next Generation Sequencing Data
    Robust Normalization of Next Generation Sequencing Data Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universität Berlin vorgelegt von Johannes Helmuth Berlin, 2017 Erstgutachter: Prof. Dr. Martin Vingron Zweitgutachter: Prof. Dr. Uwe Ohler Tag der Disputation: 15.05.2017 Preface This dissertation introduces a robust normalization method to uncover signals in noisy next generation sequencing data. The genesis of the described approach is the observation that next generation sequencing resembles a sampling process that can be modeled by means of discrete statistics. The specic and sensitive detection of signals from sequencing data pushes the eld of molecular biology forward towards a comprehensive understanding of the functional basic unit of life – the cell. The thesis is structured in three parts: Part I provides a background on molecular biology and statistics that is needed to understand Part II. I describe the fascinating subject of gene regulation and how diverse next genera- tion sequencing techniques have been developed to study cellular processes at the molec- ular level. The data generated in these experiments are naturally modeled with statistical models. To accurately quantify sequencing data, I propose the computational program “bamsignals” which was developed in collaboration with Alessandro Mammana [1]. Part II introduces a novel sequencing data normalization method which was developed under supervision of Dr. Ho-Ryun Chung from the Epigenomics laboratory at the Max Planck Institute for Molecular Genetics in Berlin. A manuscript describing the approach is de- posited on bioRxiv [2] and the method was also featured as a journal article in Springer Press BioSpektrum [3].
    [Show full text]
  • Detection of Cooperatively Bound Transcription Factor Pairs Using Chip-Seq Peak Intensities and Expectation Maximization
    bioRxiv preprint doi: https://doi.org/10.1101/120113; this version posted March 24, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Detection of cooperatively bound transcription factor pairs using ChIP-seq peak intensities and expectation maximization Vishaka Datta∗1, Rahul Siddharthan2, and Sandeep Krishna1 1Simons Centre for the Study of Living Machines, National Centre for Biological Sciences, TIFR, Bengaluru 560065, India 2The Institute of Mathematical Sciences/HBNI, Taramani, Chennai 600 113, India March 24, 2017 Abstract Transcription factors (TFs) often work cooperatively, where the binding of one TF to DNA enhances the binding affinity of a second TF to a nearby location. Such cooperative binding is important for activating gene expression from promoters and enhancers in both prokaryotic and eukaryotic cells. Existing methods to detect cooperative binding of a TF pair rely on analyzing the sequence that is bound. We propose a method that uses, instead, only ChIP-Seq peak intensities and an expectation maximisation (CPI-EM) algorithm. We validate our method using ChIP-seq data from cells where one of a pair of TFs under consideration has been genetically knocked out. Our algorithm relies on our observation that cooperative TF-TF binding is correlated with weak binding of one of the TFs, which we demonstrate in a variety of cell types, including E. coli, S. cerevisiae, M. musculus, as well as human cancer and stem cell lines.
    [Show full text]
  • Computational Approaches to Predict Effect of Epigenetic Modifications
    Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression Sharmi Banerjee Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering Pratap Tokekar, Co-chair Xiaowei Wu, Co-chair William Baumann Inyoung Kim Anil Vullikanti September 05, 2019 Blacksburg, Virginia Keywords: Epigenetic factors, gene expression, transcription factors, histone marks, DNA methylation Copyright 2019, Sharmi Banerjee Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression Sharmi Banerjee (ABSTRACT) This dissertation presents applications of machine learning and statistical approaches to infer protein-DNA bindings in the presence of epigenetic modifications. Epigenetic modifications are alterations to the DNA resulting in gene expression regulation where the structure of the DNA remains unaltered. It is a heritable and reversible modification and often involves addition or deletion of certain chemical compounds to the DNA. Histone modification is an epigenetic change that involves alteration of the histone proteins { thus changing the chromatin (DNA wound around histone proteins) structure { or addition of methyl-groups to the Cytosine base adjacent to a Guanine base. Epigenetic factors often interfere in gene expression regulation by promoting or inhibiting protein-DNA bindings. Such proteins are known as transcription factors. Transcription is the first step of gene expression where a particular segment of DNA is copied into the messenger-RNA (mRNA). Transcription factors orchestrate gene activity and are crucial for normal cell function in any organism. For example, deletion/mutation of certain transcription factors such as MEF2 have been associated with neurological disorders such as autism and schizophrenia.
    [Show full text]
  • UCSF UC San Francisco Electronic Theses and Dissertations
    UCSF UC San Francisco Electronic Theses and Dissertations Title Computational approaches to cell type and interindividual variation in autoimmune disease Permalink https://escholarship.org/uc/item/23p6k04c Author Targ, Sasha Publication Date 2018 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California DOCTOR OF FliilAMCMtY lis ii Acknowledgements The work contained in this thesis would not have been possible without the help of many individuals along the way. First and foremost, I would like to thank my advisor, Jimmie Ye, who gave me the freedom and flexibility to pursue my scientific interests during graduate school, and introduced me to computational genetics and methods development. I would also like to thank the UCSF MSTP for administrative help. Finally, I would like to thank my friends and family for their support and encouragement throughout the many phases of the MD/PhD training program. The chapter entitled “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation” was published in Nature Biotechnology (PMID: 29227470, doi: 10.1038/nbt.4042). iii Computational approaches to cell type and interindividual variation in autoimmune disease Sasha Kiang Targ Abstract Computational approaches offer substantial ability to improve annotation and interpretation of a range of genomic datasets collected with the advent of next generation sequencing technologies, providing an avenue to further understand the impact of changes in genomic data which might contribute to disease. Decoding the genome using deep learning is a promising approach to identify the most important sequence motifs in predicting functional genomic outcomes. In the first part of this work, we develop a search algorithm for deep learning architectures that finds models which succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges.
    [Show full text]
  • Probing Sequence-Level Instructions for Gene Expression May Taha
    Probing sequence-level instructions for gene expression May Taha To cite this version: May Taha. Probing sequence-level instructions for gene expression. General Mathematics [math.GM]. Université Montpellier, 2018. English. NNT : 2018MONTT096. tel-02073052 HAL Id: tel-02073052 https://tel.archives-ouvertes.fr/tel-02073052 Submitted on 19 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE POUR OBTENIR LE GRADE DE DOCTEUR DE L’UNIVERSITE DE MONTPELLIER En Mathématique appliquées et application des mathématiques École doctorale : CBS2 – Sciences chimiques et Biologiques pour la santé Unités de recherche : Institut de Génétique Moléculaire de Montpellier (IGMM – UMR 5535) Institut Montpelliérain Alexandre Grothendieck (IMAG – UMR 5194) Probing sequence-level instructions for gene expression Présentée par May TAHA Le 28 Novembre 2018 Sous la direction de Charles-Henri Lecellier et Jean Michel Marin Devant le jury composé de Dr. Julien CHIQUET, Chargé de recherche, AgroParisTech/INRA MIA, Paris Rapporteur Dr. Mohamed ELATI, Professeur, Université de Lille Rapporteur Dr. Stéphane ROBIN, Directeur de recherche, AgroParisTech/INRA MIA, Paris Examinateur Dr. Nathalie VILLA-VIALANEIX, Chargé de recherche, Université de Toulouse 1 Examinateur Dr. Charles-Henri LECELLIER, Chargé de recherche, CNRS, Montpellier Directeur de thèse Dr.
    [Show full text]
  • DOCTOR of Fliilamcmty
    DOCTOR OF FliilAMCMtY lis ii Acknowledgements The work contained in this thesis would not have been possible without the help of many individuals along the way. First and foremost, I would like to thank my advisor, Jimmie Ye, who gave me the freedom and flexibility to pursue my scientific interests during graduate school, and introduced me to computational genetics and methods development. I would also like to thank the UCSF MSTP for administrative help. Finally, I would like to thank my friends and family for their support and encouragement throughout the many phases of the MD/PhD training program. The chapter entitled “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation” was published in Nature Biotechnology (PMID: 29227470, doi: 10.1038/nbt.4042). iii Computational approaches to cell type and interindividual variation in autoimmune disease Sasha Kiang Targ Abstract Computational approaches offer substantial ability to improve annotation and interpretation of a range of genomic datasets collected with the advent of next generation sequencing technologies, providing an avenue to further understand the impact of changes in genomic data which might contribute to disease. Decoding the genome using deep learning is a promising approach to identify the most important sequence motifs in predicting functional genomic outcomes. In the first part of this work, we develop a search algorithm for deep learning architectures that finds models which succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges. We also develop a computational tool, demuxlet, for droplet-based single-cell RNA-sequencing (dscRNA-seq) that harnesses natural genetic variation to determine the sample identity of each cell and detect droplets containing two cells.
    [Show full text]
  • A Twentieth Anniversary Tribute to Psb
    A TWENTIETH ANNIVERSARY TRIBUTE TO PSB DARLA HEWETT1, MICHELLE WHIRL-CARRILLO1, LAWRENCE E HUNTER2, RUSS B ALTMAN1, TERI E KLEIN1 1Stanford University, Shriram Center for Bioengineering and Chemical Engineering 443 Via Ortega, Stanford, CA 94305 Email: [email protected] 2University of Colorado School of Medicine, Computational Bioscience Program, Aurora CO 80045 PSB brings together top researchers from around the world to exchange research results and address open issues in all aspects of computational biology. PSB 2015 marks the twentieth anniversary of PSB. Reaching a milestone year is an accomplishment well worth celebrating. It is long enough to have seen big changes occur, but recent enough to be relevant for today. As PSB celebrates twenty years of service, we would like to take this opportunity to congratulate the PSB community for your success. We would also like the community to join us in a time of celebration and reflection on this accomplishment. 1. PSB’s Influence PSB is one of the world's leading conferences in computational biology. It is where top researchers present and discuss current research in the theory and application of computational methods in problems of biological significance. The following facts, computed October 2014, highlight how PSB has impacted and supported our community: • PSB has accepted and is tracking 887 papers. • 81% of the papers submitted to PSB have been cited in Google Scholar. • On the average, if a PSB paper has citations recorded in Google Scholar, it is cited 45 times. • There are currently 32,504 citations of PSB Papers recorded in Google Scholar. • The most highly cited PSB paper has 978 citations recorded in Google Scholar.
    [Show full text]
  • DISSERTATION UNCOVERING the ROLE of EPIGENETICS in ALTERNATIVE SPLICING Submitted by Fahad Ullah Department of Computer Science
    DISSERTATION UNCOVERING THE ROLE OF EPIGENETICS IN ALTERNATIVE SPLICING Submitted by Fahad Ullah Department of Computer Science In partial fulfillment of the requirements For the Degree of Doctor of Philosophy Colorado State University Fort Collins, Colorado Summer 2020 Doctoral Committee: Advisor: Asa Ben-Hur Charles Anderson Hamidreza Chitsaz Anireddy SN Reddy Copyright by Fahad Ullah 2020 All Rights Reserved ABSTRACT UNCOVERING THE ROLE OF EPIGENETICS IN ALTERNATIVE SPLICING Alternative Splicing (AS) is a regulated phenomenon that enables a single gene to encode struc- turally and functionally different biomolecules (proteins, non-coding RNAs etc.), that play impor- tant roles in an organism’s development and growth. Besides, it has been implicated in multiple diseases including cancer, thalassemia, and spinal muscular atrophy. Recent studies have shown that AS is widespread in both plants and animals. Moreover, it has been reported that splicing oc- curs co-transcriptionally and that chromatin state is important for understanding the regulation of AS. Most of the previous efforts made to elucidate the regulation of AS used sequence information alone. However, in this study our goal is to understand AS from an epigenetic perspective: how chromatin organization, accessibility, and modifications are involved in its regulation. Intron Retention (IR) is the most frequent form of AS in plants, however, very little is known about its regulation, particularly regarding the role of chromatin state. Therefore, as a first step, we investigate the relationship between IR and chromatin accessibility in two plant species: arabidop- sis and rice. We report a strong association between chromatin accessibility and IR. Our findings suggest that chromatin is more open and accessible in IR.
    [Show full text]
  • RECAP Reveals the True Statistical Significance of Chip-Seq Peak Calls
    bioRxiv preprint doi: https://doi.org/10.1101/260687; this version posted February 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. RECAP reveals the true statistical significance of ChIP-seq peak calls Justin G. Chitpin 1;2, Aseel Awdeh2;3 and Theodore J. Perkins 2;3;4;∗ February 5, 2018 1Translational and Molecular Medicine Program, University of Ottawa, Ottawa, ON, K1H8M5, Canada 2Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON, K1H8L6, Canada 3School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, K1N6N5, Canada 4Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, ON, K1H8M5, Canada ∗ Correspondence to [email protected] Abstract Motivation: ChIP-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. The fundamental bioinformatics problem is to take ChIP-seq read data and data representing some kind of control, and determine genomic regions that are enriched in the ChIP-seq versus the control, also called \peak calling." While many programs have been designed to solve this task, nearly all fall into the statistical trap of using the data twice|once to determine candidate enriched regions, and a second time to assess enrichment by methods of classical statistical hypothesis testing. This double use of the data has the potential to invalidate the statistical significance assigned to enriched regions, or \peaks", and as a consequence, to invalidate false discovery rate estimates.
    [Show full text]