Searching for Compact Hierarchical Structures in DNA by Means of the Smallest Grammar Problem Matthias Gallé

Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem Matthias Gallé To cite this version: Matthias Gallé. Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Modeling and Simulation. Université Rennes 1, 2011. English. tel-00595494 HAL Id: tel-00595494 https://tel.archives-ouvertes.fr/tel-00595494 Submitted on 24 May 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. N° d’ordre : ANNÉE 2011 THÈSE / UNIVERSITÉ DE RENNES 1 sous le sceau de l’Université Européenne de Bretagne pour le grade de DOCTEUR DE L’UNIVERSITÉ DE RENNES 1 Mention : Informatique Ecole doctorale MATISSE présentée par Matthias Gallé préparée au équipe Symbiose du laboratoire INRIA Rennes - Bretagne Atlantique Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem Thèse soutenue à Rennes le 15 février 2011 devant le jury composé de : Alberto APOSTOLICO Professeur Georgia Tech, Atlanta, USA François COSTE Charge de Recherche INRIA, France Alexander CLARK Professeur, University of London, London, UK Jacques NICOLAS Directeur de Recherche INRIA, France Eric RIVALS Directeur de Recherche CNRS, LIRMM, France Gabriel INFANTE-LOPEZ Professeur, Universidad de Córdoba, Argentina Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. – Charles L. Dogson Colorless green ideas sleep furiously – Noam Chomsky ATGGCCCGGACGAAGCAGACAGCTCGCAAGTCTACCGGC GGCAAGGCACCGCGGAAGCAGCTGGCCACCAAGGCAGCG CGCAAAAGCGCTCCAGCGACTGGCGGTGTGAAGAAGCCC CACCGCTACAGGCCAGGCACCGTGGCCTTGCGTGAGATC CGCCGTTATCAGAAGTCGACTGAGCTGCTCATCCGCAAA CTGCCATTTCAGCGCCTGGTGCGAGAAATCGCGCAGGAT TTCAAAACCGACCTTCGTTTCCAGAGCTCGGCGGTGATG GCGCTGCAAGAGGCGTGCGAGGCCTATCTGGTGGGTCTC TTTGAAGACACCAACCTCTGTGCTATTCACGCCAAGCGT GTCACTATTATGCCTAAGGACATCCAGCTTGCGCGTCGT ATCCGTGGCGAGCGAGCATAATCCCCTGCTCTATCTTGG GTTTCTTAATTGCTTCCAAGCTTCCAAAGGCTCTTTTCA GAGCCACTTA – You (HIST1H3J, chromosome 6) Acknowledgements It is standard practice to start acknowledging the advisor(s) and I believe this is a good tradition. Each of my two advisors guided and shaped me and this thesis in a different and complementary way and it would just be wrong not to acknowledge this from the beginning. François let me enough place to diverge from the initial statement of this thesis, into directions that fascinated me the most. With a constant optimism (which was much needed at times) he gave good remarks about science and surroundings and always made enough time for me to discuss any idea thoroughly. Gabriel was always an ametralladora de ideas and the interactions over the distance and during my visits in Argentina let me with a huge stack of possibilities to explore. Alberto Apostolico and Alexander Clark acted as rapporteurs and read this whole thing very carefully. Their insightful remarks were helpful and very much appreciated. I would also thank Eric Rivals for honouring me with his role as president of the doctoral committee. The IRISA/INRIA centre in general and the Symbiose group provided an ideal place to perform research. Dozens of persons made of my stay there a rich, pleasant and funny one. There are too many to enumerate them all, but I would at least write down (in what I think is order of appearance) the names of German Capdehourat, Didier Devarus, Carito Vargas, Robert Guziolowski, Alfredo Illanes, Noël Malod-Dognin, Sidney Rosario, Sagar Sen, Mario Rivero, Guillaume Rizk, Katarzyna Kucharczyk, Joaquin Zepeda, Freddy Munoz, Guil- laume Collet, Andrés Burgos, Michael Döhler. Thanks for all the coffe breaks, lunch, brainstormings, campings and company. The Symbiose team in particular would not be what it is without the leadership of Jacques Nicolas. Thank you Jacques for all the good advices, for your participation on the committee, for your accurate words and ideas and of course the 21:30 chocolates. A great thanks to Guillaume Rizk, who did all the running to the École Doctorale and the University library and filling out of paper and that I should have done. Fabio Cunial, Tania Roblot and Didier Devarus gave valuable feedback and improved with their opinions the final version of this manuscript. I must also thank the number of others who have kindly read some of these pages and commented upon them. Rayan Chikhi stayed 90 minutes behind a camera and ensured a great video tape of the defence. From my first days at Rennes, Pierre Peterlongo was much more than the funny post-doc who helped out wherever I needed. I am happy to have co- authored my first paper with him. He and his family hosted us again during the final days of me as a grad-student, and it was great to be able to add an almost insignificant contribution to the renovations of his house. i At the ground level of the INRIA/IRISA, easy to miss if he does not look for it, the attentive visitor might find the Documentation Centre (IST). Pascale, Anne and Agnès tracked down each and every one of the old or obscure paper I asked them for and did not hesitate do get out an old issue of some journal or to ask in some far away library for some book that caught my attention. Tania Roblot and Matthieu Perrin hadn’t much choice but to interact with me during their internship. I really enjoyed their collaboration and the work together. A project co-funded by MINCyT, INRIA and CNRS permitted me to visit the PLN group at FaMAF. The work outperformed together with Rafael Carras- cosa was key to some of the major ideas presented in these pages. I also would like to thank Matias “Chun” Lee, Franco Luque and Martin(cito) Dominguez for their help and their mates. Franco was also my principal port whenever NLP-doubts threatened this ship. Alberto Apostolico kindly permitted me to attend a Dagstuhl Seminar were I saw a bigger picture and were I enjoyed fruitful discussions with him and Michael Clausen. At the final stage of this thesis, Michel Termier (a few weeks before his retirement), Alain Denise and Yann Ponty welcomed me at Paris to discuss about a possible context-free grammar for DNA. Finally, at the end of 2010 Gonzalo Navarro invited me to stay some days in Santiago (Chile) and I was amazed in discovering a bit more about the work done by his team. My ex-professors and co-students at FaMAF were present during almost all stages of this thesis. Pedro D’Argenio and Nicolas Wolovick gave me valuable advice when I was about to decide the final place to do my PhD. Daniel Fri- dlender always answered clearly to any question I had about computational complexity. Each of my visits to Córdoba was pleasant, not least due to the good moments spent with Carlos de la Torre and Walter Alini. Carolina Dania, Javier Valdazo and Alejandro Sanchez came all the way over from Spain to see me wear a suit, and I really appreciated it. My parents expressed their support and love in lots of different and generous ways. In particular, they crossed an ocean to see me presenting my thesis during a cold european winter. Andrei taught me basic arithmetic operations when we were kids and seeking excuses not to sleep. Without them I certainly wouldn’t have done this thesis. Though probably visible only to me, Cecilia appears behind every word of this document. Thank you Samuel for receiving me every night at the door at my arrival and sharing your Legos with me. Soli Deo Gloria Matthias Gallé Saint-Martin d’Hères – France May 2011 ii CONTENTS Contents 1 Introduction 1 1.1 Linguistics of DNA . 2 1.1.1 The Linguistic Metaphor . 2 1.1.2 Identifying Words . 3 1.1.3 Modeling with Grammars . 4 1.2 Grammatical Inference . 6 1.2.1 Learning CFG from Positive Data Only . 6 1.2.2 Learning CFG from Structural Descriptions . 7 1.3 Occam’s Razor and MDL principle . 8 1.4 Overview of this Thesis . 9 2 The Smallest Grammar Problem 11 2.1 Definitions . 11 2.1.1 Sequences . 11 2.1.2 Grammars . 12 2.1.3 Straight-Line Grammars . 13 2.2 Origins of the Smallest Grammar Problem . 14 2.3 Data Compression . 15 2.3.1 Dictionary based . 16 2.3.2 Statistical methods . 17 2.3.3 DNA Compression . 18 2.3.4 The SGP in Data Compression . 20 2.3.5 RNA compression with SLG . 23 2.3.6 XML Compression with SLG . 24 2.3.7 Compressed Data Structures . 24 2.4 Kolmogorov Complexity . 25 2.4.1 The SGP in Kolmogorov Complexity . 26 2.4.2 The Similarity Metric . 26 2.5 Structure Discovery . 27 2.6 Algorithms . 28 2.6.1 LZ77 . 29 2.6.2 LZ78 . 29 2.6.3 Sequitur . 29 2.6.4 DNASequitur . 30 2.6.5 Sequential . 31 2.6.6 Bisection / MPM . 31 2.6.7 RePair . 31 2.6.8 Longest First . 31 iii 2.6.9 Greedy . 32 2.6.10 MDLCompress . 32 2.6.11 Bounding the Worst Case . 32 2.7 Comparison . 33 2.7.1 IRR: a general offline framework . 34 2.7.2 Final grammar size . 35 2.7.3 DNA Compression . 35 3 Efficiency 39 3.1 The Suffix Array . 40 3.2 A Taxonomy of Repeats . 42 3.2.1 Bounds . 43 3.2.2 Computation . 46 3.2.3 Use in IRR . 49 3.3 Non-overlapping Occurrences . 52 3.4 In-place Update of Suffix Array . 53 3.4.1 Motivation . 53 3.4.2 Double-linked Enhanced Suffix Array .

Searching for Compact Hierarchical Structures in DNA by Means of the Smallest Grammar Problem Matthias Gallé

The Tumor Suppressor Notch Inhibits Head and Neck Squamous Cell

Histone-Related Genes Are Hypermethylated in Lung Cancer

Histone H3.1 (Human) Cell-Based ELISA Kit

Identification of Genetic Factors Underpinning Phenotypic Heterogeneity in Huntington’S Disease and Other Neurodegenerative Disorders

A Multiprotein Occupancy Map of the Mrnp on the 3 End of Histone

Research Article Common Expression Quantitative Trait Loci Shared by Histone Genes

Leveraging Omics Data to Expand the Value and Understanding of Alternative Splicing

Primepcr™Assay Validation Report

Full Text (PDF)

Histone H3R17 Monomethyl (H3r17me1) Polyclonal Antibody (Catalog # A-3710)

Supplementary Table 1 Genes in the Yellow Module

Global Dna Methylation and Gene Expression Analysis in Pre-B Cell Acute Lymphoblastic Leukemia