Annotation of the Arabidopsis Genome1

Total Page:16

File Type:pdf, Size:1020Kb

Annotation of the Arabidopsis Genome1 Annotation of the Arabidopsis Genome1 Jennifer R. Wortman, Brian J. Haas, Linda I. Hannick, Roger K. Smith, Jr., Rama Maiti, Catherine M. Ronning, Agnes P. Chan, Chunhui Yu, Mulu Ayele, Catherine A. Whitelaw, Owen R. White, and Christopher D. Town* The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850 The Arabidopsis Genome Sequencing Project was A CENTRALIZED REANNOTATION EFFORT officially completed in late 2000, leading to the pub- The stated goals of the Arabidopsis reannotation lication of a landmark paper describing, in broad effort are the comprehensive identification of protein outline, many salient features of the Arabidopsis ge- coding genes, the elucidation of accurate gene struc- nome (Arabidopsis Genome Initiative [AGI], 2000). tures, and the assignment of function to each gene However, the genome annotation, generated by the product in the predicted proteome. The approach individual sequencing centers, was heterogeneous, taken at TIGR involves both automated annotation both in terms of gene structure predictions and ter- and manual curation, including the development of minology used in their description. In response, The novel algorithms and custom software interfaces to Institute for Genomic Research (TIGR) was funded facilitate the generation of data that are complete, by the National Science Foundation to carry out a thorough, and of high quality. whole-genome reannotation that would be of a uni- The first challenge is to generate and maintain an form high standard and consistency, eliminating the integrated, nonredundant set of gene models across heterogeneity that had accumulated over time in the the Arabidopsis genome. This involves incorporating public databases. data from multiple original sources and recognizing The first phase of this process took place in the and correcting for duplicate and partial gene models latter half of 2000, as the sequencing itself was on overlapping BACs. The growing availability of drawing to a close. At that time, it was agreed that full-length cDNA (FL-cDNA) and expressed se- two centers, The Munich Information Center for quence tag (EST) data enables robust computational Protein Sequences (MIPS) and TIGR, would carry corrections to the original annotated gene structures. out the bulk of the analyses for the whole-genome Protein data and genomic reads from an evolutionary publication. As part of this effort, all publicly avail- neighbor, such as Brassica oleracea, can also inform able sequence and associated annotation was re- computational gene prediction. With a comprehensive set of gene models avail- trieved from various databases and loaded into the able, the predicted proteome can be clustered, mak- TIGR annotation database. Bacterial artificial chro- ing it possible to examine, model, and describe genes mosomes (BACs) that either lacked annotation or in the context of gene families. This can highlight required review were examined by TIGR annotators inconsistencies in previous annotations and aids in to provide as complete and nonredundant a data set achieving consistency and accuracy in the manual as possible for publication. When the whole-genome curation of both gene structure and function. Anno- analysis, carried out in close collaboration with tation updates are incorporated weekly into the TIGR MIPS, was completed, TIGR embarked upon the Arabidopsis Annotation Web site. reannotation proper, using the integrated annota- TIGR publishes its annotation data in the context tion data set as a starting point. The purpose of this of Arabidopsis chromosome sequences, generated article is to describe the goals, chronology, and ac- based on the available tiling path information. Where complishments to date of this reannotation effort tiling path data are incomplete or inconsistent, over- and to communicate the nature, quality, and basis of laps are being validated experimentally. Updated the latest TIGR data release to the plant research chromosome sequences along with annotation are community. generated and released twice a year to the Arabidop- sis Information Resource (TAIR) and the Genome Division of the National Center for Biotechnology Information (NCBI). 1 This work was supported by the National Science Foundation (Cooperative Agreement no. DBI 9813586). GENERATING A COMPREHENSIVE GENE SET * Corresponding author; e-mail [email protected]; fax 301–838– 0208. To support the reannotation effort, an in-house Article, publication date, and citation information can be found database was populated with a complete set of BAC at www.plantphysiol.org/cgi/doi/10.1104/pp.103.022251. sequences representing the published Arabidopsis Plant Physiology, June 2003, Vol. 132, pp. 461–468, www.plantphysiol.org © 2003 American Society of Plant Biologists 461 Wortman et al. tiling path, along with the genes and other biological Gene Models Supported by Arabidopsis FL-cDNA and features identified on these genomic sequences by EST Sequences the respective sequencing centers. These sequences were then processed through the TIGR annotation Early in 2001, we performed a detailed analysis pipeline, a collection of software known as Eukary- with approximately 5,000 FL-cDNA sequences pro- otic Genome Control (EGC) that serves as the central vided by Ceres Inc. (Malibu, CA) and evaluated the data management system. performance of a number of programs for their suit- EGC processes each BAC sequence through a series ability in generating gene models based on the of algorithms for predicting genes (Genscanϩ, Gen- genomic alignments of these data. We found that emark.hmm, and Glimmer; Burge and Karlin, 1997; approximately 35% of the cognate gene models with Lukashin and Borodovsky, 1998; Salzberg et al., cDNA support required some kind of modification. 1999), splice sites (Hebsgaard et al., 1996; Pertea et al., During this pilot study, 240 new genes were discov- 2001), and tRNAs (Lowe and Eddy, 1997). Homology ered and modeled based upon FL-cDNA alignments. to nucleotide and protein databases is computed us- This analysis also supported the documentation of ing the AAT package (Huang et al., 1997), which many instances of alternative splicing (Haas et al., utilizes a two-step approach consisting of a fast da- 2002) and mini-exons (N. Volfovsky, B.J. Haas, and tabase search step to identify the boundaries of the S.L. Salzberg, unpublished data) and has allowed us sequence match, followed by a rigorous alignment to develop a pipeline in which almost all new FL- step which takes into account consensus splice sig- cDNAs can be used to validate or update a gene nals. Data sets include Arabidopsis-specific cDNA model automatically. We have continued to down- and EST sequences, TIGR gene indices for Arabidop- load FL-cDNAs from GenBank. Currently, there are sis and other plants (Quackenbush et al., 2000), a 24,839 cDNA accessions supporting 12,569 gene nonredundant amino acid database filtered from models for 11,960 genes. From the end user’s per- public sources, and SwissProt (Bairoch and Ap- spective, gene models and predicted coding sequence weiler, 2000). based on FL-cDNAs can be regarded as high confi- The output of the gene prediction algorithms and dence, although a small percentage may be truncated homology-based computes were compared with the or contain unspliced introns or genomic DNA. preexisting gene models. New gene models were Arabidopsis ESTs also provide strong support for created, and existing gene models were updated components of gene structure, including exons, based on different classes of evidence. The process of splice sites, and untranslated regions. Many of the automating gene structure updates is continually be- original gene models relied, at least in part, on ESTs ing improved as new data becomes available and for their support. However, due to the distributed pipeline components become more robust. Current nature of the annotation and the gradual accumula- annotation data is based on the following data sets: tion of sequences, EST evidence was not uniformly or Arabidopsis and plant cDNAs and ESTs downloaded systematically incorporated into gene models. We from GenBank on January 28, 2003, the February 2003 recently have developed an algorithmic approach to release of TIGR gene indices, and a nonredundant incorporate all EST evidence into gene models and, protein set generated on February 1, 2003. Table I where appropriate, provide evidence for alternative describes the numbers of gene models supported by splicing. As a result, the current annotation data set different types of evidence in the current annotation contains approximately 16,000 genes whose models data set. incorporate all available EST evidence (B.J. Haas, Table I. Evidence supporting annotated Arabidopsis gene models Support was determined by BLAST-based similarity between the current, complete set of gene models/proteins, and the following datasets: non-Arabidopsis proteins parsed from the TIGR nonredundant protein set (February 1, 2003); the Arabidopsis protein set as represented in TIGR’s annotation database (February 1, 2003); and Arabidopsis cDNAs, Arabidopsis ESTs, and non-Arabidopsis plant ESTs downloaded from GenBank (January 28, 2003). Note that matches between an Arabidopsis protein and itself were excluded from the count for Arabidopsis protein support. Each gene was counted only once, in the category associated with the highest overall confidence. Gene models supported by both Arabidopsis cDNA and protein similarity to another organism are considered to be highest confidence. Genes with no EST and no protein support are based solely on gene predictions and are the lowest confidence set. Non-Arabidopsis Arabidopsis No Total Protein Protein Protein Arabidopsis cDNA 9,017 2,055 1,664 12,736 Arabidopsis EST 2,607 1,041 1,033 4,681 Other Plant EST 3,691 1,365 1,032 6,088 No cDNA/EST 247 2,280 1,352 3,879 Total 15,562 6,741 5,081 27,384 462 Plant Physiol. Vol. 132, 2003 Annotation of the Arabidopsis Genome A.L. Delcher, J.R. Wortman, R.K. Smith Jr, L.I. Extending Gene Detection and Validation by Hannick, R. Maiti, C.M. Ronning, D.B. Rusch, C.D. Comparative Genomics Town, S.L.
Recommended publications
  • Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research
    Platforms EOI: UniCarbKB: Building a Standardised and Scalable Informatics Platform for Glycosciences Research 20 September 2019 at 13:02 Project title UniCarbKB: Building a Standardised and Scalable Informatics Platform for Glycosciences Research Field of Research code(s) 03 CHEMICAL SCIENCES 06 BIOLOGICAL SCIENCES 08 INFORMATION AND COMPUTING SCIENCES 11 MEDICAL AND HEALTH SCIENCES EOI Lead Name Matthew Campbell EOI lead Research Group Institute for Glycomics EOI lead Organisation Institute for Glycomics, Griffith University EOI lead Email Collaborator details Name Research Group Organisation Malcolm Wolski eResearch Services Griffith University Project description Over the last few years the glycomics knowledge platform UniCarbKB has contributed to the growth of international glycoinformatics resources by providing access to data collections, web services and software applications supported by biocuration activities. However, UniCarbKB has continued to develop as a single monolithic resource. This proposed project will improve the growth and scalability of UniCarbKB by breaking down the platform into smaller and composable pieces. This will be achieved through the adoption of a more decentralised approach and a microservice architecture that will allow components of the platform to scale at different rates. This architecture will allow us is to build an extendable and cross-disciplinary resource by integrating existing data with new analysis tools. The adoption of international resources (e.g. NIH GlyGen), will allow not just mapping of glycan data to genes and proteins but also pathways, gene ontology, publications and a wide variety of data from EBI, NCBI and other sources. Additionally, we propose to integrate glycan array data and develop ontologies that can serve as tools for integration and subsequently analysis of glycan-associated Existing technology Adopt The proposed project will adopt a multi-tier application architecture in which applications will be developed and deployed as microservices.
    [Show full text]
  • Hierarchical Classification of Gene Ontology Terms Using the Gostruct
    Hierarchical classification of Gene Ontology terms using the GOstruct method Artem Sokolov and Asa Ben-Hur Department of Computer Science, Colorado State University Fort Collins, CO, 80523, USA Abstract Protein function prediction is an active area of research in bioinformatics. And yet, transfer of annotation on the basis of sequence or structural similarity remains widely used as an annotation method. Most of today's machine learning approaches reduce the problem to a collection of binary classification problems: whether a protein performs a particular function, sometimes with a post-processing step to combine the binary outputs. We propose a method that directly predicts a full functional annotation of a protein by modeling the structure of the Gene Ontology hierarchy in the framework of kernel methods for structured-output spaces. Our empirical results show improved performance over a BLAST nearest-neighbor method, and over algorithms that employ a collection of binary classifiers as measured on the Mousefunc benchmark dataset. 1 Introduction Protein function prediction is an active area of research in bioinformatics [25]; and yet, transfer of annotation on the basis of sequence or structural similarity remains the standard way of assigning function to proteins in newly sequenced organisms [20]. The Gene Ontology (GO), which is the current standard for annotating gene products and proteins, provides a large set of terms arranged in a hierarchical fashion that specify a gene-product's molecular function, the biological process it is involved in, and its localization to a cellular component [12]. GO term prediction is therefore a hierarchical classification problem, made more challenging by the thousands of annotation terms available.
    [Show full text]
  • Arabidopsis Thaliana
    Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press RESEARCH A Physical Map of Chromosome 2 of Arabidopsis thaliana Eve Ann Zachgo, 2,4 Ming Li Wang, 1'2'4 Julia Dewdney, 1'2 David Bouchez, 3 Christine Carnilleri, 3 Stephen Belmonte, 2 Lu Huang, 2 Maureen Dolan, 2 and Howard M. Goodman 1'2'5 1Department of Genetics, Harvard Medical School and 2Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts 02114; 3Laboratoire de Biologie Cellulaire, Institut National de la Recherche Agronomique (INRA), 78026 Versailles CEDEX, France A yeast artificial chromosome (YAC] physical map of chromosome 2 of Arabidopsis thaliana has been constructed by hybridization of 69 DNA markers and 61 YAC end probes to gridded arrays of YAC clones. Thirty-four YACs in four contigs define the chromosome. Complete closure of the map was not attained because some regions of the chromosome were repetitive or were not represented in the YAC library. Based on the sizes of the YACs and their coverage of the chromosome, the length of chromosome 2 is estimated to be at least 18 Mb. These data provide the means for immediately identifying the YACs containing a genetic locus mapped on Arabidopsis chromosome 2. The small flowering plant Arabidopsis thaliana is ters (Maluszynska and Heslop-Harrison 1991; A1- an excellent model system for metabolic, genetic, bini 1994; Copenhaver et al. 1995). We present and developmental studies in plants. Its haploid here a YAC contig physical map of chromosome nuclear genome is small (-100 Mb), consisting of 2 of A.
    [Show full text]
  • Gene Ontology Analysis with Cytoscape
    Introduction to Systems Biology Exercise 6: Gene Ontology Analysis Overview: Gene Ontology (GO) is a useful resource in bioinformatics and systems biology. GO defines a controlled vocabulary of terms in biological process, molecular function, and cellular location, and relates the terms in a somewhat-organized fashion. This exercise outlines the resources available under Cytoscape to perform analyses for the enrichment of gene annotation (GO terms) in networks or sub-networks. In this exercise you will: • Learn how to navigate the Cytoscape Gene Ontology wizard to apply GO annotations to Cytoscape nodes • Learn how to look for enriched GO categories using the BiNGO plugin. What you will need: • The BiNGO plugin, developed by the Computational Biology Division, Dept. of Plant Systems Biology, Flanders Interuniversitary Institute for Biotechnology (VIB), described in Maere S, et al. Bioinformatics. 2005 Aug 15;21(16):3448-9. Epub 2005 Jun 21. • galFiltered.sif used in the earlier exercises and found under sampleData. Please download any missing files from http://www.cbs.dtu.dk/courses/27041/exercises/Ex3/ Download and install the BiNGO plugin, as follows: 1. Get BiNGO from the course page (see above) or go to the BiNGO page at http://www.psb.ugent.be/cbd/papers/BiNGO/. (This site also provides documentation on BiNGO). 2. Unzip the contents of BiNGO.zip to your Cytoscape plugins directory (make sure the BiNGO.jar file is in the plugins directory and not in a subdirectory). 3. If you are currently running Cytoscape, exit and restart. First, we will load the Gene Ontology data into Cytoscape. 1. Start Cytoscape, under Edit, Preferences, make sure your default species, “defaultSpeciesName”, is set to “Saccharomyces cerevisiae”.
    [Show full text]
  • To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
    UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR.
    [Show full text]
  • Use and Misuse of the Gene Ontology Annotations
    Nature Reviews Genetics | AOP, published online 13 May 2008; doi:10.1038/nrg2363 REVIEWS Use and misuse of the gene ontology annotations Seung Yon Rhee*, Valerie Wood‡, Kara Dolinski§ and Sorin Draghici|| Abstract | The Gene Ontology (GO) project is a collaboration among model organism databases to describe gene products from all organisms using a consistent and computable language. GO produces sets of explicitly defined, structured vocabularies that describe biological processes, molecular functions and cellular components of gene products in both a computer- and human-readable manner. Here we describe key aspects of GO, which, when overlooked, can cause erroneous results, and address how these pitfalls can be avoided. The accumulation of data produced by genome-scale FIG. 1b). These characteristics of the GO structure enable research requires explicitly defined vocabularies to powerful grouping, searching and analysis of genes. describe the biological attributes of genes in order to allow integration, retrieval and computation of the Fundamental aspects of GO annotations data1. Arguably, the most successful example of system- A GO annotation associates a gene with terms in the atic description of biology is the Gene Ontology (GO) ontologies and is generated either by a curator or project2. GO is widely used in biological databases, automatically through predictive methods. Genes are annotation projects and computational analyses (there associated with as many terms as appropriate as well as are 2,960 citations for GO in version 3.0 of the ISI Web of with the most specific terms available to reflect what is Knowledge) for annotating newly sequenced genomes3, currently known about a gene.
    [Show full text]
  • Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis Thaliana by Genome-Wide Sequencing
    International Journal of Molecular Sciences Article Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis thaliana by Genome-Wide Sequencing Wenjie Xu 1,2 , Wei Fu 2, Pengyu Zhu 2, Zhihong Li 1, Chenguang Wang 2, Chaonan Wang 1,2, Yongjiang Zhang 2 and Shuifang Zhu 1,2,* 1 College of Plant Protection, China Agricultural University, Beijing 100193 China 2 Institute of Plant Quarantine, Chinese Academy of Inspection and Quarantine, Beijing 100176, China * Correspondence: [email protected] Received: 9 July 2019; Accepted: 21 August 2019; Published: 23 August 2019 Abstract: The clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated protein (Cas) system has been widely applied in functional genomics research and plant breeding. In contrast to the off-target studies of mammalian cells, there is little evidence for the common occurrence of off-target sites in plants and a great need exists for accurate detection of editing sites. Here, we summarized the precision of CRISPR/Cas9-mediated mutations for 281 targets and found that there is a preference for single nucleotide deletions/insertions and longer deletions starting from 40 nt upstream or ending at 30 nt downstream of the cleavage site, which suggested the candidate sequences for editing sites detection by whole-genome sequencing (WGS). We analyzed the on-/off-target sites of 6 CRISPR/Cas9-mediated Arabidopsis plants by the optimized method. The results showed that the on-target editing frequency ranged from 38.1% to 100%, and one off target at a frequency of 9.8%–97.3% cannot be prevented by increasing the specificity or reducing the expression level of the Cas9 enzyme.
    [Show full text]
  • Locdb: Experimental Annotations of Localization for Homo Sapiens and Arabidopsis Thaliana Shruti Rastogi1,* and Burkhard Rost1,2,3
    D230–D234 Nucleic Acids Research, 2011, Vol. 39, Database issue Published online 11 November 2010 doi:10.1093/nar/gkq927 LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana Shruti Rastogi1,* and Burkhard Rost1,2,3 1Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY 10032, USA, 2Technical University Munich, Bioinformatics, Department of Computer Science and Institute of Advanced Studies (IAS) and 3New York Consortium on Membrane Protein Structure (NYCOMPS), TUM Bioinformatics, Boltzmannstr. 3, 85748 Garching, Germany Downloaded from Received August 15, 2010; Revised September 24, 2010; Accepted September 27, 2010 ABSTRACT stepped up to the challenge to increase the amount of annotations (6–15). These data sets capture aspects of LocDB is a manually curated database with protein function and, more generally, of global cellular nar.oxfordjournals.org experimental annotations for the subcellular processes. localizations of proteins in Homo sapiens (HS, UniProt (release 2010_07) (16) constitutes the most human) and Arabidopsis thaliana (AT, thale cress). comprehensive and, arguably, the most accurate Currently, it contains entries for 19 604 UniProt resource with experimental annotations of subcellular proteins (HS: 13 342; AT: 6262). Each database localization. However, even this excellent resource entry contains the experimentally derived locali- remains incomplete for the proteomes from Homo sapiens (HS) and Arabidopsis thaliana (AT): of the zation in Gene Ontology (GO) terminology, the at Medical Center Library, Duke University on January 13, 2011 experimental annotation of localization, localization 20 282 human proteins in Swiss-Prot (17), 14 502 have predictions by state-of-the-art methods and, where annotations of localization (72%), but for only 3720 (18%) these annotations are experimental.
    [Show full text]
  • Wormbase Curation Interfaces and Tools
    Wormbase Curation Interfaces And Tools Xiaodong Wang, Paul Sternberg and the WormBase Consortium Division of Biology 156-29, California Institute of Technology, Pasadena, CA 91125, USA Abstract Cellular Component GO curation form Antibody OA Gene ontology OA Curating biological information from the published literature can be time- and labor-intensive especially without automated tools. WormBase1 has adopted several curation interfaces and tools, most of which were built in-house, to help curators recognize and extract data more efficiently from the literature. These tools range from simple computer interfaces for data entry to employing scripts that take advantage of complex text extraction algorithms, which automatically identify specific objects in a paper and presents them to the curator for curation. By using these in-house tools, we Go term are also able to tailor the tool to the individual needs and preferences of the curator. For example, C.elegans protein Cellular Gene Ontology Cellular Component and gene-gene interaction curators employ the text mining Component software Textpresso2 to indentify, retrieve, and extract relevant sentences from the full text of an Category terms article. The curators then use a web-based curation form to enter the data into our local database. Wormbase has developed our own Ontology Annotator tool based on the publicly available Phenote ontology annotation curation interface (developed by the Berkeley Bioinformatics Open-Source Ontology Annotator Projects (BBOP)), which we have adapted with datatype specific configurations. Currently, we have Autocomple4on field eight data s, including antibody, GO, gene regulation, interaction, molecule, people, phenotype and with ontology Transgene OA transgene are using or will use OA for routine literature curation.
    [Show full text]
  • A Bioinformatics Analysis of the Arabidopsis Thaliana Epigenome Ikhlak Ahmed
    A bioinformatics analysis of the arabidopsis thaliana epigenome Ikhlak Ahmed To cite this version: Ikhlak Ahmed. A bioinformatics analysis of the arabidopsis thaliana epigenome. Agricultural sciences. Université Paris Sud - Paris XI, 2011. English. NNT : 2011PA112230. tel-00684391 HAL Id: tel-00684391 https://tel.archives-ouvertes.fr/tel-00684391 Submitted on 2 Apr 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 2011 PhD Thesis Ikhlak Ahmed Laboratoire: CNRS UMR8197 - INSERM U1024 Institut de Biologie de l'ENS(IBENS) Ecole Doctorale : Sciences Du Végétal A Bioinformatics analysis of the Arabidopsis thaliana Epigenome Supervisors: Dr. Chris Bowler and Dr. Vincent Colot Jury Prof. Dao-Xiu Zhou President Prof. Christian Fankhauser Rapporteur Dr. Raphaël Margueron Rapporteur Dr. Chris Bowler Examiner Dr. Vincent Colot Examiner Dr. Allison Mallory Examiner Dr. Michaël Weber Examiner Acknowledgement I express my deepest gratitude to my supervisors Dr. Chris Bowler and Dr. Vincent Colot for the support, advice and freedom that I enjoyed working with them. I am very thankful to Chris Bowler who trusted in me and gave me the opportunity to come here and work in the excellent possible conditions. I have no words to express my thankfulness to Vincent Colot for his enthusiastic supervision and the incredibly valuable sessions we spent together that shaped my understanding of DNA methylation.
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • Mutator-Like Elements in Arabidopsis Thaliana: Structure, Diversity and Evolution
    Copyright 2000 by the Genetics Society of America Mutator-like Elements in Arabidopsis thaliana: Structure, Diversity and Evolution Zhihui Yu, Stephen I. Wright and Thomas E. Bureau Department of Biology, McGill University, Montreal, Quebec, H3A 1B1 Canada Manuscript received May 18, 2000 Accepted for publication September 11, 2000 ABSTRACT While genome-wide surveys of abundance and diversity of mobile elements have been conducted for some class I transposable element families, little is known about the nature of class II transposable elements on this scale. In this report, we present the results from analysis of the sequence and structural diversity of Mutator-like elements (MULEs) in the genome of Arabidopsis thaliana (Columbia). Sequence similarity searches and subsequent characterization suggest that MULEs exhibit extreme structure, sequence, and size heterogeneity. Multiple alignments at the nucleotide and amino acid levels reveal conserved, potentially transposition-related sequence motifs. While many MULEs share common structural features to Mu ele- ments in maize, some groups lack characteristic long terminal inverted repeats. High sequence similarity and phylogenetic analyses based on nucleotide sequence alignments indicate that many of these elements with diverse structural features may remain transpositionally competent and that multiple MULE lineages may have been evolving independently over long time scales. Finally, there is evidence that MULEs are capable of the acquisition of host DNA segments, which may have implications for adaptive evolution, both at the element and host levels. HE Mutator (Mu) system is a diverse family of class ef®cient transposon-tagging tool for maize gene isola- TII transposable elements (TEs) found in maize. tion (Walbot 1992).
    [Show full text]