Automation of Functional Annotation of Genomes and Transcriptomes
Total Page:16
File Type:pdf, Size:1020Kb
Tecnura http://revistas.udistrital.edu.co/ojs/index.php/Tecnura/issue/view/687 DOI: http://doi.org/10.14483/udistrital.jour.tecnura.2014.DSE1.a08 INVESTIGACIÓN Automation of functional annotation of genomes and transcriptomes Automatización de la anotación funcional de genomas y transcriptomas Luis Fernando Cadavid Gutiérrez*, José Nelson Pérez Castillo**,Cristian Alejandro Rojas Quintero***, Nelson Enrique Vera Parra**** Fecha de recepción: June 10th, 2014 Fecha de aceptación: November 4th, 2014 Citation / Para citar este artículo: Cadavid Gutiérrez, L. F., Pérez Castillo, N. J., Rojas Quintero, C. A., & Vera Parra, N. E. (2014). Automation of functional annotation of genomes and transcriptomes. Revista Tecnura, 18 (Edición especial doctorado), 90–96. doi: 10.14483/udistrital.jour.tecnura.2014.DSE1.a08 ABSTRACT database being processed (1GB for Uniprot databa- Functional annotation represents a way to investi- se and 9GB for Non-redundant database). Aviability: gate and classify genes and transcripts according to https://github.com/BioinfUD/MAFA. their function within a given organism. Keywords: Annotator, Functional annotation, Gene This paper presents Massive Automatic Functional ontology, High Throughput Sequencing. Annotation (MAFA - Web), which is an online free bioinformatics tool that allows automation, unifi- RESUMEN cation and optimization of functional annotation La anotación funcional es un medio para investigar y processes when dealing with large volumes of se- clasificar genes y transcritos de acuerdo con la fun- quences. MAFA includes tools for categorization and ción que realizan en un organismo dado. statistical analysis of associations between sequen- Este artículo presenta Massive Automatic Functio- ces. We have evaluated the performance of MAFA nal Annotation (MAFA - Web), la cual es una herra- with a set of data taken from Diploria-Strigosa trans- mienta bioinformática libre y en línea que permite criptome (using an 8-core computer, namely E7450 la automatización, unificación y automatización de @ 2,40GHZ with 256GB RAM), processing rates of los procesos de la anotación funcional, trabajando 2,7 seconds per sequence (using Uniprot database) con grandes volúmenes de secuencias. MAFA in- and 50,0 seconds per sequence (using Non-redun- cluye herramientas para la categorización y análi- dant from NCBI database) were found together with sis estadístico de las asociaciones entre secuencias particular RAM usage patterns that depend on the y su ontología correspondiente. Se ha evaluado el * Medicine Doctor, Ecology and Evolutionary Biology PhD., IEI Research Group - Teacher / Researcher, Institute of Genetics and Department of Biology, National University, Bogotá D.C., Colombia. [email protected] ** System Engineer, Informatics PhD., GICOGE Research Group - Director of Center for Scientific Research and Development, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. [email protected] *** System Engineer Student, GICOGE Research Group - Student, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. caro- [email protected] ****Electronic Engineer, Information Sciences and Communication M.Sc., GICOGE Research Group - Teacher / Researcher, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. [email protected] Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 90-96 [ 90 ] Automation of functional annotation of genomes and transcriptomes Luis Fernando Cadavid Gutiérrez, José Nelson Pérez Castillo,Cristian Alejandro Rojas Quintero, Nelson Enrique Vera Parra desempeño de MAFA con un set de datos toma- un patrón particular de uso de RAM que depende do del transcriptoma de Diploria-Strigosa (usando de la base de datos que es procesada (1GB para la un computador de 8 núcleos, específicamente un base de datos Uniprot y 9GB para la base de da- E7450 @ 2,40GHZ con 256GB de memoria RAM). tos Non-redundant). Disponibilidad: https://github. Se encontraron tasas de procesamiento de 2,7 se- com/BioinfUD/MAFA. gundos por secuencia (usando la base de datos de Palabras clave: anotador, anotación funcional, on- Uniprot) y 50,0 segundos por secuencia (usando la tología génica, secuenciación de alto rendimiento. base de datos Non-redundant de NCBI), junto con INTRODUCTION controlled-term vocabulary to describe particular genes and the gene-product attributes within a par- Biological-sequence decoding plays an essential ticular organism). role in almost all research branches of Biology. For The annotation process for unknown sequen- various decades, sequencing processes were con- ces involves the use and integration of various ducted using the Sanger method (including the tools that deal with the following tasks: Local-alig- human genome project, where this method was nment search for comparing unknown sequences crucial). However, the cost of the method and its with known-sequence databases (e.g. Swissprot, limitations in terms of performance, scalability, Uniprot, Refseq, among others), association be- speed and resolution have led to a migration trend tween sequences and the ontology that describes towards using new procedures in the last 5 years, the functionality of such sequences and categori- namely the so called “next generation sequencing” zation and statistical analysis of the corresponding (Mekster, 2010; Martin & Wang, 2011) These new associations). technologies allow having lower-cost, more-effi- This paper is divided in two sections. In the first cient sequencing, which leads to an exponential section we describe the software working way. In growth in the volumes of sequenced data. the second section we have made an evaluation of Optimization of the sequencing process would MAFA using various datasets. be worthless without the development and opti- mization of suitable computing tools capable of METHODOLOGY analyzing such large sequenced-data volumes. In this context, one of the main needs of genomic- General description transcriptomic data mining is functional annota- tion. As a process, functional annotation consists MAFA is a free online bioinformatics tool that has of two stages, namely a search for known similar been optimized to carry out functional annotation sequences (through alignment) and the association processes over large numbers of nucleotide se- of such sequences to functional categories. The quences (genomes and transcriptomes). Moreover, type of tools that are commonly used to carry out MAFA includes additional tools to perform catego- functional annotation processes are the following: rization and statistical analysis of the correspon- BLAST - Basic Local Alignment Search Tool (Alts- ding sequence-ontology associations. MAFA is chul et al., 1990), (Camacho et al., 2008) (for find- intended to operate by a web interface making the ing sequences through alignment) and GO - Gene functional annotation a simple process (almost in- Ontology (Ashburner et al., 2000) (which provides tuitive) for biologist. Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 90-96 [ 91 ] Automation of functional annotation of genomes and transcriptomes Luis Fernando Cadavid Gutiérrez, José Nelson Pérez Castillo,Cristian Alejandro Rojas Quintero, Nelson Enrique Vera Parra Architecture BlastExec.py: This script orders the system to run blastx (Sequences against Reference database) MAFA consists of 4 modules that constitute a work using various cores. flow. In order to run and integrate the modules, it is necessary to use additional tools that apply to all modules. Figure 1 shows the 4 modules together with the work flow and the cross-module applica- ble tools. Figure 2. LBS-module Diagram Source: Own work. GO associator Figure 1. Workflow for MAFA This module establishes the existing associations between the best hits, obtained from BLAST, and Source: Own work. the terms from Gene Ontology. These associations are made by means of mapping tables between se- Cross-module software components quence identifiers and GO terms. The GOAS mo- MySQL: A relational and multi-thread, multi-user dule is shown in Figure 3. data-base management system, also free software. GNU/Linux: A free operating system that is suita- ble for servers and also for running bio-informatics tools. Biopython (Cock et al., 2009): has proposed this free software project with various modules in- tended to facilitate manipulation of bioinformatics data. Pygal: Free libraries that assist the production of graphical materials for the representation of in- formation. BLAST (Basic Search Alignment Tool): A tool intended to find local regions of similarity through sequence alignment. Reportlab: A open- source Pyton-based library that facilitates the crea- tion of PDF-format files. Apache: A HTTP server with free license. Local BLAST server Figure 3. GOAS-module diagram This module is in charge of running BLAST (Nu- Source: Own work. cleotides vs Amino-acids) and also of storing the corresponding output using the XML format. Figu- This module involves the following scripts: re 2 shows the inputs and outputs of this module. BLASTXML2CSV.py: This script selects the best The script involved in this module is as follows: alignment per sequence (top hit) and also writes Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 •Vol. 18 - Special Edition Doctorate • December 2014 • pp. 90-96 [ 92 ] Automation of functional annotation of genomes and transcriptomes Luis Fernando Cadavid Gutiérrez,