DOI: http://doi.org/10.14483/udistrital.jour.tecnura.2014.DSE1.a08
INVESTIGACIÓN
Automation of functional annotation of genomes and transcriptomes
Automatización de la anotación funcional de genomas y transcriptomas
Luis Fernando Cadavid Gutiérrez*, José Nelson Pérez Castillo**,Cristian Alejandro Rojas Quintero***, Nelson Enrique Vera Parra****
Fecha de recepción: June 10th, 2014
Fecha de aceptación: November 4th, 2014
ABSTRACT
database being processed (1GB for Uniprot databa-
Functional annotation represents a way to investi-
se and 9GB for Non-redundant database). Aviability:
gate and classify genes and transcripts according to
https://github.com/BioinfUD/MAFA.
their function within a given organism.
Keywords: Annotator, Functional annotation, Gene
This paper presents Massive Automatic Functional
ontology, High Throughput Sequencing.
Annotation (MAFA - Web), which is an online free
bioinformatics tool that allows automation, unifi-
RESUMEN
cation and optimization of functional annotation
La anotación funcional es un medio para investigar y
processes when dealing with large volumes of se-
clasificar genes y transcritos de acuerdo con la fun-
quences. MAFA includes tools for categorization and
ción que realizan en un organismo dado.
statistical analysis of associations between sequen-
Este artículo presenta Massive Automatic Functio-
ces. We have evaluated the performance of MAFA
nal Annotation (MAFA - Web), la cual es una herra-
with a set of data taken from Diploria-Strigosa trans-
mienta bioinformática libre y en línea que permite
criptome (using an 8-core computer, namely E7450
la automatización, unificación y automatización de
@ 2,40GHZ with 256GB RAM), processing rates of
los procesos de la anotación funcional, trabajando
2,7 seconds per sequence (using Uniprot database)
con grandes volúmenes de secuencias. MAFA in-
and 50,0 seconds per sequence (using Non-redun-
cluye herramientas para la categorización y análi-
dant from NCBI database) were found together with
sis estadístico de las asociaciones entre secuencias
particular RAM usage patterns that depend on the
y su ontología correspondiente. Se ha evaluado el [email protected] ** System Engineer, Informatics PhD., GICOGE Research Group - Director of Center for Scientific Research and Development, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. [email protected] *** System Engineer Student, GICOGE Research Group - Student, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. caro- [email protected] ****Electronic Engineer, Information Sciences and Communication M.Sc., GICOGE Research Group - Teacher / Researcher, Universidad Distrital Francisco José de Caldas, Bogotá D.C., Colombia. desempeño de MAFA con un set de datos toma-
un patrón particular de uso de RAM que depende
do del transcriptoma de Diploria-Strigosa (usando
de la base de datos que es procesada (1GB para la
un computador de 8 núcleos, específicamente un
base de datos Uniprot y 9GB para la base de da-
E7450 @ 2,40GHZ con 256GB de memoria RAM).
tos Non-redundant). Disponibilidad: https://github. Se encontraron tasas de procesamiento de 2,7 se-
com/BioinfUD/MAFA.
gundos por secuencia (usando la base de datos de
Palabras clave: anotador, anotación funcional, on-
Uniprot) y 50,0 segundos por secuencia (usando la
tología génica, secuenciación de alto rendimiento.
base de datos Non-redundant de NCBI), junto con
INTRODUCTION
controlled-term vocabulary to describe particular
genes and the gene-product attributes within a par-
Biological-sequence decoding plays an essential
ticular organism).
role in almost all research branches of Biology. For
The annotation process for unknown sequen-
various decades, sequencing processes were con-
ces involves the use and integration of various
ducted using the Sanger method (including the
tools that deal with the following tasks: Local-alig-
human genome project, where this method was
nment search for comparing unknown sequences
crucial). However, the cost of the method and its
with known-sequence databases (e.g. Swissprot,
limitations in terms of performance, scalability,
Uniprot, Refseq, among others), association be-
speed and resolution have led to a migration trend
tween sequences and the ontology that describes
towards using new procedures in the last 5 years,
the functionality of such sequences and categori-
namely the so called "next generation sequencing"
zation and statistical analysis of the corresponding
(Mekster, 2010; Martin & Wang, 2011) These new
associations).
technologies allow having lower-cost, more-effi-
This paper is divided in two sections. In the first
cient sequencing, which leads to an exponential
section we describe the software working way. In
growth in the volumes of sequenced data.
the second section we have made an evaluation of
Optimization of the sequencing process would
MAFA using various datasets.
be worthless without the development and opti-
mization of suitable computing tools capable of
METHODOLOGY
analyzing such large sequenced-data volumes. In
this context, one of the main needs of genomic-
General description
transcriptomic data mining is functional annota-
tion. As a process, functional annotation consists
MAFA is a free online bioinformatics tool that has
of two stages, namely a search for known similar
been optimized to carry out functional annotation
sequences (through alignment) and the association
processes over large numbers of nucleotide se-
of such sequences to functional categories. The
quences (genomes and transcriptomes). Moreover,
type of tools that are commonly used to carry out
MAFA includes additional tools to perform catego-
functional annotation processes are the following:
rization and statistical analysis of the correspon-
BLAST - Basic Local Alignment Search Tool (Alts-
ding sequence-ontology associations. MAFA is
chul et al., 1990), (Camacho et al., 2008) (for find-
intended to operate by a web interface making the
ing sequences through alignment) and GO - Gene
functional annotation a simple process (almost in-
Ontology (Ashburner et al., 2000) (which provides
tuitive) for biologist. Architecture
BlastExec.py: This script orders the system to run
blastx (Sequences against Reference database)
MAFA consists of 4 modules that constitute a work
using various cores.
flow. In order to run and integrate the modules, it
is necessary to use additional tools that apply to all
modules. Figure 1 shows the 4 modules together
with the work flow and the cross-module applica-
ble tools.
Figure 2. LBS-module Diagram
Source: Own work.
GO associator
Figure 1. Workflow for MAFA
This module establishes the existing associations
between the best hits, obtained from BLAST, and
Source: Own work.
the terms from Gene Ontology. These associations
are made by means of mapping tables between se-
Cross-module software components
quence identifiers and GO terms. The GOAS mo-
MySQL: A relational and multi-thread, multi-user
dule is shown in Figure 3.
data-base management system, also free software.
GNU/Linux: A free operating system that is suita-
ble for servers and also for running bio-informatics
tools.
Biopython (Cock et al., 2009): has proposed this
free software project with various modules in-
tended to facilitate manipulation of bioinformatics
data.
Pygal: Free libraries that assist the production of
graphical materials for the representation of in-
formation.
BLAST (Basic Search Alignment Tool): A tool
intended to find local regions of similarity through
sequence alignment. Reportlab: A open- source Pyton-based library that facilitates the crea-
tion of PDF-format files.
Apache: A HTTP server with free license.
Local BLAST server
Figure 3. GOAS-module diagram
This module is in charge of running BLAST (Nu-
Source: Own work.
cleotides vs Amino-acids) and also of storing the
corresponding output using the XML format. Figu-
This module involves the following scripts:
re 2 shows the inputs and outputs of this module.
BLASTXML2CSV.py: This script selects the best
The script involved in this module is as follows:
alignment per sequence (top hit) and also writes
