Aspicdb: a Database Resource for Alternative Splicing Analysis T
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 24 no. 10 2008, pages 1300–1304 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn113 Databases and ontologies ASPicDB: A database resource for alternative splicing analysis T. Castrignano` 1, M. D’Antonio1, A. Anselmo2, D. Carrabino1, A. D’Onorio De Meo1, A. M. D’Erchia3, F. Licciulli4, M. Mangiulli3, F. Mignone2, G. Pavesi2, E. Picardi3, A. Riva3,5, R. Rizzi6, P. Bonizzoni6 and G. Pesole3,4,* 1Consorzio Interuniversitario per le Applicazioni di Supercalcolo per Universita` e Ricerca, CASPUR, Rome, 2University 3 of Milan, Dipartimento di Scienze Biomolecolari e Biotecnologie, via Celoria 26, Milan 20133, University of Bari, Downloaded from https://academic.oup.com/bioinformatics/article/24/10/1300/178069 by guest on 29 September 2021 Dipartimento di Biochimica e Biologia Molecolare, via Orabona, 4, Bari 70126, 4Istituto Tecnologie Biomediche del Consiglio Nazionale delle Ricerche, via Amendola 122/D, Bari 70126, Italy, 5Department of Molecular Genetics and Microbiology, University of Florida, PO Box 103610, Gainesville, FL 32610-3610, USA and 6DISCo, University of Milan Bicocca, via Bicocca degli Arcimboldi, 8, Milan, 20135, Italy Received on November 30, 2007; revised on February 26, 2008; accepted on March 28, 2008 Advance Access publication April 3, 2008 Associate Editor: Alfonso Valencia ABSTRACT 1 INTRODUCTION Motivation: Alternative splicing has recently emerged as a key Alternative splicing and alternative initiation/termination of mechanism responsible for the expansion of transcriptome and transcription have recently emerged as the major mechanisms proteome complexity in human and other organisms. Although responsible for the expansion of the transcriptome and several online resources devoted to alternative splicing analysis proteome complexity in human and other organisms (Brett are available they may suffer from limitations related both to et al., 2002; Kopelman et al., 2005; Talavera et al., 2007). the computational methodologies adopted and to the extent of the Although the functional potential of splicing variants has not annotations they provide that prevent the full exploitation of the been widely studied so far, several examples are known where available data. Furthermore, current resources provide limited query they are involved in the creation of novel or specialized and download facilities. functions (Black, 2003; Blencowe, 2006; Lopez, 1998). In fact, Results: ASPicDB is a database designed to provide access to recent experimental studies aimed at the characterization of reliable annotations of the alternative splicing pattern of human human and mouse transcriptomes have revealed a remarkable genes and to the functional annotation of predicted splicing heterogeneity in transcription initiation, and have also shown isoforms. Splice-site detection and full-length transcript modeling that alternative splicing is a widespread phenomenon affecting have been carried out by a genome-wide application of the ASPic more than 60% of human genes (Matlin et al., 2005; Tress algorithm, based on the multiple alignments of gene-related et al., 2007), an estimate constantly increasing from the 35% of transcripts (typically a Unigene cluster) to the genomic sequence, the first genome-wide assessment (Mironov et al., 1999). a strategy that greatly improves prediction accuracy compared to Splicing regulation, resulting from a combination of cis- methods based on independent and progressive alignments. elements and trans-acting factors, is thus a key mechanism to Enhanced query and download facilities for annotations and tune gene expression to a variety of conditions, while its sequences allow users to select and extract specific sets of data dysfunction may often be at the basis of the onset of genetic related to genes, transcripts and introns fulfilling a combination of diseases and cancer (Garcia-Blanco et al., 2004). user-defined criteria. Several tabular and graphical views of the Most of the methods currently available for the investigation results are presented, providing a comprehensive assessment of the and prediction of gene splicing patterns are based on functional implication of alternative splicing in the gene set under independent and progressive alignments of transcript data investigation. (mostly ESTs) to a genomic sequence. Since these methods ASPicDB, which is regularly updated on a monthly basis, also present some limitations mostly due to the sequence errors includes information on tissue-specific splicing patterns of normal frequently occurring in ESTs and to the repetitive structure of and cancer cells, based on available EST sequences and their library the genome sequence, we previously developed a novel source annotation. methodology implemented in the ASPic algorithm and software Availability: www.caspur.it/ASPicDB that take into account these issues [see Methods and (Bonizzoni Contact: [email protected] et al., 2005; Castrignano et al., 2006)]. Supplementary information: Supplementary data are available at Here we present ASPicDB (version 1.2, January 2008), a database resource for alternative splicing analysis that collects Bioinformatics online. and provides access to the results of a genome-wide analysis of human genes carried out by the ASPic software. ASPicDB is regularly updated on a monthly basis with the latest releases of *To whom correspondence should be addressed. NCBI Entrez gene and Unigene databases. 1300 ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] ASPicDB: A database resource for alternative splicing analysis A large variety of tools and databases is currently available for genomic sequence (Bonizzoni et al., 2005). The algorithm uses an genome-wide investigation and prediction of alternative splicing optimization procedure to minimize the number of detected splice sites in human and other organisms: ASD (Stamm et al., 2006), thus reducing the number of false splice predictions, a common problem ECgene (Lee et al., 2007), ASAP2 (Kim et al., 2007), Hollywood with other methods as we previously showed in (Bonizzoni et al., 2006). (Holste et al., 2006), AceView (Thierry-Mieg and Thierry-Mieg, ASPic overcomes the limitations of programs based on independent pairwise alignment of ESTs against the genome that tend to predict 2006). However, the data they present are often significantly artefactual splice sites (Bonizzoni et al., 2006) suggested by alternative discordant due to differences in the input data, in the algorithm individual EST alignments, mainly, because of the repetitive structure adopted for splice sites prediction and transcript assembly [see of genomic sequences and of sequencing errors frequently occurring in (Bonizzoni et al., 2006) for further discussion] as well as on the ESTs. Furthermore, a dynamic programming procedure is adopted for level of stringency adopted to predict a splice site (e.g. occurrence further refinement of the alignment at exon–intron boundaries, trying of canonical donor/acceptor, number of supporting ESTs, to reconcile, where possible, non-canonical splice sites to canonical ones Downloaded from https://academic.oup.com/bioinformatics/article/24/10/1300/178069 by guest on 29 September 2021 quality of the alignment at the splice boundaries). ASPicDB (Bonizzoni et al., 2005), and taking into account the scoring matrices combines the high-quality predictions produced by the ASPic for donor, acceptor and branch sites of U2 and U12 introns derived algorithm with a powerful user-interface that offers several from (Sheth et al., 2006). However, all 256 possible splice site pairs unique features. Enhanced retrieval tools allow for the definition are acceptable and used in transcript assembly under the condition they of composite queries for the selection of subsets of genes, are supported by at least two mRNA/ESTs and no mismatches are observed in a 15-bp long sequence upstream and downstream from the transcripts or introns related to specific features, for example intron. The main result of the strategy of aligning spliced ESTs to the introns belonging to the U2 or U12 class, or found in a specific 0 genome is a more reliable prediction of introns. Then, a suitable mRNA localization (e.g. 5 UTR), or supported by a minimum algorithm based on a directed acyclic graph (DAG) has been then number of ESTs, and so on. Furthermore, the database also designed to assemble spliced ESTs and predicted introns into a includes information on tissue specific splicing in normal and minimum set of non-mergeable transcripts which are also annotated cancer cells based on available expressed sequence tags and their with respect to the location of the coding sequence and to the presence source library annotations. Finally, ASPicDB provides down- of premature stop codon (Maquat, 2004), polyA signal and polyA tail load facilities for sequence extraction (e.g. sequence regions (Zhang et al., 2005). The annotation of splicing variants is done using a surrounding splice site boundaries featuring specific conditions) RefSeq mRNA as reference (Pruitt et al., 2007). that could be used to assist bioinformatics analyses aimed at the detection of splicing regulatory elements. 3 RESULTS 2 METHODS 3.1 Database content 2.1 Data resources Table 1 reports some statistics on the data contained in the ASPicDB is a relational database populated with the results obtained current version of ASPicDB (version 1.2, January 2008) which by the genome-wide application