An Automated Approach to the In-Silico Identification of Chimeric Mrnas

Abstract 1 An automated approach to the in-silico identification of chimeric mRNAs Alberti S (1), Trerotola M (1), Emerson A (2), Rossi E (2) (1) Unit of Cancer Pathology, Center for Excellence in Research on Aging, University "G. D' Annunzio", Via Colle dell' Ara, 66013 Chieti Scalo (Chieti), Italy. (2) High Performance Systems Division, CINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno (BO), Italy. Motivation Chimeric mRNAs from two different genes largely arise by mRNA trans-splicing. mRNA trans- splicing post-transcriptionally joins heterologous mRNAs at canonical exon-exon borders, essentially following the rules of canonical cis-splicing. As the construction of cDNA libraries frequently causes cDNA fusion artefacts, largely because of incorrect ligation of independent cDNAs or of abnormal reverse-transcription, a key issue is how to distinguish between bona fide chimeras and in vitro artefacts. We have developed a bioinformatics retrieval strategy, the In Silico Trans-splicing Retrieval System (ISTReS), in order to distinguish between the two. The ISTRes pipeline consists of the following steps: 1. Map the cDNA databank onto the human genome by Blast analysis, masking human repetitive DNA. 2. Filter the Blast output according to score, match length and percentage identity. 3. Group the query sequence segments (mRNA exons) in longer concatamers, each mapping only onto one chromosome. 4. Check for possible chimeric sequences by comparing concatamers. 5. Remove possible cDNA fusion artefacts (e.g. 'sense/antisense' sequences). 6. Structural analysis of remaining sequences to locate mRNA cleavage or poly-A addition signals to provide further evidence of chimeric joins. The procedure has been successfully validated against a set of known chimeric sequences and has also detected two novel chimeric mRNAs [1]. The authors of this work estimate that about 1% of the hybrid sequences in current mRNA databanks are canonically trans-spliced. The aim of this study was to extend the ISTReS procedure to larger datasets. Methods Steps 2-6 of the trans-splicing detection system were implemented with custom Perl scripts, many of them re-written for efficiency and to reflect changes in strategy since the previous study. Although computationally inexpensive the algorithms are often quite complex and most of the programs have undergone major revisions. Indeed, we have found that progress in developing the trans-splicing retrieval system for larger datasets does not depend on the computationally intensive Blast analysis but instead on the validation of the analysis programs. In order to validate the ISTReS procedure the scientific experts in the team need to be able to execute each individual component of the pipeline as well as the whole pipeline itself. The situation is complicated by the requirements for supercomputing resources and large data storage, thus necessitating direct logon access to the computers in question. The common technique of providing a web-interface to hide the underlying computer implementation is in impractical for such a complex system which is still evolving and being tested. To accelerate the refining of the ISTReS procedure and to provide a more convenient environment for the end-user, it was decided to create a workflow description of the pipeline and to implement the various components as web services. The workflow was constructed with the Taverna workflow editor, while the web services were created with the Soaplab environment. The latter is particularly convenient because it generates web services by "wrapping" already existing programs, thereby avoiding re-programming of the applications. Note that due to difficulties in implementing asynchronous web services with available tools, for the moment the Blast analyses have not been exported as web services. Results We show below an image of an example Taverna workflow which implements some of the key steps of the ISTReS pipeline. This and similar workflows are currently being used to refine some of the analysis steps in ISTReS. Candidate chimeras identified by ISTReS analysis with selected cDNA databanks will be reported in a future work. Contact email: [email protected] Abstract 2 TFBSs prediction by integration of genomic, evolutionary, and gene expression data Ambesi-Impiombato A (1,2), Bansal M (1,3), Rispoli R (1), Liò P (4), di Bernardo D (1,3) (1) Telethon Institute of Genetics and Medicine, Tigem, Napoli (2) Department of Neuroscience, University of Naples "Federico II", Napoli (3) SEMM, European School of Molecular Medicine, Naples, Italy (4) Computer Laboratory, Cambridge University, Cambridge, UK Motivation Control of gene expression is essential to the establishment and maintenance of all cell types, and is involved in pathogenesis of several diseases. However, biological mechanisms underlying the regulation of gene expression are not completely understood, and predictions via bioinformatics tools are typically poorly specific. We have developed and tested a computatonal workflow to computationally predict Transcription Factor Binding Sites on proximal promoters of vertebrate genes. Finally we applied the workflow to a cluster of genes found to respond significantly to p63 overexpression.This dataset consists of microarray gene expression at 15 time-points in primary murine keratinocytes. Methods Our approach for the prediction of regulatory elements is based on a search for known regulatory motifs retrieved from TRANSFAC, on DNA sequences of genes' promoters. Genomic information is retrieved from ensembl database (www.ensembl.org) and compara for orthology information. Predictions are computed independently on different species and the final scores are integrated using a weighted sum calibrated on the phylogenetic distances between the species. These predictions are further refined using logistic regression to integrate data from co-regulated genes. For the purpose of this analysis each matrices were scored using a 3rd order Markov Model trained on a large number of intergenic regions upstream of randomly selected genes. Results We show the advantages of integrating genomic data with information based on evolutionary conservation, as well as gene expression data. Consistent results were obtained on a large simulated dataset consisting of 13050 simulated promoter sequences (performance shown in figure 1), on a set of 161 human gene promoters for which binding sites are known. Key factors of our approach include the integration of predictive scores obtained on promoters of ortholog genes from multiple species, and the possibility to include a priori information such as that available from quantitative or qualitative gene expression data, by fitting a logistic regression. A robustness of the logistic regression was evalutated by progressively misassigning genes to the co-regulated group. Our results on simulated datasets show that integrating information from multiple data sources, such as genomic sequence of genes' promoters, conservation over multiple species, and gene expression data, indeed improves the accuracy of computational predictions. Contact email: [email protected] References - Tadesse MG, Vannucci M, Lio P: Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics 2004, 20:2553-2561. - Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J: Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell 2006, 124:47-59. - Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nat Genet 2005, 37:382-390. Abstract 3 Orion: a spatial Multi Agent System framework for Computational Cellular Dynamics of metabolic pathways Angeletti M (1), Baldoncini A (2), Cannata N (2), Corradini F (2), Culmone R (2), Forcato C (2), Mattioni M (2), Merelli E (2), Piergallini R (2) (1) Dipartimento Biologia Molecolare, Cellulare ed Animale, Università di Camerino (2) Dipartimento di Matematica e Informatica, Università di Camerino Motivation Computational models that reproduce and predict the detailed behavior of cellular systems form the Holy Grail of systems biology [1]. Molecular Dynamics represents the most accurate and fundamental approach to cell simulation, taking into account the fundamental physical rules at the atomic level. Due to the incredible high number of atoms that must be considered, it cannot be practically used to simulate whole cell systems. A plethora of other mathematical and computational approaches are therefore applied -often experimentally- in systems biology, aiming at the modeling and simulation of cellular systems and processes (e.g. Ordinary Differential Equations, Partial Differential Equations, Petri Nets, UML, PI calculus, Multi Agent Systems, Dynamic Cellular Automata). Methods can be differentiated [2] according to the resolution levels adopted in space, scale and time representation, presence or absence of stochasticity, level of abstraction and to many other factors. The choice of the method implies critical consequences on the model's engineering cycle of life. Issues like accuracy, availability of formal methods to verify properties of the systems, modularity, questions that the model can answer, intuitiveness, scalability, practicability, usefulness for the biological community, existence of suitable experimental data, should all be accurately weighted when choosing a modeling and simulation framework. Methods Multiagent systems

An Automated Approach to the In-Silico Identification of Chimeric Mrnas

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support