Web & Grid Technologies in Bioinformatics, Computational And
Total Page:16
File Type:pdf, Size:1020Kb
Current Bioinformatics, 2008, 3, 000-000 1 Web & Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review Azhar A. Shah1, Daniel Barthel1, Piotr Lukasiak2,3, Jacek Blazewicz2,3 and Natalio Krasnogor*,1 1School of Computer Science, University of Nottingham, Jubilee Campus, NG81BB, Nottingham, UK 2Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland 3Institute of Bioorganic Chemistry, Laboratory of Bioinformatics, Polish Academy of Sciences, Noskowskiego 12/14, 61- 704 Poznan, Poland Abstract: The acquisition of biological data, ranging from molecular characterization and simulations (e.g. protein fold- ing dynamics), to systems biology endeavors (e.g. whole organ simulations) all the way up to ecological observations (e.g. as to ascertain climate change’s impact on the biota) is growing at unprecedented speed. The use of computational and networking resources is thus unavoidable. As the datasets become bigger and the acquisition technology more refined, the biologist is empowered to ask deeper and more complex questions. These, in turn, drive a runoff effect where large re- search consortia emerge that span beyond organizations and national boundaries. Thus the need for reliable, robust, certi- fied, curated, accessible, secure and timely data processing and management becomes entrenched within, and crucial to, 21st century biology. Furthermore, the proliferation of biotechnologies and advances in biological sciences has produced a strong drive for new informatics solutions, both at the basic science and technological levels. The previously unknown situation of dealing with, on one hand, (potentially) exabytes of data, much of which is noisy, has large experimental er- rors or theoretical uncertainties associated with it, or on the other hand, large quantities of data that require automated computationally intense analysis and processing, have produced important innovations in web and grid technology. In this paper we present a trace of these technological changes in Web and Grid technology, including details of emerging infra- structures, standards, languages and tools, as they apply to bioinformatics, computational biology and systems biology. A major focus of this technological review is to collate up-to-date information regarding the design and implementation of various bioinformatics Webs, Grids, Web-based grids or Grid-based webs in terms of their infrastructure, standards, pro- tocols, services, applications and other tools. This review, besides surveying the current state-of-the-art, will also provide a road map for future research and open questions. Keywords: Semantic web, web services, web agents, grid computing, middleware, in silico experiments. 1. INTRODUCTION networks, instruments, software applications and other com- puter related hardware equipment. These information and ‘The impact of computing on biology can fairly be con- resource-sharing capabilities of web and grid technologies sidered a paradigm change as biology enters the 21st cen- could upgrade a single user computer into a global tury. In short, computing and information technology ap- supercomputer with vast computational power and storage plied to biological problems is likely to play a role for 21st capacity. The so called upgraded web and grid-enabled century biology that is in many ways analogous to the role global super computer would make itself a potential candi- that molecular biology has played across all fields of bio- date to be used in resource-hungry computing domains. For logical research for the last quarter century and computing example, it could be used to efficiently solve complex calcu- and information technology will become embedded within lations such as parameter sweep scenario with Monte Carlo biological research itself’ [1]. simulation and modeling techniques, which would normally As a visualization of the above referred conclusion re- require several decades of execution time on a traditional garding the embedding of computing and information tech- single desktop processor. nology (IT) in biological research, one can look at the cur- A quick look at the literature reveals that web and grid rent state-of-the-art in web and grid technologies as applied technologies are continuously being taken up by the biologi- to bioinformatics, computational biology and systems biol- cal community as an alternate to traditional monolithic high ogy. The World Wide Web or simply the web has revolu- performance computing mainly because of the inherent na- tionized the field of IT and related disciplines, by providing ture of biological resources (distributed, heterogeneous and information-sharing services on top of the internet. Simi- CPU intensive), smaller financial costs, better flexibility, larly, grid technology has revolutionized the field of comput- scalability and efficiency offered by the web and grid- ing by providing location-independent resource sharing- enabled environment. An important factor that provides the services such as computational power, storage, databases, justification behind the ever growing use of web and grid technologies in life sciences is the continuous and rapid in- *Address correspondence to this author at the ASAP Group, School of crease in biological data production. About eight years ago it Computer Science, University of Nottingham, Jubilee Campus, NG81BB, has been approximated that the amount of information pro- Nottingham, UK; Tel: +44 115 8467592; Fax: +44 115 8467066; E-mail: duced by a single gene laboratory could be as higher as 100 [email protected] 1574-8936/08 $55.00+.00 © 2008 Bentham Science Publishers Ltd. 2 Current Bioinformatics, 2008, Vol. 3, No. 1 Shah et al. terabytes, which is equivalent to about 1 million encyclope- Some other large scale bioinformatics grid projects have dias [2]. Although being extensive, these data also differ in provided a platform where a biologist can design and run terms of storage and access technologies and are dispersed complex in silico experiments by combing several distrib- throughout the world. Currently, there are no uniform stan- uted and heterogeneous resources that are wrapped as web- dards, or at least not yet been adopted properly by the bio- services. Examples of these are myGrid [10, 11], BioMOBY logical community as a whole, to deal with the diverse na- [12-14], Seqhound (http://www.blueprint.org /seqhound) and ture, type, location and storage formats of this data. On the Biomart (http://www.biomart.org) etc., which allow for the other hand, in order to obtain the most comprehensive and automatic discovery and invocation of many bioinformatics competitive results, in many situations, a biologist may need applications, tools and databases such as EMBOSS [15] suite to access several different types of data which are publicly of bioinformatics applications and some other publicly avail- available in more than 700 [3] biomolecular databases. One able services from the National Center for Biomedical In- way to handle this situation would be to convert the required formatics (NCBI) (http://www.ncbi. nlm.nih.gov) and Euro- databases into a single format and then store it on a single pean Bioinformatics Institute (EBI) (http://www.ebi. ac.uk). storage device with extremely large capacity. Considering These projects also provide some special toolkits with neces- the tremendous size and growth of data, this solution seems sary application programming interfaces (APIs), which could to be infeasible, inefficient and very costly. The application be used to transform any legacy bioinformatics application of Web and Grid technology provides an opportunity to into a web-service that can be deployed on their platforms. standardize the access to these data in an efficient, automatic The availability of these BioGrid projects brought into sharp and seamless way through the use of appropriate DataGrid focus the need for better user interfaces as to provide the middleware technologies. These technologies include grid biologist with easier access to these web/grid resources. This middleware specific Data Management Services (DMS), has led to the development of various web based interfaces, distributed storage environments such as Open Source Grid portals, workflow management systems, problem solving Services Architecture-Data Access and Integration (OGSA- environments, frameworks, application programming envi- DAI) (http://www.ogsadia.org) with Distributed Query Proc- ronments, middleware toolkits, data and resource manage- essing (OGSA-DQP), Storage Resource Broker (SRB) [4] ment approaches along with various ways of controlling grid and IBM DiscoveryLink [5] middleware etc. access and security. This review attempts to provide an up- to-date coherent and curated overview of the most recent Furthermore, it is also very common for a typical bio- advances in these technologies as applied to life sciences. logical application that involves very complex analysis of large-scale datasets and other simulation related tasks to de- The review aims at providing a complementary source of additional information to some previous reviews in this field mand for high throughput computing power in addition to such as [16, 17]. seamless access to very large biological datasets. The tradi- tional approach towards this solution was to purchase ex- The organization of the paper is as follows: section 2 tremely costly special-purpose super computers or dedicated presents an overview of the state-of-the-art