Studies on Distributed Approaches for Large Scale Multi-Criteria Protein Structure Comparison and Analysis

Studies on Distributed Approaches for Large Scale Multi-Criteria Protein Structure Comparison and Analysis Azhar Ali Shah, MPhil Thesis submitted to The University of Nottingham for the degree of Doctor of Philosophy September 2010 Abstract Protein Structure Comparison (PSC) is at the core of many important structural biology problems. PSC is used to infer the evolutionary history of distantly related proteins; it can also help in the identification of the biological function of a new protein by comparing it with other proteins whose function has already been annotated; PSC is also a key step in protein structure prediction, because one needs to reliably and efficiently compare tens or hundreds of thousands of decoys (predicted structures) in evaluation of ’native-like’ candidates (e.g. Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment). Each of these applications, as well as many others where molecular comparison plays an important role, requires a different notion of similarity, which naturally lead to the Multi-Criteria Protein Structure Comparison (MC-PSC) problem. ProCKSI (www.procksi.org), was the first publicly available server to provide algorithmic solutions for the MC-PSC problem by means of an enhanced structural comparison that relies on the princi- pled application of information fusion to similarity assessments derived from multiple comparison methods (e.g. USM, FAST, MaxCMO, DaliLite, CE and TMAlign). Current MC-PSC works well for moderately sized data sets and it is time consuming as it provides public service to multiple users. Many of the structural bioinformatics applications mentioned above would benefit from the ability to perform, for a dedicated user, thousands or tens of thousands of comparisons through multiple methods in real-time, a capacity beyond our current technology. This research is aimed at the investigation of Grid-styled distributed computing strategies for the solution of the enormous computational challenge inherent in MC-PSC. To this aim a novel distributed algorithm has been designed, implemented and evaluated with different load balancing strategies and selection and configuration of a variety of software tools, services and technologies on different levels of infrastructures ranging from local testbeds to production level eScience infrastructures such as the National Grid Service (NGS). Empirical results of different experiments reporting on the scalability, speedup and efficiency of the overall system are presented and discussed along with the software engineering aspects behind the implementation of a distributed solution to the MC-PSC problem based on a local computer cluster as well as with a GRID implementation. The results lead us to conclude that the combination of better and faster parallel and distributed algorithms with more similarity comparison methods provides an unprecedented advance on protein structure comparison and analysis technology. These advances might facilitate both directed and fortuitous discovery of protein similarities, families, super-families, domains, etc, and also help pave the way to faster and better protein function inference, annotation and protein structure prediction and assessment thus empowering the structural biologist to do a science that he/she would not have done otherwise. ii Acknowledgements Doctoral studies being a journey of a discovery wouldn’t be possible without the enabling technologies in the form of help and support of so many individuals. The first and foremost support comes from one’s guide/supervisor in the form of expert advise, interest, enthusiasm and encouragement. It is a matter of fact that not all supervisors provide all this support and hence history shows that as a brilliant student as Einstein had to change his supervisor 1. In this regard I consider myself fortunate enough to have Professor Natalio Krasnogor as my supervisor. It always leaves me wondering how Prof. Krasnogor has managed to shape his personality to have all aspects of an ideal and perfect supervisor besides being a very kind, caring, friendly and social person! I know that, a proper answer to this question might require me to write another thesis, so I would rather leave it mysterious and would like to offer my utmost gratitude and recognition for his excellent mentorship in the form of regular constructive feedback on my verbal and written ideas and drafts while sharing and discussing various issues, outcomes and future plans of my research from the very beginning to the completion of this thesis. Besides this, I am also grateful to him for acquainting me with a great team/network of co-workers and providing all the resources needed for the experimental setup as well as financial support to attend several research and professional trainings and to present my research papers at various conferences across the globe. I know that this is not the explicit listing of what I should be thankful to Prof. Krasnogor, indeed there are many and many (let me say count- less) other things which I couldn’t mention here but am really debited to be grateful to him. Indeed it has been an honor for me to have Prof. Krasnogor as my supervisor. While writing several co-authored research papers I had the opportunity to leverage the expertise of many well known academicians and researchers who have their association/collabora- tion with ASAP group including Professor Jacek Blazewicz and Piotr Lukasiak (Poznan University, Poland), Drs. Daniel Barthel (ex-ASAPEE, now team leader in Rexroth, part of the Bosch group, Germany) and Gianluigi Folino (CNR-ICAR, University of Calabria, Italy). I am thankful for their critical comments, suggestions and help for the improvement of the publications which got accepted at targeted journals and conferences. Had it not been the much needed technical support from Technical Service Group (TSG), I would had lost several days and even months to deal with all the technical problems that arose from time to time while setting-up my local Linux cluster installed with different software packages, tools and services needed for different experiments. In this regard a big thank you to TSG especially Nick Reynolds, William Armitage and Viktor Huddleston. In addition to TSG, I would also like to thank Drs. Pawel Widera, German Terrazas Angulo, James Smaldon and Amr Soghier for their help with Linux, Latex and other software related issues. I would also like to acknowledge the use of the CNR-ICAR cluster in Italy and the UK National Grid Service (NGS) in carrying out this work. I am also thankful to all the academic and admin staff members at School of Computer Science in general and members and admin team of the ASAP group in particular for their cooper- ation in various activities during the complete period of my research. A similar token of gratitude goes to all the members of academic staff and admin team at Institute of Information and Com- munication Technology (ICT), University of Sindh (where I hold an academic position and was granted study leave and scholarship for PhD) for their help and support in various administrative aspects. Special thanks to Drs. Abdul Wahab Ansari, Imdad Ali Ismaili, Khalil Khombati and 1It is said that, "In 1901 Einstein transitioned his thesis supervision from "H.F. Weber" to "Alfred Kleiner"; and changed his dissertation topic from thermoelectric to molecular kinetics". iii iv Vice-Chancellor, Mr. Mazhar-ul Haq Siddiqui. I would also like to acknowledge The Univer- sity of Sindh for funding my study through scholarship (SU/PLAN/F.SCH/794). Many thanks to all my friends who in some or other way provided the social fabric needed to a foreigner living far away from his family. All the friends studying at Nottingham as well as so many other Uni- versities in England and other countries (connect through different egroups such as ((sindhischol- ars,scholars_uk,hec_overseas)@yahoogroups.com) had been of great help. Finally, thanks to those for whom thanks is a too small word to say and I would rather like to dedicate this complete thesis which is the outcome of my endeavors as well as their patience, prayers and support i.e my: late father, mother, in-laws, wife, kids, siblings (specially my elder brother Noor Muhammad Shah for his utmost love, guidance, encouragement and support) and rel- atives. I think, being the very first person from our family to have a PhD, I have earned something for us to be proud of! PS: 5th July 2010: at the midnight I was informed of a great loss in my life – my mother left this earth – weeping and crying I caught a flight for back home but was too late to see her – a bad side of doing PhD abroad! May God rest her soul in peace! Azhar Ali Shah Nottingham, England, September 2010. For Asfar, Maryam, Maidah, Shazia, Ada N.M Shah and my (late) parents. v vi Contents Abstract ii Acknowledgements iii List of Figures ix List of Tables xi Glossary of Acronyms xii 1 Introduction 1 1.1 Introduction 1 1.2 Challenges of MC-PSC 5 1.3 Research Objectives 11 1.4 General Methodology 11 1.5 Thesis Organization 12 1.6 List of Contributions 17 1.7 Publications 19 1.8 Conclusions 21 2 Survey of Web and Grid Technologies in Life Sciences 22 2.1 Introduction 22 2.2 Web Technologies 29 2.2.1 Semantic Web Technologies 29 2.2.2 Web Service Technologies 34 2.2.3 Agent-based Semantic Web Services 36 2.3 Grid Technologies 37 2.3.1 BioGird Infrastructure 41 2.3.2 Grid-enabled Applications and Tools 42 2.3.3 Grid-based BioPortals 44 2.3.4 BioGrid Application Development Toolkits 46

Studies on Distributed Approaches for Large Scale Multi-Criteria Protein Structure Comparison and Analysis

Easybuild Documentation Release 20210907.0

Bioperl What’S Bioperl?

The Bioperl Toolkit: Perl Modules for the Life Sciences

Vasco Da Rocha Figueiras Algoritmos Para Genómica Comparativa

Perl for Bioinformatics Perl and Bioperl I

Pipenightdreams Osgcal-Doc Mumudvb Mpg123-Alsa Tbb

Web & Grid Technologies in Bioinformatics, Computational And

BINF 634 Bioinformatics Programming

(12) United States Patent (10) Patent No.: US 8,407,013 B2 Rogan (45) Date of Patent: Mar

Program Performance Review

Technical Notes All Changes in Fedora 13

IT-SC Beginning Perl for Bioinformatics James Tisdall Publisher: O'reilly First Edition October 2001 ISBN: 0-596-00080-4, 384 Pages