Protein Models Comparator Scalable Bioinformatics Computing on the Google App Engine Platform
Total Page:16
File Type:pdf, Size:1020Kb
Protein Models Comparator Scalable Bioinformatics Computing on the Google App Engine Platform Paweł Widera and Natalio Krasnogor School of Computer Science, University of Nottingham, UK Contact: @a nrg to ik gnot htnam.io co.an uala k.rs Background: The comparison of computer generated protein structural models is an important element of protein structure prediction. It has many uses including model quality evaluation, selection of the final models from a large set of candidates or optimisation of parameters of energy functions used in template-free modelling and refinement. Although many protein comparison methods are available online on numerous web servers, they are not well suited for large scale model comparison: (1) they operate with methods designed to compare actual proteins, not the models of the same protein, (2) majority of them offer only a single pairwise structural comparison and are unable to scale up to a required order of thousands of comparisons. To bridge the gap between the protein and model structure comparison we have developed the Protein Models Comparator (pm-cmp). To be able to deliver the scalability on demand and handle large comparison experiments the pm-cmp was implemented \in the cloud". Results: Protein Models Comparator is a scalable web application for a fast distributed comparison of protein models with RMSD, GDT TS, TM-score and Q-score measures. It runs on the Google App Engine cloud platform and is a showcase of how the emerging PaaS (Platform as a Service) technology could be used to simplify the development of scalable bioinformatics services. The functionality of pm-cmp is accessible through API which allows a full automation of the experiment submission and results retrieval. Protein Models Comparator is a free software released under the Affero GNU Public Licence and is available with its source code at: http://www.infobiotics.org/pm-cmp Conclusions: This article presents a new web application addressing the need for a large-scale model-specific protein structure comparison and provides an insight into the GAE (Google App Engine) platform and its usefulness in scientific computing. rotein structure comparison seems to be most suc- ferent prediction groups are ranked by the similarity to the cessfully applied to the functional classification of target structure. Depending on the target category, which arXiv:1102.4293v2 [cs.CE] 26 Jul 2011 Pnewly discovered proteins. As the evolutionary con- could be either a template-based modelling (TBM) target tinuity between the structure and the function of proteins or a free modelling (FM) target, the comparison emphasis is strong, it is possible to infer the function of a new pro- is put either on local similarity and identification of well tein based on its structural similarity to known protein predicted regions or global distance between the model and structures. This is, however, not the only application of the native structure [1{3]. structural comparison. There are several aspects of pro- The CASP evaluation is done only for the final models tein structure prediction (PSP) where robust structural submitted by each group. These models have to be se- comparison is very important. lected from a large set of computer generated candidate The most common application is the evaluation of mod- structures of unknown quality. The most promising models els. To measure the quality of a model, the predicted are commonly chosen with the use of clustering techniques. structure is compared against the target native structure. First, all models are compared against each other and then, This type of evaluation is performed on a large scale dur- split into several groups of close similarity (clusters). The ing the CASP experiment (Critical Assessment of protein most representative elements of each cluster (e.g. cluster Structure Prediction), when all models submitted by dif- centroids) are selected as final models for submission [4, 5]. P. Widera and N. Krasnogor • Protein Models Comparator page 1 of 8 The generation of models in the free modelling category, Comparison servers as well as the process of model refinement in both FM and Although many protein structure comparison web services TBM categories, requires a well designed protein energy are already available online, they are not well suited for function. As it is believed that the native structure is in a models comparison. Firstly, they do not operate on a scale state of thermodynamic equilibrium and low free energy, needed for such a task. Commonly these methods offer a the energy function is used to guide the structural search simple comparison between two structures (1:1) or in the towards more native-like structures. Ideally, the energy best case, a comparison between a single structure and a function should have low values for models within small set of known structures extracted from the Protein Data structural distance to the native structure, and high values Bank (1:PDB). While what is really needed is the ability for the most distinct and non-protein-like models. To to compare a large number of structures either against a ensure such properties, the parameters of energy functions known native structure (1:N) or against each other (N:N). are carefully optimised on a training set of models for which Secondly, the comparison itself is done using just a single the real distances to the native structures are precomputed comparison method, which may not be reliable enough for [6{9]. all the cases (types of proteins, sizes etc.). An exception to this is the ProCKSI server [10] that Model comparison vs. protein alignment uses several different comparison methods and provides 1:N and N:N comparison modes. However, it operates All these three aspects of prediction: evaluation of models with methods designed to compare real proteins, not the quality, selection of the best models from a set of candi- models generated in the process of PSP, and therefore it dates and the optimisation of energy functions, require a lacks the ability to use a fixed alignment while scoring the significant number of structural comparisons to be made. structural similarity. Also the high computational cost of However, this comparisons are not made between two pro- these methods makes large-scale comparison experiments teins, but between two protein models that are structural difficult without a support of grid computing facilities (see variants of the same protein and are composed of the same our previous work on this topic [11, 12]). set of atoms. Because of that, the alignment between the The only server able to perform a large-scale model- atoms is known a priori and is fixed, in contrast to compar- specific structural comparison we are aware of, is the in- ison between two different proteins where the alignment of frastructure implemented to support the CASP experi- atoms usually has to be found before scoring the structural ment [13]. This service, however, is only available to a similarity. small group of CASP assessors for the purpose of evalua- Even though searching for optimal alignment is not nec- tion of the predictions submitted for a current edition of essary in model comparison, assessing their similarity is CASP. It is a closed and proprietary system that is not still not straightforward. Additional complexity is caused publicly available neither as an online server nor in a form in practice by the incompleteness of models. For example, of a source code. Due to that, it cannot be freely used, many CASP submitted models contain the atomic coordi- replicated or adapted to the specific needs of the users. We nates for just a subset of the protein sequence. Often even have created the Protein Models Comparator (pm-cmp) to the native structures have several residues missing as the address these issues. X-ray crystallography experiments not always locate all of them. As the model comparison measures operate only on Google App Engine the structures of equal length, a common set of residues We implemented pm-cmp using the Google App Engine have to be determined for each pair of models before the (GAE) [14], a recently introduced web application platform comparison is performed (see Figure 1). It should be noted designed for scalability. GAE operates as a cloud comput- that this is not an alignment in the traditional sense but ing environment providing Platform as a Service (PaaS), just a matching procedure that selects the residues present and removes the need to consider physical resources as they in both structures. are automatically scaled up as and when required. Any individual or a small team with enough programming skills can build a distributed and scalable application on GAE without the need to spend any resources on the setup and maintenance of the hardware infrastructure. This way, sci- entist freed from tedious configuration and administration tasks can focus on what they do best, the science itself. Figure 1. Matching common residues between two GAE offers two runtime environments based on Python structures. There are two common cases when number of or Java. Both environments offer almost identical set of residues differs between the structures: (A) some residues at the beginning/end of a protein sequence were not located in platform services, they only differ in maturity as Java the crystallography experiment and (B) structure was derived environment has been introduced 12 months after first from templates that did not cover the entire protein sequence. preview of the Python one. The environments are well In both cases pm-cmp performs a comparison using the documented and frequently updated with new features. maximum common subset of residues. A limited amount of GAE resources is provided for free and is enough to run a small application. This limits are consequently decreased with each release of the platform SDK (Software Development Kit) as the stability and P.