J Grid Computing DOI 10.1007/s10723-015-9353-8

Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud

Dariusz Mrozek · Paweł Gosk · Bozena˙ Małysiak-Mrozek

Received: 7 October 2014 / Accepted: 9 October 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Computational methods for protein structure tests performed proved good acceleration of predic- prediction allow us to determine a three-dimensional tions when scaling the system vertically and horizon- structure of a protein based on its pure amino acid tally. In the paper, we show the system architecture sequence. These methods are a very important alter- that allowed us to achieve such good results, the native to costly and slow experimental methods, like Cloud4PSP processing model, and the results of the X-ray crystallography or Nuclear Magnetic Reso- scalability tests. At the end of the paper, we try to nance. However, conventional calculations of protein answer which of the scaling techniques, scaling out structure are time-consuming and require ample com- or scaling up, is better for solving such computational putational resources, especially when carried out with problems with the use of Cloud computing. the use of ab initio methods that rely on physical forces and interactions between atoms in a protein. Keywords Bioinformatics · Proteins · 3D protein Fortunately, at the present stage of the development of structure · Protein structure prediction · Tertiary computer science, such huge computational resources structure prediction · Ab initio · Protein structure are available from public cloud providers on a pay- modeling · Cloud computing · Distributed as-you-go basis. We have designed and developed a computing · Scalability · Microsoft Azure scalable and extensible system, called Cloud4PSP, which enables predictions of 3D protein structures in the Microsoft Azure commercial cloud. The sys- tem makes use of the Warecki-Znamirowski method 1 Introduction as a sample procedure for protein structure predic- tion, and this prediction method was used to test the Protein structure prediction is one of the most impor- scalability of the system. The results of the efficiency tant and yet difficult processes for modern computa- tional biology and structural bioinformatics [21]. The practical role of protein structure prediction is becom- This work was supported by Microsoft Research within ing even more important in the face of the dynami- Microsoft Azure for Research Award. A part of the cally growing number of protein sequences obtained work was performed using the infrastructure supported by POIG.02.03.01-24-099/13 grant: “GeCONiI - Upper Sile- through the translation of DNA sequences coming sian Center for Computational Science and Engineering”. from large-scale sequencing projects. Needless to say, experimental methods for the determination of protein D. Mrozek () · P. Gosk · B. Małysiak-Mrozek Institute of Informatics, Silesian University of Technology, structures, such as X-ray crystallography or Nuclear Akademicka 16, 44-100 Gliwice, Poland Magnetic Resonance (NMR), are lagging behind the e-mail: [email protected] number of protein sequences. As a consequence, the D. Mrozek et al. number of protein structures in repositories, such potential energy of an atomic system is composed as the world-wide (PDB) [3], is of several components depending on the force field only a small percentage of the number of all known type. Methods that rely on such a physical approach sequences. Therefore, computational procedures that belong to so-called ab initio protein structure predic- allow for the determination of a protein structure tion methods. Representatives of the approach include from its amino acid sequence are great alternatives for I-TASSER [68], Rosetta@home [42], Quark [69], and experimental methods for protein structure determina- many others, e.g. [66]and[73]. In the paper, we will tion, like X-ray crystallography and NMR. focus on this approach. Protein structure prediction refers to the compu- On the other hand, comparative methods rely on tational procedure that delivers a three-dimensional already known structures that are deposited in macro- structure of a protein based on its amino acid sequence molecular data repositories, such as the Protein Data (Fig. 1). There are various approaches to the prob- Bank (PDB). Comparative methods try to predict the lem and many algorithms have been developed. These structure of the target protein by finding homologs methods generally fall into two groups: (1) physical among sequences of proteins of already determined and (2) comparative [35, 72]. Physical methods rely structures and by ”dressing” the target sequence in one on physical forces and interactions between atoms in of the known structures. Comparative methods can a protein. Most of them try to reproduce nature’s algo- be split into several groups including: (1) homology rithm and implement it as a computational procedure modeling methods, e.g. Swiss-Model [2], Modeller in order to give proteins their unique 3D native confor- [14], RaptorX [32], Robetta [36], HHpred [63], (2) mations [43]. Following the rules of thermodynamics, fold recognition methods, like Phyre [34], Raptor [70], it is assumed that the native conformation of a pro- and Sparks-X [71], and (3) secondary structure predic- tein is the one that possesses the minimum potential tion methods, e.g. PREDATOR [18], GOR [19], and energy. Therefore, physical methods try to find the PredictProtein [56]. global minimum of the potential energy function [41]. Both physical and comparative approaches are still The functional form of the energy is described by being developed and their effectiveness is assessed empirical force fields [10] that define the mathemat- every two years in the CASP experiment (Criti- ical function of the potential energy, its components cal Assessment of protein Structure Prediction). It describing various interactions between atoms, and must also be admitted that ab initio prediction meth- the parameters needed for computations of each of ods require significant computational resources, either the components (see Section 2.1 for details). The powerful supercomputers [53, 60] or distributed com- puting. The proof of the latter give projects such as Folding@home [62] and Rosetta@Home [9]that make use of Grid computing architecture. Input Methods: A promising alternative seems also to be provided 1. Physical (ab initio) by Cloud computing, which allows hardware infras- Modeling 2. Comparative Homology modeling tructure to be leased in the pay-as-you-go model. Software Fold recognition Cloud computing is a model for enabling convenient, Secondary structure on-demand network access to a shared pool of config- urable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly Output provisioned and released with minimal management effort or service provider interaction [47]. In practice, Cloud computing allows applications and services to be run on a distributed network using a virtualized sys- Fig. 1 3D protein structure prediction: from amino acid tem and its resources, and at the same time, allows to sequence (Input) to 3D structure (Output). Modeling software uses different methods for prediction purposes: 1 physical, 2 abstract away from the implementation details of the comparative system itself. Cloud computing emerged as a result Scaling Predictions of Protein Structures in Microsoft Azure Cloud of requirements for the public availability of comput- is a publicly accessible Virtual Machine (VM) that ing power, new technologies for data processing and enables scientists to quickly provision on-demand the need for their global standardization, becoming infrastructures for high-performance bioinformatics a mechanism allowing control of the development of computing using cloud platforms. Cloud BioLinux hardware and software resources by introducing the provides a range of pre-configured command line idea of virtualization. The use of cloud platforms can and graphical software applications, including a full- be particularly beneficial for companies and institu- featured desktop interface, documentation and over tions that need to quickly gain access to a computer 135 bioinformatics packages for applications includ- system which has a higher than average computing ing , clustering, assembly, display, power. In this case, the use of Cloud computing ser- editing, and phylogeny. The release of the Cloud vices can be faster in implementation than using the BioLinux reported in [31] consists of the already owned resources (servers and computing clusters) or mentioned PredictProtein suite that includes methods buying new ones. For this reason, Cloud computing for the prediction of secondary structure and sol- is widely used in business and, according to Forbes, vent accessibility (profphd), nuclear localization sig- the market value of such services will significantly nals (predictnls), and intrinsically disordered regions increase in the coming years [46]. (norsnet). The second solution is Rosetta@Cloud [27]. 1.1 Related Works Rosetta@Cloud is a commercial, cloud-based pay- as-you-go server for predicting protein 3D struc- Cloud computing evolved from various forms of dis- tures using ab initio methods. Services provided by tributed computing, including Grid computing. Both, Rosetta@Cloud are analogous to the well-established Clouds and Grids provide plenty of computational Rosetta computational software suite, with a variety of and storage resources, which is very attractive for tools developed for macromolecular modeling, struc- scientific computations performed in the domain of ture prediction and functional design. Rosetta@Cloud Life sciences. Therefore, both computing models was originally designed to work on the Cloud provided became popular in scientific applications for which by Amazon Web Service (AWS). It delivers a scalable a large pool of computational resources allows vari- environment, fully functional graphical user interface ous computationally intensive problems to be solved. and a dedicated pricing model for using computing In the domain of Life sciences there are many ded- units (AWS Instances) in the prediction process. icated cloud-based and grid-based tools that allow Great progress in scaling molecular simulations has us to solve problems, such as whole genome and also been brought by scientific initiatives that make metagenome sequence analysis [1], gene detection use of the Grid computing architecture. Laganaet` [45], identification of peptide sequences from spec- al. [39] presented the foundations and structures of tra in mass spectrometry [44], analysis of data from the building blocks of the ab initio Grid Empowered DNA microarray experiments [33], mapping next- Molecular Simulator (GEMS) implemented on the generation sequence data to the human genome and EGEE computing Grid. In GEMS the computational other reference genomes, for use in a variety of bio- problem is split into a sequence of three computa- logical analyses including SNP discovery, genotyp- tional blocks, namely INTERACTION, DYNAMICS, ing and personal genomics [13, 57], gene sequence and OBSERVABLES. Each of the blocks has different analysis and protein characterization [28], 3D ligand purpose. Ab initio calculations determining the struc- binding site comparison and similarity searching of ture of a molecular system are performed in the first a structural proteome [24], molecular docking [4, 8], block. The main efforts to implement GEMS on the protein structure similarity searching and structural EGEE Grid were made to the second block, which is alignment [25, 50–52], and many others. responsible for the calculation of the dynamics of the In the domain of protein structure prediction, it molecular system. The authors show that for various is worth noting two cloud-based solutions. The first reasons the parallelization of molecular simulations is one is an open-source Cloud BioLinux [38], which possible through the application of a simple parame- D. Mrozek et al. ter sweeping distribution scheme, which is also used resources that are available through the use of Grid in our system. or Cloud computing. In the paper, we present a newly WeNMR [67] is an example of a successful ini- developed Cloud4PSP (Cloud for Protein Structure tiative for moving the analysis of Nuclear Magnetic Prediction) system. Cloud4PSP is a cloud-based com- Resonance (NMR) and Small Angle X-Ray scattering putational framework for 3D protein structure predic- (SAXS) imaging data for atomic and near-atomic res- tion. Cloud4PSP was designed and developed for the olution molecular structures to the Grid. The WeNMR Microsoft Azure commercial Cloud. In the paper, we makes use of an operational Grid that was initially show the architecture of the Cloud4PSP, its unique based on the gLite middleware, developed in the features, and the advantages of using the queue-based context of the EGEE project series [16]. Although, architecture instead of direct communication. We also WeNMR focuses on NMR and SAXS, its web por- present the processing model used by the system for tal provides access to various services and tools, those who would like to extend the system with new including those used for the prediction of biological prediction methods. Finally, we show the scalabil- complexes (HADDOCK [11]), calculating structures ity of the system based on the designed architecture from NMR data (Xplor-NIH [58], CYANA [22]and on the example of sample prediction processes per- CS-ROSETTA [61]), molecular structure refinement formed with the use of the ab initio prediction method and molecular dynamics simulations (AMBER [6]and designed by S. Warecki and L. Znamirowski [66]. GROMACS [65]). A typical workflow is based on The paper presents this system, which comple- Grid job pooling. A specialized process is listening ments the collection of available Cloud-based and for Grid job packages that should be executed in the Grid-based tools by: Grid, another process is periodically checking running – sharing a dedicated architecture of a Cloud-based, jobs for their status, retrieving the results when ready, scalable and extensible system focused on protein and finally, another process performs post-processing structure prediction, oriented to the specificity of of results before they are presented to the user. In the the process, paper, we will present the system that uses a sim- – showing the benefits of building the system with ilar workflow scheme, with prediction jobs that are the use of roles and queues, divided into many prediction tasks. These tasks are enqueued in the system for further execution by pro- – providing a processing model that involves the cessing units that work in a dedicated role-based and creation of prediction jobs with accompanying queue-based architecture that we designed. collections of prediction tasks, Gesing et al. [20] report on a very important secu- – delivering a ready-to-deploy Azure solution, rity issue while carrying out computations in struc- whose good scalability in the Azure cloud was tural bioinformatics, molecular modeling, and quan- proved by a series of tests. tum chemistry with the use of Molecular Simulation Grid (MoSGrid). MoSGrid is a science gateway that 1.2 Microsoft Azure makes use of WS-PGRADE [15] as a user interface, gUSE [30] as a high-level middleware service layer Microsoft Azure is a commercial cloud platform for workload storage, workload execution, monitor- that delivers services for building scalable web-based ing and others, UNICORE [64] as a Grid resources applications. Microsoft Azure allows the develop- middleware layer, and XtreemeFS [26] as a file sys- ment, deployment and management of applications tem for storage purposes. Various security elements and services through a network of data centers located are employed in particular layers of the MoSGrid in various countries throughout the world. Microsoft security infrastructure with the aim of protecting intel- Azure is a public Cloud, which means that the infras- lectual property of companies performing scientific tructure of the Cloud is available for public use and is calculations. owned by Microsoft selling cloud services. Microsoft These projects prove that computations per- Azure provides computing resources in a virtualized formed in bioinformatics require ample computational form, including processing power, RAM, space and Scaling Predictions of Protein Structures in Microsoft Azure Cloud appropriate bandwidth for transferring data over the – Web role - is a virtual machine (VM) network, within the Infrastructure as a Service (IaaS) instance used for providing a web based service model [47]. Moreover, within the Platform front-end for the cloud service; as a Service model [47], Azure also delivers a plat- – Worker role - is a virtual machine (VM) form and dedicated cloud service programming model instance used for generalized develop- for developing applications that should work on the ment that performs background process- Cloud. ing and scalable computations, accepts In Fig. 2 we can see types of services and and responds to requests, and performs resources provided to a developer of the application long-running or intermittent tasks. deployed to Microsoft Azure cloud. Below we briefly – Storage: Microsoft Azure also provides a rich describe the Microsoft Azure cloud resources used set of data services for various storage scenarios. in our project: These data services enable storage, modification – Compute: Compute/Cloud Services represent and reporting on data in Microsoft Azure. The applications that are designed to run on the Cloud following components of Data Services can be and XML configuration files that define how the used to store data: BLOBs that allow the stor- cloud service should run. The Microsoft Azure age of unstructured text or binary data (video, programming model provides an abstraction for audio and images), Tables that can store large building cloud applications. Each cloud appli- amounts of unstructured non-relational (NoSQL) cation is defined in terms of component roles data, and the Azure SQL Database for storing that implement the logic of the application. Con- large amounts of relational data, and HDInsight, figuration files define the roles and resources which is a distribution of Apache Hadoop, based for an application. There are two types of roles on the Hortonworks Data Platform. that can be used to implement the logic of the – Messaging: Messaging mechanisms allow for application: effective communication between components

Fig. 2 Application deployed to Microsoft Azure cloud, which serves user as a virtualized infrastructure, platform for developers, and gateway for hosting applications

Application

Compute Data services Networking App Services

Fabric (virtualization layer)

virtual machines D. Mrozek et al.

and processes running in the whole cloud service. units, storage, network traffic and bandwidth), the Queues are a general-purpose technology that can time by which these resources are used, the number of be used for messaging in a wide variety of sce- deployed applications, and the usage of other services narios, including: communication between web that are enabled (e.g., HDInsight, Stream Analytics, and worker roles in multi-tier Azure applications Active Directory). The most popular and flexible pay- (as in our project), communication between on- ment plan is based on pay-as-you-go subscriptions. premises applications and Azure-hosted applica- tions in hybrid solutions, communication between components of distributed applications running 2 Methods on-premises in different organizations or depart- ments of an organization. Microsoft Azure pro- Cloud4PSP is an extensible service that allows for ab vides two types of queue mechanisms: Azure initio prediction of protein structures from amino acid Queues and Service Bus Queues. Azure Queues, sequences. It was intentionally designed to work in the which are part of the Azure Storage infrastruc- Azure cloud. At the moment, Cloud4PSP uses only the ture, enable reliable, persistent messaging within Warecki-Znamirowski (WZ) method [66]asasample and between services. Service Bus queues are part procedure for protein structure prediction. However, of a broader Azure messaging infrastructure that the collection of prediction methods can be extended supports queuing as well as other communication by using available programming interfaces. The infor- patterns. mation on possible extensions will be provided in – Fabric - the entire compute, storage (e.g. hard Section 2.4. drives), and network infrastructure, usually imple- In terms of efficiency, the mentioned WZ method mented as virtualized clusters, constitutes a is slower than popular optimization methods like resource pool that consists of one or more servers gradient-based steepest descent [7], Newton-Raphson (called scale units). [12], Fletcher-Powell [17], or Broyden-Fletcher- Goldfarb-Shanno (BFGS) [59]. However, it finds a The basic tier of the Microsoft Azure platform global minimum of the potential energy with higher allows us to create five classes of virtual machines probability and has a lower tendency to converge on with different parameters and computational power and get stuck in the nearest local minimum, which (number of cores, CPU/core speed, amount of mem- is not necessarily the global minimum of the energy ory, efficiency of I/O channel). Table 1 shows a list of function [66]. The method is based on the modi- features for available computing units. fied Monte Carlo approach, which is a better solu- Microsoft Azure is a commercial Cloud. Costs tion. In the section, we give details of the prediction associated with the use of the Azure platform depend method, architecture and processing model used by on the number of resources used (including compute Cloud4PSP, and details of how the system is scaled in

Table 1 Available sizes of Microsoft Azure virtual VM/server Number of CPU core Memory Disk space Cost machines (VMs) for Web type CPU cores speed (GHz) (GB) (GB) ($)/hr and Worker role instances [48] ExtraSmall Shared core 1.0 0.768 19 0.02 Small 1 1.6 1.75 224 0.08 Medium 2 1.6 3.5 489 0.16 Large 4 1.6 7 999 0.32 ExtraLarge 8 1.6 14 2039 0.64 Scaling Predictions of Protein Structures in Microsoft Azure Cloud order to handle a growing amount of work related to of amino acids in the predicted protein structure, and th the prediction process. φi,ψi are torsion angles of the i amino acid. The basic assumption of the method is that the pep- 2.1 Prediction Method tide bond is planar (see Fig. 3), which restricts the rotation around the C − N bond, and the ω angle The Warecki-Znamirowski (WZ) method for protein is essentially fixed to 180 degrees due to the partial structure prediction models a protein backbone by (1) double-bond character of the peptide bond. Therefore, sampling numerous protein conformations determined the main changes of the protein conformation are pos- by a series of φ and ψ torsion angles of the protein sible by chain rotations around the φ and ψ angles, backbone (Fig. 3), and (2) finding the conformation and these angles provide the flexibility required for with the lowest potential energy, which minimizes the folding the protein backbone. expression: The following potential energy function E is used

min E(φ0,ψ0,φ1,ψ1, ..., φn−1,ψn−1), (1) in the calculations of energy for the current conforma- φ0,ψ0,φ1,ψ1,...,φn−1,ψn−1 tion (configuration AN of N atoms, determined by a where: E is a function describing the potential series of φ and ψ torsion angles in (1)): energy of the current conformation, n is the number bonds kb angles ka E(AN ) = j (d − d0)2 + j (θ − θ 0)2 2 j j 2 j j j=1 j=1

R2 torsions 2 V 2 C’ + (1 + cos(pω − γ)) 2 j=1       C 2 N N 12 6 H σkj σkj + 4εkj − rkj rkj k=1 j=k+1 2 N N qkqj + , (2) 4πε0rkj N k= j=k+ H 1 1 O where: the first term represents the bond stretching C’ 1 b component energy and: kj is a bond stretching force constant, dj is the distance between two atoms (real 1 H bond length), d0 is the optimal bond length; the second C j 1 term represents the angle bending component energy a H and: k is a bending force constant, θ is the actual 1 j j 1 value of the valence angle, θ 0 is the optimal valence N R j 1 angle; the third term represents torsional angle compo- nent energy and: V denotes the height of the torsional 0 R0 0 barrier, p is the periodicity, ω is the torsion angle, γ 0 C’ is a phase factor; the fourth term represents van der Waals component energy described by the Lennard- C 0 O H Jennes potential and: rkj denotes the distance between atoms k and j, σkj is the collision diameter, εkj is 0 the well depth, and N is the number of atoms in the structure AN ; the fifth term represents electrostatic N component energy and: qk, qj are atomic charges, rkj Fig. 3 Overview of protein construction. Visible atoms form- denotes the distance between atoms k and j, ε0 is a ing the main chain of the polypeptide. Side chains are marked dielectric constant, and N is the number of atoms in N as R0,R1,R2 the structure A . D. Mrozek et al.

The Warecki-Znamirowski (WZ) algorithm mini- After each modification of the torsion angle the mizes the expression (2) for current values of variables conformational energy E(AN ) is calculated once defining degrees of freedom, referential values of the again, and if it is lower than the current minimum  N variables taken from force field parameter sets [37], E (A )

Prediction Manager Web role Input queue User role

Output queue

DB with prediction results Prediction Prediction Input Output queue queue SQL Azure

3D protein Azure structures BLOBs … (PDB files)

PredictionWorker roles

Microsoft Azure public cloud

Fig. 4 Architecture of the Cloud4PSP - a cloud-based solution for ab initio protein structure prediction D. Mrozek et al. appropriate queues. Roles have access to various stor- contain the information on the prediction algorithm, age resources, including Azure BLOBs (for PDB its parameters, random seed used to initialize a pseu- files with predicted 3D structures) and Azure SQL dorandom number generator for the modified Monte Database (for description of the results). Carlo method, and the number of tries #iterations, The Web role provides a Web site which allows are stored in the PredictionInput queue (lines 11–12). users to interact with the entire system. Through Prediction tasks are then consumed by instances of the the Web role users input the amino acid sequence PredictionWorker role. of the protein whose structure is to be predicted. Instances of the PredictionWorker role act accord- They also choose the prediction method (only one ing to Algorithm 2. They consume messages from is available at the moment, but the system provides the Prediction Input queue (lines 2–4), execute the programming interfaces allowing us to bind new chosen prediction method, generate successive protein methods), specify its parameters, e.g. the number of structures, calculate potential energies for them (lines iterations (random tries of the modified Monte Carlo 6–7), store protein structures in PDB files, and then method), and provide a short description of the input save these files in Azure BLOBs (line 8). The set of molecule. PredictionWorker role instances RPW is defined as The PredictionManager role distributes the predic- follows: tion process across many instances of the Predic- R ={r |i = 1, ..., m} (6) tionWorker role. In other words, PredictionManager PW PWi implements the entire logic of how the prediction pro- where rPWi is a single instance of PredictionWorker cess is parallelized. This may depend on the prediction role, and m is the number of PredictionWorker role method used. Algorithm 1 describes how calculations instances working in the system. are scheduled and parallelized by the PredictionMan- Usually: ager role with part of the code adapted to the speci- #tasks m. (7) ficity of the WZ prediction algorithm used. When the prediction is performed according to the WZ algo- Steering instructions that control the entire system rithm, the entire prediction process (in the system also and the course of action inside the Cloud4PSP are called a prediction job) consists of a vast number of transferred as messages through the queueing system. iterations (random selections of torsion angles) that, There are four queues present in the architecture of the in the Cloud4PSP, are performed in parallel by many Cloud4PSP: instances of the PredictionWorker role. The number – Input queue - collects structure prediction of iterations is specified by the user through the Web requests (Prediction Job descriptions) generated role. In the configuration of the prediction job, the by users through the Web site. user specifies the number of tasks (#tasks)andthe – Output queue - collects notifications of predic- number of tries performed in each task (#iterations). tion completion for particular users (based on a This gives flexibility in choosing the stopping crite- generated token number). rion and flexibility in configuring prediction jobs (and – Prediction Input queue - transfers parameters of in consequence, the number of candidate structures the prediction process for a single instance of the after the prediction, since in this implementation of the PredictionWorker role, among others: amino acid parallel prediction only the best conformations gen- sequence provided by the user, the number of iter- erated in Phase I by each task are adjusted in Phase ations #iterations to be performed, parameters II and these tuned conformations are returned). The of the prediction method. total number of iterations performed within the predic- – Prediction Output queue - transfers descriptions tion job (iT otal) is a product of the number of tasks of results, e.g. PDB file name with protein struc- (#tasks) and the number of tries performed in each ture and potential energy value for the structure, task (#iterations): from each instance of the PredictionWorker role iT otal = #tasks × #iterations. (5) back to the PredictionManager role. The PredictionManager role creates a pool of #tasks The Input queue and Prediction Input queue are tasks (lines 6–7). Descriptions of these tasks, which very important for buffering prediction requests and Scaling Predictions of Protein Structures in Microsoft Azure Cloud

steering the prediction process. The prediction pro- Many users can generate many prediction requests, cess may take several hours and is performed asyn- so there must be a buffering mechanism for these chronously. A user’s requests are placed in the Input requests. Queues fulfill this task perfectly. One pre- queue with the typical FIFO discipline and they diction request enqueued in the Input queue causes are processed when there are idle instances of the the creation of many prediction orders (tasks) for PredictionWorker role that can realize the request. instances of the PredictionWorker role. The Predic- D. Mrozek et al.

3. Initialize job Web node PredictionManager node 1. Configure 2. Request 4. Create and execute Prediction for execution PredictionManager Web role prediction tasks Job role

11. Confirm the completion ofexecution

5. Send a task 8. Confirm task for execution execution and send 12. Retrieve results 10. Finalize the job description of results

Prediction Worker 9. Send description node of results to DB

PredictionWorker role 7. Save predicted AzureSQL protein structures Database 6. Execute task

Prediction Task Azure BLOBs Phase I: MC Phase II: Angles simulation adjustment

Azure Storage

Fig. 5 Cloud4PSP processing model showing components and objects involved in the prediction process, and the flow of messages and data between components of the system tionManager role consumes prediction requests from that is used by the Cloud4PSP and can be used by the Input Queue, generates many predictions tasks, developers extending the system in the future. In each for performing the specified number of iterations the Cloud4PSP processing model, every prediction by PredictionWorker roles available in the system, request generated by a user causes the creation of a and generates task description messages to the Predic- Prediction Job (step 1). This is a special object with tion Input queue. The Prediction Output queue is used the information on the entire prediction that must be by instances of the PredictionWorker role to return a performed. This Prediction Job is submitted for exe- description of results to the PredictionManager role. cution (step 2), i.e., it is serialized to the message Such a feedback allows the PredictionManager role to that is placed in the Input queue. The PredictionMan- control if all the prediction tasks have been completed. ager role consumes Prediction Jobs, initialize them In the case of failure to realize of a prediction task, (step 3) and, based on their description and avail- its completion can be delegated to another Prediction- able resources, creates so-called Prediction Tasks (step Worker instance. The Output queue allows the Web 4). These tasks are serialized to the messages that role to be notified that the whole prediction process is are sent to the Prediction Input queue (step 5) and already completed. are then consumed by successive PredictionWorker role instances. PredictionWorker role instances exe- 2.3 Cloud4PSP Processing Model cute tasks according to the description contained in the retrieved message and using an appropriate prediction While building the Cloud4PSP, we have designed and method (step 6), and finally save the predicted pro- made available a dedicated processing model (Fig. 5) tein structures in the Azure BLOB storage service Scaling Predictions of Protein Structures in Microsoft Azure Cloud

Fig. 6 Interfaces, their implementations for PredictionManager and PredictionWorker in the form of base classes, and inheritance for classes representing a particular prediction method. Dum- myPredictionManager and DummyPrediction represent the sample test manager and prediction method, Monte- CarloPredictionManager and MonteCarloPrediction reflect implementations of the scheduling/parallelization and prediction procedure for the described WZ method

(step 7). After completing the Prediction Task, Predic- the user knows whether the prediction process is tionWorker confirms completion of the task and sends completed or is still in progress. a description of the generated protein structures to Such a processing model provides a kind of abstrac- the PredictionManager (step 8). The PredictionMan- tion layer for developers of the system, hides some ager saves the results in the database managed by the implementation details, e.g., implementation of com- Azure SQL Database (step 9). By using an additional munication and serialization of jobs and tasks, and register stored in the database, the PredictionManager finally, provides a framework for future extensions of maintains awareness of the progress of the Prediction the system. Job execution. This emphasizes the importance of the confirmations obtained in step 8. When the Prediction- 2.4 Extending Cloud4PSP Manager states that all tasks have been completed, it finalizes the job by setting an appropriate property in Although Cloud4PSP has been designed to work the job’s register in the database (step 10) and sends on the Microsoft Azure Cloud, its architecture and the notification to the Web role through the Output workflow have a general character and can be also queue (step 11). The Output queue stores notifica- implemented on other Clouds. This, however, would tions together with the token number generated for require adoption of the programming model specific the Prediction Job at the beginning of the prediction to the Cloud provider and the tailoring of elements process, i.e., when submitting the job for execution. of the architecture to available compute resources. The Web role periodically checks the Output queue for The architecture of Cloud4PSP could also be adapted messages reporting the completion of the Prediction and extended to hybrid clouds by combining on- Job and the statuses of prediction tasks in the database premises compute resources and Microsoft Azure (step 12). Partial results (results of already finished cloud resources. This would require some additional tasks) are retrieved from the database. In this way, effort, related to moving the PredictionManager to D. Mrozek et al. the on-premises cluster or private cloud and parcel- out means changing the value of m in (6). Scaling up ing out prediction tasks across available resources. implies the change of the size of the PredictionWorker However, this would also allow for the use of pub- role: lic cloud resources to satisfy potentially increased size(rPWi) ∈{XS,S,M,L,XL}, (8) demand especially for compute units. If the sys- tem needs some extra compute power, e.g., due to where: XS denotes ExtraSmall size, S - Small, M - an excessively long and computationally demanding Medium, L - Large, XL - ExtraLarge. The sizes of the prediction process, part of the prediction job (some virtual machines for PredictionWorker role instances prediction tasks) could be scheduled for execution on are described in Table 1 and are consistent with the the Azure cloud. Then, after the prediction process is parameters of the computation units provided. completed these commercial compute resources could In the Cloud4PSP we also made the following be released. assumption regarding all instances of the Prediction- Cloud4PSP has been developed in such a way that it Worker role: allows developers to extend its functionality by adding ∀ ∈ ∈ = size(r )=size(r ). (9) new modules. This is important, for example, in sit- rPWi,rPWj RPW,i,j 1,...,m,i j PWi PWj uations when new prediction algorithms have to be The assumption of the same size for all instances of added to the system. This determines the openness of the PredictionWorker role is motivated by the ease of the system. Cloud4PSP has been mainly developed configuration and scaling of the system and by the as a software, which places it in the SaaS layer of ease of analysis of its scalability. the cloud stack [47]. However, extensions are possi- The number of instances m of the PredictionWorker ble not only by using the processing model presented role depends on the configuration of the Cloud4PSP. It in Section 2.3, but also by a set of programming inter- can be measured in thousands. The m is limited only faces provided for almost all components of the sys- by the user’s subscription for Microsoft Azure cloud tem. In Fig. 6 we show interfaces, their implementa- resources and the number of compute units available tions for PredictionManager (for parallelization logic) in the Azure cloud. and PredictionWorker (for the prediction method) in the form of base classes, and inheritance for classes representing particular prediction methods. By using 3Results these base classes, advanced users and programmers can plug in their own prediction methods (by inher- Prediction of 3D protein structures from the begin- itance from the PredictionBase class) and their own ning with the use of the WZ method requires many computation scheduling algorithms (by inheritance Monte Carlo iterations, which takes some time. The from the PredictionManagerBase class) number of iterations and therefore the total execution time depend on the size of the protein modeled, i.e., 2.5 Scaling the Cloud4PSP the number of residues in the input sequence. During our tests we successfully modeled a 20 amino acid The scalability of the system means that it is able to long part of the nonstructural protein NSP1 molecule handle a growing amount of work in a capable man- from the Semliki Forest virus with an RMSD of 0.99A.˚ ner or is able to be enlarged to accommodate that The NMR structure of the molecule (PDB ID code: growth [5]. Cloud4PSP has this property. Cloud4PSP 1FW5) [40] from the Protein Data Bank is shown in is a cloud-based, high performance and highly scal- Fig. 7 (top). The structural alignment and superposi- able system for 3D protein structure prediction. The tion of the modeled molecule and NMR molecule are scalability of the system is provided by the Microsoft presented in Fig. 7 (bottom). Modeling the 20 amino Azure cloud platform. The system can be scaled up acid long part of the NSP1 protein was a sample com- (vertical scaling) or scaled out (horizontal scaling). putational procedure used for testing the efficiency Microsoft Azure allows us to combine both types of of the designed Cloud4PSP architecture in the real scaling. (Microsoft Azure) cloud environment. Tests were per- Cloud4PSP can be scaled out and scaled up dur- formed with the use of twenty CPU cores. Two of ing the protein structure prediction process. Scaling these twenty cores, i.e., two compute units of the Scaling Predictions of Protein Structures in Microsoft Azure Cloud

Fig. 7 Crystal structure of the part of the NSP1 molecule (PDB [29]), (top right) secondary structure showing mainly α- ID code: 1FW5) [40] from the Semliki Forest virus that was helical character of this fragment (ribbon display mode in Jmol), modeled during experiments: (top left) protein skeletal structure (bottom) structural alignment and superposition of modeled made by covalent bonds between atoms (sticks display mode in molecule and NMR molecule

Small size, were allocated to allow the activity of the five sizes: ExtraSmall (XS), Small (S), Medium (M), Web role and PredictionManager role. The remaining Large (L), and ExtraLarge (XL). Their capabilities eighteen cores were used for the activity of the Predic- were described in Sect. 1.2 and briefly compared in tionWorker role instances and to test the scalability of Table 1. It is worth noting in Table 1 that beginning the system. from the Small size up to the ExtraLarge size the In particular, we have tested: amount of available memory and the number of cores – the vertical scalability of the system - the effi- doubles gradually. The ExtraSmall size provides one, ciency of the prediction process depending on the shared core and is generally used when testing appli- size of instances of the PredictionWorker role, cations that should work on the Cloud, so as not to – the horizontal scalability of the system - the effi- generate unnecessary costs for the testing process. ciency of the prediction process depending on the While testing vertical scalability, we changed the number of instances of the PredictionWorker role, size of the PredictionWorker role from ExtraSmall – the efficiency of the prediction process depending (one, shared CPU core) to ExtraLarge (eight CPU on the number of prediction tasks and the number of iterations performed by each task. 8 In all cases, the scalability has been determined on 7,30 the basis of execution time measurements. We have 7 carried out at least two replicas for each measure- 6 ment. Then, the obtained results were averaged and 5 3,90 in such a way they are presented in the following 4 sections. Averaged values of measurements were also 3 used to determine n-fold speedups for both scaling n-fold speedup 1,99 2 techniques. 1,00 1,00 1 3.1 Vertical Scalability 0 XS S M L XL Size of PredictionWorker role In the first phase of our tests, we wanted to check Fig. 8 Results of vertical scaling. Acceleration (n-fold which of the sizes for the compute units offered by speedup) of the 3D protein structure prediction (prediction job) Microsoft Azure is most efficient. The basic tier for as a function of the size of instances of the PredictionWorker the Web and Worker roles consists of the following role D. Mrozek et al.

1:25 11:00:00 10:00:00 1:20 09:00:00 08:00:00 07:00:00 1:15 06:00:00 05:00:00 1:10 04:00:00 Time (hh:mm:ss) Time (hh:mm) 03:00:00 1:05 02:00:00 01:00:00 00:00:00 1:00 1 2 4 8 16 18 XS S M L XL # PredictionWorker role instances Size of PredictionWorker role Fig. 10 Results of horizontal scaling. Total execution time for Fig. 9 Results of vertical scaling. Average execution time a prediction job as a function of the number of instances of the for a single prediction task as a function of the size of the PredictionWorker role (Small size) PredictionWorker role

PredictionWorker role. During computations, the Pre- dictionWorker role makes use of various executable cores). Only one PredictionWorker was used in this programs and additional files that must be read from test. The Prediction job was configured to explore 80 the local hard drive. It also stores some intermediate 000 random conformations (iT otal) that were per- results on the hard drive. Multiple threads of the role formed in 16 prediction tasks. Multiple prediction that run on the compute unit multiply these read/write tasks were assigned to those instances of Predic- operations and interfere with each other. Therefore, tionWorker that possessed more than one CPU core, the more threads were running on the compute unit proportionally to the number of cores (according to during our experiments, the more I/O operations were the rule: one core - one task). The input sequence con- performed, which influenced the n-fold speedup. This tained 20 amino acids of the NSP1 molecule. Each is also visible when analyzing average execution times task was configured to generate 5 000 random protein for prediction tasks, as shown in Fig. 9. conformations (performed 5 000 iterations per task) in In Fig. 9 we can clearly see that execution of a sin- Phase I of the WZ method, and to tune only one best gle prediction task was the fastest when performed on conformation in Phase II of the method ( φ = ψ = Small-sized instances and the slowest on ExtraLarge- 8 degrees for k = 3, 2, 1 in successive adjustment sized instances of the PredictionWorker role. How- iterations). ever, ExtraLarge instances could host the execution In Fig. 8 we show the n-fold speedup when scal- of eight prediction tasks at the same time, while ing the system vertically. We observed that increasing Small-sized instances could host only one. Therefore, the size of the PredictionWorker role from Small the total execution time for the whole prediction job to Medium accelerated the prediction process almost performed on ExtraLarge instances was 7.30 times two-fold. Above the Medium size (2 cores), the shorter than the same job performed on a single Small dynamics of the acceleration slowed down a bit - for instance. Large size (4 cores) the acceleration reached 3.90 and for ExtraLarge size (8 cores) it reached 7.30. 3.2 Horizontal Scalability When analyzing the results of tests, we deduced that the slowdown in the acceleration dynamics was In the second series of our performance tests, we an effect of two factors. The first factor was an over- checked the horizontal scalability of the Cloud4PSP. head caused by the necessity of handling multiple Tests were performed in the Microsoft Azure cloud. threads. For example, on ExtraLarge-sized Prediction- During this series of tests we increased the number of Workers we ran eight prediction tasks concurrently (8 instances of the PredictionWorker role from one to 18. CPU cores = 8 threads = 8 prediction tasks). And the Instances of the PredictionWorker role were hosted on second factor was the cumulative I/O operations per- compute units of the Small size (one CPU core). The formed on a local hard drive by many threads of the Small-sized virtual machine represents the cheapest Scaling Predictions of Protein Structures in Microsoft Azure Cloud

16 14,52 the prediction process more than two-fold. Adding 14 13,08 more instances of the role proportionally acceler-

12 ated the process. Finally, the acceleration ratio (n-fold

10 speedup) reached the level of 14.52, when the num- ber of instances of the PredictionWorker role was 8 7,45 increased from one to 18. A slowdown of the accel- 6

n-fold speedup 3,98 eration dynamics was observed when using more than 4 2,04 four compute units. This was caused by the uneven 2 1,00 processing times of individual tasks – the average exe- 0 cution time per task was 840 s, with minimum 340 s 1 2 4 8 16 18 and maximum 1722 s. Moreover, we have to remem- # PredictionWorker role instances ber that the execution times and n-fold speedup were Fig. 11 Results of horizontal scaling. Acceleration (n-fold measured for the whole system. And therefore, the speedup) of the 3D protein structure prediction as a function execution times covered not only the execution of of the number of instances of the PredictionWorker role (Small size) particular prediction tasks, but also processing and storing the results of these tasks by the Prediction- computational unit, providing one unshared core, and Manager role. Consequently, by increasing the degree is the smallest sized unit recommended for production of parallelism we raised the pressure on the Pre- workloads [49]. Moreover, as proved by the experi- dictionManager, which serially processed incoming mental results shown in Fig. 9, the size guaranteed the prediction results. fastest processing of prediction tasks. An interesting question is also why the process- While testing the horizontal scalability, we used ing times of tasks are so different. The answer lies the same input sequence as in the vertical scalabil- in the prediction method itself, and specifically in its ity tests. The sequence contained 20 amino acids of second phase. Phase I is fairly predictable in terms the NSP1 enzyme. The prediction was configured to of the speed of generating random protein confor- explore 4 000 random conformations (iT otal)that mations and calculating energies for them (Fig. 12). were performed in 40 tasks. Each task was configured The problem lies in the second phase, which can take to generate 100 random protein conformations (per- much longer, depending on the length of the input formed 100 iterations per task) in Phase I of the WZ sequences, tuning parameters and the course of tuning method, and to tune only one best conformation in itself, which depends on the conformation generated Phase II of the method ( φ = ψ = 8 degrees for in the first phase. We observed that for the tested input k = 3, 2, 1 in successive angle adjustment iterations). sequence Phase I took 5 % and Phase II took 95 % of In Fig. 10 we show total execution time for the the whole execution time. whole prediction as a function of the number of instances of the PredictionWorker role. The results 3.3 Influence of the Task Size obtained proved that the total prediction time can be significantly reduced when increasing the num- Since in Phase II of the WZ method angle adjust- ber of compute units. Upon increasing the number of ment is only performed for the structure with the instances from one to 18, we reduced the prediction lowest energy, while performing our experiments we time from 9 h and 31 min to only 39 min. expected that the total execution time of the predic- Based on the execution time measurements that tion process might depend on the configuration of we obtained during the performance tests, we calcu- the prediction tasks. We decided to verify how the lated the n-fold speedup for various configurations of task size (i.e. the number of iterations) and entire the Cloud4PSP with respect to a single instance con- job configuration influence the whole prediction time figuration. Figure 11 shows how the n-fold speedup and how strong that influence is. For this purpose, changes as a function of the number of instances of the we decided to explore 4 000 random conformations PredictionWorker role. (iT otal) in the prediction process, for the same input We noticed that employing two instances of sequence as in the vertical and horizontal scalabil- the PredictionWorker role increased the speed of ity tests. The sequence contained 20 amino acids of D. Mrozek et al.

Fig. 12 Efficiency of a 84:00:00 single compute unit in the first phase of the prediction 72:00:00 process. Dependency between the execution time 60:00:00 XS & S and the number of randomly generated structures (Monte 48:00:00 M Carlo trials) for various L sizes of compute units 36:00:00

Time (hh:mm:ss) XL 24:00:00

12:00:00

0:00:00 1 000 10 000 100 000 1 000 000 # Monte Carlo trials (generated structures)

the NSP1 molecule. We tested nine different config- 25 s. The configuration with the largest number of urations of the prediction job. Leaving the number of small tasks (800t × 5i)provedtobeextremelyinef- random conformations (iT otal) constant, we appro- ficient. The prediction process for this configuration priately selected the number of iterations per task and took 11 h and 42 min. The reasons for such a behavior the number of tasks. Phase II of the WZ method was of the system are the same as already discussed in set up to tune only one best conformation for φ = the previous section. The most time-consuming phase ψ = 8 degrees for k = 3, 2, 1insuccessiveangle of the WZ method is Phase II, which usually takes adjustment iterations. Tests were performed using 90–98 % of the task execution time. By configur- sixteen Small-sized instances of the PredictionWorker ing the system to use many small prediction tasks, role. we multiplied the execution of the longest phase, The total execution times for various configura- which resulted in a long execution time for the whole tions of the prediction job are shown in Fig. 13. prediction job. As can be observed, the most efficient is the con- For a small number of tasks it is advisable to figuration with the lowest number of tasks and the choose the number of tasks as a multiple of the number highest number of iterations (16t × 250i). The pre- of instances of the PredictionWorker role. For a larger diction process for this configuration took 19 min and number of tasks (if # tasks > 2m) the rule does not

Fig. 13 Dependency 14:00:00 between execution time and configuration of the 12:00:00 prediction job (#tasks × #iterations)for 10:00:00 the constant number of randomly generated 08:00:00 structures (Monte Carlo 06:00:00 trials, iTotal = 4000) for sixteen instances of the Time (hh:mm:ss) 04:00:00 PredictionWorker role 02:00:00

00:00:00 16t x 32t x 40t x 50t x 80t x 100t x 200t x 400t x 800t x 250i 125i 100i 80i 50i 40i 20i 10i 5i Job configuration (#tasks x #iterations) Scaling Predictions of Protein Structures in Microsoft Azure Cloud apply, since the execution times for particular tasks are to compare similar configurations of the Cloud4PSP different. architecture and answer which of the configurations is It is difficult to explicitly and clearly define the more efficient. configuration of the prediction job. It depends on We tested various configurations of the Cloud4PSP many factors, including the purpose of the prediction by changing the size of the PredictionWorker process, the size of the input molecule, and others. role from ExtraSmall (one, shared CPU core) to However, choosing a low number of tasks and a large ExtraLarge (eight CPU cores) and by changing the number of iterations has the following consequences: number of instances of the role from one to eighteen (if possible, depending on the size of the compute – the efficiency of the prediction process is higher, unit). The prediction was configured to explore 100 – users obtain fewer candidate structures after the 000 random conformations (iT otal) that were per- prediction is finished, and some good candidates formed in 18 tasks. Multiple prediction tasks were can be missed, assigned to instances that possessed more than one – fewer prediction results are stored in the database CPU core, proportionally to the number of cores and fewer 3D structures are stored in BLOBs, (according to the rule: one core - one task). The which lowers the long-term costs of storage in the input sequence contained 20 amino acids of the NSP1 Cloud, enzyme. Each task was configured to generate 5 000 – the cost of the computation process is lower due random protein conformations (performed 5 000 iter- to higher efficiency. ations per task) in Phase I of the WZ method, and On the other hand, choosing a large number of to tune only one best conformation in Phase II of the tasks and a low number of iterations has the following method ( φ = ψ = 8 degrees for k = 3, 2, 1in consequences: successive adjustment iterations). – the efficiency of the prediction process is lower, We tested the following configurations of the – users obtain more candidate protein structures Cloud4PSP with respect to resource consumption by after the prediction process, which increases the the PredictionWorker role: chance to get the global minimum of the energy function, – one to eighteen ExtraSmall and Small compute – more prediction results are stored in the database units, and more candidate 3D structures are stored in – one to nine Medium compute units, BLOBs, which increases the long-term costs of – one to four Large compute units, storage in the Cloud, – one to two ExtraLarge compute units. – the cost of the computation process is higher due to lower efficiency. In Fig. 14 we show the number of prediction tasks completed within one hour as a function of the number 3.4 Scale Up, Scale Out or Combine? of instances of the PredictionWorker role for various sizes of the instances. The results of the tests indicate Microsoft Azure allows us to scale out and scale up a slight advantage of horizontal scaling over verti- applications and systems deployed in the Cloud. The cal scaling. Comparing both scaling techniques, we choice is left to the developer and administrator of the noted that the number of prediction tasks completed system. In both cases, the system must be properly within one hour by eight Small (1-core) instances was implemented to utilize the resources of the multi-core higher than those completed by one ExtraLarge (8- compute unit (when scaling up) and to distribute, and core) instance. Similar behavior was observed when then process, parts of the main job by many instances comparing four Small instances vs. one Large (4-core) of the Worker role (when scaling out). The workload instance, and two Small instances vs. one Medium associated with the development of the system towards (2-core) instance, but the differences were not so sig- each of the scaling types is difficult to express in num- nificant. We could draw the same conclusion based on bers. However, there are some technical reasons that the charts showing n-fold speedups for vertical scaling speak in favor of particular scaling techniques. More- (presented in Fig. 8) and horizontal scaling (presented over, we have conducted a series of tests in order in Fig. 11). For example, the acceleration ratio was D. Mrozek et al. slightly better for eight Small instances of the Predic- experiments have shown that scaling out by adding tionWorker role (7.45) than for a single, ExtraLarge more Small-sized computing units turned out to be (8-core) instance (7.30). more effective and to give more development flexibil- When combining both scaling techniques, our ity than scaling up (the acceleration rate for 8 cores results again speak in favor of horizontal scaling with was 7.45 when scaling out and 7.30 when scaling up). the use of Small-sized instances, e.g., eighteen Small However, in the case of big molecules being modeled, instances vs. nine Medium instances, sixteen Small we may be forced to use larger compute units, e.g. instances vs. two ExtraLarge instances, eight Small Medium, Large, or ExtraLarge, and to combine hori- instances vs. two Large instances, etc. The differences zontal and vertical scaling. Delivering Cloud4PSP as are not significant, but they increase with the num- a service offloads bio-oriented companies from main- ber of simulation hours. We must also remember that, taining their own IT infrastructures, which can now be although in Fig. 14 we present measurements for par- outsourced from the cloud provider. This has another ticular parameters of the prediction job, the entire positive consequence - a low barrier to entry. Even a experiment reveals the relationships between particu- laboratory or a company that cannot afford to possess lar configurations of the Cloud4PSP. The actual exe- a high-performance computational infrastructure can cution times depend on the size of the input sequence, outsource it in the Cloud and perform a prediction pro- the number of Monte Carlo trials, the number of tasks, cess starting with several computation units, then grow and the parameters of the WZ method. up, if needed, and scale its systemo at any time. It should also be noted that horizontal scaling is On the other hand, the important aspects of process- easier to implement. Once the Cloud4PSP system is ing data in the Cloud are privacy and data protection. designed to be scaled horizontally, scaling can be done As rightly pointed out by Gesing et al. [20], both at any moment after the system is published in the the input and output data of molecular simulations Cloud and only if the system requires higher resources are sensitive and may constitute a valuable intel- (more compute units). Moreover, it can be done not lectual property. Therefore, they need to be stored only manually by the administrator of the system, but securely. Cloud security is still an evolving domain also automatically based on statistics of the system and defensive mechanisms are implemented on vari- performance and resource usage indicators. ous levels of the functioning of the cloud-based sys- tem. In order to protect the systems deployed in the Cloud, Microsoft Azure provides a reach set of secu- 4 Discussion rity elements, including SSL mutual authentication mechanisms while accessing various components of Building Cloud computing services for the predic- the system, encryption of data stored in Azure Storage tion of 3D protein structures, such as the presented and encryption of data transmission, firewalled and Cloud4PSP, responds to the current demands for hav- partitioned networks to help protect against unwanted ing widely available and scalable platforms solving traffic from the Internet, and many others. Many of these types of difficult problems. This is very impor- the security elements are explicitly, and many of them tant not only for scientists and research laboratories implicitly, used by Cloud4PSP, e.g., secure access to trying to predict a single 3D protein structure, but also Azure BLOBs using HTTPS, secure authentications for the biotechnology and pharmaceutical industry when accessing SQL Azure, where prediction results producing new drug solutions for humanity. are stored, SSL-based communication between inter- Cloud4PSP is such a platform, allowing a highly nal components. In terms of data privacy, Microsoft is scalable prediction process in the Cloud, with all its committed to safeguarding the privacy of data and fol- advantages and drawbacks. Cloud4PSP was originally lows the provisions of the E.U. Data Protection Direc- designed to be deployed in Microsoft’s commercial tive. Users of the Azure cloud, i.e., cloud application cloud, where computing resources can be dynamically developers, may specify the geographic areas where allocated as needed. As we have shown, the predic- the data are stored and replicated for redundancy. tion process can be scaled up by adding more powerful Also, the distribution and deployment model of the computing instances, or scaled out by allocating more Cloud4PSP supports privacy and data protection. On computing units of the same types. The results of our the deployment level, various users and companies are Scaling Predictions of Protein Structures in Microsoft Azure Cloud

Fig. 14 Comparison of the 16 efficiency of horizontal and vertical scaling, and the 14 combination of both scaling 12 techniques. The number of prediction tasks completed 10 within one hour as a XS 8 function of the number of S M instances of the 6 PredictionWorker role for L 4 XL various sizes of the #prediction tasks per hour instances. Measurements 2 for ExtraSmall (XS) and Small (S) sizes are similar 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # PredictionWorker instances

expected to establish their private Cloud4PSP-based structure prediction, it represents a different approach clusters in the Azure cloud for their own simulations. to the problem. It relies on the PredictProtein and This ensures separation of data and computations per- provides tools for the prediction of secondary struc- formed by two companies, for which the predicted tures of macromolecules, not 3D structures. Unlike 3D structures of proteins may be a highly-protected Cloud BioLinux, Cloud4PSP is focused on ab initio intellectual property. Finally, on the application level, prediction methods for tertiary structure. Cloud4PSP provides additional security mechanism Considering these two mentioned features, i.e. the of 128-bit GUID1-based tokens for protecting access service model and prediction methods, Cloud4PSP is to results of prediction processes. more similar to Rosetta@Cloud and its AbinitioRe- Comparing Cloud4PSP to other solutions men- lax application. In both systems prediction services are tioned in the Introduction section, we can draw the provided in the SaaS model. Although both systems following conclusions. Firstly, Cloud4PSP, in terms of use different prediction methods, the implemented what it offers to the end user, is a typical Software prediction methods belong to the same ab initio class. as a Service solution (SaaS), and also an extensible Both systems are prepared to be published to the computational framework, where advanced users may Cloud, giving users the ability to establish their pri- add their own prediction methods. This is the fun- vate cloud-based clusters for 3D protein structure damental feature that differentiates our cloud-based prediction, which is also important from the viewpoint software from the Cloud BioLinux. Cloud BioLinux of data protection. Rosetta@Cloud builds a cluster provides a virtual machine and, in the context of the of compute units on Amazon Web Services (AWS) presented service stack, functions in the Infrastructure and Cloud4PSP works on the Microsoft Azure cloud. as a Service (IaaS) model, providing pre-configured There are also some similarities and differences in software on a pre-configured Ubuntu operating system the architectures of the two systems. Rosetta@Cloud and leaving the users the responsibility for maintain- uses a so-called Master Node that controls the distri- ing the operating systems and configuring the best bution of workload among Worker Nodes. The Master way to scale the available applications. The second Node in Rosetta@Cloud is a counterpart of the Pre- fundamental difference between the two solutions dictionManager role in Cloud4PSP, and Worker Nodes is the software used for structure prediction. Cloud are counterparts of instances of the PredictionWorker BioLinux is rich in various software packages, allow- role. However, Rosetta’s GUI for launching predic- ing us to perform many processes in the domain tions is installed locally on the user’s computer as of bioinformatics. However, in terms of the protein Rosetta@Cloud Launcher, i.e., it works outside the cloud infrastructure, while the Cloud4PSP GUI is designed to be available through a web site (provided by the Web role working inside the cloud), and there- 1GUID - Globally Unique Identifier fore, users do not have to install anything locally. D. Mrozek et al.

Moreover, Cloud4PSP uses queues to buffer predic- calculations in parallel in order to reduce the pre- tion requests. Applying queues has several advan- diction time. Cloud computing provides theoretically tages. Since queues provide an asynchronous messag- unlimited computing resources in a pay-as-you-go ing model, users of Cloud4PSP need not be online model. This is very important feature of the cloud at the same time. Queues reliably store prediction architecture. Clouds provide an important comput- requests as messages until the Cloud4PSP is ready ing power that can be provisioned on-demand and to process them. Then, Cloud4PSP can be adjusted according to particular needs, which perfectly fits the and scaled out according to the current needs. As the character of the prediction process. depth of the prediction request queue grows, more computing resources can be provisioned. Therefore, such an approach allows money to be saved, taking 6 Availability into account the amount of infrastructure required to service the application load. Moreover, the queue- Cloud4PSP software is available at the project based approach allows load balancing - as the number home page http://zti.polsl.pl/w3/dmrozek/science/ of requests in a queue increases, more instances of cloud4psp.htm. The system can be used by the aca- the PredictionWorker role can be added to accelerate demic society for free, i.e., the software is free, but the prediction. In both systems the prediction process users have to lease the Cloud hardware infrastruc- can be measured and end-users can be charged based ture to use the system for their own costs. Users on some metrics, such as the amount of computing will have to configure and publish the system on the instances, storage requirements, and data transfers. Azure cloud. Configuration details are provided at the On a critical note, we have to admit that, although Cloud4PSP project home page. Cloud4PSP implements a relatively new algorithm Further development of the system will be car- for 3D protein structure prediction, which is based ried out by the Cloud4Proteins non-profit, scien- on Monte Carlo simulations and structure refinement, tific group (http://www.zti.aei.polsl.pl/w3/dmrozek/ at the current stage of development, it is unable to science/cloud4proteins.htm). catch up Rosetta@Cloud in terms of the wealth of deployed software for predicting protein structures. Acknowledgments We would like to thank Microsoft Research for providing us with free access to the computational However, the advantage of the WZ method used in resources of the Microsoft Azure cloud under the Microsoft the Cloud4PSP is that it eliminates the drawbacks of Azure for Research Award program. widely used gradient-based methods and this brings it closer to finding the global minimum of energy function. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// Implementation of other prediction methods creativecommons.org/licenses/by/4.0/), which permits unre- remains in the perspective of system development in stricted use, distribution, and reproduction in any medium, the future. Advanced users may also extend the sys- provided you give appropriate credit to the original author(s) tem by adding their own, new prediction methods. We and the source, provide a link to the Creative Commons license, and indicate if changes were made. also have to mention that our tests were performed for the prediction of small molecular structures. For large proteins, the number of iterations needed for References modeling the structure increases exponentially, and the process requires more computational resources 1. Angiuoli, S., Matalka, M., Gussman, A., Galens, K., than those which we had during our tests. et al.: CloVR: a virtual machine for automated and portable sequence analysis from the desktop using Cloud computing. BMC Bioinf. 12, 356 (2011) 5 Summary 2. Arnold, K., Bordoli, L., Kopp, J., Schwede, T.: The SWISS- MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22(2), 195– Predicting protein structures from the beginning using 201 (2006) the ab initio methods is one of the most challenging 3. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) tasks for modern computational biology. This requires 4. Bertis, V., Bolze, R., Desprez, F., Reed, K.: From dedi- huge computing resources to allow us to perform cated Grid to Volunteer Grid: large scale execution of a Scaling Predictions of Protein Structures in Microsoft Azure Cloud

bioinformatics application. J. Grid Comput. 7(4), 463–478 21. Gu, J., Bourne, P.: Structural Bioinformatics (Methods of (2009) Biochemical Analysis), 2nd edn. Wiley-Blackwell, Hobo- 5. Bondi, A.: Characteristics of scalability and their impact on ken (2009) performance. In: 2nd International Workshop on Software 22. Herrmann, T., Guntert,¨ P., Wuthrich,¨ K.: Protein NMR and Performance, WOSP 2000, pp. 195–203 (2000) structure determination with automated NOE assignment 6. Case, D., Cheatham, T., Darden, T., Gohlke, H., Luo, using the new software CANDID and the torsion angle R., Merz, K.J., Onufriev, A., Simmerling, C., Wang, B., dynamics algorithm DYANA. J. Mol. Biol. 319, 209–227 Woods, R.: The Amber biomolecular simulation programs. (2002) J. Comput. Chem. 26, 1668–1688 (2005) 23. Hovmoller,¨ S., Zhou, T., Ohlson, T.: Conformations of 7. Chen, C., Huang, Y., Ji, X., Xiao, Y.: Efficiently finding the amino acids in proteins. Acta Cryst. D58, 768–776 (2002) minimum free energy path from steepest descent path. J. 24. Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligand Chem. Phys. 138(16), 164122 (2013) binding site comparison. Biomed. Res. Int., 170356 (2013) 8. Chen, H.Y., Hsiung, M., Lee, H.C., Yen, E., Lin, S., Wu, 25. Hung, C.L., Lin, Y.L.: Implementation of a parallel pro- Y.T.: GVSS: a high throughput drug discovery service of tein structure alignment service on Cloud. Int. J. Genomics Avian Flu and Dengue Fever for EGEE and EUAsiaGrid. J. 439681, 1–8 (2008) Grid Comput. 8(4), 529–541 (2010) 26. Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., 9. Chivian, D., Kim, D.E., Malmstrom,¨ L., Bradley, P., Hess, M., et al.: The XtreemFS architecture - a case for Robertson, T., Murphy, P., Strauss, C.E., Bonneau, R., Rohl, object-based file systems in Grids. Concurrency Computat.: C.A., Baker, D.: Automated prediction of CASP-5 struc- Pract. Exper. 20(17), 2049–2060 (2008) tures using the Robetta server. Proteins: Struct., Funct., 27. Insilicos: Rosetta@Cloud: macromolecular modeling in Bioinf. 53(S6), 524–533 (2003) 10. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, the Cloud. Fact Sheet. https://rosettacloud.files.wordpress. K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, com/2012/08/rc-fact-sheet bp5-en2a.pdf (2012). Accessed J.W., Kollman, P.A.: A second generation force field for 9 March 2015 the simulation of proteins, nucleic acids, and organic 28. Jithesh, P., Donachy, P., Harmer, T., Kelly, N., Perrott, molecules. J. Am. Chem. Soc. 117(19), 5179–5197 (1995) R., Wasnik, S., Johnston, J., McCurley, M., Townsley, M., 11. De Vries, S., van Dijk, A., Krzeminski, M., van Dijk, M., McKee, S.: GeneGrid: architecture, implementation and Thureau, A., Hsu, V., Wassenaar, T., Bonvin, A.: HAD- application. J. Grid Comput. 4(2), 209–222 (2006) DOCK versus HADDOCK: new features and performance 29. Jmol Homepage: Jmol: an open-source Java viewer for of HADDOCK2.0 on the CAPRI targets. Proteins 69, 726– chemical structures in 3D. http://www.jmol.org. Accessed 733 (2007) 7 Sept 2015 12. Edic, P., Isaacson, D., Saulnier, G., Jain, H., Newell, J.: 30. Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Bal- An iterative Newton-Raphson method to solve the inverse asko, A., Karoczkai,´ K., Marton,´ I.: WS-PGRADE/gUSE admittivity problem. IEEE Trans. Biomed. Eng. 45(7), generic DCI gateway framework for a large variety of user 899–908 (1998) communities. J. Grid Comput. 10(4), 601–630 (2012) 13. Emeakaroha, V.C., Maurer, M., Stern, P., Łabaj, P.P., 31. Kajan,´ L., Yachdav, G., Vicedo, E., Steinegger, M., Mirdita, Brandic, I., Kreil, D.P.: Managing and optimizing bioin- M., Angermuller,¨ C., Bohm,¨ A., Domke, S., Ertl, J., Mertes, formatics workflows for data analysis in clouds. J. Grid C., Reisinger, E., Staniewski, C., Rost, B.: Cloud predic- Comput. 11(3), 407–428 (2013) tion of protein structure and function with PredictProtein 14. Eswar, N., Webb, B., Marti-Renom, M.A., Madhusudhan, for Debian. BioMed Res. Int. 2013(398968), 1–6 (2013) M., Eramian, D., Shen, M., Pieper, U., Sali, A.: Com- 32. Kallberg,¨ M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu, parative Protein Structure Modeling Using MODELLER. H., Xu, J.: Template-based protein structure modeling using Wiley, New York (2007) the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012) 15. Farkas, Z., Kacsuk, P.: P-GRADE portal: a generic work- 33. Kanaris, I., Mylonakis, V., Chatziioannou, A., Maglogian- flow system to support user communities. Future Gener. nis, I., Soldatos, J.: HECTOR: enabling microarray experi- Comput. Syst. 27(5), 454–465 (2011) ments over the hellenic Grid infrastructure. J. Grid Comput. 16. Ferrari, T., Gaido, L.: Resources and services of the EGEE 7(3), 395–416 (2009) production infrastructure. J. Grid Comput. 9, 119–133 34. Kelley, L., Sternberg, M.: Protein structure prediction on (2011) the Web: a case study using the Phyre server. Nat. Protoc. 17. Fletcher, R., Powell, M.: A rapidly convergent descent 4(3), 363–371 (2009) method for minimization. Comput. J. 6(2), 163–168 (1963) 35. Kessel, A., Ben-Tal, N.: Introduction to Proteins: Structure, 18. Frishman, D., Argos, P.: Seventy-five percent accuracy in Function, and Motion. Chapman & Hall/CRC Mathemat- protein secondary structure prediction. Proteins 27, 329– ical & Computational Biology, CRC Press, Boca Raton 335 (1997) (2010) 19. Garnier, J., Gibrat, J., Robson, B.: GOR method for predict- 36. Kim, D., Chivian, D., Baker, D.: Protein structure predic- ing protein secondary structure from amino acid sequence. tion and analysis using the Robetta server. Nucleic Acids Methods Enzymol. 266, 540–53 (1996) Res. 32(Suppl 2), W526–31 (2004) 20. Gesing, S., Grunzke, R., Kruger,¨ J., Birkenheuer, G., 37. Kollman, P.: Advances and continuing challenges in achiev- Wewior, M., Schafer,¨ P., et al.: A single sign-on infras- ing realistic and predictive simulations of the properties tructure for science gateways on a use case for structural of organic and biological molecules. Acc. Chem. Res. 29, bioinformatics. J. Grid Comput. 10, 769–790 (2012) 461–469 (1996) D. Mrozek et al.

38. Krampis, K., Booth, T., Chapman, B., Tiwari, B., et al.: scale events with accelerated molecular dynamics. J. Chem. Cloud BioLinux: pre-configured and on-demand bioin- Theory Comput. 8(9), 2997–3002 (2012) formatics computing for the genomics community. BMC 54. Ponder, J.: TINKER - software tools for molecular design. Bioinf. 13, 42 (2012) Dept. of Biochemistry & Molecular Biophysics, Washing- 39. Lagana,` A., Costantini, A., Gervasi, O., Lago, N.F., Man- ton University, School of Medicine, St. Louis (2001) uali, C., Rampino, S.: COMPCHEM: progress towards 55. Ramachandran, G., Ramakrishnan, C., Sasisekaran, V.: GEMS a grid empowered molecular simulator and beyond. Stereochemistry of polypeptide chain configurations. J. J. Grid Comput. 8(4), 571–586 (2010) Mol. Biol. 7, 95–9 (1963) 40. Lampio, A., Kilpelainen,¨ I., Pesonen, S., Karhi, K., Auvi- 56. Rost, B., Liu, J.: The PredictProtein server. Nucleic Acids nen, P., Somerharju, P., Ka¨ari¨ ainen,¨ L.: Membrane binding Res. 31(13), 3300–3304 (2003) mechanism of an RNA virus-capping enzyme. J. Biol. 57. Schatz, M.C.: CloudBurst: highly sensitive read map- Chem. 275(48), 37853–9 (2000) ping with MapReduce. Bioinformatics 25(11), 1363–1369 41. Leach, A. Molecular Modelling: Principles and Applica- (2009) tions, 2nd edn. Pearson Education EMA, Essex (2001) 58. Schwieters, C., Kuszewski, J., Tjandra, N., Clore, G.: The 42. Leaver-Fay, A., Tyka, M., Lewis, S., Lange, O., Thompson, Xplor-NIH NMR molecular structure determination pack- J., Jacak, R., et al.: ROSETTA3: an object-oriented software age. J. Magn. Reson. 160, 65–73 (2003) suite for the simulation and design of macromolecules. 59. Shanno, D.: On Broyden-Fletcher-Goldfarb-Shanno Methods Enzymol. 487, 545–74 (2011) method. J. Optimiz Theory Appl., 46 (1985) 43. Lesk, A. Introduction to Protein Science: Architecture, 60. Shaw, D.E., Dror, R.O., Salmon, J.K., Grossman, J.P., Function, and Genomics, 2nd edn. Oxford University Press, Mackenzie, K.M., Bank, J.A., Young, C., Deneroff, M.M., NY (2010) Batson, B., Bowers, K.J., Chow, E., Eastwood, M.P., 44. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Ierardi, D.J., Klepeis, J.L., Kuskin, J.S., Larson, R.H., et al.: Hydra: a scalable proteomic search engine which Lindorff-Larsen, K., Maragakis, P., Moraes, M.A., Piana, utilizes the Hadoop distributed computing framework. S., Shan, Y., Towles, B.: Millisecond-scale molecular BMC Bioinf. 13, 324 (2012) 45. Lordan, F., Tejedor, E., Ejarque, J., Rafanell, R., Alvarez,´ dynamics simulations on Anton. In: Proceedings of the J., Marozzo, F., Lezzi, D., Sirvent, R., Talia, D., Badia, Conference on High Performance Computing Networking, R.M.: ServiceSs: an interoperable programming framework Storage and Analysis, SC ’09, pp. 39:1–39:11. ACM, New for the cloud. J. Grid Comput. 12(1), 67–91 (2014) York (2009) 46. McKendrick, J. Cloud computing market hot, 61. Shen, Y., Vernon, R., Baker, D., Bax, A.: De novo pro- but how hot? estimates are all over the map. tein structure generation from incomplete chemical shift http://www.forbes.com/sites/joemckendrick/2012/02/13/cl assignments. J. Biomol. NMR 43, 63–78 (2009) oud-computing-market-hot-but-how-hot-estimates-are-all 62. Shirts, M., Pande, V.: COMPUTING: screen savers of the -over-the-map/ (2012). Accessed 24 Aug 2015 world unite! Science 290(5498), 1903–4 (2000) 47. Mell, P., Grance, T.: The NIST definition of cloud 63. Soding,¨ J.: Protein homology detection by HMM-HMM computing. Special Publication 800-145. http://csrc.nist. comparison. Bioinformatics 21(7), 951–960 (2005) gov/publications/nistpubs/800-145/SP800-145.pdf (2011). 64. Streit, A., Bala, P., Beck-Ratzka, A., Benedyczak, K., Accessed 7 May 2015 Bergmann, S., Breu, R., et al.: Unicore 6 - recent and future 48. Microsoft Azure Cloud Services Specification: Sizes advancements. JUEL, 4319 (2010) for Cloud Services. https://azure.microsoft.com/pl-pl/ 65. Van Der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., documentation/articles/cloud-services-sizes-specs/. Mark, A., Berendsen, H.: GROMACS: fast, flexible, and Accessed 7 May 2015 free. J. Comput. Chem. 26, 1701–1718 (2005) 49. Microsoft Azure Cloud Services Specification: Sizes 66. Warecki, S., Znamirowski, L.: Random simulation of the for virtual machines. https://azure.microsoft.com/pl-pl/ nanostructures conformations. In: Proceedings of Inter- documentation/articles/virtual-machines-size-specs/. national Conference on Computing, Communication and Accessed 7 Sept 2015 Control Technology, The International Institute of Infor- 50. Mrozek, D.: High-Performance Computational Solutions in matics and Systemics, Austin, Texas, vol. 1, pp. 388–393 Protein Bioinformatics. SpringerBriefs in Computer Sci- (2004) ence. Springer International Publishing (2014) 67. Wassenaar, T.A., van Dijk, M., Loureiro-Ferreira, N., van 51. Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerat- der Schot, G., de Vries, S.J., Schmitz, C., van der Zwan, J., ing 3D protein structure similarity searching on Microsoft Boelens, R., Giachetti, A., Ferella, L., Rosato, A., Bertini, Azure Cloud with local replicas of macromolecular data. I., Herrmann, T., Jonker, H.R., Bagaria, A., Jaravine, V., Parallel Processing and Applied Mathematics - PPAM Guntert,¨ P., Schwalbe, H., Vranken, W.F., Doreleijers, J.F., 2015, Lecture Notes in Computer Science. Springer, Berlin Vriend, G., Vuister, G., Franke, D., Kikhney, A., Svergun, Heidelberg (2015) D.I., Fogh, R.H., Ionides, J., Laue, E.D., Spronk, C., Jurksa,˘ 52. Mrozek, D., Małysiak-Mrozek, B., Kłapcinski,´ A.: S., Verlato, M., Badoer, S., Dal Pra, S., Mazzucato, M., Cloud4Psi: cloud computing for 3D protein structure Frizziero, E., Bonvin, A.M.: WeNMR: structural biology on similarity searching. Bioinformatics 30(19), 2822–2825 the Grid. J. Grid Comput. 10(4), 743–767 (2012) (2014) 68. Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of 53. Pierce, L., Salomon-Ferrer, R., de Oliveira, C., McCam- small proteins by iterative TASSER simulations. BMC mon, J., Walker, R.: Routine access to millisecond time Biol. 5(17) (2007) Scaling Predictions of Protein Structures in Microsoft Azure Cloud

69. Xu, D., Zhang, Y.: Ab initio protein structure assem- one-dimensional structural properties of query and cor- bly using continuous structure fragments and optimized responding native properties of templates. Bioinformatics knowledge-based force field. Proteins 80(7), 1715–35 27(15), 2076–2082 (2011) (2012) 72. Zhang, Y.: Progress and challenges in protein structure 70. Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimal prediction. Curr. Opin. Struct. Biol. 18(3), 342–348 (2008) protein threading by linear programming, the inaugu- 73. Znamirowski, L.: Non-gradient, sequential algorithm for ral issue. J. Bioinform Comput. Biol. 1(1), 95–117 simulation of nascent polypeptide folding. In: Sunderam, (2003) V.S., van Albada, G.D., Sloot, P.M., Dongarra, J.J. (eds.) 71. Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improving Computational Science - ICCS 2005, Lecture Notes in protein fold recognition and template-based modeling by Computer Science, vol. 3514, pp. 766–774. Springer, employing probabilistic-based matching between predicted Berlin Heidelberg (2005)