The PSIPRED Protein Analysis Workbench

Nucleic Acids Research, 2019 1 doi: 10.1093/nar/gkz297 The PSIPRED Protein Analysis Workbench: 20 years Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkz297/5480136 by Medical Research Council, [email protected] on 22 May 2019 on Daniel W.A. Buchan* and David T. Jones UCL Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK Received February 01, 2019; Revised April 02, 2019; Editorial Decision April 11, 2019; Accepted April 15, 2019 ABSTRACT of the previous step or steps. The complete data analysis task can then be defined by the input data, the required The PSIPRED Workbench is a web server offering a outputs and the dependency graph of data transformations range of predictive methods to the bioscience com- that take the data from input to output. Often, data analy- munity for 20 years. Here, we present the work we sis tasks decomposed into steps in this manner are regarded have completed to update the PSIPRED Protein Anal- as pipelines or workflows. A range of workflow manage- ysis Workbench and make it ready for the next 20 ment tools, both proprietary and open source, are readily years. The main focus of our recent website upgrade available. Such as Microsoft Azure, Spotify’s Luigi, Twitter work has been the acceleration of analyses in the Storm, HADOOP and so forth. In the Open Source space face of increasing protein sequence database size. Apache Taverna (https://taverna.incubator.apache.org/)isa We additionally discuss any new software, the new user-centric tool for specify workflows which can access a hardware infrastructure, our webservices and web heterogenous set of data processing web services. Common Workflow Language (CWL) (8) is a specification for de- site. Lastly we survey updates to some of the key scribing workflows to ensure they are portable across hard- predictive algorithms available through our website. ware and execution platforms. Several implementations ca- pable of executing CWL workflows are available. INTRODUCTION Data analysis webservices in the biosciences face many Today there are a great number of web services available similar challenges to one another. High throughput data covering all aspects of research in the Life Sciences. In bio- generation technologies continue to increase the pace that chemistry, important resources hosting primary data are new data is deposited in public databases. Genbank and hosted by the NCBI, EBI and RCSB (1–3). Additionally a the PDB have shown near exponential growth over the pre- number of labs around the world have provided invaluable vious 20–40 years with no signs of slowing down (1,2). second tier analysis services to the research community for Researchers have also pioneered the use of increasingly a number of years. Examples include PHYRE, MPI Bioin- computationally intensive algorithms or predictors ranging formatics Toolkit and the Expasy: SIB Bioinformatics Por- from dynamic programming (9) to deep neural networks tal (4–7). in the present day. Without algorithmic or hardware ad- The PSIPRED Protein Analysis Workbench is a world vancements database search and algorithmic run times will renowned web service providing a diverse suite of protein lengthen. These provide challenges to web services which prediction and annotation tools focussed principally on wish to ensure the services offered run in a timely fashion. structural annotations of proteins. The service has been in near-continuous operation for 20 years. The service runs around 250 000 predictions for users each year. In the fol- Web site developments lowing paper we describe work we have completed to bring our web services in to the modern web era. Most impor- Since 2015, we have undertaken to completely rewrite all the tantly the implementation of a novel workflow management code which serves and runs the PSIPRED Workbench site. platform which allows us to easily deploy new predictive Our principle goal was to greatly reduce prediction runtimes tools to our website. for users. The expanding size of sequence data banks such Workflow Management Systems solve the generic prob- as UniRef and GenBank has greatly increased the runtime lem of turning chains of data processing steps into repeat- of software such as PSIBLAST in the intervening years. We able or automated processes. Typically, a given data anal- also wished to ensure the website will be able to cope with ysis can be decomposed into a series of data transforma- the increasing resource and computational demands of run- tion steps, where any given step depends on the outputs ning a contemporary bioscience web site. *To whom correspondence should be addressed. Tel: +44 20 7679 7214; Email: [email protected] C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2019 New web design. A major part of the work we completed Social media & support. To make it easier for users to Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkz297/5480136 by Medical Research Council, [email protected] on 22 May 2019 is the web site’s redesign (see Figure 1). This was com- contact us and get real time updates about the PSIPRED pleted to ensure all our predictive methods were more eas- webserver, we have also recently launched our new Twitter ily accessible in a single location and that the results pro- account, https://twitter.com/psipred. This account dissemi- duced were more intuitively laid out. The new site makes nates up to date information about the current state of the use of modern web frameworks including bootstrap (https: server and any issues, updates, downtime notices and rele- //getbootstrap.com), Ractive (https://ractive.js.org) and d3 vant news. Users are welcome to request help, ask questions (https://d3js.org/). We have also moved all processing of or give feedback via this service. This service now sits along- server results and diagram rendering to the users’s web side our contact email, [email protected]. browser, this reduces the processing load on our backend servers allowing us to calculate a greater number of concur- Software & algorithm improvements rent users’ analyses. The input form remains much as it was previously. The The PSIPRED Protein Analysis Workbench offers a total principle difference is that users can now select whether of 16 protein analysis methods (see Table 1). Since the pub- their input data is protein sequence or PDB data (see Figure lication or our previous server paper the majority of these 1, part A) and the form will present the appropriate predic- remain as previously published (10). In this section, we give tive methods (see Figure 1B and C) there are 12 sequence a brief overview of only those methods that have seen sig- analysis methods and four protein structure analysis meth- nificant improvements since our 2013 publication, namely: ods. PSIPRED, MetaPSICOV, FFPred, DISOPRED, Bioserf & Upon protein sequence submission the results page has Domserf. Readers interested in our other methods and their been updated (Figure 1DandFigure2). On the right-hand current performance should consult the references listed in side we show three panels: an estimate of the runtime for Table 1. the analysis, a panel that holds links to the results file down- PSIPRED 4. PSIPRED is a popular and cutting edge loads (Figure 2B) and a panel which holds the resubmission protein secondary structure method. This remains the most widget (Figure 2C). For the results files users can down- popular method on our server. Since the previous publica- load the files of interest or download a zip file of all the tion of the webserver substantial work on the PSIPRED results files. The resubmission panel allows users to select method has been completed. Our webserver now offers some sub-region of their query sequence, using the linear PSIPRED version 4. PSIPRED 4 has implemented the first start-stop coordinates of the sequence. This sub-region can architectural change to PSIPRED since its first release in then be submitted to the server for further analysis. Users 1999. The main change is to a deeper neural network archi- can also follow the ‘Help & Tutorials’ link on the page for tecture with two hidden layers rather than just one, and with further details about the methods and using the server. rectifier activations rather than sigmoid. The other change The principle change to the website is the presentation of is that the input window has been extended from 15 residues the results (Figure 2). The results page is now divided in to to 33 thanks to using sparse connections between the in- five main sections. The first section (Figure 2A) is a title ban- put layer and first hidden layer. The latest version has aQ3 ner which holds the name of the job the user selected and secondary structure prediction accuracy of 84.2%. A full a small widget allowing the user to easily copy the results benchmark and comparison with equivalent methods will page URL to the desktop clipboard. In the Downloads sec- be the topic of a future publication. tion (Figure 2B) users can access the individual results files The new PSIPRED web presentation updates the used to prepare the results page and visualizations. We also PSIPRED cartoon (Supplementary Information Figure 1). provide a link to a ‘Job Configuration’ file which details the As previously the results are presented on four rows repre- analysis steps performed to make the prediction, this con- senting the confidence score (Conf), the assignment in car- tains software and dataset versioning information.

The PSIPRED Protein Analysis Workbench

An Effective Computational Method Incorporating Multiple Secondary

Homology Modeling and Analysis of Structure Predictions of the Bovine Rhinitis B Virus RNA Dependent RNA Polymerase (Rdrp)

Chapter 13 Protein Structure Learning Objectives

Increasing the Accuracy of Single Sequence Prediction Methods Using a Deep Semi-Supervised Learning Framework Lewis Moffat 1,∗ and David T

Bi-Allelic Novel Variants in CLIC5 Identified in a Cameroonian

Dnpro: a Deep Learning Network Approach to Predicting Protein Stability Changes Induced by Single-Site Mutations Xiao Zhou and Jianlin Cheng*

Further Confirmation of the Association of SLC12A2 with Non-Syndromic

Bayesian Model of Protein Primary Sequence for Secondary Structure Prediction

The Structure and Dynamics of Bmr1 Protein from Brugia Malayi: in Silico Approaches

11: Catchup II Machine Learning and Real-World Data (MLRD)

The PSIPRED Protein Structure Prediction Server Liam J

Advances in Rosetta Protein Structure Prediction on Massively Parallel Systems