ÔØ ÅÒÙ×Ö ÔØ

Computation resources for molecular biology: A special issue

Michael J.E. Sternberg, Marina I. Ostankovitch

PII: S0022-2836(16)00084-X DOI: doi: 10.1016/j.jmb.2016.02.001 Reference: YJMBI 64983

To appear in: Journal of Molecular Biology

Please cite this article as: Sternberg, M.J.E. & Ostankovitch, M.I., Computation re- sources for molecular biology: A special issue, Journal of Molecular Biology (2016), doi: 10.1016/j.jmb.2016.02.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT

Computation resources for molecular biology: a special issue

Michael J E Sternberg (1*) & Marina I Ostankovitch (2)

(1) Structural Group Department of Life Sciences Imperial College London South Kensington London SW7 2AZ, England [email protected] (*) To whom communications should be sent.

(2) Journal of Molecular Biology 50 Hampshire st Cambridge MA 02139, USA

Increasingly, computational approaches are having a central role across many areas of research tackling the challenges of understanding the complexity of biological systems. A resource such as BLAST (Basic Local Alignment Search Tool) (1), which was published in this journal in 1990, has transformed sequence searching because of its speed and power in detect distant but biologically significant relationships. Since then, high-throughput molecular biology technologies have led to a rapid expansion in available sequence, structural and ‘omics data for many systems being studied. Computational biologists, statisticians and mathematicians have been motivated by this exponential growth of data to develop enhanced novel computational tools. Challenges include the storage and cataloguing the primary data, searching for relationships between the data in particular the identification of homology, performing comparative analyses to derive fundamental principles, and integrating of information from different modalities. Via the web, computational resources are readily being made available to researchers in the community, many of whom are not necessarily skilled in computing.

Accordingly, in recognition of the crucial importance of methods, databases, software and algorithms, the JournalACCEPTED of Molecular Biology MANUSCRIPT has devoted a Special Issue to collect a series of eight important computational resources, which can aid researchers to gain novel, molecular and functional insights into important biological systems and help to solve unanswered challenging questions relevant to health and disease.

We have three contributions where the authors present an integrated database for an important biological system. Filipa et al (2) have developed the AlloRep database which details sequence, structural and mutagenic data for LacI/GalR transcriptional regulators. This well-studied family of transcriptional regulators binds a diverse set of DNA sequences. The authors provide manually-curated sequence alignments for over 3,000 sequences and in vivo phenotypic and biochemical data for over 5750 mutant variants. The authors bring proof of principle that AlloRep can be used to predict residues that alter allosteric regulation. This resource allows one to hypothesize novel ideas and test the robustness of computational approaches for engineering synthetic transcription repressors. ACCEPTED MANUSCRIPT

Harrison et al (3) have developed a relational SQL database reporting structural features for all 509 ubiquitin-like folds in the protein databank (UbSRD). The resource quantifies the structures of ubiquitins and SUMOs (small ubiquitin-like modifier proteins) and their different modes of protein-protein interactions. The database allowed the authors to identify that the ubiquitin tail is flexible and adopts a range of conformations on binding. Users can browse the database by phylogeny, by structural properties, and by residue interactions.

The third database (4) in this Special Issue, authored by Keerthikumar et al., is ExoCarta which is a manually-curated compendium of exosomal proteins, RNAs and lipids. The current version details more than 41,000 protein, 7,000 RNA and 1,000 lipid molecules. Users can browse the database by organism, content type or gene. Data can also be downloaded for further bioinformatics analysis. In addition, users can submit their data via a spreadsheet to be added after review to the database.

With the explosion in the number of known protein sequences for a diverse range of organisms and the expansion in the number of experimentally-determined protein structures, template-based protein structure prediction is now widely used by the community. Estimation of the accuracy of a predicted structure is crucial in guiding interpretation of the model by the biologist. Yang et al (5) report a new server, ResQ, which estimates the accuracy at the residue level together with a value for the thermal mobility (B-value) for a predicted model. The approach uses information about local structural variation within a series of structural templates. The authors also demonstrate how the ResQ values can help in structure determination by molecular replacement.

When a new protein structure has been solved, it is common practice to search the protein data bank (PDB) for related structure and indeed authors reporting structures in JMB are expected to have performed such a search. The identification of related structures can provide insights into the specificity, function and evolution of the protein being studied. Similarly, often when a biologist obtains a predicted protein structure they wish to undertake these structural searches. Mezulis et al (6) report a new webserver PhyreStorm that can search the entire PDB in typically less than one minute. This speed should facilitate users undertaking several iterative searches exploring multiple hypotheses.

The identification of small-molecule binding sites in proteins is of widespread interest in understanding function and in the development of novel drugs and other regulators of activity. Cryptic sites are those which are not readily detectible from the ligand-free structure but become identifiableACCEPTED as a result of conformational MANUSCRIPT change. Typically these sites are identified by time-consuming molecular dynamics simulations. Here, Cimermancic et al (7) present the CryptoSite server which predicts cryptic sites from sequence, structural and evolutionary features together with a rapid simulation of protein mobility. Users will input their protein coordinates and obtain predicted sites.

The determination of the three-dimensional structure of a biomolecular complex generally is harder than solving the structures of the components. Here, van Zundert et al (8) report an update of the widely-used HADDOCK2 web server for the prediction of bimolecular complexes. The HADDOCK approach is based on using a series of diverse distance constraints to identify the predicted complex. This version, HADDOCK2.2, provides facilities to dock mixed molecule type and incorporate additional experimental constraints including a restraint based on an experimental radius of gyration such as obtained from a SAXS experiment and several additional restraints identified from NMR studies.

ACCEPTED MANUSCRIPT

Kanehisa et al (9) report two new automatic servers, BlastKOALA and GhostKOALA, which annotate genome and metagenome sequences. These servers use the information in the widely-used KEGG database which provides a biological interpretation of proteins including their location in pathways. These two servers identify orthologs in the KEGG database and thereby construct a KEGG pathway. BlastKOALA is designed for genome sequences using a version of BLAST for sequence searching. GlostKOALA employs a far more rapid sequence search approach and is therefore appropriate for metagenome sequences.

This Special Issue reports specialist databases about molecules and vesicles together with web servers that provide biological insight about individual proteins, biomolecular complexes and pathways. We trust that the resources described will assist many in their research to characterise biological structure and function at the molecular level. We would like to thank all the contributors to this Special Issue.

References

1-Stephen F. Altschul, Warren Gish, , Eugene W. Myers, David J. Lipman, Basic local alignment search tool, Journal of Molecular Biology 1990, http://dx.doi.org/10.1016/S0022-2836(05)80360-2

2-Filipa L. Sousa, Daniel J. Parente, David L. Shis, Jacob A. Hessman, Allen Chazelle, Matthew R. Bennett, Sarah A. Teichmann, Liskin Swint-Kruse, AlloRep: A Repository of Sequence, Structural and Mutagenesis Data for the LacI/GalR Transcription Regulators, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.09.015.

3-Joseph S. Harrison, Tim M. Jacobs, Kevin Houlihan, Koenraad Van Doorslaer, Brian Kuhlman, UbSRD: The Ubiquitin Structural Relational Database, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.09.011.

4-Shivakumar Keerthikumar, David Chisanga, Dinuka Ariyaratne, Haidar Al Saffar, Sushma Anand, Kening Zhao, Monisha Samuel, Mohashin Pathan, Markandeya Jois, Naveen Chilamkurti, Lahiru Gangoda, Suresh Mathivanan, ExoCarta: A Web-Based Compendium of Exosomal Cargo,ACCEPTED Journal of Molecular Biology, MANUSCRIPT http://dx.doi.org/10.1016/j.jmb.2015.09.019. 5-Jianyi Yang, Yan Wang, Yang Zhang, ResQ: An Approach to Unified Estimation of B- Factor and Residue-Specific Error in Protein Structure Prediction, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.09.024.

6-Stefans Mezulis, Michael J.E. Sternberg, Lawrence A. Kelley, PhyreStorm: A Web Server for Fast Structural Searches Against the PDB, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.10.017.

7- Peter Cimermancic, Andrej Sali: CryptoSite: Expanding the druggable proteome by characterization and prediction of cryptic binding sites, Journal of Molecular Biology, doi?

8-G.C.P. van Zundert, J.P.G.L.M. Rodrigues, M. Trellet, C. Schmitz, P.L. Kastritis, E. Karaca, A.S.J. Melquiond, M. van Dijk, S.J. de Vries, A.M.J.J. Bonvin, The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular Complexes, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.09.014. ACCEPTED MANUSCRIPT

9-Minoru Kanehisa, Yoko Sato, Kanae Morishima, BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences, Journal of Molecular Biology, http://dx.doi.org/10.1016/j.jmb.2015.11.006.

ACCEPTED MANUSCRIPT