SAHG, a Comprehensive Database of Predicted Structures of All Human
Total Page:16
File Type:pdf, Size:1020Kb
Published online 3 November 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D487–D493 doi:10.1093/nar/gkq1057 SAHG, a comprehensive database of predicted structures of all human proteins Chie Motono1,2,*, Junichi Nakata1,2, Ryotaro Koike2,3, Kana Shimizu1,2, Matsuyuki Shirota2,4, Takayuki Amemiya2,5, Kentaro Tomii1,2, Nozomi Nagano1,2, Naofumi Sakaya1,2,6, Kiyotaka Misoo1,2,6, Miwa Sato1,2,5,7, Akinori Kidera2,5,8, Hidekazu Hiroaki2,9, Tsuyoshi Shirai2,10, Kengo Kinoshita2,4, Tamotsu Noguchi1,2 and Motonori Ota2,3,* 1Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, 2Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), Tokyo 102-0081, 3Graduate School of Information Science, Nagoya University, Nagoya 464-8601, 4Graduate School of Information Science, Tohoku University, Sendai 980-8579, 5Department of Supramolecular Biology, Yokohama City University, Yokohama 230-0045, 6Information and Mathematical Science Laboratory Inc., Tokyo 112-0012, 7Mitsui Knowledge Industry Co., Ltd, Tokyo 105-6215, 8Department of Computational Science Research Program, RIKEN, Wako 351-0198, 9Graduate School of Medicine, Kobe University, Kobe 650-0017 and 10Department of Bioscience, Nagahama Institute of Bioscience and Technology, Nagahama 526-0829, Japan Received August 15, 2010; Revised October 2, 2010; Accepted October 13, 2010 ABSTRACT Most proteins from higher organisms are known to animated images. As of July 2010, SAHG contains be multi-domain proteins and contain substantial 42 581 protein-domain models in approximately numbers of intrinsically disordered (ID) regions. To 24 900 unique human protein sequences from the analyse such protein sequences, those from human RefSeq database. Annotation of models with func- for instance, we developed a special protein- tional information and links to other databases such structure-prediction pipeline and accumulated the as EzCatDB, InterPro or HPRD are also provided products in the Structure Atlas of Human Genome to facilitate understanding the protein structure- (SAHG) database at http://bird.cbrc.jp/sahg. With function relationships. the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and INTRODUCTION Smith–Waterman profile–profile alignment), global– local alignment methods (FORTE) and prediction Nowadays, genome sequencing projects are producing tools for ID regions (POODLE-S) and homology complete genome sequences at an extremely high rate modeling (MODELLER). Conformational changes of (1,2). With the rise of next-gen sequencers (3–5), this is the continuous trend for the future without a doubt. protein models upon ligand-binding were predicted Consequently, the number of known protein sequences by simultaneous modeling using templates of apo (6) grows more rapidly than the number of known and holo forms. When there were no suitable tem- protein structures experimentally determined (7). plates for holo forms and the apo models were However, to make full use of genome sequences, accurate, we prepared holo models using prediction proteins encoded in genomes should be analysed and for methods for ligand-binding (eF-seek) and conform- this purpose, protein three-dimensional (3D) structures ational change (the elastic network model and the provide much information (8,9). Computational methods linear response theory). Models are displayed as for protein 3D structure prediction are anticipated to *To whom correspondence should be addressed. Tel: +81 3 3599 8067; Fax: +81 3 3599 8081; Email: [email protected] Correspondence may also be addressed to Motonori Ota. Tel: +81 52 789 4782; Fax: +81 52 789 4782; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ß The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D488 Nucleic Acids Research, 2011, Vol. 39, Database issue bridge the gap between the number of known protein se- pink regions in Figure 1) in which several homology quences and the number of known protein structures. search and protein structure prediction tools, conducting According to assessments of the accuracy of those sequence–sequence, sequence–profile and profile–profile methods, e.g. recent Critical Assessment of Techniques alignments, are combined sequentially, and it processes for Protein Structure Prediction (CASP) experiments protein sequences, assigns them with 3D templates and (10,11), template-based protein structure prediction often finally produces 3D models. If available, 3D models of produced 3D models accurate enough for functional an- apo and holo forms were generated. The other compo- notations, modification of protein functions or even for nents are ‘Other structure and function predictors’ structure-based drug design (12,13). In addition, in the (bottom light blue regions in Figure 1). They are an CASP7 and 8 experiments, fully automated structure pre- ensemble of independent prediction tools, which analyse diction methods had reached a comparable level to the protein sequences. All the results from these systems were best prediction performance by methods with human accumulated in SAHG in XML formats. intervention (14). In the CASP experiments, target protein sequences are Structure prediction pipeline ones whose 3D structures will be determined. It means that such protein structures are expected to be single Construction of 3D models. Protein structure prediction domains or a couple of domains and suitable for the ex- consists of the following procedures: template searches perimental structure determination. Therefore, sometimes and selection, alignment of target sequence and protein sequences are truncated from their full-length template, building 3D models and evaluation of model forms. On the other hand, most protein sequences coded quality. in genomes from higher organisms are known to be long The template searches and their assignments to a target and should be multi-domain proteins (15), and contain a protein are the ‘step-wise-multi-methods’ approach. In the significant portion of intrinsically disordered (ID) regions first step, a BLAST (23) search against all the latest (16–19). Clearly, these proteins are unsuitable for experi- Protein Data Bank (PDB) (7) and Structure Classification of Proteins (SCOP) (24,25) sequences is per- mental structure determination in the full-length form and À5 distinct from the target protein sequences of CASPs. To formed with 10 E-value cut-off. We selected templates, analyse such proteins, we have developed a special at least 90% of whose sequence could be aligned with the protein-structure-prediction pipeline, by integrating and target, to ensure that the 3D models corresponded to arranging various computational tools, either developed stable domains or proteins. The resulting target by us or widely used as global standards. This pipeline sequence-template alignments were ranked based on was applied to all proteins coded in the human genome. their E-values. The best combination of templates for The resulting 3D models as well as other annotations for each domain was determined using an original algorithm protein functions were accumulated in the Structural Atlas to maximize the coverage of the target sequence (label I in of Human Genome (SAHG) database and presented Figure 1). In the second step, a PSI-BLAST (23) search through the web interface at http://bird.cbrc.jp/sahg. with the same parameters was conducted for the remain- There are other databases of protein structure models, ing regions of the target sequence, where no models had e.g. SWISS-MODEL Repository (20) or ModBase (21). been assigned and the best templates were assigned onto Both databases contain annotated protein structure the target sequence (II in Figure 1). Protein sequence models generated by original automated modeling pipe- profiles were prepared using the latest NCBI-nr lines. They also allow the users to build models on database. In the third step, a Smith–Waterman profile– demand. Compared with them, the SAHG database is profile alignment method (SWPPA) (26) was applied to distinct mainly in the following points: (i) The 3D the remaining regions against restricted templates (SCOP models in SAHG were generated by an original pipeline, and PDB subsets with less than 40% sequence identity) with a cut-off of Z-score > 10, the comparable threshold specific for multi-domain proteins with substantial ID À5 regions; (ii) Conformational changes of proteins upon to E-value < 10 in PSI-BLAST (III in Figure 1). Finally, ligand-binding are predicted by simultaneous modeling the FORTE (27) search, a profile–profile comparison using templates of the ligand-bound state (holo form) method, was performed for the remaining regions, with and the unbound state (apo form) and displayed as a strict cut-off of Z-score > 20, to detect distantly related animated images; and (iii) Functional annotations for templates (V in Figure 1). FORTE is based on the global– protein interactions, e.g. ligand-binding and protein– local alignment method and was adjusted to perform best protein interactions, are available.