Raptorx: Exploiting Structure Information for Protein Alignment by Statistical Inference Jian Peng and Jinbo Xu*

proteins STRUCTURE O FUNCTION O BIOINFORMATICS Prediction Methods and Reports RaptorX: Exploiting structure information for protein alignment by statistical inference Jian Peng and Jinbo Xu* Toyota Technological Institute at Chicago, Chicago, Illinois ABSTRACT INTRODUCTION This work presents RaptorX, a sta- RaptorX is totally different from our previous threading program RAPTOR, tistical method for template-based which aligns a sequence to a template by using linear programming to minimize a protein modeling that improves given threading scoring function.1,2 By contrast, RaptorX uses a statistical learning alignment accuracy by exploiting method to design a new threading scoring function, aiming at better measuring the structural information in a single compatibility between a target sequence and a template structure. In addition to or multiple templates. RaptorX single-template threading, RaptorX also has a multiple-template threading compo- consists of three major compo- nent and contains a new module for alignment quality prediction. Our results show nents: single-template threading, 3,4 alignment quality prediction, and that RaptorX indeed has much better alignment accuracy than RAPTOR. multiple-template threading. This RaptorX is designed to address two ‘‘alignment’’ challenges facing template- work summarizes the methods used based protein modeling. One is how to align a target to its template when they by RaptorX and presents its CASP9 have a sparse sequence profile (i.e., no sufficient amount of information in result analysis, aiming to identify homologs). In this case, a profile-based alignment method may not work well. major bottlenecks with RaptorX The other is how to improve sequence-template alignment accuracy using more and template-based modeling and reliable template structural alignments as bridge when at least two similar tem- hopefully directions for further plates are available for a target. RaptorX addresses these two challenges by study. Our results show that tem- exploiting template structure information in several unique ways. plate structural information helps a Many homology modeling and protein threading methods have been developed lot with both single-template and for sequence-template alignment.5–20 These methods have two major issues in multiple-template protein threading especially when closely-related tem- dealing with distantly related sequence and template. One is that these methods plates are unavailable, and there is use a linear scoring function to guide the alignment of a sequence to its template. still large room for improvement in A linear function cannot deal well with correlation among protein features, both alignment and template selec- although many features are indeed correlated (e.g., secondary structure vs. solvent tion. The RaptorX web server is accessibility). The other issue is that these methods heavily depend on sequence available at http://raptorx.uchica- profile. Sequence profile has proved to be very powerful in detecting remote go.edu. homologs and generating accurate alignments, as demonstrated by the excellent 7 Proteins 2011; 79(Suppl 10):161–171. HHpred program. However, information in a sequence profile may not be suffi- VC 2011 Wiley-Liss, Inc. cient especially when the profile is sparse. To address these two issues, RaptorX Key words: single-template thread- The authors state no conflict of interest. ing; multiple-template threading; Grant sponsor: National Institutes of Health; Grant number: R01GM089753; Grant sponsor: National Science Foun- alignment quality prediction; prob- dation; Grant number: DBI-0960390 abilistic alignment; multiple pro- *Correspondence to: Jinbo Xu, Toyota Technological Institute at Chicago, 6045 S. Kenwood Avenue, Chicago, IL 60637. E-mail: [email protected]. tein alignment; CASP. Received 6 April 2011; Revised 25 July 2011; Accepted 19 August 2011 Published online 12 September 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/prot.23175 VC 2011 WILEY-LISS, INC. PROTEINS 161 J. Peng and J. Xu uses a nonlinear scoring function to combine homolo- (250 < mutation score < 210) and (secondary structure gous information (i.e., sequence profile), and structure score > 0.9) and (solvent accessibility score > 0.6), then information in a very flexible way. When proteins under the log-likelihood of two residues being aligned is ln 0.7". consideration have high-quality sequence profile, Rap- Our scoring function is much more sensitive than the torX counts more on profile information, otherwise on widely used linear function, because our function can structure information to improve alignment accuracy. model protein feature correlation. Our method also ena- When multiple similar templates are available for a bles us to use different criteria to align different regions of single target, RaptorX improves pairwise target-template the sequence and template. This is analogous to the posi- alignment accuracy through exploiting template struc- tion specific scoring matrix, which has different mutation tural similarity (i.e., geometrical information), in addi- potentials for the same amino acid at different positions. tion to increase alignment coverage for the target (by Our method differs from others in that we use NEFF copying the most similar regions from different tem- to adjust the relative importance of homologous and plates). Nearly all existing multiple-template methods structure information so that the former will not domi- generate an alignment between the target and its multiple nate the latter. When proteins have large NEFF, RaptorX templates by either simply assembling pairwise align- counts more on sequence profile information; otherwise, ments into a single multiple alignment using the target structure information. Our method uses both context- as an anchor21,22 or using some multiple sequence specific and position-specific gap penalty and then use alignment tools such as T-Coffee23, MUSCLE,24 and NEFF to determine their relative importance. If NEFF is ProbCons.25 Neither way effectively uses template struc- large, we will rely more on position-specific gap penalty tural alignments, which are more reliable than sequence- derived from the alignment of sequence homologs; other- template alignments, as bridge to build sequence-to-mul- wise, context-specific gap penalty. Our context-specific tiple-template alignments, so they usually do not fare gap penalty depends on (predicted) secondary structure much better than when only a single template is available type, (predicted) solvent accessibility, amino acid identity, for the target. By contrast, RaptorX takes advantage of hydropathy count, and if a residue is in the core region template structural alignments to align a target sequence or not. simultaneously to multiple templates through a novel statistical inference method. Two multiple protein alignment programs 3DCoffee26 and PROMAL3D,27 which can Alignment quality prediction also be used to align a single sequence to multiple templates, indeed take advantage of template structural align- We predict the absolute quality of a pairwise sequence- ments in building a multiple alignment, but our experi- template alignment using neural network and then use mental results show that they do not fare well when the the predicted quality to rank all the templates for a spe- target and templates are not closely related.28 cific target. The quality of a pairwise sequence-template alignment is defined as the TMscore30 of the 3D model built from this alignment by MODELLER (with default METHODS parameters). However, our method does not need to Following PSI-BLAST and HHpred, we use NEFF to actually build a 3D model in order to predict alignment measure the amount of information in the sequence pro- quality, because only information in an alignment is used file of a protein. NEFF ranges from 1 to 20 and can be for quality prediction. This saves time for 3D model interpreted as the expected number of amino acid substi- building, and, thus, we can predict alignment quality tutions at each sequence position. A sparse sequence pro- very quickly. Our old RAPTOR program uses an SVM file (i.e., a profile with a small NEFF value) usually leads method to predict the number of correctly aligned posi- 31 to less accurate secondary structure prediction and less tions in an alignment. RaptorX differs from RAPTOR accurate alignment.29 in that RaptorX predicts TMscore, a better quality measure, and also uses a better set of alignment features. Let A(i) denote the target residue aligned to the ith Single-template protein threading template residue. A(i) is empty if the template residue is We use a probabilistic model to formulate the pairwise not aligned to any target residue. Let PSSM and PSFM sequence-template alignment problem. See Peng and denote the position-specific scoring matrix and position Xu3,4 for the technical details. Our probabilistic model specific frequency matrix for the template and the uses a regression-tree-based nonlinear scoring function to sequence, respectively. PSSM and PSFM are two slightly measure the similarity between two proteins. A regression different representations of sequence profile. Let PSSMi tree consists of a collection of rules to calculate the proba- and PSFMi denote sequence profiles at template position bility of an alignment. One rule can be as simple as ‘‘if i and sequence position i, respectively. Both PSSMi and (mutation score < 250), then the log-likelihood of PSFMi are a vector of 20 real values encoding

Raptorx: Exploiting Structure Information for Protein Alignment by Statistical Inference Jian Peng and Jinbo Xu*

Development of Novel Strategies for Template-Based Protein Structure Prediction

The Phyre2 Web Portal for Protein Modeling, Prediction and Analysis

Statistical Inference for Template-Based Protein Structure

Deep Learning-Based Advances in Protein Structure Prediction

Distance-Based Protein Folding Powered by Deep Learning Jinbo Xu Toyota Technological Institute at Chicago 6045 S Kenwood, IL, 60637, USA [email protected]

Template-Based Protein Modeling Using the Raptorx Web Server

And Contact-Based Protein Structure Prediction

MULTICOM2 Open-Source Protein Structure Prediction System

An Open-Source Protein Structure Prediction System Powered by Deep Learning and Distance Prediction

Probabilistic Multi-Template Protein Homology Modeling

Use of Bioinformatics Tools in Different Spheres of Life Sciences

Structural Insights and Characterization of Human Npas4 Protein