Improved Protein Structure Refinement Guided by Deep Learning Based

ARTICLE https://doi.org/10.1038/s41467-021-21511-x OPEN Improved protein structure refinement guided by deep learning based accuracy estimation Naozumi Hiranuma1,2,4, Hahnbeom Park1,4, Minkyung Baek 1, Ivan Anishchenko 1, Justas Dauparas1 & ✉ David Baker 1,3 We develop a deep learning framework (DeepAccNet) that estimates per-residue accuracy and residue-residue distance signed error in protein models and uses these predictions to 1234567890():,; guide Rosetta protein structure refinement. The network uses 3D convolutions to evaluate local atomic environments followed by 2D convolutions to provide their global contexts and outperforms other methods that similarly predict the accuracy of protein structure models. Overall accuracy predictions for X-ray and cryoEM structures in the PDB correlate with their resolution, and the network should be broadly useful for assessing the accuracy of both predicted structure models and experimentally determined structures and identifying specific regions likely to be in error. Incorporation of the accuracy predictions at multiple stages in the Rosetta refinement protocol considerably increased the accuracy of the resulting protein structure models, illustrating how deep learning can improve search for global energy minima of biomolecules. 1 Department of Biochemistry and Institute for Protein Design, University of Washington, Washington, WA, USA. 2 Paul G. Allen School of Computer Science & Engineering, University of Washington, Washington, WA, USA. 3 Howard Hughes Medical Institute, University of Washington, Washington, WA, USA. ✉ 4These authors contributed equally: Naozumi Hiranuma, Hahnbeom Park. email: [email protected] NATURE COMMUNICATIONS | (2021) 12:1340 | https://doi.org/10.1038/s41467-021-21511-x | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21511-x istance prediction through deep learning on amino acid resolution better than 2.5 Å lacking extensive crystal contacts and co-evolution data has considerably advanced protein having sequence identity less than 40% to any of 73 refinement D 1–3 structure prediction . However, in most cases, the pre- benchmark set proteins (see below). Of the approximately one dicted structures still deviate considerably from the actual struc- million decoys, those for 280 and 278 of the 7307 proteins were ture4. The protein structure refinement challenge is to increase held-out for validation and testing, respectively. More details of the accuracy of such starting models. To date, the most successful the training/test set and decoy structure generation can be found approaches have been with physically based methods that involve in Methods. a large-scale search for low energy structures, for example with The predictions are based on 1D, 2D, and 3D features that Rosetta5 and/or molecular dynamics6. This is because any avail- reflect accuracy at different levels. Defects in high-resolution able homology and coevolutionary information is typically atomic packing are captured by 3D-convolution operations already used in the generation of the starting models. performed on 3D atomic grids around each residue defined in a The major challenge in refinement is sampling; the space of rotationally invariant local frame, similar to the Ornate method9. possible structures that must be searched through even in the 2D features are defined for all residue pairs, and they include vicinity of a starting model is extremely large5,7. If it were possible Rosetta inter-residue interaction terms, which further report on to accurately identify what parts of an input protein model were the details of the interatomic interactions, while residue–residue most likely to be in error, and how these regions should be altered, distance and angular orientation features provide lower resolution it should be possible to considerably improve the search through structural information. Multiple sequence alignment (MSA) structure space and hence the overall refinement process. Many information in the form of inter-residue distance prediction by methods for estimation of model accuracy (EMA) have been the trRosetta1 network and sequence embeddings from the described, including approaches based on deep learning such as ProtBert-BFD100 model16 (or Bert, in short) are also optionally ProQ3D (based on per-residue Rosetta energy terms and multiple provided as 2D features. At the 1D per-residue level, the features sequence alignments with multilayer perceptrons8), and Ornate are the amino acid sequence, backbone torsion angles, and the (based on 3D voxel atomic representations with 3D convolutional Rosetta intraresidue energy terms (see Methods for details). networks9). Non-deep learning methods such as VoroMQA We implemented a deep neural network, DeepAccNet, that compare a Voronoi tessellation representation of atomic interac- incorporates these 1D, 2D, and 3D features (Fig. 1a). The tions against precollected statistics10. These methods focus on networks first perform a series of 3D-convolution operations on predicting per-residue accuracy. Few studies have sought to guide local atomic grids in coordinate frames centered on each residue. refinement using deep learning based accuracy predictions11;the These convolutions generate features describing the local 3D most successful refinement protocols in the recent blind 13th environments of each of the N residues in the protein. These, Critical Assessment of Structure Prediction (CASP13) test either together with additional residue level 1D input features (e.g., local utilized very simple ensemble-based error estimations5 or none at torsional angles and individual residue energies), are combined all12. This is likely because of the low specificity of most current with the 2D residue–residue input features by tiling (so that accuracy prediction methods, which only predict which residues associated with each pair of residues there are both the input 2D are likely to be inaccurately modeled, but not how they should be features for that pair and the 1D features for both individual moved, and hence are less useful for guiding search. residues), and the resulting combined 2D feature description is In this work, we develop a deep learning based framework input to a series of 2D convolutional layers using the ResNet (DeepAccNet) that estimates the signed error in every architecture17. A notable advantage of our approach of tying residue–residue distance along with the local residue contact together local 3D residue based atomic coordinate frames through error, and we use this estimation to guide Rosetta based protein a 2D distance map is the ability to integrate full atomic coordinate structure refinement. Our approach is schematically outlined in information in a rotationally invariant way; in contrast, a Fig. 1. Cartesian representation of the full atomic coordinates would change upon rotation, substantially complicating network for both training and its use. Details of the network architecture, Results feature generation, and training processes are found in Methods. Development of improved model accuracy predictors.Wefirst Figure 2 shows examples of the predictions of DeepAccNet sought to develop model accuracy predictors that provide both without MSA or Bert embeddings (referred to as “DeepAccNet- global and local information for guiding structure refinement. We Standard”) on two randomly selected decoy structures for each of developed network architectures that make the following three three target proteins (3lhnA, 4gmqA, and 3hixA) not included in types of predictions given a protein structure model: per-residue training. In each case, the network generates different signed Cβ local distance difference test (Cβ l-DDT) scores which report residue–residue distance error maps for the two decoys that on local structure accuracy13, a native Cβ contact map thre- qualitatively resemble the actual patterns of the structural errors sholded at 15 Å (referred to as “mask”), and per-residue-pair (rows of Fig. 2). The network also accurately predicts the distributions of signed Cβ–Cβ distance error from the corre- variations in per-residue model accuracy (Cβ l-DDT scores) for sponding native structures (referred to as “estograms”; histogram the different decoys. The left sample from 4gmqA (second row) is of errors); Cα is taken for GLY. Rather than predicting single closer to the native structure than the other samples are, and the error values for each pair of positions, we instead predict histo- network correctly predicts the location of the smaller distance grams of errors (analogous to the distance histograms employed errors and Cβ l-DDT scores closer to 1. Overall, while the detailed in the structure prediction networks of refs. 1–3), which provide predictions are not pixel-perfect, they provide considerable more detailed information about the distributions of possible information on what parts of the structure need to move and structures and better represent the uncertainties inherent to error in what ways to guide refinement. Predictions from the variants prediction. Networks were trained on alternative structures with the MSA (referred to as “DeepAccNet-MSA”) and Bert (“decoys”) with model quality ranging from 50 to 90% in GDT- features (referred to as “DeepAccNet-Bert”) are visualized in TS (global distance test—tertiary structure)14 generated by Supplementary Fig. 1. homology modeling15, trRosetta1, and native structure pertur- We compared the performance of the DeepAccNet networks bation (see Methods). Approximately

Load more