A Novel Domain Assembly Routine for Creating Full-Length Models of Membrane Proteins from Known Domain Structures

Supplemental Information for: A novel domain assembly routine for creating full-length models of membrane proteins from known domain structures Julia Koehler Leman1,2*, Richard Bonneau2,3 1 Department of Biology and Center for Genomics and Systems Biology, New York University, New York, NY 10003, United States 2 Center for Computational Biology, Flatiron Institute, Simons Foundation, 162 Fifth Avenue, New York, NY 10010, United States 3 Center for Data Science, New York University, New York, NY 10011, United States * corresponding author: [email protected] Supplemental Figure 1 Additional example used for benchmarking our method. Score vs CA-RMSD plot on the left, which contains the relaxed native structures as blue dots. The right panel shows the native structure in gray, superimposed with the top RMSD model for 5hk2 with an RMSD = 3.2 Å. 1 Protocol capture for mp_domain_assembly author: JKLeman: [email protected] updated: 09/26/2017 Prerequisites: 1. get the protein sequence of the full-length construct in fasta format from the NCBI database (https://www.ncbi.nlm.nih.gov/) 2. find domain structures in the PDB (http://www.rcsb.org/pdb/home/home.do) and generate the fasta file from the ATOM lines in the PDB file using /Rosetta/tools/protein_tools/scripts/clean_pdb.py mypdb.pdb A 3. create a multiple sequence alignment (MSA) to identify the best possible domain templates with either the largest coverage or the fewest mutations a. we recommend using ClustalOmega for the MSA (http://www.ebi.ac.uk/Tools/msa/clustalo/) 4. run secondary structure prediction and transmembrane span prediction on the full-length sequence to identify whether there are other folded domains that might need to be modeled via homology modeling or de novo structure prediction a. for secondary structure prediction, we suggest PsiPred (http://bioinf.cs.ucl.ac.uk/psipred/) and BCL::Jufo9D (http://meilerlab.org/index.php/servers/show?s_id=5) b. for transmembrane helix prediction, we suggest Octopus (http://octopus.cbr.su.se/) and BCL::Jufo9D (http://meilerlab.org/index.php/servers/show?s_id=5) c. for transmembrane b-strand prediction, we suggest Boctopus (http://boctopus.cbr.su.se/) and BCL::Jufo9D (http://meilerlab.org/index.php/servers/show?s_id=5) d. for homology modeling, we suggest multi-template modeling via RosettaCM1 e. for de novo structure prediction, we suggest either BCL::Fold (http://meilerlab.org/index.php/servers/show?s_id=12), BCL::MPFold (http://meilerlab.org/index.php/bclcommons/show/b_apps_id/1), EVfold (http://evfold.org/evfold-web/newprediction.do?), or Rosetta, depending on the number of sequences and the size of the domain f. if the TMD has no known structure, single transmembrane span helices can either be modeled using the helix_from_sequence application2 in RosettaMP3 or through de novo modeling in RosettaMembrane4,5 g. for larger TMD structures we suggest homology modeling via RosettaCM1, if possible 5. once the best domain structures have been identified, re-create the MSA and also do a pairwise sequence alignment with the full-length sequence to identify missing density (like missing loops) or mutations 2 a. required: residues in missing density needs to be built via loop modeling6–8 (https://www.rosettacommons.org/docs/latest/application_documentation/Application %20Documentation#Structure-Prediction) b. required: mutated residues that differ from the full-length sequence of interest need to be mutated back, the mutate app can be used for this, the mutation is given as (wild- type amino acid, Rosetta pose number, mutated amino acid): ~/Rosetta/main/source/bin/mutate.macosclangrelease \ -database ~/Rosetta/main/database \ -in:file:s mypdb.pdb \ -in:file:native mypdb.pdb \ -mutate_relax:mutation E11C \ 6. required: trim the domain structures such that there is at least 1 residue linker between the domains, i.e. they shouldn’t be overlapping because Rosetta wouldn’t know which coordinates to pick a. also remove N-terminal floppy termini of the N-terminal domain and C-terminal floppy termini of the C-terminal domain; keeping them could have an unfavorable effect on the score 7. recreate the sequence alignment to check that all parts are modeled, loops are closed, mutations are removed, and to identify linker residues a. when trimming the N-terminus of the N-terminal model or the C-terminus of the C- terminal model, make sure you also trim the fasta sequence of the full-length construct; we will be creating fragments from the full-length fasta sequence and this has to match the ATOM line ‘sequence’ in the models! 8. optional: refine individual domains to remove clashes; this can also be done after the domain assembly step a. this can be accomplished using the FastRelax9,10 (https://www.rosettacommons.org/docs/latest/application_documentation/structure_ prediction/relax), mp_relax3 (https://www.rosettacommons.org/docs/latest/application_documentation/membrane _proteins/RosettaMP-App-MPFastRelax), or range_relax/mp_range_relax applications (manuscript in preparation) Run domain assembly: 1. create fragments (for instance using Robetta11 (http://www.robetta.org/fragmentsubmit.Jsp)); make sure the full-length fasta sequence matches the ATOM line ‘sequence’ in the models from step 7a! 2. required: orient the TMD in the membrane using the PPM server (http://opm.phar.umich.edu/server.php) a. clean the PDB file again to remove all the dummy atoms created by PPM 3. required: create a Rosetta spanfile for the TMD via 3 ~/Rosetta/main/source/bin/mp_span_from_pdb.macosclangrelease \ -database ~/Rosetta/main/database \ -in:file:s tmd.pdb \ -ignore_unrecognized_res true \ -ignore_zero_occupancy false \ 4. refine the membrane embedding of the TMD in Rosetta; this step is optional, however, this makes sure that Rosetta builds reasonable linker conformations and keeps the soluble domains from dipping into the membrane (i.e. it scores these conformations worse) ~/Rosetta/main/source/bin/mp_transform.macosclangrelease \ -database ~/Rosetta/main/database \ -in:file:s tmd.pdb \ -mp:setup:spanfiles tmd.span \ -mp:transform:optimize_embedding true \ 5. run domain assembly, make sure you provide the PDB files of the domains in the correct order from the N- to C-terminus: # executable ~/Rosetta/main/source/bin/mp_domain_assembly.macosclangrelease \ # Rosetta database, if not set via environment variable -database ~/Rosetta/main/database \ # fasta file of the full-length sequence, might be trimmed at the termini -in:file:fasta 5HK1_tr_A.fasta \ # fragment files for 3- and 9-residue fragments -in:file:frag3 5hk1.frags.3.200_v1_3 \ -in:file:frag9 5hk1.frags.9.200_v1_3 \ # number of models to build, depends on the linker length # longer linkers require more sampling, i.e. more models # we suggest between 5 000 and 100 000 models -nstruct 5000 \ # PDB files of the domains # they should be ordered from the N-terminus to the C-terminus -mp:assembly:poses 5HK1_tr_A_opt_tm.pdb 5HK1_tr_A_opt_sol.pdb \ # number of the poses above that corresponds to the transmembrane domain -mp:assembly:TM_pose_number 1 \ # maximal backbone dihedral angle perturbation for the refinement step -relax::range::angle_max 0.5 \ 4. models can then be either sorted by score, filtered depending on some available experimental data, or clustered, to identify reasonable models 5. the final model can be subJected to one or iterative high-resolution refinement steps (see step 8 in Prerequisites for details) 4 References: 1. Song, Y., Dimaio, F., Wang, R. Y. R., Kim, D., Miles, C., Brunette, T., Thompson, J. & Baker, D. High- resolution comparative modeling with RosettaCM. Structure 21, 1735–1742 (2013). 2. Koehler Leman, J., Mueller, B. K. & Gray, J. J. Expanding the toolkit for membrane protein modeling in Rosetta. Bioinformatics 11, 1–3 (2016). 3. Alford, R. F., Koehler Leman, J., Weitzner, B. D., Duran, A. M., Tilley, D. C., Elazar, A. & Gray, J. J. An Integrated Framework Advancing Membrane Protein Modeling and Design. PLoS Comput. Biol. 11, e1004398 (2015). 4. Yarov-Yarovoy, V., Schonbrun, J. & Baker, D. Multipass membrane protein structure prediction using Rosetta. Proteins 62, 1010–25 (2006). 5. Barth, P., Schonbrun, J. & Baker, D. Toward high-resolution prediction and design of transmembrane helical protein structures. Proc Natl Acad Sci U S A 104, 15682–15687 (2007). 6. Canutescu, A. A. & Dunbrack, R. L. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12, 963–72 (2003). 7. Mandell, D. J., Coutsias, E. A. & Kortemme, T. Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nat. Methods 6, 551–2 (2009). 8. Stein, A. & Kortemme, T. Improvements to robotics-inspired conformational sampling in rosetta. PLoS One 8, e63090 (2013). 9. Tyka, M. D., Keedy, D. A., André, I., Dimaio, F., Song, Y., Richardson, D. C., Richardson, J. S. & Baker, D. Alternate states of proteins revealed by detailed energy landscape mapping. J. Mol. Biol. 405, 607–18 (2011). 10. Khatib, F., Cooper, S., Tyka, M. D., Xu, K., Makedon, I., Popovic, Z., Baker, D. & Players, F. Algorithm discovery by protein folding game players. Proc. Natl. Acad. Sci. U. S. A. 108, 18949–53 (2011). 11. Kim, D. E., Chivian, D. & Baker, D. Protein structure prediction and analysis using the Robetta server. 32, 526–531 (2004). 5 .

Load more