A Need for Rosetta

Rosetta Steven Bitner

A Need for Rosetta A need exists for the ability to compute the complex workings of amino acid chains. These proteins have been found to fold in many different ways, and the forces that affect these changes have yet to be fully understood. The folded state of a protein is the state in which it finally gains its functionality within the organism. As a result, it has become very important to try to predict these folding patterns for various reasons. Also, it has been found that the final shape of a folded protein determines its function. This fact is very important since in the future it may be possible to know what shape of protein is needed to allow for a certain function. This can help with very specific medication by using synthetic proteins to create functions that prevent or reduce/remove diseased protein interactions.

Enter Rosetta. Rosetta software is capable of the two major points above with a great deal of accuracy. Rosetta has won the Critical Assessment of Structure Prediction (CASP) competition held at the Lawrence Livermore Lab in Livermore, California which is an important competition for determining the best available de novo protein predictor. De Novo protein prediction is that prediction done without any classification information for the protein given. Many softwares used for protein folding prediction require the knowledge of what protein family it is a part of or which portions of the protein strand are rigid or flexible by using techniques such as the pebble game. The CASP competition gives an unpublished protein sequence to each team in the competition. Each team then utilizes its own software to predict the folding of the given protein. Then a metric such as Root Mean Square Deviation (RMSD) is used to determine which software is closest to the actual protein. Rosetta has been used to create synthetic proteins such as Top-7 (fig.1). Using a combinatorial approach combined with its folding prediction abilities using many different amino acid sequences and calculating the folding. When a protein gets close to the desired shape (approximately 2-3 Å RMSD), then the protein can be synthesized using the amino acid sequence found.

Fig.1: Top-7 Protein, the first synthetic protein, synthesized in November 2003.

The Top-7 protein shown above as already described was created by using the Rosetta software. This protein has not yet been found in nature, and may never be. It was created to have the shape shown above. The initial, or target, shape was (almost) arbitrarily chosen and then the computing began to find the necessary sequence. This synthesis was completed in November of 2003 and the Rosetta team became the first to synthesize a protein. For this synthesis, the Baker Lab won the coveted AAAS Newcomb Cleveland Prize. As stated earlier, this ability to synthesize proteins with given shapes may one day help scientists create specific medicines based on a desired reaction that may reduce or remove symptoms of various diseases.

Rosetta - Steven Bitner 2 The Baker Lab Rosetta was created by a team of students and faculty at the University of Washington in the United States. David Baker, the lab’s figurehead, is the faculty member of the University of Washington that has gathered the talent available and has led the lab since the early 1990’s. Info on the Baker Lab can be found at http://www.bakerlab.org/.

The Many Faces of Rosetta Although this paper will only cover the protein prediction abilities of Rosetta, Rosetta is capable of more. Rosetta can be used to determine protein to protein interactions, known as docking, as well as the synthesis already mentioned. Rosetta is also working on computing the protein predictions of all known amino acid sequences due to the high cost of NMR and X-ray crystallography techniques. These computations require a deal of processing speed, much greater than is present on any supercomputer. So, Rosetta has joined forces with the World Community Grid in Seattle, Washington to create something called the Human Proteome Folding Project under the name Rosetta@Home.

How Rosetta Works Rosetta, like many other software packages for protein folding, uses a series of energy minimization functions [2]. These functions help to determine whether or not a configuration of the protein is plausible (through use of various weights assigned due to probabilities), or even possible (through penalties for steric collisions). Software could be written which could determine the exact best conformation of a protein by simply looking at all possible conformations and taking the lowest energy. This would take far too much time to ever be practical. So, Rosetta does some things to speed up the algorithm and make it useful for actual implementation and use.

Rosetta uses a concept of global folding that helps to distinguish it from some other protein folding software packages. The ideal folding of a subset of the backbone does not necessarily occur. The ideal folding of the subset simply provides a local preference value to the overall folding algorithm. This local preference may help to determine the local fold, but this preference

Rosetta - Steven Bitner 3 is not the only part of the equation. This is due to the possibility of an ideal local fold causing a higher energy conformation in other parts of the protein by way of steric collisions and other energy function penalties. Global preferences also include things such as β-strand pairings, and hydrophobic preferences at the surface of the protein.

The first step Rosetta uses to speed up computation is that of side chain reduction. Side chains make the folding of a protein much more complicated due to the higher number of atoms in the protein. So, Rosetta sets them aside and deals with them after the backbone has been determined. The way that this is done is through the use of side chain centroids. A side chain centroid is an approximation of an entire side chain to one vertex. This vertex is chosen to be an atom in the location of the probabilistic center of mass for that specific side chain. So, if there are 1,000 ways that a given side chain appears in the Rosetta library, then those 1,000 configurations are averaged together to find the predicted center of mass. This center of mass location relative to the Cα atom on the backbone chain is used in the energy computations during Rosetta’s run.

The major part of the Rosetta folding algorithm involves two steps. Fragment insertion and fragment assembly. These steps involve replacing sections of the protein with segments from a library stored in the Rosetta database. Rosetta, due to empirical testing, uses nine residue and three residue substitution segments. These segments in the library are gathered from ‘known’ proteins in the PDB. The library segments are used to replace sections of the protein by using the torsion angles as fixed values. So, in effect, the eight torsion angles connecting nine residues become fixed, and the segment is treated as one rigid body. This increases the algorithm speed for two reasons. First, since the torsion angles used in the folding are obtained from a finite library, torsion angles cease being continuous and become discrete variables and thus much faster computation becomes possible. Also, since windows are used to create rigid segments with fixed torsion angles, very few individual torsion angles need to be computed.

For fragment insertion, we check the library and insert segments which are a good fit into a list. More specifically, nine residue and three residue windows are checked against the library and a

Rosetta - Steven Bitner 4 list of the best 200 fits for both nine and three residue window size is obtained. The best fits are based on a subset of the energy functions given in [2].

Fragment assembly is the assembly of the protein as a whole. First, a nine residue segment is randomly chosen from the top 25 in the nine residue list created during fragment insertion. This segment is then used to replace the corresponding nine residues in the protein. The scoring function is then used to calculate the change in energy. If the energy went down, then this replacement is kept. If not, it is placed aside with its score. This repeats until either a negative score is obtained, or the program has determined that the likelihood of a negative score is low, in which case the lowest positive score is kept. In each simulation folding, Rosetta chooses a random start segment from the list and attempts 28,000 nine residue replacements. Next, a similar process is performed for three residue segments, but only 8,000 replacements are attempted.

Side chains are added last. The importance that Rosetta places on side chains in the overall protein fold is done using the centroid. This probabilistic approach does seem to obtain a high level of accuracy, but can most likely be improved. We discuss potential improvements later in the section titled “Nobody’s Perfect”. A randomized Monte Carlo approach is used to check the known rotamers of the residue being analyzed. The replacement is made and Rosetta verifies whether or not steric clashes have occurred. If they have, then the algorithm continues; if not then the algorithm moves along to the next residue.

Obtaining a Copy of Rosetta Getting a copy is quite easy, but not entirely necessary. To obtain a copy go to [5] and click the link for ‘Rosetta Licensing Information’. Follow the appropriate links and follow the directions in the email that will be sent to you via email. Installation help is provided below in ‘Using Rosetta’. The reason that I say that getting a copy of the software is not entirely necessary is due to the availability of online servers. One such server is available via a link from the Baker Lab website. This server is called ROBETTA. The server is down as of the writing of this report, but is intended to come back online sometime in 2006. Another server is available through the

Rosetta - Steven Bitner 5 University of North Carolina by going to [1]. This link is for academic use only and limits the user to prediction using 200 residues per trial. Another limitation is that only protein prediction can be done via this server. One cannot use the docking capabilities of Rosetta through this web server. I am unaware of the capabilities and limitations of the ROBETTA Server since it has been out of commission for the duration of this project term.

Using Rosetta Since there are two different places to go (three if you include the ROBETTA server), we must discuss how to use both versions.

Using the web server Firstly, let’s discuss the web server available at [1]. This web server uses a standard GU interface, which makes its use quite straightforward. There is a link for documentation on the persistent navigation sidebar on the left side of the page. This sidebar also contains links for registration, logging in, submitting jobs and checking status of jobs. Registration is not necessary for submitting jobs or using any other features of the site. Registration does send you email notification of job completion and stores jobs under the username that you select making them easier to locate among the sometimes lengthy queue. To submit a job, you must first download a PDB file to your computer. Upload this file to the web server by clicking on the browse button. The main submit screen, see figure 2, gives you a few options. After you have uploaded the desired PDB file, you can opt to use a resfile that you have created, or create one using a simple web form by selecting upload your own list or create a list respectively. Of course, Rosetta can compute packing for all residues in the PDB file if you select the all residues option. The exception to this is in the case of a protein containing more than 200 residues. In this case, the web server will force you to create a resfile if one is not uploaded. If it is absolutely necessary to repack all residues of the protein and it contains more than 200 residues, then you must follow the instructions above for obtaining the software and then follow the instructions below for running the downloaded software. It is also possible when submitting a job to place a smaller emphasis on the importance of the repulsive energy functions. Another important option available via the submit job screen is the option to run the job multiple times. As described above in the section entitled ‘How Rosetta Works’, Rosetta uses a random start point during

Rosetta - Steven Bitner 6 simulation. Therefore, different conformations of the same protein may be obtained. Using multiple simulations allows you to see multiple results for potential conformations of the protein.

Fig.2: The submit job screen for the UNC online Rosetta server

Rosetta - Steven Bitner 7 After you have submitted a job, you must check the queue by selecting the queue link from the navigation bar on the left for your results. Your job has completed when “Complete” appears next to your job. If “Unsubmitted” appears next to your job, then you must resubmit your job. There seems to be a glitch in the server that sometimes fails to submit a job. This glitch is sometimes activated by clicking on the ‘view resfile’ link after you have submitted a job. Even though you have selected submit job, it may not always be the case that it has been submitted. Be sure to check the queue for a status of “Processing” or “Complete”.

Using the downloaded software The downloaded software is much more difficult to use and does not contain user friendly GU interfaces. Whenever possible a novice user should opt for the on line version as described above. However, if you intend to repack proteins quite regularly, or need to repack in excess of 200 residues, then you have no other choice. The downloaded version is also necessary if you intend to use the docking portions of the Rosetta software. The uses of the software as well as sample command lines are given in the various README documents given in the Rosetta package. There are a few important things worth noting at this time to make use of the software run more smoothly. Firstly, Rosetta is not supported on all operating system platforms. See the README_platforms text file in the Rosetta package for a list of supported platforms. If using this software at UTDallas, you must use the software on a machine in the UNIX lab, since the school’s Apache server runs Sun Solaris 9 (an unsupported platform). Unpack the Rosetta package using the command “tar –zxvf filename” where filename is the name under which you have previously stored the package. Change directories to get into the rosetta++ directory. In this directory, you can check out the README file for compiling instructions. If you don’t intend to change the code in any way, then the simplest way to compile is to type “make gcc”. The system that you are using must have the GNU compiler and make installed. If the correct software is installed and you are using a supported platform, compiling will take about twenty minutes for the optimized version (this is the version that will be created by using the command “make gcc”).

Rosetta - Steven Bitner 8 After compiling, one change must be made before you can use the software. In the folder rosetta++, you must alter the file “paths.txt” by changing the line that begins “data files” to contain the following “../rosetta_database/” in stead of the default location. This is the only change that is necessary. Now you are ready to run jobs as you please by using the commands and options in the README file located in the rosetta++ directory.

Interpreting Results http://rosettadesign.med.unc.edu/documentation.html#II gives a field by field breakdown of the output file format. The most important fields for basic use are the output coordinates. These coordinates are given in the same format as the PDB files needed as input. This output file can be used with visualization software such as PyMol to see the final conformation without making any changes to the output file. The most important lines after the coordinates are the overall score that was assigned to the conformation by the Rosetta scoring functions and the rms_to_start showing the RMSD between the start and finish conformations. In both cases, the lower the better. The fields that remain after the total are the individual function values used to compute the total energy. It is worth noting that in the files created by using the C++ version of the software the energy abbreviations are generalized form LJ (Lennard-Jones) and LK (Lazaris- Karplus) to E (Energy). For example, the output file format given above shows LJatr for the Lennard-Jones attractive function, but in the C++ version, the output file contains Eatr for the Energy attractive formula. The figure below shows the input (before) and output (after) of a sample run performed using PyMol for visualization. The input file was the accepted conformation for the protein 1ubq, and the output is slightly different. This is because Rosetta does not assume the input file to be the known conformation. The score for the conformation below is 113, and the RMSD is 1.6Å.

Rosetta - Steven Bitner 9 Fig.3: Input and output conformations for sample run of Rosetta using the protein 1ubq

Nobody’s Perfect Rosetta is no Superman, and would not claim to be. All software has its imperfections, and Rosetta is no different. One problem with Rosetta is that it tends to group together atoms with similar chemical properties. The problem with this is that hydrophilic residues may form hydrogen bonds with each other and allow for hydrophobic portions of the protein to appear on or near the surface of the protein. If you notice this problem occurring, you can steer Rosetta away by omitting folding in these regions of the protein by creating a resfile. Rosetta does not yet have a filter installed to prevent this placement of hydrophobic atoms near the surface, and as such it is the user’s responsibility to watch out for and correct these situations if they occur.

The side chain simplification portion of Rosetta is also an area that may need some improvement. Currently, Rosetta uses the centroid replacement scheme described earlier. The problem with this centroid replacement is that when the side chains are later added, we must use

Rosetta - Steven Bitner 10 the Monte Carlo method of replacement, and may find many conflicts. It could be better to use a different replacement structure. Using spheres that contain the common intersection of the known side chain rotamers may be a better method. A better fit could potentially be obtained by using the 3D convex hull that circumscribes the intersection of the known rotamers. This will account for rotamers that may have a longer major axis and thus aren’t well represented by a sphere. Documentation is lacking in this area, but it seems as though Rosetta forces rotamer configurations around the backbone chain and as a result, developers of Rosetta wanted to use a small structure to represent the side chain to minimize conflicts during backbone folding. The idea being that the backbone should be free to do as it pleases and later the side chains will be forced into the best rotamer fit available. The view of the importance of side chains in backbone folding is still disputed, so this matter will not be a focus unless it is found that side chains do play a large part in the folding of the entire protein.

Another problem comes in on of the smallest parts of the algorithm. After the appropriate nine and three residue fragments have been assembled, Rosetta iteratively calculates the individual unknown torsion angles for the rest of the protein. The problem with the way that this is done is that a bad torsion angle in one bond may be offset by a torsion angle in another bond, making the bad local change a good global choice. Rosetta concentrates mostly on the course (nine residue) and fine (three residue) adjustments, and overlooks the finishing touches needed for the remainder of the protein.

The biggest problem of all is the lack of guidance. The documentation for Rosetta is thin and poor at best. There is no clear user’s manual other than a single page of text with sample command lines, and a sample output file. The output file has little description of what each item means, and although [2] contains the scoring functions, there is no description about how these functions are combined to attain a final score. There is also no useful description of the functions, just the equations themselves. Basically, if you want to learn a lot about Rosetta, you had better already know it.

Rosetta - Steven Bitner 11 Citations

[1] Rosetta Design Web Server http://rosettadesign.med.unc.edu/documentation.html [2] Protein Structure Prediction Using Rosetta, Numerical Computer Methods, C.A. Rohl, C.E. Strauss, K.M. Misura, D. Baker, pp. 66-93, 2004 [3] README documentation included with rosetta2.0.1 [4] Rosetta Website https://www.rosettacommons.org/ [5] David Baker Lab Homepage http://www.bakerlab.org/

Rosetta - Steven Bitner 12