<<

Rosetta @ Home and the game

Observations of a distributed computing project, for the Interconnect & Neuroscience course by prof.dr.ir. R.H.J.M. Otten By Frank Razenberg – [email protected] – student id. 0636007

Contents 1 Introduction ...... 3 2 Distributed Computing ...... 4 2.1 Concept ...... 4 2.2 Applications ...... 4 3 The BOINC Platform ...... 5 3.1 Design ...... 5 3.2 Processing power ...... 5 3.3 Interesting projects ...... 6 3.3.1 SETI@home ...... 6 3.3.2 Chess960@home ...... 6 3.3.3 PrimeGrid ...... 6 3.3.4 SHA-1 Collision Search Graz ...... 7 4 Rosetta@home ...... 8 4.1 The Rosetta algorithm ...... 8 4.1.1 Overview ...... 8 4.1.2 Secondary structure prediction ...... 9 4.1.3 Decoy generation ...... 9 4.1.4 Structure ranking ...... 10 4.1.5 Fragment insertion ...... 10 5 Foldit game ...... 11 5.1 Origin ...... 11 5.2 Elements of the game ...... 12 5.3 Results ...... 12 6 Bibliography ...... 14

1 Introduction In this essay we will observe the distributed computing project named Rosetta@home. Rosetta@home is a non-profit project which attempts to determine the 3-dimensional shapes of from their sequences. Success of Rosetta's work would have broad ranging implications for human health, ranging from the development of a vaccine for HIV to the eradication of Malaria.

Almost all human diseases are caused by mutations in proteins that affect their 3-dimensional structures and functions, and so, if we can reliably predict structures, we could understand how mutations cause disease and from there perhaps go on to develop therapies. An example is one of Rosetta@home's goals: trying to design immunogens that will elicit antibodies against HIV, which would be a critical part of a vaccine (1).

Up until recently it's been thought to be pretty much impossible to reliably predict the structure of proteins from their sequence, but it is known that the 3d structure is determined solely by their amino acid sequences. Protein structures are currently determined by time consuming, expensive experiments, which are only applicable to a small subset of proteins. If we instead could predict protein structures, in a reliable and accurate way, it would revolutionize much of . Structure prediction is typically an energy minimization problem. Proteins tend to form structures that keep hydrophobic parts buried internally, away from the water they're dissolved in. They also form bridges between neighboring sections by hydrogen bonds and charge interactions. Maximize these sorts of interactions and you minimize the energy involved.

The Rosetta algorithm was developed by the Baker Laboratory under principal guidance of biochemist . The algorithm was then implemented in the distributed computing application Rosetta@home, and later the computer game Foldit arose from the Rosetta@home project. The University of Washington and manages Rosetta@home, which runs on the Berkeley Open Infrastructure for Network Computing (hereafter BOINC).

The first part of this essay is devoted to an observation of the concept of distributed computing. We then look at the BOINC platform, which is the distributed computing platform on which Rosetta@home and several similar projects run. Next we focus on Rosetta@home by looking at the problem it tries to solve, the projects’ significance so far. Then, we take an in-depth look at the Foldit game.

2 Distributed Computing

The following sections briefly describe the concept and advantages of using distributed computing for solving large computable problems.

2.1 Concept Many problems can be solved by computation, for which our personal computers are of course very suited. Large problems, such as weather prediction, are often too complex for a consumer grade personal computer to take on. Instead, large supercomputers with possibly over 100 times the processing power of an average consumer PC are used to work on these problems. These supercomputers are generally built by placing multiple processing units in a cluster or grid.

If the problem to solve can be divided in sub problems, then each of these sub problems can be solved by a different CPU. The power of such a supercomputer thus stems from the fact that it is possible to perform many calculations simultaneously. With the advent of internet and broadband internet connections, possibilities to create a gigantic computing cluster emerged. Millions of users can be persuaded to take part in a project. The project manager can assign jobs to each participant, and the participant works on this job. When calculations are finished, the participant returns the results to the process manager and he may accept a new job.

Typically, a distributed system can tolerate failures of individual nodes. Nodes need to only know part of the total input and may not be aware of other nodes in system. If designed with this in mind, scalability is guaranteed, meaning that having more participants results in (near) linearly more work getting done.

Although no clear definitions exist for parallel and distributed computing, the difference is generally considered that in parallel computation, different processes share the same memory, while in distributed computing each processor has its own memory set. Parallel computation might thus be considered are more tightly coupled form of distributed computing.

2.2 Applications Algorithms have been designed to tackle various problems through distributed computing. Mathematical applications include searching for unknown prime numbers and testing cryptology techniques. Another major application is medical research to cure diseases, study global warming, discover pulsars, and do many other types of scientific research. A few such projects are discussed in Section 3.3. 3 The BOINC Platform

The Berkeley Open Infrastructure for Network Computing (BOINC) is an Open Source platform for distributed applications, developed at the University of California, Berkeley. It serves as platform for various scientific research projects that require grid computing. These research sciences include Biology and , Earth sciences, Physics and Astronomy, Mathematics, Artificial Intelligence and many others.

3.1 Design The BOINC Platform emerged from a rewrite of the distributed computing client to Search for Extra-Terrestrial Intelligence (SETI). SETI’s purpose was to do useful scientific work by supporting an observational analysis to detect intelligent life outside Earth, and to prove the viability and practicality of the 'volunteer computing' concept. This is done by analyzing radio signals. Thus far the first goal has not been met. The SETI client was only the second large distributed computing project, initiated in 1999, also at Berkeley. It was not designed with a high level of security in mind. In order to gain participation of a large number of volunteers, a credit system was implemented (more on this in Section 3.2). When the low level of security was exploited by volunteers to falsely claim credits or submit invalid work, a client rewrite was started in February 2002 to prevent such misuse. In June 2004, the first BOINC-based project, Predictor@home, was launched.

BOINC is an open platform. Anyone is allowed to start a distributed computing project using BOINC as a platform. Currently, there are 39 different projects powered by BOINC.

3.2 Processing power In June 2011, the processing power of the BOINC platform was estimated at over 450.000 active hosts worldwide (2). This number is achieved by recruiting volunteers to donate their unspent computer processing time to research projects. People are encouraged to participate in two ways: by informing them of the scientific progress their participation will help to achieve, and by creating the element of competition. Users are requested to donate their CPU time, while being promised that their computer will not ‘become slow or unresponsive’. The most common tasks performed on the PC do not consume a lot of CPU power. When performing these non-intensive tasks, such as web- browsing and typing documents, the remaining available processing power is dedicated to the distributed computing project. The claim is that this scheduling works so well, the end-user will notice no slowdowns at all. In my testing, this appears to be mostly true. When playing High Definition video content, which is a computationally intensive task, the entire CPU capability is constantly divided over the video player and the running BOINC projects, resulting in a 100% load. Yet, no stuttering in video playback is noticeable. In this regard, the claim of no negative performance impact when using BOINC seems to be true.

What is commonly left out when advocating BOINC’s use, however, is the impact on power consumption. The Thermal Design Power of CPU’s have dropped considerably over the last few years, but especially the high-clocked older CPU’s such as those from the Intel Celeron, Intel Pentium, Intel Core2Duo, AMD Athlon, AMD Sempron, AMD Duron and AMD Phenom series typically have TDP’s of over 100W. With the advent of the AMD Fusion, AMD Brazos and Intel Core and Intel Atom platforms, TDP’s have dropped considerably, but mainly in idle state. The result of using a distributed computing client like this will be very noticeable on the electricity bill. Additionally, extra warmth will be produced. The computer will become louder due to fans needing to move more air, and components, hard disks especially, might die earlier because of additional warmth inside the case.

Special optimized versions of distributed clients are able to use the GPU for computation as well. These make use the NVidia’s CUDA technology or Apple’s OpenCL. GPU’s are a lot faster at Fourier transformations and matrix multiplications. The optimized clients can at times process up to 10 times as fast as their CPU-based counterparts. Power consumption on GPU’s is typically a lot higher than on CPU’s, so this is certainly a consideration when opting to participate.

3.3 Interesting projects As BOINC is an open platform, several projects have emerged that make use of BOINC. A few interesting ones are discussed in the following subsections.

3.3.1 SETI@home The Search for Extra-Terrestrial Intelligence is one of the first large-scale distributed computing projects. As the name suggests, it attempts to find signs of extra-terrestrial intelligence. The search is performed in wave-forms of radio-signals gathered at around 1420 MHz using Fourier- transformation based algorithms.

3.3.2 Chess960@home Chess960 is a variant of the classical chess-game where only the starting position of some of the pieces is shuffled. Shortly, all the non-pawns are shuffled over the player’s home ranks with as only restriction that the King is to be placed between the Rooks, and as in normal chess, the pieces of both players are mirrored. This results in 960 possible configurations from which the game can start. The idea behind this randomization is that players are encouraged to obtain an early advantage not by memorizing strong opening moves, but through creativity and talent. Chess960@home aims to build a database of Chess960 games that are to be made publicly available. For every possible starting configuration about 1000 matches are to be recorded. The results are visible on http://www.chess-960.org.

3.3.3 PrimeGrid PrimeGrid aims to find some special kinds of prime numbers, such as Woodall primes (pairs of form n*2n+1 en n*2n-1), Twin primes (pairs of form k*2n-1/k*2n+1) and Proth primes (k*2n+1 with k even and 2n>k). The largest known Woodall and Twin primes were found as a result of this project. 3.3.4 SHA-1 Collision Search Graz This project, ran by the Graz University of Technology in Austria, aims to find collisions in the SHA1 hashing function. An SHA1 hash consists of 160 bytes, so as we can have infinitely many different inputs and 2^160 possible outputs, so different inputs might result to the same hash. After two years the project was shut down due to lack of progress. Not much later a weakness in the SHA1 algorithm was found. Using a mathematically crafted attack, the number of operations required to find a collision was reduced to O(2^52).

4 Rosetta@home The function of a protein is largely defined by its structure. The number of proteins whose structure has been experimentally determined, although rapidly increasing, is still small. Primary reason for this is the long, expensive and difficult process required to find these structures. Refolding experiments have shown in the past that the is determined only be the amino acid sequence, and that this structure near always results in a minimum free energy configuration. In theory, this free energy minimum can be computed from quantum mechanics and thus predict the structure from the sequence. However in practice, ab initio and (MD) computation methods are either not fast enough or too inaccurate to obtain usable results (3).

Several methods have been designed to tackle the problem of protein structure prediction. Every two years the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition tests the performance among these ever-evolving protein structure prediction methods. Rosetta@home participates in the CASP competitions and so far has performed quite well. In the following sections we will look at the algorithm Rosetta uses, the results so far and future plans.

4.1 The Rosetta algorithm Rosetta uses an ab initio structure prediction method that is based on the assumption that the distribution of conformations sampled by a local segment of the polypeptide chain is reasonably well approximated by the distribution of structures adopted by that sequence and closely related sequences in known protein structures that are known to have low energy levels (4). These fragments are illustrated in Figure 1. Ab initio means that it is attempted to form a fold from scratch, meaning no template is used and no empirical structural info is considered. Other prediction methods are and threading. In CASP, the three categories are tested separately.

4.1.1 Overview The general assumption behind Rosetta is that a short sequence of amino acids has a small number of low energy conformations. These conformations (fragments) are a result of mainly local interactions. Rosetta extracts fragments in a sliding window (i.e., 1-9, 2-10, 3-11) of 9 amino acids from the . Rosetta then predicts the unknown protein structure by assembling the fragments. After each fragment insertion, Rosetta minimizes the structure's energy. The potential used in Rosetta tries to capture multiple features seen in experimentally determined protein structures. The potential is derived from Bayesian treatment of residue distributions in known protein structures.

Figure 1: optimal structure for local segments

4.1.2 Secondary structure prediction The first step in predicting a protein structure is determining the secondary structure from the primary structure. The primary structure of a nucleic acid refers to the exact sequence of nucleotides that comprise the whole molecule. The secondary structure refers to the formation of the simple protein structures like alpha helices, beta strands (or sheets) and turns (or coils). Rosetta is said to be as accurate as about 80% at predicting the secondary structure. This is measured by comparing its results to the results of the DSSP algorithm applied to the crystal structure of the protein. The DSSP program was designed by Wolfgang Kabsch and Chris Sander to standardize secondary structure assignment. DSSP is a database of secondary structure assignments for all protein entries in the Protein Data Bank.

4.1.3 Decoy generation From the secondary structure, about a number between 1,000 and 100,000 simulations are performed which all result in probable protein structures. These are called decoys. Rosetta makes use of pair-wise interaction, beta-strand pairing, compactness, stearic overlap and solvation for this. I could not find more in-depth information on this.

Using cluster analysis the center of a cluster of decoys with broadest minimum is then searched. Using an algorithm called Mammoth, the largest cluster is compared to known structures. If Mammoth can find a significant similarity between the decoy and an experimentally determined structure, it is assumed that the decoy and the matched structure belong to the same SCOP superfamily.

Figure 2: The broadest minima across the decoys is determined by cluster analysis

4.1.4 Structure ranking The Rosetta scoring function employs Bayes’ theorem. The probability of a structure to be correct is zero if there are overlapping atoms, and proportional to Exp(-Radius of gyration)2 for all other configurations. The radius of gyration is used to describe how ‘compact’ a model is.

4.1.5 Fragment insertion Fragment insertion is done using a form of simulated annealing using a Monte Carlo procedure. This is a procedure that is able to minimize any function if every possible state can be evaluated. The initial state and possible subsequent moves need to be known. It chooses randomly a possible move and accepts it with the Metropolis-Hasting acceptance probability. What this comes down to is that every mode with decreasing energy is accepted, and some with increasing energy may be accepted which is necessary to escape local minima.

5 Foldit game Foldit is an interactive computer game that challenges users to fold proteins for the Rosetta project. Foldit is freely available for Windows, Mac and users. In the following sections we will take a closer look at the game.

5.1 Origin In 2007 multiple Rosetta@home users sent in emails saying they saw possible improvements in the previews of the proposed protein structures that the algorithms had constructed. Subsequently a suggestion was made to start an interactive version of the program. Several members of the Baker group at the University of Washington listened and in 2008 they released the first version of the crowdsourcing ‘Game with a Purpose’ named ‘Foldit’. Foldit is a hybrid approach to (5). The user is challenges to optimize a decoy using parts of the Rosetta algorithm, and using the ’s natural three-dimensional pattern matching abilities. Foldit differs from most other GWAPs in that instead of getting humans to do work for the project, the ulterior motive is learning from the players so that the algorithm can be improved.

Figure 3: in-game screenshot of Foldit

5.2 Elements of the game When Foldit is started some puzzles are downloaded from the central server. Players are challenged to find the best possible protein given the decoy. A deadline is set for when the final results will be collected. This is usually a few weeks per protein. Players can attempt to fold a protein either individually or in a team effort. Multiple different proteins are available simultaneously, so players can choose which protein to work on.

By far not all players have a background in biochemical engineering. It is therefore important that the game is introduced in a way that requires minimal background knowledge. In the game’s tutorial, the basics are explained very thoroughly and the available tools are introduced one by one. Using sufficient hints, the player is guided through the first few levels. Here we learn about the ‘shake’ tool which optimizes sidechains by ‘shaking’ the sidechains so that no collisions are present as these would result in zero probability for the protein structure. A little later, the wiggle tool is introduced. This tool optimizes the backbone by compacting the protein. When two backbone sheets are sufficiently close, hydrogen bonds form between them, which further reduces the energy level of the protein. This of course results in a better score.

A more advanced tool is the ‘tweak’. The interface presents hydrophobic sidechains in an orange color. It is important that these are buried inside the protein. Blue sidechains should point outward, because they need space around them. Sometimes it is required to rotate a backbone in order to bury orange sidechains or to get blue sidechains to point outward. This can be done using the tweak tool. The tweak tool can also flip sheets.

Sometimes the actual structure of a fold is known. It is then presented in foldit and the player can match the native protein using his toolset. Primary objective of the project here is to learn how human abilities are used to quickly structure a three-dimensional structure onto a given one.

After the tutorial has been successfully completed, the player is invited to take part in an actual scientific puzzle. During such a puzzle, scores of competitors are publicly available and players can chat with each other; asking questions is encouraged. People may also organize teams so they can collaborate on the puzzle.

5.3 Results Foldit has certainly shown that the idea of having humans involved in folding proteins can be successful. In the September 2011 issue of the journal Nature Structural & Molecular Biology, researchers show how gamers provided the crucial insights to solve the structure of a protein- snipping critical for reproduction of the AIDS (4). With help from game-players' strategies, researchers revealed the ' structure within three weeks and identified targets for drugs to neutralize it, a problem which has been unsolved for decades. The progress and final protein are shown in Figure 4.

Figure 4: M-PMV retroviral structure improvement by the Foldit Contenders Group.

Foldit has also taken part in the CASP9 experiment’s most recent incarnation in 2010 where it entered in the Free Modeling category.

Unfortunately, Foldit’s mean weakness became very clear here. When given some of the best models Rosetta produced for CASP9, the players had a very hard-time moving away from the initial configuration. Because the Rosetta energy function used by Foldit yields the same minimized starting model, it appears that every change made by players results in a higher energy level. As such, the explored conformation space remains too small. The only way for the players to improve their Foldit scores was to make very small changes to the starting model. This did however lead to one of the most spectacularly well performing models produced throughout the entire CASP9 experiment.

Sadly though, a lot of energy minima exist for any starting configuration, and even the very first steps taken have crucial impact on the states that end up being reachable. To get a completely different manifold of protein structures, even ones that result in better energies than the local minima, you may have to go through a zone where it seems as if the entire protein is being ruined. Most of the time this is actually the case and there is no way of knowing whether there is going to be any point of improvement later on.

We can conclude that Foldit definitely appears to serve its purpose in the sense that the human brain is a great tool in learning how to make computers optimize folds further than currently existing algorithms. It has also been shown in CASP9 that through crowdsourcing, even better local minima can be found than any existing algorithm presently can produce. 6 Bibliography

1. Baker, David. Prediction and Design of Macromolecular Structures and Interactions. bakerlab.org. [Online] 2011. [Cited: August 14, 2011.] http://boinc.bakerlab.org/rah_research.php.

2. California, University of. Research projects involving BOINC. BOINC. [Online] September 14, 2011. [Cited: September 22, 2011.] http://boinc.berkeley.edu/trac/wiki/ResearchProjects.

3. Rosetta in CASP4: progress in ab initio protein structure prediction. Bonneau, R. and Tsai, J. and Ruczinski, I. and Chivian, D. and Rohl, C. and Strauss, C.E.M. and Baker, D. 2001, Proteins: Structure, Function, and , Vol. 45, pp. 119-126. S5.

4. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Khatib, F. and DiMaio, F. and Cooper, S. and Kazmierczyk, M. and Gilski, M. and Krzywda, S. and Zabranska, H. and Pichova, I. and Thompson, J. and Popoviç, Z. and others. s.l. : Nature Publishing Group, 2001, Nature Structural & Molecular Biology.

5. Predicting protein structures with a multiplayer online game. Cooper, S. and Khatib, F. and Treuille, A. and Barbero, J. and Lee, J. and Beenen, M. and Leaver-Fay, A. and Baker, D. and Popoviç, Z. 7307, s.l. : Nature, 2010, Vol. 466.

6. Foundation, Wikimedia. Rosetta@home. Wikipedia - The online encyclopedia. [Online] [Cited: July 14, 2011.] http://en.wikipedia.org/wiki/Rosetta@home.

7. Das, R. and Qian, B. and Raman, S. and Vernon, R. and Thompson, J. and Bradley, P. and Khare, S. and Tyka, M.D. and Bhat, D. and Chivian, D. and others. Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@ home. Proteins: Structure, Function, and Bioinformatics. 2007, Vol. 69, S8.

8. Using the Rosetta algorithm and selected inter-residue distances to predict protein structure. Crecca, C. and Roitberg, A.E. 15, s.l. : Wiley Online Library, 2008, International Journal of Quantum Chemistry, Vol. 108, pp. 2793-2802.