Rosetta @ Home and the Foldit Game
Total Page:16
File Type:pdf, Size:1020Kb
Rosetta @ Home and the Foldit game Observations of a distributed computing project, for the Interconnect & Neuroscience course by prof.dr.ir. R.H.J.M. Otten By Frank Razenberg – [email protected] – student id. 0636007 Contents 1 Introduction .......................................................................................................................................... 3 2 Distributed Computing .......................................................................................................................... 4 2.1 Concept ......................................................................................................................................... 4 2.2 Applications ................................................................................................................................... 4 3 The BOINC Platform .............................................................................................................................. 5 3.1 Design ............................................................................................................................................ 5 3.2 Processing power .......................................................................................................................... 5 3.3 Interesting projects ....................................................................................................................... 6 3.3.1 SETI@home ........................................................................................................................... 6 3.3.2 Chess960@home .................................................................................................................. 6 3.3.3 PrimeGrid .............................................................................................................................. 6 3.3.4 SHA-1 Collision Search Graz .................................................................................................. 7 4 Rosetta@home ..................................................................................................................................... 8 4.1 The Rosetta algorithm ................................................................................................................... 8 4.1.1 Overview ............................................................................................................................... 8 4.1.2 Secondary structure prediction ............................................................................................ 9 4.1.3 Decoy generation .................................................................................................................. 9 4.1.4 Structure ranking ................................................................................................................ 10 4.1.5 Fragment insertion .............................................................................................................. 10 5 Foldit game ......................................................................................................................................... 11 5.1 Origin ........................................................................................................................................... 11 5.2 Elements of the game ................................................................................................................. 12 5.3 Results ......................................................................................................................................... 12 6 Bibliography ........................................................................................................................................ 14 1 Introduction In this essay we will observe the distributed computing project named Rosetta@home. Rosetta@home is a non-profit project which attempts to determine the 3-dimensional shapes of proteins from their amino acid sequences. Success of Rosetta's work would have broad ranging implications for human health, ranging from the development of a vaccine for HIV to the eradication of Malaria. Almost all human diseases are caused by mutations in proteins that affect their 3-dimensional structures and functions, and so, if we can reliably predict protein structures, we could understand how mutations cause disease and from there perhaps go on to develop therapies. An example is one of Rosetta@home's goals: trying to design immunogens that will elicit antibodies against HIV, which would be a critical part of a vaccine (1). Up until recently it's been thought to be pretty much impossible to reliably predict the structure of proteins from their sequence, but it is known that the 3d structure is determined solely by their amino acid sequences. Protein structures are currently determined by time consuming, expensive experiments, which are only applicable to a small subset of proteins. If we instead could predict protein structures, in a reliable and accurate way, it would revolutionize much of molecular biology. Structure prediction is typically an energy minimization problem. Proteins tend to form structures that keep hydrophobic parts buried internally, away from the water they're dissolved in. They also form bridges between neighboring sections by hydrogen bonds and charge interactions. Maximize these sorts of interactions and you minimize the energy involved. The Rosetta algorithm was developed by the Baker Laboratory under principal guidance of biochemist David Baker. The algorithm was then implemented in the distributed computing application Rosetta@home, and later the computer game Foldit arose from the Rosetta@home project. The University of Washington and manages Rosetta@home, which runs on the Berkeley Open Infrastructure for Network Computing (hereafter BOINC). The first part of this essay is devoted to an observation of the concept of distributed computing. We then look at the BOINC platform, which is the distributed computing platform on which Rosetta@home and several similar projects run. Next we focus on Rosetta@home by looking at the problem it tries to solve, the projects’ significance so far. Then, we take an in-depth look at the Foldit game. 2 Distributed Computing The following sections briefly describe the concept and advantages of using distributed computing for solving large computable problems. 2.1 Concept Many problems can be solved by computation, for which our personal computers are of course very suited. Large problems, such as weather prediction, are often too complex for a consumer grade personal computer to take on. Instead, large supercomputers with possibly over 100 times the processing power of an average consumer PC are used to work on these problems. These supercomputers are generally built by placing multiple processing units in a cluster or grid. If the problem to solve can be divided in sub problems, then each of these sub problems can be solved by a different CPU. The power of such a supercomputer thus stems from the fact that it is possible to perform many calculations simultaneously. With the advent of internet and broadband internet connections, possibilities to create a gigantic computing cluster emerged. Millions of users can be persuaded to take part in a project. The project manager can assign jobs to each participant, and the participant works on this job. When calculations are finished, the participant returns the results to the process manager and he may accept a new job. Typically, a distributed system can tolerate failures of individual nodes. Nodes need to only know part of the total input and may not be aware of other nodes in system. If designed with this in mind, scalability is guaranteed, meaning that having more participants results in (near) linearly more work getting done. Although no clear definitions exist for parallel and distributed computing, the difference is generally considered that in parallel computation, different processes share the same memory, while in distributed computing each processor has its own memory set. Parallel computation might thus be considered are more tightly coupled form of distributed computing. 2.2 Applications Algorithms have been designed to tackle various problems through distributed computing. Mathematical applications include searching for unknown prime numbers and testing cryptology techniques. Another major application is medical research to cure diseases, study global warming, discover pulsars, and do many other types of scientific research. A few such projects are discussed in Section 3.3. 3 The BOINC Platform The Berkeley Open Infrastructure for Network Computing (BOINC) is an Open Source platform for distributed applications, developed at the University of California, Berkeley. It serves as platform for various scientific research projects that require grid computing. These research sciences include Biology and Medicine, Earth sciences, Physics and Astronomy, Mathematics, Artificial Intelligence and many others. 3.1 Design The BOINC Platform emerged from a rewrite of the distributed computing client to Search for Extra-Terrestrial Intelligence (SETI). SETI’s purpose was to do useful scientific work by supporting an observational analysis to detect intelligent life outside Earth, and to prove the viability and practicality of the 'volunteer computing' concept. This is done by analyzing radio signals. Thus far the first goal has not been met. The SETI client was only the second large distributed computing project, initiated in 1999, also at Berkeley.