Freecycles - Efficient Data Distribution for Volunteer Computing
Total Page:16
File Type:pdf, Size:1020Kb
freeCycles - Efficient Data Distribution for Volunteer Computing Rodrigo Bruno INESC-ID Instituto Superior Técnico Universidade de Lisboa Lisboa, Portugal [email protected] ABSTRACT Volunteer Computing, BitTorrent, BOINC, MapReduce Volunteer Computing (VC) is a very interesting solution to harness large amounts of computational power, network bandwidth, and storage which, otherwise, would be left with 1. INTRODUCTION no use. In addition, recent developments of new program- With the ever growing demand for computational power, ming paradigms, namely MapReduce, have raised the inter- scientists and companies all over the world strive to har- est of using VC solutions to run MapReduce applications vest more computational resources in order to solve more on the large scale Internet. However, current data distribu- and bigger problems in less time, while spending the mini- tion techniques, used in VC to distribute the high volumes mum money possible. With these two objectives in mind, we of information which are needed to run MapReduce jobs, are think of volunteer computing (VC) as a viable solution to ac- naive, and therefore need to be re-thought. cess huge amounts of computational resources, namely CPU cycles, network bandwidth, and storage, while reducing the We present a VC solution called freeCycles, that supports cost of the computation (although some commercial solutions MapReduce jobs and provides two new main contributions: i) give incentives, pure VC projects can produce computation improves data distribution (among the data server, mappers, at zero cost). and reducers) by using the BitTorrent protocol to distribute (input, intermediate, and output) data, and ii) improves in- With time, more computing devices (PCs, gaming consoles, termediate data availability by replicating it through volun- tablets, mobile phones, and any other kind of device capable teers in order to avoid losing intermediate data and conse- of sharing its resources) join the network. By aggregating all quently preventing big delays on the MapReduce overall ex- these resources in a global volunteer pool, it is possible to ecution time. obtain huge amounts of computational resources that would be impossible, or impractical, for most grids, supercomput- We present the design and implementation of freeCycles ers, and clusters. For example, recent data from BOINC along with an extensive set of performance results which con- [3, 4] projects shows that, currently, there are 50 supported firm the usefulness of the above mentioned contributions, im- projects sustained by an average computational power of over 1 proved data distribution and availability, thus making VC a 7 PetaFLOPS . feasible approach to run MapReduce jobs. The creation and development of large scale VC systems en- Categories and Subject Descriptors ables large projects, that could not be run on grids, super- computers, or clusters due to its size and/or costs associated, C.2.0 [Computer-Communication Networks]: to run with minimized costs. VC also provides a platform to General|Data communications; C.2.4 [Computer- explore and create new projects without losing time building Communication Networks]: Distributed Systems| and for setting up execution environments like clusters. Distributed applications; C.4 [Performance of Systems]: Reliability, availability, and serviceability In addition, recent developments of new programming paradigms, namely MapReduce2 [9], have raised the inter- General Terms est of using VC solutions to run MapReduce applications Algorithms, Performance, Realiability on large scale networks, such as the Internet. Although in- creasing the attractiveness of VC, it also brings the need to Keywords rethink and evolve current VC platforms' architectures and protocols (in particular, the data distribution techniques and protocols) to adapt to these new programming paradigms. We present a system called freeCycles, a VC solution which 1Statistics from boincstats.com 2MapReduce is a popular programming paradigm composed of two operations: Map (applied for a range of values) and Reduce (operation that will aggregate values generated by the Map operation). has as its ultimate goal to aggregate as many volunteer re- churn, variable node performance and connection, and being sources as possible, in order to run MapReduce jobs in a scal- able to do it in a scalable way is the challenge that address able and efficient way. Therefore, the output of this project is with freeCycles. a middleware platform that enables upper software abstrac- tion layers (programs developed by scientists, for example) By considering all the available solutions, it is important to to use volunteer resources all over the Internet as they would note that solutions based on clusters and/or grids do not fit use a supercomputer or a cluster. Therefore, using freeCy- our needs. Such solutions are designed for controlled environ- cles, applications can be developed to solve problems that ments where node churn is expected to be low, where nodes require more computational resources than what it is avail- are typically well connected with each other, nodes can be able in private clusters, grids or even in supercomputers. trusted and are very similar in terms of software and hard- ware. Therefore, solutions like HTCondor [24], Hadoop [25], To be successful, freeCycles must fulfill several key objectives. XtremWeb [11] and other similar computing platforms are of It must be capable of scaling up with the number of volun- no use to attain our goals. teers; it must be capable of collecting volunteers' resources such as CPU cycles, network bandwidth, and storage in an When we turn to VC platforms, we observe that most ex- efficient way; it must be able to tolerate byzantine and fail- isting solutions are built and optimized to run Bag-of-Tasks stop failures (which can be motivated by unreliable network applications. Therefore, solutions such as BOINC, GridBot connections, that are very common in large scale networks [22], Bayanihan [19] and many others [2, 20, 26, 5, 6, 17, such as the Internet). Finally, we require our solution to be 16, 21, 15] do not support the execution of MapReduce jobs, capable of supporting new parallel programming paradigms, which is one of our main requirements. in particular MapReduce, given its relevance for a large num- ber of applications. With respect to the few solutions [14, 23, 10, 8] that support MapReduce, we are able to point out some frequent prob- MapReduce applications commonly have two key features: lems. First, data distribution (input, intermediate, and final i) they depend on large amounts of information to run (for output) is done using naive techniques. Current solutions both input and output) and ii) may run for several cycles (as do not take advantage of task replication (that is essential to the original algorithm for page ranking). Therefore, in order avoid stragglers and to tolerate byzantine faults). Thus, even to take full advantage of the MapReduce paradigm, freeCy- thought multiple mappers have the same output, reducers cles must be capable of distributing large amounts of data still download intermediate data from one mapper only. The (namely input, intermediate and output data) very efficiently second problem is related to intermediate data availability. while allowing applications to perform several MapReduce It was shown [13] that if some mappers fail and intermediate cycles without compromising its scalability. data becomes unavailable, a MapReduce job can be delayed by 30% of the overall computing time. Current solutions In order to understand the challenges inherent to building a fail to prevent intermediate data from becoming unavailable. solution like freeCycles, it is important to note the difference The third problem comes from the lack of support for ap- between the main three computing environments: clusters, plications with more than one MapReduce cycle. Current grids, and volunteer pools. Clusters are composed by ded- solutions typically use a central data server to store all input icated computers so that every node has a fast connection and output data. Therefore, this central server is a bottle- with other nodes, node failure is low, nodes are very similar neck between every cycle (since the previous cycle needs to in hardware and software and all their resources are focused upload all output data and the next cycle needs to download on cluster jobs. The second computing environment is a grid. all the input data before starting). Grids may be created by aggregating desktop computers from universities, research labs or even companies. Computers In order to support the execution of programming paradigms have moderate to fast connection with each other and grids like MapReduce, that need large amounts of data to process, may be inter-connected to create larger grids; node failure new distribution techniques need to be employed. In conclu- is moderate, nodes are also similar in hardware and soft- sion, current VC solutions raise several issues that need to be ware but their resources are shared between user tasks and improved in order to be able to run efficiently new parallel grid jobs. At the end of the spectrum there are volunteer programming paradigms, MapReduce in particular, on large pools. Volunteer pools are made of regular desktop comput- volunteer pools. ers, owned by volunteers around the world. Volunteer nodes have variable Internet connection bandwidth, node churn is freeCycles is a BOINC-compatible VC platform that en- very high