Celebrating Diversity in Volunteer Computing
Total Page:16
File Type:pdf, Size:1020Kb
Celebrating Diversity in Volunteer Computing David P. Anderson Kevin Reed University of California, Berkeley IBM [email protected] [email protected] Abstract In this paper we discuss this problem in the context The computing resources in a volunteer computing of the IBM-sponsored World Community Grid, a large system are highly diverse in terms of software and BOINC-based volunteer computing project. Section 2 hardware type, speed, availability, reliability, network describes the BOINC architecture. Section 3 connectivity, and other properties. Similarly, the jobs summarizes the population of computers participating to be performed may vary widely in terms of their in World Community Grid, and the applications it hardware and completion time requirements. To supports. In Section 4 we discuss the techniques used maximize system performance, the system’s job in the BOINC scheduling server to efficiently match selection policy must accommodate both types of diverse jobs to diverse hosts. diversity. This work was supported by National Science In this paper we discuss diversity in the context of Foundation award OCI-0721124. World Community Grid (a large volunteer computing project sponsored by IBM) and BOINC, the 2. The BOINC model and architecture middleware system on which it is based. We then discuss the techniques used in the BOINC scheduler to The BOINC model involves projects and efficiently match diverse jobs to diverse hosts. volunteers . Projects are organizations (typically academic research groups) that need computing power. 1. Introduction Projects are independent; each operates its own BOINC server. Volunteer computing is a form of distributed Volunteers participate by running BOINC client computing in which the general public volunteers software on their computers ( hosts ). Volunteers can processing and storage resources to computing attach each host to any set of projects, and can specify projects. BOINC is a software platform for volunteer the quota of bottleneck resources allocated to each computing [2]. BOINC is being used by projects in project. physics, molecular biology, medicine, chemistry, A BOINC server is centered around a relational astronomy, climate dynamics, mathematics, and the database, whose tables correspond to the abstractions study of games. There are currently 50 projects and of BOINC’s computing model: 580,000 volunteer computers supplying an average of Platform : an execution environment, typically the 1.2 PetaFLOPS. combination of an operating system and processor Compared to other types of high-performance type (Windows/x86), or a virtual environment computing, volunteer computing has a high degree of (VMWare/x86 or Java). diversity. The volunteered computers vary widely in Application : the abstraction of a program, terms of software and hardware type, speed, independent of platforms or versions. availability, reliability, and network connectivity. Application version : an executable program. Similarly, the applications and jobs vary widely in Each is associated with an application, a platform, terms of their resource requirements and completion a version number, and one or more files (main time constraints. program, libraries, and data). These sources of diversity place many demands on Job : a computation to be done. Each is associated BOINC. Foremost among these is the job selection with an application (not an application version or problem : when a client contacts a BOINC scheduling platform) and a set of input files. server, the server must choose, from a database of Job instance : the execution of a job on a perhaps a million jobs, those which are “best” for that particular host. Each job instance is associated client according to a complex set of criteria. with an application version (chosen when the job Furthermore, the server must handle hundreds of such is issued) and a set of output files. requests per second. distributions ofthese properties. rate, and turnaroundmemory, disk,time. network bandwidth, availability, erro vary Figurewidely in terms1 of processor, showsoperating system the power of 170 TeraFLOPS from volunteeredcomputers. November 2004. It is currently producing an averag 167,000 years of processing time since its390,000 launch volunteers, computingwho project havesponsored contributedby IBMover with over Grid Community 3. Host and application diversity in World numberthe of instances is reached. consensus of agreeing results is obtained or a limi correct. Otherwise further instances are issued un hosts. If the results whichagree, instances theyof each arejob areassumed sent totovalidation; the twomostb basic isdiffere validation computing not actually performed. As a result, attempting to undermine the project or to get credi over-clocked computers) or tocorrect, duemalicious to hardwarevolunteers malfunction (particularly Number of Computers 100000 BOINC supports several techniques for result The results returned by volunteers are not always 10000 The computers attached to World Community Grid World Community Grid is a large-scale volunteer 1000 100 10 1 AMD Athlon AMD Athlon 64 is typicallyis required. AMD Athlon FX AMD Athlon MP AMD Athlon X2 AMD Athlon XP AMD Duron a) type a) Processor AMD Geode AMD K6 AMD K7 AMD Opteron Processory Type Processory AMD Other AMD Phenom Intel Celeron replicated computing Intel Core 2 Intel Core 2 Duo Intel Core 2 Quad Intel Core Duo Intel Other Intel Pentium Intel Pentium 4 Intel Pentium D Intel Pentium II Intel Pentium III Intel Pentium M Intel Xeon Transmeta result t for CentaurHauls t on til a , in IBM PowerPC on nt in e e r , Number of Computers Number of Computers 1,000,000 Number of Computers Number of Computers 100,000 10000 100000 10,000 1000 10000 100000 1,000 100 1000 10000 Windows 10 100 1000 100 1 10 100 10 1 10 1 0 1 Windows XP 8 0 1 2 3 4 6 8 14 16 24 32 64 16 Vi 512 s 24 Windows Ser ta Lin 1024 u 32 x 2.6 40 1536 48 ver 2003 e) Available disk space e) Available disk 2048 Windows c) c) Operating system 56 Physicald) memory b) Number b)cores of 64 2560 Mac OS X 2000 10.5 72 3072 80 Mac OS Type of Operating System Available Disk Space (GB) 88 3584 96 4096 X 10 104 RAM (in MB) .4 112 4608 Windows LonghornLinux 2.4 Cores 120 5248 128 5760 136 144 6272 Windows 152 7296 Mac OS X 10.3 98 160 Windows 168 7808 176 8704 Mille 184 n n 192 10112 ium 200 11264 208 BSD 13184 Windows NT 216 224 14336 232 15872 240 248 16384 10000 At the time of this writing, World Community Grid 1000 has six active applications, which have varying requirements for disk space, computational power, 100 memory and bandwidth, as shown in Figure 2. Number of Computers 10 1 0 256 512 768 1024 1280 1536 1792 2056 2324 2608 2884 3236 3592 4048 4512 Measured Download Bandwidth (kB) f) Download bandwidth 18000 16000 14000 Figure 2: Resource requirements of World Community 12000 Grid applications 10000 Different result validation techniques are used, in 8000 6000 accordance with the properties of the applications. Number of Computers 4000 The Human Proteome Folding and Nutritious Rice 2000 applications do not use replication. The statistical 0 distribution of results is known in advance, and invalid 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Percent Available results appear as outliers. To detect submission of duplicate results, each job includes a small unique g) Availability computation that can be checked on the server. The Help Conquer Cancer, FightAIDS@Home and 100000 Discovering Dengue Drugs applications use replicated 10000 computing. 1000 4. Job selection in BOINC 100 Number of Computers of Number 10 When a BOINC client is attached to a project, it periodically issues a scheduler request to the project’s 1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 server. The request message includes a description of Percent Error the host and its current workload (jobs queued and in progress), descriptions of newly-completed jobs, and a h) Job error rate request for new jobs. The reply message may contain a set of new jobs. Multiple jobs may be returned; this 5000 4500 reduces the rate of scheduler requests and 4000 accommodates clients that are disconnected from the 3500 Internet for long periods. 3000 A BOINC project’s database may contain millions 2500 of jobs, and its server may handle tens or hundreds of 2000 Num Computers 1500 scheduler requests per second. For a given request 1000 we’d ideally scan the entire job table and send the jobs 500 that are “best” for the client according to criteria to be 0 discussed later. In practice this is infeasible; the 0 39 75 111 147 183 219 255 291 327 363 399 435 471 519 Hours database overhead would be prohibitive.\ i) Average turnaround time Figure 1: Properties of hosts attached to World Community Grid Instead, BOINC uses the following architecture This model broke down with the release of Intel- (see Figure 3): based Macintosh computers, which were able to execute PowerPC binaries via emulation. Should such a host advertise itself as Intel/Mac? If so it would be given no work by projects that had only PowerPC/Mac client Scheduler Job queue executables. Similarly, Linux/x64 and Windows/x64 (DB) hosts are generally able to execute Linux/x86 and Windows/x86 binaries as well; and some BSD-based Job cache systems can execute Linux binaries. (shared memory) Feeder To deal with these situations we allow a host to support multiple platforms; the BOINC client generates this list by examining the host at runtime. Figure 3: BOINC server architecture Scheduler request messages contain a list of platforms in order of decreasing expected execution speed, and A cache of roughly 1000 jobs is maintained in a the scheduler sends applications for the fastest shared memory segment.