HarnessingHarnessing GridGrid ResourcesResources withwith DataData--CentricCentric TaskTask FarmsFarms

Ioan Raicu Distributed Systems Laboratory Computer Science Department

Collaborators: Ian Foster (UC/CI/ANL), Yong Zhao (MS), Mike Wilde (CI/ANL), Zhao Zhang (CI), Rick Stevens (UC/CI/ANL), Alex Szalay (JHU), Jerry Yan (NASA/AMES), Catalin Dumitrescu (FANL)

Hyde Park Global Investments LLC Interview Talk April 18th, 2008 What will we do with 1+ Exaflops and 1M+ cores? 1) Tackle Bigger and Bigger Problems

Computational Scientist as Hero

4/18/2008 Harnessing Grid Resources with -Centric Task Farms 3 2) Tackle Increasingly Complex Problems

Computational Scientist as Logistics Officer

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 4 “More Complex Problems”

• Use ensemble runs to quantify climate model uncertainty • Identify potential drug targets by screening a database of ligand structures against target proteins • Study economic model sensitivity to key parameters • Analyze turbulence dataset from multiple perspectives • Perform numerical optimization to determine optimal resource assignment in energy problems

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 5 Programming Model Issues

• Multicore processors • Massive task parallelism • Massive data parallelism • Integrating black box applications • Complex task dependencies (task graphs) • Failure, and other execution management issues • Data management: input, intermediate, output • Dynamic task graphs • Dynamic data access involving large amounts of data • Documenting provenance of data products

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 6 Problem Types

Input Hi data Data and size analysis, mining many tasks Med

Heroic MPI Many loosely coupled tasks Lo tasks

1 1K 1M Number of tasks 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 7 Motivating Example: AstroPortal Stacking Service + + • Purpose + – On-demand “stacks” of + + random locations within + ~10TB dataset + • Challenge = – Rapid access to 10-10K “random” files S4 Sloan Data – Time-varying load Web page or Web • Solution Service – Dynamic acquisition of compute, storage 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 8 Challenge #1: Long Queue Times • Wait queue times are typically longer than the job duration times

SDSC DataStar 1024 Processor Cluster 2004 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms9 Challenge #2: Slow Job Dispatch Rates • Production LRMs Æ ~1 job/sec dispatch rates Medium Size Grid Site (1K processors) 100% • What job durations are 90% 80% needed for 90% efficiency: 70% 60% 50%

– Production LRMs: 900 sec Efficiency 40% 30% 20% – Development LRMs: 100 sec 10% 0% 0.001 0.01 0.1 1 10 100 1000 10000 100000 – Experimental LRMs: 50 sec Task Length (sec) 1 task/sec (i.e. PBS, Condor 6.8) 10 tasks/sec (i.e. Condor 6.9.2) 100 tasks/sec 500 tasks/sec (i.e. Falkon) 1K tasks/sec 10K tasks/sec – 1~10 sec should be possible 100K tasks/sec 1M tasks/sec Throughput System Comments (tasks/sec) Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49 PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45 Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2 Condor (v6.8.2) - Production 0.42 4/18/2008Condor ( Harnessingv6.9.3) - Development Grid Resources with Data-Centric Task Farms11 10 Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22 Challenge #3: Poor Scalability of Shared File Systems

• GPFS vs. LOCAL

1000000 GPFS R – Read Throughput LOCAL R GPFS R+W LOCAL R+W • 1 node: 0.48Gb/s vs. 1.03Gb/s Î 2.15x 100000 • 160 nodes: 3.4Gb/s vs. 165Gb/s Î 48x – Read+Write Throughput: 10000 • 1 node: 0.2Gb/s vs. 0.39Gb/s Î 1.95x

1000 • 160 nodes: 1.1Gb/s vs. 62Gb/s Î 55x Throughput (Mb/s) – Metadata (mkdir / rm -rf) 100 • 1 node: 151/sec vs. 199/sec Î 1.3x 1 10 100 1000 Number of Nodes • 160 nodes: 21/sec vs. 31840/sec Î 1516x

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms11 Hypothesis

“Significant performance improvements can be obtained in the analysis of large dataset by leveraging information about workloads rather than individual data analysis tasks.”

• Important concepts related to the hypothesis – Workload: a complex query (or set of queries) decomposable into simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications – Allocate computational and caching storage resources, co-scheduled to optimize workload performance

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 12 Proposed Solution: Part 1 Abstract Model and Validation • AMDASK: – An Abstract Model for DAta-centric taSK farms • Task Farm: A common parallel pattern that drives independent computational tasks – Models the efficiency of data analysis workloads for the split/merge class of applications – Captures the following data diffusion properties • Resources are acquired in response to demand • Data and applications diffuse from archival storage to new resources • Resource “caching” allows faster responses to subsequent requests • Resources are released when demand drops • Considers both data and computations to optimize performance • Model Validation – Implement the abstract model in a discrete event simulation – Validate model with statistical methods (R2 Statistic, Residual Analysis)

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 13 Proposed Solution: Part 2 Practical Realization • Falkon: a Fast and Light-weight tasK executiON framework – Light-weight task dispatch mechanism – Dynamic resource provisioning to acquire and release resources – Data management capabilities including data-aware scheduling – Integration into Swift to leverage many Swift-based applications • Applications cover many domains: , astro-, medicine, chemistry, and economics

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 14 AMDASK: Performance Efficiency Model

• B: Average Task Execution Time: – K: Stream of tasks 1 Β = µ(κ) – µ(k): Task k execution time | Κ | ∑ k∈Κ µ κ κ • Y: Average Task Execution Time with Overheads:µ δ κ φ τ – ο(k): Dispatch overhead ⎧ 1 κ δ ⎪ ∑[ ( ) + o(ζ )],δ ∈ ( ), ∈Ω ⎪ | Κ | ∈Κ τ – ς(δ,τ): Time to get data Y = κ δ ⎨ 1 φ ⎪ [ ( ) + o( ) + ( , )], ∉ (τ ),δ ∈Ω ⎪| Κ | ∑ • V: Workload Execution Time: ⎩ κ∈Κ – A: Arrival rate of tasks ⎛ B 1 ⎞ V = max⎜ , ⎟*| Κ | – T: Transient Resources ⎝ | Τ | Α ⎠ • W: Workload Execution Time with Overheads ⎛ Υ 1 ⎞ W = max⎜ , ⎟*| Κ | 4/18/2008 Harnessing Grid Resources with Data-Centric⎝ | Τ |TaskΑ ⎠Farms 17 AMDASK: Performance Efficiency Model

• Efficiency ⎧ Y 1 1, ≤ V ⎪ |T | A Ε = E = ⎨ ⎛ B | Τ | ⎞ Y 1 W ⎪max⎜ , ⎟, > ⎪ ⎝ Y Α*Y ⎠ |T | A • Speedup ⎩ S = E*| T |

• Optimizing Efficiency – Easy to maximize either efficiency or speedup independently – Harder to maximize both at the same time • Find the smallest number of transient resources |T| while maximizing 4/18/2008speedup*efficiency Harnessing Grid Resources with Data-Centric Task Farms 18 Performance Efficiency Model Example: 1K CPU Cluster

• Application: Angle - distributed • Testbed Characteristics: – Computational Resources: 1024 – Transient Resource Bandwidth: 10MB/sec – Persistent Store Bandwidth: 426MB/sec • Workload: – Number of Tasks: 128K – Arrival rate: 1000/sec – Average task execution time: 60 sec – Data Object Size: 40MB 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 19 Performance Efficiency Model Example: 1K CPU Cluster

Falkon on ANL/UC TG Site: PBS on ANL/UC TG Site: Peak Dispatch Throughput: 500/sec Peak Dispatch Throughput: 1/sec Scalability: 50~500 CPUs Scalability: <50 CPUs Peak speedup: 623x Peak speedup: 54x

100% 1000 100% 1000 90% 90% 80% 80% 70% 70% 100 100 60% 60% 50% 50% Speedup Speedup

Efficiency 40% Efficiency 40% 10 10 30% 30%

20% Efficiency 20% Efficiency 10% Speedup 10% Speedup Speedup*Efficiency Speedup*Efficiency 0% 1 0% 1 12481632641282565121024 1 2 4 8 16 32 64 128 256 512 1024 Number of Processors Number of Processors 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 20 Model Validation: Simulations • Implement the abstract model in a discrete event simulation • Simulation parameters – number of storage and computational resources – communication costs – management overhead – workloads (inter-arrival rates, query complexity, data set properties, and data locality) • Model Validation –R2 Statistic – Residual analysis

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 23 Falkon: a Fast and Light-weight tasK executiON framework

• Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters • Combines three components: –a streamlined task dispatcher able to achieve order-of- magnitude higher task dispatch rates than conventional schedulers Î Challenge #1 – resource provisioning through multi-level scheduling techniques Î Challenge #2 – data diffusion and data-aware scheduling to leverage the co-located computational and storage resources Î Challenge #3 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 24 Falkon: The Streamlined Task Dispatcher Clients • Tier 1: Dispatcher

– GT4 Web Service accepting WS task submissions from clients and sending them to available Provisioner Compute executors Resources

Executor 1 • Tier 2: Executor Dispatcher – Run tasks on local resources WS • Provisioner Executor n

– Static and dynamic resource Compute Resource 1 provisioning Compute Resource m

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 25 Falkon: The Streamlined Task Dispatcher

• Falkon Message Exchanges – Description: {1}: task(s) submit {2}: task(s) submit confirmation {3}: notification for work {4}: request for task(s) {5 or 7}: dispatch task(s) {6}: deliver task(s) results to service {8}: notification for task result(s) {9}: request for task result(s) {10}: deliver task(s) results to client – Worst case (process tasks individually, no optimizations): • 4 WS messages ({1,2}, {4,5}, {6,7}, {9,10}) and 2 notifications ({3}, {8}) per task

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 26 Falkon: The Streamlined Task Dispatcher

• Falkon Message Exchanges Enhancements – Bundling • Include multiple tasks per communication message – Piggy-Backing • Attach next task to acknowledgement of previous task • Include data management information in the task description and acknowledgement messages – Message reduction: • General Lower Bound: 10Æ2+c, where c is a small positive value • Application Specific Lower Bound: 10Æ0+c, where c is a small positive value

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 27 Falkon: Resource Provisioning

0. provisioner registration 1. task(s) submit 2. resource allocation to GRAM 3. resource allocation to LRM 4. executor registration 5. notification for work 6. pick up task(s) 7. deliver task(s) results 8. notification for task(s) result 9. pick up task(s) results

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 28 Falkon: Data Diffusion

• Resource acquired in Idle Resources response to demand

• Data and applications diffuse text from archival storage to Task Dispatcher Data-Aware Scheduler newly acquired resources Persistent Storage Shared File System • Resource “caching” allows faster responses to subsequent requests Provisioned Resources – Cache Eviction Strategies: RANDOM, FIFO, LRU, LFU • Resources are released when demand drops

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 29 Falkon: Data Diffusion • Considers both data and computations to optimize performance • Decrease dependency of a shared file system – Theoretical linear scalability with compute resources – Significantly increases meta-data creation and/or modification performance • Completes the “data-centric task farm” realization 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 30 Related Work: Task Farms •[Casanova99]: Adaptive Scheduling for Task Farming with Grid Middleware •[Heymann00]: Adaptive Scheduling for Master-Worker Applications on the Computational Grid •[Danelutto04]: Adaptive Task Farm Implementation Strategies •[González-Vélez05]: An Adaptive Skeletal Task Farm for Grids •[Petrou05]: Scheduling Speculative Tasks in a Compute Farm •[Reid06]: Task farming on Blue Gene

Conclusion: none addressed the proposed “data-centric” part of task farms

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 31 Related Work: Task Dispatch •[Zhou92]: LSF – Load Sharing Cluster Management •[Bode00]: PBS – Portable Batch Scheduler and Maui Scheduler •[Anderson04]: BOINC – Task Distribution for Volunteer Computing •[Thain05]: Condor •[Robinson07]: Condor-J2 – Turning Cluster Management into Data Management

Conclusion: related work is several orders of magnitude slower

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 32 Related Work: Resource Provisioning •[Appleby01]: Oceano - SLA Based Management of a Computing Utility •[Frey02, Mehta06]: Condor glide-ins •[Walker06]: MyCluster (based on Condor glide-ins) •[Ramakrishnan06]: Grid Hosting with Adaptive Resource Control •[Bresnahan06]: Provisioning of bandwidth •[Singh06]: Simulations

Conclusion: Allows dynamic resizing of resource pool (independent of application logic) based on system load and makes use of light-weight task dispatch

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 33 Related Work: Data Management •[Beynon01]: DataCutter •[Ranganathan03]: Simulations •[Ghemawat03,Dean04,Chang06]: BigTable, GFS, MapReduce •[Liu04]: GridDB •[Chervenak04,Chervenak06]: RLS (Replica Location Service), DRS (Data Replication Service) •[Tatebe04,Xiaohui05]: GFarm •[Branco04,Adams06]: DIAL/ATLAS

Conclusion: Our work focuses on the co-location of storage and computations close to each other (i.e. on the same physical resource) while operating in a dynamic environment. 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 34 Results

• Abstract task farm model [Dissertation Proposal 2007] • Practical Realization: Falkon – Task Dispatcher [Globus Incubator 2007, SC07, SC08] – Resource Provisioning [SC07, TG07] – Data Diffusion [NSF06, MSES07, DADC08] – Swift Integration [SWF07, NOVA08, SWF08, GW08]

• Applications [NASA06, TG06, SC06, NASA07, SWF07, NOVA08, SC08] – Astronomy, medical imaging, molecular dynamics (chemistry and pharmaceuticals), economic modeling

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 35 Dispatcher Throughput

•Fast: 5000 4500 3773 – Up to 3700 tasks/sec 4000 3186 3500 • Scalable: 3000 2534 2500 1758 – 54,000 processors 2000 1500 – 1,500,000 tasks queued 1000 604 Throughput (tasks/sec) 100% 500 90% 0 ANL/UC, Java ANL/UC, Java ANL/UC, C SiCortex, C BlueGene/P, C 80% 200 CPUs Bundling 10 200 CPUs 5760 CPUs 1024 CPUs 200 CPUs 70% Executor Implementation and Various Systems 60% 50% • Efficient: Efficiency 40% – High efficiency with second 30% ANL/UC, Java, 200 CPUs 20% ANL/UC, C, 200 CPUs long tasks on 1000s of 10% SiCortex, C, 5760 CPUs BG/P, C, 2048 CPUs processors 0% 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 36 0.11248163264128256 Task Length (sec) Dispatcher Performance Profiling

• GT: Java WS-Core 4.0.4 • Java: Sun JDK 1.6 • Machine Hardware: Dual Xeon 3GHz CPUs with HT • Machine OS: Linux 2.6.13-15.16-smp • Executors Location: ANL/UC TG Site, 100 dual CPU Xeon/Itanium nodes, ~2ms latency • Workload: 10000 tasks, “/bin/sleep 0”

6

5

4

3

2

1 CPU Time per Task (ms) Task per Time CPU 0 JAVA C Task Submit (Client -> Service) Notification for Task Availability (Service -> Executor) Task Dispatch (Service -> Executor) 4/18/2008 HarnessingTask Results (Executor Grid Resources -> Service) with Data-Centric Task Farms 39 Notifications for Task Results (Service -> Client) WS communication / TCPCore (Service -> Executor & Executor -> Service) Resource Provisioning

35 35 35 - 18 Stages 700 640 Allocated Ideal- 1,000 tasks RegisteredFalkon-15 Falkon-180 - 17,820 CPU seconds 30 Active 30 Allocated 30 600 Registered - 1,260 total time on 32 machines Active 25 25 25 500 # of Machines # of Tasks 20 20 20 400

15 15 # of Executors 15 300 # of Executors

10 10 10 160 200 Number of Tasks Number ofMachines 5 5 5 64 100 32 20 18 1 2 4 8 16 1 3 16 8 4 2 1 0 0 0 0 0 580.386 1156.853 1735.62 0 494.438 986.091 1477.3 123456789101112131415161718 Time (sec) Time (sec) Stage Number GRAM Ideal +PBS Falkon-15 Falkon-60 Falkon-120 Falkon-180 Falkon-∞ (32 nodes) • End-to-end execution time: Queue – 1260 sec in ideal case Time (sec) 611.1 87.3 83.9 74.7 44.4 43.5 42.2 Execution – 4904 sec Æ 1276 sec Time (sec) 56.5 17.9 17.9 17.9 17.9 17.9 17.8 Execution Time % 8.5% 17.0% 17.6% 19.3% 28.7% 29.2% 29.7% • Average task queue time: GRAM Ideal +PBS Falkon-15 Falkon-60 Falkon-120 Falkon-180 Falkon-∞ (32 nodes) – 42.2 sec in ideal case Time to – 611 sec Æ 43.5 sec complete (sec) 4904 1754 1680 1507 1484 1276 1260 Resouce • Trade-off: Utilization 30% 89% 75% 65% 59% 44% 100% Execution 4/18/2008– Resource Utilization Harnessing for Grid ResourcesEfficiency with26% Data-Centric 72% 75% Task Farms 84% 85% 99% 40 100% Execution Efficiency Resource Allocations 1000 11 9 7 6 0 0 Data Diffusion

• No Locality – Modest loss of read performance for small # of nodes (<8) – Comparable performance with large # of nodes – Modest gains in read+write performance • Locality – Significant gains in performance beyond 8 nodes – Data-aware scheduler achieves near optimal performance and scalability 1. Model (local disk) 1. Model (local disk) 100,000 2. Model (shared file system) 100,000 2. Model (shared file system) 3. Falkon (first-available policy) 3. Falkon (first-available policy) 5. Falkon (first-cache-available policy – 0% locality) 5. Falkon (first-cache-available policy – 0% locality) 6. Falkon (first-cache-available policy – 100% locality) 6. Falkon (first-cache-available policy – 100% locality) 7. Falkon (max-compute-util policy – 0% locality) 7. Falkon (max-compute-util policy – 0% locality) 8. Falkon (max-compute-util policy – 100% locality) 8. Falkon (max-compute-util policy – 100% locality) 10,000 10,000

1,000 1,000 Read Throughput (Mb/s) Throughput Read Read+Write Throughput (Mb/s)

4/18/2008100 Harnessing Grid Resources with Data-Centric100 Task Farms 41 1 2 4 8 16 32 64 1248163264 Number of Nodes Number of Nodes Falkon Integration with Swift

Application #Tasks/workflow #Stages ATLAS: High Energy 500K 1 Physics Event Simulation fMRI DBIC: 100s 12 AIRSN Image Processing FOAM: Ocean/Atmosphere Model 2000 3 GADU: Genomics 40K 4 HNL: fMRI Aphasia Study 500 4 NVO/NASA: Photorealistic 1000s 16 Montage/Morphology Swift Architecture QuarkNet/I2U2: 10s 3 ~ 6 Physics Science Education RadCAD: Radiology SpecificationScheduling Execution Provisioning 1000s 5 Classifier Training Abstract SIDGrid: EEG Wavelet Execution Engine Virtual Node(s) Falkon 100s 20 computation Processing, Gaze Analysis (Karajan w/ Resource SDSS: Coadd, Swift Runtime) Provisioner 40K, 500K 2, 8 file1 Cluster Search SDSS: Stacking, AstroPortal 10Ks ~ 100Ks 2 ~ 4 SwiftScript C C C App MolDyn: Molecular Dynamics 1Ks ~ 20Ks 8 Compiler launcher C F1 Swift runtime callouts Provenance data file2 Virtual Data Catalog launcher App Status reporting F2 Amazon Provenance file3 EC2 SwiftScript Provenance data collector

13 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 42 Functional MRI (fMRI)

• Wide range of analyses – Testing, interactive analysis, production runs – Data mining

4/18/2008– Parameter studies Harnessing Grid Resources with Data-Centric Task Farms 43 fMRI Application

• GRAM vs. Falkon: 85%~90% lower run time • GRAM/Clustering vs. Falkon: 40%~74% lower run time

6000 GRAM 4808 5000 GRAM/Clustering Falkon 4000 3683

3000 2510 Time (s)

2000 1239 1123 866 992 1000 546 678 456 327 120 0 120 240 360 480 4/18/2008Falkon: Harnessing a Fast Grid and Resources Light-weight with tasK Data-Centric executiON Task framework Farms 44 Input Data Size (Volumes) B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL)

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 45 Montage Application

• GRAM/Clustering vs. Falkon: 57% lower application run time • MPI* vs. Falkon: 4% higher application run time • * MPI should be lower bound

3500 GRAM/Clustering 3000 MPI 2500 Falkon

2000

1500 Time (s) Time

1000

500

0

l dd ta A round to Diff/Fit g m mProject m mAdd(sub) 4/18/2008Falkon: Harnessing a Fast Grid andmBack Resources Light-weight with tasK Data-Centric executiON Task framework Farms 46 Components MolDyn Application

• 244 molecules Æ 20497 jobs • 15091 seconds on 216 CPUs Æ 867.1 CPU hours • Efficiency: 99.8% • Speedup: 206.9x Æ 8.2x faster than GRAM/PBS • 50 molecules w/ GRAM (4201 jobs) Æ 25.3 speedup

Time (sec) 0 1800 3600 5400 7200 9000 10800 12600 14400 1 1001 waitQueueTime execTime resultsQueueTime 2001 3001 4001 5001 6001 7001 8001 9001 10001 11001 Task ID Task 12001 13001 14001 15001 16001 17001 18001 19001 20001

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms47 MARS Economic Modeling on IBM BG/P • CPU Cores: 2048 • Tasks: 49152 • Micro-tasks: 7077888 • Elapsed time: 1601 secs • CPU Hours: 894 • Speedup: 1993X (ideal 2048) • Efficiency: 97.3%

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 48 Many Many Tasks: Identifying Potential Drug Targets

200+ Protein x 5M+ ligands target(s)

(Mike Kubal, Benoit Roux, and others) 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 49 DOCK on SiCortex

• CPU cores: 5760 • Tasks: 92160 • Elapsed time: 12821 sec • Compute time: 1.94 CPU years • Average task time: 660.3 sec • Speedup: 5650X (ideal 5760) • Efficiency: 98.2%

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 50 AstroPortal Stacking Service

+ • Purpose + + – On-demand “stacks” of random + locations within ~10TB dataset + + • Challenge + – Rapid access to 10-10K “random” files = – Time-varying load • Sample Workloads Sloan Locality Number of Objects Number of Files S4 1 111700 111700 Data 1.38 154345 111699 Web page 2 97999 49000 or Web 3 88857 29620 4 76575 19145 Service 5 60590 12120 10 46480 4650 20 40460 2025 30 23695 790

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 51 AstroPortal Stacking Service with Data Diffusion

Data Diffusion (GZ) 2000 Data Diffusion (FIT) 1800 GPFS (GZ) GPFS (FIT) 1600 1400 Low data locality Î 1200 1000

Data Diffusion (GZ) 800 2000 Data Diffusion (FIT) 600 1800 GPFS (GZ) GPFS (FIT) 400 1600 Time (ms) per stack per CPU 200 1400 0 1200 2 4 8 16 32 64 128 1000 Number of CPUs 800 600 400

Time (ms) per stack per CPU Í 200 High data locality 0 2 4 8 163264128 – Near perfect scalability Number of CPUs 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 52 AstroPortal Stacking Service with Data Diffusion

Data Diffusion (GZ) 2000 Data Diffusion (FIT) 1800 GPFS (GZ) • Big performance GPFS (FIT) 1600 gains as locality 1400 1200 increases 1000 800

100% 600 90% 400 80% Time (ms) per stack per CPU 70% 200 60% 0 50% 1 1.38 2 3 4 5 10 20 30 Ideal 40% 30% Locality 20% max-compute-util: cache hit ratio ideal cache hit ratio 10% % of ideal Local Disk Cache Hit Percentage Hit Cache Disk Local • 90%+ cache hit ratios 0% 1 1.38 2 3 4 5 10 20 30 Locality 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 53 AstroPortal Stacking Service with Data Diffusion

50 Data Diffusion Throughput Local Data Diffusion Throughput Cache-to-Cache 45 Data Diffusion Throughput GPFS • Aggregate throughput: GPFS Throughput (FIT) 40 GPFS Throughput (GZ) – 39Gb/s 35 – 10X higher than GPFS 30 25 20 Data Diffusion Size Local 10 Data Diffusion Size Cache-to-Cache 15 Data Diffusion Size GPFS 9 GPFS Size (FIT) 10 GPFS Size (GZ)

8 Aggregate Throughput (Gb/s) 5 7 6 0 5 11.382345102030 Locality 4 3 • Reduced load on GPFS 2

Data Movement (MB) per Stack 1 – 0.49Gb/s 0 1 1.3818 2345102030 – 1/10 of the original load Locality 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 54 Hadoop vs. Swift

• Classic benchmarks for MapReduce – Word Count –Sort • Swift performs similar or better than Hadoop (on 32 processors) Word Count Sort 7860 10000 Swift+PBS 4688 10000 Hadoop 1795 1143 Swift+Falkon 863 733 1000 1000 Hadoop 512 221 85 83 100 100 42 25 Time (sec) Time Time (sec) Time

10 10

4/18/20081 Harnessing Grid Resources with 1Data-Centric Task Farms 55 75MB 350MB 703MB 10MB 100MB 1000MB Data Size Data Size Mythbusting

• Embarrassingly Happily parallel apps are trivial to run – Logistical problems can be tremendous • Loosely coupled apps do not require “supercomputers” – Total computational requirements can be enormous – Individual tasks may be tightly coupled – Workloads frequently involve large amounts of I/O • Loosely coupled apps do not require specialized system software • Shared file systems are good all around solutions –4/18/2008 They don’t scale Harnessing proportionall Grid Resources withy Data-Centricwith the Task compute Farms resources 56 Conclusions & Contributions

• Defined an abstract model for performance efficiency of data analysis workloads using data-centric task farms • Provide a reference implementation (Falkon) – Use a streamlined dispatcher to increase task throughput by several orders of magnitude over traditional LRMs – Use multi-level scheduling to reduce perceived wait queue time for tasks to execute on remote resources – Address data diffusion through co-scheduling of storage and computational resources to improve performance and scalability – Provide the benefits of dedicated hardware without the associated high cost – Show flexibility/effectiveness on real world applications • Astronomy, medical imaging, molecular dynamics (chemistry and pharmaceuticals), economic modeling – Runs on real systems with 1000s of processors: • TeraGrid, IBM BlueGene/P, SiCortex 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 57 More Information

• More information: http://people.cs.uchicago.edu/~iraicu/ • Related Projects: – Falkon: http://dev.globus.org/wiki/Incubator/Falkon – AstroPortal: http://people.cs.uchicago.edu/~iraicu/projects/Falkon/astro_portal.htm – Swift: http://www.ci.uchicago.edu/swift/index.php • Collaborators (relevant to this proposal): – Ian Foster, The University of Chicago & Argonne National Laboratory – Alex Szalay, The – Rick Stevens, The University of Chicago & Argonne National Laboratory – Yong Zhao, – Mike Wilde, Computation Institute, University of Chicago & Argonne National Laboratory – Catalin Dumitrescu, Fermi National Laboratory – Zhao Zhang, The University of Chicago – Jerry C. Yan, NASA, Ames Research Center • Funding: – NASA: Ames Research Center, Graduate Student Research Program (GSRP) – DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy –NSF: TeraGrid 4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 58 Proposals (selected)

1. Ioan Raicu. “Harnessing Grid Resources with Data-Centric Task Farms”, University of Chicago, Computer Science Department, PhD Proposal, December 2007, Chicago, Illinois. 2. Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster and Mike Wilde. “Falkon: A Proposal for Project Globus Incubation”, Globus Incubation Management Project, 2007 – Proposal accepted 11/10/07. 3. Ioan Raicu, Ian Foster. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets: Year 1 Status and Year 2 Proposal”, NASA GSRP Proposal, Ames Research Center, NASA, February 2007 -- Award funded 10/1/07 - 9/30/08. 4. I. Raicu, I. Foster. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, NASA GSRP Proposal, Ames Research Center, NASA, February 2006 -- Award funded 10/1/06 - 9/30/07. 5. Alexandru Iosup, Catalin L. Dumitrescu, Dick Epema, Ioan Raicu, Ian Foster, Matei Ripeanu. "ServMark (DiPerF+GrenchMark): A Proposal for Project Globus Incubation", 2006 – Proposal accepted 06/22/06.

4/18/2008 Harnessing Grid Resources with Data-Centric Task Farms 59 Journal Articles Book Chapters

1. Yong Zhao, Ioan Raicu, Ian Foster, Mihael Hategan, Veronika Nefedova, Mike Wilde. “Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments”, to appear as a book chapter in Research Progress, ISBN: 978-1-60456-404-4, Nova Publisher 2008. 2. Catalin Dumitrescu, Jan Dünnweber, Philipp Lüdeking, Sergei Gorlatch, Ioan Raicu and Ian Foster. Simplifying Grid Application Programming Using Web-Enabled Code Transfer Tools. Toward Next Generation Grids, Chapter 6, Springer Verlag, 2007. 3. Catalin Dumitrescu, Ioan Raicu, Ian Foster. “The Design, Usage, and Performance of GRUBER: A Grid uSLA-based Brokering Infrastructure”, to appear in International Journal of Grid Computing, 2007. 4. Ioan Raicu, Catalin Dumitrescu, Matei Ripeanu, Ian Foster. “The Design, Performance, and Use of DiPerF: An automated DIstributed PERformance testing Framework”, International Journal of Grid Computing, Special Issue on Global and Peer-to-Peer Computing, 2006; 25% acceptance rate. 5. Catalin Dumitrescu, Ioan Raicu, Ian Foster. “Usage SLA-based Scheduling in Grids”, Journal on Concurrency and Computation: Practice and Experience, to appear in 2006. 6. Ioan Raicu, Loren Schwiebert, Scott Fowler, Sandeep K.S. Gupta. “Local Load Balancing for Globally Efficient Routing in Wireless Sensor Networks”, International Journal of Distributed Sensor Networks, 1: 163–185, 2005. 7. Sheralli Zeadally, R. Wasseem, Ioan Raicu, "Comparison of End-System IPv6 Protocol Stacks", IEE Proceedings Communications, Special issue on Internet Protocols, Technology and Applications (VoIP), Vol. 151, No. 3, June 2004. 8.4/18/2008 Sherali Zeadally, Ioan Harnessing Raicu. Grid“Evaluating Resources IPV6 with on Data-Centric Windows Taskand FarmsSolaris”, IEEE Internet 60 Computing, Volume 7, Issue 3, May June 2003, pp 51 – 57. Workshop/Conference Articles (selected)

1. Yong Zhao, Ioan Raicu, Ian Foster. “Scientific Workflow Systems for 21st Century e-Science, New Bottle or New Wine?”, Invited Paper, to appear at IEEE Workshop on Scientific Workflows 2008. 2. Ioan Raicu, Yong Zhao, Ian Foster, Alex Szalay. "Accelerating Large-scale Data Exploration through Data Diffusion", to appear at International Workshop on Data-Aware Distributed Computing 2008, co-locate with ACM/IEEE Int. Symposium High Performance Distributed Computing (HPDC) 2008. 3. Ioan Raicu, Yong Zhao, Ian Foster, Alex Szalay. “A Data Diffusion Approach to Large Scale Scientific Exploration”, to appear in the Microsoft Research eScience Workshop 2007. 4. Catalin Dumitrescu, Alexandru Iosup, H. Mohamed, Dick H.J. Epema, Matei Ripeanu, Nicolae Tapus, Ioan Raicu, Ian Foster. “ServMark: A Framework for Testing Grids Services”, IEEE Grid 2007. 5. Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, Mike Wilde. “Falkon: a Fast and Light-weight tasK executiON framework”, IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC), 2007. 6. Ioan Raicu, Catalin Dumitrescu, Ian Foster. “Dynamic Resource Provisioning in Grid Environments”, TeraGrid Conference 2007. 7. Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde. “Swift: Fast, Reliable, Loosely Coupled Parallel Computation”, IEEE Workshop on Scientific Workflows 2007. 8. Ioan Raicu, Ian Foster, Alex Szalay. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, poster presentation, IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC), 2006. 9. Ioan Raicu, Ian Foster, Alex Szalay, Gabriela Turcu. “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”, TeraGrid Conference 2006, June 2006. 10. Alex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu. “The Importance of Data Locality in Distributed Computing Applications”, NSF Workflow Workshop 2006. 11. Catalin Dumitrescu, Ioan Raicu, Ian Foster. “Performance Measurements in Running Workloads over a Grid”, The 4th International Conference on Grid and Cooperative Computing (GCC 2005); 11% acceptance rate 12. Catalin Dumitrescu, Ioan Raicu, Ian Foster. "DI-GRUBER: A Distributed Approach for Grid Resource Brokering", IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC), 2005; 22% acceptance rate. 13. William Allcock, John Bresnahan, Rajkumar Kettimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, Ian Foster, "The Globus Striped GridFTP Framework and Server," sc, p. 54, IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC), 2005; 22% acceptance rate. 14. Ioan Raicu, Loren Schwiebert, Scott Fowler, Sandeep K.S. Gupta. “e3D: An Energy-Efficient Routing Algorithm for Wireless Sensor Networks”, IEEE ISSNIP 2004 (The International Conference on Intelligent Sensors, Sensor Networks and Information Processing), Melbourne, Australia, December 2004; top 10% of conference papers, extended version published in International Journal of Distributed Sensor Networks 2005. 15. Catalin Dumitrescu, Ioan Raicu, Matei Ripeanu, Ian Foster. "DiPerF: an automated DIstributed PERformance testing Framework", IEEE/ACM GRID2004, Pittsburgh, PA, November 2004, pp 289 - 296; 22% acceptance rate 16. Ioan Raicu, Sherali Zeadally. “Impact of IPv6 on End-User Applications”, IEEE International Conference on Telecommunications 2003, ICT'2003, Volume 2, Feb 2003, pp 973 - 980, Tahiti Papeete, French Polynesia; 35% acceptance rate. 17. Ioan Raicu, Sherali Zeadally. “Evaluating IPv4 to IPv6 Transition Mechanisms” , IEEE International Conference on Telecommunications 2003, 4/18/2008ICT'2003, Volume 2, Feb 2003, Harnessing pp 1091 - 1098, Grid Tahiti ResourcesPapeete, French with Polynesia; Data-Centric 35% acceptance Task rate. Farms 61