<<

The IBM PERCS Project: Hardware- Co-design of a for High Programmer Productivity

Kemal Ebcioglu Co-leader, Programming Model and Tools Area IBM PERCS Project IBM Research Email: [email protected]

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004 Disclaimer

ƒ The research material described in this presentation implies no commitment regarding future IBM software or hardware products.

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 2 PERCS Programming Model & Tools Team

ƒ X10 ƒ PERCS Productivity – Philippe Charles – Catalina Danis – Chris Donawa – Christine Halverson – Kemal Ebcioglu – Wendy Kellogg – Christian Grothoff – Allan Kielstra ƒ University partners – Christoph von Praun – MIT – Vijay Saraswat – Purdue University – Vivek Sarkar – UC Berkeley ƒ PERCS Tools – U. Delaware – Marina Biberstein – U. Illinois – Bill Chung – U. Pittsburgh – Robert Fuhrer – UT Austin – Matthias Hauswirth – Vanderbilt University – Eugen Nistor – Peter Sweeney ƒ Leads – Beth Tibbitts – Mootaz Elnozahy (PERCS Principal Investigator) – Frank Tip – Rama Govindaraju (IBM STG software lead) – Mandana Vaziri – Vivek Sarkar, Kemal Ebcioglu (IBM Research PERCS – Justin Xue Programming Model & Tools leads)

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 3 Talk outline

ƒ Programming productivity trends in HPC ƒ Overview of PERCS project and X10 language ƒ Productivity experiments with X10 ƒ PERCS hardware-software co-design research agenda – Programmer productivity features – /performance features – Virtualization features ƒ Summary and future challenges

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 4 Programming productivity trends in High Performance ƒ Step function breakthroughs in productivity have historically occurred rarely, with fierce natural selection – Fortran; “structured programming” in early days – Integrated Environments – Safe OO programming – Re-usable components and model-based design – Separation of concerns-Aspect oriented programming (jury still out) ƒ The HPC community has stayed behind advances in commercial – C/C++/Fortran/MPI – command line tools – Very performance driven: functionality and performance concerns often tangled – Modern HPC machine complexities lead to expertise gap: • Only a small percentage of employees are able to produce HPC software on a deadline • [Sarkar, Williams, Ebcioglu 2004]

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 5 Productivity crisis in future scalable computing systems Memory wall: Severe non- Frequency wall: Multiple layers of uniformities in bandwidth & hierarchical heterogeneous latency in memory hierarchy parallelism to compensate for slowdown in frequency scaling

Proc Cluster Proc Cluster Clusters (scale-out) PEs, PEs, PEs, PEs, SMP L1 $ . . . L1 $ L1 $ . . . L1 $ . . . Multiple cores on a chip L2 Cache L2 Cache Coprocessors (SPUs) . . . SMTs SIMD ILP

Attempting to overcome these . . . L3 Cache . . . obstacles in a 10^5 processor Memory future system reduces productivity: - Lengthen SW life cycle - Increase in expertise gap

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 6 High Productivity Computing Systems (slide from DARPA) Goal: ¾ Provide a new generation of economically viable high productivity computing systems for the national and industrial user community (2010)

Impact: z Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X z Programmability (idea-to-first-solution): reduce cost and time of developing application solutions z Portability (transparency): insulate research and operational application software from system z Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors

Applications: z Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80’s80’s HPCHPC technology)…..technology)…..toto…..Future…..Future (Quantum/Bio(Quantum/Bio Computing)Computing) Overview of IBM PERCS Project

ƒ PERCS: DARPA-sponsored hardware-software project – 10X productivity improvement – grand challenge – Multi-petaflop performance – more than 100K processors – Commercial viability as an IBM product – Project addresses all levels of hardware-software stack: from circuits to programming model ƒ The PERCS productivity strategy – New programming model for scalability and productivity, with embodiment in X10 language – Integrated tools for reduced time-to-solution, built on open- source Eclipse framework – Productivity model and measurement tools, with a focus on addressing the expertise gap

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 8 X10 design goals (productivity) ƒ By design, X10 aims to rule out large classes of bugs – Building on proven baseline OO productivity features: • Type safety, memory safety, pointer safety, portability, – New X10 language features help avoid concurrency errors: e.g. eliminating deadlock with X10 clocks (generalized barriers) ƒ Concise specification of distributed aggregate operations – For rapid prototyping ƒ Language features for productivity, e.g.: – Atomic sections • Free the programmer from the complexity of lock management – Rooted exception model: • Handling errors from deeply nested parallel asynchronous activities – Integrated fine grain parallelism inside a place and across places • Going beyond the SPMD model ƒ Ability to re-use legacy software components ƒ Eclipse based tool chain for race detection, refactoring, performance optimization,

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 9 X10 Design Goals (scalability) ƒ Aiming to scale X10 programs to O(10^5) threads – Language constructs for explicitly programming non-uniform data accesses • Performance transparency for remote accesses – Ability to specify high degrees of asynchronous parallelism – Scalable memory consistency and synchronization primitives – Automatic and semi-automatic performance optimization based on dynamic feedback • Continuous Program Optimization

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 10 An X10 example program: (HPC Challenge) RandomAccess public boolean run() { dist D = dist.factory.block([0:TABLE_SIZE-1]); Allocate and initialize Table as a final long[.] Table = new long[D] (point [i]) { return i; } block-distributed array. final long[.] RanStarts = new long[dist.factory.unique()] Allocate and initialize RanStarts with (point [i]) { return starts(i*N_UPDATES_PER_PLACE);}; one random number seed for each final long value[.] SmallTable= place. new long value[[0:TABLE_SIZE-1]] (point [i]) {return i*S_TABLE_INIT;}; Allocate a small immutable table that can be copied to all places. finish ateach (point [i] : RanStarts ) { Everywhere in parallel, repeatedly long ran = nextRandom(RanStarts[i]); generate random Table indices and for (point [count]: [1:N_UPDATES_PER_PLACE]) { atomically read/modify/write Table final int J = f(ran); element. final long K = SmallTable[g(ran)]; async (D[J]) atomic Table[J] ^= K; ran = nextRandom(ran); } } return Table.sum() == EXPECTED_RESULT; }

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 11 Current X10 Status and Schedule

X10 status and schedule ƒ 6/2003 PERCS programming model concept ƒ 2/2004 Kickoff of X10 as concrete embodiment of PERCS Prog Model ƒ 7/2004 First draft of X10 language specification ƒ 2/2005 First (unoptimized) X10 prototype – reference implementation ƒ 7/2005 X10 application and productivity studies ƒ 3Q2005 Start participation in High Productivity Language “consortium” ƒ 1/2006 Second (optimized) X10 prototype ƒ 6/2006 Open source release of X10 reference implementation

Structure of X10 reference implementation Code X10 Templates X10 Multithreaded Grammar Annotated Target RTS AST Native AST Java code Parser Analysis passes Code emitter JVM X10 source PEM Events

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 12 Productivity experiments with X10

ƒ Major study involving 27 subjects, 05/23/2005 – 05/27/2005 – Mostly CS and Science students at Pittsburgh Supercomputing Center ƒ Sequence alignment problem in bio-informatics (SSCA#1) – Suggested by David Bader, and refined by PSC bio-informatics experts – Given the sequential version of the code, parallelize it. ƒ Three conditions studied on a 3000-processor Alpha supercomputer (lemieux) at PSC – C+MPI – X10 (Parallel execution through emulation only) – UPC ƒ 4.5 day experiment – Two days of tutorials taught by experts, hands-on exercises – Two days of coding under observation (both human observation and automated recording of activity) – ½ day exit interview ƒ Experiment professionally run by the IBM Research Social Computing Group and PSC team. – Subjects anonymous to the technical team, known as X1, X2,… – All interactions were recorded. … ƒ Feedback from study now influencing x10 language and tools design – Unique approach to validating and improving language and tools productivity

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 13 Development Time (slide from PSC team) ƒ MPI and UPC both exhibited Development time from the first serial run (if performed) or larger maximum and median parallel construct to the first correct parallel output (where obtained) or to end (where no parallel output obtained) times to correct parallel outputs. ƒ Variability of development times for all 3 languages was high. (Work is ongoing to relate development time on the primary task to proficiencies determined from introductory exercises.) Development Time (minutes) min max median MPI 117 968 558 finished 117 594 182 out of time 418 968 733 UPC 125 898 500 finished 125 590 399 out of time 500 898 874 X10 10 648 309 finished 10 562 289 out of time 648 648 648 obtained correct parallel output did not complete the study

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 14 PERCS Hardware-Software co-design ƒ Productivity goals of hardware-software features – Time to first deterministically correct parallel solution – Time to optimized scalable parallel solution – Effort for field maintenance: software – Time for responding to administrative requests in HPC centers: • Create securely isolated, virtual graph of processors for customer. • Increase or decrease resources for a customer – meet performance objectives based on service-level agreements • React to hardware failures based on SLA ƒ Overcoming impediments to scalability to 10^5 processors – Message aggregation pressure – performance falls off cliff when chip boundary is crossed – HPC applications can violate standard cache heuristics that work well on commercial code: • Little spatial or temporal locality – Amdahl’s law – exacerbated with more parallelism power

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 15 Research agenda for hardware features to improve productivity

ƒ HPC programmers will happily trade-off safety for performance – Make this less attractive with HW support ƒ Array bounds safety – Experimental virtual memory techniques for supporting array bounds checks with less overhead ƒ Initialization safety – Variables must be set before being used. Final variables must be set at most once. ƒ Pointer safety & memory leaks – Hardware means for detecting memory leaks, e.g., similar to SafeMem, iWatcher… ƒ Deterministic replay support to pinpoint errors – Non-determinism considered harmful: Finding un-repeatable non-deterministic errors and system freezes can take months – X10 has a deadlock free, determinate subset, but not all parallel programs are determinate – Deterministic replay: roll-back from assertion failures or deadlock, then replay precisely to pinpoint error – To be combined with checkpointing – Recent research – e.g., “flight data recorder” (Bodik et al.), Meiosys (company acquired by IBM) ƒ Advanced hardware performance monitoring, for dynamic optimizations – E.g. identify the loads/store instructions where the most time is being spent due to L2 cache misses. – Feedback can be used by performance programmer, or by a dynamic , to do, e.g., tiling

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 16 Research agenda for features to support parallelism, scalability

ƒ Highly efficient spawning and termination of threads – Scalable hardware thread scheduling/active messages – Anchored data: send operations to where data is – Non-anchored data: send data to where operations are, then load balance operations ƒ Support for atomic blocks – Optimistic concurrency– detect data race and rollback, or – Lightweight hardware locking – ILP research (Falsafi, Asanovic, Olukotun,…) highly relevant ƒ Efficient collective and fine-grain synchronization

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 17 Virtualization features ƒ Consistent with IBM’s autonomic computing vision ƒ Ability to emulate virtual graphs of computers – Virtual IT Shop: map M virtual nodes onto N physical nodes ƒ Ability to dynamically increase resources of a virtual graph on demand ƒ Secure isolation between different Virtual IT shops, even if they share physical nodes ƒ Service level agreements, performance guarantees in IT. – Dynamic cache sizing, share of processor

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 18 Summary and Future Challenges

ƒ Summarized hardware-software co-design issues for productivity and scalability, in the PERCS processor and X10 ƒ Human Programmer Productivity will become increasingly important in HPC ƒ Legacy code will be hard to displace – Only a very smooth transition will be successful ƒ Global performance optimization/parallelization of re-usable components key to productivity/performance ƒ Traditional processor design must face up to scalability features – Direct hardware support for thread spawning, remote access, fine grain synchronization, collective synchronization, atomic blocks – Graceful slowdown as communication distance increases • Must not ‘fall off cliff’ when crossing chip boundary • Must not force high message aggregation requirements – Resilience to applications that violate cache heuristics: temporal and spatial locality, lack of false sharing ƒ Scalability features can greatly improve performance of commercial applications as well – Essential for success and acceptance of new architectural features

WASP 2005 Keynote Kemal Ebcioglu September 22, 2005 19