The Looming Software Crisis Due to the Multicore Menace

The Looming Software Crisis due to the Multicore Menace Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology The “Software Crisis” “To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." -- E. Dijkstra, 1972 Turing Award Lecture 2 The First Software Crisis • Time Frame: ’60s and ’70s • Problem: Assembly Language Programming – Computers could handle larger more complex programs • Needed to get Abstraction and Portability without losing Performance 3 How Did We Solve the First Software Crisis? • High-level languages for von-Neumann machines – FORTRAN and C • Provided “common machine language” for uniprocessors – Only the common properties are exposed – Hidden properties are backed-up by good compiler technology Common Properties Single flow of control Single memory image Differences: Register File Register Allocation ISA Instruction Selection Functional Units Instruction Scheduling 4 The Second Software Crisis • Time Frame: ’80s and ’90s • Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers – Computers could handle larger more complex programs • Needed to get Composability, Malleability and Maintainability – High-performance was not an issue Æ left to Moore’s Law 5 How Did We Solve the Second Software Crisis? • Object Oriented Programming –C++ – Now C# and Java • Better tools – Component libraries, Purify • Better software engineering methodology – Design patterns, specification, testing, code reviews 6 Today: Programmers are Oblivious about the Processors • Solid boundary between Hardware and Software • Programmers don’t have to know anything about the processor – High level languages abstract away the processors – Ex: Java bytecode is machine independent – Moore’s law does not require the programmers to know anything about the processors to get good speedups • Programs are oblivious to the processor Æ works on all processors – A program written in ’70 using C still works and is much faster today • This abstraction provides a lot of freedom for the programmers 7 The Origins of a Third Crisis • Time Frame: 2010 to ?? • Problem: Sequential performance is left behind by Moore’s law – All software developers have abstracted away the processor assuming that Moore’s law will always provide performance gains! 8 The March to Multicore: Moore’s Law 100000 1,000,000,000 From Hennessy and Patterson, Computer Architecture:Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium ) 10000 100,000,000 P4 ??%/year Number of Transistors P3 1000 10,000,000 52%/yearP2 Pentium 100 486 1,000,000 386 Performance (vs. VAX-11/780 100,000 10 286 25%/year 8086 1 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 9 From David Patterson The March to Multicore: Uniprocessor Performance (SPECint) 100000 1,000,000,000 From Hennessy and Patterson, Computer Architecture:Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium 10000 ) 100,000,000 P4 Number of Transistors P3 1000 10,000,000 52%/yearP2 Pentium 100 486 1,000,000 386 Performance (vs. VAX-11/780 100,000 10 286 25%/year 8086 1 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 10 From David Patterson The March to Multicore: Uniprocessor Performance (SPECint) 100000 1,000,000,000 From Hennessy and Patterson, Computer Architecture:Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium ) 10000 100,000,000 P4 ??%/year Number of Transistors P3 1000 10,000,000 52%/yearP2 Pentium 100 486 1,000,000 386 Performance (vs. VAX-11/780 100,000 10 286 25%/year 8086 1 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 • General-purpose unicores have stopped historic performance scaling – Power consumption – Wire delays – DRAM access latency – Diminishing returns of more instruction-level parallelism 11 From David Patterson The Origins of a Third Crisis 512 How to program Picochip Ambric PC102 AM2045 256 multicores? Cisco CSR-1 Intel 128 Tflops 64 32 Raza Cavium # of Raw XLR Octeon cores 16 8 Niagara Cell Opteron 4P A Program written in the 70’s not only works today… Broadcom 1480 Xeon MP 4 but also runs faster (tracking Moore’s law) Xbox360 PA-8800 Opteron Tanglewood 2 Power4 PExtreme Power6 Yonah 4004 80808086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 19751980 1985 1990 1995 2000 2005 20?? 12 The Origins of a Third Crisis • Time Frame: 2010 to ?? • Problem: Sequential performance is left behind by Moore’s law • Needed continuous and reasonable performance improvements – to support new features – to support larger datasets • While sustaining portability, malleability and maintainability without unduly increasing complexity faced by the programmer Æ critical to keep-up with the current rate of evolution in software 13 Why Parallelism is Hard • A huge increase in complexity and work for the programmer – Programmer has to think about performance! – Parallelism has to be designed in at every level • Humans are sequential beings – Deconstructing problems into parallel tasks is hard for many of us • Parallelism is not easy to implement – Parallelism cannot be abstracted or layered away – Code and data has to be restructured in very different (non-intuitive) ways • Parallel programs are very hard to debug – Combinatorial explosion of possible execution orderings – Race condition and deadlock bugs are non-deterministic and illusive – Non-deterministic bugs go away in lab environment and with instrumentation 14 Outline: Ideas on Solving the Third Software Crisis 1. Advances in Computer Architecture 2. Novel Programming Models and Languages 3. Aggressive Compilers 4. Tools to support parallelization, debugging and migration 15 Computer Architecture • Current generation of multicores – How can we cobble together something with existing parts/investments? – Impact of multicores haven’t hit us yet • The move to multicore will be a disruptive shift – An inversion of the cost model – A forced shift in the programming model • A chance to redesign the microprocessor from scratch. • What are the innovations that will reduce/eliminate the extra burden placed on the programmer? 16 Novel Opportunities in Multicores • Don’t have to contend with uniprocessors • Not your same old multiprocessor problem – How does going from Multiprocessors to Multicores impact programs? – What changed? – Where is the Impact? – Communication Bandwidth – Communication Latency 17 Communication Bandwidth • How much data can be communicated between two cores? • What changed? 10,000X – Number of Wires – Clock rate – Multiplexing • Impact on programming model? – Massive data exchange is possible 32 Giga bits/sec ~300 Tera bits/sec – Data movement is not the bottleneck Æ processor affinity not that important 18 Communication Latency • How long does it take for a round trip communication? • What changed? 50X – Length of wire – Pipeline stages • Impact on programming model? ~200 Cycles ~4 cycles – Ultra-fast synchronization – Can run real-time apps on multiple cores 19 Architectural Innovations The Raw Experience Supported by a CISE Experimental Partnership The MIT Raw Processor • Raw project started in 1997 Prototype operational in 2003 • The Problem: How to keep the Moore’s Law going with – Increasing processor complexity – Longer wire delays – Higher power consumption • Raw philosophy – Build a tightly integrated multicore – Off-load most functions to compilers and software • Raw design – 16 single issue cores – 4 register-mapped networks – Huge IO bandwidth 180 nm ASIC (IBM SA-27E) •Raw power 18.2mm x 18.2mm – 16 Flops/ops per cycle ~100 million transistors – 16 Memory Accesses per cycle Designed for 225 MHz – 208 Operand Routes per cycle Tested at 425 MHz – 12 IO Operations per cycle Raw’s networks are tightly coupled into the bypass paths r24 r24 Ex: lb r25, 0x341(r26) r25 r25 r26 r26 r27 r27 Network Network Input Output FIFOs FIFOs E M1 M2 A TL TV IF D RF F P U F4 WB fmul r24, r3, r4 fadd r5, r3, r24 route P->E route W->P software software controlled controlled crossbar crossbar Raw Networks is Rarely the Bottleneck (225 Gb/s @ 225 Mhz) • Raw has 4 bidirectional, point-to-point mesh networks MIPS-Style – Two of them statically routed Pipeline – Two of the dynamically routed • A single issue core may read 8 32-bit from or write to one network in buses a given cycle 100% ave. bandwidth utilization 90% ave. instruction utilization • The cores cannot saturate the 80% 70% network! 60% 50% 40% 30% 20% 10% 0% BitonicSort FFT DCT DES TDE Serpent Outline: Ideas on Solving the Third Software Crisis 1. Advances in Computer Architecture 2. Novel Programming Models and Languages 3. Aggressive Compilers 4. Tools to support parallelization, debugging and migration 24 Programming Models and Languages • Critical to solving the third software crisis • Novel languages were the central solution in the last two crises 25 Lessons from the Last Crisis: The OO Revolution • Object Oriented revolution did not come out of a vacuum • Hundreds of small experimental languages • Rely on lessons learned from lesser-known languages – C++ grew out of C, Simula, and other languages – Java grew out of C++, Eiffel, SmallTalk, Objective C,

Load more