Computer Programming

The future of (super) Computer Programming Bent Thomsen [email protected] Department of Computer Science Aalborg University 1 eScience: Simulation - The Third Pillar of Science • Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. • Limitations: - Too difficult -- build large wind tunnels. - Too expensive -- build a throw-away passenger jet. - Too slow -- wait for climate or galactic evolution. - Too dangerous -- weapons, drug design, climate experimentation. • Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon - Base on known physical laws and efficient numerical methods. 2 Exascale computing The United States has put aside $126 million for exascale computing beginning in 2012, in an attempt to overtake China's Tianhe-1A supercomputer as the fastest computing platform in the world. February 21, 2011 3 How to spend a billion dollars Exascale programme builds on the HPCS programme In Phase I (June 2002 – June 2003) Cray, IBM, SUN, HP, SGI, MITRE spent $250 million In Phase II (July 2003 – June 2006) Cray was awarded a $43.1 million IBM was awarded a $53.3 million SUN was awarded $49.7 million Phase III (July 2006 – December 2010) Cray has been awarded $250 million IBM has been awarded $244 million High Productivity Computing Systems -Program Overview- Create a new generation of economically viable computing systems (2010) and a procurement methodology (2007-2010) for the security/industrial community Half-Way Point Petascale/s Systems Full Scale Phase 2 Development Technology Vendors Assessment Validated Procurement Review Evaluation Methodology Advanced Test Evaluation Design & Framework Prototypes Concept New Evaluation Framework Study Phase 1 Phase 2 Phase 3 (2003-2005) (2006-2010) Petascale Computers • Roadrunner, built by IBM, • first computer to go petascale, May 25, 2008, performance of 1.026 petaflops. • XT5 "Jaguar", built by Cray, • Later in 2008. After an update in 2009, its performance reached 1.759 petaflops. • Nebulae built by Dawning, • third petascale computer and the first built by China, performance of 1.271 petaflops in 2010. • Tianhe-1A built by NUDT, • is the fastest supercomputer in the world, at 2.566 petaflops in 2010. 6 High Productivity Computing Systems Create a new generation of economically viable computing systems (2010) and a procurement methodology (2007-2010) for the security/industrial community Impact: Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X Programmability (idea-to-first-solution): reduce cost and time of developing application solutions Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) HPCS Program Goals Productivity Goals • HPCS overall productivity goals: – Execution (sustained performance) . 1 Petaflop/s . scalable to greater than 4 Petaflop/s – Development . 10X over today’s systems . Reference: Lone researcher and Enterprise workflows Development Lone Researcher Enterprise Execution Visualize Design Theory Port Legacy Software Researcher Enterprise Experiment Simulation 10x improvement in time to first solution! How to increase Programmer Productivity? 3 ways of increasing programmer productivity: 1. Process (software engineering) – Controlling programmers – Good process can yield up to 20% increase 2. Tools (verification, static analysis, program generation) – Good tools can yield up to 10% increase 3. Language design --- the center of the universe! – Core abstractions, mechanisms, services, guarantees – Affect how programmers approach a task (C vs. SML) – New languages can yield 700% increase 9 High Productivity Computing Systems Large Part of HPCS Program focused on Programming Language Development • X10 from IBM – Extended subset of Java based onNon-Uniform Computing Clusters (NUCCs) where different memory locations incur different cost • Chapel from CRAY – Built on HPF and ZPL (based on Modula-2, Pascal, Algol) • Fortress from SUN – Based on “Growwing a Language” philosophy 10 New Programming Languages Why should I bother? • Fortran has been with us since 1954 • C has been with us since 1971 • C++ has been with us from 1983 • Java has been with us since 1995 • C# has been with us since 2000 11 New Programming Languages Why should I bother? • Every generation improves: – Programmer productivity • Higher level of abstraction • Thus reduce time-to-market – Program reuse • Libraries, components, patterns – Program reliability • Thus fewer bugs make it through to product – But usually not performance • Usually lagging five years behind, but will catch-up 12 Programming Language Genealogy Lang History.htm Diagram by Peter Sestoft 13 But why do we need new (HPCS) languages now? • Until about 20 years ago there was a neat correspondence between the Fortran/C/C++/Java/C# programming model and the underlying machines • The only thing that (apparently) changed was that the processors got faster • Moore’s Law (misinterpreted): – The processor speed doubles every 18 months – Almost every measure of the capabilities of digital electronic devices is linked to Moore's Law: processing speed, memory capacity, … (source Wikipedia) 14 The Hardware world is changing! 15 Moore’s Law • Popular belief: – Moore’s Law stopped working in 2005! • Moore’s Law (misinterpreted): – The processor speed doubles every 18 months • Moore’s Law still going strong – the number of transistors per unit area on a chip doubles every 18 months • Instead of using more and more HW real-estate on cache memory it is now used for multiple cores 16 The IT industry wakeup call • The super computing community discovered the change in hardware first • The rest of the computing industry have started to worry “Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.” – Bill Gates 17 What is the most expensive operation in this line of C code? • int x = (3.14 * r) + (x * y); 18 A programmer’s view of memory This model was pretty accurate in 1985. Processors (386, ARM, MIPS, SPARC) all ran at 1–10MHz clock speed and could access external memory in 1 cycle; and most instructions took 1 cycle. Indeed the C language was as expressively time-accurate as a language could be: almost all C operators took one or two cycles. But this model is no longer accurate! 19 A modern view of memory timings So what happened? On-chip computation (clock-speed) sped up faster (1985–2005) than off-chip communication (with memory) as feature sizes shrank. The gap was filled by spending transistor budget on caches which (statistically) filled the mismatch until 2005 or so. Techniques like caches, deep pipelining with bypasses, and superscalar instruction issue burned power to preserve our illusions. 2005 or so was crunch point as faster, hotter, single-CPU Pentiums were scrapped. These techniques had delayed the inevitable. 20 The Current Mainstream Processor Will scale to 2, 4 maybe 8 processors. But ultimately shared memory becomes the bottleneck (1024 processors?!?). 21 Angela C. Sodan, Jacob Machina, Arash Deshmeh, Kevin Macnaughton, Bryan Esbaugh, "Parallelism via Multithreaded and Multicore CPUs," Computer, pp. 24- 32, March, 2010 22 Angela C. Sodan, Jacob Machina, Arash Deshmeh, Kevin Macnaughton, Bryan Esbaugh, "Parallelism via Multithreaded and Multicore CPUs," Computer, pp. 24- 32, March, 2010 23 Hardware will change • Cell – Multi-core with 1 PPC + 8(6) SPE (SIMD) – 3 level memory hierarchy – broadcast communication •GPU – 256 SIMD HW treads – Data parallel memory • FPGA … (build your own hardware) 24 Super Computer Organisation 25 Locality and Parallelism Proc Proc Cache Cache L2 Cache L2 Cache interconnects potential L3 Cache L3 Cache Memory Memory • Large memories are slow, fast memories are small. • Storage hierarchies are large and fast on average. • Parallel processors, collectively, have large, fast memories -- the slow accesses to “remote” data we call “communication”. 26 • Algorithm should do most work on local data. Memory Hierarchy • Most programs have a high degree of locality in their accesses • spatial locality: accessing things nearby previous accesses • temporal locality: reusing an item that was previously accessed • Memory hierarchy tries to exploit locality processor control Second Main Secondary Tertiary level memory storage storage cache (Disk) datapath (SRAM) (DRAM) (Disk/Tape) on-chip registers cache Speed (ns): 1 10 100 10 ms 10 sec Size (bytes): 100s KB MB GB TB 27 Programming model(s) reflecting the new world are called for • Algorithm should do most work on local data !! • Programmers need to – make decisions on parallel execution – know what is local and what is not – need to deal with communication • But how can the poor programmer ensure this? • She/he has to exploit: – Data Parallelism and memory parallelism – Task parallelism and instruction parallelism • She/he needs programming language constructs to help her/him 28 Domain deposition 29 Domain

Computer Programming

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support