Introduction to Parallel Computing Introduction, Problems, Models, Performance

Introduction to Parallel Computing Introduction, problems, models, performance Jesper Larsson Träff [email protected] TU Wien Parallel Computing WS16/17 ©Jesper Larsson Träff Practical Parallel Computing Parallelism everywhere! How to use all these resources? Limits? June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak) Mobile phones: „dual core“, „quad core“ (2012?), … …octa-core (2016, Samsung Galaxy 7) “Never mind that there's little-to-no software that can take advantage of four June 2012: IBM processing cores, Xtreme Notebooks has BlueGene/Q, released the first quad-core laptop in the 1572864 cores, U.S (2007)”. 16PF WS16/17 ©Jesper Larsson Träff Practical Parallel Computing June 2011: Fujitsu K, 705024 cores, 11PF As per ca. 2010: Why? No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, … WS16/17 ©Jesper Larsson Träff Practical Parallel Computing June 2011: Fujitsu K, 705024 cores, 11PF Challenge: How do I speed up windows…? As per 2010: a) All applications must be parallelized; or b) enough independent applications that can run at the same time WS16/17 ©Jesper Larsson Träff Practical Parallel Computing “I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind” 7 myths about quad-core phones (Smartphones Unlocked) by Jessica Dolcourt April 8, 2012 12:00 PM PDT …many similar quotations can (still) be found The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005 WS16/17 ©Jesper Larsson Träff The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005 “Free lunch” (as in “There is no such thing as a free lunch”): Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors WS16/17 ©Jesper Larsson Träff Kunle Olukotun (Stanford), ca. 2010 “The free lunch” was over…” ca. 2005 •Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties But: Numbers of transistors/chip can continue to increase (>1Billion) Solution(?): Put more cores on chip „multi-core revolution“ WS16/17 ©Jesper Larsson Träff Kunle Olukotun (Stanford), ca. 2010 “The free lunch” was over…” ca. 2005 •Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties Parallelism challenge: Solve problem p times faster on p (slower, more energy efficient) cores WS16/17 ©Jesper Larsson Träff Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015 WS16/17 ©Jesper Larsson Träff Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com Average, normalized, SPEC benchmark numbers, see www.spec.org WS16/17 ©Jesper Larsson Träff From Hennessy/Patterson, Computer Architecture 2006: Factor 103-104 increase in integer performance over 1978 high- performance processor Single-core processor development (“free lunch”) made life very difficult for parallel computing in the 90ties WS16/17 ©Jesper Larsson Träff Architectural ideas driving sequential performance increase Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase: •Deep pipelining •Superscalar execution (>1 instruction/clock) through multiple- functional units… •Data parallelism through SIMD units (old HPC idea: vector processing), same instruction on multiple data/clock •Out-of-order execution, speculative execution •Branch prediction •Caches (pre-fetching) •Simplified/better instruction sets (better for compiler) Mostly fully transparent, at most •SMT/Hyperthreading the compiler needs to care: the “free lunch” WS16/17 ©Jesper Larsson Träff •Very diverse parallel architectures •No single, commonly agreed upon abstract model (for designing and analyzing algorithms) •Has been so •Is still so (largely) Many different programming paradigms (models), different programming frameworks (interfaces, languages) WS16/17 ©Jesper Larsson Träff Theoretical Parallel Computing Parallelism was always there! What can be done? Limits? Assume p processors instead of just one, reasonably connected (memory, network, …) •How much master can some given problem be solved? Can some problems be solved better? •How? New algorithms? New techniques? •Can all problems be solved faster? Are there problems that cannot be solved faster with more processors? •Which assumptions are reasonable? •Does parallelism give new insights into nature of computation? WS16/17 ©Jesper Larsson Träff Sequential computing vs. Parallel computing Algorithm Algorithm Algorithm in model A in model B in model Z Algorithm in model Concrete program (C, C++, Java, Haskell, Fortran,…) Concrete program: different paradigms (MPI, Concrete architecture OpenMP, Cilk, OpenCL, …) is difficult. Analysis in model (often) has some relation to concrete Conc. Arch. A … Conc. Arch. Z execution WS16/17 ©Jesper Larsson Träff Parallel computing Algorithm Algorithm Algorithm in model A in model B in model Z Huge gap: no „standard“ abstract model (e.g., RAM), Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …) is extremely difficult. Analysis in model may have little relation to concrete Conc. Arch. A … Conc. Arch. Z execution WS16/17 ©Jesper Larsson Träff Extremely difficult. Parallel computing Analysis in model may have little relation to concrete Algorithm Algorithm Algorithm execution in model A in model B in model Z Challenges: •Algorithmic: Not all problems seems to be easily parallelizable Concrete program: •Portability: Support for different paradigms (MPI, same language/interface OpenMP, Cilk, OpenCL, …) on different architectures (e.g. MPI in HPC) •Performance portability Conc. Arch. A … Conc. Arch. Z (?) WS16/17 ©Jesper Larsson Träff Elements of Parallel Parallel computing computing: Algorithm Algorithm Algorithm •Algorithmic: Find the in model A in model B in model Z parallelism •Linguistic: Express the parallelism •Practical: Validate the parallelism (correctness, performance) Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …) •Technical challenge: run- time/compiler-support support for paradigm Conc. Arch. A … Conc. Arch. Z WS16/17 ©Jesper Larsson Träff Parallel computing: Accomplish something with a coordinated set of processors under control of a parallel algorithm Why study parallel computing? •It is inevitable: multi-core revolution, GPGPU paradigm, … •It‘s interesting, challenging, highly non-trivial – full of surprises •Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies) •It‘s ubiquituous (gates, architecture: pipelines, ILP, TLP, systems: operating systems, software), not always transparent •It‘s useful: large, extremely computationally intensive problems, Scientific Computing, HPC •… WS16/17 ©Jesper Larsson Träff Parallel computing: The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem. Specifically: Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands Focus on properties of solution (time, size, energy, …) to given, individual problem Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), … WS16/17 ©Jesper Larsson Träff Distributed computing: The discipline of making independent, non-dedicated resources available and coorperative toward solving specified problem complexes. Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, … Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, … WS16/17 ©Jesper Larsson Träff Concurrent computing: The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously Typical concerns: correctness (often formal), e.g. deadlock- freedom, starvation-freedom, mutual exclusion, fairness Buzz words: operating systems, synchronization, interprocess communication, locks, semaphores, autonomous computing, process calculi, CSP, CCS, pi-calculus, … WS16/17 ©Jesper Larsson Träff Adopted from Madan Musuvathi Parallel vs. Concurrent computing Given problem: specification, algorithm, data Process Process Process … Process Proc Resource Proc Memory (locks, semaphores, data structures), device, … Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem) WS16/17 ©Jesper Larsson Träff Coordination: Specification, Problem synchronization, algorithm, data communication Subproblem Subproblem Subproblem Subproblem Proc Proc Proc Proc Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination) WS16/17 ©Jesper Larsson Träff The “problem” of parallelization How to divide given problem into subproblems that can be solved in parallel? •Specification Problem •Algorithm? •Data?

Load more