Introduction Dr. Ralf-Peter Mundani Cesim / IGSSE

Technische Universität München Parallel Programming and High-Performance Computing Part 1: Introduction Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universität München 1 Introduction General Remarks • materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/ • Ralf-Peter Mundani – email [email protected], phone 289–25057, room 3181 (city centre) – consultation-hour: Tuesday, 4:00—6:00 pm (room 02.05.058) • Ioan Lucian Muntean – email [email protected], phone 289–18692, room 02.05.059 • lecture (2 SWS) – weekly – Tuesday, start at 12:15 pm, room 02.07.023 • exercises (1 SWS) – fortnightly – Wednesday, start at 4:45 pm, room 02.07.023 Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−2 Technische Universität München 1 Introduction General Remarks •content – part 1: introduction – part 2: high-performance networks – part 3: foundations – part 4: programming memory-coupled systems – part 5: programming message-coupled systems – part 6: dynamic load balancing – part 7: examples of parallel algorithms Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−3 Technische Universität München 1 Introduction Overview • motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation I think there is a world market for maybe five computers. —Thomas Watson, chairman IBM, 1943 Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−4 Technische Universität München 1 Introduction Motivation • numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations 2. numerical treatment model discretisation, algorithm development 3. implementation software development, parallelisation discipline 4. visualisation mathematics illustration of abstract simulation results computer science 5. validation comparison of results with reality application 6. embedding insertion into working process Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−5 Technische Universität München 1 Introduction Motivation • why parallel programming and HPC? – complex problems (especially the so called “grand challenges”) demand for more computing power • climate or geophysics simulation (tsunami, e. g.) • structure or flow simulation (crash test, e. g.) • development systems (CAD, e. g.) • large data analysis (Large Hadron Collider at CERN, e. g.) • military applications (crypto analysis, e. g.) • … – performance increase due to • faster hardware, more memory (“work harder”) • more efficient algorithms, optimisation (“work smarter”) • parallel computing (“get some help”) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−6 Technische Universität München 1 Introduction Motivation • objectives (in case all resources would be available N-times) – throughput: compute N problems simultaneously • running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e. g. • drawback: limited resources of single nodes – response time: compute one problem at a fraction (1/N) of time • running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. • drawback: writing a parallel program; communication – problem size: compute one problem with N-times larger data • running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. • drawback: writing a parallel program; communication Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−7 Technische Universität München 1 Introduction Overview • motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−8 Technische Universität München 1 Introduction Classification of Parallel Computers • definition: “A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989) • possible appearances of such processing elements – specialised units (steps of a vector pipeline, e. g.) – parallel features in modern monoprocessors (superscalar architectures, instruction pipelining, VLIW, multithreading, multicore, …) – several uniform arithmetical units (processing elements of array computers, e. g.) – processors of a multiprocessor computer (i. e. the actual parallel computers) – complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) – parallel computers or clusters connected via WAN (so called metacomputers) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−9 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore – observation: increasing frequency (and thus core voltage) over past years – problem: thermal power dissipation increases linearly in frequency and with the square of the core voltage Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−10 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore (cont’d) – 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation dissipation performance Î normal CPU reduced CPU Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−11 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore (cont’d) – idea: installation of two cores per die with same dissipation as single core system dissipation performance Î single core dual core Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−12 Technische Universität München 1 Introduction Classification of Parallel Computers • commercial parallel computers – manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business) – names have been coming and going rapidly – in addition: several manufacturers of vector computers and non- standard architectures company country year status in 2003 Sequent U.S. 1984 acquired by IBM Intel U.S. 1984 out of business Meiko U.K. 1985 bankrupt nCUBE U.S. 1985 out of business Parsytec Germany 1985 out of business Alliant U.S. 1985 bankrupt Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−13 Technische Universität München 1 Introduction Classification of Parallel Computers • commercial parallel computers (cont’d) company country year status in 2003 Encore U.S. 1986 out of business Floating Point Systems U.S. 1986 acquired by SUN Myrias Canada 1987 out of business Ametek U.S. 1987 out of business Silicon Graphics U.S. 1988 active C-DAC India 1991 active Kendall Square Research U.S. 1992 bankrupt IBM U.S. 1993 active NEC Japan 1993 active SUN Microsystems U.S. 1993 active Cray Research U.S. 1993 active Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−14 Technische Universität München 1 Introduction Classification of Parallel Computers • arrival of clusters – in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices – growing attractiveness for parallel computers – 1994: Beowulf, the first parallel computer built completely out of commodity hardware • NASA Goddard Space Flight Centre • 16 Intel DX4 processors • multiple 10 Mbit Ethernet links • Linux with GNU compilers •MPI library – 1996: Beowulf cluster performing more than 1 GFlops – 1997: a 140-node cluster performing more than 10 GFlops Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−15 Technische Universität München 1 Introduction Classification of Parallel Computers • arrival of clusters (cont’d) – 2005: InfiniBand cluster at TUM • 36 Opteron nodes (quad boards) • 4 Itanium nodes (quad boards) • 4 Xeon nodes (dual boards) for interactive tasks • InfiniBand 4× Switch, 96 ports • Linux (SuSE and Redhat) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−16 Technische Universität München 1 Introduction Classification of Parallel Computers • supercomputers – supercomputing or high-performance scientific computing as the most important application of the big number crunchers – national initiatives due to huge budget requirements • Accelerated Strategic Computing Initiative (ASCI) in the U.S. – in the sequel of the nuclear testing moratorium in 1992/93 – decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. – start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’s first TFlops computer) – then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, … • meanwhile new high-end computing memorandum (2004) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−17 Technische Universität München 1 Introduction Classification

Introduction Dr. Ralf-Peter Mundani Cesim / IGSSE

Parallel Prefix Sum (Scan) with CUDA

CSE373: Data Structures & Algorithms Lecture 26

CSE 613: Parallel Programming Lecture 2

A Review of Multicore Processors with Parallel Programming

Parallel Algorithms and Parallel Program Design

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

PACKET 7 BOOKSTORE 433 Lecture 5 Dr W IBM OVERVIEW

A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian

A Survey on Parallel Multicore Computing: Performance & Improvement

Parallelizing Multiple Flow Accumulation Algorithm Using CUDA and Openacc

Chapter 3. Parallel Algorithm Design Methodology