Parallel Computing Concepts
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Computing Concepts But …. Why are we interested in Parallel Computing? Because we do Supercomputing? But … COMPUTE | STORE | ANALYZE What is a Supercomputing? John M Levesque CTO Office – Applications Director Cray’s Supercomputing Center of Excellence COMPUTE | STORE | ANALYZE Seymour Cray September 28, 1925 To October 5, 1996 First – a supercomputer is: ● A computer whose peak performance is capable of solving the largest complex science/engineering problems ● Peak performance is measured by ● Number of total processors in the system multiplied by the peak of each processor ● Peak of each processor is measured in Floating Point Operations per second. ● A scalar processor can usually generate 2-4 64 Bit results (Multiply/Add) operations each clock cycle ● The Intel Skylake and Intel Knight’s Landing has vector units that can generate 16-32 64 bit results each clock cycle. ● Skylake’s clock cycle varies and the fastest is 3-4 GHZ for 64-128 GFLOPS ● Big Benefit from vectorizing applications ● Summit has 4,356 nodes, each one equipped with two 22- core Power9 CPUs, and six NVIDIA Tesla V100 GPUs for a whopping 187.6 Petaflops and it is the fastest computer in the world COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 4 Workshop However; The Supercomputer is just one piece of Supercomputing ● We need three things to be successful at Supercomputing ● The Supercomputer – Summit is great example ● Large Scientific problem to solve – we have many of those ● Programming expertise to port/optimize existing applications and/or develop new applications to effectively utilize the target Supercomputing architecture ● This is an area where we are lacking – too many application developers are not interested in how to write optimized applications for the target system ● When you achieve less than one percent of the peak performance on a 200 Petaflop computer – you are only achieving less than 2 Petaflops ● There are a few sites who are exceptions to this. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 5 Workshop Ways to make a computer faster ● Lower the clock cycle – increase clock rate ● As the clock cycle becomes shorter the computer can perform more computations each second. Remember we are determining the peak performance by counting the number of floating point operations/second. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 6 Workshop For the last ten years the clock cycle is staying staying the same or increasing Supercomputer CPU Clock Cycle Times 1.00E-07 Developers became lazy during this period Developers 1.00E-08 became worried during this period 1.00E-09 Clock Cycle Time (seconds) 1.00E-10 1960 1970 1980 Year1990 2000 2010 2020 COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 7 Workshop Ways to make a computer faster ● Lower the clock cycle – increase clock rate ● Generate more than one result each clock ● Single Instruction, Multiple Data - SIMD ● Vector instruction utilizing multiple segmented functional units ● Parallel instruction ● Mulitple Instructions, Multiple Data - MIMD ● Data Flow ● Utilize more than one processor ● Shared memory parallelism ● All processors share memory ● Distributed memory parallelism ● All nodes have separate memories, but: ● A node can have several processors sharing memory within the node COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 8 Workshop Vector Unit is an Assembly Line COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 9 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 1 A(1) * B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 10 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 2 A(2) A(1) * * B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 11 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 3 A(3) A(2) A(1) * * * B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 12 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 4 A(4) A(3) A(2) A(1) * * * * B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 13 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 5 A(5) A(4) A(3) A(2) A(1) * * * * * B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 14 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 6 A(6) A(5) A(4) A(3) A(2) A(1) * * * * * * B(6) B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 15 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 7 A(7) A(6) A(5) A(4) A(3) A(2) A(1) + * * * * * * B(7) B(6) B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 16 Workshop SIMD Parallel Unit Result Rate = [Number of Operands/Number of Processing Elements] COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 17 Workshop What are today’s Vector processors? ● They are not like the traditional Cray vector processors ● The are actually like the Illiac. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 18 Workshop Outline ● What was a Supercomputer in the 1960s,70s,80s,90s and the 2000s ● Single Instruction, Multiple Data (SIMD) ● Vector ● Parallel ● Multiple Instructions, Multiple Data (MIMD) ● Distributed Memory Systems ● Beowulf, Paragon, nCUBE, Cray T3D/T3E, IBM SP ● Shared Memory Systems ● SGI Origin ● What is a Supercomputer Today ● déjà vu ● MIMD collection of distributed nodes with a MIMD collection of shared memory processors with SIMD instructions ● So it is important to understand the history of Supercomputing COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 19 Workshop Who is the historian - John Levesque? ● 1964 – 1965 ● Delivered Meads Fine Bread, (Student at University of New Mexico 62-72) ● 3/1966 – 1969 ● Sandia Laboratories – Underground Physics Department (CDC3600) ● 1969 – 1972 ● Air Force Weapons Laboratory (CDC 6600) ● 1972 – 1978 ● R & D Associates - (CDC 7600, Illiac IV, Cray 1, Cyber 203-205) ● 1978 -1979 ● Massachusetts Computer Associates (Illiac IV, Cray 1) Parallizer ● 1979 – 1991 ● Pacific Sierra Research – (Intel Paragon, Ncube, Univac APS) - VAST ● 1991 – 1998 ● Applied Parallel Research – (MPPs of all sorts, CM5, NEX SX) – FORGE ● 1998 – 2001 ● IBM Research – Director Advanced Computing Technology Center ● 1/2001 – 9/2001 ● Times N Systems – Director of Software Development ● 9/2001 – Present ( 11/2018?) ç (9/2021) ● Cray Inc – Director – Cray’s Supercomputing Center of Excellence COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 20 Workshop 1960s COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 21 Workshop First System I drove COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 22 Workshop Then got job at Sandia Laboratories in Albuquerque COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 23 Workshop 1970s COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 24 Workshop CDC 6600 – Multiple Functional Units Could generate an ADD and MULTIPLY at the same time ASCR Exascale Computing Systems Productivity 25 Workshop Computer input devices ASCR Exascale Computing Systems Productivity 26 Workshop Why was the Fortran line length 72? ASCR Exascale Computing Systems Productivity 27 Workshop File organization ASCR Exascale Computing Systems Productivity 28 Workshop Computer Listing ASCR Exascale Computing Systems Productivity 29 Workshop Computational Steering in the 70s Set a sense switch, dump the velocity Vector field to tape, take tape to Calcomp Plotter. ASCR Exascale Computing Systems Productivity 30 Workshop Then in the early 70’s I worked on the Illiac IV Anyone know why the door is open? Programming the 64 SIMD units is exactly the same as programming the 32 SIMT threads in a Nvidia Warp ASCR Exascale Computing Systems Productivity 31 Workshop Actually accessed the Illiac IV remotely Contrarily to popular belief, Al Core did not invent the Internet – Larry Roberts of ARPA did. ASCR Exascale Computing Systems Productivity 32 Workshop Then started working with Seymour Cray’s first Masterpiece – the CDC 7600 This System had a large slow memory (256K words) and a small fast memory (65K words) Similar memory architecture is being used today Intel’s Knights Landing and Summit’s nodes ASCR Exascale Computing Systems Productivity 33 Workshop Seymour’s Innovations in the CDC 7600 ● Segmented Functional Units ● The first hint of vector hardware ● Multiple functional Units capable of parallel execution ● Small main memory with larger extended core memory ASCR Exascale Computing Systems Productivity 34 Workshop Any Idea what this is? 35 18.Feb.13 Cray OpenACC training at CSCS My first laptop ASCR Exascale Computing Systems Productivity 36 Workshop Cyber 203/205 ASCR Exascale Computing Systems Productivity 37 Workshop From paper by Neil Lincoln -CDC Memory to Memory Vector Processor Star 100 Scalar Processor SIMD Parallel Processor – Illiac I of the time ASCR Exascale Computing Systems Productivity 38 Workshop First Real Vector Machines ● Memory to Memory Vector Processors ● Texas Instruments ASC ● Star 100 ● Cyber 203/205 ● Register to Register Vector