High Performance Computing : Concepts, Methods, & Means An Introduction

Prof. Thomas Sterling Department of Computer Science Louisiana State University January 16, 2007 The Hammer of the Mind

• The Hammer – Mankind’s 1 st tool – In the most general case: applies a directed force to a concentrated point in our physical world to affect a desired change of state – Many implements of the physical world • Conventional means of inserting nails to wood • Includes knives, spears, arrows, screwdrivers, sledge-hammers, axes, clubs, etc. • Understanding – The “force” that drives our abstract world – Historically, 2 means by which the mind applies understanding • Empiricism – acquiring knowledge through experience • Theory – project beyond immediate experience to new knowledge • Supercomputing – The 3 rd hammer of the mind for applying understanding – Explain the past – Predict the future – Control the present 2 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

3 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

4 Applying the Force of Understanding through the Power of Supercomputing

5 Addressing the Big Questions

• How to integrate technology into computing engines? • How to push the performance to extremes? – What are the enabling conditions? – What are the inhibiting factors? • How to manage supercomputer resources to deliver useful computing capabilities? – What are the hardware mechanisms? – What are the software policies? • How do users program such systems? – What languages and in what environments? – What are the semantics and strategies? • What grand challenge applications demand these capabilities? • What are the computational models and algorithms that can map the innate application properties to the physical medium of the machine?

6 Challenges in the Physical World Command Our Abilities in the Abstract

• Physical Sciences • Technology • Biology and Medical Science • Energy • Meteorology and Climate • Materials and Nanotechnology • National Security

7 A Growth-Factor of a Billion in Performance in a Single Lifetime

1959 IBM 7094 1976 1991 1996 2003 1949 1 Intel Delta T3E 1Edsac 10 3 10 6 10 9 10 12 10 15

One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS 1823 2001 1943 1951 1964 1982 1988 1997 Babbage Difference Univac 1 CDC 6600 Cray XMP Cray YMP ASCI Red Earth Engine Harvard Simulator Mark 1

8 Performance: a cross-cutting issue the Top-500 list of supercomputers

2.3 PF/s 1 Pflop/s 280.6 TF/s

100 Tflop/s SUM IBM BlueGene/L 10 Tflop/s NEC 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1.646 TF/s 1 Tflop/s LLNL 59.7 GF/s Intel ASCI Red Sandia 100 Gflop/s Fujitsu 'NWT' NAL 10 Gflop/s N=500 0.4 GF/s 1 Gflop/s

100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

9 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

10 Definitions: “supercomputer”

Supercomputer : A computing system exhibiting high-end performance capabilities and resource capacities within practical constraints of technology, cost, power, and reliability. Thomas Sterling, 2007

Supercomputer: a large very fast mainframe used especially for scientific computations. Merriam-Webster Online

Supercomputer : any of a class of extremely powerful computers. The term is commonly applied to the fastest high-performance systems available at any given time. Such computers are used primarily for scientific and engineering work requiring exceedingly high-speed computations. Encyclopedia Britannica Online 11 Performance

• Performance : – A quantifiable measure of rate of doing (computational) work – Multiple such measures of performance • Delineated at the level of the basic operation – ops – operations per second – ips – instructions per second – flops – floating operations per second • Rate at which a benchmark program takes to execute – A carefully crafted and controlled code used to compare systems – Linpack Rmax (Linpack flops) – gups (billion updates per second) – others • Two perspectives on performance – Peak performance • Maximum theoretical performance possible for a system – Sustained performance • Observed performance for a particular workload and run • Varies across workloads and possibly between runs 12 Key Parameters

• Peak floating point performance • Main memory capacity • Bi-section bandwidth • I/O bandwidth • Secondary storage capacity • Organization – Class of system – # nodes – # processors per node – Accelerators – Network topology • Control strategy – MIMD – Vector, PVP – SIMD – SPMD 13 Scalability

• The ability to deliver proportionally greater sustained performance through increased system resources • Strict Scaling – Fixed size application problem – Application size remains constant with increase in system size • Weak Scaling – Variable size application problem – Application size scales proportionally with system size • Capability computing – in most pure form: strict scaling – Marketing claims tend toward this class • Capacity computing – Throughput computing • Includes job-stream workloads – In most simple form: weak scaling • Cooperative computing – Interacting and coordinating concurrent processes – Not a widely used term – Also: coordinated computing 14 Practical Constraints and Limitations

• Cost – Deployment – Operational support • Power – Energy required to run the computer – Energy for support facilities – Energy for cooling (remove heat from machine) • Size – Floor space – Access way for power and signal cabling • Reliability – One factor of availability • Generality – How good is it across a range of problems • Usability – How hard is it to program and manage 15 Productivity – a computing metric of merit

• A rich measure of merit for computing – Captures key factors that determine overall impact – Exposes relationship between program development and program execution – Supercedes mere scalar parameters of assumed performance • Focuses attention on all (most) important contributors to overall effectiveness • Permits cogent comparative assessment of alternative system classes • Devised as part of DARPA HPCS Program Phase 1 – T. Sterling – M. Snir – B. Smith – and others

16 Productivity Factors Directed Graph

Peak Performance (S P, C M)

Performance Efficiency (E)

Programmability Application Productivity Portability Construction (Ψ) (C ) S Maintainability

Availability Reliability (A) Accessibility

17 General Model of Productivity

th NR Ri ≡ i result product TR = Ti Ti ≡ time to compute result Ri ∑ i TL ≡ total lifetime of machine TL = TR +TV +TQ TV ≡ total overhead time of machine NR TQ ≡ quiescent time of machine RL = ∑ Ri TR ≡ working time of machine i NR TL ≡ total number of result products during CL = CLS + CM + CLO CL ≡ all costs associated with machine during TL NR CLS ≡ application software costs during TL CLS = ∑CSi CLO ≡ costs of ownership during TL i CM ≡ cost of procurement and initial installation RL Ψ ≡ CSi ≡ cost of application software for result Ri CL ×TL Ψ ≡ productivity 18 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing: a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

19 Related Fields

• Hardware – Device technologies – Logic circuit designs – Architecture • Software – System software – Programming methodologies • End user application problems – Problem area disciplines – Computational algorithms • Cross-cutting issues – Performance – Products and Market drivers – People – Packaging: cost, space, power, reliability

20 Supercomputing: A Discipline of Disciplines in this course

• Device technologies – Enabling technologies for logic, memory, & communication – Circuit design • Computer architecture – semantics and structures • Programming – languages, tools, & environments • Models of computation – governing principles • Compilers and runtime software – Maps application program to system resources, mechanisms, and semantics • Operating systems – Manages resources and provides virtual machine • Performance – modeling, measurement, benchmarking, and debugging • Algorithms – Numerical techniques – Means of exposing parallelism • Applications – End user problems, often in sciences and technology

21 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

22 Where Does Performance Come From?

• Device Technology – Logic switching speed and device density – Memory capacity and access time – Communications bandwidth and latency • Computer Architecture – Instruction issue rate • Execution pipelining • Reservation stations • Branch prediction • Cache management – Parallelism • Parallelism – number of operations per cycle per processor – Instruction level parallelism (ILP) – Vector processing • Parallelism – number of processors per node • Parallelism – number of nodes in a system

23 Moore’s Law

24 Microprocessor Clock Speed

25 Classes of Architecture for High Performance Computers

• Parallel Vector Processors (PVP) – NEC Earth Simulator, SX-6 – Cray- 1, 2, XMP, YMP, C90, T90, X1 – Fujitsu 5000 series • Massively Parallel Processors (MPP) – Intel Touchstone Delta & Paragon – TMC CM-5 – IBM SP-2 & 3, Blue Gene/Light – Cray T3D, T3E, /Strider • Distributed Shared Memory (DSM) – SGI Origin – HP Superdome • Single Instruction stream Single Data stream (SIMD) – Goodyear MPP, MasPar 1 & 2, TMC CM-2 • Commodity Clusters – Beowulf-class PC/Linux clusters – Constellations – HP Compaq SC, Linux NetworX MCR 26 27 28 Why Fast Machines Run Slow

• Latency – Waiting for access to memory or other parts of the system • Overhead – Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform • Starvation – Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources • Contention – Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.

29 The SIA ITRS Roadmap

100,000 MB per DRAM Chip 10,000 Logic Transistors per Chip (M) uP Clock (MHz) 1,000

100

10

1 1997 1999 2001 2003 2006 2009 2012 Year of Technology Availability 30 Latency in a Single System

500 1000 Ratio Memory Access Time 400 100 300

200 10

Time Time (ns) 100

1 CPU Time Ratio to CPU Memory 0 0.1 1997 1999 2001 2003 2006 2009 X-Axis

CPU Clock Period (ns) Ratio Memory System Access Time THE WALL

31 Microprocessors no longer realize the full potential of VLSI technology

1e+7 1e+6 52% /y Perf (ps/Inst) 1e+5 ear Linear (ps/Inst) 1e+4 1e+3 19% 1e+2 7 30:1 /year 4% 1e+1 /y ea 1e+0 r 1,000:1 1e-1 30,000:1 1e-2 1e-3 1e-4 1980 1990 2000 2010 2020

32 Driving Issues/Trends

• Multicore – Now: 2 – possibly 100’s – will be million-way parallelism • Heterogeneity – GPU – Clearspeed – Cell SPE • Component I/O Pins – Off chip bandwidth not increasing with demand • Limited number of pins • Limited bandwidth per pin (pair) – Cache size per core may decline – Shared cache fragmentation • System Interconnect – Node bandwidth not increasing proportionally to core demand • Power – Mwatts at the high end = millions of $s per year

33 Topics

• Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

34 A Myth of Precedence

• It is assumed by most that: – Computers preceded supercomputers – Supercomputers emerged as a special purpose case • Definition: – Supercomputer is a machine that greatly accelerates the rate of calculation with respect to alternative conventional means of the time • Contrary to popular belief: – 1st computers were supercomputers – Supercomputers are general purpose – Mainstream computers were special purpose data processing systems

35 5. A Brief History of Supercomputing

• Mechanical Computing (5.1) – Babbage, Hollerith, Aiken • Electronic Digital Calculating (5.2) – Atanasoff, Eckert, Mauchly • von Neumann Architecture (5.3) – Turing, von Neumann, Eckert, Mauchly, Foster, Wilkes • Semiconductor Technologies (5.4) • Birth of the Supercomputer (5.5) – Cray, Watanabe • The Golden Age (5.6) – Batcher, Dennis, S. Chen, Hillis, Dally, Blank, B. Smith • Common Era of Killer Micros (5.7) – Scott, Culler, Sterling/Becker, Goodhue, A. Chen, Tomkins • Petaflops (5.8) – Messina, Sterling, Stevens, P. Smith,

36 Synergy Drives Supercomputing Evolution

• Technology – Enables digital technology – Defines balance of capabilities – Establishes relationship of relative costs • Architecture – Creates interface between computation and technology – Determines structures of technology-based components – Establishes low-level semantics of operation – Provides low-cost mechanisms • Model of Computation – Paradigm by which computation is manifest – Provides governing principles of architecture operation – Implies programming model and languages

37 Historical Trends are a Consequence of this Interplay

• Technology evolves as new fabrication methods, processes, and materials emerge through industrial research • New components replace old but with different operational properties and support requirements • Innovations in system structure are developed to exploit strengths of new components and compensate for their relative weaknesses as well as meet their requirements • When old architecture classes fail to fully exploit advancing technology, new model of computation is adopted to provides a better conceptual framework

38 Major Technology Generations (dates approximate)

• Electromechanical – 19 th century through 1 st half of 20 th century • Digital electronic with vacuum tubes – 1940s • Core memory – 1950 • Transistors – 1947 • SSI & MSI RTL/DTL/TTL semiconductor – 1970 • DRAM – 1970s • CMOS VLSI – 1990

39 Supercomputer Points of Transition

• Automated calculating – 17 th century • Stored program digital electronic – 1948 • Vector – 1975 • SIMD – 1980s • MPPs – 1991 • Commodity Clusters – 1993/4

40 Historical Machines

• Leibniz Stepped Reckoner • Babbage Difference Engine • Hollerith Tabulator • Harvard Mark 1 • Un. of Pennsylvania Eniac • Cambridge Edsac • MIT Whirlwind • Cray 1 • TMC CM-2 • Intel Touchstone Delta • Beowulf • IBM Blue Gene/L

41 ENIAC (Electronic Numerical Integrator and Computer ) • Eckert and Mauchly, 1946. • Vacuum tubes. • Numerical solutions to problems in fields such as atomic energy and ballistic trajectories.

42 EDSAC (Electronic Delay Storage Automatic Calculator)

• Maurice Wilkes, 1949. • Mercury delay lines for memory and vacuum tubes for logic. • Used one of the first assemblers called Initial Orders. • Calculation of prime numbers, solutions of algebraic equations, etc.

43 MIT Whirlwind

• Jay Forrester, 1949. • Fastest computer. • First computer to use magnetic core memory. • Displayed real time text and graphics on a large oscilloscope screen.

44 CRAY-1

• Cray Research, 1976. • Pipelined vector arithmetic units. • Unique C-shape to help increase the signal speeds from one end to the other.

45 CM-2

• Thinking Machines Corporation, 1987. • Hypercube architecture with 65,536 processors. • SIMD. • Performance in the range of GFLOPS.

46 INTEL Touchstone Delta

• INTEL, 1990. • MIMD hypercube. • LINPACK rating of 13.9 GFLOPS . • Enough computing power for applications like real-time processing of satellite images and molecular models for AIDS research.

47 Beowulf

• Thomas Sterling and Donald Becker, 1994. • Cluster formed of one head node and one/more compute nodes. • Nodes and network dedicated to the Beowulf. • Compute nodes are mass produced commodities. • Use open source software including Linux.

48 Beowulf Project

 Wiglaf - 1994  Hrothgar - 1995  Hyglac-1996 (Caltech)  16 Intel 80486 100 MHz  16 Intel Pentium100 MHz  16 Pentium Pro 200 MHz  VESA Local bus  PCI  PCI  256 Mbytes memory  1 Gbyte memory  2 Gbytes memory  6.4 Gbytes of disk  6.4 Gbytes of disk  49.6 Gbytes of disk  Dual 10 base-T Ethernet  100 base-T Fast Ethernet  100 base-T Fast Ethernet  72 Mflops sustained (hub) (switch)  $40K  240 Mflops sustained  1.25 Gflops sustained  $46K  $50K 49 Earth Simulator

• Japan, 1997. • Fastest supercomputer from 2002-2004: 35.86 TFLOPS. • 640 nodes with eight vector processors and 16 gigabytes of computer memory at each node.

50 BlueGene/L

• IBM, 2004. • Current fastest supercomputer - 207.3 TFLOPS . • First supercomputer ever to run over 100 TFLOPS sustained on a real world application, namely a three- dimensional molecular dynamics code (ddcMD).

51 Events in Supercomputing

• Fortran – compiler – Greatly simplified creation of complex application programs • Parallel processing – Enables more than one action to occur at a time • Pipeline structures – Increases clock rate and efficient use of resources • MPI – Universally adopted parallel programming model • Ethernet – Low cost interconnection network • Linpack – Most widely recognized benchmark for comparative study • Visualization – Facilitated interpretation of large data sets • Weak scaling – Dramatic increase in scalability of systems and achievable performance • Beowulf-class commodity clusters – Exploitation of economy of scale for significant improvement of performance to cost – Also, NOW – network of workstations 52 Topics

• Supercomputing – the big picture • Supercomputing as a multidisciplinary field • What is a supercomputer? • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

53 A New HPC Course

• An Introduction to all aspects of Supercomputing • In collaboration – Louisiana State University – University of Arkansas – Louisiana Technical University – Masaryk University, Czech republic – MCNC, North Carolina • Greatly expand student accessibility – Easier to learn – Available to students out of the mainstream • Multimedia – Hands-on interactive – Easily accessible for review/study • High Definition video over Internet – Specialized expertise available to the general community – Precision presentation for enhance learning experience 54 Goals of the Course

• A first overview of the entire field of HPC • Basic concepts that govern the capability and effectiveness of supercomputers • Techniques and methods for applying HPC systems • Tools and environments that facilitate effective application of supercomputers • Hands-on experience with widely used systems and software • Performance measurement methods, benchmarks, and metrics • Practical real-world knowledge about the HPC community • Access by students outside the HPC mainstream

55 A Precursor to Future Pursuits

• Understand concepts and challenges – for possible future research – advanced graduate studies • Basic methods of using and programming a supercomputer – for future computational scientists • Managing HPC systems – for future systems administrators • HPC system structures and engineering – for future system designers and developers

56 Technology Strategy

• Interdisciplinary – Device technology and parallel computer architecture – Parallel programming models, languages, and tools – System software for resource management – Applications and algorithms • Web site managed – Lecture notes and source material – Problem sets • Video – On-demand streaming of class lectures – Additional side-bar material for expanded understanding – Subtitles for hearing impaired and non-native speakers • Hands on examples • Performance sensitivity and measurement cross- cutting interrelate disciplines

57 Course Overview Divided into 7 Segments

• S1: Introduction & Clusters • S2: Architecture and Nodes • S3: MPI • S4: Enabling Technologies • S5: System Software • S6: Advanced Techniques • S7: Conclusions

58 Course Overview: in 7 Segments

• Introduction • Enabling Technologies – An Overview – Device Technologies – Commodity Clusters – System Area Networks – Benchmarking – Throughput Computing • System Software • Architecture and Nodes – Operating Systems – Parallel Computer Architecture – Schedulers and Middleware – Single Node Architecture – Parallel file I/O – Parallel thread computing • Advanced Techniques – OpenMP programming – Performance factors and – Libraries measurement (1) – Visualization • MPI – Domain specific environments and – Communicating sequential Frameworks processes (CSP) – Applications – MPI programming – Performance measurement (2) • Conclusions – Parallel Algorithms – What’s beyond the scope of this course – What form will the future of HPC take 59 Topics

• Supercomputing – the big picture • Supercomputing as a multidisciplinary field • What is a supercomputer? • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

60 Segment 1: Clusters Skill Set • Login and establish control of cluster resources • Determine state of system resources and manipulate • Acquire, run, and measure benchmark performance • Launch and run user application codes • Collect ensemble result data using OS tools • Startup and apply Condor for performing concurrent jobs

61 Segment 1: Clusters Knowledge Factors • Overview of multidisciplinary field of HPC • Commodity cluster components and hardware/software architecture • Performance factors • Benchmarking and metrics • Throughput computing and Condor programming • History driven by interplay among technology, architecture, and programming models • Top 500 List

62 Topics

• Supercomputing – the big picture • Supercomputing as a multidisciplinary field • What is a supercomputer? • Challenges to supercomputing • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement

63 Course Website • HPC Course Website can be accessed at : http://www.cct.lsu.edu/csc7600 • Course Info: – Syllabus – Schedule • Contact Information in the (People Section) : email, IM, Phone etc.. • All course announcements will be made via email and Website. • Lecture Slides will be made available on the course website (Course Material Section) • Videos of Lectures will be made available on the course website (Course Material Section) after every lecture.

64 Contact Information

Prof. Thomas Sterling [email protected] (225) 578-8982 (CCT Office) Coates Hall Room 284, (225) 578-3320 Office Hours : Tu – Th,12:30 – 3:00 PM Teaching Assistant: Course Secretary : Chirag Dekate Ms. Terri Bordelon [email protected] [email protected] (225) 578-8930 302 Johnston Hall Office Hours : (Coates 284) (225) 578-5979 Tu – Th, 12:30 – 3:00 PM AIM / Yahoo / gTalk : cdekate 65 Grading Policy

66 Assignments

• Segments 1-6 (inclusive) will have prescribed problem sets. • Students are required to turn in the problem sets no later than their due dates. • Cumulatively these problem sets account for 20% of the overall grade for Graduate students (30% for undergraduates) • IMPORTANT : – Most of the assignments will need to be run on local supercomputing resources that are shared among several users. – Jobs that you submit WILL get stuck in a queue. – “Queue ate my homework” – Not an acceptable excuse for not turning homework in. – Your are strongly encouraged to start work on assignments as and when they are assigned to avoid inevitable queue wait times.

67 Schedule

68 Schedule

69 Reference Material

• No Required Textbook • Lecture notes (slides), required reading lists (URLs) provided at the end of lectures, some additional notes (on web site), and assignments would be primary sources of material for exams. • Students are strongly encouraged to pursue additional reading material available on the internet (and as part of projects).

70 Compute Resources

71 Plagiarism • The LSU Code of Student Conduct defines plagiarism in Section 5.1.16: – "Plagiarism is defined as the unacknowledged inclusion of someone else's words, structure, ideas, or data. When a student submits work as his/her own that includes the words, structure, ideas, or data of others, the source of this information must be acknowledged through complete, accurate, and specific references, and, if verbatim statements are included, through quotation marks as well. Failure to identify any source (including interviews, surveys, etc.), published in any medium (including on the internet) or unpublished, from which words, structure, ideas, or data have been taken, constitutes plagiarism;“ • Plagiarism will not be tolerated and will be dealt with in accordance with and as outlined by the LSU Code of Student Conduct :

http://appl003.lsu.edu/slas/dos.nsf/$Content/Code+of+Conduct? OpenDocument

72 Summary

• History of supercomputing achieves performance gain of > billion in single lifetime • Performance achieved – Technology clock rate (logic switching speed) and density – Parallelism through architecture and computing models – Algorithms and programming languages and tools • Performance degraded – Latency, overhead, contention, starvation – Cost, power consumption, size – Programming difficulties • Productivity considers all aspects of user goals

73 74