1 2

High Performance Computing Technologies

My Group in Tennessee

 Numerical Linear Algebra

{ Basic algorithms for HPC

Jack Dongarra

{ EISPACK, LINPACK, BLAS, LAPACK, ScaLA-

University of Tennessee

PACK

Oak Ridge National Lab oratory

 Heterogeneous Network Computing

http://www.netlib.org/utk/p eopl e/JackDongarra/

{ PVM

{ MPI

 Software Rep ositories

{ Netlib

{ High-Performance Software Exchange

 Performance Evaluation

{ Linpack Benchmark, Top500

{ ParkBench

3 4

Computational Science WhyTurn to Simulation? ... To o Large

 HPC o ered a new way to do science:  Climate/Weather Mo delling

{ Exp eriment

{ Theory

{ Computation

 Computation used to approximate physical systems

 Advantages include:

{ playing with simulation parameters to study of

emergent trends

{ p ossible replay of a particular simulation event

{ study systems where no exact theories exist

 Data intensive problems data-mining, oil resevoir

simulation

 Problems with large length and time scales cosmol- ogy

5 6

Automotive Industry

Why Parallel Computers?

 Desire to solve bigger, more realistic applicatio ns prob-

lems.

 Huge users of HPC technology: Ford US is 25th

largest user of HPC in the world

 Fundamental limits are b eing approached.

 Main uses of simulation:

{ aero dynamics similar to aerospace industry

 More cost e ective solution

{ crash simulation

{ metal sheet forming

{ noise/vibrati onal optimization

Example: Weather Prediction Navier-Stokes

with 3D Grid around the Earth

{ trac simulation

8

>

>

>

temper atur e

 Main gains:

>

>

>

>

>

>

>

<

pr essur e

6 v ar iabl es

{ reduced time to market of new cars;

>

>

>

humidity

>

>

>

>

>

>

>

{ increased quality;

:

3 w ind v el ocity

{ reduced need to build exp ensive prototyp es;

 1 Kilometer Cells

{ more ecient &; integrated manufacturing pro-

9

 10 slices ! 5  10 cells

cesses

11

 each cell is 8 bytes, 2  10 Bytes = 200 GBytes

 at each cell will p erform 100 ops/cell

 1 minute time step

9

100ops=cel l 510 cells



=8GF l op=s

1min60sec=min

7 8

GC Computing Requirements

Grand Challenge Science

 US Oce of Science and Technology Policy

 Some De nitions A Grand Challenge is a fundamen-

tal problem in science or engineering, with p oten-

tially broad economic, p oliti cal and/or scienti c im-

pact, that could b e advanced by applying High Per-

formance Computing resources

 The Grand Challenges of High Performance Comput-

ing are those pro jects which are almost to o dicult

to investigate using current sup ercomputers!

9 10

GC Summary

High-Performance Computing Today

 In the past decade, the world has exp erienced one of

 Computational science is a relatively new metho d

the most exciting p erio ds in computer development

of investigating the world

 Computer p erformance improvements have b een dra-

 Current generation of high p erformance computers

matic - a trend that promises to continue for the next

are making an impact in many areas of science

several years.

 New Grand Challenges app earing { e.g., global mo d-

 One reason for the improved p erformance is the rapid

eling, computational geography

advance in micropro cessor technology.

 Users still want more p ower!

 Micropro cessors have b ecome smaller, denser, and

more p owerful.

 ... and all this applies to HPC in business

 If cars had made equal progress, you could buy a car

 Mayb e the problems in computational science are not

for a few dollars, drive it across the country in a few

so di erent from those in business ...?

minutes, and \park" the car in your p o cket!

 The result is that micropro cessor-based sup ercom-

puting is rapidly b ecoming the technology of prefer-

ence in attacking some of the most imp ortant prob-

lems of science and engineering.

11 12

Growth in Performance in 1990’s 366 100 34

4 10 Cray T−90 Cray C−90 322 127 3 Cray 2 51 10 Cray X−MP Cray Y−MP

Alpha

2 RS 6000/590 10 Alpha Cray 1S R8000 RS 6000/540 194 CMOS proprietary 64 i860 1 242 10

R2000 ECL

10 0 Performance in Mflop/s 61 246

−1 10

80387 193 6881 8087 80287 10 −2 1980 1982 1984 1986 1988 1990 1992 1994 TOP500 - CPU Technology 313 CMOS off the shelf

Year 63 124 332 59 109 6/93 11/93 6/94 11/94 6/95 11/95 0

50

400 350 300 250 200 150 100 # Systems # Universität Mannheim 

13 14

Scalable Multipro cessors

The Maturation of Highly Parallel Technology

What is Required?

 A ordable parallel systems now out-p erform the b est

 Must scale the lo cal memory bandwidth linearly.

conventional sup ercomputers.

 Performance p er dollar is particularly favorable.

 Must scale the global interpro cessor communication

 The eld is thinning to a few very capable systems.

bandwidth.

 Reliability is greatly improved.

 Third-party scienti c and engineering applications are

 Scaling memory bandwidth cost-e ectivel y requires

app earing.

separate, distributed memories.

 Business applications are app earing.

 Commercial customers, not just research labs, are

 Cost-e ectiveness also requires b est price-p erformance

acquiring systems.

in individual pro cessors.

What we get

 Comp elling Price/Performance

 Tremendous scalabili ty

 Tolerable entry price

 Tackle intractable problems

15 16

Cray v Cray

Silicon Graphics Inc. SGI

 Cray Research Inc. v Cray Computer Company

 The new kids on the blo ck ...

 CRI: Founded by Seymour Cray in 1972, the father

 Founded in 1981 as a Stanford University spin-out

of the sup ercomputer

 Sales originally based on graphics workstations

 Business based on vector sup ercomputers & later MPP

{ Graphics done in hardware

{ Cray1 `76, XMP`82, YMP`87, C90`92, J90`93,

{ exception to the rule of custom built chips b eing

T90 `95, ....

less cost e ective than general-purp ose pro cessors

{ Cray1 `76, Cray2`85, Cray3?

running software

{ T3D `94, T3E `96, ...

 All machines use mass pro duced pro cessors from MIPS

Computer Systems now an SGI subsidiary

 Seymour Cray left to form CCC in 1989 to develop

exotic pro cessor technology Cray3

 Aggressively marketed

1994 CCC went bust

 1995 CRI returned to pro t + huge order backlog

17 18

SGI Today The Giants

 No longer just biding their time

 IBM: released SP2 in 1994 based on workstation

 New markets: moveaway from graphics workstations

chips;

to general purp ose HPC: intro duction of paralleli sm

{ Market p osition: 21 of machines in "Top 500"

 Current: POWER CHALLENGE

list

 Aim:

 DEC: Memory Channel architecture released 1994

sell a ordable / accessible / entry-level / scalable

from networking and workstation pro cessor exp eri-

HPC

ence

 Market p osition: 23 of machines in "Top 500" list

{ Market p osition: 3 of machines in "Top 500" list

 Interesting asides:

 : early exp eriences with hyp ercub e machines 1982-

90 1995: won contract for US Government "Ter-

{ MIPS announce deal to supply pro cessors for the

a ops machine"

next generation of Nintendo machines: HPC feed-

ing into the mainstream

{ Market p osition: 5 of machines in "Top 500" list

{ Feb. 26, 1996: SGI buy 75 of CRI sto ck: low end

 HP Convex: HP b ought Convex in 1994, to bring

HPC having strong in uence on high end HPC

together workstation knowledge & HPC

{ Market p osition: 4 of machines in "Top 500" list

 ... but how many of them are making a pro t in MPP

systems?

 Others: Fujitsu 7, NEC 8, Hitachi 3, Tera,

Meiko 2

19 20

Scienti c Computing: 1986 vs. 1996 Teraflop Cray C−90 Multiprocessors

Massively parallel

 1986: Intel Paragon

Delta

1. Minisup ercomputers 1 - 20 M op/s: Alliant, Con- CM−2 Cray−2

Vector

vex, DEC.

2. Parallel vector pro cessors PVP 20 - 2000 M op/s:

Cray Y−MP CRI, CDC, IBM. Cray X−MP

ILLIAC IV

 1996:

Cray−1

1. PCs 200 M op/s: Intel Pentium Pro

2. RISC workstations 10 - 1000 M op/s: DEC, HP,

Scalar IBM, SGI, Sun.

CDC 7600

3. RISC based symmetric multipro cessors SMP Stretch CDC 6600

LARC

0.5 - 15 G op/s: HP-Convex, DEC, and SGI-CRI.

4. Parallel vector pro cessors 1 - 250 G op/s: SGI-

UNIVAC

CRI, Fujitsu, and NEC. IBM 704 CDC 1604

1950 1960 1970 1980 1990 2000

5. Highly parallel pro cessors 1 - 250 G op/s: HP-

Convex, SGI-CRI, Fujitsu, IBM, NEC, Hitachi ENIAC Relays Vacuum tubes Transistors Integrated circuits Mark I 12 11 10 9 8 7 6 5 4 3 2

1 0.1

10 10 10 10 10 10 10 10 10 10 10 10 Flops

21 22

Hitachi CP−PACS Performance Improvement 360 Gflop/s 2048 proc for Scientific Computing Problems 340

320 Linpack−HPC Gflop/s 300 Speed−Up Solving a System of Dense Linear Equations Factor Derived from Hardware Intel Paragon 280 Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory 105 6788 proc 4 260 10 Vector Supercomputer 103 240 102 220 101 200 0 Fujitsu VPP−500 10 180 140 proc 1970 1980 1990 2000

160 Intel Paragon 3680 proc 140

Fujitsu VPP−500 120 100 proc Speed−Up TMC CM−5 Factor Derived from Computational Methods 100 1024 proc 105 80 4 NEC SX−3 10 4 proc Conjugate Gradient Multi−Grid 60 3 Fujitsu VP−2600 Intel Delta 10 1 proc 512 proc Successive Over−Relaxation 40 NEC SX−2 TMC CM−2 102 1 proc Cray Y−MP 2048 proc 8 proc 1 Gauss−Seidel 20 10 Cray 1 Cray X−MP 4 proc Sparse Gaussian Elimination 0 1 proc 100

1980 1985 1990 1995 1970 1980 1990 2000

23 24

Department of Energy's

Accelerated Strategic Computing Initiative

Virtual Environments

 5-year, $1B program designed to deliver tera-scale computing capa-

bility.

 When the numb er crunchers nish crunching, the user is facedwith

 \Sto ckpile Stewardship" - safe and reliable maintenance of the nation's

the mammoth task of making sense of the data. As visualization and

nuclear arsenal in the absence of nuclear testing.

computation b ecome ever more closely coupled, new environments for

 Advanced computations, sp eci cally 3-D mo deling and simulation ca-

scienti c discovery emerge: virtual environments.

pability, are viewed as the backb one of \sto ckpile stewardship"

0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-04

0.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-02

 5 generations of HPC will b e delivered over the lifetime of the program.

0.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02

0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-02

 First machine is a single massively parallel 1.8 tera op computer. Intel 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-03

0.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-05

Paragon based on 9000 200 M op/s Pentium pro cessors to Sandia

0.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00

0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-05

Labs by the end of 1996.

0.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-03

0.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02

 Second machine is a $93M system from IBM consists of clusters of

0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02

0.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02

shared-memory pro cessors. 3 T op/s system is scheduled for demon-

0.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-03

stration in Decemb er 1998 to LLNL.

0.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+00

0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+00

0.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-04

 Third machine is a NUMA system from SGI-CRI. Schedule for LANL.

0.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-02

0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02

 Remaining two machines will deliver capability in the 30- and 100-

0.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02

T op/s range.

}

Do they makeany sense?

25 26

Alternative Sup ercomputing Resources

 Vast numb ers of under utilized workstations available

to use.

 Huge numb ers of unused pro cessor cycles and resources

that could b e put to go o d use in a wide variety of ap-

plications areas.

 Reluctance to buy Sup ercomputer due to their cost

and short life span.

 Distributed computer resources " t" b etter into to-

day's funding mo del.

27 28

MIMD, multicomputer: networked

workstations

THE METACOMPUTER: ONE FROM MANY

Enabling software technology: PVM Parallel Virtual Machine available

from [email protected]

Enabling software technology: MPI Message Passing Interface available

 Birth of a Concept

from [email protected]

 The term \metacomputing" was coined around 1987 by NCSA Direc-

tor, Larry Smarr. But the genesis of metacomputing to ok place years

very active research area; ab out 150 software pro ducts;

earlier.

catalog available NHSE

 Goals for the research communitywas to provide a \Seamless Web"

Enabling hardware technology: high bandwidth interconnect is not here

linking the user interface on the workstation and sup ercomputers.

yet;

Ethernet: msec latencies and 100's of Kbyte/sec bandwidth insuf-

cient

Other technology is on the verge of b ecoming available:

HIPPI pro ducts, Fibre Channel, ATM.

29 30

Java

 Java likely to b e a dominant language.

MetaComputer Summary

{ C++ like language

{ Taking the web/world by storm

 Many parts and functions of a metacomputer are b eing tested on a

{ No p ointers or memory deallo cation

small scale to day.

{ Portabilityachieved via abstract machine

 Much research remains to create a balanced system of computational

 Java is a convenient user interface builder which allows one to develop

power and mass storage connected by high-sp eed networks.

quickly customized interfaces.

 The ultimate goal is to have a Scalable Distributed Op erating System

 Internet is slow and getting slower, many activities fo cus on intranets.

31 32

Op en Universal WebWindows Java Linpack Benchmark

A Revolution in the Software Industry

 Should Java b e taken seriously for numerical computations?

 In future one will not write software for either

 3 months ago the fastest Java p erformance was 1 M op/s on a 600

M op/s pro cessor.

{ Windows95/NT, UNIX, Digital VMS, etc

 Top p erformer to day is 13.7 M op/s for a P6 using Netscap e 3.0 JIT

 Rather one will write software for WebWindows de ned as the op er-

 URL http://www.netlib.org/b enchmark/linpackjava/

ating environment for the World Wide Web

 WebWindows builds on top of Web Servers and Web Clientopen

interfaces as in

{ CGI interface for servers

{ Java or equivalent applet technology for clients

 Applications written for WebWindows will b e p ortable to all comput-

ers running Web Servers or Clients which hide hardware and native OS sp eci cs.

33 34

Metacomputing in the Future The Future Trends... Metacomputing in the Future

Hardware Trends 5-10 Years Computers

 Long term is hard to predict- See changes over the last 5 Years!!

 Millions 100-300 of "settop" b oxes

 Can see trends, however...

 One in every US household

 More worldwide

 Ranging from Sup ercomputing to Personal Digital Assistants.

35 36

Metacomputing in the Future Hardware Trends 5-10 Metacomputing in the Future Hardware Trends 5-10

Years Networks Years Software

 Networks 1-20 MByte/s ful ll needs of "home" entertainment in-  Very hard to predict in a relatively short term- JAVA has b een a

dustry. pro duct for ab out a year!!

 Technologies ranging from high band-width bre to Electromagnetic  Ubiquitous and p ervasive WWW/JAVA-like.

typ es such as Microwave.

 Can forget ab out underlying h/w and OS.

 Metacomputing \plug-ins"

 Micro-kernel-likeJAVA based servers with add-on services that can

supp ort Metacomputing load balancing, migration, checkp ointing, etc...

37 38

Highly Parallel Sup ercomputing:

Highly Parallel Sup ercomputing: Where Are

Where Are We?

We?

1. Performance:

1. Op erating systems:

 Robustness and reliability are improving.

 Sustained p erformance has dramatically increased

 New system management to ols improve system

during the last year.

utilization.

 On most applications, sustained p erformance p er dol-

But

lar now exceeds that of conventional sup ercomput-

 Reliability still not as go o d as conventional systems.

ers.

2. I/O subsystems:

But

 New RAID disks, HiPPI interfaces, etc. provide

 Conventional systems are still faster on some appli-

substantially improved I/O p erformance.

cations.

But

 I/O remains a b ottleneck on some systems.

2. Languages and compilers:

 Standardized, p ortable, high-level languages suchas

HPF, PVM and MPI are available.

But

 Initial HPF releases are not very ecient.

 Message passing programming is tedious and hard

to debug.

 Programming diculty remains a ma jor obstacle to

usage by mainstream scientist.

39 40

The Imp ortance of Standards I

Current Situation...

 An ongoing thread of researchin scienti c computing

Software

is the ecient solution of large problems.

 Writing programs for MPP is hard ...

 Various mechanisms have b een develop ed to p erform

computations across diverse platforms. The most com-

 But ... one-o e ort if written in a standard language

mon mechanism involves software libraries.

 Past lack of parallel programming standards ...

 Some software libraries are highly optimized for only

certain platforms and do not provide a convenient in-

{ ... has restricted uptakeoftechnology to "enthu-

terface to other computer systems.

siasts"

 Other libraries demand considerable programming ef-

{ ... reduced p ortability over a range of current

fort from the user, who may not have the time to learn

architectures and b etween future generations

the required programming techniques.

Now standards exist: PVM, MPI & HPF, which ...

 While a limited number of to ols have b een develop ed

 { ... allows users & manufacturers to protect soft-

to alleviate these diculties, such to ols themselves are

usually available only on a limited numb er of computer

ware investment

systems.

{ ... encourage growth of a "third party" paral-

lel software industry & parallel versions of widely used co des

41 42

The Future of HPC

The Imp ortance of Standards I I

Hardware

 The exp ense of b eing di erent is b eing replaced by

 pro cessors

the economics of b eing the same

{ commo dity RISC pro cessors

 HPC needs to lose its "sp ecial purp ose" tag

 interconnects

 Still has to bring ab out the promise of scalable gen-

eral purp ose computing ...

{ high bandwidth, low latency communications pro-

to col

 ... but it is dangerous to ignore this technology

{ no de-facto standard yet ATM, Fibre Channel,

 Final success when MPP technology is emb edded in

HPPI, FDDI

desktop computing

 growing demand for total solution:

 Yesterday's HPC is to day's mainframe is tomorrow's

workstation

{ robust hardware + usable software

 HPC systems containing all the programming to ols

/environments / languages / libraries / applications

packages found on desktops