1 2
High Performance Computing Technologies
My Group in Tennessee
Numerical Linear Algebra
{ Basic algorithms for HPC
Jack Dongarra
{ EISPACK, LINPACK, BLAS, LAPACK, ScaLA-
University of Tennessee
PACK
Oak Ridge National Lab oratory
Heterogeneous Network Computing
http://www.netlib.org/utk/p eopl e/JackDongarra/
{ PVM
{ MPI
Software Rep ositories
{ Netlib
{ High-Performance Software Exchange
Performance Evaluation
{ Linpack Benchmark, Top500
{ ParkBench
3 4
Computational Science WhyTurn to Simulation? ... To o Large
HPC o ered a new way to do science: Climate/Weather Mo delling
{ Exp eriment
{ Theory
{ Computation
Computation used to approximate physical systems
Advantages include:
{ playing with simulation parameters to study of
emergent trends
{ p ossible replay of a particular simulation event
{ study systems where no exact theories exist
Data intensive problems data-mining, oil resevoir
simulation
Problems with large length and time scales cosmol- ogy
5 6
Automotive Industry
Why Parallel Computers?
Desire to solve bigger, more realistic applicatio ns prob-
lems.
Huge users of HPC technology: Ford US is 25th
largest user of HPC in the world
Fundamental limits are b eing approached.
Main uses of simulation:
{ aero dynamics similar to aerospace industry
More cost e ective solution
{ crash simulation
{ metal sheet forming
{ noise/vibrati onal optimization
Example: Weather Prediction Navier-Stokes
with 3D Grid around the Earth
{ trac simulation
8
>
>
>
temper atur e
Main gains:
>
>
>
>
>
>
>
<
pr essur e
6 v ar iabl es
{ reduced time to market of new cars;
>
>
>
humidity
>
>
>
>
>
>
>
{ increased quality;
:
3 w ind v el ocity
{ reduced need to build exp ensive prototyp es;
1 Kilometer Cells
{ more ecient &; integrated manufacturing pro-
9
10 slices ! 5 10 cells
cesses
11
each cell is 8 bytes, 2 10 Bytes = 200 GBytes
at each cell will p erform 100 ops/cell
1 minute time step
9
100ops=cel l 510 cells
=8GF l op=s
1min60sec=min
7 8
GC Computing Requirements
Grand Challenge Science
US Oce of Science and Technology Policy
Some De nitions A Grand Challenge is a fundamen-
tal problem in science or engineering, with p oten-
tially broad economic, p oliti cal and/or scienti c im-
pact, that could b e advanced by applying High Per-
formance Computing resources
The Grand Challenges of High Performance Comput-
ing are those pro jects which are almost to o dicult
to investigate using current sup ercomputers!
9 10
GC Summary
High-Performance Computing Today
In the past decade, the world has exp erienced one of
Computational science is a relatively new metho d
the most exciting p erio ds in computer development
of investigating the world
Computer p erformance improvements have b een dra-
Current generation of high p erformance computers
matic - a trend that promises to continue for the next
are making an impact in many areas of science
several years.
New Grand Challenges app earing { e.g., global mo d-
One reason for the improved p erformance is the rapid
eling, computational geography
advance in micropro cessor technology.
Users still want more p ower!
Micropro cessors have b ecome smaller, denser, and
more p owerful.
... and all this applies to HPC in business
If cars had made equal progress, you could buy a car
Mayb e the problems in computational science are not
for a few dollars, drive it across the country in a few
so di erent from those in business ...?
minutes, and \park" the car in your p o cket!
The result is that micropro cessor-based sup ercom-
puting is rapidly b ecoming the technology of prefer-
ence in attacking some of the most imp ortant prob-
lems of science and engineering.
11 12
Growth in Microprocessor Performance in 1990’s 366 100 34
4 10 Cray T−90 Cray C−90 322 127 3 Cray 2 51 10 Cray X−MP Cray Y−MP
Alpha
2 RS 6000/590 10 Alpha Cray 1S R8000 RS 6000/540 194 CMOS proprietary 64 i860 1 242 10
R2000 ECL
10 0 Performance in Mflop/s 61 246
−1 10
80387 193 6881 8087 80287 10 −2 1980 1982 1984 1986 1988 1990 1992 1994 TOP500 - CPU Technology 313 CMOS off the shelf
Year 63 124 332 59 109 6/93 11/93 6/94 11/94 6/95 11/95 0
50
400 350 300 250 200 150 100 # Systems # Universität Mannheim
13 14
Scalable Multipro cessors
The Maturation of Highly Parallel Technology
What is Required?
A ordable parallel systems now out-p erform the b est
Must scale the lo cal memory bandwidth linearly.
conventional sup ercomputers.
Performance p er dollar is particularly favorable.
Must scale the global interpro cessor communication
The eld is thinning to a few very capable systems.
bandwidth.
Reliability is greatly improved.
Third-party scienti c and engineering applications are
Scaling memory bandwidth cost-e ectivel y requires
app earing.
separate, distributed memories.
Business applications are app earing.
Commercial customers, not just research labs, are
Cost-e ectiveness also requires b est price-p erformance
acquiring systems.
in individual pro cessors.
What we get
Comp elling Price/Performance
Tremendous scalabili ty
Tolerable entry price
Tackle intractable problems
15 16
Cray v Cray
Silicon Graphics Inc. SGI
Cray Research Inc. v Cray Computer Company
The new kids on the blo ck ...
CRI: Founded by Seymour Cray in 1972, the father
Founded in 1981 as a Stanford University spin-out
of the sup ercomputer
Sales originally based on graphics workstations
Business based on vector sup ercomputers & later MPP
{ Graphics done in hardware
{ Cray1 `76, XMP`82, YMP`87, C90`92, J90`93,
{ exception to the rule of custom built chips b eing
T90 `95, ....
less cost e ective than general-purp ose pro cessors
{ Cray1 `76, Cray2`85, Cray3?
running software
{ T3D `94, T3E `96, ...
All machines use mass pro duced pro cessors from MIPS
Computer Systems now an SGI subsidiary
Seymour Cray left to form CCC in 1989 to develop
exotic pro cessor technology Cray3
Aggressively marketed
1994 CCC went bust
1995 CRI returned to pro t + huge order backlog
17 18
SGI Today The Giants
No longer just biding their time
IBM: released SP2 in 1994 based on workstation
New markets: moveaway from graphics workstations
chips;
to general purp ose HPC: intro duction of paralleli sm
{ Market p osition: 21 of machines in "Top 500"
Current: POWER CHALLENGE
list
Aim:
DEC: Memory Channel architecture released 1994
sell a ordable / accessible / entry-level / scalable
from networking and workstation pro cessor exp eri-
HPC
ence
Market p osition: 23 of machines in "Top 500" list
{ Market p osition: 3 of machines in "Top 500" list
Interesting asides:
Intel: early exp eriences with hyp ercub e machines 1982-
90 1995: won contract for US Government "Ter-
{ MIPS announce deal to supply pro cessors for the
a ops machine"
next generation of Nintendo machines: HPC feed-
ing into the mainstream
{ Market p osition: 5 of machines in "Top 500" list
{ Feb. 26, 1996: SGI buy 75 of CRI sto ck: low end
HP Convex: HP b ought Convex in 1994, to bring
HPC having strong in uence on high end HPC
together workstation knowledge & HPC
{ Market p osition: 4 of machines in "Top 500" list
... but how many of them are making a pro t in MPP
systems?
Others: Fujitsu 7, NEC 8, Hitachi 3, Tera,
Meiko 2
19 20
Scienti c Computing: 1986 vs. 1996 Teraflop Cray C−90 Multiprocessors
Massively parallel
1986: Intel Paragon
Delta
1. Minisup ercomputers 1 - 20 M op/s: Alliant, Con- CM−2 Cray−2
Vector
vex, DEC.
2. Parallel vector pro cessors PVP 20 - 2000 M op/s:
Cray Y−MP CRI, CDC, IBM. Cray X−MP
ILLIAC IV
1996:
Cray−1
1. PCs 200 M op/s: Intel Pentium Pro
2. RISC workstations 10 - 1000 M op/s: DEC, HP,
Scalar IBM, SGI, Sun.
CDC 7600
3. RISC based symmetric multipro cessors SMP Stretch CDC 6600
LARC
0.5 - 15 G op/s: HP-Convex, DEC, and SGI-CRI.
4. Parallel vector pro cessors 1 - 250 G op/s: SGI-
UNIVAC
CRI, Fujitsu, and NEC. IBM 704 CDC 1604
1950 1960 1970 1980 1990 2000
5. Highly parallel pro cessors 1 - 250 G op/s: HP-
Convex, SGI-CRI, Fujitsu, IBM, NEC, Hitachi ENIAC Relays Vacuum tubes Transistors Integrated circuits Microprocessors Mark I 12 11 10 9 8 7 6 5 4 3 2
1 0.1
10 10 10 10 10 10 10 10 10 10 10 10 Flops
21 22
Hitachi CP−PACS Performance Improvement 360 Gflop/s 2048 proc for Scientific Computing Problems 340
320 Linpack−HPC Gflop/s 300 Speed−Up Solving a System of Dense Linear Equations Factor Derived from Supercomputer Hardware Intel Paragon 280 Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory 105 6788 proc 4 260 10 Vector Supercomputer 103 240 102 220 101 200 0 Fujitsu VPP−500 10 180 140 proc 1970 1980 1990 2000
160 Intel Paragon 3680 proc 140
Fujitsu VPP−500 120 100 proc Speed−Up TMC CM−5 Factor Derived from Computational Methods 100 1024 proc 105 80 4 NEC SX−3 10 4 proc Conjugate Gradient Multi−Grid 60 3 Fujitsu VP−2600 Intel Delta 10 1 proc 512 proc Successive Over−Relaxation 40 NEC SX−2 TMC CM−2 102 1 proc Cray Y−MP 2048 proc 8 proc 1 Gauss−Seidel 20 10 Cray 1 Cray X−MP 4 proc Sparse Gaussian Elimination 0 1 proc 100
1980 1985 1990 1995 1970 1980 1990 2000
23 24
Department of Energy's
Accelerated Strategic Computing Initiative
Virtual Environments
5-year, $1B program designed to deliver tera-scale computing capa-
bility.
When the numb er crunchers nish crunching, the user is facedwith
\Sto ckpile Stewardship" - safe and reliable maintenance of the nation's
the mammoth task of making sense of the data. As visualization and
nuclear arsenal in the absence of nuclear testing.
computation b ecome ever more closely coupled, new environments for
Advanced computations, sp eci cally 3-D mo deling and simulation ca-
scienti c discovery emerge: virtual environments.
pability, are viewed as the backb one of \sto ckpile stewardship"
0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-04
0.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-02
5 generations of HPC will b e delivered over the lifetime of the program.
0.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02
0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-02
First machine is a single massively parallel 1.8 tera op computer. Intel 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-03
0.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-05
Paragon based on 9000 200 M op/s Pentium pro cessors to Sandia
0.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00
0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-05
Labs by the end of 1996.
0.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-03
0.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02
Second machine is a $93M system from IBM consists of clusters of
0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02
0.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02
shared-memory pro cessors. 3 T op/s system is scheduled for demon-
0.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-03
stration in Decemb er 1998 to LLNL.
0.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+00
0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+00
0.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-04
Third machine is a NUMA system from SGI-CRI. Schedule for LANL.
0.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-02
0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02
Remaining two machines will deliver capability in the 30- and 100-
0.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02
T op/s range.
}
Do they makeany sense?
25 26
Alternative Sup ercomputing Resources
Vast numb ers of under utilized workstations available
to use.
Huge numb ers of unused pro cessor cycles and resources
that could b e put to go o d use in a wide variety of ap-
plications areas.
Reluctance to buy Sup ercomputer due to their cost
and short life span.
Distributed computer resources " t" b etter into to-
day's funding mo del.
27 28
MIMD, multicomputer: networked
workstations
THE METACOMPUTER: ONE FROM MANY
Enabling software technology: PVM Parallel Virtual Machine available
from [email protected]
Enabling software technology: MPI Message Passing Interface available
Birth of a Concept
from [email protected]
The term \metacomputing" was coined around 1987 by NCSA Direc-
tor, Larry Smarr. But the genesis of metacomputing to ok place years
very active research area; ab out 150 software pro ducts;
earlier.
catalog available NHSE
Goals for the research communitywas to provide a \Seamless Web"
Enabling hardware technology: high bandwidth interconnect is not here
linking the user interface on the workstation and sup ercomputers.
yet;
Ethernet: msec latencies and 100's of Kbyte/sec bandwidth insuf-
cient
Other technology is on the verge of b ecoming available:
HIPPI pro ducts, Fibre Channel, ATM.
29 30
Java
Java likely to b e a dominant language.
MetaComputer Summary
{ C++ like language
{ Taking the web/world by storm
Many parts and functions of a metacomputer are b eing tested on a
{ No p ointers or memory deallo cation
small scale to day.
{ Portabilityachieved via abstract machine
Much research remains to create a balanced system of computational
Java is a convenient user interface builder which allows one to develop
power and mass storage connected by high-sp eed networks.
quickly customized interfaces.
The ultimate goal is to have a Scalable Distributed Op erating System
Internet is slow and getting slower, many activities fo cus on intranets.
31 32
Op en Universal WebWindows Java Linpack Benchmark
A Revolution in the Software Industry
Should Java b e taken seriously for numerical computations?
In future one will not write software for either
3 months ago the fastest Java p erformance was 1 M op/s on a 600
M op/s pro cessor.
{ Windows95/NT, UNIX, Digital VMS, etc
Top p erformer to day is 13.7 M op/s for a P6 using Netscap e 3.0 JIT
Rather one will write software for WebWindows de ned as the op er-
URL http://www.netlib.org/b enchmark/linpackjava/
ating environment for the World Wide Web
WebWindows builds on top of Web Servers and Web Clientopen
interfaces as in
{ CGI interface for servers
{ Java or equivalent applet technology for clients
Applications written for WebWindows will b e p ortable to all comput-
ers running Web Servers or Clients which hide hardware and native OS sp eci cs.
33 34
Metacomputing in the Future The Future Trends... Metacomputing in the Future
Hardware Trends 5-10 Years Computers
Long term is hard to predict- See changes over the last 5 Years!!
Millions 100-300 of "settop" b oxes
Can see trends, however...
One in every US household
More worldwide
Ranging from Sup ercomputing to Personal Digital Assistants.
35 36
Metacomputing in the Future Hardware Trends 5-10 Metacomputing in the Future Hardware Trends 5-10
Years Networks Years Software
Networks 1-20 MByte/s ful ll needs of "home" entertainment in- Very hard to predict in a relatively short term- JAVA has b een a
dustry. pro duct for ab out a year!!
Technologies ranging from high band-width bre to Electromagnetic Ubiquitous and p ervasive WWW/JAVA-like.
typ es such as Microwave.
Can forget ab out underlying h/w and OS.
Metacomputing \plug-ins"
Micro-kernel-likeJAVA based servers with add-on services that can
supp ort Metacomputing load balancing, migration, checkp ointing, etc...
37 38
Highly Parallel Sup ercomputing:
Highly Parallel Sup ercomputing: Where Are
Where Are We?
We?
1. Performance:
1. Op erating systems:
Robustness and reliability are improving.
Sustained p erformance has dramatically increased
New system management to ols improve system
during the last year.
utilization.
On most applications, sustained p erformance p er dol-
But
lar now exceeds that of conventional sup ercomput-
Reliability still not as go o d as conventional systems.
ers.
2. I/O subsystems:
But
New RAID disks, HiPPI interfaces, etc. provide
Conventional systems are still faster on some appli-
substantially improved I/O p erformance.
cations.
But
I/O remains a b ottleneck on some systems.
2. Languages and compilers:
Standardized, p ortable, high-level languages suchas
HPF, PVM and MPI are available.
But
Initial HPF releases are not very ecient.
Message passing programming is tedious and hard
to debug.
Programming diculty remains a ma jor obstacle to
usage by mainstream scientist.
39 40
The Imp ortance of Standards I
Current Situation...
An ongoing thread of researchin scienti c computing
Software
is the ecient solution of large problems.
Writing programs for MPP is hard ...
Various mechanisms have b een develop ed to p erform
computations across diverse platforms. The most com-
But ... one-o e ort if written in a standard language
mon mechanism involves software libraries.
Past lack of parallel programming standards ...
Some software libraries are highly optimized for only
certain platforms and do not provide a convenient in-
{ ... has restricted uptakeoftechnology to "enthu-
terface to other computer systems.
siasts"
Other libraries demand considerable programming ef-
{ ... reduced p ortability over a range of current
fort from the user, who may not have the time to learn
architectures and b etween future generations
the required programming techniques.
Now standards exist: PVM, MPI & HPF, which ...
While a limited number of to ols have b een develop ed
{ ... allows users & manufacturers to protect soft-
to alleviate these diculties, such to ols themselves are
usually available only on a limited numb er of computer
ware investment
systems.
{ ... encourage growth of a "third party" paral-
lel software industry & parallel versions of widely used co des
41 42
The Future of HPC
The Imp ortance of Standards I I
Hardware
The exp ense of b eing di erent is b eing replaced by
pro cessors
the economics of b eing the same
{ commo dity RISC pro cessors
HPC needs to lose its "sp ecial purp ose" tag
interconnects
Still has to bring ab out the promise of scalable gen-
eral purp ose computing ...
{ high bandwidth, low latency communications pro-
to col
... but it is dangerous to ignore this technology
{ no de-facto standard yet ATM, Fibre Channel,
Final success when MPP technology is emb edded in
HPPI, FDDI
desktop computing
growing demand for total solution:
Yesterday's HPC is to day's mainframe is tomorrow's
workstation
{ robust hardware + usable software
HPC systems containing all the programming to ols
/environments / languages / libraries / applications
packages found on desktops