LOI DE MOORE, LOI D’AMDHAL, QUEL FUTUR? LA VISION D’

Philippe Thierry SrSrStaff Engineer Intel Corp.

Marc Dollfus Responsable HPC secteur public et recherche Intel France

Idris 88 janvier2009

1

Copyright © 2008, Intel Corporation

Legal Disclaimers

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 18006288686 or 19163563104.

Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance*

All dates and products specified are for planning purposes only and are subject to change without notice

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

SPEC, SPECint2000, SPECfp2000, SPECint2006, SPECfp2006, SPECjbb, are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details.

Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice

Intel, Intel , microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

* Other names and brands may be claimed as the property of others.

Copyright © 2008 Intel Corporation.

2

Copyright © 2008, Intel Corporation

1 Intel in High Performance Computing : the big picture

Leading Process Manufacturing Leading performance, performance/watt

Advanced HPC R&D

Dedicated, Broad SW renowned tools expertise portfolio

Large scale clusters for test & optimization

Defined HPC Platform application building platform blocks

A long term commitment to HPC 3

Copyright © 2008, Intel Corporation

Agenda: from the micromicro--archarch to the end users

4

Copyright © 2008, Intel Corporation

2 Agenda: from the micromicro--archarch to the end users

5

Copyright © 2008, Intel Corporation

Agenda: from the micromicro--archarch to the end users

6

Copyright © 2008, Intel Corporation

3 What G. Moore predicted ??

• 1971 Intel® 4004 : 4/8 bits, 108 kHz, 2300 transistors

• 1972 Intel® 8008 : 8 bits, 200 kHz, 3500 transistors

• 1974 Intel® 8080 : 8 bits, 2 Mhz, 4500 transistors

• 1978 Intel® 8086-8088 : 16 bits, 5 MHz,

• 1982 Intel® 80286 : 16/32 bits, plus de 100.000 transistors

• 1985 Intel® 80386 : 32 bits, 275.000 transistors

• 1989 Intel® 80486 : 32 bits, 25 à 66 MHz, 1,6 millions de transistors

• 1993 Intel® Intel® ® : 32 bits, 75 à 133 MHz, 3,1 millions de transistors

• 1995 Intel® Pentium® Pro : 32 bits, 150 à 200 MHz, 5,5 millions de transistors

• 1997 Intel® Intel® Pentium® II : 32 bits, 7,5 millions de transistors

• 1999 Intel® Pentium® III : 32 bits, 450 à 600 MHz, 9,5 millions de transistors

• 1999 Intel® ® : 32 bits, 9,5 millions de transistors

• 2000 Intel® Pentium® 4 : 32 bits, 1,5 GHz, 42 millions de transistors

7

Copyright © 2008, Intel Corporation

Not a simple linear function !!

8

Copyright © 2008, Intel Corporation

4 FOR simple circuits, the cost per component is nearly inversely proportional to the number of components, the result of the equivalent piece of semiconductor in the equivalent package containing more components. But as components are added , decreased yields more than compensate for the increased complexity, tending to raise the cost per component.

Thus, there is a minimum cost at any given time in the evolution of the technology. … If we look ahead five years, a plot of costs suggests that the minimum cost per component might be expected in circuits with about 1,000 components per circuit

In 1970, the manufacturing cost per component can be expected to be onlya tenth of the present cost.

The complexityfor minimum component costs has increased at a rate of roughly a factor of two per year . Certainlyover the short termthis rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.

I believe that such a large circuit can be built on a single wafer.

Electronics, Volume 3838,, Number 88,, April 1919,, 1965 9

Copyright © 2008, Intel Corporation

Conclusion intermédiaire 11

• La loi de Moore n’a jamais signifiée – plus de performance – plus de hpc, plus de //ism, plus de scalabilité

Juste « il existe un tjs un bon rapport prix / nb de transitor sur un circuit !! »

Implicitement , le nb de transistor augmente et les capacités du dit circuit aussi ;)

Mais la seule (?) loi qui domine en HPC reste la loi d’Amdhal et assimilée

10

Copyright © 2008, Intel Corporation

5 However …

11

Copyright © 2008, Intel Corporation

Latest steps

nb transitor freq (khz) 1000 3,50E+03

100 3,00E+03 10 2,50E+03 1 0,1 2,00E+03 0,01 1,50E+03 0,001 1,00E+03 0,0001 0,00001 5,00E+02 0,000001 0,00E+00 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

12

Copyright © 2008, Intel Corporation

6 2007 Latest steps 2005 2003 2000 45nm 300mm 65nm 300mm 9090nmnm Dual Core 1000 300300mmmm 3,50E+03 100 130130nmnm 200200mmmm 3,00E+03 10 2,50E+03 1 0,1 2,00E+03 0,01 1,50E+03 0,001 1,00E+03 0,0001 0,00001 5,00E+02 0,000001 0,00E+00 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

13

Copyright © 2008, Intel Corporation

45 nm Hi--kHi k Intel Processor

Quad core Intel ® Xeon ® 53xx (Clovertown) Quad core Intel ® Xeon ® 54xx (Harpertown) 65 nm 45 nm Hi-k

22** 22** 22** 22** 143 mm 143 mm 107 mmmm 107 mmmm

582582mm Transistors 820820mm Transistors 8 MB Cache 12 MB Cache

14 *Source: Intel Note: die picture sizes are approximate Copyright © 2008, Intel Corporation

7 Industry’s First 45 nm HighHigh--KK + Metal Gate TransistTransistoror

Improved Transistor Density ~~22xx Improved Transistor Switching Speed >>2020%% Reduced Transistor Switching Power ~~3030%% Reduction in gate oxide leakage power >>1010xx

65 nmnm 45 nm

15

Copyright © 2008, Intel Corporation

HighHigh--kk + metalgate transistors

16

Copyright © 2008, Intel Corporation

8 High-k + metal gate transistors

« The implementation of high-k and metal gate materials marks the biggest change in transistor technology since the introduction of polysilicon gate MOS transistors in the late 1960s », G. Moore, 2007

17

Copyright © 2008, Intel Corporation

Intel Driving Moore’s Law into the Future

90nm Node 2003 6565nmnm Node 2005 4545nmnm Node 2007 3232nmnm Node 2009 2222nmnm Node 2011 1616nmnm Node 2013 25 nmnm 1111nmnm Node 2015 88nmnm Node 1515nmnm 2017

77nmnm 55nmnm 3nm

18

Copyright © 2008, Intel Corporation

9 Tick --TockTock

Intel ®® Sandy Core™ Penryn Nehalem Westmere Bridge

NEW Compaction/ NEW Compaction/ NEW Microarchitecture Derivative Microarchitecture Derivative Microarchitecture 65nm 45nm 4545nmnm 3232nmnm 3232nmnm

Tock Tick Tock Tick Tock

Montecito/ Montvale Poulson

90 nm NEW NEW Microarchitecture Microarchitecture

Forecast

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United StateStates andas nd other countries. All 19 products, dates, and figures are preliminary and are subject to change without notice.

Copyright © 2008, Intel Corporation

Intel® HPC Roadmap

2008 2009 2010 2011 1H 2H 1H 2H 1H 2H 1H 2H

Montvale (2C/4T) Tukwila New u-arch Critical Mission

Tigerton (4C) Dunnington (6C) Nehalem-EX

Caneland w/ Clarksboro chipset EXpandable

Harpertown (4C) Derivative New u-arch Wolfdale-DP (2C) Nehalem-EP

Efficient Bensley w/Blackford Tylersburg-EP Performance Stoakley w/Seaburg

Nehalem-EP Derivative Harpertown (4C) Wolfdale-DP (2C) Tylersburg-EN New u-arch ENtry

20

Copyright © 2008, Intel Corporation

10 Nehalem--EPEP

••44 cores ••8M8M ononchipchip Shared Cache • 3-level cache hierarchy • 32k I-Cache + 32k D-cache • New 256k L2 cache per core Nehalem-EP • New shared last level cache • Inclusive Cache Policy Core Core Core Core ••SimultaneousSimultaneous MultiMultiThreadingThreading capability (SMT) ••QuickPathQuickPath interconnect 8M Shared Cache

• Point-to-Point Memory Link • 2 links per CPU socket Controller Controller • 1 for connection to other socket • 1 for connection to chipset • Up to 6.4 GT/sec (12.8 GB/sec) in 3 DDR3 Two each direction per link ( Fully duplex) channels QuickPath ••IntegratedIntegrated QuickPath Memory Controller (DDR3) Interconnects • Up to 18 DIMMs • 800/1066/1333MHz DDR3 ••NewNew instructions Socket: ••Power:Power: 130W, 95W, 80W, 60W ••NewNew LGA 1366 Socket

21

Copyright © 2008, Intel Corporation

Nehalem Processor Microarchitecture

• Derived from Core Microarchitecture • Similar in pipeline length • Processor Performance features

– Greater parallelism

» Example: 33% more instructions in flight

– More efficient algorithms

» Example: Faster handling of unaligned cache accesses

– Better branch prediction • New 3level cache hierarchy • New 2level TLB hierarchy

22

Copyright © 2008, Intel Corporation

11 Intel Smart Cache ––CoreCore Caches

• 1st level caches – 32kB Instruction cache – 32kB Data Cache Core

• Support more L1 misses in parallel than Core 2 32kB L1 32kB L1 Data Cache Inst. Cache • 2nd level Cache – New cache introduced in Nehalem – Unified (holds code and data) – 256 kB per core

– Performance: Very low latency 256kB L2 Cache – Scalability: As core count increases, reduce pressure on shared cache

23

Copyright © 2008, Intel Corporation

Intel Smart Cache ----33rdrd Level Cache

• New 3 rd level cache Core Core Core • Shared across all cores L1 Caches L1 Caches L1 Caches • Size depends on # of cores

– Quadcore: Up to 8MB …

– Scalability: L2 Cache L2 Cache L2 Cache

• Built to vary size with varied core counts L3 Cache

• Built to easily increase L3 size in future parts • Inclusive cache policy for best performance

– Address residing in L1/L2 must be present in 3 rd level cache

24

Copyright © 2008, Intel Corporation

12 Designed For Modularity 2008 – 2009 Servers & Desktops

CCC CCC CCC OOO OOO OOO Core RRR RRR RRR E EEE EE … EEE

L3 Cache DRAM Uncore Power IMC QPI … QPI && … Clock

QPI …

Differentiation in the “Uncore”:

Power # mem #QPI Size of Type of # cores Memory ManageManage-- channels Links cache ment

25

Copyright © 2008, Intel Corporation

Expandable (EX) Platform Roadmap

2008 2009 2010 2011 1H 2H 1H 2H 1H 2H 1H 2H

Tigerton (4C) Dunnington (6C)

Caneland w/ Clarksboro chipset EXpandable

cores cores ••4545nmnm highkhigh k technology ••11..99BB transistors cache system interface ••1616 MB L3L 3 cache ••CanelandCanelandsocket compatible cores cache ••LatestLatest Intel virtualization technologies ••LaunchedLaunched

26

Copyright © 2008, Intel Corporation

13 Intel innovations

Shrink/Derivative Evolution of Instruction Set Architecture (ISA) : Intel® Xeon®, , Intel® Core™ Duo • 1979 Numeric data extensions 65nm New Microarchitecture YEARS YEARS YEARS YEARS • 1997 MMX™ 64b SIMD ISA extension YEARS YEARS YEARS YEARS Intel ®® CoreCore™™ 2222 2222 • 19992007 – SSEx 128b SIMD ISA extensions

• Starting 2010 – Intel® AVX – 256b SIMD ISA extensions Shrink/Derivative Penryn Family 45nm YEARS YEARS YEARS YEARS Benefits of Intel’s ISA YEARS YEARS YEARS YEARS New Microarchitecture 2222 2222 Nehalem • Available across a wide range of platforms – No additional hw required Shrink/Derivative – Investment protection for your development costs and effort Westmere • Backward and forward compatible to future ISAs 323232nm32 nmnmnm YEARS YEARS YEARS YEARS YEARS YEARS YEARS YEARS New Microarchitecture

• A suite of tools and compilers to make development easy 2222 2222

27

Copyright © 2008, Intel Corporation

Key Intel® Advanced Vector Extensions : Intel®Intel ® AVX

KEY FEATURES BENEFITS

Wider Vectors Up to 2x peak Flops/s Increased from 128 bits to 256 bits with good power efficiency

Enhanced Data Rearrangement Organize, access and pull only Use new 256 bit primitives to broadcast, mask loads and permute necessary data more quickly and efficiently

Three and four Operands, Non Destructive Syntax Fewer register copies, better register use Designed for efficiency and extensibility

More opportunities to combine load Flexible unaligned memory access support and compute operations

Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today

28

Copyright © 2008, Intel Corporation

14 Intel Instruction Set Innovation Continues

A new trajectory Vector FP offers scalable FLOPS and impressive Flops/W

The old trend Intel® AVX* 15% gain/year (due to • 2X throughput vector FP frequency and uArch) • 2X load throughput • 3operand instructions

Intel® microarchitecure (Westmere) Next generation Intel® microarchitecture Cryptography acceleration Performance / core / Performance (Nehalem) instructions • Superior Memory latency+BW • Fast Unaligned Support

20082009 2010 > 2010

All timeframes, dates and products are subject to change without further notification

* Intel® AVX = Intel® Advanced Vector Extensions (Intel® AVX) 29

Copyright © 2008, Intel Corporation

Agenda: from the micro-arch to the end users

30

Copyright © 2008, Intel Corporation

15 QuickPath Interconnect

• Nehalem introduces new QuickPath Nehalem Nehalem Interconnect (QPI) EPEP EPEP

• High bandwidth , low latency point to point interconnect

• Up to 6.4 GT/sec initially – 6.4 GT/sec > 12.8 GB/sec IOH CPU CPU memory memory – Bidirectional link > 25.6 GB/sec per link

CPU CPU memory memory • Highly scalable for systems with IOH varying # of sockets

31

Copyright © 2008, Intel Corporation

Memory Latency Comparison

• Low memory latency critical to high performance • Design integrated memory controller for low latency • Need to optimize both local and remote memory latency • Effective memory latency depends per application/OS – Percentage of local vs. remote accesses – NHM has lower latency regardless of mix

Relative Memory Latency Comparison

1.00

0.80

0.60

0.40

0.20

Relative Memory Latency Memory Relative 0.00 Harpertown (FSB 1600) Nehalem (DDR3-1333) Local Nehalem (DDR3-1333) Remote

32

Copyright © 2008, Intel Corporation

16 Intel QuickPath Architecture

Intel® QuickPath interconnect plus integrated Intel® QuickPath memory controllers

Stream Bandwidth ––Mbytes/SecMbytes/Sec (Triad)

Higher is better 34904

9776 6102

HTN 3.16/ BF1333/ 667 HTN 3.00/ SB1600/ 800 NHM 2.8/ 6.4 QPI/ 1066 MHz mem MHz mem MHz mem

Source: Intel. All future products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Performance33 tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the Copyright © 2008, Intel Corporation performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2008, Intel Corporation. * Other names and brands may be claimed as the property of others.

Intel QuickPath Architecture

Intel® QuickPath interconnect plus integrated Intel® QuickPath memory controllers

Stream Bandwidth ––Mbytes/SecMbytes/Sec (Triad)

38486 34612

1066 MHz /6 Dimms 1333 MHz / 6 Dimms

All results using NHM 22..88GHz/GHz/66..44QPIQPI ––MayMay 2008 ––BandwidthBandwidth in Mega /Sec Source: Intel. All future products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Performance34 tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the Copyright © 2008, Intel Corporation performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2008 , Intel Corporation. * Other names and brands may be claimed as the property of others.

17 Agenda: from the micro-arch to the end users

35

Copyright © 2008, Intel Corporation

Where are the limitations ?

I / O

CPU Memoire

Intra and Inter L 1 L 2 L 3 socket communications

CPU Memoire

L 1 L 2 L 3

Interconnect. NumberNumber,, size and Size, BW of Inter nodes performance Cache Communications ? of the cores ?? memories ?? Memory BW ?

36

Copyright © 2008, Intel Corporation

18 I/O Trends

Resident in Memory TeraScale Memory Research

Local Disk

SSD Parallel File Systems

• Disk Replacement • Disk Cache • Memory Displacement Tape

37

Copyright © 2008, Intel Corporation

Intel® High Performance: SATA Solid State Drives --TheThe new trend

SATA 3.0 Gb/s Interface

1.8” & 2.5” Form Factors

32GB to 80GB+ Capacities

Power

Active (avg) 2.0W Standby 0.1W

Weight: 2.5” SSD 87g (+/- 2g)

1.8” SSD 40g (+/- 2g)

1IOmeter results vs Seagate Momentus® 7200.2 SATA 3Gb/s Hard Drive Tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 38

Copyright © 2008, Intel Corporation

19 Intel ® High Performance Solid-State Drives

High performance SATA Drive

• 2.5” FF “X“X“X25“X 252525----E”E”E”E” • 32GB / 64GB SLC Enterprise Class • 240 / 170 R/W (MB/s) • Architected for write intensive highly random workloads

SLC – Enterprise Class SSD

Performance SFF Mobile SATA Drive

“X“X“X18“X 181818----M”M”M”M” • 1.8” FF SFF Mobile Client • 80GB / 160GB MLC • 240 / 70 R/W (MB/s)

Performance Mobile SATA Drive

• 2.5” FF “X“X“X25“X 252525----M”M”M”M” • 80GB / 160GB MLC • 240 / 70 R/W (MB/s) Mobile Client

MLC – Client/Consumer Class SSD

39 SLC & MLC (Multi(Multi--LevelLevel Cell) NAND Copyright © 2008, Intel Corporation

Experiment - Finite Difference Code

(Simulate RTM Snapshots)

• 2D elastic wave propagation 3,5 • Small 2homogeneous layered model • Disk Replacement Test 3,0

– HDD – 4x150GB SATA RAID0 2,5 – SSD – 4x80GB MLC Samples RAID0 2,0

1,5

1,0 (Higher is better) is (Higher

Speedup Relative to HDD HDD to Relative Speedup 0,5

0,0 48 GB 144 GB 240 GB

Potential: 2-3x Gain with this Example

40

Copyright © 2008, Intel Corporation

20 Experiment - Finite Difference Code

• 2D elastic wave propagation 300,0 11..66xx • Small 2homogeneous layered model • Disk Replacement Test 250,0 – HDD – 2x250GB SATA RAID0 – SSD – 2x32GB SLC Samples RAID0 200,0 150,0 100,0 50,0 Elapsed Time (sec) Time Elapsed 0,0 HDD SSD

1 shot gather : 12GB ouput

41

Copyright © 2008, Intel Corporation

Where are the limitations ?

E / S

CPU Memoire

Intra and Inter L 1 L 2 L 3 socket communications

CPU Memoire

L 1 L 2 L 3

Interconnect. NumberNumber,, size and Size, BW of Inter nodes performance Cache Communications ? of the cores ?? memories ?? Memory BW ?

42

Copyright © 2008, Intel Corporation

21 Interconnect ??

43

Copyright © 2008, Intel Corporation

1111//20082008 Top500Top 500 statistics

Interconnect famillyfamilly// Interconnect famillyfamilly// systems performance

44

Copyright © 2008, Intel Corporation

22 Agenda: from the micromicro--archarch to the end users

45

Copyright © 2008, Intel Corporation

MultiMulti--corecore motivation

Dual-Core 1.73x Performance 1.73x Power

1.13x 1.00x 1.02x 0.87x

0.51x

Over-clocked Design Frequency Under-clocked (-20%) (+20%)

Relative singlesingle--corecore frequency and Vcc

46

Copyright © 2008, Intel Corporation

23 MultiMulti--corecore advantage

Higher is better Integer Performance with SPECint_rate_base2000* (2001 to 2007) 250 242,0 214,0 19.8x

200 Single Core Dual Core Quad Core

150 123,0

100

50 40,5 26,0 30,7 32,1 12,2 15,7

0 Xeon 1.26 Xeon 2.20 Xeon 3.06 Xeon 3.20 Xeon 3.60 Xeon 3.60 Xeon 3.00 Xeon 3.00 Xeon 3.16 512KB L2 512KB L2 1M L3 (2003)2M L3 (2004) 1M L2 (2004)2M L2 (2005) 4M L2 DC 8M L2 QC 12M L2 QC (2001) (2002) (2006) (2007) (2007)

47

Copyright © 2008, Intel Corporation

PetaflopsPetaflops/s/s with today CPU ?

11.E+.E+1515 Peta --11 1.E+14

11.E+.E+1313 Flops . . s Flops 11.E+.E+1212 Tera

11.E+.E+1111 Intel® Core™ uArch 11.E+.E+1010 Pentium® III 11.E+.E+0909 Giga Pentium® II Pentium® 44

11.E+.E+0808 Pentium® 486 11.E+.E+0707 386 11.E+.E+0606 1985 1990 1995 2000 2005 2010

48

Copyright © 2008, Intel Corporation

24 Global warming of the datacenter tk da nd blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual tk wth BladeAtokaSwitch BladeAtokaSwitch

11U:U: 170 GFGF

22xx 450 WW N+N+11 Power Supply 1 1 module : 11..3636 TFTF ~~55..66 kWkW

288 racks: 1 rack: 11..66 PFPF 55..4444 TFTF 147 k cores ~~2222..44 kWkW 66..4545 MWMW

49

Copyright © 2008, Intel Corporation

Petaflops : decrease of the power consumption ??

1010 MW à 00..0505 €€// kWh > > 5 MM€€// an ((500500 €€// hour )

2007 2008 2009

Core Penryn New µArch 65 nm 45 nmnm 45 nmnm

xx11..55 refroidissement ~ 66..4545 MW < 55MWMW < 44MWMW

50

Copyright © 2008, Intel Corporation

25 French power consumption

http://www.rtehttp://www.rte--france.com/france.com/ 51

Copyright © 2008, Intel Corporation

Agenda: from the micro-arch to the end users

52

Copyright © 2008, Intel Corporation

26 HPC enabling@enabling @ Softwares & Services Group (SSG)

- tuning / optimization - training - benchmarking - architecture (roadmap) - topology (interconnect, i/o) HPC end user: - softwares develop their own codes - early cpu testing uses ISV codes - collaboration benchmarking ISV - co-design Universities / Research Institutes OEMs

53

Copyright © 2008, Intel Corporation

Intel point of view: Emerging application research

54

Copyright © 2008, Intel Corporation

27 End users point of view: Physics and computing needs increase

Increase problem size (at constant time and physics)

Decrease elapsed time. (at constant size and physics)

Increase physics complexity (at constant elapsed time and size )

55

Copyright © 2008, Intel Corporation

HPC or Scientific Computing landscape

Universities / Research Institutes Wks Labs Meso center ISVISV National center < < 500 Gflops R & D in the Industry < 10 Tflops

Production in the Industry 20 TfTf to < 1 Pflops

56

Copyright © 2008, Intel Corporation

28 Availaible Computing power … in 2008

Peta 11.E+.E+1515 11..11 P. : N°°1N: 1 TopTop500500 < < 1 P. : OilOil contractors

--11 < 500 T. : national CC 11.E+.E+1414 < < 200 Tera : Industries, OilOil Companies

< < 50 TT.. : regional CC 11.E+.E+1313 12 T. : NN°°500500 TopTop500500 Flops . . s Flops Tera 11.E+.E+1212 < < 1 T. : Laboratories

11.E+.E+1111 < < 50 G. : Wks. > 20 G. : Core 2 Duo 11.E+.E+1010

11.E+.E+0909 Giga

57

Copyright © 2008, Intel Corporation

The Picket Fence (Amdahl revisited) # People 1 10 100

Prepare 1 1.1 1.2

Paint 10 1 .1

Clean-up 1 1.1 1.2

Total 12 hours 3.2 hours 2.5 hours

Speed-up 1 3.75 4.8

Efficiency 100% 37.5% 4.8%

Brush 1 € 10 € 100 €

Beer 0.12 € 0.32 € 2.5 € cooling

58

Copyright © 2008, Intel Corporation

29 Parallel computing

time Super linear Linear ((theorytheory)) Speedup

Elapsed time Ideal

Saturated

««trytry again »»

Nb cores 1 cc 4 cc

59

Copyright © 2008, Intel Corporation

First step : do ititserial time time Elapsed time Elapsed time Profiling + HDW counters Compiler option Help the compiler SIMD ((ssesse)) Check memory management Are all the Flops useful

1 cc 1 cc

60

Copyright © 2008, Intel Corporation

30 Apps classification and performance analysis

Memory bound. CPU bound. CPU bound. “Stream” “HPL” Real world applications

Part of the code Part of the code doing ops doing mem access

CPU.ops% * (cycle / trans) ++ Mem.ops% * [ ( Lxx_success%Lxx_success % * cycle / trans) + (Lxx_misses( Lxx_misses%% * cycle/trans) ]

1 / IPC_max nb cycle to access memory

Cache Size Mem BW

Higher frequency New SSESSE44

• 1 cpu • no i/o • No comms 61

Copyright © 2008, Intel Corporation

Intel® Software Development Products

• Intel® Compilers – The best way to get application performance on Intel® multicore processors – both for C/C++ and Fortran • Intel® VTune™ Performance Analyzers – Identify bottlenecks i and optimize multicore performance • Intel® Performance Libraries – Highly optimized, threadsafe, multimedia and HPC math functions

62

Copyright © 2008, Intel Corporation

31 C++ and Fortran Compilers 1111..00 OS IA32 IA64 Intel® 64 Windows* √ √ √ Linux* √ √ √ • OpenMP 3.0 support Mac OS* X √ √ • C++ lambda functions • Windows* products: Microsoft Visual Studio* 2005 and 2008 support • Linux* products: Eclipse* CDT 4.0 support • Windows Server* 2008 Support • Parallel debugger for IA32/Intel®64 Linux • Option to emit FE diagnostics related to threading • /MP option for generating compilation sessions in parallel • Implementation of valarray using IPP for array notation • Improved install framework based on Intel® PSET 2.0 • Default to be Intel® Pentium® 4 (QxW) compatible • Increased F2003 feature support beyond 10.1: – ENUMERATOR, MIN/MAX and friends, IEEE FP, ASSOCIATE, PROCEDURE POINTER, ABSTRACT INTERFACE, TYPE EXTENDS, struct ctor change, array ctor changes, ALLOCATE scalar • Native Intel®64 Compiler (Win/Linux) • Compilers available only in “Pro” packaging (no more “Standard”)

63 Intel Confidential NDA Required Copyright © 2008, Intel Corporation

Intel® Math Kernel Library 10.0

The flagship math processing library – Simplify multithreaded application development • Extensively threaded math functions with excellent scaling • New Threading in Vector Math Functions • New OpenMP compatibility library supports Microsoft and GNU OpenMP implementations – Maximize application performance • Automatic runtime processor detection ensures great performance on whatever processor your application is running on. • New optimizations for latest Intel processors • New cluster functionality is now standard

Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √ √

64

Copyright © 2008, Intel Corporation

32 MKL 1010..00

65

Copyright © 2008, Intel Corporation

Math Functions Using SVML Library

Short vector math library (SVML) provides efficient software implementations: – sin/cos/tan – asin/acos/atan – sinh/cosh/tanh – asinh/acosh/atanh – log10/ln – exp/pow

– and many more

66

Copyright © 2008, Intel Corporation

33 Intel® VTune™VTune ™ Analyzer 99..00

Quickly find performance bottlenecks – Simplify multithreaded application development • Tune process or thread parallel code • Works on standard debug builds without recompiling – Maximize application performance • Stall cycle accounting for Core™2 Duo and Core™2 Quad processors • Low overhead sampling • Graphical call graph • View results on source or assembly – Includes Intel® Thread Profiler

Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √ √

67

Copyright © 2008, Intel Corporation

Whatif.intel.com

Opportunity to kick the tires, give feedback, experiment – put you “in the loop early” – before we fold into final products.

Also - have direct connection with the developers and researchers in Intel.

Intel® Adaptive Spike-Based Solver Cluster OpenMP* for Intel® Compilers Intel® Platform Modeling with Machine Learning Intel® Decimal FloatingPoint Math Library Intel® Location Technologies Software Development Kit 1.0 (LTSDK) Intel® C++ Parallelism Exploration Compiler, Prototype Edition Integrated Debugger for Java*/JNI Environments Intel® Performance Tuning Utility 3.0

68

Copyright © 2008, Intel Corporation

34 Intel® Performance Tuning Utility 33..00

•• NEW in vv33..00 –– Result difference Extension •• SemiSemiautomatedautomated compiler comparison –– Events per CPU –– Data Access Profiling updates –– Module Aliases –– CPU Frequency Changing –– Heap Profiling GUI –– InstrumentationInstrumentation--basedbased Call Graph and Call Count •• EXISTING from previous version –– Statistical Call Graph –– Basic Block Analysis –– Events over IP graph –– Loop Analysis Learn More at: http://whatif.intel.com

69

Copyright © 2008, Intel Corporation

Intel® Performance Tuning Utility 33..00

70

Copyright © 2008, Intel Corporation

35 Agenda (for next time)

•• Compiler •• Reports •• Vectorization •• Inlining •• PGO •• Blocking •• Unrolling •• Floating points

71

Copyright © 2008, Intel Corporation

References

• “The Software Vectorization Handbook” by Aart Bik • “The Software Optimization Cookbook” by Richard Gerber • Intel ® C++ Compiler User’s Guide • IA32 Optimization Reference Manual • High Performance Computing, 2nd Edition by Kevin Dowd and Charles Severance • “Floating Point Calculations and the ANSI C, C++ and Fortran Standard” • http://cachewww.intel.com/cd/00/00/33/01/330130_330130.pdf • Details on the SSE4 instruction set • http://cachewww.intel.com/cd/00/00/32/26/322663_322663.pdf • Webbased and classroom training – www.intel.com/software/college • White papers and technical notes – www.intel.com/ids – www.intel.com/software/products • And have a look at • http://whatif.intel.com • Product support resources – www.intel.com/software/products/support

72 or other countries. Copyright © 2008, Intel Corporation

36 Parallel computing: computing : Amdhal law time time

Overhead Elapsed time Elapsed time

Serial Serial

Parallel Parallel

1 cc 2 cc 88cc 1 cc2 cc 88cc

In In theory In practice

73

Copyright © 2008, Intel Corporation

Speedup using Amdhal Using infinite nb of processors lim (Speedup) = 1 / serial part Speedup

serial % 5 % 10 % 15 % 20 % 25 %

parallel % 95 % 90 % 85 % 80 % 75 %

74

Copyright © 2008, Intel Corporation

37 (Zoom) Using infinite nb of processors lim (Speedup) = 1 / serial part

Infinite nbnb of processors

10 % serial 90 % parallel

Max speedup : 1010 Speedup

serial % 2 % 5 % 10 % parallel % 98% 95 % 90 %

75

Copyright © 2008, Intel Corporation

Speedup curves from Amdhal law

100 %% up to 1024 pp

parallel %

9999..999999 %% Speedup

9999..992992 %%

nbnb processor

76

Copyright © 2008, Intel Corporation

38 Amdhal’s Hypothesis ((amongamong othersothers))

Amdhal lawlawhypothesis:: hypothesis

CodeCode aloneon the machine Just a dream

NotNot affectedby the OS OS tuning

NoNo overheaddue to

////ismism ((commscomms)) Hdw //sftwsftw

i/i/o’so’s Hdw //sftwsftw

MemoryMemory access ((bwbw,, latency))latency hdw

LoadLoadbalancing isis perfect sftw

serialserial optimization are not affectedby //ismism// sftw

77

Copyright © 2008, Intel Corporation

Real world

But when nbnbof processes increases :

• Amountof work / coredecreases Workload

((cancanincrease data locality))locality sftw

• communications (size / nb) and sync Hdw //sftwsftw do not decreaseas the work / core sftw • Loadbalancing becomes important

78

Copyright © 2008, Intel Corporation

39 Conclusion intermédiaire

La loi d’d’amdhalamdhal n’est pas suffisante pour simuler précisément le comportement d’une appli Mais elle reste simple et didactique elle montre bien son « asymptote »» d’autres approximations existent on peut passer sa vie à en obtenir de meilleures

Permet d’estimer un taux de ////ismism (associé avec un profiling serial du nb de flops/s) Son aspect didactique permet de quantifier l’effort à (ou ne pas) fournir pour optimiser le code

79

Copyright © 2008, Intel Corporation

Parallel computing time •• Serial OptimOptim.. Done 0 -- Elapsed time

time ((WeWe reached XX % of the peak)peak )

1 1 cc Elapsed time •• Work on the ovehead to 1 -- become ««supersuper linear »» time Elapsed time ••Take advantage of ////ismismtoto run 2 -- 4 x bigger workloads ??

1 cc 4 cc

1 1 cc4 cc

80

Copyright © 2008, Intel Corporation

40 Application classification vs. XX--corecore

Memory bound. CPU bound. CPU bound. “Stream” “HPL” Real world applications

Main task Main task Main task Main task

main task

threads

4 cores 4 cores 4 cores 4 cores

computing node computing node

Try Find a trade-off between langages / apps / platform to limit stalls and overhead

81

Copyright © 2008, Intel Corporation

Most used

Coarse and fine level parallelism ::

1 to 1010kk cores •• Domain decompositionusing MPI •• Secondary MPI level if coarse enough **

•• OpenMP / pthread atatthe finest level 1 to 8 cores ((withinwithin a SMP node))node

•• Asynchron communications

•• Asynchron and parallel i/o ~ max of 10 to 100k100 k cores

82

Copyright © 2008, Intel Corporation

41 Levels of Parallelism

Serial Core Level MultiCore/SMP MultiNode Cluster Level Node Level

Programming Model, C/C++ AutoParallelization Cluster OpenMP Implementation FORTRAN95 OpenMP* Intel MPI TBB

Correctness & IDB Thread Checker MPI Correctness Debugging IDB Checker IDBMPP

Performance Libraries MKL MKL Cluster MKL IPP IPP

Performance Analysis VTune™ VTune Trace Analyzer Thread Profiler

*Other names and brands may be claimed as the property of others.

83

Copyright © 2008, Intel Corporation

Intel® Software Development Products

• Intel® Threading Analysis Tools – Find threading errors and optimize threaded applications for maximum performance • Intel® Threading Building Blocks – C++ templatebased runtime library that simplifies writing multithreaded applications for performance and scalability • Intel® Cluster Tools – Create, analyze, optimize and deploy clusterbased applications

84

Copyright © 2008, Intel Corporation

42 Intel® Thread Checker vv33..11

– Simplify multithreaded application development • Detects challenging data races and deadlocks before they occur • Pinpoints errors to the source code line • Works on standard debug builds without recompiling – Full compatibility with 32 & 64 bit Windows* and Linux* development environments • Includes command line interface for Windows and Linux • Batch scripts integration for regression test runs – Support OpenMP*, WIN32 and PThreads, TBB

Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √

85

Copyright © 2008, Intel Corporation

Graphical User Interface (Windows)

86

Copyright © 2008, Intel Corporation

43 Intel® Thread Profiler v3.1

• Pinpoint threading inefficiencies to maximize threaded performance – Simplify multithreaded application development • Easier to Use Recall custom configuration settings • Faster to Use User selectable stack walking • Support for Threading Building Blocks API – Maximize application performance • View application concurrency level to ensure full core utilization • Identify where thread related overhead impacts performance • Find out which created threads are active and which are inactive – Utilize one threading profiler for 32 & 64 bit Windows* – Included with VTune™ Analyzer for Windows*

Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √

87

Copyright © 2008, Intel Corporation

Intel® Trace Analyzer and Collector 77..11

• Increase productivity and cluster application performance • Very low impact • Lightweight statistics mode Application Intel Trace • Excellent scalability on time and processors Collector • Compiler and binary instrumentation • MPI Correctness Checking library GUI on Linux* and Windows* • Comparison of multiple trace files Tracefile • Timeline display for performance counters • Powerful new aggregation and filtering functions • Better and faster GUI Intel Trace Analyzer • Embedded Help

88

Copyright © 2008, Intel Corporation

44 Intel® Trace Analyzer

89

Copyright © 2008, Intel Corporation

Technical Information about Parallelism

• Watch Intel Software Development • Go-parallel.com Products innovative online series produced by CNET Networks – Resource for technical information about: • Parallelism International at: • Optimization • Threading • ZDNet Asia: • Hpc http://www.zdnetasia.com/whiteboard/0,3906906 3,62032144,00.htm • ZDNet Australia: http://www.builderau.com.au/whiteboards/intel/vi deo1.htm • Multithreading Tools and Techniques for Software Developers: A Technical • ZDNet China: Coming soon Webinar Series • ZDNet France: – Sign up today! http://www.zdnet.fr/special/whiteboardintel/ • ZDNet Germany: http://www.zdnet.de/specials/whiteboardseries/ • ZDNet Japan: Coming soon • ZDNet UK: http://resources.zdnet.co.uk/articles/video/0,1000 002009,39289127,00.htm

90 Intel Confidential, For Internal Use O nly Copyright © 2008, Intel Corporation

45 Conclusion intermédiaire

le Petaflopssoutenu semble difficile à atteindre …

• Opposition « loi d’Amdhal » versus nombre de coeurs croissants • + le cout, la consommation, l’espace

91

Copyright © 2008, Intel Corporation

Conclusion intermédiaire

le Petaflopsest possible si …

• les plateformes deviennent : – beaucoup plus “parallèles” ou plus spécialisées – beaucoup plus puissante – beaucoup plus fiables ?

• les applications deviennent : – moins sensibles aux problèmes matériels – 100x plus parallèles – après un profond remaniement …

92

Copyright © 2008, Intel Corporation

46 Conclusion intermédiaire : Problèmes à résoudre

• Hardware (électronique) – Taille , consommation, type de « transistors » – Communications : latence et BW

• CPU  memoires  i/o

• Sockets  Sockets E / S • Cores  cores CPU Memoire

L 1 L 2 L 3

CPU Memoire

L 1 L 2 L 3

Interconnect.

93

Copyright © 2008, Intel Corporation

Agenda: from the micromicro--archarch to the end users

94

Copyright © 2008, Intel Corporation

47 Petaflops : toomuch cores to handle ??

1.E+15 Peta --11 11.E+.E+1414

11.E+.E+1313 Flops . s . Flops 11.E+.E+1212 Tera

11.E+.E+1111

11.E+.E+1010 Intel® Core™ uArch Pentium® III 1.E+09 Giga Pentium® II Pentium® 4

11.E+.E+0808 Pentium® 11.E+.E+0707 486 386 11.E+.E+0606 1985 1990 1995 2000 2005 2010

How to get the Petaflops with less powerfull cores ??

95

Copyright © 2008, Intel Corporation

Accelerators ––ObservationsObservations

– Accelerators may work well at an algorithm level but typically do not meet speedup goals from an application standpoint. – Multicore performance has not been fully exploited • Projected accelerator speedup’s are typically biased to comparing against single CPU or single core instead of complete systems (2S, 4S, Clusters) – Software agility is lacking • Development, debug, optimization • Support, maintainability across generations of devices – IT deployment criteria is difficult to meet

96

Copyright © 2008, Intel Corporation

48 Throughput Computing

Intel Many Core Breakthroughs

• Array Of Fully Programmable IA Cores VECTOR VECTOR … VECTOR VECTOR IA CORE IA CORE IA CORE IA CORE • Innovative Hardware Caching Architecture

INTERPROCESSOR NETWORK • Scales To Tera FLOPS … COHERENT COHERENT COHERENT COHERENT CACHE CACHE CACHE CACHE • Single Development Environment

COHERENT COHERENT … COHERENT COHERENT CACHE CACHE CACHE CACHE • Single Software Executable

INTERPROCESSOR NETWORK FIXEDFUNCTION LOGIC MEMORYand I/O INTERFACES

VECTOR VECTOR … VECTOR VECTOR IA CORE IA CORE IA CORE IA CORE Larrabee is Intel’s first many core processor under development

Each core is a complete Intel processor

Efficient interinter--blockblock communication

97

Copyright © 2008, Intel Corporation 97

Larrabee Block Diagram

Multi-ThreadedMulti Multi-ThreadedMulti ThreadedWide SIMD . . . ThreadedWide SIMD Wide SIMD Wide SIMD I$ D$ I$ D$ I$I$ D$D$ Fixed Function Fixed L2 Cache Interface Display

Multi Multi-ThreadedMulti Controller Memory

Memory Controller Memory Multi-Threaded ThreadedWide SIMD ThreadedWide SIMD Wide SIMD . . . Wide SIMD Controller Memory I$ D$ I$ D$

I$I$ D$D$ I$I$ D$D$ Interface System Texture Logic Texture

• Cores communicate on a wide ring – Fast access to memory and fixed function blocks – Fast access for cache coherency • L2 cache is partitioned among the cores – Provides high aggregate bandwidth – Allows data replication & sharing

98

Copyright © 2008, Intel Corporation

49 Braided Parallelism Support

Multi-Core CPUs Larrabee

• Intel manycore combines best of GPU and CPU – IA many core can submit work to itself – Create “braided parallelism” that intermixes data- and task-parallelism with sequential code • SW compatibility , Same data formats • High performance Math libraries

99

Copyright © 2008, Intel Corporation 99

Then what about Exaflops ??

1 1 EFlop ~ 2019

100

Copyright © 2008, Intel Corporation

50 ExaFlops Peak with today CPUs ?

peak perf..perf

48 GF **2222 => 1 TF **10001000 = > 1 PF * 1000 => 1 Exa Flops

power (board)

225 W =>=> 5000 W =>=>5 MW =>=> 5 GW

xx11..55 for cooling

nbnb.. cores 4 cores =>=>88 cores =>=>88 K cores =>=> 88 M cores

101

Copyright © 2008, Intel Corporation

Low power for the ExaFlops ??

peak perf.

1 TF *1000 = > 1 PF * 1000 => 1 Exa Flops

power .

100 W => 0.1 MW => 100 MW (including cooling) (including cooling) (including cooling)

nb. cores => 1 “CPU” => 1 K “CPU” 1M “CPU”

102

Copyright © 2008, Intel Corporation

51 Teraflops Research Chip

100 Million Transistors ● 80 cores ● 275275mmmmmmmm22 ● 62 W ● 1 TFlops

Frequency Voltage Power Bisection Performance Bandwidth

3.16 GHz 0.95 V 62W 1.62 1.01 Terabits/s Teraflops

5.1 GHz 1.2 V 175W 2.61 1.63 Terabits/s Teraflops

5.7 GHz 1.35 V 265W 2.92 1.81 Terabits/s Teraflops

103

Copyright © 2008, Intel Corporation

Teraflops Research Chip

104

Copyright © 2008, Intel Corporation

52 Core size (at constant 100--200100 200——300300 Watts/die)Watts/die ) ??

• Many (~1000--2000) small cores:

– Think of using low power (say mobility) Processors and putting

many of them on a die.

– Inorder, SSEn with 2, 4 or perhaps 8 ops per clock 12 GHz clock

– Many threads per core

• A bunch of really big cores (512):

– very wide (32 ops) Vector units Less large cores ??

– 4 / 8 threads per core

– 2 / 4 GHz Many small cores ??

105 Small and large cores ??

N o t e : theabo vep ict uresd on’t rep resent any curr ent orf u tureIn tep l rod ucts Copyright © 2008, Intel Corporation

Memory solutions

• We cannot match memory bandwidth to offpackage memories with increase in processing speed. • We are hitting a wall in terms of number of pins, signaling area, and signaling power. • Memory speeds are advancing more slowly than CPU speeds When we have many tera-ops processors We’ll need many TB/s Memory systems

Actual values “Tera value” Gain

Performance 20 Gflops / p 1 Tflops / p ~50 x

BW 12 GB/s ~1.2 TB/s ~100 x

Latency 400 cycles 20 cycles ~20 x

106

Copyright © 2008, Intel Corporation

53 Memory Bandwidth futures: 33DD Die Stacking

HeatHeat--sinksink

••PowerPower and IO signals go through DRAM to CPU

••ThinThin DRAM die CPU DRAM ••ThroughThrough DRAM vias Package

DRAM, Voltage Regulators, and High Voltage I/O All on the 3D integrated die

107

Copyright © 2008, Intel Corporation

Memory Bandwidth options: DRAM on Pkg

CPU DRAM

Package

DRAM, CPU integrated on die

108

Copyright © 2008, Intel Corporation

54 Memory Bandwidth options

CPU MC + buffers

Package Package

Replace onon----pkgpkg MC with very fast flex links to an ononon-on ---boardboard MC

109

Copyright © 2008, Intel Corporation

Possible Technology insertion timeline

• 2011: terascale processors

• 2015: terascale Siphotonics

• 2015: Aggressive memory stacking

– fault tolerant networks

• 2015: 210 Teraops processors

– optical memory interconnects

• 2018: 10’s of teraops processors

– “All” memory is on package

• 2018: 10’s of terabits/sec/fiber

110

Copyright © 2008, Intel Corporation

55 Summary

Copyright © 2008, Intel Corporation

Questions ?

112

Copyright © 2008, Intel Corporation

56