LOI DE MOORE, LOI D’AMDHAL, QUEL FUTUR? LA VISION D’INTEL
Philippe Thierry SrSrStaff Engineer Intel Corp.
Marc Dollfus Responsable HPC secteur public et recherche Intel France
Idris 88 janvier2009
1
Copyright © 2008, Intel Corporation
Legal Disclaimers
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1 800 628 8686 or 1 916 356 3104.
Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance*
All dates and products specified are for planning purposes only and are subject to change without notice
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
SPEC, SPECint2000, SPECfp2000, SPECint2006, SPECfp2006, SPECjbb, are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice
Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2008 Intel Corporation.
2
Copyright © 2008, Intel Corporation
1 Intel in High Performance Computing : the big picture
Leading Process Manufacturing Leading performance, performance/watt
Advanced HPC R&D
Dedicated, Broad SW renowned tools expertise portfolio
Large scale clusters for test & optimization
Defined HPC Platform application building platform blocks
A long term commitment to HPC 3
Copyright © 2008, Intel Corporation
Agenda: from the micromicro--archarch to the end users
4
Copyright © 2008, Intel Corporation
2 Agenda: from the micromicro--archarch to the end users
5
Copyright © 2008, Intel Corporation
Agenda: from the micromicro--archarch to the end users
6
Copyright © 2008, Intel Corporation
3 What G. Moore predicted ??
• 1971 Intel® 4004 : 4/8 bits, 108 kHz, 2300 transistors
• 1972 Intel® 8008 : 8 bits, 200 kHz, 3500 transistors
• 1974 Intel® 8080 : 8 bits, 2 Mhz, 4500 transistors
• 1978 Intel® 8086-8088 : 16 bits, 5 MHz,
• 1982 Intel® 80286 : 16/32 bits, plus de 100.000 transistors
• 1985 Intel® 80386 : 32 bits, 275.000 transistors
• 1989 Intel® 80486 : 32 bits, 25 à 66 MHz, 1,6 millions de transistors
• 1993 Intel® Intel® Pentium® : 32 bits, 75 à 133 MHz, 3,1 millions de transistors
• 1995 Intel® Pentium® Pro : 32 bits, 150 à 200 MHz, 5,5 millions de transistors
• 1997 Intel® Intel® Pentium® II : 32 bits, 7,5 millions de transistors
• 1999 Intel® Pentium® III : 32 bits, 450 à 600 MHz, 9,5 millions de transistors
• 1999 Intel® Celeron® : 32 bits, 9,5 millions de transistors
• 2000 Intel® Pentium® 4 : 32 bits, 1,5 GHz, 42 millions de transistors
7
Copyright © 2008, Intel Corporation
Not a simple linear function !!
8
Copyright © 2008, Intel Corporation
4 FOR simple circuits, the cost per component is nearly inversely proportional to the number of components, the result of the equivalent piece of semiconductor in the equivalent package containing more components. But as components are added , decreased yields more than compensate for the increased complexity, tending to raise the cost per component.
Thus, there is a minimum cost at any given time in the evolution of the technology. … If we look ahead five years, a plot of costs suggests that the minimum cost per component might be expected in circuits with about 1,000 components per circuit
In 1970, the manufacturing cost per component can be expected to be onlya tenth of the present cost.
The complexityfor minimum component costs has increased at a rate of roughly a factor of two per year . Certainlyover the short termthis rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.
I believe that such a large circuit can be built on a single wafer.
Electronics, Volume 3838,, Number 88,, April 1919,, 1965 9
Copyright © 2008, Intel Corporation
Conclusion intermédiaire 11
• La loi de Moore n’a jamais signifiée – plus de performance – plus de hpc, plus de //ism, plus de scalabilité
Juste « il existe un tjs un bon rapport prix / nb de transitor sur un circuit !! »
Implicitement , le nb de transistor augmente et les capacités du dit circuit aussi ; )
Mais la seule (?) loi qui domine en HPC reste la loi d’Amdhal et assimilée
10
Copyright © 2008, Intel Corporation
5 However …
11
Copyright © 2008, Intel Corporation
Latest steps
nb transitor freq (khz) 1000 3,50E+03
100 3,00E+03 10 2,50E+03 1 0,1 2,00E+03 0,01 1,50E+03 0,001 1,00E+03 0,0001 0,00001 5,00E+02 0,000001 0,00E+00 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
12
Copyright © 2008, Intel Corporation
6 2007 Latest steps 2005 2003 2000 45nm 300mm 65nm 300mm 9090nmnm Dual Core 1000 300300mmmm 3,50E+03 100 130130nmnm 200200mmmm 3,00E+03 10 2,50E+03 1 0,1 2,00E+03 0,01 1,50E+03 0,001 1,00E+03 0,0001 0,00001 5,00E+02 0,000001 0,00E+00 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
13
Copyright © 2008, Intel Corporation
45 nm Hi--kHi k Intel Processor
Quad core Intel ® Xeon ® 53xx (Clovertown) Quad core Intel ® Xeon ® 54xx (Harpertown) 65 nm 45 nm Hi-k
22** 22** 22** 22** 143 mm 143 mm 107 mmmm 107 mmmm
582582mm Transistors 820820mm Transistors 8 MB Cache 12 MB Cache
14 *Source: Intel Note: die picture sizes are approximate Copyright © 2008, Intel Corporation
7 Industry’s First 45 nm HighHigh--KK + Metal Gate TransistTransistoror
Improved Transistor Density ~~22xx Improved Transistor Switching Speed >>2020%% Reduced Transistor Switching Power ~~3030%% Reduction in gate oxide leakage power >>1010xx
65 nmnm 45 nm
15
Copyright © 2008, Intel Corporation
HighHigh--kk + metalgate transistors
16
Copyright © 2008, Intel Corporation
8 High-k + metal gate transistors
« The implementation of high-k and metal gate materials marks the biggest change in transistor technology since the introduction of polysilicon gate MOS transistors in the late 1960s », G. Moore, 2007
17
Copyright © 2008, Intel Corporation
Intel Driving Moore’s Law into the Future
90nm Node 2003 6565nmnm Node 2005 4545nmnm Node 2007 3232nmnm Node 2009 2222nmnm Node 2011 1616nmnm Node 2013 25 nmnm 1111nmnm Node 2015 88nmnm Node 1515nmnm 2017
77nmnm 55nmnm 3nm
18
Copyright © 2008, Intel Corporation
9 Tick --TockTock
Intel ®® Sandy Core™ Penryn Nehalem Westmere Bridge
NEW Compaction/ NEW Compaction/ NEW Microarchitecture Derivative Microarchitecture Derivative Microarchitecture 65nm 45nm 4545nmnm 3232nmnm 3232nmnm
Tock Tick Tock Tick Tock
Montecito/ Montvale Tukwila Poulson
90 nm NEW NEW Microarchitecture Microarchitecture
Forecast
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United StateStates andas nd other countries. All 19 products, dates, and figures are preliminary and are subject to change without notice.
Copyright © 2008, Intel Corporation
Intel® HPC Roadmap
2008 2009 2010 2011 1H 2H 1H 2H 1H 2H 1H 2H
Montvale (2C/4T) Tukwila New u-arch Critical Mission
Tigerton (4C) Dunnington (6C) Nehalem-EX
Caneland w/ Clarksboro chipset EXpandable
Harpertown (4C) Derivative New u-arch Wolfdale-DP (2C) Nehalem-EP
Efficient Bensley w/Blackford Tylersburg-EP Performance Stoakley w/Seaburg
Nehalem-EP Derivative Harpertown (4C) Wolfdale-DP (2C) Tylersburg-EN New u-arch ENtry
20
Copyright © 2008, Intel Corporation
10 Nehalem--EPEP
••44 cores ••8M8M onon chipchip Shared Cache • 3-level cache hierarchy • 32k I-Cache + 32k D-cache • New 256k L2 cache per core Nehalem-EP • New shared last level cache • Inclusive Cache Policy Core Core Core Core ••SimultaneousSimultaneous MultiMulti ThreadingThreading capability (SMT) ••QuickPathQuickPath interconnect 8M Shared Cache
• Point-to-Point Memory Link • 2 links per CPU socket Controller Controller • 1 for connection to other socket • 1 for connection to chipset • Up to 6.4 GT/sec (12.8 GB/sec) in 3 DDR3 Two each direction per link ( Fully duplex) channels QuickPath ••IntegratedIntegrated QuickPath Memory Controller (DDR3) Interconnects • Up to 18 DIMMs • 800/1066/1333MHz DDR3 ••NewNew instructions Socket: ••Power:Power: 130W, 95W, 80W, 60W ••NewNew LGA 1366 Socket
21
Copyright © 2008, Intel Corporation
Nehalem Processor Microarchitecture
• Derived from Core Microarchitecture • Similar in pipeline length • Processor Performance features
– Greater parallelism
» Example: 33% more instructions in flight
– More efficient algorithms
» Example: Faster handling of unaligned cache accesses
– Better branch prediction • New 3 level cache hierarchy • New 2 level TLB hierarchy
22
Copyright © 2008, Intel Corporation
11 Intel Smart Cache ––CoreCore Caches
• 1st level caches – 32kB Instruction cache – 32kB Data Cache Core
• Support more L1 misses in parallel than Core 2 32kB L1 32kB L1 Data Cache Inst. Cache • 2nd level Cache – New cache introduced in Nehalem – Unified (holds code and data) – 256 kB per core
– Performance: Very low latency 256kB L2 Cache – Scalability: As core count increases, reduce pressure on shared cache
23
Copyright © 2008, Intel Corporation
Intel Smart Cache ----33rdrd Level Cache
• New 3 rd level cache Core Core Core • Shared across all cores L1 Caches L1 Caches L1 Caches • Size depends on # of cores
– Quad core: Up to 8MB …
– Scalability: L2 Cache L2 Cache L2 Cache
• Built to vary size with varied core counts L3 Cache
• Built to easily increase L3 size in future parts • Inclusive cache policy for best performance
– Address residing in L1/L2 must be present in 3 rd level cache
24
Copyright © 2008, Intel Corporation
12 Designed For Modularity 2008 – 2009 Servers & Desktops
CCC CCC CCC OOO OOO OOO Core RRR RRR RRR E EEE EE … EEE
L3 Cache DRAM Uncore Power IMC QPI … QPI && … Clock
QPI …
Differentiation in the “Uncore”:
Power # mem #QPI Size of Type of # cores Memory ManageManage-- channels Links cache ment
25
Copyright © 2008, Intel Corporation
Expandable (EX) Platform Roadmap
2008 2009 2010 2011 1H 2H 1H 2H 1H 2H 1H 2H
Tigerton (4C) Dunnington (6C)
Caneland w/ Clarksboro chipset EXpandable
cores cores ••4545nmnm high khigh k technology ••11..99BB transistors cache system interface ••1616 MB L3L 3 cache ••CanelandCanelandsocket compatible cores cache ••LatestLatest Intel virtualization technologies ••LaunchedLaunched
26
Copyright © 2008, Intel Corporation
13 Intel innovations
Shrink/Derivative Evolution of Instruction Set Architecture (ISA) : Intel® Xeon®, Pentium D, Intel® Core™ Duo • 1979 x87 Numeric data extensions 65nm New Microarchitecture YEARS YEARS YEARS YEARS • 1997 MMX™ 64b SIMD ISA extension YEARS YEARS YEARS YEARS Intel ®® CoreCore™™ 2222 2222 • 1999 2007 – SSEx 128b SIMD ISA extensions
• Starting 2010 – Intel® AVX – 256b SIMD ISA extensions Shrink/Derivative Penryn Family 45nm YEARS YEARS YEARS YEARS Benefits of Intel’s ISA YEARS YEARS YEARS YEARS New Microarchitecture 2222 2222 Nehalem • Available across a wide range of platforms – No additional hw required Shrink/Derivative – Investment protection for your development costs and effort Westmere • Backward and forward compatible to future ISAs 323232nm32 nmnmnm YEARS YEARS YEARS YEARS YEARS YEARS YEARS YEARS New Microarchitecture
• A suite of tools and compilers to make development easy 2222 2222 Sandy Bridge
27
Copyright © 2008, Intel Corporation
Key Intel® Advanced Vector Extensions : Intel®Intel ® AVX
KEY FEATURES BENEFITS
Wider Vectors Up to 2x peak Flops/s Increased from 128 bits to 256 bits with good power efficiency
Enhanced Data Rearrangement Organize, access and pull only Use new 256 bit primitives to broadcast, mask loads and permute necessary data more quickly and efficiently
Three and four Operands, Non Destructive Syntax Fewer register copies, better register use Designed for efficiency and extensibility
More opportunities to combine load Flexible unaligned memory access support and compute operations
Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today
28
Copyright © 2008, Intel Corporation
14 Intel Instruction Set Innovation Continues
A new trajectory Vector FP offers scalable FLOPS and impressive Flops/W
The old trend Intel® AVX* 15% gain/year (due to • 2X throughput vector FP frequency and uArch) • 2X load throughput • 3 operand instructions
Intel® microarchitecure (Westmere) Next generation Intel® microarchitecture Cryptography acceleration Performance / core / Performance (Nehalem) instructions • Superior Memory latency+BW • Fast Unaligned Support
20082009 2010 > 2010
All timeframes, dates and products are subject to change without further notification
* Intel® AVX = Intel® Advanced Vector Extensions (Intel® AVX) 29
Copyright © 2008, Intel Corporation
Agenda: from the micro-arch to the end users
30
Copyright © 2008, Intel Corporation
15 QuickPath Interconnect
• Nehalem introduces new QuickPath Nehalem Nehalem Interconnect (QPI) EPEP EPEP
• High bandwidth , low latency point to point interconnect
• Up to 6.4 GT/sec initially – 6.4 GT/sec > 12.8 GB/sec IOH CPU CPU memory memory – Bi directional link > 25.6 GB/sec per link
CPU CPU memory memory • Highly scalable for systems with IOH varying # of sockets
31
Copyright © 2008, Intel Corporation
Memory Latency Comparison
• Low memory latency critical to high performance • Design integrated memory controller for low latency • Need to optimize both local and remote memory latency • Effective memory latency depends per application/OS – Percentage of local vs. remote accesses – NHM has lower latency regardless of mix
Relative Memory Latency Comparison
1.00
0.80
0.60
0.40
0.20
Relative Memory Latency Memory Relative 0.00 Harpertown (FSB 1600) Nehalem (DDR3-1333) Local Nehalem (DDR3-1333) Remote
32
Copyright © 2008, Intel Corporation
16 Intel QuickPath Architecture
Intel® QuickPath interconnect plus integrated Intel® QuickPath memory controllers
Stream Bandwidth ––Mbytes/SecMbytes/Sec (Triad)
Higher is better 34904
9776 6102
HTN 3.16/ BF1333/ 667 HTN 3.00/ SB1600/ 800 NHM 2.8/ 6.4 QPI/ 1066 MHz mem MHz mem MHz mem
Source: Intel. All future products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Performance33 tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the Copyright © 2008, Intel Corporation performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2008, Intel Corporation. * Other names and brands may be claimed as the property of others.
Intel QuickPath Architecture
Intel® QuickPath interconnect plus integrated Intel® QuickPath memory controllers
Stream Bandwidth ––Mbytes/SecMbytes/Sec (Triad)
38486 34612
1066 MHz /6 Dimms 1333 MHz / 6 Dimms
All results using NHM 22..88GHz/GHz/66..44QPIQPI ––MayMay 2008 ––BandwidthBandwidth in Mega Bytes/Sec Source: Intel. All future products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Performance34 tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the Copyright © 2008, Intel Corporation performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm Copyright © 2008 , Intel Corporation. * Other names and brands may be claimed as the property of others.
17 Agenda: from the micro-arch to the end users
35
Copyright © 2008, Intel Corporation
Where are the limitations ?
I / O
CPU Memoire
Intra and Inter L 1 L 2 L 3 socket communications
CPU Memoire
L 1 L 2 L 3
Interconnect. NumberNumber,, size and Size, BW of Inter nodes performance Cache Communications ? of the cores ?? memories ?? Memory BW ?
36
Copyright © 2008, Intel Corporation
18 I/O Trends
Resident in Memory TeraScale Memory Research
Local Disk
SSD Parallel File Systems
• Disk Replacement • Disk Cache • Memory Displacement Tape
37
Copyright © 2008, Intel Corporation
Intel® High Performance: SATA Solid State Drives --TheThe new trend
SATA 3.0 Gb/s Interface
1.8” & 2.5” Form Factors
32GB to 80GB+ Capacities
Power
Active (avg) 2.0W Standby 0.1W
Weight: 2.5” SSD 87g (+/- 2g)
1.8” SSD 40g (+/- 2g)
1IOmeter results vs Seagate Momentus® 7200.2 SATA 3Gb/s Hard Drive Tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 38
Copyright © 2008, Intel Corporation
19 Intel ® High Performance Solid-State Drives
High performance SATA Drive
• 2.5” FF “X“X“X25“X 252525----E”E”E”E” • 32GB / 64GB SLC Enterprise Class • 240 / 170 R/W (MB/s) • Architected for write intensive highly random workloads
SLC – Enterprise Class SSD
Performance SFF Mobile SATA Drive
“X“X“X18“X 181818----M”M”M”M” • 1.8” FF SFF Mobile Client • 80GB / 160GB MLC • 240 / 70 R/W (MB/s)
Performance Mobile SATA Drive
• 2.5” FF “X“X“X25“X 252525----M”M”M”M” • 80GB / 160GB MLC • 240 / 70 R/W (MB/s) Mobile Client
MLC – Client/Consumer Class SSD
39 SLC & MLC (Multi(Multi--LevelLevel Cell) NAND Copyright © 2008, Intel Corporation
Experiment - Finite Difference Code
(Simulate RTM Snapshots)
• 2D elastic wave propagation 3,5 • Small 2 homogeneous layered model • Disk Replacement Test 3,0
– HDD – 4x150GB SATA RAID 0 2,5 – SSD – 4x80GB MLC Samples RAID 0 2,0
1,5
1,0 (Higher is better) is (Higher
Speedup Relative to HDD HDD to Relative Speedup 0,5
0,0 48 GB 144 GB 240 GB
Potential: 2-3x Gain with this Example
40
Copyright © 2008, Intel Corporation
20 Experiment - Finite Difference Code
• 2D elastic wave propagation 300,0 11..66xx • Small 2 homogeneous layered model • Disk Replacement Test 250,0 – HDD – 2x250GB SATA RAID 0 – SSD – 2x32GB SLC Samples RAID 0 200,0 150,0 100,0 50,0 Elapsed Time (sec) Time Elapsed 0,0 HDD SSD
1 shot gather : 12GB ouput
41
Copyright © 2008, Intel Corporation
Where are the limitations ?
E / S
CPU Memoire
Intra and Inter L 1 L 2 L 3 socket communications
CPU Memoire
L 1 L 2 L 3
Interconnect. NumberNumber,, size and Size, BW of Inter nodes performance Cache Communications ? of the cores ?? memories ?? Memory BW ?
42
Copyright © 2008, Intel Corporation
21 Interconnect ??
43
Copyright © 2008, Intel Corporation
1111//20082008 Top500Top 500 statistics
Interconnect famillyfamilly// Interconnect famillyfamilly// systems performance
44
Copyright © 2008, Intel Corporation
22 Agenda: from the micromicro--archarch to the end users
45
Copyright © 2008, Intel Corporation
MultiMulti--corecore motivation
Dual-Core 1.73x Performance 1.73x Power
1.13x 1.00x 1.02x 0.87x
0.51x
Over-clocked Design Frequency Under-clocked (-20%) (+20%)
Relative singlesingle--corecore frequency and Vcc
46
Copyright © 2008, Intel Corporation
23 MultiMulti--corecore advantage
Higher is better Integer Performance with SPECint_rate_base2000* (2001 to 2007) 250 242,0 214,0 19.8x
200 Single Core Dual Core Quad Core
150 123,0
100
50 40,5 26,0 30,7 32,1 12,2 15,7
0 Xeon 1.26 Xeon 2.20 Xeon 3.06 Xeon 3.20 Xeon 3.60 Xeon 3.60 Xeon 3.00 Xeon 3.00 Xeon 3.16 512KB L2 512KB L2 1M L3 (2003)2M L3 (2004) 1M L2 (2004)2M L2 (2005) 4M L2 DC 8M L2 QC 12M L2 QC (2001) (2002) (2006) (2007) (2007)
47
Copyright © 2008, Intel Corporation
PetaflopsPetaflops/s/s with today CPU ?
11.E+.E+1515 Peta --11 1.E+14
11.E+.E+1313 Flops . . s Flops 11.E+.E+1212 Tera
11.E+.E+1111 Intel® Core™ uArch 11.E+.E+1010 Pentium® III 11.E+.E+0909 Giga Pentium® II Pentium® 44
11.E+.E+0808 Pentium® 486 11.E+.E+0707 386 11.E+.E+0606 1985 1990 1995 2000 2005 2010
48
Copyright © 2008, Intel Corporation
24 Global warming of the datacenter tk da nd blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual blade) nodeAtoka(dual tk wth BladeAtokaSwitch BladeAtokaSwitch
11U:U: 170 GFGF
22xx 450 WW N+N+11 Power Supply 1 1 module : 11..3636 TFTF ~~55..66 kWkW
288 racks: 1 rack: 11..66 PFPF 55..4444 TFTF 147 k cores ~~2222..44 kWkW 66..4545 MWMW
49
Copyright © 2008, Intel Corporation
Petaflops : decrease of the power consumption ??
1010 MW à 00..0505 €€// kWh > > 5 MM€€// an ((500500 €€// hour )
2007 2008 2009
Core Penryn New µArch 65 nm 45 nmnm 45 nmnm
xx11..55 refroidissement ~ 66..4545 MW < 55MWMW < 44MWMW
50
Copyright © 2008, Intel Corporation
25 French power consumption
http://www.rtehttp://www.rte--france.com/france.com/ 51
Copyright © 2008, Intel Corporation
Agenda: from the micro-arch to the end users
52
Copyright © 2008, Intel Corporation
26 HPC enabling@enabling @ Softwares & Services Group (SSG)
- tuning / optimization - training - benchmarking - architecture (roadmap) - topology (interconnect, i/o) HPC end user: - softwares develop their own codes - early cpu testing uses ISV codes - collaboration benchmarking ISV - co-design Universities / Research Institutes OEMs
53
Copyright © 2008, Intel Corporation
Intel point of view: Emerging application research
54
Copyright © 2008, Intel Corporation
27 End users point of view: Physics and computing needs increase
Increase problem size (at constant time and physics)
Decrease elapsed time. (at constant size and physics)
Increase physics complexity (at constant elapsed time and size )
55
Copyright © 2008, Intel Corporation
HPC or Scientific Computing landscape
Universities / Research Institutes Wks Labs Meso center ISVISV National center < < 500 Gflops R & D in the Industry < 10 Tflops
Production in the Industry 20 TfTf to < 1 Pflops
56
Copyright © 2008, Intel Corporation
28 Availaible Computing power … in 2008
Peta 11.E+.E+1515 11..11 P. : N°°1N: 1 TopTop500500 < < 1 P. : OilOil contractors
--11 < 500 T. : national CC 11.E+.E+1414 < < 200 Tera : Industries, OilOil Companies
< < 50 TT.. : regional CC 11.E+.E+1313 12 T. : NN°°500500 TopTop500500 Flops . . s Flops Tera 11.E+.E+1212 < < 1 T. : Laboratories
11.E+.E+1111 < < 50 G. : Wks. > 20 G. : Core 2 Duo 11.E+.E+1010
11.E+.E+0909 Giga
57
Copyright © 2008, Intel Corporation
The Picket Fence (Amdahl revisited) # People 1 10 100
Prepare 1 1.1 1.2
Paint 10 1 .1
Clean-up 1 1.1 1.2
Total 12 hours 3.2 hours 2.5 hours
Speed-up 1 3.75 4.8
Efficiency 100% 37.5% 4.8%
Brush 1 € 10 € 100 €
Beer 0.12 € 0.32 € 2.5 € cooling
58
Copyright © 2008, Intel Corporation
29 Parallel computing
time Super linear Linear ((theorytheory)) Speedup
Elapsed time Ideal
Saturated
««trytry again »»
Nb cores 1 cc 4 cc
59
Copyright © 2008, Intel Corporation
First step : do ititserial time time Elapsed time Elapsed time Profiling + HDW counters Compiler option Help the compiler SIMD ((ssesse)) Check memory management Are all the Flops useful
1 cc 1 cc
60
Copyright © 2008, Intel Corporation
30 Apps classification and performance analysis
Memory bound. CPU bound. CPU bound. “Stream” “HPL” Real world applications
Part of the code Part of the code doing ops doing mem access
CPU.ops% * (cycle / trans) ++ Mem.ops% * [ ( Lxx_success%Lxx_success % * cycle / trans) + (Lxx_misses( Lxx_misses%% * cycle/trans) ]
1 / IPC_max nb cycle to access memory
Cache Size Mem BW
Higher frequency New SSESSE44
• 1 cpu • no i/o • No comms 61
Copyright © 2008, Intel Corporation
Intel® Software Development Products
• Intel® Compilers – The best way to get application performance on Intel® multi core processors – both for C/C++ and Fortran • Intel® VTune™ Performance Analyzers – Identify bottlenecks i and optimize multi core performance • Intel® Performance Libraries – Highly optimized, thread safe, multimedia and HPC math functions
62
Copyright © 2008, Intel Corporation
31 C++ and Fortran Compilers 1111..00 OS IA 32 IA 64 Intel® 64 Windows* √ √ √ Linux* √ √ √ • OpenMP 3.0 support Mac OS* X √ √ • C++ lambda functions • Windows* products: Microsoft Visual Studio* 2005 and 2008 support • Linux* products: Eclipse* CDT 4.0 support • Windows Server* 2008 Support • Parallel debugger for IA 32/Intel®64 Linux • Option to emit FE diagnostics related to threading • /MP option for generating compilation sessions in parallel • Implementation of valarray using IPP for array notation • Improved install framework based on Intel® PSET 2.0 • Default to be Intel® Pentium® 4 (QxW) compatible • Increased F2003 feature support beyond 10.1: – ENUMERATOR, MIN/MAX and friends, IEEE FP, ASSOCIATE, PROCEDURE POINTER, ABSTRACT INTERFACE, TYPE EXTENDS, struct ctor change, array ctor changes, ALLOCATE scalar • Native Intel®64 Compiler (Win/Linux) • Compilers available only in “Pro” packaging (no more “Standard”)
63 Intel Confidential NDA Required Copyright © 2008, Intel Corporation
Intel® Math Kernel Library 10.0
The flagship math processing library – Simplify multi threaded application development • Extensively threaded math functions with excellent scaling • New Threading in Vector Math Functions • New OpenMP compatibility library supports Microsoft and GNU OpenMP implementations – Maximize application performance • Automatic runtime processor detection ensures great performance on whatever processor your application is running on. • New optimizations for latest Intel processors • New cluster functionality is now standard
Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √ √
64
Copyright © 2008, Intel Corporation
32 MKL 1010..00
65
Copyright © 2008, Intel Corporation
Math Functions Using SVML Library
Short vector math library (SVML) provides efficient software implementations: – sin/cos/tan – asin/acos/atan – sinh/cosh/tanh – asinh/acosh/atanh – log10/ln – exp/pow
– and many more
66
Copyright © 2008, Intel Corporation
33 Intel® VTune™VTune ™ Analyzer 99..00
Quickly find performance bottlenecks – Simplify multi threaded application development • Tune process or thread parallel code • Works on standard debug builds without recompiling – Maximize application performance • Stall cycle accounting for Core™2 Duo and Core™2 Quad processors • Low overhead sampling • Graphical call graph • View results on source or assembly – Includes Intel® Thread Profiler
Windows* Linux* Mac* IA32 Intel64 IA64 Multicore √ √ √ √ √ √
67
Copyright © 2008, Intel Corporation
Whatif.intel.com
Opportunity to kick the tires, give feedback, experiment – put you “in the loop early” – before we fold into final products.
Also - have direct connection with the developers and researchers in Intel.
Intel® Adaptive Spike-Based Solver Cluster OpenMP* for Intel® Compilers Intel® Platform Modeling with Machine Learning Intel® Decimal Floating Point Math Library Intel® Location Technologies Software Development Kit 1.0 (LTSDK) Intel® C++ Parallelism Exploration Compiler, Prototype Edition Integrated Debugger for Java*/JNI Environments Intel® Performance Tuning Utility 3.0
68
Copyright © 2008, Intel Corporation
34 Intel® Performance Tuning Utility 33..00
•• NEW in vv33..00 –– Result difference Extension •• SemiSemi automatedautomated compiler comparison –– Events per CPU –– Data Access Profiling updates –– Module Aliases –– CPU Frequency Changing –– Heap Profiling GUI –– InstrumentationInstrumentation--basedbased Call Graph and Call Count •• EXISTING from previous version –– Statistical Call Graph –– Basic Block Analysis –– Events over IP graph –– Loop Analysis Learn More at: http://whatif.intel.com
69
Copyright © 2008, Intel Corporation
Intel® Performance Tuning Utility 33..00
70
Copyright © 2008, Intel Corporation
35 Agenda (for next time)
•• Compiler •• Reports •• Vectorization •• Inlining •• PGO •• Blocking •• Unrolling •• Floating points
71
Copyright © 2008, Intel Corporation
References
• “The Software Vectorization Handbook” by Aart Bik • “The Software Optimization Cookbook” by Richard Gerber • Intel ® C++ Compiler User’s Guide • IA32 Optimization Reference Manual • High Performance Computing, 2nd Edition by Kevin Dowd and Charles Severance • “Floating Point Calculations and the ANSI C, C++ and Fortran Standard” • http://cache www.intel.com/cd/00/00/33/01/330130_330130.pdf • Details on the SSE 4 instruction set • http://cache www.intel.com/cd/00/00/32/26/322663_322663.pdf • Web based and classroom training – www.intel.com/software/college • White papers and technical notes – www.intel.com/ids – www.intel.com/software/products • And have a look at • http://whatif.intel.com • Product support resources – www.intel.com/software/products/support
72 or other countries. Copyright © 2008, Intel Corporation
36 Parallel computing: computing : Amdhal law time time
Overhead Elapsed time Elapsed time
Serial Serial
Parallel Parallel
1 cc 2 cc 88cc 1 cc2 cc 88cc
In In theory In practice
73
Copyright © 2008, Intel Corporation
Speedup using Amdhal Using infinite nb of processors lim (Speedup) = 1 / serial part Speedup
serial % 5 % 10 % 15 % 20 % 25 %
parallel % 95 % 90 % 85 % 80 % 75 %
74
Copyright © 2008, Intel Corporation
37 (Zoom) Using infinite nb of processors lim (Speedup) = 1 / serial part
Infinite nbnb of processors
10 % serial 90 % parallel
Max speedup : 1010 Speedup
serial % 2 % 5 % 10 % parallel % 98% 95 % 90 %
75
Copyright © 2008, Intel Corporation
Speedup curves from Amdhal law
100 %% up to 1024 pp
parallel %
9999..999999 %% Speedup
9999..992992 %%
nbnb processor
76
Copyright © 2008, Intel Corporation
38 Amdhal’s Hypothesis ((amongamong othersothers))
Amdhal lawlawhypothesis:: hypothesis