Download Presentation
Total Page:16
File Type:pdf, Size:1020Kb
Computer architecture in the past and next quarter century CNNA 2010 Berkeley, Feb 3, 2010 H. Peter Hofstee Cell/B.E. Chief Scientist IBM Systems and Technology Group CMOS Microprocessor Trends, The First ~25 Years ( Good old days ) Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 2 SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach , 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 Page 3 CMOS Devices hit a scaling wall 1000 Air Cooling limit ) 100 2 Active 10 Power Passive Power 1 0.1 Power Density (W/cm Power Density 0.01 1994 2004 0.001 1 0.1 0.01 Gate Length (microns) Isaac e.a. IBM Page 4 Microprocessor Trends More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 5 Microprocessor Trends Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 6 Multicore Power Server Processors Power 4 Power 5 Power 7 2001 2004 2009 Introduces Dual core Dual Core – 4 threads 8 cores – 32 threads Page 7 Why are (shared memory) CMPs dominant? A new system delivers nearly twice the throughput performance of the previous one without application-level changes. Applications do not degrade in performance when ported (to a next-generation processor). This is an important factor in markets where it is not possible to rewrite all applications for a new system, a common case. Applications benefit from more memory capacity and more memory bandwidth when ported. .. even if they do not (optimally) use all the available cores. Even when a single application must be accelerated, large portions of code can be reused. Design cost is reduced, at least relative to the scenario where all available transistors are used to build a single processor. Page 8 PDSOI optimization results Constant performance Optimizing for maximum performance for each core improvement, 20% per gen. 11nm (750W/cm2) 10.00 2 15 nm (260W/cm2) 200W/cm 100W/cm 2 8.00 22nm (110W/cm2) 50W/cm 2 25W/cm 2 32 nm (65W/cm2) 6.00 45 nm (65W/cm2) 32nm 65nm (50W/cm2) 22nm 4.00 15nm 45nm Est. Clock Frequency (GHz) Frequency Clock Est. Constant 65 nm 11nm 90nm power density 2.00 25 W/cm 2 0.00 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 T2 Date D. Frank, C. Tyberg Page 9 Microprocessor Trends More active transistors, higher frequency Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 10 Microprocessor Trends Hybrid/Heterogeneous More active transistors, higher frequency Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) 2025(??) Page 11 Heterogeneous Power Architecture Processors Xilinx Virtex II-Pro Cell Broadband Engine Rapport Kilocore 2002 2005 2006 1-2 Power Cores 1 Power Core 1 Power Core + FPGA + 8 accelerator cores + 256 accelerator cores Page 12 Cell Broadband Engine Heterogeneous SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 Multiprocessor SPU SPU SPU SPU SPU SPU SPU SPU – Power processor 16B/cycle – Synergistic Processing Elements LS LS LS LS LS LS LS LS Power Processor AUC AUC AUC AUC AUC AUC AUC AUC Element (PPE) MFC MFC MFC MFC MFC MFC MFC MFC – general purpose 16B/cycle – running full-fledged OSs EIB (up to 96B/cycle) – 2 levels of globally 16B/cycle 16B/cycle(2x) coherent cache PPE MIC BIC Synergistic Proc. Element (SPE) L2 – SPU optimized for I/O I/O 32B/cycle computation density PPU L1 – 128 bit wide SIMD 16B/cycle Dual XDR™ – Fast local memory 64-bit Power Architecture w/VMX – Globally coherent DMA Page 13 Memory Managing Processor vs. Traditional General Purpose Processor Cell AMD BE IBM Intel Page 14 Page 15 Current Cell: Integer Workloads Breadth-First Search Villa, Scarpazza, Petrini, Peinador IPDPS 2007 Sort: Gedik, Bordawekar, Yu (IBM) Mapreduce Sangkaralingam, De Kruijf, Oct. 2007 Page 16 Three Generations of Cell/B.E. 90nm SOI 236mm 2 65nm SOI 175mm 2 45nm SOI 116mm 2 Cell/B.E. Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 19.17 12.29 235.48 100.0% 65nm 15.59 11.20 174.61 74.2% 100.0% 45nm 12.75 9.06 115.46 49.0% 66.1% Synergistic Processor Element (SPE) Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 2.54 5.81 14.76 100.0% 65nm 2.09 5.30 11.08 75.0% 100.0% 45nm 1.59 4.09 6.47 43.9% 58.5% Power Processor Element (PPE) Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 4.44 6.05 26.86 100.0% 65nm 3.50 5.60 19.60 73.0% 100.0% Takahashi e.a. 45nm 2.66 4.26 11.32 42.1% 57.7% Page 17 Cell Broadband Engine-based CE Products Sony Playstation 3 and PS3 Toshiba Regza Cell Page 18 IBM and its Partners are Active Users of Cell Technology Three Generations of Server Blades Accompanied By 3 SDK Releases IBM QS20 IBM QS21 IBM QS22 Two Generations of PCIe Cell Accelerator Boards CAB ( Mercury ) PXCAB ( Mercury/Fixstars/Matrix Vision ) 1U Formfactor Mercury Computer TPlatforms Custom Boards Hitachi Medical ( Ultrasound ) Other Medical and Defense World’s First 1 PFlop Computer LANL Roadrunner Top 7 Green Systems Green 500 list Page 19 Playstation 3 high-level organization and PS3 cluster. 1Gb Ethernet XDR Cell IO XDR BroadBand Bridge XDR Engine XDR RSX GPU Roadrunner accelerated node and system. IB Adapter DDR2 DDR2DDR2 IBM IBM PCIe AMD DDR2DDR2 PowerXCell8iIBM IO BridgePCIe Bridge Opteron DDR2DDR2 PowerXCell8i Bridge DDR2 DDR2 IBM IBM PCIe AMD DDR2DDR2 IO BridgePCIe PowerXCell8iIBM DDR2DDR2 Bridge Bridge Opteron DDR2DDR2 PowerXCell8i DDR2 IB Adapter Roadrunner Accelerated Node QPACE PowerXCell8i node card and system. QPACE Processor Node Card: Rialto DDR2 Xilinx IBM DDR2 Virtex 5 PowerXCell8i DDR2 FPGA DDR2 QPACE node card. Green 500 (Green500.org) Performance and Productivity Challenges require a Multi- Dimensional Approach Highly Highly Scalable Productive Systems Multi-core Systems Hybrid Systems POWER Comprehensive (Holistic) System Innovation & Optimization Page 24 Next Era of Innovation – Hybrid Computing Symmetric Multiprocessing Era Hybrid Computing Era Today pNext 1.0 pNext 2.0 p6 p7 Traditional Cell Throughput Computational BlueGene Technology Out Market In Driven by cores/threads Driven by workload consolidation Page 25 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. HPC Cluster Directions ExaScale ExaF Accelerators Targeted Configurability BG/Q Accelerators PF Roadrunner Accelerators PERCS BG/P Systems Extended Configurability Performance Capability Machines Accelerators Power, AIX/ Linux Accelerators Capacity Clusters Linux Clusters Power, x86-64, Less Demanding Communication 2007 2008 2009/2010 2011 2012 2013 2018-19 Page 26 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. Converging Software ( Much Harder! ) Software-Hardware Efficiency Driven by Number of operations (we all learn this in school) Degree of thread parallelism Degree of data parallelism Degree of locality (code and data, data more important) Degree of predictability ( code and data, data more important ) Need a new Portable Framework Allow portable format to retain enough information for run-time optimization. Allow run-time optimization to heterogeneous hardware for each of the parameters above. Page 27 Two Standards for Programming the Node Two standards evolving from different sides of the market OpenMP OpenCL CPUs GPUs MIMD SIMD Scalar code bases Scalar/Vector code bases Parallel for loop Data Parallel Shared Memory Model Distributed Shared Memory Model Concurrency and Locality CPU SPE GPU Page 28 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. IBM Systems and Technology Group Hybrid/Accelerator Node/Server Model 2 Address Spaces Host Cores Compute Cores (User Mode Cores) (Base Cores) Base Base UMC UMC UMC UMC Core 1 … Core N 1 2 …… M-1 M Host Memory Device Global Memory System Memory Device Memory 29 IBM Confidential IBM Corporation IBM Systems and Technology Group Heterogeneous Node/Server Model Single Memory System – 1 Address Space Host Cores Compute Cores (User Mode Cores) (Base Cores) Base Base UMC UMC UMC UMC Core 1 … Core N 1 2 …… M-1 M Host Memory Device Global Memory System Memory 30 IBM Confidential IBM Corporation Cell Broadband Engine Unified host and device memory Zero copies between them SPE 1 SPE N WorkItem 1 WorkItem N Compute Unit 1 Compute Unit N PPE Local Memory Local Memory Private Memory Private Memory Local Store Local Store Host Device Compute Device Host Memory Device Global Memory System Memory Page 31 Page 32 Finite