POWER5, Federation and Linux: System Architecture

Total Page:16

File Type:pdf, Size:1020Kb

POWER5, Federation and Linux: System Architecture POWER5, Federation and Linux: System Architecture Charles Grassl IBM August, 2004 © 2004 IBM Agenda Architecture –Processors –POWER4 –POWER5 –System –Nodes –Systems High Performance Switch 2 © 2004 IBM Corporation 1 Processor Design POWER4 POWER5 Two processors per chip Two processors per chip L1: L1: –32 kbyte data –32 kbyte data –64 kbyte instructions –64 kbyte instructions L2: L2: –1.45 Mbyte per processor pair –1.9 Mbyte per processor pair L3: L3: –32 Mbyte per processor pair –36 Mbyte per processor pair –Shared by all processors –Shared by processor pair –Extension of L2 cache Simultaneous multithreading Enhanced memory loads and stores Additional rename registers 3 © 2004 IBM Corporation POWER4: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – Shared across node – Path to memory L2 Cache 4 © 2004 IBM Corporation 2 POWER5: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – NOT shared – Path to memory – Memory controller Memory L2 Cache Controller 5 © 2004 IBM Corporation POWER4: Features Multi-processor chip High clock rate – High computation burst rate Three cache levels – Bandwidth – Latency hiding Shared Memory – Large memory size 6 © 2004 IBM Corporation 3 POWER5 Features Performance –Registers –Additional rename registers –Cache improvements –L3 runs at ½ freq. (POWER4: 1/3 freq.) –Simultaneous Multi Threading (SMT) –Memory Bandwidth Virtualization –Partitioning Dynamic power management 7 © 2004 IBM Corporation POWER4, POWER5 Differences POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB 36MB Better Cache 8-way Associative 12-way Associative L3 Cache performance 118 Clock Cycles ~80 Clock Cycles Memory 4GBbyte / sec / ~16Gbyte / sec / 4X improvement Bandwidth Chip Chip Faster memory access Simultaneous Better processor No Yes Multi- utilization Threading Processor Better usage of 1 processor 1/10 of processor Addressing processor resources 50% more transistors in 412mm 389mm Size the same space 8 © 2004 IBM Corporation 4 Modifications to POWER4 System Structure P P P P L3 L2 L2 L3 L3 L3 Fab Ctl Fab Ctl Mem Ctl Mem Ctl L3 L3 Mem Ctl Mem Ctl Memory Memory 9 © 2004 IBM Corporation POWER5 Challenges Scaling from 32-way to logical 128-way – Large system effects, 2Tbyte of memory 1.9MB L2 shared across 4 threads (SMT) –L2 miss rates 36 Mbyte private L3 versus 128 Mbyte shared L3 – Will have to watch multi-programming level – Read-only data will get replicated Local/Remote memory latencies –Worst case remote memory latency is 60% higher than local 10 © 2004 IBM Corporation 5 Effect of Features: Registers Additional rename registers –POWER4: 72 total registers –POWER5: 110 total registers Matrix multiply: –POWER4: 65% of burst rate –POWER5: 90% of burst rate Enhances performance of computationally intensive kernels 11 © 2004 IBM Corporation Effect of Rename Registers 7000 6000 5000 4000 Polynomial 3000 MATMUL Mflop/s 2000 1000 0 POWER4 POWER4 POWER5 POWER5 1.5 GHz 1.5 GHz 1.65 GHz 1.65 GHz 12 © 2004 IBM Corporation 6 Effect of Rename Registers 7000 “Tight” algorithms were 6000 previously register bound 5000 Increased floating point efficiency 4000 Examples –Matrix multiply 3000 – LINPACK –Polynomial calculation 2000 – Horner’s Rule –FFTs 1000 0 13 © 2004 IBM Corporation Memory Bandwidth POWER5 14000 POWER4 12000 10000 8000 6000 Mbyte/s 4000 2000 0 0 100000 200000 300000 400000 s=s+A(i) 1.65 GHz POWER5 1.5 GHz POWER4 14 © 2004 IBM Corporation 7 Memory Bandwidth: L1 – L2 Cache Regime POWER5 14000 POWER4 L1 Cache 12000 10000 L2 Cache 8000 6000 Mbyte/s 4000 2000 0 0 100000 200000 300000 400000 15 © 2004 IBM Corporation Memory Bandwidth: L2 Cache Regime POWER5 14000 POWER4 12000 10000 -----------------------------------L2 Cache------------------------------- 8000 6000 Mbyte/s 4000 2000 0 0 1000000 2000000 3000000 4000000 16 © 2004 IBM Corporation 8 L2 and L3 Cache Improvement 14000 L3 cache enhances bandwidth 12000 –Previously only hid 10000 latency -----------------------------------L2 Cache------------------------------- –New L3 cache is like a “2.5 8000 cache” Larger blocking sizes 6000 4000 2000 0 0 1000000 2000000 3000000 4000000 17 © 2004 IBM Corporation Simultaneous Multi-Threading Two threads per processor – Threads execute concurrently Threads share: – Caches – Registers – Functional units 18 © 2004 IBM Corporation 9 POWER5 Simultaneous Multi-Threading (SMT) POWER4 POWER5 (Single Threaded) (Simultaneous Multi-threaded) FX0 FX0 FX1 FX1 LS0 LS0 LS1 LS1 FP0 FP0 FP1 FP1 BRX BRX CRL CRL Processor Cycles Processor Cycles Legend Thread0 active No Thread active Thread1 active ● Presents SMP programming model to software ● Natural fit with superscalar out-of-order execution core 19 © 2004 IBM Corporation Conventional Multi-Threading Functional Units Time FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Thread 1 Threads alternate – Nothing shared 20 © 2004 IBM Corporation 10 Simultaneous Multi-Threading 0 1 FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Thread 1 Simultaneous execution – Shared registers – Shared functional units 21 © 2004 IBM Corporation Manipulating SMT Administrator function Dynamic System-wide Function Command View /usr/sbin/smtctl Off /usr/sbin/smtctl –m off now On /usr/sbin/smtctl –m on now 22 © 2004 IBM Corporation 11 Dynamic Power Management Dynamic Power No Power Management Management Single Thread Photos taken Simultaneous with thermal sensitive camera Multi-threading while prototype POWER5 chip was undergoing tests 23 © 2004 IBM Corporation Multi-Chip Modules (MCMs) Four processor chips are mounted on a module –Unit of manufacture –Smallest freestanding system Multiple modules –Node Also available: Single Chip Modules –Lower cost –Lower bandwidth 24 © 2004 IBM Corporation 12 POWER4 Multi-Chip Module (MCM) 4 chips – 8 processors – 2 processors/chip >35 Gbyte/s in each chip-to- chip connection 2:1 MHz chip-to-chip busses 6 connections per MCM Logically shared L2's, L3's Data Bus (Memory, inter- module): 3:1 MHz MCM to MCM 25 © 2004 IBM Corporation POWER5 Multi-Chip Module (MCM) 95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal 26 © 2004 IBM Corporation 13 Multi Chip Module (MCM) Architecture POWER4: POWER5: – 4 processor chips – 4 processor chips – 2 processors per chip – 2 processors per chip – 8 off-module L3 chips – 4 L3 cache chips – L3 cache is controlled by MCM – L3 cache is used by processor pair and logically shared across node – “Extension” of L2 – 4 Memory control chips ---------------------------------------------- ---------------------------------------------- – 8 chips – 16 chips L3 Chip Mem. Controller 27 © 2004 IBM Corporation System Design with POWER4 and with POWER5 Multi Chip Module (MCM) –Multiple MCMs per system node Other issues: –L3 cache sharing –Memory controller 28 © 2004 IBM Corporation 14 Power 4 POWER4 Processor Chip Memory Core Core Store Cache Store Cache MC 1.4M L2 Cache 32 Mbyte L3 Cache A (portion of 128M GX, shared) FBC NCU L3 Z Cache X Y A 29 © 2004 IBM Corporation POWER4 Multi-chip Module MCM to MCM GX Bus GX Bus MCM to MCM L2 & L3 Dirs L2 & L3 Dirs cpu0 Cpu 1 Cpu 2 Cpu 3 Shared L2 Shared L2 M 1.5MB 1.5MB Chip-chip communication Chip-chip communication M E E M M M C Shared C M O 128MB L3 O R Chip-chip communication Chip-chip communication R L2 & L3 Dirs Y Shared L2 Shared L2 L2 & L3 Dirs 1.5MB 1.5MB Y Cpu 4 Cpu 5 Cpu 6 Cpu 7 MCM to MCM MCM to MCM GX Bus GX Bus 30 © 2004 IBM Corporation 15 POWER4 32-way Plane Topology S T MCM 3 V U Uni-directional links S T MCM2 2-beat Addr @ 850Mhz = V 425M A/s U 8B Data @ 850Mhz (2:1) S T MCM 1 = 7 GB/s V U S T MCM 0 V U 31 © 2004 IBM Corporation Power 5 POWER5 Thread 0 Thread 1 Core Core Store Cache Store Cache Memory 36M L3 Cache 1.9M L2 Cache (victim) A MC,GX FBC NCU B Z X Y A 32 © 2004 IBM Corporation 16 POWER5 Multi Chip Module MCM to MCM MCM to MCM Book to Book Book to Book GX Bus GX Bus Mem CPU CPU CPU CPU Mem Ctrl 0, 1 2, 3 4, 5 6, 7 Ctrl M L3 L3 M Shared L2 Shared L2 E L3 Dir Dir L3 E 36 Chip-chip communication Chip-chip communication M MB M O O Chip-chip communication Chip-chip communication R L3 L3 R L3 Shared L2 Shared L2 L3 Y Dir Dir Y Mem CPU CPU CPU CPU Mem Ctrl 12,13 14,15 8, 9 10,11 Ctrl GX Bus GX Bus MCM to MCM MCM to MCM Book to Book RIO-2 Book to Book High Performance Switch InfiniBand 33 © 2004 IBM Corporation POWER5 64-way Plane Topology SS Vertical node links MCM 3 T T MCM 4 Used to reduce latency V V and increase bandwidth UU for data transfers. S S T T MCM 2 MCM 5 Used as alternate route V V for address operations U U during dynamic ring S S re-configuration (e.g. T MCM 1 T MCM 6 concur repair/upgrade). V V U U Uni-directional links S S T T 2-beat Addr @ 1GHz MCM 0 MCM 7 = 500M A/s V V 8B Data @ 1 GHz (2:1) U U = 8 GB/s 34 © 2004 IBM Corporation 17 Upcoming POWER5 Systems Early introductions are iSeries –Commercial Next: –Small nodes –2,4,8 processors –SF, ML Next, next –iH –H –Up to 64 processors 35 © 2004 IBM Corporation Planned POWER5 iH 2/4/6/8 – way POWER5 2U rack Chassis Rack: 24” x 43” deep H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM 36 © 2004 IBM Corporation 18 Auxiliary 37 © 2004 IBM Corporation Switch Technology Generation Processors Internal network HPS Switch POWER2 –In lieu of, e.g.
Recommended publications
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important
    VOLUME 13, NUMBER 13 OCTOBER 6,1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important by Keith Diefendorff company has decided to make a last-gasp effort to retain control of its high-end server silicon by throwing its consid- Not content to wrap sheet metal around erable financial and technical weight behind Power4. Intel microprocessors for its future server After investing this much effort in Power4, if IBM fails business, IBM is developing a processor it to deliver a server processor with compelling advantages hopes will fend off the IA-64 juggernaut. Speaking at this over the best IA-64 processors, it will be left with little alter- week’s Microprocessor Forum, chief architect Jim Kahle de- native but to capitulate. If Power4 fails, it will also be a clear scribed IBM’s monster 170-million-transistor Power4 chip, indication to Sun, Compaq, and others that are bucking which boasts two 64-bit 1-GHz five-issue superscalar cores, a IA-64, that the days of proprietary CPUs are numbered. But triple-level cache hierarchy, a 10-GByte/s main-memory IBM intends to resist mightily, and, based on what the com- interface, and a 45-GByte/s multiprocessor interface, as pany has disclosed about Power4 so far, it may just succeed. Figure 1 shows. Kahle said that IBM will see first silicon on Power4 in 1Q00, and systems will begin shipping in 2H01. Looking for Parallelism in All the Right Places With Power4, IBM is targeting the high-reliability servers No Holds Barred that will power future e-businesses.
    [Show full text]
  • From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005
    IBM eServer pSeries™ From Blue Gene to Cell Power.org Moscow, JSCC Technical Day November 30, 2005 Dr. Luigi Brochard IBM Distinguished Engineer Deep Computing Architect [email protected] © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As frequency increase is limited due to power limitation Dual core is a way to : 2 x Peak Performance per chip (and per cycle) But at the expense of frequency (around 20% down) Another way is to increase Flop/cycle © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations POWER : FMA in 1990 with POWER: 2 Flop/cycle/chip Double FMA in 1992 with POWER2 : 4 Flop/cycle/chip Dual core in 2001 with POWER4: 8 Flop/cycle/chip Quadruple core modules in Oct 2005 with POWER5: 16 Flop/cycle/module PowerPC: VMX in 2003 with ppc970FX : 8 Flops/cycle/core, 32bit only Dual VMX+ FMA with pp970MP in 1Q06 Blue Gene: Low frequency , system on a chip, tight integration of thousands of cpus Cell : 8 SIMD units and a ppc970 core on a chip : 64 Flop/cycle/chip © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As needs diversify, systems are heterogeneous and distributed GRID technologies are an essential part to create cooperative environments based on standards © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations IBM is : a sponsor of Globus Alliances contributing to Globus Tool Kit open souce a founding member of Globus Consortium IBM is extending its products Global file systems : – Multi platform and multi cluster GPFS Meta schedulers : – Multi platform
    [Show full text]
  • IBM Powerpc 970 (A.K.A. G5)
    IBM PowerPC 970 (a.k.a. G5) Ref 1 David Benham and Yu-Chung Chen UIC – Department of Computer Science CS 466 PPC 970FX overview ● 64-bit RISC ● 58 million transistors ● 512 KB of L2 cache and 96KB of L1 cache ● 90um process with a die size of 65 sq. mm ● Native 32 bit compatibility ● Maximum clock speed of 2.7 Ghz ● SIMD instruction set (Altivec) ● 42 watts @ 1.8 Ghz (1.3 volts) ● Peak data bandwidth of 6.4 GB per second A picture is worth a 2^10 words (approx.) Ref 2 A little history ● PowerPC processor line is a product of the AIM alliance formed in 1991. (Apple, IBM, and Motorola) ● PPC 601 (G1) - 1993 ● PPC 603 (G2) - 1995 ● PPC 750 (G3) - 1997 ● PPC 7400 (G4) - 1999 ● PPC 970 (G5) - 2002 ● AIM alliance dissolved in 2005 Processor Ref 3 Ref 3 Core details ● 16(int)-25(vector) stage pipeline ● Large number of 'in flight' instructions (various stages of execution) - theoretical limit of 215 instructions ● 512 KB L2 cache ● 96 KB L1 cache – 64 KB I-Cache – 32 KB D-Cache Core details continued ● 10 execution units – 2 load/store operations – 2 fixed-point register-register operations – 2 floating-point operations – 1 branch operation – 1 condition register operation – 1 vector permute operation – 1 vector ALU operation ● 32 64 bit general purpose registers, 32 64 bit floating point registers, 32 128 vector registers Pipeline Ref 4 Benchmarks ● SPEC2000 ● BLAST – Bioinformatics ● Amber / jac - Structure biology ● CFD lab code SPEC CPU2000 ● IBM eServer BladeCenter JS20 ● PPC 970 2.2Ghz ● SPECint2000 ● Base: 986 Peak: 1040 ● SPECfp2000 ● Base: 1178 Peak: 1241 ● Dell PowerEdge 1750 Xeon 3.06Ghz ● SPECint2000 ● Base: 1031 Peak: 1067 Apple’s SPEC Results*2 ● SPECfp2000 ● Base: 1030 Peak: 1044 BLAST Ref.
    [Show full text]
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • IBM Power Roadmap
    POWER6™ Processor and Systems Jim McInnes [email protected] Compiler Optimization IBM Canada Toronto Software Lab © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Role .I am a Technical leader in the Compiler Optimization Team . Focal point to the hardware development team . Member of the Power ISA Architecture Board .For each new microarchitecture I . help direct the design toward helpful features . Design and deliver specific compiler optimizations to enable hardware exploitation 2 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 Chip Overview High frequency dual-core chip . 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR . 1.9MB on-chip shared L2 – point of coherency, 3 slices . On-chip L3 directory and controller . On-chip memory controller Technology & Chip Stats . 130nm lithography, Cu, SOI . 276M transistors, 389 mm2 . I/Os: 2313 signal, 3057 Power/Gnd 3 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip Overview SDU RU Ultra-high frequency dual-core chip IFU FXU . 8 execution units L2 Core0 VMX L2 2LS, 2FP, 2FX, 1BR, 1VMX QUAD LSU FPU QUAD . 2 x 4MB on-chip L2 – point of L3 coherency, 4 quads L2 CNTL CNTL . On-chip L3 directory and controller . Two on-chip memory controllers M FBC GXC M Technology & Chip stats C C .
    [Show full text]
  • Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance
    Power Architecture® ISA 2.06 Stride N prefetch Engines to boost Application's performance History of IBM POWER architecture: POWER stands for Performance Optimization with Enhanced RISC. Power architecture is synonymous with performance. Introduced by IBM in 1991, POWER1 was a superscalar design that implemented register renaming andout-of-order execution. In Power2, additional FP unit and caches were added to boost performance. In 1996 IBM released successor of the POWER2 called P2SC (POWER2 Super chip), which is a single chip implementation of POWER2. P2SC is used to power the 30-node IBM Deep Blue supercomputer that beat world Chess Champion Garry Kasparov at chess in 1997. Power3, first 64 bit SMP, featured a data prefetch engine, non-blocking interleaved data cache, dual floating point execution units, and many other goodies. Power3 also unified the PowerPC and POWER Instruction set and was used in IBM's RS/6000 servers. The POWER3-II reimplemented POWER3 using copper interconnects, delivering double the performance at about the same price. Power4 was the first Gigahertz dual core processor launched in 2001 which was awarded the MicroProcessor Technology Award in recognition of its innovations and technology exploitation. Power5 came in with symmetric multi threading (SMT) feature to further increase application's performance. In 2004, IBM with 15 other companies founded Power.org. Power.org released the Power ISA v2.03 in September 2006, Power ISA v.2.04 in June 2007 and Power ISA v.2.05 with many advanced features such as VMX, virtualization, variable length encoding, hyper visor functionality, logical partitioning, virtual page handling, Decimal Floating point and so on which further boosted the architecture leadership in the market place and POWER5+, Cell, POWER6, PA6T, Titan are various compliant cores.
    [Show full text]
  • POWER8: the First Openpower Processor
    POWER8: The first OpenPOWER processor Dr. Michael Gschwind Senior Technical Staff Member & Senior Manager IBM Power Systems #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 OpenPOWER is about choice in large-scale data centers The choice to The choice to The choice to differentiate innovate grow . build workload • collaborative • delivered system optimized innovation in open performance solutions ecosystem • new capabilities . use best-of- • with open instead of breed interfaces technology scaling components from an open ecosystem Join the conversation at #OpenPOWERSummit Why Power and Why Now? . Power is optimized for server workloads . Power8 was optimized to simplify application porting . Power8 includes CAPI, the Coherent Accelerator Processor Interconnect • Building on a long history of IBM workload acceleration Join the conversation at #OpenPOWERSummit POWER8 Processor Cores • 12 cores (SMT8) 96 threads per chip • 2X internal data flows/queues • 64K data cache, 32K instruction cache Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility • Coherent Accelerator Processor Interface (CAPI) Join the conversation at #OpenPOWERSummit 4 POWER8 Core •Up to eight hardware threads per core (SMT8) •8 dispatch •10 issue •16 execution pipes: •2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR •Larger Issue queues (4 x 16-entry) •Larger global completion, Load/Store reorder queue •Improved branch prediction •Improved unaligned storage access •Improved data prefetch Join the conversation at #OpenPOWERSummit 5 POWER8 Architecture . High-performance LE support – Foundation for a new ecosystem . Organic application growth Power evolution – Instruction Fusion 1600 PowerPC .
    [Show full text]
  • Power8 Quser Mspl Nov 2015 Handout
    IBM Power Systems Power Systems Hardware: Today and Tomorrow November 2015 Mark Olson [email protected] © 2015 IBM Corporation IBM Power Systems POWER8 Chip © 2015 IBM Corporation IBM Power Systems Processor Technology Roadmap POWER11 Or whatever it is POWER10 named Or whatever it is POWER9 named Or whatever it is named POWER7 POWER8 POWER6 45 nm 22 nm POWER5 65 nm 130 nm POWER4 90 nm 180 nm 130 nm 2001 2004 2007 2010 2014 Future 3 © 2015 IBM Corporation IBM Power Systems Processor Chip Comparisons POWER5 POWER6 POWER7 POWER7+ POWER8 2004 2007 2010 2012 45nm SOI 32nm SOI 22nm SOI Technology 130nm SOI 65nm SOI eDRAM eDRAM eDRAM Compute Cores 2 2 8 8 12 Threads SMT2 SMT2 SMT4 SMT4 SMT8 Caching On-chip 1.9MB (L2) 8MB (L2) 2 + 32MB (L2+3) 2 + 80MB (L2+3) 6 + 96MB (L2+3) Off-chip 36MB (L3) 32MB (L3) None None 128MB (L4) Bandwidth Sust. Mem. 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s 4 © 2015 IBM Corporation IBM Power Systems Processor Designs POWER5+ POWER6 POWER7 POWER7+ POWER8 Max cores 4 2 8 8 12 Technology 90nm 65nm 45nm 32nm 22nm Size 245 mm2 341 mm2 567 mm2 567 mm2 650 mm2 * Transistors 276 M 790 M 1.2 B 2.1 B 4.2 B * 1.9 4 - 5 3 – 4 Up to 4.4 Up to 4.1 Frequencies GHz GHz GHz GHz GHz ** SMT (threads) 2 2 4 4 8 L2 Cache 1.9MB Shared 4MB / Core 256KB / core 256KB / core 512KB/core 4MB / Core 10MB / Core 8MB / Core L3 Cache 36MB 32MB On chip On chip On chip L4 Cache -- -- -- -- Up to 128MB Bandwidth Sust memory 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s * with 12-core
    [Show full text]
  • I.T.S.O. Powerpc an Inside View
    SG24-4299-00 PowerPC An Inside View IBM SG24-4299-00 PowerPC An Inside View Take Note! Before using this information and the product it supports, be sure to read the general information under “Special Notices” on page xiii. First Edition (September 1995) This edition applies to the IBM PC PowerPC hardware and software products currently announced at the date of publication. Order publications through your IBM representative or the IBM branch office serving your locality. Publications are not stocked at the address given below. An ITSO Technical Bulletin Evaluation Form for reader′s feedback appears facing Chapter 1. If the form has been removed, comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JLPC Building 014 Internal Zip 5220 1000 NW 51st Street Boca Raton, Florida 33431-1328 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright International Business Machines Corporation 1995. All rights reserved. Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp. Abstract This document provides technical details on the PowerPC technology. It focuses on the features and advantages of the PowerPC Architecture and includes an historical overview of the development of the reduced instruction set computer (RISC) technology. It also describes in detail the IBM Power Series product family based on PowerPC technology, including IBM Personal Computer Power Series 830 and 850 and IBM ThinkPad Power Series 820 and 850.
    [Show full text]
  • Processor Design:System-On-Chip Computing For
    Processor Design Processor Design System-on-Chip Computing for ASICs and FPGAs Edited by Jari Nurmi Tampere University of Technology Finland A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN 978-1-4020-5529-4 (HB) ISBN 978-1-4020-5530-0 (e-book) Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com Printed on acid-free paper All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. To Pirjo, Lauri, Eero, and Santeri Preface When I started my computing career by programming a PDP-11 computer as a freshman in the university in early 1980s, I could not have dreamed that one day I’d be able to design a processor. At that time, the freshmen were only allowed to use PDP. Next year I was given the permission to use the famous brand-new VAX-780 computer. Also, my new roommate at the dorm had got one of the first personal computers, a Commodore-64 which we started to explore together. Again, I could not have imagined that hundreds of times the processing power will be available in an everyday embedded device just a quarter of century later.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]