Programming Model

Total Page:16

File Type:pdf, Size:1020Kb

Programming Model CSE4351/5351 Parallel Processing Instructor: Dr. Song Jiang The CSE Department [email protected] http://ranger.uta.edu/~sjiang/CSE4351-5351-summer/index.htm Lecture: MoTuWeTh 10:30AM - 12:30PM LS 101 Office hours: Monday /2-3pm at ERB 101 1 Outline ▪ Introduction ➢What is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢ Shared address space, message passing, data parallel, data flow ➢ A generic parallel architecture 2 What is Parallel Computer? “A parallel computer is a collection of processing elements that can communicate and cooperate to solve large problems fast” ------ Almasi/Gottlieb ▪ “communicate and cooperate” ➢Node and interconnect architecture ➢Problem partitioning and orchestration ▪ “large problems fast” ➢Programming model ➢Match of model and architecture ▪ Focus of this course ▪ Parallel architecture ▪ Parallel programming models ▪ Interaction between models and architecture 3 What is Parallel Computer? (cont’d) Some broad issues: • Resource Allocation: – how large a collection? – how powerful are the elements? • Data access, Communication and Synchronization – how are data transmitted between processors? – how do the elements cooperate and communicate? – what are the abstractions and primitives for cooperation? • Performance and Scalability – how does it all translate into performance? – how does it scale? 4 Why Study Parallel Computing ▪ Inevitability of parallel computing ➢ Fueled by application demand for performance • Scientific: weather forecasting, pharmaceutical design, and genomics • Commercial: OLTP, search engine, decision support, data mining • Scalable web servers ➢ Enabled by technology and architecture trends • limits to sequential CPU, memory, storage performance o parallelism is an effective way of utilizing growing number of transistors. • low incremental cost of supporting parallelism ▪ Convergence of parallel computer organizations ➢ driven by technology constraints and economies of scale • laptops and supercomputers share the same building block ➢ growing consensus on fundamental principles and design tradeoffs 5 Why Study Parallel Computing (cont’d) • Parallel computing is ubiquitous: ➢ Multithreading ➢ Simultaneous multithreading (SMT) a.k.a. hyper-threading • e.g., Intel® Pentium 4 Xeon ➢Chip Multiprocessor (CMP) a.k.a, multi-core processor • Intel® Core™ Duo, Xbox 360 (triple cores, each with SMTs), AMD Quad-core Opteron. • IBM Cell processor with as many as 9 cores used in Sony PlayStation 3, Toshiba HD sets, and IBM Roadrunner HPC. ➢ Symmetrical Multiprocessor (SMP) a.k.a, shared memory multiprocessor • e.g. Intel® Pentium Pro Quad, motherboard with multiple sockets ➢ Cluster-based supercomputer • IBM Bluegene/L (65,536 modified PowerPC 400, each with two cores) • IBM Roadrunner (6,562 dual-core AMD Opteron® chips and 12,240 Cell chips) 6 Course Coverage • Parallel architectures Q: which are the dominant architectures? A: small-scale shared memory (SMPs), large-scale distributed memory • Programming model Q: how to program these architectures? A: Message passing and shared memory models • Programming for performance Q: how are programming models mapped to the underlying architecture, and how can this mapping be exploited for performance? 7 Course Administration • Course prerequisites • Course textbooks • Class attendance • Required work and grading policy • Late policy • Academic honesty (see details on the syllabus) 8 Outline ▪ Introduction ➢Why is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢Shared address space, message passing data parallel, data flow, systolic ➢A generic parallel architecture 9 Inevitability of Parallel Computing • Application demands: ➢ Our insatiable need for computing cycles in challenge applications • Technology Trends ➢Number of transistors on chip growing rapidly ➢Clock rates expected to go up only slowly • Architecture Trends ➢Instruction-level parallelism valuable but limited ➢Coarser-level parallelism, as in MPs, the most viable approach • Economics ➢Low incremental cost of supporting parallelism 10 Application Demands: Scientific Computing • Large parallel machines are a mainstay in many industries ➢Petroleum • Reservoir analysis ➢Automotive • Crash simulation, combustion efficiency ➢Aeronautics • Airflow analysis, structural mechanics, electromegnetism ➢Computer-aided design ➢Pharmaceuticals • Molecular modeling ➢Visualization • Entertainment • Architecture 2,300 CPU years (2.8 GHz ➢Financial modeling Intel Xeon) at a rate of approximately one hour per • Yield and derivative analysis frame. 11 Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. Limitations: – Too difficult -- build large wind tunnels. – Too expensive -- build a throw-away passenger jet. – Too slow -- wait for climate or galactic evolution. – Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon – Based on known physical laws and efficient numerical methods. 12 Challenge Computation Examples Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational chemistry • Computational material sciences and nanosciences Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) Business • Financial and economic modeling Defense • Nuclear weapons -- test by simulation • Cryptography 13 Units of Measure in HPC High Performance Computing (HPC) units are: • Flop/s: floating point operations • Bytes: size of data Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Exa = 1018 byte 14 Global Climate Modeling Problem Problem is to compute: f(latitude, longitude, elevation, time) → temperature, pressure, humidity, wind velocity Approach: • Discretize the domain, e.g., a measurement point every 1km • Devise an algorithm to predict weather at time t+1 given t Source: http://www.epm.ornl.gov/chammp/chammp.html 15 Example: Numerical Climate Modeling at NASA • Weather forecasting over US landmass: 3000 x 3000 x 11 miles • Assuming 0.1 mile cubic element ---> 1011 cells • Assuming 2 day prediction @ 30 min ---> 100 steps in time scale • Computation: Partial differential equation and finite element approach • Single element computation takes 100 Flops • Total number of flops: 1011 x 100 x 100 = 1015 (i.e., one peta-flops) • Supposed uniprocessor power: 109 flops/sec (Giga-flops) • It takes 106 seconds or 280 hours. (Forecast nine days late!) • 1000 processors at 10% efficiency → around 3 hours • IBM Roadrunner → 1 second ?! • State of the art models require integration of atmosphere, ocean, sea- ice, land models, and more; Models demanding more computation resources will be applied. 16 High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL Commercial Computing • Parallelism benefits many applications ➢ Database and Web servers for online transaction processing ➢ Decision support ➢ Data mining and data warehousing ➢ Financial modeling • Scale not necessaily as large, but more widely used • Computational power determines scale of business that can be handled. 18 Outline ▪ Introduction ➢Why is parallel computing? ➢Why you should care? ▪ Course administration ➢Course coverage ➢Workload and grading ▪ Inevitability of parallel computing ➢Application demands ➢Technology and architecture trends ➢Economics ▪ Convergence of parallel architecture ➢Shared address space, message passing data parallel, data flow, systolic ➢A generic parallel architecture 19 Tunnel Vision by Experts “I think there is a world market for maybe five computers.” – Thomas Watson, chairman of IBM, 1943. “There is no reason for any individual to have a computer in their home” – Ken Olson, president and founder of Digital Equipment Corporation, 1977. “640K [of memory] ought to be enough for anybody.” – Bill Gates, chairman of Microsoft,1981. 20 Technology Trends: -processor Capacity The number of transistors on a chip doubles every 18 months (while the costs are halved). 21 Technology Trends:Transistor Count 22 23 Technology Trends 100 Supercomputers 10 Mainframes Microprocessors Minicomputers Performance 1 0.1 1965 1970 1975 1980 1985 1990 1995 • Microprocessor exhibits astonishing progress! • Natural building block for parallel computers are also state-of-art microprocessors. 24 Architecture Trend: Role of Architecture Clock rate increases 30% per year, while the overall CPU performance increases 50% to 100% per year Where is the rest coming from? ➢Parallelism likely to contribute more to performance improvements 25 Architectural Trends Greatest trend in VLSI is an increase in the exploited parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit –
Recommended publications
  • PPC400 Debugger C++ and JAVAC++ Aswell Asthe Debugging of the TPU
    BDM-PPC400 Technical Information Technical PPC400 Debugger ■ Full HLL and ASM support available ■ Batch processing ■ Supports ELF/DWARF format ■ 3.3 volt support ■ Support for internal triggers ■ Break on code or data ■ Unlimited software breakpoints ■ Fast download (ETHERNET or PARALLEL) up to 600KB/sec The debugger for IBM PowerPC 400 family allows fast access to the BDM interface of the chip. Up to 450 KByte can be downloaded in 1 second. The systems supports C, C++ and JAVA as well as the debugging of the TPU. BDM-PPC400 14.06.13 TRACE32 - Technical Information 2 Features ❏ Active and Passive JTAG ❏ Variable Debug Clock Speed Debugger available ■ 10 kHz...5 MHz ■ 1/4 CPU Clock ■ 1/8 CPU Clock ❏ Software Compatible to In-Circuit ■ Variable up to 100 MHz (Pow- Emulator and Monitor erDebug only) ■ Operation System ■ PRACTICE ❏ ■ ASM Debugger Tr igger ■ HLL Debugger for C and C++ ■ Input from PODBUS ■ Peripheral Windows ■ Output to PODBUS ❏ High-Speed Download ❏ Support for EPROM/FLASH ■ Up to 450 KByte/sec Simulator ■ Breakpoints in ROM Area ■ 8, 16 and 32 Bit EPROM/ FLASH Emulation BDM-PPC400 Features TRACE32 - Technical Information 3 Connector Connector Type stanard 100 mil connector (BETRG, AMP, etc.) Connector 16 pin Signal Pin Pin Signal TDO 1 2 N/C TDI 3 4 TRST- (*) N/C 5 6 VCCS TCK 7 8 N/C TMS 9 10 N/C HALT- 11 12 N/C N/C 13 14 KEY N/C 15 16 GND BDM-PPC400 Connector TRACE32 - Technical Information 4 Operation Voltage Operation Voltage This list contains information on probes available for other voltage ranges.
    [Show full text]
  • IBM Power System POWER8 Facts and Features
    IBM Power Systems IBM Power System POWER8 Facts and Features April 29, 2014 IBM Power Systems™ servers and IBM BladeCenter® blade servers using IBM POWER7® and POWER7+® processors are described in a separate Facts and Features report dated July 2013 (POB03022-USEN-28). IBM Power Systems™ servers and IBM BladeCenter® blade servers using IBM POWER6® and POWER6+™ processors are described in a separate Facts and Features report dated April 2010 (POB03004-USEN-14). 1 IBM Power Systems Table of Contents IBM Power System S812L 4 IBM Power System S822 and IBM Power System S822L 5 IBM Power System S814 and IBM Power System S824 6 System Unit Details 7 Server I/O Drawers & Attachment 8 Physical Planning Characteristics 9 Warranty / Installation 10 Power Systems Software Support 11 Performance Notes & More Information 12 These notes apply to the description tables for the pages which follow: Y Standard / Supported Optional Optionally Available / Supported N/A or - Not Available / Supported or Not Applicable SOD Statement of General Direction announced SLES SUSE Linux Enterprise Server RHEL Red Hat Enterprise Linux a One x8 PCIe slots must contain a 4-port 1Gb Ethernet LAN available for client use b Use of expanded function storage backplane uses one PCIe slot Backplane provides dual high performance SAS controllers with 1.8 GB write cache expanded up to 7.2 GB with c compression plus Easy Tier function plus two SAS ports for running an EXP24S drawer d Full benchmark results are located at ibm.com/systems/power/hardware/reports/system_perf.html e Option is supported on IBM i only through VIOS.
    [Show full text]
  • Chapter 2-1: Cpus
    Chapter 2-1: CPUs Soo-Ik Chae © 2007 Elsevier 1 Topics CPU metrics. Categories of CPUs. CPU mechanisms. High Performance Embedded Computing © 2007 Elsevier 2 Performance as a design metric Performance = speed: Latency. Throughput. Average vs. peak performance. Worst-case and best- case performance. High Performance Embedded Computing © 2007 Elsevier 3 Other metrics Cost (area). Energy and p ower. Predictability: important for embedded systems Pipelining: branch penalty. Memory system (Cache) : cache miss penalty Security: difficult to measure because of the fact that we do not know of a successful attack. High Performance Embedded Computing © 2007 Elsevier 4 Flyyypnn’s taxonomy of processors Single-instruction single-data (SISD): RISC, etc. Single-instruction multiple-data (SIMD): all processors perform the same operations. Multiple-instruction multiple-data (MIMD): homogeneou s or heterogeneou s multiprocessor. Multiple-instruction multiple data (MISD). High Performance Embedded Computing © 2007 Elsevier 5 Other axes of comparison RISC. Emphasis on software Sing le-cyclilittile, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second, Large code sizes Spends more transistors on memory registers CISC. Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per second Small code sizes Transistors used for storing complex instructions High Performance Embedded Computing © 2007 Elsevier 6 RISC CISC 1. 1-cycle simple instructions 1. multi-cycle complex instructions 2. only LD/ST can access memory 2. any instruction may access memory 3. designed around pipeline 3. designed around instn. set 4. instns. executed by h/w 4. instns interpreted by micro-program 5.
    [Show full text]
  • Coverstory by Robert Cravotta, Technical Editor
    coverstory By Robert Cravotta, Technical Editor u WELCOME to the 31st annual EDN Microprocessor/Microcontroller Di- rectory. The number of companies and devices the directory lists continues to grow and change. The size of this year’s table of devices has grown more than NEW PROCESSOR OFFERINGS 25% from last year’s. Also, despite the fact that a number of companies have disappeared from the list, the number of companies participating in this year’s CONTINUE TO INCLUDE directory has still grown by 10%. So what? Should this growth and change in the companies and devices the directory lists mean anything to you? TARGETED, INTEGRATED One thing to note is that this year’s directory has experienced more compa- ny and product-line changes than the previous few years. One significant type PERIPHERAL SETS THAT SPAN of change is that more companies are publicly offering software-programma- ble processors. To clarify this fact, not every company that sells processor prod- ALL ARCHITECTURE SIZES. ucts decides to participate in the directory. One reason for not participating is that the companies are selling their processors only to specific customers and are not yet publicly offering those products. Some of the new companies par- ticipating in this year’s directory have recently begun making their processors available to the engineering public. Another type of change occurs when a company acquires another company or another company’s product line. Some of the acquired product lines are no longer available in their current form, such as the MediaQ processors that Nvidia acquired or the Triscend products that Arm acquired.
    [Show full text]
  • IBM POWER8 High-Performance Computing Guide: IBM Power System S822LC (8335-GTB) Edition
    Front cover IBM POWER8 High-Performance Computing Guide IBM Power System S822LC (8335-GTB) Edition Dino Quintero Joseph Apuzzo John Dunham Mauricio Faria de Oliveira Markus Hilger Desnes Augusto Nunes Rosario Wainer dos Santos Moschetta Alexander Pozdneev Redbooks International Technical Support Organization IBM POWER8 High-Performance Computing Guide: IBM Power System S822LC (8335-GTB) Edition May 2017 SG24-8371-00 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. First Edition (May 2017) This edition applies to: IBM Platform LSF Standard 10.1.0.1 IBM XL Fortran v15.1.4 and v15.1.5 compilers IBM XLC/C++ v13.1.2 and v13.1.5 compilers IBM PE Developer Edition version 2.3 Red Hat Enterprise Linux (RHEL) 7.2 and 7.3 in little-endian mode © Copyright International Business Machines Corporation 2017. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . ix Trademarks . .x Preface . xi Authors. xi Now you can become a published author, too! . xiii Comments welcome. xiv Stay connected to IBM Redbooks . xiv Chapter 1. IBM Power System S822LC for HPC server overview . 1 1.1 IBM Power System S822LC for HPC server. 2 1.1.1 IBM POWER8 processor . 3 1.1.2 NVLink . 4 1.2 HPC system hardware components . 5 1.2.1 Login nodes . 6 1.2.2 Management nodes . 6 1.2.3 Compute nodes. 7 1.2.4 Compute racks . 7 1.2.5 High-performance interconnect.
    [Show full text]
  • POWER8: the First Openpower Processor
    POWER8: The first OpenPOWER processor Dr. Michael Gschwind Senior Technical Staff Member & Senior Manager IBM Power Systems #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 OpenPOWER is about choice in large-scale data centers The choice to The choice to The choice to differentiate innovate grow . build workload • collaborative • delivered system optimized innovation in open performance solutions ecosystem • new capabilities . use best-of- • with open instead of breed interfaces technology scaling components from an open ecosystem Join the conversation at #OpenPOWERSummit Why Power and Why Now? . Power is optimized for server workloads . Power8 was optimized to simplify application porting . Power8 includes CAPI, the Coherent Accelerator Processor Interconnect • Building on a long history of IBM workload acceleration Join the conversation at #OpenPOWERSummit POWER8 Processor Cores • 12 cores (SMT8) 96 threads per chip • 2X internal data flows/queues • 64K data cache, 32K instruction cache Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility • Coherent Accelerator Processor Interface (CAPI) Join the conversation at #OpenPOWERSummit 4 POWER8 Core •Up to eight hardware threads per core (SMT8) •8 dispatch •10 issue •16 execution pipes: •2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR •Larger Issue queues (4 x 16-entry) •Larger global completion, Load/Store reorder queue •Improved branch prediction •Improved unaligned storage access •Improved data prefetch Join the conversation at #OpenPOWERSummit 5 POWER8 Architecture . High-performance LE support – Foundation for a new ecosystem . Organic application growth Power evolution – Instruction Fusion 1600 PowerPC .
    [Show full text]
  • Powerpc 400 Series Caches: Programming and Coherency Issues
    PowerPC 400 Series Caches: Microcontroller Applications Programming and Coherency IBM Microelectronics Research Triangle Park, NC Issues [email protected] Version: 1.0 January 22, 1999 Abstract – The PowerPC™ instruction set provides opcodes that allow programmers to explicitly move items in or out of the processor’s instruction and data caches. This application note examines these cache control instructions in terms of initializing the caches at power-on and for enforcing coherency in systems with devices capable of directly reading and writing cacheable system memory. High performance processors, such as the IBM family of Embedded PowerPC Processors, require access to instructions and data at the clock rate of the processor. In most circumstances, external memory cannot provide this level of data throughput and on-chip memory cannot fulfill a system’s memory requirements. An effective solution is to exploit the locality of instruction and data accesses present in most programs and retain frequently accessed items in an on-chip cache memory. Typically, caches greatly increase performance while minimally affecting system and program design. However, there are methods of accessing memory that can cause on-chip caches and external memory to become incoherent, resulting in data errors and possible system failure. Key to preventing this type of problem is an understanding of what may cause these problems and how to write software to avoid them. Cache Structure The PPC40x series of Embedded Processors and Cores employ separate instruction and data caches. Separate caches provide a performance advantage by allowing simultaneous instruction and data accesses. Equivalent performance is possible through a dual-ported unified cache with the same total number of storage bits, but the unified cache has the disadvantage of occupying a larger amount of silicon.
    [Show full text]
  • Release History
    Release History TRACE32 Online Help TRACE32 Directory TRACE32 Index TRACE32 Technical Support ........................................................................................................... Release History ............................................................................................................................. 1 General Information ................................................................................................................... 4 Code 4 Release Information ................................................................................................................... 4 Software Release from 01-Feb-2021 5 Build 130863 5 Software Release from 01-Sep-2020 8 Build 125398 8 Software Release from 01-Feb-2020 11 Build 117056 11 Software Release from 01-Sep-2019 13 Build 112182 13 Software Release from 01-Feb-2019 16 Build 105499 16 Software Release from 01-Sep-2018 19 Build 100486 19 Software Release from 01-Feb-2018 24 Build 93173 24 Software Release from 01-Sep-2017 27 Build 88288 27 Software Release from 01-Feb-2017 32 Build 81148 32 Build 80996 33 Software Release from 01-Sep-2016 36 Build 76594 36 Software Release from 01-Feb-2016 39 Build 69655 39 Software Release from 01-Sep-2015 42 Build 65657 42 Software Release from 02-Feb-2015 45 Build 60219 45 Software Release from 01-Sep-2014 48 Build 56057 48 Software Release from 16-Feb-2014 51 ©1989-2021 Lauterbach GmbH Release History 1 Build 51144 51 Software Release from 16-Aug-2013 54 Build 50104 54 Software Release from 16-Feb-2013 56
    [Show full text]
  • A História Da Família Powerpc
    A História da família PowerPC ∗ Flavio Augusto Wada de Oliveira Preto Instituto de Computação Unicamp fl[email protected] ABSTRACT principal atingir a marca de uma instru¸c~ao por ciclo e 300 Este artigo oferece um passeio hist´orico pela arquitetura liga¸c~oes por minuto. POWER, desde sua origem at´eos dias de hoje. Atrav´es deste passeio podemos analisar como as tecnologias que fo- O IBM 801 foi contra a tend^encia do mercado ao reduzir ram surgindo atrav´esdas quatro d´ecadas de exist^encia da dr´asticamente o n´umero de instru¸c~oes em busca de um con- arquitetura foram incorporadas. E desta forma ´eposs´ıvel junto pequeno e simples, chamado de RISC (reduced ins- verificar at´eos dias de hoje como as tend^encias foram segui- truction set computer). Este conjunto de instru¸c~oes elimi- das e usadas. Al´emde poder analisar como as tendencias nava instru¸c~oes redundantes que podiam ser executadas com futuras na ´area de arquitetura de computadores seguir´a. uma combina¸c~ao de outras intru¸c~oes. Com este novo con- junto reduzido, o IBM 801 possuia metade dos circuitos de Neste artigo tamb´emser´aapresentado sistemas computacio- seus contempor^aneos. nais que empregam comercialmente processadores POWER, em especial os videogames, dado que atualmente os tr^es vi- Apesar do IBM 801 nunca ter se tornado um chaveador te- deogames mais vendidos no mundo fazem uso de um chip lef^onico, ele foi o marco de toda uma linha de processadores POWER, que apesar da arquitetura comum possuem gran- RISC que podemos encontrar at´ehoje: a linha POWER.
    [Show full text]
  • Real-Time Instruction Trace in the Powerpc 400 Family of Embedded
    IBM Microelectronics Real-Time Instruction Trace in Dept PDXA/Bldg 060 3039 Cornwallis Road the PowerPC 400 Family of Research Triangle Park, NC 27709 Embedded Controllers Version: 1.0 April 10, 1998 HISTORY Software development and debug time have become the critical factors in the embedded system development cycle. But the embedded controllers of today are far different animals from those of the past. Previously, controllers ran at speeds less than 10 MHz, and contained no caches. These controllers were built by only a few companies, and enjoyed a long product life. With a long product life, manufacturers could justify the cost of producing a special “bond-out” version of a chip. This bond-out chip contained additional signals not found in the production chip, and was attached to emulators and logic analyzers to perform debug and real-time trace. But because of their high pin-out and cost of manufacturing, it was cost prohibitive to ship them in a production product, and instead they were only used for initial system development and test. THE PROBLEM Today’s environment has changed dramatically. Now, in the 32-bit embedded RISC1 market, controllers are running in excess of 100 MHz, and contain large caches -- 16 KB and higher. There are a large number of suppliers making many different controllers with short product lives. Some manufacturers offer system-on-a-chip (also known as Core+ASIC2) embedded solutions, in which customers can choose the peripherals and packaging that surround the embedded RISC processor core. Meanwhile, the size and complexity of the code executed by these controllers have increased.
    [Show full text]
  • IBM POWER8 CPU Architecture
    POWER8 Jeff Stuecheli IBM Power Systems IBM Systems & Technology Group Development © 2013 International Business Machines Corporation 1 POWER7+ POWER7 2012 POWER6 2010 POWER5 2007 2004 45nm SOI 32nm SOI Technology 130nm SOI 65nm SOI eDRAM eDRAM Compute Cores 2 2 8 8 Threads SMT2 SMT2 SMT4 SMT4 Caching On-chip 1.9MB 8MB 2 + 32MB 2 + 80MB Off-chip 36MB 32MB None None Bandwidth Sust. Mem. 15GB/s 30GB/s 100GB/s 100GB/s Peak I/O 3GB/s 10GB/s 20GB/s 20GB/s © 2013 International Business Machines Corporation 2 POWER8 POWER7+ POWER7 2012 POWER6 2010 POWER5 2007 2004 45nm SOI 32nm SOI Technology 130nm SOI 65nm SOI eDRAM eDRAM Compute Cores 2 2 8 8 Today’s Threads SMT2 SMT2 SMT4 SMT4 Topic Caching On-chip 1.9MB 8MB 2 + 32MB 2 + 80MB Off-chip 36MB 32MB None None Bandwidth Sust. Mem. 15GB/s 30GB/s 100GB/s 100GB/s Peak I/O 3GB/s 10GB/s 20GB/s 20GB/s © 2013 International Business Machines Corporation 3 Leadership System Open System Performance Innovation Innovation • Increase core throughput • Higher capacity cache hierarchy • CAPI at single thread, SMT2, and highly threaded processor • Memory interface SMT4, and SMT8 level • Enhanced memory bandwidth, • Open system software • Large step in per socket capacity, and expansion performance • Flexible SMT • Enable more robust • Dynamic code optimization multi-socket scaling • Hardware-accelerated virtual memory management © 2013 International Business Machines Corporation 4 Technology • 22nm SOI, eDRAM, 15 ML 650mm2 Caches Cores • 512 KB SRAM L2 / core • 12 cores (SMT8) • 96 MB eDRAM shared L3 • 8 dispatch, 10 issue, Local SMP Links SMP Local • Up to 128 MB eDRAM L4 Accelerators 16 exec pipe Core Core Core Core Core Core (off-chip) • 2X internal data flows/queues L2 L2 L2 L2 L2 L2 Memory 8M L3 • Enhanced prefetching Region • Up to 230 GB/s • 64K data cache, Mem .
    [Show full text]
  • The Powerpc 405 Core, Contact an IBM Microelectronics Sales Representative
    The PowerPC 405TM Core IBM Microelectronics Division Research Triangle Park, NC 27709 11/2/98 Overview The PowerPC 405 CPU Core is a new addition to the 32-bit RISC PowerPC Embedded Processor family. The 405 Core possesses all of the qualities necessary to make system-on-a-chip designs a reality. This core occupies a small die area, consumes minimal power, and provides a high performance 100% PowerPC compatible platform capable of taming the most demanding embedded applications. Target Applications The 405 Core enables high performance designs in which low cost, low power and versatility are the critical selection criteria. Target market segments for the 405 core include: • Consumer video applications including digital cameras, video games, set-top boxes, and soft modems • Portable products such as cellular phones, PDAs, and handheld GPS receivers • Office automation products such as printer, X-terminals, and fax machines • Networking and storage products such as disk drive controllers, routers, ATM switches, high performance modems, and network cards Typical Application A typical system on a chip design with the 405 Core uses a two level bus structure for system level communication. High bandwidth peripherals and the 405 Core communicate with one another over the processor local bus (PLB). Less demanding peripherals share the on-chip peripheral bus (OPB) and communicate to the PLB through the OPB Bridge. The PLB and OPB provide common interfaces for peripherals and enable quick turnaround, custom solutions for high volume applications. Figure 2 shows an example 405 Core-based system on a chip, illustrating the two-level bus structure and modular core-based design.
    [Show full text]