Untold Story of Marvell's Processor Development

Total Page:16

File Type:pdf, Size:1020Kb

Untold Story of Marvell's Processor Development The Untold Story of Marvellʼs Processor Development By Linley Gwennap Principal Analyst August 2008 www.linleygroup.com The Untold Story of Marvell’s Processor Development By Linley Gwennap, Principal Analyst, The Linley Group This paper discloses the eight-year effort that preceded the recent launch of Marvell’s Sheeva processors, explaining how the company became a leading CPU supplier without announcing a single processor product. We also examine these new processors and their applicability to com- munications, printers, storage, consumer, and mobile applications, and provide a peek at some next-generation CPUs. This paper is sponsored by Marvell, but all opinions and analysis are those of the author. Introduction Marvell is a leading vendor in several markets, including hard-drive controllers, Ether- net chips, and mobile Wi-Fi chips. Despite this success, few people think of Marvell as a processor company, much less a leader in that field. Yet Marvell shipped more than 300 million CPUs last year, most of its own design. We are aware of no company that shipped more chips based on 32-bit CPUs of its own design. Because Marvell’s CPUs are embedded in many of the products that the company ships, the company has not publicized its design efforts, but these CPUs are vital in enabling the feature set and power efficiency of its products. For this reason, the company has quietly maintained its own CPU design team since 2003. Recently, Marvell has expanded its product line to include general-purpose proces- sor chips. Although it appears to most outsiders that Marvell is a new entrant in the embedded-processor market, in fact these products are based on the same CPU tech- nology that the company has been shipping for years. By combining its proven CPU technology with a set of common system peripherals, Marvell’s new Sheeva products offer a compelling alternative to traditional embedded RISC processors. With these products in hand, Marvell is now ready to disclose its CPU history. The Road to Sheeva Since 2000, Marvell has included CPUs in its chips, for example, to control the flow of data in its storage and Ethernet controllers. These early products used licensed CPU cores, but as Marvell began to ship greater numbers of CPU-based products, company founder and CEO Sehat Sutardja decided that Marvell needed to design its own CPU cores in order to create innovative and differentiated products. To implement this vision, Marvell in 2003 acquired a small company called ASICA that was designing ARM-compatible CPUs. Much of the ASICA team had previously worked at Picoturbo, an earlier startup that had also designed ARM-compatible CPUs, providing extensive experience with the instruction set. After the ASICA acquisition, Marvell negotiated an architecture license from ARM Ltd., making it one of the few companies in the world legally able to design and sell ARM-compliant CPUs. ©2008 The Linley Group - 1 - The Untold Story of Marvell’s Processor Development Figure 1. Timeline of Marvell CPU development. (Source: Marvell) As Figure 1 shows, the first Marvell products using its in-house CPU design entered production in 2004. The company quickly began converting its other CPU-based prod- ucts to use its own CPU. To meet the needs of these products, the company designed several CPUs, each with different cost and performance characteristics. In May 2005, Sutardja gave a presentation at Microprocessor Forum disclosing that Marvell had developed a CPU known as Feroceon that would operate at 600MHz in 150nm technology. This CPU would later be used in Orion, Marvell’s first customer- programmable processors. Although the company never announced the Orion products nor officially disclosed details about them, they quickly became successful in SOHO storage (NAS) products. In December 2006, Marvell acquired Intel’s XScale processors and CPU design team. The XScale chips are also ARM-compatible but are designed for mobile applications. Since that time, Marvell has integrated the XScale and Feroceon design teams into a single group, led by veteran CPU designer Hongyi Chen, that will produce CPU cores for both mobile and embedded applications. Before joining Marvell, Chen was a cofounder of Picoturbo and ASICA after stints in CPU design at AMD and Sun. In June 2008, the company introduced its first processors based on its new Sheeva CPUs. Sheeva is a family of Marvell-designed CPU cores that span a range of price/perfor- mance points but share a focus on maximizing performance per watt. Even while keeping power dissipation for the entire processor chip below 2.0W (typical), these CPUs operate as fast as 2.0GHz in 65nm CMOS. ©2008 The Linley Group, Inc. - 2 - The Untold Story of Marvell’s Processor Development Marvell’s CPUs are fully compatible with the ARM architecture and thus support all ARM development tools and software. Marvell also offers a complete tool chain that is optimized for its CPU designs. By offering the fastest available ARM-compatible pro- cessors, Marvell is expanding the ARM ecosystem into new applications. Marvell has extended its license to cover ARM v6 and v7, the most recent version of the architecture. The company expects to sample its first ARM v7 CPU in late 2008. Communications Applications The Sheeva-based processors can be used in a variety of networking and communi- cations applications. For example, the MV78000 family uses an advanced Sheeva CPU that is superscalar (executing up to two instructions per cycle) and can reorder instruc- tions to avoid pipeline stalls. This CPU operates at up to 1.2GHz in 65nm CMOS. The MV78000 is available in single-CPU and dual-CPU models, providing enough perform- ance for enterprise-class control-plane designs or SMB-class equipment that combines the control and data planes on a single processor chip. To further boost performance, the CPU includes cache optimizations such as fetching the critical word first, reading from the cache while a miss is being processed (hit-under- miss), and reading from the cache while a store is being processed (nonblocking store). Important code can be locked into the cache on a per-way (but not per-line) basis. To speed context switches, each cache line has two dirty bits, so only the dirty half of the line needs to be flushed. Reliability is critical in enterprise and infrastructure applications. For these applications, Marvell’s CPUs implement ECC protection on the level-two cache, protecting against errors in this data structure. The MV78000 processors also implement ECC on the mem- ory controller. The small level-one caches are not protected. The MV78000 processors also include common networking functions, such as Gigabit Ethernet MACs and PCI Express ports, to reduce system cost. Most of these functions have already been proven in Marvell’s popular Discovery system-logic chips. Yet the processors dissipate less than 5W (typical), even with two 1.2GHz CPUs and a complete set of peripherals. This efficiency enables the Sheeva processors to fit into systems that have tight power budgets. Mobile Devices Because of their low power consumption, Marvell’s CPUs are a leading choice for mo- bile devices such as smartphones and PDAs. The company currently offers standalone application processors, which can be used in cell phones and other mobile devices, as well as products that combine an application processor and a 3G cellular baseband on a single chip. These processors are compatible with all leading mobile software, which is developed for the ARM instruction set. In addition, they implement the WMMX2 multimedia extensions, which accelerate audio and video functions. WMMX2 was developed by ©2008 The Linley Group, Inc. - 3 - The Untold Story of Marvell’s Processor Development Intel as a corollary to the MMX and SSE extensions implemented in its PC processors. This compatibility simplifies the task of developers moving software applications from the PC to a mobile Internet device (MID), for example. Marvell’s software tools support the WMMX2 extensions, and the company offers a suite of multimedia subroutines that make use of these extensions, so customers need only access these subroutines to accel- erate their software. Marvell uses many techniques to reduce CPU power. As noted above, the CPU die size is minimized, resulting in fewer transistors to consume power, either when switching or through leakage. The design uses fine-grained clock gating to turn off portions of the CPU that are not needed on a cycle-by-cycle basis, reducing operating power. When the CPU goes into standby mode, the supply voltage is removed from most of the circuitry, eliminating leakage power. Operating at a reduced voltage, the caches retain their state in this mode, even though the control circuitry is turned off. Marvell’s efficient CPU pipeline also reduces power. CPUs with long pipelines often waste power due to pipeline stalls and mispredicted branch penalties. Marvell’s next- generation mobile CPU uses a variable-length pipeline that is 7 stages for basic integer instructions and up to 10 stages for load instructions. To minimize time- and power- wasting branch penalties, the CPU implements a complex prediction methodology, including a Gshare-based branch history table (BHT), a branch target buffer (BTB), a branch return stack, and when all else fails, static prediction. Branches that hit in the BTB execute immediately; other correctly predicted branches require one cycle to load the target instructions. Printers Marvell is a leading supplier of ASICs for laser printers, due to its acquisition of Avago’s printer-ASIC business in April 2006. These printers have historically included a fast general-purpose processor, which performs the image processing, and an ASIC that controls the print engine. More recently, the trend is to combine the processor and the print controller into a single chip.
Recommended publications
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • Generic Pipelined Processor Modeling and High Performance
    Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation Mehrdad Reshadi, Nikil Dutt Center for Embedded Computer Systems (CECS), Donald Bren School of Information and Computer Science, University of California Irvine, CA 92697, USA. {reshadi, dutt}@cecs.uci.edu simulators were more limited or slower than their manually generated Abstract counterparts. Detailed modeling of processors and high performance cycle- Colored Petri Net (CPN) [1] is a very powerful and flexible accurate simulators are essential for today’s hardware and software modeling technique and has been successfully used for describing design. These problems are challenging enough by themselves and parallelism, resource sharing and synchronization. It can naturally have seen many previous research efforts. Addressing both capture most of the behavioral elements of instruction flow in a simultaneously is even more challenging, with many existing processor. However, CPN models of realistic processors are very approaches focusing on one over another. In this paper, we propose complex mostly due to incompatibility of a token-based mechanism for the Reduced Colored Petri Net (RCPN) model that has two capturing data hazards. Such complexity reduces the productivity and advantages: first, it offers a very simple and intuitive way of modeling results in very slow simulators. In this paper, we present Reduced pipelined processors; second, it can generate high performance cycle- Colored Petri Net (RCPN), a generic modeling approach for accurate simulators. RCPN benefits from all the useful features of generating fast cycle-accurate simulators for pipelined processors. Colored Petri Nets without suffering from their exponential growth in RCPN is based on CPN and reduces the modeling complexity by complexity.
    [Show full text]
  • A Modern Primer on Processing in Memory
    A Modern Primer on Processing in Memory Onur Mutlua,b, Saugata Ghoseb,c, Juan Gomez-Luna´ a, Rachata Ausavarungnirund SAFARI Research Group aETH Z¨urich bCarnegie Mellon University cUniversity of Illinois at Urbana-Champaign dKing Mongkut’s University of Technology North Bangkok Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.
    [Show full text]
  • Advanced X86
    Advanced x86: BIOS and System Management Mode Internals Input/Output Xeno Kovah && Corey Kallenberg LegbaCore, LLC All materials are licensed under a Creative Commons “Share Alike” license. http://creativecommons.org/licenses/by-sa/3.0/ ABribuEon condiEon: You must indicate that derivave work "Is derived from John BuBerworth & Xeno Kovah’s ’Advanced Intel x86: BIOS and SMM’ class posted at hBp://opensecuritytraining.info/IntroBIOS.html” 2 Input/Output (I/O) I/O, I/O, it’s off to work we go… 2 Types of I/O 1. Memory-Mapped I/O (MMIO) 2. Port I/O (PIO) – Also called Isolated I/O or port-mapped IO (PMIO) • X86 systems employ both-types of I/O • Both methods map peripheral devices • Address space of each is accessed using instructions – typically requires Ring 0 privileges – Real-Addressing mode has no implementation of rings, so no privilege escalation needed • I/O ports can be mapped so that they appear in the I/O address space or the physical-memory address space (memory mapped I/O) or both – Example: PCI configuration space in a PCIe system – both memory-mapped and accessible via port I/O. We’ll learn about that in the next section • The I/O Controller Hub contains the registers that are located in both the I/O Address Space and the Memory-Mapped address space 4 Memory-Mapped I/O • Devices can also be mapped to the physical address space instead of (or in addition to) the I/O address space • Even though it is a hardware device on the other end of that access request, you can operate on it like it's memory: – Any of the processor’s instructions
    [Show full text]
  • Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M
    Demystifying Internet of Things Security Successful IoT Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M. Wheeler Demystifying Internet of Things Security Successful IoT Device/Edge and Platform Security Deployment Sunil Cheruvu Anil Kumar Ned Smith David M. Wheeler Demystifying Internet of Things Security: Successful IoT Device/Edge and Platform Security Deployment Sunil Cheruvu Anil Kumar Chandler, AZ, USA Chandler, AZ, USA Ned Smith David M. Wheeler Beaverton, OR, USA Gilbert, AZ, USA ISBN-13 (pbk): 978-1-4842-2895-1 ISBN-13 (electronic): 978-1-4842-2896-8 https://doi.org/10.1007/978-1-4842-2896-8 Copyright © 2020 by The Editor(s) (if applicable) and The Author(s) This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material.
    [Show full text]
  • Motorola Mpc107 Pci Bridge/Integrated Memory Controller
    MPC107FACT/D Fact Sheet MOTOROLA MPC107 PCI BRIDGE/INTEGRATED MEMORY CONTROLLER The MPC107 PCI Bridge/Integrated Memory Controller provides a bridge between the Peripheral Component Interconnect, (PCI) bus and PowerPC 603e™, PowerPC 740™, PowerPC 750™ or MPC7400 microprocessors. PCI support allows system designers to design systems quickly using peripherals already designed for PCI and the other standard interfaces available in the personal computer hardware environment. The MPC107 provides many of the other necessities for embedded applications including a high-performance memory controller and dual processor support, 2-channel flexible DMA controller, an interrupt controller, an I2O-ready message unit, an inter-integrated circuit controller (I2C), and low skew clock drivers. The MPC107 contains an Embedded Programmable Interrupt Controller (EPIC) featuring five hardware interrupts (IRQ’s) as well as sixteen serial interrupts along with four timers. The MPC107 uses an advanced, 2.5-volt HiP3 process technology and is fully compatible with TTL devices. Integrated Memory Controller The memory interface controls processor and PCI interactions to main memory. It supports a variety of programmable timing supporting DRAM (FPM, EDO), SDRAM, and ROM/Flash ROM configurations, up to speeds of 100 MHz. PCI Bus Support The MPC107 PCI interface is designed to connect the processor and memory buses to the PCI local bus without the need for "glue" logic at speeds up to 66 MHz. The MPC107 acts as either a master or slave device on the PCI MPC107 Block Diagram bus and contains a PCI bus arbitration unit 60x Bus which reduces the need for an equivalent Memory Data external unit thus reducing the total system Data Path ECC / Parity complexity and cost.
    [Show full text]
  • The Impulse Memory Controller
    IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 11, NOVEMBER 2001 1 The Impulse Memory Controller John B. Carter, Member, IEEE, Zhen Fang, Student Member, IEEE, Wilson C. Hsieh, Sally A. McKee, Member, IEEE, and Lixin Zhang, Student Member, IEEE AbstractÐImpulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impuse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies. Index TermsÐComputer architecture, memory systems. æ 1 INTRODUCTION INCE 1987, microprocessor performance has improved at memory. By giving applications control (mediated by the Sa rate of 55 percent per year; in contrast, DRAM latencies OS) over the use of shadow addresses, Impulse supports have improved by only 7 percent per year and DRAM application-specific optimizations that restructure data.
    [Show full text]
  • IXP43X Product Line of Network Processors Specification Update December 2008 2 Order Number: 316847; Revision: 005US Contents
    Intel® IXP43X Product Line of Network Processors Specification Update December 2008 Order Number: 316847; Revision: 005US INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
    [Show full text]
  • Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps
    Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained CMPs Major Bhadauria and Sally A. Mckee Computer Systems Lab Cornell University Ithaca, NY, USA [email protected], [email protected] ABSTRACT 1. INTRODUCTION Multi-core designs have become the industry imperative, Power and thermal constraints have begun to limit the replacing our reliance on increasingly complicated micro- maximum operating frequency of high performance proces- architectural designs and VLSI improvements to deliver in- sors. The cubic increase in power from increases in fre- creased performance at lower power budgets. Performance quency and higher voltages required to attain those frequen- of these multi-core chips will be limited by the DRAM mem- cies has reached a plateau. By leveraging increasing die ory system: we demonstrate this by modeling a cycle-accurate space for more processing cores (creating chip multiproces- DDR2 memory controller with SPLASH-2 workloads. Sur- sors, or CMPs) and larger caches, designers hope that multi- prisingly, benchmarks that appear to scale well with the threaded programs can exploit shrinking transistor sizes to number of processors fail to do so when memory is accurately deliver equal or higher throughput as single-threaded, single- modeled. We frequently find that the most efficient config- core predecessors. The current software paradigm is based uration is not the one with the most threads. By choosing on the assumption that multi-threaded programs with little the most efficient number of threads for each benchmark, contention for shared data scale (nearly) linearly with the average energy delay efficiency improves by a factor of 3.39, number of processors, yielding power-efficient data through- and performance improves by 19.7%, on average.
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • WP127: "Embedded System Design Considerations" V1.0 (03/06/2002)
    White Paper: Virtex-II Series R WP127 (v1.0) March 6, 2002 Embedded System Design Considerations By: David Naylor Embedded systems see a steadily increasing bandwidth mismatch between raw processor MIPS and surrounding components. System performance is not solely dependent upon processor capability. While a processor with a higher MIPS specification can provide incremental system performance improvement, eventually the lagging surrounding components become a system performance bottleneck. This white paper examines some of the factors contributing to this. The analysis of bus and component performance leads to the conclusion that matching of processor and component performance provides a good cost-performance trade-off. © 2002 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and disclaimers are as listed at http://www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice. WP127 (v1.0) March 6, 2002 www.xilinx.com 1 1-800-255-7778 R White Paper: Embedded System Design Considerations Introduction Today’s systems are composed of a hierarchy, or layers, of subsystems with varying access times. Figure 1 depicts a layered performance pyramid with slower system components in the wider, lower layers and faster components nearer the top. The upper six layers represent an embedded system. In descending order of speed, the layers in this pyramid are as follows: 1. CPU 2. Cache memory 3. Processor Local Bus (PLB) 4. Fast and small Static Random Access Memory (SRAM) subsystem 5. Slow and large Dynamic Random Access Memory (DRAM) subsystem 6.
    [Show full text]