739250-1.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

739250-1.Pdf Eindhoven University of Technology MASTER Modeling and analysis of a cache coherent interconnect Wiener, U. Award date: 2012 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Faculty of Electrical Engineering Electronic Systems Group Modeling and Analysis of a Cache Coherent Interconnect Thesis Report submitted in partial fulfillment of the requirements for the degree of Master of Science in Embedded Systems Supervisors: by: Prof. Kees Goossens Uri Wiener Dr. Andreas Hansson v2.2, August 2, 2012 Abstract System on Chip (SoC) designs integrate heterogeneous components, increasing in number, performance, and memory-related requirements. Such components share data using distributed shared memory, and rely on caches for improving performance and reducing power consumption. Complex interconnect designs function as the heart of these SoCs, gluing together components using a common interface protocol such as AMBA, with hardware- managed coherency a key feature for improving system performance. Such an interconnect is ARM's CCI-400, the first to implement the AMBA 4 AXI Coherence Extensions (ACE). Capturing the behavior of such a SoC-oriented interconnect and its impact on the performance of a system requires accurate modeling, simulation using adequate workloads, and precise analysis. In this work we model ACE and a CCI-like interconnect using gem5, a Transaction-Level-Modeling (TLM) simulation framework. We extend gem5's memory system to support ACE-like transactions. In addition, we improve the temporal behavior of gem5's interconnect model by better representing resource contention. As such, this work intertwines two challenging computer architecture topics: coherence protocols and on-chip interconnects. Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using both PARSEC benchmarks and BBench, a web-page rendering benchmark. We show how the system-level coherency extensions introduced to gem5's memory system provide insights about modeling coherent single- and multi-hop SoC interconnects. Our simulation results demonstrate that the impact of snoop response latency on overall system performance is highly dependent on the amount of inter-cluster sharing. The changes introduced significantly improve the interconnect's observability, enabling better characterization of a workload's memory requirements and sharing patterns. אתה מתחיל הכי מהר שלך, ולאט לאט אתה מגביר. - אלון 'קרמבו' שגיב, "מבצע סבתא" You start off as fast as you can, and very slowly increase the pace. -Alon ‘Krembo’ Sagiv, “Operation Grandma” i Acknowledgments Chronologically, those who made it happen should be credited first. Hadn't it been to Dr. Benny Akesson and Prof. Kees Goossens's match-making efforts, I would have never crossed the English Channel. For that I give them my full appreciation. While Kees bravely became my supervisor, Andreas had the questionable pleasure of directly supervising my work. For their help and cooperation throughout the past year, I take off my hat. While supervising is done from a bird's-eye view, the man who was affected the most by my presence must be Sascha Bischoff. Thank you for all your help, knowledge sharing, patience and for translating Dr. Taylor's English when times were tough. Many many thanks to my friendly colleagues at ARM - Akash, Charlie, Djordje, Hugo, Matt E., Matt H., Radhika, Peter, Rene, Robert and William (alphabetical order, guys, nothing else) for their companion and support. My appreciation goes also to the gifted people behind ACE and CCI in Cambridge and Sheffield who cooperated, spared time and shared insights whenever I asked. In addition, I'd like to thank all gem5 developers, but especially to Ali, Dam and Chris, for providing tips, patches, feedback and plenty of time required for making it all work. From a personal perspective, I'd like to thank my family and friends in Israel, the Netherlands and Germany for all their support during my stay in the UK. But above all, my love and endless appreciation to Tali, which supported me in making this move and bared having the English Channel between us for half a year. Uri Wiener Cambridge, The United Kingdom August 2nd, 2012 ii Contents List of Figures v List of Tables vi List of Abbreviations vii 1 Introduction 1 1.1 Motivation..................................................1 1.2 Hardware-based system coherency.....................................1 1.3 Goals.....................................................2 1.3.1 Problem Statement..........................................2 1.3.2 Thesis Goals.............................................3 1.4 Contribution of this work..........................................3 1.5 Thesis organization..............................................3 2 Related Work 5 2.1 System coherence protocols.........................................5 2.2 Coherent interconnects............................................6 2.3 Memory system performance........................................7 3 Background 8 3.1 System coherency and AMBA 4 ACE...................................8 3.1.1 Memory Consistency.........................................8 3.1.2 Cache Coherence...........................................9 3.1.3 System-level coherency........................................ 10 3.1.4 AMBA 4 AXI Coherence Extensions................................ 11 3.1.5 ACE transactions........................................... 13 3.2 Interconnect modeled: CCI++....................................... 14 3.2.1 CoreLink CCI-400.......................................... 14 3.2.2 CCI++: extending ACE and CCI to multi-hop memory systems................ 16 3.3 The Modeling Challenge........................................... 17 3.4 Simulation and modeling framework: gem5................................ 18 3.4.1 gem5.................................................. 18 3.4.2 Simulation flow............................................ 18 3.4.3 Test-System Example........................................ 20 3.5 The gem5 memory system.......................................... 20 3.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets.......... 20 3.5.2 Request's lifecycle example..................................... 22 3.5.3 Intermezzo: Events and the Event Queue............................. 22 3.5.4 The bus model............................................ 23 3.5.5 A simple request from a master................................... 24 3.5.6 A simple response from a slave................................... 24 3.5.7 Receiving a snoop request...................................... 24 3.5.8 Receiving a snoop response..................................... 24 3.5.9 The bridge model........................................... 25 iii 3.5.10 Caches and coherency in gem5................................... 25 3.5.11 gem5's cache line states....................................... 25 3.5.12 Main cache scenarios......................................... 25 3.6 Target platform................................................ 26 3.6.1 gem5 building block for performance analysis........................... 28 3.6.2 Simulator performance........................................ 29 3.7 Differences between ACE and gem5's memory system.......................... 29 3.7.1 System-coherency modeling..................................... 29 4 Interconnect model 32 4.1 Temporal transaction transport modeling................................. 33 4.2 Resource contention modeling........................................ 34 4.3 ACE transaction modeling.......................................... 35 4.3.1 ReadOnce............................................... 36 4.3.2 ReadNoSnoop............................................. 37 4.3.3 WriteNoSnoop............................................ 37 4.3.4 MakeInvalid.............................................. 38 4.3.5 WriteLineUnique........................................... 38 4.4 Modeling inaccuracies, optimizations and places for improvement.................... 39 4.5 Bus performance observability........................................ 40 4.6 Conclusions.................................................. 41 5 Implementation and Verification Framework 42 5.1 Memtest.................................................... 42 5.2 ACE-transactions development....................................... 43 5.3 ACE transaction verification......................................... 44 5.3.1 ReadOnce............................................... 44 5.3.2 ReadNoSnoop............................................. 44 5.3.3 WriteNoSnoop...........................................
Recommended publications
  • Contents [Edit] History
    HP 9000 is the name for a line of workstation and server computer systems produced by the Hewlett-Packard Company (HP). The HP 9000 brand was introduced in 1984 to encompass several existing technical workstations models previously launched in the early 1980s. Contents [hide] • 1 History • 2 Workstation models o 2.1 Series 200 o 2.2 Series 300/400 o 2.3 Series 500 o 2.4 Series 700 . 2.4.1 VME Industrial Workstations o 2.5 B, C, J class • 3 Server models o 3.1 D-class o 3.2 R-class o 3.3 N-class o 3.4 L-class o 3.5 A-class o 3.6 S/X-class o 3.7 V-class • 4 Operating systems • 5 See also • 6 Notes • 7 References • 8 External links [edit] History The first HP 9000 models comprised the HP 9000 Series 200 and Series 500 ranges. These were rebadged existing models, the Series 200 including various Motorola 68000- based workstations such as the HP 9826 and HP 9836, and the Series 500 using HP's FOCUS microprocessor architecture introduced in the HP 9020 workstation. These were followed by the HP 9000 Series 300 and Series 400 workstations which also used 68k- series microprocessors. From the mid-1980s onwards, HP started to switch over to its own microprocessors based on its proprietary PA-RISC ISA, for the Series 600, 700, 800, and later lines. More recent models use either the PA-RISC or its successor, the HP/Intel IA-64 ISA. All of the HP 9000 line run various versions of the HP-UX operating system, except earlier Series 200 models, which ran standalone applications.
    [Show full text]
  • PC Hardware Contents
    PC Hardware Contents 1 Computer hardware 1 1.1 Von Neumann architecture ...................................... 1 1.2 Sales .................................................. 1 1.3 Different systems ........................................... 2 1.3.1 Personal computer ...................................... 2 1.3.2 Mainframe computer ..................................... 3 1.3.3 Departmental computing ................................... 4 1.3.4 Supercomputer ........................................ 4 1.4 See also ................................................ 4 1.5 References ............................................... 4 1.6 External links ............................................. 4 2 Central processing unit 5 2.1 History ................................................. 5 2.1.1 Transistor and integrated circuit CPUs ............................ 6 2.1.2 Microprocessors ....................................... 7 2.2 Operation ............................................... 8 2.2.1 Fetch ............................................. 8 2.2.2 Decode ............................................ 8 2.2.3 Execute ............................................ 9 2.3 Design and implementation ...................................... 9 2.3.1 Control unit .......................................... 9 2.3.2 Arithmetic logic unit ..................................... 9 2.3.3 Integer range ......................................... 10 2.3.4 Clock rate ........................................... 10 2.3.5 Parallelism .........................................
    [Show full text]
  • Integrating DMA Into the Generic Device Model
    Integrating DMA Into the Generic Device Model James E.J. Bottomley SteelEye Technology, Inc. http://www.steeleye.com [email protected] Abstract across architectures prior to PCI, the most prominent be- ing EISA. Finally, some manufacturers of I/O chips de- signed them not to be bus based (the LSI 53c7xx series This paper will introduce the new DMA API for the of SCSI chips being a good example). These chips made generic device model, illustrating how it works and ex- an appearance in an astonishing variety of cards with an plaining the enhancements over the previous DMA Map- equal variety of bus interconnects. ping API. In a later section we will explain (using illus- trations from the PA-RISC platform) how conversion to The major headache for people who write drivers for the new API may be achieved hand in hand with a com- non-PCI or multiple bus devices is that there was no plete implementation of the generic device API for that standard for non-PCI based DMA, even though many of platform. the problems encountered were addressed by the DMA Mapping API. This gave rise to a whole hotchpotch of solutions that differed from architecture to architecture: On Sparc, the DMA Mapping API has a completely 1 Introduction equivalent SBUS API; on PA-RISC, one may obtain a “fake” PCI object for a device residing on a non-PCI bus which may be passed straight into the PCI based DMA Back in 2001, a group of people working on non-x86 API. architectures first began discussing radical changes to the way device drivers make use of DMA.
    [Show full text]
  • Internals Training Guide HP E3000 MPE/Ix Computer Systems
    Internals Training Guide HP e3000 MPE/iX Computer Systems Edition 1 Manufacturing Part Number: 30216-90316 E0101 U.S.A. January 2001 Notice The information contained in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this material, including, but not limited to, the implied warranties of merchantability or fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for direct, indirect, special, incidental or consequential damages in connection with the furnishing or use of this material. Hewlett-Packard assumes no responsibility for the use or reliability of its software on equipment that is not furnished by Hewlett-Packard. This document contains proprietary information which is protected by copyright. All rights reserved. Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under the copyright laws. Restricted Rights Legend Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. Rights for non-DOD U.S. Government Departments and Agencies are as set forth in FAR 52.227-19 (c) (1,2). Acknowledgments UNIX is a registered trademark of The Open Group. Hewlett-Packard Company 3000 Hanover Street Palo Alto, CA 94304 U.S.A. © Copyright 2001 by Hewlett-Packard Company 2 Contents 1. Hardware Overview Monitor and I/O Services 2. PCISCSI Device Adapter Manager (DAM) Internals Training . 38 Additional References . 40 Introduction. 40 PCI Based SCSI Interface Cards .
    [Show full text]
  • The Hp Pa-8000 Risc Cpu
    . THE HP PA-8000 RISC CPU Ashok Kumar he PA-8000 RISC CPU is the first of a bits (40 bits on the PA-8000). A new mode new generation of Hewlett-Packard bit governs address formation, creating Hewlett-Packard Tmicroprocessors. Designed for high- increased flexibility. In 32-bit addressing end systems, it is among the world’s most mode, the processor can take advantage of powerful and fastest microprocessors. It fea- 64-bit computing instructions for faster tures an aggressive, four-way, superscalar throughput. In 64-bit addressing mode, 32- implementation, combining speculative exe- bit instructions and conditions are available cution with on-the-fly instruction reordering. for backward compatibility. The heart of the machine, the instruction In addition, the following extensions help reorder buffer, provides out-of-order execu- optimize performance for virtual memory tion capability. and cache management, branching, and Our primary design objective for the PA- floating-point operations: 8000 was to attain industry-leading perfor- mance in a broad range of applications. In • fast TLB (translation look-aside buffer) addition, we wanted to provide full support insertion instructions, for 64-bit applications. To make the PA-8000 • load and store instructions with 16-bit truly useful, we needed to ensure that the displacement, processor would not only achieve high bench- • memory prefetch instructions, mark performance but would sustain such • support for variable-size pages, performance in large, real-world applications. • halfword instructions for multimedia To achieve this goal, we designed large, exter- support, nal primary caches with the ability to hide • branches with 22-bit displacements, memory latency in hardware.
    [Show full text]
  • Service Handbook
    Service Handbook J Class Workstation HP Part No. A2876–90041 Edition E0498 Update to Service Handbook J Class Workstation HP Part No. A2876–90040 R Hewlett–Packard Company 3404 E. Harmony Rd., Ft. Collins, CO 80528–9599 NOTICE The information contained in this document is subject to change without notice. HEWLETT–PACKARD WARRANTY STATEMENT HP PRODUCT DURATION OF WARRANTY J Class Workstation one year 1. HP warrants HP hardware, accessories and supplies against defects in materials and workmanship for the period specified above. If HP receives notice of such defects during the warranty period, HP will, at its option, either repair or replace products which prove to be defective. Replacement products may be either new or like–new. 2. HP warrants that HP software will not fail to execute its programming instructions, for the period specified above, due to defects in material and workmanship when properly installed and used. If HP receives notice of such defects during the warranty period, HP will replace software media which does not execute its programming instructions due to such defects. 3. HP does not warrant that the operation of HP products will be uninterrupted or error free. If HP is unable, within a reasonable time, to repair or replace any product to a condition as warranted, customer will be entitled to a refund of the purchase price upon prompt return of the product. 4. HP products may contain remanufactured parts equivalent to new in performance or may have been subject to incidental use. 5. The warranty period begins on the date of delivery or on the date of installation if installed by HP.
    [Show full text]
  • Symmetric Multiprocessing Workstations and Servers System-Designed for High Performance and Low Cost
    Symmetric Multiprocessing Workstations and Servers System-Designed for High Performance and Low Cost A new family of workstations and servers provides enhanced system performance in several price classes. The HP 9000 Series 700 J-class workstations provide up to 2-way symmetric multiprocessing, while the HP 9000 Series 800 K-class servers (technical servers, file servers) and HP 3000 Series 9x9KS business-oriented systems provide up to 4-way symmetric multiprocessing. by Matt J. Harline, Brendan A. Voge, Loren P. Staley, and Badir M. Mousa Blending high performance and low cost, a new family of workstations and servers has been designed to help maintain HP’s leadership in system performance, price/performance, system support, and system reliability. This article and the accompanying articles in this issue describe the design and implementation of the HP 9000 J-class workstations, which are high-end workstations running the HP-UX* operating system, the HP 9000 K-class servers, which are a family of midrange technical and business servers running the HP-UX operating system, and the HP 3000 Series 9x9KS servers, which are a family of midrange business servers running the MPE/iX operating system. In this issue, these systems will be referred to collectively as J/K-class systems. The goals of the the design team were to achieve high performance and low cost, while at the same time creating a broad family of systems that would share many of the same components and meet a wide range of customer needs. The challenge was to create a list of requirements that would meet the needs of the three different target markets: the UNIX-system-based workstation market, the UNIX-system-based server market, and Hewlett-Packard’s proprietary MPE/iX-system-based server market.
    [Show full text]
  • Architecture Reference Guide
    Architecture Reference Guide V2500 Server First Edition A5074-96004 V2500 Server Customer Order Number: A5074-90004 June, 1999 Printed in: USA Revision History Edition: First Document Number: A3725-96004 Remarks: Initial release June, 1999. Notice Copyright Hewlett-Packard Company 1999. All Rights Reserved. Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under the copyright laws. The information contained in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. Contents Preface . xv Notational conventions . xvi 1 Introduction . 1 The PA-8500 processor . .2 The node . .3 Control and status registers (CSRs) . .5 Description of functional blocks. .5 Processor agent controller. .5 Routing attachment controller—Hyperplane crossbar . .6 Memory access controller . .7 CUB and core logic bus . .8 Shared memory . .9 Multiple nodes . .11 Coherent toroidal interconnect . .12 Globally shared memory (GSM) . .14 GSM subsystem . .14 Memory interleave . .14 GSM and memory latency . .15 GSM and cache coherence . .17 2 Physical address space . 19 Physical addresses. .20 Node addressing . .22 Node Identifiers . .23 Coherent memory space . .24 Coherent memory layout . .25 Addressing a byte of memory. .25 Memory interleaving . .27 Memory interleave generation. .29 Force node ID function . .30 Memory board, bus and bank index selection . .32 Memory board interleave pattern . .33 Memory bus interleave pattern .
    [Show full text]
  • Hp Rp5400 Series of Entry-Level Unix Servers November 2001 a Technical
    hp rp5400 series of a technical white paper november 2001 entry-level unix servers from Hewlett-Packard table of contents introduction to the hp server rp5400 series......................................................................... 3 the hp server product line ............................................................................................. 7 binary compatibility ..................................................................................................... 8 IPF ready..................................................................................................................... 8 IPF transition................................................................................................................ 8 architecture ..................................................................................................................... 9 low-latency memory access ........................................................................................... 11 speeds and feeds......................................................................................................... 11 I/O subsystem design................................................................................................... 13 internal removable media ............................................................................................. 16 scalability.................................................................................................................... 16 rp5400 series industrial design and packaging..................................................................
    [Show full text]
  • Hp Server Rp7400
    an hp white paper february 2002 hp server rp7400 This white paper describes the design goals and system system architecture architecture of the HP server rp7400 and provides recommendations for implementing business-critical and design guide solutions. table of contents introduction .......................................................................................................................3 overview of the hp server rp7400 ...................................................................... 3 the hp enterprise server product line ................................................................... 4 performance .................................................................................................... 5 design philosophy—balanced high performance.................................................. 6 binary compatibility.......................................................................................... 7 architecture .......................................................................................................................7 twin system bus ................................................................................................ 8 very low latency memory access ........................................................................ 9 speeds and feeds ........................................................................................... 10 I/O subsystem design ..................................................................................... 11 scalability.....................................................................................................
    [Show full text]
  • Independent PA-RISC and Itanium Reference Book
    OpenPA The book of PA-RISC Paul Weissmann Bonn This document and its content are Copyright © 1999-2021 Paul Weissmann, Bonn, Germany, unless otherwise noted. Berlin, Bonn, Palo Alto, San Francisco No parts of this document may be reproduced or copied without prior written permission. Commercial use of the content is prohibited. OpenPA.net, the information resource on PA-RISC. Second edition Release 2.8 Bonn, May 2021 Other editions: Second edition 2.7: July 2020 Second edition 2.6: January 2018 Second edition 2.5: January 2016 Second edition 2.4: July 2012 Second edition 2.3: July 2009 Second edition 2.2: January 2009 Second edition 2.1: October 2008 Second edition 2.0: May 2008 First edition 1.2: December 2007 First edition 1.1: November 2007 First edition 1.0: July 2006 OpenPA.net (online) is a registered serial publication, ISSN 1866-2757. Paul Weissmann can be reached by e-mail: [email protected]. and online at Insignals Cyber Security and OpenKRITIS. Preface This is the print edition of the OpenPA.net website from Spring 2021. OpenPA is a resource for HP PA-RISC and Itanium computers with technical descriptions of workstations, servers, their hardware architecture and supported operating systems This project is independent of and does not represent The Hewlett Packard Company in any way. This is the Second Edition 2.8. Set with LATEX. Changes in Second Edition 2.8 since the last edition (2020): ê Many revisions and corrections (thanks!) ê Text and language in all chapters ê TeX backend update All other changes are listed in chapter 5.1.
    [Show full text]
  • Sensory Information Processing
    Analysis of Avalanche’s Shared M em ory Architecture Ravindra Kuramkote, John Carter, Alan Davis, Chen-Chi Kuo, Leigh Stoller, Mark Swanson UUCS-97-008 * Computer Systems Laboratory University of Utah ' A bstract In this paper, we describe the design of the Avalanchemultiprocessor’s shared memory subsys­ tem, evaluate its performance, and discuss problems associated with using commodity worksta­ tions and network interconnects as the building blocks of a scalable shared memory multiprocessor. Compared to other scalable shared memory architectures, Avalanchehas a number of novel fea­ tures including its support for the Simple COMA memory architecture and its support for multiple coherency protocols (migratory, delayed write update, and (soon) write invalidate). We describe the performance implications of Avalanche’s architecture, the impact of various novel low-level design options, and describe a number of interesting phenomena we encountered while developing a scalable multiprocessor built on the HP PA-RISC platform. 1 9 Analysis of Avalanche’s Shared M em ory Architecture Ravindra Kuram,kote, John Carter, Alan Davis, Chen-Chi Kuo, Leigh Stoller, Mark Swanson Computer Systems Laboratory University of Utah 1 Introduction , The primary Avalanchedesign goal is to maximize the use of commercial components in the creation of a scalable parallel cluster of workstation multiprocessor that supports both high performance message passing and distributed shared memory. In the current prototype, Avalanchenodes are composed from Hewlett-Packard HP7200 or PA-8000 based symmetric multiprocessing worksta­ tions, a custom device called the W idget, and Myricom’s Myrinet interconnect fabric [6]. Both workstations use a main memory bus known as the Runway [7], a split transaction bus supporting cache coherent transactions.
    [Show full text]