739250-1.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Eindhoven University of Technology MASTER Modeling and analysis of a cache coherent interconnect Wiener, U. Award date: 2012 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Faculty of Electrical Engineering Electronic Systems Group Modeling and Analysis of a Cache Coherent Interconnect Thesis Report submitted in partial fulfillment of the requirements for the degree of Master of Science in Embedded Systems Supervisors: by: Prof. Kees Goossens Uri Wiener Dr. Andreas Hansson v2.2, August 2, 2012 Abstract System on Chip (SoC) designs integrate heterogeneous components, increasing in number, performance, and memory-related requirements. Such components share data using distributed shared memory, and rely on caches for improving performance and reducing power consumption. Complex interconnect designs function as the heart of these SoCs, gluing together components using a common interface protocol such as AMBA, with hardware- managed coherency a key feature for improving system performance. Such an interconnect is ARM's CCI-400, the first to implement the AMBA 4 AXI Coherence Extensions (ACE). Capturing the behavior of such a SoC-oriented interconnect and its impact on the performance of a system requires accurate modeling, simulation using adequate workloads, and precise analysis. In this work we model ACE and a CCI-like interconnect using gem5, a Transaction-Level-Modeling (TLM) simulation framework. We extend gem5's memory system to support ACE-like transactions. In addition, we improve the temporal behavior of gem5's interconnect model by better representing resource contention. As such, this work intertwines two challenging computer architecture topics: coherence protocols and on-chip interconnects. Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using both PARSEC benchmarks and BBench, a web-page rendering benchmark. We show how the system-level coherency extensions introduced to gem5's memory system provide insights about modeling coherent single- and multi-hop SoC interconnects. Our simulation results demonstrate that the impact of snoop response latency on overall system performance is highly dependent on the amount of inter-cluster sharing. The changes introduced significantly improve the interconnect's observability, enabling better characterization of a workload's memory requirements and sharing patterns. אתה מתחיל הכי מהר שלך, ולאט לאט אתה מגביר. - אלון 'קרמבו' שגיב, "מבצע סבתא" You start off as fast as you can, and very slowly increase the pace. -Alon ‘Krembo’ Sagiv, “Operation Grandma” i Acknowledgments Chronologically, those who made it happen should be credited first. Hadn't it been to Dr. Benny Akesson and Prof. Kees Goossens's match-making efforts, I would have never crossed the English Channel. For that I give them my full appreciation. While Kees bravely became my supervisor, Andreas had the questionable pleasure of directly supervising my work. For their help and cooperation throughout the past year, I take off my hat. While supervising is done from a bird's-eye view, the man who was affected the most by my presence must be Sascha Bischoff. Thank you for all your help, knowledge sharing, patience and for translating Dr. Taylor's English when times were tough. Many many thanks to my friendly colleagues at ARM - Akash, Charlie, Djordje, Hugo, Matt E., Matt H., Radhika, Peter, Rene, Robert and William (alphabetical order, guys, nothing else) for their companion and support. My appreciation goes also to the gifted people behind ACE and CCI in Cambridge and Sheffield who cooperated, spared time and shared insights whenever I asked. In addition, I'd like to thank all gem5 developers, but especially to Ali, Dam and Chris, for providing tips, patches, feedback and plenty of time required for making it all work. From a personal perspective, I'd like to thank my family and friends in Israel, the Netherlands and Germany for all their support during my stay in the UK. But above all, my love and endless appreciation to Tali, which supported me in making this move and bared having the English Channel between us for half a year. Uri Wiener Cambridge, The United Kingdom August 2nd, 2012 ii Contents List of Figures v List of Tables vi List of Abbreviations vii 1 Introduction 1 1.1 Motivation..................................................1 1.2 Hardware-based system coherency.....................................1 1.3 Goals.....................................................2 1.3.1 Problem Statement..........................................2 1.3.2 Thesis Goals.............................................3 1.4 Contribution of this work..........................................3 1.5 Thesis organization..............................................3 2 Related Work 5 2.1 System coherence protocols.........................................5 2.2 Coherent interconnects............................................6 2.3 Memory system performance........................................7 3 Background 8 3.1 System coherency and AMBA 4 ACE...................................8 3.1.1 Memory Consistency.........................................8 3.1.2 Cache Coherence...........................................9 3.1.3 System-level coherency........................................ 10 3.1.4 AMBA 4 AXI Coherence Extensions................................ 11 3.1.5 ACE transactions........................................... 13 3.2 Interconnect modeled: CCI++....................................... 14 3.2.1 CoreLink CCI-400.......................................... 14 3.2.2 CCI++: extending ACE and CCI to multi-hop memory systems................ 16 3.3 The Modeling Challenge........................................... 17 3.4 Simulation and modeling framework: gem5................................ 18 3.4.1 gem5.................................................. 18 3.4.2 Simulation flow............................................ 18 3.4.3 Test-System Example........................................ 20 3.5 The gem5 memory system.......................................... 20 3.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets.......... 20 3.5.2 Request's lifecycle example..................................... 22 3.5.3 Intermezzo: Events and the Event Queue............................. 22 3.5.4 The bus model............................................ 23 3.5.5 A simple request from a master................................... 24 3.5.6 A simple response from a slave................................... 24 3.5.7 Receiving a snoop request...................................... 24 3.5.8 Receiving a snoop response..................................... 24 3.5.9 The bridge model........................................... 25 iii 3.5.10 Caches and coherency in gem5................................... 25 3.5.11 gem5's cache line states....................................... 25 3.5.12 Main cache scenarios......................................... 25 3.6 Target platform................................................ 26 3.6.1 gem5 building block for performance analysis........................... 28 3.6.2 Simulator performance........................................ 29 3.7 Differences between ACE and gem5's memory system.......................... 29 3.7.1 System-coherency modeling..................................... 29 4 Interconnect model 32 4.1 Temporal transaction transport modeling................................. 33 4.2 Resource contention modeling........................................ 34 4.3 ACE transaction modeling.......................................... 35 4.3.1 ReadOnce............................................... 36 4.3.2 ReadNoSnoop............................................. 37 4.3.3 WriteNoSnoop............................................ 37 4.3.4 MakeInvalid.............................................. 38 4.3.5 WriteLineUnique........................................... 38 4.4 Modeling inaccuracies, optimizations and places for improvement.................... 39 4.5 Bus performance observability........................................ 40 4.6 Conclusions.................................................. 41 5 Implementation and Verification Framework 42 5.1 Memtest.................................................... 42 5.2 ACE-transactions development....................................... 43 5.3 ACE transaction verification......................................... 44 5.3.1 ReadOnce............................................... 44 5.3.2 ReadNoSnoop............................................. 44 5.3.3 WriteNoSnoop...........................................