Fireperf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design
Total Page:16
File Type:pdf, Size:1020Kb
FirePerf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design Sagar Karandikar Albert Ou Alon Amid University of California, Berkeley University of California, Berkeley University of California, Berkeley [email protected] [email protected] [email protected] Howard Mao Randy Katz Borivoje Nikolić University of California, Berkeley University of California, Berkeley University of California, Berkeley [email protected] [email protected] [email protected] Krste Asanović University of California, Berkeley [email protected] Abstract CCS Concepts. • General and reference → Performance; Achieving high-performance when developing specialized • Hardware → Simulation and emulation; Hardware- hardware/software systems requires understanding and im- software codesign; • Computing methodologies → Sim- proving not only core compute kernels, but also intricate and ulation environments; • Networks → Network perfor- elusive system-level bottlenecks. Profiling these bottlenecks mance analysis; • Software and its engineering → Op- requires both high-fidelity introspection and the ability to erating systems. run sufficiently many cycles to execute complex software Keywords. performance profiling; hardware/software co- stacks, a challenging combination. In this work, we enable design; FPGA-accelerated simulation; network performance agile full-system performance optimization for hardware/ optimization; agile hardware software systems with FirePerf, a set of novel out-of-band ACM Reference Format: system-level performance profiling capabilities integrated Sagar Karandikar, Albert Ou, Alon Amid, Howard Mao, Randy into the open-source FireSim FPGA-accelerated hardware Katz, Borivoje Nikolić, and Krste Asanović. 2020. FirePerf: FPGA- simulation platform. Using out-of-band call stack reconstruc- Accelerated Full-System Hardware/Software Performance Profiling tion and automatic performance counter insertion, FirePerf and Co-Design. In Proceedings of the Twenty-Fifth International enables introspecting into hardware and software at appro- Conference on Architectural Support for Programming Languages priate abstraction levels to rapidly identify opportunities for and Operating Systems (ASPLOS ’20), March 16–20, 2020, Lausanne, software optimization and hardware specialization, without Switzerland. ACM, New York, NY, USA, 17 pages. https://doi.org/ disrupting end-to-end system behavior like traditional pro- 10.1145/3373376.3378455 filing tools. We demonstrate the capabilities of FirePerf with a case study that optimizes the hardware/software stack of 1 Introduction an open-source RISC-V SoC with an Ethernet NIC to achieve As hardware specialization improves the performance of × 8 end-to-end improvement in achievable bandwidth for core computational kernels, system-level effects that used to networking applications running on Linux. We also deploy a lurk in the shadows (e.g. getting input/output data to/from RISC-V Linux kernel optimization discovered with FirePerf an accelerated system) come to dominate workload runtime. × on commercial RISC-V silicon, resulting in up to 1.72 im- Similarly, as traditionally “slow” hardware components like provement in network performance. networks and bulk storage continue to improve in perfor- Permission to make digital or hard copies of part or all of this work for mance relative to general purpose computation, it becomes personal or classroom use is granted without fee provided that copies are easy to waste this improved hardware performance with not made or distributed for profit or commercial advantage and that copies poorly optimized systems software [9]. Due to scale and bear this notice and the full citation on the first page. Copyrights for third- complexity, these system-level effects are considerably more party components of this work must be honored. For all other uses, contact difficult to identify and optimize than core compute kernels. the owner/author(s). To understand systems assembled from complex compo- ASPLOS ’20, March 16–20, 2020, Lausanne, Switzerland © 2020 Copyright held by the owner/author(s). nents, hardware architects and systems software developers ACM ISBN 978-1-4503-7102-5/20/03. use a variety of profiling, simulation, and debugging tools. https://doi.org/10.1145/3373376.3378455 Software profiling tools like perf [18] and strace [4] are common tools of the trade for software systems developers, study that uses FirePerf to systematically identify and im- but have the potential to perturb the system being evaluated plement optimizations that yield an 8× speedup in Ethernet (as demonstrated in Section 4.6). Furthermore, because these network performance on a commodity open-source RISC-V tools are generally used post-silicon, they can only introspect SoC design. Optimizing this stack requires comprehensive on parts of the hardware design that the hardware devel- profiling of the operating system, application software, SoC oper exposed in advance, such as performance counters. In and NIC microarchitectures and RTL implementations, and many cases, limited resources mean that a small set of these network link and switch characteristics. In addition to dis- counters must be shared across many hardware events, re- covering and improving several components of this system quiring complex approaches to extrapolate information from in FPGA-accelerated simulation, we deploy one particular them [34]. Other post-silicon tools like Intel Processor Trace optimization in the Linux kernel on a commercially available or ARM CoreSight trace debuggers can pull short sequences RISC-V SoC. This optimization enables the SoC to saturate of instruction traces from running systems, but these se- its onboard Gigabit Ethernet link, which it could not do with quences are often too short to observe and profile system be- the default kernel. Overall, with the FirePerf profiling tools, havior. While these tools can sometimes backpressure/single- a developer building a specialized system can improve not step the cores when buffers fill up, they cannot backpressure only the core compute kernel of their application, but also off-chip I/O and thus are incapable of reproducing system- analyze the end-to-end behavior of the system, including level performance phenomena such as network events. running complicated software stacks like Linux with com- Pre-silicon, hardware developers use software tools like ab- plete workloads. This allows developers to ensure that no stract architectural simulators and software RTL simulation new system-level bottlenecks arise during the integration to understand the behavior of hardware at the architectural process that prevent them from achieving an ideal speedup. or waveform levels. While abstract architectural simulators are useful in exploring high-level microarchitectural trade- 2 Background offs in a design, new sets of optimization challenges and In this section, we introduce the tools we use to demonstrate bottlenecks arise when a design is implemented as realizable FirePerf as well as the networked RISC-V SoC-based system RTL, since no high-level simulator can capture all details that we will optimize using FirePerf in our case study. with full fidelity. In both cases, software simulators aretoo slow to observe end-to-end system-level behavior (e.g. a clus- 2.1 Target System Design: FireChip SoC ter of multiple nodes running Linux) when trying to rapidly iterate on a design. Furthermore, when debugging system In the case study in Section 4, we will use FirePerf to opti- integration issues, waveforms and architectural events are mize network performance on a simulated networked cluster often the wrong abstraction level for conveniently diagnos- of nodes where each simulated node is an instantiation of ing performance anomalies, as they provide far too much the open-source FireChip SoC, the default SoC design in- detail. FPGA-accelerated RTL simulation tools (e.g. [29]) and cluded with FireSim. The FireChip SoC is derived from the FPGA prototypes address the simulation performance bot- open-source Rocket Chip generator [6], which is written in tleneck but offer poor introspection capabilities, especially parameterized Chisel RTL [7] and provides infrastructure to at the abstraction level needed for system-level optimization. generate Verilog RTL for a complete SoC design, including In essence, there exists a gap between the detailed hardware RISC-V cores, caches, and a TileLink on-chip interconnect. simulators and traces used by hardware architects and the Designs produced with the Rocket Chip generator have been high-level profiling tools used by systems software develop- taped out over ten times in academia, and the generator has ers. But extracting the last bit of performance out of complete been used as the basis for shipping commercial silicon, like hardware-software systems requires understanding the in- the SiFive HiFive Unleashed [46] and the Kendryte K210 teraction of hardware and software across this boundary. SoC [1]. To this base SoC, FireChip adds a 200 Gbit/s Eth- Without useful profiling tools or with noisy data from ex- ernet NIC and a block device controller, both implemented isting tools, developers must blindly make decisions about in Chisel RTL. The FireChip SoC is also capable of booting what to optimize. Mistakes in this process can be especially Linux and running full-featured networked applications. costly for small agile development teams. To bridge this gap and enable agile system-level