Full-System Simulation of Java Workloads with RISC-V and the Jikes Research Virtual Machine
Total Page:16
File Type:pdf, Size:1020Kb
Full-System Simulation of Java Workloads with RISC-V and the Jikes Research Virtual Machine Martin Maas Krste Asanović John Kubiatowicz University of California, Berkeley University of California, Berkeley University of California, Berkeley [email protected] [email protected] [email protected] ABSTRACT architecture research. Specifically, all of the major approaches fall Managed languages such as Java, JavaScript or Python account for short when applied to managed-language applications: a large portion of workloads, both in cloud data centers and on • High-level full-system simulators do not provide the fidelity mobile devices. It is therefore unsurprising that there is an inter- to fully capture managed-language workloads. These work- est in hardware-software co-design for these languages. However, loads often interact at very small time-scales. For example, existing research infrastructure is often unsuitable for this kind of garbage collectors may introduce small delays of ≈ 10 cycles research: managed languages are sensitive to fine-grained inter- each, scattered through the application [10]. Cumulatively, actions that are not captured by high-level architectural models, these delays add up to substantial overheads but individually, yet are also too long-running and irregular to be simulated using they can only be captured with a high-fidelity model. cycle-accurate software simulators. • Software-based cycle-accurate simulators are too slow for Open-source hardware based on the RISC-V ISA provides an managed workloads. These simulators typically achieve on opportunity to solve this problem, by running managed workloads the order of 400 KIPS [17], or 1s of simulated time per 1.5h on RISC-V systems in FPGA-based full-system simulation. This of simulation (per core). Managed-language workloads are approach achieves both the accuracy and simulation speeds re- typically long-running (i.e., a minute and more) and run quired for managed workloads, while enabling modification and across a large number of cores, which means that simulating design-space exploration for the underlying hardware. an 8-core workload for 1 minute takes around a month. A crucial requirement for this hardware-software research is a • Native workloads often take advantage of sampling-based managed runtime that can be easily modified. The Jikes Research approaches, or use solutions such as Simpoints [20] to deter- Virtual Machine (JikesRVM) is a Java Virtual Machine that was mine regions of interest in workloads and then only simulate developed specifically for this purpose, and has become the gold those regions. This does not work for managed workloads, standard in managed-language research. In this paper, we describe as they consist of several components running in parallel our experience of porting JikesRVM to the RISC-V infrastructure. and affecting each other, including the garbage collector, JIT We discuss why this combined setup is necessary, and how it en- compiler and features with dynamically changing state (such ables hardware-software research for managed languages that was as biased locks, inline caching for dynamic dispatch, etc.). infeasible with previous infrastructure. In addition, managed application performance is often not dominated by specific kernels or regions of interests, which makes approaches that change between high-level and de- 1 INTRODUCTION tailed simulation modes (e.g., MARSSx86 [17], Sniper [9]) unsuitable for many of these workloads. Managed languages such as Java, JavaScript and Python account for a large portion of workloads [16]. A substantial body of work For these reasons, a large fraction of managed-language research suggests that managed-language runtimes can significantly benefit relies on stock hardware for experimentation. While this has en- from hardware support and hardware-software co-design [10, 13, 21, abled a large amount of research on improving garbage collectors, 22]. However, despite their pervasiveness, these types of workloads JIT compilers and runtime system abstractions, there has been rela- are often underrepresented in computer architecture research, and tively little research on hardware-software co-design for managed most papers in premier conferences use native workloads such as languages. Further, the research that does exist in this area typically SPEC CPU to evaluate architectural ideas. explores a single design point, often in the context of a released While native workloads represent an important subset of appli- chip or product, such as Azul’s Vega appliance [10]. Architectural cations, they are not representative of a large fraction of workloads design-space exploration is rare, especially in academia. in some of the most important spaces, including cloud and mobile. We believe that easy-to-modify open-source hardware based This disconnect between real-world workloads and evaluation was on the RISC-V ISA, combined with an easy-to-modify managed- pointed out in a prominent Communications-of-the-ACM article language runtime system, can provide an opportunity to address almost 10 years ago [7], but not much has changed since then. A this problem and perform hardware-software research that was part of the problem is arguably that there is currently no good infeasible before. Both pieces of infrastructure already exists: way to evaluate managed languages in the context of computer On one hand, the RocketChip SoC generator [5] provides the infrastructure to generate full SoCs that are realistic (i.e., used in products), and can target both ASIC and FPGA flows. Using an 1st Workshop on Computer Architecture Research with RISC-V, 10/14/2017, Boston, MA 2017. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. FPGA-based simulation framework such as Strober [14] enables https://doi.org/10.1145/nnnnnnn.nnnnnnn simulating the performance of real RocketChip SoCs at high-fidelity, 1 with FPGA frequencies of 30-100 MHz. This means that this infras- JikesRVM Step 1: Load JikesRVM into itself tructure can achieve the realism, fidelity and simulation speed required to simulate managed-language workloads. JikesRVM Step 2: JIT compiler produces JIT Compiler On the other hand, infrastructure exists for managed-language re- code and stores it to memory search. Specifically, the Jikes Research Virtual Machine (JikesRVM) is a Java VM geared towards experimentation. JikesRVM is easy to Address ObjectReference Word modify, thanks to being written in a high-level language (Java) and using a modular software design that facilitates changing compo- Image nents such as the object layout, GC or JIT passes. Copy compiled code and state We believe that bringing these two projects together will enable Existing “Bootstrap” JVM Step 3: Store novel hardware-software research. In this paper, we present one Image to disk important step towards this vision, by porting JikesRVM to RISC-V. We first discuss why such a port is necessary. We then describe Figure 1: Building the JikesRVM. the porting effort in detail, in the hope that it will be helpful for others porting managed runtime systems to RISC-V. Finally, we demonstrate the running system, and show the research it enables. Java JikesRVM Program In JIT-generated code, primitives Address ObjectReference Word map to actual operations 2 BACKGROUND The shortcomings of existing infrastructure to perform managed- Step 5: Resume at “boot” function. language research have been well-established. For example, Yang Image et al. demonstrated that sampling Java applications at 100 KHz or Step 4: Load boot image into less misses important performance characteristics [23]. Bootloader (C code) memory (bootloader implements Another example is a 2005 paper by Hertz and Berger [11]: In native calls, fault handlers, etc.) order to investigate trade-offs between manual and automatic mem- ory management, the authors had to instrument an existing runtime Figure 2: Running the JikesRVM. system to extract allocated memory addresses, and – in a second pass – inject addresses produced by an oracle. The authors found timing models for off-chip components such as DRAM (running that this was difficult to achieve in software, as the software instru- either on the FPGA or on a host machine). This approach can real- mentation led to a 2-33% perturbation in execution time, which was istically simulate the performance of an ASIC implementation, and larger than the effect they were trying to measure. They therefore provides a combination of accuracy, simulation speed and modifia- decided to use a software simulator (Dynamic SimpleScalar [12]) for bility that makes hardware and software co-design feasible. these experiments. While appropriate in this setting, this approach is often problematic in terms of simulation speed and the reliability 3 THE JIKES RVM of the resulting performance numbers. To facilitate this type of research, several projects have tried To experiment with managed runtimes, we require a runtime system to enable simulation of managed workloads. Zsim [18] enables that can be easily modified. We picked the Jikes Research VM[4], long-running multi-core workloads by using dynamic instrumenta- which is the de facto standard in managed-language research. tion, but this approach sacrifices accuracy and cannot account for Jikes is a VM for Java, and is highly representative of other man- fine-grained interactions such as write-barriers in garbage collec- aged runtime systems. We ported JikesRVM and its non-optimizing tors. Other examples are MARSSx86 [17] and Sniper [9], which are Baseline JIT compiler to RISC-V. To our knowledge, this results full-system emulators that can fast-forward to regions of interest in the first full-system platform for hardware-software research and then simulate those regions at high fidelity. Both simulators on Java applications, allowing modification of the entire hardware have been used to simulate Java workloads [8, 19]. However, this and software stack. In the following section, we describe our port. approach is only appropriate if short, representative regions can We particularly focus on aspects that will be useful for authors of be found, and architectural state does not build up slowly.