A JVM for the Barrelfish 2nd Workshop on Systems for Future Multi-core Architectures (SFMA’12)

Martin Maas (University of California, Berkeley) Ross McIlroy (Google Inc.)

10 April 2012, Bern, Switzerland Introduction

I Future multi-core architectures will presumably...

I ...have a larger numbers of cores I ...exhibit a higher degree of diversity I ...be increasingly heterogenous I ...have no cache-coherence/shared memory

I These changes (arguably) require new approaches for Operating Systems: e.g. Barrelfish, fos, Tessellation,...

I Barrelfish’s approach: treat the machine’s cores as nodes in a distributed system, communicating via message-passing.

I But: How to program such a system uniformly?

I How to exploit performance on all configurations?

I How to structure executables for these systems? Introduction

I Answer: Managed Language Runtime Environments (e.g. Virtual Machine, Common Language Runtime) I Advantages over a native programming environment:

I Single-system image I Transparent migration of threads I Dynamic optimisation and compilation I Language extensibility

I Investigate challenges of bringing up a JVM on Barrelfish. I Comparing two different approaches:

I Convential shared-memory approach I Distributed approach in the style of Barrelfish Outline

1. The Barrelfish Operating System 2. Implementation Strategy

I Shared-memory approach I Distributed approach 3. Performance Evaluation 4. Discussion & Conclusions 5. Future Work The Barrelfish Operating System

I Barrelfish is based on the Multikernel Model: Treats multi-core machine as a distributed system.

I Communication through a lightweight message-passing .

I Global state is replicated rather than shared. 2.2. THE BARRELFISH OPERATING SYSTEM 25

Domain App App

Dispatcher Disp Disp Dispatcher

Monitor Monitor Monitor

User Mode Kernel Mode

CPU Driver CPU Driver CPU Driver

Software Hardware

CPU CPU CPU

Figure 2.3: Interactions between Barrelfish’s core components

from the rest of the system. This allows Barrelfish to support heterogeneous systems, since the CPU driver exposes an ABI that is (mostly) independent from the underlying architecture of the core. Monitor: The monitor runs in user-mode and together, the monitors across all cores coordinate to provide most traditional OS functionality, such as memory management, spanning domains between cores and managing timers. Monitors communicate with each other via inter-core communication. Global OS state (such as memory mappings) is replicated between the monitors and kept consistent using agreement protocols. Dispatchers: Each core runs one or more dispatchers. These are user- level thread schedulers that are up-called by the CPU driver to perform the scheduling for one particular process. Since processes in Barrelfish can span multiple cores, they may have multiple dispatchers associated with them, one per core on which the process is running. Together, these dispatchers form the “process domain”. Dispatchers are responsible for spawning threads on the di↵erent cores of a domain, performing user-level scheduling and managing Implementation

I Running real-world Java applications would require bringing up a full JVM (e.g. the Jikes RVM) on Barrelfish.

I Stresses the memory system (virtual memory is fully managed by the JVM), Barrelfish lacked necessary features (e.g. page fault handling, file system).

I Would have distracted from understanding the core challenges.

I Approach: Implementation of a rudimentary Java that provides just enough functionality to run standard Java benchmarks (Java Grande Benchmark Suite).

I Supports 198 out of 201 Bytecode instructions (except wide, goto w and jsr w), Inheritance, Strings, Arrays, Threads,...

I No Garbage Collection, JIT, Exception Handling, Dynamic Linking or Class Loading, Reflection,... Shared memory vs. Distributed approach

Shared memory Distributed Approach

Domain move object run func on JVM JVM0 JVM1 move object ack

return invoke

putfield JVM2 JVM3 putfield ack

Heap obj A obj B obj obj D obj A obj B obj C obj D The distributed approach

jvm-node0 jvm-node1 jvm-node2 jvm-node3

obj request obj request× obj request

blocks × Object lookup obj request response

invoke virtual

getfield

getfield response

Method call invoke return Performance Evaluation

I Performance evaluation using the sequential and parallel Java Grande Benchmarks (mostly Section 2 - compute kernels).

I Performed on a 48-core AMD Magny- Cours (Opteron 6168).

I Four 2x6-core processors, 8 NUMA nodes (8GB RAM each).

I Evaluation of the shared-memory version on (using numactl to pin cores) and Barrelfish.

I Evaluation of the distributed version only on Barrelfish.

I Compared performance to industry-standard JVM (OpenJDK 1.6.0) with and without JIT compilation. Single-core (sequential) performance

I Consistently within a factor of 2-3 of OpenJDK without JIT. 4.5. MULTI-CORE PERFORMANCE 71

0.77 OpenJDK (JIT) 6.24 SparseMatmult 12.85 OpenJDK (No JIT) 14.42 JVM (Linux)

1.02 JVM (Barrelfish) 20.01 SOR 43.02 40.47

11.78 17.89 Series 18.22 21.08

0.13 3.45 LUFact 9.74 9.85

0.39 HeapSort 5.43 16.19 16.34

4.4 27 FFT 54.89 67.29

0.28 Crypt 4.83 8.55 8.75

0 10 20 30 40 50 60 70 Execution time in s

Figure 4.9: Application-benchmarks on a single core (SizeA)

While these results do not represent new findings, they are important for evaluating the success of the project. They show that the developed JVM provides the performance necessary to run real-world software, both on Linux and Barrelfish.

4.5 Multi-core performance

Performance on multiple cores was evaluated using the parallel SparseMatmult benchmark from the JGF benchmark suite. The benchmark was chosen since it stresses inter-core communication and does not use Math.sqrt(), which exhibits di↵erent performance on Barrelfish and Linux (Figure 4.8). Performance of the shared-memory approach

I Using the parallel sparse matrix multiplication Java Grande benchmark JGFSparseMatmultBenchSizeB.

I Scales to 48 cores as expected (relative to OpenJDK).

JGFSparseMatmultBenchSizeB 100 Barrelfish JVM (Barrelfish, Shared Memory) Barrelfish JVM (Linux) OpenJDK (No JIT) OpenJDK

10

1 Execution time in s (log scale)

0.1 0 5 10 15 20 25 30 35 40 45 50 Number of cores Performance of the shared-memory approach

I Quasi-linear speed-up implies large interpreter overhead.

I Barrelfish4.5. MULTI-CORE overhead PERFORMANCE presumably from agreement protocols.75

JGFSparseMatmultBenchSizeB 50 Linux Barrelfish (Shared) 45

40

35

30

25 Speed-up 20

15

10

5

0 0 5 10 15 20 25 30 35 40 45 50 Number of cores

Figure 4.13: Average speed-up of the shared-memory approach

run-time of the experiments (which would have been up to 25h otherwise). Since the benchmark only measures the execution of the kernel, this gives a very close approximation for the run-time of JGFSparseMatmultBenchSizeA, divided by 10. I call this benchmark JGFSparseMatmultBenchSizeA*.

Cores Run-time in s (Standard deviation)4 12.7010.0017 2457.8937.8906 3395.6813.5453 4402.3427.6161 5444.3822.1277 6514.25136.769 71764.32247.74 82631.27335.90 16 9333.87 (only executed once)

Table 4.1: Results of JGFSparseMatmultBenchSizeA*

4Since some executions exhibited a high variance, is given for this experiment. Performance of the distributed approach

I Distributed approach is orders of magnitude slower than shared-memory approach.

I Sparse Matrix Multiplication is a difficult benchmark for this implementation: 7 pairs of messages for each iteration of the kernel (almost no communication for shared-memory).

I Overhead arguably caused by inter-core communication (150-600 cycles) and message handling in Barrelfish.

Cores Run-time in s σ (Standard deviation) 1 2.70 0.002 2 458 7.891 3 396 3.545 4 402 7.616 5 444 2.128 6 514 36.77 7 1764 247.7 8 2631 335.9 16 9334 (only executed once) Performance of the distributed approach 76 CHAPTER 4. EVALUATION

I MeasuringThe results completion show that without time optimisation, of threads the distributed on different approach is toocores shows slow to be feasible, at least for this benchmark. Measuring the run-time of performanceeach individual limitation thread gives evidencedue to that inter-core this is causedcommunication. by the overhead of message passing: While a thread running on the home node of the working I All dataset (jvm-node0 ”lives”)completesveryquickly,threadsonothercorestakeordersof on the same home node (Core #0). magnitude longer (Figure 4.14). The diagram also confirms that communi- I Corescation 0-5 with within cores on a other single chips (#6processor, and #7) is significantly 6 & 7 is more off-chip. expensive than on-chip communication (Figure 4.3).

#7 2,478.17 )

#6 2,386.94

#5 470.05 jvm-node* #4 452.58

#3 453.52

#2 452.06

#1 453.24 Thread (running on #0 0.36

400 800 1,200 1,600 2,000 2,400 2,800 Execution time in s

Figure 4.14: Run-times of individual threads

For this particular benchmark, the distributed JVM has to exchange 7 pairs of messages for each iteration of the loop in Listing 4.1 (1 getfield,1astore, 5 aload), while the shared-memory approach requires almost no inter-core communication (all arrays reside in the local cache most of the time and there is little contention, since di↵erent threads write to di↵erent parts of the output array). There are two basic aspects that add to the overhead of the message passing: Inter-core communication: Each message transfer has to invoke • the cache coherence protocol, causing a delay of up to 150-600 cycles, depending on the architecture and the number of hops [12]. Message handling: The client has to yield the interpreter thread, • poll for messages, execute the message handler code and unblock the interpreter thread. This involves two context switches and a time in- Discussion & Future Work

I Preliminary results show that future work should focus on reducing message-passing overhead and number of messages. I How can these overheads be alleviated?

I Reduce inter-core communication: Caching of objects and arrays, like a directory-based MSI cache-coherence protocol. I Reduce message-passing latency: Hardware support for message-passing (e.g. running on the Intel SCC). I Additional areas of interest:

I Garbage Collection on such a system. I Relocation of objects at run-time. I Logical partitioning of objects.

I Future work should investigate bringing up the Jikes RVM on Barrelfish, focussing on these aspects. Questions?