Falcon a new JIT compiler in Zing JVM Agenda
• What is Falcon? Why do we need a new compiler? • Why did we decide to use LLVM? • What does it take to make a Java JIT from LLVM? • How does it fit with exiting technology, like ReadyNow?
2 Zing JVM Zing: A better JVM
• Commercial server JVM • Key features • C4 GC, ReadyNow!, Falcon
5 What is Falcon?
• Top tier JIT compiler in Zing JVM • Replacement for С2 compiler • Based on LLVM
6 Development Timeline
PoC GA on by default
Apr Dec Apr 2014 2016 2017
Estimated resource investment ~20 man-years (team of 4-6 people)
7 Why do we need a new compiler?
• Zing used to have C2 like OpenJDK • C2 is aging poorly — difficult to maintain and evolve • Looking for competitive advantage over competition
8 Alternatives
• Writing compiler from scratch? Too hard • Open-source compilers • GCC, Open64, Graal, LLVM…
9 Alternatives
• Writing compiler from scratch? Too hard • Open-source compilers short list • Graal vs LLVM
10 “The LLVM Project is a collection of modular and reusable compiler and toolchain technologies”
– llvm.org Where LLVM is used?
• C/C++/Objective C • Swift • Haskell • Rust • …
12 Who makes LLVM?
More than 500 developers
13 LLVM
• Started in 2000 • Stable and mature • Proven performance for C/C++ code
14 LLVM IR General-purpose high-level assembler int mul_add(int x, int y, int z) { return x * y + z; } define i32 @mul_add(i32 %x, i32 %y, i32 %z) { entry: %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z ret i32 %tmp2 }
15 Infrastructure provides
• Analysis, transformations • LLVM IR => LLVM IR • Backends for all major CPU architectures • LLVM IR => Native • Tools
16 How to write a compiler?
Clang C/C++/ LLVM X86 C ObjectiveC X86 Frontend Backend
llvm-gcc LLVM PowerPC Power Fortran LLVM Optimizer Frontend Backend PC
LLVM IR LLVM IR LLVM ARM Haskell GHC Frontend ARM Backend How to write a compiler?
LLVM X86 X86 Backend
Your LLVM PowerPC Power Frontend LLVM Optimizer language Backend PC
LLVM IR LLVM IR LLVM ARM ARM Backend Falcon JVM LLVM
Java Bytecode LLVM Bytecode Frontend IR LLVM Optimizer
LLVM IR .obj Install Method LLVM X86 Backend
19 Orca
• Our fork of LLVM • LLVM-based Java JIT compiler • Can be used in other VM • Or outside of any VM
20 Falcon JVM LLVM aka Orca
Custom Optimizer Java Bytecode LLVM PassManager PM; Bytecode Frontend IR PM.addPass(createLLVMPassA()); PM.addPass(createLLVMPassB()); PM.addPass(createAzulPassA()); PM.addPass(createLLVMPassC()); PM.addPass(createAzulPassB()); …
LLVM IR .obj Install Method LLVM X86 Backend
21 VM Callbacks
static final Object X = new Integer(42); bool foo() { return X instanceof Number; }
22 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; }
define i1 @foo() { %X = load i8*, i8**
23 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; } VM, what is at
23 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; } VM, what is the type of this object? define i1 @foo() { %X = load i8*, i8**
24 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; } java.lang.Integer define i1 @foo() { %X = load i8*, i8**
25 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; }
define i1 @foo() { %res = call i1 @is_subtype_of(
26 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; } VM, is Integer a subtype of Number? define i1 @foo() { %res = call i1 @is_subtype_of(
27 VM Callbacks static final Object X = new Integer(42); bool foo() { return X instanceof Number; }
define i1 @foo() { ret i1 true }
28 VM Callbacks
• Is class A a subclass of class B? • Can you devirtualize a call for the given receiver type?
• Generate LLVM IR for the given method • …
29 Falcon JVM LLVM aka Orca
Custom Optimizer Java Bytecode LLVM PassManager PM; Bytecode Frontend IR PM.addPass(createLLVMPassA()); PM.addPass(createLLVMPassB()); PM.addPass(createAzulPassA()); PM.addPass(createLLVMPassC()); VM PM.addPass(createAzulPassB()); Callbacks … LLVM IR .obj Install Method LLVM X86 Backend
30 What does it take to make a Java JIT from LLVM? LLVM as a JIT
• LLVM is mostly used for static compilation • Clang — the main user • A lot of experiments to use LLVM as a JIT for a managed language • Shark — (was) an experimental LLVM-based JIT compiler in Open JDK
32 What was missing?
• Precise Garbage Collection • Speculative optimizations and deoptimizations • Java-specific optimizations
34 GC-Compiler interaction
• How does the GC find the references? • Stack, registers
35 GC-Compiler interaction
• How does the GC find the references? • Stack, registers
• GC safepoints — the GC can parse the stack at safepoints
• GC can ask a thread to pause at a safepoint
35 Thread is at a Safepoint
Object foo() { void bar() { Object o = ...; ... bar(); <
GC should be able to parse the stack
36 GC Relocations
Object foo() { Object o = ...; bar(); return o; }
The compiler should expect o to be relocated
37 GC Relocations
Object foo() { Object o = ...; _o = safepoint(bar(), o); return _o; }
The compiler should expect o to be relocated
38 Statepoint function args… live pointers…
call gc.statepoint(
return value relocated pointers…
39 Statepoint Intrinsic http://llvm.org/docs/Statepoints.html declare token @llvm.experimental.gc.statepoint(func_type
40 Statepoint Intrinsic http://llvm.org/docs/Statepoints.html
declare token @llvm.experimental.gc.statepoint(func_type
Lowered to a call to
GC parses the stack using the Stack Map
40 Safepoints and optimizations void foo(Object o) { if (o instanceof String) The optimizer return; _o = safepoint(bar(), o); doesn’t know if (_o instanceof String) anything about _o return; ... }
41 Safepoints and optimizations void foo(Object o) { void foo(Object o) { if (o instanceof String) if (o instanceof String) return; return; bar(); bar(); if (o instanceof String) ... return; } ... } We don’t use explicit relocations for optimizations
42 RS4GC Implicit relocations, strong gc pointers
RewriteStatepointsForGC Pass
Explicit relocations, integral pointers
43 Optimization Pipeline
Most of the optimizations work on implicit relocations
RewriteStatepointsForGC Pass
Low-level optimizations over explicit relocations
44 What was missing?
✓Precise Garbage Collection • Speculative optimizations and deoptimization • Java-specific optimizations
46 Uncommon Traps
Speculation example
if (cond) foo(); else bar();
47 Uncommon Traps
Profile…
if (cond) foo(); // 0 else bar(); // 10000
foo was never called!
48 Uncommon Traps
Will not compile the foo call!
if (cond) foo(); else bar();
49 Uncommon Traps
Will not compile the foo call!
if (cond) if (cond) foo();
49 Uncommon Traps
How to express an uncommon trap in LLVM IR?
declare type @llvm.experimental.deoptimize(...)
50 Deoptimization final int f = 5; boolean foo() { Assume that int f_ = f; final f never bar(); changes return f == _f; }
51 Deoptimization final int f = 5; boolean foo() { Assume that int f_ = f; final f never bar(); changes return f == _f; }
52 Deoptimization final int f = 5; boolean foo() { Assume that int f_ = f; final f never bar(); changes return true; }
53 Deoptimization final int f = 5; void bar() { boolean foo() { // abuse field int f_ = f; // finality! bar(); // f = 1 return true; } }
Where do we return?
54 Deoptimization final int f = 5; void bar() { boolean foo() { // abuse field int f_ = f; // finality! bar(); // f = 1 return true; } }
To the interpreter!
55 Return from a call
Any call has two execution continuations
Return to the call foo() interpreter
Normal return
56 Return from a call
Any call has two execution continuations
call foo() if (deoptimize) Return to the
Normal return
57 Interpreter State
0: iconst_1 1: istore_1 boolean foo() { 2: aload_0 int f_ = f; 3: invokevirtual bar() bar(); 6: iconst_1 if (deoptimize) 7: iload_1
58 Interpreter State
0: iconst_1 1: istore_1 boolean foo() { 2: aload_0 int f_ = f; 3: invokevirtual bar() bar(); 6: iconst_1 if (deoptimize) 7: iload_1
60 Interpreter State
0: iconst_1 1: istore_1 boolean foo() { 2: aload_0 int f_ = f; 3: invokevirtual bar() bar(); 6: iconst_1 if (deoptimize) 7: iload_1
61 Interpreter State
• Interpreter state consists of: • BCI, locals, stack, monitors • Runtime should know where to get these values
62 Interpreter State
How to express this in LLVM IR?
call void @f(i32 %arg) [ “deopt”(i32 0, i8* %a, i32* null) ]
63 Operand Bundles
http://llvm.org/docs/LangRef.html#operand-bundles
• Generic mechanism to add a set of values to a call • Use Deopt Bundle to describe the interpreter state
64 Deopt Bundle Lowering
How does the runtime find out these values?
call void @f(i32 %arg) [ “deopt”(i32 0, i8* %a, i32* null) ]
65 Deopt Bundle Lowering
call void @f(i32 %arg) [ “deopt”(i32 0, i8* %a, i32* null) ]
RewriteStatepointsForGC call token @llvm.experimental.gc.statepoint(void @f(i32), i64 1, i32 %arg, i64 3, i32 0, i8* %a, i32* null, ... (gc parameters))
66 Deopt Bundle Lowering
call token @llvm.experimental.gc.statepoint(void @f(i32), i64 1, i32 %arg, i64 3, i32 0, i8* %a, i32* null, ... (gc parameters))
Stack Map is generated
Runtime uses the Stack Map to find the values
67 What was missing?
✓Precise Garbage Collection ✓Speculative optimizations and deoptimization • Java-specific optimizations
69 High-level semantics
• High-level semantics should be taken into account for optimizations
• E.g. devirtualization • LLVM IR — low level representation • LLVM doesn’t know anything about Java semantics
71 High Level Embedded IR
• High level IR embedded in LLVM IR • Attributes and metadata for type annotations • Abstractions — high level intrinsics • Functions with the semantics known by the optimizer
72 High Level Embedded IR Example
int foo(Object o) { if (o instanceof Integer) if (o instanceof Number) return o.hashCode(); return 0; }
73 High Level Embedded IR Example
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) Attributes if (@is_subtype_of(Number, o_class)) { target = @resolve_virtual(o, hashCode); return target(o); } return 0; }
75 High Level Embedded IR Example
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) Abstractions if (@is_subtype_of(Number, o_class)) { target = @resolve_virtual(o, hashCode); return target(o); } return 0; }
76 High Level Embedded IR Optimizations
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) if (@is_subtype_of(Number, o_class)) { target = @resolve_virtual(o, hashCode); After the check return target(o); o_class is Integer! } return 0; }
77 High Level Embedded IR Optimizations
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) if (@is_subtype_of(Number, o_class)) { target = Integer is a @resolve_virtual(o, hashCode); subtype of return target(o); Number } return 0; }
78 High Level Embedded IR Optimizations
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) { resolve_virtual for target = Integer is @resolve_virtual(o, hashCode); Integer::hashCode return target(o); } return 0; }
80 High Level Embedded IR Optimizations
int foo(i8* “java-type-class”=“Object” o) { o_class = @get_class(o); if (@is_subtype_of(Integer, o_class)) return Integer::hashCode(o); return 0; }
81 Abstraction Lowering
• VM provides implementations for the abstractions • Abstractions’ bodies are inlined after high-level optimizations
• After that low-level optimizations can be applied
82 What was missing?
✓Precise Garbage Collection ✓Speculative optimizations and deoptimization ✓Java-specific optimizations
83 Our interaction with LLVM Upstream
• Most of the code is upstreamed • GC infrastructure • Speculative optimizations infrastructure • Optimizations improvements
84 How do we stand against competition? Zing 18.02 vs Oracle 8u152
200%
175%
150%
125%
100%
75%
50%
25%
0% SPECjvm + Dacapo + application benchmarks skylake + haswell
86 Zing 18.02 vs Oracle 8u152
200%
175%
150%
125%
100%
75%
50%
25%
0% 63% win rate
87 Falcon vs C2 Robustness
• Fuzzing testing, in 100 000 tests • Zing Falcon — 1-2 crashes • OpenJDK C2 — ~10 crashes
88 Any downsides?
89 A vector arithmetic microbenchmark
double eachIteration(double[] input, int phase) {
if (phase == 1) { return Case1.case1(input); static double case1(double[] input) { } else { double result = 0; for (int i = 0; i < input.length; i++) { if (phase == 2) { result += Math.sin(input[i]) return Case2.case2(input); / Math.cos(input[i]); } else { } return result; if (phase == 3) { } return Case3.case3(input); } else { return Case4.case4(input); } } }
}
90 Iteration times
OpenJDK Zing C2 Falcon 8u152 Score (lower is better)
712 us/op 438 us/op 223 us/op
91 Iteration times / compile times
Zing C2 Falcon
Score (lower is better)
458 us/op 228 us/op Total compile time
49 ms 467 ms Wait time in a compiler queue
0 methods over 10 ms 8 methods over 10 ms
92 Reducing compile time or combat higher CPU usage
• Reduce number of compilations • ReadyNow • Reuse previous compilations • Compile Stashing
93 ReadyNow
• Persistent profile driven compilation • Similar to PGO in C++ and other compilers • Reduces warmup times and number of deoptimizations
94 Recording a profile
CT1 CT2 CT3 profile method1 method3 field x really final? was obj a ever null?
method2 Uniq. impl-tor for interface I?
95 Serialized codeprofile
parseInt(java.lang.String, int) throws java.lang.NumberFormatException; Code: 0: aload_0 1: ifnonnull 14 Method 468152524917 109 parseInt (Ljava/lang/String;I)I 261 4: new #25 // class java/lang/NumberFormatException Branch 468152524917 2365 1bci 18720taken 0untaken 7: dup Branch 468152524917 2365 16bci 18720taken 0untaken 8: ldc #26 // String null ... 10: invokespecial #27 13: athrow Jump 468152524917 2365 239bci 31824count 14: iload_1 Jump 468152524917 2365 242bci 18720count 15: iconst_2 ... 16: if_icmpge 51 C1Counters 468152524917 2365 18720invok. 31824backedge 0 0 0 0 19: new #25 // class java/lang/NumberFormatException ... 22: dup 23: new #28 // class java/lang/StringBuilder MethodData 468152524917 2365 4426 7124 0 { } 26: dup 1514905536667692158 | C2Compilation 468152524917 2365 {inlined 27: invokespecial #29 // Method java/lang/StringBuilder."
96 Inputs to compile
Bytecodes Codeprofile Runtime data
Tier 2 compiler
97 Using a profile
method1 method2 method3
CT1 CT2 CT3
98 Deoptimizations without/with profile
3 total 0 total
unloaded 3 0
# of top tier compiles 9 6
99 Same test with ReadyNow
Falcon Falcon + ReadyNow
Score (lower is better)
228 us/op 226 us/op
Time the top tier compiler
467 ms 284 ms Worse op time
2111 us 514 us
100 Reducing compile time
✓Reduce number of compilations caused by late deoptimizations
✓ReadyNow •Reuse previous compilations •Compile Stashing
101 Compile Stashing
• Reuse top-tier compilation • Applied to JDK and user code • Different from JAOT • Challenges are the same as for JOAT
102 Compile Stashing - Write JVM LLVM aka Orca
Custom Optimizer Java Bytecode LLVM Bytecode Frontend IR
… VM Callbacks LLVM IR .obj Install Method LLVM X86 Backend
103 Compile Stashing - Write JVM LLVM aka Orca
Custom Optimizer Java Bytecode LLVM Bytecode Frontend IR
… VM Callbacks LLVM IR .obj Install Method LLVM X86 Backend
104 Compile Stashing - Use JVM
Java Bytecode LLVM Have you seen this IR for this method? Bytecode Frontend IR Yes. These were the VM callbacks Do they still hold? VM Callbacks Callbacks verified. Give me the binary
.obj Install Method
105 Compile Stashing - Use and write JVM LLVM aka Orca
Custom Optimizer Java Bytecode LLVM Bytecode Frontend IR Got something ?? … I have/don’t have VM Callbacks LLVM IR .obj Install Method LLVM X86 Backend
106 A vector arithmetic microbenchmark
Falcon + RN Falcon + RN + Stashing
Time in the compiler
284 ms 53 ms/op
Score (lower is better)
226 us/op 228 us/op
107 Problem soved? ..
• Real-world match rate is 10% - 50 %
• (But we are improving every month) • Some due to well-known problems • Un-identifiable classloaders
• Instrumenting and other transforming agents
• Generated classes
108 Can I try it?
• ReadyNow • In the product since 2015 • Falcon • Since December 2016 • Compile Stashing • Experimental feature • Since 17.12
109 Conclusion
• Falcon - an LLVM-based JIT • in production, now the default compiler • delivers well performing binaries • easier to maintain and advance • Made a lot of work building infrastructure in LLVM upstream • Technologies to re-use compilations from previous runs
110 Questions?
How to try? http://www.azul.com/zingtrial/
111