從 Maxine VM 理解高效能 Java 虛擬機器運作原理
Jim Huang ( 黃敬群 )
OpenJDK vs. Dalvik/ART
Jim Huang ( 黃敬群 )
• How a dynamic compiler like Hotspot and Dalvik/ART works • The common optimization techniques in virtual machines • Performance specific issues What We won't
• JVM tuning • JNI, GC, invokedynamic • Production tweaking • Android Programming, sorry Heritage of Languages
Scheme function closure
prototype-based OO Self JavaScript
C-like syntax, built-in objects Java
… Heritage of Virtual Machine
CLDC-HI (Java)
HotSpot VM (Java)
Strongtalk VM (Smalltalk)
Self VM V8 (Self) (JavaScript) JIT
• Just-In-Time compilation • Compiled when needed – Maybe immediately before execution – ...or when we decide it’s important – ...or never? Mixed-Mode
• Interpreted – Bytecode-walking – Artificial stack machine • Compiled – Direct native operations – Native register machine Profiling
• Gather data about code while interpreting – Invariants (types, constants, nulls) – Statistics (branches, calls) • Use that information to optimize – Educated guess – Guess can be wrong... Runtime Statistics Golden Rule of Optimization
• Don’t do unnecessary work. Optimizations
• Method inlining • Loop unrolling • Lock coarsening/eliding • Dead code elimination • Duplicate code elimination • Escape analysis Inlining
• Combine caller and callee into one unit – e.g. based on profile – Perhaps with a guard/test • Optimize as a whole – More code means better visibility Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum; }
int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum;Only one target is ever seen }
int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = accum + i; } return accum; Don’t bother making the call } Loop unrolling
• Works for small, constant loops • Avoid tests, branching • Allow inlining a single call as many Loop unrolling
private static final String[] options = { "yes", "no", "maybe"}; public void looper() { for (String option : options) { process(option); } } Small loop, constant stride, constant size Loop unrolling
private static final String[] options = { "yes", "no", "maybe"}; public void looper() { process(options[0]); process(options[1]); Unrolled! process(options[2]); } Lock Coarsening
public void needsLocks() { for (option : options) { process(option); } Repeatedly locking } private synchronized String process(String option) { // some wacky thread-unsafe code } Lock Coarsening
public void needsLocks() { Lock once synchronized (this) { for (option : options) { // some wacky thread-unsafe code } } } Lock Eliding
public void overCautious() { Synchronize List l = new ArrayList(); on new Object synchronized (l) { for (option : options) { l.add(process(option)); } } } But we know it never escapes this thread... Lock Eliding
public void overCautious() { List l = new ArrayList(); for (option : options) { l.add(
/* process()’s code */); } } No need to lock Escape Analysis private static class Foo { public final String a; public final String b;
Foo(String a, String b) { this.a = a; this.b = b; } } Escape Analysis public void bar() { Foo f = new Foo("Hello", "JVM"); baz(f); } public void baz(Foo f) { System.out.print(f.a); Same object all the System.out.print(", "); quux(f); way through } public void quux(Foo f) { System.out.print(f.b); Never “escapes” these System.out.println('!'); methods } Escape Analysis public secret awesome inlinedBarBazQuux() { System.out.print("Hello"); System.out.print(", "); System.out.print("JavaOne"); System.out.println('!'); } Don’t bother allocating foo object Escape Analysis
• A bit tweaky on Hotspot – All paths must inline – No external view of object Performance Pitfall
• Memory accesses – By far the biggest expense • Calls – Memory reference + branch kills pipeline – Call stack, register juggling costs • Locks Performance Pitfall (again)
• Each CPU maintains a memory cache • Caches may be out of sync – If it doesn’t matter, no problem – If it does matter, threads disagree! • Volatile forces synchronization of cache – Across cores and to main memory Hotspot
• client mode (C1) inlines, less aggressive – Fewer opportunities to optimize • server mode (C2) inlines aggressively – Based on richer runtime profiling Tiered
• Increasing tiers of interpreter, C1, and C2 • Level 0 = Interpreter • Level 1-3 = C1 • Level 4 = C2 HotSpot Client Compiler C2 Compiler
• Profile to find “hot spots” – Call sites – Branch statistics – Profile until 10k calls from Interpreter to Compiler
純 Interpreter 簡單 Compiler Source-level interpreter base compiler Tree-traversal static optimizing Bytecode interpreter compiler – switch-threading – 基於程式碼模式 – indirect-threading Dynamic optimizing – token-threading compiler – direct-threading – 基於硬體作業系統 – subroutine-threading – 基於執行頻率 – inline-threading – 基於類型反射 – context-threading – – … 基於整個程式的分析 簡單 Compiler 優化 Compiler – … 執行時期的開銷 : Interpreter
fetch execute dispatch
Java source public class TOSDemo { public static int test() { JVM bytecode int i = 31; int j = 42; iload_0 int x = i + j; iload_1 return x; iadd } javac istore_2 如何執行? } ;;------iload_0 iload_0------mov ecx,dword ptr ss:[esp+1C] 由 Sun JDK 1.0.2 的 iload_1 mov edx,dword ptr ss:[esp+14] JVM 解譯執行 iadd add dword ptr ss:[esp+14],4 [x86 指令序列 ] mov eax,dword ptr ds:[ecx] istore_2 inc dword ptr ss:[esp+10] mov dword ptr ds:[edx+10],eax jmp javai.1000E182 由 Sun JDK 1.1.8 的 由 Sun JDK 6 的 mov eax,dword ptr ss:[esp+10] mov bl,byte ptr ds:[eax] JVM 解譯執行 HotSpot 解譯執行 xor eax,eax [x86 指令序列 ] [x86 指令序列 ] mov al,bl cmp eax,0FF ja short javai.1000E130;;------iadd------;;------;;------iload_0------jmp dword ptrmov ds: ecx,dword ptr ss:[esp+14] iload_0------mov eax,dword ptr ds:[edi] [eax*4+10011B54]inc dword ptr ss:[esp+10] movzx eax,byte ptr ds:[esi+1] movzx ebx,byte ptr ds:[esi+1] ;;------sub dword ptr ss:[esp+14],4 mov ebx,dword ptr ss:[ebp] inc esi iload_1------mov edx,dword ptr ds:[ecx+C] inc esi jmp dword ptr ds:[ebx*4+6DB188C8] mov ecx,dwordadd ptr dword ss:[esp+1C] ptr ds:[ecx+8],edx jmp dword ptr ds: ;;------iload_1------mov edx,dwordjmp ptr javai.1000E182 ss:[esp+14] [eax*4+1003FBD4] push eax add dword ptrmov ss:[esp+14],4 eax,dword ptr ss:[esp+10] ;;------mov eax,dword ptr ds:[edi-4] mov eax,dwordmov ptr bl,byte ds:[ecx+4] ptr ds:[eax] iload_1------movzx ebx,byte ptr ds:[esi+1] inc dword ptrxor ss:[esp+10] eax,eax movzx eax,byte ptr ds:[esi+1] inc esi mov dword ptrmov ds:[edx+10],eax al,bl mov ecx,dword ptr ss:[ebp+4] jmp dword ptr ds:[ebx*4+6DB188C8] jmp javai.1000E182cmp eax,0FF inc esi ;;------iadd------mov eax,dwordja ptr short ss:[esp+10] javai.1000E130 jmp dword ptr ds: pop edx mov bl,byte jmpptr dwordds:[eax] ptr ds:[eax*4+10011B54] [eax*4+1003FFD4] add eax,edx xor eax,eax ;;------istore_2------;;------movzx ebx,byte ptr ds:[esi+1] mov al,bl mov eax,dword ptr ss:[esp+14] iadd------inc esi cmp eax,0FF mov ecx,dword ptr ss:[esp+1C] add ebx,ecx jmp dword ptr ds:[ebx*4+6DB188C8] ja short javai.1000E130sub dword ptr ss:[esp+14],4 movzx eax,byte ptr ds:[esi+1] ;;------istore_2------jmp dword ptrmov ds: edx,dword ptr ds:[eax+C] inc esi mov dword ptr ds:[edi-8],eax [eax*4+10011B54]inc dword ptr ss:[esp+10] jmp dword ptr ds: movzx ebx,byte ptr ds:[esi+1] mov dword ptr ds:[ecx+8],edx [eax*4+1003FBD4] inc esi jmp javai.1000E182 ;;------jmp dword ptr ds:[ebx*4+6DB19CC8] mov eax,dword ptr ss:[esp+10] istore_2------mov bl,byte ptr ds:[eax] movzx eax,byte ptr ds:[esi+1] xor eax,eax mov dword ptr ss:[ebp+8],ebx mov al,bl inc esi cmp eax,0FF jmp dword ptr ds[eax*4+1003F7D4] ja short javai.1000E130 jmp dword ptr ds [eax*4+10011B54] instruction traces Summary: OpenJDK Introduction to Dalvik VM
K. Yaghmour, Embedded Android 1st edition, Chapter 2, Figure 2-1 Dalvik VM in a nutshell
• The core of Android Applications – All fancy Android applications are run by it • Register-based Process Virtual Machine – Think of running a Java application • Intermediate Language = Dalvik Bytecode • Executable File = Dalvik Executable (DEX) – A converted Java class done by “dx” tool – Reduce redundancy in variables Dalvik VM: Bytecode
• Register-based, 32-bits • Instructions Fetch Unit : 16 bits • Byte code store as binary • Constant pools • String, Type, Field, Method, Class • Human-syntax andmnemonics
Insturction Suffix
-wide(64bits OpCodes) -char
-boolean -short
-byte -int
-long -float
-object -string
-class -void Dalvik is Register-based
• const-4 to store 1 into register 0 • add-int/lit8 to sum the value in register 0 (1) with the literal 2 and store the result into register 1, namely “foo” • Fewer dispatches generally means less time spent reading code and more time spent running it by the interpreter Dalvik Bytecode: Human syntax
• Example: move-wide/from16 vAA,vBBBB – Opcode: “move": move a register's value – "wide" is the name suffix • it operates on wide (64 bit) data. – "from16" is the opcode suffix • 16-bit register referenceas a source. – "vAA" is the destination register • v0 – v255 – "vBBBB" is the source register • v0 – v65535 Dalvik Registers
• Consider, the for loop shown here, it is not legal to do just a push of a number onto the stack inside a loop in Java byte code. – to be able map, stack slots to hardware registers, we need the stack height to be the same at the start and end of a loop -- unlike true stack-based languages like Forth. • The irony is in the end, normal JVMs convert to the same form as Dalvik anyway. – For instance, Java HotSpot 6 client JIT. DEX Translation Example
SymDroid: Symbolic Execution forDalvikBytecode- TechnicalReport CS-TR-5022, July2012
JinseongJeon, Kristopher K.Micinski,JereyS. Foster
Department of Computer Science, University of Maryland, College Park Hello World in Dalvik bytecode
.class public LHelloWorld; .super Ljava/lang/Object; .method public static main([Ljava/lang/String;)V .registers 2 sget-object v0,Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World!" invoke-virtual {v0, v1},Ljava/io/PrintStream;>println(Ljava/lang/String;)V return-void .endmethod Hello World in Dalvik bytecode
.class public LHelloWorld; .super Ljava/lang/Object; .method public static main([Ljava/lang/String;)V .registers 2 sget-object v0,Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World!" Assign variables invoke-virtual {v0, v1},Ljava/io/PrintStream;>println(Ljava/lang/String;)V return-void invoke-* { parameters }, method-to-call static – static method .endmethod virtual – virtual method
Unless the method is declared as a void, it will return a value.
Return types: return-void, return-object, return v0 Optimizing Dispatch
• Refine the dispatch • Efficiency or Reduce the dispatches – Super instructions • ex: add a constant integer to a variable • JVM case –load variable to stack, load constant, add & store back • Dalvik VM case – all in one super instructions – Larger code size Optimizing Dispatch
● Selective in-lining – The idea is to construct (new) super instruction bodies by concatenating the virtual bodies of the virtual instructions that make them up.
Avoid 1. the overhead of the multiple dispatches that are implied by the original bycodes
Ian Piumarta and Fabio Riccardi, “Optimizing direct threaded code by selective inlining,” Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) Interpretor Optimization
Why we still lookup the interpretor optimization? • JIT compilers use an interpreter for the first executions • the interpreter produces profile data for determining where and how to expend the optimization effort
A faster interpreter leads • longer you can wait with the native-code compilation (boost time) • resulting in compiling less code (cost down) • better profile information for statistical result in optimization
M. Anton Ert1, etc, "The Structure and Performane of Efficient Interpreters, " Journal of Instruction- Lefel Parallelism 5 (2003) 1-25. From Source Code to Execution
Java Source Code javac
Java Class dx
DEX file dexopt
Optimized DEX dalvikvm
Forays in software development – Stack based vs Register based Virtual Machine Architecture, and the Dalvik VM (cited Nov 26, 2013) DEX Optimizations • Before being executed by Dalvik, DEX files are optimized. – Normally it happens before the first execution of code from the DEX file – Combined with the bytecode verification – In case of DEX files from APKs, when the application is launched for the first time. • Process – The dexopt process (which is actually a backdoor of Dalvik) loads the DEX, replaces certain instructions with their optimized counterparts – Then writes the resulting optimized DEX (ODEX) file into the /data/dalvik-cache directory – It is assumed that the optimized DEX file will be executed on the same VM that optimized it. ODEX files are NOT portable across VMs. dexopt: Instruction Rewritten • Virtual (non-private, non-constructor, non-static methods) invoke-virtual