
從 Maxine VM 理解高效能 Java 虛擬機器運作原理 Jim Huang ( 黃敬群 ) <[email protected]>, Aug 2, 2013 / 台北國際會議中心 1+ OpenJDK vs. Dalvik/ART Jim Huang ( 黃敬群 ) <[email protected]>, Nov 15, 2014 / 中央研究院 What We Will Learn • How a dynamic compiler like Hotspot and Dalvik/ART works • The common optimization techniques in virtual machines • Performance specific issues What We won't • JVM tuning • JNI, GC, invokedynamic • Production tweaking • Android Programming, sorry Heritage of Languages Scheme function closure prototype-based OO Self JavaScript C-like syntax, built-in objects Java … Heritage of Virtual Machine CLDC-HI (Java) HotSpot VM (Java) Strongtalk VM (Smalltalk) Self VM V8 (Self) (JavaScript) JIT • Just-In-Time compilation • Compiled when needed – Maybe immediately before execution – ...or when we decide it’s important – ...or never? Mixed-Mode • Interpreted – Bytecode-walking – Artificial stack machine • Compiled – Direct native operations – Native register machine Profiling • Gather data about code while interpreting – Invariants (types, constants, nulls) – Statistics (branches, calls) • Use that information to optimize – Educated guess – Guess can be wrong... Runtime Statistics Golden Rule of Optimization • Don’t do unnecessary work. Optimizations • Method inlining • Loop unrolling • Lock coarsening/eliding • Dead code elimination • Duplicate code elimination • Escape analysis Inlining • Combine caller and callee into one unit – e.g. based on profile – Perhaps with a guard/test • Optimize as a whole – More code means better visibility Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum; } int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum;Only one target is ever seen } int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = accum + i; } return accum; Don’t bother making the call } Loop unrolling • Works for small, constant loops • Avoid tests, branching • Allow inlining a single call as many Loop unrolling private static final String[] options = { "yes", "no", "maybe"}; public void looper() { for (String option : options) { process(option); } } Small loop, constant stride, constant size Loop unrolling private static final String[] options = { "yes", "no", "maybe"}; public void looper() { process(options[0]); process(options[1]); Unrolled! process(options[2]); } Lock Coarsening public void needsLocks() { for (option : options) { process(option); } Repeatedly locking } private synchronized String process(String option) { // some wacky thread-unsafe code } Lock Coarsening public void needsLocks() { Lock once synchronized (this) { for (option : options) { // some wacky thread-unsafe code } } } Lock Eliding public void overCautious() { Synchronize List l = new ArrayList(); on new Object synchronized (l) { for (option : options) { l.add(process(option)); } } } But we know it never escapes this thread... Lock Eliding public void overCautious() { List l = new ArrayList(); for (option : options) { l.add( /* process()’s code */); } } No need to lock Escape Analysis private static class Foo { public final String a; public final String b; Foo(String a, String b) { this.a = a; this.b = b; } } Escape Analysis public void bar() { Foo f = new Foo("Hello", "JVM"); baz(f); } public void baz(Foo f) { System.out.print(f.a); Same object all the System.out.print(", "); quux(f); way through } public void quux(Foo f) { System.out.print(f.b); Never “escapes” these System.out.println('!'); methods } Escape Analysis public secret awesome inlinedBarBazQuux() { System.out.print("Hello"); System.out.print(", "); System.out.print("JavaOne"); System.out.println('!'); } Don’t bother allocating foo object Escape Analysis • A bit tweaky on Hotspot – All paths must inline – No external view of object Performance Pitfall • Memory accesses – By far the biggest expense • Calls – Memory reference + branch kills pipeline – Call stack, register juggling costs • Locks Performance Pitfall (again) • Each CPU maintains a memory cache • Caches may be out of sync – If it doesn’t matter, no problem – If it does matter, threads disagree! • Volatile forces synchronization of cache – Across cores and to main memory Hotspot • client mode (C1) inlines, less aggressive – Fewer opportunities to optimize • server mode (C2) inlines aggressively – Based on richer runtime profiling Tiered • Increasing tiers of interpreter, C1, and C2 • Level 0 = Interpreter • Level 1-3 = C1 • Level 4 = C2 HotSpot Client Compiler C2 Compiler • Profile to find “hot spots” – Call sites – Branch statistics – Profile until 10k calls from Interpreter to Compiler 純 Interpreter 簡單 Compiler Source-level interpreter base compiler Tree-traversal static optimizing Bytecode interpreter compiler – switch-threading – 基於程式碼模式 – indirect-threading Dynamic optimizing – token-threading compiler – direct-threading – 基於硬體作業系統 – subroutine-threading – 基於執行頻率 – inline-threading – 基於類型反射 – context-threading – – … 基於整個程式的分析 簡單 Compiler 優化 Compiler – … 執行時期的開銷 : Interpreter fetch execute dispatch Java source public class TOSDemo { public static int test() { JVM bytecode int i = 31; int j = 42; iload_0 int x = i + j; iload_1 return x; iadd } javac istore_2 如何執行? } ;;------------ iload_0 iload_0------------ mov ecx,dword ptr ss:[esp+1C] 由 Sun JDK 1.0.2 的 iload_1 mov edx,dword ptr ss:[esp+14] JVM 解譯執行 iadd add dword ptr ss:[esp+14],4 [x86 指令序列 ] mov eax,dword ptr ds:[ecx] istore_2 inc dword ptr ss:[esp+10] mov dword ptr ds:[edx+10],eax jmp javai.1000E182 由 Sun JDK 1.1.8 的 由 Sun JDK 6 的 mov eax,dword ptr ss:[esp+10] mov bl,byte ptr ds:[eax] JVM 解譯執行 HotSpot 解譯執行 xor eax,eax [x86 指令序列 ] [x86 指令序列 ] mov al,bl cmp eax,0FF ja short javai.1000E130;;-------------iadd-------------- ;;------------- ;;-------------iload_0------------- jmp dword ptrmov ds: ecx,dword ptr ss:[esp+14] iload_0------------- mov eax,dword ptr ds:[edi] [eax*4+10011B54]inc dword ptr ss:[esp+10] movzx eax,byte ptr ds:[esi+1] movzx ebx,byte ptr ds:[esi+1] ;;------------sub dword ptr ss:[esp+14],4 mov ebx,dword ptr ss:[ebp] inc esi iload_1------------mov edx,dword ptr ds:[ecx+C] inc esi jmp dword ptr ds:[ebx*4+6DB188C8] mov ecx,dwordadd ptr dword ss:[esp+1C] ptr ds:[ecx+8],edx jmp dword ptr ds: ;;-------------iload_1------------- mov edx,dwordjmp ptr javai.1000E182 ss:[esp+14] [eax*4+1003FBD4] push eax add dword ptrmov ss:[esp+14],4 eax,dword ptr ss:[esp+10] ;;------------- mov eax,dword ptr ds:[edi-4] mov eax,dwordmov ptr bl,byte ds:[ecx+4] ptr ds:[eax] iload_1------------- movzx ebx,byte ptr ds:[esi+1] inc dword ptrxor ss:[esp+10] eax,eax movzx eax,byte ptr ds:[esi+1] inc esi mov dword ptrmov ds:[edx+10],eax al,bl mov ecx,dword ptr ss:[ebp+4] jmp dword ptr ds:[ebx*4+6DB188C8] jmp javai.1000E182cmp eax,0FF inc esi ;;--------------iadd--------------- mov eax,dwordja ptr short ss:[esp+10] javai.1000E130 jmp dword ptr ds: pop edx mov bl,byte jmpptr dwordds:[eax] ptr ds:[eax*4+10011B54] [eax*4+1003FFD4] add eax,edx xor eax,eax ;;-----------istore_2------------ ;;-------------- movzx ebx,byte ptr ds:[esi+1] mov al,bl mov eax,dword ptr ss:[esp+14] iadd--------------- inc esi cmp eax,0FF mov ecx,dword ptr ss:[esp+1C] add ebx,ecx jmp dword ptr ds:[ebx*4+6DB188C8] ja short javai.1000E130sub dword ptr ss:[esp+14],4 movzx eax,byte ptr ds:[esi+1] ;;------------istore_2------------- jmp dword ptrmov ds: edx,dword ptr ds:[eax+C] inc esi mov dword ptr ds:[edi-8],eax [eax*4+10011B54]inc dword ptr ss:[esp+10] jmp dword ptr ds: movzx ebx,byte ptr ds:[esi+1] mov dword ptr ds:[ecx+8],edx [eax*4+1003FBD4] inc esi jmp javai.1000E182 ;;------------ jmp dword ptr ds:[ebx*4+6DB19CC8] mov eax,dword ptr ss:[esp+10] istore_2------------- mov bl,byte ptr ds:[eax] movzx eax,byte ptr ds:[esi+1] xor eax,eax mov dword ptr ss:[ebp+8],ebx mov al,bl inc esi cmp eax,0FF jmp dword ptr ds[eax*4+1003F7D4] ja short javai.1000E130 jmp dword ptr ds [eax*4+10011B54] instruction traces Summary: OpenJDK Introduction to Dalvik VM K. Yaghmour, Embedded Android 1st edition, Chapter 2, Figure 2-1 Dalvik VM in a nutshell • The core of Android Applications – All fancy Android applications are run by it • Register-based Process Virtual Machine – Think of running a Java application • Intermediate Language = Dalvik Bytecode • Executable File = Dalvik Executable (DEX) – A converted Java class done by “dx” tool – Reduce redundancy in variables Dalvik VM: Bytecode • Register-based, 32-bits • Instructions Fetch Unit : 16 bits • Byte code store as binary • Constant pools • String, Type, Field, Method, Class • Human-syntax andmnemonics Insturction Suffix -wide(64bits OpCodes) -char -boolean -short -byte -int -long -float -object -string -class -void Dalvik is Register-based • const-4 to store 1 into register 0 • add-int/lit8 to sum the value in register 0 (1) with the literal 2 and store the result into register 1, namely “foo” • Fewer dispatches generally means less time spent reading code and more time spent running it by the interpreter Dalvik Bytecode: Human syntax • Example: move-wide/from16 vAA,vBBBB – Opcode: “move": move a register's value – "wide" is the name suffix • it operates on wide (64 bit) data. – "from16" is the opcode suffix • 16-bit register referenceas a source. – "vAA" is the destination register • v0 – v255 – "vBBBB" is the source register • v0 – v65535 Dalvik Registers • Consider, the for loop shown here, it is not legal to do just a push of a number onto the stack inside a loop in Java byte code. – to be able map, stack slots to hardware registers, we need the stack height to be the same at the start and end of a loop -- unlike true stack-based languages like Forth. • The irony is in the end, normal JVMs convert to the same form as Dalvik anyway. – For instance, Java HotSpot 6 client JIT.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages54 Page
-
File Size-