從 Maxine VM 理解高效能 虛擬機器運作原理

Jim Huang ( 黃敬群 ) , Aug 2, 2013 / 台北國際會議中心 1+

OpenJDK vs. /ART

Jim Huang ( 黃敬群 ) , Nov 15, 2014 / 中央研究院 What We Will Learn

• How a dynamic compiler like Hotspot and Dalvik/ART works • The common optimization techniques in virtual machines • Performance specific issues What We won't

• JVM tuning • JNI, GC, invokedynamic • Production tweaking • Android Programming, sorry Heritage of Languages

Scheme function closure

prototype-based OO Self JavaScript

C-like syntax, built-in objects Java

… Heritage of Virtual Machine

CLDC-HI (Java)

HotSpot VM (Java)

Strongtalk VM (Smalltalk)

Self VM V8 (Self) (JavaScript) JIT

• Just-In-Time compilation • Compiled when needed – Maybe immediately before execution – ...or when we decide it’s important – ...or never? Mixed-Mode

• Interpreted – Bytecode-walking – Artificial stack machine • Compiled – Direct native operations – Native register machine Profiling

• Gather data about code while interpreting – Invariants (types, constants, nulls) – Statistics (branches, calls) • Use that information to optimize – Educated guess – Guess can be wrong... Runtime Statistics Golden Rule of Optimization

• Don’t do unnecessary work. Optimizations

• Method inlining • Loop unrolling • Lock coarsening/eliding • Dead code elimination • Duplicate code elimination • Escape analysis Inlining

• Combine caller and callee into one unit – e.g. based on profile – Perhaps with a guard/test • Optimize as a whole – More code means better visibility Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum; }

int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = add(accum, i); } return accum;Only one target is ever seen }

int add(int a, int b) { return a + b; } Inlining int addAll(int max) { int accum = 0; for (int i = 0; i < max; i++) { accum = accum + i; } return accum; Don’t bother making the call } Loop unrolling

• Works for small, constant loops • Avoid tests, branching • Allow inlining a single call as many Loop unrolling

private static final String[] options = { "yes", "no", "maybe"}; public void looper() { for (String option : options) { process(option); } } Small loop, constant stride, constant size Loop unrolling

private static final String[] options = { "yes", "no", "maybe"}; public void looper() { process(options[0]); process(options[1]); Unrolled! process(options[2]); } Lock Coarsening

public void needsLocks() { for (option : options) { process(option); } Repeatedly locking } private synchronized String process(String option) { // some wacky thread-unsafe code } Lock Coarsening

public void needsLocks() { Lock once synchronized (this) { for (option : options) { // some wacky thread-unsafe code } } } Lock Eliding

public void overCautious() { Synchronize List l = new ArrayList(); on new Object synchronized (l) { for (option : options) { l.add(process(option)); } } } But we know it never escapes this thread... Lock Eliding

public void overCautious() { List l = new ArrayList(); for (option : options) { l.add(

/* process()’s code */); } } No need to lock Escape Analysis private static class Foo { public final String a; public final String b;

Foo(String a, String b) { this.a = a; this.b = b; } } Escape Analysis public void bar() { Foo f = new Foo("Hello", "JVM"); baz(f); } public void baz(Foo f) { System.out.print(f.a); Same object all the System.out.print(", "); quux(f); way through } public void quux(Foo f) { System.out.print(f.b); Never “escapes” these System.out.println('!'); methods } Escape Analysis public secret awesome inlinedBarBazQuux() { System.out.print("Hello"); System.out.print(", "); System.out.print("JavaOne"); System.out.println('!'); } Don’t bother allocating foo object Escape Analysis

• A bit tweaky on Hotspot – All paths must inline – No external view of object Performance Pitfall

accesses – By far the biggest expense • Calls – Memory reference + branch kills pipeline – Call stack, register juggling costs • Locks Performance Pitfall (again)

• Each CPU maintains a memory cache • Caches may be out of sync – If it doesn’t matter, no problem – If it does matter, threads disagree! • Volatile forces synchronization of cache – Across cores and to main memory Hotspot

• client mode (C1) inlines, less aggressive – Fewer opportunities to optimize • server mode (C2) inlines aggressively – Based on richer runtime profiling Tiered

• Increasing tiers of interpreter, C1, and C2 • Level 0 = Interpreter • Level 1-3 = C1 • Level 4 = C2 HotSpot Client Compiler C2 Compiler

• Profile to find “hot spots” – Call sites – Branch statistics – Profile until 10k calls from Interpreter to Compiler

純 Interpreter 簡單 Compiler Source-level interpreter base compiler Tree-traversal static optimizing Bytecode interpreter compiler – switch-threading – 基於程式碼模式 – indirect-threading Dynamic optimizing – token-threading compiler – direct-threading – 基於硬體作業系統 – subroutine-threading – 基於執行頻率 – inline-threading – 基於類型反射 – context-threading – – … 基於整個程式的分析 簡單 Compiler 優化 Compiler – … 執行時期的開銷 : Interpreter

fetch execute dispatch

Java source public class TOSDemo { public static int test() { JVM bytecode int i = 31; int j = 42; iload_0 int x = i + j; iload_1 return x; iadd } istore_2 如何執行? } ;;------iload_0 iload_0------mov ecx,dword ptr ss:[esp+1C] 由 Sun JDK 1.0.2 的 iload_1 mov edx,dword ptr ss:[esp+14] JVM 解譯執行 iadd add dword ptr ss:[esp+14],4 [x86 指令序列 ] mov eax,dword ptr ds:[ecx] istore_2 inc dword ptr ss:[esp+10] mov dword ptr ds:[edx+10],eax jmp javai.1000E182 由 Sun JDK 1.1.8 的 由 Sun JDK 6 的 mov eax,dword ptr ss:[esp+10] mov bl,byte ptr ds:[eax] JVM 解譯執行 HotSpot 解譯執行 xor eax,eax [x86 指令序列 ] [x86 指令序列 ] mov al,bl cmp eax,0FF ja short javai.1000E130;;------iadd------;;------;;------iload_0------jmp dword ptrmov ds: ecx,dword ptr ss:[esp+14] iload_0------mov eax,dword ptr ds:[edi] [eax*4+10011B54]inc dword ptr ss:[esp+10] movzx eax,byte ptr ds:[esi+1] movzx ebx,byte ptr ds:[esi+1] ;;------sub dword ptr ss:[esp+14],4 mov ebx,dword ptr ss:[ebp] inc esi iload_1------mov edx,dword ptr ds:[ecx+C] inc esi jmp dword ptr ds:[ebx*4+6DB188C8] mov ecx,dwordadd ptr dword ss:[esp+1C] ptr ds:[ecx+8],edx jmp dword ptr ds: ;;------iload_1------mov edx,dwordjmp ptr javai.1000E182 ss:[esp+14] [eax*4+1003FBD4] push eax add dword ptrmov ss:[esp+14],4 eax,dword ptr ss:[esp+10] ;;------mov eax,dword ptr ds:[edi-4] mov eax,dwordmov ptr bl,byte ds:[ecx+4] ptr ds:[eax] iload_1------movzx ebx,byte ptr ds:[esi+1] inc dword ptrxor ss:[esp+10] eax,eax movzx eax,byte ptr ds:[esi+1] inc esi mov dword ptrmov ds:[edx+10],eax al,bl mov ecx,dword ptr ss:[ebp+4] jmp dword ptr ds:[ebx*4+6DB188C8] jmp javai.1000E182cmp eax,0FF inc esi ;;------iadd------mov eax,dwordja ptr short ss:[esp+10] javai.1000E130 jmp dword ptr ds: pop edx mov bl,byte jmpptr dwordds:[eax] ptr ds:[eax*4+10011B54] [eax*4+1003FFD4] add eax,edx xor eax,eax ;;------istore_2------;;------movzx ebx,byte ptr ds:[esi+1] mov al,bl mov eax,dword ptr ss:[esp+14] iadd------inc esi cmp eax,0FF mov ecx,dword ptr ss:[esp+1C] add ebx,ecx jmp dword ptr ds:[ebx*4+6DB188C8] ja short javai.1000E130sub dword ptr ss:[esp+14],4 movzx eax,byte ptr ds:[esi+1] ;;------istore_2------jmp dword ptrmov ds: edx,dword ptr ds:[eax+C] inc esi mov dword ptr ds:[edi-8],eax [eax*4+10011B54]inc dword ptr ss:[esp+10] jmp dword ptr ds: movzx ebx,byte ptr ds:[esi+1] mov dword ptr ds:[ecx+8],edx [eax*4+1003FBD4] inc esi jmp javai.1000E182 ;;------jmp dword ptr ds:[ebx*4+6DB19CC8] mov eax,dword ptr ss:[esp+10] istore_2------mov bl,byte ptr ds:[eax] movzx eax,byte ptr ds:[esi+1] xor eax,eax mov dword ptr ss:[ebp+8],ebx mov al,bl inc esi cmp eax,0FF jmp dword ptr ds[eax*4+1003F7D4] ja short javai.1000E130 jmp dword ptr ds [eax*4+10011B54] instruction traces Summary: OpenJDK Introduction to Dalvik VM

K. Yaghmour, Embedded Android 1st edition, Chapter 2, Figure 2-1 Dalvik VM in a nutshell

• The core of Android Applications – All fancy Android applications are run by it • Register-based Process Virtual Machine – Think of running a Java application • Intermediate Language = Dalvik Bytecode • Executable File = Dalvik Executable (DEX) – A converted Java class done by “dx” tool – Reduce redundancy in variables Dalvik VM: Bytecode

• Register-based, 32-bits • Instructions Fetch Unit : 16 bits • Byte code store as binary • Constant pools • String, Type, Field, Method, Class • Human-syntax andmnemonics

Insturction Suffix

-wide(64bits OpCodes) -char

-boolean -short

-byte -int

-long -float

-object -string

-class -void Dalvik is Register-based

• const-4 to store 1 into register 0 • add-int/lit8 to sum the value in register 0 (1) with the literal 2 and store the result into register 1, namely “foo” • Fewer dispatches generally means less time spent reading code and more time spent running it by the interpreter Dalvik Bytecode: Human syntax

• Example: move-wide/from16 vAA,vBBBB – Opcode: “move": move a register's value – "wide" is the name suffix • it operates on wide (64 bit) data. – "from16" is the opcode suffix • 16-bit register referenceas a source. – "vAA" is the destination register • v0 – v255 – "vBBBB" is the source register • v0 – v65535 Dalvik Registers

• Consider, the for loop shown here, it is not legal to do just a push of a number onto the stack inside a loop in Java byte code. – to be able map, stack slots to hardware registers, we need the stack height to be the same at the start and end of a loop -- unlike true stack-based languages like Forth. • The irony is in the end, normal JVMs convert to the same form as Dalvik anyway. – For instance, Java HotSpot 6 client JIT. DEX Translation Example

SymDroid: Symbolic Execution forDalvikBytecode- TechnicalReport CS-TR-5022, July2012

JinseongJeon, Kristopher K.Micinski,JereyS. Foster

Department of Computer Science, University of Maryland, College Park Hello World in Dalvik bytecode

.class public LHelloWorld; .super Ljava/lang/Object; .method public static main([Ljava/lang/String;)V .registers 2 sget-object v0,Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World!" invoke-virtual {v0, v1},Ljava/io/PrintStream;>println(Ljava/lang/String;)V return-void .endmethod Hello World in Dalvik bytecode

.class public LHelloWorld; .super Ljava/lang/Object; .method public static main([Ljava/lang/String;)V .registers 2 sget-object v0,Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World!" Assign variables invoke-virtual {v0, v1},Ljava/io/PrintStream;>println(Ljava/lang/String;)V return-void invoke-* { parameters }, method-to-call static – static method .endmethod virtual – virtual method

Unless the method is declared as a void, it will return a value.

Return types: return-void, return-object, return v0 Optimizing Dispatch

• Refine the dispatch • Efficiency or Reduce the dispatches – Super instructions • ex: add a constant integer to a variable • JVM case –load variable to stack, load constant, add & store back • Dalvik VM case – all in one super instructions – Larger code size Optimizing Dispatch

● Selective in-lining – The idea is to construct (new) super instruction bodies by concatenating the virtual bodies of the virtual instructions that make them up.

Avoid 1. the overhead of the multiple dispatches that are implied by the original bycodes

Ian Piumarta and Fabio Riccardi, “Optimizing direct threaded code by selective inlining,” Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) Interpretor Optimization

Why we still lookup the interpretor optimization? • JIT compilers use an interpreter for the first executions • the interpreter produces profile data for determining where and how to expend the optimization effort

A faster interpreter leads • longer you can wait with the native-code compilation (boost time) • resulting in compiling less code (cost down) • better profile information for statistical result in optimization

M. Anton Ert1, etc, "The Structure and Performane of Efficient Interpreters, " Journal of Instruction- Lefel Parallelism 5 (2003) 1-25. From to Execution

Java Source Code javac

Java Class dx

DEX file dexopt

Optimized DEX dalvikvm

Forays in software development – Stack based vs Register based Virtual Machine Architecture, and the Dalvik VM (cited Nov 26, 2013) DEX Optimizations • Before being executed by Dalvik, DEX files are optimized. – Normally it happens before the first execution of code from the DEX file – Combined with the bytecode verification – In case of DEX files from APKs, when the application is launched for the first time. • Process – The dexopt process (which is actually a backdoor of Dalvik) loads the DEX, replaces certain instructions with their optimized counterparts – Then writes the resulting optimized DEX (ODEX) file into the /data/dalvik-cache directory – It is assumed that the optimized DEX file will be executed on the same VM that optimized it. ODEX files are NOT portable across VMs. dexopt: Instruction Rewritten • Virtual (non-private, non-constructor, non-static methods) invoke-virtual → invoke-virtual-quick Before: invoke-virtual {v1,v2},java/lang/StringBuilder/append;append(Ljava/lang/String;)Ljava/lang/StringBuilder; After: invoke-virtual-quick {v1,v2},vtable #0x3b • Frequently used methods invoke-virtual/direct/static → execute-inline – Before: invoke-virtual {v2},java/lang/String/length – After: execute-inline {v2},inline #0x4 • instance fields: iget/iput → iget/iput-quick – Before: iget-object v3,v5,android/app/Activity.mComponent – After: iget-object-quick v3,v5,[obj+0x28] Meaning of DEX Optimizations • Sets byte ordering and structure alignment • Aligns the member variables to 32-bits / 64-bits • boundary (the structures in the DEX/ODEX file itself are 32-bit aligned) • Significant optimizations because of the elimination of symbolic field/method lookup at runtime. • Aid of Just-In-Time compiler libART ( Library)  Android 4.4’s experimental virtual machine library  Disabled by default  Turn on by checking this option in Developer options  Changing to libART from libDVM  Requires all apps to be recompiled = Restart + Super long first boot  Use Ahead-Of-Time (AOT) scheme instead of JIT  Precompile Dalvik Bytecode into machine language during installation  Take longer time to install for the first time  Applications run faster  Apps are in native form, so they are very ready to be executed  Saving more energy  Apps end faster = More PEs idle time  Occupy more storage space  Decompress to machine code