Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira IBM Research – Tokyo Jose G. Castanos IBM Research – Watson Research Center Hisanobu Tomari University of Tokyo
© 2014 IBM Corporation IBM Research – Tokyo
Global Interpreter Locks in Scripting Languages
Scripting languages (Ruby, Python, etc.) everywhere. Increasing demands for speed.
Multi-thread speed restricted by global interpreter locks. – Only one thread can execute interpreter at one time. – Not required by language specification. ☺Simplified implementation in interpreter and libraries. ̓ No scalability on multi cores. – CRuby, CPython, and PyPy
2 © 2014 IBM Corporation IBM Research – Tokyo Hardware Transactional Memory (HTM) Coming into the Market Improve performance by simply replacing locks with TM? Lower overhead than software TM via hardware. Intel 4th Generation Blue Gene/Q Core Processor 2012 2013
zEC12 POWER8 2012
3 © 2014 IBM Corporation IBM Research – Tokyo Our Goal
What will realistic applications perform if we replace GIL with HTM? – Global Interpreter Lock (GIL) What modifications and new techniques are needed?
4 © 2014 IBM Corporation IBM Research – Tokyo Our Goal
What will realistic applications perform if we replace GIL with HTM? – Global Interpreter Lock (GIL) What modifications and new techniques are needed?
Eliminate GIL in Ruby through HTM on zEC12 and Xeon E3-1275 v3 (aka Haswell) Evaluate Ruby NAS Parallel Benchmarks, WEBrick HTTP server, and Ruby on Rails
Intel zEC12 Xeon atomic { } E3-1275 v3 5 © 2014 IBM Corporation IBM Research – Tokyo Related Work Eliminate Python’s GIL with HTM ̓ Micro-benchmarks on non-cycle-accurate simulator [Riley et al. ,2006] ̓ Micro-benchmarks on cycle-accurate simulator [Blundell et al., 2010] ̓ Micro-benchmarks on Rock’s restricted HTM [Tabba, 2010] Eliminate Ruby and Python’s GILs with fine-grain locks – JRuby, IronRuby, Jython, IronPython, etc. ỉ Huge implementation effort ỉ Remaining locks and incompatibility in class libraries
Eliminate the GIL in Ruby 2 through less restricted HTM on real machines 2 and measure non-micro-benchmarks. Concurrency and no incompatibility in the class libraries
6 © 2014 IBM Corporation IBM Research – Tokyo
Outline
Motivation Transactional Memory GIL in Ruby Eliminating GIL through HTM Experimental Results Conclusion
7 © 2014 IBM Corporation IBM Research – Tokyo
Transactional Memory lock(); tbegin(); a->count++; a->count++; unlock(); tend(); At programming time
– Enclose critical sections with Thread X transaction begin/end operations. tbegin(); At execution time a->count++; Thread Y tend(); – Memory operations within a tbegin(); transaction observed as one step a->count++; by other threads. tend(); – Multiple transactions executed in parallel as long as their memory operations do not conflict. tbegin(); tbegin(); Higher parallelism than locks. a->count++; b->count++; tend(); tend();
8 © 2014 IBM Corporation IBM Research – Tokyo
HTM TBEGIN if (cc!=0) Instruction set (zEC12) goto abort handler – TBEGIN: Begin a transaction ...... – TEND: End a transaction TEND – TABORT, etc. Micro-architecture – Read and write sets held in caches – Conflict detection using cache coherence protocol Abort in four cases: – Read set and write set conflict – Read set and write set overflow – Restricted instructions (e.g. system calls) – External interruptions, etc.
9 © 2014 IBM Corporation IBM Research – Tokyo
Ruby Multi-thread Programming and GIL based on 1.9.3-p194
Ruby language – Program with Thread, Mutex, and ConditionVariable classes Ruby virtual machine – Ruby threads assigned to native threads (Pthread) – Only one thread can execute the interpreter at any give time due to the GIL.
GIL acquired/released when a thread starts/finishes. GIL yielded during a blocking operation, such as I/O. GIL yielded also at pre-defined yield-point bytecode ops. – Conditional/unconditional jumps, method/block exits, etc.
10 © 2014 IBM Corporation IBM Research – Tokyo How GIL is Yielded in Ruby It is too costly to yield GIL at every yield point. Yield GIL every 250 msec using a timer thread.
Timer thread Application threads
Check the flag at every yield point. Set flag Release if (flag is true){ GIL gil_release(); sched_yield(); gil_acquire(); Set flag }
Acquire
Wake up every 250 msec 250 every up Wake GIL Actual implementation is more complex to ensure fairness.
11 © 2014 IBM Corporation IBM Research – Tokyo
Outline
Motivation Transactional Memory GIL in Ruby Eliminating GIL through HTM – Basic Algorithm – Dynamic Adjustment of Transaction Lengths – Conflict Removal Experimental Results Conclusion
12 © 2014 IBM Corporation IBM Research – Tokyo Eliminating GIL through HTM Execute as a transaction first. – Begins/ends at the same points as GIL’s yield points. Acquire GIL after consecutive aborts in a transaction.
Begin transaction
End transaction Conflict Retry if abort
Acquire GIL in case of consecutive aborts Wait for GIL release Release GIL No timer thread needed. 13 © 2014 IBM Corporation IBM Research – Tokyo Tradeoff in Transaction Lengths No need to end and begin transactions at every yield point. = Variable transaction lengths at the granularity of yield points.
Longer transaction = Smaller relative overhead to begin/end transaction. Shorter transaction = Smaller abort overhead – Smaller amount of wasteful work rolled-back at aborts – Smaller probability of transaction overflows Wasteful work
BeginEnd Begin Begin Longer transaction
BeginEnd Begin EndBegin Begin End Shorter transaction
14 © 2014 IBM Corporation IBM Research – Tokyo Dynamic Adjustment of Transaction Lengths Transaction length = How many yield points to skip
Adjust transaction lengths on a per-yield-point basis. 1. Initialize with a long length (255). 2. Calculate abort ratio at each yield point Number of aborts / Number of transactions 3. If the abort ratio exceeds a threshold (1% for zEC12 and 6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2. 4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.
15 © 2014 IBM Corporation IBM Research – Tokyo 5 Sources of Conflicts and How to Remove Them
Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of the interpreter’s source code.
Refer to our paper for the details. – Global variables pointing to the current running thread – Manipulation of global free list when allocating objects – Conflicts and overflows in garbage collection – Updates to inline caches at the time of misses – False sharing in thread structures
16 © 2014 IBM Corporation IBM Research – Tokyo 5 Sources of Conflicts and How to Remove Them
Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of the interpreter’s source code.
Refer to our paper for the details. – Global variables pointing to the current running thread – Manipulation of global free list when allocating objects – Conflicts and overflows in garbage collection – Updates to inline caches at the time of misses – False sharing in thread structures
17 © 2014 IBM Corporation IBM Research – Tokyo Thread-Local Free Lists to Avoid Conflicts Object allocation Single global free list Thread 1 Free object Free object Free object
Thread 2
Bulk Free object Free object movement Thread 1 Thread-local free lists Free object Free object
Free object Free object Free object Thread 2 18 © 2014 IBM Corporation IBM Research – Tokyo
Outline
Motivation Transactional Memory GIL in Ruby Eliminating GIL through HTM Experimental Results Conclusion
19 © 2014 IBM Corporation IBM Research – Tokyo Experimental Environment Implemented in Ruby 1.9.3-p194. ~400 MB Ruby heap zEC12 – 12-core 5.5-GHz zEC12 (1 hardware thread on 1 core) – 1-MB read set and 8-KB write set – Ported to z/OS 1.13 UNIX System Services. Xeon E3-1275 v3 (aka Haswell) – 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3 – 6-MB read set and 19-KB write set – Linux 3.10.5
20 © 2014 IBM Corporation IBM Research – Tokyo
Benchmark Programs
Ruby NAS Parallel Benchmarks (NPB) [Nose et al., 2012] – Class S for BT, CG, FT, LU, SP, and Class W for IS and MG WEBrick: HTTP server included in Ruby distribution – Spawn 1 thread to handle 1 request. Ruby on Rails running on WEBrick – Executed a Web application to fetch a book list from SQLite3. – Measured only on Xeon E3-1275 v3.
21 © 2014 IBM Corporation IBM Research – Tokyo
GIL and HTM Configurations
GIL: Original Ruby HTM-n (n = 1, 16, 256)℆ Fixed transaction length (Skip n-1 yield points) HTM-dynamic: Dynamic adaptive transaction length
22 © 2014 IBM Corporation IBM Research – Tokyo
zEC12 FT 5 Throughput of FT in Ruby NPB 4.5 4 GIL 3.5 3 HTM-1 2.5 HTM-16 2 1.5 HTM-256 1 HTM-dynamic 0.5 Throughput (1 = (1 GIL) 1 thread Throughput 0 0 1 2 3 4 5 6 7 8 910111213 Number of threads HTM-dynamic was the best. Xeon E3-1275 v3 FT HTM-1 suffered high overhead. 2.5 2 HTM-256GIL incurred high abort
1.5 ratios.HTM-1 HTM-16 1 zEC12 and Xeon showed similar speed-upsHTM-256 up to 4 threads. 0.5 HTM-dynamic SMT Throughput (1 =(1 GIL) thread 1 Throughput 0 SMT did not help. 0 1 2 3 4 5 6 7 8 9 Number of threads 23 © 2014 IBM Corporation IBM Research – Tokyo
zEC12 WEBrick 1.5 Throughput of WEBRick HTM was faster than GIL by 1 57% on Xeon E3-1275 v3.
0.5 GIL HTM-1 Scalability on zEC12 was HTM-16 HTM-256 Throughput (1 = 1 thread GIL) thread 1 (1 = Throughput HTM-dynamic limited by conflicts at 0 malloc(). 0 1 2 3 4 5 6 7 Number of clients Xeon E3-1275 v3 WEBrick 2.5 HTM-dynamic was among
2 the best. 1.5 HTM-1 was better than HTM- 1 16 and -256. 0.5 – Frequent complex
Throughput (1 =GIL) (1 1 thread Throughput 0 operations in WEBrick such 0 1 2 3 4 5 6 7 as method invocations Number of clients 24 © 2014 IBM Corporation IBM Research – Tokyo Ruby on Rails on Xeon E3-1275 v3 and Abort Ratios HTM improved the throughput by 24% over GIL. High abort ratios due to transaction overflows. – Regular expression libraries and method invocations. Need to split into multiple transactions? Ruby on Rails Abort ratios of HTM-dynamic 1.4 35 WEBrick / zEC12 WEBrick / Xeon 1.2 30 Rails / Xeon 1 25 0.8 20 0.6 15
0.4 GIL HTM-1 (%) ratio Abort 10 HTM-16 HTM-256 0.2 5 HTM-dynamic Throughput (1 =(1 GIL) 1 thread Throughput 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of clients Number of clients 25 © 2014 IBM Corporation IBM Research – Tokyo
Single-thread Overhead
Single-thread speed is important too.
18-35% single-thread overhead – Optimization: with 1 thread, use the GIL instead of HTM. 5-14% overhead in micro-benchmarks even with this optimization. Sources of the overhead: – Checks at the yield points Adaptive insertion of yield points – Disabled method-invocation inline caches Concurrent inline caches
26 © 2014 IBM Corporation IBM Research – Tokyo
Conclusion
What was the performance of realistic applications when GIL was replaced with HTM? – Up to 4.4-fold speed-up with 12 threads in Ruby NPB. – 57% speed-up in WEBrick and 24% in Ruby on Rails. What was required for good scalability? – Proposed dynamic transaction-length adjustment. – Removed 5 sources of conflicts.
Using HTM is an effective way to outperform the GIL without fine-grain locking.
27 © 2014 IBM Corporation Backups
© 2014 IBM Corporation IBM Research – Tokyo if (TBEGIN()) { Beginning a Transaction /* Transaction */ if (GIL.acquired) Persistent aborts TABORT(); } else { – Overflow /* Abort */ – Restricted if (GIL.acquired) { instructions if (Retried 16 times) Acquire GIL; Otherwise, transient else { aborts Retry after GIL release; – Read-set and write- } else if (Persistent abort) { set conflicts, etc. Acquire GIL; } else { /* Transient abort */ Abort reason reported by if (Retried 3 times) CPU Acquire GILℇ – Using a specified else memory area Retry; }} Execute Ruby code; 29 © 2014 IBM Corporation IBM Research – Tokyo Where to Begin and End Transactions? Should be the same as GIL’s acquisition/release/yield points. – Guaranteed as critical section boundaries. ỉ However, the original yield points are too coarse-grained. – Cause many transaction overflows. Bytecode boundaries are supposed to be safe critical section boundaries. – Bytecode can be generated in arbitrary orders. – Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary. ỉ Ruby programs that are not correctly synchronized can change behavior. Added the following bytecode instructions as transaction yield points. – getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref – Criteria: they appear frequently or are complex.
30 © 2014 IBM Corporation IBM Research – Tokyo
Ending and Yielding a Transaction
Ending a transaction Yielding a transaction (transaction boundary) if (GIL.acquired) if (--yield_counter == 0) { Release GIL; End a transaction; else Begin a transaction; TEND(); }
Dynamic adjustment of a transaction length
31 © 2014 IBM Corporation IBM Research – Tokyo Abort Ratios on zEC12 Transaction lengths well adjusted by HTM-dynamic with 1% as a target abort ratio. No correlation to the scalabilities.
Abort ratios of HTM-dynamic 3 2.5 BT 2 CG FT 1.5 IS 1 LU Abort ratio (%) ratio Abort 0.5 MG SP 0 0 1 2 3 4 5 6 7 8 910111213 Number of threads
32 © 2014 IBM Corporation IBM Research – Tokyo Cycle Breakdowns on zEC12 No correlation to the scalabilities. – Result of IS is not reliable because 79% of its execution was spent in initialization, outside of the measurement period. Cycle breakdowns of Ruby NPB Transaction begin/end Successful transactions GIL acquired Aborted transactions Waiting for GIL release 100% 80% 60% 40% 20% 0% BT CG FT IS LU MG SP
33 © 2014 IBM Corporation IBM Research – Tokyo Categorization by Functions Aborted by Fetch Conflicts
Half of the aborts occurred in manipulating the global free list (rb_newobj) and lazy sweep in GC (gc_lazy_sweep). – A lot of Float objects allocated. To be fixed in Ruby 2.0 with Flonum?
Abort categorization by functions HTM-dynamic / 12 threads / Cache fetch-related Others 100% vm_getivar 90% 80% vm_exec_core 70% tend 60% 50% rb_newobj
40% rb_mutex_sleep... 30% rb_mutex_lock 20% 10% libc 0% gc_lazy_sweep BT CG FT IS LU MG SP
34 © 2014 IBM Corporation IBM Research – Tokyo Comparing Scalabilities of CRuby, JRuby, and Java Scalability of HTM-dynamic / CRuby Scalability of JRuby (12x Intel Xeon) 7 7
6 6
5 5
4 4
3 3
2 2 Throughput (1 =(1 1 thread) Throughput Throughput (1 = (1 1 thread) Throughput 1 1
0 0 0 1 2 3 4 5 6 7 8 910111213 0 1 2 3 4 5 6 7 8 910111213 Number of threads Number of threads Scalability of Java NPB (12x Xeon) CRuby (HTM) was similar to Java 10 (almost no VM-internal bottleneck). BT 8 CG Scalability saturation in CRuby (HTM) FT is inherent in the applications. 6 IS JRuby (fine-grain locking) behaved 4 LU differently from CRuby (HTM). MG 2 SP On average, HTM achieved the same Throughput (1 = 1 thread) 1 = (1 Throughput 0 scalability as fine-grain locking. 0 1 2 3 4 5 6 7 8 910111213 35 Number of threads © 2014 IBM Corporation