Eliminating Global Locks in Ruby through Hardware Transactional Memory Rei Odaira IBM Research – Tokyo Jose G. Castanos IBM Research – Watson Research Center Hisanobu Tomari University of Tokyo

© 2014 IBM Corporation IBM Research – Tokyo

Global Interpreter Locks in Scripting Languages

 Scripting languages (Ruby, Python, etc.) everywhere.  Increasing demands for speed.

 Multi- speed restricted by global interpreter locks. – Only one thread can execute interpreter at one time. – Not required by language specification. ☺Simplified implementation in interpreter and libraries. ̓ No scalability on multi cores. – CRuby, CPython, and PyPy

2 © 2014 IBM Corporation IBM Research – Tokyo Hardware Transactional Memory (HTM) Coming into the Market  Improve performance by simply replacing locks with TM?  Lower overhead than software TM via hardware. Intel 4th Generation Blue Gene/Q Core Processor 2012 2013

zEC12 POWER8 2012

3 © 2014 IBM Corporation IBM Research – Tokyo Our Goal

 What will realistic applications perform if we replace GIL with HTM? – Global Interpreter (GIL)  What modifications and new techniques are needed?

4 © 2014 IBM Corporation IBM Research – Tokyo Our Goal

 What will realistic applications perform if we replace GIL with HTM? – Global Interpreter Lock (GIL)  What modifications and new techniques are needed?

 Eliminate GIL in Ruby through HTM on zEC12 and Xeon E3-1275 v3 (aka Haswell)  Evaluate Ruby NAS Parallel Benchmarks, WEBrick HTTP server, and Ruby on Rails

Intel zEC12 Xeon atomic { } E3-1275 v3 5 © 2014 IBM Corporation IBM Research – Tokyo Related Work  Eliminate Python’s GIL with HTM ̓ Micro-benchmarks on non-cycle-accurate simulator [Riley et al. ,2006] ̓ Micro-benchmarks on cycle-accurate simulator [Blundell et al., 2010] ̓ Micro-benchmarks on Rock’s restricted HTM [Tabba, 2010]  Eliminate Ruby and Python’s GILs with fine-grain locks – JRuby, IronRuby, , IronPython, etc. ỉ Huge implementation effort ỉ Remaining locks and incompatibility in class libraries

 Eliminate the GIL in Ruby 2  through less restricted HTM on real machines 2  and measure non-micro-benchmarks. Concurrency and no incompatibility in the class libraries

6 © 2014 IBM Corporation IBM Research – Tokyo

Outline

 Motivation  Transactional Memory  GIL in Ruby  Eliminating GIL through HTM  Experimental Results  Conclusion

7 © 2014 IBM Corporation IBM Research – Tokyo

Transactional Memory lock(); tbegin(); a->count++; a->count++; unlock(); tend();  At programming time

– Enclose critical sections with Thread X transaction begin/end operations. tbegin();  At execution time a->count++; Thread Y tend(); – Memory operations within a tbegin(); transaction observed as one step a->count++; by other threads. tend(); – Multiple transactions executed in parallel as long as their memory operations do not conflict. tbegin(); tbegin();  Higher parallelism than locks. a->count++; b->count++; tend(); tend();

8 © 2014 IBM Corporation IBM Research – Tokyo

HTM TBEGIN if (cc!=0)  Instruction set (zEC12) goto abort handler – TBEGIN: Begin a transaction ...... – TEND: End a transaction TEND – TABORT, etc.  Micro-architecture – Read and write sets held in caches – Conflict detection using cache coherence protocol  Abort in four cases: – Read set and write set conflict – Read set and write set overflow – Restricted instructions (e.g. system calls) – External interruptions, etc.

9 © 2014 IBM Corporation IBM Research – Tokyo

Ruby Multi-thread Programming and GIL based on 1.9.3-p194

 Ruby language – Program with Thread, Mutex, and ConditionVariable classes  Ruby virtual machine – Ruby threads assigned to native threads (Pthread) – Only one thread can execute the interpreter at any give time due to the GIL.

 GIL acquired/released when a thread starts/finishes.  GIL yielded during a blocking operation, such as I/O.  GIL yielded also at pre-defined yield-point bytecode ops. – Conditional/unconditional jumps, method/block exits, etc.

10 © 2014 IBM Corporation IBM Research – Tokyo How GIL is Yielded in Ruby  It is too costly to yield GIL at every yield point.  Yield GIL every 250 msec using a timer thread.

Timer thread Application threads

Check the flag at every yield point. Set flag Release if (flag is true){ GIL gil_release(); sched_yield(); gil_acquire(); Set flag }

Acquire

Wake up every 250 msec 250 every up Wake GIL Actual implementation is more complex to ensure fairness.

11 © 2014 IBM Corporation IBM Research – Tokyo

Outline

 Motivation  Transactional Memory  GIL in Ruby  Eliminating GIL through HTM – Basic Algorithm – Dynamic Adjustment of Transaction Lengths – Conflict Removal  Experimental Results  Conclusion

12 © 2014 IBM Corporation IBM Research – Tokyo Eliminating GIL through HTM  Execute as a transaction first. – Begins/ends at the same points as GIL’s yield points.  Acquire GIL after consecutive aborts in a transaction.

Begin transaction

End transaction Conflict Retry if abort

Acquire GIL in case of consecutive aborts Wait for GIL release Release GIL No timer thread needed. 13 © 2014 IBM Corporation IBM Research – Tokyo Tradeoff in Transaction Lengths  No need to end and begin transactions at every yield point. = Variable transaction lengths at the granularity of yield points.

 Longer transaction = Smaller relative overhead to begin/end transaction.  Shorter transaction = Smaller abort overhead – Smaller amount of wasteful work rolled-back at aborts – Smaller probability of transaction overflows Wasteful work

BeginEnd Begin Begin Longer transaction

BeginEnd Begin EndBegin Begin End Shorter transaction

14 © 2014 IBM Corporation IBM Research – Tokyo Dynamic Adjustment of Transaction Lengths Transaction length = How many yield points to skip

Adjust transaction lengths on a per-yield-point basis. 1. Initialize with a long length (255). 2. Calculate abort ratio at each yield point  Number of aborts / Number of transactions 3. If the abort ratio exceeds a threshold (1% for zEC12 and 6% for Xeon), shorten the transaction length (x 0.75) and return to Step 2. 4. If a pre-defined number (300) of transactions started before the abort ratio exceeds the threshold, stop adjusting that yield point.

15 © 2014 IBM Corporation IBM Research – Tokyo 5 Sources of Conflicts and How to Remove Them

Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of the interpreter’s source code.

 Refer to our paper for the details. – Global variables pointing to the current running thread – Manipulation of global free list when allocating objects – Conflicts and overflows in garbage collection – Updates to inline caches at the time of misses – False sharing in thread structures

16 © 2014 IBM Corporation IBM Research – Tokyo 5 Sources of Conflicts and How to Remove Them

Source code needs to be changed for higher performance, but each change is limited to only a few dozen lines of the interpreter’s source code.

 Refer to our paper for the details. – Global variables pointing to the current running thread – Manipulation of global free list when allocating objects – Conflicts and overflows in garbage collection – Updates to inline caches at the time of misses – False sharing in thread structures

17 © 2014 IBM Corporation IBM Research – Tokyo Thread-Local Free Lists to Avoid Conflicts Object allocation Single global free list Thread 1 Free object Free object Free object

Thread 2

Bulk Free object Free object movement Thread 1 Thread-local free lists Free object Free object

Free object Free object Free object Thread 2 18 © 2014 IBM Corporation IBM Research – Tokyo

Outline

 Motivation  Transactional Memory  GIL in Ruby  Eliminating GIL through HTM  Experimental Results  Conclusion

19 © 2014 IBM Corporation IBM Research – Tokyo Experimental Environment  Implemented in Ruby 1.9.3-p194.  ~400 MB Ruby heap  zEC12 – 12-core 5.5-GHz zEC12 (1 hardware thread on 1 core) – 1-MB read set and 8-KB write set – Ported to z/OS 1.13 UNIX System Services.  Xeon E3-1275 v3 (aka Haswell) – 4-core 2-SMT 3.5-GHz Xeon E3-1275 v3 – 6-MB read set and 19-KB write set – Linux 3.10.5

20 © 2014 IBM Corporation IBM Research – Tokyo

Benchmark Programs

 Ruby NAS Parallel Benchmarks (NPB) [Nose et al., 2012] – Class S for BT, CG, FT, LU, SP, and Class W for IS and MG  WEBrick: HTTP server included in Ruby distribution – Spawn 1 thread to handle 1 request.  Ruby on Rails running on WEBrick – Executed a Web application to fetch a book list from SQLite3. – Measured only on Xeon E3-1275 v3.

21 © 2014 IBM Corporation IBM Research – Tokyo

GIL and HTM Configurations

 GIL: Original Ruby  HTM-n (n = 1, 16, 256)℆ Fixed transaction length (Skip n-1 yield points)  HTM-dynamic: Dynamic adaptive transaction length

22 © 2014 IBM Corporation IBM Research – Tokyo

zEC12 FT 5 Throughput of FT in Ruby NPB 4.5 4 GIL 3.5 3 HTM-1 2.5 HTM-16 2 1.5 HTM-256 1 HTM-dynamic 0.5 Throughput (1 = (1 GIL) 1 thread Throughput 0 0 1 2 3 4 5 6 7 8 910111213 Number of threads  HTM-dynamic was the best. Xeon E3-1275 v3 FT  HTM-1 suffered high overhead. 2.5  2 HTM-256GIL incurred high abort

1.5 ratios.HTM-1 HTM-16 1  zEC12 and Xeon showed similar speed-upsHTM-256 up to 4 threads. 0.5 HTM-dynamic SMT Throughput (1 =(1 GIL) thread 1 Throughput 0  SMT did not help. 0 1 2 3 4 5 6 7 8 9 Number of threads 23 © 2014 IBM Corporation IBM Research – Tokyo

zEC12 WEBrick 1.5 Throughput of WEBRick  HTM was faster than GIL by 1 57% on Xeon E3-1275 v3.

0.5  GIL HTM-1 Scalability on zEC12 was HTM-16 HTM-256 Throughput (1 = 1 thread GIL) thread 1 (1 = Throughput HTM-dynamic limited by conflicts at 0 malloc(). 0 1 2 3 4 5 6 7 Number of clients Xeon E3-1275 v3 WEBrick 2.5  HTM-dynamic was among

2 the best. 1.5  HTM-1 was better than HTM- 1 16 and -256. 0.5 – Frequent complex

Throughput (1 =GIL) (1 1 thread Throughput 0 operations in WEBrick such 0 1 2 3 4 5 6 7 as method invocations Number of clients 24 © 2014 IBM Corporation IBM Research – Tokyo Ruby on Rails on Xeon E3-1275 v3 and Abort Ratios  HTM improved the throughput by 24% over GIL.  High abort ratios due to transaction overflows. – Regular expression libraries and method invocations.  Need to split into multiple transactions? Ruby on Rails Abort ratios of HTM-dynamic 1.4 35 WEBrick / zEC12 WEBrick / Xeon 1.2 30 Rails / Xeon 1 25 0.8 20 0.6 15

0.4 GIL HTM-1 (%) ratio Abort 10 HTM-16 HTM-256 0.2 5 HTM-dynamic Throughput (1 =(1 GIL) 1 thread Throughput 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of clients Number of clients 25 © 2014 IBM Corporation IBM Research – Tokyo

Single-thread Overhead

 Single-thread speed is important too.

 18-35% single-thread overhead – Optimization: with 1 thread, use the GIL instead of HTM.  5-14% overhead in micro-benchmarks even with this optimization.  Sources of the overhead: – Checks at the yield points  Adaptive insertion of yield points – Disabled method-invocation inline caches  Concurrent inline caches

26 © 2014 IBM Corporation IBM Research – Tokyo

Conclusion

 What was the performance of realistic applications when GIL was replaced with HTM? – Up to 4.4-fold speed-up with 12 threads in Ruby NPB. – 57% speed-up in WEBrick and 24% in Ruby on Rails.  What was required for good scalability? – Proposed dynamic transaction-length adjustment. – Removed 5 sources of conflicts.

 Using HTM is an effective way to outperform the GIL without fine-grain locking.

27 © 2014 IBM Corporation Backups

© 2014 IBM Corporation IBM Research – Tokyo if (TBEGIN()) { Beginning a Transaction /* Transaction */ if (GIL.acquired)  Persistent aborts TABORT(); } else { – Overflow /* Abort */ – Restricted if (GIL.acquired) { instructions if (Retried 16 times) Acquire GIL;  Otherwise, transient else { aborts Retry after GIL release; – Read-set and write- } else if (Persistent abort) { set conflicts, etc. Acquire GIL; } else { /* Transient abort */  Abort reason reported by if (Retried 3 times) CPU Acquire GILℇ – Using a specified else memory area Retry; }} Execute Ruby code; 29 © 2014 IBM Corporation IBM Research – Tokyo Where to Begin and End Transactions?  Should be the same as GIL’s acquisition/release/yield points. – Guaranteed as critical section boundaries. ỉ However, the original yield points are too coarse-grained. – Cause many transaction overflows.  Bytecode boundaries are supposed to be safe critical section boundaries. – Bytecode can be generated in arbitrary orders. – Therefore, an interpreter is not supposed to have a critical section that straddles a bytecode boundary. ỉ Ruby programs that are not correctly synchronized can change behavior.  Added the following bytecode instructions as transaction yield points. – getinstancevariable, getclassvariable, getlocal, send, opt_plus, opt_minus, opt_mult, opt_aref – Criteria: they appear frequently or are complex.

30 © 2014 IBM Corporation IBM Research – Tokyo

Ending and Yielding a Transaction

Ending a transaction Yielding a transaction (transaction boundary) if (GIL.acquired) if (--yield_counter == 0) { Release GIL; End a transaction; else Begin a transaction; TEND(); }

Dynamic adjustment of a transaction length

31 © 2014 IBM Corporation IBM Research – Tokyo Abort Ratios on zEC12  Transaction lengths well adjusted by HTM-dynamic with 1% as a target abort ratio.  No correlation to the scalabilities.

Abort ratios of HTM-dynamic 3 2.5 BT 2 CG FT 1.5 IS 1 LU Abort ratio (%) ratio Abort 0.5 MG SP 0 0 1 2 3 4 5 6 7 8 910111213 Number of threads

32 © 2014 IBM Corporation IBM Research – Tokyo Cycle Breakdowns on zEC12  No correlation to the scalabilities. – Result of IS is not reliable because 79% of its execution was spent in initialization, outside of the measurement period. Cycle breakdowns of Ruby NPB Transaction begin/end Successful transactions GIL acquired Aborted transactions Waiting for GIL release 100% 80% 60% 40% 20% 0% BT CG FT IS LU MG SP

33 © 2014 IBM Corporation IBM Research – Tokyo Categorization by Functions Aborted by Fetch Conflicts

 Half of the aborts occurred in manipulating the global free list (rb_newobj) and lazy sweep in GC (gc_lazy_sweep). – A lot of Float objects allocated.  To be fixed in Ruby 2.0 with Flonum?

Abort categorization by functions HTM-dynamic / 12 threads / Cache fetch-related Others 100% vm_getivar 90% 80% vm_exec_core 70% tend 60% 50% rb_newobj

40% rb_mutex_sleep... 30% rb_mutex_lock 20% 10% libc 0% gc_lazy_sweep BT CG FT IS LU MG SP

34 © 2014 IBM Corporation IBM Research – Tokyo Comparing Scalabilities of CRuby, JRuby, and Java Scalability of HTM-dynamic / CRuby Scalability of JRuby (12x Intel Xeon) 7 7

6 6

5 5

4 4

3 3

2 2 Throughput (1 =(1 1 thread) Throughput Throughput (1 = (1 1 thread) Throughput 1 1

0 0 0 1 2 3 4 5 6 7 8 910111213 0 1 2 3 4 5 6 7 8 910111213 Number of threads Number of threads Scalability of Java NPB (12x Xeon)  CRuby (HTM) was similar to Java 10 (almost no VM-internal bottleneck). BT 8 CG  Scalability saturation in CRuby (HTM) FT is inherent in the applications. 6 IS  JRuby (fine-grain locking) behaved 4 LU differently from CRuby (HTM). MG 2 SP  On average, HTM achieved the same Throughput (1 = 1 thread) 1 = (1 Throughput 0 scalability as fine-grain locking. 0 1 2 3 4 5 6 7 8 910111213 35 Number of threads © 2014 IBM Corporation