Eliminating Global Locks in Ruby through Hardware Transactional Memory

Rei Odaira, Jose G. Castanos and Hisanobu Tomari

IBM Research and University of Tokyo

April 8, 2014

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 1 / 26 The Problem:

Script languages interpreters often use a Global Interpreter (GIL)

Allow easy interpreter design But serialize all execution

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 2 / 26 Observation:

Work for single- performance JIT compilers (Rubinius, PyPy, etc.) HPC Ruby

No scalability on multicores

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 3 / 26 Observation:

Work for multicore performance Fine-grain synchronization in interpreters (eg. JRuby)

Replacing GIL with fine-grain locking leads to incompatible extension librairies support

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 4 / 26 Approach:

Replace GIL with HTM to get scalability

How/Where to put transactions ? Dynamic-length transactions Evaluation on real applications

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 5 / 26 Outline

1 Background

2 Eliminating GIL with HTM

3 Dynamic-length transactions

4 Evaluation

5 Conclusions

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 6 / 26 Hardware Transactionnal Memory Goals: Best-effort mechanism to improve fast-paths Simple interface to define critical sections Intel RTM API (?):

I XBEGIN I XEND I XABORT I XTEST Programmer needs to provide fallback handler

Lifecycle: Critical sections are executed optimistically Check for conflicts eagerly using cache coherency protocol Aborted and Rollbacked if conflict or capacity overflow Commited to global memory only upon validation

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 7 / 26 Background on Ruby

Dynamic language Dynamic typing meta-programming closures Just like Python or Perl

CRuby reference implementation Bytecode interpretation Use native threads No JIT

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 8 / 26 Background on GIL

Not a normal lock: Always held when executing Aquired when thread starts, released when it stops Released for blocking operations (I/O) To enable ”concurrency”: Yield points

Yield points: Release GIL; sched yield(); Aqcuire GIL; Set at loop-back edges and when exiting a method/block Heavy ⇒ does not happen at every yield point Timer thread set a yield flag every 250ms No yield when only one thread

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 9 / 26 Outline

1 Background

2 Eliminating GIL with HTM

3 Dynamic-length transactions

4 Evaluation

5 Conclusions

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 10 / 26 Approach

Based on Transactional Lock Elision

Replace aquiring/release GIL with transaction begin/end Use GIL as a fallback Retry on abort (depends on the cause) Transactions span 2 yield points Remove timer thread

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 11 / 26 Aborts

Yield points are too coarse-grain Causes a lot of capacity aborts ⇒ Add more yield points:

I getlocal, getinstancevariable, getclassvariable (read variable) I send (call method or block) I opt plus, opt aref, opt minus, opt mult (classic operations)

New yield points Located at bytecode boundaries Are safe, unless application is not synchronized correctly Applications should not rely on GIL for synchronization in NAS Parallel Benchmarks, half of the instructions are yield points

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 12 / 26 Outline

1 Background

2 Eliminating GIL with HTM

3 Dynamic-length transactions

4 Evaluation

5 Conclusions

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 13 / 26 Optimization - dynamic-length transactions

Transaction length: Tradeoff between XBEGIN/XEND cost, and abort rate Best choice depends on the yield point

I imagine a syscall just after a yield point

Transactions can now span more than 2 yield point The size of a transaction (number of yield points it crosses):

I Count best length for transaction at each yield point I Count number of transactions start at each yield point (for abort rate) I Start high (long transaction) I Shorten if abort rate above a threshold I Stop profiling when steady state reached

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 14 / 26 Optimization - conflict removal

Sources of conflicts 1 Global variables pointing to the running thread 2 Single global linked list for free objects 3 Mark&Sweep GC

I Not parallel (causes conflicts) I Too long (overflows capacity) ⇒ has to be from GIL 4 Shared inline caches (cache hashtable lookup to access methods/variables) 5 False sharing in thread structures

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 15 / 26 Optimization - conflict removal

Solutions 1 Put those global variables in thread-local storage 2 Thread-private free objects lists 3 increase heap size 1000x (400MB) to GC less often 4 Change caching logic (less cache updates, less caches misses) 5 Dedicate cache line to thread structures

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 16 / 26 Outline

1 Background

2 Eliminating GIL with HTM

3 Dynamic-length transactions

4 Evaluation

5 Conclusions

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 17 / 26 Evaluation

Machines Intel Haswell with RTM IBM zEC12 with HTM support

2 microbenchmarks Ruby NPB (NAS) WEBrick HTTP server Ruby on Rails

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 18 / 26 NPB on IBM machine

BT CG 3.5 2 1.8 3 1.6 2.5 1.4 2 1.2 1 1.5 0.8 1 0.6 0.4 0.5 Throughput (1 =(1 GIL) 1 thread Throughput Throughput (1 = (1 GIL) 1 thread Throughput 0.2 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 NumberFT of threads Number of threads 5 Achieved up to 4.4-fold 4.5 4 GIL speed-up in FT. 3.5 HTM-dynamic was the best 3 HTM-1 2.5 HTM-16 in 6 of 7 benchmarks. 2 1.5 HTM-256 HTM-1 suffered high 1 HTM-dynamic overhead.

Throughput (1 = (1 GIL) 1 thread Throughput 0.5 0 HTM-256 incurred high 0 1 2 3 4 5 6 7 8 9 10 11 12 13 abort ratios. Number of threads Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 19 / 26 NPB on Haswell machine

BT CG 2 1.6 1.4 1.5 1.2 1 1 0.8 0.6 0.5 0.4

Throughput (1 = 1 thread GIL) thread 1 = (1 Throughput 0.2 0 GIL) thread 1 = (1 Throughput 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Number of threads FT Number of threads 2.5

2 GIL

1.5 HTM-1 HTM-16 1 HTM-256 0.5 HTM-dynamic Throughput (1 = 1 thread GIL) thread 1 = (1 Throughput 0 0 1 2 3 4 5 6 7 8 9 Number of threads

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 20 / 26 Differences with Intel’s RTM and IBM’s HTM

RTM has a learning algorithm Written size (left axis) Micro-benchmark write Success ratio (right axis) different sizes 28672 100 24576 Success ratio increase 80 20480 gradually, not sharply 16384 60 This conflicts with 12288 40 Size (byte) Size 8192 HTM-dynamic, making 20 4096 adjustment phase longer 0 0 (%) ratio Success A longer benchmark run 0 10000 20000 30000 40000 achieves a steady-state, Iteration with HTM-dynamic performing best

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 21 / 26 Optimizations impact

New yield points Without new yield points, all NPB benchmarks have 20% slowdown over GIL

Remove conflicts Without conflict removal optims, HTM provides no acceleration in any benchmark

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 22 / 26 Further Measurements

Abort rate HTM-dynamic targets abort ratio of 1% Observed abort ratio is between 0.5% and 2.5% No correlation between observed abort ratio and scalability

Single thread overhead When only 1 thread, use GIL instead of HTM ⇒ 5-14% overhead in microbenchmarks, due to:

I checks at yield points I newly added yield points I slow access to Pthread’s thread local storage (z/OS specific)

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 23 / 26 Outline

1 Background

2 Eliminating GIL with HTM

3 Dynamic-length transactions

4 Evaluation

5 Conclusions

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 24 / 26 Conclusions

Summary Use HTM to replace GIL in Ruby’s interpreter Keep compatibility with extension librairies Get up to 4.4 speedup with 12 threads

I Required small code modifications I Small changes to remove conflicts I Introduce dynamicly-sized transactions

Future work: Remove GIL in Python Work on PyPy as GC in CPython is refcount

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 25 / 26 Thanks for your attention !

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 26 / 26 Backup slides

Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby