Eliminating Global Interpreter Locks in Ruby Through Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 1 / 26 The Problem: Script languages interpreters often use a Global Interpreter Lock (GIL) Allow easy interpreter design But serialize all execution Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 2 / 26 Observation: Work for single-thread performance JIT compilers (Rubinius, PyPy, etc.) HPC Ruby No scalability on multicores Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 3 / 26 Observation: Work for multicore performance Fine-grain synchronization in interpreters (eg. JRuby) Replacing GIL with fine-grain locking leads to incompatible extension librairies support Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 4 / 26 Approach: Replace GIL with HTM to get scalability How/Where to put transactions ? Dynamic-length transactions Evaluation on real applications Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 5 / 26 Outline 1 Background 2 Eliminating GIL with HTM 3 Dynamic-length transactions 4 Evaluation 5 Conclusions Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 6 / 26 Hardware Transactionnal Memory Goals: Best-effort mechanism to improve fast-paths Simple interface to define critical sections Intel RTM API (?): I XBEGIN I XEND I XABORT I XTEST Programmer needs to provide fallback handler Lifecycle: Critical sections are executed optimistically Check for conflicts eagerly using cache coherency protocol Aborted and Rollbacked if conflict or capacity overflow Commited to global memory only upon validation Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 7 / 26 Background on Ruby Dynamic language Dynamic typing meta-programming closures Just like Python or Perl CRuby reference implementation Bytecode interpretation Use native threads No JIT Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 8 / 26 Background on GIL Not a normal lock: Always held when executing Aquired when thread starts, released when it stops Released for blocking operations (I/O) To enable "concurrency": Yield points Yield points: Release GIL; sched yield(); Aqcuire GIL; Set at loop-back edges and when exiting a method/block Heavy ) does not happen at every yield point Timer thread set a yield flag every 250ms No yield when only one thread Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 9 / 26 Outline 1 Background 2 Eliminating GIL with HTM 3 Dynamic-length transactions 4 Evaluation 5 Conclusions Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 10 / 26 Approach Based on Transactional Lock Elision Replace aquiring/release GIL with transaction begin/end Use GIL as a fallback Retry on abort (depends on the cause) Transactions span 2 yield points Remove timer thread Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 11 / 26 Aborts Yield points are too coarse-grain Causes a lot of capacity aborts ) Add more yield points: I getlocal, getinstancevariable, getclassvariable (read variable) I send (call method or block) I opt plus, opt aref, opt minus, opt mult (classic operations) New yield points Located at bytecode boundaries Are safe, unless application is not synchronized correctly Applications should not rely on GIL for synchronization in NAS Parallel Benchmarks, half of the instructions are yield points Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 12 / 26 Outline 1 Background 2 Eliminating GIL with HTM 3 Dynamic-length transactions 4 Evaluation 5 Conclusions Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 13 / 26 Optimization - dynamic-length transactions Transaction length: Tradeoff between XBEGIN/XEND cost, and abort rate Best choice depends on the yield point I imagine a syscall just after a yield point Transactions can now span more than 2 yield point The size of a transaction (number of yield points it crosses): I Count best length for transaction at each yield point I Count number of transactions start at each yield point (for abort rate) I Start high (long transaction) I Shorten if abort rate above a threshold I Stop profiling when steady state reached Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 14 / 26 Optimization - conflict removal Sources of conflicts 1 Global variables pointing to the running thread 2 Single global linked list for free objects 3 Mark&Sweep GC I Not parallel (causes conflicts) I Too long (overflows capacity) ) has to be from GIL 4 Shared inline caches (cache hashtable lookup to access methods/variables) 5 False sharing in thread structures Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 15 / 26 Optimization - conflict removal Solutions 1 Put those global variables in thread-local storage 2 Thread-private free objects lists 3 increase heap size 1000x (400MB) to GC less often 4 Change caching logic (less cache updates, less caches misses) 5 Dedicate cache line to thread structures Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 16 / 26 Outline 1 Background 2 Eliminating GIL with HTM 3 Dynamic-length transactions 4 Evaluation 5 Conclusions Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 17 / 26 Evaluation Machines Intel Haswell with RTM IBM zEC12 with HTM support 2 microbenchmarks Ruby NPB (NAS) WEBrick HTTP server Ruby on Rails Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 18 / 26 NPB on IBM machine BT CG 3.5 2 1.8 3 1.6 2.5 1.4 2 1.2 1 1.5 0.8 1 0.6 0.4 0.5 Throughput (1 =(1 GIL) 1 thread Throughput Throughput (1 = (1 GIL) 1 thread Throughput 0.2 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 NumberFT of threads Number of threads 5 Achieved up to 4.4-fold 4.5 4 GIL speed-up in FT. 3.5 HTM-dynamic was the best 3 HTM-1 2.5 HTM-16 in 6 of 7 benchmarks. 2 1.5 HTM-256 HTM-1 suffered high 1 HTM-dynamic overhead. Throughput (1 = (1 GIL) 1 thread Throughput 0.5 0 HTM-256 incurred high 0 1 2 3 4 5 6 7 8 9 10 11 12 13 abort ratios. Number of threads Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 19 / 26 NPB on Haswell machine BT CG 2 1.6 1.4 1.5 1.2 1 1 0.8 0.6 0.5 0.4 Throughput (1 = 1 thread GIL) thread 1 = (1 Throughput 0.2 0 GIL) thread 1 = (1 Throughput 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Number of threads FT Number of threads 2.5 2 GIL 1.5 HTM-1 HTM-16 1 HTM-256 0.5 HTM-dynamic Throughput (1 = 1 thread GIL) thread 1 = (1 Throughput 0 0 1 2 3 4 5 6 7 8 9 Number of threads Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 20 / 26 Differences with Intel's RTM and IBM's HTM RTM has a learning algorithm Written size (left axis) Micro-benchmark write Success ratio (right axis) different sizes 28672 100 24576 Success ratio increase 80 20480 gradually, not sharply 16384 60 This conflicts with 12288 40 Size (byte) Size 8192 HTM-dynamic, making 20 4096 adjustment phase longer 0 0 (%) ratio Success A longer benchmark run 0 10000 20000 30000 40000 achieves a steady-state, Iteration with HTM-dynamic performing best Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 21 / 26 Optimizations impact New yield points Without new yield points, all NPB benchmarks have 20% slowdown over GIL Remove conflicts Without conflict removal optims, HTM provides no acceleration in any benchmark Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 22 / 26 Further Measurements Abort rate HTM-dynamic targets abort ratio of 1% Observed abort ratio is between 0.5% and 2.5% No correlation between observed abort ratio and scalability Single thread overhead When only 1 thread, use GIL instead of HTM ) 5-14% overhead in microbenchmarks, due to: I checks at yield points I newly added yield points I slow access to Pthread's thread local storage (z/OS specific) Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 23 / 26 Outline 1 Background 2 Eliminating GIL with HTM 3 Dynamic-length transactions 4 Evaluation 5 Conclusions Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 24 / 26 Conclusions Summary Use HTM to replace GIL in Ruby's interpreter Keep compatibility with extension librairies Get up to 4.4 speedup with 12 threads I Required small code modifications I Small changes to remove conflicts I Introduce dynamicly-sized transactions Future work: Remove GIL in Python Work on PyPy as GC in CPython is refcount Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 25 / 26 Thanks for your attention ! Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby 26 / 26 Backup slides Rei Odaira, Jose G. Castanos and Hisanobu Tomari HTM in CRuby.

Load more