Towards an Extreme Scale Multithreaded MPI

Abdelhalim Amer (Halim) Postdoctoral Researcher Argonne National Laboratory, IL, USA

Japan-Korea HPC Winter School University of Tsukuba, Feb 17, 2016 Programming Models and Runtime Systems Group Myself Affiliaon – Mathemacs and Computer Science Division, ANL Group Lead

– Pavan Balaji (computer scienst) Current Staff Members – Sangmin Seo (Assistant Sciensts) – Abdelhalim Amer (Halim) (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken Raffene (developer) – Min Si (postdoc)

2 The PMRS Group Research Areas

Data-Movement Threading in Communicaon Models Heterogeneous Runmes (OS-level, user- and Deep (MPICH) level: Argobots) Memory Hierarchies My focus: Communicaon opmizaon in threading environments

3 The MPICH Project MPICH and it derivaves in the Top 10 § MPICH and its derivaves are the 1. Tianhe-2 (China): TH-MPI world’s most widely used MPI 2. Titan (US): MPI implementaons 3. Sequoia (US): IBM PE MPI § Funded by DOE for 23 years (turned 23 4. K Computer (Japan): Fujitsu MPI this month) 5. Mira (US): IBM PE MPI § Has been a key influencer in the 6. Trinity (US): Cray MPI adopon of MPI 7. Piz Daint (Germany): Cray MPI § Award winning project 8. Hazel Hen (Germany): Cray MPI – DOE R&D100 award in 2005 9. Shaheen II (Saudi Arabia): Cray MPI 10.Stampede (US): Intel MPI and MVAPICH Intel Tianhe MPI MPI IBM MPI ParaStaon MPI MPICH Cray MPI FG-MPI Microso MPI MVAPICH

4 Threading Models § Argobots – Lightweight low-level threading and tasking framework – Targets massive on-node parallelism – Fine-grained control over the execuon, scheduling, synchronizaon, and data- movements – Exposes a rich set of features to higher programming systems and DSLs for efficient implementaons § BOLT (hps://press3.mcs.anl.gov/bolt) – Based on the LLVM OpenMP runme and the Clang

Sangmin Seo Assistant Computer Scientist

5 The Argobots Ecosystem

PO

TR Applications Fused ULT TR SY TR N …

GE PO GE GE Charm++ Fused ULT SY TR TR SY 2 Fused ULT SY GE SY Argobots runtime 1 ULT ULT RWS PO ES ES ULT TR Communication Argobots SY libraries ES MPI “Worker” PO MPI+Argobots Charm++ OmpSs CilkBots PaRSEC

ES ES 1 Sched n Origin RPC proc U S E

T U E .. RPC proc S E T T . Target U T E OpenMP T Mercury RPC (BOLT)

Argobots External GridFTP, RAJA, ROSE, TASCEL, XMP, etc. Connections Kokkos,

6 Heterogeneous and Deep Memory Hierarchies

§ Prefetching for deep memory hierarchies Lena Oden Postdoc § Currently focusing on MIC accelerator environments § Compiler support – LLVM + user hints through pragma NVRAM direcves § Runme support – Prefetching between In-Package Memory IPM (IPM) and NVRAM – Soware-managed cache

Yanfei Guo Postdoc

7 The PMRS Group Research Areas

Data-Movement Threading Communicaon in Models Runmes Heterogeneous/ (OS-level, user- (MPICH) Hierarchical level: Argobots) Memories

My focus: Communicaon opmizaon in threading environments Today’s Talk

8 Outline

§ Introducon to hybrid MPI+threads programming – Hybrid MPI programming – Multhreaded-Driven MPI communicaon § MPI + OS-Threading Opmizaon Space – Opmizing the coarse-grained global locking model – Moving towards a fine-grained model § MPI + User-Level Threading Opmizaon Space – Opmizaon Opportunies over OS-threading – Advanced coarse-grained locking in MPI § Summary

9 Introduction to Hybrid MPI Programming Why Going Hybrid MPI + Shared-Memory (X) Programming

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core Growth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea- Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote

Main Forms of Hybrid MPI Programming § E.g. Quantum Monte Carlo simulaons

§ Two different models – MPI + shared-memory (X = MPI) – MPI + threads (e.g. X = OpenMP) § Both models use direct load/store operaons

MPI + Shared-Memory (MPI 3.0~) MPI + Theads • Everything private by default • Share everything by default • Expose shared data explicitly • Privatize data when necessary

MPI Task 0 MPI Task 1 MPI Task 1

Large B-spline table in a Share-Memory Window Large B-spline table

W W W W W W W W

Core Core Core Core

W 0 Thread 1 Walker data Multithreaded MPI Communication: Current Situation 4.E+07 § Becoming more difficult for a single-core to 4.E+07 saturate the network 3.E+07 3.E+07 § Network fabrics are going parallel 2.E+07 – E.g. BG/Q has 16 injecon/recepon FIFOs 2.E+07 Message Rate 1.E+07 1 Core § Core frequencies are reducing 5.E+06 0.E+00 – Local message processing me increases 0 10 20 30 40 § Driving MPI communicaon through threads Number of Cores per Node Node-to-node message rate (1B) on backfires Haswell + Mellanox FDR Fabric

1 2 Processes 4 Processes 1 Thread 2 Threads 4 Threads 8 Processes 18 Processes 9 Threads 18 Threads

3.E+07 3.E+07

4.E+06 4.E+06

5.E+05 5.E+05

6.E+04 6.E+04

8.E+03 8.E+03 Message Rate (Messages/s) Message Rate (Messages/s) 1.E+03 1.E+03 1 16 256 4096 65536 1048576 1 16 256 4096 65536 1048576 Message Size (Bytes) Message Size (Bytes) Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads 13 Applications to Exploit Multithreaded MPI Communication MPI Task 0 MPI Task 1 § Several applicaons are moving to MPI Distributed B- Communicate Distributed large +Threads models Walker and B- spline table spline Data B-spline table § Most of them rely on funneled communication ! W W W W W W W W

§ Certain type of applicaons fit beer Core Core Core Core

multhreaded communicaon model Thread 0 Thread 1 Thread 0 Thread 1 – E.g. Task-parallel applicaons Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1] § Several applicaons can scale well with current MPI runmes if:

– Data movements are not too fine-grained 1 Thread 2 Threads 4 Threads 9 Threads 18 Threads – Frequency of calling concurrently MPI is not too high 3.E+07 § A scalable multhreaded support is 4.E+06 necessary for other applicaons 5.E+05 6.E+04

8.E+03 Message Rate (Messages/s) 1.E+03 1 16 256 4096 65536 1048576 14 Message Size (Bytes) [1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012. OS-Thread-Level Optimization Space Challenges and Aspects Being Considered

MPI_Call(...) § Granularity { CS_ENTER; – The current coarse-grained lock/work/unlock ... Local state model is not scalable

– Fully lock-free is not praccal Shared state while (! req_complete) { • Rank-wise progress constrains Progress(); poll(); CS_EXIT; CS_YIELD; • Ordering constrains Global Read- CS_ENTER; } • Overuse of atomics and memory barriers can backfire only ... CS_EXIT; § Arbitraon }

– Tradional Pthread mutex locking is biased by the Threads hardware (NUCA) and the OS (slow wakeups) Critical Section – Causes waste because of the lack of correlaon Length between resource acquision and work

§ Hand-off latency Arbitration – Slow hand-off latencies waste the advantages of

fine-granularity and smart arbitraon Hand-off – Trade-off must be carefully addressed Latency Production Thread-Safe MPI Implementations on § OS-thread-level synchronizaon pthread_mutex_lock § Most implementaons rely on coarse-grained CAS FUTEX_WAIT Kernel-Space

crical secons Sleep – Only MPICH on BG/Q is known to implement fine- FUTEX_WAKE grained locking CAS § Most implementaons rely on Pthread FUTEX_WAIT Sleep

synchronizaon API (mutex, spinlock, …) User-Space CAS FUTEX_WAKE § Mutex in Nave Posix Thread Library (NPTL) – Ships with glibc Lock Acquired – Fast user-space locking aempt; usually with a Compare And Swap (CAS) signal wakeup 1.00E+05 – Futex wait/wake in contended cases ~10x Main Memory 1.00E+04 Mutex

1.00E+03 LLC LLC Latency (cycles) L1 L1 L1 L1 1.00E+02 Core Core Core Core 0 10 20 30 40 Number of Threads Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups 17 Analysis of MPICH on a Linux Cluster MPI_CALL_ENTER § Thread-safety in MPICH CS_ENTER – Full THREAD_MULTIPLE support #pragma omp parallel { – OS-thread-level synchronizaon 1 for (i=0; i

Bias Factor 8 40 6 30

4 Wasted Progress Polls (%) 20 2 10 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Number of Threads per Rank Number of Threads per Rank Fairness and progress analysis with a message rate benchmark on Haswell + Mellanox FDR fabric 18 Smart Progress and Fast Hand-off MPI_CALL_ENTER Adapt arbitration to maximize work Pthread Mutex § Time CS_ENTER FIFO locks overcome the Penalty shortcomings of mutexes 1 2 § Polling for progress can be wasteful NO OPERATION CS_EXIT COMPLETE? (waing does not generate work!) FIFO Lock Time Penalty § Priorizing issuing operaons YIELD YES • Feed the communicaon CS_EXIT • Reduce chances of wasteful internal CS_ENTER process (e.g. more requests on the fly MPI_CALL_EXIT Fairness (FIFO) reduces wasted Prioritizing issuing ops over è higher chances of making resource acquisitions waiting for internal progress progress) Hand-off latency must be kept low § Same arbitraon cab be achieve with different locks 262144 (e.g. cket, MCS, CLH, etc. are FIFO)

§ Trade-off between desired arbitraon and hand-off Mutex latency Ticket Priority • E.g. for FIFO, CLH is more scalable than cket CLH

§ Going through the kernel (e.g. using futex) is MessageRate (Messages/s) 65536 expensive and should not be done frequently 1 32 1024 32768 Message Size (Bytes) § A more scalable scalable hierarchical lock is being Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR integrated Fairness and Cohort Locking § Unfairness itself IS NOT EVIL Mutex Ticket Priority CLH HCLH<2> 1048576 § Unbounded unfairness IS EVIL § Unfairness can improve locality of reference 524288 § Exploit locality of reference through cohorng 262144 – Threads sharing some level of the memory hierarchy Message Rate (Messages/s) – Local passing within a threshold 131072

65536 1 4 16 64 256 1024 4096 16384 Message Size (Bytes)

Hierarchical MCS Lock. Chabbi, Milind, Michael Fagan, and John Mellor- Crummey. "High performance locks for multi-level NUMA systems." Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2015. Granularity Optimization Being Considered MPI_Call(...) Combination of several methods to achieve fine granularity { § Locks and atomics are not always necessary Local state • Memory barriers can be enough MEM_BARRIER(); • E.g. only read barrier for MPI_Comm_rank Global Read-only § Brief-global locking: only protect shared objects CS_ENTER; § Lock-free wherever possible Shared state • Atomic reference count updates CS_EXIT; • Lock-free queue operations ATOMIC_UPDATE (obj.ref_count)

Moving towards a lightweight queue/dequeue model Local state § Reduce locking requirements and move to a } mostly lock-free queue/dequeue model § Reduce unnecessary progress polling for MPI_Call(...) MPI_Icall(...) { { nonblocking operations ...... § Enqueue operations all atomic Atomic_ § Dequeue operations enqueue Atomic_ enqueue CS_TRY_ENTER; • Atomic for unordered queues (RMA) .... • Fine-grained locking for ordered queues Wait_Progress(); CS_TRY_ENTER; /*Return*/ § The result is a mostly lock-free model for Locked or } nonblocking operations Atomic Dequeue CS_EXIT; } User-Level Threads Optimization Space MPI+User-Level Threads Optimization Opportunities

Applicaons Applicaons

OpenMP, OmpSs, XMP, MPI OpenMP, OmpSs, XMP, MPI

POSIX Threads (glibc) User-Level Thread Library

POSIX Threads (glibc)

Pthread Pthread1 Pthread2

Context-switch

MPI_Isend(); ULT0 Lockless yield when ULT2 Computation(); MPI_Irecv (); MPI_Isend(); leaving CS Computation(); MPI_Wait(...) ULT1 { Computation(); Computation(); MPI_Irecv (); MPI_Irecv (); MPI_Wait(...) MPI_Wait(...) MPI_Wait(...) { { Computation(); {

} MPI_Wait(...) } { Stalls } Yield-to instead of } blocking } Advanced Argobots Locking Support

Argobots Mutex API Argobots Mutex Data-Strctures

Status = LOCKED/UNLOCKED ABT_mutex_lock() conditional push Push unconditionally if prior Acquire(table_lock) ULT already blocked

ABT_mutex_unlock() High priority ES 0 Low priority

…... conditional push ABT_mutex_lock_low() Local status

ES N flag Mutexes If ULTs on my ES Pop without atomics Release(table_lock) ABT_mutex_unlock_se() Mapping Argobots Advanced Locking to MPI

MPI Side Argobots API Average of 4x improvement for fine-grained communication!

MPI_CALL_ENTER

ABT_mutex_lock() Pthread Mutex Argobots Mutex

Single-Threaded CS_ENTER

2097152 1 ABT_mutex_unlock_se() 524288 2 NO OP 131072 COMPLETE CS_EXIT ? 32768

ABT_mutex_lock_low() Message Rate (Msg/s) 8192 YIELD YES 2048 1 8 64 512 4096 32768 262144 CS_ENTER Message Size (Bytes)

CS_EXIT ABT_mutex_unlock() Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR

MPI_CALL_EXIT Summary To Take Out § Why would you care about MPI+threads communicaon? – Applicaons, libraries, and languages are moving towards driving both computaon and communicaon because of • Hardware constraints • Programmability § The current situaon – Multhreaded MPI communicaon is sll not Exascale level – We improved upon exisng runmes but sll more to go § Insight into the future – Hopefully the fine-grained model will be enough for both OS-level and user-level threading – Otherwise, a combinaon of advanced thread- synchronizaon and threading runmes will be necessary

27