Towards an Extreme Scale Multithreaded MPI
Abdelhalim Amer (Halim) Postdoctoral Researcher Argonne National Laboratory, IL, USA
Japan-Korea HPC Winter School University of Tsukuba, Feb 17, 2016 Programming Models and Runtime Systems Group Myself Affilia on – Mathema cs and Computer Science Division, ANL Group Lead
– Pavan Balaji (computer scien st) Current Staff Members – Sangmin Seo (Assistant Scien sts) – Abdelhalim Amer (Halim) (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken Raffene (developer) – Min Si (postdoc)
2 The PMRS Group Research Areas
Data-Movement Threading in Communica on Models Heterogeneous Run mes (OS-level, user- and Deep (MPICH) level: Argobots) Memory Hierarchies My focus: Communica on op miza on in threading environments
3 The MPICH Project MPICH and it deriva ves in the Top 10 § MPICH and its deriva ves are the 1. Tianhe-2 (China): TH-MPI world’s most widely used MPI 2. Titan (US): Cray MPI implementa ons 3. Sequoia (US): IBM PE MPI § Funded by DOE for 23 years (turned 23 4. K Computer (Japan): Fujitsu MPI this month) 5. Mira (US): IBM PE MPI § Has been a key influencer in the 6. Trinity (US): Cray MPI adop on of MPI 7. Piz Daint (Germany): Cray MPI § Award winning project 8. Hazel Hen (Germany): Cray MPI – DOE R&D100 award in 2005 9. Shaheen II (Saudi Arabia): Cray MPI 10.Stampede (US): Intel MPI and MVAPICH Intel Tianhe MPI MPI IBM MPI ParaSta on MPI MPICH Cray MPI FG-MPI Microso MPI MVAPICH
4 Threading Models § Argobots – Lightweight low-level threading and tasking framework – Targets massive on-node parallelism – Fine-grained control over the execu on, scheduling, synchroniza on, and data- movements – Exposes a rich set of features to higher programming systems and DSLs for efficient implementa ons § BOLT (h ps://press3.mcs.anl.gov/bolt) – Based on the LLVM OpenMP run me and the Clang
Sangmin Seo Assistant Computer Scientist
5 The Argobots Ecosystem
PO
TR Applications Fused ULT TR SY TR N …
GE PO GE GE Charm++ Fused ULT SY TR TR SY 2 Fused ULT SY GE SY Argobots runtime 1 ULT ULT RWS PO ES ES ULT TR Communication Argobots SY libraries ES MPI Cilk “Worker” PO MPI+Argobots Charm++ OmpSs CilkBots PaRSEC
ES ES 1 Sched n Origin RPC proc U S E
T U E .. RPC proc S E T T . Target U T E OpenMP T Mercury RPC (BOLT)
Argobots External GridFTP, RAJA, ROSE, TASCEL, XMP, etc. Connections Kokkos,
6 Heterogeneous and Deep Memory Hierarchies
§ Prefetching for deep memory hierarchies Lena Oden Postdoc § Currently focusing on MIC accelerator environments § Compiler support – LLVM + user hints through pragma NVRAM direc ves § Run me support – Prefetching between In-Package Memory IPM (IPM) and NVRAM – So ware-managed cache
Yanfei Guo Postdoc
7 The PMRS Group Research Areas
Data-Movement Threading Communica on in Models Run mes Heterogeneous/ (OS-level, user- (MPICH) Hierarchical level: Argobots) Memories
My focus: Communica on op miza on in threading environments Today’s Talk
8 Outline
§ Introduc on to hybrid MPI+threads programming – Hybrid MPI programming – Mul threaded-Driven MPI communica on § MPI + OS-Threading Op miza on Space – Op mizing the coarse-grained global locking model – Moving towards a fine-grained model § MPI + User-Level Threading Op miza on Space – Op miza on Opportuni es over OS-threading – Advanced coarse-grained locking in MPI § Summary
9 Introduction to Hybrid MPI Programming Why Going Hybrid MPI + Shared-Memory (X) Programming
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core Growth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea- Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote
Main Forms of Hybrid MPI Programming § E.g. Quantum Monte Carlo simula ons
§ Two different models – MPI + shared-memory (X = MPI) – MPI + threads (e.g. X = OpenMP) § Both models use direct load/store opera ons
MPI + Shared-Memory (MPI 3.0~) MPI + Theads • Everything private by default • Share everything by default • Expose shared data explicitly • Privatize data when necessary
MPI Task 0 MPI Task 1 MPI Task 1
Large B-spline table in a Share-Memory Window Large B-spline table
W W W W W W W W
Core Core Core Core
W Thread 0 Thread 1 Walker data Multithreaded MPI Communication: Current Situation 4.E+07 § Becoming more difficult for a single-core to 4.E+07 saturate the network 3.E+07 3.E+07 § Network fabrics are going parallel 2.E+07 – E.g. BG/Q has 16 injec on/recep on FIFOs 2.E+07 Message Rate 1.E+07 1 Core § Core frequencies are reducing 5.E+06 0.E+00 – Local message processing me increases 0 10 20 30 40 § Driving MPI communica on through threads Number of Cores per Node Node-to-node message rate (1B) on backfires Haswell + Mellanox FDR Fabric
1 Process 2 Processes 4 Processes 1 Thread 2 Threads 4 Threads 8 Processes 18 Processes 9 Threads 18 Threads
3.E+07 3.E+07
4.E+06 4.E+06
5.E+05 5.E+05
6.E+04 6.E+04
8.E+03 8.E+03 Message Rate (Messages/s) Message Rate (Messages/s) 1.E+03 1.E+03 1 16 256 4096 65536 1048576 1 16 256 4096 65536 1048576 Message Size (Bytes) Message Size (Bytes) Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads 13 Applications to Exploit Multithreaded MPI Communication MPI Task 0 MPI Task 1 § Several applica ons are moving to MPI Distributed B- Communicate Distributed large +Threads models Walker and B- spline table spline Data B-spline table § Most of them rely on funneled communication ! W W W W W W W W
§ Certain type of applica ons fit be er Core Core Core Core
mul threaded communica on model Thread 0 Thread 1 Thread 0 Thread 1 – E.g. Task-parallel applica ons Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1] § Several applica ons can scale well with current MPI run mes if:
– Data movements are not too fine-grained 1 Thread 2 Threads 4 Threads 9 Threads 18 Threads – Frequency of calling concurrently MPI is not too high 3.E+07 § A scalable mul threaded support is 4.E+06 necessary for other applica ons 5.E+05 6.E+04
8.E+03 Message Rate (Messages/s) 1.E+03 1 16 256 4096 65536 1048576 14 Message Size (Bytes) [1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012. OS-Thread-Level Optimization Space Challenges and Aspects Being Considered
MPI_Call(...) § Granularity { CS_ENTER; – The current coarse-grained lock/work/unlock ... Local state model is not scalable
– Fully lock-free is not prac cal Shared state while (! req_complete) { • Rank-wise progress constrains Progress(); poll(); CS_EXIT; CS_YIELD; • Ordering constrains Global Read- CS_ENTER; } • Overuse of atomics and memory barriers can backfire only ... CS_EXIT; § Arbitra on }
– Tradi onal Pthread mutex locking is biased by the Threads hardware (NUCA) and the OS (slow wakeups) Critical Section – Causes waste because of the lack of correla on Length between resource acquisi on and work
§ Hand-off latency Arbitration – Slow hand-off latencies waste the advantages of
fine-granularity and smart arbitra on Hand-off – Trade-off must be carefully addressed Latency Production Thread-Safe MPI Implementations on Linux § OS-thread-level synchroniza on pthread_mutex_lock § Most implementa ons rely on coarse-grained CAS FUTEX_WAIT Kernel-Space
cri cal sec ons Sleep – Only MPICH on BG/Q is known to implement fine- FUTEX_WAKE grained locking CAS § Most implementa ons rely on Pthread FUTEX_WAIT Sleep
synchroniza on API (mutex, spinlock, …) User-Space CAS FUTEX_WAKE § Mutex in Na ve Posix Thread Library (NPTL) – Ships with glibc Lock Acquired – Fast user-space locking a empt; usually with a Compare And Swap (CAS) signal wakeup 1.00E+05 – Futex wait/wake in contended cases ~10x Main Memory 1.00E+04 Mutex
1.00E+03 LLC LLC Latency (cycles) L1 L1 L1 L1 1.00E+02 Core Core Core Core 0 10 20 30 40 Number of Threads Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups 17 Analysis of MPICH on a Linux Cluster MPI_CALL_ENTER § Thread-safety in MPICH CS_ENTER – Full THREAD_MULTIPLE support #pragma omp parallel { – OS-thread-level synchroniza on 1 for (i=0; i