Towards an Extreme Scale Multithreaded MPI

Towards an Extreme Scale Multithreaded MPI Abdelhalim Amer (Halim) Postdoctoral Researcher Argonne National Laboratory, IL, USA Japan-Korea HPC Winter School University of Tsukuba, Feb 17, 2016 Programming Models and Runtime Systems Group Myself Affilia&on – Mathemacs and Computer Science Division, ANL Group Lead – Pavan Balaji (computer scienGst) Current Staff Members – Sangmin Seo (Assistant ScienGsts) – Abdelhalim Amer (Halim) (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken RaffeneW (developer) – Min Si (postdoc) 2 The PMRS Group Research Areas Data-Movement Threading in Communicaon Models Heterogeneous RunGmes (OS-level, user- and Deep (MPICH) level: Argobots) Memory Hierarchies My focus: Communicaon opGmizaon in threading environments 3 The MPICH Project MPICH and it deriva&ves in the Top 10 § MPICH and its derivaves are the 1. Tianhe-2 (China): TH-MPI world’s most widely used MPI 2. Titan (US): Cray MPI implementaons 3. Sequoia (US): IBM PE MPI § Funded by DOE for 23 years (turned 23 4. K Computer (Japan): Fujitsu MPI this month) 5. Mira (US): IBM PE MPI § Has been a key influencer in the 6. Trinity (US): Cray MPI adopGon of MPI 7. Piz Daint (Germany): Cray MPI § Award winning project 8. Hazel Hen (Germany): Cray MPI – DOE R&D100 award in 2005 9. Shaheen II (Saudi Arabia): Cray MPI 10.Stampede (US): Intel MPI and MVAPICH Intel Tianhe MPI MPI IBM MPI ParaStaon MPI MPICH Cray MPI FG-MPI MicrosoV MPI MVAPICH 4 Threading Models § Argobots – Lightweight low-level threading and tasking framework – Targets massive on-node parallelism – Fine-grained control over the execuGon, scheduling, synchronizaon, and data- movements – Exposes a rich set of features to higher programming systems and DSLs for efficient implementaons § BOLT (hfps://press3.mcs.anl.gov/bolt) – Based on the LLVM OpenMP runGme and the Clang Sangmin Seo Assistant Computer Scientist 5 The Argobots Ecosystem PO TR Applications Fused ULT TR SY TR N … GE PO GE GE Charm++ Fused ULT SY TR TR SY 2 Fused ULT SY GE SY Argobots runtime 1 ULT ULT RWS PO ES ES ULT TR Communication Argobots SY libraries ES MPI Cilk “Worker” PO MPI+Argobots Charm++ OmpSs CilkBots PaRSEC ES ES 1 Sched n Origin RPC proc U S E T U E .. RPC proc T S E T . Target U T E OpenMP T Mercury RPC (BOLT) Argobots External GridFTP, RAJA, ROSE, TASCEL, XMP, etc. Connections Kokkos, 6 Heterogeneous and Deep Memory Hierarchies § Prefetching for deep memory hierarchies Lena Oden Postdoc § Currently focusing on MIC accelerator environments § Compiler support – LLVM + user hints through pragma NVRAM direcves § RunGme support – Prefetching between In-Package Memory IPM (IPM) and NVRAM – Sojware-managed cache Yanfei Guo Postdoc 7 The PMRS Group Research Areas Data-Movement Threading Communicaon in Models RunGmes Heterogeneous/ (OS-level, user- (MPICH) Hierarchical level: Argobots) Memories My focus: Communicaon opGmizaon in threading environments Today’s Talk 8 Outline § Introduc>on to hybrid MPI+threads programming – Hybrid MPI programming – Mul>threaded-Driven MPI communica>on § MPI + OS-Threading Op>miza>on Space – Op>mizing the coarse-grained global locking model – Moving towards a fine-grained model § MPI + User-Level Threading Op>miza>on Space – Opmizaon Opportunies over OS-threading – Advanced coarse-grained locking in MPI § Summary 9 Introduction to Hybrid MPI Programming Why Going Hybrid MPI + Shared-Memory (X) Programming Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Growth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea- Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote Main Forms of Hybrid MPI Programming § E.g. Quantum Monte Carlo simulaons § Two different models – MPI + shared-memory (X = MPI) – MPI + threads (e.g. X = OpenMP) § Both models use direct load/store operaons MPI + Shared-Memory (MPI 3.0~) MPI + Theads • Everything private by default • Share everything by default • Expose shared data explicitly • Privatize data when necessary MPI Task 0 MPI Task 1 MPI Task 1 Large B-spline table in a Share-Memory Window Large B-spline table W W W W W W W W Core Core Core Core W Thread 0 Thread 1 Walker data Multithreaded MPI Communication: Current Situation 4.E+07 § Becoming more difficult for a single-core to 4.E+07 saturate the network 3.E+07 3.E+07 § Network fabrics are going parallel 2.E+07 – E.g. BG/Q has 16 injecGon/recepGon FIFOs 2.E+07 Message Rate 1.E+07 1 Core § Core frequencies are reducing 5.E+06 0.E+00 – Local message processing Gme increases 0 10 20 30 40 § Driving MPI communicaon through threads Number of Cores per Node Node-to-node message rate (1B) on backfires Haswell + Mellanox FDR Fabric 1 Process 2 Processes 4 Processes 1 Thread 2 Threads 4 Threads 8 Processes 18 Processes 9 Threads 18 Threads 3.E+07 3.E+07 4.E+06 4.E+06 5.E+05 5.E+05 6.E+04 6.E+04 8.E+03 8.E+03 Message Rate (Messages/s) Message Rate (Messages/s) 1.E+03 1.E+03 1 16 256 4096 65536 1048576 1 16 256 4096 65536 1048576 Message Size (Bytes) Message Size (Bytes) Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads 13 Applications to Exploit Multithreaded MPI Communication MPI Task 0 MPI Task 1 § Several applicaons are moving to MPI Distributed B- Communicate Distributed large +Threads models Walker and B- spline table spline Data B-spline table § Most of them rely on funneled communication ! W W W W W W W W § Certain type of applicaons fit befer Core Core Core Core mulGthreaded communicaon model Thread 0 Thread 1 Thread 0 Thread 1 – E.g. Task-parallel applicaons Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1] § Several applicaons can scale well with current MPI runGmes if: – Data movements are not too fine-grained 1 Thread 2 Threads 4 Threads 9 Threads 18 Threads – Frequency of calling concurrently MPI is not too high 3.E+07 § A scalable mulGthreaded support is 4.E+06 necessary for other applicaons 5.E+05 6.E+04 8.E+03 Message Rate (Messages/s) 1.E+03 1 16 256 4096 65536 1048576 14 Message Size (Bytes) [1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012. OS-Thread-Level Optimization Space Challenges and Aspects Being Considered MPI_Call(...) § Granularity { CS_ENTER; – The current coarse-grained lock/work/unlock ... Local state model is not scalable – Fully lock-free is not pracGcal Shared state while (! req_complete) { • Rank-wise progress constrains Progress(); poll(); CS_EXIT; CS_YIELD; • Ordering constrains Global Read- CS_ENTER; } • Overuse of atomics and memory barriers can backfire only ... CS_EXIT; § Arbitra>on } – TradiGonal Pthread mutex locking is biased by the Threads hardware (NUCA) and the OS (slow wakeups) Critical Section – Causes waste because of the lack of correlaon Length between resource acquisiGon and work § Hand-off latency Arbitration – Slow hand-off latencies waste the advantages of fine-granularity and smart arbitraon Hand-off – Trade-off must be carefully addressed Latency Production Thread-Safe MPI Implementations on Linux § OS-thread-level synchronizaon pthread_mutex_lock § Most implementaons rely on coarse-grained CAS FUTEX_WAIT Kernel-Space criGcal secGons Sleep – Only MPICH on BG/Q is known to implement fine- FUTEX_WAKE grained locking CAS § Most implementaons rely on Pthread FUTEX_WAIT Sleep synchronizaon API (mutex, spinlock, …) User-Space CAS FUTEX_WAKE § Mutex in Nave Posix Thread Library (NPTL) – Ships with glibc Lock Acquired – Fast user-space locking aempt; usually with a Compare And Swap (CAS) signal wakeup 1.00E+05 – FuteX wait/wake in contended cases ~10x Main Memory 1.00E+04 Mutex 1.00E+03 LLC LLC Latency (cycles) L1 L1 L1 L1 1.00E+02 Core Core Core Core 0 10 20 30 40 Number of Threads Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups 17 Analysis of MPICH on a Linux Cluster MPI_CALL_ENTER § Thread-safety in MPICH CS_ENTER – Full THREAD_MULTIPLE support #pragma omp parallel { – OS-thread-level synchronizaon 1 for (i=0; i<nreq; i++) 2 NO – Single coarse-grained criGcal secGon MPI_Irecv(&reqs[i]); OPERATION CS_ENTER COMPLETE? – Uses by default Pthread mutex MPI_Waitall(nreq,reqs); YIELD § Fairness analysis P(same_ domain _ acquisition) YES } – Bias factor (BF): P(random _ uniform) CS_EXIT CS_EXIT – BF <=1 à fair arbitraon § Progress analysis MPI_CALL_EXIT – Wasted progress polls = unsuccessful progress polls while other threads are waiGng at entry 20 100 Mutex-Core-Level 18 90 Mutex Mutex-NUMA-Nodel-Level 16 80 Ticket Ticket-Core-Level 14 70 Ticket-NUMA-Node-Level 12 60 10 50 Bias Factor 8 40 6 30 4 Wasted Progress Polls (%) 20 2 10 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Number of Threads per Rank Number of Threads per Rank Fairness and progress analysis with a message rate benchmark on Haswell + Mellanox FDR fabric 18 Smart Progress and Fast Hand-off MPI_CALL_ENTER Adapt arbitration to maximize work Pthread Mutex § Time CS_ENTER FIFO locks overcome the Penalty shortcomings of mutexes 1 2 § Polling for progress can be wasteful NO OPERATION CS_EXIT COMPLETE? (waing does not generate work!) FIFO Lock Time Penalty § PrioriGzing issuing operaons YIELD YES • Feed the communicaon pipeline CS_EXIT • Reduce chances of wasteful internal CS_ENTER process (e.g. more requests on the fly MPI_CALL_EXIT Fairness (FIFO) reduces wasted Prioritizing issuing ops over è higher chances of making resource acquisitions waiting for internal progress progress) Hand-off latency must be kept low § Same arbitraon cab be achieve with different locks 262144 (e.g.

Load more