Towards an Extreme Scale Multithreaded MPI

Towards an Extreme Scale Multithreaded MPI Abdelhalim Amer (Halim) Postdoctoral Researcher Argonne National Laboratory, IL, USA Japan-Korea HPC Winter School University of Tsukuba, Feb 17, 2016 Programming Models and Runtime Systems Group Myself Affilia&on – Mathemacs and Computer Science Division, ANL Group Lead – Pavan Balaji (computer scienGst) Current Staff Members – Sangmin Seo (Assistant ScienGsts) – Abdelhalim Amer (Halim) (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken RaffeneW (developer) – Min Si (postdoc) 2 The PMRS Group Research Areas Data-Movement Threading in Communicaon Models Heterogeneous RunGmes (OS-level, user- and Deep (MPICH) level: Argobots) Memory Hierarchies My focus: Communicaon opGmizaon in threading environments 3 The MPICH Project MPICH and it deriva&ves in the Top 10 § MPICH and its derivaves are the 1. Tianhe-2 (China): TH-MPI world’s most widely used MPI 2. Titan (US): Cray MPI implementaons 3. Sequoia (US): IBM PE MPI § Funded by DOE for 23 years (turned 23 4. K Computer (Japan): Fujitsu MPI this month) 5. Mira (US): IBM PE MPI § Has been a key influencer in the 6. Trinity (US): Cray MPI adopGon of MPI 7. Piz Daint (Germany): Cray MPI § Award winning project 8. Hazel Hen (Germany): Cray MPI – DOE R&D100 award in 2005 9. Shaheen II (Saudi Arabia): Cray MPI 10.Stampede (US): Intel MPI and MVAPICH Intel Tianhe MPI MPI IBM MPI ParaStaon MPI MPICH Cray MPI FG-MPI MicrosoV MPI MVAPICH 4 Threading Models § Argobots – Lightweight low-level threading and tasking framework – Targets massive on-node parallelism – Fine-grained control over the execuGon, scheduling, synchronizaon, and data- movements – Exposes a rich set of features to higher programming systems and DSLs for efficient implementaons § BOLT (hfps://press3.mcs.anl.gov/bolt) – Based on the LLVM OpenMP runGme and the Clang Sangmin Seo Assistant Computer Scientist 5 The Argobots Ecosystem PO TR Applications Fused ULT TR SY TR N … GE PO GE GE Charm++ Fused ULT SY TR TR SY 2 Fused ULT SY GE SY Argobots runtime 1 ULT ULT RWS PO ES ES ULT TR Communication Argobots SY libraries ES MPI Cilk “Worker” PO MPI+Argobots Charm++ OmpSs CilkBots PaRSEC ES ES 1 Sched n Origin RPC proc U S E T U E .. RPC proc T S E T . Target U T E OpenMP T Mercury RPC (BOLT) Argobots External GridFTP, RAJA, ROSE, TASCEL, XMP, etc. Connections Kokkos, 6 Heterogeneous and Deep Memory Hierarchies § Prefetching for deep memory hierarchies Lena Oden Postdoc § Currently focusing on MIC accelerator environments § Compiler support – LLVM + user hints through pragma NVRAM direcves § RunGme support – Prefetching between In-Package Memory IPM (IPM) and NVRAM – Sojware-managed cache Yanfei Guo Postdoc 7 The PMRS Group Research Areas Data-Movement Threading Communicaon in Models RunGmes Heterogeneous/ (OS-level, user- (MPICH) Hierarchical level: Argobots) Memories My focus: Communicaon opGmizaon in threading environments Today’s Talk 8 Outline § Introduc>on to hybrid MPI+threads programming – Hybrid MPI programming – Mul>threaded-Driven MPI communica>on § MPI + OS-Threading Op>miza>on Space – Op>mizing the coarse-grained global locking model – Moving towards a fine-grained model § MPI + User-Level Threading Op>miza>on Space – Opmizaon Opportunies over OS-threading – Advanced coarse-grained locking in MPI § Summary 9 Introduction to Hybrid MPI Programming Why Going Hybrid MPI + Shared-Memory (X) Programming Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Growth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea- Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote Main Forms of Hybrid MPI Programming § E.g. Quantum Monte Carlo simulaons § Two different models – MPI + shared-memory (X = MPI) – MPI + threads (e.g. X = OpenMP) § Both models use direct load/store operaons MPI + Shared-Memory (MPI 3.0~) MPI + Theads • Everything private by default • Share everything by default • Expose shared data explicitly • Privatize data when necessary MPI Task 0 MPI Task 1 MPI Task 1 Large B-spline table in a Share-Memory Window Large B-spline table W W W W W W W W Core Core Core Core W Thread 0 Thread 1 Walker data Multithreaded MPI Communication: Current Situation 4.E+07 § Becoming more difficult for a single-core to 4.E+07 saturate the network 3.E+07 3.E+07 § Network fabrics are going parallel 2.E+07 – E.g. BG/Q has 16 injecGon/recepGon FIFOs 2.E+07 Message Rate 1.E+07 1 Core § Core frequencies are reducing 5.E+06 0.E+00 – Local message processing Gme increases 0 10 20 30 40 § Driving MPI communicaon through threads Number of Cores per Node Node-to-node message rate (1B) on backfires Haswell + Mellanox FDR Fabric 1 Process 2 Processes 4 Processes 1 Thread 2 Threads 4 Threads 8 Processes 18 Processes 9 Threads 18 Threads 3.E+07 3.E+07 4.E+06 4.E+06 5.E+05 5.E+05 6.E+04 6.E+04 8.E+03 8.E+03 Message Rate (Messages/s) Message Rate (Messages/s) 1.E+03 1.E+03 1 16 256 4096 65536 1048576 1 16 256 4096 65536 1048576 Message Size (Bytes) Message Size (Bytes) Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads 13 Applications to Exploit Multithreaded MPI Communication MPI Task 0 MPI Task 1 § Several applicaons are moving to MPI Distributed B- Communicate Distributed large +Threads models Walker and B- spline table spline Data B-spline table § Most of them rely on funneled communication ! W W W W W W W W § Certain type of applicaons fit befer Core Core Core Core mulGthreaded communicaon model Thread 0 Thread 1 Thread 0 Thread 1 – E.g. Task-parallel applicaons Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1] § Several applicaons can scale well with current MPI runGmes if: – Data movements are not too fine-grained 1 Thread 2 Threads 4 Threads 9 Threads 18 Threads – Frequency of calling concurrently MPI is not too high 3.E+07 § A scalable mulGthreaded support is 4.E+06 necessary for other applicaons 5.E+05 6.E+04 8.E+03 Message Rate (Messages/s) 1.E+03 1 16 256 4096 65536 1048576 14 Message Size (Bytes) [1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012. OS-Thread-Level Optimization Space Challenges and Aspects Being Considered MPI_Call(...) § Granularity { CS_ENTER; – The current coarse-grained lock/work/unlock ... Local state model is not scalable – Fully lock-free is not pracGcal Shared state while (! req_complete) { • Rank-wise progress constrains Progress(); poll(); CS_EXIT; CS_YIELD; • Ordering constrains Global Read- CS_ENTER; } • Overuse of atomics and memory barriers can backfire only ... CS_EXIT; § Arbitra>on } – TradiGonal Pthread mutex locking is biased by the Threads hardware (NUCA) and the OS (slow wakeups) Critical Section – Causes waste because of the lack of correlaon Length between resource acquisiGon and work § Hand-off latency Arbitration – Slow hand-off latencies waste the advantages of fine-granularity and smart arbitraon Hand-off – Trade-off must be carefully addressed Latency Production Thread-Safe MPI Implementations on Linux § OS-thread-level synchronizaon pthread_mutex_lock § Most implementaons rely on coarse-grained CAS FUTEX_WAIT Kernel-Space criGcal secGons Sleep – Only MPICH on BG/Q is known to implement fine- FUTEX_WAKE grained locking CAS § Most implementaons rely on Pthread FUTEX_WAIT Sleep synchronizaon API (mutex, spinlock, …) User-Space CAS FUTEX_WAKE § Mutex in Nave Posix Thread Library (NPTL) – Ships with glibc Lock Acquired – Fast user-space locking aempt; usually with a Compare And Swap (CAS) signal wakeup 1.00E+05 – FuteX wait/wake in contended cases ~10x Main Memory 1.00E+04 Mutex 1.00E+03 LLC LLC Latency (cycles) L1 L1 L1 L1 1.00E+02 Core Core Core Core 0 10 20 30 40 Number of Threads Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups 17 Analysis of MPICH on a Linux Cluster MPI_CALL_ENTER § Thread-safety in MPICH CS_ENTER – Full THREAD_MULTIPLE support #pragma omp parallel { – OS-thread-level synchronizaon 1 for (i=0; i<nreq; i++) 2 NO – Single coarse-grained criGcal secGon MPI_Irecv(&reqs[i]); OPERATION CS_ENTER COMPLETE? – Uses by default Pthread mutex MPI_Waitall(nreq,reqs); YIELD § Fairness analysis P(same_ domain _ acquisition) YES } – Bias factor (BF): P(random _ uniform) CS_EXIT CS_EXIT – BF <=1 à fair arbitraon § Progress analysis MPI_CALL_EXIT – Wasted progress polls = unsuccessful progress polls while other threads are waiGng at entry 20 100 Mutex-Core-Level 18 90 Mutex Mutex-NUMA-Nodel-Level 16 80 Ticket Ticket-Core-Level 14 70 Ticket-NUMA-Node-Level 12 60 10 50 Bias Factor 8 40 6 30 4 Wasted Progress Polls (%) 20 2 10 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Number of Threads per Rank Number of Threads per Rank Fairness and progress analysis with a message rate benchmark on Haswell + Mellanox FDR fabric 18 Smart Progress and Fast Hand-off MPI_CALL_ENTER Adapt arbitration to maximize work Pthread Mutex § Time CS_ENTER FIFO locks overcome the Penalty shortcomings of mutexes 1 2 § Polling for progress can be wasteful NO OPERATION CS_EXIT COMPLETE? (waing does not generate work!) FIFO Lock Time Penalty § PrioriGzing issuing operaons YIELD YES • Feed the communicaon pipeline CS_EXIT • Reduce chances of wasteful internal CS_ENTER process (e.g. more requests on the fly MPI_CALL_EXIT Fairness (FIFO) reduces wasted Prioritizing issuing ops over è higher chances of making resource acquisitions waiting for internal progress progress) Hand-off latency must be kept low § Same arbitraon cab be achieve with different locks 262144 (e.g.

Towards an Extreme Scale Multithreaded MPI

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support