Parastack: Efficient Hang Detection for MPI Programs at Large Scale

ParaStack: Eicient Hang Detection for MPI Programs at Large Scale Hongbo Li, Zizhong Chen, Rajiv Gupta University of California, Riverside CSE Department, 900 University Ave Riverside, CA 92521 fhli035,chen,[email protected] ABSTRACT It is widely accepted that some errors manifest more frequently While program hangs on large parallel systems can be detected via at large scale both in terms of the number of parallel processes and the widely used timeout mechanism, it is dicult for the users to problem size as testing is usually performed at small scale to manage set the timeout – too small a timeout leads to high false alarm rates cost and some errors are scale and input dependent [12, 29, 36, 39, and too large a timeout wastes a vast amount of valuable computing 40]. Due to communication, an error triggered in one process (faulty resources. To address the above problems with hang detection, this process) gradually spreads to others, nally leading to a global hang. paper presents ParaStack, an extremely lightweight tool to detect Although a hang may be caused by a single faulty process, this hangs in a timely manner with high accuracy, negligible overhead process is hard to locate as it is not easily distinguishable from with great scalability, and without requiring the user to select a other processes whose execution has stalled. us, the problem of timeout value. For a detected hang, it provides direction for fur- hang diagnosing, i.e. locating faulty processes, has received much ther analysis by telling users whether the hang is the result of an aention [11, 27, 28, 31]. error in the computation phase or the communication phase. For Typically hang diagnosing is preceded by the hang detection step a computation-error induced hang, our tool pinpoints the faulty and this problem has not been adequately addressed. Much work process by excluding hundreds and thousands of other processes. has been done on communication-deadlock — a special case of We have adapted ParaStack to work with the Torque and Slurm hang — using methods like time-out [17, 25, 32], communication parallel batch schedulers and validated its functionality and perfor- dependency analysis [20, 38], and formal verication [37]. ese mance on Tianhe-2 and Stampede that are respectively the world’s tools either use an imprecise timeout mechanism or precise but current 2nd and 12th fastest supercomputers. Experimental results centralized technique that limits scalability. MUST [22, 23] claims demonstrate that ParaStack detects hangs in a timely manner at to be a scalable tool for detecting MPI deadlocks at large scale, negligible overhead with over 99% accuracy. No false alarm is ob- but its overhead is still non-trivial as it ultimately checks MPI se- served in correct runs taking 66 hours at scale of 256 processes and mantics across all processes. ese non-timeout methods do not 39.7 hours at scale of 1024 processes. ParaStack accurately reports address the full scope of hang detection, as they do not consider the faulty process for computation-error induced hangs. computation-error induced hangs. In terms of hang detection, ad hoc timeout mechanism [2, 27, 28, 31] is the mainstream; however, it is dicult to set an appropriate threshold even for users that 1 INTRODUCTION have good knowledge of an application. is is because the optimal timeout not only varies across applications, but also with input Program hang , the phenomenon of unresponsiveness [34], is a characteristics and the underlying computing platform. Choosing common yet dicult type of bug in parallel programs. In large a timeout that is too small leads to high false alarm rates and too scale MPI programs, errors causing a program hang can arise in large timeouts lead to long detection delays. e user may favor either the computation phase or the MPI communication phase. selecting a very large timeout to achieve high accuracy while sac- Hang causing errors in the computation phase include innite ricing delay in detecting a hang. For example, IO-Watchdog [2] loop [31] within an MPI process, local deadlock within a process monitors writing activities and detects hangs based on a user spec- due to incorrect thread-level synchronization [28], so error in one ied timeout with 1 hour as the default. Up to 1 hour on every MPI process that causes the process to hang, and unknown errors processing core will be wasted if the user uses the default timeout in either soware or hardware that cause a single computing node seing. us, a lightweight hang detection tool with high accuracy to freeze. Errors in MPI communication phase that can give rise to is urgently needed for programs encountering non-deterministic a program hang include MPI communication deadlocks/failures. hangs or sporadically triggered hangs (e.g., hangs that manifest rarely and on certain inputs). It can be deployed to automatically Permission to make digital or hard copies of all or part of this work for personal or terminate erroneous runs to avoid wasting computing resources classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation without adversely eecting the performance of correct runs. on the rst page. Copyrights for components of this work owned by others than ACM To address the above need for hang detection, this paper presents must be honored. Abstracting with credit is permied. To copy otherwise, or republish, ParaStack, an extremely lightweight tool to detect hangs in a timely to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. manner with high accuracy, negligible overhead with great scala- SC17, Denver, CO, USA bility, and without requiring the user to select a timeout value. Due © 2017 ACM. 978-1-4503-5114-0/17/11...$15.00 to its lightweight nature, ParaStack can be deployed in production DOI: 10.1145/3126908.3126938 SC17, November 12–17, 2017, Denver, CO, USA Hongbo Li, Zizhong Chen, Rajiv Gupta runs without adversely aecting application performance when no hang arises. It handles communication-error-induced hangs and hangs brought about by a minority of processes encountering a computation error. For a detected hang, ParaStack provides direction for further analysis by telling whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, it pinpoints faulty processes. ParaStack is a parallel tool based on stack trace that judges a hang by detecting dynamic manifestation of following paern of behavior – persistent existence of very few processes outside of MPI calls. is simple, yet novel, approach is based upon the following observation. Since processes iterate between computation and communication phases, a persistent dynamic variation of the count of processes outside of MPI calls indicates a healthy running state while a continuous small count of processes outside MPI calls strongly indicates the onset of a hang. Based on execution history, ParaStack builds a runtime model of count that is robust even with limited history information and uses it to evaluate the likelihood Figure 1: ParaStack workow – steps with solid border are of continuously observing a small count. A hang is veried if the performed by ParaStack and those shown with dashed bor- likelihood of persistent small count is signicantly high. Upon der require a complimentary tool. detecting a hang, ParaStack reports the process in computation phase, if any, as faulty. e above execution behavior based model is capable of detecting that ParaStack detects hangs in a timely manner at negligible hangs for dierent target programs, with dierent input charac- overhead with over 99% accuracy. No false alarm was observed teristics and sizes, and running on dierent computing platforms in correct runs taking about 66 hours in total at the scale of without any assistance from the programmer alike. ParaStack re- 256 processes and 39.7 hours at the scale of 1024 processes. In ports hang very accurately and in a timely manner. By monitoring addition, ParaStack accurately identies the faulty process for only a constant number of processes, ParaStack introduces negligi- computation-error induced hangs. ble overhead and thus provides good scalability. Finally, it helps in identifying the cause of the hang. If a hang is caused by a faulty process with an error, all the other concurrent processes get stuck 2 THE CASE FOR PARASTACK inside MPI communication calls. If the error is inside communi- Hang detection is of value to application users and developers alike. cation phase, the faulty process will also stay in communication; Application users usually do not have the knowledge to debug the otherwise, it will stay in computation phase. Simply checking application. In batch mode, when a hang is encountered, the ap- whether there are processes outside of communication can tell the plication simply wastes the remainder of the allocated computing type of hang, communication-error or computation-error induced, time. is problem is further exacerbated by the fact that users as well as the faulty processes for a computation-error induced commonly request a bigger time slot than what is really needed to hang. e main contributions of ParaStack are: ensure their job can complete. If users are unaware of a hang occur- rence, they may rerun the application with even a much bigger time • ParaStack introduces highly ecient non-timeout mechanism allocation, which will lead to even more waste. By aaching a hang to detect hangs in a timely manner with high accuracy, negligi- detection capability to a batch job scheduler with negligible over- ble overhead, and great scalability. us it avoids the diculty head, ParaStack can help by terminating the jobs and reporting the of seing the timeout value.

Parastack: Efficient Hang Detection for MPI Programs at Large Scale

Hang Analysis: Fighting Responsiveness Bugs

What Is an Operating System III 2.1 Compnents II an Operating System

Mac OS X: an Introduction for Support Providers

Recovering from Operating System Crashes

How to Use Microsoft Adplus to Capture Hang and Crash Logs for EFT Server

3.01.0000.00 Release Notes

How to Start Using Princeton's High Performance Computing Systems

Some Computers May Hang Or Present a Blue Screen of Death

Automating Problem Analysis and Triage Sasha Goldshtein @Goldshtn Production Debugging

Dell EMC Powerscale Onefs: Using Macos Clients

Why Software Hangs and What Can Be Done with It ∗

Mapping Guis to Auditory Interfaces