ParaStack: Eicient Hang Detection for MPI Programs at Large Scale Hongbo Li, Zizhong Chen, Rajiv Gupta University of California, Riverside CSE Department, 900 University Ave Riverside, CA 92521 fhli035,chen,
[email protected] ABSTRACT It is widely accepted that some errors manifest more frequently While program hangs on large parallel systems can be detected via at large scale both in terms of the number of parallel processes and the widely used timeout mechanism, it is dicult for the users to problem size as testing is usually performed at small scale to manage set the timeout – too small a timeout leads to high false alarm rates cost and some errors are scale and input dependent [12, 29, 36, 39, and too large a timeout wastes a vast amount of valuable computing 40]. Due to communication, an error triggered in one process (faulty resources. To address the above problems with hang detection, this process) gradually spreads to others, nally leading to a global hang. paper presents ParaStack, an extremely lightweight tool to detect Although a hang may be caused by a single faulty process, this hangs in a timely manner with high accuracy, negligible overhead process is hard to locate as it is not easily distinguishable from with great scalability, and without requiring the user to select a other processes whose execution has stalled. us, the problem of timeout value. For a detected hang, it provides direction for fur- hang diagnosing, i.e. locating faulty processes, has received much ther analysis by telling users whether the hang is the result of an aention [11, 27, 28, 31].