<<

Combining Symbolic Execution and Model Checking to Verify MPI Programs

Hengbiao Yu1∗, Zhenbang Chen1∗, Xianjin Fu1,2, Wang1,2∗, Zhendong Su3, Jun Sun4, Chun Huang1, Dong1 1College of Computer, National University of Defense Technology, Changsha, 2State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, China 3Department of Computer Science, ETH Zurich, Switzerland 4School of Information Systems, Singapore Management University, Singapore {hengbiaoyu,zbchen,wj}@nudt.edu.cn,[email protected],[email protected],[email protected]

ABSTRACT South Korea. ACM, New York, NY, USA, 17 pages. https://doi.org/XX.XXXX/ Message passing is the standard paradigm of programming in high- XXXXXXX.XXXXXXX performance computing. However, verifying Message Passing In- terface (MPI) programs is challenging, due to the complex program 1 INTRODUCTION features (such as non-determinism and non-blocking operations). In this work, we present MPI symbolic verifier (MPI-SV), the first Nowadays, an increasing number of high-performance computing symbolic execution based tool for automatically verifying MPI pro- (HPC) applications have been developed to solve large-scale prob- grams with non-blocking operations. MPI-SV combines symbolic lems [11]. The Message Passing Interface (MPI) [78] is the current execution and model checking in a synergistic way to tackle the de facto standard programming paradigm for developing HPC appli- challenges in MPI program verification. The synergy improves the cations. Many MPI programs are developed with significant human scalability and enlarges the scope of verifiable properties. We have effort. One of the reasons is that MPI programs are error-prone implemented MPI-SV1 and evaluated it with 111 real-world MPI because of complex program features (such as non-determinism verification tasks. The pure symbolic execution-based technique and asynchrony) and their scale. Improving the reliability of MPI successfully verifies 61 out of the 111 tasks (55%) within onehour, programs is challenging [29, 30]. while in comparison, MPI-SV verifies 100 tasks (90%). On aver- Program analysis [64] is an effective technique for improving age, compared with pure symbolic execution, MPI-SV achieves 19x program reliability. Existing methods for analyzing MPI programs speedups on verifying the satisfaction of the critical property and can be categorized into dynamic and static approaches. Most ex- 5x speedups on finding violations. isting methods are dynamic, such as debugging [51], correctness checking [71] and dynamic verification [83]. These methods need CCS CONCEPTS concrete inputs to run MPI programs and perform analysis based on runtime information. Hence, dynamic approaches may miss • Software and its engineering → Software verification and input-related program errors. Static approaches [5, 9, 55, 74] ana- validation; lyze abstract models of MPI programs and suffer from false alarms, KEYWORDS manual effort, and poor scalability. To the best of our knowledge, existing automated verification approaches for MPI programs either Symbolic Verification; Symbolic Execution; Model Checking; Mes- do not support input-related analysis or fail to support the analysis sage Passing Inteface; Synergy of the MPI programs with non-blocking operations, the invocations ACM Reference Format: of which do not block the program execution. Non-blocking opera- Hengbiao Yu, Zhenbang , Xianjin Fu, Ji Wang, Zhendong Su, Jun tions are ubiquitous in real-world MPI programs for improving the Sun, Chun , and Wei Dong. 2020. Combining Symbolic Execution performance but introduce more complexity to programming. arXiv:1803.06300v2 [cs.PL] 17 Jan 2020 and Model Checking to Verify MPI Programs. In ICSE ’20: ICSE ’20: 42nd Symbolic execution [27, 48] supports input-related analysis by International Conference on Software Engineering , May 23-29, 2020, Seoul, systematically exploring a program’s path space. In principle, sym- bolic execution provides a balance between concrete execution and ∗The first two authors contributed equally to this work and are co-first authors. Zhen- bang Chen and Ji Wang are the corresponding authors. static abstraction with improved input coverage or more precise 1MPI-SV is available https://mpi-sv.github.io. program abstraction. However, symbolic execution based analyses suffer from path explosion due to the exponential increase ofpro- Permission to make digital or hard copies of all or part of this work for personal or gram paths w.r.t. the number of conditional statements. The problem classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation is particularly severe when analyzing MPI programs because of par- on the first page. Copyrights for components of this work owned by others than ACM allel execution and non-deterministic operations. Existing symbolic must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a execution based verification approaches77 [ ][25] do not support fee. Request permissions from [email protected]. non-blocking MPI operations. ICSE ’20, May 23-29, 2018, Seoul, South Korea In this work, we present MPI-SV, a novel verifier for MPI pro- © 2020 Association for Computing Machinery. ACM ISBN xxx-x-xxxx-xxxx-x/xx/xx...$15.00 grams by smartly integrating symbolic execution and model check- https://doi.org/XX.XXXX/XXXXXXX.XXXXXXX ing. As far as we know, MPI-SV is the first automated verifier that Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong supports non-blocking MPI programs and LTL [58] property verifi- Proc ::= var r : T | r := | Comm | Proc ; Proc | cation. MPI-SV uses symbolic execution to extract path-level models if e Proc else Proc | while e do Proc from MPI programs and verifies the models w.r.t. the expected prop- Comm ::= Ssend(e) | Send(e) | Recv(e) | Recv(*) | Barrier | | | | erties by model checking [17]. The two techniques complement ISend(e,r) IRecv(e,r) IRecv(*,r) Wait(r) each other: (1) symbolic execution abstracts the control and data dependencies to generate verifiable models for model checking, and Figure 1: Syntax of a core MPI language. (2) model checking improves the scalability of symbolic execution by leveraging the verification results to prune redundant paths and enlarges the scope of verifiable properties of symbolic execution. 2 ILLUSTRATION In particular, MPI-SV combines two algorithms: (1) symbolic execution of non-blocking MPI programs with non-deterministic In this section, we first introduce MPI programs and use an example operations, and (2) modeling and checking the behaviors of an to illustrate the problem that this work targets. Then, we overview MPI program path precisely. To safely handle non-deterministic MPI-SV informally by the example. operations, the first algorithm delays the message matchings of non- deterministic operations as much as possible. The second algorithm 2.1 MPI Syntax and Motivating Example extracts a model from an MPI program path. The model represents MPI implementations, such as MPICH [31] and OpenMPI [26], pro- all the path’s equivalent behaviors, i.e., the paths generated by vide the programming interfaces of message passing to support changing the interleavings and matchings of the communication the development of parallel applications. An MPI program can be operations in the path. We have proved that our modeling algorithm implemented in different languages, such as C and C++. Without is precise and consistent with the MPI standard [24]. We feed the loss of generality, we focus on MPI programs written in C. Let T generated models from the second algorithm into a model checker be a set of types, N a set of names, and E a set of expressions. For to perform verification w.r.t. the expected properties, i.e., safety simplifying the discussion, we define a core language for MPI pro- and liveness properties in linear temporal logic (LTL) [58]. If the cesses in Figure 1, where T ∈ T, r ∈ N, and e ∈ E. An MPI program extracted model from a pathp satisfies the property φ,p’s equivalent MP is defined by a finite set of processes {Proci | 0 ≤ i ≤ n}. For paths can be safely pruned; otherwise, if the model checker reports a brevity, we omit complex language features (such as the messages counterexample, a violation of φ is found. This way, we significantly in the communication operations and pointer operations) although boost the performance of symbolic execution by pruning a large MPI-SV does support real-world MPI C programs. set of paths which are equivalent to certain paths that have been The statement var r : T declares a variable r with type T. The already model-checked. statement r := e assigns the value of expression e to variable r. We have implemented MPI-SV for MPI C programs based on A process can be constructed from basic statements by using the Cloud9 [10] and PAT [80]. We have used MPI-SV to analyze 12 real- composition operations including sequence, condition and loop. world MPI programs, totaling 47K lines of code (LOC) (three are For brevity, we incorporate the key message passing operations in beyond the scale that the state-of-the-art MPI verification tools can the syntax, where e indicates the destination process’s identifier. handle), w.r.t. the deadlock freedom property and non-reachability These message passing operations can be blocking or non-blocking. properties. For the 111 deadlock freedom verification tasks, when First, we introduce blocking operations: we set the time threshold to be an hour, MPI-SV can complete 100 • Ssend(e): sends a message to the eth process, and the sending tasks, i.e., deadlock reported or deadlock freedom verified, while process blocks until the message is received by the destination pure symbolic execution can complete 61 tasks. For the 100 com- process. pleted tasks, MPI-SV achieves, on average, 19x speedups on verify- • Send(e): sends a message to the eth process, and the sending ing deadlock freedom and 5x speedups on finding a deadlock. process blocks until the message is copied into the system buffer. The main contributions of this work are: • Recv(e): receives a message from the eth process, and the re- ceiving process blocks until the message from the eth process is received. • Recv(*): receives a message from any process, and the receiv- • A synergistic framework combining symbolic execution and ing process blocks until a message is received regardless which model checking for verifying MPI programs. process sends the message. • A method for symbolic execution of non-blocking MPI pro- • Barrier: blocks the process until all the processes have called grams with non-deterministic operations. The method is formally Barrier. proven to preserve the correctness of verifying reachability prop- • Wait(r): the process blocks until the operation indicated by r is erties. completed. • A precise method for modeling the equivalent behaviors of an A Recv(*) operation, called wildcard receive, may receive a mes- MPI path, which enlarges the scope of the verifiable properties sage from different processes under different runs, resulting in and improves the scalability. non-determinism. The blocking of a Send(e) operation depends • A tool for symbolic verification of MPI C programs and anex- on the size of the system buffer, which may differ under differ- tensive evaluation on real-world MPI programs. ent MPI implementations. For simplicity, we assume that the size of the system buffer is infinite. Hence, each Send(e) operation Combining Symbolic Execution and Model Checking to Verify MPI Programs

P0 P1 P2 P3 Violation Yes An MPI Symbolic Executor Test Case Send(1) if (x != ‘a’) Send(1) Send(1) Program Path Recv(0) No State Pruner else No Property Yes IRecv(*,req); Violation Recv(3) CSP Model Checker CSP Model MPI-SV Figure 2: An illustrative example of MPI programs. Figure 3: The framework of MPI-SV. returns immediately after being issued. Note that our implemen- tation allows users to configure the buffer size. To improve the performance, the MPI standard provides non-blocking operations 2.2 Our Approach to overlap computations and communications. MPI-SV leverages dynamic verification83 [ ] and model checking [17] • ISend(e,r): sends a message to the eth process, and the opera- to tackle the challenges. Figure 3 shows MPI-SV’s basic framework. tion returns immediately after being issued. The parameter r is The inputs of MPI-SV are an MPI program and an expected property, the handle of the operation. e.g., deadlock freedom expressed in LTL. MPI-SV uses the built-in • IRecv(e,r): receives a message from the eth process, and the symbolic executor to explore the path space automatically and operation returns immediately after being issued. IRecv(*,r) checks the property along with path exploration. For a path that is the non-blocking wildcard receive. violates the property, called a violation path, MPI-SV generates a The operations above are key MPI operations. Complex operations, test case for replaying, which includes the program inputs, the such as MPI_Bcast and MPI_Gather, can be implemented by com- interleaving sequence of MPI operations and the matchings of wild- posing these key operations. The formal semantics of the core card receives. In contrast, for a violation-free path p, MPI-SV builds language is defined based on communicating state machines (CSM) a communicating sequential process (CSP) model Γ, which repre- [8]. We define each process as a CSM with an unbounded receiving sents the paths which can be obtained based on p by changing the FIFO queue. For the sake of space limit, the formal semantics can interleavings and matchings of the communication operations in be referred to [91]. p. Then, MPI-SV utilizes a CSP model checker to verify Γ w.r.t. the An MPI program runs in many processes spanned across multi- property. If the model checker reports a counterexample, a viola- ple machines. These processes communicate by message passing tion is found; otherwise, if Γ satisfies the property, MPI-SV prunes to accomplish a parallel task. Besides parallel execution, the non- all behaviors captured by the model so that they are avoided by determinism in MPI programs mainly comes from two sources: symbolic execution. (1) inputs, which may influence the communication through con- Since MPI processes are memory independent, MPI-SV will se- trol flow, and (2) wildcard receives, which lead to highly non- lect a process to execute in a round-robin manner to avoid exploring deterministic executions. all interleavings of the processes. A process keeps running until it Consider the MPI program in Figure 2. Processes P0, P2 and blocks or terminates. When encountering an MPI operation, MPI-SV P3 only send a message to P1 and then terminate. For process P1, records the operation instead of executing it and doing the mes- if input x is not equal to ‘a’, P1 receives a message from P0 in sage matching. When every process blocks or terminates and at a blocking manner; otherwise, P1 uses a non-blocking wildcard least one blocked process exists, MPI-SV matches the recorded MPI receive to receive a message. Then, P1 receives a message from operations of the processes w.r.t. the MPI standard [24]. The intu- P3. When x is ‘a’ and IRecv(*,req) receives the message from ition behind this strategy is to collect the message exchanges as P3, a deadlock occurs, i.e., P1 blocks at Recv(3), and all the other thoroughly as possible, which helps find possible matchings for the processes terminate. Hence, to detect the deadlock, we need to wildcard receive operations. Consider the MPI program in Figure 2 handle the non-determinism caused by the input x and the wildcard and the deadlock freedom property. Figure 4 shows the symbolic receive IRecv(*,req). execution tree, where the node labels indicate process communica- To handle non-determinism due to the input, a standard remedy tions, e.g., (3, 1) means that P1 receives a message from P3. MPI-SV is symbolic execution [48]. However, there are two challenges. The first symbolically executes P0, which only sends a message to P1. first one is to systematically explore the paths of an MPI program The Send(1) operation returns immediately with the assumption with non-blocking and wildcard operations, which significantly in- of infinite system buffers. Hence, P0 terminates, and the operation crease the complexity of MPI programs. A non-blocking operation Send(1) is recorded. Then, MPI-SV executes P1 and explores both does not block but returns immediately, causing out-of-order com- branches of the conditional statement as follows. pletion. The difficulty in handling wildcard operations is to getall (1) True branch (x , ‘a’). In this case, P1 blocks at Recv(0). the possibly matched messages. The second one is to improve the MPI-SV records the receive operation for P1, and starts executing P2. scalability of the symbolic execution. Symbolic execution struggles Like P0, P2 executes operation Send(1) and terminates, after which with path explosion. MPI processes run concurrently, resulting in P3 is selected and behaves the same as P2. After P3 terminates, the an exponential number of program paths w.r.t. the number of pro- global execution blocks, i.e., P1 blocks and all the other processes cesses. Furthermore, the path space increases exponentially with terminate. When this happens, MPI-SV matches the recorded oper- the number of wildcard operations. ations, performs the message exchanges and continues to execute Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong the matched processes. The Recv(0) in P1 should be matched with (x = ‘a’) has 6 paths because the first wildcard receive has 3 match- the Send(1) in P0. After executing the send and receive opera- ings (send operations from P0, P2 and P3) and the last wildcard tions, MPI-SV selects P1 to execute, because P0 terminates. Then, receive has 2 matchings (because the first wildcard receive has P1 blocks at Recv(3). Same as earlier, the global execution blocks matched one send operation). Hence, in total, there are 8 paths and operation matching needs to be done. Recv(3) is matched with (i.e., 2 + 3 ∗ 2 = 8) if we use pure symbolic execution. In contrast, the Send(1) in P3. After executing the Recv(3) and Send(1) op- with model checking, MPI-SV only needs 2 paths to verify that the erations, all the processes terminate successfully. Path p1 in Figure program is deadlock-free. For each branch, the generated model is 4 is explored. verified to be deadlock-free, so MPI-SV prunes the candidate states (2) False branch (x =‘a’). The execution of P1 proceeds until forked for the matchings of the wildcard receives. reaching the blocking receive Recv(3). Additionally, the two issued Properties. Because our CSP modeling encodes the interleav- receive operations, i.e., IRecv(*,req) and Recv(3), are recorded. ings of the MPI operations in the MPI processes, the scope of the Similar to the true branch, when every process blocks or terminates, verifiable properties is enlarged, i.e., MPI-SV can verify safety and we handle operation matching. Here P0, P2 and P3 terminate, and P1 liveness properties in LTL. Suppose we change the property to blocks at Recv(3). IRecv(*,req) should be matched first because be the one that requires the Send(1) operation in P0 should be of the non-overtaken policy in the MPI standard [24]. There are three completed before the Send(1) operation in P2. Actually, the send Send operation candidates from P0, P2 and P3, respectively. MPI-SV operation in P2 can be completed before the send operation in P0, forks a state for each candidate. Suppose MPI-SV first explores the due to the nature of parallel execution. However, pure symbolic state where IRecv(*,req) is matched with P0’s Send(1). After execution fails to detect the property violation. In contrast, with matching and executing P1’s Recv(3) and P3’s Send(1), the path the help of CSP modeling, when we verify the model generated terminates successfully, which generates path p2 in Figure 4. from the first path w.r.t. the property, the model checker gives a Violation detection. MPI-SV continues to explore the remain- counterexample, indicating that a violation of the property exists. ing two cases. Without CSP-based boosting, the deadlock would be found in the last case (i.e., p4 in Figure 4), where IRecv(*,req) 3 SYMBOLIC VERIFICATION METHOD is matched with P3’s Send(1) and P1 blocks because Recv(3) has In this section, we present our symbolic verification framework no matched operation. MPI-SV generates a CSP model Γ based on and then describe MPI-SV’s symbolic execution method. the deadlock-free path p2 where P1’s IRecv(*,req) is matched with P0’s Send(1). Each MPI process is modeled as a CSP pro- 3.1 Framework cess, and all the CSP processes are composed in parallel to form Given an MPI program MP = {Proci | 0 ≤ i ≤ n}, a state Sc Γ. Notably, in Γ, we collect the possible matchings of a wildcard in MP’s symbolic execution is composed by the states of pro- receive through statically matching the arguments of operations in cesses, i.e., (s0,...,sn), and each MPI process’s state is a 6-tuple the path. Additionally, the requirements in the MPI standard, i.e., (M, Stat, PC, F , B, R), where M maps each variable to a concrete completes-before relations [83], are also modeled. A CSP model value or a symbolic value, Stat is the next program statement to checker then verifies deadlock freedom for Γ. The model checker execute, PC is the process’s path constraint [48], F is the flag of reports a counterexample where IRecv(*,req) is matched with process status belonging to {active, blocked, terminated}, B and the Send(1) in P . MPI-SV only explores two paths for detecting 3 R are infinite buffers for storing the issued MPI operations not the deadlock and avoids the exploration of p and p (indicated by 3 4 yet matched and the matched MPI operations, respectively. We dashed lines). use si ∈ Sc to denote that si is a process state in the global state Pruning. Because the CSP modeling is precise (cf. Section 4), Sc . An element elem of si can be accessed by si .elem, e.g., si .F is in addition to finding violations earlier, MPI-SV can also perform the ith process’s status flag. In principle, a statement execution in path pruning when the model satisfies the property. Suppose we any process advances the global state, making MP’s state space change the program in Figure 2 to be the one where the last state- exponential to the number of processes. We use variable Seqi de- ment of P is a Recv(*) operation. Then, the program is deadlock 1 fined in M to record the sequence of the issued MPI operations in free. The true branch (x , ‘a’) has 2 paths, because the last wildcard Proci , and Seq(Sc ) to denote the set {Seqi | 0 ≤ i ≤ n} of global receive in P1 has two matchings (i.e., P2’s send and P3’s send, and state Sc . Global state Sc ’s path condition (denoted by Sc .PC) is the P0’s send has been matched by P1’s Recv(0)). The false branch Ó conjunction of the path conditions of Sc ’s processes, i.e., si .PC. si ∈Sc Algorithm 1 shows the details of MPI-SV. We use worklist to store the global states to be explored. Initially, worklist only con- x ≠ 'a' x = 'a' tains Sinit , composed of the initial states of all the processes, and each process’s status is active. At Line 4, Select picks a state from worklist as the one to advance. Hence, Select can be customized (0,1) (0,1) (2,1) (3,1) with different search heuristics, e.g., depth-first search (DFS). Then, p 4 Scheduler selects an active process Proci to execute. Next, Execute Deadlock (3,1) (3,1) (3,1) (cf. Algorithm 2) symbolically executes the statement Stati in Proci , p p p 1 2 3 and may add new states into worklist. This procedure continues until worklist is empty (i.e., all the paths have been explored), de- Figure 4: The example program’s symbolic execution tree. tecting a violation or time out (omitted for brevity). After executing Combining Symbolic Execution and Model Checking to Verify MPI Programs

Algorithm 1: Symbolic Verification Framework Algorithm 2: Blocking-driven Symbolic Execution MPI-SV(MP, φ, Sym) Execute(Sc , Proci , Stati , Sym,worklist) Data: MP is {Proci | 0 ≤ i ≤ n}, φ is a property, and Sym is Data: Global state Sc , MPI process Proci , Statement Stati , a set of symbolic variables Symbolic variable set Sym, worklist of global states 1 begin 1 begin 2 worklist ← {Sinit } 2 switch (Stati ) do 3 while worklist , ∅ do 3 case Send or ISend or IRecv do 4 Sc ← Select(worklist) 4 Seqi ← Seqi ·⟨Stati ⟩ 5 (Mi , Stati , PCi , Fi , Bi , Ri ) ← Scheduler(Sc ) 5 si .B ← si .B·⟨Stati ⟩ 6 Execute(Sc , Proci , Stati , Sym, worklist) 6 end 7 if si ∈ Sc ,si .F = terminated then 7 case Barrier or Wait or Ssend or Recv do 8 ∀Γ ← GenerateCSP(Sc ) 8 Seqi ← Seqi ·⟨Stati ⟩ 9 ModelCheck(Γ, φ) 9 si .B ← si .B·⟨Stati ⟩ 10 if Γ |= φ then 10 si .F ← blocked 11 worklist←worklist\{Sp ∈worklist|Sp .PC⇒Sc .PC} 11 if GlobalBlocking then // si ∈ Sc, (si .F = blocked ∨ si .F = terminated) 12 end ∀ 12 Matching(S ,worklist) 13 else if Γ ̸|= φ then c 14 reportViolation and Exit 13 end 15 end 14 end 16 end 15 default: 17 end Execute(Sc , Proci , Stati , Sym,worklist) as normal 16 18 end end 17 end

Stat , if all the processes in the current global state S terminate, i.e., i c matched operations. Since the opportunity of matching messages a violation-free path terminates, we use Algorithm 4 to generate a is GlobalBlocking, we call it blocking-driven symbolic execution. CSP model Γ from the current state (Line 8). Then, we use a CSP Matching matches the recorded MPI operations in different pro- model checker to verify Γ w.r.t. φ. If Γ satisfies φ (denoted by Γ |= φ), cesses. To obtain all the possible matchings, we delay the matching we prune the global states forked by the wildcard operations along of a wildcard operation as much as possible. We use match to the current path (Line 11), i.e., the states in worklist whose path con- N match the non-wildcard operations first (Line 3) w.r.t. the rules in ditions imply S ’s path condition; otherwise, if the model checker c the MPI standard [24], especially the non-overtaken ones: (1) if two gives a counterexample, we report the violation and exit (Line 14). sends of a process send messages to the same destination, and both Since MPI processes are memory independent, we employ partial can match the same receive, the receive should match the first one; order reduction (POR) [17] to reduce the search space. Scheduler and (2) if a process has two receives, and both can match a send, the selects a process in a round-robin fashion from the current global state. In principle, Scheduler starts from the active MPI process with the smallest identifier, e.g., Proc0 at the beginning, and an MPI Algorithm 3: Blocking-driven Matching process keeps running until it is blocked or terminated. Then, the Matching(S , worklist) next active process will be selected to execute. Such a strategy sig- c Data: Global state Sc , worklist of global states nificantly reduces the path space of symbolic execution. Then, with 1 begin the help of CSP modeling and model checking, MPI-SV can verify 2 MSW ← ∅ // Matching set of wildcard operations more properties, i.e., safety and liveness properties in LTL. The 3 pairn ← matchN (Sc ) // Match non-wildcard operations details of such technical improvements will be given in Section 4. 4 if pairn , empty pair then 5 Fire(Sc ,pairn) 3.2 Blocking-driven Symbolic Execution 6 end 7 else Algorithm 2 shows the symbolic execution of a statement. Com- 8 MSW ← matchW (Sc ) // Match wildcard operations mon statements such as conditional statements are handled in the 9 for pair ∈ MSW do ′ w standard way [48] (omitted for brevity), and here we focus on MPI 10 Sc ← fork(Sc , pairw ) ′ operations. The main idea is to delay the executions of MPI opera- 11 worklist ← worklist ∪ {Sc } tions as much as possible, i.e., trying to get all the message matchings. 12 end Instead of execution, Algorithm 2 records each MPI operation for 13 if MSW , ∅ then each MPI process (Lines 4&8). We also need to update buffer B 14 worklist ← worklist \{Sc } 15 end after issuing an MPI operation (Lines 5&9). Then, if Stati is a non- blocking operation, the execution returns immediately; otherwise, 16 end 17 if pair = empty pair ∧ MSW = ∅ then we block Proci (Line 10, excepting the Wait of an ISend operation). n 18 reportDeadlock and Exit When reaching GlobalBlocking (Lines 11&12), i.e., every process is 19 end terminated or blocked, we use Matching (cf. Algorithm 3) to match 20 end the recorded but not yet matched MPI operations and execute the Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong

P0 P1 P2 The single event process a performs the event a and terminates. ISend(1,req1); IRecv(*,req2); Barrier; There are three operators: sequential composition ( ), external # Barrier; Barrier; ISend(1,req3); choice (□) and parallel composition with synchronization ( ∥ ). P□Q Wait(req ) Wait(req ) Wait(req ) X 1 2 3 performs as P or Q, and the choice is made by the environment. Let PS be a finite set of processes, □PS denotes the external choice Figure 5: An example of operation matching. of all the processes in PS. P ∥ Q performs P and Q in an inter- X leaving manner, but P and Q synchronize on the events in X. The first receive should match the send. The matched send and receive process c?x → P performs as P after reading a value from channel operations will be executed, and the statuses of the involved pro- c and writing the value to variable x. The process c!x → P writes the value of x to channel c and then behaves as P. Process skip cesses will be updated to active, denoted by Fire(Sc , pairn) (Line 5). If there is no matching for non-wildcard operations, we use terminates immediately. matchW to match the wildcard operations (Line 8). For each possi- ble matching of a wildcard receive, we fork a new state (denoted 4.2 CSP Modeling by fork(Sc , pairw ) at Line 10) to analyze each matching case. If no For each violation-free program path, Algorithm 4 builds a precise operations can be matched, but there exist blocked processes, a CSP model of the possible communication behaviors by changing deadlock happens (Line 17). Besides, for the LTL properties other the matchings and interleavings of the communication operations than deadlock freedom (such as temporal properties), we also check along the path. The basic idea is to model the communication them during symbolic execution (omitted for brevity). operations in each process as a CSP process, then compose all the Take the program in Figure 5 for example. When all the pro- CSP processes in parallel to form the model. To model Proci , we cesses block at Barrier, MPI-SV matches the recorded operation scan its operation sequence Seqi in reverse. For each operation, we in the buffers of the processes, i.e., s0 .B=⟨ISend(1,req1),Barrier⟩, generate its CSP model and compose the model with that of the s1 .B=⟨IRecv(*,req2), Barrier⟩, and s2 .B=⟨Barrier⟩. According to remaining operations in Seqi w.r.t. the semantics of the operation the MPI standard, each operation in the buffers is ready to be and the MPI standard [24]. The modeling algorithm is efficient, matched. Hence, Matching first matches the non-wildcard opera- and has a polynomial time complexity w.r.t. the total length of the tions, i.e., the Barrier operations, then the status of each process be- recorded MPI operation sequences. comes active. After that, MPI-SV continues to execute the active pro- We use channel operations in CSP to model send and receive cesses and record issued MPI operations. The next GlobalBlocking operations. Each send operation op has its own channel, denoted point is: P0 and P2 terminate, and P1 blocks at Wait(req2). The by Chan(op). We use a zero-sized channel to model Ssend opera- buffers are ⟨ISend(1,req1),Wait(req1)⟩, ⟨IRecv(*,req2),Wait(req2)⟩, tion (Line 10), because Ssend blocks until the message is received. and ⟨ISend(1,req3), Wait(req3)⟩, respectively. All the issued Wait In contrast, considering a Send or ISend operation is completed operations are not ready to match, because the corresponding immediately, we use one-sized channels for them (Line 14), so the non-blocking operations are not matched. So Matching needs to channel writing returns immediately. The modeling of Barrier match the wildcard operation, i.e., IRecv(*,req2), which can be (Line 17) is to generate a synchronization event that requires all matched with ISend(1,req1) or ISend(1,req3). Then, a new the parallel CSP processes to synchronize it (Lines 17&38). The state is forked for each case and added to the worklist. modeling of receive operations consists of three steps. The first Correctness. Blocking-driven symbolic execution is an instance step calculates the possibly matched channels written by the send of model checking with POR. We have proved the symbolic execu- operations (Lines 20&25). The second uses the external choice of tion method is correct for reachability properties [58]. Due to the reading actions of the matched channels (Lines 21&26), so as to space limit, the proof can be referred to [91]. model different cases of the receive operation. Finally, the refined external choice process is composed with the remaining model. If 4 CSP BASED PATH MODELING the operation is blocking, the composition is sequential (Line 22); In this section, we first introduce the CSP [70] language. Then, we otherwise, it is a parallel composition (Line 28). present the modeling algorithm of an MPI program terminated StaticMatchedChannel(opj , S) (Lines 20&25) returns the set of path using a subset of CSP. Finally, we prove the soundness and the channels written by the possibly matched send operations of completeness of our modeling. the receive operation opj . We scan Seq(S) to obtain the possibly matched send operations of opj . Given a receive operation recv in 4.1 CSP Subset process Proci , SMO(recv, S) calculated as follows denotes the set Let Σ be a finite set of events, C a set of channels, and X a set of of the matched send operations of recv. variables. Figure 6 shows the syntax of the CSP subset, where P • If recv is Recv(j) or IRecv(j, r), SMO(recv, S) contains Procj ’s denotes a CSP process, a∈Σ, c∈C, X ⊆Σ and x∈X. send operations with Proci as the destination process. • If recv is Recv(∗) or IRecv(∗, r), SMO(recv, S) contains any pro- P := a | P P | P□P | P ∥P | c?x→P | c!x→P | skip cess’s send operations with Proci as the destination process. # X SMO(op, S) over-approximates op’s precisely matched opera- tions, and can be optimized by removing the send operations that Figure 6: The syntax of a CSP subset. are definitely executed after op’s completion, and the ones whose Combining Symbolic Execution and Model Checking to Verify MPI Programs messages are definitely received before op’s issue. For example, • If a receive operation has multiple matched send operations from Let Proc0 be Send(1);Barrier;Send(1), and Proc1 be Recv(*);Barrier. the same process, it should match the earlier issued one. This is SMO will add the two send operations in Proc0 to the matching ensured by checking the emptiness of the dependent channels. set of the Recv(*) in Proc1. Since Recv(*) must complete before • The receive operations in the same process should be matched Barrier, we can remove the second send operation in Proc0. Such w.r.t. their issue order if they receive messages from the same optimization reduces the complexity of the CSP model. For brevity, process, except the conditional completes-before pattern [83]. We we use SMO(op, S) to denote the optimized matching set. Then, use one-sized channel actions to model these requirements. StaticMatchedChannel(opj , S) is {Chan(op) | op ∈ SMO(opj , S)}. We model a Wait operation if it corresponds to an IRecv oper- To satisfy the MPI requirements, Refine(P, S) (Lines 21&26) re- ation (Line 30), because ISend operations complete immediately fines the models of receive operations by imposing the completes- under the assumption of infinite system buffer. Wait operations are before requirements [83] as follows: modeled by the synchronization in parallel processes. GenerateEvent generates a new synchronization event ew for each Wait opera- tion (Line 31). Then, ew is produced after the corresponding non- blocking operation is completed (Line 28). The synchronization on Algorithm 4: CSP Modeling for a Terminated State ew ensures that a Wait operation blocks until the corresponding GenerateCSP(S) non-blocking operation is completed. Data: A terminated global state S, and We use the example in Figure 5 for a demonstration. After ex- Seq(S)={Seqi | 0 ≤ i ≤ n} ploring a violation-free path, the recorded operation sequences are 1 begin Seq0=⟨ISend(1,req1), Barrier, Wait(req1)⟩, Seq1=⟨IRecv(*,req2), 2 PS ← ∅ Barrier,Wait(req )⟩, Seq2=⟨Barrier,ISend(1,req ),Wait(req )⟩. We 3 for i ← 0 ... n do 2 3 3 Seq Wait(req ) 4 ← skip first scan 0 in reverse. Note that we don’t model 1 , 5 Req ← {r | IRecv(*,r)∈Seqi ∨IRecv(i,r)∈Seqi } because it corresponds to ISend. We create a synchronization event 6 for j ←lenдth(Seqi ) − 1 ... 0 do B for modeling Barrier (Lines 16&17). For the ISend(1,req1), we 7 switch opj do model it by writing an element a to a one-sized channel chan1, and 8 case Ssend(i) do use prefix operation to compose its model with B (Lines 12-14). In 9 c ← Chan(opj ) // c1’s size is 0 1 this way, we generate CSP process chan1!a→B skip (denoted by 10 Pi ← c1!x → Pi # CP0) for Proc0. Similarly, we model Proc2 by B chan2!b→skip 11 end # (denoted by CP2), where chan2 is also a one-sized channel and b is 12 case Send(i) or ISend(i,r) do a channel element. For Proc1, we generate a single event process ew 13 c2 ← Chan(opj ) // c2’s size is 1 to model Wait(req ), because it corresponds to IRecv (Lines 30- 14 Pi ← c2!x → Pi 2 15 end 32). For IRecv(*,req2), we first compute the matched channels 16 case Barrier do using SMO (Line 25), and StaticMatchedChannel(opj , S) contains 17 Pi ← B Pi both chan1 and chan2. Then, we generate the following CSP process 18 end # ((chan ?a→skip□chan ?b→skip) e ) ∥ (B e skip) 19 case Recv(i) or Recv(*) do 1 2 w w # {ew } # # 20 C ← StaticMatchedChannel(opj , S) 21 Q ← Refine(□{c?x → skip | c ∈ C}, S) (denoted by CP1) for Proc1. Finally, we compose the CSP processes 22 Pi ← Q Pi using the parallel operator to form the CSP model (Line 38), i.e., # 23 end CP0 ∥ CP1 ∥ CP2. 24 case IRecv(*,r) or IRecv(i,r) do {B} {B} 25 C ← StaticMatchedChannel(opj , S) CSP modeling supports the case where communications depend 26 Q ← Refine(□{c?x → skip | c ∈ C}, S) on message contents. MPI-SV tracks the influence of a message dur- 27 ew ←WaitEvent(opj ) // opj ’s wait event ing symbolic execution. When detecting that the message content 28 Pi ← (Q ew ) ∥ Pi influences the communications, MPI-SV symbolizes the content # {e } w on-the-fly. We specially handle the widely used master-slave pattern 29 end for dynamic load balancing [32]. The basic idea is to use a recursive 30 case Wait(r) and r ∈ Req do 31 ew ← GenerateEvent(opj ) CSP process to model each slave process and a conditional state- 32 Pi ← ew Pi ment for master process to model the communication behaviors 33 end # of different matchings. We verified five dynamic load balancing 34 end MPI programs in our experiments (cf. Section 5.4). The details for 35 end supporting master-slave pattern is in Appendix A.3. 36 PS ← PS ∪ {Pi } 37 end 4.3 Soundness and Completeness 38 P ← ∥ PS {B} In the following, we show that the CSP modeling is sound and 39 return P complete. Suppose GenerateCSP(S) generates the CSP process CSPs . 40 end Here, soundness means that CSPs models all the possible behaviors by changing the matchings or interleavings of the communication Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong operations along the path to S, and completeness means that each support for multi-threaded programs. We use a multi-threaded li- trace in CSPs represents a real behavior that can be derived from S brary for MPI, called AzequiaMPI [69], as the MPI environment by changing the matchings or interleavings of the communications. model for symbolic execution. MPI-SV contains three main modules: Since we compute SMO(op, S) by statically matching the argu- program preprocessing, symbolic execution, and model checking. ments of the recorded operations, SMO(op, S) may contain some The program preprocessing module generates the input for sym- false matchings. Calculating the precisely matched operations of op bolic execution. We use Clang to compile an MPI program to LLVM is NP-complete [23], and we suppose such an ideal method exists. bytecode, which is then linked with the pre-compiled MPI library We use CSPstatic and CSPideal to denote the generated models AzequiaMPI. The symbolic execution module is in charge of path using SMO(op, S) and the ideal method, respectively. The follow- exploration and property checking. The third module utilizes the ing theorems ensure the equivalence of the two models under the state-of-the-art CSP model checker PAT [80] to verify CSP models, stable-failure semantics [70] of CSP and CSPstatic ’s consistency to and uses the output of PAT to boost the symbolic executor. the MPI semantics, which imply the soundness and completeness of our CSP modeling method. Let T(P) denote the trace set [70] of 5.2 Research Questions CSP process P, and F(P) denote the failure set of CSP process P. Each element in F(P) is (s, X), where s ∈ T (P) is a trace, and X is We conducted experiments to answer the following questions: the set of events P refuses to perform after s. • Effectiveness: Can MPI-SV verify real-world MPI programs effec- tively? How effective is MPI-SV when compared to the existing Theorem 4.1. F(CSPstatic ) = F(CSPideal ). state-of-the-art tools? Proof. We only give the skeleton of the proof. We first prove • Efficiency: How efficient is MPI-SV when verifying real-world MPI programs? How efficient is MPI-SV when compared tothe T(CSP ) = T(CSP ) static ideal pure symbolic execution? based on which we can prove F(CSPstatic ) = F(CSPideal ). The • Verifiable properties : Can MPI-SV verify properties other than main idea of proving these two equivalence relations is to use deadlock freedom? contradiction for proving the subset relations. We only give the proof of T(CSPstatic ) ⊆ T (CSPideal ); the other subset relations 5.3 Setup can be proved in a similar way. Table 1 lists the programs analyzed in our experiments. All the pro- Suppose there is a trace t=⟨e1,..., en⟩ such that t ∈ T (CSPstatic ) grams are real-world open source MPI programs. DTG is a testing but ta){Recv(i)} else {Recv(*)}; (2) replace Recv(*) that CSPstatic is consistent with the MPI semantics. Please refer with if (x >a){Recv(*)} else {Recv(j)}. Here x is an input variable, a to [91] for the detailed proofs for these two theorems. is a random value, and j is generated randomly from the scope of the process identifier. The mutations for IRecv(i,r) and IRecv(*,r) 5 EXPERIMENTAL EVALUATION are similar. Rule 1 is to improve program performance and simplify In this section, we first introduce the implementation of MPI-SV, programming, while rule 2 is to make the communication more then describes the research questions and the experimental setup. deterministic. Since communications tend to depend on inputs in Finally, we give experimental results. complex applications, such as the last three programs in Table 1, we also introduce input related conditions. For each program, we gen- 5.1 Implementation erate five mutants if possible, or generate as many as the number of We have implemented MPI-SV based on Cloud9 [10], which is built receives. We don’t mutate the programs using master-slave pattern upon KLEE [12], and enhances KLEE with better support for POSIX [32], i.e., Matmat and Sorting, and only mutate the static schedul- environment and parallel symbolic execution. We leverage Cloud9’s ing versions of programs Integrate, Mandelbrot, and Kfray. Combining Symbolic Execution and Model Checking to Verify MPI Programs

Table 1: The programs in the experiments. 100

Program LOC Brief Description DTG 90 Dependence transition group Matmat 105 Matrix multiplication Integrate 181 Integral computing 50 Diffusion2d 197 Simulation of diffusion equation Gauss_elim 341 Gaussian elimination Symbolic execution

Heat 613 Heat equation solver # Completed verification tasks Our approach

Mandelbrot 268 Mandelbrot set drawing 0 Sorting 218 Array sorting 0 5 10 15 20 25 30 35 40 45 50 55 60 Image_manip 360 Image manipulation Verification time thresholds DepSolver 8988 Multimaterial electrostatic solver Kfray 12728 KF-Ray parallel raytracer Figure 7: Completed tasks under a time threshold. ClustalW 23265 Multiple sequence alignment Total 47354 12 open source programs and unknown, respectively. We use unknown for the case that both configurations fail to complete the task. Columns Time(s) and #Iterations show the verification time and the number of explored Baselines. We use pure symbolic execution as the first base- paths, respectively, where to stands for timeout. The results where line because: (1) none of the state-of-the-art symbolic execution Our approach performs better is in gray background. based verification tools can analyze non-blocking MPI programs, For the 111 verification tasks, MPI-SV completes 100 tasks90% ( ) e.g., CIVL [57, 75]; (2) MPI-SPIN [74] can support input coverage within one hour, whereas 61 tasks (55%) for Symbolic execution. and non-blocking operations, but it requires building models of Our approach detects deadlocks in 48 tasks, while the number of the programs manually; and (3) other automated tools that support Symbolic execution is 44. We manually confirmed that the detected non-blocking operations, such as MOPPER [23] and ISP [83], can deadlocks are real. For the 48 tasks having deadlocks, MPI-SV on only verify programs under given inputs. MPI-SV aims at cover- average offers a 5x speedups for detecting deadlocks. On the other ing both the input space and non-determinism automatically. To hand, Our approach can verify deadlock freedom for 52 tasks, while compare with pure symbolic execution, we run MPI-SV under two only 17 tasks for Symbolic execution. MPI-SV achieves an average configurations: (1) Symbolic execution, i.e., applying only symbolic 19x speedups. Besides, compared with Symbolic execution, Our execution for path exploration, and (2) Our approach, i.e., using approach requires fewer paths to detect the deadlocks (1/55 on model checking based boosting. Most of the programs run with average) and complete the path exploration (1/205 on average). 6, 8, and 10 processes, respectively. DTG and Matmat can only be These results demonstrate MPI-SV’s effectiveness and efficiency. run with 5 and 4 processes, respectively. For Diffusion and the Figure 7 shows the efficiency of verification for the two configu- programs using the master-slave pattern, we only run them with rations. The X-axis varies the time threshold from 5 minutes to one 4 and 6 processes due to the huge path space. We use MPI-SV hour, while the Y-axis is the number of completed verification tasks. to verify deadlock freedom of MPI programs and also evaluate 2 Our approach can complete more tasks than Symbolic execution non-reachability properties for Integrate and Mandelbrot. The under the same time threshold, demonstrating MPI-SV’s efficiency. timeout is one hour. There are three possible verification results: In addition, Our approach can complete 96 (96%) tasks in 5 minutes, finding a violation, no violation, or timeout. We carry out allthe which also demonstrates MPI-SV’s effectiveness. tasks on an Intel Xeon-based Server with 64G memory and 8 2.5GHz For some tasks, e.g., Kfray, MPI-SV does not outperform Sym- cores running a Ubuntu 14.04 OS. We ran each verification task bolic execution. The reasons include: (a) the paths contain hundreds three times and use the average results to alleviate the experimental of non-wildcard operations, and the corresponding CSP models are errors. To evaluate MPI-SV’s effectiveness further, we also directly huge, and thus time-consuming to model check; (b) the number of compare MPI-SV with CIVL [57, 75] and MPI-SPIN [74]. Note that, wildcard receives or their possible matchings is very small, and as since MPI-SPIN needs manual modeling, we only use MPI-SV to a result, only few paths are pruned. verify MPI-SPIN’s C benchmarks w.r.t. deadlock freedom. Comparison with CIVL. CIVL uses symbolic execution to build a model for the whole program and performs model checking on the 5.4 Experimental Results model. In contrast, MPI-SV adopts symbolic execution to generate Table 2 lists the results for evaluating MPI-SV against pure symbolic path-level verifiable models. CIVL does not support non-blocking execution. The first column shows program names, and #Procs is operations. We applied CIVL on our evaluation subjects. It only the number of running processes. T specifies whether the analyzed successfully analyzed DTG. Diffusion2d could be analyzed after program is mutated, where o denotes the original program, and mi removing unsupported external calls. MPI-SV and CIVL had similar represents a mutant. A task comprises a program and the number performance on these two programs. CIVL failed on all the remain- of running processes. We label the programs using master-slave ing programs due to compilation failures or lack of support for pattern with superscript “*”. Column Deadlock indicates whether a non-blocking operations. In contrast, MPI-SV successfully analyzed task is deadlock free, where 0, 1, and -1 denote no deadlock, deadlock 99 of the 140 programs in CIVL’s latest benchmarks. The failed Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong

Table 2: Experimental results.

Time(s) #Iterations Program (#Procs) T Deadlock Symbolic execution Our approach Symbolic execution Our approach o 0 10.12 9.02 3 1 m1 0 13.69 9.50 10 2 m 1 10.02 8.93 4 2 DTG(5) 2 m3 1 10.21 9.49 4 2 m4 1 10.08 9.19 4 2 m5 1 9.04 9.29 2 2 Matmat∗(4) o 0 36.94 10.43 54 1 o 0/0/0 78.17/to/to 8.87/10.45/44.00 120/3912/3162 1/1/1 Integrate(6/8/10) m1 0/0/-1 to/to/to 49.94 /to/to 4773/3712/3206 32 /128/79 m2 1/1/1 9.35/9.83/9.94 9.39/10.76/44.09 2/2/2 2/2/2 Integrate∗ (4/6) o 0/0 24.18/123.55 9.39/32.03 27/125 1/1 o 0/0 106.86/to 9.84/13.19 90/2041 1/1 m1 0/1 110.25/11.95 10.18 /13.81 90/2 1 /2 Diffusion2d(4/6) m2 0/1 3236.02/12.66 17.05 /14.38 5850/3 16/2 m3 0/0 to/to 19.26/199.95 5590/4923 16/64 m4 1/1 11.35 /11.52 11.14 /14.22 3/2 2 /2 m5 1/0 10.98/to 10.85/13.44 2/1991 2/ 1 o 0/0/0 to/to/to 13.47/15.12/87.45 2756/2055/1662 1/1/1 Gauss_elim(6/8/10) m1 1/1/1 155.40/to/to 14.31/16.99/88.79 121/2131/559 2/2/2 o 1/1/1 17.31/17.99/20.51 16.75 /19.27/22.75 2/2/2 1/1/1 m1 1/1/1 17.33/18.21/20.78 17.03 /19.75/23.16 2/2/2 1/1/1 Heat(6/8/10) m2 1/1/1 18.35/18.19/20.74 16.36 /19.53/23.07 2/2/2 1/1/1 m3 1/1/1 19.64/20.21/23.08 16.36/19.72/22.95 3/3/3 1/1/1 m4 1/1/1 22.9/24.73/27.78 16.4/19.69/22.90 9/9/9 1/1/1 m5 1/1/1 24.28/28.57/32.67 16.61/19.59/22.42 7/7/7 1/1/1 o 0/0/-1 to/to/to 117.68 / 831.87 /to 500/491/447 9 / 9 /9 m -1/-1/-1 to/to/to to/to/to 1037/1621/1459 173/227/246 Mandelbrot(6/8/10) 1 m2 -1/-1/-1 to/to/to to/to/to 1093/1032/916 178/136/90 m3 1/1/1 10.71/11.17/11.92 10.84/11.68/13.5 2/2/2 2/2/2 Mandelbort∗ (4/6) o 0/0 68.09/270.65 12.65/13.21 72/240 2/2 Sorting∗ (4/6) o 0/0 to/to 19.18/46.19 584/519 1/1 o 0/0/0 97.69/118.72/141.87 18.68/23.84/27.89 96/96/96 4/4/4 Image_mani(6/8/10) m1 1/1/1 12.92/15.80/15.59 14.15/ 14.53 /16.86 2/2/2 2/2/2 DepSolver(6/8/10) o 0/0/0 94.17/116.5/148.38 97.19/123.36/151.83 4/4/4 4/4/4 o 0/0/0 to/to/to 51.59/68.25/226.96 1054/981/1146 1/1/1 m 1/1/1 52.15/53.50/46.83 53.14/69.58/229.97 2/2/2 2/2/2 Kfray(6/8/10) 1 m2 -1/-1/-1 to/to/to to/to/to 1603/1583/1374 239/137/21 m3 1/1/1 51.31/43.34/48.33 50.40 /71.15/230.18 2/2/2 2/2/2 Kfray∗ (4/6) o 0/0 to/to 53.44/282.46 1301/1575 1/1 o 0/0/0 to/to/to 47.28/79.38/238.37 1234/1105/1162 1/1/1 m1 0/0/0 to/to/to 47.94/80.10/266.16 1365/1127/982 1/1/1 m 0/0/0 to/to/to 47.71/90.32/266.08 1241/1223/915 1/1/1 Clustalw(6/8/10) 2 m3 1/1/1 895.63/to/to 149.71/1083.95/301.99 175/1342/866 5/17/2 m4 0/0/0 to/to/to 47.49/79.94/234.99 1347/1452/993 1/1/1 m5 0/0/0 to/to/to 47.75/80.33/223.77 1353/1289/1153 1/1/1 ones are small API test programs for the APIs that MPI-SV does MPI program and its model. Although prototypes exist for trans- not support. For the real-world program floyd that both MPI-SV lating C to Promela [45], they are impractical for real-world MPI and CIVL can analyze, MPI-SV verified its deadlock-freedom under programs. MPI-SPIN’s state space reduction treats communication 4 processes in 3 minutes, while CIVL timed out after 30 minutes. channels as rendezvous ones; thus, the reduction cannot handle the The results indicate the benefits of MPI-SV’s path-level modeling. programs with wildcard receives. MPI-SV leverages model checking Comparison with MPI-SPIN. MPI-SPIN relies on manual mod- to prune redundant paths caused by wildcard receives. We applied eling of MPI programs. Inconsistencies may happen between an MPI-SV on MPI-SPIN’s 17 C benchmarks to verify deadlock free- dom, and MPI-SV successfully analyzed 15 automatically, indicating Combining Symbolic Execution and Model Checking to Verify MPI Programs the effectiveness. For the remaining two programs, i.e., BlobFlow a static analysis tool built on Clang Static Analyzer [15], and only and Monte, MPI-SV cannot analyze them due to the lack of support supports intraprocedural analysis of local properties such as double for APIs. For the real-world program gausselim, MPI-SPIN needs non-blocking and missing wait. Botbol et al. [5] abstract an MPI 171s to verify that the model is deadlock-free under 5 processes, program to symbolic transducers, and obtain the reachability set while MPI-SV only needs 27s to verify the program automatically. If based on abstract interpretation [18], which only supports blocking the number of the processes is 8, MPI-SPIN timed out in 30 minutes, MPI programs and may generate false positives. COMPI [53, 54] but MPI-SV used 66s to complete verification. uses concolic testing [27, 72] to detect assertion or runtime errors in Temporal properties. We specify two temporal safety properties MPI applications. Ye et al. [89] employs partial symbolic execution φ1 and φ2 for Integrate and Mandelbrot, respectively, where φ1 [68] to detect MPI usage anomalies. However, these two symbolic requires process one cannot receive a message before process two, execution-based bug detection methods do not support the non- and φ2 requires process one cannot send a message before process determinism caused by wildcard operations. and Siegel [56] two. Both φ1 and φ2 can be represented by an LTL formula !a U b, propose a preliminary deductive method for verifying the numeric which requires event a cannot happen before event b. We verify properties of MPI programs in an unbounded number of processes. Integrate and Mandelbrot under 6 processes. The verification However, this method still needs manually provided verification results show that MPI-SV detects the violations of φ1 and φ2, while conditions to prove MPI programs. pure symbolic execution fails to detect violations. MPI-SV is related to the existing work on symbolic execution [48], Runtime bugs. MPI-SV can also detect local runtime bugs. Dur- which has been advanced significantly during the last decade10 [ , ing the experiments, MPI-SV finds 5 unknown memory access out- 12, 27, 28, 66, 72, 81, 86, 93]. Many methods have been proposed of-bound bugs: 4 in DepSolver and 1 in ClustalW. to prune paths during symbolic execution [4, 19, 34, 43, 92]. The basic idea is to use the techniques such as slicing [44] and interpo- 6 RELATED WORK lation [59] to safely prune the paths. Compared with them, MPI-SV only prunes the paths of the same path constraint but different Dynamic analyses are widely used for analyzing MPI programs. message matchings or operation interleavings. MPI-SV is also re- Debugging or testing tools [1, 36, 50, 51, 60, 71, 87] have better lated to the work of automatically extracting session types [63] or feasibility and scalability but depend on specific inputs and run- behavioral types [52] for Go programs and verifying the extracted ning schedules. Dynamic verification techniques, e.g., ISP [83] and type models. These methods extract over-approximation models DAMPI [84], run MPI programs multiple times to cover the sched- from Go programs, and hence are sound but incomplete. Compared ules under the same inputs. Böhm et al. [3] propose a state-space with them, MPI-SV extracts path-level models for verification. Fur- reduction framework for the MPI program with non-deterministic thermore, there exists work of combining symbolic execution and synchronization. These approaches can detect the bugs depending model checking [20, 65, 79]. YOGI [65] and Abstraction-driven con- on specific matchings of wildcard operations, but may still miss colic testing [20] combine dynamic symbolic execution [27, 72] inputs related bugs. MPI-SV supports both input and schedule cov- with counterexample-guided abstraction refinement (CEGAR)16 [ ]. erages, and a larger scope of verifiable properties. MOPPER23 [ ] MPI-SV focuses on parallel programs, and the verified models are encodes the deadlock detection problem under concrete inputs in path-level. MPI-SV is also related to the work of unbounded ver- a SAT equation. Similarly, Huang and Mercer [41] use an SMT ification for parallel programs2 [ , 6, 7, 85]. Compared with them, formula to reason about a trace of an MPI program for deadlock MPI-SV is a bounded verification tool and supports the verifica- detection. However, the SMT encoding is specific for the zero-buffer tion of LTL properties. Besides, MPI-SV is related to the exist- mode. Khanna et al. [47] combines dynamic and symbolic analy- ing work of testing and verification of shared-memory programs ses to verify multi-path MPI programs. Compared with these path [13, 14, 21, 34, 35, 39, 40, 42, 49, 62, 90]. Compared with them, MPI- reasoning work in dynamic verification, MPI-SV ensures input SV concentrates on message-passing programs. Utilizing the ideas space coverage and can verify more properties, i.e., safety and live- in these work for analyzing MPI programs is interesting and left to ness properties in LTL. Besides, MPI-SV employs CSP to enable a the future work. more expressive modeling, e.g., supporting conditional completes- before [83] and master-slave pattern [32]. For static methods of analyzing MPI program, MPI-SPIN [73, 74] 7 CONCLUSION manually models MPI programs in Promela [38], and verifies the model w.r.t. LTL properties [58] by SPIN [37](cf. Section 5.4 for We have presented MPI-SV for verifying MPI programs with both empirical comparison). MPI-SPIN can also verify the consistency non-blocking and non-deterministic operations. By synergistically between an MPI program and a sequential program, which is not combining symbolic execution and model checking, MPI-SV pro- supported by MPI-SV. Bronevetsky [9] proposes parallel control vides a general framework for verifying MPI programs. We have flow graph (pCFG) for MPI programs to capture the interactions be- implemented MPI-SV and extensively evaluated it on real-world tween arbitrary processes. But the static analysis using pCFG is hard MPI programs. The experimental results are promising demonstrate to be automated. ParTypes [55] uses type checking and deductive MPI-SV’s effectiveness and efficiency. The future work lies insev- verification to verify MPI programs against a protocol. ParTypes’s eral directions: (1) enhance MPI-SV to support more MPI operations, verification results are sound but incomplete, and independent (2) investigate the automated performance tuning of MPI programs with the number of processes. ParTypes does not support non- based on MPI-SV, (3) apply our synergistic framework to other deterministic or non-blocking MPI operations. MPI-Checker [22] is message-passing programs. Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong

REFERENCES [29] Ganesh Gopalakrishnan, Paul D. Hovland, Costin Iancu, Sriram Krishnamoorthy, [1] Allinea. 2002. Allinea DDT. http://www.allinea.com/products/ddt/. (2002). Ignacio Laguna, Richard A. Lethin, Koushik Sen, Stephen F. Siegel, and Armando [2] Alexander Bakst, Klaus von Gleissenthall, Rami Gökhan Kici, and Ranjit Jhala. Solar-Lezama. 2017. Report of the HPC Correctness Summit Jan 25-26, 2017, Wash- 2017. Verifying distributed programs via canonical sequentialization. PACMPL 1, ington, DC. https://science.energy.gov/~/media/ascr/pdf/programdocuments/ OOPSLA (2017), 110:1–110:27. docs/2017/HPC_Correctness_Report.pdf. (2017). [3] Stanislav Böhm, Ondrej Meca, and Petr Jancar. 2016. State-Space Reduction of [30] Ganesh Gopalakrishnan, Robert M. Kirby, Stephen F. Siegel, Rajeev Thakur, Non-deterministically Synchronizing Systems Applicable to Deadlock Detection William Gropp, Ewing L. Lusk, Bronis R. de Supinski, Martin Schulz, and Greg in MPI. In FM. 102–118. Bronevetsky. 2011. Formal analysis of MPI-based parallel programs. Commun. [4] Peter Boonstoppel, Cristian Cadar, and Dawson Engler. 2008. RWset: attacking ACM (2011), 82–91. path explosion in constraint-based test generation. In TACAS. 351–366. [31] William Gropp. 2002. MPICH2: A new start for MPI implementations. In EuroMPI. [5] Vincent Botbol, Emmanuel Chailloux, and Tristan Le Gall. 2017. Static Analysis 7–7. of Communicating Processes Using Symbolic Transducers. In VMCAI. 73–90. [32] William Gropp, Ewing Lusk, and Anthony Skjellum. 2014. Using MPI: Portable [6] Ahmed Bouajjani and Michael Emmi. 2012. Analysis of recursively parallel pro- Parallel Programming with the Message-Passing Interface. The MIT Press. grams. In Proceedings of the 39th ACM SIGPLAN-SIGACT Symposium on Principles [33] William Gropp, Ewing Lusk, and Rajeev Thakur. 1999. Using MPI-2: Advanced of Programming Languages, POPL 2012, Philadelphia, Pennsylvania, USA, January features of the message-passing interface. MIT press. 22-28, 2012. 203–214. [34] Shengjian Guo, Markus Kusano, Wang, Zijiang , and Aarti Gupta. [7] Ahmed Bouajjani, Constantin Enea, Kailiang Ji, and Shaz Qadeer. 2018. On the 2015. Assertion guided symbolic execution of multithreaded programs. In FSE. Completeness of Verifying Message Passing Programs Under Bounded Asyn- 854–865. chrony. In Computer Aided Verification - 30th International Conference, CAV 2018, [35] Shengjian Guo, Meng , and Chao Wang. 2018. Adversarial symbolic execution Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, for detecting concurrency-related cache timing leaks. In Proceedings of the 2018 2018, Proceedings, Part II. 372–391. ACM Joint Meeting on European Software Engineering Conference and Symposium [8] Daniel Brand and Pitro Zafiropulo. 1983. On communicating finite-state machines. on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena J. ACM (1983), 323–342. Vista, FL, USA, November 04-09, 2018. 377–388. [9] Greg Bronevetsky. 2009. Communication-sensitive static dataflow for parallel [36] Tobias Hilbrich, Joachim Protze, Martin Schulz, Bronis R de Supinski, and message passing applications. In CGO. 1–12. Matthias S Müller. 2012. MPI runtime error detection with MUST: advances [10] Stefan Bucur, Vlad Ureche, Cristian Zamfir, and George Candea. 2011. Parallel in deadlock detection. In SC. 30. symbolic execution for automated real-world software testing. In EuroSYS. 183– [37] Gerard J Holzmann. 1997. The model checker SPIN. IEEE Transactions on Software 198. Engineering (1997), 279–295. [11] Rajkumar Buyya and others. 1999. High performance cluster computing: archi- [38] Gerard J. Holzmann. 2012. Promela manual pages. http://spinroot.com/spin/ tectures and systems. Prentice Hall (1999), 999. Man/promela.html. (2012). [12] C. Cadar, D. Dunbar, and D. Engler. 2008. KLEE: Unassisted and automatic [39] Jeff Huang, Charles Zhang, and Julian Dolby. 2013. CLAP: recording localex- generation of high-coverage tests for complex systems programs. In OSDI. 209– ecutions to reproduce concurrency failures. In ACM SIGPLAN Conference on 224. Programming Language Design and Implementation, PLDI ’13, Seattle, WA, USA, [13] Sagar Chaki, Edmund M. Clarke, Alex Groce, Joël Ouaknine, Ofer Strichman, June 16-19, 2013. 141–152. and Karen Yorav. 2004. Efficient Verification of Sequential and Concurrent C [40] Shiyou Huang and Jeff Huang. 2016. Maximal causality reduction for TSOand Programs. Formal Methods in System Design 25, 2-3 (2004), 129–166. PSO. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object- [14] Alessandro Cimatti, Iman Narasamdya, and Marco Roveri. 2011. Boosting Lazy Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, part Abstraction for SystemC with Partial Order Reduction. In Tools and Algorithms for of SPLASH 2016, Amsterdam, The Netherlands, October 30 - November 4, 2016. the Construction and Analysis of Systems - 17th International Conference, TACAS 447–461. 2011, Held as Part of the Joint European Conferences on Theory and Practice of [41] Yu Huang and Eric Mercer. 2015. Detecting MPI Zero Buffer Incompatibility by Software, ETAPS 2011, Saarbrücken, Germany, March 26-April 3, 2011. Proceedings. SMT Encoding. In NFM. 219–233. 341–356. [42] Omar Inverso, Truc L. Nguyen, Bernd Fischer, Salvatore La Torre, and Gennaro [15] Clang. 2016. Clang Static Analyzer. http://clang-analyzer.llvm.org. (2016). Parlato. 2015. Lazy-CSeq: A Context-Bounded Model Checking Tool for Multi- [16] Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan , and Helmut Veith. 2000. threaded C-Programs. In 30th IEEE/ACM International Conference on Automated Counterexample-guided abstraction refinement. In CAV. 154–169. Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015. 807–812. [17] Edmund M Clarke, Orna Grumberg, and Doron Peled. 1999. Model checking. MIT [43] Joxan Jaffar, Vijayaraghavan Murali, and Jorge A Navas. 2013. Boosting concolic press. testing via interpolation. In FSE. 48–58. [18] Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified [44] Ranjit Jhala and Rupak Majumdar. 2005. Path slicing. In PLDI. 38–47. Lattice Model for Static Analysis of Programs by Construction or Approximation [45] Ke Jiang and Bengt Jonsson. 2009. Using SPIN to model check concurrent algo- of Fixpoints. In POPL. 238–252. rithms, using a translation from C to Promela. In MCC 2009. 67–69. [19] Heming Cui, Gang Hu, Jingyue Wu, and Junfeng Yang. 2013. Verifying systems [46] René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and rules using rule-directed symbolic execution. In ASPLOS. 329–342. Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software [20] Przemysław Daca, Ashutosh Gupta, and Thomas A Henzinger. 2016. Abstraction- testing?. In FSE. 654–665. driven Concolic Testing. In VMCAI. 328–347. [47] Dhriti Khanna, Subodh Sharma, César Rodríguez, and Rahul Purandare. 2018. [21] Brian Demsky and Patrick Lam. 2015. SATCheck: SAT-directed stateless model Dynamic Symbolic Verification of MPI Programs. In FM. checking for SC and TSO. In Proceedings of the 2015 ACM SIGPLAN International [48] J.C. King. 1976. Symbolic execution and program testing. Commun. ACM (1976), Conference on Object-Oriented Programming, Systems, Languages, and Applications, 385–394. OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015. 20– [49] Bernhard Kragl, Shaz Qadeer, and Thomas A. Henzinger. 2018. Synchronizing the 36. Asynchronous. In 29th International Conference on Concurrency Theory, CONCUR [22] Alexander Droste, Michael Kuhn, and Thomas Ludwig. 2015. MPI-checker: static 2018, September 4-7, 2018, Beijing, China. 21:1–21:17. analysis for MPI. In LLVM-HPC. 3:1–3:10. [50] Bettina Krammer, Katrin Bidmon, Matthias S Müller, and Michael M Resch. 2004. [23] Vojtěch Forejt, Daniel Kroening, Ganesh Narayanaswamy, and Subodh Sharma. MARMOT: An MPI analysis and checking tool. Advances in Parallel Computing 2014. Precise predictive analysis for discovering communication deadlocks in (2004), 493–500. MPI programs. In FM. 263–278. [51] Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Todd Gamblin, Gregory L. [24] MPI Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0. http: Lee, Martin Schulz, Saurabh Bagchi, Milind Kulkarni, Bowen Zhou, Zhezhe Chen, //mpi-forum.org. (2012). and Feng . 2015. Debugging high-performance computing applications at [25] Xianjin Fu, Zhenbang Chen, Yufeng Zhang, Chun Huang, Wei Dong, and Ji Wang. massive scales. Commun. ACM 58, 9 (2015), 72–81. 2015. MPISE: Symbolic Execution of MPI Programs. In HASE. 181–188. [52] Julien Lange, Nicholas Ng, Bernardo Toninho, and Nobuko Yoshida. 2018. A [26] Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, static verification framework for message passing in Go using behavioural types. Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew In Proceedings of the 40th International Conference on Software Engineering, ICSE Lumsdaine, and others. 2004. Open MPI: Goals, concept, and design of a next 2018, Gothenburg, Sweden, May 27 - June 03, 2018. 1137–1148. generation MPI implementation. In EuroMPI. 97–104. [53] Hongbo Li, Zizhong Chen, and Rajiv Gupta. 2019. Efficient Concolic Testing of [27] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: directed auto- MPI Applications. In Proceedings of the 28th International Conference on Compiler mated random testing. In PLDI. 213–223. Construction (CC 2019). 193–204. [28] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. 2008. Automated [54] Hongbo Li, Sihuan Li, Zachary Benavides, Zizhong Chen, and Rajiv Gupta. 2018. Whitebox Fuzz Testing. In NDSS. COMPI: Concolic Testing for MPI Applications. In 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, BC, Canada, May 21-25, 2018. 865–874. Combining Symbolic Execution and Model Checking to Verify MPI Programs

[55] Hugo A. López, Eduardo R. B. Marques, Francisco Martins, Nicholas Ng, César [82] Sarvani Vakkalanka. 2010. Efficient dynamic verification algorithms forMPI Santos, Vasco Thudichum Vasconcelos, and Nobuko Yoshida. 2015. Protocol- applications. Ph.D. Dissertation. The University of Utah. based verification of message-passing parallel programs. In OOPSLA. 280–298. [83] Sarvani S. Vakkalanka, Ganesh Gopalakrishnan, and Robert M. Kirby. 2008. Dy- [56] Ziqing Luo and Stephen F. Siegel. 2018. Towards Deductive Verification of namic Verification of MPI Programs with Reductions in Presence of Split Opera- Message-Passing Parallel Programs. In 2nd IEEE/ACM International Workshop on tions and Relaxed Orderings. In CAV. 66–79. Software Correctness for HPC Applications, CORRECTNESS@SC 2018, Dallas, TX, [84] Anh Vo, Sriram Aananthakrishnan, Ganesh Gopalakrishnan, Bronis R De Supin- USA, November 12, 2018. 59–68. ski, Martin Schulz, and Greg Bronevetsky. 2010. A scalable and distributed [57] Ziqing Luo, Manchun , and Stephen F. Siegel. 2017. Verification of MPI dynamic formal verifier for MPI programs. In SC. 1–10. programs using CIVL. In EuroMPI. 6:1–6:11. [85] Klaus von Gleissenthall, Rami Gökhan Kici, Alexander Bakst, Deian Stefan, and [58] Zohar Manna and Amir Pnueli. 1992. The temporal logic of reactive and concurrent Ranjit Jhala. 2019. Pretend synchrony: synchronous verification of asynchronous systems - specification. Springer. distributed programs. PACMPL 3, POPL (2019), 59:1–59:30. [59] Kenneth L. McMillan. 2005. Applications of Craig Interpolants in Model Checking. [86] Xinyu Wang, Jun Sun, Zhenbang Chen, Peixin Zhang, Jingyi Wang, and Yun Lin. In TACAS. 1–12. 2018. Towards optimal concolic testing. In Proceedings of the 40th International [60] Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - and Todd Gamblin. 2014. Accurate application progress analysis for large-scale June 03, 2018. 291–302. parallel debugging. In ACM SIGPLAN Conference on Programming Language [87] Rogue Wave. 2009. TotalView Software. http://www.roguewave.com/products/ Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, totalview. (2009). 2014. 193–203. [88] Ruini Xue, Xuezheng Liu, Ming Wu, Zhenyu Guo, Wenguang Chen, Weimin [61] Matthias Müller, Bronis de Supinski, Ganesh Gopalakrishnan, Tobias Hilbrich, and Zheng, Zheng Zhang, and Geoffrey Voelker. 2009. MPIWiz: subgroup reproducible David Lecomber. 2011. Dealing with MPI bugs at scale: Best practices, automatic replay of MPI applications. ACM Sigplan Notices (2009), 251–260. detection, debugging, and formal verification. http://sc11.supercomputing.org/ [89] Fangke Ye, Jisheng , and Vivek Sarkar. 2018. Detecting MPI usage anom- schedule/event_detail.php?evid=tut131, (2011). alies via partial program symbolic execution. In Proceedings of the International [62] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gérard Basler, Pira- Conference for High Performance Computing, Networking, Storage, and Analysis, manayagam Arumuga Nainar, and Iulian Neamtiu. 2008. Finding and Reproducing SC 2018, Dallas, TX, USA, November 11-16, 2018. 63:1–63:5. Heisenbugs in Concurrent Programs. In 8th USENIX Symposium on Operating [90] Liangze Yin, Wei Dong, Wanwei Liu, Yunchou Li, and Ji Wang. 2018. YOGAR- Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, CBMC: CBMC with Scheduling Constraint Based Abstraction Refinement - (Com- California, USA, Proceedings. 267–280. petition Contribution). In Tools and Algorithms for the Construction and Analysis of [63] Nicholas Ng and Nobuko Yoshida. 2016. Static deadlock detection for concurrent Systems - 24th International Conference, TACAS 2018, Held as Part of the European go by global session graph synthesis. In Proceedings of the 25th International Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016. Greece, April 14-20, 2018, Proceedings, Part II. 422–426. 174–184. [91] Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, [64] Flemming Nielson, Hanne R Nielson, and Chris Hankin. 2015. Principles of Chun Huang, and Wei Dong. 2020. Combining Symbolic Execution and Model program analysis. Springer. Checking to Verify MPI Programs. CoRR abs/1803.06300 (2020). arXiv:1803.06300 [65] Aditya V Nori, Sriram K Rajamani, SaiDeep Tetali, and Aditya V Thakur. 2009. http://arxiv.org/abs/1803.06300 The YOGI Project: Software property checking via static analysis and testing. In [92] Hengbiao Yu, Zhenbang Chen, Ji Wang, Zhendong Su, and Wei Dong. 2018. TACAS. 178–181. Symbolic verification of regular properties. In Proceedings of the 40th International [66] Corina S. Pasareanu, Peter C. Mehlitz, David H. Bushnell, Karen Gundy-Burlet, Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - Michael R. Lowry, Suzette Person, and Mark Pape. 2008. Combining unit-level June 03, 2018. 871–881. symbolic execution and system-level concrete execution for testing NASA soft- [93] Yufeng Zhang, Zhenbang Chen, Ji Wang, Wei Dong, and Zhiming Liu. 2015. ware. In Proceedings of the ACM/SIGSOFT International Symposium on Software Regular property guided dynamic symbolic execution. In ICSE. IEEE Press, 643– Testing and Analysis, ISSTA 2008, Seattle, WA, USA, July 20-24, 2008. 15–26. 653. [67] Wojciech Penczek, Maciej Szreter, Rob Gerth, and Ruurd Kuiper. 2000. Improving Partial Order Reductions for Universal Branching Time Properties. Fundam. Inform. (2000), 245–267. A APPENDIX [68] David A. Ramos and Dawson R. Engler. 2015. Under-Constrained Symbolic Execution: Correctness Checking for Real Code. In SEC. USENIX Association, A.1 Semantics of the Core MPI Language 49–64. [69] Juan A. Rico-Gallego and Juan Carlos Díaz Martín. 2011. Performance Evaluation Auxiliary Definitions. Before giving the MPI language’s seman- of Thread-Based MPI in Shared Memory. In EuroMPI. 337–338. tics, we give some auxiliary definitions. Given an MPI program [70] Bill Roscoe. 2005. The theory and practice of concurrency. Prentice-Hall. MP = {Proci | 0 ≤ i ≤ n}, send(dst) and recv(src) denote MP’s [71] Victor Samofalov, V. Krukov, B. Kuhn, S. Zheltov, Alexander V. Konovalov, and J. 2 DeSouza. 2005. Automated Correctness Analysis of MPI Programs with Intel(r) send and receive operations , respectively, where dst∈{0,...,n} Message Checker. In PARCO. 901–908. and src∈{0,...,n}∪{∗}. op(MP) represents the set of all the MPI [72] Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing operations in MP, rank(α) is the process identifier of operation α, engine for C. In Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of and isBlocking(α) indicates whether α is a blocking operation. Software Engineering, 2005, Lisbon, Portugal, September 5-9, 2005. 263–272. [73] Stephen F. Siegel. Model Checking Nonblocking MPI Programs. In VMCAI. Definition A.1. MPI Process State. An MPI process’s state is a [74] Stephen F. Siegel. 2007. Verifying Parallel Programs with MPI-Spin. In PVM/MPI. tuple (M, Stat, F , B, R), where M maps a variable to its value, Stat 13–14. [75] Stephen F. Siegel, Manchun Zheng, Ziqing Luo, Timothy K. Zirkel, Andre V. is the next program statement to execute, F is the flag of process Marianiello, John G. Edenhofner, Matthew B. Dwyer, and Michael S. Rogers. 2015. status and belongs to the set {active, blocked, terminated}, B and CIVL: the concurrency intermediate verification language. In Proceedings of the R are infinite buffers to store the issued MPI operations notyet International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015. 61:1–61:12. matched and the matched MPI operations, respectively. [76] Stephen F Siegel and Timothy K Zirkel. 2011. FEVS: A functional equivalence verification suite for high-performance scientific computing. Mathematics in An element elem of a process state s can be accessed by s.elem, Computer Science (2011), 427–435. e.g., s.F is the status of s. The behavior of a process can be regarded [77] Stephen F. Siegel and Timothy K. Zirkel. 2011. TASS: The Toolkit for Accurate Scientific Software. Mathematics in Computer Science (2011), 395–426. as a sequence of statements, and we use index(α) to denote the index [78] Marc Snir. 1998. MPI–the Complete Reference: The MPI core. Vol. 1. MIT press. of operation α in the sequence. An MPI program’s global state S is [79] Ting Su, Fu, Geguang Pu, Jifeng He, and Zhendong Su. 2015. Combining composed by the states of the MPI processes, i.e., S = (s ,...,sn). symbolic execution and model checking for data flow testing. In ICSE. 654–665. 0 [80] Jun Sun, Yang Liu, Dong, and Jun Pang. 2009. PAT: Towards flexible An MPI program’s semantics is a labeled transition system defined verification under fairness. In CAV. 709–714. below. [81] Nikolai Tillmann and Jonathan de Halleux. 2008. Pex-White Box Test Generation for .NET. In TAP. 134–153. 2send(dst) and recv(src) can denote both blocking and unblocking operations, and we omit the req parameter for non-blocking ones for the sake of simplicity. Hengbiao Yu, Zhenbang Chen, Xianjin Fu, Ji Wang, Zhendong Su, Jun Sun, Chun Huang, and Wei Dong

s .F=active i ∈[0,n],s .F=blocked∧( α ∈s .B,α=Barrier) i ⟨ISSUE⟩ i i ⟨B⟩ issue(si .St at ) B ∀ ∃ S−−−−−−−−−−−→(...,si [update(F,si .Stat), B.push(si .Stat)],... ) S−→(s0[update(F,α), B.pull(α)], ...,sn [update(F,α), B.pull(α)])

α ∈s .B, β ∈s .B,ready(α,s )∧ready(β,s )∧match(α, β)∧C(α,s , β,s ) .F= ∧ ∈ .B,( =Wait∧ ( , )) i j i j i j ⟨ ⟩ si blocked α si α ready α si ⟨ ⟩ SR∃ /SR∗ ∃ SR W ∃ W S−−−−−−−→(...,si [update(F,α), B.pull(α)], ...,sj [update(F, β), B.pull(β)],... ) S−−→(...,si [update(F,α), B.pull(α)],... )

Figure 8: Transition Rules of MPI operations

Definition A.2. Labeled Transition System. A labeled transi- C1(α,si , β,sj ) ∧ C1(β,sj , α,si ) to define the conditional completes- ′ tion system (LTS) of an MPI program MP is a quadruple (G, Σ, → before relation requirement, where C1(α,si , β,sj ) is ¬( β ∈ sj .B, ′ ′ ∃ ′ , G0), where G is the set of global states, Σ denotes the set of actions (β=IRecv(*,r)∧β =IRecv(i,r')∧ready(β ,sj )∧match(α, β ))). defined below, →⊆ G × Σ × G represents the set of transitions, and Figure 8 shows four transition rules for MPI operations. For the G0 is the set of initial states. sake of brevity, we omit the transition rules of local statements, which only update the mapping M and the next statement Stat to { ∗} ∪ { ( ) | ∈ Actions. The action set Σ is B,W , SR, SR issue o o execute. Rule ⟨ISSUE⟩ describes the transition of issuing an MPI (MP)} op , where B represents the synchronization of Barrier, W operation, which requires the issuing process to be active. After is the execution of Wait operation, SR denotes the matching of mes- ∗ issuing the operation, the process status is updated, the next state- sage send and deterministic receive, SR represents the matching ment to execute becomes Stat ′ (omitted for the sake of spaces), and ( ) of message send and wildcard receive, and issue o stands for the the issued operation is added to the buffer B. Rule ⟨SR⟩ is about issue of operation o. matchings of message send and receive. There are three required Transition Rules. We first give some definitions used bythe conditions to match a send to a receive: (1) both of them have ( ) ∈ { } transition rules. We use ready α,si True, False defined as been issued to the buffer B and are ready to be matched; (2) opera- follows to indicate whether operation α is ready to be matched in tion arguments are matched, i.e., match(α, β); (3) they comply with state si w.r.t. the MPI standard [24], where β ∈ si .B represents that the conditional completes-before relation, i.e., C(α,si , β,sj ). After B ∈ { } operation β is in the buffer of process state si and k 0,...,n . matching, the matched operations will be removed from buffer B • If α is Wait(r), α can be matched if the waited operation has and added to buffer R, and the process status is updated. Rule ⟨B⟩ been matched, i.e., β∈R, (β= ISend(k,r)∨β=IRecv(k,r)∨ is for barrier synchronization, which requires that all the processes β=IRecv(*,r)). ∃ have been blocked at the Barrier. After barrier synchronization, • If α is send(k), α can be matched if there is no previously operation Barrier will be moved from buffer B to buffer R and all issued send(k) not yet matched, i.e., ¬( β∈si .B,index(β) the processes become active. Rule ⟨W⟩ is for Wait operation, which < index(α) ∧ β = send(k)). ∃ requires the corresponding non-blocking operation has been fin- • If α is recv(k), α can be matched if the previously issued ished. After executing Wait operation, the process becomes active recv(k) or recv(∗) has been matched, which can be formal- and the Wait operation will be removed from buffer B and added ized as ¬( β∈si .B, index(β) < index(α) ∧ (β = recv(k)∨ β = to buffer R. recv(∗))).∃ • If α is recv(∗), α can be matched if the previously issued A.2 Correctness of Symbolic Execution for MPI recv(∗) has been matched, i.e., ¬( β∈si .B, index(β) < index(α)∧ Programs β = recv(∗)). It is worth noting the∃ conditional completes-before Round-robin schedule and blocking-driven symbolic execution is pattern [83], i.e., operation IRecv(k,r) followed by a recv(∗), an instance of model checking with POR preserving reachability and the recv(∗) can complete first when the matched message properties. Next, we prove the correctness of symbolic execution is not from k. We will give a condition later to ensure such method for verifying reachability properties. relation. Suppose α and β are MPI operations, match(α, β) represents Definition A.3. Reachability Property. A reachability property MP ( ) whether α and β can be matched statically, and can be defined as φ of an MPI program can be defined as follows, where assertion S ′ ′ ′ represents an assertion of global state S, e.g., deadlock and asser- match (α, β)∨match (β, α), wherematch (α, β) is ((α, β)=(send(dst), tions of variables. recv(src)) ∧ (dst=src ∨ src = ∗). We use si [ops] to denote the up- | ∨ | ¬ | ( ) dates of the process state si with an update operation sequence ops. γ ::= true γ γ γ assertion S The operation update(F , α) updates the process status w.r.t. α as φ ::= EF γ | ¬φ follows. EF γ returns true iff there exists an MP’s state that satisfies the  F :=active F =blocked ∧ isBlocking(α) formula γ .  update(F,α)= F :=blocked F =active ∧ isBlocking(α) a We use S −→ S ′ to represent the transition (S, a, S ′) in the tran-  F :=F otherwise  sition set and enabled(S) to denote the set of enabled actions at a a B.push(α) represents adding MPI operation α to buffer B, while global state S, i.e., enabled(S)={a | S ′ ∈ G, S −→ S ′}. If S −→ S ′, B.pull(α) represents removing α from B and adding α to R. We use we use a(S) to represent S ′. Instead∃ of exploring all the possible ′ Stat to denote the statement next to Stat. We use C(α,si , β,sj ) = states, MPI-SV only executes a subset of enabled(S) (denoted as Combining Symbolic Execution and Model Checking to Verify MPI Programs

E(S)) when reaching a state S. According to the workflow of MPI (1) b=issue(o′). Suppose rank(o)=i, rank(o′)=j, then i,j. Since symbolic execution, we define E(S) below, where an issued operation can only block its process, we can have b ∈ • minIssue(S) = min{rank(o) | issue(o) ∈ enabled(S)} is the enabled(a(S)) and a ∈ enabled(b(S)). In addition, a(b(S)) = b(a(S)) = ′ ′ minimum process identifier that can issue an MPI operation. (..., si [update(F, o), Stat:=Stat , B.push(o)],..., sj [update(F,o ), ∗ ′ B ( ′)] ) • minRank(S) = min{ranka(b) | b ∈ enabled(S) ∧ b , SR } Stat:=Stat , .push o ,... . Hence, (a,b)∈I. is the minimum process identifier of enabled non-wildcard (2)b=W . Supposerank(o)=i,ranka(W )=j, theni,j. Since issue(o) actions. When b is SR, ranka(b) is the process identifier of the can only block process i and W can only make process j active, ( ( )) ( ( )) send; when b is W , ranka(b) is the process identifier of the b∈enabled(a(S)) and a∈enabled(b(S)). In addition, a b S =b a S = ′ Wait operation. (..., si [update(F, o), Stat:=Stat , B.push(o)], . . ., sj [update(F, Wait), B.pull(Wait)],...). Hence, (a,b)∈I. {issue(o)} if issue(o)∈enabled(S)∧rank(o)=minIssue(S)  ∗ {B } if B ∈enabled(S) (3) b=SR∨b=SR . Suppose b=(p,q), rank(o)=i, rank(p)=j1, and  E(S)= {W } if W ∈enabled(S)∧ranka (W )=minRank(S) rank(q)=j2. For i,j1∧i,j2: issue(o) only updates the state si , and {SR } if SR ∈enabled(S)∧rank (SR)=minRank(S) ready(p,s ) and ready(q,s ) will not be affected. On the other  a j1 j2 enabled(S) otherwise ∈ ( ( ))  hand, b cannot make process i blocked. Hence b enabled a S and When issue(o) is enabled in state S, we will select the enabled issue a∈enabled(b(S)). Then, a(b(S))=b(a(S))=(. . ., si [update(F,o), Stat ′ operation having the smallest process identifier as E(S), which is in :=Stat , B.push(o)],..., sj1 [update(F, p), B.pull(p)], . . ., accordance with round-robin schedule. Remember that blocking- sj2 [update(F, q), B.pull(q)],...). For i=j1: since issue(o) and b are driven matching delays the matching of wildcard receive (SR∗) as co-enabled, index(o)>index(p) and p is non-blocking. Due to the late as possible. If B is enabled, we use {B} as E(S); else, we will condition index(o)>index(p), issue(o) has no effect on ready(p,sj1 ). use the deterministic matching (W or SR) having smallest process On the other hand, since p is non-blocking, b cannot make process i identifier (we select W in case ranka(W )=ranka(SR)); otherwise, blocked. Hence b∈enabled(a(S)) and a∈enabled(b(S)). Additionally, ∗ we use the enabled set as E(S), i.e., a∈enabled(S), a=SR . since p is non-blocking, a(b(S))=b(a(S))=(...,si [update(F ,o), Stat ∀ =Stat ′, B.pull(p), B.push(o)],...,s [update(F ,q), B.pull(q)],... ) Definition A.4. Independence Relation (I ⊆ Σ× Σ). For S ∈ G : j2 . ( )∈I and (a,b) ∈ I, I is a binary relation that if (a,b) ∈ enabled(S) then For i=j2, the proof is similar. Hence a,b . □ a ∈ enabled(b(S)), b ∈ enabled(a(S)), and a(b(S)) = b(a(S)). Theorem A.2. E(S) preserves the satisfaction of global reachability The dependence relation D ⊆ Σ × Σ is the complement of I, i.e., properties. (a,b) ∈ D, if (a,b) < I. Given an MPI program MP, whose LTS (G → G ) ⟨ ⟩ ∈ model is M = , Σ, , 0 , an execution trace T = a0,..., an Proof. We first prove the E(S) satisfies condition C1 and C2, ∗ ai Σ of MP is a sequence of actions, such that Si , Si+1 ∈ G, Si −−→ respectively. ∈ [ , ] ∈ G ∃( ) a0 a1 Si+1 for each i 0 n and S0 0. We use r T to represent the C1: a ∈ E(S), if (a,b) ∈ D, then for every trace S −−→ S −−→ ∈ 0 1 result state of T , i.e., Sn+1, and T M to represent T is an execution a∀k b trace of MP. ,..., −−→ Sk −→ Si+1, there exists ai ∈ E(S), where 0 ≤ i ≤ k. Case 1: E(S) = enabled(S), C1 holds because a0 ∈ E(S). Definition A.5. Execution equivalence ≡ ⊆ Σ∗ × Σ∗ is a reflexive Case 2: E(S) , enabled(S). In this case, E(S) contains only and symmetric binary relation such that (1) a,b ∈ Σ, (a,b) ∈ one element, and can be issue(o), B, W , or SR. Assume C1 does I ⇒ ⟨a,b⟩ ≡ ⟨b, a⟩. (2) T,T ′ ∈ Σ∗, T ≡ ∀T ′ if there exists a ∀ ′ not hold, i.e., ai , a. According to Theorem A.1, (a, ai ) ∈ I and sequence ⟨T0,...,Tk ⟩ that T0=T, Tk =T , and for every i 10) Send(0); else Send(1)". The general way to are the matched channels of a schedule receive, we model it by handle the situation where communications depend on the message Chan1?label → Skip□Chan2?label → Skip. Considering that a contents is to make the contents symbolic. MPI-SV make x to be send operation in slave processes is modeled by writing the slave’s symbolic once it detects that there exist communications depending process identifier to the channel, we can use the valueof label to on x, so that MPI-SV does not miss a branch. Master-slave pattern decide the destination of the next job. is a representative situation and has been widely adopted to achieve Modeling the slave process. We use recursive CSP process to a dynamic load balancing [33]. The verification of the parallel pro- model a slave’s dynamic feature, i.e., repeatedly receiving a job and grams employing dynamic load balancing is a common challenge. sending the result back until receiving the termination message. To In the pattern, the master process is in charge of management, i.e., model the job receive operation in a slave process, we use a guard dispatching jobs to slave processes and collecting results. On the expression [label == i] before the channel reading of the receive other hand, the slave processes repeatedly work on the jobs and operation, where label is the global variable of the corresponding send results back to the master process until receiving the message schedule receive and i is the slave process’s identifier. Notably, of termination. Figure 9 is an example program. Following shows the guard expression will disable the channel reading until the an example program. inside condition becomes true, indicating that the slave process cannot receive a new job unless its result has just been received by

P0 P1 P2 the master. Send(1); while(true) do { while(true) do { We need to refine the algorithms of symbolic execution and Send(2); Recv(0); Recv(0); CSP modeling to support master-slave pattern. We have already while (...) do { if (...) if (...) Recv(*,r); break; break; implemented the refinement in MPI-SV. The support of master- Send(r.src) Send(0); Send(0); slave pattern demonstrates that MPI-SV outperforms the single } } } ... path reasoning work [23, 41]. A.4 Proof of CSP Modeling’s Soundness and Figure 9: An example program using master-slave pattern. Completeness

Theorem 4.1 F(CSPstatic ) = F(CSPideal ). P0 is the master process, and the remaining processes are slaves. P0 first dispatches one job to each slave. Then, P0 will iteratively re- Proof. We first prove T(CSPstatic ) = T(CSPideal ), based on ceive a job result from any slave (Recv(*,r)) and dispatch a new job which we can prove F(CSPstatic ) = F(CSPideal ). to the slave process whose result is just received, i.e., Send(r.src), First, we prove T(CSPstatic )⊆T (CSPideal ) by contradiction. where r is the status of the receive and r.src denotes the process Suppose there exists a trace t=⟨e1,..., en⟩ such that t∈T (CSPstatic ) identifier of the received message. Each slave process iteratively but t

It means there exists an event e in X that is refused by CSPstatic at s, but enabled by CSPideal at s. Because there is no internal choice in the CSP models, we have s·⟨e⟩

Theorem 4.2 CSPstatic is consistent with the MPI semantics.

Proof. If the global state of generating CSPstatic is Sc , then we can get an MPI program MPp from the sequence set Seq(Sc ), where each process Proci of MPp is the sequential composition of the op- erations in Seqi . Suppose the LTS model of MPp is Mp , and the LTS after hiding all the issue(o) actions in Mp is Mˆp . Then, CSPstatic is consistent with the MPI semantics iff {(Mt (s), Ms (X)) | (s, X) ∈ F(CSPstatic )} is equal to {(T, X)|T ∈Mˆp ∧X ⊆Ms (Σ)\enabled(r(T ))}, where Σ is the event set of CSPstatic , Mt (s) and Ms (X) maps the events in the sequence t and the set X to the corresponding actions in MPI semantics, respectively. This can be shown by proving that Algorithms 4 with a precise SMO ensures all the completes-before relations of MPI semantics (cf. semantic rules in Figure 8). The relations between send operations and those between receive oper- ations (including conditional completes-before relation) are ensured by Refine(P, S). The communications of send and recv operations are modeled by CSP channel operations and process compositions. The requirements of Wait and Barrier operations are modeled by the process compositions defined in Algorithm 4. Hence, wecan conclude that CSPideal is consistent with the MPI semantics. Then, by Theorem 4.1, we can prove CSPstatic is consistent with the MPI semantics. □