Fast Graph Simplification for Interleaved Dyck-Reachability

Yuanbo Li Qirun Zhang Thomas Reps Georgia Institute of Technology Georgia Institute of Technology University of Wisconsin-Madison USA USA USA [email protected] [email protected] [email protected]

Abstract CCS Concepts: • Mathematics of computing → Graph → Many program-analysis problems can be formulated as graph- algorithms; • Theory of computation Program anal- reachability problems. Interleaved Dyck language reachabil- ysis. ity (InterDyck-reachability) is a fundamental framework to Keywords: CFL-Reachability, Static Analysis express a wide variety of program-analysis problems over edge-labeled graphs. The InterDyck language represents ACM Reference Format: an intersection of multiple matched-parenthesis languages Yuanbo Li, Qirun Zhang, and Thomas Reps. 2020. Fast Graph Simpli- fication for Interleaved Dyck-Reachability. In Proceedings of the 41st (i.e., Dyck languages). In practice, program analyses typically ACM SIGPLAN International Conference on Programming Language leverage one Dyck language to achieve context-sensitivity, Design and Implementation (PLDI ’20), June 15–20, 2020, London, and other Dyck languages to model data dependences, such UK. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/ as field-sensitivity and pointer references/dereferences. In 3385412.3386021 the ideal case, an InterDyck-reachability framework should model multiple Dyck languages simultaneously. Unfortunately, precise InterDyck-reachability is undecid- 1 Introduction able. Any practical solution must over-approximate the exact The L language-reachability (L-reachability) framework is answer. In the literature, a lot of work has been proposed to a popular model to formulate many program-analysis prob- over-approximate the InterDyck-reachability formulation. lems [14]. An L-reachability instance Reach⟨L,G⟩ contains This paper offers a new perspective on improving both the (1) a L that formalizes the analysis problem, precision and the scalability of InterDyck-reachability: we and (2) an edge-labeled graph G that represents the program aim to simplify the underlying input graphG. Our key insight under analysis. Two nodes are L-reachable in G iff there ex- is based on the observation that if an edge is not contributing ists a path joining them, and the path string belongs to L. In to any InterDyck-path, we can safely eliminate it from G. the literature, the most popular L-reachability formulation Our technique is orthogonal to the InterDyck-reachability is Dyck-reachability [11, 25]. A Dyck language essentially formulation, and can serve as a pre-processing step with any generates well-balanced parentheses, which can be used to over-approximating approaches for InterDyck-reachability. capture well-paired program properties, such as function call- We have applied our graph simplification algorithm to pre- s/returns [15, 16, 22], pointer references/dereferences [27, processing the graphs from a recent InterDyck-reachability- 28], locks/unlocks [10, 13], and field reads/writes [9, 24, 25]. based taint analysis for Android. Our evaluation on three A natural generalization of Dyck-reachability is Inter- popular InterDyck-reachability algorithms yields promis- leaved Dyck-reachability (InterDyck-reachability) [9, 15, ing results. In particular, our graph-simplification method 26]. The Interleaved Dyck language denotes the intersec- improves both the scalability and precision of all three In- tion of multiple Dyck languages based on an interleaving terDyck-reachability algorithms, sometimes dramatically. operator ⊙. For instance, let D1 and D2 be two Dyck lan- guages that generate matched parentheses and matched brackets, respectively. The string “ ” belongs to the Permission to make digital or hard copies of all or part of this work for language InterDyck = D1 ⊙ D2, becauseLJLKMM both parentheses personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear and brackets are properly matched. InterDyck-reachability this notice and the full citation on the first page. Copyrights for components is much more expressive than Dyck-reachability, and in prac- of this work owned by others than ACM must be honored. Abstracting with tice, brings tremendous precision improvements in client credit is permitted. To copy otherwise, or republish, to post on servers or to analyses. In particular, almost all recent work on context- redistribute to lists, requires prior specific permission and/or a fee. Request sensitive, field-sensitive analysis has adopted the InterDyck- permissions from [email protected]. reachability formulation to achieve both the context- and PLDI ’20, June 15–20, 2020, London, UK © 2020 Association for Computing Machinery. field-sensitivities simultaneously [9, 19, 26]. ACM ISBN 978-1-4503-7613-6/20/06...$15.00 Unfortunately, solving InterDyck-reachability is compu- https://doi.org/10.1145/3385412.3386021 tationally hard because the InterDyck-reachability problem

780 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps is, in general, undecidable [15]. Therefore, any practical anal- with n nodes and m edges, we give an efficient algorithm that ysis must approximate the exact answer. In practice, it is quite simplifies G in O(m log m) time with O(m) space. This graph- challenging to develop a suitable over-approximative Inter- simplification algorithm is asymptotically faster than the Dyck-reachability framework that offers a sweet spot in the fastest O(mn)-time InterDyck-reachability algorithm [26]. trade-off between precision and scalability. InterDyck is a The technique is general, and can be used as a preprocessing prototypical example of a non-context-free language [8]. Tra- step for any existing InterDyck-reachability algorithms. ditional approaches employ less expressive but polynomial- We have implemented the graph-simplification algorithm, time decidable language-reachability frameworks, such as and evaluated it on a recent InterDyck-reachability-based context-free-language reachability (CFL-reachability) to over- taint analysis for Android [9]. In particular, we tested graph approximate InterDyck-reachability [9, 21, 24]. For exam- simplification with three popular InterDyck-reachability ple, the recent work by Späth et al. proposed synchronized algorithms, based on context-free language reachability (CFL- pushdown systems to compute a sound solution for Inter- reachability) [14], synchronized pushdown systems reacha- Dyck-reachability [19]. The work by Zhang and Su proposed bility (SPDS) [19], and linear conjunctive language reacha- linear-conjunctive-language reachability (LCL-reachability) bility (LCL-reachability) [26]. The empirical results are en- to precisely describe the InterDyck-reachability formula- couraging: graph simplification significantly improves both tion [26]. However, the LCL-reachability algorithm is inher- the performance and the precision of the client analyses. ently an over-approximation. To the best of our knowledge, • We found that, on average, it is 2.18× faster to (i) run all previous efforts on the InterDyck-reachability problem the simplification algorithm on digraph G—thereby attempt to improve either on the L-reachability formulation creating simplified digraph Gf —and then (ii) run an or the L-reachability algorithm. InterDyck-reachability algorithm A on Gf , compared In this paper, we attack the InterDyck-reachability prob- to running A directly on the original graph G. lem from a new angle. Consider an InterDyck-reachability • In the experiments with LCL-reachability, we found instance Reach⟨L,G⟩. Unlike existing approaches that im- that the cost of running the simplification algorithm prove either the L-reachability formulation or the algorithm, is recouped for all examples that require more than our approach focuses on simplifying the input graph G in seven seconds to run in the original graphs. Reach⟨L,G⟩. Specifically, we give an efficient algorithm to • The number of reachable pairs returned by the anal- simplify the input graphs by eliminating “useless” graph ysis based on the simplified graph Gf is reduced to edges. The benefits of graph simplification are two-fold. First, 64.92% compared to the number obtained by running working with smaller graphs improves the scalability of all the analysis on G. Moreover, the analysis run on Gf existing approaches for InterDyck-reachability. Second, be- uses 57.37% memory for the analysis running on G. cause all InterDyck-reachability algorithms are inherently Our work makes the following contributions: over-approximative, they could achieve better precision by working with graphs that contain fewer edges. The techni- • We propose a novel graph-simplification framework cal challenge, however, is to design a graph-simplification for InterDyck-reachability. Our technique reduces algorithm that is both effective (i.e., it should remove as input-graph size, and is compatible with all existing many “useless” edges as possible) and efficient (i.e., as a sound InterDyck-reachability algorithms. pre-processing step, it should run much faster than the In- • Given a graph with n nodes and m edges, we give a terDyck-reachability algorithm itself). fast simplification algorithm that runs in O(m log m) time with O(m) space. In practice, our algorithm scales Consider an InterDyck language L1 ⊙ L2 ... ⊙ Lk , where linearly with the graph size. for each i ∈ [1, k], Li is a Dyck language. One enabling insight is to decompose the InterDyck-reachability problem • We evaluate our technique based on a variety of Inter- in the input graph G into k Dyck-reachability problems in a Dyck-reachability algorithms for taint analysis. Our graph G′, which is a relaxation of G that makes G′ bidirected. empirical results show that graph simplification is ben- It turns out that if an edge contributes to an InterDyck-path eficial: running an analysis on a simplified graph (plus in G, the corresponding edge must contribute to a Dyck-path graph simplification) is faster than running the analy- in G′. G′ contains more edges than G, and hence more paths sis on the original graph. With simplified graphs, all than G. Therefore, we can safely delete all non-contributing evaluated algorithms yield more precise results and edges in G′, as well as in G. The problem then becomes one use less memory as well. of identifying non-contributing edges in G′. The remainder of the paper is organized as follows: Sec- One natural question to ask is why we identify the non- tion2 motivates graph simplification. Section3 gives defi- contributing edges in G′ and not in G itself. The reason is nitions and the problem formulation. Section4 presents the that Dyck-reachability in G′ has special properties that allow idea of eliminating non-contributing edges. Section5 gives us to identify non-contributing edges much faster than if it the simplification algorithm. Section6 describes our evalua- were attempted in G. In particular, given an input graph G tion. Section7 discusses related work. Section8 concludes.

781 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK

2 Motivating Example Table 1. Precision improvement by graph simplification. We motivate our graph-simplification method using a formu- Graph InterDyck-Reachable Node Pairs lation of taint analysis as an InterDyck-reachability prob- CFL LCL SPDS  (v , v ),   (v , v ),  lem [9]. Consider the simple Java-like program in Figure1. Original x z {(v , v )} x z ( , ) a c ( , ) For every pair of variables, the taint analysis checks whether va vc va vc Simplified {(va , vc )} {(va , vc )} {(va , vc )} a tainted value can potentially flow between them.

InterDyck-reachability for taint analysis. Figure 1b gives the graph G that encodes the taint-analysis problem Benefits of graph simplification. The graph simplifica- for the program in Figure 1a as an InterDyck-reachability tion is iterative. Figures 2a and 2b give two intermediate problem. In particular, nodes in G represent the variables steps based on the first and second applications, respectively, in the program, and edges represent the assignments and of our graph simplification algorithm (Algorithm2 in Sec- calls/returns. Each edge is labeled with either a bracket or tion 5.2). Figure 2c shows the final graph Gf . Compared with a parenthesis. Specifically, the bracketsi.e. ( , and ) repre- the original graph in Figure 1b, the number of edges in Gf sent field reads/writes, and parentheses (i.e.,J, ) representK L M has been reduced from 11 to 4, and the number of nodes has calls and returns. For a path in G to represent the flow of a been reduced from 11 to 5. It is immediate that any Inter- tainted value, both brackets and parenthesis must be prop- Dyck-reachability algorithm runs faster on Gf because Gf erly matched. Let Db and Dp be two Dyck languages. Due is only half the size of the original graph G. Table1 gives the to the work of Huang et al. [9], the taint analysis can be InterDyck-reachable node pairs. We can see that graph sim- formulated as an InterDyck-reachability problem over G, plification improves the precision of the results from various ⊙ where InterDyck = Db Dp . InterDyck-reachability algorithms. InterDyck-reachability algorithms. The problem of We now briefly discuss the impact of graph simplification precise InterDyck-reachability is undecidable [15]. We briefly on different InterDyck-reachability algorithms. mention three popular over-approximation algorithms for • CFL-reachability algorithm. The CFL-reachability algo- the InterDyck-reachability problem. rithm in Figure 1b computes a false-positive reachable ( ) • CFL-reachability algorithm [14]. The intersection of a pair vx ,vz . This pair is introduced by the path p1 = д f 8 f 8 h and a context-free language is still vx −−→J va −−→J vb −→L vt −−→K ret1 −→M vc −−→K vz . In the context-free. Therefore, we can over-approximate one simplified graph Gf (Figure 2c), the CFL-reachability Dyck language in InterDyck using a regular language. algorithm gives an exact solution. For instance, let Rb be the regular language that over- • SPDS-reachability algorithm. In Figure 1b, the SPDS- approximates Db . The reachability problem could be reachability algorithm computes a non-InterDyck- solved by a CFL-reachability algorithm based on the reachable pair (vx ,vz ). In particular, the field-insensitive CFL = Rb ∩ Dp . д f −−→J −−→J • SPDS-reachability algorithm [19]. The synchronized pushdown system accepts the path p1 = vx va 8 f 8 h pushdown systems (SPDS) also over-approximates the vb −→L vt −−→K ret1 −→M vc −−→K vz and the context- InterDyck-reachability. SPDS encodes calls/returns insensitive pushdown system accepts the path p2 = and field reads/writes as separate CFL-reachability д 11 д 12 −−→J −−→L −−→K −−→M problems, and intersects the results. vx vy vs ret2 vz . After synchro- nization, the SPDS system concludes that v is Inter- • LCL-reachability algorithm [26]. The InterDyck lan- z guages belong to the class of linear conjunctive lan- Dyck-reachable from vx . In Gf (Figure 2c), neither guages (LCLs). Therefore, an LCL-reachability formu- path exists, and consequently the SPDS-reachability lation can precisely encode a InterDyck-reachability algorithm produces a precise solution. • problem. However, the LCL-reachability algorithm LCL-reachability algorithm. The LCL-reachability al- only computes an over-approximating solution. gorithm computes the exact solution in this example because the graph is acyclic. In practice, graph sim- Graph simplification. Recall that our key idea for graph plification allows the LCL-reachability algorithm to simplification is to eliminate “useless” graph edges that are run faster and consume less memory. It also eliminates not contributing to any InterDyck-paths. Our simplifica- some cycles in the graph, and improves the precision tion algorithm is iterative. Intuitively, the eliminated edges of LCL-reachability. Moreover, the cost is not prohibi- identified by a previous iteration can be used to identify addi- tive: in the experiments with LCL-reachability, the cost tional “useless” edges in later iterations. Figure2 provides an of running the simplification algorithm is recouped— overview of the results for the taint-analysis example after often dramatically—for all examples that require more selected iterations of the edge-elimination algorithm. than seven seconds to run in the original graphs.

782 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps

1 class T { T f; T g; T h; } f 8 f 8 va J vb L vt K ret1 M vc 2 T getF (T vt ){ return vt .f; } 3 T getG (T vs ){ return vs .g; } д h 4 J K д д 5 T v , v , v , v , v , v ;... 11 12 a b c y z w vx J vy L vs K ret2 M vz 6 va .g = vx ; 7 .f = ; vb va 12 8 vc = getF (vb ); L 9 vz = vc .h vw 10 vy .g = vx ; 11 getG (vy ) (b) InterDyck-reachability graph for taint analysis. Bracket edges 12 vz = getG (vw ) (i.e., f and f ) represent field reads and writes w.r.t. a field name f . J K Parenthesis edges (i.e., l and l ) represent calls and returns w.r.t. the (a) Example Java code. line number l of the call-sites.L M Figure 1. Motivating taint-analysis example.

f 8 f 8 f 8 f 8 f 8 f 8 va J vb L vt K ret1 M vc va J vb L vt K ret1 M vc va J vb L vt K ret1 M vc

д д 12 12 vx J vy vs K ret2 M vz vs ret2 M vz

12 12 L L vw vw

(a) Graph G1 after the first iteration. The (b) Graph G2 after the second iteration. (c) Final graph Gf . This iteration elimi- д д 12 field-write edge vx −−→J va, the field-read The field-write edge vx −−→J vy , the field- nates the call edge vw −−→L vs and the re- h 11 д 12 edge vc −−→K vz , the call edge vy −−→L vs read edge vs −−→K ret2, and the correspond- turn edge ret2 −−→M vz . There are no more have been eliminated. ing nodes have been eliminated. edges to eliminate from Gf . Figure 2. Overview of the graph-simplification procedure on the taint-analysis example.

3 Preliminaries S-path from u to v in G. Because S denotes the start symbol This section introduces definitions used in the paper. Sec- of language L, we also say that node v is L-reachable from ⟨ ⟩ tion 3.1 reviews Dyck languages and the graph-reachability u. The L-reachability problem Reach L,G is to compute all framework. Section 3.2 describes InterDyck-reachability. L-reachable node pairs in graph G. Section 3.3 defines the graph-simplification problem. 3.2 InterDyck-Reachability 3.1 Dyck Language and L-Reachability A Dyck language is a context-free language that describes the This paper focuses on the reachability problem related to set of well-balanced-parenthesis strings. Let CFG = (Σ, N, P, S) the interleaved Dyck language (InterDyck language). The be a context-free grammar for the Dyck language with k InterDyck language is a prototypical example of a non- context-free language. Informally, the InterDyck language kinds of parentheses. The CFG has the alphabet Σ = { i, i | L M describes the intersection of multiple Dyck languages, where i ∈ [1..k]}, the nonterminal symbol set N = {Dk }, the start the parentheses in each Dyck language can be arbitrarily symbol set S = {Dk }, and the following productions P: interleaved. For example, consider two Dyck strings “ ” ∈ D → D D | D | ... | D | ε. (1) JK k k k 1 k 1 k k k Db and “ ” ∈ Dp . All of “ ”, “ ”, and “ ” belong L M L M LM JLKM LJMK JKLM Given a context-free grammar CFG = (Σ, N, P, S) and to the InterDyck language based on Db and Dp . t We formally define the class of InterDyck languages a directed graph G = (V, E) with each edge u −→ v in E ⊙ ⊙ ∗ × ∗ → t0 based on an interleaving operation . Formally, : Σ Σ labeled by a terminal t ∈ Σ, we say that a path p = v0 −→ P(Σ∗) is a binary operator that takes two strings and returns t1 t2 tm−1 v1 −→ v2 −→ ... −−−→ vm in G realizes a string R(p) over a set of strings, where P(·) denotes the power-set operator. the alphabet by concatenating the edge labels in the path in The operator ⊙ is inductively defined as follows: for every ∗ order, i.e., R(p) = t0t1t2 ... tm−1. A path in G is an S-path if u ∈ Σ , we have u ⊙ ϵ = ϵ ⊙ u = {u}. Moreover, for every ∗ the realized string can be derived from the start symbol S in α1, α2,u1,u2 ∈ Σ , α1u1 ⊙ α2u2 = {α1w | w ∈ (u1 ⊙ α2u2)} ∪ CFG. Node v is S-reachable from node u iff there exists an {α2w | w ∈ (α1u1 ⊙ u2)}. The interleaving operator can be

783 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK extended to languages with graph G′ by inserting two additional edges s ′ −→L s and t −→M t ′. Ø L1 ⊙ L2 = u1 ⊙ u2. Based on the reduction, we can see that the edge s ′ −→L s is u1 ∈L1,u2 ∈L2 an InterDyck′-contributing edge in G′ iff t is InterDyck- Note that ⊙ is associative—i.e., (L1 ⊙L2)⊙L3 = L1 ⊙(L2 ⊙L3)— reachable from s in G. It is straightforward to verify that the and hence can be extended to k Dyck languages with disjoint reduction is in polynomial time. □ alphabets. If L , L ,..., L are k Dyck languages with dis- 1 2 k To side-step the undecidability of graph simplification, joint alphabets, we define InterDyck := L ⊙ L ⊙ ... ⊙ L . 1 2 k we describe two novel relaxations in Section4. Here we de- The InterDyck-reachability problem is an L-reachability fine the notion of of graph simplification, which problem by restricting L to InterDyck. In particular, correctness is similar to the concept of soundness in static analysis. Definition 3.1 (InterDyck-Reachability). Given an edge- Let ϕ be the set of all InterDyck-contributing edges in G. labeled digraph G = (V, E) and an InterDyck language, Intuitively, a graph-simplification algorithm computes an compute all InterDyck-reachable node pairs in G. over-approximating solution ϕ′ (of “apparently contribut- 3.3 Problem Formulation ing” edges). Therefore, if it determines an edge to be non- InterDyck-contributing, the edge can be safely eliminated Our technique eliminates graph edges to improve solving graph G. To sum up, InterDyck-reachability. To determine the set of edges to eliminate, we formally define the “usefulness” of each edge. Definition 3.5 (Correctness). A graph-simplification algo- rithm is correct if and only if it computes a solution ϕ′ to Definition 3.2 (L-Contributing Edges). Given an instance the contributing-edge problem such that ϕ′ ⊇ ϕ. Reach⟨L,G⟩ of L-reachability, an edge u → v ∈ G is con- tributing to L-reachability iff it is in an L-path in G, i.e., there 4 Identifying Contributing Edges exists a path “p = ... → u → v → ...” in G and R(p) ∈ L. Central to our graph-simplification approach is the idea of Example 3.3. In the motivating example from Section2 eliminating non-InterDyck-contributing edges in G. Due f 8 to Theorem 3.4, identifying non-L-contributing edges is as J (Figure 1b), the contributing edges are va −−→ vb , vb −→L vt , hard as computing the L-reachability problem, and solving f 8 vt −−→K ret1, and ret1 −→M vc that appear in the simplified InterDyck-reachability in general is undecidable [15]. graph Gf in Figure 2c. Our key idea is to cast the undecidable problem (i.e., iden- tifying InterDyck-contributing edges in a digraph G) to an In this paper, we consider the following graph-simplification easier problem (i.e., identifying Dyck-contributing edges in a problem for InterDyck-reachability: bidirected graph G′) that admits an efficient polynomial-time Given an InterDyck-reachability problem instance solution. In particular, we give two forms of relaxation: Reach⟨InterDyck,G⟩, simplify graph G by eliminating • Graph Relaxation. We first relax the general directed non-InterDyck-contributing edges. graph G to a bidirected graph G′ by introducing in- It is interesting to note that there is a correspondence verse edges (Section 4.1); and between the reachability problem in Definition 3.1 and the • Formulation Relaxation. We then relax the InterDyck- ′ graph-simplification problem stated above. Intuitively, based reachability problem in the bidirected graph G to the on Definition 3.2, the problem of deciding all InterDyck- Dyck-reachability problem in a contracted graph (Lx - ′ contributing edges should be as hard as computing the In- graph) derived from G , where Lx represents a Dyck terDyck-reachability. We now establish the undecidabil- language in InterDyck (Section 4.2). ity of computing all InterDyck-contribution edges via a The benefit of our relaxations is that Dyck-reachability can reduction from InterDyck-reachability. Note that Inter- be efficiently solved in O(m logm) time on a bidirected Lx - Dyck-reachability is undecidable even when restricted to graph with m edges and n nodes [25]. The Dyck-reachability the single-source-single-sink variant [15]. algorithm also identifies an anchor-node property in Lx - graph. We utilize the anchor-node property to identify the Theorem 3.4. It is undecidable to compute all InterDyck- non-Dyck-contributing edges in the L -graph (Section 4.3). G x contributing edges in a graph . Finally, if an edge is not a Dyck-contributing edge in the Proof. We show a reduction from the single-source-single- Lx -graph, its corresponding edge in G is a not an InterDyck- sink variant of InterDyck-reachability. Given any single- contributing edge. Graph simplification can be performed source-single-sink InterDyck-reachability problem instance safely by eliminating those edges in G. Reach⟨InterDyck,G⟩, we first introduce a new Dyck lan- Figure3 provides a roadmap to this section: it summa- guage Dp with an alphabet ΣDp = { , } and ΣDp ∩ΣInterDyck = rizes the relations among various lemmas. Combining these ′ ∅. Define InterDyck = InterDyckL M⊙ Dp . Let s and t be the lemmas together, it provides a criterion for identifying—and source and sink in graph G, respectively. We construct a new removing—non-contributing edges in G.

784 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps

t t u −→ v is an InterDyck- Corollary 4.2 u −→ v is an InterDyck- We extend the discussion about InterDyck strings to contributing edge in contributing edge in the InterDyck-reachability problem on graphs. Consider G G′ an InterDyck-reachability instance Reach⟨InterDyck,G′⟩. Rather than “extracting” an Lx substring from an InterDyck Lemma 4.4 string, we build a contracted graph called the Lx -graph from G′. Intuitively, an L -graph is derived from G by maintaining If t ∈ Σo , v is an anchor t x Lemma 4.7 u −→ v is a Dyck- ′ node in Lx -graph; only L -edges in G , merging the nodes joined by any t-edge, contributing edge in x If t ∈ Σc , u is an anchor node L -graph and deleting any t-edges where t L . in Lx -graph x < x

Definition 4.3 (Lx -Graph). Let Lx be a Dyck language. Given Figure 3. Summary of lemmas used in Section4. Let Lx be a ′ an input graph G , we construct the Lx -graph by contracting Dyck language in InterDyck. The alphabet ΣL = Σo ∪ Σc x t ′ of Lx can be partitioned into Σo and Σc representing open all u −→ v edges in G where t < Lx . and close parentheses, respectively. Let t ∈ ΣLx . Lemma 4.4. Let Lx ∈ InterDyck and t ∈ ΣLx . If an edge t ′ 4.1 Graph Relaxation: From G to G′ u −→ v is InterDyck-contributing inG , it is a Dyck-contributing edge in the L -graph. Given an edge-labeled input graph G = (V, E), we construct a x relaxed graph G′ = (V, E′) by introducing additional inverse 4.3 Identifying Dyck-Contributing Edges ∈ edges. In particular, the node set V G remains unmodified. According to Definition 3.2, identifying Dyck-contributing Let i and i be two matched open and close parentheses in edges requires computing Dyck-reachability. The L -graph L M ′ ∈ ′ x InterDyck. The edge set E G is constructed as follows: is essentially a bidirected graph with each edge labeled by a i i • For each edge u −→L v ∈ E, we insert both edges u −→L v terminal t in a Dyck language Lx . Dyck-reachability on bidi- i rected graphs can be solved in linear-logarithmic time [25]. and v −→M u into E′; i i 4.3.1 Computing Dyck-Reachability in Lx -Graphs In • For each edge u −→M v ∈ E, we insert both edges u −→M v general, Dyck-reachable node pairs (u,v) in a graph G = i and v −→L u into E′. (V, E) can be described as a binary relation Dyck over V ×V . Each edge e ∈ G is mapped to two corresponding edges Specifically, a pair (u,v) ∈ Dyck iff node v is Dyck-reachable ′ ′ in G , denoted as set h(e). Based on the construction of G , from u in G. The relaxed Lx -graph is a bidirected graph. One it follows immediately that InterDyck-reachability in G′ property that is special for bidirected Dyck-reachability is over-approximates InterDyck-reachability in G. that the Dyck relation on a bidirected graph is an equivalence ′ relation [25]: (i) it is reflexive and transitive based on the Lemma 4.1 (Relaxed Reachability in G ). Given two nodes u Dyck grammar given in Eqn. (1) (see Section 3.1); and (ii) it is and v, if v is InterDyck-reachable from u in G, node v must ′ 1 ′ symmetric based on the G construction given in Section 4.1. be InterDyck-reachable from u in G . Due to the equivalence property, we can collapse all nodes Corollary 4.2. If an edge e = u → v is an InterDyck- that belong to the Dyck relation into a single representative ′ contributing edge in G, the corresponding edges in h(e) are node, i.e., node v is Dyck-reachable from u in G iff u and ′ InterDyck-contributing in G . v belong to the same representative node in the Lx -graph. However, we have only the Lx -graph rather than the Dyck 4.2 Formulation Relaxation: From relation itself, so we are not in a position to find and collapse InterDyck-Reachability to Dyck-Reachability all Dyck-reachable nodes. Instead, the collapsing can be done We now describe how to relax the problem of determining In- on-the-fly as Dyck-reachability is computed. terDyck-contributing edges to the problem of determining Following the work of Zhang et al. [25], we summarize the ′ Dyck-contributing edges in the bidirected graph G . algorithm for solving Dyck-reachability in Lx -graphs. The Let InterDyck be InterDyck = L1 ⊙ L2 ⊙ ... ⊙ Ln. principal idea is to collapse the nodes in the Dyck relation. Note that each Li in InterDyck represents a Dyck lan- There are two major steps. guage for all i ∈ [1, n]. Let Lx be a Dyck language and i i • : Identifying two edges u −→L w and v −→L Σ = { , ,..., , }. Given a valid InterDyck string s, Edge merging Lx 1 1 k k w, and merging one edge with the other based on the we couldL indeedM “extract”L M a substring s ′ by concatenating all degrees of u and v; L terminals in s. The resulting string s ′ is always a valid x • : Collapsing two nodes u and v into Dyck string. For example, let s be a valid InterDyck string Node collapsing a single representative node n {u,v } and updating all “ 1 2 2 2 2 1”. The “extracted” substring is “ 1 2 2 1”, which L J L K M M L L M M adjacency edges of node n {u,v } based on u and v. is a valid Dyck string. In general, let be a terminal in ΣLx . L It is straightforward to see that if is in a valid InterDyck 1The Dyck relation in a general digraph is not symmetric. Therefore, it is string, must belong to a valid DyckL (Lx ) string as well. not an in the general case. L

785 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK

Repeating node collapsing and edge merging introduces exist nodes v ′,w in the same Dyck-path such that the additional Dyck-reachable node pairs in the graph. Therefore, subpath between v and v ′ is also a Dyck-path, and we continue the process until there are no newly introduced k v ′ −−→M w ∈ E. By Lemma 4.5, we have rep_node[v] = Dyck-reachable nodes. We refer to such an algorithm as rep_node[v ′]. Based on the bidirectedness, we have procedure Fast-Dyck(). k k u −−→L v,w −−→L v ′ ∈ E. By the definition of anchor-k Lemma 4.5 (Correctness of Fast-Dyck [25]). In a bidirected nodes, we conclude that v is an anchor-k node. Lx -graph, node v is Dyck-reachable from node u iff u and v • If the Dyck-path is generated by the second rule, edge are in the same representative node after running Fast-Dyck() e is involved in a Dyck-path with length less than or on the Lx -graph. equal to 2p, thus v is an anchor-k node. To facilitate further discussion, let Gf denote the resulting Now we prove the backward direction. Suppose thatv is an [·] graph after running Fast-Dyck() onG. We define rep_node anchor-k node in the Lx -graph. According to the definition, as a mapping from a node in G to its representative node in there exists a node v ′ such that rep_node[v] = rep_node[v ′], Gf . For example, if u ∈ V (G) is merged to the representative k and there exists another node w with w , u and edge w −−→L node uf ∈ V (Gf ), we write rep_node[u] = uf . ′ ′ v ∈ E(Lx ). Because rep_node[v] = rep_node[v ], i.e., v and 4.3.2 Anchor Nodes Lemma 4.5 indicates that every Dyck- v ′ are merged by Fast-Dyck(), there exists a Dyck-path ′ path in the Lx -graph is obtained via edge merging in Fast- p = v → v . By utilizing the bidirectedness of the Lx -graph, Dyck(). To identify Dyck-contributing edges in the Lx -graph, k k we have v ′ −−→M w ∈ E(L ), as well. Then, u −−→L v, p, and k x we leverage the anchor node w of the two edges u −−→L w and k k v ′ −−→M w ∈ E(L ) form a new Dyck-path. Therefore, u −−→L v k x v −−→L w merged by Fast-Dyck(). Intuitively, every Dyck- is a contributing edge. □ path computed by Fast-Dyck() is associated with at least one such anchor node. Formally, we have To obtain the main theorem, we revisit Figure3. In gen- eral, if an edge u → v is InterDyck-contributing in G, it Definition 4.6 (Anchor Node). Node w is an anchor-k node must be an InterDyck-contributing edge in relaxed graph ′ in an Lx -graph iff there exist nodes u,v,w in the Lx -graph, G′ (Corollary 4.2). Any InterDyck-contributing edge in G′ such that rep_node[w] = rep_node[w ′] after running Fast- must be a Dyck-contributing edge in an Lx -graph derived k k ′ Dyck(), and the two edges u −−→L w and v −−→L w ′ also exist from G (Lemma 4.4). Finally, the problem of deciding Dyck- in the Lx -graph. contributing edges is equivalent to deciding the correspond- ing anchor-k nodes in the Lx -graph (Lemma 4.7). Putting k Lemma 4.7. An edge u −−→L v is a contributing edge for Dyck- everything together, we have reachability in a bidirected Lx -graph iff v is an anchor-k node Theorem 4.8. Let Lx be a Dyck language in InterDyck and k in the L -graph. Similarly, an edge u −−→M v is a contributing k k x , ∈ Σ . If either an edge u −−→L v or an edge v −−→M u is edge for Dyck-reachability in an L -graph iff u is an anchor-k k k Lx x contributingL M to InterDyck-reachability in G, the node v is an node in the Lx -graph. anchor-k node in the Lx -graph. k L Proof. Without loss of generality, we consider the u −−→ v Corollary 4.9. If a node v is not an anchor-k node in the Lx - case. We prove the forward direction by induction on the k k u −−→L v v −−→M u k graph, both edges and are non-contributing length of the Dyck-path that involves the edge u −−→L v. edges for InterDyck-reachability in G.

k Base case. The contributing edge u −−→L v is involved in Thus, the graph-simplification algorithm can remove from a Dyck-path of length 2. There must exist another node w G all edges that meet the criterion given in Corollary 4.9. k such that v −−→M w. Because Lx -graph is bidirected, we have 5 Graph-Simplification Algorithm k w −−→L v ∈ E. Therefore, v is an anchor-k node. This section discusses the graph-simplification algorithm. Section 5.1 describes the key steps in the algorithms. Sec- Inductive step. Assume that the lemma holds for con- tion 5.2 presents the main algorithm. Section 5.3 discusses the tributing edges involved in a Dyck-path with length less than correctness and complexity of the simplification algorithm. k or equal to 2p. Suppose that a contributing edge e = u −−→L v 5.1 Key Steps is involved in a Dyck-path of length 2p + 2 and not involved in any Dyck-path with length less or equal to 2p. Consider There are two key steps in the graph simplification: con- the Dyck grammar rule S → iS i | SS. structing the Lx -graphs and identifying anchor-k nodes. L M • If the Dyck-path is generated based on the first rule, 5.1.1 Lx -Graph Construction Consider an interleaved edge e is the first edge in the Dyck-path. There must Dyck language InterDyck = L1 ⊙ ... ⊙ Ln. Given a relaxed

786 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps

[ ] [ ] Procedure 1: GetLxGraph(G, Lx ) rep_node u and rep_node v , preserving only the Lx -edges. Input : Edge-labeled relaxed bidirected graph G = (V , E), a Dyck language Lines 18-23 move all its adjacent edges into the other node for Lx the smaller-degree node of rep_node[u] and rep_node[v]. Output: An Lx -graph Gx 1 rep_node ← a disjoint-set of size |V |. l 5.1.2 Anchor-k Node Identification The second step in 2 foreach u −→ v ∈ E do 3 if l < ΣLx then graph simplification is anchor-k node identification. We mod- l 4 E ← E \{u −→ v } ify the Fast-Dyck() algorithm by Zhang et al. [25] to collect 5 if rep_node[u] == rep_node[v] then continue; the anchor-k node information. We denote the modified 6 rep_node.union(u, v) 7 if degree(rep_node[u]) > degree(rep_node[v]) then version as Fast-Dyck-Modified(). Recall that Fast-Dyck() ← [ ] ← [ ] 8 wb rep_node u , ws rep_node v tracks the number of incoming edges with the same edge 9 else 10 wb ← rep_node[v], ws ← rep_node[u] label for each node in the graph. If there are two incom- 11 V ← V \{ws }. k k 12 Let internal_edges be the edges between ws and wb ing edges u −→ v and w −→ v with the same edge label k, 13 foreach e ∈ internal_edges do 14 E ← E \{e } and k is an open-parenthesis, then the Fast-Dyck algorithm 15 l ← getlabel(e) performs a node-collapsing between node u and w. 16 if l ∈ ΣLx then l Fast-Dyck-Modified() leverages the node-collapsing pro- 17 E ← E ∪ {w −→ w } b b cess in Fast-Dyck() to mark anchor-k nodes. In particular, 18 foreach x ∈ In[ws ] with x , wb do k l −→ 19 foreach edge e = x −→ ws do when Fast-Dyck() detects two incoming edges u v and l k 20 E ← E \{e } ∪ {x −→ wb } w −→ v with an open-parenthesis edge label k. Fast-Dyck-

21 foreach x ∈ Out[ws ] with x , wb do Modified() marks v as an anchor-k node according to the l 22 foreach edge e = ws −→ x do anchor-k node definition. Through the marking process, we l 23 E ← E \{e } ∪ {wb −→ x } collect all information to recover anchor-k nodes for the Lx -graph. For any node v ∈ V (Lx ), it is an anchor-k node iff 24 return Gx = (V , E) rep_node(v) is marked as an anchor-k node by Fast-Dyck- Modified(). Finally, Fast-Dyck-Modified() returns the set graphG′, to identify the anchor-k nodes, our algorithm needs of collected anchor-k nodes in the Lx graph. Notice that the original Fast-Dyck algorithm runs in to construct an Li -graph for each i ∈ {1, ··· , n}. An Li -graph is essentially a contracted graph of G′. In particular, let t ′ be O(m logm) time. After the modification, the extra running time for each node-merging is O(1), and thus the complexity a letter in ΣInterDyck \ ΣLi . The Li -graph is constructed by t ′ of Fast-Dyck-Modified is still O(m logm) contracting all u −→ v edges in G′. We describe the Lx -graph construction of the motivat- Example 5.1. We continue our example using the graph ing example in Figure 4a. Recall that we have InterDyck = shown in Figure 4b. We apply Fast-Dyck-Modified() on Db ⊙ Dp . The procedure iterates through non-Lp edges. In this Lp -graph. Figure 4c gives the resulting graph. There д exist two 12-edges pointing to node {vs, ret2} and two 8- Figure 4a, the first non-Lp edge is vx −−→J va because д ∈ edges pointingL to the node {vt , ret1}. Therefore, Fast-Dyck-L д J J Modified() collects the information that nodes v , are ΣInterDyck \ ΣLp . The edge vx −−→ va is contracted by col- t ret1 lapsing nodes vx and va. Nodes vx and va form a represen- anchor-8 nodes and vs, ret2 are anchor-12 nodes. д tative node {vx ,va } and the edge vx −−→J va is removed. We 5.2 The Simplification Algorithm continue contracting non-Lp edges until there are no more Algorithm2 gives the graph-simplification algorithm. In lines non-Lp edges. Figure 4b depicts the final Lp -graph. 1-2, contrib_edges is initialized to an empty set. It contains To facilitate node collapsing, we adopt the standard disjoint- the set of potential InterDyck-contributing edges in the orig- set data structure. Procedure1 gives the Lx -graph-construction inal graph G when the algorithm terminates. We then obtain algorithm. Line 1 initializes the disjoint-set rep_node. Given the relaxed graph G′ defined in Section 4.1. The loop iterates a node u, rep_node[u] returns the representative node of u over each Dyck language Li in InterDyck = L1 ⊙ · · · ⊙ Ln in the graph. The union method always joins the smaller- (lines 3-14). It first builds the Li -graph based on Procedure1. degree vertex to the larger-degree vertex. Lines 2-3 show that After we construct the Li -graph, the algorithm invokes Fast- the procedure iterates over all non-Lx edges in the graph. Dyck-Modified() described in Section 5.1.2 to collect anchor- Lines 4-11 perform node collapsing. To collapse the edge k nodes in the Li -graph. The variable anchor_nodes stores l e = u −→ v, we merge nodes u,v to a representative set. Line a set of anchor nodes of the form v_l, where v is the node 5 shows that if u,v have been merged together, we skip the in the Li -graph, l is the open-parenthesis edge label for the rest of the loop body. Lines 6-11 merge the smaller-degree corresponding anchor. In lines 6-14, for each anchor node, node between rep_node[u] and rep_node[v] into the larger we add its corresponding contributing edges to the set con- one. Lines 12-17 collapse all the non-Lx edges between nodes trib_edges. After collecting the contributing edges for each

787 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK

f 8 f 8 vt , ret1 va J vb L vt K ret1 M vc vt , ret1 8 8 8 L L L va,b,x,y , v vc,z,vw 11 д h a,b,x,y vc,z 11 12 L J K L L 12 v , д 11 д 12 L s ret2 vx J vy L vs K ret2 M vz vs , ret2

12 12 L L

vw vw (c) Identifying anchor-k nodes based on The graph to be simplified. (b) Constructed L -graph. (a) p Fast-Dyck-Modified. Figure 4. Collection of anchor-k node information. Figure 4a repeats the graph of our motivating example from Figure 1b. Graph simplification involves repeated application of Algorithm2. Figure 4b illustrates the Lp -graph construction result based on Section 5.1.1 during the first application of the algorithm. Figure 4c gives the corresponding graph after running Fast-Dyck-Modified described in Section 5.1.2. In Figures 4b and 4c, each , edge has a corresponding reverse , edge. We omit reverse edges for brevity. L J M K

f 8 f 8 Example 5.2. We illustrate how Algorithm2 eliminates va J vb L vt K ret1 M vc non-contributing edges in the original graph of our moti- anchor-12: v , ret2 s vating example, i.e., Figure 1b and Figure 4a. After running anchor-8: vt , ret1 д д 12 vx J vy vs K ret2 M vz the L -graph construction (Procedure1) and Fast-Dyck- anchor-f : vb , vt x

anchor-д: vy , vs , vw 12 Modified for both the parenthesis language Lp and the L bracket language Lb , Figure 5a gives the anchor-k nodes vw identified by the first application of Algorithm2. It identifies Resulting graph after remov- (a) Collected anchor-k (b) the non-contributing edges based on Lemma 4.7. For instance, node information. ing non-contributing edges. provided that vs is an anchor-12 nodes, all the incoming 12 edges to vs and outgoing 12 from vs edges are contributingL Figure 5. Elimination of non-contributing edges. Figure 5a edges. By removing the non-contributingM ones, we get the lists all anchor-k nodes identified in the first application of resulting graph Figure 5b. By applying Algorithm2 two more Algorithm2. Figure 5b gives the simplified graph after the times, we obtain the final graph, shown in Figure 2c. first application. It is the same as Figure 2a. 5.3 Correctness and Complexity We establish the correctness of Algorithm2 and analyze its Algorithm 2: The graph simplification iteration. complexity. In lines 6-10, the algorithm collects all the in- Input : Edge-labeled directed graph G = (V , E), an InterDyck language L = L1 ⊙ · · · ⊙ Ln ; coming open-parenthesis anchor-labeled edges and outgoing Output: A new edge-labeled directed graph Gf close-parenthesis anchor-labeled edges. Due to Theorem 4.8, 1 contrib_edges ← œ all contributing edges are in contrib_edges. Then it suffices ′ 2 G ← RelaxedGraph(G) to show that the derived Lx -graph in line 4 is correct and 3 for i ← 1 to n do ′′ ′ 4 Gi ← GetLxGraph(G , Li ) the Fast-Dyck-Modified() collects all anchor-k nodes. ′′ 5 anchor_nodes ← Fast-Dyck-Modified(Gi ) 6 foreach v_l ∈ anchor_nodes do Lemma 5.3. Fast-Dyck-Modified() collects all anchor-k nodes 7 foreach x ∈ In[v] do t for each Li -graph. 8 foreach edge e = x −→ v do 9 if t == l then Proof. Fast-Dyck-Modified() identifies two incoming edges 10 contrib_edges ← contrib_edges ∪{e } incident on the same node with the same open-parenthesis

11 foreach x ∈ Out[v] do label, performs a node collapsing, and generates an anchor-k t 12 foreach edge e = v −→ x do node. For each anchor-k node generated, it only removes one 13 if t == l then edge in the graph. Thus, the previous two incoming edges ← ∪{ } 14 contrib_edges contrib_edges e incident to the same node become one. Suppose an anchor-k node has not been collected by Fast-Dyck-Modified(), there 15 Gf ← (V ,contrib_edges ) will be two incoming edges to the anchor-k node with the 16 return Gf same open-parenthesis label. It contradicts the fact that Fast- Dyck-Modified() guarantees to find any pair of incoming edges with same open-parenthesis label [25]. □ Li -graph, contrib_edges, the union is returned as the new edge set of the graph. It serves as the input for the next Procedure1 also correctly derives the Lx -graph. Specifi- iteration (lines 15-16). The graph-simplification algorithm cally, lines 2 and 3 guarantee that we iterate over all edges l terminates if there are no edges removed in one iteration. with non-Lx labels. For each u −→ v, line 4 contracts this

788 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps non-Lx edge based on Definition 4.3. Lines 6-11 merge the an access on field f . Therefore, the analysis is based on In- smaller-degree node to the larger-degree one. Lines 13-23 terDyck-reachability where InterDyck = Db ⊙ Dp . move edges incident to smaller-degree node to the other. We performed taint analysis on both the original and sim- Based on Section 5.1.1 the disjoint-set rep_node also joins plified graphs. The set of subject Android applications in- the smaller-degree node to the larger-degree one. All infor- cludes the top 30 free apps, as well as some popular apps mation is updated correctly. in the Editor’s Choice list as of January 2015. We extracted the taint-analysis graphs using the tools from the work of Theorem 5.4. For an InterDyck language InterDyck = Huang et al. [9]. Note that the original taint analysis [9] is L1 ⊙ · · · ⊙ Ln and a graph G, Algorithm2 computes an over- demand-driven, while ours is exhaustive. The tool success- ′ approximation ϕ of all InterDyck-contributing edges in G, fully generates graphs from 95 out of the 150 Google store ′ i.e., ϕ ⊇ ϕ where ϕ denotes the exact solution. apps provided in the implementation.2 The 95 obtained taint-analysis graphs have various sizes, Next, we analyze the complexity of each iteration of the ranging from a few hundred nodes to more than 100,000 simplification. In Algorithm2, the loop body in line 3-14 nodes. On average, each graph consists of 40,129 nodes and contains two procedure calls: GetLxGraph and Fast-Dyck- 147,009 edges. These taint-analysis graphs also contain more Modified(). Given a graph with m edges, the time complex- call/return edges than field read/write edges. On average, ity of Fast-Dyck-Modified() is O(m logm) [25]. When the each taint-analysis graph has 21,559 different calls/returns GetLxGraph procedure contracts an edge, it always moves all and 2,250 different field accesses. smaller-degree node’s incident edges to the other. Therefore, for any single edge u → v during the edge moving, either the degree of u is doubled or the degree of v is doubled. Thus InterDyck-Reachability Algorithms. We used the fol- the total number of edge moving is at most 2 logm. The time lowing three InterDyck-reachability algorithms as the graph- complexity for GetLxGraph is O(m log m). The complexity reachability engine for variants of the taint analysis: of the loop body in lines 3-14 in Algorithm2 is O(m log m). The overall algorithm iterates over all k Dyck languages in InterDyck. k is usually considered as a constant. Therefore, the time complexity of Algorithm2 is O(m logm). • CFL-reachability algorithm [14]. This method is the tra- 6 Evaluation ditional over-approximation for InterDyck-reachability. To approximate Db ⊙ Dp , we used a regular language We implemented the graph-simplification algorithm, and— Rp to over-approximate Dp . The language Db ⊙ Rp is using three different InterDyck-reachability solvers—applied still context-free, so one can apply the CFL-reachability it to the problem of a context- and field-sensitive taint anal- algorithm to solve the (Db ⊙ Rp )-reachability problem. ysis for Android applications [9]. The experiments were • SPDS-reachability algorithm [19]. In our client analysis, performed on a 16GB memory machine with an Intel Xeon the synchronized pushdown system (SPDS) separates 2.10GHz CPU, running Ubuntu 18.04. the analysis into a context-insensitive, field-sensitive We compared three InterDyck-reachability algorithms analysis and a context-sensitive, field-insensitive anal- on both the original and simplified graphs. Our evaluation ysis. Each problem can be effectively formulated as a focused on addressing the following two research questions: CFL-reachability problem. The SPDS algorithm solves • RQ1: How does graph size influence the size reduction them independently and intersects the results. and the efficiency of the simplification algorithm? • LCL-reachability algorithm [26]. The linear-conjunctive- • RQ2: How much can graph simplification improve languages (LCLs) properly contain the InterDyck the performance and precision of various InterDyck- languages. Unlike CFL- and SPDS-reachability, LCL- reachability algorithms? reachability precisely models InterDyck-reachability. The LCL-reachability algorithm, in contrast, is an over- 6.1 Experimental Setup approximating algorithm, which means that it may The Client Analysis. The experiment was conducted return a superset of the exact result, i.e., there may be with a context- and field-sensitive taint analysis for Android pairs of nodes that are connected by an accepting-state applications [9], applied to 95 Google App-store applica- summary edge that are not InterDyck-reachable. tions. Context-sensitivity is captured by a Dyck language Dp , where each open parenthesis i represents a method call, and a matching close parenthesis Li represents a correspond- M 3 ing return. The analysis uses another Dyck language Db to We implemented all algorithms in C++. All experiments encode field sensitivity, where an open bracket f represents were repeated three times, and we report the average of the J an assignment to field f and a close bracket f represents three trials to improve the reliability of the collected results. K

789 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK

| 50

) 1 G ( E | / | 0.8 40 ) f G ( E | 0.6 30

0.4 20

0.2 Simplification time (s) 10

·105 5 Edge reduction ratio 0 0 ·10 0 1 2 3 4 0 1 2 3 4 Number of edges in the graph Number of edges in the graph (a) Plot of the graph-reduction ratio. In general, the reduction (b) The relation between graph size and graph-simplification ratio is lower than 0.8. A lower ratio indicates there are fewer time. The data shows that, in practice, the running time of the edges remaining in the simplified graph. graph-simplification algorithm is linear in the size of the graph. Figure 6. Amount of graph reduction, and running time of the graph-simplification algorithm as a function of graph size.

6.2 RQ1: Graph-Simplification Efficiency Table 2. Timeout statistics in the experiments. Our graph-simplification algorithm reduces an original graph Finished Finished Timed out ′ ′ ′ |E(Gf )| on G,G only on G on G,G G to Gf . We define the graph-reduction ratio as r = | ( )| . E G LCL 41 13 41 Figure 6a presents the simplification results w.r.t. ratio r. On SPDS 9 5 81 average, r = 0.743, indicating that the other 25.7% edges have CFL 5 2 88 been removed from the original graph G by simplification. As graph size increases, Figure 6a indicates that there is a very slight trend for ratio r to increase. However, for 6.3 RQ2: Performance and Precision Improvement most graphs, the reduction ratio is below 0.8, even for large of InterDyck-Reachability Algorithms graphs with around 400K edges. Thus, simplification can Performance. We set a time budget of 300 seconds for consistently remove a significant number of edges. all InterDyck-reachability algorithms. Table2 presents the In terms of the running time, the graph simplification is timeout information of different algorithms. Typically, the much faster than the InterDyck-reachability algorithms CFL-reachability algorithm runs out of time for graphs with in most cases. The only exception is when the graph size more than 1K edges. The SPDS-reachability algorithm runs is very small (i.e., when the number of edges is less than out of time for graphs with more than 5K edges. The LCL- 200), the simplification procedure can take time comparable reachability algorithm usually finishes processing graphs to the InterDyck-reachability algorithms. Figure 6b gives with fewer than 70K edges within the time budget. the relationship between graph size and running time of We use the following metric to the measure performance the graph-simplification algorithm. It demostrates that the improvement: given the running time T on original graph G, asymptotic running time of the algorithm is close to linear and the running time Ts on the simplified graph, the perfor- in the size of the (original) graph. Ts mance ratio is defined as T . If an InterDyck-reachability Summary. On average, after the graph-simplification al- algorithm finishes on both the original and the simplified gorithm there are 74.3% of the edges remains in the simplified graphs, we collect the performance ratio; the data is plot- graphs. The algorithm can consistently remove more than ted in Figure 7a. The plot shows that for the majority of 20% of the edges in large graphs. The running time of the sim- graphs, graph simplification reduces the running time of plification algorithm is almost linear in practice. The linear- InterDyck-reachability algorithms to less than 40% of the time performance allows it to serve as a pre-processing step original running time. On all graphs the running time is for an InterDyck-reachability algorithm. reduced by more than 20%.

2 In practice, it is also necessary to take graph-simplification Both the implementation and the subject apps are publicly available at ′ https://github.com/proganalysis/type-inference. time into account. We define T as the time needed for both 3The implementation is available at https://www.cc.gatech.edu/~qrzhang/ graph simplification and running the InterDyck-reachability projects/interdyck/interdyck.html. algorithm on the simplified graphs. Figure 8a presents the

790 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps

LCL 1 SPDS LCL CFL LCL 1 SPDS 0.8 SPDS CFL CFL /M

0.8 s M

/P 0.8 s

0.6 P /T s

T 0.6 0.6

0.4 0.4

Precision ratio 0.4

0.2 0.2 Performance ratio

Memory consumption ratio 0.2

0 0 101 102 103 104 105 101 102 103 104 105 1 2 3 4 5 Number of edges in the original graph 10 10 10 10 10 Number of edges in the original graph Number of edges in the original graph (b) Precision improvements. For about two- (c) Memory-usage improvements. For most (a) Performance improvements. The running thirds of the benchmark programs, the In- of the graphs, the simplification process re- time of all benchmark programs is reduced to terDyck algorithms find less than 80% of duces the memory consumption of the In- less than 80% after the graph simplification. the reachable pairs in simplified graphs: the terDyck-reachability algorithms by half. discarded pairs are false positives. Figure 7. The effect of the graph simplification on InterDyck-reachability algorithms. Running on a simplified graph helps improve the performance, precision, and memory usage of the algorithm. The x-axis represents the number of edges in the original graphs, and the y-axis indicates the improvement ratio. Each plot shows the improvement on one benchmark program. A lower y-axis value indicates a better degree of improvement from graph simplification.

5 5 5

4 4 4 0 0 0 T/T T/T T/T 3 3 3

2 2 2 Time Ratio Time Ratio Time Ratio 1 1 1

0 0 0 0 20 40 60 80 10−2 10−1 100 101 102 10−1 100 101 102 LCL running time in original graph (s) SPDS running time in original graph (s) CFL running time in original graph (s) (a) Improvement on LCL-reachability. (b) Improvement on SPDS-reachability. (c) Improvement on CFL-reachability. Figure 8. Performance comparison. In these comparisons, we compare two approaches to solve the InterDyck-reachability problem: directly running an InterDyck algorithm on the original graph (with time T ), versus performing graph simplification and then running the same InterDyck algorithm on the simplified graph (with time T ′). The value on the y-axis indicates the speedup due to graph simplification. y = 1 means that the time needed to run InterDyck algorithm on the original graph is the same as first performing graph simplification and then running the algorithm on the simplified graph.

y LCL-reachability result. From the plot, we see that, as the Precision. We define the precision ratio as x where x running time of the LCL-reachability algorithm increases, and y denote the number of InterDyck-reachable pairs ob- the cost of graph simplification becomes less and less sig- tained from running the InterDyck-reachability algorithm nificant. If the LCL-reachability algorithm completes onthe on the original and simplified graphs, respectively. Figure 7b original graph within 7 seconds, it is not worth performing gives information about the precision improvements. It is graph simplification. However, for larger graphs, the time interesting to note that the precision improvement is quite graph simplification is recouped. The observation is consis- significant. On average, graph simplification helps Inter- tent for both SPDS- and CFL-reachability algorithm w.r.t. Dyck-reachability to generate a solution with only 64.92% Figure 8b and Figure 8c. Overall, running graph simplifica- pairs of the solution for the original graph (discarded ones tion and performing an InterDyck-reachability algorithm are false positives). For the LCL-reachability algorithm, there on the simplified graph is 2.18× faster than running the same are three graphs where graph simplification helps to detect algorithm on the original graph. more than 80% of pairs as false positives in the original solu- tion. Moreover, there is a trend that with increasing number

791 Fast Graph Simplification for Interleaved Dyck-Reachability PLDI ’20, June 15–20, 2020, London, UK of edges in the graph, the precision improvement from graph over-approximates InterDyck-reachability [4, 12]. The re- simplification is likely to be more significant. cent work by Späth et al. [19] over-approximates InterDyck- reachability using synchronized pushdown systems. Zhang Memory Consumption. As mentioned in Section 6.2, the average amount of edges in the simplified graph is 74.3% and Su [26] propose linear-conjunctive-language reacha- of the original graphs. However, Figure 7c shows that, in bility to precisely formulate InterDyck-reachability, and most cases, the InterDyck-reachability algorithms consume provide an over-approximating algorithm for solving the around half the memory when running on the simplified LCL-reachability problem. graph. On average, running InterDyck-reachability algo- The proposed graph-simplification algorithm is based on rithms on simplified graphs uses only 57.37% memory. the Fast-Dyck algorithm proposed by Zhang et al. [25]. Chat- terjee et al. [1] give an O(m)-time algorithm for solving bidi- 6.4 Discussion rected Dyck-reachability, which improves the O(m log m) Our graph-simplification algorithm is based on the bidirected bound by Zhang et al. [25]. In practice, many techniques Fast-Dyck algorithm on G′. A natural extension is to utilize have been proposed to improve CFL-reachability-based anal- a general Dyck-reachability algorithm on G to identify the yses [2, 23, 27]. Our work focuses on simplifying the in- InterDyck-contributing edges in G. However, the Dyck- put graphs for InterDyck-reachability, and is applicable to relation on general digraphs is not an equivalence relation. any existing sound InterDyck-reachability-based analysis. We have to resort to a more expensive (sub)cubic-time Dyck- Graph simplification techniques are also studied in other reachability algorithm to identify Dyck-contributing edges program-analysis applications. In pointer analysis, various in G. In Table2, we have seen that the SPDS-reachability techniques [5, 6, 17] are applied to reduce the size of the algorithm based-on Dyck-reachability does not scale well in constraint graphs for inclusion-based analysis. For example, practice. Moreover, a recent result by Chatterjee et al. [1] has the work by Hardekopf and Calvin [6] focuses on deriving improved Fast-Dyck from O(m logm) time to O(m) time. It pointer-equivalence and location-equivalence relationships is interesting to apply their algorithm to further improve the between variables. They simplify the graphs by collapsing running time of graph simplification. On the other hand, our the equivalent nodes. Our graph simplification focuses on evaluation (Figure 6b) shows that the Fast-Dyck adopted eliminating irrelevant edges. in our current algorithm, though being in O(m logm) time, scales almost linearly in practice. 8 Conclusion In the work of Späth et al. [19], it has been observed that over-approximation for InterDyck reachability almost This paper has proposed a new graph-simplification algo- never happens in practice. Their conclusion is supported rithm for InterDyck-reachability. Our key insight is to re- by the empirical study of a typestate analysis for relatively duce the graph by eliminating graph edges that do not con- small graphs. In our experiments, we observed significant tribute to any InterDyck-paths. We have applied the sim- over-approximation of taint analysis. In general, the degree plification algorithm to context- and field-sensitive taint of over-approximation depends on what kind of information analysis for Android. The experimental results demonstrate the client analysis is computing. that graph simplification can significantly speed up existing We implemented the LCL- and CFL-reachability algorithms InterDyck-reachability algorithms. Moreover, graph simpli- given in the original references. The original SPDS paper fication improves both the precision and the memory-usage presents a demand-driven reachability algorithm which also of the client analysis. accepts the prefix of the InterDyck languages, i.e., the algo- rithm accepts words with unmatched open parentheses/brack- Acknowledgments ets, such as “ 1 1 1”. Our SPDS implementation is restricted to only the InterDyckL J M language and always computes the We would like to thank our shepherd and the anonymous all-pairs InterDyck-reachability results. PLDI reviewers for valuable feedback on earlier drafts of this paper, which helped improve its presentation. We thank 7 Related Work Gabriel Eiseman for helpful comments. This work was sup- Many program-analysis problems can be formulated as an ported in part by the United States National Science Foun- InterDyck-reachability problem [3, 18, 20–22, 24]. However, dation (NSF) under Grant No. 1917924; by a gift from Ra- solving InterDyck-reachability is undecidable [15]. Existing jiv and Ritu Batra; by ONR under grants N00014-17-1-2889 approaches use different techniques to over-approximate the and N00014-19-1-2318. The U.S. Government is authorized to exact solution for InterDyck-reachability problems. Tradi- reproduce and distribute reprints for Governmental purposes tional approaches include over-approximating some Dyck notwithstanding any copyright notation thereon. Opinions, languages in InterDyck using regular languages [7, 8]. Access- findings, conclusions, or recommendations expressed in this path-based analysis approximates field-sensitivity by restrict- publication are those of the authors, and do not necessarily ing the access-paths with a bounded length, and thus also reflect the views of the sponsoring agencies.

792 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps

References 416–430. [14] Thomas W. Reps. 1998. Program analysis via graph reachability. Infor- [1] Krishnendu Chatterjee, Bhavya Choudhary, and Andreas Pavlogian- mation & Software Technology 40, 11-12 (1998), 701–726. nis. 2018. Optimal Dyck reachability for data-dependence and alias [15] Thomas W. Reps. 2000. Undecidability of context-sensitive data- analysis. PACMPL 2, POPL (2018), 30:1–30:30. dependence analysis. ACM Trans. Program. Lang. Syst. 22, 1 (2000), [2] Swarat Chaudhuri. 2008. Subcubic algorithms for recursive state 162–186. machines. In POPL. 159–169. [16] Thomas W. Reps, Susan Horwitz, and Shmuel Sagiv. 1995. Precise [3] Ben-Chung Cheng and Wen-mei W. Hwu. 2000. Modular interproce- Interprocedural Dataflow Analysis via Graph Reachability. In POPL. dural pointer analysis using access paths: design, implementation, and 49–61. evaluation. In Proceedings of the 2000 ACM SIGPLAN Conference on [17] Atanas Rountev and Satish Chandra. 2000. Off-line variable substi- Programming Language Design and Implementation (PLDI), Vancouver, tution for scaling points-to analysis. In Proceedings of the 2000 ACM Britith Columbia, Canada, June 18-21, 2000. 57–69. SIGPLAN Conference on Programming Language Design and Implemen- [4] Arnab De and Deepak D’Souza. 2012. Scalable Flow-Sensitive Pointer tation (PLDI), Vancouver, Britith Columbia, Canada, June 18-21, 2000, Analysis for Java with Strong Updates. In ECOOP 2012 - Object-Oriented Monica S. Lam (Ed.). ACM, 47–56. Programming - 26th European Conference, Beijing, China, June 11-16, [18] Micha Sharir and Amir Pnueli. 1981. Two approaches to interpro- 2012. Proceedings. 665–687. cedural data flow analysis. In Program Flow Analysis: Theory and [5] Manuel Fähndrich, Jeffrey S. Foster, Zhendong Su, and Alexander Applications, Steven S. Muchnick and Neil D. Jones (Eds.). Prentice- Aiken. 1998. Partial Online Cycle Elimination in Inclusion Constraint Hall, 189–234. Graphs. In Proceedings of the ACM SIGPLAN ’98 Conference on Program- [19] Johannes Späth, Karim Ali, and Eric Bodden. 2019. Context-, flow-, ming Language Design and Implementation (PLDI), Montreal, Canada, and field-sensitive data-flow analysis using synchronized Pushdown June 17-19, 1998. 85–96. systems. PACMPL 3, POPL (2019), 48:1–48:29. [6] Ben Hardekopf and Calvin Lin. 2007. Exploiting Pointer and Location [20] Manu Sridharan and Rastislav Bodík. 2006. Refinement-based context- Equivalence to Optimize Pointer Analysis. In Static Analysis, 14th sensitive points-to analysis for Java. In PLDI. 387–400. International Symposium, SAS 2007, Kongens Lyngby, Denmark, August [21] Manu Sridharan, Denis Gopan, Lexin Shan, and Rastislav Bodík. 2005. 22-24, 2007, Proceedings. 265–280. Demand-driven points-to analysis for Java. In OOPSLA. 59–76. [7] M. A. Harrison. 1978. Introduction to Formal Language Theory. Addison- [22] Hao Tang, Xiaoyin Wang, Lingming Zhang, Bing Xie, Lu Zhang, and Wesley Longman Publishing Co., Inc., Boston, MA, USA. Hong Mei. 2015. Summary-Based Context-Sensitive Data-Dependence [8] John E. Hopcroft and Jeffrey D. Ullman. 1979. Introduction to Automata Analysis in Presence of Callbacks. In POPL. 83–95. Theory, Languages and Computation. Addison-Wesley. [23] Guoqing Xu, Atanas Rountev, and Manu Sridharan. 2009. Scaling [9] Wei Huang, Yao Dong, Ana Milanova, and Julian Dolby. 2015. Scalable CFL-Reachability-Based Points-To Analysis Using Context-Sensitive and precise taint analysis for Android. In ISSTA. 106–117. Must-Not-Alias Analysis. In ECOOP. 98–122. [10] Vineet Kahlon. 2009. Boundedness vs. Unboundedness of Lock Chains: [24] Dacong Yan, Guoqing (Harry) Xu, and Atanas Rountev. 2011. Demand- Characterizing Decidability of Pairwise CFL-Reachability for Threads driven context-sensitive alias analysis for Java. In ISSTA. 155–165. Communicating via Locks. In LICS. 27–36. [25] Qirun Zhang, Michael R. Lyu, Hao Yuan, and Zhendong Su. 2013. [11] John Kodumal and Alexander Aiken. 2004. The set constraint/CFL Fast algorithms for Dyck-CFL-reachability with applications to alias reachability connection in practice. In PLDI. 207–218. analysis. In PLDI. 435–446. [12] Johannes Lerch, Johannes Späth, Eric Bodden, and Mira Mezini. 2015. [26] Qirun Zhang and Zhendong Su. 2017. Context-sensitive data- Access-Path Abstraction: Scaling Field-Sensitive Data-Flow Analysis dependence analysis via linear conjunctive language reachability. In with Unbounded Access Paths (T). In 30th IEEE/ACM International POPL. 344–358. Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, [27] Qirun Zhang, Xiao Xiao, Charles Zhang, Hao Yuan, and Zhendong Su. USA, November 9-13, 2015. 619–629. 2014. Efficient subcubic alias analysis for C.In OOPSLA. 829–845. [13] G. Ramalingam. 2000. Context-sensitive synchronization-sensitive [28] Xin Zheng and Radu Rugina. 2008. Demand-driven alias analysis for analysis is undecidable. ACM Trans. Program. Lang. Syst. 22, 2 (2000), C. In POPL. 197–208.

793