Fast Graph Simplification for Interleaved Dyck-Reachability

Fast Graph Simplification for Interleaved Dyck-Reachability Yuanbo Li Qirun Zhang Thomas Reps Georgia Institute of Technology Georgia Institute of Technology University of Wisconsin-Madison USA USA USA [email protected] [email protected] [email protected] Abstract CCS Concepts: • Mathematics of computing ! Graph ! Many program-analysis problems can be formulated as graph- algorithms; • Theory of computation Program anal- reachability problems. Interleaved Dyck language reachabil- ysis. ity (InterDyck-reachability) is a fundamental framework to Keywords: CFL-Reachability, Static Analysis express a wide variety of program-analysis problems over edge-labeled graphs. The InterDyck language represents ACM Reference Format: an intersection of multiple matched-parenthesis languages Yuanbo Li, Qirun Zhang, and Thomas Reps. 2020. Fast Graph Simpli- fication for Interleaved Dyck-Reachability. In Proceedings of the 41st (i.e., Dyck languages). In practice, program analyses typically ACM SIGPLAN International Conference on Programming Language leverage one Dyck language to achieve context-sensitivity, Design and Implementation (PLDI ’20), June 15–20, 2020, London, and other Dyck languages to model data dependences, such UK. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/ as field-sensitivity and pointer references/dereferences. In 3385412.3386021 the ideal case, an InterDyck-reachability framework should model multiple Dyck languages simultaneously. Unfortunately, precise InterDyck-reachability is undecid- 1 Introduction able. Any practical solution must over-approximate the exact The L language-reachability (L-reachability) framework is answer. In the literature, a lot of work has been proposed to a popular model to formulate many program-analysis prob- over-approximate the InterDyck-reachability formulation. lems [14]. An L-reachability instance ReachhL;Gi contains This paper offers a new perspective on improving both the (1) a formal language L that formalizes the analysis problem, precision and the scalability of InterDyck-reachability: we and (2) an edge-labeled graph G that represents the program aim to simplify the underlying input graphG. Our key insight under analysis. Two nodes are L-reachable in G iff there ex- is based on the observation that if an edge is not contributing ists a path joining them, and the path string belongs to L. In to any InterDyck-path, we can safely eliminate it from G. the literature, the most popular L-reachability formulation Our technique is orthogonal to the InterDyck-reachability is Dyck-reachability [11, 25]. A Dyck language essentially formulation, and can serve as a pre-processing step with any generates well-balanced parentheses, which can be used to over-approximating approaches for InterDyck-reachability. capture well-paired program properties, such as function call- We have applied our graph simplification algorithm to pre- s/returns [15, 16, 22], pointer references/dereferences [27, processing the graphs from a recent InterDyck-reachability- 28], locks/unlocks [10, 13], and field reads/writes [9, 24, 25]. based taint analysis for Android. Our evaluation on three A natural generalization of Dyck-reachability is Inter- popular InterDyck-reachability algorithms yields promis- leaved Dyck-reachability (InterDyck-reachability) [9, 15, ing results. In particular, our graph-simplification method 26]. The Interleaved Dyck language denotes the intersec- improves both the scalability and precision of all three In- tion of multiple Dyck languages based on an interleaving terDyck-reachability algorithms, sometimes dramatically. operator ⊙. For instance, let D1 and D2 be two Dyck languages that generate matched parentheses and matched brackets, respectively. The string “ ” belongs to the Permission to make digital or hard copies of all or part of this work for language InterDyck = D1 ⊙ D2, becauseLJLKMM both parentheses personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear and brackets are properly matched. InterDyck-reachability this notice and the full citation on the first page. Copyrights for components is much more expressive than Dyck-reachability, and in prac- of this work owned by others than ACM must be honored. Abstracting with tice, brings tremendous precision improvements in client credit is permitted. To copy otherwise, or republish, to post on servers or to analyses. In particular, almost all recent work on context- redistribute to lists, requires prior specific permission and/or a fee. Request sensitive, field-sensitive analysis has adopted the InterDyck- permissions from [email protected]. reachability formulation to achieve both the context- and PLDI ’20, June 15–20, 2020, London, UK © 2020 Association for Computing Machinery. field-sensitivities simultaneously [9, 19, 26]. ACM ISBN 978-1-4503-7613-6/20/06...$15.00 Unfortunately, solving InterDyck-reachability is compu- https://doi.org/10.1145/3385412.3386021 tationally hard because the InterDyck-reachability problem 780 PLDI ’20, June 15–20, 2020, London, UK Yuanbo Li, Qirun Zhang, and Thomas Reps is, in general, undecidable [15]. Therefore, any practical anal- with n nodes and m edges, we give an efficient algorithm that ysis must approximate the exact answer. In practice, it is quite simplifies G in O¹m log mº time with O¹mº space. This graph- challenging to develop a suitable over-approximative Inter- simplification algorithm is asymptotically faster than the Dyck-reachability framework that offers a sweet spot in the fastest O¹mnº-time InterDyck-reachability algorithm [26]. trade-off between precision and scalability. InterDyck is a The technique is general, and can be used as a preprocessing prototypical example of a non-context-free language [8]. Tra- step for any existing InterDyck-reachability algorithms. ditional approaches employ less expressive but polynomial- We have implemented the graph-simplification algorithm, time decidable language-reachability frameworks, such as and evaluated it on a recent InterDyck-reachability-based context-free-language reachability (CFL-reachability) to over- taint analysis for Android [9]. In particular, we tested graph approximate InterDyck-reachability [9, 21, 24]. For exam- simplification with three popular InterDyck-reachability ple, the recent work by Späth et al. proposed synchronized algorithms, based on context-free language reachability (CFL- pushdown systems to compute a sound solution for Inter- reachability) [14], synchronized pushdown systems reacha- Dyck-reachability [19]. The work by Zhang and Su proposed bility (SPDS) [19], and linear conjunctive language reacha- linear-conjunctive-language reachability (LCL-reachability) bility (LCL-reachability) [26]. The empirical results are en- to precisely describe the InterDyck-reachability formula- couraging: graph simplification significantly improves both tion [26]. However, the LCL-reachability algorithm is inher- the performance and the precision of the client analyses. ently an over-approximation. To the best of our knowledge, • We found that, on average, it is 2.18× faster to (i) run all previous efforts on the InterDyck-reachability problem the simplification algorithm on digraph G—thereby attempt to improve either on the L-reachability formulation creating simplified digraph Gf —and then (ii) run an or the L-reachability algorithm. InterDyck-reachability algorithm A on Gf , compared In this paper, we attack the InterDyck-reachability prob- to running A directly on the original graph G. lem from a new angle. Consider an InterDyck-reachability • In the experiments with LCL-reachability, we found instance ReachhL;Gi. Unlike existing approaches that im- that the cost of running the simplification algorithm prove either the L-reachability formulation or the algorithm, is recouped for all examples that require more than our approach focuses on simplifying the input graph G in seven seconds to run in the original graphs. ReachhL;Gi. Specifically, we give an efficient algorithm to • The number of reachable pairs returned by the anal- simplify the input graphs by eliminating “useless” graph ysis based on the simplified graph Gf is reduced to edges. The benefits of graph simplification are two-fold. First, 64.92% compared to the number obtained by running working with smaller graphs improves the scalability of all the analysis on G. Moreover, the analysis run on Gf existing approaches for InterDyck-reachability. Second, be- uses 57.37% memory for the analysis running on G. cause all InterDyck-reachability algorithms are inherently Our work makes the following contributions: over-approximative, they could achieve better precision by working with graphs that contain fewer edges. The techni- • We propose a novel graph-simplification framework cal challenge, however, is to design a graph-simplification for InterDyck-reachability. Our technique reduces algorithm that is both effectivei.e. ( , it should remove as input-graph size, and is compatible with all existing many “useless” edges as possible) and efficienti.e. ( , as a sound InterDyck-reachability algorithms. pre-processing step, it should run much faster than the In- • Given a graph with n nodes and m edges, we give a terDyck-reachability algorithm itself). fast simplification algorithm that runs in O¹m log mº time with O¹mº space. In practice, our algorithm scales Consider an InterDyck language L1 ⊙ L2 ::: ⊙ Lk , where linearly with the graph size. for each i 2

Fast Graph Simplification for Interleaved Dyck-Reachability

Grammatical Compression: Compressed Equivalence and Other Problems Alberto Bertoni, Roberto Radicioni

Visibly Counter Languages and Constant Depth Circuits

The Inclusion Problem of Context-Free Languages: Some Tractable Cases?

CDM [1Ex]Context-Free Grammars

Inverse Monoids, Trees, and Context-Free Languages

The Dyck Language Edit Distance Problem in Near-Linear Time

Quantum Lower and Upper Bounds for 2D-Grid and Dyck Language

Improved Bounds for Testing Dyck Languages

Rational Subsets of Groups, When Good Closure and Decidability Properties of These Subsets Are Satisﬁed

Rational Monoid and Semigroup Automata

Recursive Prime Factorizations: Dyck Words As Representations Of

How to Prove That a Language Is Regular Or Star-Free? Jean-Eric Pin