Sound Regular Expression Semantics for Dynamic Symbolic Execution Of
Total Page:16
File Type:pdf, Size:1020Kb
Sound Regular Expression Semantics for Dynamic Symbolic Execution of JavaScript Blake Loring Duncan Mitchell Johannes Kinder Information Security Group Department of Computer Science Research Institute CODE Royal Holloway, University of London Royal Holloway, University of London Bundeswehr University Munich United Kingdom United Kingdom Germany [email protected] [email protected] [email protected] Abstract 1 Introduction Support for regular expressions in symbolic execution-based Regular expressions are popular with developers for match- tools for test generation and bug finding is insufficient. Com- ing and substituting strings and are supported by many pro- mon aspects of mainstream regular expression engines, such gramming languages. For instance, in JavaScript, one can as backreferences or greedy matching, are ignored or impre- write /goo+d/.test(s) to test whether the string value of cisely approximated, leading to poor test coverage or missed s contains "go",followedbyoneormoreoccurrencesof "o" bugs. In this paper, we present a model for the complete reg- and a final "d". Similarly, s.replace(/goo+d/,"better") ular expression language of ECMAScript 2015 (ES6), which evaluates to a new string where the first such occurrence in is sound for dynamic symbolic execution of the test and s is replaced with the string "better". exec functions. We model regular expression operations us- Several testing and verification tools include some degree ing string constraints and classical regular expressions and of support for regular expressions because they are so com- use a refinement scheme to address the problem of matching mon[24, 27, 29, 34, 37]. In particular, SMT (satisfiability mod- precedence and greediness. We implemented our model in ulo theory) solvers now often support theories for strings ExpoSE, a dynamic symbolic execution engine for JavaScript, and classical regular expressions [1, 2, 6, 15, 25, 26, 34, 38–40], and evaluated it on over 1,000 Node.js packages containing which allow expressing constraints such as s ∈ L(goo+d) regular expressions, demonstrating that the strategy is effec- for the test example above. Although any general theory tive and can significantly increase the number of successful of strings is undecidable [7], many string constraints are ef- regular expression queries and therefore boost coverage. ficiently solved by modern SMT solvers. SMT solvers support regular expressions in the language- CCS Concepts • Software and its engineering → Soft- theoretical sense, but “regular expressions” in programming ware verification and validation; Dynamic analysis; • languages like Perl or JavaScript—often called regex, a term Theory of computation → Regular languages. we also adopt in the remainder of this paper—are not lim- ited to representing regular languages [3]. For instance, the Keywords Dynamic symbolic execution, JavaScript, regu- expression /<(\w+)>.*?<\/\1>/ parses any pair of match- lar expressions, SMT ing XML tags, which is a context-sensitive language (be- cause the tag is an arbitrary string that must appear twice). ACM Reference Format: Problematic features that prevent a translation of regexes Blake Loring, Duncan Mitchell, and Johannes Kinder. 2019. Sound to the word problem in regular languages include capture arXiv:1810.05661v4 [cs.PL] 13 Mar 2020 Regular Expression Semantics for Dynamic Symbolic Execution of groups (the parentheses around \w+ in the example above), JavaScript. In Proceedings of the 40th ACM SIGPLAN Conference on backreferences (the \1 referring to the capture group), and ProgrammingLanguage Designand Implementation(PLDI ’19), June greedy/non-greedy matching precedence of subexpressions 22–26, 2019, Phoenix, AZ, USA. ACM, New York, NY, USA, 15 pages. (the .*? is non-greedy). In addition, any such expression hps://doi.org/10.1145/3314221.3314645 could also be included in a lookahead (?=), which effec- tively encodes intersection of context sensitive languages. In tools reasoning about string-manipulating programs, these features are usually ignored or imprecisely approximated. PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA © 2019 Copyright held by the owner/author(s). Publication rights licensed This is a problem, because they are widely used, as we demon- to ACM. strate in §7.1. This is the author’s version of the work. It is posted here for your personal In the context of dynamic symbolic execution (DSE) for use. Not for redistribution. The definitive Version of Record was published test generation, this lack of support can lead to loss of cover- in Proceedings of the 40th ACM SIGPLAN Conference on Programming Lan- age or missed bugs where constraints would have to include guage Design and Implementation (PLDI ’19), June 22–26, 2019, Phoenix, AZ, USA, hps://doi.org/10.1145/3314221.3314645. membership in non-regular languages. The difficulty arises 1 PLDI’19,June22–26,2019,Phoenix,AZ,USA BlakeLoring,Duncan Mitchell, and Johannes Kinder from the typical mixing of constraints in path conditions— offer the match, split, search and replace methods that simply generating a matching word for a standalone regex is expect a RegExp argument. easy (without lookaheads). To date, there has been only lim- A regex accepts a string if any portion of the string matches ited progress on this problem, mostly addressing immediate the expression, i.e., it is implicitly surrounded by wildcards. needs of implementations with approximate solutions, e.g., The relative position in the string can be controlled with an- for capture groups [29] and backreferences [27, 30]. How- chors, with ^ and $ matching the start and end, respectively. ever, neither matching precedence nor lookaheads have been Flags in regexes can modify the behavior of matching op- addressed before. erations. The ignore case flag i ignores character cases when In this paper, we propose a novel scheme for support- matching. The multiline flag m redefines anchor characters ing ECMAScript regex in dynamic symbolic execution and to match either the start and end of input or newline char- show that it is effective in practice. We rely on the specifica- acters. The unicode flag u changes how unicode literals are tion of regexes and their associated methods in ECMAScript escaped within an expression. The sticky flag y forces match- 2015 (ES6). However, our methods and findings should be ing to start at RegExp.lastIndex, which is updated with easily transferable to most other existing implementations. the index of the previous match. Therefore, RegExp objects In particular, we make the following contributions: become stateful as seen in the following example: • We fully model ES6 regex in terms of classical regular r = /goo+d/y; languages and string constraints (§4) and cover sev- r.test("goood"); // true; r.lastIndex = 6 eral aspects missing from previous work [27, 29, 30]. r.test("goood"); // false; r.lastIndex = 0 We introduce the notion of a capturing language to The meaning of the global flag g varies. It extends the effects make the problem of matching and capture group as- of match and replace to include all matches on the string signment self-contained. and it is equivalent to the sticky flag for the test and exec • We introduce a counterexample-guided abstraction re- methods of RegExp. finement (CEGAR) scheme to address the effect of greed- iness on capture groups (§5), which allows us to de- 2.2 Capture Groups ploy our model in DSE without sacrificing soundness Parentheses in regexes not only change operator precedence for under-approximation. (e.g., (ab)* matches any number of repetitions of the string • We present the first systematic study of JavaScript "ab" while ab* matches the character "a" followed by any regexes, examining feature usage across 415,487 pack- number of repetitions of the character "b") but also cre- ages from the NPM software repository. We show that ate capture groups. Capture groups are implicitly numbered non-regular features are widely used (§7.1). from left to right by order of the opening parenthesis. For 1 2 In the remainder of the paper we review ES6 regexes (§2) example, /a|((b)*c)*d/ is numbered as /a|( ( b)*c)*d/. and present an overview of our approach by example (§3). Where only bracketing is required, a non-capturing group We then detail our regex model using a novel formulation (§4), can be created by using the syntax (?: ... ). and we propose a CEGAR scheme to address matching prece- For regexes, capture groups are important because the dence (§5). We discuss an implementation of the model as regex engine will record the most recent substring matched part of the ExpoSE symbolic execution engine for JavaScript against each capture group. Capture groups can be referred (§6) and evaluate its practical impact on DSE (§7). Finally, we to from within the expression using backreferences (see §2.3). review related work (§8) and conclude (§9). The last matched substring for each capture group is also returned by some of the API methods. In JavaScript, the re- turn values of match and exec are arrays, with the whole 2 ECMAScript Regex match at index 0 (the implicit capture group 0), and the last We review the ES6 regex specification, focusing on differ- matched instance of the ith capture group at index i. In the ences to classical regular expressions. We begin with the example above, "bbbbcbcd".match(/a|((b)*c)*d/) will regex API and its matching behavior (§2.1) and then explain evaluate to the array ["bbbbcbcd", "bc", "b"]. capture groups (§2.2), backreferences (§2.3), and operator precedence (§2.4). ES6 regexes are comparable to those of 2.3 Backreferences other languages but lack Perl’s recursion and lookbehind A backreference in a regex refers to a numbered capture group and do not require POSIX-like longest matches. and will match whatever the engine last matched the cap- ture group against. In general, the addition of backreferences 2.1 Methods, Anchors, Flags to regexes makes the accepted languages non-regular [3].