Backreferences in Practical Regular Expressions Student: Martin Hron Supervisor: Ing

ASSIGNMENT OF MASTER’S THESIS Title: Backreferences in practical regular expressions Student: Martin Hron Supervisor: Ing. Ondřej Guth, Ph.D. Study Programme: Informatics Study Branch: Computer Science Department: Department of Theoretical Computer Science Validity: Until the end of summer semester 2020/21 Instructions Make a thorough research on existing algorithms for regular expression matching. Focus on regular expressions with backreferences. Upon agreement with the supervisor, implement a selected method based on memory automata defined in [1]. Compare this approach with existing regex matching tools. References [1] Schmid, M. Characterising REGEX languages by regular languages equipped with factor-referencing. In: Information and Computation. Volume 249, August 2016. ISSN 0890-5401. DOI 10.1016/j.ic.2016.02.003. doc. Ing. Jan Janoušek, Ph.D. doc. RNDr. Ing. Marcel Jiřina, Ph.D. Head of Department Dean Prague January 5, 2020 Master’s thesis Backreferences in practical regular expressions Bc. Martin Hron Department of Theoretical Computer Science Supervisor: Ing. Ondˇrej Guth, Ph.D. May 27, 2020 Acknowledgements I would like to thank my supervisor Ing. OndˇrejGuth Ph.D. for guiding this thesis and for providing all the valuable feedback and helpful advice. I would also like to thank my family for their support. Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stipu- lated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46 (6) of the Act, I hereby grant a nonexclusive authorization (license) to utilize this thesis, including any and all computer programs incorporated therein or attached thereto and all corresponding documentation (hereinafter collectively referred to as the “Work”), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quantity. However, all persons that makes use of the above license shall be obliged to grant a license at least in the same scope as defined above with respect to each and every work that is created (wholly or in part) based on the Work, by modi- fying the Work, by combining the Work with another work, by including the Work in a collection of works or by adapting the Work (including translation), and at the same time make available the source code of such work at least in a way and scope that are comparable to the way and scope in which the source code of the Work is made available. In Prague on May 27, 2020 . .. .. .. .. .. .. Czech Technical University in Prague Faculty of Information Technology c 2020 Martin Hron. All rights reserved. This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act). Citation of this thesis Hron, Martin. Backreferences in practical regular expressions. Master’s thesis. Czech Technical University in Prague, Faculty of Information Technology, 2020. Abstrakt Zpˇetnéreference (odkazy) jsou rozˇs´ıˇren´ıregulárn´ıch výraz˚ubeˇznˇepodporo- vanév dneˇsn´ıch nástroj´ıch. Regulárn´ıvýrazyse zpˇetnýmireferencemi maj´ı zvýˇsenouvyjadˇrovac´ıs´ılu,ale jejich vyhledáván´ı(matching) je NP-úplné.Tato práceposkytuje pˇrehled existuj´ıc´ıch pˇr´ıstup˚upro vyhledáván´ı regulárn´ıch výraz˚use zpˇetnýmireferencemi a takéteoretickéhovýzkumu na toto téma. V rámcitétoprácebyl implementovánnástroj pro vyhledáván´ıregulárn´ıch výraz˚uzaloˇzenýna modelu pamˇet’ových automat˚u(memory automata). Vy- hledáván´ıregulárn´ıch výraz˚us poˇctemzpˇetných referenc´ına r˚uznéskupiny omezených konstantou pomoc´ıpamˇet’ových automat˚umápolynomián´ıˇcasovou sloˇzitost.Byla implementovánadalˇs´ınedávnozveˇrejnˇenámetoda zaloˇzenána pamˇet’ových automatech, kteráposkytuje polynomiáln´ısloˇzitosti pro výrazy s neomezenýmpoˇctemzpˇetných referenc´ısplˇnuj´ıc´ıch jistou vlastnost. V rámci tétoprácebyl navrˇzen a implementovánalternativn´ıalgoritmus pro výpoˇcet tétovlastnosti. Model pamˇet’ovéhoautomatu byl dálerozˇs´ıˇrenpro podporu kvantifikátor˚uomezenéhopoˇctuopakován´ıa dalˇs´ırozˇs´ıˇren´ıbyla implemen- tována.Experimentáln´ıvyhodnocen´ıukázalo,ˇzeimplementovanýnástroj je mnohem odolnˇejˇs´ıv˚uˇcikatastrofickému backtrackován´ıneˇzexistuj´ıc´ıimple- mentace podporuj´ıc´ızpˇetnéreference. Zádnýzˇ testovaných útok˚upˇresalgo- ritmickou sloˇzitostnevyvolal znatelnézpomalen´ı. Kl´ıˇcová slova Zpˇetnéreference, Regulárn´ıvýrazy, Regex, Pamˇet’ovéauto- maty, Implementace regulárn´ıch výraz˚u vii Abstract Backreferences are an extension of regular expressions commonly supported in modern tools. Regular expressions with backreference have an increased expressive power but their matching problem is NP-complete. This work researches existing approaches for regular expression matching with focus on backreferences, and also provides an overview of theoretical work on the topic. As a part of this thesis, a matching tool based on computational model called memory automata was implemented. Matching patterns with number of backreferences to different groups limited by a constant using memory automata has polynomial time complexity. An additional recently published technique based on memory automata was also implemented, which provides polynomial complexity even for a subset of patterns with unbounded number of backreferences restricted by a certain property. As a part of this thesis, an alternative algorithm to compute this property was proposed and implemented. The memory automaton model was also extended to support counting constraints, and other common extensions were implemented. Experimental evaluation showed that the implemented tool is much more resistant to catastrophic backtracking when compared to existing implementations with support for backreferences. No tested algorithmic complexity attack triggered significant slowdown. Keywords Backreferences, Regular expressions, Regex, Memory automata, Regular expressions implementation viii Contents Introduction 1 1 Preliminaries 3 1.1 Basic notation . 3 1.2 Alphabet, string and language . 3 1.3 Finite automata . 5 1.4 Other automata models . 6 1.5 Language hierarchy . 6 1.6 Regular expressions . 7 1.7 Other preliminaries . 7 2 Regular expressions with backreferences 9 2.1 Introduction . 9 2.2 Formalizing backreferences . 10 2.2.1 Early formalization using variables . 10 2.2.2 Formalization of rewbr with numbered backreferences . 11 2.2.3 Formalizing rewbr via factor-referencing . 12 2.3 Properties of the regex language class . 14 2.3.1 Relation to other language classes . 14 2.3.2 Closure under operations . 15 2.4 The regex matching problem . 16 2.5 Submatch addressing and searching in text . 16 2.6 Other extensions in practical regular expressions . 17 2.6.1 Subpattern recursion . 18 2.7 The POSIX standard . 18 3 Regular expression matching algorithms and approaches 21 3.1 Thompson’s algorithm . 21 3.2 Basic approaches . 22 3.2.1 DFA based . 22 ix 3.2.2 NFA based with backtracking . 23 3.2.3 NFA based without backtracking . 23 3.3 Recursive backtracking . 23 3.3.1 Preventing infinite loops . 24 3.3.2 Backreference support . 24 3.3.3 Practical implementations . 25 3.3.4 Complexity and catastrophic backtracking . 25 3.4 Approaches overview . 27 3.4.1 Extending Finite Automata . 27 3.4.1.1 Method description . 27 3.4.1.2 Multiple capturing groups and backreferences . 29 3.4.2 Polynomial time matching for rewbr subclasses . 31 3.4.3 Tagged automata . 33 3.4.4 Tagged NFA with constraints . 34 3.4.5 Symbolic register automata . 34 3.4.6 Symbolic regex matcher . 35 3.4.7 Pattern optimizations . 35 3.4.8 Virtual machine approach and JIT compilation . 36 3.5 Overview of major implementations . 37 3.5.1 Perl Compatible Regular Expressions . 37 3.5.2 Java . 38 3.5.3 JavaScript . 38 3.5.4 Oniguruma . 38 3.5.5 Boost regex . 38 3.5.6 ICU Regular Expressions . 39 3.5.7 GNU C library and POSIX . 39 3.5.8 Hyperscan . 39 3.5.9 TRE . 39 3.5.10 RE2 . 39 4 Memory automata 41 4.1 Memory automaton model . 41 4.2 Memory automata for regex patterns . 44 4.2.1 Matching regex using memory automata . 45 4.3 Properties of memory automata and relation to regex languages 46 4.4 Active variable degree . 47 4.4.1 Matching regex with bounded active variable degree . 50 4.4.2 Strong active variable degree . 51 4.4.3 Relation to variable distance . 51 4.5 Deterministic regex . 53 4.6 Memory determinism . 55 5 Implementation 57 5.1 Memory automata for practical regular expressions . 57 x 5.1.1 Matching algorithms . 58 5.1.2 Configurations representation and implementing memories 59 5.1.3 Computing active variable sets . 60 5.1.4 Matching patterns with bounded active variable degree 63 5.1.5 Extending memory automata for counting constraints . 63 5.2 Solution architecture and implementation details . 71 5.2.1 Supported features . 73 5.2.2 Testing . 74 5.2.3 Source code publication and licensing . 75 6 Experimental evaluation 77 6.1 Methodology . 77 6.1.1 Datasets . 78 6.1.1.1 Inputs . 78 6.1.2 Platform specifications . 79 6.2 Comparison of algorithms and options provided by mfa-regex . 79 6.2.1 Comparison of matching algorithms .

Load more