Accessible Formal Methods for Verified Parser Development
Total Page:16
File Type:pdf, Size:1020Kb
2021 IEEE Symposium on Security and Privacy Workshops Accessible Formal Methods for Verified Parser Development Letitia W. Li, Greg Eakman Elias J. M. Garcia, Sam Atman FAST Labs, BAE Systems Special Circumstances Burlington, Massachusetts, USA Brussels, Belgium [email protected], [email protected] fejmg, [email protected] Abstract—Security flaws in Portable Document Format (PDF) in the field of another object may not have been yet parsed, readers may allow PDFs to conceal malware, exfiltrate in- and therefore we cannot check any of the attributes of the ref- formation, and execute malicious code. For a PDF reader to erenced object (such as its type, or even its existence). Instead, identify these flawed PDFs requires first parsing the document syntactically, and then analyzing the semantic properties or these semantic properties are checked after the first parsing of structure of the parse result. This paper presents a language- the document, whether within the parser or separately. theoretic and developer-accessible approach to PDF parsing and In the DARPA SafeDocs project, we address the insecurity validation using ACL2, a formal methods language and theorem of PDF readers/parsers with a Language-theoretic Security prover. We develop two related components, a PDF syntax parser approach by first defining a PDF grammar, and then de- and a semantic validator. The ACL2-based parser, verified using the theorem prover, supports verification of a high-performance veloping verified parsers to accept only well-formed input equivalent of itself or an existing PDF parser. The validator then according to that grammar [8]. In this paper, we propose an extends the existing parser by checking the semantic properties accessible formal methods approach in ACL2 for safe, secure, in the parse result. The structure of the parser and semantic and efficient parser development covering both syntax and rules are defined by the PDF grammar and fall into certain semantics, demonstrated on the PDF standard. While other patterns, and therefore can be automatically composed from a set of verified base functions. parsers focus on mainly syntax parsing, we add semantic Instead of requiring a developer to be familiar with formal validation intended to fully cover the PDF standard. methods and to manually write ACL2 code, we use Tower, our Formal methods offer mathematical proofs of correctness modular metalanguage, to generate verified ACL2 functions and of software and more rigorous and complete coverage than proofs and the equivalent C code to analyze semantic properties, manual testing [9]. These techniques help detect and remove which can then be integrated into our verified parser or an existing parser to support PDF validation. errors by providing unambiguous statements of correctness Keywords—LangSec, PDF, Formal Methods, ACL2 to verify on the system [10]. Theorem-proving languages, like ACL2, involve the verifier writing functions and then I. INTRODUCTION corresponding theorems to prove guarantees on the output Portable Document Format (PDF) Readers are a known and correctness of those functions [11]. ACL2 uses first-order target for attacks with multiple known vulnerabilities [1]. logic, and is based off a subset of Common Lisp [11]. It has Malware in executable format can masquerade as PDFs, or been used for the verification of real-world systems such as PDFs can execute malicious code [2]. Circular references can microprocessors, algorithms, and virtual machines [12], [13], cause certain PDF readers to crash [3] or display infinite [14]. recursion [2]. PDFs also include actions such code execution, However, formal languages like ACL2 are not commonly providing additional avenues of attack to run malicious scripts used for final product development nor designed for per- or use actions to exfiltrate data from an encrypted PDF [1], formance. Other formal languages, such as Coq, can some- [4], [5]. times extract performant code automatically, while we write A PDF reader’s parser should detect if the input is malicious a verified equivalent to an existing syntax parser written or invalid [6], which can occur via syntax (form of the input in a language more accessible to developers, such as the language) or semantics (meaning of terms within the input). Hammer-based parser in C [15]. If we can demonstrate the For example, an invalid sequence of characters is a syntax equivalence of the operational and verified parsers, guarantees error (such as parsing a PDF without the “%PDF-<version>” of our verified parser can then extend to the operational parser. header), while structures violating the PDF standard are se- Next, as the operational parser is not written in a formal mantic errors [7]. language, integrating the semantic checking of the verified Semantic properties from the PDF standard determine if a parser requires expressing the same, equivalent functions in document is readable, safe and acceptable to all readers (e.g. the language of the operational parser. We apply the known there are no cycles in certain fields). Syntax-level parsing, technique of generating equivalent proven and unproven (but however, cannot check certain semantic properties in the easier to use and generally faster) functions to assure cor- process of parsing. During syntax parsing, objects referenced rectness of functions written in a non-formal/theorem proving © 2021, Letitia W. Li. Under license to IEEE. 142 DOI 10.1109/SPW53761.2021.00028 Rule Generation Graphical Interface Code/Proof Grammar Generator Generation Hammer-based Format Validator Parser (C) (C) Document Valid? Y/N Verified Parser Format Validator (ACL2) (ACL2) Fig. 1. Document Validation Overview Fig. 2. Structure of the PDF document with object structure language [16]. parsing and semantic validation, and Section 4 applies the To apply verified semantic checks, we generate verified grammar to developing a verified syntax parser in ACL2. functions based on the language specification to analyze the Section 5 discusses the types of rules used for semantic parsed objects. However, the specification expert or developer validation of PDFs, and Section 6 describes how we generate may not be experienced in formal methods, which is widely ACL2 functions and theorems of correctness to apply those accepted to be difficult to learn to comprehend or use [17]. rules on the parsed PDF. Section 7 discusses related work in Not all developers are trained in the mathematics of formal PDF parsing and validation, as well as other parsers. Finally, methods, which is another barrier to its widespread usage Section 8 concludes the paper and offers directions for future [10]. To make formal methods more accessible, we developed work. an automated generator for ACL2 and C which allows users to easily write rules using a graphical interface, and an S- II. PDF STRUCTURE expression metalanguage named Tower. The metalanguage PDF Documents are composed of a header with the PDF separates the user from the underlying formal proofs, so that version, a body of objects (pages, fonts, etc.), a cross-reference users are not required to understand the details of the theorem table listing objects, and the trailer [18]. The header nominally prover. dictates which version the PDF document may conform to Figure 1 shows the overview of our approach, where we while the objects in the body of the PDF specify the actual first parse a document with equivalent verified and operational design elements and text. The cross-reference table, called an syntax parsers, and then use the graphical interface to generate xref table, lists the location of both used and unused objects semantic validation rules, which forms the “Format validator”. by their offsets from the start of the file. The trailer includes Currently, we use a Hammer-based PDF parser for syntax the location of the last xref table and other special objects, parsing, which provides us with the parse result in the form of such as a reference to of a PDF’s root object. The root object, an Abstract Syntax Tree (AST) [15]. We then transform and known as the Catalog, contains references to child objects in cast the AST into a list of PDF objects (as described in the the document hierarchy. next section), which is then passed into the format validator. The grammar of the format determines the syntax parser, Listing 1 semantic structure of parsed objects, and the layout of the ACL2 TYPE DEFINITION FOR THE PDF OBJECT graphical interface, as we will explain in this paper. The cur- (fty::defmap attrdict rent ACL2-based verified parser is a proof-of-concept verified :key-type stringp parser that can be extended into the equivalent of the Hammer- :val-type any-p) based PDF parser or another existing parser. Once we’ve demonstrated their equivalence, the verified parser assures (fty::defprod obj the correctness of the operational parser, while we mainly ((type stringp :default "") use the operational C parser for actual document analysis. (attrs attrdict-p :default (list nil)) The advantage of our approach is that we only require the )) reviewer check the translation for helper functions and single expression code translations, instead of longer functions with Figure 2 shows the structure of a simple PDF, where the multiple inner expressions. The proofs of correctness and other Catalog is the root object, and it references Outlines and properties assured on the ACL2 functions then apply to the PageTreeNodeRoot objects. The Outlines object in turn refer- generated C function. ences individual OutlineItems, while the PageTreeNodeRoot We begin by introducing the PDF structure in Section 2, contains the page tree of either PageTreeNodes or individual which explains the source and structure of a PDF. Section pages. The text content on a page is then contained in a 3 discusses PDF grammar on which we base our syntax stream object. Every major structure in PDF, including the xref 143 table and trailer, is constructed from the eight built-in types cannot express the jump to offsets required in PDFs, which of objects: boolean values, integer and real numbers, strings, can instead be described with Parsing Expression Grammar names, arrays, dictionaries, streams, and the null object [7].