A Minimalistic Verified Bootstrapped Compiler (Proof Pearl)
Total Page:16
File Type:pdf, Size:1020Kb
A Minimalistic Verified Bootstrapped Compiler (Proof Pearl) Magnus O. Myreen Chalmers University of Technology Gothenburg, Sweden Abstract compiler on itself within the ITP (i.e. bootstrapping it), one This paper shows how a small verified bootstrapped compiler can arrive at an implementation of the compiler inside the can be developed inside an interactive theorem prover (ITP). ITP and get a proof about the correctness of each step [13]. Throughout, emphasis is put on clarity and minimalism. From there, one can export the low-level implementation of the compiler for use outside the ITP, without involving any CCS Concepts: • Theory of computation ! Program ver- complicated unverified code generation path. ification; Higher order logic; Automated reasoning. The concept of applying a compiler to itself inside an ITP Keywords: compiler verification, compiler bootstrapping, can be baffling at first, but don’t worry: the point of this paper interactive theorem proving is to clearly explain the ideas of compiler bootstrapping on a simple and minimalistic example. ACM Reference Format: To the best of our knowledge, compiler bootstrapping Magnus O. Myreen. 2021. A Minimalistic Verified Bootstrapped inside an ITP has previously only been done as part of the Proceedings of the 10th ACM SIGPLAN Compiler (Proof Pearl). In CakeML project [29]. The CakeML project has as one of International Conference on Certified Programs and Proofs (CPP ’21), its aims to produce a realistic verified ML compiler and, January 18–19, 2021, Virtual, Denmark. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3437992.3439915 as a result, some of its contributions, including compiler bootstrapping, are not as clearly explained as they ought to 1 Introduction be: important theorems are cluttered with real-world details that obscure some key ideas. Bootstrapping is a milestone in any compiler development. The contribution of this paper is a new verified boot- We say that a compiler bootstraps itself when it can generate strapped compiler that is designed to clearly explain the its own low-level implementation by applying itself to its concept of compiler bootstrapping inside an ITP. This paper own source code [8, 14]. is not aiming for generality, realism or good performance, In traditional compiler development, the bootstrapping but instead strives for clarity and minimalism. milestone means that the compiler can, from then on, be The result of this effort is a mechanised proof development expressed in its own source language and no longer needs that produces verified x86-64 assembly code implementing to rely on another compiler for development. Due to this our new verified compiler. Crucially, the assembly imple- independence, bootstrapped compilers are called self hosting. mentation of the compiler is produced via bootstrapping In the context of verified compilation, compiler bootstrap- inside the logic of an ITP. That is, the compiler is run on ping also means that one can arrive at a low-level executable itself within the logic to produce its own implementation implementation of the compiler without the use of another in assembly. This is all done with proof, and, as a result, we code-generation path. Verified compilers live within the logic arrive at a theorem which guarantees correct behaviour of of interactive theorem provers (ITPs), and, even though com- the assembly implementation of the compiler. pilers can be run in this setting, it is often more convenient This work has been carried out in the HOL4 theorem to have a way to use them outside of ITPs. By evaluating the prover [25] but the ideas ought to translate to other provers Permission to make digital or hard copies of all or part of this work for such as Coq [17], Isabelle [30] and ACL2 [18]. Our proof personal or classroom use is granted without fee provided that copies scripts are under examples/bootstrap in the sources of HOL4. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights 2 Idea of Bootstrapping in the Logic for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or In this section, we start with a look at how the idea of com- republish, to post on servers or to redistribute to lists, requires prior specific piler bootstrapping works for our new verified compiler. This permission and/or a fee. Request permissions from [email protected]. section focuses on how all the different parts fit together, CPP ’21, January 18–19, 2021, Virtual, Denmark while subsequent sections explain the definitions and theo- © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. rems that we build on in this section. ACM ISBN 978-1-4503-8299-1/21/01...$15.00 The text below describes the top-level compiler definition https://doi.org/10.1145/3437992.3439915 for our new little compiler; the relevant correctness theorem CPP ’21, January 18–19, 2021, Virtual, Denmark Magnus O. Myreen for the code generator that the compiler contains; how one The operational semantics mentioned above, i.e. +prog and can apply the compiler to itself; and what theorems come +asm, are explained in Section3 and5. The proof of the out at the end of bootstrapping inside an ITP. correctness theorem above is the topic of Section6. While the correctness theorem shown above is sufficient Compiler definition. Our new compiler is defined as for compiler bootstrapping, it is not quite satisfactory in functions in logic. The top-level function is the following: general. For example, it does not say anything about non- compiler input def= terminating executions. Section7 shows how preservation asm2str ¹codegen ¹parser ¹lexer inputººº of non-terminating behaviour can be proved for codegen. This compiler function has type string ! string, and the The compiler in AST form. In order to apply the com- internal functions have the following types: piler to itself, we need to somehow get the compiler into a lexer : string ! token list form that fits with what the compiler function takes as input. parser : token list ! prog The compiler function has a function type: string ! string. codegen : prog ! asm It would be type incorrect to attempt to apply compiler di- asm2str : asm ! string rectly to compiler, since the input type, string, does not match the type of the argument, string ! string. Here prog is a datatype for the abstract syntax tree (AST) of The key to this puzzle is to define a new constant, which source programs (which we will see in Section3), and asm we call compiler_prog, that is the compiler represented as is the AST for x86-64 assembly (to be seen in Section5). source AST, i.e. a value of type prog: In addition to the compiler functions, we also define a pretty printing function that converts a source program to a compiler_prog : prog string. The second argument (of type string list) is a list of We define compiler_prog in such a way that we can prove comments to inject into the generated string. that it implements the compiler function: prog2str : prog ! string list ! string ` ¹input,compiler_progº +prog compiler input We have proved that the lexer followed by the parser inverts prog2str, regardless of the comments coms: One should read the theorem above as saying: for any input, execution of the compiler_prog program will always termi- ` parser ¹lexer ¹prog2str p comsºº = p nate and the output on stdout is the string produced by Correctness of code generation. The in-logic compiler applying the compiler function to input. bootstrapping that we do requires a correctness theorem for Section4 describes how we produce compiler_prog and the code generator, codegen. The required correctness theo- prove the correctness theorem above for it. rem relates terminating executions of the source language Applying the compiler to itself. compiler_prog with the terminating executions of the target language. Using , compiler_str Before we can show the codegen correctness theorem, we we can define a concrete-syntax version of it, ; a compiler_asm need to introduce how we make statements about program version expressed in assembly, ; and the string compiler_asm_str execution in the source and target languages. For source- representation of the assembly version, : level programs (of type prog), we write: compiler_str def= prog2str compiler_prog coms ¹input,pº +prog output def compiler_asm = codegen compiler_prog input to say that, with available on stdin, source program def p terminates and produces output on stdout. Similarly, for compiler_asm_str = asm2str compiler_asm target-level assembly programs (of type asm), we write: Note that compiler_asm is applying part of the compiler, ¹input,aº +asm output namely codegen, to the compiler itself. From the definition of compiler, the definitions above, to say that, with input to be read on stdin, target program a and the correctness of prog2str, we can prove an equation successfully terminates and produces output on stdout. describing the result of applying the entire compiler to itself: We use the following correctness theorem for codegen in our compiler bootstrapping. This theorem states that, if ` compiler compiler_str = compiler_asm_str execution of source program p terminates and target-level execution of assembly program codegen p terminates, then This result is reassuring, but it is not on the critical path to the outputs produced by the two executions must be equal. the main bootstrap theorem, which we will explain next. ` ¹input,pº +prog output 1 ^ The bootstrap theorem. The result of bootstrapping a ¹input,codegen pº +asm output 2 ) compiler inside an ITP is a theorem stating that the low- output 1 = output 2 level implementation of the compiler correctly implements A Minimalistic Verified Bootstrapped Compiler (Proof Pearl) CPP ’21, January 18–19, 2021, Virtual, Denmark the compiler algorithm.