<<

arXiv:2102.01644v2 [cs.PL] 9 Jun 2021 08AscainfrCmuigMachinery. Computing for Association 2018 © rtn oei ad rvn tcreti vnhre.A th As harder. even is correct it proving hard; is code Writing Psne ob eie.Hwvr h aki fetmsrepe oftentimes is task the However, verified. be to need ol aebe mosbe u ehdtu infiatyim significantly thus method our impossible; been have would btatn ihcei spritd ocp tews,o republis or otherwise, copy To permitted. is credit with Abstracting SN978-1-4503-XXXX-X/18/06...$15.00 ISBN ACM hr utperaosfrti nhsam yvru fbe of virtue by enthusiasm: this for reasons multiple are There OAHNPROTZENKO, JONATHAN https://doi.org/10.1145/1122445.1122456 perm from permissions , Request , fee. a and/or th permission of specific components prior for Copyrights comme page. or first f profit the work for on this distributed citation of or full part made the or not all are of copies copies that hard provided or digital make to Permission HO, SON F in Functors Stateful Dependently-Typed Meta-Programmed, Zero-Cost, Pearl: Functional aiyadb opttv ihuvrfidcytgahclib cryptographic unverified with veri competitive of context be the and in than parity evident more amoun issue large this verifying is efficiently Nowhere of problem the heights, new re-fmgiueipoeeti h muto eie co verified of amount the in improvement order-of-magnitude now until requiring difficulties, with fraught is algorithms nodro nraigspitcto.Frt euedepend a use of we machine First, state sophistication. increasing of order revin We performance. compromising without library, graphic + oe u eyn naueln atcadprilevalua partial and tactic userland a on tact relying code-rewriting but a code, using C++ functor, a authoring of process h F the ieieo F of the on pipeline rely instead i we than techniques, rather synthesis func program that rating higher-order is approach our our instantiate of can feature distinguishing users and all, for and nF in ieyseie.Nx,w eyo ata vlaint auth to evaluation partial any on rely we Next, specified. cisely hnst aeu ata evaluation. partial careful to thanks nweg.I ipytrsotta h xsigHACL existing the that out turns simply It knowledge. atrso eiyn ntelre ehv otiue i h six contributed have We large. the in verifying o of effectiveness the patterns evaluate to playground ideal an plexity; eetyashv inse ur fatvt ntefil ffo of field the in activity of flurry a witnessed LARGE have years THE Recent IN VERIFICATION, CRYPTOGRAPHIC 1 assurance. of level higher much a ing Indi applications. functor specialized 40 nearly of total a hspprsoshwajdcoscmiaino nw functi known of combination judicious a how shows paper This edmntaeortcnqe nteF the in techniques our demonstrate We u prahi nn a pcfi ocytgah,ada such as and cryptography, to specific way no in is approach Our ★ naebokAIit aecutrat ial,w eyon rely we Finally, counterpart. safe a into API unsafe ★ hspoie h rgamrwt eyhg ee fabstra of level high very a with programmer the provides thus oplrwasee.Ortcnqe eur oue interv user no require techniques Our whatsoever. compiler ni,France Inria, ★ iha xr al eapormigsaeadd u enco Our added. stage meta-programming early extra an with , lc algorithm block irsf eerh USA Research, Microsoft rporpi ointa a ni o nyifral a informally only now until was that notion cryptographic a , ★ rgamn agae n hwta hyrqien hnei change no require they that show and language, programming 1 ★ irr rvdsagadcalnei em fcom- of terms in challenge grand a provides library iulyvrfigec n ftee4 algorithms 40 these of one each verifying vidually u ehius n oceestigt identify to setting concrete a and techniques; our f touiganvletato ieieo incorpo- or pipeline extraction novel a ntroducing rahge-re,saeu uco httransforms that functor stateful higher-order, a or n ye ocipycpueteseicto and specification the capture crisply to types ent infiataon fmna effort. manual of amount significant a c hscliae nasyeai otemplatized to akin style a in culminates This ic. salse eozysyeeaueadextraction and erasure Letouzey-style established , oto evr rt eitiuet it,requires lists, to redistribute to honored. or servers be on must post ACM to than h, others by owned work is oswtoticrigaypofolgtos A obligations. proof any incurring without tors e he ehiusta ul pnec other, each upon build that techniques three iew in ahrta ul-ncmie support. compiler built-in than rather tion, ge-re ucost h HACL the to functors igher-order ais eylrenme fagrtm and algorithms of number large very a raries, feature- achieve To libraries. cryptographic fied iie n atrn u omnlt between commonality out factoring and titive, ca datg n htcpe erti oieand notice this bear copies that and advantage rcial so otaebcmsmr n oesalient. more and more becomes software of ts rvsporme rdciiywieprovid- while productivity programmer proves rproa rcasomuei rne ihu fee without granted is use classroom or personal or [email protected]. cl fvrfidsfwr rjcsreaches projects software verified of scale e epoue ytepplrHACL popular the by produced de lbrtrrflcint uoaetevery the automate to reflection elaborator nlpormigtcnqe ed oan to leads techniques programming onal ninete;tepof r are once carried are proofs the either; ention to hl nurn orntm costs run-time no incurring while ction hspprrqie ocryptographic no requires paper this , mlyvrfigcytgah [ cryptography verifying rmally n nu-uptfntos many functions, input-output ing ★ igo ihrodrfunctors higher-order of ding ★ irr,for library, dimpre- nd ★ crypto- 11 n ]. ,, Jonathan Protzenko and Son Ho core cryptographic primitives lend themselves well to verification; implementors can hardly af- ford to ship incorrect cryptography anymore, leading to an above-average level of enthusiasm for verified artifacts in industry; and practitioners no longer need to fear functional languages: the output of recent verification pipelines directly generates C or assembly, which are known, famil- iar languages that can be readily integrated into the existing software ecosystem. Projects such as FiatCrypto [24], Jasmin [4, 5], CryptoLine [26], Vale-Crypto [16, 25] or HACL★ [37, 46] em- body this new reality. They all have demonstrated that is it now feasible to verify cryptographic algorithms whose performance matches or exceeds state-of-the-art, unverified code from projects like OpenSSL [35] or libsodium [1]. These projects all produce verified C or assembly code. Build- ing upon such efforts, projects like EverCrypt [38] offer entire verified cryptographic providers, which expose a wide range of modern algorithms, offering an agile and multiplexing API via CPU auto-detection. But beyond the initial successes and flagship verified algorithms lies a long tail of oftentimes tedious work required to meet the feature set of existing cryptographic libraries. For instance, authors of a verified library might be required to provide several nearly-identical, specialized vari- ants of the same algorithm to deal with different processor instruction sets. As another example, an industrial-grade cryptographic library might be required to provide several copies of a high-level API for algorithms that only differ in minor ways. Such tasks are repetitive; provide few novel insights each time a new piece of verified code is added; and make maintenance of the verified code base harder as it grows bigger. In our view, this highlights the stark need for language-based techniques to perform verifica- tion in the large [22]. That is, we envision for the programmer to operate at a very high-level of abstraction, composing high-level building blocks with a minimal amount of effort. We wish for all commonality to be factored out as much as possible, resulting in a minimal amount of verified code for a maximal amount of generated code. But more importantly, we want to make verification engineers’ lives better, while still producing the same optimized, low-level, first-order C code that practitioners have come to embrace. This paper presents our answer to the challenge of scaling up cryptographic verification, in the context of the HACL★ cryptographic library. With dependent types, we characterize previously fuzzy notions such as that of a block algorithm. With meta-programming, we encode higher-order stateful functors in F★. With elaborator reflection [19], we automate the process of writing a func- tor, thanks to a tactic that automatically generates a functor via syntax inspection and term injec- tion. We require no modifications to the F★ compiler, meaning that the trusted computing base remains unchanged. The technical ingredients are not new; but the key contribution of this paper is to show how their judicious combination allows scaling up verification, while bringing greater modularity and clarity to the verified codebase. This work was performed using the F★ , and its Low★ toolchain that compiles a subset of F★ to C. Our techniques were contributed and integrated to HACL★. They form the language-based foundation of the most recent iteration of HACL★ [37], which represents a complete overhaul of the previous version [46]. Until now, these techniques had never been described in detail. All throughout the paper, we use cryptography as the main driving force. There are two reasons. First, cryptographic APIs offer the highest level of complexity, combining mutable state, state ma- chines, abstract representations and abstract predicates. While our techniques have been success- fully applied to simpler use-cases, e.g. parameterized data structures and associative lists, we wish to showcase the full power of our approach. Second, we benefit from the large existing codebase of cryptographic code in HACL★, meaning we can perform a real-world quantitative evaluation of how our techniques make verification more scalable.

2 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

Our work is motivated by two very concrete issues previously faced by the HACL★ codebase. We now succinctly describe both problems, and highlight how (automated) higher-order stateful functors provide a technical answer.

Unsafe block APIs. Verified cryptography has been successfully integrated in flagship software projects, such as Google Chrome, LibreSSL, Firefox or Linux. Unfortunately, most of the integra- tions so far have been piecewise, wherein a fragment of an algorithm, or a single algorithm at a time gets replaced with a verified version. We have yet to see an entire cryptographic library, such as Mozilla’s NSS, being swapped out for a verified library. One of the reasons is, when authoring verified code, it does not suffice to merely implement an algorithm as specified by the RFC; the author of verified code must strive for elegant, intuitive, easy-to-use APIs that minimize the risk of programmer error. Designing such a safe API requires more layers of abstraction, which increases the verification burden. As an example, consider hash algorithms like SHA-2 [2] or Blake2 [10]. Such algorithms are block-based, which requires clients to obey a precise state machine, feeding the data to be hashed block by block, and invalidating the state when processing the final data. This is error-prone and almost certain to result in programmer mistakes. There exists a safer API with better usability, which various libraries aim to provide. The safe API maintains an internal buffer, and eliminates almost all possible mistakes when using it from C. Yet, even Blake’s own safe reference API imple- mentation [9] contained a buffer management bug that went undetected for seven years [34]! The natural conclusion is that verified cryptographic libraries need to expose only safe, and verified APIs. Unfortunately, none of them currently do. One reason is that the amount of effort required is tremendous: verifying a non-trivial, safe API for SHA2-256 is certainly feasible [7], but doing it over and over for a dozen similar algorithms is out of the question. Verifying safe APIs is tedious, repetitive, and until now did not lend itself well to automation. A reason for the lack of automation is that the commonality between algorithms has not yet been formally captured. Consider HACL★, which at this time offers 17 algorithms, spread across about 40 implementations. We observe that at least a dozen of those implementations obey almost identical state machines and behave in a fashion almost identical to Blake2. Yet, a formal argument was never made to assert that these algorithms really do behave in the same way. All of these shortcomings show up in existing work. For instance, EverCrypt offers only a single safe API for hashing, briefly mentioned but not explained in [38, XIII]. Lacking any better ways of automating those proofs, Protzenko et.al. were not able to verify more than a single safe API (personal communication). With the higher-order functor style we present in this paper, we are able to precisely express a transformation from any block algorithm to its corresponding specialized, low-level safe API. The functor, once applied, meta-evaluates away, meaning that the resulting API bears no trace of the original higher-order abstraction. The technique is generic, and the toolkit we present can be applied to e.g. a stream cipher parameterized over a block function; to an algorithm parameterized over a container obeying a particular interface; and to any situation in which multiple implementa- tions of a core building block are interchangeable. With this technique, the user can “verify in the large”, and apply at will the higher-order transformation. As a case in point, we apply our sample functor to all 14 instances of a block API in HACL★, and get a corresponding number of safe APIs “for free”.

Swappable implementations. Cryptographic algorithms oftentimes come with multiple imple- mentations, each optimized for a different instruction set, such as ARM NEON, or AVX2. Such implementations are almost identical, and differ only in the choice of their meta-parameters, such

3 ,, Jonathan Protzenko and Son Ho as: buffer size; amount of data processed in each loop iteration; choice of vectorized instructions for implementing the core of the algorithm, etc. It is only natural that the programmer should attempt to share as much as possible in terms of codes and proofs, seeing that implementations exhibit a high degree of similarity. Once a generic implementation is devised, it ought to be simple to generate multiple copies of it, each one special- ized for a different flavor of vectorization. But in the context of HACL★, two key constraints need to be obeyed. First, the structure of the call-graph needs to remain the same, that is, once special- ized, functions should still exhibit calls to auxiliary definitions and helpers. Second, we wish to write the code once and for all; that is, the generic implementation should be generic enough that adding a new flavor of vectorization requires no change in the original implementation. This problem can be solved with a code specialization logic that resembles what a C++ compiler does: when specializing a , an entire copy of the class is generated, optimized for the given template argument. In our case, we need a way to generate a copy of a given algorithm, with the same structure, except specialized for a choice of vectorization. Given our two constraints, and our stated goal of not modifyingthe F★ compiler, this turns out to be a technical challenge. We find that the technical answer is the same as before, and consists in having a functor that can be applied to its meta-parameters so as to reduce into a specialized call-graph. This time, rather than requiring the user to manually author a functor for each algorithm that admits multiple specializations, we devise a tactic that automatically takes care of generating the functor by means of reflection and inspection of syntax, at meta-time. This second, general technique enjoys a high degree of automation and can beapplied to numer- ous situations that would call for, say, a templatized C++ class. In our case, we once again find a grand challenge in HACL★, and show that our technique can successfully express the Curve25519 algorithm in HACL★ as two nested functor applications for a maximum degree of genericity. Overview of the paper. Our paper aims to introduce our techniques gradually, first starting with the concrete example of the hand-written block algorithm functor, before automating the functor creation via a meta-programming tactic. We thus adopt the following progression: • we provide some detailed background (§2)on theF★-to-C toolchain that HACL★ uses; • we explore the commonality problem (§3), also known as “agility” in cryptographic lingo, using type classes and dependent types as key technical devices – doing so, we capture what exactly a “block algorithm” means; • equipped with a precise definition of a block algorithm, we show how to manually write a functor that generates a safe API for any block algorithm (§4); • getting such higher-order code to compile to C without runtime overhead is a challenge; we list a constellation of techniques (§5) that, put together, completely eliminate the cost of high-level programming thanks to partial evaluation; • manually writing functors can prove quite tedious; we use the full power of Meta-F★ (§6) to automate the functor encoding, allowing programmers to write generic code without any syntactic penalty; • we evaluate the benefits of our techniques (§7), quantifying the verification effort for the programmer, as well as the performance impact on verification times. Therefore, our contributions are: the precise specification of what “block algorithm” means; the principles and techniques that govern the encoding and extraction of higher-order stateful func- tors in F★; the design of the functor-generating tactic; and all of the technical implementation work that materializes the claims made in the paper. We provide an anonymous supplement that contains all of the technical development, along with pointers to definitions, lemmas and code ref- erenced throughout the paper. We encourage readers to browse the source files; in particular, the

4 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

implementation of the functor-generating tactic is not only the second largest tactic written in F★ to date, but also written in a near style. All of our contributions have been merged into HACL★ and are now used by flagship software projects such as the Tezos blockchain and the Election Guard toolkit for verifiable elections.

2 BACKGROUND: F★, LOW★, META-F★ HACL★ [37, 46] is a cryptographic library written in F★, which we use as a baseline to author, evaluate and integrate our proof techniques. HACL★ compiles to C, and offers vectorized versions of many algorithms via C compiler intrinsics, e.g. for targets that support AVX, AVX2 or ARM Neon. EverCrypt is a high-level API that multiplexes between HACL★ and Vale-Crypto [25] and supports dynamic selection of algorithms and implementations based on the target CPU’s feature set. Combined with EverCrypt, HACL★ features 105k lines of F* code for 72k lines of generated C code (this excludes comments and whitespace, and also excludes the Vale assembly DSL). F★ is a state-of-the-art verification-oriented programming language. Hailing from the tradition ofML [33], F★ features dependent types and a user-extensible effect system, which allows reason- ing about IO, concurrency, divergence, various flavors of mutability, or any combination thereof. For verification, F★ uses a weakest precondition calculus based on Dijkstra Monads [3, 43], which synthesizes verification conditions that are then discharged to the Z3 [20] SMT solver. Proofs in F★ typically are a mixture of manual reasoning (calls to lemmas), semi-automated reasoning (via tactics) and fully automated reasoning (via SMT). Low★ is a subset of F★ that exposes a carefully curated set of features from the C language. Using F★’s effect system, Low★ models the C stack and heap, and allocations in those regions of the memory. Low★ also models data-oriented features of C, such as arrays, pointer arithmetic, machine integers with modulo semantics, const pointers, and many others via a set of distinguished libraries. Programming in Low★ guarantees spatial safety (no out-of-bounds accesses), temporal safety (no double frees, no use-after free) and a form of side-channel resistance [39, 46]. All of these guarantees are enforced statically and incur no run-time checks. ★ To provide a flavor of programming in Low , we present the mk_update_blake2 function below, taken from HACL★. Suffices to say, for now, that Blake2 is a block algorithm, and comes in two algorithmic variants, captured by the first argument, either Blake2B or Blake2S. Furthermore, im- plementations of Blake2 admit multiple degrees of implementation vectorization, captured by the second argument, either M32 (scalar, pure C code), M128 (using AVX or NEON instruction sets), or M256 (using AVX2). Anticipating slightly on §3.3, mk_update_blake2 implements the “update block” transition from Figure 1, for any choice of algorithmic variant and vectorization. The name mk_ is a hint that this function is intended to be applied to its two meta-parameters, so as to “make” a specialized update_blake2 variant. But for now, we focus on the various features of this function signature, which are representative of what one typically encounters in Low★.

1 val mk_update_blake2 (a:blake_alg) (v:vectorization) (s:blake2_state a v) 2 (totlen : U64.t) (data:block a): Stack U64.t 3 (requires _h → live h s ∧ live h data ∧ disjoint s data) 4 (ensures _h0 totlen' h1 → modifies1 s h0 h1 ∧ 5 (as_seqh1 s, U64.v totlen') == 6 Blake2.Spec.update a (as_seq h0 s) (U64.v totlen) (as_seq h0 data))

★ Functions in Low are annotated with their return effect, in this case Stack, which indicates the function only performs stack allocations, and therefore is guaranteedto have no memoryleaks. For ★ all other stateful cases, programmers may use the Low ST effect, which has the same signature,

5 ,, Jonathan Protzenko and Son Ho but permits heap allocation. (Functions without a return effect are understood to be total.) The re- turn type of the function is U64.t, the type of 64-bit unsigned machine integers with modulo seman- tics. Functions are specified using pre- and two-state post-conditions. Such specifications typically cover both memory and functional reasoning. For memory, Low★ relies on a modifies-clause the- ory [29]: the mk_update_blake2 pre-condition demands that arrays data and state be disjoint, while the post-condition ensures that the only memory location affectedby acallto mk_update_blake2 is s. If a client holds an array a, then the combination of disjoint a s and modifies1 s h0 h1 allows them to derive automatically that a is unchanged in h1. The liveness clauses ensure accesses to s and data are valid in the body of mk_update_blake2; that is, that both s and data point to allocated addresses that have not been de-allocated yet. Importantly, the functional behavior of mk_update_blake2 is specified via Blake2.Spec.update, a pure function that forms our specification. In this case, we learn that after calling mk_update_blake2, s contains the updated state, and the return value totlen' contains the total number of bytes pro- cessed after the call has finished. This style is referred to as intrinsic reasoning, or implementation refinement: we prove that the low-level behavior of mk_update_blake2 is characterised by a pure, trusted and carefully audited specification. Various functions allow reflecting Low★ objects as their pure counterpart: in this case, as_seq reflects the contents of the array s in memory h1 as a pure se- quence, and U64.v reflects a machine integer as a pure unbounded mathematical number. Functions such as as_seq are in effect Ghost, meaning that they may only be used in proofs and refinements, and cannot appear at run-time. Partial evaluation is a powerful, trusted mechanism in F★. With it, the programmer can trigger various steps of reduction using F★’s trusted normalizer without having to perform any proof. F★ exposes keywords, attributes and functions to control this mechanism. To illustrate this, we revisit ★ the example of mk_update_blake2. We now reveal that the function does not fit within the Low shallow embedding, owing to the definition of the blake2_state type. inline_for_extraction let element_t (a:blake_alg) (v:vectorization) = match a, m with | Blake2B, M256 → vec_t U64.t 4 | ... | Blake2S, M32 → U32.t | Blake2B, M32 → U64.t inline_for_extraction let blake2_state a v = lbuffer (element_tav) (4ul ∗. row_len a v)

The type state is indexed over two arguments, capturing the choice of algorithm a and implemen- ★ tation v. Here, lbuffer t l is a distinguished Low type that stands for an array of type t and length l (we elide the definition of row_len, which is similar in structure to element_t). Each element is in turn a value-: for scalar implementations (M32), elements are either 32- or 64-bit integers; for vectorized implementations, such as the 256-bit one, elements are vectors (we also elide the definition of vec_t) so as to enable SIMD processing. Such a definition cannot be trivially extracted to C: the C does not admit the kind of value-dependency that appears in the definition, and while some contrived, indirect compilation schemes are possible (such as untagged unions), they would be rejected by practitioners: in the case of scalar Blake2S, a C union would waste 50% of the space. Fortunately, we can obtain a definition that fits within Low★ thanks to partial evaluation. The special “inline for extraction” attribute indicates to F★ that right before extraction, occurrences of blake2_state should be replaced with their definition. This in turns triggers more reductions steps,

6 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,, where matches are reduced, unreachable branches eliminated, beta-redexes evaluated away, mean- ing that if we apply mk_update_blake2 to constant values for its first two arguments, we actually get Low★ code after partial evaluation. We make extensive use of the “inline for extraction” keyword all throughout this paper, as it provides a convenient way to drastically limit code duplication. This “agility” style is popular in Low★; as an example, HACL★ truly has a single implementation of Blake2, the one we just saw; in the actual codebase, the mk_update_blake2 function currently partially evaluates to four different implementations depending on the variant and the degree of vectorization. To obtain a specialized variant, it suffices to apply mk_blake2_update to two constant arguments. let update_blake2s_32 = mk_update_blake2 Blake2S M32 // regular C let update_blake2s_128 = mk_update_blake2 Blake2S M128 // vectorized

After partial evaluation, all that remains is a version of mk_update_blake2 that has been specialized on its first two arguments; the type blake2_state is replaced by a first-order, monomorphic array type, which maps naturally onto a C type. The code is now Low★ after partial evaluation. We remark that the “inline for extraction” mechanism applies to functions, including stateful ★ ones: F introduces an A-normal form [41], meaning a stateful call fe becomes let x = e in f x, which then allows V-reduction since x is a value. Beyond this keyword, other mechanisms exist for partial evaluation. The [@inline_let] attribute allows reducing pure let-bindings inside function definitions (including stateful ones), and the normalize call allows very fine-grained control of the reduction flags for a sub-term. The definition of mk_update_blake2 uses both. Meta-F★ is a recent extension of F★ [31] that allows the programmer to script the F★ com- piler using user-written F★ programs, an approach known as elaborator reflection and pioneered by Lean [21] and Idris [17]. Meta-F★ offers, by design, a safe API for term manipulation, meaning it re-checks the results of meta-program execution: if a meta-program attempts to synthesize an ill-typed term, F★ aborts. Therefore, tactics do not need to be statically proven correct and enjoy a great deal of flexibility. Tactics are written in the Tac effect, and in this paper we use the terms “tactic” and “meta-program” interchangeably. Erasure and extraction in F★ follows Letouzey’s extraction principles for Coq [30]. After type-checking and performing partial evaluation, F★ erases computationally-irrelevant code and performs extraction to an intermediary representation dubbed the “ML AST”. For erasure, F★ eliminates refinements, pre- and post-conditions, and generally replaces com- ★ putationally irrelevant terms with units, i.e. any subexpression of type erased becomes (). F also removes calls to unit-returning functions, which means that calls to lemmas are also eliminated. For extraction, F★ ensures that the “ML AST” features only prenex polymorphism (i.e. type schemes), and that it is annotated with classic ML types. Naturally, not all F★ programs are type- able per the ML rules: using a bidirectional approach, extraction inserts necessary casts, and re- places types that are invalid in ML with ⊤, the uninformative type. The ML AST can then be further compiled to three different targets. Going to OCaml, ⊤ becomes ★ Obj.t and casts become calls to Obj.magic; owing to OCaml’s uniform value representation, any F program can be compiled to OCaml. Going to F#, only a subset of casts are admissible, since the .NET bytecode is typed. Finally, going to C only works for programs that are free of casts, as C does not admit a ⊤ type. KreMLin [39] compiles the “ML AST” to readable, auditable C by using a series of small, com- posable passes. The KreMLin preservation theorem [39] states that the safety guarantees in Low★ carry over to the generated C code. We present a few transformations that are relevant to this work.

7 ,, Jonathan Protzenko and Son Ho

init init update alloc init last finish free

update block

Fig. 1. State machine of an error-prone block-based API

The ML AST supports parameterized data types, such as pairs, tuples, and user-defined induc- tives, e.g. type ('a, 'b) pair = Pair of 'a ∗ 'b. First, KreMLin removes unused type parameters. This is important for deeply dependent inductive indices: as long as they only appear in refinements, the resulting ⊤ type parameter in the ML AST is eliminated by KreMLin. Next, KreMLin performs a whole-program analysis and monomorphizes types, functions and polymorphic equalities based on usage at call-sites. This process is similar to MLton [45]. KreMLin also performs unit elimination: unit arguments are removed from functions and unit fields are removed from data types. This, in particular, means that a data type that features a Ghost ★ field will incur no penalty in C, since Ghost is erased by F to unit. Furthermore, KreMLin features five different compilation schemes for inductive types. The default uses a tagged-union scheme, but for single-case inductives, a clean C struct is used rather than waste space for the (useless) tag. Furthermore, if the constructor itself takes a single argument, the struct is altogether eliminated. The combination of these features is of particular importance for this paper, as we rely on these optimized compilation schemes to make sure, by the time we get toC, notraceis leftof thevarious ghost fields of our inductive types. For a concrete example of how the combination of these features ★ operates, consider the following F type t, parameterized over a type a at universe 0, and a value p. type t (a: Type0) (p: ... value type ...) = | Foo: bar:a → baz: erased (... p ... x ...) → t x p let t' = t U32.t (...)

Erasure replaces the baz field of constructor Foo with unit, since erased denotes computationally- irrelevant subterms. Extraction then generates an ML-style polymorphic data type with two type parameters, one for the value type a, and another one for p. In OCaml syntax, this would be type ('a, 'p) t = Foo of 'a ∗ unit. KreMLin then takes over, and performs a whole-program monomor- phization, meaning that t' becomes a specialized instance of t with a = U32.t. Furthermore, KreMLin eliminates the uninformative baz field of type unit. Finally, seeing that t' is an inductive with a single constructor that takes a single argument, KreMLin compiles t' to C’s uint32_t without introducing any C struct or tag.

3 THE ESSENCE OF AGILITY We posited earlier that many algorithms exhibit large amounts of commonality, and essentially behave the same way. We provide some basic cryptographic context, then show how dependent types allow us to precisely describe what a block API is in F★.

8 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

finish

alloc free

reinit update

Fig. 2. State machine of a safe, streaming API

3.1 Craing safe APIs for C clients Many cryptographic algorithms offer identical or similar functionalities. For example, SHA2 [2], SHA3 [23] and Blake2 [10, 40] (in no-key mode) all implement the hash functionality, taking an input text to compute a resulting digest. As another example, HMAC [13], Poly1305[15],GCM[32] and Blake2 implement the message authentication code (MAC) functionality, taking an input text and a key to compute a digest. At a high level, these functionalities are simply black boxes with one or two inputs, and a single output. Taking HACL★’s SHA2-256 implementation as an example, this results in a natural, self- explanatory C API: void sha2_256(uint8_t ∗input, uint32_t input_len, uint8_t ∗dst);

This “one-shot” API, however, places unrealistic expectations on clients of this library. For instance, the TLS protocol, in order to authenticate messages, computes repeated intermediary hashes of the handshake data transmitted so far. Using the one-shot API would be grossly inef- ficient, as it would require re-hashing the entire handshake data every single time. In other situ- ations, such as Noise protocols, merely hashing the concatenation of two non-contiguous arrays with this API requires a full copy into a contiguous array. Cryptographic libraries thus need to provide a different API that allows clients to perform in- cremental hash computations. A natural candidate for this is the block API: all of the algorithms we mentioned above are block-based, meaning that under the hood, they follow the state ma- chine from Figure 1: after allocating internal state (alloc), they initialize it (init), process the data (update_block) block-by-block (for an algorithm-specific block size), perform some special treat- ment for the leftover data (update_last), then extract the internal state (finish) onto a user-provided destination buffer, which holds the final digest. Revealing this API would allow clients to feed their data into the hash incrementally. The issue with this block API is that it is wildly unsafe to call from unverified C code. First, it requires clients to maintain a block-sized buffer, that once full must be emptied via a call to update_block. This entails non-trivial modulo-arithmetic computations and pointer manipulations, which are error-prone [34]. Second, clients can very easily violate the state machine. For instance, when extracting an intermediary hash, clients must remember to copy the internal hash state, call the sequence update_last / finish on the copy, free that copy, and only then resume feeding more data into the original hash state. Third, algorithms exhibit subtle differences: for instance, Blake2 must not receive empty data for update_last, while SHA2 is absolutely fine. In short, the block API is error-prone, confusing, and is likely to result in programmer mistakes. We thus wish to take all of the block-basedalgorithms, and devise a way to wrap their respective blockAPIsinto a uniform,safe API thateliminates all of thepitfalls above. We dub this safe API the streaming API (Figure 2): it has a degenerate state machine with a single state; it performs buffer management under the hood; it hides the differences between algorithms; and performs necessary copies as-needed when a digest needs to be extracted.

9 ,, Jonathan Protzenko and Son Ho

1 type stateful = | Stateful: 2 (∗ Low−level type ∗) s: Type0 → 3 4 footprint: (h:mem → s:s → Ghost loc) → 5 invariant: (h:mem → s:s → Type) → 6 7 (∗ A pure representation of an s ∗) 8 t: Type0 → 9 v: (h:mem → s:s → Ghost t) → 10 11 (∗ Adequate framing lemmas ∗) 12 invariant_loc_in_footprint: (h:mem → s:s → Lemma 13 (requires (invariant h s)) 14 (ensures (loc_in (footprint h s) h))) → 15 16 frame_invariant: (l:loc → s:s → h0:mem → h1:mem → Lemma 17 (requires (invariant h0 s ∧ loc_disjoint l (footprint h0 s) ∧ 18 modifies l h0 h1)) 19 (ensures (invariant h1 s ∧ vih0s == vih1s ∧ 20 footprint h1 s == footprint h0 s))) → 21 22 (∗ Stateful operations ∗) 23 alloca: (... → Stack ...) → 24 malloc: (... → ST ...) → 25 free: (... → Stack ...) → 26 copy: (... → Stack ...) → 27 28 stateful

Fig. 3. The stateful API

Writing and verifying a copy of the streaming API for each one of the eligible algorithms would be tedious, not very much fun, and bad proof engineering. We thus aim for a generic description of what a block API is, using a multi-field inductive that describes in one go a block algorithm’s stateful API and intended specification. This step is a prerequisite for us to be able to state the type of the argument of our upcoming functor.

3.2 The essence of stateful data Before we get to the block API itself, we need to capture a more basic notion, that of an abstract piece of data that lives in memory, composes with the Low★ memory model and modifies-clause theory, and supports basic operations such as allocation, de-allocation and copy. This is the stateful type class, presented in Figure 3. It captures many of the ideas presented in §2, except now in an abstract fashion. The low-level type s (e.g. lbuffer U8.t 64ul) comes with an abstract footprint (e.g. the extent of that array), and an abstract invariant (e.g. the array is live). The low-level type can be reflected as a pure value of type t (e.g. a sequence) using a v function (e.g. as_seq, see §2). The administrative lemmas allow harmonious interaction with Low★’s modifies-clause theory; the first captures via

10 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,, loc_in an abstract notion of liveness which statefuls must observe, and which allows clients to au- tomatically derive disjointness of a fresh allocation. The second lemma allows automatic framing of the invariant thanks to a suitable SMT pattern (elided here). The stateful operations allow, respectively, allocating a fresh state s on the stack and on the heap; freeing a heap-allocated state; and copying the state. Writing instances of stateful is easy, the most complex one being the internal state of Blake2 which occupies 46 lines of code, with all proofs going through automatically. A procedural note: stateful is, for all intents and purposes, a type class [42], but we do not use the type class syntax of F★ which did not work with Low★ code when we started this work. We directly use a single-constructor inductive which is what the type class syntax desugars to. We ★ plan to switch to the class keyword once all the bugs are fixed in F .

3.3 The essence of block algorithms We now capture the essence of a block algorithm by authoring a type class that encapsulates a block algorithm’s types, representations, specifications, lemmas and stateful implementations in one go. We need the block type class to capture four broad traits of a block algorithm, namely i) explain the runtime representation and spatial characteristics of the block algorithm, ii) specify as pure functions the transitions in the state machine, iii) reveal the block algorithm’s central lemma, i.e. processing the input data block by block is the same as processing all of the data in one go, and iv) expose the low-level run-time functions that realize the transitions in the state machine. The result appears in Figure 4; for conciseness, we omit the full statement of the fold lemma, as well as the stateful type of the remaining transitions of the state machine. The actual definition is about 150 lines of F★, and appears in the anonymous supplement.

Run-time characteristics. A block algorithm revolves around its state, of type stateful. A block algorithm may need to keep a key at run-time (km = Runtime, e.g. Poly1305), or keep a ghost key for specification purposes (km = Erased, e.g. keyed Blake2),or may needno key at all,in which case the key field is a degenerate instance of stateful where key.s = unit.

Specification. Using state.t, i.e. the algorithm’s state reflected as pure value, we specify each tran- sition of the state machine at lines 18-21. Importantly, rather than specify an “update block” func- tion, we use an “update multi” function that can process multiple blocksat a time. We don’t impose any constraints on how update_multi is authored: we just request that it obeys the fold law via the lemma at line 23, i.e. update_multi_s ((update_multi_ssl1b1) (l1 + length b1) b2) == update_multi_ssl1 (concat b1 b2). This style has several advantages. First, this leaves the possibility for optimized algorithms that process multiple blocks at a time to provide their own update_multi function, rather than being forced to inefficiently process a single block. For unoptimized algorithms that are authored with a stateful update_block, we provide a higher-order combinator that derives an update_multi function and its correctness lemma automatically. Second, by being very abstract about how the blocks are processed, we capture a wide range of behaviors for block algorithms. For instance, Poly1305 has immutable internal state for storing precomputations, along with an accumulator that changes with each call to update_block: we simply pick state.t to be a pair, where the fold only operates on the second component.

The block lemma. The spec_is_incremental lemma captures the key correctness condition and ties all of the specification functions together: for a given piece of data, the result hash1, obtained via the incremental state machine from Figure 1, is the same as calling the one-shot specification spec_s.

11 ,, Jonathan Protzenko and Son Ho

1 type block = | Block: 2 km: key_management (∗ km = Runtime ∨ km = Erased ∗) → 3 4 (∗ Low−level types ∗) 5 state: stateful → 6 key: stateful → 7 8 (∗ Some lengths used in the types below ∗) 9 max_input_length: x:nat { 0 < x ∧ x < pow2 64 }) → 10 output_len: x:U32.t { U32.v x >0} → 11 block_len: x:U32.t { U32.v x >0} → 12 13 (∗ The one−shot specification ∗) 14 spec_s: (key.t → input:seq U8.t { length input ≤ max_input_length } → 15 output:seq U8.t { length output == U32.v output_len }) → 16 17 (∗ The block specification ∗) 18 init_s: (key.t → state.t) → 19 update_multi_s: (state.t → prevlen:nat → s:seq U8.t { length s % U32.v block_len =0} → state.t) → 20 update_last_s: (state.t → prevlen:nat → s:seq U8.t { length s ≤ U32.v block_len } → state.t) → 21 finish_s: (key.t → state.t → s:seq U8.t { length s = U32.v output_len }) → 22 23 update_multi_is_a_fold: ... → 24 25 (∗ Central correctness lemma of a block algorithm ∗) 26 spec_is_incremental: (key: key.t → 27 input:seq U8.t { length input ≤ max_input_length } → Lemma ( 28 let bs, l = split_at_last (U32.v block_len) input in 29 let hash0 = update_multi_s (init_s key)0 bs in 30 let hash1 = finish_s key (update_last_s hash0 (length bs) l) in 31 hash1 == spec_s key input)) → 32 33 update_multi: (s:state.s → prevlen:U64.t → 34 blocks:buffer U8.t { length blocks % U32.v block_len =0} → 35 len: U32.t { U32.v len = length blocks ∧ ... (∗ omitted ∗) } → 36 Stack unit 37 (requires ... (∗ omitted ∗)) 38 (ensures (_ h0 _h1 → 39 modifies (state.footprint h0 s) h0 h1 ∧ 40 state.footprint h0 s == state.footprint h1 s ∧ 41 state.invariant h1 s ∧ 42 state.vih1s == update_multi_s 43 (state.vih0s) (U64.v prevlen) (as_seq h0 blocks) ∧ ...))) 44 → ... (∗ rest of the block typeclass, e.g. init, finish... ∗) → block

Fig. 4. The block API

This lemma relies on a helper, split_at_last, which was carefully crafted to subsume the different behaviors between Blake2 and other block algorithms. let split_at_last (block_len: U32.t) (b:seq U8.t) =

12 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

let n = length b / block_len in let rem = length b % block_len in let n = if rem =0&& n >0 then n−1 else n in let blocks, rest = split b (n ∗ l) in blocks, rest

Stateful implementations. We now zoom in on the update_multi low-level signature, which de- scribes a block’s algorithm run-time processing of multiple blocks in one go (Figure 4). This func- tion is characterized by the spec-level update_multi_s; it preserves the invariant as well as the footprint; and only affects the block algorithm’s state when called. ★ The combination of spec_is_incremental along with the Low signatures of update_multi and oth- ers restricts the API in a way that the only valid usage is dictated by Figure 1. Designing this type class while looking at a wide range of algorithms has forced us to come up with a precise, yet gen- eral enough description of what a block algorithm is: we have been able to author instances of this type class for SHA2 (4 variants), Blake2 (4 variants), Poly1305 (3 variants), and legacy algorithms MD5 and SHA1. This includes the vectorized variants of these algorithms, when available.

4 A STREAMING FUNCTOR Equipped with an accurate and precise description of what a block algorithm is, we are now ready to write an API transformer that takes an instance of block, implementing the state machine from Figure 1, and returns the safe API from Figure 2. We call this API transformer a functor, since once applied to a block it generates type definitions for the internal state, specifications, correctness lemmas and of course the five low-level, runtime functions that implement the transitions from Figure 2. Since F★ has no native support for functors, we describe a somewhat manual encoding; §6 shows how to automate this encoding with Meta-F★. The streaming functor’s state is naturally parameterized over a block and wraps the block algo- rithm’s state with several other fields.

[@ CAbstractStruct ] type state_s (c: block) = | State: block_state: c.state.s → buf: B.buffer U8.t { B.len buf = c.block_len } → total_len: U64.t → seen: erased (S.seq U8.t) → p_key: optional_key c.km c.key → state_s c let state (c: block) = pointer (state_s c)

The CAbstractStruct attribute ensures that the following C code will appear in the header. This pattern is known as “C abstract structs”, i.e. the client cannot allocate structs or inspect private state, since the definition of the type is not known; it can only hold pointers to that state, which forces them to go through the API and provides a modicum of abstraction. struct state_s; typedef struct state_s ∗state;

First, buf is a block-sized internal buffer, which relieves the client of having to performmodulo com- putations and buffer management. Once the buffer is full, the streaming functor calls the underly- ing block algorithm’s update_multi function, which effectively folds the blocks into the block_state.

13 ,, Jonathan Protzenko and Son Ho

The total_len field keeps track of how much data has been fed so far. (We explain the optional_∗ helpers in §4.1.) The most subtle point is the use of a ghost (erased, i.e. computationally-irrelevant) sequence of bytes, which keeps track of the past, i.e. the bytes we have fed so far into the hash. This is reflected in the functor’s invariant, which states that if we split the input data into blocks, then the current block algorithm state is the result of accumulating all of the blocks into the block state; the rest of the data that doesn’t form a full block is stored in buf. let state_invariant (c: block) (h:mem) (s:state c) = let s = deref h s in let State block_state buffer total_len seen key = s in let blocks, rest = split_at_last (U32.v c.block_len) seen in

(∗ omitted ∗) ... ∧ c.state.v h block_state == c.update_multi_s c.init_s (optional_reveal h key)0 blocks ∧ slice (as_seq h buffer) 0 (length rest) == rest

The mk_finish function takes a block algorithm c and returns a suitable finish function usable with a state c. Under-the-hood, it calls c.s.copy to avoid invalidating the block_state; then c.update_last followed by c.finish, the last two transitions of Figure 1. Thanks to the correctness lemma in c along with the invariant, mk_finish states that the digest written in dst is the result of applying the full block algorithm to the data that was fed into the streaming state so far. val mk_finish : c:block → s:state c → dst:B.buffer U8.t { B.len dst == c.output_len } → Stack unit (requires _h0 → ... (∗ omitted ∗)) (ensures _h0 s' h1 → ... ∧ (∗ some omitted ∗) as_seq h1 dst == c.spec_s (get_key ch0 s) (get_seenc h0 s))

One point of interest is the usage of a ghost selector get_seen, which in any heap returns the bytes seen so far. We have found this style the easiest to work with, as opposed to a previous iteration of our design where the user was required to materialize the previously-seen bytes as a ghost argu- ment to the stateful functions, such as mk_finish above. The previous iteration placed a heavy bur- den on clients, who were required to perform some syntactically heavy book-keeping; the present style is much more lightweight. (The anonymous supplement contains an in-depth explanation of the respective merits of the three styles we considered, in file Hacl.Streaming.Functor.fsti.) This streaming API has two limitations. First, we cannot prove the absence of memory leaks. ★ This is a fundamental limitation of Low , which cannot show that a malloc followed by a free is morally equivalent to being in the Stack effect. However, this can easily be addressed with manual code review or off-the-shelf tools, such as clang’s −fsanitize=memory. The second is that there is still a source of unsafety for C clients: they may exceed the maximum amount of data that can be fed into the block algorithm. This is purely a design decision: since the limit is never less than 261 bytes (that’s two million terabytes), we chose to not penalize the vast majority of C clients who will never exceed that limit, and leave it up to clients who may encounter such extreme cases to perform length-checking themselves. (For that purpose, we could of course write another functor that transforms the streaming API into a “very safe” streaming API with the additional check.)

4.1 A note on meta-arguments We now focus on the usage of meta-level arguments, which act as tweaking knobs to control the shape of the streaming API. Clients can of course choose a suitable block size and suitable types

14 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,, for the block state and key representation, which influences the result of the functor application. But the key management is of particular interest. type key_management = | Runtime | Erased inline_for_extraction let optional_key (km: key_management) (key: stateful): Type = match km with | Runtime → key.s | Erased → Ghost.erased key.t

The km parameter of the block type class is purely meta, and will never be examined at run- time. It allows the block algorithm to indicate whether it needs a key. In the streaming code, every reference to key goes through a wrapper like the one above. After partial application, the optional_∗ wrappers reduce to either a proper key type, or to a ghost value, which then gets erased to unit, which means the key field of the state entirely disappears thanks to KreMLin’s unit field elimination. The streaming API’s init function unconditionally takes a key at run-time; but for algorithms like hashes, it suffices to pick c.key.s = unit and the superfluous argument to init gets eliminated too.

4.2 Runtime agility In reality, both the block type class and the streaming functor are parametric over an argument dyn: Type0. Within the block type class, types such as s or t receive a parameter : dyn; and so do functions such as v or spec_s. All usages of the dynamic argument in the block type class are ghost, except for the init function. This allows the user to implement run-time agility, as follows. For the block type class, the user may pick: type dyn = | SHA2_256 | Blake2B | ... type s (d: dyn) = | SHA2: sha256_state → s SHA2_256 | Blake2B: blake2b_state → s Blake2B | ...

The block’s init function receives a dyn at run-time and can dynamically record in the state which variantof ahashalgorithmis beingused; later callsto blockfunctions, such as update_multi, inspect the state and dynamically dispatch a call to the right function. Applying the streaming functor to such an instance of the block type class yields an API for incremental hashing that accepts any algorithm at run-time. Using this, we trivially re-implement EverCrypt’s old incremental hashing module [38, XIII], making it another application of the stream- ing functor.

4.3 Support for vectorization In reality, the streaming functor holds a internal buffer whose length is not just the block size, but a multiple of the block size. The multiple is a meta-level parameter, or tweaking knob; the client chooses how big the internal buffer ought to be. Typically, for vectorized algorithms, we choose the internal buffer to have the optimal size that guarantees that the fast, SIMD vectorized path is always taken. This is another instance in which we relieve unverified clients from tedious and error-prone book-keeping.

15 ,, Jonathan Protzenko and Son Ho

1 let mk_finish (c: block) (p: state) dst = 2 let _= c.update_multi_is_a_fold in 3 4 let h0 = ST.get () in 5 let State block_state buf_ total_len seen k' =!∗p in 6 7 push_frame (); 8 let h1 = ST.get () in 9 c.state.frame_invariant B.loc_none block_state h0 h1; 10 11 let r = rest c total_len in 12 let buf_ = B.sub buf_0ul r in 13 assert ((U64.v total_len − U32.vr)% U32.v c.block_len = 0); 14 15 let tmp_block_state = c.state.alloca () in 16 c.state.copy block_state tmp_block_state; 17 ...

Fig. 5. The functor’s mk_finish function

5 FROM HIGH-LEVEL FUNCTORS TO LOW★ CODE We now turn our attention to extraction, and explain how to carefully tweak the streaming functor so that, once applied to a given block algorithm, it yields first-order, specialized code that fits in the Low★ subset that can compile to C. In this section, we use F★ flags and attributes without resorting to Meta-F★ tactics. While we use the streaming functor as our running example, the techniques are systematic and can be applied to any hand-written functor in F★.

5.1 Specialization via partial evaluation ★ We now show how to use the inline_for_extraction keyword (§2) to ensure that, after F has per- formed its extraction-specific normalization run, no traces are left of the functor argument, and the resulting code only contains first-order Low★ code. In other words, we completely eliminate accesses to the type class dictionary via partial evaluation, meaning no run-time overhead. We focus on mk_finish, which after partial application generates the streaming API’s finish func- tion. We saw the signature of the function earlier, and now show its definition in Figure 5. The function contains numerous patterns that need to be inlined away, which makes it representative of the streaming functor as a whole. The let-binding at line 2 serves only to bring the fold law (§3.3) lemma from the type class into the scope of the SMT context, so that its associated pattern can trigger and save the programmer from having to call the lemma manually. Built-in constructs of Low★ such as lines 4 and 7 receive special treatment in the toolchain and are eliminated. Calls to lemma and assertions at line 9 and 13 have type Ghost unit. The KreMLin compiler eliminates superfluous units, so these disappear too. At lines 15-16, we need to copy the block state into a temporary, in order to obtain the state machine mentioned earlier (Figure 2). The syntax hides nested calls to projectors of the block and stateful type classes respectively. To make sure these reduce, we mark both type class definitions as “inline for extraction”, which in turns makes their projectors reduce. At extraction-time, provided

16 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,, mk_finish is applied to an instance, lines 15-16 reduce into direct calls to the original alloca and copy functions found in the type class. At line 11, we call rest, a helper shared across multiple functions that returns the amount of data currently in the internal buffer. As such, rest needs access to the type class, if only to know the block algorithm’s block size. To avoid generating a run-time access to c in the call to rest, we mark the definition of rest as “inline for extraction”; provided rest undergoes the same treatment we described above, all references to c are now eliminated at extraction-time. One missing feature from this example is [@inline_let], an attribute that can be added to pure, inner let-bindings to force them to be inlined away even within stateful functions. We use it per- vasively, and the anonymous supplement contains numerous instances.

5.2 Recovering ML polymorphism Playing with normalization attributes and keywords guarantees that any reference to the type class argument disappears; but this is not enough to make the code valid Low★. The issue lies with our earlier type definition for state_s, at the beginning of §4. Specifically, the issue lies in how type abbreviations and inductive types behave differently at extraction-time. Type abbreviations, such as element_t we saw earlier, can be inlined away via inline_for_extraction; by the time the program reaches the erasure phase, element_t has reduced away to, say, U32.t, a specialized type that can be successfully extracted. Inductive types behave differently. When state_s is partially applied to a concrete argument foo, ★ no partial evaluation happens: state_s is not a structural type. Therefore, when F extracts state_s, it does so in a standalone manner, by looking only at the definition of state_s, not its uses. We now describe precisely how the definition of state_s extracts. The projection c.block_len is innocuous, and disappears: refinements are erased. The type of field block_state, that is c.state.s, is a nested application of two functions, namely the projectors for state and s. This type does not reduce: c is a formal parameter and has no concrete value. Upon seeing such a type, F★’s extraction simply inserts ⊤, since neither the type systems of ML or C can express this flavor of value dependency. This means that a partial application of state_s to a class foo will generate: type 'c state_s ={ block_state: T; ... } type foo_state = unit state_s

All seems lost, for a field block_state: ⊤ cannot be compiled to C. We can, however, regain ML-like polymorphism with this “one simple trick”: type state_s (c: block) (t: Type { t == c.state.s })=| State: block_state: t → buf: B.buffer U8.t { B.len buf = c.block_len } → ... → state_s c This second version is curiously convoluted; from the point of view of F★, however, this is a per- fectly valid ML type whose definition in the ML AST becomes: type 'c 't state_s ={ block_state: 't; buf: ... }

We apply a similar trick to functions, where for instance the prototype of mk_finish becomes: val mk_finish: c:block → t:Type0 { t == c.state.s } → s:state c t → ...

This extracts to an ML AST that is free of casts, since all types within the body of mk_finish are now of type t (a type parameter of the function) instead of c.state.s (a non-extractable function call at the Type0 level). We obtain an ML-polymorphic definition, along with monomorphic uses: let mk_finish (type c t) (s: (c, t) state) ... = ... let finish_sha2_256 = mk_finish

17 ,, Jonathan Protzenko and Son Ho

Finally, we rely on KreMLin to perform whole-program monomorphization (§2) in order to special- ize type definitions and functions based on their usage at type parameter U32.t buffer. Combined with unit-field elimination and unused type parameter elimination, we obtain a fully specialized copy of finish for SHA2-256, along with a clean definition for the type, just like a C programmer would have written: typedef struct { uint32_t ∗block_state; uint8_t ∗buf; uint64_t total_len; } Hacl_Streaming_Functor_state_sha2_256;

6 TACTIC-BASED The process of hand-writing a functor (§4) gives the programmer fine-grained control over reduc- tion and type monomorphization. However, this manual encoding requires the programmer to mark the entire call-graph (that is, all functions reachable at run-time, such as mk_finish, rest, etc.), as “inline for extraction”, in order to properly eliminate occurrences of the meta-argument c in the generated code. In many situations, this is not acceptable: a huge blob of code would not pass muster with software engineers who want to use verified libraries, and as such we need to retain the structure of the call-graph in the generated C code.

6.1 General-purpose meta-level indices We now abstract over, and generalize the setting of §4. We consider call-graphs of arbitrary depth, where the only restriction is the absence of recursion. This is a safe assumption: most Low★ code uses loop combinators, as recursion results in unpredictable performance, owing to the uneven support for tail-call optimizations in C compilers. We assume every function in the call-graph is parametric over a meta-parameter, which we call from here on an index. In order for the functions in our call-graph to be valid Low★, they must be applied to a concrete index in order to trigger enough partial evaluation. In §4, the meta-parameter was the type class c, which contained type definitions followed by specifications, lemmas, low-level implementations, helper definitions, etc. This style is burden- some, as a large algorithm will typically incur a type class with dozens of fields, which makes authoring instances tedious and non-modular. We now present a different style, which we have found minimizes syntactic overhead, and is easier to work with in day-to-day proof engineering. For the rest of this paper, we choose for the index a finite enumeration, accompanied with a set of helper definitions over that index. To illustrate that second style, weuse HPKE[12] (Hybrid Public- Key Encryption), a cryptographic construction that combines AEAD (Authenticated Encryption with Additional Data), DH (Diffie-Hellman) and hashing. We structure the code as follows, using alg as our index and parameterizing all relevant functions over that index. // Spec.HPKE.fsti let alg = DH.alg ∗ AEAD.alg ∗ Hash.alg

// Spec.AEAD.fsti type alg = AES128_GCM | AES256_GCM | CHACHA20_POLY1305

// Impl.HPKE.fsti

18 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

hpke helper sign

enc

Fig. 6. Simplified call-graph of HPKE inline_for_extraction let key_aead (alg: Spec.HPKE.alg) = lbuffer U8.t (AEAD.key_len (snd3 alg))

// Impl.HPKE.fst let sign_t (alg: Spec.HPKE.alg): Type = ... → ... let sign (alg: Spec.HPKE.alg): sign_t alg = _... → ...

The index Spec.HPKE.alg is a triple that captures all possible algorithm choices prescribed by the HPKE RFC. We thus write specifications, lemmas, helpers and types parametrically over the index as standalone definitions. The key_aead type, for example, is parametric over triplets of algorithms, and defines a low-level key to be an array of bytes whose length is the key length for the chosen AEAD. The same systematic parametrization over alg can be carried to functions and their types, shown with sign as an example. An important point is that for a given sign_t, there may be multiple implementations of the given algorithm. So, the finite combinations for Spec.HPKE.alg admit an infinite number of implementations.

6.2 Call-graph rewriting Figure 6 describes the call-graph of HPKE: nodes are F★ top-level functions and arrows indicate function calls. The helper node makes verification modular but would pollute the generated C code, and therefore must be evaluated away. All other circled nodes must appear in the generated C code. This call-graph is representative of a typical large-scale verification effort: in order to make verification robust in the presence of an SMT solver; to ensure modularity of the proofs; and to increase the likelihood that a future programmer can understand and maintain a given proof, we encourage a proliferation of small helpers with crisp pre- and post-conditions. However, these helpers would typically amount, if they were extracted, to a mere line or two of C code. Our goal is now to transform the user-written code for HPKE into a form that lends itself well to specialization. That is, we want to obtain a functor that, once applied to a triplet of implemen- tations for DH, AEAD and Hash, returns a specialized variant of HPKE for that triplet. We now describe this transformation by example. The first step consists in adding extra parameters to both helper and hpke; these parameters shadow references to other nodes in the call-graph. val sign: alg → sign_t alg val enc: alg → enc_t alg

(∗ before: ∗) inline_for_extraction let helper (alg: Spec.HPKE.alg): helper_t alg = _... → ... sign alg ... (∗ after: ∗) inline_for_extraction let helper (alg: Spec.HPKE.alg) (sign: sign_t alg): helper_t alg = _... →

19 ,, Jonathan Protzenko and Son Ho

... sign ...

(∗ before: ∗) inline_for_extraction let hpke (alg: Spec.HPKE.alg): hpke_t alg = _... → ... helper alg ...... enc alg ... (∗ after: ∗) inline_for_extraction let hpke (alg: Spec.HPKE.alg) (sign: sign_t alg) (enc: enc_t alg): hpke_t alg = _... → ... helper alg sign ...... enc ...

After rewriting, instead of calling the generic variant sign in the global scope, helper now calls its parameter sign, which is a specialized variant of sign for the specific algorithm alg. Similarly, hpke, after rewriting, calls a specialized variant of enc. We remark that helper, which we want to eliminate from the call-graph, remains inline for extraction, and does not appear as an extra specialized parameter; hpke simply calls helper, passing it a copy of a specialized sign. What we have done here is ensure that every node in the call-graph (e.g. hpke) receives as parameters specialized variants of the functions it calls, either directly (enc) or transitively (sign). This form is more convoluted than the original version; but the user can now instantiate each function to obtain a specialized variant. We instantiate functions in the topological order of the call-graph. let sign_p256: sign_t Spec.DH.P256 = P256.sign let enc_chachapoly: enc_t Spec.AEAD.ChachaPoly = ChachaPoly256.encrypt let hpke_chachapoly_p256: hpke_t (Spec.DH.P256, Spec.AEAD.ChachaPoly, ...) = hpke (Spec.DH.P256, Spec.AEAD.ChachaPoly, ...) sign_p256 enc_chachapoly

This results in a specialized version of hpke, which calls a specialized version of encrypt. The index is gone and there is no run-time overhead; we have in effect specialized the entire call-graph for a specific value of the index. Note that we can provide many more specializations of HPKE for the same value of the index, e.g. with the AVX512 version of ChachaPoly. Back to our earlier C++ template analogy, we have in effect written: HPKE Hpke_ChachaPoly.

6.3 Call-graph rewriting by tactic We now generalize the rewritings performed on the HPKE example, and describe a general proce- dure that rewrites a call-graph, for any algorithm, into a form that lends itself to a specialization over the choice of index. Rather than having the user manually rewrite their code to take a type class, we let the user author their algorithm in a natural style, as a long as the functions and types are parametric over an index type. We then build the functor automatically, by inspecting the user- provided definitions, and automatically performing the rewriting described on the HPKE example, in an index-generic fashion. The user obtains a set of parameterized functions that are now ready to be specialized for a choice of index. Our rewriting is best thought of as a call-graph traversal, where each node in the call-graph is either specialized (e.g. hpke) or eliminated (e.g. helper). To that end, in order to benefit from the automatic rewriting of their call-graph into the specializable form, users need to annotate their functions with an attribute, either [@ Eliminate ] or [@ Specialize ]. We proceed as follows. Each function node in the call-graph 58 is assumed to be parameterized over an implicit argument #idx : Cidx that represents the specialization index. We expect this argu- ment to come in first position.

20 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

# = # : • At definition site, every function definition let 5 idxG is replaced by let mkf idx (61 C61 idx) : = . . . (6= C6= idx) G . The 68 represent all the [@ Specialize ] nodes reachable via a (possibly- empty) path of [@ Eliminate ] nodes through the body of 5 .The C68 arethe types oftheoriginal # : = : 68 , minus the index idx, that is, if 68 had type idx Cidx → C, then C68 _(idx Cidx) → C. • At call-site, when encountering a call 5 #idx4, we distinguish two cases. If 5 is a [@ Specialize ] node, the call becomes 68 4 and references the bound variable instead of the global name. If 5 # is an [@ Eliminate ] node, the call becomes mkf idx681 ···68= , where 68: is the :-th function needed by 5 , found amongst one of the current parameters 68 .

This follows the example outlined earlier on HPKE. We emphasize that the arguments 68 contain the [@ Specialize ] nodes transitively reachable from the body of the current function; this means # that in the sub-call mkf idx681 ···68= , the 68: always form a subset of the current arguments 68.

6.4 Implementation in Meta-F★ We have implemented this call-graph rewriting using syntax inspection, term generation and def- inition splicing in Meta-F★. Our tactic, at 620 lines, (including whitespace and comments) is the second largest Meta-F★ program written to date. We now briefly give an overview of the implemen- tation; the anonymous supplement contains a generously commented version in Meta.Interface.fst. As mentioned earlier (§2), the design of Meta-F★ means any fresh term generated by a tactic must be re-checked for soundness; we therefore do not prove any results about our tactic and let F★ validate the terms it produces. In order to call our tactic, the user passes the roots of the call-graph traversal, along with the type of the index. The tactic traverses the call-graph; generates rewritten variants of all the definitions; and the %splice directive inserts the freshly rewritten definitions at the current program point. A minor caveat: F★ requires the user to lexically insert between square brackets the names of the top- level definitions generated by a %splice; this allows resolving further calls to the tactic-generated names in purely lexical fashion, as scope resolution runs before meta-program evaluation.

% hpke_setupBaseI'; hpke_setupBaseR'; hpke_sealBase'; hpke_openBase' ] (Meta.Interface.specialize (`Spec.HPKE.alg)[ `Impl.HPKE.setupBaseI; `Impl.HPKE.setupBaseR; `Impl.HPKE.sealBase; `Impl.HPKE.openBase ])

One possible specialization, out of more than a hundred possible options, is P256 for elliptic curve DH, AVX 128-bit ChachaPolyfor AEAD and SHA256for hashing. Note how the user first generates a specialized version of setupBaseI, then a specialized version of sealBase that calls it. let alg = DH_P256, CHACHA20_POLY1305, SHA2_256 let setupBaseI = hpke_setupBaseI' #alg hkdf_expand256 hkdf_extract256 sha256 secret_to_public_p256 dh_p256 let setupBaseR = hpke_setupBaseR' #alg hkdf_expand256 hkdf_extract256 sha256 secret_to_public_p256 dh_p256 let sealBase = hpke_sealBase' #alg setupBaseI encrypt_cp128 let openBase = hpke_openBase' #alg setupBaseR decrypt_cp128

Our tactic has a few more bells and whistles. Notably, we allow leaf [@ Specialize ] nodes to ★ be abstract definitions (assume val in F lingo) without an implementation. This does not affect soundness: we still have to provide concrete definitions when instantiating the functor to obtain

21 ,, Jonathan Protzenko and Son Ho

proved Curve25519.fst Field.fsti against

implements implements

proved Field51.fst Field64.fst Core64.fsti against

implements implements

HaclCore.fst ValeCore.fst

Fig. 7. Abstract, large-scale call-graph for Curve25519 the specialized definitions. The reason for this feature is clarity and modularity. We adopt a sys- tematic style where we write an entire algorithm in file MyAlg.fst, against an abstract interface that contains the entire functor argument, in MyInterface.fsti. Furthermore, by having every definition in MyInterface.fsti of the form: let foo_t (i: index) = ... val foo:#i:index → foo_t i

We can write multiple implementations of MyInterface in separate files, such as MyImplementation1: let foo1: foo_t Index1 = ...

6.5 Application to Curve25519 We finish this presentation with the application of our automated tactic to a particularly gnarly example. Curve25519 is an elliptic curve algorithm [14]; suffices to say that it relies on a mathemat- ical field, which admits two efficient implementations; furthermore, one of these two implementa- tions relies on a core set of primitives (e.g. multiplication) which themselves admit two different implementations, one in Low★ from the HACL★ library, and one in Vale assembly [25]. Figure 7 describes the call-graph of Curve25519 at a coarse level of granularity. We adopt the systematic style we described right above, and insert interface files (.fsti files) to stand for the ab- stract interfaces that we want to specialize against. Each abstract interface in turn admits multiple implementations. The tactic successfully handles the two nested levels of abstraction, and generates functorized code accordingly. The user can then pick one of three specialization choices: using OCaml syntax, these are: module Curve51 = Curve25519(Field51), module Curve64Vale = Curve25519(Field64(CoreVale)) and module Curve64Hacl = Curve25519(Field64(CoreVale)). The full details are provided in the anonymous supplement.

7 EVALUATION We now proceed to evaluate the efficiency of our approach. In this section, we are solely concerned with proof engineering and programmer productivity metrics. The run-time performance of the code, by virtue of being in C and re-using the HACL★ algorithms, exceeds or approaches the state of the art; we therefore leave a crypto-oriented performance discussion to the original HACL★ paper [37].

22 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

★ F LoC C LoC verif. extract. EverCrypt hashing (old) 848 798 Functor and interfaces 1667 0 103.5s 0 EverCrypt hashing (new) 128 929 20.3s 8.3s MD5, SHA1, SHA2 (×4) 231 1577 39.4s 13.6s Poly1305 (×3) 452 998 55.7s 6.4s Blake2s (×2), Blake2b (×2) 874 4761 115.1s 9.2s Total 4200 8265 334s 47.6s Table 1. antifying the impact of the streaming functor

7.1 Streaming functor To evaluate the applicability of the streaming functor, we compare lines of code (LoC) for the F★ source code and the final C code as a proxy for programmer effort. While not ideal, this metric has been used by several other papers [37, 38, 46] and provides a coarse estimate of the proof engineering effort. Our point of reference is a previous, non-generic streaming API that previously operated atop the EverCrypt agile hash layer. Table 1 presents the evaluation. For the old, non-generic streaming API, the proof-to-code ratio was 1.11, meaning we had to write more than one line of code in F★ for every line of generated C code. Capturing the block API and implementing the functor uses 1667 linesofF★ code. The extra ver- ification effort is quickly amortized across the 14 applications of the streaming functor, which each requires a modest amount of proofs to implement the exact signature of the block API. Poly1305 and Blake2 were originally authored without bringing out the functional, fold-like nature of the algorithms, which led to some glue code and proofs to meet the block API. Altogether, we ob- tain a final proof-to-code ratio of 0.51, which we interpret to coarsely mean a 2x improvement in programmer productivity. We expect this number to further decrease, as more applications of the streaming functor follow. For execution times, we present the verification time of the functor itself, and the verification time of each of the instances, including glue proofs. Compared to fully verifying Blake2 (7.5 min- utes), or Poly1305 (~14 minutes), the verification cost is modest. Applying the streaming functor to a type class argument incurs no verification cost, so the extraction column measures the cost of partial evaluation and extraction to the ML AST, which is negligible.

7.2 Tactic From a qualitative point of view, the usage of the call-graph rewriting tactic significantly improved programmer experience and addressed many fundamental engineering roadblocks in one go. First, ★ programmers would tweak F ’s include path to switch between implementations (e.g. Field51 vs ★ Field64, §6.5), effectively making it impossible to build verified applications on top of HACL . Sec- ond, the lack of modularity and call-graph specialization in old versions of HACL★ made C and vectorized implementations appear in the same file; since we had to use −mavx −mavx2 for compil- ing instrics, the C compiler would use AVX2 instructions for our non-vectorized, regular C version, causing illegal instruction errors later on [44]. We now put one tactic instantiation per file, which solves the problem definitively. Finally, previous versions of HACL★ did not distinguish between algorithmic agility and choice of implementation for a given algorithm. This made a modular and specializable HPKE just impossible to author. From a quantitative point of view, Table 2 measures the overhead incurred by re-checking the tactic-generated call-graph, relative to the total verification time for a given algorithm. In most cases, the overhead is < 100%, because we don’t rewrite lemmas and proofs. We need to investigate

23 ,, Jonathan Protzenko and Son Ho

Algorithm Verification time Curve25519 1379s (+127.0%) Chacha20 174s (+41.0%) Poly1305 429s (+17.2%) Chacha20Poly1305 421s (+92.2%) Table 2. Cost of verifying the tactic-rewrien call graphs why Curve25519 is an outlier; we suspect the Z3 solver might be overly sensitive to the shape of the proof obligations it receives; since we rewrite the call-graph, the resulting proof obligations are slightly different from the ones generated by the original call-graph. In practice, however, build time matters little in the face of improved programmer productivity.

7.3 Usability of tactics Tactics are not part of the trusted computing base (§2); unlike, say, MTac2 [27], Meta-F★ [31] does not allow the user to prove properties about tactics, trading provable correctness for ease-of-use and programmer productivity. The debugging experience for tactics is thus pleasant. If the tactic itself fails, F* points to the faulty line in the meta-program. If the generated code is ill-typed, we examine it like any other F* program. We debugged this tactic on Curve25519, our most complex example; once debugged, the tactic never generated ill-typed code and was used successfully by other co-authors. A technical detail relates to lexical scoping (§6.3): the user must somewhat materialize in the source code the names of all the 6’s that are generated by the tactic, in order to establish proper lexical scope. For that, regular users can observe the standard output, where the tactic prints a summary of the functions it generated, their types, and their names. Equipped with the names of the generated 6’s, users then edit their source code to fill in the first argument to %splice. Alterna- tively, power users just look up the mangling scheme and directly write arguments to %splice in one go. We have plans to automate this part as well, but so far, none of the users have complained about the overhead of writing manual instantiations.

8 RELATED WORK Automating the generation of low-level code is a common theme among several software verifica- tion projects, and is by no means specific to the HACL★ universe. We now review several related efforts. FiatCrypto [24] relies on a combination of partial evaluation and certified compilation phases to compile a generic description of a bignum arithmetic routine to an efficient, low-level imperative language. The imperative language can then be printed as either C or assembly. In the FiatCrypto approach, that specifications are declarative, and do not impose any choice of representation. Con- versely, in HACL★, the decision is made by the programmer, as they refine a high-level mathemat- ical specification into a functional implementation that talks about word sizes and representation choices. While the approach of FiatCrypto is automated, it relies on fine-grained control of the compilation toolchain; for instance, a key compilation step is bounds inference, which picks inte- ger widths to be used by the rest of the compilation phases. By constrast, we do not customize the extraction procedure of F★, nor extend KreMLin with dedicated phases. Another difference to highlight is that FiatCrypto, to the best of our knowledge, focuses on the core bignum subset of operations, and offers neither a high-level Curve25519 API, or other families of algorithms. In our work, we operate at higher levels up the stack, tackling high-level API transformers and complete algorithms, “in the large”.

24 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

Recent work by Pit-Claudel et.al. [36] proposes a correct-by-construction pipeline to generate efficient low-level implementations from non-deterministic high-level specifications. The process is end-to-end verified since it relies on Bedrock [18]. In Pit-Claudel’s work, compilation and ex- traction are framed as a backwards proof search and synthesis goal. Handling non-determinism has not been done at scale with F★; however, the algorithms we study are fully deterministic. Pit-Claudel’s approach is DSL-centric: the user is expected to augment the compiler with new synthesis rules whenever considering a new flavor of specifications. In our work, we reuse the ex- isting extraction facility of F★, which we treat as a black box we have no control over. We rely on many whole-program compilation phases, such as the various compilation schemes for data types, monomorphization and unused argument elimination; it is unclear to us if Pit-Claudel’s toolchain supports such whole-program transformations that require making non-local decisions. Appel [8] verifies the equivalent of our streaming functor applied to the SHA256 block algo- rithm; specifically, the version of it found within OpenSSL. This work is end-to-end verified, by virtue of using VST [6]. The efforts remains directed towards one single instance of the problem, and it is unclear to us whether the development could be made generic to automatically show the correctness of all of the other block algorithms in OpenSSL in one go. All three of these works, by virtue of using the Coq proof assistant, enjoy a small trusted com- puting base; in our work, we trust F★, Z3 and the KreMLin compiler. Lammich [28] uses an approach very similar to Pit-Claudel et.al., and refines Isabelle/HOL spec- ifications down to efficient, Imperative/HOL code. The code is then extracted to a functional pro- gramming language, e.g. OCaml or SML, and compiled by an off-the-shelf compiler. This means that unlike Pit-Claudel, Lammich still relies on a built-in extraction facility. This is a promising approach, and seems applicable to the original HACL★ code: it would worthwhile trying to re- fine HACL★ specifications automatically to the Low★ code. In the case of the various functors we describe, however, we suspect “explaining” how to refine the specification into the exact API we want would require the same amount of work as writing the functors directly. In Haskell, the cryptonite library offers an abstract hash interface using type classes, along with several instances of this type class. The high-level idea is the same: unifying various hash algo- rithms under a single interface. One natural advantage of our work is that it comes with proofs, meaning that clients can be verified and shown to not misuse our APIs. Perhaps more to the point, our approach also has unique constraints: no matter how efficient functional code with type classes may be, we insist on generating idiomatic C code: adoption by Firefox, Linux and others comes at that cost. The code produced by our toolchain cannot afford to have run-time dictionaries or function pointers. We therefore must partially evaluate away the abstractions before C code gen- eration.

9 CONCLUSION Based on our experience performing very large-scale verification in F★, we have shown an array of techniques to make program proof a productive endeavor. Establishing clear, crisp abstractions that highlight commonality between related pieces of code sets a foundation for higher-level APIs. With partial evaluation and meta-programming, programmers can think at the highest levels of abstractions, while minimizing effort and increasing . Quantitative evaluations provides evidence that our techniques improve programmer experience. The techniques we describe in this paper form the foundation of the current iteration of HACL★, and have been used pervasively throughout the entire codebase of the project. HACL★ currently features over 105,000 lines of code (excluding whitespace and comments); we believe verification at that scale would not have been achievable without the techniques we describe in this paper. While verification projects of this size remain rare, we believe the improvements in theorem proving will

25 ,, Jonathan Protzenko and Son Ho soon lead to more and more verification efforts crossing the 100,000 line threshold. We thus feel like our experience is worth sharing, and hope that it will inspire similar gains of productivity in other projects. We intend to continue our efforts on the HACL★ codebase. In particular, we intend to author a new type class to capture the commonality between Chacha20, AES, SHA3 and Blake2 in PRF mode and the matching flavors of AEAD, along with a corresponding functor to automatically generate safer APIs for those.

26 Functional Pearl: Zero-Cost, Meta-Programmed, Dependently-Typed Stateful Functors in F★ ,,

REFERENCES [1] The sodium crypto library (libsodium). https://github.com/jedisct1/libsodium. [2] Federal Information Processing Standards Publication 180-4: Secure hash standard (SHS), 2012. NIST. [3] Danel Ahman, Cătălin Hriţcu, Kenji Maillard, Guido Martínez, Gordon Plotkin, Jonathan Protzenko, Aseem Rastogi, and Nikhil Swamy. Dijkstra monads for free. In ACM Symposium on Principles of Programming Languages (POPL), January 2017. [4] José Bacelar Almeida, Manuel Barbosa, Gilles Barthe, Arthur Blot, Benjamin Grégoire, Vincent Laporte, Tiago Oliveira, Hugo Pacheco, Benedikt Schmidt, and Pierre-Yves Strub. Jasmin: High-assurance and high-speed cryptography. 2017. [5] José Bacelar Almeida, Manuel Barbosa, Gilles Barthe, Benjamin Grégoire, Adrien Koutsos, Vincent Laporte, Tiago Oliveira, and Pierre-Yves Strub. The last mile: High-assurance and high-speed cryptographic implementations. arXiv:1904.04606 https://arxiv.org/abs/1904.04606. [6] Andrew W. Appel. Verified software toolchain. In Proceedings of the European Conference on Programming Languages and Systems (ESOP/ETAPS), 2011. [7] Andrew W Appel. Verification of a cryptographic primitive: SHA-256. ACM Transactions on Programming Languages and Systems (TOPLAS), 37(2):7, 2015. [8] Andrew W. Appel. Verification of a cryptographic primitive: SHA-256. ACM Trans. Program. Lang. Syst., April 2015. [9] Jean-Philippe Aumasson, Willi Meier, Raphael C-W Phan, and Luca Henzen. The hash function BLAKE. Springer, 2014. [10] Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn, and Christian Winnerlein. BLAKE2: Simpler, smaller, fast as MD5. In Applied Cryptography and Network Security, pages 119–135, 2013. [11] Manuel Barbosa, Gilles Barthe, Karthikeyan Bhargavan, Bruno Blanchet, Cas Cremers, Kevin Liao, and Bryan Parno. Sok: Computer-aided cryptography. IACR . ePrint Arch., 2019:1393, 2019. [12] R. Barnes and K. Bhargavan. Hybrid public key encryption. IRTF Internet-Draft draft-irtf-cfrg-hpke-02, 2019. [13] Lennart Beringer, Adam Petcher, Katherine Q. Ye, and Andrew W. Appel. Verified correctness and security of OpenSSL HMAC. 2015. [14] D. J. Bernstein. Curve25519: New Diffie-Hellman speed records. In Proceedings of the IACR Conference on Practice and Theory of Public Key Cryptography (PKC), 2006. [15] Daniel J. Bernstein. The Poly1305-AES message-authentication code. In Proceedings of Fast Software Encryption, March 2005. [16] Barry Bond, Chris Hawblitzel, Manos Kapritsos, K. Rustan M. Leino, Jacob R. Lorch, Bryan Parno, Ashay Rane, Srinath Setty, and Laure Thompson. Vale: Verifying high-performance cryptographic assembly code. In Proceedings of the USENIX Security Symposium, August 2017. [17] Edwin Brady. Idris, a general-purpose dependently typed programming language: Design and implementation. Jour- nal of , 23(5):552–593, 2013. [18] Adam Chlipala. The bedrock system: Combining generative metaprogramming and hoare logic in an extensible program verifier. In Proceedings of the 18th ACM SIGPLAN international conference on Functional programming, pages 391–402, 2013. [19] David Christiansen and Edwin Brady. Elaborator reflection: extending idris in idris. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, pages 284–297, 2016. [20] L. de Moura and N. Bjørner. Z3: An efficient SMT solver. 2008. [21] Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The Lean theorem prover. In Proc. of the Conference on Automated Deduction (CADE), 2015. [22] Frank DeRemer and Hans Kron. Programming-in-the large versus programming-in-the-small. In Proceedings of the International Conference on Reliable Software, page 114–121, New York, NY, USA, 1975. Association for Computing Machinery. [23] Morris J Dworkin. Sha-3 standard: Permutation-based hash and extendable-output functions. Technical report, 2015. [24] A. Erbsen, J. Philipoom, J. Gross, R. Sloan, and A. Chlipala. Simple high-level code for cryptographic arithmetic - with proofs, without compromises. 2019. [25] Aymeric Fromherz, Nick Giannarakis, Chris Hawblitzel, Bryan Parno, Aseem Rastogi, and Nikhil Swamy. A verified, efficient embedding of a verifiable . Proc. ACM Program. Lang., 3(POPL), January 2019. [26] Yu-Fu Fu, Jiaxiang Liu, Xiaomu Shi, Ming-Hsien Tsai, Bow-Yaw Wang, and Bo-Yin Yang. Signed cryptographic pro- gram verification with typed cryptoline. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Com- munications Security, CCS ’19, page 1591–1606, New York, NY, USA, 2019. Association for Computing Machinery. [27] Jan-Oliver Kaiser, Beta Ziliani, Robbert Krebbers, Yann Régis-Gianas, and Derek Dreyer. Mtac2: typed tactics for backward reasoning in coq. Proceedings of the ACM on Programming Languages, 2(ICFP):1–31, 2018. [28] Peter Lammich. Refinement to imperative hol. Journal of Automated Reasoning, 62(4):481–503, 2019.

27 ,, Jonathan Protzenko and Son Ho

[29] K. Rustan M. Leino. Dafny: An automatic program verifier for functional correctness. In Proceedings of the Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR), 2010. [30] Pierre Letouzey. A new extraction for coq. In International Workshop on Types for Proofs and Programs, pages 200–219. Springer, 2002. [31] Guido Martínez, Danel Ahman, Victor Dumitrescu, Nick Giannarakis, Chris Hawblitzel, Catalin Hritcu, Monal Narasimhamurthy, Zoe Paraskevopoulou, Clément Pit-Claudel, Jonathan Protzenko, Tahina Ramananandro, Aseem Rastogi, and Nikhil Swamy. Meta-F*: Proof automation with SMT, tactics, and metaprograms. In 28th European Symposium on Programming (ESOP), pages 30–59. Springer, 2019. [32] David A. McGrew and John Viega. The security and performance of the Galois/counter mode of operation. In Proceedings of the International Conference on Cryptology in India (INDOCRYPT), 2004. [33] Robin Milner. A theory of type polymorphism in programming. Journal of computer and system sciences, 17(3):348–375, 1978. [34] Nicky Mouha, Mohammad S Raunak, D Richard Kuhn, and Raghu Kacker. Finding bugs in cryptographic hash function implementations. IEEE transactions on reliability, 67(3):870–884, 2018. [35] OpenSSL Team. OpenSSL. http://www.openssl.org/, May 2005. [36] Clément Pit-Claudel, Peng Wang, Benjamin Delaware, Jason Gross, and Adam Chlipala. Extensible extraction of efficient imperative programs with foreign functions, manually managed memory, and proofs. In Nicolas Peltier and Viorica Sofronie-Stokkermans, editors, Automated Reasoning: 10th International Joint Conference, IJCAR 2020, Paris, France, July 1–4, 2020, Proceedings, Part II, volume 12167 of Lecture Notes in Computer Science, pages 119–137. Springer International Publishing, July 2020. [37] Marina Polubelova, Karthikeyan Bhargavan, Jonathan Protzenko, Benjamin Beurdouche, Aymeric Fromherz, Natalia Kulatova, and Santiago Zanella-Béguelin. Hacl×n: Verified generic simd crypto (for all your favorite platforms). Cryp- tology ePrint Archive, Report 2020/572, 2020. https://eprint.iacr.org/2020/572. [38] Jonathan Protzenko, Bryan Parno, Aymeric Fromherz, Chris Hawblitzel, Marina Polubelova, Karthikeyan Bhargavan, Benjamin Beurdouche, Joonwon Choi, Antoine Delignat-Lavaud, Cédric Fournet, et al. Evercrypt: A fast, verified, cross-platform cryptographic provider. In 2020 IEEE Symposium on Security and Privacy (SP), pages 634–653, 2019. [39] Jonathan Protzenko, Jean-Karim Zinzindohoué, Aseem Rastogi, Tahina Ramananandro, Peng Wang, Santiago Zanella- Béguelin, Antoine Delignat-Lavaud, Catalin Hritcu, Karthikeyan Bhargavan, Cédric Fournet, and Nikhil Swamy. Ver- ified low-level programming embedded in F*. PACMPL, (ICFP), September 2017. [40] M-J. Saarinen and J-P. Aumasson. The blake2 cryptographic hash and message authentication code (mac). IETF RFC 7693, 2015. [41] Amr Sabry and Matthias Felleisen. Reasoning about programs in continuation-passing style. Lisp and symbolic computation, 6(3-4):289–360, 1993. [42] Tim Sheard and Simon Peyton Jones. Template meta-programming for haskell. In Proceedings of the 2002 ACM SIG- PLAN Workshop on Haskell, Haskell ’02, page 1–16, New York, NY, USA, 2002. Association for Computing Machinery. [43] Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, Antoine Delignat-Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet, Pierre-Yves Strub, Markulf Kohlweiss, Jean-Karim Zinzindohoué, and Santiago Zanella- Béguelin. Dependent types and multi-monadic effects in F*. In Proceedings of the ACM Conference on Principles of Programming Languages (POPL), 2016. [44] Guido Vranken. Cryptofuzz. https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=14822#c3, May 2019. [45] Stephen Weeks. Whole-program compilation in mlton. ML, 6:1–1, 2006. [46] Jean Karim Zinzindohoué, Karthikeyan Bhargavan, Jonathan Protzenko, and Benjamin Beurdouche. HACL*: A veri- fied modern cryptographic library. 2017.

28