An Implementation of Deflate in Coq

Ludwig–Maximilians–Universität München Promotionsarbeit Christoph-Simon Senjak Fach: Informatik Thema: An Implementation of Deflate in Coq Abgabetermin …2018 Betreuer Prof. Dr. Martin Hofmann, PhD Eidesstattliche Versicherung (Siehe Promotionsordnung vom 12.07.11, § 8, Abs. 2 Pkt. .5.) Hiermit erkläre ich an Eidesstatt, dass die Dissertation von mir selbstständig, ohne unerlaubte Beihilfe angefertigt ist. Name, Vorname Ort, Datum Unterschrift Doktorand/in Formular 3.2 In Memoriam Prof. Dr. Martin Hofmann Abstract The compression format “Deflate” is defined in RFC 1951. It is acon- tainer format that can utilize prefix-free codings (“Huffman codings”), backreferences for deduplication, and run length encoding. Its unverified reference implementation is the ZLib. It is widely used, for example in several network protocols like HTTP or SSH; modern file systems like ZFS and BTRFS even support online compression. Since Deflate is only a container format, there are many possibilities to compress a given data stream, with different tradeoffs. Hence, alternative implementations exist, like the popu- lar Zopfli library from Google. The standard is quite hard to read, and there are several sources for confusion. We try to remedy this problem by giving a rigorous mathematical specification, which we formalized in Coq. The formalization is axiomatic, and can itself not be verified, which is why we test it empirically against the Canterbury Corpus. Our formalization is furthermore very modular, which makes it easier to test single parts of the implementation, and put them together later. This formalism should be applicable to other data formats in verified implementations in the future. We produced a verified implementation of a decompression algorithm in Coq which achieves reasonable performance on inputs of several mega- bytes. To achieve performance, we investigated several efficient functional data structures, and “semi-functional” data structures like DiffArrays, the latter being the ones to perform best. In this work we present the several parts of our implementation. Our main contribution is a well-tested formalization of the standard. We furthermore produced a fully verified implementation of canonical prefix-free codings, which can be used for other compression formats as well. We also programmed a compression algorithm in Coq which we formally prove to be inverse to the decompression algorithm – the first time this has been achieved to our knowledge. We will also talk about the difficulties, specifically regarding memory and runtime requirements, and our approaches to overcome them. German Abstract Das Kompressionsformat Deflate“ ist in RFC 1951 spezifiziert. Es ist ” ein Containerformat, welches präfixfreie Kodierungen ( Huffmankodierun- ” gen“), Rückreferenzen zur Deduplikation und Lauflängenkodierung unter- stützt. Seine unverifizierte Referenzimplementierung ist die ZLib. Das For- mat ist weit verbreitet, zum Beispiel in diversen Netzwerkprotokollen wie HTTP oder SSH; moderne Dateisysteme wie ZFS und BTRFS unterstützen sogar transparente Kompression. Da es sich nur um ein Containerformat handelt, gibt es viele Möglichkeiten, einen gegebenen Datenstrom zu kom- primieren, mit unterschiedlichen Vor- und Nachteilen. Entsprechend gibt es auch alternative Implementierungen, wie das bekannte, von Google imple- mentierte Zopfli. Der Standard ist schwierig zu lesen, und es gibt etliche Unklarheiten. Wir versuchen, dieses Problem zu beheben, indem wir eine streng mathe- matische, in Coq formalisierte Spezifikation angeben. Da die Formalisierung axiomatisch ist, und somit selbst nicht verifiziert werden kann, testen wir sie empirisch mit dem Canterbury Corpus. Unsere Formalisierung ist außer- dem sehr modular, sodass es einfach ist, einzelne Teile der Implementierung zu testen, bevor sie zusammengefügt werden. Dieser Formalismus sollte sich auch auf andere Datenformate in künftigen verifizierten Implementierungen anwenden lassen. Wir entwickelten eine verifizierte Implementierung eines Dekompressi- onsalgorithmus in Coq, welcher bei Eingabedaten von mehreren Megabyte eine angemessene Performanz erzielt. Um Performance zu erreichen, haben wir diverse effiziente funktionale und semi-funktionale“ Datenstrukturen ” wie zum Beispiel DiffArrays untersucht; letztere zeigten das beste Verhal- ten. In dieser Arbeit stellen wir die verschiedenen Teile unserer Implementie- rung vor. Das wichtigste Ergebnis ist eine gut getestete Formalisierung des Standards. Wir haben außerdem eine vollverifizierte Implementierung kano- nischer präfixfreier Kodierungen, die auch für andere Kompressionsformate benutzt werden kann. Des Weiteren programmierten wir einen Kompressionsalgorithmus in Coq, von dem wir formal beweisen, dass er invers zum Dekompressions- algorithmus ist – unseres besten Wisses nach das erste Mal, dass so etwas gemacht wurde. Wir werden außerdem über die Schwierigkeiten reden, speziell bezüglich Speicher- und Laufzeitverhalten, und über unsere Ansätze um sie zu lösen. 4 Contents 1 Introduction 9 1.1 Historical Overview ........................ 10 1.1.1 Formal Verification .................... 10 1.1.2 Data Compression .................... 11 1.2 Notation .............................. 12 1.3 Reasons for Deflate ........................ 14 1.4 Related Work ........................... 16 1.4.1 Similar Goals ....................... 16 1.4.2 Formal Methods in General ............... 18 1.4.3 Similar Methodology ................... 19 2 Technical Overview 21 2.1 Design Decisions ......................... 21 2.2 Trusted Codebase ......................... 23 2.3 Module Overview ......................... 23 3 Program Extraction 27 3.1 Motivation ............................ 27 3.2 Formalizing ............................ 28 3.3 Classical Reasoning ........................ 31 3.4 Phases of Extraction ....................... 31 3.5 Moravec’s Paradox ........................ 37 3.6 Practical Applications ...................... 38 4 An Introduction To Coq 41 4.1 Set and Prop ........................... 41 4.2 Gauss formula ........................... 42 4.3 Square Pyramidal Numbers ................... 44 5 Deflate Codings 47 6 Parsers from Constructive Proofs 57 6.1 Strong Uniqueness ........................ 57 6.2 Strong Decidability ........................ 58 5 6 CONTENTS 6.3 Relational Combinators ..................... 59 6.4 Streamable Strong Decidability ................. 61 7 The Encoding Relation 67 7.1 Overview ............................. 67 7.2 The Toplevel Relation ...................... 72 7.3 Uncompressed Blocks ...................... 73 7.4 Backreferences .......................... 75 7.5 Compressed Blocks ........................ 76 7.5.1 Compressed Code with Extra Bits ........... 76 7.5.2 Compressed Data ..................... 77 7.5.3 Statically Compressed Blocks .............. 79 7.5.4 Dynamically Compressed Blocks ............ 79 7.6 Refactoring ............................ 83 8 Efficiency 85 8.1 Natural Numbers ......................... 85 8.2 Singly-linked Lists ........................ 87 8.3 Backreferences .......................... 88 8.4 Using DiffArrays ......................... 90 8.5 A Purely Functional, Efficient Backreference-resolver ..... 92 8.5.1 Pairing Heaps ....................... 93 8.5.2 General Idea ....................... 95 8.5.3 A Formal Proof ...................... 97 9 Extraction and Testing 101 9.1 Extraction ............................. 101 9.1.1 Compatibility ....................... 102 9.1.2 Makefile .......................... 102 9.2 Testing Unverified Algorithms .................. 102 9.3 Benchmarks ............................ 102 9.3.1 No Backreferences .................... 103 9.3.2 With ExpLists ...................... 103 9.3.3 With DiffArrays ..................... 104 9.3.4 Unverified Functional Resolver ............. 104 9.3.5 Compression ....................... 105 9.4 Building and Running ...................... 106 10 Conclusion 109 10.1 Further Work ........................... 109 10.1.1 Streamable Strong Decidability ............. 109 10.1.2 Fast Compression ..................... 109 10.1.3 Trusted Codebase .................... 109 10.1.4 Imperative Implementation ............... 111 CONTENTS 7 10.1.5 Usage In Other Projects ................. 112 10.1.6 Other System Components ............... 112 10.2 Lessions Learned ......................... 112 8 CONTENTS Chapter 1 Introduction Program testing can be used to show the presence of bugs, but never to show their absence! Edsger Wybe Dijkstra In this work, we will describe our implementation of the Deflate compression standard in the Coq proof assistant, and the various aspects of the proof and optimization techniques we used. In this chapter, we will give a short introduction to the history of formal methods and compression, as well as our motivation for choosing Deflate, and working on this project. It is more and more recognized that traditional methods for maintenance of software security reach their limits, and different approaches become in- evitable. At the same time, formal program verification has reached a state where it becomes realistic to prove correctness of low-level system components and combine them to prove the correctness of larger systems. This is an important step towards fully verified software, but it is also desirable to verify the low-level middleware. While for these components the adherence of access restrictions would be assured by an underlying sandbox, functional correctness becomes the main concern.

An Implementation of Deflate in Coq

Data Compression: Dictionary-Based Coding 2 / 37 Dictionary-Based Coding Dictionary-Based Coding

A New Way of Accelerating Web by Compressing Data with Back Reference-Prefer Geflochtener

Daftar Pustaka.Pdf

Combinatorial Optimization Problems in Internet Applications

Context-Aware Encoding & Delivery in The

Compositional Analysis for Exposing Vulnerabilities – a Symbolic Execution Approach

Compression: Putting the Squeeze on Storage

Comparison and Model of Compression Techniques for Smart Cloud Log File Handling

Arxiv:1609.01220V1 [Cs.LO] 5 Sep 2016 Rjc,Ada 2]Pit U,Acmie U a Naiaealgua All Invalidate Can Bug Compiler a Out, Com Points Compcert [20] the As Concern

Applied Mathematics & Information Sciences

Open Source Used in Staros 21.24

I Came to Drop Bombs Auditing the Compression Algorithm Weapons Cache