Functional Data Structures and Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Charles University in Prague Faculty of Mathematics and Physics DOCTORAL THESIS Milan Straka Functional Data Structures and Algorithms Computer Science Institute of Charles University Supervisor of the thesis: doc. Mgr. Zdeněk Dvořák, Ph.D. Study programme: Computer Science Specialization: Discrete Models and Algorithms (4I4) Prague 2013 I am grateful to Zdeněk Dvořák for his support. He was very accommodative during my studies. He quickly discovered any errors in my early conjectures and suggested areas of interest that prove rewarding. I would also like to express my sincere gratitude to Simon Peyton Jones for his supervision and guidance during my internship in Microsoft Research Labs, and also for helping me with one of my papers. Our discussions were always very intriguing and motivating. Johan Tibell supported my work on data structures in Haskell. His enthusiasm encouraged me to overcome initial hardships and continues to make the work really enjoyable. Furthermore, I would like to thank to Michal Koucký for comments and dis- cussions that improved the thesis presentation considerably. Finally, all this would not be possible without my beloved wife and my parents. You make me very happy, Jana. iii iv I declare that I carried out this doctoral thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In Prague, date 12th August 2013 ..................................................... signature of the author v vi Název práce: Funkcionální datové struktury a algoritmy Autor: Milan Straka Ústav: Informatický ústav Univerzity Karlovy Vedoucí doktorské práce: doc. Mgr. Zdeněk Dvořák, Ph.D, Informatický ústav Univerzity Karlovy Abstrakt: Funkcionální programování je rozšířené a stále více oblíbené programo- vací paradigma, které nachází své uplatnění i v průmyslových aplikacích. Datové struktury používané ve funkcionálních jazycích jsou převážně perzistentní, což znamená, že pokud jsou změněny, zachovávají své předchozí verze. Cílem této práce je rozšířit teorii perzistentních datových struktur a navrhnout efektivní implementace těchto datových struktur pro funkcionální jazyky. Bezpochyby nejpoužívanější datovou strukturou je pole. Ačkoli se jedná o velmi jednoduchou strukturu, neexistuje jeho perzistentní protějšek s konstantní složitostí přístupu k prvku. V této práci popíšeme zjednodušenou implementaci perzistentního pole s asymptoticky optimální amortizovanou časovou složitostí Θ(log log n) a především téměř optimální implementaci se složitostí v nejhorším případě. Také ukážeme, jak efektivně rozpoznat a uvolnit nepoužívané verze perzistentního pole. Nejvýkonnější datové struktury nemusí být vždy ty, které jsou založeny na asymptoticky nejlepších strukturách. Z toho důvodu se také zaměříme na imple- mentaci datových struktur v čistě funkcionálním programovacím jazyku Haskell a podstatně zlepšíme standardní knihovnu datových struktur jazyka Haskell. Klíčová slova: perzistentní datové struktury, perzistentní pole, algoritmy se složi- tostí v nejhorším případě, čistě funkcionální datové struktury, Haskell vii viii Title: Functional Data Structures and Algorithms Author: Milan Straka Institute: Computer Science Institute of Charles University Supervisor of the doctoral thesis: doc. Mgr. Zdeněk Dvořák, Ph.D, Computer Science Institute of Charles University Abstract: Functional programming is a well established programming paradigm and is becoming increasingly popular, even in industrial and commercial appli- cations. Data structures used in functional languages are principally persistent, that is, they preserve previous versions of themselves when modified. The goal of this work is to broaden the theory of persistent data structures and devise efficient implementations of data structures to be used in functional languages. Arrays are without any question the most frequently used data structure. Despite being conceptually very simple, no persistent array with constant time access operation exists. We describe a simplified implementation of a fully per- sistent array with asymptotically optimal amortized complexity Θ(log log n) and especially a nearly optimal worst-case implementation. Additionally, we show how to effectively perform a garbage collection on a persistent array. The most efficient data structures are not necessarily based on asymptotically best structures. On that account, we also focus on data structure implementations in the purely functional language Haskell and improve the standard Haskell data structure library considerably. Keywords: persistent data structures, persistent arrays, worst-case algorithms, purely functional data structures, Haskell ix x Contents 1 Introduction 1 1.1 Functionalprogramming . 1 1.2 PersistentDataStructures . 5 1.3 StructureoftheThesis ........................ 9 I Persistent Data Structures 13 2 Making Data Structures Persistent 15 2.1 PathCopyingMethod ........................ 17 2.2 Making Linked Structures Persistent . 18 2.3 Making Linked Structures Persistent in the Worst Case . .... 25 2.4 Making Amortized Structures Persistent . 29 3 Navigating the Version Tree 39 3.1 LinearizingtheVersionTree . 40 3.2 ListLabelling ............................. 40 3.3 ListOrderProblem.......................... 46 4 Dynamic Integer Sets 53 4.1 VanEmdeBoasTrees ........................ 54 4.2 ExponentialTrees........................... 58 5 Persistent Arrays 63 5.1 RelatedWork ............................. 65 5.2 Lower Bound on Persistent Array Lookup . 67 5.3 AmortizedPersistentArray . 69 5.4 Worst-CasePersistentArray . 74 5.5 Improving Complexity of Persistent Array Operations . ..... 76 5.6 Garbage Collection of a Persistent Array . 78 xi xii CONTENTS II Purely Functional Data Structures 85 6 Persistent Array Implementation 87 6.1 Fully Persistent Array Implementation . 87 6.2 Choosing the Best Branching Factor . 90 7 BB-ω Trees 95 7.1 BB-ω Trees .............................. 96 7.2 Rebalancing BB-ω Trees ....................... 99 7.3 Choosing the Parameters ω, α and δ ................ 101 7.4 BB-ω TreesHeight .......................... 106 7.5 Performance of BB-ω Trees ..................... 107 7.6 Reducing Memory by Utilizing Additional Data Constructor . 109 8 The Haskell containers Package 113 8.1 The containers Package ...................... 114 8.2 Benchmarks .............................. 117 8.3 Improving the containers Performance . 128 8.4 NewHashing-BasedContainer . 137 9 Conclusion 147 Bibliography 149 List of Terms and Abbreviations 159 List of Figures 161 Attachments 163 A.1 GeneratingFigure7.3 . 163 A.2 PackagesUsedinChapter8 . 164 Chapter 1 Introduction Computer programming has been developing enormously ever since the first high-level languages were created,1 and several fundamental approaches to com- puter programming, i.e., several programming paradigms, have been designed. The prevalent approach is the imperative programming paradigm, represented for example by the wide-spread C language. The imperative paradigm considers a computer program to be a sequence of statements that change the program state. In other words, serial orders (imper- atives) are given to the computer. The declarative programming represents a contrasting paradigm to the im- perative programming. The fundamental principle of declarative approach is describing the problem instead of defining the solution, allowing the program to express what should be accomplished instead of how should it be accomplished. The logic of the computation is described without dependence on control flow, as opposed to the imperative programing, where the control flow is a fundamental part of any program. One of the well established way of realizing the declarative paradigm is func- tional programming. 1.1 Functional programming Functional programming treats computations as evaluations of mathematic func- tions and the process of program execution is viewed as application of functions instead of changes in state. 1FORTRAN, “Formula Translator”, released in 1954, is considered the first high-level lan- guage with working implementation. 1 2 CHAPTER 1. INTRODUCTION Referential Transparency A major difference between functional and imperative programming is absence of side effects that change global program state. Function has a side effect, if, in addition to returning a value, it irreversibly modifies some global state or has an observable effect on the outside world, like displaying a message on a screen. Side effects are common in imperative programming, while in functional programming, output of a function depends solely on the input arguments and not on an internal state of a program. Therefore, calling a function twice with the same arguments produces the same result. Functional programs are referentially transparent, meaning that a function can be replaced by its resulting value without changing the behaviour of the program. The functional languages that completely lack the side effects are usually called purely functional and considered declarative. Because purely functional language does not define a specific evaluation order, various evaluation strategies are possible. One of the most theoretically and practically interesting strategies is lazy evaluation. Under