Collected Works of Philippe Flajolet
Total Page:16
File Type:pdf, Size:1020Kb
COLLECTED WORKS OF PHILIPPE FLAJOLET Editorial Committee: HSIEN-KUEI HWANG BRUNO SALVY Institute of Statistical Science Algorithms Project Academia Sinica INRIA Rocquencourt Taipei 115 F-78153 Le Chesnay Taiwan France ROBERT SEDGEWICK MICHÈLE SORIA Department of Computer Science Laboratoire d’Informatique Princeton University Université Pierre et Marie Curie Princeton, NJ 08540 F-75252 Paris cedex 05 USA France WOJCIECH SZPANKOWSKI BRIGITTE VALLÉE Department of Computer Science Département d’Informatique Purdue University Université de Caen West Lafayette, Indiana 47907 F-14032 Caen Cedex USA France MARK DANIEL WARD (General Editor) Department of Statistics Purdue University West Lafayette, Indiana 47907 USA ISBN TBA ©Cambridge University Press 2012+ (print version) ©TBA 2012+ (e-version) COLLECTED WORKS OF PHILIPPE FLAJOLET There will be several types of introductions, including an introduction to the entire series of books (written by Donald E. Knuth), and also introductions to each specific volume (written the editors of that volume). Contents Chapter I. STRING ALGORITHMS 1 Introduction 1. TEXT ANALYSIS 3 Paper 2. PAPER74 9 Paper 3. PAPER 76 11 Paper 4. PAPER 191 13 Chapter II. INFORMATION THEORY 15 ANALYTIC INFORMATION THEORY 17 Analytic Information Theory 17 Preliminary Discussion 18 Minimax Redundancy for a Class of Sources 20 Minimax Redundancy for Memoryless Sources 21 Minimax Redundancy for Renewal Sources 23 Paper 5. PAPER 158 27 Paper 6. PAPER 173 29 Paper 7. PAPER SEMINAR 31 Paper 8. PAPER SEMINAR 33 Chapter III. DIGITAL TREES 35 THE DIGITAL TREE PROCESS 37 1. A central role in computer science 37 2. Digital trees in Philippe Flajolet’s works 39 3. Conclusion 43 Paper 9. PAPER 34 45 Chapter IV. MELLIN TRANSFORM 47 DR FLAJOLET’S ELIXIR OR MELLIN TRANSFORM AND ASYMPTOTICS 49 Mellin transform and fundamental strip 49 Symbolic analysis 50 iii iv CONTENTS Fundamental result 51 Harmonic sums 52 Zigzag method 52 Average-case analysis of algorithms and harmonic sums 53 Exponentials in harmonic sums 54 Technical point 55 Oscillations 56 Related topics 57 Paper 10. PAPER 58 59 Chapter V. DIVIDE AND CONQUER 61 DIVIDE-AND-CONQUER RECURRENCES AND THE MELLIN-PERRON FORMULA 63 1. Introduction 63 2. The basic technique 65 3. Concluding Remarks 69 Paper 11. PAPER 115 71 Chapter VI. COMMUNICATION PROTOCOLS 73 FLAJOLET’S WORK ON TELECOMMUNICATION PROTOCOLS AND COLLISION RESOLUTION AGORITHMS 75 1. Introduction 75 2. Telecommunication Protocols, Aloha protocol 76 3. The tree collision resolution algorithm 77 4. The free access tree algorithm 83 5. Q-ary free access tree algorithm 87 BIBLIOGRAPHY 89 Paper 12. PAPER 49 91 BIBLIOGRAPHY 93 INDEX 103 Chapter I STRING ALGORITHMS INTRODUCTION 1 Text Analysis Pierre Nicodème List of articles. – (#74)[66] Deviations from Uniformity in Random Strings (1988), P. Flajolet, P. Kirschenhofer and R.F. Tichy – (#76)[67] Discrepancy of Sequences in Discrete Spaces (1989), P. Flajolet, P. Kirschenhofer and R.F. Tichy – (#151,#174)[166, 167] Motif statistics (1999)-(2002), P. Nicodème, B. Salvy and P. Flajolet – (#164)[63] Hidden Pattern Statistics (2001), P. Flajolet, Y. Guivarc’h, W. Sz- pankowski and B. Vallée. – (#191)[121] Hidden Word Statistics, (2006), P. Flajolet, W. Szpankowski and B. Vallée. Since the computing capability of computers developed in the sixties and the seventies, text analysis has been a field of subject either for searching tools that find positions of matches with a motif in a specific text or for counting occurrences of motifs in random texts by combinatorial or probabilistic methods. Counting methods and statistics often provide limit laws, under various probability source models for the texts, which allows the detection of exceptional behaviours. As a typical object of computer science, finite automata have been used both for searching and for statistical analysis. Many statistical questions about word statistics have been solved by three different methods, combinatorial analysis, automata, and probability analysis. Such statistical researches imply the consideration of two objects, the source under which the text is generated, and the type of motif considered; the latter may be a single word or a finite set of words, reduced if no word is factor of another word of the set, or not reduced in the contrary and more difficult case, or an infinite set as defined by a regular expression with stars, or a hidden word. Deviations from Uniformity in Random Strings. The important contributions of Philippe Flajolet in text analysis have to be situated historically with respect to the previously mentionned developments; his work however goes also upon searching in- trinsic properties of texts. In the article [66] “Deviations from Uniformity in Random Strings” (1988) 1, coauthored with P. Kirschenhofer and R.F. Tichy, he goes from (in- trinsic) properties of normality of infinite strings, a problem set up by E. Borel in 1908 1. The 1989 published article “Discrepancy of Sequences in Discrete Spaces”, although published later of the 1988 article, is obviously a preliminary and unaccomplished version of the 1988 article; we will therefore not discuss it. 3 4 1. TEXT ANALYSIS during his researches on measure theory, to the speed of convergence to uniformity for large sequences, a computer science problem. Normal numbers (E. Borel, 1908) are numbers such that any block of bits of a given size occurs with its natural probability (1=2k for blocks of length k) in their infinite binary representation. P. Flajolet and his coauthors [66] cope with the asymp- totic number of occurrences of blocks when a random binary sequence built upon a uniform Bernoulli source is large, but not infinite, a totally unexplored subject by the time. They build first a de Bruijn graph [25] counting simultaneously the occurrences of all words and deduce from it a universal Markov chain; the latter posseses strong convergence properties when the size k of the blocks remains fixed while the size n of the sequences tends to infinity, but these properties do not allow to conclude when k tends to infinity and approaches log2(n). Next comes an analysis based on words counting where Philippe Flajolet’s influence is clear; the proofs are based on com- binatorics of words “à la Guibas-Odlyzko”, very delicate asymptotic manipulations and a saddle-point like integral. Guibas-Odlyzko [130, 131] (1981) introduced the autocorrelation polynomial of a word, the correlation polynomial of two words, and the language parsing of a sequence with respect to occurrences of a pattern. A key lemma of the proof of the “Deviations from Uniformity” article [66] extends a result of Guibas and Odlyzko [129] (1978) and indicates that the relevant counting generat- ing function has no poles inside a circle of integration z = 1 + for a suitable small j j . It is worth noting that the results of this article are optimal, proving that all words of size (1 ) log (n) occurs with probability one in a binary random sequence of − 2 length n. Considering Proposition IV.4 p. 274. in Flajolet and Sedgewick book [114] and using bootstrapping as in Fayolle [40] (2004) should open the way to a general- ization of the result to alphabets of any size. As a consequence, the fill-up level of a suffix-tree built from an unbiased source upon a sequence of length n is likely to be (1 ) log (n) for an alphabet of cardinality α. Future work could study more − α general sources; in particular the study of a general notion of discrepancy for biased sources should be compared with Knessl and Szpankowski study [148] (2004) of the fill-up level in tries generated by an biased binary source, the analysis of the fill-up level of a suffix-tree remaining also an open problem. Motif Statistics. The (1999-2002) articles [166, 167] entitled “Motif statistics”, coau- thored with P. Nicodème and B. Salvy, build upon important previous developments of theoretical computer science. It is worth recalling some corner stones of automata theory, a major topic in this article coping on regular expressions. Kleene [147] (1956) and Rabin and Scott [171] (1959) provided constructions of DFA for regular expres- sions. Aho and Corasick devised an efficient algorithm [2] (1975) to construct an automata for searching finite set of words (1975) while Knuth, Morris and Pratt [149] (1977) gave a fast algorithm that is also realized by an automaton and searches for occurrences of a single word. In a fundamental article about context-free languages, Chomsky and Schützenberger [17] (1963) gave an algorithm computing the gener- ating function of words recognized by a Deterministic Finite Automaton on a finite alphabet. This generating function is always solution of a system of linear equations, the homogeneous part of which having coefficients that are monomials of degree one 1. TEXT ANALYSIS 5 with respect to the alphabet; it follows that the resulting generating functions (those of regular languages by the classical automata constructions [147, 171] previously men- tioned) are rational. Note that P. Flajolet and coauthors results have a wide generality, providing an algorithmic construction for regular patterns, a class which contains all finite patterns. From the resulting bivariate generating function follows computation of the moments and access to the normal limit law. The automata constructions how- ever hide the structural properties of finite patterns. These are mostly provided by language analysis and once again we have to mention the pioneering work of Guibas and Odlyzko [130, 131] (1981); they followed the idea of parsing a text with respect to the occurrences of the pattern, and defined the languages Right of words finishing with the first occurrence of a word of the pattern, Minimal of words separating two oc- currences and Ultimate of words following the last occurrence. Guibas and Odlyzko provided the generating functions of these languages by recurrence; later, Régnier and Szpankowski [175, 176] (1997,1998) and Régnier [172] (2000) provided a set of formal equations for these languages and proved for single word patterns Gaussian or Poisson limits, depending on number of occurrences being Θ(n) or O(1).