Learning Finite-State Machines: Algorithmic and Statistical Aspects

Learning Finite-State Machines Statistical and Algorithmic Aspects Borja de Balle Pigem Thesis Supervisors: Jorge Castro and Ricard Gavaldà Department of Software | PhD in Computing Thesis submitted to obtain the qualification of Doctor from the Universitat Politècnicade Catalunya Copyright © 2013 by Borja de Balle Pigem This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/. Abstract The present thesis addresses several machine learning problems on generative and predictive models on sequential data. All the models considered have in common that they can be defined in terms of finite-state machines. On one line of work we study algorithms for learning the probabilistic analog of Deterministic Finite Automata (DFA). This provides a fairly expressive generative model for sequences with very interesting algorithmic properties. State-merging algorithms for learning these models can be interpreted as a divisive clustering scheme where the \dependency graph" between clusters is not necessarily a tree. We characterize these algorithms in terms of statistical queries and a use this charac- terization for proving a lower bound with an explicit dependency on the distinguishability of the target machine. In a more realistic setting, we give an adaptive state-merging algorithm satisfying the strin- gent algorithmic constraints of the data streams computing paradigm. Our algorithms come with strict PAC learning guarantees. At the heart of state-merging algorithms lies a statistical test for distribu- tion similarity. In the streaming version this is replaced with a bootstrap-based test which yields faster convergence in many situations. We also studied a wider class of models for which the state-merging paradigm also yield PAC learning algorithms. Applications of this method are given to continuous-time Markovian models and stochastic transducers on pairs of aligned sequences. The main tools used for obtaining these results include a variety of concentration inequalities and sketching algorithms. In another line of work we contribute to the rapidly growing body of spectral learning algorithms. The main virtues of this type of algorithms include the possibility of proving finite-sample error bounds in the realizable case and enormous savings on computing time over iterative methods like Expectation- Maximization. In this thesis we give the first application of this method for learning conditional distri- butions over pairs of aligned sequences defined by probabilistic finite-state transducers. We also prove that the method can learn the whole class of probabilistic automata, thus extending the class of models previously known to be learnable with this approach. In the last two chapters we present works com- bining spectral learning with methods from convex optimization and matrix completion. Respectively, these yield an alternative interpretation of spectral learning and an extension to cases with missing data. In the latter case we used a novel joint stability analysis of matrix completion and spectral learning to prove the first generalization bound for this type of algorithms that holds in the non-realizable case. Work in this area has been motivated by connections between spectral learning, classic automata theory, and statistical learning; tools from these three areas have been used. iii Preface From Heisenberg's uncertainty principle to the law of diminishing returns, essential trade-offs can be found all around in science and engineering. The Merriam{Webster on-line dictionary defines trade-off as a noun describing \a balancing of factors all of which are not attainable at the same time." In many cases where a human actor is involved in the balancing, such balancing is the result of some conscious reasoning and acting plan, pursued with the intent of achieving a particular effect in a system of interest. In general it is difficult to imagine that this reasoning and acting can take place without the existence of some specific items of knowledge. In particular, no reasoning whatsoever is possible without the knowledge of the factors involved in the trade-off and some measurement of the individual effect each of those factors has on the ultimate goal of the balancing. The knowledge required to reach a trade-off can be conceptually separated into two parts. First there is a qualitative part, which involves the identification of the factors that need to be balanced. Then, there is a quantitative one: the influence exerted by these factors on the system of interest needs to be measured. These two pieces are clearly complementary, and each of them brings with it some actionable information. The qualitative knowledge brings with it intuitions about the underlying works of the system under study which can be useful in deriving analogies and generalizations. On the other hand, quantitative measurements provide the material needed for principled mathematical reasoning and fine-tuning of the trade-off. In computer science there are plenty of trade-offs that perfectly exemplify this dichotomy. Take, for example, the field of optimization algorithms. Let k denote the number of iterations executed by an iterative optimization algorithm. Obviously, there exists a relation between k and the accuracy of the solution found by the algorithm. This is a general intuition that holds along all optimization algorithms, and one that can be further extended to other families of approximation algorithms. On the other hand, it is well known that for some classes of optimization problemsp the distance to the optimal solution decreases like O(1=k), while for some others the rate O(1= k) is not improvable. These particular bits of knowledge exemplify two different quantitative realizations of the iterations/accuracy trade-off. But even more importantly, this kind of quantitative information can be used to approach meta-trade-offs like the following. Suppose you are givenp two models for the same problem, one that is closer to reality but can only be solved at a rate of O(1= k), and another that is less realistic but can be solved at a rate of O(1=k). Then, given a fixed computational budget, should one compute a more accurate solution to an imprecise model, or a less accurate solution in a more precise model? Being able to answer such questions is what leads to informed decisions in science and engineering, which is, in some sense, the ultimate goal for studying such trade-offs in the first place. Learning theory is a field of computer science that abounds in trade-offs. Those arise not only from the computational nature of the field, but also from a desire to encompass, in a principled way, many aspects of the statistical process that takes data from some real world problem and produces a meaningful model for it. The many aspects involved in such trade-offs can be broadly classified as either computational (time and space complexity), statistical (sample complexity and identifiability), or of a modeling nature (realizable vs non-realizable, proper vs improper modeling). This diversity induces a rich variety of situations, most of which, unfortunately, turn out to be computationally intractable as soon as the models of interest become moderately complex. The understanding and identification of all these trade-offs has lead in recent years to important advances in the theory and practice of machine learning, data mining, information retrieval, and other data-driven computational disciplines. v In this dissertation we address a particular problem in learning theory: the design and analysis of algorithms for learning finite-state machines. In particular, we look into the problem of designing efficient learning algorithms for several classes of finite-state machines under different sets of modelling assumptions. Special attention is devoted to the computational and statistical complexity of the proposed methods. The leitmotif of the thesis is to try answering the question: what is the larger class of finite- state machines that is efficiently learnable from a computational and statistical perspective? Of course, a short answer would be: there is a trade-off. The results presented in this thesis provide several longer answers to this question, all of which illuminate some aspect of the trade-offs implied by the different modelling assumptions. vi Acknowledgements Needless to say is that this is the last page one writes on a thesis. It is thus the time to thank everyone whose help, in one way or another, has made it possible to reach this final page. Help has come to me in many forms: mentoring, friendship, love, and mixtures of those. To me, these are the things which make life worth living. And so, it is with great pride and pleasure that I thank the following people for their help. First and foremost, I am eternally thankful to my advisors Jorge and Ricard for their encouragement, patience, and guidance throughout these years. My long-time co-authors Ariadna and Xavier, for their kindness, and for sharing their ideas and listening to mine, deserve little less that my eternal gratitude. During my stay in New York I had the great pleasure to work with Mehryar Mohri; his ideas and points of view still inspire me to this day. Besides my closer collaborators, I have interacted with a long list of people during my research. All of them have devoted their time to share and discuss with me their views on many interesting problems. Hoping I do not forget any unforgettable name, I extend my most sincere gratitude to the following people: Franco M. Luque, RaphaëlBailly, Albert Bifet, Daniel Hsu, Dana Angluin, Alex Clark, Colin de la Higuera, Sham Kakade, Anima Anandkumar, David Sontag, Jeff Sorensen, Marta Arias,` Ramon Ferrer, Doina Precup, Prakash Panangaden, Denis Thérien,Gabor Lugosi, Geoff Holmes, Bernhard Pfahringer, and Osamu Watanabe. I would never have gotten into machine learning research if it was not for Pep Fuertes and Enric Ventura who introduced me to academic research, JoséL.

Learning Finite-State Machines: Algorithmic and Statistical Aspects

Practical Experiments with Regular Approximation of Context-Free Languages

CIT 425- AUTOMATA THEORY, COMPUTABILITY and FORMAL LANGUAGES LECTURE NOTE by DR. OYELAMI M. O. Introduction • This Course Cons

Theory of Computation

On the Complexity of Regular-Grammars with Integer Attributes ∗ M

Computation of Infix Probabilities for Probabilistic Context-Free Grammars

1 Finite Automata and Regular Languages

Problem Book

Formal Languages and Automata Theory

13.2 Finite-State Machines with Output

Automata Theory

Generalizing Input-Driven Languages: Theoretical and Practical Benefits

One-Tape Turing Machine Variants and Language Recognition ∗