Inducing Regular Grammars Using Recurrent Neural Networks

Inducing Regular Grammars Using Recurrent Neural Networks Mor Cohen,∗ Avi Caciularu,∗ Idan Rejwan,∗ Jonathan Berant School of Computer Science, Tel-Aviv University, Tel-Aviv, 6997801 Israel [email protected],[email protected], [email protected],[email protected] Abstract Recently, neural networks were shown to be powerful learning models for identifying patterns Grammar induction is the task of learning in data, including language-related tasks (Linzen a grammar from a set of examples. Re- et al., 2016; Kuncoro and Ballesteros, ). This cently, neural networks have been shown work investigates how good neural networks are to be powerful learning machines that can at inducing a regular grammar from data. More identify patterns in streams of informa- specifically, we investigate whether RNNs, a neu- tion. In this work we investigate their ef- ral network that specializes in processing sequen- fectiveness in inducing a regular grammar tial streams, can learn a DFA from data. from data, without any assumptions about RNNs are suitable for this task since they re- the grammar. We train a recurrent neu- semble DFAs. At each time step the network has ral network to distinguish between strings a current state, and given the next input symbol it that are in or outside a regular language, produces the next state. Formally, let st be the cur- and utilize an algorithm for extracting the rent state and xt+1 the next input symbol, then the learned finite-state automaton. We apply RNN computes the next state st+1 = δ(st; xt+1), this method to several regular languages where δ is the function learned by the RNN. Con- and find unexpected results regarding the sequently, δ is actually a transition function be- connections between the network’s states tween states, similar to a DFA. that may be regarded as evidence for gen- This analogy between RNNs and DFAs sug- eralization. gests a way to understand RNNs. It enables us to ”open the black box” and analyze the network 1 Introduction by converting it into the corresponding DFA and examining the learned language. Grammar induction is the task of learning a gram- Inspired by that, we explore a method for grammar from a set of examples, thus constructing a mar induction. Given a labeled dataset of strings model that captures the patterns within the ob- that are in and outside a language, we wish to served data. It plays an important role in scientific train a network to classify them. If the network research of sequential phenomena, such as human succeeds, it must have learned the latent patterns language or genetics. We focus on the most ba- arXiv:1710.10453v2 [cs.CL] 26 Jun 2018 underlying the data. This allows us to extract sic level of grammars - regular grammars. That is, the states used by the network and reconstruct the the set of all languages that can be decided by a grammar it had learned. Deterministic Finite Automaton (DFA). There is one major difference between the states Inducing regular grammars is an old and ex- of DFAs and RNNs. While the former are discrete tensively studied problem (De la Higuera, 2010). and finite, the latter are continuous. In theory, this However, most suggested methods involve prior difference makes RNNs much more powerful than assumptions about the grammar being learned. In DFAs (Siegelmann and Sontag, 1995). However this work, we aim to induce a grammar from ex- in practice, simple RNNs 1 are not strong enough amples that are in or outside a language, without to deal with languages beyond the regular domain any assumption on its structure. 1Without the aid of additional memory such as in LSTMs ∗ Authors equally contributed to this work. (Hochreiter and Schmidhuber, 1997) (Gers and Schmidhuber, 2001). 3. DFA Construction - extracting the states Similar ideas have already been investigated produced by the RNN, quantizing them and in the 1990s (Cleeremans et al., 1989; Giles et constructing a minimized DFA. al., 1990; Elman, 1991; Omlin and Giles, 1996; 2 Morris et al., 1998; Tino et al., 1998) . These 3.1 Data Generation works also presented techniques to extract a DFA from a trained RNN. However, most of them in- The following method was used to create a bal- cluded a priori quantization of the RNNs contin- anced dataset of positive and negative strings. uous state space, yielding exponential number of Given a regular expression we randomly generate states even for simple grammars. Instead, several sequences out of it. As for the negative strings, works used clustering techniques for quantizing two different methods were used. The first method the state space, which yielded much smaller num- was to generate random sequences of words from ber of states (Zeng et al., 1993). However, they the vocabulary, such that their length distribu- fixed the number of clusters in advance. In this tion is identical to the positive ones. The other work, we introduce a novel technique for extract- approach was to generate ungrammatical strings ing a DFA from an RNN using clustering without which are almost identical to the grammatical the need to know the number of cluster. For that ones, making the learning more difficult for the purpose, we present a heuristic to find the most RNN. We generated the negative strings by apply- suitable number of states for the DFA, making the ing minor modifications such as word deletions, process much more general and unconstrained. additions or movements. Both methods did not yield any significant difference in the results, thus 2 Problem Statement we show only results for the first method. Given a regular language L and a labeled dataset f(Xi; yi)g of strings Xi with binary labels yi = 3.2 Learning 1 , Xi 2 L, the goal is to output a DFA A such that A accepts L, i.e., L(A) = L. We used the most basic architecture of RNNs, with Since the target language L is unknown, we re- one layer and cross entropy loss. In more detail, lax this goal to A(X¯i) =y ¯i. Namely, A accepts the RNN’s transition function is a single fully- or rejects correctly on a test set f(X¯i; y¯i)g that has connected layer given by not been used for training. Accordingly, the ac- st = tanh (Wst−1 + Uxt + v) : curacy of A is defined as the proportion of strings Prediction is made by another fully-connected classified correctly by A. layer which gets the RNN’s final state sn as an 3 Method input and returns a prediction y^ 2 [0; 1] by, y^ = σ(Aσ(Bsn + c) + d): The method consists of the following main steps. 3 where σ stands for the sigmoid function, W; U; A; B are learned matrices and v; c; d are 1. Data Generation - creating a labeled dataset learned vectors. of positive and negative strings: 15; 000 The loss used for training is cross entropy, strings for training the network, a valida- n 1 X tion set of 10; 000 strings for constructing the l = − y logy ^ + (1 − y ) log (1 − y^ ) ; n i i i i DFA, and a test set of 10; 000 strings for eval- i=1 uating the results. where y1; : : : ; yn are the true labels and y^1;:::; y^n 2. Learning - training an RNN within 15 are the network’s predictions. To optimize our epochs to classify the dataset with high loss, we employ the Adam algorithm (Kingma and accuracy- in almost all the conducted ex- Ba, 2014) periments (fully described in the next sec- Our goal is to reach perfect accuracy on the tion), a nearly perfect validation accuracy is validation set, in order to make sure the network achieved, approximately 99%. succeeded in generalizing and inducing the regu- 2For a more thorough survey of relevant papers, see sec- lar grammar underlying the dataset. This assures tion 5.13 in (Schmidhuber, 2015) 3 that the DFA we extract later is reliable as much Python code is available at https://github.com/acrola/ RnnInduceRegularGrammar as possible. 3.3 DFA Construction When training is finished, we extract the DFA learned by the network. This process consists of the following four steps. Collecting the states First, we collect the RNN’s continuous state vectors by feeding the network with strings from the validation set and collecting the states it outputs while reading them. Quantization After collecting the continuous states, we need to transform them into a discrete (a) (01)∗ (b) (0j1)∗100 set of states. We can achieve this by simply using Figure 1: Minimized DFAs for binary regexes any conventional clustering method with the Eu- clidean distance measure. More specifically, we use the K-means clustering algorithm, where K is 4.1 Simple Binary Regexes taken to be the minimal value such that the quan- The resulting DFAs for the two binary regexes tized graph’s classifications match the network’s (01)∗ and (0j1) ∗ 100 are shown in Figure 1. K ones with high rate. That is, for each we build It can be observed that the method produced the quantized DFA (as we describe later) and count perfect DFAs that accept exactly the given lan- the number of matches between the DFA’s classifi- guages. The DFAs accuracy was indeed 100%. cations and the network’s over a validation set. We A finding worth mentioning is the emergence return the minimal K that exceeds 99% matches. of cycles within the continuous states transitions. It should be noted that the initial state is left as is That is, the RNN before quantization mapped new and is not associated with any of the clusters.

Load more