Practical MAT Learning of Natural Languages Using Treebanks

Practical MAT learning of natural languages using treebanks Petter Ericson Supervisor: Johanna Björklund Department of Computing Science, Ume˚aUniversity S{901 87 Ume˚a,Sweden, [email protected] Abstract. In this thesis, an implementation of the MAT algorithm for query learning is improved and tested, specifically by attempted simula- tion of the MAT oracle using large corpora. In order to arrive at suitable test corpora, algorithms for generating positive and negative examples in relation to regular tree languages are developed. Table of Contents Practical MAT learning of natural languages using treebanks ::::::::::: i Petter Ericson Supervisor: Johanna Björklund 1 Introduction . 1 1.1 Motivation . 1 1.2 Language identification . 1 1.3 MAT learners . 2 Basics . 2 1.4 Outline . 3 2 Preliminaries . 4 2.1 Introduction to trees and tree languages . 4 Alphabets, strings and trees . 4 Automata . 6 Tree automata . 6 Grammars . 8 2.2 MAT learners . 10 Contexts and substitution . 10 Myhill-Nerode and equivalence classes. 10 Observation tables . 11 Putting it all together. 12 Implementation status . 12 3 Method....................................................... 13 3.1 Implementation of the algorithm . 13 Optimising for speed . 13 Adding required functionality . 14 3.2 Tree generation . 14 3.3 Test runs . 16 Optimisations . 16 Tree generation . 16 Corpus runs . 17 4 Results . 18 4.1 Tower languages . 18 4.2 Optimisation . 18 4.3 Generalisation . 19 4.4 Corpora . 19 Real-world corpus gaps . 20 4.5 Tree generation . 22 5 Major difficulties . 23 5.1 Lack of robustness . 23 5.2 Selecting counterexample . 23 5.3 Suitable testing languages . 24 6 Future work . 25 iii 1 Introduction Achieving a concise and specific model for natural languages has been a major goal of natural language research in the last decades, as such a representation could lead to a better understanding of language, and specifically, could lead to computerised simulations of various natural language tasks. This thesis will con- sider the Minimal Adequate Teacher model of algorithmic learning of languages, specifically how it can be used to achieve a compact and precise tree-based model for natural languages through simulating complete knowledge of the target language by using large positive and negative corpora. 1.1 Motivation While probabilistic models of language appears to be adequate in provisionally recognising and translating natural languages (e.g. the Google Translate service), and small, hand-crafted grammars are used to correct and rearrange words according to heuristics, a comprehensive, accurate and complete model of a natural language is still not within the reach of computers. Such a model would be very helpful in many applications, such as parsing, translation and natural language interfaces in general. Furthermore, having an exact computational model of a natural language could lead to insights into more general linguistics, such as how language is being represented in the brain, what drives linguistic development et cetera. 1.2 Language identification The general problem of language identification has been extensively explored, notably by Gold [Gol67], Angluin [Ang80] et. al. Simply put, the problem is "what can we say about a language, given examples inside and outside that language"? Unfortunately, given no prior knowledge about the language, the answer is "not very much". However, if we know that the underlying language is of a certain class, opportunities start to arise. Identification of natural languages has proved to be a difficult problem, mainly due to two facts: The simplest class of string grammars that are ex- pressive enough to model natural languages is the context-free class; language identification for context-free languages has been shown to be NP-complete. Thus the simplest reasonable way of producing a grammar for a natural language has been assumed to be a human, simply typing up the rule set, and submitting this to a minimisation process (or at least an optimiser). However, as context-free string languages can be produced from the yield of regular tree languages (see Section 2 on page 4), and language identification of regular tree languages is not NP-complete, there would seem to be room for a way of learning "natural tree languages", as long as the parse trees of the string examples are provided. Fortunately, tree banks (i.e. large sets) of parse trees have been compiled by linguists for many different languages, providing both training and test data for various learning algorithm approaches. 1 1.3 MAT learners The language identification model of choice in this thesis is the Minimal Ad- equate Teacher due to Angluin et. al. in [Ang87], where it is named the L∗ algorithm. Indeed, the aim of the project was to arrive at a practically usable model for language identification of natural languages from corpora using MAT learners. While the theoretical basis of MAT learning is further covered in the next section, and specifically in Subsection 2.2 on page 10, let us briefly de- scribe the overall algorithm, and why it is thought to be a suitable candidate for real-world language identification duties. Basics MAT learning was introduced to explore the minimum information required for a perfectly rational student to learn the given regular language, and as such introduces a restricted set of messages that the student and the teacher may pass between each other. The model is as follows: the teacher (or MAT oracle) has full knowledge about the target language, and is required to respond to the queries of the student according to that knowledge. The student has initially no knowledge of the target language, besides it being a regular language, but by submitting queries to the teacher it will eventually build up an internal model consistent with the target language. The student may submit two kinds of queries: { First, it may submit a model (automaton) to the teacher (an equivalence query). If the model is consistent with the language, the teacher will return a special token, indicating that the learning is complete. Otherwise, it will return a counterexample, i.e. an item that the model classifies wrongly. That is, the submitted automata regards the counterexample as a member of the language except that it is not, or as not being part of the language, but it in fact is. { Second, is may submit an item (string, tree etc.) to the teacher (a membership query). If the item is part of the target language, the teacher will return true, otherwise it will return false. The data from the membership queries is used to build an observation table of membership data for various combinations of string prefixes and suffixes, which eventually can be synthesised into an automaton. Receiving a counterexample from an equivalence query using that automaton results in more rows and/or columns in the table, which are filled in by further membership queries, giving another automaton which is submitted to the teacher, and so on until the teacher returns the accepting token, indicating that the algorithm has run its course. 2 1.4 Outline Section 2 on the following page will explain the theory behind the MAT algorithm in more detail, while Section 3 on page 13 will give an overview of previous work, as well as the project plan for the thesis. Section 3 also contains certain algorithms developed for the purpose of testing the results obtained. Section 4 on page 18 contains a run-down of what results were accomplished, as well as some reasoning about the test runs and what they actually measure. Finally, Section 5 on page 23 reasons about how the results illustrate the difficulties of the problem at hand, and what potential exists for alleviating these problems. 3 2 Preliminaries 2.1 Introduction to trees and tree languages The theory of tree languages is in essence an extension of the theory of string languages. Specifically, the class of regular tree languages is the class of regular string languages, extended by allowing symbols to have more than one successor. To make this extension obvious, it is informative to view the class of regular string languages in the context of finite automata, because the hierarchy of string languages depends on the hierarchy of automata used to recognise them. Thus, the extension of the classes of string languages to trees becomes a problem of redefining the automata in terms of trees as opposed to in terms of strings. Alphabets, strings and trees Formally, an alphabet is a nonempty set Σ of symbols.A ranked alphabet is a pair (Σ; R) where { Σ is an alphabet, i.e., a finite set of symbols, and { R is a mapping Σ ! K ⊂ N. We call the number k = R(s) the rank of the symbol s. Furthermore, for every k 2 range(R), we define the set Σk = fs 2 Σ j R(s) = kg. A symbol s of rank k may be written sk to make the rank explicit. The requirement that symbols have one rank only is unimportant, but useful. Informally, a tree (or term) is an acyclic graph with a designated node called the root. Looking at it from a string perspective, we arrive at the following definition, however: Let f[; ]g be a set of auxiliary symbols, disjoint from every other alphabet considered herein. The set TΣ of trees over the (ranked) alphabet Σ is the set of strings defined inductively as follows { for a 2 Σ0; t = a 2 TΣ { for a 2 Σk; k ≥ 1; t1 : : : tk 2 TΣ; t = a[t1 : : : tk] 2 TΣ; Fig. 1. A simple graphical representation of the tree a[b[c]; d] 4 In the tree t = a[b[c]; d] (shown graphically in Figure 1 on the facing page, the symbol a is the root of the tree, while b[c] and d are child trees, or direct subtrees.

Practical MAT Learning of Natural Languages Using Treebanks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support