Proceedings of the 10Th International Workshop on Finite State Methods and Natural Language Processing, Pages 1–9, Donostia–San Sebastian,´ July 23–25, 2012

FSMNLP 2012

Proceedings of the

10th International Workshop on Finite State Methods and Natural Language Processing

July 23–25, 2012 University of the Basque Country Donostia-San Sebastian´ Eusko Jaurlaritzaren erakunde-nortasunaren eskuliburua

Manual de Sponsors: Identidad Corporativa del Gobierno Vasco

c 2012 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ii Preface

These proceedings contain the papers presented at the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), held in Donostia-San Sebastian´ (Basque Country), July 23–25, 2012.

The workshop covers a wide range of topics from morphology to formal language theory. A special theme was chosen for FSMNLP 2012: “practical issues in ﬁnite-state technology,” including:

Practical implementations of linguistic descriptions with finite-state technology • Software tools and utilities for finite-state NLP • Finite-state models of linguistic theories • Applications of finite-state-based NLP in closely related fields •

This volume contains the 7 long and 12 short papers presented at the workshop. In total, 31 papers (13 long and 18 short papers) were submitted and double-blind refereed. Each paper was reviewed by 3 program committee members. The overall acceptance rate was 61 per cent.

The program committee was composed of internationally leading researchers and practitioners selected from academia, research labs, and companies.

The organizing committee would like to thank the program committee for their hard work, the referees for their valuable feedback, the invited speaker and the presenters of tutorials for their contributions and the local organizers for their tireless efforts. We are particularly indebted to the University of the Basque Country (UPV/EHU) and the Basque Government (Eusko Jaurlaritza) for signiﬁcant ﬁnancial support and to the Cursos de Verano/Uda Ikastaroak and the IXA research group for their support in organizing the event.

INAKI˜ ALEGRIA MANS HULDEN

iii

Local Organizing Committee: Inaki˜ Alegria (University of the Basque Country) Koldo Gojenola (University of the Basque Country) Izaskun Etxeberria (University of the Basque Country) Nerea Ezeiza (University of the Basque Country) Mans Hulden (University of the Basque Country / Ikerbasque, Basque Foundation for Science) Amaia Lorenzo (University of the Basque Country) Esther Miranda (University of the Basque Country) Maite Oronoz (University of the Basque Country)

Invited Speaker: Kimmo Koskenniemi (University of Helsinki)

Tutorials by: Tommi Pirinen (University of Helsinki) Inaki˜ Alegria (University of the Basque Country) Mans Hulden (University of the Basque Country / Ikerbasque, Basque Foundation for Science) Miikka Silfverberg (University of Helsinki) Josef Novak (University of Tokyo)

v Program Committee: Kenneth R. Beesley (SAP Business Objects, USA) Francisco Casacuberta (Instituto Tecnologico´ De Informatica,´ Spain) Jan Daciuk (Gdansk´ University of Technology, Poland) Frank Drewes (Umea University, Sweden) Dale Gerdemann (University of Tuebingen, Germany) Mike Hammond (University of Arizona, USA) Thomas Hanneforth (University of Potsdam, Germany) Colin de la Higuera (University of Nantes, France) Jan Holub (Czech Technical University in Prague, Czech Republic) Mans Hulden (Ikerbasque, Basque Country) Andre´ Kempe (CADEGE Technologies & Consulting, France) Andras Kornai (Eotvos Lorand University, Hungary) Andreas Maletti (University of Stuttgart, Germany) Mark-Jan Nederhof (University of St Andrews, Scotland) Kemal Oﬂazer (Carnegie Mellon University, Qatar) Maite Oronoz (University of the Basque Country) Laurette Pretorius (University of South Africa, South Africa) Strahil Ristov (Ruder Boskovic Institute, Croatia) Frederique Segond Frederique (ObjectDirect, France) Max Silberztein (Universite´ de Franche-Comte, France) Richard Sproat (University of Illinois at Urbana-Champaign, USA) Trond Trosterud (University of Tromsø, Norway) Shuly Wintner (University of Haifa, Israel) Anssi Yli-Jyra (University of Helsinki, Finland) Menno van Zaanen (Tilburg University, Netherlands) Lynette van Zijl (Stellenbosch University, South Africa)

Additional Reviewers: Alicia Perez´ (University of the Basque Country) Suna Bensch (Umea˚ University, Sweden) Emad Mohamed (Carnegie Mellon University, Qatar) Johanna Bjorklund¨ (Umea˚ University, Sweden)

vi Table of Contents

Effect of Language and Error Models on Efﬁciency of Finite-State Spell-Checking and Correction Tommi A. Pirinen and Sam Hardwick...... 1

Practical Finite State Optimality Theory Dale Gerdemann and Mans Hulden...... 10

Handling Unknown Words in Arabic FST Morphology Khaled Shaalan and Mohammed Attia ...... 20

Urdu - Roman Transliteration via Finite State Transducers Tina Bogel...... ¨ 25

Integrating Aspectually Relevant Properties of Verbs into a Morphological Analyzer for English Katina Bontcheva...... 30

Finite-State Technology in a Verse-Making Tool Manex Agirrezabal, Inaki˜ Alegria, Bertol Arrieta and Mans Hulden...... 35

DAGGER: A Toolkit for Automata on Directed Acyclic Graphs Daniel Quernheim and Kevin Knight ...... 40

WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding Josef R. Novak, Nobuaki Minematsu and Keikichi Hirose ...... 45

Kleene, a Free and Open-Source Language for Finite-State Programming Kenneth R. Beesley ...... 50

Implementation of Replace Rules Using Preference Operator Senka Drobac, Miikka Silfverberg and Anssi Yli-Jyra...... ¨ 55

First Approaches on Spanish Medical Record Classiﬁcation Using Diagnostic Term to Class Transduc- tion A. Casillas, A. D´ıaz de Ilarraza, K. Gojenola, M. Oronoz and Alicia Perez´ ...... 60

Developing an Open-Source FST Grammar for Verb Chain Transfer in a Spanish-Basque MT System Aingeru Mayor, Mans Hulden and Gorka Labaka ...... 65

Conversion of Procedural Morphologies to Finite-State Morphologies: A Case Study of Arabic Mans Hulden and Younes Samih ...... 70

A Methodology for Obtaining Concept Graphs from Word Graphs Marcos Calvo, Jon Ander Gomez,´ Llu´ıs-F. Hurtado and Emilio Sanchis ...... 75

A Finite-State Temporal Ontology and Event-Intervals Tim Fernando ...... 80

vii A Finite-State Approach to Phrase-Based Statistical Machine Translation Jorge Gonzalez...... ´ 90

Finite-State Acoustic and Translation Model Composition in Statistical Speech Translation: Empirical Assessment Alicia Perez,´ M. Ines´ Torres and Francisco Casacuberta ...... 99

Reﬁning the Design of a Contracting Finite-State Dependency Parser Anssi Yli-Jyra,¨ Jussi Piitulainen and Atro Voutilainen...... 108

Lattice-Based Minimum Error Rate Training Using Weighted Finite-State Transducers with Tropical Polynomial Weights Aurelien Waite, Graeme Blackwood and William Byrne ...... 116

viii Conference Program

Monday, July 23rd, 2012

15:00–18:00 TUTORIALS

15:00–16:00 Tommi Pirinen and Mans Hulden: Spelling and Grammar Correction with FSTs

16:00–17:00 Miikka Silfverberg: Probabilistic Parsing with Weighted FSTs

17:00–18:00 Josef Novak: Grapheme-to-Phoneme Training and Conversion with Weighted FSTs

Tuesday, July 24th, 2012

9:00 Opening

9:30–10:30 Invited Speaker: Kimmo Koskenniemi: The Simplicity of Two-Level Morphology

10:30–11:00 Coffee Break

11:00–12:00 LONG PAPERS I

Effect of Language and Error Models on Efﬁciency of Finite-State Spell-Checking and Correction Tommi A. Pirinen and Sam Hardwick

Practical Finite State Optimality Theory Dale Gerdemann and Mans Hulden

12:00–13:00 SHORT PAPERS I

Handling Unknown Words in Arabic FST Morphology Khaled Shaalan and Mohammed Attia

Urdu – Roman Transliteration via Finite State Transducers Tina Bogel¨

Integrating Aspectually Relevant Properties of Verbs into a Morphological Analyzer for English Katina Bontcheva

Finite-State Technology in a Verse-Making Tool Manex Agirrezabal, Inaki˜ Alegria, Bertol Arrieta and Mans Hulden

13:00–14:30 Lunch

ix Tuesday, July 24th, 2012 (continued)

14:30–15:30 SHORT PAPERS II

DAGGER: A Toolkit for Automata on Directed Acyclic Graphs Daniel Quernheim and Kevin Knight

WFST-based Grapheme-to-Phoneme conversion: Open Source Tools for Alignment, Model-Building and Decoding Josef R. Novak, Nobuaki Minematsu and Keikichi Hirose

Kleene, a Free and Open-Source Language for Finite-State Programming Kenneth R. Beesley

Implementation of Replace Rules Using Preference Operator Senka Drobac, Miikka Silfverberg and Anssi Yli-Jyra¨

15:30–16:30 SHORT PAPERS III

First Approaches on Spanish Medical Record Classiﬁcation Using Diagnostic Term to Class Transduction A. Casillas, A. D´ıaz de Ilarraza, K. Gojenola, M. Oronoz and A. Perez´

Developing an Open-Source FST Grammar for Verb Chain Transfer in a Spanish-Basque MT System Aingeru Mayor, Mans Hulden and Gorka Labaka

Conversion of Procedural Morphologies to Finite-State Morphologies: A Case Study of Arabic Mans Hulden and Younes Samih

A Methodology for Obtaining Concept Graphs from Word Graphs Marcos Calvo, Jon Ander Gomez,´ Llu´ıs-F. Hurtado and Emilio Sanchis

16:30–18:00 POSTERS AND COFFEE

20:30 Dinner in the Old Town

x Tuesday, July 25th, 2012

9:30–11:00 LONG PAPERS II

A Finite-State Temporal Ontology and Event-Intervals Tim Fernando

A Finite-State Approach to Phrase-Based Statistical Machine Translation Jorge Gonzalez´

Finite-State Acoustic and Translation Model Composition in Statistical Speech Transla- tion: Empirical Assessment Alicia Perez,´ M. Ines´ Torres and Francisco Casacuberta

11:30–12:30 LONG PAPERS III

Reﬁning the Design of a Contracting Finite-State Dependency Parser Anssi Yli-Jyra,¨ Jussi Piitulainen and Atro Voutilainen

Lattice-Based Minimum Error Rate Training using Weighted Finite-State Transducers with Tropical Polynomial Weights Aurelien Waite, Graeme Blackwood and William Byrne

12:30–13:30 Business Meeting

Effect of Language and Error Models on Efﬁciency of Finite-State Spell-Checking and Correction

Tommi A Pirinen Sam Hardwick University of Helsinki University of Helsinki Department of Modern Languages Department of Modern Languages FI-00014 Univ. of Helsinki, PO box 24 FI-00014 Univ. of Helsinki, PO box 24 [email protected] [email protected]

Abstract finite-state implementations that both detect and correct errors (Schulz and Mihov, 2002; Pirinen and We inspect the viability of finite-state spell- Linden,´ 2010). In this paper we further evaluate the checking and contextless correction of non- use of finite-state dictionaries with two-tape finite- word errors in three languages with a large de- state automatons as a mechanism for correcting mis- gree of morphological variety. Overviewing spellings, and optimisations to the finite-state error previous work, we conduct large-scale tests models, intending to demonstrate that purely finite- involving three languages — English, Finnish and Greenlandic — and a variety of error mod- state algorithms can be made sufficiently efficient. els and algorithms, including proposed im- To evaluate the general usability and efficiency provements of our own. Special reference of finite-state spell-checking we test a number of is made to on-line three-way composition of possible implementations of such a system with the input, the error model and the language three languages of typologically different morpho- model. Tests are run on real-world text ac- logical features2 and reference implementations for quired from freely available sources. We show that the finite-state approaches discussed are contemporary spell-checking applications: English sufficiently fast for high-quality correction, as a morphologically more isolating language with even for Greenlandic which, due to its mor- essentially a word-list approach to spell-checking; phological complexity, is a difficult task for Finnish, whose computational complexity has been non-finite-state approaches. just beyond the edge of being too hard to implement nicely in eg. hunspell (Pitkanen,¨ 2006); and Green- landic, a polysynthetic language which is imple- 1 Introduction mented as a finite-state system using Xerox’s orig- In most implementations of spell-checking, effi- inal finite-state morphology formalism (Beesley and ciency is a limiting factor for selecting or discard- Karttunen, 2003). As a general purpose finite-state 3 ing spell-checking solutions. In the case of finite- library we use HFST , which also contains our spell- state spell-checking it is known that finite-state language models can efficiently encode dictionaries of 2We will not go into details regarding the morphological fea- natural languages (Beesley and Karttunen, 2003), tures of these languages. We thank the anonymous reviewer for guiding us to make a rough comparison using a piece of trans- even for polysynthetic languages. Most contem- lated text. We observe from the translations of the Universal porary spell-checking and correction systems are Declaration of Human Rights (with pre-amble included) as fol- still based on programmatic solutions (e.g. hun- lows: the number of word-like tokens for English is 1,746, for spell1, and its *spell relatives), or at most specialised Finnish 1,275 and for Greenlandic 1,063. The count of the 15 most frequent tokens are for English 120—28, for Finnish 85— algorithms for implementing error-tolerant traver- 10 and for Greenlandic 38—7. sal of the finite-state dictionaries (Oflazer, 1996; The average word length is 5.0 characters for English, 7.8 Hulden,´ 2009a). There have also been few fully for Finnish and 14.9 for Greenlandic. For the complexity of computational models refer to Table 2 in this article. 1http://hunspell.sf.net 3http://hfst.sf.net

1 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 1–9, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics checking code. the language, and one automaton to map misspelt As neither Finnish nor Greenlandic have been words into correct strings, or the error model. Both successfully implemented in the hunspell formal- the language model and the error model are usu- ism, we mainly use them to evaluate how the com- ally (Pirinen and Linden,´ 2010) weighted finite-state plexity of a language model affects the efficiency of automata, where the weights represent the prob- finite-state spell-checking. For a full-scale survey on abilities are of a word being correctly spelled in the state-of-the-art non-finite-state spell-checking, the language model and of specific misspellings, refer to Mitton (2009). respectively. We evaluate here the effect of both The efficiency results are contrasted with the ex- the language and error model automatons’ structure isting research on finite-state spell-checking in Has- and complexity on the efficiency of the finite-state 4 san et al. (2008) and the theoretical results on finite- spelling task. state error-models in Mitankin (2005). Our contri- 2.1 Language Models bution primarily comprises the addition of morphologically complex languages with actual cyclic dic- The most basic language model for a spell-checking tionary automata (i.e. infinite dictionaries formed dictionary is a list of correctly spelled word forms. by compounding and recurring derivation) and more One of the easiest ways of creating such a spell- complex structure in general, compared to those of checker is to collect the word forms from a reason- English and Arabic. Our goal is to demonstrate that ably large corpus of (mostly) correctly spelt texts. finite-state spelling is tractable for these complex Additionally we can count the frequency of words c(w) languages, to document their implications for per- and use that as the likelihood, P (w) = w c(w) formance and to present an algorithm for the task. where c(w) is the count of the word w and ∈Dis the PD We also point out that previous approaches have ne- set of corpus word forms. For morphologically more glected to simultaneously constrain the error model isolating languages such as English, this is often a and the dictionary with each other in on-line com- sufficient approach (Norvig, 2010), and we use it to position, which affords a significant speed benefit create a dictionary for our English spell-checker as compared to generating the two component compo- well. As a non-finite-state reference point we use sitions. hunspell. The rest of the paper is organised as follows. In For agglutinative languages like Finnish, for Section 2 we discuss the spell-checking task, current which the word-list approach is likely to miss a non-finite-state spell-checkers and previously used much greater number of words, one of the most finite-state methods for spell-checking and correc- common approaches is to use right-linear gram- tion and propose some possible speed optimisations mars, possibly combined with finite-state rule lan- for the error models. We also investigate algorith- guages to implement morphophonological alter- mic limitations of finite-state approaches and ways ations (Koskenniemi, 1983). This approach also ap- to remedy them. In Section 3 we present the lan- plies to the newest available free / open source and guage models, error models and the testing corpora. full-fledged finite-state Finnish morphological dic- In Section 4 we present the comparisons of speed tionary we found (Pirinen, 2011). This language and quality with combinations of different language model features productive derivations, compound- and error models and corpora for spell-checking. In ing and rudimentary probabilistic models. We take, Section 5 we summarise our findings and results, as a reference non-finite state language model for and outline future goals. Finnish, Voikko’s implementation in Malaga, which is currently used as a spell-checking component 2 Methods in open source software. It is implemented in a

4 A ﬁnite-state spell-checker is typically (Pirinen and The methods introduced in this research as well as all ma- Linden,´ 2010) composed of at least two ﬁnite-state terials are free/libre open source. Please see our svn repository https://hfst.svn.sf.net/svnroot/trunk/ automata; one for the dictionary of the language, or fsmnlp-2012-spellers/ for detailed implementation the language model, which contains valid strings of and scripts to reproduce all the results.

2 left-associative grammar formalism, which is a po- To further fine-tune this finite-state formulation tentially less efficient system with more expressive of the edit distance algorithm, it is possible to at- power. It’s similar to finite-state formulations in tach a probability to each of the error events as a terms of linguistic coverage. weight in a weighted finite-state automaton, corre- For polysynthetic languages it will be obvious sponding to the likelihood of an error, or a con- that the coverage of any word-list-based approach fusion factor. This can be used to implement fea- will be even lower. Furthermore, most simple ex- tures like keyboard adjacency or an OCR confusion tensions to it such as affix stripping (as in hun- factor to the error correction model. This will not spell) are not adequate for describing word forms. modify the structure of the finite-state error mod- To our knowledge, the only approaches that have els or the search space—which is why we did not been widely used for spell-checking and morpho- test their effects in this article—, but introduction of logical analysis of Greenlandic have been based non-homogenous weights to the resulting finite-state on traditional finite-state solutions, such as the Xe- network may have an effect on search time. This ad- rox formalisms. In our case we have obtained dition is equivalent to hunspell’s KEY mechanism. a freely available finite-state morphology imple- For English language spelling correction there mentation from the Internet5. For further de- is also an additional type of error model to deal tails we refer to the authors’ website http:// with competence-related misspellings—as opposed oqaaserpassualeriffik.org/. to models that mainly deal with mistypings— implemented in the form of phonemic folding and 2.2 Error Models unfolding. This type of error is very specific to certain types of English text and is not in the scope of The ubiquitous formula for modeling typing er- this experiment. This is the PHON part of the hun- rors since computer-assisted spelling correction be- spell’s correction mechanism. gan has been the edit distance metric sometimes After fine-tuning the error models to reimplement attributed to Levenshtein (1966) and/or Damerau hunspell’s feature set, we propose variations of this (1964). It maps four typical slips of the fingers on edit distance scheme to optimise the speed of er- a keyboard to events in the fuzzy matching of mis- ror correction with little or no negative effect to the spelt word forms to correct ones, that is, the deletion quality of the correction suggestions. The time re- of a character (i.e. failing to press a key), addition quirement of the algorithm is determined by the size of a character (i.e. hitting an extra key accidentally), of the search space, i.e. the complexity of the result- changing a character (i.e. hitting the wrong key) and ing network when the error model is applied to the transposing adjacent characters (i.e. hitting two keys misspelt string and intersected with the dictionary6. in the wrong order). To optimise the application of edit distance by When modeling edit distance as a finite-state au- limiting the search space, many traditional spell tomaton, a relatively simple two-tape automaton is checkers will not attempt to correct the very first let- sufficient to implement the algorithm (Hassan et al., ter of the word form. We investigated whether this 2008). The automaton will consist of one arc for decision is a particularly effective way to limit the each type of error, and additionally one state for search space, but it does not appear to significantly each transposition pair. This means that the trivial differ from restricting edits at any other position in nondetermistic finite-state automaton implementing the input. S(V,E, Σ) = the algorithm is of space complexity Dividing the states of a dictionary automaton into O( Σ 2 V + Σ 2 E ), where Σ is the alphabet of | | | | | | | | 6 language, V is the set vertices in automaton and E is For non-finite-state solutions, the search space is simply the number of possible strings given the error corrections made the set of edges in automaton. This edit distance for- in the algorithm. For finite-state systems the amount of gener- mulation is roughly feature equivalent to hunspell’s ated strings with cyclic language and error models is infinite, so TRY mechanism. complexity calculation are theoretically slightly more complex, however for basic edit distance implementations used in this ar- 5https://victorio.uit.no/langtech/trunk/ ticle the search space complexities are always the same and the st/kal amount of suggestions generated finite

3 classes corresponding to the minimum number of position does not apparently depend on the choice input symbols consumed by that state, we found of position. This result was acquired by intersecting that the average ambiguity in a particular class is eg. the automaton e.+ with the dictionary to restrict somewhat higher for the first input symbols, but the first position to have the symbol e, the automa- then stabilises quickly at a lower level. This was ton .e.+ to restrict the second position, and so on. accomplished by performing the following state- The transducers acquired by this intersection vary in categorisation procedure: size of the language, number of states and number of transitions, but without any trend depending on the 1. The start state is assigned to class 0, and all position of the restriction. This is in line with the other states are assigned to a candidate pool. rather obvious finding that the size of the restricted 2. All states to which there is an (input) epsilon dictionary in terms of number of strings is similarily transition from the start state are assigned to position-agnostic. class 0 and removed from the candidate pool. Presumably, the rationale is a belief that errors 3. This is repeated for each state in class 0 until predominately occur at other positions in the input. no more states are added to class 0. This com- As far as we know, the complete justification for this pletes class 0 as the set of states in which the belief remains to be made with a high-quality, hand- automaton can be before consuming any input. checked error corpus. 4. For each state in class 0, states in the candidate pool to which there is a non-epsilon transition On the error model side this optimisation has been are assigned to class 1 and removed from the justified by findings where between 1.5 % and 15 % candidate pool. of spelling errors happen in the first character of the 5. Class 1 is epsilon-completed as in (2-3). word, depending on the text type (Bhagat, 2007); the 6. After the completion of class n, class n + 1 1.5 % from a small corpus of academic texts (Yan- is constructed. This continues until the candi- nakoudakis and Fawthrop, 1983) and 15 % from dic- date pool is empty, which will happen as long tated corpora (Kukich, 1992). We also performed a as there are no unreachable states. rudimentary classification of the errors in the small error corpus of 333 entries from Pirinen et al. (2012), With this categorisation, we tallied the total num- and found errors at the first position in 1.2 % of the ber of arcs from states in each class and divided the entries. Furthermore, we noticed that when evenly total by the number of states in the class. This is splitting the word forms in three parts, 15 % of the intended as an approximate measure of the ambigu- errors are in the first third of the word form, while ity present at a particular point in the input. Some second has 47 % and third 38 %, which would be in results are summarized in Table 1. favor of discarding initial errors7. Class Transitions States Average A second form of optimisation that is used by 0 156 3 52 many traditional spell-checking systems is to apply 1 1,015 109 9.3 2 6,439 1,029 6.3 a lower order edit distance separately before trying 3 22,436 5,780 3.9 higher order ones. This is based on the assumption 4 38,899 12,785 3.0 5 44,973 15,481 2.9 that the vast majority of spelling errors will be of 6 47,808 17,014 2.8 lower order. In the original account of edit distance 7 47,495 18,866 2.5 for spell-checking, 80 % of the spelling errors were 8 39,835 17,000 2.3 9 36,786 14,304 2.6 found to be correctable with distance 1 (Pollock and 10 45,092 14,633 3.1 Zamora, 1984). 11 66,598 22,007 3.0 12 86,206 30,017 2.9 The third form of optimisation that we test is omitting redundant corrections in error models of Table 1: State classification by minimum input consumed higher order than one. Without such an optimisa- for the Finnish dictionary 7By crude classification we mean that all errors were forced to one of the three classes at weight of one, e.g. a series of Further, the size of a dictionary automaton that is three consecutive instances of the same letters was counted as restricted to have a particular symbol in a particular deletion at the first position.

4 tion, higher order error models will permit adding 100, 215 different outputs. We have implemented and deleting the same character in succession at any this algorithm for our results by generating the position, which is obviously futile work for error edited strings by lookup, and performing another correction. Performing the optimisation makes the lookup with the language model on these strings. error model larger but reduces the search space, and Plainly, it would be desirable to improve on this. does not affect the quality of results. The intuition behind our improvement is that when editing an input string, say “spellling”, it is a wasted 2.3 Algorithms effort to explore the remainder after generating a The obvious baseline algorithm for the task of find- prefix that is not present in the lexicon. For example, ing which strings can be altered by the error model after changing the first character to “z” and not edit- in such a way that the alteration is present in the lan- ing the second characted, we have the prefix “zp-”, guage model is generating all the possible alterations which does not occur in our English lexicon. So the and checking which ones are present in the language remaining possibilities - performing any edits on the model. This was done in Hassan et al. (2008) by first remaining 7-character word - can be ignored. calculating the composition of the input string with This is accomplished with a three-way composi- the error model and then composing the result with tion in which the input, the error model and the lan- the language model. guage model simultaneously constrain each other to If we simplify the error model to one in which produce the legal correction set. This algorithm is only substitutions occur, it can already be seen that presented in some detail in Linden´ et al. (2012). A this method is quite sensitive to input length and al- more advanced and general algorithm is due to Al- phabet size. The composition explores each combi- lauzen and Mohri (2009). nation of edit sites in the input string. If any number of edits up to d can be made at positions in an input 3 Material string of length n, there are For language models we have acquired suitable free- d n to-use dictionaries, readily obtainable on the Inter- i net. i=1 X We made our own implementations of the always to choose the edit site, and each site is subject gorithms to create and modify finite-state error to a choice of Σ 1 edits (the entire alphabet except | |− models. Our source repository contains a Python for the actual input). This expression has no closed script for generating error models and an extensive form, but as d grows to n, the number of choices Makefile for exercising it in various permuta- n has the form 2 , so the altogether complexity is ex- tions. ponential in input length and linear in alphabet size To test the effect of correctness of the source (quadratic if transpositions are considered). text to the speed of the spell-checker we have re- In practice (when d is small relative to n) it is use- trieved one of largest freely available open source ful to observe that an increase of 1 in distance results text materials from the Internet, i.e. Wikipedia. The in an additional term to the aforementioned sum, the Wikipedia text is an appropriate real-world material ratio of which to the previously greatest term is as it is a large body of text authored by many individ- n!/(d! (n d!)) n d + 1 uals, and may be expected to contain a wide variety · − = − of spelling errors. For material with more errors, we n!/((d 1)! (n d + 1)!) d − · − have used a simple script to introduce (further, ar- indicating that when d is small, increases in it pro- bitrary) errors at a uniform probability of 1/33 per duce an exponential increase in complexity. For character; using this method we can also obtain a an English 26-letter lowercase alphabet, edit dis- corpus of errors with correct corrections along them. tance 2 and the 8-letter word “spelling”, 700 strings Finally we have used a text corpus from a language are stored in a transducer. With transpositions, different than the one being spelled to ensure that the deletions, insertions and edit weights this grows to majority of words are not in the vocabulary and (al-

5 most always) not correctable by standard error mod- Automaton En Fi Kl Σ set size 28 60 64 els. Edit distance 1 nodes 652 3,308 3,784 The Wikipedia corpora were sourced from Edit distance 1 arcs 2,081 10,209 11,657 Edit distance 2 nodes 1,303 6,615 7,567 wikimedia.org. For exact references, see our Edit distance 2 arcs 4136 20,360 23,252 previously mentioned repository. From the dumps No firsts ed 1 nodes 652 3,308 3,784 No firsts ed 1 arcs 2,107 10,267 11,719 we extracted the contents of the articles and picked No firsts ed 2 nodes 1,303 6,615 7,567 the first 100,000 word tokens for evaluation. No firsts ed 2 arcs 4,162 20,418 23,314 No redundancy and 1st ed 2 nodes 1,303 6,615 7,567 In Table 2 we summarize the sizes of automata in No redundancy and 1st ed 2 arcs 4,162 20,418 23,314 terms of structural elements. On the first row, we Lower order first ed 1 to 2 arcs 6,217 30,569 34,909 give the size of the alphabet needed to represent the Lower order first ed 1 to 2 nodes 1,955 9,923 11,351 entire dictionary. Next we give the sizes of automata as nodes and arcs of the finite-state automaton en- Table 3: The sizes of error models as automata coding the dictionary. Finally we give the size of the automaton as serialised on the hard disk. While this is to establish that simpler error models lead to de- is not the same amount of memory as its loaded data graded recall—and not to more generally evaluate structures, it gives some indication of memory usage the present system as a spell-checker. of the program while running the automaton in ques- The evaluations in this section are performed on tion. As can be clearly seen from the table, the mor- quad-core Intel Xeon E5450 running at 3 GHz with phologically less isolating languages do fairly con- 64 GiB of RAM memory. The times are averaged sistently have larger automata in every sense. over five test runs of 10,000 words in a stable server

Automaton En Fi Kl environment with no server processes or running Σ set size 43 117 133 graphical interfaces or other uses. The test results Dictionary FSM nodes 49,778 286,719 628,177 are measured using the getrusage C function on Dictionary FSM arcs 86,523 783,461 11,596,911 Dictionary FSM on disk 2.3 MiB 43 MiB 290 MiB a system that supports the maximum resident stack size ru maxrss and user time ru utime fields. Table 2: The sizes of dictionaries as automata The times are also verified with the GNU time command. The results for hunspell, Voikkospell In Table 3 we give the same figures for the sizes of and foma processes are only measured with time error models we’ve generated. The Σ size row here and top. The respective versions of the soft- shows the number of symbols left when we have re- ware are Voikkospell 3.3, hunspell 1.2.14, and Foma moved the symbols that are usually not considered 0.9.16alpha. The reference systems are tested with to be a part of a spell-checking mechanism, such as default settings, meaning that they will only give all punctuation that does not occur word-internally some fixed number of suggestions whereas our sys- and white-space characters8. Note that sizes of error tem will calculate all strings within the given error models can be directly computed from their parame- model. ters; i.e., the distance, the Σ set size and the optimi- As a reference implementation for English we use sation, so this table is provided for reference only. hunspell’s en-US dictionary9 and for a finite-state implementation we use a weighted word-list from 4 Evaluation Norvig (2010). As a Finnish reference implementation we use Voikko10, with a LAG-based dictionary We ran various combinations of language and error 11 models on the corpora described in section 3. We using Malaga . The reference correction task for give tabular results of the speed of the system and Greenlandic is done with foma’s (Hulden,´ 2009b) the effect of the error model on recall. The latter 9http://wiki.services.openoffice.org/ 8The method described here does not handle run-on words wiki/Dictionaries or extraneous spaces, as they introduce lot of programmatic 10http://voikko.sf.net complexity which we believe is irrelevant to the results of this 11http://home.arcor.de/bjoern-beutel/ experiment. malaga/

6 apply med function with default settings12. of error models gives the expected timing result be- The baseline feature set and the efficiency of tween its relevant primary and secondary error mod- spell-checking we are targeting is defined by the cur- els. It should be noteworthy that, when thinking of rently de facto standard spelling suite in open source real world applications, the speed of the most of the systems, hunspell. models described here is greater than 1 word per sec- In Table 4 we measure the speed of the spell- ond (i.e. 10,000 seconds per 10,000 words). checking process on native language Wikipedia We measured memory consumption when per- text with real-world spelling errors and unknown forming the same tests. Varying the error model had strings. The error model rows are defined as fol- little to no effect. Memory consumption was almost lows: on the Reference impl. row, we test the spell- entirely determined by the language model, giving checking speed of the hunspell tool for English, and consumptions of 13-7 MiB for English, 0.2 GiB for Voikkospell tool for Finnish. On the edit distance 2 Finnish and 1.6 GiB for Greenlandic. row we use the basic traditional edit distance 2 with- To measure the degradation of quality when us- out any modifications. On the No first edits row we ing different error models we count the proportion use the error model that does not modify the first of suggestion sets that contain the correct correction character of the word. On the No redundancy row among the corrected strings. The suggestion sets are we use the edit distance 2 error model with the re- the entire (unrestricted by number) results of correc- dundant edit combinations removed. On the No re- tion, with no attempt to evaluate precision13. For dundancy and firsts rows we use the combined er- this test we use automatically generated corpus of ror model of No first edits and No redundancy func- spelling errors to get the large-scale results. tionalities. On the row Lower order first we apply a lower order edit distance model first, then if no re- Error model En Fi Kl sults are found, a higher order model is used. In the Edit distance 1 0.89 0.83 0.81 Edit distance 2 0.99 0.95 0.92 tables and formulae we routinely use the language Edit distance 3 1.00 0.96 — codes to denote the languages: en for English, fi for No firsts ed 1 0.74 0.73 0.60 No firsts ed 2 0.81 0.82 0.69 Finnish and kl for Greenlandic (Kalaallisut). No firsts ed 3 0.82 — —

Error model En Fi Kl Reference impl. 9.93 7.96 11.42 Table 5: Effect of language and error models to quality Generate all edits 2 3818.20 118775.60 36432.80 (recall, proportion of suggestion sets containing a cor- Edit distance 1 0.26 6.78 4.79 rectly suggested word) Edit distance 2 7.55 220.42 568.36 No first edits 1 0.44 3.19 3.52 No firsts ed 2 1.38 61.88 386.06 This test with automatically introduced errors No redundancy ed 2 7.52 4230.94 6420.66 shows us that with uniformly distributed errors the No redundancy and firsts ed 2 1.51 62.05 386.63 Lower order first ed 1 to 2 4.31 157.07 545.91 penalty of using an error model that ignores word- initial corrections could be significant. This con- Table 4: Effect of language and error models to speed trasts to our findings with real world errors, that the (time in seconds per 10,000 word forms) distribution of errors tends towards the end of the word, described in 2.2 and (Bhagat, 2007), but it The results show that not editing the first posi- should be noted that degradation can be as bad as tion does indeed give significant boost to the speed, given here. regardless of language model, which is of course Finally we measure how the text type used caused by the significant reduction in search space. will affect the speed of spell-checking. As the However, the redundancy avoidance does not seem best-case scenario we use the unmodified texts of to make a significant difference. This is most likely Wikipedia, which contain probably the most real- because the amount of duplicate paths in the search istic native-language-speaker-like typing error dis- space is not so proportionally large and their traver- 13 sal will be relatively fast. The separate application Which, in the absence of suitable error corpora and a more full-fledged language model taking context into account, would 12http://code.google.com/p/Foma/ be irrelevant for the goal at hand.

7 tribution available. For text with more errors, selection of error model, apart from the need to store where the majority of errors should be recoverable, a greater set of suggestions for models that generate we introduce automatically generated errors in the more suggestions. The error models may therefore Wikipedia texts. Finally to see the performance in be quite freely changed in real world applications as the worst case scenario where most of the words needed. have unrecoverable spelling errors we use texts We verified that correcting only the first input let- from other languages, in this case English texts for ter affords a significant speed improvement, but that Finnish and Greenlandic spell-checking and Finnish this improvement is not greatly dependent on the po- texts for English spell-checking, which should bring sition of such a restriction. This practice is some- us close to the lower bounds on performance. The what supported by our tentative finding that it may effects of text type (i.e. frequency of non-words) on cause the least drop in practical recall figures, at speed of spell-checking is given in Table 6. All of least in Finnish. It is promising especially in con- the tests in this category were performed with er- junction with a fallback model that does correct the ror models under the avoid redundancy and firsts first letter. ed 2 row in previous tables, which gave us the best We described a way to avoid having a finite-state speed/quality ratio. error model perform redundant work, such as deleting and inserting the same letter in succession. The Error model En Fi Kl Native Lang. Corpus 1.38 61.88 386.06 practical improvement from doing this is extremely Added automatic errors 6.91 95.01 551.81 modest, and it increases the size of the error model. Text in another language 22.40 148.86 783.64 In this research we focused on differences in automatically generated error models and their optimi- Table 6: Effect of text type on error models to speed (in seconds per 10,000 word-forms) sations in the case of morphologically complex languages. For future research we intend to study more Here we chiefly note that the amount of non- realistic error models induced from actual error cor- words in text directly reflects the speed of spell- pora (e.g. Brill and Moore (2000)). Research into checking. This shows that the dominating factor of different ways to induce weights into the language the speed of spell-checking is indeed in the correct- models, as well as further use of context in finite- ing of misspelled words. state spell-checking (as in Pirinen et al. (2012)), is warranted. 5 Conclusions and Future Work Acknowledgements In this article, we built a full-fledged finite-state spell-checking system from existing finite-state lan- We thank the anonymous reviewers for their com- guage models and generated error models. This ments and the HFST research team for fruity discus- work uses the system initially described in Pirinen sions on the article’s topics. The first author thanks and Linden´ (2010) and an algorithm described in the people of Oqaaserpassualeriffik for introducing Linden´ et al. (2012), providing an extensive quan- the problems and possibilities of finite-state appli- titative evaluation of various combinations of con- cations to the morphologically complex language of stituents for such a system, and applying it to the Greenlandic. most challenging linguistic environments available for testing. We showed that using on-line composition of the word form, error model and dictionary is References usable for morphologically complex languages. Fur- Cyril Allauzen and Mehryar Mohri. 2009. N-way com- thermore we showed that the error models can be au- position of weighted finite-state transducers. Interna- tomatically optimised in several ways to gain some tional Journal of Foundations of Computer Science, speed at cost of recall. 20:613–627. We showed that the memory consumption of the Kenneth R Beesley and Lauri Karttunen. 2003. Finite spell-checking process is mainly unaffected by the State Morphology. CSLI publications.

8 Meenu Bhagat. 2007. Spelling error pattern analysis of Tommi A Pirinen and Krister Linden.´ 2010. Finite-state punjabi typed text. Master’s thesis, Thapar University. spell-checking with weighted language and error mod- Eric Brill and Robert C. Moore. 2000. An improved els. In Proceedings of the Seventh SaLTMiL workshop error model for noisy channel spelling correction. In on creation and use of basic lexical resources for less- ACL ’00: Proceedings of the 38th Annual Meeting resourced languagages, pages 13–18, Valletta, Malta. on Association for Computational Linguistics, pages Tommi Pirinen, Miikka Silfverberg, and Krister Linden. 286–293, Morristown, NJ, USA. Association for Com- 2012. Improving finite-state spell-checker suggestions putational Linguistics. with part of speech n-grams. In Internatational Jour- Fred J Damerau. 1964. A technique for computer detec- nal of Computational Linguistics and Applications IJ- tion and correction of spelling errors. Commun. ACM, CLA (to appear). (7). Tommi A Pirinen. 2011. Modularisation of finnish Ahmed Hassan, Sara Noeman, and Hany Hassan. 2008. finite-state language descriptiontowards wide collab- Language independent text correction using finite state oration in open source development of morphological automata. In Proceedings of the Third International analyser. In Proceedings of Nodalida, volume 18 of Joint Conference on Natural Language Processing, NEALT proceedings. volume 2, pages 913–918. Harri Pitkanen.¨ 2006. Hunspell-in kesakoodi¨ 2006: Fi- Mans˚ Hulden.´ 2009a. Fast approximate string match- nal report. Technical report. ing with finite automata. Procesamiento del Lenguaje Joseph J. Pollock and Antonio Zamora. 1984. Auto- Natural, 43:57–64. matic spelling correction in scientific and scholarly Mans˚ Hulden.´ 2009b. Foma: a finite-state compiler text. Commun. ACM, 27(4):358–368, April. and library. In Proceedings of the 12th Conference of Klaus Schulz and Stoyan Mihov. 2002. Fast string cor- the European Chapter of the Association for Compu- rection with levenshtein-automata. International Jour- tational Linguistics: Demonstrations Session, EACL nal of Document Analysis and Recognition, 5:67–85. ’09, pages 29–32, Stroudsburg, PA, USA. Association Emmanuel J Yannakoudakis and D Fawthrop. 1983. An for Computational Linguistics. intelligent spelling error corrector. Information Pro- Kimmo Koskenniemi. 1983. Two-level Morphology: A cessing and Management, 19(2):101–108. General Computational Model for Word-Form Recog- nition and Production. Ph.D. thesis, University of Helsinki. Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Comput. Surv., 24(4):377– 439. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics—Doklady 10, 707710. Translated from Dok- lady Akademii Nauk SSSR, pages 845–848. Krister Linden,´ Erik Axelson, Senka Drobac, Sam Hard- wick, Miikka Silfverberg, and Tommi A Pirinen. 2012. Using hfst for creating computational linguistic applications. In Proceedings of Computational Lin- guistics - Applications, 2012, page to appear. Petar Nikolaev Mitankin. 2005. Universal levenshtein automata. building and properties. Master’s thesis, University of Sofia. Roger Mitton. 2009. Ordering the suggestions of a spellchecker without using context*. Nat. Lang. Eng., 15(2):173–192. Peter Norvig. 2010. How to write a spelling corrector. referred 2011-01-11, available http://norvig. com/spell-correct.html. Kemal Oflazer. 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Comput. Linguist., 22(1):73–89.

9 Practical Finite State Optimality Theory Dale Gerdemann Mans Hulden University of Tubingen¨ University of the Basque Country [email protected] IXA Group IKERBASQUE, Basque Foundation for Science [email protected]

Abstract proaches model constraint interaction by constructing a GEN-transducer, which is subsequently com- Previous work for encoding Optimality The- posed with filtering transducers that mark violations ory grammars as finite-state transducers has of constraints, and remove suboptimal candidates— included two prominent approaches: the so- called ‘counting’ method where constraint vi- candidates that have received more violation marks olations are counted and filtered out to some than the optimal candidate, with the general tem- set limit of approximability in a finite-state plate: system, and the ‘matching’ method, where constraint violations in alternative strings are Grammar = Gen .o. MarkC1 .o. FilterC1 ... matched through violation alignment in order MarkCN .o. FilterCN to remove suboptimal candidates. In this paper we extend the matching approach to show how not only markedness constraints, but also In Karttunen’s system, auxiliary ‘counting’ trans- faithfulness constraints and the interaction of ducers are created that first remove candidates with the two types of constraints can be captured maximally k violation marks for some fixed k, then by the matching method. This often produces k 1, and so on, until nothing can be removed with- exact and small FST representations for OT − out emptying the candidate set, using a finite-state grammars which we illustrate with two practical example grammars. We also provide a new operation called priority union. Gerdemann and van proof of nonregularity of simple OT gram- Noord (2000) present a similar system that they call mars. a ‘matching’ approach, but which does not rely on fixing a maximal number of distinguishable viola- k 1 Introduction tions . The matching method is a procedure by which we can in many cases (though not always) The possibility of representing Optimality Theory distinguish between infinitely many violations in a (OT) grammars (Prince and Smolensky, 1993) as finite-state system—something that is not possible computational models and finite-state transducers, when encoding OT by the alternative approach of in particular, has been widely studied since the in- counting violations. ception of the theory itself. In particular, construct- In this paper our primary purpose is to both ex- ing an OT grammar step-by-step as the composition tend and simplify this ‘matching’ method. We of a set of transducers, akin to rewrite rule com- will include interaction of both markedness and position in (Kaplan and Kay, 1994), has offered faithfulness constraints (MAX,DEP, and IDENT the attractive possibility of simultaneously model- violations)—going beyond both Karttunen (1998) ing OT parsing and generation as a natural conse- and Gerdemann and van Noord (2000), where only quence of the bidirectionality of finite-state trans- markedness constraints were modeled. We shall also ducers. Two main approaches have received atten- clarify the notation and markup used in the matching tion as practical options for implementing OT with approach as well as present a set of generic trans- finite-state transducers: that of Karttunen (1998) ducer templates for EVAL by which modeling vary- 1 and Gerdemann and van Noord (2000). Both ap- ing OT grammars becomes a simple matter of mod- 1Earlier finite-state approaches do exist, see e.g. Ellison ifying the necessary constraint transducers and or- (1994) and Hammond (1997). dering them correctly in a series of compositions.

10 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 10–19, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

We will first give a detailed explanation of the AB Concatenation A B Union ‘matching’ approach in section 2—our encoding, | notation, and tools differ somewhat from that of Ã Complement ? Any symbol in alphabet Gerdemann and van Noord (2000), although the core % Escape symbol techniques are essentially alike. This is followed by [ and ] Grouping brackets an illustration of our encoding and method through A:B Cross product a standard OT grammar example in section 3. In T.l Output projection of T A -> B Rewrite A as B that section we also give examples of debugging OT A (->) B Optionally rewrite A as B grammars using standard finite state calculus meth- C D Context specifier || ods. In section 4 we also present an alternate en- [..] -> A Insert one instance of A coding of an OT account of prosody in Karttunen A -> B ... C Insert B and C around A .#. End or beginning of string (2006) illustrating devices where GEN is assumed to add metrical and stress markup in addition to changing, inserting, or deleting segments. We also com- Table 1: Regular expression notation in foma. pare this grammar to both a non-OT grammar and an OT grammar of the same phenomenon described in than some other candidate string in the same pool Karttunen (2006). In section 5, we conclude with a of strings. This method of transducer manipulation brief discussion about the limitations of FST-based is perhaps best illustrated through a self-contained OT grammars in light of the method developed in example. this paper, as well as show a new proof of nonregu- Consider a simple morphological analyzer en- larity of some very simple OT constraint systems. coded as an FST, say of English, that only adds morpheme-boundaries—+-symbols—to input 1.1 Notation words, perhaps consulting a dictionary of affixes and All the examples discussed are implemented with stems. Some of the mappings of such a transducer the finite-state toolkit foma (Hulden, 2009b). The could be ambiguous: for example, the words decon- regular expressions are also compilable with the Xe- struction or incorporate could be broken down in rox tools (Beesley and Karttunen, 2003), although two ways by such a morpheme analyzer: some of the tests of properties of finite-state transducers, crucial for debugging, are unavailable. The regular expression formalism used is summarized in table 1.

2 OT evaluation with matching In order to clarify the main method used in this pa- Suppose our task was now to remove alternate per to model OT systems, we will briefly recapitu- morpheme breakdowns from the transducer so that, late the ‘matching’ approach to filter out suboptimal if an analysis with a smaller number of morphemes candidates, or candidates with more violation marks was available for any word, a longer analysis would in a string representation, developed in Gerdemann not be produced. In effect, should and van Noord (2000).2 deconstruction only map to deconstruct+ion, since the other al- 2.1 Worsening ternative has one more morpheme boundary. The worsening trick is based on the idea that we can The fundamental technique behind the finite-state use the existing set of words from the output side matching approach to OT is a device which we call of the morphology, add at least one morpheme ‘worsening’, used to filter out strings from a trans- boundary to all of them, and use the resulting set ducer containing more occurrences of some desig- of words to filter out longer ‘candidates’ from the nated special symbol s (e.g. a violation marker), original morphology. For example, one way of 2Also discussed in Jager¨ (2002). adding a +-symbol to de+construction produces

11 de+construct+ion, which coincides with the orig- violations of some constraint. Then, having a con- inal output in the morphology, and can now be used straint, NOCODA, implemented as a transducer that to knock out this suboptimal division. This process adds violation marks when syllables end in conso- can be captured through: nants, we can achieve the following sequence of markup by composition of GEN and NOCODA, for AddBoundary = [?* 0:%+ ?*]+; a particular example input bebop: Worsen = Morphology .o. AddBoundary; Shortest = Morphology .o. ˜Worsen.l; the effect of which is illustrated for the word deconstruction in ﬁgure 1. Here, AddBoundary is a transducer that adds at least one +-symbol to the input. The Worsen transducer is simply the original transducer composed with the AddBoundary transducer. The Shortest morphology is then constructed by extracting the output projection of Worsen, and composing its negation with the original morphology. The above transducers could be implemented very simply, by epenthesis replacement rules:

# Insert periods arbitrarily inside words Gen = [..] (->) %. || \.#. _ \.#. ; # Insert *-marks after C . or C .#. NoCoda = [..] -> %* || C+ [%. | .#.] _ ;

Naturally, at this point in the composition chain we would like to filter out the suboptimal candidates—that is, the ones with fewer violation marks, then remove the marks, and continue with the next constraint, until all constraints have been evaluated. The problem of filtering out the suboptimal candidates is now analogous to the ‘worsening’ Figure 1: Illustration of a worsening filter for morpheme scenario above: we can create a ‘worsening’-filter boundaries. automaton by adding violation marks to the entire set of candidates. In this example, the candidate be.bop∗ would produce a worse candidate be∗.bop∗, 2.2 Worsening in OT which (disregarding for the moment syllable bound- The above ‘worsening’ maneuver is what the ary marks and the exact position of the violation) can ‘matching’ approach to model OT syllabification is be used to filter out the suboptimal beb∗.op∗. based upon. Evaluation of competing candidates with regard to a single OT constraint can be per- 3 An OT grammar with faithfulness and formed in the same manner. This, of course, pre- markedness constraints supposes that we are using transducers to mark con- As previous work has been limited to working with straint violations in input strings, say by the sym- only markedness constraints as well as a some- bol *. Gerdemann and van Noord (2000) illustrate what impoverished GEN—one that only syllabifies this by constructing a GEN-transducer that syllabi- words—our first task when approaching a more fies words,3 and another set of transducers that mark complete finite-state methodology of OT needs to 3Although using a much more complex set of markup sym- address this point. In keeping with the ‘richness bols than here. of the base’-concept of OT, we require a suitable

12 GEN to be able to perform arbitrary deletions (eli- obstruents surface as devoiced in word-final posi- sions), insertions (epentheses), and changes to the tion, as in pad pat. A set of core constraints to → input. A GEN-FST that only performs this task illustrate this include: (maps Σ Σ ) on input strings is obviously fairly ∗ → ∗ easy to construct. However, we need to do more than VF: a markedness constraint that disallows fi- • ∗ this: we also need to keep track of which parts of nal voiced obstruents. the input have been modified by GEN in any way IDENTV: a faithfulness constraint that militates to later be able to pinpoint and mark faithfulness • violations—places where GEN has manipulated the against change in voicing. input—through an FST. VOP: a markedness constraint against voiced • 3.1 Encoding of GEN obstruents in general. Perhaps the simplest possible encoding that meets the above criteria is to have GEN not only change The interaction of these constraints to achieve de- 4 the input, but also mark each segment in its output voicing can be illustrated by the following tableau. with a marker whereby we can later distinguish how the input was changed. To do so, we perform the bed ∗VFIDENTVVOP following markup: bet * * pet **! Every surface segment (output) is surrounded • bed *! ** by brackets [ ... ]. ped *! * * Every input segment that was manipulated by • GEN is surrounded by parentheses ( ... ). The tableau above represents a kind of shorthand often given in the linguistic literature where, for the For example, given the input a,GEN would pro- sake of conciseness, higher-ranked faithfulness con- duce an infinite number of outputs, and among them: straints are omitted. For example, there is nothing preventing the candidate bede to rank equally with [a] GEN did nothing bet, were it not for an implicit high-ranked DEP- (a)[] GEN deleted the a (a)[e] GEN changed the a to e constraint disallowing epenthesis. As we are build- ()[d](a)[i] GEN inserted a d and changed a to i ing a complete computational model with an unre- ... stricted GEN, and no implicit assumptions, we need to add a few constraints not normally given when This type of generic GEN can be defined through: arguing about OT models. These include:

Gen = S -> %( ... %) %[ (S) %] ,, DEP: a faithfulness constraint against epenthe- S -> %[ ... %] ,, • [..] (->) [%( %) %[ S %]]* ; sis.

MAX: a faithfulness constraint against dele- • assuming here that S represents the set of segments tion. available. IDENTPL: a faithfulness constraint against • 3.2 Evaluation of faithfulness and markedness changes in place of articulation of segments. constraints This is crucial to avoid e.g. bat or bap being As an illustrative grammar, let us consider a standard equally ranked with bet in the above example.5 OT example of word-ﬁnal obstruent devoicing—as 4The illustration roughly follows (Kager, 1999), p. 42. in Dutch or German—achieved through the interac- 5 Note that a generic higher-ranked IDENT will not do, be- tion of faithfulness and markedness constraints. The cause then we would never get the desired devoicing in the ﬁrst constraints model the fact that underlyingly voiced place.

13 Including these constraints explicitly allows us to we may now deﬁne the remaining violation markups rule out unwanted candidates that may otherwise needed. rank equal with the candidate where word-ﬁnal ob- CVOI = [b|d|g]; struents are devoiced, as illustrated in the following: Voiced = [b|d|g|V]; Unvoiced = [p|t|k]; define VC Change(Voiced,Unvoiced) | Change(Unvoiced,Voiced); L

P V define Place Change(p,?-b)|Change(t,?-d)| Change(k,?-g)|Change(b,?-p)| Change(d,?-t)|Change(g,?-k)| AX EP VF DENT DENT Change(a,?)|Change(e,?)| D M I I VOP bed ∗ Change(i,?)|Change(o,?)| bet * * Change(u,?); pet **! VF = [..] -> {*} || Surf(CVOI) _ .#. ; bed *! ** IdentV = [..] -> {*} || VC _ ; ped *! * * VOP = [..] -> {*} || Surf(CVOI) _ ; IdentPl = [..] -> {*} || Place _ ; bat *! * * bep *! * * The final remaining element for a complete imple- be *! * mentation concerns the question of ‘worsening’ and bede *! ** its introduction into a chain of transducer composi- Once we have settled for the representation tion. To this end, we include a few more definitions: of GEN, the basic faithfulness constraint markup transducers—whose job is to insert asterisks wher- AddViol = [?* 0:%* ?*]+; ever violations occur—can be defined as follows: Worsen = [Gen.i .o. Gen]/%* .o. AddViol; def Eval(X) X .o. ˜[X .o. Worsen].l .o. %*->0; Cleanup = %[|%] -> 0 .o. %( \%)* %) -> 0; Dep = [..] -> {*} || %( %) _ ; Max = [..] -> {*} || %[ %] _ ; Ident = [..] -> {*} || %( S %) %[ S %] _ ; Here, AddViol is the basic worsening method discussed above whereby at least one violation mark That is, DEP inserts a *-symbol after ()- EN sequences, which is how GEN marks epenthesis. is added. However, because G adds markup to Likewise, MAX-violations are identified by the se- the underlying forms, we need to be a bit more flexible in our worsening procedure when matching up quence [], and IDENT-violations by a parenthesized segment followed by a bracketed segment. To define violations. It may be the case that two different com- the remaining markup transducers, we shall take ad- peting surface forms have the same underlying form, vantage of some auxiliary template definitions, de- but the violation marks will not align correctly be- fined as functions: cause of interfering brackets. Given two competing candidates with a different number of violations, for def Surf(X) [X .o. [0:%[ ? 0:%]]*].l/ [ %( (S) %) | %[ %] ]; example (a)[b]* and [a], we would like the latter to def Change(X,Y) [%( X %) %[ Y %]]; match the former after adding a violation mark since they both originate in the same underlying form a. Here, Surf(X) in effect changes the language X The way to achieve this is to undo the effect of GEN, so that it can match every possible surface encod- and then redo GEN in every possible configuration ing produced by GEN; for example, a surface se- before adding the violation marks. The transducer quence ab may look like [a][b], or [a](a)[b], etc., Worsen, above, does this by a composition of the since it may spring from various different underly- inverse GEN, followed by GEN, ignoring already ex- ing forms. This is a useful auxiliary definition that isting violations. For the above example, this leads will serve to identify markedness violations. Like- to representations such as: wise Change(X,Y) reflects the GEN representation of changing a segment X to Y needed to con- Gen.i Gen AddViol cisely identify changed segments. Using the above [a] a (a)[b] (a)[b]*. → → →

14 g:0 2 b d g g:0 0:k p t k a e i o u 0:t p t k a e i o u d:0 4 3 1 0:p b d g b:0 0 d:0 5

b:0

Figure 2: OT grammar for devoicing compiled into an Figure 3: Violation permutation transducer. FST.

We also deﬁne a Cleanup transducer that removes brackets and parts of the underlying form. Permute = [%*:0 ?* 0:%*|0:%* ?* %*:0]*/?; Worsen = [Gen.i .o. Gen]/%* .o. Now we are ready to compile the entire system Permute .o. ... .o. Permute .o. into an FST. To apply only GEN and the ﬁrst con- AddViol; straint, for example, we can calculate: Knowing how many permutations are necessary Eval(Gen .o. Dep) .o. Cleanup; for the transducer to be able to distinguish between any number of violations in a candidate pool is pos- and likewise the entire grammar can be calculated sible as follows: we can can calculate for some con- by: straint ConsN in a sequence of constraints,

Eval(Eval(Eval(Eval(Eval(Eval( Eval(Eval(Gen .o. Cons1) ... .o. ConsN) .o. Gen .o. Dep) .o. Max) .o. IdentPl) .o. ConsN .o. \%* -> 0; VF) .o. IdentV) .o. VOP) .o. Cleanup;

This yields an FST of 6 states and 31 transitions Now, this yields a transducer that maps every un- (see figure 2)—it can be ascertained that the FST derlying form to n asterisks, n being the number indeed does represent a relation where word-final of violations with respect to ConsN in the candi- voiced obstruents are always devoiced. dates that have successfully survived ConsN. If this transducer represents a function (is single-valued), 3.3 Permutation of violations then we know that two candidates with a different As mentioned in Gerdemann and van Noord number of violations have not survived ConsN, and (2000), there is an additional complication with the that the worsening yielded the correct answer. Since ‘worsening’-approach. It is not always the case that the question of transducer functionality is known in the pool of competing candidates, the violation to be decidable (Blattner and Head, 1977), and markers line up, which is a prerequisite for filtering an efficient algorithm is given in Hulden (2009a), out suboptimal ones by adding violations—although which is included in foma (with the command test in the above grammar the violations do line up cor- functional) we can address this question by cal- rectly. However, for the vast majority of OT gram- culating the above for each constraint, if necessary, mars, this can be remedied by inserting a violation- and then permute the violation markers until the permuting transducer that moves violations markers above transducer is functional. around before worsening, to attempt to produce a correct alignment. Such a permuting transducer can 3.4 Equivalence testing be defined as in figure 3. In many cases, the purpose of an OT grammar is If the need for permutation arises, repeated per- to capture accurately some linguistic phenomenon mutations can be included as many times as war- through the interaction of constraints rather than by ranted in the definition of Worsen: other formalisms. However, as has been noted by

15 Karttunen (2006), among others, OT constraint de- which indeed returns TRUE. For a small grammar, bugging is an arduous task due to the sheer num- such as the devoicing grammar, determining the cor- ber of unforeseen candidates. One of the advantages rectness of the result by other means is certainly fea- in encoding an OT grammar through the worsening sible. However, for more complex systems the abil- approach is that we can produce an exact represen- ity to test for equivalence becomes a valuable tool in tation of the grammar, which is not an approxima- analyzing constraint systems. tion bounded by the number of constraint violations it can distinguish (as in Karttunen (1998)), or by the 4 Variations on GEN: an OT grammar of length of strings it can handle. This allows us to stress assignment formally calculate, among other things, the equiva- Most OT grammars that deal with phonological phe- lence of an OT grammar represented as an FST and nomena with faithfulness and markedness gram- some other transducer. For example, in the above mars are implementable through the approach given grammar, the intention was to model end-of-word above, with minor variations according to what spe- obstruent devoicing through optimality constraints. cific constraints are used. In other domains, how- Another way to model the same thing would be to ever, in may be the case that GEN, as described compile the replacement rule: above, needs modification. A case in point are grammars that mark prosody or perform syllabification Rule = b -> p, d -> t, g -> k || _ .#. ; that often take advantage of only markedness constraints. In such cases, there is often no need for The transducer resulting from this is shown in fig- GEN to insert, change, and delete material if all ure 4. faithfulness constraints are assumed to outrank all markedness constraints. Or alternatively, if the OT b d g @ k p t grammar is assumed to operate on a different stra- b d g 1 b:p d:t g:k tum where no faithfulness constraints are present. @ k p t 0 However, GEN still needs to insert material into b:p d:t g:k 2 strings, such as stress marks or syllable boundaries. To test the approach with a larger ‘real- Figure 4: Devoicing transducer compiled through a rule. world’ grammar we have reimplemented a Finnish stress assignment grammar, originally implemented As is seen, the OT transducer (figure 2) and through the counting approach of Karttunen (1998) the rule transducer (figure 4) are not structurally in Karttunen (2006), following a description in identical. However, both transducers represent a Kiparsky (2003). The grammar itself contains nine function—i.e. for any given input, there is always constraints, and is intended to give a complete ac- a unique winning candidate. Although transducer count of stress placement in Finnish words. Without equivalence is not testable by algorithm in the gen- going into a line-by-line analysis of the grammar, eral case, it is decidable in the case where one of the crucial main differences in this implementation two transducers is functional. If this is the case it is to that of the previous sections are: sufficient to test that domain(τ1) = domain(τ2) and 1 GEN only inserts symbols ( ) ‘ and ’ that τ − τ represents identity relations only. As • 2 ◦ 1 to mark feet and stress an algorithm to decide if a transducer is an identity transducer is also included in foma, it can be used to Violations need to be permuted in Worsen to • ascertain that the two above transducers are in fact yield an exact representation identical, and that the linguistic generalization captured by the OT constraints is correct: GEN syllabifies words correctly through a re- • placement rule (no constraints are given in the regex Rule.i .o. Grammar; grammar to model syllabification; this is as- test identity sumed to be already performed)

16 kainostelijat -> (ka´i.nos).(te‘.li).jat where their respective predictions differ. However, kalastelemme -> (ka´.las).te.(le‘m.me) by virtue of having an exact transducer, we can for- kalasteleminen -> *(ka´.las).te.(le‘.mi).nen kalastelet -> (ka´.las).(te‘.let) mally analyze the OT account together with the rule- kuningas -> (ku´.nin).gas based account to see if they differ in their predictions strukturalismi -> (stru´k.tu).ra.(li‘s.mi) ergonomia -> (e´r.go).(no‘.mi).a for any input, without having to ﬁrst intuit a differ- matematiikka -> (ma´.te).ma.(ti‘ik.ka) ing example:

regex RuleGrammar.i .o. OTGrammar; Figure 5: Example outputs of matching implementation test identity of Finnish OT. Further, we can subject the two grammars to the usual finite-state calculus operations to gain possible Compiling the entire grammar through the same insight into what kinds of words yield different pre- procedure as above outputs a transducer with 134 dictions with the two—something useful for linguis- states, and produces the same predictions as Kart- tic debugging. Likewise, we can use similar tech- tunen’s counting OT grammar.6 As opposed to the niques to analyze for redundancy in grammars. For previous devoicing grammar, compiling the Finnish example, we have assumed that the VOP-constraint prosody grammar requires permutation of the viola- plays no role in the above devoicing tableaux. Using tion markers, although only one constraint requires finite-state calculus, we can prove it to be so for any it (STRESS-TO-WEIGHT, and in that case, compos- input if the grammar is constructed with the method ing Worsen with one round of permutation is suffi- presented here. cient for convergence). Unlike the counting approach, the current ap- 5 Limits on FST implementation proach confers two significant advantages. The first is that we can compile the entire grammar into an We shall conclude the presentation here with a brief FST that does not restrict the inputs in any way. That discussion of the limits of FST representability, even is, the final product is a stand-alone transducer that of simple OT grammars. Previous analyses have accepts as input any sequence of any length of sym- shown that OT systems are beyond the generative bols in the Finnish alphabet, and produces an output capacity of finite-state systems, under some assump- where the sequence is syllabified, marked with feet, tions of what GEN looks like. For example, Frank and primary and secondary stress placement (see fig- and Satta (1998) present such a constraint system ure 5). The counting method, in order to compile at where GEN is taken to be defined through a trans- 8 all, requires that the set of inputs be fixed to some duction equivalent to: very limited set of words, and that the maximum Gen = [a:b|b:a]* | [a|b]*; number of distinguishable violations (and indirectly word length) be fixed to some k.7 The second ad- That is, a relation which either maps all a’s to b’s vantage is that, as mentioned before, we are able to and vice versa, or leaves the input unchanged. Now, formally compare the OT grammar (because it is not let us assume the presence of a single markedness an approximation), to a rule-based grammar (FST) constraint , militating against the letter a. In that that purports to capture the same phenomena. For ∗a case, given an input of the format a b the effective example, Karttunen (2006), apart from the count- ∗ ∗ mapping of the entire system is one that is an identity ing OT implementation, also provides a rule-based relation if there are fewer a’s than b’s; otherwise the account of Finnish stress, which he discovers to be a’s and b’s are swapped. As is easily seen, this is not distinct from an OT account by finding two words a regular relation. 6Including replicating errors in Kiparsky’s OT analysis dis- One possible objection to this analysis of non- covered by Karttunen, as seen in figure 5. regularity is that linguistically GEN is usually as- 7 Also, compiling the grammar is reasonably quick: 7.04s on sumed to perform any transformation to the input a 2.8MHz Intel Core 2, vs. 2.1s for a rewrite-rule-based account of the same phenomena. 8The idea is attributed to Markus Hiller in the article.

17 whatsoever—not just limiting itself to a proper sub- we calculate set of Σ Σ . However, it is indeed the case ∗ → ∗ that even with a canonical GEN-function, some very define Grammar Eval(Eval(Eval(Eval( Gen .o. Ident) .o. Dep) .o. NotAB) .o. simple OT systems fall outside the purview of ﬁnite- Max) .o. Cleanup; state expressibility, as we shall illustrate by a different example here. which produces an FST that cannot distinguish be- 5.1 A simple proof of OT nonregularity tween more than two a’s or b’s in a string. While Assume a grammar that has four very basic con- it correctly maps aab to aa and abb to bb, the straints: IDENT, forbidding changes, DEP, for- tableau example of aaabb is mapped to both aaa bidding epenthesis, ∗ab, a markedness constraint and bb. However, with one more round of permu- against the sequence ab, and MAX, forbidding dele- tation in Worsen, we produce an FST that can in- tion, ranked IDENT,DEP ab MAX. We as- deed cover the example, mapping aaabb uniquely ∗ sume GEN to be as general as possible—performing to bb, while failing with aaaabbb (see ﬁgure 6). arbitrary deletions, insertions, and changes. This illustrates the approximation characteristic of It is clear, as is illustrated in table 2, that for all in- the matching method: for some grammars (proba- puts of the format anbm the grammar in question de- bly most natural language grammars) the worsening scribes a relation that deletes all the a’s or all the b’s approach will at some point of permutation of the vi- depending on which there are fewer instances of, i.e. olation markers terminate and produce an exact FST anbm an if m < n, and anbm bm if n < m. representation of the grammar, while for some gram- → → This can be shown by a simple pumping argument mars such convergence will never happen. How- to not be realizable through an FST. ever, if the permutation of markers terminates and produces a functional transducer when testing each aaabb IDENT DEP ∗ab MAX violation as described above, the FST is guaranteed aaaaa *!* to be an exact representation. aaacbb *! b:0 aaabb *! a @ 2 aaab *! * 1 a bb ***! a

aaa ** b @ @ b:0 a b:0 @ 3 a Table 2: Illustrative tableau for a simple constraint sys- 0 tem not capturable as a regular relation. @ 4

Implementing this constraint system with the b a:0 methods presented here is an interesting exercise a:0 a:0 a:0 5 7 and serves to examine the behavior of the method. b 6 b We deﬁne GEN,DEP,MAX, and IDENT as before, deﬁne a universal alphabet (excluding markup Figure 6: An non-regular OT approximation. symbols), and the constraint ∗ab naturally as: It is an open question if it is decidable by exam- S = ? - %( - %) - %[ - %] - %* ; NotAB = [..] -> {*} || Surf(a b) _ ; ining a grammar whether it will yield an exact FST representation. We do not expect this question to be Now, with one round of permutation of the viola- easy, since it cannot be determined by the nature of tion markers in Worsen as follows: the constraints alone. For example, the above four- constraint system does have an exact FST represen- Worsen = [Gen.i .o. Gen]/{*} .o. tation in some orderings of the constraints, but not AddViol .o. Permute; in the particular one given above.

18 6 Conclusion Gerdemann, D. and van Noord, G. (2000). Approx- imation and exactness in finite state optimality We have presented a practical method of implement- theory. In Proceedings of the Fifth Workshop of ing OT grammars as finite-state transducers. The ex- the ACL Special Interest Group in Computational amples, definitions, and templates given should be Phonology. sufficient and flexible enough to encode a wide variety of OT grammars as FSTs. Although no method Hammond, M. (1997). Parsing syllables: Modeling can encode all OT grammars as FSTs, the funda- OT computationally. Rutgers Optimality Archive mental advantage with the system outlined is that (ROA), 222-1097. for a large majority of practical cases, an FST can Hulden, M. (2009a). Finite-state Machine Construc- be produced which is not an approximation that can tion Methods and Algorithms for Phonology and only tell apart a limited number of violations. As Morphology. PhD thesis, The University of Ari- has been noted elsewhere (e.g. Eisner (2000b,a)), zona. some OT constraints, such as Generalized Align- Hulden, M. (2009b). Foma: a finite-state compiler ment constraints, are on the face of it not suitable and library. In EACL 2009 Proceedings, pages for FST implementation. We may add to this that 29–32. some very simple constraint systems, assuming a canonical GEN, and only using the most basic faith- Jager,¨ G. (2002). Gradient constraints in finite state fulness and markedness constraints, are likewise not OT: the unidirectional and the bidirectional case. encodable as regular relations, and seem to have the More than Words. A Festschrift for Dieter Wun- generative power to encode phenomena not found derlich, pages 299–325. in natural language. However, for most practical Kager, R. (1999). Optimality Theory. Cambridge purposes—and this includes modeling actual phe- University Press. nomena in phonology and morphology—the present Kaplan, R. M. and Kay, M. (1994). Regular mod- approach offers a fruitful way to implement, ana- els of phonological rule systems. Computational lyze, and debug OT grammars. Linguistics, 20(3):331–378. References Karttunen, L. (1998). The proper treatment of optimality theory in computational phonology. In Beesley, K. R. and Karttunen, L. (2003). Finite State Finite-state Methods in Natural Language Pro- Morphology. CSLI Publications, Stanford, CA. cessing. Blattner, M. and Head, T. (1977). Single-valued a- Karttunen, L. (2006). The insufficiency of paper- transducers. Journal of Computer and System Sci- and-pencil linguistics: the case of Finnish ences, 15(3):328–353. prosody. Rutgers Optimality Archive. Eisner, J. (2000a). Directional constraint evaluation Kiparsky, P. (2003). Finnish noun inflection. Gener- in optimality theory. In Proceedings of the 18th ative approaches to Finnic linguistics. Stanford: conference on Computational linguistics , pages CSLI. 257–263. Association for Computational Linguis- tics. Prince, A. and Smolensky, P. (1993). Optimality theory: Constraint interaction in generative gram- Eisner, J. (2000b). Easy and hard constraint ranking mar. ms. Rutgers University Cognitive Science in optimality theory. In Finite-state phonology: Center. Proceedings of the 5th SIGPHON, pages 22–33. Riggle, J. (2004). Generation, recognition, and Ellison, T. M. (1994). Phonological derivation in op- learning in finite state Optimality Theory. PhD timality theory. In Proceedings of COLING’94— thesis, University of California, Los Angeles. Volume 2, pages 1007–1013. Frank, R. and Satta, G. (1998). Optimality theory and the generative complexity of constraint viola- bility. Computational Linguistics, 24(2):307–315.

19 Handling Unknown Words in Arabic FST Morphology

Khaled Shaalan and Mohammed Attia Faculty of Engineering & IT, The British University in Dubai [email protected] [email protected]

Abstract We present the first attempt, to the best of our knowledge, to address lemmatization of Arabic A morphological analyser only recognizes words that it already knows in the lexical unknown words. The specific problem with database. It needs, however, a way of sensing lemmatizing unknown words is that they cannot be significant changes in the language in the form matched against a morphological lexicon. We of newly borrowed or coined words with high develop a rule-based finite-state morphological frequency. We develop a finite-state guesser and use a machine learning disambiguator, morphological guesser in a pipelined MADA (Roth et al., 2008), in a pipelined approach methodology for extracting unknown words, to lemmatization. lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The This paper is structured as follows. The remainder processing is performed on a large of the introduction reviews previous work on contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based Arabic unknown word extraction and annotation tool. Our method is tested on a lemmatization, and explains the data used in our manually-annotated gold standard of 1,310 experiments. Section 2 presents the methodology forms and yields good results despite the followed in extracting and analysing unknown complexity of the task. Our work shows the words. Section 3 provides details on the usability of a highly non-deterministic finite morphological guesser we have developed to help state guesser in a practical and complex deal with the problem. Section 4 shows and application. discusses the testing and evaluation results, and finally Section 5 gives the conclusion. 1 Introduction 1.1 Previous Work Due to the complex and semi-algorithmic nature of Lemmatization of Arabic words has been the Arabic morphology, it has always been a addressed in (Roth et al., 2008; Dichy, 2001). challenge for computational processing and Lemmatization of unknown words has been analysis (Kiraz, 2001; Beesley 2003; Shaalan et al., addressed for Slovene in (Erjavec and Džerosk, 2012). A lexicon is an indispensable part of a 2004), for Hebrew in (Adler at al., 2008) and for morphological analyser (Dichy and Farghaly, English, Finnish, Swedish and Swahili in (Lindén, 2003; Attia, 2006; Buckwalter, 2004; Beesley, 2008). Lemmatization means the normalization of 2001), and the coverage of the lexical database is a text data by reducing surface forms to their key factor in the coverage of the morphological canonical underlying representations, which, in analyser. This is why an automatic method for Arabic, means verbs in their perfective, indicative, َﺷ َﻜ َﺮ updating a lexical database is crucially important. 3rd person, masculine, singular forms, such as

20 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 20–24, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

$akara “to thank”; and nominals in their Morphological Analyser (BAMA) (Buckwalter, nominative, singular, masculine forms, such as 2004). Second, we develop a finite-state TAlib “student”; and nominative plural for morphological guesser that gives all possible طﺎﻟِﺐ pluralia tantum nouns (or nouns that appear only interpretations of a given word. The morphological in the plural form and are not derived from a guesser first takes an Arabic form as a whole and nAs “people”. To the then strips off all possible affixes and clitics one by ﻧﺎس singular word), such as best of our knowledge, the study presented here is one until all potential analyses are exhausted. As the first to address lemmatization of Arabic the morphological guesser is highly non- unknown words. The specific problem with deterministic, all the interpretations are matched lemmatizing unknown words is that they cannot be against the morphological analysis of MADA that matched against a lexicon. In our method, we use a receives the highest probabilistic scores. The machine learning disambiguator, develop a rule- guesser’s analysis that bears the closest based finite-state morphological guesser, and resemblance (in terms of morphological features) combine them in a pipelined process of with the MADA analysis is selected. lemmatization. We test our method against a manually created gold standard of 1,310 types These are the steps followed in extracting and (unique forms) and show a significant lemmatizing Arabic unknown words: improvement over the baseline. Furthermore, we • A corpus of 1,089,111,204 is analysed with develop an algorithm for weighting and prioritizing MADA. The number of types for which new words for inclusion in a lexicon depending on MADA could not find an analysis in SAMA is three factors: number of form variations of the 2,116,180. lemmas, cumulative frequency of the forms, and • These unknown types are spell checked by the POS tags. Microsoft Arabic spell checker using MS Office 2010. Among the unknown types, the 1.2 Data Used number of types accepted as correctly spelt is In our work we rely on a large-scale corpus of 208,188. 1,089,111,204 words, consisting of 925,461,707 • We then select types with frequency of 10 or words from the Arabic Gigaword Fourth Edition more. This leave us with 40,277 types. (Parker et al., 2009), and 163,649,497 words from • We randomly select 1,310 types and manually news articles collected from the Al-Jazeera web annotate them with the gold lemma, the gold site.1 In this corpus, unknown words appear at a POS and lexicographic preference for rate of between 2% of word tokens (when we inclusion in a dictionary. ignore possible spelling variants) and 9% of word • We use the full POS tags and morpho-syntactic tokens (when possible spelling variants are features produced by MADA. included). • We use the finite-state morphological guesser to produce all possible morphological inter- 2 Methodology pretations and corresponding lemmatizations. To deal with unknown words, or out-of-vocabulary • We compare the POS tags and morpho- words (OOVs), we use a pipelined approach, syntactic features in MADA output with the which predicts part-of-speech tags and morpho- output of the morphological guesser and syntactic features before lemmatization. First, a choose the one with the highest matching machine learning, context-sensitive tool is used. score. This tool, MADA (Roth et al., 2008), performs POS tagging and morpho-syntactic analysis and 3 Morphological Guesser disambiguation of words in context. MADA internally uses the Standard Arabic Morphological We develop a morphological guesser for Arabic Analyser (SAMA) (Maamouri et al., 2010), an that analyses unknown words with all possible updated version of Buckalter Arabic clitics, morpho-syntactic affixes and all relevant alteration operations that include insertion, assimilation, and deletion. Beesley and Karttunen 1 http://aljazeera.net/portal. Collected in January 2010.

21 (2003) show how to create a basic guesser. The +Guess+sg@ ﻣﺴﻮقdefArt@+noun+ال@conj+و واﻟﻤﺴﻮﻗﻮن core idea of a guesser is to assume that a stem is composed of any arbitrary sequence of Arabic non- +Guess+masc+pl+nom@ ﻣﺴﻮﻗﻮنdefArt@+noun+ال@conj+و واﻟﻤﺴﻮﻗﻮن numeric characters, and this stem can be prefixed and/or suffixed with a predefined set of prefixes, +Guess+sg@ Guess+masc+اﻟﻤﺴﻮقconj@+adj+و واﻟﻤﺴﻮﻗﻮن suffixes or clitics. The guesser marks clitic boundaries and tries to return the stem to its +pl+nom@ @Guess+sg+اﻟﻤﺴﻮﻗﻮنconj@+adj+و واﻟﻤﺴﻮﻗﻮن underlying representation, the lemma. Due to the Guess+masc+اﻟﻤﺴﻮقconj@+noun+و واﻟﻤﺴﻮﻗﻮن nondeterministic nature of the guesser, there will be a large number of possible lemmas for each +pl+nom@ @Guess+sg+اﻟﻤﺴﻮﻗﻮنconj@+noun+و واﻟﻤﺴﻮﻗﻮن .form

The XFST finite-state compiler (Beesley and For a list of 40,277 word types, the morphological Karttunen, 2003) uses the “substitute defined” guesser gives an average of 12.6 possible command for creating the guesser. The XFST interpretations per word. This is highly non- commands in our guesser are stated as follows. deterministic when compared to AraComLex morphological analyser (Attia et al. 2011) which define PossNounStem has an average of 2.1 solutions per word. We also [[Alphabet]^{2,24}] "+Guess":0; note that 97% of the gold lemmas are found among define PossVerbStem the finite-state guesser's choices. [[Alphabet]^{2,6}] "+Guess":0;

This rule states that a possible noun stem is defined 4 Testing and Evaluation as any sequence of Arabic non-numeric characters of length between 2 and 24 characters. A possible To evaluate our methodology we create a manually verb stem is between 2 and 6 characters. The annotated gold standard test suite of randomly length is the only constraint applied to an Arabic selected surface form types. For these surface word stem. This word stem is surrounded by forms, the gold lemma and part of speech are prefixes, suffixes, proclitics and enclitics. Clitics manually given. Besides, the human annotator are considered as independent tokens and are gives a preference on whether or not to include the separated by the ‘@’ sign, while prefixes and entry in a dictionary. This feature helps to evaluate suffixes are considered as morpho-syntactic our lemma weighting equation. The annotator features and are interpreted with tags preceded by tends to include nouns, verbs and adjectives, and the ‘+’ sign. Below we present the analysis of the only proper nouns that have a high frequency. The .wa-Al-musaw~iquwna size of the test suite is 1,310 واﻟ ُﻤ َﺴ ﱢﻮﻗﻮ َن unknown noun “and-the-marketers”.

4.1 Evaluating Lemmatization MADA output: form:wAlmswqwn num:p gen:m per:na In the evaluation experiment we measure accuracy case:n asp:na mod:na vox:na pos:noun calculated as the number of correct tags divided by prc0:Al_det prc1:0 prc2:wa_conj the count of all tags. The baseline is given by the prc3:0 enc0:0 stt:d assumption that new words appear in their base form, i.e., we do not need to lemmatize them. The Finite-state guesser output: baseline accuracy is 45% as shown in Table 1. The Guess+masc+pl+nom@ POS tagging baseline proposes the most frequent+واﻟﻤﺴﻮقadj+ واﻟﻤﺴﻮﻗﻮن Guess+sg@ tag (proper name) for all unknown words. In our+واﻟﻤﺴﻮﻗﻮنadj+ واﻟﻤﺴﻮﻗﻮن Guess+masc+pl+nom@ test data this stands at 45%. We notice that MADA+واﻟﻤﺴﻮقnoun+ واﻟﻤﺴﻮﻗﻮن .(Guess+sg@ POS tagging accuracy is unexpectedly low (60%+واﻟﻤﺴﻮﻗﻮنnoun+ واﻟﻤﺴﻮﻗﻮن We use Voted POS Tagging, that is when a lemma ﻣﺴﻮقdefArt@+adj+ال@conj+و واﻟﻤﺴﻮﻗﻮن +Guess+masc+pl+nom@ gets a different POS tag with a higher frequency, .the new tag replaces the old low frequency tag ﻣﺴﻮﻗﻮنdefArt@+adj+ال@conj+و واﻟﻤﺴﻮﻗﻮن

22 This method has improved the tagging results algorithm has been tuned through several rounds of significantly (69%). experimentation.

Accuracy Word Weight = ((number of sister POS tagging forms having the same mother 1 POS Tagging baseline 45% lemma * 800) + cumulative sum of 2 MADA POS tagging 60% frequencies of sister forms 3 Voted POS Tagging 69% having the same mother lemma) / 2 + POS factor Table 1. Evaluation of POS tagging

Good words In top In bottom As for the lemmatization process itself, we notice 100 100 that our experiment in the pipelined lemmatization relying on Frequency 63 50 approach gains a higher (54%) score than the alone (baseline) baseline (45%) as shown in Table 2. This score relying on number of sister 87 28 significantly rises to 63% when the difference in forms * 800 the definite article ‘Al’ is ignored. The testing relying on POS factor 58 30 results indicate significant improvements over the using the combined criteria 78 15 baseline. Table 3. Evaluation of lemma weighting and ranking

Lemmatization Table 3 shows the evaluation of the weighting 1 Lemmas found among corpus forms 64% criteria. We notice that the combined criteria gives 2 Lemmas found among FST guesser 97% the best balance between increasing the number of forms good words in the top 100 words and reducing the 3 Lemma first-order baseline 45% number of good words in the bottom 100 words. 4 Pipelined lemmatization (first-order 54% decision) with strict definite article matching 5 Conclusion 5 Pipelined lemmatization (first-order 63% decision) ignoring definite article We develop a methodology for automatically matching extracting unknown words in Arabic and Table 2. Evaluation of lemmatization lemmatizing them in order to relate multiple surface forms to their base underlying 4.2 Evaluating Lemma Weighting representation using a finite-state guesser and a machine learning tool for disambiguation. We In our data we have 40,277 unknown token types. develop a weighting mechanism for simulating a After lemmatization they are reduced to 18,399 human decision on whether or not to include the types (that is 54% reduction of the surface forms) new words in a general-domain lexical database. which are presumably ready for manual validation We show the feasibility of a highly non- before being included in a lexicon. This number is deterministic finite state guesser in an essential and still too big for manual inspection. In order to practical application. facilitate human revision, we devise a weighting algorithm for ranking so that the top n number of Out of a word list of 40,255 unknown words, we words will include the most lexicographically create a lexicon of 18,399 lemmatized, POS-tagged relevant words. We call surface forms that share and weighted entries. We make our unknown word the same lemma ‘sister forms’, and we call the lexicon available as a free open-source resource2. lemma that they share the ‘mother lemma’. This weighting algorithm is based on three criteria: Acknowledgments frequency of the sister forms, number of sister forms, and a POS factor which penalizes proper This research is funded by the UAE National nouns (due to their disproportionate high Research Foundation (NRF) (Grant No. frequency). The parameters of the weighting 0514/2011).

2 http://arabic-unknowns.sourceforge.net/

23 References Text Processing and Computational Linguistics, Haifa, Israel, pp. 106-116. Adler, M., Goldberg, Y., Gabay, D. and Elhadad, M. 2008. Unsupervised Lexicon-Based Resolution of Maamouri, M., Graff, D., Bouziri, B., Krouna, S., and Unknown Words for Full Morpholological Analysis. Kulick, S. 2010. LDC Standard Arabic In: Proceedings of Association for Computational Morphological Analyzer (SAMA) v. 3.1. LDC Linguistics (ACL), Columbus, Ohio. Catalog No. LDC2010L01. ISBN: 1-58563-555-3. Attia, M. 2006. An Ambiguity-Controlled Morpho- Parker, R., Graff, D., Chen, K., Kong, J., and Maeda, K. logical Analyzer for Modern Standard Arabic 2009. Arabic Gigaword Fourth Edition. LDC Catalog Modelling Finite State Networks. In: Challenges of No. LDC2009T30. ISBN: 1-58563-532-4. Arabic for NLP/MT Conference, The British Roth, R., Rambow, O., Habash, N., Diab, M., and Computer Society, London, UK. Rudin, C. 2008. Arabic Morphological Tagging, Attia, Mohammed, Pavel Pecina, Lamia Tounsi, Diacritization, and Lemmatization Using Lexeme Antonio Toral, Josef van Genabith. 2011. An Open- Models and Feature Ranking. In: Proceedings of Source Finite State Morphological Transducer for Association for Computational Linguistics (ACL), Modern Standard Arabic. International Workshop on Columbus, Ohio. Finite State Methods and Natural Language Shaalan, K., Magdy, M., Fahmy, A., Morphological Processing (FSMNLP). Blois, France. Analysis of Il-formed Arabic Verbs for Second Beesley, K. R. 2001. Finite-State Morphological Language Learners, In Eds. McCarthy P., Boonthum, Analysis and Generation of Arabic at Xerox C., Applied Natural Language Processing: Research: Status and Plans in 2001. In: The ACL Identification, Investigation and Resolution, PP. 383- 2001 Workshop on Arabic Language Processing: 397, IGI Global, PA, USA, 2012. Status and Prospects, Toulouse, France. Beesley, K. R., and Karttunen, L.. 2003. Finite State Morphology: CSLI studies in computational linguistics. Stanford, Calif.: Csli. Buckwalter, T. 2004. Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue number LDC2004L02, ISBN1-58563-324-0 Dichy, J. 2001. On lemmatization in Arabic, A formal definition of the Arabic entries of multilingual lexical databases. ACL/EACL 2001 Workshop on Arabic Language Processing: Status and Prospects. Toulouse, France. Dichy, J., and Farghaly, A. 2003. Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? In: The MT-Summit IX workshop on Machine Translation for Semitic Languages, New Orleans. Erjavec, T., and Džerosk, S. 2004. Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words. Applied Artificial Intelligence, 18:17–41. Kiraz, G. A. 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press. Lindén, K. 2008. A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In CICling- 2008, 9th International Conference on Intelligent

24 Urdu – Roman Transliteration via Finite State Transducers

Tina Bogel¨ University of Konstanz Konstanz, Germany [email protected]

Abstract The TURF system presented in this paper is concerned with the Urdu–Roman transliteration. It This paper introduces a two-way Urdu– deals with the Urdu-specific orthographic issues by Roman transliterator based solely on a non- integrating certain restrictional components into the probabilistic finite state transducer that solves finite state transducer to cut down on overgener- the encountered scriptural issues via a partic- ation, while at the same time employing an ar- ular architectural design in combination with a set of restrictions. In order to deal with the chitectural design that allows for a large degree enormous amount of overgenerations caused of flexibility. The transliterator is based solely by inherent properties of the Urdu script, the on a non-probabilistic finite state transducer im- transliterator depends on a set of phonologi- plemented with the Xerox finite state technology cal and orthographic restrictions and a word (XFST) (Beesley and Karttunen, 2003), a robust and list; additionally, a default component is im- easy-to-use finite state tool. plemented to allow for unknown entities to be This paper is organized as follows: In section 2, transliterated, thus ensuring a large degree of flexibility in addition to robustness. one of the (many) orthographic issues of Urdu is introduced. Section 3 contains a short review of earlier approaches. Section 4 gives a brief introduction 1 Introduction into the transducer and the set of restrictions used to cut down on overgeneration. Following this is an This paper introduces a way of ransliterating rdu t U account of the architectural design of the translit- and oman via a non-probabilistic inite state trans- R f eration process (section 5). The last two sections ducer ( ), thus allowing for easier machine TURF provide a first evaluation of the TURF system and a processing.1 The TURF transliterator was originally final conclusion. designed for a grammar of Hindi/Urdu (Bogel¨ et al., 2009), based on the grammar development platform 2 Urdu script issues XLE (Crouch et al., 2011). This grammar is written in Roman script to serve as a bridge/pivot lan- Urdu is an Indo-Aryan language spoken mainly in guage between the different scripts used by Urdu Pakistan and India. It is written in a version of the and Hindi. It is in principle able to parse input from Persian alphabet and includes a substantial amount both Hindi and Urdu and can generate output for of Persian and Arabic vocabulary. The direction of both of these language varieties. In order to achieve the script is from right to left and the shapes of most this goal, transliterators converting the scripts of characters are context sensitive; i.e., depending on Urdu and Hindi, respectively, into the common Ro- the position within the word, a character assumes a man representation are of great importance. certain form.

1 Urdu has a set of diacritical marks which ap- I would like to thank Tafseer Ahmed and Miriam Butt pear above or below a character deﬁning a partic- for their help with the content of this paper. This research was part of the Urdu ParGram project funded by the Deutsche ular vowel, its absence or compound forms. In total, Forschungsgemeinschaft. there are 15 of these diacritics (Malik, 2006, 13);

25 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 25–29, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics the four most frequent ones are shown in Table 1 in the issues of the scripts of Urdu and Hindi. combination with the letter H ‘b’. All of these approaches are highly dependent on . word lists due to the properties of the Urdu script and H. + diacritic Name Roman transliteration the problems arising with the use of diacritics. Most Zabar ba systems dealing with undiacriticized input are faced H. with low accuracy rates: The original system of Ma- Zer bi H . lik (2006), e.g., drops from approximately 80% to H . Pesh bu 50% accuracy (cf. Malik et al. (2009, 178)) – others H Tashdid bb have higher accuracy rates at the cost of being uni- . directional. Table 1: The four most frequently used diacritics While Malik et al. (2009) have claimed that the non-probabilistic finite state model is not able to When transliterating from the Urdu script to another handle the orthographic issues of Urdu in a satisfy- script, these diacritics present a huge problem being way, this paper shows that there are possibilities cause in standard Urdu texts, the diacritics are rarely for allowing a high accuracy of interpretation, even used. Thus, for example, we generally are only con- if the input text does not include diacritics. fronted with the letter H. ‘b’ and have to guess at the pronunciation that was intended. Take, e.g., the 4 The TURF Transliterator following example, where the word AJ» kuttA ‘dog’ is to be transliterated. Without diacritics, the word The TURF transliterator has been implemented as consists of three letters: k, t and A. If in the case of a non-probabilistic finite state transducer compiled transliteration, the system takes a guess at possible with the lexc language (Lexicon Compiler), which is short vowels and geminated consonants, the output explicitly designed to build finite state networks and contains multiple possibilities ((1)). analyzers (Beesley and Karttunen, 2003, 203). The resulting network is completely compatible with one (1) that was written with, e.g., regular expressions, but has the advantage in that it is easily readable. The transliteration scheme used here was developed by Malik et al. (2010), following Glassman (1986). As has been shown in section 1, Urdu transliteration with simple character-to-character mapping is In addition to the correct transliteration kuttA, the not sufficient. A default integration of short vowels transliterator proposes five other possibilities for the and geminated consonants will, on the other hand, missing diacritics. These examples show that this cause significant overgeneration. In order to reduce property of the Urdu script makes it extremely dif- this overgeneration and to keep the transliterator as ficult for any transliterator to correctly transliterate efficient as possible, the current approach integrates undiacriticized input without the help of word lists. several layers of restrictions.

3 Earlier approaches 4.1 The word list When dealing with Urdu transliteration it is not pos- Earlier approaches to Urdu transliteration almost sible to not work with a word list in order to ex- always have been concerned with the process of clude a large proportion of the overgenerated out- transliterating Urdu to Hindi or Hindi to Urdu (see, put. In contrast to other approaches, which depend e.g., Lehal and Saini (2010) (Hindi Urdu), Ma- → on Hindi or Urdu wordlists, TURF works with a Ro- lik et al. (2009) (Urdu Hindi), Malik et al. → man wordlist. This wordlist is derived from an XFST (2010) (Urdu Roman) or Ahmed (2009) (Roman → ﬁnite state morphology (Bogel¨ et al., 2007) indepen- Urdu). An exception is Malik (2006), who ex- → dently created as part of the Urdu ParGram devel- plored the general idea of using ﬁnite state transduc- opment effort for the Roman intermediate language ers and an intermediate/pivot language to deal with (Bogel¨ et al., 2009).

26 4.2 Regular expression filters simply matched against themselves (e.g. k:k, r:r). On top of this network, the regular expression filters The regular expression filters are based on knowl- provide further restrictions for the output form. edge about the phonotactics of the language and are a powerful tool for reducing the number of possibilities proposed by the transliterator. As a concrete example, consider the filter in (2). (2) [ [ y A [a i u] ]] ∼ | | In Urdu a combination of [ y A short vowel ] is not allowed ( ). A filter like in (2) can thus be used to ∼ disallow any generations that match this sequence.

4.3 Flag diacritics The XFST software also provides the user with a method to store ‘memory’ within a ﬁnite state network (cf. Beesley and Karttunen (2003, 339)). These so-called ﬂag diacritics enable the user to enforce desired constraints within a network, keeping the transducers relatively small and simple by removing illegal paths and thus reducing the number of possible analyses.

5 The overall TURF architecture However, the finite state transducer should also be able to deal with unknown items; thus, the constraints on transliteration should not be too restric- tive, but should allow for a default transliteration as well. Word lists in general have the drawback that a matching of a finite state transducer output against a word list will delete any entities not on the word list. Figure 1: Transliteration of and This means that a methodology needs to be found I» H. AJ» to deal with unknown but legitimate words with- The Urdu script default 1-1 mappings are marked out involving any further (non-finite state) software. with a special identification tag ([+Uscript]) for Figure 1 shows the general architecture to achieve later processing purposes. Thus, an Urdu script this goal. For illustrative purposes two words are word will not only be transliterated into the corre- transliterated: kitAb ‘book’ and , which H. AJ» I » sponding Roman script, but will also be ‘transliter- transliterates to an unknown word kt, potentially ated’ into itself plus an identificational tag. having the surface forms kut, kat or kit. The output of the basic transliterator shows part of the vast overgeneration caused by the underspec- 5.1 Step 1: Transliteration Part 1 ified nature of the script, even though the restricting The finite state transducer itself consists of a net- filters and flags are compiled into this component. work containing the Roman–Urdu character mapping with the possible paths regulated via flag dia- 5.2 Step 2: Word list matching and tag deletion critics. Apart from these regular mappings, the net- In step 2, the output is matched against a Roman work also contains a default Urdu and a default Ro- word list. In case there is a match, the respective man component where the respective characters are word is tagged [+match]. After this process, a

27 filter is applied, erasing all output forms that contain 6 Evaluation neither a [+match] nor a [Uscript+] tag. This A first evaluation of the TURF transliterator with way we are left with two choices for the word H. AJ» unseen texts resulted in an accuracy of 86%, if the – one transliterated ‘matched’ form and one default input was not diacriticized. The accuracy rate for Urdu form – while the word I » is left with only the undiacriticized text always depends on the size of default Urdu form. the word list. The word list used in this application is currently being extended from formerly 20.000 to 5.3 Step 3: Distinguishing unknown and 40.000 words; thus, a significant improvement of the overgenerated entities accuracy rate can be expected within the next few The Urdu word list applied in step 3 is a translitera- months. tion of the original Roman word list (4.1), which was If the optional inclusion of short vowels is re- transliterated via the TURF system. Thus, the Urdu moved from the network, the accuracy rate for di- word list is a mirror image of the Roman word list. acriticized input is close to 97%. During this step, the Urdu script words are matched When transliterating from Roman to Urdu, the ac- against the Urdu word list, this time deleting all the curacy rate is close to a 100%, iff the Roman script is words that find a match. As was to be expected from written according to the transliteration scheme pro- matching against a mirror word list of the original posed by Malik et al. (2010). Roman word list, all of the words that found a match Transliteration U R U R R U in the Roman word list will also find a match in the → → → Urdu word list, while all unknown entities fail to Input diacritics no diacritics match. As a result, any Urdu script version of an al- Diacritics opt. / compuls. optional Accuracy 86% / 97% 86% 100% ready correctly transliterated word is deleted, while ∼ the Urdu script unknown entity is kept for further Table 2: Accuracy rates of the TURF transliterator processing – the system has now effectively separated known from unknown entities. 7 Conclusion In a further step, the tags of the remaining entities This paper has introduced a finite state transducer are deleted, which leaves us with the correct translit- for Urdu Roman transliteration. Furthermore, eration of the known word kitAb and the unknown ↔ this paper has shown that it is possible for appli- Urdu script word . I » cations based only on non-probabilistic finite state 5.4 Step 4: Transliteration Part 2 technology to return output with a high state-of-the- art accuracy rate; as a consequence, the application The remaining words are once again sent into the profits from the inherently fast and small nature of finite state transducer of step 1. The Roman translit- finite state transducers. eration kitAb passes unhindered through the Default While the transliteration from Roman to Urdu is Roman part. The Urdu word on the other hand is basically a simple character to character mapping, transliterated to all possible forms (in this case three) the transliteration from Urdu to Roman causes a within the range of the restrictions applied by flags substantial amount of overgeneration due to the and filters. underspecified nature of the Urdu script. This was solved by applying different layers of restrictions. 5.5 Step 5: Final adjustments The specific architectural design enables TURF to Up to now, the transliterator is only applicable to distinguish between unknown-to-the-word-list and single words. With a simple (recursive) regular ex- overgenerated items; thus, when matched against pression it can be designed to apply to larger strings a word list, unknown items are not deleted along containing more than one word. with the overgenerated items, but are transliterated The ouput can then be easily composed with a along with the known items. As a consequence, standard tokenizer (e.g. Kaplan (2005)) to enable a transliteration is always given, resulting in an smooth machine processing. efficient, highly accurate and robust system.

28 References Tafseer Ahmed. 2009. Roman to Urdu transliteration using wordlist. In Proceedings of the Conference on Language and Technology 2009 (CLT09), CRULP, La- hore. Kenneth R. Beesley and Lauri Karttunen. 2003. Finite State Morphology. CSLI Publications, Stanford, CA. Tina Bogel,¨ Miriam Butt, Annette Hautli, and Sebastian Sulger. 2007. Developing a ﬁnite-state morphological analyzer for Urdu and Hindi. In T. Hanneforth und K. M. Wurzner,¨ editor, Proceedings of the Sixth International Workshop on Finite-State Methods and Natural Language Processing, pages 86–96, Potsdam. Potsdam University Press. Tina Bogel,¨ Miriam Butt, Annette Hautli, and Sebas- tian Sulger. 2009. Urdu and the modular architecture of ParGram. In Proceedings of the Conference on Language and Technology 2009 (CLT09), CRULP, Lahore. Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy King, John Maxwell, and Paula New- man. 2011. XLE Documentation. Palo Alto Research Center, Palo Alto, CA. URL: http://www2.parc.com/isl/groups/nltt/xle/doc/xle toc.html. Eugene H. Glassman. 1986. Spoken Urdu. Nirali Kitaben Publishing House, Lahore, 6 edition. Ronald M. Kaplan. 2005. A method for tokenizing text. In Festschrift in Honor of Kimmo Koskenniemi’s 60th anniversary. CSLI Publications, Stanford, CA. Gurpreet S. Lehal and Tejinder S. Saini. 2010. A Hindi to Urdu transliteration system. In Proceedings of ICON-2010: 8th International Conference on Nat- ural Language Processing, Kharagpur. Abbas Malik, Laurent Besacier, Christian Boitet, and Pushpak Bhattacharyya. 2009. A hybrid model for Urdu Hindi transliteration. In Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, pages 177– 185, Suntec, Singapore. Muhammad Kamran Malik, Tafseer Ahmed, Sebastian Sulger, Tina Bogel,¨ Atif Gulzar, Ghulam Raza, Sar- mad Hussain, and Miriam Butt. 2010. Transliter- ating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Associ- ation (ELRA). Abbas Malik. 2006. Hindi Urdu machine transliteration system. Master’s thesis, University of Paris.

29 Integrating Aspectually Relevant Properties of Verbs into a

Morphological Analyzer for English

Katina Bontcheva Heinrich-Heine-University Düsseldorf [email protected] duesseldorf.de

Abstract 2 A Prototypical Finite-State Morpho- logical Analyzer for English The integration of semantic properties into morphological analyzers can significantly enhance the English is an inflectionally-poor language which performance of any tool that uses their output as for this reason has been chosen to illustrate my input, e.g., for derivation or for syntactic parsing. approach to the integration of grammatically rele- In this paper will be presented my approach to the vant lexicalized meaning into morphological ana- integration of aspectually relevant properties of lyzers. It has a finite number of irregular (strong) verbs into a morphological analyzer for English. verbs. The rest of the verbs are regular and constitute a single inflectional class. 1 Introduction This prototypical morphological analyzer for English has parallel implementations in xfst (cf. Heid, Radtke and Klosa (2012) have recently sur- Beesley and Karttunen (2003)) and foma (cf. Hul- veyed morphological analyzers and interactive den (2009a) and (2009b)). It consists of a lexicon online dictionaries for German and French. They that describes the morphotactics of the language, have established that most of them do not utilize and of phonological and orthographical alterna- semantic properties. The integration of semantic tions and realizational rules that are handled by properties into morphological analyzers can sig- finite-state replace rules elsewhere. The bases of nificantly enhance the performance of any tool that the regular verbs are stored in a single text file. uses their output as input, e.g., for derivation or for Here is an excerpt from the lexc lexicon without syntactic parsing. In this paper will be presented semantic features: my approach to the integration of aspectually relevant properties of verbs into a morphological ana- LEXICON Root lyzer for English. Verb ; … In section 2 I will describe a prototypical finite- LEXICON Verb state morphological analyzer for English that ^VREG VerbReg ; doesn’t utilize semantic properties. Some classifi- … cations of English verbs with respect to the aspec- LEXICON VerbReg tually relevant properties that they lexicalize will +V:0 VerbRegFlex ; … be outlined in section 3. In section 4 will be pre- ! This lexicon contains the morpho- sented my approach to the integration the semantic tactic rules. classes in the lexicon. I will describe the modified morphological analyzer for English in section 5 LEXICON VerbRegFlex < ["+Pres"] ["+3P"] ["+Sg"] > # ; and point out in section 6 the challenges that < ["+Pres"] ["+Non3PSg"] > # ; inflectionally-rich languages present to the tech- < ["+Past"] > # ; niques outlined in section 4. < ["+PrPart"|"+PaPart"] > # ; Finally, in section 7 I will draw some conclu- < ["+Inf"] > # ; sions and outline future work on other languages.

30 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 30–34, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

3 Aspectually Relevant Properties of However, Vendlerian classes cannot be imple- Verbs mented in a lexc lexicon for the following reasons: Vendler does not classify verbs but VPs The information that is provided by the prototypi- Part of the features used to differentiate be- cal analyzer described above contains lemma, tween the classes are not lexicalized by the W(ord)-features (morphosyntactic features that verb but can be determined at the VP level exhibit different specifications in different cells of This classification allows multiple class the same inflectional paradigm) and L(exeme)- membership even for the same word sense. features that “specify a lexeme’s invariant mor- Thus run can be activity and accomplish- phosyntactic properties” (e.g., gender of nouns, cf. ment, cf. above running/running a mile. Stump (2001), p. 137, emphasis mine). L-features should not be confused with lexical- 3.2 Levin and Rappaport Hovav’s Approach ized meaning. I adopt the definition in Rappaport to English Verb classes Hovav and Levin (2010), p. 23: “In order to distin- Sets of semantically related verbs that share a guish lexicalized meaning from inferences derived range of linguistic properties form verb classes. from particular uses of verbs in sentences, we take There are different criteria for grouping and granu- lexicalized meaning to be those components of larity, e.g., Levin (1993) classifies the verbs in two meaning that are entailed in all uses of (a single ways: a) according to semantic content with 48 sense of) a verb, regardless of context” (emphasis broad classes and 192 smaller classes; b) according mine). Obviously, this definition is applicable not to their participation in argument alternations with only to verbs but to all word classes. 79 alternations. The account of Beth Levin and However, in this paper I will limit myself to the Malka Rappaport Hovav for verb classes devel- description of lexicalized aspectually relevant oped over the years in a steady and consistent way properties of verbs. that can be trailed in the following publications: 3.1 Vendler’s Classification (Levin 1993; Levin and Rappaport Hovav 1991, 1995, 2005; Rappaport Hovav 2008; Rappaport In his famous paper “Verbs and Times” Vendler Hovav and Levin 1998, 2001, 2005, 2010), among (1957) introduced his “time schemata presupposed others. by various verbs” (ibid.). He proposes four time Here I will just summarize the most important schemata: states, activities, accomplishments and ideas and implications for the non-stative verbs: achievements. Dynamic verbs either lexicalize scales (sca- It is important to point out from the beginning lar verbs) or do not (non-scalar verbs) that although he didn’t declare explicitly that he Non-scalar verbs lexicalize manner was classifying VPs, he did imply this: “Obviously Scalar verbs lexicalize result these differences cannot be explained in terms of Scalar verbs lexicalize two major types of time alone: other factors, like the presence or scales – multi-point scales and two-point absence of an object, conditions, intended state scales of affairs, also enter the picture.” (ibid., p. 143, The chosen aspectually relevant properties emphasis mine). are complementary The properties that are often used to define All lexical distinctions described here have Vendler’s classes are dynamicity, duration and grammatical consequences which are rele- telicity. States are non-dynamic, achievements are vant to aspectual composition. non-durative. States and activities are inherently unbounded (non-telic); accomplishments and This interpretation of non-stative verbs has some achievements are inherently bounded. Since three very attractive properties: features are needed to differentiate between only The verbs fall into disjunctive classes. four classes that cannot be represented as, e.g., a There is no multiple class membership (for right-branching tree one wonders if these are the the same word sense). right features to be used for the classification. The aspectual properties are lexicalized ex- Vendler’s classification was widely accepted clusively by the verb and are not computed and is used in most current studies on aspect. at the VP level.

31 The lexicalized aspectual properties con- files containing the complementary subsets of strain the syntactical behavior of the verb. bases of verbs belonging to the different aspectual Manner verbs in English show a uniform classes. argument-realization pattern: they can ap- New continuation lexicons introducing each as- pear with unspecified and non- pectual class must be added immediately after subcategorized objects. LEXICON Verb. Since the union of the sets of Result verbs are more constrained and less aspectual-class bases of regular verbs is identical uniform in their argument realization pat- with the set of the bases of the regular verbs, all terns. Transitivity (in contrast to the manner aspectual-class lexicons have the same continua- verbs) is an issue. tion lexicon: LEXICON VerbRegFlex. Irregular verbs get the semantic tags added to the lexical 4 Intersection of Semantic Classes and entry and suppletive verbs get them in the master Inflectional Classes lexicon.

The main difficulties here arise from the fact that Multichar_Symbols the set of bases that belong to one inflectional class +V +VIrrTT % of verbs usually is not identical with the set of … bases that lexicalize a particular aspectually rele- LEXICON Root vant property. As a rule, it has intersections with Verb ; more than one semantic class. The situation is rela- VerbSuppl ; tively manageable in inflectionally-poor languages … like English but becomes very complicated in LEXICON VerbSuppl inflectionally-rich languages. go%+V+Inf:go # ; The distribution of verbs in inflectional classes go%+V+Pres+3P+Sg:goes # ; go%+V+Pres+Non3PSg:go # ; is in general complementary. There are some ex- go%+V+Past:went # ; ceptions that will not be discussed here. go%+V+PaPart:gone # ; Vendler’s approach to the verb classification go%+V+PrPart:going # ; … described in 3.1 has the undesirable property that most of the verbs have multiple class membership, LEXICON Verb while the approach of Levin and Rappaport Hovav ^VREGM VerbRegManner ; described in 3.2 has advantages which make the … LEXICON VerbRegManner task easier. +V%:0 VerbRegFlex ; Thus, for English we have the set of bases of regular verbs that is monolithic, and the same set LEXICON VerbRegFlex of bases but this time divided into complementary … subsets of aspectual semantic classes in the sense of Levin and Rappaport Hovav. The cross product Below is an excerpt from the file holding the of the number of subsets in the first set and the bases of irregular verbs that build identical past- number of subsets in the second set equals the tense and perfect-participle forms by adding ‘-t’: … number of aspectual semantic classes since there is {creep}:{creep} | only one inflectional class of regular verbs. {feel} | {keep} | 5 The modified Prototypical Lexicon for {sleep} | {sweep}:{sweep} | English …

The following modifications need to be introduced In order to be able to rewrite the semantic-class to the lexicon in order to incorporate the aspectual tags, which appear only on the lexical (upper) side properties of English verbs. of the transducer containing the lexicon, I invert The single placeholder pointing to the single the network, apply the semantic-tag rewriting rules file containing the bases of regular verbs must be and invert the resulting net again. The network is replaced with several placeholders that point to the then composed with the realization rules and the

32 phonological and orthographical alternations that Things become much more challenging if we operate on the surface (lower) side of the trans- want to model inflectionally-rich languages such as ducer: Bulgarian, Russian or Finnish. Bulgarian verbs, for example, can be divided (depending on the model- ! Semantic-features tag-rewriting ing) into some 15 complementary inflectional define LEX2 [LEX1.i] ; define LEX2 [LEX1.i] ; classes. This number multiplied by 4 Levin- define Mnr [ %< m a n n e r %> -> Rappaport-Hovav classes would result in some 60 %<%+SV%>%<%+SVO%>%<%+SVOOC%> ] ; sets of verb bases that share the same inflectional class and Levin-Rappaport-Hovav class. If a finer- ! alternative RRG tags !define Mnr [%< m a n n e r %> -> grained semantic classification is adopted, the !%] ; this will lead to a lexicon that exclusively requires manual lexicographical work. define LEX3 [LEX2 .o. Mnr] ; define LEX [LEX3.i] ; 7 Conclusion ! Inflectional morphology: realization … This paper illustrates the integration of aspectually relevant properties of verbs into a morphological Here is the output of the analysis of ‘swept’ with analyzer for English. I showed that these features dependency-grammar valency-pattern tags can be integrated while the computational effi- (S=subject, V=verb, O=object, OC=object com- ciency of the analyzer can still be maintained if the plement): linguistic modelling is adequate. However, this only scratches the surface of the challenge of inte- swept grating semantic features into morphological ana- sweep<+SV><+SVO><+SVOOC>+V+Past lyzers. In the future, it is planned (together with sweep<+SV><+SVO><+SVOOC>+V+PaPart other researchers) to extend the integration of se- and the alternative output with Role and Reference mantic features to nouns, adjectives and adverbs. Grammar logical structures: We also plan to model and implement morphological analyzers for other languages such as German, swept Russian, Polish and Bulgarian. sweep+V+Past sweep+V+PaPart References Valency information is necessary for syntactic Kenneth R. Beesley and Lauri Karttunen. 2003. Finite parsing and has been used in Constraint Grammar State Morphology. Palo Alto, CA: CSLI Publications shallow parsers and in dependency parsers. The David Dowty. 1979. Word Meaning and Montague Grammar. Dordrecht: Reidel. advantage of this approach to already existing mor- William Foley and Robert Van Valin, Jr. 1984. Func- phological analyzers for English is that the tional Syntax and Universal Grammar. Cambridge: valency-pattern tags are added to classes of verbs Cambridge University Press. rather than to individual lexical entries. The ability Ulrich Heid, Janina Radtke and Anette Klosa. 2012. to provide alternative outputs for the integrated Morphology and Semantics: A Survey of Morpho- aspectually relevant semantic information is a nov- logical Analyzers and Interactive Online Dictionaries elty of this morphological analyzer. – and Proposals for their Improvement. 15th Interna- tional Morphology Meeting. Morphology and Mean- 6 Beyond English: the Challenges of In- ing. Vienna, February 9-12, 2012 Mans Hulden. 2009a. Finite-State Machine Construc- flectionally-Rich Languages tion Methods and Algorithms for Phonology and We have seen a simplified example that shows the Morphology. PhD Thesis, University of Arizona. Mans Hulden. 2009b. Foma: a Finite-State Compiler modeling and the implementation of a morphologi- and Library. In: Proceedings of the EACL 2009 cal analyzer that utilizes semantic-class tags for Demonstrations Session, pp. 29-32. aspectually relevant lexical properties of English verbs.

33 Beth Levin. 1993. English Verb Classes and Alterna- tions: A Preliminary Investigation. Chicago, IL: Uni- versity of Chicago Press.. Beth Levin and Malka Rappaport Hovav. 1991. Wiping the Slate Clean: A Lexical Semantic Exploration. In: B. Levin and S. Pinker, (eds.). Special Issue on Lexi- cal and Conceptual Semantics. Cognition 41, pp. 123-151. Beth Levin and Malka Rappaport Hovav. 1995. Unac- cusativity: At the Syntax-Lexical Semantics Inter- face. Cambridge, MA : MIT Press. Beth Levin and Malka Rappaport Hovav. 2005. Argu- ment Realization. Cambridge, UK: Cambridge Uni- versity Press. Malka Rappaport Hovav. 2008. Lexicalized Meaning and the Internal Temporal Structure of Events. In: S. Rothstein, (ed.), Theoretical and Crosslinguistic Ap- proaches to the Semantics of Aspect. Amsterdam: John Benjamins, pp. 13-42. Malka Rappaport Hovav and Beth Levin. 1998. Build- ing Verb Meanings. In: M. Butt and W. Geuder, (eds.). The Projection of Arguments: Lexical and Compositional Factors. Stanford, CA: CSLI Publica- tions, pp. 97-134. Malka Rappaport Hovav and Beth Levin. 2001. An Event Structure Account of English Resultatives. Language 77, pp. 766-797. Malka Rappaport Hovav and Beth Levin. 2005. Change of State Verbs: Implications for Theories of Argu- ment Projection. In: N. Erteschik-Shir and T. Rapoport (eds.) The Syntax of Aspect. Oxford: Ox- ford University Press, pp. 274-286. Malka Rappaport Hovav and Beth Levin. 2010. Reflec- tions on Manner/Result Complementarity. In: M. Rappaport Hovav, E. Doron, and I. Sichel (eds.). Syntax, Lexical Semantics, and Event Structure. Ox- ford: Oxford University Press, pp. 21–38. Gregory Stump. 2001. Inflectional Morphology: A The- ory of Paradigm Structure. Cambridge: Cambridge University Press. Zeno Vendler. 1957. Verbs and Times. The Philosophi- cal Review, Vol. 66, No. 2., pp. 143-60

34 Finite-state technology in a verse-making tool Manex Agirrezabal, Inaki˜ Alegria, Bertol Arrieta University of the Basque Country (UPV/EHU) [email protected], [email protected], [email protected] Mans Hulden Ikerbasque (Basque Science Foundation) [email protected]

Abstract technical correctness of the poem. To perform this task, several finite state transducer-based modules, This paper presents a set of tools designed to are used, some of them involving the metrics (syl- assist traditional Basque verse writers during lable counter) of the verse, and others the rhyme the composition process. In this article we (rhyme searcher and checker). The tool has support are going to focus on the parts that have been for 150 well known verse meters. created using finite-state technology: this includes tools such as syllable counters, rhyme In the following sections, we will outline the tech- checkers and a rhyme search utility. nology used in each of the parts in the system. 2 Related work 1 The BAD tool and the Basque singing Much of the existing technology for Basque mor- tradition phology and phonology uses finite-state technology, The BAD tool is an assistant tool for verse-makers including earlier work on rhyme patterns (Arrieta in the Basque bertsolari tradition. This is a form et al., 2001). In our work, we have used the Basque of improvised verse composition and singing where morphological description (Alegria et al., 1996) in participants are asked to produce impromptu com- the rhyme search module. Arrieta et al. (2001) de- positions around themes which are given to them velop a system where, among other things, users can following one of many alternative verse formats. search for words that rhyme with an introduced pat- The variety of verse schemata that exist all impose tern. It is implemented in the formalism of two-level fairly strict structural requirements on the composer. morphology (Koskenniemi, 1983) and compiled into Verses in the bertsolari tradition must consist of a finite-state transducers. specified number of lines, each with a fixed num- We have used the open-source foma finite-state ber of syllables. Also, strict rhyme patterns must compiler to develop all the finite-state based parts 1 be followed. The structural requirements are con- of our tool. . After compiling the transducers, we sidered the most difficult element in the bertsolar- use them in our own application through the C/C++ itza—however, well-trained bertsolaris can usually API provided with foma. produce verses that fulfill the structural prerequisites 3 Syllable counter in a very limited time. The BAD tool presented here is mainly di- As mentioned, each line in a verse must contain a rected at those with less experience in the tradi- specified number of syllables. The syllable counter tion such as students. One particular target group module that checks whether this is the case consists are the bertso-eskola-s (verse-making schools) that of a submodule that performs the syllabification it- have been growing in popularity—these are schools self as well as a module that yields variants produced found throughout the Basque Country that train by optional apocope and syncope effects. For the young people in the art of bertsolaritza. syllabification itself, we use the approach described The primary functionality of the tool is illustrated in Hulden (2006), with some modifications to cap- in figure 1 which shows the main view of the util- ture Basque phonology. ity. The user is offered a form in which a verse 1In our examples, FST expressions are written using foma can be written, after which the system checks the syntax. For details, visit http://foma.googlecode.com

35 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 35–39, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

Figure 1: A verse written in the BAD web application.

define Obs [f|h|j|k|p|s|t|t s|t z|t x|x| 3.1 Syllabification z|b|d|g|v|d d|t t]; define LiqNasGli [l|r|r r|y|n|m]; Basque syllables can be modeled by assuming a define LowV [a|e|o]; maximum onset principle together with a sonority define HighV [i|u]; define V LowV | HighV; hierarchy where obstruents are the least sonorous el- define Nucleus [V | LowV HighV | ement, followed in sonority by the liquids, the nasals [HighV HighV - [i i] - [u u]]]; and the glides. The syllable nuclei are always a sin- define Onset (Obs) (LiqNasGli); define Coda Cˆ<4; gle vowel (a,e,i,o,u) or a combination of a low vowel (a,e) and a high vowel (i,o,u) or a high vowel and an- define MarkNuclei Nucleus @-> %{ ... %}; define Syll Onset %{ Nucleus %} Coda; other high vowel. define markSyll Syll @> ... "." || _ Syll ; The syllabifier relies on a chain of composed re- define cleanUp %{|%} -> 0; placement rules (Beesley and Karttunen, 2003) com- regex MarkNuclei .o. markSyll .o. cleanUp; piled into finite-state transducers. These definitions are shown in figure 2. The overall strategy Figure 2: Syllable definition is to first mark off the nuclei in a word by the rule MarkNuclei which takes advantage of a left-to- right longest replacement rule. This is to ensure that the second syllable. The final syllabification, af- diphthongs do not get split into separate syllables ter markup removal by the Cleanup rule, is then by the subsequent syllabification process. Follow- in.tran.si.ti.bo.a. This process is illustrated in fig- ing this, syllables are marked off by the markSyll- ure 3 rule, which inserts periods after legitimate syllables. In bertsolaritza, Basque verse-makers follow this This rule takes advantage of the shortest-leftmost re- type of syllable counting in the majority if cases; placement strategy—in effect minimizing the coda however, there is some flexibility as regards the syl- and maximizing the size of the onset of a syllable to labification process. For example, suppose that the the extent permitted by the allowed onsets and co- phrase ta lehenengo urtian needs to fit a line which das, defined in Onset and Coda, respectively. must contain six syllables. If we count the sylla- To illustrate this process, supposing that we bles using the algorithm shown above, we receive a are syllabifying the Basque word intransitiboa. count of eight (ta le.hen.en.go ur.ti.an). However, The first step in the syllabification process is in the word lehenengo we can identify the syncope to mark the nuclei in the word, resulting in pattern vowel-h-vowel, with the two vowels being i ntr a ns i t i b o a . In the more com- identical. In such cases, we may simply replace { } { } { } { } { }{ } plex syllabification step, the markSyll rule as- the entire sequence by a single vowel (ehe e). → sures that the juncture ntr gets divided as n.tr be- This is phonetically equivalent to shortening the ehe- cause nt.r would produce a non-maximal onset, sequence (for those dialects where the orthographi- and i.ntr would in turn produce an illegal onset in cal h is silent). With this modification, we can fit

36 define plosvl [p | t | k]; the line in a 7 syllable structure. We can, however, define rplosv [b | d | g | r]; further reduce the line to 6 syllables by a second define sib [s | z | x]; type of process that merges the last syllable of one define nas [n | m]; word with the first of the next one and then resyl- define plosvlconv ptk -> PTK; labifying. Hence, ta lehenengo urtian, using the define rplosvconv bdgr -> BDGR; define sibconv sib -> SZX; modifications explained above, could be reduced to define nasconv nas -> NM; ta.le.nen.gour.ti.an, which would fit the 6 syllable define phoRules plosvlconv .o. rplosvconv .o. structure. This production of syllabification variants sibconv .o. nasconv ; is shown in figure 4.

transformazioei Figure 5: Conﬂation of consonant groups before rhyme markNuclei checking.

tr{a}nsf{o}rm{a}z{i}{o}{ei} syllabify a purely finite-state system. In this section we will tr{a}ns.f{o}r.m{a}.z{i}.{o}.{ei} focus on the finite-state based one. cleanUp As the php version takes advantage of syllabifica- trans.for.ma.zi.o.ei tion, the one developed with transducers does not. Figure 3: Normal syllabification. Instead, it relies on a series of replacement rules and the special eq() operator available in foma. An implementation of this is given in figure 6. As input lehentasun etxera etorri to the system, the two words to be checked are as- syllabification syllabification sumed to be provided one after the other, joined by le.hen.ta.sun e.txe.ra e.to.rri alternates alternates a hyphen. Then, the system (by rule rhympat1) le.hen.ta.sun len.ta.sun e.txe.ra e.to.rri e.txe.rae.to.rri identifies the segments that do not participate in the rhyme and marks them off with “ ” and “ ” symbols { } Figure 4: Flexible syllabification. (e.g. landa-ganga < l anda>-< g anga>). → { } { } The third rule (rhympat3) removes everything that is between “ ” and “ ”, leaving us only with { } 4 Finite-state technology for rhymes the segments relevant for the rhyming pattern (e.g. 4.1 Basque rhyme patterns and rules -). Subsequent to this rule, we Similar to the flexibility in syllabification, Basque apply the phonological grouping reductions men- rhyme schemes also allows for a certain amount tioned above in section 4.1, producing, for example of leeway that bertsolaris can take advantage of. (-). The widely consulted rhyming dictionary Hiztegi After this reduction, we use the eq(X,L,R)- Errimatua (Amuriza, 1981) contains documented a operator in foma, which from a transducer X, filters number of phonological alternations that are accept- out those words in the output where material be- able as off-rhymes: for example the stops p, t, and k tween the specified delimiter symbols L and R are are often interchangeable, as are some other phono- unequal. In our case, we use the < and > symbols logical groups. Figure 5 illustrates the definitions as delimiters, yielding a final transducer that does for interchangeable phonemes when rhyming. The not accept non-rhyming words. interchangeability is done as a prelude to rhyme 4.3 Rhyme search checking, whereby phonemes in certain groups, such as p, are replaced by an abstract symbol de- The BAD tool also includes a component for search- noting the group (e.g. PTK). ing words that rhyme with a given word. It is developed in php and uses a finite-state component like- 4.2 Rhyme checker wise developed with foma. The rhyme checker itself in BAD was originally de- Similarly to the techniques previously described, veloped as a php-script, and then reimplemented as it relies on extracting the segments relevant to the

37 define rhympat1 [0:"{" ?* 0:"}" regex phoRules .o. phoRules.i .o. [[[V+ C+] (V) V] | [(C) V V]] C* ]; 0:?* ?* .o. dictionary ; # constraining V V C pattern define rhympat2 ˜[?* V "}" V C]; # cleaning non-rhyme part Figure 7: Rhyme search using foma define rhympat3 "{" ?* "}" -> 0; define rhympat rhympat1 .o. rhympat2 .o. rhympat3; it allowed for a comparison of the application speed # rhyming pattern on each word of each. We ran an experiment introducing 250,000 # and phonological changes define MarkPattern rhympat .o. pairs of words to the two rhyme checkers and mea- phoRules .o. patroiak; sured the time each system needed to reply. The # verifying if elements between < and > # are equal FST-based checker was roughly 25 times faster than define MarkTwoPatterns the one developed in php. 0:%< MarkPattern 0:%> %- 0:%< MarkPattern 0:%> ; It is also important to mention that these tools define Verify _eq(MarkTwoPatterns, %<, %>) are going to be evaluated in an academic environ- regex Verify .o. Clean; ment. As that evaluation has not been done yet, we made another evaluation in our NLP group in or- Figure 6: Rhyme checking using foma. der to detect errors in terms of syllabification and rhyme quality. The general feeling of the experiment rhyme, after which phonological rules are applied was that the BAD tool works well, but we had some (as in 4.1) to yield phonetically related forms. For efficiency problems when many people worked to- example, introducing the pattern era, the system re- gether. To face this problem some tools are being turns four phonetically similar forms era, eda, ega, implemented as a server. and eba. Then, these responses are fed to a transducer that returns a list of words with the same end- 6 Discussion & Future work ings. To this end, we take advantage of a finite-state Once the main tools of the BAD have been devel- morphological description of Basque (Alegria et al., oped, we intend to focus on two different lines of 1996). development. The first one is to extend to flexibil- As this transducer returns a set of words which ity of rhyme checking. There are as of yet patterns may be very comprehensive—including words not which are acceptable as rhymes to bertsolaris that commonly used, or very long compounds—we then the system does not yet recognize. For example, apply a frequency-based filter to reduce the set of the words and will not be accepted by possible rhymes. To construct the filter, we used filma errima the current system, as the two rhymes and a newspaper corpus, (Egunkaria2) and extracted the ilma ima are deemed to be incompatible. In reality, these two frequencies of each word form. Using the frequency words are acceptable as rhymes by bertsolaris, as counts, we defined a transducer that returns a word’s the is not very phonetically prominent. However, frequency, using which we can extract only the n- l adding flexibility also involves controlling for over- most frequent candidates for rhymes. The system generation in rhymes. Other reduction patterns not also offers the possibility to limit the number of syl- currently covered by the system include phenomena lables that desired rhyming words may contain. The such as synaloepha—omission of vowels at word syllable filtering system and the frequency limiting boundaries when one word ends and the next one parts have been developed in php. Figure 7 shows begins with a vowel. the principle of the rhyme search’s finite-state component. Also, we intend to include a catalogue of melodies in the system. These are traditional melodies that 5 Evaluation usually go along with a specific meter. Some 3,000 melodies are catalogued (Dorronsoro, 1995). We are As we had available to us a rhyme checker written also using the components described in this article in php in before implementing the finite-state version, another project whose aim is to construct a robot ca- 2http://berria.info pable to find, generate and sing verses automatically.

38 Acknowledgments This research has been partially funded by the Span- ish Ministry of Education and Science (OpenMT- 2, TIN2009-14675-C03) and partially funded by the Basque Government (Research Groups, IT344-10). We would like to acknowledge Aitzol Astigarraga for his help in the development of this project. He has been instrumental in our work, and we intend to continue working with him. Also we must mention the Association of Friends of Bertsolaritza, whose verse corpora has been used to test and develop these tools and to develop new ones.

References Alegria, I., Artola, X., Sarasola, K., and Urkia, M. (1996). Automatic morphological analysis of Basque. Literary and Linguistic Computing, 11(4):193–203. Amuriza, X. (1981). Hiztegi errimatua [Rhyme Dic- tionary]. Alfabetatze Euskalduntze Koordinakun- dea. Arrieta, B., Alegria, I., and Arregi, X. (2001). An assistant tool for verse-making in Basque based on two-level morphology. Literary and linguistic computing, 16(1):29–43. Beesley, K. R. and Karttunen, L. (2003). Finite state morphology. CSLI. Dorronsoro, J. (1995). Bertso doinutegia [Verse melodies repository]. Euskal Herriko Bertsolari Elkartea. Hulden, M. (2006). Finite-state syllabiﬁcation. Finite-State Methods and Natural Language Pro- cessing, pages 86–96. Koskenniemi, K. (1983). Two-level morphology: A general computational model for word-form production and generation. Publications of the Department of General Linguistics, University of Helsinki. Helsinki: University of Helsinki.

39 DAGGER: A Toolkit for Automata on Directed Acyclic Graphs

Daniel Quernheim Kevin Knight Institute for Natural Language Processing University of Southern California Universitat¨ Stuttgart, Germany Information Sciences Institute Pfaffenwaldring 5b, 70569 Stuttgart Marina del Rey, California 90292 [email protected] [email protected]

Abstract WANT WANT

This paper presents DAGGER, a toolkit for BELIEVE BELIEVE finite-state automata that operate on directed acyclic graphs (dags). The work is based on a (a) BOY GIRL (b) BOY GIRL model introduced by (Kamimura and Slutzki, 1981; Kamimura and Slutzki, 1982), with a Figure 1: (a) “The boy wants to believe the girl.” and few changes to make the automata more ap- (b) “The boy wants the girl to believe him.” First edge plicable to natural language processing. Avail- represents :agent role, second edge represents :patient able algorithms include membership checking role. in bottom-up dag acceptors, transduction of dags to trees (bottom-up dag-to-tree transducers), k-best generation and basic operations and dag-to-tree transducers similar to their model. such as union and intersection. Compared to those devices, in order to use them for actual NLP tasks, our machines differ in certain aspects: 1 Introduction We do not require our dags to be planar, and we • Finite string automata and finite tree automata have do not only consider derivation dags. proved to be useful tools in various areas of natural We add weights from any commutative semir- • language processing (Knight and May, 2009). How- ing, e.g. real numbers. ever, some applications, especially in semantics, re- The toolkit is available under an open source li- quire graph structures, in particular directed acyclic cence.1 graphs (dags), to model reentrancies. For instance, the dags in Fig. 1 represents the semantics of the sen- 2 Dags and dag acceptors tences “The boy wants to believe the girl” and “The boy wants the girl to believe him.” The double role DAGGER comes with a variety of example dags and of “the boy” is made clear by the two parent edges of automata. Let us briefly illustrate some of them. The the BOY node, making this structure non-tree-like. dag of Fig. 1(a) can be defined in a human-readable Powerful graph rewriting systems have been used format called PENMAN (Bateman, 1990): for NLP (Bohnet and Wanner, 2010), yet we consider a rather simple model: finite dag automata that (1 / WANT :agent (2 / BOY) have been introduced by (Kamimura and Slutzki, :patient (3 / BELIEVE 1981; Kamimura and Slutzki, 1982) as a straight- :agent 2 :patient (4 / GIRL))) forward extension of tree automata. We present the toolkit DAGGER (written in PYTHON) that can be 1http://www.ims.uni-stuttgart.de/ used to visualize dags and to build dag acceptors ˜daniel/dagger/

40 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 40–44, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics s s s -> (WANT :agent i :patient s) s -> (WANT :agent i :patient s) @ 0.6 s -> (BELIEVE :agent i :patient s) s -> (BELIEVE :agent i :patient s) @ 0.4 i -> (0) i -> (0) @ 0.2 s -> (0) s -> (0) @ 0.4 i -> (GIRL) i -> (GIRL) @ 0.3 i -> (BOY) s -> (GIRL) @ 0.3 s -> (GIRL) i -> (BOY) @ 0.2 s -> (BOY) s -> (BOY) @ 0.2 i i -> (GIRL) i i -> (2=1 :arg e) @ 0.3 i i -> (BOY) i s -> (2=1 :arg e) @ 0.3 i s -> (GIRL) s s -> (2=1 :arg e) @ 0.3 i s -> (BOY) e -> (GIRL) @ 0.4 s s -> (GIRL) e -> (BOY) @ 0.6 s s -> (BOY)

Figure 3: Simpliﬁed dag acceptor simple.bda. Figure 2: Example dag acceptor example.bda.

In this format, every node has a unique identifier, 2.1 Membership checking and derivation and edge labels start with a colon. The tail node of forests an edge is specified as a whole subdag, or, in the DAGGER is able to perform various operations on case of a reentrancy, is referred to with its identifier. dags. The instructions can be given in a simple ex- Fig. 2 shows a dag acceptor. The first line con- pression language. The general format of an expres- tains the final state, and the remaining lines contain sion is: rules. Mind that the rules are written in a top-down fashion, but are evaluated bottom-up for now. Let us (command f1 .. fm p1 .. pn) consider a single rule: Every command has a number of (optional) features

s -> (WANT :agent i :patient s) fi and a fixed number of arguments pi. Most commands have a short and a long name; we will use the The right-hand side is a symbol (WANT :agent short names here to save space. In order to evaluate :patient) whose tail edges are labeled with states (i a expression, you can either and s), and after applying the rule, its head edges are supply it on the command-line: labeled with new states (s). All rules are height one, • but in the future we will allow for larger subgraphs. ./dagger.py -e EXPRESSION In order to deal with symbols of arbitrary head or read from a file: • rank (i.e. symbols that can play multiple roles), we ./dagger.py -f FILE can use rules using special symbols such as 2=1 and We will now show a couple of example expres- 3=1 that split one edge into more than one: sions that are composed of smaller expressions. i s -> (2=1 :arg e) Assume that the dag acceptor of Fig. 2 is saved in the file example.bda, and the file boywants.dag Using these state-changing rules, the ruleset can contains the example dag in PENMAN format. be simplified (see Fig. 3), however the dags look a We can load the dag with the expression (g (f bit different now: boywants.dag)), and the acceptor with the expres- (1 / WANT :agent (2 / 2=1 sion (a w (f example.bda)) where w means that the :arg (3 / BOY)) acceptor is weighted. We could also specify the dag :patient (4 / BELIEVE ENMAN p f :agent 2 directly in P format using instead of . We :patient (5 / GIRL))) can use the command r:

Note that we also added weights to the ruleset now. (r (a w (f example.bda)) (g (f Weights are separated from the rest of a rule by the @ boywants.dag))) sign. The weight semantics is the usual one, where weights are multiplied along derivation steps, while to check whether example.bda recognizes the weights of alternative derivations are added. boywants.dag. This will output one list item

41 q S S S

WANT qnomb wants qinfb qnomb wants INF INF = = = BELIEVE ⇒ BELIEVE ⇒ qaccg to believe qaccb ⇒ NP NP NP

the boy wants the girl to believe him BOY GIRL BOY GIRL BOY GIRL

Figure 4: Derivation from graph to tree “the boy wants the girl to believe him”. q q.S(x1 wants x2)) -> (WANT :agent nomb.x1 :patient inf.x2) inf.INF(x1 to believe x2) -> (BELIEVE :agent accg.x1 :patient accb.x2) accg.NP(the girl) -> (GIRL) nomb.NP(the boy) accb.(him) -> (BOY)

Figure 5: Example dag-to-tree-transducer example.bdt. for each successful derivation (and, if the acceptor (1 / 0) (1 / BOY) is weighted, their weights), in this case: (’s’, (1 / GIRL) ’0.1’, 0, ’0’), which means that the acceptor can (1 / BELIEVE :agent (2 / GIRL) :patient 2) reach state s with a derivation weighted 0.1. The (1 / BELIEVE :agent (2 / BOY) :patient 2) (1 / BELIEVE :agent (2 / GIRL) :patient rest of the output concerns dag-to-tree transducers (3 / 0)) and will be explained later. (1 / BELIEVE :agent (2 / GIRL) :patient Note that in general, there might be multiple (3 / GIRL)) derivations due to ambiguity (non-determinism). 2.3 Visualization of dags Fortunately, the whole set of derivations can be efﬁ- Both dags and dag acceptors can be visualized using ciently represented as another dag acceptor with the GRAPHVIZ2. For this purpose, we use the q (query) d command. This derivation forest acceptor has the command and the v feature: set of rules as its symbol and the set of conﬁgura- tions (state-labelings of the input dag) as its state set. (v (g (f boywants.dag)) boywants.pdf) (v (a (f example.bda)) example.pdf) (d (a w (f example.bda)) (g f boywants.dag))) Dag acceptors are represented as hypergraphs, where the nodes are the states and each hyperedge will write the derivation forest acceptor to the stan- represents a rule labeled with a symbol. dard output. 2.4 Union and intersection 2.2 k-best generation In order to construct complex acceptors from sim- To obtain the highest-weighted 7 dags generated by pler building blocks, it is helpful to make use of the example dag acceptor, run: union (u) and intersection (i). The following code (k 7 (a w (f example.bda))) will intersect two acceptors and return the 5 best

(1 / BOY) dags of the intersection acceptor. (1 / GIRL) (1 / BELIEVE :agent (2 / GIRL) :patient 2) (k 5 (i (a (f example.bda)) (a (f (1 / WANT :agent (2 / GIRL) :patient 2) someother.bda)))) (1 / 0) (1 / BELIEVE :agent (2 / BOY) :patient 2) Weighted union, as usual, corresponds to sum, (1 / WANT :agent (2 / BOY) :patient 2) weighted intersection to product. If the acceptor is unweighted, the smallest dags 2available under the Eclipse Public Licence from http:// (in terms of derivation steps) are returned. www.graphviz.org/

42 string automata tree automata dag automata compute . . . strings (sentences) . . . (syntax) trees . . . semantic representations k-best . . . paths through a WFSA (Viterbi, . . . derivations in a weighted forest 1967; Eppstein, 1998) (Jimenez´ and Marzal, 2000; Huang and Chiang, 2005) EM training Forward-backward EM (Baum et al., Tree transducer EM training (Graehl et ? 1970; Eisner, 2003) al., 2008) Determinization . . . of weighted string acceptors (Mohri, . . . of weighted tree acceptors (Bor- ? 1997) chardt and Vogler, 2003; May and Knight, 2006a) Transducer composi- WFST composition (Pereira and Riley, Many transducers not closed under com- ? tion 1997) position (Maletti et al., 2009) General tools AT&T FSM (Mohri et al., 2000), Tiburon (May and Knight, 2006b), DAGGER Carmel (Graehl, 1997), OpenFST (Riley ForestFIRE (Cleophas, 2008; Strolen- et al., 2009) berg, 2007)

Table 1: General-purpose algorithms for strings, trees and feature structures.

3 Dag-to-tree transducers 4 Future work Dag-to-tree transducers are dag acceptors with tree This work describes the basics of a dag automata output. In every rule, the states on the right-hand toolkit. To the authors’ knowledge, no such im- sides have tree variables attached that are used to plementation already exists. Of course, many algo- build one tree for each state on the left-hand side. A rithms are missing, and there is a lot of room for im- fragment of an example dag-to-tree transducer can provement, both from the theoretical and the practi- be seen in Fig. 5. cal viewpoint. This is a brief list of items for future Let us see what happens if we apply this trans- research (Quernheim and Knight, 2012): ducer to our example dag: Complexity analysis of the algorithms. • Closure properties of dag acceptors and dag- (r (a t (f example.bdt)) (g (f • boywants.dag))) to-tree transducers as well as composition with tree transducers. All derivations including output trees will be listed: Extended left-hand sides to condition on a • (’q’, ’1.0’, larger semantic context, just like extended top- S(NP(the boy) wants INF(NP(the girl) down tree transducers (Maletti et al., 2009). Handling flat, unordered, sparse sets of rela- to believe NP(him))), • ’the boy wants the girl to believe tions that are typical of feature structures. Cur- him’) rently, rules are specific to the rank of the nodes. A first step in this direction could be A graphical representation of this derivation (top- gone by getting rid of the explicit n=m symbols. down instead of bottom-up for illustrative purposes) Hand-annotated resources such as (dag, tree) • can be seen in Fig. 4. pairs, similar to treebanks for syntactic repre- 3.1 Backward application and force decoding sentations as well as a reasonable probabilistic model and training procedures. Sometimes, we might want to see which dags map Useful algorithms for NLP applications that to a certain input tree in a dag-to-tree transducer. • exist for string and tree automata (cf. Ta- This is called backward application since we use the ble 1). The long-term goal could be to build a transducer in the reverse direction: We are currently semantics-based machine translation pipeline. implementing this by “generation and checking”, i.e. a process that generates dags and trees at the same Acknowledgements time. Whenever a partial tree does not match the input tree, it is discarded, until we find a derivation This research was supported in part by ARO grant W911NF-10- and a dag for the input tree. If we also restrict the 1-0533. The first author was supported by the German Research dag part, we have force decoding. Foundation (DFG) grant MA 4959/1–1.

43 References Jonathan May and Kevin Knight. 2006b. Tiburon: A weighted tree automata toolkit. In Oscar H. Ibarra and John A. Bateman. 1990. Upper modeling: organizing Hsu-Chun Yen, editors, Proc. CIAA, volume 4094 of knowledge for natural language processing. In Proc. LNCS, pages 102–113. Springer. Natural Language Generation Workshop, pages 54– Mehryar Mohri, Fernando C. N. Pereira, and Michael 60. Riley. 2000. The design principles of a weighted L. E. Baum, T. Petrie, G. Soules, and N. Weiss. 1970. finite-state transducer library. Theor. Comput. Sci., A maximization technique occurring in the statistical 231(1):17–32. analysis of probabilistic functions of Markov chains. Mehryar Mohri. 1997. Finite-state transducers in lan- Ann. Math. Statist., 41(1):164171. guage and speech processing. Computational Linguis- Bernd Bohnet and Leo Wanner. 2010. Open source tics, 23(2):269–311. graph transducer interpreter and grammar develop- Fernando Pereira and Michael Riley. 1997. Speech ment environment. In Proc. LREC. recognition by composition of weighted finite au- Bjorn¨ Borchardt and Heiko Vogler. 2003. Determiniza- tomata. In Finite-State Language Processing, pages tion of finite state weighted tree automata. J. Autom. 431–453. MIT Press. Lang. Comb., 8(3):417–463. Daniel Quernheim and Kevin Knight. 2012. To- Loek G. W. A. Cleophas. 2008. Tree Algorithms: Two wards probabilistic acceptors and transducers for fea- Taxonomies and a Toolkit. Ph.D. thesis, Department of ture structures. In Proc. SSST. (to appear). Mathematics and Computer Science, Eindhoven Uni- Michael Riley, Cyril Allauzen, and Martin Jansche. versity of Technology. 2009. OpenFST: An open-source, weighted finite- Jason Eisner. 2003. Learning non-isomorphic tree map- state transducer library and its applications to speech pings for machine translation. In Proc. ACL, pages and language. In Proc. HLT-NAACL (Tutorial Ab- 205–208. stracts), pages 9–10. David Eppstein. 1998. Finding the k shortest paths. Roger Strolenberg. 2007. ForestFIRE and FIREWood. SIAM J. Comput., 28(2):652–673. a toolkit & GUI for tree algorithms. Master’s thesis, Jonathan Graehl, Kevin Knight, and Jonathan May. Department of Mathematics and Computer Science, 2008. Training tree transducers. Comput. Linguist., Eindhoven University of Technology. 34(3):391–427. Andrew Viterbi. 1967. Error bounds for convolutional Jonathan Graehl. 1997. Carmel finite-state toolkit. codes and an asymptotically optimum decoding al- http://www.isi.edu/licensed-sw/ gorithm. IEEE Transactions on Information Theory, carmel. 13(2):260–269. Liang Huang and David Chiang. 2005. Better k-best parsing. In Proc. IWPT. V´ıctor M. Jimenez´ and Andres´ Marzal. 2000. Computa- tion of the n best parse trees for weighted and stochastic context-free grammars. In Proc. SSPR/SPR, pages 183–192. Tsutomu Kamimura and Giora Slutzki. 1981. Paral- lel and two-way automata on directed ordered acyclic graphs. Inf. Control, 49(1):10–51. Tsutomu Kamimura and Giora Slutzki. 1982. Transduc- tions of dags and trees. Math. Syst. Theory, 15(3):225– 249. Kevin Knight and Jonathan May. 2009. Applications of weighted automata in natural language processing. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata. Springer. Andreas Maletti, Jonathan Graehl, Mark Hopkins, and Kevin Knight. 2009. The power of extended top-down tree transducers. SIAM J. Comput., 39(2):410–430. Jonathan May and Kevin Knight. 2006a. A better n-best list: Practical determinization of weighted finite tree automata. In Proc. HLT-NAACL.

44 WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment, Model-Building and Decoding

Josef R. Novak, Nobuaki Minematsu, Keikichi Hirose Graduate School of Information Science and Technology The University of Tokyo, Japan {novakj,mine,hirose}@gavo.t.u-tokyo.ac.jp

Abstract mum Bayes-Risk (LMBR) decoding [2; 3] and N- best rescoring with a Recurrent Neural Network This paper introduces a new open source, Language Model (RNNLM) [4; 5] to G2P con- WFST-based toolkit for Grapheme-to- version. The Weighted Finite-State Transducer Phoneme conversion. The toolkit is efﬁcient, (WFST) framework is used throughout, and the open accurate and currently supports a range of features including EM sequence alignment source implementation relies on OpenFst [6]. Ex- and several decoding techniques novel in perimental results are provided illustrating the speed the context of G2P. Experimental results and accuracy of the proposed system. show that a combination RNNLM system The remainder of the paper is structured as fol- outperforms all previous reported results on lows. Section 2 provides background, Section 3 out- several standard G2P test sets. Preliminary lines the alignment approach, Section 4 describes experiments applying Lattice Minimum the joint-sequence LM. Section 5 describes decod- Bayes-Risk decoding to G2P conversion are also provided. The toolkit is implemented ing approaches. Section 6 discusses preliminary ex- using OpenFst. periments, Section 7 provides simple usage commands and Section 8 concludes the paper. 1 Introduction Grapheme-to-Phoneme (G2P) conversion is an im- 2 G2P problem outline portant problem related to Natural Language Pro- Grapheme-to-Phoneme conversion has been a pop- cessing, Speech Recognition and Spoken Dialog ular research topic for many years. Many differ- Systems development. The primary goal of G2P ent approaches have been proposed, but perhaps the conversion is to accurately predict the pronunciation most popular is the joint-sequence model [6]. Most of a novel input word given only the spelling. For joint-sequence modeling techniques focus on pro- example, we would like to be able to predict, ducing an initial alignment between corresponding PHOENIX /f i n I k s/ → grapheme and phoneme sequences, and then mod- given only the input spelling and a G2P model or set eling the aligned dictionary as a series of joint to- of rules. This problem is straightforward for some kens. The gold standard in this area is the EM- languages like Spanish or Italian, where pronuncia- driven joint-sequence modeling approach described tion rules are consistent. For languages like English in [6] that simultaneously infers both alignments and and French however, inconsistent conventions make subsequence chunks. Due to space constraints the the problem much more challenging. reader is referred to [6] for a detailed background of In this paper we present a fully data-driven, previous research. state-of-the-art, open-source toolkit for G2P conver- The G2P conversion problem is typically bro- sion, Phonetisaurus [1]. It includes a novel mod- ken down into several sub-problems: (1) Sequence iﬁed Expectation-Maximization (EM)-driven G2P alignment, (2) Model training and, (3) Decoding. sequence alignment algorithm, support for joint- The goal of (1) is to align the grapheme and sequence language models, and several decoding so- phoneme sequence pairs in a training dictionary. lutions. The paper also provides preliminary in- The goal of (2) is to produce a model able to gen- vestigations of the applicability of Lattice Mini- erate new pronunciations for novel words, and the

45 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45–49, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics goal of (3) is to find the most likely pronunciation Algorithm 2: Expectation step given the model. Input: AlignedLattices Output: γ, total 3 Alignment 1 foreach FSA alignment lattice F do 2 α ShortestDistance(F ) ← R The proposed toolkit implements a modified WFST- 3 β ShortestDistance(F ) ← based version of the EM-driven multiple-to-multiple 4 foreach state q Q[F ] do ∈ alignment algorithm proposed in [7] and elaborated 5 foreach arc e E[q] do ∈ in [8]. This algorithm is capable of learning natural 6 v ((α[q] w[e]) β[n[e]]) β[0]; ← ⊗ ⊗ G-P relationships like igh /AY/ which were not 7 γ[i[e]] γ[i[e]] v; → ← ⊕ possible with previous 1-to-1 algorithms like [9]. 8 total total v; ← ⊕ The proposed alignment algorithm includes three modifications to [7]: (1) A constraint is imposed Algorithm 3: Maximization step such that only m-to-one and one-to-m arcs are Input: γ, total considered during training. (2) During initialization Output: AlignedLattices a joint alignment lattice is constructed for each in- 1 foreach arc e in E[γ] do put entry, and any unconnected arcs are deleted. (3) 2 γnew[i[e]] w[e]/total; γ[i[e]] 0; All arcs, including deletions and insertions are ini- ← ← 3 foreach FSA alignment lattice F do tialized to and constrained to maintain a non-zero 4 foreach state q Q[F ] do weight. ∈ 5 foreach arc e E[q] do ∈ These minor modifications appear to result in a 6 w[e] γnew[i[e]]; small but consistent improvement in terms of Word ← Accuracy (WA) on G2P tasks. The Expectation and Maximization steps for the EM training procedure 4 Joint Sequence N-gram model are outlined in Algorithms 2, 3. The EM algorithm The pronunciation model implemented by the toolkit is a straightforward joint N-gram model. The Algorithm 1: EM-driven M2One/One2M training corpus is constructed by extracting the best T V Input: x , y , mX, mY , dX, dY alignment for each entry, e.g.: Output: γ, AlignedLattices T V a}x b}b a}@ c|k}k 1 foreach sequence pair (x , y ) do T V a}x b}b a}@ f}f t}t 2 InitFSA(x , y , mX, mY , dX, dY ) T V 3 foreach sequence pair (x , y ) do The training procedure is then, (1) Convert aligned T V 4 Expectation(x , y , mX, mY , γ) sequence pairs to sequences of aligned joint label 5 Maximization(γ) pairs, (g1:p1, g2:p2, ..., gn:pn); (2) Train an N-gram model from (1); (3) Convert the N-gram model to a WFST. Step (3) may be performed with any lan- is initialized by generating an alignment FSA for guage modeling toolkit. In this paper mitlm [11] is each dictionary entry, which encodes all valid G-P utilized. alignments, given max subsequence parameters supplied by the user. Any unconnected arcs are deleted 5 Decoding and all remaining arcs are initialized with a non-zero The proposed toolkit provides varying support for weight. In Algorithm 2 lines 2-3 compute the for- three different decoding schemes. The default de- ward and backward probabilities. Lines 4-8 com- coder provided by the distribution simply extracts pute the arc posteriors and update the current model. the shortest path through the phoneme lattice created In Algorithm 3 lines 1-2 normalize the probability via composition with the input word, distribution. Lines 3-6 update the alignment lattice arc weights with the new model. H = ShortestP ath(P roject (w M)) (1) best o ◦

46 where Hbest refers to the lowest cost path, P rojecto Algorithm 4: G2P Lattice MBR-Decode refers to projecting the output labels, w refers to the Input: P rojecto(w M), α, θ0 n M E ← ◦ − input word, refers to the G2P model, and indi- 1 ScaleLattice(α ) ◦ E ← × E cates composition. 2 ExtractN-grams( ) NN ← E 3 for n 1 to N do 5.1 RNNLM N-best rescoring ← 4 Φn MakeMapper( n) Recurrent Neural Network Language Models have R ← N 5 Ψn MakePathCounter( n) recently enjoyed a resurgence in popularity in the ← RN 6 n Opt(( Φn) Ψn ) context of ASR applications [4]. In another re- U ← E ◦ ◦ 7 Ωn = Φn cent publication we investigated the applicability 8 for state q Q[Ωn] do of this approach to G2P conversion with joint se- ∈ 9 for arc e E[q] do quence models by providing support for the rnnlm ∈ 10 w[e] θn (o[e]) toolkit [5]. The training corpus for the G2P LM ← × U 11 P rojectinput( Ω1) is a corpus of joint sequences, thus it can be used P ← Eθ0 ◦ 12 for n 2 to N do without modiﬁcation to train a parallel RNNLM. N- ← 13 P rojectinput( Ωn) best reranking is then accomplished with the pro- P ← P ◦ 14 = ShortestP ath( ) posed toolkit by causing the decoder to output the Hbest P N-best joint G-P sequences, and employing rnnlm to rerank the the N-best joint sequences, which does not affect the MBR decision [2]. Line

HNbest =NShortestP aths(w M) 1 applies α to the raw lattice. In effect this controls ◦ (2) how much we trust the raw lattice weights. After Hbest =P rojecto(Rescorernn(HNbest)). applying α, is normalized by pushing weights to E In practice the rnnlm models require considerable the final state and removing any final weights. In tuning, and somewhat more time to train, but pro- line 2 all unique N-grams up to order N are ex- vide a consistent WA boost. For further details on tracted from the lattice. Lines 4-10 create, for each algorithm as well as tuning for G2P see [4; 10]. order, a context-dependency FST (Φn) and a spe- R cial path-posterior counting WFST (Ψn ), which are 5.2 Lattice Minimum Bayes-Risk decoding for then used to compute N-gram posteriors ( n), and G2P U finally to create a decoder WFST (Ωn). The full In [2] the authors note that the aim of MBR decod- MBR decoder is then computed by first making an ing is to find the hypothesis that has the “least ex- unweighted copy of , applying θ uniformly to all E 0 pected loss under the model”. MBR decoding was arcs, and iteratively composing and input-projecting successfully applied to Statistical Machine Trans- with each Ωn. The MBR hypothesis is then the best lation (SMT) lattices in [2], and significantly im- path through the result . See [2; 3] for further P proved in [3]. Noting the similarities between G2P details. conversion and SMT, we have begun work implementing an integrated LMBR decoder for the pro- 6 Experimental results posed toolkit. Our approach closely follows that described Experimental evaluations were conducted utilizing in [3], and the algorithm implementation is sum- three standard G2P test sets. These included repli- marized in Algorithm 4. The inputs are the full cations of the NetTalk, CMUdict, and OALD En- phoneme lattice that results from composing the in- glish language dictionary evaluations described in put word with the G2P model and projecting output detail in [6]. Results comparing various configu- labels, an exponential scale factor α, and N-gram ration of the proposed toolkit to the joint sequence precision factors θ0 N . The θn are computed us- model Sequitur [6] and an alternative discriminative − ing a linear corpus BLEU [2] N-gram precision p, training toolkit direcTL+ [8] are described in Ta- and a match ratio r using the following equations, ble 1. Here m2m-P indicates the proposed toolkit θ = 1/T ; θ = 1/(NT prn 1). T is a constant using the alignment algorithm from [7], m2m-fst-P 0 − n −

47 System NT15k CMUdict OALD 678$0'9%:;<=::;<%% '"# Sequitur [6] 66.20 75.47 82.51 !"#$%&''(#)'*%>5?%@04/%A"#%BC;D:;% direcTL+ [8] 75.52 83.32 &$# ∼ m2m-P 66.39 75.08 81.20 &"# m2m-fst-P 66.41 75.25 81.86 *+#,-#./01# rnnlm-P 67.77 75.56 83.52 %$# Table 1: Comparison of G2P WA(%) for previous sys- %"# tems and variations of the proposed toolkit. $$# !"#$%&''(#)'*%+,-% indicates the alternative FST-based alignment algo- $"# rithm, and rnnlm-P indicates the use of RNNLM N- !$# best reranking. !"# The results show that the improved alignment al- ()# (!# (%# ('# )"# ))# )!# )%# )'# gorithm contributes a small but consistent improve- ./'"$012%34/%+5/'-% ment to WA, while RNNLM reranking contributes a Figure 1: Decoding speed vs. WA plot for various N- further small but signiﬁcant boost to WA which program orders for the CMUdict 12k/112k test/train set. ctime duces state-of-the-art results on all three test sets. Times averaged over 5 run using . The WA gains are interesting, however a major NT15k N=1 N=2 N=3 N=4 N=5 N=6 plus point for the toolkit is speed. Table 2 compares WA 28.88 65.48 66.03 66.41 66.37 66.50 training times for the proposed toolkit with previ- PA 83.17 91.74 91.79 91.87 91.82 91.82 ously reported results. The m2m-fst-P for system for Table 3: LMBR decoding Word Accuracy (WA) and Phoneme Accuracy (PA) for order N=1-6. System NETtalk-15k CMUdict Sequitur [6] Hours Days 7 Toolkit distribution and usage direcTL+ [8] Hours Days m2m-P 2m56s 21m58s The preceding sections introduced various theoreti- m2m-fst-P 1m43s 13m06s cal aspects of the toolkit as well as preliminary ex- rnnlm-P 20m 2h perimental results. The current section provides sev- Table 2: Training times for the smallest (15k entries) and eral introductory usage commands. largest (112k entries) training sets. The toolkit is open source and released under the liberal BSD license. It is available for download from [1], which also includes detailed com- CMUdict performs %0.27 worse than the state-of- pilation instructions, tutorial information and addi- the-art, but requires just a tiny fraction of the train- tional examples. The examples that follow utilize ing time. This turn-around time may be very impor- the NETTalk dictionary. tant for rapid system development. Finally, Figure. 1 Align a dictionary: plots WA versus decoding time for m2m-fst-P on the $ phonetisaurus-align --input=test.dic \ largest test set, further illustrating the speed of the --ofile=test.corpus decoder, and the impact of using larger models. Train a 7-gram model with mitlm: Preliminary experiments with the LMBR decoder $ estimate-ngram -o 7 -t test.corpus \ were also carried out using the smaller NT15k -wl test.arpa dataset. The θn values were computed using p, r, Convert the model to a WFSA and T from [2] while α was tuned to 0.6. Re- $ phonetisaurus-arpa2fst --input=test.arpa \ sults are described in Table 3. The system matched --prefix=test the basic WA for N=6, and achieved a small im- Apply the default decoder provement in PA over m2m-fst-P (%91.80 versus $ phonetisaurus-g2p --model=test.fst \ %91.82). Tuning the loss function for the G2P task --input=abbreviate --nbest=3 --words should improve performance. abbreviate 25.66 @ b r i v i e t

48 abbreviate 28.20 @ b i v i e t T. Mikolov and S. Kombrink and D. Anoop and L. Burget abbreviate 29.03 x b b r i v i e t and J. Cernocký.ˇ [5]. RNNLM - Recurrent Neural Net- work Language Modeling Toolkit, ASRU 2011, demo Apply the LMBR decoder session. $ phonetisaurus-g2p --model=test.fst \ C. Allauzen and M. Riley and J. Schalkwyk and W. Skut --input=abbreviate --nbest=3 --words \ and M. Mohri. [6]. OpenFST: A General and Effi- --mbr --order=7 cient Weighted Finite-State Transducer Library, Proc. abbreviate 1.50 @ b r i v i e t CIAA 2007, pp. 11-23. abbreviate 2.62 x b r i v i e t M. Bisani and H. Ney. [6]. Joint-sequence models for abbreviate 2.81 a b r i v i e t grapheme-to-phoneme conversion, Speech Communi- cation 50, 2008, pp. 434-451. 8 Conclusion and Future work S. Jiampojamarn and G. Kondrak and T. Sherif. [7]. Ap- plying Many-to-Many Alignments and Hidden Markov This work introduced a new Open Source WFST- Models to Letter-to-Phoneme Conversion, NAACL driven G2P conversion toolkit which is both highly HLT 2007, pp. 372-379. accurate as well as efficient to train and test. It incor- S. Jiampojamarn and G. Kondrak. [8]. Letter-to- porates a novel modified alignment algorithm. To Phoneme Alignment: an Exploration, Proc. ACL our knowledge the RNNLM N-best reranking and 2010, pp. 780-788. LMBR decoding are also novel applications in the E. Ristad and P. Yianilos. [9]. Learning String Edit Dis- tance, IEEE Trans. PRMI 1998, pp. 522-532. context of G2P. J. Novak and P. Dixon and N. Minematsu and K. Hirose Both the RNNLM N-best reranking and LMBR and C. Hori and H. Kashioka. [10]. Improving WFST- decoding are promising but further work is required based G2P Conversion with Alignment Constraints to improve usability and performance. In particular and RNNLM N-best Rescoring, Interspeech 2012 (Ac- RNNLM training requires considerable tuning, and cepted). we would like to automate this process. The pro- B. Hsu and J. Glass. [11]. Iterative Language Model Es- visional LMBR decoder achieved a small improve- timation: Efficient Data Structure & Algorithms, Proc. ment but further work will be needed to tune the Interspeech 2008. loss function. Several known optimizations are also planned to speed up the LMBR decoder. Nevertheless the current release of the toolkit provides several novel G2P solutions, achieves state-of- the-art WA on several test sets and is efficient for both training and decoding.

References J. Novak, et al. [1]. http://code.google.com/p/ phonetisaurus R. Tromble and S. Kumar and F. Och and W. Macherey. [2]. Lattice Minimum Bayes-Risk Decoding for Sta- tistical Machine Translation, Proc. EMNLP 2007, pp. 620-629. G. Blackwood and A. Gispert and W. Byrne. [3]. Efﬁ- cient path counting transducers for minimum bayes- risk decoding of statistical machine translation lattices, Proc. ACL 2010, pp. 27-32. T. Mikolov and M. Karaﬁat and L. Burget and J. Cer-ˇ nocký and S. Khundanpur. [4]. Recurrent Neural Net- work based Language Model, Proc. InterSpeech, 2010.

49 Kleene, a Free and Open-Source Language for Finite-State Programming

Kenneth R. Beesley SAP Labs, LLC P.O. Box 540475 North Salt Lake, UT 84054 USA [email protected]

Abstract (Allauzen et al., 2007),6 developed by Google Labs and NYU’s Courant Institute. Kleene is a high-level programming language, The design and implementation of the lan- based on the OpenFst library, for constructing and manipulating finite-state acceptors and guage were motivated by three main principles, transducers. Users can program using reg- summarized as Syntax Matters, Licensing Matters ular expressions, alternation-rule syntax and and Open Source Matters. As for the syntax, right-linear phrase-structure grammars; and Kleene allows programmers to specify weighted Kleene provides variables, lists, functions and or unweighted finite-state machines (FSMs)— familiar program-control syntax. Kleene has including acceptors that encode regular languages been approved by SAPAG for release as free, and two-projection transducers that encode regu- open-source code under the Apache License, lar relations—using regular expressions, alternation- Version 2.0, and will be available by Au- gust 2012 for downloading from http:// rule syntax and right-linear phrase-structure gram- www.kleene-lang.org. The design, im- mars. The regular-expression operators are bor- plementation, development status and future rowed, as far as possible, from familiar Perl-like plans for the language are discussed. and academic regular expressions, and the alternation rules are based on the “rewrite rules” made pop- 1 Introduction ular by Chomsky and Halle (Chomsky and Halle, 1968). Borrowing from general-purpose program- 1 Kleene is a finite-state programming language in ming languages, Kleene also provides variables, lists the tradition of the AT&T Lextools (Roark and and functions, plus nested code blocks and familiar 2 Sproat, 2007), the SFST-PL language (Schmid, control structures such as if-else statements and 3 2005), the Xerox/PARC finite-state toolkit (Beesley while loops. 4 5 and Karttunen, 2003) and FOMA (Hulden,´ 2009b), As for the licensing, Kleene, like the OpenFst li- all of which provide higher-level programming for- brary, is released under the Apache License, Version malisms built on top of low-level finite-state li- 2.0, and its other dependencies are also released un- braries. Kleene itself is built on the OpenFst library der this and similar permissive licenses that allow 1Kleene is named after American mathematician Stephen commercial usage. In contrast, many notable finite- Cole Kleene (1909–1994), who investigated the properties of state implementations, released under the GPL and regular sets and invented the metalanguage of regular expres- similar licenses, are restricted to academic and other sions. 2 non-commercial use. The Kleene code is also open- http://www.research.att.com/ãlb/ lextools/ source, allowing users to examine, correct, augment 3http://www.ims.uni-stuttgart.de/ and even adopt the code if the project should ever be projekte/gramotron/SOFTWARE/SFST.html abandoned by its original maintainer(s). 4http://www.fsmbook.com 5http://code.google.com/p/foma/ 6http://www.openfst.org

50 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 50–54, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

It is hoped that Kleene will provide an attractive plus. To improve the readability of expressions, development environment for experts and students. spaces are not significant, unless they appear inside Pre-edited Kleene scripts can be run from the com- square brackets or are explicitly literalized inside mand line, but a graphical user interface is also pro- double quotes or with a preceding backslash. vided for interactive learning, programming, testing In a language like Kleene where alphabetic sym- and drawing of FSMs. bols are literal, and the expression dog denotes three Like comparable implementations of finite-state literal symbols, d, o and g, concatenated together, machines, Kleene can be used to implement a vari- there must be a way to distinguish variable names ety of useful applications, including spell-checking from simple concatenations. The Kleene solution is and -correction, phonetic modeling, morphological to prefix variable names that are bound to FSM val- analysis and generation, and various kinds of pat- ues with a dollar-sign sigil, e.g. $myvar. Once tern matching. The paper continues with a brief de- defined, a variable name can be used inside subse- scription the Kleene language, the current state of quent regular expressions, as in the following ex- development, and plans for the future. ample, which models a fragment of Esperanto verb morphology. 2 Implementation $vroot = don | dir | pens | ir ; The Java-language Kleene parser, implemented with // "give", "say", "think", "go" JavaCC and JJTree (Copeland, 2007),7 is Unicode- $aspect = ad ; capable and portable. Successfully parsed state- // optional repeated aspect ments are reduced to abstract syntax trees (ASTs), $vend = as | is | os | us | u | i ; which are interpreted by calling C++ functions in the // pres, past, fut, cond, subj, inf OpenFst library via the Java Native Interface (JNI). $verbs = $vroot $aspect? $vend ; // use of pre-defined variables

3 Kleene Syntax Similarly, names of functions that return FSMs are 3.1 Regular Expressions distinguished with the $ˆ sigil. To denote less common operations, rather than inventing and prolifer- Basic assignment statements have a regular expres- ating new and arguably cryptic regular-expression sion on the right-hand side, as shown in Table 1. As operators, Kleene provides a set of predefined func- in Perl regular expressions, simple alphabetic char- tions including acters are literal, and concatenation is indicated by juxtaposition, with no overt operator. Parentheses $ˆreverse(regexp) can be used to group expressions. The postfixed * $învert(regexp) + ? (the “Kleene star”), (the “Kleene plus”), and de- $înputProj(regexp) note zero-or-more, one-or-more, and optionality, re- $ôutputProj(regexp) spectively. Square-bracketed expressions have their $ˆcontains(regexp) own internal syntax to denote character sets, includ- $îgnore(regexp, regexp) [A-Z] ing character ranges such as . The union op- $ˆcopy(regexp) erator is |. Basic regular operations missing from Perl regular expressions include composition ( or ◦ Users can also define their own functions, and func- _o_ : & ), crossproduct ( ), language intersection ( ), tion calls are regular expressions that can appear as language negation ( ) and language subtraction (-). ∼ operands inside larger regular expressions. Weights are indicated inside angle brackets, e.g. <0.1>. 3.2 Alternation-Rule Syntax Special characters can be literalized with a pre- Kleene provides a variety of alternation-rule types, ceding backslash or inside double quotes, e.g. \ or * comparable to Xerox/PARC Replace Rules (Beesley " " denotes a literal asterisk rather than the Kleene * and Karttunen, 2003, pp. 130–82), but implemented 7https://javacc.dev.java.net using algorithms by Mans˚ Hulden´ (Hulden,´ 2009a).

51 $var = dog ; $var = d o g ; // equivalent to dog $var = ˜( a+ b* c? ) ; $var = \˜ \+ \* \? ; // literalized special characters $var = "˜+*?"; // literalized characters inside double quotes $var = "dog" ; // unnecessary literalization, equivalent to dog $myvar = (dog | cat | horse) s? ; $yourvar = [A-Za-z] [A-Za-z0-9]* ; $hisvar = ([A-Za-z]-[aeiouAEIOU])+ ; $hervar = (bird|cow|elephant|pig) & (pig|ant|bird) ; $ourvar = (dog):(chien) (chien):(Hund) ; ◦ $theirvar = [a-z]+ ( a <0.91629> | b <0.1> ) ; // weights in brackets

Table 1: Kleene Regular-Expression Assignment Examples.

input-expression -> output-expression / left-context _ right-context

Table 2: The Simplest Kleene Alternation-Rule Template.

The simplest rules have the template shown in Ta- photactics, including noun-root compounding. ble 2, and are interpreted into transducers that map $>Root = (kat | hund | elefant | dom) the input to the output in the speciﬁed context. Such ( $>Root | $>AugDim ) ; rules, which cannot be reviewed in detail here, are $>AugDim = ( eg | et )? $>Noun ; commonly used to model phonetic and orthographi- $>Noun = o $>Plur ; cal alternations. $>Plur = j? $>Case ; $>Case = n? ;

3.3 Right-Linear Phrase Structure Grammars $net = $ˆstart($>Root) ; While regular expressions are formally capable of The syntax on the right-hand side of productions is describing any regular language or regular relation, identical to the regular-expression syntax, but allow- some linguistic phenomena—especially productive ing right-linear references to productions of the form morphological compounding and derivation—can $>Name. be awkward to describe this way. Kleene therefore provides right-linear phrase-structure grammars that 4 Kleene FSMs are similar in semantics, if not in syntax, to the Xe- Each Kleene finite-state machine consists of a stan- rox/PARC lexc language (Beesley and Karttunen, dard OpenFst FSM, under the default Tropical 2003, pp. 203–78). Semiring, wrapped with a Java object8 that stores A Kleene phrase-structure grammar is defined as the private alphabet9 of each machine. a set of productions, each assigned to a variable with In Kleene, it is not necessary or possible to de- a $> sigil. Productions may include right-linear ref- clare the characters being used; characters appearing erences to themselves or to other productions, which in regular expressions, alternation rules and right- might not yet be defined. The productions are parsed linear phrase-structure grammars are stored auto- immediately but are not evaluated until the entire matically as FSM arc labels using their Unicode grammar is built into an FSM via a call to the built-in 8 function $ˆstart(), which takes one production Each Java object of the class Fst contains a long integer field that stores a pointer to the OpenFst machine, which actu- variable as its argument and treats it as the starting ally resides in OpenFst’s C++ memory space. production of the whole grammar. The following 9The alphabet, sometimes known as the sigma, contains just example models a fragment of Esperanto noun mor- the symbols that appear explicitly in the labels of the FSM.

52 code point value, and this includes Unicode sup- 5.2 Future Work plementary characters. Programmer-defined multi- The work remaining to be done includes: character symbols, represented in the syntax with surrounding single quotes, e.g. '+Noun' and • Completion of the implementation of '+Verb', or, using another common convention, alternation-rule variations. '[Noun]' and '[Verb]', also need no declara- • Writing of runtime code and APIs to apply tion and are automatically stored using code point FSMs to input and return output. values taken from a Unicode Private Use Area. The dot (.) denotes any character, and it translates • Conversion of FSMs into stand-alone exe- non-trivially into reserved arc labels that represent cutable code, initially in Java and C++. 10 OTHER (i.e. unknown) characters. • Expansion to handle semirings other than the default Tropical Semiring of OpenFst. 5 Status • Testing in non-trivial applications to determine 5.1 Currently Working memory usage and performance. As of the date of writing, Kleene is an advanced beta 6 History and Licensing project offering the following: Kleene was begun informally in late 2006, became part of a company project in 2008, and was under • Compilation of regular expressions, right- development until early 2011, when the project was linear phrase-structure grammars, and several canceled. On 4 May 2012, SAPAG released Kleene alternation-rule variations into FSMs. as free, open-source code under the Apache License, Version 2.0.11 • Robust handling of Unicode, including sup- The Kleene source code will be repackaged ac- plementary characters, plus support for user- cording to Apache standards and made available for defined multi-character symbols. download by August of 2012 at http://www. kleene-lang.org. A user manual, currently • Variables and maintenance of symbol tables in over 100 pages, and an engineering manual will also a frame-based environment. be released. Precompiled versions will be provided for Linux, OS X and, if possible, Windows. • Pre-defined and user-defined functions. Acknowledgments • Handling of lists of FSMs, iteration over lists, Sincere thanks are due to the OpenFst team and and functions that handle and return lists. all who made that library available. A special per- sonal thanks goes to Mans˚ Hulden,´ who graciously • A graphical user interface, including tools to released his algorithms for interpreting alternation draw FSMs and test them manually. rules and language-restriction expressions, and who went to great lengths to help me understand and re- • File I/O of FSMs in an XML format. implement them. I also acknowledge my SAP Labs colleagues Paola Nieddu and Phil Sours, who con- • Interpretation of arithmetic expressions, tributed to the design and implementation of Kleene, arithmetic variables and functions, including and my supervisor Michael Wiesner, who supported boolean functions; and if-then statements the open-source release. Finally, I thank Lauri Kart- and while loops that use boolean operators tunen, who introduced me to finite-state linguistics and functions. and has always been a good friend and mentor.

10The treatment of FSM-speciﬁc alphabets and the handling of OTHER characters is modeled on the Xerox/PARC implemen- 11http://www.apache.org/licenses/ tation (Beesley and Karttunen, 2003, pp. 56–60). LICENSE-2.0.html

53 References Cyril Allauzen, Michael Riley, Johan Schalkwyk, Woj- ciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Ninth International Conference on Implementation and Application of Au- tomata (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages 11–23. Springer. Kenneth R. Beesley and Lauri Karttunen. 2003. Finite State Morphology. CSLI Publications, Palo Alto, CA. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper and Row, New York. Tom Copeland. 2007. Generating Parsers with JavaCC. Centennial Books, Alexandria, VA. Mans˚ Hulden.´ 2009a. Finite-State Machine Construc- tion Methods and Algorithms for Phonology and Mor- phology. Ph.D. thesis, The University of Arizona, Tucson, AZ. Mans˚ Hulden.´ 2009b. Foma: a finite-state compiler and library. In Proceedings of the EACL 2009 Demonstra- tions Session, pages 29–32, Athens, Greece. Brian Roark and Richard Sproat. 2007. Computa- tional Approaches to Morphology and Syntax. Oxford Surveys in Syntax & Morphology. Oxford University Press, Oxford. Helmut Schmid. 2005. A programming language for finite state transducers. In FSMNLP’05, Helskinki.

54 Implementation of replace rules using preference operator

Senka Drobac, Miikka Silfverberg, and Anssi Yli-Jyrä University of Helsinki Department of Modern Languages Unioninkatu 40 A FI-00014 Helsingin yliopisto, Finland {senka.drobac, miikka.silfverberg, anssi.yli- jyra}@helsinki.fi

The purpose of this paper is to explain a general Abstract method of compiling replace rules with .r-glc. operator and to show how preference constraints We explain the implementation of replace described in Yli-Jyrä (2008b) can be combined to rules with the .r-glc. operator and preference form different replace rules. relations. Our modular approach combines various preference constraints to form differ- 2 Notation ent replace rules. In addition to describing the method, we present illustrative examples. The notation used in this paper is the standard regular expression notation extended with replace rule 1 Introduction operators introduced and described in Beesley and Karttunen (2003). The idea of HFST - Helsinki Finite-State Technol- In a simple rule ogy (Lindén et al. 2009, 2011) is to provide open- source replicas of well-known tools for building � �� ! _ �!, … , �! _ �! op is a replace rule operator such as: morphologies, including XFST (Beesley and Kart- ∗ tunen 2003). HFST's lack of replace rules such as →, → , @→, @>, ←, (←), …; � ⊆ Σ is the set those supported by XFST, prompted us to imple- of patterns in the input text that are overwritten in the output text by the alternative patterns, which ment them using the present method, which repli- ∗ ∗ cates XFST's behavior (with minor differences are given as set � ⊆ Σ , where Σ is a universal which will be detailed in later work), but will also language and Σ set of alphabetical symbols; �! and allow easy expansion with new functionalities. �! are left and right contexts and dir is context The semantics of replacement rules mixes con- direction (||, //, \\ and \/). textual conditions with replacement strategies that Rules can also be parallel. Then they are divid- are specified by replace rule operators. This paper ed with double comma (,,), or alternately with sin- describes the implementation of replace rules using gle comma if context is not specified. a preference operator, .r-glc., that disambiguates Operation Name alternative replacement strategies according to a X Y The concatenation of Y after X preference relation. The use of preference relations X | Y The disjunction of X and Y The cross product of X and Y, (Yli-Jyrä 2008b) is similar to the worsener rela- X:Y tions used by Gerdemann (2009). The current ap- where X and Y denote languages The composition of X and Y, proach was first described in Yli-Jyrä (2008b), and X .o. Y is closely related to the matching-based finite-state where X and Y denote relations approaches to optimality in OT phonology (Noord X+ The Kleene plus and Gerdemann 1999; Eisner 2000). The prefer- X* The Kleene star ence operator, .r-glc., is the reversal of generalized The projection of the input lan- proj (X) lenient composition (glc), a preference operator 1 guage of the relation X construct proposed by Jäger (2001). The imple- The projection of the output lan- proj (X) mentation is developed using the HFST library, 2 guage of the relation X and is now a part of the same. Table 1 – List of operations

55 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 55–59, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

Operators used in the paper are listed in Table changed. Furthermore, modularity makes it possi- 1, where X and Y stand for regular expressions. ble to easily integrate new functionalities such as Additionally, parenthesis ( ) are used to mark weighted replace rules or two level contexts. optionality, squared brackets [ ] for precedence and question mark ? is used to denote set Σ in regular 3.1 Unconstrained Bracketed Transducer expressions. As mentioned earlier, it is first necessary to build the UBT. This step can be seen as a variant of Yli- 3 Method Jyrä and Koskenniemi's (2007) method for compiling contextually restricted changes in two-level The general idea for compiling replace rules with grammars. The main difference now is that the the .r-glc. operator and preference constraints is rule applications cannot overlap because they will shown in Figure 1. be marked with brackets.

Step 1: Bracketed center UBT The first step is to create a bracketed center, .r-glc. Remove ��! – the replace relation surrounded by Constraint 1 brackets brackets , . For optional replacement, it is nec- REPLACE essary that �� also contains the upper side of RULE ! .r-glc. the relation bracketed with another pair of brackets … �! = , . This is necessary for filtering out all .r-glc. the results without any brackets (see later filter � ) and getting non optional replacement. Constraint N !"#$! ��! = �! ∶ �! ∪ �! !!! In case of parallel replace rules, bracketed cen- Figure 1: General method of building a replace rule ter is the union of all individual bracketed centers. Like XFST, this implementation requires parallel The method consists of the following steps: replace rules to have the same replace operator 1. Building an Unconstrained Bracketed (and optionality) in all replacements. Transducer (UBT) – a transducer which applies or skips contextually valid re- Step 2: The change centers in free context placements freely in all possible portions of the inputs. Every application of the re- The second step is to expand bracketed center to be placement rule is marked with special valid in any context. If � = , , , , we can define: brackets. Similar replace rules that differ ∗ only with respect to their replacement � = Σ − � ∪ ��! strategies will use the same UBT. Thus, Then, center in free context is: the compilation of UBT is independent of ��!"## = � ⋄ ��! � the replacement strategy, which increases where ⋄ is diamond, which is used to align centers the modularity of the compilation algo- and contexts during compilation. rithm. Step 3: Expanded center in context 2. Implement the functionality of the replace rule operator by constraining the UBT with The next step is to compile contexts. The method the respective preference relation. used for constructing �� depends on 3. Remove brackets from the transducer. whether the context must match on the upper or the lower side. Since it is possible to have multiple The major advantage of this method is its mod- contexts, each replacement should be surrounded ularity. The algorithm is divided into small com- with all applicable contexts: ponents which are combined in the desired way. ��!"#$%$ = �!| �! | … | �! This approach allows every part of the algorithm to Center surrounded with one context is: be separately and clearly defined, tested and �! = � | �′ ⋄ ��! � | �´ ,

56 ! where � and � are left and right contexts from the �! = ? −� | 0: 0: ∶0 � ] ? −� ! ∗ ! replace rule, and � and �′ are expanded contexts, �!"!#$% = ? ? – � 0 : �! �� depending on which side the context matches. In Longest match right to left: ! the case when context must match on the upper �! = ? −� | ? −� 0: 0 ∶ ∶ 0 � ] ! ∗ side, �′ and �′ are: �!""#$% = �� ! 0 ∶ ? – � ? �′ = � � ≪ B . o. � Shortest match left to right: ! �′ = � � ≪ B . o. � �! = ? −� | 0 ∶ ∶ 0 ∶0 � ] ? −� ∗ ! If they must match on the lower side: �!"#$%"& = ? ? – � ∶ 0 �! �� Shortest match right to left: �′ = �. o. � � ≪ B ! �! = ? −� | ? −� 0 ∶ ∶ 0 ∶0 � ] �′ = �. o. � � ≪ B ! ∗ �!"#$%!& = �� ! : 0 ? – � ? where brackets are freely inserted (≪) in the con- For compiling epenthesis rules, to avoid more than texts and then composed with �. one epsilon in the row: In this example: � = , � → � ∥ � _ � ! �! = , both contexts should match on the upper side of ∗ ∗ �!" !"#′ = ? �!�!�!�! ? the replacement, so ��!"#$%$ is: For non-optional replacements: �′ = � c ≪ B . o. � ∗ ! ∗ ! �!"#$! = ? [�!: 0 ? −� �!: 0 ? ] �′ = d � ≪ B . o. � To remove paths containing �!, where �! = , : ∗ ∗ �! = ( � | �′) ⋄ � ∶ � ∪ � ( � | �′) �!"#$′ = ? �! ? ��!"#$%$ = �! Since �!" !"#′ and �!"#$′ are reflexive, they This way of compiling contexts allows every are not preference relation. Instead, they are filters rule in a parallel replace rule to have its own con- applied after preference relations. text direction (||, //, \\, \/). Therefore, rules like the following one are valid in this implementation: 3.3 Applying constraints with .r-glc. operator � → � \\ � _ � , , � → � // � _ � To apply a preference constraint in order to restrict Steps 4: Final operations transducer t, we use .r-glc. operator. The .r-glc. operation between transducer t and a constraint is Finally, to get the unconstrained replace transducer shown in Figure 2. Input language of a transducer it is necessary to subtract ��!"#$%$ from is noted as proj1 and output language as proj2. ��!"##, remove diamond and do a negation of that relation. ∗ Let � = Σ − � − ⋄ ∪ ��! , then: proj1(t) � − � ⋄ ��!"## – ��!"#$%$ .o. constraint where � ⋄ denotes removal of diamond. proj1(t) – proj2 .o. 3.2 Constraints proj1(t)

All the preference constraints were defined in Yli- Jyrä (2008), but since they were mostly difficult to .o. interpret and implement, here is the list of the con- t straints written with regular expressions over the set of finite binary relations. First, let us define RP – a regular expression of- Figure 2: Breakdown of the operation: ten used in the restraints: t .r-glc. constraint ∗ �� = �: 0 0: � ? −� Contraints combinations The left most preference is achieved by: ∗ ∗ �!" = ? <: 0 �: 0 ? −� �� As shown in Figure 1, in order to achieve desired Right most preference: replace rules, it is often necessary to use several ! ∗ �!" = �� ? −� ∶ 0 ? constraints. For example, to achieve left to right Longest match left to right: longest match, it is necessary to combine �!" and

57 �!"!#$%. If the same longest match contains epen- result, it is necessary to remove brackets from the thesis, �!" !"# constraint should also be used. relation. Following examples will be shown on the input 3.4 Removing brackets string � � � �. Table 3 shows steps of building left Removing brackets is simply achieved by applying to right longest match and Table 4 left to right ∗ ∗ shortest match. �!"#$ = ? � ? constraint, where B is set of brackets we want to remove. Additionally, in HFST Both longest match and shortest match have the implementation, it is also required to remove the same first two steps. After building Unconstrained brackets from the transducers alphabets. Bracketed Replace, we apply �!" filter which finds all the results with left most brackets in every 4 Examples position and filters out all the rest. This contraints characteristic filters out the results without the Let us show how the replace rule is compiled on brackets as well, so the result will be non-optional. different examples. In order to get the longest match, we apply another Since it would take too much space to show filter (�!"!#$%) to the result of the left most filter. whole transducers, we will show only output of the This filter finds the longest of the bracketed intermediate results applied to an input string. matches with the same starting position. In the The first example shows how to achieve a non- final step, if we apply filter �!"#$%"& instead of optional replacement. Intermediate results of the �!"!#$%, we will get the shortest match (Table 4). replace rule � → � || � _ � is shown in the Table 2. �+ @→ � || � _ � Since the arrow demands non-optional replace- UBT � � ment, the unconstrained bracketed replace, if ap- !" !"!#$% � � � � plied to the input string � � � , contains three � �: � � � � �: � � � possible results. The first result is the input string � �: � �: � � � �: � �: � � � �: � �: � � itself, which would be part of the non-optional � �: � �: � � � �: � �: � � replacement. The second result is necessary to filter out the first one. In this example, because of � � �: � � the restricting context, replacement is possible only Table 3: Left to right longest match in the middle, and therefore, it is bracketed with special brackets. Finally, the third result contains �+ @> � || � _ � the bracketed replace relation. UBT �!" �!"#$%"& � → � || � _ � � � � � � �: � � � UBT � � ′ � �: � � � !"#$! !"#$ � �: � �: � � � �: � �: � � � � � � �: � � � �: � �: � � � �: � � � �: � �: � � � � � � � � � �: � �: � �

� �: � � � � �: � � Table 2: Steps of the non optional replacement Table 4: Left to right shortest match

Once when we have the unconstrained bracket- 5 Conclusion ed replace transducer, we are ready to apply filters. The large number of different replace operators First filter, � will filter out all results that !"#$! makes it quite complicated and error-prone to build contain smaller number of brackets in every posi- a supporting framework for them. However, the .r- tion, without making difference to the type of glc. operator and preference relations allow split- brackets. In this example, it will filter out the first ting the algorithm into small reusable units which result, the one that does not have any brackets at are easy to maintain and upgrade with new func- all. tionalities. The second filter, � ′ will filter out all the !"#$ The replace rules are now part of the HFST li- results containing � brackets because they don’t ! brary and can be used through hfst-regexp2fst contain the replace relation. Finally, to get the final command line tool, but there is still some work to

58 be done to build an interactive interface. Addition- Computer and Information Science, vol. 41, pp. 28- ally, we are planning to add support for two level 47. Springer Berlin Heidelberg (2009) contexts and parallel weighted rules. Yli-Jyrä, A., Koskenniemi, K.: A new method for compiling parallel replacement rules. In Holub, J., Ždárek, J., eds.: Implementation and Application of Acknowledgments Automata, 12th International Conference, CIAA 2007, Revised Selected Papers. Volume 4783 of The research leading to these results has received LNCS., Springer (2007) 320–321 funding from the European Commission’s 7th Yli-Jyrä, A.: Applications of Diamonded Double Nega- Framework Program under grant agreement n° tion. In Finite-state methods and natural language 238405 (CLARA). processing. Thomas Hanneforth and Kay-Michael Würtzner. 6th International Workshop, FSMNLP References 2007. Potsdam, Germany, September 14-16. Revised Papers. Universitätsverlag Potsdam (2008a) 6-30 Beesley, K.R., Karttunen, L.: Finite State Morphology. Yli-Jyrä, A., Transducers from Parallel Replace Rules CSLI publications (2003) and Modes with Generalized Lenient Composition. Eisner, J.: Directional constraint evaluation in optimali- In Finite-state methods and natural language pro- ty theory. In: 20th COLING 2000, Proceedings of the cessing. Thomas Hanneforth and Kay-Michael Conference, Saarbrücken, Germany (2000) 257–263 Würtzner. 6th International Workshop, FSMNLP Gerdemann, D. (2009). Mix and Match Replacement 2007. Potsdam, Germany, September 14-16. Revised Rules. Proceedings of the Workshop on RANLP Papers. Universitätsverlag Potsdam (2008b) 197-212 2009 Workshop on Adaptation of Language Re- sources and Technology to New Domains, Borovets, Bulgaria, 2011, pages 39-47. Gerdemann, D., van Noord, G.: Approximation and exactness in Finite-State Optimality Theory. In Eis- ner, J., Karttunen, L., Thériault, A., eds.: SIGPHON 2000, Finite State Phonology. (2000) Gerdemann, D., van Noord, G.: Transducers from rewrite rules with backreferences. In: 9th EACL 1999, Proceedings of the Conference. (1999) 126–133 Jäger, G.: Gradient constraints in Finite State OT: The unidirectional and the bidirectional case. In: Proceed- ings of FSMNLP 2001, an ESSLLI Workshop, Hel- sinki (2001) (35–40) Karttunen, L.: The replace operator. In: 33th ACL 1995, Proceedings of the Conference, Cambridge, MA, USA (1995) 16–23 Karttunen, L.: Directed replace operator. In Roche, E., Schabes, Y., eds.: Finitestate language processing, Cambridge, Massachusetts, A Bradford Book. The MIT Press (1996) 117–147 Kempe, A., Karttunen, L.: Parallel replacement in finite state calculus. In: 16th COLING 1996, Proc. Conf. Volume 2., Copenhagen, Denmark (1996) 622–627 Lindén, K., Axelson, E., Hardwick, S., Silfverberg, M., Pirinen, T.: HFST - Framework for Compiling and Applying Morphologies, Communications in Com- puter and Information Science, vol. 100, pp. 67-85. Springer Berlin Heidelberg (2011) Lindén, K., Silfverberg, M., Pirinen, T.: Hfst tools for morphology - an efficient open-source package for construction of morphological analyzers. In: Mahlow, C., Pietrowski, M. (eds.) State of the Art in Computational Morphology. Communications in

59 First approaches on Spanish medical record classiﬁcation using Diagnostic Term to class transduction

A. Casillas(1), A. D´ıaz de Ilarraza(2), K. Gojenola(2), M. Oronoz(2), A. Perez´ (2) (1) Dep. Electricity and Electronics (2) Dep. Computer Languages and Systems University of the Basque Country (UPV/EHU) [email protected]

Abstract on the classification of Medical Records (MRs) according to a standard. In our context, the MRs pro- This paper presents an application of finite- duced in a hospital have to be classified with re- state transducers to the domain of medicine. spect to the World Health Organization’s 9th Revi- The objective is to assign disease codes to sion of the International Classification of Diseases1 each Diagnostic Term in the medical records (ICD-9). ICD-9 is designed for the classification of generated by the Basque Health Hospital Sys- tem. As a starting point, a set of manually morbidity and mortality information and for the in- coded medical records were collected in order dexing of hospital records by disease and procedure. to code new medical records on the basis of The already classified MRs are stored in a database this set of positive samples. Since the texts that serves for further classification purposes. Each are written in natural language by doctors, the MR consists of two pieces of information: same Diagnostic Term might show alternative forms. Hence, trying to code a new medical Diagnostic Terms (DTs): one or more terms that record by exact matching the samples in the describe the diseases corresponding to the MR. set is not always feasible due to sparsity of data. In an attempt to increase the coverage Body-text: a description of the patient’s details, of the data, our work centered on applying a antecedents, symptoms, adverse effects, meth- set of finite-state transducers that helped the ods of administration of medicines etc. matching process between the positive samples and a set of new entries. That is, these Even though the DTs are within a limited domain, transducers allowed not only exact matching their description is not subject to a standard. Doc- but also approximate matching. While there tors express the DTs in natural language with their are related works in languages such as En- glish, this work presents the first results on au- own style and different degrees of precision. Usu- tomatic assignment of disease codes to medi- ally, a given concept might be expressed by alterna- cal records written in Spanish. tive DTs with variations due to modifiers, abbreviations, acronyms, dates, names, misspellings or style. This is a typical problem that arises in natural lan- 1 Introduction guage processing due to the fact that doctors focus on the patients and not so much on the writing of the During the last years an exponential increase in MR. On account of this, there is ample variability in the number of electronic documents in the medi- the presentation of the DTs. Consequently, it is not cal domain has occurred. The automatic process- a straightforward task to get the corresponding ICD- ing of these documents allows to retrieve informa- codes. That is, the task is by far more complex than tion, helping the health professionals in their work. a standard dictionary lookup. There are different sort of valuable data that help to exploit medical information. Our framework lays 1http://www.cdc.gov/nchs/icd/icd9.htm

60 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 60–64, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

The Basque Health Hospital System is concerned 2 Approximate transduction with the automatization of this ICD-code assign- As it was previously mentioned, there are variations ment task. So far, the hospital processes the daily regarding the DT descriptions due to style, miss- produced documents in the following sequence: spells, etc. Table 1 shows several pairs of DT and 1. Automatic: exact match of the DTs in a set of ICD-codes within the collected samples that illus- manually coded samples. trate some of those variations. 2. Semi-automatic: through semantic match, DT ICD ranking the DTs by means of machine-learning 1 Adenocarcinoma de prostata 185 techniques. This stage requires that experts se- 2 Adenocarcinomas prostata.´ 185 lect amongst the ranked choices. 3 Ca. prostata 185 4 CANCER´ DE PROSTATA 185 3. Manual: the documents that were not matched 5 adenocarcinoma de pulmon estadio IV 1629 in the previous two stages are examined by pro- 6 CA pulmon´ estadio 4 1629 fessional coders assigning the codes manually. 7 ADENOCARCINOMA PANCREAS 1579 The goal of this paper is to bypass the variability Table 1: Examples of DTs and their ICD-codes. associated to natural language descriptions in an attempt to maximize the proportion of automatically There are differences in the use of uppercase/lower assigned codes, as the Hospital System aims to ex- case; omissions of accents; use of both standard and pand the use of the automatic codification of MRs non-standard abbreviations (e.g. ca. for both cancer´ to more hospitals. According to experts, even an in- and adenocarcinoma); punctuation marks (inciden- crease of 1% in exact match would represent a sig- tal use of full-stop as commas, etc.); omission of nificant improvement allowing to gain time and re- prepositions (see rows 1 and 2); equivalence be- sources. tween Roman and Arabic numerals (rows 5 and 6). Related work can be found in the literature. For Due to these variations, our problem can be defined instance, Pestian et al. (2007) reported on a shared as an approximate lookup in a dictionary. task involving the assignment of ICD-codes to radi- ology reports written in English from a reduced set 2.1 Finite-state models of 45 codes. In general it implied the examination of Foma toolkit was used to build the FS machines and the full MR (including body-text). In our case, the code the evaluation sets. Foma (Hulden, 2009) is number of ICD-codes is above 1,000, although we a freely available2 toolkit that allows to both build restrict ourselves to exact and approximate match and parse FS automata and transducers. Foma of- over the diagnoses. fers a versatile layout that supports imports/exports Farkas and Szarvas (2008) used machine learning from/to other tools such as: Xerox XFST (Beesley for the automatic assignment of ICD-9 codes. Their and Karttunen, 2003), AT&T (Mehryar Mohri results showed that hand-crafted systems could be and Riley, 2003), OpenFST (Riley et al., 2009). reproduced by replacing several laborious steps in There are, as well, outstanding alternatives such as their construction with machine learning models. HFST (Linden´ et al., 2010). Refer to (Yli-Jyra¨ et al., Tsuruoka et al. (2008) presented a system that 2006) for a thorough inventory on FS resources. tried to normalize different variants of the terms con- The FS models in Figure 1 perform the conver- tained in a medical dictionary, automatically getting sions necessary to carry out a soft match between normalizing rules for genes, proteins, chemicals and the dictionary entries and their variants. diseases in English. First, we define the transducer Accents that The contribution of this work is: i) to collect • manually coded MRs in Spanish; ii) to approximate takes into account the correspondences be- transduction with finite-state (FS) models for auto- tween standard letters and their versions using matic MR coding and, iii) to assess the performance accent text marks. of the proposed FS transduction approaches. 2http://code.google.com/p/foma

61 define Accents [a:a|e:´ e|i:´ ı|o:´ o|u:´ u|...];´ define Case [a:A|b:B|c:C|d:D|e:E|f:F|...]; define Spaces [..] (->) " " || [.#. | "."] , .#.; define Punctuation ["."|"-"|" "]:["."|"-"|" "]; define Plurals [..] -> ([s|es]) || [.#. | "." | " "]; define PluralsI [s|es] (->) "" || [.#. | "." | ","| " "]; define Preps [..] (->) [de |del |con |por ] || " " ; define Disease [enf|enf.|enfermedad]:[enf|enf.|enfermedad]; define AltCa [tumor|ca|ca.|carcinoma|adenocarcinoma|cancer];´ define TagNormCa AltCa:AltCa; define AltIzq [izquierdo|izquierda|izq|izq.|izqda|izqda.| izqdo|izqdo.|izda|izda.|izdo|izdo.]; define TagNormIzq AltIzq:AltIzq;

Figure 1: A simpliﬁed version in Foma source code of the regular expressions and transducers used to bypass several sources of distortion within the DTs in order to parse variations of unseen input DTs.

The expression Case matches uppercase and sources of distortion together. • lowercase versions of the DTs. 3 Experimental results There is a set of transducers (Spaces, • Punctuation, Plurals and PluralsI) To begin with, coded MRs produced in the hospi- that deal with the addition or deletion of spaces tal throughout 12 months were collected summing and separators (as full-stop, comma, and hy- up a total of 8,020 MRs as described in Table 2. phen) between words or at the end of the DT. Note that there are ambiguities in our data-set since there are 3,313 different DTs that have resulted in Prepositions. Many DTs can be differen- • 3,407 (DT, ICD-code) different pairs (as shown in tiated by the use or absence of prepositions, al- Table 2). That is, the same DT was not always as- though they correspond to the same ICD-code. signed the same ICD-code. For that reason, we designed a transducer that inserts or deletes the prepositions from a re- DT ICD-code duced set that were identified by inspection of entries 8,020 the training set. In this way, expressions as different entries 3,407 ”Adenocarcinoma prostata” and ”Adenocarci- different forms 3,313 1,011 noma de prostata” can be mapped to each other. Table 2: The data-set of (DT, ICD-code) pairs. Tag Normalization of synonyms, vari- • ants and abbreviations. The examination of the Next, the data-set was shuffled and divided into 3 DTs in the training set revealed that there were disjoint sets for training, development and test pur- several terms used indistinctly, including syn- poses as shown in Table 3. onyms and different kinds of variants (masculine and feminine) and abbreviations. For ex- train dev test ample, the words adenocarcinoma, adenoca., entries 6,020 1,000 1,000 carcinoma, ca, ca. and cancer serve to name different entries 2,825 734 728 the same disease. There are also multiple variants of left/right, indicating the location of an Table 3: The data-set shuffled and divided into 3 sets illness, that do not affect the assignment of the ICD-code (e.g. izquierdo, izq., izda.). Using the set of mappings derived from the training set we performed the experiments on the devel- Finally, all the FS transducers were composed opment set. After several rounds of tuning the sys- into a single machine that served to overcome all the tem, the resulting system was applied to the test set.

62 PERCENTAGE OF UNCLASSIFIED DTs TRAIN EVAL-SET exact-match + case-ins. + punct. + plurals +preps. + tag-norm. train dev 30.6 27.0 25.2 24.4 23.9 23.2 train test 29.8 26.7 25.1 24.8 24.3 23.2 train+dev test 27.7 24.5 23.0 22.9 22.5 21.4

Table 4: Performance of different FS machines in terms of the percentage of unclassified entries. All the classified entries were correctly classified, yielding, as a result, a precision of 100%.

Given a DT, the goal is to find its corresponding to classify 232 DTs out of 1,000 (see last column ICD-code despite the variations. Different FS ap- in first row). Among the unclassified DTs, 10 out proaches (described in Section 2.1) were proposed of 232 were due to misspellings: e.g. cic atriz to bypass particular sources of noise in the DT. Their (instead of cicatriz), desprendimineot (instead of performance was assessed by means of the percent- desprendimiento). In fact, spelling correction re- age of unclassified DTs, as summarized in Table 4. ported improvements in related tasks (Patrick et al., Note that the lower the number of unclassified DTs 2010). The remaining DTs showed wider variations the better the performance. In each of the three rows in their forms, as unexpected degree of specificity of Table 4 the results of different experimental se- (e.g. named entities), spurious dates or numbers. tups are shown: in the first two rows the training set was used to build the models and either the devel- 4 Conclusions opment or the test set was evaluated in their turn; in the third row, both the training and the devel- Medical records in Spanish were collected yielding opment sets were used to build the model and the a data set of 8,020 DT and ICD-code pairs. While test set was evaluated. The impact of adding pro- there are a number of references dealing with En- gressively the FS machines built to tackle particular glish medical records, there are few for Spanish. sources of noise is shown by columns. Thus, the re- The goal of this work was to build a system that sults of the last column represent the performance given a DT it would find its corresponding ICD- of the transducer allowing exact-match search to- code as in a standard key-value dictionary. Yet, the gether with case-insensitive search, bypassing punc- DTs are far from being standard since they contain tuation marks, allowing plurals, bypassing preposi- a number of variations. We proposed the use of sev- tions and allowing tag-normalization. The compo- eral FS models to bypass different variants and al- sition of each transducer outperforms the previous low to provide ICD-codes even when the exact DT result, yielding an improvement on the test of 6 ab- was not found. Each source of variations was tack- solute points over the exact-match baseline, from led with a specific transducer based on handwritten 27.7% to 21.4%. As it can be derived from the rules. The composition of each machine improved first column of Table 4 the test set contributed to the the performance of the system gradually, leading to training+development set with %27.7 of new DTs. an improvement up to 6% in accuracy, from 27.7% Overall, the FSMs progressively improved the re- unclassified DTs with the exact-match baseline to sults for the three series of experiments carried out 21.4% with the tag-normalization transducer. in more than 6%. As a result, less and less DTs are Future work will focus on the unclassified DTs. left unclassified. In other words, the FS machines Together with FS models, other strategies shall be tackling different sources of errors contribute to as- explored. Machine-learning strategies in the field of sign ICD-codes to previously unassigned DTs. information retrieval might help to make the most of A manual inspection over the results associated the piece of information that was here discarded (i.e. to the evaluation of the development set (focus on the body-text). All in all, regardless of the approach, the first row of Table 4) showed that all the DTs the command in this MR classification context is to were correctly classified according to the training get an accuracy of 100%, possibly through the inter- data. Overall, the resulting transducer was unable active inference framework (Toselli et al., 2011).

63 Acknowledgments ference of the North American Chapter of the Associ- ation for Computational Linguistics, Companion Vol- Authors would like to thank the Hospital Galdakao- ume: Tutorial Abstracts, pages 9–10, Boulder, Col- Usansolo for their contributions and support, in par- orado, May. Association for Computational Linguis- ticular to Javier Yetano, responsible of the Clinical tics. Documentation Service. [Toselli et al.2011] Alejandro H. Toselli, Enrique Vi- This research was supported by the Department of dal, and Francisco Casacuberta. 2011. Multi- Industry of the Basque Government (IT344-10, S- modal Interactive Pattern Recognition and Applica- tions. Springer. PE11UN114, GIC10/158 IT375-10), the University [Tsuruoka et al.2008] Yoshimasa Tsuruoka, John Mc- of the Basque Country (GIU09/19) and the Span- Naught, and Sophia Ananiadou. 2008. Normalizing ish Ministry of Science and Innovation (MICINN, biomedical terms by minimizing ambiguity and vari- TIN2010- 20218). ability. BMC Bioinformatics, 9(Suppl 3):S2. [Yli-Jyra¨ et al.2006] A. Yli-Jyra,¨ K. Koskenniemi, and K.. Linden.´ 2006. Common infrastructure for References finite-state based methods and linguistic descriptions. [Beesley and Karttunen2003] Kenneth R. Beesley and In Proceedings of International Workshop Towards Lauri Karttunen. 2003. Finite State Morphology. a Research Infrastructure for Language Resources., CSLI Publications,. Genoa, May. [Farkas and Szarvas2008] Richard´ Farkas and Gyorgy¨ Szarvas. 2008. Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinformatics., 9 (Suppl 3): S10. [Hulden2009] Mans Hulden. 2009. Foma: a Finite-State Compiler and Library. In EACL (Demos), pages 29– 32. The Association for Computer Linguistics. [Linden´ et al.2010] Krister Linden,´ Miikka Silfverberg, and Tommi Pirinen. 2010. HFST tools for morphology – an efficient open-source package for construction of morphological analyzers. [Mehryar Mohri and Riley2003] Fernando C. N. Pereira Mehryar Mohri and Michael D. Riley. 2003. AT&T FSM LibraryTM – Finite-State Machine Library. www.research.att.com/sw/tools/fsm. [Patrick et al.2010] Jon Patrick, Mojtaba Sabbagh, Suvir Jain, and Haifeng Zheng. 2010. Spelling correction in clinical notes with emphasis on first suggestion accuracy. In 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2010) LREC. ELRA. [Pestian et al.2007] John P. Pestian, Chris Brew, Pawel Matykiewicz, D. J Hovermale, Neil Johnson, K. Bre- tonnel Cohen, and Wlodzislaw Duch. 2007. A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing, pages 97–104, Prague, Czech Re- public, June. Association for Computational Linguis- tics. [Riley et al.2009] Michael Riley, Cyril Allauzen, and Martin Jansche. 2009. OpenFST: An open-source, weighted finite-state transducer library and its applications to speech and language. In Proceedings of Hu- man Language Technologies: The 2009 Annual Con-

64 Developing an open-source FST grammar for verb chain transfer in a Spanish-Basque MT System

Aingeru Mayor, Mans Hulden, Gorka Labaka Ixa Group University of the Basque Country [email protected], [email protected], [email protected]

Abstract ‘periphrastic’ verbs, which accompany most main verbs, agree not only with the subject, but also with This paper presents the current status of de- the direct object and the indirect object, if present. velopment of a finite state transducer gram- Among European languages, this polypersonal sys- mar for the verbal-chain transfer module in Matxin, a Rule Based Machine Translation tem (multiple verb agreement) is rare, and found system between Spanish and Basque. Due to only in Basque, some Caucasian languages, and the distance between Spanish and Basque, the Hungarian. verbal-chain transfer is a very complex mod- The fact that Basque is both a morphologically ule in the overall system. The grammar is rich and less-resourced language makes the use of compiled with foma, an open-source finite- statistical approaches for Machine Translation dif- state toolkit, and yields a translation execution ficult and raises the need to develop a rule-based time of 2000 verb chains/second. architecture which in the future could be combined with statistical techniques. 1 Introduction The Matxin es-eu (Spanish-Basque) MT engine is a classic transfer-based system comprising three This paper presents the current status of develop- main modules: analysis of the Spanish text (based ment of an FST (Finite State Transducer) grammar on FreeLing, (Atserias et al., 2006)), transfer, and we have developed for Matxin, a Machine Transla- generation of the Basque target text. tion system between Spanish and Basque. In the transfer process, lexical transfer is first Basque is a minority language isolate, and it is carried out using a bilingual dictionary coded in likely that an early form of this language was already the XML format of Apertium dictionary files (.dix) present in Western Europe before the arrival of the (Forcada et al., 2009), and compiled, using the FST Indo-European languages. library implemented in the Apertium project (the lt- Basque is a highly inflected language with free toolbox library), into a finite-state transducer that order of sentence constituents. It is an agglutinative can be processed very quickly. language, with a rich flexional morphology. Following this, structural transfer at the sentence Basque is also a so-called ergative-absolutive lan- level is performed, and some information is trans- guage where the subjects of intransitive verbs ap- ferred from some chunks1 to others while some pear in the absolutive case (which is unmarked), chunks may be deleted. Finally, the structural trans- and where the same case is used for the direct object of a transitive verb. The subject of the transi- 1A chunk is a non-recursive phrase (noun phrase, preposi- tive verb (that is, the agent) is marked differently, tional phrase, verbal chain, etc.) which expresses a constituent with the ergative case (in Basque by the suffix -k). (Abney, 1991; Civit, 2003). In our system, chunks play a crucial part in simplifying the translation process, due to the fact The presence of this morpheme also triggers main that each module works only at a single level, either inside or and auxiliary verbal agreement. Auxiliary verbs, or between chunks.

65 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 65–69, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics fer at the verb chunk level is carried out. The verbal Table 1 illustrates the central idea of the verb chunk transfer is a very complex module because of chunk transfer. In the first four examples the form of the nature of Spanish and Basque auxiliary verb con- the transitive auxiliary changes to express agreement structions, and is the main subject of this paper. with different ergative arguments (the subject of the This verb chain transfer module is implemented clause), absolutive arguments (the direct object) and as a series of ordered replacement rules (Beesley and dative arguments (the indirect object). In the fifth Karttunen, 2003) using the foma finite-state toolkit example the future participle is used. The last ex- (Hulden, 2009). In total, the system consists of 166 ample shows the translation of a periphrastic con- separate replacement rules that together perform the struction, in which the the Spanish and the Basque verb chunk translation. In practice, the input is given word orders are completely different: this is re- to the first transducer, after which its output is passed flected in the Spanish tengo que-construction (have to the second, and so forth, in a cascade. Each rule in to) which appears before the main verb, whereas in the system is unambiguous in its output; that is, for the Basque, the equivalent (behar) appears after the each input in a particular step along the verb chain main verb (erosi). transfer, the transducers never produce multiple outputs (i.e. the transducers in question are functional). 3 The FST grammar Some of the rules are joined together with composi- We carry out the verbal chunk transfer using finite- tion, yielding a total of 55 separate transducers. In state transducers (Alegria et al., 2005). The gram- principle, all the rules could be composed together mar rules take as input the Spanish verbal chunk, into one monolithic transducer, but in practice the perform a number of transformations on the input, size of the composed transducer is too large to be and then create and output the verbal chunk for feasible. The choice to combine some transduc- Basque. ers while leaving others separate is largely a mem- To illustrate the functioning of the grammar, let us ory/translation speed tradeoff. consider the following example sentence in Spanish: “Un tribunal ha negado los derechos constitu- 2 Spanish and Basque verb features and cionales a los presos polticos” (A court has denied their translation constitutional rights to political prisoners). The correct translation into Basque given by the system for In the following, we will illustrate some of the main this example is as follows: Auzitegi batek eskubide issues in translating Spanish verb chains to Basque. konstituzionalak ukatu dizkie preso politikoei. Fig- Since both languages make frequent use of auxiliary ure 1 shows a detailed overview of how the whole verb constructions, and since periphrastic verb con- transfer of the verbal chunk is performed for this par- structions are frequent in Basque, transfer rules can ticular example. get quite complex in their design. First, the input to the grammar is assumed to be a For example, in translating the phrase string containing (separated by the ’&’ symbol) the (Yo) compro (una manzana) following information : (I) buy (an apple) the morphological information (using • [PP1CSN00] [VMIP1S0] [DI0FS0] [NCFS000] EAGLES-style tags Leech and Wilson we can translate it using the imperfective partici- (1996)) for all nodes (separated by ’+’ ple form (erosten) of the verb erosi (to buy), and a symbol) in the Spanish verbal chunk transitive auxiliary (dut) which itself contains both (haber[VAIP3S0]+negar[VMP00SM]); I subject agreement information ( : 1st sg.) and num- the morphological information of the subject ber agreement with the object (an apple: 3rd sg.): • ([sub3s]), the direct object ([obj3p]) and the (nik) (sagar bat) erosten dut. The participle carries indirect object ([iobj3p]); information concerning meaning, aspect and tense, whereas the auxiliaries convey information about ar- the translation of the main verb in Basque • gument structure, tense and mood. (ukatu) and information about its transitivity

66 Spanish sentence English Basque translation (Yo) compro (una manzana) (I) buy (an apple) (Nik) (sagar bat) erosten dut (Yo) compro (manzanas) (I) buy (apples) (Nik) (sagarrak) erosten ditut (Tu´) compras (manzanas) (You) buy (apples) (Zuk) (sagarrak) erosten dituzu (Yo) (te) compro (una manzana) (I) buy (you) (an apple) (Nik) (zuri) (sagar bat) erosten dizut (Yo) comprare´ (una manzana) (I) will buy (an apple) (Nik) (sagar bat) erosiko dut (Yo) tengo que comprar (manzanas) (I) must buy (apples) (Nik) (sagarrak) erosi behar ditut

Table 1: Examples of translations

A court has denied (the) rights constitutional to (the) prisoners political Un tribunal ha negado los derechos constitucionales a los presos políticos Subject Verb Object Indirect Object

Input haber[VAIP3S0]+negar[VMP00SM] & [sub3s] [obj3p] [iobj3p] & ukatu [DIO]

1. Identification [ SimpleVerbEsType -> ... SimpleVerbEuSchema ] of the schema haber[VAIP3S0]+negar[VMP00SM] & [sub3s] [obj3p] [iobj3p] & ukatu [DIO] SimpleVerb (main) AspectMain / Aux TenseMood Abs Dat Erg

2. Resolution Attrib. -> Value || Context of the values AspectMain -> [perfPart] || ?* VAIP ?* SimpleVerb ?* _ Aux -> edun(aux) || ?* DIO ?* _ TenseMood -> [indPres] || ?* VAIP ?* _ Abs -> [abs3p] || ?* [obj3p] ?* edun(aux) ?* _ Dat -> [dat3p] || ?* [iobj3p] ?* _ Erg -> [erg3s] || ?* V???3S ?* edun(aux) ?* _ niega[VMIP3S0] & [sub3s] [obj3s] [dat3p] & ukatu [DIO] + SimpleVerb (main)[perfPart] / edun(aux) [indPres] [abs3p][dat3p][erg3s] 3. Elimination of source information Output ukatu(main)[perfPart] / edun(aux) [indPres] [abs3p][dat3p][erg3s]

uka +tu d +i +zki +e +Ø deny perf. ind. trans. 3rdpl 3rdpl 3rdsg part. pres. aux. abs. dat. erg.

Figure 1: Example of the transfer of a verbal chunk.

67 ([DIO]), indicating a ditransitive construction: [Aux -> edun(aux) || ?* DIO ?* ] [Abs -> [abs3p] || ?* [obj3p] ?* haber[VAIP3S0]+negar[VMP00SM] & edun(aux) ?* ] [sub3s][obj3p][iobj3p] & ukatu[DIO] The grammatical rules are organized into three 3. Elimination of source-language information (4 groups according to the three main steps defined for rules in total). translating verbal chunks: The output of the grammar for the example is: 1. Identification of the Basque verbal chunk ukatu(main)[perfPart] / schema corresponding to the source verbal edun(aux)[indPres][abs3p][dat3p][erg3s] chunk. The first node has the main verb (ukatu) with There are twelve rules which perform this task, the perfective participle aspect, and the sec- each of which corresponds to one of the follow- ond one contains the auxiliary verb (edun) with ing verbal chunks in Spanish: non-conjugated all its morphological information: indicative verbs, simple non-periphrastic verbs as well present and argument structure. as four different groups reserved for the pe- In the output string, each of the elements contains riphrastic verbs. the information needed by the subsequent syntactic The verbal chunk of the example in figure 1 is generation and morphological generation phases. a simple non-periphrastic one, and the rule that handles this particular case is as follows: 4 Implementation [simpleVerbEsType -> ... When the verbal chunk transfer module was first de- simpleVerbEuSchema] veloped, there did not exist any efficient open-source When this rule matches the input string tools for the construction of finite state transduc- representing a simple non-periphrastic ver- ers. At the time, the XFST-toolkit (Beesley and bal chunk (simpleVerbEsType) it adds the Karttunen, 2003) was used to produce the earlier corresponding Basque verbal chunk schema versions of the module: this included 25 separate (simpleVerbEuSchema) to the end of the input transducers of moderate size, occupying 2,795 kB string. simpleVerbEsType is a complex au- in total. The execution speed was roughly 250 verb tomaton that has the definition of the Spanish chains per second. Since Matxin was designed to be simple verbs. simpleVerbEuSchema is the type open source, we built a simple compiler that con- of the verbal chunk (SimpleVerb) and an au- verted the XFST rules into regular expressions that tomaton that contains as strings the pattern of could then be applied without FST technology, at the elements (separated by the ’/’ symbol) that the cost of execution speed. This verbal chunk transfer corresponding Basque verb chunk will need to module read and applied these regular expressions have (in this case, the main verb and the auxil- at a speed of 50 verbal chunks per second. iary verb): In the work presented here, we have reimplemented and expanded the original rules written for SimpleVerb (main) AspectMain / XFST with the foma2 toolkit (Hulden, 2009). Af- Aux TenseMood Abs Dat Erg ter adapting the grammar and compiling it, the 55 separate transducers occupy 607 kB and operate at 2. Resolution of the values for the attributes in the roughly 2,000 complete verb chains per second.3 Basque schema. Passing the strings from one transducer to the next in A total of 150 replacement rules of this type the chain of 55 transducers in accomplished by the have been written in the grammar. Here are depth-first-search transducer chaining functionality some rules that apply to the above example: available in the foma API.

[AspectMain -> [perfPart] || ?* VAIP 2http://foma.sourceforge.net ?* SimpleVerb ?* ] 3On a 2.8MHz Intel Core 2 Duo.

68 References Abney, S. (1991). Principle-Based Parsing: Com- putation and Psycholinguistics, chapter Parsing by Chunks, pages 257–278. Kluwer Academic, Boston. Alegria, I., D´ıaz de Ilarraza, A., Labaka, G., Ler- sundi, M., Mayor, A., and Sarasola, K. (2005). An FST grammar for verb chain transfer in a Spanish–Basque MT system. In Finite-State Methods and Natural Language Processing, volume 4002, pages 295–296, Germany. Springer Verlag. Atserias, J., Casas, B., Comelles, E., Gonzalez,´ M., Padro,´ L., and Padro,´ M. (2006). Freeling 1.3: Syntactic and semantic services in an open-source NLP library. In Proceedings of LREC, volume 6, pages 48–55. Beesley, K. R. and Karttunen, L. (2003). Finite State Morphology. CSLI Publications, Stanford, CA. Civit, M. (2003). Criterios de etiquetacion´ y desam- biguacion´ morfosintactica´ de corpus en Espanol˜ . PhD thesis, Universidad de Barcelona. Forcada, M., Bonev, B. I., Ortiz-Rojas, S., Perez-Ortiz,´ J. A., Ram´ırez-Sanchez, G., Sanchez-Mart´ ´ınez, F., Armentano-Oller, C., Montava, M. A., Tyers, F. M., and Ginest´ı- Rosell, M. (2009). Documentation of the open-source shallow-transfer machine translation platform Apertium. Technical report, Departament de Llenguatges i Sistemes In- formatics. Universitat d’Alacant. Available at http://xixona.dlsi.ua.es/ fran/apertium2- documentation.pdf. Hulden, M. (2009). Foma: a ﬁnite-state compiler and library. In Proceedings of EACL 2009, pages 29–32. Leech, G. and Wilson, A. (1996). EAGLES rec- ommendations for the morphosyntactic annotation of corpora. Technical report, EAGLES Expert Advisory Group on Language Engineering Stan- dards, Istituto di Linguistica Computazionale, Pisa, Italy.

69 Conversion of Procedural Morphologies to Finite-State Morphologies: a Case Study of Arabic

Mans Hulden Younes Samih University of the Basque Country Heinrich-Heine-Universitat¨ Dusseldorf¨ IXA Group [email protected] IKERBASQUE, Basque Foundation for Science [email protected]

Abstract that perform the same task. As a case study, we have chosen to implement a one-to-one faithful conver- In this paper we describe a conversion of sion of the Buckwalter Arabic analyzer into a finite- the Buckwalter Morphological Analyzer for state representation using the foma finite state com- Arabic, originally written as a Perl-script, piler (Hulden, 2009b), while also adding some ex- into a pure finite-state morphological analyzer. Representing a morphological ana- tensions to the original analyzer. These are useful lyzer as a finite-state transducer (FST) con- extensions which are difficult to add to the original fers many advantages over running a procedu- Perl-based analyzer because of its procedural nature, ral affix-matching algorithm. Apart from ap- but very straightforward to perform in a finite-state plication speed, an FST representation imme- environment using standard design techniques. diately offers various possibilities to flexibly There are several advantages to representing mor- modify a grammar. In the case of Arabic, this phological analyzers as FSTs, as is well noted in the is illustrated through the addition of the ability to correctly parse partially vocalized forms literature. Here, in addition to documenting the con- without overgeneration, something not possi- version, we shall also discuss and give examples of ble in the original analyzer, as well as to serve the flexibility, extensibility, and speed of application both as an analyzer and a generator. which results from using a finite-state representation of a morphology.1 1 Introduction 2 The Buckwalter Analyzer Many lexicon-driven morphological analysis systems rely on a general strategy of breaking down Without going into an extensive linguistic discus- input words into constituent parts by consulting cus- sion, we shall briefly describe the widely used Buck- tomized lexicons and rules designed for a particu- walter morphological analyzer for Arabic. The lar language. The constraints imposed by the lex- BAMA accepts as input Arabic words, with or with- ica designed are then implemented as program code out vocalization, and produces as output a break- that handles co-occurrence restrictions and analysis down of the affixes participating in the word, the of possible orthographic variants, finally producing stem, together with information about conjugation classes. For example, for the input word / , a parse of the input word. Some systems designed ktb I. J» along these lines are meant for general use, such as BAMA returns, among others: the hunspell tool (Halacsy´ et al., 2004) which allows LOOK-UP WORD: ktb users to specify lexicons and constraints, while oth- SOLUTION 1: (kataba) [katab-u_1] ers are language-dependent, such as the Buckwalter katab/VERB_PERFECT Arabic Morphological Analyzer (BAMA) (Buckwal- +a/PVSUFF_SUBJ:3MS (GLOSS): + write + he/it ter, 2004). In this paper we examine the possibility of con- 1The complete code and analyzer are available at verting such morphological analysis tools to FSTs http://buckwalter-fst.googlecode.com/

70 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 70–74, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

Figure 1: The Buckwalter Arabic Morphological Analyzer’s lookup process exempliﬁed for the word lilkitAbi.

2.1 BAMA lookup , while accepted in the previous ∅ In the BAMA system, every Arabic word is assumed step, is now rejected because the file dict- to consist of a sometimes optional prefix, an oblig- Prefixes lists k as a prefix belonging to class atory stem, and a sometimes optional suffix.2 The NPref-Bi, and the stem tb belonging to one of system for analysis is performed by a Perl-script that PV V, IV V, NF, PV C, or IV C. However, carries out the following tasks: the compatibility file tableAB does not permit a combination of prefix class NPref-Bi and any 1. Strips all diacritics (vowels) from the input of the above-mentioned stem classes. word (since Arabic words may contain vocal- 5. In the event that the lookup fails, the analyzer ization marks which are not included in the lex- considers various alternative spellings of the in- icon lookup). Example: kataba ktb → put word, and runs through the same steps us- 2. Factors the input word into all possible ing the alternate spellings. combinations of prefix-stem-suffix. Stems may not be empty, while affixes are optional. The BAMA lookup process is illustrated using a Example: ktb ,< kt,b, >, different example in figure 1. → { ∅ , < ,k,tb>, < ,kt,b>, ∅ ∅ ∅ 3 Conversion < ,ktb, > . ∅ ∅ } Our goal in the conversion of the Perl-code and the 3. Consults three lexicons (dictPrefixes, dict- lookup tables is to produce a single transducer that Stems, dictSuffixes) for ruling out impossi- maps input words directly to their morphological ble divisions. For example, , is ∅ analysis, including class and gloss information. In rejected since kt does not appear as a prefix order to do this, we break the process down into in dictPrefixes, while is accepted ∅ three major steps: since k appears in dictPrefixes, tb in dict- Stems, and in dictSuffixes. (1) We construct a transducer Lexicon that ac- ∅ cepts on its output side strings consisting of 4. Consults three co-occurrence constraint lists any combinations of fully vocalized prefixes, for further ruling out incompatible prefix- stems, and suffixes listed in dictPrefixes, dict- stem combinations, stem-suffix combinations, Stems, and dictSuffixes. On the input side, and prefix-suffix combinations. For example, we find a string that represents the class each 2In reality, these are often conjoined prefixes treated as a morpheme on the output side corresponds to, single entry within the system. as well as the line number in the correspond-

71 LEXICON Root Prefixes ; into a finite automaton. For example, the file tableBC, which lists co-occurrence constraints LEXICON Prefixes between stems and suffixes contains only the [Pref-%0]{P%:34}:0 Stems; following lines beginning with Nhy: [Pref-Wa]{P%:37}:wa Stems; ... Nhy NSuff-h Nhy NSuff-iy LEXICON Stems

[Nprop]{S%:23}:|b Suffixes; indicating that the Nhy-class only combines [Nprop]{S%:27}:% _ ?* ... "[NSuff-h]"|"[NSuff-iy]"];

This in effect defines the language where each Figure 2: Skeleton of basic lexicon transducer in LEXC instance [Nhy] is always followed some- generated from BAMA lexicons. time later in the string by either [NSuff-h], or [NSuff-iy]. By composing this, and ing file where the morpheme appears. For ex- the other constraints, with the Lexicon- ample, the Lexicon transducer would contain transducer, we can filter out all illegitimate the mapping: combinations of morphemes as dictated by the original Buckwalter files, by calculating: [Pref-0]{P:34}[PV]{S:102658}[NSuff-a]{X:72} kataba def Grammar Lexicon.i .o. Rule1 .o. indicating that for the surface form ... RuleNNN ; / , the prefix class is kataba I . J» Pref-0 appearing on line 34 in the file dictPrefixes, In this step, it is crucial to note that one cannot the stem class is PV, appearing on line in practice build a separate, single transducer 102,658 in dictStems, and that the suffix (or automaton) that models the intersection of class is NSuff-a, appearing on line 72 in all the lexicon constraints, i.e. Rule1 .o. dictSuffixes. Rule2 .o. ... RuleNNN, and then To construct the Lexicon, we produced a compose that transducer with the Lexicon Perl-script that reads the contents of the BAMA transducer. The reason for this is that the files and automatically constructs a LEXC- size of the intersection of all co-occurrence format file (Beesley and Karttunen, 2003), rules grows exponentially with each rule. To which is compiled with foma into a finite trans- avoid this intermediate exponential size, the ducer (see figure 2). Lexicon transducer must be composed with the first rule, whose composition is then com- (2) We construct rule transducers that filter out imposed with the second rule, etc., as above. possible combinations of prefix classes based on the data in the constraint tables tableAB, (3) As the previous two steps leave us with a trans- tableBC, and tableAC. We then iteratively ducer that accepts only legitimate combina- compose the Lexicon transducer with each tions of fully vocalized prefixes, stems, and rule transducer. This is achieved by converting suffixes, we proceed to optionally remove short each suffix class mentioned in each of the class vowel diacritics as well as perform optional files to a constraint rule, which is compiled normalization of the letter Alif (@) from the

72 output side of the transducer. This means, into the FST itself using phonological replacement for instance, that an intermediate / , rules and various composition strategies such as pri- kataba I . J» would be mapped to the surface forms kataba, ority union (Kaplan, 1987). We can thus mimic the katab, katba, katb, ktaba, ktab, ktba, and behavior of the BAMA, albeit without incurring any ktb. This last step assures that we can extra lookup time. parse partially vocalized forms, fully vocal- 4.2 Vocalization ized forms, completely unvocalized forms, and common variants of Alif. As noted above, by constructing the analyzer from the fully vocalized forms and then optionally remov- def RemoveShortVowels ing vowels in surface variants allows us to more ac- [a|u|i|o|%˜|%‘] (->) 0; curately parse partially vocalized Arabic forms. We thus rectify one of the drawbacks of the original def NormalizeAlif BAMA, which makes no use of vocalization informa- ["|"|"<"|">"] (->) A .o. tion even when it is provided. For example, given an "{" (->) [A|"<"] ; input word qabol, BAMA would as a first step strip def RemovefatHatAn [F|K|N] -> 0; off all the vocalization marks, producing qbl. Dur- ing the parsing process, BAMA could then match qbl def BAMA 0 <- %{|%} .o. with, for instance, qibal, an entirely different word, Grammar .o. even though vowels were indicated. The FST de- RemoveShortVowels .o. sign addresses this problem elegantly: if the input NormalizeAlif .o. word is qabol, it will never match qibal because the RemovefatHatAn; vocalized morphemes are used throughout the construction of the FST and only optionally removed 4 Results from the surface forms, whereas BAMA used the un- Converting the entire BAMA grammar as described vocalized forms to match input. This behavior is in above produces a final FST of 855,267 states and line with other finite-state implementations of Ara- 1,907,978 arcs, which accepts 14,563,985,397 Ara- bic, such as Beesley (1996), where diacritics, if they bic surface forms. The transducer occupies 8.5Mb. happen to be present, are taken advantage of in order An optional auxiliary transducer for mapping line to disambiguate and rule out illegitimate parses. numbers to complete long glosses and class names This is of practical importance when parsing Ara- occupies an additional 10.5 Mb. This is slightly bic as writers often partially disambiguate words more than the original BAMA files which occupy depending on context. For example, the word 4.0Mb. However, having a FST representation of Hsbt/I .kis ambiguous (Hasabat = compute, the grammar provides us with a number of advan- charge; Hasibat = regard, consider). One would tages not available in the original BAMA, some of partially vocalize Hsbt as Hsibt to denote “she which we will briefly discuss. regards”, or as Hsabt to imply “she computes.” The FST-based system correctly narrows down the 4.1 Orthographical variants parses accordingly, while BAMA would produce all The original BAMA deals with spelling variants and ambiguities regardless of the vocalization in the in- substandard spelling by performing Perl-regex re- put. placements to the input string if lookup fails. In the BAMA documentation, we find replacements such 4.3 Surface lexicon extraction as: Having the BAMA represented as a FST also al- - word final Y’ should be y’ lows us to extract the output projection of the gram- - word final Y’ should be } mar, producing an automaton that only accepts le- - word final y’ should be } gitimate words in Arabic. This can be then be In a finite-state system, once the grammar is con- used in spell checking applications, for example, verted, we can easily build such search heuristics by integrating the lexicon with weighted transduc-

73 ers reflecting frequency information and error mod- References els (Hulden, 2009a; Pirinen et al., 2010). Attia, M., Pecina, P., Toral, A., Tounsi, L., and van Genabith, J. (2011). An open-source finite 4.4 Constraint analysis state morphological transducer for modern standard Arabic. In Proceedings of the 9th Interna- Interestingly, the BAMA itself contains a vast tional Workshop on Finite State Methods and Nat- amount of redundant information in the co- ural Language Processing, pages 125–133. Asso- occurrence constraints. That is, some suffix-stem- ciation for Computational Linguistics. lexicon constraints are entirely subsumed by other Beesley, K. R. (1996). Arabic finite-state morpho- constraints and could be removed without affecting logical analysis and generation. In Proceedings the overall system. This can be observed during the of COLING’96—Volume 1, pages 89–94. Associ- chain of composition of the various transducers rep- ation for Computational Linguistics. resenting lexicon constraints. If a constraint X fails to remove any words from the lexicon—something Beesley, K. R. and Karttunen, L. (2003). Finite State that can be ascertained by noting that the number Morphology. CSLI Publications, Stanford, CA. of paths through the new transducer is the same as Buckwalter, T. (2004). Arabic Morphological Ana- in the transducer before composition—it is an indi- lyzer 2.0. Linguistics Data Consortium (LDC). cation that a previous constraint Y has already sub- Habash, N. (2010). Introduction to Arabic natural sumed X. In short, the constraint X is redundant. language processing. Synthesis Lectures on Hu- The original grammar cannot be consistently ana- man Language Technologies. lyzed for redundancies as it stands. However, redun- Halacsy,´ P., Kornai, A., Nemeth,´ L., Rung, A., Sza- dant constraints can be detected when compiling the kadat,´ I., and Tron,´ V. (2004). Creating open lan- Lexicon FST together with the set of rules, offer- guage resources for Hungarian. In Proceedings of ing a way to streamline the original grammar. Language Resources and Evaluation Conference (LREC04). European Language Resources Asso- ciation. 5 Conclusion Hulden, M. (2009a). Fast approximate string matching with finite automata. Procesamiento del We have shown a method for converting the table- lenguaje natural, 43:57–64. based and producedural constraint-driven Buckwal- ter Arabic Morphological Analyzer into an equiva- Hulden, M. (2009b). Foma: a finite-state compiler lent finite-state transducer. By doing so, we can take and library. In Proceedings of the 12th confer- advantage of established finite-state methods to pro- ence of the European Chapter of the Association vide faster and more flexible parsing and also use the for Computational Linguistics,, pages 29–32. As- finite-state calculus to produce derivative applica- sociation for Computational Linguistics. tions that were not possible using the original table- Kaplan, R. M. (1987). Three seductions of computa- driven Perl parser, such as spell checkers, normaliz- tional psycholinguistics. In Whitelock, P., Wood, ers, etc. The finite-state transducer implementation M. M., Somers, H. L., Johnson, R., and Bennett, also allows us to parse words with any vocalization P., editors, Linguistic Theory and Computer Ap- without sacrificing accuracy. plications, London. Academic Press. While the conversion method in this case is spe- Pirinen, T., Linden,´ K., et al. (2010). Finite-state cific to the BAMA, the general principle illustrated spell-checking with weighted language and error in this paper can be applied to many other procedu- models. In Proceedings of LREC 2010 Workshop ral morphologies that rule out morphological parses on creation and use of basic lexical resources for by first consulting a base lexicon and subsequently less-resourced languages. applying a batch of serial or parallel constraints over affix occurrence.

74 A methodology for obtaining concept graphs from word graphs

Marcos Calvo, Jon Ander Gomez,´ Llu´ıs-F. Hurtado, Emilio Sanchis Departament de Sistemes Informatics` i Computacio´ Universitat Politecnica` de Valencia` Cam´ı de Vera s/n, 46022, Valencia,` Spain mcalvo,jon,lhurtado,esanchis @dsic.upv.es { }

Abstract We can distinguish between the SLU systems that work with the 1-best transcription and those that take In this work, we describe a methodology a representation of the n-best (Hakkani-Tur¨ et al., based on the Stochastic Finite State Trans- 2006; Tur et al., 2002). The use of a word graph as ducers paradigm for Spoken Language Under- standing (SLU) for obtaining concept graphs the input of the SLU module makes this task more from word graphs. In the edges of these con- difficult, as the search space becomes larger. On the cept graphs, both semantic and lexical infor- other hand, the advantage of using them is that there mation are represented. This makes these is more information that could help to find the cor- graphs a very useful representation of the in- rect semantic interpretation, rather than just taking formation for SLU. The best path in these con- the best sentence given by the ASR. cept graphs provides the best sequence of concepts. In the recent literature, a variety of approaches for automatic SLU have been proposed, like those explained in (Hahn et al., 2010; Raymond and Ric- 1 Introduction cardi, 2007; McCallum et al., 2000; Macherey et The task of SLU can be seen as the process that, al., 2001; Lef´ evre,` 2007; Lafferty et al., 2001). The given an utterance, computes a semantic interpreta- methodology that we propose in this paper is based tion of the information contained in it. This semantic on Stochastic Finite State Transducers (SFST). This interpretation will be based on a task-dependent set is a generative approach that composes several trans- of concepts. ducers containing acoustic, lexical and semantic An area where SLU systems are typically applied knowledge. Our method performs this composition is the construction of spoken dialog systems. The on-the-fly, obtaining as a result a concept graph, goal of the SLU subsystem in the context of a dia- where semantic information is associated with seg- log system is to process the information given by the ments of words. To carry out this step, we use a Automatic Speech Recognition (ASR) module, and different language model for each concept and also provide the semantic interpretation of it to the Dia- study the use of lexical categorization and lemmas. log Manager, which will determine the next action The best sequence of concepts can be determined by of the dialog. Thus, the work of the SLU module finding the best path in the concept graph, with the can be split into two subtasks, the first of them is help of a language model of sequences of the con- the identification of the sequence of concepts and the cepts. segments of the original sentence according to them, The rest of this paper is structured as follows. In and the other is the extraction of the relevant infor- Section 2, the theoretical model for SLU based on mation underlying to these labeled segments. In this SFST is briefly presented. Then, in Section 3 the work we will focus on concept labeling, but we will methodology for converting word graphs into con- also consider the other subtask in our evaluation. cept graphs is described. A experimentation to eval-

75 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 75–79, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics uate this methodology for the SLU task is shown in 3.1 Topology and semantics of the word graph Section 4. Finally, we draw some conclusions and To perform the transformations for obtaining the future work. concept graph, the input graph given by the ASR 2 The SFST approach for SLU should represent the information in the following way. First, its nodes will be labeled with times- The Bayes classifier for the SLU problem can be ex- tamps. Also, for every two nodes i, j such that pressed as stated in equation 1, where C represents i < j 1, there will be an edge from i to j labeled − a sequence of concepts or semantic labels and A is with w and weight s if the ASR detected w between the utterance that constitutes the input to the system. the instants i and j 1 with an acoustic score s. − Finally, there may exist a λ-transition between any Cˆ = arg max p(C A) (1) pair of adjacent nodes. The score of this edge should C | be computed by means of a smoothing method. Taking into account the underlying sequence of Defining the word graph in this way allows us to words W , and assuming that the acoustics may de- model on it the distribution p(A w), where A is the | pend on W but not on C, this equation can be rewrit- sequence of acoustic frames between the initial and ten as follows. final nodes of any edge, and w the word attached to it. This probability distribution is represented in the theoretical model by λ . Cˆ = arg max max p(A W ) p(W, C) (2) G C W | · 3.2 Building the concept graph To compute the best sequence of concepts Cˆ ex- The concept graph that is obtained has the following pressed as in equation 2, the proposal made by the features. First, its set of nodes is the same of the paradigm based on SFST is to search the best path in word graph, and its meaning is kept. There is at most a transducer λ result of composing four SFST: SLU one edge between every two nodes i and j (i < j) labeled with the concept c. Every edge is labeled λ = λ λ λ λ (3) SLU G ◦ gen ◦ W 2C ◦ SLM with a pair (W, c), where W is a sequence of words In this equation λG is a SFST provided by and c the concept that they represent. The weight the ASR module where the acoustic probabilities of the edge is max (p(Aj W ) p(W c)), where Aj W i | · | i p(A W ) are represented, λgen introduces prior in- are the acoustic frames in the interval [i, j[ and W | formation of the task by means of a lexical cate- the argument that maximizes the former expression. gorization, λW 2C provides the probability of a se- In this specification appears the probability distri- quence of words and labels it with a semantic label bution p(W c), which can be estimated by using a | and λSLM modelizes a language model of sequences language model for each available concept. of concepts. This concept graph can be built using a Dynamic Programming algorithm that finds for each concept 3 From word graphs to concept graphs c and each pair of nodes i, j, with i < j, the The output of an ASR can be represented as a word path from i to j on the word graph that maximizes j graph. This word graph can be enriched with se- p(A W ) p(W c). In this case, W is the sequence of i | · | mantic information, obtaining a concept graph. This words obtained by concatenating the words attached concept graph constitutes a useful representation of to the edges of the path. Each of the “best paths” the possible semantics, considering the uncertainty computed in this way will become an edge in the expressed in the original word graph. Finally, find- resulting concept graph. ing the best path in the concept graph using a lan- Thus, in the concept graph it is represented infor- guage model of sequences of concepts provides as mation about possible sequences of words that might a result the best sequence of concepts Cˆ, the recog- have been uttered by the speaker, along with the con- nized sentence W˜ , and its segmentation according cepts each of these sequences expresses. This pair is to Cˆ. weighted with a score that is the result of combining

76 the acoustic score expressed in the word graph, and plained in Section 2. The first way consists of con- the lexical and syntactic score given by the language sidering a transducer that given a word as its input, model, which is dependent on the current concept. outputs that word with probability 1. This means Furthermore, this information is enriched with tem- that no generalization is being done. poral information, since the initial and final nodes The second λgen transducer performs a lexical of every edge represent the beginning and ending categorization of some of the nouns of the vocab- timestamps of the sequence of words. Consequently, ulary. Some extra words have been added to some this way of building the concept graph corresponds lexical categories, in order to make the task more to the transducer λW 2C of equation 3, since we find realistic, as the lexical coverage is increased. Never- sequences of words and attach them to a concept. theless, it also makes the task harder, as the size of However, we also take advantage of and keep other the vocabulary increases. We have used a total of 11 information, such as the temporal one. lexical categories. Finally, the third λgen, transducer we have gen- 4 Experiments and results erated performs the same lexical categorization but it also includes a lemmatization of the verbs. This To evaluate this methodology, we have performed process is normally needed for real-world systems SLU experiments using the concept graphs obtained that work with spontaneous (and maybe telephonic) as explained in Section 3 and then finding the best speech. path in each of them. For this experimentation we We have generated three sets of word graphs to have used the DIHANA corpus (Bened´ı et al., 2006). take them as the input for the method. The first of This is a corpus of telephone spontaneous speech in these sets, G , is made up by the whole graphs ob- Spanish composed by 900 dialogs acquired by 225 1 tained from a word graph builder module that works speakers using the Wizard of Oz technique, with a without using any language model. The Oracle total of 6,229 user turns. All these dialogs simulate WER of these graphs is 4.10. With Oracle WER we real conversations in an automatic train information mean the WER obtained considering the sequence of phone service. The experiments reported here were words S(G) corresponding to the path in the graph performed using the user turns of the dialogs, split- G that is the nearest to the reference sentence. ting them into a set of 1,340 utterances (turns) for The second set, G , is composed by word graphs test and all the remaining 4,889 for training. Some 2 that only contain the path corresponding to S(G) for interesting statistics about the DIHANA corpus are each graph G G . These graphs give an idea of given in table 1. ∈ 1 the best results we could achieve if we could mini- Number of words 47,222 mize the confusion due to misrecognized words. G Vocabulary size 811 The third set, 3 is formed by a synthetic word Average number of words per user turn 7.6 graph for each reference sentence, in which only that Number of concepts 30 sentence is contained. This set of graphs allows us to simulate an experimentation on plain text. Table 1: Characteristics of the DIHANA corpus. For our evaluation, we have taken two measures. First, we have evaluated the Concept Error Rate In the DIHANA corpus, the orthographic tran- (CER) over the best sequence of concepts. The def- scriptions of the utterances are semi-automatically inition of the CER is analogous to that of the WER segmented and labeled in terms of semantic units. but taking concepts instead of words. Second, we This segmentation is used by our methodology as a have also evaluated the slot-level error (SLE). The language model of sequences of words for each con- SLE is similar to the CER but deleting the non- cept. All the language models involved in this exper- relevant segments (such as courtesies) and substitut- imentation are bigram models trained using Witten- ing the relevant concepts by a canonic value for the Bell smoothing and linear interpolation. sequence of words associated to them. In our experimentation, we have considered three Tables 2, 3, and 4 show the results obtained using different ways for building the λgen transducer ex- the different λgen transducers explained before.

77 Input word graphs CER SLE Input word graphs CER SLE

G1 31.794 35.392 G1 36.536 40.640 G2 11.230 9.104 G2 11.605 8.445 G3 9.933 5.321 G3 9.458 4.064

Table 2: CER and SLE without any categorization. Table 4: CER and SLE with lemmatization and lexical categorization. Input word graphs CER SLE G1 34.565 38.760 concept graphs from word graphs. The edges of the G2 11.755 8.714 concept graphs represent information about possible G3 9.633 4.516 sequences of words that might have been uttered by Table 3: CER and SLE with lexical categorization. the speaker, along with the concept each of these sequences expresses. Each of these edges is weighted with a score that combines acoustic, lexical, syntac- From the results of Tables 2, 3, and 4 several facts tic and semantic information. Furthermore, this in- come to light. First, we can see that, in all the exper- formation is enriched with temporal information, as iments performed with the G1 set, the CER is lower the nodes represent the beginning and ending of the than the SLE, while with the other sets the CER is sequence of words. These concepts graphs consti- larger than the SLE. It is due to the fact that the tute a very useful representation of the information whole graphs obtained from the word graph builder for SLU. have more lexical confusion than those from G2 and To evaluate this methodology we have performed G3, which are based on the reference sentence. This an experimental evaluation in which different types lexical confusion may cause that a well-recognized of lexical generalization have been considered. The concept is associated to a misrecognized sequence results show that a trade-off between the lexical con- of words. This would imply that a hit would be confusion expressed in the word graphs and the general- sidered for the CER calculation, while the value for izations encoded in the other transducers should be this slot is missed. achieved, in order to obtain the best results. Other interesting fact is that, for the G1 set, the It would be interesting to apply this methodology more complex λgen transducers give the worse re- to word graphs generated with a language model, sults. This is because in these graphs there is a although this way of generating the graphs would signiﬁcant confusion between phonetically similar not ﬁt exactly the theoretical model. If a language words, as the graphs were generated without any model is used to generate the graphs, then their lex- language model. This phonetic confusion, combined ical confusion could be reduced, so better results with the generalizations expressed by the lexical cat- could be achieved. Other interesting task in which egorization and the lemmas, makes the task harder, this methodology could help is in performing SLU which leads to worse results. Nevertheless, in a real- experiments on a combination of the output of some world application of this system these generaliza- different ASR engines. All these interesting appli- tions would be needed in order to have a larger cov- cations constitute a line of our future work. erage of the lexicon of the language. The experiments on G2 and G3 show that when the confusion Acknowledgments introduced in the graphs due to misrecognized words is minimized, the use of lexical categorization and This work has been supported by the Spanish Mi- lemmatization helps to improve the results. nisterio de Econom´ıa y Competitividad, under the project TIN2011-28169-C05-01, by the Vicerrec- 5 Conclusions and future work torat d’Investigacio,´ Desenvolupament i Innovacio´ de la Universitat Politecnica` de Valencia,` under the In this paper we have described a methodology, project PAID-06-10, and by the Spanish Ministerio based on the SFST paradigm for SLU, for obtaining de Educacion´ under FPU Grant AP2010-4193.

78 References Jose-Miguel´ Bened´ı, Eduardo Lleida, Amparo Varona, Mar´ıa-Jose´ Castro, Isabel Galiano, Raquel Justo, Inigo˜ Lopez´ de Letona, and Antonio Miguel. 2006. Design and acquisition of a telephone spontaneous speech di- alogue corpus in Spanish: DIHANA. In Proceedings of LREC 2006, pages 1636–1639, Genoa (Italy). S. Hahn, M. Dinarelli, C. Raymond, F. Lef´ evre,` P. Lehnen, R. De Mori, A. Moschitti, H. Ney, and G. Riccardi. 2010. Comparing stochastic approaches to spoken language understanding in multiple languages. Audio, Speech, and Language Processing, IEEE Transactions on, 6(99):1569–1583. D. Hakkani-Tur,¨ F. Bechet,´ G. Riccardi, and G. Tur. 2006. Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech & Language, 20(4):495–514. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- ditional random ﬁelds: Probabilistic models for seg- menting and labeling sequence data. In International Conference on Machine Learning, pages 282–289. Citeseer. F. Lef´ evre.` 2007. Dynamic bayesian networks and discriminative classiﬁers for multi-stage semantic interpretation. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages 13–16. IEEE. K. Macherey, F.J. Och, and H. Ney. 2001. Natural language understanding using statistical machine translation. In European Conf. on Speech Communication and Technology, pages 2205–2208. Citeseer. A. McCallum, D. Freitag, and F. Pereira. 2000. Maxi- mum entropy markov models for information extraction and segmentation. In Proceedings of the Seven- teenth International Conference on Machine Learning, pages 591–598. Citeseer. C. Raymond and G. Riccardi. 2007. Generative and discriminative algorithms for spoken language understanding. Proceedings of Interspeech2007, Antwerp, Belgium, pages 1605–1608. G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani- Tur.¨ 2002. Improving spoken language understanding using word confusion networks. In Proceedings of the ICSLP. Citeseer.

79 A Finite-State Temporal Ontology and Event-intervals

Tim Fernando Computer Science Department Trinity College Dublin Ireland [email protected]

Abstract A step towards (semi-)automated reasoning about temporal information is taken in TimeML (Puste- A finite-state approach to temporal ontology jovsky et al., 2003), a “mark-up language for tempo- for natural language text is described under which intervals (of the real line) paired with ral and event expressions” (www.timeml.org). event descriptions are encoded as strings. The The primary aim of the present paper is to show how approach is applied to an interval temporal finite-state methods can push this step further, by logic linked to TimeML, a standard mark-up building strings, regular languages and regular rela- language for time and events, for which vari- tions to represent some basic semantic ingredients ous finite-state mechanisms are proposed. proposed for TimeML. An instructive example is sentence (1), which is assigned in (Pratt-Hartmann, 1 Introduction 2005a; ISO, 2007) the logical form (2). A model-theoretic perspective on finite-state meth- (1) After his talk with Mary, John drove to Boston. ods is provided by an important theorem due to (2) p(e) q(e ) after(e, e ) Buchi,¨ Elgot and Trakhtenbrot (Thomas, 1997). ∧ 0 ∧ 0 Σ Given a finite alphabet , a system MSOΣ of If we read p(e) as “e is an event of John talking with monadic second-order logic is set up with a binary Mary” and q(e0) as “e0 is an event of John driving to relation symbol (for successor) and a unary relation Boston” then (2) says “an event e0 of John driving to Σ symbol for each symbol in so that the formulae of Boston comes after an event e of John talking with Σ MSOΣ define precisely the regular languages over Mary .” Evidently, (1) follows from (3) and (4) be- (minus the null string ). Extensions of this theorem low (implicitly quantifying the variables e and e0 in to infinite strings and trees are fundamental to work (2) existentially). on formal verification associated with Model Check- ing (Clarke et al., 1999). In that work, a well-defined (3) John talked with Mary from 1pm to 2pm. computational system (of hardware or software) can be taken for granted, against which to evaluate pre- (4) John drove to Boston from 2pm to 4pm. cise specifications. The matter is far more delicate, But is (3) not compatible with (5) — and indeed im- however, with natural language semantics. It is not plied by (5)? clear what models, if any, are appropriate for natural language. Nor is it obvious what logical forms natu- (5) John talked with Mary from 1pm to 4pm. ral language statements translate to. That said, there Could we truthfully assert (1), given (4) and (5)? Or is a considerable body of work in linguistic seman- if not (1), perhaps (6)? tics that uses model theory, and no shortage of natural language text containing information that cries (6) After talking with Mary for an hour, John drove out for extraction. to Boston.

80 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 80–89, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

The acceptability of (6) suffers, however, if we are it misrepresents the essential continuity of told (7). events of motion. For one thing, aside from the beginning and end points, the choice of a (7) John drove toward Boston from 1pm to 2pm. finite set of subevents is altogether arbitrary. How many subevents are there, and how is one to choose them? Notice that to stipulate the Clearly, individuating events, as (2) does, opens subevents as equally spaced, for instance one up a can of worms. But since at least (Davidson, second or 3.5 milliseconds apart, is as arbi- 1967), there has been no retreating from events (Par- trary and unmotivated as any other choice. sons, 1990; Kamp and Reyle, 1993; Pratt-Hartmann, Another difficulty with a snapshot conceptu- 2005). Be that as it may, an appeal to events carries alization concerns the representation of non- with it an obligation to provide a minimal account bounded events (activities) such as John ran of what holds during these events and perhaps even along the river (for hours). A finite sequence a bit beyond. It is for such an account that finite- of subevents necessarily has a specified begin- state methods are deployed below, viewed through ning and ending, so it cannot encode the absence of endpoints. And excluding the speci- the lens of the Buchi-Elgot-Trakhtenbrot¨ theorem. fied endpoints simply exposes other specified That lens gives temporal logic, the formulae of subevents, which thereby become new end- which — hereafter called fluents (for brevity) — points. Thus encoding nonbounded events re- may or may not hold at a string position, conceived quires major surgery in the semantic represen- as time and ordered according to succession within tation. [page 316] the string. For example, we can introduce a flu- Jackendoff’s objections are overcome below by ent p for “John talked with Mary” and a fluent q finite-state manipulations that may well be called for “John drove to Boston” to form the string p q surgery. Following details supplied in the next sec- (of length 2) for “after John talked with Mary, John tion,2 strings are formed from a finite set X of flu- drove to Boston.” The idea is that a string α α 1 ··· n ents that is allowed to vary so that of boxes αi describes a sequence t1, . . . , tn of n times, ti coming before ti+1, such that every fluent (i) the continuity desired by Jackendoff arises in 1 in αi holds at ti. To a first approximation, a box αi the inverse limit of a system of projections πX is a snapshot at time t , making α α a cartoon (defined below; Table 1), and i 1 ··· n or filmstrip. But just what is a time t : a temporal i (ii) the temporal span of any finite string may, on point or an interval? expanding the set X, stretch without bound to For p q to apply to (3) and (4), it is natural the left (past) and/or to the right (future). to regard ti as an interval, setting up an account of the entailment from (5) to (3) in terms of the Applying πX , section 2 proceeds to encode a model so-called subinterval property of John-talking-with- of an interval temporal logic as a string s( ). Mary (Bennett and Partee, 1972). John-driving-to- A A Building on that encoding, section 3 develops finite- Boston, by contrast, does not have this property, ne- cessitating the change from to Boston in (4) to to- state methods for interval temporal logic. Section ward Boston in (7). We can bring out this fact by 4 concludes with proposals (drawing on work of the representing individual events as strings, refining, earlier sections) for extending the empirical (linguis- for instance, our picture q of John’s drive to Boston tic) coverage. by adding a fluent r for “John in Boston” to form q q,r . An event of motion is conceptualized as a 2 From event-intervals to strings finite sequence of snapshots in (Tenny, 1987) and Before equating the set X of fluents with a model elsewhere — a conceptualization resoundingly re- interpreting TimeML, let us bring out the intuition jected in (Jackendoff, 1996) because 2The present work extends a line of research most recently 1The alphabet Σ from which strings are formed is the family reported in (Fernando, 2011, 2011a, 2011b, 2012). That line is Pow(X) of subsets of some set X of fluents. A fluent corre- related to (Niemi and Koskenniemi, 2009), from which it dif- sponds to a monadic second-order variable in the Buchi-Elgot-¨ fers in adopting an alphabet Pow(X) that equates sucession in Trakhtenbrot theorem. a string with temporal succession.

81 def R Allen sR π( x, x0 ) χR([a, b], [a0, b0]) ρX (α1 αn) = (α1 X) (αn X) ∈ ∈ L { } ··· ∩ ··· ∩ x = x0 x, x0 a = a0, b = b0 bc(αs0) if s = ααs0 def bc(s) = αbc(α0s0) if s = αα0s0 and α = α0 x s x0 x, x0 x0 a = a0, b < b0  6 s otherwise  x si x0 x, x0 x a = a0, b0 < b def unpad(s ) if s = s or s unpad(s) = 0 0 0  s otherwise x f x0 x0 x, x0 a0 < a, b = b0 x fi x0 x x, x0 a < a0, b = b0

def x d x0 x0 x, x0 x0 a0 < a, b < b0 Table 1: Behind πX (s) = unpad(bc(ρX (s))) x di x0 x x, x0 x a < a0, b0 < b

x o x0 x x, x0 x0 a < a0 b < b0 underlying the function πX through a familiar exam- ≤ x oi x0 x0 x, x0 x a0 < a b0 < b ple. We can represent a calendar year by the string ≤ x m x0 x x0 -- def s = Jan Feb Dec x < x0 x x0 b < a0 mo ··· x mi x0 x0 x -- of length 12 (with a month in each box), or (adding x > x0 x0 x b0 < a one of 31 days d1, d2,..., d31) the string π def Table 2: The Allen relations via x,x0 s = Jan,d1 Jan,d2 { } mo,dy ··· Jan,d31 Feb,d1 Dec,d31 Feb, ... Dec to make ρ an instance of ρ ), while ··· } mo X bc 3 eliminates stutters, hardwiring the view that time of length 365 (a box per day in a non-leap year). passes only if there is change (or rather: we observe Unlike the points in say, the real line R, a box can time passing only if we observe a change within a split if we enlarge the set X of ﬂuents we can put in box). As this example shows, temporal granularity it, as illustrated by the change from Jan in smo to depends on the set X of observables that may go Jan,d1 Jan,d2 Jan,d31 in s . Two func- ··· mo,dy inside a box. Writing bcX for the composition map- s s tions link the strings mo,dy and mo ping s to bc(ρX (s)), we have

(i) a function ρmo that keeps only the months in a bc Jan (smo,dy) = bc Jan (smo) = Jan box so that { } { } bc Feb (smo,dy) = bc Feb (smo) = Feb 31 28 31 { } { } ρmo(smo,dy) = Jan Feb Dec 12 ··· bc d3 (smo,dy) = ( d3 ) . { }

(ii) block compression bc, which compresses con- Now, the function πX is bcX followed by the deletion secutive occurrences of a box into one, map- unpad of any initial or final empty boxes (Table 4 ping ρmo(smo,dy) to 1). We can then define a fluent x to be an s-interval if π x (s) is x . Next, let π(X) be the language 31 28 31 { } 1 L bc( Jan Feb Dec ) = smo πX [ x X π−x x ] consisting of strings πX (s) for ··· ∈ { } s Pow(X)∗ such that π x (s) = x for all x X. ∈ T { } ∈ so that bc(ρ (s )) = s . As made precise Note that ( x ) = x while for x = x , mo mo,dy mo Lπ { } { } 6 0 def ( x, x ) consists of 13 strings s , one per inter- in Table 1, ρX “sees only X” (setting mo = Jan, Lπ { 0} R { val relation R in (Allen, 1983); see columns 1 and 2 3 In (Niemi and Koskenniemi, 2009), smo is represented as of Table 2 the string

π( x, x0 ) = sR R Allen . [m Jan ]m [m Feb ]m [m Mar ]m ... [m Dec ]m L { } { | ∈ } 4 of length 36 over 14 symbols (the 12 months plus the 2 brackets Restricted to a finite alphabet, the maps ρX , bc, unpad [m and ]m) on which finite-state transducers operate. (See the and πX are computable by finite-state transducers (Fernando, previous footnote.) 2011).

82 For example, in the case of the “f inish” relation def x1 = [1, 5], e r1 = 1, r2 = 4 f Allen, def h i ∈ x2 = [4, 9], e r3 = 5, r4 = 9 def h i s = x fx0 π (s) = x x, x x3 = [9, 50], e0 r5 = 50 x,x0 0 0 h i | ⇐⇒ { } def= x , x , x A { 1 2 3} provided x and x0 are s-intervals. The third column of Table 2 characterizes R Allen as conditions χR ∈ Table 3: Example s( ) = x1 x1, x2 x2 x2, x3 x3 on pairs [a, b] and [a0, b0] of real numbers (in R) de- A noting closed intervals5 — e.g., a, b . Sorting gives Ends( ) = r , r , . . . , r { } A { 1 2 n} [a, b] f [a0, b0] a0 < a and b = b0 . with r < r < < r . Breaking [r , r ] up into ⇐⇒ 1 2 ··· n 1 n 2n 1 intervals, let This brings us to the semantics of TimeML pro- − posed in (Pratt-Hartmann, 2005a). A system def TPL αi = I, e ri I for 1 i n of Temporal Preposition Logic is built from an infi- {h i ∈ A | ∈ } ≤ ≤ nite set E of event-atoms, and interpreted relative to and the family def βi = I, e [ri, ri+1] I for 1 i < n. def {h i ∈ A | ⊆ } ≤ = [a, b] a, b R and a b I { | ∈ ≤ } Interleaving and block-compressing give of closed, bounded non-empty intervals in R.A -model is defined to be a finite subset of s( ) def= bc(α β α β α ) TPL A 1 1 n 1 n 1 n E. The intuition is that a pair I, e in repre- A ··· − − I × h i A sents “an occurrence of an event of type e over the (see Table 3 for an example). One may then verify interval” I (Pratt-Hartmann, 2005; page 17), revers- (by induction on the cardinality of the domain of ) A ing the construal in line (2) above of e as a token. that s( ) is the unique string in π( ) satisfying A L A Identifying occurrences with events, we can think of the equivalence in Proposition 1. as a finite set of events, conceived as “intervals But is encoding as a string s( ) adequate for A A A cum description” (van Benthem, 1983; page 113). -satisfaction? Let us introduce -formulae TPL TPL Treating events as fluents, we have through an English example. Proposition 1. For every -model , there is a (8) During each of John’s drives to Boston, he ate TPL A unique string s( ) π( ) such that for all x, x0 a donut. A ∈ L A ∈ with x = I, e and x0 = I0, e0 , A h i h i (8) translates in to (9), which is interpreted TPL π x,x (s( )) = sR χR(I,I0) relative to a -model and an interval I { 0} A ⇐⇒ TPL A ∈ I according to (10) and (11), with [e]ϕ abbreviating for R Allen and sR, χR specified in Table 2. e ϕ (as usual), a tautology (in that = ∈ ¬h i¬ > A | I > To construct the string s( ), let Ends( ) be the set always) and as strict (irreflexive) subset. A A ⊂ of endpoints of A (9) [John-drive-to-Boston] John-eat-a-donut h i> def Ends( ) = ends(I) def A (10) =I e ϕ ( J I s.t. (J, e)) I dom( ) A | h i ⇐⇒ ∃ ⊂ A ∈ [ A =J ϕ where dom( ) is the domain I ( e E) I, e A | A { | ∃ ∈ h i ∈ def of , and ends([a, b]) is the unordered pair (11) = ϕ not = ϕ A} A A | I ¬ ⇐⇒ A | I 5 Over non-empty closed intervals that include points [a, a], Clause (10) shows off a crucial feature of : the Allen relations m and mi collapse to o and oi, respectively. TPL quantification over intervals is bounded by the do- Alternatively, we can realize m and mi by trading closed in- main of ; that is, quantification is restricted to intervals for open intervals (required to be non-empty); see the A Appendix below. tervals that are paired up with an event-atom by the

83 -model (making “quasi-guarded”; Pratt- A and B of real numbers is order-preserving if for TPL TPL Hartmann, 2005; page 5). This is not to say that the all a, a A, 0 ∈ only intervals I that may appear in forming =I ϕ A | a < a0 f(a) < f(a0) are those in the domain of . Indeed, for [a, b] ⇐⇒ A ∈ I in which case we write f : A = B. Given a - and [a0, b0] dom( ) such that [a0, b0] [a, b], ∼ TPL ∈ A ⊂ model , and a function f : Ends( ) R, let f uses intervals A A → A TPL be with all its intervals renamed by f def A init([a , b ], [a, b]) = [a, a ] def 0 0 0 f = [f(a), f(b)], e [a, b], e . def A {h i | h i ∈ A} fin([a0, b0], [a, b]) = [b0, b] Now, we say is congruent with and write = A A0 A ∼ 0 if there is an order-preserving bijection between to interpret e <ϕ and e >ϕ according to (12). A { } { } Ends( ) and Ends( 0) that renames to 0 A A A A def def (12) = e ϕ ( !J I s.t. (J, e)) = 0 ( f : Ends( ) = Ends( 0)) A | I { }< ⇐⇒ ∃ ⊂ A A ∼ A ⇐⇒ ∃ A ∼ A f = ϕ 0 = . A | init(J,I) A A def Finally, we bring I into the picture by defining the =I e >ϕ ( !J I s.t. (J, e)) A | { } ⇐⇒ ∃ ⊂ A restriction of to I to be the subset AI A =fin(J,I) ϕ A | def= J, e J I AI {h i ∈ A | ⊂ } The bang ! in !J in (12) expresses uniqueness, ∃ of with intervals strictly contained in I. which means that under the translation of (1) as (13) A Proposition 2. For all finite subsets and 0 of below, the interval I of evaluation is required to con- A A E and all intervals I,I0 , if I = 0 then tain a unique event of John talking with Mary. I × ∈ I A ∼ AI0 for every -formula ϕ, TPL (1) After his talk with Mary John drove to Boston. = ϕ 0 = ϕ . A | I ⇐⇒ A | I0 p q Proposition 2 suggests normalizing a model TPL A | {z } | {z } with endpoints r1 < r2 < < rn to nr( ) with ri (13) p q ··· A { }>h i> renamed to i def f def For a translation of (1) more faithful to nr( ) = where f = r1, 1 ,..., rn, n . A A {h i h i} Assigning every -formula ϕ the truth set (2) p(e) q(e0) after(e, e0) TPL ∧ ∧ def (ϕ) = s(nr( I )) is a -model, 6 than (13), it suffices to drop ! in (12) for e < and T { A | A TPL h i I and =I ϕ e in place of e and e respectively (Fer- ∈ I A | } h i> { }< { }> nando, 2011a), and to revise (13) to p q . Re- gives h i>h i> laxing uniqueness, we can form [p]> q for after Proposition 3. For every -formula ϕ, - h i> TPL TPL every talk with Mary, John drove to Boston, as well model , and interval I , as p p for after a talk with Mary, John talked A ∈ I h i>h i> = ϕ s(nr( )) (ϕ) . with Mary again. has further constructs ef A | I ⇐⇒ AI ∈ T TPL and el for the (minimal) first and (minimal) last e- To bolster the claim that (ϕ) encodes - T TPL events in an interval. satisfaction, we may construct (ϕ) by induction T Returning to the suitability of s( ) for , con- on ϕ, mimicking the clauses for -satisfaction, A TPL TPL sider the question: when do two pairs ,I and ,I as in (14). A A0 0 of -models , and intervals I,I sat- TPL A A0 0 ∈ I (14) (ϕ ϕ0) = (ϕ) (ϕ0) isfy the same -formulae? Some definitions are T ∧ T ∩ T TPL Details are provided in the next section, where we in order. A bijection f : A B between finite sets → consider the finite-state character of the clauses, and 6 Caution: e and e0 are tokens in (2), but types in . may verify Propositions 2 and 3. TPL 84 3 Regularity and relations behind truth Next, keeping intersection and complementation in (14) and (15) in mind, let us call an opera- A consequence of Proposition 3 is that the entail- tion regularity-preserving (rp) if its output is regu- ment from ϕ to ϕ given by 0 lar whenever all its inputs are regular. To interpret def , we construe operations broadly to allow their ϕ ,E ϕ0 ( finite E)( I ) TPL |−I ⇐⇒ ∀ A ⊆ I × ∀ ∈ I inputs and output to range over relations between = ϕ implies = ϕ0 A | I A | I strings (and not just languages), construing a relation to be regular if it is computable by a finite-state becomes equivalent to the inclusion (ϕ) (ϕ0), T ⊆ T transducer. For instance, the modal diamond e la- or to the unsatisfiability of ϕ ϕ0 h i ∧ ¬ belled by an event-atom e E is interpreted via an ∈ ϕ ,E ϕ0 (ϕ ϕ0) = accessibility relation (e) in the usual Kripke se- |−I ⇐⇒ T ∧ ¬ ∅ R mantics assuming classical interpretations (14) and (15) of 1 conjunction and negation . ( e ϕ) = (e)− (ϕ) ∧ ¬ T h i R T + 1 (15) ( ϕ) = Σ (ϕ) of e ϕ where R− L is the set s Σ∗ ( s0 T ¬ − T h i { ∈ | ∃ ∈ L) sRs0 of strings related by R to a string in L. Finite-state methods are of interest as regular lan- } 1 The operation that outputs R− L on inputs R and L guages are closed under intersection and comple- is rp. But what is the accessibility relation (e)? R mentation. (Context-free languages are not; nor is Three ingredients go into making (e): containment between context-free languages decid- R able.) The alphabet Σ in (15) is, however, infinite; (i) a notion of strict containment A between Σ is the set Fin( E) of finite subsets of E, strings J × J × where is the set J (ii) the demarcation s• of a string s def = [n, m] n, m Z+ (iii) a set (e) of strings representing full occur- J { ∈ I | ∈ } D rences of e. of intervals in with endpoints in the set Z+ of I positive integers 1, 2,... (containing the domain of We take up each in turn, starting with A, which com- a normalized -model). As with π , regular- bines two ways a string can be part of another. To TPL X ity demands restricting Σ to a finite subalphabet — capture strict inclusion between intervals, we say ⊂ or better: subalphabets given by the set of pairs a string s0 is a proper factor of a string s, and write F ,E of finite subsets and E of and E re- s pfac s0, if s0 is s with some prefix u and suffix v hI0 0i I0 0 J spectively, for which deleted, and uv is non-empty

Σ = Pow( 0 E0) . s pfac s0 ( u, v) s = us0v and uv = . I × ⇐⇒ ∃ 6 ,E hI0 [0i∈F (Dropping the requirement uv = gives factors 6 The basis of the decidability/complexity results in simpliciter.) The second way a string s0 may be part (Pratt-Hartmann, 2005) is a lemma (number 3 in of s applies specifically to strings of sets. We say s page 20) that, for any -formula ϕ, bounds the subsumes s0, and write s ¤ s0, if they are of the same TPL length, and holds componentwise between them size of a minimal model of ϕ. We get a computable ⊇ function mapping a -formula ϕ to a finite sub- def TPL α1 αn ¤ α10 αm0 n = m and set ϕ of just big enough so that if ϕ is - ··· ··· ⇐⇒ I J TPL α0 α for 1 i n. satisfiable, i ⊆ i ≤ ≤ Now, writing R; R0 for the relational composition of ( Pow( ϕ Eϕ))( I ϕ) =I ϕ ∃A ∈ I × ∃ ∈ I A | binary relations R and R0 in which the output of R where Eϕ is the finite subset of E occurring in ϕ. To is fed as input to R0 minimize notational clutter, we leave out the choice def ,E of a finite alphabet below. s R; R0 s0 ( s00) sRs00 and s00R0s0 , hI0 0i ∈ F ⇐⇒ ∃ 85 we compose pfac with ¤ for strict containment Proposition 4. All -connectives can be inter- A TPL preted by rp operations. def A = pfac ; ¤ (= ¤ ; pfac) . Beyond , the interval temporal logic of TPL HS (It is well-known that relational composition ; is (Halpern and Shoham, 1991) suggests variants of e ϕ with strict containment in (e) replaced by rp.) Next, the idea behind demarcating a string s h i A R is to mark the beginning and ending of every in- any of Allen’s 13 interval relations R. terval I mentioned in s, with fresh fluents bgn-I def and I-end. The demarcation (α α α ) of (17) =I e Rϕ ( J s.t. IRJ) 1 2 ··· n • A | h i ⇐⇒ ∃ α1α2 αn adds bgn-I to αi precisely if (J, e) and = ϕ ··· A A | J there is some e such that I, e α and either h i ∈ i To emulate (17), we need to mark the evaluation in- i = 1 or I, e αi 1 terval I in by some r E, setting h i 6∈ − A 6∈ and adds I-end to αi precisely if [I] def= I, r Ar A ∪ {h i} there is some e such that I, e αi and either 7 h i ∈ i = n or I, e α . rather than simply forming I (which will do if we i+1 A h i 6∈ can always assume the model’s full temporal extent For s = s( ) given by the example in Table 3, is marked). A string s = α α r-marks I if A 1 ··· n I, r n α . If that interval is unique, we say h i ∈ i=1 i s• = x1, bgn-I1 x1, x2,I1-end, bgn-I2 s is r-marked, and write I(s) for the interval it r- S x x , x ,I , I x ,I marks, and s for s with the fluent I(s), r deleted 2 2 3 2-end bgn- 3 3 3-end − h i (so that s( r[I]) = s( )). For any of the rela- A − A We then form the denotation (e) of e relative to tions R Allen, we let R hold between r-marked DI0 ∈ ≈ a finite subset 0 of by demarcating every string in strings that are identical except possibly for the in- +I I tervals they r-mark, which are related by R I I, e as in (16). ∈I0 h i def S def + s R s0 s = s0 and I(s) R I(s0). (16) 0 (e) = I s• s I, e ≈ ⇐⇒ − − DI ∈I0 { | ∈ h i } S Next, given an event-atom e, we let (e)R be a bi- To simplify notation, we suppress the subscript 0 R I nary relation that holds between r-marked strings re- on (e). Restricting strict containment A to (e) DI0 D lated by , the latter of which picks out a factor gives ≈R subsuming some string in (e) D def s (e) s0 s A s0 and s0 (e) def R◦ ⇐⇒ ∈ D s (e) s0 s s0 and R R ⇐⇒ ≈R from which we define (e), making adjustments for ( d (e)) s ¤ d R r0 demarcation ∃ ∈ D s s bgn-I(s ) def where r0 is the factor of 0 that begins with 0 s (e) s0 s• (e) s0•. and ends with I(s )-end. Replacing by [I] in R ⇐⇒ R◦ 0 AI Ar (ϕ) for That is, (e) is the composition •; (e); where T R · R◦ ·• demarcation • is inverted by . As ’s other def · ·• TPL r(ϕ) = s(nr( r[I])) is a -model, constructs are shown in 4.1 of (Fernando, 2011a) T { A | A TPL § I and = ϕ , to be interpretable by rp operations, we have ∈ I A | I } 7The markers bgn-I and I-end are analogous to the brackets (17) corresponds to [g and ]g in (Niemi and Koskenniemi, 2009), an essential difference being that a grain (type) g supports multiple occurrences 1 ( e ϕ) = (e)− (ϕ). of [g and ]g, in contrast to the (token) interval I. Tr h iR R R Tr 86 4 Conclusion and future work The second is that an interval arguably has little to do with an e-event being an e-event. An interval The key notion behind the analysis above of time in [4,9] does not, in and of itself, make [4, 9], e an e- terms of strings is the map π , which for X consist- h i X event; [4, 9], e is an e-event only in a -model ing of interval-event pairs I, e , is applied in Propo- h i TPL h i that says it is. An alternative is to express in strings sition 1 to turn a -model into a string s( ). TPL A A what holds during an event that makes it an e-event. As far as -satisfaction = ϕ is concerned, TPL A | I Consider the event type e of Pat walking a mile. In- we can normalize the endpoints of the intervals to an cremental change in an event of that type can be rep- initial segment of the positive integers, after restrict- resented through a parametrized fluent f(r) with pa- ing to intervals contained in the evaluation inter- A rameter r ranging over the reals in the unit interval val I (Proposition 3). For a finite-state encoding of [0, 1], such that f(r) says Pat has walked r (a mile). -satisfaction, it is useful to demarcate the oth- · TPL + Let (e) be erwise homogeneous picture I, e of I, e , and D h i h i + to define a notion A of proper containment between f(0) f f(1) strings. We close with further finite-state enhance- ↑ ments. where f abbreviates the fluent Demarcation is linguistically significant, bearing ↑ directly on telicity and the so-called Aristotle-Ryle- ( r < 1) f(r) Previous( f(r)). ∃ ∧ ¬ Kenny-Vendler classification (Dowty, 1979), illustrated by the contrasts in (18) and (19). Previous is a temporal operator that constrains strings α1 αn so that whenever Previous(ϕ) be- (18) John was driving John drove ··· |− longs to αi+1, ϕ belongs to αi; that is, John was driving to L.A. John drove to L.A. 6|− Previous(ϕ) ϕ (19) John drove for an hour ⇒ using an rp binary operator on languages that John drove to L.A. in an hour ⇒ combines subsumption ¤ with constraints famil- The difference at work in (18) and (19) is that John iar from finite-state morphology (Beesley and Kart- driving to L.A. has a termination condition, in(John, tunen, 2003). L.A.), missing from John driving. Given a fluent The borders and interior of I, e aside, there is h i such as in(John, L.A.), we call a language L ϕ-telic the matter of locating an e-string in a larger string if for every s L, there is an n 0 such that ∈ ≥ (effected in through strict inclusion , the s ¤ ϕ n ϕ (which is to say: a string in L ends TPL ⊃ ¬ string-analog of which is proper containment A). as soon as ϕ becomes true). L is telic if it is ϕ-telic, But what larger string? The influential theory of for some ϕ. Now, the contrasts in (18) and (19) can tense and aspect in (Reichenbach, 1947) places e rel- be put down formally to the language for John driv- ative not only to the speech S but also to a reference ing to L.A. being telic, but not that for John driving time r, differentiating, for instance, the simple past (Fernando, 2008). e, r S from the present perfect e S,r , as required The demarcation (via ϕ) just described does not by differences in defeasible entailments , (20), and rely on some set of intervals I from which flu- |∼ I0 acceptability, (21). ents bgn-I and I-end are formed (as in s• from section 3). There are at least two reasons for attempt- (20) Pat has left Paris Pat is not in Paris |∼ ing to avoid when demarcating or, for that matter, I0 Pat left Paris Pat is not in Paris building the set (e) of denotations of e. The first 6|∼ D ? is that under a definition such as (16), the number (21) Pat left Paris. ( Pat has left Paris.) of e-events (i.e., events of type e) is bounded by the But Pat is back in Paris. cardinality of 0. I The placement of r provides a bound on the iner- def + (16) 0 (e) = I s• s I, e tia applying to the postcondition of Pat’s departure DI ∈I0 { | ∈ h i } S 87 (Fernando, 2008). The extension r[I] proposed in References A section 3 to the combination (adequate for , AI TPL James F. Allen. 1983. Maintaining knowledge about but not ) explicitly r-marks the evaluation inter- HS temporal intervals. Communications of the Associa- val I, facilitating an account more intricate than sim- tion for Computing Machinery 26(11): 832–843. ply of e’s occurrence in the larger string. A TPL James F. Allen and George Ferguson. Actions and events goes no further than Ramsey in analyzing That Cae- in interval temporal logic. J. Logic and Computation, sar died as an ontological claim that an event of cer- 4(5):531–579, 1994. tain sort exists (Parsons, 1990), leading to the view Kenneth R. Beesley and Lauri Karttunen. 2003. Finite of an event as a truthmaker (Davidson, 1967; Mulli- State Morphology. CSLI, Stanford, CA. gan et al., 1984). The idea of an event (in isolation) Michael Bennett and Barbara Partee. 1972. Toward the as some sort of proof runs into serious difficulties, logic of tense and aspect in English. Indiana University Linguistics Club, Bloomington, IN. however, as soon as tense and aspect are brought J.F.A.K. van Benthem. 1983. The Logic of Time. Reidel. into the picture; complications such as the Imperfec- Edmund M. Clarke, Jr., Orna Grumberg and Doron A. tive Paradox (Dowty, 1979), illustrated in (22), raise Peled. 1999. Model Checking. MIT Press. tricky questions about what it means for an event to Donald Davidson. 1967. The logical form of action sen- exist and how to ground it in the world (speaking tences. In The Logic of Decision and Action, pages 81– loosely) in which the utterance is made. 95. University of Pittsburgh Press. David Dowty. 1979. Word Meaning and Montague Gram- (22) John was drawing a circle when he ran out of mar. Kluwer. ink. Tim Fernando. 2008. Branching from inertia worlds. J. Semantics 25:321–344. But while the burden of proof may be too heavy to Tim Fernando. 2011. Regular relations for temporal propositions. Natural Language Engineering 17(2): be borne by a single pair I, e of interval I and h i 163–184. event-atom e, the larger picture in which the pair is Tim Fernando. 2011a. Strings over intervals. TextInfer embedded can be strung out, and a temporal state- 2011 Workshop on Textual Entailment, pages 50-58, ment ϕ interpreted as a binary relation Rϕ between Edinburgh (ACL Archives). such strings that goes well beyond A. The inputs to Tim Fernando. 2011b. Finite-state representations em- Rϕ serve as indices, with those in the domain of Rϕ bodying temporal relations. In 9th International Work- supporting the truth of ϕ shop FSMNLP 2011, Blois, pages 12–20. A revised, extended version is in the author’s webpage. def Tim Fernando. 2012. Steedman’s temporality proposal ϕ is true at s ( s0) s R s0 ⇐⇒ ∃ ϕ and finite automata. In Amsterdam Colloquium 2011, Sprnger LNCS 7218, pages 301-310. (Fernando, 2011, 2012). In witnessing truth at par- Joseph Y. Halpern and Yoav Shoham. 1991. A Proposi- ticular inputs, the outputs of Rϕ constitute denota- tional Modal Logic of Time Intervals. J. Association tions more informative than truth values, from which for Computing Machinery 38(4): 935–962. indices can be built bottom-up, in harmony with a ISO. 2007. ISO Draft International Standard 24617-1 Se- semantic analysis of text from its parts (to which mantic annotation framework — Part 1: Time and presumably TimeML is committed). An obvious events. ISO (Geneva). question is how far finite-state methods will take us. Ray Jackendoff. 1996. The proper treatment of mea- suring out, telicity, and perhaps even quantification Based on the evidence at hand, we have much fur- in English. Natural Language and Linguistic Theory ther to go. 14:305–354. Hans Kamp and Uwe Reyle. 1993. From Disocurse to Acknowledgments Logic. Kluwer. Kevin Mulligan, Peter Simons and Barry Smith. 1984. My thanks to Daniel Isemann for useful, sustained Truth-makers. Philosophy and Phenomenological Re- discussions. The work is supported by EI Grant # search 44: 287–321. CC-2011-2601B:GRCTC. Jyrki Niemi and Kimmo Koskenniemi. 2009. Repre- senting and combining calendar information by using

88 finite-state transducers. In 7th International Workshop All other rows in Table 2 are the same for FSMNLP 2008, pages 122–33. Amsterdam. [a, b] R◦ [a0, b0]. The somewhat wasteful encoding Terry Parsons. 1990. Events in the Semantics of English. s( ) in Proposition 1 then becomes s◦( ) in MIT Press. A A Proposition 1◦. For every -model such that Ian Pratt-Hartmann. 2005. Temporal prepositions and TPL A a < b for all [a, b] dom( ), there is a unique their logic. Artificial Intelligence 166: 1–36. ∈ A Ian Pratt-Hartmann. 2005a. From TimeML to TPL . In string s◦( ) π( ) such that for all x, x0 ∗ A ∈ L A ∈ A Annotating, Extracting and Reasoning about Time and with x = I, e and x0 = I0, e0 , and R Allen Events, Schloss Dagstuhl. h i h i ∈ James Pustejovsky, Jose´ Castano,˜ Robert Ingria, Roser π x,x (s◦( )) = sR IR◦I0. { 0} A ⇐⇒ Saur´ı, Robert Gaizauskas, Andrea Setzer and Graham Katz. 2003. TimeML: Robust Specification of Event The encoding s ( ) is formed exactly as s( ) is in and Temporal Expressions in Text. In 5th International ◦ A A section 2 above from the endpoints r < r < < Workshop on Computational Semantics, pages 337- 1 2 ··· r of dom( ), except that the α ’s for the endpoints 353. Tilburg. n A i Hans Reichenbach. 1947. Elements of Symbolic Logic. ri are dropped (these being un-observable), leaving Macmillan & Co (New York). us with the βi’s for [ri, ri+1] Carol Tenny. 1987. Grammaticalizing Aspect and Affect- def edness. PhD dissertation, Department of Linguistics s◦( ) = bc(β1 βn 1). and Philosophy, MIT. A ··· − Wolfgang Thomas. 1997. Languages, automata and logic. Beyond Proposition 1, the arguments above for In Handbook of Formal Languages: Beyond Words, s( ) carry over to s ( ), with the requirement on A ◦ A Volume 3, pages 389–455. Springer-Verlag. a -model that a < b for all [a, b] dom( ). TPL A ∈ A It is noteworthy that (Pratt-Hartmann, 2005a) makes Appendix: a case of “less is more”? no mention that this requirement is a departure from Because the set of intervals from which a - (Pratt-Hartmann, 2005). Although the restriction I TPL model is constructed includes singleton sets a < b rules out -models with points [a, a] in A TPL [a, a] = a (for all real numbers a), there can never their domain, it also opens up to strings in { } TPL be events x and x in such that x meets (or abuts) which events meet — a trade-off accepted in (Allen 0 A x0, x m x0, according to Table 2 above. It is, how- and Ferguson, 1994). To properly accommodate ever, easy enough to throw out sets [a, a] from , points alongside larger intervals, we can introduce I requiring that for [a, b] , a be strictly less than a fluent indiv marking out boxes corresponding to ∈ I b. (In doing so, we follow (Allen, 1983) and (Pratt- points [a, a] (as opposed to divisible intervals [a, b] Hartmann, 2005a), but stray from (Pratt-Hartmann, where a < b), and re-define πX to leave boxes with 2005).) The result is that the overlap at b between indiv in them alone. From this perspective, the re- [a, b] and [b, c] is deeemed un-observable (effec- striction a < b is quite compatible with πX as de- tively re-interpreting closed intervals by their interi- fined above. But can we justify the notational over- ors, understood to be non-empty). The third column head in introducing indiv and complicating πX ? We χR([a, b], [a0, b0]) in Table 2 is modified to a condi- say no more here. tion [a, b] R◦ [a0, b0] that differs on the cases where R is one of the four Allen relations o,m,oi,mi, splitting the disjunction a b in o with m, and a b 0 ≤ ≤ 0 in oi with mi.

R Allen sR π( e1, e2 ) [a, b] R◦ [a0, b0] ∈ ∈ L { } x o x0 x x, x0 x0 a < a0 < b < b0

x m x0 x x0 b = a0

x oi x0 x0 x, x0 x a0 < a < b0 < b

x mi x0 x0 x b0 = a 89 A ﬁnite-state approach to phrase-based statistical machine translation

Jorge Gonzalez´ Departamento de Sistemas Informaticos´ y Computacion´ Universitat Politecnica` de Valencia` Valencia,` Spain [email protected]

Abstract a source and a target sentence can be considered. Nowadays, these models are strongly based on This paper presents a finite-state approach phrases, i.e. variable-length n-grams, which means to phrase-based statistical machine translation that they are built from some other lower-context where a log-linear modelling framework is im- models that, in this case, are defined at phrase level. plemented by means of an on-the-fly composition of weighted finite-state transducers. Phrase-based (PB) models (Tomas and Casacuberta, Moses, a well-known state-of-the-art system, 2001; Och and Ney, 2002; Marcu and Wong, 2002; is used as a machine translation reference in Zens et al., 2002) constitute the core of the current order to validate our results by comparison. state-of-the-art in SMT. The basic idea of PB-SMT Experiments on the TED corpus achieve a systems is: similar performance to that yielded by Moses. 1. to segment the source sentence into phrases, then 1 Introduction 2. to translate each source phrase into a target Statistical machine translation (SMT) is a pattern phrase, and finally recognition approach to machine translation which was defined by Brown et al. (1993) as follows: given 3. to reorder them in order to compose the final a sentence s from a certain source language, a cor- translation in the target language. responding sentence ˆt in a given target language that maximises the posterior probability Pr(t s) is In a monotone translation framework however, to be found. State-of-the-art SMT systems model| the third step is omitted as the final translation is the translation distribution Pr(t s) via the log-linear just generated by concatenation of the target phrases. approach (Och and Ney, 2002):| Apart from translation functions, the log-linear approach is also usually composed by means of a ˆt = argmax Pr(t s) (1) target language model and some other additional t | elements such as word penalties or phrase penalties. M The word and phrase penalties allow an SMT sys- arg max λ h (s, t) (2) ≈ m m tem to limit the number of words or target phrases, t m=1 X respectively, that constitute a translation hypothesis. where hm(s, t) is a logarithmic function represen- In this paper, a finite-state approach to a PB-SMT ting an important feature for the translation of s into state-of-the-art system, Moses (Koehn et al., 2007), t, M is the number of features (or models), and λm is presented. Experimental results validate our work is the weight of hm in the log-linear combination. because they are similar to those yielded by Moses. This feature set typically includes several trans- A related study can be found in Kumar et al. (2006) lation models so that different relations between for the alignment template model (Och et al., 1999).

90 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 90–98, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

2 Log-linear features for monotone SMT whose dependence on β1 is omitted for the sake of an easier reading. As a first approach to Moses using finite-state Feature 1 is finally formulated as follows: models, a monotone PB-SMT framework is adopted.

β1 Under this constraint, Moses’ log-linear model is | | ˜ usually taking into account the following 7 features: h1(s, t) = logmax P (tk ˜sk ) (3) β1 | k=1 Translation features Y ˜ ˜ 1. Direct PB translation probability where η1(˜s, t) = P (t ˜s) is a set of PB probability distributions estimated| from bilingual training data, 2. Inverse PB translation probability once statistically word-aligned (Brown et al., 1993) by means of GIZA++ (Och and Ney, 2003), which 3. Direct PB lexical weighting Moses relies on as far as training is concerned. 4. Inverse PB lexical weighting This information is organized as a translation table where a pool of phrase pairs is previously collected. Penalty features Inverse PB translation probability 5. PB penalty Similar to what happens with Feature 1, Feature 2 6. Word penalty is formulated as follows:

β2 | | Language features ˜ h2(s, t) = logmax P (˜sk tk ) (4) 7. Target language model β2 | kY=1 2.1 Translation features where η2(˜s, ˜t) = P (˜s ˜t) is another set of PB probability distributions,| which are simultaneously All 4 features related to translation are PB models, trained together with the ones for Feature 1, P (˜t ˜s), that is, their associated feature functions h (s, t), m over the same pool of phrase pairs already extracted.| which are in any case defined for full sentences, are modelled from other PB distributions ηm(˜s, ˜t), Direct PB lexical weighting which are based on phrases. Given the word-alignments obtained by GIZA++, Direct PB translation probability it is quite straight-forward to estimate a maximum likelihood stochastic dictionary P (ti sj), which is The first feature h1(s, t) = log P (t s) is based on | used to score a weight D(˜s, ˜t) to each phrase pair in modelling the posterior probability by| using the seg- the pool. Details about the computation of D(˜s, ˜t) mentation between s and t as a hidden variable β . 1 are given in Koehn et al. (2007). However, as far as In this manner, Pr(t s) = Pr(t s,β ) is then | | 1 this work is concerned, these details are not relevant. Xβ1 Feature 3 is then similarly formulated as follows: approximated by P (t s) by using maximization | instead of summation: P (t s) = max P (t s,β1). β3 | β1 | | | ˜ h3(s, t) = logmax D(˜sk , tk ) (5) β3 Given a monotone segmentation between s and t, k=1 P (t s,β ) is generatively computed as the product Y 1 where η (˜s, ˜t)= D(˜s, ˜t) is yet another score to use of the| translation probabilities for each segment pair 3 with the pool of phrase pairs aligned during training. according to some PB probability distributions: Inverse PB lexical weighting β1 | | Similar to what happens with Feature 3, Feature 4 P (t s,β )= P (˜t ˜s ) | 1 k | k is formulated as follows: kY=1 β4 | | where β1 is the number of phrases that s and t are ˜ | | h4(s, t) = logmax I(˜sk , tk ) (6) ˜ β4 segmented into, i.e. every ˜sk and tk , respectively, kY=1 91 where η4(˜s, ˜t) = I(˜s, ˜t) is another weight vector, 3 Data structures which is computed by using a dictionary P (s t ), j i This section shows how the features from Section 2 with which the translation table is expanded again,| are actually organized into different data structures thus scoring a new weight per phrase pair in the pool. in order to be efficiently used by the Moses decoder, 2.2 Penalty features which implements the search defined by Equation 2 ˆ The penalties are not modelled in the same way. to find out the most likely translation hypothesis t The PB penalty is similar to a translation feature, i.e. for a given source sentence s. it is based on a monotone sentence segmentation. 3.1 PB models The word penalty however is formulated as a whole, being taken into account by Moses at decoding time. The PB distributions associated to Features 1 to 5 are organized in table form as a translation table for PB penalty the collection of phrase pairs previously extracted. The PB penalty scores e = 2.718 per phrase pair, That builds a PB database similar to that in Table 1 thus modelling somehow the segmentation length. Therefore, Feature 5 is defined as follows: Source Target η1 η2 η3 η4 η5 β5 | | barato low cost 1 0.3 1 0.6 2.718 h5(s, t) = logmax e (7) me gusta I like 0.6 1 0.9 1 2.718 β5 es decir that is 0.8 0.5 0.7 0.9 2.718 kY=1 ˜ por favor please 0.4 0.2 0.1 0.4 2.718 where η5(˜s, t)= e extends the PB table once again...... 2.718 Word penalty Word penalties are not modelled as PB penalties. Table 1: A Spanish-into-English PB translation table. Each source-target phrase pair is scored by all η models. In fact, this feature is not defined from PB scores, but it is formulated at sentence level just as follows: where each phrase pair is scored by all five models. t h6(s, t) = log e| | (8) 3.2 Word-based models where the exponent of e is the number of words in t. Whereas PB models are an interesting approach 2.3 Language features to deal with translation relations between languages, Language models approach the a priori probability language modelling itself is usually based on words. that a given sentence belongs to a certain language. Feature 6 is a length model of the target sentence, In SMT, they are usually employed to guarantee that and Feature 7 is a target language model. translation hypotheses are built according to the pe- Word penalty culiarities of the target language. Penalties are not models that need to be trained. Target language model However, while PB penalties are provided to Moses An n-gram is used as target language model P (t), to take them into account during the search process where a word-based approach is usually considered. (see for example the last column of Table 1, η5), word penalties are internally implemented in Moses Then, h7(s, t) = log P (t) is based on a model where sentences are generatively built word by word as part of the log-linear maximization in Equation 2, under the influence of the last n 1 previous words, and are automatically computed on-the-fly at search. − with the cutoff derived from the start of the sentence: Target n-gram model t | | Language models, and n-grams in particular, suf- h7(s, t) = log P (ti ti n+1 ... ti 1) (9) fer from a sparseness problem (Rosenfeld, 1996). | − − Yi=1 The n-gram probability distributions are smoothed where P (ti ti n+1 ... ti 1) are word-based proba- to be able to deal with the unseen events out of train- bility distributions| − learnt− from monolingual corpora. ing data, thus aiming for a larger language coverage.

92 This smoothing is based on the backoff method, defines a set of edges between pairs of states in such which introduces some penalties for level down- a way that every edge is labelled with an input string grading within hierarchical language models. in Σ⋆, with an output string in ∆⋆, and is assigned a For example, let be a trigram language model, transition weight. M which, as regards smoothing, needs both a bigram When weights are probabilities, i.e. the range of and a unigram model trained on the same data. functions f and P is constrained between 0 and 1, Any trigram probability, P (c ab), is then computed and under certain conditions, a weighted finite- | as follows: state transducer may define probability distributions. Then, it is called a stochastic finite-state transducer. if abc : P (c ab) ∈M M | 4.1 WFSTs for SMT models elseif bc : BO (ab)P (c b) ∈M M M | elseif c : BO (ab) BO (b)P (c) Here, we show how the SMT models described in ∈M M M M else : BO (ab) BO (b)P (unk) Section 3 (that is, the five η scores in the PB trans- M M M (10) lation table, the word penalty, and the n-gram language model) are represented by means of WFSTs. where P is the probability estimated by for the M M First of all, the word penalty feature in Equation 8 corresponding n-gram, BO is the backoff weight M is equivalently reformulated as another PB score, to deal with the unseen events out of training data, as in Equations 3 to 7: and finally, P (unk) is the probability mass reserved for unknownM words. β6 | | The P (ti ti n+1 ... ti 1) term from Equation 9 t ˜tk | − − h6(s, t) = log e| | = logmax e| | (11) is then computed according to that algorithm above, β6 k=1 given the model data organized again in table form Y as a collection of probabilities and backoff weights where the length of t is split up by summation for the n-grams appearing in the training corpus. using the length of each phrase in a segmentation β . This model displays similarly to that in Table 2. 6 Actually, this feature is independent of β6, that is, t n-gram P BO any segmentation produces the expected value e| |, please 0.02 0.2 and therefore the maximization by β6 is not needed. low cost 0.05 0.3 However, the main goal is to introduce this feature as I like 0.1 0.7 another PB score similar to those in Features 1 to 5, that is 0.08 0.5 and so it is redefined following the same framework...... The PB table can be now extended by means of ˜t η6(˜s, ˜t)= e| |, just as Table 3 shows. Table 2: An English word-based backoff n-gram model. The likelihood and the backoff model score for each n- Source Target η η η η η η gram. 1 2 3 4 5 6 barato low cost ... ee2 me gusta I like ... ee2 2 4 Weighted finite-state transducers es decir that is ... ee por favor please ... e e Weighted finite-state transducers (Mohri et al., ...... 2002) (WFSTs) are defined by means of a tuple Table 3: A word-penalty-extended PB translation table. (Σ, ∆,Q,q0,f,P ), where Σ is the alphabet of input symbols, ∆ is the alphabet of output symbols, The exponent of e in η6 is the number of words in Target. Q is a finite set of states, q Q is the initial state, 0 ∈ f : Q R is a state-based weight distribution to Now, the translation table including 6 PB scores → quantify that states may be final states, and finally, and the target-language backoff n-gram model can the partial function P : Q Σ⋆ ∆⋆ Q R be expressed by means of (some stochastic) WFSTs. × × × → 93 Translation table Bigram cost / cost edges P (cost low) Each PB model included in the translation table, M | i.e. any PB distribution in ˜ ˜ ˜ ˜ , Bigram η1(s, t),...,η6(s, t) q q q q can be represented as a particular{ case of a WFST.} 0 1 2 3 layer

Figure 1 shows a PB score encoded as a WFST, BO(q ) BO(q ) Unigram 1 2 low / low using a different looping transition per table row edges BO(q0 ) BO(q3 ) PM (low) within a WFST of only one state.

Unigram q ε layer barato / low cost

x1 Figure 2: A WFST example for a backoff bigram model. Backoff (BO) is dealt with failure transitions from the bigram layer to the unigram layer. Unigrams go in the other direction and bigrams link states within the bigram layer.

... q0 To sum up, our log-linear combination scenario considers 7 (some stochastic) WFSTs, 1 per feature: 6 of them are PB models related to a translation table Source Target η x m 2 while the 7th one is a target-language n-gram model. barato low cost x1 me gusta / I like Next in Section 4.2, we show how these WFSTs me gusta I like x2 ...... are used in conjunction in a homogeneous framework. Figure 1: Equivalent WFST representation of PB scores. Table rows are embedded within as many looping 4.2 Search transitions of a WFST which has no topology at all; η-scores are correspondingly stored as transition weights. Equation 2 is a general framework for log-linear approaches to SMT. This framework is adopted here in order to combine several features based on WFSTs, It is straight-forward to see that the application of which are modelled as their respective Viterbi score. the Viterbi method (Viterbi, 1967) on these WFSTs As already mentioned, the computation of provides the corresponding feature value h (s, t) m h (s, t) for each PB-WFST, let us say (with for all Features 1 to 6 as defined in Equations 3 to 8. m Tm 1 m 6), provides the most likely segmenta- ≤ ≤ tion β for s and t according to . However, a m Tm Language model constraint is used here so that all m models define T the same segmentation β: It is well known that n-gram models are a subclass of stochastic finite-state automata where backoff β > 0 | | can also be adequately incorporated (Llorens, 2000). s = ˜s1 ... ˜s β Then, they can be equivalently turned into trans- | | ˜ ˜ t = t1 ... t β ducers by means of the concept of identity, that is, | | transducers which map every input label to itself. Figure 2 shows a WFST for a backoff bigram model. where the PB scores corresponding to Features 1 to 6 are directly applied on that particular segmentation It is also quite straight-forward to see that h7(s, t) for each phrase pair (˜s , ˜t ) monotonically aligned. (as defined in Equation 9 for a target n-gram model k k where backoff is adopted according to Equation 10) Equations 3 to 7 and 11 can be simplified as follows: is also computed by means of a parsing algorithm, m = 1,..., 6 ∀ which is actually a process that is simple to carry β out given that these backoff n-gram WFSTs are de- | | hm(s, t) = logmax ηm(˜s , ˜t ) (12) terministic. β k k kY=1 94 Then, Equation 2 can be instanced as follows: using the n-gram scores in on the target hypotheses from as soon as theyL are partially produced. 7 T Equation 13 represents a Viterbi-based compo- ˆt = argmax λm hm(s, t) (13) sition framework where all the (weighted) models t m=1 X contribute to the overall score to be maximized, 6 β | | ˜ provided that the set of λm-weights is instantiated. = argmax λm max log ηm(˜sk , tk ) t  β  Using a development corpus, the set of λ -weights m=1 k=1 m X X can be empirically determined by means of running  t  | | several iterations of this framework, where different +λ7 log P (ti ti n+1 ... ti 1) | − − values for the λm-weights are tried in each iteration. Xi=1 β 6 | | 5 Experiments = argmax max λm log ηm(˜s , ˜t )  β k k  t m=1 Experiments were carried out on the TED corpus, Xk=1 X  t  which is described in depth throughout Section 5.1. | | Automatic evaluation for SMT is often considered + λ7 log P (ti ti n+1 ... ti 1) | − − i=1 and we use the measures enumerated in Section 5.2. X Results are shown and also discussed, in Section 5.3. as logarithm rules are applied to Equations 9 and 12. The square-bracketed expression of Equation 13 5.1 Corpora data is a Viterbi-like score which can be incrementally The TED corpus is composed of a collection of built through the contribution of all the PB-WFSTs English-French sentences from audiovisual content (along with their respective λm-weights) over some whose main statistics are displayed in Table 4. ˜ phrase pair (˜sk , tk ) that extends a partial hypothesis. As these models share their topology, we implement Subset English French them jointly including as many scores per transi- Sentences 47.5K tion as needed (Gonzalez´ and Casacuberta, 2008). Running words 747.2K 792.9K These models can also be merged by means of union Train Vocabulary 24.6K 31.7K Sentences 571 once their λm-weights are transferred into them. Running words 9.2K 10.3K That allows us to model the whole translation table Vocabulary 1.9K 2.2K (see Table 3) by means of just 1 WFST structure . Develop T Sentences 641 Therefore, the search framework for single models Running words 12.6K 12.8K can also be used for their log-linear combination. Test Vocabulary 2.4K 2.7K As regards the remaining term from Equation 13, i.e. the target n-gram language model for Feature 7, Table 4: Main statistics from the TED corpus and its split. it is seen as a rescoring function (Och et al., 2004) which is applied once the PB-WFST is explored. As shown in Table 4, develop and test partitions The translation model returns the bestT hypotheses are statistically comparable. The former is used that are later input to the n-gram language model , to train the λ -weights in the log-linear approach, L m where they are reranked, to finally choose the best ˆt. in the hope that they can also work well for the latter. However, these two steps can be processed at once if both the WFST and the WFST are merged by means of their compositionT (Mohri,L 2004). 5.2 Evaluation measures The product of such an operationT ◦L is another WFST Since its appearance as a translation quality mea- as WFSTs are closed under a composition operation. sure, the BLEU metric (Papineni et al., 2002), which In practice though, the size of can be very large stands for bilingual evaluation understudy, has be- so composition is done on-the-flyT ◦L (Caseiro, 2003), come consolidated in the area of automatic evalua- which actually does not build the WFST for tion as the most widely used SMT measure. Never- but explores both and as if they were composed,T ◦L theless, it was later found that its correlation factor T L 95 with subjective evaluations (the original reason for 5.3 Results its success) is actually not so high as first thought The goal of this section is to assess experimentally (Callison-Burch et al., 2006). Anyway, it is still the the finite-state approach to PB-SMT presented here. most popular SMT measure in the literature. First, an English-to-French translation is considered, However, the word error rate (WER) is a very then a French-to-English direction is later evaluated. common measure in the area of speech recognition which is also quite usually applied in SMT (Och et On the one hand, our log-linear framework is al., 1999). Although it is not so widely employed as tuned on the basis of BLEU as the only evaluation BLEU, there exists some work that shows a better measure in order to select the best set of λm-weights. correlation of WER with human assessments (Paul That is accomplished by means of development data, et al., 2007). Of course, the WER measure has some however, once the λm-weights are estimated, they bad reviews as well (Chen and Goodman, 1996; are extrapolated to test data for the final evaluation. Wang et al., 2003) and one of the main criticisms Table 5 shows: a) the BLEU translation results for that it receives in SMT areas is about the fact that the development data; and b) the BLEU, WER and there is only one translation reference to compare TER results for the test data. In both a) and b), the with. The MWER measure (Nießen et al., 2000) is λm-weights are trained on the development parti- an attempt to relax this dependence by means of an tion. These results are according to different feature average error rate with respect to a set of multiple combinations in our log-linear approach to PB-SMT. references of equivalent meaning, provided that they As shown in Table 5, the first experimental sce- are available. nario is not a log-linear framework since only one Another measure also based on the edit distance feature, (a direct PB translation probability model) concept has recently arisen as an evolution of WER is considered. The corresponding results are poor towards SMT. It is the translation edit rate (TER), and, judging by the remaining results in Table 5, and it has become popular because it takes into ac- they reflect the need for a log-linear approach. count the basic post-process operations that profes- The following experiments in Table 5 represent sional translators usually do during their daily work. a log-linear framework for Features 1 to 6, Statistically, it is considered as a measure highly cor- i.e. the PB translation table encoded as a WFST , T related with the result of one or more subjective eval- where different PB models are the focus of attention. uations (Snover et al., 2006). Only the log-linear combination of Features 1 and 2 The definition of these evaluation measures is as follows: Log-linear Develop Test features BLEU BLEU WER TER BLEU: It computes the precision of the unigrams, 1 (baseline) 8.5 7.1 102.9 101.5 bigrams, trigrams, and fourgrams that appear in 1+2 4.0 3.0 116.6 115.6 the hypotheses with respect to the n-grams of 1+2+3 22.7 18.4 66.6 64.4 the same order that occur in the translation ref- 1+2+3+4 22.8 18.5 66.3 64.2 1+2+3+4+5 22.7 18.8 65.2 63.2 erence, with a penalty for too short sentences. 1+2+3+4+5+6 23.1 19.1 65.9 63.8 Unlike the WER measure, BLEU is not an error 1+7 24.6 20.5 65.1 62.9 rate but an accuracy measure. 1+2+7 25.5 21.3 63.7 61.6 1+2+3+7 25.9 22.2 62.5 60.4 WER: This measure computes the minimum num- 1+2+3+4+7 26.3 22.0 63.4 61.3 ber of editions (replacements, insertions or 1+2+3+4+5+7 26.4 22.1 63.1 61.0 deletions) that are needed to turn the system 1+2+3+4+5+6+7 27.0 21.8 64.4 62.2 hypothesis into the corresponding reference. Moses (1+...+7) 27.1 22.0 64.0 61.8

TER: It is computed similarly to WER, using an ad- Table 5: English-to-French results for development and ditional edit operation. TER allows the move- test data according to different log-linear scenarios. ment of phrases, besides replacements, inser- The set of λm-weights is learnt from development data tions, and deletions. for every feature combination log-linear scenario deﬁned.

96 is worse than the baseline, which feeds us back Log-linear Develop Test features BLEU BLEU WER TER on the fact that the λm-weights can be better trained, that is, the log-linear model for Features 1 and 2 1 (baseline) 7.1 7.4 101.6 100.0 1+2 4.1 3.5 117.5 116.0 can be upgraded until baseline’s results with λ = 0. 2 1+2+3 24.2 21.1 58.9 56.5 This battery of experiments on Features 1 to 6 1+2+3+4 24.4 20.8 58.0 55.7 allows us to see the benefits of a log-linear approach. 1+2+3+4+5 24.9 21.2 56.9 54.8 The baseline results are clearly outperformed now, 1+2+3+4+5+6 25.2 21.2 57.1 55.0 and we can say that the more features are included, 1+7 24.7 22.5 60.0 57.7 the better are the results. 1+2+7 26.0 23.2 58.8 56.5 The next block of experiments in Table 5 always 1+2+3+7 28.5 23.0 56.1 54.0 1+2+3+4+7 28.4 23.1 56.0 53.8 include Feature 7, i.e. the target language model . 1+2+3+4+5+7 28.8 23.4 56.0 53.9 Features 1 to 6 are progressively introduced into L. T 1+2+3+4+5+6+7 28.7 23.8 55.8 53.7 These results confirm that the target language model Moses (1+...+7) 28.9 23.5 55.8 53.6 is still an important feature to take into account, even though PB models are already providing a sur- Table 6: French-to-English results for development rounding context for their translation hypotheses be- and test data according to different log-linear scenarios. cause translation itself is modelled at phrase level. These models can also be implemented by means These results are significantly better than the ones of WFSTs on the basis of the Viterbi algorithm. where the target language model is not considered. The word penalty can also be equivalently redefined Again, the more translation features are included, as another PB model, similar to the five others, the better are the results on the development data. which allows us to constitute a translation model However, an overtraining is presumedly occurring composed of six parallel WFSTs that are constrainedT with regard to the optimization of the λ -weights, m to share the same monotonic bilingual segmentation. as results on the test partition do not reach their top A backoff n-gram model for the target language the same way the ones for the development data do, can be represented as an identity WFST where L i.e. when using all 7 features, but when combining P (t) is modelled on the basis of the Viterbi algorithm. Features 1, 2, 3, and 7, instead. These differences The whole log-linear approach to Moses is attained are not statistically significant though. by means of the on-the-fly WFST composition . Finally, our finite-state approach to PB-SMT T ◦L Our finite-state log-linear approach to PB-SMT is validated by comparison, as it allows us to achieve is validated by comparison, as it has allowed us similar results to those yielded by Moses itself. to achieve similar results to those yielded by Moses. On the other hand, a translation direction where Monotonicity is an evident limitation of this work, French is translated into English gets now the focus. as Moses can also feature some limited reordering. Their corresponding results are presented in Table 6. However, future work on that line is straight-forward A similar behaviour can be observed in Table 6 since the framework described in this paper can be for the series of French-to-English empirical results. easily extended to include a PB reordering model , by means of the on-the-fly composition R. 6 Conclusions and future work T ◦R◦L Acknowledgments In this paper, a finite-state approach to Moses, which is a PB-SMT state-of-the-art system, is presented. The research leading to these results has received A monotone framework is adopted, where 7 mo- funding from the European Union 7th Framework dels in log-linear combination are considered: a di- Programme (FP7/2007-2013) under grant agree- rect and an inverse PB translation probability model, ment no. 287576. Work also supported by the EC a direct and an inverse PB lexical weighting model, (FEDER, FSE), the Spanish government (MICINN, PB and word penalties, and a target language model. MITyC, “Plan E”, grants MIPRCV “Consolider In- Five out of these models are based on PB scores genio 2010” and iTrans2 TIN2009-14511), and the which are organized under a PB translation table. Generalitat Valenciana (grant Prometeo/2009/014).

97 References translation. In Proc. of Association for Computational Linguistics, pages 295–302. P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. F.J. Och and H. Ney. 2003. A systematic comparison of Mercer. 1993. The mathematics of machine transla- various statistical alignment models. Computational tion. In Computational Linguistics, volume 19, pages Linguistics, 29:19–51, March. 263–311, June. F. J. Och, C. Tillmann, and H. Ney. 1999. Improved C. Callison-Burch, M. Osborne, and P. Koehn. 2006. alignment models for statistical machine translation. Re-evaluating the Role of Bleu in Machine Transla- In Proceedings of the joint conference on Empirical tion Research. In Proceedings of the 11th conference Methods in Natural Language Processing and the 37th of the European Chapter of the Association for Com- annual meeting of the Association for Computational putational Linguistics, pages 249–256. Linguistics, pages 20–28. D. Caseiro. 2003. Finite-State Methods in Automatic F.J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Ya- Speech Recognition. PhD Thesis, Instituto Superior mada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, Tecnico,´ Universidade Tecnica´ de Lisboa. V. Jain, Z. Jin, and D. Radev. 2004. A smorgas- S. F. Chen and J. Goodman. 1996. An empirical study of bord of features for statistical machine translation. In smoothing techniques for language modeling. In Pro- D. Marcu S. Dumais and S. Roukos, editors, HLT- ceedings of the 34th annual meeting of the Association NAACL 2004: Main Proceedings, pages 161–168, for Computational Linguistics, pages 310–318. Boston, Massachusetts, USA, May 2 - May 7. Asso- J. Gonzalez´ and F. Casacuberta. 2008. A finite-state ciation for Computational Linguistics. framework for log-linear models in machine transla- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tion. In Proc. of European Association for Machine Jing Zhu. 2002. Bleu: a method for automatic evalua- Translation, pages 41–46. tion of machine translation. In Proceedings of the 40th P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, annual meeting of the Association for Computational M. Federico, N. Bertoldi, B. Cowan, W. Shen, Linguistics, pages 311–318. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, M. Paul, A. Finch, and E. Sumita. 2007. Reducing and E. Herbst. 2007. Moses: open source toolkit for human assessment of machine translation quality to statistical machine translation. In Proc. of Association binary classifiers. In Proceedings of the conference for Computational Linguistics, pages 177–180. on Theoretical and Methodological Issues in machine S. Kumar, Y. Deng, and W. Byrne. 2006. A weighted translation, pages 154–162. finite state transducer translation template model for R. Rosenfeld. 1996. A maximum entropy approach to statistical machine translation. Natural Language En- adaptative statistical language modeling. Computer gineering, 12(1):35–75. Speech and Language, 10:187–228. David Llorens. 2000. Suavizado de automatas´ y traduc- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- tores finitos estocasticos.´ PhD Thesis, Departamento nea Micciulla, and John Makhoul. 2006. A study of de Sistemas Informaticos´ y Computacion,´ Universidad translation edit rate with targeted human annotation. Politecnica´ de Valencia. In Proceedings of the 7th biennial conference of the D. Marcu and W. Wong. 2002. A phrase-based, joint Association for Machine Translation in the Americas, probability model for statistical machine translation. pages 223–231. In Proc. of Empirical methods in natural language J. Tomas and F. Casacuberta. 2001. Monotone statistical processing, pages 133–139. translation using word groups. In Proc. of the Machine Mehryar Mohri, Fernando Pereira, and Michael Ri- Translation Summit, pages 357–361. ley. 2002. Weighted finite-state transducers in A. Viterbi. 1967. Error bounds for convolutional codes speech recognition. Computer Speech and Language, and an asymptotically optimum decoding algorithm. 16(1):69–88. IEEE Transactions on Information Theory, 13(2):260– Mehryar Mohri. 2004. Weighted finite-state transducer 269. algorithms: An overview. Formal Languages and Ap- Y. Wang, A. Acero, and C. Chelba. 2003. Is word error plications, 148:551–564. rate a good indicator for spoken language understand- S. Nießen, F. J. Och, G. Leusch, and H. Ney. 2000. An ing accuracy. In Proceedings of the IEEE workshop evaluation tool for machine translation: Fast evalua- on Automatic Speech Recognition and Understanding, tion for MT research. In Proceedings of the 2nd in- pages 577–582. ternational Conference on Language Resources and Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Evaluation, pages 39–45. Phrase-based statistical machine translation. In Proc. F.J. Och and H. Ney. 2002. Discriminative training of Advances in Artificial Intelligence, pages 18–32. and maximum entropy models for statistical machine

98 Finite-state acoustic and translation model composition in statistical speech translation: empirical assessment

Alicia Perez´ (1), M. Ines´ Torres(2) Francisco Casacuberta (1)Dep. Computer Languages and Systems Instituto Tecnologico´ de Informatica´ (2)Dep. Electricidad y Electronica´ Universidad Politecnica´ de Valencia University of the Basque Country UPV/EHU Valencia (Spain) Bilbao (Spain) [email protected] (1)[email protected] (2)[email protected]

Abstract 1 Introduction

Speech translation can be tackled by Statistical speech translation (SST) was typ- means of the so-called decoupled ap- ically implemented as a pair of consecutive proach: a speech recognition system fol- steps in the so-called decoupled approach: with lowed by a text translation system. The an automatic speech recognition (ASR) system major drawback of this two-pass decod- placed before to a text-to-text translation sys- ing approach lies in the fact that the trans- tem. This approach involves two independent lation system has to cope with the er- decision processes: ﬁrst, getting the most likely rors derived from the speech recognition string in the source language and next, get- system. There is hardly any cooperation between the acoustic and the translation ting the expected translation into the target lan- knowledge sources. There is a line of re- guage. Since the ASR system is not an ideal search focusing on alternatives to imple- device it might make mistakes. Hence, the text ment speech translation efﬁciently: rang- translation system would have to manage with ing from semi-decoupled to tightly in- the transcription errors. Being the translation tegrated approaches. The goal of inte- models (TMs) trained with positive samples of gration is to make acoustic and transla- well-formed source strings, they are very sensi- tion models cooperate in the underlying decision problem. That is, the transla- tive to ill-formed strings in the source language. tion is built by virtue of the joint ac- Hence, it seems ambitious for TMs to aspire to tion of both models. As a side-advantage cope with both well and ill formed sentences in of the integrated approaches, the transla- the source language. tion is obtained in a single-pass decoding strategy. The aim of this paper is 1.1 Related work to assess the quality of the hypotheses Regarding the coupling of acoustic and trans- explored within different speech transla- lation models, there are some contributions in tion approaches. Evidence of the performance is given through experimental re- the literature that propose the use of semi- sults on a limited-domain task. decoupled approaches. On the one hand, in (Zhang et al., 2004), SST is carried out by

99 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 99–107, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics an ASR placed before a TM with an addi- This integration paradigm, was earlier proposed tional stage that would re-score the obtained hy- in (Vidal, 1997), showing that a single-pass potheses within a log-linear framework gather- decoding was enough to carry out SST. ing features from both the ASR system (lexicon Finally, in (Perez´ et al., 2010) several SST de- and language model) and the TM (eg. distor- coding approaches including decoupled, N-best tion, fertility) and also additional features (POS, lists and integrated were compared. Neverthe- length etc.). less, the paper focused on the potential scope of On the other hand, in (Quan et al., 2005), the the approaches, comparing the theoretical upper N-best hypotheses derived from an ASR sys- threshold of their performance. tem were next translated by a TM, finally, a last stage would re-score the hypotheses and make 1.2 Contribution a choice. Within the list of the N-best hypothe- All the models assessed in this work relay upon ses typically a number of them include some n- exactly the same acoustic and translation mod- grams that are identical, hence, the list results to els. It is the combination of them on which be an inefficient means of storing data. Alterna- we are focusing. In brief, the aim of this pa- tively, in (Zhou et al., 2007) the search space per is to compare different approaches to carry extracted from the ASR system, represented as out speech translation decoding. The compari- a word-graph (WG), was next explored by a TM son is carried out using exactly the same under- following a multilayer search algorithm. lying acoustic and translation models in order Still, a further approach can be assumed to allow to make a fair comparison of the abil- in order to make the graph-decoding com- ities inherent to the decoding strategy. Apart putationally cheaper, that is, confusion net- from the decoupled and semi-decoupled strate- works (Bertoldi et al., 2007). Confusion- gies we also focus on the fully-integrated ap- networks implement a linear approach of the proach. While the fully integrated approach al- word-graphs, however, as a result, dummy hy- lows to provide the most-likely hypothesis, we potheses might be introduced and probabili- explored a variant: an integrated architecture ties mis-computed. Confusion networks traded with a re-scoring LM that provided alternatives off between the accuracy and storage ability of derived from the integrated approach and used word-graphs for decoding time. Indeed, in (Ma- re-scoring to make the final decision. Not only tusov and Ney, 2011) an efficient means of do- an oracle-evaluation is provided as an upper- ing the decoding with confusion networks was threshold of the experiments but also an experi- presented. Note that these approaches follow a mental set-up to give empirical evidence. two-pass decoding strategy. The paper is arranged as follows: Section 2 The aforementioned approaches imple- introduces the formulation of statistical speech mented phrase-based TMs within a log-linear translation (SST); Section 3 describes differ- framework. In this context, in (Casacuberta ent approaches to put into practice SST, plac- et al., 2008) a fully integrated approach was ing emphasis on the assumptions behind each examined. Under this approach, the translation of them. Section 4 is devoted to assess experi- was carried out in a single-pass decoding, mentally the performance of each approach. Fi- involving a single decision process in which nally, in Section 5 the concussions drawn from acoustic and translations models cooperated. the experiments are summarized.

100 2 Statistical speech translation finite-state topology of the models, a summary of the relevant features of the underlying mod- The goal of speech translation, formulated un- els is given in this section. der the probabilistic framework, is to find the most likely string in the target language (ˆt) 2.1.1 Translation model given the spoken utterance in the source lan- The translation model used in this work guage. Speech signal in the source language to tackle all the approaches consists of a is characterized in terms of an array of acoustic stochastic finite-state transducer (SFST) en- features in the source language, x. The decision compassing phrases in the source and tar- problem involved is formulated as follows: get languages together with a probability of joint occurrence. The SFST ( ) is a tuple t = arg max P (t x) (1) T t | = Σ, ∆, Q, q0,R,F,P , where: TΣ ish a finite set of input symbols;i In this context, the text transcription in the b ∆ is a finite set of output symbols; source language (denoted as s) is introduced as Q is a finite set of states; a hidden variable and Bayes’ rule applied: q Q is the initial state; 0 ∈ R Q Σ+ ∆ Q is a set of transi- t = arg max P (x s, t)P (s, t) (2) ⊆ × × ∗ × t | tions. (q, s,˜ t,˜ q ) R, represents a tran- s 0 ∈ X sition from the state q Q to the state b ∈ Assuming P (x s, t) P (x s), and using the q Q, with the source phrase s˜ Σ+ and | ≈ | 0 ∈ ∈ maximum term involved in the sum as an ap- producing the substring t˜ ∆ , where t˜ ∈ ∗ proach to the sum itself for the sake of compu- might consist of zero or more target words tational affordability, we yield to: ( t˜ 0); | | ≥ F : Q [0, 1] is a final state probability; t arg max max P (x s)P (s, t) (3) → ≈ t s | P : R [0, 1] is a transition probability; → As ab result, the expected translation is built Subject to the stochastic constraint: relying upon both a translation model (P (s, t)) q QF (q) + P (q, s,˜ t,˜ q0) = 1 (4) and an acoustic model in the source language ∀ ∈ ˜ (P (x s)). This approach requires the joint co- s,˜Xt,q0 | operation of both models to implement the de- For further reading on formulation and prop- cision problem since the maximum over s con- erties of these machines turn to (Vidal et al., cerns both of them. 2005). The SFST can be understood as a statistical 2.1 Involved models bi-language implemented by means of finite- Being the goal of this paper to compare differ- state regular grammar (Casacuberta and Vidal, ent techniques to combine acoustic and trans- 2004) (in the same way as a stochastic finite- lation models, it is important to keep constant state automaton can be used to model a sin- the underlying models while varying the strate- gle language): = Γ, Q, q ,R,F,P , being A h 0 i gies to combine them. Before to delve into the Γ Σ+ ∆ a finite-set of bilingual-phrases. ⊆ × ∗ composition strategies and due to the fact that Likewise, bilingual n-gram models can be in- some combination strategies are based on the ferred in practice (Marino˜ et al., 2006).

101 2.1.2 Acoustic models 3.1 Decoupled approach The acoustic model consists of a mapping of Possibly the most widely used approach to text-transcriptions of lexical units in the source tackle speech translation is the so-called serial, language and their acoustic representation. That cascade or decoupled approach. It consists of comprises the composition of: 1) a lexical a text-to-text translation system placed after an model consisting of a mapping between the tex- ASR system. This process is formally stated as: tual representation with their phone-like repre- t arg max max P (x s)P (s)P (t s) (5) sentation in terms of a left-to-right sequence; ≈ t s | | and 2) an inventory of phone-like units con- Inb practice, previous expression is imple- sists of a typical three-state hidden Markov mented in two independent stages as follows: model (Rabiner, 1989). Thus, acoustic model 1st stage: an ASR system would find the lays on the composition of two finite-state mod- most likely transcription (ˆs): els (depicted in Figure 1). ˆs arg max P (x s)P (s) (6) /T/ /j/ /e/ /l/ /o/ s cielo ≈ | (a) Phonetic representation of a text lexical unit 2nd stage next, given the expected string in the source language (ˆs), a TM would find the most likely translation: /T/ ˆt arg max P (t ˆs) = arg max P (ˆs, t) (7) /j/ ≈ t | t The TM involved in eq.(7) can be based on /e/ either posterior or joint-probability as the dif- (b) HMM phone-like units ference between both of them is a normaliza- Figure 1: Acoustic model requires composing tion term that does not intervene in the maxi- phone-like units within phonetic representation mization process. The second stage has to cope of lexical units. with expected transcription of speech (ˆs) which does not necessarily convey the exact reference 3 Decoding strategies source string (s). That is, the ASR might intro- In the previous section the formulation of SST duce errors in the source string to be translated was summarized. Let us now turn into prac- in the next stage. However, the TMs are typ- tice and show the different strategies explored ically trained with correct source-target pairs. to combine acoustic and translation models to Thus, transcription errors are seldom foreseen tackle SST. The approaches accounted are: de- even in models including smoothing (Martin et coupled, semi-decoupled and integrated archi- al., 1999). In addition, TMs are extremely sen- tectures. While the former two are imple- sitive to the errors in the input, in particular to mentable by virtue of alternative TMs, the latter substitutions (Vilar et al., 2006). is achieved thanks to the integration allowed by This architecture represents a suboptimal finite-state framework. Thus, in order to com- means of contending with SST as referred in pare the combination rather than the TMs them- eq. (3). This approach barely takes advantage of selves, all of the combinations shall be put in the involved knowledge sources, namely, acous- practice using the same SFST as TM. tic and translation models.

102 3.2 Semi-Decoupled approach tic and language model probabilities as the ASR system handles throughout the trellis. Occasionally, the most probable translation 2nd stage: translating the hypotheses within does not result to be the most accurate one with (the graph derived in the 1st stage) allows to S respect to a given reference. That is, it might take into account alternative translations for the happen that hypotheses with a slightly lower given spoken utterance. The searching space probability than that of the expected hypothesis being explored is limited by the source strings turn to be more similar to the reference than the conveyed by . The combination of the recog- S expected hypothesis. This happens due to sev- nition probability with the translation probabil- eral factors, amongst others, due to the sparsity ity results in a score that accounts both recogni- of the data with which the model was trained. tion and translation likelihood: In brief, some sort of disparity between the ˆt arg max max P (s)P (s, t) (8) probability of the hypotheses and their quality ≈ t s might arise in practice. The semi-decoupled ap- ∈S Thus, acoustic and translation models would proach arose to address this issue. Hence, rather one re-score the other. than translating a single transcription hypothe- All in all, this semi-decoupled approach re- sis, a number of them are provided by the ASR sults in an extension of the decoupled one. to the TM, and it is the latter that makes the de- It accounts alternative transcriptions of speech cision giving as a result the most likely transla- in an attempt to get good quality transcription. The decoupled approach is implemented tions (rather than the most probable transcrip- in two steps, and so is it the semi-decoupled ap- tion as in the case of the decoupled approach). proach. Details on the process are as follows: Amongst all the transcriptions, those with high 1st stage: for a given utterance in the source quality are expected to provide the best quality language, an ASR system, laying on source in the target language. That is, by avoiding er- acoustic model and source language model rors derived from the transcription process, the (LM), would provide a search sub-space. This TM should perform better, and thus get transla- sub-space is traced in the search process for the tions of higher quality. Note that finally, a single most likely transcription of speech but without translation hypothesis is selected. To do so, the getting rid of other highly probable hypotheses. highest combined probability is accounted. For what us concern, this sub-space is represented in terms of a graph of words in the 3.3 Fully-integrated approach source language ( ). The word-graph gath- Finite-state framework (by contrast to other S ers the hypotheses with a probability within a frameworks) makes a tight composition of mod- threshold with respect to the optimal hypothesis els possible. In our case, of acoustic and trans- at each time-frame as it was formulated in (Ney lation finite-state models. The fully-integrated et al., 1997). The obtained graph is an acyclic approach, proposed in (Vidal, 1997), encfom- directed graph where the nodes are associated passed acoustic and translation models within a with word-prefixes of a variable length, and the single model. To develop the fully-integrated edges join the word sequences allowed in the approach a finite-state acoustic model on the recognition process with an associated recogni- source language ( ) providing the text tran- A tion probability. The edges consist of the acous- scription of a given acoustic utterance ( : A

103 X S) can be composed with a text transla- path together with other locally close paths in → tion model ( ) that provides the translation of a the integrated searching space can be extracted T given text in the source language ( : S T ) and arranged in terms of a word graph. While T → and give as a result a transducer ( = ) the WG derived in Section 3.2 was in source Z A ◦ T that would render acoustic utterances in the language, this one would be bilingual. source language to strings in the target lan- Given a bilingual WG, the lower-side net guage. For the sake of efficiency in terms of (W G.l) can be extracted keeping the topol- spatial cost, the models are integrated on-the-fly ogy and the associated probability distributions in the same manner as it is done in ASR (Ca- while getting rid of the input string of each tran- seiro and Trancoso, 2006). sition, this gives as a result the projection of The way in which integrated architecture ap- the WG in the target language. Next, a target proaches eq. (3) is looking for the most-likely language model (LM) would help to make the source-target translation pair as follows: choice for the most likely hypothesis amongst those in the W G.l. (s, t) = arg max P (s, t)P (x s) (9) (s,t) | ˆt arg max PW G.l(t)PLM (t) (10) ≈ t That is, thed search is driven by bilingual phrases made up of acoustic elements in the source In other words, while in Section 3.2 the trans- language integrated within bilingual phrases of lation model was used to re-score alternative words together with target phrases. transcriptions of speech whereas in this ap- Then, the expected translation would simply proach a target language models re-scores al- be approached as the target projection of (s, t), ternative translations provided by the bilingual the expected source-target string (also known WG. Note that this approach, as well as the as the lower projection); and likewise, thed ex- semi-decoupled one, entail a two-pass decoding pected transcription is obtained as a side-result strategy. Both rely upon two models: the for- by the source projection (aka upper projection). mer focused on the source language WG, this It is well-worth mentioning that this approach one focuses on the target language WG. implements fairly the eq. (3) without further 4 Experiments assumptions rather than those made in the decoding stage such as Viterbi-like decoding with The aim of this section is to assess empir- beam-search. All in all, acoustic and translation ically the performance each of the four ap- models cooperate to find the expected transla- proaches previously introduced: decoupled, tion. Moreover, it is carried out in a single-pass semi-decoupled, fully-integrated and integrated decoding strategy by contrast to either decou- WG with re-scoring LM. The four approaches pled or semi-decoupled approaches. differ on the decoding strategy implemented to sort out the decision problem, but all of them 3.4 Integrated WG and re-scoring LM rely on the very same knowledge sources (that The fully-integrated approach looks for the is, the same acoustic and translation model). single-best hypothesis within the integrated The main features of the corpus used to carry acoustic-and-translation network. Following out the experimental layout are summarized in the reasoning of Section 3.2, the most likely Table 1. The training set was used to infer the

104 TM consisting of an SFST and the test set to as- 4.2 Discussion sess the SST decoding approaches. The test set consisted of 500 training-independent pairs dif- While the results with two-pass decoding strate- ferent each other, each of them was uttered by gies (either decoupled or semi-decoupled ap- at least 3 speakers. proach) require an ASR engine, integrated approaches have the ability to get both the source Spanish Basque string together with its translation. This is why Sentences 15,000 we have make a distinction between ASR-WER Running words 191,000 187,000 in the former and source-WER in the latter. Train Vocabulary 702 1,135 Nevertheless, our aim focuses on translation Sentences 1,800 rather than on recognition. Test Hours of speech 3.0 3.5 The results show that semi-decoupled approach outperforms the decoupled one. Simi- Table 1: Main features of the Meteus corpus. larly, the approach based on the integrated WG The performance of each experiment is as- with the re-scoring target LM outperforms the sessed through well-known evaluation met- integrated approach. As a result, exploring dif- rics, namely: bilingual evaluation under-study ferent hypotheses and making the selection with (BLEU) (Papineni et al., 2002), word error-rate a second model allows to make refined deci- (WER), translation edit rate (TER). sions. On the other hand, comparing the first row of the Table 2a with the first row of the Ta- 4.1 Results ble 2b (or equally the second row of the former The obtained results are given in Table 2. The with the second row of the latter), we conclude performance of the most-likely or single-best that slightly better performance can be obtained translation derived by either decoupled or fully- with the integrated approach. integrated architectures is shown in the first row Finally, comparing the third row of both Ta- of Tables 2a and 2b respectively. The per- ble 2a and Table 2b, the conclusion is that the formance of the semi-decoupled and integrated eventual quality of the hypotheses within the in- WG with re-scoring LM is shown in the sec- tegrated approach are significantly better than ond row. The highest performance achievable those in the semi-decoupled approaches. That by both the semi decoupled approach and the is, what we can learn is that the integrated de- integrated WG with re-scoring LM is given in coding strategy keeps much better hypotheses the third row. To do so, an oracle evaluation of than the semi-decoupled one throughout the de- the alternatives was carried out and the score as- coding process. Still, while good quality hy- sociated to the best choice achievable was given potheses exist within the integrated approach, as in (Perez´ et al., 2010). Since the oracle evalu- the re-scoring with a target LM used to select ation provides an upper threshold of the quality a single hypothesis from the entire network has achievable, the scope of each decoupled or in- not resulted in getting the best possible hypoth- tegrated approaches can be assessed regardless esis. Oracle evaluation shows that the integrated of the underlying decoding algorithms and ap- approach offers a leeway to achieve improve- proaches. The highest performance achievable ments in the quality, yet, alternative strategies is reflected in the last row of Tables 2a and 2b. have to be explored.

105 ASR target source target WER BLEU WER TER WER BLEU WER TER D 1-best 7.9 40.8 50.3 47.7 I 1-best 9.6 40.9 49.6 46.8 SD 7.9 42.2 47.6 44.7 I WG + LM 9.3 42.6 46.7 43.9 SD tgt-oracle 7.5 57.6 36.2 32.8 I tgt-oracle 6.6 64.0 32.2 28.5 (a) Decoupled and semi-decoupled (b) Integrated and integrated WG with LM

Table 2: Assessment of SST approaches decoupled (2a) and integrated (2b) respectively.

5 Conclusions What we can learn from the experiments is that integrating the models allow to keep Different approaches to cope with the SST de- good quality hypotheses in the decoding pro- coding methodology were explored, namely, cess. Nevertheless, the re-scoring model has decoupled approach, semi-decoupled approach, not resulted in being able to make the most of fully-integrated approach and integrated ap- the integrated approach. In other words, there proach with a re-scoring LM. The first two fol- are better quality hypotheses within the word- low a two-pass decoding strategy and focus on graph rather than that selected by the re-scoring exploring alternatives in the source language; target LM. Hence, further work should be fo- while the integrated one follows a single-pass cused on other means of selecting hypotheses decoding and present tight cooperation between from the integrated word-graph. acoustic and translation models. However, undoubtedly significantly better All the experimental layouts used exactly the performance can be reached from the inte- same translation and acoustic models differing grated decoding strategy than from the semi- only on the methodology used to overcome the decoupled one. It seems as though knowledge decision problem. In this way, we can assert sources modeling the syntactic differences be- that the differences lay on the decoding strate- tween source and target languages should be gies rather than on the models themselves. Note tackled in order to improve the performance, that implementing all the models in terms of particularly in our case, a strategy for further finite-state models allows to build both decou- work could go on the line of the recently tack- pled and integrated approaches. led approach (Durrani et al., 2011). Both decoupled and integrated decoding approaches aim at finding the most-likely transla- Acknowledgments tion under different assumptions. Occasionally, This work was partially funded by the the most probable translation does not result to Spanish Ministry of Science and Innovation: be the most accurate one with respect to a given through the T´ımpano (TIN2011-28169-C05-04) reference. On account of this, we turned to ana- and iTrans2 (TIN2009-14511) projects; also through MIPRCV (CSD2007-00018) project lyzing alternatives and making use of re-scoring within the Consolider-Ingenio 2010 program; techniques on both approaches in an attempt by the Basque Government to PR&ST research to make the most accurate hypothesis emerge. group (GIC10/158, IT375-10), and by the Gen- This resulted in semi-decoupled and integrated- eralitat Valenciana under grants ALMPR (Prom- WG with re-scoring target LM approaches. eteo/2009/01) and GV/2010/067.

106 References [Papineni et al.2002] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for [Bertoldi et al.2007] N. Bertoldi, R. Zens, and M. automatic evaluation of machine translation. An- Federico. 2008. Efficient speech translation by nual Meeting on Association for Computational confusion network decoding. IEEE International Linguistics, pg. 311–318 Conference on Acoustics, Speech and Signal Pro- [Perez´ et al.2010]A.P erez,´ M. I. Torres, and F. cessing, pg. 1696–1705 Casacuberta. 2010. Potential scope of a fully- [Casacuberta and Vidal2004] F. Casacuberta and E. integrated architecture for speech translation. An- Vidal. 2004. Machine translation with in- nual Conference of the European Association for ferred stochastic finite-state transducers. Compu- Machine Translation, pg. 1–8 tational Linguistics, 30(2): pg. 205–225. [Casacuberta et al.2008] F. Casacuberta, M. Fed- [Quan et al.2005] V. H. Quan, M. Federico, and M. erico, H. Ney, and E. Vidal. 2008. Recent efforts Cettolo. 2005. Integrated n-best re-ranking for in spoken language translation. IEEE Signal Pro- spoken language translation. European Conver- cessing Magazine, 25(3): pg. 80–88. ence on Speech Communication and Technology, [Caseiro and Trancoso2006] D. Caseiro and I. Tran- Interspeech, pg. 3181–3184. coso. 2006. A specialized on-the-fly algo- [Rabiner1989] L.R. Rabiner. 1989. A tutorial on rithm for lexicon and language model composi- hidden markov models and selected applications tion. IEEE Transactions on Audio, Speech & in speech recognition. Proceedings of the IEEE, Language Processing, 14(4): pg. 1281–1291. 77(2): pg. 257–286 [Durrani et al.2011] N. Durrani, H. Schmid, and A. [Vidal et al.2005] E. Vidal, F. Thollard, C. de la Fraser. 2011. A joint sequence translation model Higuera, F. Casacuberta, and R. C. Carrasco. with integrated reordering. In 49th Annual Meet- 2005. Probabilistic finite-state machines - part II. ing of the Association for Computational Linguis- IEEE Transactions on Pattern Analysis and Ma- tics: Human Language Technologies, pg. 1045– chine Intelligence, 27(7): pg. 1026–1039 1054 [Vidal1997] E. Vidal. 1997. Finite-state speech-to- [Marino˜ et al.2006] J. B. Marino,˜ R. E. Banchs, J. M. speech translation. International Conference on Crego, A. de Gispert, P. Lambert, J. A. R. Fonol- Acoustic, Speech and Signal Processing, vol. 1, losa, and M. R. Costa-jussa.` 2006. N-gram- pg. 111–114 based machine translation. Computational Lin- [Vilar et al.2006] David Vilar, Jia Xu, Luis Fernando guistics, 32(4): pg. 527–549 D’Haro, and H. Ney. 2006. Error Analysis of [Martin et al.1999] S. C. Martin, H. Ney, and Machine Translation Output. International Con- J. Zaplo. 1999. Smoothing methods in maxi- ference on Language Resources and Evaluation, mum entropy language modeling. IEEE Interna- pg. 697–702 tional Conference on Acoustics, Speech, and Sig- [Zhang et al.2004] R. Zhang, G. Kikui, H. Ya- nal Processing , vol. 1, pg. 545–548 mamoto, T. Watanabe, F. Soong, and W. K. Lo. [Matusov and Ney2011] E. Matusov and H. Ney. 2004. A unified approach in speech-to-speech 2011. Lattice-based ASR-MT interface for translation: integrating features of speech recog- speech translation. IEEE Transactions on Audio, nition and machine translation. International Speech, and Language Processing, 19(4): pg. 721 Conference on Computational Linguistics, pg. –732 1168-1174 [Ney et al.1997] H. Ney, S. Ortmanns, and I. Lin- [Zhou et al.2007] B. Zhou, L. Besacier, and Y. Gao. dam. 1997. Extensions to the word graph method 2007. On efficient coupling of ASR and SMT for for large vocabulary continuous speech recogni- speech translation. IEEE International Confer- tion. IEEE International Conference on Acous- ence on Acoustics, Speech and Signal Processing, tics, Speech, and Signal Processing, vol. 3, pg. vol. 4, pg. 101–104 1791 –1794

107 Reﬁning the Design of a Contracting Finite-State Dependency Parser

Anssi Yli-Jyra¨ and Jussi Piitulainen and Atro Voutilainen The Department of Modern Languages PO Box 3 00014 University of Helsinki anssi.yli-jyra,jussi.piitulainen,atro.voutilainen @helsinki.fi { }

Abstract the sentence: an 80-word sentence has potentially 1.1 1062 unrooted unlabeled dependency trees that This work complements a parallel paper of are× stored “compactly” into a finite-state lattice that a new finite-state dependency parser archi- requires at least 2.4 1024 states, see Table 4 in Yli- tecture (Yli-Jyra,¨ 2012) by a proposal for Jyra¨ (2012). × a linguistically elaborated morphology-syntax interface and its finite-state implementation. A truly compact representation of the parse forest The proposed interface extends Gaifman’s is provided by an interesting new extended finite- (1965) classical dependency rule formalism state parsing architecture (Yli-Jyra,¨ 2012) that first by separating lexical word forms and morpho- recognizes the grammatical sentences in quadratic logical categories from syntactic categories. time and space if the nested dependencies are lim- The separation lets the linguist take advantage ited by a constant (in cubic time if the length of the of the morphological features in order to re- sentence limits the nesting). The new system (Yli- duce the number of dependency rules and to make them lexically selective. In addition, Jyra,¨ 2012) replaces the additive (Oflazer, 2003) and the relative functional specificity of parse trees the intersecting (Yli-Jyra,¨ 2005) validation of depen- gives rise to a measure of parse quality. By fil- dency links with reductive validation that gradually tering worse parses out from the parse forest contracts the dependencies until the whole tree has using finite-state techniques, the best parses been reduced into a trivial one. The idea of the con- are saved. Finally, we present a synthesis of tractions is illustrated in Example 1. In practice, our strict grammar parsing and robust text pars- parser operates on bracketed trees (i.e., strings), but ing by connecting fragmental parses into trees with additional linear successor links. the effect will be similar.

(1) a. time flies like an arrow 1 Introduction Finite-state dependency parsing aims to combine de- DET pendency syntax and finite-state automata into a sin- NOBJ SUBJ gle elegant system. Deterministic systems such as ADVL (Elworthy, 2000) are fast but susceptible to garden- path type errors although some ambiguity is encoded b. time flies like an arrow in the output. Some other systems such as (Oflazer,

2003; Yli-Jyra,¨ 2005) carry out full projective de- NOBJ pendency parsing while being much slower, especially if the syntactic ambiguity is high. In the worst c. time ﬂies like an arrow case, the size of the minimal ﬁnite-state automaton storing the forest is exponentially larger than

108 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 108–115, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics

Despite being non-deterministic and efficient, tree is better than another tree if a larger propor- there are two important requirements that are not ful- tion of its dependency links is motivated by the filled by the core of the new architecture (Yli-Jyra,¨ linguistic rules. In contrast to Oflazer (2003), 2012): our method counts the number of links needed 1. A mature finite-state dependency parser must to connect the fragments into a spanning tree. be robust. The outputs should not be restricted Moreover, since such additional links are in- to complete grammatical parses. For exam- deed included in the parses, the ranking method ple, Oflazer (2003) builds fragmental parses but turns a grammar parser into a robust text parser. later drops those fragmental parses for which The paper is structured as follows. The next section there are alternative parses with fewer frag- will give an overview of the new parser architecture. ments. However, his approach handles only After it, we present the new morphology-syntax in- gap-free bottom-up fragments and optimizes terface in Section 3 and the parse ranking method in the number of fragments by a counting method Section 4. The paper ends with theoretical evalua- whose capacity is limited. tion and discussion about the proposed formalism in 2. Besides robustness, a wide-coverage parser Section 5. should be able to assign reasonably well- motivated syntactic categories to every word in 2 The General Design the input. This amounts to having a morpho- 2.1 The Internal Linguistic Representation logical guesser and an adequate morphology- syntax interface. Most prior work trivializes We need to define a string-based representation for the complexity of the interface, being compara- the structures that are processed by the parser. For ble to Gaifman’s (1965) legacy formalism that this purpose, we encode the dependency trees and is mathematically elegant but based on word- then augment the representation with morphological form lists. A good interface formalism is pro- features. vided, e.g., by Constraint Grammar parsers Dependency brackets encode dependency links (Karlsson et al., 1995) where syntactic rules between pairs of tokens that are separated by an (im- can refer to morphological features. Oflazer plicit) token boundary. The four-token string abcd (2003) tests morphological features in compli- has 12 distinct undirected unlabeled dependency cated regular expressions. The state complexity bracketings a((()b)c)d, a((b())c)d, a(()b()c)d, of the combination of such expressions is, how- a(()bc())d, a(()b)c()d, a(b(()c))d, a(b(c()))d, ever, a potential problem if many more rules a(b()c())d, a(b())c()d, a()b(()c)d, a()b(c())d, would be added to the system. a()b()c()d.1 This paper makes two main contributions: The basic dependency brackets extend with labels such as in (LBL LBL) and directions such as in . Directed dependency links quirements of morphologically rich languages. designate one of the linked words as the head and With the adapted formalism, grammar writing another as the dependent. The extended brackets let becomes easier. However, efficient implemen- us encode a full dependency tree in a string format tation of the rule lookup becomes inherently as indicated in (2).2 The dependent word of each trickier because testing several morphological conditions in parallel increases the size of the 1Dependency bracketing differs clearly from binary phrase- finite-state automata. Fortunately, the new for- structure bracketings that put brackets around phrases: the malism comes with an efficient implementation string abcd has only five distinct bracketings ((ab)(cd)), (((ab)c)d), ((a(bc))d), (a((bc)d)), and (a(b(cd))). that keeps the finite-state representation of the 2The syntactic labels used in this paper are: AG=Agent, rule set as elegant as possible. by=Preposition ’by’ as a phrasal verb complement, D=Determiner, EN=Past Participle, FP=Final Punctuation, 2. The paper introduces a linguistically motivated P=adjunctive preposition, PC=Preposition Complement, ranking for complete trees. According to it, a S=Subject, sgS=Singular Subject.

109 link is indicated in trees with an arrowhead but in the universal language Σ∗. The (binary) finite-state bracketing with an angle bracket. relations are defined over Σ∗ and include all finite subsets of Σ Σ . In addition, they are closed un- ∗ × ∗ (2) it was inspired by the writings . der the operations over finite-state languages L and /AG AG> /PC FP> M and finite-state relations R and S according to Table 2. The language relation Id(L) restricts the D PC identity relation to a language L. The composition G of language relations corresponds to the intersection FP S EN of their languages.

Table 2: The relevant closure properties In Table 1, the dependency bracketing is com- language relation meaning bined with a common textual format for morphological analyses. In this format, the base forms are LM RS concatenation defined over the alphabet of orthographical symbols L∗ R∗ (Kleene) star + + (Kleene) plus Ω whereas the morphological symbols and syntactic L R categories are multi-character symbols that belong, L MR S union ∪ Id ∪ language relation respectively, to the alphabets Π and Γ. In addition, (L) Id 1(R) L for R = Id(L) there is a token boundary symbol #. − L M Id(L) Id(M) set difference − − L M cross product Table 1: One morpho-syntactic analysis of a sentence × R L input restriction R| S composition 1 i t PRON NOM SG3 /AG # Proj (R) Id(the input side of R) 4 b y PREP AG> /PC # 1 5 t h e DET SG/PL # 7 . PUNCT FP> # For notational convenience, the empty string is denoted by ǫ. A string x is identified with the sin- Depending on the type of the languages, one or- gleton set x . { } thographical word can be split into several parts such The syntactic component of the grammar defines as the inflectional groups in Turkish (Oflazer, 2003). a set of parse strings where the bracketing is a valid In this case, a separate word-initial token boundary dependency tree. In these parses, there is no mor- can be used to separate such parts into lines of their phological information. One way to express the set own. is to intersect a set of constraints as in (Yli-Jyra,¨ The current dependency bracketing captures pro- 2005). However, the contracting dependency parser jective and weakly non-projective (1-planar) trees expresses the Id relation of the set through a compo- only, but an extended encoding for 2-planar and sition of simple finite-state relations: multi-planar dependency trees seems feasible (Yli- Syn = Proj (Abst R ... R Root), (1) Jyra,¨ 2012): t 1 ◦ ◦ ◦ ◦ t Root Id # (2) 2.2 The Valid Trees = ( ). | {z } We are now going to define precisely the semantics In (1), Abst is a relation that removes all non- of the syntactic grammar component using finite- syntactic information from the strings, state relations. Abst =(Id(Γ) Id(#) Delete)∗, (3) The finite-state languages will be defined over a ∪ ∪ Delete = (x,ǫ) x Ω Π , (4) finite alphabet Σ and they include all finite subsets of { | ∈ ∪ }

110 and R is a relation that performs one layer of con- to a finite number of potential morpho-syntactic cat- tractions in dependency bracketing. egories that relate word forms to their syntactic functions. The words of particular categories are then R =(Id(Γ) Id(#) Left Right)∗, (5) ∪ ∪ ∪ related by dependency rules: Left = (<α # α\,ǫ) <α,α\ Γ , (6) { | ∈ } Right = (/α # α>,ǫ) /α,α> Γ . (7) X0(Xp,...,X 1, *,X1,...,Xm). (9) { | ∈ } − The parameter t determines the maximum number The rule (9) states that a word in category X0 is the of layers of dependency links in the validated brack- head of dependent words in categories Xp,...,X 1 − etings. The limit of Syn as t approaches is not before it and words in categories X ,...,X after t ∞ 1 m necessarily a finite-state language, but it remains it, in the given order. The rule expresses, in a cer- context-free because only projective trees are as- tain sense, the frame or the argument structure of the signed to the sentences. word. Rule X(*) indicates that the word in category X can occur without dependents. 2.3 The Big Picture In addition, there is a root rule *(X) that states We are now ready to embed the contraction based that a word in category X can occur independently, grammar into the bigger picture. that is, as the root of the sentence. Let x Ω∗ be an orthographical string to be In the legacy notation, the distinction between ∈ parsed. Assume that it is segmented into n tokens. complements and adjuncts is not made explicit, as The string x is parsed by composition of four rela- both need to be listed as dependents. To compact the tions: the relation (x,x) , the lexical transducer notation, we introduce optional dependents that will { } (Morph), the morphology-syntax interface (Iface), be indicated by categories Xp?,...,X 1? and cat- − and the syntactic validator Synn 1. egories X ?,...,X ?. This extension potentially − 1 m saves a large number of rules in cases where sev- Parses(x)= Id(x) Morph Iface Synn 1. (8) ◦ ◦ ◦ − eral dependents are actually adjuncts, some kinds of 3 The language relation Proj2(Parses(x)) encodes modifiers. the parse forest of the input x. 3.2 The Decomposed Categories In practice, the syntactic validator Synn 1 cannot be compiled into a finite-state transducer− due to its In practice, atomic morpho-syntactic categories are large state complexity. However, when each copy of often too coarse for morphological description but the contracting transducer R in (1) is restricted by too refined for convenient description of syntactic its admissible input-side language, a compact rep- frames. A practical description requires a more ex- resentation for the input-side restriction (Synn 1) X pressive and flexible formalism. − | where X = Proj2(Id(x) Morph Iface) is computed In our new rule formalism, each morpho-syntactic efficiently as described in◦ (Yli-Jyr◦ a,¨ 2012). category X is viewed as a combination of a morphological category M (including the information on 3 The Grammar Formalism the lexical form of the word) and a syntactic cate- In the parser, the linguistic knowledge is organized gory S. The morphological category M is a string into Morph (the morphology) and Iface (the lexical- of orthographical and morphological feature labels ized morphology-syntax interface), while Syn has while S is an atomic category label. mainly a technical role as a tree validator. Imple- The morphological category M0 and the syntactic menting the morphology-syntax interface is far from category S0 are specified for the head of each de- an easy task since it is actually the place that lexical- pendency rule. Together, they specify the morpho- izes the whole syntax. syntactic category (M0,S0). In contrast, the rule specifies only the syntactic categories Sp,...,S 1, 3.1 Gaifman’s Dependency Rules − 3Optional dependents may be a worthwhile extension even Gaifman’s legacy notation (Gaifman, 1965; Hays, in descriptions that treat the modified word as a complement of 1964) for dependency grammars assigns word forms a modifier.

111 and S1,...,Sm of the dependent words and thus can be motivated by the linguistic knowledge. To delegates the selection of the morphological cate- glue the fragments together, we interpret the roots gories to the respective rules of the dependent words. of fragments as linear successors – thus dependents The categories Sp,...,S 1, and S1,...,Sm may – for the word that immediately precedes the frag- again be marked optional− with the question mark. ment. The rules are separated according to the direction The link to a linear successor is indicated with a of the head dependency. Rules (10), (11) and (12) special category ++ having a default rule ++(*). Since attach the head to the right, to the left, and in any di- any word can act as a root of a fragment, every word rection, respectively. In addition, the syntactic cate- is provided with this potential category. In addi- gory of the root is specified with a rule of the form tion, there is, for every rule (12), an automatic rule (13). ++(Sp,...,S 1, *[M],S1,...,Sm) that allows the − S0(Sp,...,S 1, *[M0],S1,...,Sm), (10) roots of the fragments to have the corresponding de- → − pendents. Similar automatic rules are defined for the S0(Sp,...,S 1, *[M0],S1,...,Sm), (11) ← − directed rules. S0(Sp,...,S 1, *[M0],S1,...,Sm), (12) − The category ++ is used to indicate dependent (13) *(S0). words that do not have any linguistically motivated The interpretations of rules (10) - (12) are similar to syntactic function. The root rule *(++) states that rule (9), but the rules are lexicalized and directed. this special category can act as the root of the whole The feature string M (Ω %Ω Ω )Π defines dependency tree. In addition to the root function ex- 0 ∈ ∗ ∗ ∪ ∗ ∗ the relevant head word forms using the features pro- pressed by that rule, an optional dependent ++? is vided by Morph. The percent symbol (%) stands for appended to the end of every dependency rule. This the unspecified part of the lexical base form. connects fragments to their left contexts. The use of the extended rule formalism is illus- With the above extensions, all sentences will have trated in Table 3. According to the rules in the table, at least one complete tree as a parse. A parse with a phrase headed by preposition by has three uses: some dependents of the type ++ are linguistically in- an adjunctive preposition (P), the complement of a ferior to parses that do not have such dependents or phrasal verb (by), or the agent of a passive verb (AG). have fewer of them. Removing such inferior analy- Note that the last two uses correspond to a fully lexi- ses from the output of the parser is proposed in Sec- calized rule where the morphological category spec- tion 4. ifies the lexeme. The fourth rule illustrates how morphological features are combined in N NOM SG and then partly propagated to the atomic name of the 3.4 The Formal Semantics of the Interface syntactic category. Let there be r dependency rules. For each rule i, i 1,...,r of type (10), let Table 3: Extended Gaifman rules ∈ { }

~~1 P (*[% PREP], PC) % prepos. 2 by (*[b y PREP], PC) % phrasal Fi = M0, (14) 3 AG (*[b y PREP], PC) % agent sgS (D?, M?, [% N NOM SG], M?) % noun Gi = S 1\ ...Sp\ S0> /Sm.../S1, (15) 4 * −~~

~~where S 1\,...,Sp\,S0>, /Sm,..., /S1 Γ. For − ∈ 3.3 Making a Gaifman Grammar Robust each rule of type (11), S0> in (15) is replaced with Dependency syntax describes complete trees where~~

112 dependency rules. In this improved method, the application of Iface demands only linear space according to the number of Iface = Intro Chk, (16) ◦ rules. This method is also fast to apply to the input, Intro =(Id(Ω∗Π∗)(ǫ Γ∗)Id(#))∗, (17) × as far as the morphology-syntax interface is con- Chk = Proj1(Match Rules), (18) cerned. Meanwhile, one efficient implementation of r ◦ Rules = Id ( F G #)∗ . (19) Synn 1 is already provided in (Yli-Jyra,¨ 2012). ∪i=1 i i − Match =(Id(Ω∗) Mid Id(Ω∗) Tag∗ Id(#))∗ (20) 4 The Most Specific Parse Mid = Id(ǫ) (Ω∗ %), (21) ∪ × The parsing method of (Yli-Jyra,¨ 2012) builds the Tag = Id(Π) (Π ǫ). (22) ∪ × parse forest efficiently using several transducers, Iface is the composition of relations Intro and Chk. but there is no guarantee that the whole set of Relation Intro inserts dependency brackets between parses could be extracted efficiently from the com- the morphological analysis of each token and the fol- pact representation constructed during the recogni- lowing token boundary. Relation Chk verifies that tion phase. We will now assume, however, that the inserted brackets are supported by dependency the number of parses is, in practice, substantially rules that are represented by relation Rules. smaller than in the theoretically possible worst case. In order to allow generalizations in the specifi- Moreover, it is even more important to assume that cation of morphological categories, the relation In- the set of parses is compactly packed into a finite au- tro does not match dependency rules directly, but tomaton. These two assumptions let us proceed by through a filter. This filter, Match, optionally re- refining the parse forest without using weights such places the middle part of each lexeme with % and ar- as in (Yli-Jyra,¨ 2012). bitrary morphological feature labels with the empty In the following, we restrict the parse forest to string. those parses that have the smallest number of ’linear In addition to the dependency rules, we need to successor’ dependencies (++). The number of such define the semantics of the root rules. Let H be the dependencies is compared with a finite-state relation set of the categories having a root rule. The category Cp (Γ # )∗ (Γ # )∗ constructed as follows: ⊆ ∪{ } × ∪{ } of the root word will be indicated in the dependency Σ′ = Σ ++> , (28) bracketing as an unmatched bracket. It is checked by − { } + 1 relation Root Id # that replaces Root Id # Cp = Mapi (Id(++>∗)(ǫ ++>) ) Map− , (29) = (H ) = ( ) ◦ × ◦ i in the composition formulas (1) . Map =(Id(++>) (Σ′ ǫ))∗. (30) i ∪ × 3.5 An Efficient Implementation In practice, the reduction of the parse forest is pos-

The definition of Iface gives rise to a naive parser sible only if the parse forest Proj2(Parses(x)) is rec- implementation that is based on the formula ognized by a sufficiently small finite-state automaton that can then be operated in Formula (33). The Parses(x)= MIx Chk Synn 1, (23) ◦ ◦ − parses that minimize the number of ’linear succes- MI = Id(x) Morph Intro. (24) sor’ dependencies are obtained as the output of the x ◦ ◦ relation Parses (x). The naive implementation is inefficient in practice. ′

The main efﬁciency problem is that the state com- Parses′(x)= MIx Chkx Tx,1, (31) plexity of relation Chk can be exponential to the ◦ ◦ Tx,0 = Proj (Parses(x)), (32) number of rules. To avoid this, we replace it with 2 Tx,1 = Tx,0 Proj2(Tx,0 Cp Tx,0). (33) Chkx, a restriction of Chk. This restriction is com- − ◦ ◦ puted lazily when the input is known. This restriction technique could be repeatedly applied to further levels of speciﬁcity. For example, Parses(x)= MIx Chkx Synn 1, (25) ◦ ◦ − lexically motivated complements could be preferred Chk = Proj (Match Rules) (26) x 1 x ◦ over adjuncts and other grammatically possible de- Match = Proj (MI ) Match. (27) x 2 x ◦ pendents.

113 5 Evaluation and Discussion 5.4 Computational Complexity Thanks to dynamically applied finite-state opera- 5.1 Elegance tions and the representation of feature combinations We have retained most of the elegancy in the as strings rather than regular languages, the depen- contracting finite-state dependency parser (Yli-Jyra,¨ dency rules can be compiled quickly into the trans- 2012). The changes introduced in this paper are ducers used by the parser. For example, the actual modular and implementable with standard opera- specifications of dependency rules are now com- tions on finite-state transducers. piled into a linear-size finite-state transducer, Chk. Our refined design for a parser can be imple- The proposed implementation for the morphology- mented largely in similar lines as the general ap- syntax interface is, thus, a significant improvement proach (Yli-Jyra,¨ 2012) up to the point when the in comparison to the common approach that com- parses are extracted from the compact parse forest. piles and combines replacement rules into a single Parsing by arc contractions is closely related transducer where the morphological conditions of to the idea of reductions with restarting automata the rules are potentially mixed in a combinatorial (Platek´ et al., 2003). manner. Although we have started to write an experimental 5.2 Coverage grammar, we do not exactly know how many rules a mature grammar will contain. Lexicalization of The representation of the parses can be extended to the rules will increase the number of rules signifi- handle word-internal token boundaries, which facil- cantly. The number of syntactic categories will in- itates the adequate treatment of agglutinative lan- crease even more if complements are lexicalized. guages, cf. (Oflazer, 2003). The limit for nested brackets is based on the psy- 5.5 Robustness cholinguistic reality (Miller, 1956; Kornai and Tuza, In case the grammar does not fully disambiguate or 1992) and the observed tendency for short depen- build a complete dependency structure, the parser dencies (Lin, 1995; Eisner and Smith, 2005) in nat- should be able to build and produce a partial anal- ural language. ysis. (In interactive treebanking, it would be useful The same general design can be used to produce if an additional knowledge source, e.g. a human, can non-projective dependency analyses as required by be used to provide additional information to help the many European languages. The crossing dependen- parser carry on the analysis to a complete structure.) cies can be assigned to two or more planes as sug- The current grammar system indeed assumes that gested in (Yli-Jyra,¨ 2012). 2-planar bracketing al- it can build complete trees for all input sentences. ready achieves very high recall in practice (Gomez-´ This assumption is typical for all generative gram- Rodr´ıguez and Nivre, 2010). mars, but seems to contradict the requirement of robustness. To support robust parsing, we have now 5.3 Ambiguity Management proposed a simple technique where partial analyses Oflazer (2003) uses the lenient composition opera- are connected into a tree with the “linear succes- tion to compute the number of bottom-up fragments sor” links. The designed parser tries its best to avoid in incomplete parses. The current solution improves these underspecific links, but uses the smallest pos- above this by supporting gapped fragments and un- sible number of them to connect the partial analyses restricted counting of the graph components. into a tree if more grammatical parses are not avail- Like in another extended finite-state approach able. (Oflazer, 2003), the ambiguity in the output of our 5.6 Future Work parsing method can be reduced by removing parses with high total link length and by applying filters Although Oflazer (2003) does not report significant that enforce barrier constraints to the dependency problems with long sentences, it may be difficult to links. construct a single automaton for the parse forest of a

114 sentence that contains many words. In the future, David Elworthy. 2000. A finite state parser with de- a more efficient method for finding the most spe- pendency structure output. In Proceedings of Sixth In- cific parse from the forest can be worked out us- ternational Workshop on Parsing Technologies (IWPT ing weighted finite-state automata. Such a method 2000, Trento, Italy, February 23-25. Institute for Sci- entific and Technological Research. would combine the approaches of the companion pa- Haim Gaifman. 1965. Dependency systems and per (Yli-Jyra,¨ 2012) and the current paper. phrase-structure systems. Information and Control, It seems interesting to study further how the speci- 8:304–37. ficity reasoning and statistically learned weights Carlos Gomez-Rodr´ ´ıguez and Joakim Nivre. 2010. A could complement each other in order to find the transition-based parser for 2-planar dependency struc- best analyses. Moreover, the parser can be modified tures. In Proceedings of the 48th Annual Meeting of in such a way that debugging information is pro- the Association for Computational Linguistics, pages duced. This could be very useful, especially when 1492—-1501, Uppsala, Sweden, 11-16 July. learning contractions that handle the crossing depen- David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40:511–525. dencies of non-projective trees. Fred Karlsson, Atro Voutilainen, Juha Heikkia,¨ and A dependency parser should enable the building Arto Anttila, editors. 1995. Constraint Grammar: of multiple types of analyses, e.g. to account for a Language-Independent System for Parsing Unre- syntactic and semantic dependencies. Also adding stricted Text, volume 4 of Natural Language Process- more structure to the syntactic categories could be ing. Mouton de Gruyter, Berlin and New York. useful. Andras´ Kornai and Zsolt Tuza. 1992. Narrowness, path- width, and their application in natural language pro- 6 Conclusions cessing. Discrete Applied Mathematics, 36:87–92. Dekang Lin. 1995. A dependency-based method for The current theoretical work paves the way for a full evaluating broad-coverage parsers. In Proceedings parser implementation. The parser should be able to of the Fourteenth International Joint Conference on cope with large grammars to enable efficient devel- Artificial Intelligence, IJCAI 95, Montreal´ Quebec,´ Canada, August 20-25, 1995, volume 2, pages 1420– opment, testing and application cycles. 1425. The current work has sketched an expressive and George A. Miller. 1956. The magical number seven, compact formalism and its efficient implementation plus or minus two: Some limits on our capacity for the morphology-syntax interface of the contract- for processing information. Psychological Review, ing dependency parser. In addition, the work has 63(2):343–355. elaborated strategies that help to make the grammar Kemal Oflazer. 2003. Dependency parsing with an ex- more robust without sacrificing the optimal speci- tended finite-state approach. Computational Linguis- ficity of the analysis. tics, 29(4):515–544. Martin Platek,´ Marketa´ Lopatkova,´ and Karel Oliva. Acknowledgments 2003. Restarting automata: motivations and applications. In M. Holzer, editor, Workshop ’Petrinetze’ and The research has received funding from the 13. Theorietag ’Formale Sprachen und Automaten’, pages 90–96, Institut fur¨ Informatik, Technische Uni- Academy of Finland under the grant agreement # versitat¨ Munchen.¨ 128536 and the FIN-CLARIN project, and from the Anssi Yli-Jyra.¨ 2005. Approximating dependency European Commission’s 7th Framework Program grammars through intersection of star-free regular lan- under the grant agreement # 238405 (CLARA). guages. International Journal of Foundations of Com- puter Science, 16(3). Anssi Yli-Jyra.¨ 2012. On dependency analysis via con- References tractions and weighted FSTs. In Diana Santos, Krister Linden,´ and Wanjiku Ng’ang’a, editors, Shall we Play Jason Eisner and Noah A. Smith. 2005. Parsing with the Festschrift Game? Essays on the Occasion of Lauri soft and hard constraints on dependency length. In Carlson’s 60th Birthday. Springer-Verlag, Berlin. Proceedings of the International Workshop on Parsing Technologies (IWPT), pages 30–41, Vancouver, Octo- ber.

~~115 Lattice-Based Minimum Error Rate Training using Weighted Finite-State Transducers with Tropical Polynomial Weights~~

Aurelien Waite‡ Graeme Blackwood∗ William Byrne‡ ‡ Department of Engineering, University of Cambridge, Trumpington Street, CB2 1PZ, U.K. aaw35|wjb31 @cam.ac.uk { } ∗ IBM T.J. Watson Research, Yorktown Heights, NY-10598 [email protected]

Abstract It has been noted that line optimisation over a lattice can be implemented as a semiring of sets of lin- Minimum Error Rate Training (MERT) is a ear functions (Dyer et al., 2010). Sokolov and Yvon method for training the parameters of a log- (2011) provide a formal description of such a semir- linear model. One advantage of this method ing, which they denote the MERT semiring. The dif- of training is that it can use the large num- ference between the various algorithms derives from ber of hypotheses encoded in a translation lattice as training data. We demonstrate that the the differences in their formulation and implemen- MERT line optimisation can be modelled as tation, but not in the objective they attempt to opti- computing the shortest distance in a weighted mise. finite-state transducer using a tropical polyno- Instead of an algebra defined in terms of trans- mial semiring. formations of sets of linear functions, we propose an alternative formulation using the tropical polynomial semiring (Speyer and Sturmfels, 2009). This 1 Introduction semiring provides a concise formalism for describ- Minimum Error Rate Training (MERT) (Och, 2003) ing line optimisation, an intuitive explanation of the is an iterative procedure for training a log-linear sta- MERT shortest distance, and draws on techniques tistical machine translation (SMT) model (Och and in the currently active field of Tropical Geometry 1 Ney, 2002). MERT optimises model parameters (Richter-Gebert et al., 2005) . directly against a criterion based on an automated We begin with a review of the line optimisation translation quality metric, such as BLEU (Papineni procedure, lattice-based MERT, and the weighted et al., 2002). Koehn (2010) provides a full descrip- finite-state transducer formulation in Section 2. In tion of the SMT task and MERT. Section 3, we introduce our novel formulation MERT uses a line optimisation procedure (Press of lattice-based MERT using tropical polynomial et al., 2002) to identify a range of points along a line weights. Section 4 compares the performance of our in parameter space that maximise an objective func- approach with k-best and lattice-based MERT. tion based on the BLEU score. A key property of the 2 Minimum Error Rate Training line optimisation is that it can consider a large set of hypotheses encoded as a weighted directed acyclic Following Och and Ney (2002), we assume that graph (Macherey et al., 2008), which is called a lat- we are given a tuning set of parallel sentences tice. The line optimisation procedure can also be ap- (r , f ), ..., (r , f ) , where r is the reference { 1 1 S S } s plied to a hypergraph representation of the hypothe- translation of the source sentence fs. We also as- ses (Kumar et al., 2009). sume that sets of hypotheses C = e , ..., e s { s,1 s,K } 1 ∗The work reported in this paper was carried out while the An associated technical report contains an extended discus- author was at the University of Cambridge. sion of our approach (Waite et al., 2011)

116 Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 116–125, Donostia–San Sebastian,´ July 23–25, 2012. c 2012 Association for Computational Linguistics are available for each source sentence fs. Env(fs; γ) Under the log-linear model formulation with fea- M M è4 ture functions h1 and model parameters λ1 , the most probable translation in a set Cs is selected as è1 M è2 M è3 eˆ(fs; λ1 ) = argmax λmhm(e, fs) . (1) e Cs (m=1 ) ∈ X γ With an error function of the form E(rS, eS) = 1 1 E(rs, eˆ(fs; γ)) S E(r , e ) s=1 s s , MERT attempts to find model pa- e4 rameters to minimise the following objective: e P 1 S e3 ˆM M γ λ1 = argmin E(rs, eˆ(fs; λ1 )) . (2) M γ1 γ2 λ1 (s=1 ) X Note that for MERT the hypotheses set Cs is Figure 1: An upper envelope and projected error. Note a k-best list of explicitly enumerated hypotheses, that the upper envelope is completely defined by hypothe- whereas lattice-based MERT uses a larger space. ses e4, e3, and e1, together with the intersection points γ1 and γ2 (after Macherey et al. (2008), Fig. 1).

2.1 Line Optimisation For any value of γ the linear functions è(γ) associated with Cs take (up to) K values. The function in Although the objective function in Eq. (2) cannot be Eq. (4) defines the ‘upper envelope’ of these values solved analytically, the line optimisation procedure over all γ. The upper envelope has the form of a con- of Och (2003) can be used to find an approxima- tinuous piecewise linear function in γ. The piece- tion of the optimal model parameters. Rather than wise linear function can be compactly described by evaluating the decision rule in Eq. (1) over all pos- the linear functions which form line segments and sible points in parameter space, the line optimisa- the values of γ at which they intersect. The example tion considers a subset of points defined by the line M M M in the upper part of Figure 1 shows how the upper λ1 +γd1 , where λ1 corresponds to an initial point M envelope associated with a set of four hypotheses in parameter space and d1 is the direction along can be represented by three associated linear func- which to optimise. Eq. (1) can be rewritten as: tions and two values of γ. The first step of line op- M M T M eˆ(fs; γ) = argmax (λ1 + γd1 ) h1 (e, f s) timisation is to compute this compact representation e Cs ∈ of the upper envelope. = argmax λmhm(e, f s) +γ dmhm(e, f s) Macherey et al. (2008) use methods from com- e Cs m m ∈ n X X o putational geometry to compute the upper envelope. a(e,f s) b(e,f s) The SweepLine algorithm (Bentley and Ottmann, 1979) computes the upper envelope from a set of lin- = argmax a|(e, f s) +{zγb(e, f}s) | {z (3)} e Cs { } ∈ ear functions with a complexity of (K log(K)). è(γ) O Computing the upper envelope reduces the run- This decision| rule{z shows that} each hypothesis time cost of line optimisation as the error function e C is associated with a linear function of γ: ∈ s need only be evaluated for the subset of hypotheses ` (γ) = a(e, f ) + γb(e, f ), where a(e, f ) is the e s s s in Cs that contribute to the upper envelope. These y-intercept and b(e, f s) is the gradient. The opti- errors are projected onto intervals of γ, as shown in misation problem is further simplified by defining a the lower part of Figure 1, so that Eq. (2) can be subspace over which optimisation is performed. The readily solved. subspace is found by considering a form of the function in Eq. (3) defined with a range of real numbers 2.2 Incorporation of Line Optimisation into (Macherey et al., 2008; Och, 2003): MERT Env(f) = max a(e, f) + γb(e, f) : γ R (4) The previous algorithm finds the upper envelope e C { ∈ } ∈ è(γ) along a particular direction in parameter space over | {z } 117 a hypothesis set Cs. The line optimisation algorithm transforms all the partial hypotheses leading to q by is then embedded within a general optimisation pro- concatenating the word e. The upper envelope as- cedure. A common approach to MERT is to select sociated with the edge from q to q0 is changed by the directions using Powell’s method (Press et al., adding è(γ) to the set of linear functions. The in- 2002). A line optimisation is performed on each co- tersection points are not changed by this operation. ordinate axis. The axis giving the largest decrease The second operation is a union. Suppose q0 in error is replaced with a vector between the initial has another incoming edge from a state q00 where parameters and the optimised parameters. Powell’s q = q . There are now two upper envelopes rep- 6 00 method halts when there is no decrease in error. resenting two sets of linear functions. The first up- Instead of using Powell’s method, the Downhill per envelope is associated with the paths from the Simplex algorithm (Press et al., 2002) can be used initial state to state q0 via the state q. Similarly the to explore the criterion in Eq. (2). This is done by second upper envelope is associated with paths from defining a simplex in parameter space. Directions the initial state to state q0 via the state q00. The upper where the error count decreases can be identified by envelope that is associated with all paths from the considering the change in error count at the points initial state to state q0 via both q and q00 is the union of the simplex. This has been applied to parameter of the two sets of linear functions. This union is no searching over k-best lists (Zens et al., 2007). longer a compact representation of the upper enve- Both Powell’s method and the Downhill Simplex lope as there may be functions which never achieve algorithms are approaches based on heuristics to se- a maximum for any value of γ. The SweepLine al- M M lect lines λ1 + γd1 . It is difficult to find theoret- gorithm (Bentley and Ottmann, 1979) is applied to ically sound reasons why one approach is superior. the union to discard redundant linear functions and Therefore Cer et al. (2008) instead choose the di- their associated hypotheses (Macherey et al., 2008). M rection vectors d1 at random. They report that this The union and extend operations are applied to method can find parameters that are as good as the states in topological order until the final state is parameters produced by more complex algorithms. reached. The upper envelope computed at the final state compactly encodes all the hypotheses that max- 2.3 Lattice Line Optimisation M M imise Eq. (1) along the line λ1 + γd1 . Macherey’s Macherey et al. (2008) describe a procedure for con- theorem (Macherey et al., 2008) states that an upper ducting line optimisation directly over a word lattice bound for the number of linear functions in the up- encoding the hypotheses in Cs. Each lattice edge is per envelope at the final state is equal to the number labelled with a word e and has a weight defined by of edges in the lattice. the vector of word specific feature function values M 2.4 Line Optimisation using WFSTs h1 (e, f) so that the weight of a path in the lattice is found by summing over the word specific feature Formally, a weighted finite-state transducer (WFST) function values on that path. Given a line through T = (Σ, ∆, Q, I, F, E, λ, ρ) over a semiring parameter space, the goal is to extract from a lattice (K, , , 0¯, 1)¯ is defined by an input alphabet Σ, an ⊕ ⊗ its upper envelope and the associated hypotheses. output alphabet ∆, a set of states Q, a set of initial Their algorithm proceeds node by node through states I Q, a set of final states F Q, a set ⊆ ⊆ the lattice. Suppose that for a state q the upper enve- of weighted transitions E, an initial state weight as- lope is known for all the partial hypotheses on all signment λ : I K, and a final state weight assign- → paths leading to q. The upper envelope defines a ment ρ : F K (Mohri et al., 2008). The weighted → set of functions è˜1 (γ), ..., èÑ (γ) over the partial transitions of T form the set E Q Σ ∆ K Q, { } ⊆ × × × × hypotheses eñ. Two operations propagate the upper where each transition includes a source state from Q, envelope to other lattice nodes. input symbol from Σ, output symbol from ∆, cost We refer to the first operation as the ‘extend’ op- from the weight set K, and target state from Q. eration. Consider a single edge from state q to state For each state q Q, let E[q] denote the set of ∈ q . This edge defines a linear function associated edges leaving state q. For each transition e E[q], 0 ∈ with a single word è(γ). A path following this edge let p[e] denote its source state, n[e] its target state,

118 and w[e] its weight. Let π = e e denote a sists of a real valued coefficient multiplied by one or 1 ··· K path in T from state p[e1] to state n[eK ], so that more variables, and these variables may have expo- n[ek 1] = p[ek] for k = 2,...,K. The weight as- nents that are non-negative integers. In this section − sociated by T to path π is the generalised product we limit ourselves to a description of a polynomial ⊗ of the weights of the individual transitions: in a single variable. A polynomial function is de- K fined by evaluating a polynomial: w[π] = w[e ] = w[e ] w[e ] (5) n n 1 2 k 1 ⊗ · · · ⊗ K f(γ) = anγ + an 1γ − + + a2γ + a1γ + a0 k=1 − ··· O A useful property of these polynomials is that they If (q) denotes the set of all paths in T start- P form a ring2 (Cox et al., 2007) and therefore are can- ing from an initial state in I and ending in state q, didates for use as weights in WFSTs. then the shortest distance d[q] is defined as the gen- Speyer and Sturmfels (2009) apply the defini- eralised sum of the weights of all paths leading to ⊕ tion of a classical polynomial to the formulation of q (Mohri, 2002): a tropical polynomial. The tropical semiring uses d[q] = π (q) w[π] (6) summation for the generalised product and a min ⊕ ∈P ⊗ For some semirings, such as the tropical semir- operation for the generalised sum . In this form, ⊕ ing, the shortest distance is the weight of the short- let γ be a variable that represents an element in the est path. For other semirings, the shortest distance tropical semiring weight set R , + . We ∪ {−∞ ∞} is associated with multiple paths (Mohri, 2002); for can write a monomial of γ raised to an integer expo- these semirings there are shortest distances but need nent as not any be shortest paths. That will be the case in γi = γ γ what follows. However, the shortest distance algo- ⊗ · · · ⊗ rithms rely only on general properties of semirings, i and once the semiring is specified, the general short- where i is a non-negative| {z integer.} The monomial est distance algorithms can be directly employed. can also have a constant coefficient: a γi, a R. ⊗ ∈ Sokolov and Yvon (2011) define the MERT We can define a function that evaluates a tropical semiring based on operations described in the pre- monomial for a particular value of γ. For example, vious section. The extend operation is used for the the tropical monomial a γi is evaluated as: ⊗ generalised product . The union operation fol- i ⊗ f(γ) = a γ = a + iγ lowed by an application of the SweepLine algorithm ⊗ This shows that a tropical monomial is a linear becomes the generalised sum . The word lattice ⊕ function with the coefficient a as its y-intercept and is then transformed for an initial parameter λM and 1 the integer exponent i as its gradient. A tropical direction dM . The weight of edge is mapped from 1 polynomial is the generalised sum of tropical mono- a word specific feature function hM (e, f) to a word 1 mials where the generalised sum is evaluated using specific linear function ` (γ). The weight of each e the min operation. For example: path is the generalised product of the word spe- ⊗ f(γ) = (a γi) (b γj) = min(a + iγ, b + jγ) cific feature linear functions. The upper envelope is ⊗ ⊕ ⊗ the shortest distance of all the paths in the WFST. Evaluating tropical polynomials in classical arithmetic gives the minimum of a finite collection of 3 The Tropical Polynomial Semiring linear functions. In this section we introduce the tropical polynomial Tropical polynomials can also be multiplied by a semiring (Speyer and Sturmfels, 2009) as a replace- monomial to form another tropical polynomial. For ment for the MERT semiring (Sokolov and Yvon, example: 2011). We then provide a full description and a f(γ) = [(a γi) (b γj)] (c γk) worked example of our MERT algorithm. ⊗ ⊕ ⊗ ⊗ ⊗ = [(a + c) γi+k] [(b + c) γj+k] ⊗ ⊕ ⊗ 3.1 Tropical Polynomials = min((a + c) + (i + k)γ, (b + c) + (j + k)γ) A polynomial is a linear combination of a finite number of non-zero monomials. A monomial con- 2A ring is a semiring that includes negation.

119 Our re-formulation of Eq. (4) negates the feature f(γ) function weights and replaces the argmax by an argmin. This allows us to keep the usual formu- c γk ⊗ lation of tropical polynomials in terms of the min b γj i ⊗ operation when converting Eq. (4) to a tropical rep- a γ ⊗ resentation. What remains to be addressed is the role of integer exponents in the tropical polynomial. (a γi) (b γj) (c γk) 3.2 Integer Realisations for Tropical ⊗ ⊕ ⊗ ⊕ ⊗ γ Monomials 0 In the previous section we noted that the function Figure 2: Redundant terms in a tropical polynomial. In defined by the upper envelope in Eq. (4) is simi- this case (a γi) (b γj) (c γk) = (a γi) (c γk). lar to the function represented by a tropical poly- ⊗ ⊕ ⊗ ⊕ ⊗ ⊗ ⊕ ⊗ nomial. A significant difference is that the formal we widen the definition of a tropical polynomial to definition of a polynomial only allows integer expo- allow for these negative exponents. nents, whereas the gradients in Eq. (4) are real numbers. The upper envelope therefore encodes a larger 3.3 Canonical Form of a Tropical Polynomial set of model parameters than a tropical polynomial. We noted in Section 2.1 that linear functions induced To create an equivalence between the upper enve- by some hypotheses do not contribute to the upper lope and tropical polynomials we can approximate envelope and can be discarded. Terms in a tropi- the linear functions ` (γ) = a(e, f ) + γ b(e, f ) { e s · s } cal polynomial can have similar behaviour. Figure that compose segments of the upper envelope. We n ˜ 2 plots the lines associated with the three terms of define a˜(e, f s) = [a(e, f s) 10 ]int and b(e, f s) = i n · the example polynomial function f(γ) = (a γ ) [b(e, f ) 10 ]int where [x]int denotes the integer part ⊗ ⊕ s · (b γj) (c γk). We note that the piecewise linear of x. The approximation to è(γ) is: ⊗ ⊕ ⊗ function can also be described with the polynomial ˜ ˜ a˜(e, f s) b(e, f s) f(γ) = (a γi) (c γk). The latter representation è(γ) è(γ) = n + γ n (7) ⊗ ⊕ ⊗ ≈ 10 · 10 is simpler but equivalent. The result of this operation is to approximate Having multiple representations of the same poly- the y-intercept and gradient of è(γ) to n decimal nomial causes problems when implementing the places. We can now represent the linear function ˜ shortest distance algorithm defined by Mohri (2002). ˜ b(e,fs) è(γ) as the tropical monomial a˜(e, fs) γ− . This algorithm performs an equality test between ˜ − ⊗ Note that a˜(e, fs) and b(e, fs) are negated since trop- values in the semiring used to weight the WFST. The ical polynomials define the lower envelope as op- behaviour of the equality test is ambiguous when posed to the upper envelope defined by Eq. (4). there are multiple polynomial representations of the The linear function represented by the tropical same piecewise linear function. We therefore re- monomial is a scaled version of è(γ), but the up- quire a canonical form of a tropical polynomial so per envelope is unchanged (to the accuracy allowed that a single polynomial represents a single function. by n). If for a particular value of γ, èi (γ) > èj (γ), We define the canonical form of a tropical polyno- ˜ ˜ then èi (γ) > èj (γ). Similarly, the boundary mial to be the tropical polynomial that contains only points are unchanged: if èi (γ) = èj (γ), then the monomial terms necessary to describe the piece- ˜ ˜ èi (γ) = èj (γ). Setting n to a very large value re- wise linear function it represents. moves numerical differences between the upper en- We remove redundant terms from a tropical poly- velope and the tropical polynomial representation, nomial after computing the generalised sum. For a as shown by the identical results in Table 1. tropical polynomial of one variable we can take ad- Using a scaled version of è(γ) as the basis for a vantage of the equivalence with Lattice MERT and tropical monomial may cause negative exponents to compute the canonical form using the SweepLine al- be created. Following Speyer and Sturmfels (2009), gorithm (Bentley and Ottmann, 1979). Each term

120 corresponds to a linear function; linear functions and γ = γn + 1 to find the MAP hypothesis in the that do not contribute to the upper envelope are dis- first and last intervals, respectively. carded. Only monomials which correspond to the remaining linear functions are kept in the canonical 3.5 The TGMERT Algorithm form. The canonical form of a tropical polynomial We now describe an alternative algorithm to Lat- thus corresponds to a unique and minimal represen- tice MERT that is formulated using the tropical tation of the upper envelope. polynomial shortest distance in one variable. We call the algorithm TGMERT, for Tropical Geome- 3.4 Relationship to the Tropical Semiring try MERT. As input to this procedure we use a word Tropical monomial weights can be transformed into lattice weighted with word specific feature functions M M M regular tropical weights by evaluating the tropical h1 (e, f), a starting point λ1 , and a direction d1 in monomial for a specific value of γ. For example, a parameter space. tropical polynomial evaluated at γ = 1 corresponds 1. Convert the word specific feature functions to the tropical weight: M M h (e, f) to a linear function è(γ) using λ ˜ 1 1 b(e,fs) ˜ M f(1) = a˜(e, fs) 1− = a˜(e, fs) b(e, fs) and d1 , as in Eq. (3). − ⊗ − − ˜ Each monomial term in the tropical polynomial 2. Convert è(γ) to è(γ) by approximating y- shortest distance represents a linear function. The intercepts and gradients to n decimal places, as intersection points of these linear functions define in Eq. (7). ˜ intervals of γ (as in Fig. 1). This suggests an alter- 3. Convert è(γ) in Eq. (7) to the tropical mono- ˜ nate explanation for what the shortest distance com- mial a˜(e, f ) γ b(e,fs). − s ⊗ − puted using the tropical polynomial semiring rep- 4. Compute the WFST shortest distance to the exit resents. Conceptually, there is a continuum of lat- states (Mohri, 2002) with generalised sum ⊕ tices which have identical edges and vertices but and generalised product defined by the trop- ⊗ with varying, real-valued edge weights determined ical polynomial semiring. The resulting trop- by values of γ R, so that each lattice in the contin- ical polynomial represents the upper envelope ∈ uum is indexed by γ. The tropical polynomial short- of the lattice. est distance agrees with the shortest distance through 5. Compute the intersection points of the linear each lattice in the continuum. functions corresponding to the monomial terms Our alternate explanation is consistent with the of the tropical polynomial shortest distance. Theorem of Macherey (Section 2.3), as there could These intersection points define intervals of γ never be more paths than edges in the lattice. There- in which the MAP hypothesis does not change. fore the upper bound for the number of monomial 6. Using the midpoint of each interval convert the ˜ terms in the tropical polynomial shortest distance is tropical monomial a˜(e, f ) γ b(e,fs) to a reg- − s ⊗ − the number of edges in the input lattice. ular tropical weight. Find the MAP hypothesis We can use the mapping to the tropical semiring for this interval by extracting the shortest path to compute the error surface. Let us assume we have using the WFST shortest path algorithm (Mohri n + 1 intervals separated by n interval boundaries. and Riley, 2002). We use the midpoint of each interval to transform the lattice of tropical monomial weights into a lattice of 3.6 TGMERT Worked Example tropical weights. The sequence of words that label This section presents a worked example showing the shortest path through the transformed lattice is how we can use the TGMERT algorithm to compute the MAP hypothesis for the interval. The shortest the upper envelope of a lattice. We start with a three path can be extracted using the WFST shortest path state lattice with a two dimensional feature vector algorithm (Mohri and Riley, 2002). As a technical shown in the upper part of Figure 3. matter, the midpoints of the first interval [ , γ ) We want to optimise the parameters along a line −∞ 1 and last interval [γ , ) are not defined. We there- in two-dimensional feature space. Suppose the ini- n ∞ fore evaluate the tropical polynomial at γ = γ 1 tial parameters are λ2 = [0.7, 0.4] and the direction 1 − 1

~~121 z/[ 0.2, 0.7]0 z/-2.4 −~~

~~x/[ 1.4, 0.3]0 z/[ 0.2, 0.6]0 x/75.2 z/23.6 0 − 1 − − 2 0 1 2 y/[ 0.9, 0.8]0 y/68.2 − −~~

~~29 z/55.6 z/ 14 γ− − ⊗~~

27 36 x/21.2 z/-48.4 x/86 γ z/38 γ 0 1 2 0 ⊗ 1 ⊗ 2 y/-65.8 y/95 γ67 ⊗ Figure 4: The lattice in the lower part of Figure 3 trans- Figure 3: The upper part is a translation lattice with 2- formed to regular tropical weights: γ = 0.4 (top) and dimensional log feature vector weights hM (e, f) where − 1 γ = 1.4 (bottom). M = 2. The lower part is the lattice from the upper part − with weights transformed into tropical monomials. The shortest distance from initial to final state is is d2 = [0.3, 0.5]. Step 1 of the TGMERT algorithm the generalised sum of the path weights: (24 γ7) 1 ⊗ ⊕ (Section 3.5) maps each edge weight to a word spe- (133 γ103). The monomial term 122 γ63 corre- ⊗ ⊗ cific linear function. For example, the weight of the sponding to “x z” can be dropped because it is not edge labelled “x” between states 0 and 1 is trans- part of the canonical form of the polynomial (Sec- formed as follows: tion 3.3). The shortest distance to the exit state can 2 2 be represented as the minimum of two linear func- M M è(γ) = λmh1 (e, f) +γ dmh1 (e,fs) tions: min(24 + 7γ, 133 + 103γ). m=1 m=1 X X We now wish to find the hypotheses that define a(e,f) b(e,f) the error surface by performing Steps 5 and 6 of the = 0.7 | 1.4 +{z 0.4 0.3}+γ |0.3 {z1.4 + 0.5} 0.3 TGMERT algorithm. These two linear functions de- · − · · · − · a(e,f) b(e,f) fine two intervals of γ. The linear functions intersect at γ 1.4; at this value of γ the MAP hypothesis = | 0.86 0{z.27γ } | {z } ≈ − − − changes. Two lattices with regular tropical weights Step 2 of the TGMERT algorithm converts the are created using γ = 0.4 and γ = 2.4. These − − word specific linear functions into tropical mono- are shown in Figure 4. For the lattice shown in the mial weights. Since all y-intercepts and gradients upper part the value for the edge labelled “x” is com- have a precision of two decimal places, we scale the puted as 86 0.427 = 86 + 0.4 27 = 75.2. ` (γ) 102 ⊗ − · linear functions e by and negate them to cre- When γ = 0.4 the lattice in the upper part in − ate tropical monomials (Step 3). The edge labelled Figure 4 shows that the shortest path is associated “x” now has the monomial weight of 86 γ27. The ⊗ with the hypothesis “z z”, which is the MAP hy- transformed lattice with weights mapped to the trop- pothesis for the range γ < 1.4. The lattice in the ical polynomial semiring is shown in the lower part lower part of Figure 4 shows that when γ = 2.4 − of Figure 3. the shortest path is associated with the hypothesis We can now compute the shortest distance “y z”, which is the MAP hypothesis when γ > 1.4. (Mohri, 2002) from the transformed example lattice with tropical monomial weights. There are three 3.7 TGMERT Implementation unique paths through the lattice corresponding to three distinct hypotheses. The weights associated TGMERT is implemented using the OpenFst Toolkit with these hypotheses are: (Allauzen et al., 2007). A weight class is added 29 36 7 14 γ− 38 γ = 24 γ z z for tropical polynomials which maintains them in − ⊗ ⊗ ⊗ ⊗ 27 36 63 canonical form. The and operations are im- 86 γ 38 γ = 122 γ x z ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ plemented for piece-wise linear functions, with the 95 γ67 38 γ36 = 133 γ103 y z ⊗ ⊗ ⊗ ⊗ SweepLine algorithm included as discussed.

122 Iteration Arabic-to-English sias et al., 2009b). We use a hierarchical phrase- MERT LMERT TGMERT based decoder (Iglesias et al., 2009a; de Gispert et Tune Test Tune Test Tune Test al., 2010a) which directly generates word lattices 36.2 36.2 36.2 1 from recursive translation networks without any in- 42.1 40.9 39.7 38.9 39.7 38.9 42.0 44.5 44.5 termediate hypergraph representation (Iglesias et al., 2 45.1 43.2 45.8 44.3 45.8 44.3 2011). The LMERT and TGMERT optimisation al- 44.5 gorithms are particularly suitable for this realisation 3 45.5 44.1 of hiero in that the lattice representation avoids the 45.6 4 need to use the hypergraph formulation of MERT 45.7 44.0 given by Kumar et al. (2009). Iteration Chinese-to-English MERT optimises the weights of the following fea- MERT LMERT TGMERT tures: target language model, source-to-target and Tune Test Tune Test Tune Test target-to-source translation models, word and rule 19.5 19.5 19.5 1 penalties, number of usages of the glue rule, word 25.3 16.7 29.3 22.6 29.3 22.6 deletion scale factor, source-to-target and target-to- 16.4 22.5 22.5 2 18.9 23.9 31.4 32.1 31.4 32.1 source lexical models, and three count-based fea- 23.6 31.6 31.6 tures that track the frequency of rules in the parallel 3 28.2 29.1 32.2 32.5 32.2 32.5 data (Bender et al., 2007). In both Arabic-to-English 29.2 32.2 32.2 and Chinese-to-English experiments all MERT im- 4 31.3 31.5 32.2 32.5 32.2 32.5 plementations start from a flat feature weight initial- 31.3 5 ization. At each iteration new lattices and k-best lists 31.8 32.1 are generated from the best parameters at the previ- 32.1 6 ous iteration, and each subsequent iteration includes 32.4 32.3 32.4 100 hypotheses from the previous iteration. For 7 32.4 32.3 Arabic-to-English we consider an additional twenty random starting parameters at every iteration. All Table 1: GALE AR EN and ZH EN BLEU scores translation scores are reported for the IBM imple- → → by MERT iteration. BLEU scores at the initial and final mentation of BLEU using case-insensitive match- points of each iteration are shown for the Tune sets. ing. We report BLEU scores for the Tune set at the start and end of each iteration. 4 Experiments The results for Arabic-to-English and Chinese- We compare feature weight optimisation using k- to-English are shown in Table 1. Both TGMERT best MERT (Och, 2003), lattice MERT (Macherey and LMERT converge to a small gain over MERT et al., 2008), and tropical geometry MERT. We refer in fewer iterations, consistent with previous re- to these as MERT, LMERT, and TGMERT, resp. ports (Macherey et al., 2008). We investigate MERT performance in the context of the Arabic-to-English GALE P4 and Chinese- 5 Discussion to-English GALE P3 evaluations3. For Arabic-to- English translation, word alignments are generated We have described a lattice-based line optimisation over around 9M sentences of GALE P4 parallel text. algorithm which can be incorporated into MERT Following de Gispert et al. (2010b), word align- for parameter tuning of SMT systems and systems ments for Chinese-to-English translation are trained based on log-linear models. Our approach recasts from a subset of 2M sentences of GALE P3 paral- the optimisation procedure used in MERT in terms lel text. Hierarchical rules are extracted from align- of Tropical Geometry; given this formulation implements using the constraints described in (Chiang, mentation is relatively straightforward using stan- 2007) with additional count and pattern filters (Igle- dard WFST operations and algorithms. 3See http://projects.ldc.upenn.edu/gale/data/catalog.html

123 References Gonzalo Iglesias, Cyril Allauzen, William Byrne, Adria` de Gispert, and Michael Riley. 2011. Hierarchical C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and phrase-based translation representations. In Proceed- M. Mohri. 2007. OpenFst: A general and efficient ings of the 2011 Conference on Empirical Methods in weighted finite-state transducer library. In Proceed- Natural Language Processing, pages 1373–1383. As- ings of the Ninth International Conference on Imple- sociation for Computational Linguistics. mentation and Application of Automata, pages 11–23. Philipp Koehn. 2010. Statistical Machine Translation. O. Bender, E. Matusov, S. Hahn, S. Hasan, S. Khadivi, Cambridge University Press. and H. Ney. 2007. The RWTH Arabic-to-English spo- Shankar Kumar, Wolfgang Macherey, Chris Dyer, and ken language translation system. In Automatic Speech Franz Och. 2009. Efficient minimum error rate train- Recognition Understanding, pages 396 –401. ing and minimum bayes-risk decoding for translation J.L. Bentley and T.A. Ottmann. 1979. Algorithms for hypergraphs and lattices. In Proceedings of the Joint reporting and counting geometric intersections. Com- Conference of the 47th Annual Meeting of the ACL and puters, IEEE Transactions on, C-28(9):643 –647. the 4th International Joint Conference on Natural Lan- Daniel Cer, Daniel Jurafsky, and Christopher D. Man- guage Processing of the AFNLP, pages 163–171. ning. 2008. Regularization and search for minimum Wolfgang Macherey, Franz Och, Ignacio Thayer, and error rate training. In Proceedings of the Third Work- Jakob Uszkoreit. 2008. Lattice-based minimum error shop on Statistical Machine Translation, pages 26–34. rate training for statistical machine translation. In Pro- David Chiang. 2007. Hierarchical phrase-based transla- ceedings of the 2008 Conference on Empirical Meth- tion. Computational Linguistics, 33. ods in Natural Language Processing, pages 725–734. David A. Cox, John Little, and Donal O’Shea. 2007. Mehryar Mohri and Michael Riley. 2002. An efficient Ideals, Varieties, and Algorithms: An Introduction to algorithm for the n-best-strings problem. In Proceed- Computational Algebraic Geometry and Commutative ings of the International Conference on Spoken Lan- Algebra, 3/e (Undergraduate Texts in Mathematics). guage Processing 2002. Mehryar Mohri, Fernando C. N. Pereira, and Michael Ri- Adria` de Gispert, Gonzalo Iglesias, Graeme Blackwood, ley. 2008. Speech recognition with weighted finite- Eduardo R. Banga, and William Byrne. 2010a. Hier- state transducers. Handbook on Speech Processing archical phrase-based translation with weighted finite- and Speech Communication. state transducers and shallow-n grammars. Computa- Mehryar Mohri. 2002. Semiring frameworks and algo- tional Linguistics, 36(3):505–533. rithms for shortest-distance problems. J. Autom. Lang. Adria` de Gispert, Juan Pino, and William Byrne. 2010b. Comb., 7(3):321–350. Hierarchical phrase-based translation grammars ex- Franz Josef Och and Hermann Ney. 2002. Discrimi- tracted from alignment posterior probabilities. In Pro- native training and maximum entropy models for sta- ceedings of the 2010 Conference on Empirical Meth- tistical machine translation. In Proceedings of the ods in Natural Language Processing, pages 545–554. 40th Annual Meeting on Association for Computa- Chris Dyer, Adam Lopez, Juri Ganitkevitch, Jonathan tional Linguistics, pages 295–302. Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Franz Josef Och. 2003. Minimum error rate training Vladimir Eidelman, and Philip Resnik. 2010. cdec: A in statistical machine translation. In Proceedings of decoder, alignment, and learning framework for finite- the 41st Annual Meeting on Association for Computa- state and context-free translation models. In Proceed- tional Linguistics, pages 160–167. ings of the ACL 2010 System Demonstrations, pages K. A. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. 7–12, July. BLEU: a method for automatic evaluation of ma- Gonzalo Iglesias, Adria` de Gispert, Eduardo R. Banga, chine translation. In Proceedings of the 40th Annual and William Byrne. 2009a. Hierarchical phrase-based Meeting on Association for Computational Linguis- translation with weighted finite state transducers. In tics, pages 311–318. Proceedings of HLT: The 2009 Annual Conference of W. H. Press, W. T. Vetterling, S. A. Teukolsky, and B. P. the North American Chapter of the Association for Flannery. 2002. Numerical Recipes in C++: the art Computational Linguistics, pages 433–441. of scientific computing. Cambridge University Press. Gonzalo Iglesias, Adria` de Gispert, Eduardo R. Banga, J. Richter-Gebert, B. Sturmfels, and T. Theobald. 2005. and William Byrne. 2009b. Rule filtering by pattern First steps in tropical geometry. In Idempotent mathe- for efficient hierarchical translation. In Proceedings matics and mathematical physics. of the 12th Conference of the European Chapter of Artem Sokolov and François Yvon. 2011. Minimum er- the Association for Computational Linguistics, pages ror rate training semiring. In Proceedings of the Euro- 380–388. pean Association for Machine Translation.

124 David Speyer and Bernd Sturmfels. 2009. Tropical mathematics. Mathematics Magazine. Aurelien Waite, Graeme Blackwood, and William Byrne. 2011. Lattice-based minimum error rate training using weighted ﬁnite-state transducers with tropical polynomial weights. Technical report, Department of Engi- neering, University of Cambridge. Richard Zens, Sasa Hasan, and Hermann Ney. 2007. A systematic comparison of training criteria for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning, pages 524–532.

~~125~~

~~Author Index~~

~~Agirrezabal, Manex, 35 Piitulainen, Jussi, 108 Alegria, Inaki,˜ 35 Pirinen, Tommi A.,1 Arrieta, Bertol, 35 Attia, Mohammed, 20 Quernheim, Daniel, 40~~

Bogel,¨ Tina, 25 Samih, Younes, 70 Beesley, Kenneth R., 50 Sanchis, Emilio, 75 Blackwood, Graeme, 116 Shaalan, Khaled, 20 Bontcheva, Katina, 30 Silfverberg, Miikka, 55 Byrne, William, 116 Torres, M. Ines,´ 99 Calvo, Marcos, 75 Voutilainen, Atro, 108 Casacuberta, Francisco, 99 Casillas, A., 60 Waite, Aurelien, 116

~~D´ıaz de Ilarraza, A., 60 Yli-Jyra,¨ Anssi, 55, 108 Drobac, Senka, 55~~

~~Fernando, Tim, 80~~

~~Gomez,´ Jon Ander, 75 Gerdemann, Dale, 10 Gojenola, K., 60 Gonzalez,´ Jorge, 90~~

~~Hardwick, Sam,1 Hirose, Keikichi, 45 Hulden, Mans, 10, 35, 65, 70 Hurtado, Llu´ıs-F., 75~~

~~Knight, Kevin, 40~~

~~Labaka, Gorka, 65~~

~~Mayor, Aingeru, 65 Minematsu, Nobuaki, 45~~

~~Novak, Josef R., 45~~

~~Oronoz, M., 60~~

~~Perez,´ Alicia, 60, 99~~

~~127~~