Open Source Word Analysis

Hunmorph: open source word analysis Viktor Tron´ Gyorgy¨ Gyepesi Peter´ Halacsy´ IGK, U of Edinburgh K-PRO Ltd. Centre of Media Research and Education 2 Buccleuch Place H-2092 Budakeszi Stoczek u. 2 EH8 9LW Edinburgh Villam´ u. 6. H-1111 Budapest [email protected] [email protected] [email protected] Andras´ Kornai Laszl´ o´ Nemeth´ Daniel´ Varga MetaCarta Inc. CMRE CMRE 350 Massachusetts Avenue Stoczek u. 2 Stoczek u. 2 Cambridge MA 02139 H-1111 Budapest H-1111 Budapest [email protected] [email protected] [email protected] Abstract The C/C++ runtime layer of our toolkit, called hunmorph, was developed by extending the code- Common tasks involving orthographic base of MySpell, a reimplementation of the well- words include spellchecking, stemming, known Ispell spellchecker. Our technology, like morphological analysis, and morpho- the Ispell family of spellcheckers it descends logical synthesis. To enable signifi- from, enforces a strict separation between the cant reuse of the language-specific re- language-specific resources (known as dictionary sources across all such tasks, we have and affix files), and the runtime environment, extended the functionality of the open which is independent of the target natural lan- source spellchecker MySpell, yield- guage. ing a generic word analysis library, the runtime layer of the hunmorph toolkit. We added an offline resource manage- ment component, hunlex, which com- plements the efficiency of our runtime layer with a high-level description language and a configurable precompiler. 0 Introduction Word-level analysis and synthesis problems range from strict recognition and approximate matching Figure 1: Architecture to full morphological analysis and generation. Our technology is predicated on the observation that Compiling accurate wide coverage machine- all of these problems are, when viewed algorith- readable dictionaries and coding the morphology mically, very similar: the central problem is to of a language can be an extremely labor-intensive dynamically analyze complex structures derived task, so the benefit expected from reusing the from some lexicon of base forms. Viewing word language-specific input database across tasks can analysis routines as a unified problem means shar- hardly be overestimated. To facilitate this resource ing the same codebase for a wider range of tasks, a sharing and to enable systematic task-dependent design goal carried out by finding the parameters optimizations from a central lexical knowledge which optimize each of the analysis modes inde- base, we designed and implemented a powerful of- pendently of the language-specific resources. fline layer we call hunlex. Hunlex offers an easy to use general framework for describing the lexi- verse application of an affix rule, the algorithm con and morphology of any language. Using this checks whether its flags contain the one that the description it can generate the language-specific affix rule is assigned to. This is a straight table- aff/dic resources, optimized for the task at hand. driven approach, where affix flags can be inter- The architecture of our toolkit is depicted in Fig- preted directly as lexical features that license en- ure 1. Our toolkit is released under a permissive tire subparts of morphological paradigms. To pick LGPL-style license and can be freely downloaded applicable affix rules efficiently, MySpell uses a from mokk.bme.hu/resources/hunmorph. fast indexing technique to check affixation condi- The rest of this paper is organized as follows. tions. Section 1 is about the runtime layer of our toolkit. In theory, affix-rules should only specify gen- We discuss the algorithmic extensions and imple- uine prefixes and suffixes to be stripped before lex- mentational enhancements in the C/C++ runtime ical lookup. But in practice, for languages with layer over MySpell, and also describe the newly rich morphology, the affix stripping mechanism is created Java port jmorph. Section 2 gives an (ab)used to strip complex clusters of affix morphs overview of the offline layer hunlex. In Section 3 in a single step. For instance, in Hungarian, due we consider the free open source software alterna- to productive combinations of derivational and in- tives and offer our conclusions. flectional affixation, a single nominal base can yield up to a million word forms. To treat all 1 The runtime layer these combinations as affix clusters, legacy ispell resources for Hungarian required so many com- Our development is a prime example of code bined affix rule entries that its resource file sizes reuse, which gives open source software devel- were not manageable. opment most of its power. Our codebase is a To solve this problem we extended the affix direct descendant of MySpell, a thread-safe C++ stripping technique to a multistep method: after spell-checking library by Kevin Hendricks, which stripping an affix cluster in step i, the resulting descends from Ispell Peterson (1980), which in pseudo-stem can be stripped of affix clusters in turn goes back to Ralph Gorin’s spell (1971), step i + 1. Restrictions of rule application are making it probably the oldest piece of linguistic checked with the help of flags associated to affixes software that is still in active use and development analogously to lexical entries: this only required (see fmg-www.cs.ucla.edu/fmg-members/ a minor modification of the data structure coding geoff/ispell.html). affix entries and a recursive call for affix stripping. The key operation supported by this codebase is By cross-checking flags of prefixes on the suffix affix stripping. Affix rules are specified in a static (as opposed to the stem only), simultaneous pre- resource (the aff file) by a sequence of conditions, fixation and suffixation can be made interdepen- an append string, and a strip string: for example, dent, extending the functionality to describe cir- in the rule forming the plural of body the strip cumfixes like German participle ge+t, or Hungar- string would be y, and the affix string would be ian superlative leg+bb, and in general provide the ies. The rules are reverse applied to complex input correct handling of prefix-suffix dependencies like wordforms: after the append string is stripped and English undrinkable (cf. *undrink), see Nemeth´ the edge conditions are checked, a pseudo-stem is et al. (2004) for more details. hypothesized by appending the strip string to the Due to productive compounding in a lot of lan- stem which is then looked up in the base dictio- guages, proper handling of composite bases is a nary (which is the other static resource, called the feature indispensable for achieving wide coverage. dic file). Ispell incorporates the possibility of specifying Lexical entries (base forms) are all associated lexical restrictions on compounding implemented with sets of affix flags, and affix flags in turn are as switches in the base dictionary. However, the associated to sets of affix rules. If the hypothe- algorithm allows any affixed form of the bases that sized base is found in the dictionary after the re- has the relevant switch to be a potential member of a compound, which proves not to be restrictive The single most important algorithmic aspect that enough. We have improved on this by the intro- distinguishes the recognition task from analysis duction of position-sensitive compounding. This is the handling of ambiguous structures. In the means that lexical features can specify whether original MySpell design, identical bases are con- a base or affix can occur as leftmost, rightmost flated and once their switch-set licensing affixes or middle constituent in compounds and whether are merged, there is no way to tell them apart. they can only appear in compounds. Since these The correct handling of homonyms is crucial for features can also be specified on affixes, this pro- morphological analysis, since base ambiguities vides a welcome solution to a number of resid- can sometimes be resolved by the affixes. In- ual problems hitherto problematic for open-source terestingly, our improvement made it possible to spellcheckers. In some Germanic languages, ’fo- rule out homonymous bases with incorrect simul- gemorphemes’, morphemes which serve linking taneous prefixing and suffixing such as English compound constituents can now be handled easily out+number+’s. Earlier these could be handled by allowing position specific compound licensing only by lexical pregeneration of relevant forms or on the foge-affixes. Another important example is duplication of affixes. the German common noun: although it is capital- Most importantly, ambiguity arises in relation ized in isolation, lowercase variants should be ac- to the number of analyses output by the system. cepted when the noun is a compound constituent. While with spell-checking the algorithm can ter- By handling lowercasing as a prefix with the com- minate after the first analysis found, performing pound flag enabled, this phenomenon can be han- an exhaustive search for all alternative analyses is dled in the resource file without resort to language a reasonable requirement in morphological analy- specific knowledge hard-wired in the code-base. sis mode as well as in some stemming tasks. Thus the exploration of the search space also becomes 1.1 From spellchecking to morphological an active parameter in our enhanced implementa- analysis tion of the algorithm: We now turn to the extensions of the MySpell algorithm that were required to equip hunmorph • search until the first correct analysis with stemming and morphological analysis func- • search restricted multiple analyses (e.g., dis- tionality. The core engine was extended with an abling compounds) optional output handling interface that can process arbitrary string tags associated with the affix-rules • search all alternative analyses read from the resources. Once this is done, sim- ply outputting the stem found at the stage of dic- Search until the first analysis is a functionality for tionary lookup already yields a stemmer.

Load more