Modeling and Encoding Traditional Wordlists for Machine Applications

Modeling and Encoding Traditional Wordlists for Machine Applications Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected] Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist database guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- machine translation, and which also allow them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full lexicons. purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data. One like parsing and generation of natural language exception to this is found in the work done in data—for example, to compare statistical NLP ap- the context of the ODIN project (Xia and Lewis, proaches to those involving grammar engineering. 2009), a significant database of interlinear glossed Such comparison is undoubtedly important insofar text (IGT), a standard descriptive linguistic data as it helps us understand how computational meth- format (Palmer et al., 2009), compiled by search- ods that are derived from these two lines of re- ing the Web for legacy instances of IGT. search can complement each other. However, one This paper describes another attempt to trans- thing that the two areas of work have in common form an existing legacy dataset into a more read- is that they tend to focus on majority languages ily repurposable format. Our data consists of tra- and majority language resources. Even where this ditional descriptive wordlists originally collected 1 is not the case (Bender et al., 2002; Alvarez et al., for comparative and historical linguistic research. 2006; Palmer et al., 2009), the resulting products Wordlists have been widely employed as a first still cover relatively few languages from a world- step towards the creation of a dictionary or as a wide perspective. This is in part because such means to quickly gather information about a lan- work cannot easily make use of the extensive language for the purposes of language comparison guage resources produced by descriptive linguists, (especially in parts of the world where languages the group of researchers that are most actively in- 1These wordlists were collected by Timothy Usher and volved in documenting the world’s entire linguis- Paul Whitehouse and represent an enormous effort without tic diversity. In fact, one particular descriptive lin- which the work described here would not have been possible. The RDF/XML implementations discussed in this paper will guistic product, the wordlist—which is the focus be made available at http://lego.linguistlist.org of this paper—can be found for at least a quarter within the context of the Lexicon Enhancement via the of the world’s languages. GOLD Ontology project. 1 Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground, ACL 2010, pages 1–9, Uppsala, Sweden, 16 July 2010. c 2010 Association for Computational Linguistics are poorly documented). Because of this, they 3 Related Work in Descriptive exist for many more languages than do full lexi- Linguistics cons. While the lexical information that wordlists Recent years have seen a fair amount of attention contain is quite sparse, they are relatively consis- paid to the modeling of traditional linguistic data tent in their structure across resources. This al- types, including lexicons, glossed texts, and gram- lows for the creation of a large-scale multilingual mars (Bell and Bird, 2000; Good, 2004; Palmer database consisting of rough translational equiva- and Erk, 2007; Nordhoff, 2008). The data type of lents which may lack precision but has coverage focus here, wordlists, has not seen serious treat- well-beyond what would otherwise be available. ment. Superficially, wordlists resemble lexicons 2 The Data and Project Background and, of course, they can be considered a kind of lexical resource. However, as will be shown in The data we are working with consists of 2,700 section 5, there are important differences between wordlists drawn from more than 1,500 languages lexicons and wordlists which have implications for (some wordlists represent dialects) and close to how they should be modeled. 500,000 forms. This is almost certainly the largest Most of the work on modeling descriptive lin- collection of wordlists in a standardized format. guistic data types has proceeded without special The average size of the individual wordlists is consideration for possible NLP applications for rather small, around 200 words, making them the data being encoded. This is largely because the comparable in size to the resources found in work was initially a response to issues relating to a project like NEDO (Takenobu, 2006), though the longevity of digital descriptive data which was, smaller than in other related projects like those otherwise, quite often being encoded solely in (of- discussed in section 4. While the work described ten proprietary) presentation formats (Bird and Si- here was originally conceived to support descrip- mons, 2003). However, the possibility for fruitful tive and comparative linguistics, our data model interaction between computational linguistics and and choice of encoding technologies has had the descriptive linguistics is apparent and has been the additional effect of making these resources read- subject of some work (Palmer et al., 2009). ily exploitable in other domains, in particular NLP. The work described here is also interested in We have approached the data initially as tradi- this possibility. In particular, we address the ques- tional, not computational, linguists, and our first tion of how to model and encode a large-scale goal has been to encode the available materials dataset that was originally intended to be used for not with any new information but rather to trans- descriptive purposes in ways that not only allow us fer the information they originally contained in a to faithfully represent the intention of the original more exploitable way. creator but also permit the data to be straightfor- By way of introduction, the hypothetical exam- wardly exploitable for new uses, including NLP. ple in (1) illustrates a traditional presentation for- To the best of our knowledge, our work is innova- mat of a wordlist, with English as the source lan- tive both because of the data type being explored guage and French as the target language. and because the data modeling is being done par- allel with the transformation of a legacy resource (1) MAN homme with significant coverage of the world’s languages. WOMAN femme This stands in contrast to most other work (again, As we will describe in more detail in section 5, with the exception of work done within ODIN they key features of a wordlist entry are an index (Xia and Lewis, 2009)) whose data, while repre- to a concept assumed to be of general provenance sentative, is not of the same scale. (e.g., MAN) and a form drawn from a specific language (e.g. homme) determined to be the counter- 4 Related Work on Lexicon part for that concept within that language. Most Interoperability in NLP typically, the elements indexing the relevant con- The relevant related work in NLP is that focused cepts are words drawn from languages of wider on interoperation among lexical resources. One communication (e.g., English or Spanish). way to achieve this is to make use of language in- dependent ontologies (or comparable objects) for word meanings which can serve as pivots for mul- 2 tilingual applications (Ide et al., 1998; Vossen, one begins with an abstract meaning, for example 2004; Nirenburg et al., 2004; Ronzano et al., MAN, and then tries to find the word in the tar- 2010). The word senses provided by WordNet, for get language which represents the best semantic example, have been used for this purpose (O’Hara fit for that meaning. Lexicons, on the other hand, et al., 1998). prototypically map in the opposite direction from A recognized data modeling standard for lexi- form to meaning. Furthermore, as will be elab- cal interoperation is the Lexical Markup Frame- orated in section 5.3, the meanings employed in work (LMF), which provides standardized frame- wordlists are not intended to refer to meanings of work for the description and representation of lex- lexical items in specific languages. In this way, icons (Francopoulo et al., 2009). Instantiations of they are quite distinct from bilingual dictionaries. LMF have also been extended to represent Word- We can therefore view a wordlist as a set of de- Nets, e.g., Wordnet-LMF (Soria et al., 2009), in fective signs—containing information on the form ways which facilitate interoperation.

Load more