<<

Modeling and Encoding Traditional Wordlists for Machine Applications

Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected]

Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- , and which also al- low them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full . purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data. One like and generation of natural exception to this is found in the work done in data—for example, to compare statistical NLP ap- the context of the ODIN project (Xia and Lewis, proaches to those involving grammar engineering. 2009), a significant database of interlinear glossed Such comparison is undoubtedly important insofar text (IGT), a standard descriptive linguistic data as it helps us understand how computational meth- format (Palmer et al., 2009), compiled by search- ods that are derived from these two lines of re- ing the Web for legacy instances of IGT. search can complement each other. However, one This paper describes another attempt to trans- thing that the two areas of work have in common form an existing legacy dataset into a more read- is that they tend to focus on majority ily repurposable format. Our data consists of tra- and majority language resources. Even where this ditional descriptive wordlists originally collected 1 is not the case (Bender et al., 2002; Alvarez et al., for comparative and historical linguistic research. 2006; Palmer et al., 2009), the resulting products Wordlists have been widely employed as a first still cover relatively few languages from a world- step towards the creation of a or as a wide perspective. This is in part because such means to quickly gather information about a lan- work cannot easily make use of the extensive lan- guage for the purposes of language comparison guage resources produced by descriptive linguists, (especially in parts of the world where languages the group of researchers that are most actively in- 1These wordlists were collected by Timothy Usher and volved in documenting the world’s entire linguis- Paul Whitehouse and represent an enormous effort without tic diversity. In fact, one particular descriptive lin- which the work described here would not have been possible. The RDF/XML implementations discussed in this paper will guistic product, the wordlist—which is the focus be made available at http://lego.linguistlist.org of this paper—can be found for at least a quarter within the context of the Enhancement via the of the world’s languages. GOLD project.

1 Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground, ACL 2010, pages 1–9, Uppsala, Sweden, 16 July 2010. c 2010 Association for Computational Linguistics are poorly documented). Because of this, they 3 Related Work in Descriptive exist for many more languages than do full lexi- Linguistics cons. While the lexical information that wordlists Recent years have seen a fair amount of attention contain is quite sparse, they are relatively consis- paid to the modeling of traditional linguistic data tent in their structure across resources. This al- types, including lexicons, glossed texts, and gram- lows for the creation of a large-scale multilingual mars (Bell and Bird, 2000; Good, 2004; Palmer database consisting of rough translational equiva- and Erk, 2007; Nordhoff, 2008). The data type of lents which may lack precision but has coverage focus here, wordlists, has not seen serious treat- well-beyond what would otherwise be available. ment. Superficially, wordlists resemble lexicons 2 The Data and Project Background and, of course, they can be considered a kind of . However, as will be shown in The data we are working with consists of 2,700 section 5, there are important differences between wordlists drawn from more than 1,500 languages lexicons and wordlists which have implications for (some wordlists represent dialects) and close to how they should be modeled. 500,000 forms. This is almost certainly the largest Most of the work on modeling descriptive lin- collection of wordlists in a standardized format. guistic data types has proceeded without special The average size of the individual wordlists is consideration for possible NLP applications for rather small, around 200 , making them the data being encoded. This is largely because the comparable in size to the resources found in work was initially a response to issues relating to a project like NEDO (Takenobu, 2006), though the longevity of digital descriptive data which was, smaller than in other related projects like those otherwise, quite often being encoded solely in (of- discussed in section 4. While the work described ten proprietary) presentation formats (Bird and Si- here was originally conceived to support descrip- mons, 2003). However, the possibility for fruitful tive and comparative linguistics, our data model interaction between computational linguistics and and choice of encoding technologies has had the descriptive linguistics is apparent and has been the additional effect of making these resources read- subject of some work (Palmer et al., 2009). ily exploitable in other domains, in particular NLP. The work described here is also interested in We have approached the data initially as tradi- this possibility. In particular, we address the ques- tional, not computational, linguists, and our first tion of how to model and encode a large-scale goal has been to encode the available materials dataset that was originally intended to be used for not with any new information but rather to trans- descriptive purposes in ways that not only allow us fer the information they originally contained in a to faithfully represent the intention of the original more exploitable way. creator but also permit the data to be straightfor- By way of introduction, the hypothetical exam- wardly exploitable for new uses, including NLP. ple in (1) illustrates a traditional presentation for- To the best of our knowledge, our work is innova- mat of a wordlist, with English as the source lan- tive both because of the data type being explored guage and French as the target language. and because the data modeling is being done par- allel with the transformation of a legacy resource (1) MAN homme with significant coverage of the world’s languages. WOMAN femme This stands in contrast to most other work (again, As we will describe in more detail in section 5, with the exception of work done within ODIN they key features of a wordlist entry are an index (Xia and Lewis, 2009)) whose data, while repre- to a concept assumed to be of general provenance sentative, is not of the same scale. (e.g., MAN) and a form drawn from a specific lan- guage (e.g. homme) determined to be the counter- 4 Related Work on Lexicon part for that concept within that language. Most Interoperability in NLP typically, the elements indexing the relevant con- The relevant related work in NLP is that focused cepts are words drawn from languages of wider on interoperation among lexical resources. One communication (e.g., English or Spanish). way to achieve this is to make use of language in- dependent (or comparable objects) for meanings which can serve as pivots for mul-

2 tilingual applications (Ide et al., 1998; Vossen, one begins with an abstract meaning, for example 2004; Nirenburg et al., 2004; Ronzano et al., MAN, and then tries to find the word in the tar- 2010). The word senses provided by WordNet, for get language which represents the best semantic example, have been used for this purpose (O’Hara fit for that meaning. Lexicons, on the other hand, et al., 1998). prototypically map in the opposite direction from A recognized data modeling standard for lexi- form to meaning. Furthermore, as will be elab- cal interoperation is the Lexical Markup Frame- orated in section 5.3, the meanings employed in work (LMF), which provides standardized frame- wordlists are not intended to refer to meanings of work for the description and representation of lex- lexical items in specific languages. In this way, icons (Francopoulo et al., 2009). Instantiations of they are quite distinct from bilingual . LMF have also been extended to represent Word- We can therefore view a wordlist as a set of de- Nets, e.g., Wordnet-LMF (Soria et al., 2009), in fective signs—containing information on the form ways which facilitate interoperation. and meaning parts of the triple, but not the gram- While we do not attempt to express the data mar. The meaning information is not directly asso- model we develop here in LMF, doing so should ciated with the specific form but, rather, is a kind be relatively straightforward. The key conceptual of “tag” indicating that the entire sign that a given observation is to recognize that the sets of mean- form is associated with is the best counterpart in ing labels found in wordlists (see section 2) can the language for a general concept. be treated either as a shared language-neutral on- Figure 1 compares the kind of information asso- tology or as a kind of interlingua, both of which ciated with signs in a lexicon to those in a wordlist. have already been the subject of LMF modeling The box on the left gives a schematic form- (Vossen, 2004). As such, they are also compa- grammar-meaning triple for the Spanish word rable to language-independent ontologies of word perro ‘dog’, containing the sort of information that meaning, bringing them in line with the work on might be found in a simple . multilingual NLP mentioned above. The box on the right schematizes the content of These similarities should not be too surprising. a parallel French wordlist entry for chien ‘dog’. After all, one of the functions of wordlists has Here, no grammatical or semantic information is been to facilitate language comparison, something associated with the form, but there is an indication which is also at the heart of multilingual NLP. that in French, this lexical item is the closest coun- An important development, however, is that new terpart to the general concept DOG. Of course, in data encoding technologies can allow us to en- this case, the word chien is not only the counter- code data in ways that facilitate its re- part of DOG in French, but can be translated as purposing for NLP applications much more easily dog in English. The semantic connection between than would have been possible previously. We will a concept label and a lexical item may not always come back to this in section 6. be so straightforward, as we will see in section 5.2. 5 Modeling Wordlists 5.1 Wordlist Entries as Defective Signs perro chien A common linguistic conceptualization of a lexi- DOG cal item is to treat it as a sign triple: an association of a form with meaning and grammar. Lexical dog items in a lexicon generally contain information on all three aspects of this triple. Wordlists do not, and the information they encode is quite sparse. Figure 1: Lexicon sign versus wordlist sign In general, they give no indication of grammatical information (e.g., part of speech), nor of language- specific . 5.2 Mapping between Form and Concept In addition, from a descriptive standpoint, lex- A challenge in comparing lexical data among nu- icons and wordlists differ in the direction of the merous languages is that a complete match be- form-meaning mapping. As the example in (1) tween a word’s meaning and a general concept suggests, in order to create or interpret a wordlist, rarely occurs within a single language, let alone

3 across languages (Haspelmath and Tadmor, 2009). the target language. There are other logical kinds Therefore, in order to describe the relationship be- of counterpart relationships as well (e.g., super- tween form and meaning in a wordlist, we use counterpart), and the example is primarily for il- the term counterpart, in the sense developed by lustrative purposes. In our database, we only em- Haspelmath and Tadmor (2009). This is in con- ploy the counterpart relation since that was the trast to related notions like definition or trans- level of precision found in the original data. lation. While the meanings found in wordlists could, in some cases, be interpreted as definitions (3) PARENT’SSIBLING hasSubCounterpart or translations, this is not how they are conceived aunt, uncle of in their core function. Rather, they are intended Though the canonical case for the counterpart to refer to language-independent concepts which relation is that there will be one counterpart for have been determined to be a useful way to begin a given concept, this is often not the case in lan- to explore the lexicon of a language. guages and in our data. To take an example from A key property of the counterpart relationship a familiar language, the English counterpart for is that that even if one particular language (e.g., MOVIE could reasonably be film or movie, and it English or Spanish) is used to refer to a particular is quite easy to imagine a wordlist for English concept (e.g., MAN), it is not the idiosyncratic se- containing both words. The entry in (4) from the mantics of the word in that language that is used to dataset we are working with gives an example of determine the relevant wordlist entry in the target this from a wordlist of North Asmat, a language language. For instance, the meaning of the English spoken in Indonesia. The concept GRANDFATHER word MAN is ambiguous between human and male has two counterparts, whose relationship to each human but the term in (1) only refers to human. other has not been specified in our source. In using a language of wider communication, the goal is to find the closest counterpart in the target (4) GRANDFATHER hasCounterpart -ak, afak language for a general concept, not to translate. We therefore distinguish between the meanings Data like that in (4) has led us to add an ad- associated with words in a given language from ditional layer in our model for the mapping be- the more general meanings found in wordlists by tween concept and form allowing for the possibil- using the term concept for the latter. Thus, a ity that the mapping may actually refer to a group wordlist entry can be schematized as in (2) where of forms. With more information, of course, one a concept and a lexical item are related by the may be able to avoid mapping to a group of forms hasCounterpart relation. In attested wordlist by, for example, determining that each member of entries, the concept is, as discussed, most typically the group is a sub-counterpart of the relevant con- indexed via a language of wider communication cept. However, this information is not available to and a lexical item is indexed via a transcription us in our dataset. representing the lexical item’s form. 5.3 The (2) CONCEPT hasCounterpart lexicalItem The concepts found in wordlists have generally been grouped into informally standardized lists. The counterpart relation is, by design, a rela- Within our model, we treat these lists as an object tively imprecise one since a lack of precision fa- to be modeled in their own right and refer to them cilitates the relatively rapid data collection that is as concepticons (i.e., “concept lexicon”). As will considered an important feature of wordlist cre- be discussed in section 6, a concepticon is simi- ation. The meaning of a given counterpart could lar to an interlingua, though this connection has be broader or narrower than that of the relevant rarely, if ever, been explicitly made. concept, for example (Haspelmath and Tadmor, As understood here, concepticons are simply 2009, p. 9). In principle, the counterpart relation curated sets of concepts, minimally indexed via could be made more precise by specifying, for ex- one or more words from a language of wider com- sub-counterpart ample, that the relevant relation is munication but, perhaps, also more elaborately for cases where a word in a target language refers described using multiple languages (e.g., English to a concept narrower than the one referred to in and Spanish) and illustrative example sentences. the word list, as illustrated in (3) for English as Concepticons may include terms for concepts of

4 such general provenance that counterpart words (7) 1. LWT: TOBREAK would be expected to occur in almost all lan- 2. IDS: BREAK, TR guages, such as TO EAT, as well as terms that may 3. Usher-Whitehouse: occur commonly in only a certain region or lan- (a) BREAK, INTOPIECES guage family. For instance, Amazonian languages (b) BREAK, BY IMPACT do not have words for SNOWSHOE or MOSQUE, (c) BREAK, BY MANIPULATION and Siberian languages do not have a term for BREAK STRINGSETC TOUCAN (Haspelmath and Tadmor, 2009, p. 5–6). (d) , . The concepticon we are employing has been (e) BREAK, LONGOBJECTS based on three different concept lists. Of these, (f) BREAK, BRITTLE SURFACES the most precise and recently published list is the Our unified concepticon combines information Loanword Typology (LWT) concepticon (Haspel- from the LWT, IDS, and Usher-Whitehouse lists. math and Tadmor, 2009), which consists of 1,460 This allow us to leverage the advantages of the dif- entries and was developed from the Intercontinen- ferent lists (e.g., the expanded term list in Usher- tal Dictionary Series2 (IDS) concepticon (1,200 Whitehouse against the more detailed concept de- entries). The LWT concepticon often offers more scriptions of LWT). No wordlist in our database precision for the same concept than the IDS list. has entries corresponding to all of the concepts For instance, the same concept in both LWT and in our concepticon. Nonetheless, we now have a IDS is described in the LWT list by labeling an dataset with several thousand wordlists whose en- English noun with the article the (5) in order to tries, where present, are linked to the same con- clearly distinguish it from a homophonous . cepticon, thereby facilitating certain multilingual (5) LWT: THEDUST and cross-lingual applications. IDS: DUST 5.4 The Overall Structure of a Wordlist In addition, certain concepts in the IDS concep- We schematize our abstract wordlist model in Fig- ticon have been expanded in the LWT list to make ure 2. The oval on the left represents the language it clearer what kinds of words might be treatable being described, from which the word forms are as counterparts. drawn (see section 5.1). On the right, the box represents a concepticon (see section 5.3) where (6) IDS: THELOUSE the concepts are listed as a set of identifiers (e.g., LWT: THELOUSE, HEADLOUSE, BODY 1.PERSON) that are associated with labels and re- LOUSE lated to their best English counterpart. Of course, the labels could be drawn from languages other The concepts in LWT and IDS concepticons re- than English, and other indexing devices, such as fer to a wide range of topics but, for historical pictures, could also be used. reasons, they are biased towards the geograph- Counterparts from the language being described ical and cultural settings of Europe, southwest for the relevant concepts are mapped to blocks of Asia, and (native) South America (Haspelmath defective signs (most typically containing just one and Tadmor, 2009, p. 6). The unpublished Usher- sign, but not always—see section 5.2) which are, Whitehouse concepticon (2,656 entries), used to in turn, associated with a concept. The schema- collect the bulk of the data used in the work de- tization further illustrates a possibility not yet ex- scribed here, includes LWT and IDS concepticons plicitly discussed that, due to the relatively impre- but also adds new concepts, such as WILDEBEEST cise nature of the counterpart relation, one group or WATTLE, in order to facilitate the collection of of forms may be the counterpart for multiple con- terms in languages from regions like Africa and cepts. In short, the mapping between forms and Papua New Guinea. Furthermore, certain concepts concepts is not necessarily particularly simple. in the LWT and IDS lists are subdivided in the Usher-Whitehouse concepticon, as shown in (7). 6 Implementing the Model 2http://lingweb.eva.mpg.de/ids/ We have used the conceptual model for wordlists developed in section 5 to create a wordlist

5 i nEngli sh 1. PERSON "person"

inEnglish 2. MAN "man"

inEnglish 3. WOMAN "woman" Described inEnglish Variety 4. HORSE "horse" of a Language i nEngli sh 5. EWE "ewe"

. . .

Concepticon

Figure 2: Wordlist modeled as a mapping between a language and a concepticon via blocks of signs database using Semantic Web technologies, in par- ontology for language documentation and descrip- ticular RDF/XML, which we expect to have both tion (Farrar and Lewis, 2007). The key data en- research and practical applications. coded by our RDF representation is the counter- Each wordlist in our database consists of two part mapping between a particular wordlist con- components: a set of metadata and a set of en- cept (lego:concept) drawn from our concep- tries. The metadata gives the various identifying ticon and a form (gold:formUnit) found in a names and codes for the wordlist e.g., a unique given wordlist. (The “lego” prefix refers to our identifier, the ISO 639-3 code, the related Ethno- internal project namespace.) logue language name3, alternate language names, An important feature of our RDF encoding is reference(s), the compilers of the wordlist, etc. All that the counterpart relation does not relate a con- forms in the wordlist are expressed as a sequence cept directly to a form but rather to a linguis- of Unicode characters and annotated with appro- tic sign (gold:LinguisticSign) whose form priate contextual information. In cases where feature contains the relevant specification. This there is more than one form attached to a concept, would allow for additional information about the we create multiple concept-form mappings. We do lexical element specified by the given form (e.g., not explicitly model form groups (see section 2) in part of speech, definition) to be added to the rep- our RDF at present since the data we are working resentation without modification of the model. with is not sufficiently detailed for us to need to Our RDF encoding, at present, is inspired by the attach information to any particular form group. traditional understanding of wordlists, building di- Expressing the data encoded in our wordlist rectly on work done by linguists (Haspelmath and database as RDF triples ensures Semantic Web Tadmor, 2009). While our use of RDF and an compatibility and allows our work to build on OWL ontology brings the data into a format allow- more general work that facilitates sharing and in- ing for much greater interoperability than would teroperating on linguistic data in a Semantic Web otherwise be possible, in order to achieve maxi- context (Farrar and Lewis, 2007). An RDF frag- mal integration with current efforts in NLP more ment describing the wordlist entry in (6) is given could be done. For example, we could devise in Figure 3 for illustrative purposes. In addition an RDF expression of our model compatible with to drawing on standard RDF constructs, we also LMF (Francopoulo et al., 2009) (see section 3). make use of descriptive linguistic concepts from The most difficult aspect of our model to en- GOLD4 (General Ontology for Linguistic De- code in LMF would appear to be the counterpart scription), which is intended be a sharable OWL relation since core LMF assumes that meanings will be expressed primarily as language-specific 3http://ethnologue.com/ 4http://linguistics-ontology.org/ senses. However, there is work in LMF encod-

6 106 the grandfather LEGO Project Unified Concepticon -ak Voorhoeve 1980 afak Voorhoeve 1980

Figure 3: Wordlist Entry RDF Fragment ing something quite comparable to our notion of 7 Evaluation counterpart, namely a SenseAxis, intended to sup- We have identified the following dimensions port interlingual pivots for multilingual resources across which it seems relevant to evaluate our (Soria et al., 2009). work against the state of the art: (i) the extent to As discussed in section 3, the concept labels which it can be applied generally to wordlist data, used in traditional wordlists can be understood as (ii) how it compares to existing wordlist , a kind of interlingua. Therefore, it seems that (iii) how it compares to other work which devel- a promising approach for adapting our model to ops data models intended to serve as targets for an LMF model would involve making use of the migration of legacy linguistic data, and (iv) the ex- SenseAxes. Because of this we believe that it tent to which our model can create lexical data that would be relatively straightforward to adapt our can straightforwardly interoperate with other lexi- database in a way which would make it even more cal data. We discuss each of these dimensions of accessible for NLP applications than it is in its evaluation in turn. present form, though we leave this as a task for (i) The RDF/XML model described here has future work. been successfully used to represent the entire core

7 dataset being used for this project (see section 2). believe that on a qualitative level it can be evalu- This represents around 2,700 wordlists and half a ated at or above the state of the art across several million forms, suggesting the model is reasonable, key dimensions. at least as a first attempt. Further testing will re- quire attempting to incorporate wordlist data from 8 Applications other sources into our model. Unlike typical research in NLP, our dataset cov- (ii) Wordlists databases have been constructed ers thousands of minority languages that are oth- for comparative linguistic work for decades. How- erwise poorly represented. Therefore, while our ever, there have not been extensive systematic at- data is sparse in many ways, it has a coverage well- tempts to encode them in interoperable formats to beyond what is normally found. the best of our knowledge, and certainly not in- Crucially, our data model makes visible the sim- volving a dataset of the size explored here. The ilarities between a concepticon and an interlingua, only comparable project is found in the World thus opening up a data type produced for descrip- Loanword Database (Haspelmath and Tadmor, tive linguistics for use in NLP contexts. In partic- 2010) (WOLD) which includes, as a possibility, ular, we have created a resource that we believe an RDF/XML export. This feature of the database could be exploited for NLP applications where is not explicitly documented, making a direct com- simple word-to-word mapping across languages is parison difficult. An examination of the data pro- useful, as in the PanImages5 search of the Pan- duced makes it appear largely similar to the model Lex project, which facilitates cross-lingual image proposed here. The database itself covers many searching. Such a database can also be readily fewer languages (around 40) but has much more exploited for machine identification of cognates data for each of its entries. In any event, we be- and recurrent sound correspondences to test algo- lieve our project and WOLD are roughly simi- rithms for language family reconstruction (Kon- lar regarding the extent to which the produced re- drak et al., 2007; Nerbonne et al., 2007) or to assist sources can be used for multiple purposes, though in the automatic identification of phonemic sys- it is difficult to examine this in detail at this time tems and, thereby, enhance relevant existing work in the absence of better documentation of WOLD. (Moran and Wright, 2009). We, therefore, think (iii) As discussed in section 3, most work on it represents a useful example of using data mod- designing data models to facilitate migration of eling and legacy data conversion to find common legacy descriptive data to more modern formats ground between descriptive linguistics and NLP. has used representative data rather than producing a substantial new resource in its own right. Fur- Acknowledgments thermore, while the data models have been gen- Funding for the work described here was provided eral in nature, the data encoding has often been in by NSF grant BCS-0753321, and the work is being parochial XML formats. By producing a substan- done in the context of the larger Lexicon Enhance- tial resource in a Semantic Web encoding in paral- ment via the GOLD Ontology project, headed by lel with the data modeling, we believe our results researchers at the Institute for Language Informa- exceed most of the comparable work on legacy lin- tion and Technology at Eastern Michigan Univer- guistic data, with the exception of ODIN (Xia and sity. Partial funding for the collection and cura- Lewis, 2009) which has also produced a substan- tion of the wordlists was provided by the Rosetta tial resource. Project (NSF DUE-0333727), along with the Max (iv) Finally, by building our wordlist model Planck Institute for Evolutionary Anthropology. around the abstract notion of the linguistic sign, and explicitly referring to the concept of sign through an OWL ontology, we believe we have References produced a wordlist data model which can produce Alison Alvarez, Lori Levin, Robert Frederking, Simon data which can straightforwardly interoperate with Fung, Donna Gates, and Jeff Good. 2006. The data from full lexicons since lexicon entries, too, MILE corpus for less commonly taught languages. can be modeled as signs, as in Figure 1. In NAACL ’06: Proceedings of the Human Lan- Therefore, while our work cannot be straight- guage Technology Conference of the NAACL, Com- panion Volume, pages 5–8. ACL. forwardly evaluated with quantitative metrics, we 5http://www.panimages.org/

8 John Bell and Steven Bird. 2000. A preliminary study Sergei Nirenburg, Marge McShane, and Steve Beale. of the structure of lexicon entries. In Proceedings 2004. The rationale for building resources expressly from the Workshop on Web-Based Language Docu- for NLP. In 4th International Conference on Lan- mentation and Description. Philadelphia, December guage Resources and Evaluation, Lisbon, Portugal. 12–15, 2000. Sebastian Nordhoff. 2008. Electronic reference gram- Emily M. Bender, Dan Flickinger, and Stephan Oepen. mars for typology: Challenges and solutions. Lan- 2002. The Grammar Matrix: An open-source guage Documentation & Conservation, 2:296–324. starter-kit for the rapid development of cross- linguistically consistent broad-coverage precision Tom O’Hara, Kavi Mahesh, and Sergei Nirenburg. grammars. In John Carroll, Nelleke Oostdijk, and 1998. Lexical Acquisition with WordNet and the In Proceedings of the ACL Richard Sutcliffe, editors, Proceedings of the Work- Mikrokosmos Ontology. shop on Grammar Engineering and Evaluation at Workshop on the Use of WordNet in NLP, pages 94– the 19th International Conference on Computational 101. Linguistics, pages 8–14, Taipei, . Alexis Palmer and Katrin Erk. 2007. IGT-XML: An XML format for interlinearized glossed text. In Steven Bird and Gary Simons. 2003. Seven dimen- Proceedings of the Linguistic Annotation Workshop, sions of portability for language documention and pages 176–183. ACL. description. Language, 79:557–582. Alexis Palmer, Taesun Moon, and Jason Baldridge. Scott Farrar and William D. Lewis. 2007. The GOLD 2009. Evaluating automation strategies in language Community of Practice: An infrastructure for lin- documentation. In Proceedings of the NAACL HLT guistic data on the Web. Language Resources and 2009 Workshop on Active Learning for Natural Lan- Evaluation, 41:45–60. guage Processing, pages 36–44. ACL. Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Francesco Ronzano, Maurizio Tesconi, Salvatore Min- Calzolari, Monica Monachini, Mandy Pet, and Clau- utoli, Andrea Marchetti. 2010. Collaborative man- dia Soria. 2009. Multilingual resources for NLP agement of KYOTO Multilingual : in the (LMF). Language The Wikyoto Knowledge Editor. In Proceedings Resources and Evaluation, 43:57–70. of the 5th International Conference of the Global WordNet Association (GWC-2010). Mumbai, India. Jeff Good. 2004. The descriptive grammar as a (meta)database. In Proceedings of the E-MELD Gary Simons, Brian Fitzsimons, Terence Langendoen, Workshop on Linguistic Databases and Best Prac- William Lewis, Scott Farrar, Alexis Lanham, Ruby tice. Detroit, Michigan. Basham, and Hector Gonzalez. 2004. The descrip- tive grammar as a (meta)database. In Proceedings Martin Haspelmath and Uri Tadmor. 2009. The Loan- of the E-MELD Workshop on Linguistic Databases word Typology Project and the World Loanword and Best Practice. Detroit, Michigan. Database. Loanwords in the world’s languages: A comparative handbook, pages 1–33. Berlin: De Claudia Soria, Monica Monachini, and Piek Vossen. Gruyter. 2009. Wordnet-LMF: Fleshing out a Standard- ized Format for Wordnet Interoperability. In Inter- Martina Haspelmath and Uri Tadmor, editors. 2010. national Workshop on Intercultural Collaboration World Loanword Database. Munich: Max Planck (IWIC), pages 139–146. ACM. Digital Library. http://wold.livingsources.org. Tokunaga Takenobu, Nicoletta Calzolari, Chu-Ren Nancy Ide, Daniel Greenstein, and Piek Vossen. 1998. Huang, Laurent Prevot, Virach Sornlertlamvanich, Introduction to EuroWordNet. Computers and the Monica Monachini, Xia YingJu, Shirai Kiyoaki, Humanities, 32:73–89. Thatsanee Charoenporn, Claudia Soria, and Hao, Yu. 2006. Infrastructure for standardization of Grzegorz Kondrak, David Beck, and Philip Dilts. Asian language resources In Proceedings of the 2007. Creating a comparative dictionary of COLING/ACL Main Conference Poster Sessions, Totonac-Tepehua. In Proceedings of Ninth Meeting pages 827–834. ACL. of the ACL Special Interest Group in Computational Morphology and , pages 134–141. ACL. Piek Vossen. 2004. EuroWordNet: A multilin- gual database of autonomous and language-specific Steven Moran and Richard Wright. 2009. Pho- connected via an Inter-Lingual-Index. In- netics Information Base and Lexicon (PHOIBLE). ternational Journal of , 17:161–173. http://phoible.org. Fei Xia and William D. Lewis. 2009. Applying NLP John Nerbonne, T. Mark Ellison, and Grzegorz Kon- technologies to the collection and enrichment of lan- drak. 2007. Computing and historical phonology. guage data on the Web to aid linguistic research. In In Proceedings of Ninth Meeting of the ACL Special LaTeCH-SHELT&R ’09: Proceedings of the EACL Interest Group in Computational Morphology and 2009 Workshop on Language Technology and Re- Phonology, pages 1–5. ACL. sources for Cultural Heritage, Social Sciences, Hu- manities, and Education, pages 51–59. ACL.

9