Cognate Production Using Character-Based Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

Cognate Production Using Character-Based Machine Translation Cognate Production using Character-based Machine Translation Lisa Beinborn, Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for International Educational Research www.ukp.tu-darmstadt.de Abstract A strict definition only considers two words as Cognates are words in different languages cognates, if they have the same etymological ori- that are associated with each other by lan- gin, i.e. they are genetic cognates (Crystal, 2011). guage learners. Thus, cognates are im- Language learners usually lack the linguistic back- portant indicators for the prediction of the ground to make this distinction and will use all perceived difficulty of a text. We in- similar words to facilitate comprehension regard- troduce a method for automatic cognate less of the linguistic derivation. For example, the production using character-based machine English word strange has the Italian correspondent translation. We show that our approach strano. The two words have different roots and are is able to learn production patterns from therefore genetically unrelated. However, for lan- noisy training data and that it works for guage learners the similarity is more evident than a wide range of language pairs. It even for example the English-Italian genetic cognate works across different alphabets, e.g. we father-padre. Therefore, we aim at identifying all obtain good results on the tested language words that are sufficiently similar to be associated pairs English-Russian, English-Greek, and by a language learner no matter whether they are English-Farsi. Our method performs sig- genetic cognates. As words which are borrowed nificantly better than similarity measures from another language without any modification used in previous work on cognates. (such as cappuccino) can be easily identified by direct string comparison, we focus on word pairs 1 Introduction that do not have identical spelling. In order to improve comprehension of a text in a If the two associated words have the same or foreign language, learners use all possible infor- a closely related meaning, they are true cognates, mation to make sense of an unknown word. This while they are called false cognates or false friends includes context and domain knowledge, but also in case they have a different meaning. On the one knowledge from the mother tongue or any other hand, true cognates are instrumental in construct- previously acquired language. Thus, a student is ing easily understandable foreign language exam- more likely to understand a word if there is a sim- ples, especially in early stages of language learn- ilar word in a language she already knows (Ring- ing. On the other hand, false friends are known bom, 1992). For example, consider the following to be a source of errors and severe confusion for German sentence: learners (Carroll, 1992) and need to be practiced Die internationale Konferenz zu kritischen more frequently. For these reasons, both types Infrastrukturen im Februar ist eine Top- need to be considered when constructing teaching Adresse fur¨ Journalisten. materials. However, existing lists of cognates are Everybody who knows English might grasp the usually limited in size and only available for very gist of the sentence with the help of associ- few language pairs. In order to improve language ated words like Konferenz-conference or Februar- learning support, we aim at automatically creating February. Such pairs of associated words are lists of related words between two languages, con- called cognates. taining both, true and false cognates. 883 International Joint Conference on Natural Language Processing, pages 883–891, Nagoya, Japan, 14-18 October 2013. In order to construct such cognate lists, we need to decide whether a word in a source language has a cognate in a target language. If we already have candidate pairs, string similarity measures can be used to distinguish cognates and unrelated Figure 1: Character-based machine translation pairs (Montalvo et al., 2012; Sepulveda´ Torres and Aluisio, 2011; Inkpen et al., 2005; Kondrak rithm selects the best combination of sequences. and Dorr, 2004). However, these measures do not The transformation is thus not performed on iso- take the regular production processes into account lated characters, it also considers the surrounding that can be found for most cognates, e.g. the En- sequences and can account for context-dependent glish suffix ~tion becomes ~cion´ in Spanish like in phenomena. The goal of the approach is to directly nation-nacion´ or addition-adicion´ . Thus, an alter- produce a cognate in the target language from an native approach is to manually extract or learn pro- input word in another language. Consequently, in duction rules that reflect the regularities (Gomes the remainder of the paper, we refer to our method and Pereira Lopes, 2011; Schulz et al., 2004). as COP (COgnate Production). All these methods are based on string align- Exploiting the orthographic similarity of cog- ment and thus cannot be directly applied to lan- nates to improve the alignment of words has al- guage pairs with different alphabets. A possible ready been analyzed as a useful preparation for workaround would be to first transliterate foreign MT (Tiedemann, 2009; Koehn and Knight, 2002; alphabets into Latin, but unambiguous translitera- Ribeiro et al., 2001). As explained above, we ap- tion is only possible for some languages. Methods proach the phenomenon from the opposite direc- that rely on the phonetic similarity of words (Kon- tion and use statistical MT for cognate production. drak, 2000) require a phonetic transcription that is Previous experiments with character-based MT not always available. Thus, we propose a novel have been performed for different purposes. Pen- production approach using statistical character- nell and Liu (2011) expand text message abbre- based machine translation in order to directly pro- viations into proper English. In Stymne (2011), duce cognates. We argue that this has the follow- character-based MT is used for the identification ing advantages: (i) it captures complex patterns in of common spelling errors. Several other ap- the same way machine translation captures com- proaches also apply MT algorithms for translit- plex rephrasing of sentences, (ii) it performs bet- eration of named entities to increase the vocabu- ter than similarity measures from previous work lary coverage (Rama and Gali, 2009; Finch and on cognates, and (iii) it also works for language Sumita, 2008). For transliteration, characters from pairs with different alphabets. one alphabet are mapped onto corresponding let- ters in another alphabet. Cognates follow more 2 Character-Based Machine Translation complex production patterns. Nakov and Tiede- mann (2012) aim at improving MT quality using Our approach relies on statistical phrase-based cognates detected by character-based alignment. machine translation (MT). As we are not inter- They focus on the language pair Macedonian- ested in the translation of phrases, but in the trans- Bulgarian and use English as a bridge language. formation of character sequences from one lan- As they use cognate identification only as an in- guage into the other, we use words instead of sen- termediary step and do not provide evaluation re- tences and characters instead of words, as shown sults, we cannot directly compare with their work. in Figure 1. In the example, the English charac- To the best of our knowledge, we are the first to ter sequence cc is mapped to a single c in Spanish use statistical character-based MT for the goal of and the final e becomes ar. It is important to note directly producing cognates. that these mappings only apply in certain contexts. For example, accident becomes accidente with a 3 Experimental Setup double c in Spanish and not every word-final e is changed into ar. In statistical MT, the training pro- Figure 2 gives an overview of the COP architec- cess generates a phrase table with transformation ture. We use the existing statistical MT engine probabilities. This information is combined with Moses (Koehn et al., 2007). The main difference language model probabilities and a search algo- of character-based MT to standard MT is the lim- 884 Figure 2: Architecture of our Cognate Production (COP) approach ited lexicon. Our tokens are character n-grams in- types of cognates as foreign words also trigger stead of words, therefore we need much less train- wrong associations in learners (see Section 5.4). ing data. Additionally, distortion effects can be neglected as reordering of ngrams is not a regu- Evaluation Metrics In order to estimate the lar morphological process for cognates.1 Thus, we cognate production quality without having to rely deal with less variation than standard MT. on repeated human judgment, we evaluate COP against a list of known cognates. Existing cog- Training As training data, we use existing lists nate lists only contain pairs of true cognates, but of cognates or lists of closely related words and a word might have several true cognates. For ex- perform some preprocessing steps. All duplicates, ample, the Spanish word musica´ has at least three multiwords, conjugated forms and all word pairs English cognates: music, musical, and musician. that are identical in source and target are removed. Therefore, not even a perfect cognate production We lowercase the remaining words and introduce process will be able to always rank the right true # as start symbol and $ as end symbol of a word. cognate on the top position. In order to account for Then all characters are divided by blanks. Moses the issue, we evaluate the coverage using a relaxed additionally requires a language model. We build metric that counts a positive match if the gold stan- an SRILM language model (Stolcke, 2002) from a dard cognate is found in the n-best list of cognate list of words in the target language converted into productions.
Recommended publications
  • A Collaborativeplatform for Multilingual Ontology
    PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento A COLLABORATIVE PLATFORM FOR MULTILINGUAL ONTOLOGY DEVELOPMENT Ahmed Maher Ahmed Tawfik Moustafa Advisor: Professor. Fausto Giunchiglia Università degli Studi di Trento Abstract The world is extremely diverse and its diversity is obvious in the cultural differences and the large number of spoken languages being used all over the world. In this sense, we need to collect and organize a huge amount of knowledge obtained from multiple resources differing from one another in many aspects. A possible approach for doing that is to think of designing effective tools for construction and maintenance of linguistic resources and localized domain ontologies based on well-defined knowledge representation methodologies capable of dealing with diversity and the continuous evolvement of human knowledge. In this thesis, we present a collaborative platform which allows for knowledge organization in a language-independent manner and provides the appropriate mapping from a language independent concept to one specific lexicalization per language. This representation ensures a smooth multilingual enrichment process for linguistic resources and a robust construction of ontologies using language-independent concepts. The collaborative platform is designed following a workflow-based development methodology that models linguistic resources as a set of collaborative objects and assigns a customizable workflow to build and maintain each collaborative object in a community driven manner, with extensive support of modern web 2.0 social and collaborative features. Keywords Knowledge Representation, Multilingual Resources, Ontology Development, Computer Supported Collaborative Work 2 Acknowledgments I am particularly grateful for my supervisor, Professor. Fausto Giunchiglia, for the guidance and advices he has provided throughout my time as a PhD student.
    [Show full text]
  • Latent Semantic Network Induction in the Context of Linked Example Senses
    Latent semantic network induction in the context of linked example senses Hunter Scott Heidenreich Jake Ryland Williams Department of Computer Science Department of Information Science College of Computing and Informatics College of Computing and Informatics [email protected] [email protected] Abstract of a network—much like the Princeton Word- Net (Miller, 1995; Fellbaum, 1998)—that is con- The Princeton WordNet is a powerful tool structed solely from the semi-structured data of for studying language and developing nat- ural language processing algorithms. With Wiktionary. This relies on the noisy annotations significant work developing it further, one of the editors of Wiktionary to naturally induce a line considers its extension through aligning network over the entirety of the English portion of its expert-annotated structure with other lex- Wiktionary. In doing so, the development of this ical resources. In contrast, this work ex- work produces: plores a completely data-driven approach to network construction, forming a wordnet us- • an induced network over Wiktionary, en- ing the entirety of the open-source, noisy, user- riched with semantically linked examples, annotated dictionary, Wiktionary. Compar- forming a directed acyclic graph (DAG); ing baselines to WordNet, we find compelling evidence that our network induction process • an exploration of the task of relationship dis- constructs a network with useful semantic structure. With thousands of semantically- ambiguation as a means to induce network linked examples that demonstrate sense usage construction; and from basic lemmas to multiword expressions (MWEs), we believe this work motivates fu- • an outline for directions of expansion, includ- ture research. ing increasing precision in disambiguation, cross-linking example usages, and aligning 1 Introduction English Wiktionary with other languages.
    [Show full text]
  • Visualization Design for a Web Interface to the Large-Scale Linked Lexical Resource UBY
    Visualization Design for a Web Interface to the Large-Scale Linked Lexical Resource UBY We present the results of a collaboration of visualization experts and computational linguists which aimed at the re-design of the visualization component in the Web user interface (Web UI) to the large-scale linked lexical resource UBY. UBY combines a wide range of information from expert- constructed (e.g., WordNet, FrameNet, VerbNet) and collaboratively constructed (e.g., Wiktionary, Wikipedia) resources for English and German, see https://www.ukp.tu-darmstadt.de/uby. All resources contained in UBY distinguish not only different words but also their senses. A distinguishing feature of UBY is that the different resources are aligned to each other at the word sense level, i.e. there are links connecting equivalent word senses from different resources in UBY. For senses that are linked, information from the aligned resources can be accessed and the resulting enriched sense representations can be used to enhance the performance of Natural Language Processing tasks. Targeted user groups of the UBY Web UI are researchers in the field of Natural Language Processing and in the Digital Humanities (e.g., lexicographers, linguists). In the context of exploring the usually large number of senses for an arbitrary search word, the UBY Web UI should support these user groups in assessing the added value of sense links for particular applications. It is important to emphasize that this is an open research question for most applications. We will present the results of our detailed requirements analysis that revealed a number of central requirements a visualization of all the senses for a given search word and the links between them must meet in order to be useful for this purpose.
    [Show full text]
  • Interdisciplinary Approaches to Stratifying the Peopling of Madagascar
    INTERDISCIPLINARY APPROACHES TO STRATIFYING THE PEOPLING OF MADAGASCAR Paper submitted for the proceedings of the Indian Ocean Conference, Madison, Wisconsin 23-24th October, 2015 Roger Blench McDonald Institute for Archaeological Research University of Cambridge Correspondence to: 8, Guest Road Cambridge CB1 2AL United Kingdom Voice/ Ans (00-44)-(0)1223-560687 Mobile worldwide (00-44)-(0)7847-495590 E-mail [email protected] http://www.rogerblench.info/RBOP.htm This version: Makurdi, 1 April, 2016 1 Malagasy - Sulawesi lexical connections Roger Blench Submission version TABLE OF CONTENTS TABLE OF CONTENTS................................................................................................................................. i ACRONYMS ...................................................................................................................................................ii 1. Introduction................................................................................................................................................. 1 2. Models for the settlement of Madagascar ................................................................................................. 2 3. Linguistic evidence...................................................................................................................................... 2 3.1 Overview 2 3.2 Connections with Sulawesi languages 3 3.2.1 Nouns..............................................................................................................................................
    [Show full text]
  • A Large, Interlinked, Syntactically-Rich Lexical Resource for Ontologies
    Semantic Web 0 (0) 1 1 IOS Press lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies Judith Eckle-Kohler, a;∗ John Philip McCrae b and Christian Chiarcos c a Ubiquitous Knowledge Processing (UKP) Lab, Department of Computer Science, Technische Universität Darmstadt and Information Center for Education, German Institute for International Educational Research, Germany, http://www.ukp.tu-darmstadt.de b Cognitive Interaction Technology (CITEC), Semantic Computing Group, Universität Bielefeld, Germany, http://www.sc.cit-ec.uni-bielefeld.de c Applied Computational Linguistics (ACoLi), Department of Computer Science and Mathematics, Goethe-University Frankfurt am Main, Germany, http://acoli.cs.uni-frankfurt.de Abstract. We introduce lemonUby, a new lexical resource integrated in the Semantic Web which is the result of converting data extracted from the existing large-scale linked lexical resource UBY to the lemon lexicon model. The following data from UBY were converted: WordNet, FrameNet, VerbNet, English and German Wiktionary, the English and German entries of Omega- Wiki, as well as links between pairs of these lexicons at the word sense level (links between VerbNet and FrameNet, VerbNet and WordNet, WordNet and FrameNet, WordNet and Wiktionary, WordNet and German OmegaWiki). We linked lemonUby to other lexical resources and linguistic terminology repositories in the Linguistic Linked Open Data cloud and outline possible applications of this new dataset. Keywords: Lexicon model, lemon, UBY-LMF, UBY, OLiA, ISOcat, WordNet, VerbNet, FrameNet, Wiktionary, OmegaWiki 1. Introduction numerous mappings and linkings of lexica, as well as standards for representing lexical resources, such Recently, the language resource community has begun as the ISO 24613:2008 Lexical Markup Framework to explore the opportunities offered by the Semantic (LMF) [13].
    [Show full text]
  • Mining Translations from the Web of Open Linked Data
    Mining translations from the web of open linked data John Philip McCrae Philipp Cimiano University of Bielefeld University of Bielefeld [email protected] [email protected] Abstract OmegaWiki on the web of data should ameliorate In this paper we consider the prospect of the process of harvesting translations from these extracting translations for words from the resources. web of linked data. By searching for We consider two sources for translations from entities that have labels in both English linked data: firstly, we consider mining labels for and German we extract 665,000 transla- concepts from the data contained in the 2010 Bil- tions. We then also consider a linguis- lion Triple Challenge (BTC) data set, as well as tic linked data resource, lemonUby, from DBpedia (Auer et al., 2007) and FreeBase (Bol- which we extract a further 115,000 transla- lacker et al., 2008). Secondly, we mine transla- tions. We combine these translations with tions from lemonUby (Eckle-Kohler et al., 2013), the Moses statistical machine translation, a resource that integrates a number of distinct dic- and we show that the translations extracted tionary language resources in the lemon (Lexi- from the linked data can be used to im- con Model for Ontologies) format (McCrae et al., prove the translation of unknown words. 2012), which is a model for representing rich lex- ical information including forms, sense, morphol- 1 Introduction ogy and syntax of ontology labels. We then con- In recent years there has been a massive explo- sider the process of including these extra transla- sion in the amount and quality of data available tions into an existing translation system, namely as linked data on the web.
    [Show full text]
  • Cognate Words in Mehri and Hadhrami Arabic
    Cognate Words in Mehri and Hadhrami Arabic Hassan Obeid Alfadly* Khaled Awadh Bin Mukhashin** Received: 18/3/2019 Accepted: 2/5/2019 Abstract The lexicon is one important source of information to establish genealogical relations between languages. This paper is an attempt to describe the lexical similarities between Mehri and Hadhrami Arabic and to show the extent of relatedness between them, a very little explored and described topic. The researchers are native speakers of Hadhrami Arabic and they paid many field visits to the area where Mehri is spoken. They used the Swadesh list to elicit their data from more than 20 Mehri informants and from Johnston's (1987) dictionary "The Mehri Lexicon and English- Mehri Word-list". The researchers employed lexicostatistical techniques to analyse their data and they found out that Mehri and Hadhrmi Arabic have so many cognate words. This finding confirms Watson (2011) claims that Arabic may not have replaced all the ancient languages in the South-Western Arabian Peninsula and that dialects of Arabic in this area including Hadhrami Arabic are tinged, to a greater or lesser degree, with substrate features of the Pre- Islamic Ancient and Modern South Arabian languages. Introduction: three branches including Central Semitic, Historically speaking, the Semitic language Ethiopian and Modern south Arabian languages family from which both of Arabic and Mehri (henceforth MSAL). Though Arabic and Mehri descend belong to a larger family of languages belong to the West Semitic, Arabic descends called Afro-Asiatic or Hamito-Semitic that from the Central Semitic and Mehri from includes Semitic, Egyptian, Cushitic, Omotic, (MSAL) which consists of two branches; the Berber and Chadic (Rubin, 2010).
    [Show full text]
  • Comparing the Cognate Effect in Spoken and Written Second Language Word Production
    1 Short title: Cognate effect in spoken and written word production Comparing the cognate effect in spoken and written second language word production 1 2 1 Merel Muylle , Eva Van Assche & Robert J. Hartsuiker 1Department of Experimental Psychology, Ghent University, Ghent, Belgium 2 Thomas More University of Applied Sciences, Antwerp, Belgium Address for correspondence: Merel Muylle Department of Experimental Psychology Ghent University Henri Dunantlaan 2 B-9000 Gent (Belgium) E-mail: [email protected] 2 Abstract Cognates – words that share form and meaning between languages – are processed faster than control words. However, it is unclear whether this effect is merely lexical (i.e., central) in nature, or whether it cascades to phonological/orthographic (i.e., peripheral) processes. This study compared the cognate effect in spoken and typewritten production, which share central, but not peripheral processes. We inquired whether this effect is present in typewriting, and if so, whether its magnitude is similar to spoken production. Dutch-English bilinguals performed either a spoken or written picture naming task in English; picture names were either Dutch-English cognates or control words. Cognates were named faster than controls and there was no cognate-by-modality interaction. Additionally, there was a similar error pattern in both modalities. These results suggest that common underlying processes are responsible for the cognate effect in spoken and written language production, and thus a central locus of the cognate effect. Keywords: bilingualism, word production, cognate effect, writing 3 Converging evidence suggests that bilinguals activate both their mother tongue (L1) and their second language (L2) simultaneously when processing linguistic information (e.g., Dijkstra & Van Heuven, 2002; Van Hell & Dijkstra, 2002).
    [Show full text]
  • 1 Roger Schwarzschild Rutgers University 18 Seminary Place New
    Roger Schwarzschild Rutgers University 18 Seminary Place New Brunswick, NJ 08904 [email protected] to appear in: Recherches Linguistiques de Vincennes November 30, 2004 ABSTRACT In some languages, measure phrases can appear with non-compared adjectives: 5 feet tall. I address three questions about this construction: (a) Is the measure phrase an argument of the adjective or an adjunct? (b) What are we to make of the markedness of this construction *142lbs heavy? (c) Why is it that the markedness disappears once the adjective is put in the comparative (2 inches taller alongside 2lbs heavier)? I claim that because degree arguments are ‘functional’, the measure phrase has to be an adjunct and not a syntactic argument of the adjective. Like event modifiers in extended NP’s and in VPs, the measure phrase predicates of a degree argument of the adjective. But given the kind of meaning a measure phrase must have to do its job in comparatives and elsewhere, it is not of the right type to directly predicate of a degree argument. I propose a lexically governed type-shift which applies to some adjectives allowing them to combine with a measure phrase. KEY WORDS adjective, measure phrase, degree, functional category, lexical, adjunct, argument, antonymy. Measure Phrases as Modifiers of Adjectives1 1. Introduction There is a widely accepted account of expressions like five feet tall according to which the adjective tall denotes a relation between individuals and degrees of height and the measure phrase, five feet, serves as an argument of the adjective, saturating the degree-place in the relation2.
    [Show full text]
  • Palatalization/Velar Softening: What It Is and What It Tells Us About the Nature of Language Morris Halle
    Palatalization/Velar Softening: What It Is and What It Tells Us about the Nature of Language Morris Halle This study proposes an account both of the consonantal changes in- volved in palatalization/velar softening and of the fact that this change is encountered before front vowels. The change is a straightforward case of feature assimilation provided that segments/phonemes are viewed as complexes of features organized into the ‘‘bottle brush’’ model illustrated in example (4) and elsewhere in the text, and that the universal set of features includes, in addition to the familiar binary features, six unary features, which specify the designated articulator(s) for every segment (not only for consonants). Keywords: phonetics, palatalization, velar softening, features, articula- tors For Ken Stevens, friend and colleague, on his 80th birthday My purpose in this study is to present an account of the very common alternation between dorsal and coronal consonants often referred to as palatalization or velar softening. This alternation, exemplified by English electri[k] ϳ electri[s]ity, occurs most often before front vowels. In spite of its extremely common occurrence in the languages of the world, to this time there has been no proper account of palatalization that would relate it to the other properties of language, in particular, to the fact that it is found most commonly before front vowels. In the course of working on these remarks, it became clear to me that palatalization raises numerous theoretical questions about which there is at present no agreement among phonologists. Since these include matters of the most fundamental importance for phonology, I start by exten- sively discussing the main issues and the views on them that seem to me most persuasive at this time (section 1).
    [Show full text]
  • Jicaque As a Hokan Language Author(S): Joseph H
    Jicaque as a Hokan Language Author(s): Joseph H. Greenberg and Morris Swadesh Source: International Journal of American Linguistics, Vol. 19, No. 3 (Jul., 1953), pp. 216- 222 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/1263010 Accessed: 11-07-2017 15:04 UTC REFERENCES Linked references are available on JSTOR for this article: http://www.jstor.org/stable/1263010?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to International Journal of American Linguistics This content downloaded from 12.14.13.130 on Tue, 11 Jul 2017 15:04:26 UTC All use subject to http://about.jstor.org/terms JICAQUE AS A HOKAN LANGUAGE JOSEPH H. GREENBERG AND MORRIS SWADESH COLUMBIA UNIVERSITY 1. The problem 2. The phonological equivalences in Hokan 2. Phonological note have been largely established by Edward 3. Cognate list Sapir's work.3 The Jicaque agreements are 4. Use of lexical statistics generally obvious. A special point is that 5.
    [Show full text]
  • Proceedings of KONVENS 2012 (Main Track: Poster Presentations), Vienna, September 19, 2012 Ments Between Its Lsrs
    Navigating Sense-Aligned Lexical-Semantic Resources: THE WEB INTERFACE TO UBY Iryna Gurevych1,2, Michael Matuschek1, Tri-Duc Nghiem1, Judith Eckle-Kohler1, Silvana Hartmann1, Christian M. Meyer1 1Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt 2Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information http://www.ukp.tu-darmstadt.de Abstract to the large sense-aligned LSR UBY (Gurevych et al., 2012). UBY is represented in compli- In this paper, we present the Web inter- ance with the ISO standard LMF (Francopoulo face to UBY, a large-scale lexical resource based on the Lexical Markup Framework et al., 2006) and currently contains interoper- (LMF). UBY contains interoperable ver- able versions of nine heterogeneous LSRs in sions of nine resources in two languages. two languages, as well as pairwise sense align- The interface allows to conveniently exam- ments for a subset of them: English WordNet ine and navigate the encoded information (WN), Wiktionary (WKT-en), Wikipedia (WP- in UBY across resource boundaries. Its en), FrameNet (FN), and VerbNet (VN); German main contributions are twofold: 1) The vi- Wiktionary (WKT-de), Wikipedia (WP-de), and sual view allows to examine the sense clus- GermaNet (GN), and the English and German en- ters for a lemma induced by alignments between different resources at the level of tries of OmegaWiki (OW-en/de). word senses. 2) The textual view uniformly The novel aspects of our interface can be sum- presents senses from different resources in marized as 1) A graph-based visualization of detail and offers the possibility to directly sense alignments between the LSRs integrated in compare them in a parallel view.
    [Show full text]