Representing the Translation Relation in a Bilingual Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Representing the Translation Relation in a Bilingual Wordnet Jyrki Niemi, Krister Lindén Language Technology, Department of Modern Languages, University of Helsinki PO Box 24, FI-00014 University of Helsinki, Finland {jyrki.niemi, krister.linden}@helsinki.fi Abstract This paper describes representing translations in the Finnish wordnet, FinnWordNet (FiWN), and constructing the FiWN database. FiWN was created by translating all the word senses of the Princeton WordNet (PWN) into Finnish and by joining the translations with the semantic and lexical relations of PWN extracted into a relational (database) format. The approach naturally resulted in a translation relation between PWN and FiWN. Unlike many other multilingual wordnets, the translation relation in FiWN is not primarily on the synset level, but on the level of an individual word sense, which allows more precise translation correspondences. This can easily be projected into a synset-level translation relation, used for linking with other wordnets, for example, via Core WordNet. Synset-level translations are also used as a default in the absence of word-sense translations. The FiWN data in the relational database can be converted to other formats. In the PWN database format, translations are attached to source-language words, allowing the implementation of a Web search interface also working as a bilingual dictionary. Another representation encodes the translation relation as a finite-state transducer. Keywords: wordnet, bilingual representation, translation relation 1. Introduction 1.2. FinnWordNet Data Representations Different uses of and different software for a wordnet re- The Finnish wordnet, FinnWordNet (FiWN) is a translation quire or benefit from different formats or representations of the word senses in the Princeton WordNet (PWN, version of the wordnet data. Different formats include the PWN 3.0) (Fellbaum, 1998). This paper describes representations textual database format, relational database formats, such of the PWN–FiWN translation relation and the method of as the GermaNet database (Henrich and Hinrichs, 2010a), constructing the FiWN database by combining the structure XML formats, such as Wordnet-LMF (Soria et al., 2009) (semantic and lexical relations) of the PWN and the Finnish and its extensions (Henrich and Hinrichs, 2010b), and translations of the word senses. simple plain-text lists of tab- or comma-separated values.1 A wordnet in the PWN database format can be searched using the PWN search software and various wordnet lib- 1.1. FinnWordNet as a Translation raries; a relational database can be processed efficiently with standard database systems and libraries; a standard- Previous approaches to multilingual wordnets define the ized XML format can be used for data interchange; and translation relation between senses (synonym sets, syn- plain-text lists are often simple to use in rapidly-written sets), which we regard as too coarse for other than text- scripts. understanding purposes; for example, the word sense trans- To ensure the consistency of the FiWN data in the differ- lation relation {Amur $ Amur, Amur River $ Amurjoki} ent formats, we have chosen a relational (database) format would be reduced to the synset translation relation {Amur, as the primary format to which changes to the content Amur River} $ {Amur, Amurjoki}, losing the fine distinc- are made and from which other formats are generated, tions in the word sense translation correspondences. In this either directly or via intermediate formats. Many common paper, we propose a solution with the translation relation formats are monolingual, so we have extended them with a defined between word senses. translation relation. We have conversion paths to the PWN database format, plain-text lists and finite-state transducers. The word senses of all the synsets of PWN were trans- Figure 1 shows the process of constructing FiWN, along lated into Finnish by professional translators as described with the different types of files and programs used. The by Lindén and Carlson (2010). The direct translation ap- FiWN data is freely available and can be downloaded in proach is based on the assumption that most synsets in several formats from the FiWN project Web page.2 PWN represent language-independent real-world concepts. The Finnish translations of the PWN word senses as Thus also the semantic relations between synsets are as- provided by the translators were in an XML format con- sumed to be mostly language-independent, so the structure taining only a small part of the structure of PWN, serving of PWN can be reused as well. This approach made it pos- as a translation context. To construct a usable wordnet for sible to create an extensive Finnish wordnet directly aligned Finnish having the structure of PWN, we thus first com- with PWN, even though some words and concepts specific bined the relations extracted from the PWN database with to the Finnish culture and society were left missing at this the Finnish translations of the PWN word senses. stage. FiWN currently contains 117,659 synsets, as does PWN. The translation relation also makes it possible to use 1Tab- or comma-separated-values files can also be used as a FiWN linked with PWN as a bilingual English–Finnish– plain-text representation of the tables of a relational database. English dictionary, as well as a Finnish-only wordnet. 2http://www.ling.helsinki.fi/en/lt/research/finnwordnet/ 2439 PWN FiWN translations richs, 2010a) to browse and edit FiWN data, which would lexicographer file (XML) make extending and modifying FiWN more user-friendly. files The rest of this paper is organized as follows. Section 2. de- Extractor scribes the information content of the PWN database. Sec- Grind tion 3. presents different ways of representing translations. FiWN translations PWN data files relation Section 4. describes how we constructed the FiWN data- bases, and Section 5. how we succeeded in it. Section 6. Relation extractor discusses the process and directions for further work, and Section 8. concludes the paper. PWN data relations 2. Wordnet Information Content The information in PWN is stored in the specialized PWN database format. To simplify constructing the FiWN data- Translation base, we first extract the relations implicitly present in the combiner PWN database. FiWN data 2.1. Princeton WordNet Database relations Lex id generator (incl. translations) The PWN database uses a textual file format of its own, comprising data and index files, generated from the source FiWN data 5 relations for lexicographer files by the PWN Grind program. lexicographer Each part of speech has a separate data and index file. files Converter to Database creator (Since we do not need the index files in constructing the format X Lexicographer file FiWN database, we omit their description.) Each line in a FiWN database creator data file contains the entry for a single synset with the in- FiWN data in with translation formation shown in Table 1. Figure 2 shows an example of format X relation FiWN a synset entry in a data file and the corresponding source lexicographer Database relation files entry in a lexicographer file. A detailed description of the extractor database file formats is found in the PWN documentation.6 Modified Grind Offset of the line in the file FiWN data files Source lexicographer file number (PWN format) Synset type (generally, part of speech) Number of words in the synset Figure 1: Constructing FiWN by combining the PWN List of words: structure and Finnish translations, with the result in mul- Word Lexical identifier number (lex id) tiple target formats. Rectangles denote files; double- Number of pointers (semantic and lexical relations) bordered rectangles denote the main representations; roun- List of pointers: ded rectangles denote programs or scripts; dashed parts rep- Relation type resent generic conversion paths to other target formats. Target synset offset Part of speech Source and target word number (for lexical relations) Since FiWN is a translation of the PWN word senses and Sentence frame numbers (for verbs only) retains the structure of PWN, we first converted the data to Gloss the PWN database format searchable with the PWN search software,3 which is used in the FiWN Web search inter- Table 1: Information in a PWN data file entry (synset). face.4 The major advantage of this approach was that we were able to develop the search interface more rapidly than otherwise. The PWN database format has no support for 09833536 18 n 02 baker 0 bread_maker 0 002 @ translations, but we attached them to word senses by using 10605985 n 0000 + 01663749 v 0101 | someone who a special syntax processed by the search interface. bakes bread or cake Since the PWN database format is very specialized and dif- { [ baker, verb.creation:bake,+ ] bread_maker, skilled_worker,@ (someone who bakes bread or ficult to extend, we converted the FiWN data into a rela- cake) } tional database format based on that used for the German wordnet, GermaNet (Henrich and Hinrichs, 2010a), mod- Figure 2: PWN data file (data.noun) and lexicographer file ified and extended with word-sense and synset translation (noun.person) entry for the synset {baker, bread maker}. relations. Using the GermaNet database as the basis has the additional advantage that we could use an appropriately PWN synsets have no persistent identifiers. Within one modified version of the GernEdiT editor (Henrich and Hin- PWN version, the synset offsets in a data file can be used as 3http://wordnet.princeton.edu/wordnet/man/wn.1WN.html 5http://wordnet.princeton.edu/wordnet/man/grind.1WN.html 4http://www.ling.helsinki.fi/cgi-bin/fiwn/search 6http://wordnet.princeton.edu/wordnet/documentation/ 2440 identifiers of a kind within a part of speech (e.g., 09833536 en:109833536|18||someone who bakes bread or cake for {baker, bread maker} in Figure 2). The offsets change, en:109833536|en:110605985|@ however, if the length of the data in any synset changes.