Bilingual Dictionary Generation and Enrichment Via Graph Exploration

Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Bilingual dictionary generation and 4 5 5 6 enrichment via graph exploration 6 7 7 8 Shashwat Goel a, Jorge Gracia b,* and Mikel Lorenzo Forcada b 8 9 a IIIT Hyderabad, Professor CR Rao Rd, Gachibowli, Hyderabad, Telangana, 500032, India 9 10 E-mail: [email protected] 10 11 b Aragon Institute of Engineering Research, Universidad de Zaragoza, Mariano Esquillor s/n, 50018 Zaragoza, 11 12 Spain 12 13 E-mail: [email protected] 13 14 c Dept. de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, Ctra. St. Vicent–Alacant, s/n, 03690 St. 14 15 Vicent del Raspeig, Spain 15 16 E-mail: [email protected] 16 17 17 18 18 19 19 20 20 21 Abstract. 21 22 In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. 22 23 This is allowing the growth and application of openly available linguistic knowledge graphs. In this work we explore techniques 23 24 that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations).We build upon a cycle density 24 25 based method: partitioning the graph into biconnected components for a speedup, and simplifying the pipeline through a careful 25 26 structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation 26 metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both- 27 27 word recall (BWR), aimed at being more informative of algorithmic improvements. On average over twenty-seven language 28 28 pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from 29 scratch within a minute. Human evaluation shows that 78% of the additional translations generated for enrichment are correct as 29 30 well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based 30 31 evaluation shows an average accuracy of 84%. We release our tool as a free/open-source software which can not only be applied 31 32 to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities. 32 33 33 34 Keywords: Bilingual dictionaries, RDF, Apertium, Graph, Linguistic linked data, Evaluation methods, Polysemy 34 35 35 36 36 37 37 38 1. Introduction case of the Apertium RDF graph, built on the basis 38 1 39 of the family of bilingual dictionaries of Apertium, 39 40 Bilingual electronic dictionaries contain translations a free-open/source machine translation platform [2]. 40 41 between collections of lexical entries in two differ- A subset of 22 such dictionaries was initially con- 41 2 42 ent languages. They constitute very useful language verted into RDF (resource description framework) , 42 43 resources, both for professionals (such as translators) published as linked open data on the Web, and made 43 44 and for language technologies (such as machine trans- available for access and querying in a way compliant 44 45 lation or the alignment of sentences in translated doc- with semantic web standards [3]. More recently, an up- 45 uments). Currently, we are witnessing an increase of 46 dated version of the Apertium RDF graph has been re- 46 such resources as linked data on the Web, owing to the 47 leased, covering 53 language pairs [4] (see Figure 1). 47 48 adoption of linguistic linked data (LLD) techniques [1] 48 49 by the language technologies community. That is the 49 50 1http://apertium.org 50 51 *Corresponding author. E-mail: [email protected]. 2http://www.w3.org/TR/rdf-primer 51

1 Publishing bilingual dictionaries, and linguistic data 1 2 in general, as linked data on the Web has a number of 2 3 advantages. First, the data are described by using stan- 3 4 dard mechanisms of the Semantic Web, such as RDF 4 5 and OWL (web ontology language3), and use consen- 5 6 sual ontologies, agreed by the community, for their 6 7 conceptual representation (such as Ontolex-lemon4 [5] 7 8 in the case of Apertium RDF). Further, it can be ac- 8 9 cessed thorough standard languages and query means 9 10 such as the SPARQL protocol and RDF query lan- 10 11 11 guage5, avoiding any dependence on proprietary appli- 12 12 cation programming interfaces (APIs). This enhances 13 13 the availability, the interoperability, and the usabil- 14 14 ity of the linguistic information that use such tech- 15 15 niques [1]. In fact, every piece of lexical information 16 Fig. 1. Apertium RDF graph (figure taken from [4]), which covers 44 16 17 (such as lexical entries, lexical senses, or translations) languages and 53 language pairs. The nodes represent monolingual 17 lexicons and edges the translation sets among them. Darker nodes 18 has its own URI, which identifies it at the web scale 18 correspond to more interconnected languages. 19 and makes it easier to be linked to, or to be reused 19 20 by, any other dataset or semantic-aware system de- 20 21 veloped for any purpose. For instance, the Apertium Imagine for instance that we have the following cor- 21 22 data, initially intended for their use in machine trans- respondences between nouns in a set of five bilingual 22 23 lation, has been successfully used, in its RDF version, dictionaries: 23 24 for cross-lingual model transfer in the pharmaceuti- – book (English)–libro (Spanish); 24 25 cal domain [4]. In this application, an embeddings- – book (English)–llyfr (Welsh), 25 26 based model for sentiment analysis in English was ef- – libro (Spanish)–llibre (Catalan), 26 27 ficiently transferred into Spanish without retraining it – llyfr (Welch)–llibre (Catalan), and 27 28 from scratch, by injecting the EN-ES translations con- – llibre (Catalan)–book (English). 28 29 tained in Apertium RDF. 29 30 In this work, we explore a method to automatically In view of these correspondences, it would be very 30 31 expand a graph of bilingual translations by inferring reasonable to propose the Welsh–Spanish entry llyfr 31 32 new translations based on the already existing ones. (Welsh)–libro (Spanish). 32 33 The method is agnostic of the particular graph formal- Our method tries to discover such indirect transla- 33 34 ism used to represent the data, although we apply it to tions and assigns a confidence score to them. Our work 34 35 the particular case of Apertium RDF. We further show is grounded on the approach proposed by Villegas et 35 36 how automatic dictionary generation is actually far al. [6], based on the identification of dense cycles in 36 37 37 more effective than reported in existing literature by the graph. We improve and extend this work in several 38 38 developing evaluation methods that are more reflective directions: 39 39 and indicative. The motivation is two-fold: to support 40 – We improve the cycle-based algorithm by exploit- 40 the evolution and enrichment of the Apertium RDF 41 ing properties of biconnected graphs, which re- 41 42 graph with new, automatically obtained high-quality duces time response while discovering the same 42 43 data, and to support the Apertium developers when translations. 43 44 building new bilingual dictionaries from scratch as val- – We reduce the number of hyperparameters, and 44 45 idating and adapting automatically predicted transla- provide a rationale for previously empirical ones 45 46 tions is easier for dictionary developers than writing based on new insights, enhancing accuracy and 46 47 new translation entries from scratch. coverage of results. 47 48 – We scale the experimental set-up to a much larger 48 49 dataset, from two language pairs in the initial 49 3http://www.w3.org/TR/owl-primer 50 4https://www.w3.org/2016/05/ontolex/ work (English–Spanish and English–French) up 50 51 5http://www.w3.org/TR/sparql11-protocol/ to 27 language pairs. 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 3

1 – We also measure the quality of predictions not 2.1. Pivot-based methods 1 2 found in the evaluation set through human evalu- 2 3 ation and through validation with other available The simplest approach is to assume that the trans- 3 4 dictionaries such as MUSE.6 lation relation is transitive. This method assumes the 4 5 existence of two bilingual dictionaries, one containing 5 6 We also contribute to the general translation infer- 6 translations language L1 to another language L2, and 7 ence problem in the following aspects: 7 another from L2 to L3. Language L2 acts as pivot lan- 8 8 – We demonstrate weaknesses of evaluation met- guage and a new set of translations from L1 to L3 is 9 9 rics used in existing literature that underestimate discovered through direct transitivity. For example, the 10 10 progress in translation inference. To this effect, translations perro to dog and dog to chien contained in 11 a Spanish–English and English–French dictionary, re- 11 we introduce and discuss novel metrics which 12 spectively, would lead to discover the translation perro 12 are representative of performance and provide in- 13 (Spanish) to chien (French). This method, despite its 13 sights for further improvement. 14 simplicity, is still quite effective in a scenario in which 14 – We propose a novel use-case of the cycles-based 15 human supervision is assumed. Actually, this is the 15 method, the generation of synonyms. 16 basis of the cross option provided in the Apertium 16 17 17 As result of this work, we release a modular and eas- framework as part of the apertium-dixtools 18 7 18 ily extensible software tool that can be used for practi- tool, which is used to cross two language pairs to gen- 19 19 cal dictionary generation. (see Section 7). Currently, it erate a new language pair [7]. 20 20 supports RDF and the original Apertium format, but it However, the translation relation is not always tran- 21 21 can be easily extended to other communities and sys- sitive as polysemy in the pivot language could lead 22 to wrong translations being inferred. For instance, the 22 23 tems that need dictionary generation, simply by con- 23 verting between their format and the internal tabulator- Spanish (L2) word muñeca is translation of wrist in 24 English (L ) in one sense (‘where hand meets arm’), 24 separated value (TSV) format. 1 25 while it translates to poupee in French (L ) in another 25 The remainder of this paper is organised as follows: 3 26 sense (‘doll’). Clearly French poupee is not a good 26 First, related work is summarised in Section 2. Then, 27 translation for English wrist, but simple transitive ap- 27 28 Section 3 describes translation graphs, Section 4 de- 28 proaches will propose it. 29 scribes the original cycle density algorithm [6], Sec- 29 In order to overcome this issue and identify in- 30 tion 5 describes the optimized version of the algorithm correct translations when constructing bilingual dic- 30 31 31 described in this work, Section 6 analyses the roles and tionaries mediated by a third language, Tanaka and 32 32 need of each hyperparameter, Section 7 describes the Umemura [8] proposed in 1994 a method called one- 33 33 software implementation, Section 8 describes the ex- time inverse consultation (OTIC). In short, the idea of 34 perimental settings of the study, novel evaluation met- 34 the OTIC method is, for each word w in L1, to assign a 35 rics are proposed in Section 9, and the results are pre- 35 score to each candidate translation ti of w in L3 based 36 36 sented and discussed in Section 10. Use cases are dis- on the overlap of pivot translations in L2 shared by 37 37 cussed in Section 11 and, ﬁnally, Section 12 presents both w and ti. 38 38 some conclusions and future work. OTIC was later adapted for different purposes, such 39 39 as the creation of multilingual lexicons from bilin- 40 40 gual lists of words [9]. While OTIC is intended for 41 41 generic dictionaries, similar methods have been devel- 42 2. Related work 42 oped for generation of domain-adapted bilingual dic- 43 43 tionaries [10]. Other authors have enriched OTIC with 44 44 The automatic generation of electronic bilingual the inclusion of semantic features, such as Bond and 45 45 dictionaries, based on existing ones, is not a new re- Ogura [11] for the creation of a Japanese-Malay dic- 46 search topic. Several approaches have been proposed 46 47 tionary. Saralegi et al. [12] studied how to use distribu- 47 over the last few decades. In this section we give an tional semantics computed from comparable corpora 48 overview of the main ones. 48 49 to prune pivot-based translations. 49 50 50 51 6https://github.com/facebookresearch/MUSE 7https://wiki.apertium.org/wiki/Apertium-dixtools 51 4 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 2.2. Graph-based methods be very effective even in comparison with more con- 1 2 temporary methods [16, 17]. However, OTIC needs 2 3 The works referred so far illustrate techniques that a pivot language to operate, while cycle-based sys- 3 4 take into account the existence of (at least) two lan- tems can discover translations between more distantly 4 5 guage pairs connected through a common pivot. How- connected languages. For instance, this is the case of 5 6 ever, when dictionaries can be connected in a richer Sardinian (sc) - Italian (it) in the Apertium RDF 6 7 way as part of a larger graph, other algorithms based graph (see Figure 1), for which OTIC cannot work. 7 8 on graph exploration may come into play. That is the On the other hand, there can be less connected parts 8 9 case, for instance, of the CQC algorithm, developed of the graph where cycles are difficult to find but a 9 10 by Flati and Navigli [13], which exploits the notion pivot language is still available. For instance, English 10 11 of cycles and quasi-cycles for the automated disam- (en) - Russian (ru) in Apertium RDF. Therefore, we 11 12 biguation of translations in a bilingual dictionary. No- consider OTIC and cycle density complementary ap- 12 13 tice that, unlike our approach, this method is not in- proaches. 13 14 tended for dictionary building but for dictionary val- Cycle-based methods can be used for additional 14 15 idation. Another remarkable method based on graph tasks such as synonym generation (see section 11.2), 15 16 exploration is the SenseUniformPaths algorithm pro- which OTIC is not able to address. Further, cycle- 16 17 posed by Mausam et al. [14], which relies on prob- based procedures are more suitable for iterative dictio- 17 18 abilistic methods to infer lexical translations. Sense- nary enrichment, as the produced translations for a tar- 18 19 UniformPaths was used in the generation of PanDic- get language pair (L1 → L2), once manually validated 19 20 tionary [15], a massive translation graph built from and integrated into the ground-truth input sets, can help 20 21 630 machine-readable dictionaries and Wiktionaries, produce more cycles. However, for the OTIC algo- 21 22 which contain over 10 million words in different lan- rithm, the pivot dictionaries would also have to be en- 22 23 guages and 60 million translation pairs. The method riched to be able to produce further entries. Lastly, the 23 24 applies probabilistic sense matching to infer lexical richer Apertium RDF gets in terms of language-pair 24 25 translations between two languages that do not share a connections and number of translations, the more ca- 25 26 translation dictionary. To that end, they define circuits pable the cycle-based algorithm becomes as it can har- 26 27 (cycles with no repeated vertices), calculate scores for ness the entire graph, not just information from pivot 27 28 the different translation paths, and prune those circuits dictionaries. 28 29 that contain nodes that exhibit undesirable behaviour, 29 30 called correlated polysemy (see section 4.1). 2.4. Distributional semantics-based methods 30 31 The SenseUniformPaths algorithm served as the ba- 31 32 sis for the cycle density method proposed by Ville- Other methods have been also proposed that do not 32 33 gas et al. [6] that we explore and expand in this work. rely on graph exploration for bilingual lexicon induc- 33 34 Both approaches coincide in the use of cycles to iden- tion but are based on distributional semantics. Ini- 34 35 tify potential translation targets, but differ in that Vil- tial approaches have been based on vector space mod- 35 36 legas et al. use graph density to rate the confidence els [18, 19] or in leveraging statistical similarities be- 36 37 value. The cycle density algorithm does not need to tween two languages [20, 21]. More recent approaches 37 38 identify ambiguous cycles, thus being computation- that exploit distributional semantics rely on the in- 38 39 ally less expensive. Moreover, SenseUniformPaths ex- ference and use of cross-lingual word embeddings. 39 40 ploits dictionary senses, as found in resources such as An initial contribution in that direction was made by 40 8 41 Wiktionary while the cycle density method operates Mikolov et al. [22]. Methods using word embeddings 41 42 in dictionaries without such sense information, such to infer new dictionary entries require an initial seed 42 43 as the Apertium bilingual dictionaries, and therefore dictionary that is used to learn a linear transforma- 43 44 solves a harder problem. tion that maps the monolingual embeddings into a 44 45 shared cross-lingual space. Then, the resulting cross- 45 46 2.3. Comparison with OTIC lingual embeddings are used to induce the translations 46 47 of words that are missing in the seed dictionary [23]. 47 48 The OTIC algorithm continues to be a powerful Such ideas evolved and new embeddings-based 48 49 method for translation inference and it has proven to methods appeared that did not need such initial train- 49 50 ing, such as the work by Lample et al. [24], who pro- 50 51 8http://wiktionary.org pose a method to develop a bilingual dictionary be- 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 5

1 tween two languages without the need of using paral- hrep, lang, posi where rep is the written representation 1 2 lel corpora. The method needs large monolingual cor- of its canonical form, lang is the language, and pos 2 3 pora for the source and target language and leverages is the part of speech (POS). An edge e(u, v) ∈ E in- 3 4 adversarial training to learn a linear mapping between dicates that the lexical entries u and v are connected 4 5 the source and target spaces. The software implement- through a translation relation, therefore sharing at least 5 6 ing the method and the ground-truth bilingual dictio- one lexical sense. For simplicity, we will use word to 6 7 naries used to test it are publicly available;9 We use refer to a lexical entry in the remainder of the paper. 7 8 these ground-truth dictionaries as part of our evalua- Note that G is initially populated using multiple 8 9 tion (section 9.3). bilingual dictionaries at once. Since the input transla- 9 10 Another remarkable method of this embeddings- tions (and hence graph edges) are between words of 10 11 based family of techniques was developed by Artetxe different languages (∀(u, v) ∈ E, u.lang 6= v.lang), the 11 12 et al. [25]. In their work, instead of directly inducing graph is K-partite, where K is the number of distinct 12 13 a bilingual lexicon from cross-lingual embeddings as languages in the input data. 13 14 in [24], they use the embeddings to build a phrase- In particular, we have used the latest version (v2.1) 14 15 table, combine it with a language model, and use the of the Apertium RDF graph as our initial transla- 15 16 resulting machine translation system to generate a syn- tion graph11. It contains 44 languages and 53 lan- 16 17 thetic parallel corpus, from which the bilingual lexi- guage pairs, with a total number of translations |E| = 17 18 con is extracted using statistical word alignment tech- 1, 540, 996 and words |V| = 1, 750, 917. 18 19 niques. Our problem is now to enrich this initial graph G, 19 20 In contrast to graph-based methods, the embeddings- that is to infer edges that do not initially exist but define 20 21 based approaches still need large corpora to operate; valid translations. 21 22 this limits their applicability for under-resourced lan- 22 23 guages. Further, in principle they operate at the word 23 24 representation level, thus not taking into account other 4. The original cycle density method 24 25 lexical information such as the part of speech, which 25 26 graph-based methods can use in a straightforward way. In this section we describe with more detail the cycle 26 27 density method developed by Villegas et al. [6], which 27 28 2.5. Systematic evaluation we base our work on. 28 29 29 30 The fact that inferring new translations is not a triv- 4.1. Cycles in the Translation Graph 30 31 ial problem has motivated the Translation Inference 31 32 Across Dictionaries (TIAD)10 periodic shared task, as As it was mentioned in Section 2, the original 32 33 a coherent experimental framework to enable reliable algorithm relies on simple cycles instead of transi- 33 34 comparison between methods and techniques for auto- tive chains, following the idea of circuits introduced 34 35 matically generating new bilingual (and multilingual) in [14]. A simple cycle is a sequence of vertices start- 35 36 dictionaries from existing ones. A number of works, ing and ending in the same vertex with no repetitions 36 37 based on graph exploration, word embeddings, parallel of vertices and edges. For ease of explanation, we will 37 38 corpora, etc. have participated so far in the campaign to be referring to simple cycles as just cycles. 38 39 test their ideas, many of them preliminary and subject For a sequence of words u1, u2 ... un, u1 to be in a 39 40 to continuous improvement [16, 17, 26]. cycle while simultaneously containing a pair of words 40 41 that are not a valid translation, there needs to be a 41 42 stronger condition than polysemy which we will call 42 43 3. Translation Graph correlated polysemy. This means that two words in 43 44 the cycle, say, vertices ui and u j (i < j without 44 45 We define translation graph as an undirected graph 45 46 G(V, E) where V and E denote the set of vertices 11The Apertium RDF data dumps, developed by Goethe Univer- 46 47 and edges respectively. Each vertex v ∈ V repre- sity Frankfurt, are available in Zenodo through this URL: https: 47 48 sents a dictionary lexical entry defined by the tuple: //tinyurl.com/apertiumrdfv2. More details on the generation of Aper- 48 tium RDF v2 can be found at [4]. A stable version of Apertium RDF 49 49 v2 will be uploaded to http://linguistic.linkeddata.es/apertium/ and 50 9https://github.com/facebookresearch/MUSE hosted by Universidad Politécnica de Madrid (UPM) as part of the 50 51 10https://tiad2021.unizar.es/ Prêt-à-LLOD H2020 project. 51 6 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 loss of generality) contain the same distinct set of 3. The confidence score of a translation from w to 1 2 possible senses. This may lead to ui, ui+1 ... u j and u not directly connected in the input data to w is 2 3 u j, u j+1 ... un, u1,... ui being paths along two differ- the density of the densest cycle containing both 3 4 ent senses, but still completing a cycle. We want to en- w and u. 4 5 sure words from these differing sense-paths do not get 4. Impose further constraints on possible transla- 5 6 considered valid translation pairs, but this is non-trivial tions to improve empirical results such as: 6 7 considering the source data does not carry any sense 7 (a) not allowing language repetition within a cy- 8 information. 8 cle, that is, one word per language only; 9 9 (b) ignoring cycles with size under a particular 10 4.2. Confidence metric: Cycle Density 10 minimum length D , which differs for small 11 min 11 and large G (w); 12 There can be multiple instances of correlated poly- D 12 (c) requiring a lower confidence score for target 13 semy in a single cycle. In fact, considering that many 13 nodes u that have a higher degree than 2 in the 14 language pairs in Apertium are linguistically close, this 14 subgraph induced by the cycle. 15 is not unexpected. Therefore, approaches based on ex- 15 16 ploiting sparsely inter-connected partitions of cycles 16 17 may not be fruitful, as our experience has shown. Ac- 17 5. Improving efficiency of the cycle density 18 cordingly, we stick to using cycle density as proposed 18 method 19 in [6] as a metric to avoid invalid predictions due to 19 20 correlated polysemy. 20 0 0 0 21 The density of a subgraph G (V , E ) here is defined In order to allow for application scenarios in which 21 0 22 2|E | time performance is an important feature as well as to 22 as |V0|(|V0|−1) , that is, as the ratio of the actual number 23 0 0 facilitate scalability to larger graphs, our first modifica- 23 of edges |E0| to the number of edges |V |(|V |−1) that 24 2 tions to the cycle density method were aimed at mak- 24 would be found in a fully connected graph (or clique). 25 ing computation more efficient, without having any 25 Cycle density is therefore the density of the subgraph 26 negative impact on the quality of the inferred transla- 26 induced by a cycle, that is, the subgraph made up of 27 tions. 27 the vertices in the cycle and the edges between them. 28 Notice that the algorithm requires us to operate on 28 A higher density implies that the nodes in a cycle 29 cycles, and cycle-finding is essentially a bottleneck in 29 are closer to forming a clique. Therefore, intuitively, 30 terms of computation time, even after having estab- 30 for high-density cycles, completing the clique by pre- 31 lished a bound on the maximum cycle length. Opti- 31 dicting the pairs of words with no edge between them 32 mizing cycle-finding is essential for scaling to larger 32 as possible translations becomes a useful strategy, one 33 graphs, and also allows end users to make multiple 33 which we adopt. In cases with correlated polysemy, 34 runs with different hyperparameter settings, which can 34 subsets of vertices corresponding to a different set of 35 be important for achieving optimal results on their tar- 35 senses are unlikely to have any edges between them, 36 get language-pair. 36 leading to a lower cycle density. 37 The original implementation of the cycle density 37 12 38 method computes and stores all cycles up to length 38 4.3. Original Algorithm 39 Lmax by trying all possible combinations of nodes. 39 40 Then, it proceeds to filter these and calculate the met- 40 The algorithm outlined in [6] is as follows: For each 41 rics. This can clearly be improved, as will be described 41 word w do the following. 42 in the following section. 42 43 1. Find the D-context of w, which is the sub- 43 44 graph GD(w)(VD(w), ED(w)) formed by vertices 5.1. Precomputing Biconnected Components 44 45 within a certain distance D from W; D is referred 45 46 to as the context depth. A biconnected graph is one in which no vertex ex- 46 47 2. Find all cycles in GD(w) traversing w. Since ists such that removing it disconnects the graph. A bi- 47 2 48 there can be up to O(2|VD(w)| ) [27] such cy- connected component is a maximal biconnected sub- 48 49 cles, and even real translation graphs are not graph. 49 50 sparse enough to compute them all, limit the cy- 50 12 51 cle length to Lmax. https://github.com/martavillegas/ApertiumRDF 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 7

1 Different biconnected components can only share DATA SET COMPUTEDON TIMERANGE (SEC) 1 2 cut vertices, that is, vertices whose removal renders biconnected 10–15 2 Development 3 the graph disconnected. The decomposition of a graph entire graph 17–23 3 4 G(V, E) to compute all its biconnected components can biconnected 22–57 4 Large 5 be done through a well-known algorithm with a time entire graph 33–80 5 6 complexity in O(V + E) by Hopcroft and Tarjan [28]. 6 7 Consider the following fundamental properties of a bi- Table 1 7 8 connected component: For our lexicon-wide experiments (see section 8), splitting into bi- 8 connected components reduces time by about a third on average 9 9 compared to when we are not. Note that the improvement is partially 10 Property 1: Consider any two vertices u, v in the same shadowed due to the cycle finding algorithm we use (see 5.3), which 10 11 biconnected component. There must be at least also implicitly exploits the locality of the graph. 11 12 one simple cycle with vertices only in the bicon- 12 13 nected component containing both u and v. 13 14 Proof: There must be at least two vertex-disjoint Case |V(G)| |E(G)| |V(Bmax)| |E(Bmax)| 14 15 (apart from u and v) paths from u to v. Combining All translations 704,284 774,986 84,088 188,116 15 16 these 2 vertex-disjoint paths would give a simple No cross-POS 704,284 769,708 44,542 99,081 16 17 17 cycle as the graph is undirected. Otherwise, if all Table 2 18 paths from u to v within the component have a 18 19 Size of the entire graph (G) and largest biconnected component 19 common vertex w, removing w would disconnect (B ) when using the 27 language-pair large set (see section 8). 20 max 20 u and v and this is not possible in a biconnected Ignoring just 5,278 cross-POS translations almost halves the size of 21 21 component by definition. the largest biconnected component (a bottleneck in our procedure). 22 22 23 Property 2: For every simple cycle C, there exists a 23 24 Size 4–5 6–10 11–20 21–100 >100 24 biconnected component B such that ∀v ∈ V(C), 25 25 v ∈ V(B) where V(G) denotes the set of vertices # 7,193 5,520 1,847 262 7 26 26 of a graph G. 27 Table 3 27 Proof: A simple cycle is in itself a biconnected graph. 28 Distribution of biconnected component size in the main graph con- 28 It must therefore be a subgraph of some bicon- taining 27 language pairs. 29 29 nected component. 30 30 31 5.2. Filtering cross-part-of-speech translations 31 Property 2 tells us that we can run the algorithm sep- 32 32 arately for each biconnected component without miss- 33 Although it is not frequent, there might be cases in 33 34 ing any cycle, and consequently any translation. Prop- RDF dictionaries in which translations connect words 34 35 erty 1 shows us that biconnected components are in with different POS. As shown in Table 2, by remov- 35 36 some sense the minimal units such that decomposing ing such cases, large biconnected components contain- 36 37 them further would make us lose information. Find- ing multiple POS can be further split into smaller ones 37 38 ing all cycles takes exponential operations in the num- with just one POS. Our method takes that step in order 38 39 ber of edges, so by splitting a graph into smaller sub- to reduce run-time, and to prevent potentially spurious 39 40 graphs, we require much less computation. Mathemat- inferred translations. 40 P k P k 41 ically, this follows from i (xi) 6 ( i xi) (for x > 0 In Table 3 we present the distribution of number of 41 42 and k > 1), which comes from the multinomial theo- biconnected components by size on our large set of 42 43 rem. 27 language pairs (see section 8) after discarding the 43 44 The partition in biconnected components can par- cross-POS translations. 44 45 ticularly be useful when searching for translations While clearly most biconnected components are 45 46 of specific words instead of the entire vocabulary in small, some are in-fact quite large, which is evident 46 47 one go. A lookup table file mapping words to bicon- from the size of the largest one as shown in Table 2. 47 48 nected components can be maintained. This helps load There are situations, however, where users might 48 49 only the biconnected components the word belongs to, need to keep such cross-POS translations. For instance, 49 50 which are sufficient for our method. This cuts runtime in Apertium one can find a translation of the English 50 51 and memory requirements during execution. noun computer into the Spanish adjective informático. 51 8 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 This allows for rules, in a machine translation system, Language repetition: The intuition for removing cy- 1 2 that may translate noun modifiers such as computer cles with language repetition as in [6] is that two words 2 3 engineer as adjectives to get fluent translations such in the same language are likely to not have the exact 3 4 as ingeniero informático by looking up these cross- same set of semantic senses. Example: pupil in English 4 5 POS entries in the dictionary. This is especially impor- can mean the aperture of the iris, but will also come 5 6 tant when the target languages don’t have the same set in cycles with student. This can become a breeding 6 7 of POS, requiring changes in POS during translation. ground for correlated polysemy leading to incorrect 7 8 Therefore, any implementation of the method should translations. But allowing language repetition is neces- 8 9 make this filter optional. sary for the use-case of generating synonyms (see Sec- 9 10 tion 11.2), which would necessitate making this op- 10 11 5.3. Backtracking Search per Source Word tional by adding a binary hyperparameter. 11 12 However, we find that not allowing repetition re- 12 13 13 We iterate over all words as the source word and reduces recall significantly, and the same higher preci- 14 14 peat the following procedure for finding possible tar- sion can instead be achieved by increasing the confi- 15 dence threshold for a lesser loss in recall. Handling 15 get words: First, we find and store the context graph of 16 correlated polysemy is exactly what measuring cycle 16 depth D for word w, G (w) using breadth-first search. 17 D density is aimed at, so it is a natural way to mitigate 17 Then, we launch a backtracking search maintaining a 18 this issue. We therefore proceed to remove this hyper- 18 stack to find relevant cycles, using an algorithm orig- 19 parameter. 19 20 inally described in [29]. While more efficient algo- 20 21 rithms for generating cycles exist [30], the chosen one Minimum cycle length: Notice that cycle density as 21 22 has a good balance between ease of modification and defined in Section 4.2 would favor shorter cycles as the 22 23 time response (see Section 10). denominator grows quadratically with cycle length. In 23 24 particular, a 4-cycle without any other edges among its 24 4 2 25 vertices would have a density of 4(4−1)/2 = 3 , which 25 26 6. Analysis of hyperparameters would mean a good chance of being selected unless the 26 27 threshold is set very high. So essentially, being part of 27 any 4-cycle (which can easily be 2 correlated polyse- 28 The heuristic indicators used in the original cycle 28 mous senses) ensures a that a pair of words becomes a 29 density method were derived empirically by study- 29 predicted translation. 30 ing small parts of the RDF graph. Scaling to larger 30 This is why the original algorithm [6] requires a 31 vocabulary-wide experiments requires generalizing 31 minimum cycle length. In-fact, they require different 32 these heuristics to tunable hyperparameters. To that 32 minimum lengths for a large context and a small con- 33 end, a deeper understanding of such hyperparameters 33 text because in relatively sparse areas of the graph, 34 is needed. Ideally, language pair developers should be 34 35 many cycles might be small and yet correct. This leads 35 able to iteratively run the algorithm, verify and include 36 to having three hyperparameters that require tuning: 36 the correct translations, and then repeat the process 37 the minimum length in small contexts, the minimum 37 on the updated, richer graph, getting new entries each 38 length in large contexts, and the threshold defining 38 time. This necessitates reducing the 8 distinct indi- 39 when a context becomes large. 39 cators proposed originally [6] and limiting the combi- 40 However, we found that eliminating all three hyper- 40 nations that need to be tried to obtain optimal results. 41 parameters gives similar results on average, while sim- 41 42 Through our exploratory analysis of the hyperparame- plifying the pipeline significantly. Allowing small cy- 42 43 ters which we describe in this section, we reduce both cles allows more translations to be produced, as ignor- 43 44 runtime and human effort. ing them completely would also leave out the proba- 44 45 bly valid translations we could predict using the dense 45 46 6.1. Removed hyperparameters small cycles. 46 47 47 48 After a careful analysis, we found that a number of 6.2. Used hyperparameters 48 49 heuristics and hyperparameters proposed in [6] were 49 50 not leading to better results, therefore we propose to Having eliminated four hyperparameters of the orig- 50 51 discard them: inal cycle density method, we now justify why we need 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 9

1 the remaining ones and find the most suitable range of Target degree multiplier (M): Here, by the degree of 1 2 values for them. a vertex, we refer to the number of edges from the ver- 2 3 tex within the subgraph induced by the cycle. Origi- 3 4 Maximum cycle length (Lmax): The cycle length was nally, in [6], a fixed lower (0.5 instead of 0.7) confi- 4 5 originally bounded above to reduce computation time. dence threshold (density) was required for cycles if the 5 6 Our large-scale experiments reveal that increasing cy- target word had degree > 2. We model this as a tunable 6 7 cle length beyond a certain threshold does not even hyperparameter: the multiplier applied to the density 7 8 help much, because of two main reasons: of the given cycle when the degree of the target is > 2. 8 9 9 1. With increasing distance13 from the source ver- This allows its effect to scale relatively when changing 10 the density thresholds used for final translation selec- 10 11 tex, the chances of sense shifting increase. Thus, 11 larger cycles have a higher chance of being poly- tion. But why is this multiplier important in the first 12 place? If the target word has only degree 2, it means 12 semous. 13 that, apart from the two edges on its vertex in the cycle 13 2. Induced graphs of large cycles will probably 14 itself, the target word is not connected to other words 14 have as subgraphs smaller cycles with higher 15 in the cycle. This points to a chance of the target word 15 density. These smaller cycles get selected as the 16 not sharing a sense with a large part of the cycle (if it 16 17 ones with the highest density for most pairs of 17 does, it can still be detected through a different cycle). 18 words. This means allowing larger cycles does 18 19 not lead to the generation of many new transla- Transitivity (T): While an essential premise to ne- 19 20 tions. cessitate a cycle-based approach is that transitivity 20 21 does not always hold for translations due to poly- 21 22 The diminishing returns upon increasing context depth semy, we realized that it can still perform well for non- 22 23 and maximum cycle length (see Table 6) also tell us polysemous categories of words.14 Specifically, we 23 24 that while the cycle-density procedure in the worst identify proper nouns and numerals as being largely 24 25 case is exponential on the square of the number of ver- non-polysemous. Both categories are part of the Lex- 25 26 tices, computing for small context subgraphs and con- Info ontology,15 a catalogue of grammar categories 26 27 sidering only cycles of a small length suffices, which used by Apertium RDF. We model transitivity as a 27 28 is tractable. ternary hyperparameter: 28 29 29 30 Context Depth (D): Similarly to the cycle length, the – T = 0: The default cycle density metric (no tran- 30 31 original motivation for limiting context depth was to sitivity). 31 32 reduce computation time. We find that increasing it be- – T = 1: Transitive closure within biconnected 32 16 33 yond a certain threshold is not helpful because of the components. 33 34 following property: – T = 2: Transitive closure up to the distance D 34 35 (context depth). 35 Lmax 36 Property 3: It is lossless to keep D 6 int( ). 36 2 Notice that it is only required to tune the context depth 37 Proof: If cycle length is limited to Lmax, the maximum 37 for T = 2 as the other hyperparameters are irrelevant 38 distance between any 2 vertices that share a com- 38 Lmax with that method. 39 mon cycle can be int( 2 ). Therefore, there will 39 40 be no cycle containing both the source vertex and Confidence threshold (C): For a given pair of words 40 41 Lmax 41 potential target vertices at more than int( 2 ) dis- u, v, let S u,v be the set of all cycles that follow the con- 42 Lmax 42 tance. Therefore, having D > int( 2 ) leads to no straints set by the first 2 hyperparameters D and Lmax. 43 43 new predictions. For a cycle c ∈ S u,v, let dc be the density of its induced 44 44 subgraph. Let Mc be the target degree multiplier M if 45 This effectively sets an upper bound to the context the condition for target degree multiplication is satis- 45 46 depth based on the maximum cycle length. We recom- 46 47 mend using int( Lmax ) as the value throughout experi- 47 2 14 48 ments, as lowering it would lead to loss of information. This is partly why we allow different hyperparameter settings 48 for different POS. 49 49 15http://www.lexinfo.net/ontology/2.0/lexinfo 50 16Not used in this paper, it exists for ongoing exploration towards 50 51 13Defined here as the length of shortest path between two vertices. future work 51 10 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 fied, and 1 otherwise. We define the confidence score bilingual dictionaries, only the maximum cycle length 1 2 of a predicted translation between u and v as: needs to be tuned in most cases, followed by trying 2 3 different confidence thresholds. This simplifies the ex- 3 4 argmax min(Mcdc, 1) perience for end users who may have to find values 4 5 c∈S u,v preferable to them for their input datasets, and target 5 6 language pair. 6 7 This effectively keeps the confidence score within the 7 8 range [0, 1]. 8 9 Note that we stick to only the cycle with the high- 7. Implementation 9 10 est confidence. We prefer this over metrics that use ag- 10 11 gregate statistics over all the cycles containing the two We implemented the algorithmic pipeline in C++. It 11 12 words such as the number of cycles or their average has been strategically decoupled into multiple stages 12 13 density. These are more likely to be sensitive to some as we will briefly describe below. Detailed instructions 13 18 14 subgraphs or contexts being richer in the number of for usage are provided in our GitHub repository, . 14 15 translations added during creation, leading to superflu- 15 16 ously better aggregates than others. The maximum is 7.1. Input 16 17 more robust to these differences, as only one dense cy- 17 18 cle is then sufficient. A simple format has been defined to represent the 18 19 For translations produced using transitivity (T = 1 data of an input translation graph G(V, E). For every 19 20 and T = 2) instead of the cycle density method, we translation e(s, t) ∈ E we represent the data in TSV 20 21 assign a full confidence score of 1. This allows com- format as: 21 22 bining these methods so that they can be used for par- 22 23 ticular POS where effective, leading to the generation (s.rep, s.pos, s.lang, t.rep, t.pos, t.lang). 23 24 of a unified set of possible translations with their con- 24 25 fidence scores. We expect the user to use transitivity Throughout our pipeline, wherever a language has to 25 26 only in cases where they are sure it is highly precise, be specified, it is done with ISO-639-3 codes. We 26 27 considering the full confidence assigned. wrote parsers that query RDF graphs for the required 27 28 This gives us the final choice the user makes, the data using SPARQL or directly use Apertium bilingual 28 29 confidence threshold C. Generated translations above dictionaries in their XML format, converting them to 29 30 this confidence threshold are all included in the final our TSV format. Using such an intermediate TSV for- 30 31 prediction set of the algorithm. mat allows to easily adapt the system to different data 31 32 formats, if needed. 32 33 To conclude, we have a set of just four simple hyper- 33 34 parameters: context depth D, maximum cycle length 7.2. Configuration files 34 35 Lmax, target degree multiplier M, and transitivity T 35 36 with guidelines for their tuning based on insights de- We provide a command-line interface (CLI) to run 36 37 scribed above. This produces a set of possible transla- the algorithm. The path to a folder containing the input 37 38 tions along with their confidence scores, from which dictionaries for all language pairs needs to be passed. 38 39 the user can select those above a certain confidence The language pairs to be output in a single run as well 39 40 threshold C, based on whether they want a large set or as those provided as input need to be specified in a 40 41 a highly precise one. language configuration file. It is also possible to spec- 41 42 We have shown a fixed value for context depth rela- ify no fixed target language pair, producing predictions 42 43 tive to maximum cycle length is optimal in most cases. across all input language-pairs in one go. Translations 43 44 Moreover, the procedure is not very sensitive to the for specific individual words can be generated instead 44 45 target degree multiplier within a reasonable range of as well by providing a different kind of configuration 45 46 [1, 1.5].17 From our tests, only proper nouns and nu- file and using the word option instead of the language 46 47 merals benefit from transitivity. So when generating option. A separate configuration file can be used to 47 48 specify the hyperparameters, with the possibility of us- 48 49 ing different ones for different POS. 49 17We still keep it as a hyperparameter as it can make the confi- 50 dence score more reflective of the similarity, which might be impor- 50 51 tant for some use-cases. 18https://github.com/shash42/ApertiumBidixGen 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 11

1 7.3. Generating translations 1 2 2 3 To infer new translations, the system first obtains 3 4 candidate translations along with their confidence 4 5 score and, then, filters out all those translations below 5 6 a certain threshold (C). These two steps are decoupled 6 Fig. 2. Fragment of the Apertium RDF graph taken as development 7 and are called by specifying the relevant options on 7 set, containing 6 languages and 11 language pairs. 8 the CLI. This allows trying different confidence scores 8 9 without having to rerun the entire cycle finding and 9 10 scoring procedure. 10 11 Once the algorithm is run, it finds cycles for ev- 11 12 ery source-word (s) and stores a matrix containing 12 13 Metrics vectors for each (s, t) ∈ V. Each vector 13 14 component corresponds to one valid cycle where s and 14 15 t were present and can store any variables required as 15 16 indicators. This allows for easy experimentation with 16 17 potentially different metrics. For the purpose of this 17 18 paper, we need to store the density of the cycle and the 18 19 degree of the target word in its context graph. 19 20 20 Fig. 3. Largest biconnected component in the Apertium RDF graph. 21 7.4. Post-processing 21 This contains 13 languages and 27 language pairs. 22 22 23 The generated predictions can be brought back into 23 language pairs across 6 languages: English, Catalan, 24 the target format (RDF or Apertium bidix). Of particu- 24 Spanish, Esperanto, French, and Occitan (see Fig- 25 lar interest is the Compare class, which provides de- 25 ure 2). We leave out each language-pair and use the re- 26 tailed metrics given the generated predictions, the orig- 26 maining 10 to re-generate it. This allows us to measure 27 inal data used, and the evaluation data. These are fur- 27 average metrics across the 11 language pairs. 28 ther described in Section 9.2. 28 Since one of our goals in this research is to scale 29 29 up the initial cycle density algorithm, it is important 30 30 to validate that the method performs well even when 31 8. Experimental setting 31 more language pairs come into play. To that end, we 32 32 define what we call the large evaluation set, by taking 33 There are two main possible usage scenarios for the 33 all 27 language pairs across the 13 languages which 34 cycle density method: (i) generation of a new bilingual 34 constitute the largest biconnected component in the 35 dictionary from scratch for a new language pair, and 35 RDF language graph (see Figure 3).19 These languages 36 (ii) enrichment of an already existing language pair. 36 are: Aragonese, Basque, Catalan, English, Esperanto, 37 In order to measure the effectiveness of the method 37 French, Galician, Italian, Occitan, Portugese, Roma- 38 in the first case, we remove one already existing lan- 38 nian, Sardinian and Spanish. 39 guage pair (dictionary) from the graph, try to re-create 39 We use the same leaving-one-pair-out experiment as 40 it with the algorithm, and then compare the result with 40 the development set in this new setting, reporting aver- 41 the removed dictionary. In the second case, entirely 41 age metrics across the 27 language pairs. 42 new translations are created, which are more difficult 42 In general, if a user wants to predict translations 43 to evaluate automatically. Nevertheless, to provide a 43 for a target language-pair (L , L ), using all language- 44 quality indication, we propose the use of external bilin- 1 2 44 pairs in the biconnected component containing both L 45 gual dictionaries that are not part of the Apertium 1 45 46 graph as well as a human evaluation. 46 19 47 Notice that, although loosely related to the notion of training, 47 48 8.1. Datasets development and test sets in machine learning, the purpose of cre- 48 ating our development and large sets is slightly different, being tar- 49 49 geted towards confirming scalability. The hyperparameters will ulti- 50 For initial experimentation with metrics and hyper- mately be manually tuned by the users to adapt the system to their 50 51 parameter tuning, we picked a development set of 11 particular necessities. 51 12 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 and L2 is a good strategy. However, if L1 and L2 do 9. Evaluation 1 2 not belong to a common biconnected component, then 2 3 pivot based methods like OTIC should be preferred When evaluating procedures aimed at generation, 3 4 over cycle-based procedures. This is because less cy- a fundamental problem arises when an exhaustive 4 5 cles between words in L1 and L2 will be found, and ground-truth set is not available. Having an exhaustive 5 6 those that are will rely strongly on having multiple set of translations turns out to be particularly hard be- 6 7 words from X, Y or intermediate languages. As dis- cause languages keep evolving. Today’s complete set 7 8 cussed earlier, this can lead to higher polysemy and will become incomplete soon, as the vocabulary of the 8 9 hence lower quality predictions. languages grow. Moreover, many languages are low- 9 10 resourced, and obtaining sets with high coverage itself 10 8.2. Hyperparameter settings 11 is a difficult task. 11 12 12 After experimenting with the development set, the While the in-production Apertium language pairs 13 that are used in RDF are large enough to be useful 13 14 final hyperparameter setting, used across the rest of the 14 for a machine translation engine, their coverage can 15 evaluation, is the following (unless otherwise speci- 15 clearly be increased. In the forthcoming discussion, we 16 fied): transitive closure (T = 2) with D = 4 for proper 16 = 0 = 3 = 6 assume that while available evaluation sets may rea- 17 nouns and numerals; and T , D , Lmax , 17 M = 1 4 C = 0 5 sonably be considered to have 100% precision (ground 18 . , . for other POS. See Section 10.1 for 18 truth),21 they do not carry all possible valid transla- 19 more details on the hyperparameters selection process. 19 tions. Moreover, it is even difficult to estimate the total 20 8.3. Baseline 20 21 number of valid translations. 21 22 As mentioned earlier, one of the main usage sce- 22 9.1. Notation 23 narios of the cycle density algorithm is the generation 23 24 of new bilingual dictionaries for the Apertium frame- 24 25 work. The idea is to substitute the cross option pro- We begin by defining some notation we will use 25 26 vided as part of the apertium-dixtools in Aper- throughout the following sections. 26 27 27 tium (see Section 2) which relies on transitive infer- – I denotes the input translation graph, which could 28 28 ence [7]. Therefore, we introduce in our experiments include multiple bilingual dictionaries. 29 29 an explicit comparison with a method that is purely – P denotes the translation graph containing the 30 30 based on transitivity, which we emulate by running our predictions (output) of the algorithm. In our ex- 31 31 system with two sets of hyperparameters: T = 2 and periments, we generate a single language pair, 32 D = 2 (baseline 1) and T = 2 and D = 4 (baseline 2). 32 and hence P can be considered the graph of the 33 The latter is expected to produce a higher number of 33 predicted target language pair. 34 candidate translations since it will traverse more ver- 34 – L and L denote the languages in the target lan- 35 tices in the graph. 1 2 35 36 guage pair. 36 37 8.4. External dictionaries – T denotes the translation graph of the chosen test 37 38 dictionary for evaluation. This is the left-out RDF 38 39 In order to validate the enrichment of translations dictionary for pair L1 − L2, 39 40 in already existent language pairs, we use the MUSE – A denotes the hypothetical complete set of valid 40 20 41 ground truth dictionaries as an additional corpora. translations for language L1 − L2. As mentioned 41 42 MUSE dictionaries, while rich in nouns, verbs, adjec- before, neither A nor |A| are known. 42 43 tives, adverbs and numerals, have almost no proper – V(X) (resp. E(X)) denotes the set of vertices 43 44 nouns. We thus stick to translations in these five POS (resp. edges) of a translation graph X. 44 45 categories for our experiments. There are five MUSE – v1(e) (resp. v2(e)) denotes the vertex of edge e 45 46 dictionaries that are common to our large set: English– belonging to language L1 (resp. L2). 46 47 Catalan, English–Spanish, French–Spanish, Spanish– 47 48 Italian, Spanish–Portuguese, which we use for our ex- 48 21Apertium dictionaries (and hence the derived RDF dictionaries) 49 periments. 49 do have some translations that may not be considered “correct”, so 50 they are not 100% precise as assumed, but such errors are limited in 50 51 20https://github.com/facebookresearch/MUSE number and do not affect evaluation significantly. 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 13

1 – BWA(B) where A and B are translation graphs de- 3. v1(e) ∈/ V(T) and v2(e) ∈ V(T), that is, the 1 2 notes those translations (edges) in B whose both word in L2 is in the evaluation set, but its pro- 2 3 words involved exist in A. posed translation in L1 is not. 3 4 4. v1(e) ∈/ V(T) and v2(e) ∈/ V(T), that is, neither 4 5 9.2. Metrics for automatic evaluation word is in the evaluation set. 5 6 6 Categories 2–4 point to a clear insufficiency in the test 7 7 The unavailability of a complete test dictionary dictionary itself. Our algorithm cannot come up with 8 8 makes it difficult to measure traditional metrics like words on its own, as any word in the output must be- 9 |E(P)∩E(T)| |E(T)∩E(P)| 9 precision ( |E(P)| ) and recall ( |E(T)| ). One long to some input dictionary of that language (that is, 10 10 might ask: is a translation produced by the algorithm ∀v ∈ V(P), v ∈ V(I)), and hence the word is a valid 11 11 that is not found in the test dictionary incorrect, or is part of that language. 12 12 it correct but missing from the test dictionary? Should Therefore, we propose measuring precision only 13 13 the algorithm really be penalized for one of its main among those predicted translations whose both words 14 14 tasks, namely dictionary enrichment, generating valid are in the test dictionary. We define this as both-word 15 15 translations that humans might have overlooked? The |BWT (P)∩E(T)| 22 precision BWP = | (P)| . While it is theoret- 16 unequivocal answer to that should be no, and yet pre- BWT 16 17 ically possible that the distribution of inaccuracies in 17 cision and recall as defined here (which we will call BWT (P) may not be an accurate representation of the 18 vanilla precision and recall in the remainder of this pa- 18 19 entire set of predictions in some combination of input 19 per) do penalize the algorithm for this. For instance, 20 and test dictionaries, it is a necessary approximation. 20 an algorithm which covers, say, 90 of 100 test dictio- 21 Note that if the test dictionary has both words but 21 nary translations would end up being considered bet- 22 no edge between them, this could still be due to over- 22 ter (precision 0.9, recall 0.9) than one producing not 23 sight by the creators. The computed BWP is again a 23 only those 90, but 30 additional correct ones not in 24 lower-bound for the actual BWP. Using a larger test 24 the test dictionary (precision: 0.75, recall: 0.9). In-fact, 0 ⊂ 0 25 dictionary T such that T T would never decrease 25 the precision only goes down with additional transla- 26 BWP, and possibly increase it. However, this is defi- 26 tions (which are all considered wrong in the vanilla 27 nitely tighter than the original precision as now at least 27 precision formula). Clearly, using these vanilla met- 28 the test dictionary is aware of the existence of these 28 rics would move our algorithms away from the goal of 29 words, and perhaps the creators deliberately left out the 29 30 enrichment. translation because of being wrong. One could assume 30 that this assumption more than counteracts any “over- 31 9.2.1. Both-Word Precision (BWP) 31 estimation” that might have happened due to differ- 32 any T 32 It is clear that precision computed using is a ing distributions in BW (P) and E(P). To conclude, 33 lower-bound on the actual precision of the algorithm, T 33 34 we propose BWP as a heuristic indicator for the actual 34 as T ⊂ A. But this lower-bound can be arbitrarily precision, assuming that: 35 loose, depending on both |T| and how close the dis- 35 36 36 tributions of translations in T and P are. We have no |BWT (P) ∩ E(T)| |E(P) ∩ E(A)| 37 strategic incentive to make P imitate T, as transla- ≈ . 37 |BWT (P)| |E(P)| 38 tions in T, even if chosen to be the most frequently 38 39 39 used ones, may still be insufficient to obtain good text It clearly holds as an equality in the limiting case of 40 40 coverage, which could improve with translations from T = A, as then BWA(P) = E(P) because A contains 41 41 A \ T. Therefore, can we come up with a metric more all words, being exhaustive. 42 independent of T? 42 43 9.2.2. Both-Word Recall (BWR) 43 A predicted translation e ∈ E(P) which is not in 44 Achieving a high precision often entails that the pre- 44 E(T) can be classified as belonging to one of four cat- 45 diction set produced has less coverage and hence low 45 egories: 46 recall. However, this could be due to insufficient input 46 47 1. v1(e) ∈ V(T) and v2(e) ∈ V(T), that is, both data or misalignment between the distributions of the 47 48 words are in the test dictionary. input set and the test dictionary. It can thus be useful to 48 49 2. v1(e) ∈ V(T) and v2(e) ∈/ V(T), that is, the 49 50 word in L1 is in the evaluation set, but its pro- 22See Figure 6 and the corresponding discussion for a visual un- 50 51 posed translation in L2 is not. derstanding of this metric, and the above categories. 51 14 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 normalize by how much of the test dictionary can pos- – It can also verify how tight the lower bounds that 1 2 sibly be generated using a given input set by a hypo- BWP provides are, as well as whether BWP is a 2 3 thetically perfect algorithm. Specifically, if for e ∈ T, good substitute for precision. 3 4 v1(e) ∈/ I or v2(e) ∈/ I, there is no way any algorithm – For the case of recall, while computing it against 4 5 could predict e as a valid translation, as it would be completely new data is generally not a good idea, 5 6 completely unaware about the existence of at least one we can verify that the BWR metric is much 6 7 of those two words. more robust to dataset distribution differences 7 8 It can thus be useful to limit our evaluation set to than vanilla recall by computing both on this new 8 9 those translations for which both words are in the in- test set. 9 10 put set. Therefore, we define both-word recall (BWR) 10 We use the MUSE dictionaries (see Section 8.4) to 11 as |BWI (T)∩E(P)| and use it as an indicator. An auxil- 11 |BWI (T)| evaluate the quality of additional translations. 12 iary benefit of this metric is that we can now gain in- 12 13 sights on how performance can be improved further 13 9.4. Measuring dictionary enrichment with human 14 by comparing it to traditional recall. Specifically, a re- 14 assessment 15 call significantly lower than BWR signals that the in- 15 16 put data is inappropriate to cover the test dictionary. If 16 As a final sanity check of our additional entries, 17 the BWR itself is low, the algorithm is too conserva- 17 we took random samples of 150 predicted dictionary 18 tive in predicting translations. In this way, we decouple 18 entries that were not present in the test set T. These 19 input data insufficiency from algorithmic incapability. 19 belonged to open POS categories, that is, nouns, ad- 20 Moreover, in our experiments I consists of other 20 21 RDF dictionaries. These dictionaries are likely to con- jectives, adverbs, verbs and proper nouns, and also 21 22 tain the frequently used words, and evaluation set numerals. This was done directly using Apertium 22 23 translations which do not have the corresponding bilingual dictionaries as input data with a rather lib- 23 24 words in the input graphs are probably among less fre- eral (recall-oriented) set of hyperparameters: T = 0, 24 25 quent, specific words. Thus, BWR can in some tasks D = 4, Lmax = 8, M = 1.4, C = 0.1. for five 25 26 be a measure of how much of the important part of the different language pairs (Spanish–English, Spanish– 26 27 left-out language pair we cover. Catalan, French–Occitan, Esperanto–English, French– 27 28 We would like to note that the BWR metric, while Catalan). We asked bilingual experts to evaluate them, 28 29 obtaining a sort of gold standard for each language 29 good at evaluating different modifications to the same 23 30 algorithmic pipeline (such as different hyperparame- pair. 30 31 ter settings), should be used carefully when comparing As these sets were produced with a different version 31 32 two methods that use different input datasets. This is of the data and hyperparameters, we took an intersec- 32 33 tion between these sets and the additional translations 33 because BWI(T) is by definition a function of I, the 34 input data. BWR can directly only be used to compare produced by a different setting that need to be evalu- 34 35 the data-effectiveness of the algorithms, i.e. how large ated. While this is not directly a random sample of the 35 36 a prediction set do they produce considering the input additional translations, it is a good approximation. 36 37 provided. 37 38 38 39 9.3. Measuring dictionary enrichment with external 10. Results and Discussion 39 40 data 40 41 In this section we show and discuss the experimen- 41 42 42 While external corpora can be used for evaluating tal results of our evaluation. For the sake of space, we 43 43 additional translations, they often have significantly show here an aggregated view of the results, but we 44 44 different properties from the input and target sets. This make all the experimental data available online to al- 45 45 can lead to erroneous conclusions due to noisy results. low further inspection.24 46 46 However, evaluating against additional data is par- 47 47 ticularly important in the context of this paper for the 23 48 We had 3 evaluators for the first 3 language pairs, so we con- 48 following reasons: solidated their differences using a simple majority vote to produce 49 49 the gold standard. For the last two we had a single evaluator whose 50 – It is a direct accuracy indicator on the task of en- results became the gold standard. 50 51 richment. 24See https://github.com/shash42/ApertiumBidixGen 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 15

1 In the following discussion, we define relative size 10.1. Demonstrating hyperparameter changes 1 2 |E(P)| 2 as |E(T)| . This metric puts the actual size of the predic- 3 tion set in context, comparing with the left-out evalua- Table 5 shows how decreasing the confidence thresh- 3 4 tion set.25 old C while holding all hyperparameters constant 4 5 Further, we use the BWP and vanilla recall metric leads to lower BWP and higher BWR, recall and pre- 5 6 for computing F-scores. We prefer BWP over preci- diction set size as expected. The end-user can tune 6 7 sion as it is far more indicative, and the vanilla pre- this precision-recall tradeoff to their liking (see Sec- 7 8 cision metric is noisy as shown in Table 5. We have tion 11.1 for a discussion in the context of Apertium), 8 9 9 to choose vanilla recall over BWR as recall uses the but we stick to maximizing the F1. However notice the 10 same target set (E(T)) as BWP, whereas BWR picks noisy trend in vanilla precision, which points to how it 10 11 11 a subset of the target set (BWI(T)). This is important is not a reflective metric in many cases. 12 as F-score is also used to compare algorithms, which Table 6 shows that increasing the maximum allowed 12 13 13 may use a different input set (I). Thus, we define F1 cycle length keeping all else constant leads to decreas- 14 score as the harmonic mean between BWP and vanilla ing BWP, but increasing BWR, recall and time. We 14 15 15 recall. Note that the F1 score reported in this section is see diminishing returns in the F1 score beyond cycle 16 16 the macro-average of the individual F1 scores over all length 6, as expected. 17 language-pairs tested. Next, Table 7 shows the comparison for proper 17 18 We begin by showing our optimal results on the de- nouns and numerals between using the hyperparam- 18 19 velopment set in Table 10 (first row). Each metric is eter setting used for other POS and the chosen T = 19 20 20 averaged across the 11 language pairs. The computa- 2, D = 4 one. Clearly we achieve much higher BWR, 21 21 tion time differs for each language pair to be generated, and recall on using transitivity, with lower but decent 22 22 so we report the range (minimum–maximum). enough BWP. Particularly, the high BWR shows that 23 23 Notice the significant gap between BWR and recall, this method generates a large portion of the target set 24 24 showing the scope for improvement in the algorithm’s that it possibly could have with the given input data. 25 25 results if the input RDF graph is enriched. Since our To demonstrate the extent of polysemy in the RDF 26 26 procedure itself can iteratively aid such enrichment, Graph, we use the T = 2, D = 4 transitive setting 27 27 we hope that in the future this gap will be bridged. for all POS. The dismal results produced are shown 28 28 Moreover, the BWP is much larger than the precision, in Table 8. Despite the extremely large prediction set 29 29 which shows that evaluation schemes based on pre- produced, vanilla recall remains low. This illustrates a 30 30 limitation of vanilla recall: it gives a pessimistic view 31 cision could be grossly under-estimating the perfor- 31 of the algorithm performance. In that sense, BWR is 32 mance of algorithms on this task, when the actual in- 32 33 sufficiency is in the test data. more reflective of the actual performance, given the 33 34 We also report statistics in Table 4 for the 9 most data provided. 34 35 frequent POS. The other POS have too small number Finally, we show in table 9 that allowing language 35 36 of translations to give reliable results. repetition actually improves performance, which led us 36 37 Table 4 shows that the transitive closure adopted to removing the hyperparameter. 37 38 produces much more translations of properNouns and 38 39 numerals than the existing Apertium dictionaries. Yet, 10.2. Dictionary generation 39 40 BWP in figure 4 is quite high for properNouns, and de- 40 41 cent for numerals. Moreover, our procedure has high In Table 10 we present the results of transferring 41 42 precision for open POS, and less for closed POS. This the same hyperparameter setting derived in our devel- 42 43 is because the translations of closed POS in Apertium, opment set to our large set, continuing the dictionary 43 44 and hence RDF, often encode grammatical changes in- generation experiment but with a larger graph. It takes 44 45 stead of semantic similarity. It might still be helpful to longer to run, but still within one minute for all lan- 45 46 use more restrictive hyperparameters or a higher confi- guage pairs. Once again, the accuracy metrics reported 46 47 dence threshold when producing translations of closed are averaged across the 27 language pairs that are left- 47 48 POS. out one by one. 48 49 The drop in the metrics on the large set compared to 49 50 25This can be larger than 100% if the prediction set is larger than the development set is not necessarily a sign of over- 50 51 the evaluation set. fitting. The large set has more less-connected, under- 51 16 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 POS Noun Verb properNoun Adjective Adverb Determiner Numeral Pronoun Preposition 1 2 Rel. Sz. 55.04% 57.03% 497.99% 43.44% 43.22% 45.65% 211.28% 59.25% 74.27% 2 3 3 4 Table 4 4 5 Relative size for different parts of speech, averaged across the language pairs in the Development set 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 Fig. 4. Breakdown by POS of BWP (left bars) and BWR (right bars), averaged across language pairs in the development set 39 40 40 41 C BWP Prec. BWR Recall F1 resourced language pairs which bring down the aver- 41 42 0.4 81.60% 46.36% 52.11% 29.96% 50.86% ages. 42 43 0.5 85.00% 48.37% 50.96% 29.32% 50.93% In Table 11, we demonstrate a signiﬁcant gain in rel- 43 44 0.6 86.39% 48.93% 48.68% 27.94% 50.25% ative size across POS when using the large set com- 44 45 0.7 87.80% 48.55% 42.24% 24.19% 47.35% pared to Table 4 when using the development set. Sim- 45 46 0.8 89.52% 48.40% 37.98% 21.69% 45.24% ilar trends are observed for BWR in Figure 5. 46 47 We show a comparison between the cycle density 47 Table 5 48 algorithm and the transitive baselines in Table 12. We 48 Change in results when varying the conﬁdence threshold 49 see that transitive closure tends to increase relative 49 50 coverage and recall, but at the cost of a much lower 50 51 precision (55% and 13% for the respective baselines 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 17

1 1 Lmax BWP BWR Recall F1 Time (s) cycle density algorithm beats the transitive baselines in 2 4 89.53% 30.50% 15.94% 30.24% 7–9 terms of the F1-score as well. 2 3 5 86.69% 49.39% 28.33% 50.43% 8–10 In conclusion, our produced translations are on av- 3 4 6 85.00% 50.96% 29.32% 50.93% 10–15 erage 71.39% the size of existing Apertium dictionar- 4 5 7 81.96% 51.56% 29.66% 50.31% 15–22 ies at a high BWP of 84.07% across 27 language pairs 5 6 8 76.67% 52.35% 30.12% 47.92% 30–50 over 13 languages. 6 7 7 8 Table 6 10.3. Dictionary enrichment 8 9 Change in results when varying the maximum cycle length 9 10 The coloured bar graph in Figure 6 shows the ratio 10 11 of translations that match the test set and contrasts it 11 POS T BWP BWR Recall 12 with those that do not match, which are further clas- 12 0 98.47% 24.75% 11.95% 13 Proper nouns sified. These are average metrics reported over the 27 13 2 91.81% 78.11% 32.81% 14 language-pairs in the large set. 14 0 97.66% 55.64% 35.22% 15 Numerals In particular, it classifies the additional translations 15 2 77.17% 90.55% 61.27% 16 (the left bar) in terms of the common words with the 16 17 Table 7 test set T and classifies the missed translations (the 17 18 Benefit from using transitivity for proper nouns and numerals right bar) of T in terms of the common words with 18 19 the input set I. This also provides a way to visualize 19 20 the BWP and BWR metrics defined. From bottom to 20 21 BWP BWR Recall Rel. Sz. F1 top, the BWP (resp. BWR) is the ratio of the height of 21 22 15.26% 81.11% 46.81% 718.75% 23.94% the first sub-bar to the combined height of the first two 22 23 sub-bars in the left (resp. right) bars. 23 Table 8 24 Figure 6 has been shown for reference of the ratios 24 Average metrics across 11 language pairs in the dev. set when using 25 of number of different additional translations belong- 25 T = 2, D = 4 for all POS. 26 ing to each class across POS. Almost 30% of predicted 26 27 translations have neither word in the test set (demon- 27

28 Repetition BWP BWR Recall Rel. Sz. F1 strating the vast scope for enrichment), which is espe- 28 29 Allowed 85.00% 50.96% 29.32% 75.73% 50.93% cially high due to the large relative size in proper nouns 29 30 30 Restricted 86.89% 46.81% 28.49% 73.03% 49.45% and numerals. On the other hand, while the number 31 of missed translations with neither word in the input 31 32 Table 9 is low, for many test set translations only one word is 32 33 Average metrics across 11 language pairs when not allowing lan- present. This demonstrates the scope for improving re- 33 34 guage repetition compared to when it is allowed. call by improving the coverage of the input data for 34 35 specific languages. 35 36 vs. 84% for cycle density). Depending on the use case, We now evaluate the additional translations for the 5 36 37 a balance towards recall might be acceptable, but usu- language pairs also present in MUSE (see Section 8.4). 37 38 ally a decent level of precision is important, which In Table 14, we calculate the overall BWP of our ad- 38 39 the transitive procedures don’t provide. As expected, ditional translation set with MUSE as test set. We fur- 39 40 baseline 2 produces many more candidate translations, ther show the breakdown of the additional translations 40 41 leading to a much larger prediction set but dropping in terms of number of words shared with our original 41 42 precision significantly. test set, that is, the left-out RDF dictionary. Particu- 42 43 In order to measure the impact of graph enrichment larly, nextra(i) denotes the number of additional trans- 43 44 towards improving results, in Table 13 we also relations with i words shared. BWPextra(i) denotes the 44 45 strict the averaged metrics to the 11 development set BWP with respect to MUSE for these nextra(i) transla- 45 46 language pairs, still using the whole large set as in- tions. Notice the declining trend of BWP with increase 46 47 put. There is a significant improvement in comparison in shared words with the test set. This confirms our hy- 47 48 with the original development set experiments (see Ta- pothesis that many additional translations are actually 48 49 ble 10). This demonstrates the advantages of adding correct, and if both words are present in the test set 49 50 more input data. Notice how in this scenario, due to the without a translation between them, the prediction is 50 51 more well-connected nature of the language pairs, the much less likely to be correct. This shows that BWP 51 18 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 BWP BWR Recall Rel. Sz. F1 Time Range (sec) 1 2 dev set 85.00% 50.96% 29.32% 75.73% 50.93% 10–15 2 3 3 large set 84.07% 40.98% 25.95% 71.39% 35.90% 22-57 4 4 5 Table 10 5 6 Results of the cycle density algorithm on average metrics in the development and large sets. 6 7 7 8 8 9 9 10 10 11 POS Noun Verb properNoun Adjective Adverb Determiner Numeral Pronoun Preposition 11 12 Rel. Sz. 70.78% 81.75% 539.29% 59.39% 61.99% 53.48% 241.53% 80.32% 107.10% 12 13 13 14 Table 11 14 15 Relative size for different POS, averaged across the 27 language pairs in the large set 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 Fig. 5. Breakdown by POS of BWP (left bars) and BWR (right bars), averaged across language pairs in the large set 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 19

1 BWP BWR Recall Rel. Sz. F1 1 2 Cycle Density 84% 41% 26% 71% 36% 2 3 3 Baseline 1 55% 85% 56% 237% 42% 4 4 Baseline 2 13% 88% 57% 1038% 19% 5 5 6 Table 12 6 7 Comparing the baselines 1 and 2 (D=2 and D=4 respectively) and cycle density algorithm. Metrics have been averaged across the large set for 7 8 all the language pairs 8 9 9

10 BWP BWR Recall Rel. Sz. F1 10 11 Cycle Density 81% 56% 36% 94% 51% 11 12 Baseline 1 48% 81% 53% 267% 44% 12 13 Baseline 2 11% 83% 54% 1176% 17% 13 14 14 15 Table 13 15 16 Comparing the baselines 1 and 2 (D=2 and D=4 respectively) and cycle density algorithm. Metric have been averaged for the 11 pairs in the 16 development set using the other 26 languages in the large set as input 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 Fig. 6. Breakdown by POS of precision [left bars] (resp. recall [right bars]) and the different classes in terms of number of shared words of 50 51 additional (resp. missed) translations averaged across 27 language pairs in the large set 51 20 S. Goel et al. / Bilingual dictionary enrichment via graphs

1 with respect to the test set is indeed a tight approxima- we could call one-word-precision, but the key draw- 1 2 tion, and a much more useful metric than vanilla preci- back is that it is asymmetric. The precision computed 2 3 sion. These results also show that our method produces for language pair A → B can be different from lan- 3 4 a decent number of additional translations for many guage pair B → A, even though the choice of source 4 5 language pairs, with especially encouraging results for and target is actually arbitrary. Moreover, as shown in 5 6 French–Spanish. Note that the BWP for these addi- Table 14, additional translations missed by one-word- 6 7 tional translations reported using MUSE would again precision are correct far more often than those missed 7 8 be a lower bound, and many of those not counted could by BWP. In-fact, past TIAD results show relatively low 8 9 possibly be correct, just not present in MUSE. precision values (up to a maximum of 0.7) [17], mak- 9 10 Table 15 shows a comparison between BWR and ing bilingual dictionary inference seem a harder prob- 10 11 Recall with MUSE as the test set (T) using the algo- lem than it actually is. This is unlike our symmetric 11 12 rithm’s output prediction sets (P) when given the large definition, which makes the evaluation process simpler 12 13 set as input (I). While the recall is dismal, the BWR is and directly conveys an indicator of the performance 13 14 much higher, in-fact somewhat similar to the BWR on of the algorithm which is independent of the setting. 14 15 the actual test sets in the large experiments, when we 15 16 evaluate on the left-out dictionary instead of MUSE. 16 17 This shows that the BWR is relatively insensitive to 11. Use cases 17 18 the dataset distributions and thus a robust metric for 18 19 evaluation. In this section we give more details on the applica- 19 20 Finally, we report accuracy for the additional trans- tion of the cycle density algorithm to support gener- 20 21 lations predicted during the large experiment on the ation and enrichment of dictionaries in the Apertium 21 22 human evaluation sets in Table 16. The precision ob- framework, and introduce another use case, which is 22 23 tained is encouraging, confirming that our procedure the discovery of synonym words in the same language. 23 24 works well for enrichment. Note that the varying sam- 24 25 ple size is due to the dataset difference and intersection 11.1. Apertium 25 26 taken explained in Section 9.4.26 26 27 As mentioned above, Apertium27 [2] is a free/open- 27 28 10.4. Recommending the use of BWP and BWR source rule-based machine translation system; that is, 28 29 the information it needs to translate from one lan- 29 30 BWR should be used as a valuable test metric when guage to another is provided in the form of dictio- 30 31 the same input dataset is used, particularly by creators naries and rule sets. Bilingual dictionaries are, there- 31 32 when tuning and improving a particular algorithm. It fore, a key component of the data used for a language 32 33 also signifies the data-effectiveness of any proposed pair. Apertium language data is all released using the 33 34 algorithm. It gives a clear picture of the optimization GNU General Public Licence,28 a free/open-source li- 34 35 landscape, growing more consistently than recall as the cence. This has, in particular, made it easy for deriva- 35 36 prediction set becomes more liberal, as shown in Ta- tives of Apertium bilingual dictionaries to be published 36 37 ble 8. such as lexical-markup framework (LMF) dictionar- 37 38 BWP can be used be used for comparison across ies,29 and the RDF dictionaries mentioned in this pa- 38 39 systems to make evaluation of translation inference per [3, 4]. Methods like ours can be used to extend or 39 40 pipelines more accurate. A natural usage scenario for create Apertium bilingual dictionaries. 40 41 this metric could be the TIAD shared task. In partic- It has to be noted though that the LMF and RDF dic- 41 42 ular, the TIAD current evaluation procedure modifies tionaries do not contain all the information present in 42 the Apertium bilingual dictionaries, but just the ortho- 43 traditional precision metrics in the same direction as 43 44 graphic representation (spelling) of the lemma and the 44 in this paper by limiting the output set for Precision POS; Apertium dictionaries may contain additional in- 45 to translations for which the word in the source lan- 45 46 guage is in the evaluation set. This is somewhat what 46 47 27http://www.apertium.org 47 28 48 Versions 2 (https://www.gnu.org/licenses/old-licenses/gpl-2. 48 26The small sample for eng–cat is potentially due to some sub- 0-standalone.html) and 3 (https://www.gnu.org/licenses/gpl-3. 49 49 stantial difference in the Apertium bilingual dictionaries and the 0-standalone.html) 50 Apertium RDF version we use, analyzing which is beyond the scope 29Such as the LMF Apertium dictionaries in https://repositori.upf. 50 51 of this paper. edu/handle/10230/13034 and https://github.com/apertium-lmf 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 21

1 Lang. pair BWP nextra(0) BWPextra(0) nextra(1) BWPextra(1) nextra(2) BWPextra(2) 1 2 eng–cat 15.57% 156 80.00% 978 41.99% 2268 6.90% 2 3 3 eng–spa 35.64% 315 94.02% 575 51.89% 374 17.78% 4 4 fra–spa 65.97% 2218 94.94% 239 39.83% 105 11.08% 5 5 spa–ita 37.88% 108 76.05% 53 32.31% 47 19.34% 6 6 spa–por 54.72% 353 78.97% 131 38.98% 31 19.62% 7 7 8 Table 14 8 9 BWP for different classes of additional translations based on words shared with original test set evaluated using MUSE 9 10 10 11 11 Lang. pair Recall BWR after being validated by an Apertium expert. Work is 12 12 eng–spa 7.05% 53.24% in progress to add this information to the RDF graphs 13 13 eng–cat 8.84% 49.81% and to improve the method described here to be able 14 14 epo–eng 8.76% 59.69% to transfer morphological information to predicted en- 15 15 fra cat tries directly. 16 – 0.97% 7.94% 16 oci–fra 2.63% 23.51% When Apertium experts process each predicted dic- 17 tionary entry, they first validate its usefulness and then 17 18 Table 15 either discard it or adopt it, perhaps adapting it by 18 19 Recall v BWR when comparing predictions using the large set with inserting additional information (like gender), before 19 20 20 MUSE dictionaries. adding it to the dictionary. The actual time required to 21 21 do these operations may be used to inform the weight β 22 22 that should be given to recall in an Fβ indicator, which 23 Lang. pair Sample Size Precision 23 can in turn be used to determine the confidence thresh- 24 eng–spa 133 78.94% 24 old. On one hand, if validating and discarding is much 25 eng–cat 67 77.61% 25 faster than validating, adopting and adapting, perhaps 26 epo–eng 146 80.13% 26 one could sacrifice precision to improve recall, select- 27 fra–cat 130 69.23% 27 ing a higher value for β.30 On the other hand, having to 28 oci–fra 134 84.32% 28 discard a deluge of useless entries may have a fatigue 29 29 Table 16 effect that reduces the expert’s productivity, so β can- 30 30 Accuracy of additional translations based on human evaluation. not be too high either. Determining an optimal value 31 31 β 32 of would require extensive measuring of actual ex- 32 33 formation, for instance, to inform of the fact that the pert work, which has not been attempted so far in the 33 34 gender of a word changes (so that, for instance, target- literature. 34 35 side gender agreement with adjectives is ensured by an 35 appropriate rule), as in this Spanish–Catalan entry, 36 11.2. Discovering Synonyms 36 37 37 38

We have until now discussed the usage of our algo- 38 39 almohada ~~rithm for creating bilingual dictionaries. The same cy- 39 40 coixí ~~cle density procedure however can also be used to gen- 40~~~~

41 erate edge predictions between words in the same lan- 41 42 guage, that is, synonyms. This allows extending from 42 43 where in addition of the fact that the Spanish (left, generating just dictionaries to even creating thesauri. 43 44 l) lemma almohada (‘pillow’) is a noun (n) corre- In fact, some popular translation engines like Google 44 45 sponding to the Catalan (right, r) lemma coixí with Translate also provide synonyms to end-users, unlike 45 46 the same part of speech, the entry also encodes the fact Apertium. Automatic synonym generation from Aper- 46 47 that almohada is a feminine (f) noun and coixí is a tium data could thus help create a useful feature. Syn- 47 48 masculine (f) noun. In entries where the gender does 48 49 49 not change, this is usually not indicated. Therefore, 30Note that validating and adopting may be done faster than val- 50 some of the bilingual correspondences predicted by idating and rejecting, as to do the latter operation safely one would 50 51 our tool in its current form may need to be completed need to think harder about possible contexts. 51 22 S. Goel et al. / Bilingual dictionary enrichment via graphs
1 Language Prediction set (# pairs) Precision κ studies and exact comparisons with existing methods 1 2 English 9,652 76% 0.4291 are left for future work. 2 3 Spanish 7,664 88.66% 0.5574 3 4 Catalan 10,254 87.33% 0.1581 4 5 12. Conclusions and future work 5 6 Table 17 6 7 Human evaluation of predicted synonym pairs. κ denotes the Cohen This paper has explored techniques that exploit the 7 8 Kappa as a measure of inter-annotator agreement between the 2 an- graph nature of openly available bilingual dictionar- 8 notators for each language. 9 ies to infer new bilingual entries. It makes the most of 9 10 the knowledge that independent developers have sep- 10 11 onymy relations may also be useful represented as lin- arately encoded in different language pairs. The tech- 11 12 guistic linked data. niques build upon the cycle density method of [6], 12 13 The confidence score obtained in our method can which has been modified (taking advantage of graph- 13 14 also possibly be used as a measure for conceptual theoretical features of the dictionary graphs) so that it 14 15 similarity. Traditional methods based on word co- is faster and has less hyperparameters to adjust when 15 16 occurence suffer from several issues for this task, re- applying it to a task. We release our tool as a free/open- 16 17 quiring further augmentation [31]. In-fact, it has even source software which can be applied to RDF graphs 17 18 been shown that having translation information is one but also directly to Apertium dictionaries. 18 19 way to improve their performance [32]. Our method We further show that existing automatic evaluation 19 20 takes that hypothesis to the extreme, relying purely on metrics for dictionary inference have limitations. To 20 21 translation graphs over documents. Some advantages this end, we propose novel metrics for automatic eval- 21 22 of this are as follows: uation of the dictionary inference task, Both-Word- 22 Precision and Both-Word-Recall, and show exten- 23 – Our method does not need large corpora, which 23 24 sively how they help to compare and improve existing 24 can be hard to obtain for under-resourced lan- dictionary algorithms. 25 guages. Further it does not require complex aug- 25 26 We experiment with a large portion of the Apertium 26 mentation procedures, and requires no training RDF graph on two bilingual dictionary development 27 time. 27 28 scenarios: dictionary creation for a new language pair, 28 – Since antonyms often occur in similar contexts in and dictionary enrichment. The results show how the 29 sentences, they are assigned high similarity by co- 29 30 progressive enrichment of the translation graph leads 30 occurrence based methods. Our method does not 31 to better results with the cycle density method, and 31 appreciably suffer from such problems. 32 confirm its superiority with respect to transitive-based 32 – Co-occurrence based methods ignore polysemy. 33 baselines. Further, our evaluations based on external 33 They are known [33] to perform suboptimally 34 dictionaries (MUSE) and human assessment show that 34 when finding synonyms for polysemous words 35 a significant amount extra translations, not initially 35 because they aggregate statistics across the differ- 36 found in the graph, are correct. We also illustrate that 36 ent semantic senses [34]. On the other hand, our 37 the algorithm can be used to infer synonyms, unlike 37 approach is explicitly polysemy-aware. 38 pivot-based methods, and report a human-based evalu- 38 ation on this. 39 We demonstrate here a preliminary experiment. We 39 We highlight the following opportunities for future 40 produce synonym pairs for English, Spanish and Cata- 40 work: 41 lan using the large set of 27 language pairs as input. 41 42 Random samples of 150 words drawn from the 3 pre- – Enrich the cloud of linguistic linked data by ap- 42 43 diction sets are evaluated by 2 human annotators each plying the method to other datasets and to connect 43 44 and we report the results in Table 17. To the evalua- Apertium RDF with other families of dictionaries 44 45 tors, we pose a similar question as before: “Is there a on the Web. 45 46 context where the two words are replaceable?”. – Exploit the fine-grained morphological informa- 46 47 Note that we took the same hyperparameters and tion, beyond POS, that is sometimes present in 47 48 confidence threshold as the translation setting (see Apertium bilingual entries. 48 49 8.2). Modifications specific to synonym generation – Improve the performance of the synonym gener- 49 50 might be required for optimal results. The results ation through more elaborate evaluation and in- 50 51 demonstrated are just a proof of concept, more detailed clude it as a feature in Apertium. 51 S. Goel et al. / Bilingual dictionary enrichment via graphs 23
1 – Explore the ability of k-connectivity to automati- [2] M.L. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, 1 2 cally partition the graph into multilingual synsets. S. Ortiz-Rojas, J.A. Pérez-Ortiz, F. Sánchez-Martínez, 2 3 – Use topological methods like the ones highlighted G. Ramírez-Sánchez and F.M. Tyers, Apertium: a free/open- 3 source platform for rule-based machine translation, Machine 4 4 here to study polysemy from the perspective of translation 25(2) (2011), 127–144. 5 linguistic typology and connect it with studies in [3] J. Gracia, M. Villegas, A. Gomez-Perez and N. Bel, The aper- 5 6 cognitive science that use translation graphs to tium bilingual dictionaries on the web of data, Semantic Web 6 7 study semantic mapping across languages [35]. 9(2) (2018), 231–240. 7 8 – Study the feasibility of an iterative approach in [4] J. Gracia, C. Fäth, M. Hartung, M. Ionov, J. Bosque-Gil, 8 S. Veríssimo, C. Chiarcos and M. Orlikowski, Leveraging Lin- 9 9 which validated predictions are fed back in each guistic Linked Data for Cross-Lingual Model Transfer in the 10 round to produce additional predictions. Pharmaceutical Domain, in: Proc. of 19th International Se- 10 11 – Explore more optimal methods that directly find mantic Web Conference (ISWC 2020), B. Fu and A. Polleres, 11 12 the maximum density cycles under additional eds, Springer, 2020, pp. 499–514. ISBN 978-3-030-62465-1. 12 13 constraints posed by our hyperparameters with- [5] J.P. McCrae, J. Bosque-Gil, J. Gracia, P. Buitelaar and P. Cimi- 13 ano, The OntoLex-Lemon Model: Development and Applica- 14 14 out having to compute all cycles within a certain tions, in: Electronic lexicography in the 21st century. Proc. of 15 length, perhaps with inspiration from maximum eLex 2017 conference, in Leiden, Netherlands, Lexical Com- 15 16 density subgraph methods [36]. puting CZ s.r.o., 2017, pp. 587–597. ISSN 2533-5626. 16 17 – Study how language pair developers use our tool [6] M. Villegas, M. Melero, N. Bel and J. Gracia, Leveraging RDF 17 18 to understand their preferences in the precision- graphs for crossing multiple bilingual dictionaries, in: Pro- 18 ceedings of the Tenth International Conference on Language 19 19 recall trade-off, using these insights to improve Resources and Evaluation (LREC’16), 2016, pp. 868–876. 20 evaluation metrics. [7] A. Toral, M. Ginestí-Rosell and F.M. Tyers, An Italian 20 21 – Participate in upcoming editions of the TIAD task to Catalan RBMT system reusing data from existing lan- 21 22 for more direct comparisons with other systems. guage pairs, in: Proc. of the Second International Work- 22 23 – Explore the complementary use of cycle based shop on Free/Open-Source Rule-Based Machine Translation, 23 Barcelona (Spain), 2011, pp. 77–81. http://www.mt-archive. 24 24 techniques with pivot-based ones like OTIC. info/10/FreeRBMT-2011-Toral-2.pdf. 25 [8] K. Tanaka and K. Umemura, Construction of a Bilingual Dic- 25 26 tionary Intermediated by a Third Language., in: Proc. of The 26 27 15th International Conference on Computational Linguistics 27 13. Acknowledgements 28 (COLING’94), 1994, pp. 297–303. 28 [9] L.T. Lim, B. Ranaivo-Malançon and E.K. Tang, Low Cost Con- 29 29 We thank Google for its support through Google struction of a Multilingual Lexicon from Bilingual Lists, Poli- 30 30 Summer of Code 2020. S. Goel thanks the Apertium bits 43 (2011), 45–51. doi:10.17562/pb-43-6. 31 [10] H. Kaji, S. Tamamura and D. Erdenebat, Automatic Construc- 31 community and Kunwar Shaanjeet Grover for help- 32 tion of a Japanese-Chinese Dictionary via English, in: Proc. of 32 ful discussions. We also thank Gema Ramírez, Hèc- 33 the Sixth International Conference on Language Resources and 33 tor Alós i Font, Aure Séguier, Silvia Olmos, Mayank Evaluation (LREC’08), European Language Resources Asso- 34 34 Goel and Xavi Ivars for their evaluation of predicted ciation (ELRA), 2008. http://www.lrec-conf.org/proceedings/ 35 35 dictionary entries. This work was partially funded by lrec2008/pdf/175{_}paper.pdf. 36 [11] F. Bond and K. Ogura, Combining linguistic resources to 36 the Prêt-à-LLOD project within the European Union’s 37 create a machine-tractable Japanese-Malay dictionary, Lan- 37 Horizon 2020 research and innovation programme un- 38 guage Resources and Evaluation 42(2) (2008), 127–136. 38 der grant agreement no. 825182. This work is also doi:10.1007/s10579-007-9038-4. 39 39 based upon work from COST Action CA18209 – [12] X. Saralegi, I. Manterola and I. San Vicente, Analyz- 40 40 NexusLinguarum “European network for Web-centred ing Methods for Improving Precision of Pivot Based 41 Bilingual Dictionaries, in: Proc. of the 2011 Conference 41 linguistic data science", supported by COST (Euro- 42 on Empirical Methods in Natural Language Processing 42 pean Cooperation in Science and Technology). It has 43 (EMNLP’11), ACL, Edinburgh, Scotland, UK, 2011, pp. 846– 43 been also partially supported by the Spanish project 856. doi:10.5555/2145432.2145526. http://www.wiktionary. 44 44 TIN2016-78011-C4-3-R (AEI/FEDER, UE). org/. 45 [13] T. Flati and R. Navigli, The CQC Algorithm: Cycling in graphs 45 46 to semantically enrich and enhance a bilingual dictionary, Jour- 46 47 nal of Artificial Intelligence Research 43 (2012), 135–171. 47 References 48 doi:10.1613/jair.3456. 48 [14] Mausam, S. Soderland, O. Etzioni, D. Weld, M. Skinner and 49 49 [1] P. Cimiano, C. Chiarcos, J.P. McCrae and J. Gracia, Linguistic J. Bilmes, Compiling a Massive, Multilingual Dictionary via 50 Linked Data, Springer International Publishing, 2020. ISBN Probabilistic Inference, in: Proc. of the Joint Conference of 50 51 978-3-030-30224-5. the 47th Annual Meeting of the ACL and the 4th Interna- 51 24 S. Goel et al. / Bilingual dictionary enrichment via graphs
1 tional Joint Conference on Natural Language Processing of 1 [25] M. Artetxe, G. Labaka and E. Agirre, Bilingual lexicon in- 2 the AFNLP, Association for Computational Linguistics, Sun- 2 duction through unsupervised machine translation, in: Proc. tec, Singapore, 2009, pp. 262–270. https://www.aclweb.org/ 3 of 57th Annual Meeting of the Association for Computational 3 anthology/P09-1030. 4 Linguistics (ACL 2019) 4 [15] Mausam, S. Soderland, O. Etzioni, D.S. Weld, K. Reiter, , Association for Computational Lin- 5 M. Skinner, M. Sammer and J. Bilmes, Panlingual lexical guistics (ACL), 2019, pp. 5002–5007. ISBN 9781950737482. 5 6 translation via probabilistic inference, Artificial Intelligence doi:10.18653/v1/p19-1494. 6 [26] J.P. McCrae, F. Bond, P. Buitelaar, P. Cimiano, T. Declerck, 7 174 (2010), 619–637. doi:10.1016/j.artint.2010.04.020. www. 7 J. Gracia, I. Kernerman, E. Montiel-Ponsoda, N. Ordan and 8 elsevier.com/locate/artint. 8 [16] J. Gracia, B. Kabashi, I. Kernerman, M. Lanau-Coronas M. Piasecki (eds), Proceedings of LDK Workshops: OntoLex, 9 9 and D. Lonke, Results of the Translation Inference Across TIAD and Challenges for Wordnets, 2017. ISSN 1613-0073. 10 Dictionaries 2019 Shared Task, in: Proc. of TIAD-2019 http://ceur-ws.org/Vol-1899/. 10 11 Shared Task – Translation Inference Across Dictionaries co- [27] R.E.L. Aldred and C. Thomassen, On the maximum number 11 12 located with the 2nd Language, Data and Knowledge Con- of cycles in a planar graph, Journal of Graph Theory 57(3) 12 (2008), 255–264. 13 ference (LDK 2019), J. Gracia, B. Kabashi and I. Kerner- 13 man, eds, CEUR Press, Leipzig (Germany), 2019, pp. 1–12. [28] J. Hopcroft and R. Tarjan, Algorithm 447: Efficient Algo- 14 14 doi:10.5281/ZENODO.3555155. http://ceur-ws.org/Vol-2493/ rithms for Graph Manipulation, Commun. ACM 16(6) (1973), 15 summary.pdf. 372–378–. doi:10.1145/362248.362272. 15 16 [17] I. Kernerman, S. Krek, J.P. Mccrae, J. Gracia, S. Ahmadi and [29] H. Weinblatt, A New Search Algorithm for Finding the Sim- 16 17 B. Kabashi, Introduction to the Globalex 2020 Workshop on ple Cycles of a Finite Directed Graph, J. ACM 19(1) (1972), 17 18 Linked Lexicography, in: Proc. of Globalex’20 Workshop on 43–56–. doi:10.1145/321679.321684. 18 Linked Lexicography at LREC 2020, I. Kernerman, S. Krek, [30] D.B. Johnson, Finding All the Elementary Circuits of 19 19 J.P. McCrae, J. Gracia, S. Ahmadi and B. Kabashi, eds, ELRA, a Directed Graph., SIAM J. Comput. 4(1) (1975), 77– 20 20 2020. ISBN 979-10-95546-46-7. 84. http://dblp.uni-trier.de/db/journals/siamcomp/siamcomp4. 21 [18] R. Rapp, Identifying word translations in non-parallel html#Johnson75. 21 22 texts, in: Proc. of the 33rd annual meeting on As- [31] A. Lu, W. Wang, M. Bansal, K. Gimpel and K. Livescu, Deep 22 23 sociation for Computational Linguistics, Association for Multilingual Correlation for Improved Word Embeddings, in: 23 Computational Linguistics, Morristown, NJ, USA, 1995, 24 Proceedings of the 2015 Conference of the North Ameri- 24 p. 320. doi:10.3115/981658.981709. http://portal.acm.org/ can Chapter of the Association for Computational Linguis- 25 25 citation.cfm?doid=981658.981709. tics: Human Language Technologies, Association for Com- 26 [19] I. Vulic´ and M.-F. Moens, A Study on Bootstrapping Bilingual putational Linguistics, Denver, Colorado, 2015, pp. 250–256. 26 27 Vector Spaces from Non-Parallel Data (and Nothing Else), in: doi:10.3115/v1/N15-1028. https://www.aclweb.org/anthology/ 27 28 Proc.of the 2013 Conference on Empirical Methods in Natural N15-1028. 28 Language Processing, Association for Computational Linguis- 29 [32] F. Hill, K. Cho, S. Jean, C. Devin and Y. Bengio, Not All Neu- 29 tics, 2013, pp. 1613–1624. https://www.aclweb.org/anthology/ ral Embeddings are Born Equal, 2014. 30 D13-1168. 30 [33] F. Liu, H. Lu and G. Neubig, Handling Homographs 31 [20] P. Fung and L. Yuen Yee, An IR Approach for Translating New 31 in Neural Machine Translation, in: Proceedings of the Words from Nonparallel, Comparable Texts, in: Proc. of 17th 32 2018 Conference of the North American Chapter of the 32 International Conference on Computational Linguistics (COL- 33 Association for Computational Linguistics: Human Lan- 33 ING 1998), ACL, 1998, pp. 414–420. https://www.aclweb.org/ 34 guage Technologies, Volume 1 (Long Papers), Associa- 34 anthology/C98-1066. tion for Computational Linguistics, New Orleans, Louisiana, 35 [21] A. Irvine and C. Callison-Burch, Supervised Bilingual Lex- 35 2018, pp. 1336–1345. doi:10.18653/v1/N18-1121. https:// 36 icon Induction with Multiple Monolingual Signals, in: Proc. 36 www.aclweb.org/anthology/N18-1121. 37 of NAACL-HLT 2013, Association for Computational Lin- 37 [34] S. Arora, Y. Li, Y. Liang, T. Ma and A. Risteski, Linear Al- 38 guistics, 2013, pp. 9–14. https://www.aclweb.org/anthology/ 38 C98-1066/. gebraic Structure of Word Senses, with Applications to Poly- 39 39 [22] T. Mikolov, Q.V. Le and I. Sutskever, Exploiting Similarities semy, Transactions of the Association for Computational Lin- 40 among Languages for Machine Translation, Technical Report, guistics 6 (2018), 483–495. doi:10.1162/tacl_a_00034. https: 40 41 2013. http://arxiv.org/abs/1309.4168. //www.aclweb.org/anthology/Q18-1034. 41 42 [23] M. Artetxe, G. Labaka and E. Agirre, Learning principled [35] H. Youn, L. Sutton, E. Smith, C. Moore, J.F. Wilkins, 42 I. Maddieson, W. Croft and T. Bhattacharya, On the uni- 43 bilingual mappings of word embeddings while preserving 43 monolingual invariance, in: Proceedings of the 2016 Confer- versal structure of human lexical semantics, Proceedings of 44 44 ence on Empirical Methods in Natural Language Processing, the National Academy of Sciences 113(7) (2016), 1766–1771. 45 2016, pp. 2289–2294. doi:10.1073/pnas.1520752113. https://www.pnas.org/content/ 45 46 [24] G. Lample, A. Conneau, A. Ranzato, L. Denoyer and 113/7/1766. 46 47 H. Jégou, Word translation without paralell data, in: Proc. [36] S. Khuller and B. Saha, On Finding Dense Subgraphs, 47 48 of 6th International Conference on Learning Representa- in ICALP ’09, Springer-Verlag, Berlin, Heidelberg, 2009, 48 tions (ICRL 2018), 2018. https://github.com/facebookresearch/ pp. 597–608–. ISBN 9783642029264. doi:10.1007/978-3-642- 49 49 MUSEhttps://openreview.net/forum?id=H196sainb. 02927-1_50. https://doi.org/10.1007/978-3-642-02927-1_50. 50 50 51 51