Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data

Fully unsupervised crosslingual semantic textual similarity metric based on BERT for identifying parallel data Chi-kiu Lo and Michel Simard NRC-CNRC National Research Council Canada 1200 Montreal Road, Ottawa, Ontario K1A 0R6, Canada fChikiu.Lo|[email protected] Abstract is also particularly useful for identifying parallel resources (Resnik and Smith, 2003; Aziz and Spe- We present a fully unsupervised crosslingual semantic textual similarity (STS) met- cia, 2011) for training and evaluating downstream ric, based on contextual embeddings extracted multilingual NLP applications, such as machine from BERT – Bidirectional Encoder Repre- translation systems. sentations from Transformers (Devlin et al., Unlike in crosslingual textual entailment (Negri 2019). The goal of crosslingual STS is to mea- et al., 2013) or crosslingual natural language infer- sure to what degree two segments of text in ence (XNLI) (Conneau et al., 2018), which are di- different languages express the same mean- rectional classification tasks, in crosslingual STS, ing. Not only is it a key task in crosslingual continuous values are produced, to reflect a range natural language understanding (XLU), it is also particularly useful for identifying paral- of similarity that goes from complete semantic lel resources for training and evaluating down- unrelatedness to complete semantic equivalence. stream multilingual natural language process- Machine translation quality estimation (MTQE) ing (NLP) applications, such as machine trans- (Specia et al., 2018) is perhaps the field of work lation. Most previous crosslingual STS meth- that is the most related to crosslingual STS: in ods relied heavily on existing parallel re- MTQE, one tries to estimate translation quality, by sources, thus leading to a circular dependency comparing an original source-language text with problem. With the advent of massively multilingual context representation models such its machine translation. In contrast, in crosslin- as BERT, which are trained on the concatena- gual STS, neither the direction nor the origin (hu- tion of non-parallel data from each language, man or machine) of the translation is taken into we show that the deadlock around parallel re- account. Furthermore, MTQE also typically con- sources can be broken. We perform intrinsic siders the fluency and grammaticality of the target evaluations on crosslingual STS data sets and text; these aspects are usually not perceived as rel- extrinsic evaluations on parallel corpus filter- evant for crosslingual STS. ing and human translation equivalence assess- ment tasks. Our results show that the unsu- Many previous crosslingual STS methods rely pervised crosslingual STS metric using BERT heavily on existing parallel resources to first build without fine-tuning achieves performance on a machine translation (MT) system and translate par with supervised or weakly supervised ap- one of the test sentences into the other language proaches. for applying monolingual STS methods (Brychc´ın and Svoboda, 2016). Methods that do not rely ex- 1 Introduction plicitly on MT, such as that in Lo et al.(2018), still Crosslingual semantic textual similarity (STS) require parallel resources to build bilingual word (Agirre et al., 2016a; Cer et al., 2017) aims at mea- representations for evaluating crosslingual lexical suring the degree of meaning overlap between two semantic similarity. It is clear that there is a circu- texts written in different languages. It is a key lar dependency problem on parallel resources. task in crosslingual natural language understand- Massively multilingual context representation ing (XLU), with applications in crosslingual in- models, such as MUSE (Conneau et al., 2017), formation retrieval (Franco-Salvador et al., 2014; BERT (Devlin et al., 2019), and XLM (Lample Vulic´ and Moens, 2015), crosslingual plagiarism and Conneau, 2019), that are trained in an unsu- detection (Franco-Salvador et al., 2016a,b), etc. It pervised manner with non-parallel data from each 206 Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 206–215 Hong Kong, China, November 3-4, 2019. c 2019 Association for Computational Linguistics language, have shown improved performance in w(f) for lexical units e and f in each language di- XNLI classification tasks using task-specific fine- rectly on the texts under consideration, we rely on tuning. precomputed weights from monolingual corpora E In this paper, we propose a crosslingual STS and F of the two tested languages. metric based on fully unsupervised contextual The YiSi metrics are formulated as an F-score: embeddings extracted from BERT without fine- by viewing the source text as a “query” and the tuning. In an intrinsic crosslingual STS evalua- target as an “answer”, precision and recall can be tion and extrinsic parallel corpus filtering and hu- computed. Depending on the intended applica- man translation error detection tasks, we show that tion, precision and recall can be weighed differ- our BERT-based metric achieves performance on ently. For example, in MT evaluation applications, par with similar metrics based on supervised or we typically assign more weight to recall (“every weakly supervised approaches. With the availabil- word in the source should find an equivalent in ity of the multilingual context representation mod- the target”). For this application, we give equal els, we show that the deadlock around parallel re- weights to precision and recall. sources for crosslingual textual similarity can be Thus, the crosslingual STS of sentences e and broken. f using YiSi-2 in this work can be expressed as follows: 2 Crosslingual STS metric v(u) = embedding of unit u Our crosslingual STS metric is based on YiSi (Lo, s(e; f) = cos(v(e); v(f)) 2019). YiSi is a unified adequacy-oriented MT jEj + 1 quality evaluation and estimation metric for lan- w (e) = idf(e) = log(1 + ) jE9ej + 1 guages with different levels of available resources. jFj + 1 Lo et al.(2018) showed that YiSi-2, the crosslin- w (f) = idf(f) = log(1 + ) jF9f j + 1 gual MT quality estimation metric, performed al- P max w (e) · s (e; f) most as well as the “MT + monolingual MT evalu- f2f precision = e2e ation metric (YiSi-1)” pipeline for identifying par- P w (e) allel sentence pairs from a noisy web-crawled cor- e2e P pus in the Parallel Corpus Filtering task of WMT max w (f) · s (e; f) f2f e2e 2018 (Koehn et al., 2018b). recall = P To measure semantic similarity between pairs of w (f) f2f segments, YiSi-2 proceeds by finding alignments 2 · precision · recall between the words of these segments that maxi- YiSi-2 = precision + recall mize semantic similarity at the lexical level. For evaluating crosslingual lexical semantic similarity, where s(e; f) is the cosine similarity of the vec- it relies on a crosslingual embedding model, us- tor representations v(e) and v(f) in the bilingual ing cosine similarity of the embeddings from the embeddings model. crosslingual lexical representation model. Follow- In the following, we present the approaches we ing the approach of Corley and Mihalcea(2005), experimented with to obtain the crosslingual em- these lexical semantic similarities are weighed bedding space in supervised, weakly supervised by lexical specificity using inverse document fre- and unsupervised manners. quency (IDF) collected from each side of the 2.1 Supervised crosslingual word tested corpus. embeddings with BiSkip As an MTQE metric, YiSi-2 also takes into account fluency and grammatically of each side of Luong et al.(2015) proposed BiSkip (with open 1 the sentence pairs using bag-of-ngrams and the se- source implementation bivec ) to jointly learn mantic parses of the tested sentence pairs. But bilingual representations from the context cooc- since crosslingual STS focuses primarily on mea- currence information in the monolingual data and suring the meaning similarity between the tested the meaning equivalent signals in the parallel data. sentence pairs, here we set the size of ngrams to 1 It trains bilingual word embeddings with the ob- and opt not to use semantic parses in YiSi-2. In ad- jective to preserve the clustering structures of dition, rather than compute IDF weights w(e) and 1https://github.com/lmthang/bivec 207 words in each language. We train our crosslingual 3 Experiment on crosslingual STS word embeddings using bivec on the parallel re- We first evaluate the performance of YiSi-2 on the sources as described in each experiment. intrinsic crosslingual STS task, before testing its ability on the downstream task of identifying par- 2.2 Weakly supervised crosslingual word allel data. embeddings with vecmap 3.1 Setup Artetxe et al.(2016) generalized a framework to learn the linear transformation between two mono- We use data from the SemEval-2016 Semantic lingual word embedding spaces by minimizing the Textual Similarity (STS) evaluation’s crosslingual distances between equivalences listed in a collec- track (task1) (Agirre et al., 2016b), in which the tion of bilingual lexicons (with open source imple- goal was to estimate the degree of equivalence mentation vecmap2). We train our monolingual between pairs of Spanish-English bilingual frag- 5 word embeddings using word2vec3 (Mikolov ments of text. The test data is partitioned into et al., 2013) on the monolingual resources and two evaluation sets: the News data set has 301 then learn the linear transformation of the two pairs, manually harvested from comparable Span- monolingual embedding space using vecmap on ish and English news sources; the Multi-source the dictionary entries as described in each experi- data set consists of 294 pairs, sampled from En- ment. glish pairs of snippets used in the SemEval-2016 monolingual STS task, translated into Spanish. We apply YiSi-2 directly to these pairs of 2.3 Unsupervised crosslingual contextual text fragments, using bilingual word embeddings embeddings with multilingual BERT trained under three different conditions (details of The above two mentioned embedding models pro- the training sets are given in Table1): duce static word embeddings that captures the se- bivec : BWE’s are produced with bivec, trained mantic space to represent the training data.

Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data

Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity?

Evaluating Lexical Similarity to Build Sentiment Similarity

Using Prior Knowledge to Guide BERT's Attention in Semantic

Latent Semantic Analysis for Text-Based Research

Ying Liu, Xi Yang

Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models

Measuring Semantic Similarity of Words Using Concept Networks

Semantic Similarity Between Sentences

Semantic Sentence Similarity: Size Does Not Always Matter

Word Embeddings for Semantic Similarity: Comparing LDA with Word2vec

Event-Based Knowledge Reconciliation Using Frame Embeddings And

Mathlingbudapest: Concept Networks for Semantic Similarity