<<

Fully unsupervised crosslingual semantic textual similarity based on BERT for identifying parallel data

Chi-kiu Lo and Michel Simard NRC-CNRC National Research Council Canada 1200 Montreal Road, Ottawa, Ontario K1A 0R6, Canada {Chikiu.Lo|Michel.Simard}@nrc-cnrc.gc.ca

Abstract is also particularly useful for identifying parallel resources (Resnik and Smith, 2003; Aziz and Spe- We present a fully unsupervised crosslin- gual semantic textual similarity (STS) met- cia, 2011) for training and evaluating downstream ric, based on contextual embeddings extracted multilingual NLP applications, such as machine from BERT – Bidirectional Encoder Repre- translation systems. sentations from Transformers (Devlin et al., Unlike in crosslingual (Negri 2019). The goal of crosslingual STS is to mea- et al., 2013) or crosslingual natural language infer- sure to what degree two segments of text in ence (XNLI) (Conneau et al., 2018), which are di- different languages express the same mean- rectional classification tasks, in crosslingual STS, ing. Not only is it a key task in crosslingual continuous values are produced, to reflect a range natural language understanding (XLU), it is also particularly useful for identifying paral- of similarity that goes from complete semantic lel resources for training and evaluating down- unrelatedness to complete semantic equivalence. stream multilingual natural language process- quality estimation (MTQE) ing (NLP) applications, such as machine trans- (Specia et al., 2018) is perhaps the field of work lation. Most previous crosslingual STS meth- that is the most related to crosslingual STS: in ods relied heavily on existing parallel re- MTQE, one tries to estimate translation quality, by sources, thus leading to a circular dependency comparing an original source-language text with problem. With the advent of massively mul- tilingual context representation models such its machine translation. In contrast, in crosslin- as BERT, which are trained on the concatena- gual STS, neither the direction nor the origin (hu- tion of non-parallel data from each language, man or machine) of the translation is taken into we show that the deadlock around parallel re- account. Furthermore, MTQE also typically con- sources can be broken. We perform intrinsic siders the fluency and grammaticality of the target evaluations on crosslingual STS data sets and text; these aspects are usually not perceived as rel- extrinsic evaluations on parallel corpus filter- evant for crosslingual STS. ing and human translation equivalence assess- ment tasks. Our results show that the unsu- Many previous crosslingual STS methods rely pervised crosslingual STS metric using BERT heavily on existing parallel resources to first build without fine-tuning achieves performance on a machine translation (MT) system and translate par with supervised or weakly supervised ap- one of the test sentences into the other language proaches. for applying monolingual STS methods (Brychc´ın and Svoboda, 2016). Methods that do not rely ex- 1 Introduction plicitly on MT, such as that in Lo et al.(2018), still Crosslingual semantic textual similarity (STS) require parallel resources to build bilingual (Agirre et al., 2016a; Cer et al., 2017) aims at mea- representations for evaluating crosslingual lexical suring the degree of meaning overlap between two semantic similarity. It is clear that there is a circu- texts written in different languages. It is a key lar dependency problem on parallel resources. task in crosslingual natural language understand- Massively multilingual context representation ing (XLU), with applications in crosslingual in- models, such as MUSE (Conneau et al., 2017), formation retrieval (Franco-Salvador et al., 2014; BERT (Devlin et al., 2019), and XLM (Lample Vulic´ and Moens, 2015), crosslingual plagiarism and Conneau, 2019), that are trained in an unsu- detection (Franco-Salvador et al., 2016a,b), etc. It pervised manner with non-parallel data from each

206 Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 206–215 Hong Kong, China, November 3-4, 2019. c 2019 Association for Computational Linguistics language, have shown improved performance in w(f) for lexical units e and f in each language di- XNLI classification tasks using task-specific fine- rectly on the texts under consideration, we rely on tuning. precomputed weights from monolingual corpora E In this paper, we propose a crosslingual STS and F of the two tested languages. metric based on fully unsupervised contextual The YiSi metrics are formulated as an F-score: embeddings extracted from BERT without fine- by viewing the source text as a “query” and the tuning. In an intrinsic crosslingual STS evalua- target as an “answer”, precision and recall can be tion and extrinsic parallel corpus filtering and hu- computed. Depending on the intended applica- man translation error detection tasks, we show that tion, precision and recall can be weighed differ- our BERT-based metric achieves performance on ently. For example, in MT evaluation applications, par with similar metrics based on supervised or we typically assign more weight to recall (“every weakly supervised approaches. With the availabil- word in the source should find an equivalent in ity of the multilingual context representation mod- the target”). For this application, we give equal els, we show that the deadlock around parallel re- weights to precision and recall. sources for crosslingual textual similarity can be Thus, the crosslingual STS of sentences e and broken. f using YiSi-2 in this work can be expressed as follows: 2 Crosslingual STS metric v(u) = embedding of unit u Our crosslingual STS metric is based on YiSi (Lo, s(e, f) = cos(v(e), v(f)) 2019). YiSi is a unified adequacy-oriented MT |E| + 1 quality evaluation and estimation metric for lan- w (e) = idf(e) = log(1 + ) |E∃e| + 1 guages with different levels of available resources. |F| + 1 Lo et al.(2018) showed that YiSi-2, the crosslin- w (f) = idf(f) = log(1 + ) |F∃f | + 1 gual MT quality estimation metric, performed al- P max w (e) · s (e, f) most as well as the “MT + monolingual MT evalu- f∈f precision = e∈e ation metric (YiSi-1)” pipeline for identifying par- P w (e) allel sentence pairs from a noisy web-crawled cor- e∈e P pus in the Parallel Corpus Filtering task of WMT max w (f) · s (e, f) f∈f e∈e 2018 (Koehn et al., 2018b). recall = P To measure semantic similarity between pairs of w (f) f∈f segments, YiSi-2 proceeds by finding alignments 2 · precision · recall between the of these segments that maxi- YiSi-2 = precision + recall mize semantic similarity at the lexical level. For evaluating crosslingual lexical semantic similarity, where s(e, f) is the cosine similarity of the vec- it relies on a crosslingual embedding model, us- tor representations v(e) and v(f) in the bilingual ing cosine similarity of the embeddings from the embeddings model. crosslingual lexical representation model. Follow- In the following, we present the approaches we ing the approach of Corley and Mihalcea(2005), experimented with to obtain the crosslingual em- these lexical semantic similarities are weighed bedding space in supervised, weakly supervised by lexical specificity using inverse document fre- and unsupervised manners. quency (IDF) collected from each side of the 2.1 Supervised crosslingual word tested corpus. embeddings with BiSkip As an MTQE metric, YiSi-2 also takes into ac- count fluency and grammatically of each side of Luong et al.(2015) proposed BiSkip (with open 1 the sentence pairs using bag-of-ngrams and the se- source implementation bivec ) to jointly learn mantic parses of the tested sentence pairs. But bilingual representations from the context cooc- since crosslingual STS focuses primarily on mea- currence information in the monolingual data and suring the meaning similarity between the tested the meaning equivalent signals in the parallel data. sentence pairs, here we set the size of ngrams to 1 It trains bilingual word embeddings with the ob- and opt not to use semantic parses in YiSi-2. In ad- jective to preserve the clustering structures of dition, rather than compute IDF weights w(e) and 1https://github.com/lmthang/bivec

207 words in each language. We train our crosslingual 3 Experiment on crosslingual STS word embeddings using bivec on the parallel re- We first evaluate the performance of YiSi-2 on the sources as described in each experiment. intrinsic crosslingual STS task, before testing its ability on the downstream task of identifying par- 2.2 Weakly supervised crosslingual word allel data. embeddings with vecmap 3.1 Setup Artetxe et al.(2016) generalized a framework to learn the linear transformation between two mono- We use data from the SemEval-2016 Semantic lingual spaces by minimizing the Textual Similarity (STS) evaluation’s crosslingual distances between equivalences listed in a collec- track (task1) (Agirre et al., 2016b), in which the tion of bilingual lexicons (with open source imple- goal was to estimate the degree of equivalence mentation vecmap2). We train our monolingual between pairs of Spanish-English bilingual frag- 5 word embeddings using word2vec3 (Mikolov ments of text. The test data is partitioned into et al., 2013) on the monolingual resources and two evaluation sets: the News data set has 301 then learn the linear transformation of the two pairs, manually harvested from comparable Span- monolingual embedding space using vecmap on ish and English news sources; the Multi-source the dictionary entries as described in each experi- data set consists of 294 pairs, sampled from En- ment. glish pairs of snippets used in the SemEval-2016 monolingual STS task, translated into Spanish. We apply YiSi-2 directly to these pairs of 2.3 Unsupervised crosslingual contextual text fragments, using bilingual word embeddings embeddings with multilingual BERT trained under three different conditions (details of The above two mentioned embedding models pro- the training sets are given in Table1): duce static word embeddings that captures the se- bivec : BWE’s are produced with bivec, trained mantic space to represent the training data. The on WMT 2013 ES-EN parallel training data. shortcoming of these static embedding models is that they provide the same embedding representa- vecmap : BWE’s are produced with vecmap, tion for the same word without reflecting the con- trained on all WMT 2013 ES and WMT text variation of them being used in different sen- 2019 EN monolingual data, using Wikititles tences. In contrast, BERT (Devlin et al., 2019) as bilingual lexicon.6 uses a bidirectional transformer encoder (Vaswani et al., 2017) to capture the sentence context in the BERT : BWE’s are obtained from pre-trained output embeddings, such that the embedding for multilingual BERT models. the same word unit in different sentences would be We compare the YiSi-2 approach to direct co- different and better represented in the embedding sine computations on sums of bilingual word em- space. Multilingual BERT model is trained on the beddings (bivec sum, vecmap sum and sum). pages of 104 languages with a shared We also compare our approach to an MT-based subword . Pires et al.(2019) showed approach, in which each Spanish fragment is first multilingual BERT works well on different mono- machine-translated into English, then compared to lingual NLP tasks across different languages. the original English fragment, using English word Following the recommendation in Devlin et al. embeddings, produced with trained on (2019), we use embeddings extracted from the WMT 2019 news translation task monolingual ninth layer of the pretrained multilingual cased data. Similarity is measured either as the cosine BERT-Base model4 to represent subword units in of the sums of word vectors from each fragment the two sentences in assessment for the crosslin- (w2v sum), or with YiSi using monolingual em- gual lexical semantic similarity. beddings as if they were bilingual (YiSi-1w2v). 5In this task, the order of languages in pairs was random- 2https://github.com/artetxem/vecmap ized, so that it was first necessary to detect which fragment 3https://code.google.com/archive/p/ was in which language. Here, we work from properly ordered word2vec/ pairs. 4https://github.com/google-research/ 6https://linguatools.org/tools/ bert corpora/wikipedia-parallel-titles-corpora/

208 Training Data: Dictionary Embedding vocab Model lang. domain #sent #words #pairs #words es 107M 291k bivec WMT 2013: EU Parliament and web 3.8M — en 102M 220k es WMT 2013: News and EU Parliament 45M 1B 883k vecmap 373k en WMT 2019: News 779M 13B 3M

Table 1: Statistics of data used in training the bilingual word embeddings for evaluating crosslingual lexical se- mantic similarity in YiSi-2.

SemEval-16 crosslingual STS better than sentence-level cosine (“* sum” sys- system news multisource tems). On the News dataset, the best results are MT + monolingual STS obtained by combining an MT-based approach UWB 0.9062 0.8190 with YiSi-1 using monolingual word embeddings MT+w2v sum 0.5883 0.2021 (MT+YiSi-1w2v), reflecting the in-domain nature MT+YiSi-1w2v 0.8965 0.6212 of the text for the MT system. However, this is fol- crosslingual STS lowed very closely by both the supervised BWE’s bivec sum 0.5302 0.2684 (YiSi-2bivec) and BERT (YiSi-2bert), which yield vecmap sum 0.3075 0.5398 very similar results, and clearly outperform semi- bert sum 0.7223 0.6071 supervised BWE’s (YiSi-2vecmap). The nature of YiSi-2bivec 0.8744 0.6550 the Multisource translations appears to be quite YiSi-2vecmap 0.7854 0.7028 different from what supervised BWE’s and the YiSi-2bert 0.8723 0.7190 MT system have been exposed to in training (YiSi-2bivec and MT+YiSi-1w2v), which possibly Table 2: Pearson’s correlation of the system scores explains their much poorer performance on this with the gold standard on the two test sets from the dataset. In contrast, weakly supervised BWE’s SemEval-16 crosslingual STS task. and BERT behave much more reliably on this data. Overall, while MT and supervised BWE’s seem The MT system used is a phrase-based SMT sys- to work best with YiSi when large quantities of tem, trained using standard resources – Europarl, in-domain training data is available, the fully un- Common Crawl (CC) and News & Commentary supervised alternative of using a pretrained BERT (NC) – totaling approximately 110M words in model comes very close, and behaves much better each language. We bias the SMT decoder to pro- in the face of out-of-domain data. duce a translation that is as close as possible on the surface to the English sentence. This is done by means of log-linear model features that aim 4 Experiment on Parallel Corpus at maximizing n-gram precision between the MT Filtering output and the English sentence. More details on this method can be found in Lo et al.(2016). Next, we evaluate YiSi on the task of Parallel Cor- pus Filtering (PCF). Quality – or “cleanliness” – 3.2 Results of parallel training data for MT has been shown The results of these experiments are presented to affect MT quality at different degrees, and vari- in Table2, where performance is measured in ous characteristics of the data – parallelism of the terms of Pearson’s correlation with the test sets’ sentence pairs and the grammaticality of target- gold standard annotations. For reference, we language data – impact MT systems in different also include results obtained by the UWB sys- ways (Goutte et al., 2012; Simard, 2014; Khayral- tem (Brychc´ın and Svoboda, 2016), which was lah and Koehn, 2018). the best performing system in the SemEval 2016 Here, we use data from the WMT19 shared crosslingual STS shared task. The UWB sys- task on PCF. In this shared task, participants were tem is an MT-based system with a STS sys- challenged to find good quality translations from tem trained on assorted lexical, syntactic and se- noisy sentence-aligned parallel corpora, for the mantic features. Globally, using the YiSi met- purpose of training MT systems for translating ric to measure semantic similarity performs much from two low-resource languages, Nepali and Sin-

209 Training Data: Dictionary Embedding vocab Model lang. domain #sent #words #pairs #words ne 8M 34k bivec IT and religious 563k — en 5M 46k ne wiki 92k 5M 55k vecmap 9k en news 779M 13B 3M

Table 3: Statistics of data used in training the bilingual word embeddings for evaluating crosslingual lexical se- mantic similarity in YiSi-2. hala, into English.7 Both corpora were crawled WMT19 parallel corpus filtering from the web, using ParaCrawl (Koehn et al., system 1M-word 5M-word 2018a). Specifically, the task is to produce a random 1.30 3.01 score for each sentence pair in these noisy cor- Zipporah 4.14 4.42 pora, reflecting the quality of that pair. The scor- YiSi-2bivec 3.86 3.76 ing schemes are evaluated by extracting the top- YiSi-2vecmap 4.00 3.76 scoring sentence pairs from each corpus, then us- YiSi-2bert 3.77 3.77 ing them to train MT systems; these systems are run on test sets of Wikipedia articles (Guzman´ Table 4: Uncased BLEU scores on the official WMT19 PCF dev (“dev-test”) sets achieved by the SMT systems et al., 2019), and the results are evaluated using trained on the 1M- and 5M-word corpora subselected BLEU (Papineni et al., 2002). In addition to the by the scoring systems. noisy corpora, participants are allowed to use a few small sets of parallel data, covering different domains, for each of the two low-resource lan- form of re-ranking: going down the ranked guages, as well as a third, related language, Hindi list of scored sentence pairs, we apply a 20% (which uses the same script as Nepali). The pro- penalty to the pair’s score if it does not con- vided data also included much larger monolingual tain at least one “new” source-language word corpora for each of English, Hindi, Nepali and , i.e., a pair of consecutive source- Sinhala. language tokens not observed in previous (higher-scoring) sentence pairs. This has the 4.1 Setup effect of down-ranking sentences that are too In these experiments, we focus on the Nepali- similar to previously selected sentences. English corpus, and perform PCF in three steps: The scoring step is performed with YiSi-2, us- ing bilingual word embeddings obtained under 1. pre-filtering: apply ad hoc filters to remove three different conditions (details of the various sentences that are exact duplicates (mask- training sets used can be found in Table3): ing numbers, emails and web addresses), that contain mismatching numbers, that are in the bivec : supervised BWE’s produced using bivec, wrong language according to the pyCLD2 trained on the WMT19 PCF (clean) parallel language detector8 or that are excessively data. long (either side has more than 150 tokens). vecmap : weakly supervised BWE’s are pro- We also filter out all pairs where over 50% of duced with vecmap, trained on all monolin- the Nepali text is comprised of English, num- gual WMT19 PCF data, using Wikititles and bers or punctuation. the provided dictionary entries as bilingual 2. scoring: we score sentence pairs using YiSi- lexicon. 2. BERT : BWE’s obtained from pretrained multi- lingual BERT models. 3. re-ranking: to optimize vocabulary cover- age in the resulting MT system, we apply a As in the WMT19 PCF shared task, we evalu- ate the quality of our scoring by training MT sys- 7http://www.statmt.org/wmt19/ parallel-corpus-filtering.html tems and measuring their performance on the of- 8https://github.com/aboSamoor/pycld2 ficial test set. We used the provided software to

210 extract the 1M-word and 5M-word samples from whose meanings are not strictly equivalent. Note the original test corpora, using the scores of each that, while in practice “translation errors” can take of our systems in turn. We then trained MT sys- many forms, here, we are strictly focusing on tems using the extracted data: our MT systems are meaning errors. In this formulation of the prob- standard phrase-based SMT systems, with com- lem, we are also assuming that the source and tar- ponents and parameters similar to the German- get texts have been properly segmented into sen- English SMT system in Williams et al.(2016). tences and aligned. The TEED problem is essentially the same as 4.2 Results that of Parallel Corpus Filtering (PCF), discussed BLEU scores of the resulting MT systems are in the previous section. However, the usage sce- shown in Table4. For comparison, we present nario is quite different: in PCF, one is typically the results of random scoring, as well as results dealing with a very large collection of segment obtained by the Zipporah PCF method (Xu and pairs, only a fraction of which are true translations; Koehn, 2017). Zipporah combines fluency and ad- the PCF task is then to filter out pairs which are not equacy features to score sentence pairs; adequacy proper translations, possibly with some tolerance features are derived from existing parallel corpora, for pairs of segments that do share partial mean- and the feature combination (logistic regression) ing. In TEED, the data is mostly expected to be is optimized on in-domain parallel data. There- high-quality translations; the task is then to iden- fore, Zipporah can be seen as a fully supervised tify those pairs that deviate from this norm, even method. The Zipporah-based MT systems were on small details. trained similarly to other systems in the results re- ported here. 5.1 Setup All systems produced with YiSi-2 produce sim- We experiment the TEED task using a data set ilar results. Interestingly, the MT systems pro- obtained from the Canadian government’s Public duced with YiSi-2 in the 5M-word condition are Service Commission (PSC). As part of its man- not better than those of the 1M-word condition. date, the PSC periodically audits Canadian gov- This is possibly explained by the large quantity of ernment job ads, to ensure that they conform with noisy data in the WMT19 Nepalese-English cor- Canada’s Official Languages Act: as such, job ads pus: it is not even clear that there are 5M words must be posted in both of Canada’s official lan- of proper translations in that corpus. In such harsh guages, English and French, and both versions conditions, pre- and post-processing steps become must be equivalent in meaning. crucially important, and deduplicating the data Our PSC data set consists of 175 000 “State- may even turn out to be harmful, if that means ment of merit criteria” paragraphs, identifying any allowing more space for noise. The MT systems skill, ability, academic specialization, relevant ex- produced with Zipporah all achieve higher BLEU perience or any other essential or asset criteria re- scores than YiSi-2, which may be explained by quired for a position to be filled. Of these, 3521 Zipporah’s explicit modeling of target-language have been manually annotated for equivalence er- fluency. This is especially apparent in the 5M- rors by PSC auditors. Out of the 3521 pairs, 164 word condition, but it may explain Zipporah’s (4.6%) were reported to contain equivalence er- slightly better performance in the 1M-word con- rors. The majority of these errors result from dition as well. Overall, the benefits of super- missing information in one language or the other vised and weakly supervised approaches over us- (45%). In a slightly smaller proportion (43%), we ing a pre-trained BERT model for PCF appear to find pairs of segments that don’t express exactly be minimal, even in very low-resource conditions the same meaning – a surprisingly large propor- such as this. tion of this last group consists in cases where the 5 Experiments on Translation word and is translated as or or vice-versa. The Equivalence Error Detection rest consist in terminology issues and untranslated segments. Given a text and its translation, Translation Equiv- We experimented applying the YiSi-2 metric to alence Error Detection (TEED) is the task of this task, using bilingual word embeddings ob- identifying pairs of corresponding text segments tained under four different conditions:

211 Training Data: Dictionary Embedding vocab Model lang. domain #sent #words #pairs #words fr 1.2B 878k WMT.bivec News and EU Parliament 40.7M — en 1.4B 791k fr 3.0M 10k PSC.bivec Job ads 175k — en 2.5M 11k fr Job ads 175k 3.0M 10k PSC.vecmap 6k en Job ads 175k 2.5M 11k

Table 5: Statistics of data used in training the bilingual word embeddings for evaluating translation equivalence assessment.

PSC Translation Error Detection model ROC AUC mean F1 mean F2 PSC.bivec 0.807 0.160 0.281 PSC.vecmap 0.717 0.136 0.241 BERT 0.702 0.132 0.234 WMT.bivec 0.641 0.112 0.205

Table 6: Sentence-level translation error detection re- sults on PSC test data, expressed in terms of Area under the ROC curve, mean F1 and mean F2.

PSC.bivec : BWE’s are produced with bivec, trained on all unannotated PSC data.

PSC.vecmap : BWE’s are produced with vecmap, trained on all unannotated PSC data, using wikititles as bilingual lexicon. Figure 1: ROC curves of Sentence-level translation er- WMT.bivec : BWE’s are produced with bivec, ror detection results on PSC test data. trained on all bilingual French-English data provided for the WMT 2015 News translation shared task. of the Area under the ROC curve (ROC AUC), which can be interpreted as the probability that a BERT : BWE’s obtained from pretrained multi- system will score a randomly chosen faulty trans- lingual BERT models. lation lower than a randomly chosen good transla- Details about the training data can be found in Ta- tion. The ROC curves themselves can be seen in ble5. Figure1. Globally, YiSi-2 clearly performs best at this 5.2 Results task when using BWE’s trained on domain- For these experiments, we considered an applica- specific parallel data (PSC.bivec), even when there tion scenario in which a text and its translation, in is very limited quantities of such data, as is the the form of pairs of matching segments, are scored case here. However, BERT models perform com- using YiSi-2, and presented to a user, ranked in in- parably to vector-mapped BWE’s trained with in- creasing order of score, so that pairs most likely to domain data (PSC.vecmap), and substantially bet- contain a translation error are presented first. The ter than BWE’s trained on large quantities of performance of the system can then be measured generic, out-of-domain parallel data (WMT). We in terms of true and false positive rates, precision conclude that, in the absence of in-domain paral- and recall, over subsets of increasing sizes of the lel data, for TEED applications, an unsupervised test set. In Table6, we report results in terms of YiSi-2 method will perform at least as well as su- mean F -score, with β = 1 and β = 2, and in terms pervised methods trained on out-of-domain data.

212 6 Conclusion Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word em- We presented a fully unsupervised crosslingual beddings while preserving monolingual invariance. semantic textual similarity (STS) metric, based In Proceedings of the 2016 Conference on Empiri- on contextual embeddings extracted from BERT cal Methods in Natural Language Processing, pages 2289–2294. without fine-tuning. We perform intrinsic evalua- tions on crosslingual STS data sets and extrinsic Wilker Aziz and Lucia Specia. 2011. Fully automatic evaluations on parallel corpus filtering and human compilation of Portuguese-English and Portuguese- Spanish parallel corpora. In Proceedings of the translation equivalence assessment tasks. Our re- 8th Brazilian Symposium in Information and Human sults show that the unsupervised metric we pro- Language Technology. pose achieves performance on par with supervised or weakly supervised approaches. We show that Toma´sˇ Brychc´ın and Luka´sˇ Svoboda. 2016. UWB at SemEval-2016 task 1: Semantic textual simi- the circular dependency on the existence of paral- larity using lexical, syntactic, and semantic infor- lel resources for using crosslingual STS to identify mation. In Proceedings of the 10th International parallel data can be broken. Workshop on Semantic Evaluation (SemEval-2016), In this paper, we have only experimented with pages 588–594, San Diego, California. Association for Computational Linguistics. the contextual embeddings extracted from pre- trained multilingual BERT model. For domain- Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- specific applications, such as the job advertise- Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and ment domain in the PSC translation equivalence crosslingual focused evaluation. In Proceedings error detection task, the performance of YiSi- of the 11th International Workshop on Semantic 2 could potentially be improved by fine-tuning Evaluation (SemEval-2017), pages 1–14, Vancou- BERT with in-domain data, something we plan to ver, Canada. Association for Computational Lin- guistics. examine in the near future. We will also want to explore the use of other multilingual context repre- Alexis Conneau, Guillaume Lample, Marc’Aurelio sentation models, such as MUSE (Conneau et al., Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2017. Word translation without parallel data. arXiv 2017), XLM (Lample and Conneau, 2019), etc. preprint arXiv:1710.04087. Acknowledgement Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel Bowman, Holger Schwenk, We would like to thank Dominic Demers and his and Veselin Stoyanov. 2018. XNLI: Evaluating colleagues at the Canadian governments Public cross-lingual sentence representations. In Proceed- Service Commission (PSC) for providing the PSC ings of the 2018 Conference on Empirical Methods data set and managing the human annotation in the in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational experiments. Linguistics. Courtney Corley and Rada Mihalcea. 2005. Measur- References ing the semantic similarity of texts. In Proceed- ings of the ACL Workshop on Empirical Modeling Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, of Semantic Equivalence and Entailment, EMSEE Aitor Gonzalez-Agirre, Rada Mihalcea, German ’05, pages 13–18, Stroudsburg, PA, USA. Associa- Rigau, and Janyce Wiebe. 2016a. SemEval-2016 tion for Computational Linguistics. task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 10th International Workshop on Semantic Evalua- Kristina Toutanova. 2019. BERT: Pre-training of tion (SemEval-2016), pages 497–511, San Diego, deep bidirectional transformers for language under- California. Association for Computational Linguis- standing. In Proceedings of the 2019 Conference tics. of the North American Chapter of the Association for Computational Linguistics: Human Language Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Technologies, Volume 1 (Long and Short Papers), Aitor Gonzalez-Agirre, Rada Mihalcea, German pages 4171–4186, Minneapolis, Minnesota. Associ- Rigau, and Janyce Wiebe. 2016b. SemEval-2016 ation for Computational Linguistics. task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the Marc Franco-Salvador, Parth Gupta, Paolo Rosso, 10th International Workshop on Semantic Evalua- and Rafael E. Banchs. 2016a. Cross-language tion (SemEval-2016), pages 497–511, San Diego, plagiarism detection over continuous-space- and California. Association for Computational Linguis- knowledge graph-based representations of language. tics. Know.-Based Syst., 111(C):87–99.

213 Marc Franco-Salvador, Paolo Rosso, and Manuel Chi-kiu Lo, Michel Simard, Darlene Stewart, Samuel Montes-y Gomez.´ 2016b. A systematic study of Larkin, Cyril Goutte, and Patrick Littell. 2018. Ac- knowledge graph analysis for cross-language plagia- curate semantic textual similarity for cleaning noisy rism detection. Inf. Process. Manage., 52(4):550– parallel corpora using semantic machine translation 570. evaluation metric: The NRC supervised submissions to the parallel corpus filtering task. In Proceedings Marc Franco-Salvador, Paolo Rosso, and Roberto Nav- of the Third Conference on Machine Translation: igli. 2014. A knowledge-based representation for Shared Task Papers, pages 908–916, Belgium, Brus- cross-language document retrieval and categoriza- sels. Association for Computational Linguistics. tion. In Proceedings of the 14th Conference of the European Chapter of the Association for Com- Minh-Thang Luong, Hieu Pham, and Christopher D. putational Linguistics, pages 414–423, Gothenburg, Manning. 2015. Bilingual word representations Sweden. Association for Computational Linguistics. with monolingual quality in mind. In NAACL Work- shop on Vector Space Modeling for NLP, Denver, Cyril Goutte, Marine Carpuat, and George Foster. United States. 2012. The impact of sentence alignment errors on phrase-based machine translation performance. In Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- Proceedings of the Tenth Conference of the Associa- rado, and Jeffrey Dean. 2013. Distributed repre- tion for Machine Translation in the Americas. sentations of words and phrases and their compo- sitionality. In Proceedings of the 26th International Francisco Guzman,´ Peng-Jen Chen, Myle Ott, Juan Conference on Neural Information Processing Sys- Pino, Guillaume Lample, Philipp Koehn, Vishrav tems, NIPS’13, pages 3111–3119, USA. Curran As- Chaudhary, and Marc’Aurelio Ranzato. 2019. Two sociates Inc. new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. Matteo Negri, Alessandro Marchetti, Yashar Mehdad, CoRR, abs/1902.01382. Luisa Bentivogli, and Danilo Giampiccolo. 2013. Huda Khayrallah and Philipp Koehn. 2018. On the Semeval-2013 task 8: Cross-lingual textual entail- impact of various types of noise on neural machine ment for content synchronization. In Second Joint translation. CoRR, abs/1805.12282. Conference on Lexical and Computational Seman- tics (*SEM), Volume 2: Proceedings of the Sev- Philipp Koehn, Kenneth Heafield, Mikel L. For- enth International Workshop on Semantic Evalua- cada, Miquel Espla-Gomis,` Sergio Ortiz-Rojas, tion (SemEval 2013), pages 25–33, Atlanta, Geor- Gema Ram´ırez Sanchez,´ V´ıctor M. Sanchez´ gia, USA. Association for Computational Linguis- Cartagena, Barry Haddow, Marta Ban˜on,´ Marek tics. Strelec,ˇ Anna Samiotou, and Amir Kamran. 2018a. ParaCrawl corpus version 1.0. LINDAT/CLARIN Kishore Papineni, Salim Roukos, Todd Ward, and Wei- digital library at the Institute of Formal and Ap- Jing Zhu. 2002. Bleu: a method for automatic eval- plied Linguistics (UFAL),´ Faculty of Mathematics uation of machine translation. In Proceedings of and Physics, Charles University. the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Philipp Koehn, Huda Khayrallah, Kenneth Heafield, Computational Linguistics. and Mikel Forcada. 2018b. Findings of the wmt 2018 shared task on parallel corpus filtering. In Pro- Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. ceedings of the Third Conference on Machine Trans- How multilingual is multilingual bert? CoRR, lation, Volume 2: Shared Task Papers, Brussels, Bel- abs/1906.01502. gium. Association for Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a Guillaume Lample and Alexis Conneau. 2019. Cross- parallel corpus. Comput. Linguist., 29(3):349–380. lingual language model pretraining. arXiv preprint arXiv:1901.07291. Michel Simard. 2014. Clean data for training statisti- cal MT: the case of MT contamination. In Proceed- Chi-kiu Lo. 2019. YiSi - a unified semantic mt quality ings of the Eleventh Conference of the Association evaluation and estimation metric for languages with for Machine Translation in the Americas, pages 69– different levels of available resources. In Proceed- 82, Vancouver, BC, Canada. ings of the Fourth Conference on Machine Transla- tion, pages 706–712, Florence, Italy. Association for Lucia Specia, Fred´ eric´ Blain, Varvara Logacheva, Computational Linguistics. Ramon´ Astudillo, and Andre´ F. T. Martins. 2018. Findings of the wmt 2018 shared task on quality es- Chi-kiu Lo, Cyril Goutte, and Michel Simard. 2016. timation. In Proceedings of the Third Conference CNRC at SemEval-2016 task 1: Experiments in on Machine Translation, Volume 2: Shared Task Pa- crosslingual semantic textual similarity. In Proceed- pers, pages 702–722, Belgium, Brussels. Associa- ings of the 10th International Workshop on Seman- tion for Computational Linguistics. tic Evaluation (SemEval-2016), pages 668–673, San Diego, California. Association for Computational Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Linguistics. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

214 Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Ivan Vulic´ and Marie-Francine Moens. 2015. Monolin- gual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceed- ings of the 38th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, SIGIR ’15, pages 363–372, New York, NY, USA. ACM. Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Barry Haddow, and Ondrejˇ Bojar. 2016. Edinburgh’s statistical machine translation systems for wmt16. In Proceedings of the First Conference on Machine Translation, pages 399– 410, Berlin, Germany. Association for Computa- tional Linguistics. Hainan Xu and Philipp Koehn. 2017. Zipporah: a fast and scalable data cleaning system for noisy web- crawled parallel corpora. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2945–2950, Copenhagen, Denmark. Association for Computational Linguis- tics.

215