<<

Measuring sentence parallelism using Mahalanobis : The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

Patrick Littell, Samuel Larkin, Darlene Stewart, Michel Simard, Cyril Goutte, Chi-kiu Lo National Research Council of Canada 1200 Montreal Road, Ottawa ON, K1A 0R6 [email protected]

Abstract in a given language pair to assess the quality of the corpus at hand; the corpus to be evaluated is The WMT18 shared task on parallel corpus fil- often the only one available.1 If we want to assess tering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high- the quality of one corpus, we cannot rely on a su- recall, low-precision web-scraped parallel cor- pervisory signal derived from additional, cleaner pus (Koehn et al., 2018a). Participants could corpora. We therefore do not utilize the additional use existing sample corpora (e.g. past WMT parallel corpora (except as additional sources of data) as a supervisory signal to learn what a monolingual data). “clean” corpus looks like. However, in lower- The systems described in this paper were in- resource situations it often happens that the spired instead by anomaly detection approaches: target corpus of the language is the only sam- ple of parallel text in that language. We there- can we instead attempt to identify sentence pairs fore made several unsupervised entries, setting that are, in some way, “strange” for this dataset? ourselves an additional constraint that we not Considering each sentence pair as a draw from a utilize the additional clean parallel corpora. distribution of high-dimensional vectors, we de- One such entry fairly consistently scored in fine an anomalous sentence pair as one whose the top ten systems in the 100M-word condi- draw was improbable compared to the probabil- tions, and for one task—translating the Euro- ity of drawing its component sentences indepen- pean Medicines Agency corpus (Tiedemann, dently. The resulting measure, conceptually simi- 2009)—scored among the best systems even in the 10M-word conditions. lar to pointwise mutual information albeit couched in terms of Mahalanobis distances rather than ac- 1 Introduction and motivation tual probabilities, is detailed in 3. § A submission based primarily on this one mea- The WMT18 shared task on parallel corpus filter- surement (with some pre- and post-processing ing assumes (but does not require) a supervised to avoid duplicate and near-duplicate sentences) learning approach. Given performed consistently above the median in the 1. a set of “clean” German-English parallel cor- 100M-word conditions, and for a few tasks (par- pora including past WMT data, Europarl ticularly EMEA translation) was among the top (Koehn, 2005), etc., and systems even for the 10M-word conditions. It was also the #2 system in one of the dev condi- 2. a large, potentially “dirty” corpus (i.e., one tions (WMT newstest2017, NMT trained on 100M that may contain non-parallel data, non- words), which is surprising given that it could not linguistic data, etc.) scraped from the internet have overfit to the development set; it did not uti- (Koehn et al., 2018a), lize the WMT17 development set in any way. can one identify which sentences from (2) are clean? is an obvious approach 2 Overall architecture in well-resourced languages like German and En- The highest-ranked submission of our unsu- glish, in which there exist well-cleaned parallel pervised submissions, NRC-seve-bicov, corpora across various domains. 1We are thinking in particular of the English-Inuktitut However, in much lower-resourced languages, translation pair, which is a long-standing research interest of we generally do not have multiple parallel corpora NRC (e.g. Martin et al., 2003).

Proceedings of the Third Conference on Machine Translation900 (WMT), Volume 2: Shared Task Papers, pages 900–907 Belgium, Brussels, October 31 - Novermber 1, 2018. c 2018 Association for Computational Linguistics https://doi.org/10.18653/v1/W18-64107 shares the same general skeleton as NRC’s dates, and performed comparatively poorly when highest-ranked supervised submission, tasked with training full sentences. To mitigate NRC-yisi-bicov (Lo et al., 2018); it dif- this, we ran an additional de-duplication step on fers primarily in the parallelism estimation the English side in which two sentences that dif- component ( 2.3). fer only in numbers (e.g., “14 May 2017” and “19 § May 1996”) were considered duplicates. 2.1 Training sentence embeddings Without numerical de-duplication, we believe We began by training monolingual sentence em- the parallelism estimation step in 2.3 would have beddings using sent2vec (Pagliardini et al., § had too much of a bias towards short numerical 2018), on all available monolingual data. This sentences. It is, after all, essentially just looking included the monolingual data available in the for sentence pairs that it considers likely given the “clean” parallel training data. That is to say, we distribution of sentence pairs in the target corpus; did not completely throw out the clean parallel if the corpus has a large number of short numeri- data for this task, we simply used it as two un- cal sentences (and it appears to), the measurement aligned monolingual corpora. will come to prefer those, whether or not they are We trained sentence vectors of 10, 50, 100, 300, useful for the downstream task. and 700 dimensions; our final submissions used the 300-dimensional vectors as a compromise be- The additional de-duplication also had a practi- tween accuracy (lower-dimensional vectors had cal benefit in that the resulting corpus was much lower accuracy during sanity-checking) and effi- smaller, allowing us to perform calculations in memory (e.g., that in 3.2) on the entire corpus ciency (higher-dimensional vectors ended up ex- § ceeding our memory capacity in downstream com- at once rather than having to approximate them in ponents). mini-batches. In a system such as this, which is looking for We also discarded sentence pairs that were ex- “strange” sentence pairs, training on additional actly the same on each side, in which one sen- monolingual data beyond the target corpus car- tence contained more than 150 tokens, in which ries some risks. If the additional monolingual data the two sentences’ numbers did not match, or were to have very different domain characteristics in which there were suspiciously non-German or (say, mostly religious text in the first language and non-English sentences according to the pyCLD2 mostly medical text in the second), then the two language detector2. When pyCLD2 believed a pu- vector spaces could encode different types of sen- tatively German sentence to be something other tence as “normal”. On the other hand, not using than German with certainty greater than 0.5, or a additional monolingual data carries its own risks; putatively English sentence to be something other monolingual data that is domain-balanced could than English with certainty greater than 0.5, it was help to mitigate domain mismatches in the target discarded. parallel data (say, newswire text erroneously mis- aligned to sequences of dates). 2.3 Parallelism estimation 2.2 Pre-filtering Although the input data had already been de- With sentence vectors ( 2.1) for the reduced cor- § duplicated by the shared task organizers, we did pus ( 2.2) in hand, we set out to estimate the de- § an additional de-duplication step in which email gree of parallelism of sentence pairs. A novel addresses and URLs were replaced with a place- measure of parallelism, based on ratios of squared holder token and numbers were removed, before Mahalanobis distances, performed better on a syn- deciding which sentences were duplicates. We thetic dataset than some more obvious measure- had noticed that large amounts of data consisted ments, and the single-feature submission based on of short sentences that were largely numbers (for it was our best unsupervised submission. example, long lists of dates). Although these sen- We also made several other unsupervised mea- tences were indeed unique, we noticed that several surements: of our parallelism measurements ended up prefer- ring such sentences to such an extent that the re- sulting MT training sets were disproportionately 2https://github.com/aboSamoor/pycld2

901 1. Perplexity of the German sentence accord- not just the one that happened to come first in the ing to a 6-gram KenLM language model3 original corpus. (Heafield, 2011) 3 Mahalanobis ratios for parallelism 2. Perplexity of the English sentence according assessment to a 6-gram KenLM language model As mentioned in 2.3, we performed several un- 3. The ratio between (1) and (2), to find sen- § tences pairs that contain different amounts of supervised measurements on each sentence pair; information of these, the measurement that best predicted par- alellism (on synthetic data and on our small 300- 4. Cosine distances between German and sentence annotated set) was a novel measurement English sentence vectors, in a bilingual based on squared Mahalanobis distances. sent2vec space trained only on the target This measurement rests on two insights: corpus If sentence vectors (or in our case, sentence- As we did not have a supervisory signal, we • pair vectors) are normally distributed, the did not have a principled way of choosing weights probability that we draw a particular vector for these features. Instead, we simply took an (or a more extreme vector) is related to the unweighted average of the above four features squared Mahalanobis via the χ2 dis- and the Mahalanobis feature in 3.2, after rescal- § tribution. ing each to the interval [0.0, 1.0]. As seen in 5, systems based on this feature combination If the two sentences relate the same infor- § • (NRC-mono-bicov and NRC-mono) were out- mation, the probability of drawing the vec- performed by our single-feature system in most tor for that pair should not be much less than conditions. the probability of drawing the individual sen- We also considered combinations of these unsu- tence vectors in isolation. pervised measurements with supervised measure- While Mahalanobis distance is a common sta- ments, but this attempt was also unsuccessful com- tistical measurement, particularly in anomaly de- pared to a system that used only a single super- tection (e.g. Reed and Yu, 1990), it is not com- vised measurement for sentence pair ranking (Lo monly used in machine translation, so we briefly et al., 2018). introduce it below.4

2.4 Post-filtering 3.1 Mahalanobis distance After scoring each sentence for parallelism, we The probability of a draw from a univariate nor- performed another de-duplication step. In this mal distribution can be related to its distance to step, we iterated over each target-language sen- the in terms of standard deviations (the z- tence in order of parallelism (that is, sentences as- score). In a multivariate normal distribution, how- sessed to have the highest parallelism were consid- ever, just measuring the Euclidean distance to the ered first), and removed pairs that only consisted mean can lead to incorrect conclusions; visual in- of bigrams that had already been seen. (That is to spection of Figure 1a illustrates that the red vector, say, a sentence pair was kept only if it contains a despite being a clear , is nonetheless closer bigram that had not previously been seen.) to the mean than the blue vector. This step has to occur after quality assessment Rather, the appropriate measurement for relat- because, in contrast to regular de-duplication, the ing distance to probability is the square of the Ma- sentences in question are not identical; the sen- halanobis distance (Mahalanobis, 1936); for a vec- tence (and the pair it comes from) may differ in tor x from distribution X with correlation Σ and quality from the sentence(s) that make it a dupli- mean µ: cate, so we want to keep the best such sentence, 4The following relies heavily on the explanation in Boggs 3Although we assumed that high perplexity sentences (2014). Note that this explanation is also concerned with would be worse—that they might be ungrammatical, for the square of the Mahalanobis distance rather than the Ma- example—sanity checking suggested higher-perplexity sen- halanobis distance; it is typical for authors to describe both tences were actually better. Error analysis later suggested that as “Mahalanobis distance” in prose (cf. Warren et al., 2011, many non-parallel (or parallel but non-informative) sentences p. 10). It is also typical to use “Mahalanobis distance” to were short, possibly explaining why taking perplexity as a specifically refer to Mahalanobis distance from a point to the positive feature resulted in higher scores in sanity-checking. mean, although this distance is defined for any two points.

902 (a) Euclidean distance (b) Mahalanobis distance

Figure 1: Euclidean distance to the mean in a multivariate normal distribution is not necessarily related to probability; in figure (a), the red vector, despite being an outlier, is closer to the mean. In figure (b), we have rescaled and decorrelated the distribution; Euclidean distance measured in the resulting space (the Mahalanobis distance) can be related to probability through the χ2 distribution.

to have zero mean—this simply makes the calcu-

2 T 1 lation and presentation easier—and transform the d (x) = (x µ) Σ− (x µ) (1) 1 − − resulting matrix by Σ− 2 . For each sentence vector pair l , l (after re- This is equivalent to decorrelating and rescal- h 1 2i ing to unit variance in all dimensions, via the in- centering), we consider three vectors in the trans- verse square root of the correlation matrix (“Ma- formed space: halanobis whitening”), and then measuring the the vector e corresponding only to l ’s con- • 1 1 squared Euclidean distance to the mean in the re- tribution to the concatenated and transformed sulting space. vector (as if l2 = ~0)

the vector e2 corresponding only to l2’s con- 2 T 1 1 • d (x) = (x µ) Σ− 2 Σ− 2 (x µ) (2) tribution (as if l = ~0) − − 1 1 T 1 = (Σ− 2 (x µ)) (Σ− 2 (x µ)) (3) the vector e corresponding to the transforma- − − • 1 2 tion of the concatenation of l1 and l2 = Σ− 2 (x µ) (4) k − k2 Figure 1b illustrates the same distribution trans- 1 1 formed by Σ− 2 ; we can see that now the magni- e1 = Σ− 2 (l1,~0) (5) tude of the outlier red vector is greater than the 1 e = Σ 2 (~0, l ) (6) magnitude of the blue vector. 2 − 2 1 As mentioned above, the squared magnitudes e = Σ− 2 (l1, l2) = e1 + e2 (7) can be used to calculate probabilities, but in prac- tice the probabilities were so similar in higher- The measurement m we are interested in is the dimensional spaces as to be identical. There re- squared magnitude of the combined vector, di- mains the possibility, however, that the magni- vided by the sum of the squared magnitudes of e1 tudes themselves remain sufficiently informative; and e2 alone. this was borne out in practice. 2 e 2 3.2 Calculating the magnitude ratios m = 2k k 2 (8) e1 + e2 We have high-dimensional vectors, trained mono- k k2 k k2 lingually, of German and English sentences ( 2.1). Roughly speaking, does the sentence pair vec- § We consider their joint distribution by simply con- tor e in Mahalanobis space give more information catenating their vectors; there is no additional (expressed in terms of its squared magnitude) than utility here in learning a translation between the the component sentence vectors e1 and e2 do on monolingual spaces. We recenter the distribution their own? If so, we consider them unlikely to

903 p 0.1 0.2 0.3 0.4 0.5 Mahalanobis 0.977 0.976 0.974 0.972 0.972 Linear 0.944 0.930 0.920 0.914 0.913 Nonlinear 0.871 0.871 0.897 0.900 0.905

Table 1: Accuracy of distinguishing parallel (i.e., related by a translation matrix T ) vs. non-parallel (i.e., random) vectors, from a synthetic dataset of 100,000 pairs of 50-dimensional vectors, plus standard normal additive noise. p represents the proportion of parallel pairs in the dataset.

σ 1.0 2.0 3.0 4.0 5.0 Mahalanobis .974 .778 .665 .617 .597 Linear .920 .722 .640 .606 .592 Nonlinear .897 .658 .600 .586 .582

Table 2: Accuracy of distinguishing parallel (i.e., related by a translation matrix T ) vs. non-parallel (i.e., random) vectors, from a synthetic dataset of 100,000 pairs of 50-dimensional vectors and “true” proportion p = 0.3, with varying degrees of additive noise. σ represents the of the additive noise added to each of L1 and L2. be parallel. We take the resulting value m to be In code, this is a very simple calculation (only the ranking (with lower values being better) for the about 15 lines of Python+NumPy) and efficient post-filtering step described in 2.4. (taking only a few minutes for millions of sen- § tences), provided one has enough system memory Implementation-wise, we do not actually have to calculate it in one fell swoop. A sample imple- to concatenate l or l with zeros in order to calcu- 2 1 mentation is given in Figure2. late (5) and (6), we can just multiply l1 and l2 by 1 Σ 2 the relevant sub-matrix of − . It is also unnec- 4 Internal results essary to actually transform the vector correspond- ing to the concatenation of l , l ; the result is just h 1 2i 4.1 Synthetic data the element-wise sum of e1 and e2. The unsupervised measurements on the sentence vectors were first tested on purely synthetic data: two sets of random normal vectors L1 and L2, in def mahalanobis_whitening(X): # inverse square root of covariance which some proportion p of vectors in L1 corre- cov = np.cov(X, rowvar=False) sponded to L2 via a linear transformation T, and inv_cov = np.linalg.inv(cov) some proportion of vectors did not. We also added L, V = np.linalg.eig(inv_cov) diag = np.diag(np.sqrt(L)) some Gaussian noise to each of L1 and L2, so return V.dot(diag).dot(V.T) that this transformation would not be perfect (as it would not be in real data). We varied the pro- def ssq(X):# sum of squares return np.sum(X*X, axis=1) portion of “true” pairs, and the proportion of addi- tive noise, to test how robust these measurements def mahalanobis_ratio(L1, L2): L1 -= L1.mean(axis=0) would be in a variety of noise conditions. L2 -= L2.mean(axis=0) Accuracy measurements on this data were made L = np.concatenate([L1,L2], axis=1) by thresholding scores so that the top p scores are whitener = mahalanobis_whitening(L) 5 E1 = L1.dot(whitener[:L1.shape[1],:]) set to 1.0 and the rest to 0.0. This is also how we E2 = L2.dot(whitener[L1.shape[1]:,:]) evaluate accuracy during sanity checking, below. return ssq(E1+E2) / (ssq(E1) + ssq(E2)) Table1 contrasts three systems:

5 Figure 2 Since the overall task is a ranking task, rather than a clas- : Sample implementation of the Ma- sification task, we do not at any point have to set a particular halanobis ratio calculation in Python, for two threshold for keeping data; this is a way in which the task n d NumPy arrays representing n samples of d- at hand is easier than a typical anomaly detection task. We × therefore simply use the correct proportion to set the thresh- dimensional sentence vectors for two languages. olds.

904 1.( Mahalanobis) We perform the Mahalanobis dimensionality did not lead to indistinguishable ratio calculation described in 3.2. measurements again. § Sanity checking (Table3) confirmed that higher 2.( Linear) We learn a be- dimensionality does not necessarily lead to poorer tween L1 and L2, transform L1 according the discrimination: while 10-dimensional vectors only resulting matrix, and measure the cosine sim- led to 44.1% accuracy in discriminating parallel ilarity between the result and L2. from non-parallel pairs, 300-dimensional vectors 3.( Nonlinear) System (2), but instead of a lin- gave 63.4% accuracy. ear regression we construct a simple two- layer perceptron with a ReLU nonlinearity.6 Dimensionality 10 50 100 300 Accuracy .441 .548 .483 .634 In each condition, the Mahalanobis measure- ment outperformed the other measurements. It Table 3: Sanity-checking results on 300 annotated sentences, for the Mahalanobis calculation ( 3.2) may, of course, be that the conditions of this syn- § thetic data are unlike real data—the relationship on 10-, 50-, 100-, and 300-dimensional sentence between the German and English sentence vec- vectors. tors might, for example, be better approximated with a nonlinear relationship—but, given the com- It is unclear why 100-dimensional vectors paratively robust performance of the Mahalanobis perform more poorly than both 50- and 300- measurement against a variety of noise conditions, dimensional vectors, but in any case this dataset we prioritized our development time to exploring only has 300 samples and we do not want to it further. put too much stock in the results. The real pur- pose of this trial was to determine if the curse of 4.2 Sanity checking dimensionality affects the Mahalanobis measure- ment adversely, and it does not appear to do so. We also annotated about 300 random sentence We therefore used 300-dimensional vectors in our pairs from the target corpus, according to whether final submissions. we judged them to be parallel or not. We did not tune any parameters to this set, except to make sure that one hyperparameter, the dimensionality 5 Official Results of the sentence vectors, did not lead to a numerical underflow condition as dimensionality increased. Table4 presents the results of the official evalua- Many of our initial attempts at measuring prob- tion, on seven corpora in four conditions. To help abilities (and log probabilities) of sentence draws navigate the wall of numbers, keep in mind that we in higher dimensions (e.g. higher than 50) led are mostly interested in the top unsupervised sys- to the differences between probabilities being so tem NRC-seve-bicov, and that each table also small that they could not be distinguished by presents average scores across the seven corpora, floating-point representations, leading to a situa- in the bottom right corner of each. tion in which almost all probabilities were equiv- In the 100M-word conditions (that is to say, alent and no meaningful comparisons could be in the conditions where a statistical or neu- made, and thus to random performance when ral machine translation system was trained on ranking sentences pairs. Keeping the measure- the top 100M words, as ranked by our fil- ments in terms of distances, and not converting ters), we find generally strong performance, with them to probabilities, did appear to allow fine- NRC-seve-bicov always performing above the grained comparison in higher dimensions, but we median system and with most results in the top 10 wanted to ensure that continuing to increase the (among 48 submissions). However, we generally observe weaker down- 6We did not expect this to outperform the linear version— stream MT performance in 10M conditions, com- after all, there is no actual nonlinearity in the relationship be- tween L1 and L2—but nonetheless wanted to see how a non- pared to other competitors; performing roughly linear regression would perform in different noise conditions. near the median system in the NMT 10M con- We observe, for example, that it does unsurprisingly poorly dition and frequently below the median in the when only a low proportion p of sentences are related, a con- dition in which a linear regression performs comparatively SMT 10M condition. This suggests to us that well. the unsupervised systems are adequate in finding

905 SMT, 10M-word dev. test domain news news speech laws medical news IT corpus newstest17 newstest18 iwslt17 Acquis EMEA GlobalVoices KDE average top score 23.23 (1) 29.59 (1) 22.16 (1) 21.45 (1) 28.70 (1) 22.67 (1) 25.51 (1) 24.58 (1) seve-bicov 19.66 (33) 25.96 (32) 18.64 (35) 18.78 (23) 27.94 (5) 20.05 (28) 21.38 (41) 22.13 (29) mono-bicov 19.61 (35) 25.13 (36) 17.86 (39) 16.59 (35) 24.21 (37) 19.97 (34) 22.07 (37) 20.97 (38) mono 17.98 (41) 23.49 (41) 16.63 (41) 15.49 (40) 23.09 (40) 18.65 (40) 21.39 (40) 19.79 (41) SMT, 100M-word top score 25.80 (1) 31.35 (1) 23.17 (1) 22.51 (1) 31.45 (1) 24.00 (1) 26.93 (1) 26.49 (1) seve-bicov 25.61 (11) 31.11 (8) 22.84 (10) 22.19 (15) 31.20 (3) 23.67 (10) 26.47 (18) 26.25 (9) mono-bicov 25.65 (5) 31.12 (5) 22.84 (10) 22.37 (8) 31.11 (7) 23.75 (7) 26.19 (30) 26.23 (10) mono 25.45 (14) 30.63 (21) 22.72 (20) 22.06 (21) 30.74 (20) 23.70 (9) 26.20 (28) 26.01 (19) NMT, 10M-word dev. test domain news news speech laws medical news IT corpus newstest17 newstest18 iwslt17 Acquis EMEA GlobalVoices KDE average top score 29.44 (1) 36.04 (1) 25.64 (1) 25.57 (1) 32.72 (1) 26.72 (1) 28.25 (1) 28.62 (1) seve-bicov 24.49 (27) 30.32 (27) 21.47 (24) 22.57 (15) 31.71 (2) 23.08 (27) 22.89 (27) 25.34 (21) mono-bicov 23.38 (30) 28.86 (32) 19.33 (34) 19.03 (29) 26.45 (32) 22.03 (32) 23.72 (23) 23.07 (30) mono 20.83 (35) 24.97 (37) 17.19 (37) 16.57 (38) 23.79 (38) 19.75 (35) 21.85 (31) 20.69 (35) NMT, 100M-word top score 32.41 (1) 39.85 (1) 27.43 (1) 28.43 (1) 36.72 (1) 29.26 (1) 30.92 (1) 32.06 (1) seve-bicov 32.10 (2) 39.39 (7) 27.09 (6) 28.31 (5) 36.30 (10) 28.94 (9) 30.12 (16) 31.69 (8) mono-bicov 31.67 (9) 38.86 (15) 27.10 (5) 28.15 (9) 35.96 (15) 28.87 (11) 30.41 (11) 31.56 (11) mono 31.39 (16) 38.42 (21) 26.80 (12) 27.94 (12) 35.71 (21) 28.00 (27) 30.32 (14) 31.20 (19)

Table 4: BLEU scores (and ranking, out of 48 submissions) of NRC’s unsupervised submissions: “seve” indicates single-feature (Mahalanobis ratio) parallelism assessment, “mono” indicates parallelism as- sessment using an unweighted ensemble of unsupervised features, “bicov” indicates that the final bigram coverage step ( 2.4) was performed. Results in the top 10 performers are bolded. § a 100M word training set7 but relatively poor at does NRC-seve-bicov perform notably well sub-selecting higher-quality sentences from that on other out-of-domain corpora in the 10M con- set. We think this may be because our system ditions. might have a bias towards picking relatively sim- ilar sentences, rather than the more diverse set of 6 Future research sentences that an MT training set needs, which is amplified in the 10M condition. The unsupervised methods described here seem A surprising exception to this weakness is promising in distinguishing parallel from non- the European Medicines Agency (EMEA) cor- parallel sentence pairs, but we interpret the 10M- pus, in which NRC-seve-bicov is the #5 word results as suggesting they are comparatively and #2 system in the SMT 10M and NMT poor at distinguishing other MT-relevant features 10M conditions, respectively. This could sug- of sentence-pair quality. Considering bigram cov- erage ( 2.4) appears to help somewhat, but more gest that competitors are overfitting to the do- § main(s) of the training data, and performing cor- research is needed into mitigating the tendency of respondingly poorly on the out-of-domain EMEA, these measurements to prefer an uninteresting se- whereas NRC-seve-bicov cannot overfit in lection of sentences. this manner. However, the other NRC unsu- Also, it is likely that a sentence-vector, even pervised submissions, which also cannot overfit, a high-dimensional one, is not sufficiently fine- have no special advantage on EMEA, and nor grained to choose the highest-quality pairs; the process described in this paper essentially says 7Spot-checking a random sample of sentences suggested that two sentences with sufficiently similar topics to us that there were indeed roughly 100M words worth of genuinely parallel data, but much of it would not have been are to be considered parallel, even if there is lit- particularly informative for machine translation. We there- tle word-level correlation between the sentences. fore interpret 100M results as representing one’s success at We therefore intend to investigate a word-level identifying parallel data, and the 10M results as represent- ing how well one assesses usefulness-for-MT beyond paral- analogue of the sentence-level Mahalanobis ratio lelism. measurement.

906 References unknown spectral distribution. IEEE Transac- tions on Acoustics, Speech, and Signal Processing, Thomas Boggs. 2014. Whitening char- 38(10):1760–1770. acteristics of the Mahalanobis distance. http://blog.bogatron.net/blog/ Jorg¨ Tiedemann. 2009. News from OPUS: A collec- 2014/03/11/mahalanobis-whitening/. tion of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. An- Kenneth Heafield. 2011. KenLM: Faster and smaller gelova, and R. Mitkov, editors, Recent Advances language model queries. In Proceedings of the Sixth in Natural Language Processing, volume V, pages Workshop on Statistical Machine Translation, WMT 237–248. John Benjamins, Amsterdam/Philadel- ’11, pages 187–197, Stroudsburg, PA, USA. Associ- phia, Borovets, Bulgaria. ation for Computational Linguistics. Rik Warren, Robert F Smith, and Anne K Cybenko. Philipp Koehn. 2005. Europarl: A parallel corpus for 2011. Use of Mahalanobis distance for detecting statistical machine translation. MT Summit 2005. and outlier clusters in markedly non-normal data: A vehicular traffic example. Technical report, Philipp Koehn, Kenneth Heafield, Mikel L. For- SRA International Inc., Dayton, OH. cada, Miquel Espla-Gomis,` Sergio Ortiz-Rojas, Gema Ram´ırez Sanchez,´ V´ıctor M. Sanchez´ Cartagena, Barry Haddow, Marta Ban˜on,´ Marek Strelec,ˇ Anna Samiotou, and Amir Kamran. 2018a. ParaCrawl corpus version 1.0. LINDAT/CLARIN digital library at the Institute of Formal and Ap- plied Linguistics (UFAL),´ Faculty of Mathematics and Physics, Charles University.

Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel Forcada. 2018b. Findings of the WMT 2018 shared task on parallel corpus filtering. In Pro- ceedings of the Third Conference on Machine Trans- lation, Volume 2: Shared Task Papers, Brussels, Bel- gium. Association for Computational Linguistics.

Chi-kiu Lo, Michel Simard, Darlene Stewart, Samuel Larkin, Cyril Goutte, and Patrick Littell. 2018. Ac- curate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the parallel corpus filtering task. In Proceed- ings of the Third Conference on Machine Transla- tion (WMT 2018).

Prasanta Chandra Mahalanobis. 1936. On the gener- alised distance in statistics. Proceedings of the Na- tional Institute of Sciences of India, 2:49–55.

Joel Martin, Howard Johnson, Benoˆıt Farley, and Anna Maclachlan. 2003. Aligning and using an English- Inuktitut parallel corpus. In Proceedings of the HLT- NAACL 2003 Workshop on Building and using par- allel texts: Data driven machine translation and beyond, Volume 3, pages 115–118. Association for Computational Linguistics.

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embed- dings using compositional n-gram features. In Pro- ceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long Papers), pages 528–540. Association for Computational Linguistics.

Irving S Reed and Xiaoli Yu. 1990. Adaptive multiple- band CFAR detection of an optical pattern with

907