uniblock: Scoring and Filtering Corpus with Block Information Yingbo Gao, Weiyue Wang, Hermann Ney Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany

The Illegal Character Problem Modeling Clean Data

What are illegal characters? Why BGMM? • , special marks, full-width characters, • classic and powerful algorithm to describe data and find clusters mathematical symbols, Greek letters, scientific notations... • infer number of components K from data • depends on task, language and domain • good convergence with feature vectors in reduced dimensions B0

K Why remove them from training data? X p(e) = πkN (e|µk, Σk) • drop noisy sentences k=1 • reduce vocabulary size πk ∼ DP(α) r1 How is it done so far? µ ∼ N (µ , ) k 0 τ • tiresome and repetitive scripting in a case-by-case fashion Σk ∼ WB0(V, n) • all WMT18 corpus filtering systems are rule-based Why clustering instead of classification? • difficult to obtain negatively labelled data uniblock I unrealistic to cover all types of noises with synthetic data What is uniblock? I not appropriate to deem all sentences in a noisy corpus dirty • a simple and effective statistical alternative to scripting rules • easy to obtain positively labelled data • filter out sentences with too many illegal characters I development set is often available • task-independent, language-independent and domain-independent I can be deemed clean with relatively high confidence

An Overview? Scoring and Filtering Corpus • construct feature vectors based on Unicode block information • model clean data with Bayesian Gaussian Mixture Model (BGMM) How are the scores used for filtering? • score corpus with weighted log probability densities • the weighted log probability densities are directly used as scores • filter corpus with various thresholding methods • versatile filtering on monolingual or parallel data, with e.g. I absolute thresholds Are scores meaningful? I relative thresholds • training sentence pairs from WMT19 zh-en news translation I the minimum scores seen during training • higher score → less illegal (underlined) characters I linear combinations of scores across languages zh-en sentence pair score リリスノト -48523.31 Open Directory - Unix 53.82 Experiments 健康、整洁、卫生和清洁技术组织 67.71 Salubrit´e, propret´e, hygi`ene et techniques d’assainissement -158.57 Language Modeling on WMT19 从系统目录→位置→传送我的位置... 0.00 (Test Perplexity) From System Menu → Location → Send My Location... 0.00 Sentiment Analysis on STS 在25℃下测定了样品的总交换容量和平衡等温线. 36.01 (Test Accuracy [%]) filtered zh en ru Cation exchange capacity and equilibria isotherm at 25℃ were determined. 40.65 filtered en 0% 40.9 135.0 181.0 财务政策及发展小组 68.63 0% 81.6 10% 36.9 133.5 176.7 Financial Policy and Development Unit 53.82 1% 82.2 20% 37.1 132.9 180.6 30% 38.7 131.4 178.6 Constructing Feature Vectors Machine Translation on WMT19 Unicode block? (Case-sensitive Bleu [%]) • Unicode → de facto standard for character encoding zh-en en-zh ru-en en-ru • characters with similar origins or functions are grouped in blocks filtered test17 test18 test17 test18 test17 test18 test17 test18

0000..007F; Basic Latin 0% 25.0 24.5 30.1 33.0 32.9 28.3 26.9 23.3 0080..00FF; Latin-1 Supplement 10% 25.2 25.6 30.9 33.1 33.5 29.3 27.8 23.9 ... 20% 24.3 25.3 30.3 33.2 33.3 28.9 27.2 23.6 4E00..9FFF; CJK Unified Ideographs 30% 24.3 24.8 30.2 33.0 32.9 28.1 26.6 23.2 ... source: the Takeaways? How exactly is a feature vector constructed? • a method to get meaningful scores for “character legalness” • consider “我爱NLP!” • an open-sourced tool for corpus filtering • “我爱” → CJK Unified Ideographs block • “NLP!” → Basic Latin block Acknowledgments Paper & Code • fixed width count vector → [4, 0, 0, ..., 0, 2, 0, ...]

• dimensionality reduction → [4, 2] This work has received funding from the European Research Council (ERC) (under the European Union’s Horizon 2020 research and innovation programme, • normalize the counts → [0.67, 0.33] grant agreement No 694537, project ”SEQCLAS”) and the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/8-1, project • supports custom block ranges: e.g. ASCII digits and ”CoreTec”). The GPU cluster used for the experiments was partially funded by DFG Grant INST 222/1168-1. The work reflects only the authors’ views and • additional features: e.g. total character count, word count none of the funding agencies is responsible for any use that may be made of the information it contains.

Human Language Technology and Pattern Recognition, RWTH Aachen University, Aachen, Germany Email: @i6.informatik.rwth-aachen.de WWW: http://www.hltpr.rwth-aachen.de