Uniblock: Scoring and Filtering Corpus with Unicode Block Information
Total Page:16
File Type:pdf, Size:1020Kb
uniblock: Scoring and Filtering Corpus with Unicode Block Information Yingbo Gao, Weiyue Wang, Hermann Ney Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany The Illegal Character Problem Modeling Clean Data What are illegal characters? Why BGMM? • emojis, special punctuation marks, full-width characters, • classic and powerful algorithm to describe data and find clusters mathematical symbols, Greek letters, scientific notations... • infer number of components K from data • depends on task, language and domain • good convergence with feature vectors in reduced dimensions B0 K Why remove them from training data? X p(e) = πkN (ejµk; Σk) • drop noisy sentences k=1 • reduce vocabulary size πk ∼ DP(α) r1 How is it done so far? µ ∼ N (µ ; ) k 0 τ • tiresome and repetitive scripting in a case-by-case fashion Σk ∼ WB0(V; n) • all WMT18 corpus filtering systems are rule-based Why clustering instead of classification? • difficult to obtain negatively labelled data uniblock I unrealistic to cover all types of noises with synthetic data What is uniblock? I not appropriate to deem all sentences in a noisy corpus dirty • a simple and effective statistical alternative to scripting rules • easy to obtain positively labelled data • filter out sentences with too many illegal characters I development set is often available • task-independent, language-independent and domain-independent I can be deemed clean with relatively high confidence An Overview? Scoring and Filtering Corpus • construct feature vectors based on Unicode block information • model clean data with Bayesian Gaussian Mixture Model (BGMM) How are the scores used for filtering? • score corpus with weighted log probability densities • the weighted log probability densities are directly used as scores • filter corpus with various thresholding methods • versatile filtering on monolingual or parallel data, with e.g. I absolute thresholds Are scores meaningful? I relative thresholds • training sentence pairs from WMT19 zh-en news translation I the minimum scores seen during training • higher score ! less illegal (underlined) characters I linear combinations of scores across languages zh-en sentence pair score êê¹ÎÈ -48523.31 Open Directory - Unix 53.82 Experiments e·、t洁、k生和清洁技/ÄÇ 67.71 Salubrit´e, propret´e, hygi`ene et techniques d'assainissement -158.57 Language Modeling on WMT19 Îû统目U→Mn→ 送我的Mn... 0.00 (Test Perplexity) From System Menu → Location → Send My Location... 0.00 Sentiment Analysis on STS (25℃下K定了7Á的;¤b¹Ï和saI)¿. 36.01 (Test Accuracy [%]) filtered zh en ru Cation exchange capacity and equilibria isotherm at 25℃ were determined. 40.65 filtered en 0% 40.9 135.0 181.0 "¡?V及发展小Ä 68.63 0% 81.6 10% 36.9 133.5 176.7 Financial Policy and Development Unit 53.82 1% 82.2 20% 37.1 132.9 180.6 30% 38.7 131.4 178.6 Constructing Feature Vectors Machine Translation on WMT19 Unicode block? (Case-sensitive Bleu [%]) • Unicode ! de facto standard for character encoding zh-en en-zh ru-en en-ru • characters with similar origins or functions are grouped in blocks filtered test17 test18 test17 test18 test17 test18 test17 test18 0000..007F; Basic Latin 0% 25.0 24.5 30.1 33.0 32.9 28.3 26.9 23.3 0080..00FF; Latin-1 Supplement 10% 25.2 25.6 30.9 33.1 33.5 29.3 27.8 23.9 ... 20% 24.3 25.3 30.3 33.2 33.3 28.9 27.2 23.6 4E00..9FFF; CJK Unified Ideographs 30% 24.3 24.8 30.2 33.0 32.9 28.1 26.6 23.2 ... source: the Unicode Consortium Takeaways? How exactly is a feature vector constructed? • a method to get meaningful scores for \character legalness" • consider \我1NLP!" • an open-sourced tool for corpus filtering • \我1" ! CJK Unified Ideographs block • \NLP!" ! Basic Latin block Acknowledgments Paper & Code • fixed width count vector ! [4, 0, 0, ..., 0, 2, 0, ...] • dimensionality reduction ! [4, 2] This work has received funding from the European Research Council (ERC) (under the European Union's Horizon 2020 research and innovation programme, • normalize the counts ! [0.67, 0.33] grant agreement No 694537, project "SEQCLAS") and the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/8-1, project • supports custom block ranges: e.g. ASCII digits and punctuations "CoreTec"). The GPU cluster used for the experiments was partially funded by DFG Grant INST 222/1168-1. The work reflects only the authors' views and • additional features: e.g. total character count, word count none of the funding agencies is responsible for any use that may be made of the information it contains. Human Language Technology and Pattern Recognition, RWTH Aachen University, Aachen, Germany Email: <surname>@i6.informatik.rwth-aachen.de WWW: http://www.hltpr.rwth-aachen.de.