Uniblock: Scoring and Filtering Corpus with Unicode Block Information

uniblock: Scoring and Filtering Corpus with Unicode Block Information Yingbo Gao Weiyue Wang Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University D-52056 Aachen, Germany <surname>@i6.informatik.rwth-aachen.de Abstract In this paper, we introduce a simple statistical method, uniblock, to address the problem. The The preprocessing pipelines in Natural Lan- motivation of our approach is straightforward - guage Processing usually involve a step of re- good rules may vary greatly depending on the situ- moving sentences consisted of illegal characters. The definition of illegal characters and ation, therefore instead of designing rules by hand, the specific removal strategy depend on the we use “rules” defined by data. We assume that task, language, domain, etc, which often lead some clean corpus is available, in which all the to tiresome and repetitive scripting of rules. characters are deemed legal and the character dis- In this paper, we introduce a simple statistical tributions are similar to the test cases. It is possi- 1 method, uniblock , to overcome this prob- ble to learn a probabilistic model, which describes uniblock lem. For each sentence, gener- the clean corpus and assigns scores to sentences ates a fixed-size feature vector using Unicode in another corpus. The scores can further be used block information of the characters. A Gaus- sian mixture model is then estimated on some for filtering. Since the legalness of characters is clean corpus using variational inference. The in question, Unicode block information is a natu- learned model can then be used to score sen- ral choice for obtaining feature vectors. Note that, tences and filter corpus. We present exper- by designing alternative feature vectors, one can imental results on Sentiment Analysis, Lan- potentially adapt uniblock to implement other guage Modeling and Machine Translation, and corpus filtering heuristics. show the simplicity and effectiveness of our We develop uniblock mainly for Machine method. Translation (MT). It can also be easily applied to 1 Introduction other NLP tasks. We present experimental results on Sentiment Analysis (SA), Language Modeling Identification and removal of sentences with ille- (LM) and MT, and show the simplicity and effec- gal characters is a common heuristic in the prepro- tiveness of our method. cessing pipelines in Natural Language Processing (NLP). While it has benefits of controlling the vo- 2 Related Work cabulary size and dropping noisy data, it is often a tedious work to come up with appropriate rules Raw data in NLP is often noisy. Khayrallah and removal strategies when facing with different and Koehn(2018) categorize five common noise tasks, languages and domains. The lack of clear sources in parallel corpora and count only about definition of what is illegal exacerbates the prob- 23% of the sentences in the raw 2016 ParaCrawl 2 lem. For example, modern Chinese text may allow corpus to be “Okay”. Although illegal charac- characters such as: traditional and simplified Chi- ters is not listed as a separate noise source, mis- nese characters, special punctuation marks, full- use of characters and shifting of character distribu- width characters, emojis, mathematical symbols, tions may result in a sentence being classified into Latin characters, currency symbols, scientific no- one of the five noise sources. In previous work, tations, etc. As a result, scripting robust rules often a supervised model using bag-of-words transla- requires a considerable amount of time and effort. tion features is developed to classify clean and noisy data (Xu and Koehn, 2017; Khayrallah et al., 1The source code is available at https://github. com/ringoreality/uniblock 2https://paracrawl.eu/index.html 1324 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 1324–1329, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2018). In contrast, our model, which is trained in formation has a good property that the blocks are an unsupervised manner, tackles the illegal char- grouped by origin and function. For instance, CJK acter problem explicitly. Symbols and Punctuation has a dedicated Unicode Koehn et al.(2018) describe a shared task on block in range U+3000...U+303F. Specifically, we parallel corpus filtering. While participating sys- count the appearances of characters in each of the tems focus on addressing both monolingual flu- 300 blocks. This will result in feature vectors ency and bilingual adequacy, character-level fil- c1; c2; :::; cB with a fixed length of B = 300. tering is common to all submissions. Junczys- If we further normalize them by the total num- Dowmunt(2018) applies a language identification ber of characters in each sentence, probability dis- model to implicitly remove sentences with illegal tributions over the 300 blocks can be obtained. characters. Rossenbach et al.(2018) keep sen- We observe that there are only few blocks whose tences with more than three words, with each word counts are non-zero in natural language texts. This having at least one character from the predefined calls for dimensionality reduction methods. Em- alphabet of the language. Lu et al.(2018) remove pirically, we drop the zero-count dimensions di- characters outside of a predefined alphabet. Ash rectly during training and assign conceptually low et al.(2018) count most frequent characters, set scores3 when a non-zero count is seen during scor- a cutoff around eighty for each language, and re- ing. That is, we use normalized feature vectors 0 move sentences with illegal characters. Erdmann e = e1; e2; :::; eB0 , where 1; 2; :::; B are dimen- and Gwinnup(2018) get rid of lines containing sions in B whose original counts are non-zero, for characters from the Unicode general category of training. “other”. Papavassiliou et al.(2018) simply con- 3.2 Bayesian Gaussian Mixture Model sider Latin Unicode characters to be legal. Unicode is the de facto standard for encoding Although many corpora are noisy, it is not appro- characters from various languages, domains and priate to deem all sentences in them “dirty”. While sources (The Unicode Consortium, 2019). It uses generating synthetic noisy data is always an op- “blocks” to group characters with similar origins tion, it is unrealistic to cover all types of noises. or functions. The current version 12.0 defines Compared to the difficulty to obtain negatively la- 300 blocks, including Basic Latin, Latin-1 Supple- belled data, the development set is often available ment, CJK (Chinese, Japanese and Korean) Sym- and can be deemed “clean” with high confidence. bols and Punctuation, etc. To identify the legal- Therefore, we take the development set as training ness of characters, the Unicode block information data for our scoring system and treat the problem provides meaningful discriminative signals. as a clustering task rather than a classification task. The Gaussian Mixture Model (GMM) is a clas- We assume that the training feature vectors e are sic algorithm that assumes data is generated from generated by a mixture of Gaussian distributions. a mixture of finite number of Gaussian distribu- K tions, whose parameters are typcially estimated X p(e) = πkN (ejµk; Σk) with the Expectation–Maximization (EM) algo- k=1 rithm. An extension to the EM algorithm is vari- πk ∼ DP(α) ational inference, which has the advantage of au- (1) r 1 tomatically choosing the number of components. µk ∼ N (µ0; ) Bishop(2006) gives a comprehensive introduction τ to the topic. We use the implementation of varia- Σk ∼ WB0 (V; n) tional Bayesian estimation of Gaussian mixtures In the equation above, k is a running index in the from scikit-learn (Pedregosa et al., 2011). number of mixtures K, πk is the mixture weight 0 3 Methodology for the k-th Gaussian, µk is a B -dimensional vector parametrizing the k-th mean, Σk is the k-th 3.1 Feature Vectors B0 ×B0 covariance matrix. We further impose pri- In order to assign meaningful scores to sentences ors on the model parameters: πk follows a Dirich- and eventually filter out those who contain ille- let Process (DP), which is parametrized by con- gal characters, we first need to design appropri- cetration prior α; µk follows a Normal distribution ate features. We believe, the Unicode block in- 3zeros in our experiments 1325 (N ), which is parametrized by the B0-dimensional test set4 and use the minimum score as an abso- mean prior µ0 and the precision prior τ; Σk fol- lute threshold to filter the training corpus. In total, lows a Wishart distribution (W) in B0, which is about 0.9% training tweets are filtered out. We ob- parametrized by the covariance prior V and degree serve that only one Unicode block exists in the test of freedom prior n. We estimate the model param- set, which means all sentences with characters in eters using the EM algorithm. Note that operat- other blocks are assigned conceptually low scores ing in B0 dimensions leads to significantly better and removed. In this particular case, our general model convergence than training in B dimensions. method reduces to a simple filtering rule similar to that of Papavassiliou et al.(2018). As shown 3.3 Scoring and Filtering in Table1, our method improves the test accuracy Once the model is trained till convergence, it is over the baseline by 0.6%. possible to use it to assign scores to unseen fea- test accuracy [%] ture vectors. We directly use the weighted log probabilities as scores. Compared to the sentences baseline 81.6 uniblock used during uniblock training, higher scored 82.2 sentences have more similar Unicode block count Table 1: Test accuracies on STS. distributions, or fewer illegal characters. Regard- ing how to remove noisy sentences, we implement two straightforward ideas: absolute thresholding 4.2 Language Modeling and relative thresholding.

Load more