ClassifierGuesser: A Context-based Prediction System for Learners

Nicole Peinelt1,2 and Maria Liakata1,2 and Shu-Kai Hsieh3 1The Alan Turing Institute, London, UK 2Department of Computer Science, University of Warwick, Coventry, UK 3Graduate Institute of Linguistics, National University, Taipei, Taiwan {n.peinelt, m.liakata}@warwick.ac.uk, [email protected]

Abstract 2011), as well as an SVM with syntactic and ontological features (Guo and Zhong, 2005). Classifiers are function words that are However, without any context classifier assign- used to express quantities in Chinese ment can be ambiguous. For instance, the and are especially difficult for language 球 ‘ball’ can be modified by ke - a clas- learners. In contrast to previous stud- sifier for round objects - when referring tothe ies, we argue that the choice of clas- object itself as in (1), but requires the event sifiers is highly contextual and train classifier chang in the context of a ball match context-aware machine learning mod- as in (2). We argue that context is an impor- els based on a novel publicly available tant factor for classifier selection, since a head dataset, outperforming previous base- word may have multiple associated classifiers, lines. We further present use cases for but the final classifier selection is restricted by our database and models in an interac- the context. tive demo system. (1) 一 颗 红色 的 球 1 Introduction one ke red DE ball Languages such as Chinese are characterized ‘a red ball’ by the existence of a class of words commonly (2) 一 场 精彩 的 球 referred to as ‘classifiers’ or ‘measure words’. one chang exciting DE ball Based on syntactic criteria, classifiers are the ‘an exciting match’ obligatory component of a quantifier phrase which is contained in a noun phrase or verb This study introduces a large-scale dataset phrase.1 Semantically, a classifier modifies the of everyday Chinese classifier usage for ma- quantity or frequency of its head word and chine learning experiments. We present a requires a certain degree of shared properties model that outperforms previous frequency between classifier and head. Although native and ontology baselines for classifier predic- speakers select classifiers intuitively, language tion without the need for extensive linguis- learners often struggle with the correct usage tic preprocessing and head word identification. of classifiers due to the lack of a similar word We further demonstrate the usefulness of the class in their native language. Moreover, no database and our models in use cases. dictionary or finite set of rules covers all possi- ble classifier-head combinations exhaustively. 2 System Design Previous research has focused on associa- tions between classifiers and nominal head Preprocessing Classifier sentence head filtering parsing words in isolation and included approaches extraction extraction Prediction Models based on ontologies (Mok et al., 2012; Mor- gado da Costa et al., 2016), databases with Chinese Classifier Interactive Web semantic features of Chinese classifiers (Gao, Corpora Database Interface

1 Following Huang (1998) and 何杰 (2008) we in- Figure 1: Overview of proposed system clude verbal as well as nominal classifiers.

41 The Companion Volume of the IJCNLP 2017 Proceedings: System Demonstrations, pages 41–44, Taipei, Taiwan, November 27 – December 1, 2017. c 2017 AFNLP Figure 1 gives an overview of our system. 2.2 Classifier Prediction It comprises data collection, pre-processing 2.2.1 Task and the compilation of the Chinese Classifier Following the only previous machine learning Database (section 2.1), the training of classi- approach (Guo and Zhong, 2005), we frame fier prediction models (section 2.2), and the classifier prediction as a multi-class classifica- interactive online interface (section 3). tion problem. However, in contrast to previ- 2.1 The Chinese Classifier Database ous work that focused on word-based classifier prediction, we adapt the prediction task for a The database is based on three openly avail- sentence-based scenario, which is a more nat- able POS tagged Chinese language corpora: ural and less ambiguous task than predicting The Lancaster Corpus of Mandarin Chi- classifiers without context. Not all sentences nese (McEnery and Xiao, 2004), the UCLA in the Chinese classifier database contain head Corpus of (Tao and Xiao, words, due to co-referential and anaphoric us- 2012) and the Leiden Weibo Corpus (van Esch, age. Hence, we query the database for sen- 2012). Sentences from the corpora were as- tences in which both the head word and cor- signed unique ids, filtered for the occurrence responding classifier were identified, resulting of classifier POS tags and cleaned in a num- in 681,102 sentences. This subset is randomly ber of filtering steps in order to improve the split into training (50%), development (25%) data quality (Table 1). We further parsed the and test set (25%). In each sentence with remaining sentences with the Stanford con- an identified classifier and corresponding head stituent parser (Levy and Manning, 2003) and word, we substitute the classifier with the gap extracted the head of the classifier in each token and use the classifier as its class sentence based on the parse tree.2 By man- label. For example, the tagged sentence ually evaluating 100 randomly sampled sen- tences from the database, we estimate a clas- 我们是一 。 sifier identification accuracy of 91% and head is transformed into the training example identification accuracy of 78%. Based on our observations, most errors are due to accumu- 我们是一 。 lating tokenisation, tagging and parsing errors, with the label ‘家’. Labels are simplified from as well as elliptic classifier usage. In addition tokens to types by reducing duplicate classi- to the example sentences, we also included lex- fiers (e.g. 个个 个) and mapping traditional ical information from CC-Cedict3 for the 176 characters to simplified→ characters (e.g. 個 unique classifier types. 个), resulting in a dataset4 with 172 distinct→ classes.5 Given a training set of observed sen- Applied filters Sentences % tences and classifiers, the task is to fill thegap None (initial corpus) 2,258,003 100 in a sentence with the most appropriate clas- 1. duplicate sentence 1,553,430 69 sifier. 2. <4 or >60 tokens in sentence 1,470,946 65 3. classifiers consisting of 1,437,491 64 letters/numbers; or <70% of 2.2.2 Baseline approaches Chinese material in sentence As previous studies have evaluated algorithms 4. tagged classifiers are in fact 1,150,749 51 on individually collected unpublished data, we measure units (e.g. 毫米) 5. classifiers with <10 examples 1,109,871 49 implement the following baselines to compare 6. classifier fails manual check 1,103,338 49 our models with previous results: 7. frequent error patterns 1,083,135 48 8. multiple classifiers in a single 858,472 38 • ge: always assign the universal and most sentence common noun classifier 个 (Guo and Zhong, 2005; Morgado da Costa et al., Table 1: Number of remaining sentences in 2016). database. Matching sentences are excluded. 4We make our dataset publicly available at 2Starting from the position of the classifier, we https://github.com/wuningxi/ChineseClassifierDataset. move one node up in the tree at a time until reach- 5The number of unique classifiers differs from the ing a noun or verb phrase and extract its head word. full database because only example sentences with 3https://cc-cedict.org/ identified head words are taken into account.

42 • pairs: assign the classifier most frequently preliminary experiments indicate that increas- observed in combination with this head ing the window size to n>2 increases compu- word during training; assign 个 for unseen tation costs without significant performance words (Guo and Zhong, 2005). gains, a better approach to include more con- text is needed. We hence use a bidirectional • concepts: assign classifiers based on LSTM (Hochreiter and Schmidhuber, 1997) to classifier-concept pair counts using the encode the entire sentence excluding any head Chinese Open Wordnet and 个 for unseen word annotation (cont ) and predict classifiers words (Morgado da Costa et al., 2016). cl based on the last hidden state (Figure 2). 2.2.3 Context-based models 2.2.4 Results Previous approaches predominantly rely on We report micro F1 (accuracy) and macro F1 ontological resources, which require a lot of scores for each model after hyper-parameter human effort to build and maintain, result- tuning in Table 3. The head-classifier combi- ing in limited coverage for new words and do- nation baseline gives a strong result, which the mains. We use distributed representations to SVM and Logistic Regression models trained capture word similarity based on syntactic be- on only headword embedding vectors cannot haviour, as they can be trained unsupervised surpass. Global corpus statistics on classifiers on a large scale and are easily adapted on new outperform the local information captured by language material. We train word embeddings the word embeddings in this case. Adding with word2vec (Mikolov et al., 2013) on sen- head word context features successfully re- tences from the original three corpora and also duces the ambiguity of head words and results obtain pre-trained word embeddings from Bo- in a significant improvement over the baseline. janowski et al. (2017). The pre-trained embed- Including contextual features of the classifier dings consistently achieve better results and gap slightly decreases the performance, but are hence used in all subsequent experiments. still outperforms the context-unaware models. Since the head word is linguistically the The best model is the LSTM which achieves most important factor for classifier selection, micro F1 71.51 and macro F1 30.56 on the test we first train two widely used machine learn- set based on the full sentence context without ing models (SVM, Logistic Regression) on the the need for headword identification (hyper- embedding vector of the head word (head). In parameters as reported in Table 2, optimiser: order to investigate to which extend context Adam, learning rate: 0.001). may help with classifier prediction, we then gradually add more contextual features to the Parameter Values models: With the motivation of reducing head word ambiguity, we include embedding vectors Hidden units 160, 224, 320, 384, 480 Dropout rate 0.0, 0.25, 0.5 of words within window size n=2 of the head Batch size 32, 64, 96, 128 word (conth). Furthermore, we add embed- ding vectors of words surrounding the classi- Table 2: Tuned hyper-parameters for LSTM. Terms in bold represent final settings. fier gap (contcl) to capture the typical imme- diate environment of different classifiers. As Micro F1 Macro F1 Features dev test dev test ge 45.12 45.21 0.36 0.37 base line pairs 61.82 61.72 24.40 23.80 concepts 49.08 49.11 8.40 7.94 ← ← ← ← h1 h2 h3 … ht head 53.67 53.72 13.33 13.56 → → → → svm +conth 66.02 66.02 24.86 24.39 h1 h2 h3 … ht +contcl 58.97 58.83 22.23 21.75 head 57.61 57.72 15.99 15.66 log reg +conth 67.81 67.67 28.95 27.37 +contcl 67.43 67.29 27.51 26.70 lstm conts 71.69 71.51 31.56 30.56

Figure 2: LSTM architecture for context- Table 3: Model performance on the classifier based classifier prediction. prediction task (logreg = Logistic Regression).

43 3 Use Cases Acknowledgments When learning new classifiers, Chinese lan- This work was supported by The Alan guage learners can obtain frequency statistics Turing Institute under the EPSRC grant from the online interface of the Chinese Classi- EP/N510129/1, Microsoft Azure and the Ger- fier Database 6 to on the most commonly man Academic Exchange Service (DAAD). used and most important classifiers. Learners can explore a visualisation of frequently used References classifier-head word combinations in an inter- active bar plot (Figure 3, left) which displays Piotr Bojanowski, Edouard Grave, Armand Joulin, example sentences from the database when and Tomas Mikolov. 2017. Enriching Word Vec- tors with Subword Information. Transactions of clicking on the bars. Furthermore, the Classi- the Association for Computational Linguistics, fierGuesser (Figure 3, right) can be used when 5:135–146. learners want to compose a sentence but don’t Helena Hong Gao. 2011. E-learning design for know the appropriate classifier. After input- Chinese classifiers: Reclassification. Communi- ing a sentence with a gap, the system predicts cations in Computer and Information Science, the best classifier candidate based on the pairs 177:186–199. baseline and the best LSTM model. Hui Guo and Huayan Zhong. 2005. Chinese classi- fier assignment using SVMs. ( ) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780. C.-T. James Huang. 1998. Logical relations in Chi- nese and the theory of grammar. Taylor & Fran- cis, New York & London. Roger Levy and Christopher Manning. 2003. Is it harder to parse chinese, or the chinese treebank? Figure 3: Screenshot of classifier-head pair vi- In Proceedings of the 41st Annual Meeting on sualisation (left) and ClassifierGuesser (right). ACL-Volume 1, pages 439–446. Anthony McEnery and Zhonghua Xiao. 2004. The Lancaster Corpus of : A cor- 4 Conclusion pus for monolingual and contrastive language study. This paper introduced a system for predict- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- ing Chinese classifiers in a sentence. Based frey Dean. 2013. Efficient estimation of word on a novel dataset of example sentences for representations in vector space. arXiv preprint authentic usage of Chinese classifiers, we con- arXiv:1301.3781. ducted multiple machine learning experiments Hazel Mok, Shu Wen, Gao Huini Eshley, and Fran- and found that incorporating context im- cis Bond. 2012. Using WordNet to predict nu- proves Chinese classifier prediction over word- meral classifiers in Chinese and Japanese. In based models. Our best model clearly out- GWC 2012 6th International Global Wordnet Conference, pages 264–271. performs the baselines and does not require manual feature engineering or extensive pre- Luis Morgado da Costa, Francis Bond, and He- processing. We argue that including contex- lena Gao. 2016. Mapping and Generating Clas- sifiers using an Open Chinese Ontology. In Pro- tual features can help resolve ambiguities and ceedings of the 8th Global WordNet Conference, context-based classifier prediction is a more re- pages 247–254. alistic task than isolated head word-based pre- Hongyin Tao and Richard Xiao. 2012. The UCLA diction. We further presented an interactive Chinese Corpus (2nd edition). UCREL. web system to access our database and pre- Daan van Esch. 2012. Leiden Weibo Corpus. trained models and demonstrated possible use http://lwc.daanvanesch.nl/. cases for language learners. 何杰. 2008. 现代汉语量词研究: 增编版. 北京语言 大学出版社, 北京市. 6chinese-classifier-database.azurewebsites.net

44