A Context-Based Classifier Prediction System for Chinese Language Learners
Total Page:16
File Type:pdf, Size:1020Kb
ClassifierGuesser: A Context-based Classifier Prediction System for Chinese Language Learners Nicole Peinelt1,2 and Maria Liakata1,2 and Shu-Kai Hsieh3 1The Alan Turing Institute, London, UK 2Department of Computer Science, University of Warwick, Coventry, UK 3Graduate Institute of Linguistics, National Taiwan University, Taipei, Taiwan {n.peinelt, m.liakata}@warwick.ac.uk, [email protected] Abstract 2011), as well as an SVM with syntactic and ontological features (Guo and Zhong, 2005). Classifiers are function words that are However, without any context classifier assign- used to express quantities in Chinese ment can be ambiguous. For instance, the and are especially difficult for language noun 球 ‘ball’ can be modified by ke - a clas- learners. In contrast to previous stud- sifier for round objects - when referring tothe ies, we argue that the choice of clas- object itself as in (1), but requires the event sifiers is highly contextual and train classifier chang in the context of a ball match context-aware machine learning mod- as in (2). We argue that context is an impor- els based on a novel publicly available tant factor for classifier selection, since a head dataset, outperforming previous base- word may have multiple associated classifiers, lines. We further present use cases for but the final classifier selection is restricted by our database and models in an interac- the context. tive demo system. (1) 一 颗 红色 的 球 1 Introduction one ke red DE ball Languages such as Chinese are characterized ‘a red ball’ by the existence of a class of words commonly (2) 一 场 精彩 的 球 referred to as ‘classifiers’ or ‘measure words’. one chang exciting DE ball Based on syntactic criteria, classifiers are the ‘an exciting match’ obligatory component of a quantifier phrase which is contained in a noun phrase or verb This study introduces a large-scale dataset phrase.1 Semantically, a classifier modifies the of everyday Chinese classifier usage for ma- quantity or frequency of its head word and chine learning experiments. We present a requires a certain degree of shared properties model that outperforms previous frequency between classifier and head. Although native and ontology baselines for classifier predic- speakers select classifiers intuitively, language tion without the need for extensive linguis- learners often struggle with the correct usage tic preprocessing and head word identification. of classifiers due to the lack of a similar word We further demonstrate the usefulness of the class in their native language. Moreover, no database and our models in use cases. dictionary or finite set of rules covers all possi- ble classifier-head combinations exhaustively. 2 System Design Previous research has focused on associa- tions between classifiers and nominal head Preprocessing Classifier sentence head filtering parsing words in isolation and included approaches extraction extraction Prediction Models based on ontologies (Mok et al., 2012; Mor- gado da Costa et al., 2016), databases with Chinese Classifier Interactive Web semantic features of Chinese classifiers (Gao, Corpora Database Interface 1 Following Huang (1998) and 何杰 (2008) we in- Figure 1: Overview of proposed system clude verbal as well as nominal classifiers. 41 The Companion Volume of the IJCNLP 2017 Proceedings: System Demonstrations, pages 41–44, Taipei, Taiwan, November 27 – December 1, 2017. c 2017 AFNLP Figure 1 gives an overview of our system. 2.2 Classifier Prediction It comprises data collection, pre-processing 2.2.1 Task and the compilation of the Chinese Classifier Following the only previous machine learning Database (section 2.1), the training of classi- approach (Guo and Zhong, 2005), we frame fier prediction models (section 2.2), and the classifier prediction as a multi-class classifica- interactive online interface (section 3). tion problem. However, in contrast to previ- 2.1 The Chinese Classifier Database ous work that focused on word-based classifier prediction, we adapt the prediction task for a The database is based on three openly avail- sentence-based scenario, which is a more nat- able POS tagged Chinese language corpora: ural and less ambiguous task than predicting The Lancaster Corpus of Mandarin Chi- classifiers without context. Not all sentences nese (McEnery and Xiao, 2004), the UCLA in the Chinese classifier database contain head Corpus of Written Chinese (Tao and Xiao, words, due to co-referential and anaphoric us- 2012) and the Leiden Weibo Corpus (van Esch, age. Hence, we query the database for sen- 2012). Sentences from the corpora were as- tences in which both the head word and cor- signed unique ids, filtered for the occurrence responding classifier were identified, resulting of classifier POS tags and cleaned in a num- in 681,102 sentences. This subset is randomly ber of filtering steps in order to improve the split into training (50%), development (25%) data quality (Table 1). We further parsed the and test set (25%). In each sentence with remaining sentences with the Stanford con- an identified classifier and corresponding head stituent parser (Levy and Manning, 2003) and word, we substitute the classifier with the gap extracted the head of the classifier in each token <CL> and use the classifier as its class sentence based on the parse tree.2 By man- label. For example, the tagged sentence ually evaluating 100 randomly sampled sen- tences from the database, we estimate a clas- 我们是一 <c> 家 </c> <h> 人 </h>。 sifier identification accuracy of 91% and head is transformed into the training example identification accuracy of 78%. Based on our observations, most errors are due to accumu- 我们是一 <CL> <h> 人 </h>。 lating tokenisation, tagging and parsing errors, with the label ‘家’. Labels are simplified from as well as elliptic classifier usage. In addition tokens to types by reducing duplicate classi- to the example sentences, we also included lex- fiers (e.g. 个个 个) and mapping traditional ical information from CC-Cedict3 for the 176 characters to simplified→ characters (e.g. 個 unique classifier types. 个), resulting in a dataset4 with 172 distinct→ classes.5 Given a training set of observed sen- Applied filters Sentences % tences and classifiers, the task is to fill thegap None (initial corpus) 2,258,003 100 in a sentence with the most appropriate clas- 1. duplicate sentence 1,553,430 69 sifier. 2. <4 or >60 tokens in sentence 1,470,946 65 3. classifiers consisting of 1,437,491 64 letters/numbers; or <70% of 2.2.2 Baseline approaches Chinese material in sentence As previous studies have evaluated algorithms 4. tagged classifiers are in fact 1,150,749 51 on individually collected unpublished data, we measure units (e.g. 毫米) 5. classifiers with <10 examples 1,109,871 49 implement the following baselines to compare 6. classifier fails manual check 1,103,338 49 our models with previous results: 7. frequent error patterns 1,083,135 48 8. multiple classifiers in a single 858,472 38 • ge: always assign the universal and most sentence common noun classifier 个 (Guo and Zhong, 2005; Morgado da Costa et al., Table 1: Number of remaining sentences in 2016). database. Matching sentences are excluded. 4We make our dataset publicly available at 2Starting from the position of the classifier, we https://github.com/wuningxi/ChineseClassifierDataset. move one node up in the tree at a time until reach- 5The number of unique classifiers differs from the ing a noun or verb phrase and extract its head word. full database because only example sentences with 3https://cc-cedict.org/ identified head words are taken into account. 42 • pairs: assign the classifier most frequently preliminary experiments indicate that increas- observed in combination with this head ing the window size to n>2 increases compu- word during training; assign 个 for unseen tation costs without significant performance words (Guo and Zhong, 2005). gains, a better approach to include more con- text is needed. We hence use a bidirectional • concepts: assign classifiers based on LSTM (Hochreiter and Schmidhuber, 1997) to classifier-concept pair counts using the encode the entire sentence excluding any head Chinese Open Wordnet and 个 for unseen word annotation (cont ) and predict classifiers words (Morgado da Costa et al., 2016). cl based on the last hidden state (Figure 2). 2.2.3 Context-based models 2.2.4 Results Previous approaches predominantly rely on We report micro F1 (accuracy) and macro F1 ontological resources, which require a lot of scores for each model after hyper-parameter human effort to build and maintain, result- tuning in Table 3. The head-classifier combi- ing in limited coverage for new words and do- nation baseline gives a strong result, which the mains. We use distributed representations to SVM and Logistic Regression models trained capture word similarity based on syntactic be- on only headword embedding vectors cannot haviour, as they can be trained unsupervised surpass. Global corpus statistics on classifiers on a large scale and are easily adapted on new outperform the local information captured by language material. We train word embeddings the word embeddings in this case. Adding with word2vec (Mikolov et al., 2013) on sen- head word context features successfully re- tences from the original three corpora and also duces the ambiguity of head words and results obtain pre-trained word embeddings from Bo- in a significant improvement over the baseline. janowski et al. (2017). The pre-trained embed- Including contextual features of the classifier dings consistently achieve better results and gap slightly decreases the performance, but are hence used in all subsequent experiments. still outperforms the context-unaware models. Since the head word is linguistically the The best model is the LSTM which achieves most important factor for classifier selection, micro F1 71.51 and macro F1 30.56 on the test we first train two widely used machine learn- set based on the full sentence context without ing models (SVM, Logistic Regression) on the the need for headword identification (hyper- embedding vector of the head word (head).