Towards Universal Dependencies for Learner Chinese

Towards Universal Dependencies for Learner Chinese John Lee, Herman Leung, Keying Li Department of Linguistics and Translation City University of Hong Kong [email protected], [email protected], [email protected] Abstract framework. One advantage of UD is the potential for contrastive analysis, e.g., comparisons between We propose an annotation scheme for a UD treebank of standard Chinese, a UD treebank learner Chinese in the Universal Depen- of language X and portions of a UD treebank of dencies (UD) framework. The scheme was learner Chinese produced by native speakers of X. adapted from a UD scheme for Mandarin The rest of the paper is organized as follows. Chinese to take interlanguage characteris- Section 2 reviews existing treebanks for learner tics into account. We applied the scheme to texts. Section 3 describes the adaptation of a Man- a set of 100 sentences written by learners of darin Chinese UD scheme to account for non- Chinese as a foreign language, and we re- canonical characteristics in learner text. Section 4 port inter-annotator agreement on syntac- reports inter-annotator agreement. tic annotation. 2 Previous work 1 Introduction Two major treebanks for learner language — the A learner corpus consists of texts written by non- Treebank of Learner English (TLE) (Berzak et al., native speakers. Recent years have seen a ris- 2016) and the project on Syntactically Annotating ing number of learner corpora, many of which are Learner Language of English (SALLE) (Ragheb error-tagged to support analysis of grammatical and Dickinson, 2014) — contain English texts mistakes made by learners (Yannakoudakis et al., written by non-native speakers. TLE annotates a 2011; Dahlmeier et al., 2013; Lee et al., 2016b). In subset of sentences from the Cambridge FCE cor- order to derive overuse and underuse statistics on pus (Yannakoudakis et al., 2011), while SALLE syntactic structures, some corpus have also been has been applied on essays written by univer- part-of-speech (POS) tagged (Díaz-Negrillo et al., sity students. They both adapt annotation guide- 2010; Reznicek et al., 2013), and syntactically an- lines for standard English: TLE is based on the alyzed (Ragheb and Dickinson, 2014; Berzak et UD guidelines for standard English; SALLE is al., 2016). These corpora are valuable as training based on the POS tagset in the SUSANNE Cor- data for robust parsing of learner texts (Geertzen pus (Sampson, 1995) and dependency relations in et al., 2013; Rehbein et al., 2012; Napoles et al., CHILDES (Sagae et al., 2010). 2016), and can also benefit a variety of down- Both treebanks adopt the principle of “literal an- stream tasks, including grammatical error correc- notation”, i.e., to annotate according to a literal tion, learner proficiency identification, and lan- reading of the sentence, and to avoid considering guage learning exercise generation. its “intended” meaning or target hypothesis. While most annotation efforts have focused on learner English, a number of large learner Chinese 2.1 Lemma corpora have also been compiled (Zhang, 2009; SALLE allows an exception to “literal annotation” Wang et al., 2015; Lee et al., 2016a). However, when dealing with lexical violations. When there POS analysis in these corpora has been limited to is a spelling error (e.g., “*ballence”), the annotator the erroneous words, and there has not yet been puts the intended, or corrected form of the word any attempt to annotate syntactic structures. This (“balance”) as lemma. For real-word spelling er- study presents the first attempt to annotate Chinese rors, the distinction between a word selection error learner text in the Universal Dependencies (UD) and spelling error can be blurred. SALLE requires 67 Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 67–71, Gothenburg, Sweden, 22 May 2017. a spelling error to be “reasonable orthographic or “most common usage” of the original word, if dif- phonetic changes” (Ragheb and Dickinson, 2013). ferent from the POS tag, is indicated in the TYPO For a sentence such as “... *loss its ballence”, the field of the metadata; there, “disappoint” is marked lemma of the word “loss” would be considered to as a verb. be “lose”. The lemma forms the basis for further analysis in POS and dependencies. 2.3 Dependency annotation To identify spelling errors, TLE follows the In both treebanks, “literal annotation” requires de- decision in the underlying error-annotated cor- pendencies to describe the way the two words are pus (Nicholls, 2003). Further, when a word is mis- apparently related, rather than the intended usage. takenly segmented into two (e.g., “*be cause”), it For example, in the verb phrase “*ask you the uses the UD relation goeswith to connect them. money” (with “ask you for the money” as the target hypothesis), the word “money” is considered 2.2 POS tagging the direct object of “ask”. For each word, SALLE annotates two POS tags, SALLE adds two new relations to handle non- a “morphological tag” and a “distributional tag”. canonical structures. First, when the morpholog- The former takes into account “morphological ev- ical POS of two words do not usually participate idence”, i.e., the linguistic form of the word; the in any relation, the special label ‘-’ is used. Sec- latter reflects its “distributional evidence”, i.e., its ond, the relation INCROOT is used when an extra- syntactic use in the sentence. In a well-formed neous word apparently serves as a second root. In sentence, these two tags should agree; in learner addition, SALLE also gives subcategorization in- text, however, there may be conflicts between the formation, indicating what the word can select for. morphological evidence and the distributional ev- This information complements distributional POS idence. Consider the word “see” in the sentence tags, enabling a comparison between the expected “*I have see the movie.” The spelling of “see” pro- relations and those that are realized. vides morphological evidence to interpret it as base form (VV0). However, its word position, following 3 Proposed annotation scheme the auxiliary “have”, points towards a past partici- Our proposed scheme for learner Chinese is based ple (VVN). It is thus assigned the morphological tag on a UD scheme for Mandarin Chinese (Leung et VV0 and the distributional tag VVN. al., 2016). We adapt this scheme in terms of word These two kinds of POS tags are similarly in- segmentation (Section 3.1), POS tagging (Sec- corporated into a constituent treebank of learner tion 3.2) and dependency annotation (Section 3.3). English (Nagata et al., 2011; Nagata and Sak- We follow SALLE and TLE in adhering to the aguchi, 2016). They are also implicitly encoded principle of “literal annotation”, with some excep- in a POS tagset designed for Classical Chinese po- tions to be discussed below. ems (Wang, 2003). This tagset includes, for example, “adjective used as verb”, which can be under- 3.1 Word segmentation stood as a morphological tag for adjective doubling There are no word boundaries in written Chinese; as a distributional tag for verb. Consider the sen- the first step of analysis is thus to perform word tence 春風又綠江南岸 chūnfēng yòu lù jiāngnán segmentation. “Literal annotation” demands an àn “Spring wind again greens Yangtze’s southern analysis “as if the sentence were as syntactically shore”1. The word lù ‘green’, normally an adjec- well-formed as it can be, possibly ignoring mean- tive, serves as a causative verb in this sentence. It ing” (Ragheb and Dickinson, 2014). As a rule is therefore tagged as “adjective used as a verb”. of thumb, we avoid segmentations that yield non- TLE also supplies similar information for existing words. spelling and word formation errors, but in a dif- A rigid application of this rule, however, may ferent format. Consider the phrase “a *disappoint result in difficult and unhelpful interpretations in unknown actor”. On the one hand, the POS tag re- the face of “spelling” errors. Consider the two pos- flects the “intended” usage, and so “disappoint” is sible segmentations for the string 不關 bù guān tagged as an adjective on the basis of its target hy- ‘not concern’ in Table 1. Literal segmentation pothesis “disappointing”. On the other hand, the should in principle be preferred, since bù guān are 1English translation taken from (Kao and Mei, 1971). two words, not one. Given the context, however, 68 Literal Segmentation root segmentation w/ spelling error Text 不關不關 subj dep bù guān bùguān ‘not’ ‘concern’ ‘not-concern’ REL: POS tag: PRON ADJ PRON Lemma 不關不管 Text: 我可怕他 bù guān bùguǎn wǒ kěpà tā ‘not’ ‘concern’ ‘no matter’ ‘I’ ‘scary’ ‘him’ POS ADV VERB SCONJ POSd tag: PRON VERB PRON RELd: Table 1: Word segmentation of the string 不關 bù obj guān into two words (left) or one word (right), and the consequences on the lemma and POS tag. Figure 1: Parse tree for the sentence wǒ kěpà ta ‘I scary him’, likely intended as ‘I scare him’. The POS tags and REL relations reflect the morpholog- the learner likely confused the character guān with ical evidence. Additionally, the POS tags (Sec- the homophonous guǎn; the latter combines with d tion 3.2) and REL relations (Section 3.3) consider bù to form one word, namely the subordinating d the distributional evidence. conjunction 不管 bùguǎn ‘no matter’. If so, the literal segmentation would misrepresent the seman- tic intention of the learner and yield an unhelpful a verb); but in others, a word insertion or deletion syntactic analysis. We thus opt for the segmenta- elsewhere might be preferred. tion that assumes the spelling error; this interpre- 3.3 Dependency annotation tation, in turn, leads to bùguǎn as the lemma and SCONJ as the POS tag.

Load more