Universal Dependencies Parsing for Colloquial Singaporean English
Total Page:16
File Type:pdf, Size:1020Kb
Universal Dependencies Parsing for Colloquial Singaporean English Hongmin Wangy, Yue Zhangy, GuangYong Leonard Chanz, Jie Yangy, Hai Leong Chieuz y Singapore University of Technology and Design fhongmin wang, yue [email protected] jie [email protected] z DSO National Laboratories, Singapore fcguangyo, [email protected] Abstract translation task of the EMNLP 2011 Workshop on Statistical Machine Translation (WMT11) to Singlish can be interesting to the ACL translate Haitian Creole SMS messages sent dur- community both linguistically as a ma- ing the 2010 Haitian earthquake. This work high- jor creole based on English, and compu- lights the importance of NLP tools on creoles in tationally for information extraction and crisis situations for emergency relief (Hu et al., sentiment analysis of regional social me- 2011; Hewavitharana et al., 2011). dia. We investigate dependency pars- Singlish is one of the major languages in Sin- ing of Singlish by constructing a depen- gapore, with borrowed vocabulary and grammars1 dency treebank under the Universal De- from a number of languages including Malay, pendencies scheme, and then training a Tamil, and Chinese dialects such as Hokkien, Can- neural network model by integrating En- tonese and Teochew (Leimgruber, 2009, 2011), glish syntactic knowledge into a state-of- and it has been increasingly used in written forms the-art parser trained on the Singlish tree- on web media. Fluent English speakers unfamiliar bank. Results show that English knowl- with Singlish would find the creole hard to com- edge can lead to 25% relative error reduc- prehend (Harada, 2009). Correspondingly, fun- tion, resulting in a parser of 84.47% ac- damental English NLP components such as POS curacies. To the best of our knowledge, taggers and dependency parsers perform poorly on we are the first to use neural stacking to such Singlish texts as shown in Table2 and4. For improve cross-lingual dependency parsing example, Seah et al.(2015) adapted the Socher on low-resource languages. We make both et al.(2013) sentiment analysis engine to the our annotation and parser available for fur- Singlish vocabulary, but failed to adapt the parser. ther research. Since dependency parsers are important for tasks such as information extraction (Miwa and Bansal, 1 Introduction 2016) and discourse parsing (Li et al., 2015), this Languages evolve temporally and geographically, hinders the development of such downstream ap- both in vocabulary as well as in syntactic struc- plications for Singlish in written forms and thus tures. When major languages such as English or makes it crucial to build a dependency parser that arXiv:1705.06463v1 [cs.CL] 18 May 2017 French are adopted in another culture as the pri- can perform well natively on Singlish. mary language, they often mix with existing lan- To address this issue, we start with investigat- guages or dialects in that culture and evolve into a ing the linguistic characteristics of Singlish and stable language called a creole. Examples of cre- specifically the causes of difficulties for under- oles include the French-based Haitian Creole, and standing Singlish with English syntax. We found Colloquial Singaporean English (Singlish) (Mian- that, despite the obvious attribute of inheriting a Lian and Platt, 1993), an English-based creole. large portion of basic vocabularies and grammars While the majority of the natural language pro- from English, Singlish not only imports terms cessing (NLP) research attention has been focused from regional languages and dialects, its lexical on the major languages, little work has been done 1We follow Leimgruber(2011) in using “grammar” to de- on adapting the components to creoles. One no- scribe “syntactic constructions” and we do not differentiate table body of work originated from the featured the two expressions in this paper. Singlish dependency trees 2 Related Work Neural networks have led to significant advance in Singlish dependency the performance for dependency parsing, includ- parser trained with small ing transition-based parsing (Chen and Manning, Singlish treebank 2014; Zhou et al., 2015; Weiss et al., 2015; Dyer et al., 2015; Ballesteros et al., 2015; Andor et al., English syntactic and 2016), and graph-based parsing (Kiperwasser and semantic knowledge learnt from large treebank Goldberg, 2016; Dozat and Manning, 2017). In particular, the biaffine attention method of Dozat and Manning(2017) uses deep bi-directional long Singlish sentences short-term memory (bi-LSTM) networks for high- order non-linear feature extraction, producing the highest-performing graph-based English depen- Figure 1: Overall model diagram dency parser. We adopt this model as the basis for our Singlish parser. Our work belongs to a line of work on trans- semantics and syntax also deviate significantly fer learning for parsing, which leverages En- from English (Leimgruber, 2009, 2011). We cate- glish resources in Universal Dependencies to im- gorize the challenges and formalize their interpre- prove the parsing accuracies of low-resource lan- tation using Universal Dependencies (Nivre et al., guages (Hwa et al., 2005; Cohen and Smith, 2009; 2016), which extends to the creation of a Singlish Ganchev et al., 2009). Seminal work employed dependency treebank with 1,200 sentences. statistical models. McDonald et al.(2011) inves- Based on the intricate relationship between tigated delexicalized transfer, where word-based Singlish and English, we build a Singlish parser by features are removed from a statistical model for leveraging knowledge of English syntax as a ba- English, so that POS and dependency label knowl- sis. This overall approach is illustrated in Figure1. edge can be utilized for training a model for low- In particular, we train a basic Singlish parser with resource language. Subsequent work considered the best off-the-shelf neural dependency parsing syntactic similarities between languages for better model using biaffine attention (Dozat and Man- feature transfer (Tackstr¨ om¨ et al., 2012; Naseem ning, 2017), and improve it with knowledge trans- et al., 2012; Zhang and Barzilay, 2015). fer by adopting neural stacking (Chen et al., 2016; Recently, a line of work leverages neural net- Zhang and Weiss, 2016) to integrate the English work models for multi-lingual parsing (Guo et al., syntax. Since POS tags are important features for 2015; Duong et al., 2015; Ammar et al., 2016). dependency parsing (Chen and Manning, 2014; The basic idea is to map the word embedding Dyer et al., 2015), we train a POS tagger for spaces between different languages into the same Singlish following the same idea by integrating vector space, by using sentence-aligned bilingual English POS knowledge using neural stacking. data. This gives consistency in tokens, POS and dependency labels thanks to the availability of Results show that English syntax knowledge Universal Dependencies (Nivre et al., 2016). Our brings 51.50% and 25.01% relative error reduction work is similar to these methods in using a neu- on POS tagging and dependency parsing respec- ral network model for knowledge sharing between tively, resulting in a Singlish dependency parser different languages. However, ours is different in with 84.47% unlabeled attachment score (UAS) the use of a neural stacking model, which respects and 77.76% labeled attachment score (LAS). the distributional differences between Singlish and We make our Singlish dependency treebank, the English words. This empirically gives higher ac- source code for training a dependency parser and curacies for Singlish. the trained model for the parser with the best per- Neural stacking was previously used for formance freely available online2. cross-annotation (Chen et al., 2016) and cross- task (Zhang and Weiss, 2016) joint-modelling on monolingual treebanks. To the best of our knowl- 2https://github.com/wanghm92/Sing_Par edge, we are the first to employ it on cross-lingual feature transfer from resource-rich languages to UD English Singlish Sentences Words Sentences Words improve dependency parsing for low-resource lan- Train 12,543 204,586 900 8,221 guages. Besides these three dimensions in deal- Dev 2,002 25,148 150 1,384 ing with heterogeneous text data, another popular Test 2,077 25,096 150 1,381 area of research is on the topic of domain adap- Table 1: Division of training, development, and tion, which is commonly associated with cross- test sets for Singlish Treebank lingual problems (Nivre et al., 2007). While this large strand of work is remotely related to ours, 3 we do not describe them in details. corpus in Universal Dependencies v1.4 collec- Unsupervised rule-based approaches also offer tion is constructed from the English Web Tree- an competitive alternative for cross-lingual depen- bank (Bies et al., 2012), comprising of web me- dency parsing (Naseem et al., 2010; Gillenwater dia texts, which potentially smooths the knowl- et al., 2010; Gelling et al., 2012; Søgaard, 2012a,b; edge transfer to our target Singlish texts in similar Mart´ınez Alonso et al., 2017), and recently been domains. The statistics of this dataset, from which benchmarked for the Universal Dependencies for- we obtain English syntactic knowledge, is shown malism by exploiting the linguistic constraints in in Table1 and we refer to this corpus as UD-Eng. the Universal Dependencies to improve the robust- This corpus uses 47 dependency relations and we ness against error propagation and domain adap- show below how to conform to the same standard tion (Mart´ınez Alonso et al., 2017). However, we while adapting to unique Singlish grammars. choose a data-driven supervised approach given 3.2 Challenges and Solutions for Annotating the relatively higher parsing accuracy owing to the Singlish availability of resourceful treebanks from the Uni- versal Dependencies project. The deviations of Singlish from English come from both the lexical and the grammatical lev- els (Leimgruber, 2009, 2011), which bring chal- 3 Singlish Dependency Treebank lenges for analysis on Singlish using English NLP 3.1 Universal Dependencies for Singlish tools.