Parsing Tweets Into Universal Dependencies

Parsing Tweets into Universal Dependencies Yijia Liu Yi Zhu Harbin Institute of Technology University of Cambridge [email protected] [email protected] Wanxiang Che Bing Qin Nathan Schneider Noah A. Smith Harbin Institute of Technology Georgetown University University of Washington Abstract presented a tweet treebank (TWEEBANK), consisting of 12,149 tokens and largely following con- We study the problem of analyzing tweets ventions suggested by Schneider et al.(2013), with Universal Dependencies (UD; Nivre fairly close to Yamada and Matsumoto(2003) de- et al., 2016). We extend the UD guide- pendencies (without labels). Both annotation ef- lines to cover special constructions in forts were highly influenced by the PTB, whose tweets that affect tokenization, part-of- guidelines have good grammatical coverage on speech tagging, and labeled dependen- newswire. However, when it comes to informal, cies. Using the extended guidelines, we unedited, user-generated text, the guidelines may create a new tweet treebank for English leave many annotation decisions unspecified. (TWEEBANK V2) that is four times larger Universal Dependencies (Nivre et al., 2016, than the (unlabeled) TWEEBANK V1 intro- UD) were introduced to enable consistent anno- duced by Kong et al.(2014). We char- tation across different languages. To allow such acterize the disagreements between our consistency, UD was designed to be adaptable to annotators and show that it is challeng- different genres (Wang et al., 2017) and languages ing to deliver consistent annotation due (Guo et al., 2015; Ammar et al., 2016). We pro- to ambiguity in understanding and ex- pose that analyzing the syntax of tweets can bene- plaining tweets. Nonetheless, using the fit from such adaptability. In this paper, we intro- new treebank, we build a pipeline sys- duce a new English tweet treebank of 55,607 to- tem to parse raw tweets into UD. To over- kens that follows the UD guidelines, but also con- come annotation noise without sacrific- tends with social media-specific challenges that ing computational efficiency, we propose were not covered by UD guidelines.1 Our anno- a new method to distill an ensemble of 20 tation includes tokenization, part-of-speech (POS) transition-based parsers into a single one. tags, and (labeled) Universal Dependencies. We Our parser achieves an improvement of 2.2 characterize the disagreements among our annota- in LAS over the un-ensembled baseline tors and find that consistent annotation is still chal- and outperforms parsers that are state-of- lenging to deliver even with the extended guide- the-art on other treebanks in both accuracy lines. and speed. Based on these annotations, we nonetheless de- 1 Introduction signed a pipeline to parse raw tweets into Uni- versal Dependencies. Our pipeline includes: a NLP for social media messages is challenging, re- bidirectional LSTM (bi-LSTM) tokenizer, a word quiring domain adaptation and annotated datasets cluster–enhanced POS tagger (following Owoputi (e.g., treebanks) for training and evaluation. Pi- et al., 2013), and a stack LSTM parser with oneering work by Foster et al.(2011) annotated character-based word representations (Ballesteros 7,630 tokens’ worth of tweets according to the et al., 2015), which we refer to as our “baseline” phrase-structure conventions of the Penn Treebank parser. To overcome the noise in our annotated (PTB; Marcus et al., 1993), enabling conversion to 1We developed our treebank independently of a similar Stanford Dependencies. Kong et al.(2014) further effort for Italian tweets (Sanguinetti et al., 2017). See §2.5 studied the challenges in annotating tweets and for a comparison. data and achieve better performance without sac- Their aim was the rapid development of a de- rificing computational efficiency, we distill a 20- pendency parser for tweets, and to that end they parser ensemble into a single greedy parser (Hin- contributed a new annotated corpus, TWEEBANK, ton et al., 2015). We show further that learn- consisting of 12,149 tokens. Their annotations ing directly from the exploration of the ensemble added unlabeled dependencies to a portion of the parser is more beneficial than learning from the data annotated with POS tags by Gimpel et al. gold standard “oracle” transition sequence. Exper- (2011) and Owoputi et al.(2013) after rule-based imental results show that an improvement of more tokenization (O’Connor et al., 2010). Kong et than 2.2 points in LAS over the baseline parser al. also contributed a system for parsing; we de- can be achieved with our distillation method. It fer the discussion of their parser to §3. outperforms other state-of-the-art parsers in both Kong et al.’s rapid, small-scale annotation ef- accuracy and speed. fort was heavily constrained. It was carried out by The contributions of this paper include: annotators with only cursory training, no clear annotation guidelines, and no effort to achieve con- • We study the challenges of annotating tweets sensus on controversial cases. Annotators were in UD (§2) and create a new tweet treebank allowed to underspecify their analyses. Most of (TWEEBANK V2), which includes tokeniza- the work was done in a very short amount of time tion, part-of-speech tagging, and labeled Uni- (a day). Driven both by the style of the text they versal Dependencies. We also characterize sought to annotate and by exigency, some of their the difficulties of creating such annotation. annotation conventions included: • We introduce and evaluate a pipeline system • Allowing an annotator to exclude tokens to parse the raw tweet text into Universal De- from the dependency tree. A clear criterion pendencies (§3). Experimental results show for exclusion was not given, but many tokens that it performs better than a pipeline of the were excluded because they were deemed state-of-the-art alternatives. “non-syntactic.” • We propose a new distillation method for • Allowing an annotator to merge a multiword training a greedy parser, leading to better per- expression into a single node in the depen- formance than existing methods and without dency tree, with no internal structure. Anno- efficiency sacrifices. tators were allowed to take the same step with Our dataset and system are publicly available noun phrases. at https://github.com/Oneplus/Tweebank and • Allowing multiple roots, since a single tweet https://github.com/Oneplus/twpipe. might contain more than one sentence. 2 Annotation These conventions were justified on the grounds of We first review TWEEBANK V1 of Kong et al. making the annotation easier for non-experts, but (2014), the previous largest Twitter dependency they must be revisited in our effort to apply UD to annotation effort (§2.1). Then we introduce the tweets. differences in our tokenization (§2.2) and part-of- speech (§2.3) (re)annotation with O’Connor et al. 2.2 Tokenization (2010) and Gimpel et al.(2011), respectively, on Our tokenization strategy lies between the strategy which TWEEBANK V1 was built. We describe our of O’Connor et al.(2010) and that of UD. There is effort of adapting the UD conventions to cover a tradeoff between preservation of original tweet tweet-specific constructions (§2.4). Finally, we content and respecting the UD guidelines. present our process of creating a new tweet tree- The regex-based tokenizer of O’Connor et al. bank, TWEEBANK V2, and characterize the diffi- (2010)—which was originally designed for an ex- culties in reaching consistent annotations (§2.6). ploratory search interface called TweetMotif, not for NLP—preserves most whitespace-delimited 2.1 Background: TWEEBANK tokens, including hashtags, at-mentions, emoti- The annotation effort we describe stands in con- cons, and unicode glyphs. They also treat contrac- trast to the previous work by Kong et al.(2014). tions and acronyms as whole tokens and do not split them. UD tokenization,2 in order to better example, mfw (“my face when”) is tagged as a serve dependency annotation, treats each syntactic noun (for face). word as a token. They therefore more aggressively Compared to the effort of Gimpel et al.(2011), split clitics from contractions (e.g., gonna is tok- our approach simplifies some matters. For exam- enized as gon and na; its is tokenized as it and s ple, if a token is not considered syntactic by UD when s is a copula). But acronyms are not touched conventions, it gets an other (X) tag (Gimpel et in the UD tokenization guidelines. Thus, we fol- al. had more extensive conventions). Other phe- low the UD tokenization for contractions and leave nomena, like abbreviations, are more complicated acronyms like idc (“I don’t care”) as a single to- for us, as discussed above; Gimpel et al. used a ken. single part of speech for such expressions. In the different direction of splitting tokens, Another important difference follows from the UD guidelines also suggest to merge multi-token difference in tokenization. As discussed in §2.2, words (e.g., 20 000) into one single token in some UD calls for more aggressive tokenization than special cases. We witnessed a small number of that of O’Connor et al.(2010) which opted out of tweets that contain multi-token words (e.g., YO, splitting contractions and possessives. As a con- and RETWEET) but didn’t combine them for sequence of adopting O’Connor et al.’s (2010) to- simplicity. Such tokens only account for 0.07% kenization, Gimpel et al. introduced new parts of and we use the UD goeswith relation to resolve speech for these cases instead.5 For us, these to- these cases in the dependency annotations. kens must be split, but universal parts of speech can be applied. 2.3 Part-of-Speech Annotation 2.4 Universal Dependencies Applied to Before turning to UD annotations, we Tweets (re)annotated the data with POS tags, for consistency with other UD efforts, which adopt We adopt UD version 2 guidelines to annotate the universal POS tagset.3 In some cases, non- the syntax of tweets.

Load more