AlpacaTag: An Active -based Crowd Framework for Sequence Tagging

Bill Yuchen Lin†∗ Dong-Ho Lee†∗ Frank F. Xu‡ Ouyu Lan† and Xiang Ren† {yuchen.lin,dongho.lee,olan,xiangren}@usc.edu, [email protected] †Computer Science Department ‡Language Technologies Institute University of Southern California Carnegie Mellon University

Abstract Consolidation & Incremental Training

We introduce an open-source web-based data Real-time Deployment API Backend Model annotation framework (AlpacaTag) for se- Complicated Unlabeled Instances Downstream quence tagging tasks such as named-entity Systems recognition (NER). The distinctive advantages

of AlpacaTag are three-fold. 1) Active in- Instance Sampling via telligent recommendation: dynamically sug- Active Learning gesting and sampling the most in- Matching formative unlabeled instances with a back-end Frequent NPs + Dictionary A batch of raw sentences active learned model; 2) Automatic crowd consolidation: enhancing real-time inter- Recommendations annotator agreement by merging inconsistent labels from multiple annotators; 3) Real-time Crowd Annotators Crowd Annotations AlpacaTag model deployment: users can deploy their Figure 1: Overview of the AlpacaTag framework. models in downstream systems while new an- notations are being made. AlpacaTag is Therefore, it is still an important research ques- a comprehensive solution for sequence label- ing tasks, ranging from rapid tagging with tion that how we can develop a better annotation recommendations powered by active learning framework to largely reduces human efforts. and auto-consolidation of crowd annotations Existing open-source sequence annotation tools to real-time model deployment. (Stenetorp et al., 2012; Druskat et al., 2014a; de Castilho et al., 2016; Yang et al., 2018a) mainly 1 Introduction focus on enhancing the friendliness of user inter- faces (UI) such as data management, fast tagging Sequence tagging is a major type of tasks in natu- with shortcut keys, supporting more platforms, ral language processing (NLP), including named- and multi-annotator analysis. We argue that there entity recognition (detecting and typing entity are still three important yet underexplored direc- names), keyword extraction (e.g. extracting as- tions of improvements: 1) active intelligent rec- pect terms in reviews or essential terms in queries), ommendation, 2) automatic crowd consolidation, chunking (extracting phrases), and word segmen- 3) real-time model deployment. Therefore, we tation (identifying word boundary in languages propose a novel web-based annotation tool named like Chinese). State-of-the-art supervised ap- AlpacaTag1 to address these three problems. proaches to sequence tagging are highly depen- Active intelligent recommendation (§3) aims dent on numerous annotations. New annota- to reduce human efforts at both instance-level and tions are usually necessary for a new domain or corpus-level by learning a back-end sequence tag- task, even though transfer learning techniques (Lin ging model incrementally with incoming human and Lu, 2018) can reduce the amount of them annotations. At the instance-level, AlpacaTag by reusing data of other related tasks. How- applies the model predictions on current unlabeled ever, manually annotating sequences can be time- sentences as tagging suggestions. Apart from that, consuming, expensive, and thus hard to scale. 1The source code is publicly available at http:// ∗Both authors contributed equally. inklab.usc.edu/AlpacaTag/. we also greedily match frequent noun phrases and 2 Overview of AlpacaTag a dictionary of already annotated spans as recom- As shown in Figure1, AlpacaTag has an ac- mendations. Annotators can easily confirm or de- tively learned back-end model (top-left) in addi- cline such recommendations with our specialized tion to a front-end web-UI (bottom). Thus, we UI. At the corpus-level, we use active learning al- have two separate servers: a back-end model gorithms (Settles, 2009) for selecting next batches server and a front-end annotation server. The of instances to annotate based on the model. The model server is built with PyTorch and supports goal is to select the most informative instances for a set of APIs for between the two human to tag and thus to achieve a more cost- servers and model deployment. The annotation effective way of using human efforts. server is built on Django in Python, which in- Automatic crowd consolidation (§4) of the an- teracts with administrators and annotators. notations from multiple annotators is an underex- To start annotating for a domain of interest, ad- plored topic in developing crowd-sourcing annota- mins should login and create a new project. Then, tion frameworks. As a crowdsourcing framework, they need to further import their raw corpus (with AlpacaTag collects annotations from multiple CSV or JSON format), and assign the tag space (non-expert) contributors with lower cost and a with associated colors and shortcut keys. Admins higher speed. However, annotators usually have can further set the batch size to sample instances different confidences, preferences, and biases in and to update back-end models. Annotators can annotating, which leads to possibly high inter- login and annotate (actively sampled) unlabeled annotator disagreement. It is shown very challeng- instances with tagging suggestions. We further ing to train models with such noisy crowd annota- present our three key features in the next sections. tions (Nguyen et al., 2017; Yang et al., 2018b). We argue that consolidating crowd labels during an- 3 Active Intelligent Recommendation notating can lead annotators to achieve real-time This section first introduces the back-end model consensus, and thus decrease disagreement of an- (§3.1) and then presents how we use the back- notations instead of exhausting post-processing. end model for both instance-level recommenda- tions (tagging suggestions, §3.2) as well as corpus- Real-time model deployment (§5) is also a de- level recommendations (active sampling, §3.3). sired feature for users. We sometimes need to deploy a state-of-the-art sequence tagging model 3.1 Back-end Model: BLSTM-CRF while the crowdsourcing is still ongoing, such that The core component of the proposed AlpacaTag users can facilitate the developing of their tagging- framework is the back-end sequence tagging required systems with our APIs. model, which is learned with an incremental To the best of our knowledge, there is no ex- active learning scheme. We use the state-of- isting annotation framework enjoying such three the-art sequence tagging model as our back-end features. AlpacaTag is the first unified frame- model (Lample et al., 2016; Lin et al., 2017; work to address these problems, while inheriting Liu et al., 2018), which is based on bidirectional the advantages of existing tools. It thus provides LSTM networks with a CRF layer (BLSTM- a more comprehensive solution to crowdsourcing CRF). It can capture character-level patterns, and annotations for sequence tagging tasks. encode token sequences with pre-trained word em- beddings, as well as using CRF layers to capture In this paper, we first present the high-level structural dependencies in tagging. In this section, structure and design of the proposed AlpacaTag we assume the model is fixed as we are talking framework. Three key features are then introduced about how to use it for infer recommendations. in detail: active intelligent recommendation (§3), How to update it by consolidating crowd annota- automatic crowd consolidation (§4), and real-time tions is illustrated in the Section §4. model deployment (§5). Experiments (§6) are con- ducted for showing the effectiveness of the pro- 3.2 Tagging Suggestions (Instance-level Rec.) posed three features. Comparisons with related Apart from the inference results from the back- works are discussed in §7. Section §8 shows con- end model on the sentences, we also include two clusion and future directions. other kinds of tagging suggestions mentioned in lowing priority (ordered by their precision): dic- tionary matches > back-end model inferences > (frequent) noun phrases, while the coverage of the suggestions are in the opposite order. Figure2 illustrates how AlpacaTag presents the tagging suggestions for the sentence “Donald Trump had a dinner with Hillary Clinton in the white house.”. The two suggestions are shown (a) the sentence and annotations are at the upper section; tagging suggestions are shown as in the bottom section as underscored spans “Don- underlined spans in the lower section. ald Trump” and “Hilary Clinton” (Fig.2a). When annotators want to confirm “Hilary Clinton” as a true annotation, they first click the span and then click the right type in the appearing float window (Fig.2b). They can also press the customized shortcut key for fast tagging. Note that the PER button is underscored in Fig.2b, meaning that it is a recommended type for annotators to choose. Fig.2c shows that after confirming, it is added into final annotations. We want to emphasize that our (b) after click on a suggested span, a floating designed UI well solves the problem of confirming window will show up near for confirming the types and declining suggestions. The float windows re- (suggested type is bounded with red line). duce mouse movement time very much and let an- notators easily confirm or change types by clicking or pressing shortcuts to correct suggested types. What if annotators want to tag spans not recom- mended? Normally annotating in AlpacaTag is as friendly as other tools: annotators can simply select the spans they want to tag in the upper sec- tion and click or press the associated type (Fig.3). We implement this UI design based on an open- (c) after click a suggested type or press a shortcut source Django framework named doccano2. key (e.g. ‘p’), confirmed annotations will show up in the upper annotation section.

Figure 2: The workflow for annotators to confirm a given tagging suggestion (“Hillary Clinton” as PER). 2. click button or press shortcut key 3. get a tag

1. select a span Fig.1: (frequent) noun phrases and the dictionary of already tagged spans. Specifically, after admins upload raw corpora, AlpacaTag runs a phrase Figure 3: Annotating a span without recommendations. mining algorithm (Shang et al., 2018) to gather a list of frequent noun phrases, which are more 3.3 Active Sampling (Corpus-level Rec.) likely to be entities. Admins can also optionally Tagging suggestions are at the instance level, enable the framework to consider all noun phrases while what instances we should ask annotators by chunking as span candidates. These sugges- to label is also very important to save human tions are not typed. Additionally, we also maintain efforts. Thus, the research question here is a dictionary mapping from already tagged spans in how to actively sample a set of the most infor- previous sentences to their majority tagged types. mative instances from the whole unlabeled sen- Frequent noun phrases and dictionary entries are tences, which is named active learning. The matched by the greedy forward maximum match- measure of informativeness is usually specific ing algorithm. Therefore, the final tagging sugges- tions are merged from three sources with the fol- 2http://doccano.herokuapp.com/ to an existing model, which means what in- stances (if labeled correctly) can improve the tag sequence

model performance. A typical example of ac- Active tive learning is called Least Confidence (Culotta CRF and McCallum, 2005), which applies the model Sampling tags scores on all the unlabeled sentences and treats the in- Incremental Training Suggestions ference confidence as the negative of informa- Linear tiveness (1 − max [y , . . . , y | {x }]). Personal Models y1,...,yn P 1 n ij Tagging Annotator Consolidation For AlpacaTag, we apply an improved version Representation BLSTM Crowd Annotations named Maximum Normalized Log-Probability (MNLP) (Shen et al., 2018), which eliminates the sentence Consensus Model effect of length of sequences by averaging: Figure 4: Reducing inter-annotator disagreement by n 1 X training personal models for personalized active sam- max log P [yi|y1, . . . , yn−1, {xij}] pling and constructing consensus models. y1,...,yn n i=1 sentation Ck into a representation matrix C, and Simply put, we manage to utilize the current thus construct a consensus model by using C to back-end model for measuring the informative- generate tag scores in the BLSTM-CRF architec- ness of unlabeled sentences and then sample the ture, which is improved from the approach pro- optimal next batch for annotators to label. posed by Nguyen et al. (2017). 4 Automatic Crowd Consolidation In summary, AlpacaTag periodically updates the back-end consensus model by consolidat- As a crowd-sourcing annotation framework, ing crowd annotations for annotated sentences AlpacaTag aims to reduce inter-annotator dis- through merging personal models. The updated agreement while keeping their individual strengths back-end model continues to offer suggestions in in labeling specific kinds of instances. We achieve future instances for all the annotators, and thus this by applying annotator-specific back-end mod- the consensus can be further solidified with rec- els (“personal models”) and consolidate them ommendation confirmations. into a shared back-end model named “consensus model”. Personal models are used for personal- 5 Real-time Model Deployment ized active sampling, such that annotators will la- Waiting for a model to be trained with ready hu- bel the regions of data space that are most impor- man annotations can be a huge bottleneck for de- tant to be labeled by them respectively. veloping complicated pipeline systems in infor- Assume there are K annotators, we refer them mation extraction. Instead, users may want a se- as {A1,...,AK }. For each annotator Ai, we refer quence tagging model that can be constantly up- the corresponding sentences and annotations pro- li Ai l dated, even when the annotation process is still un- cessed by it to be xi = {xi,j} and y ∈ Y i . j=1 i dergoing. AlpacaTag naturally enjoys this fea- The target of consolidation is to generate predic- ture since we maintain a back-end tagging model tions for the test data x, while the resulting labels with incremental active learning techniques. are referred as yˆ. Note that annotators may an- Thus, we provide a suite of APIs for interacting notate different portions of the data and sentences with the back-end consensus model server. They can have inconsistent crowd labels. can work for both with the annota- As shown in Fig.4, we model the behav- tion server and real-time deployment of the back- ior of every annotator by constructing annota- end model. One can obtain inference results for tor representation matrices Ck from the network their developing complicated systems by querying parameters of personal back-end models. The such APIs for downstream applications. personal models share the same BLSTM lay- ers for encoding sentences and avoiding over- 6 Experiments parameterizing. They are incrementally updated every batch, which usually consists of 50 ∼ 100 To investigate the performance of our imple- sentences sampled actively by the algorithms men- mented back-end model with incremental active tioned in §3.3. We further merge the crowd repre- learning and consolidation, we conduct a prelim- 80 w/o active sampling few of them enjoys the above-introduced three fea- 70 w/ active sampling tures of AlpacaTag. It is true that a few existing 60 tools also support tagging suggestions (instance- 50 level recommendations) as follows: 40 • BRAT (Stenetorp et al., 2012) and 30 GATE (Bontcheva et al., 2013) can offer F1-score (%) 20 suggestions with a fixed tagging model like

10 CoreNLP (Manning et al., 2014). However,

0 it is hardly helpful when users need to 0 100 200 300 400 500 600 700 800 900 1000 annotate sentences with customized label set Number of Instances specific to their interested domains, which is Figure 5: Effectiveness of incremental updating with the most common motivation for people to and without active sampling. use an annotation framework. • YEDDA (Yang et al., 2018a) simply generates inary experiment on the CoNLL2003 dataset. We suggestions by exact matching against a con- set iteration step size as 100 with the default order tinuously updated lexicon of already anno- for single-user annotation, and incrementally up- tated spans. In comparison, this is a subset date the back-end model for 10 iterations. Then, of AlpacaTag’s recommendations. Note we apply our implementation for active sampling that YEDDA cannot suggest any unseen spans. the optimal batches of samples to annotate dynam- • WebAnno (Yimam et al., 2013) integrates a ically. All results are tested on the official test set. learning component for suggestions, which As shown in Fig.5, the performance with active is based on hand-crafted features and learning is indeed improved by a large margin over generic online learning framework (MIRA). vanilla training effectively. We also find that with Nonetheless, the back-end model is too far active sampling, using annotations of about 35% away from the state-of-the-art methods for training data can achieve the performance of using sequence tagging and hard to customize for 100%, confirming the reported results of previous consolidation and real-time deployment. studies (Siddhant and Lipton, 2018). AlpacaTag’s tagging suggestions are more For consolidating multiple annotations, we test comprehensive since they are from state-of-the- our implementation with the real-world crowd- art BLSTM-CRF architecture, annotation dictio- sourced data collected by Rodrigues et al.(2013) nary, and frequent noun phrase mining. An unique using Amazon Mechanical Turk (AMT). Our advantage is that it also supports active learn- methods outperform existing approaches includ- ing to sample instances that are most worth la- ing MVT-SLM (Wang et al., 2018) and Crowd- beling to reduce human efforts. In addition, X(Nguyen et al., 2017) (see Tab.1). AlpacaTag further attempts to reduce inter- annotator disagreement during annotating by au- 7 Related Works tomatically consolidating personal models to con- sensus models, which pushes further than just pre- Many sequence annotation tools have been de- senting inter-annotator agreement without doing veloped (Ogren, 2006; Bontcheva et al., 2013; anything helpful. Chen and Styler, 2013; Druskat et al., 2014b) While recent papers have shown the power for basic annotation features including data man- of deep active learning with simulations (Sid- agement, shortcut key annotation, and multi- dhant and Lipton, 2018), a practical annota- annotation presentation and analysis. However, tion framework with active learning is missing. AlpacaTag is not only an annotation framework

Methods Precision(%) Recall(%) F1(%) but also a practical environment for evaluating dif- MVT-SLM 88.88(±0.25) 65.04(±0.80) 75.10(±0.44) ferent active learning methods. Crowd-Add 89.74(±0.10) 64.50(±1.48) 75.03(±1.02) Crowd-Cat 89.72(±0.47) 63.55(±1.20) 74.39(±0.98) 8 Conclusion and Future Directions Ours 88.77(±0.25) 72.79(±0.04) 79.99(±0.08) Gold 92.12(±0.31) 91.73(±0.09) 91.92(±0.21) To sum up, we propose an open-source web-based annotation framework AlpacaTag that provides Table 1: Comparisons of consolidation methods. users with more advanced features for reducing Liyuan Liu, Jingbo Shang, Frank F. Xu, Xiang Ren, human efforts in annotating, while keeping the Huan Gui, Jian Peng, and Jiawei Han. 2018. Em- existing basic annotation functions like shortcut power sequence labeling with task-aware neural lan- guage model. In Proc. of AAAI. keys. Incremental active learning of back-end models with crowd consolidation facilities intel- Christopher D. Manning, Mihai Surdeanu, John Bauer, ligent recommendations and reduce disagreement. Jenny Rose Finkel, Steven Bethard, and David Mc- Closky. 2014. The stanford corenlp natural language Future directions include supporting annotation processing toolkit. In Proc. of ACL. of other tasks, such as relation extraction and event extraction. We also plan to design a real-time con- An Thanh Nguyen, Byron C. Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017. Aggregat- sole for showing the status of back-end models ing and predicting sequence labels from crowd an- based on TensorBoard, such that admins can bet- notations. Proc. of ACL. ter track the quality of deployed models and early Philip V Ogren. 2006. Knowtator: a proteg´ e´ plug- stop annotating to save human efforts. in for annotated corpus construction. In Proc. of NAACL-HLT. References Filipe Rodrigues, Francisco C. Pereira, and Bernardete Kalina Bontcheva, Hamish Cunningham, Ian Roberts, Ribeiro. 2013. Sequence labeling with multiple an- Angus Roberts, Valentin Tablan, Niraj Aswani, and notators. Machine Learning. Genevieve Gorrell. 2013. Gate teamware: a web- based, collaborative text annotation framework. In Burr Settles. 2009. Active learning literature survey. Proc. of LREC. Technical report, University of Wisconsin-Madison Department of Computer Sciences. Richard Eckart de Castilho, Eva´ Mujdricza-Maydt,´ Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Seid Muhie Yimam, Silvana Hartmann, Iryna Clare R. Voss, and Jiawei Han. 2018. Automated Gurevych, Anette Frank, and Christian Biemann. phrase mining from massive text corpora. Proc. of 2016. A web-based tool for the integrated annota- TKDE. tion of semantic and syntactic structures. In Proc. of LT4DH@COLING. Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. 2018. Wei-Te Chen and Will Styler. 2013. Anafora: a web- Deep active learning for named entity recognition. based general purpose annotation tool. In Proc. of In Proc. of ICLR. NAACL-HLT. Aditya Siddhant and Zachary Chase Lipton. 2018. Aron Culotta and Andrew McCallum. 2005. Con- Deep bayesian active learning for natural language fidence estimation for information extraction. In processing: Results of a large-scale empirical study. Proc. of AAAI. In Proc. of EMNLP. Pontus Stenetorp, Sampo Pyysalo, Goran Topic,´ Stephan Druskat, Lennart Bierkandt, Volker Gast, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- Christoph Rzymski, and Florian Zipser. 2014a. jii. 2012. brat: a web-based tool for NLP-assisted Atomic: an open-source platform for text annotation. In Proc. of EACL. multi-level corpus annotation. In KONVENS. Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Stephan Druskat, Lennart Bierkandt, Volker Gast, Marinka Zitnik, Jingbo Shang, Curtis P. Langlotz, Christoph Rzymski, and Florian Zipser. 2014b. and Jiawei Han. 2018. Cross-type biomedical Atomic: An open-source software platform for named entity recognition with deep multi-task learn- multi-level corpus annotation. ing. Bioinformatics.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Jie Yang, Yue Zhang, Linwei Li, and Xingxuan Li. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. 2018a. Yedda: A lightweight collaborative text span Neural architectures for named entity recognition. annotation tool. In Proc. of ACL. In Proc. of NAACL-HLT. YaoSheng Yang, Meishan Zhang, Wenliang Chen, Wei Zhang, Haofen Wang, and Min Zhang. 2018b. Ad- Bill Y. Lin, Frank F. Xu, Zhiyi Luo, and Kenny Q. Zhu. versarial learning for chinese ner from crowd anno- 2017. Multi-channel bilstm-crf model for emerg- tations. In Proc. of AAAI. ing named entity recognition in social media. In NUT@EMNLP. Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Christian Biemann. 2013. We- Bill Yuchen Lin and Wei Lu. 2018. Neural adapta- banno: A flexible, web-based and visually supported tion layers for cross-domain named entity recogni- system for distributed annotations. In Proc. of ACL. tion. In Proc. of EMNLP.