Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating the Utility of Hand-crafted Features in Sequence Labelling∗ Minghao Wu♠♥y Fei Liu♠ Trevor Cohn♠ ♠The University of Melbourne, Victoria, Australia ~JD AI Research, Beijing, China [email protected] [email protected] [email protected] Abstract Orthogonal to the advances in deep learning is the effort spent on feature engineering. A rep- Conventional wisdom is that hand-crafted fea- resentative example is the task of named entity tures are redundant for deep learning mod- recognition (NER), one that requires both lexi- els, as they already learn adequate represen- cal and syntactic knowledge, where, until recently, tations of text automatically from corpora. In this work, we test this claim by proposing a most models heavily rely on statistical sequential new method for exploiting handcrafted fea- labelling models taking in manually engineered tures as part of a novel hybrid learning ap- features (Florian et al., 2003; Chieu and Ng, 2002; proach, incorporating a feature auto-encoder Ando and Zhang, 2005). Typical features include loss component. We evaluate on the task of POS and chunk tags, prefixes and suffixes, and ex- named entity recognition (NER), where we ternal gazetteers, all of which represent years of show that including manual features for part- accumulated knowledge in the field of computa- of-speech, word shapes and gazetteers can im- tional linguistics. prove the performance of a neural CRF model. The work of Collobert et al.(2011) started We obtain a F1 of 91.89 for the CoNLL-2003 English shared task, which significantly out- the trend of feature engineering-free modelling performs a collection of highly competitive by learning internal representations of compo- baseline models. We also present an abla- sitional components of text (e.g., word embed- tion study showing the importance of auto- dings). Subsequent work has shown impressive encoding, over using features as either inputs progress through capturing syntactic and semantic or outputs alone, and moreover, show includ- knowledge with dense real-valued vectors trained ing the autoencoder components reduces train- ing requirements to 60%, while retaining the on large unannotated corpora (Mikolov et al., same predictive accuracy. 2013a,b; Pennington et al., 2014). Enabled by the powerful representational capacity of such em- 1 Introduction beddings and neural networks, feature engineering has largely been replaced with taking off-the-shelf Deep neural networks have been proven to be a pre-trained word embeddings as input, thereby powerful framework for natural language process- making models fully end-to-end and the research ing, and have demonstrated strong performance on focus has shifted to neural network architecture a number of challenging tasks, ranging from ma- engineering. chine translation (Cho et al., 2014b,a), to text cat- More recently, there has been increasing recog- egorisation (Zhang et al., 2015; Joulin et al., 2017; nition of the utility of linguistic features (Li et al., Liu et al., 2018b). Not only do such deep models 2017; Chen et al., 2017; Wu et al., 2017; Liu et al., outperform traditional machine learning methods, 2018a) where such features are integrated to im- they also come with the benefit of not requiring prove model performance. Inspired by this, tak- difficult feature engineering. For instance, both ing NER as a case study, we investigate the util- Lample et al.(2016) and Ma and Hovy(2016) ity of hand-crafted features in deep learning mod- propose end-to-end models for sequence labelling els, challenging conventional wisdom in an at- task and achieve state-of-the-art results. tempt to refute the utility of manually-engineered ∗https://github.com/minghao-wu/CRF-AE features. Of particular interest to this paper is y Work carried out at The University of Melbourne the work by Ma and Hovy(2016) where they 2850 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2850–2856 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics Output introduced by Ma and Hovy(2016). Given an in- NER Auto-Encoder Layer put sequence of x = fx1; x2; : : : ; xT g of length T , the model is capable of tagging each input with Feature Bi-directional LSTM a predicted label y^, resulting in a sequence of y^ = Encoder fy^1; y^2;:::; y^T g closely matching the gold label sequence y = fy1; y2; : : : ; yT g. Here, we extend Feature Character Word Embeddings Hand-crafted Features Concatenation Representations the model by incorporating an auto-encoder loss taking hand-crafted features as in/output, thereby Input forcing the model to preserve crucial information EU rejects German call to ... Sentence stored in such features and allowing us to eval- uate the impacts of each feature on model per- Figure 1: Main architecture of our neural network. formance. Specifically, our model, referred to Character representations are extracted by a character- as Neural-CRF+AE, consists of four major com- level CNN. The dash line indicates we use an auto- ponents: (1) a character-level CNN (char-CNN); encoder loss to reconstruct hand-crafted features. (2) a word-level bi-directional LSTM (Bi-LSTM); (3) a conditional random field (CRF); and (4) an introduce a strong end-to-end model combining auto-encoder auxiliary loss. An illustration of the a bi-directional Long Short-Term Memory (Bi- model architecture is presented in Figure1. LSTM) network with Convolutional Neural Net- work (CNN) character encoding in a Conditional Char-CNN. Previous studies (Santos and Random Field (CRF). Their model is highly capa- Zadrozny, 2014; Chiu and Nichols, 2016; Ma ble of capturing not only word- but also character- and Hovy, 2016) have demonstrated that CNNs level features. We extend this model by integrating are highly capable of capturing character-level an auto-encoder loss, allowing the model to take features. Here, our character-level CNN is similar hand-crafted features as input and re-construct to that used in Ma and Hovy(2016) but differs in them as output, and show that, even with such a that we use a ReLU activation (Nair and Hinton, 1 highly competitive model, incorporating linguistic 2010). features is still beneficial. Perhaps the closest to Bi-LSTM. We use a Bi-LSTM to learn contex- this study is the works by Ammar et al.(2014) and tual information of a sequence of words. As in- Zhang et al.(2017), who show how CRFs can be puts to the Bi-LSTM, we first concatenate the framed as auto-encoders in unsupervised or semi- pre-trained embedding of each word w with its supervised settings. i character-level representation cwi (the output of With our proposed model, we achieve strong the char-CNN) and a vector of manually crafted performance on the CoNLL 2003 English NER features fi (described in Section 2.2): shared task with an F1 of 91:89, significantly out- performing an array of competitive baselines. We −! −−−−! −! h i = LSTM( h i−1; [wi; cwi ; fi]) (1) conduct an ablation study to better understand the − −−−− − impacts of each manually-crafted feature. Finally, h i = LSTM( h i+1; [wi; cwi ; fi]) ; (2) we further provide an in-depth analysis of model performance when trained with varying amount of where [; ] denotes concatenation. The outputs of the forward and backward pass of the Bi-LSTM data and show that the proposed model is highly −! − competent with only 60% of the training set. is then concatenated hi = [ h i; h i] to form the output of the Bi-LSTM, where dropout is also ap- 2 Methodology plied. In this section, we first outline the model archi- CRF. For sequence labelling tasks, it is intuitive tecture, then the manually crafted features, and fi- and beneficial to utilise information carried be- nally how they are incorporated into the model. tween neighbouring labels to predict the best se- quence of labels for a given sentence. Therefore, 2.1 Model Architecture 1While the hyperbolic tangent activation function results We build on a highly competitive sequence la- in comparable performance, the choice of ReLU is mainly belling model, namely Bi-LSTM-CNN-CRF, first due to faster convergence. 2851 x U.N. official Ekeus heads for Baghdad . POS NNP NN NNP VBZ IN NNP . Word shape X.X. xxxx Xxxxx xxxx xxx Xxxxx . Dependency tags compound compound compound ROOT prep pobj punct Gazetteer O O PER O O LOC O y B-ORG O B-PER O O B-LOC O Table 1: Example sentence (top), showing the different types of linguistic features used in this work as additional inputs and auxiliary outputs (middle), and its labelling (bottom). we employ a conditional random field layer (Laf- ture, we focus on PERSON and LOCATION and ferty et al., 2001) taking as input the output of compile a list for each. The PERSON gazetteer the Bi-LSTM hi. Training is carried out by max- is collected from U.S. census 2000, U.S. cen- imising the log probability of the gold sequence: sus 2010 and DBpedia whereas GeoNames is the LCRF = log p(yjx) while decoding can be effi- main source for LOCATION, taking in both of- ciently performed with the Viterbi algorithm. ficial and alternative names. All the tokens on both lists are then filtered to exclude frequently Auto-encoder loss. Alongside sequence la- occurring common words.3 Each category is con- belling as the primary task, we also deploy, as aux- verted into a one-hot sparse feature vector f t and iliary tasks, three auto-encoders for reconstruct- i then concatenated to form a multi-hot vector f = ing the hand-engineered feature vectors. To this i [f POS; f shape; f gazetteer] for the i-th word.