Automated Concatenation of Embeddings for Structured Prediction

Automated Concatenation of Embeddings for Structured Prediction Xinyu Wang‡, Yong Jiangy∗, Nguyen Bachy, Tao Wangy, Zhongqiang Huangy, Fei Huangy, Kewei Tu∗ School of Information Science and Technology, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences yDAMO Academy, Alibaba Group {wangxy1,tukw}@shanghaitech.edu.cn, [email protected] {nguyen.bach,leeo.wangt,z.huang,f.huang}@alibaba-inc.com Abstract language processing. Approaches based on contextualized embeddings, such as ELMo (Peters et al., Pretrained contextualized embeddings are 2018), Flair (Akbik et al., 2018), BERT (Devlin powerful word representations for structured et al., 2019), and XLM-R (Conneau et al., 2020), prediction tasks. Recent work found that bet- have been consistently raising the state-of-the-art ter word representations can be obtained by concatenating different types of embeddings. for various structured prediction tasks. Concur- However, the selection of embeddings to form rently, research has also showed that word represen- the best concatenated representation usually tations based on the concatenation of multiple pre- varies depending on the task and the collec- trained contextualized embeddings and traditional tion of candidate embeddings, and the ever- non-contextualized embeddings (such as word2vec increasing number of embedding types makes (Mikolov et al., 2013) and character embeddings it a more difficult problem. In this paper, we (Santos and Zadrozny, 2014)) can further improve propose Automated Concatenation of Embed- dings (ACE) to automate the process of find- performance (Peters et al., 2018; Akbik et al., 2018; ing better concatenations of embeddings for Straková et al., 2019; Wang et al., 2020b). Given structured prediction tasks, based on a formu- the ever-increasing number of embedding learn- lation inspired by recent progress on neural ing methods that operate on different granularities architecture search. Specifically, a controller (e.g., word, subword, or character level) and with alternately samples a concatenation of embed- different model architectures, choosing the best em- dings, according to its current belief of the ef- beddings to concatenate for a specific task becomes fectiveness of individual embedding types in non-trivial, and exploring all possible concatena- consideration for a task, and updates the belief based on a reward. We follow strategies tions can be prohibitively demanding in computing in reinforcement learning to optimize the pa- resources. rameters of the controller and compute the re- Neural architecture search (NAS) is an active ward based on the accuracy of a task model, area of research in deep learning to automati- which is fed with the sampled concatenation cally search for better model architectures, and has as input and trained on a task dataset. Empir- achieved state-of-the-art performance on various ical results on 6 tasks and 21 datasets show that our approach outperforms strong base- tasks in computer vision, such as image classifi- lines and achieves state-of-the-art performance cation (Real et al., 2019), semantic segmentation with fine-tuned embeddings in all the evalua- (Liu et al., 2019a), and object detection (Ghiasi tions.1 et al., 2019). In natural language processing, NAS has been successfully applied to find better RNN 1 Introduction structures (Zoph and Le, 2017; Pham et al., 2018b) and recently better transformer structures (So et al., Recent developments on pretrained contextualized 2019; Zhu et al., 2020). In this paper, we propose embeddings have significantly improved the per- Automated Concatenation of Embeddings (ACE) formance of structured prediction tasks in natural to automate the process of finding better concatena- ∗ Yong Jiang and Kewei Tu are the corresponding authors. tions of embeddings for structured prediction tasks. z : This work was conducted when Xinyu Wang was interning ACE is formulated as an NAS problem. In this at Alibaba DAMO Academy. 1Our code is publicly available at https://github. approach, an iterative search process is guided by com/Alibaba-NLP/ACE. a controller based on its belief that models the effectiveness of individual embedding candidates in 2004), syntactic dependency parsing (Tesnière, consideration for a specific task. At each step, the 1959) and semantic dependency parsing (Oepen controller samples a concatenation of embeddings et al., 2014) over 21 datasets. Besides, we also according to the belief model and then feeds the analyze the advantage of ACE and reward function concatenated word representations as inputs to a design over the baselines and show the advantage task model, which in turn is trained on the task of ACE over ensemble models. dataset and returns the model accuracy as a reward signal to update the belief model. We use the policy 2 Related Work gradient algorithm (Williams, 1992) in reinforce- 2.1 Embeddings ment learning (Sutton and Barto, 1992) to solve the optimization problem. In order to improve the Non-contextualized embeddings, such as word2vec efficiency of the search process, we also design (Mikolov et al., 2013), GloVe (Pennington et al., a special reward function by accumulating all the 2014), and fastText (Bojanowski et al., 2017), help rewards based on the transformation between the lots of NLP tasks. Character embeddings (San- current concatenation and all previously sampled tos and Zadrozny, 2014) are trained together with concatenations. the task and applied in many structured prediction Our approach is different from previous work on tasks (Ma and Hovy, 2016; Lample et al., 2016; NAS in the following aspects: Dozat and Manning, 2018). For pretrained contextualized embeddings, ELMo (Peters et al., 2018), 1. Unlike most previous work, we focus on search- a pretrained contextualized word embedding gen- ing for better word representations rather than erated with multiple Bidirectional LSTM layers, better model architectures. significantly outperforms previous state-of-the-art 2. We design a novel search space for the embed- approaches on several NLP tasks. Following this ding concatenation search. Instead of using idea, Akbik et al.(2018) proposed Flair embed- RNN as in previous work of Zoph and Le(2017), dings, which is a kind of contextualized character we design a more straightforward controller to embeddings and achieved strong performance in generate the embedding concatenation. We de- sequence labeling tasks. Recently, Devlin et al. sign a novel reward function in the objective of (2019) proposed BERT, which encodes contex- optimization to better evaluate the effectiveness tualized sub-word information by Transformers of each concatenated embeddings. (Vaswani et al., 2017) and significantly improves the performance on a lot of NLP tasks. Much re- 3. ACE achieves high accuracy without the need search such as RoBERTa (Liu et al., 2019c) has for retraining the task model, which is typically focused on improving BERT model’s performance required in other NAS approaches. through stronger masking strategies. Moreover, multilingual contextualized embeddings become 4. Our approach is efficient and practical. Al- popular. Pires et al.(2019) and Wu and Dredze though ACE is formulated in a NAS framework, (2019) showed that Multilingual BERT (M-BERT) ACE can find a strong word representation on could learn a good multilingual representation ef- a single GPU with only a few GPU-hours for fectively with strong cross-lingual zero-shot trans- structured prediction tasks. In comparison, a lot fer performance in various tasks. Conneau et al. of NAS approaches require dozens or even thou- (2020) proposed XLM-R, which is trained on a sands of GPU-hours to search for good neural larger multilingual corpus and significantly outper- architectures for their corresponding tasks. forms M-BERT on various multilingual tasks. Empirical results show that ACE outperforms strong baselines. Furthermore, when ACE is 2.2 Neural Architecture Search applied to concatenate pretrained contextualized Recent progress on deep learning has shown that embeddings fine-tuned on specific tasks, we can network architecture design is crucial to the model achieve state-of-the-art accuracy on 6 structured performance. However, designing a strong neu- prediction tasks including Named Entity Recog- ral architecture for each task requires enormous nition (Sundheim, 1995), Part-Of-Speech tagging efforts, high level of knowledge, and experiences (DeRose, 1988), chunking (Tjong Kim Sang and over the task domain. Therefore, automatic design Buchholz, 2000), aspect extraction (Hu and Liu, of neural architecture is desired. A crucial part of NAS is search space design, which defines the dis- Given an embedding concatenation generated from coverable NAS space. Previous work (Baker et al., the controller, the task model is trained over the 2017; Zoph and Le, 2017; Xie and Yuille, 2017) task data and returns a reward to the controller. The designs a global search space (Elsken et al., 2019) controller receives the reward to update its param- which incorporates structures from hand-crafted eter and samples a new embedding concatenation architectures. For example, Zoph and Le(2017) de- for the task model. Figure1 shows the general signed a chained-structured search space with skip architecture of our approach. connections. The global search space usually has a considerable degree of freedom.

Load more