LIMIT-BERT : Linguistics Informed Multi-Task BERT
Total Page:16
File Type:pdf, Size:1020Kb
LIMIT-BERT : Linguistics Informed Multi-Task BERT Junru Zhou1;2;3 , Zhuosheng Zhang 1;2;3, Hai Zhao1;2;3∗, Shuailiang Zhang 1;2;3 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University fzhoujunru,[email protected], [email protected] Abstract so on (Zhou and Zhao, 2019; Zhou et al., 2020; Ouchi et al., 2018; He et al., 2018b; Li et al., 2019), In this paper, we present Linguistics Informed when taking the latter as downstream tasks for Multi-Task BERT (LIMIT-BERT) for learning the former. In the meantime, introducing linguis- language representations across multiple lin- tic clues such as syntax and semantics into the guistics tasks by Multi-Task Learning. LIMIT- BERT includes five key linguistics tasks: Part- pre-trained language models may furthermore en- Of-Speech (POS) tags, constituent and de- hance other downstream tasks such as various Nat- pendency syntactic parsing, span and depen- ural Language Understanding (NLU) tasks (Zhang dency semantic role labeling (SRL). Differ- et al., 2020a,b). However, nearly all existing lan- ent from recent Multi-Task Deep Neural Net- guage models are usually trained on large amounts works (MT-DNN), our LIMIT-BERT is fully of unlabeled text data (Peters et al., 2018; Devlin linguistics motivated and thus is capable of et al., 2019), without explicitly exploiting linguis- adopting an improved masked training objec- tive according to syntactic and semantic con- tic knowledge. Such observations motivate us to stituents. Besides, LIMIT-BERT takes a semi- jointly consider both types of tasks, pre-training supervised learning strategy to offer the same language models, and solving linguistics inspired large amount of linguistics task data as that NLP problems. We argue such a treatment may for the language model training. As a re- benefit from two-fold. (1) Joint learning is a better sult, LIMIT-BERT not only improves linguis- way to let the former help the latter in a bidirec- tics tasks performance, but also benefits from tional mode, rather than in a unidirectional mode, a regularization effect and linguistics infor- taking the latter as downstream tasks of the former. mation that leads to more general representa- tions to help adapt to new tasks and domains. (2) Naturally empowered by linguistic clues from LIMIT-BERT outperforms the strong baseline joint learning, pre-trained language models will be Whole Word Masking BERT on both depen- more powerful for enhancing downstream tasks. dency and constituent syntactic/semantic pars- Thus we propose Linguistics Informed Multi-Task ing, GLUE benchmark, and SNLI task. Our BERT (LIMIT-BERT), making an attempt to in- practice on the proposed LIMIT-BERT also en- corporate linguistic knowledge into pre-training ables us to release a well pre-trained model for language representation models. The proposed multi-purpose of natural language processing tasks once for all. LIMIT-BERT is implemented in terms of Multi- Task Learning (MTL) (Caruana, 1993) which has 1 Introduction shown useful, by alleviating overfitting to a spe- cific task, thus making the learned representations Recently, pre-trained language models have shown universal across tasks. greatly effective across a range of linguistics in- Since universal language representations are spired natural language processing (NLP) tasks learning by leveraging large amounts of unlabeled such as syntactic parsing, semantic parsing and data which has quite different data volume com- pared with linguistics tasks dataset such as Penn ∗ Corresponding author. This paper was partially sup- 1 ported by National Key Research and Development Program Treebank (PTB) (Marcus et al., 1993). of China (No. 2017YFB0304100), Key Projects of Na- To alleviate such data unbalance on multi-task tional Natural Science Foundation of China (U1836222 and 61733011), Huawei-SJTU long term AI project, Cutting-edge 1PTB is an English treebank with syntactic tree annotation Machine reading comprehension and language model. which only contains 50k sentences. 4450 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4450–4461 November 16 - 20, 2020. c 2020 Association for Computational Linguistics learning, we apply semi-supervised learning ap- is better at disclosing phrasal continuity, while proach that uses a pre-trained linguistics model2 to the dependency structure is better at indicating annotate large amounts of unlabeled text data and to dependency relation among words. combine with gold linguistics task dataset as our fi- Semantic role labeling (SRL) is dedicated to rec- nal training data. For such pre-processing, it is easy ognizing the predicate-argument structure of a sen- to train our LIMIT-BERT on large amounts of data tence, such as who did what to whom, where and with many tasks concurrently by simply summing when, etc. For argument annotation, there are two up all the concerned losses together. Moreover, formulizations. One is based on text spans, namely since every sentence has labeled with predicted span-based SRL. The other is dependency-based syntax and semantics, we can furthermore improve SRL, which annotates the syntactic head of argu- the masked training objective by fully exploiting ment rather than the entire argument span. SRL is the known syntactic or semantic constituents dur- an important method to obtain semantic informa- ing the language model training process. Unlike tion beneficial to a wide range of NLP tasks (Zhang the previous work MT-DNN (Liu et al., 2019b) et al., 2019; Mihaylov and Frank, 2019). which only fine-tunes BERT on GLUE tasks, our BERT is typically trained on quite large un- LIMIT-BERT is trained on large amounts of data labeled text datasets, BooksCorpus and English in a semi-supervised way and firmly supported by Wikipedia, which have 13GB plain text, while the explicit linguistic clues. datasets for specific linguistics tasks are less than We verify the effectiveness and applicability 100MB. Thus we employ semi-supervised learn- of LIMIT-BERT on Propbank semantic parsing ing to alleviate such data unbalance on multi-task 3 in both span style (CoNLL-2005) (Carreras and learning by using a pre-trained linguistics model Marquez` , 2005) and dependency style, (CoNLL- to label BooksCorpus and English Wikipedia data. 2009) (Hajicˇ et al., 2009) and Penn Treebank (PTB) The pre-trained model jointly learns POS tags and (Marcus et al., 1993) for both constituent and de- the four types of structures on semantics and syn- pendency syntactic parsing. Our empirical results tax, in which the latter is from the XLNet version show that semantics and syntax can indeed ben- of (Zhou et al., 2020), giving state-of-the-art or efit the language representation model via multi- comparable performance for the concerned four task learning and outperforms the strong baseline parsing tasks. During training, we set 10% proba- Whole Word Masking BERT (BERTWWM). bility to use gold syntactic parsing and SRL data: Penn Treebank (PTB) (Marcus et al., 1993), span 2 Tasks and Datasets style SRL (CoNLL-2005) (Carreras and Marquez` , LIMIT-BERT includes five types of downstream 2005) and dependency style SRL (CoNLL-2009) tasks: Part-Of-Speech, constituent and dependency (Hajicˇ et al., 2009). syntactic parsing, span and dependency semantic 2.1 Linguistics-Guided Mask Strategy role labeling (SRL). Both span (constituent) and dependency are two BERT applies two training objectives: Masked broadly-adopted annotation styles for either seman- Language Model (LM) and Next Sentence Predic- tics or syntax, which have been well studied and tion (NSP) based on WordPiece embeddings (Wu discussed from both linguistic and computational et al., 2016) with a 30,000 token vocabulary. For perspectives (Chomsky, 1981; Li et al., 2019). Masked LM training objective, BERT uses training Constituency parsing aims to build a data generator to choose 15% of the token posi- constituency-based parse tree from a sentence tions at random for mask replacement and predict that represents its syntactic structure according to the masked tokens4. Since using different mask- a phrase structure grammar. While dependency ing strategy can improve model performance such parsing identifies syntactic relations (such as as the Whole Word Masking5 which masks all of an adjective modifying a noun) between word the tokens corresponding to a word at once, we pairs in a sentence. The constituent structure further improve the masking strategy by exploit- 2The model may jointly predict syntax and semantics for 4Actually, BERT applies three replacement strategies: (1) both span and dependency annotation styles, which is from the [MASK] token 80% of the time (2) random token 10% of (Zhou et al., 2020) and joint learning with POS tag. the time (3) the unchanged i-th token 10% of the time. This 3It is also called semantic role labeling (SRL) for the se- work uses the same replacement strategies. mantic parsing task over the Propbank. 5https://github.com/huggingface/transformers 4451 A0 A1 tactic and semantic scorers and decoders. We take A0 A1 multi-task learning (MTL) (Caruana, 1993) sharing Federal Paper Board sells paper and wood products . the parameters of token representation and Trans- A2 former encoder, while language modeling layers A1 A0 and the top task-specific layers have independent Span and Dependency SRL parameters. The training procedure is simple that federal paper board [MASK] paper and wood [MASK] . we just sum up the language model loss with task- (a) Semantic Phrase Masking. specific losses together. S (1,9) 3.2 Token Representation NP VP Following BERT token representation (Devlin (1,3) (4,8) 9 et al., 2019), the first token is always the [CLS] VBZ NP NNP NNP NNP (5,8) token. If input X is packed by a sentence pair Board sells Federal Paper 4 3 1 2 NN CC NN NNS X1; X2, we separate the two sentences with a spe- paper and wood products 5 6 7 8 cial token [SEP] (”packed by” means connect two Constituent Syntactic Tree sentences as BERT training).