Unleash GPT-2 Power for Event Detection

Amir Pouran Ben Veyseh1, Viet Dac Lai1, Franck Dernoncourt2, and Thien Huu Nguyen1 1 Department of Computer and Information Science, University of Oregon, Eugene, OR 97403, USA 2 Adobe Research, San Jose, CA, USA {apouranb,vietl,thien}@cs.uoregon.edu, [email protected]

Abstract extending from feature-based models (Ahn, 2006; Liao and Grishman, 2010a; Miwa et al., 2014) to Event Detection (ED) aims to recognize men- advanced methods (Nguyen and Gr- tions of events (i.e., event triggers) and their types in text. Recently, several ED datasets ishman, 2015; Chen et al., 2015; Nguyen et al., in various domains have been proposed. How- 2016c; Sha et al., 2018; Zhang et al., 2020b; ever, the major limitation of these resources is Nguyen et al., 2021). Although deep learning mod- the lack of enough training data for individual els have achieved substantial improvement, their event types which hinders the efficient train- requirement of large training datasets together with ing of data-hungry deep learning models. To the small sizes of existing ED datasets constitutes a overcome this issue, we propose to exploit the major hurdle to build high-performing ED models. powerful pre-trained language model GPT-2 Recently, there have been some efforts to enlarge to generate training samples for ED. To pre- vent the noises inevitable in automatically gen- training data for ED models by exploiting unsu- erated data from hampering training process, pervised (Huang et al., 2016; Yuan et al., 2018) or we propose to exploit a teacher-student archi- distantly-supervised (Keith et al., 2017; Nguyen tecture in which the teacher is supposed to and Nguyen, 2018; Araki and Mitamura, 2018) learn anchor knowledge from the original data. techniques. The common strategy in these meth- The student is then trained on combination ods is to exploit unlabeled text data that are rich in of the original and GPT-generated data while event mentions to aid the expansion of training data being led by the anchor knowledge from the teacher. Optimal transport is introduced to fa- for ED. In this work, we explore a novel approach cilitate the anchor knowledge-based guidance for training data expansion in ED by leveraging the between the two networks. We evaluate the existing pre-trained language model GPT-2 (Rad- proposed model on multiple ED benchmark ford et al., 2019) to automatically generate training datasets, gaining consistent improvement and data for models. Motivated by the promising per- establishing state-of-the-art results for ED. formance of GPT models for text generation, we expect our approach to produce effective data for 1 Introduction ED in different domains. An important task of Information Extraction (IE) Specifically, we aim to fine-tune GPT-2 on ex- involves Event Detection (ED) whose goal is to isting training datasets so it can generate new sen- recognize and classify words/phrases that evoke tences annotated with event triggers and/or event events in text (i.e., event triggers). For instance, types, serving as additional training data for ED in the sentence “The organization donated 2 mil- models. One direction to achieve this idea is to ex- lion dollars to humanitarian helps.”, ED systems plicitly mark event triggers along with their event should recognize “donated” as an event trigger of types in sentences of an existing ED dataset that type Pay. We differentiate two subtasks in ED, can be used to fine-tune the GPT model for new i.e., Event Identification (EI): a binary classifica- data generation. However, one issue with this di- tion problem to predict if a word in text is an event rection is that in existing ED datasets, numbers of trigger or not, and Event Classification (EC): a examples for some rare event types might be small, multi-class classification problem to classify event potentially leading to the poor tuning performance triggers according to predefined event types. of GPT and impairing the quality of generated ex- Several methods have been introduced for ED, amples for such rare events. In , large num-

6271

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6271–6282 August 1–6, 2021. ©2021 Association for Computational Linguistics bers of event types in some ED datasets might make noises in the data. Motivated by this intuition, we it more challenging for the fine-tuning of GPT to propose a novel teacher-student framework for our differentiate event types and produce high-quality multi-task learning problem where the teacher is data. To this end, instead of directly generating trained on the original clean ED datasets to induce data for ED, we propose to use GPT-2 to only gen- anchor representation knowledge for data. The stu- erate samples for the event identification task to dent, on the other hand, will be trained on both simplify the generation and achieve data with bet- generated EI data and original ED data to accom- ter annotated labels (i.e., output sentences only are plish transfer learning. Here, the anchor knowledge only marked with positions of event triggers). As from the teacher will be leveraged to guide the stu- such, to effectively leverage the generated EI data dent to prevent drastic divergence of representation to improve ED performance, we propose a multi- vectors for noisy information penalization. Conse- task learning framework to train the ED models on quently, we propose a novel anchor information to the combination of the generated EI data and the implement this idea, seeking to maintain the same original ED data. In particular, for every event trig- level of differences between the generated and orig- ger candidate in a sentence, our framework seeks inal data (in terms of representation vectors) for to perform two tasks, i.e., EI to predict a binary both the teacher and the student (i.e., generated-vs- label for being an event trigger or not, and ED to original data difference as the anchor). At the core predict the event type (if any) evoked by the word of this techniques involves the computation of dis- via a multi-class classification problem. An input tance/difference between samples in generated and encoder is shared for both tasks that allow training original data. In this work, we envision two types from both generated EI data and original of information that models should consider when ED data to contribute to the representation learn- computing such distances for our problem: (1) rep- ing in the encoder (i.e., transferring knowledge in resentation vectors of the models for the examples, generated EI data to ED models). and (2) event trigger likelihood scores of exam- ples based on the models (i.e., two examples in the Despite the simplification to EI for better anno- generated and original data are more similar if they tated labels of data, the generated sentences might both correspond to event triggers). As such, we pro- still involve noises due to the inherent nature of the pose to cast this distance computation problem of language generation, e.g., grammatically wrong generated and original data into an Optimal Trans- sentences, inconsistent information, or incorrect port (OT) problem. OT is an established method event trigger annotations. As such, it is crucial to to compute the optimal transportation between two introduce mechanisms to filter the noises in gener- data distributions based on the masses ated data to enable effective transfer learning from of data points and their pair-wise distances, thus fa- generated EI data. To this end, prior works for GPT- cilitating the integration of the two criteria of event based data generation for other tasks has attempted trigger likelihoods and representation vectors into to directly remove noisy generated examples before the distance computation between data point sets. actual usage for model training via some heuris- Extensive experiments and analysis reveal the tic rules (Anaby-Tavor et al., 2020; Yang et al., effectiveness of the proposed approach for ED in 2020). However, heuristic rules are brittle and re- different domains, establishing new state-of-the- stricted in their coverage so they might overly filter art performance on the ACE 2005, CySecED and the generated data or incorrectly retain some noisy RAMS datasets. generated samples. To address this issue, we pro- pose to preserve all generated data for training and 2 Model devise methods to explicitly limit impacts of noisy generated sentences in the models. In particular, We formulate the task of Event Detection as a we expect the inclusion of generated EI data into word-level classification problem as in prior work the training process for ED models might help to (Nguyen and Grishman, 2015; Ngo et al., 2020). shift the representations of the models to better Formally, given the sentence S = [w1, w2, . . . , wn] regions for ED. As such, we argue that this repre- and the candidate trigger word wt, the goal is to pre- sentation transition should only occur at a reason- dict the event type l from a pre-defined set of event able rate as drastic divergence of representations types L. Note that if the word wt is not a trigger due to the generated data might be associated with word, the gold event type is None. Our proposed

6272 approach for this task consist of two stages: (1) by the fine-tuned GPT-2 model over the popular : to employ natural language ACE 2005 training set for ED) and evaluate them generation to augment existing training datasets for regarding grammatical soundness, meaningfulness, ED, (2) Task Modeling: to propose a deep learning and inclusion and correctness of annotated event model for ED, exploiting available training data. triggers (i.e., whether the words between the tokens TRGs and TRGe evoke events or not). Among 2.1 Data Augmentation the sampled set, we find that 17% of the sentences contains at least one type of such errors. As presented in the introduction, our motivation in this work is to explore a novel approach for train- 2.2 Task Modeling ing data augmentation for ED based on the power- ful pre-trained language model for text generation This section describes our model for ED to over- GPT2. Our overall strategy involves using some come the noises in the generated data G for model existing training dataset O for ED (i.e., original training. As discussed in the introduction, we em- data) to fine-tune GPT-2. The fine-tuned model is ploy the Teacher-Student framework with multi- then employed to generate a new labeled training task learning to achieve this goal. In the proposed set G (i.e., synthetic data) that will be combined framework, the teacher and student employs a base with the original data O to train models for ED. deep learning model with the same architecture and To simplify the training data generation task and different parameters. enhance the quality of the synthetic data, we seek to Base Model: Following the prior work (Wang generate data only for the subtask EI of ED where et al., 2019), our base model consists of the synthesized sentences are annotated with positions BERTbase model to represent each word wi in the of their event triggers (i.e., event types for triggers input sentence S with a vector ei. Formally, the are not required for the generation to avoid the com- input sentence [[CLS], w1, w2, . . . , wn, [SEP ]] is plication with rare event types for fine-tuning). To fed into the BERTbase model and the hidden states this end, we first enrich each sentence S ∈ O with of the last of BERT are taken as the con- positions of event triggers that it contains to facili- textualized embeddings of the input words, i.e., tate the GPT fine-tuning process. Formally, assume E = [e1, e2, . . . , en]. Note that if wi contains more that S = w1, w2, . . . , wn is a sentence of n words than one word-piece, the average of its word-piece with only one event trigger word located at wt, the embeddings is used for ei. In our experiments, we 0 enriched sentence S for S would have the form: find that fixing the BERTbase parameters achieve 0 S = [BOS, w1,...,TRGs, wt,TRGe, . . . , wn, higher performance. As such, to fine-tune the EOS] where TRGs and TRGe are special tokens contextualized embeddings E for ED, we employ to mark the position of the event trigger, and BOS a Bi-directional Long Short-Term Memory (BiL- and EOS are special tokens to identify the begin- STM) network to consumes E; its hidden states, ning and the end of the sentence. Next, the GPT-2 i.e., H = [h1, h2, . . . , hn], are then employed as model will be fine-tuned on the enriched sentences the final representations for the words in S. Fi- S0 of O in an auto-regressive fashion (i.e., predict- nally, to create the final vector V for ED prediction, ing the next token in S0 given prior ones). Finally, the max-pooled representation of the sentence, i.e., using the fine-tuned GPT-2, we generate a new h¯ = MAX POOL(h1, h2, . . . , hn), is concate- dataset G of |O| sentences (|G| = |O|) to achieve a nated with the representation of the trigger candi- balanced size. Here, we ensure that only generated date, i.e., ht. V is consumed by a feed-forward net- sentences that contain the special tokens TRGs work, whose last layer has |L| neurons, followed by and TRGe (i.e., involving event trigger words) are a softmax layer to predict the distribution P (·|S, t) added into G, allowing us to identify the candi- over possible event types in L. To train the model, date trigger word in our word-level classification we use negative log-likelihood as the loss : formulation for ED. As such, the combination A Lpred = − log P (l|S, t) where l is the gold label. of the synthetic data G and the original data O As the synthetic sentences in G only involve in- (A = O ∪ G) will be leveraged to train our ED formation about positions of event triggers (i.e., no model in the next step. event types included), we cannot directly combine To assess the quality of the synthetic data, we G with O to train ED models with the loss Lpred. randomly select 200 sentences from G (generated To facilitate the integration of G into the training

6273 process, we introduce an auxiliary task of EI for the bution and the student-predicted label-probability multi-task learning in the training process, seeking distributions. Formally, for a sentence S ∈ O, the to predict the binary label laux for the trigger candi- label-probability distributions of the teacher and date wt in S, i.e., laux = 1 if wt is an event trigger. the student, i.e., Pt(·|S, t) and Ps(·|S, t) respec- To perform this auxiliary task, we employ another tively, are employed to compute the KL-divergence Pt(l|S,t) feed-forward network, i.e., FFaux, which also con- loss L = −Σ P (l|S, t) log( ). By de- KL l∈L t Ps(l|S,t) sumes the overall vector V as input. This feed- creasing the KL-divergence during the student’s forward network has one neuron with the sigmoid training, the model is encouraged to make simi- in the last layer to estimate the lar predictions as the teacher for the same original event trigger likelihood score: P (laux = 1|S, t) = sentence, thereby preventing noises to mislead the FFaux(V ). Finally, to train the base model with student. Note that different from traditional teacher- the auxiliary task, we exploit the binary cross- student networks that employ KL to achieve knowl- entropy loss: Laux = −(laux log(FFaux(V )) + edge distillation on unlabelled data (Hinton et al., (1 − laux) log(1 − FFaux(V ))). Note that the main 2015), the KL divergence in our model is leveraged ED task and the auxiliary EI task are done jointly to enforce knowledge consistency to prevent noises in a single training process where the loss Lpred for in labeled data automatically generated by GPT-2. ED is computed only for the original data O. The Second, for the auxiliary task EI, instead of en- loss Laux, in contrast, will be obtained for both forcing the student-teacher knowledge consistency original and synthetic data in A. via similarity predictions, we argue that it will be Knowledge Consistency: The generated data G is more beneficial to leverage the difference between not noise-free. As such, training the ED model on the original data O and the generated data G as an A could lead to inferior performance. To address anchor knowledge to promote consistency. In par- this issue, as discussed in the introduction, we pro- ticular, we expect that the student which is trained pose to first learn the anchor knowledge from the on A, should discern the same difference between original data O, then use that to lead the model G and O as the teacher which is trained only on training on A to prevent drastic divergence from the original data O. Formally, during student train- the anchor knowledge (i.e., knowledge consistency ing, for each mini-batch, the distances between the promotion), thus constraining the noises. Hence, original data and the generated data detected by the T we propose a teacher-student network, in which teacher and the student are denoted by dO,G and S the teacher is first trained on O to learn the anchor dO,G, respectively. To enforce the O-G distance knowledge. The student network will be trained consistency between the two networks, the follow- on A afterward leveraging the consistency guid- ing loss is added into the overall : |dT −dS | ance with the induced anchor knowledge from the O,G O,G Ldist = |B| , where |B| is the mini-batch teacher. We will also use the student network as size. The advantage of this novel knowledge consis- the final model for our ED problem in this work. tency enforcement compared to the KL-divergence In our framework, both teacher and student net- is that it explicitly exploits the different nature of works will be trained in the multi-task setting with the original and generated data to facilitate the mit- ED and EI tasks. In particular, the training losses igation of noises in the generated data. for both ED and EI will be computed based on A remaining question for our proposed knowl- O for the teacher (the loss to train the teacher is: edge consistency concerns how to assess the differ- Lpred + τLaux where τ is a trade-off parameter). ence between the original and the generated data T In contrast, the combined data A will be used to from the perspective of the teacher, i.e., dO,G, and S compute the EI loss for the student while the ED the student networks, i.e., dO,G. In this section, loss for the student can only be computed on the we will describe our method from the perspective original data O. As such, we propose to enforce of the student (the same method is employed for the knowledge consistency between the two net- the teacher network). In particular, we define the works for both the main task ED and the auxiliary difference between the original and the generated task EI during the training of the student model. data as the cost of transforming O to G such that First, to achieve the knowledge consistency for for the transformed data the model will make the ED, we seek to minimize the KL divergence be- same predictions as G. How can we compute the tween the teacher-predicted label-probability distri- cost of such transformation? To answer this ques-

6274 tion, we propose to employ Optimal Transport (OT) for the auxiliary task EI in the student model, X S which is an established method to find the efficient i.e, Scoreo = FFaux(Xo). Afterward, we ap- transportation (i.e., transformation with the lowest ply the over the scores of the cost) of one to another one. original sentences in the current mini-batch to ob- X Formally, given the probability distributions p(x) tain p(x), i.e., p(x) = Softmax(Scoreo ). Sim- and q(y) over the domains X and Y, and the cost ilarly, the discrete distribution q(y) is defined as Y function C(x, y): X ×Y → R+ for mapping X to q(y) = Softmax(Scoreg ). To this end, by solv- Y, OT finds the optimal joint distribution π∗(x, y) ing the OT problem in Equation1 and obtaining (over X × Y) with marginals p(x) and q(y), i.e., the efficient transport plan π∗(x, y) using this setup, S the cheapest transportation from p(x) to q(y), by we can obtain the distance dO,G. In the same way, solving the following problem: T the distance dO,G can be computed using the rep- Z Z resentations and event trigger likelihoods from the π∗(x, y) = min π(x, y)C(x, y)dxdy teacher network. Note that in this way, we can inte- π∈Π(x,y) Y X (1) grate both representation vectors of sentences and s.t. x ∼ p(x) and y ∼ q(y), event trigger likelihoods into the distance computa- where Π(x, y) is the set of all joint distributions tion between data as motivated in the introduction. with marginals p(x) and q(y). Note that if the Finally, to train the student model, the following distributions p(x) and q(y) are discrete, the inte- combined loss function is used in our framework: grals in Equation1 are replaced with a sum and L = Lpred + αLaux + βLKL + γLdist, where α, the joint distribution π∗(x, y) is represented by β, and γ are the trade-off parameters. a matrix whose entry (x, y) represents the prob- ability of transforming the data point x ∈ X to 3 Experiments y ∈ Y to convert the distribution p(x) to q(y). By 3.1 Datasets, Baselines & Hyper-Parameters solving the problem in Equation1 1, the cost of To evaluate the effectiveness of the proposed model, transforming the discrete distribution p(x) to q(y) called the GPT-based data augmentation model for (i.e., Wasserstein distance Dist ) is defined as: W ED with OT (GPTEDOT), we conduct experiments Dist = Σ Σ π∗(x, y)C(x, y). W x∈X y∈Y on the following ED datasets: In order to utilize OT to compute the transforma- ACE 2005 (Walker et al., 2006): This dataset tion cost between O and G, i.e., dS , we propose O,G annotates 599 documents for 33 event types that to define the domain X and Y as the representa- cover different text domains(e.g., news, weblog or tion spaces of the sentences in O and G, respec- conversation documents). We use the same pre- tively, obtained from the student network. In par- processing script and data split as prior works (Lai ticular, a data point x ∈ X represents a sentence et al., 2020c; Tong et al., 2020b) to achieve fair X ∈ O. Similarly, a data point y ∈ Y stands o comparisons. In particular, the data split involves for a sentence Y ∈ G. To define the cost function g 529/30/40 articles for train/dev/test sets respec- C(x, y) for OT, we compute the Euclidean distance tively. For this dataset, we compare our model between the representation vectors of the sentences with prior state-of-the-art models reported in the X and Y (obtained by max-pooling over repre- o g recent works (Lai et al., 2020c; Tong et al., 2020b), sentations of their words): C(x, y) = h¯X − h¯Y o g including BERT-based models such as DMBERT, where h¯X = MAX POOL(hX , . . . , hX ), o o,1 o,|Xo| AD-DMBERT (Wang et al., 2019), DRMM, EKD h¯Y = MAX POOL(hY , . . . , hY ), and hX g g,1 g,|Yg| o,i (Tong et al., 2020b), and GatedGCN (Lai et al., Y and hg,i are the representation vectors of the i- 2020c). th words of Xo and Yg, respectively, obtained CySecED (Man Duc Trong et al., 2020): This from the student’s BiLSTM. Also, to define the dataset provides 8,014 event triggers for 30 event discrete distribution p(x) for OT over X , we em- types from 300 articles of the cybersecurity do- X ploy the event trigger likelihood Scoreo for the main (i.e., cybersecurity events). We follow the the trigger candidate of each sentence Xo in X that same pre-processing and data split as the original S is returned by the feed-forward network FFaux work (Man Duc Trong et al., 2020) with 240/30/30 documents for the train/dev/test sets. To be consis- 1It is worth mentioning that this problem is intractable so we solve its entropy-based approximation using the Sinkhorn tent with other experiments and facilitate the data (Peyre and Cuturi, 2019). generation based on GPT-2, the experiments on Cy-

6275 SecED are conducted at the sentence level where Model P R F1 CNN (Nguyen and Grishman, 2015) 71.8 66.4 69.0 inputs for models involve sentences. As such, we DMCNN (Chen et al., 2015) 75.6 63.6 69.1 employ the state-of-the-art sentence-level models DLRNN (Duan et al., 2017) 77.2 64.9 70.5 reported in (Man Duc Trong et al., 2020), i.e., DM- ANN-S2 (Liu et al., 2017) 78.0 66.3 71.7 GMLATT (Liu et al., 2018) 78.9 66.9 72.4 BERT (Wang et al., 2019), BERT-ED (Yang et al., GCN-ED (Nguyen and Grishman, 2018) 77.9 68.8 73.1 2019), as the baselines for CySecED. Lu’s DISTILL (Lu et al., 2019) 76.3 71.9 74.0 RAMS (Ebner et al., 2020): This dataset anno- TS-DISTILL (Liu et al., 2019) 76.8 72.9 74.8 DMBERT* (Wang et al., 2019) 77.6 71.8 74.6 tates 9,124 event triggers for 38 event types. We AD-DMBERT* (Wang et al., 2019) 77.9 72.5 75.1 use the official data split with 3,194, 399, and 400 DRMM* (Tong et al., 2020a) 77.9 74.8 76.3 documents for training, development, and testing GatedGCN* (Lai et al., 2020c) 78.8 76.3 77.6 EKD* (Tong et al., 2020b) 79.1 78.0 78.6 respectively for RAMS. We also perform ED at the GPTEDOT* 82.3 76.3 79.2 sentence level in this dataset. For the baselines, we utilize recent state-of-the-art BERT-based models Table 1: Performance on the on ACE 2005 test set. * for ED, i.e., DMBERT (Wang et al., 2019) and indicates models that use BERT for the encoding. GatedGCN (Lai et al., 2020c). For a fair compar- Model P R F1 ison, the performance of such baseline models is CNN (Nguyen and Grishman, 2015) 51.8 36.7 43.0 obtained via their official implementations from DMCNN (Chen et al., 2015) 47.5 38.7 43.2 the original papers that are fine-tuned for RAMS. GCN-ED (Nguyen and Grishman, 2018) 46.3 51.8 48.9 MOGANED (Yan et al., 2019) 53.7 59.6 56.5 For each dataset, we use its training and devel- CyberLSTM (Satyapanich et al., 2020) 42.5 29.0 34.5 opment data to fine-tune the GPT-2 model. We DMBERT (Wang et al., 2019) 59.4 51.3 55.1 tune the hyperparameters for the proposed teacher- BERT-ED (Man Duc Trong et al., 2020) 60.2 56.1 58.1 GPTEDOT 65.9 64.1 65.0 student architecture using a random search. All the hyperparameters are selected based on the F1 Table 2: Comparison with state-of-the-art models on scores on the development set of the ACE 2005 CySecED. All the models in this table use BERT. dataset. The same hyper-parameters from this fine- tuning are then applied for other datasets for con- Model P R F1 DMBERT (Wang et al., 2019) 62.6 44.0 51.7 sistency. In our model we use the small version GatedGCN (Lai et al., 2020c) 66.5 59.0 62.5 of GPT-2 to generate data. In the base model, we GPTEDOT 55.5 78.6 65.1 use BERTbase, 300 dimensions in the hidden states of BiLSTM and 2 layers of feed-forward neural Table 3: Model’s performance on RAMS. All the mod- networks with 200 hidden dimensions to predict els use BERT in this table. events. The trade-off parameters τ, α, β and γ are set to 0.1, 0.1, 0.05, and 0.08, respectively. The learning, DRMM (Tong et al., 2020a) with image- learning rate is set to 0.3 for the Adam optimizer enhanced models, and EKD (Tong et al., 2020b) and the batch size of 50 are employed during train- with external open-domain event triggers, the better ing. Finally, note that we do not update the BERT performance of GPTEDOT highlights the advan- model for word embeddings in this work due to tages of GPT-2 to generate data for ED models. its better performance on the development data of ACE 2005. Results of experiments on the CySecED test set are presented in Table2. This table reveals that the teacher-student architecture GPTEDOT signif- 3.2 Results icantly improves the performance over previous Results of experiments on the ACE 2005 test set are state-of-the-art models for ED in cybersecurity do- shown in Table1. The most important observation main. This is important as it shows that the pro- is that the proposed model GPTEDOT significantly posed model is effective in different domains. In outperforms all the baseline models (p < 0.01), addition, our results also suggest that GPT-2 can thus showing the benefits of GPT-generated data be employed to generate effective data for ED in and the teacher-student framework with knowledge domains where data annotation for ED requires consistency for ED in this work. In particular, extensive domain expertise and expensive cost to compared to the BERT-based models that lever- obtain such as the cybersecurity events. Moreover, age data augmentation, i.e., AD-DMBERT (Wang the higher margin of improvement for GPTEDOT et al., 2019) with semi-supervised and adversarial on CySecED compared to the those on the ACE

6276 2005 dataset suggests the necessity of using more Model P R F1 GPTEDOT (full) 82.4 75.0 78.5 training data for ED in technical domains. BaseO 78.2 73.7 75.9 Finally, results of experiments on the RAMS BaseA 75.8 73.9 74.9 test set are reported in Table3. Consistent with Teacher−A 76.9 78.1 77.5 −M our experiments on ACE 2005 and CySecED, Teacher 75.8 77.9 76.9 Student−OT 75.4 79.3 77.3 our proposed model achieve significantly higher Student−KL 76.8 77.3 77.0 performance than existing state-of-the-art models Student+OT 76.1 76.6 76.4 (p < 0.01), thus further confirming the advantages Student+KL 77.1 76.7 76.9 −Rep of GPTEDOT for ED. OT 76.8 77.3 77.0 OT−Score 78.0 77.1 77.6

3.3 Ablation Study Table 4: Ablation study on the ACE 2005 dev set. This ablation study evaluates the effectiveness of different components in GPTEDOT for ED. First, vergence between models’ predicted distributions for the importance of the generated data G from to enforce the teacher-student consistency for both GPT-2 and the teacher-student architecture to mit- the main task and the auxiliary task. To this end, for igate noises, we examine the following baselines: the auxiliary task EI, we convert the final activation O (1) Base : The baseline is the base model trained of FFaux into a distribution with two data points only on the original data O, thus being equivalent to (i.e., [FFaux(X), 1 − FFaux(X)]) to compute the the teacher model and not using the student model; KL divergence between the teacher and the student. and (2) BaseA: This baseline trains the base model Finally, for the importance of Euclidean dis- on the combination of the original and generated tances and event trigger likelihoods in the OT- data, i.e., A, using the multi-learning setting (i.e., based distance between O and G for knowledge the teacher model is excluded). consistency in EI, we investigate two baselines: Second, for the multi-task learning design in the (9) OT−Rep: Here, to compute OT, we use con- teacher network, we explore the following ablated stant cost between every pair of sentences, i.e., models: (3) Teacher−A: This baseline removes the C(x, y) = 1 (i.e., ignoring representation-based auxiliary task EI in the teacher from GPTEDOT. As distances); and (10) OT−Score: This model uses such, the OT-based knowledge consistency for EI is uniform distributions for p(x) and q(y) to compute also eliminated; (4) Teacher−M : In this model, the the OT (i.e., ignoring event trigger likelihoods). main task ED is utilize to train the teacher, so the We report the performance of the models (on corresponding KL-based knowledge consistency the ACE 2005 development set) for the ablation for ED is also removed. study in Table4. There are several observations Third, for the design of the knowledge consis- from this table. First, the generated data G and tency losses in the student network, we evaluate the teacher-student architecture are necessary for the following baselines: (5) Student−OT : This ab- GPTEDOT to achieve the highest performance. In lated model eliminates the OT-based knowledge particular, comparing with BaseO, the better perfor- consistency loss for the auxiliary task EI in the stu- mance of GPTEDOT indicates the benefits of the dent’s training of GPTEDOT (the auxiliary task is GPT-generated data. Moreover, the better perfor- still employed for the teacher and the student); (6) mance of BaseO over BaseA reveals that the simple Student−KL: For this model, the KL-based knowl- combination of the synthetic and original data with- edge consistency for the main task ED is ignored out any effective method to mitigate noises might in the student’s training; (7) Student+OT : In this be harmful. Second, the lower performance of baseline, we use OT for the knowledge consistency Teacher−A and Teacher−M shows that both the aux- on both the main and the auxiliary tasks. Here, iliary and the main task (i.e., multi-task learning) for the main task ED, the cost function C(x, y) in the teacher are to produce the best per- for OT is still obtained via the Euclidean distances formance. Third, the choice of methods to promote between representation vectors while the distribu- knowledge consistency is important and the pro- tions p(x) and p(y) are based on the maximum posed combination of KL and OT for the ED and of the label-probability distributions EI tasks (respectively) are necessary. In particular, Ps(.|Xo, to) and Ps(Yg, tg) for the ED task; and removing or replacing each of them with the other (8) Student+KL: This baseline employs the KL di- one (i.e., Student+OT and Student+KL) would de-

6277 Dataset Sentence ACE 2005 I was totally shocked by the court’s decision to agree with Sam Sloan after he TRGs sued TRGe his children. CySecED According to the last update by the company, the following techniques are used to protect against such TRGs malware TRGe. RAMS The Russian officials TRGs vowed TRGe to bomb the ISIS bases after the last week’s TRGs attack TRGe.

Table 5: Generated sentences by GPT-2 for different datasets. Event triggers are shown in boldface that are surrounded by the special tokens TRGs and TRGe generated by GPT-2.

Error Type Sentence Example Proportion Incompleteness A federal judge on Monday settled TRGs charges TRGe against seven members of 18% Repetition Do you think the TRGs attack TRGe will happen to you or do you think the TRGs attack TRGe will happen to you? 15% Inconsistency this morning we were watching the news and heard the news about the tragic TRGs death TRGe of a young boy and her mother in Iraq. 12% Missing Labels Aaron Tramailer’s story is the story of a woman who was forced into suicide. 29% Incorrect Labels The SEC is a very good place to TRGs hide TRGe money. 26%

Table 6: Samples of noisy generated sentences for the ACE 2005 dataset from GPT-2. Event triggers are shown in boldface and the special tokens TRGs and TRGe are generated by GPT-2.

|G| P R F1 types) generated by the language model. 0.5 * |O| 80.3 72.4 76.2 1.0 * |O| 82.4 75.0 78.5 Finally, to study the importance of the size of 2.0 * |O| 81.3 73.3 77.1 the generated data to augment training set for ED, 3.0 * |O| 78.4 71.8 75.0 we conduct an experiment in which different num- bers of generated samples in G (for the ACE 2005 Table 7: The performance of GPTEDOT on the ACE 2005 dev set with different sizes of the generated data dataset) are combined with the original data O. G. The results are shown in Table7. According to this table, the highest performance of the proposed model is achieved when the numbers of the gener- crease the performance significantly. Finally, in the ated and original data are equal. More specifically, proposed consistency method based on OT for EI, decreasing the number of generated samples po- it is beneficial to employ both representation-level tentially limits the benefits of data augmentation. −Rep distances (i.e., OT ) and models’ predictions On the other hand, increasing the size of generated −Score for event trigger likelihoods (i.e., OT ) as re- data might introduces extensive noises and become moving any of them hurts the performance. harmful to the ED models. 3.4 Analysis 4 Related Work To provide more insights into the quality of the synthetic data G, we provide samples of sentences Early methods for ED have employed feature- that are generated by the fine-tuned GPT-2 model based techniques (Ahn, 2006; Ji and Grishman, on each dataset in Table5. This table illustrates that 2008; Patwardhan and Riloff, 2009; Liao and Grish- the generated sentences also belong to the domains man, 2010a,b; Hong et al., 2011; McClosky et al., of the original data (i.e., the cybersecurity domain). 2011; Li et al., 2013; Miwa et al., 2014; Yang and As such, combining synthetic data with original Mitchell, 2016). Later, advanced deep learning data is promising for improving ED performance methods (Nguyen and Grishman, 2015; Chen et al., as demonstrated in our experiments. 2015; Nguyen et al., 2016a,b; Sha et al., 2018; As discussed earlier, the generated data G is not Zhang et al., 2019; Yang et al., 2019; Nguyen and free of noise. In order to better understand the Nguyen, 2019; Zhang et al., 2020b) have been ap- types of errors existing in generated sentences, we plied for ED. One challenge for ED research is the manually assess 200 sentences randomly selected limited size of existing datasets that hinder the train- from the set G generated by the fine-tuned GPT-2 ing of effective models. Prior works have attempted model on the ACE 2005 dataset. We categorize the to address this issue via unsupervised (Huang et al., errors into five types and provide their proportions 2016; Yuan et al., 2018), semi-supervised (Liao along with example for each error type in Table and Grishman, 2010a; Huang and Riloff, 2012; 6. This table shows that the majority of errors are Ferguson et al., 2018), distantly supervised (Keith due to missing labels (i.e., no special tokens TRGs et al., 2017; Nguyen and Nguyen, 2018; Zeng et al., and TRGe are generated) or incorrect labels (i.e., 2017; Araki and Mitamura, 2018), and few/zero- marked words are not event triggers of interested shot (Huang et al., 2018; Lai et al., 2020a,b) learn-

6278 ing. In this work, we propose a novel method to U.S. International Traffic in Arms Regulations or augment training data for ED by exploiting the the U.S. Export Administration Regulations. powerful language model GPT-2 to automatically generate new samples. Leveraging GPT-2 for augmenting training data References has also been studied for other NLP tasks recently David Ahn. 2006. The stages of event extraction. In (e.g., relation extraction, commonsense reasoning) Proceedings of the Workshop on Annotating and (Papanikolaou and Pierleoni, 2020; Zhang et al., Reasoning about Time and Events. 2020a; Yang et al., 2020; Madaan et al., 2020; Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Bosselut et al., 2019; Kumar et al., 2020; Anaby- Amir Kantor, George Kour, Segev Shlomov, Naama Tavor et al., 2020; Peng et al., 2020). However, Tepper, and Naama Zwerdling. 2020. Do not have none of those works has explored GPT-2 for ED. In enough data? deep learning to the rescue! In Pro- ceedings of the Association for the Advancement of addition, existing methods only resort to heuristics Artificial Intelligence (AAAI). to filter out noisy samples generated by GPT-2. In contrast, we propose a novel differentiable method Jun Araki and Teruko Mitamura. 2018. Open-domain capable of preventing noises from diverging repre- event detection using distant supervision. In Pro- ceedings of the International Conference on Compu- sentation vectors of the models for ED. tational Linguistics (COLING).

5 Conclusion Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. We propose a novel method for augmenting train- 2019. Comet: Commonsense transformers for au- ing data for ED using the samples generated by tomatic knowledge graph construction. In Proceed- the language model GPT-2. To avoid noises in the ings of the Annual Meeting of the Association for Computational Linguistics (ACL). generated data, we propose a novel teacher-student architecture in a multi-task learning framework. Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and We introduce a mechanism for knowledge consis- Jun Zhao. 2015. Event extraction via dynamic multi- tency enforcement to mitigate noises from gener- pooling convolutional neural networks. In Proceed- ings of the Annual Meeting of the Association for ated data based on optimal transport. Experiments Computational Linguistics (ACL). on various ED benchmark datasets demonstrate the effectiveness of the proposed method. Shaoyang Duan, Ruifang He, and Wenli Zhao. 2017. Exploiting document level information to improve Acknowledgments event detection via recurrent neural networks. In Proceedings of the International Joint Conference This research has been supported by the Army Re- on Natural Language Processing (IJCNLP). search Office (ARO) grant W911NF-21-1-0112 Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and the NSF grant CNS-1747798 to the IU- and Benjamin Van Durme. 2020. Multi-sentence ar- CRC Center for Big Learning. This research is gument linking. In Proceedings of the Annual Meet- also based upon work supported by the Office ing of the Association for Computational Linguistics (ACL). of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activ- James Ferguson, Colin Lockard, Daniel S Weld, and ity (IARPA), via IARPA Contract No. 2019- Hannaneh Hajishirzi. 2018. Semi-supervised event 19051600006 under the Better Extraction from Text extraction with paraphrase clusters. In Proceedings of the Conference of the North American Chapter of Towards Enhanced Retrieval (BETTER) Program. the Association for Computational Linguistics: Hu- The views and conclusions contained herein are man Language Technologies (NAACL-HLT). those of the authors and should not be interpreted as necessarily representing the official policies, ei- , Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In ther expressed or implied, of ARO, ODNI, IARPA, Proceedings of the Deep Learning Workshop at the Department of Defense, or the U.S. Govern- NeurIPS. ment. The U.S. Government is authorized to re- produce and distribute reprints for governmental Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaoming Zhu. 2011. Using purposes notwithstanding any copyright annotation cross-entity inference to improve event extraction. therein. This document does not contain technol- In Proceedings of the Annual Meeting of the Asso- ogy or technical data controlled under either the ciation for Computational Linguistics (ACL).

6279 Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng Ji, Shasha Liao and Ralph Grishman. 2010b. Using doc- Clare Voss, Jiawei Han, and Avirup Sil. 2016. Lib- ument level cross-event inference to improve event eral event extraction and event schema induction. In extraction. In Proceedings of the Annual Meet- Proceedings of the Annual Meeting of the Associa- ing of the Association for Computational Linguistics tion for Computational Linguistics (ACL). (ACL).

Lifu Huang, Heng Ji, Kyunghyun Cho, and Clare R. Jian Liu, Yubo Chen, and Kang Liu. 2019. Exploit- Voss. 2018. Zero-shot transfer learning for event ing the ground-truth: An adversarial imitation based extraction. In Proceedings of the Annual Meet- knowledge distillation approach for event detection. ing of the Association for Computational Linguistics In Proceedings of the Association for the Advance- (ACL). ment of Artificial Intelligence (AAAI). Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2018. Ruihong Huang and Ellen Riloff. 2012. Bootstrapped Event detection via gated multilingual attention training of event extraction classifiers. In Proceed- mechanism. In Proceedings of the Association for ings of the Conference of the European Chapter the Advancement of Artificial Intelligence (AAAI). of the Association for Computational Linguistics (EACL). Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017. Exploiting argument information to improve event Heng Ji and Ralph Grishman. 2008. Refining event ex- detection via supervised attention mechanisms. In traction through cross-document inference. In Pro- Proceedings of the Annual Meeting of the Associa- ceedings of the Annual Meeting of the Association tion for Computational Linguistics (ACL). for Computational Linguistics (ACL). Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2019. Distilling discrimination and generaliza- Katherine Keith, Abram Handler, Michael Pinkham, tion knowledge for event detection via delta- Cara Magliozzi, Joshua McDuffie, and Brendan representation learning. In Proceedings of the An- O’Connor. 2017. Identifying civilians killed by po- nual Meeting of the Association for Computational lice with distantly supervised entity-event extrac- Linguistics (ACL). tion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Ab- hilasha Ravichander, Eduard Hovy, and Shrimai Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Prabhumoye. 2020. Eigen: Event influence gen- 2020. Data augmentation using pre-trained trans- eration using pre-trained language models. arXiv former models. arXiv preprint arXiv:2003.02245. preprint arXiv:2010.11764.

Viet Dac Lai, Franck Dernoncourt, and Thien Huu Hieu Man Duc Trong, Duc Trong Le, Amir Pouran Nguyen. 2020a. Exploiting the matching informa- Ben Veyseh, Thuat Nguyen, and Thien Huu Nguyen. tion in the set for few shot event classifica- 2020. Introducing a new dataset for event detection tion. In Proceedings of the 24th Pacific-Asia Con- in cybersecurity texts. In Proceedings of the Con- ference on Knowledge Discovery and Data Mining ference on Empirical Methods in Natural Language (PAKDD). Processing (EMNLP). David McClosky, Mihai Surdeanu, and Christopher Viet Dac Lai, Franck Dernoncourt, and Thien Huu Manning. 2011. Event extraction as dependency Nguyen. 2020b. Extensively matching for few-shot parsing. In BioNLP Shared Task Workshop. learning event detection. In Proceedings of the 1st Joint Workshop on Narrative Understanding, Story- Makoto Miwa, Paul Thompson, Ioannis Korkontzelos, lines, and Events (NUSE) at ACL 2020. and Sophia Ananiadou. 2014. Comparable study of event extraction in newswire and biomedical do- Viet Dac Lai, Tuan Ngo Nguyen, and Thien Huu mains. In Proceedings of the International Confer- Nguyen. 2020c. Event detection: Gate diversity and ence on Computational Linguistics (COLING). syntactic importance scores for graph convolution Nghia Ngo, Tuan Ngo Nguyen, and Thien Huu Nguyen. neural networks. In Proceedings of the Conference 2020. Learning to select important context words on Empirical Methods in Natural Language Process- for event detection. In Proceedings of the 24th ing (EMNLP). Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global fea- Minh Van Nguyen, Viet Dac Lai, and Thien Huu tures. In Proceedings of the Annual Meeting of the Nguyen. 2021. Cross-task instance representation Association for Computational Linguistics (ACL). interactions and label dependencies for joint in- formation extraction with graph convolutional net- Shasha Liao and Ralph Grishman. 2010a. Filtered works. In Proceedings of the Conference of the ranking for bootstrapping in event extraction. In North American Chapter of the Association for Com- Proceedings of the International Conference on putational Linguistics: Human Language Technolo- Computational Linguistics (COLING). gies (NAACL-HLT).

6280 Minh Van Nguyen and Thien Huu Nguyen. 2018. Who Taneeya Satyapanich, Francis Ferraro, and Tim Finin. is killed by police: Introducing supervised attention 2020. Casie: Extracting cybersecurity event infor- for hierarchical lstms. In Proceedings of the Inter- mation from text. In Proceedings of the Associa- national Conference on Computational Linguistics tion for the Advancement of Artificial Intelligence (COLING). (AAAI).

Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grish- Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. man. 2016a. Joint event extraction via recurrent neu- 2018. Jointly extracting event triggers and argu- ral networks. In Proceedings of the Conference of ments by dependency-bridge rnn and tensor-based the North American Chapter of the Association for argument interaction. In Proceedings of the Associ- Computational Linguistics: Human Language Tech- ation for the Advancement of Artificial Intelligence nologies (NAACL-HLT). (AAAI).

Thien Huu Nguyen, Lisheng Fu, Kyunghyun Cho, and Meihan Tong, Shuai Wang, Yixin Cao, Bin Xu, Juanzi Ralph Grishman. 2016b. A two-stage approach for Li, Lei Hou, and Tat-Seng Chua. 2020a. Image en- extending event detection to new types via neural hanced event detection in news articles. In Proceed- networks. In Proceedings of the 1st ACL Workshop ings of the Association for the Advancement of Arti- on Representation Learning for NLP (RepL4NLP). ficial Intelligence (AAAI). Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional Meihan Tong, Bin Xu, Shuai Wang, Yixin Cao, Lei neural networks. In Proceedings of the Annual Meet- Hou, Juanzi Li, and Jun Xie. 2020b. Improving ing of the Association for Computational Linguistics event detection via open-domain trigger knowledge. (ACL). In Proceedings of the Annual Meeting of the Associ- ation for Computational Linguistics (ACL). Thien Huu Nguyen and Ralph Grishman. 2018. Graph convolutional networks with argument-aware pool- Christopher Walker, Stephanie Strassel, Julie Medero, ing for event detection. In Proceedings of the Asso- and Kazuaki Maeda. 2006. Ace 2005 multilingual ciation for the Advancement of Artificial Intelligence training corpus. In Technical report, Linguistic Data (AAAI). Consortium.

Thien Huu Nguyen, Adam Meyers, and Ralph Grish- Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, man. 2016c. New york university 2016 system for and Peng Li. 2019. Adversarial training for weakly kbp event nugget: A deep learning approach. In Pro- supervised event detection. In Proceedings of the ceedings of Text Analysis Conference (TAC). Conference of the North American Chapter of the Association for Computational Linguistics: Human Trung Minh Nguyen and Thien Huu Nguyen. 2019. Language Technologies (NAACL-HLT). One for all: Neural joint modeling of entities and events. In Proceedings of the Association for the Ad- Haoran Yan, Xiaolong Jin, Xiangbin Meng, Jiafeng vancement of Artificial Intelligence (AAAI). Guo, and Xueqi Cheng. 2019. Event detection with multi-order graph convolution and aggregated atten- Yannis Papanikolaou and Andrea Pierleoni. 2020. tion. In Proceedings of the Conference on Empirical Dare: Data augmented relation extraction with gpt- Methods in Natural Language Processing (EMNLP). 2. In SciNLP workshop at the Conference on Auto- mated Knowledge Base Construction (AKBC). Bishan Yang and Tom M. Mitchell. 2016. Joint extrac- tion of events and entities within a document con- Siddharth Patwardhan and Ellen Riloff. 2009. A uni- text. In Proceedings of the Conference of the North fied model of phrasal and sentential evidence for American Chapter of the Association for Computa- information extraction. In Proceedings of the Con- tional Linguistics: Human Language Technologies ference on Empirical Methods in Natural Language (NAACL-HLT). Processing (EMNLP).

Baolin Peng, Chenguang Zhu, Michael Zeng, and Jian- Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, feng Gao. 2020. Data augmentation for spoken lan- and Dongsheng Li. 2019. Exploring pre-trained lan- guage understanding via pretrained models. arXiv guage models for event extraction and generation. In preprint arXiv:2004.13952. Proceedings of the Annual Meeting of the Associa- tion for Computational Linguistics (ACL). Gabriel Peyre and Marco Cuturi. 2019. Computational optimal transport: With applications to data science. Yiben Yang, Chaitanya Malaviya, Jared Fernandez, In Foundations and Trends in . Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Downey. 2020. Generative data augmentation for Dario Amodei, and Ilya Sutskever. 2019. Language commonsense reasoning. In Proceedings of the models are unsupervised multitask learners. OpenAI Findings of the Conference on Empirical Methods blog, 1(8):9. in Natural Language Processing (EMNLP).

6281 Quan Yuan, Xiang Ren, Wenqi He, Chao Zhang, Xinhe Geng, Lifu Huang, Heng Ji, Chin-Yew Lin, and Ji- awei Han. 2018. Open-schema event profiling for massive news corpora. In Proceedings of the Confer- ence on Information and Knowledge Management (CIKM). Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui Yan, Chongde Shi, and Dongyan Zhao. 2017. Scale up event extraction learning via automatic training data generation. In Proceedings of the Associa- tion for the Advancement of Artificial Intelligence (AAAI). Danqing Zhang, Tao Li, Haiyang Zhang, and Bing Yin. 2020a. On data augmentation for ex- treme multi-label classification. arXiv preprint arXiv:2009.10778. Junchi Zhang, Yanxia Qin, Yue Zhang, Mengchi Liu, and Donghong Ji. 2019. Extracting entities and events as a single task using a transition-based neu- ral model. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Yunyan Zhang, Guangluan Xu, Yang Wang, Daoyu Lin, Feng Li, Chenglong Wu, Jingyuan Zhang, and Tin- glei Huang. 2020b. A question answering-based framework for one-step event argument extraction. In IEEE Access, vol 8, 65420-65431.

6282