EXPLOITING UNLABELLED DATA FOR RELATION EXTRACTION

A THESISSUBMITTEDTO THE UNIVERSITYOF MANCHESTER FORTHEDEGREEOF DOCTOROF PHILOSOPHY INTHE FACULTY OF SCIENCEAND ENGINEERING

2020

Thy Thy Tran

Department of Computer Science

Contents

Abstract 11

Declaration 13

Copyright 14

Acknowledgements 15

Acronyms and Abbreviations 16

1 Introduction 18 1.1 Motivation...... 18 1.2 Research Questions, Hypotheses and Objectives...... 20 1.3 Contributions...... 22 1.4 Dissertation Outline and Publications...... 23

2 Background 26 2.1 Introduction: Relation Extraction...... 26 2.1.1 Related Concepts...... 27 2.1.2 Relation Extraction Tasks...... 28 2.1.3 Datasets...... 29 2.1.4 Evaluation Metrics...... 31 2.2 Related Work on Relation Extraction...... 34 2.2.1 Early Systems and Classical Machine Learning...... 34 2.2.2 Neural Networks and Deep Learning...... 37 2.3 Features for Relation Extraction...... 42 2.3.1 Linguistic Features...... 43 2.3.2 Word Representations...... 45 2.3.3 External Resources...... 51

3 2.4 Neural Components for Relation Extraction...... 52 2.4.1 Convolutional Neural Networks...... 53 2.4.2 Recurrent Neural Networks...... 54 2.4.3 Graph Neural Networks...... 57 2.4.4 Attention Mechanisms...... 58 2.4.5 Pretrained Models...... 60 2.4.6 Hybrid Architectures...... 60 2.5 Relation Candidate Representation...... 61 2.6 Relation Classification Layer...... 61 2.7 Learning...... 62 2.7.1 Fully Supervised Learning...... 63 2.7.2 Few-shot Learning...... 64 2.7.3 Weakly-Supervised Learning...... 66 2.7.4 Unsupervised Learning...... 71 2.7.5 Transfer Learning...... 72 2.7.6 Semi-Supervised Learning...... 73 2.7.7 Open Information Extraction...... 74 2.8 Conclusions...... 75

3 Enriching Word Representations 79 3.1 Introduction...... 79 3.2 Proposed Approach...... 81 3.2.1 Base Representation...... 82 3.2.2 Part-of-Speech Tags and Dependencies...... 83 3.2.3 The SIWR Model...... 83 3.2.4 Pretraining the SIWR Model...... 84 3.2.5 Syntactically-Informed Word Representations...... 86 3.3 Pretraining Settings...... 87 3.3.1 Datasets and Base Representations Used for Pretraining... 87 3.3.2 Pretraining Implementation Details...... 88 3.4 Evaluation Settings...... 89 3.4.1 Binary Relation Extraction...... 90 3.4.2 Ternary Relation Extraction...... 91 3.5 Results...... 92 3.6 Analysis...... 96 3.6.1 Effects of the Number of Pretraining Samples...... 96

4 3.6.2 Ablation Studies...... 96 3.6.3 Impact of Syntactic Information...... 98 3.6.4 Computational Cost...... 100 3.7 Related Work...... 101 3.8 Conclusion...... 102

4 Unsupervised Relation Extraction 104 4.1 Motivation...... 105 4.2 Background: Unsupervised Relation Extraction...... 106 4.2.1 Generative Approach...... 106 4.2.2 Discriminative Approaches...... 107 4.3 Our Methods...... 110 4.4 Experimental Settings...... 111 4.4.1 Evaluation Metrics...... 111 4.4.2 Datasets...... 111 4.4.3 Model Settings...... 112 4.5 Results and Discussion...... 113 4.5.1 Results...... 113 4.5.2 Analysis...... 114 4.6 Conclusion...... 118

5 Language Models as Weak Supervision 119 5.1 Motivation...... 119 5.2 Using Language Models as Weak Annotators...... 122 5.2.1 Defining Relation Types...... 122 5.2.2 Language Model Annotator...... 123 5.3 Noisy Channel Auto-encoder (NoelA)...... 124 5.3.1 Encoder...... 124 5.3.2 Decoder...... 125 5.3.3 Learning...... 127 5.4 Experimental Settings...... 127 5.4.1 Datasets...... 127 5.4.2 Pretrained Language Models...... 128 5.4.3 Relation Classification Settings...... 129 5.5 Results...... 130 5.5.1 Data Annotation...... 130

5 5.5.2 Relation Classification...... 131 5.6 Analysis...... 132 5.6.1 Relation Distribution...... 132 5.6.2 The Accuracy of BERT Annotator...... 132 5.6.3 The Accuracy of NoelA...... 134 5.6.4 The Impact of Entity Type Reconstruction...... 134 5.7 Related Work...... 135 5.7.1 Relation Classification...... 135 5.7.2 Pretrained Language Models...... 136 5.8 Conclusion...... 137

6 Conclusions 139 6.1 Summary of Research Objectives...... 140 6.2 Open Problems and Future Work...... 142 6.2.1 External Information for Enriching Word Representations.. 143 6.2.2 Graph Generalisation and Construction...... 143 6.2.3 Cluster Definition...... 144 6.2.4 Improvement for Language Model Annotation...... 144 6.2.5 Noise Reduction...... 144 6.2.6 Multiple Sources of Supervision...... 145 6.2.7 Document-level Relation Extraction...... 145

A Named Entity Recognition 146 A.1 Named Entity Recognition...... 146 A.2 Experimental Settings...... 147 A.3 Results...... 148 A.4 Comparison between Different Base Representations...... 148

B Language Models as Weak Supervision 150 B.1 BERT Annotator Confusion Matrices...... 150 B.2 Relation Exemplars...... 150

Bibliography 157

Word Count: 35,786

6 List of Tables

2.1 Available relation extraction datasets for general domain...... 29 2.2 Annotation examples of distant supervision and the corresponding gold relation categories (DS)...... 67 2.3 Relation examples from classical relation extraction methods and open information extraction systems...... 75

3.1 Word representations, training data and dependency parsers that are used in our experiments...... 87 3.2 Value range and best value of tuned hyperparameters for our SIWR.. 87 3.3 Evaluation datasets and related models used in our experiments.... 89 3.4 Statistics and hyperparameters for the ACE2005 binary relation extrac- tion task...... 90 3.5 Statistics and hyperparameters for the drug-gene-mutation dataset.. 92 3.6 Test set results with different embeddings over two relation extraction tasks...... 93 3.7 Comparison of contextual representations and fine-tuning large-scale language model...... 94 3.8 Binary relation extraction performance on ACE2005 test set...... 94 3.9 N-ary relation extraction accuracy on the drug-gene-mutation data... 95 3.10 Binary relation extraction performance of ablated SIWRs variants on ACE2005 development set...... 97 3.11 Pretrained model parameters and downstream trainable parameters.. 100

4.1 The statistics of the NYT-FB and the TACRED datasets. #r indicates the number of relation types in each dataset...... 111 4.2 Hyper-parameter values used in our experiments...... 113 4.3 Average results (%) across three runs of different models (except the rule-based EType) on NYT-FB and TACRED...... 114

7 4.4 Study of EType+ in combination with different features...... 117

5.1 Data statistics of TACRED and reWiki datasets...... 128 5.2 Hyper-parameters of NoelA and its variants...... 128 5.3 Accuracy (%) ofLM annotators on two datasets...... 130 5.4RC accuracy (Acc.) across five runs of NoelA with its variants..... 131 5.5 Mutual information between entity type pairs (ET) and gold relations (R) on the development sets...... 135

A.1 Data statistics for the ACE2005 named entity recognition datatset... 146 A.2 Nested NER...... 147 A.3 Test set results with different embeddings on the nested named entity recognition dataset (ACE2005)...... 147 A.4 Performance comparison on Nested NER – ACE 2005 test set..... 148 A.5 Nested named entity recognition results on ACE2005 development set with different base representations and their enriched alternatives... 149

B.2 Exemplars created for each relation in reWiki...... 150 B.1 Exemplars created for each relation in TACRED...... 155

8 List of Figures

1.1 The overview of our contributions...... 23 1.2 The thesis roadmap...... 25

2.1 An example of binary relation in a sentence...... 27 2.2 A neuron or a perceptron visualisation adapted from CS231n(2020).. 38 2.3 One-layer neural network...... 39 2.4 Two-layer neural network...... 39 2.5 Block diagram of a neural framework for relation extraction (RE)... 41 2.6 Part-of-speech tags (yellow rectangles) and dependency structure (red arcs) of a sentence...... 44 2.7 A convolutional neural network architecture for relation extraction by Zeng et al.(2017)...... 54 2.8 A recurrent neural network and its unrolled visualisation (Olah, 2015) 54 2.9 Long short-term memory cell (Olah, 2015)...... 55 2.10 A recursive neural network (Ebrahimi and Dou, 2015)...... 56 2.11 A graph convolutional neural network (GCN)...... 57 2.12 General intuition about few-shot learning, 3 way 2 shot in this case.. 64 2.13 An illustration of the data programming framework, adapted from Ratner et al.(2017)...... 71 2.14 The comparison between previous dependency word representations and our work...... 77

3.1 Overview of our neural graph model...... 82 3.2 The use of SIWRs in downstream tasks, i.e., relation extraction models. 86 3.3 A binary relation example from the ACE2005 dataset (Walker et al., 2006)...... 90 3.4 An n-ary relation example from the drug-gene-mutation dataset (Peng et al., 2018)...... 91

9 3.5 Binary relation extraction performance of SIWRsELMo with different numbers of pretraining sentences on the ACE2005 development set.. 96 3.6 Missing predictions when removing a component...... 98 3.7 Comparison of the binary relation extraction performance on entity pairs with different distances on the ACE2005 development set using

(left) SIWRsELMo and ELMo, (right) SIWRsBERT and BERT-feature.. 98 3.8 Relation prediction examples from the ACE2005 dataset and the auto- matically parsed tree...... 99

4.1 The idea of a link predictor...... 107 4.2 Intuition of using entity types...... 110 4.3 Abstract idea of testifying the link predictor...... 115 4.4 Average negative log likelihood losses across three runs of the link predictor on the training data...... 115

5.1 Language models as weak supervision for relation classification.... 120 5.2 Similarity computation using anLM...... 123 5.3 Overview of our model NoelA...... 124 5.4 Relation distributions on the development sets...... 132 5.5 Accuracy (%) w.r.t. relation type of BERT annotator on the development sets. Relation types with highest and lowest performance are marked. 133 5.6 Accuracy differences (%) w.r.t. relation types between NoelA and BERT annotator on the development sets. Relation types with the most and least accuracy differences are labelled...... 134

B.1 Gold relations and BERT annotator relations confusion matrix. The indices of the relation types are given in Table B.1 and Table B.2.... 156

10 Abstract

EXPLOITING UNLABELLED DATA FOR RELATION EXTRACTION Thy Thy Tran A thesis submitted to The University of Manchester for the degree of Doctor of Philosophy, 2020

Information extraction transforms unstructured text to structured by annotating semantic information on raw data. A crucial step in information extraction is relation extraction, which identifies semantic relationships between named entities in text. The resulting relations can be used to construct and populate knowledge bases as well as used in various applications such as information retrieval and question answering. Relation extraction has been widely studied using fully supervised learning and dis- tantly supervised approaches, these approaches require either manually- or automatically- annotated data. In contrast, a massive amount of unlabelled texts freely available are underused. We hence focus on leveraging the unlabelled data to improve and extend relation extraction. We approach the use of unlabelled text from three directions: (i) use it for pre-training word representations, (ii) conduct unsupervised learning, and (iii) perform weak supervision. Regarding the first direction, we want to leverage syntactic information for relation extraction. Instead of directly tuning such information on a relation extraction corpus, we propose a novel graph neural model for learning syntactically-informed word repre- sentations. The proposed method allows us to enrich pretrained word representations with syntactic information rather than re-training language models from scratch as previous work. Throughout this work, we can confirm that our novel representations are beneficial for relations in two different domains. In the second direction, we study unsupervised relation extraction, which is a promising approach because it does not require manually- or automatically-labelled data. We hypothesise that inductive biases are extremely important to direct unsupervised relation extraction. We hence employ two simple methods using only entity types to infer relations. Despite their simplicity, our methods can outperform existing approaches

11 on two popular datasets. These surprising results suggest that entity types provide a strong inductive bias for unsupervised relation extraction. The last direction is inspired by recent evidence that large-scale pretrained language models capture some sort of relational facts. We want to investigate whether these pretrained language models can serve as weak annotators. To this end, we evaluate three large pretrained language models by matching sentences against relations’ exemplars. The matching scores decide how likely a given sentence expresses a relation. The top relations are further used as weak annotations to train a relation classifier. We observe that pretrained language models are confused by highly similar relations, thus, we propose a method that models the labelling confusion to correct relation prediction. We validate the proposed method on two datasets with different characteristics, showing that it can effectively model labelling noise from our weak annotator. Overall, we illustrate that exploring the use of unlabelled data is an important step towards improving relation extraction. The use of unlabelled data is a promising path for relation extraction and should receive more attention from researchers.

12 Declaration

No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

13 Copyright

i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intel- lectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the Univer- sity IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=24420), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.library. manchester.ac.uk/about/regulations/) and in The University’s policy on presentation of Theses

14 Acknowledgements

For me, PhD journey is a roller coaster, without the support of many people I would never have gone this far. First of all, I would like to thank my supervisor Prof. Sophia Ananiadou for offering me this great opportunity, as well as her encouragement and support during these years. I want to thank my thesis committee, Dr. Andre Freitas and Prof. Naoaki Okazaki for their insightful comments and advice towards the improvement of my thesis. I would like to thank in particular Prof. Tsuiji for giving me the opportunity to join AIST/AIRC. I also want to thank Prof. Makoto Miwa for his dedicated guidance, supportive comments and discussion during our collaboration. I would also like to thank Nhung for helping me from the very first days in Manch- ester, for your thoughtful suggestions and encouragement from work to life, for the memorable gathering and holidays we spent together. I also want to express my thank to Phong, who is a mentor and a friend, who shares interesting discussions and gives me invaluable advice. I am grateful to all my current and former colleagues in NaCTeM whom I shared my Ph.D. life with: Meizhi, Maolin, Fenia, Paul, Kurt, Yifei, Hassan, Jiarun, Panos, Erxue, Chryssa, Jock, Sam, Emrah, Minh, Piotr, Austin, and Sunil. Extra thanks to Meizhi for several meals that you invited me. I owe a lot to my best friends, Dinh and Vy, who always listen and encourage me through countless gloomy days. I also want to thank Marco, my best flatmate! He has accompanied me through the entire journey, who always stands by my side, makes Manchester homey and patiently helps me go through many disappointments. I am also glad for all the friends who accompanied me from a distance. Finally, words cannot express enough how grateful I am to my family, especially my mother, who always supports and believes in me, who loves me unconditionally.

15 Acronyms and Abbreviations

ARI adjusted rand index.

CNN convolutional neural network. conv convolutional neural layer.

DS distant supervision. DVAE discrete-state variational auto-encoder.

FSDA few-shot domain adaptation. FSL few-shot learning.

GCN graph convolutional network. GNN graph neural network.

KB knowlege base plural. kl Kullback-Leibler divergence.

LDA latent dirichlet analysis. LM language model. LSTM long short-term memory.

MIL multi-instance learning. ML machine learning. MLM masked language model descriptionplural.

NA not a relation. NER named entity recognition.

16 Acronyms and Abbreviations 17

NLP natural language processing. NOTA none-of-the-above.

OIE open information extraction.

PCNN piece-wise convolutional neural network. PGM probabilistic graphical model descriptionplural. PLM pre-trained language model descriptionplural. POS part of speech.

RC relation classification. RE relation extraction. RecNN recursive neural network. RL reinforcement learning. RNN recurrent neural network.

SDP shortest dependency path. SIWR syntactically-informed word representation. SVM support vector machine.

URE unsupervised relation extraction. Chapter 1

Introduction

1.1 Motivation

In this digital age, people can freely share and obtain any information online, which promotes the exponential growth of digital content and text in particular. As it is infeasible for humans to read through such a large amount of text, we expect computers to automatically extract important information and communicate with us in our desirable forms. Natural language processing (NLP) is the field of study that develops methods to support computers for this purpose. As human language is highly ambiguous and ever changing, NLP is challenging. To process a piece of text, NLP needs to solve different tasks for understanding various aspects of text. A set of studies in NLP, named information extraction, develops methods to transform unstructured text into machine-readable structures for further applications such as question answering, knowledge base population and information retrieval. In particular, information extraction methods reveal the underlying structures by detecting semantic concepts and relationships between them. Such methods help readers to navigate over a large amount of textual data and quickly locate information of interest. In this dissertation, we study relation extraction (RE) which is a sub-field of information extraction. Relation extraction refers to the identification of semantic associations (relations) between concepts (named entities) within a given text. The study of relations can be dated back to the discussion of lexical cohesion or lexical semantic relations in 1976 (Halliday and Hasan, 1976), which identifies the rela- tions between words, e.g., hypernym and hyponym. In contrast, the field of extracting relations between concepts were formally introduced since the Message Understanding Conference (MUC-7) in 1998 (Chinchor, 1998). Early pioneers towards capturing

18 1.1. MOTIVATION 19 relationships using patterns written by humans (Huffman, 1995; Brin, 1998; Agichtein and Gravano, 2000). Such pattern-based systems, however, were too restrictive to the particular datasets and domains they had been designed for. Therefore, statistical machine learning has dominated current relation extraction approaches, which can automatically learn patterns from data. For more than a decade, core relation extraction techniques were attributed to feature-based and kernel-based models. Consequently, human efforts were moved to generate meaningful features for these models rather than writing patterns. Feature engineering is time-consuming and also limited to specific domains. Since 2012, neural networks have been proven more effective than non-neural, feature-based models and become the model of choice for buildingRE methods (Socher et al., 2012). Neural models can automatically learn abstract features and thus reduce the need for feature engineering. Neural relation extraction is thus the theme of this study. The standard way to train a model is to learn a mapping function from input to output using a large number of annotated examples. This is known as supervised learning, which has attracted a lot of attention from the community, and the number of publications has increased significantly in recent years. However, current relation corpora are mostly imbalanced, the results on these corpora often reflects the performance of high frequent relations, while the long-tail usually receives much lower scores. This inconsistency likely due to the design and learning process of neural models which often rely on a large amount of data and tend to predict towards high observation and down weight the less frequent ones. One way to deal with this problem is to annotate more instances for the low frequent relations, which increase the cost of data collection. Furthermore, if we want to perform relation extraction in a different domain or even to identify different sets of relation types in the same domain, the annotation process is also required to repeat in order to train a supervised learning model. The repetition consequently raises more human efforts. Unlike annotated corpora, there is an abundant amount of unlabelled text freely available. Can we utilise such data to improveRE and discover relations with minimal human effort? This question motivates us to exploit the use of unlabelled text for relation extraction in this dissertation. We attempt to investigate three ways of using such unlabelled data: (i) transfer learning from word representations, (ii) unsupervised learning, and (iii) weakly-supervised learning. We first try to transfer knowledge from large unlabelled data by enriching word 20 CHAPTER 1. INTRODUCTION

representations with syntactic information. In particular, we train a graph convolution neural model using automatically-parsed text and base representations as input. We then extract the intermediate representations of the model as syntactically-informed word representations (SIWRs). We evaluate SIWRs on two imbalanced textual relational datasets in general and biomedical domains, showing the effectiveness of our enriched word representations. Since enriching word representations does not directly addressRE, we experiment with unsupervised learning which directly uses unlabelled text forRE. Unsupervised relation extraction (URE) enables the discovery of new relations since there are no predefined sets of relations. Following previous work, we investigated unsupervised relation extraction in a discrete-state variational auto-encoder setting. This work presents surprising results that our two simple methods using only entity types to infer relations can outperform existing state-of-the-art. The work also includes an extensive analysis regarding data quality, training setting and evaluation metrics. Although URE is promising, experiments on URE can not be directly used for applications, i.e., the number of relation categories is small and the meaning of each category requires further provision by either manually defining the relation types or extracting frequent words and phrases within individual clusters as representatives. To address the unnamed problem, we propose a weakly supervised approach as the third task in this dissertation. Our setting uses little human effort to provide one exemplar per relation of interests. Our study leverages the exemplars and large-scale pretrained language models (PLMs) as labelling functions to generate weakly-labelled data for relation extraction. The intuition of using PLMs is from the evidence in recent studies that these models capture some sort of relational facts. The resulting data are used for training a relation classifier, which we enhance its performance by explicitly modelling the labelling noise. We validate our proposed method on two datasets with different characteristics: imbalanced and balanced, different numbers of relation types, and different domains (news and encyclopedia).

1.2 Research Questions, Hypotheses and Objectives

Our research can be formed into the following research questions (RQ) accompanied by their hypotheses (H).

RQ1 Can pre-encoding syntactic information support the detection of semantic rela- tions between two entities? 1.2. RESEARCH QUESTIONS, HYPOTHESES AND OBJECTIVES 21

H1 Detection of entity relations can be benefited from pre-encoded syntactic information in word representations.

RQ2 What is the existing approaches of unsupervised relation extraction built on neural models?

RQ3 Can inductive biases benefit for unsupervised relation extraction?

H2 The use of entity types offers a strong inductive bias for unsupervised relation extraction.

RQ4 Is it possible to use pretrained language models to annotate relations on raw text without training?

H3 Pretrained language models can be used as weak supervision for relation extraction, which reduces the need of manual annotations and human curated knowledge bases.

RQ5 Can modelling similar relation confusion be beneficial for identifying interactions of entities in text?

H4 Modelling relation confusion to estimate correct relations from noisy ones can be effective for classifying relations between entities.

Based on the above research questions and hypotheses, we establish the following research objectives (O):

O1 Develop a model that can enrich pretrained static or contextual word representa- tions with syntactic information.

O2 Validate the enriched word representations on sentence level relation datasets in different domains (news and biomedical).

O3 Investigate the existing unsupervised relation extraction approaches and provide extensive analysis on the current unsupervised setting, e.g., data quality, training setting and evaluation metrics.

O4 Introduce methods using entity types and assess their effectiveness in two rela- tional datasets. 22 CHAPTER 1. INTRODUCTION

O5 Validate the use of pretrained languages models as weak annotators for relation annotation on raw text with given entities, in order to reduce the cost of manual annotation.

O6 Propose a noisy-channel auto-encoder approach to deal with the annotation noise and validate its effectiveness on two datasets with different relation properties.

1.3 Contributions

The main contributions (C) of this dissertation are summarised as follows and illustrated in Figure 1.1.

C1 We propose new syntactically-informed word representations based on a graph neural model. We show that our word representations can improve relation extraction in both general and biomedical domains.

C2 We investigate the current learning setting of neural relation extraction in the case that no labelled text is given. We summarise notable points in unsupervised relation extraction covering data, training signals and evaluation metrics.

C3 We introduce two simple baselines (less parameters compared to previous work) using entity types that outperform previous methods, raising attention to the field of unsupervised relation extraction.

C4 We test the use of pretrained language models to weakly annotate relations on raw text, assuming that entities are given, as well as a set of desirable relation types and their corresponding examples are provided.

C5 We propose a noisy-channel neural model that eliminates some sort of label confusion inherited from the annotation process. We show that explicitly mod- elling noise improve the detection of relations on both imbalanced and balanced datasets.

C6 We show the improvement of using unlabelled text in three different directions, e.g., transfer learning, unsupervised learning and weakly supervised learning using language models. 1.4. DISSERTATION OUTLINE AND PUBLICATIONS 23

Enriching Chapter 3. Tran et al. Transfer learning word representations (2020c; Neurocomputing) Unlabelled Textual Data Unsupervised Unsupervised Chapter 4. for learning relation extraction Tran et al. (2020b; ACL) Relation Extraction Weakly-supervised Supervision from Chapter 5. Tran et al. learning Language Models (2020a; under review)

Figure 1.1: The overview of our contributions.

1.4 Dissertation Outline and Publications

We present the dissertation structure, as well as their corresponding peer-reviewed publication as follows. In most cases, the content of these publications is replicated with very little change. Figure 1.2 illustrates an overview of our dissertation. In Chapter2, we first provide definitions of related concepts used in this dissertation. Next, we briefly introduce different relation extraction tasks and describe the related corpora. We then give an overview of the history and recent development of relation extraction where neural-based models become the mainstream framework. Since neural models constitute the main techniques used in this dissertation, we present a fundamental background of neural networks to facilitate the reader in the remainder of this chapter. Standing on the fundamentals, we present individual blocks of neural relation extraction models and describe related work of each block. In Chapter3, we introduce the syntactically-informed word representations extracted from a graph-based model. We describe the graph-based model architecture and its training objectives. The resulting word representations are evaluated on two sentential relation datasets from different domains (news and biomedical). At this point, we

examine the first hypothesis (H1) and validate our proposed word representations. Extensive analysis is conducted on different aspects to better understand the behaviour of the models and our contributions. This chapter includes the following publication:

Thy Thy Tran, Makoto Miwa, and Sophia Ananiadou. 2020c. Syntactically- Informed Word Representations. Neurocomputing

In Chapter4, we study the problem of unsupervised relation extraction, where training data are plain text with given entity mentions. We describe our two simple methods to infer relations from entity types and compare with existing unsupervised

models. In this regard, we address our second and third research questions (RQ2 and RQ3) and the second hypothesis (H2). We further discuss our findings of the current unsupervised relation extraction setting, consisting of training and testing data, 24 CHAPTER 1. INTRODUCTION

evaluation metrics and learning signals of implemented models. We include work from our published paper:

Thy Thy Tran, Phong Le, and Sophia Ananiadou. 2020b. Revisiting unsupervised relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 7498–7505

Chapter5 begins with examining our third hypothesis ( H3), to testify the use of pretrained language models as annotators. Predictions from the language models show that these models get confused by highly similar relations. We hence propose a noisy- channel neural model to deal with labelling noises. We describe in detail the proposed method to estimate the correct relations from the noisy ones. Baselines and our proposed model are evaluated on two textual data with different relation characteristics. This

accordingly addresses our last hypothesis (H4). Finally, we analyse the annotation results to understand the biases captured in pretrained language models. We also discuss the contributions of each component in our noisy-channel model. The work in this chapter is under review.

Thy Thy Tran, Phong Le, and Sophia Ananiadou. 2020a. Exploiting Language Models for Weakly-Supervised Relation Classification. In Under Review

Chapter6 finally concludes this dissertation where we summarise our findings and present the overall conclusion of this work. We also discuss shortcomings of the methods proposed in this dissertation and potential future directions.

Additional Publication

The following articles have also been completed over my Ph.D.:

Fenia Christopoulou*, Thy Thy Tran*, Sunil Kumar Sahu, Makoto Miwa, and Sophia Ananiadou. 2020. Adverse Drug Events and Medication Relation Extraction in EHRs with Ensemble Deep Learning Methods. Journal of the American Medical Informatics Association

Hai-Long Trieu, Thy Thy Tran, Khoa N. A. Duong, Anh Nguyen, Makoto Miwa, and Sophia Ananiadou. 2020. DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts. Bioinformatics 1.4. DISSERTATION OUTLINE AND PUBLICATIONS 25 OpenAI-GPT2 approaches Fine-tuning T indicates the common BER Contextualised cyan word embeddings ransfer learning Enriching word representations T representations ord Pre-trained word W e representations ELMo GloV (Static) Few-shot learning learning Unsupervised word embeddings ord2vec W from learning Learning setting eakly-supervised language models W GRU GCN RNN LSTM learning (RE) Neural eakly-supervised W Relation Extraction is the main topic of this dissertation, while learning Distant supervised Recurrent layers Graph-based layers Feed forward layers Convolutional layers Attention mechanism learning Fully-supervised shows concepts that are directly relevant to this work as well as the paradigms that our Semi-supervised learning Purple Neural nodes highlight our contributions. layers & blocks Neural relation extraction Red extual T granuality Related NLP tasks Relation arguments Sentence Document Paragraph -ary ernary Binary T ord representations Dependency parsing W Part of speech tagging Named entity recognition concepts in neural relation extraction. Figure 1.2: The thesis roadmap. contributions fall into. Finally, Chapter 2

Background

This chapter aims to:

• Describe the task of relation extraction • Briefly introduce the history of relation extraction methods • Present common neural layers and building blocks for relation extraction • Summarise existing methods for neural relation extraction

In this chapter, we first present the key definitions of relation extraction. Next, we briefly introduce various relation extraction tasks that are categorised with regards to multiple aspects of a relationship as well as the available corpora and popular evaluation metrics. After the general introduction, we describe the early history of relation extraction followed by discussing the popularity of neural networks. Since this dissertation relies on using neural-based models, we introduce neural building blocks for relation extraction that have been used in most related work. We further describe these blocks as well as related studies.

2.1 Introduction: Relation Extraction

We first describe the main concepts used in relation extraction as well as related tasks, then present an overview of the existing annotated corpora, followed by a discussion of commonly used evaluation metrics.

26 2.1. INTRODUCTION: RELATION EXTRACTION 27

2.1.1 Related Concepts

Relation extraction (RE) aims at extracting semantic associations between concepts, also called named entities or arguments. The extracted semantic relationships can contribute to knowledge base construction and completion (Ji and Grishman, 2011). These results are also used in downstream NLP applications such as information retrieval (Singhal, 2012; Wei et al., 2013; Soto et al., 2019), textual entailment (Szpektor et al., 2004) and question answering (Xu et al., 2016). Figure 2.1 illustrated a binary relation between two entities.

place_of_birth

Murat Kurnaz , a Turkish national who was born and grew up in Germany . PERSON LOCATION Figure 2.1: An example of binary relation in a sentence from the TACRED dataset (Zhang et al., 2017b).

Named entity A named entity, often known as entity, can be a word or a set of words that indicate a concept of interests. In Figure 2.1, the named entities are “Murat Kurnaz” and “Germany”. We also refer to them as entity mentions.

Entity type Entity type defines the semantic category of a named entity, such as PERSON and LOCATION in the example above.

Relation candidate A relation candidate involves at least two participated entities which are known as relation arguments. We refer to a relation candidate that includes only two entities as a pair, e.g., (Murat Kurnaz, Germany) and (Turkish, Germany).

Relation type We use the term relation type (r) to define the semantic relation category of the relation candidate, such as place of birth.

Relation instance A relation instance, relation for short, includes a relation candidate and their relation type. A pair with their relation type are referred as a triple

(e1,r,e2) where e1,e2 are two entities, e.g., (Murat Kurnaz, place of birth, Ger- many) and (Turkish, no relation, Germany). When the direction is considered, e1 represents the head and e2 refers to the tail. 28 CHAPTER 2. BACKGROUND

2.1.2 Relation Extraction Tasks

Relation extraction (RE) is an important task in natural language processing, which aims to identify semantic relations between spans of interests such as named entities or events. We categorisedRE tasks based on different aspects of relations. Textual Granularity Arguments participated in a relation can be located within a sentence, in a short paragraph, or in a document. Previous work mainly studies sentence-level relation extraction (intra-sentenceRE), this area accounts for most of the related datasets. In this work, we also focus on identifying sentence-level relations and in the following sections of this background overview we will provide more details of past works and recent advancements in the field. We note that in reality, many entities can be related across sentences (inter-sentence), either in a paragraph or a document. Inter-sentence relations have gained more attention in the last few years, which is motivated by the introduction of labelled corpora. Approaches dealing with document-level relations can utilise existing methods for intra-sentence relations along with higher-level information such as paragraph and document structures (Verga et al., 2018; Christopoulou et al., 2019; Nan et al., 2020). Domain Besides textual granularity, we can categorise relation extraction tasks based on the data domain. Relation extraction was originally applied on generic data obtained from the Web and Wikipedia. Later on, several approaches have been adapted and developed for biomedical and scientific literature. The domain-specific approaches often encode specialised features and information in order to achieve high performance. Number of Arguments A relation can be shared between two entities (binary), among three entities (ternary), or n entities (n-ary). ExistingRE approaches mostly focus on extracting binary relations, while n-ary relations have also gained attention in recent years. n-ary relation extraction approaches typically generalise from binary ones, in which word representations are obtained similarly while relation candidate represen- tations are constructed from multiple entity representations rather than two entities as in binary. Our introduction mainly describes about binary relation construction as it is the focus of this study. Besides the above categories, a group of approaches focus on particular relation types such as temporal (Verhagen et al., 2007) and causal relations (Blanco et al., 2008). Temporal and causal relations are highly dependent (Mirza, 2014) and sometimes considered as narrative extraction (Caselli and Vossen, 2017). Other studies also aim at extracting entities jointly with their relations (Li and Ji, 2014; Miwa and Bansal, 2016), namely end-to-end relation extraction. These 2.1. INTRODUCTION: RELATION EXTRACTION 29

approaches perform either in a pipeline setting or joint training. The former may inherit errors from previous modules in the pipeline, while the latter has been shown to be beneficial from each other task. Lastly, depending on how the annotations were generated, we can also group relation extraction into several learning settings considered as different tasks. We will discuss these learning settings in §2.7

2.1.3 Datasets

Dataset Text Granuality Domain Annotation ACE 03-05 sentence news gold SE10-T8 sentence news gold NYT10 sentence news DS(Freebase) KBP sentence news DS Google-RE sentence Wikipedia DS(Freebase)+gold TACRED sentence news gold NYT13 sentence Wikipedia DS(Freebase) T-REx sentence DBpedia DS(Wikidata) FewRel 1.0 sentence Wikipedia DS(Wikidata)+gold FewRel 2.0 sentence Wikipedia/PubMed DS(Wikidata/UMLS)+gold DocRED document Wikipedia DS(Wikidata)+gold

Table 2.1: Available relation extraction datasets for general domain. DS stands for distant supervision.

The availability of labelled datasets has substantially motivated the fast development of relation extraction. These datasets have been created by human annotators for differ- ent domains known as gold data, or automatically generated using distant supervision (DS) known as silver data. In this section, we summarise existing datasets in terms of domains and textual granularity. Generic domain. We present an overview of the most commonly usedRE datasets for the general domain in Table 2.1 along with their properties, including data sources, classification types, and annotation sources. The first official relation dataset was introduced at the Seventh Message Understanding Conference (MUC-7) (Chinchor, 1998), created from newswire reports. ACE 2003, 2004 and 2005 datasets (Grishman et al., 2005) published by the Automatic Content Extraction (ACE) project including annotations of named entities, relations and events in several languages from news articles. A shared task for extracting relationships between two nominals (SE10-T8) 30 CHAPTER 2. BACKGROUND was organised in the SemEval 2010 (Hendrickx et al., 2010). The created data has been widely used since then. Riedel et al.(2010) published the NYT dataset, usually referred to NYT10 or NYT Riedel, constructed by aligning New York Times articles against Freebase (Bollacker et al., 2008). Similarly, the KBP corpus (Angeli et al., 2014) was created from the 2010 and 2013 KBP document collections and a July 2013 dump of Wikipedia as text corpus using distant supervision and active learning. Google-RE corpus1 contains facts from Freebase manually aligned to Wikipedia text, but only five relation types are covered. A recent manual curated dataset is TACRED (Zhang et al., 2017b) originated from the TAC KBP data. TACRED contains more relation instances than previous gold corpora (e.g., SE10-T8 and ACE03-05). T-REx (Elsahar et al., 2018) is the largestDS dataset including a large number of automatically alignments between DBpedia abstracts (Brummer¨ et al., 2016) and Wikidata triples (Vrandeciˇ c´, 2012). The authors of the dataset reported a high performance of 97.8% accuracy on a crowdsourced test set (not available for download). However, the above corpora do not discriminate between frequent and long-tail relations which are commonly seen in the real-world. This leads to the need for models that can learn long-tail relations more efficiently, hence, few-shot relation extraction datasets were created. FewRel 1.0 (Han et al., 2018b) was the first few-shotRE constructed via distant supervision and curated by crowdsourcing. The data follow n-way k-shot setting (Vinyals et al., 2016), where n new relation types are sampled from the test set and k instances are given for each relation type. Wiki80 (Han et al., 2019) is a redistributed version of FewRel 1.0 where training and testing sets share relation types. FewRel 2.0 (Gao et al., 2019b) extended FewRel 1.0 with biomedical domain annotations and a simple additional evaluation where a testing instance does not necessarily belong to n categories. A recent dataset was introduced for document-level relation extraction, namely DocRED (Yao et al., 2019). Biomedical literature. Manual annotations cost more to obtain for biomedical domains due to the requirement of domain expertise. Thus, existingRE datasets mostly focus on particular types of named entities and aim to identify whether a relation shares between them rather than a specific relation type. Early datasets target protein-protein interactions (PPI) such as AIMed (Mooney and Bunescu, 2006), BioInfer (Pyysalo et al., 2007). Later on, the i2b2 challenge was organised for extracting relations between biomedical entities in clinical text Uzuner et al.(2011). A range of shared tasks was held since 2013 focusing on relation extraction from biomedical text, which resulted

1https://code.google.com/archive/p/relation-extraction-corpus 2.1. INTRODUCTION: RELATION EXTRACTION 31

in significant resources for the field. SemEval 2013 DDI extraction task (Segura- Bedmar et al., 2013) is the most widely-used drug-drug interactions dataset. Other relation arguments include bacteria and biotopes from BioNLP 2013 (Bossy et al., 2013), chemical and disease from BioCreative V CDR (Li et al., 2016a), chemical and protein from BioCreative VI ChemProt (Krallinger et al., 2017). Gurulingappa et al. (2012) introduced the ADE corpus annotated with relations between drugs and adverse events. Another adverse relation extraction dataset was released as part of the 2017 Text Analysis Conference (TAC) (Roberts et al., 2017). Henry et al.(2020) introduced the n2c2 challenge that aims at extracting relations between drugs and medical entities as well as adverse events. Relations between metabolites and other entities were also explored by Shardlow et al.(2018). Recently, distant supervision has also been used to create datasets for biomedicalRE. Quirk and Poon(2017) generated a binaryRE dataset between drugs and genes located within spans of one to three sentences and later the corpus was extended to ternary relations among drug-gene-mutation entities (Peng et al., 2017). Verga et al.(2018) aligned PubMed abstracts with CTD database, namely CTD corpus.2. Scientific literature. Most relation types in the scientific text are related to hypoth- esis, methods and experimental results. Jain et al.(2020) published SciREx as the first scientific document-level n-ary relation dataset, which deals with relations among 4 types of entities: dataset, method, metric and task.

2.1.4 Evaluation Metrics

We describe here various metrics commonly used for evaluating relation extraction models’ performance. SinceRE is usually treated as a classification task, most of the metrics are for classification. We note that these measurements are performed under the assumption that named entities are given. Before getting into the formulation of these metrics, we first introduce statistical terms that are used to compute them. True positive (TP) and false positive (FP) refer to the number of instances that are classified by a model as sharing a relation. TP indicates the number of instances that are correctly classified and FP means that they are incorrect. On the other hand, True negative (TN) and False negative (FN) refer to the number of instances that are classified as not sharing a relation by a model. TN refers to the correctly classified ones and FN refers to the incorrect.

2http://ctdbase.org/ 32 CHAPTER 2. BACKGROUND

Accuracy. This is a primary metric to evaluate a classification model, defined in Eq. (2.1). Essentially, it measures the proportion of correctly-classified instances over incorrect ones in the test set. All categories are treated equally, even negative ones.

TP + TN Accuracy = (2.1) N where N = TP + TN + FN + FP is the total number of instances. Precision / Recall / F-score. These are also primary metrics for classification, usu- ally used when true negatives are not of interest. Precision (P) measures the percentage of true positives (TP) out of the number of predicted as positive (TP + FP). Recall (R) is the proportion of correctly predicted positives (TP) divided by the number of possible

positives (TP + FN). Fβ-score is the harmonic mean of precision (P) and recall (R) (van Rijsbergen, 1979). The popular setting is β = 1, in which precision and recall are equally weighted.

TP TP P × R P = ;R = ;F = (1 + β2) × (2.2) TP + FP TP + FN β β2 × P + R The above metrics are for binary classification, i.e., the number of categories is equal to two. Given a multi-class setting, we would like to have a summary over all relation categories. We can compute micro-average or macro-average as the average F-score across categories. Micro-average takes into account instances from all relation types, macro-average computes the average metrics of each category,

∑c TPc ∑c TPc Pmicro = ;Rmicro = ; (2.3) ∑c TPc + FPc ∑c TPc + FNc

1 1 Pmacro = ∑Pc;Rmacro = ∑Rc; (2.4) C c C c where C is the number of relation categories and c is a relation category. For imbalanced datasets, macro-average is lower than micro-average if instances of the least populated relation types are poorly classified. In contrast, micro-average lower than macro-average indicates that instances of the most populated types are incorrectly classified. When relation extraction is framed as a clustering task (a cluster corresponding to a particular relation type), there is no named relation type for each category/cluster and the number of relation categories is usually different from the number of annotated relation categories. In some sense, each method may define its own relation categories/clusters. In this case, we evaluate these methods using standard clustering metrics: B cubed, 2.1. INTRODUCTION: RELATION EXTRACTION 33

V-measure and adjusted rand index. We denote x,x0 are two instances in a corpus, l(x) returns the gold label (relation type) of instance x and m(x) refers to the cluster, predicted by a model, that instance x belongs to.

Bcubed. Bcubed (B3) is the harmonic mean of pairwise precision and recall (Bagga and Baldwin, 1998). The pairwise precision and recall here are slightly different from above:

0 0 PBcubed = Ex,x0 [P(l(x) = l(x )|m(x) = m(x ))]; (2.5) 0 0 RBcubed = Ex,x0 [P(m(x) = m(x )|l(x) = l(x ))]; (2.6) 2 PBcubed × RBcubed FBcubed = (1 + β ) × 2 ; (2.7) β × PBcubed + RBcubed where P(·|·) is the conditional probability. PBcubed measures whether instances in a cluster share the same gold label, while RBcubed estimates the probability that instances sharing a label belong to the same cluster. And FBcubed is equal to Fβ(PBcubed,RBcubed), typically we use β = 1.

Homogeneity / Completeness / V-measure. Homogeneity (Homo) measures if a cluster contains only data points of a single class, while completeness (Comp) will penalise if all data points of a single class in a single cluster. V-measure (Rosenberg and Hirschberg, 2007) balances between homogeneity and completeness.

H(l(x)|m(x)) Homo = 1 − ; (2.8) H(l(x)) H(m(x)|l(x)) Comp = 1 − ; (2.9) H(m(x)) Homo × Comp V = (1 + β2) × ; (2.10) β2 × Homo + Comp where H(·) is the entropy and H(·|·) is the conditional entropy. And V is equal to

Fβ(Homo,Comp), typically we use β = 1.

Adjusted Rand Index (ARI). Adjusted rand index (ARI)(Hubert and Arabie, 1985) is the normalised Rand Index, which usually ranges between 0 and 1 but can be negative if the index is less than the expected index (Wagner and Wagner, 2007). Rand Index was 34 CHAPTER 2. BACKGROUND

motivated by classification problems, counting correctly classified pairs of instances.

0 0 RI = Ex,x0 P(m(x) = m(x ) ⇔ l(x) = l(x )) (2.11) RI − Expected RI ARI = (2.12) max(RI) − Expected RI

2.2 Related Work on Relation Extraction

In general, researchers have worked on different paradigms of relation extraction (RE). We will present the progress and development ofRE in this section. We divide existing approaches to automatic relation extraction into the following categories based on their core computational techniques: pattern-based methods (early systems), classical machine learning methods and recent neural models. We first summarise the former two sets of approaches and then give a brief overview regarding the replacement by neural models. Since this work aims at sententialRE level, we do not include paragraph- or document-level approaches as they involve dealing with sentence connections and document structures.

2.2.1 Early Systems and Classical Machine Learning

The pioneering explorations ofRE rely on text analysis tools to identify linguistic features in text. These features are then manually selected based on their usefulness to the task ofRE. These approaches can be grouped into the following categories:

Pattern-based methods typically employ patterns as ways of searching in text for relation instances.

Feature-based methods use a set of features carefully selected from textual analysis such as part-of-speech tags and syntactic structure.

Kernel methods represent relation instances in linguistic structures, e.g, sequence, dependency, and syntactic trees. The structured instances are then transformed into a high dimensional latent space to capture similarity between them.

Probabilistic graphical models (PGMs) represent dependencies between words, en- tities and relations in the form of an acyclic graph, and then infer the correct relations from the graph, also known as symbolic methods. 2.2. RELATED WORK ON RELATION EXTRACTION 35

Clustering methods induce relation categories based on example representations which can be obtained via feature engineering or representation learning.

Pattern-based systems were the earliestRE methods using hand designed patterns to match relation instances (Huffman, 1995; Califf and Mooney, 1997; Brin, 1998; Agichtein and Gravano, 2000). Lin and Pantel(2001) generated inference rules, moti- vated by the distributional hypothesis on words (Harris, 1954), including dependency paths connecting two entities. Later studies worked on semantically organising and filtering relation patterns, which can then be utilised to automatically generate new patterns (Nakashole et al., 2012; Jiang et al., 2017; Eichler et al., 2017). Although pattern-based methods can achieve high precision, the recall is considerably low as it is impossible to cover all language expressions. Machine learning (ML) approaches are then implemented to bring better coverage and require less human intervention. Feature-based methods were firstly used forRE among classicalML approaches. These methods rely on lexical, syntactic and semantic information of an entity pair and their corresponding context. The features include context words, entity mentions, part-of-speech (POS) tags, base phrase chunking, syntactic parse tree, and dependency tree (Kambhatla, 2004; Zhou et al., 2005). The tree-based features were either used separately (Miller et al., 2000; Nguyen et al., 2007) or were selected jointly using feature selection techniques (Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007). Other linguistic features were also considered forRE. Word clusters were used given their ability to group similar words into the same cluster (Boschee et al., 2005; Chan and Roth, 2010; Sun et al., 2011). Besides, coreference has been shown to benefit for RE as co-referring mentions share no other semantic relations (Chan and Roth, 2010). Relations are associations between entities, the construction of relations depends on the entity information such as semantic entity categories. Several attempts were proposed to introduce entity information such as semantic entity categories (Roth and Yih, 2007; Zhou et al., 2005), entity statistics from the Web (Rosenfeld and Feldman, 2007) and from Wikipedia (Chan and Roth, 2010). Others also studied the dependencies between relation categories (Chan and Roth, 2010). These features were transformed into an n-dimensional vector and then passed into relation classifiers. Feature-based methods require carefully designed features to perform well on a particular corpus. To reduce this limitation, kernel methods were proposed, which can explore the original representation of a given instance. Kernel methods compute similarities between representations by kernel functions performing on subsequences, entire sequences, and grammatical structures such as constituent trees and dependency 36 CHAPTER 2. BACKGROUND

trees. In this group of approaches, support vector machines (SVM; Cortes and Vapnik, 1995) is the most well-studied classifier. A typical group of kernel-based methods forRE is sequence kernels that measure similar subsequences between instances. Mooney and Bunescu(2006) used different types of subsequence patterns that are typically used inRE: words before, between and after relation arguments, inspired by Lodhi et al.(2002). Sentence structures can provide important information regarding relations between entities, hence, tree-based kernels were proposed for this reason. Zelenko et al.(2003) proposed a tree-based kernel performing on base phrase chunking information forRE. Culotta and Sorensen(2004) followed this idea to compute kernel scores between dependency trees augmented with features for each node such as POS. Both previous methods perform on the subtrees containing two entities, while Bunescu and Mooney (2005) argued that the shortest dependency path (SDP) between two entities provides the most relevant information about their relation. They showed that using SDP yielded substantially higher performance than previous subtree approaches. Zhang et al.(2006) explored multiple tree representations constructed from the constituent tree structure of a sentence forRE using a convolutional tree kernel. Qian et al.(2007) proposed an extension of the previous work by augmenting the tree with entity features. Another improvement was the work of Zhou et al.(2007) that dynamically includes necessary context information by expanding the tree span. Several studies were proposed by adding richer features into the tree or modifying kernel functions (Qian et al., 2008; Khayyamian et al., 2009; Sun et al., 2014). Zhao and Grishman(2005) and Nguyen et al.(2009) combined different information such as subsequences, dependency path and entity features using different kernels. Probabilistic graphical models identify the dependencies between entities, text and relations in the form of directed acyclic graphs for relation inference. The first generative probabilistic models were proposed by Yao et al.(2011) for unsupervised learning. The authors introduced three topic models, e.g., Rel-LDA, Rel-LDA1 and Type-LDA, extended from a standard latent Dirichlet allocation (LDA; Blei et al., 2003) in which topics correspond to semantic relation categories. Rel-LDA is a simple extension using only textual surfaces of entity mentions and the shortest dependency path between them. Rel-LDA1 adds more linguistic features to the set: trigger (content words from the shortest dependency path), lexical and part of speech (POS) patterns of the intervening context, semantic and syntactic categories of named entities. The last model, Type-LDA, considers two entity semantic categories as a constraint to cluster 2.2. RELATED WORK ON RELATION EXTRACTION 37 relations. Yao et al.(2012) later improved the previous LDA by disambiguating senses of the context between entities. Later on, Lopez de Lacalle and Lapata(2013) used logical rules to incorporate global relation constraints into the LDA. Clustering approaches were mostly proposed for unsupervised relation extraction where no labelled data is provided for training. Since no label exists, relation types are not predefined and the clusters may reveal meanings different from expectation. We note that the aforementioned LDA approaches are also clustering methods. Hasegawa et al.(2004) started the first proposal for using a clustering method forRE. The authors collected co-occurring named entities and the context words intervening between two from large corpora. The context is then clustered based on cosine similarity to form relations. However, the number of clusters is heuristically defined. To address this, Chen et al.(2005) proposed stability-based criteria to automatically decide the number of optimal clusters. Since dependency structure of text is shown to be effective for RE, Yan et al.(2009) proposed a clustering method based on both the context words and the dependency paths connecting two entities. Recently, Elsahar et al.(2017) applied Ward’s hierarchical agglomerative clustering (Ward Jr, 1963) on sentence representations getting from word embeddings and entity types. Feature-based and kernel-based methods lack of generalisation to other domains or even different datasets in the same domain due to the reliance on feature engineering and kernel function designation, while probabilistic graphical methods are limited in model capacities, e.g., models have fixed complexity, and dependencies between variables must be carefully designed to avoid correlated variables.

2.2.2 Neural Networks and Deep Learning

Neural networks have shown high performance over classical machine learning in a wide variety of natural language processing (NLP) tasks (Collobert et al., 2011). Neural models can automatically extract informative features from the input data, which significantly reduce the need for manual feature engineering in previous systems. These models take advantage of distributional embeddings and techniques for representation learning, replacing classical machine learning methods by using rich representations instead of hand-crafted features. Many ideas proposed in the past were also adapted into the design of neural-based models to enhance text representation. Since our work will focus on neural relation extraction, we present in the following an introduction of neural networks, inspiration as well as training techniques, before getting into particular components used for building a neural relation extractor in the next sections. 38 CHAPTER 2. BACKGROUND

2.2.2.1 Neural networks

Artificial neural networks (ANN), or neural networks (NN) for short, are information processing models that were originally inspired by the operation of biological neural systems (Mcculloch and Pitts, 1943). Neural networks learn, memorise and generalise from observation simulating the human mind. We often use directed acyclic graphs to represent a neural network where a node is a neuron and a directed edge indicates the information flow. Following the development of the cognitive study, different types of neural networks have been proposed over the years. We present the intuition and fundamental formulation of neural architectures as well as their training procedure in the following parts. We refer the readers to the deep learning book by Goodfellow et al. (2016) for more details.

Input Weight

Activation function

(a) Classic form (b) Vector form

Figure 2.2: A neuron or a perceptron visualisation adapted from CS231n(2020).

Neuron An artificial neuron, also known as a perceptron, is the core processing unit in neural networks. Figure 2.2 illustrates the computation of a neuron. The neuron first takes n scalar inputs x1,...,xn ∈ R multiplied with their corresponding weights w1,...,wn ∈ R and added a bias b ∈ R. The resulting value is then passed through an activation function f that controls the value range of the output z ∈ R. We can denote n n the input and weight using the vector notation x ∈ R and w ∈ R respectively. The computation is given in Eq. (2.13).

n > z = f (∑ wixi + b) = f (w x + b). (2.13) i=1 The most common neural layer is the fully-connected layer where m neurons share the same input without internal connection. A one-layer neural network can be 2.2. RELATED WORK ON RELATION EXTRACTION 39

Figure 2.3: One-layer neural network. Figure 2.4: Two-layer neural network.

computed as follows: z = f (Wx + b), (2.14)

m×n m where W ∈ R is a weight matrix and b ∈ R is a bias vector, we call W and b n m the parameters of a one-layer neural network. While x ∈ R and z ∈ R are the input and output of the layer, respectively. In this equation, f is an element-wise activation

function f (x) = f ([x1,...,xn]) = [ f (x1),..., f (xn)]. Figure 2.3 illustrates an one-layer neural network in the vector form. One-layer neural networks can not solve linearly inseparable data, as pointed out by the famous example of the exclusive-or (XOR) operator by Minsky and Papert(1969). The authors suggested that inserting one layer in between the input and output layer of a one-layer network can address the shortage (Figure 2.4). The additional layer is known as the hidden layer which can project the data into another vector space where linear separability is possible. The computation can be written as below:

z = f2(W2 f1(W1x + b1) + b2), (2.15)

d×n d where W1 ∈ R ,b1 ∈ R are considered as parameters of the hidden layer and m×d m n m W2 ∈ R ,b2 ∈ R are parameters of the output layer; while x ∈ R and z ∈ R remain as the input and output. In theory, two-layer neural networks can approximate any functions if we increase the number of hidden units and cover all examples (Cybenko, 1989). In practice, it is impossible to include all examples, the network can fit any training data but not generalise to new data points by reasoning from the observed data. Besides, the advantages of using more hidden layers is that features can be re-used and as we go deeper, we can find more abstract features (Bengio et al., 2013), e.g., deeper layers capture semantic meanings while initial layers capture lexical and syntactic features. 40 CHAPTER 2. BACKGROUND

Classification For multi-class classification, the output layer typically maps the pre- C vious output vector to a vector z ∈ R , known as logit, where C refers to the number of categories. We compute the probability of a category c ∈ C as follows:

exp(zc) p(c|x) = softmaxc(z) = , (2.16) ∑c0 exp(zc0 ) where zc is the c-th value of logit z, refer to the unnormalised score of the category c. We typically initialise W and b randomly from a distribution (commonly using Gaussian distribution). Their values will be updated during training, known as model optimisation.

Loss function To optimise a model, we need to define the criteria to estimate how well our model performs on a target task. Each criterion can be considered as an objective, which can be either minimised or maximised. In the context of neural networks, we typically compute the errors, we refer to the objective as a loss function or cost function.

Let us assume that we have a dataset of n examples D = {(x1,cn),...,(xn,cn)}, where xi is the i-th example and ci corresponds to the category of this example. We denote p(ci|xi) being the relation probability of example i predicted by a model and θ being the parameter set of a model (such as W and b). The loss function L(θ) over the entire dataset can be computed by averaging the losses of every example. The commonly-used loss function for classification is the cross-entropy, which is the negative log likelihood of the correct category: 1 n L(θ) = ∑ −logp(ci|xi). (2.17) n i=1

Model Optimisation After computing the loss, we will update the parameters of a model accordingly using the back-propagration algorithm (Rumelhart et al., 1985). The main idea is to back-propagate errors from the output layer to the input layer using chain rules. Given a scalar input x, let z = g(x) and y = f (z), the gradient with respect to the input is: dy dy dz = . dx dz dx If the input is a vector, we have the following computation:

∂y ∂z ∂y = , ∂x ∂x ∂z 2.2. RELATED WORK ON RELATION EXTRACTION 41

m n ∂u n×m where for any vector u ∈ R and v ∈ R , we have ∂v ∈ R . We recall the one-layer neural network in Eq. (2.14):

a = Wx + b; z = f (a), and the loss L(W;b). We compute the gradients with respect to W and b starting from the loss and moving backwards:

∂L ∂a ∂z ∂L ∂z ∂L = = x>, (2.18) ∂W ∂W ∂a ∂z ∂a ∂z ∂L ∂a ∂z ∂L ∂z ∂L = = . (2.19) ∂b ∂b ∂a ∂z ∂a ∂z

The process can be generalised to networks with multiple layers.

Next, we need to update model parameters θ using the computed gradients. θt denotes the parameter set at the iteration t, which is subtracted by its partial derivative in order to update the parameter:

∂L θt+1 = θt − η , (2.20) ∂θt where η refers to the learning rate that scales the amount of update. The computation is known as gradient descent (GD), attributed to Cauchy(1847) but studied for non-linear optimisation problems by Curry(1944).

2.2.2.2 Neural relation extraction

Word Context-dependent Relation Candidate Relation Representations Word Representation Representation Classifier

Figure 2.5: Block diagram of a neural framework for relation extraction (RE).

Most of the current neural relation extraction follows the general framework illus- trated in Figure 2.5. A textual input is first transformed into numerical representations based on the smallest meaningful unit – words by a word representation layer. To understand the correct meaning of words, we need to consider their context, hence, context-dependent word representation layer is included to encode contextual informa- tion into each word. Recent work was proposed to include such contextual information directly in word representations, namely contextual word representations, from large- scale training data (Peters et al., 2018; Devlin et al., 2019). A context-dependent word 42 CHAPTER 2. BACKGROUND

representation block can be built upon the contextual word representations to obtain task-specific features. The context-dependent word representations are then used to form relation candidate representations before passing through a classifier. Although neural networks have shown their capability in extracting relations from text, future work can associate classical with symbolic methods and neural models to boost the development of the field. For instance, a recent proposal suggested that combining probabilistic models and neural approaches is promising to take advantage of both worlds (Bai and Ritter, 2019). Additionally, heuristic rules are also utilised to provide pseudo/weak labels for training neural-based models (Ratner et al., 2016), in an effort to reduce the cost of manual annotation. In the following section, we will go through individual blocks that can be used for building a relation extraction model and the corresponding related work.

2.3 Features for Relation Extraction

In the previous section, we discussed the general idea of neural models which take as input vectors x and produce predictions. Textual data are usually displayed in a sequence of discrete symbols, although their underlying structures play an important role in reflecting their meaning to the readers. The sequence needs to be transformed into numerical data x in order to reflect various linguistic properties of a text so that computers can process them. The transformation from text to real-valued vectors is called feature representation or feature extraction, which is done by feature mapping function. Although neural networks reduce the need in feature engineering, a good set of core features still needs to be defined. These features can be learned automatically during training or obtained from pretrained models. Features for relation extraction (RE) typically can be divided into three main categories:

Linguistic features include lexical, syntactic and semantic features such as semantic categories of entities and dependency path between two entities.

Pretrained features are automatically obtained from statistical approaches and pre- trained models, which are usually word representations.

Features from external resources are typically obtained from dictionaries, ontolo- gies, knowledge bases or other semantic inventories.

In the following, we will describe typical linguistic features used inRE, followed 2.3. FEATURES FOR RELATION EXTRACTION 43

by a closer demonstration of word representations. We then give a brief introduction of external resources that can be used forRE.

2.3.1 Linguistic Features

Although neural networks have been shown to automatically extract useful features from large corpora, some syntactic and semantic features are shown to be beneficial for neuralRE. In the following, we will introduce a list of common features forRE that will be used in this work, including syntactical and entity-related features. Finally, we describe how to encode these features so that they can be used in a neural network.

2.3.1.1 Categorical features

Word position in text. Absolute positions of words in a sequence are important information when processing text because a change in word order can impact the entire meaning of the sequence. For example, the cat chased the mouse certainly differs from the mouse chased the cat. Word position with regards to arguments. Relative word distance to relation arguments might refer their contribution to the associations between entities. Especially in relation extraction, words that are closer to entities probably imply more useful information regarding the relation between these entities (Zeng et al., 2014). Neural methods require explicit integration of such relative position information. Relative position based on a particular grammatical structure was also proposed such as relative positions in a dependency tree (Yang et al., 2016). Different from word positions which are non-negative values, relative word positions can be negative (in the left context of an argument), positive (in the right context) or zero (words constitute entities). Position markers. Another indicator to the position of entities in text is position markers proposed by Zhang and Wang(2015). The idea is to insert special position tokens such as “[e], [/e]” into the input sequence immediately before and after entity mentions as indicating the starts and ends of the mentions, respectively. For instance, we can modify the previous example as follows, “[e] Murat Kurnaz [/e], a [e] Turkish [/e] national who was born and grew up in [e] Germany [/e]”. We can distinguish entities of the target candidate by using different markers as well as indicating their related direction, “[head], [/head], [tail], [/tail]”. POS tags Part-of-speech (POS) tagging assigns each word a syntactic category such as noun, verb, and objective. POS tags enable a model to learn explicitly which 44 CHAPTER 2. BACKGROUND word is more informative to the target task. For example,RE may focus more on verbs between entities and sentiment analysis may attend on adjectives that are likely to express sentiment. Entity types are semantic categories of an entity, as mentioned in §2.1.1. Entity types can restrict the relation categories that a group of entities can hold, e.g., LOCA- TION and LOCATION can not share the relation place of birth. Entity types can infer the coarse relation clusters, which we show in our experiments in Chapter4. The above-mentioned features are categorical items which often take the form of in- dicators. In neural models, we usually transformed them into real-valued representations so that statistical models can perform on them. The transformation from categorical to real-valued representations (typically in forms of vectors) is known as embedding, performed by an embedding layer. An embedding layer can be viewed as a lookup table, where each category is mapped to a continuous vector that is usually randomly initialised (Collobert and Weston, 2008). Besides randomisation, position embeddings for instance, can be computed by a particular function (Vaswani et al., 2017; Wang et al., 2020).

nsubj det cop amod det punct

DT JJS NN VBZ DT NN . The smallest feline is a masterpiece .

Figure 2.6: Part-of-speech tags (yellow rectangles) and dependency structure (red arcs) of a sentence.

2.3.1.2 Structural features

Structural analysis of a sentence is also used to design a neural architecture. Dependency. Dependency grammar (see Jurafsky and H. Martin, 2019, page 300) defines the syntactic structure of a sentence as a tree where words in a sentence constitute nodes and binary grammatical relations between words correspond to directed edges. The root of a dependency tree is usually the main verb in a sentence. Each word except the root has exactly one parent in the tree. The direction of an edge generally goes from the head to the dependent word, e.g., from the nominal head “feline” to the determiner “the” in Figure 2.6. Essentially, a dependency relation is a directed, labelled arc from a head to its dependent word. The dependency structure with labelled arcs is called a typed dependency structure. Such head-dependent relations provide short 2.3. FEATURES FOR RELATION EXTRACTION 45

connections between entities that can approximate the semantic relationships between entities. Shortest Dependency Path (SDP). The SDP hypothesis states that the shortest path connecting two entities along with the dependency structure usually provides information regarding their relations, introduced by Bunescu and Mooney(2005). Other linguistic properties such as coreference and predicate-argument structures can also be considered forRE. With coreference, we can extract relations of the closer entity mentions instead of two distant entities in order to reduce redundant context. On the other hand, predicate-argument structures essentially reveal direct relations between verbs and their arguments. The arguments may contain entity mentions and verbs can explicitly express semantic relations between them. Since our experiments do not involve these structures, we do not present them in detail. However, we note that our graph-based model to enrich word representations can be generalised to any of these linguistic structures.

2.3.2 Word Representations

Word representations are considered as the initial step of an NLP model. At the very beginning, words are represented as vectors of identifiers such as word indexing in a vocabulary, each dimension corresponds to a separate word. These vectors are known as one-hot vectors. One-hot vector is a sparse vector whose dimension is equal to the number of words in the vocabulary, all values are equal to zero except the word index that is equal to one, which indicates the occurrence of the word.

> vcat = [0,0,...,0,1,0,...,0,0,0] > vdog = [0,0,...,0,0,0,...,0,1,0]

These vectors have extremely high dimensions and they do not share any information between words. Distributional semantic models have been proposed as an effort to represent the semantic similarity between similar words, following the distributional hypothesis (Firth, 1935; Harris, 1954; Firth, 1957). A popular group of approaches was to assign a class to each word so that similar words belong to the same class, known as clustering-based word representations. The most famous method of this group is Brown clustering, which clusters words from bottom-up to build a semantic hierarchy (Brown et al., 1992). Another major approach converting words to vectors was to use matrices that 46 CHAPTER 2. BACKGROUND

contain co-occurrence information, namely vector space models. These approaches often consist of two stages: word matrix construction and factorisation. The first stage essentially constructs co-occurrence matrices such as word-context count matrix. And the second stage often applies matrix factorisation algorithms to reduce the dimension of the co-occurrence matrix. Principal component analysis (PCA) and latent semantic analysis (LSA) (Deerwester et al., 1990) are typical methods of this category. However, matrix factorisation is often computationally expensive. With the advent of neural networks, we can learn dense, low dimensional continuous representations of words using large-scale unlabelled data (i.e., language modelling) or resource-rich task-specific corpora (e.g., machine translation). We group word representations into two main categories: static and contextual word representations. In static word representations, each word is represented by a single continuous vector regardless of the word’s context. While in contextual word representations, each word in different contexts are represented by different embeddings. We summarise the most influential work of the two categories. Methods for contextual word representations can be further categorised into two sets, namely feature-based and finetuning methods. Feature-based contextual word representations can be used similarly to static word representations, although we note that this type of contextual representations is not updated in downstream tasks. In contrast, finetuning methods update parameters of pretrained models in downstream tasks.

2.3.2.1 Language models

Before going into three types of word representations, we would like to introduce the concept of language models (LMs). A conventional language model computes the probability of a sentence belonging to a target language. Given a sentence of n words

s = (w1,...,wn), an unidirectionalLM computes:

n p(s) = p(w1)∏ p(wi|w1,...,wi−1), (2.21) i=2 where p(wi|w1,...,wi−1) is the probability of a current word given its previous/left context. Considering the example, “A dog is a man’s best friend”, the probability of “best” is conditional on previous words, p(“best”|“A dog is a man’s”). On the other hand, a bidirectionalLM takes in to account both left and right contexts

to predict a current word p(wi|w1,...,wi−1,wi+1,...,wn). Although anLM is defined to taking only the left context, a bidirectional model can also be considered as anLM 2.3. FEATURES FOR RELATION EXTRACTION 47 according to the pseudo-likelihood (Besag, 1975):

n p(s) ≈ ∏ p(wi|w1,...,wi−1,wi+1,...,wn). (2.22) i=1

LMs implicitly encode different linguistic properties from large-scale textual data. In classical approaches, the probability of a word given its previous context is estimated by counting co-occurrences of n-grams. In neural models, the probability is computed by so ftmax function over the vocabulary size. However, the vocabulary size is relatively large, which is expensive to compute. One way to overcome this problem is to use a hierarchical softmax (Morin and Bengio, 2005) which reduces the complexity to log vocabulary size by encoding vocabulary into a binary tree hierarchy. Another way is to consider noise contrastive estimation (NCE) that distinguish the current word with negative samples (Mnih and Teh, 2012).

2.3.2.2 (Static) Word representations

Static word representations are distributed representations of a unique word in a corpus, in the form of real-valued vectors. Similar to categorical features, the transformation of words into real-valued representations is known as word embedding layer. The embedding layer is essentially a lookup table LT(·) which maps each element to a real-value representation (typically vector) (Collobert and Weston, 2008). We refer to the resulting representations as word representations or word embeddings, the word static comes from the fact that these representations are context-independent, in order to distinguish with contextual word representations. These representations can be trained in a target task, starting from random initialisation. Another way to initialise word representations is to use pretrained ones, which are typically learned from neural language models. Since text sequences usually have arbitrary lengths, we can reduce the difficulty of computing the probability of a word by considering its neighbouring context instead of the full history in Eq. (2.21):

p(wi|w1,...,wi−1) ≈ p(wi|wi−t+1,...,wi−1), where t is the context window, typically from 3 to 10. The early work of using neural models to produce word representations is of Bengio 48 CHAPTER 2. BACKGROUND

et al.(2003). The authors first mapped every word in the context window to real- valued vectors vi−t+1,...,vi−1, then applied a tanh over the sum of these vectors in order to obtain context vector x. The context vector is then projected to a nonlinear vector space to estimate the probability of the i-th word. The network is trained by maximising the log-likelihood of the correct word, i.e., minimising the cross-entropy loss. Several subsequent models were proposed following this idea using different neural architectures and various linguistic properties, e.g., document context, semantic hierarchy (Kombrink et al., 2011; Huang et al., 2012). Distributed representations started to be popular since the introduction of word2vec (Mikolov et al., 2013a,b). The idea is similar to the work of Bengio et al.(2003) that computes the probability using the surrounding context. The difference is that word2vec does not include a nonlinear function, which can reduce the models’ complexity with- out sacrificing performance. Word2vec consists of two different but related models: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW predicts the current word using its context in a small window, whereas Skip-gram aims to predict the context given the current word. A number of studies were proposed to use word2vec inRE (Nguyen and Grishman, 2015b; dos Santos et al., 2015), showing improvements over previous clustering-based and count-based word representations. Another well-known approach for static word representations is GloVe (Pennington et al., 2014), a log-bilinear regression model which combines count-based and context methods. The approach first requires aggregating word co-occurrence probability from a corpus. The probability is then used to train word representations by forcing their distance to be close to the log of word co-occurrence matrix. Word representations are typically trained with a fixed-size vocabulary from training corpora, since out-of-vocabulary words (OOVs) do not have pretrained embeddings. We usually initialise a single embedding vector randomly to represent unknown words (UNK) and use it for all OOVs. Thus, there is no discrimination of different OOVs. One way to address this limitation is to construct word embeddings from character embeddings, used in conjunction with pretrained word embeddings. The intuition is that character n-grams can provide partial meaning of a word, e.g., bi-gram prefixes such as “un-, in-, im-” often indicate “not”, while bi-gram suffixes such as “-ee, -er/or” may indicate “person”. Word embeddings can be composed of character embeddings using neural networks (Dos Santos and Zadrozny, 2014; Ballesteros et al., 2015). Following the success of word representations, a line of research has focused on encoding linguistic information into word representations. Levy and Goldberg(2014) 2.3. FEATURES FOR RELATION EXTRACTION 49 generalise the Skip-gram model on the context derived by dependency grammar. In particular, instead of predicting surrounding words in a fixed size window, the authors predict the directly-connected words in the dependency path. The latest approach of Vashishth et al.(2019a) incorporates syntactic dependency and semantic hierarchy into word embeddings using a graph-based language model. The above methods encode additional information into word representations by training from scratch. In contrast, retrofitting word representation methods modify pretrained static word representations to incorporate lexical and linguistic information. The first approach was proposed by Faruqui et al.(2015), which minimises the Euclidean distance between representations of similar words. A range of subsequent models were proposed by introducing new information for retrofitting: antonyms (Mrksiˇ c´ et al., 2016), sense-specific (Ettinger et al., 2016), medical term ontology (Yu et al., 2016), knowledge bases (Lengerich et al., 2018), and paraphrases (Shi et al., 2019).

2.3.2.3 Contextual word representations

Static word representations have a fundamental problem, the same word in different contexts is represented by a single vector. For this reason, static word representations tend to encode high frequent contexts (popular senses) while rare contexts may be missed out. Previous work attempted to address this problem by embedding individual word senses, which results in multiple vectors for each word (Neelakantan et al., 2014; Chen et al., 2014; Iacobacci et al., 2015). However, using sense vectors requires an additional word sense disambiguation or induction step which may introduce noise. Contextual word representations were proposed to address this limitation, which aims at producing dynamic representations for each word conditioned on the word’s context. Contextual word representations are often obtained from models pretrained on large corpora either labelled or unlabelled ones. Depending on the use of contextual word representations in downstream tasks, we separate the existing work into two main categories: feature-based and finetuning methods.

Feature-based approaches

Feature-based approaches share similar procedures as static word representations, but a word in different contexts is mapped to individual embeddings. McCann et al.(2017) proposed contextual word vectors (CoVe) which are extracted from a pretrained LSTM encoder of a sequence-to-sequence machine translation model. The resulting word vectors improve the performance of several NLP tasks such as text classification and 50 CHAPTER 2. BACKGROUND

question answering. CoVe, however, requires labelled corpora (i.e., machine translation data) to train the encoder. It also suffers from out-of-vocabulary problems when using only word vocabulary from GloVe. To address the two problems, Peters et al.(2018) introduced embeddings from lan- guage models (ELMo). As the name mentioned, ELMo is extracted from intermediate representations of a pretrained language model. While conventional language models typically use to predict a word, ELMo combines representations from two independent LMs of each direction. ELMo includes character embeddings in the first layer to deal with the out-of-vocabulary problem. Another attempt to capture words in context is the work of (Akbik et al., 2018), who proposed contextual character-level word embeddings by modelling words and context as a sequence of characters. Their method constructs the representation of a word by combining the representations of two characters located right before the word in the left-to-right and right-to-left directions.

Finetuning approaches

Finetuning approaches utilise the entire pretrained models in downstream tasks, which will be updated during training. The idea of finetuning an entire model is motivated by the success of using ImageNet pretraining models for computer vision tasks. The idea was adapted to NLP by Dai and Le(2015) in which the pretrained models are language models. They tried two approaches to pretrain LSTMs, one performs language modelling by predicting the next word in a sequence and the other uses a sequence auto-encoder which predicts the input sequence from the last encoder state. Recently, the rise of using pretrained language models started from the superior performance on a wide variety of tasks shown in recent studies (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Radford et al., 2019). Howard and Ruder(2018) utilised a LSTM-based language model from Merity et al.(2018). On the other hand, Radford et al.(2018); Devlin et al.(2019); Radford et al.(2019) used a transformer encoder (Vaswani et al., 2017) to obtain word representations via language modelling. Radford et al.(2018) proposed OpenAI GPT which is a uni-directional language model, i.e., the model predicts the left-to-right context. Radford et al.(2019) and Brown et al.(2020) later on released the subsequent large-scale language models, namely GPT-2 and GPT-3 respectively, with larger model sizes granting them higher accuracy compared to previous versions. GPT-2 and GPT-3 have successfully demonstrated 2.3. FEATURES FOR RELATION EXTRACTION 51

zero-shot and few-shot transfer learning, achieving promising results in these settings on a range of canonical NLP tasks. Differently, the method of Devlin et al.(2019) makes use of bidirectional context, i.e., both left and right context. The method first randomly replaces a word in a sentence by a special mask token [MASK] and uses the resulting sentence to predict the original word. The problem formulation is similar to a cloze test, known as masked language model (MLM). The approach also predicts whether a sentence is the next sentence of the other given two sentences as input, namely next sentence prediction (NSP). Although these transformer-based LMs have gained high performance on a wide range of NLP tasks, the huge number of parameters in BERT is a concern leading to several approaches proposed to compress BERT with minimal performance loss. The work of Lan et al.(2019) aims at reducing the high number of parameters residing in the subword-level embeddings (Wu et al., 2016) by factorising the embeddings. The second way to improve parameter efficiency by sharing parameters across layers. Other studies perform knowledge distillation and quantisation for parameter reduction. Knowledge distillation (Hinton et al., 2015) framework trains a smaller student model to reproduce the behaviour of a large model (teacher). This is realised via a distillation loss by either mimicking the probability distribution of the teacher (Sanh et al., 2019a) or learning activation output of intermediate layers (Sun et al., 2019c). Despite the superior performance brought up by finetuning pretrained language models, static word representations are still used for the lexical-semantic graph (e.g., word analogy and word similarity) and widely used in industry because of their simpler computational complexity compared to contextual word representations.

2.3.3 External Resources

Web texts are redundantly available which contain a high number of relational facts. Each arbitrary set of entities (a random pair of entities) can be mentioned multiple times in the Web text. One assumption suggests that text mentioning the same set of entities are likely to express the same semantic relationship between them, even if the relation categories are unknown (Baldini Soares et al., 2019). Such textual data can be used to pretrainRE models based on semantic similarity. Encyclopedia knowledge such as Wikipedia can provide useful information regard- ing entities and relation dependencies. Chan and Roth(2010) mapped entity mentions to their corresponding Wikipedia articles to create two features including (i) if a mention occurs in the article of the other mention in a relation pair and (ii) if parent-child relation 52 CHAPTER 2. BACKGROUND

holds between two entities in the Wikipedia ontology. Later work also uses information from Wikipedia as background knowledge in order to support the extraction of relations (Sorokin and Gurevych, 2017; Elsahar et al., 2018; Vashishth et al., 2018). Knowledge bases (KBs) are graph-structured representations where entities consti- tute graph nodes and edges correspond to relations between two entities. Entities are unique concepts, while relations are often grounded to a specific set. Semi-structured knowledge bases allow free relation labels, i.e., not limited to a particular relation set. But the freedom of relation labels is also the limit of semi-structured KBs because the same relation with different textual expressions might not be grouped. Existing knowledge bases (KBs), which are commonly used for relation extraction, include Freebase (Bollacker et al., 2008), Wikidata (Vrandeciˇ c´ and Krotzsch¨ , 2014), YAGO 1–3 (Suchanek et al., 2007; Hoffart et al., 2013; Mahdisoltani et al., 2015) and DBPe- dia (Auer et al., 2007). There are other private KBs that are used internally in their founded companies such as Google knowledge graphs (Singhal, 2012). Knowledge bases are usually created following these procedures: (i) manually-curated by a group of experts (curated), (ii) manually-curated by publics/crowd (collaborative), and (iii) automatically-extracted from semi-structured text such as infoboxes from Wikipedia. KBs constructed by manually-curated approaches typically have high accuracy, how- ever, these approaches heavily depend on human experts which is difficult to scale up. Automatic KBs constructed from semi-structured data often have high accuracy as well, e.g., Freebase has 99% accuracy through manual evaluation over sample facts (Bollacker et al., 2008). KBs are often used in distant supervision to automatically annotate text, we will discuss in §2.7.3.1. A group of researches has also been focused on extending KBs based on existing relational facts, namely knowledge base comple- tion or link prediction, ranging from vector space models (Socher et al., 2013; Bordes et al., 2013) to graph-based algorithms (Perozzi et al., 2014; Grover and Leskovec, 2016). KBs completion methods can reduce the number of false negatives in distant supervision.

2.4 Neural Components for Relation Extraction

We present different neural components that have been used to encode the sentential context. Convolutional neural networks extract local correlations, sub-sequences, which are typical aspects that we look at when extracting features from text (§2.4.1). The sequential information is also important for text processing, this motivates the use 2.4. NEURAL COMPONENTS FOR RELATION EXTRACTION 53 of recurrent neural networks (§2.4.2). Although a piece of text is presented in the sequential form, the sequence is constructed from underlying structures defined by the grammar. In NLP, we often consider such structures in the form of grammatical trees (constituent or dependency trees). Recursive neural networks were proposed to operate on such grammatical trees, discussed in §2.4.2.3. The grammatical trees can also be viewed as a directed acyclic graph, entities and relations can be represented in a graph structure as well. Thus, graph neural networks are employed forRE to perform on these graphs (§2.4.3). Additionally, attention mechanisms were proposed to aggregate information from the entire sequence by attending on the most relevant parts to the task (§2.4.4). Attention mechanisms can also be used on their own without building on sub-sequence, sequential or even graph-based models (§2.4.4.1). Recent work utilised pretrained models to improveRE by adding task-specific neural layers on top of these pretrained models (§2.4.5). A combination of different neural architectures was also proposed to take advantage of different views (§2.4.6). Although most of the following architectures were proposed for supervised learning, they can be used in any learning settings by modifying the last blocks of relation detection and loss functions.

2.4.1 Convolutional Neural Networks

Convolutional neural networks (CNNs) are a special type of feed forward neural networks originally proposed for computer vision. The intuition of CNNs comes from the visual cortex system (Hubel and Wiesel, 1959), whose neurons correspond to different visual properties such as edge detection, colour and orientation. The core of a CNN is the convolution layer (Conv), whose name comes from the mathematical operation between two matrices. Convolutional neural networks (CNNs) were adapted forRE by performing on windows of continuous words capturing the local context information (Figure 2.7). Zeng et al.(2014) proposed position features to capture information regarding the target entities in the sentence. These position features are the relative distances of a word to two entities, which are mapped into continuous-valued vectors, also known as position embeddings. However, these embeddings are randomly initialised and unchanged during training. Nguyen and Grishman(2015b) improved the position embeddings by finetuning them during training. The authors utilised multiple window sizes to gather information from subsequences with different lengths. They conducted experiments showing that CNN is more robust in compared to feature-based methods. 54 CHAPTER 2. BACKGROUND

Figure 2.7: A convolutional neural network architecture for relation extraction by Zeng et al.(2017)

Zeng et al.(2015) separated information from three segments of a sentence based on the positions of relation arguments: left, middle and right contexts. Rather than using relative positions, Zeng et al.(2017) used word positions as input to a CNN.

2.4.2 Recurrent Neural Networks

As we comprehend language, we normally process it sequentially, some information from previous words remains and supports our thoughts. Recurrent neural networks (RNNs; Elman, 1990) and their variants were particularly proposed to process each element in a sequence with access to previous information. Thus, RNNs are favoured by the NLP community due to their capability to deal with sequential data. The core of RNNs is a recurrent cell that connects to itself, the connection is known as a recurrent connection. As illustrated in Figure 2.8, the connection allows the recurrent cell to perform on the present element given information from previous ones.

Figure 2.8: A recurrent neural network and its unrolled visualisation (Olah, 2015) 2.4. NEURAL COMPONENTS FOR RELATION EXTRACTION 55

Figure 2.9: Long short-term memory cell (Olah, 2015)

2.4.2.1 Vanilla recurrent neural networks

RNN has been shown the capability to capture associations between entities located in longer distance than CNN in Zhang and Wang(2015). The authors argued that key information forRE can appear arbitrarily in text, hence, a bidirectional RNN was used to model both previous and following context.

2.4.2.2 Long short-term memory

However, recurrent neural networks will be collapsed if the input length is too long, this is due to the traverse of gradients over the entire sequence through back-propagation. Gradients with respect to the weight matrix are repeatedly multiplied to themselves, leading to exponentially decreasing (vanishing gradient) or increase (exploding gradi- ent) when the largest singular value of the weight matrix is less than 1 or larger than 1, respectively. Exploding gradients can be reduced by restricting the norm of gradients in a particular range, this technique is called gradient clipping. To partially address vanishing gradients, Long short-term memory LSTM has been proposed (Hochreiter and Schmidhuber, 1997). Long short-term memory simulates our memory process to preserve the most useful information in the long-term, while the most recent informa- tion is kept in a different area where we can access it directly. In order to control the information, we need a number of gates to operate Figure 2.9. The first gate, input gate, is needed to decide which data should be read into the cell. The second gate, forget gate, is to control the contents from previous cells. This gate will decide which information is still useful for further steps and remove the unused information. Lastly, we pass the current contents to the next cell by an output gate. Zhang et al.(2015) employed a bidirectional LSTM model to better represent words from their contexts. The model further splits the output from LSTM into three parts by 56 CHAPTER 2. BACKGROUND

entity locations: before, between and after the pair; then aggregates information from these parts to form a pair representation using a max-pooling operation.

2.4.2.3 Grammatical structures

As syntactic and dependency parse trees were shown to be beneficial forRE in the early systems, these grammatical structures have also been explored in neural models. Tree structures are more complicated than sequential order so the easiest way to incorporate them into neural models is to transform them into sequences or features. In particular, shortest dependency paths (SDP) have long been believed to share more information regarding entity associations than other surrounding words. These paths can eliminate redundant words in text where entities are located far away from each other. Xu et al.(2015) first applied an LSTM along shortest dependency paths (SDP) between entities, in which SDP was treated as a sequence of words and dependency connections. The authors also included other linguistic features such as POS tags and a semantic lexical hierarchy forRE. Su et al.(2018) used an LSTM to model the shortest dependency path between entities, but have other channels for additional information such as their lexical, syntactic sequences and part of speech.

Figure 2.10: A recursive neural network (Ebrahimi and Dou, 2015)

Different from previous work, which used syntactic structures as features, another set of methods directly model tree structures into their neural architectures. Recursive neural networks (RecNNs) can serve as a compositional function to combine smaller textual units, i.e., words, into broader units such as phrases, sentences (Figure 2.10). Socher et al.(2012) presented a RecNN that learns compositional vectors for phrases and sentences from a constituent tree. Hashimoto et al.(2013) extended the previous work 2.4. NEURAL COMPONENTS FOR RELATION EXTRACTION 57 by explicitly weighting important phrases that contribute toRE. Shortest dependency paths (SDP) were also utilised in Ebrahimi and Dou(2015). To remain the binary compositional function as in previous work on constituent trees, the authors combined dependent-head word pairs to unknown parent nodes, constituting new tree structures. Li et al.(2015) showed that recursive networks are more effective than recurrent networks in case of long-distance entities. Instead of applying RecNNs on the SDP, Liu et al.(2015) modelled the subtrees of each word on the SDP as additional context to distinguish two similar paths but sharing different relations. The SDP was encoded using a CNN, showing the effectiveness when combining RecNN and CNN. Later on, Miwa and Bansal(2016) adapted a tree-based LSTM(Tai et al., 2015) that is able to operate on tree-structure with arbitrary branches.

2.4.3 Graph Neural Networks

A tree can be considered as a directed acyclic graph where there is a special single node named root. Each node has a single parent and edges are directed is child of/is parent of relationships. A graph is a collection of nodes and edges connecting two nodes with arbitrary types. An edge could be directed or undirected.

Figure 2.11: A graph convolutional neural network (GCN)(Fu et al., 2019). A node (fill in red) is updated using information from adjacent neighbours (fill in blue). The process is repeated to include further information from nodes located 2-hop away (fill in yellow) from the target node.

Zhang et al.(2018) initially adopted graph convolutional neural networks (GCNs; Kipf and Welling, 2017) to perform on the dependency tree of a sentence. Observing that using only the shortest dependency path (SDP) might lead to the loss of crucial information, the authors thus proposed a pruning strategy to include relevant words to the SDP. On the other hand, Vashishth et al.(2018) applied GCNs on the entire 58 CHAPTER 2. BACKGROUND

dependency tree. Although these methods have demonstrated promising results, a grammatical graph obtained from an automatic syntactic tool is not always correct. Song et al.(2019) alleviated the grammatical errors using a graph LSTM on dependency forests. Another limitation is that the methods relying on grammatical structures can not be directly applied on low-resource languages, where textual analysis tools are not available. To avoid the use of external tools (i.e., syntactic tools), many attempts were proposed to construct heuristic graphs, showing promising results. One way to construct graphs without the need for textual analysis tools is to rely on shallow textual structures that can be easily obtained. An interesting trend is to perform on an entity-relation graph rather than a word structure since graphs are natural representations of relational facts. We note that the proposed approaches in this trend often assume that the entity-relation graph is a fully-connected graph at the beginning. Christopoulou et al.(2018) represented an entity pair by aggregating edge representations of contextual entities. Zhu et al.(2019) proposed graph neural networks to incorporate edge information (potential relations) into node representations (entities). Sun et al.(2019a) introduced an entity-relation bipartite graph in which both entities and relations constitute nodes and edges connect every relation node to their entity arguments. Chai et al.(2019) developed a relation graph where relation candidates correspond to nodes and edges are overlapping entities of two relations. Other attempts were also proposed to combine word graphs and entity-relation graphs, taking advantage of both structures. Fu et al.(2019) proposed a two-stage approach where the first stage is to gather information from a word graph using an LSTM-GCN model. At this point, the model tries to get high coverage of relations. The second stage of the model considers all potential relations with overlapping entities for the final prediction.

2.4.4 Attention Mechanisms

Attention mechanisms are inspired by the way humans focusing on particular regions of an object, e.g., a particular area of an image or correlate words in a sentence. Attention is represented as a vector of important weights that can explain the relation between target elements, e.g., relations between words in a sentence. The first work using attention in NLP was proposed by Bahdanau et al.(2015) for phrase-based machine translation. The proposed attention is an alignment matrix of the source and target sequences. The mechanism was adapted in several NLP tasks such as document summarisation (Rush 2.4. NEURAL COMPONENTS FOR RELATION EXTRACTION 59

et al., 2015), syntactic parsing (Vinyals et al., 2015) and question answering (Sukhbaatar et al., 2015). Attention, hence, becomes an essential component of neural architectures. Given the weights of important words, attention mechanisms are often used to explain the prediction of a model by highlighting the key elements leading to the decision. Shen and Huang(2016) performed word-attention on the output of a CNN. Wang et al.(2016) introduced two attention mechanisms to select relevant parts of a sen- tence in a Conv-based model. Apart from the word-level attention applied to input representations, the other attention is used to aggregate information from convolutional representations, replacing regular max-pooling. A recent study of Huang and Du(2019) applied self-attention mechanism between the convolution operations in a piecewise CNN to better encode sentence context. Besides CNNs, attention has also shown to support gathering information in RNNs. Zhou et al.(2016) proposed attention which is applied to the contextual word repre- sentations from LSTM to compute the influence of each word to the associations of entities. Zhang et al.(2017b) developed an entity position-aware attention mechanism in an LSTM model to focus on context words instead of entity mentions as previous attention approaches. Later on, Sorokin and Gurevych(2017) considered multiple entity pairs in a sentence as additional contexts with an attention mechanism when predicting the relation of a target pair. A recent work of Can et al.(2019) performed attention on subtrees attached to each word on the shortest dependency path in order to select relevant information. Attention was also applied on graph neural networks either as in Guo et al.(2019) proposed attention mechanisms to softly prune the dependency tree instead of heuristic rules as in Zhang et al.(2018).

2.4.4.1 Multi-head attention

While previous studies mostly use attention as a supporting block in their models, transformer uses multi-head attention as its core computation (Vaswani et al., 2017). While the transformer was proposed for machine translation task, the idea of using only attention blocks as core processing has been used inRE as well. Multi-head attention is an extension of self-attention which is performed by computing dot products of all possible element pairs such as word pairs in a sentence. While self-attention is directly applied on the input, multi-head attention first divides the input into multiple heads and applies self-attention on each head. The intuition behind this division is to allow the model to have different views of data . ShafieiBavani et al.(2020) experimented with a 60 CHAPTER 2. BACKGROUND

range of multi-head attention neural models, showing their effectiveness over CNNs. Li et al.(2019) proposed a neural model consisting of a knowledge-attention, which encodes information from external resources along with multi-head self-attention for modelling text.

2.4.5 Pretrained Models

Recently, with the arrival of pretrained language models (PLMs), performance on a wide range of NLP tasks have been significantly improved, including relation extraction. A trivial reason is that PLMs may capture factual information from their huge training set, therefore, this hidden information can be retrieved and enhanced for downstream tasks during finetuning. The task-specific neural block on top of PLMs is often simple such as a feed-forward network taking relation candidate representation as input and output the relation prediction by a softmax classifier. Alt et al.(2019) extended generative pretrained transformer (GPT; Radford et al., 2018) for distant supervised learning. Other approaches utilised bidirectional encoder representations from transformers (BERT; Devlin et al., 2019) as contextualised word representations for finetuning. Baldini Soares et al.(2019) inserted entity marker tokens to indicate entity locations in a sentence, the classification block remains simple. Differently, Zhang et al.(2019b) incorporated knowledge bases into pretrained masked language models using knowledge base embeddings (Bordes et al., 2013). Wang et al.(2019b) built upon BERT with an entity-aware self-attention mechanism to incorporate information from all entity pairs in a sentence.

2.4.6 Hybrid Architectures

The simplest approach to take advantage of different models is to perform ensembling methods such as majority voting. Vu et al.(2016) applied a simple voting scheme to combine the predictions of a CNN and an RNN, while Nguyen and Grishman (2015a) explored several ways of combination including ensembling, stacking and voting schemes. However, most of the existing architectures composed a mixture of the aforemen- tioned structures rather than training them separately. Some approaches stack LSTMs and CNNs in either order to have diverse views for sentence encoding (Liu et al., 2015; Cai et al., 2016). Differently, Yang et al.(2016) first encoded context into word represen- tations using RecNNs and then applied Conv channels to construct a relation candidate 2.5. RELATION CANDIDATE REPRESENTATION 61 representation. Others also combined representations from RecNNs and RNNs(Liu et al., 2015; Miwa and Bansal, 2016). Most of the graph-based models are hybrid approaches, which usually perform LSTM over text to obtain contextualised representations. These approaches then operate on particular graphs such as grammatical word graphs, heuristic word graphs or entity-relation graphs (Zhang et al., 2018; Vashishth et al., 2018; Christopoulou et al., 2018).

2.5 Relation Candidate Representation

After encoding context information into word representations, we obtain the context- dependent representations of a sequence of n words {h1,h2,...,hn}. The next step is to construct relation candidate representations from them. A typical approach to form a relation representation is first constructing the two entity representations e1,2. Each d entity representation e ∈ R e is usually computed as follows:

e = fe(hi,hi+1,...,hi+m), (2.23)

d where wi ∈ R w is the word representation at the position i-th in text, starting an entity mention and m is the length of entity span. fe can be any of the following operation sum,max,mean in order to gather information from a text span, attention can also be used in this case and de = dw. Additional neural layers can be applied on e to obtain arbitrary length of de. d Next, a relation candidate representation r ∈ R r is usually computed by passing the concatenation of two entity representations e1,e2 from Eq. (2.23):

r = fr([e1,e2]), (2.24) where [·,·] refers to concatenation and fr can simply be a fully-connected neural layer. Eq. (2.24) considers only two arguments to form a relation, but it is also applicable to more than two arguments.

2.6 Relation Classification Layer

A classification layer is usually the final layer of a neural relation extraction model. In this layer, we pass the relation candidate representation to a linear layer to get the 62 CHAPTER 2. BACKGROUND

C unnormalised relation predictions, known as logit z ∈ R .

z = rW (2.25)

d ×C C where W ∈ R r and C is the number of relation categories, resulting z ∈ R . z is passed to a softmax function to obtain the relation probabilities (as in Eq. (2.16)).

2.7 Learning

In the previous sections, we introduced the basic building blocks to obtain relation representations in neural-based models. Next, we will introduce typical learning settings to train a neuralRE model, either from raw, manually- or automatically-annotated data.

Fully-supervised Fully supervised learning requires human-annotated data.

Semi-supervised Since unlabelled data are readily available, semi-supervised learning utilises them in conjunction with existing labelled data.

Weakly-supervised The limit of the two learning methods is that they require accurate human annotations, weakly-supervised learning addresses it by leveraging weak labelling sources such as knowledge bases or heuristic labelling functions. The alignment against knowledge bases is known as distant supervision dominating this line of research. Another paradigm using labelling functions is named data programming.

Transfer learning Transfer learning takes advantage of the knowledge acquired for a resource-rich task to a target task, either different domains or completely different problems.

Unsupervised learning Unsupervised learning methods learn directly from raw text without task-specific annotations.

Listed above are prominent learning methods that automatically train a model given a training set. Other learning strategies such as active learning relying on intensive human feedback (Sun and Grishman, 2012; Fu and Grishman, 2013) are not included in this introduction. Most deep learning models are supervised models, which require large amounts of domain labels. In practice, it is expensive to collect such labels for every new 2.7. LEARNING 63 domain. Semi-supervised, transfer, few-shot and unsupervised learning approaches have been proposed to reduce the numbers of training labels, thus opening more opportunities for relation discovery. We will selectively summarise important work of individual supervision settings for the completeness of this overview. The proposed methods of this dissertation focus on using unlabelled data, we will elaborate the related work in the following chapters. Since we already presented studies that improve text representations in the previous section (§2.4), this section only reviews work that made improvements on learning settings, e.g., different training objectives, noise reduction strategies and supervised signals from other sources involving knowledge bases (§2.7.3.1) or entity/link prediction (§2.7.4). In most cases, the softmax classifier is used to identify relations regardless of the training setting. While different supervision settings consider varying information, such as distant learning taking multiple instances into account. We will introduce in the following sections.

2.7.1 Fully Supervised Learning

The most studied relation extraction is supervised learning, where we are typically given a training set containing manually-labelled instances. SupervisedRE is typically framed as a classification problem, where the goal is to classify a group of potential relation arguments into a particular set of pre-defined relation categories. We present attempts proposed to improve the supervised learning setting. Most supervised methods follow the neural relation extraction framework using the cross-entropy loss introduced in §2.2.2 and introduce various modifications in the model architecture (§2.4). Apart from the cross-entropy loss, we can perform ranking loss to estimate parame- ters of a neural model. dos Santos et al.(2015) presented pairwise ranking loss forRE in order to reduce the impact of negative relation instances, which is used in subsequent models (Vu et al., 2016; Ye et al., 2019). Besides, regularisation can be introduced to improve training. Zhang et al.(2020b) incorporated relation constraints based on semantic entity categories to train a teacher model. A student model was proposed to distil knowledge from the teacher without explicitly modelling the heuristic constraints. Lin et al.(2020) proposed a model that automatically learns global constraints between relations by introducing constraint patterns into their model. Their model automatically fills in these patterns to generate corresponding features and learns the weights of these features during training. 64 CHAPTER 2. BACKGROUND

Supervised learning requires annotated training data which is time-consuming and costly to create. This limitation makes supervised learning hard to extend, i.e., if we want to detect new relations, new training data are required.

2.7.2 Few-shot Learning

Supervised learning approaches have been shown to perform well on relation extraction. However, they have a very high demand for manually-annotated data which is costly to construct and not easy to scale-up. In contrast, a human can generalise well after only a few observations. Therefore, it is critical to studyRE that can learn from a limited number of examples and generalise to unseen instances. The type of learning from a small set of instances is known as few-shot learning (FSL). One set of FSL approaches are typically supervised learning, which learns a classifier on one set of labels and then evaluates on another non-overlapping set of n categories. The classifier will be given k instances for each new category. This setting is known as n-way k-shot classification (Vinyals et al., 2016). Figure 2.12a depicts the abstract idea of n-way k-shot setting. Essentially, a way is a new category during testing, i.e., n-way indicates n new categories that are not in the training set. A shot indicates that there is a single example available for each new category, i.e., k-shot means that we have k examples. Formally, we have k examples corresponding to each of n categories which results in a total of n × k new examples. The set of n × k examples is called the support set S. FSL methods are evaluated by using the support set S to classify whether a test instance q belongs to one of the n categories. When k = 1 we call one-shot learning and when k = 0 we refer to it as zero-shot learning.

Training Testing Training Testing Different task Support set Labelled data Support set A ... E ... Test instance A Test instance E Method Method B ... ? F B ... ? F C ... G C ... G (b) Few-shot learning without fine-tuning on (a) Conventional few-shot learning target tasks

Figure 2.12: General intuition about few-shot learning, 3 way 2 shot in this case.

There are two typical training strategies for n-way k-shot learning: metric learn- ing and meta-learning. Metric learning learns a semantic distance function between instances which can classify test instances by comparing them with the support set. 2.7. LEARNING 65

Meta-learning, also known as “learning to learn”, targets parameter initialisation and then adapts quickly on the support set. Few-shot learning can sometimes be confused with transfer learning (§2.7.5). Both learning settings start from a training set and adapt to an evaluation set. The difference is that meta-learning aims at optimising quickly on the target task.

Han et al.(2018b) implemented several baselines for few-shot relation extraction including both metric and meta-learning approaches as follows. Their first model is meta networks (Munkhdalai and Yu, 2017) which learn two sets of parameters for general features and new task adaptation, respectively. The second method considers every instance, support set or test, are embedded as nodes in a graph and the classi- fication is made by propagating information from support set to the test instance as posterior inference, namely graph neural networks (Garcia and Bruna, 2017). Next one is a simple attentive meta-learner (SNAIL; Mishra et al., 2017) which combined support set and a test instance into a sequence, inspired by the temporal order of learning process; the authors then used a temporal CNN to aggregate information along the sequence for final classification.h The last one is a prototypical network (Snell et al., 2017) that learns representations for each category as a prototype and classifies test instances using Euclidean distance. Other studies tried to improve prototype repre- sentations by introducing hierarchical attention mechanisms (Gao et al., 2019a; Sun et al., 2019b), while Ye and Ling(2019b) proposed to perform matching on token-level. (Baldini Soares et al., 2019) leveraged pretrained language models (PLMs) to improve the performance. Qu et al.(2020) improved the prototype representations of relation categories by investigating the dependencies between relations as the prior distribution.

The aforementioned studies assumes that a relation between entities always belongs to n categories, but it is not the case in practice. Co-occurred entities may not be related or they often share a different relation, not in the pre-defined set. This scenario is introduced as none-of-the-above (NOTA) category during testing. NOTA is common in conventionalRE, but NOTA is difficult to detect in FSL. The difficulty comes from the dynamic of k, i.e., k is not a fixed relation set during testing. Thus, the semantic meaning of NOTA also changes accordingly. Another case is that for some domains lots of annotated data are difficult to achieve, but few-shot is possible to obtain. As we already have some large annotated corpora in general domain, adapting models trained from such data to resource-less domains with few examples is crucial, often referring to few-shot domain adaptation (FSDA). The task of few-shot domain adaptation (FSDA) is to evaluate models on a domain-specific dataset that is different from the training 66 CHAPTER 2. BACKGROUND

data. PLMs were utilised for improving few-shot domain adaptation (FSDA) and detecting none-of-the-above (NOTA) relation category (Gao et al., 2019b). The authors employed adversarial learning to learn domain-invariant features for FSDA, while NOTA prediction is inferred from the probabilities of k categories. Another definition of FSL refers to directly evaluating models without any labelled instance. This can also be considered as zero-shot learning, which is a specific case of few-shot learning where k = 0. Zero-shot learning aims at classifying unseen/novel categories without a single training example on the target task. This simulates how humans perform reasoning, we can classify new relations if we have adequate infor- mation about their properties and functionality. Without direct learning is hard, thus, zero-shot learning receives less attention than other settings and usually casts relation classification into other tasks. These tasks include a question answering problem that aims at extracting spans corresponding the relation arguments from text (Levy et al., 2017) or natural language inference where text mentioning an entity pair corresponds to a premise P and a relation description constitutes a hypothesis H (Obamuyide and Vlachos, 2018). Recent work also shows the ability to train models from auxiliary tasks without fine-tuning them on the target task such as GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020). These studies suggested the potential of future work to learn from a few labelled instances.

2.7.3 Weakly-Supervised Learning

Although few-shot learning can partly address the demand of labelled instances on new categories, these approaches mostly require a fair amount of labelled training data. One way to reduce the cost of human efforts in the annotation is weakly-supervised learning which can be seen as a weaker form of fully supervised learning. Weakly-supervised labels are usually from higher-level and/or from noisy sources. Several approaches have been proposed to generate weak supervision data as well as learning from weak and noisy annotations for relation extraction (RE). These approaches include the broadly- used distant supervision and the recently proposed framework data programming.

2.7.3.1 Distant Supervised Learning

Distant supervision (DS) forRE automatically generates relation data from plain text, typically using existing knowledge bases (KBs; see definition ofKB in §2.3.3) which contain a large number of relational facts. The initial intuition ofDS is that given two 2.7. LEARNING 67

Knowledge Base Triple (Barack Obama, place of birth, the United States) ID Sentence DS Gold S1 [Barack Obama] was born in [the U.S.]. place of birth S2 [Barack Obama] was the 44th president of [the United States]. place of birth president of S3 [Stephen Hawking] was born in [England]. no relation place of birth

Table 2.2: Annotation examples of distant supervision and the corresponding gold relation categories (DS). entities in KBs, any sentence mentioning these two entities probably expresses the relation between them in the KBs(Mintz et al., 2009). To create the labelled data, the first step involves named entity recognition (NER) on the raw text to identify potential entities from it. After getting entity mentions, entity linking is performed to match each mention to a specific entity in KBs that it refers to. At this point, we can label any sentences containing two entities that participate in a relation in KBs as supporting that relation. This labelling process is referred to align against KBs. These sentences are considered as positive instances such as S1 and S2 in Table 2.2. Meanwhile, sentences with entity pairs that do not have a known relation in KBs are labelled as “no relation” (NA), referred as negative instances (see S3 in Table 2.2). However, the aligned text might express a completely-different relation from the KB one or no relation, known as false positive (FP; S2 in Table 2.2). False positives are mostly caused by the strong assumption that co-occurrences infer associations. In practice, two entities may co-occur in the same sentence because they are associated with a particular topic but not related. Another noise introduced by distant supervision (DS) assumption is the annotation of negative instances. As mentioned above, we obtain negative instances by labelling sentences of entity pairs that are unrelated in KBs. But existing KBs are usually incomplete which might result in missing relational facts, i.e., wrongly labelling a certain relation as no relation. Specifically, there are 93.8% of “persons” in Freebase without “place of birth” information according to the incompleteness measurement in Min et al.(2013). This type of error is referred to as false negative, S3 in Table 2.2. To alleviate false positives, Riedel et al.(2010) followed the multi-instance learn- ing (Dietterich et al., 1997) relying on the expressed-at-least-once assumption. The intuition is that an entity pair may not be related in every sentence that they co-appear but at least one of these sentences reveals theKB relation. In this case, all sentences mentioning the same entity pair are referred to a bag and the relation label is assigned to 68 CHAPTER 2. BACKGROUND

each bag (bag-level) instead of mention-level instances. The task ofDS hence becomes (i) to find informative sentences from a bag of sentences mentioning an entity pair and (ii) to identify whether a pair shares a certain relation. The relaxed assumption has become standard inDS approaches hereafter. Initial work employed graphical models to identify the relation of an entity pair and the supporting sentence (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012). Riedel et al.(2010) assumed that there is only a single relation between a pair of entities, which is not always true in practice. As in our examples (Table 2.2), Barack Obama was not only born in the United States but also the country’s president. Hoffmann et al. (2011); Surdeanu et al.(2012), hence, proposed multi-instance multi-label models to tackle multiple relations shared between an entity pair. A various number of neural approaches were explored later on for MIL. In particular, attention mechanisms were used to select informative sentences from the bag. CNN and variants have been employed to represent the local context. Zeng et al.(2015) proposed a piecewise convolutional neural network (PCNN) to automatically learn informative features without relying on feature engineering. The model then selects one most confident sentence as the representative for the relation of an entity pair. The elimination of other sentences consequently loses some valuable information. Instead of selecting only one sentence, Jiang et al.(2016) employed max pooling over a bag to allow information sharing among multiple sentences. Lin et al.(2016) introduced an attention mechanism across sentences to compute how much each sentence contributes to the relation realisation of an entity pair. Several studies were proposed following the attention idea, ranking losses (Ye et al., 2017), syntactic sub-tree Liu et al.(2018) and word- and sentence-level structured attention (Du et al., 2018). Alt et al.(2019) replaced previous context encoders by a pretrained transformer language model, demonstrating the use of PLMs forDSRE. Another set of methods tackle the dependencies between relation categories and internal information from the data or model. Zeng et al.(2017) used supporting sentences that contain one of the target entities in improving relation detection. Han et al.(2018a) employed relation hierarchy to incorporate relation dependencies into their model. Yan et al.(2019) observed that relations between an entity pair may be different depending on the period, hence, they proposed a model taking temporal information into account forRE. Since sentences expressing a particular relation usually share similar contexts, Ye and Ling(2019a) proposed intra- and inter-bag attention to leverage the similarity. Huang and Du(2019)’s model learns from disagreements of 2.7. LEARNING 69

two collaborative relation networks to alleviate noise fromDS. External information, e.g., knowledge bases and entity descriptions, was used in several methods. Ritter et al.(2013) proposed a graphical model that penalises disagreement between relations extracted text and in KBs. Entity popularity is also considered as rare entities that are potentially not found in KBs. Ji et al.(2017) extended the idea by utilising entity descriptions for the computation of attention weights. Vashishth et al.(2018) used fine-grain entity categories and relation alias information fromKB, open information extraction and paraphrasing method as soft constraints for relation prediction. Hu et al.(2019) combined label embeddings from KBs and entity descriptions using a gating mechanism. Xu and Barbosa(2019) unified the relation prediction using KBs and textual contexts. Different from previous work, Wang et al.(2018) directly used the relation labels fromKB embeddings. Latest work using KBs to deal with long-tail relations (Zhang et al., 2019a; Deng and Sun, 2019), joint training with named entity disambiguation (Trisedya et al., 2019). The above methods do not estimate noise explicitly nor dealing with false negatives. We present in this paragraph a few attempts to mitigate noises inDS. Takamatsu et al. (2012) filtered noisy instances by predicting wrong relation patterns using a generative model. Other approaches estimate gold labels given predicted ones using a dynamic transition matrix (Luo et al., 2017), relation predictions using entity-pair representations and the reliability ofDS labels (Liu et al., 2017b). Jia et al.(2019) proposed an attention regularisation to locate relevant relation patterns in text. An instance selector is then applied to redistribute data based on reliable patterns. Another set of distant supervision approaches aims at discriminating noisy instances and true instances using adversarial networks. Generative adversarial networks were first introduced by Goodfellow et al.(2014), which consist of a generator to generate adversarial examples and a discriminator to discriminate “real” instances against gener- ated ones. Wu et al.(2017) first applied this approach to relation extraction by adding random adversarial noise to word embeddings. Unlike previous work, Qin et al.(2018a) used low-weight instances in a bag as adversarial examples to train the discriminator. Regular linguistic noises introduced in other NLP tasks that can be used including syntactic reordering, word modification based on textual surfaces, synonyms Li et al. (2017). In addition, reinforcement learning (RL) was also adopted for denoisingDS corpora. RL trains a model, namely agent, to make a sequence of decisions according to a policy, in order to maximise a reward (See Sutton and Barto, 2018, page 1). The idea 70 CHAPTER 2. BACKGROUND

is ideologically inspired by human interaction with the environment during learning. RL aims at removing or altering noisy instances (Feng et al., 2018; Qin et al., 2018b) by modelling consistency and difference betweenDS labels and models’ predictions. However, the removal or altering decisions can not be explained in these models. Zheng et al.(2019) hence proposed a pattern-extraction agent in an effort to produce an explainable prediction. The authors also involved humans to curate extracted patterns, which are then used to generate more labelled data. They adopted a weak label fusion model to infer correct labels fromDS and the pattern-generated ones. One drawback of this approach is the cost of training, which results in a small number of relation categories extracted by the model. DS has also been used in the biomedical domain (Quirk and Poon, 2017). Biomedi- cal data sets are usually small and the annotated data are more expensive to get than general domain because of the need of expertise. Distant supervision is thus more crucial for the biomedical domain. Quirk and Poon(2017) introduced a model that considers binary relations across multiple sentences, i.e., from 1 to 3 consecutive sen- tences. The authors used a series of binary logistic regression models, using several graph-based features including coreferences, shortest dependency paths, and discourse structure. Peng et al.(2017) proposed an extension of the previous model to n-ary relations, in particular, ternary relations. The authors proposed graph LSTMs to perform on a document graph constructed by the above graph-based features. Different attempts to address n-ary relation extraction were proposed, including graph state LSTM(Song et al., 2018) and combining probabilistic logic with neural embeddings (Wang and Poon, 2018). While distant supervision is the most popular weak supervision technique, it only uses one data source (i.e., knowledge bases) to heuristically label corpora. We will present the concept of data programming which allows us to combine multiple supervi- sion sources.

2.7.3.2 Data Programming

The term data programming was first proposed in Ratner et al.(2016), which takes multiple labelling functions to annotate data and construct weak supervision corpora (Figure 2.13). The main idea of data programming is to combine multiple sources for data annotation. The sources are usually from human heuristics used as labelling functions, each of which is to provide a label for some subset of the data. Labelling functions are not a new idea but more of a unifying name for calling weak supervision 2.7. LEARNING 71

def lf1(x): cid = (x.chemical_id.x.disease_id) return 1 if cid in KB else 0

def lf2(x): m = re.search(r".*cause.*", x.between) return 1 if m else 0

def lf3(x): m = re.search(r".*not cause.*", x.between) return 1 if m else 0

Domain expert Labelling functions Generative model Noise-aware discriminative model Figure 2.13: An illustration of the data programming framework, adapted from Ratner et al.(2017).

approaches. Many previous weak supervision approaches can be viewed as labelling functions, such as using relation triplets in KBs as in distant supervision or domain- specific heuristics. The data created by these heuristic functions are inevitable noisy and may generate conflicted labels on certain data points. Thus, most approaches following this framework focus on denoising the generated data. To “denoise” the data, labelling process can be represented as a generative model in which noise-aware mechanisms can be applied. Ratner et al.(2016, 2017) modelled the labelling noises using generative models that take into account statistical dependencies between the labelling functions. Liu et al.(2017a) argued that the reliability of each labelling function can be inconsistent among different instances. The authors considered instance-level weights for each labelling function utilising context representations.

2.7.4 Unsupervised Learning

We discussed in previous sections fully and weakly supervised learning, which can perform well on a task given enough manually or automatically labelled data. In contrast with these approaches, unsupervised methods do not require any annotated data for training, neither from human nor curated knowledge bases. This line of research imitates human and animal learning by discovering the structure of the world from observation (LeCun et al., 2015). In particular, URE only needs a lot of texts and entity mentions in the text, without a particular set of pre-defined relations. Since there is no given relation set, new relation categories can be revealed. In general, unsupervised relation extraction (URE) clusters similar semantic context or co-occurrence to form relations. The most recent framework using discrete-state variational autoencoders (DVAE) 72 CHAPTER 2. BACKGROUND was proposed by Marcheggiani and Titov(2016). The method consists of two compo- nents: (i) a feature-rich relation extractor that identifies a semantic relation between two entities within a sentence; (ii) a decoder component that reconstructs entities in the input relying on the predicted relation. We refer to models developed on this framework as discriminative approaches since the encoder is a discriminative relation extractor. Simon et al.(2019) stabilised the learning process of this DVAE by introducing two regularisers that encourage the encoder to predict a wide range of relation categories across data points and to estimate only a particular relation per each instance.

2.7.5 Transfer Learning

Previous learning settings, using either labelled or unlabelled data, assume that training and testing data share the same distribution. Different from those, transfer learning allows us to train on different domains, tasks, distributions (source task/domain) and then transfer the knowledge to a related task (target task/domain). Transfer learning has been considered in different contexts, it receives different names depending on particular types. The most prominent set of approaches is unsupervised pretrained language mod- els (PLMs), usually referred as word representations. Word representation learning approaches have been discussed in §2.3.2 and their usage has also been mentioned in previous sections. Here we mentioned some specific transfer learning approaches in which word representations play the key role in performance gain. Plank and Moschitti (2013) studied unsupervised domain adaptation where labels for target domains are not available during training. The adaptation is from the introduction of word clus- ters, which brings distributional information into tree kernel methods. Nguyen and Grishman(2014) later on explored the use of word embeddings in feature-based and kernel methods. Nguyen et al.(2015) suggested that tree-based kernel performed better than feature-based in combining with word embeddings. Sanh et al.(2019b) further used contextual word representations in a multi-task learning setting. Their model leverages training signals from related information extraction tasks, including NER and coreference resolution, in a hierarchical setting. Latest approaches fine-tune PLMs forRE, showing great improvements over state-of-the-art (Baldini Soares et al., 2019; Alt et al., 2019) in different settings, including fully, distantly supervised and few-shot learning. Hu et al.(2020) adapted Deep Embedded Clustering (DEC) (Xie et al., 2016) to performRE in a self-supervised learning manner. The authors first obtained the contextual entity pair representations from the pretrained language model, BERT. The 2.7. LEARNING 73 resulting representations are then passed into k-mean to get k initial cluster centroid representations. Next, the centroids are used to initialise a deep neural auto-encoder which is trained to cluster sentences using kullback-leibler divergence (Kl) distance. Since NLP tasks are closely related, another group to pretrain models performs on related tasks that can be beneficial for representing text structures. Ebrahimi and Dou (2015) suggested to pretrain a recursive neural network by reconstructing the structure of RecNN. Li et al.(2016b) pretrained a model by sentence reconstruction loss. The authors also learned entity embeddings by treating them as tokens and training them using the Skip-gram model (Mikolov et al., 2013b). A different set of transfer learning utilises knowledge acquired from the same task in a different domain where resources are available, namely domain adaptation. The target domain may have few or even no labelled data. Xu et al.(2008); Jiang(2009) adapted knowledge learned on auxiliary relation types to target relation types with only a few training instances. The studies using word representations (Plank and Moschitti, 2013; Nguyen and Grishman, 2014; Nguyen et al., 2015) can also be considered as domain adaptation, which were trained on auxiliary domains and tested on different ones. Fu et al.(2017) replaced the feature-based and tree-based methods by a convolution neural network (CNN). The network was trained jointly with a domain classifier (Ganin and Lempitsky, 2015) in order to learn domain invariant representations.

2.7.6 Semi-Supervised Learning

Semi-supervised learning falls between supervised (training with only labelled data) and unsupervised learning (with no labelled data), which combines a large amount of unlabelled data with existing manually-labelled data. Semi-supervisedRE can be divided into three categories as follows. The simplest semi-supervised learning method is self-training, which can apply to any existing models. The idea is to first train a model with existing labelled data, then to use it annotating additional data (unlabelled), which is finally added to the training set for model re-training. Bootstrapping is the earliest and most common self-training approach, in which the most reliable, highest confidence, unlabelled instances are selected for re-training. DIPRE (Brin, 1998) and Snowball (Agichtein and Gravano, 2000) were the first two bootstrapping systems forRE, using patterns to extract a particular category of relations from the Web. Xu et al.(2007) extended the idea to n-ary relations. Later studies mainly proposed extensions to include more features such as coreference (Gabbard et al., 2011); word embeddings (Batista et al., 2015). Other 74 CHAPTER 2. BACKGROUND

also used different underlying relation extraction models such as SVMs(Zhang, 2004) or expectation maximisation (EM) algorithm (Pawar et al., 2014). A key challenge of self-training is semantic drift, which highly depends on the accuracy of the added instances. Gupta et al.(2018) combined entity and pattern constraints in bootstrapping to alleviate the problem. Qian et al.(2009); Qian and Zhou(2010) investigated seed set selection for bootstrapping. Instead of getting labels from the model itself, another set of approaches directly utilise annotations of labelled data, namely graph-based algorithms. Graph-based algorithms perform on a connected graph, where nodes correspond to relation instances and edges usually constitute similarity between two instances. These algorithms infer relations on unlabelled data from neighbouring labelled nodes in the graph. Chen et al. (2006) showed the effectiveness of a graph-based algorithm over bootstrapping, while Zhou et al.(2008) combined bootstrapping with the algorithm using an SVM as the underlying classifier showing better results. Although the previous two sets of approaches were proposed using classical machine learning, neural-based models can be used within their setting. The most recent set of semi-supervised learning approaches is self-ensembling, which mostly uses a single model under different noise configurations to make the model more robust. A recent approach of this is the Mean Teacher (Tarvainen and Valpola, 2017) which was adapted forRE by Luo et al.(2019). As its name mentions, the method consists of a teacher that is an exponential average of its students. The method is trained using consistency loss for ensembling (Laine and Aila, 2017). Although the demand of manual efforts is reduced in semi-supervised learning, there are two requisites for this learning setting. The first one is a high qualitative initial set to learn a good initial model in self-training and self-ensembling approaches. The second requirement is to have a good inductive bias for inference in graph-based methods.

2.7.7 Open Information Extraction

Open information extraction (OIE) finds all possible relation expressions between named entities in large amounts of text. OIE started as an attempt to facilitate domain- independent discovery of relations between entities from the Web Banko et al.(2007). Table 2.3 illustrates examples of OIE in comparison with classicalRE. In OIE, the relations of interests are unspecified and their number may increase when the corpora size expands. The early systems of OIE mainly implemented rules such TextRunner (Banko et al., 2.8. CONCLUSIONS 75

Input sentence Classical relation extraction author of(head, tail) J.K. Rowlinghead is the author of Harry Pottertail. Open information extraction The Lord of the Rings is written by J.R.R. Tolkien . head tail (head, is the author of, tail) (tail, is written by, head)

Table 2.3: Relation examples from classical relation extraction methods and open information extraction systems.

2007), StateSnowBall (Zhu et al., 2009), WOE (Wu and Weld, 2010), REVERB (Fader et al., 2011), OLLIE (Mausam et al., 2012), and DepOE (Gamallo et al., 2012). These approaches divided OIE into three steps: (i) extract patterns from sentences to form relation candidates (manually defined, dependency or predicate-argument structures), (ii) assign a binary label (true/false) to the extracted relations based on lexical, syntactic and semantic constraints, and (iii) train a relation detection model using the resulting corpus. Instead of using shallow parsing as previous work, Mesquita et al.(2013) proposed EXEMPLAR to identify both binary and n-ary relations using semantic role labelling (SRL) systems (Surdeanu et al., 2003; Johansson and Nugues, 2008). For a more detailed review of OIE methods, we refer the reader to the survey of Niklaus et al. (2018), which includes literature up to 2018. Since OIE aims at extracting the relation patterns from text, recent work treated it as a sequence-to-sequence task using neural-based models (Zhang et al., 2017a; Cui et al., 2018). In particular, the output sequence is a list of relation triples copied from the input. Following this, Stanovsky and Dagan(2016) released a OIE corpus that frames OIE as a sequence tagging task, which is generated from a question-answering (QA) dataset (He et al., 2015). The dataset was expanded later on by Stanovsky et al.(2018), which is large enough to enable the training of neural networks. Jiang et al.(2019) proposed a binary classification loss to rank relation patterns generated from different sentences.

2.8 Conclusions

This chapter presented an overview of relation extraction. We first introduced related concepts at the beginning as well as the task formulation. We separatedRE into multiple tasks according to different aspects of relations, in which we focus on sentence-level RE in this work. We also mentioned a range of available resources for this task and the commonly-used evaluation metrics. We then briefly presented a historical story ofRE 76 CHAPTER 2. BACKGROUND

approaches from classical machine learning to current neural models, followed by a demonstration of the neural relation extraction model framework, viewed in blocks of layers. We elaborate on individual building blocks in each corresponding section. We described in details common features forRE, divided into three categories: word representations, linguistic features and external resources. One of our proposals focuses on enriching word representations using syntactic information in order to improveRE, hence, we paid more attention to these features. On the other hand, external resources were mentioned for completeness, introducing several resources that can be utilised to improve the performance. In neural models, representing words in context is the most crucial part, hence, we spent an entire long section to discuss these representation blocks. We categorised prior methods based on their core neural networks: CNNs, RNNs, GNNs, attention and pretrained models. We can see the development trend ofRE moving from capturing local context (CNNs) to performing on entire sequences (RNNs) and later on generalising to graph structure (GNNs). To further understand which words in the context play an important role inRE, recent approaches utilise attention mechanisms for this purpose. The latest approaches leverage pretrained models to provide common sense information in order to support the extraction of relation. We then move forwards to relation candidate representation and classification blocks, in which we describe the most common way to obtain relation representations and prediction. Finally, we describe different types of learning to optimise a neuralRE model and related developments introduced for each learning. In particular, we mention fully supervised techniques in short, as most of the approaches proposed for this setting are neural architecture modifications, which were reported in §2.4. For other learning techniques, we discussed interesting ideas that have been proposed in an effort to reduce the reliance of neural models on labelled data. Our investigation reveals that relation extraction is a well-studied task in NLP, especially in the advent of deep learning-based models. Many novel ideas have been proposed which have significantly contributed to the moment of relation extraction. Despite the progress, the majority of approaches rely on a considerable amount of annotated text. Recent attempts were proposed to alleviate labelled data. Specifically, current few-shot learning relies on a large labelled training set despite the few-shot setting during testing. In addition, distant supervised learning requires highly accurate KBs and learning noise is still an issue in this setting. This motivates our research for 2.8. CONCLUSIONS 77 using unlabelled data forRE in different settings. Throughout this thesis, we aim to investigate three different settings of using un- labelled data forRE: (i) pre-training using syntactic information, (ii) unsupervised relation extraction using entity types, and (iii) using language models to provide weak supervision. The first one regarding transfer learning from word representation, since previous RE was shown to be beneficial from syntactic features, modelling syntactic from only RE data may lead to overfitting. In Chapter3, we propose a method for encoding syntactic information into word representations in a self-supervised manner training on task-agnostic data in order to capture general syntactic structure. The differences between our work and previous work encoding syntactic information into word repre- sentations are illustrated in Figure 2.14. Dependency information has been explored in previous work as mentioned in §2.3.2.2, but these studies encode syntactic information by training the language modelling from scratch and only consider static word repre- sentations. In contrast, our work takes advantage of the pretrained static/contextual word representations and enrich on top of them. A later work of Glavasˇ and Vulic´ (2020) finetunes pretrained language models on dependency parsing in order to inject dependency information into the models. The major differences are: (i) training data (they use manual-labelled data while we use automatically-labelled data), (ii) encoding technique (they train on dependency parsing, whereas we also encode the dependency information using a graph neural model). In general, we propose an unified approach that can enrich both static and contextual representations.

Using existing syntactic tools, e.g., Stanford CoreNLP

Static Automatically-labelled [Levy and Goldberg 2014], Language Modelling Data [Vashishth et al., 2019]

Contextual Manually-labelled Pretrained Dependency Parsing Intermediate training Dependency Parsing Data Language Models Structure [Glavaš and Vulić, 2020] (a) Previous work Using existing syntactic tools, e.g., Stanford CoreNLP Automatically-labelled Data Graph-based Dependency Parsing Encoder + POS tagging Static/Contextual Pretrained Word Representations, Word Representations e.g., word2vec, GloVe, ELMo, BERT

Base Representations (b) Our work

Figure 2.14: The comparison between previous dependency word representations and our work. 78 CHAPTER 2. BACKGROUND

We further investigate unsupervised relation extraction (URE) where no label is given (Chapter4). The community has paid more attention to relation extraction with supervision from human-labelled data or human-curated knowledge bases (KBs). In contrast to the annotated data and knowledge bases, there is massive unlabelled data freely available, which can be used for relation extraction and new relation discovery. We investigate the current URE setting and present a strong baseline showing the importance of inductive biases, i.e., entity types. A drawback of unsupervised learning is that there is no particular relation category which is usually derived after the clusters are formed. In reality, we often have a desirable set of relation categories when building a relation extraction system. This is similar to building an annotation guideline where we provide a set of categories and their corresponding examples. Our last work thus uses one exemplar for each relation of interests and pretrained language models to provide weak supervision for relation extraction. We note that previous work only stops at probing relational facts from pretrained language models without further using the extracted facts for relation classification, which is our work. We also propose a noisy-channel method that explicitly models the noise in weak supervision to improve the detection of relations (Chapter5). Chapter 3

Enriching Word Representations for Relation Extraction

Our first study of using unlabelled data is to enrich word representations with depen- dency information, in order to improve relation extraction. We addressed our first hypothesis (H1), incorporating syntactic information in pretrained word representations in an effort to support the identification of semantic relationships between named enti- ties. We introduce a graph convolutional neural model that learns to encode syntactic information into static/contextual word representations by reconstructing dependency structures and predicting part-of-speech tags. We then extract the syntactically-informed word representations from the intermediate layers of the model and evaluate them in a binary and an n-ary relation extraction corpora. Our word representations improve the performance of the two corpora over the base representations, showing the effectiveness of incorporating syntactic information. We also show that our enriched contextual word representations can outperform a fine-tuned large language model (BERT) on the binary relation extraction corpus. This chapter contains part of the work published in Tran et al.(2020c).

3.1 Introduction

As shown in the background chapter, word representations constitute the initial step in building neural models for relation extraction. Most approaches rely on language models (LMs) to obtain static word representations (Arora et al., 2016; Tissier et al., 2017; Xin et al., 2018), which conflate all possible meanings of a word in a single real-valued vector. Recent work investigated contextual word representations that assign

79 80 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

a different representation to each occurrence of a word based on its local context (Mc- Cann et al., 2017; Peters et al., 2018). These contextual word representations have demonstrated improvements in downstream tasks over the static ones. In contrast, rather than using word representations as input features, large-scale pretrained language models (PLMs) can be used in downstream application models by fine-tuning the entire models (Radford et al., 2018; Devlin et al., 2019; Radford et al., 2019). Task-specific neural layers are added on top of PLMs to perform a target task. These fine-tuning methods have shown promising results with higher performance than feature-based contextual word representations in some applications such as text classification and textual entailment (Devlin et al., 2019). However, despite the impressive empirical results of fine-tuning, this process remains unstable: identical learning settings with different seeds may result in a substantial difference of performance (Dodge et al., 2020; Mosbach et al., 2020; Zhang et al., 2020a). On the contrary, contextual representations can be easily adopted as a plug-to-play input. Additionally, contextual representations are cheaper to run as they are pre-computed once only for each instance and run in many experiments with smaller models on top. In other words, the computational costs for fine-tuning are much higher than contextual approaches.

All of the LMs mentioned above are mainly trained on a large amount of raw text, and thus do not explicitly encode any linguistic structures. Previous studies have shown that downstream task performance can be beneficial from linguistic structures such as syntactic information (Miwa and Bansal, 2016; Kuncoro et al., 2018), even when contextual word representations and pretrained models are also used (Strubell et al., 2018; Lauscher et al., 2019; Peng et al., 2019). We consider part-of-speech (POS) tags and dependencies as the source of syntactic information which has been well studied and can be obtained efficiently with high accuracy using existing dependency parsing tools for English corpora (Manning et al., 2014; Neumann et al., 2019). While encoding syntactic information into task-oriented neural models requires explicit neural network design, we here propose to incorporate such information in pretrained word representations. We demonstrate that syntactic features can be pre-encoded in contextual word representations, which can be beneficial for subsequent applications.

Although previous work was also proposed to incorporate dependency context into word representations (Levy and Goldberg, 2014; Vashishth et al., 2019a), they typically train the language models from scratch. Recent attempts started exploring intermediate training, which trains a PLM on certain resource-rich supervised tasks and then performs fine-tuning on the target task (Phang et al., 2018; Glavasˇ and Vulic´, 3.2. PROPOSED APPROACH 81

2020). Our method is inline with this research direction, one of the main differences is that we explore the use of automatic parsing tools instead of depending on gold corpora. Another difference is that we explicitly model a syntactic structure using a graph convolution network while previous work only uses the structure as a training signal (Levy and Goldberg, 2014). In other words, we can directly encode syntactic relations into word representations, while previous work learns these relations implicitly during training. The intuition is that a syntactic structure connecting entities can imply a semantic relation as well as reject one relation. Despite a few discrepancies that are expected when using automatically-parsed data, our proposed method works well on two relation extraction tasks. We introduce Syntactically-Informed Word Representations (SIWRs) that can incorporate syntactic information in neural models without explicitly changing their architecture. Syntactic information is integrated into existing word representations such as GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018) and BERT (Clark et al., 2019) by learning from automatically annotated data, which are task-independent. The proposed SIWR model extends a graph convolutional neural network (GCN) and builds on top of these word representations. Since in English word order is important, we preserve it by adding these connections into the graph layer. We then obtain SIWRs from the pretrained SIWR model using a contextualised word representation extraction scheme. Finally, we incorporate SIWRs into downstream models by only replacing the word representations with SIWRs. We show that SIWRs enrich the base word representations with syntactic information and boost the performance in downstream tasks. Unlike previous work (Swayamdipta et al., 2019), our findings demonstrate that syntactic information is helpful in the advent of contextual word representations and large pretrained models.

3.2 Proposed Approach

We propose the syntactically-informed word representation (SIWR) model that utilises a GCN as its core computational component to integrate syntactic information into word representations as illustrated in Figure 3.1. We first prepare pretrained static and contextual word representations, e.g., GloVe, ELMo, and contextual BERT, as the base representations. We then feed them to our SIWR model, which consists of a two-stacked GCN layer over dependency trees along with self and sequential information. The GCN is used to include syntactic information into the base word representations. The SIWR 82 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

Dependency Parsing masterpiece nsubj Dependency direction Biaffine Biaffine Reversed direction arc_dep label_dep arc_head label_head Self- feline masterpiece information

Previous word POS Tagging DT JJS NN NN Next word

... Linear layer

GCN GCN

Base representation Automatically annotated / The smallest feline is a masterpiece . Parsed dependency amod det nsubj Figure 3.1: Overview of our neural graph model. The model addresses syntactic tasks to incorporate syntactic information into word representations. We only show one graph convolution neural layer and other edges are omitted for simplicity.

model jointly predicts part-of-speech (POS) tags and syntactic dependencies. We only pretrain the SIWR model once with a relatively modest amount of task-agnostic data that are automatically annotated by using existing syntactic tools. The final SIWRs are obtained by combining the outputs of all layers in the model. We can simply integrate SIWRs in downstream tasks by replacing word representations with the resulting SIWRs.

3.2.1 Base Representation

As shown in Figure 3.1, pretrained word representations are used as input of our model, namely base representations. Base representations can be either static or contextualised word representations. Static word representations encode a word by a single continuous vector regardless of the context where the word occurs. If base representations are pretrained static representations, we create an embedding layer that maps each word d to a single continuous vector es ∈ R w , where es is the static embedding and dw is the dimension of the embedding. In contrast, in contextualised word representations, word vectors change dynamically based on context. To use contextualised word repre- d sentations, we construct the base representations ec ∈ R w by combining intermediate d word-based vectors (L × el ∈ R w ) from a pretrained language model:

L ec = ∑ γlel, (3.1) l=0 3.2. PROPOSED APPROACH 83

where γ are softmax-normalised weights of L layers, the vector el is the internal word embedding from the l-th layer in the pretrained model. These base representations are obtained from existing pretrained LMs, which are fixed during pretraining and then are used in downstream tasks.

3.2.2 Part-of-Speech Tags and Dependencies

We consider two types of syntactic information: part-of-speech tags and dependencies which are two fundamental syntactic features as introduced in §2.3.1. POS tags are necessary for identifying the essential components of a sentence, e.g., nouns and verbs. Dependency structure, or dependency for short, provides information on relations be- tween words in a sentence, which can be leveraged to extract the subject, object, and their modifiers in a sentence. Raw word sequences work well enough for short-range relations but the performance drops for entities located in longer distance. In this case, dependency connections provide shorter information flow through the syntac- tic structures, which directly connect distant but related words and avoid redundant neighbouring words.

3.2.3 The SIWR Model

The goal of our SIWR model is to incorporate dependency context into word represen- tations. To achieve this, we utilise the dependency tree structure of a sentence, where nodes correspond to words and edges are dependency relations between words. We also preserve the word order by adding previous and next word connections to the word graph. GCN has been successfully used for operating on labelled word graphs in previous work (Marcheggiani and Titov, 2017; Marcheggiani et al., 2018), we hence follow to use the edge-wise gating GCN. We create five types of edges (Figure 3.1): dependent edges representing head-to-dependent connections (dependency direction), reversed edges for dependent-to-head direction (reversed direction), previous and next word connections, as well as self-informed edges to retain information from the target l+1 word. The representation hv of word v at layer l + 1 is defined as follows: ! l+1 l+1  l l l  hv = ReLU ∑ gu,v huWedge(u,v) + bdep(u,v) , (3.2) u∈N(v) 84 CHAPTER 3. ENRICHING WORD REPRESENTATIONS where ReLU(x) = max(0,x), N(v) denotes the set of words that directly connect to ∗ dw ∗ dw×dw word v, hv ∈ R is the representation of word v; Wedge(u,v) ∈ R is a weight matrix ∗ dw that learns the connective information from word u to word v; and bdep(u,v) ∈ R is the edge type embedding of the edge (u,v). The gate gu,v controls the information weight of edge (u,v), which is calculated as follows:

  l+1 l ˆ l ˆl gu,v = σ huWedge(u,v) + bdep(u,v) , (3.3)

1 ˆ ∗ dw ˆ∗ 1 where σ(x) = 1+e−x , Wedge(u,v) ∈ R and bdep(u,v) ∈ R are weights and biases, and σ is the sigmoid function. Since GCN only encodes information into a word from its adjacent nodes in the word graph, we employ two-stacked GCN layers to capture longer dependency context (Marcheggiani and Titov, 2017; Marcheggiani et al., 2018).

3.2.4 Pretraining the SIWR Model

We pretrain our SIWR model to jointly capture POS tags and syntactic dependencies. In particular, for each word we predict its dependency connection, i.e., the head of a word (a dependent arc) and the label of the dependent arc. We first obtain POS representation POS hv of word v for POS prediction. We then construct arc and label representations of arc dep label dep word v as a dependent, hv and hv , respectively. Similarly, we compute arc arc head label head and label representations of word v, which serve as a head word, hv and hv , syn respectively. For simplicity, we denote these representations as hv , and compute them as follows. syn L syn syn hv = f (hv W + b ), (3.4) L where hv is the output of the last GCN layer corresponding to a word v and syn ∈ {POS, arc head, arc dependent, label head, label dependent}. f is an activation func- ex−e−x tion, which is tanh(x) = ex+e−x for POS representations and ReLU(x) = max(0,x) for dependency representations. POS The POS representation hv of a word v is then fed into a softmax classifier to predict the correct POS tag of the target word. We optimise the negative log-likelihood

objective of POS tagging (JPOS) as follows:

POS JPOS = −∑log p(yv ), (3.5) v

POS where p(yv ) is the probability value that the correct POS label is assigned to a word 3.2. PROPOSED APPROACH 85 v according to the softmax function.

Meanwhile, to predict syntactic dependencies, we need to predict (i) heads of words (dependency arcs) and (ii) arc labels. First, we predict the head of word v arc arc dep (dependent arc) sv by passing the arc dep representations h of word v and arc head representations H arc head of other words in a sentence to a biaffine classifier (Dozat and Manning, 2016) (Eq. (3.6)). Likewise, we use another biaffine layer for label label dep prediction (Eq. (3.7)). In particular, we pass the label dep representation hv of word v and label head representations H label head of other words to the biaffine classifier, labeli resulting in the label probability sv . The label probability is then multiplied by the arc arc labeli predicted arc probability sv to get the final arc label probability sv Eq. (3.8). The computation is as follows:

arc arc head arc arc dep sv = H W hv (3.6)

labeli label head labeli label dep sv = H W hv (3.7)

arc labeli labeli arc sv = sv sv , (3.8)

∗ ∗ ∗ n×dw where H = {h1,...,hn} ∈ R is the matrix whose rows are arc head representations ∗ dw×dw ∗ dw arc label∗ ns from Eq. (3.4), W ∈ R , and hv ∈ R , which results in s ,s ∈ R , n is the number of words in a sentence and is an element-wise multiplication. We design the model that performs similarly to an auto-encoder (taking the dependency graph as input and predicting it later) to ensure that dependency information is embedded in arc label arc label1 arc labelm ns×m the representations. Let Sv = {sv ,...,sv } ∈ R where m is the number of dependency labels. To obtain the probability of assigning the correct head word and the dependency label to a word v:

arc label arc label p(yv ) = softmaxlabel indexSv (head index,·), (3.9)

arc label 1×m where Sv (head index,·) ∈ R , head index and label index are the index of the correct head word and correct dependency label respectively. We then employ the negative log-likelihood to train dependency parsing (JDP), the objective function is defined as follows: arc label JDP = −∑log p(yv ), (3.10) v arc-label where p(yv ) is the probability assigned to the correct head word and the correct dependency label to a word v. We jointly train the model by maximising both POS 86 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

Intermediate Pretraining Representations Input to Predict downstream tasks Dense + Softmax POS + Dependencies

2-hop dependency GCN 2

Relation extraction 1-hop dependency GCN 1 models

Static/contextual Base representations weights trained in word representations downstream tasks' models SIWRs SIWR model Syntactically-Informed word representations Figure 3.2: The use of SIWRs in downstream tasks, i.e., relation extraction models.

tagging and dependency parsing, i.e., minimising the following joint loss,

2 J = JPOS + JDP + λ||W|| , (3.11) where λ||W||2 is the L2-norm regularisation term which encourages the model’s param- eters close to zero and λ is a hyperparameter.

3.2.5 Syntactically-Informed Word Representations

As illustrated in Figure 3.2, we extract the intermediate representations of the SIWR model and use them as input replacing word representations in the downstream models. In particular, within a specific domain, we only need to train the SIWR model once, to compute word representations for downstream linguistic tasks. SIWRs are modelled as a weighted combination of the intermediate embeddings: the base representations and the outputs of the two-stacked graph convolutional neural layers in the SIWR model. The base representations provide distributional features derived from large-scale data, while the other layers include syntactic information and long-range dependency context. We employ the combination scheme of contextualised representations in Eq. (3.1),

L+2 SIWRs = ∑ βiei, (3.12) i=0 where L is equal to 1, if the base representations are static embeddings, or L is the number of intermediate layers if the base representations are contextualised embeddings, L+ and +2 from two-stacked GCNs. β ∈ R 2 functions as γ in Eq. (3.1), which is trained in downstream tasks while the intermediate embeddings serve as the input and are fixed 3.3. PRETRAINING SETTINGS 87 during training.

3.3 Pretraining Settings

3.3.1 Datasets and Base Representations Used for Pretraining

Hyperparameters Range Best Batch size [10, 32] 20 Dropout input [0.1, 0.5] 0.34 Dropout GCN [0.1, 0.5] 0.19 Generic Biomedical Dropout classification 0.33 0.33 BinaryRE N-aryRE Learning rate Default 0.001 Static Gradient clipping [5, 30] 10 W2V GloVe Embeddings Weight decay (L2) [1e-2, 1e-5] 1e-4 Dim. of POS [16, 50] 16 Base ELMo PubMed Dim. of dependency 100 100 SIWRs Data 1B Word PubMed Stanford Parser ScispaCy Table 3.2: Value range and best value CoreNLP of tuned hyperparameters for our SIWR. Dropout was employed to all representa- Table 3.1: Word representations, training tion layers before classification: POS tag, data and dependency parsers that are used dependency arc and label with the recom- in our experiments. mended rate of 0.33; we used a fixed di- mension of 100 for dependency arc and label embeddings (Dozat and Manning, 2016).

Table 3.1 presents word representations used in our experiments as well as training data and syntactic parsers. In our experiments, these settings are chosen depending on the domain. The static embeddings used in our experiments include W2V (200d) trained using word2vec (Mikolov et al., 2013b) on Wikipedia 2015 (Miwa and Bansal, 2016)1; GloVe (100d) which was trained on Wikipedia 2014 and Gigaword 5 (Pen- nington et al., 2014)2 and PubMed (200d) trained on MEDLINE/PubMed 2018 using word2vec (Mikolov et al., 2013b)3. Regarding contextualised representations, we used the small version of ELMo (256d) released by Peters et al.(2018). 4

As shown in Table 3.1, we used SIWRsELMo (the enriched ELMo) on general data for the ACE2005RE dataset. To obtain SIWRsELMo, around 0.2% of the one billion 1http://tti-coin.jp/data/wikipedia200.bin 2http://nlp.stanford.edu/projects/glove/ 3http://nlp.cs.aueb.gr/software.html 4https://allennlp.org/elmo 88 CHAPTER 3. ENRICHING WORD REPRESENTATIONS word benchmark (Chelba et al., 2014), which was also used by ELMo, was randomly selected and automatically parsed using Stanford CoreNLP (Manning et al., 2014). The processed data was divided into two parts for training (30k) and development (30k). We

then used the training set to train our SIWRsELMo. As ELMo was trained on the general domain, the meaning that ELMo captures are likely biased towards commonsense. Thus, for the biomedical domain, we used

PubMed as our base representation for the n-aryRE task ( SIWRsPubMed). Our model enriches PubMed using a PubMed subset containing the same number of sentences while enriching ELMo (30k). We used ScispaCy which is a syntactic parser trained on biomedical data (Neumann et al., 2019) to obtain dependency structures. The current trend of using large pretrained models, e.g., BERT (Devlin et al., 2019), might question the need for explicit syntactic information. BERT demonstrated that fine- tuning the entire model can perform better than using pretrained word representations. To compare with such fine-tuning methods and to show the role of explicitly including syntactic structures, we performed experiments using BERT as base representation and fine-tuned it for binaryRE. 5 We employed the pretrained BERT base model (word representation dimension of 768) in our experiments. To use BERT as base represen- tation, we extracted the intermediate embeddings from the last four hidden layers of BERT as these layers were reported to be the best performing representations (Devlin et al., 2019). Due to the memory constraint, we further added a feed-forward hidden layer with 256 dimensions on top of BERT contextual representations to reduce the representation size when applying them in the downstream linguistic tasks. While all the non-BERT training and experiments were done on a Tesla K20 (5GB GPU memory and CUDA compute capability 3.5), the experiments with BERT were conducted using a GTX 1080 Ti (11 GB memory) due to its computational requirements.

3.3.2 Pretraining Implementation Details

We implemented our SIWR model using the Theano library (Theano Development Team, 2016) trained using Adam optimiser with default hyperparameters (Kingma and Ba, 2014).6 During pretraining, we implemented gradient clipping, dropout, and L2 regularisation to avoid over-fitting. We also incorporated early stopping with patience equal to 10 updating steps. The base representations were not updated during pretraining.

5We did not conduct experiments on n-aryRE because n-ary baseline is implemented in the Theano framework but there is no available BERT implementation in that framework. 6Our code is available at https://github.com/ttthy/siwrs. 3.4. EVALUATION SETTINGS 89

We tuned the hyperparameters for our model by applying the Tree Parzen Estimator optimization algorithm using the Hyperopt toolkit. 7 The final set of hyperparameters are listed in Table 3.2.

3.4 Evaluation Settings

Task Dataset Annotation Baseline Related Work Walk BIO tag BinaryRE ACE2005 Binary relation (Christopoulou et al., 2018) (Ye et al., 2019) BiLSTM Graph LSTM N-aryRE Drug-Gene-Mut. Ternary relation (Peng et al., 2018) (Peng et al., 2018)

Table 3.3: Evaluation datasets and related models used in our experiments.

We evaluated SIWRs on two sentence-level relation extraction tasks: binary relation extraction (RE) and n-ary relation extraction (n-aryRE). We provide in Table 3.3, a summary of the data and evaluation settings, e.g., evaluation tasks, data sets, baselines, and related models. Once our model was trained, we extracted SIWRs on automatically parsed eval- uation data without any further fine-tuning efforts. We used different neural models as baselines for individual tasks. We kept all hyperparameters to be the same in the reported papers, except for the learning rate in the n-ary model. We found that due to the masked entity word form, it is more difficult to capture the context with many masked tokens. Hence, we increased the learning rate to speed up the training process. The large learning rate may cause the model more easily over-fitting to the training data. We prevented this by adding a small value of weight decay along with a word dropout rate. Other hyperparameters remained the same as reported in Peng et al.(2018). Table 3.4b and Table 3.5b reports the hyperparameters for the ACE2005 and drug-gene-mutation datasets, respectively. We used a general domain dataset, ACE2005 (Walker et al., 2006) for binaryRE. For n-aryRE, we used the sentence-level biomedical drug-gene-mutation dataset in (Peng et al., 2018) since there is no n-aryRE dataset in general domain. We followed previous work to use Precision (P), Recall (R) and F1-score (F1) as evaluation metrics for binary RE. In the following sections, we briefly explain the evaluation tasks, the data, the baselines we used and related work corresponding to each task.

7https://github.com/hyperopt/hyperopt 90 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

Original, Hyperparameters ELMo & SIWRsELMo Train Dev Test Unk. word prob 0.01 / 0.0 Batch size 10 Positive 4,780 1,131 1,151 Word dimension 200 / 256 ORG-AFF 1,469 365 359 Position dimension 25 PHYS 1,099 278 278 Type dimension 20 PART-WHOLE 774 162 182 LSTM dimension 100 GEN-AFF 511 124 104 Pair dimension 100 ART 489 96 151 β 0.77 PER-SOC 438 106 77 Input dropout rate 0.11 Output dropout rate 0.32 (a) Data statistics Learning rate 0.002 Weight decay (L2) 0.000057 Gradient clipping 24.4 (b) Hyperparameters

Table 3.4: Statistics and hyperparameters for the ACE2005 binary relation extraction task.

3.4.1 Binary Relation Extraction

PHYS PHYS Firefighters continue their work at the Ramallah oil field . PER PER GPE FAC

Figure 3.3: A binary relation example from the ACE2005 dataset (Walker et al., 2006).

Binary relation extraction (RE) identifies relations between two entity mentions in a sentence. We evaluated our representations on the ACE2005 dataset that provides a total of 6 relation types: ORG-AFF (organisation affiliation), PHYS (physical), PART- WHOLE (part-whole), GEN-AFF (gen-affiliation), ART (artifact) and PER-SOC (person social); and a specific NA category to indicate no relation. An example from the dataset is illustrated in Figure 3.3, and the statistics of the dataset are shown in Table 3.4a. For this task, we employed the walk-based model proposed in Christopoulou et al. (2018) as our baseline.8 The walk model supports relation detection by leveraging interactions between all entities in a sentence. In particular, entities are formed as nodes in a graph structure where edges are considered as relation paths. Then, a relation

8This is the latest work on sentential binary relation extraction at the time we conducted this experiment. 3.4. EVALUATION SETTINGS 91 candidate embedding is constructed by walking from the head entity to the tail entity through different path lengths. The constructed embeddings are then fed into a softmax classifier to identify the underlying relation. We compared our baselines using different base representations and corresponding SIWRs to related work (Miwa and Bansal, 2016; Christopoulou et al., 2018; Ye et al., 2019). Miwa and Bansal(2016) integrated dependencies in a long short-term memory (LSTM) model, while we used the model proposed by Christopoulou et al.(2018) as our baseline. For reference, we also showed the results of Ye et al.(2019) on the ACE2005 dataset. They used the BIO (begin-inner-outer) tags as entity indicators which may give a hint to the relation extraction. However, they employed a slightly different evaluation setting compared to ours, the details will be explained in §3.5. We employed the baseline model using Chainer library (Tokui et al., 2015). We do not change hyperparameters in binaryRE, which allows us to do a controlled comparison with base representations.

3.4.2 Ternary Relation Extraction

Gene Also , we found that EGFR-WT model at lower concentrations of EGF is

sensitivity

as sensitive to gefitinib for ERK phosphorylation as L858R model A . Drug Variant Figure 3.4: An n-ary relation example from the drug-gene-mutation dataset (Peng et al., 2018).

N-ary relation extraction (n-aryRE) is fundamentally an extension of theRE task, which further detects relations among several entity mentions in a sentence. For this task, we used the drug-gene-mutation data set (Peng et al., 2018) with the same split as the released data.9 The drug-gene-mutation data defines four semantic relation categories: resistance or nonresponse, sensitivity, response, resistance; n-ary instances that do not express a defined relation are referred to as None. An example from the dataset is illustrated in Figure 3.4. We followed (Peng et al., 2018) to evaluate the task using five-fold cross-validation where 200 instances from the training set were

9http://hanover.azurewebsites.net 92 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

randomly selected for development in each fold. The statistics of the dataset are shown in Table 3.5a.

Fold 1 2 3 4 5 Hyperparameters PubMed SIWRsPubMed Positive 758 705 543 815 586 Batch size 8 15 resistance LSTM dimension 150 150 or 308 283 287 363 238 Learning rate 0.02 0.3735 non-response Weight decay (L2) 0.0 0.0021 sensitivity 262 292 142 257 196 Word dropout rate 0.0 0.0019 response 124 89 83 92 100 resistance 64 41 31 103 52 (b) Hyperparameters (a) Data statistics

Table 3.5: Statistics and hyperparameters for the drug-gene-mutation dataset.

We adopted the LSTM model in (Peng et al., 2018) as the baseline for our experi- ments. The LSTM model does not include syntactic information. We replaced its word representations with our pre-encoded syntactic word representations. The n-ary model was implemented with Theano framework (Theano Development Team, 2016). N-ary RE data was difficult to train on and dependent on precise hyperparameter tuning since it was automatically constructed using a distant supervision approach, i.e., containing noise. We then fine-tuned the hyperparameters of the n-ary baseline, although most of the hyperparameters remained the same as reported in (Peng et al., 2018). Given an identical base architecture across representations for each task, we can then attribute any difference in performance to the proposed SIWRs over base representations. We compare with their proposed graph-based LSTM modelling dependency struc- ture (baseline). This demonstrates the comparison of the syntactic information in our SIWRs with directly incorporating syntactic in a task-oriented model. We did not compare with Song et al.(2018) since we found that they did not blind entity names in their experiments. When we replaced entities by their entity types, Song’s model performance was lower than Peng et al.(2018), 69.36% and 77.9%, respectively.

3.5 Results

We first summarise the main results compared to base representations, then compare our SIWRs with fine-tuning the large pretrained model BERT. We also discuss the performance of our method on each evaluation task. Table 3.6 depicts the performance of SIWRs on the two relation extraction tasks 3.5. RESULTS 93

BinaryRE N-aryRE Method P R F1 Acc. W2V 68.77 60.64 64.45 – PubMed – – – 77.16 ELMo 70.48 61.60 65.74 75.77 SIWRsELMo 69.74 64.47 67.00 – SIWRsPubMed ––– 79.30 Improvement – – 1.25 / 3.79% 2.2 / 9.6% (Absolute / Relative)

Table 3.6: Test set results with different embeddings over two relation extraction tasks. We use F1-score (%) for binaryRE and accuracy (%) for n-aryRE following previous work. The “Improvement” row lists the absolute and relative improvements over the base representations of SIWRs, i.e., SIWRsELMo for binaryRE and SIWRsPubMed/Static for n-aryRE.

in comparison with the baseline models using static embeddings (Static) and ELMo. For n-aryRE, unlike other tasks, we employed the first layer of ELMo as the input representations because this layer provides the highest performance among different layer composition. The performance of the same setting with other tasks is lower than the first layer, 70.6 and 75.8% accuracy, respectively. Adding SIWRs to the baselines improves the performance from 3–9% relative points compared to the base

representations, i.e., SIWRsELMo versus ELMo for binaryRE, and SIWRsPubMed versus Static/PubMed for n-aryRE.

Furthermore, we compare the performance of using contextual embeddings as well as fine-tuning BERT in Table 3.7. With contextual embeddings, we add only k scalar

weights rather than fine-tuning large pretrainedLM( θLM) in downstream tasks (k equals to the number of intermediate layers so that k  θLM). After getting the representations, many experiments using less computationally intensive model architectures can be run on top of these representations. Moreover, our results indicate that pretrained embeddings can perform better than fine-tuning in terms of binaryRE. This may be due to the complexity of BERT compared to the small data size of the task, and the prerequisite syntactic information embedded in SIWRs. In other words, fine-tuning may require a task-specific model design to be added in order to fit downstream linguistic tasks.

We also report the performance of SIWRs compared with other representations 94 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

RE Method DEV TEST P R F1 P R F1 Static Embeddings Baseline 64.51 64.28 64.39 68.77 60.64 64.45 Contextual Embeddings ELMo 66.19 61.80 63.92 70.48 61.60 65.74 SIWRsELMo 65.96 65.61 65.78 69.74 64.47 67.00 BERT–feature 68.10 70.03 69.05 71.59 69.85 70.71 SIWRsBERT 68.27 71.35 69.78 73.78 71.16 72.45 Fine-tuningLM BERT–fine-tuning 63.37 65.78 64.56 66.02 67.68 66.84

Table 3.7: Comparison of contextual representations and fine-tuning large-scale lan- guage model. Text subscription indicates the base representations used in SIWRs.

Previous work Ours Model CNN+RL Baseline – Walk-based SPTree Walk-based +BIO Base W2V +E +S +B-ft +B-f +S Tag E B P 70.1 69.7 58.8 61.3 68.77 70.48 69.74 66.02 71.59 73.78 R 61.2 59.5 57.3 76.7 60.64 61.60 64.47 67.68 69.85 71.16 F1 65.3 64.2 57.2 67.4 64.45 65.74 67.00 66.84 70.71 72.45

Table 3.8: Binary relation extraction performance on ACE2005 test set. Text subscrip- tion indicates the base representations used in SIWRs. +E denotes ELMo is used as word representations, +B-f is BERT-feature, +B-ft is BERT-fine-tune, +SE and +SB are SIWRs with ELMo or BERT as base representations respectively.

used in the binary relation extraction task (Table 3.8). SIWRs improved the perfor- mance of the baseline model using ELMo to 67% F1-score (1.3% points of the overall performance). Compared to that, fine-tuning BERT, in this case, is less effective than our method, i.e., our model could produce comparable performance when we enriched ELMo (see §3.6.4 for analysis). Furthermore, we show the new reported SOTA per-

formance by using BERT as our base representations (SIWRsBERT) with 72.45% in F1-score. This indicates that syntactic information is helpful even when using large- scale pretrained models. We also show the performance of a recent model – CNN+RL proposed by Ye et al.(2019) that uses a ranking loss for relation classification. Unlike our experimental setting, they did not consider directionality when generating relation 3.5. RESULTS 95

candidates (i.e., relations from the first to the second argument or vice-versa) in the experiments on the English part of ACE2005. Not considering relation direction makes the task easier, hence they got higher performance than our baseline. This performance gap can be resolved by our SIWRs which provide syntactic dependencies to infer relation direction.

Peng et al.(2018) Ours – LSTM Model PubMed ELMo SIWRs LSTM GraphLSTM PubMed all 1st CoreNLP ScispaCy Accuracy 75.3 77.9 77.2 70.6 75.8 78.7 79.3

Table 3.9: N-ary relation extraction accuracy on the drug-gene-mutation data (Peng et al., 2018). The results of LSTM and GraphLSTM were reported in Peng et al.(2018).

Lastly, the results on the n-ary relation extraction task are shown in Table 3.9, where the effectiveness of our model in capturing syntactic information is demonstrated. We observe that ELMo did not perform well in this domain-specific task compared to static word representations. Hence, we included an experiment using the first layer representations from ELMo which are the combination of character and static word representations. Surprisingly, the static representations from ELMo performed comparably to the baseline model using GloVe. Meanwhile, using domain-specific word representations (PubMed) produced better results than general domain word representations. The baseline performance of using PubMed is comparable to the GraphLSTM model which included syntactic information. We followed previous work and used a general domain parser – Stanford CoreNLP. Our SIWRs outperformed the GraphLSTM by 0.9% and the base representations setting by 1.54% in terms of accuracy. We additionally applied the in-domain parser – ScispaCy to the dataset. As expected, our SIWRs further improved the performance to 79.3% accuracy. This indicates that our method can enrich word representations with both in- and out-of-domain parsers.

Overall, empirical results show the flexibility of our model to enrich static and con- textualised word representations: PubMed and ELMo, respectively. The improvement over the static embeddings, i.e., PubMed, may be partially attributed to the introduction of context. 96 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

3.6 Analysis

We analysed our model including the amount of pretraining data, the breakdown contributions of each component, the need for syntax in the advent of large pretrained models, the number of model parameters and the computational environment.

66 Model

SIWRsELMo ELMo 64 F1-score (%)

62 0 10 20 30 40 50 60 Pre-training Sentences (Thousands)

Figure 3.5: Binary relation extraction performance of SIWRsELMo with different num- bers of pretraining sentences on the ACE2005 development set. 60k corresponds to about 0.2% of the 1B Word dataset. The exact numbers of sentences used in Figure 3.5 are: 10,253 (10k); 20,506 (20k); 30,759 (30k); 41,012 (40k); 51,265 (50k); 61,326 (60k).

3.6.1 Effects of the Number of Pretraining Samples

We show the performance of SIWRsELMo with different numbers of pretraining sen- tences on the binaryRE development set in Figure 3.5. The learning curve reveals that only a relatively modest amount of pretraining sentences, i.e., 10k, is enough to improve the performance of downstream tasks and that this performance is almost stable after the 10k sentences. This figure shows that it is reasonable that we chose 30k sentences for pretraining.

3.6.2 Ablation Studies

We investigate the contributions and effects of the various components in our model by

evaluating the binaryRE performance using SIWRsELMo on the ACE2005 development set (Table 3.10). –POS (POS tagging): We re-trained our model without POS tagging by consider- ing only dependency parsing when computing loss. –DP (Dependency Parsing): Contrary to –POS, we retrained our model without dependency parsing loss and retained POS tagging. 3.6. ANALYSIS 97

Model P R F1 † SIWRsELMo 65.96 65.61 65.78 –POS 66.36 64.01 65.17 –DP 67.30 59.86 63.36 –SO 66.48 61.72 64.01 –SI 66.86 62.25 64.47 ELMo 66.19 61.80 63.92

Table 3.10: Binary relation extraction performance of ablated SIWRs variants on ACE2005 development set, i.e., without POS tagging, dependency parsing, sequential information and self-information in GCN encoder. † denotes significance at p < 0.05 compared to ELMo.

–SO (Sequential Order): We removed the sequential order connections in GCN to measure its importance to the learning process. –SI (Self-Information): We removed the self-information connections in GCN to see if a word can obtain its information only from the dependency context. As we observe in Table 3.10, each removal of different components of our model substantially reduced the performance. We performed the Approximate Randomisation test (Noreen, 1989) on the results to measure the difference among different ablations.

The full setting of SIWRsELMo (POS, DP, SO and SI) is significantly different from ELMo with p < 0.05. This observation indicates the importance of selected information added in our model along with our learning objectives in order to incorporate syntactic information into word representations. We also showed that merely adding randomly- initialised parameters is not enough to get better performance, since when removing each component, the significance test shows no significant difference compared to ELMo. To show the contribution of each component for relation extraction, we report some qualitative examples where SIWRsELMo can predict the illustrated relations but removing the corresponding component fails (Figure 3.6). In the first case, the POS sequence between two entities (PER VB [IN—TO] (DT) GPE) can signal the relation between “players” and “canada”. Secondly, sequence order (SO) along with POS reveals the implicit relationship between “U.S.” and “Special Forces”, while the sentence does not have a specific textual trigger. In the last case, syntactic dependency structure of a sentence provides helpful clues forRE(Miwa and Bansal, 2016), self-information (SI) reveals a helpful multi-hop dependency path between entities by capturing direct dependent words in the first GCN layer and the multi-hop in the second GCN layer. 98 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

POS PHYS PER VBG TO GPE major league baseball plans to warn its players heading to canada

SO PART-WHOLE

GPE ORG we had joined another convoy of U.S. Special Forces troops DP/SI PHYS acl nmod PER VB GPE ... the equipment of choice for a task force like this to go into baghdad .

Figure 3.6: Relation predictions obtained using information of a component whose named is shown in the white box on the top left corner in SIWRsELMo. Other POS tags, dependency connections and relations are omitted for simplicity. POS tags are denoted in yellow, while entities are shown in different colours regarding their semantic types.

However, as shown in Table 3.10 the precision of removing components yields higher than including them. An explanation is that the patterns from the components can signal more relationships between entities and thus include more false positives. These false positives then lead to the drop in the precision while increasing the recall.

3.6.3 Impact of Syntactic Information

0.8 0.8 Representation 0.7 ELMo SIWRELMo 0.6 0.6

0.5 0.4 0.4 Micro F1 (%) Micro F1 (%) 0.3 0.2 Representation 0.2 BERT-feature SIWRsBERT 0.1 0.0 2 4 6 8 10 12 >=14 2 4 6 8 10 12 >=14 Distance between an entity pair (token) Distance between an entity pair (token) Figure 3.7: Comparison of the binary relation extraction performance on entity pairs with different distances on the ACE2005 development set using (left) SIWRsELMo and ELMo, (right) SIWRsBERT and BERT-feature.

To evaluate the benefits of dependency information, we further analyse the binary

RE performance of SIWRsELMo and ELMo in capturing distantly-related entity pairs on the ACE2005 development set. Figure 3.7 shows that our SIWRsELMo performed 3.6. ANALYSIS 99

PER VEH Maybe some day we 'll be stuck in traffic in hydrogen-powered cars . (a) ELMo nsubjpass nmod

PRP VBN NNS Maybe some day we 'll be stuck in traffic in hydrogen-powered cars . (b) Automatically-parsed dependency tree from CoreNLP ART PER VEH Maybe some day we 'll be stuck in traffic in hydrogen-powered cars .

(c) SIWRsELMo Figure 3.8: Relation prediction examples from the ACE2005 dataset and the auto- matically parsed tree. Orange and purple rectangles denote two different entity types (PER for person and VEH for vehicle respectively), while yellow rectangles denote part-of-speech tags. Red arrow connections denote dependency relations between words, while the relation between entities is illustrated by a black arrow with its type in blue background. ART stands for artifact. Other dependency connections and relations are omitted for simplicity. better than ELMo in capturing long-distance context. As it can be observed, for entity pairs with a distance shorter than 5, the performance of using ELMo and SIWRsELMo are nearly the same. The performance gap then becomes wider when the distance between entities is longer. The performance gain using SIWRsELMo can be attributed to dependency-based context information, which is useful for detecting relations between distant entities. We present an example of such a case in Figure 3.8. The figure illustrates that using only ELMo could not capture the relation between PER (“we”) and VEH (“cars”). Using SIWRsELMo, the downstream model was then able to detect the relation by leveraging the dependency connections. However, when the distance is longer than 14, the performance of using SIWRsELMo drops, which is comparable to ELMo (Figure 3.7 left). This may be partly because the distance of longer than 14 is beyond the dependency length. In future work, co-reference, as well as discourse connections can be considered to address long-distance dependencies. The right side of Figure 3.7 showed the analysis between using BERT-feature and SIWRsBERT. Different from ELMo, BERT considers the entire sentence for each word, thus does not receive much benefit from dependency connections. Surprisingly, BERT-feature shows better performance in Figure 3.7, although the performance of

SIWRsBERT is slightly higher (0.73%). One reason is that the number of long-distance 100 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

entities is less than the number of short-distance, a substantial higher F1 score for long-distance entities results in less improvement in the total F1-score.

3.6.4 Computational Cost

Table 3.11 compares the numbers of parameters used for training different representa- tions and applying them on downstream NLP tasks. The table also shows the parameters used for training SIWRs with different base representations as well as applications to downstream tasks. In general, with contextual representations, we require fewer parame- ters for training in downstream models even compared to static word representations. In particular, our SIWRs require L + 2 scalar numbers to be trained when applied in NLP

models compared to the entire model parameters in fine-tuning methods, i.e., θBERT where L + 2  θBERT. Most of the current work on adding a syntactic inductive bias into word representations or large pretrained models requires re-training the language models (Vashishth et al., 2019b; Peng et al., 2019; Lauscher et al., 2019). Although we need to train the SIWR model, we keep the computational costs down because we do not need to adapt the available well-pretrained models. These training-from-scratch methods are costly when computational resources are limited.

Downstream Original Downstream SIWRs SIWRs word2vec GloVe |V| × dw +(|V| + 1) × dw +(1 + 2) PubMed θSIWR ELMo θ +L ELMo +(L + 2) BERT θBERT θBERT − θMaskedLM − θNSP

Table 3.11: Pretrained model parameters and downstream trainable parameters.

In particular, since the number of parameters for training the SIWR model is rel- atively small compared to BERT, we trained our model on a Tesla K20 (5GB GPU memory and CUDA compute capability 3.5). Other downstream models using static, contextual and SIWRs are also trained using a Tesla K20. Meanwhile, as BERT required large memory for fine-tuning, we ran its corresponding models on a GTX 1080 Ti (11 GB memory) for every downstream task. In this case, contextual representations and our SIWRs require less computational resources than the fine-tuning method (BERT). 3.7. RELATED WORK 101

Although our model is not computationally expensive, our SIWRs can boost the per- formance of downstream linguistic tasks to be competitive with the fine-tuning model. This finding opens a potential direction to investigate the role of syntactic information and other rich linguistic information in general.

3.7 Related Work

A controversial topic has been raised about the usefulness of explicitly inserting syntax into neural networks. Especially given the impressive results of PLMs, a growing number of studies question the role of syntax in such models. A group of studies showed that large-scale PLMs can implicitly encode some kind of syntax, including shallow syntactic information (Tenney et al., 2019; Lin et al., 2019) and distance in dependency trees (Hewitt and Manning, 2019; Kulmizev et al., 2020; Chi et al., 2020). In contrast, several approaches incorporate linguistic constraints (syntactic and lexical semantic knowledge) into word representations, showing improved performance. A set of studies incorporate syntax by training word representations from scratch (Levy and Goldberg, 2014; Vashishth et al., 2019a; Lauscher et al., 2019; Peng et al., 2019). Training from scratch or fine-tuning in a multitask learning setting is computationally expensive and time-consuming. Meanwhile, our method enrichs pretrained word repre- sentations via constructing syntactic information, which requires less computational resources than training from scratch. Another line of models retrofit pretrained static embeddings with distance-based objectives (Bansal et al., 2014; Faruqui et al., 2015; Vulic´, 2018; Vulic´ et al., 2018). They, however, can not be directly applied to contextual word representations. A similar work to ours is by Vashishth et al.(2019a) who also used GCN to incorporate syntax into word representations. They trained a neuralLM from scratch with dependencies and lexical structures that compress all meanings of a word into a single vector, including polysemy. By contrast, our work is based on existing pretrained LMs which uses much fewer data and computational resources compared to training from scratch. Other attempts present plenty promising ideas to explicitly injecting syntax into large-scale PLMs models, i.e., BERT (Chrupała and Alishahi, 2019; Peng et al., 2019; Du et al., 2020; Kuncoro et al., 2020). A recent group of studies, also known as intermediate training, train PLMs on one or more data-rich supervised tasks before fine-tuning for the target task, showing additional improvements (Phang et al., 2018; Glavasˇ and Vulic´, 2020). Our method can be considered in this line of research. A closely related work of Glavasˇ and Vulic´(2020) 102 CHAPTER 3. ENRICHING WORD REPRESENTATIONS

trained BERT to perform dependency parsing on an English universal dependency treebank (Nivre et al., 2016). The difference of our work compared to them is that we used automatically-parsed corpora while their data is gold annotation.

3.8 Conclusion

We proposed a novel method to include syntactic information to word representa-

tions, addressing our first hypothesis (H1). The enhanced representations are called syntactically-informed word representations (SIWRs). SIWRs were obtained by train- ing a graph-based model to capture two types of syntactic information on the data, POS, and dependencies. Our method attempts to include such syntactic information without retraining language models by leveraging existing well-pretrained models. The compu- tational resource and cost required by our syntactically informative model during the training phase are fewer than that by previous methods. We empirically demonstrated the contributions of including syntactic information in our experiments. In particular, SIWRs achieved gains over the base representations, i.e., ELMo and PubMed, on bi- nary and n-aryRE, from generic and biomedical domains, respectively. SIWRs show improvements over the base representations achieving 6.64% error reduction in terms of F1-score for binaryRE and 6.98% error reduction in terms of accuracy for n-ary RE. The experimental results also demonstrated the flexibility and effectiveness of SIWRs in enriching different pretrained representations. In addition, we implemented BERT (Clark et al., 2019) in both feature-based and fine-tuning methods on binaryRE.

We also employed BERT as the base representation in our SIWRsBERT for comparison. Surprisingly, our SIWRsBERT based on contextual BERT even performed better than the fine-tuning in binaryRE with the F1-score of 72.45% and 66.84% respectively. Our extensive analysis also shows that the syntactic bias can be beneficial for subsequent NLP tasks. Although we only presented the effectiveness of our SIWR model on relation extraction tasks, our SIWRs can be applied to any other downstream tasks. We included the evaluation of named entity recognition in AppendixA. We hope that this study will attract more investigation by the NLP community to revisit the incorporation of rich linguistic information in NLP tasks. Enriching word representations is an indirect approach to improve relation extraction. To address the task more directly using unlabelled text, we next investigate unsupervised relation extraction, in which we classify relationships between named entities without 3.8. CONCLUSION 103 access to any relation annotation. Chapter 4

Unsupervised Relation Extraction

Highlights in this chapter:

• Describe the current discrete-state variational auto-encoder (DVAE) frame- work for unsupervised relation extraction • Propose a simple rule-based method to infer relations using only entity types and a simple neural method taking only entity types as input • Conduct an extensive analysis on the current experimental settings of neural- based unsupervised relation extraction

In the previous chapter, we introduced word representations from a graph-based neural model, which can take advantage of syntactic information. This method indi- rectly improves the performance of supervised relation extraction, which still relies on manually-labelled corpora. To reduce the manual efforts, in this chapter, we aim to perform relation extraction in an unsupervised manner, i.e., no relation labels are given during training. We first survey existing unsupervised relation extraction methods (URE) and conduct experiments using the current neural methods. We address our

two research questions (RQ2, RQ3) and a hypothesis (H2) in Chapter1, stating that inductive biases are crucial for URE such as entity types. We demonstrate that by using only named entities to induce relation types, we can outperform existing methods on two popular datasets. We also conduct additional analysis of combining entity types with other commonly used features, such as shortest dependency path between two entities, showing that using only entity types surprisingly performs better than any naive combination. This leads to our conclusion that inductive biases such as entity types should be included in unsupervised models forRE.

104 4.1. MOTIVATION 105

4.1 Motivation

Unsupervised relation extraction (URE) is the task of extracting relations between named entities from raw text without manually-labelled data or existing knowledge bases (KBs). URE is promising, since it does not require manually annotated data nor human curated knowledge bases (KBs), which are expensive to produce. Thus, URE can be applied to domains and languages where annotated data and KBs are not available. Moreover, URE can discover new relation types, since it is not restricted to pre-defined relation categories as in other supervised methods. As we mentioned in §2.7.4, open information extraction (OIE) can also discover new relations. However, OIE extracts relations in the form of predicate-argument, in which similar semantic relation categories with varying textual expressions may not be grouped together. Unlike OIE, URE groups similar relations into clusters. Despite these advantages, URE methods have not been explored as much as fully or distantly supervised learning techniques. There are only a few attempts tackling URE using machine learning (ML) (Hasegawa et al., 2004; Banko et al., 2007; Yao et al., 2011; Marcheggiani and Titov, 2016; Simon et al., 2019). Similarly to other unsupervised learning tasks, a challenge in URE is how to evaluate results. Recent approaches (Yao et al., 2011; Marcheggiani and Titov, 2016; Simon et al., 2019) employ a widely used data generation setting in distantly supervisedRE, i.e., aligning a large amount of raw text against triplets in a curated KB. A standard metric score is computed by comparing the output relation clusters against the automatically annotated relations. In particular, the evaluation dataset, NYT-FB (Marcheggiani and Titov, 2016), has been created by mapping relation triplets in Freebase (Bollacker et al., 2008) against plain text articles in the New York Times (NYT) corpus (Sandhaus, 2008). We note that this NYT-FB is different from the NYT Riedel 2010 dataset (Riedel et al., 2010). Standard clustering evaluation metrics for URE include B3 (Bagga and Baldwin, 1998), V-measure (Rosenberg and Hirschberg, 2007), and ARI (Hubert and Arabie, 1985). Although the above mentioned experimental setting can be created automatically, there are three challenges to overcome. Firstly, the development and test sets of NYT-FB are silver, i.e., they include noisy labelled instances, since they are not human- curated. Secondly, the development and test sentences are part of the training set, i.e., a transductive setting. It is thus unclear how well the existing models perform on unseen sentences. Finally, NYT-FB can be considered highly imbalanced, since only 2.1% of the training sentences can be aligned with Freebase’s triplets. Due to the noisy nature of silver data (e.g., NYT-FB), evaluation on silver data will not accurately 106 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

reflect the system performance. We also need unseen data during testing to examine the system generalisation. To overcome these challenges, we will employ the test set of TACRED (Zhang et al., 2017b), a widely used manually annotated corpus. Regarding the imbalanced data, we will demonstrate that in fact around 60% (instead of 2.1%) of instances in the training set express relation types defined in Freebase. In this work, we present a simple URE approach relying only on entity types that can obtain improved performance compared to current methods. Specifically, given a sentence consisting of two entities and their corresponding entity types, e.g., PERSON and LOCATION, we induce relations as the combination of entity types, e.g., PERSON-LOCATION. It should be noted that we employ only entity types because their combinations form reasonably coarse relation types (e.g., PERSON-LOCATION covers /people/person/place of birth defined in Freebase).

4.2 Background: Unsupervised Relation Extraction

We have introduced unsupervised relation extraction (URE) in §2.7.4. This section serves as a recap to facilitate readers without going back and forth to Chapter2 as well as elaborating related work.

The goal of URE is to predict the relation r between two entities e1 and e2 in a sentence s. The input x to the model is a tuple of (s,e1,e2). We will describe three related machine learning (ML) methods tackling URE. We note that all are generative approaches in terms of learning, though we categorise based on the way that we predict relations. URE methods can be categorised into generative and discriminative approaches, which rely either on hand-crafted features or surface form. In this chapter, we categorise a method as generative if the model predicts a relation based on p(r,x) and as discriminative if the relation prediction conditions on input p(r|x).

4.2.1 Generative Approach

Yao et al.(2011) extended topic modelling – Latent Dirichlet Allocation (LDA) (Blei et al., 2003) forRE, developing two models, herewith RelLDA and RelLDA1. In both models, a sentence and an entity pair perform as a document in topic modelling, while a relation type corresponds to a topic. RelLDA uses three features, i.e., the shortest dependency path between two entities and the two entity mentions. RelLDA1 is an extension of RelLDA with five more features, i.e., the entity types, words and 4.2. BACKGROUND: UNSUPERVISED RELATION EXTRACTION 107 part-of-speech tags between the two entities.

4.2.2 Discriminative Approaches

Most neural-basedRE follows the below diagram:

x = (s,e1,e2) → classifier → r where x = (s,e1,e2) corresponds to the input sentence and two entity mentions in the sentence, r is the predicted relation. However, URE does not have access to any relation label, an additional component is introduced by Marcheggiani and Titov(2016) to provide supervision signal to a relation classifier, namely link predictor.

x = (s,e1,e2) → classifier → r → link predictor

At this point, the framework consists of the following core components: a relation classifier and a link predictor. Any relation classifier can be used, which is expected to be discriminative p(r|x), which can be trained by backpropagating errors from the link predictor. The link predictor then uses the (soft) predicted relation r to predict the ˜ missing entity ei˜ ∈ {e1,e2}\{ei}, where if i = 1 then i = 2 and vice versa.

Venezuela United States Hugo Chávez per:country_of_birth Marisabel Rodríguez per:country_of_birth Venezuela ...

Figure 4.1: The idea of a link predictor.

As shown in Figure 4.1, we first take “Hugo Chavez” and the predicted relation as input, and try to predict “Venezuela” from a list of entities in the training set, and vice versa. In other words, entity prediction, in a self-supervised manner, provides training signals to learn the relation classifier. The URE task can be formulated as below:

p(ei˜|x,ei) = ∑ p(r|x) p(ei˜|r,ei) (4.1) r | {z } | {z } relation classifier; link predictor

During evaluation, only the relation classifier p(r|x) is used. 108 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

Relation classifier. Marcheggiani and Titov(2016)’s encoder and classifier is a multi- nomial logistic regression (softmax regression). The model receives features extracted from the input sentence, including bag-of-word between two entities, words on the dependency path between them, a trigger, and two entity types, in order to predict the relation. In contrast, Simon et al.(2019) replaced the softmax regression by a piecewise convolutional neural network (PCNN) that takes only the raw sentence as input. The authors replaced the entity mentions by a special token “[MASK]”, making the independence hypothesis that the context of the two entities alone can reveal the relation. Thus, in the Simon model, the relation classifier conditions on the sentence

context s without e1,e2, resulting in the relation probability p(r|s). However, we denote the following equations to conditional on x for legible (Marcheggiani and Titov, 2016).

Link predictor. The relation classifier p(r|s) and link predictor p(ei˜|r,ei) are jointly trained to reconstruct the missing entity. It is crucial that the link predictor does not have direct access to the input so that the essential information should be encoded into r, serving as a bottleneck. This bottleneck encourages the classifier to predict semantic relations between entities rather than assigning random information. Assuming that we

are predicting e1, the probability of the missing entity e1 given the predicted relation r and the entity e2 is defined as follows.

p(e1|r,e2) ∝ exp(ψ(e1,r,e2)) (4.2) where ψ is a scoring function, which is the sum of two relational learning models: RESCAL (Nickel et al., 2011) and selectional preferences (Riedel et al., 2013). We note that other relational models are possible to be used here. The following computation is used in previous work and our implementation:

(e ,r,e ) = > + > + > ψ 1 2 ue1 Aue2 ue1 B ue2 C (4.3) | {z } | {z } RESCAL selectional preferences

|E|×d |R|×d ×d where u ∈ R e is an entity embedding matrix, A ∈ R e e is a three-way tensor |R|×d scoring the interaction of two entities, B,C ∈ R e are two matrices computing the selectional preferences of each relation and de is the dimension of an entity embedding. As the set of entities is large, the partition function of Eq. (4.2) can not be computed efficiently. In order to avoid the sum over all entities, negative sampling proposed by Mikolov et al.(2013b) is employed. The model is trained by discriminating the correct 4.2. BACKGROUND: UNSUPERVISED RELATION EXTRACTION 109

0 0 triplet (e1,r,e2) from the fake ones (e ,r,e2) or (e1,r,e ). The negative log likelihood can be written as follows:

 LLP = E − 2log σ(ψ(e1,r,e2)) (x,e1,e2)∼X r∼classifier(x)  0  − ∑ log σ(−ψ(e1,r,e )) e0∈E   0  − ∑ log σ(−ψ(e ,r,e2)) (4.4) e0∈E

1 where σ(x) = 1+e−x , X is the dataset, E is a random sample of k negative entities from the distribution of entities raised to the 3/4 power, p(e)3/4.

The link predictor loss LLP (Eq. (4.4)), however, depends on the accuracy of the relation classifier. To learn the relation classifier, we need a good link predictor. This leads to a causality dilemma that while the classifier might be uncertain about which relation is expressed or the classifier might predict the same relation for all instances. To prevent the classifier from always predicting the same relation, Marcheggiani and Titov(2016) proposed a regulariser to force the relation distribution over the dataset close to the uniform distribution.

  Lreg = E −H(R|x,e1,e2) , (4.5) (x,e1,e2)∼X where R corresponds to the predicted relation.

Differently, Simon et al.(2019) proposed two regularisers: skewness (LS) and dispersion (LD). Skewness (LS) computes the entropy of the relation distribution predicted by the classifier. When an optimiser tries to minimise the skewness loss, it consequently forces the classifier towards predicting one relation for each instance. In other words, the classifier is more confident about its prediction. We note that LS is equivalent to the negation of Lreg. On the other hand, dispersion (LD) encourages that the classifier predicts multiple relation types over the dataset. The two formulas are presented in the following equations:

LS = E(x,e1,e2)∼X [H(R|x,e1,e2)] (4.6)

LD = DKL(p(R)||U) (4.7) 110 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION where p(R) is the prior relation distribution, which in practice is estimated at the level of a mini-batch, and U is the uniform distribution. We refer to the model of Marcheggiani and Titov(2016) as March and the model by Simon et al.(2019) as Simon.

4.3 Our Methods

per:country_of_death induce Hugo Chávez Person was born in Venezuela Country Person Country per:country_of_birth per:countries_of_residence

Figure 4.2: Intuition of using entity types

We introduce two entity-based methods, herewith EType and EType+. Entity types have been shown to be beneficial forRE either in supervised learning (Zhang et al., 2017b) or distant learning (Ren et al., 2017). In URE, previous work (Yao et al., 2011; Marcheggiani and Titov, 2016) also used entity types. Furthermore, as illustrated in Figure 4.2, entity types can induce coarse relation types which cover the correct fine- grained relation. Our first method, EType, performs relation classification in this way. In particular, given only two entity types a,b in the entire dataset, EType defines two relations ab and ba. If the input head and tail entity types are a and b respectively, then EType outputs ab as the relation. This apparently leads to a determined number of relation types that is equal to the square of number of entity types. For instance, 4 entity types lead to 42 = 16 relation types. We note that EType is a rule-based method without considering any other features but only entity types given in a dataset either from gold or automatically-typed using named entity recogniser. To extract an arbitrary number of relation types, we build EType+ trained by the link predictor. EType+ replaces the relation classifier in Marcheggiani and Titov(2016) by a single-layer perceptron taking two entity types as features, no other feature is used.

p(r|x,e1,e2) = softmaxr(Wve + b), (4.8)

|ET| where ve ∈ R is the two-hot encoding of the entity types corresponding to two |R|×|ET| |R| entities, W ∈ R is the relation matrix associated with entity types, b ∈ R is a bias, and |R| is the number of predicted relation types. The model can be considered as a multinomial logistic regression. We then employ the link predictor used in March and the two regularisers used in Simon, to produce a new method, herewith EType+. 4.4. EXPERIMENTAL SETTINGS 111

4.4 Experimental Settings

4.4.1 Evaluation Metrics

We use the following evaluation metrics for our analysis: a) B3 (Bagga and Baldwin, 1998) used in previous work, which is the harmonic mean of precision and recall for clustering task; b) V-measure (Rosenberg and Hirschberg, 2007), and c) ARI (Hubert and Arabie, 1985) used in Simon et al.(2019). 1 V-measure is analysed in terms of homogeneity and completeness, while ARI measures the pairwise similarity between two clusterings. We note that V-measure is sensitive to the dependency between the number of clusters and instances. A relatively small number of clusters compared to the number of instances should be used to maintain the comparability of using V-measure. In particular, we evaluated V-measure of the trivial homogeneity, where there are only singular clusters (i.e., each instance is its own cluster). The V-measure of the trivial homogeneity on NYT-FB reached 43.77%, which is higher than all the methods imple- mented in this study and previous work. Meanwhile, neither B3 nor ARI encounters this problem. However, ARI is shown to be used when there are large equal-sized clusters (Romano et al., 2016) while relation datasets are generally imbalanced (both NYT-FB and TACRED in this study).

4.4.2 Datasets

NYT-FB (|R| = 262) TACRED (|R| = 41) Train Dev Test Train Dev Test Raw instances 1,950,557 389,819 1,560,738 68,124 22,631 15,509 Positive 41,685 7,7862 33,808 13,012 5,436 3,325

Table 4.1: The statistics of the NYT-FB and the TACRED datasets. #r indicates the number of relation types in each dataset.

We employed NYT-FB for training and evaluation following previous work (Yao et al., 2011; Marcheggiani and Titov, 2016; Simon et al., 2019). According to the statistics, only 2.1% of the sentences in NYT-FB were aligned against Freebase’s triplets, which raises a concern that whether this dataset contains enough positive

1We used sklearn.metrics package to compute V-measure and ARI. 2This is a typo in our published paper, which stated 7,793. 112 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

sentences for a model to learn relation types from Freebase. We thus examined 100 randomly chosen instances from 1.86 million of non-aligned sentences. We found that 61% of them (or 60% of the whole dataset) express relation types defined in Freebase. This suggests that the NYT-FB dataset can be employed to train a relation extractor. However, there are two further issues when evaluating URE methods on NYT-FB. Firstly, the development and test sets are automatically-aligned sentences without human curation, leading to wrong/noisy labelled instances in the data. In particular, we found that 35 out of 100 randomly chosen sentences were given incorrect relations. Secondly, the two validation sets, i.e., the development and testing sets, are part of the training set. This setting is obviously not inductive, as it does not evaluate how a model performs on unseen sentences. Therefore, we additionally evaluate all methods (except topic modelling) on the test set of TACRED (Zhang et al., 2017b), a widely used manually annotated corpus for supervisedRE. Table 4.1 shows the statistics of the NYT-FB (Marcheggiani and Titov, 2016) and TACRED (Zhang et al., 2017b) datasets. We used the preprocessed data from Marcheg- giani and Titov(2016). We note that entity mentions and their semantic categories are given in both datasets, which were automatically annotated (Finkel et al., 2005; Manning et al., 2014). Both datasets have 4 overlapping entity categories: PERSON, ORGANIZATION, LOCATION, MISC. For all methods, we trained on NYT-FB and evaluated them on both NYT-FB and TACRED. We found that 15 over 262 most frequent relations account for 82.97% of the total number of instances in NYT-FB. Meanwhile, 15 over 41 relations sum up to 74.94% of the total number of instances in TACRED. This indicates the imbalanced nature of both datasets.

4.4.3 Model Settings

We implemented all neural-based methods using the PyTorch library (Paszke et al., 2019), which is publicly available.3

Hyper-parameters We employ the March, and Simon using the reported hyper- parameters (Yao et al., 2011; Marcheggiani and Titov, 2016; Simon et al., 2019). Meanwhile, for the RelLDA1, we only reported the results from the original paper (Yao et al., 2011). For comparison, we also evaluate March with the two regularisers of

Simon, namely March (Ls + Ld). To evaluate on TACRED, we employed the original March with number of relation set to |R| = 100 using the published repository written in

3https://github.com/ttthy/ure 4.5. RESULTS AND DISCUSSION 113

Parameter Ls Ls + Ld Parameter Value Parameter Value Optimiser AdaGrad Optimiser Adam Number of epochs 10 Optimiser Adam Learning rate 0.001 Batch size 100 Learning rate 0.005 0.25 Batch size 100 L2 regularisation 1e-7 Learning rate annealing 0.5 Early stop patience 10 Feature dimension 10 Batch size 100 L2 regularisation 1e-5 Learning rate 0.1 0.005 Early stop patience 10 Entity type dimension 10 L2 regularisation 2e-11 Ls coefficient 0.1 0.01 Ls coefficient 0.0001 Word dimension 50 Ld coefficient – 0.02 Ld coefficient 0.02 Entity type dimension 10 (b) Marcheggiani and Titov Ls coefficient 0.01 (a) EType+. L coefficient 0.02 (2016)’s model. d (c) Simon et al.(2019)’s model.

Table 4.2: Hyper-parameter values used in our experiments.

4 Theano . Meanwhile, for March (Ls+Ld) and Simon, we reimplemented these models and evaluated them on TACRED. Regarding our methods, EType does not have hyper- parameters, while EType+ uses the same optimiser and entity type dimension as in Simon. We used the development set to stop the training process. For every model, we conducted three runs with different initialised parameters and computed the average performance. We list the hyper-parameters of different models in Table 4.2.

4.5 Results and Discussion

4.5.1 Results

Results Table 4.3 demonstrates the average performance of our methods across three runs in comparison with the three ML models on NYT-FB and TACRED. Our models outperform the best performing system of Simon et al.(2019) on both datasets, except ARI on NYT-FB. As mentioned in §4.4.1, ARI might not be appropriate to compare methods on imbalanced datasets. In addition, the ML methods consistently exhibit lower performance on TACRED than on NYT-FB. The results of our evaluation demonstrate that our models outperform previous methods, despite being simpler than them. However, we note that the two models proposed by Marcheggiani and Titov(2016) and Simon et al.(2019) are sensitive to the hyper-parameters and thus difficult to train. We could not replicate the performance of Simon on the NYT-FB dataset.

4github.com/diegma/relation-autoencoder 114 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

B3 V-measure Model ARI F1 P R F1 Homo Comp NYT-FB RelLDA 29.1 24.8 35.2 30.0 26.1 35.1 13.3 RelLDA1 36.9 30.4 47.0 37.4 31.9 45.1 24.2 March (Ls+Ld) 37.5 31.1 47.4 38.7 32.6 47.8 27.6 March (Ls+Ld) |R| = 10 38.7 30.9 51.7 37.6 31.0 47.7 26.1 Simon 39.4 32.2 50.7 38.3 32.2 47.2 33.8 Simon 32.6 28.2 38.9 30.5 26.1 36.8 23.8 EType+ 41.9 31.3 63.7 40.6 31.8 56.2 30.7

March (Ls+Ld) 36.9 32.0 43.7 37.4 32.6 43.9 28.1 EType|R| = 16 41.7 32.5 58.0 42.1 34.7 53.6 30.7 EType+ 41.5 32.0 59.0 41.3 33.6 53.9 30.5 RelLDA1 29.6 ------March|R| = 100 35.8 ------March 34.8 24.4 62.4 25.9 18.7 42.7 13.1 TACRED

March (Ls+Ld) 31.0 21.7 54.9 43.8 35.5 57.2 22.6 Simon |R| = 10 15.7 12.1 22.4 17.1 14.6 20.6 6.1 EType+ 43.3 28.0 96.9 59.7 43.4 96.0 25.7

March (Ls+Ld) 34.6 24.3 61.3 47.6 38.9 61.4 23.2 EType |R| = 16 48.3 32.3 96.3 64.4 48.6 95.6 29.1 EType+ 46.1 30.3 96.9 62.0 45.8 96.1 27.4 March |R| = 100 33.13 21.83 69.20 43.63 32.96 64.66 20.21

Table 4.3: Average results (%) across three runs of different models (except the rule- based EType) on two datasets: the distant supervision NYT-FB and the large supervised dataset TACRED. The model of Marcheggiani and Titov(2016) is March and the model of Simon et al.(2019) is Simon. |R| indicates the number of clusters/relation types in each method.  indicates our implementation of the corresponding model. We note that all methods were trained on NYT-FB and evaluated on the test set of both NYT-FB and TACRED.

4.5.2 Analysis

4.5.2.1 Do ML models employ inductive biases supporting relation extraction?

In common with other unsupervised learning approaches, there is no guarantee that a URE model would learn the relation types defined in KBs and/or annotated data. A common solution is to employ inductive biases (Wagstaff, 2000) to guide the learning process towards desired relation types. Inductive biases can emanate from pre-processed data. Since our models outperform other methods, we conclude that entity type infor- mation alone constitutes a better bias than the biases employed by existing ML models. 4.5. RESULTS AND DISCUSSION 115

Indeed, entity types constitute a useful bias for this task. Among the topic modelling- based methods, RelLDA1, a model that uses entity types, outperforms RelLDA, a model that does not. In a separate experiment, we found that adding entity types to the Simon model helped to achieve higher performance than the original version, i.e., 42.74% vs. 39.4% F1 B3 on the NYT-FB test set. However, although both RelLDA1 and March also employ entity types, their performance is still lower than ours. This is because other syntactic and word features used in these two models might cancel out the useful bias of entity types.

0.6 Rand10 Rand10 with silver frequencies One relation 0.5 EType Silver relations (10) Link Predictor Silver relations (full) 0.4 Relation Various Classifier relation settings

Log Loss 0.3

Figure 4.3: Abstract idea of tes- 0.2 tifying the link predictor. We replace the relation classifier in 0.1 the discrete-state variation auto- 0 5 10 15 20 25 30 encoder by various relation input Epoch settings. Figure 4.4: Average negative log likelihood losses across three runs of the link predictor on the training data (not including negative instances). Each line demonstrates a different relation input setting.

Inductive biases can emanate from training signals. March and Simon are trained from a link predictor, which provides indirect signals from predicting entities. Hence, the question here is “can the link predictor induce good training signals?” To answer this, we examine the link predictor with alternative settings on the NYT-FB dataset, the intuition is illustrated in Figure 4.3. Instead of getting soft relations from a relation classifier, we directly define the relation of a given input as follows: • Rand10 randomly assigns one among ten relation types to each entity pair; • Rand10 with silver frequencies, similar to Rand10, randomly generates relation types but follows the silver relation distribution; • One relation assumes all entity pairs sharing the same relation type; • EType uses 16 relation types induced from 4 coarse entity types; • Silver relations (10) takes the top nine most frequent relation types and groups 116 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

the rest together to form the tenth relation type; • Silver relations (full) considers the full (silver) annotated relations, i.e., 262 relation types in NYT-FB. Figure 4.4 illustrates the average loss values of using these settings. If high quality relations are critical for training the link predictor, we will expect lower losses while using annotated relations. As observed from the figure, the loss curve of using silver relation types is consistently below the others. This implies that the link predictor is able to provide reasonable signals for training a relation classifier. So why are the Simon and March models outperformed by our models? As mentioned in Simon et al.(2019), the link predictor itself cannot be trained without a good relation classifier. It suggests that the relation classifiers in both methods need to be improved. Empirical evidence shows that both Simon and March models are outperformed (in B3 and V) by our Etype+, which uses the same link predictor. We also notice that both One relation and EType at the end share similar performances. This might imply that the link predictor is very expressive, we only need one relation (matrix) to overfit head/tail entity pairs. However, the silver relations are clearly helpful because during the first 15 epochs their losses are much lower than those by the others.

4.5.2.2 Why was the performance on TACRED lower?

Despite the fact that TACRED shares similar relation types with Freebase, we observed that both the March and Simon models consistently fare less well in terms of their performance on the TACRED dataset. In particular, the Simon model results in signif- icantly worse performance on TACRED, with 15.7% in terms of B3, which is twice as low as on NYT-FB (39.4%). The performance drop of both approaches might be attributed to the distributional shift of the two datasets: variation and semantic shift in vocabulary and language structure over time, since NYT was collected long before TACRED.

4.5.2.3 How is the performance when combining entity types with other fea- tures?

Our experiments using only entity types surprisingly perform higher than the previous state-of-the-art methods including feature engineering and deep learning models. How- ever, we know that context information is crucial to distinguish the relation between two entities, as manyRE studies have been proposed to integrate the context to improve the 4.5. RESULTS AND DISCUSSION 117

B3 V-measure Model ARI F1 P R F1 Homo Comp EType+ 42.5 30.3 70.8 40.1 29.9 60.9 29.2 +Entity 40.5 30.7 59.7 39.9 32.0 53.1 28.6 +BOW 37.7 30.2 50.7 38.0 31.4 48.2 20.5 +DepPath. 41.4 30.3 65.9 39.4 30.2 57.0 26.7 +POS 41.6 30.9 63.6 40.4 31.8 55.6 27.8 +Trigger 41.7 31.3 63.0 41.3 32.6 56.4 29.0 +PCNN 40.8 30.2 63.1 39.6 30.6 55.8 27.1

Table 4.4: Study of EType+ in combination with different features. The results are average across three runs on the development set.

RE performance. We conduct experiments when combining entity types with common features forRE in Table 4.4. The list of features includes:

• Entity: textual surface form of two entities, • BOW: bag of words between two entities, • DepPath: words on the dependency path between two entities, • POS: part-of-speech tag sequence between two entities, • Trigger: DepPath without stop words.

In general, naively combining entity types with other features could not improve the model performance. Additionally, BOW feature had negative effects on theRE performance. This indicates that bag of words between two entities often include uninformative and redundant words, i.e., noises, that are difficult to eliminate using simple models. Furthermore, the precision (P) and homogenity (Homo) scores when using trigger are higher than EType+. This is reasonable and expected as trigger may indicate the relation type explicitly, e.g., “born” is the trigger between “Hugo Chavez” and “Venezuela” in Figure 4.2. While the above features are widely used hand-crafted features forRE, we also incorporated a neural-based context encoder PCNN which is a combination of Simon’s PCNN encoder, the entity masking and position-aware attention proposed in Zhang et al.(2017b). However, the performance of combining PCNN is also lower than only using entity types. We leave exploring the best performing model for future work. 118 CHAPTER 4. UNSUPERVISED RELATION EXTRACTION

4.6 Conclusion

We have shown the importance of entity types in unsupervised relation extraction (URE). Our methods use only entity types, yet they yield higher performance than previous work on both NYT-FB and TACRED. The surprising results raise questions about the current state of unsupervised relation extraction. We analysed the experimental setting including evaluation metrics, datasets, and the supervision signal. We showed that the link predictor provides a good signal to train a URE model. We also illustrated that entity types are a strong inductive bias for URE. The task of URE remains challenging, which requires improved methods to deal with silver data. Our experiments focused on the unsupervised relation extraction without a pre- defined relation set. Although this setting can discover new relations, further processing is required to name the relation classes. While most of the time when we want to extract relations from text, we essentially have some set of desired relation types. Additionally, we are likely to have in mind one or few easy examples (exemplars) regarding to the desired relations. The set of desired relation types and their exemplars can be treated as weak supervision for training a relation extraction model, we will study this in more detail in the following chapter. In particular, we will use pretrained language models to provide noisy labels for relation extraction via similarity matching. The resulting data are then used to train a relation classifier in which we propose a noise-aware mechanism to deal with noisy labels. Chapter 5

Language Models as Weak Supervision

The previous chapter studied the task of unsupervised relation extraction (URE). As we can extract relations without using labelled data, URE is interesting for research, but far from being directly applicable. The main reasons are (i) the number of relation categories in URE is relatively small, and (ii) despite the new relation discovery, we need to manually define relation categories to individual clusters. Inspired by recent evidence that language models (LMs) capture some relational facts as in knowledge bases (KBs), our work investigates whether LMs can provide weak supervision for relation classification (RC). We first employ LMs as annotators, matching raw sentences with exemplars of desired relation types. We propose an auto-encoder using a noisy channel, namely NoelA, to learn on the noisy data. Our experiments on TACRED and reWiki show promising results, as NoelA outperforms BERT annotator which in turn outperforms the baselines. Our results indicate the possibility of using LMs as annotators.

5.1 Motivation

Current studies (Radford et al., 2019; Petroni et al., 2019; Jiang et al., 2020) show that large-scale pretrained language models (PLMs) such as GPT2 and BERT capture some sort of relational facts as in knowledge bases. They proposed to probe factual and commonsense information from PLMs. In particular, they used anLM to answer some questions related to a certain piece of information to testify whether theLM captures such information. The work of Petroni et al.(2019) uses cloze questions such

119 120 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

as “Dante was born in ”. If anLM successfully answers “Florence”, it is said to “capture” the fact born in(Dante, Florence). As a result, LMs can support commonsense reasoning (Trinh and Le, 2018; Schick and Schutze¨ , 2020) and also fact extraction from text (Bouraoui et al., 2020). A common idea shared among the aforementioned approaches is the use of cloze questions where entities are masked and expressed by single tokens. Differently, we see the potential use of LMs from another perspective. We investigate whether the implicit relational facts from PLMs can be used as weakly-supervised signals to train a relation classifier, similar to distant learning. To this end, we attempt

to address two main research questions (RQ4, RQ5): RQ4 Is it possible to use pretrained language models to annotate relations on raw text without training?

RQ5 Can modelling similar relation confusion be beneficial for identifying interactions of entities in text? For simplicity, following Han et al.(2018b) and Baldini Soares et al.(2019), we consider only sentences that express one of predefined relation types. We refer to this setting as relation classification (RC), to distinguish with relation extraction which involves detecting whether an input expresses a relation or not.

Sentence Relation examplars Murat Kurnaz , a Turkish per:country_of_birth Obama was born in US. national who was born and org:founded_by William Penn founded Pennsylvania. grew up in Germany . per:spouse Marie Curie is married to Pierre Curie. ...

Pre-trained Language Model

2.7 per:country_of_birth 0.3 org:founded 1.8 Sentence org:city_of_headquarters ...

per:country_of_birth

Weak Supervision Data

Relation Classifier

Figure 5.1: Language models as weak supervision for relation classification.

To address the RQ4 research question, we employ PLMs to annotate sentences with given entity mentions. For this purpose, we require different techniques than 5.1. MOTIVATION 121

the previous single-word probing because the surface form of a relation type or an entity mention often consists of multiple words. Figure 5.1 illustrates our intuition, where a PLM is used to annotate relations between entities in a sentence by matching the sentence with exemplars of individual relation types. A set of predefined relation types and a simple exemplar for each relation are required for the annotation process. This process is similar to when we recruit human annotators, where we also need to provide annotation guidelines to the annotators with simple expected examples. Next, we compute the score of assigning a relation type to a given sentence by matching the sentence with the exemplars. As shown in Figure 5.1, the box “Relation exemplars” contains one simple exemplar for each relation type. We compute the similarity scores between the input sentence “[Murat Kurnaz] , a Turkish national who was born and grew up in [German]” with those exemplars, resulting in a list of relation scores. We treat these scores as the likelihood of a sentence expressing a particular relation between the given entities. Since the exemplar “[Obama] was born in [US].” results in the highest score, we assign the relation country of birth to the input sentence. In this setting, we refer to the PLM as an annotator.

Although anLM-based annotator can be used directly forRC, it is uncertain whether the resulting data can be used for training relation extraction models to get higher accuracy. We hypothesise that we can deal with the uncertainty of automatic annotation using inductive biases and learning-with-noise techniques. We propose NoelA (a short form for Noisy Channel Auto-encoder) that employs two mechanisms to prevent over-fitting noisy annotations. Since entity types have been shown to be helpful for RC(Hancock et al., 2018; Ma et al., 2019; Tran et al., 2020b), NoelA reconstructs the entity types of the two input entities so that the entity type bias is used when predicting relations. Regarding the second mechanism, we used a noisy channel (Sukhbaatar et al., 2014; Goldberger and Ben-Reuven, 2016; Wang et al., 2019a) to explicitly model the noise, i.e., computing the probability of a noisy label given an unknown correct label.

As a result, we address our second research question (RQ5). We conducted experiments on twoRC datasets that have significantly different characteristics: the relation type distribution (skewed and uniform), the number of relation types, and the source of text (news and encyclopedia). To answer the first question, we carried out experiments to demonstrate the annotation capability of LMs. We also demonstrate that our NoelA can reduce the negative impact of the noisy labels, which addresses our second question. These results demonstrate the potential of using LMs as weak supervision forRC. 122 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

Task Formulation

The task of relation classification focuses on classifying associations between entity pairs to a set of (predefined) relation types. It is worth noting that this set does not include “no relation” type, all entity pairs are assumed to share certain relation types.

We denote R = {r1,r2,...,rm} the set of m relation types. Given a sentence s of n words s = (w1,...,wn), two entities h,t (namely head and tail entities respectively), and their corresponding semantic types eh,et, the task is to identify the relation r from a set of predefined semantic relation types R for the entity pair. As illustrated in Figure 5.1, two entities h and t are “Murat Kurnaz” and “Germany”.

Their corresponding entity types are eh = PERSON, et = LOCATION, respectively. The relation between them is r = country of birth.

A parametric probabilistic relation classifier predicts pr(r|s,h,t;θ) – the probability of a relation type r given a sentence s and the head and tail entities h,t resided in the sentence, where θ denotes the model’s parameters. RecentRC approaches have adopted the below diagram, taking the input hs,h,ti and producing a fixed-size vector representation of the input before passing it to a relation classification layer. The classification layer is often a pair of linear-softmax over the relation type set R.

s,h,t → encoder → classifier → r

5.2 Using Language Models as Weak Annotators

5.2.1 Defining Relation Types

In order to use pretrained language models for annotation, we first need a list of desired relation types and their exemplars, e.g., “[Obama] was born in [the USA]” for country of birth. The exemplars can be short and simple, often less than 10 words. Since our work does not aim at creating a new dataset, we use the gold relation types of the evaluation datasets (TACRED and reWiki) and manually create an exemplar for each relation. The relation types and exemplars used in our experiments are shown in Appendix B.2. 5.2. USING LANGUAGE MODELS AS WEAK ANNOTATORS 123

similarity

dot product

mean pooling ...... Language Model

Murat Kurnaz , ... was born ... in Germany Marie Curie is married to Pierre Curie Sentence Relation examplar

Figure 5.2: Similarity computation using anLM.

5.2.2 Language Model Annotator

We hypothesise that anLM trained on massive raw data (e.g. BERT) can capture some level of semantic similarity. We use PLMs and exemplars to annotate raw sentences, based on similarity scores. For instance, “[A] is the mother of [B]” is more similar to “[A] gave birth to [B]” than to “[A] works for [B]”. Therefore, we can use similarity matching computed by a PLM to produce similar scores between an unseen sentence and the exemplars. Finally, we assigned the relation type having the highest score assigned to the sentence. To perform matching, we first define a mapping function f that transforms a sentence d s and two entities h,t to a vector representation ∈ R . The similarity score between an unseen sentence hs1,h1,t1i and an exemplar hs2,h2,t2i is computed by a function sim, which is any function that computes the similarity between two vectors. This works use dot product as the similarity function.

sim( f (s1,h1,t1), f (s2,h2,t2))

The annotation process is illustrated in Figure 5.2. We note that when BERT is used as our annotator, the model is similar to BERT-based relation classifier with mention max pooling in Baldini Soares et al.(2019). We apply mean pooling rather than max pooling because the former outperforms the latter in our preliminary experiments. Although Baldini Soares et al.(2019) show that BERT with entity markers achieves the best performance, it is unclear how the entity markers are initialised. The resulting data D inevitably inherits noise from the weak annotation, thus, we present our attempt to learn from noisy data in the next section. 124 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

5.3 Noisy Channel Auto-encoder (NoelA)

Our model NoelA is depicted in Figure 5.3. The input to the model is a sentence and the named entity mentions in the sentence. We follow the diagram introduced in the first section to build a relation classifier, which includes an encoder and a relation classifier. The difference is the introduction of a noise-modelling component after the classifier to deal with noisy labels. We refer to the classifier and the noise-modelling component as the decoder.

country_of_death (PER, LOC)

Noisy Channel Linear & Softmax country_of_birth

Linear Linear & Decoder & ReLU Softmax

Linear & ReLU PER LOC

... BERT

Encoder Murat Kurnaz , ... was born ... in Germany

Figure 5.3: Overview of our model NoelA, consisting of an encoder and a decoder. The encoder converts input hs,h,ti to a fixed-size vector representation xs,h,t. The decoder then reconstructs the entity types of h,t with pe(eh,et|s,h,t), and predicts the relation expressed in the input with pr(r|s,h,t).

5.3.1 Encoder

d The encoder transforms an input hs,h,ti to a fixed-size vector xs,h,t ∈ R , where s is a sentence, h and t are two entities. The encoder first produces context-dependent word representations using a neural architecture, which we use BERT in this work. d We then construct representations for the two entities xh,xt ∈ R e by taking the mean pooling over word representations resided in each entity span (we refer the readers to the detail in §2.5). The two entity vectors are then concatenated with their entity types’ 5.3. NOISY CHANNEL AUTO-ENCODER (NOELA) 125

de embeddings (xeh ,xet ∈ R ) and passed to a linear and ReLU layer, forming the relation candidate representation xs,h,t.

xs,h,t = ReLU(Linear([xh,xt,xeh ,xet ])), (5.1)

where [,] denotes concatenation. We note that a linear layer is defined as Linear(q) = Wq + b.

5.3.2 Decoder

Differently from traditional decoders, our decoder does not completely reconstruct the

input hs,h,ti. Instead, our decoder reconstructs the entity types eh,et of h,t only, and predicts the relation r expressed in the input by computing pr(r|s,h,t).

Relation Classifier

After having a vector representation of hs,h,ti, we apply a linear and a softmax (over

the relation type set R) layers to compute pr(r|s,h,t).

pr(.|s,h,t) = SoftmaxR(Linear(xs,h,t)), (5.2)

where pr(.|s,h,t) corresponds to the relation distribution of the set R.

Noisy Channel

We expect the relation classifier to predict the correct relation type r. However, in order to leverage the noisy relation label r0 obtained using LMs, we explicitly model the annotation noise. The probability of transferring the correct relation type r to the noisy relation label r0 is denoted by q(r0|r,s,h,t). This probabilistic function is called “noisy channel” (Goldberger and Ben-Reuven, 2016). We note that the correct r is unknown, we marginalise the transition over the relation type set R:

0 0 pr0 (r |s,h,t) = ∑ q(r |r,s,h,t)pr(r|s,h,t) (5.3) r∈R

It is often assumed that the noise comes from language modelling, hence r0 is inde- 0 0 |R|2 pendent from hs,h,ti. We base q(r |r,s,h,t) = q(r |r) on a relational matrix C ∈ R , 126 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION which presents the dependencies between correct labels and noisy ones. The computa- tion can be written as follows:

exp(c 0 ) q(r0|r) = r r (5.4) ∑r00 exp(cr00r) where ci j is the entry of C at row i, column j. Initialising q(r0|r) has been shown crucial for learning p(r|s,h,t) (Goldberger and Ben-Reuven, 2016). In this work, we initialise C with a matrix computed by the confusion of choosing relation types by the usedLM annotator. Formally, let count(r0,r) be the number of times when r0 6= r appear together in the top-k candidate lists for all sentences, then each entry can be defined as follows:

count(r0,r) 0 cr r = log 00 (5.5) ∑r00 count(r ,r) This initialisation provides the learning process with some sort of information about to what extent theLM is confused r with r0. For instance, country of birth and country of death are likely confusing, one reason is that in the past an average person often died and was born in the same place. In contrast, country of birth and spouse are easy to distinguish from one to the other. In our experiments, we did not fine-tune q(r0|r) and chose k = b|R|/4c. Future work will consider to dynamically adapt q(r0|r) along with the update of a model’s parameters.

Entity Type Reconstruction

Another way to tolerate the annotation noise is to inject into the model proper biases.

Our encoder uses the entity types of h,t to compute the vector representation xs,h,t, since entity types have been shown to be helpful forRC(Hancock et al., 2018; Ma et al., 2019; Tran et al., 2020b). However, if trained on noisy labels only, the model may not be able to make use of entity types to tolerate the noise. Therefore, we force the model to capture the entity type bias by reconstructing the entity types of h,t. Formally speaking, denoting E the entity type set, we compute the reconstruction probability using a linear and a softmax (over E × E) layers:

pe(.|s,h,t) = SoftmaxE×E(Linear(xee)), (5.6)

d where xee = ReLU(Linear(xs,h,t)) ∈ R ee . 5.4. EXPERIMENTAL SETTINGS 127

5.3.3 Learning

Given a noisy dataset D, we train NoelA by minimising the following loss:

L(θ) = Lnc(θ) + Lrc(θ) + λLreg(θ) (5.7)

where Lnc is the negative log-likelihood of predicting the noisy labels

1 L (θ) = − log p0 (r0|s,h,t;θ), (5.8) nc | | ∑ r D hs,h,t,r0i∈D

Lrc is an entity type reconstruction loss, which is the negative log-likelihood of predict- ing the entity types

1 L (θ) = − log p (e ,e |s,h,t;θ), (5.9) rc | | ∑ e h t D hs,h,t,r0i∈D

Lreg is a regularisation term and λ ∈ R corresponds to its coefficient. We use the dispersion (Eq. (4.7)) proposed by Simon et al.(2019) to encourage the encoder to predict diverse relation types across all instances. We set λ to 0.01 in all of our experiments.

5.4 Experimental Settings

Our implementation was developed using the Transformers library (Wolf et al., 2019) and PyTorch (Paszke et al., 2019). We use accuracy as an evaluation metric. Our source code will be available at https://github.com/.

5.4.1 Datasets

We conducted experiments on two English datasets TACRED (Zhang et al., 2017b) and reWiki whose statistics are shown in Table 5.1. TACRED is a widely used dataset for supervised relation extraction. We removed sentences having “no relation”, resulting in a total number of 41 relation types. reWiki is a rearranged variant of the Wiki80 dataset used in Han et al.(2019), originated from FewRel (Han et al., 2018b). We note that there is no no relation instance in reWiki. Since the test set of Wiki80 is not provided, we used the development set for testing, and name the resulting data reWiki for discrimination. Additionally, 20% of the training data of Wiki80 is used as 128 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

the development set for analysis. This rearrangement is reasonable since the relation instances are unseen during training and tuning the model. Entity types are provided in TACRED, while for reWiki we obtain the entity types using the Stanford named entity recogniser (Manning et al., 2014). The two datasets are different in multiple aspects. In particular, reWiki has almost double the relation types than TACRED (80 vs 41). The relation distribution of TACRED is skewed while reWiki is uniform. Furthermore, TACRED involves text extracted from news and the source of reWiki is Wikipedia. The distinction between two datasets thus showcases the generalisability of our approach. For each dataset, we manually created an exemplar for each relation in which head and tail entities were randomly selected and are mostly unseen in the data. These exemplars were used to annotate the sentences in the original training set of the two datasets, which are treated as unlabelled data. The exemplars correspond to each corpus are presented in the AppendixB.

TACRED Train Dev Test Relation types 41 Parameter Value Entity types 17 Optimiser Adam Instances 13,012 5,436 3,325 Learning rate 3e-4 Entity pairs 8,426 3,229 2,036 Patience 5 Distribution Skewed Batch size 128 reWiki80 Dropout 0.5 Train Dev Test BERT token dimension 768 Entity type dimension d 20 Relation types 80 e Entity types 8 Encoder dimension d 200 Entity type representation Instances 50,400 10,080 5,600 50 Entity pairs 50,213 10,080 5,597 from encoder output dee Distribution Uniform Max length 512

Table 5.1: Data statistics of TACRED and Table 5.2: Hyper-parameters of NoelA reWiki datasets. Each instance is a sen- and its variants tence given entity spans and automatically- labelled entity types.

5.4.2 Pretrained Language Models

We examined three annotators based on the small versions (12-layers, 768-hidden units and 12-heads) of three LMs: BERT (Devlin et al., 2019), GPT2 (Radford et al., 2019), 5.4. EXPERIMENTAL SETTINGS 129

and SpanBERT (Joshi et al., 2020). The BERT version used in this work is uncased, while the GPT2 and SpanBERT do not have the uncased small versions so that we used the cased ones.

5.4.3 Relation Classification Settings

NoelA was trained with the Adam optimiser (Kingma and Ba, 2014) with a widely used learning rate of 3.10−4. We used exemplars as the development set and early stopped the training process if the accuracy on the development set does not increase after a certain number of epochs (patience). We list the hyper-parameters of NoelA in Table 5.2. We compare NoelA with its variants by removing a component at each time: re- moving entity type reconstruction (–ETR), then regularisation (–Reg), and finally noisy channel (–NC). We called the removal of all proposed components (at –NC) as BERTwET. We also include BERT annotator and BERTwET trained with bootstrap-hard loss (bootstrap-hard; Reed et al., 2014) for comparison. The idea of bootstrap-hard is to consider the relation type predicted by the classifier for an instance at each training step

as the noisy label. In particular, we employ BERTwET with the loss Lbootstrap−hard(θ). The loss is a combination of Lnc(θ) and Lmodel(θ), the negative log-likelihood loss of the label predicted by the model with the current θ.

Lbootstrap−hard(θ) = βLnc(θ) + (1 − β)Lmodel(θ),

where β is set to 0.8 following Reed et al.(2014) and Lmodel is computed as follows. 1 L (θ) = − log p(r0|s,h,t;θ) model | | ∑ D hs,h,ti∈D r0=argmaxp(.|s,h,t;θ)

We also tried their “soft” bootstrapping that minimises the entropy of the predicted label probability distribution H(p(.|s,h,t;θ)). However, the entropy regulariser caused the model collapsed. We thus did not include in our comparison. We also show the results of BERTwET trained on the gold relations of TACRED and reWiki as an upper-bound. For every model, we conducted five runs with different initialised parameters and computed the average performance. We note that a small number of instances were eliminated when training NoelA and its variants due to the max length constraint. The numbers of removed instances in TACRED (Zhang et al., 2017b) train/dev/test sets are as follows: 148, 47 and 20 instances, respectively. There is no instance beyond the 130 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

restricted length in reWiki (Han et al., 2019). Regarding entity type embeddings, we distinguish entity type of subject and object, e.g., PER-SUBJ and PER-OBJ resulting |E|×d in an entity type embedding matrix E ∈ R2 e where |E| = number of entity types reported in Table 5.1. For a fair comparison with the annotators, we did not fine-tune BERT during training in order to show the contribution of the additional components rather than the large fine-tuning parameters. All experiments were performed on a compute node, which has an Intel Skylake CPU and an Nvidia V100 GPU (16GB GPU RAM).

5.5 Results

5.5.1 Data Annotation

TACRED reWiki80 Top1 Top3 Top5 b|R|/4c Top1 Top3 Top5 Top10 b|R|/4c Random 2.44 7.32 12.20 24.39 1.25 3.75 6.25 12.50 25.00 Frequency 15.04 33.38 43.04 64.09 1.25 3.75 6.25 12.50 25.00 GPT2-small 0.27 5.05 6.05 12.60 1.73 4.14 6.52 12.98 26.66 BERT-base 15.46 31.16 40.72 56.00 27.48 42.45 50.09 60.39 71.32 SpanBERT-base 8.36 17.50 29.74 45.86 6.45 14.75 21.63 32.84 46.82

Table 5.3: Accuracy (%) ofLM annotators on two datasets. |R| denotes the number of predefined relation types in a dataset.

Table 5.3 shows top-k accuracy using threeLM annotators on the TACRED and reWiki test sets. Since, to our knowledge, this work is the first utilising LMs as annotators for relation extraction, we include the performance of two simple and trivial baseline annotators: (i) Random (randomly selecting relation types), and (ii) Frequency (choosing the most frequent relation types). In spite of being simple and deterministic, Frequency can be a strong baseline when no human supervision is given (Petroni et al., 2019). In general, the BERT annotator yields substantially higher accuracy than the base- lines and the other twoLM annotators at top- 1 (in the Top1 column) on both datasets. However, the BERT annotator performs worse than Frequency at top-k where k > 1 on the TACRED test set. We partially attribute this lower performance to the skewness of the gold relation distribution of TACRED. 5.5. RESULTS 131

Surprisingly, SpanBERT, trained on span prediction, yields substantially lower performance than BERT. Meanwhile, GPT2 performs the worst among the three LMs in this setting. The low performance is because word embeddings from GPT2 capture only left-to-right contexts, whereas the other two are bidirectional.

5.5.2 Relation Classification

TACRED reWiki Acc. (%) Abs.+ Acc. (%) Abs.+ BERT 15.46 27.48 NoelA 24.79 ±0.68 9.33 33.17 ±0.39 5.69 –ETR 21.54 ±0.69 6.08 32.48 ±0.67 5.00 –Reg 21.28 ±0.54 5.82 32.65 ±0.11 5.17 –NC (BERTwET) 19.03 ±0.34 3.57 30.06 ±0.14 2.58 BERTwET (bootstrap-hard) 19.28 ±0.42 3.82 29.76 ±0.16 2.28 BERTwET (sup.) 82.73 ±0.99 67.27 73.92 ±3.46 46.44

Table 5.4:RC accuracy (Acc.) across five runs of NoelA with its variants, and the absolute improvement (Abs.+) compared to BERT annotator (BERT). They were trained on the BERT annotated data: ETR (entity type reconstruction), Reg (dispersion regulariser), NC (noisy channel). BERTwET is the variant without all the components that we added to BERT-based relation classifier. BERTwET (sup.) was trained with the gold relations.

Table 5.4 presents the results of our model and compared ones. Our proposed model (NoelA) substantially improves the performance of BERT annotator and BERTwET. Each component incrementally contributes to the improvement of NoelA. The noisy channel is shown to make the model more robust against the annotation noise. Neverthe- less, it is unclear whether the relation dispersion regulariser is helpful since removing it slightly decreases the performance on TACRED but increases the result on reWiki (though the differences are not substantial). On the other hand, the entity type re- construction has less impact on the reWiki test set, since removing it only reduces a marginal score (0.69%). Bootstrap-hard only perform similar to BERTwET, this indicates that bootstrapping may not work on such noisy data where the seed set is too small, i.e., one example per category. 132 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

5.6 Analysis

To better understand BERT annotator’s preference and how the preference affects later RC training, we carried our analyses in four aspects: gold versus BERT’s relation distribution, accuracy of the BERT annotator, the accuracy of NoelA, and the impact of entity type reconstruction.

org:top_members/employees per:title org:shareholders per:charges

org:city_of_headquaters per:state_of_death

(a) TACRED

religion military_branch residence

country_of_ instance_of subsidiary citizenship follows

(b) reWiki

Figure 5.4: The gold relation distributions and the predicted relation distributions from BERT annotator on the development sets. Relation types with high frequency differences between the gold and the predicted distributions are labelled.

5.6.1 Relation Distribution

As the two datasets have different relation distributions, we firstly look at them and those yielded by BERT annotator (Figure 5.4). In TACRED, the gold distribution is skewed towards a few relation types such as per:title, org:top members/employees. BERT annotator however is in favour of infrequent ones such as org:shareholders, per:charges. In reWiki80, although the gold distribution is uniform, BERT annotator’s distribution is multi-model. This observation shows inappropriate biases of BERT annotator, suggesting that one can improve the annotation by injecting inductive bias to BERT annotator to make the predicted relation distribution close to the gold.

5.6.2 The Accuracy of BERT Annotator

We show the accuracy of BERT annotator according to relation types in Figure 5.5. On TACRED, BERT annotator performs exceptionally well for per:charges, per:age, 5.6. ANALYSIS 133

org:number_ per:age per:charges of_employees org: subsidiaries org:dissolved per:country_ of_residence

(a) TACRED constellation mountain_range operating_system country_of_ citizenship subsidiary part _of

(b) reWiki

Figure 5.5: Accuracy (%) w.r.t. relation type of BERT annotator on the development sets. Relation types with highest and lowest performance are marked. but poorly for org:dissolved, org:subsidiaries. Although Figure 5.5a gives an intuition that the overall accuracy should be substantially higher than 15.46%, it is not the case because most frequent types have low accuracy. This observation again suggests the need for biasing BERT annotator towards frequent relation types. On reWiki, the highest accuracy is for mountain range, and the lowest ones are part of, subsidiary, operating system. Because the gold relation distribution is uniform rather than skewed, the overall accuracy of BERT annotator on reWiki80 (27.48%) is substantially higher than that on TACRED (15.46%). Generally on both datasets, we observe that BERT annotator performs poorly for (i) human-human relations such as parent, mother, siblings, and spouse, and (ii) human-location relations such as residence, citizenship, and place of birth/death. The poor performance on human-human relations can be because that family relationships usually occur in the same context and different relationships can only be implied by logical reasoning, e.g., parents implies father and mother. In addition, human-location relationships can be overlapped, e.g., residential location is typically the country of citizenship, and the place of birth and death are often the same. We can analyse the original text that was used to train BERT for further understanding, we leave the analysis for future work. We refer the readers to Appendix B.1 for BERT annotator’s confusion matrices. 134 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

5.6.3 The Accuracy of NoelA

org:founded org:political/religious_affiliation per:siblings

org:website per:country_of_birth per:parents (a) TACRED contains_ administrative_ territorial_ entity successful licensed_to_broadcast_to _candidate

religion league main_subj (b) reWiki

Figure 5.6: Accuracy differences (%) w.r.t. relation types between NoelA and BERT annotator on the development sets. Relation types with the most and least accuracy differences are labelled.

Next, we show the accuracy difference of NoelA in comparison with BERT annotator in Figure 5.6. The improvement of NoelA over BERT annotator is mediocre on TACRED but visible on reWiki. Nevertheless, the overall accuracy gain of NoelA on TACRED (9.33%) is substantial. This is due to the skewness of the gold relation distribution of TACRED (Figure 5.4a), i.e., a slight improvement for highly frequent relation types, would lead to a substantial improvement on the overall accuracy. The observation here indicates an interesting behaviour of NoelA: it seems to adjust its attention according to the hidden gold relation distribution. On TACRED, NoelA trades off the accuracy loss for some infrequent relation types against the accuracy gain for some frequent ones. On reWiki, NoelA, on the other hand, pays attention to multiple relation types since all of them are equally frequent.

5.6.4 The Impact of Entity Type Reconstruction

We examine to what extent entity types can help to predict gold relations. To do so, we measure the mutual information between entity type pairs (ET) and gold relations (R), and between gold relations (R) and gold relations (R). In information theory, given 5.7. RELATED WORK 135

TACRED reWiki I(ET;R) 2.604 1.450 I(R;R) 3.210 4.382 ˆ I(ET;R) I(ET;R) = I(R;R) 0.811 0.331

Table 5.5: Mutual information between entity type pairs (ET) and gold relations (R) on the development sets. two random variables X,Y, the mutual information I(X,Y) measures the amount of information in X that tells us about Y and vice versa. Therefore, the more helpful to predict relations (R) entity types (ET) are, the larger I(ET,R) is. Since the maximum value of I(ET,R) is I(R,R), we propose the below normalisation

I(ET;R) Iˆ(ET;R) = ∈ [0,1] I(R;R)

If entity types are not related to gold relations, I(ET;R) = 0; thus Iˆ(ET;R) = 0. Oth- erwise, if gold relations are determined by entity types, I(ET;R) = I(R;R), leading to Iˆ(ET;R) = 1. Table 5.5 shows Iˆ(ET;R) on the development sets. We can see that Iˆ(ET;R) in TACRED is close to 1 whereas Iˆ(ET;R) in reWiki is more than twice lower. This explains the low impact of the entity type construction loss on reWiki compared to TACRED.

5.7 Related Work

5.7.1 Relation Classification

Our work is in line with distant learning approaches (Mintz et al., 2009; Riedel et al., 2010; Lin et al., 2016; Bai and Ritter, 2019) that automatically annotate raw sentences using KBs. The key difference is that our work does not require intensive human labour (for annotating data or for building up KBs). We only need human labour to provide a set of relation types, each of which comes with a simple exemplar. Moreover, unlike unsupervised learning (Marcheggiani and Titov, 2016; Simon et al., 2019), which uses unlabelled data, the resulting relations of our work are well aligned to desired relation types. One might compare our work with the supervision from KBs or self-learning in 136 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

semi-supervised learning, which also contains a set of seed examples. However, we emphasise that our aim is to testify the relational facts captured in PLMs. In order to evaluate our model, the best setting is to compare with gold labels, which was demonstrated in our experiments. For future work, we consider the comparison between the supervision from PLMs and other supervision sources as well as their composition. We also note that URE is a clustering task where relation labels are not determined. Hence, this work and unsupervised learning methods are not directly comparable. Our work differs from few-shot learning in two main points. The most important difference is that, few-shot learning uses in-domain examples, i.e., examples are from annotated data (Han et al., 2018b). In contrast, our exemplars are completely domain- agnostic. Another crucial difference is that for each testing instance, few-shot learning often randomly selects a small subset of relation types for classification, e.g., 5 to 10. Our work deals with all relation types when classifying each testing instance. The two major differences lead to a more difficult scenario in our setting, but it is more realistic and user-friendly because of domain-agnostic exemplars. The most relevant study to ours is the one-shotRC (without task-aware learning) based on BERT, proposed by Baldini Soares et al.(2019). They employ BERT for relation matching where the number of relation types is small (5 or 10) and exemplars are from the same domain with the test data. Our work is different in two crucial points. Firstly, the number of relation types is larger (41 in TACRED and 80 in reWiki80). Exemplars in our approach are simple (often less than 10 words) and domain-agnostic. Secondly, whereas they use relation matching directly forRC, our work shows that higher accuracy can be achieved by training a classifier on data annotated by BERT.

5.7.2 Pretrained Language Models

GPT2 (Radford et al., 2019), a unidirectionalLM, and BERT (Devlin et al., 2019), bidirectional, are recent large-scale PLMs, which were trained on massive data, e.g., BERT was trained on BooksCorpus (800M words) and English Wikipedia (2,500M words) which are around 16GB. Using Transformer (Vaswani et al., 2017), both BERT and GPT2 are able to compute contextualised word representations, which represent words in a particular context. There are several variations of BERT, in which SpanBERT (Joshi et al., 2020) is extended from BERT by predicting contiguous spans of text rather than single words. They also add the span boundary objective (SBO) to use adjacent contextual words (e.g., one word to the left and right of a current span) for span prediction. 5.8. CONCLUSION 137

Probing from Pretrained Language Models

Most recent work uses cloze-style templates to probe knowledge from LMs without fine- tuning. Petroni et al.(2019) extracted factual knowledge about relations between entities from language models. They manually define a single template for each relation type in a predefined set. Jiang et al.(2020) improved the manually-defined relation templates using pattern mining and paraphrasing methods. Both Petroni et al.(2019); Jiang et al. (2020) limited the number of instances to those that have single-token entities. Different from the above studies, Bouraoui et al.(2020) examined the commonsense relations between words captured by language models. They extracted relation templates from data by using a seed set of instances and used the resulting templates to predict the relation between a word pair. These studies are for probing relation knowledge from LMs but not for relation classification. Recently, Schick and Schutze¨ (2020) provided a semi-supervised training procedure that uses the probing information in combination with supervised data for few-shot text classification. The authors also used the probing results to assign soft labels to unlabelled data. All of these prior studies only consider using LMs to fill in the blank. In contrast, our work uses LMs to discriminate information, i.e., the relation between named entities in text.

5.8 Conclusion

We demonstrated our hypothesis that LMs can be used as annotators to generate labelled data by matching sentences against exemplars of a predefined relation set, answering

our fourth research question (RQ4). Although the resulting data is noisy, we showed that learning can yield substantial gains over the usedLM annotator. To reduce the impact of noisy labels, we proposed NoelA (Noisy Channel Auto-encoder) which can learn the latent correct labels by explicitly modelling noise and using entity type bias. NoelA gains a promising 6% and 9% accuracy over BERT on reWiki and TACRED, respectively, demonstrating the potential of using supervision from LMs. This also

addresses our last research question and hypothesis (RQ5 and H4). We analysed the gold and BERT annotated relation distributions, showing that BERT captures different biases during pre-training. This observation is promising for future work, when we can inject biases during annotation to better match the predicted relation distribution with the gold. The analysis of the accuracy of BERT annotator sheds some insights into BERT’s preferred relation types and confusing ones. Interestingly, we 138 CHAPTER 5. LANGUAGE MODELS AS WEAK SUPERVISION

observed from the analysis of NoelA’s accuracy that our model can adjust towards the latent gold label distribution. The adjustment indicates the effectiveness of the two noise reduction mechanisms, i.e., noisy channel and entity type reconstruction that we proposed. Finally, we observed that the contribution of entity type information to the performance of relation classification highly depends on the mutual information between entity types and relations in a dataset. In our experiments we used simple and artificial exemplars. Jiang et al.(2020) showed that different exemplars may work better when probing relational knowledge from LMs. Our future work thus is to investigate how to automatically create more appropriate exemplars that increase the annotation quality using LMs. Chapter 6

Conclusions

This dissertation contributed to the development of relation extraction (RE) that aims at identifying relationships between named entities in text. As stated in Chapter1, we focused on the specific use of unlabelled text in three scenarios:

• SupportRE by encoding syntactic features into word representations using automatically-parsed text; • Perform RE when no annotation is given during training; • Leverage pretrained language models as weak supervision for relation annotation.

We present to the readers the essence of relation extraction in the background chapter (Chapter2). We first formulated the task and described related concepts, presented existing corpora and the evaluation metrics used forRE. We then walked the readers through the history and fundamentals ofRE, in which conventional machine learning methods are grouped into five categories, along with an introduction of neural networks. As our work focuses on neural models, we provided readers necessary knowledge to understand and build a neuralRE model. The remainder of the chapter covered related work of individual building blocks forRE from input representation to learning techniques. In this dissertation, we explored three ways of using unlabelled text, which were mostly evaluated on sentence-level relation extraction datasets.

139 140 CHAPTER 6. CONCLUSIONS

6.1 Summary of Research Objectives

In this dissertation, we investigated several techniques to address relation extraction using unlabelled data, which involve a pretraining approach to support supervised learning, two simple neural relation extraction methods working on unlabelled data, and a weak supervision approach using pretrained language models.

In Chapter3, we confirmed our first hypothesis H1 that stated:

Detection of entity relations can be beneficial from pre-encoded syntactic infor- mation in word representations.

To test our hypothesis, we evaluated on a binary and an n-ary relation extraction datasets, involving text from newswire articles (ACE2005) and biomedical text (drug- gene-mutation). We used unlabelled text from the two domains, independent from the training data, for pretraining. Two syntactic analysis tools, i.e., Stanford CoreNLP and ScispaCy, corresponding to each domain were used to obtain part-of-speech tags and dependency trees for individual sentences. We proposed a representation learning method based on a graph convolutional neural model to enrich existing pretrained word representations. The model performs on a dependency word graph enhanced with adjacent word connections and takes pretrained word representations as the input embeddings. We trained the model by syntactic objectives, i.e., POS tagging and dependency parsing. The proposed method generates dependency-based contextualised word representations, namely syntactically-informed word representations (SIWRs). Our SIWRs showed performance gain over base representations on relation extrac- tion from both generic and biomedical domains. Our experimental results also indicated the effectiveness of SIWRs not only on contextualised word representations (ELMo) but also static ones (PubMed). In addition, we conducted experiments using BERT by extracting contextualised word representations and finetuning the entire model. We com- pared them with our SIWRs using BERT as the base representations. We empirically demonstrated that injecting explicit syntactic bias into BERT representations improves the performance of relation extraction from generic domain, showcasing the structural bias in large PLMs. By analysing the results on distant entity pairs, we observed a steady decrease in performance. This is reasonable as further distance might not directly be inferred by dependency . In Chapter4, we investigated binary relation extraction from the general domain without access to annotated data. We addressed RQ2 and RQ3: 6.1. SUMMARY OF RESEARCH OBJECTIVES 141

RQ2 What is the current setting of unsupervised relation extraction built on neural models?

RQ3 Can inductive biases benefit for unsupervised relation extraction?

We answered the first question by implementing two previous neural approaches, i.e., March and Simon, for URE based on a discrete variational auto-encoder framework. We analysed the experimental settings following previous work including evaluation metrics, datasets and training signals. Previous studies were evaluated using a distant supervision corpus which inherently contains noise due to the automatic process. A small portion of the data is aligned with relations in Freebase, but we observed that a large portion of unaligned sentences reveals semantic relations between entities. Thus, we confirmed the possibility of using such data for training. Since the annotations are not completely accurate, we involved a manually-annotated corpus for evaluation (without training on this data) in order to better reflect the performance of a model. For evaluation metrics, we highlighted that (i) ARI might not be appropriate to evaluate imbalanced data, (ii) when using V-measure, we should use a smaller number of clusters compared to the number of examples. In addition, we illustrated that the link predictor can provide a good signal to train URE.

Our two simple methods using entity types substantially outperform previous meth- ods. This indicates that semantic entity types provide a strong inductive bias for URE.

Thus, we confirmed the hypothesis H2 which stated:

Inductive biases benefit for building unsupervised relation extraction models.

Although unsupervised relation extraction can discover new relations, the semantic meanings of relation clusters are undefined. As a result, in Chapter5, we investigated the use of weak supervision, i.e., one example for each relation type in a predefined relation set. Recent work suggested that large-scale PLMs capture some sort of relational facts, we hence evaluated the use of PLMs to annotate semantic relationships between entities. In particular, we defined a set of relation types and one example for each relation type, which in this work we used the relation set of evaluation data. We relied on the PLMs to obtain weak relation annotations for individual examples based on similar matching scores. We demonstrated that BERT provided higher accuracy than two trivial baselines as well as other two PLMs, implying that there are some relational facts captured in the model. Our findings thus confirmed our hypothesis: 142 CHAPTER 6. CONCLUSIONS

Pretrained language models can be used as weak supervision for relation extrac- tion, which reduces the need of manual annotations and human curated knowledge bases.

We then examined the performance of a relation classifier trained on the resulting weakly-labelled data. In particular, we proposed a noisy channel auto-encoder (NoelA) to model the labelling noise explicitly. We presented an ablation study, showing the effectiveness of the introduced noise modelling component. Thus, we addressed our

final hypothesis H4:

Modelling relation confusion to estimate correct relations from noisy ones can be effective for classifying relations between entities.

We further analysed that BERT provides different biases compared to the gold distribution. Interestingly, our model updated towards latent gold relation distribution during training. This analysis indicated the benefit of our noise reduction mechanism. Furthermore, our analysis regarding mutual information between entity types and relation categories again confirmed the dependency of the two variables. Overall, this dissertation has shown the potential of using unlabelled text to automat- ically identify relations between named entities with experiments from three lines of approaches. The first approach considered enriching word representations to indirectly improve relation extraction. We showed that syntactic information, i.e., part-of-speech and dependency structures, provides an effective clue for detecting associations between entities. Next, we attempted to address the task of relation extraction when no labels are given (URE). The work included an extensive analysis regarding data quality, training setting and evaluation metrics for URE. We showed that entity types are a strong induc- tive bias for relation extraction. Finally, we showed that pretrained language models can be used as labelling functions for relation extraction, generating weakly-labelled data. We then enhanced its performance by explicitly modelling the label noise.

6.2 Open Problems and Future Work

In this dissertation, our efforts at using unlabelled text have centered around (i) enriching word representations, (ii) analysing unsupervised relation extraction and confirming a strong inductive bias forRE, and (iii) utilising pretrained language models as a 6.2. OPEN PROBLEMS AND FUTURE WORK 143

weakly-supervised source and coping with noise in the resulting corpora. In this section, we highlight potential directions for future work by first addressing limitations in our proposed methods and looking at other possibilities of using unlabelled data such as semi-supervised learning.

6.2.1 External Information for Enriching Word Representations

We have demonstrated the effectiveness of our enriched word representations for relation extraction. However, we found that the advantage of dependency connections is reduced when the distance between two entities surpasses a dependency range (see Figure 3.7). A potential improvement is to consider coreference connections which have shown promising results in the work of Luan et al.(2019). The authors proposed a method that jointly predicts entity spans, relations and co-reference links. Moreover, we can incorporate lexical semantic relations into the word graph, such as hypernym/hyponym, synonyms/antonyms as in (Vashishth et al., 2019b).

6.2.2 Graph Generalisation and Construction

An information mask similar to maskedLM in Devlin et al.(2019) will be helpful to generalise the graph-based model. In particular, we can randomly replace a word in the input sentence by a special token “[MASK]”, then predict the word. Drop-edge (Rong et al., 2020) can be adopted for the same reason, to avoid overfitting. Drop-edge works as dropout but for randomly masking a number of edges from the word graph at each iteration. One consideration when using drop-edge is that it could cause a disconnected graph, hence, we should consider heuristic constraints when removing edges. Another idea is to create a special edge – “mask”. We randomly replace an edge in the word graph by the special edge and predict the correct dependency label at the end. In addition, our current enriching word representation method relies on syntactic tools to obtain dependency structures. To reduce that constraint, we can try to construct a fully-connected word graph and utilise attention mechanisms to focus on important edges. We can assign position-aware edges connecting words and then force the model to attend on dependency connections. In this case, we only need the external syntactic tool during training, while testing the model only depends on attention mechanisms. 144 CHAPTER 6. CONCLUSIONS

6.2.3 Cluster Definition

Hu et al.(2020) showed that using a large pretrained language model (BERT) and the deep embedding clustering (DEC) approach proposed by Xie et al.(2016) can provide self-supervision for URE. However, as we stated in the conclusion of Chapter4, although URE can discover new relations, we need to either manually-annotate or extract frequent terms in order to obtain the relation type corresponding to each cluster. A straightforward solution is to initialise the clusters with a set of seed examples to represent each relation type of interest.

6.2.4 Improvement for Language Model Annotation

For the use of pretrained language models as weak annotators, we can improve the annotation by providing more examples for individual relation types. The examples can be obtained either manually or automatically. Apparently, the accuracy will be increased if we combine manual and automatic examples. However, the latter, automatically getting exemplars, is an interesting and challenging task to work on. A simple scheme can leverage existing neural machine translation models to generate different translations of a particular example in one or more languages (e.g., German, French, Chinese, etc.) and then translate back to the target language (English in our experiments). This process, which translates a text to another language and translates back to the original language, is called back-translation. The challenge here is to obtain diverse back-translations, which can not be addressed by naively tuning the output generation parameter (beam size) of a neural machine translation model. Another suggestion is to finetune pretrained language models on natural language inference or paraphrasing corpora before using them to compute relation similarity scores. This can encourage the model to produce similar embeddings for contexts sharing close meanings. This kind of transfer learning from natural language inference has also been used in zero-shot relation extraction (Obamuyide and Vlachos, 2018). As mentioned in §3.7, this can also be considered as intermediate training that trains large-scale pretrained language models via intermediate tasks before fine-tuning on a target task (Phang et al., 2018; Glavasˇ and Vulic´, 2020).

6.2.5 Noise Reduction

Our current noisy channel is static, i.e., unchanged during training. However, a relation classifier is adapted towards predicting the correct relations. The transition from 6.2. OPEN PROBLEMS AND FUTURE WORK 145 unknown correct relations to observable noisy labels should be refined accordingly. Thus, we would like to investigate a way to dynamically adjust the noisy channel in the future.

6.2.6 Multiple Sources of Supervision

This work used existing labelled corpora and unlabelled text in a separated manner, without considering their dependencies. One future work that might worth trying is to jointly incorporate the two types of data. Typically, semi-supervised learning can be a potential way to leverage a small set of exemplars. We can finetune pretrained language models to perform relation extraction on a small set of annotated data. The resulting models can be used to annotate a bag of unlabelled text from which the high confidence examples can be added to the training data. At this state, we can model the dependency between gold and newly-annotated data. Heuristic assumptions can be adopted to restrict the added examples. While semi-supervised learning considers manual and self-annotated data, another potential direction is to combine different supervision sources, e.g., distant supervision using knowledge bases and heuristic labelling functions as in data programming. Mod- elling the relationships between existing resources can also be challenging, i.e., we need to resolve conflicts and decide which sources can be trusted and used for individual examples.

6.2.7 Document-level Relation Extraction

Our work mainly focuses on intra-sentence relation extraction (RE), thus, we consider to tackle document-levelRE as part of future work. A straightforward extension is to apply our graph-based model for learning word representations that include document- level context. In particular, we can construct a document graph based on the sentence structure and dependencies, i.e., sentences constitute nodes and the associations between sentences correspond to edges. After applying our model on the sentence-level word graph, we can introduce a graph model performing on the constructed document graph. At this point, the main problem is how to construct a document graph that is informative but requires less human interventions. Another concern for this approach is the memory usage when considering all sentences in a document. Heuristic pruning strategies should be introduced to address this issue. Appendix A

Named Entity Recognition

In Chapter3, we evaluated our syntactically-informed word representations (SIWRs) on two relation extraction corpora. However, our SIWRs can be used in other natural language processing tasks. In this chapter, we evaluate on a nested named entity recognition corpus, which is part of our publication (Tran et al., 2020c).

A.1 Named Entity Recognition

Train Dev Test Entities 24,440 2,972 3,554 FAC 924 83 173 GPE 4,725 486 671 LOC 763 81 69 ORG 3,702 479 559 PER 13,050 1,668 1,949 VEH 624 81 66 WEA 652 94 67 Nested level 6 4 5 Entities in level 1 19,676 2,429 2,936 Entities in level 2 3,934 448 505 Entities in level 3 731 85 102 Entities in level 4 90 10 10 Entities in level 5 7 0 1 Entities in level 6 2 0 0

Table A.1: Data statistics for the ACE2005 named entity recognition datatset.

Nested named entity recognition (nested NER) detects complex entities that include

146 A.2. EXPERIMENTAL SETTINGS 147 both flat and nested entities, i.e., embedded entities included in other entities. We evaluated the performance of nested NER on the ACE2005 dataset annotated with 7 nested entity types: person (PER), location (LOC), organisation (ORG), geo-political entity (GPE), facility (FAC), vehicle (VEH) and weapon (WEA). We follow Ju et al. (2018) in data splitting by keeping the 8:1:1 ratio for training, development and testing datasets, respectively. We also used the conventional BIO tagging scheme in this experiment. The data statistics of ACE2005 for named entity recognition is presented in Table A.1. The stacked bidirectional long short-term memory - conditional random field (BiLSTM-CRF) model proposed in Ju et al.(2018) was used as our baseline since it achieves top performance without using syntactic information. The model extracts nested entities by dynamically stacking flat BiLSTM-CRF blocks to predict entities from inside out. This allows the model to encode the dependencies between nested entities. We also compared our results with the recent model in Fisher and Vlachos(2019). The model predicts real-valued segmentation structures for nested NER using a merge and label approach. The method includes two stages: firstly, it detects the entity boundaries at all nested levels; secondly, the embeddings of predicted entities are then used to predict their entity labels.

A.2 Experimental Settings

Original / Hyperparameters NER ELMo & SIWRs Method Batch size 91 P R F1 No. of hidden units 200 / 256 W2V 79.54 64.55 71.26 Dim. of char. emb. 28 ELMo 80.17 72.12 75.93 Dropout rate 0.1708 SIWRsELMo 79.57 75.61 77.54 Learning rate 0.00426 Gradient clipping 11 Table A.3: Test set results with different Weight decay (L2) 9.43e-5 embeddings on the nested named entity recognition dataset (ACE2005). Table A.2: Nested NER

We employed the layered named entity model (Ju et al., 2018) using Chainer library Tokui et al.(2015). 1 Table A.2 presents the hyperparameters used in the model. 1https://github.com/meizhiju/layered-bilstm-crf 148 APPENDIX A. NAMED ENTITY RECOGNITION

Previous work Ours Stacked Merge & Label Stacked BiLSTM-CRF BiLSTM-CRF GloVe +E +B-f W2V +E +SE +B-ft +B-f +SB P 74.2 75.1 79.7 82.7 79.54 80.17 79.57 84.24 81.49 83.32 R 70.3 74.1 78.0 82.1 64.55 72.12 75.61 81.49 80.02 80.84 F1 72.2 74.6 78.9 82.4 71.26 75.93 77.54 82.84 80.75 82.06

Table A.4: Performance comparison on the nested NER – ACE 2005 test set. Text subscription indicates the base representations used in SIWRs. +E denotes ELMo is used as base representations, +B-f is BERT-feature, +B-ft is BERT-fine-tune, +SE and +SB are SIWRs with ELMo or BERT as base representations, respectively.

The base model uses W2V embeddings from Miwa and Bansal(2016). Meanwhile, we use SIWRs trained on different base representations, e.g., W2V Miwa and Bansal (2016), GloVe Pennington et al.(2014), and ELMo Peters et al.(2018).

A.3 Results

Table A.3 shows the results on the test set of ACE2005. We used SIWRsELMo (the enriched ELMo) using generic data for the ACE2005 dataset as in the binary relation extraction task for this nested named entity recognition. We compared the performance of our nested NER baseline with different base repre- sentations and the current models in Table A.4. While stacked BiLSTM-CRF (Ju et al., 2018) is the baseline that we employed, the SOTA (Merge and Label) performance with different embeddings were reported in Fisher and Vlachos(2019). As shown in Table A.4, our SIWRs consistently improved the performance over base represen- tations. Although our baseline is around 2% points lower than the SOTA in terms of F1-score using either static or contextual embeddings, i.e, ELMo and contextual BERT (BERT-feature), we improved the performance of the baseline by adding syntactic information and the results were then comparable to the SOTA.

A.4 Comparison between Different Base Representations

Our model is not restricted to ELMo; Table A.5 shows the flexibility of our model with several word representations including static and contextualised ones: W2V, GloVe, and ELMo. W2V and GloVe are static embeddings with the dimensionality of 200, while A.4. COMPARISON BETWEEN DIFFERENT BASE REPRESENTATIONS 149

Partition Embedding W2V GloVe ELMo Original 71.03 66.31 76.71 Dev SIWRs 71.00 71.86 75.09 Original 71.26 68.58 75.93 Test SIWRs 72.67 72.93 77.54 Improvements 1.41 4.35 1.61 (Absolute / Relative) / 4.91% / 13.84% / 6.68%

Table A.5: Nested named entity recognition results on ACE2005 development set with different base representations and their enriched alternatives.

ELMo is contextualised embeddings with the dimensionality of 256 and 3 layers. As shown in the table, we obtained 1–4% absolute improvements in F1-score compared to the original representations for nested named entity recognition. Although ELMo already included contextual and statistically inferred syntactic features (Peters et al., 2018), our SIWR model provides explicit syntactic information to the embeddings. GloVe obtained the most significant improvement of 4.35% absolute F1-score and a relative error reduction of 13.84%. The distinct improvement over the static embeddings, i.e., GloVe, may be partly due to the embedded context. Our experimental results show that incorporating syntactic information can further boost the performance of deep neural models in nested named entity tasks. Appendix B

Language Models as Weak Supervision

B.1 BERT Annotator Confusion Matrices

Figure B.1 illustrates the confusion matrices of the gold and BERT annotated relations. The diagonal line is clearly shown for reWiki while it is lighter for TACRED. This explains the annotation performance on TACRED is low, as BERT gets confused by other relations.

B.2 Relation Exemplars

We present all the exemplars used for TACRED and reWiki in Table B.1 and Table B.2, respectively. All exemplars are manually created by me and partially revised by a colleague.

Table B.2: Exemplars created for each relation in reWiki. Entities are denoted in italic.

ID Relation Exemplar

0 place served by Luton Airport is an international airport in London . transport hub 1 mountain range The Tour Noir is a mountain in the Mont Blanc massif . 2 religion Henry VIII ’s religion is Church of England . To be continued

150 B.2. RELATION EXEMPLARS 151

Table B.2: Exemplars created for each relation in reWiki. (Continued)

ID Relation Exemplar

3 participating team Manchester United F.C. competes in the Premier League . 4 contains administra- Ho Chi Minh City is a territorial entity in Vietnam . tive territorial entity 5 head of government Barack Obama is the 44th president of the United States . 6 country of citizen- Marco Polo was an Italian explorer . ship 7 original network One litre of tears was first aired on Fuji TV . 8 heritage designa- City of Bath is listed on UNESCO World Heritage Site tion . 9 performer Abbey Road is the eleventh studio album by the Beatles . 10 participant of participated in UK 2019 . 11 position held Barack Obama is the 44th president of the United States . 12 has part Germany is part of European Union . 13 location of forma- Facebook was founded in Massachusetts . tion 14 located on terrain Heard Island is located in the Indian Ocean . feature 15 architect The architecture of Eiffel Tower was designed by Gus- tave Eiffel . 16 country of origin Parasite is a 2019 South Korean black comedy . 17 publisher Harry Potter was published by Scholastic . 18 director Joker was directed by Todd Phillips . 19 father Fred Trump is Donald Trump ’s father . 20 developer The Witcher was developed by CD Projekt . 21 military branch Arthur Mackenzie Power was a Royal Navy admiral . 22 mouth of the water- The White Nile riverNile river is a tributary of the Nile course . 23 nominated for Spirited Away was nominated for Best Animated Fea- ture . To be continued 152 APPENDIX B. LANGUAGE MODELS AS WEAK SUPERVISION

Table B.2: Exemplars created for each relation in reWiki. (Continued)

ID Relation Exemplar

24 movement Post-impressionist movement is associated with Vin- cent Willem van Gogh . 25 successful candi- Obama was elected in 2009 . date 26 followed by iPad Air 2 was followed by iPad Air 3 . 27 manufacturer iPhone was made by Foxconn . 28 instance of Siamese is a cat breed . 29 after a work by Harry Potter and the Cursed Child is based on a work by J. K. Rowling . 30 member of political David Cameron was a member of the Conservative party Party . 31 licensed to broad- Tokyo FM is a radio station in Chiyoda, Tokyo, Japan . cast to 32 headquarters loca- Facebook ’s headquarter is located in Menlo Park, Cal- tion ifornia, United States . 33 sibling Alexander Watson is the brother of Emma Watson . 34 instrument Yiruma plays piano . 35 country Corfu island is in Greece . 36 occupation Richard Phillips Feynman was an American theoretical physicist . 37 residence Richard Feynman lived in New York . 38 work location Stephen Hawking worked in Cambridge . 39 subsidiary Cafe Nero is a child organization of Rome Bidco . 40 participant Molly Hocking participated in The Voice UK 2019 . 41 operator Stagecoach Manchester operated the local bus services in Greater Manchester . 42 characters Hermione is a character in Harry Potter . 43 occupant Old Trafford Stadium is occupied by Manchester United . 44 genre The Beatles were an English rock band . 45 operating system Microsoft Word can be installed on Android operating system . 46 owned by WhatsApp is owned by Facebook . To be continued B.2. RELATION EXEMPLARS 153

Table B.2: Exemplars created for each relation in reWiki. (Continued)

ID Relation Exemplar

47 platform Contra: Rogue Corps was released for Playstation 4 . 48 tributary The White Nile riverNile river is a tributary of the Nile . 49 winner Lara Dutta was the winner of the Miss Universe 2000 pageant . 50 said to be the same Mary I of England was also known as bloody Mary . as 51 composer River flows in you was written by Yiruma . 52 league Alessandro del Piero plays in Serie A league . 53 record label Abbey Road was released by Apple Records . 54 distributor Spirited Away was released by Toho . 55 screenwriter Andrew Lloyd Webber is the screenwriter of the phan- tom of the opera . 56 sports season of There is a season of UEFA Champions League in 2016 league or competi- . tion 57 taxon rank Felidae is a family in the taxonomic hierarchy . 58 location The 2008 Summer Olympics was located in Beijing . 59 field of work Alan Turing was a pioneer of computer science . 60 language of work or Les Miserables is a French historical novel . name 61 applies to jurisdic- Mayor of Paris applies jurisdiction to Paris . tion 62 notable work Vincent van Gogh is known for the Starry Night . 63 located in the ad- Ho Chi Minh city is located in the South of Vietnam . ministrative territo- rial entity 64 crosses Channel Tunnel crosses English Channel . 65 original language of Friends is one of the most-watched English language film or TV show TV shows . 66 competition class Mike Tyson was a heavyweight boxer . 67 part of Netherlands is part of Europe . 68 sport Roger Federer is a tennis player . To be continued 154 APPENDIX B. LANGUAGE MODELS AS WEAK SUPERVISION

Table B.2: Exemplars created for each relation in reWiki. (Continued)

ID Relation Exemplar

69 constellation Andromeda Galaxy is in the constellation Andromeda . 70 position played on Cristiano Ronaldo plays as a forward for Juventus . team / speciality 71 located in or next to Easter Island is an island in Pacific Ocean . body of water 72 voice type Enrico Caruso has a voice of tenor . 73 follows Monday is after Sunday . 74 spouse Marie Curie is married to Pierre Curie . 75 military rank Napoleons served as a general in the French army . 76 mother Marie Curie is the mother of Irene` Joliot-Curie . 77 member of Iron Man is a member of Avengers . 78 child Michael Douglas is a child of Kirk Douglas . 79 main subject Robert Langdon is the main subject of The Da Vinci Code. End B.2. RELATION EXEMPLARS 155

ID Relation Exemplar 0 org:alternate names The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health . 1 org:city of headquarters Facebook ’s headquarter is located in Menlo Park, Califor- nia, United States . 2 org:country of headquarters Facebook ’s headquarter is located in Menlo Park, Califor- nia, United States . 3 org:dissolved President Truman dissolved the O.S.S. in 1945 . 4 org:founded Facebook was founded in 2004 . 5 org:founded by Facebook was founded by Mark Zuckerberg . 6 org:member of Germany is a founding member of the European Union . 7 org:members Germany is a founding member of the European Union . 8 org:number of employees/members IBM total number of employees in 2019 was 383800 . 9 org:parents Alphabet is the parent of Google . 10 org:political/religious affiliation Tearfund is an international Christian relief and develop- ment agency . 11 org:shareholders The largest shareholder of Google is Larry Page . 12 org:stateorprovince of headquarters Facebook ’s headquarter is located in Menlo Park, Califor- nia, United States . 13 org:subsidiaries Cafe Nero is a child organization of Rome Bidco . 14 org:top members/employees Tedros Adhanom is the WHO current director . 15 org:website gov.uk is a United Kingdom public sector information web- site . 16 per:age Peter Higgs is now at the age of 90 . 17 per:alternate names Mary I of England was also known as bloody Mary . 18 per:cause of death Richard Feynman died of abdominal cancer . 19 per:charges Jeffrey Dahmer was convicted of 15 murders . 20 per:children Michael Douglas is a child of Kirk Douglas . 21 per:cities of residence Richard Feynman lived in New York . 22 per:city of birth Obama was born in Honolulu, Hawaii . 23 per:city of death Richard Feynman died in Los Angeles , California , US . 24 per:countries of residence Richard Feynman lived in US . 25 per:country of birth Obama was born in the USA . 26 per:country of death Richard Feynman died in Los Angeles , California , US . 27 per:date of birth Obama was born in 1961 . 28 per:date of death Richard Feynman died in 1988 . 29 per:employee of Kayleigh McEnany is the current White House press secre- tary . 30 per:origin Barack Obama is an American politician . 31 per:other family Craig Robinson is Barack Obama ’s brother in law . 32 per:parents Fred Trump is Donald Trump ’s father . 33 per:religion Maximilian Kolbe is Catholic . 34 per:schools attended Peter Higgs was awarded a PhD degree from King ’s College London . 35 per:siblings Alexander Watson is the brother of Emma Watson . 36 per:spouse Marie Curie is married to Pierre Curie . 37 per:stateorprovince of birth Obama was born in Honolulu, Hawaii . 38 per:stateorprovince of death Richard Feynman died in Los Angeles , California , U.S . 39 per:stateorprovinces of residence Barack Obama lives in Washington . 40 per:title Barack Obama was the 44th president of the United States .

Table B.1: Exemplars created for each relation in TACRED. Entities are denoted in italic. 156 APPENDIX B. LANGUAGE MODELS AS WEAK SUPERVISION

0 0.05 2 4 6

8 0.04 10 12 14

16 0.03 18 20 22 Gold Relation 0.02 24 26 28 30 0.01 32 34 36 38

40 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 BERT Annotated Relation (a) TACRED 0 3 6 0.010 9 12 15 18

21 0.008 24 27 30 33 0.006 36 39 42 Gold Relation 45

48 0.004 51 54 57 60

63 0.002 66 69 72 75 78 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 BERT Annotated Relation (b) reWiki

Figure B.1: Gold relations and BERT annotator relations confusion matrix. The indices of the relation types are given in Table B.1 and Table B.2. Bibliography

Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries. pages 85–94.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, pages 1638–1649.

Christoph Alt, Marc Hubner,¨ and Leonhard Hennig. 2019. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1388–1398.

Gabor Angeli, Julie Tibshirani, Jean Wu, and Christopher D. Manning. 2014. Com- bining distant and partial supervision for relation extraction. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1556–1567.

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics 4:385–399.

Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, Springer, pages 722–735.

Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In Proceedings of the First Iternational Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. volume 1, pages 563–566.

157 158 BIBLIOGRAPHY

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

Fan Bai and Alan Ritter. 2019. Structured Minimally Supervised Learning for Neural Relation Extraction. In Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 3057–3069.

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 2895–2905.

Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 349–359.

Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI. volume 7, pages 2670–2676.

Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word rep- resentations for dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 809–815.

David S. Batista, Bruno Martins, and Mario´ J. Silva. 2015. Semi-supervised boot- strapping of relationship extractors with distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 499–504.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828.

Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research 3(Feb):1137– 1155. BIBLIOGRAPHY 159

Julian Besag. 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician) 24(3):179–195.

Eduardo Blanco, Nuria Castell, and Dan I Moldovan. 2008. Causal relation extraction. In Lrec. volume 66, page 74.

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan):993–1022.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, pages 1247–1250.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems. pages 2787–2795.

Elizabeth Boschee, Ralph Weischedel, and Alex Zamanian. 2005. Automatic infor- mation extraction. In Proceedings of the International Conference on Intelligence Analysis. Citeseer, volume 71.

Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Philippe Bessieres, and Claire Nedellec.´ 2013. Bionlp shared task 2013–an overview of the bacteria biotope task. In Proceed- ings of the BioNLP Shared Task 2013 Workshop. pages 161–169.

Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2020. Inducing relational knowledge from bert. In Proceedings of the Thirty-Fourth Innovative Applications of Artificial Intelligence Conference.

Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases. Springer, pages 172–183.

Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. 1992. Class-based n-gram models of natural language. Computational linguistics 18(4):467–480.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 . 160 BIBLIOGRAPHY

Martin Brummer,¨ Milan Dojchinovski, and Sebastian Hellmann. 2016. DBpedia ab- stracts: A large-scale, open, multilingual NLP training corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portoroz,ˇ Slovenia, pages 3339– 3343.

Razvan Bunescu and Raymond Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Vancouver, British Columbia, Canada, pages 724–731.

Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016. Bidirectional recurrent convolu- tional neural network for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 756–765.

Mary Elaine Califf and Raymond J. Mooney. 1997. Relational learning of pattern-match rules for information extraction. In CoNLL97: Computational Natural Language Learning.

Duy-Cat Can, Hoang-Quynh Le, Quang-Thuy Ha, and Nigel Collier. 2019. A richer-but- smarter shortest dependency path with attentive augmentation for relation extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 2902–2912.

Tommaso Caselli and Piek Vossen. 2017. The event storyline corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop. pages 77–86.

Augustin Cauchy. 1847. Methode´ gen´ erale´ pour la resolution´ des systemes d’equations´ simultanees.´ Comp. Rend. Sci. Paris 25(1847):536–538.

Zi Chai, Xiaojun Wan, Zhao Zhang, and Minjie Li. 2019. Harvesting drug effectiveness from social media. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pages 55–64. BIBLIOGRAPHY 161

Yee Seng Chan and Dan Roth. 2010. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Coling 2010 Organizing Committee, Beijing, China, pages 152–160.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. INTERSPEECH .

Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Zhengyu Niu. 2005. Unsupervised feature selection for relation extraction. In Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts.

Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Zhengyu Niu. 2006. Relation extraction using label propagation based semi-supervised learning. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sydney, Australia, pages 129–136.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1025–1035.

Ethan A. Chi, John Hewitt, and Christopher D. Manning. 2020. Finding universal gram- matical relations in multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 5564–5577.

Nancy A. Chinchor. 1998. Overview of MUC-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998.

Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2018. A walk-based model on entity graphs for relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, pages 81–88. 162 BIBLIOGRAPHY

Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2019. Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 4925–4936.

Fenia Christopoulou*, Thy Thy Tran*, Sunil Kumar Sahu, Makoto Miwa, and Sophia Ananiadou. 2020. Adverse Drug Events and Medication Relation Extraction in EHRs with Ensemble Deep Learning Methods. Journal of the American Medical Informatics Association .

Grzegorz Chrupała and Afra Alishahi. 2019. Correlating neural and symbolic represen- tations of language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 2952–2962.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, pages 276–286.

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. pages 160–167.

Ronan Collobert, Jason Weston, Leon´ Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12(ARTICLE):2493–2537.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20(3):273–297.

Stanford CS231n. 2020. Neural networks 1. https://cs231n.github.io/ .

Lei Cui, Furu Wei, and Ming Zhou. 2018. Neural open information extraction. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, pages 407–413. BIBLIOGRAPHY 163

Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree kernels for relation extrac- tion. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain, pages 423–429.

Haskell B Curry. 1944. The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics 2(3):258–261.

George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2(4):303–314.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In NIPS. pages 3079–3087.

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41(6):391–407.

Xiang Deng and Huan Sun. 2019. Leveraging 2-hop distant supervision from table entity pairs for relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 410–420.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 4171–4186.

Thomas G Dietterich, Richard H Lathrop, and Tomas´ Lozano-Perez.´ 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89(1-2):31–71.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 .

C´ıcero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting 164 BIBLIOGRAPHY

of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 626–634.

Cicero Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In International Conference on Machine Learning. pages 1818–1826.

Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. ICLR .

Jinhua Du, Jingguang Han, Andy Way, and Dadong Wan. 2018. Multi-level structured self-attentions for distantly supervised relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 2216–2225.

Wenyu Du, Zhouhan Lin, Yikang Shen, Timothy J. O’Donnell, Yoshua Bengio, and Yue Zhang. 2020. Exploiting syntactic structure for better language modeling: A syntactic distance approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 6611–6628.

Javid Ebrahimi and Dejing Dou. 2015. Chain based RNN for relation classification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1244–1249.

Kathrin Eichler, Feiyu Xu, Hans Uszkoreit, and Sebastian Krause. 2017. Generating pattern-based entailment graphs for relation extraction. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017). Association for Computational Linguistics, Vancouver, Canada, pages 220–229.

Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14(2):179–211.

Hady Elsahar, Elena Demidova, Simon Gottschalk, Christophe Gravier, and Frederique Laforest. 2017. Unsupervised open relation extraction. In European Semantic Web Conference. Springer, pages 12–16.

Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. 2018. T-REx: A large scale alignment BIBLIOGRAPHY 165

of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.

Allyson Ettinger, Philip Resnik, and Marine Carpuat. 2016. Retrofitting sense-specific word vectors using parallel text. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 1378–1383.

Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK., pages 1535–1545.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1606–1615.

Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement learning for relation classification from noisy data. In Thirty-Second AAAI Conference on Artificial Intelligence.

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Lin- guistics (ACL’05). Association for Computational Linguistics, Ann Arbor, Michigan, pages 363–370.

J. R. Firth. 1957. A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis (special volume of the Philological Society) 1952-59:1–32.

John Rupert Firth. 1935. The technique of semantics. Transactions of the philological society 34(1):36–73.

Joseph Fisher and Andreas Vlachos. 2019. Merge and label: A novel neural network 166 BIBLIOGRAPHY

architecture for nested NER. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 5840–5850.

Lisheng Fu and Ralph Grishman. 2013. An efficient active learning framework for new relation types. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, Nagoya, Japan, pages 692–698.

Lisheng Fu, Thien Huu Nguyen, Bonan Min, and Ralph Grishman. 2017. Domain adaptation for relation extraction with domain adversarial neural network. In Proceed- ings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, pages 425–429.

Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1409–1418.

Ryan Gabbard, Marjorie Freedman, and Ralph Weischedel. 2011. Coreference for learn- ing to extract relations: Yes virginia, coreference matters. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 288–293.

Pablo Gamallo, Marcos Garcia, and Santiago Fernandez-Lanza.´ 2012. Dependency- based open information extraction. In Proceedings of the Joint Workshop on Un- supervised and Semi-Supervised Learning in NLP. Association for Computational Linguistics, Avignon, France, pages 10–18.

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. pages 1180– 1189.

Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019a. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence. volume 33, pages 6407–6414. BIBLIOGRAPHY 167

Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019b. FewRel 2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 6250–6255.

Victor Garcia and Joan Bruna. 2017. Few-shot learning with graph neural networks. In ICLR.

Goran Glavasˇ and Ivan Vulic.´ 2020. Is supervised syntactic parsing beneficial for lan- guage understanding? an empirical investigation. arXiv preprint arXiv:2008.06788 .

Jacob Goldberger and Ehud Ben-Reuven. 2016. Training deep neural-networks using a noise adaptation layer. ICLR .

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. pages 2672–2680.

Ralph Grishman, David Westbrook, and Adam Meyers. 2005. Nyu’s english ace 2005 system description. ACE 5.

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. pages 855–864.

Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the As- sociation for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 241–251.

Pankaj Gupta, Benjamin Roth, and Hinrich Schutze.¨ 2018. Joint bootstrapping machines for high confidence relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human 168 BIBLIOGRAPHY

Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 26–36.

Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics 45(5):885–892.

Michael Alexander Kirkwood Halliday and Ruqaiya Hasan. 1976. Cohesion in english. London: Longman.

Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Liu, and Maosong Sun. 2019. OpenNRE: An open and extensible toolkit for neural relation extraction. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP): System Demonstrations. Association for Computational Linguistics, Hong Kong, China, pages 169–174.

Xu Han, Pengfei Yu, Zhiyuan Liu, Maosong Sun, and Peng Li. 2018a. Hierarchical relation extraction with coarse-to-fine grained attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 2236–2245.

Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018b. FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 4803–4809.

Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher Re.´ 2018. Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 1884–1895.

Zellig S Harris. 1954. Distributional structure. Word 10(2-3):146–162.

Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting BIBLIOGRAPHY 169

of the Association for Computational Linguistics (ACL-04). Barcelona, Spain, pages 415–422.

Kazuma Hashimoto, Makoto Miwa, Yoshimasa Tsuruoka, and Takashi Chikayama. 2013. Simple customization of recursive neural networks for semantic relation classification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, pages 1372–1376.

Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 643–653.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid OS´ eaghdha,´ Sebastian Pado,´ Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Uppsala, Sweden, pages 33–38.

Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner. 2020. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association 27(1):3–12.

John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 4129–4138.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.

Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artifi- cial Intelligence 194:28–61. 170 BIBLIOGRAPHY

Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 541–550.

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 328–339.

Linmei Hu, Luhao Zhang, Chuan Shi, Liqiang Nie, Weili Guan, and Cheng Yang. 2019. Improving distantly-supervised relation extraction with joint label embedding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 3821–3829.

Xuming Hu, Lijie Wen, Yusong Xu, Chenwei Zhang, and Philip S Yu. 2020. Selfore: Self-supervised relational feature learning for open relation extraction. arXiv preprint arXiv:2004.02438 .

Eric Huang, Richard Socher, Christopher Manning, and Andrew Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jeju Island, Korea, pages 873–882.

Yuyun Huang and Jinhua Du. 2019. Self-attention enhanced CNNs and collaborative curriculum learning for distantly supervised relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 389–398.

David H Hubel and Torsten N Wiesel. 1959. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology 148(3):574.

Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classifica- tion 2(1):193–218. BIBLIOGRAPHY 171

Scott B Huffman. 1995. Learning information extraction patterns from examples. In International Joint Conference on Artificial Intelligence. Springer, pages 246–260.

Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 95–105.

Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, and Iz Beltagy. 2020. SciREX: A challenge dataset for document-level information extraction. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 7506–7516.

Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In AAAI. volume 3060.

Heng Ji and Ralph Grishman. 2011. Knowledge base population: Successful approaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 1148–1158.

Wei Jia, Dai Dai, Xinyan Xiao, and Hua Wu. 2019. ARNOR: Attention regularization based noise reduction for distant supervision relation classification. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1399–1408.

Jing Jiang. 2009. Multi-task transfer learning for weakly-supervised relation extraction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pages 1012–1020.

Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the feature space for relation extraction. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics, Rochester, New York, pages 113–120. 172 BIBLIOGRAPHY

Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timothy P Hanratty, and Jiawei Han. 2017. Metapad: Meta pattern discovery from massive text corpora. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pages 877–886.

Xiaotian Jiang, Quan Wang, Peng Li, and Bin Wang. 2016. Relation extraction with multi-instance multi-label convolutional neural networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, pages 1471–1480.

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics .

Zhengbao Jiang, Pengcheng Yin, and Graham Neubig. 2019. Improving open infor- mation extraction via iterative rank-aware learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 5295–5300.

Richard Johansson and Pierre Nugues. 2008. Dependency-based semantic role labeling of PropBank. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, pages 69–78.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8:64–77.

Meizhi Ju, Makoto Miwa, and Sophia Ananiadou. 2018. A neural layered model for nested named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 1446–1459.

Dan Jurafsky and James H. Martin. 2019. Speech & language processing. Stanford.

Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of the ACL Interactive Poster and Demonstration Sessions. Association for Computational Lin- guistics, Barcelona, Spain, pages 178–181. BIBLIOGRAPHY 173

Mahdy Khayyamian, Seyed Abolghasem Mirroshandel, and Hassan Abolhassani. 2009. Syntactic tree-based relation extraction using a generalization of Collins and Duffy convolution tree kernel. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium. Association for Computational Linguistics, Boulder, Colorado, pages 66–71.

Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: Amethod for stochastic optimiza- tion. ICLR .

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. ICLR .

Stefan Kombrink, Toma´sˇ Mikolov, Martin Karafiat,´ and Luka´sˇ Burget. 2011. Recurrent neural network based language modeling in meeting recognition. In Twelfth annual conference of the international speech communication association.

Martin Krallinger, Obdulia Rabal, Saber A Akhondi, Martın Perez´ Perez,´ Jesus´ Santa- mar´ıa, GP Rodr´ıguez, et al. 2017. Overview of the biocreative vi chemical-protein interaction track. In Proceedings of the sixth BioCreative challenge evaluation workshop. volume 1, pages 141–146.

Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou, and Joakim Nivre. 2020. Do neural language models show preferences for syntactic formalisms? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 4077–4091.

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 1426–1436.

Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, Dani Yogatama, Laura Rimell, Chris Dyer, and Phil Blunsom. 2020. Syntactic structure distillation pretraining for bidirectional encoders. arXiv preprint arXiv:2005.13482 .

Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In ICLR. 174 BIBLIOGRAPHY

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 .

Anne Lauscher, Ivan Vulic,´ Edoardo Maria Ponti, Anna Korhonen, and Goran Glavas.ˇ 2019. Informing unsupervised pretraining with external linguistic knowledge. arXiv preprint arXiv:1909.02339 .

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521(7553):436–444.

Ben Lengerich, Andrew Maas, and Christopher Potts. 2018. Retrofitting distribu- tional embeddings to knowledge graphs with functional relations. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, pages 2423–2436.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 302–308.

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Compu- tational Linguistics, Vancouver, Canada, pages 333–342.

Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016a. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016.

Jiwei Li, Thang Luong, Dan Jurafsky, and Eduard Hovy. 2015. When are tree struc- tures necessary for deep learning of representations? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 2304–2314.

Pengfei Li, Kezhi Mao, Xuefeng Yang, and Qi Li. 2019. Improving relation extraction with knowledge-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference BIBLIOGRAPHY 175

on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 229–239.

Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 402–412.

Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, pages 21–27.

Zhuang Li, Lizhen Qu, Qiongkai Xu, and Mark Johnson. 2016b. Unsupervised pre- training with Seq2Seq reconstruction loss for deep relation extraction models. In Proceedings of the Australasian Language Technology Association Workshop 2016. Melbourne, Australia, pages 54–64.

Dekang Lin and Patrick Pantel. 2001. Dirt@ sbt@ discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pages 323–328.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 2124–2133.

Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. A joint neural model for infor- mation extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 7999–8009.

Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, pages 241–253. 176 BIBLIOGRAPHY

Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. 2017a. Heterogeneous supervision for relation extraction: A representation learning ap- proach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Den- mark, pages 46–56.

Lizhen Liu, Xiao Hu, Wei Song, Ruiji Fu, Ting Liu, and Guoping Hu. 2018. Neural multitask learning for simile recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 1543–1553.

Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhifang Sui. 2017b. A soft-label method for noise-tolerant distantly supervised relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 1790– 1795.

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015. A dependency-based neural network for relation classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, pages 285–290.

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. Journal of Machine Learning Research 2(Feb):419–444.

Oier Lopez de Lacalle and Mirella Lapata. 2013. Unsupervised relation extraction with general domain knowledge. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, pages 415–425.

Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 3036–3046. BIBLIOGRAPHY 177

Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to predict charges for criminal cases with legal basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 2727–2736.

Fan Luo, Ajay Nagesh, Rebecca Sharp, and Mihai Surdeanu. 2019. Semi-supervised teacher-student architecture for relation extraction. In Proceedings of the Third Work- shop on Structured Prediction for NLP. Association for Computational Linguistics, Minneapolis, Minnesota, pages 29–37.

Shuai Ma, Gang Wang, Yansong Feng, and Jinpeng Huai. 2019. Easy first relation extraction with information redundancy. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 3851–3861.

Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. 2015. Yago3: A knowledge base from multilingual wikipedias. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR 2015).

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Baltimore, Maryland, pages 55–60.

Diego Marcheggiani, Joost Bastings, and Ivan Titov. 2018. Exploiting semantics in neural machine translation with graph convolutional networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 486–492.

Diego Marcheggiani and Ivan Titov. 2016. Discrete-state variational autoencoders for joint discovery and factorization of relations. Transactions of the Association for Computational Linguistics 4:231–244.

Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on 178 BIBLIOGRAPHY

Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 1506–1515.

Mausam, Michael Schmitz, Stephen Soderland, Robert Bart, and Oren Etzioni. 2012. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, pages 523–534.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. pages 6294–6305.

Warren S. Mcculloch and Walter H. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, Volume 5 .

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing lstm language models. In ICLR.

Filipe Mesquita, Jordan Schmidek, and Denilson Barbosa. 2013. Effectiveness and efficiency of open relation extraction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, pages 447–457.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.

Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Weischedel. 2000. A novel use of statistical parsing to extract information from text. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In Pro- ceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Atlanta, Georgia, pages 777–782. BIBLIOGRAPHY 179

Marvin Minsky and Seymour Papert. 1969. An introduction to computational geometry. Cambridge tiass., HIT .

Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pages 1003–1011.

Paramita Mirza. 2014. Extracting temporal and causal relations between events. In Proceedings of the ACL 2014 Student Research Workshop. pages 10–17.

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2017. A simple neural attentive meta-learner. In ICLR.

Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1105–1116.

Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In ICML.

Raymond J Mooney and Razvan C Bunescu. 2006. Subsequence kernels for relation extraction. In Advances in neural information processing systems. pages 171–178.

Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Aistats. Citeseer, volume 5, pages 246–252.

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. 2020. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884 .

Nikola Mrksiˇ c,´ Diarmuid OS´ eaghdha,´ Blaise Thomson, Milica Gasiˇ c,´ Lina M. Rojas- Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 142–148. 180 BIBLIOGRAPHY

Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. Proceedings of machine learning research 70:2554.

Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: A taxonomy of relational patterns with semantic types. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, pages 1135–1145.

Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu. 2020. Reasoning with latent structure refinement for document-level relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 1546–1557.

Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1059–1069.

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. CoRR abs/1902.07669.

Dat PT Nguyen, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Relation extraction from wikipedia using subtree mining. In Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, volume 22, page 1414.

Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 68–74.

Thien Huu Nguyen and Ralph Grishman. 2015a. Combining neural networks and log-linear models to improve relation extraction. arXiv preprint arXiv:1511.05926 .

Thien Huu Nguyen and Ralph Grishman. 2015b. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector BIBLIOGRAPHY 181

Space Modeling for Natural Language Processing. Association for Computational Linguistics, Denver, Colorado, pages 39–48.

Thien Huu Nguyen, Barbara Plank, and Ralph Grishman. 2015. Semantic represen- tations for domain adaptation: A case study on the tree kernel-based method for relation extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natu- ral Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 635–644.

Truc-Vien T. Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, pages 1378–1387.

Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In ICML. volume 11, pages 809–816.

Christina Niklaus, Matthias Cetto, Andre´ Freitas, and Siegfried Handschuh. 2018. A survey on open information extraction. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, pages 3866–3878.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic,ˇ Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A mul- tilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portoroz,ˇ Slovenia, pages 1659–1666.

Eric W Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses. Wiley New York.

Abiola Obamuyide and Andreas Vlachos. 2018. Zero-shot relation classification as textual entailment. In Proceedings of the First Workshop on Fact Extraction and VER- ification (FEVER). Association for Computational Linguistics, Brussels, Belgium, pages 72–78.

Christopher Olah. 2015. Understanding lstm networks. colah.github.io . 182 BIBLIOGRAPHY

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc,´ E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pages 8024–8035.

Sachin Pawar, Pushpak Bhattacharyya, and Girish Keshav Palshikar. 2014. Semi- supervised relation extraction using em algorithm. In International Conference on NLP (ICON, 2013).

Hao Peng, Roy Schwartz, and Noah A. Smith. 2019. PaLM: A hybrid parser and language model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 3644–3651.

Minlong Peng, Qi Zhang, Yu-gang Jiang, and Xuanjing Huang. 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 2505–2513.

Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics 5:101–115.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1532–1543.

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pages 701–710. BIBLIOGRAPHY 183

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 2227–2237.

Fabio Petroni, Tim Rocktaschel,¨ Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yux- iang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 2463–2473.

Jason Phang, Thibault Fevry,´ and Samuel R Bowman. 2018. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088 .

Barbara Plank and Alessandro Moschitti. 2013. Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, pages 1498–1507.

Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Bjorne,¨ Jorma Boberg, Jouni Jarvinen,¨ and Tapio Salakoski. 2007. Bioinfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics 8(1):50.

Longhua Qian and Guodong Zhou. 2010. Clustering-based stratified seed sampling for semi-supervised relation classification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Cambridge, MA, pages 346–355.

Longhua Qian, Guodong Zhou, Fang Kong, and Qiaoming Zhu. 2009. Semi-supervised learning for semantic relation classification using stratified sampling strategy. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, pages 1437–1445. 184 BIBLIOGRAPHY

Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Coling 2008 Organizing Committee, Manchester, UK, pages 697–704.

Longhua Qian, Guodong Zhou, Qiaomin Zhu, and Peide Qian. 2007. Relation extraction using convolution tree kernel expanded with entity features. In Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation. The Korean Society for Language and Information (KSLI), Seoul National University, Seoul, Korea, pages 415–421.

Pengda Qin, Weiran Xu, and William Yang Wang. 2018a. DSGAN: Generative adver- sarial training for distant supervision relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 496–505.

Pengda Qin, Weiran Xu, and William Yang Wang. 2018b. Robust distant supervision relation extraction via deep reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 2137–2147.

Meng Qu, Tianyu Gao, Louis-Pascal AC Xhonneux, and Jian Tang. 2020. Few-shot relation extraction via bayesian meta-learning on relation graphs. In ICML.

Chris Quirk and Hoifung Poon. 2017. Distant supervision for relation extraction beyond the sentence boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pages 1171–1182.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8):9. BIBLIOGRAPHY 185

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Re.´ 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. NIH Public Access, volume 11, page 269.

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Re.´ 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. pages 3567–3575.

Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2014. Training deep neural networks on noisy labels with bootstrapping. ICLR Workshop .

Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. 2017. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web. pages 1015–1024.

Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pages 148–163.

Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies. Association for Computational Linguistics, Atlanta, Georgia, pages 74–84.

Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Etzioni. 2013. Modeling missing data in distant supervision for information extraction. Transactions of the Association for Computational Linguistics 1:367–378.

Kirk Roberts, Dina Demner-Fushman, and Joseph M Tonning. 2017. Overview of the tac 2017 adverse reaction extraction from drug labels track. In TAC.

Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. 2016. Ad- justing for chance clustering comparison measures. Journal of Machine Learning Research 17(1):4635–4666. 186 BIBLIOGRAPHY

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations.

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy- based external cluster evaluation measure. In Proceedings of the 2007 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Lin- guistics, Prague, Czech Republic, pages 410–420.

Benjamin Rosenfeld and Ronen Feldman. 2007. Using corpus statistics on entities to improve semi-supervised relation extraction from the web. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, pages 600–607.

Dan Roth and Wen-tau Yih. 2007. Global inference for entity and relation identification via a linear programming formulation. Introduction to statistical relational learning pages 553–580.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.

Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 379–389.

Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consor- tium, Philadelphia 6(12):e26752.

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019a. Distil- bert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 .

Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019b. A hierarchical multi-task approach for learning embeddings from semantic tasks. In Proceedings of the AAAI Conference on Artificial Intelligence. volume 33, pages 6949–6956. BIBLIOGRAPHY 187

Timo Schick and Hinrich Schutze.¨ 2020. Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 .

Isabel Segura-Bedmar, Paloma Mart´ınez, and Mar´ıa Herrero-Zazo. 2013. SemEval-2013 task 9 : Extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evalua- tion (SemEval 2013). Association for Computational Linguistics, Atlanta, Georgia, USA, pages 341–350.

Elaheh ShafieiBavani, Antonio Jimeno Yepes, Xu Zhong, and David Martinez Iraola. 2020. Global locality in biomedical relation and event extraction. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing. Association for Computational Linguistics, Online, pages 195–204.

MJ Shardlow, Nhung Nguyen, Gareth Owen, Claire O’Donovan, Andrew Leach, John McNaught, Steve Turner, and Sophia Ananiadou. 2018. A new corpus to support text mining for the curation of metabolites in the chebi database. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). pages 280–285.

Yatian Shen and Xuanjing Huang. 2016. Attention-based convolutional neural network for semantic relation extraction. In Proceedings of COLING 2016, the 26th Interna- tional Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, pages 2526–2536.

Weijia Shi, Muhao Chen, Pei Zhou, and Kai-Wei Chang. 2019. Retrofitting contextu- alized word embeddings with paraphrases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 1198–1203.

Etienne´ Simon, Vincent Guigue, and Benjamin Piwowarski. 2019. Unsupervised infor- mation extraction: Regularizing discriminative approaches with relation distribution losses. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1378–1387. 188 BIBLIOGRAPHY

Amit Singhal. 2012. Introducing the knowledge graph: things, not strings. Official google blog 5.

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in neural information processing systems. pages 4077–4087.

Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. pages 926–934.

Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, pages 1201–1211.

Linfeng Song, Yue Zhang, Daniel Gildea, Mo Yu, Zhiguo Wang, and Jinsong Su. 2019. Leveraging dependency forest for neural medical relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 208–218.

Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018. N-ary relation extraction using graph-state LSTM. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 2226–2235.

Daniil Sorokin and Iryna Gurevych. 2017. Context-aware representations for knowledge base relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 1784–1789.

Axel J Soto, Piotr Przybyła, and Sophia Ananiadou. 2019. Thalia: semantic search engine for biomedical abstracts. Bioinformatics 35(10):1799–1801.

Gabriel Stanovsky and Ido Dagan. 2016. Creating a large benchmark for open infor- mation extraction. In Proceedings of the 2016 Conference on Empirical Methods in BIBLIOGRAPHY 189

Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 2300–2305.

Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Super- vised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 885–895.

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 5027–5038.

Yu Su, Honglei Liu, Semih Yavuz, Izzeddin Gur,¨ Huan Sun, and Xifeng Yan. 2018. Global relation embedding for relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 820–830.

Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web. pages 697–706.

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. 2014. Training convolutional networks with noisy labels. ICLR Workshop .

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems. pages 2440–2448.

Ang Sun and Ralph Grishman. 2012. Active learning for relation type extension with local and global data views. In Proceedings of the 21st ACM international conference on Information and knowledge management. pages 1105–1112.

Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semi-supervised relation extrac- tion with large-scale word clustering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 521–529. 190 BIBLIOGRAPHY

Changzhi Sun, Yeyun Gong, Yuanbin Wu, Ming Gong, Daxin Jiang, Man Lan, Shiliang Sun, and Nan Duan. 2019a. Joint type inference on entities and relations via graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1361–1370.

Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv. 2019b. Hierarchical attention prototypical networks for few-shot text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 476–485.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019c. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 4323–4332.

Xu Sun, Wenjie Li, Houfeng Wang, and Qin Lu. 2014. Feature-frequency–adaptive on-line training for fast and accurate natural language processing. Computational Linguistics 40(3):563–586.

Mihai Surdeanu, Sanda Harabagiu, John Williams, and Paul Aarseth. 2003. Using predicate-argument structures for information extraction. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sapporo, Japan, pages 8–15.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, pages 455–465.

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Swabha Swayamdipta, Matthew Peters, Brendan Roof, Chris Dyer, and Noah A. Smith. 2019. Shallow syntax in deep water. arXiv:1906.04341 . BIBLIOGRAPHY 191

Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaventura Coppola. 2004. Scaling web- based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Barcelona, Spain, pages 41–48.

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, pages 1556–1566.

Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wrong labels in distant supervision for relation extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jeju Island, Korea, pages 721–729.

Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems. pages 1195–1204.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? probing for sentence structure in contextual- ized word representations. In ICLR.

Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688.

Julien Tissier, Christophe Gravier, and Amaury Habrard. 2017. Dict2vec : Learning word embeddings using lexical dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computa- tional Linguistics, Copenhagen, Denmark, pages 254–263.

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next- generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS). volume 5, pages 1–6. 192 BIBLIOGRAPHY

Thy Thy Tran, Phong Le, and Sophia Ananiadou. 2020a. Exploiting Language Models for Weakly-Supervised Relation Classification. In Under Review.

Thy Thy Tran, Phong Le, and Sophia Ananiadou. 2020b. Revisiting unsupervised relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pages 7498–7505.

Thy Thy Tran, Makoto Miwa, and Sophia Ananiadou. 2020c. Syntactically-Informed Word Representations. Neurocomputing .

Hai-Long Trieu, Thy Thy Tran, Khoa N. A. Duong, Anh Nguyen, Makoto Miwa, and Sophia Ananiadou. 2020. DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts. Bioinformatics .

Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847 .

Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, and Rui Zhang. 2019. Neural relation extraction for knowledge base enrichment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 229–240.

Ozlem¨ Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18(5):552–556.

Cornelius Joost van Rijsbergen. 1979. Information Retrieval.. Butterworth-Heinemann.

Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhat- tacharyya, and Partha Talukdar. 2019a. Incorporating syntactic and semantic in- formation in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Associ- ation for Computational Linguistics, Florence, Italy, pages 3308–3318.

Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. RESIDE: Improving distantly-supervised neural relation ex- traction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 1257–1266. BIBLIOGRAPHY 193

Shikhar Vashishth, Naganand Yadati, and Partha Talukdar. 2019b. Graph-based deep learning in natural language processing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts. Association for Computational Linguistics, Hong Kong, China.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. pages 5998–6008.

Patrick Verga, Emma Strubell, and Andrew McCallum. 2018. Simultaneously self- attending to all mentions for full-abstract biological relation extraction. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 872–884.

Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 task 15: TempEval temporal relation identification. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). Association for Computational Linguistics, Prague, Czech Republic, pages 75–80.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems. pages 3630–3638.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in neural information processing systems. pages 2773–2781.

Denny Vrandeciˇ c.´ 2012. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference on world wide web. pages 1063– 1064.

Denny Vrandeciˇ c´ and Markus Krotzsch.¨ 2014. Wikidata: a free collaborative knowl- edgebase. Communications of the ACM 57(10):78–85.

Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze.¨ 2016. Combining re- current and convolutional neural networks for relation classification. In Proceedings 194 BIBLIOGRAPHY

of the 2016 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 534–539.

Ivan Vulic.´ 2018. Injecting lexical contrast into word vectors by guiding vector space specialisation. In Proceedings of The Third Workshop on Representation Learning for NLP. Association for Computational Linguistics, Melbourne, Australia, pages 137–143.

Ivan Vulic,´ Goran Glavas,ˇ Nikola Mrksiˇ c,´ and Anna Korhonen. 2018. Post- specialisation: Retrofitting vectors of words unseen in lexical resources. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pages 516–527.

Silke Wagner and Dorothea Wagner. 2007. Comparing clusterings: an overview. Universitat¨ Karlsruhe, Fakultat¨ fur¨ Informatik Karlsruhe.

Kiri Wagstaff. 2000. Refining inductive bias in unsupervised learning via constraints. In AAAI/IAAI. page 1112.

Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia 57.

Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In ICLR.

Guanying Wang, Wen Zhang, Ruoxu Wang, Yalin Zhou, Xi Chen, Wei Zhang, Hai Zhu, and Huajun Chen. 2018. Label-free distant supervision for relation extraction via knowledge graph embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 2246–2255.

Hai Wang and Hoifung Poon. 2018. Deep probabilistic logic: A unifying framework for indirect supervision. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 1891–1902.

Hao Wang, Bing Liu, Chaozhuo Li, Yan Yang, and Tianrui Li. 2019a. Learning with noisy labels for sentence-level sentiment classification. In Proceedings of the 2019 BIBLIOGRAPHY 195

Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pages 6286–6292.

Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, Dakuo Wang, Kun Xu, Xiaoxiao Guo, and Saloni Potdar. 2019b. Extracting multiple-relations in one-pass with pre-trained transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1371–1377.

Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention CNNs. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1298–1307.

Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58(301):236–244.

Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research 41.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language process- ing. ArXiv abs/1910.03771.

Fei Wu and Daniel S. Weld. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, pages 118–127.

Yi Wu, David Bamman, and Stuart Russell. 2017. Adversarial training for relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 1778–1783.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 . 196 BIBLIOGRAPHY

Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding for clustering analysis. In International conference on machine learning. pages 478–487.

Xin Xin, Fajie Yuan, Xiangnan He, and Joemon M. Jose. 2018. Batch IS NOT heavy: Learning word representations from all samples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pages 1853–1862.

Feiyu Xu, Hans Uszkoreit, and Hong Li. 2007. A seed-driven bottom-up machine learning framework for extracting relations of various complexity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, pages 584–591.

Feiyu Xu, Hans Uszkoreit, Hong Li, and Niko Felger. 2008. Adaptation of relation extraction rules to new domains. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), Marrakech, Morocco.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on Freebase via relation extraction and textual evidence. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 2326–2336.

Peng Xu and Denilson Barbosa. 2019. Connecting language and knowledge with heterogeneous representations for neural relation extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 3201– 3206.

Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 1785–1794.

Jianhao Yan, Lin He, Ruqin Huang, Jian Li, and Ying Liu. 2019. Relation extraction BIBLIOGRAPHY 197

with temporal reasoning based on memory augmented distant supervision. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 1019–1030.

Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. 2009. Unsupervised relation extraction by mining Wikipedia texts using information from the web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pages 1021–1029.

Yunlun Yang, Yunhai Tong, Shulei Ma, and Zhi-Hong Deng. 2016. A position encoding convolutional neural network based on dependency tree for relation classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 65–74.

Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK., pages 1456–1466.

Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense disambiguation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jeju Island, Korea, pages 712–720.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 764–777.

Hai Ye, Wenhan Chao, Zhunchen Luo, and Zhoujun Li. 2017. Jointly extracting relations with class ties via effective deep ranking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, pages 1810–1820. 198 BIBLIOGRAPHY

Wei Ye, Bo Li, Rui Xie, Zhonghao Sheng, Long Chen, and Shikun Zhang. 2019. Exploiting entity BIO tag embeddings and multi-task learning for relation extraction with imbalanced data. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1351–1360.

Zhi-Xiu Ye and Zhen-Hua Ling. 2019a. Distant supervision relation extraction with intra-bag and inter-bag attentions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 2810–2819.

Zhi-Xiu Ye and Zhen-Hua Ling. 2019b. Multi-level matching and aggregation network for few-shot relation classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 2872–2881.

Zhiguo Yu, Trevor Cohen, Byron Wallace, Elmer Bernstam, and Todd Johnson. 2016. Retrofitting word vectors of MeSH terms to improve semantic similarity measures. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, Auxtin, TX, pages 43–51.

Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. Journal of machine learning research 3(Feb):1083–1106.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 1753–1762.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Pa- pers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pages 2335–2344.

Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017. Incorporating relation paths in neural relation extraction. In Proceedings of the 2017 Conference on BIBLIOGRAPHY 199

Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 1768–1777.

Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. arXiv preprint arXiv:1508.01006 .

Min Zhang, Jie Zhang, and Jian Su. 2006. Exploring syntactic features for relation extraction using a convolution tree kernel. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. Association for Computa- tional Linguistics, New York City, USA, pages 288–295.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guanying Wang, Xi Chen, Wei Zhang, and Huajun Chen. 2019a. Long-tail relation extraction via knowledge graph em- beddings and graph convolution networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pages 3016–3025.

Sheng Zhang, Kevin Duh, and Benjamin Van Durme. 2017a. MT/IE: Cross-lingual open information extraction with neural sequence-to-sequence models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, pages 64–70.

Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. Shanghai, China, pages 73–78.

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi. 2020a. Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987 .

Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pages 2205–2215.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 200 BIBLIOGRAPHY

2017b. Position-aware attention and supervised data improve slot filling. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pages 35–45.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019b. ERNIE: Enhanced language representation with informative entities. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1441–1451.

Zhenyu Zhang, Xiaobo Shu, Bowen Yu, Tingwen Liu, Jiapeng Zhao, Quangang Li, and Li Guo. 2020b. Distilling knowledge from well-informed soft labels for neural relation extraction. In AAAI. pages 9620–9627.

Zhu Zhang. 2004. Weakly-supervised relation classification for information extraction. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. pages 581–588.

Shubin Zhao and Ralph Grishman. 2005. Extracting relations with integrated infor- mation using kernel methods. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Ann Arbor, Michigan, pages 419–426.

Shun Zheng, Xu Han, Yankai Lin, Peilin Yu, Lu Chen, Ling Huang, Zhiyuan Liu, and Wei Xu. 2019. DIAG-NRE: A neural pattern diagnosis framework for distantly supervised neural relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1419–1429.

GuoDong Zhou, JunHui Li, LongHua Qian, and QiaoMing Zhu. 2008. Semi-supervised learning for relation extraction. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I.

GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Ann Arbor, Michigan, pages 427–434.

GuoDong Zhou, Min Zhang, Dong Hong Ji, and QiaoMing Zhu. 2007. Tree kernel- based relation extraction with context-sensitive structured parse tree information. BIBLIOGRAPHY 201

In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL). Association for Computational Linguistics, Prague, Czech Republic, pages 728–736.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, pages 207–212.

Hao Zhu, Yankai Lin, Zhiyuan Liu, Jie Fu, Tat-Seng Chua, and Maosong Sun. 2019. Graph neural networks with generated parameters for relation extraction. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pages 1331–1339.

Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th international conference on World wide web. pages 101–110.