Exploiting Unlabelled Data for Relation Extraction
Total Page:16
File Type:pdf, Size:1020Kb
EXPLOITING UNLABELLED DATA FOR RELATION EXTRACTION A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE FACULTY OF SCIENCE AND ENGINEERING 2020 Thy Thy Tran Department of Computer Science Contents Abstract 11 Declaration 13 Copyright 14 Acknowledgements 15 Acronyms and Abbreviations 16 1 Introduction 18 1.1 Motivation................................ 18 1.2 Research Questions, Hypotheses and Objectives............ 20 1.3 Contributions.............................. 22 1.4 Dissertation Outline and Publications................. 23 2 Background 26 2.1 Introduction: Relation Extraction.................... 26 2.1.1 Related Concepts........................ 27 2.1.2 Relation Extraction Tasks................... 28 2.1.3 Datasets............................. 29 2.1.4 Evaluation Metrics....................... 31 2.2 Related Work on Relation Extraction.................. 34 2.2.1 Early Systems and Classical Machine Learning........ 34 2.2.2 Neural Networks and Deep Learning............. 37 2.3 Features for Relation Extraction.................... 42 2.3.1 Linguistic Features....................... 43 2.3.2 Word Representations..................... 45 2.3.3 External Resources....................... 51 3 2.4 Neural Components for Relation Extraction.............. 52 2.4.1 Convolutional Neural Networks................ 53 2.4.2 Recurrent Neural Networks.................. 54 2.4.3 Graph Neural Networks.................... 57 2.4.4 Attention Mechanisms..................... 58 2.4.5 Pretrained Models....................... 60 2.4.6 Hybrid Architectures...................... 60 2.5 Relation Candidate Representation................... 61 2.6 Relation Classification Layer...................... 61 2.7 Learning................................. 62 2.7.1 Fully Supervised Learning................... 63 2.7.2 Few-shot Learning....................... 64 2.7.3 Weakly-Supervised Learning.................. 66 2.7.4 Unsupervised Learning..................... 71 2.7.5 Transfer Learning........................ 72 2.7.6 Semi-Supervised Learning................... 73 2.7.7 Open Information Extraction.................. 74 2.8 Conclusions............................... 75 3 Enriching Word Representations 79 3.1 Introduction............................... 79 3.2 Proposed Approach........................... 81 3.2.1 Base Representation...................... 82 3.2.2 Part-of-Speech Tags and Dependencies............ 83 3.2.3 The SIWR Model........................ 83 3.2.4 Pretraining the SIWR Model.................. 84 3.2.5 Syntactically-Informed Word Representations......... 86 3.3 Pretraining Settings........................... 87 3.3.1 Datasets and Base Representations Used for Pretraining... 87 3.3.2 Pretraining Implementation Details.............. 88 3.4 Evaluation Settings........................... 89 3.4.1 Binary Relation Extraction................... 90 3.4.2 Ternary Relation Extraction.................. 91 3.5 Results.................................. 92 3.6 Analysis................................. 96 3.6.1 Effects of the Number of Pretraining Samples......... 96 4 3.6.2 Ablation Studies........................ 96 3.6.3 Impact of Syntactic Information................ 98 3.6.4 Computational Cost...................... 100 3.7 Related Work.............................. 101 3.8 Conclusion............................... 102 4 Unsupervised Relation Extraction 104 4.1 Motivation................................ 105 4.2 Background: Unsupervised Relation Extraction............ 106 4.2.1 Generative Approach...................... 106 4.2.2 Discriminative Approaches................... 107 4.3 Our Methods.............................. 110 4.4 Experimental Settings.......................... 111 4.4.1 Evaluation Metrics....................... 111 4.4.2 Datasets............................. 111 4.4.3 Model Settings......................... 112 4.5 Results and Discussion......................... 113 4.5.1 Results............................. 113 4.5.2 Analysis............................ 114 4.6 Conclusion............................... 118 5 Language Models as Weak Supervision 119 5.1 Motivation................................ 119 5.2 Using Language Models as Weak Annotators............. 122 5.2.1 Defining Relation Types.................... 122 5.2.2 Language Model Annotator.................. 123 5.3 Noisy Channel Auto-encoder (NoelA)................. 124 5.3.1 Encoder............................. 124 5.3.2 Decoder............................. 125 5.3.3 Learning............................ 127 5.4 Experimental Settings.......................... 127 5.4.1 Datasets............................. 127 5.4.2 Pretrained Language Models.................. 128 5.4.3 Relation Classification Settings................ 129 5.5 Results.................................. 130 5.5.1 Data Annotation........................ 130 5 5.5.2 Relation Classification..................... 131 5.6 Analysis................................. 132 5.6.1 Relation Distribution...................... 132 5.6.2 The Accuracy of BERT Annotator............... 132 5.6.3 The Accuracy of NoelA.................... 134 5.6.4 The Impact of Entity Type Reconstruction........... 134 5.7 Related Work.............................. 135 5.7.1 Relation Classification..................... 135 5.7.2 Pretrained Language Models.................. 136 5.8 Conclusion............................... 137 6 Conclusions 139 6.1 Summary of Research Objectives.................... 140 6.2 Open Problems and Future Work.................... 142 6.2.1 External Information for Enriching Word Representations.. 143 6.2.2 Graph Generalisation and Construction............ 143 6.2.3 Cluster Definition........................ 144 6.2.4 Improvement for Language Model Annotation........ 144 6.2.5 Noise Reduction........................ 144 6.2.6 Multiple Sources of Supervision................ 145 6.2.7 Document-level Relation Extraction.............. 145 A Named Entity Recognition 146 A.1 Named Entity Recognition....................... 146 A.2 Experimental Settings.......................... 147 A.3 Results.................................. 148 A.4 Comparison between Different Base Representations......... 148 B Language Models as Weak Supervision 150 B.1 BERT Annotator Confusion Matrices................. 150 B.2 Relation Exemplars........................... 150 Bibliography 157 Word Count: 35,786 6 List of Tables 2.1 Available relation extraction datasets for general domain........ 29 2.2 Annotation examples of distant supervision and the corresponding gold relation categories (DS)......................... 67 2.3 Relation examples from classical relation extraction methods and open information extraction systems...................... 75 3.1 Word representations, training data and dependency parsers that are used in our experiments......................... 87 3.2 Value range and best value of tuned hyperparameters for our SIWR.. 87 3.3 Evaluation datasets and related models used in our experiments.... 89 3.4 Statistics and hyperparameters for the ACE2005 binary relation extrac- tion task.................................. 90 3.5 Statistics and hyperparameters for the drug-gene-mutation dataset.. 92 3.6 Test set results with different embeddings over two relation extraction tasks................................... 93 3.7 Comparison of contextual representations and fine-tuning large-scale language model.............................. 94 3.8 Binary relation extraction performance on ACE2005 test set...... 94 3.9 N-ary relation extraction accuracy on the drug-gene-mutation data... 95 3.10 Binary relation extraction performance of ablated SIWRs variants on ACE2005 development set........................ 97 3.11 Pretrained model parameters and downstream trainable parameters.. 100 4.1 The statistics of the NYT-FB and the TACRED datasets. #r indicates the number of relation types in each dataset............... 111 4.2 Hyper-parameter values used in our experiments............ 113 4.3 Average results (%) across three runs of different models (except the rule-based EType) on NYT-FB and TACRED.............. 114 7 4.4 Study of EType+ in combination with different features........ 117 5.1 Data statistics of TACRED and reWiki datasets............. 128 5.2 Hyper-parameters of NoelA and its variants.............. 128 5.3 Accuracy (%) ofLM annotators on two datasets............ 130 5.4RC accuracy (Acc.) across five runs of NoelA with its variants..... 131 5.5 Mutual information between entity type pairs (ET) and gold relations (R) on the development sets....................... 135 A.1 Data statistics for the ACE2005 named entity recognition datatset... 146 A.2 Nested NER............................... 147 A.3 Test set results with different embeddings on the nested named entity recognition dataset (ACE2005)..................... 147 A.4 Performance comparison on Nested NER – ACE 2005 test set..... 148 A.5 Nested named entity recognition results on ACE2005 development set with different base representations and their enriched alternatives... 149 B.2 Exemplars created for each relation in reWiki.............. 150 B.1 Exemplars created for each relation in TACRED............ 155 8 List of Figures 1.1 The overview of our contributions.................... 23 1.2 The thesis roadmap............................ 25 2.1 An example of binary relation in a sentence............... 27 2.2 A neuron or a perceptron visualisation adapted from CS231n(2020).. 38 2.3 One-layer neural network........................ 39 2.4 Two-layer neural network.......................