Deep Neural Network Models for Sequence Labeling and Coreference Tasks
Total Page:16
File Type:pdf, Size:1020Kb
Federal state autonomous educational institution for higher education ¾Moscow institute of physics and technology (national research university)¿ On the rights of a manuscript Le The Anh DEEP NEURAL NETWORK MODELS FOR SEQUENCE LABELING AND COREFERENCE TASKS Specialty 05.13.01 - ¾System analysis, control theory, and information processing (information and technical systems)¿ A dissertation submitted in requirements for the degree of candidate of technical sciences Supervisor: PhD of physical and mathematical sciences Burtsev Mikhail Sergeevich Dolgoprudny - 2020 Федеральное государственное автономное образовательное учреждение высшего образования ¾Московский физико-технический институт (национальный исследовательский университет)¿ На правах рукописи Ле Тхе Ань ГЛУБОКИЕ НЕЙРОСЕТЕВЫЕ МОДЕЛИ ДЛЯ ЗАДАЧ РАЗМЕТКИ ПОСЛЕДОВАТЕЛЬНОСТИ И РАЗРЕШЕНИЯ КОРЕФЕРЕНЦИИ Специальность 05.13.01 – ¾Системный анализ, управление и обработка информации (информационные и технические системы)¿ Диссертация на соискание учёной степени кандидата технических наук Научный руководитель: кандидат физико-математических наук Бурцев Михаил Сергеевич Долгопрудный - 2020 Contents Abstract 4 Acknowledgments 6 Abbreviations 7 List of Figures 11 List of Tables 13 1 Introduction 14 1.1 Overview of Deep Learning . 14 1.1.1 Artificial Intelligence, Machine Learning, and Deep Learning . 14 1.1.2 Milestones in Deep Learning History . 16 1.1.3 Types of Machine Learning Models . 16 1.2 Brief Overview of Natural Language Processing . 18 1.3 Dissertation Overview . 20 1.3.1 Scientific Actuality of the Research . 20 1.3.2 The Goal and Task of the Dissertation . 20 1.3.3 Scientific Novelty . 21 1.3.4 Theoretical and Practical Value of the Work in the Dissertation . 21 1.3.5 Statements to be Defended . 22 1.3.6 Presentations and Validation of the Research Results . 23 1.3.7 Publications . 23 1.3.8 Dissertation Structure . 24 2 Deep Neural Network Models for NLP Tasks 26 2.1 Word Representation Models . 26 2.1.1 Word Representation . 26 2.1.2 Prediction-based Models . 28 2.1.3 Count-based Models . 34 2.2 Deep Neural Network Models . 38 1 2.2.1 Convolutional Neural Network . 38 2.2.2 Recurrent Neural Network . 42 2.2.3 Long Short-Term Memory Cells . 46 2.2.4 LSTM Networks . 49 2.3 Pre-trained Language Models . 50 2.3.1 ELMo . 50 2.3.2 Transformer . 52 2.3.3 OpenAI’s GPT . 56 2.3.4 Google’s BERT . 58 2.4 Summary . 60 3 Sequence Labeling with Character-aware Deep Neural Networks and Lan- guage Models 62 3.1 Introduction to the Sequence Labeling Tasks . 62 3.2 Related Work . 63 3.2.1 Rule-based Models . 64 3.2.2 Feature-based Models . 64 3.2.3 Deep Learning-based Models . 65 3.2.4 Related Work on Vietnamese Named Entity Recognition . 67 3.2.5 Related Work on Russian Named Entity Recognition . 68 3.3 Tagging Schemes . 69 3.4 Evaluation Metrics . 70 3.5 WCC-NN-CRF Models for Sequence Labeling Tasks . 71 3.5.1 Backbone WCC-NN-CRF Architecture . 71 3.5.2 Language Model-based Architecture . 77 3.6 Application of WCC-NN-CRF Models for Named Entity Recognition . 78 3.6.1 Overview of Named Entity Recognition Task . 78 3.6.2 Datasets and Pre-trained Word Embeddings . 78 3.6.3 Evaluation of backbone WCC-NN-CRF Model . 80 3.6.4 Evaluation of ELMo-based WCC-NN-CRF model . 85 3.6.5 Evaluation of BERT-based Multilingual WCC-NN-CRF Model . 85 3.7 Application of WCC-NN-CRF Model for Sentence Boundary Detection . 87 3.7.1 Introduction to the Sentence Boundary Detection Task . 87 3.7.2 Sentence Boundary Detection as a Sequence Labeling Task . 89 3.7.3 Evaluation of WCC-NN-CRF SBD Model . 91 3.8 Conclusions . 93 2 4 Coreference Resolution with Sentence-level Coreferential Scoring 95 4.1 The Coreference Resolution Task . 95 4.2 Related Work . 96 4.2.1 Rule-based Models . 96 4.2.2 Deep Learning Models . 97 4.3 Coreference Resolution Evaluation Metrics . 100 4.4 Baseline Model Description . 102 4.5 Sentence-level Coreferential Relation-based Model . 104 4.6 BERT-based Coreference Model . 108 4.7 Experiments and Results . 108 4.7.1 Datasets . 108 4.7.2 Evaluation of Proposed Models . 110 4.8 Conclusions . 115 5 Conclusions 117 5.1 Conclusions for Sequence Labeling Task . 117 5.2 Conclusions for Coreference Resolution Task . 120 5.3 Summary of the Main Contributions of the Dissertation . 121 Bibliography 123 3 Abstract Deep neural network models have recently received tremendous attentions from both academy and industry, and of course, garnered amazing results in a variety of domains ranging from Computer Vision, Speech Recognition to Natural Language Processing (NLP). They sig- nificantly lifted the performance of machine learning-based systems to a whole new level, close to the human-level performance. As a matter of course, the number of deep learning projects has also increased year by year. The IPavlov project1, based at the Neural Networks and Deep Learning Lab of Moscow Institute of Physics and Technology (MIPT), is one of them, aiming at building a set of pre-trained network models, predefined dialogue system components and pipeline templates. This thesis is based on the work carried out as a part of this project, focusing on studying deep neural network models to address Sequence Labeling and Coreference Resolution tasks. This thesis consists of three main parts. Firstly, we systematically synthesize three key concepts in the field of Deep Learning for NLP closely related to the work carried out in this thesis, including (1) two approaches to word representation learning, (2) deep neural network models often used to address machine learning tasks in general and NLP tasks in particular, and (3) cutting-edge pre-trained language models and their applications in downstream tasks. Secondly, we propose three deep neural network models for Sequence Labeling tasks, including (1) the hybrid model consisting of three sub-networks to fully capture character- level and capitalization features as well as word context features, (2) language modeling- based model, and (3) the multilingual model. These proposed models were evaluated on the task of Named Entity Recognition. Conducted experiments on six datasets covering four languages Russian, Vietnamese, English, and Chinese datasets showed that our models achieved state-of-the-art performance. Besides that, we reformulated the task of Sentence Boundary Detection as Sequence Labeling task and used the proposed model to address this task. The obtained results on two conversational datasets pointed out that the proposed model achieved an impressive accuracy. Thirdly, we propose two models for Coreference Resolution task, including (1) Sentence-level Coreferential Relation-based model that can take as input a paragraph, a discourse, or even a document with hundreds of sentences and predict the coreference relations between sentences, and (2) the language modeling-based 1https://ipavlov.ai/ 4 model that leverages the power of modern language models to boost the model performance. The experiment results on two Russian datasets and the comparisons with the other existing models pointed out that our models obtained the cutting-edge results on both Anaphora and Coreference Resolution tasks. 5 Acknowledgments The pursuit of a Ph.D. is a long and challenging journey through unknown lands. I would like to express sincere thanks to those who have accompanied and encouraged me on this journey. Without their supports, this thesis would not have been done. First and foremost, I would like to thank my supervisor Burtsev Mikhail Sergeevich, the Head of Neural Networks and Deep Learning Lab2 - Moscow Institute of Physics and Technology, who provided me a good research environment and gave me the right directions during my Ph.D. study. I also would like to thank my colleagues Yura Kuratov, Mikhail Arkhipov, Valentin Malykh, Maxim Petrov, Alexey Sorokin, and the others in our lab for the highly fruitful and enlightening discussions we shared. Besides, I would like to express my thanks to the reviewers Artem Shelmanov, Natalia Lukashevich, and Aleksandr Panov - the chair of the pre-defense seminar, for their valuable comments and suggestions on my dissertation. Finally, this thesis is dedicated to my family for their love, mental encouragement, and for always being by my side. This work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002. Dolgoprudny - 2020 2https://mipt.ru/science/labs/laboratoriya-neyronnykh-sistem-i-glubokogo-obucheniya/ 6 Abbreviations General notation a.k.a. also known as e.g. exempli gratia, meaning “for example” et al. et alia, meaning “and others” etc. et cetera, meaning “and the other things” i.e. id est, meaning “in other words” w.r.t. with respect to Natural language processing BERT Bidirectional Encoder Representations from Transformers model [44] Bi-LM Bidirectional Language Model BPE BytePair Encoding CBOW Continuous Bag-of-Words model [123] CEAF Constrained Entity Alignment F-measure CR Coreference Resolution CRF Conditional Random Field ELMo Embeddings from Language Models [69] GPT Generative Pretraining [14] IOB Inside-Outside-Beginning tagging scheme IOBES Inside-Outside-Beginning-End-Single tagging scheme IOE Inside-Outside-End tagging scheme 7 LM Language Modeling LR-MVL Low-Rank Multi-View Learning model [87] LSA Laten Semantic Analysis Masked LM Masked Language Modeling MT Machine Translation NER Named Entity Recognition NLP Natural Language Processing NPs Noun Phrases OOV Out-Of-Vocabulary POS Part-Of-Speech QA Question Answering SRL Semantic Role Labeling WER Word Error Rate Neural networks AGI Artificial General Intelligence AI Artificial Intelligence ANI Artificial Narrow Intelligence ANN Artificial Neural Network Bi-LSTM Bi-directional Long Short-Term Memory BPTT Back-Propagation Through Time CNN Convolutional Neural Network DL Deep Learning FFNN Feed-Forward Neural Network GRU Gated Recurrent Network HMM Hidden Markov Model 8 LSTM Long Short-Term Memory ML Machine Learning ReLU Rectified Linear Unit RNN Recurrent Neural Network Probability and statistics CCA Canonical Correlation Analysis PCA Principle Component Analysis PPL Perplexity SVD Singular Value Decomposition T-SNE T-distributed Stochastic Neighbor Embedding 9 List of Figures 1.1 The Artificial Intelligence taxonomy.